JPH0855093A

JPH0855093A - Multiple vector parallel computer

Info

Publication number: JPH0855093A
Application number: JP6211813A
Authority: JP
Inventors: Tatsuo Nogi; 野木達夫
Original assignee: Individual
Current assignee: Individual
Priority date: 1994-08-11
Filing date: 1994-08-11
Publication date: 1996-02-27
Anticipated expiration: 2012-09-17
Also published as: JP2655243B2

Abstract

PURPOSE:To perform calculation at an extra-high speed by coupling the main storage part, which consists of a specific threedimensional arrangement of unit main storage parts each having a. specific three-dimensional arrangement, and the processor part, which consists of a specific square arrangement of unit processor parts consisting of processor units for control and vector units for parallel calculation, through, a network part. CONSTITUTION:The main storage part consists or three-dimensional arrangement {ME (I, J, K), I, J, K=1, 2,..., M} of M%M(X) unit main storage parts 101 each of which consists of three-dimensional arrangement {MB(i, j, k),l, j, k=1, 2,...,N} of NXNXN memory banks, ana the processor part consists of square arrangement {PE(I, J), I, J=1, 2,...,N} of MXM unit processor parts 100 each of which consists or the processor unit for control ana N vector units for parallel calculation {VU(i), i=1,2,...,N}, and the main storage part and the processor part are coupled with the network part between them. Thus, calculation at an extra- high speed is realized.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、近未来の超高速科学技
術計算需要に応えるために、現今のベクトル計算機を高
度に複合化して高性能の並列計算機を構成することに関
するものであり、直接的にはメインフレーマーを含むコ
ンピュータ産業の牽引力につながるとともに、間接的に
は種々の産業分野おけるＣＡＤなどスーパーコンピュー
ティング作業を支えることに関与し、その利用分野は極
めて大きい。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the construction of a high-performance parallel computer by highly complex present-day vector computers in order to meet the demand for ultra-high-speed scientific and technological calculations in the near future. It leads to the traction of the computer industry including the mainframer, and indirectly supports supercomputing work such as CAD in various industrial fields, and its application field is extremely large.

【０００２】[0002]

【従来の技術】スーパーコンピューティング需要の高度
化に応えるために、一方に超並列分散メモリ方式の展開
があり、他方に共有メモリに対して複数ベクトルプロセ
ッサを用いた並列システムの展開がある。前者にはハイ
パーキューブネットワークを用いたものや、トーラス結
合したもの（クレイ社Ｔ３Ｄなど）及び本発明者が先に
提案したＡＤＥＮＡ並列計算機（特許第１４０４７５３
号）が含まれる。後者には中規模の並列化版ではクレイ
社Ｙ‐ＭＰ、日本電気のＳＸ３など、大規模のものでは
富士通のＶＰＰ５００などが含まれる。前者の傾向とし
ては、ＡＤＥＮＡを除いて一般にできるだけ単純なグリ
ッド結合方式に回帰する流れがみられるが、それは容易
に高性能化にはつながらない。後者では、完全な共有メ
モリ方式を守るため、複数プロセッサとメモリブロック
配列間の接続にωネットワークやクロスバスウィッチな
どを用いるため、プロセッサ数が増加すると、莫大なハ
ードウエアを必要とすることになる。このような状況の
中でベンチマークテストなどで明らかになってきたこと
は、やはり共有メモリ型にできるだけ近いもの、分散メ
モリ型にしてもハードウエアもＡＤＥＮＡくらいに凝っ
たものが高性能化に直結するということであった。そこ
で、ＡＤＥＮＡの計算方式を部分共有メモリ型で実現す
る方途として提案したのが先に提案した「ベクトル並列
計算機」（特開平６−１６２２７号）であった。これは
高性能化に書かせないメモリアクセスを許すが、余分な
アクセス経路を排除してハードウエアを軽減した極めて
自然な行き先を与えている。2. Description of the Related Art In order to respond to the demand for supercomputing, there is a development of a massively parallel distributed memory system on one side and a development of a parallel system using a plurality of vector processors for a shared memory. The former uses a hypercube network, the one using a torus connection (such as T3D of Clay Inc.), and the ADENA parallel computer previously proposed by the present inventors (Japanese Patent No. 1404753).
No.) is included. The latter includes Clay's Y-MP and NEC's SX3 for medium-scale parallelized versions, and Fujitsu's VPP500 for large-scale versions. As the former tendency, there is a general tendency to return to the simplest grid connection method except for ADENA, but this does not easily lead to higher performance. In the latter case, an ω network or a cross bus switch is used to connect between a plurality of processors and a memory block array in order to maintain a completely shared memory system, so that when the number of processors increases, an enormous amount of hardware is required. . Under such circumstances, what has become clear through benchmark tests, etc. is that things that are as close as possible to the shared memory type, or even distributed memory types that have elaborate hardware such as ADENA directly lead to higher performance. It was that. In view of this, "Vector parallel computer" (Japanese Patent Laid-Open No. 6-16227) has been proposed as a method for realizing the ADENA calculation method with the partial shared memory type. This allows memory access that cannot be written for higher performance, but provides an extremely natural destination with reduced hardware by eliminating extra access paths.

【０００３】[0003]

【発明が解決しようとする課題】本発明は、第１に、先
に特許出願した「ベクトル並列計算機」（特開平６−１
６２２７号）をユニットとして、複合化システムを構成
することでさらに高性能の並列計算機を作り上げる課題
に応えること、第２に、比較的小さなシステムから大き
なシステムまでのものを同一原理のもとに構成できると
いうスケーラビリティを獲得するという課題に応えるも
のである。SUMMARY OF THE INVENTION The present invention is, firstly, a "vector parallel computer" filed in the above patent application (Japanese Patent Laid-Open No. 6-1).
No. 6227) as a unit to meet the problem of constructing a high performance parallel computer by configuring a complex system. Second, configure from relatively small system to large system based on the same principle. It addresses the challenge of gaining the scalability of being able to.

【０００４】先願のものでは、Ｎ³個のメモリバンクか
らなる立方配列からなる主記憶部と、Ｎ台のプロセッサ
ユニットからなるプロセッサ部、並びに前記主記憶部と
プロセッサ部の間にあってベクトルデータを一時記憶す
るＮ本のベクトルラッチからなるベクトルラッチ部を備
えたバックエンド用計算機システムであって、次のよう
な作用を可能にするものであった。メモリバンクの立方
配列を稜辺に沿った３方向（ｘ方向、ｙ方向、ｚ方向）
について、それぞれの方向を法線とする選択した一つの
断面（セクション）上に並ぶＮ²個のメモリバンクから
対応した方向に並ぶＮ本のバンク列から、それぞれＮ個
のバンクから１個ずつというベクトルアクセスを並列に
行う。例えば、ｘ方向の法線をもつ一つのクション上で
はｙ方向に並ぶＮ個のバンクから各１個ずつの要素を集
めたベクトルを単位としてｚ方向に並ぶＮ本のベクトル
を並列にアクセスすることができる。これをｘアクセス
と呼ぶ。他の方法についても同様であり、それぞれｙア
クセス、ｚアクセスと呼ぶ。このような３方向のアクセ
スのいずれかを必要に応じて任意に選択できることを基
本としながら、補助的にｚ方向に並ぶバンク列のうちの
一つを任意に指定し、その中のそれぞれのバンクから一
つずつのＮ長ベクトルをＮ本分並列アクセスすることも
可能である。これをｄアクセスと呼ぶ。以上四つのアク
セス方法を可能にすることを明らかにした発明であっ
た。In the prior application, a main memory unit consisting of a cubic array consisting of N ³ memory banks, a processor unit consisting of N processor units, and vector data between the main memory unit and the processor units are used. A back-end computer system provided with a vector latch unit composed of N vector latches for temporary storage, which enabled the following operations. Three directions (x-direction, y-direction, z-direction) along the edges of the cubic array of memory banks
, From N ² memory banks arranged in one selected cross section (section) with each direction as a normal line, from N bank rows arranged in a corresponding direction, one from N banks each. Perform vector access in parallel. For example, on one action having a normal in the x direction, N vectors arranged in the z direction are accessed in parallel using a vector obtained by collecting one element each from the N banks arranged in the y direction. Can be. This is called x access. The same applies to other methods, which are called y access and z access, respectively. One of the banks arranged in the z direction is arbitrarily designated, and each of the banks in the z direction is arbitrarily designated based on the fact that any of the three directions can be arbitrarily selected as necessary. , It is also possible to access N N-length vectors one by one in parallel. This is called d access. The invention was made clear to enable the above four access methods.

【０００５】システムの拡大及びスケーラビリティの確
保を図るためには、先願のものをユニットとして複合化
し拡大化したとき、上述の４方向アクセスがそのまま可
能であるようにすることが解決すべき課題である。In order to expand the system and ensure scalability, it is a problem to be solved that the above-mentioned four-direction access can be directly performed when the prior application is combined and expanded as a unit. is there.

【０００６】[0006]

【課題を解決するための手段】本発明は、上記の課題を
解決するために、先願の「ベクトル並列計算機」のプロ
セッサ部と、残りの部分を一旦切り離す。残された部分
はベクトルラッチ部と主記憶部からなるが、前者を後者
に付属したものとみなし、以下ではあわせて主記憶部と
呼ぶ。それらプロセッサ部と主記憶部の単位をそれぞれ
単位プロセッサ部、単位主記憶部と呼び、ＰＥとＭＥと
記す。ＰＥの中には幾本かのベクトル演算装置とスカラ
ー演算装置を１セットとしてＮセットが含まれているこ
とと、ＭＥの中にはＮ×Ｎ×Ｎ個のメモリブロックの立
方配列が含まれていることを前提とする。次に、Ｍ×Ｍ
個のＰＥ，｛ＰＥ（Ｉ，Ｊ），Ｉ，Ｊ＝１，２，…，
Ｍ｝及びＭ×Ｍ×Ｍ個のＭＥ，｛ＭＥ（Ｉ，Ｊ，Ｋ），
Ｉ，Ｊ，Ｋ＝１，２，…，Ｍ｝を備えた複合化システム
を考え、ＰＥ群とＭＥ群の間にしかるべき接続ネットワ
ークを与えることで上記課題を解決する。In order to solve the above-mentioned problems, the present invention temporarily separates the processor part of the "vector parallel computer" of the prior application from the rest. The remaining part consists of a vector latch part and a main memory part, but the former is regarded as being attached to the latter, and is hereinafter collectively referred to as the main memory part. The units of the processor unit and the main storage unit are referred to as a unit processor unit and a unit main storage unit, and are referred to as PE and ME. The PE includes N sets with some vector arithmetic units and the scalar arithmetic unit as one set, and the ME includes a cubic array of N × N × N memory blocks. It is assumed that Next, M × M
PEs, {PE (I, J), I, J = 1,2, ...,
M} and M × M × M MEs, {ME (I, J, K),
Considering a composite system having I, J, K = 1, 2,..., M}, the above problem is solved by providing an appropriate connection network between the PE group and the ME group.

【０００７】この際、Ｘアクセスであれば、各ＰＥへの
ｘアクセスを同時並行して行うことでＭＥ群全体に対す
るアクセスが各ＰＥに対するｘアクセスと相似なものが
実現できるようにする。それをシステム総体に対するＸ
アクセスと呼ぶ。ＹアクセスやＺアクセスやＤアクセス
についても同様に実現する。At this time, in the case of X access, x access to each PE is performed simultaneously in parallel so that access to the entire ME group can be realized similar to x access to each PE. X to the whole system
Called access. The same applies to Y access, Z access, and D access.

【０００８】Ｘアクセスについて詳しく言えば、標準的
には、選ばれたインデックスＩに対応するＭＥ群の断面
上でＪ，Ｋ＝１，２，…，Ｍに対応するＭＥ行列の各要
素ＭＥからそれぞれ長さＮ（あるいはＮ×Ｍ）のベクト
ルをｘアクセスし、全体でＭ×Ｍ本のベクトルＰＥ群が
同時並行（長さがＮ×ＭならＭに応じた繰り返しを含め
て）にアクセスする。その際、ＭＥ（Ｉ，Ｊ，Ｋ）のベ
クトルをＰＥ（Ｊ，Ｋ）が個別のｘアクセスを行う。そ
れは各Ｋに対して、Ｊ＝１，２，…，Ｍに対するそれぞ
れのＮ長ベクトルを連ねた総長Ｎ×ＭのベクトルをＰＥ
列、｛ＰＥ（Ｊ，Ｋ），Ｊ＝１，２，…，Ｍ｝が組にな
って並行アクセスすることを、Ｋ＝１，２，…，Ｍに亘
って並列に行うことであり、システム総体に対してＸア
クセスを実現していると言える。[0008] To be more specific about the X access, typically, on the cross section of the ME group corresponding to the selected index I, each element ME of the ME matrix corresponding to J, K = 1, 2,. Each of the vectors of length N (or N × M) is accessed x, and a total of M × M vector PEs are accessed simultaneously in parallel (including repetition according to M if the length is N × M). . At this time, the PE (J, K) makes individual x access to the vector of ME (I, J, K). For each K, a vector of total length N × M, which is a series of N length vectors for J = 1, 2,.
Parallel access over K = 1, 2, ..., M is performed by performing parallel access as a set of columns, {PE (J, K), J = 1, 2, ..., M}. It can be said that X access is realized for the entire system.

【０００９】Ｙアクセスについて詳しく言えば、選ばれ
たインデックスＪに対応する断面上でＫ，Ｉ＝１，２，
…，Ｍに対応するＭＥ行列の各要素ＭＥからそれぞれＮ
長ベクトルをｙアクセスし、全体でＭ×Ｍ本のベクトル
ＰＥ群が同時並行アクセスする。その際、ＭＥ（Ｉ，
Ｊ，Ｋ）のベクトルをＰＥ（Ｋ，Ｉ）が個別のｙアクセ
スを行う。それは各Ｉに対して、Ｋ＝１，２，…，Ｍに
対するそれぞれのＮ長ベクトルを連ねた総長Ｎ×Ｍのベ
クトルをＰＥ列、｛ＰＥ（Ｋ，Ｉ），Ｋ＝１，２，…，
Ｍ｝が組になって並行アクセスすることを、Ｉ＝１，
２，…，Ｍに亘って並列に行うことであり、システム総
体に対してＹアクセスを実現していると言える。In detail regarding the Y access, K, I = 1, 2, on the cross section corresponding to the selected index J.
, N from each element ME of the ME matrix corresponding to M
Y access is performed on the long vector, and a total of M × M vector PE groups are simultaneously accessed in parallel. At that time, ME (I,
The PE (K, I) makes individual y access to the vector of (J, K). For each I, a vector of total length N × M, which is a sequence of N length vectors for K = 1, 2,..., M, is a PE column, ｛PE (K, I), K = 1, 2,. ,
M = 1 as a set to perform parallel access, I = 1,
2,..., M in parallel, and it can be said that Y access is realized for the entire system.

【００１０】Ｚアクセスについて詳しく言えば、標準的
には、選ばれたインデックスＫに対応する断面上でＩ，
Ｊ＝１，２，…，Ｍに対応するＭＥ行列の各要素ＭＥか
らそれぞれＮ長ベクトルをｚアクセスし、全体でＭ×Ｍ
本のベクトルをＰＥ群が同時並行アクセスする。その
際、ＭＥ（Ｉ，Ｊ，Ｋ）のベクトルをＰＥ（Ｉ，Ｊ）が
個別ｚアクセスを行う。それは各Ｊに対して、Ｉ＝１，
２，…，Ｍに対するそれぞれのＮ長ベクトルを連ねた総
長Ｎ×ＭのベクトルをＰＥ列、｛ＰＥ（Ｉ，Ｊ），Ｉ＝
１，２，…，Ｍ｝が組になって並行アクセスすること
を、Ｊ＝１，２，…，Ｍに亘って並列に行うことであ
り、システム総体に対してＺアクセスを実現していると
言える。[0010] More specifically about the Z-access, typically, I, I on the cross-section corresponding to the selected index K
Accessing the N-length vector from each element ME of the ME matrix corresponding to J = 1, 2,...
PE groups simultaneously access the vector of the book in parallel. At that time, PE (I, J) performs individual z access to the vector of ME (I, J, K). That is, for each J, I = 1,
A vector of total length N × M in which N length vectors for 2, ..., M are connected is a PE sequence, {PE (I, J), I =
1, 2, ..., M} are set to perform parallel access in parallel over J = 1, 2, ..., M, and Z access is realized for the entire system. Can be said.

【００１１】最後に、Ｄアクセスは前のものとは少し異
なっている。標準的には、選ばれたインデックスＩとＪ
に対応する線分上に並ぶ、Ｋ＝１，２，…，Ｍに対応す
るＭＥの柱の各要素ＭＥからそれぞれＮ×Ｍ長のベクト
ルをｄアクセスし、全体でＭ×Ｍ本のベクトルをＰＥ群
が同時並行アクセスする。その際、ＭＥ（Ｉ，Ｊ，Ｋ）
のベクトルをＰＥ（Ｌ，Ｋ）が個別のｄアクセスを行
う。それは各対（Ｉ，Ｊ）に対して、Ｋ＝１，２，…，
Ｍに対するそれぞれのＮ×Ｍ長のベクトルを、Ｎ個ずつ
を｛ＰＥ（Ｌ，Ｋ），Ｌ＝１，２，…，Ｍ｝に割り振っ
てアクセスするものである。そのようなアクセスでは、
Ｎ×Ｍ長のベクトルを分けて配ることや逆にＮ長ベクト
ルを集めてＮ×Ｍ長のベクトルにする操作が必要にな
る。Finally, the D access is slightly different from the previous one. Typically, the selected indices I and J
, N-by-N vectors are accessed from each element ME of the column of ME corresponding to K = 1, 2,..., M arranged on the line segment corresponding to M, and M × M vectors are obtained as a whole. The PEs access simultaneously. At that time, ME (I, J, K)
PE (L, K) makes individual d access to the vector That is, for each pair (I, J), K = 1,2, ...,
Each N × M-long vector for M is assigned N {PE (L, K), L = 1, 2, ..., M} for access. With such access,
It is necessary to distribute N × M-length vectors separately, or conversely, to collect N-length vectors and make them into N × M-length vectors.

【００１２】上記４種のアクセスを任意選択的に行える
ようにするため、プロセッサ部と主記憶部の間にデータ
バスとアクセス経路選択の論理回路からなるネットワー
クを組み立てる。プロセッサ部に直接接続される部分に
は、アクセスの種類Ｘ／Ｄ，Ｙ，Ｚに応じて異なるバス
経路を選択するための種別選択回路を置く。それは一方
で、プロセッサ群｛ＰＥ（Ｊ，Ｋ），Ｊ，Ｋ＝１，２，
…，Ｍ｝の各々からでるＮ組のエッジ｛ＰＥＥ（Ｊ，
Ｋ）（Ｌ），Ｌ＝１，２，…，Ｎ｝とつながり、他方で
３Ｎ組のバスに接続するものをＭ²個備えることにな
る。ＸアクセスとＤアクセスはプロセッサに近いところ
でバスを共有するため区別する必要がない。それらを
｛ＳＡ（Ｊ，Ｋ）（Ｌ），Ｊ，Ｋ＝１，２，…，Ｍ，Ｌ
＝１，２，…，Ｎ｝と書き表す。In order to optionally perform the above four types of access, a network composed of a data bus and an access path selecting logic circuit is assembled between the processor section and the main memory section. A type selection circuit for selecting a different bus path according to the type of access X / D, Y, Z is provided in a portion directly connected to the processor unit. On the other hand, the processor group {PE (J, K), J, K = 1, 2, 2,
, M}, N sets of edges {PEE (J,
K) (L), L = 1, 2, ..., N}, and on the other hand, there are M ² pieces connected to 3N sets of buses. Since the X access and the D access share the bus near the processor, there is no need to distinguish between them. Change them to {SA (J, K) (L), J, K = 1,2, ..., M, L
= 1, 2,..., N}.

【００１３】主記憶部に直接接続される部分には、主記
憶部全体をみてメモリブロック配列のうちアクセスする
一つの断面配列（Ｄアクセスでは、さらにその上の一つ
の柱）を選択して、そこからくるバスを選択するバス選
択回路をアクセスの種類に応じてそれぞれ配置する。ア
クセスは択一的なので、どの断面選択回路もプロセッサ
側の端子は共通バスでもって種別選択回路につながる。
いずれの断面選択回路も主記憶側のＭ個のメモリブロッ
クの内の一つを選ぶものである。それらが各アクセス種
毎にＮ×Ｍ×Ｍ個置かれる。それらを｛ａＳＢ（Ｊ，
Ｋ）（Ｌ），Ｊ，Ｋ＝１，２，…，Ｍ，Ｌ＝１，２，
…，Ｎ｝と記す。ここでａのところには異なるバス選択
回路を特定するための文字Ｘ／Ｄ，Ｙ，Ｚのいずれかが
置かれる。In the part directly connected to the main storage unit, one cross-sectional array (one column above it in the case of D access) to be accessed is selected from the memory block array in the entire main storage unit. A bus selection circuit for selecting a bus coming therefrom is arranged according to the type of access. Since the access is optional, the terminal on the processor side of any cross section selection circuit is connected to the type selection circuit by a common bus.
Each cross section selection circuit selects one of the M memory blocks on the main memory side. N × M × M of them are placed for each access type. Replace them with {aSB (J,
K) (L), J, K = 1, 2, ..., M, L = 1, 2,
..., N}. Here, any one of the letters X / D, Y, Z for specifying a different bus selection circuit is placed at a.

【００１４】Ｘアクセスでは、各Ｘ／ＤＳＢ（Ｊ，Ｋ）
（Ｌ）がメモリブロックのＭ個の並び｛ＭＥ（Ｉ，Ｊ，
Ｋ），Ｉ＝１，２，…，Ｍ｝からくるバスの一つを選
ぶ。Ｙアクセスでは、各ＹＳＢ（Ｋ，Ｉ）（Ｌ）がメモ
リブロックのＭ個の並び｛ＭＥ（Ｉ，Ｊ，Ｋ），Ｊ＝
１，２，…，Ｍ｝からくるバスの一つを選ぶ。Ｚアクセ
スでは、各ＺＳＢ（Ｉ，Ｊ）（Ｌ）がメモリブロックの
Ｍ個の並び｛ＭＥ（Ｉ，Ｊ，Ｋ），Ｋ＝１，２，…，
Ｍ｝からくるバスの一つを選ぶ。Ｄアクセスでは、主記
憶からデータを読む場合、各Ｘ／ＤＳＢ（Ｊ，Ｋ）
（Ｌ）がインデックス値Ｉを指定することで選ばれる１
個のメモリブロックからのデータバスにつながり、長い
データ（標準的にはＮ×Ｍ長）を通過させながら、Ｎ長
のＭ個のベクトルとしてそれぞれを対応する｛ＳＡ
（ｊ，Ｋ）（Ｌ），ｊ＝１，２，…，Ｍ｝につながるバ
スに振り分ける。主記憶に書き込む場合には、上の逆を
行えばよく、｛ＳＡ（ｊ，Ｋ）（Ｌ），ｊ＝１，２，
…，Ｍ｝からくるデータを長いデータにして各Ｘ／ＤＳ
Ｂ（Ｊ，Ｋ）（Ｌ）からＭＥ（Ｉ，Ｊ，Ｋ）に送り込
む。In X access, each X / DSB (J, K)
(L) is a sequence of M memory blocks {ME (I, J,
K), I = 1, 2,..., M}. In Y access, each YSB (K, I) (L) has M rows of memory blocks {ME (I, J, K), J =
Choose one of the buses coming from 1,2, ..., M｝. In Z access, each ZSB (I, J) (L) is an array of M memory blocks {ME (I, J, K), K = 1,2, ...
Choose one of the buses coming from M}. In D access, when reading data from the main memory, each X / DSB (J, K)
(L) is selected by specifying the index value I 1
Connected to the data bus from the memory blocks and passing long data (typically N × M lengths) while corresponding to each of the M vectors of N lengths {SA
(J, K) (L), j = 1, 2, ..., M}. When writing to the main memory, the reverse of the above may be performed, and {SA (j, K) (L), j = 1, 2, 2,
…, Convert the data coming from M｝ into long data to make each X / DS
Send from B (J, K) (L) to ME (I, J, K).

【００１５】各アクセス種別において確保されるバスの
接続状態を模式的に表せば、次のようになる。Ｘアクセス：Ｉは特定値、Ｊ，Ｋ＝１，２，…，Ｍ，Ｌ＝１，２，…，Ｎ PEE(J,K)(L) ‐SA(J,K)(L)‐X/DSB(J,K)(L) ‐ME(I,J,K) Ｙアクセス：Ｊは特定値、Ｋ，Ｉ＝１，２，…，Ｍ，Ｌ＝１，２，…，Ｎ PEE(K,I)(L) ‐SA(K,I)(L)‐YSB(K,I)(L) ‐ME(I,J,K) Ｚアクセス：Ｋは特定値、Ｉ，Ｊ＝１，２，…，Ｍ，Ｌ＝１，２，…，Ｎ PEE(I,J)(L) ‐SA(I,J)(L)‐ZSB(I,J)(L) ‐ME(I,J,K) Ｄアクセス：Ｉ，Ｊは特定値、ｊ，Ｋ＝１，２，…，Ｍ，Ｌ＝１，２，…，Ｎ PEE(j,K)(L) ‐SA(j,K)(l)‐X/DSB(J,K)(L) ‐ME(I,J,K) 上記アクセス状態を確保するための、より具体的な内容
は次の実施例にて説明する。The connection state of the bus secured in each access type is schematically represented as follows. X access: I is a specific value, J, K = 1, 2, ..., M, L = 1, 2, ..., N PEE (J, K) (L) -SA (J, K) (L) -X / DSB (J, K) (L) -ME (I, J, K) Y access: J is a specific value, K, I = 1,2, ..., M, L = 1,2, ..., N PEE ( K, I) (L) -SA (K, I) (L) -YSB (K, I) (L) -ME (I, J, K) Z access: K is a specific value, I, J = 1, 2, ..., M, L = 1,2, ..., N PEE (I, J) (L) -SA (I, J) (L) -ZSB (I, J) (L) -ME (I, J , K) D access: I, J are specific values, j, K = 1, 2, ..., M, L = 1, 2, ..., N PEE (j, K) (L) -SA (j, K) (l) -X / DSB (J, K) (L) -ME (I, J, K) More specific contents for ensuring the above access state will be described in the next embodiment.

【００１６】[0016]

【実施例】実施例１説明と図示を易しくするために、小規模の複合化システ
ムをあげる。複合化に利用する単位プロセッサ部１００
として４組のベクトル演算装置１６をもつものを考える
（この場合、Ｎ＝４に相当）。１組のベクトル演算装置
の中には複数個のパイプライン演算器やスカラー演算器
が含まれている。単位主記憶部１０１は４×４×４個の
メモリブロック１からなる立方配列とベクトルデータラ
ッチ部を搭載するドーターボート１０２とからなってい
る。これら単位プロセッサ部と単位主記憶部とを直接つ
なぐことで、最小規模のベクトル並列計算機ができあが
る。それが先の特許願の実施例になる。そのシステムを
示したものが図１であり、主記憶ボードからでるデータ
バス１０３とベクトラッチをドーターボード１０２上で
どのように接続するかをｘ／ｄ，ｙ，ｚアクセスに対応
して表したのが図２である。ここで、ｘ／ｄ，ｙアクセ
スでは図のように結線は固定されるが、ｚアクセスでは
主記憶ボート選択に応じて変化する。[Embodiment 1 ] In order to facilitate the description and illustration, a small-scale composite system will be given. Unit processor unit 100 used for compounding
Is assumed to have four sets of vector operation devices 16 (in this case, N = 4). A plurality of pipeline arithmetic units and scalar arithmetic units are included in one set of vector arithmetic units. The unit main memory unit 101 is composed of a cubic array composed of 4 × 4 × 4 memory blocks 1 and a daughter boat 102 having a vector data latch unit. By directly connecting the unit processor unit and the unit main storage unit, a minimum-scale vector parallel computer is completed. That is the embodiment of the earlier patent application. The system is shown in FIG. 1, which shows how to connect the data bus 103 and the vector latch from the main memory board on the daughter board 102 in correspondence with x / d, y, z access. Is shown in FIG. Here, in the x / d, y access, the connection is fixed as shown in the figure, but in the z access, it changes according to the main memory boat selection.

【００１７】さて、ここでは単位プロセッサ部を２×２
＝４セット、単位主記憶部を２×２×２＝８個分用いた
複合化システムを実施例とする（この場合、Ｍ＝２に相
当）。これでもってＮ＝Ｍ×（元の単位システムのＮ）
＝８に相当した単位システムと同等の機能をもたせるこ
とができる（スケーラビリティ）。今の場合、用いるプ
ロセッサ数もメモリ単位数も少数なので、ＰＥ（１，
１），ＰＥ（２，１），ＰＥ（１，２），ＰＥ（２，
２）の代わりにＰＡ，ＰＢ，ＰＣ，ＰＤの記号を用い、
プロセッサエッジ部分も簡単にＡＥ，ＢＥ，ＣＥ，ＤＥ
と表す。そして、ＭＥ（１，１，１），ＭＥ（２，１，
１），ＭＥ（１，２，１，），ＭＥ（２，２，１，）の
代わりにＭＥ１，ＭＥ２，ＭＥ３，ＭＥ４，さらに、Ｍ
Ｅ（１，１，２），ＭＥ（２，１，２），ＭＥ（１，
２，２，），ＭＥ（２，２，２，）の代わりにＭＥ５，
ＭＥ６，ＭＥ７，ＭＥ８を用いるものとする。また、前
節でＮ組のものを区分するために用いたインデックス
（Ｌ）値は名前に続く数字でもって直接表す。例えば、
ＰＡにおける４組の演算装置はＬ＝１，２，３，４に相
当してＰＡ１，ＰＡ２，ＰＡ３，ＰＡ４と表せる。この
システムのプロセッサ部と主記憶部の全体を表したもの
が図３である。このシステムによってＸアクセス、Ｙア
クセス、Ｚアクセス及びＤアクセスでプロセッサ部に運
ばれるデータセットの例をそれぞれ図４，５，６，７に
与えてある。図４，５，６のメモリブロックのところに
配列Ａの要素を表示しているが、それは典型的に｛Ａ
（Ｉ，Ｊ，Ｋ），Ｉ＝１，９，Ｊ＝１，７，Ｋ＝１，
５｝をストアした状態を示しており、ここで、網かけ状
で図示した短冊形部分が単位主記憶部から一度にアクセ
スされるベクトルであり、そのような短冊がすべて並列
にアクセスされるものである。図４では｛Ａ（１，Ｊ，
Ｋ），Ｊ＝１，７，Ｋ＝１，５｝をＸアクセスする場
合、図５では｛Ａ（Ｉ，１，Ｋ），Ｋ＝１，５，Ｉ＝
１，９｝をＹアクセスする場合（実際には、２度アクセ
スする必要がある）、図６では｛Ａ（Ｉ，Ｊ，１），Ｉ
＝１，９，Ｊ＝１，７｝をＺアクセスする場合（実際に
は、２度アクセスする必要がある）を示している。な
お、ＸアクセスとＺアクセスでは短冊群の主記憶部内で
の配置とプロセッサ部のレジスタ内での配置の関係は直
接的であり、短冊間の対応は明確であるが、Ｙアクセス
では主記憶部内の短冊とプロセッサ部のレジスタ内のも
のとは直接対応せず、実際には、プロセッサ部にかかれ
た鉛直短冊の形でメモリブロックにも蓄えられているも
のが対応する。しかし、単位記憶部において直接読まれ
るものは図にあるような短冊としてであって、単位主記
憶部内に置かれるベクトルラッチ群を介して編集された
ものがプロセッサ部に運ばれるようになっている（ここ
のところは、先願のものの内容に関係する）。また、図
７のＤアクセスでは柱状に並ぶメモリブロックのそれぞ
れから２個ずつの短冊ベクトルをアクセスする状態を示
している。Here, the unit processor section is 2 × 2
= 4 sets, a composite system using 2 × 2 × 2 = 8 unit main storage units is an example (in this case, M = 2). With this, N = M × (N of the original unit system)
= 8 (scalability). In this case, since the number of processors and the number of memory units to be used are small, PE (1,
1), PE (2,1), PE (1,2), PE (2,
Use the symbols PA, PB, PC, PD instead of 2),
AE, BE, CE, DE easily at the processor edge
Express. Then, ME (1,1,1), ME (2,1,
1), ME (1,2,1,), ME (2,2,1,) instead of ME1, ME2, ME3, ME4, and M
E (1,1,2), ME (2,1,2), ME (1,
Instead of ME (2,2,2), ME5
ME6, ME7, and ME8 are used. Also, the index (L) value used to distinguish the N sets in the previous section is directly represented by a number following the name. For example,
The four sets of arithmetic units in the PA correspond to L = 1, 2, 3, and 4, and can be expressed as PA1, PA2, PA3, and PA4. FIG. 3 shows the entire processor unit and main storage unit of this system. Examples of data sets carried to the processor unit by X access, Y access, Z access and D access by this system are given in FIGS. The elements of the array A are displayed at the memory blocks in FIGS. 4, 5 and 6, which are typically {A
(I, J, K), I = 1, 9, J = 1, 7, K = 1,
5} is stored, where the strip-shaped portion shown in the shaded form is a vector that is accessed at one time from the unit main memory, and all such strips are accessed in parallel. Is. In FIG. 4, {A (1, J,
When K), J = 1, 7, K = 1, 5} is X-accessed, {A (I, 1, K), K = 1, 5, I = in FIG.
In the case where Y access is made to {1, 9} (actually, it is necessary to access twice), in FIG. 6, {A (I, J, 1), I
= 1,9, J = 1,7} are Z-accessed (actually, they need to be accessed twice). Note that in X access and Z access, the relationship between the arrangement of strips in the main storage unit and the arrangement of the processor unit in the register is direct, and the correspondence between the strips is clear, but in Y access, in the main storage unit. The strips and those in the register of the processor section do not directly correspond to each other, but in reality correspond to those stored in the memory block in the form of a vertical strip on the processor section. However, what is read directly in the unit storage section is a strip as shown in the figure, and what is edited via a vector latch group placed in the unit main storage section is carried to the processor section. (This concerns the content of the earlier application). Further, the D access in FIG. 7 shows a state in which two strip vectors are accessed from each of the memory blocks arranged in a column.

【００１８】プロセッサ部と主記憶部の間にあって、そ
れらを接続するネットワーク部を搭載したマザーボード
が本出願の中心部分に該当する。それを図８に表す。A motherboard mounted between the processor unit and the main memory unit and having a network unit connecting them corresponds to the central part of the present application. It is shown in FIG.

【００１９】図８の下部で直接プロセッサエッジにつな
がるのがアクセス種別によってバスの選択を行う種別選
択回路を実現するマルチプレクサ１０４である。４つの
単位プロセッサ部１０５のそれぞれにつながる４個のも
のが置かれる。各１個にはプロセッサ側には、単位プロ
セッサ部に直結するバスのための４組の端子１０６が備
えられてる（Ｎ＝４）。それらには、例えばプロセッサ
エッジＡＥにつながるものでは、Ａ１，Ａ２，Ａ３，Ａ
４の記号をあてている。その他のマルチプレクサであっ
てもプロセッサ側の端子にはどのプロセッサにつながる
かを示すことのみが重要としてそのマルチプレクサの目
的にかかわらず同種の記号をあてる。その端子名でもっ
て当のマルチプレクサ自身も表すものとする。上記種別
選択マルチプレクサ１個の主記憶側にはＸ／Ｄ，Ｙ，Ｚ
という３通りのアクセスに対応したバス経路をそれぞれ
４組ずつ確保するための端子１２個分１０７が並んでい
る。Directly connected to the processor edge in the lower part of FIG. 8 is a multiplexer 104 for realizing a type selection circuit for selecting a bus according to an access type. There are four units connected to each of the four unit processor units 105. Each one is provided on the processor side with four sets of terminals 106 for a bus directly connected to the unit processor unit (N = 4). They include, for example, A1, A2, A3, A
The symbol of 4 is applied. Even for other multiplexers, it is important only to indicate which processor is connected to the terminal on the processor side, and the same symbols are applied regardless of the purpose of the multiplexer. The terminal name also represents the multiplexer itself. X / D, Y, Z on the main memory side of one of the type selection multiplexers
12 terminals 107 for securing four sets of bus routes respectively corresponding to the three types of accesses are arranged.

【００２０】図８の中央部には、下から上へと３通りの
断面選択回路、すなわちＺアクセス断面選択回路１０
８、Ｙアクセス断面選択回路１０９およびＸ／Ｄアクセ
ス選択回路１１０が並べてある。いずれにおいても、プ
ロセッサ側には上記説明にある端子名がつけてあるが、
図が見にくくならないように、それら端子と種別選択回
路の端子の間の結線は省略してある。Ｘアクセス、Ｙア
クセス、Ｚアクセスのいずれの選択回路であっても、今
の場合Ｍ＝２であるから、それぞれの方向に並ぶ２つの
単位主記憶部のいずれかを選べばよい。Ｚアクセスで
は、Ａ，Ｂ，Ｃ，Ｄを頭字とする４組のマルチプレクサ
がそれぞれ主記憶側の接続バスとして、単位主記憶部エ
ッジ１１１の選択子対（ＭＥ１かＭＥ５）、（ＭＥ２か
ＭＥ６）、（ＭＥ３かＭＥ７）、（ＭＥ４かＭＥ８）か
らくるいずれのバスを選ぶ。Ｙアクセスでは、同じく
Ａ，Ｂ，Ｃ，Ｄを頭文字とする４組のマルチプレクサ
が、それぞれ（ＭＥ１かＭＥ３）、（ＭＥ５かＭＥ
７）、（ＭＥ２かＭＥ４）、（ＭＥ６かＭＥ８）のいず
れかに接続することを選択する。Ｘアクセスでも、同じ
くＡ，Ｂ，Ｃ，Ｄを頭文字とする４組のマルチプレクサ
が、それぞれ（ＭＥ１かＭＥ２）、（ＭＥ３かＭＥ
４）、（ＭＥ５かＭＥ６）、（ＭＥ７かＭＥ８）のいず
れかに接続することを選択する。なお、Ｘアクセスでは
Ｘ／Ｄバス選択回路を用いているが、それはＤアクセス
にも共通に利用できるようにしてあるため、Ｙアクセス
やＺアクセスのものに比べて少し複雑になっている。In the center of FIG. 8, three cross-section selection circuits, that is, a Z-access cross-section selection circuit 10 are arranged from bottom to top.
8, a Y access section selection circuit 109 and an X / D access selection circuit 110 are arranged. In any case, the processor has the terminal name described in the above description,
Connections between these terminals and the terminals of the type selection circuit are omitted so as not to obscure the figure. In any of the X access, Y access, and Z access selection circuits, since M = 2 in this case, it is only necessary to select one of the two unit main memory units arranged in each direction. In the Z access, four pairs of multiplexers each having an acronym of A, B, C, and D are used as connection buses on the main storage side to select the selector pair (ME1 or ME5) or (ME2 or ME6) of the unit main storage unit edge 111. , (ME3 or ME7) or (ME4 or ME8). In Y access, four sets of multiplexers similarly having initials A, B, C, and D are (ME1 or ME3) and (ME5 or ME5), respectively.
7) Select to connect to one of (ME2 or ME4) or (ME6 or ME8). Also in the X access, four sets of multiplexers also having the initials A, B, C, and D respectively (ME1 or ME2), (ME3 or ME3).
4) Select connection to one of (ME5 or ME6) or (ME7 or ME8). Note that although the X / D bus selection circuit is used in the X access, the X / D bus selection circuit is commonly used for the D access, so that the X / D bus selection circuit is slightly more complicated than those of the Y access and the Z access.

【００２１】Ｄアクセスのために、Ｘ／Ｄバス選択回路
のプロセッサ側に直結する形で、２Ｎ長のベクトルを２
つのＮ長のベクトルに分け一半をストレートにつながる
バスに渡すとともに他方の半分を対応するプロセッサに
つながるバスに渡す、あるいは逆にそれらのバスのＮ長
ベクトルを併せて２Ｎのものにして一つのＸ／Ｄバス選
択回路に渡すためのマルチプレクサ機能付きのバストラ
ンシーバを置く。Ｘアクセスのときには単純に通過する
だけにしておければよい。For D access, a 2N-long vector is directly connected to the processor side of the X / D bus selection circuit.
Divided into two N-length vectors, one half is passed to the bus connected to the straight line, and the other half is passed to the bus connected to the corresponding processor, or conversely, the N-length vectors of those buses are combined into 2N and one X A bus transceiver with a multiplexer function for passing to the / D bus selection circuit is provided. In the case of X access, it is only necessary to simply pass.

【００２２】実施例２実施例１では、説明を簡単にするためデータバスの幅に
ついては特に言及せず、暗黙裡に約１倍長語（６４ビッ
ト＋訂正符号分）を想定していた。しかし、高性能のベ
クトル演算器を用いる場合、通常は一連の計算手続の中
で、少なくとも個々の計算に必要な２つのオペランドベ
クトルのフェッチと結果とのストアは並行して行うこと
が普通である。そうしないと演算スピードに適応したメ
モリアクセスにならない。そこで、各単位主記憶部のベ
クトルラッチやアクセスラッチや各種選択回路（こせら
は先の特許願に付随したこと）と同時に、前記マザーボ
ード上のバス選択回路はすべて３重に用意する。実際
上、主記憶部に出ている３重のエッジとプロセッサ部に
出ている３重のエッジをそれぞれつなぐ前記実施例１の
マザーボードを３枚分備えることになる。 Second Embodiment In the first embodiment, the width of the data bus is not particularly referred to for the sake of simplicity, and it is implicitly assumed that a double word (64 bits + correction code) is used. However, when a high-performance vector arithmetic unit is used, it is usual to fetch at least two operand vectors required for each calculation and store the result in parallel in a series of calculation procedures. . Otherwise, the memory access will not be adapted to the operation speed. Therefore, at the same time as the vector latches and access latches of each unit main storage unit and various selection circuits (these are attached to the above patent application), all the bus selection circuits on the motherboard are prepared in triple. Actually, the three motherboards of the first embodiment for connecting the triple edges appearing in the main memory and the triple edges appearing in the processor, respectively, are provided.

【００２３】総括本発明システムは、複合化システムであっても単位シス
テム作用と全く同様な作用が行える。しかも、複合化に
よってより高性能化したものになる。そのことをここで
説明する。 Summary The system of the present invention can perform the same operation as the unit system operation even if it is a combined system. In addition, the combination makes it even more sophisticated. This will be explained here.

【００２４】まず、説明に必要な範囲で、単位システム
において実現できていた作用に簡単に触れて置く。例え
ば、通常の逐次計算機でＦＯＲＴＲＡＮで表現される構
文ＤＯ 10 Ｋ＝１，４ＤＯ 10 Ｊ＝１，４ＤＯ 10 Ｉ＝１，４ 10 V(I,J,K)＝(U(I-1,J,K)+U(I+1,J,K)+U(I,J-1,K)+U(I,J+1,K) ＆ +U(I,J,K-1)+U(I,J,K+1)/6.0 を単位システムで並列計算する場合の実現態様を適当に
言語表現するなら、例えば次のように与えられる。ＰＤＯ I,J ＝１，４ＤＯ 10 Ｋ＝１，４ 10 Z(/I,J/,K)＝U(/I,J/,K-1)+U(/I,J/,K+1) ＰＥＮＤＰＤＯ K,I＝１，４ＤＯ 20 Ｊ＝１，Ｎ 20 Y(I/,J,/K)＝U(I/,J-1,/K)+U(I/,J+1,/K) ＰＥＮＤＰＤＯ J,K＝１，４ＤＯ 30 Ｉ＝１，４ 30 V(I,/J,K/)＝(U(I-1,/J,K/)+U(I+1,/J,K)+Y(I,/J,K/)+Z(I,/J,K/)/ 6.0 ＰＥＮＤここで、ラベル１０の代入文についてはインデックスＫ
について逐次計算する（ＤＯ文）ことを、インデックス
Ｉ，Ｊの変化範囲のすべてについて並列計算する（ＰＤ
Ｏ文）ことを指示している。その代入文の中で並列計算
に対応したインデックスをスラッシュの対で囲んでいる
のは読みやすくするためである。この部分の計算はｚア
クセスによって行える。例えば、Ｋ＝２のとき｛Ｕ（／
Ｉ，Ｊ／，１）｝と｛Ｕ（／Ｉ，Ｊ／，３）｝をフェッ
チすることになるが、それはメモリブロック配列の断面
配列｛ＭＢ（／Ｉ，Ｊ／，１）｝と｛ＭＢ（／Ｉ，Ｊ
／，３）｝からＪ＝１，２，３，４に応じて一つずつの
ブロック列からその中のＩ＝１，２，３，４に対応した
ブロックから各１個の要素を取り出してできる長さ４の
ベクトルを１個ずつ計４個をそれぞれ並列にフェッチす
るものである。すなわち、Ｉ方向に延びるベクトルをＪ
に亘って並列フェッチする。計算そのものもＩ方向にベ
クトル処理、Ｊ方向に並列処理する。それを上のプログ
ラムではＩ，Ｊに関するＰＤＯと表現している。結果の
｛Ｚ（／Ｉ，Ｊ／，２）｝のストアでもＩ方向に延びる
ベクトルをＪについて同時並列にメモリブロックの断面
配列｛ＭＢ（／Ｉ，Ｊ／，２）｝に書き入れる。上のよ
うな計算をＫ＝１，２，３，４と順に追いながら処理す
るのが最初のＰＤＯバラグラフの内容である。続いて、
ラベル２０の代入文の処理はｙアクセスに拠る。この場
合では、Ｋ方向に延びるベクトルにつき、Ｉ＝１，２，
３，４に亘って並列にフェッチし、計算し、結果をスト
アすることをＪ＝１，２，３，４と繰り返す。最後に、
ラベル３０の代入文ではｘアクセスを用い、Ｊ方向に延
びるベクトルに付き、Ｋ＝１，２，３，４に亘って並列
にフェッチし、計算し、結果をストアすることをＩ＝
１，２，３，４の順に繰り返す。このように適当なアク
セス方向を選びながら並列計算を行うことができるハー
ドウエアとして提起したのが先願のものである。その
際、ベクトル処理と並列処理に関係した方向の他にもう
一つの方向が残され、その方向には逐次計算を行うこと
を前提としている。実はそれを積極的に残しているのは
その処理の中にさまざまな洗練されたアルゴリズムを表
現したいからである。見方を変えれば、ベクトル処理も
並列処理も区別することなく並列処理とみなせば、上に
行った計算はスラッシュ対の付かない裸のインデックス
方向に延びる１次元部分配列（セグメント）に対する逐
次計算をスラッシュ付きのインデックスに亘って並行処
理しているものである。しかも、セグメントの切り出し
方向を変えながら処理していくものである。このような
並列計算方式をＡＤＥＰＳ（Altenating Direction Exe
cution of “Parallel over Segments”) と名付ける。First, to the extent necessary for the description, the operation realized in the unit system will be briefly mentioned. For example, the syntax DO 10 K = 1,4 DO 10 J = 1,4 DO 10 I = 1,4 10 V (I, J, K) = (U (I−1) , J, K) + U (I + 1, J, K) + U (I, J-1, K) + U (I, J + 1, K) & + U (I, J, K-1) If the realization of parallel computation of + U (I, J, K + 1) /6.0 in a unit system is appropriately expressed in a language, for example, it is given as follows: PDO I, J = 1,4 DO 10 K = 1,4 10 Z (/ I, J /, K) = U (/ I, J /, K-1) + U (/ I, J /, K + 1) PEND PDO K, I = 1, 4 DO 20 J = 1, N 20 Y (I /, J, / K) = U (I /, J-1, / K) + U (I /, J + 1, / K) PEND PDO J, K = 1,4 DO 30 I = 1,430 V (I, / J, K /) = (U (I-1, / J, K /) + U (I + 1, / J, K) + Y (I, / J, K /) + Z (I, / J, K /) / 6.0 PEND Here, the index K
Is sequentially calculated (DO statement), and parallel calculation is performed for all the change ranges of the indexes I and J (PD
(O sentence). The reason why the index corresponding to the parallel computation is surrounded by a pair of slashes in the assignment statement is to make it easier to read. This part can be calculated by z access. For example, when K = 2, {U (/
I, J /, 1)} and {U (/ I, J /, 3)}, which are cross-sectional arrays {MB (/ I, J /, 1)} and { MB (/ I, J
/, 3)} from J = 1,2,3,4 according to J = 1,2,3,4, one element is taken out from each block corresponding to I = 1,2,3,4 A total of four vectors each having a possible length of 4 are fetched in parallel. That is, the vector extending in the I direction is set to J
Fetch in parallel. The calculation itself is also vector processed in the I direction and parallel processed in the J direction. In the above program, it is expressed as PDO for I and J. In the store of the resulting {Z (/ I, J /, 2)}, the vector extending in the I direction is written in the memory block sectional array {MB (/ I, J /, 2)} in parallel with J in parallel. The content of the first PDO rose is to process the above calculations in order of K = 1, 2, 3, 4. continue,
Processing of the assignment statement at label 20 relies on y access. In this case, for vectors extending in the K direction, I = 1, 2,
Fetching, calculating, and storing the result in parallel over 3, 4 are repeated as J = 1, 2, 3, 4. Finally,
The assignment statement of the label 30 uses x access, attaches to the vector extending in the J direction, and fetches, calculates, and stores the result in parallel over K = 1, 2, 3, 4 by I =
Repeat in the order of 1, 2, 3, and 4. The hardware of the prior application has been proposed as hardware capable of performing parallel computation while selecting an appropriate access direction. At that time, another direction is left in addition to the direction related to vector processing and parallel processing, and it is premised that sequential calculation is performed in that direction. Actually, I leave it positively because I want to express various sophisticated algorithms in the process. From a different point of view, if vector processing and parallel processing are not distinguished and they are regarded as parallel processing, the above calculation is a sequential calculation for a one-dimensional partial array (segment) extending in the bare index direction with no slash pair. Parallel processing is performed across the index with. Moreover, processing is performed while changing the segment cutting direction. This parallel computing method is called ADEPS (Altenating Direction Exe).
cution of “Parallel over Segments”).

【００２５】さて、上の例では実施例１の単位システム
で想定したようにＮ＝４というベクトル長、並列処理さ
れるベクトル演算個数も４という小さなシステムに丁度
見合った問題を扱っていた。しかし、スーパーコンピュ
ーティングではそんな小さな問題では済まなくて、例え
ば配列｛Ｕ（Ｉ，Ｊ，Ｋ），Ｉ，Ｊ，Ｋ＝１，２，…，
２５６｝位の大きさのデータを扱うのが普通である。そ
の場合、当然４×４×４個のメモリブロックに畳み込ん
でストアすることになる。ｉ＝（Ｉ−１）modＮ＋１，
ｊ＝（Ｊ−１）mod Ｎ＋１，ｋ＝（Ｋ−１）mod Ｎ＋１
と置いたとき、各Ｕ（Ｉ，Ｊ，Ｋ）を（ｉ，ｊ，ｋ）メ
ロモリブロックに確保する。もし、プロセッサに長さ２
５６のベクトルベクトルレジスタがあるなら、ベクトル
長２５６のものをアクセスするにはベクトル長４のアク
セスを６４回繰り返すことになる。しかも、２５６個の
ベクトル演算を並列処理することはできなくて、４個ず
つの並列処理を６４回繰り返すことになる。もちろん、
これらの操作はシステム側ソフトウエアで簡単にサポー
トできることであって、ユーザーは見かけ上２５６×２
５６×２５６のメモリブロックがあって２５６長ベクト
ル演算を２５６本分並列に処理するとみなして利用でき
る。In the above example, the problem that is just commensurate with the small system in which the vector length of N = 4 and the number of vector operations to be processed in parallel are 4 is dealt with as assumed in the unit system of the first embodiment. However, in supercomputing, such a small problem does not have to be solved. For example, the array {U (I, J, K), I, J, K = 1,2, ...,
Normally, data having a size of about 256 ° is handled. In such a case, the data is naturally folded and stored in 4 × 4 × 4 memory blocks. i = (I-1) modN + 1,
j = (J-1) mod N + 1, k = (K-1) mod N + 1
, Then each U (I, J, K) is reserved in the (i, j, k) memory block. If the processor has length 2
If there are 56 vector vector registers, accessing a vector length of 256 requires 64 accesses of a vector length of 4 times. In addition, the 256 vector operations cannot be processed in parallel, and the parallel processing of four vectors is repeated 64 times. of course,
These operations can be easily supported by the software on the system side.
There is a 56 × 256 memory block, which can be used assuming that 256 long vector operations are processed in parallel for 256 lines.

【００２６】そこで、より高性能のシステムを得ようと
するなら、物理的にベクトル長を大きくし、並列処理の
プロセッサ台数も増やさなければならない。それを上記
単位システムの単位主記憶部と単位プロセッサ部を必要
だけ組み合わせて実現する方法を呈示したのが本特許願
である。例えば、実施例１にあるＭ＝２の拡張を行えば
４倍の大きさのプロセッサ部と８倍の大きさの主記憶部
を得、Ｎ＝８，したがって、ベクトル長８、並列処理ベ
クトル演算個数８を持つ単位システムと同等のものを得
たことになる。上記｛Ｕ（Ｉ，Ｊ，Ｋ），Ｉ，Ｊ，Ｋ＝
１，２，…，２５６｝に対してベクトル長２５６のもの
のアクセスには３２回の繰り返し、しかも、並列処理も
８個ずつ３２回の繰り返しで実現できるようになる。も
し、Ｍ＝６４の拡張を行えば、同じデータはベクトル長
２５６の２５６個分のアクセスを１回で行えるようにな
る。なお、単位システムがＮ＝１６で出来上がっておれ
ば、Ｍ＝１６にして同じだけのアクセス性能が確保でき
る。In order to obtain a higher performance system, the vector length must be physically increased and the number of processors for parallel processing must be increased. The present patent application presents a method of realizing it by combining the unit main memory unit and the unit processor unit of the unit system as needed. For example, if the extension of M = 2 in the first embodiment is performed, a processor unit having four times the size and a main storage unit having eight times the size are obtained, and N = 8, therefore, the vector length is 8, and the parallel processing vector operation is performed. This is equivalent to a unit system having eight units. Above {U (I, J, K), I, J, K =
1, 2, ..., 256} can be realized by repeating 32 times to access the one having the vector length of 256, and the parallel processing can also be realized by repeating 32 times by 8 times. If expansion of M = 64 is performed, the same data can be accessed in 256 times with a vector length of 256 at one time. If the unit system is completed with N = 16, the same access performance can be secured by setting M = 16.

【００２７】先願の単位システムであれ、本願の複合シ
ステムであれ、それらが３次元配列処理に最も相応して
アーキテクチャになっていることは確かであるが、２次
元配列の処理も上首尾に行えるように工夫してある。そ
のことは実用上極めて重要である。どのように処理する
かを説明しておく。２次元配列データをすべて主次元３
と副次元１、計４次元のものに置き換える。例えば、２
次元配列｛Ｕ（Ｉ，Ｊ），Ｉ，Ｊ＝１，２，…，２５
６｝を扱うものとする。Ｌ＝Ｍ×Ｎとして、トータルで
Ｌ×Ｌ×Ｌ個のメモリブロックをもつシステムであれ
ば、Ｉ＝（ｑ−１）Ｌ＋ｒ，Ｊ＝（ｓ−１）Ｌ＋ｔの関係式によって、Ｉに対（ｒ，ｑ）を、Ｊに対（ｓ，
ｔ）を対応させ、配列そのものを｛ｕ（ｒ，ｑ，ｔ）
（ｓ）｝に置き換える。さらに、２方向のアクセス状態
｛Ｕ（／Ｉ／，Ｊ）｝と｛Ｕ（Ｉ，／Ｊ／）｝に対応さ
せて｛ｕ（／ｒ，ｑ／，ｔ）（ｓ）｝と｛ｕ（ｒ，ｑ，
／ｔ／）（／ｓ／）を考える。ここで、（ｒ，ｑ，ｔ）
か主次元部分であり、（ｓ）が副次元部分である。主次
元部分がそれをストアすべきメモリブロックを同定し、
副次元部分は各メモリブロック内のサブブロックを同定
する。｛ｕ（／ｒ，ｑ／，ｔ）（ｓ）｝にはシステムの
Ｚアクセスを用い、｛ｕ（ｒ，ｑ，／ｔ／）（／ｓ
／）｝にはＤアクセスを用いることになる。Whether the unit system of the prior application or the complex system of the present application has an architecture that most corresponds to the three-dimensional array processing, the processing of the two-dimensional array is also successful. It is devised so that it can be done. That is extremely important in practice. How to process is explained. All 2D array data has 3 main dimensions
And one sub-dimension, a total of four dimensions. For example, 2
Dimensional array ｛U (I, J), I, J = 1, 2,..., 25
6} is handled. Assuming that L = M × N and the system has a total of L × L × L memory blocks, the relational expression of I = (q−1) L + r and J = (s−1) L + t (R, q) is added to J by (s,
t) and the array itself is {u (r, q, t)
(S) Replace with｝. Furthermore, {u (/ r, q /, t) (s)} and {u corresponding to the bidirectional access states {U (/ I /, J)} and {U (I, / J /)}. (R, q,
/ T /) (/ s /) is considered. Where (r, q, t)
Or the main dimension part, and (s) is the sub-dimensional part. The main dimension part identifies the memory block in which to store it,
The sub-dimensional portion identifies the sub-block within each memory block. For {u (/ r, q /, t) (s)}, use Z access of the system, and {u (r, q, / t /) (/ s
/)｝ Will use D access.

【００２８】[0028]

【発明の効果】実施例２でどのような性能をもつマシン
が得られるのかをみておく。現今のスーパーコンピュー
ターの標準的なところてマシンサイクル（１ピリオド）
が２．５ns、単位プロセッサ部ＰＥが４組のベクトル演
算装置をもつものを考えると、このとき１ピリオドに２
オペランドフェッチと１結果のストアを行うなら、要求
されるメモリアクセススピードは１ピリオド中に４
（本）×４（語長分）×３（個）＝４８語、すなわち１
９．２Ｇ語／sec でる。もし、利用するメモリのサイク
ルタイムが１０nsてあるなら、その間、すなわち４ピリ
オド中に単位システム（アクセス経路が３重になってい
るもの）では最大４８語までのアクセススピードであ
る。しかし、これではプロセッサ部の要求する最大のバ
ンド幅と主記憶部の応えられる能力に４倍の開きがあ
る。このとき実質的には１０nsの間に４（組のベクトル
対）×４（要素演算）＝１６回の浮動小数点演算が行わ
れることになり、実効計算速度は１．６ＧＦＬＯＰＳに
止まる。そこで、Ｍ＝２の複合システムを構成する１０
nsの間に８（本）×８（語長分）×３（個）＝１９２
語、すなわち１９．２Ｇ語／sec のメモリアクセスバン
ド幅が確保できることになる。このときは１０nsの間に
８（組のベクトル対）×８（要素演算）＝６４回の浮動
小数点演算が行われ、単位システムの４倍の実効計算速
度６．４ＧＦＬＯＰＳが得られる。The performance of the machine according to the second embodiment can be obtained. A machine cycle (1 period) that is standard in today's supercomputers
Is 2.5 ns and the unit processor unit PE has four sets of vector operation devices.
If operand fetch and one result store are performed, the required memory access speed is 4 per one period.
(Book) × 4 (word length) × 3 (pieces) = 48 words, ie, 1
9.2G words / sec. If the cycle time of the memory to be used is 10 ns, the access speed is up to 48 words during that period, ie, in four periods in a unit system (those having three access paths). However, in this case, the maximum bandwidth required by the processor unit and the available capacity of the main memory unit are quadrupled. At this time, 4 (pair of vector pairs) × 4 (element operations) = 16 floating-point operations are performed substantially within 10 ns, and the effective calculation speed remains at 1.6 GFLOPS. Therefore, the composite system of M = 2 is configured 10
8 (lines) × 8 (word length) × 3 (pieces) = 192 during ns
Words, ie, a memory access bandwidth of 19.2 G words / sec. In this case, 8 (pair of vector pairs) × 8 (element operations) = 64 floating point operations are performed within 10 ns, and an effective calculation speed of 6.4 GFLOPS, which is four times the unit system, is obtained.

【００２９】いま単位システムがＮ＝８として上記のも
のと同じマシンサイクル２．５nsで、１０nsメモリチッ
プを用いて構成されているなら、それは始めから実効計
算速度６．４ＧＦＬＯＰＳをもつ、さらに、Ｍ＝１６の
複合システムを作るならば、実効計算速度が６．４ＧＦ
ＬＯＰＳ×１６²＝１．６ＴＦＬＯＰＳというウルトラ
コンピューターができあがる。If the unit system is now configured with 10 ns memory chips, with the same machine cycle of 2.5 ns as above, with N = 8, it has an effective computation speed of 6.4 GFLOPS from the beginning, and M = 16, the effective calculation speed is 6.4GF
An ultracomputer called LOPS × 16 ² = 1.6 TFLOPS is completed.

【００３０】このように本発明は高性能化への展望を大
きく開くとともに、Ｍの値の異なるシステムが幾段階に
も考えられ、大きなファミリシリーズを構成でき、しか
も、使い方は全く共通しており、スケーラビリティにお
いては優れている。また、特に言及しなかったが、マザ
ーボード上にさらに補助的な選択回路をつけることで一
つのシステムを２つ３つと分割して利用することも可能
である。As described above, the present invention greatly opens up the prospects for high performance, a system having different values of M can be considered in several stages, a large family series can be constructed, and the usage is completely common. , Is excellent in scalability. Although not particularly mentioned, it is possible to divide one system into two and three by using an auxiliary selection circuit on the motherboard.

[Brief description of drawings]

【図１】本発明の複合化システムの要素になる単位シス
テムの構成図である。FIG. 1 is a configuration diagram of a unit system that is an element of a composite system of the present invention.

【図２】単位システム内のドータボード上の主記憶ボー
ドエッジとベクトルラッチ部へいくデータバスの接続模
様を表す模型図である。FIG. 2 is a model diagram showing a connection pattern of a main memory board edge on a daughter board in a unit system and a data bus going to a vector latch unit.

【図３】本発明の実施例１に述べた複合システムの概観
図である。FIG. 3 is a schematic view of the complex system described in the first embodiment of the present invention.

【図４】実施例１の構成でＸアクセスによって一度にア
クセスされるベクトルデータ群を示した図である。網か
け短冊でもって一つのベクトル単位分を表している。FIG. 4 is a diagram showing a group of vector data accessed at a time by X access in the configuration of the first embodiment; Shaded strips represent one vector unit.

【図５】実施例１の構成でＹアクセスによって一度にア
クセスされるベクトルデータ群を示した図である。FIG. 5 is a diagram showing a vector data group that is accessed at one time by Y access in the configuration of the first embodiment.

【図６】実施例１の構成でＺアクセスによって一度にア
クセスされるベクトルデータ群を示した図である。FIG. 6 is a diagram showing a vector data group accessed at a time by Z access in the configuration of the first embodiment.

【図７】実施例１の構成でＤアクセスによって一度にア
クセスされるベクトルデータ群を示した図である。主記
憶部で斜めに延びる網かけ部分はそれの始点が接するメ
モリバンク内のベクトルデータを示している。FIG. 7 is a diagram showing a vector data group that is accessed at one time by D access in the configuration of the first embodiment. A hatched portion extending obliquely in the main storage unit indicates vector data in a memory bank to which a starting point thereof is in contact.

【図８】実施例１においてプロセッサ部と主記憶部を接
続するためのネットワークを搭載したマザーボード上の
データバスの結線と選択回路を示した図である。FIG. 8 is a diagram showing a data bus connection and a selection circuit on a motherboard on which a network for connecting a processor unit and a main memory unit is mounted in the first embodiment.

[Explanation of symbols]

１メモリブロック１６ベクトル演算装置１００、１０５単位プロセッサ部１０１単位主記憶部１０２ドーターボード１０３データバス１０４マルチプレクサ１０６、１０７端子１０８Ｚアクセス断面選択回路１０９Ｙアクセス断面選択回路１１０Ｘ／Ｄアクセス断面選択回路 Reference Signs List 1 memory block 16 vector operation device 100, 105 unit processor unit 101 unit main storage unit 102 daughter board 103 data bus 104 multiplexer 106, 107 terminal 108 Z access section selection circuit 109 Y access section selection circuit 110 X / D access section selection circuit

Claims

[Claims]

1. A cubic array of N × N × N memory banks {MB (i, j, k), i, j, k = 1, 2,..., N}
Cubic array {ME (I, J, K), I, J, K = 1, 2,
, M}, a control processor unit, and N sets of parallel calculation vector units {VU
(I), i = 1, 2, ..., N} is further arranged in a square array {PE (I,
J), I, J = 1, 2,..., M}, and a back-end computer system including a network unit for connecting the two. Four types of access called by x, y, z, and d are selectively permitted. (i) In x access, the jk section of the memory bank cubic array which can be specified by specifying the value of the index i , N, corresponding to k = 1, 2,..., N on the square array arranged in the matrix array {MB (i, j, k), j = 1, 2,.
N length vector data which can be designated one by one from each memory bank of N} are accessed in parallel over k, (ii)
In the y access, the value of i =＝
An N-length vector that can be specified by specifying one from each memory bank of the memory bank string {MB (i, j, k), k = 1, 2,..., N} corresponding to each of 1, 2,. Data is accessed in parallel over i, and (iii) in z access, j = 1, 2,..., N on the ij section square array of the memory bank cubic array that can be specified by specifying the value of index k.
Of the memory bank column {MB (i, j,
k), N-length vector data, which can be specified one by one from each memory bank of i = 1, 2,..., N}, is accessed in parallel over j. (iv) In d access, index i is used. , K = 1, 2,..., On the memory bank row passing through the position (i, j) of the ij section
From the memory bank corresponding to each of N, N-length vector data that can be specified by specifying an element in the memory bank can be accessed in parallel over k, (b) the unit processor unit can be accessed from the unit main memory unit. Parallel processing of N vector data obtained by any access by corresponding vector units,
And the vector data of the calculation results obtained by the N vector units can be written in the unit main memory in parallel, while each vector unit has a scalar memory and a scalar operation capable of dealing with scalar data. (C) In a whole system in which a main storage unit composed of M × M × M unit main storage units and a processor unit composed of M × M unit processor units are combined, the system is realized as a unit. To realize the same form of X, Y, Z, and D access as the four types of access, a different data bus set is selected on the processor side as a network unit connecting the two according to the X / D, Y, and Z accesses.種別 SA (J, K) (L), J, K = 1, 2,...,
M, L = 1, 2,..., N}, and each SA (J, K)
It is assumed that the processor side of (L) is connected to the L-th data bus from PE (J, K), and that the main memory section side has 3N sets of data input / output terminals. It is connected to the / D access network, the Y access network, and the Z access network. Of these, (i) in the X / D access network, the index I can be designated for the X access. Unit main memory row {ME (I, J, K), J = 1,2 corresponding to each of K = 1, 2, ..., M on the JK cross section square array of the cubic array of the unit main memory. ,…, M}
｛ME (I, I, K) for each (J, K) so that N access data or a plurality of n × N x access vector data can be periodically accessed in parallel over K.
J, K), J = 1, 2,..., M}
One of the data buses having one set is selected by designating I and {SA (J, K) (L), L = 1, 2,..., N}
Section selection circuit for connecting to the X / D data bus of {x
SB (J, K) (L), L = 1,2, ..., N} are placed, and a common bus connected to the processor side input / output terminals of the cross-section selection circuit is connected to each (K) (L) for D access. ) For each set of {SA (j, K) (L), j =
1, 2,..., M, L = 1, 2,..., N}, connected to the X / D bus, and the unit main memory ME selected thereby.
Assuming that N × M vectors obtained by d access from (I, J, K) are divided into N × M pieces or further multiplied by an integer n and divided by (n ×) N × N, {PE (j,
K), j = 1, M} is read, or the reverse path is stored, and a scatter and gather control circuit is added to the common bus so that it can be stored in parallel across K. In the network for Y access, it is possible to specify the index J in the cross section of the cubic array of the unit main memory, so that the D access component is linearly passed. And I =
N, or periodically n × N from the unit main storage unit sequence {ME (I, J, K), K = 1, 2,..., M} corresponding to each of 1, 2,. So that each y access vector data can be accessed in parallel over I,
For each (K, I), ｛ME (I, J, K), J = 1,
Of the data buses each consisting of N sets from 2, 2, ..., M}, one is selected by designating J {SA (K,
I) (L), L = 1, 2, ..., N}, a cross-section selection circuit {ySB (K, I) (L), for connecting to the Y data bus.
L = 1, 2,..., N}, and (iii) In the Z access network, an index K can be specified. <J = 1, 2,. The corresponding unit main storage unit sequence {ME (I, J, K),
From I = 1, 2, ..., M}, for each (I, J) so that N pieces, or periodically n × N pieces of z access vector data can be accessed in parallel across J, Of the data buses in which N sets from {ME (I, J, K), K = 1, 2, ...
Select one by specifying ｛SA (I, J) (L), L = 1,
2, ..., N} end face selection circuit {zSB (I, J) (L), L = 1,2, ..., N} for connecting to the Z data bus.
A composite vector parallel computer system characterized by being able to select any of X access, Y access, Z access and D access in the entire system and perform parallel vector calculation while switching between them.

2. The type selecting circuit m group so as to permit access to a plurality m, for example, m = 3 sets at a time in order to utilize parallel processing performance in each vector unit in a unit processor unit included in the processor unit. Further, m sets of the three kinds of cross section selecting circuits and the D access common bus are provided, so that m sets of arbitrarily selected access modes of the same access form can be freely accessed independently of each other. The computer system according to claim 1.