JPH0675930A

JPH0675930A - Parallel processor system

Info

Publication number: JPH0675930A
Application number: JP4228263A
Authority: JP
Inventors: Seigo Suzuki; 清吾鈴木
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-08-27
Filing date: 1992-08-27
Publication date: 1994-03-18

Abstract

PURPOSE:To provide a parallel processor system capable of remarkably improving the capacity of plural number of communication between PEs and the network communication with a host computer. CONSTITUTION:This parallel processor system is provided with an element column composed of a multi-port memory 2 having plural processors (PE) 1 arrayed two-dimensionally and more that at least three ports input/output parts. In the system, the multi-port memory 2 is arranged leticulately, and a first, second ports of this multi-port memory 2 are connected with a reticular data buses 4, 5, respectively and the third port is connected with the corresponded processor 1 and a reticular network is constituted on a single chip. Therefore, a parallel/super parallel super computer which is essentially suitable to a VLSI computer can be constituted, and the capacity of plural number of the communication between PEs and the network communication with a host computer can be remarkably improved.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は並列／超並列コンピュー
タに適した並列プロセッサシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a parallel processor system suitable for parallel / massively parallel computers.

【０００２】[0002]

【従来の技術】先ず、本発明の並列プロセッサシステム
の必要性について述べる。2. Description of the Related Art First, the necessity of the parallel processor system of the present invention will be described.

【０００３】一般に、ＶＬＳＩコンピュータの性能向上
に関しては、ＶＬＳＩ自体の集積度向上とこれにともな
った性能向上によるところが大きい。例えば、１．０μ
ｍの加工技術を用いたＶＬＳＩで作成したＶＬＳＩコン
ピュータと０．２μｍの加工技術を用いたＶＬＳＩで作
成したＶＬＳＩコンピュータとの性能について、仮に一
般論で比較すれば、後者の技術により構成されるものの
ほうが性能／価格比は優れているはずである。Generally, the improvement in the performance of a VLSI computer is largely due to the improvement in the degree of integration of the VLSI itself and the accompanying performance improvement. For example, 1.0μ
If the performance of a VLSI computer created with VLSI using the m processing technology and the performance of a VLSI computer created with VLSI using the 0.2 μm processing technology are compared in general terms, it is possible that the latter technology is used. The performance / price ratio should be better.

【０００４】しかしながら、コンピュータのアーキテク
チュアを基本的に変えずに、単に加工技術による微細化
のみを行なって構成した場合には、１．０μｍの加工技
術によるＬＳＩと０．２μｍ加工技術によるＬＳＩとの
集積度の比較は、単純には、（１．０）² ／（０．２）
² ＝２５（倍）となり、もし、ＬＳＩチップコストが不
変とすれば、両者間のＬＳＩトータルコスト比は２５：
１となる。However, when the structure of the computer is basically the same and only the miniaturization by the processing technology is performed without changing the architecture of the computer, the LSI by the processing technology of 1.0 μm and the LSI by the processing technology of 0.2 μm are used. The density comparison is simply (1.0) ² / (0.2)
² = 25 (times), and if the LSI chip cost remains unchanged, the LSI total cost ratio between the two is 25:
It becomes 1.

【０００５】一方、性能に関しては、２５分割されてい
たシステムを１チップ内に統合するのであるから、チッ
プ間伝送（Ｉ／Ｆ）に関するロスは、大幅に減少し、
１．５〜２．０倍の効率化（性能向上）が見込まれる。
更に、チップ内各要素部分も微細化にともなって高速化
し、１．０μｍの加工技術により得られるＬＳＩに対し
０．２μｍの加工技術によって得られるＬＳＩでは、３
〜４倍のクロック周波数の向上が見込まれる。On the other hand, in terms of performance, since the system divided into 25 is integrated into one chip, the loss related to inter-chip transmission (I / F) is greatly reduced,
It is expected that the efficiency (performance improvement) will be 1.5 to 2.0 times.
Furthermore, the speed of each element part in the chip has also increased with the miniaturization, and in the LSI obtained by the processing technology of 0.2 μm, the LSI obtained by the processing technology of 1.0 μm is 3
A clock frequency improvement of ~ 4 times is expected.

【０００６】従って、例えば、シングルプロセッサシス
テムのコンピュータ・アーキテクチュアでは、ＬＳＩの
高集積化により、ＬＳＩコストはＭｉｎ１／２５とも
なり得るが、性能は高々（１．５〜２．０）×（３〜
４）＝３〜８（倍）である。Therefore, for example, in a computer architecture of a single processor system, the LSI cost can be as high as Min 1/25 due to high integration of the LSI, but the performance is at most (1.5 to 2.0) × (3 ~
4) = 3 to 8 (times).

【０００７】一方、ＬＳＩの集積度向上のペースは、
１．０μｍ技術の時代から０．２μｍ技術の時代になる
までには１０年以上かかるとすると、アーキテクチュア
変革をしないで高集積化のみであれば、ＶＬＳＩコンピ
ュータの性能は、高々１０年間で３〜８倍にしか向上し
ないことになる。即ち、単一のプロセッサであっては、
飛躍的な技術向上は望めないことになる。On the other hand, the pace of improvement in LSI integration is
If it takes more than 10 years from the era of 1.0 μm technology to the era of 0.2 μm technology, the performance of a VLSI computer will be 3 to 10 years at most if only high integration is achieved without changing the architecture. It will only improve 8 times. That is, with a single processor,
You will not be able to expect dramatic technological improvements.

【０００８】そこで、単一プロセッサから並列／超並列
プロセッサにアーキテクチュア変革することにより、性
能向上が図れ、有効な手段であると考えられてきた。例
えば、１．０μｍ加工技術で加工された３２ｂｉｔプロ
セッサは１個で１チップに構成されるが、０．２μｍ加
工技術によれば２５個の３２ｂｉｔプロセッサを１チッ
プに入れて構成することができる。[0008] Therefore, it has been considered to be an effective means to improve the performance by changing the architecture from a single processor to a parallel / super parallel processor. For example, one 32-bit processor processed by the 1.0 μm processing technology is configured on one chip, but 25 32-bit processors can be configured on one chip by the 0.2 μm processing technology.

【０００９】即ち、この２５個の３２ｂｉｔプロセッサ
が全て効率良く常に同時並列動作することは困難である
としても、並列度を１／２として常時１２個のプロセッ
サが並列動作したと仮定すれば、高集積化による各要素
（クロックレベル）のスピードアップを前述したように
３〜４倍とすると、並列アーキテクチュアによる性能向
上は、１０年間でＴｏｔａｌ１２×（３〜４）＝３６
〜４８（倍）に向上することになる。That is, even if it is difficult for all of these 25 32-bit processors to operate efficiently in parallel at the same time, assuming that the parallel degree is 1/2 and 12 processors always operate in parallel, it is high. Assuming that the speedup of each element (clock level) due to integration is 3 to 4 times as described above, the performance improvement due to the parallel architecture is Total 12 × (3 to 4) = 36 in 10 years.
It will be improved to 48 times.

【００１０】このように、並列／超並列プロセッサのア
ーキテクチュアは、単一プロセッサのものと比較し、処
理スピード等の性能向上に大きく貢献するが、その分割
・並列化した各プロセッサ間の通信や、データのやり取
りの面で解決すべき問題が多い。特に、ＰＥ（Ｐｒｏｃ
ｅｓｓｏｒＥｌｅｍｅｎｔ）間接続に関しては種々な
接続トポロジーが存在し、各接続形態が有り、通信の平
均頻度、一回当たりのデータ量（長）、通信相手のラン
ダム性、同期／非同期通信、通信の平均距離（広域通
信）等のバランスにより、主に、アプリケーションの持
つ上記性格によって接続形態が選ばれるものであった。
しかし、このコンピュータを各応用別、プログラム別に
改良し、このＰＥ間接続トポロジーを夫々に最適化する
こともできるが、現実的には経済的に見ても良策ではな
い。As described above, the architecture of the parallel / massively parallel processor greatly contributes to the improvement of the performance such as the processing speed as compared with that of the single processor, but the communication between the divided / parallelized processors, There are many problems to be solved in terms of data exchange. In particular, PE (Proc
There are various connection topologies with regard to inter-element connections, and there are various connection topologies, and the average frequency of communication, the amount of data (long) per communication, the randomness of communication partners, synchronous / asynchronous communication, and the average of communication. Due to the balance of distance (wide area communication) and the like, the connection form is mainly selected by the above-mentioned character of the application.
However, although it is possible to improve this computer for each application and for each program and optimize the connection topology between PEs respectively, in reality, this is not a good measure from an economical point of view.

【００１１】一方、ＶＬＳＩ及びモジュールに関し、実
現のし易さ（作り易さ）から考察すると、両者共に極力
二次元空間に押さえることが重要である。例えば、パー
パーＮキューブのごとき立方体構造（以下、Ｎキューブ
と称す）と二次元メッシュ構造（以下、メッシュ構造と
称す）とを比較した場合、前者のＰＥ間接続を２次元面
で実現することは、ＰＥ数（Ｎ）が１００個を越えると
極めて複雑となる。故に、Ｎキューブでは、この複雑さ
のためにチップ間接続の配線量（長さ）も大きくなり、
メッシュ構造で単に隣接するチップ同志を接続する場合
に比べ、Ｎ＞１０００の場合、その配線量は、５０〜１
００倍にもなる。即ち、メッシュ構造に対するＮキュー
ブでのＰＥ間接続配線長の比率はＭａｘ５０に達する。On the other hand, regarding the VLSI and the module, considering the ease of implementation (manufacturability), it is important to keep both in a two-dimensional space as much as possible. For example, when comparing a cubic structure (hereinafter, referred to as N cube) such as a per-par N cube and a two-dimensional mesh structure (hereinafter, referred to as mesh structure), it is possible to realize the former connection between PEs on a two-dimensional surface. , The number of PEs (N) exceeds 100, it becomes extremely complicated. Therefore, in the N-cube, the wiring amount (length) of the chip-to-chip connection becomes large due to this complexity,
In the case of N> 1000, the wiring amount is 50 to 1 as compared with a case where adjacent chips are simply connected in a mesh structure.
It will be 00 times. That is, the ratio of the PE-to-PE connection wiring length in the N cube to the mesh structure reaches Max50.

【００１２】今、この配線長（量）の差がチップ間伝送
速度を５倍にしているとすると、例えば、メッシュ構造
での隣接ＰＥとの伝送時間が１０ｎｓ（１００Ｍｂ／
ｓ）となり、Ｎキューブ構造では５倍の５０ｎｓ（２０
Ｍｂ／ｓ）となる。ここで、ＰＥ数を１０２４個（＝３
２×３２）とすると、Ｎキューブ構造での最長転送時間
はｌｏｇ₂Ｎ×τ₁＝ｌｏｇ₂１０２４×５０（ｎｓ）
＝５００（ｎｓ）であり、１００ＭＨｚ（１０ｎｓ）の
クロックでは５０サイクル必要となる。これに対して、
メッシュ構造では、最悪でも（Ｎ） ^1/2×τ₂＝１０２
４^1/2 ×１０（ｎｓ）＝３２０（ｎｓ）であり、１００
ＭＨｚのクロックでは３２サイクル必要となるのであ
る。即ち、Ｎキューブ構造の方が転送に要する段数波少
ないにも拘らず総転送時間はメッシュ構造のほうが少な
いのである。以上詳述したように、並列／超並列コンピ
ュータのアーキテクチュアには、２次元メッシュ構造が
有効であることが判った。Now, assuming that the difference in the wiring length (quantity) makes the transmission rate between chips five times, for example, the transmission time with an adjacent PE in the mesh structure is 10 ns (100 Mb /
s), which is 5 times 50 ns (20
Mb / s). Here, the number of PEs is 1024 (= 3
2 × 32), the longest transfer time in the N-cube structure is log ₂ N × τ ₁ = log ₂ 1024 × 50 (ns).
= 500 (ns), and a 100 MHz (10 ns) clock requires 50 cycles. On the contrary,
With a mesh structure, at worst (N) ^1/2 x τ ₂ = 102
4 ^1/2 × 10 (ns) = 320 (ns), 100
A clock of MHz requires 32 cycles. That is, although the N-cube structure has a smaller number of stages required for transfer, the total transfer time is shorter in the mesh structure. As described above in detail, it has been found that the two-dimensional mesh structure is effective for the architecture of parallel / super-parallel computers.

【００１３】この従来の２次元メッシュ構造の並列／超
並列コンピュータを構成する並列プロセッサシステム
は、図６に示すように、メモリ（Ｍｉ、Ｍｊ…）を有す
るプロセッサエレメント（ＰＥｉ、ＰＥｊ…）（Ｐｒｏ
ｃｅｓｓｏｒＥｌｅｍｅｎｔ）がＸバス（Ｘｉ、Ｘｊ
…）、Ｙバス（Ｙｉ、Ｙｊ…）を介して複数個並列的に
構成されている。As shown in FIG. 6, a parallel processor system that constitutes a conventional parallel / super parallel computer having a two-dimensional mesh structure has processor elements (PEi, PEj ...) (Pro) having memories (Mi, Mj ...) As shown in FIG.
The processor element is X bus (Xi, Xj)
,), And a plurality of Y buses (Yi, Yj ...).

【００１４】この従来システムのポイントとなる各ＰＥ
間通信は、全てＰＥ同志の通信であり、メモリ（Ｍ）は
各々のＰＥに付属しており、メモリ間の通信は常にＰＥ
を介して行われている。Each PE which is the key point of this conventional system
Communication between PEs is the same among PEs, and memory (M) is attached to each PE, and communication between memories is always PE.
Is done through.

【００１５】これは従来システムのメモリが単一ポート
入出力を主としたもので、対応するＰＥとのやり取りに
専従する構成となっていた。しかるに、本質的には、各
ＰＥ間の通信要因は、各ＰＥに属するメモリの内容デー
タを参照、交換することが主な仕事となっているのであ
る。This is because the memory of the conventional system is mainly for single-port input / output, and is dedicated to the exchange with the corresponding PE. However, essentially, as a communication factor between the PEs, the main task is to refer to and exchange the content data of the memory belonging to each PE.

【００１６】このような従来の２次元メッシュ構造の並
列／超並列コンピュータの並列プロセッサシステムであ
っては、メモリ（Ｍ）の出入口がプロセッサ（ＰＥ）に
専有されている為、メモリ同志のやり取りする道はな
く、全て担当プロセッサの制御下で実施される。即ち、
このプロセッサは各々仕事を持っていることと、このメ
モリのやり取りもプロセッサの持つソフトを介して行わ
れ、メモリＭｉとメモリＭｊとの相互の通信は、メモリ
Ｍｉ→プロセッサＰＥｉ→プロセッサＰＥｊ→メモリＭ
ｊの経路で行われる。このようにメモリ間のやり取りに
常にＰＥを経由するための所定の通信時間が必要である
ことはもとより、この各ステップ毎に適当な待時間が必
要であるため、結果的に全体的な所要時間は、シングル
プロセッサシステムに比較し、数十倍以上の時間がかか
ることになる。しかもこの転送毎に担当ＰＥｉ、ＰＥｊ
は、実行中の仕事を中断するため、このペナルティ(pen
alty) も非常に大きい。ＰＥｉ、ＰＥｊの実行速度は通
信の頻度にもよるが、当該データを待つＰＥｉは数倍に
遅れ、データの通過を取り計らうだけのＰＥｊも１．５
〜３倍程度遅くなる。In such a conventional parallel processor system of parallel / super parallel computers having a two-dimensional mesh structure, since the processor (PE) has an exclusive entrance / exit of the memory (M), the memories communicate with each other. There is no way to do it, and everything is done under the control of the responsible processor. That is,
This processor has its own work, and the exchange of this memory is also performed through the software of the processor. The mutual communication between the memory Mi and the memory Mj is performed by the memory Mi → processor PEi → processor PEj → memory M.
j route. As described above, the communication between the memories always requires a predetermined communication time for passing through the PE, and an appropriate waiting time is required for each step, so that the overall required time is eventually increased. Will take dozens of times longer than a single processor system. Moreover, PEi and PEj in charge of each transfer
This penalty (pen
alty) is also very large. The execution speed of PEi and PEj depends on the frequency of communication, but PEi waiting for the data is delayed several times, and PEj just waiting for the data to pass is 1.5.
~ 3 times slower.

【００１７】[0017]

【発明が解決しようとする課題】本願発明は、前述した
従来技術の欠点を改良したもので、ＶＬＳＩコンピュー
タの本質的に適した並列／超並列コンピュータの並列プ
ロセッサシステムにおいて、複数個のＰＥ間通信及びホ
ストコンピュータとのネットワーク通信の能力を大きく
向上することのできる並列コンピュータを提供すること
を目的とする。SUMMARY OF THE INVENTION The present invention is an improvement over the above-mentioned drawbacks of the prior art. In a parallel processor system of a parallel / massively parallel computer, which is essentially suitable for a VLSI computer, communication between a plurality of PEs is performed. Another object of the present invention is to provide a parallel computer capable of greatly improving the capability of network communication with the host computer.

【００１８】[0018]

【課題を解決するための手段】本願発明は、上述した従
来技術の課題に対し、次のような構成によって解決する
ことができる。The present invention can solve the above-mentioned problems of the prior art by the following configurations.

【００１９】即ち、本発明の構成は、プロセッサ及びメ
モリからなるプロセッサユニットを複数個２次元的に配
列された要素列を具備してなるものにおいて、前記メモ
リを少なくとも３ポート以上の入出力部を有する多ポー
トメモリとし、且つ前記多ポートメモリを網目状に配置
し、前記多ポートメモリの第１および第２のポートを網
目状のデータバスに、第３のポートを対応するプロセッ
サに各々接続して、網目状のネットワークを単一のチッ
プ上に構成した並列プロセッサシステムである。さら
に、この多ポートメモリの入出力Ｘｉ、Ｙｉバスを、四
辺方向に周辺回路を介して直接チップ外端子に接続して
なる並列プロセッサシステムである。That is, the configuration of the present invention comprises an element array in which a plurality of processor units each including a processor and a memory are two-dimensionally arranged, and the memory is provided with an input / output unit having at least three ports. A multi-port memory having the same, and the multi-port memories are arranged in a mesh, and the first and second ports of the multi-port memory are connected to a mesh data bus, and the third port is connected to a corresponding processor. It is a parallel processor system in which a mesh network is constructed on a single chip. Further, it is a parallel processor system in which the input / output Xi and Yi buses of this multi-port memory are directly connected to the external terminals of the chip through peripheral circuits in the four sides.

【００２０】更にまた、その多ポートメモリの入出力Ｘ
ｉ、Ｙｉバスへの転送を、前記多ポートメモリ内部の行
構造に合わせて行単位の複数データを群転送にして行
い、且つ前記入出力Ｘｉ、Ｙｉバスへのアドレスカウン
タを共有させて、前記入出力Ｘｉ、Ｙｉバスあるいは前
記入出力Ｘｉ、Ｙｉバスを介して周辺回路及び外部端子
へ前記入出力Ｘｉ、Ｙｉバスと同時に転送を行わせるよ
う構成した並列プロセッサシステムである。Furthermore, the input / output X of the multi-port memory
Transfer to the i, Yi buses is performed by group transfer of a plurality of data in row units according to the row structure inside the multi-port memory, and the address counters for the input / output Xi, Yi buses are shared, It is a parallel processor system configured to transfer simultaneously with the input / output Xi, Yi bus to a peripheral circuit and an external terminal via the input / output Xi, Yi bus or the input / output Xi, Yi bus.

【００２１】更にまた、その多ポートメモリを複数配置
し、且つ網目状に配列された各プロセッサの動作と平行
して上記多ポートメモリ間の相互データ授受を独立に行
うよう構成した並列プロセッサシステムである。Furthermore, a parallel processor system in which a plurality of the multi-port memories are arranged and the mutual data exchange between the multi-port memories is independently performed in parallel with the operation of each processor arranged in a mesh pattern. is there.

【００２２】そして更に、複数の２次元的且つ網目状に
配列されたプロセッサと、このプロセッサを収納してな
るパッケージと、前記プロセッサに直接入出力する入出
力Ｘｉ、Ｙｉ、Ｚｉバスと、前記パッケージの一面に配
置してなり前記Ｚｉバスを介して外部端子へ前記プロセ
ッサ群からのＺ方向のデータ授受を行い、複数データの
転送を行う受・発光素子とを具備してなる並列プロセッ
サシステムである。Further, a plurality of processors arranged in a two-dimensional and mesh pattern, a package accommodating the processors, input / output Xi, Yi, Zi buses for direct input / output to / from the processor, and the package A parallel processor system, which is arranged on one surface and which receives and emits data in the Z direction from the processor group to an external terminal via the Zi bus and transfers a plurality of data. .

【００２３】そして更に、プロセッサ及びメモリからな
るプロセッサユニットを複数個２次元的に配列された要
素列を具備してなるものにおいて、前記メモリを少なく
とも３ポート以上の入出力部を有する多ポートメモリと
し、且つ前記多ポートメモリを網目状に配置し、前記多
ポートメモリの第１および第２のポートを網目状のデー
タバスに、第３のポートを対応するプロセッサに各々接
続して、網目状のネットワークを形成し、前記入出力Ｘ
ｉ、Ｙｉバスを各々時分割し、ＣＬＯＣＫ期間はＸｉバ
スをメモリに接続し、Ｙｉバスは外部端子及び周辺回路
に接続しＣＬＯＣＫ期間はＹｉバスを前記メモリに接続
し、Ｘｉバスを外部端子及び周辺回路に接続してなる並
列プロセッサシステムである。Further, in a device comprising a plurality of processor units each including a processor and a memory arranged two-dimensionally, the memory is a multi-port memory having an input / output unit of at least three ports. , The multi-port memories are arranged in a mesh, and the first and second ports of the multi-port memory are connected to a mesh data bus and the third port is connected to a corresponding processor. Forming a network, the input and output X
The i and Yi buses are each time-divided, the Xi bus is connected to the memory during the CLOCK period, the Yi bus is connected to an external terminal and peripheral circuits, the Yi bus is connected to the memory during the CLOCK period, and the Xi bus is connected to the external terminal. It is a parallel processor system connected to peripheral circuits.

【００２４】[0024]

【作用】本発明の並列プロセッサシステムは、複数の２
次元的に配列されたプロセッサ及び少なくとも３ポート
以上の入出力部を有する多ポートメモリからなる要素列
を具備してなるものにおいて、前記多ポートメモリを網
目状に配置し、前記多ポートメモリの第１のポートを網
目状のデータバスに、第２のポートを対応するプロセッ
サに各々接続して、網目状のネットワークを単一のチッ
プ上に構成することによって、複数個のＰＥ間通信及び
ホストコンピュータとのネットワーク通信の能力を大き
く向上することのできる。The parallel processor system of the present invention has a plurality of two processors.
A multi-port memory comprising a multi-port memory having an array of processors and a multi-port memory having at least three-port input / output units, wherein the multi-port memory is arranged in a mesh pattern. By connecting one port to a mesh data bus and connecting the second port to a corresponding processor to form a mesh network on a single chip, communication between a plurality of PEs and a host computer The ability of network communication with can be greatly improved.

【００２５】[0025]

【実施例】以下、本発明の実施例について説明する。図
１に本発明の実施例である並列プロセッサシステムを示
す。この並列プロセッサシステムは、図１（ａ）に示す
ように、３２ｂｉｔ構成のＰＥ（１）とこのＰＥ（１）
に接続され且つ３２ｂｉｔ構成のＸ、Ｙバス（４）、
（５）に接続された１６ＫＢ３ＰＯＲＴ高速ＲＡＭ
（２）とからなるＰＥユニット（３）を複数個網目状に
配置してなり、このＰＥ（１）は高速ＲＡＭ（２）に接
続される一方、外部のホストコンピュータ（６）にブロ
ードキャストバッファ（７）とオプチカルブロードキャ
ストネットワークのＢ（ブロードキャスト）ポート
（８）を介して接続されている。これらのＰＥユニット
（３）は、ＲＥ＋ＲＡＭを一組として（２^N ）２個（こ
こでは、Ｎ＝２として１６個が配置されている。）をメ
ッシュ状に並べ、図１（ｂ）に示すように、一個のＬＳ
Ｉ装置（１１）の中に１チップ（１０）で構成され、チ
ップ内を縦横に走るＸ・Ｙバス（３）、（４）は、Ｘ・
ＹのＩ／Ｏポートバッファ（１２）を介してチップ（１
０）外のポート（Ｉ／Ｏピン）（１３）に出力（入力）
される。このＩ／Ｏピン（１３）の入出力のピン数は、
１チャンネル（１ｃｈ）に関し３２×４＝１２８（本）
で、四辺合わせて１２８×４＝５１２本で構成されてい
る。EXAMPLES Examples of the present invention will be described below. FIG. 1 shows a parallel processor system which is an embodiment of the present invention. As shown in FIG. 1A, this parallel processor system includes a PE (1) having a 32-bit configuration and the PE (1).
32 bit X, Y bus (4)
16KB3PORT high-speed RAM connected to (5)
A plurality of PE units (3) each consisting of (2) are arranged in a mesh pattern, and the PE (1) is connected to the high-speed RAM (2) while the external host computer (6) receives a broadcast buffer ( 7) and the B (broadcast) port (8) of the optical broadcast network. These PE units (3) include RE + RAM as a set (2 ^N ) Two pieces (here, 16 pieces are arranged with N = 2) are arranged in a mesh shape, and one LS is formed as shown in FIG.
The I / O device (11) is composed of one chip (10), and the X and Y buses (3) and (4) that run vertically and horizontally in the chip are X and Y buses.
Through the Y I / O port buffer (12), the chip (1
0) Output (input) to external port (I / O pin) (13)
To be done. The number of input / output pins of this I / O pin (13) is
32 × 4 = 128 (pieces) for one channel (1ch)
Thus, the total of four sides is 128 × 4 = 512.

【００２６】各ＰＥ（１）の上部から、即ち、Ｚ方向へ
伸びるＺバス（１４）は、チップ（１０）内のブロード
キャストリングバス（１５）に接続されており、更に、
外部のシステム共通のブロードキャストバスに接続され
ている。この外部のシステム共通のブロードキャストバ
スは、オプチカルブロードキャストネットワーク（８）
内に設けられ、光ケーブル（１６）を用いた光結合リン
クで構成され、各チップ（１０）から一本以上の光ケー
ブルで、ブロードキャストバッファ（７）を介してホス
トコンピュータ（６）に接続されている。ここで、ＰＥ
ユニット（３）の詳細な構造を図２で示しておく、図２
（ａ）は、メッシュ状に配列されているＰＥユニットの
一つを拡大して示している。また、図２（ｂ）は、ＰＥ
ユニット（３）の中のＰＥ（プロセッサエレメント）
（１）のメモリ（２）が他のメモリ（２）と直接データ
交換する関係について、Ｘ・Ｙバスを省略して模式的に
示している。即ち、メモリ（２）を高速マルチポートＲ
ＡＭで構成することによって、先ずメモリ同志で直接メ
モリ間の高速データ交換できるので、他のＰＥユニット
内のＰＥ間とのデータ通信を迅速に行うことができるの
である。The Z bus (14) extending from the top of each PE (1), that is, in the Z direction is connected to the broadcast ring bus (15) in the chip (10), and further,
It is connected to a broadcast bus common to external systems. This external system-wide broadcast bus is an optical broadcast network (8).
An optical coupling link using an optical cable (16) is provided inside, and each chip (10) is connected to a host computer (6) via a broadcast buffer (7) by one or more optical cables. . Where PE
A detailed structure of the unit (3) is shown in FIG.
(A) is an enlarged view of one of the PE units arranged in a mesh. In addition, FIG.
PE (processor element) in the unit (3)
The relationship in which the memory (2) of (1) directly exchanges data with another memory (2) is schematically shown by omitting the XY bus. That is, the memory (2) is connected to the high-speed multiport R
With the AM configuration, the memories can directly exchange high-speed data directly between the memories, so that the data communication with the PEs in the other PE units can be performed quickly.

【００２７】次に、本願発明の並列プロセスシステムを
大型並列スーパーコンピュータに適用した場合の構成に
ついて説明する。図３は、本発明の並列プロセスシステ
ムのＬＳＩ装置（１１）を６４チップ組み合わせて構成
した場合について示している。即ち、図１で示したＰＥ
（１）と３ポートメモリ（２）とからなるＰＥユニット
（３）を１６個組み合わせて構成されている１チップの
搭載されたＬＳＩ装置（１１）を、全ては図示しない
が、縦横８列（８×８＝６４チップ）に配置している。
即ち、１０２４個のＰＥユニットを具備して構成されて
いる。Next, the configuration when the parallel process system of the present invention is applied to a large parallel supercomputer will be described. FIG. 3 shows a case where the LSI device (11) of the parallel process system of the present invention is configured by combining 64 chips. That is, the PE shown in FIG.
Although not shown, all of the LSI devices (11) mounted with one chip, which are configured by combining 16 PE units (3) each including (1) and a 3-port memory (2), are arranged in 8 rows and 8 columns ( (8 × 8 = 64 chips).
That is, it is configured to include 1024 PE units.

【００２８】ＬＳＩ装置（１１）は図１（ｂ）で示した
チップ（１０）のＸｉバス、ＹｉバスがＮ、Ｅ、Ｗ、Ｓ
の各方向に設けられたバッファ（１２）およびＩ／Ｆ回
路（図示せず）を介して外部ピン（１３）に接続されて
構成されているが、それぞれのＬＳＩ装置（１１）は、
この外部ピン（１３）を介してこれらＬＳＩ装置（１
１）外に設けられている外部バス（１７、１８）のｘｉ
バス、ｙｉバスに接続されている。即ち、ＬＳＩ装置
（１１）内のチップ（１０）は、外部バス（１７、１
８）によって６４個が並列的にメッシュ状に接続されて
いる。ＬＳＩ装置（１１）のＺ方向の入出力は、光ケー
ブル（１６）によってそれぞれのＬＳＩ装置（１１）か
ら前述（図１）したステム共通のブロードキャストバス
（８）によってブロードキャストバッファ（７）を介し
てホストコンピュータ（６）に接続されている。In the LSI device (11), the Xi bus and Yi bus of the chip (10) shown in FIG. 1B are N, E, W and S.
Are connected to an external pin (13) via a buffer (12) and an I / F circuit (not shown) provided in each direction of the respective LSI devices (11),
Through these external pins (13), these LSI devices (1
1) xi of external buses (17, 18) provided outside
It is connected to the bus and yi bus. That is, the chip (10) in the LSI device (11) is connected to the external bus (17, 1).
According to 8), 64 pieces are connected in parallel in a mesh shape. The input / output in the Z direction of the LSI device (11) is performed by the optical cable (16) from the respective LSI devices (11) via the broadcast bus (8) common to the system described above (FIG. 1) via the broadcast buffer (7) to the host. It is connected to a computer (6).

【００２９】尚、チップ内バス（４、５）のＸｉバス、
Ｙｉバスと、外部バス（１７、１８）のｘｉバス、ｙｉ
バスとの関係は、図４に示すように、動作クロックＣＬ
Ｋを、その位相に応じてＣＬＫ、／ＣＬＫの２相（フェ
ーズ）にし、それぞれのＸ、Ｙ両軸のバスが衝突しない
ように構成されている。例えば、チップ内部バスのｘｉ
バスは、ｘｉ→Ｘｉ→ｘｉ＋１の順にバス経路となると
共に、チップ内のＰＥ間の接続バスとしても機能するこ
とになる。このように転送全体を２相化し、動作を交互
化することによって、全体の転送効率を落とすことなく
システム構成を単純化している。The Xi bus of the on-chip buses (4, 5),
Yi bus and xi bus of external bus (17, 18), yi
The relationship with the bus is as shown in FIG.
K is set to two phases (phase) of CLK and / CLK according to the phase so that buses of both X and Y axes do not collide. For example, xi of the chip internal bus
The bus becomes a bus path in the order of xi → Xi → xi + 1 and also functions as a connection bus between PEs in the chip. In this way, by making the entire transfer into two phases and alternating the operations, the system configuration is simplified without lowering the overall transfer efficiency.

【００３０】このようにして超並列スーパーコンピュー
タが構成されている。この構成の一部を概観的に見る
と、図５に示すように、各チップ（１０）からのＺ方向
の入出力は、光接続装置（１９）を介して光ケーブル
（１６）でホストコンピュータ等と接続されている。A massively parallel supercomputer is constructed in this way. When a part of this configuration is roughly seen, as shown in FIG. 5, input / output in the Z direction from each chip (10) is performed by an optical cable (16) via an optical connection device (19) to a host computer or the like. Connected with.

【００３１】[0031]

【発明の効果】本発明によれば、ＶＬＳＩコンピュータ
に本質的に適した並列／超並列スーパーコンピュータを
構成でき、複数個のＰＥ間通信およびホストコンピュー
タとのネットワーク通信の能力を大きく向上させること
ができる。According to the present invention, a parallel / super-parallel supercomputer essentially suitable for a VLSI computer can be constructed, and the ability of communication between a plurality of PEs and network communication with a host computer can be greatly improved. it can.

[Brief description of drawings]

【図１】本願発明のシステムを説明するブロック構成
図。FIG. 1 is a block configuration diagram illustrating a system of the present invention.

【図２】本願発明の一部要部を説明するブロック構成
図。FIG. 2 is a block diagram illustrating a part of a main part of the present invention.

【図３】本発明のシステムを超並列スーパーコンピュ
ータに適用した場合のシステム構成図。FIG. 3 is a system configuration diagram when the system of the present invention is applied to a massively parallel supercomputer.

【図４】図４で示したシステムの動作説明図。FIG. 4 is an operation explanatory diagram of the system shown in FIG.

【図５】本発明のシステムを適用した超並列スーパー
コンピュータの一部外観図。FIG. 5 is a partial external view of a massively parallel supercomputer to which the system of the present invention is applied.

【図６】従来に並列プロセッサシステムの構成図。FIG. 6 is a block diagram of a conventional parallel processor system.

[Explanation of symbols]

１ＰＥ２メモリ３ＰＥユニット４Ｘバス５Ｙバス 1 PE 2 memory 3 PE unit 4 X bus 5 Y bus

Claims

[Claims]

1. A multi-port memory having a plurality of processor units each including a processor and a memory, the array of elements being arranged two-dimensionally, wherein the memory is a multi-port memory having at least three input / output ports. Moreover, the multi-port memories are arranged in a mesh pattern, and the first and second ports of the multi-port memory are arranged in a mesh data bus,
A parallel processor system characterized in that a mesh-like network is formed on a single chip by connecting each of the third ports to a corresponding processor.

2. A parallel processor system in which the input / output Xi and Yi buses of the multi-port memory according to claim 1 are directly connected to external terminals of the chip in the four sides via peripheral circuits.

3. The input / output Xi and Yi buses of the multiport memory according to claim 1 are transferred by grouping a plurality of data in row units according to a row structure inside the multiport memory, and The input / output Xi, Yi bus is shared with the input / output Xi, Yi bus or the input / output Xi, Yi bus to the peripheral circuit and the external terminal to simultaneously transfer the input / output Xi, Yi bus. A parallel processor system characterized by being configured to perform.

4. A multi-port memory according to claim 1, wherein a plurality of multi-port memories are arranged, and the mutual data transfer between the multi-port memories is independently performed in parallel with the operation of each processor arranged in a mesh. A parallel processor system.

5. A plurality of processors arranged in a two-dimensional and mesh pattern, a package accommodating the processors, and inputs / outputs Xi, Y for directly inputting / outputting to / from the processor.
i, Zi buses, and a light receiving / light emitting element which is arranged on one surface of the package and which transfers data in the Z direction from the processor group to external terminals via the Zi bus and transfers a plurality of data. A parallel processor system characterized by the following.

6. A multi-port memory having a plurality of processor units each including a processor and a memory, the array of elements being arranged two-dimensionally, wherein the memory is a multi-port memory having at least three ports of input and output, and The multiport memory is arranged in a mesh, and the first and second ports of the multiport memory are connected to a mesh data bus, and the third port is connected to a corresponding processor to form a mesh network. The input / output Xi and Yi buses are time-divided, the Xi bus is connected to the memory during the CLOCK period, the Yi bus is connected to the external terminal and the peripheral circuit, and C
A parallel processor system characterized in that a Yi bus is connected to the memory and a Xi bus is connected to an external terminal and a peripheral circuit during a LOCK period.