JP2557175B2

JP2557175B2 - Computer system

Info

Publication number: JP2557175B2
Application number: JP5117135A
Authority: JP
Inventors: ポール・アンバ・ウィルキンソン; ジェームズ・ウォレン・ディーフェンデルファー; ピーター・マイケル・コッヘ; ニコラス・ジェローム・ショーノヴァー
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1992-05-22
Filing date: 1993-05-19
Publication date: 1996-11-27
Anticipated expiration: 2011-11-27
Also published as: JPH0652125A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、コンピュータおよびコ
ンピュータ・システム、特に並列アレイ・プロセッサに
関する。本発明によれば、単一の半導体シリコン・チッ
プ上に並列アレイ・プロセッサを組み込むことができ
る。このチップは、複雑な科学研究アプリケーションお
よびビジネス・アプリケーションの大規模並列処理が可
能なシステムの基礎となる。FIELD OF THE INVENTION This invention relates to computers and computer systems, and more particularly to parallel array processors. According to the present invention, a parallel array processor can be incorporated on a single semiconductor silicon chip. The chip is the basis for a system capable of massively parallel processing of complex scientific research and business applications.

【０００２】[0002]

【従来の技術】はじめに、本明細書で用いられる用語に
ついて説明する。2. Description of the Related Art First, terms used in this specification will be described.

【０００３】・ＡＬＵＡＬＵとは、プロセッサの演算論理回路部分である。ALU ALU is the arithmetic logic circuit part of the processor.

【０００４】・アレイアレイとは、１次元または多次元における要素のアレイ
を指す。アレイは、順番に並べた１組のデータ項目（ア
レイ要素）を含むことができるが、ＦＯＲＴＲＡＮのよ
うな言語では、それらのデータ項目は単一の名前で識別
される。他の言語では、順番に並べた１組のデータ項目
の名前は、すべて同じ属性を持つ順番に並べた１組のデ
ータ要素を指す。プログラム・アレイでは、一般に数ま
たは次元属性によって次元が指定される。アレイの宣言
子でアレイの各次元のサイズを指定する言語もあり、ア
レイがテーブル内の要素のアレイとなっている言語もあ
る。ハードウェア的な意味では、アレイは、大規模並列
アーキテクチャにおいて全体として同一な構造（機能要
素）の集合体である。データ並列コンピュータ処理にお
けるアレイ要素は、動作を割り当てることができ、並列
状態のとき、それぞれ独立にかつ並列に必要な動作を実
行できる要素である。一般に、アレイは処理要素の格子
と考えることができる。アレイの各セクションに区分デ
ータを割り当てることにより、区分データを規則的な格
子パターン内で移動することができる。ただし、データ
に索引を付け、あるいはデータをアレイ中の任意の位置
に割り当てることが可能である。Array An array refers to an array of elements in one or more dimensions. An array can include an ordered set of data items (array elements), but in languages such as FORTRAN, those data items are identified by a single name. In other languages, the name of an ordered set of data items refers to an ordered set of data elements that all have the same attributes. In program arrays, dimensions are generally specified by number or dimension attributes. In some languages, array declarators specify the size of each dimension of the array, and in some languages the array is an array of elements in a table. In a hardware sense, an array is a collection of structures (functional elements) that are the same in a massively parallel architecture as a whole. An array element in data parallel computer processing is an element to which an operation can be assigned and which can execute a required operation independently and in parallel in a parallel state. In general, an array can be thought of as a grid of processing elements. By assigning partition data to each section of the array, the partition data can be moved in a regular grid pattern. However, it is possible to index the data or assign the data to any location in the array.

【０００５】・アレイ・ディレクタアレイ・ディレクタとは、アレイの制御プログラムとし
てプログラミングされる単位である。アレイ・ディレク
タは、アレイとしてアレイされた機能要素のグループの
マスタ制御プログラムとしての機能を果す。Array Director The array director is a unit programmed as an array control program. The array director serves as a master control program for a group of functional elements arrayed as an array.

【０００６】・アレイ・プロセッサアレイ・プロセッサには主として、複数命令複数データ
方式（ＭＩＭＤ）と単一命令複数データ方式（ＳＩＭ
Ｄ）との２種類がある。ＭＩＭＤアレイ・プロセッサで
は、アレイ中の各処理要素が、それ自体のデータを使っ
てそれ自体の固有の命令ストリームを実行する。ＳＩＭ
Ｄアレイ・プロセッサでは、アレイ中の各処理要素が、
共通の命令ストリームを介して同一の命令に限定され
る。ただし、各処理要素に関連するデータは固有であ
る。本発明の好ましいアレイ・プロセッサには他にも特
徴がある。本明細書では、これを拡張並列アレイ・プロ
セッサと呼び、ＡＰＡＰという略語を使用する。Array Processor Array processors mainly include multiple instruction multiple data scheme (MIMD) and single instruction multiple data scheme (SIM).
D). In a MIMD array processor, each processing element in the array uses its own data to execute its own unique instruction stream. SIM
In a D array processor, each processing element in the array is
Limited to the same instruction via a common instruction stream. However, the data associated with each processing element is unique. There are other features of the preferred array processor of this invention. This is referred to herein as an enhanced parallel array processor and uses the abbreviation APAP.

【０００７】・非同期非同期とは、規則的な時間関係がないことである。すな
わち、各機能の実行間の関係が予測不能であり、各機能
の実行間に規則的または予測可能な時間関係が存在しな
い。制御状況では、制御プログラムは、データが、アド
レスされている遊休要素を待っているとき、制御が渡さ
れる位置にアドレスする。このため、諸操作が、どの事
象とも時間が一致しないのに順序通りのままとなる。Asynchronous Asynchronous means that there is no regular time relationship. That is, the relationship between the executions of each function is unpredictable, and there is no regular or predictable time relationship between the executions of each function. In a control situation, the control program addresses the location to which control is passed when data is waiting for the idle element being addressed. This keeps the operations in order, even though they do not match the time of any event.

【０００８】・ＢＯＰＳ／ＧＯＰＳＢＯＰＳまたはＧＯＰＳは、１秒当たり１０億回の動作
という同じ意味の略語である。ＧＯＰＳを参照された
い。BOPS / GOPS BOPS or GOPS is an abbreviation for the same meaning of 1 billion operations per second. See GOPS.

【０００９】・回線交換／蓄積交換これらの用語は、ノードのネットワークを介してデータ
・パケットを移動するための２つの機構を指す。蓄積交
換は、データ・パケットを各中間ノードで受信し、その
メモリに格納してから、その宛先に向かって転送する機
構である。回線交換は、中間ノードに、その入力ポート
を出力ポートに論理的に接続するよう指令して、データ
・パケットが、中間ノードのメモリに入らずに、ノード
を直接通過して宛先に向かうことができるようにする機
構である。Circuit Switching / Store-and-Switch These terms refer to two mechanisms for moving data packets through a network of nodes. Store-and-forward is a mechanism by which a data packet is received at each intermediate node, stored in its memory, and then forwarded towards its destination. Circuit switching directs an intermediate node to logically connect its input port to an output port so that data packets do not go into the intermediate node's memory but pass directly through the node to its destination. It is a mechanism that enables it.

【００１０】・クラスタクラスタとは、制御ユニット（クラスタ制御装置）と、
それに接続されたハードウェア（端末、機能ユニット、
または仮想構成要素）とから成るステーション（または
機能ユニット）である。本明細書では、クラスタは、ノ
ード・アレイとも称するプロセッサ・メモリ要素（ＰＭ
Ｅ）のアレイを含む。通常、クラスタは５１２個のＰＭ
Ｅ要素を有する。Cluster A cluster is a control unit (cluster control device),
Hardware connected to it (terminals, functional units,
Or a virtual component) and a station (or functional unit). As used herein, a cluster is a processor memory element (PM), also referred to as a node array.
E) of the array. Normally, a cluster has 512 PM
It has an E element.

【００１１】本発明の全ＰＭＥノード・アレイは、それ
ぞれ１つのクラスタ制御装置（ＣＣ）によってサポート
される１組のクラスタから成る。The entire PME node array of the present invention consists of a set of clusters each supported by one Cluster Controller (CC).

【００１２】・クラスタ制御装置クラスタ制御装置とは、それに接続された複数の装置ま
たは機能ユニットの入出力動作を制御する装置である。
クラスタ制御装置は通常、ＩＢＭ３６０１金融機関通
信制御装置におけるように、該ユニットに格納され、そ
こで実行されるプログラムの制御を受けるが、ＩＢＭ
３２７２制御装置におけるように、ハードウェアで完全
に制御可能である。Cluster control device A cluster control device is a device that controls the input / output operations of a plurality of devices or functional units connected to it.
The cluster controller is typically under the control of a program stored in, and executing on, the unit, as in the IBM 3601 financial institution communication controller.
It is fully controllable in hardware, as in the 3272 controller.

【００１３】・クラスタ・シンクロナイザクラスタ・シンクロナイザとは、あるクラスタのすべて
または一部分の動作を管理して、諸要素の同期動作を維
持し、各機能ユニットがプログラムの実行と特定の時間
関係を維持できるようにする機能ユニットである。Cluster synchronizer A cluster synchronizer manages the operation of all or a part of a certain cluster, maintains the synchronous operation of various elements, and enables each functional unit to maintain a specific time relationship with the execution of a program. It is a functional unit to make it.

【００１４】・制御装置制御装置とは、相互接続ネットワークのリンクを介した
データおよび命令の伝送を指令する装置である。制御装
置の動作は、制御装置が接続されたプロセッサによって
実行されるプログラム、または制御装置内で実行される
プログラムによって制御される。Control Device A control device is a device that directs the transmission of data and commands over the links of an interconnection network. The operation of the control device is controlled by a program executed by a processor to which the control device is connected or a program executed in the control device.

【００１５】・ＣＭＯＳＣＭＯＳとは、相補型金属酸化膜半導体技術の略語であ
る。これは、ダイナミック・ランダム・アクセス・メモ
リ（ＤＲＡＭ）の製造に広く使用されている。ＮＭＯＳ
は、ダイナミック・ランダム・アクセス・メモリの製造
に使用されるもう１つの技術である。本発明では相補型
金属酸化膜半導体の方を使用するが、拡張並列アレイ・
プロセッサ（ＡＰＡＰ）の製造に使用する技術によっ
て、使用される半導体技術の範囲が制限されることはな
い。CMOS CMOS is an abbreviation for complementary metal oxide semiconductor technology. It is widely used in the manufacture of dynamic random access memory (DRAM). NMOS
Is another technique used in the manufacture of dynamic random access memories. In the present invention, the complementary metal oxide semiconductor is used.
The technology used to manufacture the processor (APAP) does not limit the scope of semiconductor technology used.

【００１６】・ドッティングドッティングとは、物理的な接続によって３本以上のリ
ード線を結合することを指す。たいていのバックパネル
・バスではこの接続方法を使用している。この用語は、
過去のＯＲＤＯＴＳと関係があるが、ここでは、非常
に単純なプロトコルによってバス上に結合できる複数の
データ源を識別するのに使用する。Dotting refers to joining three or more lead wires by physical connection. Most backpanel buses use this connection method. This term is
Related to the OR DOTS of the past, it is used here to identify multiple data sources that can be coupled onto the bus by a very simple protocol.

【００１７】本発明における入出力ジッパの概念を用い
て、あるノードに入る入力ポートが、あるノードから出
る出力ポート、またはシステム・バスからくるデータに
よって駆動できるという概念を実施することができる。
逆に、あるノードから出力されるデータは、別のノード
およびシステム・バスへの入力として使用できる。シス
テム・バスと別のノードへのデータ出力は、同時には実
行されず、別のサイクルで実行されることに留意された
い。The concept of I / O zippers in the present invention can be used to implement the concept that an input port entering a node can be driven by an output port exiting a node or data coming from the system bus.
Conversely, the data output from one node can be used as an input to another node and system bus. It should be noted that the data output to the system bus and another node are not executed at the same time but in another cycle.

【００１８】ドッティングは、それを利用することによ
り２ポート式のＰＥまたはＰＭＥまたはピケットを様々
な編成のアレイに使用できる、Ｈ−ＤＯＴの議論で使用
されている。２次元メッシュおよび３次元メッシュ、ベ
ース２Ｎキューブ、スパース・ベース４Ｎキューブ、ス
パース・ベース８Ｎキューブを含めて、いくつかのトポ
ロジーが議論されている。Dotting is used in the discussion of H-DOT, by which the two-port PE or PME or picket can be used for arrays of various configurations. Several topologies have been discussed, including 2D and 3D meshes, base 2N cubes, sparse base 4N cubes, sparse base 8N cubes.

【００１９】・ＤＲＡＭＤＲＡＭとは、コンピュータが主記憶装置として使用す
る共通記憶装置であるダイナミック・ランダム・アクセ
ス・メモリの略語である。ただし、ＤＲＡＭという用語
は、キャッシュとして、または主記憶装置ではないメモ
リとして使用するのにも適用できる。DRAM DRAM is an abbreviation for dynamic random access memory, which is a common storage device used by a computer as a main storage device. However, the term DRAM is also applicable to use as a cache or memory that is not main memory.

【００２０】・浮動小数点浮動小数点数は、固定小数部すなわち小数部と、約束上
の基数または基底に対する指数部の２つの部分で表され
る。指数は、１０進小数点の実際の位置を示す。典型的
な浮動小数点の表記法では、実数０．０００１２３４は
０．１２３４−３と表される。ここで、０．１２３４は
小数部であり、−３は指数である。この例では、浮動小
数点基数または基底は１０であり、暗示的な１より大き
な正の固定整数基底を表す。浮動小数点表示で明示的に
示される、あるいは浮動小数点表示で指数部で表される
指数でこの基底をべき乗し、次に小数部を掛けると、表
される実数が求められる。数字リテラルは、浮動小数点
表記法で表すことも実数で表すこともできる。Floating-Point Floating-point numbers are represented in two parts: a fixed-point part, or fractional part, and an exponent part to a radix or base on the promise. The exponent indicates the actual position of the decimal point. In typical floating point notation, the real number 0.0001234 is represented as 0.1234-3. Here, 0.1234 is a decimal part, and -3 is an exponent. In this example, the floating point radix or base is 10, representing an implicit fixed integer base greater than one. The exponent represented explicitly in the floating-point representation, or represented in the exponent in the floating-point representation, is raised to the power of this base and then multiplied by the fractional part to obtain the real number represented. Numeric literals can be represented either in floating point notation or in real numbers.

【００２１】・ＦＬＯＰＳこの用語は、１秒当たりの浮動小数点命令数を指す。浮
動小数点演算には、ＡＤＤ（加算）、ＳＵＢ（減算）、
ＭＰＹ（乗算）、ＤＩＶ（除算）と、しばしばその他の
多くの演算が含まれる。１秒当たり浮動小数点命令数と
いうパラメータは、しばしば加算命令または乗算命令を
使って算出され、一般に５０／５０ミックスとみなすこ
とができる。演算には、指数部、小数部の生成と、必要
な小数部の正規化が含まれる。本発明では、３２ビット
または４８ビットの浮動小数点フォーマットを扱うこと
ができる（これより長くてもよいが、そのようなフォー
マットはミックスではカウントしなかった）。浮動小数
点演算を固定小数点命令（正規またはＲＩＳＣ）で実施
する際には、複数の命令が必要である。性能を計算する
際に１０対１の比率を使用する人もあれば、比率を６．
２５にした方が適切であることを示す研究もある。アー
キテクチャごとに比率が異なる。FLOPS This term refers to the number of floating point instructions per second. Floating point operations include ADD (addition), SUB (subtraction),
It involves MPY (multiplication), DIV (division) and often many other operations. The parameter of floating point instructions per second is often calculated using add or multiply instructions and can generally be considered a 50/50 mix. The calculation includes generation of an exponent part and a decimal part, and normalization of a necessary decimal part. The present invention can handle 32-bit or 48-bit floating point formats (which may be longer, but such formats were not counted in the mix). Multiple instructions are required when performing floating point operations with fixed point instructions (regular or RISC). Some people use a 10: 1 ratio when calculating performance, and a ratio of 6: 1.
Some studies have shown that 25 is more appropriate. Different architectures have different ratios.

【００２２】・機能ユニット機能ユニットとは、ある目的を達成できる、ハードウェ
ア、ソフトウェア、あるいはその両方のエンティティで
ある。Functional Unit A functional unit is a hardware, software, or both entity that can achieve a certain purpose.

【００２３】・ＧバイトＧバイトとは１０億バイトを指す。Ｇバイト／秒は、１
秒当たり１０億バイトということになる。G bytes G bytes refer to 1 billion bytes. 1 Gbyte / sec
That's 1 billion bytes per second.

【００２４】・ＧＩＧＡＦＬＯＰＳ１秒当たり１０⁹個の浮動小数点命令GIGAFLOPS 10 ⁹ floating point instructions per second

【００２５】・ＧＯＰＳおよびＰＥＴＡＯＰＳＧＯＰＳまたはＢＯＰＳは、１秒当たり１０億回の演算
という同じ意味を持つ。ＰＥＴＡＯＰＳは、現在のマシ
ンの潜在能力である１秒当たり１兆回の演算という意味
である。本発明のＡＰＡＰマシンでは、これらの用語
は、１秒当たり１０億個の命令数を意味するＢＩＰ／Ｇ
ＩＰとほぼ同じである。１つの命令で複数の演算（すな
わち、加算と乗算の両方）を実行できるマシンもある
が、本発明ではそのようにはしない。また、１つの演算
を実行するのに多数の命令を要する場合もある。たとえ
ば、本発明では複数の命令を使って、６４ビット演算を
実行している。しかし、演算をカウントする際、対数演
算のカウントは行わなかった。性能を記述するにはＧＯ
ＰＳを使用する方が好ましいが、それを一貫して使うこ
とはしなかった。ＭＩＰ／ＭＯＰ、その上の単位として
ＢＩＰ／ＢＯＰ、およびＭｅｇａＦＬＯＰＳ／Ｇｉｇａ
ＦＬＯＰＳ／ＴｅｒａＦＬＯＰＳ／ＰｅｔａＦＬＯＰＳ
が使用される。GOPS and PETAOPS GOPS or BOPS have the same meaning of 1 billion operations per second. PETAOPS means 1 trillion operations per second, which is the potential of current machines. In the APAP machine of the present invention, these terms are BIP / G which means 1 billion instructions per second.
It is almost the same as IP. While some machines can perform multiple operations (ie, both add and multiply) with a single instruction, the present invention does not. Also, it may take many instructions to perform one operation. For example, the present invention uses multiple instructions to perform 64-bit operations. However, when counting the calculation, the logarithmic calculation was not performed. GO to describe performance
The use of PS was preferred, but it was not used consistently. MIP / MOP, BIP / BOP as units above it, and MegaFLOPS / Giga
FLOPS / TeraFLOPS / PetaFLOPS
Is used.

【００２６】・ＩＳＡＩＳＡとは、ＳｅｔＡｒｃｈｉｔｅｃｔｕｒｅ（アー
キテクチャ設定）命令を意味する。ISA ISA means a Set Architecture (architecture setting) instruction.

【００２７】・リンクリンクとは、物理的または論理的要素である。物理的リ
ンクは要素またはユニットを結合するための物理接続で
あり、一方コンピュータ・プログラミングにおけるリン
クは、プログラムの別々の部分間で制御およびパラメー
タのやり取りを行う命令またはアドレスである。多重シ
ステムでは、実アドレスまたは仮想アドレスで識別され
るリンクを識別するプログラム・コードによって指定さ
れる、２つのシステム間の接続がリンクである。したが
って、リンクには一般に、物理媒体、任意のプロトコ
ル、ならびに関連する装置およびプログラミングが含ま
れる。すなわち、リンクは論理的であるとともに物理的
である。Link A link is a physical or logical element. A physical link is a physical connection for connecting elements or units, while a link in computer programming is an instruction or address that exchanges control and parameters between different parts of a program. In multiple systems, a link is a connection between two systems specified by a program code that identifies the link identified by its real or virtual address. Thus, a link typically includes the physical medium, any protocol, and associated equipment and programming. That is, the link is both logical and physical.

【００２８】・ＭＦＬＯＰＳＭＦＬＯＰＳは、１秒当たり１０⁶個の浮動小数点命令
を意味する。MFLOPS MFLOPS means 10 ⁶ floating point instructions per second.

【００２９】・ＭＩＭＤＭＩＭＤは、アレイ内の各プロセッサがそれ自体の命令
ストリームを持ち、したがって多重命令ストリームを有
し、１処理要素当たり１つずつ配置された複数データ・
ストリームを実行する、プロセッサ・アレイ・アーキテ
クチャを指すのに使用される。MIMD MIMD is a system in which each processor in the array has its own instruction stream, and thus has multiple instruction streams, with multiple data arranged, one per processing element.
Used to refer to the processor array architecture that executes a stream.

【００３０】・モジュールモジュールとは、離散しており識別可能なプログラム単
位、あるいは他の構成要素と共に使用するように設計さ
れたハードウェアの機能単位である。また、単一の電子
チップに含まれるＰＥの集合体もモジュールと呼ばれ
る。Module A module is a discrete and identifiable program unit, or a functional unit of hardware designed for use with other components. An aggregate of PEs included in a single electronic chip is also called a module.

【００３１】・ノード一般に、ノードとはリンクの接合部である。ＰＥの汎用
アレイでは、１つのＰＥをノードとすることができる。
ノードはまた、モジュールというＰＥの集合体を含むこ
ともできる。本発明では、ノードはＰＭＥのアレイから
形成されており、この１組のＰＭＥをノードと称する。
ノードは８個のＰＭＥであることが好ましい。Node In general, a node is a junction of links. In a general-purpose array of PEs, one PE can be a node.
A node can also include a collection of PEs called modules. In the present invention, a node is formed from an array of PMEs, and this set of PMEs is called a node.
The node is preferably 8 PMEs.

【００３２】・ノード・アレイＰＭＥから構成されるモジュールの集合体をノード・ア
レイと呼ぶことがある。これは、モジュールから構成さ
れるノードのアレイである。ノード・アレイは通常、
２、３個より多いＰＭＥであるが、この用語は複数を包
含する。Node array A collection of modules composed of PMEs is sometimes called a node array. It is an array of nodes made up of modules. Node arrays are typically
Although there are more than a few PMEs, the term encompasses a plurality.

【００３３】・ＰＤＥＰＤＥとは、偏微分方程式である。PDE PDE is a partial differential equation.

【００３４】・ＰＤＥ緩和解法プロセスＰＤＥ緩和解法プロセスとは、ＰＤＥ（偏微分方程式）
を解く方法である。ＰＤＥを解くには、既知の分野にお
けるスーパー・コンピュータの計算能力の大半を使用
し、したがってこれは緩和プロセスの好例となる。ＰＤ
Ｅ方程式を解く方法は多数あり、複数の数値解法に緩和
プロセスが含まれている。たとえば、ＰＤＥを有限要素
法で解く場合、緩和の計算に大部分の時間が費やされ
る。熱伝達の分野の例を考えてみよう。煙突内に高温の
ガスがあり、外では冷たい風が吹いているとすると、煙
突のレンガ内の温度勾配はどのようになるだろうか。レ
ンガを小さなセグメントとみなし、セグメント間を熱が
どのように流れるかを温度差の関数として表す方程式を
書くと、伝熱ＰＤＥが有限要素問題に変換される。ここ
で、内側と外側の要素を除くすべての要素が室温であ
り、境界セグメントが高温のガスと冷たい風の温度であ
るとすると、緩和を開始するための問題ができあがる。
その後、コンピュータ・プログラムでは、セグメントに
流れ込む、あるいはセグメントから流れ出る熱の量に基
づいて各セグメント内の温度変数を更新することによ
り、時間をモデル化する。煙突における１組の温度変数
を緩和して、物理的な煙突で発生する実際の温度分布を
表すには、モデル中のすべてのセグメントを処理するサ
イクルに何回もかけなければならない。目的が煙突にお
けるガス冷却をモデル化することである場合、諸要素を
気体方程式に拡張しなければならず、そうすると、内側
の境界条件が別の有限要素モデルとリンクされ、このプ
ロセスが続く。熱の流れが隣接するセグメント間の温度
差に依存することに留意されたい。したがって、ＰＥ間
通信経路を使って温度変数を分配する。ＰＤＥ関係が並
列計算にうまく適用できるのは、この近隣接通信パター
ンまたは特性による。PDE relaxation solution process PDE relaxation solution process is PDE (partial differential equation)
Is a method of solving. Solving PDEs uses most of the computational power of supercomputers in the known field, and thus it exemplifies the mitigation process. PD
There are many ways to solve the E equation, and several numerical solutions include the relaxation process. For example, when solving a PDE with the finite element method, most of the time is spent computing the relaxation. Consider the example of the field of heat transfer. Given the hot gas inside the chimney and the cold breeze outside, what is the temperature gradient inside the brick of the chimney? Considering bricks as small segments and writing an equation that describes how heat flows between the segments as a function of temperature difference, the heat transfer PDE is transformed into a finite element problem. Now, assuming that all elements, except the inner and outer elements, are at room temperature and the boundary segments are the temperature of the hot gas and cold wind, a problem is created to initiate the relaxation.
The computer program then models the time by updating the temperature variable within each segment based on the amount of heat flowing into or out of the segment. In order to relax a set of temperature variables in the chimney to represent the actual temperature distribution that occurs in the physical chimney, it must take many cycles to process all the segments in the model. If the goal is to model gas cooling in a chimney, the elements must be extended to the gas equation, then the inner boundary conditions are linked with another finite element model and the process continues. Note that the heat flow depends on the temperature difference between adjacent segments. Therefore, the temperature variable is distributed using the communication path between PEs. It is this near-neighbor communication pattern or characteristic that makes the PDE relationship well applicable to parallel computing.

【００３５】・ピケットこれは、アレイ・プロセッサを構成する要素のアレイ内
の要素である。この要素は、データ・フロー（ＡＬＵ
ＲＥＧＳ）、メモリ、制御機構、通信マトリックスのこ
の要素と関連する部分から構成される。この単位は、並
列プロセッサ要素およびメモリ要素と、その制御機構お
よびアレイ相互通信機構の一部から成るアレイ・プロセ
ッサの１／ｎを指す。ピケットは、プロセッサ・メモリ
要素（ＰＭＥ）の１つの形である。本発明のＰＭＥチッ
プ設計プロセッサ論理回路は、関連出願に記載されてい
るピケット論理を実施し、あるいはノードとして形成さ
れたプロセッサ・アレイ用の論理を持つことができる。
ピケットという用語は、処理要素を表す、一般的に使用
されているアレイ用語のＰＥと似ており、好ましくはビ
ット並列バイトの情報をクロック・サイクルで処理する
ための処理要素とローカル・メモリの組合せからなる、
処理アレイの要素である。好ましい実施例は、バイト幅
データ・フロー・プロセッサ、３２バイト以上のメモ
リ、原始制御機構、および他のピケットとの通信機構か
ら構成されている。Picket This is an element within the array of elements that make up the array processor. This element is a data flow (ALU
REGS), memory, controls, and the parts of the communication matrix associated with this element. This unit refers to 1 / n of an array processor that consists of parallel processor elements and memory elements and their control and part of the array intercommunication mechanism. Pickets are a form of processor memory element (PME). The PME chip design processor logic of the present invention may implement the picket logic described in the related application or have the logic for a processor array formed as a node.
The term picket is similar to the commonly used array term PE for processing elements, and is preferably a combination of processing elements and local memory for processing bit parallel bytes of information in a clock cycle. Consists of,
It is an element of the processing array. The preferred embodiment consists of a byte wide data flow processor, 32 bytes or more of memory, a primitive control mechanism, and a mechanism for communicating with other pickets.

【００３６】「ピケット」という用語は、トム・ソーヤ
ーと、彼の白いフェンスに由来している。ただし、機能
的には、軍隊のピケット・ラインと類似性があることも
理解されよう。The term "picket" comes from Tom Sawyer and his white fence. However, it will also be understood that, functionally, it is similar to the army's picket line.

【００３７】・ピケット・チップピケット・チップは、単一のシリコン・チップ上に複数
のピケットを含んでいる。Picket Chip A picket chip contains multiple pickets on a single silicon chip.

【００３８】・ピケット・プロセッサ・システム（また
はサブシステム）ピケット・プロセッサは、ピケットのアレイと、通信ネ
ットワークと、入出力システムと、マイクロプロセッ
サ、キャンド・ルーチン・プロセッサ、およびアレイを
実行するマイクロコントローラから成るＳＩＭＤ制御装
置とから構成されるトータル・システムである。Picket Processor System (or Subsystem) A picket processor consists of an array of pickets, a communication network, an input / output system, a microprocessor, a canned routine processor, and a microcontroller executing the array. It is a total system including a SIMD control device.

【００３９】・ピケット・アーキテクチャピケット・アーキテクチャは、ＳＩＭＤアーキテクチャ
の好ましい実施例であり、次のことを含む複数の多様な
問題に対応できる機能をもつ。 −セット連想処理 −並列数値中心処理 −イメージに類似した物理的アレイ処理Picket Architecture The picket architecture is a preferred embodiment of the SIMD architecture and has features that can address a variety of different issues, including: -Set associative processing-Parallel numerical central processing-Image-like physical array processing

【００４０】・ピケット・アレイピケット・アレイは、幾何的順序でアレイされたピケッ
トの集合体であり、規則正しいアレイである。Picket Array A picket array is an ordered array of pickets arranged in a geometric order.

【００４１】・ＰＭＥすなわちプロセッサ・メモリ要素ＰＭＥは、プロセッサ・メモリ要素を表す。本明細書で
は、ＰＭＥという用語を、本発明の並列アレイ・プロセ
ッサの１つを形成する、単一のプロセッサ、メモリ、お
よび入出力可能なシステム要素もしくはユニットを指す
のに使用する。ＰＭＥは、ピケットを包含する用語であ
る。ＰＭＥは、プロセッサ、それと結合されたメモリ、
制御インタフェース、およびアレイ通信ネットワーク機
構の一部分から成るプロセッサ・アレイの１／ｎであ
る。この要素は、ピケット・プロセッサにおけるよう
に、正規のアレイの接続性を持つＰＭＥ、あるいは上述
の多重ＰＭＥノードにおけるように、サブアレイの一部
としてのＰＭＥを備えることができる。PME or Processor Memory Element PME stands for Processor Memory Element. The term PME is used herein to refer to a single processor, memory, and I / O capable system element or unit forming one of the parallel array processors of the present invention. PME is a term that encompasses pickets. A PME is a processor, memory associated with it,
1 / n of the processor array that consists of the control interface and part of the array communication network facility. This element can comprise a PME with regular array connectivity, such as in a picket processor, or a PME as part of a sub-array, such as in the multiple PME node described above.

【００４２】・経路指定経路指定とは、メッセージを宛先に届けるための物理経
路を割り当てることである。経路の割当てには、発信元
と宛先が必要である。これらの要素またはアドレスは、
一時的な関係または類縁性を持つ。メッセージの経路指
定は、しばしば、割当てのテーブルを参照することによ
って得られるキーに基づいて行われる。ネットワーク内
では、宛先は、リンクを識別する経路制御アドレスによ
って、伝送される情報の宛先としてアドレス指定され
る、任意のステーションまたはネットワークのアドレス
指定可能ユニットである。宛先フィールドは、メッセー
ジ・ヘッダ宛先コードで宛先を識別する。Routing The routing is the allocation of a physical route for delivering a message to a destination. A source and a destination are required for route allocation. These elements or addresses are
Has a temporary relationship or affinity. Routing of messages is often based on keys obtained by consulting a table of assignments. Within a network, a destination is any station or addressable unit of a network that is addressed as the destination of information to be transmitted by a routing address that identifies a link. The destination field identifies the destination with the message header destination code.

【００４３】・ＳＩＭＤアレイ内のすべてのプロセッサが、単一命令ストリーム
から、１処理要素当たり１つずつ配置された複数データ
・ストリームを実行するように指令を受ける、プロセッ
サ・アレイ・アーキテクチャ。A processor array architecture in which all processors in a SIMD array are instructed to execute multiple data streams arranged one per processing element from a single instruction stream.

【００４４】・ＳＩＭＤＭＩＭＤまたはＳＩＭＤ／ＭＩＭＤＳＩＭＤＭＩＭＤまたはＳＩＭＤ／ＭＩＭＤとは、ある
時間の間ＭＩＭＤからＳＩＭＤに切り換えて複雑な命令
を処理できる二重機能を持ち、したがって２つのモード
を持つマシンを指す用語である。シンキング・マシンズ
社（Thinking Machines, Inc）のコネクション・マシン
（Connection Machine）モデルＣＭ−２をＭＩＭＤマシ
ンのフロント・エンドまたはバック・エンドとして配置
すると、プログラマは、二重モードとも称する、複数の
モードを動作させてある問題の別々の部分を実行するこ
とができた。これらのマシンは、ＩＬＬＩＡＣ以来存在
しており、バスを使用してマスタＣＰＵを他のプロセッ
サと相互接続している。マスタ制御プロセッサは、他の
ＣＰＵの処理に割り込む能力を持つ。他のＣＰＵは、独
立のプログラム・コードを実行できる。割込み中、チェ
ックポイント機能用に何らかの処理が必要である（制御
されるプロセッサの現状況のクローズおよびセーブ）。SIMD MIMD or SIMD / MIMD SIMD MIMD or SIMD / MIMD is a term that refers to a machine that has the dual function of switching from MIMD to SIMD for a certain amount of time to process complex instructions, and thus has two modes. is there. Placing a Connection Machine Model CM-2 from Thinking Machines, Inc as the front end or back end of a MIMD machine allows the programmer to configure multiple modes, also known as dual mode. I was able to run different parts of the problem that were running. These machines have been around since ILLIAC and use a bus to interconnect a master CPU with other processors. The master control processor has the ability to interrupt the processing of other CPUs. Other CPUs can execute independent program code. During the interrupt, something needs to be done for the checkpoint function (close and save the current state of the controlled processor).

【００４５】・ＳＩＭＩＭＤＳＩＭＩＭＤは、アレイ内のすべてのプロセッサが、単
一命令ストリームから、１処理要素当たり１つずつ配置
された複数データ・ストリームを実行するように指令を
受ける、プロセッサ・アレイ・アーキテクチャである。
この構成内では、命令実行を模倣する、各ピケット内の
データ従属演算が、ＳＩＭＤ命令ストリームによって制
御される。SIMIMD SIMIMD is a processor array architecture in which all processors in the array are commanded by a single instruction stream to execute multiple data streams arranged one per processing element. Is.
Within this configuration, the data dependent operations within each picket that mimic instruction execution are controlled by the SIMD instruction stream.

【００４６】これは、ＳＩＭＤ命令ストリームを使用し
て複数命令ストリーム（１ピケット当たり１個）を順序
付けし、複数データ・ストリーム（１ピケット当たり１
個）を実行することの可能な、単一命令ストリーム・マ
シンである。ＳＩＭＩＭＤは、ＰＭＥシステムによって
実行できる。This uses the SIMD instruction stream to order multiple instruction streams (one per picket) and multiple data streams (one per picket).
A single instruction stream machine capable of executing SIMIMD can be performed by the PME system.

【００４７】・ＳＩＳＤＳＩＳＤは、単一命令単一データの略語である。SISD SISD is an abbreviation for single instruction single data.

【００４８】・スワッピングスワッピングとは、ある記憶域のデータ内容を別の記憶
域のデータ内容と相互に交換することをいう。Swapping Swapping refers to the interchange of the data content of one storage area with the data content of another storage area.

【００４９】・同期操作ＭＩＭＤマシンにおける同期動作は、各アクションがあ
る事象（通常はクロック）に関係付けられる、動作モー
ドである。この事象は、プログラム・シーケンス中で規
則的に発生する、指定された事象とすることができる。
動作は多数の処理要素にディスパッチされ、それらの処
理要素はそれぞれ独立して機能を実行する。動作が完了
しないかぎり、制御は制御装置に返されない。Synchronous operation Synchronous operation in MIMD machines is a mode of operation in which each action is associated with some event (usually a clock). This event can be a specified event that occurs regularly in the program sequence.
Operations are dispatched to a number of processing elements, each of which independently performs a function. Control is not returned to the controller until the operation is complete.

【００５０】要求が機能ユニットのアレイに対するもの
である場合、アレイ内の要素に制御装置から要求が出さ
れ、その要素は、制御装置に制御が返される前に動作を
完了しなければならない。If the request is for an array of functional units, a request is made by the controller to an element in the array, which element must complete operation before control is returned to the controller.

【００５１】・ＴＥＲＡＦＬＯＰＳＴＥＲＡＦＬＯＰＳは、１秒当たり１０¹²個の浮動小数
点命令を意味する。TERAFLOPS TERAFLOPS means 10 ¹² floating point instructions per second.

【００５２】・ＶＬＳＩＶＬＳＩとは、（集積回路に適用される）超大規模集積
の略語である。VLSI VLSI is an abbreviation for very large scale integration (applied to integrated circuits).

【００５３】・ジッパジッパとは、新規に提供される、アレイ構成の通常の相
互接続の外部にある装置からリンクを確立するための機
能である。Zipper A zipper is a newly provided function for establishing a link from a device outside the normal interconnection of an array configuration.

【００５４】以下に、本発明の背景となる従来技術につ
いて述べる。エンジニアは、コンピュータの高速化をあ
くなく追求する中で、数百、ときには数千もの低コスト
・マイクロプロセッサを並列にリンクして、スーパー・
スーパーコンピュータを構築し、今日のマシンには手の
負えない複雑な問題を解決しようとしている。そのよう
なマシンは、大規模並列マシンと呼ばれている。本発明
者等は、大規模並列システムを構築するための新規の方
法を開発した。本発明者等が加えた多数の改良は、他の
人々の多数の研究の背景と対比して考察すべきである。The prior art as the background of the present invention will be described below. Engineers are constantly pursuing faster computers, linking hundreds and sometimes thousands of low-cost microprocessors in parallel
They are building supercomputers to solve complex problems that are out of hand with today's machines. Such machines are called massively parallel machines. The inventors have developed a new method for building a massively parallel system. The numerous improvements we have made should be considered against the background of numerous studies by others.

【００５５】並列に動作する複数のコンピュータは数十
年前から存在する。初期の並列マシンには、１９６０年
代に開始されたＩＬＬＩＡＣが含まれる。１９７０年代
にはＩＬＬＩＡＣＩＶが構築された。他の多重プロセ
ッサ（米国特許第４９７５８３４号の要約を参照）に
は、シーダ（Cedar）、シグマ１（Sigma-1）、バタフラ
イ・アンド・ザ・モナーク（the Butterfly and the Mo
narch）、インテルｉｐｓｃ（the Intel ipsc）、コネ
クション・マシンズ（The Connection Machines）、カ
ルテック・コズミック（the Caltech COSMIC）、Ｎキュ
ーブ（the N Cube）、ＩＢＭのＲＰ３、ＩＢＭのＧＦ１
１、ＮＹＵウルトラ・コンピュータ（theNYU Ultra Com
puter）、インテル・デルタ・アンド・タッチストーン
（the Intel Delta and Touchstone）などがある。Multiple computers operating in parallel have existed for decades. Early parallel machines included the ILLIAC, which started in the 1960s. The ILLIAC IV was built in the 1970s. Other multiprocessors (see abstract US Pat. No. 4,975,834) include Cedar, Sigma-1 and the Butterfly and the Mo.
narch), Intel ipsc (the Intel ipsc), Connection Machines (The Connection Machines), Caltech COSMIC (the Caltech COSMIC), N cube (the N Cube), RP3 of IBM, GF1 of IBM
1. NYU Ultra Computer (theNYU Ultra Com
puter) and the Intel Delta and Touchstone.

【００５６】ＩＬＬＩＡＣから始まる大規模多重プロセ
ッサは、スーパーコンピュータとみなされている。商業
的にもっとも大きな成功を収めたスーパーコンピュータ
は、複数のベクトル・プロセッサに基づくものであり、
クレイ・リサーチ（Cray Research）Ｙ−ＭＰシステ
ム、ＩＢＭ３０９０、ならびにアムダール（Amdah
l）、日立、富士通、ＮＥＣなどその他の製造業者のマ
シンがその代表である。Large multiprocessors starting with ILLIAC are considered supercomputers. The most commercially successful supercomputers are based on multiple vector processors,
Cray Research Y-MP System, IBM 3090, and Amdah
Machines from other manufacturers such as l), Hitachi, Fujitsu, and NEC are representative.

【００５７】大規模並列プロセッサ（ＭＰＰ）は現在、
スーパーコンピュータとなる能力を有するとみなされて
いる。これらのコンピュータ・システムは、相互接続ネ
ットワークによって多数のマイクロプロセッサを集成
し、これらのマイクロプロセッサが並行して動作するよ
うにプログラミングする。これらのコンピュータの動作
モードは２つある。すなわち、ＭＩＭＤモード・マシン
とＳＩＭＤモード・マシンとがある。これらのマシンの
中で商業的にもっとも成功したのは、シンキング・マシ
ンズ社のコネクション・マシンズ・シリーズ１および２
である。これは基本的に、ＳＩＭＤマシンである。大規
模並列マシンの多くは、並列に相互接続されたマイクロ
プロセッサを使用して、並行性、すなわち並列動作能力
を得ている。ｉ８６０などのインテル・マイクロプロセ
ッサは、インテル社その他が使用してきている。Ｎキュ
ーブ社では、インテル ^３８６マイクロプロセッサを用
い大規模並列マシンを構築している。他に、いわゆる
「トランスピュータ」チップを使って構築されたマシン
もある。インモス・トランスピュータ（Inmos Transput
er）ＩＭＳＴ８００はその一例である。インモス・ト
ランスピュータＴ８００は３２ビット装置であり、一体
型高速浮動小数点プロセッサを備えている。Massively parallel processors (MPP) are currently
It is considered capable of becoming a supercomputer. These computer systems assemble a number of microprocessors through an interconnect network and program these microprocessors to operate in parallel. There are two modes of operation for these computers. That is, there are MIMD mode machines and SIMD mode machines. The most commercially successful of these machines is Connection Machines Series 1 and 2 from Thinking Machines.
Is. This is basically a SIMD machine. Many large parallel machines use parallel interconnected microprocessors to achieve concurrency, or parallel operating capability. Intel microprocessors such as the i860 have been used by Intel and others. N-Cube has built a massively parallel machine using the Intel ^ 386 microprocessor. Other machines are built using so-called "transputer" chips. Inmos Transput
er) IMS T800 is one example. The Inmos Transputer T800 is a 32-bit device with an integrated high speed floating point processor.

【００５８】構築されるシステムの種類の例を挙げる
と、たとえば、複数のインモス・トランスピュータＴ８
００チップが、それぞれ３２個の通信リンク入力と３２
個のリンク出力を有する。各チップは、単一のプロセッ
サ、小規模のメモリ、ローカル・メモリへの通信リン
ク、および外部インタフェースへの通信リンクを有す
る。さらに、システムを完成するため、ＩＭＳＣ０１
１やＣ０１２などの通信リンク・アダプタが接続され
る。さらに、ＩＭＳＣ００４などのスイッチをたとえ
ば３２個のリンク入力と３２個のリンク出力の間のクロ
スバー交換機として用いて、追加のトランスピュータ・
チップ間での２地点間接続を行う。さらに、トランスピ
ュータ用の特殊回路およびインタフェース・チップを使
用し、これを特殊装置、グラフィックス制御装置、また
はディスク制御装置の要件に合わせて調整された特殊目
的に使用できるように適合させている。インモスＩＭＳ
Ｍ２１２は１６ビット・プロセッサであり、１個のオ
ン・チップ・メモリと複数の通信リンクを備えている。
このプロセッサは、ディスク・ドライブを制御するため
のハードウェアおよび論理回路を備えており、プログラ
マブル・ディスク制御装置または汎用インタフェースと
して使用できる。並行性（並列動作）を使用するため、
インモス社ではトランスピュータ用の特殊言語であるＯ
ｃｃａｍを開発した。プログラマは、トランスピュータ
のネットワークを直接Ｏｃｃａｍプログラムで記述する
必要がある。To give an example of the type of system to be constructed, for example, a plurality of inmos transputers T8
00 chips each have 32 communication link inputs and 32
Have link outputs. Each chip has a single processor, small memory, a communication link to local memory, and a communication link to an external interface. Furthermore, in order to complete the system, IMS C01
A communication link adapter such as 1 or C012 is connected. In addition, switches such as the IMS C004 can be used, for example, as a crossbar switch between 32 link inputs and 32 link outputs to provide additional transputer
Make a point-to-point connection between chips. In addition, special circuits and interface chips for transputers are used and adapted for special purposes tailored to the requirements of special devices, graphics controllers, or disk controllers. Inmos IMS
The M212 is a 16-bit processor with one on-chip memory and multiple communication links.
The processor contains the hardware and logic to control the disk drive and can be used as a programmable disk controller or general purpose interface. Because it uses concurrency (parallel operation),
Inmos uses O, a special language for transputers.
ccam was developed. The programmer needs to describe the network of transputers directly in the Occam program.

【００５９】これらの大規模並列マシンのいくつかは、
様々なトポロジーで相互接続されたプロセッサ・チップ
から成る並列プロセッサ・アレイを使用している。トラ
ンスピュータは、ＩＭＳＣ００４チップを追加してク
ロスバー・ネットワークを形成する。この他に、ハイパ
ーキューブ接続を使用するシステムもある。バスまたは
メッシュを使用して、マイクロプロセッサとそれに関連
する回路を接続するシステムもある。回線交換プロセッ
サによって相互接続され、スイッチをプロセッサ・アド
レス可能ネットワークとして使用するシステムもある。
昨年秋、ローレンス・リバーモア（Lawrence Livermor
e）で、マシンを相互に配線することによって相互接続
された１４個のＲＩＳＣ／６０００の場合のように、プ
ロセッサ・アドレス可能ネットワークは粗多重プロセッ
サとみなされるようになっている。Some of these massively parallel machines are
It uses a parallel processor array consisting of processor chips interconnected in various topologies. The transputer adds an IMS C004 chip to form a crossbar network. Other systems use hypercube connections. Some systems use a bus or mesh to connect the microprocessor and its associated circuitry. Some systems are interconnected by circuit switched processors and use the switch as a processor addressable network.
Last fall, Lawrence Livermor
In e), the processor-addressable network is to be considered as a coarse multiprocessor, as in the case of 14 RISC / 6000 interconnected by wiring the machines together.

【００６０】いくつかの大規模なマシンが、インテル社
およびＮキューブ社その他によって、データ処理におけ
るいわゆる「遠大な課題」に対処するために構築されて
いる。しかし、これらのコンピュータはきわめて高価で
ある。「遠大な課題」に対処するために米国政府が開発
資金を提供しているコンピュータの最近の見積りコスト
は、約３０００〜７５００万ドル（テラコンピュータ）
である。これらの「遠大な課題」には、気候モデリン
グ、乱流、汚染分散、ヒト遺伝子および海流のマッピン
グ、量子クロモ力学、半導体およびスーパーコンピュー
タのモデル化、燃焼システム、視覚認識が挙げられる。Several large machines have been built by Intel, N-Cube and others to address the so-called "far-reaching challenges" in data processing. However, these computers are extremely expensive. Computers funded for development by the US Government to address "far-reaching challenges" have recently estimated costs of about $ 3,000-75 million (teracomputers)
Is. These "far-reaching challenges" include climate modeling, turbulence, pollution dispersion, human gene and ocean current mapping, quantum chromodynamics, semiconductor and supercomputer modeling, combustion systems, and visual recognition.

【００６１】本発明の背景について一言付け加えると、
ＩＢＭが開発した初期の大規模並列マシンの１つを認識
すべきである。本明細書では、チップ内のＰＭＥのアレ
イすなわちノードを構成する、プロセッサおよび入出力
機能を備えた８つ以上のメモリ単位の１つを記述するの
に、「トランスピュータ」ではなくＰＭＥという用語を
使用する。参照した従来技術の「トランスピュータ」
は、チップ上に１つのプロセッサ、ＦＯＲＴＲＡＮ補助
プロセッサ、および小規模なメモリと入出力インタフェ
ースを有する。本発明のＰＭＥは一般に、トランスピュ
ータおよびＲＰ３のＰＭＥに適用できる。しかし、以下
で分かるように、本発明の小型チップは多くの点で大幅
に異なる。本発明の小型チップは後述する多数の機能を
有する。しかし、ＰＭＥという用語が、最初はＲＰ３と
して知られる大規模並列マシンの基礎となった、現在で
はより典型的となった別のＰＭＥを表すために作り出さ
れたことを、本発明者等は承知している。ＲＰ３（ＩＢ
Ｍ研究用並列処理プロトタイプ）は、複数命令複数デー
タ（ＭＩＭＤ）アーキテクチャに基づく実験的並列プロ
セッサであった。ＲＰ３は、ＩＢＭのＴ．Ｊ．ワトソン
研究所でニューヨーク大学ウルトラコンピュータ・プロ
ジェクトと協力して設計し構築された。この研究は一
部、米国防総省高等研究企画庁の後援を受けた。ＲＰ３
は、高速オメガ・ネットワークと相互接続された６４個
のプロセッサ・メモリ要素（ＰＭＥ）から構成されてい
た。各ＰＭＥが、３２ビットのＩＢＭ"ＰＣ Scientifi
c"マイクロプロセッサ、３２ＫＢキャッシュ、４ＭＢセ
グメントのシステム・メモリ、および入出力ポートを備
えていた。ＰＭＥ入出力ポート・ハードウェアおよびソ
フトウェアは、初期設定、状況獲得、ならびに共用入出
力サポート・プロセッサ（ＩＳＰ）によるメモリとプロ
セッサの通信をサポートした。各入出力サポート・プロ
セッサは、システム・ネットワークとは独立に、拡張入
出力アダプタ（ＥＴＩＯ）により８個のＰＭＥをサポー
トする。各入出力サポート・プロセッサは、ＩＢＭＳ
／３７０チャネルおよびＩＢＭトークン・リング・ネッ
トワークとインターフェースし、オペレータ・モニタ・
サービスを提供する。各拡張入出力アダプタは、装置と
してＰＭＥＲＯＭＰ記憶チャネル（ＲＳＣ）に接続さ
れ、ＥＴＩＯチャネルを介してプログラマブルＰＭＥ制
御／状況信号入出力を提供した。ＥＴＩＯチャネルは、
ＩＳＰを８個のアダプタに相互接続する３２ビット・バ
スである。ＥＴＩＯチャネルは、ＥＴＩＯアダプタ上の
ハードウェアと入出力サポート・プロセッサ上のソフト
ウェアでサポートされるカスタム・インタフェース・プ
ロトコルを使用していた。To add to the background of the present invention,
One should be aware of one of the earliest massively parallel machines developed by IBM. The term PME rather than "transputer" is used herein to describe one of the eight or more memory units with processors and I / O functions that make up an array or node of PMEs within a chip. use. Prior art "transputer" referenced
Has one processor on a chip, a FORTRAN auxiliary processor, and a small memory and input / output interface. The PME of the present invention is generally applicable to transputers and RP3 PMEs. However, as will be seen below, the small chip of the present invention differs in many respects. The small chip of the present invention has many functions described below. However, we are aware that the term PME was coined to refer to another PME that became more typical nowadays, which was the basis of the massively parallel machine known as RP3. are doing. RP3 (IB
The M research parallel processing prototype) was an experimental parallel processor based on a multiple instruction multiple data (MIMD) architecture. RP3 is a T.M. J. Designed and built at the Watson Research Center in collaboration with the New York University Ultracomputer Project. This work was supported in part by the US Department of Defense Advanced Research Projects Agency. RP3
Consisted of 64 processor memory elements (PMEs) interconnected with a high speed Omega network. Each PME has a 32-bit IBM "PC Scientifi
It had a c "microprocessor, 32 KB cache, 4 MB segment system memory, and I / O ports. PME I / O port hardware and software provided initialization, status acquisition, and shared I / O support processor (ISP). ) Supports communication between the memory and the processor. Each input / output support processor supports eight PMEs independently of the system network by an expansion input / output adapter (ETIO). , IBM S
/ 370 channel and IBM Token Ring network interface with operator monitor
Provide services. Each expansion I / O adapter was connected as a device to a PME ROMP storage channel (RSC) and provided programmable PME control / status signal I / O via an ETIO channel. The ETIO channel is
A 32-bit bus that interconnects the ISP to eight adapters. The ETIO channel used a custom interface protocol supported by hardware on the ETIO adapter and software on the I / O support processor.

【００６２】[0062]

【発明が解決しようとする課題】本明細書で拡張並列ア
レイ・プロセッサ（ＡＰＡＰ）と呼んでいるマシンは、
密並列プロセッサであり、従来設計の問題に対処するた
めにこれが必要であると本発明者等は考えている。上述
のように、専用メモリおよび共用メモリを使用するポイ
ント・デザインおよびオフ・ザ・シェルフ・プロセッサ
と、可能な多数の相互接続方式の１つとから多くの密
（および疎）プロセッサが構築されている。現在までの
ところ、これらの手法はすべて、設計および性能上いく
つかの制限が見つかっている。その「解決策」はそれぞ
れ目指す所が異なるが、それぞれ問題がある。既存の並
列マシンはプログラミングが難しい。各並列マシンは一
般に、一連のアプリケーションと互換性のある様々なサ
イズのマシンに適合できるわけではない。各並列マシン
は、物理設計、相互接続、およびアーキテクチャ上の問
題によってその設計が制限されている。The machine referred to herein as an advanced parallel array processor (APAP) is
The inventors believe that it is a dense parallel processor and that it is needed to address the problems of conventional designs. As mentioned above, many dense (and sparse) processors are built from point designs and off-the-shelf processors that use dedicated and shared memory and one of many possible interconnection schemes. . To date, all of these approaches have some design and performance limitations. The "solutions" have different goals, but each has its own problems. Existing parallel machines are difficult to program. Each parallel machine generally cannot fit into machines of varying sizes compatible with a set of applications. Each parallel machine is limited in its design by physical design, interconnects, and architectural issues.

【００６３】物理的問題：水平構造に必要な各種機能の
それぞれ用に別々のチップ設計を使用する手法がある。
これらの手法は、チップ交差遅延により性能が制限され
ている。Physical Problem: There are approaches to using separate chip designs for each of the various functions required for horizontal structures.
These approaches have limited performance due to chip crossing delay.

【００６４】様々な機能を単一のチップに垂直に集積す
る手法もある。これらの手法は、生産可能なチップ上に
集積できる論理ゲートの数に物理的制限があるために、
その性能が制限されている。There is also a technique for vertically integrating various functions on a single chip. These approaches are physically limited by the number of logic gates that can be integrated on a chip that can be produced.
Its performance is limited.

【００６５】相互接続の問題：密並列プロセッサにとっ
て、様々な処理機能を相互接続するネットワークが重要
である。バス、メッシュ、およびハイパーキューブを用
いるプロセッサ設計はすべて開発されている。これらの
ネットワークはそれぞれ、処理能力に関して固有の制限
がある。バス設計では、物理的に相互接続可能なプロセ
ッサの数と、ネットワーク性能の両方が制限される。メ
ッシュ設計では、ネットワークの直径が大きくなるた
め、ネットワークの性能が制限される。ハイパーキュー
ブ設計では、各ノードに多数の相互接続ポートが必要で
ある。したがって、相互接続できるプロセッサの数は、
ノードにおける物理入出力ピンによって制限される。ハ
イパーキューブ構造は、従来のバス構造およびメッシュ
構造よりも性能上かなり優れていると考えられている。Interconnection Problem: For dense parallel processors, a network interconnecting various processing functions is important. Processor designs using buses, meshes, and hypercubes have all been developed. Each of these networks has its own limitations on processing power. Bus designs limit both the number of physically interconnectable processors and network performance. The mesh design limits the performance of the network due to the large diameter of the network. Hypercube designs require a large number of interconnect ports on each node. Therefore, the number of processors that can be interconnected is
Limited by the physical I / O pins at the node. The hypercube structure is considered to be significantly better in performance than conventional bus and mesh structures.

【００６６】アーキテクチャ上の問題：密並列プロセッ
サに適したプロセスは、２つの種類に分けられる。機能
的に区分可能なプロセスは、複数命令複数データ（ＭＩ
ＭＤ）アーキテクチャでの方がうまく実行できる傾向が
ある。機能的に区分可能ではないが、複数のデータ・ス
トリームを持つプロセスは、単一命令複数データ（ＳＩ
ＭＤ）アーキテクチャでの方がうまく実行できる傾向が
ある。どのようなアプリケーションでも、両方の種類の
プロセスがいくつか含まれている。特定のアプリケーシ
ョンに最も適合するアーキテクチャを選択するにはシス
テム・トレードオフが必要であるが、単一の解決策では
満足な結果が得られていない。Architectural problem: Processes suitable for densely parallel processors can be divided into two types. A process that can be functionally divided is a multi-instruction multi-data (MI
The MD) architecture tends to perform better. A process that is not functionally distinguishable but has multiple data streams is a single instruction multiple data (SI
The MD) architecture tends to perform better. Every application contains several processes of both kinds. System trade-offs are required to choose the architecture that best fits a particular application, but no single solution has yielded satisfactory results.

【００６７】[0067]

【課題を解決するための手段】本発明では、新しい概念
によって設計された新規の「チップ」およびシステムを
作成することにより、大規模並列プロセッサおよびその
他のコンピュータ・システムを構築する、新しい方法を
開発した。本発明は、そのようなシステムを対象として
いる。本明細書に記載する構成要素を本発明のシステム
中で組み合わせると、新しいシステムが構築できる。こ
れらの構成要素は、既存の技術と組み合わせることも可
能である。The present invention develops a new method of building massively parallel processors and other computer systems by creating new "chips" and systems designed according to the new concept. did. The present invention is directed to such a system. The components described herein can be combined in a system of the invention to create a new system. These components can also be combined with existing technology.

【００６８】本発明の約１４×１４ｍｍの小型ＣＭＯＳ
ＤＲＡＭは、レンガでビルディングの壁を築いたり道
路を舗装するのと同様に組み立てることができる。この
チップは、複製物の接続により、「家」すなわち複合コ
ンピュータ・システムの構築に必要な構造を提供する。Small CMOS about 14 × 14 mm of the present invention
DRAMs can be constructed similar to brick building walls and paving roads. This chip provides the structure necessary to build a "house" or complex computer system by connecting replicas.

【００６９】本発明を概観すると、それぞれ内部アレイ
機能および外部入出力ＢＣＩを備え、メモリに組み込ま
れた８個以上のプロセッサを備える、同一の小型チップ
４個で、３６以上の複合コンピュータのメモリおよび処
理能力が提供される。これらのチップは、そのすべてを
コンパクト・ハイブリッド・パッケージングにより腕時
計程度のサイズに配置することができ、また各チップが
約２Ｗしか散逸しないので、きわめて低い電力で動作さ
せることができる。本発明者等は、このチップを用いて
多数の新規概念を開発しており、独自の発明と考える概
念については、実施例および特許請求の範囲で詳しく記
述する。本発明のコンピュータ・システムを用いて構築
できるシステムは、小型装置から、ＰＥＴＡＯＰ能力を
持つ大規模なマシンまでの範囲に及ぶ。このようなチッ
プに関する詳細は関連特許出願に出ている。本明細書で
はそのいくつかについて述べ、また多重プロセッサ・メ
モリ要素（ＰＭＥ）並列プロセッサに適用され、チップ
設計に関して本発明の多重プロセッサに特に適用される
特徴と、それほどコンパクトではない処理要素及びピケ
ットに適用されるいくつかの特徴について述べる。In overview of the invention, four identical small chips, each with internal array functionality and external input / output BCI, with eight or more processors embedded in the memory, memory of 36 or more complex computers and Processing power is provided. All of these chips can be sized as a wristwatch due to compact hybrid packaging, and each chip dissipates only about 2 W, allowing it to operate at very low power. The inventors have developed a number of new concepts using this chip, and the concepts considered unique inventions are described in detail in the examples and claims. Systems that can be built using the computer system of the present invention range from small devices to large machines with PETAOP capabilities. Details regarding such chips can be found in the related patent applications. Some of the features described herein are also applied to the multiprocessor memory element (PME) parallel processor, with particular reference to the multiprocessor of the present invention in terms of chip design, and to less compact processing elements and pickets. Some features that apply are described.

【００７０】この設計により、プロセッサが、ＭＩＭＤ
モードとＳＩＭＤモードの間で動的切替えが可能になる
ので、本システムではＳＩＭＤとＭＩＭＤの間のトレー
ドオフは不要である。このため、「ハイブリッド」マシ
ンのアプリケーション・プログラマがぶつかる多くの問
題がなくなる。その上、この設計により、一部のプロセ
ッサがＳＩＭＤモードまたはＭＩＭＤモードをとること
が可能となる。By this design, the processor is
The trade-off between SIMD and MIMD is not necessary in this system as it allows dynamic switching between modes and SIMD modes. This eliminates many of the problems that application programmers on "hybrid" machines will run into. Moreover, this design allows some processors to be in SIMD or MIMD mode.

【００７１】拡張並列アレイ・プロセッサ（ＡＰＡＰ）
は、密並列プロセッサである。ＡＰＡＰは、パーソナル
・コンピュータ処理アプリケーションによるスーパーコ
ンピュータ処理に適した構成が満足されるように、区分
可能な制御セクションおよび処理セクションから構成さ
れている。大部分の構成では、これはホスト・プロセッ
サに接続され、ホストの作業負荷の各セグメントへのオ
フロードをサポートする。ＡＰＡＰアレイ処理要素は汎
用コンピュータなので、オフロードされる作業負荷の種
類は、ホストの機能に応じて変わる。たとえば、本発明
のＡＰＡＰは、ＩＢＭ３０９０ベクトル・プロセッサ
・メインフレームのモジュールとすることができる。高
性能ベクトル浮動小数点機能を備えたメインフレームに
接続する場合は、オフロードされるタスクが疎密行列変
換であってよい。また、パーソナル・コンピュータに接
続する場合は、オフロードされるタスクが数値計算中心
の３次元図形処理であってよい。Extended Parallel Array Processor (APAP)
Is a dense parallel processor. APAP is composed of partitionable control and processing sections so that a configuration suitable for supercomputer processing by personal computing applications is satisfied. In most configurations, it connects to the host processor and supports offloading the host workload to each segment. Since the APAP array processing element is a general purpose computer, the type of workload offloaded depends on the capabilities of the host. For example, the APAP of the present invention may be a module of the IBM 3090 Vector Processor Mainframe. When connecting to a mainframe with high performance vector floating point capabilities, the offloaded task may be a sparse to dense matrix transformation. When connecting to a personal computer, the offloaded task may be a three-dimensional graphic processing centered on numerical calculation.

【００７２】"Parallel Associative Processor Syste
m"と題する米国特許出願第０７／６１１５９４号では、
コンピュータ・メモリと制御論理回路を単一チップ内に
統合し、チップ内でその組合せを複製して、単一チップ
の複製からプロセッサ・システムを構築するという考え
が記載されており、必要により参照されたい。この手法
は、本発明で継続され拡張されて、わずか１種類のチッ
プを開発および製造するだけで、大規模並列処理機能が
実施でき、チップ境界交差が削減され線長が短くなった
ために性能が向上した、システムがもたらされる。[Parallel Associative Processor Syste
In US patent application Ser. No. 07 / 611,594 entitled "m",
The idea of integrating computer memory and control logic into a single chip and replicating the combination within the chip to build a processor system from a single chip replication is described and referenced by necessity. I want to. This approach is continued and extended by the present invention to develop and manufacture only one type of chip to perform massively parallel processing functions and reduce performance due to reduced chip boundary crossings and shorter line lengths. An improved system results.

【００７３】１９９０年１１月１３日出願の米国特許出
願第０７／６１１５９４号では、１次元入出力構造（基
本的に線形入出力）を、チップ内で複数のＳＩＭＤＰ
ＭＥを該構造に取り付けて利用することが記載されてお
り、必要により参照されたい。この実施例では、これら
の概念を２次元以上に拡張している。次に、１チップ当
たり８個のＳＩＭＤ／ＭＩＭＤＰＭＥを備えた４次元
入出力構造について説明する。しかし、後で図４、図１
０、図１１、図１７、および図１８に関して説明するよ
うに、これよりも次元数、または次元当たりのＰＭＥを
増やすことが可能である。本発明の処理要素は、データ
転送割込みおよびプログラム割込みを含む完全な入出力
システムを備えている。好ましい実施例の説明では、主
として、１チップ当たり８個のＳＩＭＤ／ＭＩＭＤＰ
ＭＥを備えた好ましい４次元入出力構造を取り上げる。
本発明者等の考えでは、現在この構造は特に有利であ
る。しかし、本明細書に記載するように、この次元数、
または次元当たりのＰＭＥをこれよりも増やすことが可
能である。さらに、大部分の応用例では、ハイパーキュ
ーブ相互接続、特に後述の修正ハイパーキューブによ
る、より高次元の領域における発明を優先し、かつその
ような発明を行った。しかし、応用例によっては、チッ
プの２次元メッシュ相互接続が手近なタスクに適用でき
る。たとえば、ある種の軍事用コンピュータでは、２次
元メッシュが適切であり、費用効果が高い。In US patent application Ser. No. 07 / 611,594 filed on Nov. 13, 1990, a one-dimensional input / output structure (basically linear input / output) is provided in a plurality of SIMD P in a chip.
It is described that the ME is attached to the structure and used, and reference should be made when necessary. In this embodiment, these concepts are extended to two dimensions or more. Next, a four-dimensional input / output structure including eight SIMD / MIMD PMEs per chip will be described. However, FIG.
0, FIG. 11, FIG. 17, and FIG. 18, it is possible to increase the number of dimensions or the PME per dimension more than this. The processing element of the present invention comprises a complete I / O system including data transfer interrupts and program interrupts. In the description of the preferred embodiment, mainly 8 SIMD / MIMD Ps per chip are used.
Take a preferred four-dimensional input / output structure with ME.
In our opinion, this structure is currently particularly advantageous. However, as described herein, this dimensionality,
Alternatively, the PME per dimension can be increased beyond this. In addition, most applications have prioritized and made inventions in higher dimensional areas with hypercube interconnects, particularly modified hypercubes described below. However, depending on the application, the two-dimensional mesh interconnection of chips can be applied to the task at hand. For example, for some military computers, a two-dimensional mesh is suitable and cost effective.

【００７４】本明細書及び関連特許出願では、ピケット
・プロセッサ及び本明細書で拡張並列アレイ・プロセッ
サ（ＡＰＡＰ）と称するものについて詳細に述べる。ピ
ケット・プロセッサはプロセッサ・メモリ要素（ＰＭ
Ｅ）を使用できることに留意されたい。ピケット・プロ
セッサは、非常にコンパクトなアレイ・プロセッサが望
まれる軍事の応用分野で特に有用である。この点に関し
て、ピケット・プロセッサはＡＰＡＰすなわち拡張並列
アレイ・プロセッサに関連する好ましい実施例と幾分異
なっている。しかし共通点が存在し、本発明によって提
供される態様及び特徴は異なるマシンで使用できる。This specification and related patent applications describe in detail the picket processor and what is referred to herein as the Advanced Parallel Array Processor (APAP). The picket processor is a processor memory element (PM
Note that E) can be used. Picket processors are particularly useful in military applications where a very compact array processor is desired. In this regard, the picket processor differs somewhat from the preferred embodiment associated with the APAP or enhanced parallel array processor. However, there are similarities and the aspects and features provided by the present invention can be used on different machines.

【００７５】ピケットの語は、プロセッサとメモリなら
びにそれに含まれるアレイ相互通信用の通信要素からな
るアレイ・プロセッサの１／ｎの要素を言う。The term picket refers to the 1 / nth element of an array processor consisting of the processor and memory and the communication elements contained therein for array intercommunication.

【００７６】ピケットの概念は、ＡＰＡＰ処理アレイの
１／ｎにも適用される。The picket concept also applies to 1 / n of the APAP processing array.

【００７７】ピケットの概念は、データ幅、メモリ・サ
イズ及びレジスタ数の点でＡＰＡＰと異なることがあり
得る。ＡＰＡＰの代替例である大規模並列実施例では、
ＡＰＡＰ中のＰＭＥがサブアレイの一部分であるのに対
して、正規アレイの１／ｎに対する接続性をもつように
構成されている点が異なる。どちらのシステムもＳＩＭ
Ｄを実行できるが、ピケット・プロセッサは、ＰＥ中で
ＭＩＭＤを伴うＳＩＭＤマシンとして構成されているの
で、ＳＩＭＩＭＤを直接に実行することができ、ＭＩＭ
ＤＡＰＡＰ構成ではＳＩＭＤをエミュレートするよう
に制御されたＭＩＭＤＰＥを使ってＳＩＭＩＭＤを実
行することになる。どちらのマシンもＰＭＥを使用す
る。The picket concept can differ from APAP in terms of data width, memory size and number of registers. In a massively parallel implementation, which is an alternative to APAP,
The difference is that the PME in the APAP is part of the sub-array, while it is configured to have connectivity for 1 / n of the regular array. Both systems are SIM
DIM can be executed, but the picket processor is configured as a SIMD machine with MIMD in PE, so SIMIMD can be executed directly.
The D APAP configuration will perform SIMIMD using MIMD PEs that are controlled to emulate SIMD. Both machines use PME.

【００７８】どちらのシステムも、アレイ通信ネットワ
ークで相互接続されたＮ個の要素を有するアレイ用のア
レイ処理ユニットを含む並列アレイ・プロセッサとして
構成することができ、その際にプロセッサ・アレイの１
／Ｎが処理要素、その関連メモリ、制御バス・インター
フェース、及びアレイ通信ネットワークの一部分であ
る。Either system can be configured as a parallel array processor containing an array processing unit for an array having N elements interconnected by an array communication network, with one of the processor arrays being
/ N is part of the processing element, its associated memory, control bus interface, and array communication network.

【００７９】並列アレイ・プロセッサは２重動作モード
の能力を有し、処理ユニットにどちらか一方または両方
のモードで動作するように指令することができ、処理ユ
ニットはＳＩＭＤ動作とＭＩＭＤ動作用の２つのモード
間を自由に移行することができ、ＳＩＭＤがその編成の
モードであるとき、処理ユニットは各要素にそれ自体の
命令をＳＩＭＩＭＤモードで実行するように指令するこ
とができ、ＭＩＭＤが処理ユニット編成の実施モードで
あるときは、ＭＩＭＤ実行をシミュレートするためにア
レイの選択された要素を同期させることができる。これ
をＭＩＭＤ−ＳＩＭＤと称する。The parallel array processor has the capability of a dual mode of operation and can instruct the processing unit to operate in either or both modes, the processing unit having two for SIMD and MIMD operations. It is possible to move freely between the two modes, and when the SIMD is the mode of its organization, the processing unit can instruct each element to execute its own instructions in SIMIMD mode, and the MIMD When in the enforcement mode of organization, selected elements of the array can be synchronized to simulate MIMD execution. This is called MIMD-SIMD.

【００８０】どちらのシステムの並列アレイ・プロセッ
サも、アレイの要素間で情報をやり取りするための経路
を備えたアレイ通信ネットワークを提供する。情報の移
動は２つの方法のどちらかで指示することができ、第１
の方法では、すべてのメッセージが同時に同じ方向に移
動するようにアレイ制御装置が指示し、したがって移動
するデータがその宛先を定義することはない。第２の方
法では、各メッセージが自己経路指定され、メッセージ
の始めにあるヘッダがその宛先を定義する。The parallel array processors of both systems provide an array communication network with paths for exchanging information between the elements of the array. Information movement can be directed in one of two ways:
In this method, the array controller dictates that all messages move in the same direction at the same time, so the moving data does not define its destination. In the second method, each message is self-routed and the header at the beginning of the message defines its destination.

【００８１】並列アレイ・プロセッサ・アレイのセグメ
ントは、単一の半導体チップ上に設けられた処理ユニッ
トの複数のコピーを有し、各コピーが、アレイの一部分
であり、アレイ通信ネットワークのうちでそのセグメン
トに関連する部分と、バッファ、ドライバ、マルチプレ
クサ及び制御機能を含み、アレイのそのセグメントの部
分をアレイの他のセグメントと継目なしに接続して、ア
レイ通信ネットワークを拡張することができるようにな
る。A segment of a parallel array processor array has multiple copies of a processing unit provided on a single semiconductor chip, each copy being a part of the array and its part of the array communication network. Includes a portion associated with a segment and buffers, drivers, multiplexers, and control functions, allowing portions of that segment of the array to be seamlessly connected to other segments of the array to extend the array communication network .

【００８２】各処理ユニットごとに制御装置からの制御
バスまたは制御経路が設けられ、その制御バスがアレイ
の各要素にまで延びてその活動を制御する。A control bus or control path from the controller is provided for each processing unit and extends to each element of the array to control its activity.

【００８３】並列アレイの各処理要素セグメントは、プ
ロセッサ・メモリ要素の複数のコピーを含んでいる。プ
ロセッサ・メモリ要素は、単一の半導体チップ内に収容
され、アレイの１セグメントを有し、アレイ制御バスの
一部とレジスタ・バッファを含み、チップ内に収容され
るアレイ・セグメントへの制御の通信をサポートする。Each processing element segment of the parallel array contains multiple copies of the processor memory elements. A processor memory element is contained within a single semiconductor chip, has one segment of the array, includes a portion of the array control bus and a register buffer, and controls the array segment contained within the chip. Support communication.

【００８４】どちらもメッシュ移動または経路指定移動
を実施することができる。通常、ＡＰＡＰは、チップ上
の８個の要素がある方式で相互関連し、チップが別の方
式で相互関連する、２重相互接続構造を実施する。チッ
プ上でプログラマブルな経路指定を行うと、先述のよう
にＰＭＥ間にリンクが確立されるが、ノードは別の方式
で関連づけることができ、通常はそうなる。チップ上で
は、基本的に通常のＡＰＡＰ構成は２×４メッシュであ
り、ノード相互接続は経路指定された疎８進Ｎキューブ
とすることができる。どちらのシステムもＰＥ（ＰＭ
Ｅ）間にＰＥ間相互通信経路を有し、マトリックスを２
点間経路から構成することが可能である。Both can perform mesh movements or routing movements. APAP typically implements a dual interconnect structure in which eight elements on a chip are related in one way and chips are related in another way. Programmable routing on the chip establishes links between PMEs as described above, but nodes can, and usually do, associate in different ways. On chip, basically a typical APAP configuration is a 2x4 mesh and the node interconnects can be routed sparse octal N-cubes. Both systems are PE (PM
There is an inter-PE communication path between E) and the matrix is 2
It can be composed of point-to-point paths.

【００８５】この背景及び展望に立って、図面を参照し
ながら本発明の好ましい実施例に関係する本発明の特徴
及び態様についてこれから詳しく述べる。With this background and perspective, the features and aspects of the present invention that are related to preferred embodiments of the invention will now be described in detail with reference to the drawings.

【００８６】[0086]

【実施例】次に本発明について詳述する。図１および２
は、トランスピュータＴ８００チップで例示され、タッ
チストーン・デルタ（ｉ８６０）、Ｎキューブ（^３８
６）などのマシン用の同様のチップを代表する、既存の
技術レベルを示す。図１および２を本発明で開発された
システムと比較すると、本発明を使用することにより、
従来システムのようなシステムを大幅に改善できるだけ
でなく、後述のように、新規の強力なシステムも構築で
きることが理解されよう。図１および２の従来型の現在
のマイクロプロセッサ技術では、ピンおよびメモリを大
量に使用する。帯域幅が限定され、チップ間通信によっ
てシステム性能が下がる。The present invention will be described in detail below. 1 and 2
Is exemplified by Transputer T800 chip, Touchstone Delta (i860), N cube (^ 38
6) shows existing technology levels representing similar chips for machines such as 6). Comparing FIGS. 1 and 2 with the system developed in the present invention, by using the present invention,
It will be appreciated that not only can a system like a conventional system be greatly improved, but also a new and powerful system can be constructed, as will be described later. The conventional current microprocessor technology of FIGS. 1 and 2 makes heavy use of pins and memory. Bandwidth is limited and chip-to-chip communication reduces system performance.

【００８７】図３に表すこの革新的な新規技術では、プ
ロセッサ、メモリ、入出力機構を組み合わせて、単一の
低出力ＣＭＯＳＤＲＡＭチップ上に形成された複数の
ＰＭＥ（それぞれにメモリ・アクセス遅延がなく、それ
ぞれがネットワーキング用にすべてのピンを使用する、
８個以上の１６ビット・プロセッサ）とする。このシス
テムは、上記で参照した開示の概念と、本発明の同時出
願に別個に記載されており、本明細書に記載するシステ
ムに適用可能な発明の概念とを利用することができる。
したがって、これらの開示および出願を参照により本明
細書に組み込む。グループ化、自律性、透過性、ジッパ
相互作用、非同期ＳＩＭＤ、ＳＩＭＩＭＤ、またはＳＩ
ＭＤ／ＭＩＭＤなど本発明の概念はすべて、この新規技
術と併用できる。また、利益は少なくなるが、従来技術
のシステム中で使用することも、本発明者等の従来の多
重ピケット・プロセッサと組み合わせて使用することも
できる。This innovative new technique, illustrated in FIG. 3, combines a processor, memory, and I / O mechanism to combine multiple PMEs (each with a memory access delay) formed on a single low-power CMOS DRAM chip. Each uses all pins for networking,
8 or more 16-bit processors). This system may take advantage of the disclosed concepts referenced above and the inventive concepts described separately in the co-pending application of the present invention and applicable to the systems described herein.
Accordingly, these disclosures and applications are incorporated herein by reference. Grouping, autonomy, transparency, zipper interaction, asynchronous SIMD, SIMIMD, or SI
All of the inventive concepts such as MD / MIMD can be used with this new technology. It may also be used in prior art systems, but with less benefit, in combination with our conventional multiple picket processor.

【００８８】本発明のピケット・システムでは、このプ
ロセッサが使用できる。本発明の基本概念は、組込みプ
ロセッサ、ルータ、および入出力機構を有するメモリ・
ユニットである、本発明の新規メモリ・プロセッサを備
えたシステム用の新規の基本的構成単位すなわち複製可
能レンガを提供したことである。この基本的構成単位は
スケーリング可能である。本発明を実施した基本システ
ムでは、４メガビットのＣＭＯＳＤＲＡＭを使用す
る。このシステムは、拡張すれば、１６メガビットＤＲ
ＡＭＳおよび６４メガビット・チップを備えたより大規
模なメモリ構成に使用できる。各プロセッサはゲート・
アレイである。付着密度を高めると、同一のチップ上に
より速いクロック速度のプロセッサがより多く配置で
き、ゲートおよび追加のメモリを使用すると、各ＰＭＥ
の性能が向上する。１パート型をスケーリングすること
により、ＰＥＴＡＯＰ範囲を十分満たす性能を持つこと
ができるシステム・フレームワークおよびアーキテクチ
ャが提供される。This processor can be used in the picket system of the present invention. The basic idea of the present invention is that a memory processor having an embedded processor, a router, and an input / output mechanism.
Providing a new basic building block, a replicable brick, for a system, which is a unit, with the novel memory processor of the present invention. This basic building block is scalable. The basic system embodying the invention uses a 4 megabit CMOS DRAM. This system can be expanded to 16 Mbit DR
It can be used for larger memory configurations with AMS and 64 Mbit chips. Each processor is a gate
It is an array. Increasing the adhesion density allows more clock speed processors to be placed on the same chip, and the use of gates and additional memory results in each PME
Performance is improved. Scaling the one-part type provides a system framework and architecture that can perform well in the PETAOP range.

【００８９】図３は、本明細書でＰＭＥまたはプロセッ
サ・メモリ要素と称する、好ましい実施例によるメモリ
・プロセッサを示している。このプロセッサは８個以上
のプロセッサを有する。図の実施例では、プロセッサは
８個である。チップを（水平方向に）拡張すればさらに
多くのプロセッサを追加することができる。チップは論
理回路を保持することができ、セルを追加してＤＲＡＭ
メモリを線形に（垂直に）拡張することができ、かつそ
うすることが好ましい。図には、１６ビット幅データ・
フロー・プロセッサの８つの複製を実施した、ＣＭＯＳ
ゲート・アレイ・ゲートのあるフィールドを囲む、ＤＲ
ＡＭメモリの３２キロビット×９ビット・セクションが
１６個示されている。FIG. 3 illustrates a memory processor according to the preferred embodiment, referred to herein as a PME or processor memory element. This processor has eight or more processors. In the illustrated embodiment, there are eight processors. More chips can be added by expanding the chip (horizontally). The chip can hold the logic circuit and add cells to the DRAM
The memory can, and preferably does, grow linearly (vertically). In the figure, 16-bit wide data
CMOS with eight replicas of the Flow Processor
DR surrounding a field with a gate array gate
Sixteen 32-kilobit by 9-bit sections of AM memory are shown.

【００９０】このプロセッサでは、ＩＢＭＣＭＯＳ低
出力サブミクロンＩＢＭＣＭＯＳオンシリコン付着技
術により、トレンチを備えた特定のシリコンを使って、
小型チップの表面上に大きな記憶域が提供される。ＩＢ
Ｍの進んだ半導体チップ製造技術により、メモリと複数
のプロセッサから編成された相互接続が行われる。ただ
し、本明細書で記載する小型チップは、約４メガビット
のメモリを有することを理解されたい。本発明の小型チ
ップは、１６メガビットのメモリ技術が安定し、歩留り
が向上し欠陥に対処する方法が確立されたとき、論理回
路を変更せずに、それぞれ９ビット幅のさらに大きなメ
モリ・サイズに移行できるように設計されている。フォ
トリソグラフィおよびＸ線リソグラフィの発達により、
最小フィーチャ・サイズは０．５ミクロンをかなり下回
るようになっている。本発明の設計では、それ以上の進
歩を想定している。これらの進歩によって、単一のシリ
コン・チップ上に、処理機能を持つきわめて大規模なメ
モリが配置できるようになろう。In this processor, IBM CMOS low power sub-micron IBM CMOS on silicon deposition technology is used, using specific silicon with trenches,
Large storage is provided on the surface of the small chip. IB
With the advanced semiconductor chip manufacturing technology of M, interconnections organized from memories and a plurality of processors are performed. However, it should be understood that the small chip described herein has about 4 megabits of memory. The small chip of the present invention can be used in larger memory sizes, each 9 bits wide, without altering the logic circuitry when 16 Mbit memory technology is stabilized, yields are improved, and ways to address defects are established. Designed for migration. With the development of photolithography and X-ray lithography,
The minimum feature size is well below 0.5 micron. Further advances are envisioned in the design of the present invention. These advances will allow the placement of very large memories with processing capabilities on a single silicon chip.

【００９１】本発明の装置は、４メガＣＭＯＳＤＲＡ
Ｍであり、これは論理回路用の広いスペースを持つ最初
の汎用メモリ・チップと考えられる。３２キロビット×
９ビットＤＲＡＭマクロの１６回の複製でメモリ・アレ
イを構成する。このＤＲＡＭは、チップ上のアプリケー
ション論理回路として大きな表面積を割り振られ、３重
レベル金属配線を備えた１２０Ｋセルを有する。プロセ
ッサ論理セルは、ゲート・アレイ・セルであることが好
ましい。ＤＲＡＭアクセス時間は３５ナノ秒以下であ
り、プロセッサのサイクル時間と合致する。このＣＭＯ
Ｓ実施態様では、非常に効果的な処理要素（ピケット）
の論理密度が提供され、しかもその際の論理回路の電力
散逸量は１．３Ｗである。チップの別々の各メモリ・セ
クションはそれぞれ３２キロビット×９ビットであり
（論理回路を変更せずに拡張可能）、ＣＭＯＳゲート・
アレイ・ゲートのフィールドを囲んでいる。このフィー
ルドは、１２０Ｋセルを表し、他の図に関して説明する
論理回路を有する。メモリには障壁が設けられ、分離さ
れた電源で９Ｗの電力を散逸する。同一のシリコン基板
上で大きな論理回路と、大規模なメモリを組み合わせ
て、論理回路とＤＲＡＭの電気雑音非整合性に伴う問題
が解決された。論理回路は雑音が多発する傾向があり、
一方メモリでは、ＤＲＡＭのセルの読取りによって生じ
るミリボルト規模の信号を検知するために、雑音がかな
り低くなければならない。本発明では、トレンチ付き３
重金属層シリコン付着を行って、メモリ・チップの別々
の各障壁付き部分をメモリ専用およびプロセッサ論理回
路専用とし、電力配分および障壁を別々に提供すること
により、論理回路とＤＲＡＭの間の整合性を達成する。The device of the present invention is a 4 mega CMOS DRA.
M, which is considered the first general-purpose memory chip with large space for logic circuits. 32 kilobits x
A memory array is constructed with 16 copies of a 9-bit DRAM macro. This DRAM has a large surface area allocated for on-chip application logic and has 120K cells with triple level metal wiring. The processor logic cells are preferably gate array cells. The DRAM access time is 35 nanoseconds or less, which matches the cycle time of the processor. This CMO
In the S embodiment, a very effective processing element (picket)
Is provided, and the power dissipation of the logic circuit at that time is 1.3W. Each separate memory section of the chip is 32 kilobits x 9 bits (expandable without changing the logic circuit), CMOS gate
It surrounds the array gate field. This field represents a 120K cell and has the logic described with respect to the other figures. The memory is barriered to dissipate 9W of power with a separate power supply. By combining a large logic circuit and a large scale memory on the same silicon substrate, the problems associated with the electrical noise inconsistency between the logic circuit and the DRAM have been solved. Logic circuits tend to be noisy,
In memory, on the other hand, the noise must be fairly low in order to detect millivolt scale signals caused by reading the cells of the DRAM. In the present invention, with trench 3
Heavy metal layer silicon deposition is used to dedicate each separate barriered portion of the memory chip to memory and processor logic circuitry, providing separate power distribution and barriers to ensure consistency between logic circuitry and DRAM. To achieve.

【００９２】好ましい実施例のＡＰＡＰシステムの概
要：ここでは、この新規技術を次の順序で紹介する。１．技術２．チップ・ハードウェアについての説明３．ネットワーキングおよびシステム構築４．ソフトウェア５．アプリケーションOverview of the APAP System of the Preferred Embodiment: This new technique is introduced here in the following order: 1. Technology 2. Explanation of chip hardware 3. Networking and system construction 4. Software 5. application

【００９３】最初の数節では、４メガＤＲＡＭ低出力Ｃ
ＭＯＳチップが、製造後のＰＭＥＤＲＡＭチップ上に、
かつ該チップの一部として、それぞれ次の機能をサポー
トする８個のプロセッサを備えるようにするにはどうし
たらよいかについて説明する。１．１６ビット、５ＭＩＰのデータ・フロー２．独立命令ストリームおよび割込み処理３．８ビット（およびパリティと制御）幅の外部ポー
ト、および他の３つのオンチップ・プロセッサとの相互
接続In the first few sections, 4 mega DRAM low power C
The MOS chip is on the PMEDRAM chip after manufacturing,
Also, how to provide, as part of the chip, eight processors each of which supports the following functions will be described. 1.16 bits, 5 MIP data flow Independent instruction stream and interrupt handling 3.8-bit (and parity and control) wide external port and interconnection with three other on-chip processors

【００９４】本発明は、単一のチップ設計に統合された
複数の機能を提供する。このチップは、スケーリング機
能を持つチップが、処理、経路指定、記憶、および３種
の入出力を効果的に実行できるのに十分な、強力で柔軟
なＰＭＥ機能を提供する。このチップでは、単一チップ
内にメモリおよび制御論理回路が統合されて、ＰＭＥを
形成しており、この組合せがチップ内で複製される。プ
ロセッサ・システムは、単一チップの複製から構築され
る。The present invention provides multiple functions integrated into a single chip design. This chip provides powerful and flexible PME functions sufficient for a chip with scaling capabilities to effectively perform processing, routing, storage, and three types of I / O. In this chip, memory and control logic are integrated in a single chip to form a PME, and this combination is duplicated in the chip. The processor system is built from a single chip replica.

【００９５】この手法では、低出力ＣＭＯＳＤＲＡＭ
を区分する。低出力ＣＭＯＳＤＲＡＭは、複数ワード
長（１６）ビット×３２キロビットのセクションとして
形成され、各セクションがプロセッサと関連付けられる
（ＰＭＥという用語は、単一のプロセッサ、メモリ、お
よび入出力可能なシステム・ユニットを指す）。この区
分により、各ＤＲＡＭは、８バイト幅の独立な相互接続
ポートを備えた、８方向「キューブ接続」ＭＩＭＤ並列
プロセッサとなる（複製およびリング・トーラスの可能
性を示す、密並列技術の複製の例については図７を参照
のこと）。In this method, a low output CMOS DRAM is used.
Divide. Low-power CMOS DRAMs are formed as sections of multiple word length (16) bits by 32 kilobits, with each section associated with a processor (the term PME refers to a single processor, memory, and input / output system unit. Point). This partition makes each DRAM an 8-way "cube-connected" MIMD parallel processor with independent interconnect ports that are 8 bytes wide (replicating dense parallel technology replicating, showing the potential for replication and ring torus). See FIG. 7 for an example).

【００９６】ソフトウェアの説明では、複数の異なるプ
ログラム・タイプを取り上げる。最低レベルでは、プロ
セスはユーザ・プログラム（あるいは、アプリケーショ
ンで呼び出されるサービス）を詳細なハードウェア要件
に適合させる。このレベルは、入出力およびプロセッサ
間同期化を管理するのに必要なタスクを含み、ＭＰＰ用
のマイクロプログラムと呼ぶことができる。中間レベル
のサービスでは、ＭＰＰのマッピング・アプリケーショ
ン（ベクトル演算および行列演算を用いて開発される）
と、制御機能、同期化機能、起動機能、診断機能が使用
できる。ホスト・レベルでは、ＭＰＰへの単純な自動デ
ータ割振りまたはユーザの調整するデータ割振りによっ
てベクトル化プログラムをサポートする、ライブラリ機
能によって高位言語がサポートされる。多重レベル・ソ
フトウェア手法を用いると、アプリケーションが単一の
プログラム内で様々な程度の制御および最適化を利用す
ることができる。したがって、ユーザは、アーキテクチ
ャの詳細を理解せずにアプリケーション・プログラムを
コーディングすることができ、オプティマイザは、プロ
グラムの、小規模で使用率の高いカーネルだけをマイク
ロコード・レベルで調整する。The software description addresses several different program types. At the lowest level, the process adapts the user program (or the service called by the application) to the detailed hardware requirements. This level contains the tasks necessary to manage I / O and interprocessor synchronization and can be called a microprogram for MPP. For intermediate level services, MPP mapping applications (developed using vector and matrix operations)
Control functions, synchronization functions, startup functions, and diagnostic functions can be used. At the host level, high-level languages are supported by library functions that support vectorization programs by simple automatic data allocation to the MPP or user-coordinated data allocation. Multi-level software techniques allow applications to take advantage of varying degrees of control and optimization within a single program. Thus, the user can code the application program without understanding the architectural details, and the optimizer will only tune the small, heavily used kernel of the program at the microcode level.

【００９７】１０２４要素５ＧＩＰＳ装置および３２７
６８要素１６４ＧＩＰＳ装置について述べるいくつかの
節では、可能なシステムの範囲を示す。ただし、それら
は制限的なものではなく、これより小規模な装置および
これより大規模な装置も実現可能である。これらの特定
のサイズを例として選択したのは、小規模な装置は、マ
イクロプロセッサ（アクセレレータ）、パーソナル・コ
ンピュータ、ワークステーション、および軍事アプリケ
ーション（もちろん異なるパッケージング技術を使用）
に適しており、大規模な装置は、モジュールとしてのメ
インフレーム・アプリケーションまたは完全なスーパー
コンピュータ・システムの実例であることを示すためで
ある。ソフトウェアの説明では、各例示システムで効果
的にプログラミングできる他の挑戦しがいのある作業の
例を提供する。1024 Element 5G IPS Device and 327
Some sections describing 68-element 164GIPS devices show the range of possible systems. However, they are not limiting and smaller and larger devices are feasible. These particular sizes were chosen as examples for small devices such as microprocessors (accelerators), personal computers, workstations, and military applications (of course using different packaging technologies).
Suitable for large scale devices to demonstrate that it is an example of a mainframe application as a module or a complete supercomputer system. The software description provides examples of other challenging tasks that can be effectively programmed with each example system.

【００９８】ＰＭＥＤＲＡＭＣＭＯＳ−多重プロセ
ッサＰＭＥの基礎：図３は、本発明における、チップ技
術レベルでの改良技術を示している。この拡張可能コン
ピュータ編成は、１種類のチップしか使用しないので、
広範なシステム・サイズにわたってコストおよび性能の
効率が非常に高い。１個のチップ上でメモリと処理機構
を組み合わせているため、メモリ・バス専用のピンが不
要であり、したがってそのようなピンに付随する信頼性
および性能上のマイナスはない。チップ内で本発明の設
計を複製するので、プロセッサ・サブセクション用のカ
スタム論理設計が経済的に実現可能になる。システム内
でチップを複製するので、製造コストが大幅に削減可能
である。最後に、ＣＭＯＳ技術では、ＭＩＰ当たりの電
力が低くて済むので、電源および冷却のニーズが最小限
になる。チップ・アーキテクチャを複数のワード長に合
わせてプログラミングできるため、普通ならはるかに長
いプロセッサを必要とする動作が、このシステムで実行
可能になる。これらの属性があいまって、広範なシステ
ム性能が可能になる。PME DRAM CMOS-Multiprocessor PME Basics: FIG. 3 shows an improvement of the invention at the chip technology level. This expandable computer organization uses only one type of chip,
Very cost and performance efficient over a wide range of system sizes. The combination of memory and processing on a single chip eliminates the need for pins dedicated to the memory bus, and thus the reliability and performance penalties associated with such pins. Duplicating the inventive design in a chip makes custom logic design for the processor subsection economically feasible. Since the chip is duplicated in the system, the manufacturing cost can be significantly reduced. Finally, CMOS technology requires lower power per MIP, thus minimizing power supply and cooling needs. Because the chip architecture can be programmed for multiple word lengths, operations that would otherwise require a much longer processor can be performed on this system. Together these attributes enable a wide range of system performance.

【００９９】本発明の新規技術は、それと一部共通する
旧来の技術を拡張した場合と比較してみることができ
る。フィーチャの小型化を利用して、プロセッサ設計者
がチップの複雑化を進め、メモリ設計者が単純な要素の
複製の拡大化を図ってきたことは明白である。この傾向
が続くなら、メモリが４倍の規模になり、プロセッサの
密度が向上して次のことが実現できると予想される。１．命令ルータを持つ複数の実行ユニットを備える。２．キャッシュ・サイズと関連機能を増大する。３．命令先読みを増加し、計算機能を向上する。The new technique of the present invention can be compared with a case where the old technique partially common therewith is expanded. Obviously, feature miniaturization has been used by processor designers to increase the complexity of chips and memory designers to expand the replication of simple elements. If this trend continues, it is expected that the memory will be quadrupled, the processor density will be improved, and the following can be realized. 1. It has multiple execution units with instruction routers. 2. Increase cache size and related functionality. 3. Increase instruction look-ahead and improve computing capabilities.

【０１００】しかし、図１に示す旧来の技術でこれらの
手法を試みても行き詰まるばかりである。プロセッサを
重複すると、必要なピン数がそれに比例して増加する
が、１チップ当たりのピン数は固定されたままである。
キャッシュ動作の改善によって向上するのは、アプリケ
ーションのデータ再使用パターンだけである。それ以上
は、メモリ帯域幅が限界となる。アプリケーション・デ
ータの依存性および分岐によって、先読み方式の潜在的
利益が制限される。また、密並列性を備えたＭＰＰアプ
リケーションで１処理装置当たり必要なのが１メガワー
ド・メモリなのか、４メガワード・メモリなのか、それ
とも１６メガワード・メモリなのかは明白でない。複数
のプロセッサ間でそのように大規模なメモリを共用しよ
うとすると、メモリ帯域幅による制限が厳しくなる。However, even if these techniques are tried by the conventional technique shown in FIG. Duplicating processors increases the number of pins required proportionately, but the number of pins per chip remains fixed.
Improving cache behavior only improves the application's data reuse pattern. Above that, memory bandwidth is the limit. Application data dependencies and branching limit the potential benefits of read-ahead schemes. Also, it is not clear whether MPP applications with dense parallelism require 1 megaword memory, 4 megaword memory, or 16 megaword memory per processor. When trying to share such a large amount of memory among multiple processors, the limitation due to the memory bandwidth becomes severe.

【０１０１】本発明の新規の方法では行詰りはない。図
３以降の図および説明で示すように、本発明では大規模
なメモリと入出力機構を組み合わせて単一のチップを形
成する。本発明の新規方法では必要な部品数が減少し、
チップ交差に伴う遅延がなくなる。さらに重要なこと
に、本発明の新規方法を用いると、すべてのチップの入
出力ピンをプロセッサ間通信専用にして、ネットワーク
帯域幅を最大にすることができる。There is no deadlock with the novel method of the present invention. As shown in FIG. 3 and subsequent figures, the present invention combines a large scale memory and input / output mechanism to form a single chip. The novel method of the present invention reduces the number of parts required,
There is no delay associated with chip crossing. More importantly, the novel method of the present invention allows the I / O pins of all chips to be dedicated to interprocessor communication to maximize network bandwidth.

【０１０２】図３に示す好ましい実施例を実施するため
に、ＩＢＭ低出力ＣＭＯＳ技術を用い、現在利用可能な
プロセスを使用する。この実施例は、相補型金属酸化膜
半導体（ＣＭＯＳ）においてＣＭＯＳＤＲＡＭ密度で
実現でき、かつより密なＣＭＯＳで実施できる。この実
施例では、ＣＭＯＳの密度が高まるにつれて、チップ上
にある８個のＰＭＥそれぞれの３２キロビット・メモリ
を増やすことができる。この実施例では、４メガのＣＭ
ＯＳＤＲＡＭにリアル・エステート・アンド・プロセ
ス技術を使用し、プロセッサの複製をチップ自体におけ
る３２Ｋメモリと関連付けることにより、これを拡張し
ている。図４に示すクラスタの各チップ・パッケージ中
で、チップがプロセッサ、メモリ、および入出力機構を
有することが理解されよう。各パッケージ内に、メモリ
と、組込みプロセッサ要素、ルータ、および入出力機構
があり、これらはすべて、論理回路用の広いスペースを
持つ初の汎用メモリ・チップと考えられる、４メガＣＭ
ＯＳＤＲＡＭに入っている。このチップは、トレンチ
を備えた特定のシリコンを使用して、小さなチップ表面
上に大規模な記憶域を提供している。別法として、本発
明の設計の各プロセッサを、３２キロビット×９ビット
のＤＲＡＭマクロ（３５／８０ナノ秒）の複製１６個か
ら構築し、０．８７ミクロンＣＭＯＳ論理回路を使って
メモリ・アレイを構成することもできる。この装置は、
チップ上に、アプリケーション論理回路の１２０Ｋセル
用の表面領域を割り振り、３重レベル金属配線の機能で
それをサポートするという点で独特である。図４の左側
に、従来技術のカードをＸ印を付けて示す。To implement the preferred embodiment shown in FIG. 3, IBM low power CMOS technology is used, using currently available processes. This embodiment can be implemented in complementary metal oxide semiconductor (CMOS) with CMOS DRAM density and can be implemented with denser CMOS. In this embodiment, as the density of CMOS increases, the 32 kilobit memory of each of the 8 PMEs on the chip can be increased. In this embodiment, 4 mega CM
This is extended by using real estate and process technology for OS DRAM and associating a replica of the processor with 32K memory in the chip itself. It will be appreciated that in each chip package of the cluster shown in FIG. 4, the chip has a processor, memory, and I / O. Within each package is a memory, embedded processor element, router, and I / O, all of which are considered the first general-purpose memory chip with plenty of space for logic circuits, a 4M CM
It is in OS DRAM. This chip uses specific silicon with trenches to provide large storage areas on a small chip surface. Alternatively, each processor of the present design is constructed from 16 replicas of a 32 kilobit by 9 bit DRAM macro (35/80 nanoseconds) and a memory array is constructed using 0.87 micron CMOS logic. It can also be configured. This device
It is unique in that it allocates surface area for 120K cells of application logic on the chip and supports it with the function of triple level metallization. On the left side of FIG. 4, a prior art card is shown with an X.

【０１０３】本発明の複製可能基本要素ブリック技術
は、旧来の技術に対する回答である。図４の左側の"Ｘ"
印を付けた技術を検討してみると、チップおよびカード
が多すぎ、無駄であることが分かる。たとえば、今日他
の発明者から提案されているテラフロップ・マシンは、
文字どおり百万個以上のチップを有する。今日の他の技
術では、これらのチップのうち真に実働するのはせいぜ
い数パーセントであり、残りは「オーバヘッド」である
（通常は、メモリ、ネットワーク・インタフェースな
ど）。The replicable building block brick technology of the present invention is the answer to the old technology. "X" on the left side of Fig. 4
A review of the marked technologies shows that there are too many chips and cards, which is a waste. For example, the teraflop machine proposed by other inventors today is
It literally has over a million chips. In today's other technologies, only a few percent of these chips actually work, and the rest are "overhead" (usually memory, network interfaces, etc.).

【０１０４】物理サイズが制約された環境で動作する必
要があるものにそのようなチップをそのように多数パッ
ケージングするのは不可能であることが理解されよう
（面積の小さなコックピットにいくつ取り付けることが
できるだろうか）。さらに、他の発明者から提案されて
いるテラフロップ・マシンは、すでに大型であるが、ペ
タフロップの範囲に到達するには１０００倍スケール・
アップしなければならない。本発明者等は、非実働チッ
プの割合を劇的に減少させる解決法を有する。本発明で
はこれを妥当なネットワーク次元数の範囲内で提供す
る。このブリック技術を用いると、メモリがオペレータ
になり、ネットワークを制御のやりとりに使用し、実働
チップが大幅に増加する。さらに、グレードアップによ
り、チップの種類が劇的に減少する。本発明のシステム
は、特殊なパッケージング、冷却、電力、または環境上
の制約なしにスケール・アップできるように設計されて
いる。It will be appreciated that it is not possible to package such a large number of such chips into ones that need to operate in a physical size constrained environment (how many mounts in a small cockpit). Can you do that?). Furthermore, although the teraflop machine proposed by another inventor is already large, it reaches 1000 times scale to reach the petaflop range.
I have to get up. We have a solution that dramatically reduces the proportion of non-working chips. The present invention provides this within a reasonable network dimensionality. With this brick technology, the memory becomes the operator, the network is used for control exchanges, and the number of working chips is significantly increased. In addition, the upgrade dramatically reduces the number of chip types. The system of the present invention is designed to scale up without special packaging, cooling, power, or environmental constraints.

【０１０５】本発明のブリック技術では、プロセッサ、
組込みプロセッサを備えたメモリ、およびネットワーク
機能を別々にする代わりに、図４に示す構成を使用す
る。この構成は、コネクタ・レベルで現行の４メガビッ
トＤＲＡＭカードとピン互換性があるチップを備えたカ
ードを表す。そのようなカード１枚で、１チップ性能レ
ベル当たり基本４０ＭＩＰＳの設計点により、３２個の
チップを保持でき、１２８０ＭＩＰＳが可能になる。そ
のようなカード４枚で、５ＧＩＰＳが提供される。図示
のワークステーション構成は、そのようなＰＥメモリ・
アレイ、クラスタ制御装置、およびワークステーション
で開発されたアレイ・プロセッサ・アプリケーションを
実行し監視するのに十分な性能を持つＩＢＭＲＩＳＣ
システム／６０００を有することが好ましい。In the brick technology of the present invention, the processor,
Instead of having separate memory with embedded processor and network functionality, the configuration shown in FIG. 4 is used. This configuration represents a card with a chip that is pin compatible at the connector level with current 4 Mbit DRAM cards. With such a card, a basic 40 MIPS design point per chip performance level could hold 32 chips, enabling 1280 MIPS. Four such cards will provide 5G IPS. The workstation configuration shown is such a PE memory
IBM RISC with sufficient performance to execute and monitor array processor applications developed on arrays, cluster controllers, and workstations
It is preferable to have a System / 6000.

【０１０６】プロセッサ部分ではゲート効率が非常に高
いプロセッサが使用できる。プロセッサにはそのような
設計が使用されているが、メモリ内で使用されたことは
ない。また、本発明では、ＭＩＭＤ基本動作とＳＩＭＤ
基本動作を混合できる能力を提供している。本発明のチ
ップは、各ＣＰＵの命令バッファに代替経路を提供す
る、「同報通信バス」を提供する。本発明のクラスタ制
御装置は、ＰＭＥ内の各処理要素にコマンドを発行す
る。これらのコマンドをＰＭＥに格納すれば、処理要素
の動作を複数のモードで制御できる。各ＰＭＥはプログ
ラム全体を格納する必要はなく、あるアプリケーション
の処理の様々な時間に所与のタスクに適用される部分だ
けを格納できる。In the processor part, a processor having a very high gate efficiency can be used. Such designs have been used for processors, but never in memory. Further, in the present invention, the MIMD basic operation and SIMD
It provides the ability to mix basic movements. The chip of the present invention provides a "broadcast bus" that provides an alternate path for the instruction buffer of each CPU. The cluster control device of the present invention issues a command to each processing element in the PME. By storing these commands in the PME, the operation of the processing element can be controlled in multiple modes. Each PME does not have to store the entire program, only the portion that applies to a given task at various times during the processing of an application.

【０１０７】基本デバイスが与えられている場合と、一
プロセッサとメモリの組合せを開発することができる。
別法として、より簡単なプロセッサおよびメモリ・マク
ロのサブセットを使用することにより、ＰＭＥの複製を
２個、４個、８個、または１６個作成するための設計も
可能である。データ・フロー帯域幅を調整するか、ある
いは機能アクセレレータをプロセッサ・サイクルと置換
すると、ＰＭＥをさらに簡単にすることができる。大部
分の実施例では、上述の基本ＰＭＥの複製を８回行うこ
とが好ましい。Given a base device, one processor and memory combination can be developed.
Alternatively, a design for making 2, 4, 8, or 16 PME replicas is also possible by using a simpler subset of processors and memory macros. Adjusting the data flow bandwidth or replacing the function accelerator with processor cycles can further simplify the PME. In most embodiments, it is preferable to duplicate the basic PME described above eight times.

【０１０８】本発明者等のアプリケーション調査による
と、現在のところ、もっとも好ましい方法は、１６ビッ
ト幅のデータ・フローおよび３２Ｋワードのメモリを８
回複製することである。このように結論した理由は、以
下のとおりである。１．１６ビット・ワードを用いると、命令およびアドレ
スの単一サイクルでの取出しが可能になる。２．８個のＰＭＥのそれぞれに外部ポートを備えると、
４次元トーラス相互接続が可能となる。各リング上で４
個または８個のＰＭＥを使用すると、目標とするシステ
ム性能の範囲に適したモジュールが得られる。３．８つの外部ポートにはチップ・ピンのうち約５０％
が必要であり、電源、接地、および共通制御信号には残
りのチップ・ピンで十分である。４．８個のプロセッサを６４ＫＢの主記憶装置中で実施
すると、ａ．メモリ・マップ式アーキテクチャではなくレジスタ
・ベースのアーキテクチャが使用可能になる。ｂ．好ましいが必要ではない若干のアクセレレータを、
複数のプロセッサ・サイクルによって強制的に実施でき
る。According to an application study by the inventors, the most preferable method at present is a 16-bit wide data flow and 32K words of memory.
It is to duplicate twice. The reasons for this conclusion are as follows. The 1.16 bit word allows fetching instructions and addresses in a single cycle. If each of the 2.8 PMEs has an external port,
Four-dimensional torus interconnection is possible. 4 on each ring
Using 8 or 8 PMEs will result in modules suitable for the range of targeted system performance. 3. Approximately 50% of chip pins in 8 external ports
And the remaining chip pins are sufficient for power, ground, and common control signals. Implementing 4.8 processors in 64 KB of main memory: a. A register-based architecture is available instead of a memory mapped architecture. b. Some accelerators that are preferable but not necessary,
It can be enforced by multiple processor cycles.

【０１０９】この最後の属性は、それによって開発中の
論理密度が増加できるので重要である。本発明の新規ア
クセレレータ（たとえば、ＰＭＥ用の浮動小数点演算機
構）は、システム設計、ピンおよびケーブル、またはア
プリケーション・コードに影響を与えずに、チップ・ハ
ードウェアとして追加される。This last attribute is important because it can increase the logic density under development. The novel accelerator of the present invention (eg, floating point arithmetic for PMEs) is added as chip hardware without affecting system design, pins and cables, or application code.

【０１１０】その結果得られるチップのレイアウトおよ
びサイズ（１４．５９×１４．６３ｍｍ）を図３に示
す。図４は、そのようなチップのクラスタを示してい
る。このチップは、後の図に示す、スタンドアロン装置
用のシステム、接続バスによってワークステーション・
ホストに隣接して配置されるワークステーション、ＡＷ
ＡＣアプリケーション、およびスーパーコンピュータに
パッケージングすることができる。このチップ技術は、
システム・レベルでいくつかの利点を提供する。これに
よって、１パート型の基本複製によりスケーリング可能
なＭＰＰが開発できる。１プロセッサ当たりのＤＲＡＭ
マクロを２つにすると、データとプログラムの両方に十
分な記憶域が提供される。等しいサイズのＳＲＡＭで
は、１０倍以上の電力を消費する可能性がある。この利
点により、単一チップ・プロセッサ／メモリ設計のマシ
ンでは典型的な、制限の大きなＳＩＭＤモデルではな
く、ＭＩＭＤマシン・モデルが使用可能になる。３５ナ
ノ秒以下のＤＲＡＭアクセス時間は、期待されるプロセ
ッサ・サイクル時間と合致する。ＣＭＯＳ論理回路は、
ＰＭＥがきわめて効果的になる論理密度を提供し、かつ
その際の散逸電力はわずか１．３Ｗである（総チップ電
力は、１．３＋０．９（メモリ）＝２．２Ｗ）。これら
の特徴により、伝導冷却が必要な軍事用途でこのチップ
が使用可能になる（非軍事用途での空冷はずっと容易で
ある）。しかし、ワークステーションおよびその他の環
境には空冷式の実施例が使用できる。スタンドアロン・
プロセッサは、８０Ａ−５Ｖの電源で構成できる。The resulting chip layout and size (14.59 × 14.63 mm) is shown in FIG. FIG. 4 shows a cluster of such chips. This chip is a system for stand-alone devices, a workstation
A workstation, AW placed adjacent to the host
It can be packaged in AC applications and supercomputers. This chip technology
It offers several advantages at the system level. This makes it possible to develop an MPP that can be scaled by a one-part type basic replication. DRAM per processor
Two macros provide sufficient storage for both data and programs. SRAMs of equal size can consume ten times more power. This advantage allows the MIMD machine model to be used rather than the more restrictive SIMD model that is typical of machines with single chip processor / memory designs. DRAM access times of 35 nanoseconds or less are consistent with expected processor cycle times. CMOS logic circuit
The PME provides a very effective logic density and the power dissipated therein is only 1.3 W (total chip power 1.3 + 0.9 (memory) = 2.2 W). These features allow the chip to be used in military applications where conduction cooling is required (air cooling in non-military applications is much easier). However, air-cooled embodiments may be used for workstations and other environments. Stand-alone
The processor can be configured with an 80A-5V power supply.

【０１１１】拡張並列アレイ・プロセッサ（ＡＰＡＰ）
の構成単位を図５および図６に示す。図５は、ＡＰＡＰ
の機能ブロック図を示している。複数のアプリケーショ
ン・インタフェース１５０、１６０、１７０、１８０
が、アプリケーション・プロセッサ１００またはプロセ
ッサ１１０、１２０、１３０用に存在している。図６
は、様々なシステム・ブロック図に構成できる基本的構
成単位を示している。ＡＰＡＰは、同一のＰＭＥを最大
構成で３２７６８個組み込むことができる。ＡＰＡＰ
は、ＰＭＥアレイ２８０、２９０、３００、３１０、ア
レイ・ディレクタ２５０、およびアプリケーション・プ
ロセッサ２００もしくはプロセッサ２２０、２３０用の
アプリケーション・プロセッサ・インタフェース２６０
から構成される。アレイ・ディレクタ２５０は、アプリ
ケーション・プロセッサ・インタフェース２６０、クラ
スタ・シンクロナイザ２７０、およびクラスタ制御装置
２７０という３つの機能ユニットから成る。アレイ・デ
ィレクタ２５０は、ＭＩＭＤ機能を備える、本発明者等
の以前のＳＩＭＤ動作用線形ピケット・システムのアレ
イ制御装置の諸機能を実行する。クラスタ制御装置２７
０は、１組６４個のアレイ・クラスタ２８０、２９０、
３００、３１０（すなわち、５１２個のＰＭＥからなる
クラスタ）と共に、ＡＰＡＰコンピュータ・システムの
基本的構成単位となる。アレイ・ディレクタ２５０の諸
要素を用いると、広範なクラスタ複製を備えたシステム
が構成可能になる。処理要素と制御要素の厳密な複製に
基づくこのモジュール性は、この大規模並列コンピュー
タ・システムに特有の特性である。さらに、アプリケー
ション・プロセッサ・インタフェース２６０は、重要な
設計機能、デバッグ機能、およびモニタ機能を実行す
る、テスト／デバッグ・デバイス２４０をサポートす
る。Enhanced Parallel Array Processor (APAP)
5 and 6 are shown in FIGS. FIG. 5 shows the APAP
FIG. Multiple application interfaces 150, 160, 170, 180
Exist for the application processor 100 or the processors 110, 120, 130. Figure 6
Shows the basic building blocks that can be configured into various system block diagrams. APAP can incorporate 32768 identical PMEs in a maximum configuration. APAP
Is an application processor interface 260 for a PME array 280, 290, 300, 310, array director 250, and application processor 200 or processor 220, 230.
Consists of The array director 250 consists of three functional units: an application processor interface 260, a cluster synchronizer 270, and a cluster controller 270. The array director 250 implements the functions of the array controller of our previous linear picket system for SIMD operation with MIMD functionality. Cluster controller 27
0 is a set of 64 array clusters 280, 290,
Together with 300, 310 (ie, a cluster of 512 PMEs), they are the basic building blocks of the APAP computer system. The elements of array director 250 allow systems with extensive cluster replication to be configured. This modularity, which is based on exact replication of processing and control elements, is a unique property of this massively parallel computer system. In addition, application processor interface 260 supports test / debug device 240, which performs important design, debug, and monitor functions.

【０１１２】制御装置は、ｉ８６０を備えた制御装置を
含めて、今日他のシステムで使用されている、ＩＢＭマ
イクロチャネル（Microchannel）などの明確に定義され
たインタフェースと共に組み立てられている。現場プロ
グラミング可能なゲート・アレイにより、特定の構成の
要件（存在するＰＭＥの数、それらの結合など）を満た
すように変更可能な機能が制御装置に追加される。Controllers, including controllers with i860, are assembled with well-defined interfaces such as the IBM Microchannel used in other systems today. The field programmable gate array adds functionality to the controller that can be modified to meet the requirements of a particular configuration (number of PMEs present, their combination, etc.).

【０１１３】ＰＭＥアレイ２８０、２９０、３００、３
１０は、ＳＩＭＤ装置またはＭＩＭＤ装置として動作す
るのに必要な機能を備えている。これらのアレイはま
た、１組のＰＭＥの全体を１〜２５６個の異なるサブセ
ットに分割できる機能を備えている。サブセットに分割
するときは、アレイ・ディレクタ２５０がサブセット間
のインタリーブを行う。インタリーブ・プロセスの順序
と、各サブセットに対して及ぼされる制御の程度は、プ
ログラムで制御される。アレイの様々なサブセットを異
なるプログラムを用いて１つのモード、すなわちＭＩＭ
Ｄモードで動作させ、他のセットをアレイ・ディテクタ
の制御下で密に同期化されたＳＩＭＤモードで動作させ
るこの機能は、当技術分野における進歩である。以下に
示すいくつかの例で、この概念の利点を示す。PME arrays 280, 290, 300, 3
10 has the functions necessary to operate as a SIMD device or a MIMD device. These arrays also provide the ability to partition an entire set of PMEs into 1 to 256 different subsets. When dividing into subsets, the array director 250 interleaves between the subsets. The order of the interleaving processes and the degree of control exerted on each subset is programmatically controlled. Different subsets of array with different programs in one mode, MIM
This ability to operate in the D mode and the other set in the tightly synchronized SIMD mode under the control of the array detector is an advance in the art. The following examples demonstrate the benefits of this concept.

【０１１４】アレイ・アーキテクチャ：アレイを形成す
る１組のノードが、ｎ次元修正ハイパーキューブとして
接続される。この相互接続方式では、各ノードが他の２
ｎ個のノードに直接接続される。それらの接続は、単信
型、半２重型、または全２重型経路とすることができ
る。３次元より次元数が多いどの次元においても、修正
ハイパーキューブは相互接続技術の新規概念である（修
正ハイパーキューブは、２次元の場合はトーラスを生成
し、３次元の場合は、エッジ表面が対向表面に折り返さ
れた直交接続格子を生成する）。Array Architecture: A set of nodes forming an array are connected as an n-dimensional modified hypercube. In this interconnection scheme, each node is
It is directly connected to n nodes. The connections can be simplex, half-duplex, or full-duplex paths. In any dimension that has more dimensions than three dimensions, modified hypercubes are a new concept in interconnect technology. Generate a folded back-to-back grid on the surface).

【０１１５】３次元より次元数が多い場合の相互接続方
式を記述するには、帰納的記述が必要である。１組ｍ₁
個のノードがリングとして相互接続できる（リングに
は、「単純接続」、「編組」、「交差接続」、「完全接
続」などが可能である。単純リング以外の場合は追加の
ノード・ポートが必要となるが、そのように複雑さが増
しても修正ハイパーキューブ構造は影響を受けない）。
１組ｍ₂個のリングにおける各等価ノードを接続する
と、ｍ₂個のリングが相互にリンクできる。この時点で
得られるのがトーラスである。ｉ次元修正ハイパーキュ
ーブからｉ＋１次元修正ハイパーキューブを構築するに
は、ｍ_i+1組のｉ次元修正ハイパーキューブを想定し、
等価なｍiレベルのノードをすべて相互接続してリング
を形成する。An inductive description is necessary to describe the interconnection method when the number of dimensions is more than three. 1 set m ₁
Individual nodes can be interconnected as a ring (rings can have "simple connections", "braids", "cross connections", "full connections", etc. Although required, the modified hypercube structure is not affected by such increased complexity).
By connecting the equivalent nodes in a set of m ₂ rings, m ₂ rings can be linked together. The torus is obtained at this point. To construct an i + 1 dimensional modified hypercube from an i dimensional modified hypercube, assume m _{i + 1} sets of i dimensional modified hypercubes,
All equivalent mi-level nodes are interconnected to form a ring.

【０１１６】ｍ_i＝８（ｉ＝１．．４）を使用して、４
次元修正ハイパーキューブにおけるこのプロセスを図７
に示す。ノード・トポロジーのもとでのこの説明と、
図７、図１０、図１１、図１７、図１８を比較された
い。Using m _i = 8 (i = 1.4), 4
Figure 7 illustrates this process in the dimensionally modified hypercube.
Shown in This explanation under node topology,
Compare FIG. 7, FIG. 10, FIG. 11, FIG. 17, and FIG.

【０１１７】図７は、３２Ｋの１６ビット・ワード・メ
モリと１６ビット・プロセッサから構成される単一プロ
セッサ要素３００から、８つのプロセッサ３１２と、そ
れに結合されたメモリ３１１、後者に付随する完全分散
型入出力ルータ３１３、および信号入出力ポート３１
４、３１５から成るネットワーク・ノード３１０に至
り、さらにクラスタ３２０で表したノードのグループを
経て、クラスタ構成３６０、ならびに各種のアプリケー
ション３３０、３４０、３５０、３７０に至る密並列技
術経路を示している。２次元レベル構造はクラスタ３２
０であり、６４個のクラスタが統合されて、３２７６８
個の処理要素からなる４次元修正ハイパーキューブ３６
０を形成している。FIG. 7 shows a single processor element 300 consisting of a 32K 16-bit word memory and a 16-bit processor, with eight processors 312 and associated memory 311, the complete distribution associated with the latter. Type input / output router 313 and signal input / output port 31
4, a dense parallel technology path leading to a network node 310 consisting of 4,315, and further through a group of nodes represented by a cluster 320 to a cluster configuration 360 and various applications 330, 340, 350, 370. Two-dimensional level structure is cluster 32
0, and 64 clusters are integrated, 32768
4D modified hypercube 36 consisting of individual processing elements
0 is formed.

【０１１８】プロセッサ・メモリ要素（ＰＭＥ）の好ま
しい実施例：図３および図１２に示すように、好ましい
ＡＰＡＰは１つのチップ・ノードから成る基本的構成単
位を有する。各ノードは、８個の同一のプロセッサ・メ
モリ要素（ＰＭＥ）と１つの同報通信／制御インタフェ
ース（ＢＣＩ）を備えている。本発明の一部は、同一の
チップ上にすべての機能が揃っていなくても実現できる
が、性能およびコスト削減の点から見ると、現在実施可
能な前述の先進技術を使用して、８個のＰＭＥを備えた
１つのチップ・ノードとしてチップを形成することが重
要である。Preferred Embodiment of Processor Memory Element (PME): As shown in FIGS. 3 and 12, the preferred APAP has a basic building block consisting of one chip node. Each node has eight identical processor memory elements (PMEs) and one broadcast / control interface (BCI). Although some of the present invention can be implemented without all the functions on the same chip, from the point of view of performance and cost reduction, using the above-mentioned advanced technology that can be implemented now, eight It is important to form the chip as one chip node with a PME.

【０１１９】ＰＭＥの好ましい実施態様は、６４ＫＢの
主記憶装置、８つのプログラム割込みレベルのそれぞれ
に関する１６個の１６ビット汎用レジスタ、全機能論理
演算機構（ＡＬＵ）、作業用レジスタ、状況レジスタ、
および４つのプログラマブル両方向入出力ポートを有す
る。さらに、この好ましい実施態様は、同報通信／制御
インタフェース（ＢＣＩ）によってＳＩＭＤモード同報
通信インタフェースを提供する。このインタフェース
は、外部制御装置（本発明の原出願と、クラスタを備え
たノーダル・アレイおよびシステムの現在好ましい実施
例についての説明を参照）が、ＰＭＥ命令解読、メモリ
・アドレス、およびＡＬＵデータ入力を駆動できるよう
にする。このチップは、その内部で複数の並列動作を実
行できるようにするマイクロコンピュータの機能を実行
でき、かつ複数ノードのシステム内で他のチップに結合
できる。その場合の結合方法は、相互接続ネットワー
ク、メッシュ・ネットワークまたはハイパーキューブ・
ネットワーク、先進的でスケーリング可能な、本発明の
好ましい実施例のいずれでもよい。The preferred embodiment of the PME is 64 KB of main memory, 16 16-bit general purpose registers for each of the 8 program interrupt levels, a full-function logic unit (ALU), working registers, status registers,
And four programmable bidirectional input / output ports. In addition, this preferred embodiment provides a SIMD mode broadcast interface with a broadcast / control interface (BCI). This interface allows external controllers (see the original application of the present invention and the description of the presently preferred embodiment of a nodal array and system with clusters) to decode PME instructions, memory addresses, and ALU data inputs. To be able to drive. This chip can perform the functions of a microcomputer that allows it to perform multiple parallel operations within it, and can be coupled to other chips within a multi-node system. In this case, the connection method can be an interconnection network, mesh network, or hypercube.
It may be a network, advanced and scalable, any of the preferred embodiments of the invention.

【０１２０】ＰＭＥは、スケーリング可能な本発明の好
ましい実施例では、一連のリングまたはトーラスとして
相互接続できる。適用例によっては、ノードをメッシュ
として相互接続することもできる。本発明の好ましい実
施例では、各ノードが、４つのトーラスのそれぞれにＰ
ＭＥを２個ずつ備えている。トーラスはＷ、Ｘ、Ｙ、お
よびＺ（図７参照）で示してある。図１２は、ノード内
でのＰＭＥの相互接続を示している。各トーラス内の２
個のＰＭＥは、その外部入出力ポート（＋Ｗ、−Ｗ、＋
Ｘ、−Ｘ、＋Ｙ、−Ｙ、＋Ｚ、−Ｚ）で指定してある。
ノード内には、４＋ｎ個および４−ｎ個のＰＭＥを相互
接続する２つのリングもある。これらの内部リングは、
メッセージを外部トーラス間で移動するための経路とし
て働く。本発明の好ましい実施例では、ＡＰＡＰを４次
元直交アレイにすることができるので、内部リングによ
りアレイ全体にわたってあらゆる次元でメッセージを移
動することが可能である。The PMEs can be interconnected as a series of rings or toruses in the preferred embodiment of the invention, which is scalable. Depending on the application, the nodes can also be interconnected as a mesh. In the preferred embodiment of the invention, each node has a P for each of the four tori.
It has two MEs each. The torus is indicated by W, X, Y, and Z (see FIG. 7). FIG. 12 shows the interconnection of PMEs within a node. 2 in each torus
Each PME has its external input / output port (+ W, -W, +
X, -X, + Y, -Y, + Z, -Z).
Within the node there are also two rings interconnecting 4 + n and 4-n PMEs. These inner rings are
Acts as a route for moving messages between external tori. In the preferred embodiment of the invention, the APAP can be a four-dimensional orthogonal array, so that the inner ring allows messages to be moved in all dimensions across the array.

【０１２１】ＰＭＥは、主記憶装置、ローカル記憶装
置、命令解読器、論理演算機構（ＡＬＵ）、作業用レジ
スタ、および入出力ポートを備える、自己完結型プログ
ラム記憶式マイクロコンピュータである。ＰＭＥは、Ｍ
ＩＭＤ動作では、それ自体の主記憶装置から格納されて
いる命令を取り出して実行し、ＳＩＭＤモードでは、Ｂ
ＣＩインタフェースを介してコマンドを取り出し実行す
る能力を有する。このインタフェースにより、複数のチ
ップから成るシステム内の、制御装置、ＰＭＥ、その他
のＰＭＥの間での相互通信が可能になる。The PME is a self-contained program memory microcomputer having a main memory device, a local memory device, an instruction decoder, a logical operation unit (ALU), a work register, and an input / output port. PME is M
In the IMD operation, the stored instruction is fetched from its own main memory and executed.
It has the ability to retrieve and execute commands via the CI interface. This interface allows intercommunication between controllers, PMEs, and other PMEs in a multi-chip system.

【０１２２】ＢＣＩは、外部アレイ制御装置要素および
アレイ・ディレクタへの、そのノードのインタフェース
である。ＢＣＩは、タイマやクロックなどの共通ノード
機能を提供する。また、各ノードＰＭＥごとの同報通信
機能のマスキングと、同報通信バスとＰＭＥ間のデータ
転送用の物理インタフェースおよびバッファリング、さ
らにシステム状況ならびにモニタ要素およびデバッグ要
素とのノーダル・インタフェースを提供する。The BCI is the node's interface to external array controller elements and array directors. The BCI provides common node functions such as timers and clocks. It also provides masking of broadcast function for each node PME, physical interface and buffering for data transfer between broadcast bus and PME, and system status and nodal interface with monitor and debug elements. .

【０１２３】各ＰＭＥは、その各２地点間インタフェー
スおよび同報通信インタフェースをサポートする、別々
の割込みレベルを備えている。データは、直接メモリ・
アクセス（ＤＭＡ）制御機構の下で、ＰＭＥ主記憶装置
に入力され、あるいは該記憶装置から出力される。"ｉ
ｎｐｕｔｔｒａｎｓｆｅｒｃｏｍｐｌｅｔｅ"割込
みは、各インタフェースが、データの存在を伝える信号
をＰＭＥソフトウェアに送るのに使用できる。状況情報
は、ＰＭＥソフトウェアが、データ出力動作の完了を判
定するのに使用できる。Each PME has a separate interrupt level supporting its respective point-to-point and broadcast interfaces. Data is stored directly in memory
Under the access (DMA) control mechanism, the data is input to or output from the PME main storage device. "i
The nput transfer complete "interrupt can be used by each interface to send a signal to the PME software indicating the presence of data. The status information can be used by the PME software to determine the completion of a data output operation.

【０１２４】各ＰＭＥには入出力の「回線交換モード」
があり、ＰＭＥ主記憶装置にデータを入力せずに、４つ
の入力ポートの１つを４つの出力ポートのいずれかに直
接切り換えることができる。「回線交換」の発信元およ
び宛先の選択は、ＰＭＥ上で実行されるソフトウェアの
制御に従う。他の３つの入力ポートは引き続きＰＭＥ主
記憶装置の諸機能にアクセスでき、４番目の入力は出力
ポートに切り換えられる。Input / output “circuit switching mode” for each PME
And one of the four input ports can be directly switched to any of the four output ports without inputting data to the PME main memory. The selection of the "circuit switched" source and destination follows the control of the software running on the PME. The other three input ports continue to have access to PME main memory functions and the fourth input is switched to the output port.

【０１２５】もう１つの種類の入出力は、ＰＭＥすべて
に同報通信し、ＰＭＥすべてから収集しなければならな
いデータと、特殊すぎて標準バスに適合できないデータ
を有する。同報通信データには、ＳＩＭＤコマンド、Ｍ
ＩＭＤプログラム、およびＳＩＭＤデータが含まれる。
収集されるデータは主として、状況機能およびモニタ機
能である。診断機能およびテスト機能は特殊データ要素
である。各ノードは、組み込まれた１組のＰＭＥの他
に、ＢＣＩを１つ備えている。ＢＣＩセクションは動作
中、ＢＣＩを監視し、アドレスされるＰＭＥに同報通信
データを送り、該ＰＭＥから同報通信データを収集す
る。ＢＣＩは、エネーブル・マスクとアドレス指定タグ
を組み合わせて使用して、どの同報通信情報がどのＰＭ
Ｅを対象としているかを判定する。Another type of I / O has data that must be broadcast to and collected from all PMEs and data that is too special to fit on the standard bus. The broadcast data includes a SIMD command, M
It contains IMD programs and SIMD data.
The data collected is primarily status and monitoring functions. Diagnostic and test functions are special data elements. Each node has one BCI in addition to a set of embedded PMEs. During operation, the BCI section monitors the BCI, sends broadcast data to the addressed PMEs, and collects broadcast data from the PMEs. The BCI uses a combination of enable masks and addressing tags to determine which broadcast information is what PM.
It is determined whether E is the target.

【０１２６】本発明の好ましい実施例では、各ＰＭＥは
ＳＩＭＤモードまたはＭＩＭＤモードで動作できる。Ｓ
ＩＭＤモードでは、各命令がＢＣＩを介して同報通信バ
スからＰＭＥに送られる。ＢＣＩは、選択されたすべて
のノードＰＭＥが該インタフェースを使用し終わるま
で、各同報通信データ・ワードをバッファし続ける。こ
の同期化によって、ＳＩＭＤコマンドの実行に関連する
データ・タイミング依存性に対処でき、非同期動作がＰ
ＭＥで実行できるようになる。ＭＩＭＤモードでは、各
ＰＭＥがそれ自体の主記憶装置からそれ自体のプログラ
ムを実行する。ＰＭＥは、初期設定ではＳＩＭＤモード
になる。ＭＩＭＤ動作の場合、外部制御装置は通常、Ｐ
ＭＥがＳＩＭＤモードのとき各要素にプログラムを同報
通信し、その後にＭＩＭＤモードに切り替えて実行を開
始するようＰＭＥに指令する。同報通信情報をマスクま
たはタグ付けすると、ＰＭＥの異なる組が異なるＭＩＭ
Ｄプログラムを含むか、ＰＭＥの特定の組が、他の組の
ＰＭＥがＳＩＭＤモードで実行する間にＭＩＭＤモード
で実行するか、あるいはその両方を行うことができるよ
うになる。各種のソフトウェア・クラスタまたは区画に
おいて、これらの機能はそれぞれ、他のクラスタまたは
区画での動作から独立して動作することができる。In the preferred embodiment of the present invention, each PME is capable of operating in SIMD or MIMD mode. S
In IMD mode, each command is sent from the broadcast bus to the PME via the BCI. The BCI continues to buffer each broadcast data word until all selected node PMEs have finished using the interface. This synchronization allows the data timing dependencies associated with the execution of SIMD commands to be addressed and the asynchronous operation to P
It can be executed by ME. In MIMD mode, each PME executes its own program from its own main memory. The PME is in SIMD mode by default. For MIMD operation, the external controller is typically P
When the ME is in SIMD mode, it broadcasts the program to each element and then instructs the PME to switch to MIMD mode and begin execution. Masking or tagging broadcast information allows different sets of PMEs to have different MIMs
It allows D programs to be included and / or a particular set of PMEs to run in MIMD mode while other sets of PMEs run in SIMD mode, or both. In various software clusters or partitions, each of these functions can operate independently of its operation in other clusters or partitions.

【０１２７】ＰＭＥの命令セット・アーキテクチャ（Ｉ
ＳＡ）の動作は、ＰＭＥがＳＩＭＤモードであるかＭＩ
ＭＤモードであるかによってわずかに異なる。大部分の
ＩＳＡ命令は、モードとは無関係に同一の動作を実行す
る。しかし、ＰＭＥはＳＩＭＤモードでは分岐やその他
の制御機能を実行しないので、それらのＭＩＭＤ命令専
用の一部のコード点がＳＩＭＤモードで再解釈されて、
主記憶装置における同報通信データ値との一致データの
探索や、ＭＩＭＤモードへの切換えなどの特殊動作をＰ
ＭＥが実行できるようになる。そのため、アレイのシス
テム柔軟性がさらに拡大する。PME instruction set architecture (I
The operation of SA) depends on whether the PME is in the SIMD mode or not.
Slightly different depending on whether it is MD mode. Most ISA instructions perform the same operation regardless of mode. However, the PME does not perform branching or other control functions in SIMD mode, so some code points dedicated to those MIMD instructions are reinterpreted in SIMD mode,
Special operations such as searching for data that matches the broadcast data value in the main memory or switching to MIMD mode are performed.
The ME is ready to execute. Therefore, the system flexibility of the array is further expanded.

【０１２８】ＰＭＥアーキテクチャ：基本的には、本発
明の好ましいアーキテクチャは、１６ビット幅のデータ
・フロー、３２Ｋの１６ビット・メモリ、特殊入出力ポ
ートおよび入出力切換え経路と、本発明の命令セット・
アーキテクチャによって提供される１６ビット命令セッ
トを各ＰＭＥが取り出し、復号して、実行できるように
するのに必要な制御論理回路を有する、ＰＭＥを含んで
いる。好ましいＰＭＥは、仮想ルータの機能を実行し、
したがって処理機能とデータ・ルータ機能の両方を実行
する。このメモリ編成では、ＰＭＥ間でのメモリの相互
アドレス指定により、大規模ランダム・アクセス・メモ
リおよびＰＭＥ用の直接メモリへのアクセスが可能にな
る。個々のＰＭＥメモリはすべてローカル側とすること
もでき、プログラムによりローカル領域と共用大域領域
に分割することもできる。本明細書に記載する特殊制御
および機能を用いると、タスクの迅速な切換えと、各Ｐ
ＭＥ割込み実行レベルでのプログラム状態情報の保持が
可能になる。本発明によって提供される機能の一部は他
のプロセッサにも存在したが、大規模並列マシンでプロ
セッサ間入出力の管理に適用されている例は他にない。
その例として、メッセージ・ルータ機能のＰＭＥ自体へ
の統合がある。これにより、特殊ルータ・チップや、特
殊ＶＬＳＩルータの開発が不要になる。また、本発明で
は単一のチップ上に提供されている機能を、メタライゼ
ーション層などによって相互接続された複数のチップ上
に分配して、大規模並列マシンを改良することができる
ことに留意されたい。さらに、本発明のアーキテクチャ
は単一のノードから大規模並列スーパーコンピュータ・
レベルのマシンまでスケーリング可能なので、本発明の
概念の一部を様々なレベルで利用することが可能であ
る。たとえば、以下に示すとおり、本発明のＰＭＥデー
タ・フローは非常に強力であるが、スケーリング可能な
設計が有効になるように働く。PME Architecture: Basically, the preferred architecture of the present invention is a 16-bit wide data flow, 32K 16-bit memory, special I / O ports and I / O switching paths, and the instruction set of the present invention.
It includes a PME with the control logic necessary to allow each PME to fetch, decode and execute the 16-bit instruction set provided by the architecture. The preferred PME performs the functions of a virtual router,
It therefore performs both processing and data router functions. In this memory organization, mutual addressing of memory between PMEs allows access to large random access memory and direct memory for PMEs. All individual PME memories can be local or can be divided into a local area and a shared global area by a program. The special controls and functions described herein allow for quick task switching and
It becomes possible to retain the program state information at the ME interrupt execution level. Although some of the functions provided by the present invention existed in other processors, there is no other example that is applied to management of interprocessor input / output in a massively parallel machine.
An example is the integration of the message router function into the PME itself. This eliminates the need to develop special router chips and special VLSI routers. It should also be noted that the present invention allows the functionality provided on a single chip to be distributed on multiple chips interconnected by metallization layers or the like to improve a massively parallel machine. . Furthermore, the architecture of the present invention allows a single node to run on a massively parallel supercomputer.
It is possible to utilize some of the concepts of the present invention at various levels, as it can be scaled to machine levels. For example, as shown below, the PME data flow of the present invention is very powerful, but works for a scalable design to work.

【０１２９】処理メモリ要素（ＰＭＥ）は、１つのノー
ドの複数のＰＭＥのそれぞれごとに、完全分散型アーキ
テクチャを形成する。あらゆるＰＭＥが、１６ビット・
データ・フローによる処理機能、６４ＫＢのローカル記
憶域、蓄積交換／回線交換論理回路、ＰＭＥ間通信、Ｓ
ＩＭＤ／ＭＩＭＤ切換え機能、プログラマブル経路指
定、および専用浮動小数点援助論理回路から構成されて
いる。これらの諸機能は、ＰＭＥによって独立に操作す
ることができ、また同一のチップ内で他のＰＭＥと統合
して、チップ交差遅延を最小限に抑えることができる。
図８および図９に、ＰＭＥデータ・フローを示す。ＰＭ
Ｅは、１６ビット幅のデータ・フロー４２５、４３５、
４４５、４５５、４６５、３２キロビット×１６ビット
・メモリ４２０、特殊入出力ポート４００、４１０、４
８０、４９０、および入出力切替え経路４２５と、１６
ビット縮小命令セットをＰＭＥが取り出し、復号して、
実行できるようにするのに必要な制御論理回路４３０、
４４０、４５０、４６０から構成されている。また、特
殊論理機能により、ＰＭＥは処理装置４６０としてもデ
ータ・ルータとしても実行できる。特殊制御機構４０
５、４０６、４０７、４０８および諸機能は、タスクを
迅速に切り替え、ＰＭＥの各割込み実行レベルでプログ
ラム情報命令を保持できるようにするために組み込まれ
ている。そのような機能は他のプロセッサにも組み込ま
れていたが、大規模並列マシンでプロセッサ間入出力の
管理に適用されている例は他にない。具体的に言うと、
それによって、特殊チップやＶＬＳＩ開発マクロなし
で、ルータ機能をＰＭＥに統合することが可能になる。The processing memory elements (PMEs) form a fully distributed architecture for each of the PMEs of a node. All PMEs are 16 bit
Data flow processing function, 64 KB local storage, store and switch / circuit switch logic, PME-to-PME communication, S
It consists of an IMD / MIMD switching function, programmable routing, and dedicated floating point assist logic. These functions can be independently manipulated by the PME and can be integrated with other PMEs within the same chip to minimize chip crossover delay.
The PME data flow is shown in FIGS. PM
E is a 16-bit wide data flow 425, 435,
445, 455, 465, 32 kbit x 16 bit memory 420, special input / output ports 400, 410, 4
80, 490, and input / output switching paths 425, 16
The PME retrieves and decodes the bit reduced instruction set,
Control logic 430 necessary to enable execution,
It is composed of 440, 450 and 460. Also, the special logic function allows the PME to run as both the processing unit 460 and the data router. Special control mechanism 40
5, 406, 407, 408 and functions are incorporated to allow quick task switching and hold program information instructions at each interrupt execution level of the PME. Although such a function was built into other processors, there is no other example where it is applied to management of input / output between processors in a massively parallel machine. Specifically,
This allows the router function to be integrated into the PME without special chips or VLSI development macros.

【０１３０】１６ビット内部データ・フローおよび制
御：処理要素の内部データ・フローの重要な部分を図８
に示す。図８は、処理要素の内部データ・フローを示し
ている。この処理要素は、全１６ビット幅内部データ・
フロー４２５、４３５、４４５、４５５、４６５を有す
る。これらの内部データ・フローの重要な経路では、Ｏ
Ｐレジスタ４５０、Ｍレジスタ４４０、ＷＲレジスタ４
７０、プログラム・カウンタＰＣレジスタ４３０など１
２個のナノ秒ハード・レジスタを使用している。すべて
の動作において、これらのレジスタから完全分散型ＡＬ
Ｕ４６０および入出力ルータ・レジスタと、特殊制御機
構４０５、４０６、４０７、４０８にデータが流れる。
現在のＶＬＳＩ技術を用いると、プロセッサは２５ＭＨ
ｚでメモリ動作と命令ステップを実行でき、ＯＰレジス
タ４５０、Ｍレジスタ４４０、ＷＲレジスタ４７０、プ
ログラム・カウンタＰＣレジスタ４３０などの重要な要
素を１２ナノ秒ハード・レジスタで構築できる。他の必
要なレジスタは、メモリ位置にマップされる。16-bit internal data flow and control: Figure 8 illustrates a significant portion of the internal data flow of processing elements.
Shown in FIG. 8 illustrates the internal data flow of processing elements. This processing element handles all 16-bit wide internal data
It has flows 425, 435, 445, 455, 465. The key pathways for these internal data flows are:
P register 450, M register 440, WR register 4
70, program counter PC register 430, etc. 1
It uses two nanosecond hard registers. Fully distributed AL from these registers in all operations
Data flows to the U460 and I / O router registers and special control mechanisms 405, 406, 407, 408.
With current VLSI technology, the processor is 25 MH
Memory operations and instruction steps can be performed in z, and important elements such as OP register 450, M register 440, WR register 470, program counter PC register 430, etc. can be constructed with 12 nanosecond hard registers. The other required registers are mapped into memory locations.

【０１３１】図９に示すように、ＰＭＥの内部データ・
フローは、２つのＤＲＡＭマクロの形の３２キロビット
×１６ビットの主記憶装置を有する。データ・フローの
残りの部分は、ＣＭＯＳゲート・アレイ・マクロから構
成されている。メモリはすべて、低出力ＣＭＯＳＤＲ
ＡＭ付着技術により論理回路と一体形成され、超大規模
集積ＰＭＥチップ・ノードを形成している。ノード・チ
ップの好ましい実施例では、ＰＭＥが８回複製される。
ＰＭＥデータ・フローは、１６ワード×１６ビット汎用
レジスタ・スタック、メモリ・アドレスをバッファする
ための多機能論理演算機構（ＡＬＵ）、作業用レジス
タ、メモリ出力レジスタ、ＡＬＵ出力レジスタ、演算／
コマンド入出力レジスタ、ならびにＡＬＵおよびレジス
タへの入力を選択するためのマルチプレクサから構成さ
れている。４ＭＢＤＲＡＭメモリと本発明の論理回路
に現行のＣＭＯＳＶＬＳＩ技術を使用すると、ＰＭＥ
は２５ＭＨｚで命令ステップを実行できるようになる。
本発明では、ＯＰレジスタ４５０、Ｍレジスタ４４０、
ＷＲレジスタ４７０、および汎用レジスタ・スタックを
１２個のナノ秒ハード・レジスタで形成する。他の必要
なレジスタは、ＰＭＥ内のメモリ位置にマップされる。As shown in FIG. 9, internal data of PME
The flow has 32 kilobits by 16 bits of main memory in the form of two DRAM macros. The rest of the data flow consists of CMOS gate array macros. All memories are low power CMOS DR
It is integrally formed with the logic circuit by AM attachment technology to form a very large scale integrated PME chip node. In the preferred embodiment of the node chip, the PME is replicated 8 times.
The PME data flow is a 16 word by 16 bit general purpose register stack, a multifunctional logic unit (ALU) for buffering memory addresses, working registers, memory output registers, ALU output registers, arithmetic / operation
It consists of a command input / output register and a multiplexer for selecting the inputs to the ALU and registers. Using the current CMOS VLSI technology for 4MB DRAM memory and the logic circuit of the present invention, PME
Will be able to execute instruction steps at 25 MHz.
In the present invention, the OP register 450, the M register 440,
The WR register 470 and the general purpose register stack are formed of 12 nanosecond hard registers. The other required registers are mapped to memory locations within the PME.

【０１３２】ＰＭＥデータ・フローは、１６ビット整数
演算プロセッサとして設計されている。ｎ×１６ビット
浮動小数点演算（ｎ≧１）のサブルーチン・エミュレー
ションを最適化するため、特殊マルチプレクサ経路が追
加されている。この１６ビット・データ・フローによっ
て、浮動小数点演算の効果的なエミュレーションが可能
になる。データ・フロー内の特殊経路は、浮動小数点演
算が１０サイクルでできるようにするために組み込まれ
ている。ＩＳＡは、拡張（１６ビットより長い）オペラ
ンド演算用のサブルーチンを使用可能にする特殊コード
点を備えている。それ以後の浮動小数点性能は、固定し
た浮動小数点性能の約２０分の１である。この性能は、
他の大規模並列マシンに特有な、ＰＭＥを補助する特殊
浮動小数点チップが不要になるのに十分である。他のプ
ロセッサには、単一のプロセッサと同じチップ上に特殊
浮動小数点プロセッサを備えているものもある（図１参
照）。ＰＭＥを備えたチップ上で特殊浮動小数点ハード
ウェア・プロセッサを使用することもできるが、そうす
るには、現在のところ、好ましい実施例で必要なセルの
他に追加のセルが必要である。（なお、浮動小数点演算
については、上記でＩＥＥＥ標準の改良に関して参照し
た同時出願の"floating point Implementation on a SI
MD Machine"と題する米国特許出願を必要により参照さ
れたい。）The PME data flow is designed as a 16-bit integer arithmetic processor. A special multiplexer path has been added to optimize the subroutine emulation of nx16 bit floating point operations (n≥1). This 16-bit data flow enables effective emulation of floating point operations. Special paths in the data flow are built in to allow floating point operations in 10 cycles. The ISA has special code points that enable subroutines for extended (longer than 16 bits) operand operations. Floating point performance thereafter is about one-twentieth of the fixed floating point performance. This performance is
It suffices to eliminate the special floating point chip that assists the PME, which is typical of other massively parallel machines. Some other processors have a special floating point processor on the same chip as the single processor (see Figure 1). A special floating point hardware processor could be used on a chip with a PME, but this would require additional cells in addition to those currently required in the preferred embodiment. (For floating point arithmetic, refer to the above-mentioned "floating point Implementation on a SI" regarding the improvement of the IEEE standard.
Please refer to the US patent application entitled "MD Machine" if necessary.)

【０１３３】本発明で開発した手法は、ＶＬＳＩ技術性
能の通常の増大をそのまま利用することができる。回路
の小型化が進み、パッケージ密度が増してくると、現在
メモリにマップされている基底レジスタやインデックス
・レジスタなどのデータ要素をハードウェアに移すこと
が可能になる。同様に、ハードウェアを増設することに
よって浮動小数点サブステップが加速される。これは、
信頼できる密度レベルが高くなるので、開発中のＣＭＯ
ＳＤＲＡＭ技術にとって好ましい。非常に重要なこと
であるが、このハードウェア手法はソフトウェアに影響
を与えない。The technique developed in the present invention can directly use the usual increase in VLSI technical performance. As circuits become smaller and package density increases, it becomes possible to move data elements such as base registers and index registers currently mapped in memory into hardware. Similarly, the addition of hardware accelerates the floating point substeps. this is,
Reliable density level will increase, so CMO under development
Preferred for SDRAM technology. Very importantly, this hardware approach does not affect software.

【０１３４】ＰＭＥは、割込みが禁止されたＳＩＭＤモ
ードに初期設定される。コマンドは、ＢＣＩからＰＭＥ
命令解読バッファに送られる。命令動作が完了するたび
に、ＰＭＥはＢＣＩに新規のコマンドを要求する。同様
に、命令実行サイクルの適切な時点で、ＢＣＩに即値デ
ータが要求される。ＩＳＡの大部分の命令は、ＰＭＥが
ＳＩＭＤモードであろうとＭＩＭＤモードであろうと同
じ動作を実行する。ただし、ＳＩＭＤ命令および即値デ
ータをＢＣＩから取り出す場合はこの限りでない。ＭＩ
ＭＤモードでは、ＰＭＥはプログラム・カウンタ（Ｐ
Ｃ）を維持し、それをそれ自体のメモリ内のアドレスと
して使用して１６ビット命令を取り出す。プログラム・
カウンタに明示的にアドレスする"Ｂｒａｎｃｈ"などの
命令は、ＳＩＭＤモードでは意味がなく、それらのコー
ド点の一部は再解釈されて、即値データと主記憶装置の
領域の比較などの特殊ＳＩＭＤ機能が実行される。The PME is initially set to SIMD mode with interrupts disabled. The command is BCI to PME
It is sent to the instruction decoding buffer. Each time the instruction operation is completed, the PME requests a new command from the BCI. Similarly, immediate data is requested from the BCI at the appropriate point in the instruction execution cycle. Most ISA instructions perform the same operation whether the PME is in SIMD mode or MIMD mode. However, this is not the case when the SIMD instruction and immediate data are taken out from the BCI. MI
In the MD mode, the PME uses the program counter (P
C) and use it as an address in its own memory to fetch the 16-bit instruction. program·
Instructions such as "Branch" that explicitly address the counter have no meaning in SIMD mode, and some of their code points are reinterpreted to allow special SIMD functions such as comparing immediate data with areas of main memory. Is executed.

【０１３５】ＰＭＥ命令解読論理回路により、ＳＩＭＤ
動作モードまたはＭＩＭＤ動作モードのどちらかが使用
可能になり、ＰＭＥはモード間を動的に移行できる。Ｓ
ＩＭＤモードでは、ＰＭＥが解読済み命令情報を受け取
り、次のクロック・サイクルでそのデータを実行する。
ＭＩＭＤモードでは、ＰＭＥがプログラム・カウンタ
（ＰＣ）アドレスを維持し、それをそれ自体のメモリ内
のアドレスとして使って１６ビット命令を取り出す。命
令の解読および実行は、他の大部分のＲＩＳＣ型マシン
と同様に進行する。ＳＩＭＤモードのＰＭＥは、解読分
岐を表す情報を与えられるとＭＩＭＤモードに入る。Ｍ
ＩＭＤモードのＰＭＥは、移行用の特定の命令を実行す
るとＳＩＭＤモードになる。SIMD by PME instruction decoding logic circuit
Either operating mode or MIMD operating mode is enabled and the PME can dynamically transition between modes. S
In IMD mode, the PME receives the decoded instruction information and executes that data on the next clock cycle.
In MIMD mode, the PME maintains a program counter (PC) address and uses it as an address in its own memory to fetch 16-bit instructions. Instruction decoding and execution proceeds as in most other RISC-type machines. The SIMD mode PME enters the MIMD mode when given the information representing the decryption branch. M
A PME in IMD mode goes into SIMD mode when it executes a specific instruction for migration.

【０１３６】ＰＭＥがＳＩＭＤモードとＭＩＭＤモード
との動的移行を行う際には、ＳＩＭＤ"ｗｒｉｔｅｃ
ｏｎｔｒｏｌｒｅｇｉｓｔｅｒ"（制御レジスタ読取
り）命令を実行するとＭＩＭＤモードに入り、当該の制
御ビットが"１"に設定される。ＳＩＭＤ命令が完了する
と、ＰＭＥはＭＩＭＤモードに入り、割込みを可能に
し、その汎用レジスタＲ０で指定された主記憶装置位置
からそのＭＩＭＤ命令を取り出して実行を開始する。Ｍ
ＩＭＤ制御ビット設定時の割込みマスクの状態に応じ
て、割込みがマスクされあるいはマスク解除される。Ｐ
ＭＥは、外部から初期設定されるか、あるいはＭＩＭ
Ｄ"ｗｒｉｔｅｃｏｎｔｒｏｌｒｅｇｉｓｔｅｒ"
（制御レジスタ書込み）命令を実行して当該の制御ビッ
トが０に設定されると、ＳＩＭＤモードに戻る。When the PME makes a dynamic transition between SIMD mode and MIMD mode, SIMD "write c"
Executing an "control register" (control register read) instruction enters MIMD mode and sets the relevant control bit to "1." Upon completion of the SIMD instruction, the PME enters MIMD mode to enable interrupts and its general purpose. The MIMD instruction is fetched from the main memory location specified by the register R0 and execution is started.
The interrupt is masked or unmasked depending on the state of the interrupt mask when the IMD control bit is set. P
ME is initialized from the outside or MIM
D "write control register"
When the (control register write) instruction is executed and the relevant control bit is set to 0, the SIMD mode is returned to.

【０１３７】データ通信経路および制御：図８に戻る
と、各ＰＭＥは、オンチップ通信用の３つの入力ポート
４００および３つの出力ポート４８０と、オフチップ通
信用の１つの入出力ポート４１０、４９０を有する。こ
の概念以外の既存の技術では、オフチップ・ポートをバ
イト幅半２重式にする必要がある。入力ポートは、デー
タが入力からメモリに、あるいは入力ＡＲレジスタ４０
５から直接１６ビット・データ経路４２５を介して出力
レジスタ４０８に経路指定できるように接続される。メ
モリは、ＰＭＥ宛のメッセージまたは「蓄積交換」モー
ドで移動されたメッセージのデータ・シンクとなる。特
定のＰＭＥ宛でないメッセージは、所望の出力ポートに
直接送られ、ブロッキングが発生していないときは「回
線交換」モードを開始する。ＰＭＥソフトウェアは、経
路指定の実行と選択された伝送モードの決定を担当す
る。これにより、「回線交換」モードと「蓄積交換」モ
ードの間の動的選択が可能になる。これは、ＰＭＥ設計
のもう１つの独特な特徴である。Data Communication Paths and Controls: Returning to FIG. 8, each PME has three input ports 400 and three output ports 480 for on-chip communication and one input / output port 410, 490 for off-chip communication. Have. Existing technologies other than this concept require the off-chip port to be byte-width half-duplex. The input port receives data from the input to the memory or the input AR register 40.
5 is routed directly to the output register 408 via the 16-bit data path 425. The memory becomes the data sink for messages destined for the PME or moved in "store-and-forward" mode. Messages that are not destined for a particular PME are sent directly to the desired output port, initiating "circuit switched" mode when blocking is not occurring. The PME software is responsible for performing routing and determining the selected transmission mode. This allows for dynamic selection between "circuit switched" and "store and switch" modes. This is another unique feature of PME designs.

【０１３８】このように、本発明の好ましいノードは８
個のＰＭＥを有し、各ＰＭＥは４つの出力ポート（左、
右、垂直、および外部）を有する。入力ポートのうち３
つと出力ポートのうち３つは、チップ上の他のＰＭＥへ
の１６ビット幅全２重２地点間接続である。４番目のポ
ートは、好ましい実施例では、組み合わされて、オフチ
ップＰＭＥへの半２重２地点間接続を提供する。本発明
で低密度ＣＭＯＳを利用するために課されるピンおよび
電源の制約により、実際のオフチップ・インタフェース
は、ＰＭＥ間データ・ワードのハーフワード２つを多重
化するのに使用されるバイト幅経路である。モード間リ
ングを動的、一時的、かつ論理的に破壊し、データをア
レイに入れあるいはアレイから出すことを可能にする、
特殊「ジッパ」回路を用いる場合、これらの外部ＰＭＥ
ポートは、ＡＰＡＰ外部入出力アレイ機能を提供する。Thus, the preferred node of the present invention is 8
Have four PMEs, each PME having four output ports (left,
Right, vertical, and external). 3 out of input ports
Three of the output ports are 16 bit wide full duplex point-to-point connections to other PMEs on the chip. The fourth port, in the preferred embodiment, is combined to provide a half-duplex point-to-point connection to the off-chip PME. Due to the pin and power supply constraints imposed to utilize low density CMOS in the present invention, the actual off-chip interface is a byte width used to multiplex two halfwords of data words between PMEs. It is a route. Dynamically, temporarily, and logically destroy the inter-mode ring, allowing data to enter and leave the array,
When using special "zipper" circuits, these external PMEs
The port provides the APAP external I / O array function.

【０１３９】ＰＭＥメモリに経路指定されるデータにつ
いては、ＰＭＥ命令ストリームがメッセージの始めと終
りだけ入出力処理に関与すればよいように、正規ＤＭＡ
がサポートされる。最後に、内部出力ポートに回線交換
されるデータは、クロッキングなしで転送される。この
ため、チップ内での単一サイクルのデータ転送が可能に
なり、もっとも高速であるが依然として確実な通信を行
うことができるチップ交差がいつ発生するかが検出され
る。高速転送には順方向データ経路と逆方向制御経路が
使用され、転送はすべて透過モードで行われる。要する
に、発信元は、ＤＭＡまたはオンチップ転送を実行する
ＰＭＥから肯定応答を受けるまでに、複数の段階を経
る。For data routed to PME memory, regular DMA so that the PME instruction stream need only participate in I / O processing at the beginning and end of the message.
Is supported. Finally, the data that is circuit switched to the internal output port is transferred without clocking. This allows a single cycle of data transfer within the chip and detects when a chip crossing occurs that allows for the fastest but still reliable communication. Forward data paths and reverse control paths are used for high speed transfers, and all transfers are in transparent mode. In short, the originator goes through several stages before being acknowledged by the PME performing the DMA or on-chip transfer.

【０１４０】図８および図９から分かるように、ＰＭＥ
入力ポート上のデータは、ローカルＰＭＥ宛、またはリ
ングをさらに下ったＰＭＥ宛にすることができる。リン
グをさらに下ったＰＭＥ宛のデータを、ローカルＰＭＥ
主記憶装置に格納した後、ローカルＰＭＥからターゲッ
トＰＭＥに向かって転送する（蓄積交換）ことも、ロー
カル入力ポートを特定のローカル出力ポートに論理的に
接続して（回線交換）、データがローカルＰＭＥを「透
過的に」通過してターゲットＰＭＥに向うようにするこ
ともできる。ローカルＰＭＥソフトウェアが、４つの入
力および４つの出力のいずれについてもローカルＰＭＥ
を「蓄積交換」モードと「回線交換」モードのどちらに
するかを動的に制御する。回線交換モードでは、ＰＭＥ
が、回線交換と関連付けられた入出力を除くすべての機
能を同時に処理する。蓄積交換モードでは、ＰＭＥが他
のすべての処理機能を中断して、入出力転送プロセスを
開始する。As can be seen from FIGS. 8 and 9, the PME
The data on the input port can be destined for the local PME or for the PME further down the ring. Data destined for the PME further down the ring is transferred to the local PME.
After being stored in the main storage device, it can be transferred from the local PME to the target PME (store-and-switch), or the local input port can be logically connected to a specific local output port (circuit-switched) so that the data is stored in the local PME. Can also be passed "transparently" to the target PME. The local PME software provides the local PME for any of the four inputs and four outputs.
To dynamically control whether the mode is "store and switch" mode or "circuit switch" mode. In circuit switched mode, PME
Handles all functions at the same time, except for the I / O associated with circuit switching. In store-and-forward mode, the PME suspends all other processing functions and starts the I / O transfer process.

【０１４１】データは、（外部制御装置により）アレイ
の外部の共用メモリまたはＤＡＳＤに格納できるが、Ｐ
ＭＥが提供するメモリのどこかに格納することもでき
る。ローカルＰＭＥ宛の入力データ、または「蓄積交
換」動作中にローカルＰＭＥにバッファされた入力デー
タは、各入力ポートと結合された直接メモリ・アクセス
（アドレス）機構を介してローカルＰＭＥ主記憶装置に
格納される。プログラム割込みによって、ＰＭＥ主記憶
装置にメッセージがロードされたことを示すことができ
る。ローカルＰＭＥプログラムは、ヘッダ・データを解
釈して、ローカルＰＭＥ宛のデータが別のＰＭＥへの回
線交換経路の設定に使用できる制御メッセージであるか
否か、あるいは別のＰＭＥに転送するメッセージである
か否かを判定する。回線交換経路は、ローカルＰＭＥソ
フトウェアによって制御される。回線交換経路は、介在
する緩衝記憶装置を通過せずに、ＰＭＥ入力経路を出力
経路と論理的に直接結合する。同一のチップ上のＰＭＥ
間の出力経路には介在する緩衝記憶装置がないので、デ
ータを単一のクロック・サイクルで、チップに入れ、チ
ップ上の多数のＰＭＥを通過させ、ターゲットＰＭＥの
主記憶装置にロードすることができる。中間に緩衝記憶
装置が必要なのは、回線交換結合がチップから離れると
きだけである。このため、ＡＰＡＰアレイの有効直径が
非バッファ回線交換経路の数だけ減少する。その結果、
経路内にあるＰＭＥの数とは無関係に、介在するチップ
と同数の少数のクロック・サイクルでＰＭＥからターゲ
ットＰＭＥにデータを送ることができる。この種の経路
指定は、各ノード・サイクルでデータを次のノードに転
送するのに数サイクル必要な交換環境と比較することが
できる。本発明のノードはそれぞれ８個のＰＭＥを持
つ。Data can be stored in shared memory or DASD external to the array (by an external controller), but P
It can also be stored somewhere in the memory provided by the ME. Input data destined for the local PME or buffered in the local PME during a "store and forward" operation is stored in the local PME main memory via a direct memory access (address) mechanism associated with each input port. To be done. A program interrupt can indicate that the message has been loaded into PME main memory. The local PME program is a message that interprets the header data and determines whether the data addressed to the local PME is a control message that can be used to set a circuit-switched path to another PME, or transfers it to another PME. Or not. The circuit switched path is controlled by the local PME software. The circuit switched path logically directly couples the PME input path with the output path without passing through an intervening buffer store. PME on the same chip
Since there is no intervening buffer storage in the output path between, data can be put into the chip, passed through multiple PMEs on the chip, and loaded into main memory of the target PME in a single clock cycle. it can. Intermediate buffer storage is only needed when the circuit switched coupling leaves the chip. This reduces the effective diameter of the APAP array by the number of unbuffered circuit switched paths. as a result,
Data can be sent from the PME to the target PME in as few clock cycles as there are intervening chips, regardless of the number of PMEs in the path. This type of routing can be compared to a switching environment where each node cycle requires several cycles to transfer data to the next node. Each node of the present invention has 8 PMEs.

【０１４２】メモリおよび割込みレベル：ＰＭＥは、メ
モリ４２０に３２キロビット×１６ビット・ワードを格
納する。この記憶域は完全に汎用であり、データとプロ
グラムの両方を入れることができる。ＳＩＭＤ動作で
は、メモリすべてをデータとすることができる。これ
は、他のＳＩＭＤ大規模並列マシンで特徴的である。Ｍ
ＩＭＤモードでは、メモリはまったく通常どおりである
が、大部分の大規模並列ＭＩＭＤマシンと異なり、ＰＭ
Ｅと同じチップ上にあるため、ただちに使用可能であ
る。このため、他の大規模並列ＭＩＭＤマシンに特有の
キャッシュ動作およびキャッシュ・コヒーレンシ技術は
不要である。インモス社のチップの場合、チップ上に常
駐するのは４Ｋだけであり、外部メモリ・インタフェー
ス・バスおよびピンが必要である。本発明ではこれらは
不要となる。Memory and Interrupt Level: The PME stores 32 kilobits by 16 bit words in memory 420. This storage is completely general purpose and can contain both data and programs. In SIMD operation, all memory can be data. This is characteristic of other SIMD massively parallel machines. M
In IMD mode, the memory is quite normal, but unlike most massively parallel MIMD machines, PM
Since it is on the same chip as E, it can be used immediately. This eliminates the cache operation and cache coherency techniques typical of other massively parallel MIMD machines. For Inmos chips, only 4K resides on the chip and requires an external memory interface bus and pins. In the present invention, these are unnecessary.

【０１４３】最低位記憶域位置は、各割込みレベル用の
１組の汎用レジスタを設けるために使用される。ＰＭＥ
用に開発された特定のＩＳＡは、これらのレジスタ参照
に短いアドレス・フィールドを使用する。割込みは、処
理、入出力活動、およびソフトウェア指定機能の管理に
使用される（すなわち、ＰＭＥは、通常の処理中に着信
入出力が開始したとき、割込みレベルに切り替わる）。
割込みレベルがマスクされていない場合、レジスタが最
低位メモリの新規セクションからアクセスされるように
ハードウェアのポインタを変更し、単一のＰＣ値をスワ
ップすることにより、この切替えが実行される。この技
術では、高速レベル切替えが可能であり、ソフトウェア
は通常のレジスタ・セーブ動作を回避するとともに、割
込みレベル・レジスタ内に状況をセーブすることができ
る。The lowest storage location is used to provide a set of general purpose registers for each interrupt level. PME
Certain ISAs developed for use short address fields for these register references. Interrupts are used to manage processing, I / O activity, and software-specified functions (ie, PME switches to interrupt level when incoming I / O starts during normal processing).
If the interrupt level is not masked, this switch is performed by changing the hardware pointer so that the register is accessed from a new section of lowest memory and swapping a single PC value. This technique allows fast level switching, allowing the software to avoid normal register save operations and save the situation in the interrupt level register.

【０１４４】ＰＭＥプロセッサは、８つのプログラム割
込みレベルのうちの１つに作用する。メモリのアドレス
指定により、メモリの下位５７６ワードを割込みの８つ
のレベルに区分できる。このメモリの５７６ワードのう
ちの６４ワードは、８つのレベルのいずれかで実行中の
プログラムによって直接アドレス可能である。他の５１
２ワードは、８つの６４ワード・セグメントに区分され
る。各６４ワード・セグメントに直接アクセスできるの
は、それと関連する割込みレベルで実行中のプログラム
だけである。直接アドレス指定技術を使用することによ
り、すべてのプログラムが、ＰＭＥメモリの全３２Ｋワ
ードにアクセスできるようになる。The PME processor operates on one of eight program interrupt levels. Memory addressing allows the lower 576 words of memory to be partitioned into eight levels of interrupt. Sixty-four of the 576 words of this memory are directly addressable by the program executing at any of the eight levels. The other 51
The two words are divided into eight 64-word segments. Only the program running at the interrupt level associated with it can directly access each 64-word segment. The use of direct addressing techniques allows all programs to access the entire 32K word of PME memory.

【０１４５】割込みレベルは、入力ポート、ＢＣＩ、お
よびエラー処理機構に割り当てられる。「通常」レベル
があるが、「特権」レベルも「スーパバイザ」レベルも
ない。プログラム割込みにより、文脈の切替えが行われ
て、ＰＣプログラム・カウンタ、状況／制御レジスタ、
および特定の汎用レジスタの内容が、指定された主記憶
装置位置に格納され、これらのレジスタの新しい値が、
他の指定された主記憶装置位置から取り出される。Interrupt levels are assigned to input ports, BCIs, and error handling mechanisms. There is a "normal" level, but no "privilege" or "supervisor" level. Program interrupts cause context switching to occur, PC program counters, status / control registers,
And the contents of certain general purpose registers are stored in the specified main storage location, and the new values of these registers are
Fetched from another designated main storage location.

【０１４６】図８および図９を参照して説明したＰＭＥ
データ・フローは、以下の数節を参照して拡張すること
ができる。複合システムでは、ＰＭＥデータ・フロー
が、アレイ・ノードとしてのチップと、メモリ、プロセ
ッサ、および入出力機構の組合せを使用する。入出力機
構は、本発明のＡＰＡＰで構築されたＭＭＰの基本的構
成単位として複製されるＢＣＩを使ってメッセージをや
り取りする。ＭＭＰは多数のワード長を処理することが
できる。The PME described with reference to FIGS. 8 and 9.
The data flow can be extended with reference to the following sections. In complex systems, the PME data flow uses a combination of chips as array nodes and memory, processors, and I / O facilities. The input / output mechanism exchanges messages by using the BCI that is duplicated as a basic unit of the MMP constructed by the APAP of the present invention. MMPs can handle many word lengths.

【０１４７】ＰＭＥ複数倍長データ・フロー処理：本明
細書に記載するシステムは、ＰＭＥ内の１６ビット幅の
データ・フローにより、現行のプロセッサで処理される
演算を実行することができる。そうするために、１６ビ
ットの倍数であるデータ長に対して演算を実行する。そ
のために、１６ビットの断片として演算を行う。各断片
の結果を知っていなければならない場合がある（すなわ
ち、結果が０だったか、合計の上位ビットが繰り上げら
れたか）。PME Plural Data Flow Processing: The system described herein is capable of performing operations processed by current processors with a 16-bit wide data flow within the PME. To do so, an operation is performed on the data length, which is a multiple of 16 bits. Therefore, the operation is performed as a 16-bit fragment. It may be necessary to know the result of each fragment (ie whether the result was 0 or the high order bit of the sum was rounded up).

【０１４８】データ・フローの例として、４８ビットの
２つの数の加算を挙げることができる。この例では、ハ
ードウェアで以下の演算を実行することにより、４８ビ
ットの２つの数（ａ（０〜４７）およびｂ（０〜４
７））を加算する。An example of a data flow is the addition of two 48-bit numbers. In this example, two 48-bit numbers (a (0-47) and b (0-4) are executed by performing the following operations in hardware.
7)) is added.

【０１４９】 a(32-47) + b(32-47)->ans(32-47) - ステップ１１）合計の上位ビットの実行結果をセーブする。２）部分結果が０だったかどうかを記憶する。A (32-47) + b (32-47)-> ans (32-47) -Step 1 1) Save the execution result of the upper bits of the total. 2) Memorize whether the partial result was 0.

【０１５０】a(16-31) + b(16-31) + save carry->ans
(16-31) - ステップ２１）合計の上位ビットの実行結果をセーブする。２）この結果および前回のステップで部分結果が０だっ
たかどうかを記憶する。両方とも０の場合、０を記憶する。いずれかが０以外の
場合、非０を記憶する。A (16-31) + b (16-31) + save carry-> ans
(16-31)-Step 2 1) Save the execution result of the upper bits of the total. 2) Memorize this result and whether the partial result was 0 in the previous step. If both are 0, 0 is stored. If either is non-zero, non-zero is stored.

【０１５１】a(0-15) + b(0-15) + saved carry->ans(0
-15) - 最終ステップ１）この断片が０で最後の断片が０だった場合、答は０
である。２）この断片が０で最後の断片が非０だった場合、答は
非０である。３）この断片が非０の場合、答は合計の符号に基づき正
または負になる（桁あふれはないものとする）。４）実行する答の符号が実行した答の符号と等しくない
場合、答は符号が間違っており、結果は桁あふれとなる
（使用可能なビット単位で適切に表すことができな
い）。A (0-15) + b (0-15) + saved carry-> ans (0
-15)-Final step 1) If this fragment is 0 and the last fragment is 0, the answer is 0
Is. 2) If this fragment is 0 and the last fragment is non-zero, the answer is non-zero. 3) If this fragment is non-zero, the answer will be positive or negative (assuming no overflow) depending on the sign of the sum. 4) If the sign of the executed answer is not equal to the sign of the executed answer, then the answer has the wrong sign and the result is an overflow (cannot be properly represented in the available bits).

【０１５２】中間の第２ステップを必要な回数繰り返す
と、長さを拡張することができる。長さが３２の場合、
第２ステップは実行されない。長さが４８より大きい場
合、ステップ２は複数回実行される。長さがちょうど１
６の場合、ステップ１における動作が、最終ステップの
条件３および４つきで実行される。オペランドの長さを
データ・フローの長さの複数倍に拡張すると、データ・
フローの幅が狭い場合、通常、命令を実行するのにかか
る時間が長くなる。すなわち、３２ビットのデータ・フ
ローで３２ビットを加算する場合、加算器論理回路を１
回通過するだけでよいが、１６ビット・データ・フロー
でやはり３２ビットを加算する場合は、加算器論理回路
を２回通過する必要がある。The length can be extended by repeating the intermediate second step as many times as necessary. If the length is 32,
The second step is not executed. If the length is greater than 48, step 2 is executed multiple times. Exactly 1 in length
In the case of 6, the operation in step 1 is executed under the conditions 3 and 4 of the final step. Extending the length of an operand to multiple times the length of the data flow
If the width of the flow is narrow, it usually takes a long time to execute the instruction. That is, when adding 32 bits in a 32-bit data flow, the adder logic circuit is set to 1
Only one pass is required, but if the 16-bit data flow also adds 32 bits, it will need to pass through the adder logic twice.

【０１５３】本発明の興味深い点として、マシンの現実
施態様では、長さ１〜８ワード（長さは、命令の一部と
して定義される）のオペランドに対して加算／減算／比
較／移動を実行できる単一の命令がある。プログラマが
使用できる個々の命令は、ステップ１、ステップ２、お
よび最終ステップで示した動作と同じ種類の動作を実行
する（ただし、プログラマにとってオペランド長は長く
なる。すなわち、１６〜１２８ビット）。基本ハードウ
ェア・レベルでは一度に１６ビットに作用するが、プロ
グラマは一度に１６〜１２８ビットを処理していると考
える。It is interesting to note that in the current implementation of the machine, add / subtract / compare / move operations are performed on operands of length 1-8 words (length is defined as part of the instruction). There is a single instruction that can be executed. The individual instructions available to the programmer perform the same types of operations as those shown in steps 1, 2, and the final step (although the operand length is longer for the programmer, ie 16-128 bits). At the basic hardware level, it works on 16 bits at a time, but programmers think they are processing 16 to 128 bits at a time.

【０１５４】これらの命令を組み合わせて使用すると、
プログラマは任意の長さのオペランドを扱うことができ
る。すなわち、２つの命令を使用すると、長さ２５６ビ
ットまでの２つの数を加算することができる。When these instructions are used in combination,
The programmer can handle operands of arbitrary length. That is, two instructions can be used to add two numbers up to 256 bits in length.

【０１５５】ＰＭＥプロセッサ：本発明のＰＭＥプロセ
ッサは、ＭＰＰアプリケーションに現在使用されている
現在のマイクロプロセッサと異なる。プロセッサ部分の
違いとしては以下の点が挙げられる。PME Processor: The PME processor of the present invention differs from the current microprocessors currently used in MPP applications. The differences between the processors are as follows.

【０１５６】１．プロセッサは、完全にプログラミング
可能なハードワイヤ式コンピュータである（命令セット
の概要に関しては、ＩＳＡについての説明を参照された
い）・左上隅に示す完全なメモリ・モジュールを有する（図
９参照）。・（左上隅に示す）各割込みレベルに対して別々のレジ
スタ・セットをエミュレートするのに必要な制御機構を
備えたハードウェア・レジスタを有する。・その論理演算機構が、効果的な多重サイクル整数およ
び浮動小数点演算を可能にするのに必要なレジスタおよ
び制御機構を有する。・右上隅に示す２地点間リンクで相互接続されたＰＭＥ
間のパケットまたは回線交換データ移動をサポートする
のに必要な入出力切替経路を有する。1. The processor is a fully programmable hardwired computer (see ISA description for instruction set overview) • Has complete memory module shown in upper left corner (see Figure 9). -Has hardware registers with the controls needed to emulate a separate register set for each interrupt level (shown in the upper left corner). The logic unit has the registers and controls necessary to enable efficient multi-cycle integer and floating point arithmetic.・ PME interconnected by a point-to-point link shown in the upper right corner
It has the necessary input / output switching paths to support packet or circuit switched data movement between.

【０１５７】２．これは、ＣＭＯＳＤＲＡＭ技術によ
り１チップ当たりＰＭＥの複製を８個作成できるプロセ
ッサ設計のための最小の手法である。2. This is the smallest technique for processor design that can make 8 duplicates of PME per chip by CMOS DRAM technology.

【０１５８】３．ＰＭＥのこのプロセッサ部分は、本発
明のＭＭＰの効果的なＭＩＭＤ動作またはＳＩＭＤ動作
を可能にするのに必要な高速命令セット・アーキテクチ
ャ（ＩＳＡ）（表を参照）をコーディングするのに必要
なほぼ最小のデータ・フロー幅を提供する。3. This processor portion of the PME is approximately the minimum required to code the high speed instruction set architecture (ISA) (see table) required to enable the efficient MIMD or SIMD operation of the MMP of the present invention. Provides the data flow width of the.

【０１５９】ＰＭＥ常駐ソフトウェア：ＰＭＥは、格納
されたプログラムを実行できる、ＡＰＡＰの最小要素で
ある。ＰＭＥは、一定の外部制御要素中に常駐し、ＳＩ
ＭＤモードで同報通信／制御インタフェース（ＢＣＩ）
によってＰＭＥに送られるプログラムを実行し、あるい
はそれ自体の主記憶装置に常駐するプログラムを実行す
ることができる（ＭＩＭＤモード）。ＰＭＥは、ＳＩＭ
ＤモードとＭＩＭＤモードの間で動的に切り替えること
ができる。これは、ＳＩＭＤ／ＭＩＭＤモード２重機能
であり、システムはこの２重機能を同時に実行できる
（ＳＩＭＩＭＤモード）。特定のＰＭＥは、制御レジス
タ中のビットをセットまたはリセットするだけで、この
動的切替えを行うことができる。ＳＩＭＤＰＭＥソフ
トウェアは実際には外部制御要素に常駐するので、これ
についての詳細はアレイ・ディレクタに関する考察の所
および関連出願に記載されている。PME Resident Software: PME is the smallest element of APAP that can execute stored programs. The PME resides in certain external control elements and the SI
Broadcast / control interface (BCI) in MD mode
It is possible to execute a program sent to the PME by means of, or to execute a program resident in its own main memory (MIMD mode). PME is SIM
It is possible to switch dynamically between D mode and MIMD mode. This is a SIMD / MIMD mode dual function, and the system can execute this dual function simultaneously (SIMIMD mode). A particular PME can do this dynamic switching simply by setting or resetting a bit in the control register. The SIMD PME software actually resides on an external control element, so more details about this can be found in the discussion regarding array directors and related applications.

【０１６０】ＭＩＭＤソフトウェアは、ＰＭＥがＳＩＭ
Ｄモードのとき、ＰＭＥ主記憶装置に格納される。これ
が可能なのは、ＰＭＥの多くが、同様なデータを非同期
的に処理するので、同一のプログラムを備えているから
である。ここでは、これらのプログラムは固定されてい
ず、他の演算の処理中に外部源からＭＩＭＤプログラム
をロードすることによって修正できることを指摘してお
く。In MIMD software, PME is SIM
In the D mode, it is stored in the PME main memory. This is possible because many PMEs process the same data asynchronously and therefore have the same program. It is pointed out here that these programs are not fixed and can be modified by loading the MIMD program from an external source while processing other operations.

【０１６１】表に示すＰＭＥ命令セット・アーキテクチ
ャはマイクロコンピュータのアーキテクチャなので、こ
のアーキテクチャではＰＭＥが実行できる機能に対する
制限はほとんどない。基本的に、各ＰＭＥはＲＩＳＣマ
イクロプロセッサと同様に機能できる。典型的なＭＩＭ
ＤＰＭＥソフトウェア・ルーチンを以下に列挙する。Since the PME instruction set architecture shown in the table is a microcomputer architecture, there are few restrictions on the functions that the PME can perform in this architecture. Basically, each PME can function like a RISC microprocessor. Typical MIM
The D PME software routines are listed below.

【０１６２】１．各種の常駐ルーチンをディスパッチし
優先順位をつけるための基本制御プログラム。1. Basic control program for dispatching various resident routines and prioritizing them.

【０１６３】２．ＰＭＥ間でデータおよび、制御メッセ
ージをやり取りするための通信ソフトウェア。このソフ
トウェアは、特定のＰＭＥが「回線交換」モードにいつ
入りそこからいつ出るかを決定する。通信ソフトウェア
は適宜「蓄積交換」機能を実行する。また、それ自体の
主記憶装置と別のＰＭＥの主記憶装置の間でのメッセー
ジの開始、送信、受信、および終了を行う。2. Communication software for exchanging data and control messages between PMEs. This software determines when a particular PME enters and exits "circuit switched" mode. The communication software suitably performs a "store and forward" function. It also initiates, sends, receives, and ends messages between its own main memory and another PME's main memory.

【０１６４】３．割込み処理ソフトウェアは、文脈の切
替えを完了し、その割込みを発生させた事象に応答す
る。これらのソフトウェアには、フェール・セイフ・ル
ーチンと、ＰＭＥのアレイへの再経路指定または再割当
てが含まれる。3. The interrupt handling software completes the context switch and responds to the event that caused the interrupt. These software include fail safe routines and rerouting or reassignment of PMEs to the array.

【０１６５】４．後述の拡張命令セット・アーキテクチ
ャを実施するルーチン。これらの再経路指定では、拡張
精密固定小数点演算、浮動小数点演算、ベクトル演算な
どのマクロ・レベル命令が実行される。そのため、複雑
な演算が扱えるだけでなく、多次元（２次元または３次
元イメージ）および複数媒体プロセスでイメージ・デー
タを表示するためのイメージ処理活動も可能になる。4. Routines that implement the extended instruction set architecture described below. These reroutes execute macro-level instructions such as extended precision fixed point arithmetic, floating point arithmetic, vector arithmetic. Thus, not only can complex operations be handled, but also image processing activities for displaying image data in multidimensional (two-dimensional or three-dimensional images) and multi-media processes.

【０１６６】５．標準の数学ライブラリ関数を組み込む
ことができる。これらの関数にはＬＩＮＰＡＫルーチン
およびＶＰＳＳルーチンを含めることが好ましい。各プ
ロセッサ・メモリ要素は、ベクトルまたは行列の異なる
要素に対して作用できるので、様々なＰＭＥがすべて、
異なるルーチン、または同一の行列の異なる部分を一時
に実行できる。5. Standard math library functions can be included. These functions preferably include LINPAK and VPSS routines. Each processor memory element can operate on different elements of a vector or matrix, so that the various PMEs all
Different routines or different parts of the same matrix can be executed at one time.

【０１６７】６．ＡＰＡＰノード相互接続構造を利用
し、動的多次元経路指定を可能にする、分散／収集機能
または分類機能を実行するための特殊ルーチンが提供さ
れる。これらのルーチンは、様々なＰＭＥ間で実現され
る一定程度の同期化を効果的に利用しながら、非同期動
作を継続できるようにする。分類用には、分類ルーチン
がある。ＡＰＡＰはバッチャ分類によく適している。な
ぜなら、このような分類では、非常に短い比較サイクル
と比較する特定の要素を決定するために大量の計算が必
要だからである。プログラム同期化は入出力ステートメ
ントによって管理される。このプログラムを使うと、１
ＰＭＥ当たり複数のデータ要素が可能であり、非常に大
規模な並列分類を非常に簡単な形で行うことができる。6. Special routines are provided for performing scatter / gather or sort functions that utilize the APAP node interconnect structure and allow for dynamic multidimensional routing. These routines allow asynchronous operation to continue while effectively utilizing the degree of synchronization achieved between the various PMEs. For classification, there is a classification routine. APAP is well suited for batcher classification. Because such a classification requires a large amount of computation to determine the particular element to compare with a very short comparison cycle. Program synchronization is managed by I / O statements. Using this program, 1
Multiple data elements per PME are possible and very large parallel classifications can be done in a very simple way.

【０１６８】各ＰＭＥはそれ自体の常駐ソフトウェアを
有するが、これらのマイクロコンピュータから構築され
るシステムは、スカラー並列マシン用に設計されたより
高水準の言語プロセスを実行できる。すなわち、このシ
ステムは、ＦＯＲＴＲＡＮ、Ｃ、Ｃ＋＋、ＦＯＲＴＲＡ
ＮＤなどの高水準言語で、ＵＮＩＸマシン用に書かれ
たアプリケーション・プログラムまたは他のオペレーテ
ィング・システムのアプリケーション・プログラムが実
行できる。Although each PME has its own resident software, a system built from these microcomputers can execute higher level language processes designed for scalar parallel machines. That is, this system is FORTRAN, C, C ++, FORTRA
A high-level language such as ND can execute application programs written for UNIX machines or application programs of other operating systems.

【０１６９】本発明のプロセッサ概念が、きわめて古い
プロセッサ設計の手法を使用しているのは興味深い。Ｉ
ＢＭの軍用プロセッサではおそらく、同様な命令セット
・アーキテクチャ（ＩＳＡ）設計が３０年来使用されて
いる。この種の設計を使用して、本発明のＰＭＥ設計全
体と組み合わせれば、行き詰まった現在のマイクロプロ
セッサ設計に活路を見いだし、当該技術が次の世紀に使
用できる新しい道を開くことができることを認識したの
は本発明者等が初めてである。It is interesting that the processor concept of the present invention uses a very old approach to processor design. I
A similar instruction set architecture (ISA) design has probably been used for 30 years in BM's military processors. Recognizing that this type of design, when combined with the overall PME design of the present invention, will find a way into the current microprocessor design that has been stalled, opening up new avenues for use in the next century. This is the first time that the present inventors have done so.

【０１７０】このプロセッサの設計の諸特徴は他の現在
のマイクロプロセッサとまったく異なるが、同様なゲー
ト制約式軍用および宇宙航空用プロセッサは６０年代か
らこの設計を使用している。このプロセッサは、簡単な
コンパイラの開発に十分な命令およびレジスタを提供
し、汎用処理アプリケーションと信号処理アプリケーシ
ョンのどちらもこの設計で効果的に実行される。本発明
の設計はゲート要件が最小であり、ＩＢＭは、組込みチ
ップ設計が汎用処理に必要であった数年前から同様な概
念を実施している。今回は旧式のＩＳＡ設計の一部を採
用したため、多くのプログラマが該設計概念について既
存のベースおよび知識を持っているので、本発明のシス
テムを迅速に採用できるようにする多数のユーティリテ
ィおよびその他のソフトウェア・ビークルの使用が可能
である。Although the design features of this processor are quite different from other current microprocessors, similar gate-constrained military and aerospace processors have been using this design since the 1960s. The processor provides enough instructions and registers for the development of a simple compiler, and both general purpose and signal processing applications run effectively with this design. The design of the present invention has minimal gate requirements, and IBM has implemented a similar concept for several years when embedded chip designs were needed for general purpose processing. Since we have adopted some of the older ISA designs this time, many programmers have existing bases and knowledge of the design concept, and therefore a number of utilities and other utilities that allow us to quickly adopt our system. Software vehicles can be used.

【０１７１】ＰＭＥ入出力：ＰＭＥは、図９の経路ＢＣ
Ｉを介して同報通信／制御インタフェース（ＢＣＩ）バ
スから論理演算機構にデータを読み込み、あるいは該バ
スから解読論理回路（図示せず）に直接命令を取り込む
ことにより、該バスと相互接続する。ＰＭＥは、ＳＩＭ
Ｄモードでパワー・アップし、分岐にぶつかるまでその
モードで命令を読み取り、解読して、実行する。ＳＩＭ
Ｄモードの同報通信コマンドは、ＭＩＭＤへの移行を行
わせ、ローカル側で命令を取り出させる。同報通信ＰＭ
Ｅ命令^ＩＮＴＥＲＮＡＬＤＩＯＷ^は状態を反転させ
る。PME input / output: PME is the route BC of FIG.
Interconnect to the bus by reading data from the Broadcast / Control Interface (BCI) bus into the logic unit via I or by fetching instructions directly from the bus into a decode logic circuit (not shown). PME is SIM
Power up in D mode, read, decode and execute instructions in that mode until a branch is hit. SIM
The D-mode broadcast communication command causes the transition to MIMD, and causes the local side to fetch the command. Broadcast communication PM
The E instruction ^ INTERNAL DIOW ^ reverses the state.

【０１７２】ＰＭＥ入出力は、データの送信、引渡しま
たは受信とすることができる。ＰＭＥは、データ送信時
にＣＴＬレジスタをセットして、ＸＭＩＴをＬ、Ｒ、
Ｖ、Ｘのいずれかに接続させる。次にハードウェア・サ
ービスが、ＡＬＵマルチプレクサおよびＸＭＩＴレジス
タを介してメモリからターゲットにデータのブロックを
引き渡す。この処理は、通常の命令動作とインタリーブ
する。アプリケーションの要件に応じて、伝送されるデ
ータのブロックは、定義済みＰＭＥ用の生データまたは
経路を確立するためのコマンド、あるいはその両方を含
むことができる。データを受け取ったＰＭＥは、入力を
メモリに格納し、活動状態の下位処理に割り込む。割込
みレベルにおける解釈タスクは、この割込み事象を使っ
て、タスク同期化を実行し、あるいは透過性入出力動作
を開始することができる（データが他の場所でアドレス
されるとき）。ＰＭＥは、透過性入出力動作中、自由に
実行を継続できる。ＰＭＥのＣＴＬレジスタがＰＭＥを
ブリッジにする。データは、ゲート処理なしにＰＭＥを
通過し、ＰＭＥは、命令またはデータ・ストリームによ
ってＣＴＬレジスタがリセットされるまでそのモードの
ままである。ＰＭＥは、データの引渡し中、データ源と
なることはできないが、別のメッセージのデータ・シン
クとなることはできる。PME input / output can be data transmission, delivery, or reception. The PME sets the CTL register when transmitting data and sets XMIT to L, R,
Connect to either V or X. The hardware service then delivers a block of data from memory to the target via the ALU multiplexer and XMIT register. This process interleaves with normal instruction operations. Depending on the requirements of the application, the block of data to be transmitted may contain raw data for the defined PMEs and / or commands to establish a path. Upon receiving the data, the PME stores the input in memory and interrupts the active subprocess. Interpretation tasks at the interrupt level can use this interrupt event to perform task synchronization or initiate transparent I / O operations (when data is addressed elsewhere). The PME is free to continue execution during transparent I / O operations. The PME's CTL register makes the PME a bridge. Data passes through the PME without gating, and the PME remains in that mode until the CTL register is reset by the instruction or data stream. The PME cannot be the data source during the passing of data, but can be the data sink for another message.

【０１７３】ＰＭＥ同報通信セクション：これは、チッ
プと共通制御デバイスの間のインタフェースである。こ
のインタフェースは、入出力を指令しまたは完全なチッ
プをテストし診断する制御装置として働くデバイスが使
用できる。PME Broadcast Section: This is the interface between the chip and the common control device. This interface can be used by devices that command I / O or act as controllers to test and diagnose complete chips.

【０１７４】入力は、ＰＭＥのサブセットが使用可能な
ワード・シーケンス（命令またはデータ）である。各ワ
ードには、どのＰＭＥがそのワードを使用するかを示す
コードが関連付けられている。ＢＣＩは、ワードを使用
して、該インタフェースへのアクセスを制限するととも
に、必要なすべてのＰＭＥがデータを受け取るようにす
る。このことは、ＢＣＩを非同期ＰＭＥ動作に調節する
のに役立つ（ＰＭＥは、ＳＩＭＤモードのときでも、入
出力および割込み処理のために非同期的である）。この
機構により、ＰＭＥを、ＢＣＩを介して受け取ったコマ
ンド／データ・ワードのインタリーブ・セットによって
制御されるグループに形成することができる。The input is a word sequence (instruction or data) available to a subset of PMEs. Associated with each word is a code that indicates which PME uses that word. The BCI uses words to limit access to the interface and ensure that all required PMEs receive the data. This helps to tune the BCI for asynchronous PME operation (PME is asynchronous for I / O and interrupt handling even when in SIMD mode). This mechanism allows the PMEs to be formed into groups controlled by an interleaved set of command / data words received via the BCI.

【０１７５】ＢＣＩは、ＰＭＥにデータを引き渡すだけ
でなく、ＰＭＥから要求コードを受け入れ、それらのコ
ードを組み合わせ、統合された要求を送り出す。この機
構は、いくつかの形で使用できる。ＭＩＭＤ処理は、す
べて出力信号で終了するプロセッサのグループ中で開始
できる。信号が^ＡＮＤ^されると、制御装置は新規プロ
セスを開始する。多くの場合、アプリケーションがＰＭ
Ｅメモリに必要なすべてのソフトウェアをロードできる
とは限らない。制御装置へのコード化された要求を使っ
て、おそらくホストの記憶システムからソフトウェア・
オーバレイを取り出す。In addition to passing data to the PME, the BCI also accepts request codes from the PME, combines those codes and sends out an integrated request. This mechanism can be used in several ways. MIMD processing can start in a group of processors all ending with an output signal. When the signal is ANDed, the controller starts a new process. Often the application is PM
Not all required software can be loaded into E-memory. Using coded requests to the controller, perhaps software from the host storage system.
Take out the overlay.

【０１７６】制御装置は、多数のチップを通る直列走査
ループを使って、個々のチップまたはＰＭＥ上の情報を
取り出す。これらのループは最初ＢＣＩと相互接続され
ているが、該インタフェースにおいて個々のＰＭＥとブ
リッジできる。The controller uses a serial scan loop through multiple chips to retrieve information on individual chips or PMEs. These loops are initially interconnected with the BCI but can bridge to individual PMEs at the interface.

【０１７７】ＢＣＩ：各チップ上に設けられた同報通信
／制御インタフェース（ＢＣＩ）は、データまたは命令
をノードに送信できるような並列入力インタフェースを
実現する。着信データはサブセット識別子でタグ付けさ
れる。ＢＣＩは、サブセット内で動作する、ノード内の
すべてのＰＭＥにデータまたは命令が提供されるように
するのに必要な機能を備えている。ＢＣＩの並列インタ
フェースは、すべてのＰＭＥにデータを同報通信できる
ようにするポートとしても、ＳＩＭＤ動作中の命令イン
タフェースとしても働く。両方の要件を満たすととも
に、それらの要件をサブセット動作のサポートにまで拡
張する機能は、本発明の設計手法以外には全く例を見な
い。BCI: The Broadcast / Control Interface (BCI) on each chip implements a parallel input interface that allows data or instructions to be sent to the nodes. Incoming data is tagged with a subset identifier. The BCI provides the functionality necessary to ensure that all PMEs in a node operating in the subset are provided with data or instructions. The BCI's parallel interface acts both as a port that allows data to be broadcast to all PMEs and as an instruction interface during SIMD operation. The ability to satisfy both requirements and extend those requirements to support subset operation is unprecedented, except for the design approach of the present invention.

【０１７８】本発明のＢＣＩ並列入力インタフェースに
より、ノードの外部の制御要素からデータまたは命令を
送信することが可能になる。ＢＣＩは、各ＰＭＥと結合
された「グループ割当て」レジスタを備えている（グル
ープ化の概念については、同時出願のgrouping of SIMD
picketsと題する米国特許出願を参照されたい）。着信
データ・ワードはグループ識別子でタグ付けされる。Ｂ
ＣＩは、専用グループに割り当てられたノード内のすべ
てのＰＭＥにデータまたは命令が提供されるようにする
のに必要な機能を備えている。ＢＣＩの並列インタフェ
ースは、ＭＩＭＤ動作中にＰＭＥにデータを同報通信で
きるようにするポートとしても、ＳＩＭＤ動作中の命令
／即値オペランド・インタフェースとしても働く。The BCI parallel input interface of the present invention allows data or instructions to be sent from a control element external to the node. The BCI has a "group allocation" register associated with each PME (for the concept of grouping, see Grouping of SIMD in the co-pending application).
See US patent application entitled pickets). Incoming data words are tagged with a group identifier. B
The CI provides the functionality necessary to ensure that data or instructions are provided to all PMEs in the node assigned to the dedicated group. The BCI parallel interface acts both as a port to allow data to be broadcast to the PME during MIMD operation and as an instruction / immediate operand interface during SIMD operation.

【０１７９】ＢＣＩは、２つの直列インタフェースも備
えている。高速直列ポートは、各ＰＭＥに、限られた量
の状況情報を出力する能力を与える。このデータの目的
は以下のとおりである。The BCI also has two serial interfaces. The high speed serial port gives each PME the ability to output a limited amount of status information. The purpose of this data is to:

【０１８０】１．ＰＭＥたとえば５００が読み取る必要
のあるデータを有すること、またはＰＭＥが何らかの動
作を完了したことを示す信号をアレイ・ディレクタ６１
０に送る。アレイ・ディレクタ６１０は、それが代表す
る外部制御要素にメッセージを渡す。２．外部テストおよびモニタ要素がシステム全体の状況
を示すことができるように活動状況を提供する。1. The array director 61 signals that the PME, eg 500, has data that needs to be read, or that the PME has completed some operation.
Send to 0. Array director 610 passes the message to the external control element it represents. 2. It provides activity status so that external test and monitor elements can show the status of the entire system.

【０１８１】標準直列ポートは、外部制御要素が監視お
よび制御の目的で特定のＰＭＥに選択的にアクセスでき
るようにする。このインタフェースを介して渡されるデ
ータは、ＢＣＩ並列インタフェースから特定のＰＭＥレ
ジスタにデータを送り、あるいは特定のＰＭＥレジスタ
からデータを選択してそれを高速直列ポートに経路指定
することができる。これらの制御点は、外部制御要素が
初期パワー・アップおよび診断フェーズ中に、個々のＰ
ＭＥを監視し制御できるようにする。これによって、ア
レイ・ディレクタは、特定のＰＭＥおよびノード内部レ
ジスタならびにアクセス点をポートの出力先とするよう
に、制御データを入力することができる。これらのレジ
スタは、ノードのＰＭＥがアレイ・ディレクタにデータ
を出力できるように経路を提供し、かつアレイ・ディレ
クタが初期パワー・アップおよび診断フェーズ中に、装
置にデータを入力できるようにする。アクセス点へのデ
ータ入力を使用して、テストおよび診断動作、すなわ
ち、単一の命令ステップの実行、比較時停止、区切り点
などを制御することができる。The standard serial port allows external control elements to selectively access a particular PME for monitoring and control purposes. Data passed through this interface can either be sent from the BCI parallel interface to a particular PME register or selected from a particular PME register and routed to the high speed serial port. These control points are set by the external control elements during the initial power-up and diagnostic phases to the individual P
Allows ME to be monitored and controlled. This allows the array director to input control data such that the output of the port is to a particular PME and node internal register and access point. These registers provide a path for the node's PME to output data to the array director, and allow the array director to input data to the device during the initial power up and diagnostic phases. Data input to the access points can be used to control test and diagnostic operations, such as execution of a single instruction step, stop on compare, breakpoint, and so on.

【０１８２】ノード・トポロジー：本発明の修正ハイパ
ーキューブ・トポロジー接続は大規模並列システムにも
っとも有効であるが、性能の劣る他の接続を本発明の基
本ＰＭＥと併用することもできる。本発明者によるＶＬ
ＳＩチップの初期実施例では、８個のＰＭＥと、完全分
散型ＰＭＥ内部ハードウェア接続が使用されている。内
部ＰＭＥ間チップ構成は、４個のＰＭＥから成るリング
２つであり、各ＰＭＥがさらに他のリングのＰＭＥへの
１つの接続を有している。ＶＬＳＩチップに８個のＰＭ
Ｅがある場合、これは３次元バイナリ・ハイパーキュー
ブである。しかし、本発明の手法では一般に、チップ内
でハイパーキューブ編成を使用しない。また、各ＰＭＥ
では１本のバスのエスケープが可能である。初期実施例
では、一方のリングからエスケープされたバスを＋Ｘ、
＋Ｙ、＋Ｗ、および＋Ｚと呼び、他方のリングからエス
ケープされたリングには−（マイナス）の同様なラベル
を付ける。Node Topology: The modified hypercube topology connection of the present invention is most effective for massively parallel systems, but other poorly performing connections can be used with the basic PME of the present invention. VL by the present inventor
In the initial implementation of the SI chip, eight PMEs and fully distributed PME internal hardware connections are used. The internal PME-to-PME chip configuration is two rings of four PMEs, each PME having one connection to the PMEs of yet another ring. 8 PM on VLSI chip
If there is E, then it is a three-dimensional binary hypercube. However, our approach generally does not use hypercube organization within the chip. In addition, each PME
Allows the escape of one bus. In the initial embodiment, the bus escaped from one ring is + X,
Called + Y, + W, and + Z, the rings escaped from the other ring are similarly labeled- (minus).

【０１８３】特定のチップ編成をアレイのノードと呼
び、ノードはアレイのクラスタ中に入れることができ
る。ノードは＋-Ｘおよび＋-Ｙを使ってアレイとして接
続され、クラスタを形成する。アレイの次元数は任意で
あり、一般に、バイナリ・ハイパーキューブの開発の必
要条件である２より多い。クラスタは＋-Ｗ、＋-Ｚを使
ってさらに接続され、クラスタのアレイとなる。ここで
もアレイの次元数は任意である。この結果、ノードの４
次元ハイパーキューブが得られる。５次元ハイパーキュ
ーブに拡張するには、ＰＭＥノードが１０個必要であ
り、２本の追加バス、たとえば＋-Ｅ１を使って４次元
ハイパーキューブを接続し、ハイパーキューブのベクト
ルにする。本発明ではさらに、奇数または偶数の基数ハ
イパーキューブへの拡張のパターンを示した。この修正
トポロジーでは、クラスタがクラスタ配線に限定される
が、ハイパーキューブ接続の利点が維持される。A particular chip organization is called a node of the array, and nodes can be put into clusters of the array. The nodes are connected as an array using + -X and + -Y to form a cluster. The dimensionality of the array is arbitrary and is generally greater than the requirement of 2 for the development of binary hypercubes. The clusters are further connected using + -W, + -Z to form an array of clusters. Again, the dimensionality of the array is arbitrary. As a result, 4 nodes
A three-dimensional hypercube is obtained. To expand to a five-dimensional hypercube, ten PME nodes are required, and two additional buses, eg, + -E1, are used to connect the four-dimensional hypercube into a hypercube vector. The present invention further showed a pattern of extension to odd or even radix hypercubes. This modified topology limits the cluster to cluster wiring, but retains the benefits of hypercube connectivity.

【０１８４】本発明の、大規模並列マシン用の配線可能
性およびトポロジー構成には、本発明のクラスタ・レベ
ルのパッケージング内でＸ次元およびＹ次元が維持で
き、すべての隣接クラスタにＷバス接続およびＺバス接
続が配分できるという利点がある。上述の技術を実施し
た後、定義されたトポロジーの固有の特性を維持しなが
ら製品を配線し製造することができる。The present invention's routability and topological configuration for massively parallel machines allows the X and Y dimensions to be maintained within the cluster level packaging of the present invention, with W bus connections to all adjacent clusters. And the Z bus connection can be distributed. After implementing the techniques described above, the product can be routed and manufactured while maintaining the unique characteristics of the defined topology.

【０１８５】ノードは、Ｋ^*ｎ個のＰＭＥと、同報通信
／制御インタフェース（ＢＣＩ）セクションから構成さ
れる。ここで、"ｎ"は、修正ハイパーキューブを特徴付
ける次元またはリングの数を表し、"ｋ"はノードを特徴
付けるリングの数を表す。ノードはｋ個のリングを備え
ることができるが、それらのリングのうち２個だけが、
エスケープ・バスを提供することがこの概念の特徴であ
る。好ましい実施例では、物理チップ・パッケージによ
り、"ｎ"および"ｋ"がＮ＝４およびｋ＝２に制限されて
いる。この制限は物理的なものであり、別のチップ・セ
ットを使用すれば、アレイの次元数を増やすことができ
る。本発明の好ましい実施例は、物理チップ・パッケー
ジの一部であるだけでなく、修正ハイパーキューブ中の
１組のリングを相互接続するＰＭＥのグループ化を可能
にする。各ノードには、ＰＭＥアーキテクチャを有し、
処理機能およびデータ・ルータ機能を実行できる、８個
のＰＭＥがある。したがって、ｎは修正ハイパーキュー
ブの次元数（次節参照）である。すなわち、４次元修正
ハイパーキューブのノード要素はＰＭＥ８個であり、５
次元修正ハイパーキューブのノードはＰＭＥ１０個であ
る。本発明で使用できるノードについては図７を、相互
接続については図１０および図１１を、各ノードのブロ
ック図については図１２を参照されたい。図１７および
図１８は、ＡＰＡＰの可能な相互接続の詳細を示したも
のである。A node consists of K ^* n PMEs and a Broadcast / Control Interface (BCI) section. Here, "n" represents the number of dimensions or rings that characterize the modified hypercube, and "k" represents the number of rings that characterize the node. A node can have k rings, but only two of those rings have
Providing an escape bus is a feature of this concept. In the preferred embodiment, the physical chip package limits "n" and "k" to N = 4 and k = 2. This limitation is physical and the use of different chip sets can increase the dimensionality of the array. The preferred embodiment of the present invention allows grouping of PMEs that are not only part of the physical chip package, but interconnect a set of rings in a modified hypercube. Each node has a PME architecture,
There are eight PMEs that can perform processing and data router functions. Therefore, n is the number of dimensions of the modified hypercube (see the next section). That is, the number of node elements of the 4-dimensional modified hypercube is 8 PMEs, and 5
The dimension modified hypercube has 10 PMEs. See FIG. 7 for nodes that can be used in the present invention, FIGS. 10 and 11 for interconnections, and FIG. 12 for a block diagram of each node. 17 and 18 show details of possible APAP interconnections.

【０１８６】１９９１年５月１３日に出願された"Metho
d for Interconnecting and Systemof Interconnected
Processing Elements"と題する米国特許出願第０７／６
９８８６６号に、本発明のＡＰＡＰＭＭＰに使用する
のが好ましい修正ハイパーキューブ基準が記載されてお
り、必要により参照されたい。上記出願には、１要素当
たりの接続数とネットワーク直径（最悪例経路長）との
バランスを取ることができるように処理要素を相互接続
する方法が記載されている。そうするには、ハイパーキ
ューブの、周知の好ましいトポロジー特性の多くを持つ
トポロジーを作成すると同時に、基底を変えることがで
きる数体系でネットワークのノードを列挙することによ
ってトポロジーの柔軟性を向上させる。この方法で基底
２の数体系を使用すると、ハイパーキューブ・トポロジ
ーが得られる。本発明では、ハイパーキューブの一様な
接続よりも相互接続の数が少なく、かつハイパーキュー
ブの特性を維持する。こうした特性としては次の３つが
ある。１）代替経路が多い。２）総合帯域幅がきわめて
大きい。３）周知の既存の方法を使って、他の共通問題
トポロジーをネットワークのトポロジーでマップでき
る。その結果、密度の低い非バイナリ・ハイパーキュー
ブが得られる。本発明では修正ハイパーキューブ手法を
優先しているが、従来のハイパーキューブを使用できる
アプリケーションもあることに留意されたい。ノードの
接続にあたり、トポロジーの他の手法も使用できる。し
かし、本明細書に記載する手法は、斬新で高度であると
考えられるので、これを優先する。“Metho, filed on May 13, 1991
d for Interconnecting and Systemof Interconnected
US Patent Application No. 07/6 entitled "Processing Elements"
No. 98866 describes modified hypercube criteria preferably used for the APAP MMPs of the present invention, which is hereby incorporated by reference. The above application describes a method of interconnecting processing elements so that the number of connections per element and the network diameter (worst case path length) can be balanced. To do so, one creates a topology with many of the well-known and preferred topological properties of a hypercube, while at the same time increasing the flexibility of the topology by enumerating the nodes of the network in a number system whose bases can change. Using a base-2 number system in this way yields a hypercube topology. The present invention has fewer interconnects than uniform connections in the hypercube and maintains the properties of the hypercube. There are the following three characteristics. 1) There are many alternative routes. 2) The total bandwidth is extremely large. 3) Other common problem topologies can be mapped with the topology of the network, using well-known existing methods. The result is a less dense non-binary hypercube. Note that although the present invention prioritizes the modified hypercube approach, some applications can use conventional hypercubes. Other topological approaches can be used to connect the nodes. However, the approach described herein is considered novel and sophisticated and will be prioritized.

【０１８７】ＰＭＥのネットワーク中で複数のノードを
相互接続するための、修正ハイパーキューブ・トポロジ
ー用の相互接続方法について以下に説明する。An interconnect method for a modified hypercube topology for interconnecting multiple nodes in a PME's network is described below.

【０１８８】１．１組の整数ｅ１、ｅ２、ｅ３．．．の
組を次のように定義する。すべての要素の積がネットワ
ーク内のＰＭＥの数Ｍと等しくなり、一方ｅ１およびｅ
２を除く、該組のすべての要素の積がノードの数Ｎであ
り、該組の要素の数ｍが、関係式ｎ＝ｍ−２によってネ
ットワークの次元数を定義する。1.1 set of integers e1, e2, e3. . . The set of is defined as follows. The product of all elements equals the number M of PMEs in the network, while e1 and e
The product of all elements of the set, except 2, is the number N of nodes, and the number m of elements of the set defines the dimensionality of the network by the relation n = m-2.

【０１８９】２．１組のインデックスａ１、ａ２．．．
ａｍによって位置指定されたＰＭＥにアドレスする。こ
こで各インデックスは、等価な展開レベルでのＰＭＥ位
置であり、インデックスａｉは、公式(....(a(m)^*e(m-
1) + a(m-2))e(m-1) ... a(2)^*e(1))+a(1)によって、ｉ
が１、２、．．．ｍのとき、０からｅｉ−１の範囲に収
まる。この公式で、a(i)という表記は通常通り、要素の
リストａ中のi番目であることを意味する。ｅについて
も同様である。2. 1 set of indexes a1, a2. . .
Address the PME located by am. Where each index is a PME position at an equivalent expansion level and the index ai is the formula (.... (a (m) ^* e (m-
1) + a (m-2)) e (m-1) ... a (2) ^* e (1)) + a (1)
Is 1, 2 ,. . . When m, it falls within the range of 0 to ei-1. In this formula, the notation a (i) means, as usual, that it is the i-th in the list a of elements. The same applies to e.

【０１９０】３．次の２つの条件のいずれかが成り立つ
場合にかぎり、２つのＰＭＥ（アドレスがｆおよびｇ）
を接続する。ａ．r/(e1 ^* e2)の整数部分がs/(e1 ^* e2)の整数部分と
等しい。１）r/e1の剰余部分がs/e1の剰余部分と１だけ異なる。
あるいは２）r/e2の剰余部分がs/e2の剰余部分と１またはe2-1だ
け異なる。ｂ．r/eiの剰余部分とs/eiの剰余部分が、ｉが３、
４、．．．ｍの範囲にあるとき異なり、r/e1の剰余部分
が、ｉ−３に等しいs/e2の剰余部分と等しく、r/e2の剰
余部分がs/e2の剰余部分とe2−１だけ異なる。3. Two PMEs (addresses f and g) only if either of the following two conditions are met:
Connect. a. The integer part of r / (e1 ^* e2) is equal to the integer part of s / (e1 ^* e2). 1) The remainder of r / e1 differs from the remainder of s / e1 by 1.
Or 2) The remainder of r / e2 differs from the remainder of s / e2 by 1 or e2-1. b. The remainder of r / ei and the remainder of s / ei are i = 3,
4,. . . Unlike in the range of m, the remainder part of r / e1 is equal to the remainder part of s / e2 equal to i-3, and the remainder part of r / e2 differs from the remainder part of s / e2 by e2-1.

【０１９１】この結果、コンピュータ・システム・ノー
ドは、各次元で基数が異なる可能性がある非バイナリ・
ハイパーキューブを形成する。ノードは、それによって
提供されるポートが修正ハイパーキューブの次元数要件
と一致するような２^*ｎ個のポートをサポートするＰＭ
Ｅのアレイと定義される。特定の修正ハイパーキューブ
の各次元の特定の範囲を定義する１組の整数ｅ３、ｅ
４、．．．ｅｍがすべて等しい、たとえばｂとみなし、
ｅ１およびｅ２をａ１とすると、アドレス可能性および
接続についての直前の公式は下記のようになる。As a result, the computer system node is a non-binary
Form a hypercube. The node supports a PM that supports 2 ^* n ports such that the ports it provides match the dimensionality requirements of the modified hypercube.
Defined as an array of E. A set of integers e3, e that define a particular extent of each dimension of a particular modified hypercube
4,. . . consider that all ems are equal, eg b,
Assuming that e1 and e2 are a1, the immediately preceding formulas for addressability and connection are:

【０１９２】１．N = b^**n1. N = b ^** n

【０１９３】２．ここでは、ＰＭＥが基底ｂ数体系を表
す数でアドレス指定される。2. Here, PMEs are addressed with numbers that represent the base b number system.

【０１９４】３．ｆのアドレスがｇのアドレスと、１基
底ｂ桁だけが異なる場合にかぎり２つの計算要素（ｆお
よびｇ）が接続される。０とｂ−１が１だけ離れている
という規則が使用される。3. Two computational elements (f and g) are connected only if the address of f differs from the address of g by only one base b digit. The rule is used that 0 and b-1 are separated by 1.

【０１９５】４．各ＰＭＥでサポートされる接続の数は
２^*ｎである。4. The number of connections supported by each PME is 2 ^* n.

【０１９６】これは基本アプリケーションでの記載通り
であり、非隣接ＰＭＥを接続する通信バスの数は０と選
択されている。This is as described in the basic application, and the number of communication buses connecting the non-adjacent PMEs is selected as 0.

【０１９７】ノード内ＰＭＥ相互接続：ＰＭＥは、ノー
ド内で２×ｎアレイとして構成される。各ＰＭＥは、１
組の入出力ポートを使って３つの隣接ＰＭＥと相互接続
されるため、ＰＭＥ間には全２重通信機能が提供され
る。各ＰＭＥ外部入出力ポートは、ノード入出力ピンに
接続される。入出力ポートは、ピンを、半２重通信用に
共用できるように接続することも、全２重機能用に分離
できるように接続することも可能である。４次元修正ハ
イパーキューブ・ノードの相互接続を図１０および図１
１に示す（ｎが偶数の場合、ノードは２×２×ｎ／２ア
レイとみなせることに留意されたい）。Intra-node PME interconnect: PMEs are organized as a 2xn array within the node. 1 for each PME
Interconnection with three adjacent PMEs using a set of input / output ports provides full duplex communication between the PMEs. Each PME external input / output port is connected to a node input / output pin. The I / O ports can be connected so that the pins can be shared for half-duplex communication or separated for full-duplex functionality. The interconnection of the four-dimensional modified hypercube node is shown in FIGS. 10 and 1.
1 (note that if n is an even number, the node can be considered as a 2 × 2 × n / 2 array).

【０１９８】図１０は、ノード内の８つの処理要素５０
０、５１０、５２０、５３０、５４０、５５０、５６
０、５７０を示している。ＰＭＥは、バイナリ・ハイパ
ーキューブ通信ネットワーク中で接続される。このバイ
ナリ・ハイパーキューブは、ＰＭＥ間のノード内接続を
３つ示している（５０１、５１１、５２１、５３１、５
４１、５５１、５６１、５７１、５９０、５９１、５９
２、５９３）。ＰＭＥ間の通信は、処理要素の制御下で
入出力レジスタによって制御される。この図は、８つの
方向、＋-ｗ５２５、５６５、＋-ｘ５１５、５５
５、＋-ｙ５０５、５４５、＋−ｚ５３５、５７５の
いずれかから入出力をエスケープするとき使用できる様
々な経路を示している。望むなら、データをメモリに格
納せずに通信を実行できる。FIG. 10 shows eight processing elements 50 in a node.
0, 510, 520, 530, 540, 550, 56
0 and 570 are shown. PMEs are connected in a binary hypercube communication network. This binary hypercube shows three intra-node connections between PMEs (501, 511, 521, 531, 5).
41, 551, 561, 571, 590, 591, 59
2,593). Communication between PMEs is controlled by I / O registers under the control of processing elements. This figure shows eight directions, + -w 525, 565, + -x 515, 55.
5, + -y 505, 545, + -z 535, 575 show various paths that can be used when escaping input and output. If desired, communication can be performed without storing the data in memory.

【０１９９】ネットワーク切替えチップを使用すれば、
それぞれ本発明のチップを持つ各種カードを、システム
の他のチップと接続できるが、ネットワーク切替えチッ
プを使用しなくてもかまわず、またそうすることが望ま
しいことに留意されたい。「４次元トーラス」として記
述する本発明のＰＭＥ間ネットワークは、ＰＭＥ間通信
に使用する機構である。ＰＭＥは、このインタフェース
上のアレイ内の任意の他のＰＭＥにアクセスできる（間
にあるＰＭＥは、蓄積交換または回線交換できる）。If a network switching chip is used,
It should be noted that various cards, each having a chip of the present invention, can be connected to other chips in the system, although it is not necessary and desirable to use no network switching chip. The inter-PME network of the present invention, described as a "four-dimensional torus", is the mechanism used for inter-PME communication. A PME can access any other PME in the array on this interface (the intervening PMEs can be store-switched or circuit-switched).

【０２００】相互接続のチップ関係：チップについて説
明してきた。図１２は、ＰＭＥプロセッサ／メモリ・チ
ップのブロック図である。このチップは、下記の要素か
ら構成されている。以下に、これらの要素のそれぞれに
ついて説明する。Interconnect Chip Relationships: Chips have been described. FIG. 12 is a block diagram of a PME processor / memory chip. This chip is composed of the following elements. Each of these elements is described below.

【０２０１】１．それぞれ１６ビットのプログラマブル
・プロセッサおよび３２Ｋワードのメモリ（６４ＫＢ）
から成るＰＭＥ８個。1. 16-bit programmable processor and 32K-word memory (64KB) each
8 PMEs consisting of.

【０２０２】２．制御装置がすべてのＰＭＥまたはその
サブセットを動作させＰＭＥ要求を累積できるようにす
るＢＣＩ。2. A BCI that allows the controller to run all PMEs or a subset thereof and accumulate PME requests.

【０２０３】３．相互接続レベルａ．各ＰＭＥは、８ビット幅ＰＭＥ間通信経路４つをサ
ポートする。これらは、チップ上の隣接ＰＭＥ３個およ
びオフチップＰＭＥ１個と接続される。ｂ．同報通信ＰＭＥ間バス接続。データまたは命令を使
用可能にする。ｃ．任意のＰＭＥが制御装置にコードを送信できるよう
にするサービス要求線。ＢＣＩは、要求を組み合わせ、
要約を転送する。ｄ．シリアル・サービス・ループは、制御装置が機能ブ
ロックに関するすべての詳細を読み取れるようにする。
このレベルの相互接続は、同報通信インタフェースから
すべてのＰＭＥまで延びる（図１２を参照すれば分かる
ので、詳細は省略する）。3. Interconnect level a. Each PME supports four 8-bit wide inter-PME communication paths. These are connected to three adjacent PMEs on the chip and one off-chip PME. b. Bus connection between broadcast PMEs. Make data or instructions available. c. A service request line that allows any PME to send a code to the controller. BCI combines requirements,
Transfer the summary. d. The serial service loop allows the controller to read all the details about the functional block.
This level of interconnection extends from the broadcast interface to all PMEs (details are omitted as can be seen with reference to FIG. 12).

【０２０４】相互接続および経路指定：ＭＰＰは、ＰＭ
Ｅを複製することによって実施される。複製の程度は、
使用する相互接続および経路指定方式に影響を与えな
い。図７は、ネットワーク相互接続方式の概要を示して
いる。チップはＰＭＥを８個備えており、ＰＭＥはすぐ
隣のＰＭＥに相互接続されている。この相互接続パター
ンにより、図１１に示す３次元キューブ構造が得られ
る。キューブ内の各プロセッサに、チップのピンへの専
用両方向バイト・ポートがある。８個１組のＰＭＥをノ
ードと呼ぶ。Interconnect and Routing: MPP, PM
Performed by replicating E. The degree of replication is
It does not affect the interconnections and routing methods used. FIG. 7 shows an outline of the network interconnection method. The chip has eight PMEs, which are interconnected to the next PME. This interconnection pattern results in the three-dimensional cube structure shown in FIG. Each processor in the cube has a dedicated bidirectional byte port to a pin on the chip. A set of eight PMEs is called a node.

【０２０５】ノードのｎ×ｎアレイがクラスタである。
＋ｘポートと−ｘポート間、＋ｙポートと−ｙポート間
を単にブリッジするだけで、クラスタ・ノード相互接続
が実現される。この場合、本発明の好ましいチップまた
はノードにはＰＭＥが８個あり、それぞれ単一の外部ポ
ートを管理する。このおかげで、ネットワーク制御機能
が分散され、ポートに対するボトルネックがなくなる。
外部エッジをブリッジすると、クラスタは論理トーラス
となる。本発明ではｎ＝４およびｎ＝８のクラスタを検
討したが、商用アプリケーションではｎ＝８の方が適切
であり、一方軍事伝導冷却式アプリケーションではｎ＝
４の方が適切であると考える。本発明の概念では、変更
不能なクラスタ・サイズは使用しない。逆に、変形を使
用するアプリケーションも想定している。An n × n array of nodes is a cluster.
Cluster node interconnection is achieved by simply bridging between the + x and -x ports and the + y and -y ports. In this case, the preferred chip or node of the present invention has eight PMEs, each managing a single external port. This distributes network control functions and eliminates bottlenecks on ports.
Bridging the outer edges makes the cluster a logical torus. Although we have considered n = 4 and n = 8 clusters in the present invention, n = 8 is more appropriate for commercial applications, while n = 8 for military conduction cooled applications.
I think that 4 is more appropriate. The inventive concept does not use an immutable cluster size. Conversely, applications that use transformations are also envisioned.

【０２０６】クラスタをアレイにすると、図１１に示す
４次元トーラスまたはハイパーキューブ構造が得られ
る。＋ｗポートと−ｗポートの間、＋ｚポートと−ｚポ
ートの間をブリッジすると、４次元トーラス相互接続が
実現される。その結果、クラスタ内の各ノードは、すべ
ての隣接クラスタ内の等価なノードと接続される（これ
で、２つの隣接クラスタ間にポートが６４個提供され
る。より大型のクラスタの場合は８個である）。クラス
タ・サイズの場合同様、この方式は特定のサイズのアレ
イを必要としない。本発明では、ワークステーションお
よび軍事アプリケーションに好ましい２×１アレイと、
メインフレーム・アプリケーション用の４×４アレイ、
４×８アレイ、および８×８アレイを考慮した。Arraying the clusters results in the four-dimensional torus or hypercube structure shown in FIG. Bridging between the + w and -w ports and between the + z and -z ports provides a four-dimensional torus interconnect. As a result, each node in the cluster is connected to an equivalent node in all adjacent clusters (which provides 64 ports between any two adjacent clusters, or 8 for larger clusters). Is). As with cluster size, this scheme does not require an array of specific size. In the present invention, a preferred 2x1 array for workstation and military applications,
4x4 array for mainframe applications,
A 4x8 array and an 8x8 array were considered.

【０２０７】４次元トーラスの開発は、本発明の好まし
いチップのゲート、ピン、およびコネクタの限界を超え
ている。しかし、本発明の代替オンチップ光学式ドライ
バ／レシーバを使用すれば、この限界はなくなる。この
実施例では、本発明のネットワークは１ＰＭＥ当たり１
本の光路を使用できる。その場合、１チップ当たりのＰ
ＭＥは８個でなく１２個になり、マルチＴｆｌｏｐ（テ
ラフロップ）性能の４次元トーラスのアレイができる。
そのようなマシンをワークステーション環境で使用する
ことは可能であり、かつ経済的に実現可能であるように
思われる。そのような代替マシンは、本発明の現在の好
ましい実施例用に開発されたアプリケーション・プログ
ラムを使用することに留意されたい。The development of the four-dimensional torus is beyond the gate, pin, and connector limits of the preferred chip of the present invention. However, with the alternative on-chip optical driver / receiver of the present invention, this limitation is removed. In this example, the network of the present invention is 1 per PME.
The light path of the book can be used. In that case, P per chip
The number of MEs becomes 12 instead of 8, and a 4-dimensional torus array with multi-Tflop (teraflop) performance can be formed.
It seems possible and economically feasible to use such a machine in a workstation environment. It should be noted that such alternative machines use application programs developed for the presently preferred embodiment of the present invention.

【０２０８】４次元クラスタ編成：図７および図１１に
示す４次元修正ハイパーキューブ３６０を構築するに
は、８つの外部ポート３１５をサポートするノードが必
要である。外部ポートを＋Ｘ、＋Ｙ、＋Ｚ、＋Ｗ、−
Ｘ、−Ｙ、−Ｚ、−Ｗで表すものとする。次に、ｍ₁ノ
ードを使って、＋Ｘポートと−Ｘポートの間を接続する
と、リングが構築できる。さらに、ｍ₂を使用し、＋Ｙ
ポートと−Ｙポート間を相互接続すると、そのようなリ
ングが相互接続され、リングのリングが構築できる。こ
のレベルの構造をクラスタ３２０と呼ぶ。ｍ₁＝ｍ₂＝８
の場合、図７にｍ＝８として示すように、５１２個のＰ
ＭＥが使用可能であり、そのようなクラスタは複数のサ
イズのシステム（３３０、３４０、３５０）の構成単位
となる。4D Cluster Organization: To build the 4D modified hypercube 360 shown in FIGS. 7 and 11, a node supporting eight external ports 315 is required. External ports are + X, + Y, + Z, + W,-
It is represented by X, -Y, -Z, -W. Next, a ring can be constructed by connecting between the + X port and the -X port using the m ₁ node. Furthermore, using m ₂ , + Y
Interconnecting the ports and -Y ports interconnects such rings and allows the construction of rings of rings. This level of structure is called cluster 320. m ₁ = m ₂ = 8
In the case of, as shown in FIG. 7 with m = 8, 512 P
MEs are available and such clusters are the building blocks of multiple size systems (330, 340, 350).

【０２０９】４次元アレイ編成：大規模な密システムを
構築するときは、＋Ｚポートおよび−Ｚポートを使用
し、ｍ₃クラスタの組を列として相互接続する。ｍ₄列を
さらに、＋Ｗポートおよび−Ｗポートを使って相互接続
する。ｍ₁＝．．．ｍ₄＝８の場合、こうすると３２７６
８すなわち８⁴⁺¹個のＰＭＥを備えたシステムが得られ
る。この編成では、あらゆる次元が図７のように（大規
模な密並列プロセッサ３７０）等密度である必要はな
い。小規模な密プロセッサの場合、１つのクラスタだけ
を使用し、未使用のＺポートおよびＷポートをカード上
で相互接続することができる。この技術によって、カー
ド・コネクタ・ピンが節約され、コネクタ・ピンの制限
のある、ワークステーション３４０、３５０およびアビ
オニックス・アプリケーション３３０にこのスケーリン
グ可能プロセッサが適用できるようになる。ＺとＷを対
にして＋／−ポートを接続すると、デバッグ、テスト、
および大規模マシン・ソフトウェア開発が可能なワーク
ステーション編成が得られる。Four-dimensional array organization: When building large dense systems, use + Z and -Z ports and interconnect sets of m ₃ clusters as columns. The m ₄ columns are further interconnected using the + W and −W ports. m ₁ =. . . If m ₄ = 8, then 3276
A system with 8 or 8 ^{4 + 1} PMEs is obtained. In this organization, not all dimensions need to be isopycnic (massive dense parallel processor 370) as in FIG. For small dense processors, only one cluster can be used and unused Z and W ports can be interconnected on the card. This technique saves card connector pins and allows this scalable processor to be applied to workstations 340, 350 and avionics applications 330, which have limited connector pins. If you connect Z and W in pairs and connect +/- ports, debug, test,
And a workstation organization that enables large-scale machine software development.

【０２１０】さらに、ｍ＝８より小さな値で構造を生成
すれば、はるかに小さなスケール・バージョンの構造を
開発できる。こうすれば、ＰＳ／２またはＲＩＳＣシス
テム６０００ワークステーション３５０におけるアクセ
レレータの要件に適合する単一のカード・プロセッサが
構築可能である。Furthermore, a much smaller scale version of the structure can be developed by generating the structure with values smaller than m = 8. In this way, a single card processor can be constructed that meets the accelerator requirements for PS / 2 or RISC system 6000 workstation 350.

【０２１１】入出力性能：入出力性能には、セットアッ
プ転送に対するオーバヘッドと、実バースト速度データ
移動が含まれる。セットアップのオーバヘッドは、アプ
リケーション機能入出力の複雑性と、ネットワークの争
奪によって決まる。たとえば、アプリケーションは、バ
ッファリングによって回線交換トラフィックをプログラ
ミングして衝突を解決することも、すべてのＰＭＥを左
に移して同期化することもできる。最初の例では、入出
力が主要なタスクであり、そのサイズ決めには詳細な分
析が使用される。本発明者等の見積りでは、単純な例の
セットアップ・オーバヘッドで、２０〜３０クロック・
サイクルまたは０．８〜１．２マイクロ秒である。Input / Output Performance: Input / output performance includes overhead for setup transfers and actual burst rate data movement. The setup overhead depends on the complexity of application function I / O and network contention. For example, an application can program circuit-switched traffic by buffering to resolve conflicts, or move all PMEs to the left for synchronization. In the first example, I / O is the main task and a detailed analysis is used to size it. According to the estimation by the present inventors, the setup overhead of the simple example is 20 to 30 clocks.
Cycle or 0.8-1.2 microseconds.

【０２１２】バースト速度入出力は、ＰＭＥがデータを
転送できる最高速度である（チップ上の隣接ＰＭＥまた
は外部の隣接ＰＭＥへの転送）。メモリ・アクセス限界
により、データ速度は１バイト当たり１４０ナノ秒に設
定されている。これは、７．１４ＭＢ／秒に該当する。
この性能には、バッファ・アドレスおよびカウント処理
と、データ読取り／書込みが含まれる。この性能では、
転送される１６ビット・ワード当たり７つの４０ナノ秒
サイクルが使用される。Burst rate input / output is the maximum rate at which the PME can transfer data (transfer to adjacent PME on chip or external adjacent PME). The memory access limit sets the data rate to 140 nanoseconds per byte. This corresponds to 7.14 MB / sec.
This performance includes buffer address and count processing, and data read / write. With this performance,
Seven 40 nanosecond cycles are used per 16-bit word transferred.

【０２１３】このバースト速度性能は、最高転送速度が
３．６５ＧＢ／秒であるクラスタに該当する。すなわ
ち、クラスタの行または列に沿った８個１組のノード
が、８個１組の使用可能なポートを使用して５７ＭＢ／
秒のバースト・データ速度を実現することになる。ラッ
プされたクラスタのエッジを、論理的に「ジップ解除」
し、外部システム・バスと接続することにより、外部と
の入出力が行われるので、この数字は重要である。This burst rate performance corresponds to a cluster having a maximum transfer rate of 3.65 GB / sec. That is, a set of 8 nodes along a row or column of the cluster uses 57 sets of available ports for 57 MB /
Burst data rates of seconds will be realized. Logically "unzip" edges of wrapped clusters
However, this number is important because input / output with the outside is performed by connecting to the external system bus.

【０２１４】ＰＭＥ間経路指定プロトコル：ＳＩＭＤ／
ＭＩＭＤＰＭＥは、外部入出力機構とのプロセッサ間
通信機構、ＢＣＩ、ならびに同一のＰＭＥ内でＳＩＭＤ
動作とＭＩＭＤ動作をどちらも可能にする切替え機能を
備えている。ＰＭＥには、プロセッサ通信およびＰＭＥ
間のデータ転送用の、完全分散型プログラマブル入出力
ルータが組み込まれている。Inter-PME Routing Protocol: SIMD /
MIMD PME is an inter-processor communication mechanism with external I / O mechanism, BCI, and SIMD within the same PME.
It has a switching function that enables both operation and MIMD operation. PME includes processor communication and PME
It incorporates a fully distributed programmable I / O router for transferring data between.

【０２１５】ＰＭＥは、オンチップＰＭＥ、ならびに修
正ハイパーキューブ構成内の相互接続されたＰＭＥに接
続された外部入出力機構への、完全分散型プロセッサ間
通信ハードウェアを有している。このハードウェアは、
ソフトウェアを介して入出力活動を制御する、ＰＭＥの
柔軟なプログラミング可能性で補足されている。プログ
ラマブル入出力ルータ機能によって、データ・パケット
およびパケット・アドレスの生成が可能になる。ＰＭＥ
は、ＰＭＥのネットワークを介して指定の方法で、また
はフォールト・トレランス要件によって決定される複数
の経路によって、この情報を送信できる。The PME has fully distributed interprocessor communication hardware to the on-chip PME as well as external I / O mechanisms connected to the interconnected PMEs in the modified hypercube configuration. This hardware is
Complemented by the flexible programmability of the PME, which controls I / O activity via software. The programmable I / O router function enables the generation of data packets and packet addresses. PME
Can transmit this information in a specified way through the PME's network or by multiple paths as determined by fault tolerance requirements.

【０２１６】分散型フォールト・トレランス・アルゴリ
ズムまたはプログラム・アルゴリズムは、プログラミン
グ可能性や、ＰＭＥのサポートされる回線交換モードな
どの利点を利用できる。この性能組合せモードによっ
て、プログラマブル入出力ルータを介して、オフライン
ＰＭＥ、最適経路データ構造などあらゆるものを実現で
きる。A distributed fault tolerance algorithm or program algorithm can take advantage of the programmability and the circuit switched modes supported by the PME. This performance combination mode makes it possible to implement all things such as an offline PME, an optimal path data structure, etc. via a programmable input / output router.

【０２１７】本発明者等のアプリケーションの調査か
ら、ＰＭＥ間で生データを送信するのがもっとも効率的
なことがあることが分かっている。アプリケーションが
データおよび経路指定情報を必要とする場合もある。さ
らに、ネットワークの衝突が発生しないように通信を計
画できることもある。アプリケーションによっては、中
間ノードにメッセージをバッファするための機構を設け
ないかぎり、デッドロックの可能性がなくならない。極
端な例を２つ示す。ＰＤＥ法の緩和フェーズでは、各格
子点をノードに割り振ることができる。隣接ノードから
データを獲得する内部ループ・プロセスは、ノード全体
にわたって容易に同期化できる。また、イメージ変換で
は、ローカル・データ・パラメータを使用して通信の発
信先または発信元の識別子がが決定される。その結果、
複数のＰＭＥ間でデータが移動され、各ＰＭＥが各パケ
ットの経路指定タスクに関与するようになる。そのよう
なトラフィックの事前計画は一般的に不可能である。From our investigation of applications, it has been found that sending raw data between PMEs can be the most efficient. Applications may also need data and routing information. In addition, it may be possible to plan communications so that network collisions do not occur. In some applications, the possibility of deadlock is not eliminated unless a mechanism is provided for buffering messages at intermediate nodes. Two extreme examples are shown. In the relaxation phase of the PDE method, each grid point can be assigned to a node. The inner loop process of acquiring data from neighboring nodes can be easily synchronized across the nodes. Image transformation also uses local data parameters to determine the destination or source identifier of the communication. as a result,
Data is moved between multiple PMEs, causing each PME to participate in the routing task of each packet. Preplanning for such traffic is generally not possible.

【０２１８】ネットワークがあらゆる種類の転送要件に
対して効率的になるように、本発明では、ハードウェア
とソフトウェアの間で、ＰＭＥ間のデータ経路指定の責
任を分割している。ソフトウェアは、大部分のタスク順
序付け機能を実行する。本発明では、内部ループ転送を
行い、外部ループ上でのソフトウェア・オーバヘッドを
最小限に抑える特殊機能をハードウェアに追加した。In order to make the network efficient for all kinds of transfer requirements, we divide the responsibility of data routing between PMEs between hardware and software. The software performs most task ordering functions. In the present invention, a special function for performing internal loop transfer and minimizing software overhead on the outer loop is added to the hardware.

【０２１９】専用割込みレベルの入出力プログラムがネ
ットワークを管理する。大部分のアプリケーションで
は、ＰＭＥが４つの割込みレベルを４個の隣接ＰＭＥか
らのデータ受信専用にしている。各レベルのバッファを
開く際は、そのレベルのレジスタをロードし、ＩＮ命令
（バッファ・アドレスおよび転送カウントを使用する
が、データ受信を待たない）とＲＥＴＵＲＮ命令の対を
実行する。ハードウェアはさらに、特定の入力バスから
ワードを受け入れ、バッファに格納する。さらに、バッ
ファ満杯条件によって割込みが発生し、ＲＥＴＵＲＮ後
の命令にプログラム・カウンタが復元される。割込みレ
ベルのためのこの手法では、割込みの原因が何かをテス
トする必要がない入出力プログラムを書くことができ
る。プログラムは、データを読み取り、リターンを行っ
てから、読み取ったデータの処理に移る。たいていの状
況ではレジスタのセーブがほとんどあるいはまったく必
要ないので、転送オーバヘッドは最低になる。ＰＤＥの
例の場合のように、アプリケーションが入出力に対して
同期化を使用する場合、プログラムによってその機能を
提供できる。A dedicated interrupt level input / output program manages the network. In most applications, the PME dedicates four interrupt levels to receiving data from four adjacent PMEs. When opening a buffer of each level, the register of that level is loaded and an IN instruction (using buffer address and transfer count, but does not wait for data reception) and RETURN instruction pair is executed. The hardware also accepts words from a particular input bus and stores them in a buffer. In addition, a buffer full condition causes an interrupt and restores the program counter to the instruction after RETURN. This approach for interrupt levels allows you to write I / O programs that do not need to test what caused the interrupt. The program reads the data, returns, and then moves on to processing the read data. In most situations, little or no register saving is needed, so transfer overhead is minimal. If the application uses synchronization for I / O, as in the PDE example, then the program can provide that functionality.

【０２２０】書込み動作は複数の方法で開始できる。Ｐ
ＤＥの例の場合、隣接ＰＭＥに結果を送信する時点で、
アプリケーション・レベルのプログラムが書込み呼出し
を実行する。この呼出しで、バッファ位置、ワード・カ
ウント、および宛先アドレスが提供される。書込みサブ
ルーチンは、レジスタ・ロードと、ハードウェアを起動
しアプリケーションに戻るのに必要なＯＵＴ命令を含ん
でいる。ハードウェアは、実際のバイトごとのデータ転
送を実行する。さらに複雑な出力要件では、最上位割込
みレベルの出力サービス機能が使用される。アプリケー
ション・レベルのタスクと割込みレベルのタスクはどち
らも、ソフトウェア割込みを介してそのサービスにアク
セスする。The write operation can be initiated in several ways. P
In the case of the DE example, at the time of sending the result to the adjacent PME,
The application-level program makes the write call. The call provides the buffer location, word count, and destination address. The write subroutine contains the register load and OUT instructions needed to boot the hardware and return to the application. The hardware performs the actual byte-by-byte data transfer. For more complex output requirements, the highest interrupt level output service function is used. Both application-level tasks and interrupt-level tasks access their services through software interrupts.

【０２２１】回線交換経路のセットアップは、これらの
単純な読取り動作および書込み動作に基づいている。本
発明ではまず、すべてのＰＭＥが、パケット・ヘッダを
受け入れるが、データは受け入れないようにサイズ決め
されたバッファを開く。データ送信の必要があるＰＭＥ
は、アドレスが宛先のアドレスにより近い隣接ＰＭＥに
アドレス／データ・ブロックを送信して転送を始める。
隣接ＰＭＥでは、アドレス情報が格納される。バッファ
満杯条件によって、割込みが発生する。割込みソフトウ
ェアは、宛先アドレスをテストし、バッファを拡張して
データを受け入れるか、あるいは出力ポートに宛先アド
レスを書き込み、さらにＣＴＬレジスタを透過データ移
動用に設定する（これにより、ＰＭＥは、アプリケーシ
ョンの実行を回線交換ブリッジング動作とオーバラップ
させることができる）。ＣＴＬレジスタはビジー状態に
なり、データ・ストリームの終りにある信号によってリ
セットされるか、ＰＭＥプログラミングで異常なリセッ
トを実行されるまで透過的である。以上の内容には、任
意の数の変更が可能である。The circuit switched path setup is based on these simple read and write operations. In the present invention, all PMEs first open a sized buffer to accept the packet header but not the data. PME that needs data transmission
Initiates the transfer by sending the address / data block to the adjacent PME whose address is closer to the destination address.
Address information is stored in the adjacent PME. A buffer full condition causes an interrupt. The interrupt software tests the destination address and either expands the buffer to accept the data or writes the destination address to the output port and also sets the CTL register for transparent data movement (this allows the PME to execute the application). Can be overlapped with circuit-switched bridging operations). The CTL register is busy and transparent until reset by a signal at the end of the data stream or an abnormal reset is performed in PME programming. The above contents can be changed in any number.

【０２２２】システム入出力およびアレイ・ディレク
タ：図１３は、好ましい実施例のアレイ・ディレクタを
示している。このアレイ・ディレクタは、システム・バ
スとアレイの接続について記述した図１４および１５の
制御装置の機能を実行できる。図１４はクラスタとやり
取りするバスを示し、図１５はＰＭＥとやり取りするバ
ス上での情報の通信を示している。アレイのロードまた
はアンロードは、クラスタのエッジをシステム・バスに
接続することによって行われる。複数のクラスタで複数
のシステム・バスをサポートできる。各クラスタは、５
０〜５７ＭＢ／秒の帯域幅をサポートする。並列アレイ
をロードまたはアンロードするには、すべてのＰＭＥま
たはそのサブセットと標準バス（すなわち、マイクロチ
ャネル、ＶＭＥバス、フューチャーバス）の間でデータ
を移動する必要がある。それらのバスは、ホスト・プロ
セッサまたはアレイ制御装置の一部であり、厳密に指定
されていると想定される。したがって、ＰＭＥアレイは
バスに適合しなければならない。ＰＭＥアレイは、ｎ個
のＰＭＥ上にバス・データをインタリーブすることによ
り任意のバスの帯域幅に一致させることができる。ここ
で、ｎはＰＭＥの入出力および処理時間の両方が可能に
なるように選択する。図１４および１５は、クラスタの
２つのエッジでシステム・バスをＰＭＥに接続する方法
を示している。そのような手法により、１１４ＭＢ／秒
をサポートできる。また、半分のピーク速度で２つのエ
ッジにデータを同時にロードすることもできる。これに
より帯域幅は１クラスタ当り５７ＭＢ／秒に減少する
が、アレイ内で直交データ移動が可能になり、２つのバ
ス間でデータの引渡しを行うことができるようになる
（この利点を使用して、高速転置および行列乗算動作を
提供する）。System I / O and Array Director: FIG. 13 shows the array director of the preferred embodiment. This array director can perform the functions of the controller of FIGS. 14 and 15 which describe the system bus and array connections. FIG. 14 shows a bus that communicates with the cluster, and FIG. 15 shows communication of information on the bus that communicates with the PME. Loading or unloading of the array is done by connecting the edges of the cluster to the system bus. Multiple clusters can support multiple system buses. 5 for each cluster
Supports 0-57 MB / sec bandwidth. Loading or unloading a parallel array requires moving data between all PMEs or a subset thereof and standard buses (ie Micro Channel, VME Bus, Future Bus). These buses are part of the host processor or array controller and are assumed to be strictly specified. Therefore, the PME array must fit into the bus. PME arrays can be matched to the bandwidth of any bus by interleaving bus data on n PMEs. Here, n is selected so that both PME input / output and processing time are possible. 14 and 15 show how to connect the system bus to the PME at the two edges of the cluster. Such an approach can support 114 MB / sec. It is also possible to load two edges simultaneously with data at half the peak rate. This reduces the bandwidth to 57 MB / s per cluster, but allows for orthogonal data movement within the array and allows the passing of data between the two buses (using this advantage. , Provides fast transposition and matrix multiplication operations).

【０２２３】図１４に示すように、バスはクラスタのエ
ッジ上のすべての経路に接続され、制御装置は、各経路
に対し必要なインタリービング・タイミングでゲート信
号を生成する。５７ＭＢ／秒を上回る速度でシステム・
バスに接続する必要がある場合、データは複数のクラス
タにわたってインタリーブされる。たとえば、２００Ｍ
Ｂ／秒のシステム・バスが必要なシステムでは、２個ま
たは４個のクラスタから成るグループを使用する。大規
模なＭＰＰには、１６ないし６４個のそのようなバスを
ｘｙネットワーク経路に接続する機能がある。ｘ経路お
よびｙ経路だけでなくｗ経路およびｚ経路を使用すれ
ば、この数を２倍にできる。As shown in FIG. 14, the bus is connected to all the paths on the edge of the cluster, and the controller generates the gate signal at the necessary interleaving timing for each path. System with speed exceeding 57MB / sec
Data needs to be interleaved across multiple clusters when needed to connect to the bus. For example, 200M
Systems that require a B / sec system bus use groups of two or four clusters. Large MPPs have the ability to connect 16 to 64 such buses to the xy network path. This number can be doubled by using the w and z paths as well as the x and y paths.

【０２２４】図１５は、データが個々のＰＭＥにどのよ
うに経路指定されるかを示す。この図は、７．１３ＭＢ
／秒でバースト・モードで動作できる１つの特定のｗ経
路、ｘ経路、ｙ経路、またはｚ経路を示している。シス
テム・バス上のデータがバースト単位で発生する場合
と、ＰＭＥメモリに完全なバーストを含めることができ
る場合は、必要なＰＭＥは１つだけである。本発明で
は、これらの条件がいずれも必要にならないようにＰＭ
Ｅ入出力構造を設計した。バッファ満杯条件が発生する
まで、データを最高速度でＰＭＥ×０にゲート入力でき
る。データ満杯条件が発生した瞬間、ＰＭＥ×０が透過
モードに変わり、ＰＭＥ×１がデータの受入れを開始す
る。ＰＭＥ×０内で、入力データ・バッファの処理が開
始できる。データを取りそれを処理したＰＭＥには、透
過モードの間結果を伝送できなくなるという制限があ
る。本設計では、データ・ストリームを間隔を置いて対
向側端部に切り替えることによってこれを解決してい
る。図１５に示すとおり、ソフトウェアの制御下で、Ｐ
ＭＥ×１２〜ＰＭＥ×１５が結果をアンロードする間、
ＰＭＥ×０〜ＰＭＥ×３をデータの受入れ専用にし、Ｐ
ＭＥ×０〜ＰＭＥ×３が結果をアンロードする間、ＰＭ
Ｅ×１２〜ＰＭＥ×１５をデータの受入れ専用にするこ
とができる。制御装置がワードをカウントし、データ・
ストリームにブロック終了信号を追加すると、方向が切
り替わる。制御装置によってサポートされるすべての経
路に１つのカウントが適用されるので、制御装置の作業
負荷は妥当である。FIG. 15 shows how data is routed to individual PMEs. This figure shows 7.13MB
1 illustrates one particular w, x, y, or z path that can operate in burst mode at / sec. Only one PME is required if the data on the system bus occurs in bursts and the PME memory can contain a complete burst. In the present invention, PM is used so that neither of these conditions is necessary.
The E input / output structure was designed. Data can be gated into PMEx0 at maximum speed until a buffer full condition occurs. At the moment when the data full condition occurs, PME × 0 changes to transparent mode and PME × 1 starts accepting data. The processing of the input data buffer can be started within PME × 0. A PME that takes data and processes it has the limitation that it cannot transmit results while in transparent mode. The design solves this by switching the data stream to the opposite end at spaced intervals. As shown in FIG. 15, under the control of software, P
While ME × 12 to PME × 15 unload the results,
Dedicate PMEx0 to PMEx3 to receive data,
PM while MEx0-PMEx3 unload results
Ex12 to PMEx15 can be dedicated to receiving data. The controller counts the words and
Adding a block end signal to the stream switches the direction. The controller workload is reasonable because one count applies to all paths supported by the controller.

【０２２５】代替コンピュータ用システム：図２０は、
単一のアプリケーション・プロセッサ・インタフェース
（ＡＰＩ）を備えた、ホスト接続大規模システムのシス
テム・ブロック図を示す。複数のアプリケーション・プ
ロセッサ・インタフェース（図示せず）を使用するスタ
ンドアロン・システムで本発明を使用できるという了解
のもとでこの図を見ることもできる。この構成は、すべ
てのクラスタまたは多数のクラスタ上でＤＡＳＤ／グラ
フィックスをサポートする。ワークステーション・アク
セレレータを使用すると、図のホスト、アプリケーショ
ン・プロセッサ・インタフェース（ＡＰＩ）、およびク
ラスタ・シンクロナイザ（ＣＳ）は不要になる。クラス
タ・シンクロナイザはかならずしも必要ではない。必要
かどうかは、実行する処理の種類と、本発明を使用する
特定のアプリケーションに提供される物理ドライブまた
は電源によって決まる。主としてＭＩＭＤ処理を実行す
るアプリケーションが制御装置に課す作業負荷要求はそ
れほど高くないので、この場合、制御バスのパルス立上
り時間が非常に長くなることがある。逆に、多数の独立
したグループ化により主として非同期のＡ−ＳＩＭＤ動
作を実行するシステムでは、より高速の制御バス機能が
必要になることがある。この場合、クラスタ・シンクロ
ナイザが好ましい。Alternate Computer System: FIG.
1 shows a system block diagram of a host-attached large scale system with a single application processor interface (API). This diagram can also be viewed with the understanding that the present invention can be used in a stand-alone system using multiple application processor interfaces (not shown). This configuration supports DASD / graphics on all clusters or multiple clusters. The workstation accelerator eliminates the need for the illustrated host, application processor interface (API), and cluster synchronizer (CS). A cluster synchronizer is not absolutely necessary. Whether it is necessary or not depends on the type of processing to be performed and the physical drive or power supply provided to the particular application using the invention. In this case, the pulse rise time of the control bus may be very long because the workload demands placed on the control device by the application, which mainly executes the MIMD process, are not very high. Conversely, higher speed control bus functions may be required in systems that primarily perform asynchronous A-SIMD operations due to a large number of independent groupings. In this case, a cluster synchronizer is preferred.

【０２２６】図２０のシステム・ブロック図は、システ
ムがホスト、アレイ制御装置、およびＰＭＥアレイから
構成できることを示している。ＰＭＥアレイは、１組の
クラスタ制御装置（ＣＣ）によってサポートされる１組
のクラスタである。各クラスタごとに１つのクラスタ制
御装置を示しているが、この関係が必ず必要なわけでは
ない。クラスタとクラスタ制御装置との実際の比率には
柔軟性がある。クラスタ制御装置は、６４個のＢＣＩ／
クラスタへの再駆動と、それからの累積を提供する。し
たがって、物理パラメータを使用して、最大比率を確立
することができる。さらに、クラスタ制御装置はＰＭＥ
アレイの複数のサブセットを個別に制御でき、このサー
ビスがゲート処理要件となることもある。調査を行っ
て、本発明の特定のアプリケーションに対するこれらの
要件を決定することができる。使用するクラスタ制御装
置のバージョンは２つある。システム・バスに接続する
クラスタでは、クラスタ制御装置がインタリーブ制御機
構（「システム入出力」および図２０参照）および３状
態ドライバを提供する必要がある。３状態バス機能を省
略したより簡単なバージョンも使用できる。大規模シス
テムの場合は、第２段階の再駆動および累積が追加され
る。このレベルはクラスタ・シンクロナイザ（ＣＳ）で
ある。１組のクラスタ制御装置と、クラスタ・シンクロ
ナイザおよびアプリケーション・プロセッサ・インタフ
ェース（ＡＰＩ）からアレイ制御装置が構成される。プ
ログラミング可能な単位はアプリケーション・プロセッ
サ・インタフェースだけである。The system block diagram of FIG. 20 shows that the system can consist of a host, an array controller, and a PME array. A PME array is a set of clusters supported by a set of cluster controllers (CCs). Although one cluster controller is shown for each cluster, this relationship is not necessary. The actual ratio of cluster to cluster controller is flexible. The cluster controller has 64 BCI /
It provides a re-drive to the cluster and an accumulation from it. Therefore, physical parameters can be used to establish a maximum ratio. In addition, the cluster controller is PME
Multiple subsets of the array can be controlled individually, and this service can be a gating requirement. Investigations can be performed to determine these requirements for a particular application of the invention. There are two versions of the cluster controller used. Clusters that connect to the system bus require the cluster controller to provide interleaved controls (see "System I / O" and Figure 20) and tri-state drivers. A simpler version without the tri-state bus function can also be used. For large systems, a second stage redrive and accumulation is added. This level is the Cluster Synchronizer (CS). An array controller is composed of a set of cluster controllers and a cluster synchronizer and an application processor interface (API). The only programmable unit is the application processor interface.

【０２２７】このシステム統合方式には複数の変形が可
能である。これらの変形によって、各種のアプリケーシ
ョン用の様々なハードウェア構成を形成できるが、その
ようにしてもサポート・ソフトウェアに大きな影響が及
ぶことはない。A plurality of modifications can be made to this system integration method. These variations can form different hardware configurations for different applications, but doing so does not significantly affect the supporting software.

【０２２８】ワークステーション・アクセレレータで
は、クラスタ制御装置がワークステーション・システム
・バスに直接接続される。アプリケーション・プロセッ
サ・インタフェースの機能はワークステーションによっ
て実行される。ＲＩＳＣ／６０００の場合、システム・
バスはマイクロチャネルであり、クラスタ制御装置がワ
ークステーション内のスロットに直接挿入できる。この
構成では、アレイをロード／アンロードするのと同一の
バス上に入出力装置（ＤＡＳＤ、ＳＣＳＩ、およびディ
スプレイ・インタフェース）を配置する。そのため、並
列アレイがリアルタイム・イメージ生成や処理など入出
力中心のタスクに使用できる。他のバス・システム（Ｖ
ＭＥバス、ヒューチャーバスなど）を使用するワークス
テーションでは、ゲートウェイ・インタフェースを使用
する。そのようなモジュールは、市場で容易に入手可能
である。これらの最小規模システムでは、特定の数のク
ラスタ間で単一のクラスタ制御装置を共用でき、クラス
タ・シンクロナイザもアプリケーション・プロセッサ・
インタフェースも必要でない。In the workstation accelerator, the cluster controller is directly attached to the workstation system bus. The functions of the application processor interface are performed by the workstation. For RISC / 6000, the system
The bus is a Micro Channel and allows the cluster controller to insert directly into a slot in the workstation. In this configuration, I / O devices (DASD, SCSI, and display interface) are placed on the same bus that loads / unloads the array. Therefore, parallel arrays can be used for I / O-intensive tasks such as real-time image generation and processing. Other bus systems (V
For workstations using ME Bus, Future Bus, etc., use the gateway interface. Such modules are readily available on the market. In these smallest systems, a single cluster controller can be shared between a certain number of clusters, and the cluster synchronizer also
No interface is needed.

【０２２９】ＭＩＬアビオニクス・アプリケーション
は、ワークステーションとサイズは同程度かもしれない
が、異なるインタフェースが必要である。通常の軍事状
況とはどのようなものか考えていただきたい。既存のプ
ラットフォームを追加の処理機能で拡張しなければなら
ないが、資金の都合で処理システムを完全に再設計する
ことはできない。このため、本発明ではＡＰＡＰアレイ
にスマート・メモリ補助プロセッサを接続する。この場
合、ホストにはメモリに見える特殊アプリケーション・
プログラム・インタフェース（ＡＰＩ）が提供される。
そうすると、ホスト・メモリにアドレスされるデータ
は、クラスタ制御装置を介してアレイまで移動される。
その後メモリに書込みを行うと、アプリケーション・プ
ログラム・インタフェースによってそれが検出され、コ
マンドと解釈され、その結果、アクセレレータはメモリ
・マップされた補助プロセッサに見えるようになる。MIL avionics applications may be similar in size to workstations, but require different interfaces. Think about what a normal military situation looks like. The existing platform must be expanded with additional processing capabilities, but funding cannot completely redesign the processing system. For this reason, the present invention connects a smart memory coprocessor to the APAP array. In this case, a special application that looks like memory to the host
A program interface (API) is provided.
The data addressed to host memory is then moved to the array via the cluster controller.
Subsequent writes to memory will be detected by the application program interface and interpreted as a command, thus exposing the accelerator to the memory-mapped coprocessor.

【０２３０】大規模システムは、ホスト接続構成として
もスタンドアロン構成としても開発できる。ホスト接続
システムでは、図２０に示す構成が有用である。ホスト
は入出力を実行し、アプリケーション・プログラム・イ
ンタフェースはディスパッチされたタスク・マネージャ
として作用する。しかし、特別の状況でも大規模スタン
ドアロン・システムが可能である。たとえば、データベ
ース探索システムは、ホストを不要にし、あらゆるクラ
スタのマイクロチャネルにＤＡＳＤを接続するととも
に、複数のアプリケーション・プログラム・インタフェ
ースを、ＰＭＥにスレーブ接続されたバス・マスタとし
て使用する。Large-scale systems can be developed as either host-connected or stand-alone configurations. The configuration shown in FIG. 20 is useful in the host connection system. The host performs I / O, and the application program interface acts as a dispatched task manager. However, large-scale stand-alone systems are possible in special circumstances. For example, the database search system eliminates the need for a host, connects DASD to Micro Channels of any cluster, and uses multiple application program interfaces as bus masters slaved to PMEs.

【０２３１】外部入出力を備えたジッパ・アレイ・イン
タフェース：本発明のジッパは、高速入出力接続方式を
提供するものであり、２つのアレイ・ノード間にスイッ
チを配置することによって実現される。このスイッチに
より、アレイとの並列通信が可能になる。高速入出力
は、アレイ・リングの１つのエッジに沿って実施され、
Ｘリング、Ｙリング、Ｗリング、Ｚリングへの大規模な
ジッパとして機能する。この高速入出力に「ジッパ接
続」という名前が付けられている。ネットワークとの間
でデータを転送できるようにしながら、スイッチ遅延を
追加するだけでプロセッサ間のデータ転送を行う方法
は、他に例のないロード技術である。この切替え方式は
Ｘバス、Ｙバス、Ｗバス、Ｚバスによって作成されたリ
ング・トポロジーに影響を及ぼすことはなく、特殊サポ
ート・ハードウェアにより、処理要素がデータを処理ま
たは経路指定している間にジッパ動作が実行できるよう
になる。Zipper array interface with external I / O: The zipper of the present invention provides a high speed I / O connection scheme and is implemented by placing a switch between two array nodes. This switch allows parallel communication with the array. High speed I / O is performed along one edge of the array ring,
Functions as a large-scale zipper to the X ring, Y ring, W ring, and Z ring. This high-speed input / output is named "zipper connection". The ability to transfer data to and from the network while transferring data between processors by simply adding a switch delay is a unique loading technique. This switching scheme does not affect the ring topology created by the X-bus, Y-bus, W-bus, and Z-bus, and special support hardware allows the processing element to process or route data. The zipper operation can be executed.

【０２３２】大規模並列システムとの間でデータを高速
にやり取りできれば、システム全体の性能は大幅に改善
される。プロセッサの数やアレイ・ネットワークの次元
を減らさずに高速入出力を実施する本発明の方法は、大
規模並列環境の分野では例がないと本発明者等は考えて
いる。If data can be exchanged at high speed with a large-scale parallel system, the performance of the entire system will be greatly improved. The inventors believe that the method of the present invention for performing high-speed input / output without reducing the number of processors or the dimension of the array network is unprecedented in the field of massively parallel environment.

【０２３３】修正ハイパーキューブ配列を拡張すると、
リング内にリングを備えたトポロジーが可能になる。外
部入出力へのインタフェースをサポートするときは、リ
ングのいずれかまたはすべてを論理的に破壊することが
できる。次いで、破壊されたリングの２つの端部を、外
部入出力バスに接続することができる。リングを破壊す
るのは論理的な動作なので、一定の時間間隔で規則的な
ＰＭＥ間通信が可能であり、同時に他の時間間隔で入出
力が可能になる。修正ハイパーキューブ内のあるレベル
のリングを破壊するこのプロセスにより、リングは入出
力の目的で効果的に「ジップ解除」される。高速入出力
「ジッパ」は、アレイへの別のインタフェースを提供す
る。このジッパは、修正ハイパーキューブの１〜ｎ個の
エッジ上に存在でき、アレイの複数の次元への並列入
力、またはアレイの複数の次元への同報通信のいずれか
をサポートできる。アレイとの間でのその後のデータ転
送は、ジッパに直接接続された２つのノード間で交互に
行うことができる。この入出力手法は他に例がなく、こ
の方法により、特定のアプリケーション要件を満たす様
々なジッパ・サイズが開発できる。たとえば、大規模密
プロセッサ３６０と称する、図７に示す特定の構成で
は、ＺバスおよびＷバスのジッパがＭＣＡバスに接続さ
れる。この手法では、行列転置時間が最適化され、プロ
セッサの特定のアプリケーション要件が満たされる。
「ジッパ」構造についての詳細は、同時出願された"APA
P I/O Zipper Connection"と題する米国特許を参照され
たい。ジッパを図１６に示す。Expanding the modified hypercube array,
Topologies with rings within rings are possible. Any or all of the rings can be logically destroyed when supporting interfaces to external I / O. The two ends of the broken ring can then be connected to the external I / O bus. Since it is a logical operation to destroy the ring, regular inter-PME communication is possible at fixed time intervals, and at the same time input / output is possible at other time intervals. This process of breaking a level of the ring in the modified hypercube effectively "unzips" the ring for input / output purposes. The high speed input / output "zippers" provide another interface to the array. The zipper can reside on 1 to n edges of the modified hypercube and can support either parallel input to multiple dimensions of the array, or broadcast to multiple dimensions of the array. Subsequent data transfers to and from the array can alternate between two nodes directly connected to the zipper. This I / O approach is unique and allows different zipper sizes to be developed to meet specific application requirements. For example, in the particular configuration shown in FIG. 7, called large dense processor 360, the Z and W bus zippers are connected to the MCA bus. This approach optimizes the matrix transpose time to meet the specific application requirements of the processor.
For more information on the "zipper" construction, see the co-filed "APA
See US patent entitled "PI / O Zipper Connection". The zipper is shown in FIG.

【０２３４】構成と、プログラムが個々の処理要素との
間でデータおよびプログラムをやり取りする必要とに応
じて、ジッパのサイズを変えることができる。入出力ジ
ッパの実速度は、接続されたリングの数と、ＰＭＥバス
幅と、ＰＭＥクロック速度を掛けて２で割った値にほぼ
等しい（この除算により、受信ＰＭＥ時間にデータが転
送できる。入出力ジッパはｎ個の場所のどこにでもデー
タを送信できるので、入出力の争奪がアレイ全体で完全
に吸収される）。ＰＭＥ転送速度が５ＭＢ／秒で、ジッ
パ上に６４個のリングを備え、２つのノードのインタリ
ーブを使用する既存の技術では、３２０ＭＢ／秒のアレ
イ転送速度が可能である（図１６の典型的なジッパ構成
を参照）。図１６は、高速入出力、すなわち、アレイへ
の別のインタフェースとして存在するいわゆる「ジッパ
接続」７００、７１０を示している。このジッパは、ア
レイ７５１、７５２、７５３、７５４内の複数のノード
で複数の方向７７０、７８０、７９０、７５１、７５
５、７５７で同報通信バス７２０、７３０、７４０、７
５０上に接続することにより、ハイパーキューブ・ネッ
トワークの１つのエッジ７００または２つのエッジ７０
０、７１０上に存在できる。The size of the zipper can be varied depending on the configuration and the need for the program to exchange data and programs with individual processing elements. The actual speed of the I / O zipper is approximately equal to the number of rings connected, the PME bus width, and the PME clock speed divided by two (this division allows data to be transferred during the receive PME time). The output zipper can send data to any of the n locations, so I / O contention is completely absorbed throughout the array). The existing technology using a PME transfer rate of 5 MB / s, 64 rings on the zipper, and interleaving of two nodes, allows an array transfer rate of 320 MB / s (typical of FIG. 16). See zipper configuration). FIG. 16 illustrates high speed input / output, a so-called “zipper connection” 700, 710 that exists as another interface to the array. This zipper is arranged in multiple directions 770, 780, 790, 751, 75 at multiple nodes in arrays 751, 752, 753, 754.
5, 757 broadcast bus 720, 730, 740, 7
By connecting over one edge 700 or two edges 70 of the hypercube network.
0,710 can exist.

【０２３５】今日のＭＣＡバスは、毎秒８０〜１６０Ｍ
Ｂの転送速度をサポートするので、単純モードまたは非
インタリーブ・モードの単一ジッパに非常に適してい
る。ただし、実際の転送速度にはチャネル・オーバヘッ
ドがあるので、効率はそれよりもいくらか下がる。これ
よりも入出力要件がはるかに厳しいシステムには、複数
のジッパおよびＭＣＡバスを使用することができる。デ
ータベース・マシンに特徴的な、ノードまたはクラスタ
と結合された大規模な外部記憶域をサポートするプロセ
ッサにとって、これらの技術は重要であると思われる。
このような入出力拡張能力は、まったくこのようなマシ
ン特有のものであり、従来、大規模並列プロセッサや従
来のシングル・プロセッサや疎並列マシンには組み込ま
れていなかった。Today's MCA buses are 80-160M / s
Since it supports B transfer rates, it is well suited for simple mode or non-interleaved mode single zippers. However, the actual transfer rate has some channel overhead, so the efficiency is somewhat lower. For systems with much higher I / O requirements, multiple zippers and MCA buses can be used. These techniques appear to be important for processors that support large external storage combined with nodes or clusters that are characteristic of database machines.
Such I / O expansion capability is completely unique to such a machine, and has not been conventionally incorporated in a massively parallel processor, a conventional single processor, or a sparse parallel machine.

【０２３６】アレイ・ディレクタ・アーキテクチャ：本
発明の大規模並列システムは、多重プロセッサ・ノード
のノード構成要素、ノード・クラスタ、およびクラスタ
にすでにパッケージされたＰＭＥのアレイから構成され
ている。これらのパッケージ・システムを制御するた
め、本発明では、ハードウェア制御装置により大規模並
列処理環境でプロセッサ・メモリ要素（ＰＭＥ）アレイ
制御装置機能全体を実行する、システム・アレイ・ディ
レクタを提供する。アレイ・ディレクタは、アプリケー
ション・インタフェース、クラスタ・シンクロナイザ、
および通常はクラスタ制御装置という３つの機能領域か
ら構成されている。アレイ・ディレクタは、同報通信バ
スおよび本発明のジッパ接続を使ってデータおよびコマ
ンドをすべてのＰＭＥに送信して、ＰＭＥアレイを総合
的に制御する。アレイ・ディレクタは、ハードウェアと
の相互作用によりオペレーティング・システムのシェル
としての役割を果す、ソフトウェア・システムとして機
能する。アレイ・ディレクタは、この役割を果す際に、
アプリケーション・インタフェースからコマンドを受信
し、適切なアレイ命令およびハードウェア・シーケンス
を発行して、指定のタスクを実行する。アレイ・ディレ
クタの主な機能は、ＰＭＥに命令を連続的に送り、最適
のシーケンスでデータを経路指定することにより、通信
量を最大に、衝突を最小に保つことである。Array Director Architecture: The massively parallel system of the present invention consists of a node component of a multiprocessor node, a node cluster, and an array of PMEs already packaged in the cluster. To control these package systems, the present invention provides a system array director that performs the entire processor memory element (PME) array controller function in a massively parallel processing environment with a hardware controller. Array Director is an application interface, a cluster synchronizer,
And is usually composed of three functional areas called a cluster controller. The Array Director uses the broadcast bus and the zipper connection of the present invention to send data and commands to all PMEs for total control of the PME array. The Array Director functions as a software system that acts as a shell for the operating system by interacting with the hardware. The array director, in playing this role,
It receives commands from the application interface and issues the appropriate array instructions and hardware sequences to perform specified tasks. The main function of the Array Director is to keep the traffic high and the collision low by continuously sending commands to the PME and routing the data in an optimal sequence.

【０２３７】図７に示したＡＰＡＰコンピュータ・シス
テムを、図１３により詳細に示してある。図１３は、図
１４および図１５ならびに図１９および図２０に示すよ
うに、制御装置またはアレイ制御装置として機能できる
アレイ・ディレクタを示している。図１３に示すこのア
レイ・ディレクタ６１０は、ｎ個の同一のアレイ・クラ
スタ６６５、６７０、６８０、６９０と、５１２個のＰ
ＭＥのクラスタ用のアレイ・ディレクタ６１０と、アプ
リケーション・プロセッサ６００用のアプリケーション
・プロセッサ・インタフェース６３０とから成る典型的
な構成のＡＰＡＰの好ましい実施例として示してある。
クラスタ・シンクロナイザ６５０は、クラスタ制御装置
６４０に必要なシーケンスを提供し、クラスタ・シンク
ロナイザ６５０とクラスタ制御装置６４０とで「アレイ
・ディレクタ」６１０を構成している。アプリケーショ
ン・プロセッサ・インタフェース６３０は、ホスト・プ
ロセッサ６００およびテスト／デバッグ・ワークステー
ションをサポートする。１つまたは複数のホストに接続
されたＡＰＡＰ装置では、アレイ・ディレクタ６１０
は、ユーザとＰＭＥのアレイの間のインタフェースとし
て働く。スタンドアロン並列処理マシンとして機能する
ＡＰＡＰでは、アレイ・ディレクタ６１０は、ホスト装
置となり、したがって装置入出力活動に関与するように
なる。The APAP computer system shown in FIG. 7 is shown in more detail in FIG. FIG. 13 shows an array director that can function as a controller or array controller as shown in FIGS. 14 and 15 and FIGS. 19 and 20. This array director 610 shown in FIG. 13 has n identical array clusters 665, 670, 680, 690 and 512 P
A typical implementation of APAP is shown as a typical configuration of an array director 610 for a cluster of MEs and an application processor interface 630 for an application processor 600.
The cluster synchronizer 650 provides the necessary sequences to the cluster controller 640, and the cluster synchronizer 650 and the cluster controller 640 form an “array director” 610. Application processor interface 630 supports host processor 600 and test / debug workstations. For APAP devices connected to one or more hosts, the array director 610
Acts as an interface between the user and the array of PMEs. In APAP, which acts as a stand-alone parallel processing machine, the array director 610 becomes a host device and thus becomes involved in device I / O activity.

【０２３８】アレイ・ディレクタ６１０は、次の４つの
機能領域から構成されている（図１３の機能ブロック図
参照）。１．アプリケーション・プロセッサ・インタフェース
（ＡＰＩ）６００２．クラスタ・シンクロナイザ（ＣＳ）６５０（クラス
タの８×８アレイ）３．クラスタ制御装置（ＣＣ）６４０（ノードの８×１
アレイ）４．高速入出力（ジッパ接続）６２０The array director 610 is composed of the following four functional areas (see the functional block diagram of FIG. 13). 1. Application processor interface (API) 600 2. Cluster Synchronizer (CS) 650 (8x8 array of clusters) 3. Cluster controller (CC) 640 (8 × 1 of nodes
Array) 4. High-speed input / output (zipper connection) 620

【０２３９】アプリケーション・プロセッサ・インタフ
ェース（ＡＰＩ）６３０：接続モードで動作するときは、各ホストごとにそれぞれ
１つのアプリケーション・プロセッサ・インタフェース
６３０が使用される。アプリケーション・プロセッサ・
インタフェース６３０は、着信データ・ストリームを監
視して、アレイ・クラスタ６６５、６７０、６８０、６
９０への命令はどれか、および高速入出力（ジッパ）６
２０用のデータはどれかを決定する。スタンドアロン・
モードでは、アプリケーション・プロセッサ・インタフ
ェース６３０は、一次ユーザ・プログラム・ホストとし
て働く。Application Processor Interface (API) 630: When operating in connected mode, one application processor interface 630 is used for each host. Application processor
Interface 630 monitors the incoming data stream for array clusters 665, 670, 680, 6
Which instruction to 90, and high-speed input / output (zipper) 6
Decide which data for 20. Stand-alone
In mode, the application processor interface 630 acts as the primary user program host.

【０２４０】これらの各種要件をサポートするために、
アプリケーション・プロセッサ・インタフェース６３０
は、アレイ・ディレクタ６１０内の唯一のプロセッサ
と、ＡＰＩプログラムおよびコマンドの専用記憶域とを
備えている。ホストから受信される命令は、ＡＰＩサブ
ルーチンの実行、追加機能のＡＰＩメモリへのロード、
または新規ソフトウェアのＣＣメモリおよびＰＭＥメモ
リへのロードを要求できる。ソフトウェアの概要の節で
述べたように、アプリケーション・プロセッサ・インタ
フェース６３０にロードされる初期プログラムを介し
て、これらの各種の要求を一部のユーザだけに制限する
ことができる。すなわち、ロードされるオペレーティン
グ・プログラムによって、提供されるサポートの種類が
決まる。このサポートは、アプリケーション・プロセッ
サ・インタフェース６３０の性能機能に適合するように
調節可能である。したがって、管理され十分テストされ
たサービスを必要とする複数のユーザ、または特定のア
プリケーションに対してピーク性能を実現したい個々の
ユーザのニーズに合わせてＡＰＡＰをさらに調節するこ
とができる。In order to support these various requirements,
Application processor interface 630
Has a unique processor in array director 610 and dedicated storage for API programs and commands. The command received from the host is to execute the API subroutine, load the additional function to the API memory,
Alternatively, a request can be made to load new software into the CC memory and the PME memory. As mentioned in the software overview section, these various requests can be restricted to some users through an initial program loaded into the application processor interface 630. That is, the operating program loaded determines the type of support provided. This support is adjustable to fit the performance capabilities of the application processor interface 630. Thus, APAP can be further tailored to the needs of multiple users who require managed and well-tested services, or individual users who want to achieve peak performance for a particular application.

【０２４１】アプリケーション・プロセッサ・インタフ
ェース６３０はまた、入出力ジッパとの間の経路の管理
を行う。接続モードのホスト・システムまたはスタンド
アロン・モードのデバイスから受け取ったデータは、ア
レイに転送される。この種の動作が開始される前に、入
出力を管理するアレイ内のＰＭＥが開始される。ＭＩＭ
Ｄモードで動作するＰＭＥは、高速割込み機能と、標準
ソフトウェアまたはこの転送用の特殊機能を使用でき、
ＳＩＭＤモードで動作するＰＭＥには、詳細な制御命令
を提供する必要がある。入出力ジッパから送られたデー
タには、これとほぼ逆の調節が必要である。ＭＩＭＤモ
ードで動作するＰＭＥは、高速直列・インタフェースを
介してアプリケーション・プロセッサ・インタフェース
６３０に信号を送り、アプリケーション・プロセッサ・
インタフェース６３０からの応答を待つ必要があり、一
方ＳＩＭＤモードのＰＭＥはすべてアプリケーション・
プロセッサ・インタフェース６３０と既に同期している
ため、データをただちに出力することができる。システ
ムでモードの切替えが可能なので、プログラムをアプリ
ケーションに合わせて調節できる他に例を見ない能力が
提供される。The application processor interface 630 also manages the path to and from the I / O zipper. Data received from a host system in connected mode or a device in standalone mode is transferred to the array. Before this type of operation is started, the PME in the array that manages I / O is started. MIM
PMEs operating in D mode can use high speed interrupt functions and standard software or special functions for this transfer,
Detailed control instructions need to be provided to the PME operating in SIMD mode. The data sent from the input and output zippers requires almost the opposite adjustment. The PME operating in MIMD mode signals the application processor interface 630 via the high speed serial interface to
It must wait for a response from interface 630, while all PMEs in SIMD mode
Since it is already synchronized with the processor interface 630, the data can be output immediately. The ability to switch modes in the system provides unparalleled ability to tailor programs to your application.

【０２４２】クラスタ・シンクロナイザ（ＣＳ）６５
０：クラスタ・シンクロナイザ６５０は、アプリケーシ
ョン・プロセッサ・インタフェース６３０とクラスタ制
御装置６４０の間のブリッジを提供する。クラスタ・シ
ンクロナイザ６５０は、アプリケーション・プロセッサ
・インタフェース６３の出力をＦＩＦＯスタックに格納
し、クラスタ制御装置６４０から返される状況（並列入
力肯定応答と高速シリアル・バス・データの両方）を監
視して、開始する必要がある所望のルーチンまたは動作
をクラスタ制御装置６４０に適時に提供する。クラスタ
・シンクロナイザ６５０は、クラスタ内で様々なクラス
タ制御装置６４０および様々なＰＭＥをサポートする機
能を提供し、アレイをサブセットに分割できるようにす
る。これを実行するときは、アレイを区分した後、所望
の動作を選択的に転送するように関連クラスタ制御装置
６４０に指令する。クラスタ・シンクロナイザ６５０の
主な機能は、オーバヘッド時間が最小限になるか、また
はＰＭＥ実行時間の中に埋まってしまうように、すべて
のクラスタを動作させ、かつ編成することである。以
上、Ａ−ＳＩＭＤ構成でクラスタ・シンクロナイザ６５
０を使用することが特に好ましい理由について説明し
た。Cluster Synchronizer (CS) 65
0: The cluster synchronizer 650 provides a bridge between the application processor interface 630 and the cluster controller 640. The cluster synchronizer 650 stores the output of the application processor interface 63 in the FIFO stack and monitors and initiates the status returned by the cluster controller 640 (both parallel input acknowledge and high speed serial bus data). It provides the cluster controller 640 with the desired routines or operations that need to be done in a timely manner. The cluster synchronizer 650 provides the functionality to support different cluster controllers 640 and different PMEs within the cluster, allowing the array to be divided into subsets. When doing this, after partitioning the array, the associated cluster controller 640 is instructed to selectively transfer the desired operation. The main function of cluster synchronizer 650 is to operate and organize all clusters so that overhead time is minimized or buried in PME run time. Above, the cluster synchronizer 65 with the A-SIMD configuration
The reason why it is particularly preferable to use 0 is explained.

【０２４３】クラスタ制御装置（ＣＣ）６４０：クラス
タ制御装置６４０は、アレイ・クラスタ６６５中の１組
のノード用のＢＣＩ６０５と相互接続する（１リング当
たり８個のノードを備えた４次元修正ハイパーキューブ
の場合、これは、クラスタ制御装置６４０が８×８ノー
ド・アレイの６４個のＢＣＩ６０５に接続され、５１２
個のＰＭＥを制御していることを示す。やはり８×８ア
レイのそのようなクラスタが６４個あると、３２７６８
個のＰＭＥを備えた完全なシステムとなる）。クラスタ
制御装置６４０は、ＭＩＭＤモードで動作する際、クラ
スタ・シンクロナイザ６５０から供給されたコマンドお
よびデータをＢＣＩ並列ポートに送信し、クラスタ・シ
ンクロナイザ６５０に肯定応答データを返す。ＳＩＭＤ
モードでは、インタフェースは同期的に動作し、ステッ
プごとの肯定応答は必要でない。クラスタ制御装置６４
０はまた、高速直列ポートを管理および監視して、ノー
ド内のＰＭＥがサービスをいつ要求するかを決定する。
そのような要求は、高速シリアル・インタフェースから
の生データが状況ディスプレイ・インタフェースに使用
可能な間に、クラスタ・シンクロナイザ６５０に渡され
る。クラスタ制御装置６４０は、標準速度のシリアル・
インタフェースを介して、クラスタ内の特定のノードへ
のインタフェースをクラスタ・シンクロナイザ６５０に
提供する。Cluster Controller (CC) 640: Cluster Controller 640 interconnects with BCI 605 for a set of nodes in array cluster 665 (4D modified hypercube with 8 nodes per ring). , The cluster controller 640 is connected to 64 BCIs 605 in an 8 × 8 node array and 512
It shows that the individual PMEs are controlled. Again, if there are 64 such clusters in an 8x8 array, 32768
Complete system with PMEs). When the cluster controller 640 operates in the MIMD mode, the cluster controller 640 sends the command and data supplied from the cluster synchronizer 650 to the BCI parallel port, and returns the acknowledge data to the cluster synchronizer 650. SIMD
In mode, the interface operates synchronously and does not require step-by-step acknowledgment. Cluster controller 64
The 0 also manages and monitors the high speed serial port to determine when the PME in the node requests service.
Such a request is passed to the cluster synchronizer 650 while raw data from the high speed serial interface is available to the status display interface. The cluster controller 640 is a standard speed serial
Through the interface, it provides the cluster synchronizer 650 with an interface to a particular node in the cluster.

【０２４４】ＳＩＭＤモードでは、クラスタ制御装置６
４０は、同報通信バス上のすべてのＰＭＥに命令または
アドレスを送るよう指令される。ＳＩＭＤモードのと
き、クラスタ制御装置６４０は、４０ナノ秒ごとにすべ
てのＰＭＥ１６ビット命令をディスパッチできる。ＰＭ
Ｅに固有命令のグループを同報通信することにより、エ
ミュレートされた命令セットが形成される。In SIMD mode, the cluster controller 6
40 is instructed to send an instruction or address to all PMEs on the broadcast bus. When in SIMD mode, the cluster controller 640 can dispatch all PME 16-bit instructions every 40 nanoseconds. PM
By broadcasting a group of instructions specific to E, an emulated instruction set is formed.

【０２４５】ＭＩＭＤモードのとき、クラスタ制御装置
６４０は、ｅｎｄｏｐ信号を待ち、該信号を受信後、Ｐ
ＭＥに新規命令を発行する。ＭＩＭＤモードの概念は、
ＰＭＥに常駐する固有命令でマイクロルーチンの文字列
を構築することである。これらの文字列をまとめて、エ
ミュレートされた命令を形成することができ、かつこれ
らのエミュレートされた命令を組み合わせて、サービス
／キャンド・ルーチンまたはライブラリ関数を作成する
ことができる。In the MIMD mode, the cluster controller 640 waits for the endop signal, and after receiving the signal, P
Issue a new command to the ME. The concept of MIMD mode is
Constructing a string of microroutines with unique instructions resident in the PME. These strings can be put together to form emulated instructions, and these emulated instructions can be combined to create service / canned routines or library functions.

【０２４６】ＳＩＭＤ／ＭＩＭＤ（ＳＩＭＩＭＤ）モー
ドでは、クラスタ制御装置６４０が、ＳＩＭＤモードの
場合と同様に命令を発行し、一定のＰＭＥからのｅｎｄ
ｏｐ信号があるかどうかを検査する。ＭＩＭＤモードの
ＰＭＥは、同報通信命令に応答せず、これらの指定され
た動作を継続する。独特の状況標識が、クラスタ制御装
置６４０がこの動作を管理し、その後の命令をいつどこ
に提供するかを決定するのを助ける。In the SIMD / MIMD (SIMIMD) mode, the cluster controller 640 issues an instruction as in the SIMD mode, and the end from a certain PME is issued.
Check if there is an op signal. The MIME mode PME does not respond to the broadcast command and continues these specified operations. Unique status indicators help the cluster controller 640 manage this operation and determine when and where to provide subsequent instructions.

【０２４７】オペレーショナル・ソフトウェア・レベ
ル：本明細書では、各種ハードウェア構成要素によって
実行されるサービスについての詳細な説明を行うため
に、オペレーショナル・ソフトウェア・レベルの概要を
示す。Operational Software Level: An overview of the Operational Software Level is provided herein for a detailed description of the services performed by the various hardware components.

【０２４８】一般的に使用されるコンピュータ・システ
ムはオペレーティング・システムを有する。大部分の大
規模ＭＩＭＤマシンでは比較的完全なオペレーティング
・システム・カーネルを備えねばならない。このような
マシンでは、ワークステーション・クラスのＣＰＵチッ
プがＭａｃｈなどのカーネルを実行する。オペレーティ
ング・システム・カーネルは、メッセージ引渡しまたは
メモリ・コヒーレンシをサポートする。ＳＩＭＤモデル
に基づく他の大規模並列システムはアレイ中に知能をほ
とんどもたない。アレイ中に「プログラム・カウンタ」
がないので、ローカル側で実行されるプログラムはな
い。すべての命令は同報通信される。A commonly used computer system has an operating system. Most large MIMD machines must have a relatively complete operating system kernel. In such machines, a workstation class CPU chip runs a kernel such as Mach. The operating system kernel supports message passing or memory coherency. Other massively parallel systems based on the SIMD model have little intelligence in the array. "Program counter" in the array
, There is no program to be run locally. All orders are broadcast.

【０２４９】クラスタ・アレイの基礎として本発明のＰ
ＭＥを使用したシステムでは、各チップ、すなわちノー
ドにオペレーティング・システムは必要でない。本発明
では、各処理要素（ＰＭＥ）内に、上位レベルで呼び出
すことができる演算または通信あるいはまたはその両方
用の重要関数のライブラリを提供する。特定の１組のＰ
ＭＥをそれぞれ設定するため、ＳＩＭＤ型の命令がアレ
イに同報通信される。そうすると、これらのＰＭＥは、
完全ＭＩＭＤモードでこれらのライブラリ関数の１つま
たは複数を実行することができる。さらに、各ＰＭＥに
は、ＰＭＥが通信を動的に処理できるようにする、基本
割込みハンドラおよび通信ルーチンが常駐している。既
存のＭＩＭＤマシンと異なり、ＡＰＡＰ構造ではＰＭＥ
メモリにプログラム全体を含める必要はない。本質的に
シリアルであるそのようなコードすべての代わりに、ク
ラスタ制御装置６４０がある。したがって、（通常は）
９０％が空間に関係し１０％が時間に関係するそのよう
なコードを、ＳＩＭＤモードでＰＭＥのアレイに同報通
信することができる。真に並列な内部ループだけがＰＭ
Ｅに動的に分配される。これらのループは、他の「ライ
ブラリ」ルーチンの場合と同様、開始時にＭＩＭＤモー
ドとなる。このため、単一プログラム複数データ型のプ
ログラム・モデルが使用できるようになる。これは、同
一のプログラムが、組込み同期化コードとともに各ＰＭ
Ｅノードにロードされ、ローカルＰＭＥで実行される場
合に使用される。設計パラメータが各種リンク上で使用
可能な帯域幅に影響を及ぼし、システム経路がプログラ
ムによって構成可能なので、ターゲット・ネットワーク
上で高帯域幅リンクが使用でき、しかもオフチップ型の
ＰＭＥ間リンクの動的区画が特定の経路上に提供できる
帯域幅を、特定のアプリケーションのニーズを満たすよ
うに広げることができる。チップから出るリンクは直接
相互に連係され、外部論理は必要でない。リンクが十分
あり、それらを他のどのリンクに接続できるかについて
既定の制約がないので、システムは様々な相互接続トポ
ロジーを持つことができ、経路指定をプログラムによっ
て動的に行うことができる。The P of the present invention as the basis of a cluster array
In an ME based system, no operating system is required for each chip or node. The present invention provides within each processing element (PME) a library of important functions for operations and / or communications that can be invoked at higher levels. A specific set of Ps
SIMD type instructions are broadcast to the array to configure each ME. Then these PMEs
One or more of these library functions can be executed in full MIMD mode. In addition, each PME has resident basic interrupt handlers and communication routines that allow the PME to handle communication dynamically. Unlike existing MIMD machines, PME is used in APAP structure.
It is not necessary to include the entire program in memory. Instead of all such code, which is serial in nature, there is a cluster controller 640. Therefore (usually)
Such codes, 90% space related and 10% time related, can be broadcast to an array of PMEs in SIMD mode. Only truly parallel inner loops are PM
Dynamically distributed to E. These loops are in MIMD mode at the start, as they are for other "library" routines. This allows the use of single program, multiple data type program models. This means that the same program will run on each PM with built-in synchronization code.
Used when loaded on an E-node and running on a local PME. High bandwidth links can be used on the target network because the design parameters affect the bandwidth available on the various links and the system path can be configured programmatically, and the off-chip inter-PME link dynamic The bandwidth that a partition can provide on a particular path can be increased to meet the needs of a particular application. The links leaving the chip are directly interconnected and no external logic is required. The system can have various interconnection topologies and the routing can be done dynamically programmatically, as there are enough links and there are no default constraints on which other links can connect them.

【０２５０】本システムによれば、既存のコンパイラお
よびコマンド解析機能を使用して、構成に基づいてホス
トまたはワークステーション上で実行可能な並列プログ
ラムを作成することができる。単一プログラム複数デー
タ・システム用の順次ソース・コードをプログラム解析
にかけて、依存性、データ、および制御を調べることに
より、プログラム・ソースを拡張して、呼出しグラフ、
依存性表、別名、用途表などを含めることができる。そ
の後、プログラム変換を行って、シーケンスの組合せま
たはパターン認識により明示的なコンパイラ指示を作成
して、並列性の度合いを拡大する、修正バージョンのプ
ログラムを作成する。次のステップは、メッセージ生成
を含む、データ割振りおよび区分ステップである。ここ
では、データ用途パターンが解析され、組み合わされる
要素が共通の索引付け、アドレス指定パターンを共用す
るように割振りが行われる。これらの動作によって、通
信サービスへの組込みプログラム・コンパイラ指示およ
び呼出しが提供される。この時点で、プログラムはレベ
ル区分ステップに移る。レベル区分ステップでは、プロ
グラムがアレイ、アレイ制御装置（アレイ・ディレクタ
６１０またはクラスタ制御装置６４０）、およびホスト
で実行される各部分に分離される。各アレイ部分は、セ
クション中で必要なメッセージ引渡し同期化機能とイン
タリーブされる。この時点で、レベル処理を進めること
ができる。ホスト・ソースが、アセンブリ・コンパイル
のためレベル・コンパイラ（ＦＯＲＴＲＡＮ）に渡され
る。制御装置ソースは、マイクロプロセッサ制御装置コ
ンパイラに渡され、単一のＰＭＥで必要であるが、ライ
ブラリ呼出し中で使用可能でない項目が、コマンド解析
機能（ＦＯＲＴＲＡＮまたはＣ言語）に渡され、最適Ｐ
ＭＥコードおよびアレイ制御装置コードを生成する中間
レベル言語表現にされる。ＰＭＥコードは、ＰＭＥマシ
ン・レベルで作成され、負荷をＰＭＥメモリに渡すライ
ブラリ拡張部分を備える。実行中、ＰＭＥ並列プログラ
ムは、ＳＰＭＤ実行プロセスで、実行時ライブラリ・カ
ーネルから、すでにコード化されたアセンブリ・サービ
ス機能を呼び出すことができる。ＡＰＡＰは、ホストと
密または疎に結合された接続装置としてもスタンドアロ
ン・プロセッサとしても機能できるので、上位レベル・
ソフトウェア・モデルではいくつかの変形がある。ただ
し、それらの変形は、各種のアプリケーションを統合し
て、ただ１組の下位機能で３つのアプリケーションをす
べて満足できるようにする働きをする。まず接続バージ
ョンのソフトウェアについて述べ、次にスタンドアロン
・モードに必要な修正について説明する。According to this system, an existing compiler and command analysis function can be used to create a parallel program executable on the host or workstation based on the configuration. Single Program Extends the program source by performing program analysis on sequential source code for multiple data systems, examining dependencies, data, and control, call graphs,
It can include dependency tables, aliases, usage tables, etc. After that, program conversion is performed to create an explicit compiler instruction by combination of sequences or pattern recognition to create a modified version of the program that expands the degree of parallelism. The next step is the data allocation and partitioning step, including message generation. Here, the data usage patterns are analyzed and allocated so that the combined elements share a common indexing, addressing pattern. These operations provide embedded program compiler directives and calls to the communication services. At this point, the program moves to the level division step. In the level partition step, the program is separated into an array, an array controller (array director 610 or cluster controller 640), and parts that execute on the host. Each array section is interleaved with the required message passing synchronization functions in the section. At this point, level processing can proceed. The host source is passed to the level compiler (FORTRAN) for assembly compilation. The controller source is passed to the microprocessor controller compiler and items that are required in a single PME but not available in the library call are passed to the command parser (FORTRAN or C language) to optimize the P
Translated into an intermediate level language representation that produces ME code and array controller code. The PME code is created at the PME machine level and comprises a library extension that passes the load to PME memory. During execution, a PME parallel program can call already coded assembly service functions from the runtime library kernel in the SPMD execution process. APAP can function as a connection processor that is tightly or loosely coupled to the host or as a standalone processor, so
There are several variations on the software model. However, these variants work by integrating various applications so that only one set of sub-functions can satisfy all three applications. We will first describe the connected version of the software and then the modifications required for stand-alone mode.

【０２５１】図２０に示すように、ＡＰＡＰをホスト・
プロセッサに接続するシステムでは、ユーザの主プログ
ラムがホスト内に存在し、所望の負荷平衡を行うのに必
要なＡＰＡＰ装置タスクおよび関連データを実行する。
ディスパッチされたタスクのプログラムをホスト内で解
釈するかアレイ・ディレクタ６１０内で解釈するかの選
択は、ユーザが行う。ホスト・レベルの解釈では、アレ
イ・ディレクタ６１０がアレイの密な制御を使用しない
インタリービング・ユーザの所で稼働できるようになる
が、ＡＰＡＰ解釈では、制御分岐における待ち時間は最
短になるが、複数ユーザ管理タスクを実行するためのＡ
ＰＡＰ時間を制限する傾向がある。これにより、ＡＰＡ
Ｐおよびホストを密または疎に結合できるという概念が
成立する。As shown in FIG. 20, the APAP host
In a processor-attached system, the user's main program resides in the host and executes the APAP device tasks and associated data necessary to achieve the desired load balancing.
The user chooses whether to interpret the program of the dispatched task in the host or in the array director 610. The host level interpretation allows the array director 610 to operate in the presence of interleaving users who do not use the tight control of the array, whereas the APAP interpretation minimizes latency in the control branch. A for executing user management tasks
It tends to limit PAP time. As a result, APA
The concept that P and the host can be tightly or loosely established holds.

【０２５２】極端な例を２つ挙げる。Two extreme examples will be given.

【０２５３】１．浮動小数点ベクトル機構を備えた３０
９０クラスのマシンにＡＰＡＰを接続すると、ＡＰＡＰ
内に圧縮形式のユーザ・データが格納できる。異なる疎
度特性を持つ２つのベクトルに対するベクトル演算を呼
び出したホスト・プログラムは、データを位置合せし直
して要素ごとに一致した対にし、結果をベクトル機構に
出力し、ベクトル機構からの回答を読み取り、最後にデ
ータを再構成して最終的疎データ形式にするようＡＰＡ
Ｐに命令を送る。ＡＰＡＰの各セグメントは、疎行列ビ
ット・マップを解釈および構築し、他のセクションは、
ＰＭＥ間でデータをどのように移動すればジッパに対し
て適切に位置合わせできるかを算出する。1. 30 with floating point vector mechanism
When connecting APAP to a 90-class machine, APAP
Compressed user data can be stored inside. A host program that called a vector operation on two vectors with different sparsity characteristics realigns the data into element-wise matched pairs, outputs the results to the vector mechanism, and reads the response from the vector mechanism. Finally, APA to reconstruct the data into the final sparse data format
Send a command to P. Each segment of APAP interprets and builds a sparse matrix bitmap, the other sections
Calculate how data can be moved between PMEs for proper alignment with the zipper.

【０２５４】２．ＡＰＡＰは、小規模空軍用コンピュー
タに接続すると、センサ・フュージョン処理と関連する
全作業負荷を実行することができる。この場合、ホスト
は、プロセスを開始し、受け取ったセンサ・データをＡ
ＰＡＰに送った後、結果を待つ。次に、アレイ・ディレ
クタは、プロセスの実行に必要なおそらく数十の処理ス
テップを通じて、ＰＭＥアレイをスケジューリングし順
序付ける必要がある。[0254] 2. When connected to a small Air Force computer, APAP can perform the full workload associated with sensor fusion processing. In this case, the host initiates the process and sends the received sensor data to A
Wait for results after sending to PAP. The Array Director then needs to schedule and order the PME array through perhaps dozens of processing steps required to perform the process.

【０２５５】ＡＰＡＰは、次の３つのレベルのユーザ制
御をサポートする。APAP supports three levels of user control:

【０２５６】１．カジュアル・ユーザ。このユーザは、
供給されたルーチンおよびライブラリ関数を使って作業
する。これらのルーチンは、ホスト・レベルまたはＡＰ
Ｉレベルで維持され、ユーザが自分のプログラム内のサ
ブルーチン呼出しを使って呼び出すことができる。1. Casual user. This user is
Work with the supplied routines and library functions. These routines are either host level or AP
Maintained at the I level and can be called by the user using subroutine calls in his program.

【０２５７】２．カスタマイザ・ユーザ。このユーザ
は、アプリケーション・プロセッサ・インタフェース６
３０内で動作し、アプリケーション・プロセッサ・イン
タフェースによって供給されるルーチン、またはクラス
タ制御装置もしくはＰＭＥによって供給されるサービス
を直接呼び出す、特殊機能を書くことができる。2. Customizer user. This user uses the application processor interface 6
Special functions can be written that operate within 30 and directly call routines provided by the application processor interface or services provided by the cluster controller or PME.

【０２５８】３．開発ユーザ。このユーザは、プログラ
ム・ロードおよび状況フィードバック用のＡＰＩサービ
スに応じて、クラスタ制御装置またはＰＭＥで実行され
るプログラムを作成する。3. Development user. This user creates a program that runs on the cluster controller or PME, depending on the API service for program loading and status feedback.

【０２５９】密または疎に結合されたシステムでこれら
３つのレベルのユーザを満足させるには、ハードウェア
／ソフトウェア制御タスクを区分する必要がある。To satisfy these three levels of users in a tightly or loosely coupled system, it is necessary to partition the hardware / software control tasks.

【０２６０】ＡＰＩソフトウェア・タスク：アプリケー
ション・プロセッサ・インタフェース６３０は、受け取
ったデータの先頭のワードをテストして、そのデータ
を、アプリケーション・プロセッサ・インタフェースで
解釈すべきか、アレイ・ディレクタまたはＰＭＥ内の記
憶域にロードすべきか、それとも入出力ジッパに引き渡
すべきかを決定することのできる、ソフトウェア・サー
ビスを備えている。API software task: The application processor interface 630 should test the first word of the received data and interpret that data at the application processor interface or store it in the array director or PME. It has a software service that can decide if it should be loaded into the area or handed off to an I / O zipper.

【０２６１】データを解釈すべき場合、アプリケーショ
ン・プロセッサ・インタフェース６３０は必要な動作を
決定し、機能を呼び出す。もっとも一般的な種類の動作
は、クラスタ制御装置６４０への（かつ、間接的にはク
ラスタ制御装置６４０への）ＡＰＩ書込みの結果として
実行される機能の実行をアレイに要求することである。
クラスタ・シンクロナイザ６５０／クラスタ制御装置６
４０に書き込まれる実際のデータは、一般に、ホストか
らアプリケーション・プロセッサ・インタフェース６３
０に渡されるパラメータに基づいて、ＡＰＩオペレーシ
ョナル・ルーチンによって構築される。クラスタ・シン
クロナイザ６５０／クラスタ制御装置６４０に送らたデ
ータは、ノードＢＣＩを介してＰＭＥに転送される。If the data is to be interpreted, the application processor interface 630 determines the required action and calls the function. The most common type of operation is to request the array to perform a function that is performed as a result of an API write to cluster controller 640 (and indirectly to cluster controller 640).
Cluster synchronizer 650 / cluster controller 6
The actual data written to 40 is typically from the host to the application processor interface 63.
Constructed by the API operational routine based on the parameters passed to zero. The data sent to the cluster synchronizer 650 / cluster controller 640 is transferred to the PME via the node BCI.

【０２６２】データは、ＡＰＩ記憶域、ＣＣ記憶域、Ｐ
ＭＥメモリのいずれかにロードできる。さらに、ＰＭＥ
メモリにロードされるデータは、入出力ジッパまたはノ
ードＢＳＩを介してロードすることができる。データを
ＡＰＩメモリに格納すべき場合、着信バスが、読み取ら
れた後、記憶域に書き込まれる。クラスタ制御装置６４
０を宛先とするデータも同様に、読み取られた後、ＣＣ
メモリに書き込まれる。最後に、ＰＭＥメモリ用のデー
タは（この場合、通常は新規または追加のＭＩＭＤプロ
グラム）、クラスタ・シンクロナイザ６５０／クラスタ
制御装置６４０／ノードＢＳＩを介してすべてのＰＭＥ
または特定のＰＭＥに送り、あるいは選択的再分配のた
めに入出力ジッパを介してＰＭＥのサブセットに送るこ
とができる。Data is stored in API storage area, CC storage area, P
It can be loaded into any of the ME memories. Furthermore, PME
The data loaded into memory can be loaded via the I / O zipper or the node BSI. If the data is to be stored in API memory, the incoming bus is read and then written to storage. Cluster controller 64
Similarly, data addressed to 0 is read and then CC
Written to memory. Finally, the data for the PME memory (in this case, usually a new or additional MIMD program) is sent to all PMEs via the cluster synchronizer 650 / cluster controller 640 / node BSI.
Or it can be sent to a specific PME, or to a subset of PMEs via I / O zippers for selective redistribution.

【０２６３】入出力ジッパにデータを送るとき、ＰＭＥ
ＭＩＭＤプログラムが最終宛先を決定できるようにす
るインライン・コマンドを、データの前に置くことがで
きる。ＡＰＩサービス機能の呼出しをデータの前に置い
て、ＭＩＭＤ開始またはＳＩＭＤ伝送を実行することも
できる。When sending data to the input / output zipper, the PME
Inline commands can be placed in front of the data that allow the MIMD program to determine the final destination. It is also possible to place a call to the API service function in front of the data to perform MIMD initiation or SIMD transmission.

【０２６４】ＡＰＩプログラムは、ホスト・インタフェ
ースを介して受け取ったサービス要求に応答するだけで
なく、ＰＭＥからの要求にも応答する。そのような要求
は、高速直列ポート上で生成され、クラスタ制御装置６
４０／クラスタ・シンクロナイザ６５０の組合せを介し
て送られる。この種の要求によって、ＡＰＩプログラム
は、ＰＭＥの要求を直接実行するか、あるいは標準速度
直列ポートを介してＰＭＥにアクセスし、サービス要求
に関してデータをさらに修飾するかどうかを決定するこ
とができる。The API program not only responds to service requests received via the host interface, but also to requests from the PME. Such a request is generated on the high speed serial port and the cluster controller 6
40 / cluster synchronizer 650 combination. This type of request allows the API program to determine whether to directly execute the PME's request or access the PME via a standard speed serial port to further qualify the data for service requests.

【０２６５】ＰＭＥソフトウェア：ソフトウェア・プラ
ンには、以下のものが含まれる。・複雑な動作および入出力管理用のＰＭＥ常駐サービス
・ルーチン（すなわち、「拡張ＩＳＡ」）の生成。・制御データおよびパラメータ・データを作成し、ＢＣ
Ｉバスを介してＰＭＥに渡す、制御装置が実行するサブ
ルーチンの定義および開発。これらのサブルーチンは以
下のことを行う。１．１組のＰＭＥに、分配されたオブジェクトに対する
演算を実行させる。２．ＰＭＥアレイとシステム・バスの相互作用のための
入出力データ管理および同期化サービスを提供する。３．プロセッサ・メモリの初期プログラム・ロード、プ
ログラム・オーバレイ、およびプログラム・タスク管理
を提供する。・ホスト・レベル・プログラム用のデータ割振りサポー
ト・サービスの開発。・アセンブラ、シミュレータ、ならびにハードウェア・
モニタおよびデバッグ・ワークステーションを含むプロ
グラミング・サポート・システムの開発。PME Software: The software plan includes: Generation of PME resident service routines (ie, "extended ISA") for complex operations and I / O management.・ Create control data and parameter data, and BC
Definition and development of subroutines executed by the controller that are passed to the PME via the I-bus. These subroutines do the following: 1. Allow a set of PMEs to perform operations on distributed objects. 2. It provides input / output data management and synchronization services for PME array and system bus interactions. 3. Provides initial program load of processor memory, program overlay, and program task management. -Development of data allocation support services for host level programs. Assembler, simulator, and hardware
Development of programming support system including monitor and debug workstation.

【０２６６】本発明者等は、軍用センサ・フュージョ
ン、最適化、イメージ変換、米国郵便光学式文字認識、
およびＦＢＩ指紋突合せアプリケーションの調査に基づ
き、ベクトル・コマンドおよびアレイ・コマンド（たと
えば、ＢＬＡＳ呼出し）を使ってプログラミングされた
並列プロセッサが有効であるとの結論を得た。基礎プロ
グラミング・モデルは、今日の技術で実現可能なＰＭＥ
アレイの特性に合致しなければならない。具体的には、
以下のとおりである。・ＰＭＥは、独立のプログラム記憶式プロセッサとする
ことができる。・アレイは数千のＰＭＥを含むことができ、密並列性に
適している。・ＰＭＥ間ネットワークは、総合帯域幅が非常に大き
く、「論理直径」が小さい。・しかし、ネットワーク接続式マイクロプロセッサＭＩ
ＭＤ標準により、各ＰＭＥはメモリ制限を受ける。The inventors have used military sensor fusion, optimization, image conversion, US Postal optical character recognition,
Based on a survey of and FBI fingerprint matching applications, it was concluded that a parallel processor programmed with vector and array commands (eg, BRAS calls) would work. The basic programming model is a PME that can be realized with today's technology.
Must match the characteristics of the array. In particular,
It is as follows. The PME can be an independent program storage processor. The array can contain thousands of PMEs, suitable for tight parallelism. -The inter-PME network has a very large total bandwidth and a small "logical diameter".・ However, network-connected microprocessor MI
Due to the MD standard, each PME is memory limited.

【０２６７】ＭＩＭＤ並列プロセッサに対する従来のプ
ログラミングでは、タスク・ディスパッチ手法が使用さ
れている。このような手法では、各ＰＭＥが大規模なプ
ログラムの一部にアクセスする必要がある。この特性
と、ハードウェアの非共用メモリ特性のために、重要な
問題に関してはＰＭＥメモリが大量に使用される。した
がって、本発明では、「非同期ＳＩＭＤ」（Ａ−ＳＩＭ
Ｄ）型処理と称する新規のプログラミング・モデルを目
標とする。この点に関しては、１９９１年１１月２７日
出願の米国特許出願第７９８７８８号を参照されたい。
この出願を、本明細書に組み込む。In conventional programming for MIMD parallel processors, a task dispatch approach is used. Such an approach requires each PME to access part of a larger program. Due to this property and the non-shared memory property of the hardware, PME memory is heavily used for critical issues. Therefore, in the present invention, "asynchronous SIMD" (A-SIM
D) Target a new programming model called type processing. In this regard, see U.S. Patent Application No. 798788 filed November 27, 1991.
This application is incorporated herein.

【０２６８】本発明のＡＰＡＰ設計のＡ−ＳＩＭＤプロ
グラミングとは、ＰＭＥのグループが、ＳＩＭＤモデル
の場合と同様に、それに同報通信されたコマンドによっ
て指示を受けることを意味する。同報通信コマンドは、
各ＰＭＥ内でＭＩＭＤ機能の実行を開始する。この実行
には、ＰＭＥ内でのデータ従属分岐およびアドレス指
定、ならびに他のＰＭＥまたはＢＣＩとの入出力にもと
づく同期化が必要となることがある。通常、ＰＭＥは、
ＢＣＩからの次のコマンドを読み取ることにより、処理
を完了し同期化する。A-SIMD programming of the APAP design of the present invention means that a group of PMEs is instructed by commands broadcast to it, as in the SIMD model. The broadcast command is
Start executing the MIMD function in each PME. This execution may require data dependent branching and addressing within the PME, and I / O based synchronization with other PMEs or BCIs. Usually, PME
The process is completed and synchronized by reading the next command from the BCI.

【０２６９】Ａ−ＳＩＭＤ手法には、ＭＩＭＤ動作モー
ドとＳＩＭＤ動作モードの両方が含まれる。この手法で
はコマンド実行時間に対する実際の時間制限がないの
で、データ転送時に同期化され無限に実行されるＰＭＥ
動作が開始できる。そのような機能は、データ・フィル
タリング、ＤＳＰ、およびシストリック動作においてき
わめて有効である（これらは、ＢＣＩ割込みまたはシリ
アル制御バス上のコマンドによって終了できる）。ＳＩ
ＭＤ動作は、ＭＩＭＤモード・コマンドを含まないＡ−
ＳＩＭＤ制御ストリームから生じる。そのような制御ス
トリームは、任意のＰＭＥ固有命令を含むことができ
る。これらの命令は、ＰＭＥの命令解読論理回路に直接
送られる。ＰＭＥ命令取出しをなくすと、データ依存分
岐を伴わないタスクの性能モードが高まる。The A-SIMD technique includes both MIMD and SIMD operating modes. With this method, there is no actual time limit for command execution time, so PMEs that are synchronized and infinitely executed during data transfer
The operation can start. Such features are extremely useful in data filtering, DSP, and systolic operation (they can be terminated by BCI interrupts or commands on the serial control bus). SI
MD operation is A- that does not include MIMD mode command.
It originates from the SIMD control stream. Such a control stream can include any PME specific instructions. These instructions are sent directly to the PME's instruction decode logic. Eliminating PME instruction fetches increases the performance mode of tasks that do not involve data-dependent branches.

【０２７０】（ハードウェア機能によってサポートされ
る）このプログラミング・モデルを拡張して、ＰＭＥの
アレイを独立したセクションに分割できるようになる。
別々のＡ−ＳＩＭＤコマンド・ストリームが各セクショ
ンを制御する。本発明者等の調査によれば、当該プログ
ラムが、パイプライン・データ処理に適した別々のフェ
ーズ（すなわち、入力、入力バッファリング、複数の処
理ステップ、および出力フォーマット化）に分割される
ことが分かっている。密並列性は、セクション内のｎ個
のプロセッサ・メモリをプログラム・フェーズに適用す
ることによって生成される。アプリケーションに疎区分
を適用すると、ＭＩＭＤに適した小規模な反復タスク、
またはＳＩＭＤ処理に適したメモリ帯域幅制限タスクが
見つかることが多い。本発明では、従来の技術を使用し
てＭＩＭＤ部分をプログラミングし、残りのフェーズ
を、ベクトル化コマンドでコード化され、アレイ制御装
置によって順序付けされたＡ−ＳＩＭＤセクションとし
てプログラミングする。これにより、大規模な制御装置
メモリがプログラム記憶域となる。セクション当たりの
ＰＭＥの数を変えて、作業負荷のバランスを取ることが
できる。ディスパッチされるタスク・サイズを変える
と、ＢＣＩバス帯域幅を制御要件に合わせて調整でき
る。This programming model (supported by the hardware function) can be extended to allow the array of PMEs to be divided into independent sections.
A separate A-SIMD command stream controls each section. Studies by the inventors have shown that the program may be split into separate phases suitable for pipelined data processing (ie, input, input buffering, multiple processing steps, and output formatting). I know it. Tight parallelism is created by applying n processor memories in a section to the program phase. Applying sparse partitioning to your application, small repetitive tasks suitable for MIMD,
Alternatively, a memory bandwidth limited task suitable for SIMD processing is often found. In the present invention, the MIMD portion is programmed using conventional techniques and the remaining phases are programmed as A-SIMD sections coded with vectorized commands and ordered by the array controller. This makes the large controller memory a program storage area. The number of PMEs per section can be varied to balance the workload. Changing the dispatched task size allows the BCI bus bandwidth to be tailored to control requirements.

【０２７１】このプログラミング・モデルではまた、デ
ータ要素をＰＭＥに割り振ることも考慮している。この
手法では、ＰＭＥ間でデータ要素が均等に分配される。
ソフトウェアの初期バージョンでは、これはプログラマ
またはソフトウェアによって実行されることになる。本
発明者等は、この問題にＩＢＭ並列化コンパイラ技術が
適用できると考えており、該技術の使用法を調査する予
定である。しかし、提供されるＰＭＥ間帯域幅により、
この手法の重要性が低下する傾向がある。これによっ
て、データ割振りと入出力機構性能が連係される。The programming model also considers allocating data elements to PMEs. In this approach, data elements are evenly distributed among PMEs.
In early versions of software, this would be done by the programmer or software. The inventors believe that the IBM parallelizing compiler technology can be applied to this problem, and plan to investigate the usage of the technology. However, due to the inter-PME bandwidth provided,
This approach tends to be less important. This links data allocation with I / O mechanism performance.

【０２７２】ハードウェアは、ＰＭＥがそのメモリから
のデータ転送を開始することを要求し、ＰＭＥプログラ
ムの関与しない、ＰＭＥメモリへの制御付き書込みをサ
ポートする。入力バッファ・アドレスおよび最大長を提
供することにより、受信側ＰＭＥで入力制御が行われ
る。ＰＭＥに対する入出力の結果バッファのオーバフロ
ーが発生すると、ハードウェアは受信側ＰＭＥに割り込
む。ＰＭＥ用に開発される下位入出力機能は、このサー
ビスに基づく。本発明では、隣接ＰＭＥ間での生データ
の移動と、ＰＭＥ間でのアドレスされたデータの移動が
いずれもサポートされる。後者の機能は、回線交換およ
び蓄積交換機構に依存する。割込みアドレスおよび転送
動作は、性能にとって重要である。本発明では、この動
作をサポートするようにハードウェアおよびソフトウェ
アを最適化している。１ワード・バッファを使うと、ア
ドレス・ヘッダの受信時に割込みが発生する。ターゲッ
トＩＤとローカルＩＤの比較により、出力経路の選択が
可能になる。その後のデータ・ワードの転送は、回線交
換モードで行われる。より大規模なバッファを使ってこ
のプロセスをわずかに変更すると、蓄積交換機構が生成
される。The hardware requires the PME to initiate a data transfer from its memory and supports controlled writes to the PME memory without involvement of the PME program. Input control is provided at the receiving PME by providing the input buffer address and maximum length. When an I / O to a PME results in a buffer overflow, the hardware interrupts the receiving PME. The lower level I / O functions developed for the PME are based on this service. The present invention supports both the movement of raw data between adjacent PMEs and the movement of addressed data between PMEs. The latter function depends on the circuit switching and store-and-forward mechanism. Interrupt addresses and transfer operations are important to performance. In the present invention, hardware and software are optimized to support this operation. With a one-word buffer, an interrupt occurs when the address header is received. By comparing the target ID and the local ID, it becomes possible to select the output path. Subsequent transfers of data words occur in circuit switched mode. A slight modification of this process with larger buffers creates a store-and-forward mechanism.

【０２７３】高性能ＰＭＥ間帯域幅のため、ＰＭＥアレ
イにデータ要素を慎重に格納することはかならずしも必
要でなく、またそうすることがかならずしも望ましいわ
けではない。ＰＭＥ間に分散されたベクトル・データ要
素を移動する場合について考えてみたい。本発明のアー
キテクチャでは、アドレス・ヘッダなしでデータを送信
できるので、非常に高速の入出力が可能となる。しか
し、多くのアプリケーションでは、１方向の移動に適す
るようにデータ構造を最適化すると、直交方向でのデー
タの移動が遅くなることが分かった。そのような場合の
時間損失は通常、ネットワーク内でデータをランダムに
経路指定する平均時間に近くなる。したがって、（デー
タを整列するのではなく）データを逐次的またはランダ
ムに格納した方が、平均処理時間が短いアプリケーショ
ンができる。Due to the high performance inter-PME bandwidth, careful storage of data elements in the PME array is not always necessary or desirable. Consider the case of moving vector data elements distributed between PMEs. The architecture of the present invention allows data to be sent without an address header, which allows very fast I / O. However, in many applications it has been found that optimizing the data structure to be suitable for movement in one direction slows the movement of data in the orthogonal direction. The time loss in such cases is typically close to the average time to randomly route data within the network. Therefore, storing data sequentially (rather than aligning the data) or randomly allows for applications with shorter average processing times.

【０２７４】同期化により平均アクセス時間を利用でき
るアプリケーションが多い（たとえば、ＰＤＥ緩和プロ
セスは、隣接プロセスからデータを獲得し、したがって
少なくとも４つの入出力動作に対するアクセスを平均化
できる）。散乱／集合や行／列演算などベクトル・プロ
セスおよびアレイ・プロセスに適用可能な因子を検討し
て、荒っぽい強制データ割振りがアプリケーションに適
していることに気付くユーザが多いと考えられる。しか
し、アプリケーション特性が特定のデータ割振りパター
ンを強制する傾向がある（シフト方向の必須同期化やバ
イアス利用など）ことを示す例もある。この特性につい
ては、開発されるツールおよび技術で、データ格納の手
動調整、あるいは単純な非最適データ割振りをサポート
する必要がある（本発明では、ベクトル化ホスト・プロ
グラムのほぼ透過性のポートをＭＰＰに提供するため
に、ホスト・レベル・マクロによる非最適データ割振り
をサポートする。ユーザは、ハードウェア・モニタ・ワ
ークステーションによって、得られる性能を調べること
ができる）。Many applications can take advantage of average access times through synchronization (eg, PDE mitigation processes can obtain data from neighboring processes and thus average access to at least four I / O operations). Considering factors applicable to vector processes and array processes such as scatter / set and row / column operations, many users may find that a crude forced data allocation is suitable for their application. However, there are also examples in which application characteristics indicate that a particular data allocation pattern tends to be enforced (such as mandatory synchronization of shift directions or use of bias). For this property, tools and techniques developed need to support manual adjustment of data storage, or simple non-optimal data allocation (in the present invention, MPP is used to support nearly transparent ports of vectorized host programs). Support for non-optimal data allocation by host-level macros to allow users to see the resulting performance with a hardware monitor workstation).

【０２７５】なお、標準ピボット演算によるガウス消去
では、列ではなく行のシフトが必要である。列が高速シ
フト方向になるようにデータを配列することによって得
られる性能の差は２：１を超える。そうしたとしても、
バスと特定の関係になるように行を整列しても利益はな
い。[0275] Note that the Gaussian elimination by the standard pivot operation requires shifting of rows instead of columns. The performance difference obtained by arranging the data such that the columns are in the fast-shift direction exceeds 2: 1. Even so,
There is no benefit to aligning the rows to have a specific relationship with the bus.

【０２７６】図２１は、一般的なソフトウェア開発およ
び使用環境を示している。プログラムの実行がホストと
モニタのどちらからでも制御できるので、ホスト・アプ
リケーション・プロセッサは省略可能である。さらに、
アレイ制御装置の代りにモニタを使用した方が効果のあ
る場合もある。この環境は、実ＭＰＰハードウェアまた
はシミュレートされたＭＰＰハードウェア上でのプログ
ラム実行をサポートする。モニタはシナリオ駆動式なの
で、テスト操作およびデバッグ操作を行う開発者は、任
意のレベルの抽象度で効果的な動作を可能にする手順を
作成することができる。図２２に、ＭＰＰ内でサポート
されるハードウェアのレベルと、これらのレベルに対す
るユーザ・インタフェースを示す。FIG. 21 shows a general software development and use environment. The host application processor is optional because program execution can be controlled by either the host or the monitor. further,
In some cases, it is more effective to use a monitor instead of the array controller. This environment supports program execution on real or simulated MPP hardware. Because the monitor is scenario driven, developers performing test and debug operations can create procedures that enable effective behavior at any level of abstraction. FIG. 22 shows the levels of hardware supported within MPP and the user interface for these levels.

【０２７７】ＭＰＰ用の可能なアプリケーション・プロ
グラミング技術が２つあると本発明者等は考えている。
もっともプログラマ集約的でない手法では、アプリケー
ションをベクトル化された高位言語で書く。この場合、
ユーザが、当該問題ではデータの格納が調整されないと
思う場合は、コンパイル時サービスを使って、ＰＭＥア
レイにデータを割り振る。アプリケーションは、ＰＭＥ
アレイ上で解釈および実行するために制御装置に渡され
る、ＢＬＡＳなどのベクトル呼出しを使用する。ホスト
とＰＭＥアレイの間でデータを移動するときは、独特な
呼出しが使用される。要するに、ユーザは、ＭＰＰがデ
ータをどのように編成または処理したかを認識する必要
がない。この種のアプリケーションのために次の２つの
最適化技術がサポートされる。We believe that there are two possible application programming techniques for MPP.
The least programmer-intensive approach is to write your application in a vectorized high-level language. in this case,
If the user does not think that the data storage is coordinated, then use compile-time services to allocate the data to the PME array. Application is PME
Uses vector calls, such as BLAS, passed to the controller for interpretation and execution on the array. Unique calls are used when moving data between the host and the PME array. In short, the user does not need to know how the MPP organized or processed the data. Two optimization techniques are supported for this type of application:

【０２７８】１．データ割振り表を作成することによっ
てデータ割振りを修正すると、プログラムはデータ格納
を強制できるようになる。1. Modifying the data allocation by creating a data allocation table allows the program to force data storage.

【０２７９】２．アレイ制御装置によって実行される付
加ベクトル・コマンドを生成すると、副次機能の調整
（すなわち、ガウス消去を単一の演算として呼び出すこ
と）が可能になる。2. The generation of additional vector commands executed by the array controller allows adjustment of sub-functions (i.e. calling Gaussian elimination as a single operation).

【０２８０】また、このプロセッサは本節の冒頭で参照
したような特殊アプリケーションに適用できると本発明
者等は考えている。その場合、アプリケーションに合わ
せて調整したコードを使用する。しかし、そのようなア
プリケーションでも、調整の度合いは、特定のタスクが
そのアプリケーションにとってどれほど重要かによって
決まる。この状況では、ＳＩＭＤモード、ＭＩＭＤモー
ド、またはＡ−ＳＩＭＤモードに個別に適合されたタス
クが必要になると思われる。これらのプログラムは、以
下の組合せを使用する。Further, the present inventors believe that this processor can be applied to the special application referred to at the beginning of this section. In that case, use the code tailored to the application. However, even in such applications, the degree of coordination depends on how important a particular task is to the application. In this situation, tasks that are individually adapted to SIMD mode, MIMD mode, or A-SIMD mode would be required. These programs use the following combinations:

【０２８１】１．アレイ制御装置内のエミュレータ機能
に渡されるＰＭＥ固有命令のシーケンス。エミュレータ
は、命令およびそのパラメータをその１組のＰＭＥに同
報通信する。このＳＩＭＤモードのＰＭＥは、命令を解
読機能に渡し、メモリ取出し動作をシミュレートする。1. A sequence of PME-specific instructions passed to the emulator function in the array controller. The emulator broadcasts the instructions and their parameters to the set of PMEs. The SIMD mode PME passes the instruction to the decode function to simulate a memory fetch operation.

【０２８２】２．入出力と同期化できる密な内部ループ
は、ＰＭＥ固有のＩＳＡプログラムを使用する。それら
のループは、ＳＩＭＤモードの変更から開始した後、Ｍ
ＩＭＤモードで連続的に実行される（^ＲＥＴＵＲＮ^命
令を介してＳＩＭＤモードに戻るオプションが存在す
る）。2. A tight inner loop that can be synchronized with I / O uses PME-specific ISA programs. Those loops start with a change of SIMD mode and then
It runs continuously in IMD mode (there is an option to return to SIMD mode via the ^ RETURN ^ instruction).

【０２８３】３．ベクトル化コマンド・セットで書かれ
たより複雑なプログラムは、ＰＭＥ固有機能を呼び出し
たアレイ制御装置中でサブルーチンを実行する。たとえ
ば、ＰＭＥ上に順次ロードされるベクトルに対してＢＬ
ＡＳ^ＳＡＸＰＹ^コマンドを実行する、単純なアレイ制
御装置プログラムは、ＰＭＥ内で、以下の動作を行うシ
ーケンスを開始する。ａ．プロセッサＩＤと同報通信^ｉｎｃｘ^値および^Ｘ
＿ａｄｄｒ^との比較により、必要なｘ要素を持つＰＭ
Ｅを使用可能にする。ｂ．連続ＰＭＥへの書込みにより、ｘ値を圧縮する。ｃ．同報通信データから、ｙ要素を持つＰＭＥのアドレ
スを算出する。ｄ．圧縮されたｘデータをｙＰＭＥに送る。ｅ．ｘ値を受け取ったＰＭＥ内で単精度浮動小数点演算
を実行し、演算を完了する。3. A more complex program written in the vectorized command set executes a subroutine in the array controller that called the PME specific function. For example, BL for vectors that are sequentially loaded on the PME
A simple array controller program that executes the AS ^ SAXPY ^ command initiates a sequence within the PME that performs the following operations. a. Processor ID and broadcast ^ incx ^ value and ^ X
PM with required x elements by comparison with _addr ^
Enable E. b. Compress the x value by writing to a continuous PME. c. The address of the PME having the y element is calculated from the broadcast data. d. Send the compressed x data to yPME. e. Perform a single precision floating point operation in the PME that received the x value and complete the operation.

【０２８４】最後に、ＳＡＸＰＹの例は、ベクトル化ア
プリケーション・プログラム実行の別の態様を示してい
る。主要なステップは、アプリケーション・プロセッサ
・インタフェース６３０で実行され、オプティマイザと
製品開発者のどちらがプログラミングを行ってもよい。
通常、ベクトル化アプリケーションでは、このレベルｏ
コードを組み込まずに呼び出して使用する。これらのス
テップは、Ｃ言語コードまたはＦＯＲＴＲＡＮコードと
して書かれ、メモリにマップされた読取りまたは書込み
を使用して、ＢＣＩバスを介してＰＭＥアレイを制御す
る。そのようなプログラムは、ＰＭＥアレイを、ＡＰＩ
プログラムへのリターンによって同期化される一連のＭ
ＩＭＤステップとして操作する。単精度浮動小数点ルー
チンなどの重要でないステップは、カスタマイザまたは
製品開発者が開発する。これらの動作は、固有ＰＭＥ
ＩＳＡを使ってコード化され、マシン特性に合わせて調
整される。一般に、これは製品開発者の領域である。な
ぜなら、このレベルでのコード化、テスト、および最適
化には、完全な製品開発ツール・セットを使用する必要
があるからである。Finally, the SAXPY example illustrates another aspect of vectorized application program execution. The major steps are performed in the application processor interface 630 and may be programmed by either the optimizer or the product developer.
Usually in vectorization applications, this level o
Call and use without embedding code. These steps are written as C or FORTRAN code and use memory mapped reads or writes to control the PME array over the BCI bus. Such a program may use the PME array as an API.
A series of Ms synchronized by a return to the program
Operate as IMD step. Minor steps such as single precision floating point routines are developed by the customizer or product developer. These actions are unique PME
Coded using ISA and tailored to machine characteristics. In general, this is the realm of product developers. This is because coding, testing, and optimization at this level requires the use of a complete product development toolset.

【０２８５】ＡＰＡＰは、アプリケーションを逐次ＦＯ
ＲＴＲＡＮで書くことができる。このための経路はまっ
たく別のものになる。図２３に、使用可能なＦＯＲＴＡ
ＲＮコンパイラの概要を示す。第１ステップで、ＦＯＲ
ＴＲＡＮコンパイラが、既存の並列化コンパイラの一部
を使って、プログラム依存性を開発する。ソースとこれ
らの表が、ＡＰＡＰＭＭＰおよびソースの特徴付けに
よって並列性を強化するプロセスへの入力となる。APAP executes the application sequential FO
Can be written in RTRAN. The route for this is completely different. Fig. 23 shows available FORTA
An outline of the RN compiler is shown. In the first step, FOR
The TRAN compiler uses some of the existing parallelizing compilers to develop program dependencies. The source and these tables are the inputs to the process that enhances parallelism through APAP MMP and source characterization.

【０２８６】このＭＭＰは、非共用メモリ・マシンなの
で、ローカル・メモリ用のＰＭＥと大域メモリ用のＰＭ
Ｅの間でデータを割り振る。超高速データ転送時間が非
常に速く、かつネットワーク帯域幅が広いために、デー
タ割振りの時間的影響が軽減されるが、それでもＭＭＰ
はアドレスされる。本発明の手法では、メモリの一部を
大域メモリとして扱い、ソフトウェア・サービス機能を
使用する。また、第２の方法で依存性情報を使って、デ
ータ割振りを実行することも可能である。ソースを複数
の逐次プログラムに変換する最終ステップは、レベル区
分ステップによって実施される。この区分ステップは、
米国防総省高等研究企画庁（ＤＡＰＲＡ）の資金供与で
行われているＦｏｒｔｒａｎ³作業に類似している。コ
ンパイルにおける最後のプロセスは、個々の機能レベル
すべてにおける実行可能コードの生成である。ＰＭＥの
場合、これは、既存のコンパイラ・システム上でのコー
ド生成プログラムのプログラミングによって行われる。
ホスト・コード・コンパイラおよびＡＰＩコード・コン
パイラは、該当するマシンを対象としたコードを生成す
る。Since this MMP is a non-shared memory machine, it is a PME for local memory and a PM for global memory.
Allocate data between E. The ultra-high-speed data transfer time is very fast and the network bandwidth is wide, which reduces the time impact of data allocation.
Is addressed. Our approach treats a portion of memory as global memory and uses software service functions. It is also possible to execute data allocation using the dependency information in the second method. The final step of converting the source into multiple sequential programs is performed by the level partitioning step. This segmentation step
It is similar to the Fortran ³ work being funded by the Department of Defense Advanced Research Projects Agency (DAPRA). The final process in compilation is the generation of executable code at all individual functional levels. For PMEs this is done by programming the code generator on an existing compiler system.
The host code compiler and API code compiler generate code targeted for the corresponding machine.

【０２８７】ＰＭＥは、それ自体のメモリからＭＩＭＤ
ソフトウェアを実行できる。一般に、複数のＰＭＥがそ
れぞれまったく異なるプログラムを実行することはな
く、同一の小規模のプログラムを非同期的に実行してい
る。次の３つの基本型ソフトウェアが考えられる。ただ
し、この設計手法では、ＡＰＡＰがそれらの手法だけに
限定されるものではない。The PME is MIMD from its own memory.
Can run software. Generally, a plurality of PMEs do not execute completely different programs but execute the same small program asynchronously. The following three basic types of software can be considered. However, in this design method, APAP is not limited to only those methods.

【０２８８】１．専用エミュレーション機能により、Ｐ
ＭＥアレイはＬＩＮＰＡＣＫやＶＰＳＳなど標準ユーザ
・ライブラリで提供される１組のサービスをエミュレー
トする。そのようなエミュレーション・パッケージで
は、ＰＭＥアレイがその複数組のデバイスを使って、通
常のベクトル呼出しで必要な動作を１つ実行できる。こ
の種のエミュレーション・パッケージは、ベクトル処理
装置に接続すると、ある動作用にベクトル装置を使用し
ながら、内部で他の動作を実行することができる。1. With the dedicated emulation function, P
The ME array emulates a set of services provided by standard user libraries such as LINPACK and VPSS. Such an emulation package allows the PME array to use that set of devices to perform one operation required by a normal vector call. When connected to a vector processing device, this type of emulation package can use the vector device for one operation while performing another operation internally.

【０２８９】２．ＰＭＥアレイの並列性は、ＰＭＥにお
いて新規の１組の演算およびサービス機能を提供する１
組のソフトウェアを操作することによって利用できる。
この１組のプリミティブは、カスタマイズを行うユーザ
が自分のアプリケーションの作成に使用するコードであ
る。軍用プラットフォームに接続されたＡＰＡＰ上でセ
ンサ・フュージョンを実行する直前の例では、そのよう
な手法を使用している。カスタマイザは、供給された１
組の関数名を使って、Kalman Filter、Track Optimum A
ssignment、Threat Assessmentを実行するルーチンを書
く。このアプリケーションは、一連のＡＰＩ呼出しであ
り、呼出しを行うたびに、ＰＭＥセットが開始され、Ｐ
ＭＥアレイ内に格納されたデータに対して行列乗算など
の基本演算が実行される。2. PME array parallelism provides a new set of compute and service functions in the PME1
Available by operating a suite of software.
This set of primitives is the code that the customizing user uses to create his application. The previous example of performing sensor fusion on an APAP connected to a military platform uses such an approach. Customizer Supplied 1
Kalman Filter, Track Optimum A using a set of function names
Write a routine to execute ssignment and Threat Assessment. This application is a series of API calls, each time the call is made, the PME set is started
A basic operation such as matrix multiplication is performed on the data stored in the ME array.

【０２９０】３．性能目標またはアプリケーション・ニ
ーズを考慮した効果的な方法が存在しない場合、ＰＭＥ
内でカスタム・ソフトウェアを開発し実行することがで
きる。この具体的な例として「ソート」がある。データ
を分類するための方法は多数存在するが、それらの目的
は常にプロセスおよびプログラムをマシン・アーキテク
チャに合うように調整することである。修正ハイパーキ
ューブは、バッチャ・ソートによく適している。しか
し、このソートでは、きわめて短い比較サイクルと比較
すべき特定の要素を決定するために、大規模な計算が必
要である。図１９のコンピュータ・プログラムは、１Ｐ
ＭＥ当たり１つの要素を備えた、バッチャ・ソート１０
００を実行するためのＰＭＥプログラムの簡単な例１１
００を示している。プログラム記述の各行が、３〜６個
のＰＭＥマシン・レベル命令に拡張され、すべてのＰＭ
ＥがＭＩＭＤモードでプログラムを実行することにな
る。プログラム同期化は入出力ステートメントによって
管理される。このプログラムは全く簡単な形で１ＰＭＥ
当たり複数のデータ要素に拡張され、非常に大規模な並
列ソートに拡張される。3. If there is no effective way to consider performance goals or application needs, PME
You can develop and run custom software in-house. "Sort" is a concrete example of this. While there are many ways to classify data, their purpose is always to tailor processes and programs to the machine architecture. The modified hypercube is well suited for batcher sorting. However, this sort requires extensive computation to determine the specific elements to compare with very short comparison cycles. The computer program of FIG. 19 is 1P
Batcher sort 10 with one element per ME
Simple example of a PME program for executing 00 11
00 is shown. Each line of the program description is extended to 3-6 PME machine level instructions, and all PM
E will execute the program in MIMD mode. Program synchronization is managed by I / O statements. This program is a simple 1PME
It extends to multiple data elements per hit, and extends to very large parallel sorts.

【０２９１】ＣＣ記憶域の内容：ＣＣ記憶域のデータ
は、ＰＭＥアレイによって、２つの方法のうちのどちら
かで使用される。ＰＭＥがＳＩＭＤモードで動作してい
るとき、一連の命令をクラスタ制御装置６４０が取り出
し、ノードＢＣＩに渡し、それによってアプリケーショ
ン・プロセッサ・インタフェース６３０とクラスタ・シ
ンクロナイザ６５０の負荷を軽減させることができる。
あるいはまた、ＰＭＥ障害再構成ソフトウェア、ＰＭＥ
診断、およびおそらく変換ルーチンなどの頻繁には必要
とされない機能を、ＣＣメモリに格納することができ
る。その場合、そのような機能を、動作中のＰＭＥＭ
ＩＭＤプログラムが要求することができ、あるいはＡＰ
Ｉプログラム指示の要求に応じてＰＭＥに移すことがで
きる。CC Storage Contents: CC storage data is used by the PME array in one of two ways. When the PME is operating in SIMD mode, the cluster controller 640 can fetch a series of instructions and pass them to the node BCI, which offloads the application processor interface 630 and the cluster synchronizer 650.
Alternatively, PME failure reconfiguration software, PME
Functions that are not often needed, such as diagnostics and possibly translation routines, can be stored in CC memory. In that case, such a function may be implemented by a working PMEM.
IMD program can request or AP
It can be transferred to the PME upon request of an I program instruction.

【０２９２】８方向修正ハイパーキューブのパッケージ
ング：本発明のパッケージング技術では、単一チップに
パッケージし、Ｎ次元修正ハイパーキューブ構成で整列
させた、８個のＰＭＥを使用する。このチップ・レベル
のパッケージまたはアレイ・ノードが、ＡＰＡＰ設計に
おける最小の構成単位である。これらのノードはさら
に、８×８アレイにパッケージされる。このアレイで
は、＋-Ｘおよび＋-Ｙがアレイまたはクラスタ内でリン
グを構成し、＋-Ｗおよび＋-Ｚが隣接クラスタまで延び
る。クラスタがまとまってアレイを構成する。このステ
ップによって、アレイのデータ用および制御用のワイヤ
数が大幅に削減される。ＷバスおよびＺバスは、隣接ク
ラスタと接続され、ＷリングおよびＺリングを形成し
て、様々なサイズの完成したアレイを総合的に接続す
る。大規模並列システムは、これらのクラスタ構成単位
から構成され、ＰＭＥの大規模なアレイを形成する。Ａ
ＰＡＰはクラスタの８×８アレイから成り、各クラスタ
はそれ自体の制御装置を有する。すべての制御装置は、
アレイ・ディレクタ６１０によって同期化される。8-Directional Modified Hypercube Packaging: The packaging technique of the present invention uses eight PMEs packaged in a single chip and aligned in an N-dimensional modified hypercube configuration. This chip-level package or array node is the smallest building block in an APAP design. These nodes are further packaged in an 8x8 array. In this array, + -X and + -Y form a ring within the array or cluster, and + -W and + -Z extend to adjacent clusters. Clusters together form an array. This step significantly reduces the number of wires for data and control of the array. The W and Z buses are connected with adjacent clusters to form W and Z rings to collectively connect finished arrays of various sizes. A massively parallel system is made up of these cluster building units to form a massive array of PMEs. A
The PAP consists of an 8x8 array of clusters, each cluster having its own controller. All control units
It is synchronized by the array director 610.

【０２９３】配線可能性とトポロジーのトレードオフが
多数考慮されているが、これらを考慮したとしても、本
発明者等は、この点に関して例示する構成を優先する。
本明細書で開示する概念は、Ｘ次元およびＹ次元をクラ
スタ・パッケージング・レベル内に保ち、Ｗバス接続お
よびＺバス接続をすべての隣接クラスタに分配するとい
う利点を有する。本明細書に記載する技術の実施後、定
義されたトポロジーの固有の特性を維持しながら、製品
の配線可能性と製造可能性が得られる。Many trade-offs between the wiring possibility and the topology have been taken into consideration. Even if these are taken into consideration, the present inventors give priority to the configuration exemplified in this respect.
The concept disclosed herein has the advantage of keeping the X and Y dimensions within the cluster packaging level and distributing W and Z bus connections to all adjacent clusters. After implementation of the techniques described herein, the routability and manufacturability of a product is obtained while maintaining the unique characteristics of the defined topology.

【０２９４】ここで使用する概念は、様々なパッケージ
ング・レベルでトポロジーを混合し、突き合わせ、修正
して、ワイヤ数に関して所望の結果を得ることである。The concept used here is to mix, match, and modify topologies at various packaging levels to achieve the desired result in terms of wire count.

【０２９５】ハイパーキューブの実際の修正の度合いを
定義するための方法については、上記の米国特許出願第
０７／６９８８６６号を参照されたい。この好ましい実
施例では、説明を簡単にするため、２つのパッケージン
グ・レベルについて説明する。これは拡張可能である。For a method for defining the actual degree of modification of the hypercube, see above-referenced US patent application Ser. No. 07 / 689,866. In this preferred embodiment, two packaging levels are described for simplicity. It is extensible.

【０２９６】第１のステップは、図４および図１２に示
すチップ設計またはチップ・パッケージである。８つの
処理要素と、それらに結合されたメモリおよび通信論理
回路が、ノードとして定義された単一のチップに包含さ
れている。内部構成はバイナリ・ハイパーキューブまた
は２次ハイパーキューブと分類され、あらゆるＰＭＥが
２つの隣接ＰＭＥと接続されている。図１０のＰＭＥ間
通信図、特に５００、５１０、５２０、５３０、５４
０、５５０、５６０、５７０を参照されたい。The first step is the chip design or package shown in FIGS. Eight processing elements and their associated memory and communication logic are contained in a single chip defined as a node. The internal structure is classified as a binary hypercube or a secondary hypercube, where every PME is connected to two adjacent PMEs. Communication diagram between PMEs of FIG. 10, especially 500, 510, 520, 530, 54
0,550,560,570.

【０２９７】第２のステップでは、ノードが８×８アレ
イとして構成され、クラスタを形成する。完全装備のマ
シンは、クラスタの８×８アレイで構成され、最大容量
のＰＭＥ３２７６８個を提供する。これらの４０９６個
のノードが接続されて、ノード間通信がプログラミング
可能な、８次ハイパーキューブ・ネットワークを形成す
る。このように、様々な指定経路がプログラミング可能
なため、様々な長さのメッセージを伝送するための柔軟
性が増す。これらのプログラミング可能性機能により、
メッセージ長の違いだけでなく、アルゴリズム最適化に
対処することもできる。In the second step, the nodes are configured as an 8x8 array to form clusters. The fully equipped machine consists of an 8x8 array of clusters, offering a maximum capacity of 32,768 PMEs. These 4096 nodes are connected to form an eighth-order hypercube network with programmable inter-node communication. In this way, the various designated routes are programmable, which increases the flexibility for transmitting messages of various lengths. With these programmability features,
Not only the difference in message length but also algorithm optimization can be dealt with.

【０２９８】このパッケージング概念は、各クラスタの
オフページ・ワイヤ数を大幅に削減することを目的とし
ている。この概念では、各ノード８２５が合計５１２の
ＰＭＥ用の８つの処理要素を持つ、ノードの８×８アレ
イ８２０として定義されるクラスタを使用し、次にクラ
スタ内でＸリングおよびＹリングを制限し、最後にＷバ
スおよびＺバスをすべてのクラスタまで延ばす。物理的
な姿は、６４個の小さな球８３０から成る球構成８０
０、８１０を頭に描くとよい。将来のパッケージングの
姿については、フル・アップ・パッケージング技術を示
す図１７を参照されたい。ここでは、クラスタ内でＸリ
ングおよびＹリング８００が制限され、ＷバスおよびＺ
バスがすべてのクラスタ８１０へ延びている。物理的な
姿は、６４個の小さな球８３０から成る球構成を頭に描
くとよい。This packaging concept aims to significantly reduce the number of off-page wires in each cluster. This concept uses a cluster defined as an 8x8 array of nodes 820, with each node 825 having 8 processing elements for a total of 512 PMEs, and then limiting the X and Y rings within the cluster. Finally, extend the W and Z buses to all clusters. The physical appearance is a sphere structure 80 consisting of 64 small spheres 830.
You should draw 0, 810 on your head. See Figure 17 showing full-up packaging technology for future packaging. Here, the X ring and the Y ring 800 are restricted in the cluster, and the W bus and the Z
Buses extend to all clusters 810. For the physical appearance, a sphere structure composed of 64 small spheres 830 may be drawn on the head.

【０２９９】単一のノードの、隣接するＸＰＭＥおよび
ＹＰＭＥへの実際の接続は、同一のクラスタ内に存在す
る。図１８に示すように、ＺバスおよびＷバスを隣接ク
ラスタまで延ばすと、配線が節約できる。図１８には、
疎接続４次元ハイパーキューブまたはトーラス９００、
９０５、９１０、９１５として構成できる１組のチップ
またはノードも示されている。８つの外部ポートのそれ
ぞれを＋Ｘ、＋Ｙ、＋Ｚ、＋Ｗ、−Ｘ、−Ｙ、−Ｚ、−
Ｗ９５０、９７５で表すことにする。その上で、＋Ｘポ
ートと−Ｘポートを接続すると、リングが構築できる。
さらに、対応する＋Ｙポートと−Ｙを相互接続すると、
そのようなリングをｍ個相互接続して、リングのリング
を形成することができる。このレベルの構造をクラスタ
と呼ぶ。この構造は、５１２個のＰＭＥを備え、複数の
サイズのシステム用の構成単位となる。図１８にそのよ
うな接続を２つ（９５０、９７５）示す。The actual connections of a single node to adjacent XPMEs and YPMEs are in the same cluster. Wiring can be saved by extending the Z and W buses to adjacent clusters as shown in FIG. In FIG.
Loosely connected 4D hypercube or torus 900,
Also shown is a set of chips or nodes that can be configured as 905, 910, 915. Each of the eight external ports is + X, + Y, + Z, + W, -X, -Y, -Z,-
It will be represented by W950 and 975. Then, by connecting the + X port and the -X port, a ring can be constructed.
Furthermore, if the corresponding + Y port and -Y are interconnected,
M of such rings can be interconnected to form a ring of rings. This level of structure is called a cluster. This structure comprises 512 PMEs and is a building block for multiple size systems. FIG. 18 shows two such connections (950, 975).

【０３００】デスクサイドＭＰＰ用アプリケーション：
ワークステーションにおけるデスクサイドＭＰＰは、以
下のものを含む複数のアプリケーション領域で効果的に
適用できる。Deskside MPP application:
Deskside MPP in workstations can be effectively applied in multiple application areas including:

【０３０１】１．数値計算中心のプロセスに依存する小
規模な実働タスク。米国郵政公社では、機械印刷された
封筒のファックス・イメージを受け入れた後、郵便番号
を見つけ読み取ることができるプロセッサを必要として
いる。このプロセスは、すべての地域分類機構で必要で
あり、繰返しの激しい数値計算中心のプロセスの例であ
る。本発明では、必要なプログラムのサンプルのＡＰＬ
言語バージョンを実施した。これらのモデルは、ＭＰＰ
上での作業の実行に使用されるベクトルおよびアレイ・
プロセスをエミュレートする。このテストに基づき、こ
のタスクがこの処理アーキテクチャにみごとに適合する
ことが分かった。1. Small production tasks that rely on numerically intensive processes. The United States Postal Service requires a processor that can find and read the postal code after receiving a fax image of a machine-printed envelope. This process is required by all regional classification schemes and is an example of a highly repetitive, computationally intensive process. In the present invention, the APL of the sample of the necessary program
Language version implemented. These models are MPP
Vectors and arrays used to perform work on
Emulate the process. Based on this testing, it turns out that this task fits perfectly into this processing architecture.

【０３０２】２．解析者が、直前の出力または予期され
るニーズの結果として、データ変換のシーケンスを要求
するタスク。米国防地図庁の例では、衛星イメージが画
素ごとに変換および平滑化され、他の座標系に変換され
る。そのような状況では、土地の標高および勾配の結
果、イメージの変換パラメータが地域ごとに異なる。し
たがって、解析者は固定制御点を追加し、変換を再処理
する必要がある。科学的シミュレーション結果を利用す
る際にユーザがほぼリアルタイムの回転または認識可能
な変更を必要とするときも、同様なニーズが発生する。2. A task in which the analyst requests a sequence of data transformations as a result of previous output or expected needs. In the US Defense Agency example, satellite images are transformed and smoothed pixel by pixel and transformed into other coordinate systems. In such a situation, the transformation parameters of the image will vary from region to region as a result of land elevation and slope. Therefore, the analyst needs to add fixed control points and reprocess the transformation. A similar need arises when users require near real-time rotation or recognizable changes when utilizing scientific simulation results.

【０３０３】３．ＭＰＰの実働バージョン用のプログラ
ム開発では、ワークステーション・サイズのＭＰＰを使
用する。プロセッサとネットワークの性能の解析が必要
な調整プロセスについて考えてみる。そのようなタスク
は、マシンと解析者の対話作業である。このタスクで
は、マシンが遊休状態であり、解析者が作業している時
間が必要になる。これをスーパーコンピュータ上で実行
すると、たいへんコストがかかる。しかし、手頃な価格
のワークステーションＭＰＰにスーパーコンピュータＭ
ＰＰと同じ（ただし、規模の小さな）特性を与えると、
遠隔プロセッサへのアクセスに関するプログラマの非効
率性が解決されるので、コストが帳消しになり、テスト
およびデバッグ・プロセスが容易になる。3. Program development for production versions of MPPs uses workstation-sized MPPs. Consider a tuning process that requires an analysis of processor and network performance. Such a task is a machine-analyst interaction. This task requires time when the machine is idle and the analyst is working. Running this on a supercomputer is very expensive. However, it is possible to add a supercomputer M to an affordable workstation MPP.
Given the same characteristics as PP (but with a smaller scale),
The programmer's inefficiencies related to accessing the remote processor are resolved, which offsets costs and facilitates the testing and debugging process.

【０３０４】図２４は、ワークステーション・アクセレ
レータの図である。ワークステーション・アクセレレー
タは、ＲＩＳＣ／６０００モデル５３０と同サイズの格
納装置を使用する。それぞれ完全なクラスタを備えた、
２つのスイング・アウト・ゲートが示されている。２つ
のクラスタを組み合わせると、５ＧＯＰＳの固定小数点
性能および５３０ＭＦＬＯＰＳの処理能力および約１０
０ＭＢの入出力帯域幅がアレイに与えられる。この装置
は、従来のどんなアプリケーションにも適している。こ
の装置は、量産を行い、ホストＲＩＳＣ／６０００を含
めれば、旧式技術を使った匹敵するマシンを無駄にせず
に、高性能ワークステーションに匹敵する価値をもつよ
うになる。FIG. 24 is a diagram of a workstation accelerator. The workstation accelerator uses a storage device that is the same size as the RISC / 6000 model 530. Each with a complete cluster,
Two swing out gates are shown. Combining the two clusters, 5 GOPS fixed point performance and 530 MFLOPS processing power and about 10
0 MB of I / O bandwidth is provided to the array. This device is suitable for any conventional application. This device, when mass-produced and including a host RISC / 6000, would be worth the value of a high-performance workstation without wasting a comparable machine using older technology.

【０３０５】ＡＷＡＣＳセンサ・フュージョンの説明：
軍用環境は、拡張数値計算プロセッサの必要性を示す一
連の例を提供する。Description of AWACS sensor fusion:
The military environment provides a series of examples showing the need for an extended math processor.

【０３０６】目標とされる雑音の多い環境での通信に
は、ＩＣＮＩＡシステムで使用されているようなディジ
タル・コード化通信が必要である。伝送用のデータをコ
ード化し、受信後に情報を回復するプロセスは、数値計
算中心のプロセスである。このタスクは、特殊信号処理
モジュールで実行できるが、通信コード化がバースト単
位の活動を表す状況では、特殊モジュールはたいてい遊
休状態になる。ＭＰＰを使用すると、複数のそのような
タスクを単一のモジュールに割り振って、重量、電力、
体積、およびコストを節減することができる。Communication in the targeted noisy environment requires digitally coded communication as used in the ICNIA system. The process of encoding data for transmission and recovering information after reception is a computationally intensive process. This task can be performed by the special signal processing module, but in situations where the communication coding represents burst-by-burst activity, the special module will usually be idle. With MPP, multiple such tasks can be allocated to a single module to reduce weight, power,
Volume and cost can be saved.

【０３０７】センサ・データのフュージョンは、ＭＰＰ
を追加することによって得られた計算能力で既存のプラ
ットフォームを増強する、特に明確な例を提供する。空
軍Ｅ３ＡＷＡＣＳでは、プラットフォーム上に５個以
上のセンサがあるが、現在のところ使用可能なすべての
データを統合してトラックを生成する方法はない。さら
に、既存の生成トラックは、そのサンプリング特性のた
めに品質がたいへん悪い。したがって、フュージョンを
使用して実効サンプル速度を高める必要がある。Fusion of sensor data is MPP
Provides a particularly clear example of augmenting an existing platform with the computing power gained by adding In the Air Force E3 AWACS, there are more than four sensors on the platform, but there is currently no way to consolidate all available data into a truck. Moreover, existing production tracks are of very poor quality due to their sampling characteristics. Therefore, there is a need to use fusion to increase the effective sample rate.

【０３０８】本発明者等は、このセンサ・フュージョン
問題を詳細に調査しており、検証可能で効果的な分析解
決を提案できるが、ＡＷＡＣＳデータ・プロセッサで使
用可能な計算能力ではこの方法には不十分である。図２
５に、従来のトラック・フュージョン・プロセスを示
す。この従来のトラック・フュージョン・プロセスは、
個々のプロセスにエラーが発生する傾向があり、最終的
なマージでエラーが除去されずに収集されてしまう傾向
があるという欠陥がある。このプロセスにはまた、待ち
時間がとても長いという特徴がある。これは、もっとも
低速のセンサが実行を終了するまでマージが完了しない
からである。図２６に、この手法による改良点と、それ
によって得られる数値計算問題を示す。本発明者等は、
ＮＰハード問題を解くことはできないが、解を近似する
良い方法を開発した。そのアプリケーションについての
詳細は、本発明者等が他の特許出願に記載している。該
アプリケーションは、５１２ｉ８６０（８０８６０）
プロセッサを備えたインテル・タッチストーン（Intel
Touchstone）やＩＢＭの科学研究用視覚化システム（Sc
ientific Visualization System）など様々なマシン上
で使用できるので、本明細書に記載するＡＰＡＰ設計
と、これらの他のシステムの性能を大幅に上回る、たと
えば１２８０００個のＰＭＥとを使用する、ＭＭＰに適
したアプリケーションとして使用できる。アプリケーシ
ョンの実験によると、近似の品質がセンサ雑音のレベル
を下回り、したがってその応答がＡＷＡＣＳなどのアプ
リケーションに適用可能である。図２７に、提案された
ラグランジュの還元ｎ次元割当てアルゴリズムに関する
処理ループを示す。この問題は、周知の２次元割当て問
題の、非常に制御された繰返しを使用している。このア
ルゴリズムは、従来のセンサ・フュージョン処理で使用
されたものと同じである。The inventors have investigated this sensor fusion problem in detail and can propose verifiable and effective analytical solutions, but the computational power available in the AWACS data processor does not lead to this method. Is insufficient. Figure 2
FIG. 5 shows a conventional track fusion process. This traditional track fusion process
The flaw is that individual processes are prone to errors and tend to be collected in the final merge rather than being eliminated. This process is also characterized by very high latency. This is because the merge will not complete until the slowest sensor has finished executing. FIG. 26 shows the improvements made by this method and the numerical calculation problems obtained thereby. The present inventors
Although we cannot solve the NP-hard problem, we have developed a good method to approximate the solution. Details of the application are described by the inventors in other patent applications. The application is 512 i860 (80860)
Intel Touchstone with processor (Intel
Touchstone) and IBM's scientific research visualization system (Sc
suitable for MMPs that use the APAP designs described herein and significantly outperform these other systems, eg, 128,000 PMEs, as they can be used on a variety of machines, such as the Intelligent Visualization System). Can be used as an application. Application experiments have shown that the quality of the approximation is below the level of sensor noise, so its response is applicable to applications such as AWACS. FIG. 27 shows a processing loop for the proposed Lagrangian reduced n-dimensional assignment algorithm. This problem uses a very controlled iteration of the well-known two-dimensional assignment problem. This algorithm is the same as that used in conventional sensor fusion processing.

【０３０９】たとえば、図２６以降に示す７組の観測値
にｎ次元アルゴリズムを適用し、還元プロセスの各パス
ごとに２次元割当てプロセスの繰返しが４回必要である
ものとする。そうすると、新規の８次元割当て問題で
は、２次元割当て問題を４０００回繰り返す必要があ
る。ＡＷＡＣＳの作業負荷は現在、マシン容量の約９０
％である。フュージョンにはおそらく全容量の１０％が
必要であるが、そのように小規模な容量でも、４０００
倍にスケール・アップすると総利用率はＡＷＡＣＳの容
量の３７０倍になる。この作業負荷は既存のプロセッサ
の性能を超えているだけでなく、新しいＭＩＬ環境に適
合する既存の疎並列処理システム、または今後数年の間
に予想されるシステムでもぎりぎりである。アルゴリズ
ムにおいて１ステップ当たり平均して４回ではなく５回
の繰返しが必要な場合、仮説上のシステムの性能さえも
超えてしまう。逆に、ＭＰＰ解法は、この計算能力を提
供でき、５回繰返しのレベルでもそれが可能である。For example, it is assumed that the n-dimensional algorithm is applied to the seven sets of observation values shown in FIG. 26 and subsequent figures, and the two-dimensional allocation process needs to be repeated four times for each pass of the reduction process. Then, in the new 8-dimensional assignment problem, the 2-dimensional assignment problem needs to be repeated 4000 times. AWACS workload is currently around 90 machine capacity
%. Fusion probably requires 10% of total capacity, but even with such a small capacity 4000
When scaled up twice, the total utilization is 370 times the capacity of AWACS. Not only does this workload exceed the performance of existing processors, but it is also marginal with existing sparse parallel processing systems that are compatible with the new MIL environment, or systems that are expected over the next few years. If the algorithm requires on average 5 iterations instead of 4 per step, then even the performance of the hypothetical system is exceeded. Conversely, the MPP solution can provide this computational power, even at the level of 5 iterations.

【０３１０】機械的パッケージング：図４および他の図
面に示すように、本発明の好ましいチップは、クワッド
フラットパック形式で構成される。したがって、チップ
をレンガ積みして、パッケージ内で様々な２次元構成お
よび３次元構成を形成することができる。８個以上のＰ
ＭＥから成る１つのチップが、第１レベル・パッケージ
・モジュールである。同様に、単一のＤＲＡＭメモリ・
チップは、チップをパッケージするファウンドリの第１
レベル・パッケージ・モジュールである。しかし、クワ
ッドフラットパック形式では４方向の相互接続が可能に
なる。各接続は２地点間接続である（第１レベル・パッ
ケージの１つのチップは、ファウンドリのモジュールで
ある）。本発明では、この特徴により、本発明の性能目
標を実現するのに十分な規模のＰＥアレイが構築でき
る。実際には、３フィート、４フィート、ときには５フ
ィート離れた２地点間、すなわち多重プロセッサ・ノー
ド間でこれらのチップを接続することができ、なおかつ
光ファイバなしで適切な制御が可能である。Mechanical Packaging: As shown in FIG. 4 and other figures, the preferred chip of the present invention is constructed in a quad flat pack format. Thus, the chips can be bricked to form various two-dimensional and three-dimensional configurations within the package. 8 or more P
One chip consisting of ME is the first level package module. Similarly, a single DRAM memory
Chips are the first foundry to package chips
It is a level package module. However, the quad flat pack format allows four-way interconnection. Each connection is a point-to-point connection (one chip in the first level package is a foundry module). In the present invention, this feature allows the construction of PE arrays of sufficient scale to achieve the performance goals of the present invention. In practice, these chips can be connected between two points that are 3 feet, 4 feet, and sometimes 5 feet apart, i.e. between multiple processor nodes, and still have proper control without optical fiber.

【０３１１】これは、モジュール上で必要な駆動／受信
回路にとって利益がある。本発明には、モジュール間を
デイジー・チェイン接続するバス・システムがないの
で、高性能を実現し、電力散逸を抑えることができる。
本発明ではノード間の同報通信を行うが、これを高性能
経路とする必要はない。大部分のデータ操作がノードで
行えるので、必要なデータ経路要件が少なくなってい
る。本発明の同報通信経路は基本的に、主として制御装
置経路指定ツールとして使用される。データ・ストリー
ムは、ＺＷＸＹ通信経路システムに接続され、該システ
ム中を流れる。This benefits the drive / receive circuitry required on the module. Since the present invention does not have a bus system for connecting modules in a daisy chain, high performance can be realized and power dissipation can be suppressed.
In the present invention, broadcast communication between nodes is performed, but this does not need to be a high performance path. Since most data operations can be performed at the nodes, the required data path requirements are reduced. The broadcast path of the present invention is primarily used primarily as a controller routing tool. The data stream is connected to and flows through the ZWXY communication path system.

【０３１２】本発明における電力散逸は、本発明の商用
ワークステーションでは１ノード・モジュール当たり
２．２Ｗである。したがって、空冷式パッケージングが
使用可能である。このため、本発明のシステムの電力要
件も妥当である。例示した本発明の電源システムでは、
サポートされるモジュールの数に１モジュール当たり約
２．５Ｗを掛ける。そのような５Ｖ電源は費用効果が非
常に高い。電力消費量については、驚くべきことに、本
を読むのに必要な明かりで消費される量より少ない電力
量で３２個のマイクロコンピュータが稼働できる。The power dissipation in the present invention is 2.2 W per node module in the commercial workstation of the present invention. Therefore, air-cooled packaging can be used. Therefore, the power requirements of the system of the present invention are also reasonable. In the illustrated power supply system of the present invention,
Multiply the number of supported modules by about 2.5W per module. Such a 5V power supply is very cost effective. In terms of power consumption, surprisingly, 32 microcomputers can operate with less power than is consumed by the light needed to read a book.

【０３１３】本発明の熱的設計は、このパッケージング
によって改善される。本発明では、高散逸部分と低散逸
部分が混ざることによる過熱点を避けている。これは、
組立コストに直接反映される。The thermal design of the present invention is improved by this packaging. In the present invention, the hot spot due to the mixture of the high dissipation portion and the low dissipation portion is avoided. this is,
It is directly reflected in the assembly cost.

【０３１４】本発明のシステムのコストは、カード上に
スーパースカラ・プロセッサを配置する手法に比べて非
常に魅力的である。本発明の１ドル当たり１部品タイプ
当たり１コネクタ当たり１Ｗ当たり１アセンブリ当たり
の性能レベルは卓越している。The cost of the system of the present invention is very attractive compared to the approach of placing a superscalar processor on the card. The performance levels per assembly per dollar per component per dollar per connector per watt of the present invention are outstanding.

【０３１５】さらに、本発明では、他の技術と同じ数の
パッケージング・レベルが必要でない。モジュール／カ
ード／バックプレーンおよびケーブルも必要でない。望
むならカード・レベルを省略することもできる。本発明
のワークステーション・モジュールに示すように、この
レンガ積み手法ではカード・レベルを省略した。Moreover, the present invention does not require the same number of packaging levels as other technologies. No modules / cards / backplanes and cables are required. You can omit the card level if you wish. As shown in the workstation module of the present invention, the brick level technique omitted the card level.

【０３１６】さらに、図示したように、ワークステーシ
ョン・モジュール内でレンガ積みした各ノード・ハウジ
ングは、図４に示すように、同一のチップ・ハウジング
内に複数の複製ダイを備えることさえできる。通常、空
冷パッケージ内には１つのダイを配置するが、複数チッ
プ・モジュール手法を使用すれば基板上に８つのダイが
配置可能である。したがって、３２個以上のプロセッサ
を備えた夢の腕時計が実現でき、さらに他の多数のアプ
リケーションも可能である。パッケージングおよび電力
と柔軟性のおかげで、可能なアプリケーションの数は無
限になる。家庭では、制御可能な計器をすべて監視し、
非常に小さな部品との調整を図ることができる。エンジ
ン監視、ブレーキ調節などのために自動車でさかんに使
用されているチップはすべて、ハウジング内にモニタを
有することができる。さらに、ハイブリッド技術を用い
た同一の基板上に、完全にプログラマブルな機能および
メモリを備えた３８６個のマイクロプロセッサ・チップ
を（すべて１つのチップ内に）実装し、それを基板パッ
ケージのアレイ制御装置として使用することができる。Further, as shown, each bricked node housing in the workstation module can even have multiple duplicate dies in the same chip housing, as shown in FIG. Normally, one die is placed in the air-cooled package, but eight die can be placed on the substrate using the multi-chip module approach. Thus, a dream wristwatch with 32 or more processors can be realized and many other applications are possible. The number of possible applications is unlimited thanks to packaging and power and flexibility. At home, monitor all controllable instruments,
Can be coordinated with very small parts. All chips used extensively in automobiles for engine monitoring, brake adjustment, etc. can have a monitor in the housing. In addition, 386 microprocessor chips with fully programmable functions and memory are mounted (all in one chip) on the same substrate using the hybrid technology, and are mounted on an array controller of the substrate package. Can be used as

【０３１７】図４の制御システムからそれよりずっと大
規模なシステムまで、システムの多数の構成を示してき
た。ディップ内のチップ上に８個以上のＰＭＥを備え、
ＳＩＭモジュールのように、標準ＤＲＡＭメモリ・モジ
ュールに適合するピン・アレイを備えて、チップをパッ
ケージすることができるため、制御機構から、今日の既
存の技術でぎりぎりの１５フレーム程度ではなく３０フ
レームの反復率を有し、かつ１画素もしくは少数の画素
から成るノードの監視にプロセッサを割り当てることが
できる、壁面大のビデオ・ディスプレイに至るまで、無
限のアプリケーションが可能となる。本発明のブリック
ウォール（レンガ壁）クワッドフラットパックにより、
同一の部分を繰り返して複製することが容易になる。さ
らに、複製されたプロセッサは、実際には、プロセッサ
交換機能を備えたメモリである。メモリの一部を特定の
監視タスクに割り当て、別の部分（サイズはプログラム
で定義する）を、２地点間でアドレスされ、すべての機
能への同報通信機能を備える、大規模大域メモリとする
ことができる。A number of system configurations have been shown, from the control system of FIG. 4 to a much larger system. Equipped with 8 or more PMEs on the chip in the dip,
Like SIM modules, chips can be packaged with a pin array that fits standard DRAM memory modules, so the control mechanism allows for 30 frames of frame instead of just 15 frames with existing technology today. Infinite applications are possible, down to wall-sized video displays, which have a repetition rate and whose processor can be assigned to monitor nodes consisting of one or a few pixels. With the brick wall (brick wall) quad flat pack of the present invention,
It becomes easy to duplicate the same part repeatedly. Furthermore, the duplicated processor is actually a memory with processor replacement capability. Dedicate a portion of memory to a specific monitoring task and another portion (size is programmatically defined) to a large global memory that is point-to-point addressed and has broadcast capability for all functions. be able to.

【０３１８】本発明の基本ワークステーション、スーパ
ーコンピュータ、制御装置、ＡＷＡＣＳはすべて、本発
明の新技術を使用できるパッケージの例である。組込み
ＣＰＵチップおよび入出力機構を備えたメモリのアレイ
は、大規模並列アプリケーション、さらにはそれより限
られたアプリケーションのＰＭＥとして機能する。パッ
ケージングおよびプログラミング上の柔軟性のために想
像力が膨らみ、本出願の技術を用いると、１つの部分を
多数の概念およびイメージに割り当てることができる。The basic workstation, supercomputer, controller and AWACS of the present invention are all examples of packages in which the new technology of the present invention may be used. An array of memory with an embedded CPU chip and I / O serves as a PME for massively parallel applications and even more limited applications. The packaging and programming flexibility add imagination and, using the techniques of the present application, one part can be assigned to multiple concepts and images.

【０３１９】軍用アビオニクス・アプリケーション：Ｍ
ＩＬＭＰＰを構築する際のコスト面の利点は、ＡＷＡ
ＣＳで特にはっきりしている。ＡＷＡＣＳは、２０年前
に開発された格納装置であり、従来のコア・メモリに代
わって新規技術のメモリ・モジュールが使用されるよう
になったため、空き空間が増えている。図２８に、ラッ
クの空き空間に直接収納され、相互接続に既存のメモリ
・バスを使用する、ＭＩＬ準拠の２つのクラスタ・シス
テムを示す。Military Avionics Application: M
The cost advantage of building an IL MPP is AWA
Especially clear on CS. The AWACS is a storage device developed 20 years ago, and a new technology memory module is used in place of the conventional core memory, so that the free space is increasing. FIG. 28 shows two MIL-compliant cluster systems that are housed directly in the rack empty space and use existing memory buses for interconnection.

【０３２０】ＡＷＡＣＳの例は、空き空間が存在するた
めきわめて有利であるが、他のシステムでも空間を作成
することが可能である。通常、既存のメモリを小型のＭ
ＰＰまたは分離ＭＰＰのゲートウェイで置き換えること
は容易である。そのような場合、４分の１クラスタおよ
びアダプタ・モジュールでは、８ＭＢメモリと６４０Ｍ
ＩＰが得られ、おそらく２つのスロットが使用される。The example of AWACS is extremely advantageous because there is free space, but it is possible to create space in other systems. Usually, existing memory is
It is easy to replace with a PP or separate MPP gateway. In such cases, a quarter cluster and an adapter module would require 8MB memory and 640M
The IP is obtained and probably two slots are used.

【０３２１】スーパーコンピュータ・アプリケーショ
ン：６４クラスタＭＰＰは１３．６ＧＦＬＯＰのスーパ
ーコンピュータである。これは、図２９に示すシステム
として構成できる。このシステムでは、図２９に示すよ
うに、クラスタ・カード上にノード・チップをレンガ積
みして、コストおよびサイズ上の利点が大きいシステム
を構築することができる。システムにネットワーク交換
などの余分なチップを組み込むのは、コストがかさむの
で、そうする必要はない。Supercomputer Application: 64 Cluster MPP is a 13.6 GFLOP supercomputer. This can be configured as the system shown in FIG. In this system, as shown in FIG. 29, a node chip can be brick-laid on a cluster card to construct a system having great cost and size advantages. It is not necessary to incorporate extra chips into the system, such as network switching, as it is costly.

【０３２２】「ブリック・ウォール」チップによる本発
明の相互接続システムにより、システムを大規模ＤＲＡ
Ｍメモリと同様に構築することができる。この相互接続
システムには、たとえばマイクロチャネル・バス・アダ
プタなどの、厳密なバス仕様に合致する定義済みのバス
・アダプタがある。各システムは、現在の多くのマイク
ロプロセッサに基づく他のシステムよりも電源システム
および冷却設計が小型になる。The interconnection system of the present invention with a "brick wall" chip allows large scale DRA
It can be constructed similarly to M memory. The interconnect system has predefined bus adapters that meet strict bus specifications, such as Micro Channel bus adapters. Each system has a smaller power system and cooling design than other systems based on many current microprocessors.

【０３２３】大部分のスーパーコンピュータと異なり、
浮動小数点エミュレーション機能を有する本発明のこの
好ましいＡＰＡＰは、浮動小数点演算よりも整数演算
（１６４ＧＩＰＳ）を行う時の方がはるかに高速であ
る。したがって、このプロセッサは、非常に文字または
整数処理中心のアプリケーションに使用すると、もっと
も効果的である。本発明では、前述の他のアプリケーシ
ョンに加えて、解決が必要な３つのプログラムの問題に
ついても検討した。日常生活にとってある種の「遠大な
課題」よりも重要な可能性のあるアプリケーションには
以下のものがある。Unlike most supercomputers,
This preferred APAP of the invention with floating point emulation capability is much faster when performing integer operations (164 GIPS) than floating point operations. Therefore, this processor is most effective when used for very character or integer processing intensive applications. In the present invention, in addition to the above-mentioned other applications, the problems of three programs that need to be solved have been examined. Applications that may be more important than some "far-reaching challenges" to daily life include:

【０３２４】１．３０９０ベクトル・プロセッサは、非
常に高性能の浮動小数点演算装置を備えている。この装
置では、大部分のベクトル化浮動小数点装置と同様、密
ベクトルに対するパイプライン操作が必要である。非正
規疎行列（直交座標でなくビット・マップで記述される
行列）を多用するアプリケーションは、浮動小数点演算
装置の性能機能を浪費する。ＭＰＰは、当該データ用の
記憶域を提供し、その計算能力およびネットワーク帯域
幅を使って、計算を実行せずに密ベクトルを構築し、密
な結果を圧縮解除することにより、この問題を解決して
いる。ベクトル処理装置は、ＭＰＰによって供給され
る、密ベクトルに対する演算の連続した流れによってビ
ジー状態に保たれる。ベクトル機構の諸プロセスを同じ
速度で効果的に圧縮し圧縮解除できるようにＭＰＰのサ
イズを設定すると、両方の装置を完全ビジー状態にする
ことができる。The 1.390 vector processor is equipped with a very high performance floating point unit. This device, like most vectored floating point devices, requires pipeline operations on dense vectors. Applications that make heavy use of non-normal sparse matrices (matrixes that are described in bit maps rather than Cartesian coordinates) waste the performance capabilities of floating-point arithmetic units. MPP solves this problem by providing storage for that data and using its computing power and network bandwidth to build dense vectors without performing calculations and decompress dense results. are doing. The vector processor is kept busy by the continuous flow of operations on the dense vector supplied by the MPP. Setting the size of the MPP so that the vector mechanism processes can be effectively compressed and decompressed at the same rate will allow both devices to be fully busy.

【０３２５】２．本発明者等が考慮したもう１つのホス
ト接続システムは、ＦＢＩ指紋突合せ問題に対する解決
策である。この場合、６５個以上のクラスタを備えたマ
シンを考慮した。問題は、１時間に約６０００の指紋を
指紋ヒストリのデータベース全体と突き合わせるにはど
うするかであった。大規模ＤＡＳＤと、ＭＰＰからホス
トへの接続機構の全帯域幅を使用すると、着信する指紋
に対しデータベース全体を約２０分で突き合わせること
ができる。ＳＩＭＤモード疎突合せ操作でＭＰＰの約７
５％を使用すると、処理と、必要なスループット・レー
トとのバランスが取れる。その場合、Ａ−ＳＩＭＤ処理
モードにあるマシンの１５％が、疎なフィルタ操作を通
過する場合に、未知の指紋をファイルの指紋と突き合わ
せて詳細な検査を行うことにより、突合せを完了すると
本発明者等は推定している。この間、マシンの残りの部
分はＭＩＭＤであり、予備容量、作業待ち行列の管理、
および出力のフォーマット化に割り振られた。2. Another host connection system that we have considered is a solution to the FBI fingerprint matching problem. In this case, a machine with more than 65 clusters was considered. The question was how to match about 6000 fingerprints per hour against the entire database of fingerprint history. With large DASDs and the full bandwidth of the MPP to host attachment, the entire database can be matched to incoming fingerprints in about 20 minutes. Approximately 7 MPP in SIMD mode loose butting operation
Using 5% balances the processing with the required throughput rate. In that case, when 15% of the machines in A-SIMD processing mode pass the sparse filtering operation, the unknown fingerprint is matched with the fingerprint of the file to perform a detailed inspection to complete the matching. Estimated. During this time, the rest of the machine is MIMD, which has spare capacity, work queue management,
And allocated for output formatting.

【０３２６】３．ＭＰＰのデータベース操作への適用を
考慮した。この作業はきわめて予備的なものであるが、
適合性は良いと思われる。ＭＰＰの２つの態様がこの前
提を支持している。ａ．クラスタ制御装置６４０とアプリケーション・プロ
セッサ・インタフェース６３０の接続はマイクロチャネ
ルである。したがって、クラスタ専用で、クラスタから
直接アクセスされるＤＡＳＤを配置することができる。
１クラスタ当たり６台の６４０ＭＢハード・ドライブを
備えた６４クラスタ・システムは、２４６ＧＢの記憶域
を提供する。さらに、このデータベース全体が１０〜２
０秒で逐次的に探索できる。ｂ．データベースは一般に、逐次的に探索されない。そ
の代わり、多数のレベルのポインタを使用する。データ
ベースの索引付けは、クラスタ内で実施できる。ＤＡＳ
Ｄの各バンクは、２．５ＧＩＰＳの処理能力および３２
ＭＢの記憶域によってサポートされる。これは、インデ
ックスの探索および格納にとって十分である。インデッ
クスは現在、ＤＡＳＤ内に格納されることが多いので、
性能が大幅に向上する。そのような手法を使用し、クラ
スタ・マイクロチャネルに接続されたＳＣＳＩインタフ
ェース上にＤＡＳＤを分散させると、実質上無限のサイ
ズのデータベースが作成可能である。3. The application of MPP to database operation was considered. This work is very preliminary,
Suitability seems good. Two aspects of MPP support this premise. a. The connection between the cluster controller 640 and the application processor interface 630 is Micro Channel. Therefore, the DASD dedicated to the cluster and directly accessed from the cluster can be arranged.
A 64-cluster system with six 640MB hard drives per cluster provides 246GB of storage. Furthermore, the entire database is 10-2
You can search sequentially in 0 seconds. b. Databases are generally not searched sequentially. Instead, it uses multiple levels of pointers. Database indexing can be performed within a cluster. DAS
Each bank of D has a processing capacity of 2.5 GIPS and 32
Supported by MB storage. This is sufficient for index search and storage. Indexes are now often stored in DASD, so
The performance is greatly improved. Using such an approach, distributing DASD over a SCSI interface connected to Cluster Micro Channel can create a database of virtually unlimited size.

【０３２７】図２９に、ＡＰＡＰを使用して構築したス
ーパーコンピュータ・スケールのＭＭＰを示す。この手
法では再び単位の複製が使用されているが、この場合は
複製されるのは、１６個のクラスタを収容する格納装置
である。この複製手法の特定基本（オペランド）サイズは１６ビット・ワードであ
る。ＰＭＥ記憶域では、オペランドは統合ワード境界上
に位置する。ワード・オペランド・サイズだけでなく、
１６ビットの倍数の他のオペランド・サイズも、追加機
能をサポートするのに使用できる。FIG. 29 shows a supercomputer-scale MMP constructed using APAP. This approach uses unit replication again, but in this case it is the enclosure that contains the 16 clusters that is replicated. Identification of this replication method The base (operand) size is 16 bit words. In PME storage, the operands are located on unified word boundaries. Not only the word operand size,
Other operand sizes in multiples of 16 bits can also be used to support additional functionality.

【０３３０】オペランド長の範囲内で、オペランドのビ
ット位置に、０から始めて左から右に連続して番号を付
ける。上位ビットまたは最上位ビットを参照すると必
ず、１番左側のビット位置が参照される。下位ビットま
たは最下位ビットを参照すると必ず、１番右のビット位
置が参照される。Within the range of the operand length, the bit positions of the operand are sequentially numbered from left to right starting from 0. Whenever the upper bit or the most significant bit is referenced, the leftmost bit position is referenced. Whenever the lower bit or the least significant bit is referenced, the rightmost bit position is referenced.

【０３３１】命令フォーマット：命令フォーマットの長
さは、１６ビットまたは３２ビットとすることができ
る。ＰＭＥ記憶域では、１６ビット境界上に命令が位置
しなければならない。Instruction Format: The length of the instruction format can be 16 bits or 32 bits. In PME storage, instructions must lie on 16-bit boundaries.

【０３３２】表１に示す汎用命令フォーマットを使用す
る。通常、命令の最初の４ビットは、命令コードを定義
し、ＯＰビットと呼ばれる。命令の定義を拡張するか、
または命令に適用される固有の条件を定義するために、
追加ビットが必要になる場合がある。これらのビットを
ＯＰＸビットと呼ぶ。The general instruction format shown in Table 1 is used. Usually, the first 4 bits of an instruction define the instruction code and are called OP bits. Extend the definition of the instruction,
Or to define the unique conditions that apply to an order,
Additional bits may be needed. These bits are called OPX bits.

【表１】[Table 1]

【０３３３】すべてのフォーマットに共通するフィール
ドが１つある。このフィールドとその解釈は、以下のと
おりである。There is one field common to all formats. This field and its interpretation are as follows:

【０３３４】ビット０〜３：命令コード−この命令コー
ドは、ときには命令コード拡張フィールドとともに、実
行すべき動作を定義する。Bits 0-3: Instruction Code--This instruction code, together with the instruction code extension field, defines the operation to be performed.

【０３３５】個々のフォーマットの詳細な図と、それら
のフィールドの解釈を以下の節に示す。命令によって
は、２つのフォーマットが組み合わされ、変種の命令を
形成しているものもある。これらは主として、命令のア
ドレス指定モードに関するものである。１例として、記
憶域間命令は、直接アドレス指定またはレジスタ・アド
レス指定に関係する形式を持つことがある。A detailed view of the individual formats and their field interpretations is given in the following sections. Some instructions combine the two formats to form a variant of the instruction. These are primarily concerned with the addressing mode of the instruction. As an example, a cross-storage instruction may have a form related to direct addressing or register addressing.

【０３３６】ＲＲフォーマット：レジスタ・レジスタ
（ＲＲ）フォーマットは、図３０に示すように、２つの
汎用レジスタ・アドレスを提供し、長さ１６ビットであ
る。RR Format: Register Register (RR) format provides two general purpose register addresses as shown in FIG. 30 and is 16 bits in length.

【０３３７】ＲＲフォーマットは、命令コード・フィー
ルドの他に、次のフィールドを含んでいる。The RR format includes the following fields in addition to the instruction code field.

【０３３８】ビット４〜７：レジスタ・アドレス１−Ｒ
Ａフィールドは、１６個の汎用レジスタのうちのどれ
を、オペランドまたは宛先、あるいはその両方として用
いるのかを指定するのに使用する。Bits 4-7: Register Address 1-R
The A field is used to specify which of the 16 general purpose registers will be used as an operand, a destination, or both.

【０３３９】ビット８〜１１：０−ビット８が０の場
合、フォーマットがＲＲフォーマットまたはＤＡフォー
マットと定義され、ビット９〜１１が０の場合は、動作
がレジスタ間動作と定義される（直接アドレス・フォー
マットの特殊な場合）。Bits 8-11: 0-If bit 8 is 0, the format is defined as RR format or DA format; if bits 9-11 are 0, the operation is defined as register-to-register operation (direct address・ In case of special format).

【０３４０】ビット１２〜１５：レジスタ・アドレス２
−ＲＢフィールドは、１６個の汎用レジスタのうちのど
れをオペランドとして用いるかを指定するのに使用す
る。Bits 12-15: Register Address 2
The RB field is used to specify which of the 16 general purpose registers to use as an operand.

【０３４１】ＤＡフォーマット：直接アドレス（ＤＡ）
フォーマットは、図３１に示すように、１つの汎用レジ
スタ・アドレスおよび１つの直接記憶域アドレスを提供
する。DA format: Direct address (DA)
The format provides one general register address and one direct storage address, as shown in FIG.

【０３４２】ＤＡフォーマットは、命令コード・フィー
ルドの他に、次のフィールドを含んでいる。The DA format includes the following fields in addition to the instruction code field.

【０３４３】ビット４〜７：レジスタ・アドレス１−Ｒ
Ａフィールドは、１６個の汎用レジスタのうちのどれ
を、オペランドまたは宛先、あるいはその両方として用
いるのかを指定するのに使用する。Bits 4-7: Register Address 1-R
The A field is used to specify which of the 16 general purpose registers will be used as an operand, a destination, or both.

【０３４４】ビット８：０−ビット８が０の場合、動作
が直接アドレス動作またはレジスタ間動作と定義され
る。Bit 8: 0-If Bit 8 is 0, the operation is defined as a direct address operation or a register-to-register operation.

【０３４５】ビット９〜１５：直接記憶域アドレス−直
接記憶域アドレス・フィールドは、レベル固有記憶域ブ
ロックまたは共通記憶域ブロックへのアドレスとして使
用する。直接アドレス・フィールドのビット９〜１１
は、直接アドレス形式を定義するため、非０でなければ
ならない。Bits 9-15: Direct Storage Address--The Direct Storage Address field is used as an address to a level-specific storage block or a common storage block. Direct address field bits 9-11
Must be non-zero since it directly defines the address format.

【０３４６】ＲＳフォーマット：レジスタ記憶域（Ｒ
Ｒ）フォーマットは、図３２に示すように、１つの汎用
レジスタ・アドレスおよび間接記憶域アドレスを提供す
る。RS format: Register storage area (R
The R) format provides one general register address and indirect storage address, as shown in FIG.

【０３４７】ＲＳフォーマットは、命令コード・フィー
ルドの他に、次のフィールドを含んでいる。The RS format includes the following fields in addition to the instruction code field.

【０３４８】ビット４〜７：レジスタ・アドレス１−Ｒ
Ａフィールドは、１６個の汎用レジスタのうちのどれ
を、オペランドまたは宛先、あるいはその両方として用
いるのかを指定するのに使用する。Bits 4-7: Register Address 1-R
The A field is used to specify which of the 16 general purpose registers will be used as an operand, a destination, or both.

【０３４９】ビット８：１−このビットが１の場合、動
作がレジスタ記憶域動作と定義される。Bit 8: 1-If this bit is 1, the operation is defined as a register storage operation.

【０３５０】ビット９〜１１：レジスタ・データ−これ
らのビットは、ＲＢフィールドによって指定されるレジ
スタの内容の修正に使う符号付きの値とみなされる。Bits 9-11: Register Data-These bits are considered to be signed values used to modify the contents of the register specified by the RB field.

【０３５１】ビット１２〜１５：レジスタ・アドレス２
−ＲＢフィールドは、１６個の汎用レジスタのうちのど
れをオペランドとして用いるかを指定するのに使用す
る。Bits 12-15: Register Address 2
The RB field is used to specify which of the 16 general purpose registers to use as an operand.

【０３５２】ＲＩフォーマット：レジスタ即値（ＲＩ）
フォーマットは、１つの汎用レジスタ・アドレスおよび
１６ビットの即値データを提供する。ＲＩフォーマット
は、図３３に示すように、長さ３２ビットである。RI format: register immediate value (RI)
The format provides one general purpose register address and 16 bits of immediate data. The RI format has a length of 32 bits as shown in FIG.

【０３５３】ＲＩフォーマットは、命令コード・フィー
ルドの他に、次のフィールドを含んでいる。The RI format includes the following fields in addition to the instruction code field.

【０３５４】ビット４〜７：レジスタ・アドレス１−Ｒ
Ａフィールドは、１６個の汎用レジスタのうちのどれ
を、オペランドまたは宛先、あるいはその両方として用
いるのかを指定するのに使用する。Bits 4-7: Register Address 1-R
The A field is used to specify which of the 16 general purpose registers will be used as an operand, a destination, or both.

【０３５５】ビット８：１−このビットが１の場合、動
作がレジスタ記憶域動作と定義される。Bit 8: 1-If this bit is 1, the operation is defined as a register storage operation.

【０３５６】ビット９〜１１：レジスタ・データ−これ
らのビットは、プログラム・カウンタの内容の修正に使
う符号付きの値とみなされる。通常、レジスタ即値フォ
ーマットではこのフィールドは、１の値をとる。Bits 9-11: Register Data-These bits are considered to be signed values used to modify the contents of the program counter. Normally, this field takes a value of 1 in the register immediate format.

【０３５７】ビット１２〜１５：０−このフィールドが
０の場合、即値データ・フィールドを指す更新済みプロ
グラム・カウンタを、オペランドの記憶域アドレスとし
て使用することが指定される。Bits 12-15: 0-If this field is 0, it is specified that the updated program counter, which points to the immediate data field, is used as the storage address of the operand.

【０３５８】ビット１６〜３１：即値データ−このフィ
ールドは、レジスタ即値命令の１６ビット即値データ・
オペランドとして機能する。Bits 16-31: Immediate Data-This field contains the 16-bit immediate data of the register immediate instruction.
Functions as an operand.

【０３５９】ＳＳフォーマット：記憶域間（ＳＳ）フォ
ーマットは、２つの記憶域アドレスを提供する。このう
ち一方は明示的で、他方は暗示的である。暗示記憶域ア
ドレスは、汎用レジスタ１に入れられる。レジスタ１
は、命令の実行中に修正される。ＳＳ命令には、図３４
に示すように、直接アドレス形式および記憶域アドレス
形式という２つの形式がある。SS Format: The inter-storage (SS) format provides two storage addresses. One of these is explicit and the other is implicit. The implicit storage address is placed in general register 1. Register 1
Are modified during the execution of the instruction. Fig. 34 shows the SS command.
There are two formats, the direct address format and the storage address format, as shown in FIG.

【０３６０】ＳＳフォーマットは、命令コード・フィー
ルドの他に、次のフィールドを含んでいる。The SS format includes the following fields in addition to the instruction code field.

【０３６１】ビット４〜７：命令拡張コード−ＯＰＸフ
ィールドは、命令コードとともに、実行すべき動作を定
義する。ビット４〜５は、ＡＤＤやＳＵＢＳＴＲＡＣＴ
などの演算タイプを定義する。ビット６〜７は、繰上
り、桁あふれ、および条件コードの設定方法を制御す
る。ビット６＝０のときは桁あふれが無視され、ビット
６＝１のときは桁あふれが可能になる。ビット７＝０の
ときは演算中のｃａｒｒｙｓｔａｔが無視され、ビット
７＝１のときは演算中にｃａｒｒｙｓｔａｔが含まれ
る。Bits 4-7: Instruction Extension Code--The OPX field, together with the instruction code, defines the operation to be performed. Bits 4-5 are for ADD and SUBSTRACT
Define the operation type such as. Bits 6-7 control how carry, overflow, and condition code are set. When bit 6 = 0, overflow is ignored, and when bit 6 = 1, overflow is enabled. When bit 7 = 0, carrystat in the operation is ignored, and when bit 7 = 1, carrystat is included in the operation.

【０３６２】ビット８：０−形式を直接アドレス形式と
定義する。１−形式を記憶域アドレス形式と定義する。Bits 8: 0-Define the direct address format. 1-Define the format as a storage address format.

【０３６３】ビット９〜１５：直接アドレス（直接アド
レス形式）−直接記憶域アドレス・フィールドは、レベ
ル固有記憶域ブロックまたは共通記憶域ブロックへのア
ドレスとして使用する。直接アドレス・フィールドのビ
ット９〜１１は、直接アドレス形式を定義するため、非
０でなければならない。Bits 9-15: Direct Address (Direct Address Format)-The direct storage address field is used as an address to a level-specific storage block or a common storage block. Bits 9-11 of the direct address field must be non-zero to define a direct address format.

【０３６４】ビット９〜１１：レジスタ・デルタ（記憶
域アドレス形式）−これらのビットは、ＲＢフィールド
によって指定されるレジスタの内容の修正に使う符号付
きの値とみなされる。Bits 9-11: Register Delta (Storage Address Format)-These bits are considered a signed value used to modify the contents of the register specified by the RB field.

【０３６５】ビット１２〜１５：レジスタ・アドレス２
（記憶域アドレス形式）−ＲＢフィールドは、１６個の
汎用レジスタのうちのどれをオペランドの記憶域アドレ
スとして用いるかを指定するのに使用する。Bits 12-15: Register Address 2
Storage Address Format--The RB field is used to specify which of the 16 general purpose registers to use as the storage address of the operand.

【０３６６】ＳＰＣフォーマット１：特殊（ＳＰＣ１）
フォーマットは、図３５に示すように、１つの汎用レジ
スタ記憶域オペランド・アドレスを提供する。SPC format 1: special (SPC1)
The format provides one general register storage operand address, as shown in FIG.

【０３６７】ＳＰＣ１フォーマットは、命令コード・フ
ィールドの他に、次のフィールドを含んでいる。The SPC1 format includes the following fields in addition to the instruction code field.

【０３６８】ビット４〜７：ＯＰ拡張−ＯＰＸフィール
ドは、命令コードを拡張するのに使用する。Bits 4-7: OP Extend-The OPX field is used to extend the opcode.

【０３６９】ビット８：０または１−このビットが０の
場合、動作がレジスタ動作と定義される。このビットが
１の場合、動作がレジスタ記憶域動作と定義される。Bit 8: 0 or 1-If this bit is 0, the operation is defined as a register operation. If this bit is 1, the operation is defined as a register storage operation.

【０３７０】ビット９〜１１：命令長−これらのビット
は、オペランドの長さを１６ビット・ワードで指定する
のに使う符号付きの値とみなされる。０の値は長さ０に
該当し、Ｂ^１１１^の値は長さ８に該当する。Bits 9-11: Instruction Length-These bits are considered to be signed values used to specify the length of the operand in 16-bit words. A value of 0 corresponds to a length of 0, and a value of B ^ 111 ^ corresponds to a length of 8.

【０３７１】ビット１２〜１５：レジスタ・アドレス２
−ＲＢフィールドは、１６個の汎用レジスタのうちのど
れをオペランドの記憶域アドレスとして用いるかを指定
するのに使用する。Bits 12-15: Register Address 2
The RB field is used to specify which of the 16 general purpose registers to use as the storage address of the operand.

【０３７２】ＳＰＣフォーマット２：特殊（ＳＰＣ２）
フォーマットは、図３６に示すように、１つの汎用レジ
スタ記憶域オペランド・アドレスを提供する。SPC format 2: special (SPC2)
The format provides one general register storage operand address, as shown in FIG.

【０３７３】ＳＰＣ２は、命令コード・フィールドの他
に、次のフィールドを含んでいる。SPC2 includes the following fields in addition to the instruction code field.

【０３７４】ビット４〜７：レジスタ・アドレス１−Ｒ
Ａフィールドは、１６個の汎用レジスタのうちのどれを
オペランドまたは宛先、あるいはその両方として用いる
のかを指定するのに使用する。 Bits 4-7: Register Address 1-R
The A field is used to specify which of the 16 general purpose registers will be used as an operand, a destination, or both.

【表２】 [Table 2]

【表３】 [Table 3]

【表４】 [Table 4]

【表５】 [Table 5]

【表６】 [Table 6]

【表７】[Table 7]

【表８】[Table 8]

【０３７８】機能の要約：ＡＰＡＰマシンの位置付け：技術上ＣＭ−１とＮ−キュ
ーブの間にあると位置付けられると考えられる、本発明
の詳細な態様について説明してきた。ＣＭ−１では、本
発明のＡＰＡＰと同様に、処理要素に点設計を使用し、
基本チップ上で処理要素とメモリを組み合わせている。
しかし、ＣＭ−１では１ビット幅の直列プロセッサを使
用しているのに対し、ＡＰＡＰシリーズでは１６ビット
幅のプロセッサを使用する。ＣＭシリーズのマシンは、
１プロセッサ当たり４キロビットのメモリから始まり、
８キロビットまたは１６キロビットまで成長している。
一方、本発明の最初のＡＰＡＰチップでは、３２キロビ
ット×１６ビットのメモリを提供している。ＣＭ−１お
よびその後継マシンは厳密にＳＩＭＤマシンであり、Ｃ
Ｍ−５はハイブリッド型である。この代わりに、本発明
のＡＰＡＰは、ＭＩＭＤ動作モードをＳＩＭＤモードと
共に効果的に使用している。本発明の並列１６ビット幅
ＰＭＥは、Ｎ−キューブに向かって１歩近づいたものと
思えるかもしれないが、そう見るのは適当ではない。Ａ
ＰＡＰは、Ｎ−キューブ型のマシンと異なり、メモリお
よび経路指定が処理要素から分離されていない。また、
ＡＰＡＰでは３２キロビット×１６ビットのＰＭＥを実
現するが、Ｎ−キューブで実現されるのは４キロビット
×３２ビット・プロセッサにすぎない。Functional Summary: APAP Machine Positioning: A detailed aspect of the invention has been described which is considered to be technically positioned between the CM-1 and N-cubes. CM-1 uses a point design for processing elements, similar to the APAP of the present invention,
The processing elements and the memory are combined on the basic chip.
However, while the CM-1 uses a 1-bit wide serial processor, the APAP series uses a 16-bit wide processor. CM series machines are
Starting with 4 kilobits of memory per processor,
It has grown to 8 or 16 kilobits.
On the other hand, the first APAP chip of the present invention provides 32 kilobits × 16 bits of memory. CM-1 and its successors are strictly SIMD machines,
M-5 is a hybrid type. Instead, the APAP of the present invention effectively uses the MIMD mode of operation with the SIMD mode. The parallel 16-bit wide PME of the present invention may seem like one step closer to the N-cube, but it is not appropriate to see so. A
Unlike N-Cube type machines, PAP does not separate memory and routing from processing elements. Also,
APAP implements a 32 kilobit by 16 bit PME, whereas an N-cube implements only a 4 kilobit by 32 bit processor.

【０３７９】上述のような表面的な類似点はあるが、Ａ
ＰＡＰ概念は、下記の点でＣＭおよびＮ−キューブ・シ
リーズとまったく異なる。Although there are superficial similarities as described above, A
The PAP concept is quite different from the CM and N-Cube series in the following points.

【０３８０】１．本発明のＡＰＡＰに組み込まれた修正
ハイパーキューブは、ハイパーキューブ・トポロジーと
比べてパッケージングおよびアドレス指定の点で顕著な
利点を有する新規の発明である。たとえば、第１の好ま
しい実施例における３２ＫＰＭＥＡＰＡＰは、ネット
ワーク直径が１９論理ステップであり、透過性によっ
て、これを実効１６論理ステップまで減らすことができ
ることに留意されたい。さらに、比較してみると、純粋
なハイパーキューブを使用し、すべてのＰＭＥが８ステ
ップ経路を介してデータを送信する場合、８個のＰＭＥ
のうち平均２個が活動状態となるが、残りはブロックさ
れて遅延する。1. The modified hypercube incorporated into the APAP of the present invention is a novel invention with significant packaging and addressing advantages over the hypercube topology. For example, it should be noted that the 32KPME APAP in the first preferred embodiment has a network diameter of 19 logical steps, and transparency can reduce this to 16 effective logical steps. Furthermore, by comparison, if we use a pure hypercube and all PMEs send data over an 8-step path, we have 8 PMEs.
On average, 2 of them are active, while the rest are blocked and delayed.

【０３８１】また、ＣＭ−１が純粋なハイパーキューブ
である場合に必要となる６４Ｋハイパーキューブについ
て考えてみたい。その場合、各ＰＭＥには、他の１６個
のＰＭＥへのポートが必要であり、１５の論理ステップ
のうちもっとも離れた２つのＰＭＥ間でデータを経路指
定できることになる。すべてのＰＭＥが平均距離で７ス
テップを転送しようとする場合、７個のＰＭＥのうちの
２個が活動状態になる。しかし、ＣＭ−１では１６次元
ハイパーキューブを使用しない。ＣＭ−１は、チップ上
の１６個のノードをＮＥＷＳネットワークと相互接続し
てから、チップ内で１つのルータ機能を提供する。４０
９６個のルータを接続して１２次元ハイパーキューブを
形成する。衝突がない場合、ハイブリッドの論理直径は
１５であるが、１６個のＰＭＥがリンクを争奪するの
で、その実効直径はそれよりはるかに大きくなる。すな
わち、８ステップの移動がある場合、１６個のプロセッ
サのうち２個だけが活動状態になる。これは、すべての
データ移動を完了するのに、４サイクルではなく８つの
完全なサイクルが必要なことを意味する。Also consider the 64K hypercube required when CM-1 is a pure hypercube. In that case, each PME would require ports to the other 16 PMEs, allowing data to be routed between the two most distant PMEs of the 15 logical steps. If all PMEs try to transfer 7 steps in average distance, 2 of the 7 PMEs will be active. However, CM-1 does not use a 16-dimensional hypercube. The CM-1 interconnects 16 nodes on the chip with the NEWS network and then provides one router function within the chip. 40
96 routers are connected to form a 12-dimensional hypercube. In the absence of collisions, the hybrid has a logical diameter of 15, but its effective diameter is much larger because 16 PMEs contend for the link. That is, if there are 8 step moves, only 2 of the 16 processors are active. This means that 8 complete cycles are needed instead of 4 to complete all data movements.

【０３８２】Ｎ−キューブは、実際には純粋なハイパー
キューブを使用するが、現在はＰＭＥ４０９６個しかサ
ポートせず、したがって１２次元ハイパーキューブ（Ｐ
ＭＥ８１９２個の場合は１３次元）を使用している。Ｎ
−キューブを１６Ｋプロセッサに拡張し、ＡＰＡＰと同
じ処理データ幅を持つようにするには、ハードウェアを
４倍に増やし、各ＰＭＥルータへの接続ポートを２５％
増加する必要がある。この結論を裏付ける確かなデータ
はないが、Ｎ−キューブ・アーキテクチャでは、１６Ｋ
ＰＭＥマシンに達しないうちにコネクタ・ピンが不足
するように思われる。The N-Cube actually uses a pure hypercube, but currently only supports 4096 PMEs, and therefore a 12-dimensional hypercube (P
In the case of ME8192 pieces, 13 dimensions are used. N
-To expand the cube to 16K processors and have the same processing data width as APAP, quadruple the hardware and 25% of the connection ports to each PME router.
Need to increase. There is no solid data to support this conclusion, but with the N-Cube architecture, 16K
It seems that the connector pins are running short before the PME machine is reached.

【０３８３】２．ＡＰＡＰマシン内で主要なタスクを完
全に統合し分散できる性質は、明確な利点である。ＣＭ
シリーズおよびＮ−キューブ・シリーズのマシンについ
ての説明で述べたように、これらは、それぞれメッセー
ジ経路指定用と浮動小数点補助プロセッサ用に別々の装
置を持つ必要がある。ＡＰＡＰシステムでは、整数処
理、浮動小数点処理、メッセージ経路指定、および入出
力制御を単一点設計ＰＭＥとして組み合わせている。そ
して、この設計がチップ上で８回複製され、さらにチッ
プが４Ｋ回複製されて、アレイを形成する。これには次
のような利点がある。ａ．１つのチップを使用するので、生産ランが最大規模
になり、システム要素コストが最低になる。ｂ．規則的なアーキテクチャにより、もっとも効果的な
プログラミング・システムが得られる。ｃ．ほぼすべてのチップ・ピンをプロセッサ間通信とい
う一般的な問題専用にできるので、ＭＰＰ設計で重要な
制限因子となる傾向がある、チップ間入出力帯域幅が最
大になる。2. The ability to fully integrate and distribute key tasks within an APAP machine is a clear advantage. CM
As mentioned in the description of the Series and N-Cube series machines, they must have separate units for message routing and floating point coprocessor, respectively. The APAP system combines integer processing, floating point processing, message routing, and I / O control as a single point design PME. This design is then replicated eight times on the chip, and then the chip is replicated 4K times to form the array. This has the following advantages. a. The use of one chip maximizes production runs and minimizes system element costs. b. A regular architecture provides the most effective programming system. c. Since almost all chip pins can be dedicated to the common problem of interprocessor communication, the interchip I / O bandwidth, which tends to be an important limiting factor in MPP designs, is maximized.

【０３８４】３．ＡＰＡＰは、チップ技術の利得と、カ
スタム・チップ設計に対する資本投資を利用できる、独
自の設計能力をもつ。3. APAP has unique design capabilities that can take advantage of chip technology and capital investment in custom chip designs.

【０３８５】浮動小数点性能の問題について考えてみた
い。ＤＡＸＰＹ上でのＡＰＡＰＰＭＥ性能は１ＦＬＯ
Ｐ当たり約１２５サイクルになる。これとは対照的に、
^３８７補助プロセッサは約１４サイクルであり、一方
ＣＭ−１のウェイテク・コプロセッサ（Weitec Coproce
ssor）は約６サイクルである。しかし、ＣＭの場合、Ｐ
ＭＥ１６個ごとに浮動小数点装置が１個しかなく、一方
Ｎ−キューブの場合はおそらく、各３８６プロセッサに
１個の３８７型チップが結合されている。本発明のＡＰ
ＡＰは、１６倍のＰＭＥを有しており、したがって単一
ユニットの性能デルタ値をほぼ完全に補償することがで
きる。Let us consider the issue of floating point performance. APAP PME performance on DAXPY is 1FLO
There are about 125 cycles per P. In contrast,
The ^ 387 auxillary processor is approximately 14 cycles, while the CM-1 Weitec Coproce
ssor) is about 6 cycles. However, in the case of CM, P
There is only one floating point unit for every 16 MEs, while the N-Cube probably has one 387 type chip associated with each 386 processor. AP of the present invention
The AP has 16 times the PME and can therefore almost completely compensate for the single unit performance delta value.

【０３８６】さらに重要なことには、チップ内の８個の
ＡＰＡＰＰＭＥが、現在技術的に可能な５０Ｋゲート
から構築されている。メモリ・マクロが縮小されるにつ
れ、論理回路が使用できるゲートの数が増えてくる。ゲ
ートの増加分を拡張浮動小数点正規化に利用すれば、Ａ
ＰＡＰ浮動小数点性能は他の装置をはるかに超えること
ができるはずである。あるいはまた、カスタム設計手法
を使用してＰＭＥ設計またはＰＭＥサブセクション設計
を行って、マシン用に開発されるソフトウェアには影響
を与えずに全体性能を高めることができる。More importantly, the eight APAP PMEs in the chip are built from the 50K gates currently technologically feasible. As memory macros shrink, the number of gates available for logic circuits increases. If the increased gate is used for extended floating-point normalization, A
PAP floating point performance should be able to far exceed other devices. Alternatively, the PME design or PME subsection design can be performed using custom design techniques to improve overall performance without affecting the software developed for the machine.

【０３８７】本発明のＡＰＡＰ設計は、将来の処理技術
の発展を利用できると本発明者等は考えている。これと
は対照的に、図１に示したようなシステムを使用する最
も類似したマシンであるＣＭ−ｘやＮ−キューブは、行
き詰まっていると思われる従来の技術を利用するのに合
っていると思われる。The inventors believe that the APAP design of the present invention can take advantage of future processing technology developments. In contrast, the most similar machines, such as CM-x and N-Cube, that use a system such as that shown in FIG. 1, are suitable for utilizing conventional techniques that appear to be stuck. I think that the.

【０３８８】ＡＰＡＰ概念の利点は、ＰＭＥのグループ
と結合されたＤＡＳＤを使用できることである。このＡ
ＰＡＰ機能と、ディスプレイおよび補助記憶装置を接続
できる能力は、ＰＭＥアレイの外部入出力ポートへのイ
ンタフェースとしてＭＣバスを選択した副産物である。
したがって、ＡＰＡＰシステムは、構成可能であり、Ｐ
Ｓ／２ユニットまたはＲＩＳＣ／６０００ユニットと互
換性がある１組のユニットのいずれかから選択したカー
ド実装ハード・ドライブを備えることができる。さら
に、この機能は、追加の部品モジュールを設計しなくて
も使用できるはずである。ただし、この場合、使用しな
ければならないバックパネルおよび基本的格納装置の複
製の数は、ＡＰＡＰに必要な数よりも多くなる。The advantage of the APAP concept is that DASD combined with a group of PMEs can be used. This A
The PAP functionality and the ability to connect displays and auxiliary storage is a byproduct of choosing the MC bus as an interface to the external input / output ports of the PME array.
Therefore, the APAP system is configurable and P
It may have a card mounted hard drive selected from either the S / 2 unit or a set of units compatible with RISC / 6000 units. In addition, this feature could be used without designing additional component modules. However, in this case, the number of back panel and basic enclosure replicas that must be used will be greater than that required for APAP.

【０３８９】ピケット及びＡＰＡＰ多重ＰＭＥＭＩＭ
Ｄ／ＳＩＭＤの諸機能本発明のピケット・プロセッサは極めてコンパクトであ
り、１０００個のプロセッサを２〜８枚のカード上に置
くことができるというその有用さは、軍事用に特に有利
であると思われる。しかしこのシステム内の概念は、そ
れほど先端的でない技術を使用するプロセッサにも応用
でき、プロセッサ・メモリ・チップに取って代わるもの
と思われる。たとえば、本発明者等の本来の概念のいく
つかは、ワークステーションＲＩＳＣマイクロプロセッ
サを有するマシンで実施でき、カード１枚に１個のプロ
セッサしかないものでさえも実施できる。各処理ユニッ
トはアレイの１要素である。各処理ユニットは、メモリ
とそれ自体の命令ストリームを備え、ＭＩＭＤ実施態様
によればそれ自体のコード・ストリーム上で完全に自律
的に走行することができ、実際に走行する。複数の要素
が同一の命令ストリームのコピーを実行し、多少とも同
期的に走行するように同期化されている場合、ＡＰＡＰ
または他のこうしたマシンで本発明のピケットＳＩＭＤ
アーキテクチャをエミュレートすることができる。本発
明のＡＰＡＰマシンは、ＳＩＭＤをエミュレートできる
ＭＩＭＤ要素として機能するような構造になっている。
これと比較して、本発明のピケット・プロセッサは、い
くつかのデータ要素が１つの命令解釈要素によって制御
される、ＳＩＭＤアーキテクチャに基づくように構成す
ることが好ましい。ピケット・マシンは、１つの命令解
釈要素コマンドに、すべてのデータ要素をそのメモリか
ら読み取らせ、それを要素命令として解釈させることに
よって、ＭＩＭＤをエミュレートする。各要素はそれ自
体の次の命令アドレスを追跡することもできる。このよ
うにして、ピケット・プロセッサはＭＩＭＤ動作を提供
する。Picket and APAP Multiplex PME MIM
D / SIMD FUNCTIONS The picket processor of the present invention is extremely compact and its usefulness of being able to place 1000 processors on 2-8 cards seems particularly advantageous for military applications. Be done. However, the concepts within this system could be applied to processors using less advanced technologies and would replace processor memory chips. For example, some of our original concepts could be implemented in a machine with a workstation RISC microprocessor, even one having only one processor per card. Each processing unit is one element of the array. Each processing unit comprises memory and its own instruction stream, and according to the MIMD implementation it can and does run completely autonomously on its own code stream. APAP if multiple elements perform the same copy of the instruction stream and are synchronized to run more or less synchronously
Or other such machine with the picket SIMD of the present invention
It can emulate an architecture. The APAP machine of the present invention is structured so as to function as a MIMD element capable of emulating SIMD.
In comparison, the picket processor of the present invention is preferably arranged to be based on the SIMD architecture, in which some data elements are controlled by one instruction interpretation element. The picket machine emulates MIMD by having one instruction-interpreting element command read all the data elements from its memory and interpreting it as an element instruction. Each element can also keep track of its next instruction address. In this way, the picket processor provides MIMD operation.

【０３９０】ＡＰＡＰマシンは、柔軟性のあるマシンで
あり、ＳＩＭＤアーキテクチャを実施し、商用環境なら
びに軍用環境で、本発明者等が本発明のピケット・マシ
ン用に開発したような諸機能を実施することができる。
このマシンは、個別の命令ストリームを有する要素でＳ
ＩＭＤをエミュレートする制御構造を実施することがで
きる。本発明のＡＰＡＰは、１個のチップ上に複数の小
型プロセッサを有し、それぞれがピケット・マシンの範
囲の性能をもつ。いくつかの点でＡＰＡＰのＶＳＬＩ設
計（及びピケット・バージョン）は比較的粗なプロセッ
サの機能を達成できるが、またずっと密なアレイ設計を
提供することもでき、実際にもそうしている。The APAP machine is a flexible machine that implements the SIMD architecture and performs various functions in commercial and military environments as the inventors have developed for the picket machine of the present invention. be able to.
This machine is an element with a separate instruction stream, S
A control structure that emulates the IMD can be implemented. The APAP of the present invention has multiple small processors on a single chip, each with performance in the range of a picket machine. In some respects, APAP's VSLI design (and picket version) can achieve relatively coarse processor functionality, but it can and does provide a much denser array design.

【０３９１】本発明のピケット・マシンは、プログラム
が「縮小」及び関連動作を行うために制御ネットワーク
を必要とするとき、制御ネットワーク及び必要とされる
関連する処理の能力を提供する。The picket machine of the present invention provides the control network and the associated processing power required when a program requires the control network to perform "shrink" and related operations.

【０３９２】ピケット・マシンを開発した時点で当技術
分野での進歩であった本発明の特徴のいくつかは、ＳＩ
ＭＩＭＤ機能を含んでいる。本発明のＳＩＭＩＭＤ機能
は、ピケット・メモリに、各ピケットによって実行され
る少量のプログラム・コードをロードすることができ
る。制御は制御装置が保持し、その後にさらに少量のコ
ードをロードし実行することが可能である。ＭＩＭＤモ
ードの処理ユニットは、各ピケット処理ユニット内で独
立したことを行う能力を提供する。今やプログラムの全
体をアレイ・プロセッサに転送する必要はない。通常は
プログラム全体をピケットに転送することはない。Some of the features of the present invention that were advances in the art at the time the picket machine was developed are SI
Includes MIMD functionality. The SIMIMD feature of the present invention can load picket memory with a small amount of program code executed by each picket. Control is held by the controller, after which a smaller amount of code can be loaded and executed. The MIMD mode processing unit provides the ability to do independent things within each picket processing unit. It is no longer necessary to transfer the entire program to the array processor. Normally you would not transfer the entire program to the picket.

【０３９３】・たとえば、区画管理が、ある区画中のあ
らゆる処理ノードに同一のコードをロードすることがで
きる。データがノードに分配される。ｍ個の値のアレイ
とｎ個のノードからなる区画とが与えられているものと
すると、各ノードはｍ／ｎ個の値を処理することにな
る。各ノードはそのプログラムのＭＩＭＤモードにある
部分を独立に実行でき、それ自体のデータ値に基づいて
分岐するので、計算がローカル側に留まる限り、同期化
や通信は必要でない。プロセッサ間でデータを転送する
必要のあるとき、たとえば諸プロセッサがそれぞれ値を
大域和に寄与しなければならないとき、通信ネットワー
ク・データを搬送し、必要な同期化を実施する。和など
の大域組合せ操作については、制御装置が、縮小を実行
できるように、制御されたネットワークを編成する働き
をする。• For example, the partition manager can load the same code on every processing node in a partition. Data is distributed to the nodes. Given an array of m values and a partition of n nodes, each node will process m / n values. No synchronization or communication is required as long as the computation stays local, as each node can independently execute the part of the program that is in MIMD mode and branch based on its own data value. It carries communication network data and performs the necessary synchronization when data needs to be transferred between processors, for example when each processor must contribute a value to the global sum. For global combination operations such as union, the controller serves to organize the controlled network so that reduction can be performed.

【０３９４】本発明のピケット・マシンのもう１つの特
徴は、スライド・バスである。スライド・バスを使っ
て、アレイ制御装置からピケットのアレイに、または単
一ピケットからアレイ制御装置に、制御装置が使用でき
るようにまたはアレイに再度同報通信するために、デー
タを同報通信することができる。本発明のスライド・バ
スを実施することによって、多くのマシンの機能を向上
させることができる。データ転送をピケット処理活動で
増補すると、より強力な機能をスライド・バス上で実行
させることが可能になる。スライド・バスを有するシス
テムで実行できる機能には、次のものがある。Another feature of the picket machine of the present invention is the slide bath. Use a slide bus to broadcast data from an array controller to an array of pickets or from a single picket to an array controller for use by the controller or to rebroadcast to the array. be able to. By implementing the slide bus of the present invention, the functionality of many machines can be improved. Augmenting data transfer with picket processing activity allows more powerful functions to be performed on the slide bus. The functions that can be performed in a system with a slide bus include:

【０３９５】ＨＯＲＩＺＯＮＴＡＬＳＵＭ（水平和）
このプロセスは、各活動ピケットからプロセスに提示
される数値の和を生成する。これは、スライド・バスが
特に有用な、数値の列から１つの値を生成するタイプの
コマンドの一例にすぎない。HORIZONTAL SUM (Water Peace)
This process produces the sum of the numbers presented to the process from each activity picket. This is just one example of the type of command that a slide bus is particularly useful for producing a single value from a sequence of numbers.

【０３９６】ＡＣＣＵＭＵＬＡＴＥＬＥＦＴ（左累
計）このプロセスでは、各ピケットが最終的に右「ピ
ケット」の数値すべての合計を含むようになる。この能
力を例示する実施態様が２つある。Ａ．単に数値を（ゼロ充填を伴って）右にシフトし、各
ピケットが単にその和に加えられる。Ｂ．第２の実施態様では、より並列な形でこのプロセス
を行い、１６個の数値を扱うのに４ステップしか要しな
い。ACCUMULATE LEFT In this process, each picket will eventually contain the sum of all right "picket" numbers. There are two implementations that illustrate this capability. A. Simply shift the number to the right (with zero fill) and each picket is simply added to the sum. B. In the second embodiment, this process is performed in a more parallel fashion and only four steps are required to handle 16 numbers.

【０３９７】ＦＩＮＤＶＡＬＵＥ（値発見）パラメ
ータの所与の値を含むピケットが識別される。FIND VALUE The picket containing the given value of the parameter is identified.

【０３９８】ＦＩＮＤＭＡＸ（最大値発見）あるパ
ラメータの最大値を含むピケットが識別される。FIND MAX The picket containing the maximum value of a parameter is identified.

【０３９９】・さらに例を挙げると、システムは並列処
理ノードをグループに分割することができ、これを区画
と見なすことができる。制御装置は各区画を管理するこ
とができる。ユーザ・プロセスは単一区画上で実行する
ことができる。To give a further example, the system can divide parallel processing nodes into groups, which can be considered partitions. The control device can manage each section. User processes can run on a single partition.

【０４００】・プロセッサ間通信は、データ値をコピー
する複製によって行うことができる。たとえば、単一の
値を、計算中に使用できるようにすべてのプロセッサに
同報通信することができる。行列の各列または各行にベ
クトルをコピーすることができる。これをスプレッドと
称する。余り規則的でないパターンは、様々なサイズの
任意のサブセットへのある集合体の約数である。各サブ
セット内の異なる値を同報通信することができる。サブ
セットが順番に並んでおり、インターリーブされていな
い場合、それを様々なサイズのベクトルの集合体と見な
すことができる。この共通ケースを一般ケースと同様に
実施することができる。Interprocessor communication can be done by duplication by copying data values. For example, a single value can be broadcast to all processors for use during calculations. You can copy the vector into each column or row of the matrix. This is called a spread. Less regular patterns are divisors of certain aggregates into arbitrary subsets of various sizes. Different values within each subset can be broadcast. If the subsets are ordered and not interleaved, then they can be viewed as a collection of vectors of various sizes. This common case can be implemented like the general case.

【０４０１】・また縮小をプロセッサ間通信と共に使用
することもできる。縮小はデータ値をとり、それを組み
合わせることによってデータ値の数を減らす。たとえ
ば、１組の値の和を算出することにより、単一の値を生
成することができる。この場合、加算が組合せ動作であ
る。縮小動作には、最大値または最小値を取ること、論
理積、論理和をとることが含まれる。これらはすべて値
の大きな集合体から出発し、それを単一の結果に縮小す
る。Reduction can also be used with interprocessor communication. Shrinking takes data values and reduces the number of data values by combining them. For example, a single value can be generated by calculating the sum of a set of values. In this case, addition is a combination operation. The reducing operation includes taking a maximum value or a minimum value, taking a logical product, and taking a logical sum. These all start with a large collection of values and reduce it to a single result.

【０４０２】・置換はプロセッサ間通信と共に使用する
ことができる。置換はその入力を並べ換えて同数の結果
を生成する。あらゆるデータが１つの場所から出て１つ
の場所に向かう。行列の転置、ベクトルの反転、多次元
格子のシフト、ＦＦＴバタフライ・パターンがその例で
ある。• Permutation can be used with interprocessor communication. Permutation reorders its inputs to produce the same number of results. Everything goes from one place to one place. Examples are matrix transposition, vector inversion, multidimensional lattice shifting, and FFT butterfly patterns.

【０４０３】・大域機能には、データまたは命令のホス
トからノードへの同報通信、データのすべてのノードへ
の縮小、ノードを横切る走査の実行、セグメント化平行
接頭部付加動作の実行、すべてのノード上のバッファへ
の要素の連結、あるいはノードからホスト上のバッファ
への要素の連結が含まれる。縮小及び平行接頭部付加動
作は、加算を実行し、最大値または最小値を見つけ、あ
るいはビットごとのＡＮＤ、ＯＲまたはＸＯＲを実行す
ることができる。本発明のピケット及びＡＰＡＰマシン
では、これらの動作はノード・レベルまたは個々のプロ
セッサ・メモリ要素レベルで実行することができる。個
々のプロセッサ・メモリ要素は、マシンのノードと同様
に機能し、唯１つのメモリ付きプロセッサをカードに提
供する。Global functions include broadcasting data or instructions from host to node, reducing data to all nodes, performing traversal across nodes, performing segmented parallel prefix operation, all Contains the concatenation of elements to the buffer on the node or from the node to the buffer on the host. The reduce and parallel prefix add operations can perform additions, find maximum or minimum values, or perform bitwise AND, OR or XOR. In the picket and APAP machines of the present invention, these operations can be performed at the node level or at the individual processor memory element level. The individual processor memory elements function similarly to the nodes of the machine, providing the card with only one memory processor.

【０４０４】・制御装置は整数及び論理演算ハードウェ
アを含んでいる。制御装置は、平行接頭部付加動作を計
算し、平行動作をセグメント化することができる。すべ
てのプロセッサが同期モードになることができ、準備完
了したとき、制御装置がＳＩＭＤ動作の準備ができたと
告げることができる。しかし、プロセッサはＭＩＭＤモ
ードで動作することができる。プロセッサは、待ち時間
内に無関係の処理をオーバーラップさせることができ
る。数千台のプロセッサがすべて厳密に同期する必要な
しにＳＩＭＤ動作を実行する動作のように、ＳＩＭＤ用
のプログラムを使用することができる。The controller contains integer and logic hardware. The controller can compute the parallel prefix add operation and segment the parallel operation. All processors can go into synchronous mode and when ready, the controller can tell that it is ready for SIMD operation. However, the processor can operate in MIMD mode. The processor may overlap irrelevant processing within the latency. Programs for SIMD can be used, such as operations in which thousands of processors all perform SIMD operations without having to be strictly synchronized.

【０４０５】・大域縮小、整数加算、整数最大値の発
見、論理和、排他的論理和及び浮動小数点演算を含む大
域動作が実行でき、すべて１つのノードによって行うこ
とができる。行列演算が実行できる。Global operations including global reduction, integer addition, integer maximum value discovery, logical sum, exclusive OR and floating point operations can be performed, all done by one node. Matrix operation can be executed.

【０４０６】・本発明の制御装置及び被制御ネットワー
クの諸機能には、処理ノード（及びノード内の処理要
素）の同期化、あらゆる処理要素からの値を組み合わせ
て単一の結果を生成することが含まれ、平行接頭部付加
動作を計算することができる。本発明のピケット・プロ
セッサは、分離した制御ネットワークとデータ・ネット
ワークを提供する。米国特許出願第６１１５９４号及び
他の関連出願を参照のこと。The functions of the control device and controlled network of the present invention include synchronization of processing nodes (and processing elements within the nodes), combining values from all processing elements to produce a single result. , And the parallel prefix addition operation can be calculated. The picket processor of the present invention provides a separate control network and data network. See U.S. Patent Application No. 6111594 and other related applications.

【０４０７】・本発明では、ハードウェア・クラスタを
設ける。ただし、ハードウェア・クラスタは区画のため
にＵＮＩＸソフトウェア・クラスタの区分化の使用を控
える必要がない。複製はソフトウェアで定義されるアレ
イ内で行うことができ、このアレイは１つのハードウェ
ア・クラスタ内にあっても、複数のハードウェア・クラ
スタを含むものでもよい。アレイへの同報通信は、ある
アレイ・プロセスに対して定義されたすべての定義済み
処理要素を含むことになる。スプレッド動作及びスプレ
ッドの逆動作（Fortranの意味で）がサポートされ、同
報通信動作のために、プロセッサをソフトウェアの意味
でクラスタに区分することができる。In the present invention, a hardware cluster is provided. However, the hardware cluster does not have to refrain from using the UNIX software cluster partitioning for partitioning. The duplication can be done in an array defined in software, which can be in one hardware cluster or can contain multiple hardware clusters. Broadcast to the array will include all defined processing elements defined for an array process. Spread and inverse spread operations (in the Fortran sense) are supported and processors can be partitioned into clusters in the software sense for broadcast operations.

【０４０８】・本発明のピケット・プロセスは、ノード
からまたはノード内の処理要素からの同報通信を有す
る。すなわち、同じ諸機能を有する同報通信動作の諸レ
ベルが提供され、監視機能はアレイ制御装置に留保され
る。プロセッサはＭＩＭＤ内部命令の実行中に同報通信
を受け取ることができる。本発明では命令を同報通信す
る。マスク・ビットにより、プロセッサは同報通信を控
えることができる。The picket process of the present invention has a broadcast from a node or from a processing element within a node. That is, different levels of broadcast operation having the same functions are provided and the supervisory functions are reserved for the array controller. The processor can receive broadcasts while executing MIMD internal instructions. The present invention broadcasts instructions. The mask bit allows the processor to refrain from broadcasting.

【０４０９】グループピケットまたはアレイ制御装置
は、ピケットを複数のグループのうちの１つまたはいく
つかに割り当てることができる。ピケットは同時に複数
のグループに入ることができる。グループは、プロセス
のある部分に対してＳＩＭＤモードまたはＭＩＭＤモー
ドで選択することができ、それらの間を自由に移動する
ことができる。Group Picket or array controllers can assign pickets to one or several of multiple groups. Pickets can be in multiple groups at the same time. Groups can be selected in SIMD or MIMD mode for some part of the process and can move freely between them.

【０４１０】・区画は他の区画にあるプロセスとデータ
を交換することができる。複数のユーザが、別のユーザ
の使用法に干渉せずにその区画にアクセスすることがで
きる。Partitions can exchange data with processes in other partitions. Multiple users can access the partition without interfering with the usage of another user.

【０４１１】・各処理要素はオペレーティング・システ
ム全体を有する必要はない（また実際にも有しない）。
コードをたとえばメモリ・マップのために、データと共
にあるノードの処理要素にダウンロードまたは同報通信
することができる。各処理要素及びノードはメモリを有
し、このメモリをその動作にまた大域メモリ動作のため
に割り当てることができる。１つまたは複数の処理要素
にコードを同報通信することにより、個々の処理要素が
提供されたコードを実行し、それぞれそれ自体のデータ
に作用し、それに応じて計算及び分岐を実行することが
できる。Each processing element need not (and in fact do not) have an entire operating system.
The code may be downloaded or broadcast along with the data to the processing elements of a node, for example for memory mapping. Each processing element and node has a memory, which can be allocated for that operation and for global memory operations. By broadcasting the code to one or more processing elements, each processing element may execute the provided code, each acting on its own data, and performing calculations and branches accordingly. it can.

【０４１２】・区画は単一ユーザ・タスクのために自由
に構成することができる。各区画は、時分割処理または
バッチ処理あるいはその両方に使用できるシステムの一
環としてあるタスクを完了することができる。Partitions can be freely configured for single user tasks. Each partition can complete certain tasks as part of a system that can be used for time-shared processing, batch processing, or both.

【０４１３】・仮想ネットワーク・アドレス割振りは、
ユーザ・プロセスが区画外の宛先にメッセージを送るの
を防止する、保護チェックを有することができる。アレ
イ制御装置はある区画から別の区画にメッセージを送る
ことができる。入出力をホストからノードにわたる様々
なレベルで調整することができる。Virtual network address allocation is
You can have protection checks that prevent your process from sending messages to destinations outside the partition. The array controller can send messages from one partition to another. I / O can be coordinated at various levels from host to node.

【０４１４】・自律性各ピケットは、処理を制御する
状況ラッチを含んでいる。このラッチがセットされてい
る場合、ピケットは処理への参加を控える。さらに、ピ
ケットは、それ自体のメモリ内部での条件のテストに基
づいてそれ自体を再活動化することができる。Autonomy Each picket contains a status latch that controls the process. If this latch is set, the picket refrains from participating in the process. In addition, the picket can reactivate itself based on testing conditions inside its own memory.

【０４１５】・ベクトル要素は各処理要素及びノード内
で扱うことができる。ベクトル命令は、短い３２ビット
・フォーマットあるいはそれより長いフォーマットとす
ることができる。Vector elements can be handled within each processing element and node. Vector instructions can be in short 32-bit format or longer format.

【０４１６】・プロセッサ・メモリ要素のローカル自律
性により、同報通信が抑制できる。ラッチはセットまた
はリセットすることができ、その設定に応じてピケット
は参加しあるいは参加しない。参加する処理要素とは、
参加を抑制されていない処理要素である。The local autonomy of the processor and memory elements can suppress broadcast communication. The latch can be set or reset and the picket will or will not participate depending on its setting. The processing elements to join are
It is a processing element whose participation is not suppressed.

【０４１７】・制御装置数値ベクトルを操作するため
の命令が提供される。これらのベクトルは水平型（ピケ
ットごとに１個分配）とすることも垂直型（数値ベクト
ル全体が１個のピケットに含まれる）とすることもでき
る。しかし、さらに、各ピケットは１つの数値ベクトル
を含むことができる。命令は、２つのベクトルの加算、
ベクトルの各メンバからの定数の減算、あるいは様々な
ベクトル積、または縮小など、所望のベクトル・コマン
ドのすべてを提供する。Controllers Instructions are provided for manipulating numeric vectors. These vectors can be horizontal (one for each picket) or vertical (the entire numeric vector is contained in one picket). However, in addition, each picket can contain one numeric vector. The instruction is the addition of two vectors,
It provides all of the desired vector commands, such as the subtraction of a constant from each member of the vector, or various vector products or reductions.

【０４１８】・制御装置は所望のベクトル・コマンドの
すべてを提供することができる。The controller can provide all of the desired vector commands.

【０４１９】状況ファネルピケット・アレイ制御装置
は「状況ファネル」を使って、活動ピケットから状況を
収集し、累計結果をアレイ制御装置に送る。Status Funnel The Picket Array Controller uses the "Status Funnel" to collect status from active pickets and send cumulative results to the Array Controller.

【０４２０】・状況ファネルにより、すべてのプロセッ
サが、その処理ステップを完了したこと、及び制御装置
からのコマンドに従って次の動作が進行できることを示
すことができる。The status funnel allows all processors to indicate that they have completed their processing steps and that the next operation can proceed according to commands from the controller.

【０４２１】ピケット・プロセッサに共通で、本発明の
ＡＰＡＰなど他のマシンにも適用できる特徴が他にもあ
る。これらの特徴は並列アレイ・プロセッサを備えるマ
シンで使用できる。There are other features common to the picket processor that can be applied to other machines such as the APAP of the present invention. These features can be used on machines with parallel array processors.

【０４２２】・本発明者等は、プロセッサ・レベルでも
システム・レベルでもスケーリング可能性を備えた独立
な処理要素を提供した。したがって、処理通信及び入出
力がスケーリング可能である。浮動小数点整数処理が提
供される。高帯域幅がある。プロセッサ・アレイは高水
準言語プログラムを実行し、時分割式及び区分式で多重
ジョブを実行することができ、複数ユーザ・アクセスを
備え、ユーザ間の安全保護を備え、処理要素間での並列
通信による高帯域幅入出力を備え、システムは高い信頼
性及びフェールセーフ能力により、入出力、処理及びメ
モリの均衡のとれたスカラー及び並列実行を行うことが
できる。The inventors have provided independent processing elements with scalability at the processor level as well as the system level. Therefore, processing communication and input / output can be scaled. Floating point integer processing is provided. Has high bandwidth. The processor array executes high-level language programs, can execute multiple jobs in a time-sharing manner and in a piecewise manner, has multi-user access, secures between users, and communicates in parallel between processing elements. With high bandwidth I / O, the system is able to perform balanced scalar and parallel execution of I / O, processing and memory with high reliability and fail-safe capability.

【０４２３】・単一タスク及び多重タスクが実行でき、
処理区画、ノード及び要素が時分割できる。諸ノード
は、同じ命令をＳＩＭＤモードで実行するためにプロセ
ッサ・メモリの同じアドレスから取り出すことができ、
または独立ＭＩＭＤモード命令を実行するために個別に
選ばれたアドレスから取り出すことができる。Single task and multiple tasks can be executed,
Processing partitions, nodes and elements can be time shared. Nodes can fetch the same instruction from the same address in processor memory to execute in SIMD mode,
Alternatively, it can be fetched from an individually selected address to execute an independent MIMD mode instruction.

【０４２４】・制御装置は複数ジョブの実行を制御す
る。通常は制御装置が１個だけ設けられるが、複数アレ
イも実施できる。The control device controls the execution of multiple jobs. Usually only one controller is provided, but multiple arrays can be implemented.

【０４２５】・本システムは、ＶＬＳＩを使用しそれを
拡張して、各処理要素にローカル・メモリと大域メモリ
を備えたシステムの諸ノードにＲＩＳＣプロセッサ・シ
ステムを提供する。The system uses and extends VLSI to provide a RISC processor system for nodes of the system with local and global memory for each processing element.

【０４２６】・本システムは、データ並列コード化によ
るプログラムを実行することができ、並列処理のために
Ｆｏｒｔｒａｎ、Ｃ言語その他の高水準言語でアプリケ
ーションを実施することができる。This system can execute programs by data parallel coding, and can execute applications in Fortran, C language or other high level languages for parallel processing.

【０４２７】・ピケットは少量のアプリケーション・コ
ードをそれ自体のＰＥメモリにロードする。実行は、ラ
イブラリ機能から供給される実行時コードを利用するこ
とができる。Picket loads a small amount of application code into its own PE memory. Execution can utilize run-time code provided by the library function.

【０４２８】・本マシンはデータ並列をサポートし、分
岐及び同期化を提供する。The machine supports data parallelism and provides branching and synchronization.

【０４２９】・各ＰＥまたはＰＭＥ内でメモリが各点に
供給され、プロセッサは同じアドレスからＳＩＭＤスタ
イルを、また個別に選ばれたアドレスからＭＩＭＤスタ
イルを取り出すことができる。Memory is provided at each point within each PE or PME, allowing the processor to retrieve SIMD styles from the same address and MIMD styles from individually selected addresses.

【０４３０】・制御装置は、命令のブロックをアレイの
処理要素に同報通信することができ、実際にそうする。
複製、スプレッド、縮小及び転置の諸機能が実行でき
る。The controller can and does broadcast a block of instructions to the processing elements of the array.
Duplicate, spread, shrink and transpose functions can be performed.

【０４３１】・個別のＰＭＥが、プロセスの必要に応じ
て制御装置に同報通信する。A separate PME broadcasts to controllers as needed for the process.

【０４３２】・ノード同報通信、ＰＭＥ同報通信は一時
に１つだけ実行され、制御装置によって制御できる。Only one node broadcast communication and PME broadcast communication can be executed at a time and can be controlled by the control device.

【０４３３】・プロセッサをグループに区分することが
できる。-Processors can be divided into groups.

【０４３４】・このアレイ・プロセッサ・システムは、
メッセージをプロセッサ間で経路指定する方法を提供す
る。This array processor system
It provides a way to route messages between processors.

【０４３５】・数千個のプロセッサのメモリ・アドレス
空間が提供される。各アドレス空間は、各要素に対して
ローカルであると、あるいは処理要素のアレイ全体に対
して大域アドレスであると見なすことができる。Provide memory address space for thousands of processors. Each address space can be considered local to each element or a global address for the entire array of processing elements.

【０４３６】・ピケット・プロセッサは、処理要素の自
律性のために条件付きエネーブルを提供する。The Picket Processor provides conditional enablement for the autonomy of processing elements.

【０４３７】条件付き処理は、マスク・ビットを使って
実行することができる。Conditional processing can be performed using mask bits.

【０４３８】マスク・ビットにより、個々の処理要素が
タスクへの参加を控えることが可能となる。Mask bits allow individual processing elements to refrain from participating in a task.

【０４３９】・各処理要素はそれ自体をある区画に割り
当てることができ、区画はゼロからシステムの全処理要
素までを含むことができる。処理要素、プロセッサ・メ
モリ要素またはピケットは、同時に１つまたは複数の区
画に入ることができる。Each processing element can assign itself to a partition, and a partition can contain from zero to all processing elements of the system. A processing element, processor memory element or picket can enter one or more compartments at the same time.

【０４４０】・ピケット・プロセッサは、数千のコンピ
ュータ処理ノードと、１つまたは複数の制御プロセッサ
と、大容量記憶装置、グラフィック表示装置及び無限の
周辺装置をサポートする入出力ユニットとを含むことが
できる。各処理ノードは、通常のノードとしてあるいは
個々の処理メモリ要素のメッシュとして働くユニットと
見なすことができ、そのそれぞれがそれ自体の命令スト
リームを取り出して解釈しベクトルのアレイを処理する
ことのできる汎用コンピュータを提供する。The picket processor may include thousands of computer processing nodes, one or more control processors, and input / output units that support mass storage, graphic displays and infinite peripherals. it can. Each processing node can be thought of as a unit that acts as a regular node or as a mesh of individual processing memory elements, each of which is capable of fetching and interpreting its own instruction stream and processing an array of vectors. I will provide a.

【０４４１】・状況ファネルが、アレイ・プロセスの終
りを示すのに使用される。The status funnel is used to indicate the end of the array process.

【０４４２】・大域ビット動作は、参加するあらゆるプ
ロセッサ用の制御装置の状況の論理和を生成する。Global bit operations produce a logical or of controller status for all participating processors.

【０４４３】・大域動作は同期式でも非同期式でもよ
く、独立に使用することができる。The global operation may be synchronous or asynchronous and can be used independently.

【０４４４】・グループ化は、ある区画に対する要素の
グループをもたらす。Grouping results in a group of elements for a partition.

【０４４５】・処理要素のグループ化は、ハードウェア
使用可能性と発生し得る障害とに基づいて割り当てられ
た、処理要素またはピケットのアドレスを使って行われ
る。Grouping of processing elements is done using the address of the processing element or picket, assigned based on hardware availability and possible failures.

【０４４６】・他の多くの特徴が実施できる本発明の好
ましい実施例と同様に、各ノードは多重層プロセッサ・
メモリ要素ノードであり、ノードは埋込みメモリを備え
たＲＡＭとして形成することができ、これがローカル・
メモリを制御し、かつ分散システムの一部として機能す
ることもできる。Each node is a multi-layer processor, similar to the preferred embodiment of the invention in which many other features may be implemented.
A memory element node, which can be formed as a RAM with embedded memory, which is locally
It can also control memory and function as part of a distributed system.

【０４４７】この簡単な比較は、限定を意図したもので
はなく、当業者に、前述の説明を検討し、上述の多数の
発明を用いて、大規模並列システムの技術を、プログラ
ミングが重要な問題でなくなり、そのようなシステムの
コストがはるかに低くなるときまでに進歩させるにはど
うすべきかを考えてもらうためのものである。本発明の
種類のシステムは、商業部門レベルの調達で手の届くコ
ストで製作できるので、少数の人だけでなく多数の人に
とって使用可能にすることができる。This brief comparison is not meant to be limiting, and one of ordinary skill in the art will consider the above description and use the numerous inventions described above to teach the techniques of massively parallel systems and programming. It helps us to think about how we can make progress by the time the cost of such a system becomes much lower. A system of the invention type can be manufactured for a large number of people, not just a few, because it can be manufactured at a commercial sector level of procurement at an affordable cost.

[Brief description of drawings]

【図１】従来の技術を利用した並列プロセッサ処理要素
を示す図である。FIG. 1 is a diagram showing a parallel processor processing element using a conventional technique.

【図２】従来の技術を利用した並列プロセッサ処理要素
を示す図である。FIG. 2 is a diagram illustrating parallel processor processing elements utilizing conventional technology.

【図３】本発明の新規チップ設計を表す、大規模並列プ
ロセッサの構成単位を示す図である。FIG. 3 is a diagram showing a structural unit of a large-scale parallel processor, which represents a novel chip design of the present invention.

【図４】本発明のチップ・シングル・ノード密並列プロ
セッサの好ましい実施例用の好ましいチップ物理クラス
タ・レイアウトを右側に示し、代替技術を左側に示す図
である。ここで、各チップは、ＣＭＯＳＤＲＡＭメモ
リおよび論理回路を備え、５ＭＩＰＳの性能を提供す
る、スケーリング可能な並列プロセッサ・チップであ
り、大規模並列システムの空冷式実施を可能にする。FIG. 4 shows a preferred chip physical cluster layout for the preferred embodiment of the chip single node dense parallel processor of the present invention on the right side and an alternative technique on the left side. Here, each chip is a scalable parallel processor chip with CMOS DRAM memory and logic, providing 5 MIPS performance, enabling air-cooled implementation of massively parallel systems.

【図５】本発明によるコンピュータ・プロセッサの機能
ブロック図である。FIG. 5 is a functional block diagram of a computer processor according to the present invention.

【図６】典型的なＡＰＡＰ・コンピュータ・システム構
成を示す図である。FIG. 6 is a diagram showing a typical APAP computer system configuration.

【図７】４０ないし１９３８４０ＭＩＰＳの性能のシス
テムを開発可能にする、ＰＭＥ要素の複製を使用したシ
ステム構築を例示する、本発明の密並列プロセッサ技術
のシステム概要を示す図である。FIG. 7 shows a system overview of the dense parallel processor technology of the present invention, exemplifying a system construction using replication of PME elements that enables the development of systems with 40 to 193840 MIPS performance.

【図８】本発明による処理要素（ＰＭＥ）データ・フロ
ーおよびローカル・メモリ用のハードウェアを示す図で
ある。FIG. 8 shows hardware for processing element (PME) data flow and local memory according to the present invention.

【図９】ＰＭＥをバードワイヤ接続された汎用コンピュ
ータとして構成して、プログラム制御式浮動小数点演算
によって約５ＭＩＰＳの固定小数点処理または０．４Ｍ
ＦＬＯＰＳを実現する、ＰＭＥデータ・フローを示す図
である。FIG. 9 is a diagram illustrating a PME configured as a general-purpose computer connected to a bird wire and performing fixed-point processing of about 5 MIPS or 0.4 M by program-controlled floating-point arithmetic.
FIG. 3 is a diagram showing a PME data flow that realizes FLOPS.

【図１０】本発明に従って使用できるＰＭＥ間接続（バ
イナリ・ハイパーキューブ）およびデータ経路を示す図
である。FIG. 10 is a diagram showing inter-PME connections (binary hypercubes) and data paths that can be used in accordance with the present invention.

【図１１】それぞれ単一の外部ポートを管理し、ネット
ワーク制御機能の分散を可能にするとともに、機能ハー
ドウェア・ポートのボトルネックをなくす、８個のＰＭ
Ｅを有するチップまたはノード用のノード相互接続を示
す図である。FIG. 11: Eight PMs that manage a single external port each, enable distribution of network control functions, and eliminate bottlenecks in functional hardware ports
FIG. 7 shows a node interconnect for a chip or node with E.

【図１２】各ＰＭＥが、３２Ｋワードのローカル・メモ
リを備える１６ビット幅のプロセッサであり、制御装置
とすべての機構の間のインタフェースを提供する同報通
信ポート用の入出力ポート動作があり、外部ポートは、
チップ内および外部とのリング・トーラス接続が可能な
両方向２地点間インタフェースである、スケーリング可
能な並列プロセッサ・チップを示す図である。FIG. 12 is a 16-bit wide processor where each PME is a 32K word local memory with I / O port operation for the broadcast port providing an interface between the controller and all features; The external port is
FIG. 3 illustrates a scalable parallel processor chip, which is a bidirectional point-to-point interface that allows ring torus connections within and outside the chip.

【図１３】好ましい実施例のアレイ・ディレクタを示す
図である。FIG. 13 illustrates the preferred embodiment array director.

【図１４】クラスタのエッジをシステム・バス（図１５
参照）に接続することにより、アレイのロードおよびア
ンロードを可能にする、クラスタ・アレイ結合との間の
システム・バスを示す図である。FIG. 14 shows an example in which an edge of a cluster is connected to a system bus (FIG. 15).
FIG. 3 shows a system bus to and from a cluster array bond, which allows the loading and unloading of arrays by connecting to (reference).

【図１５】処理要素部分との間のバスを示す図である。
図１４および図１５は、複数のシステム・バスをどのよ
うにして複数のクラスタでサポートするかを示してい
る。各クラスタは５０〜５７ＭＢの帯域幅をサポートで
きる。FIG. 15 shows a bus to and from the processing element part.
14 and 15 show how multiple system buses are supported by multiple clusters. Each cluster can support a bandwidth of 50-57 MB.

【図１６】高速入出力接続用の「ジッパ接続」を示す図
である。FIG. 16 is a diagram showing “zipper connection” for high-speed input / output connection.

【図１７】８次ハイパーキューブに適用される、本発明
によるパッケージング技術を示す、８次ハイパーキュー
ブ接続を示した図である。FIG. 17 is a diagram showing an 8th order hypercube connection showing the packaging technique according to the present invention applied to the 8th order hypercube.

【図１８】ハイパーキューブにおける２つの独立したノ
ード接続を示す図である。FIG. 18 is a diagram showing two independent node connections in a hypercube.

【図１９】ＢｉｔｏｎｉｃＳｏｒｔアルゴリズムを例
として示し、定義済みＳＩＭＤ／ＭＩＭＤプロセッサ・
システムの利点を示した図である。FIG. 19 illustrates the Bitonic Sort algorithm as an example and illustrates a predefined SIMD / MIMD processor
It is the figure which showed the advantage of the system.

【図２０】１つのアプリケーション・プロセッサ・イン
タフェースを備える、ホストに接続された大規模システ
ムのシステム・ブロック図である。本発明が、複数のア
プリケーション・プロセッサ・インタフェースを使用す
るスタンドアロン・システムで使用できるという了解の
もとにこの図を見ることもできる。図２０のこのような
インタフェースは、すべてのクラスタまたは多数のクラ
スタ上でＤＡＳＤ／グラフィックスをサポートする。ワ
ークステーション・アクセレレータは、エミュレーショ
ンで示されるホスト、アプリケーション・プロセッサ・
インタフェース（ＡＰＩ）、およびクラスタ・シンクロ
ナイザ（ＣＳ）を不要にすることができる。クラスタ・
シンクロナイザは、すべての例に必要ではない。FIG. 20 is a system block diagram of a host-attached large scale system with one application processor interface. This figure can also be seen with the understanding that the present invention can be used in a stand-alone system using multiple application processor interfaces. Such an interface in FIG. 20 supports DASD / graphics on all clusters or multiple clusters. Workstation Accelerator is a host, application processor
The interface (API) and cluster synchronizer (CS) can be eliminated. cluster·
Synchronizers are not required for all examples.

【図２１】本発明のシステムのソフトウェア開発環境を
示す図である。プログラムは、ホスト・アプリケーショ
ン・プロセッサによって作成し、該プロセッサから実行
することができる。プログラムとマシン・デバッグの両
方が、図２１および図２４に示すワークステーション・
ベースのコンソールでサポートされる。これらのサービ
スはどちらも、実ＭＭＰまたはシミュレートされたＭＭ
Ｐ上で動作するアプリケーションをサポートし、ワーク
ステーション・レベルでも、スーパーコンピュータ形式
のＡＰＡＰＭＭＰ上でもアプリケーションが開発でき
るようにする。この共通ソフトウェア環境によって、プ
ログラミング可能性および分散型使用が拡張される。FIG. 21 is a diagram showing a software development environment of the system of the present invention. The program can be created by and executed by the host application processor. Both the program and the machine debug are
Supported on base console. Both of these services are real MMPs or simulated MMs.
It supports applications running on P and allows applications to be developed both at the workstation level and on the supercomputer style APAP MMPs. This common software environment enhances programmability and distributed use.

【図２２】この新規システムによって使用可能となるプ
ログラミング・レベルを示す図である。様々なユーザが
多かれ少なかれ詳細な知識を必要とするので、これらの
ニーズをサポートするソフトウェア・システムが開発さ
れる。最上位レベルでは、アーキテクチャが実際はＭＭ
Ｐであることをユーザが知る必要はない。このシステム
は、並列ＦＯＲＴＲＡＮなど、プログラムの区分用の既
存の言語システムと併用できる。FIG. 22 shows the programming levels enabled by this new system. As different users require more or less detailed knowledge, software systems are developed to support these needs. At the highest level, the architecture is actually MM
The user does not need to know that it is P. This system can be used with existing language systems for program partitioning, such as parallel FORTRAN.

【図２３】上述のＡＰＡＰ構成によって提供されるＭＭ
Ｐ用の並列ＦＯＲＴＲＡＮコンパイラ・システムを示す
図である。順次−並列コンパイラ・システムは、既存の
コンパイラ機能と新規データ割振り機能の組合せを使っ
て、ＦＯＲＴＲＡＮＤなどの区分プログラムを使用で
きるようにする。FIG. 23: MM provided by the above APAP configuration
FIG. 2 shows a parallel FORTRAN compiler system for P. The Sequential-Parallel compiler system uses a combination of existing compiler features and new data allocation features to enable partitioned programs such as FORTRAN D.

【図２４】ＡＰＡＰがワークステーション・アクセレレ
ータになる、ＡＰＡＰのワークステーション・アプリケ
ーションを示す図である。装置は、ＲＩＳＣ／６０００
モデル５３０と同じ物理サイズであるが、このモデルは
現在、図のバス拡張モジュールを介してワークステーシ
ョンに接続されたＭＭＰを備えていることに留意された
い。FIG. 24 is a diagram of a APAP workstation application in which APAP becomes a workstation accelerator. The device is RISC / 6000
Note that although it is the same physical size as model 530, this model now has an MMP connected to the workstation via the illustrated bus expansion module.

【図２５】ＡＷＡＣＳ軍用または商業アプリケーション
用のＡＰＡＰＭＭＰモジュール用のアプリケーション
を示す図である。これは、この図に示す従来の分散型セ
ンサ・フュージョンの問題を効率的に処理する１つの方
法である。ここでは、最近傍、２次元線割当て（Ｍｕｎ
ｋｅｓ）、確率的データ関連付け、複数仮説試験などの
周知のアルゴリズムにより従来の方法で実行されるが、
これらは現在、図２６および図２７に示す改良された方
法で実施できる。FIG. 25 shows an application for an APAP MMP module for AWACS military or commercial applications. This is one way to efficiently handle the problem of conventional distributed sensor fusion shown in this figure. Here, the nearest neighbor, two-dimensional line assignment (Mun
kes), stochastic data association, multiple hypothesis testing, etc.
These can now be implemented with the improved method shown in FIGS.

【図２６】システムがどのようにｎ次元割当て問題をリ
アルタイムで処理できるようにするかを示す図である。FIG. 26 illustrates how the system enables an n-dimensional assignment problem to be processed in real time.

【図２７】ＡＰＡＰを使用した、ｎ次元割当て問題の処
理フローを示す図である。FIG. 27 is a diagram showing a processing flow of an n-dimensional assignment problem using APAP.

【図２８】ユニットがどのようにして８〜１０個の拡張
ＳＥＭ−Ｅモジュールだけを使って４２４ＭＦＬＯＰＳ
または５１２０ＭＩＰＳを実現し、わずか約０．０１７
ｍ³で特殊信号プロセッサ・モジュールの性能に匹敵す
る性能を提供することができるかを示す、上述のシステ
ム格納装置によって提供される拡張ユニットを示す図で
ある。このシステムは、毎秒２０億命令（ＧＯＰＳ）を
実行する１０２４個の並列プロセッサを備えたＳＩＭＤ
大規模マシンとなることができ、かつ１０２４個の追加
プロセッサおよび３２ＭＢの追加記憶域を増設すること
によって拡張することができる。FIG. 28 shows how the unit uses only 8-10 enhanced SEM-E modules for 424 MFLOPS
Or achieve 5120 MIPS and only about 0.017
FIG. 6 shows an expansion unit provided by the system enclosure described above, showing whether m ³ can provide performance comparable to that of a special signal processor module. The system is SIMD with 1024 parallel processors executing 2 billion instructions per second (GOPS).
It can be a large machine and can be expanded by adding 1024 additional processors and 32 MB of additional storage.

【図２９】スーパーコンピュータのＡＰＡＰパッケージ
ングを示す図である。これは、他のシステムの性能に匹
敵するが、他のシステムよりフットプリントがはるかに
小さい大規模システムである。これは、より小規模なマ
シンに使用されるような格納装置内でＡＰＡＰクラスタ
を複製することによって構築できる。FIG. 29 is a diagram showing APAP packaging of a super computer. This is a large system that is comparable in performance to other systems, but has a much smaller footprint than other systems. This can be constructed by replicating the APAP cluster in a storage device such as those used in smaller machines.

【図３０】レジスタ・レジスタ（ＲＲ）フォーマットを
示す図である。FIG. 30 is a diagram showing a register / register (RR) format.

【図３１】直接アドレス（ＤＡ）フォーマットを示す図
である。FIG. 31 is a diagram showing a direct address (DA) format.

【図３２】レジスタ記憶域（ＲＳ）フォーマットを示す
図である。FIG. 32 is a diagram showing a register storage area (RS) format.

【図３３】レジスタ即値（ＲＩ）フォーマットを示す図
である。FIG. 33 is a diagram showing a register immediate value (RI) format.

【図３４】記憶域間（ＳＳ）フォーマットを示す図であ
る。FIG. 34 is a diagram showing an inter-storage area (SS) format.

【図３５】特殊（ＳＰＣ１）フォーマットを示す図であ
る。FIG. 35 is a diagram showing a special (SPC1) format.

【図３６】特殊（ＳＰＣ２）フォーマットを示す図であ
る。FIG. 36 is a diagram showing a special (SPC2) format.

[Explanation of symbols]

２００アプリケーション・プロセッサ２４０テスト／デバッグ・デバイス２５０アレイ・ディレクタ３００シングル・プロセッサ・ユニット３０１３２Ｋハーフワード・メモリ３０２１６ビット・プロセッサ３１０ネットワーク・ノード３１３ネットワーク・ルータ３１４信号入出力機構４０５ＡＲレジスタ４０６マルチプレクサ４２０メモリ４６０演算論理機構６３０アプリケーション・プロセッサ・インタフェー
ス６４０クラスタ制御装置６５０クラスタ・シンクロナイザ200 Application Processor 240 Test / Debug Device 250 Array Director 300 Single Processor Unit 301 32K Halfword Memory 302 16-bit Processor 310 Network Node 313 Network Router 314 Signal I / O 405 AR Register 406 Multiplexer 420 Memory 460 Arithmetic logic 630 Application processor interface 640 Cluster controller 650 Cluster synchronizer

───────────────────────────────────────────────────── フロントページの続き (72)発明者ジェームズ・ウォレン・ディーフェンデルファーアメリカ合衆国13827、ニューヨーク州オウェゴ、フロント・ストリート 396 (72)発明者ピーター・マイケル・コッヘアメリカ合衆国13760、ニューヨーク州エンディコット、ドーチェスター・ドライブ７ (72)発明者ニコラス・ジェローム・ショーノヴァーアメリカ合衆国13845、ニューヨーク州タイオガ・センター、ピー・オー・ボックス18 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor James Warren Defendelfer USA 13827, Front Street, Owego, NY 396 (72) Inventor Peter Michael Koche USA 13760, Endicott, NY Dorchester Drive 7 (72) Inventor Nicholas Jerome Schonover, USA 13845, P.O. Box 18 at Tioga Center, NY

Claims

(57) [Claims]

1. A plurality of processing units, each processing unit having its own memory and instruction stream, said instruction stream being autonomously executable on its own code stream by a MIMD implementation. When and when it is actually executed and more than one element is assembled under the control of the processor to manipulate SIMD instructions, the element can execute a copy of the same instruction stream and synchronize the elements. Even if multiple processing units execute instructions in SIMD mode at the same time, other processing units can
Arrays that can execute instructions in modes other than MD mode
Computer system for the processor.

2. Computer system according to claim 1, characterized in that the processing elements of the machine are able to emulate the SIMD architecture with MIMD elements capable of implementing SIMD.

3. The computer of claim 1, wherein the system provides SIMIMD functionality that causes a small amount of program code executed by each picket to be loaded into picket memory. system.

4. An array controller to an array of pickets,
Or from a single picket to the array controller provided with a slide bus used to broadcast data for use by the controller or for rebroadcasting to the array, The computer system according to claim 1.

5. The computer system of claim 1, wherein reduction is provided for use with inter-processor communication, taking data values and combining them to reduce the number of data values. .

6. Computer system according to claim 1, characterized in that there is a separate control network and data network.

7. A computer system according to claim 1, wherein each processing element comprises means for broadcasting from a node or a processing element within a node.

8. The computer system of claim 1, wherein the processing element or array controller can assign the processing element to one or several of the plurality of groups.

9. The computer system of claim 1, wherein each processing element includes a status latch that controls processing.

10. The computer of claim 1, wherein vector elements can be handled within each processing element and the nodes and vector instructions can be in a short 32-bit format or a longer format. system.

11. A status funnel is provided, wherein the array controller uses the status funnel to collect status from active processing elements, which sends cumulative results to the array controller. The computer system described in.

12. The status funnel of claim 1, wherein all processors can indicate that their processing steps have been completed and that the next operation can proceed in response to a command from the controller. Computer system.

13. The computer system of claim 1, wherein single and multiple tasks can be performed and processing partitions, nodes and processing elements can be time shared.

14. A node is capable of fetching from the same address in processor memory to execute the same instruction in SIMD mode, or from individually selected addresses to execute instructions in independent MIMD mode. The computer system of claim 1, wherein the computer system is capable of being.

15. A V in which a processing element resides in a node of the system.
The computer system according to claim 1, wherein the computer system is an LSI chip RISC processor system, each processing element including a local memory and a global memory.

16. The computer system according to claim 1, wherein each element is a memory, and the processor can retrieve the SIMD style from the same address or the MIMD style from an individually selected address. .

17. An array processing system comprising a plurality of processing elements, each having a processor and a memory, each of the processing elements being selectively and autonomously independent of a plurality of data streams. For providing a MIMD mode and thereby for instructing said processing element to execute on a plurality of independent data streams, one for each processing element. A controller for dispatching a single instruction stream to multiple processing elements, the processing elements executing the single instruction stream regardless of a fixed time relationship between or among them. An array processing system having:

18. An array processing system comprising a plurality of processing elements interconnected as an array processor, each processing element having a processor and a memory connected to the processor, each of the processing elements being selected. And autonomously executing independent instruction streams on independent data streams, thereby
Providing an IMD mode, and providing a single instruction stream to the plurality of processing elements to instruct the processing elements to execute on multiple independent data streams, one for each processing element. An array processing system comprising: a dispatching controller, wherein the processing elements execute the single instruction stream regardless of a fixed time relationship between or within them.

19. A parallel array processing system, comprising a plurality of processing elements, each having a processor, a memory, and a data path interconnecting the processor and the memory, and each of the processing elements. A single instruction stream to the plurality of processing elements for execution on the plurality of independent data streams, one for each processing element, in memory of , A controller thereby providing a SIMD mode, an interconnection network interconnecting the plurality of processing elements for communication between or within the processing elements and between the controller and the processing elements. And each of the processing elements is selectively and autonomously independent of its own memory in its respective memory. Execute an independent instruction stream over a number of data streams,
A parallel array, thereby providing a MIMD mode, each processing element having means for training local autonomy to selectively discontinue participation in a broadcast or task. Processing system.

20. An array processing system comprising: an array controller, data processing means having a processing element each having a memory connected to a processor; and cycle-by-cycle control of the array controller in SIMIMD mode. A plurality of independent data streams, one for each processing element, each executing means within the plurality of processing elements for performing the original instructions, wherein each individual data stream is executed by a respective processor. Execution of the plurality of independent instruction streams above is controlled by the SIMD instruction stream and that every processing element has completed execution of each of the plurality of independent instruction streams is controlled on a cycle-by-cycle basis. An array processing system comprising: a unit shown in the array controller as a reference.