JP2004531783A

JP2004531783A - Scalable interconnect structure for parallel operations and parallel memory access

Info

Publication number: JP2004531783A
Application number: JP2002536883A
Authority: JP
Inventors: ヘッセ、ジョン; リード、コーク・エス
Original assignee: Interactic Holdings LLC
Current assignee: Interactic Holdings LLC
Priority date: 2000-10-19
Filing date: 2001-10-19
Publication date: 2004-10-14
Anticipated expiration: 2021-10-19
Also published as: MXPA03003528A; EP1360595A2; CN1489732A; JP4128447B2; CA2426422A1; CA2426422C; WO2002033565A2; AU2002229127A1; WO2002033565A3; CN100341014C

Abstract

いくつかの革新的技術を用いることで複数のプロセッサが同じデータに並列アクセスすることが可能である。第１に、複数の離れたプロセッサが同じデータ位置からの読み取りを要求することができ、それらの要求をオーバラップした期間において処理することが可能である。第２に、複数のプロセッサが同じ位置にあるデータにアクセスでき、同じデータに対し読み取り、書き込みまたはマルチ処理をオーバラップした期間に行うことが可能である。第３に、一つのデータパケットを複数位置にマルチキャストすることができ、また、複数パケットを複数セットのターゲット位置にマルチキャストすることができる。Several processors allow multiple processors to access the same data in parallel. First, multiple distant processors can request reads from the same data location, and those requests can be processed in overlapping periods. Second, multiple processors can access data at the same location and can read, write, or multi-process the same data during overlapping periods. Third, one data packet can be multicast to multiple locations and multiple packets can be multicast to multiple sets of target locations.

Description

【０００１】
（発明の背景）
高度な並列演算を行うシステムで生じている問題に、複数のプロセッサへの十分なデータフローの供給がある。米国特許第５，９９６，０２０号及び米国特許第６，２８９，０２１号には、ネットワークのデータフローを大幅に改善する広帯域で低遅延なインターコネクト（相互接続）構造が開示されている。ネットワークにおける並列メモリーアクセス及び演算をサポートすることにより、そのような広帯域・低遅延インターコネクト構造を十分に活用し得るシステムが必要とされている。
【０００２】
（発明の要約）
いくつかの革新的技術を用いることで複数のプロセッサが同じデータに並列アクセスすることが可能である。第１に、複数の離れたプロセッサが同じデータ位置からの読み取りを要求することができ、それらの要求をオーバラップした期間において処理することが可能である。第２に、複数のプロセッサが同じ位置にあるデータにアクセスでき、同じデータに対し読み取り、書き込みまたはマルチ処理をオーバラップした期間に行うことが可能である。第３に、一つのデータパケットを複数位置にマルチキャストすることができ、また、複数パケットを複数セットのターゲット位置にマルチキャストすることができる。
【０００３】
以下の説明において、“パケット”という用語は、好適にはシリアル形式のデータユニットを指す。パケットの例としては、インターネットプロトコル（ＩＰ）パケット、イーサネット（登録商標）フレーム、ＡＴＭセル、スイッチファブリックセグメント（ｓｗｉｔｃｈ−ｆａｂｒｉｃｓｅｇｍｅｎｔｓ）があり、それらには、より大きなフレームまたはパケットの一部、スーパーコンピュータのプロセッサ間メッセージ、及びメッセージ長さが有限の他のデータメッセージタイプを含まれる。
【０００４】
本明細書に開示されるシステムは、あるスイッチに到達した複数パケットが同じ位置にあるデータにアクセスする場合に通信において発生するのと同様の問題を解決する。
【０００５】
他のマルチレベル最小ロジックネットワーク構造（ＭｕｌｔｉｐｌｅＬｅｖｅｌＭｉｎｉｍｕｍＬｏｇｉｃｓｔｒｕｃｔｕｒｅｓ）を、多種多様なプロセッサ及びコンピュータ、メモリーデバイス、ロジックデバイスを含む多くの非常に有用な装置及びシステムにおいて、基本構築ブロック（ｆｕｎｄａｍｅｎｔａｌｂｕｉｌｄｉｎｇｂｌｏｃｋ）として用いることができる。そのような装置及びシステムの例としては、並列ランダムアクセスメモリー（ＰＲＡＭ）及び並列演算エンジンがある。これらの装置及びシステムは、内蔵記憶装置またはメモリー及びロジック回路とともにネットワーク接続構造を基本構築ブロックとして含む。データ記憶装置は先入れ先出し（ＦＩＦＯ）リングの形態とすることができる。
【０００６】
（詳細な説明）
以下に説明する好適実施例の新規であると考えられる特徴は特許請求の範囲に規定される。しかし、構造及び処理方法の両方に関連する本発明の実施例は添付の図面を参照しつつ以下の説明を読むことにより最も良く理解されるだろう。
【０００７】
図１の模式的なブロック図には、１または複数のネットワークインターコネクト構造（ｎｅｔｗｏｒｋｉｎｔｅｒｃｏｎｎｅｃｔｓｔｒｕｃｔｕｒｅ）を含む構築ブロックから形成された汎用システム１００の例が示されている。図示されている例では、汎用システム１００は、ネットワークインターコネクト構造から形成されたトップスイッチ１００とボトムスイッチ１１２とを有する。“ネットワークインターコネクト構造”という用語は他のインターコネクト構造を指すこともできる。別のシステムでは、ネットワークインターコネクト構造から形成される更に別の要素を含むこともできる。汎用システム１００には、基本的な例示的システムの中核要素として含まれ得る様々な構成要素が示されている。ある実施例では、中核要素に加えて別の要素も含まれる。他の要素には、１）共有メモリー、２）トップスイッチとボトムスイッチの間の直接接続１３０、３）ボトムスイッチとＩ／Ｏとの間の直接接続１４０、及び、４）ロジックユニット１１４とボトムスイッチ１１２との間に接続されるコンセントレータ（ｃｏｎｃｅｎｔｒａｔｏｒ）などが含まれ得る。
【０００８】
汎用システム１００はトップスイッチ１１０を有しており、このトップスイッチ１１０は、入力データパケットを入力ライン１３６から、または、外部ソース及び可能ならばボトムスイッチからバス１３０を通じて受信し、受信したパケットを動的プロセッサ・イン・メモリー式ロジックモジュール（ｄｙｎａｍｉｃｐｒｏｃｅｓｓｏｒ−ｉｎ−ｍｅｍｏｒｙｌｏｇｉｃｍｏｄｕｌｅ：ＤＰＩＭ）１１４へと配布する入力ターミナルとして機能する。トップスイッチ１１０は、パケットヘッダ内に格納された通信情報に基づき汎用システム１００内におけるパケットのルーティングを行う。パケットはトップスイッチ１１０からＤＰＩＭモジュール１１４へと送られる。ＤＰＩＭモジュール１１４からトップスイッチ１１０への制御信号により、衝突を防ぐべくパケット入力のタイミングが制御される。これにより、そのような制御がない場合に発生し得るＤＰＩＭ内のデータまたはボトムスイッチ内のデータとの衝突が防止される。このシステムは、出力ライン及びバス１３０、１３２、１３４及び１３６を用いて、更なる演算要素、通信要素、記憶要素及び他の要素（図示せず）に情報を転送することもできる。
【０００９】
データパケットはトップスイッチ１１０に入り、各パケットのアドレスフィールドに基づいてターゲットＤＰＩＭ１１４へと送られる。パケット内に含まれる情報を可能ならば他の情報と共に用いて、ロジックＤＰＩＭ１１４によりパケットまたはＤＰＩＭメモリー内に含まれるデータに対して実行される処理を決定することができる。例えば、パケット内の情報により、ＤＰＩＭメモリー内に格納されたデータを変形（ｍｏｄｉｆｙ）したり、ＤＰＩＭメモリー内に含まれるデータをボトムスイッチ１１２へと送ったり、或いは、ＤＰＩＭロジックモジュールによって生成された他のデータがボトムスイッチから出力されるようにすることができる。ＤＰＩＭからのパケットはボトムスイッチへと送られる。汎用システム１００の他のオプションとして、演算ユニット（ｃｏｍｐｕｔａｔｉｏｎａｌｕｎｉｔ）またはメモリーユニット或いはその両方を含むことができる。演算ユニット１２６はＩ／Ｏユニット１２４を通じてシステム１００の外部またはトップスイッチ１１０或いはその両方へとデータパケットを送るべく配置することができる。ボトムスイッチがトップスイッチへとパケットを送る場合、パケットは直接送ることもできるし、或いは、システム１００のサブコンポーネントである集積回路の間のタイミング及び制御を扱う１または複数のインターコネクトモジュール（図示せず）を介して送ることもできる。
【００１０】
システムの一実施例では、データ記憶装置はＤＰＩＭ１１４内の先入れ先出し（ＦＩＦＯ）データ記憶リングＲ、及び、演算ユニット（ＣＵ）１２６に関連づけられた通常のデータ記憶装置の形態を有する。ＦＩＦＯリングは、環状接続された複数の単ビットシフトレジスタからなる。ＦＩＦＯリングは２種類の構成要素を含む。第１の例では、従来知られているが、ＦＩＦＯリングは、隣接する単ビットシフトレジスタにのみ接続され単純なＦＩＦＯ３１０を形成する複数の単ビットシフトレジスタを有する。第２の例では、リングの他のシフトレジスタは、例えばロジックモジュール１１４のようなシステムの他の要素内に含まれる単ビットまたはマルチビットレジスタからなる。これら２種類の構成要素は直列に接続されリングを形成する。例えば、ＦＩＦＯリングの全長Ｆ_Ｌを２００ビットとし、６４ビットが複数のロジックモジュールＬ内に格納され、残りの１３６ビットはＦＩＦＯの直列接続されたレジスタ内に格納されるようにすることができる。システム全体に供給されるクロックがＦＩＦＯ要素及びシフトレジスタに接続され、データビットを“バケツリレー”式に次の位置へと進ませる。サイクル時間（ｃｙｃｌｅｐｅｒｉｏｄ）は、データがＦＩＦＯリングのちょうど１サイクルを完了するのに要するクロック周期で表される時間として定義される。サイクル時間の整数値は構成要素の数で表したＦＩＦＯリングの長さに等しい。例えば、２００個の構成要素（即ち、長さ２００）からなるリングでは、サイクル時間は２００システムクロック周期となる。システムは、異なるレートで動作する局所的なクロックまたはタイミングソースを有することも可能である。実施例によって、システムの全てのＦＩＦＯリングが同じ長さを有しても、所定の最小長さの整数倍でばらついた長さを有してもよい。別の実施例では、一つのリングが複数の並列経路を備えたバス構造を有し、リング内に保持されるデータ量がリング長さＦ_Ｌの整数倍となるようにすることもできる。
【００１１】
汎用システム１００では、トップスイッチは、システム最大長さまでの様々な長さを有するパケットを扱うことが可能である。ある用途では、パケットは全て同じ長さを有することもできる。一般的には、様々な長さを有するパケットがトップスイッチに入力される。パケットの長さはＰ_Ｌで表され、Ｐ_ＬはＦ_Ｌを越えない。
【００１２】
同様に、ボトムスイッチも様々な長さのパケットを扱うことができる。汎用システム１００の典型的な実施例では、ＤＰＩＭロジックモジュール１１４及びＣＵ１２６の機能及び動作に応じて異なるビット長さを有するデータが生成される。ＤＰＩＭは独立して機能することができ、または、これらＤＰＩＭからデータを集め、システム１００の内または外にあるＤＰＩＭまたは他の要素にデータを供給することが可能な、図示しない複数のシステムがあってもよい。
【００１３】
図２の模式的ブロック図は、図１に含まれるよりも少ない数の構築ブロックから形成された並列ランダムアクセスメモリー（ＰＲＡＭ）システム２００の例を示している。このＰＲＡＭシステムは、ネットワークインターコネクト構造から形成されるトップスイッチ１１０、コンセントレータ１５０及びボトムスイッチ１１２を有している。またこのシステムは、データを格納する複数のＤＰＩＭ１１４を含む。これらのＤＰＩＭユニットは通常ＲＥＡＤ及びＷＲＩＴＥ機能を実行することができ、それによりシステムを並列ランダムアクセスメモリーとして使用することが可能となっている。
【００１４】
例示している実施例では、トップスイッチ１１０に入力されるデータパケットは次の形式を有する：
ペイロード｜処理コード２｜アドレス２｜処理コード１｜アドレス１｜タイミングビット
これは以下のように表すことができる：
ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＡＤ１｜ＢＩＴ
【００１５】
ＰＡＹＬＯＡＤフィールドのビット数はＰａｙＬで表す。ＯＰ２及びＯＰ１のビット数はそれぞれＯＰ２Ｌ及びＯＰ１Ｌで表す。ＡＤ２及びＡＤ１のビット数はそれぞれＡＤ２Ｌ及びＡＤ１Ｌで表す。ＢＩＴフィールドの長さは好適実施例では１ビットである。
【００１６】
以下の表に、パケットフィールドの簡単な説明を示す。

【００１７】
ＢＩＴフィールドは最初にスイッチに入るが、パケットが存在することを示すべく常に１にセットされる。ＢＩＴフィールドは“トラフィックビット（ｔｒａｆｆｉｃｂｉｔ）”とも呼ばれる。ＡＤ１フィールドはパケットをトップスイッチを通じてそのパケットのターゲットＤＰＩＭへと導くのに用いられる。トップスイッチ１１０を複数の階層レベル及びコラムをなすように構成し、パケットがこれらのレベルを通過するようにすることができる。パケットがトップスイッチ１１０の新たなレベルに入るたび、ＡＤ１フィールドの１ビットが除去され、それによりＡＤ１フィールドは短くなる。システム２００も同じ技法を用いる。パケットがトップスイッチ１１０を出るとき、ＡＤ１フィールドには何も残らない。従って、パケットはトップスイッチを出るとき以下の形式を有する：
ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＢＩＴ
【００１８】
システム１００及び２００は複数のＤＰＩＭユニットを有する。図３はＤＰＩＭユニットの一例を図示するとともにＤＰＩＭとトップスイッチ１１０及びボトムスイッチ１１２との間のデータ及び制御接続路を示す模式的ブロック図である。図３には４つのデータインターコネクト構造Ｚ、Ｃ、Ｒ及びＢが示されている。インターコネクト構造Ｚはトップスイッチ１１０内に配置されたＦＩＦＯリングとすることができる。インターコネクト構造Ｃ及びＲはＤＰＩＭモジュール内に配置されたＦＩＦＯリングからなる。いくつかの実施例では、ＤＰＩＭはデータを直接ボトムスイッチへ送る。それらの実施例では、ボトムスイッチがインターコネクト構造である場合、インターコネクト構造ＢはＦＩＦＯリングである。図１及び７はコンセントレータを有さないシステムを図示し、図２、３、４Ａ及び５はコンセントレータを含むシステムを示している。
【００１９】
データはトップスイッチ１１０を通過しターゲット出力リングＺ_Ｊ（Ｊ＝ＡＤ１）へ到達する。リングＺ＝Ｚ_Ｊは、出力ライン３２６に接続された複数のノード３３０を有する。ＤＰＩＭモジュールは“データ通信リング”と呼ばれるパケット受信リングＣ３０２及び１または複数の“データ記憶リング”Ｒ３０４を有する。図３に示したＤＰＩＭは単一のデータ記憶リングＲを有する。各構造Ｚ、Ｃ、Ｒ及びＢは、相互接続された複数の単ビットＦＩＦＯノードを含む。構造内のノードのいくつかは単一のデータ入力ポート及び単一のデータ出力ポートを有し、インターコネクトされて単純なマルチノードＦＩＦＯを形成する。構造内の他のノードは更なるデータ入力ポート、更なるデータ出力ポート或いはその両方を有する。またこれらノードは制御信号出力ポートまたは制御信号入力ポートを有することもできる。リングＺはリングＣから制御信号を受信し、データをロジックモジュールＬ３１４に送る。リングＣ及びＲはロジックモジュールＬ３１４に対しデータの送受信を行う。ＦＩＦＯＢ３８０はロジックモジュールＬに制御信号を送り、ロジックモジュールＬからデータを受け取る。一つのＤＰＩＭは、インターコネクト構造またはＦＩＦＯＢ内の複数の入力ポートへとデータを送ることが可能な複数のロジックモジュールを有することができる。ＤＰＩＭからのデータは、システムＢのトップレベル内へと複数の列（ｒｏｗｓ）をなすように送り込むことができる。ＤＰＩＭの数はメモリー位置の数と同じとすることができ、その場合、各ＤＰＩＭは１ワードのデータを格納する単一の記憶リングＲを有する。別の方法として、一つのＤＰＩＭユニットが複数の記憶リングＲを含むものとすることもできる。アドレスＡＤ１フィールドの一部によって、または、処理ＯＰ１フィールドの一部によって特定の記憶リングを同定することができる。
【００２０】
パケット移動のタイミングは４つのリングで同期がとられている。パケットがリング内を循環する際、パケットはＢＩＴフィールドに関して整合される。整合の有益な結果として、リングＣはリングＺに制御信号３２８を送り、Ｚ内のノードがＣへとパケットを送るのを許可または禁止する。リングＣのノード３３０から許可を受信すると、リングＺ上のノード３１２はロジックモジュールＬにパケットを送ることができ、ロジックモジュールはパケットを即座にビットシリアルに処理する位置に置かれる。同様に、データ記憶リングＲ内を循環しているパケットはリングＣと同期され、それによりパケットがそれぞれのリングを循環する際にそれぞれのビットがロジックモジュールＬにより好適に処理される。データ記憶リングＲは、後述するいくつかの新規な用途において用いることが可能なメモリー要素として機能する。複数のＤＰＩＭがトップスイッチと同じチップ上にはない場合、リングＺのノードとロジックモジュールＬとの間を結ぶ別個のデータ通信リング（図示せず）を用いてチップ間のタイミング及び制御を行うこともできる。
【００２１】
記憶リングＲ内のデータは、トップスイッチのＺリング３０６内のパケットの一部と整合及びオーバラップし、サイクル時間において同時に発生する複数のパケットによりトップスイッチ１１０からアクセス可能である。複数のロジックモジュール３１４がデータ通信リングＣ及びデータ記憶リングＲに関連づけられている。一つのロジックモジュールＬはリングＣ及びＲからデータを読み出し、ある条件の下でデータに対して処理を施し、リングＣ及びＲに書き込むことができる。更にロジックモジュールＬはボトムスイッチ１１２またはコンセントレータにあるＦＩＦＯ３０８のノード３２０へパケットを送ることができる。複数のＤＰＩＭがボトムスイッチと同じチップ上にない場合、インターコネクト構造Ｂのノード３２０とロジックモジュールＬ３１４とを結ぶ別個のデータ通信リング（図示せず）を用いてチップ間のタイミング及び制御を行うことができる。また、一つのデバイスが１サイクル時間において通信リングの複数のビットにアクセスする必要がある場合にも、タイミング及び制御動作のために別個のデータ通信リングを用いることができる。
【００２２】
パケットはロジックモジュール３１４を通じて通信リングＣに入る。パケットはロジックモジュールＬから出て、入力チャネルを通じて様々な角度でボトムスイッチに入る。
【００２３】
汎用システム１００のいくつかの例では、ＤＰＩＭ１１４のリングＣ及びＲに沿ったロジックモジュールの全てが同じタイプであり、同様のロジック機能を実行する。他の例では、複数の異なるロジックモジュールタイプが用いられ、特定のＤＰＩＭのリングＲに格納されたデータに対し複数のロジック機能を実行することが可能である。データがリングＲを循環する際、ロジックモジュールＬ３１４はデータを変形（ｍｏｄｉｆｙ）することができる。ロジックモジュールは、リングＣ及びリングＲから、及びリングＺ上のノードからモジュールをシリアルに通過するデータビットに作用する。典型的なロジック機能には、（１）ロード、記憶、読み出し、書き込みのようなデータ転送処理、（２）ＡＮＤ、ＯＲ、ＮＯＲ、ＮＡＮＤ、排他的ＯＲ、ビットテストなどの論理演算、及び（３）加算、減算、乗算、除算、超越関数などのような算術演算が含まれる。多数の他のタイプの論理演算を含むこともできる。ロジックモジュールの機能はロジックモジュールにハードワイア化することも、あるいは、ロジックモジュールに送られたパケットからロジックモジュールへとロードされるソフトウェアによって実現することもできる。ある実施例では、特定のデータ記憶リングＲに関連する複数のロジックモジュールは独立して動作する。他の実施例では、ロジックモジュールのグループからデータを受信することができる別個のシステム（図示せず）によって複数のロジックモジュールグループが制御される。更に別の実施例では、ロジックモジュール制御システムはロジックモジュールから受信したデータに対して制御命令を実行する。
【００２４】
図１及び図２において、各ＤＰＩＭは一つのリングＲ及び一つのリングＣを有する。システム１００の別の実施例では、特定のＤＰＩＭ１１４が複数のＲリングを有する。このような複数Ｒリングの実施例において、一つのロジックモジュール３１４がＣリング及び全てのＲリングからのデータに同時にアクセスすることが可能である。同時アクセスによって、ロジックモジュールが１または複数のＲリング上のデータを、Ｒリングの内容に基づき、且つ、関連する通信リングＣ及び受信されたパケットの内容にも基づいて、変形することが可能となる。
【００２５】
ロジックモジュールによって実行される典型的な機能は、リングＲに保持されたデータに関連してパケットのＰＡＹＬＯＡＤフィールドに保持されたデータに対してなされる、ＯＰ１フィールドに指定された処理の実行である。ある特定の例では、処理ＯＰ１は、パケットのＰＡＹＬＯＡＤ内のデータをアドレスＡＤ１にあるリングＲ内に格納されたデータに加えることを指示する。その結果得られる和はアドレスＡＤ２のボトムスイッチのターゲットポートへと送られる。ＯＰ１処理のデータフィールドに保持された命令によって指示され、ロジックモジュールは複数の処理を実行することができる。例えば、リングＲ３０４内のデータを変更なくそのままとすることができる。ロジックモジュールはリングＲ３０４内のデータをＰＡＹＬＯＡＤフィールドの内容で置き換えることもできる。或いは、ロジックモジュールＬはＰＡＹＬＯＡＤフィールド内に保持されたデータを、リングＲ３０４及びＰＡＹＬＯＡＤフィールド内に保持されていた内容に対して行った処理の結果で置き換えることもできる。別の例では、メモリーＦＩＦＯはデータだけでなくプログラム命令も格納することができる。
【００２６】
通信リングＣ及び記憶リングＲに関連する複数タイプのロジックモジュール３１４を含む汎用システム１００は、ある処理を実行するのに用いられる特定のロジックモジュールを指定するのにＯＰ１フィールドの１または複数のビットを用いることができる。ある実施例では、複数のロジックモジュールが同じデータに対して複数の処理を実行する。アドレスＡＤ１＝ｘのロジックモジュールのセットは、アドレスＡＤ１＝ｙのロジックモジュールのセットとは異なる処理を実行することができる。
【００２７】
汎用システム１００内を通るデータパケットの移動効率はデータフローのタイミングに依存する。いくつかのシステムでは、ロジックモジュールに関連づけられたバッファ（図示せず）がデータ転送のタイミングの維持に寄与する。多くの実施例で、タイミングはデータをバッファリングすることなく維持される。汎用システム１００のインターコネクト構造は、好適なことに、データの効率的な並列演算、生成及びアクセスを実現する動作タイミングを有する。
【００２８】
少なくとも１つのスイッチ、データ記憶リング３０４のグループ、及び関連するロジックモジュール３１４を含む複数の構成要素からなる汎用システム１００は、様々な演算及び通信スイッチを実現するのに用いることができる。演算及び通信スイッチの例としては、インターネットスイッチングシステムで用いられるＩＰパケットルータまたはスイッチ、特殊用途ソーティングエンジン（ｓｏｒｔｉｎｇｅｎｇｉｎｅ）、汎用コンピュータ、或いは、汎用または特定機能を有する多くの並列演算システムがある。
【００２９】
図２の模式的なブロック図には、ネットワークインターコネクト構造を基本要素として用いて形成された並列ランダムアクセスメモリー（ＰＲＡＭ）が示されている。このＰＲＡＭは、複数のソースから同時にアクセス可能で且つ複数の送付先へ同時に送ることが可能なデータを格納する。このＰＲＡＭはトップスイッチ１１０を有しているが、トップスイッチ１１０のターゲットリングからパケットを受信する通信リングを有しても有さなくてもよい。通信リングを有さないインターコネクト構造では、リングＺがロジックモジュールを通過する。トップスイッチ１１０は、ターゲットリングの各々からのＴ個の出力ポート２１０を有する。典型的なＰＲＡＭシステム２００では、アドレス位置の数は、システムのＩ／Ｏポートの数より多い。例えば、ＰＲＡＭシステムは、ＤＰＩＭに格納された６４Ｋワードのデータにアクセスする１２８個のＩ／Ｏポートを有する。ＡＤ１フィールドは６４Ｋ個のＤＰＩＭアドレス１１４を可能とするべく１６ビット長さとなっている。ＡＤ２フィールドは、１２８個の出力ポート２０４を可能とするべく８ビット長さとなっており、７ビットがアドレスを保持し、１ビットはアドレスのＢＩＴ２部分である。トップスイッチは１２８個の入力ポート２０２と、それぞれ出力ポート２０６を介したＤＰＩＭユニットへの多重接続を有する６４Ｋ個のＺリング（図示せず）とを有する。コンセントレータ１５０は６４Ｋ（６５５３６）個の入力ポート２０８と１２８個の出力ポート２１０とを有する。ボトムスイッチ１１２は１２８個の入力ポートと１２８個の出力ポート２０４とを有する。コンセントレータは、トップ及びボトムスイッチ及びロジックモジュールと同じ入力及び出力に対する制御タイミング及び信号規則に従う。
【００３０】
別の態様として、トップスイッチはより少ない数の出力Ｚリング及び関連するＤＰＩＭを有することもできる。ＤＰＩＭユニットが複数のＲリングを有し、トータルのデータサイズは変わらないようにすることも可能である。
【００３１】
図２に示されているＰＲＡＭは、通信リングＣ３０２及び記憶リングＲ３０４に直結したロジックモジュール３１４を含むＤＰＩＭユニット１１４を有している。ＤＰＩＭユニット１１４は、出力データをボトムスイッチ１１２に供給するパケットコンセントレータ１５０に接続している。
【００３２】
図３を参照すると、リングＣ上のノード３３０はトップスイッチのリングＺ上のノード３１２に制御信号を送り、リングＺの個々のノード３１２がロジックモジュールＬへとパケットを送るのを許可する。リングＺからパケットを受け取ると、ロジックモジュールＬは幾つかの処理の一つを実行することができる。第１に、ロジックモジュールＬはパケットのＣリング上への載置（ｐｌａｃｉｎｇ）を開始することができる。第２に、ロジックモジュールＬはパケット内のデータを即座に使用開始することができる。第３に、ロジックモジュールＬは生成されたパケットをＣリング上に置くことなくコンセントレータ１５０に即座に送付し始めることができる。あるロジックモジュールＬｉがパケットＰをＣリング上に置き始めることができ、そのロジックモジュールＬｉがリング上に幾つかビットを置いた後、別のロジックモジュールＬｋ（ここでｋ＞ｉ）がそれらのビットの処理及び除去を開始することができる。ある場合には、パケットＰ全体がリングＣ上に置かれることがない。ロジックモジュールはデータをＣリングまたはＲリングへと挿入することができ、或いは、データをコンセントレータ１５０に送ることもできる。コンセントレータに入るデータの制御にはコンセントレータからのライン３２４上の信号も用いられる。あるリングＲに関連づけられた複数のロジックモジュール３１４は、そのリングＲに関連づけることが可能な補助デバイス（図示せず）との付加的な送受信用相互接続を有することもできる。補助デバイスはシステムの目的及び機能に応じて様々な構造を有し様々な機能を奏することができる。補助デバイスの一例としてはシステムコントローラがある。
【００３３】
ある実施例では、ＰＲＡＭ２００は、全て同じロジックタイプを有し同じ機能を奏する複数のロジックモジュール３１４を含むＤＰＩＭを有する。
【００３４】
別の実施例では、特定のアドレスにある第１のＤＰＩＭＳが異なるタイプ及び機能の複数のロジックモジュールを有することができる。第２のＤＰＩＭＴは第１のＤＰＩＭＳと比較して同じまたは異なるタイプのロジックモジュールを有することができる。ＰＲＡＭの一応用例では、１データワードが一つの記憶リングＲに格納される。データがリングＲ内を循環する際、ロジックモジュールはデータを変形することができる。このＰＲＡＭでは、ロジックモジュールはデータだけでなくプログラム命令も格納可能な記憶リングＲの内容を変形する。
【００３５】
このＰＲＡＭは、以下のように定義されるフィールドを含むように定められたパケットを用いてデータの記憶及び検索をする：
ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＡＤ１｜ＢＩＴ
【００３６】
パケットが存在することを示すべく１にセットされたＢＩＴフィールドが汎用システム１００に入る。ＡＤ１フィールドは、所望のデータを含むデータ記憶リングＲ３０４を含む特定のＤＰＩＭのアドレスを指定する。トップスイッチはアドレスＡＤ１によって指定されたＤＰＩＭ（ＡＤ１）にパケットを送る。示される例では、ＯＰ１フィールドは実行される処理を指定する単ビットからなる。例えば、論理値１はＲＥＡＤ要求を、論理値０はＷＲＩＴＥ要求を指定する。
【００３７】
ＲＥＡＤ要求では、位置ＡＤ１にあるＤＰＩＭ内の受信ロジックモジュールがリングＲ上に格納されたデータをボトムスイッチ１１２のアドレスＡＤ２へ送信する。ＷＲＩＴＥ要求では、パケットのＰＡＹＬＯＡＤフィールドがアドレスＡＤ１のリングＲ上に置かれる。ＡＤ２はＲＥＡＤ要求においてボトムスイッチ１１２を通じてデータのルーティングをするのに用いられる宛先アドレスであり、メモリーの内容が送られるべき場所を指定する。ＯＰ２は所望に応じてアドレスＡＤ２に位置するデバイスが送られてきたデータに対して実行すべき処理を記述する。処理ＯＰ１がＲＥＡＤ要求の場合、ＲＥＡＤ要求を実行するロジックモジュールはＰＡＹＬＯＡＤフィールドを用いない。
【００３８】
一実施例では、ＰＲＡＭはただ一つのタイプのロジックモジュール（ＲＥＡＤ及びＷＲＩＴＥ処理の両方を実行するタイプ）を含む。ＰＲＡＭの別の実施例では、別個のＲＥＡＤ要素とＷＲＩＴＥ要素を備えたような、別のタイプのロジックモジュールが用いられる。
【００３９】
図２及び図３を参照すると、図示されているＰＲＡＭ２００は適時にトップスイッチ１１０に入ってくるパケットを受信することにより処理を開始する。パケットＰはトップスイッチ内をルーティングされアドレスＡＤ１に位置するターゲットリングＺに到達する。パケットのＡＤ１フィールドはトップスイッチのターゲットリングＺ_Ｊ３０６を指定する（ここでＪ＝ＡＤ１）。ノードＳ（図示せず）及びノードＴ（図示せず）がメッセージタイミングを規定するべく定められる。ノードＳはリングＲ_Ｊのノード３３０として定められ、ノードＴはリングＺ_Ｊのノード３１２として定められ、ノードＳはノードＴに制御ライン３２８を通じて制御信号を送るように配置される。グローバルタイミング信号に基づき、リングＲ_ＪのノードＳ３３０はノードＳにおけるタイミングビット到達時間を特定する。タイミングビット到達時間にノードＳに値１のタイミングビットが到達すると、ノードＳはライン３２８を通じてリングＺ上のノードＴ３１２にブロック信号（ｂｌｏｃｋｉｎｇｓｉｇｎａｌ）を送り、ノードＴがパケットをライン３２６を通じてロジックユニットＬへと送るのを禁止する。タイミングビット到達時間にノードＳが値１のビットを受信しない場合、ノードＣからノードＳに入ってきているメッセージはなく、ノードＳはノードＴへ非ブロック（ｎｏｎ−ｂｌｏｃｋｉｎｇ）制御信号を送る。グローバルタイミングはノードＴにおける制御信号到達時間が、リングＺまたはトップスイッチ内においてリングＺより一つ上のレベルに位置するノードＵからノードＴへのメッセージ到達時間と同時になるようになっている。トップスイッチ１１０を出るパケットはノード３１２からライン３２６を通じてロジックモジュールへと向かう。ロジックモジュールは通信リングＣ３０２上にパケットを置く、或いは、リングＣ上に置くことなく即座にパケットを処理することができる。このとき、パケットＰは以下の形式を有する：
ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＢＩＴ
【００４０】
パケットＰはリングＺからライン３２６を通じてロジックモジュールＬへと送られる。パケットＰがロジックモジュールＬへの移動を開始するとき、リングＺ上のあるノードＮ_Ｚは、トップスイッチ内のより高いレベルにあるノードＷにノードＮ_Ｚにおいて非ブロック状態であることを通知するべく制御信号を送る。この制御信号は、ノードＮ_Ｚからデータを受け取る位置にあるノードＮ_Ｘにデータをルーティングする許可をノードＷに与える。ロジックモジュールＬはライン３２６を通じて到達するパケット及びリングＣ上から到達するパケットに対しタイミングに関して同様に作用する。パケットＰがロジックモジュールＬに入ると、ロジックモジュールＬはＯＰ１フィールド内のコマンドを解析（ｐａｒｓｅ）し実行する。
【００４１】
示されている実施例では、通信リングＣは記憶リングＲと同じ長さを有する。ビットは共通クロックによって支配されるレートでビットシリアルにリングＣ及びＲ内を移動する。パケットのＰＡＹＬＯＡＤフィールドの第１ビットはリングＲのＤＡＴＡフィールドの第１ビットと整合される。従って、ＲＥＡＤ要求の場合、リングＲ内のデータはパケットのペイロード部にコピーされる。ＷＲＩＴＥ要求の場合、パケットのペイロード部内のデータをパケットから記憶リングＲへと転送することができる。
【００４２】
ＲＥＡＤ要求
ＲＥＡＤ要求では、パケットＰは以下の形式を有する：
ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＡＤ１｜ＢＩＴ
【００４３】
パケットはトップスイッチに入る。一般に、アドレスＡＤ１のＤＰＩＭのロジックモジュールは、処理コードＯＰ１フィールドを調べることでＲＥＡＤ要求を同定する。ロジックモジュールはパケットのＰＡＹＬＯＡＤフィールドをリングＲからのＤＡＴＡフィールドで置き換える。そうして、更新されたパケットはコンセントレータを通じてボトムスイッチへ送られ、そこからアドレスＡＤ２にある演算ユニット（ＣＵ）１２６または他のデバイスへと送られる。ＣＵまたは他のデバイスはＰＡＹＬＯＡＤフィールドに関連して処理コード２（ＯＰ２）によって指定された命令を実行することができる。
【００４４】
パケットＰはリングＺ上のノードＴ３１２に入る。ノードＴは、ノードＴに入るパケットＰのタイミングビット及びリングＣ上のノード３３０からの非ブロック制御信号に応答して、パケットＰをデータパス３２６を通じてロジックモジュールＬへと送り始める。ＢＩＴ及びＯＰ１フィールドがロジックモジュールＬに入るとき、コンセントレータ１５０或いはコンセントレータがない場合にはボトムスイッチがメッセージを受け取ることが可能であるかどうかを示すライン３２４上の制御信号もロジックモジュールＬに到達する。制御信号がコンセントレータはメッセージを受け取ることができない旨示す場合、ロジックモジュールＬはパケットをリングＣへと転送し始める。パケットＰはリングＣ上の次のロジックモジュールへと移動する。
【００４５】
ある時点で、リングＣ上のロジックモジュールの一つが、階層のより下方からノットビジー制御信号を受信する。そのとき、ロジックモジュールＬはインターコネクト構造Ｂの入力ノード３２０へのパケットＰの転送を開始する。
【００４６】
ＲＥＡＤ要求では、ロジックモジュールはＯＰ１フィールドをパケットから取り出し、コンセントレータの入力ノード３２０へとパス３２２を通じてパケットの送出を開始する。まず、ロジックモジュールはＢＩＴフィールドを、続いてＡＤ２フィールドを、そしてＯＰ２フィールドを送る。記憶リングＲのＤＡＴＡフィールドの第１ビットがロジックモジュールに達するのと同時にＯＰ２フィールドの最終ビットがロジックモジュールを離れるようにタイミングがセットされる。ロジックモジュールは記憶リングＲ内のＤＡＴＡフィールドは不変のままとし、下流に送られるパケットのＰＡＹＬＯＡＤフィールドにＤＡＴＡのコピーを入れ、コンセントレータへビットシリアルにパケットを送り続ける。リングＲ内のデータは不変のまま維持される。
【００４７】
パケットはコンセントレータに入るとき及びそれを出るとき変化なく、ボトムスイッチ１１２に入るとき以下の形式を有する：
ＤＡＴＡ｜ＯＰ２｜ＡＤ２｜ＢＩＴ
【００４８】
ＰＡＹＬＯＡＤフィールドはこの時点ではリングＲからのＤＡＴＡフィールドを含む。パケットがボトムスイッチ内をルーティングされる際、ＡＤ２フィールドは除去される。パケットはボトムスイッチのアドレスＡＤ２に位置する出力ポート２０４から送出される。送出時、パケットは以下の形式を有する：
ＤＡＴＡ｜ＯＰ２｜ＢＩＴ
【００４９】
ＯＰ２フィールドは様々な方法で使用可能なコードである。一つの用途は、ボトムスイッチ出力デバイスがＰＡＹＬＯＡＤフィールドに格納されているデータに対し行う処理を示すことである。
【００５０】
ＰＲＡＭのインターコネクト構造はデータの効率的、並列生成及びアクセスを実現する循環タイミング（ｃｉｒｃｕｌａｒｔｉｍｉｎｇ）を本質的に有する。例えば、異なる入力ポート２０２に位置する複数の外部ソースが特定のＤＰＩＭ１１４にある同じＤＡＴＡフィールドに対するＲＥＡＤ処理を要求することができる。複数のＲＥＡＤ要求は異なるノード３１２においてトップスイッチの特定のターゲットリングＺ３０６に入ることができ、その後、そのターゲットＤＰＩＭの異なるロジックモジュールＬに入る。これらＲＥＡＤ要求は同じサイクル時間においてリングＣ上の異なるロジックモジュールに入ることができる。通信リングＣ３２０及びメモリーリングＲ３０４はコンセントレータの入力インターコネクト構造Ｂ及びトップスイッチのターゲットリングＺ内のパケットの動きに関して常に同期している。
【００５１】
ＲＥＡＤ要求は常に、リングＲからのデータを送出パケットの適切なＰＡＹＬＯＡＤ位置に付加するのに適した時間にロジックモジュールに到達する。その有益な結果として、リングＲ内の同じデータに対する複数の要求を同時に発行することが可能となる。同じＤＡＴＡフィールドが複数の要求によりアクセスされる。リングＲからのデータは複数の最終送付先に送られる。複数のＲＥＡＤ処理は並列に実行され、送付されるパケットは複数の出力ポート２０４に同時に到達する。複数のＲＥＡＤ要求は、異なるロジックモジュールによってリングＲ内の異なる場所から同時に読み取りを行うことにより、オーバラップして実行される。更に、他の複数のＲＥＡＤ要求がＰＲＡＭメモリーの異なるアドレスにおいて同じサイクル時間において実行される。
【００５２】
システムタイミングにより、複数のＲＥＡＤ要求はオーバラップして、効率的且つ並列に実行される。図４Ａ、４Ｂ及び４Ｃは単一のＲＥＡＤに対するタイミングを図示している。記憶リングＲは通信リングＣと同じ長さである。リングＲは長さＰａｙＬの循環データ４１４を含む。リングＲ内の残りの記憶要素はゼロまたは“ブランク（ｂｌａｎｋ）”とするか、或いは、無視され任意の値をとることができる。ＢＬＡＮＫフィールド４１２はＤＡＴＡフィールド４１４に含まれないビットのセットである。
【００５３】
図４Ａを参照すると、各リングＣ及びＲの一部が特定のＤＰＩＭのロジックモジュールを通過している。ロジックモジュールは、リングＣを構成するシフトレジスタのセットの少なくとも２つのビットと、リングＲを構成するシフトレジスタの少なくとも２つのビットを含む。ある実施例では、ＤＰＩＭ３１４は複数のロジックモジュール３１４を含む。ロジックモジュールは、１クロック時間で通信リング３０２の２ビットを読み込むよう配置される。グローバル信号（図示せず）によって指示される時間において、ロジックモジュールはＢＩＴフィールドとＯＰ１フィールドを調べる。示されている例では、ロジックモジュールはＯＰ１フィールド及びＢＩＴフィールドの全体を一緒に読み込む。別の実施例では、ＯＰ１及びＢＩＴフィールドを複数動作により読むことも可能である。ＲＥＡＤ要求では、ブロックされていないロジックモジュール３１４が適切な時間にパケットをコンセントレータまたはボトムスイッチに送り、パケットがコンセントレータまたはボトムスイッチの入力内の他のビットと整合するようにする。
【００５４】
ＲＥＡＤ要求では、ブロックされたロジックモジュールがパケットをリングＣ上に置くと、そのパケットは次のロジックモジュールへと移動する。次のロジックモジュールはブロックされていてもされていなくてもよい。後続のロジックモジュールがブロックされている場合、そのブロックされたロジックモジュールもリングＣ上のパケットを同様に次のモジュールへ送る。パケットが右端のロジックモジュールＬＲに入り、この右端のロジックモジュールＬＲがブロックされている場合、ロジックモジュールＬＲはパケットをリングＣ上のＦＩＦＯを通過するよう送る。ＦＩＦＯを出るとパケットは左端のロジックモジュールに入る。パケットは、ブロックされていないロジックモジュールに出会うまで循環する。リングＣの長さは循環するパケットが常にリング上に完全に納まるように設定される。別の言い方をすると、パケット長さＰ_Ｌはリング長さＦ_Ｌを越えることはない。
【００５５】
ＲＥＡＤ動作では、パケットは次の形式を有する：
｜ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＯＰ１｜ＡＤ１｜ＢＩＴ｜
【００５６】
パケットはトップスイッチに挿入される。アドレスフィールドＡＤ１は所望のデータを含むリングＲ３０４のターゲットアドレスを示す。アドレスフィールドＡＤ２は、結果が送られるボトムスイッチの出力ポート２０４のターゲットアドレスである。処理コードＯＰ２は出力デバイスによって実行されるべき処理を指定する。
【００５７】
ある典型的な実施例では、出力デバイスは入力デバイスと同じである。従って、単一のデバイスがＰＲＡＭの入力２０２及び出力２０４ポートに接続される。ＲＥＡＤ要求では、ＰＡＹＬＯＡＤフィールドはロジックモジュールによって無視されるので、どのような値でもよい。一方、ＷＲＩＴＥ動作では、ＰＡＹＬＯＡＤフィールドはアドレスＡＤ１のＤＰＩＭに関連づけられたリングＲ３０４上に置かれるデータを含む。ロジックモジュールから送り出される変形されたパケットは以下の形式を有する：
｜ＤＡＴＡ｜ＯＰ２｜ＡＤ２｜ＢＩＴ｜
【００５８】
ボトムスイッチに入るデータは以下の形式を有する：
｜ＤＡＴＡ｜ＯＰ２｜ＡＤ２｜ＢＩＴ｜
【００５９】
データは、アドレスフィールドＡＤ２によって指定された出力ポートを通じてボトムスイッチから送り出され、ここでＤＡＴＡはリングＲのデータフィールド４１４である。
【００６０】
図４Ａ、４Ｂ及び４Ｃは通信リングＣ、データ記憶リングＲ及びコンセントレータＢの間のタイミングを表している。複数の並列なＦＩＦＯをバス構造で有するリングを含む実施例では、ロジックモジュール３１４は一度に複数のビットを読み込むことができる。本実施例では、ロジックモジュールＬはクロック時間ごとに１ビットのみを受け取る。コンセントレータＢはロジックモジュールからパケットを受け取ることができるＦＩＦＯ３０８上の複数の入力ノード３２０を含む。ロジックモジュールは入力ポート３２２を通じてコンセントレータのトップレベルにデータを注入するべく配置される。
【００６１】
図４Ａを参照すると、ＢＩＴフィールド４０２は１にセットされ、データリングＲ上のＢＬＡＮＫフィールド４１２の第１ビットＢ_０４０８と同時にロジックモジュールに到達する。循環するデータの相対的なタイミングは、リングＲ内のＤＡＴＡの第１ビットが（ライン４１０で示すように）リングＣ内の要求パケットのペイロードフィールドの第１ビットと整合するように調整される。
【００６２】
既にコンセントレータＢ内にあり、コンセントレータの別のノードからノード３１６へと入ろうとしているデータは、パス３２２を通じて上方からノード３１６に入ろうとするデータに対して優先する。グローバルパケット到達タイミング信号（図示せず）により、パケットが入り得る時間についてノード３１６に情報が伝えられる。既にコンセントレータ内にあるパケットがノード３１６に入る場合、ノード３１６はそれに接続されたロジックモジュールに対しパス３２４を通じてブロック信号を出す。ブロック信号に応答して、上述したように、ロジックモジュールＬは通信リングＣへとＲＥＡＤ要求パケットを送る。階層の下方からブロック信号が送られてこない場合、ロジックモジュールＬはノード３１６の下流のコンセントレータＢ内の入力ノード３２０へとライン３２２を通じてパケットを送る。
【００６３】
図４Ａは時間Ｔ＝０におけるＲＥＡＤ要求を示している（ここでＴ＝０は、要求を受け取ったロジックモジュールによる要求処理の開始時間）。この時点で、ロジックモジュールは、ＲＥＡＤ要求を受け取ったのかどうか及び受け取った要求が下方からブロックされていないかどうか判断するのに十分な情報を有している。特に、ロジックモジュールはＢＩＴ及びＯＰ１フィールドを調べ、以下の３つの条件について応答する：
下方からライン３２４を通じてビジー信号は受信されていない、
ＢＩＴ＝１、及び
ＯＰ１＝ＲＥＡＤ要求。
【００６４】
これら３つの条件が満たされた場合、ロジックモジュールは次時間ステップでＲＥＡＤ処理を開始するのに準備完了となる。ＯＰ１＝ＷＲＩＴＥの場合、ロジックモジュールは次時間ステップでＷＲＩＴＥ処理を開始する。
【００６５】
図４Ｂ及び図４Ｃはブロック信号がノード３１６からロジックモジュールへ送られない場合の進行中のＲＥＡＤ要求を示す。
【００６６】
図４ＢはＴ＝１におけるＲＥＡＤ要求を示す。リングＺ、Ｃ及びＲ内の全データビットは一つ右にシフトされる。リングの右端のビットはＦＩＦＯに入る。またＦＩＦＯは左端の要素に１ビット供給する。ロジックモジュールはＢＩＴフィールドをライン３２２を通じてコンセントレータの入力ポートへ送る。シフト後、Ｃリングのレジスタはパケットの第２及び第３ビット、即ち１ビットからなるＯＰ１フィールド及びＡＤ２フィールドの第１ビットを含む。ロジックモジュールはリングＲのＢＬＡＮＫフィールドの第２及び第３ビット、即ちＢ１及びＢ２を含む。ＰＲＡＭ２００の典型的な動作では、リングＺからのパケットは図示されているロジックモジュールの左に位置するロジックモジュール（図示せず）に入るものとすることができる。この場合、パケットの全体はリングＣ内に含まれない。パケットの残りはトップスイッチ１１０内にある、または、入力ポートからトップスイッチを通ってリングＺから出つつロジックモジュールＬ３１４に入るという“ワームホール（ｗｏｒｍｈｏｌｉｎｇ）”プロセスにあるものとすることができる。図４Ａ、４Ｂ及び４Ｃは理解が容易なように全体がリングＣ上に含まれるＲＥＡＤ要求パケットを示している。
【００６７】
次のＡＤ２Ｌ＋ＯＰ２Ｌ個のステップで、ロジックモジュールＬはＡＤ２及びＯＰ２フィールドを読み取り入力ポート３２０へコピーする。この時点で、コンセントレータは、ＢＩＴフィールド、ＡＤ２フィールド及びＯＰ２フィールドをビットシリアルに受信したこととなる。コンセントレータはこのシーケンスを、ＤＡＴＡフィールド４１４の第１ビットがロジックモジュールＬに到達する前に、ワームホール式に受け取り処理する。ロジックモジュールＬがリングＣ上のＡＤ２及びＯＰ２を読み込む間、リングＲ上のＢＬＡＮＫフィールド４１２はロジックモジュールＬを通過し、無視される。ロジックモジュールは、リングＲのＤＡＴＡフィールドの第１ビットが到達するのと同時（ライン４１０で示した）に通信リングＣ内のパケットのＰＡＹＬＯＡＤセクションの第１ビットを読み取るように配置されている。
【００６８】
ロジックモジュールＬは出力データを２方向に送る。第１に、ロジックモジュールＬはゼロ化されたパケット（ｚｅｒｏｅｄｐａｃｋｅｔ）をリングＣに戻す。第２に、ロジックモジュールＬはＤＡＴＡフィールドを下流に送る。リングＣに戻される全ビットはゼロ４３０にセットされ、リングＣ上の後続のロジックモジュールがＲＥＡＤ処理を繰り返さないようにされる。別の方法として、代わりにロジックモジュールＬが要求を問題なく処理したら要求パケットを通信リングＣから除去してもよく、その場合、同じリング上の他のロジックモジュールが同じサイクル時間において他の要求パケットを受け付けることが可能となるという利点が得られる。パケットはロジックモジュールによって好適にワームホール式に処理され、それにより、１サイクル時間において複数の異なるパケットを特定の一つのＤＰＩＭにより処理することができる。
【００６９】
時間Ｋ＋１では、ペイロードの第１ビットはロジックモジュールＬによりゼロで置き換えられる位置にあり、リングＲの第１ビットＤ_１はボトムスイッチまたはボトムスイッチへとデータを転送するコンセントレータへと送られる位置にある。プロセスは図４Ｃに示すように続く。ロジックモジュールは第２のＤＡＴＡビットＤ_２をコンセントレータへ送る一方、データリングＲから第３のＤＡＴＡビットＤ_３を読み込む。プロセスの終了時、パケット全体が通信リングＲから取り除かれ、パケットは次の形式を有する：
｜ＤＡＴＡ｜ＯＰ２｜ＡＤ２｜ＢＩＴ｜
【００７０】
パケットはコンセントレータの入力ポート３２０またはボトムスイッチへと送られる。ＤＡＴＡはリングＲのＤＡＴＡフィールドからコンセントレータへとコピーされる。データリングＲ内のＤＡＴＡフィールド４１４は変化しない。
【００７１】
図５を参照すると、ロジックモジュールＬ１５０４とＬ２５０２が同時にＲＥＡＤ要求を実行する。異なる要求パケットＰ１及びＰ２が一般に異なる入力ポート２０２から送られてきてトップスイッチに入り、単一のＤＰＩＭにおいてワームホール式に複数のＲＥＡＤ要求が処理される。図示した例では、全ての要求は同じＰＲＡＭアドレスに対するものであり、それはそれぞれの要求パケットのＡＤ１フィールドにて指定される。パケットＰ１及びＰ２はそれぞれターゲットＤＰＩＭ内の異なるロジックモジュールＬ１及びＬ２に到達する。各ロジックモジュールは互いに独立して要求を処理する。示した例では、最初に到達したＲＥＡＤ要求Ｐ２がモジュールＬ２５０２により処理されている。モジュールＬ２は既にＢＩＴフィールド、ＯＰ１フィールド、及びＡＤ２フィールドの５ビットを読み込んで処理している。またモジュールＬ２は既にＢＩＴフィールド及びＡＤ２フィールドの４ビットをコンセントレータの入力ノード５１２へと送っている。同様に、モジュールＬ１は既にパケットＰ１のＡＤ２フィールドの２つのビットを読み込んで処理しており、第１のＡＤ２ビットを下方のノード５１４へと送っている。２つのパケットのＡＤ２フィールドは異なっており、その結果、ＤＡＴＡフィールド４１４はボトムスイッチの２つの異なる出力ポートに送られる。２つ目の要求は最初の要求から数クロック時間だけ遅れて発生し、これら２つの要求の処理はオーバラップしてなされる。ＤＰＩＭはＴ個のロジックモジュールを有し、同じサイクル時間においてＴ個のＲＥＡＤ要求を処理する能力を有する。ＲＥＡＤ要求を処理した結果として、ロジックモジュールは常にリングＣ上にゼロ４３０を置く。
【００７２】
要求及び応答をそれぞれトップスイッチ及びボトムスイッチ内をワームホール式にルーティングすることにより、任意の入力ポートが他の入力ポートと同時に要求パケットを送ることが可能となる。一般に、任意の入力ポート２０２はＲＥＡＤ要求を他の入力ポートから同時に送られてくる要求とは独立して任意のＤＰＩＭに送ることができる。ＰＲＡＭ２００は、複数の要求元からの単一のデータベースに対する並列でオーバラップしたアクセスをサポートし、同じデータ位置に対する複数の要求をサポートする。
【００７３】
ＷＲＩＴＥ要求
ＷＲＩＴＥ要求でも、パケットのＡＤ１フィールドはトップスイッチ内におけるパケットのルーティングに用いられる。パケットは所定位置でトップスイッチのノード３１２を出てリングＣに入る。ＯＰ１フィールドはＷＲＩＴＥ要求を指定する。ＷＲＩＴＥ要求では、コンセントレータへはデータは送付されない。従って、ロジックモジュールはコンセントレータからの制御信号を無視する。ロジックモジュールはコンセントレータの入力ポート３２０に“０”を送り、パケットは送られないとの情報を伝える。リングＺのＷＲＩＴＥ要求は、リングＣ上で遭遇する最初のロジックモジュールに常に入ることができる。
【００７４】
説明の簡便化のため、要求パケットをリングＣ内に示す。より典型的な動作では、要求はトップスイッチを通ってロジックモジュールへとワームホール式に送られる。ＷＲＩＴＥ要求に対して、ロジックモジュールはＯＰ１とＰＡＹＬＯＡＤフィールド以外のフィールドの情報は無視する。
【００７５】
図６は時間Ｔ＝Ｋ＋５におけるＷＲＩＴＥ要求を図示している。リングＣ上のＷＲＩＴＥパケット及びリングＲ上のデータは共に同期してロジックモジュールを通って回る。ＯＰ２フィールドの最終ビットは、ロジックモジュールが記憶リングＲのＢＬＡＮＫフィールドの最終ビットと整合されるのと同時に、ロジックモジュールによって捨てられる。パケットのＰＡＹＬＯＡＤフィールドの第１ビットがロジックモジュールに到達すると、ロジックモジュールＬはリングＣから第１ビットを取り除き、その第１ビットをリングＲのＤＡＴＡフィールド内に置く。プロセスはＰＡＹＬＯＡＤフィールドの全体が通信リングからリングＲのＤＡＴＡフィールドへと転送されるまで続く。ロジックモジュールＬはパケットをゼロ化し、望ましくはリングＣからパケットを除去して他のロジックモジュールがそのＷＲＩＴＥ動作を繰り返さないようにする。
【００７６】
視覚的に理解し易いように、図６はリングＣからリングＲへと移動中のデータパケットを示している。データは通常、トップスイッチから到達する。より詳細には、データはトップスイッチ上に散布される。
【００７７】
一つのＤＰＩＭに複数のＲリングが設けられる別の実施例では、ＤＰＩＭモジュールのアドレスはＡＤ１フィールドに格納され、ＤＰＩＭモジュール内の所与のＲリングのアドレスは拡張されたＯＰ１フィールドの一部として格納される。一つのＤＰＩＭメモリーモジュールに８つのＲリングが設けられた例では、ＯＰ１フィールドは４ビット長さを有し、第１ビットがＲＥＡＤまたはＷＲＩＴＥ動作を示し、次の３つのビットがどのＲリングに対して要求がなされているのかを示す。ＤＰＩＭの各々に複数のＲリングが含まれる場合、トップスイッチ内のレベルの数及びコンセントレータ内のレベルの数が低減される。
【００７８】
一つのＤＰＩＭ内に複数のＲリングを設けることにより、より多くのデータ及びより多くのロジックをモジュール内に必要とし且つより複雑なＯＰ１コードを必要とするような、より複雑な動作も可能となる。例えば、ＤＰＩＭへの要求を、全てのＲリングの中で最大の値を送る要求としたり、Ｒリングのサブセットの値の合計を送る要求としたりすることができる。また、ＤＰＩＭ要求を、所定のタイプのデータの効率的な検索を可能とするべく、指定されたサブフィールドを含むワードの各コピーを計算されたアドレスに送る要求とすることもできる。
【００７９】
示されているＰＲＡＭシステムでは、ＢＬＡＮＫフィールドは無視され、任意の値を有することができる。別の実施例では、様々な処理を補助するべくＢＬＡＮＫフィールドを定義してもよい。一実施例では、ＢＬＡＮＫフィールドはスコアボード機能に用いられる。あるシステムがＢ_Ｌより少ないＮ個のプロセッサを有し、ＤＡＴＡフィールドが上書き可能となる前にＮ個のプロセッサ全てがＤＡＴＡフィールドを読み取らなければならないとする。新たなＤＡＴＡ値が記憶リングＲに置かれるとき、ＢＬＡＮＫフィールドは全てゼロに設定される。Ｎ個のプロセッサのうちプロセッサＷがデータを読み取ると、ＢＬＡＮＫのビットＷが１にセットされる。ＢＬＡＮＫの適切なＮビットのサブフィールドが全て１の状態に設定されたときのみ、リングＲのＤＡＴＡ部の上書きを行うことができる。ＢＬＡＮＫフィールドは再リセットされて全てゼロになる。
【００８０】
このようなスコアボード機能は多数あるＢＬＡＮＫフィールドの使用法の一つにすぎない。当業者であれば、演算及び通信における様々な応用のためＢＬＡＮＫフィールドを効果的に使用することが可能だろう。
【００８１】
いくつかの実施例では、ＤＰＩＭ内の複数のロジックモジュールが互いに通信する（ｉｎｔｅｒｃｏｍｍｕｎｉｃａｔｅ）ことができなければならない。そのような応用の例としては、非同期転送モード（ＡＴＭ）インターネットスイッチに用いられるリーキーバケット・アルゴリズム（ｌｅａｋｙｂｕｃｋｅｔａｌｇｏｒｉｔｈｍ）がある。例示した並列アクセスメモリー２００では、演算ロジックモジュール３１４は、ＲＥＡＤ要求エントリーの受信に応じてローカルカウンタ（図示せず）に信号を送る。一つのＤＰＩＭ内の２つの演算ロジックモジュールが同時に読み取りパケットの第１ビットを受信することはなく、従って、共通のＤＰＩＭバス（図示せず）を好適に用いて全ロジックモジュールに接続されたカウンタを動かすことができる。カウンタは全ての演算ロジックモジュールに応答することができ、それにより、“リーキー・バケットがあふれた”場合、適切なロジックモジュールの全てが通知を受け、その情報に対してＡＤ２及びＯＰ２フィールドを変形することで応答し、適切な宛先に対して適切な応答を生成する。
【００８２】
図１を参照すると、基本要素としてネットワークインターコネクト構造を用いて構築された演算エンジン（ｃｏｍｐｕｔａｔｉｏｎａｌｅｎｇｉｎｅ）１００が模式的なブロック図に示されている。演算エンジンの様々な実施例は、図１の説明において上述した汎用システム１００の中核要素を含む。演算システムである演算エンジンの例では、ボトムスイッチ１１２は、１または複数のプロセッサ及びメモリーまたは記憶装置を含む演算ユニット１２６にパケットを送る。図３を参照すると、リングＲに関連づけられた演算ロジックモジュールがシステム全体の演算機能の一部を実行する。ボトムスイッチ１１２からデータを受信する演算ユニット１２６は更なる論理処理を実行する。
【００８３】
ロジックモジュールは、演算エンジンに望まれる全体的な機能に応じて従来のプロセッサ処理及び新規なプロセッサ処理の両方を実行する。
【００８４】
システム１００の第１の例はスケーラブルな並列演算システムである。処理の一側面において、システムは、ＳＯＲＴ処理の並列比較サブオペレーションを含む並列ＳＯＲＴを実行する。ロジックモジュールＬはパケットから第１データ要素を受け取り、記憶リングＲ３０４から第２データ要素を受け取る。ロジックモジュールは２つのデータ要素のうち大きい方を記憶リングＲ上に置き、小さい方をＰＡＹＬＯＡＤフィールドに置き、更に、小さい方の値をパケットのＡＤ２フィールド内の所定のアドレスに送る。図３に示すように、２つのそのようなロジックモジュールが直列に接続されている場合、第２のロジックモジュールは数クロックサイクル内に第１のロジックモジュールから来るデータについて第２の比較を実行することができる。このような比較及び置換プロセスは多くのソーティングアルゴリズムにおいて共通の作業単位であり、従来技術をよく知っているものであれば、このような比較及び置換プロセスを用いてより大きな並列ソーティングエンジンを形成することが可能である。
【００８５】
当業者であれば、広い範囲のシステム応用に用いることが可能な多くの有用なロジックモジュール３１４を形成することができるだろう。単一のロジックモジュールが多くの処理を行うことも、あるいは、異なる種類のロジックモジュールを形成し各ユニットがより少ない数のタスクを実行するようにすることも可能である。
【００８６】
システム１００には２種類の処理ユニットが含まれている。即ち、ＤＰＩＭ１１４内のユニットと演算ユニットＣＵ１２６内のユニットである。ＤＰＩＭはビットシリアルなデータ移動を扱い、大量のデータの移動を伴うような演算を実行する。ＣＵは１または複数の汎用プロセッサのようなプロセッサ及び通常のＲＡＭを含む。ＣＵは、該ＣＵに与えられるデータセットについて“大量データ処理（ｎｕｍｂｅｒｃｒｕｎｃｈｉｎｇ）”動作を実行し、パケットの生成、転送及び受信を行う。ＤＰＩＭの重要な機能の一つは、小さい遅延で、並列に、且つ後の処理に都合のよい形式でデータをＣＵに供給することである。
【００８７】
機能の一例では、演算問題の大きな領域を互いに重ならないサブ領域の集まりに分解することができる。ＣＵは、そのＣＵによって実行される演算に大きく貢献する各サブ領域からの所定のタイプのデータを受信するように選択することができる。ＤＰＩＭはデータを用意し、結果を適切なＣＵに送る。例えば、領域は１０回の移動で可能な全てのチェスの位置とし、サブ領域は所与の一対の動きから８回の動きで可能な全ての位置を含むものとすることができる。ＤＰＩＭは見込みのありそうな最初の動き対のみを、最も見込みのありそうなものから最も見込みの薄いものへと順に並べられたデータとともに、ＣＵに戻す。
【００８８】
別の応用では、領域は３次元空間での複数の対象物（ｏｂｊｅｃｔｓ）の表現を含み、各サブ領域はその空間の区切られた部分からなる。ある特定の例では、関心のある状態が、関心のある物体（ｂｏｄｙ）に働く所定のしきい値を越えた重力の状態として定義される。ＤＰＩＭは、関心のある状態と整合するデータを含むサブ領域からデータをＣＵに送る。
【００８９】
図１に示したスケーラブルなシステム、及び、スケーラブルなシステムの中核要素を用いた実施例は、スーパーコンピュータでの応用に適するように構築することもできる。スーパーコンピュータでの応用では、複数のＣＵは適切な形式で且つタイムリーに並列にデータを受信する。これらＣＵは並列にデータを処理し、処理結果を送付し、後の相互作用に対する要求を生成する。
【００９０】
ＤＰＩＭはブックキーパー（ｂｏｏｋｋｅｅｐｅｒ）及びタスクスケジューラとしても有用である。一例として、集合（ｃｏｌｌｅｃｔｉｏｎ）Ｈの複数（Ｋ個）の演算ユニット（ＣＵ）を用いるタスクスケジューラがある。集合ＨのＣＵは、通常、並列演算で様々なタスクを実行する。タスクを終了すると、Ｋ個のＣＵのうちＮ個に新たなタスクが割り当てられる。少なくともＫビットのデータを格納することができるデータ記憶リングＲはＫ長さのワードＷをゼロ化する。ワードＷ内の各ビット位置はコレクションＨの特定のＣＵに関連づけられている。あるＣＵが割り当てられたタスクを完了すると、そのＣＵはリングＲを含むＤＰＩＭにパケットＭを送る。データ記憶リングＲ上のロジックモジュールＬ１は、パケットＭを送出したＣＵに関連づけられたビット位置に１を挿入することでワードＷを変形する。データ記憶リングＲ上の別のロジックモジュールＬ２はワードＷ内の１の数を追跡する。ワードＷがＮビットを有する場合、Ｈ内のＮ個のアイドルＣＵが新たなタスクを開始する。これら新たなタスクは一つのパケットをＮ個のプロセッサにマルチキャストすることにより開始される。集合Ｈの部分集合（ｓｕｂｃｏｌｌｅｃｔｉｏｎ）にマルチキャストを行う効率的な方法について以下に説明する。
【００９１】
図７を参照すると、間接アドレス指定を用いてマルチキャスト動作を実行するための構造及び技法が模式的ブロック図に示されている。パケットを対応するアドレスにより指定された複数の宛先にマルチキャストすることは、演算及び通信の用途においてとても有用な機能である。単一の第１アドレスが第２アドレスの集合を指す。これら第２アドレスはマルチチャストされるパケットペイロードのコピーの宛先である。
【００９２】
いくつかの実施例では、インターコネクト構造システムが出力ポートの集合Ｃを有し、ある条件の下では、システムは所定のパケットペイロードを集合Ｃ_０内の全ての出力ポートに送る。集合Ｃ_０、Ｃ_１、Ｃ_２、．．．、Ｃ_Ｊ−１の各々は出力ポートのセットからなり、Ｊより小さい特定の整数Ｎに対し、集合Ｃ_Ｎ内の全てのポートが、単一のマルチキャスト要求の結果、同じ特定のパケットを受信することができる。
【００９３】
マルチキャストインターコネクト構造７００は記憶リングＲ７０４内に集合Ｃ_Ｎの出力アドレスのセットを格納する。リングの各々はＦＭＡＸ個のアドレスを格納する容量を有する。示されている例では、図７に示されたリングＲはＦＭＡＸ＝５個のアドレスを格納する容量を有する。
【００９４】
スイッチの構成及びサイズは様々なものを用いることができる。一実施例では、ボトムスイッチは６４個の出力ポートを有する。出力ポートアドレスは６ビットバイナリパターンで格納することができる。リングＲはＦ_０、Ｆ_１、Ｆ_２、Ｆ_３及びＦ_４のラベルが付された５つのフィールド７０２を有し、これらのフィールドに集合Ｃ_Ｎの出力ポート位置が保持される。各フィールドは７ビット長さである。７つのビットの内、第１ビットは、Ｃ_Ｎの位置がそのフィールドの次の６ビットに格納されている場合、１にセットされる。そうでない場合、第１ビットは０にセットされる。
【００９５】
少なくとも２種類のパケットがマルチキャストロジックモジュールＭＬＭ７１４に到達し得る。それらパケットには、ＭＵＬＴＩＣＡＳＴＲＥＡＤパケットとＭＵＬＴＩＣＡＳＴＷＲＩＴＥパケットが含まれる。
【００９６】
パケットの第１のタイプ、ＰＷ、はＭＵＬＴＩＣＡＳＴＷＲＩＴＥ処理を指定するＯＰ１フィールドを有する。ＷＲＩＴＥパケットは通信リング３０２に到達し、以下の形式を有する：
｜ＰＡＹＬＯＡＤ｜ＯＰ１｜ＢＩＴ｜
【００９７】
ＰＡＹＬＯＡＤは、鎖状につながったフィールドＦ_０、Ｆ_１、Ｆ_２、Ｆ_３及びＦ_４に等しい。パケットＰＷはＭＬＭ７１４が適切な時間にＦ_０の第１ビットを読み取るのに適した位置において通信リング３０２に到達する。ＭＬＭは、図６を参照して上述したＷＲＩＴＥ動作と同様に、ＰＡＹＬＯＡＤの第１ビットをリングＲに書き込む。
【００９８】
図７はマルチキャスト機能をサポートする特別なハードウェアＤＰＩＭ７１４に接続されたロジックモジュールを示している。ＷＲＩＴＥ要求に応答して、システムは処理を実行し、それによってフィールドＦ_０、Ｆ_１、Ｆ_２、Ｆ_３及びＦ_４はリングＺ及びＣからデータ記憶リングＲ３０４に転送される。パケットはＢＩＴ＝１により示され、ＢＩＴ＝０のときパケットの残りは常に無視される。処理コードフィールドＯＰ１がＢＩＴフィールドの後に続く。ＭＵＬＴＩＣＡＳＴＷＲＩＴＥ処理では、ＯＰ１はペイロードをパケットから記憶リングに転送し、そのとき記憶リング上にあるデータを置き換えるべきであることを示す。データはＭＬＭから記憶リングへとシリアル転送される。
【００９９】
例えば、データは右端のライン３３４を通じて転送される。データは記憶リング７０４上に置かれるのに適切な時間及び位置に適切な形式で到達する。ＭＵＬＴＩＣＡＳＴＷＲＩＴＥ処理では、ボトムスイッチからＭＬＭへライン７２２を通じて送られる制御信号は無視することができる。
【０１００】
ＭＵＬＴＩＣＡＳＴＲＥＡＤ要求を示す別のタイプのパケット、ＰＲ、が通信リング３０２に到達することもあり、以下の形式を有する：
｜ＰＡＹＬＯＡＤ｜ＯＰ２｜ＢＬＡＮＫ｜ＯＰ１｜ＢＩＴ｜
【０１０１】
ＢＬＡＮＫ部は、例えば、６ビット長さである。ＢＬＡＮＫフィールドはＣ_Ｎのフィールドの一つからのターゲットアドレスによって置換される。ＯＰ１フィールドは、特定のパケットまたは応用のために用いても用いなくてもよい。パケットのグループがボトムスイッチ１１２に以下の形式で入力する：
｜ＰＡＹＬＯＡＤ｜ＯＰ２｜ＡＤ２｜ＢＩＴ｜
【０１０２】
アドレスフィールドＡＤ２はリングＲフィールドから来ている。処理フィールドＯＰ２及びＰＡＹＬＯＡＤはＭＵＬＴＩＣＡＳＴＲＥＡＤパケットに由来する。
【０１０３】
示されている例では、ターゲットアドレスＡＤ１に位置する記憶リングＲ７０４は３つの出力ポートアドレス、例えば、３、８及び１７を格納する。出力アドレス３がフィールドＦ_０に格納されている。アドレス３の最上位ビット（ｍｏｓｔｓｉｇｎｉｆｉｃａｎｔｂｉｔ）が最初に現れ、その後に次に重要なビットが続くというようにして続いていく。従って、１０進数の整数３を表す標準的な６ビットバイナリパターンは００００１１となる。これらヘッダビットは、最上位ビットから最下位ビットの順に用いられる。ヘッダビットは最上位ビットが最初にくるように格納すると都合がよく、その結果、フィールドＦ_０において、ターゲット出力３を表すフィールドは６ビットパターン１１００００によって表される。タイミングビットを含むフィールドＦ_０全体は７ビットパターン１１００００１を有する。同様に、フィールドＦ_１は１０進数の８をパターン０００１００１として格納する。フィールドＦ_２は１０進数の１７を１０００１０１として格納する。更なる出力ポートはアドレス指定されていないため、フィールドＦ_３及びＦ_４は全てゼロ（０００００００）に設定される。
【０１０４】
ライン７２２上の制御信号はボトムスイッチにおける非ブロック状態を示し、パケットがライン７１８を通じて入力するのを許可する。ボトムスイッチからロジックモジュール７１４へとライン７２２を通じて送られる制御信号がビジー状態を示しているときは、スイッチへのデータの送付はなされない。“非ビジー”制御信号がＭＬＭに達すると、リングＲ内のアドレスのデータフィールドが、読み取りユニット（ｒｅａｄｉｎｇｕｎｉｔ）７０８及びボトムスイッチ１１２へと応答を生成し送付するのに適切な位置に置かれる。ロジックモジュールに“非ビジー”信号が到達した後の適切な時間において、ＭＬＭは複数のＭＵＬＴＩＣＡＳＴＲＥＡＤ応答パケットをボトムスイッチ１１２を通じてアドレスの集合Ｃ_Ｎに送り始める。
【０１０５】
システムは、アドレスＡＤ１にあるＤＰＩＭへＭＵＬＴＩＣＡＳＴＲＥＡＤパケットを送った後、そのパケットのＰＡＹＬＯＡＤフィールドをリングＲ７０４に格納された集合Ｃ_Ｎに格納された複数のアドレスにマルチキャストする能力を有する。
【０１０６】
典型的には、上記したようなマルチキャストシステムは様々な演算及びデータ記憶タスクを実行することが可能なハードウェアを含む。示した例では、マルチキャスト能力は、マルチキャストアドレスを保持し転送するべく特別に構成されたＤＰＩＭユニット７００を使用することにより達成される。
【０１０７】
上記したマルチキャスト機能の一般化は、単一のパケットＭが集合Ｃ_Ｎ内のメンバーシップを指定するアドレスを有する出力ポートの所定のサブセットへと送られるようなモードである。どのメンバーに送出されるべきかを示すビットマスクは送出マスク（ｓｅｎｄｍａｓｋ）と呼ばれる。一例では、アドレス３、８及び１７が集合Ｃ_Ｎの３つのメンバーである。送出マスク０、０、１、０、１はリストＣ_Ｎの第１及び第３出力ポートがパケットを受信するべきであることを示す。応答パケットは出力ポート３及び１７にマルチキャストされる。一例では、制御信号により、全入力ポートがパケットを受信することが可能な状態にあるかどうか、或いは、１または複数の入力ポートがブロックされているかどうかが示される。
【０１０８】
別の例では、ブロックされていない出力ポートのリストが格納される。このリストは、ブロックマスクと呼ばれるマスクである。送出マスク内のＮ番目の位置における値１は、Ｃ_ＮのＮ番目のメンバーに送出することが望まれていることを示す。ブロックマスクのＮ番目の位置における値１は、Ｃ_ＮのＮ番目のメンバーがブロックされておらず、従って送出可能であることを示す。両マスクのＮ番目の位置の値が１のとき、リストのＮ番目の出力ポートへとパケットＭが送られる。
【０１０９】
送出マスクによって示されたサブセットに対応して、Ｃ_Ｎ内に列挙された出力ポートのサブセットにマルチキャストされるパケットは以下の形式を有する：
ＰＡＹＬＯＡＤ｜ＯＰ２｜Ｍａｓｋ｜マルチキャストＯｐ｜ＡＤ１｜ＢＩＴ｜
【０１１０】
パケットはシステムのトップスイッチに入る。アドレスフィールドＡＤ２に通常格納されるアドレスはアドレスフィールドＡＤ１内に格納されたデータ内に含まれ、従って、アドレスフィールドＡＤ２は用いられない。
【０１１１】
図７を参照すると、ＢＩＴフィールド及びＯＰ１コードはリングＣまたはリングＺからロジックモジュール７１４へと送られる。送出マスク及びブロックマスクも同時にロジックモジュールに入る。送出マスクのＪ番目のビットが１で且つブロックマスクのＪ番目のビットも１の場合、ＰＡＹＬＯＡＤがアドレスＦ_Ｊに送られる。残りの処理は、マスクのないマルチキャストモードと同様に進行する。
【０１１２】
集合Ｃ_Ｎ内の出力ポートのセットはｐ_０、ｐ_１、．．．、ｐ_ｍで示される。出力ポートは複数のグループに分割されるが、これらグループは、最大で、データ記憶リングＲ上に格納することが可能な数のＣ_Ｎのメンバーを含む。データリングＲが５つの出力アドレスを有し、集合Ｃ_Ｎが９つの出力ポートを有する場合、最初の４つの出力ポートはグループ０に格納され、次の４つの出力ポートはグループ１に格納され、残りの出力ポートはグループ３に格納される。出力ポートシーケンスｐ_０、ｐ_１、．．．、ｐ_９はｑ_００、ｑ_０１、ｑ_０２、ｑ_０３、ｑ_１０、ｑ_１１、ｑ_１２、ｑ_１３、ｑ_２０とインデックス付けすることもできる。このようにして、ターゲットの物理的アドレスをグループ番号とアドレスフィールドインデックスを示す２つの整数により完全に記述することができる。
【０１１３】
いくつかの応用においては、パケットのペイロードは以下の情報を担う：
出力ポートセットのどのポートがアドレスの位置を特定するのに用いられたかを示すＣ_ＮのサブスクリプトＮ、
アドレスが配置されているＣ_Ｎのグループ、
アドレスが所属するグループのメンバー、
パケットが入力されるスイッチの入力ポート。
【０１１４】
第２番目と第３番目の情報は、ｑのメンバーの２つのインデックスを示しており、これら２つのインデックスからｐのインデックスを容易に計算することができる。これらの情報を担うパケットに対し、ＰＡＹＬＯＡＤフィールドは以下の形式を有する。
Ｎ｜ｑの第１サブスクリプト｜ｑの第２サブスクリプト｜入力ポート番号｜
【０１１５】
図７は、マルチキャスティングにおいて間接アドレスを用いるシステムも示している。より単純な動作は、単一の出力ポートを間接アドレス指定することである。間接アドレス指定の一例では、データ記憶リングＲは間接アドレスを表す単一のフィールドを含む。例えば、アドレス１７にあるＤＰＩＭの記憶リングＲは値１５３を含む。アドレス１７に送られたパケットはボトムスイッチの出力ポート１５３へと送られる。
【０１１６】
本明細書中に説明する実施例では、所与のリングＲに関連する全てのロジックモジュールはボトムスイッチ１１２にデータを送る。あるＤＰＩＭがトラフィックバーストを、別のＤＰＩＭユニットが比較的少量のトラフィックをボトムスイッチに送る場合、これらリングＲは、同じリングではなく、リングＢのグループにパケットを送る。更に別の例では、これらリングＲはパケットをコンセントレータ１５０に送り、コンセントレータ１５０がデータをボトムスイッチ１１２に伝える。
【０１１７】
上記したシステムでは、データ記憶リングＲ３０４及び通信リングＲ３０２内の情報は環状に接続されたＦＩＦＯの態様で循環する。一変形例は、リングＲ７０４内の情報が静止しているシステムである。トップスイッチ１１０内のレベルゼロのリングからのデータが静的バッファに入るよう接続することができる。静的バッファ内のデータは上記した循環モデルと論理的には等価な態様で相互作用することができる。静的モデルの利点はデータをより効率的に格納することが可能になる点にある。
【０１１８】
本説明では、データＸが、データＹを保持するリングＲに送られる。演算リングＣは入力信号としてデータＸとデータＹのストリームを両方とも受信し、データＸとＹに数学的な関数Ｆを実行し、演算の結果をターゲット出力ポートに送る。ターゲットはリングＲのフィールドまたはパケットのＡＤ２フィールドに格納可能である。別の方法として、ターゲットをＦ（Ｘ，Ｙ）の結果によるものとしたり、別の関数Ｇ（Ｘ，Ｙ）により生成されるものとすることもできる。
【０１１９】
別の応用では、複数の処理をデータＸ及びデータＹに対して実行することができ、その結果を複数の宛先に転送することができる。例えば、関数Ｆ（Ｘ，Ｙ）の結果がアドレスＡＤ２により指定された宛先に送られる。また関数Ｈ（Ｘ，Ｙ）の結果をパケットのアドレスＡＤ３によって指定された宛先に送ることができる。多重処理により、システム１００が多岐に渡る変換を並列に効率よく実行することが可能となるという利点が得られる。
【０１２０】
２つの引数Ｘ及びＹに対してより込み入った算術関数を実行するのに加えて、関数ＦがＸまたはＹの一方のみの関数となるよう、より単純なタスクが実行されるようにすることも可能である。単純な関数Ｆ（Ｘ）またはＦ（Ｙ）の結果は別の関数Ｇ（Ｘ）により生成された、または、アドレスＡＤ２によって指定された宛先に送られる。
【０１２１】
本発明を様々な実施例に基づいて説明したが、これら実施例は例示であって本発明の範囲を限定するものではないことを理解されたい。これら説明した実施例の様々な変形、変更、追加及び改良が可能である。例えば、当業者であれば開示した構造及び方法を提供するのに必要なステップを容易に具現することが可能であり、また、プロセスパラメータ、材料、及び寸法は例としてのみ与えられたのであって、所望の機能特性または本発明の範囲内の変形を実現するべく調節可能であることが理解されるだろう。開示した実施例の変形及び変更は本明細書の記載に基づいて、特許請求の範囲に記載した本発明の思想及び範囲を逸脱することなく可能である。
【０１２２】
当業者は、本発明の範囲内でいくつかの有用な変形及び変更をなす能力を有するだろう。そのような変形及び変更のいくつかの例は列挙したが、他のシステムにも適用され得る。
【図面の簡単な説明】
【図１】
図１は複数のネットワークインターコネクト構造を含む構築ブロックから形成された汎用システムの例を示す模式的なブロック図である。
【図２】
図２は基本要素としてネットワークインターコネクト構造を用いて形成された例えば並列ランダムアクセスメモリー（ＰＲＡＭ）のような並列メモリー構造を示す模式的なブロック図である。
【図３】
図３は、通信リング、複数のロジックモジュール、循環ＦＩＦＯデータ記憶リングへの接続、及びボトムスイッチの上部レベルへの接続を示すトップスイッチの下部レベルの図である。
【図４Ａ】
図４Ａ、４Ｂ及び４Ｃは、通信リング及び循環ＦＩＦＯデータ記憶リングを通るデータの動きを示すブロック図である。図４ＡはＲＥＡＤ及びＷＲＩＴＥ要求の両方に当てはまる。図４Ｂ及び４Ｃは処理中のＲＥＡＤ要求に当てはまる。
【図４Ｂ】
図４Ａ、４Ｂ及び４Ｃは、通信リング及び循環ＦＩＦＯデータ記憶リングを通るデータの動きを示すブロック図である。図４ＡはＲＥＡＤ及びＷＲＩＴＥ要求の両方に当てはまる。図４Ｂ及び４Ｃは処理中のＲＥＡＤ要求に当てはまる。
【図４Ｃ】
図４Ａ、４Ｂ及び４Ｃは、通信リング及び循環ＦＩＦＯデータ記憶リングを通るデータの動きを示すブロック図である。図４ＡはＲＥＡＤ及びＷＲＩＴＥ要求の両方に当てはまる。図４Ｂ及び４Ｃは処理中のＲＥＡＤ要求に当てはまる。
【図５】
図５は、オーバラップした期間に発生する同じ循環データ記憶リングから読み出しを行う２つの読み出し命令を実行しているインターコネクト構造の一部を示している。読み出されたデータは第２スイッチに入ってそこで異なるターゲットへと振り向けられる。
【図６】
図６は、ＷＲＩＴＥ命令を実行しているインターコネクト構造の一部を示している。
【図７】
図７は、間接アドレス指定を用いたマルチキャスト処理を実行するための構造及び技法を示す模式的なブロック図である。[0001]
(Background of the Invention)
A problem that has arisen in systems that perform highly parallel operations is the provision of sufficient data flow to multiple processors. U.S. Pat. Nos. 5,996,020 and 6,289,021 disclose broadband, low-latency interconnect structures that significantly improve data flow in a network. There is a need for a system that can take full advantage of such a wideband, low-latency interconnect structure by supporting parallel memory access and operations in a network.
[0002]
(Summary of the Invention)
Several processors allow multiple processors to access the same data in parallel. First, multiple distant processors can request reads from the same data location, and those requests can be processed in overlapping periods. Second, multiple processors can access data at the same location and can read, write, or multi-process the same data during overlapping periods. Third, one data packet can be multicast to multiple locations and multiple packets can be multicast to multiple sets of target locations.
[0003]
In the following description, the term "packet" preferably refers to a data unit in serial form. Examples of packets include Internet Protocol (IP) packets, Ethernet frames, ATM cells, switch-fabric segments, which include portions of larger frames or packets, supercomputers, and the like. And other data message types with a finite message length.
[0004]
The system disclosed herein solves a similar problem that occurs in communications when multiple packets arriving at a switch access data at the same location.
[0005]
Other Multi-Level Minimum Logic structures can be implemented in a wide variety of processors and many highly useful devices and systems, including computers, memory devices, and logic devices, in a fundamental building block. Can be used as Examples of such devices and systems include parallel random access memory (PRAM) and parallel computing engines. These devices and systems include a network connection structure as a basic building block along with internal storage or memory and logic circuits. The data storage device may be in the form of a first in first out (FIFO) ring.
[0006]
(Detailed description)
The novel features of the preferred embodiment described below are set forth in the appended claims. However, embodiments of the invention relating to both structure and processing methods will be best understood from the following description read in conjunction with the accompanying drawings.
[0007]
The schematic block diagram of FIG. 1 shows an example of a general-purpose system 100 formed from building blocks that include one or more network interconnect structures. In the illustrated example, the general-purpose system 100 has a top switch 100 and a bottom switch 112 formed from a network interconnect structure. The term "network interconnect structure" can also refer to other interconnect structures. In other systems, additional elements formed from the network interconnect structure may be included. The general-purpose system 100 illustrates various components that may be included as core elements of the basic exemplary system. In some embodiments, additional elements are included in addition to the core elements. Other elements include 1) shared memory, 2) direct connection 130 between top and bottom switches, 3) direct connection 140 between bottom switches and I / O, and 4) logic unit 114 and bottom. A concentrator connected to the switch 112 may be included.
[0008]
The general-purpose system 100 includes a top switch 110, which receives input data packets from an input line 136 or from an external source and possibly a bottom switch via a bus 130 and operates the received packets. It functions as an input terminal for distribution to a dynamic processor-in-memory logic module (DPIM) 114. The top switch 110 performs packet routing in the general-purpose system 100 based on communication information stored in a packet header. The packet is sent from the top switch 110 to the DPIM module 114. The control signal from the DPIM module 114 to the top switch 110 controls the timing of packet input to prevent collision. This prevents collisions with data in the DPIM or data in the bottom switch that can occur without such control. The system may also use the output lines and

buses

130, 132, 134 and 136 to transfer information to additional computing, communication, storage and other elements (not shown).
[0009]
Data packets enter the top switch 110 and are sent to the target DPIM 114 based on the address field of each packet. The information contained in the packet, possibly together with other information, can be used to determine the processing performed by logic DPIM 114 on the data contained in the packet or DPIM memory. For example, the information stored in the DPIM memory may be modified according to the information in the packet, the data included in the DPIM memory may be sent to the bottom switch 112, or other data generated by the DPIM logic module. Can be output from the bottom switch. Packets from DPIM are sent to the bottom switch. Other options for the general-purpose system 100 may include a computational unit and / or a memory unit. The arithmetic unit 126 may be arranged to send data packets through the I / O unit 124 to the outside of the system 100 and / or to the top switch 110. When the bottom switch sends a packet to the top switch, the packet can be sent directly, or one or more interconnect modules (not shown) that handle timing and control between integrated circuits that are subcomponents of the system 100. ) Can also be sent.
[0010]
In one embodiment of the system, the data storage has the form of a conventional first-in first-out (FIFO) data storage ring R in DPIM 114 and a conventional data storage associated with a computing unit (CU) 126. The FIFO ring is composed of a plurality of single-bit shift registers connected in a ring. FIFO rings include two types of components. In a first example, as is known in the art, a FIFO ring has a plurality of single-bit shift registers connected only to adjacent single-bit shift registers to form a simple FIFO 310. In a second example, the other shift registers of the ring consist of single-bit or multi-bit registers contained within other elements of the system, for example, logic module 114. These two components are connected in series to form a ring. For example, the total length F of the FIFO ring_LIs 200 bits, 64 bits are stored in a plurality of logic modules L, and the remaining 136 bits can be stored in serially connected registers of the FIFO. A clock supplied to the entire system is connected to a FIFO element and a shift register to advance the data bits to the next position in a "bucket relay" fashion. Cycle time is defined as the time expressed in clock cycles required for data to complete exactly one cycle of the FIFO ring. The integer value of the cycle time is equal to the length of the FIFO ring in number of components. For example, for a ring of 200 components (ie, 200 length), the cycle time would be 200 system clock periods. The system may have local clock or timing sources operating at different rates. Depending on the embodiment, all FIFO rings of the system may have the same length, or may have lengths that vary by an integer multiple of a predetermined minimum length. In another embodiment, one ring has a bus structure with multiple parallel paths and the amount of data held in the ring is equal to the ring length F_LIt can be set to be an integral multiple of.
[0011]
In the general-purpose system 100, the top switch can handle packets having various lengths up to the system maximum length. In some applications, the packets may all have the same length. Generally, packets having various lengths are input to the top switch. Packet length is P_LAnd P_LIs F_LNot exceed.
[0012]
Similarly, the bottom switch can handle packets of various lengths. In an exemplary embodiment of general-purpose system 100, data having different bit lengths is generated depending on the function and operation of DPIM logic module 114 and CU 126. The DPIM may function independently, or there may be multiple systems (not shown) capable of collecting data from these DPIMs and providing data to DPIM or other elements inside or outside system 100. May be.
[0013]
The schematic block diagram of FIG. 2 illustrates an example of a parallel random access memory (PRAM) system 200 formed from a smaller number of building blocks than included in FIG. This PRAM system has a top switch 110, a concentrator 150, and a bottom switch 112 formed from a network interconnect structure. The system also includes a plurality of DPIMs 114 for storing data. These DPIM units can typically perform READ and WRITE functions, thereby allowing the system to be used as a parallel random access memory.
[0014]
In the illustrated embodiment, the data packet input to top switch 110 has the following format:
Payload | Processing code 2 | Address 2 | Processing code 1 | Address 1 | Timing bit
This can be expressed as:
PAYLOAD | OP2 | AD2 | OP1 | AD1 | BIT
[0015]
The number of bits in the PAYLOAD field is represented by PayL. The bit numbers of OP2 and OP1 are represented by OP2L and OP1L, respectively. The bit numbers of AD2 and AD1 are represented by AD2L and AD1L, respectively. The length of the BIT field is one bit in the preferred embodiment.
[0016]
The following table gives a brief description of the packet fields.

[0017]
The BIT field enters the switch first, but is always set to 1 to indicate that a packet is present. The BIT field is also called "traffic bit". The AD1 field is used to direct the packet through the top switch to the packet's target DPIM. The top switch 110 can be configured to have multiple hierarchical levels and columns so that packets can pass through these levels. Each time a packet enters the new level of the top switch 110, one bit of the AD1 field is removed, thereby shortening the AD1 field. System 200 uses the same technique. When a packet leaves the top switch 110, nothing remains in the AD1 field. Thus, when the packet leaves the top switch, it has the following format:
PAYLOAD | OP2 | AD2 | OP1 | BIT
[0018]

Systems

100 and 200 have multiple DPIM units. FIG. 3 is a schematic block diagram illustrating an example of the DPIM unit and showing data and control connection paths between the DPIM and the top switch 110 and the bottom switch 112. FIG. 3 shows four data interconnect structures Z, C, R and B. Interconnect structure Z may be a FIFO ring located within top switch 110. Interconnect structures C and R consist of FIFO rings located within the DPIM module. In some embodiments, DPIM sends data directly to the bottom switch. In those embodiments, if the bottom switch is an interconnect structure, interconnect structure B is a FIFO ring. 1 and 7 illustrate a system without a concentrator, and FIGS. 2, 3, 4A and 5 illustrate systems with a concentrator.
[0019]
The data passes through the top switch 110 and the target output ring Z_J(J = AD1). Ring Z = Z_JHas a plurality of nodes 330 connected to the output line 326. The DPIM module has a packet receiving ring C 302 called a "data communication ring" and one or more "data storage rings" R 304. The DPIM shown in FIG. 3 has a single data storage ring R. Each structure Z, C, R and B includes a plurality of interconnected single-bit FIFO nodes. Some of the nodes in the structure have a single data input port and a single data output port and are interconnected to form a simple multi-node FIFO. Other nodes in the structure have additional data input ports, additional data output ports, or both. These nodes may also have control signal output ports or control signal input ports. Ring Z receives the control signal from ring C and sends data to logic module L 314. Rings C and R send and receive data to and from logic module L 314. FIFO B 380 sends a control signal to logic module L and receives data from logic module L. A DPIM can have multiple logic modules that can send data to multiple input ports in an interconnect structure or FIFO B. The data from the DPIM can be sent in rows into the top level of System B. The number of DPIMs can be the same as the number of memory locations, where each DPIM has a single storage ring R that stores one word of data. Alternatively, one DPIM unit may include multiple storage rings R. A particular storage ring can be identified by part of the address AD1 field or by part of the processing OP1 field.
[0020]
The timing of packet movement is synchronized by four rings. As the packet circulates through the ring, the packet is matched with respect to the BIT field. As a beneficial result of the match, ring C sends a control signal 328 to ring Z, allowing or disallowing nodes in Z to send packets to C. Upon receiving permission from node 330 on ring C, node 312 on ring Z can send the packet to logic module L, which is immediately placed in a position to process the packet bit-serial. Similarly, packets circulating in data storage ring R are synchronized with ring C so that each bit is suitably processed by logic module L as the packet circulates in each ring. The data storage ring R functions as a memory element that can be used in some novel applications described below. If multiple DPIMs are not on the same chip as the top switch, use separate data communication rings (not shown) between the nodes of ring Z and logic module L to provide timing and control between the chips You can also.
[0021]
The data in the storage ring R matches and overlaps with some of the packets in the Z ring 306 of the top switch and is accessible from the top switch 110 by multiple packets occurring simultaneously in the cycle time. A plurality of logic modules 314 are associated with data communication ring C and data storage ring R. One logic module L can read data from the rings C and R, perform processing on the data under certain conditions, and write the data to the rings C and R. Further, the logic module L can send the packet to the node 320 of the FIFO 308 located at the bottom switch 112 or the concentrator. If multiple DPIMs are not on the same chip as the bottom switch, use a separate data communication ring (not shown) between node 320 of interconnect structure B and logic module L 314 to provide timing and control between chips. Can be. Also, if one device needs to access multiple bits of the communication ring in one cycle time, a separate data communication ring can be used for timing and control operations.
[0022]
The packet enters the communication ring C through the logic module 314. Packets exit the logic module L and enter the bottom switch at various angles through the input channel.
[0023]
In some examples of general-purpose system 100, all of the logic modules along rings C and R of DPIM 114 are of the same type and perform similar logic functions. In another example, multiple different logic module types are used, and it is possible to perform multiple logic functions on data stored in a particular DPIM ring R. As data circulates through ring R, logic module L 314 can modify the data. The logic modules operate on data bits that pass serially through the module from rings C and R and from nodes on ring Z. Typical logic functions include (1) data transfer processes such as load, store, read, write, (2) logical operations such as AND, OR, NOR, NAND, exclusive OR, bit test, and (3) ) Includes arithmetic operations such as addition, subtraction, multiplication, division, transcendental functions, etc. Numerous other types of logical operations can also be included. The function of the logic module can be hardwired to the logic module, or can be realized by software loaded into the logic module from a packet sent to the logic module. In one embodiment, the logic modules associated with a particular data storage ring R operate independently. In another embodiment, the groups of logic modules are controlled by a separate system (not shown) that can receive data from the groups of logic modules. In yet another embodiment, the logic module control system executes control instructions on data received from the logic module.
[0024]
1 and 2, each DPIM has one ring R and one ring C. In another embodiment of the system 100, a particular DPIM 114 has multiple R-rings. In such a multiple R-ring embodiment, one logic module 314 can simultaneously access data from the C-ring and all R-rings. Simultaneous access allows the logic module to transform data on one or more R-rings based on the contents of the R-ring and also on the associated communication ring C and the contents of the received packet. Become.
[0025]
A typical function performed by the logic module is the execution of the processing specified in the OP1 field on the data held in the PAYLOAD field of the packet in relation to the data held in ring R. In one particular example, operation OP1 indicates that the data in PAYLOAD of the packet is to be added to the data stored in ring R at address AD1. The resulting sum is sent to the target port of the bottom switch at address AD2. Instructed by the instruction held in the data field of the OP1 process, the logic module can execute a plurality of processes. For example, the data in ring R 304 can remain unchanged. The logic module can also replace the data in ring R 304 with the contents of the PAYLOAD field. Alternatively, the logic module L can replace the data held in the PAYLOAD field with the result of the processing performed on the ring R 304 and the content held in the PAYLOAD field. In another example, the memory FIFO may store program instructions as well as data.
[0026]
The general-purpose system 100, which includes multiple types of logic modules 314 associated with the communication ring C and the storage ring R, uses one or more bits of the OP1 field to specify a particular logic module used to perform a certain operation. Can be used. In some embodiments, multiple logic modules perform multiple operations on the same data. The set of logic modules at address AD1 = x can perform a different process from the set of logic modules at address AD1 = y.
[0027]
The efficiency of moving data packets through the general-purpose system 100 depends on the timing of the data flow. In some systems, a buffer (not shown) associated with the logic module helps maintain the timing of the data transfer. In many embodiments, timing is maintained without buffering data. The interconnect structure of the general-purpose system 100 preferably has operational timing to provide efficient parallel computation, generation and access of data.
[0028]
The multi-component system 100, which includes at least one switch, a group of data storage rings 304, and associated logic modules 314, can be used to implement a variety of arithmetic and communication switches. Examples of arithmetic and communication switches include IP packet routers or switches used in Internet switching systems, special purpose sorting engines, general purpose computers, or many parallel arithmetic systems with general or specific functions.
[0029]
The schematic block diagram of FIG. 2 shows a parallel random access memory (PRAM) formed using a network interconnect structure as a basic element. The PRAM stores data that can be accessed simultaneously from a plurality of sources and can be simultaneously sent to a plurality of destinations. The PRAM has a top switch 110, but may or may not have a communication ring that receives packets from a target ring of the top switch 110. In an interconnect structure without a communication ring, ring Z passes through the logic module. Top switch 110 has T output ports 210 from each of the target rings. In a typical PRAM system 200, the number of address locations is greater than the number of I / O ports in the system. For example, a PRAM system has 128 I / O ports that access 64K words of data stored in DPIM. The AD1 field is 16 bits long to allow 64K DPIM addresses 114. The AD2 field is 8 bits long to allow 128 output ports 204, with 7 bits holding the address and 1 bit being the BIT2 portion of the address. The top switch has 128 input ports 202 and 64K Z-rings (not shown) each with multiple connections to the DPIM unit via output port 206. Concentrator 150 has 64K (65536) input ports 208 and 128 output ports 210. The bottom switch 112 has 128 input ports and 128 output ports 204. The concentrator follows the same control timing and signaling rules for inputs and outputs as the top and bottom switches and logic modules.
[0030]
Alternatively, the top switch may have a lower number of output Z-rings and associated DPIM. It is also possible that the DPIM unit has a plurality of R-rings so that the total data size does not change.
[0031]
The PRAM shown in FIG. 2 has a DPIM unit 114 that includes a logic module 314 that is directly connected to a communication ring C 302 and a storage ring R 304. The DPIM unit 114 is connected to a packet concentrator 150 that supplies output data to the bottom switch 112.
[0032]
Referring to FIG. 3, node 330 on ring C sends a control signal to node 312 on ring Z of the top switch, allowing individual nodes 312 on ring Z to send packets to logic module L. Upon receiving a packet from ring Z, logic module L may perform one of several processes. First, the logic module L can start placing the packet on the C-ring. Second, the logic module L can immediately start using the data in the packet. Third, the logic module L can immediately start sending the generated packet to the concentrator 150 without placing it on the C-ring. One logic module Li can start putting packets P on the C-ring, and after that logic module Li places some bits on the ring, another logic module Lk (where k> i) Processing and removal can begin. In some cases, the entire packet P is not placed on ring C. The logic module can insert data into the C-ring or R-ring, or send data to the concentrator 150. The signal on line 324 from the concentrator is also used to control the data entering the concentrator. The plurality of logic modules 314 associated with a ring R may also have additional transmit and receive interconnections with auxiliary devices (not shown) that can be associated with that ring R. The auxiliary device has various structures and can perform various functions according to the purpose and function of the system. One example of the auxiliary device is a system controller.
[0033]
In one embodiment, PRAM 200 includes a DPIM that includes a plurality of logic modules 314 that all have the same logic type and perform the same function.
[0034]
In another embodiment, the first DPIM S at a particular address may have multiple logic modules of different types and functions. The second DPIM T may have the same or a different type of logic module compared to the first DPIM S. In one application of the PRAM, one data word is stored in one storage ring R. As data circulates through ring R, the logic module can transform the data. In this PRAM, the logic module modifies the contents of the storage ring R that can store not only data but also program instructions.
[0035]
This PRAM stores and retrieves data using packets defined to include fields defined as follows:
PAYLOAD | OP2 | AD2 | OP1 | AD1 | BIT
[0036]
A BIT field set to 1 to indicate that a packet is present enters general-purpose system 100. The AD1 field specifies the address of the particular DPIM that contains the data storage ring R 304 containing the desired data. The top switch sends a packet to DPIM (AD1) specified by address AD1. In the example shown, the OP1 field consists of a single bit that specifies the operation to be performed. For example, a logical value of 1 specifies a READ request, and a logical value of 0 specifies a WRITE request.
[0037]
In the READ request, the receiving logic module in the DPIM at the position AD1 transmits the data stored on the ring R to the address AD2 of the bottom switch 112. In a WRITE request, the PAYLOAD field of the packet is placed on the ring R at the address AD1. AD2 is the destination address used to route data through the bottom switch 112 in a READ request and specifies where the contents of the memory should be sent. OP2 describes a process to be executed on data transmitted by the device located at the address AD2 as desired. If the process OP1 is a READ request, the logic module executing the READ request does not use the PAYLOAD field.
[0038]
In one embodiment, the PRAM includes only one type of logic module (the type that performs both READ and WRITE operations). In another embodiment of the PRAM, another type of logic module is used, such as with separate READ and WRITE elements.
[0039]
Referring to FIGS. 2 and 3, the illustrated PRAM 200 begins processing by receiving a packet entering the top switch 110 in a timely manner. The packet P is routed in the top switch and reaches the target ring Z located at the address AD1. The AD1 field of the packet is the target ring Z of the top switch._J306 is designated (here, J = AD1). Node S (not shown) and node T (not shown) are defined to define message timing. Node S is ring R_JOf the ring Z_JThe node S is arranged to send a control signal to the node T via the control line 328. Ring R based on global timing signal_JNode S 330 specifies the timing bit arrival time at node S. When the timing bit of value 1 arrives at the node S at the timing bit arrival time, the node S sends a blocking signal to the node T 312 on the ring Z via the line 328, and the node T transmits the packet via the line 326 to the logic unit. Prohibit sending to L. If node S does not receive a bit of value 1 at the timing bit arrival time, there are no messages coming from node C to node S, and node S sends a non-blocking control signal to node T. The global timing is such that the control signal arrival time at the node T is the same as the message arrival time from the node U located one level above the ring Z in the ring Z or the top switch to the node T. Packets exiting top switch 110 go from node 312 through line 326 to the logic module. The logic module can place the packet on communication ring C 302 or process the packet immediately without placing it on ring C. Then, the packet P has the following format:
PAYLOAD | OP2 | AD2 | OP1 | BIT
[0040]
Packet P is sent from ring Z to logic module L via line 326. When the packet P starts moving to the logic module L, a certain node N on the ring Z_ZIs connected to node N at a higher level in the top switch_ZSends a control signal to notify the non-blocking state. This control signal is applied to node N_ZN at a position to receive data from_XTo node W. Logic module L acts similarly on timing for packets arriving on line 326 and for packets arriving on ring C. When the packet P enters the logic module L, the logic module L parses and executes the command in the OP1 field.
[0041]
In the embodiment shown, communication ring C has the same length as storage ring R. The bits move through the rings C and R bit serially at a rate governed by the common clock. The first bit of the PAYLOAD field of the packet is aligned with the first bit of the Ring R DATA field. Therefore, in the case of a READ request, the data in the ring R is copied to the payload of the packet. In the case of a WRITE request, the data in the payload portion of the packet can be transferred from the packet to the storage ring R.
[0042]
READ request
For a READ request, packet P has the following format:
PAYLOAD | OP2 | AD2 | OP1 | AD1 | BIT
[0043]
Packets enter the top switch. Generally, the DPIM logic module at address AD1 identifies the READ request by examining the opcode OP1 field. The logic module replaces the PAYLOAD field of the packet with the DATA field from ring R. The updated packet is then sent to the bottom switch through the concentrator and from there to the arithmetic unit (CU) 126 at address AD2 or another device. The CU or other device can execute the instruction specified by the processing code 2 (OP2) in connection with the PAYLOAD field.
[0044]
Packet P enters node T 312 on ring Z. Node T begins to send packet P to logic module L via data path 326 in response to the timing bits of packet P entering node T and the non-blocking control signal from node 330 on ring C. When the BIT and OP1 fields enter the logic module L, the control signal on line 324 indicating whether the concentrator 150 or the bottom switch if no concentrator can receive a message also reaches the logic module L. If the control signal indicates that the concentrator cannot receive the message, logic module L will begin forwarding the packet to ring C. The packet P moves to the next logic module on the ring C.
[0045]
At some point, one of the logic modules on ring C receives a not busy control signal from below the hierarchy. At that time, the logic module L starts transferring the packet P to the input node 320 of the interconnect structure B.
[0046]
In the READ request, the logic module extracts the OP1 field from the packet and starts sending the packet to the input node 320 of the concentrator via the path 322. First, the logic module sends a BIT field, followed by an AD2 field, and an OP2 field. Timing is set such that the last bit of the OP2 field leaves the logic module at the same time that the first bit of the DATA field of the storage ring R reaches the logic module. The logic module leaves the DATA field in the storage ring R unchanged, puts a copy of the DATA in the PAYLOAD field of the packet sent downstream, and continues to send the packet bit-serial to the concentrator. The data in ring R remains unchanged.
[0047]
Packets remain unchanged when entering and leaving the concentrator, and have the following form when entering the bottom switch 112:
DATA | OP2 | AD2 | BIT
[0048]
The PAYLOAD field now contains the DATA field from ring R. As packets are routed through the bottom switch, the AD2 field is removed. The packet is transmitted from the output port 204 located at the address AD2 of the bottom switch. Upon transmission, the packet has the following format:
DATA ｜ OP2 ｜ BIT
[0049]
The OP2 field is a code that can be used in various ways. One use is to show the processing that the bottom switch output device performs on the data stored in the PAYLOAD field.
[0050]
The interconnect structure of a PRAM inherently has circular timing to achieve efficient, parallel generation and access of data. For example, multiple external sources located at different input ports 202 may request a READ operation on the same DATA field at a particular DPIM 114. Multiple READ requests can enter a particular target ring Z306 of the top switch at different nodes 312 and then enter different logic modules L of that target DPIM. These READ requests can enter different logic modules on ring C at the same cycle time. The communication ring C 320 and the memory ring R 304 are always synchronized with respect to the movement of the packet in the input interconnect structure B of the concentrator and the target ring Z of the top switch.
[0051]
The READ request always arrives at the logic module at a suitable time to add data from ring R to the appropriate PAYLOAD location of the outgoing packet. The beneficial result is that multiple requests for the same data in ring R can be issued simultaneously. The same DATA field is accessed by multiple requests. Data from ring R is sent to multiple final destinations. The plurality of READ processes are executed in parallel, and the transmitted packet reaches the plurality of output ports 204 simultaneously. The plurality of READ requests are executed in an overlapping manner by simultaneously reading from different locations in the ring R by different logic modules. Further, other READ requests are executed at the same cycle time at different addresses of the PRAM memory.
[0052]
Depending on system timing, multiple READ requests overlap and are executed efficiently and in parallel. 4A, 4B and 4C illustrate the timing for a single READ. The storage ring R is the same length as the communication ring C. Ring R includes cyclic data 414 of length PayL. The remaining storage elements in ring R may be zero or "blank" or may be ignored and have any value. The BLANK field 412 is a set of bits not included in the DATA field 414.
[0053]
Referring to FIG. 4A, a portion of each ring C and R has passed through a particular DPIM logic module. The logic module includes at least two bits of a set of shift registers forming a ring C and at least two bits of a shift register forming a ring R. In one embodiment, DPIM 314 includes a plurality of logic modules 314. The logic module is arranged to read two bits of the communication ring 302 in one clock time. At the time indicated by the global signal (not shown), the logic module examines the BIT and OP1 fields. In the example shown, the logic module reads the entire OP1 and BIT fields together. In another embodiment, it is possible to read the OP1 and BIT fields by multiple operations. In a READ request, the unblocked logic module 314 sends the packet to the concentrator or bottom switch at the appropriate time so that the packet matches other bits in the concentrator or bottom switch input.
[0054]
In a READ request, when a blocked logic module places a packet on ring C, the packet moves to the next logic module. The next logic module may or may not be blocked. If a subsequent logic module is blocked, that blocked logic module also sends the packet on ring C to the next module. If the packet enters the rightmost logic module LR and the rightmost logic module LR is blocked, the logic module LR sends the packet through the FIFO on ring C. Upon exiting the FIFO, the packet enters the leftmost logic module. Packets cycle until a non-blocked logic module is encountered. The length of the ring C is set so that the circulating packet always completely fits on the ring. In other words, the packet length P_LIs the ring length F_LNever exceed.
[0055]
In a READ operation, the packet has the following format:
| PAYLOAD | OP2 | AD2 | OP1 | AD1 | BIT |
[0056]
The packet is inserted into the top switch. Address field AD1 indicates the target address of ring R 304 containing the desired data. The address field AD2 is the target address of the output port 204 of the bottom switch to which the result is sent. The processing code OP2 specifies the processing to be performed by the output device.
[0057]
In one exemplary embodiment, the output device is the same as the input device. Thus, a single device is connected to the PRAM input 202 and output 204 ports. In a READ request, the PAYLOAD field is ignored by the logic module and can be any value. On the other hand, in a WRITE operation, the PAYLOAD field contains data to be placed on the ring R 304 associated with the DPIM at address AD1. The modified packet sent out of the logic module has the following format:
｜ DATA ｜ OP2 ｜ AD2 ｜ BIT ｜
[0058]
The data entering the bottom switch has the following format:
｜ DATA ｜ OP2 ｜ AD2 ｜ BIT ｜
[0059]
Data is sent from the bottom switch through the output port specified by address field AD2, where DATA is data field 414 of ring R.
[0060]
4A, 4B and 4C show the timing between communication ring C, data storage ring R and concentrator B. In embodiments that include a ring having multiple parallel FIFOs in a bus structure, logic module 314 can read multiple bits at a time. In this embodiment, the logic module L receives only one bit per clock time. Concentrator B includes a plurality of input nodes 320 on FIFO 308 that can receive packets from the logic module. The logic module is arranged to inject data at the top level of the concentrator through input port 322.
[0061]
Referring to FIG. 4A, the BIT field 402 is set to 1 and the first bit B of the BLANK field 412 on the data ring R is set.₀At the same time as 408, the logic module is reached. The relative timing of the circulating data is adjusted so that the first bit of DATA in ring R matches the first bit of the payload field of the request packet in ring C (as shown by line 410).
[0062]
Data already in concentrator B and trying to enter node 316 from another node of the concentrator takes precedence over data attempting to enter node 316 from above via path 322. A global packet arrival timing signal (not shown) informs node 316 about the times when a packet can enter. If a packet already in the concentrator enters node 316, node 316 issues a block signal via path 324 to the logic module connected to it. In response to the block signal, the logic module L sends a READ request packet to the communication ring C as described above. If no block signal is sent from below the hierarchy, logic module L sends a packet via line 322 to input node 320 in concentrator B downstream of node 316.
[0063]
FIG. 4A shows a READ request at time T = 0 (where T = 0 is the start time of request processing by the logic module that received the request). At this point, the logic module has enough information to determine whether a READ request has been received and whether the received request has been blocked from below. In particular, the logic module examines the BIT and OP1 fields and responds for the following three conditions:
No busy signal is received from below through line 324,
BIT = 1, and
OP1 = READ request.
[0064]
If these three conditions are met, the logic module is ready to start the READ operation at the next time step. If OP1 = WRITE, the logic module starts the WRITE process in the next time step.
[0065]
4B and 4C illustrate an ongoing READ request when no block signal is sent from node 316 to the logic module.
[0066]
FIG. 4B shows a READ request at T = 1. All data bits in rings Z, C and R are shifted right by one. The rightmost bit of the ring enters the FIFO. The FIFO supplies one bit to the leftmost element. The logic module sends the BIT field over line 322 to the input port of the concentrator. After the shift, the register of the C-ring contains the second and third bits of the packet, ie the OP1 field consisting of one bit and the first bit of the AD2 field. The logic module includes the second and third bits of the ring R BLANK field, namely B1 and B2. In typical operation of PRAM 200, packets from ring Z may enter a logic module (not shown) located to the left of the logic module shown. In this case, the entire packet is not included in ring C. The rest of the packet may be in the top switch 110 or in a "wormholing" process in which the input module exits ring Z through the top switch and enters logic module L 314. 4A, 4B and 4C show a READ request packet entirely contained on ring C for ease of understanding.
[0067]
In the next AD2L + OP2L steps, the logic module L reads the AD2 and OP2 fields and copies them to the input port 320. At this point, the concentrator has received the BIT field, the AD2 field, and the OP2 field in a bit serial manner. The concentrator receives and processes this sequence in a wormhole manner before the first bit of the DATA field 414 reaches the logic module L. While logic module L reads AD2 and OP2 on ring C, BLANK field 412 on ring R passes through logic module L and is ignored. The logic module is arranged to read the first bit of the PAYLOAD section of the packet in communication ring C at the same time as the first bit of the DATA field of ring R arrives (indicated by line 410).
[0068]
The logic module L sends output data in two directions. First, the logic module L returns a zeroed packet to the ring C. Second, logic module L sends the DATA field downstream. All bits returned to ring C are set to zero 430 to prevent subsequent logic modules on ring C from repeating the READ operation. Alternatively, if the logic module L successfully processes the request, the request packet may be removed from the communication ring C, in which case another logic module on the same ring may be able to remove another request packet in the same cycle time. This has the advantage that it is possible to accept The packets are preferably processed by the logic module in a wormhole manner, so that a plurality of different packets can be processed by one specific DPIM in one cycle time.
[0069]
At time K + 1, the first bit of the payload is in a position that is replaced by zero by the logic module L and the first bit D of the ring R₁Is in a position that is sent to the bottom switch or to a concentrator that transfers data to the bottom switch. The process continues as shown in FIG. 4C. The logic module has a second DATA bit D₂From the data ring R to the third DATA bit D₃Read. At the end of the process, the entire packet is removed from the communication ring R and the packet has the following format:
｜ DATA ｜ OP2 ｜ AD2 ｜ BIT ｜
[0070]
The packet is sent to the concentrator input port 320 or the bottom switch. DATA is copied from the Ring R DATA field to the concentrator. The DATA field 414 in the data ring R does not change.
[0071]
Referring to FIG. 5, the logic modules L1 504 and L2 502 simultaneously execute a READ request. Different request packets P1 and P2 generally come from different input ports 202 and enter the top switch, and multiple READ requests are processed in a single DPIM in a wormhole manner. In the example shown, all requests are for the same PRAM address, which is specified in the AD1 field of each request packet. Packets P1 and P2 arrive at different logic modules L1 and L2 in the target DPIM, respectively. Each logic module processes requests independently of each other. In the example shown, the first arrived READ request P2 is being processed by module L2 502. The module L2 has already read and processed the five bits of the BIT field, OP1 field, and AD2 field. The module L2 has already sent 4 bits of the BIT field and the AD2 field to the input node 512 of the concentrator. Similarly, module L1 has already read and processed the two bits of the AD2 field of packet P1, and has sent the first AD2 bit to node 514 below. The AD2 fields of the two packets are different, so that the DATA field 414 is sent to two different output ports of the bottom switch. The second request occurs several clocks later than the first request, and the processing of these two requests is overlapped. DPIM has T logic modules and has the ability to process T READ requests in the same cycle time. As a result of processing the READ request, the logic module always places a zero 430 on ring C.
[0072]
By routing the request and response in a wormhole manner within the top and bottom switches, respectively, any input port can send request packets simultaneously with other input ports. In general, any input port 202 can send a READ request to any DPIM independently of requests coming in simultaneously from other input ports. PRAM 200 supports parallel, overlapping access to a single database from multiple requesters, and supports multiple requests for the same data location.
[0073]
WRITE request
Even in the WRITE request, the AD1 field of the packet is used for routing the packet in the top switch. The packet exits the top switch node 312 at a predetermined location and enters ring C. The OP1 field specifies a WRITE request. In a WRITE request, no data is sent to the concentrator. Therefore, the logic module ignores the control signal from the concentrator. The logic module sends a "0" to the input port 320 of the concentrator to convey information that no packet will be sent. Ring Z's WRITE request can always enter the first logic module encountered on ring C.
[0074]
The request packet is shown in ring C for simplicity of explanation. In a more typical operation, a request is wormholed through a top switch to a logic module. In response to the WRITE request, the logic module ignores information in fields other than the OP1 and PAYLOAD fields.
[0075]
FIG. 6 illustrates a WRITE request at time T = K + 5. The WRITE packet on ring C and the data on ring R together go through the logic module in synchronization. The last bit of the OP2 field is discarded by the logic module at the same time that the logic module is matched with the last bit of the BLANK field of the storage ring R. When the first bit of the PAYLOAD field of the packet reaches the logic module, logic module L removes the first bit from ring C and places the first bit in the data field of ring R. The process continues until the entire PAYLOAD field has been transferred from the communication ring to the Ring R DATA field. Logic module L zeroes the packet and preferably removes the packet from ring C so that no other logic module repeats its WRITE operation.
[0076]
FIG. 6 shows a data packet moving from ring C to ring R for visual comprehension. Data usually arrives from the top switch. More specifically, the data is scattered on the top switch.
[0077]
In another embodiment where a single DPIM is provided with multiple R rings, the address of the DPIM module is stored in the AD1 field and the address of a given R ring in the DPIM module is stored as part of the extended OP1 field. Is done. In an example in which one DPIM memory module is provided with eight R rings, the OP1 field has a length of 4 bits, the first bit indicates a READ or WRITE operation, and the next three bits indicate which R ring To indicate if a request has been made. If each DPIM includes multiple R-rings, the number of levels in the top switch and the number of levels in the concentrator are reduced.
[0078]
Providing multiple R-rings within a single DPIM also allows for more complex operations that require more data and more logic in the module and require more complex OP1 code. . For example, the request to the DPIM can be a request to send the largest value among all the R rings, or a request to send the sum of values of a subset of the R rings. Also, the DPIM request can be a request to send each copy of the word containing the specified subfield to the calculated address to enable efficient retrieval of certain types of data.
[0079]
In the PRAM system shown, the BLANK field is ignored and can have any value. In another embodiment, a BLANK field may be defined to assist in various operations. In one embodiment, the BLANK field is used for a scoreboard function. A system is B_LSuppose we have fewer N processors and all N processors must read the DATA field before the DATA field can be overwritten. When a new DATA value is placed on storage ring R, the BLANK field is set to all zeros. When the processor W among the N processors reads the data, the bit W of BLANK is set to one. Only when all the appropriate N-bit subfields of BLANK are set to “1”, the data portion of the ring R can be overwritten. The BLANK field is reset again to all zeros.
[0080]
Such a scoreboard function is just one of many uses for the BLANK field. Those skilled in the art will be able to effectively use the BLANK field for various applications in computing and communications.
[0081]
In some embodiments, multiple logic modules in the DPIM must be able to intercommunicate with each other. An example of such an application is a leaky bucket algorithm used in asynchronous transfer mode (ATM) Internet switches. In the illustrated parallel access memory 200, the arithmetic logic module 314 sends a signal to a local counter (not shown) in response to receiving a READ request entry. The two arithmetic logic modules in one DPIM do not receive the first bit of the read packet at the same time, and therefore preferably use a common DPIM bus (not shown) to reset the counters connected to all logic modules. You can move. The counter can respond to all arithmetic logic modules so that if "leaky buckets overflow," all appropriate logic modules are notified and transform the AD2 and OP2 fields for that information. And generate the appropriate response to the appropriate destination.
[0082]
Referring to FIG. 1, a schematic block diagram illustrates a computational engine 100 constructed using a network interconnect structure as a basic element. Various embodiments of the computing engine include the core components of the general-purpose system 100 described above in the description of FIG. In the example of a computing engine that is a computing system, the bottom switch 112 sends packets to a computing unit 126 that includes one or more processors and memory or storage. Referring to FIG. 3, the arithmetic logic module associated with ring R performs some of the arithmetic functions of the overall system. The arithmetic unit 126 receiving the data from the bottom switch 112 performs further logical processing.
[0083]
The logic module performs both conventional and new processor processing depending on the overall function desired for the computing engine.
[0084]
A first example of the system 100 is a scalable parallel computing system. In one aspect of the process, the system performs a parallel SORT that includes a parallel compare sub-operation of the SORT process. Logic module L receives the first data element from the packet and receives the second data element from storage ring R 304. The logic module places the larger of the two data elements on the storage ring R, places the smaller one in the PAYLOAD field, and sends the smaller value to a predetermined address in the AD2 field of the packet. As shown in FIG. 3, if two such logic modules are connected in series, the second logic module performs a second comparison on the data coming from the first logic module within a few clock cycles. be able to. Such comparison and replacement processes are a common unit of work in many sorting algorithms, and those familiar with the prior art may use such comparison and replacement processes to form larger parallel sorting engines. It is possible.
[0085]
Those skilled in the art will be able to form many useful logic modules 314 that can be used for a wide range of system applications. A single logic module can perform many tasks, or form different types of logic modules, with each unit performing a smaller number of tasks.
[0086]
The system 100 includes two types of processing units. That is, the unit in the DPIM 114 and the unit in the arithmetic unit CU 126. DPIM handles bit-serial data movement and performs operations that involve moving large amounts of data. The CU includes a processor, such as one or more general-purpose processors, and a normal RAM. The CU performs a "number crunching" operation on the data set provided to the CU to generate, transfer, and receive packets. One of the key functions of DPIM is to provide data to the CU with low delay, in parallel, and in a form convenient for later processing.
[0087]
In one example of a function, a large area of a computational problem can be broken down into a set of non-overlapping sub-areas. A CU may choose to receive certain types of data from each sub-region that contribute significantly to the operations performed by the CU. DPIM prepares the data and sends the result to the appropriate CU. For example, the region may be all chess positions possible in 10 movements, and the sub-region may include all positions in 8 movements from a given pair of movements. DPIM returns only the first likely motion pair to the CU, with data ordered from most likely to least likely.
[0088]
In another application, a region includes a representation of a plurality of objects in a three-dimensional space, and each sub-region comprises a partitioned portion of the space. In one particular example, a state of interest is defined as a state of gravity above a predetermined threshold acting on a body of interest. DPIM sends data to the CU from sub-regions containing data consistent with the state of interest.
[0089]
The scalable system shown in FIG. 1 and the embodiment using the core elements of the scalable system can be constructed to be suitable for supercomputer applications. In supercomputer applications, multiple CUs receive data in an appropriate format and in a timely manner in parallel. These CUs process data in parallel, send processing results, and generate requests for later interaction.
[0090]
DPIM is also useful as a bookkeeper and task scheduler. As an example, there is a task scheduler using a plurality (K) of arithmetic units (CUs) of a collection H. The CUs in set H typically perform various tasks in parallel operations. When the task is completed, a new task is assigned to N of the K CUs. A data storage ring R capable of storing at least K bits of data nullifies a K-length word W. Each bit position in word W is associated with a particular CU of collection H. When a CU completes its assigned task, it sends a packet M to the DPIM containing ring R. The logic module L1 on the data storage ring R transforms the word W by inserting a 1 into the bit position associated with the CU that sent the packet M. Another logic module L2 on data storage ring R tracks the number of ones in word W. If the word W has N bits, the N idle CUs in H start a new task. These new tasks are started by multicasting one packet to N processors. An efficient method for multicasting to a subset of the set H is described below.
[0091]
Referring to FIG. 7, a schematic block diagram illustrates the structure and technique for performing a multicast operation using indirect addressing. Multicasting a packet to a plurality of destinations specified by corresponding addresses is a very useful function in computing and communication applications. A single first address points to a set of second addresses. These second addresses are the destination of the copy of the packet payload to be multi-chasted.
[0092]
In some embodiments, the interconnect structure system has a set C of output ports, and under certain conditions, the system₀To all output ports in Set C₀, C₁, C₂,. . . , C_J-1Consists of a set of output ports, for a particular integer N less than J, the set C_NCan receive the same particular packet as a result of a single multicast request.
[0093]
The multicast interconnect structure 700 stores the set C in the storage ring R 704._NStores a set of output addresses for Each of the rings has the capacity to store FMAX addresses. In the example shown, the ring R shown in FIG. 7 has the capacity to store FMAX = 5 addresses.
[0094]
Various configurations and sizes of switches can be used. In one embodiment, the bottom switch has 64 output ports. The output port address can be stored in a 6-bit binary pattern. Ring R is F₀, F₁, F₂, F₃And F₄Has five fields 702, labeled as_NOutput port position is maintained. Each field is 7 bits long. Of the seven bits, the first bit is C_NIs set to 1 if the position of is stored in the next 6 bits of the field. Otherwise, the first bit is set to zero.
[0095]
At least two types of packets may reach the multicast logic module MLM 714. These packets include a MULTICAST READ packet and a MULTICAST WRITE packet.
[0096]
The first type of packet, PW, has an OP1 field that specifies MULTICAST WRITE processing. The WRITE packet reaches communication ring 302 and has the following format:
｜ PAYLOAD ｜ OP1 ｜ BIT ｜
[0097]
PAYLOAD is a field F connected in a chain.₀, F₁, F₂, F₃And F₄be equivalent to. The packet PW is transmitted to the MLM 714 at the appropriate time.₀Arrives at the communication ring 302 at a location suitable for reading the first bit of. The MLM writes the first bit of PAYLOAD to the ring R in the same manner as the WRITE operation described above with reference to FIG.
[0098]
FIG. 7 shows a logic module connected to special hardware DPIM 714 that supports the multicast function. In response to the WRITE request, the system performs an operation, whereby the field F₀, F₁, F₂, F₃And F₄Are transferred from rings Z and C to data storage ring R 304. The packet is indicated by BIT = 1, and when BIT = 0, the rest of the packet is always ignored. A processing code field OP1 follows the BIT field. In the MULTICAST WRITE operation, OP1 transfers the payload from the packet to the storage ring, indicating that the data on the storage ring should then be replaced. Data is serially transferred from the MLM to the storage ring.
[0099]
For example, data is transferred over the rightmost line 334. The data arrives at the appropriate time and location to be placed on storage ring 704 in the appropriate format. In the MULTICAST WRITE process, the control signal sent from the bottom switch to the MLM via line 722 can be ignored.
[0100]
Another type of packet, PR, indicating a MULTICAST READ request may arrive on the communication ring 302 and has the following format:
｜ PAYLOAD ｜ OP2 ｜ BLANK ｜ OP1 ｜ BIT ｜
[0101]
The BLANK section is, for example, 6 bits long. BLANK field is C_NIs replaced by the target address from one of the fields. The OP1 field may or may not be used for a particular packet or application. A group of packets enters the bottom switch 112 in the following format:
｜ PAYLOAD ｜ OP2 ｜ AD2 ｜ BIT ｜
[0102]
The address field AD2 comes from the ring R field. The processing fields OP2 and PAYLOAD are derived from a MULTICAST READ packet.
[0103]
In the example shown, storage ring R 704 located at target address AD1 stores three output port addresses, eg, 3, 8, and 17. Output address 3 is field F₀Is stored in The most significant bit of address 3 appears first, followed by the next most significant bit, and so on. Therefore, the standard 6-bit binary pattern representing the decimal integer 3 is 000011. These header bits are used in order from the most significant bit to the least significant bit. It is convenient to store the header bits with the most significant bit first, so that field F₀In, the field representing target output 3 is represented by the 6-bit pattern 110,000. Field F including timing bits₀The whole has a 7-bit pattern 1100001. Similarly, field F₁Stores the decimal number 8 as the pattern 0001001. Field F₂Stores the decimal number 17 as 1000101. Since no further output ports are addressed, field F₃And F₄Are all set to zero (00000000).
[0104]
The control signal on line 722 indicates a non-blocking condition at the bottom switch, allowing packets to enter through line 718. When the control signal sent from the bottom switch to the logic module 714 over line 722 indicates a busy condition, no data is sent to the switch. When the "non-busy" control signal reaches the MLM, the data field of the address in ring R is placed in the proper position to generate and send a response to reading unit 708 and bottom switch 112. At the appropriate time after the "non-busy" signal arrives at the logic module, the MLM sends a plurality of MULTICAST READ response packets through the bottom switch 112 to the set of addresses C._NStart sending to.
[0105]
After sending a MULTICAST READ packet to the DPIM at address AD1, the system stores the PAYLOAD field of the packet in the set C stored in ring R 704._NHas the ability to multicast to multiple addresses stored in
[0106]
Typically, such multicast systems include hardware capable of performing various computation and data storage tasks. In the example shown, multicast capability is achieved by using a specially configured DPIM unit 700 to hold and forward multicast addresses.
[0107]
The generalization of the multicast function described above is that a single packet M is a set C_NIs a mode that is sent to a predetermined subset of output ports whose addresses specify membership in the The bit mask indicating which members should be sent is called a send mask. In one example, addresses 3, 8, and 17 are set C_NAre three members. Sending

mask

0, 0, 1, 0, 1 is list C_NIndicate that the first and third output ports should receive the packet. The response packet is multicast to output ports 3 and 17. In one example, the control signal indicates whether all input ports are ready to receive packets, or whether one or more input ports are blocked.
[0108]
In another example, a list of unblocked output ports is stored. This list is a mask called a block mask. The value 1 at the Nth position in the transmit mask is C_NIs desired to be sent to the Nth member of The value 1 at the Nth position of the block mask is C_NIndicates that the Nth member of is not blocked and can be sent. When the value of the Nth position in both masks is 1, packet M is sent to the Nth output port of the list.
[0109]
Corresponding to the subset indicated by the outgoing mask, C_NPackets that are multicast to a subset of the output ports listed in have the following format:
PAYLOAD | OP2 | Mask | Multicast Op | AD1 | BIT |
[0110]
Packets enter the top switch of the system. The address normally stored in the address field AD2 is included in the data stored in the address field AD1, and therefore, the address field AD2 is not used.
[0111]
Referring to FIG. 7, the BIT field and the OP1 code are sent from the ring C or the ring Z to the logic module 714. The sending mask and the block mask also enter the logic module at the same time. When the J-th bit of the transmission mask is 1 and the J-th bit of the block mask is also 1, PAYLOAD is the address F_JSent to The rest of the process proceeds as in the unmasked multicast mode.
[0112]
Set C_NThe set of output ports in is p₀, P₁,. . . , P_mIndicated by The output ports are divided into groups, each of which has at most a number of Cs that can be stored on the data storage ring R._NIncluding members. Data ring R has five output addresses and set C_NHas nine output ports, the first four output ports are stored in group 0, the next four output ports are stored in group 1, and the remaining output ports are stored in group 3. Output port sequence p₀, P₁,. . . , P₉Is q₀₀, Q₀₁, Q₀₂, Q₀₃, Q₁₀, Q₁₁, Q₁₂, Q_Thirteen, Q₂₀Can also be indexed. In this way, the physical address of the target can be completely described by two integers indicating the group number and the address field index.
[0113]
In some applications, the packet payload carries the following information:
C indicating which port of the output port set was used to locate the address_NSubscript N,
C where the address is located_NGroup of
Member of the group to which the address belongs,
Input port of the switch to which the packet is input.
[0114]
The second and third pieces of information indicate the two indices of the members of q, and the index of p can be easily calculated from these two indices. For a packet carrying these pieces of information, the PAYLOAD field has the following format.
N | first subscript of q | second subscript of q | input port number |
[0115]
FIG. 7 also shows a system that uses indirect addresses in multicasting. A simpler operation is to indirectly address a single output port. In one example of indirect addressing, data storage ring R includes a single field representing an indirect address. For example, the DPIM storage ring R at address 17 contains the value 153. The packet sent to the address 17 is sent to the output port 153 of the bottom switch.
[0116]
In the embodiment described herein, all logic modules associated with a given ring R send data to the bottom switch 112. If one DPIM sends a traffic burst and another DPIM unit sends a relatively small amount of traffic to the bottom switch, these rings R send packets to a group of rings B instead of the same ring. In yet another example, these rings R send packets to concentrator 150, which conveys data to bottom switch 112.
[0117]
In the system described above, the information in the data storage ring R 304 and the communication ring R 302 circulates in a circularly connected FIFO manner. One variation is a system in which the information in ring R 704 is stationary. Data from the level zero ring in the top switch 110 can be connected to enter a static buffer. The data in the static buffer can interact in a manner that is logically equivalent to the circular model described above. The advantage of the static model is that data can be stored more efficiently.
[0118]
In this description, data X is sent to a ring R that holds data Y. Arithmetic ring C receives both streams of data X and data Y as input signals, performs a mathematical function F on data X and Y, and sends the result of the operation to the target output port. The target can be stored in the ring R field or the AD2 field of the packet. Alternatively, the target may be based on the result of F (X, Y) or may be generated by another function G (X, Y).
[0119]
In another application, multiple operations can be performed on data X and data Y, and the results can be forwarded to multiple destinations. For example, the result of the function F (X, Y) is sent to the destination specified by the address AD2. Further, the result of the function H (X, Y) can be sent to the destination specified by the address AD3 of the packet. Multiple processing has the advantage that system 100 can efficiently perform a wide variety of transformations in parallel.
[0120]
In addition to performing more complicated arithmetic functions on the two arguments X and Y, it is also possible to perform simpler tasks such that the function F is a function of only one of X or Y. It is possible. The result of the simple function F (X) or F (Y) is generated by another function G (X) or sent to the destination specified by address AD2.
[0121]
Although the present invention has been described based on various embodiments, it should be understood that these embodiments are illustrative and do not limit the scope of the present invention. Various modifications, changes, additions and improvements of the described embodiments are possible. For example, those skilled in the art can readily implement the steps necessary to provide the disclosed structures and methods, and the process parameters, materials, and dimensions are provided only as examples. , It will be appreciated that it can be adjusted to achieve desired functional characteristics or variations within the scope of the invention. Modifications and changes of the disclosed embodiments are possible based on the description of the present specification without departing from the spirit and scope of the present invention described in the appended claims.
[0122]
Those skilled in the art will have the ability to make some useful variations and modifications within the scope of the invention. Some examples of such variations and modifications have been listed, but may be applied to other systems.
[Brief description of the drawings]
FIG.
FIG. 1 is a schematic block diagram showing an example of a general-purpose system formed from building blocks including a plurality of network interconnect structures.
FIG. 2
FIG. 2 is a schematic block diagram showing a parallel memory structure such as a parallel random access memory (PRAM) formed using a network interconnect structure as a basic element.
FIG. 3
FIG. 3 is a lower level view of the top switch showing the communication ring, the plurality of logic modules, connections to the circular FIFO data storage ring, and the lower level switch to the upper level.
FIG. 4A
4A, 4B and 4C are block diagrams illustrating the movement of data through a communication ring and a circular FIFO data storage ring. FIG. 4A applies to both READ and WRITE requests. 4B and 4C apply to a READ request in progress.
FIG. 4B
4A, 4B and 4C are block diagrams illustrating the movement of data through a communication ring and a circular FIFO data storage ring. FIG. 4A applies to both READ and WRITE requests. 4B and 4C apply to a READ request in progress.
FIG. 4C
4A, 4B and 4C are block diagrams illustrating the movement of data through a communication ring and a circular FIFO data storage ring. FIG. 4A applies to both READ and WRITE requests. 4B and 4C apply to a READ request in progress.
FIG. 5
FIG. 5 shows a portion of an interconnect structure executing two read instructions that read from the same circular data storage ring that occurs during an overlap period. The read data enters a second switch where it is directed to a different target.
FIG. 6
FIG. 6 shows a part of the interconnect structure executing the WRITE instruction.
FIG. 7
FIG. 7 is a schematic block diagram illustrating the structure and technique for performing a multicast operation using indirect addressing.

Claims

A parallel data processing device,
An interconnect structure (100) for interconnecting a plurality of locations;
One or more storage elements (114) comprising at location L a storage element W having a plurality of storage parts, connected to said interconnect structure and accessible as a location via said interconnect structure;
A plurality of operation units (126) connected to the interconnect structure and accessible as locations of the interconnect structure;
The plurality of arithmetic units are capable of accessing data from the one or more storage elements via the interconnect structure, the arithmetic units including an arithmetic unit C1 and an arithmetic unit C2, and the arithmetic unit C1 And C2 can simultaneously read from different storage portions of the storage element W and transfer data content of the storage portion of the storage element W to different target locations.

A parallel data processing device,
An interconnect structure (100) for interconnecting a plurality of locations;
One or more storage elements (114) including storage elements W1 and W2 at locations L1 and L2, respectively, connected to said interconnect structure and accessible as locations via said interconnect structure;
A plurality of operation units (126) connected to the interconnect structure and accessible as locations of the interconnect structure;
The plurality of operation units can access data from the one or more storage elements via the interconnect structure, and the operation units include an operation unit C1 and an operation unit C2, and the operation unit C1 Can simultaneously read and operate on the data from the storage elements W1 and W2, and when the operation unit C2 overlaps with the reading and operation by the operation unit C1, the storage elements W1 and W2 A device capable of reading and calculating data from the device.

A parallel data processing device,
An interconnect structure (100) for interconnecting a plurality of locations;
One or more storage elements (114), including a circular shift register R1 to store the word W1 having a plurality of storage portions, connected to the interconnect structure and accessible as a location via the interconnect structure;
A plurality of operation units (126) connected to the interconnect structure and accessible as locations of the interconnect structure;
Apparatus, wherein the plurality of operation units can operate simultaneously on different storage portions of the word W1.

Said storage element includes a circular shift register R2 (302) to store a word W2 having a plurality of storage portions;
4. The apparatus of claim 3, wherein the plurality of arithmetic units can use information in the word W1 to perform an operation on the word W2.

Logic for handling data, including a plurality of nodes (330) interconnected in a hierarchy, predicting data collisions at the nodes, and resolving the data collisions based on a priority order defined by the hierarchy. An interconnect structure (100) including (114);
A first switch (110) connected to the interconnect structure and distributing data to the interconnect structure based on communication information included in the data;
A plurality of logic modules (114) connected to the interconnect structure and capable of executing processing on the data;
A second switch (112) connected to the plurality of logic modules and receiving data from the plurality of logic modules.

A plurality of interconnect modules connected to the plurality of logic modules and connected to the first switch, wherein the interconnect modules monitor data traffic in the logic modules and avoid data collisions. 6. The apparatus according to claim 5, wherein the timing of data injection by the first switch is controllable for this purpose.

The first switch has a plurality of output ports, and the device further has a plurality of interconnect modules connected to the first switch and connected to the plurality of logic modules, The apparatus of claim 5, wherein an interconnect module is associated with each of the plurality of output ports.

The apparatus of claim 5, wherein the plurality of logic modules includes logic that uses information in data to determine a process to be performed by one of the plurality of logic modules.

The plurality of logic modules perform data transfer processing including load, storage, read and write, logical operations including AND, OR, NOR, NAND, exclusive AND, exclusive OR, and bit test, and addition, subtraction, and multiplication. 6. The apparatus according to claim 5, comprising a plurality of different logic element types with logic functions selected from: arithmetic operations, including division and transcendental functions.

And a plurality of interconnect modules connected to the first switch and connected to the plurality of logic modules, the interconnect modules being capable of monitoring data traffic in the logic modules and storing data. 6. The apparatus according to claim 5, further comprising a buffer and a concentrator for concentrating, and wherein the timing of data injection by the first switch can be controlled to avoid data collision.

An interconnect unit is formed by the first and second switches, the plurality of interconnect structures, and the plurality of interconnect modules, and the device is further connected to the plurality of interconnect structures, and the data is output outside the interconnect unit. 6. The apparatus according to claim 5, comprising one or more arithmetic units (126) arranged to transfer the data and to the top switch.

An interconnect unit is formed by the first and second switches, the plurality of interconnect structures, and the plurality of interconnect modules, and the device further transfers data outside the interconnect unit, and transmits the data to a top switch. Device according to claim 5, characterized in that it has one or more memory units (126) arranged to transfer data.

The apparatus of claim 5, wherein the top and bottom switches handle a plurality of different bit lengths of data.

The apparatus according to claim 5, wherein the logic module is a dynamic program in memory.

The device is to be performed on a payload field capable of handling a data payload, a first address specifying a storage location for holding data to be processed, and data held at the first address. A first processing code that specifies processing, a second address that specifies an optional device that performs processing on the data from the first address storage location, and a second address that specifies an optional device that performs processing on the data from the first address storage location. 6. The apparatus according to claim 5, wherein the processing is performed on a data message including information and a data field including a second processing code specifying a processing to be performed by the second address device.

The apparatus comprises: a field indicating the presence of a data packet; a payload field capable of carrying a data payload; a first address specifying a storage location for holding data to be processed; A first processing code specifying processing to be performed on the data held at the address; a second address specifying an optional device performing processing on the data from the first address storage location; Performing a process on a data message having information and a data field including a second processing code designating a process to be performed by the second address device on data from the first address storage location. 6. The device according to claim 5, wherein

And further comprising one or more processing units (126) connected to the second switch, wherein the second switch is capable of transferring data packets to the one or more processing units. Apparatus according to claim 5, characterized in that the apparatus forms a calculation engine.

One or more storage elements having a plurality of storage portions and connected to the interconnect structure and accessible as locations via the interconnect structure;
A plurality of arithmetic units (126) connected to the interconnect structure and accessible as locations of the interconnect structure;
The plurality of operation units can access data from the one or more storage elements via the interconnect structure, and the operation units include a first operation unit and a second operation unit; 6. The storage device according to claim 5, wherein the first and second arithmetic units are capable of simultaneously reading from different storage portions of the storage element and transferring data contents of the storage portion of the storage element to different target locations. An apparatus according to claim 1.

One or more storage elements (116) connected to the interconnect structure and accessible as a location via the interconnect structure;
A plurality of arithmetic units (126) connected to the interconnect structure and accessible as locations of the interconnect structure;
The plurality of operation units can access data from the one or more storage elements via the interconnect structure, and the operation units include a first operation unit and a second operation unit; The first arithmetic unit can read and operate on data from the two storage elements simultaneously, and the second arithmetic unit overlaps with the reading and arithmetic by the first arithmetic unit. Apparatus according to claim 5, wherein data can be read and operated on from two of said storage elements at a time.

A plurality of logics connected in a hierarchical interconnect structure to handle data, predict data collisions at nodes, and resolve the data collisions based at least in part on a priority order defined by the hierarchy. A module (114);
A first switch connected to the interconnect structure and distributing data to the plurality of logic modules based on communication information included in the data;
A second switch (112) connected to the plurality of logic modules and receiving data from the plurality of logic modules.

21. The method of claim 20, wherein one of the plurality of logic modules has a data communication ring (302) and a data storage ring (304), wherein the data communication ring and the data storage ring comprise a circular FIFO. The described device.

One of the plurality of logic modules includes a data communication ring (302) and a data storage ring (304), wherein the data communication ring and the data storage ring comprise a circular FIFO, and one element of data is a single element. 21. The apparatus of claim 20, wherein the data is modified by the logic module while the data elements are circulated in a data storage ring, held in a memory FIFO.

One of the plurality of logic modules includes a data communication ring (302) and a data storage ring (304), wherein the data communication ring and the data storage ring comprise a circular FIFO, and one element of data is a single element. 21. The apparatus of claim 20, held in a memory FIFO, wherein said single memory FIFO is capable of holding both program instructions and data.

One of the plurality of logic modules has a data communication ring and a data storage ring, and the data communication ring is a mirror image of a bottom level ring of the first switch connected to the data communication ring. 21. The device of claim 20, wherein the device is.

21. The system of claim 20, further comprising a data communication ring and a plurality of data storage rings, wherein one or more of the logic modules are associated with the data communication ring and the data storage ring. The described device.

A data communication ring, and a plurality of data storage rings, wherein one or more of the logic modules are associated with the data communication ring and the data storage ring; The apparatus of claim 20, having the same logic element type.

Further comprising a data communication ring and a plurality of data storage rings, wherein one or more of the logic modules are associated with the data communication ring and the data storage ring; The apparatus of claim 20, having a plurality of different logic element types.

A data communication ring, and a plurality of data storage rings, wherein one or more of the logic modules are associated with the data communication ring and the data storage ring; , Load, store, read and write, data transfer processing, AND, OR, NOR, NAND, exclusive AND, exclusive OR and logical operation including bit test, and addition, subtraction, multiplication, division and transcendental functions. 21. The apparatus of claim 20, having a plurality of different logic element types with logic functions selected from arithmetic operations including.

And a plurality of interconnect modules connected to the first switch and connected to the plurality of logic modules, the interconnect modules being capable of monitoring data traffic in the logic modules and storing data. 21. The apparatus of claim 20, comprising a buffer and a concentrator for concentrating, and wherein the timing of data injection by the first switch is controllable to avoid data collisions.

It further comprises a data communication ring (302) and a plurality of data storage rings (304), wherein the data storage ring is for simultaneously accessing data from a plurality of sources and transferring data to a plurality of destination locations simultaneously. 21. The device of claim 20, storing.

The apparatus of claim 20, wherein the logic module is a dynamic program-in-memory logic module.

A multiple access memory and arithmetic unit,
Multiple logic devices, including memory devices,
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes (330);
A plurality of logic elements (114) associated with the plurality of nodes;
A plurality of message interconnect paths, each connecting the selected nodes to each other, so as to transfer data from the node functioning as the transmitting node to the node functioning as the receiving node;
Each having a plurality of control signal interconnect paths connecting the selected nodes to each other, so as to transfer control signals from a transmitting node toward a logic element associated with the receiving node;
The plurality of nodes are:
Nodes A, B and X different from each other;
A logic L _B associated with the node B to determine routing for a Node B,
A message interconnect path from node B functioning as a sending node to node X functioning as a receiving node;
A message interconnect path from node A functioning as a sending node to node X functioning as a receiving node;
And a control signal interconnect path from the node A functions as a transmitting node to the logic L _B,
The apparatus according to claim 1, wherein the control signal causes the transfer of data from the node A to the node X to be performed preferentially with respect to the transfer of data from the node B to the node X.

A multiple access memory and arithmetic unit,
A plurality of logic devices (114) including a memory device;
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes (330) including different nodes A, B and X;
A plurality of interconnect paths for selectively connecting the nodes of the plurality of nodes to each other,
The interconnect path has a control signal interconnect path for transferring a control signal from a control signal transmitting node to a logic associated with a control signal using node, and a data interconnect path for transferring data from a transmitting node to a receiving node. And
Node B includes a data interconnect path for transferring data to nodes X and Y,
Node A has a control signal interconnect path for transferring the control signal to the logic L _B associated with a Node B,
For a message M to reach a Node B, Node A forwards the control signal C to the logic L _B, the logic L _B is using a control signal C, and should forward the message M to one of the nodes X and Y as determining, and wherein the logic L _B can function.

Message M 'to reach the node B, node X, to be transferred to any different nodes D Y and B, multiple access according to claim 33 in which the logic L _B is equal to or can function -Memory and arithmetic devices.

A multiple access memory and arithmetic unit,
A plurality of logic devices (114) including a memory device;
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes including nodes A and B and a node set P, wherein the nodes A and B are composed of different nodes other than the node set P, and the node B can transfer data to all the nodes of the node set P. A node (330) of
A plurality of interconnect paths for selectively connecting the nodes of the plurality of nodes to each other,
The node is selected as a paired node that includes a receiving node and a transmitting node, the transmitting node is adapted to transfer data to the receiving node, and the plurality of interconnect paths comprise a data interconnect path and a control interconnect path. Comprising, the control interconnect path selectively connects the nodes of the plurality of nodes to each other as a control signal transmitting node to transfer a control signal to a logic associated with a control signal using node,
The plurality of control interconnect path comprises a control interconnect path leading to the logic L _B associated from node A to node B, the logic L _B is, by utilizing the control signal from the node A, is a Node B, node set An apparatus for determining which node of P to forward data to.

A multiple access memory and arithmetic unit,
A plurality of logic devices (114) including a memory device;
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes including nodes A and B and a node set P, wherein the nodes A and B are composed of different nodes other than the node set P, and the node B can transfer data to all the nodes of the node set P. A node (330) of
A plurality of interconnect paths selectively connecting nodes among the plurality of nodes to each other, the plurality of interconnect paths being selected as a paired node including a receiving node and a transmitting node, the transmitting node being adapted to transfer data to the receiving node; When,
Logic L _A associated with node A and capable of determining where to route data from node A;
Associated with a Node B, and a where possible decide to route logic L _B data from a Node B,
Unlike logic L _A logic L _B, the logic L _B is, by using the information determined by the logic L _A, the node B determines whether to forward the data to which node in the node set P An apparatus characterized by the above.

37. The multiple access memory and operation device according to claim 36, wherein the node B can transfer data to a node output of the node set P.

A multiple access memory and arithmetic unit,
A plurality of logic devices (114) including a memory device;
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes (330), each including a plurality of data input ports, a plurality of data output ports, and a logic element for controlling a flow of data passing through the nodes, and including nodes A, B, X and Y different from each other;
A plurality of interconnect paths for selectively connecting the nodes of the plurality of nodes to each other,
The interconnect path has a control signal interconnect path for transferring a control signal from a control signal transmitting node to a logic associated with a control signal using node, and a data interconnect path for transferring data from a data transmitting node to a data receiving node. Wherein the data interconnect path selectively connects the data output port and the data input port to each other, and wherein the control interconnect path is associated with a node having a data flow dependent on a control signal from a control signal transmitting node. Selectively connect nodes and logic elements to each other to transfer control signals to the logic elements,
Node B, by using the control signal from the node A, are associated with a logic L _B that determines the routing of message M passing through the node B, the control signal C received from the node A, the message M is Apparatus characterized in that the message M is transferred from the node B to the node Y by a control signal C ′ received from the node A so as to be transferred to the node X.

39. The multiple access according to claim 38, wherein the routing of the message M 'passing through the node B is not changed depending on whether the control signal from the node A is the control signal C or the control signal C'. -Memory and arithmetic devices.

39. The multiple access memory and operation device according to claim 38, wherein the control signal sent to the node B is taken from a data output port of the node A.

A multiple access memory and arithmetic unit,
A plurality of logic devices (114) including a memory device;
An interconnect structure connected to the logic device for routing data and processing code to the logic device, wherein the interconnect structure further comprises:
A plurality of nodes (330) having a node set P including a node X and a plurality of nodes capable of transferring data to the node X;
A plurality of interconnect paths for selectively connecting the nodes of the plurality of nodes to each other,
Wherein the interconnect path has a data interconnect path that transfers data from a sending node to a receiving node, and blocks a node having the highest priority for transferring data to node X from transferring data to node X. Wherein the nodes in the node set P are given a priority order for transferring data to the node X so as not to be performed.

With respect to transferring data to node X, data transferred to node X from node B having a lower priority than node A in node set P blocks node A from transferring data to node X. 42. The multiple access memory and operation device according to claim 41, wherein the operation is prevented from occurring.

42. The method according to claim 41, wherein the priority order among the nodes in the node set P capable of transferring data to the node X depends on the position of each node in the node set P in the interconnect structure. A multiple access memory and arithmetic unit as described.