JP4285877B2

JP4285877B2 - Metaaddressing architecture for dynamic reconfiguration computation and metaaddressing method for dynamic reconfiguration computation

Info

Publication number: JP4285877B2
Application number: JP2000034825A
Authority: JP
Inventors: バクスターマイケル
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1999-02-23
Filing date: 2000-02-14
Publication date: 2009-06-24
Anticipated expiration: 2020-02-14
Also published as: JP2000242613A

Description

【０００１】
【発明の属する技術分野】
本発明は、一般にコンピュータアーキテクチャに係り、特に、再構成計算のためのシステム及び方法、つまり、動的再構成計算のためのメタアドレス指定アーキテクチャ及び動的再構成計算のためのメタアドレス指定方法に関する。
【０００２】
本発明は、１９９５年４月１７日付で提出した米国特許Ｎｏ．５，７９４，０６２の分割出願である１９９８年２月２６日付で提出した「変更可能な内部ハードウェア編成を含む処理装置を用いた動的再構成計算のためのシステム及び方法」という名称の米国特許出願Ｎｏ．０９／０３１，３２３の米国一部継続出願に基づく優先権主張出願である。
【０００３】
【従来の技術】
コンピュータアーキテクチャの進展は、より優れた計算性能への要求によって推進されている。各種の計算問題を迅速かつ正確に解くには、一般に異なる種類の計算リソースが必要である。問題の種類が限られている場合には、検討中の種類の問題のために特に構築された計算リソースを用いることによって計算性能を向上させることができる。たとえば、デジタル信号処理（ＤＳＰ：Digital Signal Processing）ハードウェアを汎用コンピュータと併用すると、ある種の信号処理能力を大幅に向上させることができる。コンピュータ自体が検討中の種類の問題のために特別に構築されているときには、こうした特定の種類の問題について計算性能がさらに向上するか、または、利用可能な計算リソースと比べて、おそらくさらに最適化されたものとなる。現在の並列コンピュータ及び大規模並列コンピュータは、Ｏ（ｎ² ）またはそれ以上に複雑な特殊な種類の問題に対する処理能力が優れており、これが上記の場合の例である。
【０００４】
優れた計算性能は必要ではあるが、その一方でシステム費用を最小限に抑える必要性と均衡させなければならず、また現在及び将来考えられるできるだけ広範囲な用途においてシステム生産性を最大限に高める必要性とも均衡させなければならない。一般に、特殊なハードウェアは汎用ハードウェアより高価であるため、限られた数種類の問題専用の計算リソースをコンピュータシステムに組込むことは、システム費用を低く抑えることに悪影響を与える。専用コンピュータを設計し生産することは、エンジニアリング（工学設計）に要する時間とハードウェアの費用の点からきわめて高価なものとなる。計算性能を高めるために専用ハードウェアを用いた場合、計算性能の必要度が変化すると性能上の利点は少なくなる。先行技術では、計算性能の必要度が変化すると、新しい専用ハードウェアまたは新しい専用システムが設計され、製造され、結果として望ましくないほど高額の再活用できない設計・製造費用が繰返し支出される。したがって、特定種類の問題専用の計算リソースを用いると、計算の必要度が変化した場合、利用可能なシリコンリソースを非効率に利用することになる。したがって、上記のような理由で専用ハードウェアを用いて計算性能を向上させようとする試みは望ましくない。
【０００５】
従来、再プログラマブルハードウェアまたは再構成ハードウェアを用いて計算性能を向上させ、また問題の種類の適用可能性を最大限に高めるさまざまな試みが行われてきた。最初のこのような先行技術のアプローチは、ダウンロード可能マイクロコードコンピュータアーキテクチャによるものである。ダウンロード可能マイクロアーキテクチャでは、固定された非再構成ハードウェアリソースの機能を特定のバージョンのマイクロコードを用いることによって選択的に変化させることができる。このようなアーキテクチャの例に、ＩＢＭシステム／３６０がある。このような先行技術システムの基本的計算ハードウェア自体は再構成可能ではないので、広範囲の種類の問題について検討する場合、こうしたシステムでは最適化された計算性能は得られない。
【０００６】
計算性能を向上させ、問題の種類の適用可能性を最大限に高めるための先行技術の第２のアプローチは、非再構成ホストプロセッサまたはホストシステムに結合された再構成ハードウェアを用いることである。この先行技術のアプローチでは、非再構成ホストに結合された１個またはそれ以上の再構成プロセッサを利用することが最も一般的である。このアプローチは、ホストに付加されたプロセッサセット内のハードウェアの一部分が再構成できるような「付加再構成可能プロセッサ（ＡＲＰ：Attached Reconfigurable Processor）」アーキテクチャとして分類することができる。ホストシステムに結合された１組の再構成プロセッサを利用する現在の付加再構成可能プロセッサ（ＡＲＰ）システムの例には、Supercomputing Research Center（Bowie，メリーランド）が設計したＳＰＬＡＳＨ−１とＳＰＬＡＳＨ−２、Annapolis Micro Systems（Annapolis，メリーランド）製のWILDFIRE Custom Configurable Computer（ＳＰＬＡＳＨ−２の市販バージョン）、Virtual Computer Corporation（Reseda，カリフォルニア）製のＥＣＶ−１がある。計算を主体とした問題の多くでは、プログラムコードの比較的小さな部分の実行にかなりの時間が費やされる。一般に、付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャを用いて、プログラムコードのこのような部分のために再構成計算アクセラレータが提供される。
【０００７】
【発明が解決しようとする課題】
残念ながら、１個またはそれ以上の再構成計算アクセラレータを基礎においた計算モデルには、下記に詳細に説明するような重大な欠点がある。
【０００８】
＜第1の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャの第１の欠点は、付加再構成可能プロセッサ（ＡＲＰ）システムが特定のときに再構成ハードウェアの特定のアルゴリズムの最適実動化を実行しようと試みるために起こる。
【０００９】
たとえば、Virtual Computer CorporationのＥＣＶ−１の背後にある設計方針は、特定のアルゴリズムのために最適の計算性能を提供するよう、その特定のアルゴリズムを再構成ハードウェアソースの特定の構成に変換するというものである。再構成ハードウェアリソースは、特定のアルゴリズムのために最適の能力を提供する目的だけに用いられる。命令実行の管理などの一般的な目的のために再構成ハードウェアリソースを用いることは避けられる。したがって、所定のアルゴリズムについて、再構成ハードウェアリソースは最適の能力が得られるよう結合された個々のゲートの全体像から検討される。
【００１０】
一部の付加再構成可能プロセッサ（ＡＲＰ）システムは、「プログラム」が従来型のプログラム命令と、各種再構成ハードウェアリソースがどのように相互結合されるかを定める専用命令との両方を含むプログラミングモデルに依拠している。付加再構成可能プロセッサ（ＡＲＰ）システムは、ゲートレベルアルゴリズムに適した固有の方法で再構成ハードウェアリソースを検討するので、これらの専用命令は、用いられる各再構成ハードウェアリソースの特性に関する詳細な内容と、再構成ハードウェアリソースが他の再構成ハードウェアリソースに結合される方法を提供しなければならない。これにより、プログラムは複雑なものとなる。プログラミングの複雑さを軽減するために、プログラムに従来型の高レベルプログラミング言語命令と高レベル専用命令の両方を含めたプログラミングモデルを利用する試みが行われている。つまり、現在の付加再構成可能プロセッサ（ＡＲＰ）システムでは、高レベルプログラミング言語命令と上記の高レベル専用命令の両方をコンパイルできるコンパイルシステムを利用しようという試みがなされている。このようなコンパイルシステムの目的出力は、従来型の高レベルプログラミング言語命令についてはアセンブリ言語コードであり、専用命令についてはハードウェア記述言語（ＨＤＬ：Hardware Description Language）である。検討中の特定のアルゴリズムについて最適の計算性能を得るために１組の再構成ハードウェアリソースと相互結合スキームを自動決定することは、残念ながらＮＰハード問題である。一部の付加再構成可能プロセッサ（ＡＲＰ）システムの長期目標は、アルゴリズムを１組のゲートのための最適化相互結合スキームに直接コンパイルできるコンパイルシステムを開発することである。しかし、このようなコンパイルシステムの開発は、特に複数の種類のアルゴリズムについて検討する場合、きわめて困難な作業である。
【００１１】
＜第２の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャの第２の欠点は、付加再構成可能プロセッサ（ＡＲＰ）装置を構成するアルゴリズムに関連した計算作業を、付加再構成可能プロセッサ（ＡＲＰ）装置が多重再構成論理装置全体に分散するために起こる。
【００１２】
たとえば１組のフィールドプログラマブル論理回路（ＦＰＧＡ）を用いて実装され、また並列乗算アクセラレータを実動化するために構成された付加再構成可能プロセッサ（ＡＲＰ）装置については、並列乗算に関連した計算作業がフィールドプログラマブル論理回路（ＦＰＧＡ）全体に分散される。したがって、付加再構成可能プロセッサ（ＡＲＰ）装置を構成できるアルゴリズムの大きさは、存在する再構成論理装置の数によって制限される。同様に、付加再構成可能プロセッサ（ＡＲＰ）装置が扱うことができる最大データセットの大きさも制限される。一部のアルゴリズムにはデータ従属性があるので、ソースコードの試験を行っても、付加再構成可能プロセッサ（ＡＲＰ）装置の限界が必ずしも明示的に示されるとは限らない。一般に、データ従属性アルゴリズムは避けられる。
【００１３】
さらに、付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャが多重再構成論理装置全体に計算作業を分散することを開示しているので、新規のまたはやや修正したアルゴリズムを含めるには、再構成をひとまとめに行う必要がある。すなわち、多重再構成論理装置を再構成しなければならない。これにより、別の問題またはカスケード接続された副次的問題について再構成を行うことができる最大レートが限定される。
【００１４】
＜第３の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャの第３の欠点は、プログラムコードの１つまたはそれ以上の部分がホストで実行されるために生じる。
【００１５】
すなわち、付加再構成可能プロセッサ（ＡＲＰ）装置はそれ自体独立した計算システムではなく、プログラム全体を実行するものではない。このためホストとの相互作用が必要となる。一部のプログラムコードが非再構成ホストで実行されるので、利用可能なシリコンリソースがプログラムの実行の時間枠において最大限に利用されない。特にホストによる命令の実行中、付加再構成可能プロセッサ（ＡＲＰ）装置のシリコンリソースはアイドル状態であるか、非効率な利用状態にある。同様に、付加再構成可能プロセッサ（ＡＲＰ）装置がデータを処理するとき、ホストでのシリコンリソースの利用はおおむね非効率である。複数の全プログラムを容易に実行するためには、システム内のシリコンリソースを、容易に再利用できるリソースにグループ化しなければならない。上記のように、付加再構成可能プロセッサ（ＡＲＰ）システムは、再構成ハードウェアリソースを特定の時間における特定のアルゴリズムの実動化のために最適に相互結合された１組のゲートとして扱う。したがって、再利用できるためにはアルゴリズムがある程度の独立性をもつ必要があるので、付加再構成可能プロセッサ（ＡＲＰ）システムは再構成ハードウェアリソースの特定のセットをアルゴリズムごとに容易に再利用できるリソースとして扱うための手段は提供しない。
【００１６】
付加再構成可能プロセッサ（ＡＲＰ）装置は、現在実行しているホストプログラムをデータとして扱うことができず、一般にそれ自体を計算環境に適合させることができない。付加再構成可能プロセッサ（ＡＲＰ）装置は、それ自体のホストプログラムを実行することによって、それ自体をシミュレートするようには作られていない。さらに付加再構成可能プロセッサ（ＡＲＰ）装置は、構築される再構成ハードウェアリソースを直接用いて、それ自体に対しそれ自体のハードウェア記述言語（ＨＤＬ）またはアプリケーションプログラムをコンパイルするようには作られていない。したがって付加再構成可能プロセッサ（ＡＲＰ）装置は、ホストプロセッサからの独立性を開示する独立計算モデルに関してアーキテクチャ的に制限されている。
【００１７】
付加再構成可能プロセッサ（ＡＲＰ）装置は、計算アクセラレータとして機能するので、一般に独立した入力／出力（Ｉ／Ｏ）処理は行えない。通常は、付加再構成可能プロセッサ（ＡＲＰ）装置は入出力処理のためのホスト相互作用を必要とする。したがって、付加再構成可能プロセッサ（ＡＲＰ）装置の性能は入出力について限られているかもしれない。しかし、当業者は、付加再構成可能プロセッサ（ＡＲＰ）装置が特定の入出力問題を加速処理するために構成できることを認めるであろう。しかし、付加再構成可能プロセッサ（ＡＲＰ）装置全体は単一の特定の問題について構成されているので、付加再構成可能プロセッサ（ＡＲＰ）装置が入出力処理とデータ処理について互いに悪影響を与えずに均衡をとることはできない。さらに、付加再構成可能プロセッサ（ＡＲＰ）装置は割込み処理のための手段を提供しない。付加再構成可能プロセッサ（ＡＲＰ）装置は計算アクセラレーションを最大化するのに向けられているので、付加再構成可能プロセッサ（ＡＲＰ）に関する開示内容ではこのような割込みメカニズムは述べられておらず、割込みは計算加速に否定的な影響を与える。
【００１８】
＜第４の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャの第４の欠点は、付加再構成可能プロセッサ（ＡＲＰ）装置を用いて利用するのが困難な固有のデータ並列性を有するソフトウェアアプリケーションが存在するために生じる。
【００１９】
きわめて大規模なネットリストのネットネーム記号導出が必要とされるときには、ハードウェア記述言語（ＨＤＬ）コンパイルアプリケーションがこのような例の１つとして挙げられる。
【００２０】
＜第５の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャの第５の欠点は、これのアーキテクチャが基本的にＳＩＭＤ（Single Instruction Stream Multiple Data Stream）コンピュータアーキテクチャモデルである点である。
【００２１】
したがって、付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャは、１つまたはそれ以上の革新的先行技術の非再構成システムと比べ、アーキテクチャとして効率的ではない。付加再構成可能プロセッサ（ＡＲＰ）システムは、各特定の構成例について、利用できる再構成ハードウェアが提供できるのと同じ程度の計算能力について、プログラム実行のプロセスのほんの一部、主として算術計算のための演算ロジックしか反映していない。これに対して、１９７１年のFairchildでのＳＹＭＢＯＬマシンのシステム設計では、コンピュータ全体がプログラムの実行の各局面について一意的なハードウェアコンテクストを使用している。結果として、付加再構成可能プロセッサ（ＡＲＰ）システムが開示しているホスト部分を含めて、ＳＹＭＢＯＬはコンピュータのシステムアプリケーションについてのすべての構成部分を含むこととなった。
【００２２】
＜その他の欠点＞
付加再構成可能プロセッサ（ＡＲＰ）アーキテクチャにはその他の欠点もある。
【００２３】
たとえば、付加再構成可能プロセッサ（ＡＲＰ）装置は多重再構成論理装置に独立したタイミングを与えるための有効な手段を持っていない。同様に、カスケード接続された付加再構成可能プロセッサ（ＡＲＰ）装置には、独立してタイミング決定された装置を提供するための有効なクロック分散手段を持っていない。別の例としては、実行時間と、アクセラレーションを試みるソースコード文とを正確に相関させることが困難なことがある。ネットシステムクロックレートを正確に算出するためには、ハードウェア記述言語（ＨＤＬ）コンパイルのあと、コンピュータ支援設計（ＣＡＤ：Computer‐Aided Design）手段で付加再構成可能プロセッサ（ＡＲＰ）装置をモデル化しなければならないが、このような基本パラメータへの到達は時間のかかるプロセスである。
【００２４】
従来のアーキテクチャで同様に重要な問題は、それらが仮想または共有メモリを使用していることである。この開示内容は統合されたアドレススペースを使用するということであり、これによって、より複雑なアドレス指定演算が必要となり、そのためメモリアクセスが遅くなり、効率が低下する。たとえば仮想メモリを用いてシステム内のメモリ装置の個々のビットにアクセスするには、まずメモリの物理的アドレススペースを論理アドレスに区分し、次に仮想アドレスをその論理アドレスにマッピングしなければならない。このようにしてようやくメモリの各ビットにアクセスすることができる。さらに共有メモリシステムでは、プロセッサは通常はメモリにアクセスを許容する前にアドレス確認演算を実行するため、メモリ演算はさらに複雑になる。最後にプロセッサは、ある種の優先度決定システムを提供することにより、メモリの同じ領域に同時にアクセスしようとする複数のプロセスの間で調整（仲裁）を行わなければならない。
【００２５】
共有メモリ及び仮想メモリを使用することにより起こる多くの問題に対処するために、多くの従来型のシステムはメモリ管理システム（ＭＭＵ：Memory Management Units）を用いて、論理アドレスの仮想アドレスへの変換などのメモリ管理機能のほとんど実行している。しかし、メモリ管理システム（ＭＭＵ）／ソフトウェア相互作用により、メモリアクセス演算はさらに複雑になる。さらにメモリ管理システム（ＭＭＵ）は、実行できる演算の種類がきわめて限定されている。メモリ管理システム（ＭＭＵ）は、割込みを扱うことができず、メッセージを待ち行列に待機させることができず、また複雑なアドレス指定演算を実行することができないため、すべてプロセッサが実行しなければならない。多重並列プロセッサを有するコンピュータアーキテクチャで共有メモリまたは仮想メモリシステムを用いると、上記のような欠点がさらに拡大される。ハードウェア／ソフトウェア相互作用が上に述べたように管理されなければならないばかりでなく、共有メモリにアクセスしようとする多重プロセッサに応じてメモリ内のデータのコヒーレント性と一致性を、ソフトウェア及びハードウェアの両方によって維持しなければならない。プロセッサを追加すると、仮想アドレスの論理アドレスへの変換はさらに困難となる。メモリアクセス演算におけるこうした複雑性によってシステム能力が必然的に低下し、プロセッサを追加しシステムの規模が大きくなればなるほどこの能力低下は大幅なものとなる。
【００２６】
従来型のシステムの一例は、キャッシュコヒーレント、非一様メモリアクセス（ｃｃＮＵＭＡ：Non-Uniform Memory Access）コンピュータアーキテクチャである。非一様メモリアクセス（ｃｃＮＵＭＡ）マシンは、キャッシュ制御装置やクロスバー切替装置などの複雑で高価なハードウェアを使用し、このメモリは多重プロセッサにより実際には共有されているとしても、個々の独立したＣＰＵについて単一のアドレススペースの幻影（仮想スペース）を維持する。非一様メモリアクセス（ｃｃＮＵＭＡ）はやや拡張性があるが、この拡張性は、システム内でプロセッサを緊密に結合させるためにハードウェアを追加することによって達成される。この種のシステムは、科学的計算での有限要素グリッドの場合のように、共有メモリ入出力演算のためきわめて広い帯域幅を必要とする単一のプログラムイメージが共有されている計算環境で用いられると一層有利である。さらに非一様メモリアクセス（ｃｃＮＵＭＡ）は、プロセッサの特性が互いに類似していないようなシステムでは役に立たない。非一様メモリアクセス（ｃｃＮＵＭＡ）アーキテクチャでは、追加される各プロセッサが既存のプロセッサと同種のものでなければならない。したがって、プロセッサを異なる機能を実行させるために最適化し、互いに異なった作動を行うシステムでは、非一様メモリアクセス（ｃｃＮＵＭＡ）アーキテクチャは有効な解決策とはならない。最後に、従来型のシステムでは、標準的メモリアドレススキームのみがシステム内のメモリをアドレス指定するのに用いられる。
【００２７】
必要とされているのは、拡張性、平易性のあるアドレス指定を提供し、またシステムの処理能力にほとんど影響を与えないような、並列計算環境でのメモリをアドレス指定するための手段である。
【００２８】
【発明が解決しようとする課題】
第１の発明は、上述した課題を解決するために、動的再プログラマブル処理マシンのデータパケットのためのローカルメモリ宛先を指定するメタアドレス指定アーキテクチャであって、第１のメモリ装置と、第１のメモリ装置が結合され、所定の命令を受け取ると、受け取った命令が遠隔演算情報を含む遠隔演算を要求する命令かを判定し、命令が遠隔演算情報を含む遠隔演算を要求する命令であると判定した場合に、命令に含まれる遠隔演算情報を第１のメモリ装置に記憶し、遠隔演算情報が記憶されている第１のメモリ装置におけるメモリアドレスを含む無条件命令を生成する動的再プログラマブル処理マシンと、第１の動的再プログラマブル処理マシンから無条件命令を受け取り、無条件命令に含まれるメモリアドレスに基づいて第１のメモリ装置から遠隔演算情報を検索し、遠隔演算情報に基づき生成されるデータパケットの送信先のアドレスを示す目的地理アドレスと、送信先においてデータを書き込む第２のメモリ装置のアドレスを示す目的ローカルメモリアドレスとを含むメタアドレスを遠隔演算情報に基づいて生成し、メタアドレスを含むデータパケットを遠隔演算情報に基づいて生成するアドレス指定マシンと、第１のアドレス指定マシンおよびデータの送信先として目的地理アドレスに示される第２のアドレス指定マシンに結合され、第１のアドレス指定マシンで生成されたデータパケットを第１のアドレス指定マシンから受け取って、データパケットにおけるメタアドレスに含まれる目的地理アドレスに応じて、第１のアドレス指定マシンと第２のアドレス指定マシンとの間で相互にデータをルーティングする相互結合装置とを具備する。
【００２９】
第２の発明は、動的再プログラマブル処理マシンは、受け取った命令が遠隔演算を要求する命令でない場合は、受け取った命令に従って処理を実行する。
【００３０】
第３の発明は、第２のアドレス指定マシンは、第１のアドレス指定マシンが生成したデータパケットを受け取ると、データパケットに含まれるメタアドレスを目的地理アドレスと目的ローカルメモリアドレスとに復号し、当第２のアドレス指定マシンのアドレスを示す地理アドレスと復号した目的地理アドレスとを比較するアドレス復号器と、アドレス復号器による比較の結果、地理アドレスと目的地理アドレスとが一致する場合に、データパケットを第２の動的再プログラマブル処理マシンに伝送する制御装置とを具備する。
【００３１】
第４の発明は、第２の動的再プログラマブル処理マシンに結合され、結合される第２の動的再プログラマブル処理マシンについてのアドレスを示す地理アドレスを記憶するアーキテクチャ記述メモリ装置をさらに具備する。
【００３２】
第５の発明は、第２のアドレス指定マシンが、入出力装置に結合された割込みハンドラをさらに含み、割込みハンドラが、割込み要求を識別するための識別装置と、割込み要求の有効性を検証するために、識別した割込み要求を割込み要求の記憶されたリストと比較するためのコンパレータと、記憶された割込み処理命令に従って有効性が確認された割込み要求を処理するための割込みロジックと、を具備する。
【００３３】
第６の発明は、メタアドレスが８０ビット幅であり、目的地理アドレスが１６ビット幅であり、ローカルアドレスが６４ビット幅である。
【００３４】
第７の発明は、動的再プログラマブル処理マシンのデータパケットのためのローカルメモリ宛先を指定するメタアドレス指定方法であって、所定の命令を受け取ると、受け取った命令が遠隔演算情報を含む遠隔演算を要求する命令かを判定し、命令が遠隔演算情報を含む遠隔演算を要求する命令であると判定した場合に、命令に含まれる遠隔演算情報を第１のメモリ装置に記憶し、遠隔演算情報が記憶されている第１のメモリ装置におけるメモリアドレスを含む無条件命令を動的再プログラマブル処理マシンにより生成する無条件命令生成ステップと、無条件命令生成ステップで生成された無条件命令を動的再プログラマブル処理マシンから受け取り、無条件命令に含まれるメモリアドレスに基づいて第１のメモリ装置から遠隔演算情報を検索し、遠隔演算情報に基づき生成されるデータパケットの送信先のアドレスを示す目的地理アドレスと、送信先においてデータを書き込む第２のメモリ装置のアドレスを示す目的ローカルメモリアドレスとを含むメタアドレスを遠隔演算情報に基づいて生成し、メタアドレスを含むデータパケットを遠隔演算情報に基づいてアドレス指定マシンにより生成するデータパケット生成ステップと、データパケット生成ステップで生成されたデータパケットを第１のアドレス指定マシンから受け取って、データパケットにおけるメタアドレスに含まれる目的地理アドレスに応じて、第１のアドレス指定マシンと、データの送信先として目的地理アドレスに示される第２のアドレス指定マシンとの間で相互にデータをルーティングするステップとを具備する。
【００３５】
第８の発明は、無条件命令生成ステップで、動的再プログラマブル処理マシンは、受け取った命令が遠隔演算を要求する命令でない場合は、受け取った命令に従って処理を実行する。
【００３６】
第９の発明は、第２のアドレス指定マシンが、第１のアドレス指定マシンが生成したデータパケットを受け取ると、データパケットに含まれるメタアドレスを目的地理アドレスと目的ローカルメモリアドレスとに復号し、当第２のアドレス指定マシンのアドレスを示す地理アドレスと復号した目的地理アドレスとを比較するアドレス復号ステップと、アドレス比較ステップによる比較の結果、地理アドレスと目的地理アドレスとが一致する場合に、データパケットを第２の動的再プログラマブル処理マシンに伝送する伝送ステップとをさらに具備する。
【００３７】
第１０の発明は、第２の動的再プログラマブル処理マシンについてのアドレスを示す地理アドレスが、第２の動的再プログラマブル処理マシンに結合されるアーキテクチャ記述メモリ装置に記憶される。
【００３８】
第１１の発明は、第２のアドレス指定マシンが含む、入出力装置に結合された割込みハンドラが、割込み要求を識別する識別ステップと、識別ステップで識別された割込み要求の有効性を検証するために、識別した割込み要求を割込み要求の記憶されたリストと比較する比較ステップと、記憶された割込み処理命令に従って前期比各ステップの比較結果に基づき有効性が確認された割込み要求を処理するためのステップとを具備する。
【００３９】
第１２の発明は、メタアドレスが８０ビット幅であり、目的地理アドレスが１６ビット幅であり、ローカルアドレスが６４ビット幅である。
【００４３】
【発明の実施の形態】
＜概要＞
本発明は、１組のＳマシンと、各Ｓマシンに対応するＴマシンと、汎用相互結合マトリックス（ＧＰＩＭ：General-Purpose Interconnect Matrix）と、１組の入出力Ｔマシンと、１組の入出力装置と、マスタタイムベース装置とが、拡張性、並列、動的再構成計算のためのシステムを形成する。各Ｓマシンは、メモリと、第１ローカルタイムベース装置と、動的再構成処理装置（ＤＲＰＵ：Dynamically Reconfigurable Process unit）とを含む動的再構成コンピュータである。動的再構成処理装置（ＤＲＰＵ）は、命令取出し装置（ＩＦＵ：Instruction Fetch Unit）として構成された再プログラマブル論理装置と、データ演算装置（ＤＯＵ：Data Operate unit）と、アドレス演算装置（ＡＯＵ：Address Operate Unit）とを用いて実装され、これらはそれぞれ再構成割込みまたは１組のプログラム命令に埋込まれた再構成命令の選択に応じてプログラム実行中に選択的に再構成される。各再構成割込みと各再構成命令は、特定の命令セットアーキテクチャ（ＩＳＡ：Instruction Set Architecture）の実装のために最適化された動的再構成処理装置（ＤＲＰＵ）ハードウェア編成を指定する構成データセットを引照する。命令取出し装置（ＩＦＵ）は、再構成演算と、命令取り出し・復号演算と、メモリアクセス演算とを指示し、命令の実行を容易にするために制御信号をデータ演算装置（ＤＯＵ）とアドレス演算装置（ＡＯＵ）とに発する。データ演算装置（ＤＯＵ）はデータ演算を実行し、アドレス演算装置（ＡＯＵ）はアドレス演算を実行する。各Ｔマシンは、共通インタフェース制御装置（ＣＩＣＵ：Common Interface and Control Unit）と、１個またはそれ以上の相互結合入出力装置と、第２ローカルタイムベース装置とを含むデータ転送装置である。汎用相互結合マトリックス（ＧＰＩＭ）は、Ｔマシン相互間の並列通信を容易に行えるようにする拡張性相互結合ネットワークである。この１組のＴマシンと汎用相互結合マトリックス（ＧＰＩＭ）によって、Ｓマシン間の並列通信が容易に行われる。またＴマシンは、ネットワークのＳマシン相互間のデータの転送を制御し、要求されるアドレス指定演算を提供する。メタアドレスは、各Ｓマシンに拡張性ビットアドレス指定能力を提供するのに用いられる。
【００４４】
＜具体的態様＞
図１は、本発明に基づいて構築された、拡張性、並列、動的再構成計算のためのシステム１０の好ましい実施例の構成図である。システム１０は、少なくとも１個のＳマシン１２と、各Ｓマシン１２に対応するＴマシン１４と、汎用相互結合マトリックス（ＧＰＩＭ）１６と、少なくとも１個の入出力Ｔマシン１８と、１個またはそれ以上の入出力装置２０と、マスタタイムベース装置２２とを含んでいることが好ましい。好ましい実施例では、システム１０は、多重Ｓマシン１２と、したがって多重Ｔマシン１４と、多重入出力Ｔマシン１８と、多重入出力装置２０とを含んでいる。
【００４５】
Ｓマシン１２と、Ｔマシン１４と、入出力Ｔマシン１８とは、それぞれマスタタイムベース装置２２のタイミング出力部に結合されたマスタタイミング入力部を含んでいる。各Ｓマシン１２は、それに対応するＴマシン１４に結合された入力部と出力部とを含んでいる。各Ｔマシン１４は、それに対応するＳマシン１２に結合された入力部と出力部の他に、汎用相互結合マトリックス（ＧＰＩＭ）１６に結合されたルーティング入力部とルーティング出力部とを含んでいる。同様に、各入出力Ｔマシン１８は、入出力装置２０に結合された入力部と出力部とを含み、また汎用相互結合マトリックス（ＧＰＩＭ）１６に結合されたルーティング入力部とルーティング出力部とを含んでいる。
【００４６】
下記に詳細に説明するように、各Ｓマシン１２は動的再構成コンピュータである。汎用相互結合マトリックス（ＧＰＩＭ）１６は、Ｔマシン１４間の通信を容易に行えるようにする２点間並列相互結合手段を形成している。Ｔマシン１４と汎用相互結合マトリックス（ＧＰＩＭ）１６は、Ｓマシン１２間のデータ転送のための２点間並列相互結合手段を形成している。同様に、汎用相互結合マトリックス（ＧＰＩＭ）１６と、１組のＴマシン１４と、１組の入出力Ｔマシン１８とは、Ｓマシン１２と各入出力装置２０との間の入出力転送のための２点間並列相互結合手段を形成している。マスタタイムベース装置２２は、各Ｓマシン１２と各Ｔマシン１４にマスタタイミング信号を送る発振器を含んでいる。
【００４７】
模範実施例では、各Ｓマシン１２は、６４メガバイトのランダムアクセスメモリ（ＲＡＭ）に結合されたＸｉｌｉｎｘＸＣ４０１３（Xilinx, Inc., サンノゼ，カリフォルニア）フィールドプログラマブルゲートアレイ（ＦＰＧＡ：Field Programmable Gate Array）を用いて実装されている。各Ｔマシン１４は、各入出力Ｔマシン１８と同様に、ＸｉｌｉｎｘＸＣ４０１３フィールドプログラマブルゲートアレイ（ＦＰＧＡ）の再構成ハードウェアリソースの約５０％を用いて実装されている。汎用相互結合マトリックス（ＧＰＩＭ）１６は、環状体の相互結合メッシュとして実装されている。マスタタイムベース装置２２は、システム全体の周波数基準を提示するクロック分散回路に結合されたクロック発振器であり、米国に特許出願された「位相同期、フレキシブル周波数クロッキングとメッセージングのためのシステムと方法（System and Method for Phase-Synchronous, Flexible Frequency Clocking and Messaging）」に記載されている。Ｔマシン１４と、Ｓマシン１２と、入出力Ｔマシン１８とは、拡張性コヒーレントインタフェース（ＳＣＩ）を定めたＡＮＳＩ／ＩＥＥＥ規格１５９６−１９９２に従って情報を転送するのが好ましい。
【００４８】
好ましい実施例では、システム１０は並列で機能する多重Ｓマシン１２を含んでいる。個々のＳマシン１２の構造と機能については、図２から図１７を用いて下記に詳しく説明する。図２は、Ｓマシン１２の好ましい実施例の構成図である。Ｓマシン１２は、第１ローカルタイムベース装置３０と、プログラム命令を実行するための動的再構成処理装置（ＤＲＰＵ）３２と、メモリ３４とを含んでいる。第１ローカルタイムベース装置３０は、Ｓマシンのマスタタイミング入力部を形成するタイミング入力部を含んでいる。また第１ローカルタイムベース装置３０は、第１ローカルタイミング信号すなわちクロックを、第１タイミング信号ライン４０を経て動的再構成処理装置（ＤＲＰＵ）３２のタイミング入力部に、またメモリ３４のタイミング入力部に送るタイミング出力部を含んでいる。動的再構成処理装置（ＤＲＰＵ）３２は、メモリ制御ライン４２を経てメモリ３４の制御信号入力部に結合された制御信号出力部と、アドレスライン４４を経てメモリ３４のアドレス入力部に結合されたアドレス出力部と、メモリ入出力ライン４６を経てメモリ３４の双方向データポートに結合された双方向データポートとを含んでいる。さらに動的再構成処理装置（ＤＲＰＵ）３２は、外部制御ライン４８を経てその対応するＴマシン１４の双方向データポートに結合された双方向データポートを含んでいる。図２に示すように、メモリ制御ライン４２はＸビットであり、アドレスライン４４はＭビットであり、メモリ入出力ライン４６は（Ｎ×ｋ）ビットであり、外部制御ライン４８はＹビットである。
【００４９】
好ましい実施例では、第１ローカルタイムベース装置３０は、マスタタイムベース装置２２からマスタタイミング信号を受取る。第１ローカルタイムベース装置３０は、マスタタイミング信号から第１ローカルタイミング信号を生成し、第１ローカルタイミング信号を動的再構成処理装置（ＤＲＰＵ）３２とメモリ３４に送る。好ましい実施例では、第１ローカルタイミング信号は個々のＳマシン１２ごとに異なる。したがって、所定のＳマシン１２内の動的再構成処理装置（ＤＲＰＵ）３２とメモリ３４は、他のＳマシン１２内の動的再構成処理装置（ＤＲＰＵ）３２とメモリ３４とは独立したクロックレートで機能する。第１ローカルタイミング信号は、マスタタイミング信号と位相同期であることが好ましい。好ましい実施例では、第１ローカルタイムベース装置３０は、再構成ハードウェアリソースを用いて実装された位相ロック検出回路を含む位相ロック周波数変換回路を用いて実動化される。当業者は、別の実施例で、第１ローカルタイムベース装置３０がクロック分散ツリーの一部として実動化できることを認めるであろう。
【００５０】
メモリ３４は、ＲＡＭとして実動化され、またプログラム命令と、プログラムデータと、動的再構成処理装置（ＤＲＰＵ）３２のための構成データとを記憶することが好ましい。任意のＳマシン１２のメモリ３４は、汎用相互結合マトリックス（ＧＰＩＭ）１６を経てシステム１０内の他のＳマシン１２にアクセスできることが好ましい。さらに各Ｓマシン１２には、均一のメモリアドレススペースがあることが好ましい。好ましい実施例では、メモリ３４に記憶されたプログラム命令は、動的再構成処理装置（ＤＲＰＵ）３２へ向けられた再構成指示を選択的に含んでいる。図３は、再構成指示を含む模範プログラムリスト５０である。図３に示すように、模範プログラムリスト５０は１組の外部ループ部分５２と、第１内部ループ部分５４と、第２内部ループ部分５５と、第３内部ループ部分５６と、第４内部ループ部分５７と、第５内部ループ部分５８とを含んでいる。当業者は、「内部ループ」という用語が特定のセットの関連演算を実行するプログラムの反復部分を指し、また「外部ループ」という用語が、主として汎用演算を実行し、及び／または一つの内部ループ部分からもう一つの内部ループ部分へ制御を転送するプログラムの部分を指すことを容易に認めるであろう。一般に、プログラムの内部ループ部分５４、５５、５６、５７、５８は、潜在的に大きなデータセットについて特定の演算を実行する。たとえば画像処理アプリケーションでは、第１内部ループ部分５４は画像データについてカラーフォーマット変換演算を実行し、第２〜第５内部ループ部分５４、５５、５６、５７、５８は、線形フィルタリング演算、畳込み演算、パターン探索演算、及び圧縮演算を実行することになる。当業者は、内部ループ部分５５、５６、５７、５８の連続シーケンスがソフトウェアパイプラインとして考えられることを認めるであろう。各外部ループ部分５２は、データの入出力について責任を有し、及び／または第１内部ループ部分５４から第２内部ループ部分５５へのデータ及び制御の転送を指示する。当業者は、さらに、所定の内部ループ部分５４、５５、５６、５７、５８が一つまたはそれ以上の再構成指示を含むことを認めるであろう。一般に任意のプログラムについて、プログラムリスト５０の外部ループ部分５２は各種の汎用命令を含むが、プログラムリスト５０の内部ループ５４、５６は特定の命令セットを実行するのに用いられる比較的種類の少ない命令で構成される。
【００５１】
模範プログラムリスト５０では、第１再構成指示は第１内部ループ部分５４の開始部分に現れ、第２再構成指示は第１内部ループ部分５４の終了部分に現れる。同様に、第３再構成指示は第２内部ループ部分５５の開始部分に、また第４の再構成指示は第３内部ループ部分５６の開始部分に、第５再構成指示は第４内部ループ部分５７の開始部分に、第６及び第７再構成指示はそれぞれ第５内部ループ部分５８の開始部分と終了部分に現れる。各再構成指示は、特定の命令セットアーキテクチャ（ＩＳＡ）を実動化するためのものであり、またそれに最適化された内部動的再構成処理装置（ＤＲＰＵ）ハードウェア編成を指定する構成データセットを指示することが好ましい。命令セットアーキテクチャ（ＩＳＡ）は、コンピュータをプログラムするのに用いることができる基本的なまたは中核となる命令セットである。命令セットアーキテクチャ（ＩＳＡ）は、命令フォーマットと、操作コードと、データフォーマットと、アドレス指定モードと、実行制御フラグと、プログラムアクセス可能レジスタとを定義する。当業者は、これが命令セットアーキテクチャ（ＩＳＡ）の従来の定義に対応することを認めるであろう。本発明では、各Ｓマシンの動的再構成処理装置（ＤＲＰＵ）３２は、各所望の命令セットアーキテクチャ（ＩＳＡ）について独自の構成データセットを用いて多重命令セットアーキテクチャ（ＩＳＡ）を直接実装するよう、迅速なランタイム構成とすることができる。すなわち各命令セットアーキテクチャ（ＩＳＡ）は、対応する構成データセットによって定められる独自の内部動的再構成処理装置（ＤＲＰＵ）ハードウェア編成で実装される。したがって本発明では、第１〜第５内部ループ部分５４、５５、５６、５７、５８はそれぞれ一意の命令セットアーキテクチャ（ＩＳＡ）、すなわち命令セットアーキテクチャ（ＩＳＡ）１、命令セットアーキテクチャ（ＩＳＡ）２、命令セットアーキテクチャ（ＩＳＡ）３、命令セットアーキテクチャ（ＩＳＡ）４及び命令セットアーキテクチャ（ＩＳＡ）ｋに対応する。当業者は、連続命令セットアーキテクチャ（ＩＳＡ）がそれぞれ一意である必要はないことを認めるであろう。したがって、命令セットアーキテクチャ（ＩＳＡ）ｋは命令セットアーキテクチャ（ＩＳＡ）１、命令セットアーキテクチャ（ＩＳＡ）２、命令セットアーキテクチャ（ＩＳＡ）３、命令セットアーキテクチャ（ＩＳＡ）４であってもよく、また異なる命令セットアーキテクチャ（ＩＳＡ）であっても良い。１組の外部ループ部分５２も、一意の命令セットアーキテクチャ（ＩＳＡ）、すなわち命令セットアーキテクチャ（ＩＳＡ）０に対応する。好ましい実施例では、プログラムの実行中、連続した再構成指示の選択はデータ従属的に行われる（データに応じて異なる）。特定の再構成指示を選択すると、プログラム命令はその後、対応する構成データセットによって指定された独自の動的再構成処理装置（ＤＲＰＵ）ハードウェア構成により、対応する命令セットアーキテクチャ（ＩＳＡ）に従って実行される。
【００５２】
本発明では、特定の命令セットアーキテクチャ（ＩＳＡ）は、命令セットアーキテクチャ（ＩＳＡ）が含む命令の数と種類に従って、内部ループ命令セットアーキテクチャ（ＩＳＡ）または外部ループ命令セットアーキテクチャ（ＩＳＡ）として分類することができる。いくつかの命令を含み、汎用演算の実行に役立つ命令セットアーキテクチャ（ＩＳＡ）は外部ループ命令セットアーキテクチャ（ＩＳＡ）であり、一方、比較的少ない命令を含み、特定の種類の命令の実行に向けられている命令セットアーキテクチャ（ＩＳＡ）は内部ループ命令セットアーキテクチャ（ＩＳＡ）である。外部ループ命令セットアーキテクチャ（ＩＳＡ）は汎用演算の実行に向けられているので、プログラム命令の逐次実行が望ましい場合に最も役に立つ。外部ループ命令セットアーキテクチャ（ＩＳＡ）の実行性能は、実行される命令ごとのクロックサイクルで特徴付けられることが好ましい。これに対して、内部ループ命令セットアーキテクチャ（ＩＳＡ）は特定の種類の命令の実行に向けられているので、プログラム命令の並列実行が望ましい場合に最も役に立つ。内部ループ命令セットアーキテクチャ（ＩＳＡ）の実行性能は、クロックサイクル当たりで実行される命令で、またはクロックサイクル当たり得られる計算結果で特徴付けられることが好ましい。
【００５３】
当業者は、プログラム命令の逐次実行及び並列実行に関するこれまでの説明が単一の動的再構成処理装置（ＤＲＰＵ）３２内でのプログラム命令の実行に関連していることを認めるであろう。システム１０に多重Ｓマシン１２が存在することによって、特定の動的再構成処理装置（ＤＲＰＵ）３２によって各プログラム命令シーケンスが実行される場合、多重プログラム命令シーケンスを任意の時間に並列実行することが容易になる。各動的再構成処理装置（ＤＲＰＵ）３２は、特定の時間にそれぞれ特定の内部ループ命令セットアーキテクチャ（ＩＳＡ）または外部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するための並列ハードウェアまたは直列ハードウェアを含むように構成されている。任意の動的再構成処理装置（ＤＲＰＵ）３２の内部ハードウェア構成は、実行される一連のプログラム命令内に埋込まれた１つまたはそれ以上の再構成指示の選択に従って経時的に変化する。
【００５４】
好ましい実施例では、各命令セットアーキテクチャ（ＩＳＡ）とその対応する内部動的再構成処理装置（ＤＲＰＵ）ハードウェア編成は、１組の利用可能な再構成ハードウェアリソースに対して特定のクラスの計算上の問題について最適の計算性能を備えるよう設計されている。上に述べたように、また下記に詳しく説明するように、外部ループ命令セットアーキテクチャ（ＩＳＡ）に対応する内部動的再構成処理装置（ＤＲＰＵ）ハードウェア編成は、プログラム命令の逐次実行について最適化されるのが好ましい。また内部ループ命令セットアーキテクチャ（ＩＳＡ）に対応する内部動的再構成処理装置（ＤＲＰＵ）ハードウェア編成は、プログラム命令の並列実行について最適化されるのが好ましい。模範汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）を参考資料Ａに示し、畳込み演算専用の模範内部ループ命令セットアーキテクチャ（ＩＳＡ）を参考資料Ｂに示す。
【００５５】
各再構成指示を除いて、図３の模範プログラムリスト５０は、従来の高レベル言語文、たとえばＣプログラミング言語に従って書かれた文で構成されていることが好ましい。当業者は、一連のプログラム命令に１つまたはそれ以上の再構成指示を含むには、再構成指示に対応するために修正されたコンパイラが必要であることを認めるであろう。図４は、一連のプログラム命令のコンパイル中に実行される先行技術コンパイル演算のフローチャートである。ここで、先行技術のコンパイル演算は、Free Software Foundation（Cambridge，マサチューセッツ）によって作成されたＧＮＵＣコンパイラ（ＧＣＣ：ＧＮＵＣ Compiler）によって実行されるものにほぼ相当している。当業者は、下記に説明する先行技術コンパイル演算が他のコンパイラについて容易に一般化できることを認めるであろう。先行技術コンパイル演算はステップ５００で始まり、コンパイラフロントエンドが一連のプログラム命令から次の高レベル文を選択する。次にステップ５０２で、コンパイラフロントエンドは選択した高レベル文に対応する中間レベルのコードを生成する。これは、ＧＮＵＣコンパイラ（ＧＣＣ）の場合には、レジスタ転送レベル（ＲＴＬ：Register Transfer Level）文に相当する。ステップ５０２のあとステップ５０４で、コンパイラフロントエンドはさらに別の高レベル文を検討する必要があるかどうかを決定する。検討する必要があれば、この好適な方法はステップ５００に戻る。
【００５６】
ステップ５０４でコンパイラフロントエンドが他のどの高レベル文も検討する必要がないと決定したときは、次にステップ５０６でコンパイラバックエンドが従来のレジスタ割当て演算を実行する。ステップ５０６のあとステップ５０８で、コンパイラバックエンドは現在のレジスタ転送レベル（ＲＴＬ）文グループ内で検討するために次のレジスタ転送レベル（ＲＴＬ）文を選択する。次にステップ５１０でコンパイラバックエンドは現在のレジスタ転送レベル（ＲＴＬ）文グループが１組のアセンブリ言語文に翻訳することのできる方法を定めるルールが存在するかどうかを決定する。このようなルールが存在しないときには、この好適な方法はステップ５０８に戻り、現在のレジスタ転送レベル（ＲＴＬ）文グループに含めるためにさらに別のレジスタ転送レベル（ＲＴＬ）文を選択する。現在のレジスタ転送レベル（ＲＴＬ）文グループに対応するルールが存在するときには、ステップ５１２でコンパイラバックエンドはそのルールに従って１組のアセンブリ言語文を生成する。ステップ５１２のあと、コンパイラバックエンドは次のレジスタ転送レベル（ＲＴＬ）文グループのコンテクストにおいて次のレジスタ転送レベル（ＲＴＬ）文を検討する必要があるかどうかを決定する。検討する必要があるときには、この好適な方法はステップ５０８に戻る。必要がなければ、この好適な方法は終了する。
【００５７】
本発明は、動的再構成計算のためのコンパイラを含んでいることが好ましい。図５と図６は、動的再構成計算のためのコンパイラによって実行される好ましいコンパイル演算のフローチャートである。好ましいコンパイル演算はステップ６００から始まり、動的再構成計算のためのコンパイラのフロントエンドが一連のプログラム命令内の次の高レベル文を選択する。次にステップ６０２で動的再構成計算のためのコンパイラのフロントエンドは、選択された高レベル文が再構成指示であるかどうかを決定する。再構成指示であるときには、ステップ６０４で動的再構成計算のためのコンパイラのフロントエンドはレジスタ転送レベル（ＲＴＬ）再構成文を生成し、ステップ６００に戻る。好ましい実施例では、レジスタ転送レベル（ＲＴＬ）再構成文は命令セットアーキテクチャ（ＩＳＡ）識別を含む非標準レジスタ転送レベル（ＲＴＬ）文である。ステップ６０２で、選択した高レベルプログラム文が再構成指示ではないときには、次にステップ６０６で動的再構成計算のためのコンパイラのフロントエンドは従来の方法で１組のレジスタ転送レベル（ＲＴＬ）文を生成する。ステップ６０６のあと、ステップ６０８で動的再構成計算のためのコンパイラのフロントエンドはさらに別の高レベル文を検討する必要があるかどうかを決定する。検討する必要があるときには、この好適な方法はステップ６００に戻る。そうでないときにはこの好適な方法はステップ６１０に進み、バックエンド演算を開始する。
【００５８】
ステップ６１０で、動的再構成計算のためのコンパイラのバックエンドはレジスタ割当て演算を実行する。本発明の好ましい実施例では、各命令セットアーキテクチャ（ＩＳＡ）は命令セットアーキテクチャ（ＩＳＡ）ごとのレジスタアーキテクチャが互いに一致するように定められている。したがって、レジスタ割当て演算は従来の方法で実行される。当業者は、一般に、命令セットアーキテクチャ（ＩＳＡ）ごとのレジスタアーキテクチャが互いに一致することが絶対的要件ではないことを認めるであろう。次にステップ６１２で動的再構成計算のためのコンパイラのバックエンドは、現在検討中のレジスタ転送レベル（ＲＴＬ）文グループ内で次のレジスタ転送レベル（ＲＴＬ）文を選択する。次にステップ６１４で動的再構成計算のためのコンパイラのバックエンドは、選択したレジスタ転送レベル（ＲＴＬ）文がレジスタ転送レベル（ＲＴＬ）再構成文であるかどうかを決定する。選択したレジスタ転送レベル（ＲＴＬ）文がレジスタ転送レベル（ＲＴＬ）再構成文でないときには、ステップ６１８で動的再構成計算のためのコンパイラのバックエンドは、現在検討中のレジスタ転送レベル（ＲＴＬ）文グループについてのルールが存在するかどうかを決定する。存在しなければ、この好適な方法はステップ６１２に戻り、現在検討中のレジスタ転送レベル（ＲＴＬ）文グループに含めるために次のレジスタ転送レベル（ＲＴＬ）文グループを選択する。ステップ６１８で現在検討中のレジスタ転送レベル（ＲＴＬ）文グループについてのルールが存在するときには、次にステップ６２０で動的再構成計算のためのコンパイラのバックエンドはこのルールに従って現在検討中のレジスタ転送レベル（ＲＴＬ）文グループに対応する１組のアセンブリ言語文を生成する。ステップ６２０のあと、ステップ６２２で動的再構成計算のためのコンパイラのバックエンドは、次のレジスタ転送レベル（ＲＴＬ）文グループのコンテクストにおいて、さらに別のレジスタ転送レベル（ＲＴＬ）文を検討する必要があるかどうかを決定する。検討する必要があればこの好適な方法はステップ６１２に戻り、そうでなければこの好適な方法は終了する。
【００５９】
ステップ６１４で、選択したレジスタ転送レベル（ＲＴＬ）文がレジスタ転送レベル（ＲＴＬ）再構成文であるときには、ステップ６１６で動的再構成計算のためのコンパイラのバックエンドはレジスタ転送レベル（ＲＴＬ）再構成文内の命令セットアーキテクチャ（ＩＳＡ）識別に対応する１組のルールセットを選択する。本発明では、各命令セットアーキテクチャ（ＩＳＡ）について独自のルールが存在することが好ましい。従って各ルールセットは、特定の命令セットアーキテクチャ（ＩＳＡ）に従ってレジスタ転送レベル（ＲＴＬ）文グループをアセンブリ言語文に変換するための１つまたはそれ以上のルールを提供する。ステップ６１６のあと、好適な方法はステップ６１８に進む。任意の命令セットアーキテクチャ（ＩＳＡ）に対応するルールセットは、レジスタ転送レベル（ＲＴＬ）再構成文を、ソフトウェア割込みを生じるような１組のアセンブリ言語命令に翻訳するためのルールを含んでいることが好ましい。このソフトウェア割込みの結果、再構成ハンドラーが実行されるが、これについては下記に詳しく説明する。
【００６０】
上記に説明した方法では、動的再構成計算のためのコンパイラは選択的にまた自動的にコンパイル演算中に多重命令セットアーキテクチャ（ＩＳＡ）に従ってアセンブリ言語文を生成する。言い換えれば、コンパイル中、動的再構成計算のためのコンパイラはそれぞれ異なる命令セットアーキテクチャ（ＩＳＡ）に従って１組のプログラム命令をコンパイルする。動的再構成計算のためのコンパイラは、図５と図６を用いて上に説明したような好ましいコンパイル演算を実行するよう修正した従来型コンパイラであることが好ましい。当業者は、必要とされる修正は複雑ではないが、このような修正は先行技術コンパイル技術及び先行技術再構成計算技術から見て自明ではないことを認めるであろう。
【００６１】
図７は、動的再構成処理装置（ＤＲＰＵ）３２の好ましい実施例の構成図である。動的再構成処理装置（ＤＲＰＵ）３２は、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４とを含んでいる。命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４のそれぞれは、第１タイミング信号ライン４０に結合されたタイミング入力部を含んでいる。命令取出し装置（ＩＦＵ）６０は、メモリ制御ライン４２に結合されたメモリ制御出力部と、メモリ入出力ライン４６に結合されたデータ入力部と、外部制御ライン４８に結合された双方向制御ポートとを含んでいる。命令取出し装置（ＩＦＵ）６０はさらに、第１制御ライン７０を経てデータ演算装置（ＤＯＵ）６２の第１制御入力部に結合された第１制御出力部と、第２制御ライン７２を経てアドレス演算装置（ＡＯＵ）６４の第１制御入力部に結合された第２制御出力部とを含んでいる。命令取出し装置（ＩＦＵ）６０は、第３制御ライン７４を経てデータ演算装置（ＤＯＵ）６２の第２制御入力部とアドレス演算装置（ＡＯＵ）６４の第２制御入力部に結合された第１制御出力部を含んでいる。データ演算装置（ＤＯＵ）６２とアドレス演算装置（ＡＯＵ）６４とは、それぞれメモリ入出力ライン４６に結合された双方向データポートを含んでいる。最後にアドレス演算装置（ＡＯＵ）６４は、動的再構成処理装置（ＤＲＰＵ）のアドレス出力部を形成するアドレス出力部を含んでいる。
【００６２】
動的再構成処理装置（ＤＲＰＵ）３２は、再構成論理装置または再プログラマブル論理装置、たとえばＸｉｌｉｎｘＸＣ４０１３（Xilinx, Inc., サンノゼ，カリフォルニア）またはＡＴ＆ＴＯＲＣＡＩＣ０７（ＡＴ＆Ｔ Microelectronics, Allentown, ペンシルバニア）などのフィールドプログラマブルゲートアレイ（ＦＰＧＡ）を用いて実装されるのが好ましい。再プログラマブル論理装置は、複数の、
１）選択的再プログラマブル論理ブロック、または構成可能論理ブロック（ＣＬＢ：Selectively Reprogramable Logic Blocks or Configurable Logick Blocks）と、
２）選択的再プログラマブル入出力ブロック（ＩＯＢ：Ｉ／Ｏ Blocks）と、
３）選択的再プログラマブル相互結合構造と、
４）データ記憶リソースと、
５）３値バッファリソースと、
６）ワイヤード論理関数能力と、
を備えていることが好ましい。各論理ブロック（ＣＬＢ）は、論理関数を生成し、データを記憶し、信号のルーティングを行うための選択的再構成回路を含んでいることが好ましい。当業者は、使用中の再プログラマブル論理装置の正確な設計に応じて、再構成データ記憶回路が論理ブロック（ＣＬＢ）とは別の１個またはそれ以上のデータ記憶ブロック（ＤＳＢ：Data Storage Block）に含まれることもあることを認めるであろう。ここでは、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）内の再構成データ記憶回路は、論理ブロック（ＣＬＢ）内に取入れられている。すなわち、データ記憶ブロック（ＤＳＢ）の存在は想定されていない。当業者は、上に説明した論理ブロック（ＣＬＢ）ベース再構成データ記憶回路を利用する１個またはそれ以上の構成部分が、データ記憶ブロック（ＤＳＢ）が存在する場合にはデータ記憶ブロック（ＤＳＢ）ベース回路も利用できることを認めるであろう。各入出力ブロック（ＩＯＢ）は、論理ブロック（ＣＬＢ）とフィールドプログラマブルゲートアレイ（ＦＰＧＡ）出力ピンとの間でデータを転送するための選択的再構成回路を含んでいることが好ましい。構成データセットは、論理ブロック（ＣＬＢ）内で実行される関数を指定することによって動的再構成処理装置（ＤＲＰＵ）ハードウェア構成または編成を定め、また、
１）論理ブロック（ＣＬＢ）内、
２）論理ブロック（ＣＬＢ）相互間、
３）入出力ブロック（ＩＯＢ）内、
４）入出力ブロック（ＩＯＢ）相互間、及び、
５）論理ブロック（ＣＬＢ）と入出力ブロック（ＩＯＢ）との間
の相互結合を定める。当業者は、構成データセットによって、メモリ制御ライン４２と、アドレスライン４４と、メモリ入出力ライン４６と、外部制御ライン４８のそれぞれにおけるビット数が再構成可能であることを認めるであろう。再構成データセットは、システム１０の中の１個またはそれ以上のＳマシン３４に記憶されることが好ましい。当業者は、動的再構成処理装置（ＤＲＰＵ）３２がフィールドプログラマブルゲートアレイ（ＦＰＧＡ）ベース実装に限定されないことを認めるであろう。たとえば動的再構成処理装置（ＤＲＰＵ）３２は、１つまたはそれ以上のルックアップテーブルをおそらく含むＲＡＭベース状態マシンとして実装することができる。あるいは動的再構成処理装置（ＤＲＰＵ）３２は、複合プログラマブル論理装置（ＣＰＬＤ）を用いて実装することができる。しかし当業者は、システム１０のＳマシン１２の一部が再構成可能ではない動的再構成処理装置（ＤＲＰＵ）１２を含むことができることを認めるであろう。
【００６３】
好ましい実施例では、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４はそれぞれ動的に再構成可能である。したがって、その内部ハードウェア構成はプログラム実行中に選択的に変更することができる。命令取出し装置（ＩＦＵ）６０は、命令取出し・復号演算と、メモリアクセス演算と、動的再構成処理装置（ＤＲＰＵ）再構成演算とを指示し、命令の実行を容易に行うためにデータ演算装置（ＤＯＵ）６２とアドレス演算装置（ＡＯＵ）６４に制御信号を送る。データ演算装置（ＤＯＵ）６２は、データ計算に関する演算を実行し、アドレス演算装置（ＡＯＵ）６４はアドレス計算に関する演算を実行する。命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４のそれぞれの内部構造と演算については下記に詳しく説明する。
【００６４】
図８は、命令取出し装置（ＩＦＵ）６０の好ましい実施例の構成図である。命令取出し装置（ＩＦＵ）６０は、命令状態シーケンサ（ＩＳＳ：Instruction State Sequencer）１００と、アーキテクチャ記述メモリ１０１と、メモリアクセスロジック１０２と、再構成ロジック１０４と、割込みロジック１０６と、取出し制御装置１０８と、命令バッファ１１０と、復号制御装置１１２と、命令復号器１１４と、操作コード記憶レジスタセット１１６と、レジスタファイル（ＲＦ：Register File）アドレスレジスタセット１１８と、定数レジスタセット１２０と、プロセス制御レジスタセット１２２とを含んでいる。命令状態シーケンサ（ＩＳＳ）１００は、それぞれ命令取出し装置（ＩＦＵ）６０の第１及び第２制御出力部を形成する第１及び第２制御出力部を含んでおり、また命令取出し装置（ＩＦＵ）６０のタイミング入力部を形成するタイミング入力部を含んでいる。また命令状態シーケンサ（ＩＳＳ）１００は、取出し／復号制御ライン１３０を経て取出し制御装置１０８の制御入力部と復号制御装置１１２の制御入力部とに結合された取出し／復号制御出力部を含んでいる。さらに命令状態シーケンサ（ＩＳＳ）１００は、双方向制御ライン１３２を経てメモリアクセスロジック１０２と、再構成ロジック１０４と、割込みロジック１０６のそれぞれの第１双方向制御ポートに結合された双方向制御ポートを含んでいる。また命令状態シーケンサ（ＩＳＳ）１００は、操作コードライン１４２を経て、操作コード記憶レジスタセット１１６の出力部に結合された操作コード入力部を含んでいる。最後に命令状態シーケンサ（ＩＳＳ）１００は、処理データライン１４４を経て、プロセス制御レジスタセット１２２の双方向制御ポートに結合された双方向制御ポートを含んでいる。
【００６５】
メモリアクセスロジック１０２と、再構成ロジック１０４と、割込みロジック１０６は、それぞれ外部制御ライン４８に結合された第２双方向制御ポートを含んでいる。さらにメモリアクセスロジック１０２と、再構成ロジック１０４と、割込みロジック１０６は、それぞれ実装制御ライン１３１を経てアーキテクチャ記述メモリ１０１のデータ出力部に結合されたデータ入力部を含んでいる。メモリアクセスロジック１０２は、さらに命令取出し装置（ＩＦＵ）６０のメモリ制御出力部を形成する制御出力部を含み、また割込みロジック１０６はさらに処理データライン１４４に結合された出力部を含んでいる。命令バッファ１１０は、命令取出し装置（ＩＦＵ）６０のデータ入力部を形成するデータ入力部と、取出し制御ライン１３４を経て取出し制御装置１０８の制御出力部に結合された制御入力部と、命令ライン１３６を経て命令復号器１１４の入力部に結合された出力部とを含んでいる。命令復号器１１４は、復号制御ライン１３８を経て復号制御装置１１２の制御出力部に結合された制御入力部と、復号命令ライン１４０を経て、
１）操作コード記憶レジスタ１１６の入力部と、
２）レジスタファイル（ＲＦ）アドレスレジスタセット１１８の入力部と、
３）定数レジスタセット１２０の入力部に結合された出力部と、
を含んでいる。レジスタファイル（ＲＦ）アドレスレジスタセット１１８と定数レジスタセット１２０は、それぞれ命令取出し装置（ＩＦＵ）６０の第３制御出力部７４を形成する出力部を含んでいる。
【００６６】
アーキテクチャ記述メモリ１０１は、現在の動的再構成処理装置（ＤＲＰＵ）構成を特徴付けるアーキテクチャ指定信号を記憶する。このアーキテクチャ指定信号は、
１）デフォルト構成データセットに対する基準と、
２）許容される構成データセットリストに対する基準と、
３）現在検討中の命令セットアーキテクチャ（ＩＳＡ）に対応する構成データセットに対する基準、すなわち現在の動的再構成処理装置（ＤＲＰＵ）構成を定める構成データセットに対する基準と、
４）命令取出し装置（ＩＦＵ）６０が存在するＳマシン１２に関連したＴマシン１４内の１個またはそれ以上の相互結合入出力装置３０４を識別する相互結合アドレスリスト（これについては、図１８を用いて下記に詳しく説明する）と、
５）割込み待ち時間と、命令取出し装置（ＩＦＵ）６０が割込みにどのように応答するかを定める割込み精度情報とを指定する１組の割込み応答信号と、
６）アトミックメモリアドレスインクリメントを定めるメモリアクセス定数と、を含んでいることが好ましい。好ましい実施例では、各構成データセットは、読出し専用メモリ（ＲＯＭ）として構成された１組の論理ブロック（ＣＬＢ）としてアーキテクチャ記述メモリ１０１を実動化する。アーキテクチャ記述メモリ１０１の内容を定めるアーキテクチャ指定信号は、各構成データセットに含まれることが好ましい。したがって、各構成データセットが特定の命令セットアーキテクチャ（ＩＳＡ）に対応するので、アーキテクチャ記述メモリ１０１の内容は、現在検討中の命令セットアーキテクチャ（ＩＳＡ）によって異なる。所定の命令セットアーキテクチャ（ＩＳＡ）について、アーキテクチャ記述メモリ１０１の内容へのプログラムアクセスは、命令セットアーキテクチャ（ＩＳＡ）にメモリ読出し命令を含めることによって容易に行われることが好ましい。これによってプログラム実行中に現在の動的再構成処理装置（ＤＲＰＵ）構成に関する情報をプログラムが検索することができる。
【００６７】
本発明では、再構成ロジック１０４は一連の再構成演算を制御する状態マシンであり、これによって構成データセットに応じて動的再構成処理装置（ＤＲＰＵ）３２の再構成が容易に行われる。再構成ロジック１０４は、再構成信号を受取り次第、再構成演算を開始することが好ましい。下記に詳しく説明するように、再構成信号は、外部制御ライン４８で受取った再構成割込みに応じて割込みロジック１０６が発生させた信号であるか、またはプログラムに埋込まれた再構成指示に応じて命令状態シーケンサ（ＩＳＳ）１００が発生させた信号である。再構成演算によって、アーキテクチャ記述メモリ１０１によって参照されるデフォルト構成データを用いて電源オン／リセット後の当初の動的再構成処理装置（ＤＲＰＵ）構成が得られる。また再構成演算によって、当初の動的再構成処理装置（ＤＲＰＵ）構成が確定したあとの選択的動的再構成処理装置（ＤＲＰＵ）再構成が得られる。再構成演算が完了すると再構成ロジック１０４は完了信号を発する。好ましい実施例では、再構成ロジック１０４は、再プログラマブル論理装置自体への構成データセットのローディングを制御する非再構成ロジックであり、したがって再構成演算のシーケンスは再プログラマブル論理装置のメーカーによって定められる。したがって、再構成演算は当業者に既知である。
【００６８】
各動的再構成処理装置（ＤＲＰＵ）構成は、対応する命令セットアーキテクチャ（ＩＳＡ）の実動化のための特定のハードウェア編成を定める構成データセットによって与えられるのが好ましい。好ましい実施例では、命令取出し装置（ＩＦＵ）６０は動的再構成処理装置（ＤＲＰＵ）構成に関係なく、上記の各構成部分を含んでいる。基本レベルでは、命令取出し装置（ＩＦＵ）６０内の各構成部分によって与えられる機能性は、現在検討中の命令セットアーキテクチャ（ＩＳＡ）とは無関係である。しかし、好ましい実施例では、命令取出し装置（ＩＦＵ）６０の１個またはそれ以上の構成部分の詳細な構造と機能性は、それが構成されている命令セットアーキテクチャ（ＩＳＡ）の特性に応じて異なる。好ましい実施例では、アーキテクチャ記述メモリ１０１及び再構成ロジック１０４の構造と機能性は、それぞれの動的再構成処理装置（ＤＲＰＵ）構成について一定であることが好ましい。命令取出し装置（ＩＦＵ）６０のその他の構成部分の構造と機能性について、またこれらが命令セットアーキテクチャ（ＩＳＡ）の種類によって異なることについては、下記に詳しく説明する。
【００６９】
プロセス制御レジスタセット１２２は、命令実行中に命令状態シーケンサ（ＩＳＳ）１００によって用いられる信号とデータを記憶する。好ましい実施例では、プロセス制御レジスタセット１２２は、プロセス制御ワードを記憶するためのレジスタと、割込みベクトルを記憶するためのレジスタと、構成データセットへの参照を記憶するためのレジスタとを含んでいる。プロセス制御ワードは、命令実行中に発生する状態にもとづいて選択的に設定またはリセットすることができる複数の条件フラグを含んでいることが好ましい。さらにプロセス制御ワードは、割込みを実施できる１つまたはそれ以上の方法を定める複数の遷移制御信号を含んでいる（これについては、下記に詳しく説明する）。好ましい実施例では、プロセス制御レジスタセット１２２は、データ記憶及びゲーティングロジックのために構成された１組の論理ブロック（ＣＬＢ）として実装される。
【００７０】
命令状態シーケンサ（ＩＳＳ）１００は、取出し制御装置１０８と復号制御装置１１２と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との演算を制御し、命令の実行を容易にするためにメモリ読出し信号とメモリ書込み信号をメモリアクセスロジック１０２に発信する状態マシンであることが好ましい。図９は、命令状態シーケンサ（ＩＳＳ）１００によって支援される１組の好ましい状態を示す状態図である。電源オンまたはリセット後、または再構成が行われた直後、命令状態シーケンサ（ＩＳＳ）１００は状態Ｐで演算を開始する。再構成ロジック１０４により発せられた完了信号に応じて、命令状態シーケンサ（ＩＳＳ）１００は状態Ｓに進み、命令状態シーケンサ（ＩＳＳ）は電源オン／リセットまたは再構成が行われた場合、それぞれプログラム状態情報を初期化するか、復元する。命令状態シーケンサ（ＩＳＳ）１００は次に状態Ｆに進み、命令取出し演算を実行する。命令取出し演算では、命令状態シーケンサ（ＩＳＳ）１００はメモリ読出し信号をメモリアクセスロジック１０２に発信し、取出し信号を取出し制御装置１０８に発信し、次命令プログラムアドレスレジスタ（ＮＩＰＡＲ）２３２をインクリメントするためにインクリメント信号をアドレス演算装置（ＡＯＵ）６４に発信する（これについては、図１５と図１６を用いて下記に詳しく説明する）。状態Ｆの後、命令状態シーケンサ（ＩＳＳ）１００は状態Ｄに進み、命令復号演算を開始する。状態Ｄで、命令状態シーケンサ（ＩＳＳ）１００は復号信号を復号制御１１２に発信する。状態Ｄで、命令状態シーケンサ（ＩＳＳ）１００はさらに復号命令に対応する操作コードを操作コード記憶レジスタセット１１６から検索する。検索した操作コードに基づいて、命令状態シーケンサ（ＩＳＳ）１００は状態Ｅまたは状態Ｍに進み、命令実行演算を実行する。命令が１回のクロックサイクルで実行できるときには、命令状態シーケンサ（ＩＳＳ）１００は状態Ｅに進む。それ以外の場合には、命令状態シーケンサ（ＩＳＳ）１００は複数のサイクルで命令を実行するために状態Ｍに進む。命令実行演算では、命令状態シーケンサ（ＩＳＳ）１００はデータ演算装置（ＤＯＵ）制御信号と、アドレス演算装置（ＡＯＵ）制御信号と、及び／または検索した操作コードに対応する命令の実行を容易にするためのメモリアクセスロジック１０２専用の信号とを生成する。状態ＥまたはＭのあと、命令状態シーケンサ（ＩＳＳ）１００は状態Ｗに進む。状態Ｗで、命令状態シーケンサ（ＩＳＳ）１００は、データ演算装置（ＤＯＵ）制御信号と、アドレス演算装置（ＡＯＵ）制御信号と、及び／または命令実行の結果の記憶を容易にするためのメモリ書込み信号とを生成する。したがって、状態Ｗはライトバック状態と呼ばれる。当業者は、状態Ｆ、Ｄ、Ｅ、Ｍ、Ｗが完全な命令実行サイクルを含むことを認めるであろう。状態Ｗのあと命令状態シーケンサ（ＩＳＳ）１００は、命令の実行を中断する必要があるときには状態Ｙに進む。状態Ｙは、たとえばＴマシン１４がＳマシンのメモリ３４にアクセスしなくてはならないときに必要とされるようなアイドル状態に対応している。状態Ｙのあと、または命令の実行を継続するときには状態Ｗの後、命令状態シーケンサ（ＩＳＳ）１００は状態Ｆに戻り、さらに別の命令実行サイクルを開始する。
【００７１】
図９に示すように、状態図には状態Ｉも含まれている。この状態は、割込み実施状態として定義される。本発明では、命令状態シーケンサ（ＩＳＳ）１００は割込みロジック１０６から割込み通知信号を受取る。図１０を用いて下記に詳しく説明するように、割込みロジック１０６は遷移制御信号を生成し、プロセス制御レジスタセット１２２内のプロセス制御ワード内に遷移制御信号を記憶する。遷移制御信号は、状態Ｆ、Ｄ、Ｅ、Ｍ、Ｗ、Ｙのどの状態が割込み可能かについて、また各割込み可能状態で必要とされる割込み精度のレベルについて、また状態Ｉのあとも命令の実行を継続すべき各割込み可能状態の次の状態を示すことが好ましい。命令状態シーケンサ（ＩＳＳ）１００が所定の状態で割込み通知信号を受取ったとき、遷移制御信号によって現在の状態が割込み可能であることが示されている場合には、命令状態シーケンサ（ＩＳＳ）１００は状態Ｉに進む。それ以外の場合には、命令状態シーケンサ（ＩＳＳ）１００は割込み可能状態に達するまで割込み信号を受取っていなかったかのように進む。
【００７２】
命令状態シーケンサ（ＩＳＳ）１００が状態Ｉに進むと、命令状態シーケンサ（ＩＳＳ）１００は割込みマスキングフラグを設定し、また割込みベクトルを検索するために、プロセス制御レジスタセット１２２にアクセスするのが好ましい。割込みベクトルを受取った後、命令状態シーケンサ（ＩＳＳ）１００は、割込みベクトルによって指定される割込みハンドラーに従来のようなサブルーチンジャンプを行い現在の割込みを実施するのが好ましい。
【００７３】
本発明では、動的再構成処理装置（ＤＲＰＵ）３２の再構成は、
１）外部制御ライン４８で表明される再構成割込みか、または、
２）一連のプログラム命令内の再構成指示の実行
に応じて開始される。好ましい実施例では、再構成割込みを行っても、また再構成指示を実行しても、再構成ハンドラーへのサブルーチンジャンプが行われる。再構成ハンドラーはプログラム状態情報をセーブし、構成データセットアドレスと再構成信号を再構成ロジック１０４に発信することが好ましい。
【００７４】
現在の割込みが再構成割込みでないときには、命令状態シーケンサ（ＩＳＳ）１００は、割込みが実施された場合に遷移制御信号によって示される次の状態に進み、これによって命令実行サイクルを再開し、完了し、または開始する。
【００７５】
好ましい実施例では、命令状態シーケンサ（ＩＳＳ）１００により支援される１組の状態は、動的再構成処理装置（ＤＲＰＵ）３２が構成される命令セットアーキテクチャ（ＩＳＡ）の特性に応じて異なる。したがって、典型的な内部ループ命令セットアーキテクチャ（ＩＳＡ）での場合のように、１つまたはそれ以上の命令が１回のクロックサイクルで実行できる命令セットアーキテクチャ（ＩＳＡ）について状態Ｍは存在しない。図に示すように、図９の状態図は、汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するために命令状態シーケンサ（ＩＳＳ）によって支援される状態を規定することが好ましい。内部ループ命令セットアーキテクチャ（ＩＳＡ）の実動化については、命令状態シーケンサ（ＩＳＳ）１００は複数の状態Ｆ、Ｄ、Ｅ、Ｗを並列に支援するのが好ましい。これによって当業者が容易に理解するような方法で命令実行のパイプライン制御を容易に行うことができる。好ましい実施例では、命令状態シーケンサ（ＩＳＳ）１００は現在検討中の命令セットアーキテクチャ（ＩＳＡ）に従って上記に述べた状態または状態のサブセットを支援する論理ブロック（ＣＬＢ）ベース状態マシンとして実動化される。
【００７６】
割込みロジック１０６は、遷移制御信号を生成し、外部制御ライン４８を経て受取った割込み信号に応じて割込み通知演算を実行する状態マシンを含んでいることが好ましい。図１０は、割込みロジック１０６によって支援される１組の好ましい状態を示す状態図である。割込みロジック１０６は状態Ｐで演算を開始する。状態Ｐは、電源オン、リセット、または再構成状態に対応している。再構成ロジック１０４によって発せられた完了信号に応じて、割込みロジック１０６は状態Ａに進み、アーキテクチャ記述メモリ１０１から割込み応答信号を検索する。割込みロジック１０６は、次に割込み応答信号から遷移制御信号を生成し、この遷移制御信号をプロセス制御レジスタセット１２２に記憶する。好ましい実施例では、割込みロジック１０６は、割込み応答信号を受取り遷移制御信号を生成するための論理ブロック（ＣＬＢ）ベースプログラマブル論理アレイ（ＰＬＡ）を含んでいる。状態Ａのあと、割込みロジック１０６は状態Ｂに進み割込み信号を待つ。割込み信号を受取り、プロセス制御レジスタセット１２２内の割込みマスキングフラグがリセットされた場合に割込みロジック１０６は状態Ｃに進む。状態Ｃでは、割込みロジック１０６は割込みの開始点と、割込み優先度と、割込みハンドラーアドレスとを決定する。割込み信号が再構成割込みのときには、割込みロジック１０６は状態Ｒに進み、構成データセットアドレスをプロセス制御レジスタセット１２２に記憶する。状態Ｒのあと、または割込み信号が再構成割込みではないときには状態Ｃのあと、割込みロジック１０６は状態Ｎに進み、割込みハンドラーアドレスをプロセス制御レジスタセット１２２に記憶する。割込みロジック１０６は次に状態Ｘに進み、割込み通知信号を命令状態シーケンサ（ＩＳＳ）１００に発する。状態Ｘのあと、割込みロジック１０６は状態Ｂに戻り、次の割込み信号を待つ。
【００７７】
好ましい実施例では、割込み応答信号が、したがって遷移制御信号が指定する割込み待ち時間のレベルは、動的再構成処理装置（ＤＲＰＵ）３２が構成されている現在の命令セットアーキテクチャ（ＩＳＡ）によって異なる。たとえば高性能リアルタイム動作制御用の命令セットアーキテクチャ（ＩＳＡ）では、迅速で予測可能な割込み応答能力が求められる。したがって、このような命令セットアーキテクチャ（ＩＳＡ）に対応する構成データセットは、待ち時間の短い割込みが必要であることを示す割込み応答信号を含んでいることが好ましい。対応する遷移制御信号は、複数の命令状態シーケンサ（ＩＳＳ）状態を割込み可能として識別することが好ましい。これにより、命令実行サイクルが完了する前に割込みによって命令実行サイクルを中断することができる。リアルタイム動作制御用の命令セットアーキテクチャ（ＩＳＡ）とは異なり、画像畳込み演算用の命令セットアーキテクチャ（ＩＳＡ）では、単位時間当たりに実行される畳込み演算の回数が最大となるような割込み応答能力が必要である。画像畳込み演算用命令セットアーキテクチャ（ＩＳＡ）に対応する構成データセットは、待ち時間の長い割込みが必要であることを指定する割込み応答信号を含んでいることが好ましい。対応する遷移制御信号は、状態Ｗを割込み可能として識別することが好ましい。画像畳込み演算用命令セットアーキテクチャ（ＩＳＡ）を実装するために構成され、命令状態シーケンサ（ＩＳＳ）１００が複数の状態Ｆ、Ｄ、Ｅ、Ｗを並列に支援するときには、遷移制御信号はそれぞれ状態Ｗを割込み可能として識別し、さらに各並列命令実行サイクルがその状態Ｗ演算を完了するまで割込み実施を遅延すべきであることを指定することが好ましい。これにより、割込みが実施される前にすべての命令が実行されることが保証され、これによって適切なパイプライン実行能力レベルが維持される。
【００７８】
割込み待ち時間のレベルと同様に、割込み応答信号によって指定される割込み精度のレベルも動的再構成処理装置（ＤＲＰＵ）３２が構成される命令セットアーキテクチャ（ＩＳＡ）によって異なる。たとえば、状態Ｍが割込み可能なマルチサイクル演算を支援する外部ループ命令セットアーキテクチャ（ＩＳＡ）について割込み可能状態であると定められた場合、割込み応答信号は正確な割込みが必要であることを指定することが好ましい。したがって遷移制御信号は、マルチサイクル演算がうまく再スタートできるよう状態Ｍで受取った割込みを正確な割込みとして扱うよう指定する。もう１つの例として、無欠陥パイプライン算術演算を支援する命令セットアーキテクチャ（ＩＳＡ）については、割込み応答信号は不正確な割込みが必要であると指定することが好ましい。次に遷移制御信号は、状態Ｗで受取った割込みを不正確な割込みとして扱うことを指定する。
【００７９】
任意の命令セットアーキテクチャ（ＩＳＡ）については、割込み応答信号は命令セットアーキテクチャ（ＩＳＡ）の対応する構成データセットの一部によって定められ、またプログラムされる。プログラマブル割込み応答信号によって、また対応する遷移制御信号を生成することにより、本発明では、命令セットアーキテクチャ（ＩＳＡ）ごとの最適の割込みスキームを実動化することが容易となっている。当業者は、先行技術コンピュータアーキテクチャのほとんどでは、割込み能力、すなわちプログラマブル状態遷移の有効化、プログラマブル割込み待ち時間、及びプログラマブル割込み精度を柔軟に指定できないことを認めるであろう。好ましい実施例では、割込みロジック１０６は上記のような状態を支援する論理ブロック（ＣＬＢ）ベース状態マシンとして実装される。
【００８０】
取出し制御装置１０８は、命令セットアーキテクチャ（ＩＳＡ）１００によって発せられた取出し信号に応じて命令バッファ１１０に命令をロードするよう指示する。好ましい実施例では、取出し制御装置１０８は１組の論理ブロック（ＣＬＢ）内でフリップフロップを用いた従来型のワンホット符号化状態マシンとして実装される。当業者は、別の実施例で、取出し制御装置１０８が従来型の符号化状態マシンとして、またはＲＯＭベース状態マシンとして構成できることを認めるであろう。命令バッファ１１０は、メモリ３４からロードされた命令を一時記憶する。外部ループ命令セットアーキテクチャ（ＩＳＡ）の実装については、命令バッファ１１０は多重論理ブロック（ＣＬＢ）を用いた従来型のＲＡＭベース先入れ先出し（ＦＩＦＯ）バッファとして実装されるのが好ましい。内部ループ命令セットアーキテクチャ（ＩＳＡ）の実装については、命令バッファ１１０は１組の入出力ブロック（ＩＯＢ）内で複数のフリップフロップを用いた、または入出力ブロック（ＩＯＢ）と論理ブロック（ＣＬＢ）の両方で複数のフリップフロップを用いた１組のフリップフロップレジスタとして実装されるのが好ましい。
【００８１】
復号制御装置１１２は、命令セットアーキテクチャ（ＩＳＡ）１００によって発せられた復号信号に応じて、命令を命令バッファ１１０から命令復号器１１４へ転送するよう指示する。内部ループ命令セットアーキテクチャ（ＩＳＡ）については、復号制御装置１１２は論理ブロック（ＣＬＢ）ベースレジスタに結合された論理ブロック（ＣＬＢ）ベースＲＯＭを含むＲＯＭベース状態マシンとして実装されるのが好ましい。外部ループ命令セットアーキテクチャ（ＩＳＡ）については、復号制御装置１１２は論理ブロック（ＣＬＢ）ベース符号化状態マシンとして実装されるのが好ましい。入力として受取った各命令については、命令復号器１１４は、従来の方法で対応する操作コードと、レジスタファイルアドレスと、選択的に１つまたはそれ以上の定数とを出力する。内部ループ命令セットアーキテクチャ（ＩＳＡ）については、命令復号器１１４は入力として受取った一連の命令を復号するよう構成されていることが好ましい。好ましい実施例では、命令復号器１１４は現在検討中の命令セットアーキテクチャ（ＩＳＡ）に含まれる各命令を復号するために構成された論理ブロック（ＣＬＢ）ベース復号器として実装される。
【００８２】
操作コード記憶レジスタセット１１６は、命令復号器１１４による各操作コード出力を一時記憶し、また各操作コードを命令状態シーケンサ（ＩＳＳ）１００に出力する。外部ループ命令セットアーキテクチャ（ＩＳＡ）を動的再構成処理装置（ＤＲＰＵ）３２に実装するとき、操作コード記憶レジスタセット１１６は最適数のフリップフロップレジスタバンクを用いて実装されることが好ましい。フリップフロップレジスタバンクは、命令バッファ１１０を通りすでに待ち行例を形成している命令の操作コードリテラルビットフィールドから導出されるクラスコードまたはグループコードを表す信号を命令復号器１１４から受取る。フリップフロップレジスタバンクは、命令状態シーケンサ（ＩＳＳ）の複雑性を最小限にとどめることのできる復号スキームに従って、前述のクラスコードまたはグループコードを記憶する。内部ループ命令セットアーキテクチャ（ＩＳＡ）の場合には、操作コード記憶レジスタセット１１６は、命令復号器１１４による操作コードリテラルビットフィールドから直接導出される操作コード指示信号を記憶する。内部ループ命令セットアーキテクチャ（ＩＳＡ）は小さい操作コードリテラルビットフィールドを必然的に有し、これによってそれぞれ命令バッファ１１０と、命令復号器１１４と、操作コード記憶レジスタセット１１６とによるバッファリングと、復号化と、命令シーケンシング（順序づけ）のための操作コード表示とについての実装要件を最小限にとどめる。以上をまとめると、外部ループ命令セットアーキテクチャ（ＩＳＡ）については、操作コード記憶レジスタセット１１６は操作コードリテラルサイズに等しいビット幅またはその一部として特徴付けられるフリップフロップレジスタバンクの小さな組合わせとして実装されることが好ましい。内部ループについては、操作コード記憶レジスタセット１１６は外部ループ命令セットアーキテクチャ（ＩＳＡ）の場合よりもフリップフロップレジスタバンクが小さくまた統合されていることが好ましい。内部ループで、フリップフロップレジスタバンクのサイズが小さくて済むのは、外部ループ命令セットアーキテクチャ（ＩＳＡ）と比較して内部ループ命令セットアーキテクチャ（ＩＳＡ）の命令数がきわめて少ないためである。
【００８３】
レジスタファイル（ＲＦ）アドレスレジスタセット１１８と定数レジスタセット１２０は、それぞれ命令復号器１１４による各レジスタファイルと各定数出力とを一時記憶する。好ましい実施例では、操作コード記憶レジスタセット１１６と、レジスタファイル（ＲＦ）アドレスレジスタセット１１８と、定数レジスタセット１２０とはそれぞれデータ記憶のために構成された１組の論理ブロック（ＣＬＢ）として実装される。
【００８４】
メモリアクセスロジック１０２は、アーキテクチャ記述メモリ１２２で指定されたアトミックメモリアドレスのサイズに従って、メモリ３４と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との間でデータの転送を指示し同期させるメモリ制御回路である。メモリアクセスロジック１０２はさらに、Ｓマシン１２と所定のＴマシン１４との間のデータとコマンドの転送を指示し同期させる。好ましい実施例では、メモリアクセスロジック１０２はバーストメモリアクセスを支援し、論理ブロック（ＣＬＢ）を用いた従来型のＲＡＭコントローラとして実装されることが好ましい。当業者は、再構成中に、再構成論理装置の入力ピンと出力ピンが３値であり、抵抗停止によって非表明ロジックレベルを定めることができ、したがってメモリ３４を混乱させないことを認めるであろう。別の実施例では、メモリアクセスロジック１０２は動的再構成処理装置（ＤＲＰＵ）３２の外部に実装することができる。
【００８５】
図１１は、データ演算装置６２の好ましい実施例の構成図である。データ演算装置（ＤＯＵ）６２はデータ演算装置（ＤＯＵ）制御信号と、レジスタファイル（ＲＦ）アドレスと、命令セットアーキテクチャ（ＩＳＡ）１００から受取った定数とに従ってデータについて演算を実行する。データ演算装置（ＤＯＵ）６２は、データ演算装置（ＤＯＵ）クロスバースイッチ１５０と、記憶／整列ロジック１５２と、データ演算ロジック１５４とを含んでいる。データ演算装置（ＤＯＵ）クロスバースイッチ１５０と、記憶／整列ロジック１５２と、データ演算ロジック１５４とはそれぞれ第１制御ライン７０を経て命令取出し装置（ＩＦＵ）６０の第１制御出力部に結合された制御入力部を含んでいる。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、データ演算装置（ＤＯＵ）の双方向データポートを形成する双方向データポートと、第３制御ライン７４に結合された定数入力部と、第１データライン１６０を経てデータ演算ロジック１５４のデータ出力部に結合された第１データフィードバック入力部と、第２データライン１６４を経て記憶／整列ロジック１５２のデータ出力部に結合された第２データフィードバック入力部と、第３データラインを経て記憶／整列ロジック１５２のデータ入力部に結合されたデータ出力部とを含んでいる。記憶／整列ロジック１５２は、そのデータ出力部の他に、第３制御ライン７４に結合されたアドレス入力部を含んでいる。データ演算ロジック１５４は、さらに第２データライン１６４を経て記憶／整列ロジックの出力部に結合されたデータ入力部を含んでいる。
【００８６】
データ演算ロジック１５４は、その制御入力部で受取ったデータ演算装置（ＤＯＵ）制御信号に応じて、そのデータ入力部で受取ったデータについて、算術演算、シフト演算及び／または論理演算を実行する。記憶／整列ロジック１５２は、それぞれそのアドレス入力部と制御入力部とで受取ったレジスタファイル（ＲＦ）アドレスとデータ演算装置（ＤＯＵ）制御の指示に従って、オペランドと、定数と、データ計算に関連した部分的結果とを一時記憶するデータ記憶素子を含んでいる。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、その制御入力部で受取ったデータ演算装置（ＤＯＵ）制御信号に従って、メモリ３４からのデータのローディングと、データ演算ロジック１５４による結果出力の記憶／整列ロジック１５２またはメモリ３４への転送と、命令取出し装置（ＩＦＵ）６０による定数出力の記憶／整列ロジック１５２へのローディングとを容易にするような従来型のクロスバースイッチネットワークであることが好ましい。好ましい実施例では、データ演算ロジック１５４の詳細な構造は、現在検討中の命令セットアーキテクチャ（ＩＳＡ）によって支援される演算の種類によって定まる。すなわち、データ演算ロジック１５４は、現在検討中の命令セットアーキテクチャ（ＩＳＡ）内のデータ処理命令によって指定された算術演算及び／または論理演算を実行するための回路を含んでいる。同様に、記憶／整列ロジック１５２とデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の詳細な構造は、現在検討中の命令セットアーキテクチャ（ＩＳＡ）によって定まる。命令セットアーキテクチャ（ＩＳＡ）の種類によるデータ演算ロジック１５４と、記憶／整列ロジック１５２と、データ演算装置（ＤＯＵ）クロスバースイッチ１５０との詳細な構造は、図１２及び図１３を参照して下記に詳しく説明する。
【００８７】
外部ループ命令セットアーキテクチャ（ＩＳＡ）については、データ演算装置（ＤＯＵ）６２はデータに対して逐次演算を実行するよう構成されていることが好ましい。図１２は、汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）の実動化のために構成されたデータ演算装置（ＤＯＵ）６１の第１模範実施例の構成図である。汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）では、乗算、加算、減算などの数学的演算と、ＡＮＤ、ＯＲ、ＮＯＴなどのブール演算と、シフト演算と、回転演算とを実行するために構成されたハードウェアが必要である。したがって、汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）の実装については、データ演算ロジック１５４は第１入力部と、第２入力部と、制御入力部と、出力部とを有する従来型の演算論理装置（ＡＬＵ）／シフタ１８４とを含んでいることが好ましい。記憶／整列ロジック１５２は、第１ＲＡＭ１８０と第２ＲＡＭ１８２とで構成されていることが好ましく、これはそれぞれデータ入力部と、データ出力部と、アドレス選択入力部と、イネーブル入力部とを含んでいる。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、双方向及び単方向クロスバー結合部を有し、また図１１を用いてすでに説明したような入力部と出力部とを有する従来型のクロスバースイッチネットワークを含んでいることが好ましい。当業者は、外部ループ命令セットアーキテクチャ（ＩＳＡ）のためのデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の効率的な実動化には、マルチプレクサと、３値バッファと、論理ブロック（ＣＬＢ）ベースロジックと、直接配線と、または再構成結合手段によって、いずれかの組合わせで結合された上記構成部分のサブセットが含まれることを認めるであろう。外部ループについては、データ演算装置（ＤＯＵ）クロスバースイッチ１５０は最短時間で逐次データ移動を促進するよう実動化されるが、汎用外部ループ命令を支援するために最大数の単一データ移動クロスバー結合部も提供する。
【００８８】
第１ＲＡＭ１８０のデータ入力部は第２ＲＡＭ１８２のデータ入力部と同様に、第３データライン１６２を経てデータ演算装置（ＤＯＵ）クロスバースイッチ１５０のデータ出力部に結合されている。第１ＲＡＭ１８０と第２ＲＡＭ１８２とのアドレス選択入力部は、第３制御ライン７４を経て命令取出し装置（ＩＦＵ）６０からレジスタファイルアドレスを受取るよう結合されている。同様に、第１ＲＡＭ１８０と第２ＲＡＭ１８２とのイネーブル入力部は、第１制御ライン７０を経てデータ演算装置（ＤＯＵ）制御信号を受取るよう結合されている。第１ＲＡＭ１８０と第２ＲＡＭ１８２とのデータ出力部は、それぞれＡＬＵ／シフタ１８４の第１入力部と第２入力部に結合されており、またデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の第２データフィードバック入力部にも結合されている。ＡＬＵ／シフタ１８４の制御入力部は、第１制御ライン７０を経てデータ演算装置（ＤＯＵ）制御信号を受取るよう結合されている。またＡＬＵ／シフタ１８４の出力部は、データ演算装置（ＤＯＵ）クロスバースイッチ１５０の第１データフィードバック入力部に結合されている。データ演算装置（ＤＯＵ）クロスバースイッチ１５０の残りの入力部と出力部への結合部は、図１１を用いて上記に説明したものと同一である。
【００８９】
データ演算命令の実行を容易にするために、命令取出し装置（ＩＦＵ）６０は命令状態シーケンサ（ＩＳＳ）が状態ＥまたはＭであるときに、データ演算装置（ＤＯＵ）制御信号と、レジスタファイル（ＲＦ）アドレス信号と、定数信号とをデータ演算装置（ＤＯＵ）６１に発する。第１ＲＡＭ１８０と第２ＲＡＭ１８２とは、それぞれ一時データ記憶のための第１及び第２レジスタファイルを提供する。第１ＲＡＭ１８０と第２ＲＡＭ１８２内の個々のアドレスは、各ＲＡＭのそれぞれのアドレス選択入力部で受取ったレジスタファイル（ＲＦ）アドレスに従って選択される。同様に、第１ＲＡＭ１８０と第２ＲＡＭ１８２のローディングは、その書込みイネーブル入力部で各第１ＲＡＭ１８０と第２ＲＡＭ１８２とがそれぞれ受取るデータ演算装置（ＤＯＵ）制御信号によって制御される。好ましい実施例では、第１ＲＡＭ１８０と第２ＲＡＭ１８２の少なくとも１個が、データ演算装置（ＤＯＵ）クロスバースイッチ１５０からＡＬＵ／シフタ１８４へデータを直接転送するのを容易にするための伝達（引渡し）能力を含んでいる。ＡＬＵ／シフタ１８４は、その制御入力部で受取ったデータ演算装置（ＤＯＵ）制御信号の指示に従って、第１ＲＡＭ１８０から受取った第１オペランドに基づいて、及び／または第２ＲＡＭ１８２から受取った第２オペランドに基づいて、算術演算、論理演算、またはシフト（桁送り）演算を実行する。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は選択的に、
１）メモリ３４と第１ＲＡＭ１８０及び第２ＲＡＭ１８２との間のデータのルーティングと、
２）ＡＬＵ／シフタ１８４から第１ＲＡＭ１８０及び第２ＲＡＭ１８２へ、またはメモリ３４への結果のルーティングと、
３）第１ＲＡＭ１８０または第２ＲＡＭ１８２に記憶されたデータのメモリ３４へのルーティングと、
４）命令取出し装置（ＩＦＵ）６０から第１ＲＡＭ１８０及び第２ＲＡＭ１８２への定数のルーティングと、
を行う。すでに述べたように、第１ＲＡＭ１８０か第２ＲＡＭ１８２のいずれかが伝達能力を有するときには、データ演算装置（ＤＯＵ）クロスバースイッチ１５０も選択的にメモリ３４からＡＬＵ／シフタ１８４に、またはＡＬＵ／シフタの出力部からＡＬＵ／シフタ１８４に直接戻るようデータをルーティングする。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、その制御入力部で受取ったデータ演算装置（ＤＯＵ）制御信号に従って、特定のルーティング演算を実行する。好ましい実施例では、ＡＬＵ／シフタ１８４は再構成論理装置内の数学的演算用の１組の論理ブロック（ＣＬＢ）と回路内の論理関数発生器を用いて実装される。第１ＲＡＭ１８０と第２ＲＡＭ１８２は、それぞれ１組の論理ブロック（ＣＬＢ）内に存在するデータ記憶回路を用いて実装されることが好ましい。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、すでに述べた方法で実装されることが好ましい。
【００９０】
図１３は、内部ループ命令セットアーキテクチャ（ＩＳＡ）の実動化のために構成されたデータ演算装置（ＤＯＵ）６３の第２模範実施例の構成図である。一般に内部ループ命令セットアーキテクチャ（ＩＳＡ）は比較的少ない専用演算を支援し、大きなデータセットに対して共通した演算セットを実行するのに用いることが好ましい。したがって、内部ループ命令セットアーキテクチャ（ＩＳＡ）のための最適の計算性能は、演算を並列に実行するために構成されたハードウェアによって得られる。したがってデータ演算装置（ＤＯＵ）６３の第２模範実施例では、データ演算ロジック１５４と、記憶／整列ロジック１５２と、データ演算装置（ＤＯＵ）クロスバースイッチ１５０とはパイプライン計算を実行するよう構成される。データ演算ロジック１５４は、複数の入力部と、制御入力部と、出力部とを有するパイプライン機能単位１９４を含んでいる。記憶／整列ロジック１５２は、
１）１組の従来型のフリップフロップアレイ１９２（それぞれがデータ入力部と、データ出力部と、制御入力部とを含んでいる）と、
２）データセレクタ１９０（制御入力部と、データ入力部と、フリップフロップアレイ１９２に対応する数のデータ出力部とを含んでいる）と、
を含んでいる。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は、二重単方向クロスバー結合部を有する従来型のクロスバースイッチネットワークを含んでいる。データ演算装置（ＤＯＵ）６３の第２模範実施例では、データ演算装置（ＤＯＵ）クロスバースイッチ１５０は第２データフィードバック入力部を除き、図１１を用いてすでに説明した入力部と出力部とを含んでいることが好ましい。外部ループ命令セットアーキテクチャ（ＩＳＡ）の場合と同様に、内部ループ命令セットアーキテクチャ（ＩＳＡ）のためのデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の効率的な実装には、マルチプレクサと、３値バッファと、論理ブロック（ＣＬＢ）ベースロジックと、直接配線と、または再構成可能な方法で結合された上記構成部分のサブセットとを含めることができる。内部ループ命令セットアーキテクチャ（ＩＳＡ）については、データ演算装置（ＤＯＵ）クロスバースイッチ１５０は最短時間で並列データ移動を最大にするよう実装されるのが好ましいが、高度パイプライン化内部ループ命令セットアーキテクチャ（ＩＳＡ）命令を支援するために、最小数の単一データ動作クロスバー結合部も提供する。
【００９１】
データセレクタ１９０のデータ入力部は、第１データライン１６２を経てデータ演算装置（ＤＯＵ）クロスバースイッチ１５０のデータ出力部に結合されている。データセレクタ１９０の制御入力部は、第３制御ライン７４を経てレジスタファイル（ＲＦ）アドレスを受取るよう結合されており、データセレクタ１９０の各出力部は、対応するフリップフロップアレイデータ入力部に結合されている。各フリップフロップアレイ１９２の制御入力部は、第１制御ライン７０を経てデータ演算装置（ＤＯＵ）制御信号を受取るよう結合されており、各フリップフロップアレイデータ出力部は機能単位１９４の入力部に結合されている。機能単位１９４の制御入力部は、第１制御ライン７０を経てデータ演算装置（ＤＯＵ）制御信号を受取るよう結合されており、機能単位１９４の出力部はデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の第１データフィードバック入力部に結合されている。データ演算装置（ＤＯＵ）クロスバースイッチ１５０の残りの入力部と出力部の結合部は、図１１を用いて既に説明したものと同一である。
【００９２】
演算では、機能単位１９４はその制御入力部で受取ったデータ演算装置（ＤＯＵ）制御信号に従ってそのデータ入力部で受取ったデータに対してパイプライン演算を実行する。当業者は、機能単位１９４が乗算／累算装置か、閾値決定装置か、画像回転装置か、エッジ強調装置か、または区分されたデータに対してパイプライン演算を実行するのに適したいずれかの種類の機能単位であることを認めるであろう。データセレクタ１９０は、その制御入力部で受取ったレジスタファイル（ＲＦ）アドレスに従ってデータ演算装置（ＤＯＵ）クロスバースイッチ１５０の出力部から所定のフリップフロップアレイ１９２へデータをルーティング（経路決定）する。各フリップフロップアレイ１９２は、その制御入力部で受取った制御信号の指示に従って、もう１個のフリップフロップアレイ１９２のデータ内容に対してデータを空間的、時間的に整列させるために逐次結合されたデータラッチを含んでいることが好ましい。データ演算装置（ＤＯＵ）クロスバースイッチ１５０は選択的に、
１）データをメモリ３４からデータセレクタ１９０へルーティングし、
２）結果を乗算／累算装置１９４からデータセレクタ１９０またはメモリ３４へルーティングし、
３）定数を命令取出し装置（ＩＦＵ）６０からデータセレクタ１９０へルーティングする。
当業者は、内部ループ命令セットアーキテクチャ（ＩＳＡ）が１組の「内蔵」定数を含んでいることを認めるであろう。このような内部ループ命令セットアーキテクチャ（ＩＳＡ）の実装では、記憶／整列ロジック１５４が内蔵定数を有する論理ブロック（ＣＬＢ）ベースＲＯＭを含んでいることが好ましく、これによってデータ演算装置（ＤＯＵ）クロスバースイッチ１５０を経て命令取出し装置（ＩＦＵ）６０から記憶／整列ロジック１５２へ定数をルーティングする必要性をなくすことができる。好ましい実施例では、機能単位１９４は１組の論理ブロック（ＣＬＢ）内の数学的演算用の論理関数発生器と回路とを用いて実装されるのが好ましい。各フリップフロップアレイ１９２は１組の論理ブロック（ＣＬＢ）内のフリップフロップを用いて実装されることが好ましい。データセレクタ１９０は１組の論理ブロック（ＣＬＢ）内の論理関数発生器とデータ選択回路とを用いて実装されることが好ましい。最後にデータ演算装置（ＤＯＵ）クロスバースイッチ１５０は、内部ループについてすでに説明した方法で実装されることが好ましい。
【００９３】
図１４は、アドレス演算装置（ＡＯＵ）６４の好ましい実施例の構成図である。アドレス演算装置（ＡＯＵ）６４は、アドレス演算装置（ＡＯＵ）制御信号と、レジスタファイル（ＲＦ）アドレスと、命令取出し装置（ＩＦＵ）６０から受取った定数とに従ってアドレスに対して演算を実行する。アドレス演算装置（ＡＯＵ）６４は、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００と、記憶／計数ロジック２０２と、アドレス演算ロジック２０４と、アドレスマルチプレクサ２０６とを含んでいる。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００と、記憶／計数ロジック２０２と、アドレス演算ロジック２０４と、アドレスマルチプレクサ２０６とは、それぞれ第２制御ライン７２を経て命令取出し装置（ＩＦＵ）６０の第２制御出力部に結合された制御入力部を含んでいる。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、アドレス演算装置（ＡＯＵ）の双方向データポートを形成する双方向データポートと、第１アドレスライン２１０を経てアドレス演算ロジック２０４のアドレス出力部に結合されたアドレスフィードバック入力部と、第３制御ライン７４に結合された定数入力部と、第２アドレスライン２１２を経て記憶／計数ロジック２０２のアドレス入力部に結合されたアドレス出力部とを含んでいる。記憶／計数ロジック２０２は、そのアドレス入力部と制御入力部の他に、第３制御ライン７４に結合されたレジスタファイル（ＲＦ）アドレス入力部と、第３アドレスライン２１４を経てアドレス演算ロジック２０４のアドレス入力部に結合されたアドレス出力部とを含んでいる。アドレスマルチプレクサ２０６は、第１アドレスライン２１０に結合された第１入力部と、第３アドレスライン２１４に結合された第２入力部と、アドレス演算装置（ＡＯＵ）６４のアドレス出力部を形成する出力部とを含んでいる。
【００９４】
アドレス演算ロジック２０４は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号の指示に従ってそのアドレス入力部で受取ったアドレスに対して算術演算を実行する。記憶／計数ロジック２０２は、アドレス及びアドレス計算結果を一時記憶する。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に従って、メモリ３４からのアドレスのローディングと、アドレス演算ロジック２０４の結果出力の記憶／計数ロジック２０２またはメモリ３４への転送と、命令取出し装置（ＩＦＵ）６０による定数出力の記憶／計数ロジック２０２へのローディングとを容易にする。アドレスマルチプレクサ２０６は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号の指示に従って、記憶／計数ロジック２０２またはアドレスマルチプレクサ２０６から受取ったアドレスをアドレス演算装置（ＡＯＵ）６４のアドレス出力部に選択的に出力する。好ましい実施例では、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００と、記憶／計数ロジック２０２と、アドレス演算ロジック２０４との詳細な構造は、図１５と図１６を用いて下記に説明するように、現在検討中の命令セットアーキテクチャ（ＩＳＡ）の種類により定まる。
【００９５】
図１５は、汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）の実動化のために構成されたアドレス演算装置（ＡＯＵ）６５の第１模範実施例の構成図である。汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）では、記憶／計数ロジック２０２に記憶されたプログラムカウンタとアドレスの内容に対して加算、減算、インクリメント、及びデクリメントなどの演算を実行するためのハードウェアが必要である。アドレス演算装置（ＡＯＵ）６５の第１模範実施例では、アドレス演算ロジック２０４は、入力部と、出力部と、制御入力部とを有する次命令プログラムアドレスレジスタ（ＮＩＰＡＲ）２３２と、第１入力部と、第２入力部と、第３入力部と、制御入力部と、出力部とを有する演算装置２３４と、第１入力部と、第２入力部と、制御入力部と、出力部とを有するマルチプレクサ２３０とを含んでいることが好ましい。記憶／計数ロジック２０２は、それぞれ入力部と、出力部と、アドレス選択入力部と、イネーブル入力部とを有する第３ＲＡＭ２２０と第４ＲＡＭ２２２とを含んでいることが好ましい。アドレスマルチプレクサ２０６は、第１入力部と、第２入力部と、第３入力部と、制御入力部と、出力部とを有するマルチプレクサを含んでいることが好ましい。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、二重単方向クロスバー結合部と、図１４を用いてすでに説明した入力部と出力部とを有する従来型クロスバースイッチネットワークを含んでいることが好ましい。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００の効率的な実動化には、マルチプレクサと、３値バッファと、論理ブロック（ＣＬＢ）ベースロジックと、直接配線と、または再構成結合部によって結合されたこのような構成部分のサブセットが含まれる。外部ループ命令セットアーキテクチャ（ＩＳＡ）については、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は最短時間で逐次データ移動を最大化するよう実装されることが好ましいが、汎用外部ループアドレス演算命令を支援するために最大数の一意のアドレス移動クロスバー結合部も提供する。
【００９６】
第３ＲＡＭ２２０の入力部と第４ＲＡＭ２２２との入力部とは、それぞれ第２アドレスライン２１２を経てアドレス演算装置（ＡＯＵ）クロスバースイッチ２００の出力部に結合されている。第３ＲＡＭ２２０と第４ＲＡＭ２２２とのアドレス選択入力部は、第３制御ライン７４を経て命令取出し装置（ＩＦＵ）６０からレジスタファイル（ＲＦ）アドレスを受取るよう結合されている。第３ＲＡＭ２２０と第４ＲＡＭ２２２とのイネーブル入力部は、第２制御ライン７２を経てアドレス演算装置（ＡＯＵ）制御信号を受取るよう結合されている。第３ＲＡＭ２２０の出力部は、マルチプレクサ２３０の第１入力部と、演算装置２３４の第１入力部と、アドレスマルチプレクサ２０６の第１入力部とに結合されている。同様に、第４ＲＡＭ２２２の出力部は、マルチプレクサ２３０の第２入力部と、演算装置２３４の第２入力部と、アドレスマルチプレクサ２０６の第２入力部とに結合されている。マルチプレクサ２３０と、ＮＩＰＡＲ２３２と、演算装置２３４との制御入力部は、それぞれ第２制御ライン７２に結合されている。演算装置２３４の出力部は、アドレス演算ロジック２０４の出力部を形成しており、したがって、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００のアドレスフィードバック入力部とアドレスマルチプレクサ２０６の第３入力部とに結合されている。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００とアドレスマルチプレクサ２０６との残りの入力部と出力部への結合部は、図１４を用いてすでに説明したものと同一である。
【００９７】
アドレス演算命令の実行を容易にするために、命令取出し装置（ＩＦＵ）６０は命令状態シーケンサ（ＩＳＳ）が状態ＥまたはＭの時に、アドレス演算装置（ＡＯＵ）制御信号と、レジスタファイル（ＲＦ）アドレスと、定数とをアドレス演算装置（ＡＯＵ）６４に発する。第３ＲＡＭ２２０と第４ＲＡＭ２２２とは、それぞれアドレスの一時記憶のための第１及び第２レジスタファイルを提供する。第３ＲＡＭ２２０と第４ＲＡＭ２２２内の各記憶位置は、各ＲＡＭのアドレス選択入力部で受取ったレジスタファイル（ＲＦ）アドレスに従って選択される。第３ＲＡＭ２２０と第４ＲＡＭ２２２とのローディングは、その書込みイネーブル入力部で第３ＲＡＭ２２０と第４ＲＡＭ２２２とがそれぞれ受取るアドレス演算装置（ＡＯＵ）制御によって制御される。マルチプレクサ２３０は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号の指示に従って、第３ＲＡＭ２２０と第４ＲＡＭ２２２とによるアドレス出力をＮＩＰＡＲ２３２に選択的にルーティングする。ＮＩＰＡＲ２３２は、マルチプレクサ２３０の出力部から受取ったアドレスをロードしてその制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に応じてその内容をインクリメントする。好ましい実施例では、ＮＩＰＡＲ２３２は、実行すべき次のプログラム命令のアドレスを記憶する。演算装置２３４は、第３ＲＡＭ２２０と第４ＲＡＭ２２２とから受取ったアドレスに対して、及び／またはＮＩＰＡＲ２３２の内容に対して加算、減算、インクリメント、及びデクリメントを含む算術演算を実行する。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は選択的に、
１）アドレスをメモリ３４から第３ＲＡＭ２２０と第４ＲＡＭ２２２とへルーティングし、
２）演算装置２３４によるアドレス計算出力の結果をメモリ３４または第３ＲＡＭ２２０と第４ＲＡＭ２２２とへルーティングする。
アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に従って特定のルーティング演算を実行する。アドレスマルチプレクサ２０６は、この制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御の指示に従って第３ＲＡＭ２２０によるアドレス出力と、第４ＲＡＭ２２２によるアドレス出力と、または演算装置２３４によるアドレス計算出力の結果とをアドレス演算装置（ＡＯＵ）のアドレス出力部に選択的にルーティングする。
【００９８】
好ましい実施例では、第３ＲＡＭ２２０と第４ＲＡＭ２２２とはそれぞれ１組の論理ブロック（ＣＬＢ）内に存在するデータ記憶回路を用いて実動化される。マルチプレクサ２３０とアドレスマルチプレクサ２０６とは、それぞれ１組の論理ブロック（ＣＬＢ）内に存在するデータ選択回路を用いて実動化されるのが好ましく、ＮＩＰＡＲ２３２は１組の論理ブロック（ＣＬＢ）内に存在するデータ記憶回路を用いて実装されることが好ましい。演算装置２３４は、１組の論理ブロック（ＣＬＢ）内の数学的演算用の論理関数発生器と回路とを用いて実装されることが好ましい。最後に、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、すでに説明した方法で実装されることが好ましい。
【００９９】
図１６は、内部ループ命令セットアーキテクチャ（ＩＳＡ）の実装のために構成されたアドレス演算装置（ＡＯＵ）６６の第２模範実施例の構成図である。内部ループ命令セットアーキテクチャ（ＩＳＡ）はきわめて限られた数のアドレス演算を実行するためのハードウェアを必要とし、また少なくとも１個のソースアドレスポインタと、対応する数の宛先アドレスポインタを維持するためのハードウェアを必要とする。きわめて限られた数のアドレス演算または１個のアドレス演算が必要な内部ループ処理の種類には、画像データのブロック演算、ラスター演算、またはサーペンタイン演算と、ビットリバーサル演算と、環状バッファデータに対する演算と、可変長データパーシング演算とが含まれる。ここでは、１回のアドレス演算すなわち、インクリメント演算を検討する。当業者は、インクリメント演算を実行するハードウェアが本来デクリメント演算も実行でき、これによってさらに別のアドレス演算能力が得られることを認めるであろう。アドレス演算装置（ＡＯＵ）６６の第２模範実施例では、記憶／計数ロジック２０２は、入力部と、出力部と、制御入力部とを有する少なくとも１個のソースアドレスレジスタ２５２と、入力部と、出力部と、制御入力部とを有する少なくとも１個の宛先アドレスレジスタ２５４と、入力部と、制御入力部と、現存のソースアドレスレジスタ２５２及び宛先アドレスレジスタ２５４の総数に等しい数の出力部とを有するデータセレクタ２５０とを含んでいる。ここでは、１個のソースアドレスレジスタ２５２と、１個の宛先アドレスレジスタ２５４を検討する。したがって、データセレクタ２５０は、第１出力部と第２出力部とを含んでいる。アドレス演算ロジック２０４は、入力部と、出力部と、制御入力部とを有するＮＩＰＡＲ２３２と、データセレクタと等しい数の入力部と、制御入力部と、出力部とを有するマルチプレクサ２６０とを含んでいる。ここでマルチプレクサ２６０は、第１入力部と第２入力部とを含んでいる。アドレスマルチプレクサ２０６は、データセレクタ出力部より１つ多い入力部と、制御入力部と、出力部とを有するマルチプレクサを含んでいることが好ましい。したがってここでは、アドレスマルチプレクサ２０６は、第１入力部と、第２入力部と、第３入力部とを含んでいる。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、双方向及び単方向クロスバー結合部を有し、また図１４を用いてすでに説明した入力部と出力部とを有する従来型のクロスバースイッチネットワークを含んでいることが好ましい。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００の効率的な実動化には、マルチプレクサと、３値バッファと、論理ブロック（ＣＬＢ）ベースロジックと、直接配線と、または再構成結合部によって結合されたこのような構成部分のサブセットが含まれる。内部ループ命令セットアーキテクチャ（ＩＳＡ）については、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は最短時間で並列アドレス移動を最大にするよう実動化されるのが好ましいが、内部ループ操作コードを支援するために、最大数の一意のアドレス移動クロスバー結合部も提供する。
【０１００】
データセレクタ２５０の入力部は、アドレス演算装置（ＡＯＵ）クロスバースイッチ２００の出力部に結合されている。データセレクタ２５０の第１及び第２出力部は、それぞれソースアドレスレジスタ２５２の入力部と宛先アドレスレジスタ２５４の入力部とに結合されている。ソースアドレスレジスタ２５２と宛先アドレスレジスタ２５４との制御入力部は、第２制御ライン７２を経てアドレス演算装置（ＡＯＵ）制御信号を受取るために結合されている。ソースアドレスレジスタ２５２の出力部は、マルチプレクサ２６０の第１入力部とアドレスマルチプレクサ２０６の第１入力部とに結合されている。同様に、宛先アドレスレジスタ２５４の出力部は、マルチプレクサ２６０の第２入力部とアドレスマルチプレクサ２０６の第２入力部とに結合されている。ＮＩＰＡＲ２３２の入力部は、マルチプレクサ２６０の出力部に結合されており、ＮＩＰＡＲ２３２の制御入力部は第２制御ライン７２を経てアドレス演算装置（ＡＯＵ）制御信号を受取るために結合されており、ＮＩＰＡＲ２３２の出力部はアドレス演算装置（ＡＯＵ）クロスバースイッチ２００のアドレスフィードバック入力部とアドレスマルチプレクサ２０６の第３入力部とに結合されている。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００の残りの入力部と出力部とへの結合部は、図１４を用いて上記に説明したものと同一である。
【０１０１】
演算では、データセレクタ２５０は、その制御入力部で受取ったレジスタファイル（ＲＦ）アドレスに従ってアドレス演算装置（ＡＯＵ）クロスバースイッチから受取ったアドレスをソースアドレスレジスタ２５２または宛先アドレスレジスタ２５４にルーティングする。ソースアドレスレジスタ２５２は、その制御入力部に存在するアドレス演算装置（ＡＯＵ）制御信号に応じてその入力部に存在するアドレスをロードする。宛先アドレスレジスタ２５４は、同様の方法でその入力部に存在するアドレスをロードする。マルチプレクサ２６０は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に従ってソースアドレスレジスタ２５２または宛先アドレスレジスタ２５４から受取ったアドレスをＮＩＰＡＲ２３２の入力部にルーティングする。ＮＩＰＡＲ２３２は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に応じてその入力部に存在するアドレスをロードし、その内容をインクリメントするか、またはデクリメントする。アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は選択的に、
１）アドレスをメモリ３４からデータセレクタ２５０にルーティングし、
２）ＮＩＰＡＲ２３２の内容をメモリ３４またはデータセレクタ２５０にルーティングする。
アドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号に従って特定のルーティング演算を実行する。アドレスマルチプレクサ２０６は、その制御入力部で受取ったアドレス演算装置（ＡＯＵ）制御信号の指示に従って、ソースアドレスレジスタ２５２、宛先アドレスレジスタ２５４、またはＮＩＰＡＲ２３２の内容をアドレス演算装置（ＡＯＵ）のアドレス出力部に選択的にルーティングする。
【０１０２】
好ましい実施例では、ソースアドレスレジスタ２５２と宛先アドレスレジスタ２５４とは、それぞれ１組の論理ブロック（ＣＬＢ）内に存在するデータ記憶回路を用いて実動化される。ＮＩＰＡＲ２３２は、１組の論理ブロック（ＣＬＢ）内のインクリメント／デクリメントロジック及びフリップフロップを用いて実動化されることが好ましい。データセレクタ２５０と、マルチプレクサ２３０と、アドレスマルチプレクサ２０６とは、それぞれ１組の論理ブロック（ＣＬＢ）内に存在するデータ選択回路を用いて実動化されることが好ましい。最後にアドレス演算装置（ＡＯＵ）クロスバースイッチ２００は、内部ループ命令セットアーキテクチャ（ＩＳＡ）についてすでに述べた方法で実動化されることが好ましい。当業者は、一部のアプリケーションでは、外部ループデータ演算装置（ＤＯＵ）構成を備えた内部ループアドレス演算装置（ＡＯＵ）構成に依存する命令セットアーキテクチャ（ＩＳＡ）を用いることが、またはその逆（内部ループアドレス演算装置（ＡＯＵ）構成を備えた外部ループデータ演算装置（ＤＯＵ）構成）が有利であることを認めるであろう。たとえば連想ストリング探索命令セットアーキテクチャ（ＩＳＡ）は、外部ループアドレス演算装置（ＡＯＵ）構成を備えた内部ループデータ演算装置（ＤＯＵ）構成を利用すると有利であろう。別の例として、ヒストグラム演算を実行するための命令セットアーキテクチャ（ＩＳＡ）は、内部ループアドレス演算装置（ＡＯＵ）構成を備えた外部ループデータ演算装置（ＤＯＵ）構成を利用すると有利であろう。
【０１０３】
有限の再構成ハードウェアリソースを、動的再構成処理装置（ＤＲＰＵ）３２の各構成部分間で割当てなければならない。再構成ハードウェアリソースは数が限られているので、たとえばこれらを命令取出し装置（ＩＦＵ）６０に割当てるとデータ演算装置（ＤＯＵ）６２及びアドレス演算装置（ＡＯＵ）６４によって達成可能な最大計算性能レベルに影響を与える。再構成ハードウェアリソースを、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との間で割当てる方法は、任意の瞬間で実装される命令セットアーキテクチャ（ＩＳＡ）の種類に応じて異なる。命令セットアーキテクチャ（ＩＳＡ）が複雑になると、次第に複雑になる復号演算及び制御演算を容易に行うために、より多くの再構成ハードウェアリソースを命令取出し装置（ＩＦＵ）６０に割当てなければならなくなり、データ演算装置（ＤＯＵ）６２とアドレス演算装置（ＡＯＵ）６４との間で利用できる再構成ハードウェアリソースは少なくなる。したがって、データ演算装置（ＤＯＵ）６２とアドレス演算装置（ＡＯＵ）６４とによって達成可能な最大計算性能は、命令セットアーキテクチャ（ＩＳＡ）の複雑性が増すと低下する。一般に、外部ループ命令セットアーキテクチャ（ＩＳＡ）は内部ループ命令セットアーキテクチャ（ＩＳＡ）より多くの命令を含み、したがってその実装は、復号回路と制御回路においてかなり複雑となる。たとえば汎用６４ビットプロセッサを規定する外部ループ命令セットアーキテクチャ（ＩＳＡ）は、データ圧縮のみに用いられる内部ループ命令セットアーキテクチャ（ＩＳＡ）より多くの命令を含むことになると考えられる。
【０１０４】
図１７（ａ）は、外部ループ命令セットアーキテクチャ（ＩＳＡ）のための、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との間での再構成ハードウェアリソースの模範割当てを示す図である。外部ループ命令セットアーキテクチャ（ＩＳＡ）のための再構成ハードウェアリソースの模範割当てでは、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４はそれぞれ利用できる再構成ハードウェアリソースの約３分の１を割当てられる。内部ループ命令セットアーキテクチャ（ＩＳＡ）を実装するために動的再構成処理装置（ＤＲＰＵ）３２を再構成すべきときには、内部ループ命令セットアーキテクチャ（ＩＳＡ）によって支援される命令の数とアドレス命令の種類が限られるため、命令取出し装置（ＩＦＵ）６０とアドレス演算装置（ＡＯＵ）６４とを実装するのに必要な再構成ハードウェアリソースは少なくて済む。図１７（ｂ）は、内部ループ命令セットアーキテクチャ（ＩＳＡ）のための、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との間での再構成ハードウェアリソースの模範割当てを示す図である。内部ループ命令セットアーキテクチャ（ＩＳＡ）のための再構成ハードウェアリソースの模範割当てでは、命令取出し装置（ＩＦＵ）６０は再構成ハードウェアリソースの約５〜１０％を用いて実装され、アドレス演算装置（ＡＯＵ）６４は再構成ハードウェアリソースの約１０〜２５％を用いて実装される。したがって、再構成ハードウェアリソースの約７０〜８０％はデータ演算装置（ＤＯＵ）６２の実装に利用できる。このことは、内部ループ命令セットアーキテクチャ（ＩＳＡ）に関連したデータ演算装置（ＤＯＵ）６２の内部構造が、内部ループ命令セットアーキテクチャ（ＩＳＡ）に関連したデータ演算装置（ＤＯＵ）６２の内部構造より複雑であってもよく、したがってはるかに高い性能を発揮できることを意味している。
【０１０５】
当業者は、別の実施例で動的再構成処理装置（ＤＲＰＵ）３２がデータ演算装置（ＤＯＵ）６２またはアドレス演算装置（ＡＯＵ）６４を除外できることを認めるであろう。たとえば別の実施例では、動的再構成処理装置（ＤＲＰＵ）３２はアドレス演算装置（ＡＯＵ）６４を含まなくても良い。その場合、データ演算装置（ＤＯＵ）６２はデータとアドレスの両方に対して演算を実行することになる。（上記で）検討した特定の動的再構成処理装置（ＤＲＰＵ）実施例とは無関係に、動的再構成処理装置（ＤＲＰＵ）３２の各構成部分を実装するために有限数の再構成ハードウェアリソースを割当てなければならない。再構成ハードウェアリソースは、利用できる再構成ハードウェアリソースの全スペースに対して現在検討中の命令セットアーキテクチャ（ＩＳＡ）について最適のまたは最適に近い能力が達成されるように割当てるのが好ましい。
【０１０６】
当業者は、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４との各構成部分の詳細な構造が上記に説明した実施例に限定されないことを認めるであろう。所定の命令セットアーキテクチャ（ＩＳＡ）について、対応する構成データセットは、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４内の各構成部分の内部構造が、利用できる再構成ハードウェアリソースに対して計算性能を最大限にするように定められるのが好ましい。
【０１０７】
図１８は、Ｔマシンの好ましい実施例の構成図である。Ｔマシン１４は、第２ローカルタイムベース装置３００と、共用インタフェース制御装置（ＣＩＣＵ：Common Interface and Control Unit）３０２と、１組の相互結合入出力装置３０４とを含んでいる。第２ローカルタイムベース装置３００は、Ｔマシンのマスタタイミング入力部を形成するタイミング入力部を含んでいる。共通インタフェース制御装置（ＣＩＣＵ）３０２は、第２タイミング信号ライン３１０を経て第２ローカルタイムベース装置３００のタイミング出力部に結合されたタイミング入力部と、アドレスライン４４に結合されたアドレス出力部と、メモリ入出力ライン４６に結合された第１双方向制御ポートと、外部制御ライン４８に結合された双方向制御ポートと、メッセージ転送ライン３１２を経て現存する各相互結合入出力装置３０４の双方向データポートに結合された第２双方向データポートとを含んでいる。各相互結合入出力装置３０４は、メッセージ入力ライン３１４を経て汎用相互結合マトリックス（ＧＰＩＭ）１６に結合された入力部と、メッセージ出力ライン３１６を経て汎用相互結合マトリックス（ＧＰＩＭ）１６に結合された出力部とを含んでいる。
【０１０８】
Ｔマシン１４内の第２ローカルタイムベース装置３００は、マスタタイムベース装置２２からマスタタイミング信号を受取り、第２ローカルタイミング信号を生成する。第２ローカルタイムベース装置３００は、第２ローカルタイミング信号を共通インタフェース制御装置（ＣＩＣＵ）３０２に送り、これによってそれが存在するＴマシン１４についてのタイミング基準を提供する。第２ローカルタイミング信号は、マスタタイミング信号と位相同期であることが好ましい。システム１０では、各Ｔマシンの第２ローカルタイムベース装置３００は同一の周波数で作動することが好ましい。当業者は、別の実施例では、１個またはそれ以上の第２ローカルタイムベース装置３００が異なる周波数で作動することを認めるであろう。第２ローカルタイムベース装置３００は、論理ブロック（ＣＬＢ）ベース位相ロック検出回路を含む従来型の位相ロック周波数変換回路を用いて実装されることが好ましい。当業者は、別の実施例では第２ローカルタイムベース装置３００がクロック分散ツリーの一部として実装できることを認めるであろう。
【０１０９】
共通インタフェース制御装置（ＣＩＣＵ）３０２は、その対応するＳマシン１２と特定の相互結合入出力装置３０４との間のメッセージの転送を指示し、このメッセージにはコマンドとおそらくデータとが含まれる。好ましい実施例では、指定された相互結合入出力装置３０４がシステム１０の内部または外部にあるいずれかのＴマシン１４または入出力Ｔマシン１８内に存在しても良い。好ましい実施例では、各相互結合入出力装置３０４は相互結合入出力装置３０４を一意的に識別する相互結合アドレスを割当てられているのが好ましい。所定のＴマシン内の相互結合入出力装置３０４についての相互結合アドレスは、対応するＳマシンのアーキテクチャ記述メモリ１０１に記憶される。
【０１１０】
共通インタフェース制御装置（ＣＩＣＵ）３０２は、それぞれメモリ入出力ライン４６と外部制御信号ライン４８とを経てその対応するＳマシン１２からデータとコマンドとを受取る。受取った各コマンドは、目的相互結合アドレスと実行すべき特定の種類の演算を指定するコマンドコードとを含んでいることが好ましい。好ましい実施例では、コマンドコードによって一意的に識別される種類の演算には、
１）データ読出し演算と、
２）データ書込み演算と、
３）再構成割込み転送を含む割込み信号転送と、
を含んでいる。目的相互結合アドレスは、データとコマンドとを転送すべき目的相互結合入出力装置３０４を識別する。共通インタフェース制御装置（ＣＩＣＵ）３０２は、従来の方法で１組のパケットベースメッセージとして各コマンドと関連データを転送することが好ましく、各メッセージには目的相互結合アドレスとコマンドコードとが含まれている。
【０１１１】
共通インタフェース制御装置（ＣＩＣＵ）３０２は、その対応するＳマシン１２からデータとコマンドとを受取る他に、メッセージ転送ライン３１２に結合された各相互結合入出力装置３０４からメッセージを受取る。好ましい実施例では、共通インタフェース制御装置（ＣＩＣＵ）３０２は、関連メッセージグループを単一のコマンドとデータシーケンスに変換する。コマンドがその対応するＳマシン１２内の動的再構成処理装置（ＤＲＰＵ）３２に向けられているときには、共通インタフェース制御装置（ＣＩＣＵ）３０２は外部制御信号ライン４８を経てコマンドを発する。コマンドがその対応するＳマシン１２内のメモリ３４に向けられているときには、共通インタフェース制御装置（ＣＩＣＵ）３０２は外部制御信号ライン４８を経て適切なメモリ制御信号を発し、またメモリアドレスライン４４を経てメモリアドレス信号を発する。データは、メモリ入出力ライン４６を経て転送される。好ましい実施例では、共通インタフェース制御装置（ＣＩＣＵ）３０２は、ＡＮＳＩ／ＩＥＥＥ規格１５９６−１９９２に定められた従来型のＳＣＩ切替装置によって実行される演算に類似した演算を実行するための論理ブロック（ＣＬＢ）ベース回路を含んでいる。
【０１１２】
各相互結合入出力装置３０４は、共通インタフェース制御装置（ＣＩＣＵ）３０２からメッセージを受取り、共通インタフェース制御装置（ＣＩＣＵ）３０２から受取った制御信号の指示に従って、このメッセージを汎用相互結合マトリックス（ＧＰＩＭ）１６を経て別の相互結合入出力装置３０４に転送する。好ましい実施例では、相互結合入出力装置３０４は、ＡＮＳＩ／ＩＥＥＥ規格１５９６−１９９２に定められたＳＣＩノードに基づいている。図１９は、相互結合入出力装置３０４の好ましい実施例の構成図である。相互結合入出力装置３０４は、アドレス復号器３２０と、入力ＦＩＦＯバッファ３２２と、バイパスＦＩＦＯバッファ３２４と、出力ＦＩＦＯバッファ３２６と、マルチプレクサ３２８とを含んでいる。アドレス復号器３２０は、相互結合入出力装置の入力部を形成する入力部と、入力ＦＩＦＯバッファ３２２に結合された第１出力部と、バイパスＦＩＦＯバッファ３２４に結合された第２出力部とを含んでいる。入力ＦＩＦＯバッファ３２２は、メッセージを共通インタフェース制御装置（ＣＩＣＵ）３０２に転送するためのメッセージ転送ライン３１２に結合された出力部を含んでいる。出力ＦＩＦＯバッファ３２６は、共通インタフェース制御装置（ＣＩＣＵ）３０２からメッセージを受取るためのメッセージ転送ライン３１２に結合された入力部と、マルチプレクサ３２８の第１入力部に結合された出力部とを含んでいる。バイパスＦＩＦＯバッファ３２４は、マルチプレクサ３２８の第２入力部に結合された出力部を含んでいる。最後にマルチプレクサ３２８は、メッセージ転送ライン３１２に結合された制御入力部と、相互結合入出力装置の出力部を形成する出力部とを含んでいる。
【０１１３】
相互結合入出力装置３０４は、アドレス復号器３２０の入力部でメッセージを受取る。アドレス復号器３２０は、受取ったメッセージで指定されている目的相互結合アドレスがそれが存在する相互結合入出力装置３０４の相互結合アドレスと同一であるかどうかを決定する。同一の場合には、アドレス復号器３２０はこのメッセージを入力ＦＩＦＯバッファ３２２にルーティングする。そうでなければアドレス復号器３２０は、メッセージをバイパスＦＩＦＯバッファ３２４にルーティングする。好ましい実施例では、アドレス復号器３２０は入出力ブロック（ＩＯＢ）と論理ブロック（ＣＬＢ）を用いて実装された復号器とデータセレクタとで構成される。
【０１１４】
入力ＦＩＦＯバッファ３２２は、その入力部で受取ったメッセージをメッセージ転送ライン３１２に転送する従来型のＦＩＦＯバッファである。バイパスＦＩＦＯバッファ３２４と出力ＦＩＦＯバッファ３２６とは、いずれもその入力部で受取ったメッセージをマルチプレクサ３２８に転送する従来型のＦＩＦＯバッファである。マルチプレクサ３２８は、その制御入力部で受取った制御信号に従って、バイパスＦＩＦＯバッファ３２４から受取ったメッセージまたは出力ＦＩＦＯバッファ３２６から受取ったメッセージを汎用相互結合マトリックス（ＧＰＩＭ）１６にルーティングする従来型のマルチプレクサである。好ましい実施例では、入力ＦＩＦＯバッファ３２２と、バイパスＦＩＦＯバッファ３２４と、出力ＦＩＦＯバッファ３２６とはそれぞれ１組の論理ブロック（ＣＬＢ）を用いて実装される。マルチプレクサ３２８は、１組の論理ブロック（ＣＬＢ）と入出力ブロック（ＩＯＢ）とを用いて実装されることが好ましい。
【０１１５】
図２０は、入出力Ｔマシン１８の好ましい実施例の構成図である。入出力Ｔマシン１８は、第３ローカルタイムベース装置３６０と、共通カスタムインタフェース制御装置３６２と、相互結合入出力装置３０４とを含んでいる。第３ローカルタイムベース装置３６０は、入出力Ｔマシンのマスタタイミング入力部を形成するタイミング入力部を含んでいる。相互結合入出力装置３０４は、メッセージ入力部ライン３１４を経て汎用相互結合マトリックス（ＧＰＩＭ）１６に結合された入力部と、メッセージ出力ライン３１６を経て汎用相互結合マトリックス（ＧＰＩＭ）１６に結合された出力部とを含んでいる。共通カスタムインタフェース制御装置３６２は、第３タイミング信号ライン３７０を経て第３ローカルタイムベース装置３６０のタイミング出力部に結合されたタイミング入力部と、相互結合入出力装置３０４の双方向データポートに結合された第１双方向データポートと、入出力装置２０への１組の結合部とを含んでいる。好ましい実施例では、入出力装置２０への１組の結合部は、入出力装置２０の双方向データポートに結合された第２双方向データポートと、入出力装置２０のアドレス入力部に結合されたアドレス出力部と、入出力装置２０の双方向制御ポートに結合された双方向制御ポートとを含んでいる。当業者は、共通カスタムインタフェース制御装置３６２が結合されている入出力装置２０の種類によって、入出力装置２０への結合部が定まることを容易に認めるであろう。
【０１１６】
第３ローカルタイムベース装置３６０は、マスタタイムベース装置２２からマスタタイミング信号を受取り、第３ローカルタイミング信号を生成する。第３ローカルタイムベース装置３６０は、第３ローカルタイミング信号を共通カスタムインタフェース制御装置３６２へ送り、それが配置されている入出力Ｔマシンにタイミング基準を提供する。好ましい実施例では、第３ローカルタイミング信号はマスタタイミング信号と位相同期している。各入出力Ｔマシンの第３ローカルタイムベース装置３６０は、同一の周波数で作動するのが好ましい。別の実施例では、１個またはそれ以上の第３ローカルタイムベース装置３６０は異なる周波数で作動することができる。第３ローカルタイムベース装置３６０は、論理ブロック（ＣＬＢ）ベース位相ロック検出回路を含む従来型の位相ロック周波数変換回路を用いて実装するのが好ましい。第１ローカルタイムベース装置３０及び第２ローカルタイムベース装置３００と同様の方法で、別の実施例では第３ローカルタイムベース装置３６０をクロック分散ツリーの一部として実装することができる。
【０１１７】
入出力Ｔマシン１８内の相互結合入出力装置３０４の構造と機能は、Ｔマシン１４についてすでに説明したものと同一であることが好ましい。入出力Ｔマシン１８内の相互結合入出力装置３０４は、任意のＴマシン１４内の各相互接続入出力装置３０４の場合と類似した方法でユニークな相互結合アドレスが割当てられる。
【０１１８】
共通カスタムインタフェース制御装置３６２は、それに結合された入出力装置２０と相互結合入出力装置３０４との間のメッセージの転送を指示し、このメッセージにはコマンドとおそらくデータとが含まれる。共通カスタムインタフェース制御装置３６２は、その対応する入出力装置２０からデータとコマンドとを受取る。入出力装置２０から受取った各コマンドは、目的相互結合アドレスと実行すべき特定の種類の演算を指定するコマンドコードとを含んでいることが好ましい。好ましい実施例では、コマンドコードによって一意的に識別される演算の種類には、
１）データ要求と、
２）データ転送確認と、
３）割込み信号転送と、
が含まれる。目的相互結合アドレスは、データとコマンドとを転送すべきシステム１０内の目的相互結合入出力装置３０４を識別する。共通カスタムインタフェース制御装置３６２は、従来の方法で１組のパケットベースメッセージとして各コマンドと関連データを転送することが好ましく、各メッセージには目的相互結合アドレスとコマンドコードとが含まれている。
【０１１９】
共通カスタムインタフェース制御装置３６２は、その対応する入出力装置２０からデータとコマンドとを受取る他に、その関連する入出力装置２０からメッセージを受取る。好ましい実施例では、共通カスタムインタフェース制御装置３６２は、その対応する入出力装置２０に支援される通信プロトコルに従って、関連メッセージグループを単一のコマンド及びデータシーケンスに変換する。好ましい実施例では、共通カスタムインタフェース制御装置３６２は、ＡＮＳＩ／ＩＥＥＥ規格１５９６−１９９２に定められた従来型のＳＣＩ切替装置によって実行される演算と類似した演算を実行するための論理ブロック（ＣＬＢ）ベース回路に結合された論理ブロック（ＣＬＢ）ベース入出力装置コントローラを含んでいる。
【０１２０】
汎用相互結合マトリックス（ＧＰＩＭ）１６は、相互結合入出力装置３０４の間の２点間並列メッセージルーティングを容易に行えるようにする従来型の相互結合メッシュである。好ましい実施例では、汎用相互結合マトリックス（ＧＰＩＭ）１６はワイヤーベースでｋ−ａｒｙのｎキューブの静的相互結合ネットワークである。図２１は、汎用相互結合マトリックス（ＧＰＩＭ）１６の模範実施例の構成図である。図２１では、汎用相互結合マトリックス（ＧＰＩＭ）１６は、複数の第１通信チャネル３８０と、複数の第２通信チャネル３８２とを含む、ｋ−ａｒｙの２キューブの環状体相互結合メッシュである。各第１通信チャネル３８０は、複数のノード接続部３８４を含んでおり、各第２通信チャネル３８２も同様に含んでいる。システム１０の各相互結合入出力装置３０４は、メッセージ入力ライン３１４と、メッセージ出力ライン３１６とが、所定の第１通信チャネル３８０と第２通信チャネル３８２内とで連続ノード接続部３８４と接続するように汎用相互結合マトリックス（ＧＰＩＭ）１６に結合されているのが好ましい。好ましい実施例では、各Ｔマシン１４は上記に説明した方法で第１通信チャネル３８０に結合された相互結合入出力装置３０４と、第２通信チャネル３８２に結合された相互結合入出力装置３０４とを含んでいる。Ｔマシン１４内の共通インタフェース制御装置（ＣＩＣＵ）３０２は、第１通信チャネル３８０に結合されたその相互結合入出力装置３０４と、第２通信チャネル３８２に結合されたその相互結合入出力装置３０４との間の情報のルーティングを容易に行えることが好ましい。したがって、図２１で３８０ｃと表記されている第１通信チャネル３８０に結合された相互結合入出力装置３０４と、３８２ｃと表記されている第２通信チャネル３８２に結合された相互結合入出力装置３０４とを含むＴマシン１４については、このＴマシンの共通インタフェース制御装置（ＣＩＣＵ）３０２は、第１通信チャネル３８０ｃと第２通信チャネル３８２ｃとの間の情報ルーティングを容易に行える。
【０１２１】
したがって汎用相互結合マトリックス（ＧＰＩＭ）１６は、並列に配置された相互結合入出力装置３０４間の複数のメッセージのルーティングを容易に行える。図２１の２次元汎用相互結合マトリックス（ＧＰＩＭ）１６については、各Ｔマシン１４は第１通信チャネル３８０について１個の相互結合入出力装置３０４を、また第２通信チャネル３８２について１個の相互結合入出力装置３０４を含んでいることが好ましい。当業者は、汎用相互結合マトリックス（ＧＰＩＭ）１６の次元が２次元を越える実施例では、Ｔマシン１４が２個を超える相互結合入出力装置３０４を含んでいることが好ましいことを認めるであろう。汎用相互結合マトリックス（ＧＰＩＭ）１６は、１６ビットデータパスを含むｋ−ａｒｙの２キューブとして実装されることが好ましい。
【０１２２】
上記の説明では、本発明の各種構成部分は、再構成ハードウェアリソースを用いて実装されることが好ましい。再構成論理装置のメーカーは、一般に再プログラマブルハードウェアリソースまたは再構成ハードウェアリソースを用いて従来型のデジタルハードウェアを実装するための指針を公表している。たとえば１９９４年度のＸｉｌｉｎｘプログラマブル論理装置データブック（xilinx, Inc., サンノゼ，カリフォルニア）には、次のようなアプリケーションノートが含まれている。すなわち、アプリケーションノートＸＡＰＰ００５．００２「レジスタベースＦＩＦＯ」、アプリケーションノートＸＡＰＰ０４４．００「高性能ＲＡＭベースＦＩＦＯ」、アプリケーションノートＸＡＰＰ０１３．００１「ＸＣ４０００での桁上げ専用ロジックの使用」、アプリケーションノートＸＡＰＰ０１８．００「ＸＣ４０００加算器とカウンタの性能の推定」、アプリケーションノートＸＡＰＰ０２８．００１「位相ロックループのための周波数／位相コンパレータ」、アプリケーションノートＸＡＰＰ０３１．０００「ＸＣ４０００ＲＡＭの使用」、アプリケーションノートＸＡＰＰ０３６．００１「４ポートＤＲＡＭコントロ−ラ．．．」、アプリケーションノートＸＡＰＰ０３９．００１「１８ビットパイプライン累算器」の各アプリケーションノートである。Ｘｉｌｉｎｘ社が公表している資料には、さらにＸｉｌｉｎｘプログラマブルロジックのユーザーのための季刊誌である「ＸＣＥＬＬ」に含まれる記事がある。たとえば１９９４年の第３号（通刊第１４号）には高速整数乗算器の実装に関する詳しい記事が掲載されている。
【０１２３】
この明細書で説明しているシステム１０は、動的に実装される多重命令セットアーキテクチャ（ＩＳＡ）のための拡張性、並列コンピュータアーキテクチャである。どのＳマシン１２も、別のＳマシン１２やホストコンピュータなどの外部ハードウェアリソースとは無関係に、それだけでコンピュータプログラム全体を実行することができる。どのＳマシン１２においても、多重命令セットアーキテクチャ（ＩＳＡ）は再構成割込み及び／またはプログラムに埋込まれた再構成指示に応じて、プログラム実行中に連続的に実装される。システム１０は多重Ｓマシン１２を含んでいるのが好ましいので、複数のプログラムが同時に実行されるのが好ましく、各プログラムは独立したものでも良い。したがって、システム１０が多重Ｓマシン１２を含んでいるのが好ましいので、多重命令セットアーキテクチャ（ＩＳＡ）はシステム初期化または再構成中以外はいつでも同時に（すなわち、並列に）実動化される。すなわち、任意の時間に複数のセットのプログラム命令が同時に実行され、プログラム命令の各セットは対応する命令セットアーキテクチャ（ＩＳＡ）に従って実行される。このような命令セットアーキテクチャ（ＩＳＡ）は、それぞれ一意なものである。
【０１２４】
Ｓマシン１２は、（複数の）Ｔマシン１４と、汎用相互結合マトリックス（ＧＰＩＭ）１６と、各入出力Ｔマシン１８とを経て、互いに、また入出力装置２０と通信する。各Ｓマシン１２は、独立した演算を実行できる、それ自体完全なコンピュータであるが、どのＳマシン１２もその他のＳマシン１２またはシステム１０全体についてマスタＳマシン１２として機能することができ、データ及び／またはコマンドをその他のＳマシン１２に、１個またはそれ以上のＴマシン１６に、１個またはそれ以上の入出力Ｔマシン１８に、１個またはそれ以上の入出力装置２２に送ることができる。
【０１２５】
したがって、本発明のシステム１０は、空間的及び時間的に１つまたはそれ以上のデータ並列（サブ）問題に分割できる問題、たとえば画像処理、医療用データ処理、校正済みカラーマッチング、データベース計算、ドキュメントの処理、連想探索エンジン、及びネットワークサーバについて特に有用である。オペランド列が多い計算問題については、並列計算法によって効率的な計算の高速化が得られるようにアルゴリズムを適用できるときには、データが並列していることになる。データ並列問題は既知の複雑さを含んでいて、これはＯ（ｎ^k）で表される。ｋの値は問題によって定まる。たとえば画像処理ではｋ＝２であり、医療用データ処理ではｋ＝３である。本発明では、各Ｓマシン１２はプログラム命令グループのレベルでデータの並列性を活用するのに用いられるのが好ましい。システム１０は多重Ｓマシン１２を含んでいるので、システム１０はプログラム全体のレベルでデータの並列性を活用するのに用いられるのが好ましい。
【０１２６】
任意の瞬間で必要な計算に対して、このようなハードウェアの計算性能を最適なものとするために、各Ｓマシン１２の命令処理ハードウェアを完全に再構成できるので、本発明のシステム１０によって大規模な計算力が得られる。各Ｓマシン１２は、他のＳマシン１２とは無関係に再構成することができる。システム１０は、ソフトウェアと、この明細書で説明した再構成ハードウェアとの間のプログラムされた境界、すなわちインタフェースとして各構成データセットを、したがって各命令セットアーキテクチャ（ＩＳＡ）を扱うのが有利である。さらに本発明のアーキテクチャによって、本来の場所で実際のシステムの問題を選択的に扱うために再構成ハードウェアを高レベルに構築することが容易となり、こうした問題には、割込みが命令処理に影響する方法と、リアルタイム処理とコンピュータ性能とを容易にする決定待ち時間応答の必要性と、欠陥処理に対する選択可能な応答の必要とが含まれる。
【０１２７】
その他のコンピュータアーキテクチャとは異なり、本発明はいつでもシリコンリソースを最大限に利用できることを開示している。本発明は、いつでも所望のサイズに拡大できる並列コンピュータシステムを提供し、その規模は数千個のＳマシン１２からなる大規模な並列システムでも可能である。このようなアーキテクチャの拡張性は、Ｓマシンベース命令処理がＴマシンベースデータ通信から意図的に分離されているので可能となっている。この命令処理／データ通信分離モデルは、データ並列計算にきわめて適している。Ｓマシンハードウェアの内部構造は、命令のタイムフローについて最適化されるのが好ましいが、Ｔマシンハードウェアの内部構造は、有効なデータ通信について最適化されるのが好ましい。Ｓマシン１２のセットとＴマシン１４のセットは、それぞれデータ並列計算の空間的・時間的区分において分離可能で構成可能なコンポーネントである。
【０１２８】
本発明を用いると、この明細書で説明した全体的構造を維持しながらさらに優れた計算性能を有するシステムを構築するのに将来の再構成ハードウェアを利用することができるかもしれない。言い換えれば、本発明のシステム１０は技術的に拡張可能である。現在用いられているほとんどすべての再構成論理装置は、メモリベース相補型金属酸化膜半導体（ＣＭＯＳ）技術を用いている。装置の能力の進歩は、半導体メモリ技術の流れ（傾向）に追随している。将来のシステムでは、Ｓマシン１２を構築するのに用いられる再構成論理装置は、この明細書に説明した内部ループ及び外部ループ命令セットアーキテクチャ（ＩＳＡ）に従った内部ハードウェアリソースの１区分を含むことになるであろう。大規模の再構成論理装置であっても、単一の装置内でより多くのデータ並列計算を実行する能力を単に提供するにすぎない。たとえば図１３を用いて上記に説明したデータ演算装置（ＤＯＵ）６３の第２模範実施例に含まれる機能単位１９４が大きいと、より大きなサイズの画像処理カーネルを含むことになるであろう。当業者は、本発明により提供される技術的拡張性がＣＭＯＳベース装置に限定されず、またフィールドプログラマブルゲートアレイ（ＦＰＧＡ）ベース実装にも限定されないことを認めるであろう。したがって、本発明は再構成可能性または再プログラマブル性を得るために用いられる特定の技術とは無関係に、技術的拡張性を提供する。
【０１２９】
図２２は、拡張性、並列、動的再構成計算のための好ましい方法のフローチャートである。図２２の方法は、システム１０内の各Ｓマシン１２内で実行されるのが好ましい。この好適な方法は、図２２のステップ１０００から始まり、再構成ロジック１０４が命令セットアーキテクチャ（ＩＳＡ）に対応する構成データセットを検索する。次にステップ１００２で、再構成ロジック１０４はステップ１０００で検索した構成データセットに従って、命令取出し装置（ＩＦＵ）６０と、データ演算装置（ＤＯＵ）６２と、アドレス演算装置（ＡＯＵ）６４内の各構成部分を構成し、これによって現在検討中の命令セットアーキテクチャ（ＩＳＡ）の実装のための動的再構成処理装置（ＤＲＰＵ）ハードウェア編成が得られる。ステップ１００２のあと、ステップ１００４で割込みロジック１０６はアーキテクチャ記述メモリ１０１に記憶された割込み応答信号を検索し、現在の動的再構成処理装置（ＤＲＰＵ）構成が割込みにどのように応答するかを定める遷移制御信号の対応するセットを生成する。その後、命令セットアーキテクチャ（ＩＳＡ）１００はステップ１００６でプログラム状態情報を初期化する。その後命令セットアーキテクチャ（ＩＳＡ）１００はステップ１００８で命令実行サイクルを開始する。
【０１３０】
次にステップ１０１０では、命令セットアーキテクチャ（ＩＳＡ）１００または割込みロジック１０６は再構成が必要かどうかを決定する。プログラム実行中に再構成指示が選択されるときには、命令セットアーキテクチャ（ＩＳＡ）１００は再構成が必要であると決定する。割込みロジック１０６は、再構成割込みに応じて再構成が必要であると決定する。再構成が必要なときには、この優先的方法はステップ１０１２に進み、ここで再構成ハンドラーはプログラム状態情報をセーブする。プログラム状態情報は、現在の動的再構成処理装置（ＤＲＰＵ）構成に対応した構成データセットへの引照を含んでいることが好ましい。ステップ１０１２のあと、この優先的方法はステップ１０００に戻り、再構成指示または再構成割込みによって引照される次の構成データセットを検索する。
【０１３１】
ステップ１０１０で再構成が必要とされないときには、ステップ１０１４で割込みロジック１０６は非再構成割込みを実施する必要があるかどうかを決定する。必要な場合には、次にステップ１０２０で、命令セットアーキテクチャ（ＩＳＡ）１００は命令実行サイクル内の現在の命令状態シーケンサ（ＩＳＳ）状態からの割込み実施状態への遷移が遷移制御信号に基づいて許容されるかどうかを決定する。割込み実施状態への遷移が許容されないときには、命令セットアーキテクチャ（ＩＳＡ）１００は命令実行サイクルの次の状態に進み、ステップ１０２０に戻る。遷移制御信号によって命令実行サイクル内の現在の命令状態シーケンサ（ＩＳＳ）状態からの割込み実施状態への遷移が許容されるときには、次にステップ１０２４で命令セットアーキテクチャ（ＩＳＡ）１００は割込み実施状態へと進む。ステップ１０２４で、命令セットアーキテクチャ（ＩＳＡ）１００はプログラム状態情報をセーブし、割込みを実施するためのプログラム命令を実行する。ステップ１０２４の後、この優先的方法はステップ１００８に戻り、現在の命令実行サイクルが完了していなかったときにはこれを再開し、（完了していたときには）次の命令実行サイクルを開始する。
【０１３２】
ステップ１０１４で非再構成割込みを実施する必要がないときには、この優先的方法はステップ１０１６に進み、現在のプログラムの実行が完了しているかどうかを決定する。現在のプログラムの実行を継続すべきときには、この優先的方法はステップ１００８に戻り、別の命令実行サイクルを開始する。それ以外の場合には、この優先的方法は終了する。
【０１３３】
本発明は、本発明のアーキテクチャにより必要とされるメモリ演算を実行するためのメタアドレス指定メカニズムを組入れている。本発明によれば、Ｔマシン１４はアドレス指定マシンとして用いられる。Ｔマシン１４は、割込み処理と、メッセージの待ち行列設定と、メタアドレス生成と、データパケットの全体的転送の制御とを実行する。図２３は、本発明に基づくデータパケット１８００の構成図である。データパケット１８００は、データ部分１８２４と、コマンド部分１８２０と、ソース地理アドレス１８１６と、サイズ区切り記号１８１２と、目的ローカルアドレスと、目的地理アドレス１８０４とを含んでいる。メタアドレス１８２８は、目的地理アドレス１８０４と、目的ローカルメモリアドレス１８０８とを含んでいる。目的ローカルメモリアドレス１８０８は、データパケット１８００のデータをローカルメモリ３４のどこに書込むかを指定する。目的地理アドレス、すなわち相互結合アドレス１８０４は、どのＴマシン１４がデータパケット１８００を受取るべきかを指定する。ソース地理アドレス１８１６は、データパケット１８００を生成したＴマシン１４を指定する。
【０１３４】
任意の２対のソース地理アドレス１８１６及び宛先
（目的）地理アドレス１８０４によって、２６４ビットのローカルアドレススペースへの１つの経路（パス）が一意的に決定される。しかし、システムにはこのような経路（パス）が２つ以上存在し、並列に作動することができる。Ｓマシン１２はそれに結合されたＴマシン１４を含み、その数はローカルメモリ帯域幅に相当する数まで、または持ち行列効果を考慮した任意の数までを含むことができる。したがって、本発明では２の不確定累乗分だけの拡張性が可能であり、またシステム内のプロセッサが不均一であってもよく、さらに各Ｓマシン１２へのユニークなパスの数を任意に拡張することができる。この種の拡張性は、分散画像処理など多くのアプリケーションで重要であり、動的再構成処理構成部分のピラミッドまたはツリーは、このシステムの高いレベルに対しさらに広い通信帯域幅を提供できるよう構成できるかもしれない。望む場合には、より多くの等速度Ｔマシン１４がＳマシン１２のピラミッドの高いレベルにアクセスできるようにすることによって、このピラミッドアーキテクチャを実装し、アドレス指定能力を最も必要とするＳマシン１２にこの能力を与える。これによって、システムリソースをほとんどの処理及び通信タスクに集中させることができるので、対費用効果のすぐれたシステムが得られる。
【０１３５】
好ましい実施例では、メタアドレスは８０ビット幅である。この実施例では、地理アドレスは１６ビットであり、ローカルメモリアドレスは６４ビット幅である。１６ビットの地理アドレスにより、６５５３６個の地理アドレスを指定することができる。６４ビットのローカルメモリアドレスにより、各ローカルメモリ３４内に２⁶⁴の個別のアドレス可能ビットを指定することができる。各Ｓマシン１２は、特定のＳマシン１２について構成されるローカルメモリ３４を含んでいる。Ｓマシン１２とメモリ３４が互いに分離しているので、各メモリのサイズや構造が均一である必要はなく、またメモリ全体のコヒーレント性や一致性を維持する必要もない。ソースＳマシン１２のプログラム命令が目的Ｓマシン１２のローカルメモリ３４のアーキテクチャを意識して書かれたものであり、またこのプログラム命令がメモリ位置を正確に指定する限り、そのサイズやレイアウトとは無関係に、目的Ｓマシン１２のローカルメモリ３４に容易にアドレス指定することができる。こうしたモジュール性を備えているため、問題の取扱いとは無関係に、さまざまなコンポーネントを用いてこのアーキテクチャのサイズを拡大・縮小することができる。新しいＳマシンを統合する方法も大幅に単純化されている。新しいＳマシン１２をシステムに加えるときには、そのＳマシン１２について新しい地理アドレスを選択し、新しいＳマシン１２の使用を要求するプログラムに新しいアドレスが与えられる。新しいＳマシン１２を利用するよう設計されたプログラムに新しいアドレスがいったん組入れられると、解決すべき問題は生じず、また計算を実行する必要もなく、Ｓマシン１２が統合される。
【０１３６】
図２４は、遠隔演算を要求するための本発明のＳマシン１２の処理の流れを示すフローチャートである。ステップ１９００で、Ｓマシン１２は命令を受取る。ステップ１９０４でＳマシン１２は、この命令が遠隔演算を要求しているかどうかを決定する。この命令が遠隔演算を要求していないときには、ステップ１９０６でこの命令が実行される。命令が遠隔演算を要求しているときには、ステップ１９０４で遠隔演算情報はローカルメモリに記憶される。下記に説明するようにステップ１９２０に進んだ後、Ｓマシン１２は遠隔演算が要求されているかどうかを示す命令コード内のフラグの状態を調べることによって、命令が遠隔演算を要求していると決定する。遠隔演算とは、結果を得るために異なるＳマシン１２を使用する必要がある演算である。遠隔演算情報は、Ｓマシン１２によって実行されるプログラムにより提供され、遠隔演算の実行が望まれる場合にローカルメモリ３４に記憶される。遠隔演算を記憶するにはローカルメモリ３４の一定のメモリ位置を用いるのが好ましく、このようにすれば、Ｔマシン１４はただちに情報にアクセスでき、最初にアドレスを取得する必要はない。遠隔演算情報は、一般に遠隔Ｔマシン１４の目的地理アドレス１８０４と、遠隔Ｓマシン１２にデータを記憶し、または遠隔Ｓマシン１２からデータを検索するための目的ローカルメモリアドレス１８０８と、コマンド情報１８２０と、サイズ情報１８１２と、データ１８２４とを含んでいる。命令が遠隔演算を必要とすると決定され次第、これらの情報はすべてＳマシン１２によってローカルメモリ３４に記憶される。
【０１３７】
１つの実施例では、ステップ１９１２でＳマシン１２が、遠隔演算が必要であることを示すようＴマシンに無条件命令を発する。無条件命令は、Ｔマシン１４が認識するよう設計されている一意的なコマンド列である。無条件命令は、一般に遠隔演算情報がローカルメモリ３４に納められているメモリアドレスと、アドレス指定情報のサイズを示すサイズ区切り記号とを含んでいる。これは遠隔演算情報の開始アドレスと一連のサイズ区切り記号を単に指定するだけで、Ｓマシン１２によって実行中のプログラムによって１度に複数の遠隔演算を要求することができる。そのときＴマシン１４は、情報について異なる要求を逐次処理することができる。次にステップ１９２０で、Ｓマシン１２は実行すべきその他の命令があるかどうかを決定する。命令が存在するときには、次の命令を受取り実行する。したがって、Ｓマシン１２は遠隔演算の要求があってもそれとは無関係に、命令の実行をほぼ瞬間的に行うことができる。Ｔマシン１４がデータの転送と検索を実行するので、Ｓマシン１２の処理能力は、命令の処理のみに集中することができる。図２５は、Ｓマシン１２から無条件命令を受取るＴマシン１４の処理のフローチャートである。まずステップ２０００でＴマシン１４は、制御ライン４８上でＳマシン１２から受取ったコマンドが無条件命令であるかどうかを決定する。コマンドが無条件命令であると決定したら、ステップ２００４でＴマシンはメモリ／データライン４６を経てローカルメモリ３４から遠隔演算情報を検索する。遠隔演算情報は、Ｔマシン１４がデータを検索する際に、遠隔演算情報を検索するたびに新しいメモリアドレスを決定する必要がないようにメモリ３４の一定の場所に納められるのが好ましい。遠隔演算情報は、ローカルメモリ３４の任意の場所に記憶することもできる。しかしこの場合には、情報の場所は無条件命令の一部として伝送されなければならない。遠隔演算情報を検索した後、Ｔマシン１４、特にＴマシン１４の共通インタフェース制御装置（ＣＩＣＵ）３０２コンポーネントは、ステップ２００８で情報からメタアドレス１８２８を生成する。目的ローカルメモリアドレス１８０８は、メタアドレス１８２８を形成するために、目的地理アドレス１８０４に添付される。次にステップ２１１２で、Ｔマシン１４は残りの遠隔演算情報からデータパケット１８００を生成し、要求されている宛先に伝送するためにデータパケット１８００を相互結合装置または汎用相互結合マトリックス（ＧＰＩＭ）１６に伝送する。
【０１３８】
ソース地理アドレス１８１６は、プログラム命令によって指定してもよく、したがってＴマシン１４による検索のためにローカルメモリ３４に記憶しても良い。またソース地理アドレス１８１６は、アーキテクチャ記述メモリ（ＡＤＭ：Architecture Description Memory）１０１に記憶されるのが好ましい。アーキテクチャ記述メモリ（ＡＤＭ）１０１は、それが結合されているＴマシン１４の地理アドレスを記憶する変更可能なメモリである。アーキテクチャ記述メモリ（ＡＤＭ）１０１を用いて、システム全体の地理アドレスを明白に変更することができる。このシステムの実施例では、Ｔマシン１４はアーキテクチャ記述メモリ（ＡＤＭ）１０１からソース地理アドレス１８１６を検索し、Ｔマシン１４自体の最も新しいソース地理アドレス１８１６を用いていることを確かめる。多重共通インタフェース制御装置（ＣＩＣＵ）３０２が各Ｓマシン１２に結合された実施例では、各共通インタフェース制御装置（ＣＩＣＵ）３０２の地理アドレスはアーキテクチャ記述メモリ（ＡＤＭ）１０１に記憶される。
【０１３９】
図２６は、相互結合装置を経て伝送されたデータパケットを受取るためのＴマシン１４の処理を示すフローチャートである。ステップ２１００でＴマシン１４は、相互結合装置からデータパケットを受取る。ステップ２１０４でＴマシン１４は、メタアドレス１８２８の目的地理アドレス１８０４コンポーネントを解析することによって、データパケット１８００を復号する。上記に説明したようにＴマシンのアドレス復号器３２０は、データパケット１８００を復号する。ステップ２１０８でアドレス復号器３２０は、目的地理アドレス１８０４と、関連する地理アドレスとを比較する。変更可能なアーキテクチャ記述メモリ（ＡＤＭ）１０１を用いる実施例では、アドレス復号器３２０は受取った目的地理アドレス１８０４とアーキテクチャ記述メモリ（ＡＤＭ）１０１内に記憶したアドレスとを比較する。ステップ２１１２で、地理アドレスが一致するとアドレス復号器３２０が決定したときには、データパケット１８００はローカルメモリアドレス１８０８によって指定されたメモリ３４の場所に伝送される。データパケット１８００は解析され、データはメモリ／データライン４６を経て送られ、コマンドは制御ライン４８を経て送られる。アドレス情報は、アドレスライン４４を経て送られる。各アドレスが一致しないときには、エラーメッセージがバイパスＦＩＦＯ３２４と、ＭＵＸ３２８と、汎用相互結合マトリックス（ＧＰＩＭ）１６とを通ってデータパケット１８００のソース地理アドレス１８１６コンポーネントによって識別されたＴマシン１４に伝送される。誤ってアドレスされたデータパケット１８００をＴマシン１４が受取った場合も、上記に説明したのと同じプロセスを用いる。新しいデータパケット１８００を受取った際に、ＣＩＣＵ３０４がデータパケット１８００を組立てているか解体しているときには、ＣＩＣＵ３０４がデータを受取り処理できるようになるまでＴマシン１４はデータパケット１８００を入力ＦＩＦＯ３２２に送り、待機させる。
【０１４０】
別の実施例では、Ｔマシン１４はメッセージの優先度を認識するように設計されており、Ｓマシンに新しいコマンドを処理させるのが適切であるときには、Ｓマシン１２の処理に割込む。この実施例では、図２７に示すように、共通インタフェース制御装置（ＣＩＣＵ）３０２は割込みロジック２２００と、コンパレータ（比較器）２２０４と、認識装置２２０８とを含む追加コンポーネントをさらに含んでいる。図２８は、共通インタフェース制御装置（ＣＩＣＵ）３０２の割込み処理能力を示すフローチャートである。ステップ２３００で認識装置２２０８は、アドレス復号３２０によってアドレスを確認したあと、データパケット１８００を解析し、パースコマンド１８２０を識別する。ステップ２３０４で認識装置２２０８は、コマンド１８２０が割込み要求であるかどうかを決定する。データパケット１８００が割込み要求であるときには、コマンド１８２０は割込みＩＤを含むことになる。コマンド１８２０が割込みＩＤを含んでいないときには、上記に説明したような処理を行うためにステップ２３０８でデータパケットは共通インタフェース制御装置（ＣＩＣＵ）３０２に送られる。
【０１４１】
コマンド１８２０が割込みＩＤを含んでいるときには、割込みＩＤはメモリ３４に結合されているコンパレータ２２０４に送られる。メモリ３４は割込みＩＤのリストを記憶する。各Ｓマシンには、その関連するローカルメモリ３４に記憶するようＳマシンが設計された割込みＩＤのリストを含むのが好ましい。このリストによって割込みを識別し、割込みの優先度を指定することができ、またこのリストは割込みを実行するための指示を含んでいる。ステップ２３１２で、コンパレータ２２０４は受取ったコマンドの割込みＩＤと、記憶されているＩＤのリストとを比較する。コマンドにより指定された割込みＩＤがリストのＩＤと一致しないときには、ステップ２３２０で、エラーメッセージがバイパスＦＩＦＯ３２４、ＭＵＸ３２８を経てソース地理アドレス１８１６により指定された宛先に伝送され、また信号ライン３１４を経て汎用相互結合マトリックス（ＧＰＩＭ）１６に伝送される。割込みＩＤが記憶されたＩＤと一致するときには、ステップ２３２４で割込みロジック２２００は記憶されたＩＤに関連するローカルメモリ３４に含まれている情報に従って、またはデータパケット１８００に含まれている情報に従って割込みを処理し、また得られたコマンドを制御ライン４８を経てＳマシン１２に送る。
【０１４２】
優先度の比較が可能なときには、割込みロジック２２００は、割込み要求の優先度と、現在入力ＦＩＦＯ３２２にあるデータパケット１８００の優先度とを比較する。割込み要求の優先度がＦＩＦＯ３２のデータパケット１８００より高いときには、割込み要求の優先度の低いデータパケット１８００の手前に置かれる。場合によっては、割込み要求がＳマシン１２の実行を停止するよう求めることもある。この場合、優先度レベルはＳマシン１２で実行しているプロセスに割当てられる。割込み要求の優先度が現在実行しているプロセスの優先度より高いときには、割込みロジック２２００は、Ｓマシン１２が現在の処理を終了し割込み要求の処理を開始する命令を制御ライン４８を経てＳマシン１２に発信する。したがって、完全な優先度比較と割込み処理スキームは、本発明のアーキテクチャに基づくＴマシン１４によって実動化され、Ｓマシン１２が追加の処理を行う必要はほとんどない。
【０１４３】
したがって、Ｔマシン１４がコンピュータシステムによって要求されるすべてのメモリ演算機能を実行するので、Ｓマシン１２はプログラムの主命令を実行することができる。メモリと命令実行演算を空間的・時間的に分離することによって、マルチプロセッサで構成される高度に並列のシステムの処理能力を大幅に最適化することができる。仮想メモリや共用メモリを使用しないので、ハードウェア一致性及びコヒーレンス性演算を行う必要はない。Ｓマシン１２は異なるレートで作動することができ、動的に再構成可能なＳマシン１２によって実現される命令セットアーキテクチャ（ＩＳＡ）は異なるものでも良い。さらにＳマシン１２を実動化するフィールドプログラマブルゲートアレイ（ＦＰＧＡ）も、特定のタスクについて最適化される。たとえば、埋込まれた画像を計算する場合、フロントパネルＬＣＤ画面コントローラを画像処理用に最適化したＳマシン１２とする必要はない。しかし、システムのすべてのＳマシン１２に対し、別のＳマシンと通信する必要がある各Ｓマシン１２が一様にアドレス指定できるようにすることは、それでもきわめて望ましいことであり、これは上記に説明したように本発明によって得られる。ソフトウェアは、システム全体のコヒーレント性また一致性を得るように用いられ、これにはＳマシン１２及びＴマシン１４のためのメッセージ伝達インタフェース（ＭＰＩ）実行時（ランタイム）ライブラリなど、また並列仮想マシン（ＰＶＭ）のための実行時ライブラリなどの従来の方法が用いられる。ＭＰＩもＰＶＭもハードウェア抽象化層（ＨＡＬ）として機能する。本発明によると、ＨＡＬは動的に再構成可能なＳマシン１２と固定式Ｔマシン１４のためのものである。メモリ演算はソフトウェアによって完全に制御されているので、このシステムは動的に再構成可能であり複雑なハードウェア／ソフトウェア相互作用の影響を受ない。したがって、独立して分離したメモリを用い、また別個のアドレス指定マシンと処理マシンとを有する完全に拡張性がありアーキテクチャ的に再構成可能なコンピュータシステムが、高度並列計算環境で使用するために提供される。メタアドレスを用いることにより、透過性で細かいアドレス指定が可能であり、またコンピュータシステムの通信経路をシステムの要求に応じて割当てたり再割当てすることが可能である。アドレス指定マシンと処理マシンとを分離することによって、処理マシンのリソースを処理のみに集中させることができ、処理マシンに多様な命令セットアーキテクチャを利用したりさまざまなレートで作動させることができ、またそれぞれ最適化されたハードウェアを用いて実動化することができる。これらのすべてによってシステムの処理力が大幅に向上する。
【０１４４】
本発明の開示内容は、再プログラマブル計算または再構成計算のためのその他のシステムと著しく異なっている。特に、ダウンロード可能なマイクロコードアーキテクチャは一般に非再構成制御手段と非再構成ハードウェアに依存しているので、本発明はこのようなアーキテクチャとは同等ではない。本発明は、１組の再構成ハードウェアが非再構成ホストプロセッサまたはホストシステムに結合された付加再構成プロセッサ（ＡＲＰ）システムとも明白に異なっている。付加再構成可能プロセッサ（ＡＲＰ）装置は、いくつかのプログラムを実行するホストに従属している。したがって、ホストまたは付加再構成可能プロセッサ（ＡＲＰ）装置がそれぞれデータについて演算する際に、付加再構成可能プロセッサ（ＡＲＰ）装置またはホストのシリコンリソースがアイドル状態であるかまたは非効率に用いられるので、利用できるシリコンリソースはプログラム実行の時間枠において最大限に利用されない。これに対して各Ｓマシン１２は、プログラム全体を容易に実行することができる独立したコンピュータである。多重Ｓマシン１２は、プログラムを同時に実行することが好ましい。したがって本発明は、個々のＳマシン１２上で実行する各プログラムとシステム１０全体上で実行する多重プログラムとの両方について、常にシリコンリソースを最大限に利用することを開示している。
【０１４５】
付加再構成可能プロセッサ（ＡＲＰ）装置は、特定の時間で特定のアルゴリズムについて計算アクセラレータを提供し、特定のアルゴリズムに関して最適に相互結合された１組のゲートとして実装される。命令実行の管理などの汎用演算のために再構成ハードウェアリソースを使用することは、付加再構成可能プロセッサ（ＡＲＰ）システムでは避けられている。さらに付加再構成可能プロセッサ（ＡＲＰ）システムは、所定のセットの相互結合を容易に再使用可能なリソースとしては扱わない。これに対して、本発明は特定の時間における計算の必要性に最も適した命令実行モデルによる、命令実行の効率的管理のために構成された動的再構成処理手段を開示している。Ｓマシン１２は、複数の再使用可能なリソース、たとえば命令状態シーケンサ（ＩＳＳ）１００と、割込みロジック１０６と、記憶／整列ロジック１５２とを含んでいる。本発明は、相互結合ゲートのレベルではなく、論理ブロック（ＣＬＢ）、入出力ブロック（ＩＯＢ）、及び再構成相互結合のレベルで再構成論理リソースを使用することを開示している。したがって本発明は、単一のアルゴリズムについて有用な単一のゲート接続スキームを開示するのではなく、すべてのクラスの計算問題について演算を実行するのに有用な再構成可能な高レベル論理構成部品の使用を開示している。
【０１４６】
一般に付加再構成可能プロセッサ（ＡＲＰ）システムは、特定のアルゴリズムを１組の相互結合ゲートに翻訳するためのものである。一部の付加再構成可能プロセッサ（ＡＲＰ）システムは、高レベル命令を最適のゲートレベルハードウェア構成にコンパイルするよう試みるが、これは一般にＮＰハード問題である。これに対して本発明は、きわめて簡単な方法で、可変命令セットアーキテクチャ（ＩＳＡ）に従って高レベルプログラム命令をアセンブリ言語命令にコンパイルする動的再構成計算のためのコンパイラの使用を開示している。
【０１４７】
付加再構成可能プロセッサ（ＡＲＰ）装置は、一般にそのホストプログラムをデータとして扱うことはできず、またそれ自体を計算環境に適合させることもできない。これに対して、システム１０の各Ｓマシン１２は、それ自体のプログラムをデータとして扱うことができ、したがって、容易にそれ自体を計算環境に適合させることができる。システム１０はそれ自体のプログラムを実行することにより、それ自体を容易にシミュレートすることができる。本発明はさらに、それ自体のコンパイラをコンパイルすることができる。
【０１４８】
本発明では、単一のプログラムには、第１命令セットアーキテクチャ（ＩＳＡ）に属する第１命令グループと、第２命令セットアーキテクチャ（ＩＳＡ）に属する第２命令グループと、さらに別の命令セットアーキテクチャ（ＩＳＡ）に属する第３命令グループと．．．を含んでいる。この明細書で開示したこのアーキテクチャは、命令が属する命令セットアーキテクチャ（ＩＳＡ）を実装するためにランタイム構成されているハードウェアを用いて、このような命令グループをそれぞれ実行する。先行技術のシステムや方法で同様の開示内容を提示しているものはない。
【０１４９】
本発明はさらに、割込み待ち時間と、割込み精度と、プログラマブル状態遷移イネイブリングとが、現在検討中の命令セットアーキテクチャ（ＩＳＡ）に従って変化する再構成割込みスキームを開示している。その他のコンピュータシステムでは、同様の開示内容は認められない。本発明はさらに、先行技術コンピュータシステムとは異なり、再構成データパスビット幅、アドレスビット幅、及び再構成制御ライン幅を有するコンピュータシステムを開示している。
【０１５０】
本発明はいくつかの好ましい実施例を用いて説明してきたが、当業者は、さまざまな変形例が得られることを認めるであろう。
【０１５１】
＜参考資料Ａ＞
命令セット0，汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）
１．０プログラマのアーキテクチャモデル
この節では、レジスタ、メモリモデル、高レベル言語から呼出しコンベンション及び割込みモデルを含む命令セットアーキテクチャ（ＩＳＡ）０アーキテクチャについてのプログラマの概略コンセプトを示す。
【０１５２】
１．１レジスタ
命令セットアーキテクチャ（ＩＳＡ）０は、１６個の１６ビット汎用レジスタ、１６個のアドレスレジスタ、２個のプロセッサ状態レジスタ、及び１個の割込みベクトルレジスタを含んでいる。データ及びアドレスレジスタのニーモニックは１６進数を用いており、したがって最後のデータレジスタはdf.であり、最後のアドレスレジスタはaf.である。プロセッサ状態レジスタの１つであるnipar（Next Instruction Program Address Register）は取出す（フェッチする）次命令のアドレスを指している。もう一方の状態レジスタであるpcw（Processor Control Word）はプログラムフローと割込み処理を行うために用いられるフラグと制御ビットとを含んでいる。そのビットは表２に定義されている。未定義のビットは将来の使用のため保留される。さまざまな命令の副作用として４つの条件フラグ、Ｚ、Ｎ、Ｖ及びＣが設定される。どのフラグが各命令によって影響を受けるかの概要については、２．０項を参照。
【０１５３】
Ｔ（Trace Mode）及びＩＭ（Interrupt Mask）フラグは、プロセッサが割込みに対してどのように対応するか、またトラップがいつ取扱われるかを制御する。割込みベクトルレジスタivecは、割込みサービスルーチンの６４ビットアドレスを保持する。割込みとトラップについては後述する１．４項で述べる。
【０１５４】
【表１】

【０１５５】
１．２メモリアクセス
６４ビットアドレスレジスタに記憶されている値は、１６ビット及び６４ビットインクリメントでメモリロード／ストア命令アクセスメモリにより用いられる（表７参照）。アドレスはビットアドレスである。つまり、アドレス１６はメモリ内のビット１６で始まるワード（語）を指す。ワードは１６ビット境界上でのみ読出すことができ、したがってメモリを読出すときにはアドレスレジスタの４つのＬＳＢ（最下位ビット）は無視される。Ｋ_ISAのコンセプトの詳細については［１］を参照。６４ビット値は、リトルエンディアン順（最下位１６ビットが最も下位のアドレスに記憶される順序）の１６ビットワードとして記憶される。
【０１５６】
【表２】

【０１５７】
１．３呼出しコンベンション
コンベンションによって、レジスタafはＣプログラムによりスタックポインタとして用いられ、レジスタaeはスタックフレームポインタとして用いられる。ニーモニックsp及びfpはこれらのレジスタのエイリアス（別名）として用いられることがある。他のすべてのレジスタは一般用に自由に用いられる。スタックは下に向かって増大する。
【０１５８】
intは１６ビットであり、longは６４であり、aはボイド*である。int値はd0で復帰され、longとボイド*の値はa0で復帰される。d0-d4とa0-a3はファンクションによってクロバーされ、他のすべての汎用レジスタはファンクションコール上で保持されなくてはならない。ファンクションに入ると、スタックポインタは復帰アドレスを指し、こうして最初の引数はアドレスsp＋64（１０進）で始まる。
【０１５９】
１．４トラップと割込み
命令セットアーキテクチャ（ＩＳＡ）０は１本の割込みラインに作用し、ソフトウエアは２つのソースからトラップする。すべては下記に述べる同じ制御フロー（flow-of-control）転送メカニズムを呼出す。
【０１６０】
外部的には単一のＩＮＴＲ信号入力があり、１つのiack出力がある。pcw内の割込みマスクビットが、xpcw命令でpcwをリセットすることにより、またはrti命令で割込みから復帰してpcwをその当初の値へ戻すことによってクリアされると同時に、iackはアクティブ（能動的）となる。外部装置による割込みの信号発信とプロセッサによる割込みのサービスの間の時間量は、現在実行中の命令とソフトウエアトラップの存在に応じて定まる。
【０１６１】
ソフトウエアトラップは、明示トラップ命令によって、またはＴ（トレース）フラッグセットで命令を実行することによってトリガされる。この場合、Ｔの設定に続く最初の命令のあと、コントロール（制御権）が割込みサービスルーチンへ移される。トラップ命令が実行されるときは、プロセッサはＴフラグを設定し、あたかも命令を実行する前にＴフラッグが設定されていたかのように割込みサービスルーチンに入る。Ｔフラッグが設定されている間は割込みのサービスは行われない。xpcw命令でpcwをリセットすることによって、またはrti命令で割込みからの復帰によりスタックからリセットすることによってＴフラグがクリアされるまではそれ以上トラップは起こらない。
【０１６２】
割込みは、intr外部信号でのアクティブ信号の存在によって発生する。imフラグまたはＴフラグが設定されているときは、割込みはマスクされ、未決定の割込みは無視される。imフラグとＴフラグがクリアされると、intrの表明に続く最初の命令のあと、コントロールは割込みサービスルーチンに移される。割込みサービスルーチンに入ると、imフラグがプロセッサにより設定される。xpcw命令でpcwがリセットされるか、またはrti命令による割込みからの復帰によってスタックからリセットされることによってimフラグがクリアされるまでは、それ以上は割込みは起こらない。
【０１６３】
割込みまたはトラップが起こるときに、プロセッサがとるステップは次の通りである。
１．現在実行中のすべての命令を完了する。
２．１６個のデータレジスタ（d0が先）、１６個のアドレスレジスタ（a0が先）、pcw、ivec及びniparが、この順序で（レジスタafによりポイントされた）スタックに押込まれる。スタックに押込まれるafの値は、割込みまたはトラップのサービスが始まる前のその値である。
３．これが割込みであるときには、pcw内の割込みビットが、それ以上の割込みをマスクするよう設定される。これがトラップ命令であるときには、Ｔフラッグが設定されるこれがＴフラッグにより発生したトラップであるときは、pcwは変更されない。
４．ivecレジスタ内の値をniperにロードする。
割込みハンドラ内での命令の実行が始まる。
【０１６４】
rti命令の実行時には、下記の動作が行われる。
１．レジスタは、それが書込まれたのと反対の順序でスタックから回復される。
２．実行を再開する。
【０１６５】
割込みマスクフラグがすでにクリアされていないときは、それはrti命令によってクリアされる。pcwの値がスタック上で変更されない限り、それはサービスルーチンに入ったときにクリアされていたからである。トラップ命令を実行することによってＴフラグが設定されるときは、同じ理由によってrtiの実行時にそれはクリアされる。サービスルーチンへ入る前に設定されていたＴフラッグによってトラップが発生したときは、トラップが発生したことを確認するために、それはサービスルーチンによりクリアされなくてはならない。割込みマスクフラグが何らかの手段によってクリアされるときは、外部出力信号iackは割込みが行われている外部装置に信号を送るために１クロックサイクルの間アクティブとなる。
【０１６６】
２．０機能による命令の分類
表記コンベンションは次の通りである。
【０１６７】
【表３】

【０１６８】
２．１レジスタの動き
【０１６９】
【表４】

【０１７０】
２．２論理演算
【０１７１】
【表５】

【０１７２】
２．３メモリロード／ストア
【０１７３】
【表６】

【０１７４】
２．４算術演算
【０１７５】
【表７】

【０１７６】
２．５制御フロー
【０１７７】
【表８】

【０１７８】
３．０英字参照記号
命令セットアーキテクチャ（ＩＳＡ）０のために設定された命令を下記にアルファベット順に示す。ニーモニックは短い記述で示してある。その下は命令の２進コードである。２進コードの各行は１６ビットのワードである。影響を受けるフラグを次にリストで示す。ほかに定めのない限り、フラグは宛先レジスタに記憶されたデータを用いて設定する。niparは命令実行の開始時にすでにインクリメントされたものと想定する。最後に命令の意味についてのテキスト記述を示す。
【０１７９】
２進コードに用いられている表記コンベンションを下記の表にまとめてある。条件コードは表５９に定義されている。
【０１８０】
【表９】

【０１８１】
【表１０】

【０１８２】
２個のデータレジスタを加算し、結果を宛先レジスタに残す。
【０１８３】
【表１１】

【０１８４】
２個のデータレジスタと桁上げフラグを加算し、結果を宛先レジスタに残す。
【０１８５】
【表１２】

【０１８６】
８ビット符号つき（２の補数）定数をデータレジスタに加算し、結果をレジスタに残す。
【０１８７】
【表１３】

【０１８８】
２個のデータレジスタのビットワイズANDを実行し、結果を宛先レジスタに残す。
【０１８９】
【表１４】

【０１９０】
条件が真のときは、（offset << K_isa ）をniparに加算する。
【０１９１】
【表１５】

【０１９２】
（offset << K_isa）をniparに加算する。
【０１９３】
【表１６】

【０１９４】
条件つきで８ビット右へシフトし、マスクする。ワードオフセットから読出された８ビットデータを整列するためにロード命令のあとに用いられる。ソースアドレスレジスタに含まれるアドレスが８ビット境界上にある（ビット２セットをもつ）ときには、データレジスタ内の値を８ビット右へシフトする。アドレスが８ビット境界上にないときは、レジスタの上流８ビットをクリアする。
【０１９５】
［註］負のフラグは、ビット１５でなくビット７で設定する。これによって８ビット量の符号延長が容易となる。
【０１９６】
【表１７】

【０１９７】
宛先レジスタからソースレジスタを差引くことによって、２個のデータレジスタの絶対値比較のためのフラグを設定し、フラグだけに影響を与える。
【０１９８】
【表１８】

【０１９９】
１６ビット符号つき整数により３２ビット符号つき整数の符号つき除算を行い、１６ビット符号つき商と剰余を戻す。３２ビット被除数は、宛先レジスタのインデックスから始まる2個の連続するレジスタ内に格納する（リトルエンディアン順）。１６ビット除数はソースレジスタ内にある。剰余は宛先レジスタに戻し、商は宛先レジスタ後のレジスタに戻す（モジュロ１６）。商が１６ビットを越えるときはオーバフローとなる。
【０２００】
【表１９】

【０２０１】
データレジスタをアドレスレジスタに加算し、結果をアドレスレジスタに残す。
【０２０２】
【表２０】

【０２０３】
８ビット符号つき定数をアドレスレジスタに加算し、結果をアドレスレジスタに残す。
【０２０４】
【表２１】

【０２０５】
宛先レジスタからソースレジスタを差引くことによって、２個のアドレスレジスタの絶対値比較のためのフラグを設定し、フラグだけに影響を与える。
【０２０６】
【表２２】

【０２０７】
２個のアドレスレジスタを加算し、結果を宛先レジスタに残す。
【０２０８】
【表２３】

【０２０９】
宛先レジスタからソースレジスタを差引き、結果を宛先レジスタに格納する。
【０２１０】
【表２４】

【０２１１】
ロードをアドレスレジスタ内に事後インクリメントする。ソースレジスタによりポイントされたアドレスからメモリを読出し、宛先レジスタ内に入れる。次にソースレジスタをインクリメントする。
【０２１２】
【表２５】

【０２１３】
アドレスレジスタを1ビット右へシフトする。
【０２１４】
【表２６】

【０２１５】
アドレスレジスタから格納する。ソースレジスタ内の６４ビット値を、宛先レジスタによりポイントされたメモリ位置に書込む。この値はリトルエンディアン順に配置した４つの１６ビットワードとして書込む。
【０２１６】
【表２７】

【０２１７】
アドレスレジスタからのストアを事前デクリメントする。宛先レジスタをデクリメントし、次にソースレジスタ内の値を、宛先レジスタによりポイントされたメモリ位置に書込む。この値はリトルエンディアン順に配置した４つの１６ビットワードとして書込む。
【０２１８】
【表２８】

【０２１９】
アドレスレジスタからデータレジスタを差引き、結果をアドレスレジスタに残す。
【０２２０】
【表２９】

【０２２１】
宛先レジスタへソースレジスタをビットごとに反転配置する。
【０２２２】
【表３０】

【０２２３】
絶対アドレスへ条件つきジャンプする。条件コードビットの定義については、表５９を参照。
【０２２４】
【表３１】

【０２２５】
絶対アドレスへ無条件ジャンプする。条件「常時」はjCCと同じ。
【０２２６】
【表３２】

【０２２７】
宛先レジスタをまずインクリメントし、次に宛先レジスタ（通常はスタックポインタ）によりポイントされたアドレスに、（次命令を指す）現在のniparを格納する。次に、次命令を取出す前にソースレジスタ内のアドレスでniparをロードする。
【０２２８】
【表３３】

【０２２９】
ビットの定数だけデータレジスタを左へシフトする。
【０２３０】
【表３４】

【０２３１】
ビットの定数だけデータレジスタを右へシフトする。
【０２３２】
【表３５】

【０２３３】
メモリからデータレジスタをロードする。ソースアドレスレジスタによりポイントされた値を宛先データレジスタにロードする。
【０２３４】
【表３６】

【０２３５】
データレジスタ内へロードを事後インクリメントする。ソースアドレスレジスタによりポイントされたアドレスからメモリを読出し、宛先データレジスタに入れる。次にソースレジスタをインクリメントする。
【０２３６】
【表３７】

【０２３７】
１６ビット隣接値をデータレジスタ内にロードする。
【０２３８】
【表３８】

【０２３９】
宛先レジスタを、ソースレジスタのビットワイズ反転で置換え、宛先レジスタを加える。
【０２４０】
【表３９】

【０２４１】
ソースデータレジスタ内の値を宛先データレジスタ内に入れる。
【０２４２】
【表４０】

【０２４３】
ソースレジスタ内の値を宛先レジスタ内の値により乗算した結果を、宛先レジスタで始まる2つの連続するレジスタ内に格納する（リトルエンディアン順）。
【０２４４】
【表４１】

【０２４５】
２つのデータレジスタのビットワイズORを実行し、結果を宛先レジスタに残す。
【０２４６】
【表４２】

【０２４７】
データレジスタを１ビット左へシフトする。ＬＳＢ（最下位ビット）を桁上げフラグの値で置換える。命令の終りに当初のＭＳＢ（最上位ビット）を桁上げフラグ内に入れる。
【０２４８】
【表４３】

【０２４９】
前述した１．４項を参照。ソースレジスタをスタックポインタとして用いる。
【０２５０】
【表４４】

【０２５１】
サブルーチンからの復帰。宛先レジスタ（通常はスタックポインタ）によりポイントされたメモリ位置からniparをロードする。次に、宛先レジスタをインクリメントする。
【０２５２】
【表４５】

【０２５３】
ソースレジスタ内の値により指定されたビット数だけ、宛名レジスタを左へシフトする。
【０２５４】
【表４６】

【０２５５】
ソースレジスタ内の値により指定されたビット数だけ、宛名レジスタを右へシフトする。
【０２５６】
【表４７】

【０２５７】
データレジスタから格納（ストア）する。ソース内の値を、宛先レジスタによりポイントされたメモリ位置に書込む。
【０２５８】
【表４８】

【０２５９】
データレジスタからストアを事前デクリメントする。宛先レジスタをデクリメントし、次に宛先レジスタによりポイントされたメモリ位置にソースレジスタ内の値を書込む。
【０２６０】
【表４９】

【０２６１】
宛先レジスタからソースレジスタを差引き、結果を宛先レジスタに格納する。
【０２６２】
【表５０】

【０２６３】
宛先レジスタからソースレジスタを差引き、次に桁上げビットを差引き、結果を宛先レジスタに格納する。
【０２６４】
【表５１】

【０２６５】
割込みハンドラーを実行する。１．４項を参照。宛先レジスタをスタックポインタとして用いる。
【０２６６】
【表５２】

【０２６７】
１６ビット符号つき整数による３２ビット符号つき整数の符号なし除算を行い、１６ビット符号つき商と剰余を戻す。宛先レジスタのインデックスから始まる２つの連続するレジスタ内に３２ビットを格納する（リトルエンディアン順）。除数はソースレジスタ内にある。剰余は宛先レジスタへ戻し、商は宛先レジスタ後の次レジスタへ戻す。商が１６ビットを越えるときはオーバフローとなる。
【０２６８】
【表５３】

【０２６９】
宛先レジスタ内の値によりソースレジスタ内の値を乗算した結果を、宛先レジスタで始まる２つの連続するレジスタ内に格納する（リトルエンディアン順）。
【０２７０】
【表５４】

【０２７１】
ソースアドレスレジスタ内の値を、宛先レジスタで始まる4つの連続するデータレジスタへ転送する。この値はリトルエンディアン順に格納し、宛先レジスタアドレスをモジュロ１６で計算し、宛先レジスタがどのレジスタでも良いようにする。
【０２７２】
【表５５】

【０２７３】
４つの連続するデータレジスタ内のリトルエンディアン順の６４ビット値を宛先アドレスレジスタ内へ転送する。ソースレジスタアドレスをモジュロ１６で計算し、宛先レジスタがどのレジスタでも良いようにする。
【０２７４】
【表５６】

【０２７５】
２つのデータレジスタのビットワイズ排他的ORを実行し、結果を宛先レジスタ内に残す。
【０２７６】
【表５７】

【０２７７】
ソースデータレジスタ内の値をpcwレジスタと交換する。
【０２７８】
【表５８】

【０２７９】
ソースアドレスレジスタ内の値をivecレジスタと交換する。
【０２８０】
４．０条件コード
条件コード操作コード部分フィールドには、下記の表からの値を用いる。
【０２８１】
【表５９】

【０２８２】
＜参考資料Ｂ＞
命令セット１，パイプライン乗算・累算命令セットアーキテクチャ（ＩＳＡ）
ＩＳＡ１ − ＸＣ４０１３のためのパイプライン畳込みエンジン
はじめに
命令セットアーキテクチャ（ＩＳＡ）１は、命令サイクルあたり４回の同時乗算・累算を行うことのできるパイプライン乗算・累算アレイである。４個の８ビット×８ビット乗算器への入力ごとに１個、つまり、８個の８ビットデータレジスタ（xd0-xd3及びyd0-yd3）がある。１つの最終１６ビット合計が出るまで、パイプライン加算アレイを経由して、４つの乗算器出力が合計され、４個までの１６ビットレジスタが結果を記憶できる（m0-m4）。命令セットアーキテクチャ（ＩＳＡ）１のアーキテクチャは、主メモリでフロースルーバッチ処理サイクルを仮定している。累算結果を再循環させるための乗算器累算器データパスを通るフィードバックパスはない。これはメモリデータ流量に重点が置かれているからである。オーバフロースケーリングまたは拡張有限性累算のための用意はない。畳込みフィルタリングに用いられる係数は、すべてのデータセットについて、１６ビットを超えない結果有限性を与えると、命令セットアーキテクチャ（ＩＳＡ）１は仮定している。乗算アレイは、８ビットの２の補数データ入力を受け、１６ビットの２の補数結果を出す。
【０２８３】
メモリへのアクセスは、２個の１６ビットアドレスレジスタ（a0とa1）によって管理され、これらは互換性のあるソース及び宛先ポインタと考えることができる。プログラムフローは、標準６４ビットNIPARレジスタにより管理され、６４ビット割込みベクトルレジスタは、フレームまたはデータ実行可能割込みなどの単独割込みについて支援される（IVEC）。
【０２８４】
命令セットアーキテクチャ（ＩＳＡ）１の命令セットはきわめて小さく、１６ビットのワードサイズに整列され、汎用外部ループプロセッサ命令セットアーキテクチャ（ＩＳＡ）０のためのK_ISA＝４メモリ編成に対応している。命令セットアーキテクチャ（ＩＳＡ）１での単一のクロックサイクルで7回までの算術演算を例示することができ、実動化によってクロックの小ウィンドウ上でクロックあたり１の割合で結果を保持し、新しいソースまたは宛名アドレスをインデックスする能力があり、計算と並行してメモリから、またメモリへ、レジスタデータを移す。
命令セットアーキテクチャ（ＩＳＡ）１命令セット
データ移動
ld（reg−vector）
命令ワード内に右揃えされた１４ビットのビットマップreg-vectorに従って、メモリから順次１４個までのレジスタがロードされる。
【０２８５】
st（reg-vector）
命令ワード内に右揃えされた１４ビットのビットマップreg-vectorに従って、メモリへ順次１４個までのレジスタが記憶される。
【０２８６】
ld（ivec-data）
この命令に続く６４ビットのアドレスがIVECレジスタにロードされ、次命令をポイントするNIPAR＋＝5が実行される。
【０２８７】
プログラム制御
jmp（nipar-data）
この命令に続く６４ビットのアドレスがNIPARレジスタにロードされ、これによって次命令へのポインティングが実行される。
【０２８８】
算術演算
mac（m-reg）
２ビットのm-regコードで示される乗算結果レジスタが積と和（xd0^*yd0）＋（xd1^*yd1）＋（xd2^*yd2）＋（xd3^*yd3）を受取る。
【０２８９】
macp（s-vec, d-vec）
４ビットd-vecコードの２ビットにより示される乗算結果レジスタが積と和（xd0^*yd0）＋（xd1^*yd1）＋（xd2^*yd2）＋（xd3^*yd3）を受取る。d-vecコードの１つおきのビットが選択的にこの結果レジスタのアドレス（a1）へのメモリ書込みを可能にし、d-vecコードの残りのビットが、アドレスレジスタa1が増分されるかどうかを選択する。８ビットs-vecは４つの２ビットグループに分けられ、データレジスタxd0-xd3について連続してアドレス（a0）でのメモリからの読出しが行われるかどうか、またアドレスレジスタa0が増分されるかどうかを指定する。読出しまたは書込みが指定されるときは、乗算に並行して行われる。ソフトウエアは、メモリから読出され、メモリへ記憶されるデータの各バッチについて命令処理のパイプライン整列を行わなければならない。
【０２９０】
再構成
reconf（ISA-vector）
命令セットアーキテクチャ（ＩＳＡ）１は脱コンテキストされ、Ｓマシンは命令内の命令セットアーキテクチャ（ＩＳＡ）ベクトルビットフィールドにより選択される命令セットアーキテクチャ（ＩＳＡ）について再構成される。
【０２９１】
表６０に、ＩＳＡ１のブロック構成として、ＸＣ４０１３のためのパイプライン畳込みエンジンを示す。
【０２９２】
【表６０】

【０２９３】
【発明の効果】
本発明は、拡張性、並列、動的再構成計算のためのシステム及び方法に関するものである。このシステムは、少なくとも１個のＳマシンと、各Ｓマシンに対応するＴマシンと、汎用相互結合マトリックス（ＧＰＩＭ）と、１組の入出力Ｔマシンと、１個またはそれ以上の入出力装置と、マスタタイムベース装置とを含んでいる。好ましい実施例では、このシステムは多重Ｓマシンを含んでいる。各Ｓマシンは、対応するＴマシンの出力部と入力部とにそれぞれ結合された入力部と出力部とを含んでいる。各Ｔマシンは、汎用相互結合マトリックス（ＧＰＩＭ）に結合されたルーティング入力部とルーティング出力部とを含んでおり、各入出力Ｔマシンも同様にこれらを含んでいる。入出力Ｔマシンはさらに、入出力装置に結合された入力部と出力部とを含んでいる。最後に、各Ｓマシンと、Ｔマシンと、入出力Ｔマシンとは、マスタタイムベース装置のタイミング出力部に結合されたマスタタイミング入力部を含んでいる。
【０２９４】
本発明のメタアドレス指定システムは、プロセッサ自体に処理集中アドレス操作機能を実行するよう要求することなしに、ネットワーク内のプロセッサにビットアドレス指定能力を提供する。割当てられた各機能を実行するよう最適化された個別の処理マシン及びアドレス指定マシンが開示される。処理マシンは命令を実行し、ローカルメモリにデータを記憶し、またローカルメモリからデータを検索し、いつ遠隔演算が要求されるかを決定する。アドレス指定マシンは、伝送するためのデータのパケットを組立て、このパケットの地理アドレスまたはネットワークアドレスを決定し、入ってくるパケットに対してアドレスをチェックする。さらにアドレス指定マシンは、割込み処理とその他のアドレス指定演算を実行することができる。
【０２９５】
１つの実施例では、Ｔマシンはまた本発明のメタアドレス指定メカニズムも提供する。メタアドレスは、システム内のＴマシンの地理的位置を指定し、ローカルメモリ装置内のデータの位置を指定する。メタアドレスのローカルアドレスは、装置のアドレス指定可能スペースがローカルアドレスのビット数以下である限りは、装置の実際のメモリサイズとは関係なく、新しい装置のメモリ内の各ビットをアドレス指定するのに用いられる。したがって、単一のメタアドレスを用いて異なるメモリサイズと構造とを有する装置のアドレス指定を行うことができる。さらに、メタアドレスを用いているので、マルチプロセッサ並列アーキテクチャ内のハードウェアが、システム全体のコヒーレント性と一致性とを保証する必要はない。
【０２９６】
メタアドレスによって、完全な拡張性が得られる。新しいＳマシンまたは新しい入出力装置が加えられると、新しい地理アドレスがこの新しい装置について指定される。本発明では、拡張性は不規則であってもよく、プロセッサの数の２乗の拡張を行わなければならないという条件がない。利用できるローカルメモリ帯域幅までの任意の数のアドレス指定マシンを各処理マシンに結合する能力によって、拡張性はさらに高まる。これにより、システム設計者は、各処理マシンへの経路の数を任意に指定することができる。このような柔軟性によって、システムのより高いレベルにより広い帯域幅を提供することができ、システムの最も重要な機能に最も広い帯域幅を与えるよう最適化されたピラミッド型処理アーキテクチャを構築することができる。
【０２９７】
上に説明したように、好ましい実施例によると、Ｔマシンはメタアドレスを生成し、割込みを扱い、メッセージを待ち行列に待機させるアドレス指定マシンである。したがって、Ｓマシンはその処理能力をプログラム命令の実行にのみ集中させることができ、本発明のマルチプロセッサ並列アーキテクチャの全体的な効率を大幅に最適化することができる。Ｓマシンは所望のデータを探し出すためのメタアドレスのローカルメモリコンポーネントにアクセスするだけでよく、地理アドレスは、Ｓマシンに対して透過性である。このアドレス指定アーキテクチャは、分散メモリ／分散プロセッサ並列計算システムときわめてよく相互作動する。ローカルメモリを分離するアーキテクチャ設計を選択することにより、ハードウェアを独立に、また並列に作動することができる。本発明によれば、各Ｓマシンは、１つの計算問題にすべて並列に向けられていたとしても、実行時には全く異なった再構成命令を有することができる。また、動的再構成Ｓマシンにより実現される命令セットアーキテクチャ（ＩＳＡ）が異なっていても良いばかりでなく、Ｓマシンを実現するのに用いられる実際のハードウェアが一定のタスクを実行するように最適化されていても良い。したがって、単一のシステム内のＳマシンはすべて異なるレートで作動してもよく、各Ｓマシンはシステムリソースの利用を最大限に高めながらその機能を最適に実行することができる。
【０２９８】
さらに、唯一のメモリ確認によって、正確な地理アドレスが伝送されていることが確認され、ローカルメモリアドレスの確認は提供されない。さらに、この確認は処理マシンではなくアドレス指定マシンによって実行される。仮想アドレス指定は用いられないので、仮想アドレスを論理アドレスに変換するためのハードウェア／ソフトウェア相互作動は必要ではない。メタアドレスのアドレスは、物理的アドレスである。このような予防的及び保全的機能をすべてなくすることにより、システム全体の処理速度が大幅に向上する。したがって、メタアドレス指定スキームと組合わせて、コンピュータシステムの「スペース」管理を、別個の処理マシンにより提供されるコンピュータシステムの「時間」管理から別のアドレス指定マシンに分離することにより、高度並列計算システムのための一意的なメモリ管理及びアドレス指定システムが提供される。本発明のアーキテクチャにより、Ｓマシンの作動は優れた柔軟性を有することとなり、Ｔマシンのレートを一定に保ったまま各Ｓマシンはそれぞれに最適なレートで作動することができる。システム全体のデータ通信を最も遠いスペースにも達するようにし、きわめて短時間でローカル命令処理を均衡させることができ、これによって高度並列コンピュータシステムによる複雑な問題の解決へのアプローチが改善される。
【図面の簡単な説明】
【図１】本発明に基づいて構築された、拡張性、並列、動的再構成計算のためのシステムの好ましい構成例を示すブロック図である。
【図２】本発明のＳマシンの好ましい構成例を示すブロック図である。
【図３】再構成指示を含む模範的プログラムリストの模式図である。
【図４】一連のプログラム命令のコンパイル中に実行される先行技術コンパイル作業のフローチャートである。
【図５】動的再構成計算のためにコンパイラによって実行される好ましいコンパイル作業のフローチャートである。
【図６】動的再構成計算のためにコンパイラによって実行される好ましいコンパイル作業のフローチャートである。
【図７】本発明の動的再構成処理装置（ＤＲＰＵ）の好ましい構成例を示すブロック図である。
【図８】本発明の命令取出し装置（ＩＦＵ）の好ましい構成例を示すブロック図である。
【図９】本発明の命令状態シーケンサ（ＩＳＳ）によって支援される好ましい１組の状態を示す模式図である。
【図１０】本発明の割込みロジックによって支援される好ましい１組の状態を示す模式図である。
【図１１】本発明のデータ演算装置（ＤＯＵ）の好ましい構成例を示すブロック図である。
【図１２】汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するために構成されたデータ演算装置（ＤＯＵ）の第１模範実施例の構成図である。
【図１３】内部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するために構成されたデータ演算装置（ＤＯＵ）の第２模範実施例の構成ブロック図である。
【図１４】本発明のアドレス演算装置（ＡＯＵ）の好ましい構成例を示すブロック図である。
【図１５】汎用外部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するために構成されたアドレス演算装置（ＡＯＵ）の第１模範実施例の構成ブロック図である。
【図１６】内部ループ命令セットアーキテクチャ（ＩＳＡ）を実動化するために構成されたアドレス演算装置（ＡＯＵ）の第２模範実施例の構成ブロック図である。
【図１７】（ａ）は、外部ループ命令セットアーキテクチャ（ＩＳＡ）のための命令取出し装置（ＩＦＵ）と、データ演算装置（ＤＯＵ）と、アドレス演算装置（ＡＯＵ）との間での再構成ハードウェアリソースの模範的割当てを示す模式図、（ｂ）は、内部ループ命令セットアーキテクチャ（ＩＳＡ）のための命令取出し装置（ＩＦＵ）と、データ演算装置（ＤＯＵ）と、アドレス演算装置（ＡＯＵ）との間での再構成ハードウェアリソースの模範的割当てを示す模式図である。
【図１８】本発明のＴマシンの好ましい構成例を示すブロック図である。
【図１９】本発明の相互結合入出力装置の構成例を示すブロック図である。
【図２０】本発明の入出力Ｔマシンの好ましい構成例を示すブロック図である。
【図２１】本発明の汎用相互結合マトリックス（ＧＰＩＭ）の好ましい構成例を示すブロック図である。
【図２２】本発明に基づく、拡張性、並列、動的再構成計算のための好ましい方法のフローチャートである。
【図２３】本発明に基づくデータパケットの好ましい構成例を示す模式図である。
【図２４】本発明に基づくデータ要求を発生させるための好ましい方法のフローチャートである。
【図２５】本発明に基づくデータを送るための好ましい方法のフローチャートである。
【図２６】本発明に基づくデータを受取るための好ましい方法のフローチャートである。
【図２７】本発明に基づく割込み処理演算を実行する相互結合入出力装置の好ましい構成例を示すブロック図である。
【図２８】本発明に基づく割込みを扱うための好ましい方法のフローチャートである。
【符号の説明】
１２動的プログラム処理マシン（Ｓマシン）
１４アドレス指定マシン（Ｔマシン）
１６相互結合装置（汎用相互結合マトリックス）
３４メモリ装置、ローカルメモリ（メモリ）
１０１アーキテクチャ記述メモリ
１０６，２２００割込みロジック
３２０アドレス復号器
１８００データパケット
１８１６地理アドレス
２２０８メタアドレス
２２０８認識装置
２２０４コンパレータ[0001]
BACKGROUND OF THE INVENTION
  The present invention relates generally to computer architectures, and more particularly to systems and methods for reconfiguration calculations, ie, metaaddressing architectures for dynamic reconfiguration calculations andMeta for dynamic reconfiguration computationIt relates to the addressing method.
[0002]
The present invention is disclosed in U.S. Pat. United States of America entitled "System and Method for Dynamic Reconfiguration Calculations Using Processors with Changeable Internal Hardware Organization" filed on Feb. 26, 1998, a divisional application of 5,794,062 Patent application no. This is a priority application based on the 09 / 031,323 US continuation application.
[0003]
[Prior art]
Advances in computer architecture are driven by demands for better computing performance. In order to solve various calculation problems quickly and accurately, different kinds of calculation resources are generally required. If the type of problem is limited, computational performance can be improved by using computational resources specifically constructed for the type of problem under consideration. For example, when digital signal processing (DSP) hardware is used in combination with a general-purpose computer, certain signal processing capabilities can be greatly improved. When the computer itself is specially built for the type of problem under consideration, computational performance will be further improved for that particular type of problem, or perhaps more optimized compared to the available computational resources Will be. Current parallel computers and massively parallel computers are O (n² ) Or more complex special types of problems, which is an example of the above case.
[0004]
Good computing performance is necessary, but it must be balanced with the need to minimize system costs and maximize system productivity in the widest possible range of applications now and in the future It must be balanced with gender. In general, special hardware is more expensive than general-purpose hardware, so incorporating a limited number of problem-specific computing resources into a computer system has a negative impact on keeping system costs low. Designing and producing a dedicated computer is extremely expensive in terms of engineering time and hardware costs. When dedicated hardware is used to improve computing performance, the performance advantage decreases as the need for computing performance changes. In the prior art, as the need for computing performance changes, new dedicated hardware or new dedicated systems are designed and manufactured, resulting in repetitive expenditures of undesirably high non-reusable design and manufacturing costs. Therefore, if a calculation resource dedicated to a particular type of problem is used, the available silicon resource will be used inefficiently when the degree of calculation changes. Therefore, an attempt to improve calculation performance using dedicated hardware for the reasons described above is undesirable.
[0005]
In the past, various attempts have been made to improve computational performance using reprogrammable or reconfigurable hardware and to maximize the applicability of problem types. The first such prior art approach is by a downloadable microcode computer architecture. In a downloadable microarchitecture, the functionality of fixed non-reconfigurable hardware resources can be selectively changed by using specific versions of microcode. An example of such an architecture is the IBM system / 360. Since the basic computing hardware itself of such prior art systems is not reconfigurable, such systems do not provide optimized computing performance when considering a wide variety of problems.
[0006]
A second prior art approach for improving computational performance and maximizing the applicability of problem types is to use reconfiguration hardware coupled to a non-reconfigurable host processor or host system. . In this prior art approach, it is most common to utilize one or more reconfiguration processors coupled to a non-reconfigurable host. This approach can be categorized as an “Attached Reconfigurable Processor” (ARP) architecture in which a portion of the hardware in the processor set attached to the host can be reconfigured. Examples of current additive reconfigurable processor (ARP) systems that utilize a set of reconfigurable processors coupled to a host system include SPLASH-1 and SPLASH-2 designed by the Supercomputing Research Center (Bowie, MD) And WILDFIRE Custom Configurable Computer (commercial version of SPLASH-2) manufactured by Annapolis Micro Systems (Annapolis, Maryland) and ECV-1 manufactured by Virtual Computer Corporation (Reseda, California). In many computational problems, considerable time is spent executing relatively small portions of program code. In general, a reconfiguration computation accelerator is provided for such portions of program code using an additional reconfigurable processor (ARP) architecture.
[0007]
[Problems to be solved by the invention]
Unfortunately, computational models based on one or more reconstructed computational accelerators have significant drawbacks as described in detail below.
[0008]
<First drawback>
The first drawback of the Additive Reconfigurable Processor (ARP) architecture is that the Additive Reconfigurable Processor (ARP) system attempts to perform optimal production of a specific algorithm of the reconfigurable hardware at a specific time Occur.
[0009]
For example, the design policy behind Virtual Computer Corporation's ECV-1 is to convert that particular algorithm to a particular configuration of reconfigurable hardware sources to provide optimal computational performance for that particular algorithm. Is. Reconfiguration hardware resources are used only for the purpose of providing optimal capabilities for a particular algorithm. The use of reconfigurable hardware resources for general purposes such as managing instruction execution is avoided. Thus, for a given algorithm, the reconfiguration hardware resources are considered from the overall picture of the individual gates combined for optimal performance.
[0010]
Some additional reconfigurable processor (ARP) systems are programming in which the “program” includes both conventional program instructions and dedicated instructions that define how the various reconfigurable hardware resources are interconnected. Rely on the model. Since the Additional Reconfigurable Processor (ARP) system considers the reconfigurable hardware resources in a unique way appropriate to the gate level algorithm, these dedicated instructions provide detailed information on the characteristics of each reconfigurable hardware resource used. Content and a method for reconfiguring hardware resources to be combined with other reconfigurable hardware resources must be provided. This complicates the program. In order to reduce programming complexity, attempts have been made to utilize programming models that include both conventional high-level programming language instructions and high-level dedicated instructions in the program. In other words, in current additive reconfigurable processor (ARP) systems, attempts have been made to use a compilation system that can compile both high-level programming language instructions and the above high-level dedicated instructions. The target output of such a compile system is assembly language code for conventional high-level programming language instructions, and hardware description language (HDL) for dedicated instructions. Unfortunately, it is an NP hard problem to automatically determine a set of reconfigurable hardware resources and interconnection schemes to obtain optimal computing performance for the particular algorithm under consideration. The long-term goal of some additive reconfigurable processor (ARP) systems is to develop a compilation system that can compile algorithms directly into an optimized interconnection scheme for a set of gates. However, developing such a compilation system is a very difficult task, especially when considering multiple types of algorithms.
[0011]
<Second disadvantage>
A second drawback of the additional reconfigurable processor (ARP) architecture is that the additional reconfigurable processor (ARP) device performs multiple reconfiguration logic to perform the computational work associated with the algorithms that make up the additional reconfigurable processor (ARP) device. It happens to be distributed throughout the device.
[0012]
For example, for an additional reconfigurable processor (ARP) device implemented using a set of field programmable logic circuits (FPGAs) and configured to implement a parallel multiplication accelerator, the computational work associated with parallel multiplication Are distributed throughout the field programmable logic (FPGA). Thus, the size of the algorithm that can construct an additional reconfigurable processor (ARP) device is limited by the number of reconfigurable logic devices present. Similarly, the maximum data set size that can be handled by an additional reconfigurable processor (ARP) device is also limited. Because some algorithms have data dependencies, testing the source code does not necessarily explicitly indicate the limits of an additional reconfigurable processor (ARP) device. In general, data dependency algorithms are avoided.
[0013]
In addition, since an additional reconfigurable processor (ARP) architecture is disclosed that distributes computational work across multiple reconfigurable logic units, reconfiguration is done together to include new or slightly modified algorithms. There is a need. That is, the multiple reconfiguration logic device must be reconfigured. This limits the maximum rate at which reconfiguration can be performed for another problem or cascaded secondary problems.
[0014]
<Third drawback>
A third drawback of the additive reconfigurable processor (ARP) architecture arises because one or more portions of program code are executed on the host.
[0015]
That is, an additional reconfigurable processor (ARP) device is not an independent computing system itself and does not execute the entire program. This requires interaction with the host. Because some program code is executed on a non-reconfigurable host, available silicon resources are not utilized to the maximum extent in the program execution time frame. In particular, during execution of instructions by the host, the silicon resources of the additional reconfigurable processor (ARP) device are idle or in an inefficient utilization state. Similarly, when an additional reconfigurable processor (ARP) device processes data, the use of silicon resources at the host is largely inefficient. In order to easily execute all the multiple programs, the silicon resources in the system must be grouped into resources that can be easily reused. As described above, an additional reconfigurable processor (ARP) system treats reconfigurable hardware resources as a set of gates that are optimally interconnected for the implementation of a particular algorithm at a particular time. Therefore, because an algorithm must have some degree of independence to be reusable, an additional reconfigurable processor (ARP) system is a resource that can easily reuse a specific set of reconfigurable hardware resources for each algorithm. Does not provide a means to treat
[0016]
An additional reconfigurable processor (ARP) device cannot handle a currently executing host program as data and cannot generally adapt itself to a computing environment. An additional reconfigurable processor (ARP) device is not designed to simulate itself by executing its own host program. In addition, an additional reconfigurable processor (ARP) device is designed to compile its own hardware description language (HDL) or application program against itself, directly using the reconfigurable hardware resources that are constructed. Not. Thus, additional reconfigurable processor (ARP) devices are architecturally limited with respect to an independent computation model that discloses independence from the host processor.
[0017]
An additional reconfigurable processor (ARP) device functions as a computational accelerator and generally cannot perform independent input / output (I / O) processing. Typically, additional reconfigurable processor (ARP) devices require host interaction for input / output processing. Therefore, the performance of an additional reconfigurable processor (ARP) device may be limited for input and output. However, those skilled in the art will appreciate that an additional reconfigurable processor (ARP) device can be configured to accelerate certain I / O problems. However, since the entire additional reconfigurable processor (ARP) device is configured for a single specific problem, the additional reconfigurable processor (ARP) device balances input / output processing and data processing without adversely affecting each other. I can't take it. In addition, additional reconfigurable processor (ARP) devices do not provide a means for interrupt handling. Since an additional reconfigurable processor (ARP) device is aimed at maximizing computational acceleration, the disclosure relating to an additional reconfigurable processor (ARP) does not mention such an interrupt mechanism and interrupts. Has a negative impact on computational acceleration.
[0018]
<4th fault>
A fourth disadvantage of the additive reconfigurable processor (ARP) architecture arises from the existence of software applications with inherent data parallelism that are difficult to utilize using additive reconfigurable processor (ARP) devices.
[0019]
A hardware description language (HDL) compiled application is one such example when netname symbol derivation of a very large netlist is required.
[0020]
<5th fault>
A fifth drawback of the additional reconfigurable processor (ARP) architecture is that it is basically a single instruction stream multiple data stream (SIMD) computer architecture model.
[0021]
Thus, an additive reconfigurable processor (ARP) architecture is not as efficient as an architecture compared to one or more innovative prior art non-reconfigurable systems. An additional reconfigurable processor (ARP) system, for each specific configuration example, for the same degree of computational power that available reconfigurable hardware can provide, for a fraction of the program execution process, primarily for arithmetic calculations. Only the arithmetic logic of is reflected. In contrast, in the 1971 Fairchild SYMBOL machine system design, the entire computer uses a unique hardware context for each aspect of program execution. As a result, SYMBOL, including the host portion disclosed by the Additional Reconfigurable Processor (ARP) system, contained all the components for the computer's system application.
[0022]
<Other disadvantages>
The additional reconfigurable processor (ARP) architecture has other drawbacks.
[0023]
For example, an additional reconfigurable processor (ARP) device does not have an effective means for providing independent timing to multiple reconfigurable logic devices. Similarly, cascaded additional reconfigurable processor (ARP) devices do not have an effective clock distribution means to provide independently timed devices. As another example, it may be difficult to accurately correlate execution time with source code statements attempting acceleration. To accurately calculate the net system clock rate, you must model an additional reconfigurable processor (ARP) device using computer-aided design (CAD) after hardware description language (HDL) compilation. However, reaching these basic parameters is a time consuming process.
[0024]
An equally important problem with traditional architectures is that they use virtual or shared memory. The disclosure is that it uses an integrated address space, which requires more complex addressing operations, thus slowing memory access and reducing efficiency. For example, using virtual memory to access individual bits of a memory device in the system, the physical address space of the memory must first be partitioned into logical addresses, and then the virtual addresses must be mapped to the logical addresses. In this way, each bit of the memory can finally be accessed. Furthermore, in a shared memory system, the memory operation is further complicated because the processor typically performs an address verification operation before allowing access to the memory. Finally, the processor must coordinate (arbitrate) among multiple processes attempting to access the same area of memory simultaneously by providing some sort of prioritization system.
[0025]
Many conventional systems use memory management units (MMUs) to convert logical addresses to virtual addresses, etc. to address the many problems that arise from using shared memory and virtual memory. Most of the memory management functions are running. However, memory access operations are further complicated by memory management system (MMU) / software interactions. Furthermore, the memory management system (MMU) has a very limited number of operations that can be performed. The memory management system (MMU) cannot handle interrupts, cannot queue messages in a queue, and cannot perform complex addressing operations, so all must be performed by the processor . The use of shared memory or virtual memory systems in computer architectures with multiple parallel processors further exacerbates the disadvantages described above. Not only must the hardware / software interaction be managed as described above, but the coherency and consistency of the data in the memory depending on the multiple processors attempting to access the shared memory, software and hardware Must be maintained by both. When a processor is added, it becomes more difficult to convert a virtual address to a logical address. This complexity in memory access operations inevitably reduces system capacity, and the capacity drops as the number of processors increases and the system scales up.
[0026]
An example of a conventional system is a cache coherent, non-uniform memory access (ccNUMA) computer architecture. Non-uniform memory access (ccNUMA) machines use complex and expensive hardware such as cache controllers and crossbar switchers, and this memory is independent of each other even though it is actually shared by multiple processors. The phantom (virtual space) of a single address space is maintained for the CPU. Non-uniform memory access (ccNUMA) is somewhat extensible, but this extensibility is achieved by adding hardware to tightly couple the processors in the system. This type of system is used in computing environments where a single program image is shared that requires very wide bandwidth for shared memory I / O operations, as in the case of finite element grids in scientific computing. And more advantageous. Furthermore, non-uniform memory access (ccNUMA) is not useful in systems where the processor characteristics are not similar to each other. In a non-uniform memory access (ccNUMA) architecture, each added processor must be of the same type as the existing processor. Therefore, non-uniform memory access (ccNUMA) architectures are not an effective solution in systems where the processor is optimized to perform different functions and operates differently. Finally, in conventional systems, only standard memory addressing schemes are used to address memory in the system.
[0027]
What is needed is a means for addressing memory in a parallel computing environment that provides scalable, straightforward addressing and has little impact on system performance. .
[0028]
[Problems to be solved by the invention]
  A first invention is a metaaddressing architecture for specifying a local memory destination for a data packet of a dynamic reprogrammable processing machine to solve the above-mentioned problem,When the first memory device and the first memory device are coupled to each other and receive a predetermined command, it is determined whether the received command is a command requesting a remote operation including remote operation information, and the instruction includes the remote operation information If it is determined that the command is a request for remote computation, the remote computation information included in the command is stored in the first memory device, and the memory address in the first memory device in which the remote computation information is stored is included. A dynamic reprogrammable processing machine that generates a conditional instruction and an unconditional instruction received from the first dynamic reprogrammable processing machine, and remote operation information is received from the first memory device based on a memory address included in the unconditional instruction The purpose of searching and indicating the destination address of the data packet generated based on the remote calculation informationGeographic address andThe purpose of indicating the address of the second memory device to which data is written at the destinationlocalmemoryMeta address including addressBased on remote calculation informationGenerate andGenerates a data packet containing a meta address based on remote computation informationAn addressing machine;FirstAddressing machineAnd a second addressing machine indicated in the destination geographic address as the data destinationCombined withA data packet generated by the first addressing machine is received from the first addressing machine andIncluded in meta addressthe purposeAccording to geographic addressThe firstAddressing machineAnd a second addressing machineIn betweenMutualAnd an interconnection device for routing data.
[0029]
  In the second invention, the dynamic reprogrammable processing machine executes processing in accordance with the received instruction when the received instruction is not an instruction requesting remote operation..
[0030]
  In a third aspect, when the second addressing machine receives the data packet generated by the first addressing machine, the second addressing machine decodes the meta address included in the data packet into the destination geographic address and the destination local memory address. If the address decoder that compares the geographic address indicating the address of the second addressing machine and the decoded target geographic address matches the result of the comparison by the address decoder, and the geographic address matches the target geographic address, the data A controller for transmitting the packet to a second dynamic reprogrammable processing machine;It comprises.
[0031]
  The fourth invention further comprises an architecture description memory device coupled to the second dynamic reprogrammable processing machine for storing a geographic address indicating an address for the coupled second dynamic reprogrammable processing machine.It has.
[0032]
  According to a fifth aspect, the second addressing machine further includes an interrupt handler coupled to the input / output device, and the interrupt handler verifies the validity of the interrupt request with an identification device for identifying the interrupt request A comparator for comparing the identified interrupt request with a stored list of interrupt requests, and an interrupt logic for processing interrupt requests validated according to stored interrupt processing instructionsAnd.
[0033]
  In the sixth invention, the meta address is 80 bits wide, the destination geographic address is 16 bits wide, and the local address is 64 bits wide.
[0034]
  A seventh invention is a metaaddressing method for designating a local memory destination for a data packet of a dynamic reprogrammable processing machine, wherein when a predetermined instruction is received, the received instruction includes remote operation information. If the command is a command requesting remote calculation including remote calculation information, the remote calculation information included in the command is stored in the first memory device, and the remote calculation information is stored. An unconditional instruction generation step for generating an unconditional instruction including a memory address in the first memory device in which is stored by a dynamic reprogrammable processing machine, and an unconditional instruction generated in the unconditional instruction generation step is dynamically generated Remote operation information is received from the first memory device based on the memory address received from the reprogrammable processing machine and included in the unconditional instruction. A meta address including a destination geographic address indicating a destination address of a data packet generated based on the remote calculation information and a destination local memory address indicating a second memory device address at which data is written at the destination. A data packet generation step for generating a data packet including a meta address by an addressing machine based on the remote calculation information, and a data packet generated at the data packet generation step, the first addressing machine Between the first addressing machine and the second addressing machine indicated in the destination geographic address as the destination of the data according to the destination geographic address contained in the meta-address in the data packet Data routing steps To.
[0035]
  In an eighth aspect of the present invention, in the unconditional instruction generation step, the dynamic reprogrammable processing machine executes a process according to the received instruction when the received instruction is not an instruction requesting remote operation.
[0036]
  In a ninth aspect, when the second addressing machine receives the data packet generated by the first addressing machine, the metaaddress included in the data packet is decoded into the destination geographic address and the destination local memory address. If the result of the comparison by the address decoding step for comparing the geographical address indicating the address of the second addressing machine with the decrypted target geographical address and the comparison by the address comparing step are the same, the data Transmitting the packet to a second dynamic reprogrammable processing machine;It has.
[0037]
  In a tenth aspect, a geographic address indicating an address for a second dynamic reprogrammable processing machine is stored in an architecture description memory device coupled to the second dynamic reprogrammable processing machine.
[0038]
  According to an eleventh aspect of the invention, an interrupt handler coupled to an input / output device included in the second addressing machine verifies an interrupt request identified by the identification step for identifying the interrupt request and the interrupt request. A comparison step for comparing the identified interrupt request with a stored list of interrupt requests, and for processing an interrupt request whose validity has been confirmed based on a comparison result of each step compared to the previous period according to a stored interrupt processing instruction. Steps.
[0039]
  In the twelfth aspect, the meta address is 80 bits wide, the destination geographic address is 16 bits wide, and the local address is 64 bits wide..
[0043]
DETAILED DESCRIPTION OF THE INVENTION
<Overview>
The present invention includes a set of S machines, a T machine corresponding to each S machine, a general-purpose interconnect matrix (GPIM), a set of input / output T machines, and a set of input / outputs. The device and the master time base device form a system for scalable, parallel, dynamic reconfiguration computation. Each S machine is a dynamic reconfigurable computer including a memory, a first local time base device, and a dynamically reconfigurable processing unit (DRPU). The dynamic reconfiguration processing unit (DRPU) includes a reprogrammable logic unit configured as an instruction fetch unit (IFU), a data arithmetic unit (DOU), and an address arithmetic unit (AOU). Which are selectively reconfigured during program execution in response to the selection of a reconfiguration interrupt or a reconfiguration instruction embedded in a set of program instructions, respectively. Each reconfiguration interrupt and each reconfiguration instruction is a configuration data set that specifies a dynamic reconfiguration processor (DRPU) hardware organization that is optimized for the implementation of a specific instruction set architecture (ISA). Refer back to. An instruction fetch unit (IFU) instructs a reconfiguration operation, an instruction fetch / decode operation, and a memory access operation, and transmits a control signal to a data operation unit (DOU) and an address operation unit to facilitate the execution of the instruction. (AOU). The data arithmetic unit (DOU) performs data arithmetic, and the address arithmetic unit (AOU) performs address arithmetic. Each T machine is a data transfer device including a common interface control unit (CICU), one or more interconnected input / output devices, and a second local time base device. A general purpose interconnection matrix (GPIM) is an extensible interconnection network that facilitates parallel communication between T machines. This one set of T machines and general interconnection matrix (GPIM) facilitates parallel communication between S machines. The T machine also controls the transfer of data between the S machines in the network and provides the required addressing operations. The meta address is used to provide each S machine with extensible bit addressing capabilities.
[0044]
<Specific embodiment>
FIG. 1 is a block diagram of a preferred embodiment of a system 10 for scalable, parallel, dynamic reconfiguration computation constructed in accordance with the present invention. The system 10 includes at least one S machine 12, a T machine 14 corresponding to each S machine 12, a general purpose interconnection matrix (GPIM) 16, at least one input / output T machine 18, and one or more thereof. It is preferable that the above input / output device 20 and the master time base device 22 are included. In the preferred embodiment, the system 10 includes multiple S machines 12, and thus multiple T machines 14, multiple input / output T machines 18, and multiple input / output devices 20.
[0045]
S machine 12, T machine 14, and input / output T machine 18 each include a master timing input that is coupled to a timing output of master time base device 22. Each S machine 12 includes an input section and an output section coupled to a corresponding T machine 14. Each T machine 14 includes a routing input and routing output coupled to a general purpose interconnection matrix (GPIM) 16 in addition to an input and output coupled to a corresponding S machine 12. Similarly, each input / output T machine 18 includes an input and output coupled to the input / output device 20, and includes a routing input and routing output coupled to a general purpose interconnection matrix (GPIM) 16. Contains.
[0046]
As described in detail below, each S machine 12 is a dynamic reconfiguration computer. The general purpose interconnection matrix (GPIM) 16 forms a point-to-point parallel interconnection means that facilitates communication between the T machines 14. The T machine 14 and the general interconnection matrix (GPIM) 16 form a two-point parallel interconnection means for data transfer between the S machines 12. Similarly, a general purpose interconnection matrix (GPIM) 16, a set of T machines 14, and a set of input / output T machines 18 are used for input / output transfer between the S machine 12 and each input / output device 20. The two-point parallel interconnection means is formed. The master time base device 22 includes an oscillator that sends a master timing signal to each S machine 12 and each T machine 14.
[0047]
In the exemplary embodiment, each S machine 12 uses a Xilinx XC4013 (Xilinx, Inc., San Jose, Calif.) Field Programmable Gate Array (FPGA) coupled to a 64 megabyte random access memory (RAM). Has been implemented. Each T-machine 14 is implemented using about 50% of the reconfigurable hardware resources of the Xilinx XC4013 field programmable gate array (FPGA), similar to each I / O T-machine 18. The general interconnect matrix (GPIM) 16 is implemented as an annular interconnect mesh. The master time base unit 22 is a clock oscillator coupled to a clock distribution circuit that presents a system-wide frequency reference and is a US patent application entitled “System and Method for Phase Synchronization, Flexible Frequency Clocking and Messaging ( System and Method for Phase-Synchronous, Flexible Frequency Clocking and Messaging) ”. The T machine 14, the S machine 12, and the input / output T machine 18 preferably transfer information in accordance with ANSI / IEEE standard 1596-1992, which defines an extensible coherent interface (SCI).
[0048]
In the preferred embodiment, system 10 includes multiple S machines 12 functioning in parallel. The structure and function of each S machine 12 will be described in detail below with reference to FIGS. FIG. 2 is a block diagram of a preferred embodiment of the S machine 12. The S machine 12 includes a first local time base device 30, a dynamic reconfiguration processing device (DRPU) 32 for executing program instructions, and a memory 34. The first local time base device 30 includes a timing input unit that forms a master timing input unit of the S machine. The first local time base device 30 also sends the first local timing signal, ie, the clock, to the timing input unit of the dynamic reconfiguration processor (DRPU) 32 via the first timing signal line 40, and to the timing input unit of the memory 34. Includes a timing output section. The dynamic reconfiguration processor (DRPU) 32 is coupled to the control signal output unit coupled to the control signal input unit of the memory 34 via the memory control line 42 and to the address input unit of the memory 34 via the address line 44. And an address output section and a bidirectional data port coupled to the bidirectional data port of memory 34 via memory input / output line 46. In addition, the dynamic reconfiguration processor (DRPU) 32 includes a bidirectional data port coupled to its corresponding T-machine 14 bidirectional data port via an external control line 48. As shown in FIG. 2, the memory control line 42 is X bits, the address line 44 is M bits, the memory input / output line 46 is (N × k) bits, and the external control line 48 is Y bits. .
[0049]
In the preferred embodiment, the first local time base device 30 receives a master timing signal from the master time base device 22. The first local time base device 30 generates a first local timing signal from the master timing signal, and sends the first local timing signal to the dynamic reconfiguration processor (DRPU) 32 and the memory 34. In the preferred embodiment, the first local timing signal is different for each individual S machine 12. Therefore, the dynamic reconfiguration processor (DRPU) 32 and the memory 34 in a predetermined S machine 12 are independent of the clock rate independent of the dynamic reconfiguration processor (DRPU) 32 and the memory 34 in another S machine 12. It works with. The first local timing signal is preferably phase synchronized with the master timing signal. In a preferred embodiment, the first local time base device 30 is implemented using a phase lock frequency conversion circuit that includes a phase lock detection circuit implemented using reconfigurable hardware resources. One skilled in the art will recognize that in other embodiments, the first local time base device 30 can be implemented as part of a clock distribution tree.
[0050]
The memory 34 is preferably implemented as a RAM and stores program instructions, program data, and configuration data for the dynamic reconfiguration processor (DRPU) 32. The memory 34 of any S machine 12 is preferably accessible to other S machines 12 in the system 10 via a general purpose interconnection matrix (GPIM) 16. Further, each S machine 12 preferably has a uniform memory address space. In the preferred embodiment, the program instructions stored in memory 34 optionally include reconfiguration instructions directed to dynamic reconfiguration processor (DRPU) 32. FIG. 3 is an exemplary program list 50 including a reconfiguration instruction. As shown in FIG. 3, the exemplary program list 50 includes a set of outer loop portions 52, a first inner loop portion 54, a second inner loop portion 55, a third inner loop portion 56, and a fourth inner loop portion. 57 and a fifth inner loop portion 58. One skilled in the art will recognize that the term “inner loop” refers to an iterative portion of a program that performs a particular set of related operations, and that the term “outer loop” performs primarily general purpose operations and / or one inner loop. It will be readily appreciated that it refers to the part of the program that transfers control from one part to another inner loop part. In general, the

inner loop portions

54, 55, 56, 57, 58 of the program perform certain operations on potentially large data sets. For example, in an image processing application, the first inner loop portion 54 performs color format conversion operations on the image data, and the second to fifth

inner loop portions

54, 55, 56, 57, 58 are linear filtering operations, convolution operations. The pattern search calculation and the compression calculation are executed. One skilled in the art will recognize that a continuous sequence of

inner loop portions

55, 56, 57, 58 can be considered as a software pipeline. Each outer loop portion 52 is responsible for data input and output and / or directs the transfer of data and control from the first inner loop portion 54 to the second inner loop portion 55. One skilled in the art will further appreciate that a given

inner loop portion

54, 55, 56, 57, 58 includes one or more reconfiguration instructions. In general, for any program, the outer loop portion 52 of the program list 50 contains various general purpose instructions, while the

inner loops

54 and 56 of the program list 50 are relatively few kinds of instructions that are used to execute a particular instruction set. Consists of.
[0051]
In the exemplary program list 50, the first reconfiguration instruction appears at the beginning of the first inner loop portion 54 and the second reconfiguration instruction appears at the end of the first inner loop portion 54. Similarly, the third reconfiguration instruction is at the start of the second inner loop portion 55, the fourth reconfiguration instruction is at the start of the third inner loop portion 56, and the fifth reconfiguration instruction is at the fourth inner loop portion. At the beginning of 57, the sixth and seventh reconstruction instructions appear at the beginning and end of the fifth inner loop portion 58, respectively. Each reconfiguration instruction is for activating a specific instruction set architecture (ISA), and a configuration data set that specifies an internal dynamic reconfiguration processor (DRPU) hardware organization optimized for it Is preferably indicated. The instruction set architecture (ISA) is a basic or core instruction set that can be used to program a computer. The instruction set architecture (ISA) defines an instruction format, an operation code, a data format, an addressing mode, an execution control flag, and a program accessible register. Those skilled in the art will appreciate that this corresponds to the traditional definition of instruction set architecture (ISA). In the present invention, the dynamic reconfiguration processor (DRPU) 32 of each S machine directly implements a multiple instruction set architecture (ISA) using a unique configuration data set for each desired instruction set architecture (ISA). Can be a quick runtime configuration. That is, each instruction set architecture (ISA) is implemented with its own internal dynamic reconfiguration processor (DRPU) hardware organization defined by a corresponding configuration data set. Thus, in the present invention, the first to fifth

inner loop portions

54, 55, 56, 57, 58 are each a unique instruction set architecture (ISA), ie, instruction set architecture (ISA) 1, instruction set architecture (ISA) 2, It corresponds to an instruction set architecture (ISA) 3, an instruction set architecture (ISA) 4, and an instruction set architecture (ISA) k. Those skilled in the art will appreciate that each successive instruction set architecture (ISA) need not be unique. Thus, instruction set architecture (ISA) k may be instruction set architecture (ISA) 1, instruction set architecture (ISA) 2, instruction set architecture (ISA) 3, instruction set architecture (ISA) 4 or different instructions. It may be a set architecture (ISA). A set of outer loop portions 52 also corresponds to a unique instruction set architecture (ISA), ie, instruction set architecture (ISA) 0. In the preferred embodiment, during the execution of the program, the selection of successive reconstruction instructions is made data dependent (depending on the data). When a particular reconfiguration instruction is selected, the program instructions are then executed according to the corresponding instruction set architecture (ISA) by the unique dynamic reconfiguration processor (DRPU) hardware configuration specified by the corresponding configuration data set. The
[0052]
In the present invention, a specific instruction set architecture (ISA) is classified as an inner loop instruction set architecture (ISA) or an outer loop instruction set architecture (ISA) according to the number and type of instructions that the instruction set architecture (ISA) contains. Can do. An instruction set architecture (ISA) that includes several instructions and is useful for performing general purpose operations is the outer loop instruction set architecture (ISA), while it contains relatively few instructions and is directed to the execution of certain types of instructions. The instruction set architecture (ISA) is the inner loop instruction set architecture (ISA). The outer loop instruction set architecture (ISA) is directed to the execution of general purpose operations and is most useful when sequential execution of program instructions is desired. The execution performance of the outer loop instruction set architecture (ISA) is preferably characterized by a clock cycle for each instruction executed. In contrast, the Inner Loop Instruction Set Architecture (ISA) is most useful when parallel execution of program instructions is desired because it is directed to the execution of specific types of instructions. The execution performance of an inner loop instruction set architecture (ISA) is preferably characterized by instructions executed per clock cycle, or by computational results obtained per clock cycle.
[0053]
Those skilled in the art will appreciate that the previous description of sequential and parallel execution of program instructions relates to the execution of program instructions within a single dynamic reconfiguration processor (DRPU) 32. When each program instruction sequence is executed by a specific dynamic reconfiguration processor (DRPU) 32 due to the presence of the multiple S machine 12 in the system 10, the multiple program instruction sequence can be executed in parallel at an arbitrary time. It becomes easy. Each dynamic reconfiguration processor (DRPU) 32 has parallel or serial hardware to implement a particular inner loop instruction set architecture (ISA) or outer loop instruction set architecture (ISA), respectively, at a particular time. Hardware is included. The internal hardware configuration of any dynamic reconfiguration processor (DRPU) 32 changes over time according to the selection of one or more reconfiguration instructions embedded within the sequence of program instructions being executed.
[0054]
In the preferred embodiment, each instruction set architecture (ISA) and its corresponding internal dynamic reconfiguration processor (DRPU) hardware organization is a specific class of computation for a set of available reconfigurable hardware resources. Designed to provide optimal computational performance for the above problems. As described above and as described in detail below, the internal dynamic reconfiguration processor (DRPU) hardware organization corresponding to the outer loop instruction set architecture (ISA) optimizes the sequential execution of program instructions. Preferably it is done. Also, the internal dynamic reconfiguration processor (DRPU) hardware organization corresponding to the internal loop instruction set architecture (ISA) is preferably optimized for parallel execution of program instructions. An exemplary generalized external loop instruction set architecture (ISA) is shown in reference material A, and an exemplary internal loop instruction set architecture (ISA) dedicated to convolution is shown in reference material B.
[0055]
Except for each reconfiguration instruction, the exemplary program list 50 of FIG. 3 is preferably composed of conventional high-level language sentences, for example, sentences written according to the C programming language. Those skilled in the art will appreciate that including a series of one or more reconfiguration instructions in a series of program instructions requires a modified compiler to accommodate the reconfiguration instructions. FIG. 4 is a flowchart of a prior art compile operation performed during compilation of a series of program instructions. Here, the compile operation of the prior art is substantially equivalent to that executed by the GNU C Compiler (GCC) created by the Free Software Foundation (Cambridge, Mass.). Those skilled in the art will appreciate that the prior art compilation operations described below can be easily generalized for other compilers. The prior art compilation operation begins at step 500 where the compiler front end selects the next high level statement from a series of program instructions. Next, in step 502, the compiler front end generates intermediate level code corresponding to the selected high level sentence. In the case of the GNU C compiler (GCC), this corresponds to a register transfer level (RTL) statement. After step 502, in step 504, the compiler front end determines whether further high level statements need to be considered. If so, the preferred method returns to step 500.
[0056]
If at step 504 the compiler front end determines that no other high level statements need to be considered, then at step 506 the compiler back end performs a conventional register allocation operation. After step 506, in step 508, the compiler back end selects the next register transfer level (RTL) statement for consideration within the current register transfer level (RTL) statement group. Next, at step 510, the compiler backend determines whether there are rules that define how the current register transfer level (RTL) statement group can be translated into a set of assembly language statements. If no such rule exists, the preferred method returns to step 508 to select a further register transfer level (RTL) statement for inclusion in the current register transfer level (RTL) statement group. If there is a rule corresponding to the current register transfer level (RTL) statement group, then at step 512, the compiler backend generates a set of assembly language statements according to the rule. After step 512, the compiler back end determines whether the next register transfer level (RTL) statement needs to be considered in the context of the next register transfer level (RTL) statement group. When necessary, the preferred method returns to step 508. If not, the preferred method ends.
[0057]
The present invention preferably includes a compiler for dynamic reconfiguration computation. 5 and 6 are flowcharts of preferred compilation operations performed by the compiler for dynamic reconfiguration computation. A preferred compilation operation begins at step 600, where the compiler front end for dynamic reconfiguration computation selects the next high-level statement in a series of program instructions. Next, at step 602, the compiler front end for dynamic reconfiguration computations determines whether the selected high-level sentence is a reconfiguration instruction. If it is a reconfiguration instruction, the front end of the compiler for dynamic reconfiguration calculation generates a register transfer level (RTL) reconfiguration statement in step 604 and returns to step 600. In the preferred embodiment, the register transfer level (RTL) reconfiguration statement is a non-standard register transfer level (RTL) statement that includes an instruction set architecture (ISA) identification. If, in step 602, the selected high-level program statement is not a reconfiguration instruction, then in step 606, the compiler front end for dynamic reconfiguration calculation uses a conventional method to generate a set of register transfer level (RTL) statements. Is generated. After step 606, at step 608, the compiler front end for dynamic reconfiguration computations determines whether further high level statements need to be considered. When necessary, the preferred method returns to step 600. Otherwise, the preferred method proceeds to step 610 to begin backend computation.
[0058]
At step 610, the compiler backend for dynamic reconfiguration computation performs a register allocation operation. In the preferred embodiment of the present invention, each instruction set architecture (ISA) is defined such that the register architecture for each instruction set architecture (ISA) matches each other. Thus, the register allocation operation is performed in a conventional manner. Those skilled in the art will recognize that it is generally not an absolute requirement that register architectures per instruction set architecture (ISA) match each other. Next, at step 612, the compiler back-end for dynamic reconfiguration computation selects the next register transfer level (RTL) statement within the currently considered register transfer level (RTL) statement group. Next, in step 614, the compiler backend for dynamic reconfiguration computations determines whether the selected register transfer level (RTL) statement is a register transfer level (RTL) reconfiguration statement. If the selected register transfer level (RTL) statement is not a register transfer level (RTL) reconfiguration statement, the compiler backend for dynamic reconfiguration computation at step 618 determines that the register transfer level (RTL) statement currently under consideration. Determine if there is a rule for the group. If not, the preferred method returns to step 612 to select the next register transfer level (RTL) statement group for inclusion in the currently considered register transfer level (RTL) statement group. If there is a rule for the register transfer level (RTL) statement group currently under consideration at step 618, then at step 620, the compiler backend for dynamic reconfiguration computations will follow the current register transfer under consideration according to this rule. A set of assembly language sentences corresponding to a level (RTL) sentence group is generated. After step 620, in step 622, the compiler back-end for dynamic reconfiguration computation needs to consider another register transfer level (RTL) statement in the context of the next register transfer level (RTL) statement group. Determine whether there is. If so, the preferred method returns to step 612, otherwise the preferred method ends.
[0059]
If, in step 614, the selected register transfer level (RTL) statement is a register transfer level (RTL) reconfiguration statement, the compiler backend for dynamic reconfiguration calculation in step 616 re-registers the register transfer level (RTL). A set of rules corresponding to the instruction set architecture (ISA) identification in the composition statement is selected. In the present invention, there is preferably a unique rule for each instruction set architecture (ISA). Each rule set thus provides one or more rules for converting a register transfer level (RTL) statement group into an assembly language statement according to a specific instruction set architecture (ISA). After step 616, the preferred method proceeds to step 618. A rule set corresponding to any instruction set architecture (ISA) may include rules for translating register transfer level (RTL) reconfiguration statements into a set of assembly language instructions that cause a software interrupt. preferable. As a result of this software interrupt, a reconfiguration handler is executed, which will be described in detail below.
[0060]
In the method described above, the compiler for dynamic reconfiguration computation selectively and automatically generates assembly language statements according to a multiple instruction set architecture (ISA) during a compile operation. In other words, during compilation, compilers for dynamic reconfiguration computations compile a set of program instructions according to different instruction set architectures (ISAs). The compiler for dynamic reconfiguration computation is preferably a conventional compiler that has been modified to perform the preferred compilation operations as described above with reference to FIGS. Those skilled in the art will appreciate that the required modifications are not complex, but such modifications are not obvious in view of prior art compilation techniques and prior art reconstruction computation techniques.
[0061]
FIG. 7 is a block diagram of a preferred embodiment of the dynamic reconfiguration processor (DRPU) 32. The dynamic reconfiguration processing unit (DRPU) 32 includes an instruction fetch unit (IFU) 60, a data arithmetic unit (DOU) 62, and an address arithmetic unit (AOU) 64. Each of the instruction fetch unit (IFU) 60, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 includes a timing input coupled to the first timing signal line 40. The instruction fetch unit (IFU) 60 includes a memory control output coupled to the memory control line 42, a data input coupled to the memory input / output line 46, and a bidirectional control port coupled to the external control line 48. Is included. The instruction fetch unit (IFU) 60 further includes a first control output unit coupled to the first control input unit of the data processing unit (DOU) 62 via the first control line 70, and an address calculation via the second control line 72. And a second control output coupled to a first control input of an apparatus (AOU) 64. The instruction fetch unit (IFU) 60 is connected to a second control input unit of the data arithmetic unit (DOU) 62 and a second control input unit of the address arithmetic unit (AOU) 64 via a third control line 74. Includes output. Data arithmetic unit (DOU) 62 and address arithmetic unit (AOU) 64 each include a bi-directional data port coupled to memory input / output line 46. Finally, the address arithmetic unit (AOU) 64 includes an address output unit that forms the address output unit of the dynamic reconfiguration processing unit (DRPU).
[0062]
The dynamic reconfiguration processor (DRPU) 32 is a field such as a reconfiguration logic device or a reprogrammable logic device, such as Xilinx XC4013 (Xilinx, Inc., San Jose, Calif.) Or AT & T ORCA IC07 (AT & T Microelectronics, Allentown, Pennsylvania). It is preferably implemented using a programmable gate array (FPGA). A reprogrammable logic unit can have multiple,
1) Selectively Reprogrammable Logic Blocks or Configurable Logic Blocks (CLB),
2) Selective reprogrammable input / output blocks (IOB: I / O Blocks);
3) a selective reprogrammable interconnect structure;
4) data storage resources;
5) a ternary buffer resource;
6) Wired logic function capability;
It is preferable to provide. Each logic block (CLB) preferably includes a selective reconfiguration circuit for generating logic functions, storing data, and performing signal routing. One skilled in the art will recognize that the reconfigurable data storage circuit is one or more data storage blocks (DSBs) separate from the logic block (CLB), depending on the exact design of the reprogrammable logic device in use. I will admit that it may be included. Here, the reconfiguration data storage circuit in the field programmable gate array (FPGA) is incorporated in the logic block (CLB). That is, the existence of a data storage block (DSB) is not assumed. Those skilled in the art will recognize that one or more components that utilize the logical block (CLB) based reconfiguration data storage circuit described above are data storage blocks (DSBs) if a data storage block (DSB) is present. It will be appreciated that a base circuit can also be used. Each input / output block (IOB) preferably includes a selective reconfiguration circuit for transferring data between a logic block (CLB) and a field programmable gate array (FPGA) output pin. The configuration data set defines a dynamic reconfiguration processor (DRPU) hardware configuration or organization by specifying functions to be executed within a logical block (CLB), and
1) Within a logical block (CLB)
2) Between logical blocks (CLB),
3) In the input / output block (IOB),
4) Between input / output blocks (IOBs), and
5) Between logical block (CLB) and input / output block (IOB)
Determine the mutual coupling. Those skilled in the art will recognize that the number of bits in each of the memory control lines 42, address lines 44, memory input / output lines 46, and external control lines 48 can be reconfigured by the configuration data set. The reconstructed data set is preferably stored on one or more S machines 34 in the system 10. Those skilled in the art will appreciate that the dynamic reconfiguration processor (DRPU) 32 is not limited to field programmable gate array (FPGA) based implementations. For example, the dynamic reconfiguration processor (DRPU) 32 can be implemented as a RAM-based state machine, possibly including one or more lookup tables. Alternatively, the dynamic reconfiguration processor (DRPU) 32 can be implemented using a complex programmable logic device (CPLD). However, those skilled in the art will appreciate that a portion of the S machine 12 of the system 10 can include a dynamic reconfiguration processor (DRPU) 12 that is not reconfigurable.
[0063]
In the preferred embodiment, the instruction fetch unit (IFU) 60, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 are each dynamically reconfigurable. Therefore, the internal hardware configuration can be selectively changed during program execution. The instruction fetch unit (IFU) 60 instructs a command fetch / decode operation, a memory access operation, and a dynamic reconfiguration processor (DRPU) reconfiguration operation, and a data operation unit for easily executing the instruction. Control signals are sent to (DOU) 62 and address arithmetic unit (AOU) 64. The data arithmetic unit (DOU) 62 performs an operation related to data calculation, and the address arithmetic unit (AOU) 64 executes an operation related to address calculation. The internal structures and operations of the instruction fetch unit (IFU) 60, the data operation unit (DOU) 62, and the address operation unit (AOU) 64 will be described in detail below.
[0064]
FIG. 8 is a block diagram of a preferred embodiment of an instruction fetch unit (IFU) 60. An instruction fetch unit (IFU) 60 includes an instruction state sequencer (ISS) 100, an architecture description memory 101, a memory access logic 102, a reconfiguration logic 104, an interrupt logic 106, and a fetch control unit 108. , Instruction buffer 110, decoding control device 112, instruction decoder 114, operation code storage register set 116, register file (RF) address register set 118, constant register set 120, process control register set 122. The instruction state sequencer (ISS) 100 includes first and second control outputs that form first and second control outputs of an instruction fetch unit (IFU) 60, respectively, and the instruction fetch unit (IFU) 60. The timing input unit forming the timing input unit is included. The instruction state sequencer (ISS) 100 also includes a fetch / decode control output coupled via a fetch / decode control line 130 to the control input of the fetch controller 108 and the control input of the decode controller 112. . In addition, the instruction state sequencer (ISS) 100 has a bidirectional control port coupled to the first bidirectional control port of each of the memory access logic 102, the reconfiguration logic 104, and the interrupt logic 106 via a bidirectional control line 132. Contains. The instruction state sequencer (ISS) 100 also includes an operation code input that is coupled to the output of the operation code storage register set 116 via an operation code line 142. Finally, the instruction state sequencer (ISS) 100 includes a bidirectional control port coupled to the bidirectional control port of the process control register set 122 via the processing data line 144.
[0065]
Memory access logic 102, reconfiguration logic 104, and interrupt logic 106 each include a second bidirectional control port coupled to external control line 48. Further, the memory access logic 102, the reconfiguration logic 104, and the interrupt logic 106 each include a data input unit coupled to the data output unit of the architecture description memory 101 via a mounting control line 131. Memory access logic 102 further includes a control output that forms the memory control output of instruction fetch unit (IFU) 60, and interrupt logic 106 further includes an output coupled to processing data line 144. The command buffer 110 includes a data input unit forming a data input unit of an instruction fetch unit (IFU) 60, a control input unit coupled to a control output unit of the fetch control unit 108 via a fetch control line 134, and a command line 136. And an output unit coupled to the input unit of the instruction decoder 114. The instruction decoder 114 has a control input unit coupled to a control output unit of the decoding control device 112 via a decoding control line 138, and a decoding instruction line 140.
1) An input unit of the operation code storage register 116;
2) Input section of register file (RF) address register set 118;
3) an output coupled to the input of the constant register set 120;
Is included. The register file (RF) address register set 118 and the constant register set 120 each include an output unit that forms the third control output unit 74 of the instruction fetch unit (IFU) 60.
[0066]
The architecture description memory 101 stores an architecture designation signal that characterizes the current dynamic reconfiguration processor (DRPU) configuration. This architecture-specific signal is
1) the criteria for the default configuration data set;
2) criteria for an acceptable configuration data set list;
3) Criteria for configuration data sets corresponding to the currently considered instruction set architecture (ISA), that is, criteria for configuration data sets that define the current dynamic reconfiguration processor (DRPU) configuration;
4) A cross-coupled address list identifying one or more cross-coupled I / O devices 304 in the T machine 14 associated with the S machine 12 in which the instruction fetch unit (IFU) 60 resides (see FIG. 18). And will be described in detail below)
5) a set of interrupt response signals that specify interrupt latency and interrupt accuracy information that defines how the instruction fetch unit (IFU) 60 responds to interrupts;
And 6) a memory access constant that defines an atomic memory address increment. In the preferred embodiment, each configuration data set implements the architecture description memory 101 as a set of logical blocks (CLB) configured as read-only memory (ROM). An architecture designating signal that defines the contents of the architecture description memory 101 is preferably included in each configuration data set. Accordingly, since each configuration data set corresponds to a specific instruction set architecture (ISA), the contents of the architecture description memory 101 differ depending on the instruction set architecture (ISA) currently under consideration. For a given instruction set architecture (ISA), program access to the contents of the architecture description memory 101 is preferably facilitated by including memory read instructions in the instruction set architecture (ISA). This allows the program to retrieve information regarding the current dynamic reconfiguration processor (DRPU) configuration during program execution.
[0067]
In the present invention, the reconfiguration logic 104 is a state machine that controls a series of reconfiguration operations, which facilitates the reconfiguration of the dynamic reconfiguration processor (DRPU) 32 according to the configuration data set. Reconfiguration logic 104 preferably initiates a reconfiguration operation upon receipt of a reconfiguration signal. As described in detail below, the reconfiguration signal is a signal generated by interrupt logic 106 in response to a reconfiguration interrupt received on external control line 48 or in response to a reconfiguration instruction embedded in the program. This is a signal generated by the instruction state sequencer (ISS) 100. By the reconfiguration operation, the initial dynamic reconfiguration processor (DRPU) configuration after power-on / reset is obtained using default configuration data referenced by the architecture description memory 101. Also, the selective dynamic reconfiguration processor (DRPU) reconfiguration after the initial dynamic reconfiguration processor (DRPU) configuration is determined is obtained by the reconfiguration operation. When the reconfiguration operation is complete, the reconfiguration logic 104 issues a completion signal. In the preferred embodiment, the reconfiguration logic 104 is non-reconfiguration logic that controls the loading of the configuration data set onto the reprogrammable logic device itself, so the sequence of reconfiguration operations is defined by the manufacturer of the reprogrammable logic device. Thus, reconstruction operations are known to those skilled in the art.
[0068]
Each dynamic reconfiguration processor (DRPU) configuration is preferably provided by a configuration data set that defines a specific hardware organization for production of a corresponding instruction set architecture (ISA). In the preferred embodiment, the instruction fetch unit (IFU) 60 includes the above components regardless of the dynamic reconfiguration processor (DRPU) configuration. At the basic level, the functionality provided by each component within the instruction fetch unit (IFU) 60 is independent of the instruction set architecture (ISA) currently under consideration. However, in the preferred embodiment, the detailed structure and functionality of one or more components of the instruction fetch unit (IFU) 60 will vary depending on the characteristics of the instruction set architecture (ISA) in which it is configured. . In the preferred embodiment, the structure and functionality of the architecture description memory 101 and the reconfiguration logic 104 are preferably constant for each dynamic reconfiguration processor (DRPU) configuration. The structure and functionality of the other components of the instruction fetch unit (IFU) 60 and the differences between them depending on the type of instruction set architecture (ISA) will be described in detail below.
[0069]
The process control register set 122 stores signals and data used by the instruction state sequencer (ISS) 100 during instruction execution. In the preferred embodiment, the process control register set 122 includes a register for storing a process control word, a register for storing an interrupt vector, and a register for storing a reference to a configuration data set. . The process control word preferably includes a plurality of condition flags that can be selectively set or reset based on conditions that occur during instruction execution. In addition, the process control word includes a plurality of transition control signals that define one or more ways in which interrupts can be implemented (this is described in detail below). In the preferred embodiment, the process control register set 122 is implemented as a set of logical blocks (CLB) configured for data storage and gating logic.
[0070]
The instruction state sequencer (ISS) 100 controls the operations of the fetch control unit 108, the decoding control unit 112, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 to facilitate the execution of instructions. Therefore, a state machine that transmits a memory read signal and a memory write signal to the memory access logic 102 is preferred. FIG. 9 is a state diagram illustrating a set of preferred states supported by the instruction state sequencer (ISS) 100. After power on or reset, or immediately after reconfiguration, the instruction state sequencer (ISS) 100 starts operation in state P. In response to the completion signal issued by the reconfiguration logic 104, the instruction state sequencer (ISS) 100 proceeds to state S, and the instruction state sequencer (ISS) is in the program state when power on / reset or reconfiguration occurs, respectively. Initialize or restore information. The instruction state sequencer (ISS) 100 then proceeds to state F and performs an instruction fetch operation. In an instruction fetch operation, the instruction state sequencer (ISS) 100 sends a memory read signal to the memory access logic 102, sends a fetch signal to the fetch controller 108, and increments the next instruction program address register (NIPAR) 232. An increment signal is transmitted to the address arithmetic unit (AOU) 64 (this will be described in detail below with reference to FIGS. 15 and 16). After state F, the instruction state sequencer (ISS) 100 proceeds to state D and starts an instruction decoding operation. In state D, the instruction state sequencer (ISS) 100 sends a decode signal to the decode control 112. In state D, the instruction state sequencer (ISS) 100 further retrieves the operation code corresponding to the decoded instruction from the operation code storage register set 116. Based on the retrieved operation code, the instruction state sequencer (ISS) 100 proceeds to state E or state M and executes an instruction execution operation. When the instruction can execute in one clock cycle, the instruction state sequencer (ISS) 100 proceeds to state E. Otherwise, the instruction state sequencer (ISS) 100 goes to state M to execute instructions in multiple cycles. For instruction execution operations, the instruction state sequencer (ISS) 100 facilitates execution of data arithmetic unit (DOU) control signals, address arithmetic unit (AOU) control signals, and / or instructions corresponding to retrieved operation codes. A signal dedicated to the memory access logic 102 is generated. After state E or M, instruction state sequencer (ISS) 100 proceeds to state W. In state W, instruction state sequencer (ISS) 100 writes data to facilitate storage of data arithmetic unit (DOU) control signals, address arithmetic unit (AOU) control signals, and / or instruction execution results. Signal. Therefore, the state W is called a write back state. One skilled in the art will recognize that states F, D, E, M, and W include a complete instruction execution cycle. After state W, the instruction state sequencer (ISS) 100 proceeds to state Y when execution of the instruction needs to be interrupted. State Y corresponds to an idle state that is required, for example, when T-machine 14 must access memory 34 of S-machine. After state Y, or when state execution continues, after state W, instruction state sequencer (ISS) 100 returns to state F and starts another instruction execution cycle.
[0071]
As shown in FIG. 9, the state diagram also includes state I. This state is defined as the interrupt execution state. In the present invention, the instruction state sequencer (ISS) 100 receives an interrupt notification signal from the interrupt logic 106. As will be described in greater detail below with reference to FIG. 10, interrupt logic 106 generates a transition control signal and stores the transition control signal in a process control word in process control register set 122. The transition control signal indicates which of the states F, D, E, M, W, and Y can be interrupted, the level of interrupt accuracy required in each interrupt enabled state, and after the state I It is preferable to indicate the next state of each interruptible state to continue execution. When the instruction state sequencer (ISS) 100 receives an interrupt notification signal in a predetermined state, if the transition control signal indicates that the current state can be interrupted, the instruction state sequencer (ISS) 100 Go to state I. Otherwise, the instruction state sequencer (ISS) 100 proceeds as if it had not received an interrupt signal until it reaches the interrupt enabled state.
[0072]
As the instruction state sequencer (ISS) 100 proceeds to state I, the instruction state sequencer (ISS) 100 preferably accesses the process control register set 122 to set the interrupt masking flag and retrieve the interrupt vector. After receiving the interrupt vector, the instruction state sequencer (ISS) 100 preferably performs a conventional subroutine jump to the interrupt handler specified by the interrupt vector to implement the current interrupt.
[0073]
In the present invention, the reconfiguration of the dynamic reconfiguration processor (DRPU) 32 is
1) a reconfiguration interrupt asserted on external control line 48, or
2) Execution of reconfiguration instructions in a series of program instructions
Will start according to. In the preferred embodiment, a subroutine jump to the reconfiguration handler is performed whether a reconfiguration interrupt is performed or a reconfiguration instruction is executed. The reconfiguration handler preferably saves program state information and sends a configuration data set address and a reconfiguration signal to the reconfiguration logic 104.
[0074]
When the current interrupt is not a reconfiguration interrupt, the instruction state sequencer (ISS) 100 proceeds to the next state indicated by the transition control signal when the interrupt is implemented, thereby resuming and completing the instruction execution cycle. Or start.
[0075]
In the preferred embodiment, the set of states supported by the instruction state sequencer (ISS) 100 depends on the characteristics of the instruction set architecture (ISA) in which the dynamic reconfiguration processor (DRPU) 32 is configured. Thus, state M does not exist for an instruction set architecture (ISA) in which one or more instructions can execute in one clock cycle, as in a typical inner loop instruction set architecture (ISA). As shown, the state diagram of FIG. 9 preferably defines the states supported by the instruction state sequencer (ISS) to implement the general purpose external loop instruction set architecture (ISA). For implementation of the inner loop instruction set architecture (ISA), the instruction state sequencer (ISS) 100 preferably supports multiple states F, D, E, and W in parallel. As a result, pipeline control of instruction execution can be easily performed in a manner that is easily understood by those skilled in the art. In the preferred embodiment, the instruction state sequencer (ISS) 100 is implemented as a logical block (CLB) based state machine that supports the state or subset of states described above according to the currently considered instruction set architecture (ISA). .
[0076]
The interrupt logic 106 preferably includes a state machine that generates transition control signals and performs interrupt notification operations in response to interrupt signals received via the external control line 48. FIG. 10 is a state diagram illustrating a set of preferred states supported by the interrupt logic 106. The interrupt logic 106 starts operation in state P. State P corresponds to a power-on, reset, or reconfiguration state. In response to the completion signal issued by the reconfiguration logic 104, the interrupt logic 106 proceeds to state A and retrieves an interrupt response signal from the architecture description memory 101. The interrupt logic 106 then generates a transition control signal from the interrupt response signal and stores this transition control signal in the process control register set 122. In the preferred embodiment, the interrupt logic 106 includes a logic block (CLB) based programmable logic array (PLA) for receiving an interrupt response signal and generating a transition control signal. After state A, interrupt logic 106 proceeds to state B and waits for an interrupt signal. Interrupt logic 106 proceeds to state C when an interrupt signal is received and the interrupt masking flag in process control register set 122 is reset. In state C, the interrupt logic 106 determines the interrupt start point, interrupt priority, and interrupt handler address. When the interrupt signal is a reconfiguration interrupt, the interrupt logic 106 proceeds to state R and stores the configuration data set address in the process control register set 122. After state R, or after state C when the interrupt signal is not a reconfiguration interrupt, interrupt logic 106 proceeds to state N and stores the interrupt handler address in process control register set 122. The interrupt logic 106 then proceeds to state X and issues an interrupt notification signal to the instruction state sequencer (ISS) 100. After state X, interrupt logic 106 returns to state B and waits for the next interrupt signal.
[0077]
In the preferred embodiment, the level of interrupt latency that the interrupt response signal, and thus the transition control signal specifies, depends on the current instruction set architecture (ISA) in which the dynamic reconfiguration processor (DRPU) 32 is configured. For example, an instruction set architecture (ISA) for high-performance real-time operation control requires a quick and predictable interrupt response capability. Accordingly, the configuration data set corresponding to such an instruction set architecture (ISA) preferably includes an interrupt response signal indicating that a low latency interrupt is required. The corresponding transition control signal preferably identifies a plurality of instruction state sequencer (ISS) states as interruptible. Thus, the instruction execution cycle can be interrupted by an interrupt before the instruction execution cycle is completed. Unlike the instruction set architecture (ISA) for real-time operation control, the image set convolution operation instruction set architecture (ISA) has an interrupt response capability that maximizes the number of convolution operations executed per unit time. is required. The configuration data set corresponding to the image convolution instruction set architecture (ISA) preferably includes an interrupt response signal that specifies that a long latency interrupt is required. The corresponding transition control signal preferably identifies state W as interruptible. When the instruction state sequencer (ISS) 100 is configured to implement an instruction set architecture (ISA) for image convolution operations and supports a plurality of states F, D, E, and W in parallel, the transition control signal is in each state. Preferably, W is identified as interruptible and further specifies that interrupt execution should be delayed until each parallel instruction execution cycle has completed its state W operation. This ensures that all instructions are executed before the interrupt is implemented, thereby maintaining an appropriate pipeline execution capability level.
[0078]
Similar to the interrupt latency level, the interrupt accuracy level specified by the interrupt response signal also depends on the instruction set architecture (ISA) in which the dynamic reconfiguration processor (DRPU) 32 is configured. For example, if state M is determined to be interruptible for an external loop instruction set architecture (ISA) that supports interruptable multi-cycle operations, the interrupt response signal specifies that an accurate interrupt is required. Is preferred. Thus, the transition control signal specifies that an interrupt received in state M should be treated as an accurate interrupt so that the multicycle operation can be successfully restarted. As another example, for an instruction set architecture (ISA) that supports defect-free pipeline arithmetic, the interrupt response signal preferably specifies that an inaccurate interrupt is required. The transition control signal then specifies that an interrupt received in state W should be treated as an incorrect interrupt.
[0079]
For any instruction set architecture (ISA), the interrupt response signal is defined and programmed by a portion of the corresponding configuration data set of the instruction set architecture (ISA). By generating programmable interrupt response signals and corresponding transition control signals, the present invention makes it easy to implement an optimal interrupt scheme for each instruction set architecture (ISA). Those skilled in the art will appreciate that most prior art computer architectures cannot flexibly specify interrupt capabilities, namely enabling programmable state transitions, programmable interrupt latency, and programmable interrupt accuracy. In the preferred embodiment, the interrupt logic 106 is implemented as a logic block (CLB) based state machine that supports states as described above.
[0080]
The fetch controller 108 instructs the instruction buffer 110 to load instructions in response to fetch signals issued by the instruction set architecture (ISA) 100. In the preferred embodiment, fetch controller 108 is implemented as a conventional one-hot coding state machine using flip-flops within a set of logical blocks (CLB). Those skilled in the art will recognize that in alternate embodiments, the fetch controller 108 can be configured as a conventional encoding state machine or as a ROM-based state machine. The instruction buffer 110 temporarily stores the instruction loaded from the memory 34. For external loop instruction set architecture (ISA) implementations, the instruction buffer 110 is preferably implemented as a conventional RAM-based first-in first-out (FIFO) buffer using multiple logical blocks (CLB). As for the implementation of the internal loop instruction set architecture (ISA), the instruction buffer 110 uses a plurality of flip-flops in one set of input / output blocks (IOB), or includes an input / output block (IOB) and a logic block (CLB) Both are preferably implemented as a set of flip-flop registers using multiple flip-flops.
[0081]
Decoding controller 112 directs instructions to be transferred from instruction buffer 110 to instruction decoder 114 in response to a decoded signal issued by instruction set architecture (ISA) 100. For an inner loop instruction set architecture (ISA), the decode controller 112 is preferably implemented as a ROM-based state machine that includes a logical block (CLB) based ROM coupled to a logical block (CLB) base register. For outer loop instruction set architecture (ISA), decoding controller 112 is preferably implemented as a logical block (CLB) based coding state machine. For each instruction received as input, instruction decoder 114 outputs the corresponding operation code, register file address, and optionally one or more constants in a conventional manner. For an inner loop instruction set architecture (ISA), the instruction decoder 114 is preferably configured to decode a series of instructions received as input. In the preferred embodiment, instruction decoder 114 is implemented as a logical block (CLB) based decoder configured to decode each instruction included in the currently considered instruction set architecture (ISA).
[0082]
The operation code storage register set 116 temporarily stores each operation code output by the instruction decoder 114 and outputs each operation code to the instruction state sequencer (ISS) 100. When the outer loop instruction set architecture (ISA) is implemented in the dynamic reconfiguration processor (DRPU) 32, the operation code storage register set 116 is preferably implemented using an optimal number of flip-flop register banks. The flip-flop register bank receives from the instruction decoder 114 a signal representing a class code or group code derived from the operation code literal bit field of the instruction that has already formed the waiting example through the instruction buffer 110. The flip-flop register bank stores the aforementioned class code or group code according to a decoding scheme that can minimize the complexity of the instruction state sequencer (ISS). In the case of an inner loop instruction set architecture (ISA), the operation code storage register set 116 stores an operation code indication signal derived directly from the operation code literal bit field by the instruction decoder 114. The inner loop instruction set architecture (ISA) necessarily has a small operation code literal bit field, thereby buffering and decoding by the instruction buffer 110, instruction decoder 114, and operation code storage register set 116, respectively. And implementation code display for instruction sequencing (ordering) is minimized. In summary, for the outer loop instruction set architecture (ISA), the operation code storage register set 116 is implemented as a small combination of flip-flop register banks characterized as a bit width equal to or part of the operation code literal size. It is preferable. For the inner loop, the operation code storage register set 116 is preferably smaller and integrated with a flip-flop register bank than in the outer loop instruction set architecture (ISA). The reason why the size of the flip-flop register bank is small in the inner loop is that the number of instructions in the inner loop instruction set architecture (ISA) is very small compared to the outer loop instruction set architecture (ISA).
[0083]
Register file (RF) address register set 118 and constant register set 120 temporarily store each register file and each constant output by instruction decoder 114, respectively. In the preferred embodiment, the opcode storage register set 116, register file (RF) address register set 118, and constant register set 120 are each implemented as a set of logical blocks (CLB) configured for data storage. The
[0084]
The memory access logic 102 instructs data transfer between the memory 34, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 according to the size of the atomic memory address specified in the architecture description memory 122. And a memory control circuit for synchronizing them. The memory access logic 102 further directs and synchronizes the transfer of data and commands between the S machine 12 and the predetermined T machine 14. In the preferred embodiment, the memory access logic 102 supports burst memory access and is preferably implemented as a conventional RAM controller using logical blocks (CLB). Those skilled in the art will appreciate that during reconfiguration, the input and output pins of the reconfiguration logic device are ternary and can de-assert logic levels by resistance stops, thus not confusing memory 34. In another embodiment, the memory access logic 102 can be implemented outside of the dynamic reconfiguration processor (DRPU) 32.
[0085]
FIG. 11 is a block diagram of a preferred embodiment of the data arithmetic unit 62. Data operation unit (DOU) 62 performs operations on data according to data operation unit (DOU) control signals, register file (RF) addresses, and constants received from instruction set architecture (ISA) 100. The data operation unit (DOU) 62 includes a data operation unit (DOU) crossbar switch 150, storage / alignment logic 152, and data operation logic 154. Data arithmetic unit (DOU) crossbar switch 150, storage / alignment logic 152, and data arithmetic logic 154 are each coupled to a first control output of instruction fetch unit (IFU) 60 via first control line 70. Includes control input. The data processing unit (DOU) crossbar switch 150 includes a bidirectional data port forming a bidirectional data port of the data processing unit (DOU), a constant input unit coupled to the third control line 74, and a first data line. A first data feedback input coupled to the data output of the data arithmetic logic 154 via 160; a second data feedback input coupled to the data output of the storage / alignment logic 152 via the second data line 164; And a data output coupled to the data input of storage / alignment logic 152 via a third data line. Store / align logic 152 includes an address input coupled to third control line 74 in addition to its data output. The data arithmetic logic 154 further includes a data input coupled to the output of the storage / alignment logic via the second data line 164.
[0086]
The data operation logic 154 performs an arithmetic operation, a shift operation, and / or a logical operation on the data received at the data input unit in response to a data operation unit (DOU) control signal received at the control input unit. The storage / alignment logic 152 is associated with operands, constants, and data calculations according to register file (RF) addresses and data processing unit (DOU) control instructions received at its address input and control input, respectively. A data storage element for temporarily storing the results. The data operation unit (DOU) crossbar switch 150 loads data from the memory 34 according to the data operation unit (DOU) control signal received at its control input unit, and stores / aligns logic of the result output by the data operation logic 154. A conventional crossbar switch network that facilitates the transfer to 152 or memory 34 and the loading of constant outputs to the store / align logic 152 by the instruction fetch unit (IFU) 60 is preferred. In the preferred embodiment, the detailed structure of the data arithmetic logic 154 depends on the types of operations supported by the currently considered instruction set architecture (ISA). That is, the data arithmetic logic 154 includes circuitry for performing arithmetic and / or logical operations specified by data processing instructions within the instruction set architecture (ISA) currently under consideration. Similarly, the detailed structure of the storage / alignment logic 152 and data processing unit (DOU) crossbar switch 150 is determined by the instruction set architecture (ISA) currently under consideration. The detailed structure of the data operation logic 154, the storage / alignment logic 152, and the data operation unit (DOU) crossbar switch 150 according to the type of instruction set architecture (ISA) will be described below with reference to FIGS. explain in detail.
[0087]
For the outer loop instruction set architecture (ISA), the data processing unit (DOU) 62 is preferably configured to perform sequential operations on the data. FIG. 12 is a block diagram of a first exemplary embodiment of a data processing unit (DOU) 61 configured for the implementation of a general purpose external loop instruction set architecture (ISA). In general purpose external loop instruction set architecture (ISA), hardware configured to perform mathematical operations such as multiplication, addition, subtraction, Boolean operations such as AND, OR, NOT, shift operations, and rotation operations. Wear is necessary. Thus, for the implementation of the general purpose external loop instruction set architecture (ISA), the data arithmetic logic 154 is a conventional arithmetic logic device having a first input portion, a second input portion, a control input portion, and an output portion ( ALU) / shifter 184. The storage / alignment logic 152 is preferably composed of a first RAM 180 and a second RAM 182, which each include a data input unit, a data output unit, an address selection input unit, and an enable input unit. A data operation unit (DOU) crossbar switch 150 has a bidirectional and unidirectional crossbar coupling unit and a conventional crossbar switch having an input unit and an output unit as already described with reference to FIG. It preferably includes a network. Those skilled in the art will recognize that a data processing unit (DOU) crossbar switch 150 for external loop instruction set architecture (ISA) can be efficiently implemented with a multiplexer, a ternary buffer, and a logic block (CLB) base logic. It will be appreciated that subsets of the above components that are coupled in any combination by direct wiring or by reconfiguration coupling means are included. For the outer loop, the data processing unit (DOU) crossbar switch 150 is implemented to facilitate sequential data movement in the shortest time, but the maximum number of single data movement crosses to support general purpose outer loop instructions. A bar joint is also provided.
[0088]
Similar to the data input section of the second RAM 182, the data input section of the first RAM 180 is coupled to the data output section of the data arithmetic unit (DOU) crossbar switch 150 via the third data line 162. The address select inputs of the first RAM 180 and the second RAM 182 are coupled to receive a register file address from the instruction fetch unit (IFU) 60 via the third control line 74. Similarly, the enable inputs of the first RAM 180 and the second RAM 182 are coupled to receive a data processing unit (DOU) control signal via the first control line 70. The data output sections of the first RAM 180 and the second RAM 182 are respectively coupled to the first input section and the second input section of the ALU / shifter 184, and the second data feedback input of the data arithmetic unit (DOU) crossbar switch 150. Also connected to the part. The control input of the ALU / shifter 184 is coupled to receive a data processing unit (DOU) control signal via the first control line 70. The output of the ALU / shifter 184 is coupled to a first data feedback input of a data processing unit (DOU) crossbar switch 150. The remaining input units and coupling units to the output unit of the data arithmetic unit (DOU) crossbar switch 150 are the same as those described above with reference to FIG.
[0089]
In order to facilitate the execution of data arithmetic instructions, the instruction fetch unit (IFU) 60 provides a data arithmetic unit (DOU) control signal and a register file (RF) when the instruction state sequencer (ISS) is in state E or M. ) An address signal and a constant signal are issued to the data arithmetic unit (DOU) 61. The first RAM 180 and the second RAM 182 provide first and second register files for temporary data storage, respectively. The individual addresses in the first RAM 180 and the second RAM 182 are selected according to the register file (RF) address received at the respective address selection input section of each RAM. Similarly, loading of the first RAM 180 and the second RAM 182 is controlled by data operation unit (DOU) control signals received by the first RAM 180 and the second RAM 182 at the write enable input section. In the preferred embodiment, at least one of the first RAM 180 and the second RAM 182 has a transfer capability to facilitate the direct transfer of data from the data processing unit (DOU) crossbar switch 150 to the ALU / shifter 184. Contains. The ALU / shifter 184 is based on a first operand received from the first RAM 180 and / or based on a second operand received from the second RAM 182 in accordance with a data arithmetic unit (DOU) control signal received at its control input. Perform arithmetic, logical, or shift (shift) operations. The data processing unit (DOU) crossbar switch 150 is selectively
1) routing of data between the memory 34 and the first RAM 180 and the second RAM 182;
2) routing the results from the ALU / shifter 184 to the first RAM 180 and the second RAM 182 or to the memory 34;
3) routing the data stored in the first RAM 180 or the second RAM 182 to the memory 34;
4) Routing constants from the instruction fetch unit (IFU) 60 to the first RAM 180 and the second RAM 182;
I do. As described above, when either the first RAM 180 or the second RAM 182 has a transmission capability, the data operation unit (DOU) crossbar switch 150 also selectively outputs from the memory 34 to the ALU / shifter 184 or the output of the ALU / shifter. Route data back to the ALU / shifter 184 directly from the unit. The data processing unit (DOU) crossbar switch 150 executes a specific routing operation in accordance with a data processing unit (DOU) control signal received at its control input unit. In the preferred embodiment, the ALU / shifter 184 is implemented using a set of logic blocks (CLB) for mathematical operations within the reconfigurable logic unit and a logic function generator in the circuit. Each of the first RAM 180 and the second RAM 182 is preferably implemented using a data storage circuit existing in one set of logical blocks (CLB). The data processing unit (DOU) crossbar switch 150 is preferably implemented in the manner already described.
[0090]
FIG. 13 is a block diagram of a second exemplary embodiment of a data processing unit (DOU) 63 configured for the implementation of an inner loop instruction set architecture (ISA). In general, the inner loop instruction set architecture (ISA) supports relatively few dedicated operations and is preferably used to perform a common set of operations on large data sets. Thus, optimal computational performance for an inner loop instruction set architecture (ISA) is obtained with hardware configured to perform operations in parallel. Thus, in the second exemplary embodiment of data processing unit (DOU) 63, data processing logic 154, storage / alignment logic 152, and data processing unit (DOU) crossbar switch 150 are configured to perform pipeline calculations. The The data arithmetic logic 154 includes a pipeline function unit 194 having a plurality of input units, a control input unit, and an output unit. The storage / alignment logic 152
1) a set of conventional flip-flop arrays 192 (each including a data input, a data output, and a control input);
2) a data selector 190 (including a control input unit, a data input unit, and a number of data output units corresponding to the flip-flop array 192);
Is included. Data processing unit (DOU) crossbar switch 150 includes a conventional crossbar switch network having dual unidirectional crossbar couplings. In the second exemplary embodiment of the data processing unit (DOU) 63, the data processing unit (DOU) crossbar switch 150 has the input unit and output unit already described with reference to FIG. 11 except for the second data feedback input unit. It is preferable to include. As with the outer loop instruction set architecture (ISA), an efficient implementation of the data processing unit (DOU) crossbar switch 150 for the inner loop instruction set architecture (ISA) includes a multiplexer, a ternary buffer, and , Logic block (CLB) base logic, direct wiring, or a subset of the above components coupled in a reconfigurable manner. For the inner loop instruction set architecture (ISA), the data processing unit (DOU) crossbar switch 150 is preferably implemented to maximize parallel data movement in the shortest time, but a highly pipelined inner loop instruction set architecture. A minimum number of single data operation crossbar couplings are also provided to support (ISA) instructions.
[0091]
The data input of the data selector 190 is coupled to the data output of the data processing unit (DOU) crossbar switch 150 via the first data line 162. The control input of data selector 190 is coupled to receive a register file (RF) address via third control line 74, and each output of data selector 190 is coupled to a corresponding flip-flop array data input. ing. The control input of each flip-flop array 192 is coupled to receive a data processing unit (DOU) control signal via the first control line 70, and each flip-flop array data output is coupled to the input of the functional unit 194. Has been. The control input of the functional unit 194 is coupled to receive a data processing unit (DOU) control signal via the first control line 70, and the output unit of the functional unit 194 is connected to the data processing unit (DOU) crossbar switch 150. Coupled to the first data feedback input. The remaining input unit and output unit of the data arithmetic unit (DOU) crossbar switch 150 are the same as those already described with reference to FIG.
[0092]
In operation, functional unit 194 performs a pipeline operation on the data received at the data input in accordance with a data processing unit (DOU) control signal received at the control input. Those skilled in the art will recognize whether functional unit 194 is a multiply / accumulate device, a threshold determination device, an image rotation device, an edge enhancement device, or any suitable for performing pipeline operations on partitioned data. Will admit that it is a functional unit of the kind. The data selector 190 routes (routes) data from the output unit of the data processing unit (DOU) crossbar switch 150 to a predetermined flip-flop array 192 according to the register file (RF) address received at its control input unit. Each flip-flop array 192 is sequentially coupled to spatially and temporally align the data with the data content of another flip-flop array 192 in accordance with the control signal received at its control input. A data latch is preferably included. The data processing unit (DOU) crossbar switch 150 is selectively
1) Route data from memory 34 to data selector 190;
2) Route the result from the multiplier / accumulator 194 to the data selector 190 or memory 34;
3) Route constants from instruction fetch unit (IFU) 60 to data selector 190.
Those skilled in the art will recognize that the inner loop instruction set architecture (ISA) includes a set of “built-in” constants. In such an internal loop instruction set architecture (ISA) implementation, the storage / alignment logic 154 preferably includes a logic block (CLB) -based ROM with built-in constants, thereby enabling a data processing unit (DOU) crossbar. The need to route constants from the instruction fetch unit (IFU) 60 to the storage / alignment logic 152 via the switch 150 can be eliminated. In the preferred embodiment, functional unit 194 is preferably implemented using a logic function generator and circuit for mathematical operations within a set of logic blocks (CLB). Each flip-flop array 192 is preferably implemented using flip-flops in a set of logic blocks (CLB). The data selector 190 is preferably implemented using a logic function generator and data selection circuit in a set of logic blocks (CLB). Finally, the data processing unit (DOU) crossbar switch 150 is preferably implemented in the manner already described for the inner loop.
[0093]
FIG. 14 is a block diagram of a preferred embodiment of an address arithmetic unit (AOU) 64. The address arithmetic unit (AOU) 64 performs an operation on the address according to the address arithmetic unit (AOU) control signal, the register file (RF) address, and the constant received from the instruction fetch unit (IFU) 60. The address arithmetic unit (AOU) 64 includes an address arithmetic unit (AOU) crossbar switch 200, storage / counting logic 202, address arithmetic logic 204, and address multiplexer 206. The address arithmetic unit (AOU) crossbar switch 200, the storage / counting logic 202, the address arithmetic logic 204, and the address multiplexer 206 are each controlled by the second control of the instruction fetch unit (IFU) 60 via the second control line 72. A control input coupled to the output is included. The address arithmetic unit (AOU) crossbar switch 200 is coupled to the bidirectional data port forming the bidirectional data port of the address arithmetic unit (AOU) and the address output unit of the address arithmetic logic 204 via the first address line 210. An address feedback input, a constant input coupled to the third control line 74, and an address output coupled to the address input of the storage / counting logic 202 via the second address line 212. In addition to the address input unit and the control input unit, the storage / counting logic 202 includes a register file (RF) address input unit coupled to the third control line 74, and the third address line 214. And an address output unit coupled to the address input unit. Address multiplexer 206 has a first input coupled to first address line 210, a second input coupled to third address line 214, and an output forming an address output of address arithmetic unit (AOU) 64. Part.
[0094]
The address calculation logic 204 performs an arithmetic operation on the address received at the address input unit in accordance with an instruction of an address arithmetic unit (AOU) control signal received at the control input unit. The storage / counting logic 202 temporarily stores addresses and address calculation results. The address arithmetic unit (AOU) crossbar switch 200 stores the address from the memory 34 and stores / counts logic of the result output of the address arithmetic logic 204 in accordance with the address arithmetic unit (AOU) control signal received at its control input unit. It facilitates the transfer to 202 or memory 34 and the loading of constant output into the store / count logic 202 by an instruction fetch unit (IFU) 60. The address multiplexer 206 sends the address received from the storage / counting logic 202 or the address multiplexer 206 to the address output unit of the address arithmetic unit (AOU) 64 according to the instruction of the address arithmetic unit (AOU) control signal received at its control input unit. Selectively output. In the preferred embodiment, the detailed structure of the address arithmetic unit (AOU) crossbar switch 200, the storage / counting logic 202, and the address arithmetic logic 204 will be described below with reference to FIGS. It depends on the type of instruction set architecture (ISA) under consideration.
[0095]
FIG. 15 is a block diagram of a first exemplary embodiment of an address arithmetic unit (AOU) 65 configured for the implementation of a general purpose external loop instruction set architecture (ISA). The general purpose external loop instruction set architecture (ISA) requires hardware to perform operations such as addition, subtraction, increment, and decrement on the contents of the program counter and address stored in the storage / counting logic 202. is there. In the first exemplary embodiment of the address arithmetic unit (AOU) 65, the address arithmetic logic 204 includes a next instruction program address register (NIPAR) 232 having an input unit, an output unit, and a control input unit, and a first input unit. A computing device 234 having a second input unit, a third input unit, a control input unit, and an output unit, a first input unit, a second input unit, a control input unit, and an output unit. It is preferable that the multiplexer 230 is included. The storage / counting logic 202 preferably includes a third RAM 220 and a fourth RAM 222 each having an input unit, an output unit, an address selection input unit, and an enable input unit. Address multiplexer 206 preferably includes a multiplexer having a first input section, a second input section, a third input section, a control input section, and an output section. The address arithmetic unit (AOU) crossbar switch 200 may include a conventional crossbar switch network having a dual unidirectional crossbar coupling section and an input section and an output section already described with reference to FIG. preferable. Effective implementation of the address arithmetic unit (AOU) crossbar switch 200 is coupled by multiplexers, ternary buffers, logic block (CLB) base logic, direct wiring, or reconfiguration couplers. A subset of such components is included. For external loop instruction set architecture (ISA), the address arithmetic unit (AOU) crossbar switch 200 is preferably implemented to maximize sequential data movement in the shortest time, but supports general purpose external loop address arithmetic instructions. A maximum number of unique address movement crossbar couplings is also provided for this purpose.
[0096]
The input unit of the third RAM 220 and the input unit of the fourth RAM 222 are respectively coupled to the output unit of the address arithmetic unit (AOU) crossbar switch 200 via the second address line 212. The address select inputs of the third RAM 220 and the fourth RAM 222 are coupled to receive a register file (RF) address from the instruction fetch unit (IFU) 60 via the third control line 74. The enable inputs of the third RAM 220 and the fourth RAM 222 are coupled to receive an address arithmetic unit (AOU) control signal via the second control line 72. The output of the third RAM 220 is coupled to the first input of the multiplexer 230, the first input of the arithmetic unit 234, and the first input of the address multiplexer 206. Similarly, the output of the fourth RAM 222 is coupled to the second input of the multiplexer 230, the second input of the arithmetic unit 234, and the second input of the address multiplexer 206. Control input units of the multiplexer 230, the NIPAR 232, and the arithmetic unit 234 are respectively coupled to the second control line 72. The output of arithmetic unit 234 forms the output of address arithmetic logic 204 and is therefore coupled to the address feedback input of address arithmetic unit (AOU) crossbar switch 200 and the third input of address multiplexer 206. Has been. The remaining input unit and output unit of the address arithmetic unit (AOU) crossbar switch 200 and address multiplexer 206 are the same as those already described with reference to FIG.
[0097]
In order to facilitate the execution of address arithmetic instructions, the instruction fetch unit (IFU) 60 provides an address arithmetic unit (AOU) control signal and a register file (RF) address when the instruction state sequencer (ISS) is in state E or M. And a constant are issued to the address arithmetic unit (AOU) 64. The third RAM 220 and the fourth RAM 222 provide first and second register files for temporary storage of addresses, respectively. Each storage location in the third RAM 220 and the fourth RAM 222 is selected according to the register file (RF) address received by the address selection input unit of each RAM. The loading of the third RAM 220 and the fourth RAM 222 is controlled by address arithmetic unit (AOU) control received by the third RAM 220 and the fourth RAM 222 at the write enable input section. The multiplexer 230 selectively routes the address output from the third RAM 220 and the fourth RAM 222 to the NIPAR 232 in accordance with an instruction of an address arithmetic unit (AOU) control signal received at its control input unit. NIPAR 232 loads the address received from the output of multiplexer 230 and increments its contents in response to an address arithmetic unit (AOU) control signal received at its control input. In the preferred embodiment, NIPAR 232 stores the address of the next program instruction to be executed. The arithmetic unit 234 performs arithmetic operations including addition, subtraction, increment, and decrement on the addresses received from the third RAM 220 and the fourth RAM 222 and / or on the contents of the NIPAR 232. The address arithmetic unit (AOU) crossbar switch 200 is selectively
1) Route addresses from memory 34 to third RAM 220 and fourth RAM 222;
2) The result of the address calculation output by the arithmetic unit 234 is routed to the memory 34 or the third RAM 220 and the fourth RAM 222.
The address arithmetic unit (AOU) crossbar switch 200 executes a specific routing operation according to an address arithmetic unit (AOU) control signal received by the control input unit. The address multiplexer 206 performs an address calculation on the address output by the third RAM 220, the address output by the fourth RAM 222, or the result of the address calculation output by the arithmetic device 234 according to the instruction of the address arithmetic unit (AOU) control received by the control input unit. Route selectively to the address output of the device (AOU).
[0098]
In the preferred embodiment, the third RAM 220 and the fourth RAM 222 are each implemented using data storage circuits residing in a set of logical blocks (CLB). Multiplexer 230 and address multiplexer 206 are each preferably implemented using a data selection circuit present in a set of logical blocks (CLB), and NIPAR 232 is present in a set of logical blocks (CLB). It is preferably implemented using a data storage circuit. The computing device 234 is preferably implemented using a logic function generator and circuit for mathematical operations within a set of logic blocks (CLB). Finally, the address arithmetic unit (AOU) crossbar switch 200 is preferably implemented in the manner already described.
[0099]
FIG. 16 is a block diagram of a second exemplary embodiment of an address arithmetic unit (AOU) 66 configured for implementation of an inner loop instruction set architecture (ISA). The inner loop instruction set architecture (ISA) requires hardware to perform a very limited number of address operations and also maintains at least one source address pointer and a corresponding number of destination address pointers. Requires hardware. The types of inner loop processing that require a very limited number of address operations or one address operation include block operations, raster operations, or serpentine operations for image data, bit reversal operations, and operations on circular buffer data. Variable length data parsing operations. Here, one address operation, that is, increment operation is considered. Those skilled in the art will recognize that the hardware that performs the increment operation can also perform the decrement operation inherently, thereby providing additional addressing capabilities. In a second exemplary embodiment of address arithmetic unit (AOU) 66, storage / counting logic 202 includes at least one source address register 252 having an input, an output, and a control input, an input, At least one destination address register 254 having an output unit, a control input unit, an input unit, a control input unit, and a number of output units equal to the total number of existing source address registers 252 and destination address registers 254; And a data selector 250 having the same. Here, one source address register 252 and one destination address register 254 are considered. Therefore, the data selector 250 includes a first output unit and a second output unit. The address operation logic 204 includes a NIPAR 232 having an input unit, an output unit, and a control input unit, an input unit equal in number to the data selector, a multiplexer 260 having a control input unit and an output unit. . Here, the multiplexer 260 includes a first input unit and a second input unit. The address multiplexer 206 preferably includes a multiplexer having one more input than the data selector output, a control input, and an output. Accordingly, here, the address multiplexer 206 includes a first input unit, a second input unit, and a third input unit. The address arithmetic unit (AOU) crossbar switch 200 includes a conventional crossbar switch network having a bidirectional and unidirectional crossbar coupling unit and an input unit and an output unit already described with reference to FIG. It is preferable to include. Effective implementation of the address arithmetic unit (AOU) crossbar switch 200 is coupled by multiplexers, ternary buffers, logic block (CLB) base logic, direct wiring, or reconfiguration couplers. A subset of such components is included. For the inner loop instruction set architecture (ISA), the address arithmetic unit (AOU) crossbar switch 200 is preferably implemented to maximize parallel address movement in the shortest time, but supports the inner loop operation code. For this purpose, a maximum number of unique address movement crossbar couplings is also provided.
[0100]
The input of data selector 250 is coupled to the output of address arithmetic unit (AOU) crossbar switch 200. The first and second output portions of the data selector 250 are coupled to the input portion of the source address register 252 and the input portion of the destination address register 254, respectively. The control input of source address register 252 and destination address register 254 are coupled to receive an address arithmetic unit (AOU) control signal via second control line 72. The output of source address register 252 is coupled to a first input of multiplexer 260 and a first input of address multiplexer 206. Similarly, the output of destination address register 254 is coupled to the second input of multiplexer 260 and the second input of address multiplexer 206. The input of NIPAR 232 is coupled to the output of multiplexer 260, and the control input of NIPAR 232 is coupled to receive an address arithmetic unit (AOU) control signal via second control line 72, and the output of NIPAR 232 Is coupled to the address feedback input of the address arithmetic unit (AOU) crossbar switch 200 and the third input of the address multiplexer 206. The connection part to the remaining input part and output part of the address arithmetic unit (AOU) crossbar switch 200 is the same as that described above with reference to FIG.
[0101]
In operation, the data selector 250 routes the address received from the address arithmetic unit (AOU) crossbar switch to the source address register 252 or destination address register 254 according to the register file (RF) address received at its control input. The source address register 252 loads an address present at its input in response to an address arithmetic unit (AOU) control signal present at its control input. Destination address register 254 loads the address present at its input in a similar manner. Multiplexer 260 routes the address received from source address register 252 or destination address register 254 to the input of NIPAR 232 in accordance with an address arithmetic unit (AOU) control signal received at its control input. NIPAR 232 loads an address present at its input in response to an address arithmetic unit (AOU) control signal received at its control input and increments or decrements its contents. The address arithmetic unit (AOU) crossbar switch 200 is selectively
1) Route addresses from memory 34 to data selector 250;
2) Route the contents of NIPAR 232 to memory 34 or data selector 250.
The address arithmetic unit (AOU) crossbar switch 200 executes a specific routing operation in accordance with an address arithmetic unit (AOU) control signal received at its control input unit. The address multiplexer 206 sends the contents of the source address register 252, the destination address register 254, or the NIPAR 232 to the address output unit of the address arithmetic unit (AOU) according to the instruction of the address arithmetic unit (AOU) control signal received at the control input unit. Selectively route.
[0102]
In the preferred embodiment, source address register 252 and destination address register 254 are each implemented using data storage circuitry residing in a set of logical blocks (CLBs). NIPAR 232 is preferably implemented using increment / decrement logic and flip-flops within a set of logic blocks (CLB). The data selector 250, the multiplexer 230, and the address multiplexer 206 are preferably implemented using data selection circuits that exist in a set of logical blocks (CLB). Finally, the address arithmetic unit (AOU) crossbar switch 200 is preferably implemented in the manner already described for the inner loop instruction set architecture (ISA). Those skilled in the art may use an instruction set architecture (ISA) that relies on an internal loop address arithmetic unit (AOU) configuration with an external loop data arithmetic unit (DOU) configuration, or vice versa (internal). It will be appreciated that an external loop data arithmetic unit (DOU) configuration with a loop address arithmetic unit (AOU) configuration is advantageous. For example, an associative string search instruction set architecture (ISA) would advantageously utilize an inner loop data arithmetic unit (DOU) configuration with an outer loop address arithmetic unit (AOU) configuration. As another example, an instruction set architecture (ISA) for performing histogram operations may advantageously utilize an outer loop data arithmetic unit (DOU) configuration with an inner loop address arithmetic unit (AOU) configuration.
[0103]
A finite reconfiguration hardware resource must be allocated between each component of the dynamic reconfiguration processor (DRPU) 32. Since the number of reconfigurable hardware resources is limited, the maximum computational performance level achievable by the data arithmetic unit (DOU) 62 and address arithmetic unit (AOU) 64, for example, if they are assigned to the instruction fetch unit (IFU) 60 To affect. A method for allocating reconfigurable hardware resources between an instruction fetch unit (IFU) 60, a data arithmetic unit (DOU) 62, and an address arithmetic unit (AOU) 64 is an instruction set architecture implemented at any moment. It depends on the type of (ISA). As the instruction set architecture (ISA) becomes more complex, more reconfigurable hardware resources must be allocated to the instruction fetch unit (IFU) 60 to facilitate increasingly complex decoding and control operations, Less reconfigurable hardware resources are available between the data arithmetic unit (DOU) 62 and the address arithmetic unit (AOU) 64. Therefore, the maximum computational performance achievable by the data arithmetic unit (DOU) 62 and the address arithmetic unit (AOU) 64 decreases as the complexity of the instruction set architecture (ISA) increases. In general, the outer loop instruction set architecture (ISA) contains more instructions than the inner loop instruction set architecture (ISA), and its implementation is therefore considerably more complex in the decoding and control circuits. For example, an outer loop instruction set architecture (ISA) that defines a general purpose 64-bit processor would contain more instructions than an inner loop instruction set architecture (ISA) used only for data compression.
[0104]
FIG. 17 (a) shows the re-routing between the instruction fetch unit (IFU) 60, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 for the outer loop instruction set architecture (ISA). It is a figure which shows the model allocation of a structure hardware resource. In an exemplary allocation of reconfigurable hardware resources for an external loop instruction set architecture (ISA), an instruction fetch unit (IFU) 60, a data arithmetic unit (DOU) 62, and an address arithmetic unit (AOU) 64 can each be used. Almost one third of the reconfigured hardware resources are allocated. When the dynamic reconfiguration processor (DRPU) 32 is to be reconfigured to implement an inner loop instruction set architecture (ISA), the number of instructions and types of address instructions supported by the inner loop instruction set architecture (ISA) Therefore, less reconfigurable hardware resources are required to implement the instruction fetch unit (IFU) 60 and the address arithmetic unit (AOU) 64. FIG. 17 (b) shows the re-routing between the instruction fetch unit (IFU) 60, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 for the inner loop instruction set architecture (ISA). It is a figure which shows the model allocation of a structure hardware resource. In the exemplary allocation of reconfigurable hardware resources for an internal loop instruction set architecture (ISA), the instruction fetch unit (IFU) 60 is implemented using approximately 5-10% of the reconfigurable hardware resources and the address arithmetic unit ( AOU) 64 is implemented using approximately 10-25% of the reconfigurable hardware resources. Therefore, about 70-80% of the reconfigurable hardware resources can be used to implement the data processing unit (DOU) 62. This is because the internal structure of the data processing unit (DOU) 62 related to the inner loop instruction set architecture (ISA) is more complicated than the internal structure of the data processing unit (DOU) 62 related to the inner loop instruction set architecture (ISA). Which means that much higher performance can be achieved.
[0105]
Those skilled in the art will recognize that in another embodiment, the dynamic reconfiguration processor (DRPU) 32 can exclude the data arithmetic unit (DOU) 62 or the address arithmetic unit (AOU) 64. For example, in another embodiment, the dynamic reconfiguration processor (DRPU) 32 may not include an address arithmetic unit (AOU) 64. In that case, the data operation unit (DOU) 62 performs an operation on both the data and the address. Regardless of the particular dynamic reconfiguration processor (DRPU) embodiment discussed (above), a finite number of reconfiguration hardware to implement each component of the dynamic reconfiguration processor (DRPU) 32 You must allocate resources. The reconfiguration hardware resources are preferably allocated so that the optimal or near-optimal capability for the currently considered instruction set architecture (ISA) is achieved over the total space of available reconfiguration hardware resources.
[0106]
Those skilled in the art will recognize that the detailed structure of each component of the instruction fetch unit (IFU) 60, the data arithmetic unit (DOU) 62, and the address arithmetic unit (AOU) 64 is not limited to the embodiment described above. I will admit. For a given instruction set architecture (ISA), the corresponding configuration data set is the internal structure of each component in the instruction fetch unit (IFU) 60, data arithmetic unit (DOU) 62, and address arithmetic unit (AOU) 64. Are preferably defined to maximize computational performance for available reconfigurable hardware resources.
[0107]
FIG. 18 is a block diagram of a preferred embodiment of a T machine. The T machine 14 includes a second local time base device 300, a common interface control unit (CICU) 302, and a set of mutually coupled input / output devices 304. The second local time base apparatus 300 includes a timing input unit that forms a master timing input unit of the T machine. The common interface control unit (CICU) 302 includes a timing input unit coupled to the timing output unit of the second local time base unit 300 via the second timing signal line 310, an address output unit coupled to the address line 44, Bidirectional data of each cross-coupled I / O device 304 present via a first bidirectional control port coupled to the memory input / output line 46, a bidirectional control port coupled to the external control line 48, and a message transfer line 312. And a second bidirectional data port coupled to the port. Each cross-coupled input / output device 304 has an input coupled to a general purpose cross-coupled matrix (GPIM) 16 via a message input line 314 and an output coupled to a general purpose cross-coupled matrix (GPIM) 16 via a message output line 316. Part.
[0108]
The second local time base device 300 in the T machine 14 receives the master timing signal from the master time base device 22 and generates a second local timing signal. The second local time base unit 300 sends a second local timing signal to the common interface controller (CICU) 302, thereby providing a timing reference for the T machine 14 in which it resides. The second local timing signal is preferably phase synchronized with the master timing signal. In the system 10, the second local time base device 300 of each T machine preferably operates at the same frequency. One skilled in the art will recognize that in another embodiment, one or more second local time base devices 300 operate at different frequencies. The second local time base device 300 is preferably implemented using a conventional phase lock frequency conversion circuit including a logic block (CLB) based phase lock detection circuit. Those skilled in the art will appreciate that in alternative embodiments, the second local time base device 300 can be implemented as part of a clock distribution tree.
[0109]
The common interface controller (CICU) 302 directs the transfer of a message between its corresponding S machine 12 and a particular interconnected I / O device 304, which includes commands and possibly data. In the preferred embodiment, the designated cross-coupled I / O device 304 may reside in either the T machine 14 or the I / O T machine 18 that is internal or external to the system 10. In the preferred embodiment, each cross-coupled input / output device 304 is preferably assigned a cross-coupled address that uniquely identifies the cross-coupled input / output device 304. The mutual coupling address for the mutual coupling input / output device 304 in a given T machine is stored in the architecture description memory 101 of the corresponding S machine.
[0110]
A common interface controller (CICU) 302 receives data and commands from its corresponding S machine 12 via memory input / output line 46 and external control signal line 48, respectively. Each received command preferably includes a target interconnection address and a command code specifying a particular type of operation to be performed. In the preferred embodiment, the types of operations uniquely identified by the command code include:
1) Data read operation,
2) Data write operation,
3) Interrupt signal transfer including reconfiguration interrupt transfer;
Is included. The target interconnection address identifies the target interconnection I / O device 304 to which data and commands are to be transferred. The common interface controller (CICU) 302 preferably transfers each command and associated data as a set of packet-based messages in a conventional manner, each message including a target cross-link address and a command code. .
[0111]
In addition to receiving data and commands from its corresponding S machine 12, the common interface controller (CICU) 302 receives messages from each cross-coupled input / output device 304 coupled to the message transfer line 312. In the preferred embodiment, the Common Interface Controller (CICU) 302 converts the associated message group into a single command and data sequence. The common interface controller (CICU) 302 issues a command via the external control signal line 48 when the command is directed to the corresponding dynamic reconfiguration processor (DRPU) 32 in the S machine 12. When a command is directed to the memory 34 in its corresponding S machine 12, the common interface controller (CICU) 302 issues an appropriate memory control signal via the external control signal line 48 and also via the memory address line 44. Issue memory address signal. Data is transferred through the memory input / output line 46. In the preferred embodiment, the common interface controller (CICU) 302 is a logic block (CLB) for performing operations similar to those performed by conventional SCI switching devices defined in ANSI / IEEE Standard 1596-1992. ) Includes base circuit.
[0112]
Each mutual coupling input / output device 304 receives a message from the common interface control unit (CICU) 302, and in response to the instruction of the control signal received from the common interface control unit (CICU) 302, the mutual coupling input / output device 304 sends the message to the general interconnection matrix (GPIM) 16 Then, the data is transferred to another mutual coupling input / output device 304. In the preferred embodiment, the cross-coupled input / output device 304 is based on SCI nodes as defined in ANSI / IEEE standard 1596-1992. FIG. 19 is a block diagram of a preferred embodiment of the mutual coupling input / output device 304. The mutual coupling input / output device 304 includes an address decoder 320, an input FIFO buffer 322, a bypass FIFO buffer 324, an output FIFO buffer 326, and a multiplexer 328. Address decoder 320 includes an input that forms the input of the cross-coupled input / output device, a first output coupled to input FIFO buffer 322, and a second output coupled to bypass FIFO buffer 324. It is out. The input FIFO buffer 322 includes an output coupled to a message transfer line 312 for transferring messages to the common interface controller (CICU) 302. Output FIFO buffer 326 includes an input coupled to message transfer line 312 for receiving a message from common interface controller (CICU) 302 and an output coupled to the first input of multiplexer 328. . Bypass FIFO buffer 324 includes an output coupled to the second input of multiplexer 328. Finally, multiplexer 328 includes a control input coupled to message transfer line 312 and an output that forms the output of the cross-coupled input / output device.
[0113]
The interconnection I / O device 304 receives the message at the input of the address decoder 320. The address decoder 320 determines whether the target cross-coupling address specified in the received message is the same as the cross-coupling address of the cross-coupling input / output device 304 in which it resides. If so, the address decoder 320 routes this message to the input FIFO buffer 322. Otherwise, address decoder 320 routes the message to bypass FIFO buffer 324. In the preferred embodiment, address decoder 320 is comprised of a decoder and data selector implemented using input / output blocks (IOB) and logic blocks (CLB).
[0114]
Input FIFO buffer 322 is a conventional FIFO buffer that transfers a message received at its input to message transfer line 312. Bypass FIFO buffer 324 and output FIFO buffer 326 are both conventional FIFO buffers that transfer messages received at their inputs to multiplexer 328. Multiplexer 328 is a conventional multiplexer that routes messages received from bypass FIFO buffer 324 or messages received from output FIFO buffer 326 to a general purpose interconnection matrix (GPIM) 16 according to control signals received at its control inputs. . In the preferred embodiment, input FIFO buffer 322, bypass FIFO buffer 324, and output FIFO buffer 326 are each implemented using a set of logical blocks (CLB). Multiplexer 328 is preferably implemented using a set of logical blocks (CLB) and input / output blocks (IOB).
[0115]
FIG. 20 is a block diagram of a preferred embodiment of the input / output T machine 18. The input / output T machine 18 includes a third local time base device 360, a common custom interface control device 362, and a mutual coupling input / output device 304. The third local time base device 360 includes a timing input unit that forms a master timing input unit of the input / output T machine. The cross-coupled input / output device 304 has an input coupled to the general-purpose cross-coupling matrix (GPIM) 16 via message input line 314 and an output coupled to the general-purpose cross-coupling matrix (GPIM) 16 via message output line 316. Part. The common custom interface controller 362 is coupled to the timing input coupled to the timing output of the third local time base device 360 via the third timing signal line 370 and to the bidirectional data port of the cross-coupled input / output device 304. A first bidirectional data port and a set of couplings to the input / output device 20. In the preferred embodiment, a set of couplings to input / output device 20 is coupled to a second bidirectional data port coupled to the bidirectional data port of input / output device 20 and to an address input of input / output device 20. Address output unit and a bidirectional control port coupled to the bidirectional control port of the input / output device 20. Those skilled in the art will readily recognize that the type of I / O device 20 to which the common custom interface controller 362 is coupled determines the coupling to the I / O device 20.
[0116]
The third local time base device 360 receives the master timing signal from the master time base device 22 and generates a third local timing signal. The third local time base unit 360 sends a third local timing signal to the common custom interface controller 362 and provides a timing reference to the input / output T machine in which it is located. In the preferred embodiment, the third local timing signal is phase synchronized with the master timing signal. The third local time base device 360 of each input / output T machine preferably operates at the same frequency. In another embodiment, one or more third local time base devices 360 can operate at different frequencies. The third local time base device 360 is preferably implemented using a conventional phase lock frequency conversion circuit including a logic block (CLB) based phase lock detection circuit. In a similar manner as the first local time base device 30 and the second local time base device 300, in another embodiment, the third local time base device 360 can be implemented as part of a clock distribution tree.
[0117]
The structure and function of the interconnection I / O device 304 in the I / O T machine 18 is preferably the same as already described for the T machine 14. The interconnect I / O device 304 in the I / O T machine 18 is assigned a unique interconnect address in a manner similar to that of each interconnect I / O device 304 in any T machine 14.
[0118]
The common custom interface controller 362 directs the transfer of a message between the input / output device 20 and the interconnected input / output device 304 coupled thereto, the message including a command and possibly data. The common custom interface control device 362 receives data and commands from the corresponding input / output device 20. Each command received from the input / output device 20 preferably includes a target cross-connect address and a command code specifying a particular type of operation to be performed. In the preferred embodiment, the type of operation uniquely identified by the command code includes:
1) Data request and
2) Confirm data transfer,
3) Interrupt signal transfer,
Is included. The target interconnection address identifies the target interconnection I / O device 304 in the system 10 to which data and commands are to be transferred. The common custom interface controller 362 preferably transfers each command and associated data as a set of packet-based messages in a conventional manner, each message including a target cross-link address and a command code.
[0119]
In addition to receiving data and commands from its corresponding input / output device 20, the common custom interface controller 362 receives messages from its associated input / output device 20. In the preferred embodiment, the common custom interface controller 362 converts the associated message group into a single command and data sequence according to the communication protocol supported by its corresponding I / O device 20. In the preferred embodiment, the common custom interface controller 362 is based on a logical block (CLB) base for performing operations similar to those performed by conventional SCI switching devices defined in ANSI / IEEE Standard 1596-1992. A logic block (CLB) based input / output device controller coupled to the circuit is included.
[0120]
The general purpose interconnection matrix (GPIM) 16 is a conventional interconnection mesh that facilitates point-to-point parallel message routing between the interconnection I / O devices 304. In the preferred embodiment, the general interconnect matrix (GPIM) 16 is a wire-based, k-ary n-cube static interconnect network. FIG. 21 is a block diagram of an exemplary embodiment of a general purpose interconnection matrix (GPIM) 16. In FIG. 21, the general interconnect matrix (GPIM) 16 is a k-ary 2-cube toroid interconnect mesh that includes a plurality of first communication channels 380 and a plurality of second communication channels 382. Each first communication channel 380 includes a plurality of node connections 384, and each second communication channel 382 includes the same. Each mutual coupling input / output device 304 of the system 10 connects the message input line 314 and the message output line 316 to the continuous node connection unit 384 in the predetermined first communication channel 380 and the second communication channel 382. Are preferably bonded to a general purpose interconnection matrix (GPIM) 16. In the preferred embodiment, each T-machine 14 includes a cross-coupled input / output device 304 coupled to the first communication channel 380 and a cross-coupled input / output device 304 coupled to the second communication channel 382 in the manner described above. Contains. The common interface controller (CICU) 302 in the T-machine 14 has its cross-coupled input / output device 304 coupled to the first communication channel 380 and its cross-coupled input / output device 304 coupled to the second communication channel 382. It is preferable that information can be easily routed between the two. Accordingly, a mutual coupling input / output device 304 coupled to the first communication channel 380 labeled 380c in FIG. 21 and a mutual coupling input / output device 304 coupled to the second communication channel 382 labeled 382c. For the T machine 14 including the common interface control unit (CICU) 302 of the T machine, information routing between the first communication channel 380c and the second communication channel 382c can be easily performed.
[0121]
Therefore, the general purpose interconnection matrix (GPIM) 16 can easily route a plurality of messages between the interconnection I / O devices 304 arranged in parallel. For the two-dimensional general purpose interconnection matrix (GPIM) 16 of FIG. 21, each T-machine 14 has one interconnection I / O device 304 for the first communication channel 380 and one interconnection for the second communication channel 382. An input / output device 304 is preferably included. Those skilled in the art will recognize that in embodiments where the general interconnect matrix (GPIM) 16 dimension exceeds two dimensions, it is preferred that the T-machine 14 includes more than two interconnected input / output devices 304. . The general purpose interconnection matrix (GPIM) 16 is preferably implemented as a k-ary 2 cube containing a 16 bit data path.
[0122]
In the above description, the various components of the present invention are preferably implemented using reconfigurable hardware resources. Manufacturers of reconfigurable logic devices have published guidelines for implementing conventional digital hardware, typically using reprogrammable hardware resources or reconfigurable hardware resources. For example, the 1994 Programmable Logic Device Data Book (xilinx, Inc., San Jose, California) includes the following application notes. That is, application note XAPP005.002 “register-based FIFO”, application note XAPP044.00 “high-performance RAM-based FIFO”, application note XAPP013.001 “use of carry-only logic in XC4000”, application note XAPP018.00 “XC4000” Adder and Counter Performance Estimation ", Application Note XAPP028.001" Frequency / Phase Comparator for Phase Locked Loop ", Application Note XAPP031.000" Using XC4000RAM ", Application Note XAPP036.001" 4 Port DRAM Controller La ... ", application note XAPP039.001" 18-bit pipeline accumulator " Are the respective application notes. The materials published by Xilinx include articles included in “XCELL,” a quarterly magazine for users of Xilinx programmable logic. For example, the third issue of 1994 (the 14th issue) contains a detailed article on the implementation of fast integer multipliers.
[0123]
The system 10 described herein is an extensible, parallel computer architecture for a dynamically implemented multiple instruction set architecture (ISA). Any S machine 12 can execute the entire computer program by itself, regardless of external hardware resources such as another S machine 12 or a host computer. In any S machine 12, multiple instruction set architectures (ISAs) are implemented continuously during program execution in response to reconfiguration interrupts and / or reconfiguration instructions embedded in the program. Since system 10 preferably includes multiple S machines 12, a plurality of programs are preferably executed simultaneously, and each program may be independent. Thus, since the system 10 preferably includes multiple S machines 12, multiple instruction set architectures (ISAs) are implemented simultaneously (ie, in parallel) at any time except during system initialization or reconfiguration. That is, multiple sets of program instructions are executed simultaneously at any given time, and each set of program instructions is executed according to a corresponding instruction set architecture (ISA). Each such instruction set architecture (ISA) is unique.
[0124]
The S machines 12 communicate with each other and with the input / output devices 20 via the T machines 14, the general interconnection matrix (GPIM) 16, and the input / output T machines 18. Each S machine 12 is itself a complete computer, capable of performing independent operations, but any S machine 12 can function as a master S machine 12 for other S machines 12 or the entire system 10, and data and / Or commands can be sent to other S machines 12, to one or more T machines 16, to one or more I / O T machines 18, to one or more I / O devices 22 .
[0125]
Thus, the system 10 of the present invention can be divided into one or more data parallel (sub) problems spatially and temporally, such as image processing, medical data processing, calibrated color matching, database calculations, documents This is particularly useful for the above processing, the associative search engine, and the network server. For a computation problem with many operand strings, the data is in parallel when the algorithm can be applied so that efficient computation speed can be obtained by the parallel computation method. Data parallel problems contain a known complexity, which is O (n^k). The value of k is determined by the problem. For example, k = 2 in image processing and k = 3 in medical data processing. In the present invention, each S machine 12 is preferably used to exploit data parallelism at the level of the program instruction group. Since system 10 includes multiple S machines 12, system 10 is preferably used to take advantage of data parallelism at the level of the entire program.
[0126]
Since the instruction processing hardware of each S machine 12 can be completely reconfigured to optimize the computational performance of such hardware for the computation required at any moment, the system 10 of the present invention. Can provide large-scale computing power. Each S machine 12 can be reconfigured independently of other S machines 12. System 10 advantageously handles each configuration data set as a programmed boundary, or interface, between software and the reconfiguration hardware described herein, and thus each instruction set architecture (ISA). . In addition, the architecture of the present invention makes it easy to build reconfiguration hardware at a high level to selectively handle real system problems in their native locations, where interrupts affect instruction processing. The method includes the need for a decision latency response that facilitates real-time processing and computer performance, and the need for a selectable response to defect handling.
[0127]
Unlike other computer architectures, the present invention discloses that silicon resources can be maximized at any time. The present invention provides a parallel computer system that can be expanded to a desired size at any time, and can be a large-scale parallel system composed of several thousand S machines 12. Such extensibility of the architecture is possible because the S machine base instruction processing is intentionally separated from the T machine base data communication. This instruction processing / data communication separation model is very suitable for data parallel computation. The internal structure of the S machine hardware is preferably optimized for instruction time flow, while the internal structure of the T machine hardware is preferably optimized for effective data communication. The set of the S machine 12 and the set of the T machine 14 are separable and configurable components in the spatial and temporal division of the data parallel calculation.
[0128]
With the present invention, future reconfiguration hardware may be used to build systems with better computing performance while maintaining the overall structure described in this document. In other words, the system 10 of the present invention is technically scalable. Almost all reconfigurable logic devices in use today use memory-based complementary metal oxide semiconductor (CMOS) technology. Advances in device capabilities follow the trend (trend) in semiconductor memory technology. In future systems, the reconfiguration logic unit used to build the S machine 12 includes a section of internal hardware resources in accordance with the inner loop and outer loop instruction set architecture (ISA) described herein. It will be. Even a large reconfigurable logic device simply provides the ability to perform more data parallel computations within a single device. For example, if the functional unit 194 included in the second exemplary embodiment of the data processing unit (DOU) 63 described above with reference to FIG. 13 is large, an image processing kernel of a larger size will be included. Those skilled in the art will appreciate that the technical scalability provided by the present invention is not limited to CMOS-based devices, nor is it limited to field programmable gate array (FPGA) -based implementations. Thus, the present invention provides technical scalability regardless of the specific technology used to obtain reconfigurability or reprogrammability.
[0129]
FIG. 22 is a flowchart of a preferred method for scalable, parallel, dynamic reconfiguration computation. The method of FIG. 22 is preferably performed within each S machine 12 in the system 10. The preferred method begins at step 1000 of FIG. 22 where the reconfiguration logic 104 retrieves a configuration data set corresponding to an instruction set architecture (ISA). Next, in step 1002, the reconfiguration logic 104 determines each component in the instruction fetch unit (IFU) 60, data arithmetic unit (DOU) 62, and address arithmetic unit (AOU) 64 according to the configuration data set retrieved in step 1000. This provides a dynamic reconfiguration processor (DRPU) hardware organization for the implementation of the currently considered instruction set architecture (ISA). After step 1002, in step 1004, interrupt logic 106 retrieves an interrupt response signal stored in architecture description memory 101 to determine how the current dynamic reconfiguration processor (DRPU) configuration responds to the interrupt. Generate a corresponding set of transition control signals. Thereafter, instruction set architecture (ISA) 100 initializes program state information at step 1006. Instruction set architecture (ISA) 100 then begins an instruction execution cycle at step 1008.
[0130]
Next, at step 1010, the instruction set architecture (ISA) 100 or interrupt logic 106 determines whether reconfiguration is required. When a reconfiguration instruction is selected during program execution, the instruction set architecture (ISA) 100 determines that reconfiguration is required. The interrupt logic 106 determines that reconfiguration is required in response to the reconfiguration interrupt. When reconfiguration is required, the preferential method proceeds to step 1012 where the reconfiguration handler saves program state information. The program state information preferably includes a reference to a configuration data set corresponding to the current dynamic reconfiguration processor (DRPU) configuration. After step 1012, the preferential method returns to step 1000 to retrieve the next configuration data set that is referenced by the reconfiguration instruction or reconfiguration interrupt.
[0131]
If no reconfiguration is required at step 1010, then at step 1014, the interrupt logic 106 determines whether a non-reconfigurable interrupt needs to be implemented. If so, then in step 1020, the instruction set architecture (ISA) 100 allows the transition from the current instruction state sequencer (ISS) state to the interrupt enforcement state within the instruction execution cycle based on the transition control signal. Decide whether to be done. When the transition to the interrupt execution state is not permitted, the instruction set architecture (ISA) 100 proceeds to the next state of the instruction execution cycle and returns to step 1020. If the transition control signal allows a transition from the current instruction state sequencer (ISS) state to the interrupt enforcement state in the instruction execution cycle, then in step 1024 the instruction set architecture (ISA) 100 enters the interrupt enforcement state. move on. At step 1024, instruction set architecture (ISA) 100 saves program state information and executes program instructions to implement an interrupt. After step 1024, the preferential method returns to step 1008 to resume the current instruction execution cycle if it has not been completed and to start the next instruction execution cycle (if it has been completed).
[0132]
When it is not necessary to implement a non-reconfigurable interrupt at step 1014, the preferential method proceeds to step 1016 to determine whether execution of the current program is complete. When execution of the current program is to continue, the priority method returns to step 1008 to begin another instruction execution cycle. Otherwise, this preferential method ends.
[0133]
The present invention incorporates a metaaddressing mechanism for performing the memory operations required by the architecture of the present invention. According to the present invention, the T machine 14 is used as an addressing machine. The T machine 14 performs interrupt processing, message queuing, metaaddress generation, and overall data packet transfer control. FIG. 23 is a block diagram of a data packet 1800 according to the present invention. Data packet 1800 includes a data portion 1824, a command portion 1820, a source geographic address 1816, a size separator 1812, a destination local address, and a destination geographic address 1804. Metaaddress 1828 includes destination geographic address 1804 and destination local memory address 1808. The destination local memory address 1808 specifies where in the local memory 34 the data of the data packet 1800 is to be written. The destination geographic address, ie, the interconnection address 1804, specifies which T machine 14 should receive the data packet 1800. The source geographic address 1816 specifies the T machine 14 that generated the data packet 1800.
[0134]
Any two pairs of source geographic address 1816 and destination
(Purpose) The geographic address 1804 uniquely determines one path to a 264-bit local address space. However, the system has two or more such paths and can operate in parallel. The S machine 12 includes a T machine 14 coupled thereto, the number of which can include up to a number corresponding to the local memory bandwidth, or up to any number considering the matrix effect. Therefore, the present invention can be expanded by an indefinite power of two, the processors in the system may be non-uniform, and the number of unique paths to each S machine 12 is arbitrarily expanded. can do. This type of extensibility is important in many applications such as distributed image processing, and the dynamic reconfiguration processing component pyramid or tree can be configured to provide wider communication bandwidth for the higher levels of this system. It may be. If desired, this pyramid architecture can be implemented by allowing more constant velocity T machines 14 to access the higher levels of the S machine 12 pyramid, to the S machine 12 that most needs addressing capabilities. Give this ability. This results in a cost-effective system because system resources can be concentrated on most processing and communication tasks.
[0135]
In the preferred embodiment, the metaaddress is 80 bits wide. In this embodiment, the geographic address is 16 bits and the local memory address is 64 bits wide. With 16-bit geographic addresses, 65536 geographic addresses can be specified. 2 in each local memory 34 with a 64-bit local memory address⁶⁴Individual addressable bits can be specified. Each S machine 12 includes a local memory 34 configured for a particular S machine 12. Since the S machine 12 and the memory 34 are separated from each other, the size and structure of each memory need not be uniform, and it is not necessary to maintain coherency and consistency of the entire memory. The program instruction of the source S machine 12 is written in consideration of the architecture of the local memory 34 of the target S machine 12, and as long as this program instruction accurately specifies the memory location, it is independent of its size and layout. In addition, the local memory 34 of the target S machine 12 can be easily addressed. Because of this modularity, the size of the architecture can be scaled with various components, regardless of the problem handling. The method of integrating the new S machine has also been greatly simplified. When a new S machine 12 is added to the system, a new geographic address is selected for the S machine 12 and a new address is given to the program requesting the use of the new S machine 12. Once a new address is incorporated into a program designed to utilize the new S machine 12, no problems to solve arise and the S machine 12 is integrated without having to perform any calculations.
[0136]
FIG. 24 is a flowchart showing the flow of processing of the S machine 12 of the present invention for requesting remote computation. In step 1900, S machine 12 receives the command. In step 1904, the S machine 12 determines whether this command requires a remote operation. If this command does not require remote computation, it is executed at step 1906. When the command requires remote computation, the remote computation information is stored in local memory at step 1904. After proceeding to step 1920 as described below, the S machine 12 determines that the instruction is requesting a remote operation by examining the state of a flag in the instruction code indicating whether the remote operation is required. To do. A remote operation is an operation that needs to use a different S machine 12 to obtain a result. The remote calculation information is provided by a program executed by the S machine 12, and is stored in the local memory 34 when execution of the remote calculation is desired. It is preferred to use a fixed memory location in the local memory 34 to store remote operations, so that the T machine 14 can immediately access the information and does not need to obtain an address first. The remote computation information generally includes a target geographic address 1804 of the remote T machine 14, a target local memory address 1808 for storing or retrieving data from the remote S machine 12, command information 1820, , Size information 1812 and data 1824 are included. All of this information is stored in the local memory 34 by the S-machine 12 as soon as it is determined that the command requires a remote operation.
[0137]
In one embodiment, at step 1912, the S machine 12 issues an unconditional instruction to the T machine to indicate that remote computation is required. An unconditional instruction is a unique command sequence designed to be recognized by the T machine 14. The unconditional instruction generally includes a memory address where the remote calculation information is stored in the local memory 34, and a size delimiter indicating the size of the addressing information. By simply designating the start address of remote calculation information and a series of size separators, a plurality of remote calculations can be requested at one time by the program being executed by the S machine 12. The T machine 14 can then sequentially process different requests for information. Next, at step 1920, the S machine 12 determines whether there are other instructions to execute. When there is an instruction, the next instruction is received and executed. Therefore, the S machine 12 can execute instructions almost instantaneously regardless of the request for remote operation. Since the T machine 14 performs data transfer and retrieval, the processing capability of the S machine 12 can be concentrated only on instruction processing. FIG. 25 is a flowchart of the process of the T machine 14 that receives an unconditional instruction from the S machine 12. First, in step 2000, the T machine 14 determines whether the command received from the S machine 12 on the control line 48 is an unconditional instruction. If it is determined that the command is an unconditional instruction, in step 2004 the T machine retrieves remote computation information from the local memory 34 via the memory / data line 46. The remote computation information is preferably stored in a fixed location in the memory 34 so that when the T machine 14 retrieves data, it does not need to determine a new memory address each time the remote computation information is retrieved. The remote calculation information can be stored in any location in the local memory 34. In this case, however, the location of the information must be transmitted as part of the unconditional command. After retrieving the remote computation information, the T machine 14, in particular the common interface controller (CICU) 302 component of the T machine 14, generates a metaaddress 1828 from the information at step 2008. The destination local memory address 1808 is appended to the destination geographic address 1804 to form the meta address 1828. Next, at step 2112, T-machine 14 generates a data packet 1800 from the remaining telemetry information and sends the data packet 1800 to an interconnection device or general interconnection matrix (GPIM) 16 for transmission to the requested destination. To transmit.
[0138]
The source geographic address 1816 may be specified by a program instruction and thus stored in the local memory 34 for retrieval by the T machine 14. The source geographic address 1816 is preferably stored in an architecture description memory (ADM) 101. Architecture description memory (ADM) 101 is a changeable memory that stores the geographic address of the T-machine 14 to which it is coupled. An architectural description memory (ADM) 101 can be used to explicitly change the geographic address of the entire system. In this system embodiment, the T machine 14 retrieves the source geographic address 1816 from the architecture description memory (ADM) 101 and verifies that it uses the latest source geographic address 1816 of the T machine 14 itself. In an embodiment in which a multiple common interface controller (CICU) 302 is coupled to each S machine 12, the geographic address of each common interface controller (CICU) 302 is stored in an architecture description memory (ADM) 101.
[0139]
FIG. 26 is a flowchart showing the processing of the T machine 14 for receiving the data packet transmitted through the mutual coupling device. In step 2100, T-machine 14 receives a data packet from the interconnection device. In step 2104, T machine 14 decrypts data packet 1800 by parsing the destination geographic address 1804 component of meta address 1828. As described above, the T machine address decoder 320 decodes the data packet 1800. In step 2108, address decoder 320 compares destination geographic address 1804 with the associated geographic address. In an embodiment using a changeable architecture description memory (ADM) 101, the address decoder 320 compares the received destination geographic address 1804 with the address stored in the architecture description memory (ADM) 101. When the address decoder 320 determines that the geographic addresses match at step 2112, the data packet 1800 is transmitted to the memory 34 location specified by the local memory address 1808. Data packet 1800 is analyzed, data is sent via memory / data line 46, and commands are sent via control line 48. Address information is sent via the address line 44. When each address does not match, an error message is transmitted through bypass FIFO 324, MUX 328, and general interconnection matrix (GPIM) 16 to the T machine 14 identified by the source geographic address 1816 component of the data packet 1800. The same process as described above is used when the T-machine 14 receives a misaddressed data packet 1800. When a new data packet 1800 is received, if the CICU 304 is assembling or disassembling the data packet 1800, the T-machine 14 sends the data packet 1800 to the input FIFO 322 and waits until the CICU 304 can receive and process the data. Let
[0140]
In another embodiment, the T machine 14 is designed to recognize the priority of the message and interrupts the processing of the S machine 12 when it is appropriate to have the S machine process the new command. In this embodiment, as shown in FIG. 27, the common interface controller (CICU) 302 further includes additional components including an interrupt logic 2200, a comparator (comparator) 2204, and a recognizer 2208. FIG. 28 is a flowchart showing the interrupt processing capability of the common interface control unit (CICU) 302. In step 2300, the recognition device 2208 confirms the address by the address decoding 320 and then analyzes the data packet 1800 to identify the parse command 1820. In step 2304, the recognizer 2208 determines whether the command 1820 is an interrupt request. When data packet 1800 is an interrupt request, command 1820 will include an interrupt ID. If the command 1820 does not include an interrupt ID, the data packet is sent to the common interface controller (CICU) 302 in step 2308 to perform the processing described above.
[0141]
When command 1820 includes an interrupt ID, the interrupt ID is sent to a comparator 2204 that is coupled to memory 34. The memory 34 stores a list of interrupt IDs. Each S machine preferably includes a list of interrupt IDs that the S machine is designed to store in its associated local memory 34. Interrupts can be identified by this list, interrupt priorities can be specified, and the list includes instructions for performing interrupts. In step 2312, the comparator 2204 compares the interrupt ID of the received command with the list of stored IDs. If the interrupt ID specified by the command does not match the ID in the list, then in step 2320, an error message is transmitted via bypass FIFO 324, MUX 328 to the destination specified by source geographic address 1816 and via signal line 314. It is transmitted to the coupling matrix (GPIM) 16. When the interrupt ID matches the stored ID, at step 2324 the interrupt logic 2200 interrupts according to information contained in the local memory 34 associated with the stored ID or according to information contained in the data packet 1800. Process and send the resulting command to the S machine 12 via the control line 48.
[0142]
When priority comparison is possible, the interrupt logic 2200 compares the priority of the interrupt request with the priority of the data packet 1800 currently in the input FIFO 322. When the priority of the interrupt request is higher than the data packet 1800 of the FIFO 32, it is placed in front of the data packet 1800 having a low priority of the interrupt request. In some cases, an interrupt request may request that execution of the S machine 12 be stopped. In this case, the priority level is assigned to the process running on the S machine 12. When the priority of the interrupt request is higher than the priority of the currently executing process, the interrupt logic 2200 sends an instruction to the S machine 12 via the control line 48 to terminate the current process and start processing the interrupt request. Call 12 Thus, a complete priority comparison and interrupt handling scheme is implemented by the T machine 14 based on the architecture of the present invention, and there is little need for the S machine 12 to do additional processing.
[0143]
Therefore, since the T machine 14 performs all memory operation functions required by the computer system, the S machine 12 can execute the main instruction of the program. By separating the memory and instruction execution operations spatially and temporally, it is possible to greatly optimize the processing capability of a highly parallel system composed of multiprocessors. Since no virtual memory or shared memory is used, there is no need to perform hardware consistency and coherency calculations. The S machine 12 can operate at different rates, and the instruction set architecture (ISA) implemented by the dynamically reconfigurable S machine 12 may be different. In addition, a field programmable gate array (FPGA) that implements the S machine 12 is also optimized for a particular task. For example, when calculating embedded images, the front panel LCD screen controller need not be an S machine 12 optimized for image processing. However, it is still highly desirable to allow all S machines 12 in the system to be uniformly addressed by each S machine 12 that needs to communicate with another S machine, as described above. Obtained by the present invention as described. The software is used to obtain system-wide coherency and consistency, such as a message transfer interface (MPI) runtime library for S-machine 12 and T-machine 14, and parallel virtual machines ( Conventional methods such as runtime libraries for PVM) are used. Both MPI and PVM function as a hardware abstraction layer (HAL). According to the present invention, the HAL is for a dynamically reconfigurable S machine 12 and a fixed T machine 14. Since memory operations are completely controlled by software, the system is dynamically reconfigurable and is not affected by complex hardware / software interactions. Thus, a fully scalable and architecturally reconfigurable computer system that uses independent and separate memory and has separate addressing and processing machines is provided for use in highly parallel computing environments. Is done. By using a meta address, it is possible to specify a transparent and fine address, and it is possible to assign or reassign a communication path of a computer system according to a system request. By separating the addressing machine from the processing machine, the processing machine resources can be focused solely on processing, the processing machine can use various instruction set architectures and operate at various rates, and Each can be implemented using optimized hardware. All of these greatly improve the processing power of the system.
[0144]
The disclosure of the present invention is significantly different from other systems for reprogrammable or reconfiguration calculations. In particular, the downloadable microcode architecture generally relies on non-reconfigurable control means and non-reconfigurable hardware, so the present invention is not equivalent to such an architecture. The present invention also clearly differs from an additional reconfiguration processor (ARP) system in which a set of reconfiguration hardware is coupled to a non-reconfigurable host processor or host system. An additional reconfigurable processor (ARP) device is dependent on a host that executes several programs. Thus, as the host or additional reconfigurable processor (ARP) device operates on the data respectively, the silicon resources of the additional reconfigurable processor (ARP) device or host are idle or used inefficiently, The available silicon resources are not utilized to the maximum during the program execution time frame. In contrast, each S machine 12 is an independent computer that can easily execute the entire program. Multiple S machines 12 preferably execute programs simultaneously. Accordingly, the present invention discloses that silicon resources are always utilized to the maximum extent for both each program executed on each S machine 12 and multiple programs executed on the entire system 10.
[0145]
An additional reconfigurable processor (ARP) device is implemented as a set of gates that provide computational accelerators for a particular algorithm at a particular time and are optimally interconnected for the particular algorithm. The use of reconfigurable hardware resources for general-purpose operations such as managing instruction execution is avoided in additive reconfigurable processor (ARP) systems. In addition, additive reconfigurable processor (ARP) systems do not treat a given set of interconnections as easily reusable resources. In contrast, the present invention discloses a dynamic reconfiguration processing means configured for efficient management of instruction execution with an instruction execution model that is best suited to the need for computation at a particular time. The S machine 12 includes a plurality of reusable resources, such as an instruction state sequencer (ISS) 100, interrupt logic 106, and storage / alignment logic 152. The present invention discloses the use of reconfigurable logic resources at the logic block (CLB), input / output block (IOB), and reconfigurable interconnect levels rather than at the interconnect gate level. Thus, the present invention does not disclose a single gate connection scheme useful for a single algorithm, but rather a reconfigurable high-level logic component useful for performing operations on all classes of computational problems. Disclose use.
[0146]
In general, an additional reconfigurable processor (ARP) system is for translating a particular algorithm into a set of interconnected gates. Some additive reconfigurable processor (ARP) systems attempt to compile high level instructions into an optimal gate level hardware configuration, which is generally an NP hard problem. In contrast, the present invention discloses the use of a compiler for dynamic reconfiguration computation that compiles high-level program instructions into assembly language instructions according to a variable instruction set architecture (ISA) in a very simple manner.
[0147]
An additional reconfigurable processor (ARP) device generally cannot treat its host program as data, nor can it adapt itself to a computing environment. On the other hand, each S machine 12 of the system 10 can handle its own program as data, and therefore can easily adapt itself to the computing environment. The system 10 can easily simulate itself by executing its own program. The present invention can further compile its own compiler.
[0148]
In the present invention, a single program includes a first instruction group belonging to the first instruction set architecture (ISA), a second instruction group belonging to the second instruction set architecture (ISA), and another instruction set architecture ( ISA) and a third instruction group. . . Is included. The architecture disclosed herein executes each such group of instructions using hardware that is configured at runtime to implement the instruction set architecture (ISA) to which the instructions belong. None of the prior art systems or methods presents similar disclosures.
[0149]
The present invention further discloses a reconfiguration interrupt scheme in which interrupt latency, interrupt accuracy, and programmable state transition enabling vary according to the currently considered instruction set architecture (ISA). Other computer systems do not allow similar disclosures. The present invention further discloses a computer system having a reconfiguration data path bit width, an address bit width, and a reconfiguration control line width, unlike prior art computer systems.
[0150]
Although the present invention has been described using several preferred embodiments, those skilled in the art will recognize that various modifications may be obtained.
[0151]
<Reference Material A>
Instruction set 0, general-purpose external loop instruction set architecture (ISA)
1.0 Programmer's architecture model
This section presents the programmer's general concept for the instruction set architecture (ISA) 0 architecture, including registers, memory model, call convention from high level language, and interrupt model.
[0152]
1.1 Register
Instruction Set Architecture (ISA) 0 includes 16 16-bit general purpose registers, 16 address registers, 2 processor status registers, and 1 interrupt vector register. The data and address register mnemonics use hexadecimal numbers, so the last data register is df. And the last address register is af. Nipar (Next Instruction Program Address Register), which is one of the processor status registers, indicates the address of the next instruction to be fetched (fetched). The other status register, pcw (Processor Control Word), includes a program flow and flags and control bits used for interrupt processing. The bits are defined in Table 2. Undefined bits are reserved for future use. Four condition flags, Z, N, V, and C are set as side effects of various instructions. See Section 2.0 for an overview of which flags are affected by each instruction.
[0153]
T (Trace Mode) and IM (Interrupt Mask) flags control how the processor responds to interrupts and when traps are handled. The interrupt vector register ivec holds the 64-bit address of the interrupt service routine. Interrupts and traps are described in section 1.4 below.
[0154]
[Table 1]

[0155]
1.2 Memory access
The value stored in the 64-bit address register is used by the memory load / store instruction access memory in 16-bit and 64-bit increments (see Table 7). The address is a bit address. That is, address 16 refers to a word (word) that begins with bit 16 in memory. Words can only be read on 16-bit boundaries, so the four LSBs (least significant bits) of the address register are ignored when reading the memory. K_ISASee [1] for details of the concept. The 64-bit value is stored as a 16-bit word in little endian order (the order in which the least significant 16 bits are stored at the lowest address).
[0156]
[Table 2]

[0157]
1.3 Calling convention
By convention, the register af is used as a stack pointer by the C program, and the register ae is used as a stack frame pointer. The mnemonics sp and fp may be used as aliases for these registers. All other registers are free for general use. The stack grows downward.
[0158]
int is 16 bits, long is 64, and a is void *. The int value is restored with d0, and the long and void * values are restored with a0. d0-d4 and a0-a3 are clobbered by the function and all other general purpose registers must be retained on the function call. When entering the function, the stack pointer points to the return address, thus the first argument starts at address sp + 64 (decimal).
[0159]
1.4 Traps and interrupts
Instruction Set Architecture (ISA) 0 operates on one interrupt line, and software traps from two sources. All call the same flow-of-control transfer mechanism described below.
[0160]
Externally, there is a single INTR signal input and one iack output. iack is active at the same time that the interrupt mask bit in pcw is cleared by resetting pcw with the xpcw instruction or by returning from interrupt with the rti instruction and returning pcw to its original value It becomes. The amount of time between the interrupt signaling by the external device and the interrupt service by the processor depends on the currently executing instruction and the presence of the software trap.
[0161]
Software traps are triggered by explicit trap instructions or by executing instructions on the T (trace) flag set. In this case, after the first instruction following the setting of T, control is transferred to the interrupt service routine. When the trap instruction is executed, the processor sets the T flag and enters the interrupt service routine as if the T flag had been set before executing the instruction. Interrupts are not serviced while the T-flag is set. No further traps occur until the T flag is cleared by resetting pcw with the xpcw instruction or by resetting from the stack by return from interrupt with the rti instruction.
[0162]
An interrupt is generated by the presence of an active signal with an intr external signal. When the im flag or T flag is set, interrupts are masked and pending interrupts are ignored. When the im and T flags are cleared, control is transferred to the interrupt service routine after the first instruction following the assertion of intr. Upon entering the interrupt service routine, the im flag is set by the processor. No further interrupts occur until the im flag is cleared by resetting pcw with the xpcw instruction or by resetting from the stack by returning from the interrupt with the rti instruction.
[0163]
The steps the processor takes when an interrupt or trap occurs are as follows:
1. Complete all currently executing instructions.
2. 16 data registers (d0 first), 16 address registers (a0 first), pcw, ivec and nipar are pushed into the stack (pointed by register af) in this order. The value of af pushed onto the stack is that value before the interrupt or trap service begins.
3. When this is an interrupt, the interrupt bit in pcw is set to mask any further interrupts. When this is a trap instruction, the T flag is set. When this is a trap generated by a T flag, pcw is not changed.
4). Load the value in the ivec register into the niper.
Execution of the instruction within the interrupt handler begins.
[0164]
The following operations are performed when the rti instruction is executed.
1. The registers are recovered from the stack in the opposite order that they were written.
2. Resume execution.
[0165]
If the interrupt mask flag is not already cleared, it is cleared by the rti instruction. This is because unless the value of pcw was changed on the stack, it was cleared when the service routine was entered. When the T flag is set by executing a trap instruction, it is cleared when rti is executed for the same reason. When a trap occurs due to a T flag that was set before entering the service routine, it must be cleared by the service routine to confirm that the trap has occurred. When the interrupt mask flag is cleared by any means, the external output signal iack is active for one clock cycle to send a signal to the external device being interrupted.
[0166]
2.0 Instruction classification by function
The conventions are as follows.
[0167]
[Table 3]

[0168]
2.1 Register behavior
[0169]
[Table 4]

[0170]
2.2 Logical operations
[0171]
[Table 5]

[0172]
2.3 Memory load / store
[0173]
[Table 6]

[0174]
2.4 Arithmetic operations
[0175]
[Table 7]

[0176]
2.5 Control flow
[0177]
[Table 8]

[0178]
3.0 Letter reference symbols
The instructions set for Instruction Set Architecture (ISA) 0 are listed below in alphabetical order. The mnemonic is shown in a short description. Below that is the binary code of the instruction. Each row of binary code is a 16-bit word. The affected flags are listed below. Unless otherwise specified, the flag is set using data stored in the destination register. Assume that nipar has already been incremented at the start of instruction execution. Finally, a text description of the meaning of the command is shown.
[0179]
The notation conventions used for binary codes are summarized in the table below. Condition codes are defined in Table 59.
[0180]
[Table 9]

[0181]
[Table 10]

[0182]
Add the two data registers and leave the result in the destination register.
[0183]
[Table 11]

[0184]
Add the two data registers and the carry flag and leave the result in the destination register.
[0185]
[Table 12]

[0186]
Add an 8-bit signed (2's complement) constant to the data register and leave the result in the register.
[0187]
[Table 13]

[0188]
Perform a bitwise AND of the two data registers and leave the result in the destination register.
[0189]
[Table 14]

[0190]
If the condition is true, (offset << K_isa ) Is added to nipar.
[0191]
[Table 15]

[0192]
(Offset << K_isa) Is added to nipar.
[0193]
[Table 16]

[0194]
Conditionally shift 8 bits to the right and mask. Used after the load instruction to align the 8-bit data read from the word offset. When the address contained in the source address register is on an 8-bit boundary (with bit 2 set), the value in the data register is shifted 8 bits to the right. If the address is not on an 8-bit boundary, the upstream 8 bits of the register are cleared.
[0195]
[註] The negative flag is set by bit 7 instead of bit 15. This facilitates code extension of 8 bits.
[0196]
[Table 17]

[0197]
By subtracting the source register from the destination register, a flag for comparing the absolute values of the two data registers is set, and only the flag is affected.
[0198]
[Table 18]

[0199]
The signed division of the 32-bit signed integer is performed by the 16-bit signed integer, and the 16-bit signed quotient and the remainder are returned. The 32-bit dividend is stored in two consecutive registers starting from the destination register index (little endian order). The 16-bit divisor is in the source register. The remainder is returned to the destination register, and the quotient is returned to the register after the destination register (modulo 16). When the quotient exceeds 16 bits, overflow occurs.
[0200]
[Table 19]

[0201]
Add the data register to the address register and leave the result in the address register.
[0202]
[Table 20]

[0203]
Add an 8-bit signed constant to the address register and leave the result in the address register.
[0204]
[Table 21]

[0205]
By subtracting the source register from the destination register, a flag for comparing the absolute values of the two address registers is set, and only the flag is affected.
[0206]
[Table 22]

[0207]
Add the two address registers and leave the result in the destination register.
[0208]
[Table 23]

[0209]
Subtract the source register from the destination register and store the result in the destination register.
[0210]
[Table 24]

[0211]
Post-increment load into address register. Read the memory from the address pointed to by the source register and place it in the destination register. Next, the source register is incremented.
[0212]
[Table 25]

[0213]
Shift address register one bit to the right.
[0214]
[Table 26]

[0215]
Store from address register. Write the 64-bit value in the source register to the memory location pointed to by the destination register. This value is written as four 16-bit words arranged in little endian order.
[0216]
[Table 27]

[0217]
Pre-decrement store from address register. Decrement the destination register and then write the value in the source register to the memory location pointed to by the destination register. This value is written as four 16-bit words arranged in little endian order.
[0218]
[Table 28]

[0219]
Subtract the data register from the address register and leave the result in the address register.
[0220]
[Table 29]

[0221]
The source register is inverted and arranged bit by bit in the destination register.
[0222]
[Table 30]

[0223]
Jump conditional to an absolute address. See Table 59 for condition code bit definitions.
[0224]
[Table 31]

[0225]
Jump unconditionally to an absolute address. The condition “always” is the same as jCC.
[0226]
[Table 32]

[0227]
The destination register is first incremented and then the current nipar (pointing to the next instruction) is stored at the address pointed to by the destination register (usually the stack pointer). Next, load nipar with the address in the source register before fetching the next instruction.
[0228]
[Table 33]

[0229]
Shift data register left by bit constant.
[0230]
[Table 34]

[0231]
Shift data register right by bit constant.
[0232]
[Table 35]

[0233]
Load data register from memory. Loads the value pointed to by the source address register into the destination data register.
[0234]
[Table 36]

[0235]
Increment after load into data register. Read the memory from the address pointed to by the source address register and place it in the destination data register. Next, the source register is incremented.
[0236]
[Table 37]

[0237]
A 16-bit adjacent value is loaded into the data register.
[0238]
[Table 38]

[0239]
Replace the destination register with the bitwise inversion of the source register and add the destination register.
[0240]
[Table 39]

[0241]
Put the value in the source data register into the destination data register.
[0242]
[Table 40]

[0243]
The result of multiplying the value in the source register by the value in the destination register is stored in two consecutive registers starting with the destination register (little endian order).
[0244]
[Table 41]

[0245]
Perform a bitwise OR of the two data registers and leave the result in the destination register.
[0246]
[Table 42]

[0247]
Shift data register one bit to the left. Replace LSB (least significant bit) with the value of the carry flag. The original MSB (most significant bit) is placed in the carry flag at the end of the instruction.
[0248]
[Table 43]

[0249]
See section 1.4 above. Use source register as stack pointer.
[0250]
[Table 44]

[0251]
Return from subroutine. Load nipar from the memory location pointed to by the destination register (usually the stack pointer). Next, the destination register is incremented.
[0252]
[Table 45]

[0253]
Shift the address register to the left by the number of bits specified by the value in the source register.
[0254]
[Table 46]

[0255]
Shift the address register to the right by the number of bits specified by the value in the source register.
[0256]
[Table 47]

[0257]
Store from the data register. Write the value in the source to the memory location pointed to by the destination register.
[0258]
[Table 48]

[0259]
Pre-decrement the store from the data register. Decrement the destination register and then write the value in the source register to the memory location pointed to by the destination register.
[0260]
[Table 49]

[0261]
Subtract the source register from the destination register and store the result in the destination register.
[0262]
[Table 50]

[0263]
Subtract the source register from the destination register, then subtract the carry bit and store the result in the destination register.
[0264]
[Table 51]

[0265]
Run the interrupt handler. See section 1.4. The destination register is used as a stack pointer.
[0266]
[Table 52]

[0267]
Unsigned division of a 32-bit signed integer by a 16-bit signed integer, and returns a 16-bit signed quotient and remainder. Store 32 bits in two consecutive registers starting from the destination register index (little endian order). The divisor is in the source register. The remainder is returned to the destination register, and the quotient is returned to the next register after the destination register. When the quotient exceeds 16 bits, overflow occurs.
[0268]
[Table 53]

[0269]
The result of multiplying the value in the source register by the value in the destination register is stored in two consecutive registers starting with the destination register (little endian order).
[0270]
[Table 54]

[0271]
Transfers the value in the source address register to four consecutive data registers starting with the destination register. This value is stored in little endian order, and the destination register address is calculated by modulo 16 so that the destination register can be any register.
[0272]
[Table 55]

[0273]
Transfer the little endian order 64-bit values in the four consecutive data registers into the destination address register. The source register address is calculated modulo 16 so that the destination register can be any register.
[0274]
[Table 56]

[0275]
Perform a bitwise exclusive OR of the two data registers and leave the result in the destination register.
[0276]
[Table 57]

[0277]
Exchange the value in the source data register with the pcw register.
[0278]
[Table 58]

[0279]
Exchange the value in the source address register with the ivec register.
[0280]
4.0 Condition code
In the condition code operation code part field, a value from the following table is used.
[0281]
[Table 59]

[0282]
<Reference Material B>
Instruction set 1, pipeline multiply / accumulate instruction set architecture (ISA)
ISA1-Pipeline convolution engine for XC4013
Introduction
Instruction Set Architecture (ISA) 1 is a pipelined multiply / accumulate array that can perform four simultaneous multiply / accumulates per instruction cycle. There is one, ie, eight 8-bit data registers (xd0-xd3 and yd0-yd3) for each input to four 8-bit × 8-bit multipliers. The four multiplier outputs are summed through the pipeline adder array until one final 16-bit sum is obtained, and up to four 16-bit registers can store the results (m0-m4). The architecture of Instruction Set Architecture (ISA) 1 assumes a flow-through batch processing cycle in main memory. There is no feedback path through the multiplier accumulator data path to recirculate the accumulation result. This is because the emphasis is on memory data flow. There is no provision for overflow scaling or extended finite accumulation. Instruction set architecture (ISA) 1 assumes that the coefficients used for convolution filtering give a result finiteness that does not exceed 16 bits for all data sets. The multiply array receives an 8-bit two's complement data input and produces a 16-bit two's complement result.
[0283]
Access to the memory is managed by two 16-bit address registers (a0 and a1), which can be thought of as compatible source and destination pointers. Program flow is managed by standard 64-bit NIPAR registers, which are supported for single interrupts (IVEC) such as frame or data-executable interrupts.
[0284]
The instruction set for instruction set architecture (ISA) 1 is very small, aligned to a 16-bit word size, and K for general purpose external loop processor instruction set architecture (ISA) 0_ISA= 4 Corresponds to memory organization. Can illustrate up to 7 arithmetic operations in a single clock cycle with Instruction Set Architecture (ISA) 1 and keeps the result at a rate of 1 per clock over a small window of clocking It has the ability to index source or destination addresses and moves register data from and to memory in parallel with computation.
Instruction Set Architecture (ISA) 1 instruction set
Data movement
ld (reg-vector)
Up to 14 registers are loaded sequentially from memory according to a 14-bit bitmap reg-vector right-justified in the instruction word.
[0285]
st (reg-vector)
Up to 14 registers are sequentially stored in memory according to a 14-bit bitmap reg-vector right-justified in the instruction word.
[0286]
ld (ivec-data)
The 64-bit address following this instruction is loaded into the IVEC register and NIPAR + = 5 pointing to the next instruction is executed.
[0287]
Program control
jmp (nipar-data)
The 64-bit address following this instruction is loaded into the NIPAR register, thereby pointing to the next instruction.
[0288]
Arithmetic
mac (m-reg)
The multiplication result register indicated by the 2-bit m-reg code is the product and sum (xd0^*yd0) + (xd1^*yd1) + (xd2^*yd2) + (xd3^*yd3) is received.
[0289]
macp (s-vec, d-vec)
The multiplication result register indicated by 2 bits of the 4-bit d-vec code is the product and sum (xd0^*yd0) + (xd1^*yd1) + (xd2^*yd2) + (xd3^*yd3) is received. Every other bit of the d-vec code selectively enables memory writes to this result register address (a1), and the remaining bits of the d-vec code indicate whether the address register a1 is incremented. select. The 8-bit s-vec is divided into four 2-bit groups. Whether data register xd0-xd3 is continuously read from memory at address (a0) and whether address register a0 is incremented Is specified. When reading or writing is specified, it is performed in parallel with multiplication. The software must perform pipeline alignment of instruction processing for each batch of data that is read from and stored in memory.
[0290]
Reconstruction
reconf (ISA-vector)
Instruction set architecture (ISA) 1 is de-contexted and the S machine is reconfigured for the instruction set architecture (ISA) selected by the instruction set architecture (ISA) vector bit field in the instruction.
[0291]
Table 60 shows a pipeline convolution engine for XC4013 as a block configuration of ISA1.
[0292]
[Table 60]

[0293]
【The invention's effect】
The present invention relates to systems and methods for scalable, parallel, dynamic reconfiguration computations. The system includes at least one S machine, a T machine corresponding to each S machine, a general purpose interconnection matrix (GPIM), a set of input / output T machines, and one or more input / output devices. And a master time base device. In the preferred embodiment, the system includes multiple S machines. Each S machine includes an input section and an output section respectively coupled to the output section and input section of the corresponding T machine. Each T machine includes a routing input and a routing output coupled to a general purpose interconnection matrix (GPIM), and each input / output T machine includes these as well. The input / output T machine further includes an input and an output coupled to the input / output device. Finally, each S machine, T machine, and input / output T machine includes a master timing input coupled to the timing output of the master time base device.
[0294]
The metaaddressing system of the present invention provides bit addressing capabilities to processors in the network without requiring the processor itself to perform processing intensive addressing functions. Disclosed are individual processing machines and addressing machines that are optimized to perform each assigned function. The processing machine executes instructions, stores data in local memory, and retrieves data from local memory to determine when remote operations are required. The addressing machine assembles a packet of data for transmission, determines the geographic or network address of this packet, and checks the address against the incoming packet. In addition, the addressing machine can perform interrupt processing and other addressing operations.
[0295]
In one embodiment, the T machine also provides the metaaddressing mechanism of the present invention. The meta address specifies the geographic location of the T machine in the system and specifies the location of the data in the local memory device. The metaaddress local address is used to address each bit in the new device's memory, regardless of the device's actual memory size, as long as the device's addressable space is less than or equal to the number of bits in the local address. Used. Thus, a single metaaddress can be used to address devices having different memory sizes and structures. Furthermore, because of the use of meta-addresses, hardware in a multiprocessor parallel architecture need not guarantee the coherency and consistency of the entire system.
[0296]
Metaaddresses provide full extensibility. When a new S machine or new I / O device is added, a new geographic address is specified for this new device. In the present invention, the extensibility may be irregular, and there is no requirement that the square of the number of processors must be expanded. Scalability is further enhanced by the ability to couple any number of addressing machines up to the available local memory bandwidth to each processing machine. Thereby, the system designer can arbitrarily designate the number of paths to each processing machine. This flexibility can provide higher bandwidth to the higher levels of the system and build a pyramid processing architecture that is optimized to give the most bandwidth to the most important functions of the system. it can.
[0297]
As explained above, according to the preferred embodiment, the T machine is an addressing machine that generates meta addresses, handles interrupts, and queues messages. Thus, the S machine can concentrate its processing power only on the execution of program instructions, and can greatly optimize the overall efficiency of the multiprocessor parallel architecture of the present invention. The S machine only needs to access the local memory component of the meta address to locate the desired data, and the geographic address is transparent to the S machine. This addressing architecture interacts very well with distributed memory / distributed processor parallel computing systems. By choosing an architectural design that isolates local memory, the hardware can operate independently and in parallel. In accordance with the present invention, each S machine can have completely different reconfiguration instructions at runtime, even if all are directed to a single computational problem in parallel. Also, not only may the instruction set architecture (ISA) realized by the dynamically reconfigurable S machine be different, but the actual hardware used to implement the S machine will perform certain tasks. It may be optimized. Thus, all S machines in a single system may operate at different rates, and each S machine can optimally perform its functions while maximizing utilization of system resources.
[0298]
In addition, a unique memory confirmation confirms that the correct geographic address is being transmitted and does not provide confirmation of the local memory address. Furthermore, this confirmation is performed by the addressing machine, not the processing machine. Since virtual addressing is not used, no hardware / software interaction is required to translate virtual addresses to logical addresses. The address of the meta address is a physical address. By eliminating all such preventive and maintenance functions, the processing speed of the entire system is greatly improved. Thus, in combination with the meta-addressing scheme, highly parallel computing by separating the “space” management of a computer system from the “time” management of a computer system provided by a separate processing machine to another addressing machine A unique memory management and addressing system for the system is provided. With the architecture of the present invention, the operation of the S machine has excellent flexibility, and each S machine can operate at the optimum rate while keeping the rate of the T machine constant. Data communication of the entire system can reach the farthest space and local instruction processing can be balanced in a very short time, thereby improving the approach to solving complex problems with highly parallel computer systems.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a preferred configuration example of a system for scalable, parallel, dynamic reconfiguration computation constructed in accordance with the present invention.
FIG. 2 is a block diagram showing a preferred configuration example of an S machine of the present invention.
FIG. 3 is a schematic diagram of an exemplary program list including reconfiguration instructions.
FIG. 4 is a flowchart of a prior art compilation operation performed during compilation of a series of program instructions.
FIG. 5 is a flowchart of a preferred compilation operation performed by a compiler for dynamic reconfiguration computation.
FIG. 6 is a flowchart of a preferred compilation operation performed by a compiler for dynamic reconfiguration computation.
FIG. 7 is a block diagram showing a preferred configuration example of a dynamic reconfiguration processing apparatus (DRPU) of the present invention.
FIG. 8 is a block diagram showing a preferred configuration example of an instruction fetch unit (IFU) of the present invention.
FIG. 9 is a schematic diagram illustrating a preferred set of states supported by the instruction state sequencer (ISS) of the present invention.
FIG. 10 is a schematic diagram illustrating a preferred set of states supported by the interrupt logic of the present invention.
FIG. 11 is a block diagram showing a preferred configuration example of a data operation unit (DOU) according to the present invention.
FIG. 12 is a block diagram of a first exemplary embodiment of a data processing unit (DOU) configured to implement a general purpose external loop instruction set architecture (ISA).
FIG. 13 is a block diagram of a second exemplary embodiment of a data processing unit (DOU) configured to implement an internal loop instruction set architecture (ISA).
FIG. 14 is a block diagram showing a preferred configuration example of an address arithmetic unit (AOU) according to the present invention.
FIG. 15 is a block diagram of a first exemplary embodiment of an address arithmetic unit (AOU) configured to implement a general purpose external loop instruction set architecture (ISA).
FIG. 16 is a block diagram of a second exemplary embodiment of an address arithmetic unit (AOU) configured to implement an inner loop instruction set architecture (ISA).
FIG. 17 (a) shows a reconfiguration hardware between an instruction fetch unit (IFU), a data arithmetic unit (DOU), and an address arithmetic unit (AOU) for an outer loop instruction set architecture (ISA). Schematic diagram showing exemplary allocation of wear resources, (b) shows an instruction fetch unit (IFU), a data arithmetic unit (DOU), an address arithmetic unit (AOU) for an inner loop instruction set architecture (ISA), FIG. 3 is a schematic diagram illustrating an exemplary allocation of reconfigurable hardware resources between the two.
FIG. 18 is a block diagram showing a preferred configuration example of a T machine of the present invention.
FIG. 19 is a block diagram showing a configuration example of a mutual coupling input / output device of the present invention.
FIG. 20 is a block diagram showing a preferred configuration example of an input / output T machine of the present invention.
FIG. 21 is a block diagram showing a preferred configuration example of a general purpose interconnection matrix (GPIM) of the present invention.
FIG. 22 is a flowchart of a preferred method for scalable, parallel, dynamic reconfiguration computation according to the present invention.
FIG. 23 is a schematic diagram showing a preferred configuration example of a data packet according to the present invention.
FIG. 24 is a flowchart of a preferred method for generating a data request according to the present invention.
FIG. 25 is a flowchart of a preferred method for sending data according to the present invention.
FIG. 26 is a flowchart of a preferred method for receiving data in accordance with the present invention.
FIG. 27 is a block diagram showing a preferred configuration example of a mutual coupling input / output device that executes an interrupt processing operation according to the present invention.
FIG. 28 is a flowchart of a preferred method for handling interrupts in accordance with the present invention.
[Explanation of symbols]
12 Dynamic program processing machine (S machine)
14 Addressing machine (T machine)
16 Interconnection device (general purpose interconnection matrix)
34 Memory device, local memory (memory)
101 Architecture description memory
106,2200 interrupt logic
320 address decoder
1800 data packets
1816 Geographic address
2208 Meta address
2208 Recognition device
2204 Comparator

Claims

A metaaddressing architecture that specifies a local memory destination for data packets of a dynamic reprogrammable processing machine,
A first memory device;
When the first memory device is coupled and a predetermined command is received, it is determined whether the received command is a command requesting a remote operation including remote operation information, and the command requests a remote operation including remote operation information If the instruction is determined to be an instruction to be executed, the remote calculation information included in the instruction is stored in the first memory device, and the memory address in the first memory device in which the remote calculation information is stored is included. A dynamic reprogrammable processing machine that generates conditional instructions;
The unconditional instruction is received from the first dynamic reprogrammable processing machine, the remote calculation information is retrieved from the first memory device based on a memory address included in the unconditional instruction, and the remote calculation information is stored in the remote calculation information. The remote calculation information includes a meta-address including a destination geographic address indicating a destination address of a data packet generated based on the destination and a destination local memory address indicating a second memory device address at which the data packet is written at the destination. and addressing machine to produce on the basis of the generated, the remote operation information the data packets containing the metadata address based on,
The first addressing machine and the second addressing machine indicated by the destination geographic address as a destination of the data packet are coupled to the data packet generated by the first addressing machine. receiving from the first addressing machines, wherein in accordance with the intended geographic address included in the meta address in the data packet, the mutual data between said first addressing machine and the second addressing machine A metaaddressing architecture for dynamic reconfiguration computation comprising an interconnecting device for routing.

The dynamic reprogrammable processing machine is:
The metaaddressing architecture for dynamic reconfiguration computation according to claim 1, wherein if the received instruction is not an instruction requesting a remote operation, processing is executed according to the received instruction.

The second addressing machine is
Upon receipt of the data packet generated by the first addressing machine, the metaaddress contained in the data packet is decoded into the destination geographic address and the destination local memory address, and the address of the second addressing machine An address decoder that compares a geographical address indicating
The address comparison by the decoder results, when said geographical address and the target geographic address matches, comprising a that control device to send transferred to the data packet a second of the dynamic re-programmable processing machine Meta-addressing architecture for dynamic reconfiguration computation according to claim 1 or claim 2 .

Coupled to said second dynamic re-programmable processing machine, according to claim further comprising a luer architecture description memory device to store the geographic address indicating an address of the second dynamic re-programmable processing machine to be coupled 3 Meta-addressing architecture for the described dynamic reconfiguration computation.

The second addressing machine further includes an interrupt handler coupled to the input / output device;
The interrupt handler is
An identification device for identifying an interrupt request;
A comparator for comparing the identified interrupt request with a stored list of interrupt requests to verify the validity of the interrupt request;
Interrupt logic for processing interrupt requests validated according to stored interrupt processing instructions;
The metaaddressing architecture for dynamic reconfiguration computation of claim 3 comprising:

The metaaddressing architecture for dynamic reconfiguration computation according to claim 1, wherein the metaaddress is 80 bits wide, the destination geographic address is 16 bits wide and the local address is 64 bits wide.

A metaaddressing method for specifying a local memory destination for data packets of a dynamic reprogrammable processing machine, comprising:
When a predetermined command is received, it is determined whether the received command is a command requesting a remote calculation including remote calculation information, and when it is determined that the command is a command requesting a remote calculation including remote calculation information, Remote operation information included in the instruction is stored in a first memory device, and an unconditional instruction including a memory address in the first memory device in which the remote operation information is stored is generated by a dynamic reprogrammable processing machine An unconditional instruction generation step,
The unconditional instruction generated in the unconditional instruction generation step is received from the dynamic reprogrammable processing machine, and the remote calculation information is retrieved from the first memory device based on a memory address included in the unconditional instruction. And a meta data including a target geographic address indicating a destination address of a data packet generated based on the remote calculation information, and a target local memory address indicating an address of a second memory device to which the data is written at the destination. A data packet generation step of generating an address based on the remote calculation information and generating the data packet including the meta address by an addressing machine based on the remote calculation information;
Receiving the data packet generated in the data packet generation step from the first addressing machine, and according to the destination geographic address included in the metaaddress in the data packet, the first addressing machine; Routing data to and from the second addressing machine indicated by the destination geographic address as a destination for the data;
A metaaddressing method for dynamic reconfiguration computation comprising:

In the unconditional instruction generation step, the dynamic reprogrammable processing machine includes:
8. The metaaddressing method for dynamic reconfiguration calculation according to claim 7, wherein when the received instruction is not an instruction for requesting remote operation, processing is executed according to the received instruction.

When the second addressing machine receives the data packet generated by the first addressing machine, the metaaddress included in the data packet is decoded into the destination geographic address and the destination local memory address; An address decoding step of comparing a geographic address indicating the address of the second addressing machine with the decoded destination geographic address;
A transmission step of transmitting the data packet to a second dynamic reprogrammable processing machine if the geographic address and the destination geographic address match as a result of the comparison in the address comparison step;
9. The metaaddressing method for dynamic reconfiguration calculation according to claim 7 or 8, further comprising:

The dynamic reconfiguration of claim 9 wherein a geographic address indicating an address for the second dynamic reprogrammable processing machine is stored in an architecture description memory device coupled to the second dynamic reprogrammable processing machine. Metaaddressing method for calculation.

An identification step, wherein the second addressing machine includes an interrupt handler coupled to an input / output device for identifying an interrupt request;
A comparing step comparing the identified interrupt request with a stored list of interrupt requests to verify the validity of the interrupt request identified in the identifying step;
A step for processing an interrupt request whose validity has been confirmed based on a comparison result of each step in accordance with a stored interrupt processing instruction;
The metaaddressing method for dynamic reconfiguration calculation according to claim 9 comprising:

The metaaddressing method for dynamic reconfiguration calculation according to claim 7, wherein the metaaddress is 80 bits wide, the destination geographic address is 16 bits wide, and the local address is 64 bits wide.