JP2001175619A

JP2001175619A - Single-chip multiprocessor

Info

Publication number: JP2001175619A
Application number: JP36370299A
Authority: JP
Inventors: Hironori Kasahara; 博徳笠原; Keiji Kimura; 啓二木村
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 1999-12-22
Filing date: 1999-12-22
Publication date: 2001-06-29
Anticipated expiration: 2019-12-22
Also published as: JP4784792B2

Abstract

PROBLEM TO BE SOLVED: To achieve performance improvement which is scalable up to an increasing degree of semiconductor integration by improving the price performance ratio with respect to a multiprocessor for parallel processing. SOLUTION: The single-chip multiprocessor includes processing elements 16 each including a CPU 20, a network interface 32 connected to the CPU, an adjustable prefetch instruction cache 24 connected directly to the CPU and network interface, and a data transfer controller 30 connected directly to the CPU and a concentrated common memory 28 which is connected to the respective processing elements and shared by the processing elements.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数のＣＰＵを単
一のチップに納めたシングルチッププロセッサのアーキ
テクチャに関し、より具体的には、マルチグレインのコ
ンパイラ協調型シングルチップマルチプロセッサアーキ
テクチャと、それらを接続した高性能マルチプロセッサ
システムアーキテクチャとに関する。[0001] 1. Field of the Invention [0002] The present invention relates to a single-chip processor architecture in which a plurality of CPUs are housed in a single chip. Connected high-performance multiprocessor system architecture.

【０００２】[0002]

【従来の技術】現在、日本のスーパーコンピュータメー
カは世界でもトップのハードウエア技術を有し、現時点
でのピーク性能は、数ＴＦＬＯＰＳを越え、２１世紀初
頭には数十ＴＦＬＯＰＳ以上のピーク性能を持つマシン
が開発されると予想される。しかし、現在のスーパーコ
ンピュータは、ピーク性能の向上とともにプログラムを
実行したときの実効性能との差が大きくなっている、す
なわち価格性能比が必ずしも優れているとはいえない状
況になっている。また、使い勝手としても、ユーザは問
題中の並列性を抽出し、ＨＰＦ、ＭＰＩ，ＰＶＭなどの
拡張言語あるいはライブラリを用いハードウエアを効果
的に使用できるようなプログラムを作成しなければなら
ず、一般のユーザには使い方が難しい、あるいは使いこ
なせないという問題が生じている。さらに、これらにも
起因して、世界の高性能コンピュータの市場を拡大でき
ないということが大きな問題となっている。2. Description of the Related Art At present, Japanese supercomputer manufacturers have the world's top hardware technology, and the peak performance at the present time exceeds several TFLOPS, and at the beginning of the 21st century, the peak performance exceeds several tens TFLOPS. The machine is expected to be developed. However, in the current supercomputer, the difference between the peak performance and the effective performance when the program is executed is increasing, that is, the price / performance ratio is not always excellent. In addition, in terms of usability, the user must extract the parallelism in question and create a program that can use the hardware effectively using an extended language or library such as HPF, MPI, or PVM. The user has a problem that it is difficult to use or cannot use it. Furthermore, due to these factors, the inability to expand the world market for high-performance computers has become a major problem.

【０００３】この価格性能比、使いやすさの問題を解決
し、スーパーコンピュータの市場を拡大するためには、
ユーザが使い慣れているフォートラン、Ｃ等の逐次型言
語で書かれたプログラムを自動的に並列化する自動並列
化コンパイラの開発が重要となる。In order to solve the problems of price / performance ratio and ease of use and to expand the supercomputer market,
It is important to develop an automatic parallelizing compiler that automatically parallelizes programs written in sequential languages such as Fortran and C that the user is familiar with.

【０００４】特に、２１世紀初頭の汎用並びに組み込み
用マイクロプロセッサ、家庭用サーバからスーパーコン
ピュータに至るマルチプロセッサシステムの主要アーキ
テクチャの一つとなると考えられるシングルチップマル
チプロセッサについて検討を行うことは重要である。さ
らに、シングルチップマルチプロセッサについても、従
来からある主記憶共有アーキテクチャでは十分な性能と
優れた価格性能比は得られない。したがって、プログラ
ム中の命令レベルの並列性、ループ並列性、粗粒度並列
性をフルに使用できるマルチグレイン並列処理のよう
に、真に実行すべき命令列からより多くの並列性を抽出
し、システムの価格性能比を向上し、誰にでも使えるユ
ーザフレンドリなシステムの構築を可能とする新しい自
動並列化コンパイル技術と、それを生かせるようなアー
キテクチャの開発が重要である。In particular, it is important to consider a single-chip multiprocessor that is considered to be one of the main architectures of multiprocessor systems ranging from home servers to supercomputers in general and embedded microprocessors in the early 21st century. Furthermore, even for a single-chip multiprocessor, a conventional main memory sharing architecture cannot provide sufficient performance and an excellent price / performance ratio. Therefore, as in multi-grain parallel processing that can fully use instruction-level parallelism, loop parallelism, and coarse-grained parallelism in a program, more parallelism is extracted from a sequence of instructions to be truly executed. It is important to develop a new automatic parallelizing compilation technology that can improve the price / performance ratio of, and build a user-friendly system that can be used by anyone, and an architecture that can take advantage of it.

【０００５】[0005]

【発明が解決しようとする課題】したがって、本発明
は、マルチグレイン並列化をサポートするコンパイラ協
調型のシングルチップマルチプロセッサおよびそれを結
合したハイパフォーマンスマルチプロセッサシステムを
提供することを目的とする。SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a compiler-cooperative single-chip multiprocessor that supports multigrain parallelism and a high-performance multiprocessor system combining the same.

【０００６】[0006]

【課題を解決するための手段】本発明は、ＣＰＵと、ネ
ットワークインタフェースと、該ＣＰＵと該ネットワー
クインターフェースに直接接続しているアジャスタブル
プリフェッチ命令キャッシュと、該ＣＰＵに直接接続し
ているデータ転送コントローラとを含んでなる複数のプ
ロセッシングエレメントと、各プロセッシングエレメン
トに接続し各プロセッシングエレメントによって共有さ
れる集中共有メモリとを含んでなるシングルチップマル
チプロセッサを提供する。SUMMARY OF THE INVENTION The present invention comprises a CPU, a network interface, an adjustable prefetch instruction cache directly connected to the CPU and the network interface, and a data transfer controller directly connected to the CPU. And a centralized shared memory connected to each processing element and shared by each processing element.

【０００７】また、本発明は、そのようなシングルチッ
プマルチプロセッサを複数必要とするメモリ容量あるい
はデータ転送性能に応じ、集中共有メモリのみからなる
複数の集中共有メモリチップと、さらに入出力制御を行
う複数の入出力チップとに接続した構成のマルチプロセ
ッサシステムを提供する。Further, according to the present invention, in accordance with a memory capacity or data transfer performance requiring a plurality of such single-chip multiprocessors, a plurality of centralized shared memory chips including only a centralized shared memory and further input / output control are performed. Provided is a multiprocessor system configured to be connected to a plurality of input / output chips.

【０００８】[0008]

【発明の実施の形態】本発明はマルチグレイン並列化を
サポートするシングルチップマルチプロセッサを提供す
る。本発明の一実施形態であるシングルチップマルチプ
ロセッサのアーキテクチャを図１に示す。図１において
は、複数のプロセッシングエレメント（ＰＥ₀，Ｐ
Ｅ₁，．．．，ＰＥ_n）を含んでなる複数（ｍ＋１個）の
シングルチップマルチプロセッサ（ＳＣＭ₀、ＳＣＭ₁、
ＳＣＭ₂、．．．、ＳＣＭ_m、．．．）１０と、共有メモ
リのみからなる複数（ｊ＋１個）の集中共有メモリチッ
プ（ＣＳＭ₀，．．．．，ＣＳＭ_j）（ただし、ＣＳＭは
要求されるシステム条件によっては１個もなくてもよ
い）と、入出力制御を行う複数（ｋ＋１個）のシングル
チップマルチプロセッサで構成される入出力チップ（Ｉ
／ＯＳＣＭ₀，．．．，Ｉ／ＯＳＣＭ_k）（ただし、
入出力制御に関しては既存技術のプロセッサを用いるこ
ともできる）とが、チップ間接続ネットワーク１２によ
って接続されている。このインタチップ接続ネットワー
ク１２は、クロスバー、バス、マルチステージネットワ
ークなど既存のネットワーク技術を利用して実現できる
ものである。DETAILED DESCRIPTION OF THE INVENTION The present invention provides a single chip multiprocessor that supports multigrain parallelism. FIG. 1 shows the architecture of a single-chip multiprocessor according to an embodiment of the present invention. In FIG. 1, a plurality of processing elements (PE ₀ , P
E ₁ ,. . . , PE _n ), a plurality of (m + 1) single-chip multiprocessors (SCM ₀ , SCM ₁ ,
SCM ₂ ,. . . , SCM _m ,. . . ) 10 and a plurality (j + 1) of centralized shared memory chips (CSM ₀ ,..., CSM _j ) consisting only of shared memory (however, CSM may be omitted depending on required system conditions). ) And an input / output chip (I) composed of a plurality of (k + 1) single-chip multiprocessors for performing input / output control.
/ O SCM ₀ ,. . . , I / O SCM _k ) (where
An existing processor can be used for input / output control). The inter-chip connection network 12 can be realized by using existing network technologies such as a crossbar, a bus, and a multi-stage network.

【０００９】図１に示した形態においては、Ｉ／Ｏデバ
イスは要求される入出力機能に応じてｋ＋１個のＳＣＭ
で構成される入出力制御チップに接続している構成とな
っている。さらに、このチップ間接続ネットワーク１２
には、システム中の全プロセッシングエレメントにより
共有されているメモリのみから構成されるｊ＋１個の集
中共有メモリ（ＣＳＭ：centralized shared memory）
チップ１４が接続されている。これは、ＳＣＭ１０内に
ある集中共有メモリを補完する働きをするものである。In the embodiment shown in FIG. 1, the I / O device has (k + 1) SCMs according to a required input / output function.
Connected to the input / output control chip composed of Furthermore, this chip-to-chip connection network 12
Has j + 1 centralized shared memories (CSMs) composed of only memories shared by all processing elements in the system.
Chip 14 is connected. This serves to supplement the centralized shared memory in the SCM 10.

【００１０】マルチグレイン並列処理とは、サブルーチ
ン、ループ、基本ブロック間の粗粒度並列性、ループタ
イプイタレーション間の中粒度並列性（ループ並列
性）、ステートメントあるいは命令間の（近）細粒度並
列性を階層的に利用する並列処理方式である。この方式
により、従来の市販マルチプロセッサシステム用自動並
列化コンパイラで用いられていたループ並列化、あるい
はスーパースカラ、ＶＬＩＷにおける命令レベル並列化
のような局所的で単一粒度の並列化とは異なり、プログ
ラム全域にわたるグローバルかつ複数粒度によるフレキ
シブルな並列処理が可能となる。Multi-grain parallel processing includes coarse-grain parallelism between subroutines, loops, and basic blocks, medium-grain parallelism between loop-type iterations (loop parallelism), and (near) fine-grain parallelism between statements or instructions. This is a parallel processing method that uses the properties hierarchically. This method differs from local parallelism and single-granular parallelism such as instruction parallelism in super scalar and VLIW, which is used in conventional parallelizing compilers for commercial multiprocessor systems. Flexible parallel processing with global and multiple granularity over the entire program is possible.

【００１１】［粗粒度タスク並列処理（マクロデータフ
ロー処理）］単一プログラム中のサブルーチン、ルー
プ、基本ブロック間の並列性を利用する粗粒度並列処理
は、マクロデータフロー処理とも呼ばれる。ソースとな
る例えばフォートランプログラムを、粗粒度タスク（マ
クロタスク）として、繰り返しブロック（ＲＢ：repeti
tion block)、サブルーチンブロック（ＳＢ：subroutin
e block)、疑似代入文ブロック（ＢＰＡ：block of pse
udo assignment statements)の３種類のマクロタスク
（ＭＴ）に分解する。ＲＢは、各階層での最も外側のナ
チュラルループであり、ＳＢはサブルーチン、ＢＰＡは
スケジューリングオーバヘッドあるいは並列性を考慮し
融合あるいは分割された基本ブロックである。ここで、
ＢＰＡは、基本的には通常の基本ブロックであるが、並
列性抽出のために単一の基本ブロックを複数に分割した
り、逆に一つのＢＰＡの処理時間が短く、ダイナミック
スケジューリング時のオーバヘッドが無視できない場合
には、複数のＢＰＡを融合し得一つのＢＰＡを生成す
る。最外側ループであるＲＢがＤｏａｌｌループである
ときは、ループインデクスを分割することにより複数の
部分Ｄｏａｌｌループに分割し、分割後の部分Ｄｏａｌ
ｌループを新たにＲＢと定義する。また、サブルーチン
ＳＢは、可能な限りインライン展開するが、コード長を
考慮し効果的にインライン展開ができないサブルーチン
はそのままＳＢとして定義する。さらに、ＳＢやＤｏａ
ｌｌ不可能なＲＢの場合、これらの内部の並列性に対
し、階層的マクロデータフロー処理を適用する。[Coarse-grained task parallel processing (macro data flow processing)] Coarse-grain parallel processing utilizing parallelism between subroutines, loops, and basic blocks in a single program is also called macro data flow processing. For example, a Fortran program serving as a source is used as a coarse-grained task (macro task) as a repetition block (RB: repeti).
option block), subroutine block (SB: subroutin)
e block), pseudo assignment block (BPA: block of pse)
udo assignment statements) into three types of macro tasks (MT). RB is the outermost natural loop in each hierarchy, SB is a subroutine, and BPA is a basic block that has been fused or divided in consideration of scheduling overhead or parallelism. here,
The BPA is basically a normal basic block, but a single basic block is divided into a plurality of blocks for parallelism extraction, or the processing time of one BPA is short, and the overhead at the time of dynamic scheduling is reduced. If not negligible, a plurality of BPAs can be fused to generate one BPA. When the outermost loop RB is a Double loop, the loop index is divided into a plurality of partial Door loops, and the divided partial Door loop is divided.
The l loop is newly defined as RB. The subroutine SB is expanded inline as much as possible, but the subroutine that cannot be effectively expanded inline in consideration of the code length is directly defined as SB. Furthermore, SB and Doa
In the case of ll impossible RBs, hierarchical macro data flow processing is applied to these internal parallelisms.

【００１２】次に、マクロタスク間の制御フローとデー
タ依存を解析し、図２のようなマクロフローグラフ（Ｍ
ＦＧ）を生成する。ＭＦＧでは、各ノードがマクロタス
ク（ＭＴ）、点線のエッジが制御フロー、実線のエッジ
がデータ依存、ノード内の小円が条件分岐文を表してい
る。また、ＭＴ７のループ（ＲＢ）は、内部で階層的に
ＭＴおよびＭＦＧを定義できることを示している。Next, control flow and data dependence between macro tasks are analyzed, and a macro flow graph (M
FG). In the MFG, each node represents a macro task (MT), a dotted edge represents a control flow, a solid line edge represents data dependence, and a small circle in the node represents a conditional branch statement. The loop (RB) of MT7 indicates that MT and MFG can be defined hierarchically inside.

【００１３】次に、マクロタスク間制御依存およびデー
タ依存より各マクロタスクが最も早く実行できる条件
（最早実行可能条件）すなわちマクロタスク間の並列性
を検出する。この並列性をグラフ表現したのが図３に示
すマクロタスクグラフ（ＭＴＧ）である。ＭＴＧでも、
ノードはＭＴ、実線のエッジがデータ依存、ノード内の
小円が条件分岐文を表す。ただし、点線のエッジは拡張
された制御依存を表し、矢印のついたエッジは元のＭＦ
Ｇにおける分岐先、実線の円弧はＡＮＤ関係、点線の円
弧はＯＲ関係を表している。例えば、ＭＴ６へのエッジ
は、ＭＴ２中の条件分岐がＭＴ４の方向に分岐するか、
ＭＴ３の実行が終了したとき、ＭＴ６が最も早く実行が
可能になることを示している。Next, the condition that each macro task can be executed earliest (the earliest executable condition), that is, the parallelism between macro tasks, is detected based on the control dependency and data dependency between macro tasks. The macro task graph (MTG) shown in FIG. 3 expresses this parallelism in a graph. MTG,
The node is MT, the solid line edge is data dependent, and the small circle in the node is a conditional branch statement. However, the dotted edge represents the extended control dependence, and the edge with the arrow is the original MF.
The branch destination in G, the solid arc represents the AND relationship, and the dotted arc represents the OR relationship. For example, the edge to MT6 indicates whether the conditional branch in MT2 branches in the direction of MT4,
When the execution of the MT3 is completed, it indicates that the MT6 can be executed earliest.

【００１４】そして、コンパイラは、ＭＴＧ上のＭＴを
プロセッサクラスタ（コンパイラあるいはユーザにより
ソフトウェア的に実現されるプロセッサのグループ）へ
コンパイル時に割り当てを行う（スタティックスケジュ
ーリング）か、実行時に割り当てを行うためのダイナミ
ックスケジューリングコードを、ダイナミックＣＰアル
ゴリズムを用いて生成し、これをプログラム中に埋め込
む。これは、従来のマルチプロセッサのようにＯＳある
いはライブラリに粗粒度タスクの生成、スケジューリン
グを依頼すると、数千から数万クロックのオーバヘッド
が生じてしまう可能性があり、それを避けるためであ
る。このダイナミックなスケジューリング時には、実行
時までどのプロセッサでタスクが実行されるか分からな
いため、タスク間共有データは全プロセッサから等距離
に見える集中共有メモリに割り当てられる。The compiler allocates the MT on the MTG to a processor cluster (a group of processors realized by software by a compiler or a user) at the time of compilation (static scheduling) or dynamically allocates an MT at the time of execution. A scheduling code is generated using a dynamic CP algorithm, and is embedded in a program. This is because there is a possibility that an overhead of thousands to tens of thousands of clocks may occur if the OS or the library is requested to generate and schedule a coarse-grained task like a conventional multiprocessor. At the time of this dynamic scheduling, since it is not known which processor will execute the task until the time of execution, the inter-task shared data is allocated to a centralized shared memory that is seen at the same distance from all processors.

【００１５】また、このスタティックスケジューリング
およびダイナミックスケジューリングコードの生成の時
には、各プロセッサ上のローカルメモリあるいは分散共
有メモリを有効に使用し、プロセッサ間のデータ転送量
を最小化するためのデータローカライゼーション手法も
用いられる。In generating the static scheduling code and the dynamic scheduling code, a local memory or a distributed shared memory on each processor is effectively used, and a data localization method for minimizing a data transfer amount between the processors is also used. Can be

【００１６】データローカライゼーションは、ＭＴＧ上
でデータ依存のある複数の異なるループにわたりイタレ
ーション間のデータ依存を解析し（インターループデー
タ依存解析）、データ転送が最小になるようにループと
データを分割（ループ整合分割）後、それらのループと
データが同一のプロセッサにスケジューリングされるよ
うに、コンパイル時にそれらのループを融合するタスク
融合方式か、実行時に同一プロセッサへ割り当てられる
ようにコンパイラが指定するパーシャルスタティックス
ケジューリングアルゴリズムを用いてダイナミックスケ
ジューリングコードを生成する。このデータローカライ
ゼーション機能を用いて各ローカルメモリの有効利用を
行うことができる。Data localization analyzes data dependencies between iterations over a plurality of different loops having data dependencies on the MTG (inter-loop data dependency analysis), and divides the loops and data so that data transfer is minimized (inter-loop data dependency analysis). After loop matching division), a task fusion method that fuses those loops at compile time so that those loops and data are scheduled on the same processor, or a partial static that the compiler specifies to be assigned to the same processor at execution time A dynamic scheduling code is generated using a scheduling algorithm. Using this data localization function, each local memory can be used effectively.

【００１７】またこの際、データローカライゼーション
によっても除去できなかったプロセッサ間のデータ転送
を、データ転送とマクロタスク処理をオーバーラップし
て行うことにより、データ転送オーバヘッドを隠蔽しよ
うとするプレロード・ポストストアスケジューリングア
ルゴリズムも使用される。このスケジューリングの結果
に基づいて各プロセッサ上のデータ転送コントローラを
利用したデータ転送が実現される。At this time, pre-load / post-store scheduling for concealing the data transfer overhead is performed by performing data transfer between processors which could not be removed even by data localization by overlapping data transfer and macro task processing. Algorithms are also used. Data transfer using the data transfer controller on each processor is realized based on the result of this scheduling.

【００１８】［ループ並列処理（中粒度並列処理）］マ
ルチグレイン並列化では、マクロデータフロー処理によ
りプロセッサクラスタ（ＰＣ）に割り当てられるループ
（ＲＢ）は、そのＲＢがＤｏａｌｌあるいはＤｏａｃｒ
ｏｓｓループの場合、ＰＣ内のプロセッシングエレメン
ト（ＰＥ）に対してイタレーションレベルで並列化処理
（分割）される。[Loop Parallel Processing (Medium Grain Parallel Processing)] In the multi-grain parallel processing, a loop (RB) assigned to a processor cluster (PC) by macro data flow processing is such that the RB is Doall or Doacr.
In the case of an oss loop, a processing element (PE) in the PC is parallelized (divided) at an iteration level.

【００１９】ループストラクチャリングとしては、以下
のような従来の技術をそのまま利用できる。（ａ）ステートメントの実行順序の変更（ｂ）ループディストリビューション（ｃ）ノードスプリッティングスカラエクスパンション（ｄ）ループインターチェンジ（ｅ）ループアンローリング（ｆ）ストリップマイニング（ｇ）アレイプライベタイゼーション（ｈ）ユニモジュラー変換（ループリバーサル、パーミ
ュテーション、スキューイング）As the loop structuring, the following conventional techniques can be used as they are. (A) Change of statement execution order (b) Loop distribution (c) Node splitting scalar expansion (d) Loop interchange (e) Loop unrolling (f) Strip mining (g) Array privatization (h) Unimodular Conversion (loop reversal, permutation, skew)

【００２０】また、ループ並列化処理が適用できないル
ープに関しては、図４のようにループボディ部を次に述
べる（近）細粒度並列処理か、ボディ部を階層的にマク
ロタスクに分割しマクロデータフロー処理（粗粒度タス
ク並列処理）を適用する。For a loop to which the loop parallel processing cannot be applied, the loop body part is described as (near) fine-grain parallel processing as shown in FIG. Apply flow processing (coarse-grained task parallel processing).

【００２１】［（近）細粒度並列処理］ＰＣに割り当て
られるＭＴがＢＰＡまたはループ並列化或いは階層的に
マクロデータフロー処理を適用できないＲＢ等の場合に
は、ＢＰＡ内部のステートメント或いは命令を近細粒度
タスクとしてＰＣ内プロセッサで並列処理する。[(Near) Fine-grained Parallel Processing] If the MT assigned to the PC is a BPA or an RB to which macro data flow processing cannot be applied hierarchically or in a loop-parallel manner, a statement or instruction inside the BPA is narrowed down. Parallel processing is performed by a processor in the PC as a granularity task.

【００２２】マルチプロセッサシステム或いはシングル
チップマルチプロセッサ上での近細粒度並列処理では、
プロセッサ間の負荷バランスだけでなくプロセッサ間デ
ータ転送をも最少にするようにタスクをプロセッサにス
ケジューリングしなければ、効率よい並列処理は実現で
きない。さらに、この近細粒度並列処理で要求されるス
ケジューリングでは、図４のタスクグラフに示すよう
に、タスク間にはデータ依存による実行順序の制約があ
るため強ＮＰ完全な非常に難しいスケジューリング問題
となる。このグラフは、無サイクル有向グラフである。
図中、各タスクは各ノードに対応している。ノード内の
数字はタスク番号ｉを表し、ノードの脇の数字はプロセ
ッシングエレメント上でのタスク処理時間ｔ_iを表す。
また、ノードＮ_iからＮ_jに向けて引かれたエッジは、タ
スクＴ_iがＴ_jに先行するという半順序制約を表してい
る。タスク間のデータ転送時間も考慮する場合、各々の
エッジは一般に可変な重みを持つ。タスクＴ_iとＴ_jが異
なるプロセッシングエレメントへ割り当てられた場合、
この重みｔ_ijがデータ転送時間となる。図４において
は、データ転送および同期に要する時間を９クロックと
仮定している。逆にこれらのタスクが同一プロセッシン
グエレメントに割り当てられた場合、重みｔ_ijは０とな
る。In near-fine-grain parallel processing on a multiprocessor system or a single-chip multiprocessor,
Efficient parallel processing cannot be realized unless tasks are scheduled to the processors so as to minimize not only the load balance between the processors but also the data transfer between the processors. Furthermore, in the scheduling required in this near-fine-grain parallel processing, as shown in the task graph of FIG. 4, there is a restriction on the execution order due to data dependence between tasks, which is a very difficult scheduling problem with strong NP completeness. . This graph is a cycleless directed graph.
In the figure, each task corresponds to each node. The number inside the node represents the task number i, and the number next to the node represents the task processing time t _i on the processing element.
The edge drawn from the node N _i toward N _j represents a partial order constraint that the task T _i precedes T _j . If the data transfer time between tasks is also taken into account, each edge generally has a variable weight. If tasks T _i and T _j are assigned to different processing elements,
This weight t _ij becomes the data transfer time. In FIG. 4, it is assumed that the time required for data transfer and synchronization is 9 clocks. Conversely, when these tasks are assigned to the same processing element, the weight t _ij becomes zero.

【００２３】このようにして生成されたタスクグラフを
各プロセッサにスタティックにスケジューリングする。
この際、スケジューリングアルゴリズムとして、データ
転送オーバヘッドを考慮し実行時間を最小化するヒュー
リスティックアルゴリズム、例えばＣＰ／ＤＴ／ＭＩＳ
Ｆ法、ＣＰ／ＥＴＦ／ＭＩＳＦ法、ＥＴＦ／ＣＰ法、あ
るいはＤＴ／ＣＰ法の４手法を自動的に適用し最良のス
ケジュールを選ぶことができる。また、このようにタス
クをスタティックにプロセッサに割り当てることによ
り、ＢＰＡ内で用いられるデータのローカルメモリ、分
散共有メモリ、レジスタへの配置等、データのメモリへ
の最適化やデータ転送・同期オーバヘッドの最小化とい
った各種の最適化が可能になる。The task graph generated in this way is statically scheduled to each processor.
At this time, as a scheduling algorithm, a heuristic algorithm that minimizes the execution time in consideration of data transfer overhead, for example, CP / DT / MIS
The best schedule can be selected by automatically applying the four methods of the F method, the CP / ETF / MISF method, the ETF / CP method, or the DT / CP method. In addition, by allocating tasks to processors in a static manner in this way, optimization of data used in the BPA, such as local memory, distributed shared memory, and allocation to registers, and minimization of data transfer / synchronization overhead are performed. Various optimizations such as optimization are possible.

【００２４】スケジューリング後、コンパイラはプロセ
ッシングエレメントに割り当てられたタスクの命令列を
順番に並べ、データ転送命令や同期命令を必要な箇所に
挿入することにより、各プロセッサ用のマシンコードを
生成する。近細粒度タスク間の同期にはバージョンナン
バー法を用い、同期フラグの受信は受信側プロセッシン
グエレメントのビジーウェイトによって行われる。ここ
で、データ転送および同期フラグのセットは、送信側の
プロセッサが受信側のプロセッサ上の分散共有メモリに
直接書き込むことにより低オーバヘッドで行うことがで
きる。After scheduling, the compiler generates a machine code for each processor by arranging the instruction sequences of the tasks assigned to the processing elements in order, and inserting data transfer instructions and synchronization instructions at necessary places. The version number method is used for synchronization between near-fine-grained tasks, and the reception of the synchronization flag is performed by the busy wait of the processing element on the receiving side. Here, the data transfer and the setting of the synchronization flag can be performed with low overhead by the processor on the transmitting side directly writing in the distributed shared memory on the processor on the receiving side.

【００２５】マシンコード生成時、コンパイラはスタテ
ィックスケジューリングの情報を用いたコード最適化を
行うことができる。例えば、同一データを使用する異な
るタスクが同一プロセッシングエレメントに割り当てら
れたとき、レジスタを介してそのデータを受け渡しする
ことができる。また、同期のオーバヘッドを最小化する
ため、タスクの割り当て状況や実行順序から、冗長な同
期を除去することもできる。特に、シングルチップマル
チプロセッサでは、コード生成時に厳密なコード実行ス
ケジューリングを行うことにより、実行時のデータ転送
タイミングを含めたすべての命令実行をコンパイラが制
御し、すべての同期コードを除去して並列実行を可能と
する無同期並列化のような究極的な最適化も行える。At the time of machine code generation, a compiler can perform code optimization using information of static scheduling. For example, when different tasks using the same data are assigned to the same processing element, the data can be transferred via a register. Further, in order to minimize the synchronization overhead, redundant synchronization can be removed from the task assignment status and the execution order. In particular, in single-chip multiprocessors, strict code execution scheduling is performed during code generation, so that the compiler controls all instruction execution, including data transfer timing during execution, and removes all synchronous code to execute in parallel. Ultimate optimization such as asynchronous parallelization that enables

【００２６】上述のようなマルチグレイン並列処理をマ
ルチプロセッサシステム上で実現するため、一例とし
て、シングルチップマルチプロセッサ（ＳＣＭ）１０は
図１に示すようなアーキテクチャを有する。In order to realize the above-described multi-grain parallel processing on a multi-processor system, for example, a single-chip multi-processor (SCM) 10 has an architecture as shown in FIG.

【００２７】図１において示したアーキテクチャにおい
ては、ＣＰＵ２０に加えて、分散共有メモリ（ＤＳＭ：
distributed shared memory)２２とアジャスタブルプリ
フェッチ命令キャッシュ２４が各ＳＣＭ１０に設けられ
ている。ここで用いられるＣＰＵ２０は、特に限定され
ず、整数演算や浮動小数点演算が可能なものであればよ
い。例えば、ロード／ストアアーキテクチャのシンプル
なシングルイッシューＲＩＳＣアーキテクチャのＣＰＵ
を用いることができるほか、スーパースカラプロセッ
サ、ＶＬＩＷプロセッサなども用いることができる。分
散共有メモリ２２は、デュアルポートメモリで構成され
ており、他のプロセッシングエレメントからも直接リー
ド／ライトができるようになっており、上に説明した近
細粒度タスク間のデータ転送に使用する。In the architecture shown in FIG. 1, in addition to the CPU 20, a distributed shared memory (DSM:
A distributed shared memory 22 and an adjustable prefetch instruction cache 24 are provided in each SCM 10. The CPU 20 used here is not particularly limited as long as it can perform an integer operation or a floating-point operation. For example, a simple single issue RISC architecture CPU with load / store architecture
And a superscalar processor, a VLIW processor, and the like. The distributed shared memory 22 is composed of a dual-port memory, and can be directly read / written from other processing elements, and is used for data transfer between near-fine-grained tasks described above.

【００２８】アジャスタブルプリフェッチ命令キャッシ
ュ２４は、コンパイラあるいはユーザからの指示で、将
来実行すべき命令をメモリあるいは低レベルキャッシュ
からプリフェッチするものである。このアジャスタブル
プリフェッチ命令キャッシュ２４は、複数ウェイのセッ
トアソシアティブキャッシュにおいて、コンパイラ等の
ソフトから指示される、あるいはハードにより事前に決
められたウェイに、将来実行されるライン（命令列）を
フェッチできるようにするものである。その際、フェッ
チの単位としては、複数ラインの連続転送指示も行え
る。アジャスタブルプリフェッチ命令キャッシュ２４
は、命令キャッシュへのミスヒットを最小化させ、命令
実行の高速化を可能にするコンパイラによる調整および
制御を可能にするキャッシュシステムである。The adjustable prefetch instruction cache 24 prefetches an instruction to be executed in the future from a memory or a low-level cache according to an instruction from a compiler or a user. The adjustable prefetch instruction cache 24 is a multi-way set associative cache so that a line (instruction sequence) to be executed in the future can be fetched to a way specified by software such as a compiler or determined in advance by hardware. Is what you do. At that time, as a unit of fetch, a continuous transfer instruction of a plurality of lines can be performed. Adjustable prefetch instruction cache 24
Is a cache system that allows coordination and control by a compiler that minimizes mishits to the instruction cache and enables faster instruction execution.

【００２９】すなわち、このアジャスタブルプリフェッ
チ命令キャッシュ２４は、すべてのプログラム（命令
列）がメモリサイズより小さいことを仮定しているロー
カルプログラムメモリとは異なり、大きなプログラムに
も対応することができ、プログラムの特徴に応じ、プリ
フェッチをしない通常のキャッシュとしても使用できる
し、逆にすべてコンパイラ制御によるプリフェッチキャ
ッシュとして使え、ミスヒットのない（ノーミスヒッ
ト）キャッシュとして使用できるものである。That is, the adjustable prefetch instruction cache 24 can cope with a large program, unlike a local program memory which assumes that all programs (instruction strings) are smaller than the memory size. Depending on the characteristics, it can be used as a normal cache without prefetching, or on the contrary, it can be used as a prefetch cache under the control of a compiler, and can be used as a cache with no miss (no miss hit).

【００３０】このようなアジャスタブルプリフェッチ命
令キャッシュの構造の一例を図５に示す。図５に示され
たｎウェイのセットアソシエイティブキャッシュにおい
ては、コンパイラあるいはユーザがプログラムに応じて
指定するｊウェイをプリフェッチ（事前読み出し）する
エリアとして使用できるものである。コンパイラにより
挿入されたプリフェッチ命令（ラインごとではなく複数
ラインのプリフェッチも可能）により、命令実行の前に
必要な命令が命令キャッシュ上に存在することを可能と
し、高速化が実現できる。プロセッシングエレメント
は、ｎウェイすべてを通常のキャッシュと同様に読み出
すことができる。ラインのリプレースは通常のＬＲＵ
（least recently used）法で行われる。そして、各セ
ット（集合）中のウェイには、通常、自由に転送された
ラインを格納できるが、プリフェッチ用に指定されたウ
ェイにはプリフェッチ命令によってＣＳＭから転送され
たラインのみ格納される。それ以外のウェイは通常のキ
ャッシュと同様にラインを割り当てられる。プリフェッ
チキャッシュコントローラは、コンパイラからの指示に
より、命令をＣＳＭからプリフェッチする。このときの
転送の単位は、１ラインから複数ラインである。コンパ
イラがｊウェイ分のプリフェッチエリアを指定し、それ
以外の（ｎ−ｊ）ウェイ分のエリアは通常のキャッシュ
として使用される。FIG. 5 shows an example of the structure of such an adjustable prefetch instruction cache. In the n-way set associative cache shown in FIG. 5, the j-way designated by the compiler or the user according to the program can be used as an area for prefetching (pre-reading). The prefetch instruction inserted by the compiler (prefetching of a plurality of lines instead of each line is also possible) enables necessary instructions to be present in the instruction cache before instruction execution, thereby realizing high speed. The processing element can read all n ways in the same way as a normal cache. Replacement of line is normal LRU
(Least recently used) method. In general, freely transferred lines can be stored in the ways in each set (set), but only the lines transferred from the CSM by the prefetch instruction are stored in the way designated for prefetch. Other ways are assigned lines as in the normal cache. The prefetch cache controller prefetches an instruction from the CSM according to an instruction from the compiler. The unit of transfer at this time is one line to a plurality of lines. The compiler specifies a prefetch area for j ways, and the other (n−j) ways area is used as a normal cache.

【００３１】さらに、図１のアーキテクチャにおいて
は、ローカルデータメモリ（ＬＤＭ）２６が設けられて
いる。このローカルデータメモリ２６は、各プロセッシ
ングエレメント１６内だけでアクセスできるメモリであ
り、データローカライゼーション技術などにより、各プ
ロセッシングエレメント１６に割り当てられたタスク間
で使用されるローカルデータを保持するために使用され
る。また、このローカルデータメモリ２６は、対象とす
るアプリケーションプログラムに対しコンパイラあるい
はユーザがデータのローカルメモリへの分割配置が可能
な場合には、ローカルメモリとして使用され、ローカル
メモリを有効に使用できない場合には、レベル１キャッ
シュ（Ｄキャッシュ）に切り替えて使用できるようにす
ることが好ましい。また、ゲーム機等のリアルタイム応
用に専ら用いられるような場合には、ローカルメモリだ
けとして設計することも可能である。基本的に各プロセ
ッシングエレメント内で使用されるメモリであるため、
共有メモリに比べチップ面積を消費しないので、相対的
に大きな容量をとれるものである。Further, in the architecture of FIG. 1, a local data memory (LDM) 26 is provided. The local data memory 26 is a memory that can be accessed only within each processing element 16 and is used to hold local data used between tasks assigned to each processing element 16 by data localization technology or the like. . The local data memory 26 is used as a local memory when a compiler or a user can divide and allocate data to the local memory for a target application program, and is used when the local memory cannot be used effectively. Is preferably switched to a level 1 cache (D cache) for use. In the case where it is exclusively used for real-time applications such as game machines, it is also possible to design as a local memory only. Since it is basically a memory used in each processing element,
Since the chip area is not consumed as compared with the shared memory, a relatively large capacity can be obtained.

【００３２】粗粒度並列処理では、条件分岐に対処する
ためにダイナミックスケジューリングが使用される。こ
の場合、マクロタスクがどのプロセッサで実行されるか
は、コンパイル時には分からない。したがって、ダイナ
ミックにスケジューリングされるマクロタスク間の共有
データは、集中共有メモリ(ＣＳＭ：centralized share
d memory)に配置できることが好ましい。そのため、本
実施形態においては、各プロセッシングエレメント１６
が共有するデータを格納する集中共有メモリ２８を各Ｓ
ＣＭ内に設けるほか、さらに、チップ間接続ネットワー
ク１２につながれた集中共有メモリ１４を設けている。
このチップ内の集中共有メモリ２８は、チップ１０内の
すべてのプロセッシングエレメント１６から、そして複
数チップの構成では他のチップ上のプロセッシングエレ
メントからも共有されるデータを保存するメモリであ
る。チップ外の集中共有メモリ１４も同様に各プロセッ
シングエレメントにより共有されるメモリである。した
がって、実際の設計上、集中共有メモリ２８、１４は、
物理的に各チップに分散されているが、論理的にはどの
プロセッシングエレメントからも等しく共有することが
できるものである。すべてのプロセッシングエレメント
から等距離に見えるようにインプリメントすることもで
きるし、自チップ内のプロセッシングエレメントからは
近く見えるようにインプリメントすることをも可能であ
る。In coarse-grain parallel processing, dynamic scheduling is used to deal with conditional branches. In this case, it is not known at compile time which processor the macro task will execute. Therefore, shared data between dynamically scheduled macro tasks is stored in a centralized shared memory (CSM).
d memory). Therefore, in this embodiment, each processing element 16
The centralized shared memory 28 for storing data shared by
In addition to being provided in the CM, a centralized shared memory 14 connected to the inter-chip connection network 12 is further provided.
The centralized shared memory 28 in the chip is a memory for storing data shared from all the processing elements 16 in the chip 10, and also from processing elements on other chips in a multi-chip configuration. Similarly, the centralized shared memory 14 outside the chip is a memory shared by the respective processing elements. Therefore, in an actual design, the centralized shared memories 28 and 14 are:
Although physically distributed to each chip, it can be logically shared equally by any processing element. It can be implemented so as to be seen from all processing elements at the same distance, or it can be implemented so as to be seen as close as possible from the processing elements in its own chip.

【００３３】単一のＳＣＭチップからなるシステムで
は、チップ内のプロセッシングエレメント（ＰＥ）１６
間で共有される等距離の共有メモリとしてこの集中共有
メモリ２８を用いることができる。また、コンパイラの
最適化が困難である場合には、Ｌ２キャッシュとして使
用することができる。このメモリ２８，１４には、ダイ
ナミックタスクスケジューリング時にタスク間で共有さ
れるデータを主に格納する。また、別のチップとなった
集中共有メモリ１４は、ＳＣＭチップ１０内の集中共有
メモリ２８の容量が足りない場合、必要に応じて、メモ
リのみからなる大容量集中共有メモリチップを任意の数
接続することができる。In a system consisting of a single SCM chip, the processing elements (PE) 16 in the chip
This centralized shared memory 28 can be used as an equidistant shared memory shared between the two. When it is difficult to optimize the compiler, it can be used as an L2 cache. The memories 28 and 14 mainly store data shared between tasks during dynamic task scheduling. In the case where the capacity of the centralized shared memory 28 in the SCM chip 10 is insufficient, the centralized shared memory 14 as another chip may be connected to an arbitrary number of large-capacity centralized shared memory chips consisting only of memories, if necessary. can do.

【００３４】また、粒度によらずスタティックスケジュ
ーリングが適用できる場合には、あるマクロタスクが定
義する共有データをどのプロセッサが必要とするかはコ
ンパイル時に分かるため、生産側のプロセッサが消費側
のプロセッサの分散共有メモリにデータと同期用のフラ
グを直接書き込めることが好ましい。Further, when static scheduling can be applied regardless of the granularity, it is known at compile time which processor needs shared data defined by a certain macro task. It is preferable that the data and the synchronization flag can be directly written in the distributed shared memory.

【００３５】データ転送コントローラ（ＤＴＣ）３０
は、コンパイラあるいはユーザの指示により自プロセッ
シングエレメント上のＤＳＭ２２や、自あるいは他のＳ
ＣＭ１０内のＣＳＭ２８、あるいは他のプロセッシング
エレメント上のＤＳＭとの間でデータ転送を行う。複数
のＳＣＭからなる構成を採用する場合には、他のＳＣＭ
上のＣＳＭやＤＳＭとの間でのデータ転送、あるいは、
独立したＣＳＭとの間でのデータ転送を行う。Data transfer controller (DTC) 30
The DSM 22 on its own processing element or its own or another S
Data transfer is performed with the CSM 28 in the CM 10 or the DSM on another processing element. When a configuration including a plurality of SCMs is adopted, another SCM
Data transfer between the above CSM and DSM, or
Data transfer between independent CSMs.

【００３６】図１におけるローカルデータメモリ２６と
データ転送コントローラ３０との間の点線は、用途に応
じて、データ転送コントローラ３０がローカルデータメ
モリ（Ｄキャッシュ）２６にアクセスできる構成をとっ
てもよいことを表している。このような場合、ローカル
データメモリ２６を介してＣＰＵ２０が転送指示をデー
タ転送コントローラ３０に与えたり、転送終了のチェッ
クを行う構成をとることができる。The dotted line between the local data memory 26 and the data transfer controller 30 in FIG. 1 indicates that the data transfer controller 30 may be configured to access the local data memory (D-cache) 26 depending on the application. ing. In such a case, it is possible to adopt a configuration in which the CPU 20 gives a transfer instruction to the data transfer controller 30 via the local data memory 26 or checks whether the transfer is completed.

【００３７】データ転送コントローラ３０へのデータ転
送の指示は、ローカルデータメモリ２６、ＤＳＭ２２、
あるいは専用のバッファ（図示しない）を介して行い、
データ転送コントローラ３０からＣＰＵ２０へのデータ
転送終了の報告は、ローカルメモリ、ＤＳＭあるいは専
用のバッファを介して行う。このとき、どれを使うかは
プロセッサの用途に応じプロセッサ設計時に決めるかあ
るいはハード的に複数の方法を用意し、プログラムの特
性に応じコンパイラあるいはユーザがソフト的に使い分
けられるようにする。The data transfer instruction to the data transfer controller 30 is sent to the local data memory 26, the DSM 22,
Or via a dedicated buffer (not shown)
The report of the end of the data transfer from the data transfer controller 30 to the CPU 20 is made via a local memory, a DSM or a dedicated buffer. At this time, which one to use is determined at the time of processor design according to the use of the processor, or a plurality of methods are prepared in hardware so that a compiler or a user can selectively use software in accordance with the characteristics of the program.

【００３８】データ転送コントローラ３０へのデータ転
送指示（例えば何番地から内バイトのデータをどこにス
トアし、またロードするか、データ転送のモード（連続
データ転送、ストライド、ストライド・ストライド転送
など）など）は、コンパイラが、データ転送命令をメモ
リあるいは専用バッファに格納しておき、実行時にはど
のデータ転送命令を実行するかの指示のみを出すように
して、データ転送コントローラ２０の駆動のためのオー
バヘッドを削減することが好ましい。Data transfer instruction to the data transfer controller 30 (for example, from what address to where to store and load the inner byte data, data transfer mode (continuous data transfer, stride, stride / stride transfer, etc.), etc.) Reduces the overhead for driving the data transfer controller 20 by storing a data transfer instruction in a memory or a dedicated buffer and issuing only an instruction on which data transfer instruction is to be executed at the time of execution. Is preferred.

【００３９】各ＳＣＭチップ１０内のプロセッシングエ
レメント１６の間の接続は、各プロセッシングエレメン
トに設けられたネットワークインタフェース３２を介し
て、チップ内接続ネットワーク（マルチバス、クロスバ
ーなどからなる）３４によって達成されており、このチ
ップ内接続ネットワーク３４を介して、プロセッシング
エレメントが共通の集中共有メモリ２８に接続される。
集中共有メモリ２８は、チップの外にあるチップ間接続
ネットワーク１２に接続している。このチップ間接続ネ
ットワークは、クロスバーネットワークあるいはバス
（複数バスも含む）が特に好ましいが、多段結合網等で
もかまわず、予算、ＳＣＭの数、アプリケーションの特
性に応じて選ぶことができるものである。また、このチ
ップ内接続ネットワーク３４を介さずに、外部のチップ
間接続ネットワーク１２とネットワークインタフェース
３２を接続することも可能であり、このような構成は、
システム中の全プロセッシングエレメントが平等に各チ
ップ上に分散された集中共有メモリ、分散共有メモリに
アクセスすることを可能にするほか、チップ間でのデー
タ転送が多い場合には、この直結パスを設けることによ
り、システム全体のデータ転送能力を大幅に高めること
ができる。The connection between the processing elements 16 in each SCM chip 10 is achieved by an intra-chip connection network (comprising a multi-bus, crossbar, etc.) 34 via a network interface 32 provided in each processing element. Processing elements are connected to a common centralized shared memory 28 via the intra-chip connection network 34.
The central shared memory 28 is connected to the inter-chip connection network 12 outside the chip. The inter-chip connection network is particularly preferably a crossbar network or a bus (including a plurality of buses), but may be a multistage interconnected network or the like, and can be selected according to the budget, the number of SCMs, and the characteristics of the application. . Further, it is also possible to connect the external inter-chip connection network 12 and the network interface 32 without going through the intra-chip connection network 34.
Allows all processing elements in the system to access the centralized shared memory and the distributed shared memory equally distributed on each chip, and provides this direct connection path when data transfer between chips is large. As a result, the data transfer capability of the entire system can be greatly increased.

【００４０】グローバルレジスタファイル３６は、マル
チポートレジスタであり、チップ内のプロセッシングエ
レメントにより共有されるレジスタである。たとえば、
近細粒度タスク（分散共有メモリを用いた場合など）の
データ転送および同期に使用することができる。このグ
ローバルレジスタファイルは、プロセッサの用途に応じ
て、省略することも可能なものである。The global register file 36 is a multi-port register and is a register shared by processing elements in a chip. For example,
It can be used for data transfer and synchronization of near-fine-grained tasks (such as when using a distributed shared memory). This global register file can be omitted according to the purpose of the processor.

【００４１】図１において、点線は、通信線を必要に応
じて用意できることを意味しており、コストあるいはピ
ン数などを考えて不必要あるいは困難な場合には、点線
の接続はなくても動作することを示すものである。In FIG. 1, a dotted line means that a communication line can be prepared as needed, and when unnecessary or difficult in consideration of cost, the number of pins, etc., operation is performed without connecting the dotted line. It indicates that

【００４２】以上のように、特定の実施の形態に基づい
て本発明を説明してきたが、本発明の技術的範囲はこの
ような実施の形態に限定されるものではなく、当業者に
とって容易な種々の変形を含むものである。As described above, the present invention has been described based on the specific embodiments. However, the technical scope of the present invention is not limited to such embodiments, and is easily understood by those skilled in the art. It includes various modifications.

【００４３】[0043]

【発明の効果】上述のように、本発明のシングルチップ
マルチプロセッサによれば、価格性能比を改善し、高ま
りつつある半導体集積度にスケーラブルな性能向上が可
能である。また、本発明は、このようなシングルチップ
マルチプロセッサを複数含むシステムをも提供するが、
そのようなシステムは、より一層の高速処理を可能にす
るものである。As described above, according to the single-chip multiprocessor of the present invention, it is possible to improve the price / performance ratio and to improve the scalable performance with the increasing degree of integration of semiconductors. The present invention also provides a system including a plurality of such single-chip multiprocessors,
Such a system allows for even faster processing.

[Brief description of the drawings]

【図１】本発明の１実施形態であるマルチグレイン並列
処理用システムを示すブロックダイアグラムである。FIG. 1 is a block diagram showing a multi-grain parallel processing system according to an embodiment of the present invention.

【図２】本発明において用いることができるコンパイラ
における粗粒度並列処理のためのマクロフローグラフの
一例を示すグラフである。FIG. 2 is a graph showing an example of a macro flow graph for coarse-grain parallel processing in a compiler that can be used in the present invention.

【図３】本発明において用いることができるコンパイラ
における粗粒度並列処理のためのマクロタスクグラフの
一例を示すグラフである。FIG. 3 is a graph showing an example of a macro task graph for coarse-grain parallel processing in a compiler that can be used in the present invention.

【図４】本発明において用いることができるコンパイラ
における近細粒度並列処理のための近細粒度タスクグラ
フの一例を示すグラフである。FIG. 4 is a graph showing an example of a near-fine grain task graph for near-fine grain parallel processing in a compiler that can be used in the present invention.

【図５】本発明において用いることができるアジャスタ
ブルプリフェッチ命令キャッシュの構成を示すブロック
ダイアグラムである。FIG. 5 is a block diagram showing a configuration of an adjustable prefetch instruction cache that can be used in the present invention.

[Explanation of symbols]

１０シングルチップマルチプロセッサ１２チップ間接続ネットワーク１４集中共有メモリ（チップ）１６プロセッシングエレメント２０ＣＰＵ２２分散共有メモリ２４アジャスタブルプリフェッチ命令キャッシュ２６ローカルデータメモリ２８集中共有メモリ３０データ転送コントローラ３２ネットワークインタフェース３４チップ内接続ネットワーク DESCRIPTION OF SYMBOLS 10 Single-chip multiprocessor 12 Inter-chip connection network 14 Centralized shared memory (chip) 16 Processing element 20 CPU 22 Distributed shared memory 24 Adjustable prefetch instruction cache 26 Local data memory 28 Centralized shared memory 30 Data transfer controller 32 Network interface 34 In-chip connection network

Claims

[Claims]

1. A CPU, a network interface connected to the CPU, an adjustable prefetch instruction cache directly connected to the CPU and the network interface, a data transfer controller directly connected to the CPU, A plurality of processing elements including a memory switchable as a local memory or a data cache, a distributed shared memory accessible from all processing elements, and a centralized shared memory connected to each processing element and shared by each processing element And a single-chip multiprocessor.

2. The system of claim 1, further comprising a global register file connected to each processing element.
A single-chip multiprocessor as described.

3. A computer comprising a plurality of single-chip multiprocessors according to claim 1.

4. A single chip multiprocessor according to claim 1 or 2, a centralized shared memory chip including a memory shared by all the single multiprocessors, and a single chip multiprocessor for input / output control. Become a computer.