JP2000259498A

JP2000259498A - Instruction cache for multi-thread processor

Info

Publication number: JP2000259498A
Application number: JP2000062593A
Authority: JP
Inventors: William Doing Richard; リチャード・ウィリアム・ドゥーイング; Nick Carla Ronald; ロナルド・ニック・カーラ; Joseph Schwein Stephen; ステファン・ジョセフ・シュワイン
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1999-03-10
Filing date: 2000-03-07
Publication date: 2000-09-22
Anticipated expiration: 2020-03-07
Also published as: CN1267024A; CN1168025C; JP3431878B2

Abstract

PROBLEM TO BE SOLVED: To reduce competitions between different threads by storing at least one part of a real address in the desired instruction to be retrieved in response to an instruction cache error corresponding to respective plural threads. SOLUTION: A CPU 101 for processing an instruction is provided with independent internal level 1 instruction cache 106 (L1I cache) and level 1 data cache 107 (L1D cache). The L1I cache 106 is composed of a directory array and an array of cached instructions and both the arrays are shared by all the threads and accessed by composing a hash function of the effective address of the desired instruction. Each entry of the directory array stores at least one part of the read address of a correspondent cache line in the array of cached instructions. Namely, at least one part of the read address in the desired instruction to be retrieved is stored in response to the instruction cache error corresponding to plural threads.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、一般的にはデジタ
ル・データ処理に関し、特にデジタル・コンピュータ・
システムの処理装置に命令を提供する命令キャッシュに
関する。FIELD OF THE INVENTION The present invention relates generally to digital data processing, and more particularly to digital computer processing.
The present invention relates to an instruction cache for providing an instruction to a processing unit of a system.

【０００２】[0002]

【従来の技術】通常、現在のコンピュータ・システム
は、中央処理装置（ＣＰＵ）と、通信バスやメモリ等、
情報の保存、検索、転送に必要なハードウェアを含む。
また、入出力コントローラ、または記憶装置コントロー
ラ等、外部との通信に必要なハードウェアや、それらに
接続されるキーボード、モニタ、テープ・ドライブ、デ
ィスク・ドライブ、ネットワークに接続される通信ライ
ン等の装置も含まれる。システムの中心はＣＰＵであ
る。ＣＰＵは、コンピュータ・プログラムを構成する命
令を実行し、他のシステム要素の操作を指示する。2. Description of the Related Art Today's computer systems typically include a central processing unit (CPU) and a communication bus and memory.
Includes hardware needed to store, retrieve, and transfer information.
Also, hardware necessary for external communication, such as an input / output controller or a storage device controller, and devices such as a keyboard, a monitor, a tape drive, a disk drive, and a communication line connected to a network connected thereto. Is also included. The heart of the system is the CPU. The CPU executes instructions constituting the computer program and instructs operation of other system elements.

【０００３】コンピュータのハードウェアの観点から
は、ほとんどのシステムは基本的には同じように動作す
る。プロセッサは、算術、論理比較、データ移動等、一
定の単純な操作を実行できる。ただし各操作は極めて高
速に実行される。こらら単純な操作を多量に実行するこ
とをコンピュータに指示するプログラムがあるため、コ
ンピュータは何か高度なことを実行しているといった幻
想が生じる。コンピュータ・システムの新しい機能や機
能の向上とユーザが認めるものは、基本的には同じ単純
な操作であっても、より高速に実行することによって実
現される。従って、コンピュータ・システムの改良を続
けるには、そうしたシステムを更に高速にする必要があ
る。[0003] From a computer hardware perspective, most systems behave essentially the same. Processors can perform certain simple operations, such as arithmetic, logical comparisons, and data movement. However, each operation is performed extremely fast. Because there are programs that instruct the computer to perform a lot of these simple operations, the illusion that the computer is doing something sophisticated is created. What a user perceives as a new function or an improvement in a computer system is realized by executing the same simple operation at a higher speed. Therefore, as computer systems continue to improve, they need to be faster.

【０００４】コンピュータ・システム全体の速度（"ス
ループット"ともいう）は、おおよそ単位時間に実行さ
れる操作の回数で測られる。概念上、考えられる限りの
改良の中で最も単純な改良は、様々な構成要素のクロッ
ク速度を上げることであり、特にプロセッサのクロック
速度を上げることである。例えば、全て２倍速く動作
し、他の点では全く同じように機能する場合、そのシス
テムは、あるタスクを半分の時間で実行することにな
る。初期のコンピュータ・プロセッサは、多くの個別要
素から構成されていて、構成要素を小型化し、要素数を
減らし、最終的にはプロセッサ全体を１つのチップ上の
集積回路とすることで、かなりの高速化が可能になっ
た。小型化によりプロセッサのクロック速度を上げるこ
とができ、その結果、システムが高速になった。[0004] The speed (also called "throughput") of an entire computer system is roughly measured by the number of operations performed per unit time. Conceptually, the simplest of the possible improvements is to increase the clock speed of various components, in particular to increase the clock speed of the processor. For example, if they all run twice as fast and otherwise function exactly the same, the system will perform some tasks in half the time. Early computer processors consisted of a number of discrete components, resulting in significantly faster speeds by miniaturizing components, reducing the number of components, and ultimately making the entire processor an integrated circuit on a single chip. Has become possible. The miniaturization allowed the processor clock speed to increase, resulting in a faster system.

【０００５】集積回路により大幅な高速化が実現したに
もかかわらず、より高速なコンピュータ・システムに対
する需要は絶えることがない。ハードウェアの設計者
は、集積度を上げる（つまり１つのチップに集積する回
路数を増やす）、回路を更に小型化する等、様々な手法
により速度を更に改良しているが、物理的な小型化を無
制限に続けることができないことはわかっており、プロ
セッサのクロック速度を上げ続ける能力にも限度があ
る。そのような背景から、コンピュータ・システム全体
の速度を上げる他のアプローチに関心が集まっている。[0005] Despite the significant speed gains achieved with integrated circuits, there is an ever-increasing demand for faster computer systems. Hardware designers have further improved the speed by various techniques, such as increasing the degree of integration (that is, increasing the number of circuits integrated on a single chip) and further miniaturizing the circuit. We know that we cannot keep going indefinitely, and there is a limit to the processor's ability to keep increasing the clock speed. Against this background, there is interest in other approaches to speeding up the overall computer system.

【０００６】クロック速度を変えずに、システムのスル
ープットを改良することは、複数のプロセッサを使用す
れば可能である。集積回路チップに組み込まれる個々の
プロセッサのコストは高くないので、これは現実的な方
法である。複数のプロセッサを使用する潜在的なメリッ
トは確かにあるが、アーキテクチャ上の問題が生じる。
こうした問題を細かく調べるのでなければ、複数のＣＰ
Ｕを使用するか１つのＣＰＵにするかにかかわらず、個
々のＣＰＵの速度を改良する理由は未だに数多くあるこ
とがわかる。ＣＰＵのクロック速度が一定の場合、個々
のＣＰＵの速度を更に上げる、つまり１秒間に実行され
る操作の回数を増やすことは、１クロック・サイクル当
たりの操作の平均回数を増やすことによって可能であ
る。[0006] Improving system throughput without changing the clock speed is possible using multiple processors. This is a practical method since the cost of the individual processors integrated into the integrated circuit chip is not high. While the potential benefits of using multiple processors are true, they do introduce architectural issues.
If you don't look into these issues,
It turns out that there are still many reasons to improve the speed of individual CPUs, whether using U or one CPU. Given a constant CPU clock speed, it is possible to further increase the speed of individual CPUs, that is, to increase the number of operations performed per second, by increasing the average number of operations per clock cycle. .

【０００７】ＣＰＵ速度を上げるために、高性能プロセ
ッサを設計する際によく行われることは、命令のパイプ
ライン化、及びキャッシュ・メモリのレベルの採用であ
る。パイプライン命令が実行されると、前に発行された
命令が終了する前に後続の命令の実行を開始できる。キ
ャッシュ・メモリは、頻繁に使用されるデータや他のデ
ータをプロセッサの近くに記憶し、命令の実行をほとん
どの場合、メイン・メモリのフル・アクセス時間を待つ
ことなく継続できる。[0007] When designing high performance processors to increase CPU speed, it is common to employ instruction pipelining and the use of cache memory levels. When a pipeline instruction is executed, execution of a subsequent instruction can begin before the previously issued instruction ends. A cache memory stores frequently used data and other data close to the processor and can continue executing instructions in most cases without waiting for full access time of main memory.

【０００８】パイプラインは特定の状況下では機能しな
くなる。前にディスパッチされた命令の結果に依存する
命令で、まだ完了していない命令はパイプラインが機能
を停止する原因になることがある。例えば、必要なデー
タがキャッシュにない（つまりキャッシュ・ミス）ロー
ド／ストア命令に依存する命令は、データがキャッシュ
から使用できるようになるまで実行できない。必要なデ
ータを実行を続けるために必要なキャッシュに維持し、
高いヒット率、つまりデータ・リクエスト数に対してデ
ータをキャッシュからすぐ使用できる回数、は、特に大
きいデータ構造が関係する計算では無視できない。キャ
ッシュ・ミスが起こると、パイプラインが数サイクルの
間停止することがある。そのとき、データがほとんどの
時間使用できない場合、メモリ遅延の合計は大きな問題
になる。メイン・メモリに使用されるメモリ装置は高速
になっているが、そのようなメモリ・チップとハイエン
ド・プロセッサの間の速度ギャップはますます大きくな
っている。そのため、現在設計されているハイエンド・
プロセッサのかなりの実行時間が、キャッシュ・ミスの
解決を待つ時間として費やされている。[0008] The pipeline fails under certain circumstances. An instruction that depends on the result of a previously dispatched instruction but has not yet completed may cause the pipeline to stop functioning. For example, an instruction that relies on a load / store instruction for which the required data is not in the cache (ie, a cache miss) cannot be executed until the data is available from the cache. Keep the data you need in the cache needed to keep it running,
The high hit ratio, the number of times data is readily available from the cache for the number of data requests, cannot be ignored, especially in calculations involving large data structures. When a cache miss occurs, the pipeline may stall for several cycles. Then, if the data is not available most of the time, the total memory delay becomes a major problem. Although memory devices used for main memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming larger. Therefore, the high-end
Considerable execution time of the processor is spent waiting for cache miss resolution.

【０００９】プロセッサが何らかのイベントを待つ時
間、例えばパイプラインのリフィルやメモリからのデー
タ検索の時間を少なくすると、１クロック・サイクル当
たりの平均操作回数が増えることがわかる。この問題を
扱うアーキテクチャ上のイノベーションとして"マルチ
スレッド"処理がある。この手法では、作業負荷がスレ
ッドと呼ばれる独立した複数の実行可能な命令シーケン
スに分けられる。ＣＰＵは、どの瞬間でも、複数のスレ
ッドの状態を維持する。その結果、スレッドを切り替え
ることは比較的簡単であり高速に行われる。It can be seen that reducing the amount of time the processor waits for some event, such as refilling the pipeline or retrieving data from memory, increases the average number of operations per clock cycle. An architectural innovation that addresses this problem is "multi-threaded" processing. In this approach, the workload is divided into a plurality of independent executable instruction sequences called threads. The CPU maintains the state of the threads at any moment. As a result, switching threads is relatively simple and fast.

【００１０】"マルチスレッド"という用語は、コンピュ
ータ・アーキテクチャの分野では、ソフトウェアの分野
の定義とは異なり、ソフトウェア分野では、１つのタス
クを、関連する複数のスレッドに更に細分することをい
う。アーキテクチャ上の定義では、スレッドは独立でも
よい。その２つの定義を区別するために"ハードウェア
・マルチスレッド"という用語もよく用いられる。ここ
では、ハードウェア・マルチスレッドの意味で"マルチ
スレッド"という用語を用いる。[0010] The term "multithreading" differs from the definition of software in the field of computer architecture and refers to the subdivision of a task into multiple related threads in the software domain. Architecturally, threads may be independent. The term "hardware multithreading" is often used to distinguish between the two definitions. Here, the term "multithread" is used to mean hardware multithread.

【００１１】マルチスレッドの基本的な形式は２つあ
る。従来型の形式は"ファイン・グレイン・マルチスレ
ッド"ともいい、プロセッサがＮ個のスレッドを並列に
実行するために実行をサイクル・ベースでインタリーブ
する。これにより、１つのスレッド内での各命令の実行
にギャップが生じるので、命令パイプラインのリフィル
等、プロセッサが短い遅延イベントを待機する必要がな
くなる。もう１つのマルチスレッドは、"コース・グレ
イン・マルチスレッド"ともいい、キャッシュ・ミス
等、比較的長い遅延イベントがプロセッサにより検出さ
れるまで、１つのスレッドで複数の命令が順に実行され
る。There are two basic forms of multithreading. The traditional form is also referred to as "fine grain multithreading," in which processors interleave execution on a cycle basis to execute N threads in parallel. This creates a gap in the execution of each instruction in one thread, eliminating the need for the processor to wait for short delay events, such as refilling the instruction pipeline. Another multi-thread is also referred to as "coarse-grain multi-threading," in which multiple instructions are executed sequentially in one thread until a relatively long delay event, such as a cache miss, is detected by the processor.

【００１２】通常、マルチスレッドでは、複数のスレッ
ドの状態を維持するために、スレッド毎にプロセッサ・
レジスタが複製される。例えば、PowerPC（商標）とし
て販売されているアーキテクチャを実装したプロセッサ
がマルチスレッドを実現する場合、プロセッサはＮ個の
スレッドを実行するためにＮ個の状態を維持する必要が
ある。従って、汎用レジスタ、浮動小数点レジスタ、条
件レジスタ、浮動小数点ステータス／制御レジスタ、カ
ウント・レジスタ、リンク・レジスタ、例外レジスタ、
保存／復元レジスタ、及び専用レジスタはＮ回複製され
る。また、セグメント・ルックアサイド・バッファなど
の特殊なバッファが複製されるか、各エントリにスレッ
ド番号のタグが付けられることがあり、タグが付けられ
ない場合は、スレッドの切り替え毎にフラッシュする必
要がある。また、相関レジスタ、戻りスタック等の分岐
予測メカニズムも複製する必要がある。Normally, in a multithread, in order to maintain a state of a plurality of threads, a processor
Registers are duplicated. For example, when a processor implementing an architecture sold as PowerPC (trademark) implements multi-threading, the processor needs to maintain N states to execute N threads. Thus, general purpose registers, floating point registers, condition registers, floating point status / control registers, count registers, link registers, exception registers,
The save / restore registers and dedicated registers are duplicated N times. Also, special buffers, such as segment lookaside buffers, may be duplicated, or each entry may be tagged with a thread number; otherwise, flushing must be performed on each thread switch. is there. It is also necessary to duplicate branch prediction mechanisms such as correlation registers and return stacks.

【００１３】普通、レベル１命令キャッシュ、レベル１
データ・キャッシュ、機能ユニット、実行ユニット等、
比較的大きいハードウェア構造は複製されない。他は全
て同じで比較的大きいハードウェア構造を複製すること
には、パフォーマンス上多少とも段階的、増分的なメリ
ットがあろう。ただし、そのようなアプローチでは、こ
うした増分的なメリットと所要ハードウェアを秤にかけ
る必要がある。キャッシュは、他の用途に充てることも
できるプロセッサ・チップ上のかなりの領域を消費す
る。従って、キャッシュのサイズ、及びキャッシュの数
と機能を慎重に選択する必要がある。Normally, level 1 instruction cache, level 1
Data cache, functional unit, execution unit, etc.
Relatively large hardware structures are not duplicated. Duplicating a relatively large hardware structure that is otherwise the same may have some incremental or incremental performance benefits. However, such an approach requires weighing these incremental benefits and the required hardware. The cache consumes a significant amount of space on the processor chip that can be devoted to other uses. Therefore, the size of the cache and the number and function of the caches must be carefully selected.

【００１４】高性能設計の場合、プロセッサ・チップに
はレベル１命令キャッシュ（Ｌ１Ｉキャッシュ）がよく
採用される。Ｌ１Ｉキャッシュは、比較的短時間のう
ちに実行される可能性があるとみなされる命令を保持す
るためのキャッシュである。For high performance designs, processor chips often employ a level one instruction cache (L1I cache). The L1 I-cache is a cache for holding instructions that are considered to be executed in a relatively short time.

【００１５】Ｌ１Ｉキャッシュがマルチスレッドのプ
ロセッサに用いられると、また別の問題が生じる。Ｉキ
ャッシュは、高速なスレッド切り替えをスレッド間の過
度の競合なしにサポートする必要がある。競合を避ける
方法として、スレッド毎に個別のＩキャッシュが使用さ
れるが、その場合は、貴重なハードウェアが消費される
か、１つのスレッドに対する個々のキャッシュが過度に
小さくなる。スレッド間で過度の競合が発生することな
く、１つのＬ１Ｉキャッシュを全てのスレッドが共有
するのが望ましい。また、キャッシュ・アクセス・メカ
ニズムにより、低速なアドレス変換メカニズムは可能な
限り使用しない方が都合が良い。Another problem arises when the L1 I-cache is used in a multi-threaded processor. The I-cache needs to support fast thread switching without excessive contention between threads. A way to avoid contention is to use a separate I-cache for each thread, which either consumes valuable hardware or makes the individual cache for one thread too small. It is desirable that all threads share a single L1 ICache without excessive contention between threads. Further, it is more convenient not to use a low-speed address translation mechanism as much as possible due to the cache access mechanism.

【００１６】Ｌ１Ｉキャッシュの設計は、プロセッサ
の高速動作にとって大きな課題である。Ｉキャッシュの
ミス率が高い場合、アクセスが遅すぎる場合、別々のス
レッド間で競合が過度に発生する場合、またはキャッシ
ュのコヒーレンシを維持するのが困難な場合、プロセッ
サは、次の命令の実行を待つ時間を過度に費やすことに
なる。プロセッサの改良を継続するには、Ｌ１Ｉキャ
ッシュがこうした課題に、特にマルチスレッド環境での
課題に効率よく応える必要がある。The design of the L1 I-cache is a major challenge for high speed operation of the processor. If the I-cache miss rate is high, access is too slow, contention between separate threads is excessive, or it is difficult to maintain cache coherency, the processor will stop executing the next instruction. You will spend too much time waiting. To continue improving the processor, the L1 I-cache must efficiently meet these challenges, especially in multi-threaded environments.

【００１７】[0017]

【発明が解決しようとする課題】本発明の目的は、改良
されたプロセッサ装置を提供することである。SUMMARY OF THE INVENTION It is an object of the present invention to provide an improved processor device.

【００１８】本発明の他の目的は、マルチスレッド・プ
ロセッサに用いられる改良された命令キャッシュ装置を
提供することである。It is another object of the present invention to provide an improved instruction cache for use in a multi-threaded processor.

【００１９】本発明の他の目的は、命令キャッシュにア
クセスするマルチスレッド・プロセッサのスレッド間の
競合を少なくすることである。Another object of the present invention is to reduce contention between threads of a multi-threaded processor accessing an instruction cache.

【００２０】[0020]

【課題を解決するための手段】マルチスレッド・プロセ
ッサは、全てのスレッドに共有されるレベル１命令キャ
ッシュ（Ｌ１Ｉキャッシュ）を含む。Ｌ１Ｉキャッ
シュは、ディレクトリ・アレイ及びキャッシュ済み命令
のアレイで構成され、アレイは、両方とも全てのスレッ
ドに共有され、所望の命令の有効アドレスからハッシュ
関数を構成することによりアクセスされる。ディレクト
リ・アレイの各エントリは、キャッシュ済み命令のアレ
イの対応するキャッシュ・ラインの実アドレスの少なく
とも一部を記憶する。そこから、キャッシュ内の命令の
完全な実アドレスを導くことができる。スレッド毎に独
立したライン・フィル・シーケンサが存在するので、１
つのスレッドのキャッシュ・ライン・フィル・リクエス
トを満足しながら、別のスレッドがキャッシュ・エント
リにアクセスでき、或いは実行中のスレッドに対してラ
インをプリフェッチできるようになる。SUMMARY OF THE INVENTION A multithreaded processor includes a level one instruction cache (L1 I-cache) that is shared by all threads. The L1 I-cache consists of a directory array and an array of cached instructions, both shared by all threads and accessed by constructing a hash function from the effective addresses of the desired instructions. Each entry in the directory array stores at least a portion of the real address of a corresponding cache line in the array of cached instructions. From there, the full real address of the instruction in the cache can be derived. Since there is an independent line fill sequencer for each thread,
While satisfying the cache line fill request of one thread, another thread can access the cache entry or prefetch the line for the executing thread.

【００２１】好適実施例の場合、これらのアレイは複数
のセットに分けられ、セットはそれぞれ１つのエントリ
がハッシュ関数の各値に対応する（Ｎウェイ連想キャッ
シュ）。この例のプロセッサは、２つの独立したスレッ
ドの状態情報を維持し、命令キャッシュ・アレイは２つ
のセットに分けられる。ただし、スレッドとキャッシュ
連想性の数は変えてもよい。各スレッドが、ハッシュ値
は同じでも属するセットが異なるキャッシュ済み命令に
独立にアクセスできるので、異なるスレッド間の競合が
少なくなる。In the preferred embodiment, the arrays are divided into sets, each set having one entry corresponding to each value of the hash function (N-way associative cache). The processor in this example maintains state information for two independent threads, and the instruction cache array is divided into two sets. However, the number of threads and cache associativity may vary. Each thread has independent access to cached instructions with the same hash value but different sets but less contention between different threads.

【００２２】Ｉキャッシュは、好適には、メイン・メモ
リのアドレス変換テーブルのキャッシュとして機能する
有効／実アドレス・テーブル（EffectiveーtoーReal Addr
essTable、ＥＲＡＴ）を含む。ＥＲＡＴは、有効アドレ
ス部とこれに対応する実アドレス部のペアを含む。ＥＲ
ＡＴのエントリは、所望の命令の有効アドレスのハッシ
ュ関数でアクセスされる。ＥＲＡＴエントリの有効アド
レス部は、次に、所望の命令の有効アドレスと比較さ
れ、ＥＲＡＴヒットが確認される。対応する実アドレス
部は、ディレクトリ・アレイの実アドレス部と比較さ
れ、キャッシュ・ヒットが確認される。The I-cache is preferably an effective / real address table (Effective-to-Real Addr) which functions as a cache of an address translation table of the main memory.
essTable, ERAT). The ERAT includes a pair of an effective address part and a corresponding real address part. ER
The AT entry is accessed with a hash function of the effective address of the desired instruction. The effective address portion of the ERAT entry is then compared with the effective address of the desired instruction to confirm an ERAT hit. The corresponding real address portion is compared with the real address portion of the directory array to confirm a cache hit.

【００２３】ライン・フィル・シーケンサは、好適に
は、キャッシュ・ミスに応答して動作する。その際、要
求された有効アドレスのＥＲＡＴエントリが存在する
（ＥＲＡＴヒット）。その場合、ＥＲＡＴの有効アドレ
スと情報から所望の命令の完全実アドレスを構成できる
ので、メイン・メモリの低速なアドレス変換メカニズム
にアクセスする必要はなくなる。ライン・フィル・シー
ケンサは、構成された実アドレスを使ってメモリに直接
アクセスする。The line fill sequencer preferably operates in response to a cache miss. At that time, there is an ERAT entry of the requested effective address (ERAT hit). In that case, the full real address of the desired instruction can be constructed from the effective address of the ERAT and the information, eliminating the need to access the slow address translation mechanism of the main memory. The line fill sequencer accesses the memory directly using the configured real address.

【００２４】スレッド毎に独立したライン・フィル・シ
ーケンサがあるので、スレッドは互いに待機することな
く、キャッシュ・フィル・リクエストを独立に満足する
ことができる。また、Ｉキャッシュのインデックスは、
エントリに対応する実ページ番号を記憶するので、キャ
ッシュのコヒーレンシが簡素化される。更に、ＥＲＡＴ
を使用して有効ページ番号と実ページ番号を関連付ける
と、多くの場合、低速なメモリ変換メカニズムにアクセ
スする必要がなくなる。そして、キャッシュのＮウェイ
連想性により、全てのスレッドが、スレッドの過度の競
合なしに共通のキャッシュを使用できる。Since there is an independent line fill sequencer for each thread, the threads can independently satisfy cache fill requests without waiting for each other. Also, the I-cache index is
Since the real page number corresponding to the entry is stored, cache coherency is simplified. In addition, ERAT
Using to associate valid and real page numbers often eliminates the need to access slow memory translation mechanisms. And the N-way associativity of the cache allows all threads to use a common cache without excessive thread contention.

【００２５】[0025]

【発明の実施の形態】図１は、本発明の好適実施例に従
った、命令キャッシュ・アーキテクチャを採用した単一
ＣＰＵのコンピュータ・システム１００の主なハードウ
ェア構成要素を示す。命令を処理するＣＰＵ１０１は、
独立した内部レベル１命令キャッシュ１０６（Ｌ１Ｉ
キャッシュ）とレベル１データ・キャッシュ１０７（Ｌ
１Ｄキャッシュ）を含む。Ｌ１Ｉキャッシュ１０６
は、ＣＰＵ１０１により実行される命令を記憶する。Ｌ
１Ｄキャッシュは、ＣＰＵ１０１により処理される、
命令以外のデータを記憶する。ＣＰＵ１０１はレベル２
キャッシュ（Ｌ２キャッシュ）１０８に接続され、Ｌ２
キャッシュ１０８は、命令、データの両方を保持するた
めに使用される。メモリ・バス１０９は、一方のＬ２キ
ャッシュ１０８またはＣＰＵ１０１と、他方のメイン・
メモリ１０２の間でデータを転送する。ＣＰＵ１０１、
Ｌ２キャッシュ１０８、及びメイン・メモリ１０２はま
た、バス・インタフェース１０５を介してシステム・バ
ス１１０と通信する。様々なＩ／Ｏ処理装置（ＩＯＰ）
１１１乃至１１５がシステム・バス１１０に接続され、
直接アクセス記憶装置（ＤＡＳＤ）、テープ・ドライ
ブ、ワークステーション、プリンタ、離れた装置や他の
コンピュータ・システムと通信するリモート通信ライン
等、様々な記憶装置やＩ／Ｏ装置との通信をサポートす
る。FIG. 1 illustrates the main hardware components of a single CPU computer system 100 employing an instruction cache architecture, according to a preferred embodiment of the present invention. The CPU 101 that processes the instruction
Independent internal level 1 instruction cache 106 (L1 I
Cache) and level 1 data cache 107 (L
1D cache). L1 I-cache 106
Stores an instruction executed by the CPU 101. L
The 1D cache is processed by the CPU 101.
Stores data other than instructions. CPU 101 is level 2
Connected to the cache (L2 cache) 108,
The cache 108 is used to hold both instructions and data. The memory bus 109 is connected to one L2 cache 108 or CPU 101 and the other main bus.
Data is transferred between the memories 102. CPU 101,
The L2 cache 108 and the main memory 102 also communicate with the system bus 110 via the bus interface 105. Various I / O processing units (IOP)
111 to 115 are connected to the system bus 110,
It supports communication with a variety of storage and I / O devices, such as direct access storage devices (DASD), tape drives, workstations, printers, remote communication lines that communicate with remote devices and other computer systems.

【００２６】図１は、システム１００の主な構成要素を
概念的に示すものであり、そのような要素の数やタイプ
は必ずしも一定ではない。特に、システム１００には複
数のＣＰＵを使用できる。そのようなマルチＣＰＵシス
テムを図２に示す。図２のシステムは、４つのＣＰＵ、
１０１Ａ、１０１Ｂ、１０１Ｃ、１０１Ｄを備え、ＣＰ
ＵそれぞれにＬ１Ｉキャッシュ１０６Ａ、１０６Ｂ、
１０６Ｃ、１０６Ｄ、及びＬ１Ｄキャッシュ１０７
Ａ、１０７Ｂ、１０７Ｃ、１０７Ｄがある。また、独立
したＬ２キャッシュ１０８Ａ、１０８Ｂ、１０８Ｃ、１
０８Ｄが各ＣＰＵに関連付けられる。FIG. 1 conceptually shows the main components of the system 100, and the number and types of such components are not necessarily constant. In particular, multiple CPUs can be used in system 100. Such a multi-CPU system is shown in FIG. The system of FIG. 2 has four CPUs,
101A, 101B, 101C, 101D, and CP
U, L1 I-caches 106A, 106B,
106C, 106D, and L1 D cache 107
A, 107B, 107C and 107D. In addition, independent L2 caches 108A, 108B, 108C, 1
08D is associated with each CPU.

【００２７】好適実施例では、各ＣＰＵが２つのスレッ
ドの状態を維持することができ、ある遅延イベントで実
行をスレッド間で切り替える。つまり、ＣＰＵは１つの
スレッド（アクティブ・スレッド）を、ＣＰＵが待機せ
ざるを得ないような何らかの遅延イベントが検出される
まで実行する（コース・グレイン・マルチスレッドの１
形態）。ただし、本発明は、各ＣＰＵのスレッド状態の
数を変えて実施でき、また、各スレッドからの命令の実
行をサイクル・ベースでインタリーブする（ファイン・
グレイン・マルチスレッド）か、他の基準によりスレッ
ドを切り替えることも可能である。In the preferred embodiment, each CPU can maintain the state of two threads, and switches execution between threads at some delay event. In other words, the CPU executes one thread (the active thread) until some delay event that the CPU has to wait for is detected (coarse grain multi-thread 1).
Form). However, the present invention can be implemented by changing the number of thread states of each CPU, and interleave the execution of instructions from each thread on a cycle basis (fine
It is also possible to switch the thread according to (grain / multithread) or other criteria.

【００２８】図３は、ＣＰＵ１０１の主な構成要素の図
である。ＣＰＵ１０１は、好適実施例に従って、図１乃
至図３よりも詳しく示してある。この例で、図３の構成
要素は、１つの半導体チップに集積される。ＣＰＵ１０
１は、命令ユニット部２０１、実行ユニット部２１１、
及び記憶制御部２２１を含む。一般に、命令ユニット２
０１はＬ１Ｉキャッシュ１０６から命令を取得し、命
令をデコードして実行する操作を決定し、分岐条件を判
別してプログラムの流れを制御する。実行ユニット２１
１は、レジスタのデータに対して算術演算や論理演算を
行い、データをロードまたは記憶する。記憶制御装置２
２１は、Ｌ１データ・キャッシュのデータにアクセスす
るか、または命令やデータをフェッチまたは記憶する必
要のあるＣＰＵの外部のメモリまたはインタフェースを
とる。FIG. 3 is a diagram of the main components of the CPU 101. CPU 101 is shown in more detail than in FIGS. 1-3, according to a preferred embodiment. In this example, the components of FIG. 3 are integrated on one semiconductor chip. CPU10
1 is an instruction unit 201, an execution unit 211,
And a storage control unit 221. Generally, instruction unit 2
01 obtains an instruction from the L1 I-cache 106, determines an operation to decode and execute the instruction, determines a branch condition, and controls the flow of the program. Execution unit 21
1 performs an arithmetic operation or a logical operation on the data in the register, and loads or stores the data. Storage control device 2
21 accesses the data in the L1 data cache or takes a memory or interface external to the CPU that needs to fetch or store instructions and data.

【００２９】命令ユニット２０１は、分岐ユニット２０
２、バッファ２０３、２０４、２０５、及びデコード／
ディスパッチ・ユニット２０６を含む。Ｌ１Ｉキャッ
シュ１０６からの命令は、Ｌ１Ｉキャッシュ命令バス
２３２から３つのバッファの１つにロードされる。順次
バッファ２０３は、１６の命令を現在の実行シーケンス
で記憶する。分岐バッファ２０５は、分岐先からの８つ
の命令を記憶する。これらは、分岐が行われる場合に
は、分岐評価の前にバッファ２０５に投機的にロードさ
れる。スレッド切り替えバッファ２０４は、アクティブ
ではないスレッドの８つの命令を記憶する。現在アクテ
ィブなスレッドからアクティブではないスレッドへの切
り替えが必要な場合、これらの命令はすぐに使用でき
る。デコード／ディスパッチ・ユニット２０６は、バッ
ファの１つから実行される現在の命令を受け取り、デコ
ードして、実行される操作または分岐条件を確認する。
分岐ユニット２０２は、分岐条件を評価してプログラム
の流れを制御し、Ｌ１Ｉキャッシュ・アドレス・バス
２３１の所望の命令の有効アドレスを送ることによっ
て、Ｌ１Ｉキャッシュ１０６からバッファをリフィル
する。The instruction unit 201 includes the branch unit 20
2, buffers 203, 204, 205, and decode /
A dispatch unit 206 is included. Instructions from L1 I-cache 106 are loaded from L1 I-cache instruction bus 232 into one of three buffers. The sequential buffer 203 stores 16 instructions in the current execution sequence. The branch buffer 205 stores eight instructions from the branch destination. These are speculatively loaded into the buffer 205 before branch evaluation if a branch is taken. The thread switch buffer 204 stores eight instructions of the inactive thread. If you need to switch from the currently active thread to the inactive thread, these instructions are immediately available. Decode / dispatch unit 206 receives and decodes the current instruction to be executed from one of the buffers to ascertain the operation or branch condition to be executed.
Branch unit 202 evaluates branch conditions to control program flow and refills buffers from L1 I-cache 106 by sending the effective address of the desired instruction on L1 I-cache address bus 231.

【００３０】実行ユニット２１１は、Ｓパイプ２１３、
Ｍパイプ２１４、Ｒパイプ２１５、及び汎用レジスタ２
１７のバンクを含む。レジスタ２１７は２セットに分け
られ、それぞれ各スレッドに対応する。Ｒパイプ２１５
は、単純整数の算術やロジックの機能の一部を実行する
パイプライン演算装置である。Ｍパイプ２１４は、比較
的大きい操作とロジックの機能セットを実行するパイプ
ライン演算装置である。Ｓパイプ２１３は、ロードとス
トアを実行するパイプライン装置である。浮動小数点ユ
ニット（ＦＰＵ）２１２とこれに関連する浮動小数点レ
ジスタ２１６は、通常は数サイクルを必要とする複雑な
浮動小数点演算に使用される。浮動小数点レジスタ２１
６は、汎用レジスタ２１７と同様、それぞれ各スレッド
に対応する２セットに分けられる。The execution unit 211 includes an S pipe 213,
M pipe 214, R pipe 215, and general purpose register 2
Includes 17 banks. The registers 217 are divided into two sets, each of which corresponds to each thread. R pipe 215
Is a pipeline arithmetic unit that performs some of the functions of simple integer arithmetic and logic. M-pipe 214 is a pipeline arithmetic unit that performs a relatively large set of operations and logic functions. The S pipe 213 is a pipeline device that executes load and store. Floating point unit (FPU) 212 and its associated floating point registers 216 are used for complex floating point operations that typically require several cycles. Floating point register 21
6 is divided into two sets corresponding to each thread, similarly to the general-purpose register 217.

【００３１】記憶制御装置２２１は、メモリ管理装置２
２２、Ｌ２キャッシュ・ディレクトリ２２３、Ｌ２キャ
ッシュ・インタフェース２２４、Ｌ１データ・キャッシ
ュ１０７、メモリ・バス・インタフェース２２５を含
む。Ｌ１Ｄキャッシュは（命令ではなく）データに使
用されるオンチップ・キャッシュである。Ｌ２キャッシ
ュ・ディレクトリ２２３は、Ｌ２キャッシュ１０８の内
容のディレクトリである。Ｌ２キャッシュ・インタフェ
ース２２４は、Ｌ２キャッシュ１０８との間で直接デー
タ転送を処理する。メモリ・バス・インタフェース２２
５は、メモリ・バス１０９のデータ転送を処理する。メ
モリ・バス１０９のデータ転送は、メイン・メモリ１０
２や、他のＣＰＵに関連付けられたＬ２キャッシュ・ユ
ニットに対する転送等である。メモリ管理装置２２２
は、様々なユニットへのデータ・アクセスをルーティン
グする役割を持つ。例えば、Ｓパイプ２１３がロード・
コマンドを処理するときに、データをレジスタにロード
する必要があると、メモリ管理装置はデータをＬ１Ｄ
キャッシュ１０７、Ｌ２キャッシュ１０８、またはメイ
ン・メモリ１０２からフェッチすることができる。メモ
リ管理装置２２２は、データをどこから取得するかを決
定する。Ｌ１Ｄキャッシュ１０７は、Ｌ２キャッシュ
・ディレクトリ２２３と同様に直接アクセスできるの
で、ユニット２２２は、データがＬ１Ｄキャッシュ１
０７にあるか、Ｌ２キャッシュ１０８にあるか確認する
ことができる。データがオンチップのＬ１Ｄキャッシ
ュ１０７にもＬ２キャッシュ１０８にもない場合は、メ
モリ・インタフェース２２５を使ってメモリ・バス１０
９からフェッチされる。The storage control device 221 is a memory management device 2
22, an L2 cache directory 223, an L2 cache interface 224, an L1 data cache 107, and a memory bus interface 225. The L1 D-cache is an on-chip cache used for data (rather than instructions). The L2 cache directory 223 is a directory of the contents of the L2 cache 108. The L2 cache interface 224 handles direct data transfer to and from the L2 cache 108. Memory bus interface 22
5 handles data transfer on the memory bus 109. The data transfer of the memory bus 109 is performed by the main memory 10.
2 or an L2 cache unit associated with another CPU. Memory management device 222
Is responsible for routing data access to various units. For example, if the S pipe 213 is loaded
When processing a command, if the data needs to be loaded into a register, the memory management unit will store the data in L1 D
It can be fetched from cache 107, L2 cache 108, or main memory 102. The memory management device 222 determines where to obtain the data. Since the L1 Dcache 107 can be directly accessed similarly to the L2 cache directory 223, the unit 222 stores the data in the L1 Dcache 1
07 or in the L2 cache 108. If the data is not in the on-chip L1 D-cache 107 or L2 cache 108, the memory bus
9 is fetched.

【００３２】ＣＰＵの様々な構成要素について説明して
いるが、好適実施例のＣＰＵには、ここに示していない
多くの要素を使用でき、それらは、本発明の理解に不可
欠ではない。例えば、通常の設計では、専用レジスタを
追加する必要があるが、それらのいくつかはスレッド毎
に複製する必要がある。ＣＰＵ１０１内の構成要素の
数、タイプ、配置は一定でなくてもよい。例えば、バッ
ファとキャッシュの数と構成、実行ユニット・パイプラ
インの数と機能、レジスタを構成するアレイやセット等
は変更可能であり、専用浮動小数点処理ハードウェアは
使用してもしなくてもよい。Although various components of the CPU have been described, many components not shown can be used in the CPU of the preferred embodiment, which are not essential to an understanding of the present invention. For example, a typical design would require the addition of dedicated registers, some of which would need to be duplicated per thread. The number, type, and arrangement of the components in the CPU 101 need not be constant. For example, the number and configuration of buffers and caches, the number and function of execution unit pipelines, arrays and sets that make up registers, and the like can be changed, and dedicated floating point processing hardware may or may not be used.

【００３３】命令ユニット２０１は、理想的には、デコ
ード／ディスパッチ・ユニット２０６でのデコードと、
実行ユニット２１１による実行を目的に一定の命令スト
リームを提供する。Ｌ１Ｉキャッシュ１０６は、アク
セス・リクエストに対して最小限の遅延時間で応答する
必要がある。要求された命令が実際にＬ１Ｉキャッシ
ュにあるとき、デコード／ディスパッチ・ユニット２０
６が待機する必要なしに、応答し、対応するバッファを
埋めることができなければならない。Ｌ１Ｉキャッシ
ュが応答できない（つまり、要求された命令がＬ１Ｉ
キャッシュにない）ときは、キャッシュ・フィル・バス
２３３を介したメモリ管理装置２２２までの比較的長い
パスを取る必要がある。その場合、命令は、Ｌ２キャッ
シュ１０８から、メイン・メモリ１０２から、また可能
ならディスクや他の記憶装置から取得できる。また、シ
ステム１００に複数のプロセッサがある場合、命令は、
他のプロセッサのＬ２キャッシュから取得することも可
能である。いずれの場合についても、離れた場所から命
令をフェッチするときの遅延時間により、命令ユニット
２０１がスレッドを切り替えることがある。つまり、ア
クティブなスレッドはアクティブでなくなり、前にアク
ティブでなかったスレッドはアクティブになり、命令ユ
ニット２０１は、スレッド切り替えバッファ２０４に保
持された、前にアクティブでなかったスレッドの命令の
処理を開始する。Instruction unit 201 ideally decodes in decode / dispatch unit 206,
It provides a stream of instructions for execution by the execution unit 211. The L1 I-cache 106 must respond to an access request with a minimum delay. When the requested instruction is actually in the L1 I-cache, the decode / dispatch unit 20
6 must be able to respond and fill the corresponding buffer without having to wait. The L1 I-cache cannot respond (i.e., the requested instruction is
When not in the cache), a relatively long path to the memory manager 222 via the cache fill bus 233 must be taken. In that case, the instructions may be obtained from the L2 cache 108, from the main memory 102, and possibly from disk or other storage. Also, if there are multiple processors in system 100, the instructions are:
It can also be obtained from the L2 cache of another processor. In either case, the instruction unit 201 may switch threads depending on the delay time when fetching an instruction from a remote location. That is, the active thread becomes inactive, the previously inactive thread becomes active, and the instruction unit 201 begins processing instructions of the previously inactive thread held in the thread switch buffer 204. .

【００３４】図４は、好適実施例に従ったＬ１Ｉキャ
ッシュ１０６の主な構成要素を、図１、図２よりも詳し
く示す。Ｌ１Ｉキャッシュ１０６は、有効／実アドレ
ス・テーブル（ＥＲＡＴ）３０１、Ｉキャッシュ・ディ
レクトリ・アレイ３０２、Ｉキャッシュ命令アレイ３０
３を含む。Ｉキャッシュ命令アレイ３０３は、実行のた
め命令ユニット２０１に送られる実際の命令を記憶す
る。Ｉキャッシュ・ディレクトリ・アレイ３０２は、命
令アレイ３０３を管理するため、特に所望の命令が実際
に命令アレイ３０３にあるかどうかを確認するため、実
ページ番号、有効ビット等の情報の集合を記憶する。Ｅ
ＲＡＴ３０１は、有効ページ番号と実ページ番号のペア
を記憶し、有効アドレスを実アドレスに関連付けるため
に使用される。FIG. 4 shows the main components of the L1 I-cache 106 in accordance with the preferred embodiment in more detail than FIGS. The L1 I-cache 106 includes an effective / real address table (ERAT) 301, an I-cache directory array 302, and an I-cache instruction array 30.
3 inclusive. I-cache instruction array 303 stores the actual instructions sent to instruction unit 201 for execution. The I-cache directory array 302 stores a set of information, such as real page numbers and valid bits, to manage the instruction array 303, and in particular to determine whether the desired instruction is actually in the instruction array 303. . E
The RAT 301 stores a pair of a valid page number and a real page number, and is used to associate a valid address with a real address.

【００３５】好適実施例のＣＰＵ１０１は、図１１にロ
ジックを示す通り、複数のアドレス変換レベルをサポー
トする。基本的なアドレス指定構造体は、有効アドレス
８０１、仮想アドレス８０２、実アドレス８０３の３つ
である。"有効アドレス"は、命令を参照するために命令
ユニット２０１により生成されるアドレスをいう。つま
り、これは、ユーザの実行可能コードの観点から見たア
ドレスである。有効アドレスは、従来の様々な方法で生
成できる。例えば、専用レジスタの上位アドレス・ビッ
ト（新しいタスクの実行開始時等、頻繁には変化しな
い）と、命令からの下位アドレス・ビットの連結、汎用
レジスタのアドレスから計算されたオフセット、現在実
行中の命令からのオフセット等として生成される。この
実施例の有効アドレスは、６４ビットで０乃至６３の番
号が付けられる（０は最上位ビット）。"仮想アドレス"
は、異なるユーザのアドレス空間を分離するために用い
られるオペレーティング・システムの構造体である。つ
まり、各ユーザが、有効アドレスの全範囲を参照できる
場合、競合を避けるために、異なるユーザの有効アドレ
ス空間を、比較的大きい仮想アドレス空間にマップする
必要がある。仮想アドレスは、レジスタに記憶されると
いう意味で物理的実体ではなく、５２ビットの仮想セグ
メントＩＤ８１４と有効アドレスの下位２８ビットを
連結して得られる計８０ビットの論理構造である。"実
アドレス"は、命令が記憶されるメモリ１０２の物理的
位置をいう。実アドレスは４０ビットで、２４乃至６３
の番号が付けられる（２４は最上位ビット）。The CPU 101 of the preferred embodiment supports a plurality of address translation levels, as shown by the logic in FIG. There are three basic addressing structures: an effective address 801, a virtual address 802, and a real address 803. “Effective address” refers to an address generated by the instruction unit 201 to refer to the instruction. That is, this is the address in terms of the user's executable code. The effective address can be generated in various conventional ways. For example, the upper address bits of a dedicated register (which do not change often, such as when a new task starts running), the concatenation of lower address bits from an instruction, the offset calculated from the address of a general register, It is generated as an offset from the instruction. The effective addresses in this embodiment are 64 bits and numbered from 0 to 63 (0 is the most significant bit). "Virtual address"
Is an operating system structure used to separate the address space of different users. In other words, if each user can refer to the entire range of effective addresses, it is necessary to map the effective address space of different users to a relatively large virtual address space to avoid contention. The virtual address is not a physical entity in the sense that it is stored in a register, but has a total 80-bit logical structure obtained by connecting the 52-bit virtual segment ID 814 and the lower 28 bits of the effective address. "Real address" refers to the physical location in memory 102 where the instruction is stored. The real address is 40 bits, 24 to 63
(24 is the most significant bit).

【００３６】図１１に示すように、有効アドレス８０１
は、３６ビットの有効セグメントＩＤ８１１、１６ビッ
トのページ番号８１２、１２ビットのバイト・インデッ
クス８１３を含み、有効セグメントＩＤは最上位ビット
位置を占める。仮想アドレス８０２は、３６ビット有効
セグメントＩＤ８１１を５２ビット仮想セグメントＩＤ
８１４にマップし、得られる仮想セグメントＩＤ８１４
をページ番号８１２とバイト・インデックス８１３に連
結することによって、有効アドレスから構成される。実
アドレス８０３は、仮想セグメントＩＤ８１４とページ
番号８１２を５２ビット実ページ番号８１５にマップ
し、実ページ番号をバイト・インデックス８１３と連結
することで、仮想アドレスから導かれる。メイン・メモ
リのページは４Ｋ（つまり２12）バイトなので、バイト
・インデックス８１３（最下位の１２アドレス・ビッ
ト）は、ページ内のアドレスを指定し、そのアドレスが
有効アドレス、仮想アドレス、または実アドレスかどう
かにかかわらず同じである。上位ビットはページを指定
し、従って、"有効ページ番号"または"実ページ番号"と
も呼ばれる。As shown in FIG. 11, the effective address 801
Contains a 36-bit valid segment ID 811, a 16-bit page number 812, and a 12-bit byte index 813, where the valid segment ID occupies the most significant bit position. The virtual address 802 is obtained by converting a 36-bit valid segment ID 811 into a 52-bit virtual segment ID.
814 and the resulting virtual segment ID 814
To the page number 812 and the byte index 813 to form an effective address. The real address 803 is derived from the virtual address by mapping the virtual segment ID 814 and page number 812 to a 52-bit real page number 815 and concatenating the real page number with the byte index 813. Since a page of main memory is 4K (or 2 @ 12) bytes, the byte index 813 (the least significant 12 address bits) specifies an address within the page and determines whether the address is a valid address, a virtual address, or a real address. Same whether or not. The high order bits specify the page and are therefore also called "valid page numbers" or "real page numbers".

【００３７】コンピュータ・システム１００には、ＣＰ
Ｕ１０１により生成される有効アドレスをメモリ１０２
の実アドレスに変換するアドレス変換メカニズムがあ
る。このアドレス変換メカニズムは、有効セグメントＩ
Ｄ８１１を仮想セグメントＩＤ８１４にマップするセグ
メント・テーブル・メカニズム８２１と、仮想セグメン
トＩＤ８１４とページ番号８１２を実ページ番号８１５
にマップするページ・テーブル・メカニズム８２２を含
む。これらのメカニズムは、図１１では便宜上シングル
・エンティティとして示しているが、実際にはレベルの
異なる複数のテーブルやレジスタを含む。つまり完全な
ページ・テーブルと完全なセグメント・テーブルがメイ
ン・メモリ１０２にあり、これらのテーブルの比較的小
さい様々なデータ・キャッシュ部は、ＣＰＵ１０１自体
またはＬ２キャッシュに記憶される。また、一定の条件
下で有効アドレスから実アドレスに直接変換する変換メ
カニズムもある（図示せず）。The computer system 100 has a CP
The effective address generated by U101 is stored in memory 102
There is an address translation mechanism that translates to a real address. This address translation mechanism uses the valid segment I
A segment table mechanism 821 that maps D811 to a virtual segment ID 814, and a virtual page ID 814 that maps the virtual segment ID 814 and the page number 812 to a real page number 815.
Page table mechanism 822 that maps to Although shown as a single entity for convenience in FIG. 11, these mechanisms actually include a plurality of tables and registers at different levels. That is, there is a complete page table and a complete segment table in main memory 102, and the various smaller data caches of these tables are stored in CPU 101 itself or in the L2 cache. There is also a translation mechanism for directly translating an effective address to a real address under certain conditions (not shown).

【００３８】ＣＰＵ１０１は、図１１のようにアドレス
変換をサポートするが、より簡単なアドレス指定もサポ
ートする。具体的には、好適実施例のＣＰＵ１０１
は、"タグ・アクティブ"・モードまたは"タグ非アクテ
ィブ"・モードのいずれかで動作する。モードが異なる
のは、アドレス指定の違いを示し、サポートするオペレ
ーティング・システムも異なる。マシン状態レジスタ
（専用レジスタ）のビットは、現在の動作モードを記録
する。上に述べた完全アドレス指定変換は"タグ非アク
ティブ"・モードで使用される。"タグ・アクティブ"・
モードでは、有効アドレスは仮想アドレスと同じである
（つまり、有効セグメントＩＤ８１１は、ルックアップ
なしに仮想セグメントＩＤ８１３に直接マップされるの
で、仮想セグメントＩＤの上位１６ビットは常に０であ
る）。ＣＰＵ１０１はまた、有効＝実アドレスのアドレ
ス指定モードでも動作する（後述）。The CPU 101 supports address conversion as shown in FIG. 11, but also supports simpler address designation. Specifically, the CPU 101 of the preferred embodiment
Operates in either "tag active" or "tag inactive" mode. The different modes indicate different addressing and support different operating systems. The bits of the machine status register (dedicated register) record the current operation mode. The full addressing translation described above is used in "tag inactive" mode. "Tag active"
In mode, the effective address is the same as the virtual address (ie, since the effective segment ID 811 is mapped directly to the virtual segment ID 813 without lookup, the upper 16 bits of the virtual segment ID are always 0). The CPU 101 also operates in an effective = real address addressing mode (described later).

【００３９】有効アドレスから実アドレスへの変換に
は、複数のレベルのテーブル・ルックアップが必要であ
る。更にアドレス変換メカニズムの各部は、ＣＰＵチッ
プから離れたところに位置し、メモリ１０２に関連付け
られるのでこのメカニズムへのアクセスは、オンチップ
のキャッシュ・メモリへのアクセスよりかなり遅い。Ｅ
ＲＡＴ３０１は、アドレス変換メカニズムにより維持さ
れる情報の一部を記憶し、有効アドレスを実アドレスに
マップするので、ほとんどの場合、アドレス変換メカニ
ズムにアクセスする必要なしに、Ｌ１Ｉキャッシュ内
で有効アドレスと実アドレスの関連付けを高速に行える
小さいキャッシュと考えることができる。The translation from a valid address to a real address requires multiple levels of table lookup. In addition, access to this mechanism is significantly slower than access to on-chip cache memory because the parts of the address translation mechanism are located remotely from the CPU chip and are associated with the memory 102. E
Since the RAT 301 stores some of the information maintained by the address translation mechanism and maps the effective address to the real address, in most cases the effective address and the effective address are stored in the L1 I-cache without having to access the address translation mechanism. It can be considered as a small cache that can associate real addresses at high speed.

【００４０】命令ユニット２０１が、Ｉキャッシュ１０
６に命令を要求し、要求された命令の有効アドレスを提
供するとき、Ｉキャッシュは、要求された命令が実際に
キャッシュにあるかどうかを高速に判定し、存在する場
合は命令を返し、存在しない場合は命令をどこかから
（Ｌ２キャッシュ、メイン・メモリ等）取得する処理を
開始する必要がある。通常の場合、命令は実際にＬ１
Ｉキャッシュ１０６にあり、図４に示すように、Ｉキャ
ッシュ内で以下の処理が並行して発生する。ａ）命令装置２０１からの有効アドレスにより、ＥＲＡ
Ｔ３０１のエントリがアクセスされ、有効ページ番号と
これに関連する実ページ番号が導かれる。ｂ）命令装置２０１からの有効アドレスにより、ディレ
クトリ・アレイ３０２のエントリがアクセスされ、実ペ
ージ番号のペアが導かれる。ｃ）命令装置２０１からの有効アドレスにより、命令ア
レイ３０３のエントリがアクセスされ、命令を含むキャ
ッシュ・ラインのペアが導かれる。The instruction unit 201 is the I-cache 10
6, when requesting an instruction and providing the effective address of the requested instruction, the I-Cache quickly determines whether the requested instruction is actually in the cache, returns the instruction if it is present, and returns If not, it is necessary to start processing to acquire the instruction from somewhere (L2 cache, main memory, etc.). In the normal case, the instruction is actually L1
In the I-cache 106, the following processing occurs in parallel in the I-cache as shown in FIG. a) According to the effective address from the instruction device 201, ERA
The entry at T301 is accessed to derive a valid page number and an associated real page number. b) The entry in the directory array 302 is accessed by the effective address from the instruction device 201, and a pair of real page numbers is derived. c) The entry in the instruction array 303 is accessed by the effective address from the instruction unit 201 to derive a cache line pair containing the instruction.

【００４１】前記のいずれの場合でも、ＥＲＡＴ３１
０、ディレクトリ・アレイ３０２、命令アレイ３０３の
いずれか１つへの入力は、これらの構成要素のうち他の
いずれか１つの出力に依存しないので、前記の処理はい
ずれも、開始する前に他の完了を待機する必要がない。
次に、ＥＲＡＴ３０１、ディレクトリ・アレイ３０２、
命令アレイ３０３の出力は以下のように処理される。In any of the above cases, ERAT31
0, the directory array 302, and the input to any one of the instruction arrays 303 do not depend on the output of any one of these components, so none of the above processing can be performed before starting. There is no need to wait for completion.
Next, ERAT 301, directory array 302,
The output of the instruction array 303 is processed as follows.

【００４２】ａ）ＥＲＡＴ３０１からの有効ページ番号
が、比較器３０４で、命令装置２０１からの有効アドレ
スの同じアドレス・ビットと比較される。一致する場合
はＥＲＡＴ"ヒット"がある。ｂ）ＥＲＡＴ３０１からの実ページ番号が、比較器３０
５、３０６で、ディレクトリ・アレイ３０２からの実ペ
ージ番号それぞれと比較される。いずれかが一致する場
合、また、ＥＲＡＴヒットがある場合、Ｉキャッシュ
・"ヒット"がある。つまり要求された命令は実際にＩキ
ャッシュ１０６に、具体的には命令アレイ３０３にあ
る。ｃ）ＥＲＡＴ３０１とディレクトリ・アレイ３０２から
の実ページ番号の比較の出力により、命令アレイ３０３
からのキャッシュ・ラインのペアのうち、所望の命令の
あるペアが選択される（選択マルチプレクサ３０７を使
用）。A) The valid page number from ERAT 301 is compared in comparator 304 with the same address bits of the valid address from instruction unit 201. If they match, there is an ERAT "hit". b) The actual page number from the ERAT 301 is
At 5 and 306, each of the real page numbers from the directory array 302 is compared. If any match, and if there is an ERAT hit, there is an I-cache "hit". That is, the requested instruction is actually in the I-cache 106, specifically in the instruction array 303. c) The output of the comparison of the real page numbers from the ERAT 301 and the directory
Are selected (using the select multiplexer 307).

【００４３】これらの処理を並行して実行することで、
所望の命令が実際にＩキャッシュにある場合の遅延が最
小になる。所望の命令がＩキャッシュにあるかどうかに
かかわらず、命令装置２０１へのＩキャッシュの出力に
データが多少とも存在する。独立したＩキャッシュ・ヒ
ット信号により、出力データに実際に所望の命令がある
ことが命令装置２０１に示される。Ｉキャッシュ・ヒッ
ト信号がないとき、命令装置２０１は出力データを無視
する。キャッシュ・ミスの場合にＩキャッシュ１０６に
より実行される処理については後述する。By executing these processes in parallel,
The delay when the desired instruction is actually in the I-cache is minimized. Regardless of whether the desired instruction is in the I-cache, there is some data at the output of the I-cache to the instruction unit 201. The independent I-cache hit signal indicates to the instruction device 201 that the output data actually has the desired instruction. When there is no I-cache hit signal, the instruction device 201 ignores the output data. The processing executed by the I-cache 106 in the case of a cache miss will be described later.

【００４４】図５、図６は、ＥＲＡＴ３０１とこれに関
連する制御構造を詳しく示す。ＥＲＡＴ３０１は８２ビ
ット×１２８アレイである（つまり１２８のエントリが
あり、各エントリは８２ビットである）。ＥＲＡＴエン
トリはそれぞれ、有効アドレスの一部（ビット０乃至４
６）、実アドレスの一部（ビット２４乃至５１）、及び
追加ビットを含む（後述）。FIGS. 5 and 6 show the ERAT 301 and its related control structure in detail. ERAT 301 is an 82 bit x 128 array (i.e., there are 128 entries, each entry being 82 bits). Each ERAT entry is a part of an effective address (bits 0 to 4).
6), a part of the real address (bits 24 to 51), and an additional bit (described later).

【００４５】ＥＲＡＴ３０１は、制御ラインと共に、有
効アドレス（ＥＡ）のビット４５乃至５１のハッシュ関
数を構成することによってアクセスされる。制御ライン
は、マルチスレッドがアクティブかどうかを示す（好適
実施例のＣＰＵ設計では、マルチスレッドをオフにする
ことができる）マルチスレッド制御ライン（ＭＴ）と、
２つのスレッドのどちらがアクティブかを示すアクティ
ブ・スレッド・ライン（ＡｃｔＴ）の２つである。ハッ
シュ関数は以下の通りである。ERAT 301 is accessed by constructing a hash function of bits 45-51 of the effective address (EA) along with the control lines. A control line (MT) which indicates whether multithreading is active (multithreading can be turned off in the preferred embodiment CPU design);
Two active thread lines (ActT) that indicate which of the two threads is active. The hash function is as follows.

【数１】 (Equation 1)

【００４６】これは７ビット関数で、ＥＲＡＴの１２８
エントリを指定するには充分である。選択ロジック４０
１は、前記のハッシュ関数に従って対応するＥＲＡＴエ
ントリを選択する。This is a 7-bit function, and
Enough to specify the entry. Selection logic 40
1 selects a corresponding ERAT entry according to the hash function described above.

【００４７】比較器３０４は、命令装置２０１により生
成される有効アドレスのビット０乃至４６を、選択され
たＥＲＡＴエントリの有効アドレス部と比較する。命令
装置２０１からの有効アドレスのビット４７乃至５１
は、ハッシュ関数を構成するために用いられたので、ビ
ット０乃至４６が一致すると、アドレスの完全有効ペー
ジ番号部、つまりビット０乃至５１の一致を保証するに
は充分である。これら２つのアドレス部の一致は、ＥＲ
ＡＴの実ページ番号（ＲＡ２４：５１）が、実際には、
命令装置２０１により指定される有効アドレスのページ
番号（ＥＡ０：５１）に対応する実ページ番号であるこ
とを意味する。そのため、ＥＲＡＴエントリに記憶され
た有効アドレス部は、厳密な意味ではなく有効ページ番
号と呼ばれることもある。ただし、好適実施例では、有
効ページ番号のビット０乃至４６しか含まれない。The comparator 304 compares bits 0 to 46 of the effective address generated by the instruction device 201 with the effective address part of the selected ERAT entry. Bits 47 to 51 of the effective address from the instruction device 201
Was used to construct the hash function, so that a match in bits 0-46 is sufficient to guarantee a match in the fully valid page number portion of the address, bits 0-51. The match between these two address parts is determined by the ER
The actual page number of the AT (RA24: 51) is actually
This means a real page number corresponding to the page number (EA0: 51) of the effective address specified by the instruction device 201. Therefore, the effective address part stored in the ERAT entry is not strictly meaning but may be called an effective page number. However, in the preferred embodiment, only bits 0 through 46 of the valid page number are included.

【００４８】ＣＰＵ１０１は、場合によっては、有効＝
実モード（Ｅ＝Ｒ）等と表記される特別なアドレス指定
モードで動作する。このモードで動作しているとき、命
令装置２０１により生成される有効アドレスの下位４０
ビット（つまりＥＡ２４：６３）は実アドレス（ＲＡ２
４：６３）と同じである。通常このモードは、同じ実ア
ドレス位置に常に記憶される場合は比較的効率的に働く
低レベルのオペレーティング・システム機能用に予約さ
れている。図５、図６に示すように、制御ラインＥ＝Ｒ
がアクティブなとき、ＥＲＡＴ３０１は効果的にバイパ
スされる。つまり、選択マルチプレクサ４０２は、Ｅ＝
Ｒが偽のとき、選択されたＥＲＡＴエントリからＲＡ２
４：５１を実ページ番号（ＲＰＮ）として選択し、Ｅ＝
Ｒが真のときは、命令装置２０１からＥＡ２４：５１を
選択する。また、Ｅ＝Ｒが真なら、比較器３０４の比較
結果にかかわらず、ＥＲＡＴはヒットとみなされる。In some cases, the CPU 101
It operates in a special addressing mode, such as the real mode (E = R). When operating in this mode, the lower 40 of the effective address generated by the instruction device 201
Bits (that is, EA24: 63) are assigned to the real address (RA2
4:63). Normally, this mode is reserved for low-level operating system functions that work relatively efficiently if they are always stored at the same real address location. As shown in FIGS. 5 and 6, the control line E = R
Is active, ERAT 301 is effectively bypassed. That is, the selection multiplexer 402 sets E =
If R is false, RA2 from selected ERAT entry
4:51 is selected as the real page number (RPN) and E =
When R is true, EA 24:51 is selected from the instruction device 201. If E = R is true, ERAT is regarded as a hit regardless of the comparison result of the comparator 304.

【００４９】ＥＲＡＴは、先に図８とあわせて述べたア
ドレス変換メカニズムを効果的にバイパスするので、通
常のアドレス変換メカニズムに含まれるアクセス制御情
報を一部複製する。つまり有効アドレスから実アドレス
への変換は、通常は、セグメント・テーブル・メカニズ
ム８２１、ページ・テーブル・メカニズム８２２等に含
まれる追加情報によりアクセス権を確認する。ＥＲＡＴ
３０１は、この情報の一部をキャッシュして、これらア
ドレス変換メカニズムを参照する必要をなくす。ＥＲＡ
Ｔの動作について詳しくは、１９９７年１１月１０日付
米国特許出願第０８／９６６７０６号、"Effective-To-
Real Address Cache Managing Apparatus and Method"
を参照されたい。Since the ERAT effectively bypasses the address translation mechanism previously described in conjunction with FIG. 8, the ERAT partially copies the access control information included in the normal address translation mechanism. That is, the conversion from the effective address to the real address is usually performed by confirming the access right based on additional information included in the segment table mechanism 821, the page table mechanism 822, and the like. ERAT
301 caches some of this information, eliminating the need to reference these address translation mechanisms. ERA
For more information on the operation of T, see U.S. patent application Ser. No. 08 / 966,706, Nov. 10, 1997, "Effective-To-
Real Address Cache Managing Apparatus and Method "
Please refer to.

【００５０】ＥＲＡＴエントリは、パリティ・ビット、
保護ビット、アクセス制御ビットを含む。特にＥＲＡＴ
エントリはそれぞれ、キャッシュ禁止ビット、問題状態
ビット、アクセス制御ビットを含む。また、個別アレイ
４０３（１ビット×１２８）は、各ＥＲＡＴエントリに
関連付けられる１つの有効ビットを含む。更に、タグ・
モード・ビットのペアが個別レジスタ４０４に記憶され
る。アレイ４０３からの有効ビットは、対応するＥＲＡ
Ｔエントリが有効かどうかを記録する。様々な条件によ
り、プロセッサ・ロジック（図示せず）が有効ビットを
リセットする結果、対応するＥＲＡＴエントリへの後の
アクセスによってエントリが再ロードされる。キャッシ
ュ禁止ビットは、要求された命令をＩキャッシュ命令ア
レイ３０３に書込むことを禁止するために用いられる。
つまり、アドレス範囲には、ＥＲＡＴのエントリが含ま
れることがあるが、このアドレス範囲でＩキャッシュに
命令をキャッシュすることを避けたい場合がある。その
場合、このアドレス範囲の命令に対するリクエストによ
って、ライン・フィル・シーケンス・ロジック（後述）
が要求された命令を取得するが、命令はアレイ３０３に
書込まれない（ディレクトリ・アレイ３０２が更新され
ることもない）。問題状態ビットは、ＥＲＡＴエントリ
がロードされる時点で、実行中のスレッド（つまりスー
パバイザかユーザ）の"問題状態"を記録する。スーパバ
イザ状態で実行中のスレッドは、一般には、問題状態の
スレッドよりもアクセス権が大きい。ある状態でＥＲＡ
Ｔエントリがロードされた場合、問題状態はその後に変
更され、現在実行中のスレッドは、ＥＲＡＴエントリの
範囲のアドレスにはアクセスできない恐れが生じ、この
情報は、ＥＲＡＴがアクセスされるときに確認する必要
がある。アクセス制御ビットはまた、ＥＲＡＴエントリ
がロードされた時点でアクセス情報を記録し、また、ア
クセスの時点でチェックされる。タグ・モード・ビット
４０４は、プロセッサのタグ・モード（タグ・アクティ
ブかタグ非アクティブ）を、ＥＲＡＴがロードされたと
き記録する。ＥＲＡＴの各半分（６４エントリ）にタグ
・モード・ビット１つが関連付けられる。これはＥＲＡ
ＴＨＡＳＨ関数の０ビットを使って選択される。タグ
・モードは、有効アドレスの解釈に影響を与えるので、
タグ・モードの変更はつまり、ＥＲＡＴエントリの実ペ
ージ番号が信頼できるとみなされないことを意味する。
タグ・モードは、変更される場合は、あまり頻繁には変
更されないと想定される。従って、変更が検出された場
合、ＥＲＡＴの対応する半分の全てのエントリが無効と
マークされ、最終的には再ロードされる。The ERAT entry contains a parity bit,
Includes protection bits and access control bits. Especially ERAT
Each entry includes a cache inhibit bit, a problem state bit, and an access control bit. Also, the individual array 403 (1 bit × 128) includes one valid bit associated with each ERAT entry. In addition, tags
The mode bit pairs are stored in individual registers 404. The valid bits from array 403 are the corresponding ERA
Record whether the T entry is valid. Various conditions cause processor logic (not shown) to reset the valid bit, resulting in a subsequent access to the corresponding ERAT entry to reload the entry. The cache inhibit bit is used to inhibit writing of the requested instruction to I-cache instruction array 303.
That is, the address range may include an ERAT entry, but there may be a case where it is desired to avoid caching an instruction in the I-cache in this address range. In that case, a request for an instruction in this address range causes a line fill sequence logic (described later).
Obtains the requested instruction, but the instruction is not written to array 303 (the directory array 302 is not updated). The problem state bit records the "problem state" of the executing thread (i.e., supervisor or user) at the time the ERAT entry is loaded. Threads running in the supervisor state generally have greater access rights than threads in the problem state. ERA in a certain state
If the T entry is loaded, the problem state is subsequently changed, and the currently executing thread may not be able to access addresses in the range of the ERAT entry, and this information will be verified when the ERAT is accessed. There is a need. The access control bits also record access information when the ERAT entry is loaded and are checked at the time of access. Tag mode bit 404 records the tag mode of the processor (tag active or tag inactive) when ERAT is loaded. One tag mode bit is associated with each half (64 entries) of the ERAT. This is ERA
Selected using the 0 bit of the T HASH function. Tag mode affects the interpretation of effective addresses,
Changing the tag mode means that the real page number of the ERAT entry is not considered reliable.
If the tag mode changes, it is assumed that it will not change very often. Thus, if a change is detected, all corresponding half entries in the ERAT are marked invalid and eventually reloaded.

【００５１】ＥＲＡＴロジック４０５は、セレクタ３０
４の出力、有効＝実モード、先に述べた様々なビット、
及びＣＰＵのマシン状態レジスタ（図示せず）のビット
をもとに、選択マルチプレクサのＲＰＮ出力の使用状態
とＥＲＡＴのメンテナンスを制御する信号を生成する。
特に、ロジック４０５は、ＥＲＡＴヒット信号４１０、
保護例外信号４１１、ＥＲＡＴミス信号４１２、及びキ
ャッシュ禁止信号４１３を生成する。The ERAT logic 405 is connected to the selector 30
4 outputs, valid = real mode, various bits mentioned above,
And a signal for controlling the use state of the RPN output of the selection multiplexer and the maintenance of the ERAT based on bits of a machine state register (not shown) of the CPU.
In particular, logic 405 includes an ERAT hit signal 410,
A protection exception signal 411, an ERAT miss signal 412, and a cache inhibit signal 413 are generated.

【００５２】ＥＲＡＴヒット信号４１０は、選択マルチ
プレクサ４０２のＲＰＮ出力が、要求された有効アドレ
スに対応する真の実ページ番号として使用できることを
示す。この信号は、有効＝実のとき（ＥＲＡＴをバイパ
ス）、または比較器３０４が一致を検出し、保護例外が
なく、ＥＲＡＴミスを強制する特定条件がないときは、
アクティブである。これは以下のロジックで表せる。The ERAT hit signal 410 indicates that the RPN output of the select multiplexer 402 can be used as a true real page number corresponding to the requested valid address. This signal can be asserted when valid = real (bypass ERAT), or when comparator 304 detects a match, there is no protection exception, and no specific condition to force an ERAT miss,
Active. This can be expressed by the following logic.

【００５３】Ｍａｔｃｈ＿３０４は、比較器３０４から
の信号で命令装置２０１からのＥＡ０：４６がＥＲＡＴ
エントリのＥＡ０：４６と一致することを示し、Ｖａｌ
ｉｄはアレイ４０３からの有効ビットの値である。Match_304 is a signal from the comparator 304, and EA0: 46 from the instruction device 201 is ERAT.
Indicates that it matches EA0: 46 of the entry, and Val
id is the value of the valid bit from array 403.

【数２】 (Equation 2)

【００５４】保護例外信号４１１は、ＥＲＡＴエントリ
は有効なデータを含むが、現在実行中のプロセスは所望
の命令へのアクセスを許可されないことを示す。ＥＲＡ
Ｔミス信号４１２は、要求されたＥＲＡＴエントリに所
望の実ページ番号がないか、または信頼できるとみなさ
れないことを示し、いずれの場合も、ＥＲＡＴエントリ
を再ロードする必要がある。キャッシュ禁止信号４１３
は、要求された命令が命令アレイ３０３にキャッシュさ
れるのを防ぐ。これらの信号は以下のロジックで導かれ
る。The protection exception signal 411 indicates that the ERAT entry contains valid data, but that the currently executing process is not authorized to access the desired instruction. ERA
The T miss signal 412 indicates that the requested ERAT entry does not have the desired real page number or is not considered reliable, and in either case, the ERAT entry needs to be reloaded. Cache inhibit signal 413
Prevents the requested instruction from being cached in the instruction array 303. These signals are derived by the following logic.

【数３】 (Equation 3)

【００５５】ここで、ＥＲＡＴ（Ｐｒ）はＥＲＡＴエントリからの問題状態ビ
ットＥＲＡＴ（ＡＣ）はＥＲＡＴエントリからのアクセス制
御ビットＥＲＡＴ（ＣＩ）はＥＲＡＴエントリからのキャッシュ
禁止ビットＭＳＲ（ＴＡ）はマシン状態レジスタからのタグ・アク
ティブ・ビットＭＳＲ（Ｕｓ）はマシン状態レジスタからのユーザ状態
ビットＴａｇ＿４０４はレジスタ４０４からの選択済みタグ・
ビットHere, ERAT (Pr) is a problem status bit from the ERAT entry. ERAT (AC) is an access control bit from the ERAT entry. ERAT (CI) is a cache inhibit bit from the ERAT entry. MSR (TA) is a machine status register. MSR (Us) is the user status bit from the machine status register Tag_404 is the selected tag from the register 404
bit

【００５６】図７は、Ｉキャッシュ・ディレクトリ・ア
レイ３０２とこれに関連する制御構造を詳しく示す。Ｉ
キャッシュ・ディレクトリ・アレイは、実ページ番号と
特定の制御ビットを保持する６６ビット×５１２アレイ
５０２と、ＭＲＵ（most-recently-used）ビットを記憶
する１ビット×５１２アレイ５０３を含む。アレイ５０
２及び５０３は物理的に独立しているが、論理的には１
つのアレイとして扱える。アレイ５０２は、論理的に２
セットに分けられる。各アレイ・エントリの最初の３３
ビットは最初のセット（セット０）に属し、各エントリ
の最後の３３ビットは第２セット（セット１）に属す
る。アレイ５０２の各エントリは、セット０に対応する
２８ビットの実ページ番号（つまり実アドレス・ビット
２４乃至５１）、セット０の４つの有効ビット、セット
０のパリティ・ビット、セット１の２８ビットの実ペー
ジ番号、セット１の４つの有効ビット、及びセット１の
パリティ・ビットを含む。FIG. 7 details the I-cache directory array 302 and its associated control structure. I
The cache directory array includes a 66-bit × 512 array 502 that holds real page numbers and specific control bits, and a 1-bit × 512 array 503 that stores MRU (most-recently-used) bits. Array 50
2 and 503 are physically independent, but logically 1
Can be treated as one array. Array 502 is logically 2
Divided into sets. First 33 of each array entry
The bits belong to the first set (set 0) and the last 33 bits of each entry belong to the second set (set 1). Each entry of array 502 has a 28-bit real page number (ie, real address bits 24-51) corresponding to set 0, four significant bits of set 0, a parity bit of set 0, and a 28-bit set 1 bit. It contains the real page number, the four significant bits of set 1, and the parity bit of set 1.

【００５７】図８は、Ｉキャッシュ命令アレイ３０３と
これに関連する制御構造を詳しく示す。Ｉキャッシュ命
令アレイ３０３は、６４バイト×２０４８アレイを含
み、これは、ディレクトリ・アレイ５０２と同様、論理
的に２つのセットに分けられる。各アレイ・エントリの
最初の３２バイトはセット０に属し、残り３２バイトは
セット１に属する。命令アレイ３０３の各エントリは、
セット０に対応しプロセッサで実行できる８つの命令
（それぞれ４バイト）と、セット１に対応しプロセッサ
で実行できる８つの命令（それぞれ４バイト）を含む。FIG. 8 illustrates the I-cache instruction array 303 and its associated control structure in detail. I-cache instruction array 303 includes a 64-byte by 2048 array, which, like directory array 502, is logically divided into two sets. The first 32 bytes of each array entry belong to set 0, and the remaining 32 bytes belong to set 1. Each entry of the instruction array 303 is
It includes eight instructions (4 bytes each) that can be executed by the processor corresponding to set 0, and eight instructions (4 bytes each) that can be executed by the processor corresponding to set 1.

【００５８】ディレクトリ・アレイ５０２の各エントリ
は、命令アレイ３０３の４つのエントリで構成される連
続したグループに関連付けられる。１つのセット（セッ
ト０または１）に含まれる４エントリのこの連続グルー
プは、キャッシュ・ラインと呼ばれ、いずれかのセット
に含まれるシングル・エントリはキャッシュ・サブライ
ンと呼ばれる。選択ロジック６０１は、各エントリ（つ
まり、キャッシュ・サブラインのペアで、セット０、及
びセット１それぞれから１つ）に独立にアクセスできる
が、各キャッシュ・ラインまたは４つのサブラインのグ
ループに対応する、ディレクトリ・アレイ５０２の実ペ
ージ番号は１つだけである。そのため、キャッシュ・ラ
インを構成する４つのキャッシュ・サブラインは、１回
のキャッシュ・ライン・フィル動作でグループとして埋
められる（後述）。Each entry in directory array 502 is associated with a contiguous group of four entries in instruction array 303. This contiguous group of four entries in one set (set 0 or 1) is called a cache line, and a single entry in either set is called a cache subline. The selection logic 601 has independent access to each entry (ie, a pair of cache sublines, one from each of set 0 and set 1), but corresponding to each cache line or a group of four sublines. The array 502 has only one real page number. Therefore, the four cache sublines constituting the cache line are filled as a group by one cache line filling operation (described later).

【００５９】好適実施例では、命令アレイ３０３のキャ
ッシュ・ラインは、１２８バイトを含み、キャッシュ・
ラインの空間内でバイトを指定するために７アドレス・
ビット（アドレス・ビット５７乃至６３）を必要とす
る。アドレス・ビット５７及び５８は、キャッシュ・ラ
イン内の４つのキャッシュ・サブラインのうち１つを指
定する。キャッシュ・ラインの実アドレスは、実アドレ
ス・ビット２４乃至５６で指定される。有効アドレス・
ビット４８乃至５６（キャッシュ・ラインの下位アドレ
ス・ビットに対応する）は、アレイ５０２及び５０３の
エントリを選択するのに用いられる。選択ロジック５０
１は、これらアドレス・ビットの直接デコードである。
事実上これは簡単なハッシュ関数である。つまり、有効
アドレス・ビット４８乃至５６に可能な組み合わせは２
9あるが、キャッシュ・ラインに可能な実アドレス２33
個（実アドレス・ビット２４乃至５６の組み合わせ）が
このアレイにマップされる。同様に、有効アドレス・ビ
ット４８乃至５８（キャッシュ・サブラインの下位アド
レス・ビットに対応する）は、命令アレイ３０３のエン
トリを選択するのに用いられ、選択ロジック６０１は、
これらアドレス・ビットの直接デコードである。命令ア
レイ３０３のキャッシュ・サブラインの実アドレスは、
有効アドレス・ビット５２乃至５８（ＥＡ５２：５８）
と連結された、ディレクトリ・アレイ５０２の対応する
エントリとセットの実ページ番号（ＲＡ２４：５１）で
ある。In the preferred embodiment, the cache line of the instruction array 303 contains 128 bytes and the cache line
7 addresses to specify a byte in the space of the line
Bits (address bits 57-63). Address bits 57 and 58 specify one of the four cache sublines in the cache line. The real address of the cache line is specified by real address bits 24-56. Effective address
Bits 48-56 (corresponding to the lower address bits of the cache line) are used to select entries in arrays 502 and 503. Selection logic 50
1 is a direct decode of these address bits.
This is effectively a simple hash function. That is, the possible combinations of valid address bits 48-56 are 2
There are nine, but 233 possible real addresses for the cache line
Individuals (combinations of real address bits 24-56) are mapped to this array. Similarly, valid address bits 48-58 (corresponding to the lower address bits of the cache sub-line) are used to select an entry in instruction array 303, and select logic 601
Direct decoding of these address bits. The real address of the cache subline of the instruction array 303 is
Valid address bits 52 through 58 (EA 52:58)
And the real page number of the corresponding entry and set of the directory array 502 (RA24: 51), concatenated with

【００６０】各エントリに２つの実ページ番号（セット
０及びセット１から）があるので、Ｉキャッシュ・ディ
レクトリには、有効アドレス・ビット４８乃至５６のそ
れぞれ９ビットの組み合わせに対応する２つの実ページ
番号（及び命令アレイ３０３に２つのキャッシュ・ライ
ン）がある。この特性から、スレッド間のＩキャッシュ
の競合を避けることができる。Since each entry has two real page numbers (from set 0 and set 1), the I-cache directory contains two real page numbers corresponding to each 9-bit combination of effective address bits 48-56. There are numbers (and two cache lines in the instruction array 303). From this characteristic, it is possible to avoid I-cache contention between threads.

【００６１】選択ロジック５０１は、疎なハッシュ関数
として機能するのでアレイ５０２のエントリに含まれる
実ページ番号のいずれかが、所望の命令の完全有効アド
レス・ページ番号に対応することの保証はない。対応を
確認するため、選択された両方の実ページ番号が、比較
器３０５及び３０６を使って、ＥＲＡＴ３０１のページ
番号出力４１１と同時に比較される。この比較と同時に
有効アドレス・ビット５７、５８により、アレイ５０２
の選択されたエントリから、セット０の４つの有効ビッ
トのうち対応する１つ（セレクタ５０４）と、セット１
の４つの有効ビットのうち１つ（セレクタ５０５）が選
択される。選択される有効ビットは、所望の命令のキャ
ッシュ・サブラインに対応する。これらは、対応する比
較器３０５、３０６の出力とのＡＮＤが取られ、それぞ
れのセットの一致を示す信号のペアが生成される。これ
らの信号の論理ＯＲは、ＥＲＡＴヒット信号４１０との
ＡＮＤが取られ、所望の命令が実際にＬ１Ｉキャッシ
ュにあることを示すＩキャッシュ・ヒット信号５１０が
生成される。Since the selection logic 501 functions as a sparse hash function, there is no guarantee that any of the real page numbers contained in the entries of the array 502 will correspond to the fully valid address page number of the desired instruction. In order to confirm the correspondence, both the selected real page numbers are compared with the page number output 411 of the ERAT 301 using the comparators 305 and 306. At the same time as this comparison, the valid address bits 57, 58 cause the array 502
Of the four valid bits of the set 0 (selector 504) and the set 1
Of the four valid bits (selector 505) is selected. The valid bit selected corresponds to the cache subline of the desired instruction. These are ANDed with the outputs of the corresponding comparators 305, 306 to generate a pair of signals indicating the respective set match. The logical OR of these signals is ANDed with the ERAT hit signal 410 to generate an I-cache hit signal 510 indicating that the desired instruction is actually in the L1 I-cache.

【００６２】先に説明したように、選択ロジック６０１
は、命令装置により与えられる所望の命令の有効アドレ
スを使用して、命令アレイ３０３のエントリ（"サブラ
イン"のペア）にアクセスする。セレクタ６０２は、ア
レイ３０３のセット０からのサブラインか、または、キ
ャッシュ書込みバス６０４からのバイパス・サブライン
値を選択する。バイパス・サブライン値は、キャッシュ
・ラインがキャッシュ・ミスの後に埋められているとき
に使用される。その場合、新しいキャッシュ・サブライ
ン値が外部ソースから使用できるとすぐにキャッシュ書
込みバス６０４から得られるので、最初に命令アレイ３
０３に書込む必要がない。従って、キャッシュ・フィル
動作中に命令アレイをバイパスすることで、少しの時間
が節約される。バイパス・サブライン値はまた、キャッ
シュ禁止ライン４１３がアクティブなときにも使用され
る。As described above, selection logic 601
Accesses the entry ("subline" pair) of the instruction array 303 using the effective address of the desired instruction provided by the instruction device. Selector 602 selects a subline from set 0 of array 303 or a bypass subline value from cache write bus 604. The bypass subline value is used when a cache line is being filled after a cache miss. In that case, the first instruction array 3
There is no need to write to 03. Thus, a small amount of time is saved by bypassing the instruction array during a cache fill operation. The bypass subline value is also used when the cache inhibit line 413 is active.

【００６３】セレクタ６０３は、セット選択ライン５１
１の値に応じて、セレクタ６０２の出力かまたはアレイ
３０３のセット１からのサブラインを選択する。セット
選択ライン５１１は、キャッシュのセット１の半分でキ
ャッシュ・ヒットがあった場合はＨＩＧＨである。つま
り比較器３０６は、ＥＲＡＴからの実ページ番号４１１
とディレクトリ・アレイ５０２の選択されたエントリか
らのセット１実ページ番号との一致を検出する。セレク
タ５０５により選択される対応するサブライン有効ビッ
トは有効で、セット選択ライン５１１はＨＩＧＨにな
り、セレクタ６０３は、アレイ３０３のセット１からサ
ブラインを選択する。他の場合では（キャッシュ・ミス
を含む）、セレクタ６０２の出力が選択される。セレク
タ６０３の出力は、連続メモリ位置からの８つの命令を
表す３２バイトのデータである。これは、順次バッファ
２０３、スレッド・バッファ２０４、または分岐バッフ
ァに書込むために命令装置２０１に送られる。キャッシ
ュ・ミスが生じた場合、Ｉキャッシュ・ヒット・ライン
５１０はＬＯＷになり、セレクタ６０３の出力は無視さ
れる（つまり、命令装置２０１のバッファの１つに書込
まれない）。キャッシュ・ヒットがあった場合（ライン
５１０がアクティブ）、選択されたディレクトリ・エン
トリに対応するアレイのＭＲＵビットが、セット選択ラ
イン５１１の値で更新される。The selector 603 is connected to the set selection line 51
Depending on the value of 1, select the output of selector 602 or a subline from set 1 of array 303. Set select line 511 is HIGH when there is a cache hit in half of cache set 1. That is, the comparator 306 outputs the real page number 411 from the ERAT.
And a set 1 real page number from the selected entry of the directory array 502 is detected. The corresponding subline valid bit selected by selector 505 is valid, set select line 511 goes HIGH, and selector 603 selects a subline from set 1 of array 303. In other cases (including cache misses), the output of selector 602 is selected. The output of selector 603 is 32 bytes of data representing eight instructions from consecutive memory locations. This is sent to the instruction unit 201 for writing to the sequential buffer 203, thread buffer 204, or branch buffer. If a cache miss occurs, I-cache hit line 510 goes LOW and the output of selector 603 is ignored (ie, not written to one of the buffers of instruction unit 201). If there is a cache hit (line 510 active), the MRU bit of the array corresponding to the selected directory entry is updated with the value of set select line 511.

【００６４】上の説明は、検索される命令が実際にＩキ
ャッシュにある状況に関する。Ｉキャッシュ・ミスがあ
るとき、可能性は２つある。ａ）ＥＲＡＴヒットがある
が、命令は命令アレイにないか、ｂ）ＥＲＡＴミスがあ
る。ＥＲＡＴヒットがある場合、所望のキャッシュ・ラ
インをかなり高速に埋めることができる。実ページ番号
はＥＲＡＴにあるので、所望のデータはメイン・メモリ
にあることがわかる（また、Ｌ２キャッシュにある可能
性もある）。Ｌ１Ｉキャッシュ１０６のロジックで、
ＥＲＡＴデータから所望の命令の完全実アドレスを構成
することは、外部のアドレス変換メカニズムにアクセス
することなしに可能であり、このデータはＬ２キャッシ
ュまたはメモリから直接フェッチすることもできる。Ｅ
ＲＡＴミスがあった場合、所望の命令の実アドレスを構
成するために、また必要に応じて新しい実ページ番号で
ＥＲＡＴを更新するために、外部のアドレス変換メカニ
ズムにアクセスする必要がある。その場合、所望のデー
タはメイン・メモリには全く存在しない可能性があり、
ディスク・ドライブ等の２次記憶装置から読出す必要が
ある。理論的には、所望の命令が実際に命令アレイ３０
３にあるときにＥＲＡＴミスの可能性があるが、これは
希な事例とみなされる。従って、ＥＲＡＴミスがあった
ときには、命令アレイのライン・フィルが同時に開始さ
れる。The above description relates to the situation where the retrieved instruction is actually in the I-cache. When there is an I-cache miss, there are two possibilities. a) There is an ERAT hit but the instruction is not in the instruction array or b) there is an ERAT miss. If there is an ERAT hit, the desired cache line can be filled much faster. Since the real page number is in ERAT, the desired data is known to be in main memory (and may be in the L2 cache). In the logic of the L1 I-cache 106,
Constructing the full real address of the desired instruction from the ERAT data is possible without accessing external address translation mechanisms, and this data can be fetched directly from the L2 cache or memory. E
In the event of a RAT miss, an external address translation mechanism will need to be accessed to configure the real address of the desired instruction and, if necessary, to update the ERAT with a new real page number. In that case, the desired data may not be present at all in main memory,
It must be read from a secondary storage device such as a disk drive. In theory, the desired instruction is actually
There is a possibility of an ERAT miss when at 3, but this is considered a rare case. Therefore, when there is an ERAT miss, line filling of the instruction array is started at the same time.

【００６５】図９、図１０は、主な高速ライン・フィル
・シーケンサのロジック、つまり、ＥＲＡＴヒットがあ
ったとき、また、キャッシュ・ミスの場合にキャッシュ
・ライン・フィルを生成する制御ロジックを示す。高速
ライン・フィル・シーケンサ・ロジックは、ライン・フ
ィル開始ロジック７０１と、ライン・フィル操作の完了
を保留するライン・フィル・リクエスト・パラメータを
記憶するレジスタ７１０、７１１のペア（それぞれＬＦ
Ａｄｄｒ０、ＬＦＡｄｄｒ１と示す）を含む。FIGS. 9 and 10 show the logic of the main high-speed line fill sequencer, that is, the control logic for generating a cache line fill in the event of an ERAT hit or in the event of a cache miss. . The high-speed line fill sequencer logic includes a pair of line fill start logic 701 and registers 710 and 711 for storing line fill request parameters for suspending the completion of the line fill operation (LF each).
Addr0, LFAddr1).

【００６６】ＬＦＡｄｄｒレジスタ７１０、７１１はそ
れぞれ２つのスレッドの一方に対応する。つまりＬＦＡ
ｄｄｒ０７１０はスレッド０に、ＬＦＡｄｄｒ１７
１１はスレッド１に対応する。命令装置２０１が、スレ
ッド０の実行中に命令リクエストを出すと、リクエスト
・パラメータがＬＦＡｄｄｒ０レジスタ７１０に記憶さ
れ、同様に、スレッド１の実行中のリクエストはＬＦＡ
ｄｄｒ１レジスタ７１１に記憶される。（マルチスレッ
ドがオフの場合、ＬＦＡｄｄｒ０レジスタ７１０しか用
いられない。）ＬＦＡｄｄｒレジスタ７１０、７１１は
それぞれ、１つのライン・フィル・リクエストしか記憶
しない。従って、あるスレッドで、同じスレッドについ
て未決のライン・フィル・リクエストが保留されている
間、ＥＲＡＴヒットとＩキャッシュ・ミスがあった場
合、２つ目のリクエストは遅らせる必要がある。The LFAddr registers 710 and 711 each correspond to one of two threads. That is, LFA
ddr0 710 is in thread 0, LFAddr1 7
11 corresponds to the thread 1. When the instruction device 201 issues an instruction request during the execution of the thread 0, the request parameter is stored in the LFAddr0 register 710, and similarly, the request during the execution of the thread 1 is the LFA
The data is stored in the ddr1 register 711. (If multithreading is off, only the LFAddr0 register 710 is used.) The LFAddr registers 710, 711 each store only one line fill request. Therefore, if an ERAT hit and an I-cache miss occur while a thread has pending line fill requests for the same thread, the second request must be delayed.

【００６７】ＬＦＡｄｄｒレジスタはそれぞれ、有効ア
ドレス・ビット４８乃至５８（ＥＡ４８：５８）、実ア
ドレス・ビット２４乃至５１（ＲＡ２４：５１）、セッ
ト・ビット、及びリクエスト未決（"Ｒ"）ビットを含
む。アドレス・ビットは、埋められるキャッシュ・ライ
ンのメモリの実アドレスを生成し、キャッシュ・ライン
が返されたときにディレクトリ・アレイ５０２と命令ア
レイ３０３に書込むために用いられる。セット・ビット
は、ディレクトリ・アレイ５０２と命令アレイ３０３の
いずれのセット（セット０またはセット１）に書込まれ
るかを判定する。リクエスト未決（"Ｒ"）ビットは、Ｌ
ＦＡｄｄｒレジスタに未決リクエストが入ったときに１
に設定され、ライン・フィル・リクエストが完了すると
リセットされる（リセット・ロジックは図示せず）。The LFAddr registers each include valid address bits 48-58 (EA 48:58), real address bits 24-51 (RA 24:51), set bits, and a request pending ("R") bit. The address bits are used to generate the real address of the memory of the cache line to be filled and to write to the directory array 502 and the instruction array 303 when the cache line is returned. The set bit determines which set (set 0 or set 1) of directory array 502 or instruction array 303 is to be written. Request pending (“R”) bit is L
1 when an undecided request enters the FAddr register
And reset when the line fill request is completed (reset logic not shown).

【００６８】ライン・フィル開始ロジック７０１は、入
力としてＥＲＡＴヒット・ライン４１０、Ｉキャッシュ
・ヒット・ライン５１０、どのスレッドがアクティブか
を指定するアクティブ・スレッド制御ライン（Ａｃｔ
Ｔ）、及びＬＦＡｄｄｒ０レジスタ７１０とＬＦＡｄｄ
ｒ１レジスタ７１１からのリクエスト未決ビット（それ
ぞれ"Ｒ０"、"Ｒ１"と示す）を受け取る。ライン・フィ
ル・リクエストは、ＥＲＡＴヒットがあるとき、Ｉキャ
ッシュ・ミスがあるとき、また、現在アクティブなスレ
ッドに対応するＬＦＡｄｄｒレジスタにライン・フィル
・リクエストが現在保留されていないときに開始される
（ライン・フィル・リクエスト・ライン７０３がアクテ
ィブになる）。ＥＲＡＴヒットとＩキャッシュ・ミスが
あり、現在アクティブなスレッドに対応するＬＦＡｄｄ
ｒレジスタにライン・フィル・リクエストが保留されて
いる場合、Ｉキャッシュは、保留中のライン・フィル・
リクエストが完了する（"Ｒ"ビットをリセットする）ま
で待機してから、新しいライン・フィル・リクエストを
開始する。これらの入力と出力の論理関係は以下のよう
に表せる。The line fill start logic 701 has as inputs an ERAT hit line 410, an I-cache hit line 510, and an active thread control line (Act) that specifies which thread is active.
T), and LFAddr0 register 710 and LFAdd
The request pending bits (represented as “R0” and “R1”, respectively) from the r1 register 711 are received. A line fill request is initiated when there is an ERAT hit, when there is an I-cache miss, and when no line fill request is currently pending in the LFAddr register corresponding to the currently active thread ( The line fill request line 703 becomes active). LFAdd corresponding to currently active thread with ERAT hit and I-cache miss
If there is a pending line fill request in the r register, the I-cache will indicate the pending line fill request.
Wait for the request to complete (reset the "R" bit) before starting a new line fill request. The logical relationship between these inputs and outputs can be expressed as follows.

【数４】 (Equation 4)

【００６９】ライン・フィル・リクエストが開始される
と、ライン・フィル開始ロジックが、書込み信号７０
４、７０５を生成し、リクエスト・パラメータがＬＦＡ
ｄｄｒレジスタ７１０、７１１に書込まれる。書込み信
号７０４、７０５のいずれか１つは常にアクティブであ
り得る。書込み信号７０４、７０５のいずれか１つがア
クティブになると、ＥＡ４８：５８（Ｌ１Ｉキャッシ
ュ・アドレス・バス２３１から）、ＲＡ２４：５１（パ
ス４１１、ＥＲＡＴ３０１から）、及びセット・ロジッ
ク７２０からのセット・ビットが、現在アクティブなス
レッドに対応するＬＦＡｄｄｒレジスタに記憶される。
同時に、レジスタのリクエスト未決ビットが１に設定さ
れる。書込み信号は、論理的には以下のように導かれ
る。When a line fill request is initiated, the line fill start logic causes the write signal 70
4, 705 and the request parameter is LFA
Written to ddr registers 710, 711. Either one of the write signals 704, 705 may always be active. When either one of the write signals 704, 705 becomes active, EA 48:58 (from the L1 I-cache address bus 231), RA 24:51 (from path 411, ERAT 301), and the set bit from the set logic 720 Is stored in the LFAddr register corresponding to the currently active thread.
At the same time, the request pending bit of the register is set to 1. The write signal is logically derived as follows.

【数５】 (Equation 5)

【００７０】ディレクトリ・アレイ５０２と命令アレイ
３０３は２セット（セット０とセット１）に分けられ、
それぞれ同じハッシュ関数で索引が付けられるので、ラ
イン・フィル・リクエストからのキャッシュ・ライン
は、論理的にいずれかのセットに書込める。キャッシュ
・ラインが書込まれるセットは、ライン・フィル・リク
エストが出された時点でセット・ロジック７２０により
判定され、対応するＬＦＡｄｄｒレジスタのセット・ビ
ットに記憶される。一般に、選択されるセットは、埋め
られるキャッシュ・ラインのＬＲＵセットである。つま
りセットは、ハッシュ関数により索引が付けられるディ
レクトリ・アレイ５０２のエントリに対応するＭＲＵビ
ットの反転である。ただし、アクティブではないスレッ
ドで未決ライン・フィル・リクエストがあり、この未決
ライン・フィルで同じキャッシュ・ラインが埋められる
特別な場合では、選択されるセットは、アクティブでは
ないスレッドに対する未決ライン・フィル・リクエスト
に選択されるセットとは反対である。従って、ライン・
フィル・リクエストが開始される時点でセットを固定す
ると、ライブ・ロック（つまり、２つの未決ライン・フ
ィル・リクエストが同じセットに書込もうとする状況）
が発生する可能性は回避される。The directory array 502 and the instruction array 303 are divided into two sets (set 0 and set 1).
Because each is indexed with the same hash function, the cache lines from the line fill request can be written to any set logically. The set to which the cache line is written is determined by the set logic 720 at the time the line fill request is issued and stored in the corresponding LFAddr register set bit. Generally, the set selected is the LRU set of cache lines to be filled. That is, the set is the inverse of the MRU bit corresponding to the entry in directory array 502 that is indexed by the hash function. However, in the special case where there is a pending line fill request in an inactive thread and this pending line fill fills the same cache line, the set selected is the pending line fill request for the inactive thread. The opposite of the set selected for the request. Therefore, the line
Locking the set at the beginning of the fill request will result in a live lock (ie, two pending line fill requests trying to write to the same set)
Is avoided.

【００７１】図９、図１０は、レジスタ７１０に記憶さ
れる情報の使用方法を示す。便宜上、レジスタ７１１か
らの同様のデータ・パスは図示していない。要求された
命令を含むキャッシュ・サブラインのアドレスは、対応
するＬＲＡｄｄｒレジスタに記憶されたアドレス情報か
ら導かれる。具体的には、実ページ番号（ＲＡ２４：５
１）がビットＥＡ５２：５８と連結されて、キャッシュ
・サブラインの実アドレスが得られる。これは図９、図
１０ではフィーチャ７１２として示してある。これは必
ずしも個別レジスタではなく、ＬＦＡｄｄｒレジスタの
１つの対応するビットからのアドレスの組み合わせの表
現にすぎない。ライン・フィル・リクエスト・ライン７
０３は、メモリ管理装置２２２に対するデータ・リクエ
ストを開始し、キャッシュ・フィル・バス２３３で７１
２として示したアドレスを転送する。スレッド・タグ・
ビットも転送され、Ｌ１Ｉキャッシュ制御ロジックは
その後、返された命令に関連付けるＬＦＡｄｄｒレジス
タを判定できる。次に、メモリ管理装置は、要求された
命令をＬ２キャッシュ１０８、メイン・メモリ１０２、
または他のソースのいずれから取得するかを判定する。
要求された命令がメモリ管理装置２２２から使用できる
場合は、バス２３３でＬ１Ｉキャッシュに、スレッド
・タグ・ビットと共に転送される。FIGS. 9 and 10 show how to use the information stored in the register 710. FIG. A similar data path from register 711 is not shown for convenience. The address of the cache subline containing the requested instruction is derived from the address information stored in the corresponding LRAddr register. Specifically, the actual page number (RA24: 5
1) is concatenated with bits EA 52:58 to get the real address of the cache subline. This is shown as feature 712 in FIGS. This is not necessarily an individual register, but merely a representation of the address combination from one corresponding bit of the LFAddr register. Line Fill Request Line 7
03 starts a data request to the memory management device 222 and
The address indicated as 2 is transferred. Thread tag
The bits are also transferred, and the L1 I-cache control logic can then determine the LFAddr register to associate with the returned instruction. Next, the memory management device stores the requested instruction in the L2 cache 108, the main memory 102,
Or from another source.
If the requested instruction is available from the memory manager 222, it is transferred on the bus 233 to the L1 I-cache along with the thread tag bit.

【００７２】バス２３３で要求された命令が返される
と、制御信号が生成され、データがディレクトリ・アレ
イ５０２と命令アレイ３０３に書込まれる。具体的に
は、対応するＬＦＡｄｄｒレジスタ７１０、７１１から
のＥＡ４８：５６は、アレイ５０２のエントリを選択す
るために使用される。ＬＦＡｄｄｒレジスタからのセッ
ト・ビットは、制御信号と共に、書込み信号ライン７０
６、７０７の１つのアレイ５０２の半分に対して書込み
信号を生成するために使用され、セット・ビットの状態
は、アレイ５０２のどちらか半分に書込まれるか（つま
り、書込み信号ライン７０６、７０７のどちらがアクテ
ィブか）を決定する。ＬＦＡｄｄｒレジスタからの実ペ
ージ番号（ＲＡ２４：５１）は、セット・ビットで決定
されるアレイ５０２の半分の、ＥＡ４８：５１により選
択されるエントリに書込まれる。ディレクトリ・アレイ
のＭＲＵビットは同時に更新される。When the requested instruction is returned on bus 233, a control signal is generated and data is written to directory array 502 and instruction array 303. Specifically, EAs 48:56 from corresponding LFAddr registers 710, 711 are used to select an entry in array 502. The set bit from the LFAddr register, along with the control signal,
6, 707 is used to generate a write signal for one half of array 502, and the state of the set bit is written to either half of array 502 (ie, write signal lines 706, 707). Is active). The real page number (RA24: 51) from the LFAddr register is written to the half-array 502 determined by the set bit, the entry selected by EA 48:51. The MRU bits in the directory array are updated at the same time.

【００７３】上の操作と並行して、ＬＦＡｄｄｒレジス
タからのＥＡ４８：５６は、命令アレイ３０３のエント
リを選択するために用いられ、ＬＦＡｄｄｒレジスタか
らのセット・ビットは、同様に、アレイの半分に対する
書込み信号を生成するために用いられる。この場所に書
込まれるデータは、バス２３３からのデータ（一連の命
令）であり、図８に示すＬＦデータ・バス６０４に送ら
れる。ただし、命令アレイ３０３を埋める場合には、一
度に１つのサブラインしか書込めない。ＬＦデータ・バ
ス６０４は、一度に１つのサブライン（３２バイト）を
送る。完全なサブラインは、選択ロジック６０１により
ＬＦＡｄｄｒレジスタからのＥＡ４８：５６と、シーケ
ンス・ロジック（図示せず）により与えられる２つのア
ドレス・ビット５７、５８を使って選択される。従っ
て、キャッシュ・ライン全体を埋めるには４回の書込み
サイクルが必要である。In parallel with the above operation, EA 48:56 from the LFAddr register is used to select an entry in the instruction array 303, and the set bit from the LFAddr register is similarly written to half of the array. Used to generate a signal. The data written to this location is data (a series of instructions) from the bus 233 and sent to the LF data bus 604 shown in FIG. However, when filling the instruction array 303, only one subline can be written at a time. The LF data bus 604 sends one subline at a time (32 bytes). The complete subline is selected by selection logic 601 using EA 48:56 from the LFAddr register and two address bits 57, 58 provided by sequence logic (not shown). Therefore, four write cycles are required to fill the entire cache line.

【００７４】更新された命令アレイ・エントリの実ペー
ジ番号がディレクトリ・アレイ５０２に書込まれると、
４つの有効ビット（各サブラインに１つ）が最初は無効
と設定される。連続したサブラインがそれぞれ命令アレ
イ３０３に書込まれるとき、ディレクトリ・アレイ５０
２の対応する有効ビットが更新されてデータが有効にな
ったことが示される。上に述べたように連続した書込み
サイクルでのキャッシュ・ラインの書込みを、どのよう
な理由であれ解釈する必要がある場合、ディレクトリ・
アレイ５０２は正しい情報を含む。When the real page number of the updated instruction array entry is written to directory array 502,
Four valid bits (one for each subline) are initially set to invalid. As successive sub-lines are written into instruction array 303, directory array 50
The two corresponding valid bits are updated to indicate that the data is valid. If the write of a cache line on successive write cycles as described above needs to be interpreted for any reason, the directory
Array 502 contains the correct information.

【００７５】ＥＲＡＴミスの場合、選択マルチプレクサ
４０２の実ページ番号出力は信頼性がない。他の何らか
の処理を行う前に、命令装置２０１からの有効アドレス
のページ番号部を、実ページ番号に変換する必要があ
る。ＥＲＡＴ＿Ｍｉｓｓライン４１２は、図１１に示し
たアドレス変換メカニズムをトリガする。実際にこの変
換を行うハードウェアは、Ｉキャッシュ１０６の一部で
はない。ハードウェアの一部はＣＰＵ２０１に組み込
め、他のハードウェアはメイン・メモリ１０２等に置け
る。このアドレス変換は、通常、先に述べたライン・フ
ィル操作よりも多くのサイクルを必要とする。ＥＲＡＴ
ミスに続いて、変換された実ページ番号が返されると、
これと並行して実ページ番号がＥＲＡＴ３０１の更新に
使用され、対応するＬＦＡｄｄｒレジスタ（７１０また
は７１１）に書込まれ、ライン・フィル操作が開始され
る。その場合、理論上、要求された命令は、ＥＲＡＴミ
スにかかわらず既にキャッシュにあるが、これは充分に
希な事例とみなされ、ＥＲＡＴエントリの埋め込みを待
つのではなく、ライン・フィル操作をすぐに要求するこ
とでパフォーマンスが改良される。In the case of an ERAT miss, the output of the real page number of the selection multiplexer 402 is not reliable. Before performing any other processing, it is necessary to convert the page number portion of the effective address from the instruction device 201 into a real page number. The ERAT_Miss line 412 triggers the address translation mechanism shown in FIG. The hardware that actually performs this conversion is not part of the ICache 106. Part of the hardware can be incorporated in the CPU 201, and other hardware can be located in the main memory 102 or the like. This address translation typically requires more cycles than the line fill operation described above. ERAT
Following the mistake, when the translated real page number is returned,
In parallel with this, the real page number is used for updating the ERAT 301, written to the corresponding LFAddr register (710 or 711), and the line fill operation is started. In that case, in theory, the requested instruction is already in the cache regardless of the ERAT miss, but this is considered a rare enough case that a line fill operation can be performed immediately rather than waiting for the ERAT entry to be filled. To improve performance.

【００７６】本発明の理解に必要不可欠ではないロジッ
ク回路の図や説明は、ここでは便宜上省略してある。例
えば、アレイ５０２にＭＲＵビットを維持するロジッ
ク、パリティ・エラーを検出して補正処置を取るロジッ
ク等は省略してある。Illustrations and descriptions of logic circuits that are not essential for understanding the present invention are omitted here for convenience. For example, logic to maintain the MRU bits in array 502, logic to detect parity errors and take corrective action, etc. are omitted.

【００７７】好適実施例では、キャッシュ・ヒットを確
認する目的で、ディレクトリ・アレイの実ページ番号と
比較するよう実アドレス（実ページ番号の一部）を提供
するためにＥＲＡＴが用いられる。この設計が望ましい
のは、ＥＲＡＴが実ページ番号への高速変換を行うから
である。高速変換は、基本的なアドレス変換メカニズム
の応答時間に依存しない。これにより、システム設計者
は一定の制約を免れる。つまり、Ｉキャッシュで１サイ
クルの応答時間をサポートするため必要な高速度でアド
レスを変換するための基本アドレス変換メカニズムは不
要になる。ただし、別の実施例では、ここで説明してい
る通り、ＥＲＡＴなしに命令キャッシュを構成すること
も可能である。その場合、ディレクトリ・アレイの実ペ
ージ番号と比較する実ページ番号を提供するために、基
本アドレス変換メカニズムが用いられる。また別の実施
例では、Ｌ１Ｉキャッシュの内部または外部の別のメ
カニズムを使用して実ページ番号を提供することもでき
よう。In the preferred embodiment, the ERAT is used to provide a real address (part of the real page number) to compare with the real page number of the directory array for the purpose of verifying a cache hit. This design is desirable because ERAT provides high speed conversion to real page numbers. Fast translation does not depend on the response time of the underlying address translation mechanism. This avoids certain restrictions for the system designer. In other words, a basic address translation mechanism for translating addresses at the high speed required to support one cycle of response time in the I-cache is not required. However, in another embodiment, the instruction cache can be configured without an ERAT, as described herein. In that case, a basic address translation mechanism is used to provide a real page number that is compared to the real page number of the directory array. In another embodiment, another mechanism could be used to provide the real page number, either internal or external to the L1 I-cache.

【００７８】キャッシュ連想性の数は、好適実施例では
スレッド数と同じである。これは、共通のキャッシュに
対するスレッドの競合を避けるのに有益である。ただ
し、これに代えて、ここで述べているように、スレッド
数がキャッシュ連想性と同じではないキャッシュを設計
することも可能である。例えば、プロセッサによりサポ
ートされるスレッド数が多い場合、スレッド数と同じ程
度のキャッシュ連想性は、競合を避けるためには必ずし
も必要ない。その場合、理論上、連想性がスレッド数よ
りも少ないときに競合の可能性はあるが、その可能性は
充分に小さいと考えられるので、キャッシュ連想性を少
なくすることも許容範囲内である。更に、何らかの競合
の可能性はあるとしても、キャッシュ連想性を１として
も許容できる場合がある。The number of cache associativity is equal to the number of threads in the preferred embodiment. This is useful to avoid thread contention for a common cache. Alternatively, however, it is possible to design a cache in which the number of threads is not the same as the cache associativity, as described here. For example, if the number of threads supported by the processor is large, cache associativity as high as the number of threads is not necessary to avoid contention. In that case, in theory, there is a possibility of contention when the associativity is smaller than the number of threads, but it is considered that the possibility is sufficiently small, so that it is within an allowable range to reduce the cache associativity. Furthermore, even if there is a possibility of some contention, there is a case where the cache associativity is set to 1 in some cases.

【００７９】まとめとして、本発明の構成に関して以下
の事項を開示する。In summary, the following matters are disclosed regarding the configuration of the present invention.

【００８０】（１）マルチスレッドのコンピュータ処理
装置であって、複数のスレッドの実行をサポートし、そ
れぞれ該複数のスレッドのそれぞれに対応する複数のレ
ジスタ・セットと、命令をデコードするデコード・ロジ
ックと、実行される命令の有効アドレスを生成するシー
ケンス・ロジックを含む命令装置と、前記命令装置によ
り生成される所望の有効アドレスに応答して命令を提供
する、命令キャッシュとを含み、該命令キャッシュは、ａ）複数のエントリがあり、それぞれ命令の実アドレス
の一部を含み、前記所望の有効アドレスを使ってエント
リが選択される、ディレクトリ・アレイと、ｂ）複数のエントリがあり、それぞれ前記ディレクトリ
・アレイのエントリに関連付けられ、少なくとも１つの
命令を含み、前記ディレクトリ・アレイのエントリが前
記所望の有効アドレスを使って選択される、命令アレイ
と、ｃ）それぞれ前記複数のスレッドに対応し、命令キャッ
シュ・ミスに応答して検索される所望の命令の実アドレ
スの少なくとも一部を記憶する、複数のライン・フィル
・レジスタとを含む、マルチスレッド・コンピュータ処
理装置。（２）前記命令キャッシュは、ｄ）複数のエントリを含み、各エントリが有効アドレス
の一部と実アドレスの一部を含み、前記所望の有効アド
レスを使ってエントリが選択される、有効／実アドレス
変換アレイと、前記ライン・フィル・レジスタに記憶さ
れる所望の命令の実アドレスの前記一部は、前記有効／
実アドレス変換アレイのエントリから取得される、前記
（１）記載のマルチスレッド・コンピュータ処理装置。（３）前記命令キャッシュは、ｅ）前記有効／実アドレス変換アレイのエントリからの
有効アドレスの前記一部を、前記所望の有効アドレスの
対応する一部と比較して、有効／実アドレス変換アレイ
・ヒットを判定する比較器を含む、前記（２）記載のマ
ルチスレッド・コンピュータ処理装置。（４）前記ディレクトリ・アレイはＮセット（Ｎ＞１）
に分けられ、前記ディレクトリ・アレイ・エントリはそ
れぞれ、命令の複数の実アドレスの一部を含み、該実ア
ドレス部はそれぞれ、前記ディレクトリ・アレイの該Ｎ
セットのうち対応するセットに属し、前記命令アレイは
Ｎセットに分けられ、各セットは前記ディレクトリ・ア
レイのセットに対応し、前記命令アレイ・エントリはそ
れぞれ複数の命令を含み、各命令は前記命令アレイの該
Ｎセットに属する、前記（１）記載のマルチスレッド・
コンピュータ処理装置。（５）前記マルチスレッド・コンピュータ処理装置はＮ
スレッドの実行をサポートする、前記（４）記載のマル
チスレッド・コンピュータ処理装置。（６）前記ライン・フィル・レジスタはそれぞれセット
・フィールドを含み、該セット・フィールドは、検索さ
れる所望の命令が検索後に記憶される前記Ｎセットのう
ちのセットを指定する、前記（４）記載のマルチスレッ
ド・コンピュータ処理装置。（７）前記命令キャッシュは、ｅ）それぞれ前記ディレクトリ・アレイのセットに関連
付けられ、前記ディレクトリ・アレイの選択されたエン
トリの関連付けられた部分からの命令の実アドレスの前
記部分を、前記所望の有効アドレスに関連付けられた実
アドレスの共通部分と比較して、キャッシュ・ヒットを
判定する、Ｎ個の比較器を含む、前記（４）記載のマル
チスレッド・コンピュータ処理装置。（８）前記命令キャッシュは、ｄ）複数のエントリがあり、各エントリは、有効アドレ
スの一部と実アドレスの一部を含み、前記所望の有効ア
ドレスを使ってエントリが選択される、有効／実アドレ
ス変換アレイを含み、前記ライン・フィル・レジスタに
記憶された所望の命令の実アドレスの前記部分が、前記
有効／実アドレス変換アレイのエントリから取得され
る、前記（４）記載のマルチスレッド・コンピュータ処
理装置。（９）前記命令キャッシュは、ｅ）それぞれ前記ディレクトリ・アレイのセットに関連
付けられ、前記ディレクトリ・アレイの選択されたエン
トリの関連付けられた部分からの命令の実アドレスの前
記部分を、前記所望の有効アドレスに関連付けられた実
アドレスの共通部分と比較して、キャッシュ・ヒットを
判定し、キャッシュ・ヒットを判定するために比較され
る前記所望の有効アドレスに関連付けられた実アドレス
の前記共通部分は、前記有効／実アドレス変換アレイの
エントリから取得される、Ｎ個の比較器を含む、前記
（８）記載のマルチスレッド・コンピュータ処理装置。（１０）マルチスレッド・コンピュータ処理装置であっ
て、それぞれスレッドに対応する複数のレジスタ・セッ
トと、命令をデコードするデコード・ロジックと、実行
される命令の有効アドレスを生成するシーケンス・ロジ
ックとを含む命令装置と、前記命令装置により生成され
る所望の有効アドレスに応答して命令を与える命令キャ
ッシュとを含み、該命令キャッシュは、複数のエントリ
があり、Ｎセット（Ｎ＞１）に分けられ、該エントリは
それぞれＮ部を含み、各エントリ部は該Ｎセットのうち
のセットに関連付けられ、命令の実アドレスの一部を含
み、前記所望の有効アドレスを使ってエントリが選択さ
れる、ディレクトリ・アレイと、複数のエントリがあ
り、各エントリは前記ディレクトリ・アレイのエントリ
に関連付けられ、複数の命令を含み、Ｎセットに分けら
れ、各セットは前記ディレクトリ・アレイのセットに対
応し、Ｎ部を含み、各エントリ部は前記命令アレイの前
記Ｎセットのうちのセットに関連付けられ、少なくとも
１つの命令を含み、前記所望の有効アドレスを使ってエ
ントリが選択される、命令アレイと、それぞれ前記ディ
レクトリ・アレイのセットに関連付けられ、前記ディレ
クトリ・アレイの選択されたエントリの関連付けられた
部分からの命令の実アドレスの前記部分を、前記所望の
有効アドレスに関連付けられた実アドレスの共通部分と
比較して、キャッシュ・ヒットを判定する、Ｎ個の比較
器と、を含む、マルチスレッド・コンピュータ処理装
置。（１１）前記マルチスレッド・コンピュータ処理装置
は、Ｎスレッドの実行をサポートする、前記（１０）記
載のマルチスレッド・コンピュータ処理装置。（１２）前記命令キャッシュは、ｄ）複数のエントリがあり、各エントリは有効アドレス
の一部と実アドレスの対応する一部とを含み、前記所望
の有効アドレスを使ってエントリが選択される、有効／
実アドレス変換アレイを含み、キャッシュ・ヒットを判
定するために前記比較器により比較される、前記所望の
有効アドレスに関連付けられた実アドレスの前記共通部
分は、前記有効／実アドレス変換アレイのエントリから
取得される、前記（１０）記載のマルチスレッド・コン
ピュータ処理装置。(1) A multi-threaded computer processing device which supports the execution of a plurality of threads, a plurality of register sets respectively corresponding to each of the plurality of threads, and decode logic for decoding an instruction. , An instruction device including sequence logic for generating an effective address of an instruction to be executed, and an instruction cache for providing the instruction in response to a desired effective address generated by the instruction device, the instruction cache comprising: A) a directory array, wherein there are a plurality of entries, each containing a part of the real address of the instruction, and an entry is selected using the desired effective address; and b) a plurality of entries, each of which contains the directory The directory includes at least one instruction associated with an entry of the array; An instruction array in which a re-array entry is selected using the desired effective address; and c) a real address of a desired instruction, each of which corresponds to the plurality of threads and is retrieved in response to an instruction cache miss. And a plurality of line fill registers storing at least a portion of the multi-threaded computer processing device. (2) the instruction cache includes: d) a plurality of entries, each entry including a part of an effective address and a part of a real address, and an entry selected using the desired effective address. The address translation array and the portion of the real address of the desired instruction stored in the line fill register are stored in the valid /
The multi-thread computer processing device according to (1), wherein the multi-thread computer processing device is obtained from an entry of a real address translation array. (3) the instruction cache includes: e) comparing the portion of the effective address from the entry in the effective / real address translation array with a corresponding portion of the desired effective address, -The multi-threaded computer processing device according to (2), further including a comparator for determining a hit. (4) The directory array has N sets (N> 1)
And each of the directory array entries includes a portion of a plurality of real addresses of an instruction, and the real address portions are respectively associated with the N of the directory array.
Belonging to a corresponding one of the sets, the instruction array is divided into N sets, each set corresponding to the set of directory arrays, the instruction array entries each including a plurality of instructions, each instruction being the instruction The multi-thread according to (1), which belongs to said N sets of arrays.
Computer processing equipment. (5) The multi-thread computer processing device is N
The multi-thread computer processing device according to (4), which supports execution of a thread. (6) the line fill registers each include a set field, the set field specifying a set of the N sets in which a desired instruction to be retrieved is stored after retrieval; A multi-threaded computer processing device as described. (7) the instruction cache is: e) associating the portion of the real address of the instruction from the associated portion of the selected entry of the directory array with the desired validity; The multi-threaded computer processing device according to (4), further including N comparators for determining a cache hit by comparing with a common part of a real address associated with the address. (8) The instruction cache includes: d) a plurality of entries, each entry including a part of an effective address and a part of a real address, and an entry is selected using the desired effective address. The multi-thread of claim 4 including a real address translation array, wherein the portion of the real address of the desired instruction stored in the line fill register is obtained from an entry in the valid / real address translation array. -Computer processing equipment. (9) the instruction cache is: e) associating the portion of the real address of the instruction from the associated portion of the selected entry of the directory array with the desired validity, each associated with the set of directory arrays; Determining a cache hit in comparison with the intersection of the real addresses associated with the address; and the intersection of the real address associated with the desired effective address to be compared to determine the cache hit, The multi-threaded computer processing device according to (8), further comprising N comparators obtained from the valid / real address translation array entry. (10) A multi-threaded computer processing device, comprising: a plurality of register sets each corresponding to a thread; decode logic for decoding an instruction; and sequence logic for generating an effective address of an instruction to be executed. An instruction cache for providing instructions in response to a desired effective address generated by the instruction cache, the instruction cache having a plurality of entries, divided into N sets (N>1); The entries each include N parts, each entry part being associated with a set of the N sets, including a portion of the real address of the instruction, and selecting an entry using the desired effective address. There is an array and a plurality of entries, each entry associated with an entry of the directory array, , Wherein each set corresponds to the set of directory arrays and includes N portions, and each entry portion is associated with a set of the N sets of the instruction array and includes at least one And an array of instructions, each of which is associated with a set of the directory arrays, and from an associated portion of the selected entries of the directory array, the entries being selected using the desired effective address. N comparators comparing the portion of the real address of the instruction with a common portion of the real address associated with the desired effective address to determine a cache hit. apparatus. (11) The multi-thread computer processing device according to (10), wherein the multi-thread computer processing device supports execution of N threads. (12) The instruction cache includes: d) a plurality of entries, each entry including a part of an effective address and a corresponding part of a real address, and an entry is selected using the desired effective address; Effectiveness/
The common portion of the real address associated with the desired effective address, including the real address translation array and compared by the comparator to determine a cache hit, is derived from an entry in the effective / real address translation array. The multi-thread computer processing device according to (10), which is acquired.

[Brief description of the drawings]

【図１】本発明の好適実施例に従った、ＣＰＵが１つの
コンピュータ・システムの主なハードウェア構成要素の
図である。FIG. 1 is a diagram of the main hardware components of a single CPU computer system in accordance with a preferred embodiment of the present invention.

【図２】本発明の好適実施例に従った、ＣＰＵが複数の
コンピュータ・システムの主なハードウェア構成要素の
図である。FIG. 2 is a diagram of the main hardware components of a computer system with multiple CPUs, according to a preferred embodiment of the present invention.

【図３】好適実施例に従ったコンピュータ・システムの
中央処理装置の図である。FIG. 3 is a diagram of a central processing unit of a computer system according to a preferred embodiment.

【図４】好適実施例に従ったＬ１命令キャッシュの主な
構成要素の図である。FIG. 4 is a diagram of the main components of an L1 instruction cache according to a preferred embodiment.

【図５】好適実施例に従った有効／実アドレス・テーブ
ルとこれに関連する制御構造の図である。FIG. 5 is a diagram of a valid / real address table and associated control structure in accordance with a preferred embodiment.

【図６】好適実施例に従った有効／実アドレス・テーブ
ルとこれに関連する制御構造の図である。FIG. 6 is a diagram of a valid / real address table and associated control structure in accordance with a preferred embodiment.

【図７】好適実施例に従ったＬ１命令キャッシュとこれ
に関連する制御構造の図である。FIG. 7 is a diagram of an L1 instruction cache and its associated control structure according to a preferred embodiment.

【図８】好適実施例に従ったＬ１命令キャッシュの命令
アレイとこれに関連する制御構造の図である。FIG. 8 is a diagram of an instruction array of an L1 instruction cache and associated control structures according to a preferred embodiment.

【図９】好適実施例に従ってキャッシュ・ライン・フィ
ルを生成する主な制御ロジックの図である。FIG. 9 is a diagram of the main control logic for generating a cache line fill in accordance with a preferred embodiment.

【図１０】好適実施例に従ってキャッシュ・ライン・フ
ィルを生成する主な制御ロジックの図である。FIG. 10 is a diagram of the main control logic for generating a cache line fill in accordance with a preferred embodiment.

【図１１】好適実施例に従ったアドレス変換の図であ
る。FIG. 11 is a diagram of address translation according to a preferred embodiment.

[Explanation of symbols]

１００システム１０１、１０１Ａ、１０１Ｂ、１０１Ｃ、１０１ＤＣ
ＰＵ１０２メイン・メモリ１０５バス・インタフェース１０６、１０６Ａ、１０６Ｂ、１０６Ｃ、１０６Ｄレ
ベル１命令キャッシュ（Ｌ１Ｉキャッシュ）１０７、１０７Ａ、１０７Ｂ、１０７Ｃ、１０７Ｄレ
ベル１データ・キャッシュ（Ｌ１Ｄキャッシュ）１０８、１０８Ａ、１０８Ｂ、１０８Ｃ、１０８Ｄレ
ベル２キャッシュ（Ｌ２キャッシュ）１０９メモリ・バス１１０システム・バス１１１、１１２、１１３、１１４、１１５Ｉ／Ｏ処理
装置（ＩＯＰ）２０１命令ユニット２０２分岐ユニット２０３順次バッファ２０４切り換えバッファ２０５分岐バッファ２０６デコード／ディスパッチ・ユニット２１１実行ユニット２１２浮動小数点ユニット（ＦＰＵ）２１３Ｓパイプ２１４Ｍパイプ２１５Ｒパイプ２１６浮動小数点レジスタ２１７汎用レジスタ２２１記憶制御装置２２２メモリ管理装置２２３Ｌ２キャッシュ・ディレクトリ２２４Ｌ２キャッシュ・インタフェース２２５メモリ・バス・インタフェース２３２Ｌ１Ｉキャッシュ命令バス２３３キャッシュ・フィル・バス３０１有効／実アドレス・テーブル（ＥＲＡＴ）３０２Ｉキャッシュ・ディレクトリ・アレイ３０３Ｉキャッシュ命令アレイ３０４、３０５、３０６比較器３１０ＥＲＡＴ３１０４０３個別アレイ４０４個別レジスタ４０５ＥＲＡＴロジック４１０ＥＲＡＴヒット信号４１１保護例外信号４１２ＥＲＡＴミス信号４１３キャッシュ禁止信号５０１、６０１選択ロジック５０２６６ビット×５１２アレイ５０３１ビット×５１２アレイ５０５、６０２セレクタ５１０Ｉキャッシュ・ヒット信号５１１セット選択ライン６０４キャッシュ書込みバス８０１有効アドレス７０１ライン・フィル開始ロジック７０４、７０５書込み信号７１０、７１１ＬＦＡｄｄｒレジスタ７２０セット・ロジック８０２仮想アドレス８０３実アドレス８１４仮想セグメントＩＤ８１１３６ビットの有効セグメントＩＤ８１２１６ビットのページ番号８１３１２ビットのバイト・インデックス８１４５２ビット仮想セグメントＩＤ８１５５２ビット実ページ番号８２１セグメント・テーブル・メカニズム８２２ページ・テーブル・メカニズム100 System 101, 101A, 101B, 101C, 101D C
PU 102 Main memory 105 Bus interface 106, 106A, 106B, 106C, 106D Level 1 instruction cache (L1 I cache) 107, 107A, 107B, 107C, 107D Level 1 data cache (L1 D cache) 108, 108A, 108B, 108C, 108D Level 2 cache (L2 cache) 109 Memory bus 110 System bus 111, 112, 113, 114, 115 I / O processing unit (IOP) 201 Instruction unit 202 Branch unit 203 Sequential buffer 204 Switching buffer 205 Branch Buffer 206 Decode / Dispatch Unit 211 Execution Unit 212 Floating Point Unit (FPU) 213 S Pipe 214 M Pipe 215 R Pipe 16 Floating-point register 217 General-purpose register 221 Storage controller 222 Memory management device 223 L2 cache directory 224 L2 cache interface 225 Memory bus interface 232 L1 I-cache instruction bus 233 Cache fill bus 301 Effective / real address table (ERAT) 302 I-cache directory array 303 I-cache instruction array 304, 305, 306 Comparator 310 ERAT 310 403 Individual array 404 Individual register 405 ERAT logic 410 ERAT hit signal 411 Protection exception signal 412 ERAT miss signal 413 Cache inhibit signal 501 , 601 selection logic 502 66-bit × 512 array 503 1-bit × 512 array 50 , 602 selector 510 I cache hit signal 511 set select line 604 cache write bus 801 effective address 701 line fill start logic 704, 705 write signal 710, 711 LFAddr register 720 set logic 802 virtual address 803 real address 814 virtual segment ID 811 36-bit valid segment ID 812 16-bit page number 813 12-bit byte index 814 52-bit virtual segment ID 815 52-bit real page number 821 Segment table mechanism 822 Page table mechanism

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 12/12 Ｇ０６Ｆ 12/12 Ａ (72)発明者リチャード・ウィリアム・ドゥーイングアメリカ合衆国55901、ミネソタ州ロチェスター、ノース・ウエスト、フィフティナインス・ストリート 2532 (72)発明者ロナルド・ニック・カーラアメリカ合衆国55991、ミネソタ州ザンブロ・フォールス、イースト・ライアンズ・ベイ・ロード、ボックス 77エイ、レイルロード１ (72)発明者ステファン・ジョセフ・シュワインアメリカ合衆国55044、ミネソタ州レイクビル、ジュビリー・ウェイ 17902−エイ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification FI theme coat ゛ (Reference) G06F 12/12 G06F 12/12 A (72) Inventor Richard William Doing USA 55901, Rochester, Minnesota, Fiftina Inns Street, North West 2532 (72) Inventor Ronald Nick Carla United States 55991, Zambro Falls, MN, East Ryans Bay Road, Box 77A, Rail Road 1 (72) Inventor Stephen Joseph Schwein 17902-A Jubilee Way, Lakeville, MN 55044 United States 55044

Claims

[Claims]

1. A multi-threaded computer processing device, comprising: a plurality of register sets each supporting execution of a plurality of threads, each corresponding to each of the plurality of threads; decoding logic for decoding instructions; An instruction device including sequence logic for generating an effective address of an instruction to be executed; and an instruction cache for providing the instruction in response to a desired effective address generated by the instruction device, the instruction cache comprising: a) a directory array, where there are a plurality of entries, each containing a portion of the real address of the instruction, and the entry is selected using the desired effective address; The directory associated with an entry of the array and including at least one instruction; An instruction array in which an entry in the array is selected using the desired effective address; and c) at least a real address of a desired instruction, each of which corresponds to the plurality of threads and is retrieved in response to an instruction cache miss. A multi-threaded computer processing device including a plurality of line fill registers for storing a portion.

2. The instruction cache, comprising: d) a plurality of entries, each entry including a portion of a valid address and a portion of a real address, wherein an entry is selected using the desired valid address. 2. The multi-thread of claim 1, wherein the real address translation array and the portion of the real address of the desired instruction stored in the line fill register are obtained from an entry in the valid / real address translation array. -Computer processing equipment.

3. The instruction cache further comprises: e) comparing the portion of the effective address from the entry in the effective / real address translation array with a corresponding portion of the desired effective address. 3. The multi-threaded computer processing device of claim 2, including a comparator that determines a translation array hit.

4. The directory array includes N sets (N sets).
> 1), wherein each of the directory array entries includes a portion of a plurality of real addresses of an instruction;
The real address portions each belong to a corresponding set of the N sets of the directory array, the instruction array is divided into N sets, each set corresponding to the set of directory arrays, The multi-threaded computer processing apparatus of claim 1, wherein each entry includes a plurality of instructions, each instruction belonging to said N sets of said instruction array.

5. The multi-threaded computer processing device of claim 4, wherein said multi-threaded computer processing device supports N-thread execution.

6. The line fill registers each include a set field, the set field comprising:
5. The multi-threaded computer processing device of claim 4, wherein the desired instruction to be searched specifies a set of the N sets that are stored after the search.

7. The instruction cache includes: e) each associated with the set of directory arrays, and retrieving the portion of an instruction's real address from an associated portion of a selected entry of the directory array to the desired address. 5. The multi-threaded computer processing device of claim 4, further comprising N comparators for determining a cache hit as compared to a common portion of the real address associated with the effective address of the multi-threaded address.

8. The instruction cache includes: d) a plurality of entries, each entry including a part of an effective address and a part of a real address, and an entry is selected using the desired effective address; 5. The multi-function system according to claim 4, further comprising a valid / real address translation array, wherein said portion of a real address of a desired instruction stored in said line fill register is obtained from an entry in said valid / real address translation array. Thread computer processing unit.

9. The instruction cache further comprises: e) each associated with the set of directory arrays, and retrieving the portion of an instruction's real address from an associated portion of a selected entry of the directory array to the desired portion. Determining a cache hit in comparison with a common portion of a real address associated with the effective address of the first one, and comparing the common portion of the real address associated with the desired valid address to be compared to determine the cache hit 9. The multi-threaded computer processing device of claim 8, further comprising: N comparators obtained from the valid / real address translation array entries.

10. A multi-threaded computer processing device, comprising: a plurality of register sets each corresponding to a thread; decode logic for decoding instructions; and sequence logic for generating effective addresses of instructions to be executed. And an instruction cache for providing an instruction in response to a desired effective address generated by the instruction device. The instruction cache has a plurality of entries and is divided into N sets (N> 1). Each of the entries includes N portions, each entry portion being associated with a set of the N sets, including a portion of a real address of an instruction, and selecting an entry using the desired effective address. There is a directory array and a plurality of entries, each entry associated with an entry in the directory array. And a plurality of instructions, each of which corresponds to the set of directory arrays and includes N portions, each entry portion being associated with a set of the N sets of the instruction array. An array of instructions, each including at least one instruction, wherein an entry is selected using the desired effective address, each associated with a set of the directory arrays, and associated with the selected entry of the directory arrays. N comparators comparing the portion of the real address of the instruction from the portion with a common portion of the real address associated with the desired effective address to determine a cache hit; -Computer processing equipment.

11. The multi-threaded computer processing device supports N threads of execution.
A multi-threaded computer processing device as described.

12. The instruction cache includes: d) a plurality of entries, each entry including a part of an effective address and a corresponding part of a real address, and the entry is selected using the desired effective address. Valid /
The common portion of the real address associated with the desired effective address, which is compared by the comparator to determine a cache hit, includes a real address translation array from the entry in the effective / real address translation array. The multi-threaded computer processing device of claim 10, obtained.