JP3795757B2

JP3795757B2 - High data density RISC processor

Info

Publication number: JP3795757B2
Application number: JP2000582881A
Authority: JP
Inventors: キリアン、アール・エー; ゴンザレス、リカルド・イー; ディキシット、アシシュ・ビー; ラム、モニカ; リヒテンシュタイン、ワルター・ディー; ローウェン、クリストファー; ルッテンバーグ、ジョン・シー; ウィルソン、ロバート・ピー
Original assignee: Tensilica Inc
Current assignee: Tensilica Inc
Priority date: 1998-11-13
Filing date: 1999-11-10
Publication date: 2006-07-12
Anticipated expiration: 2019-11-10
Also published as: CN1348560A; TW452693B; JP2003521753A; KR100412920B1; CN1204490C; EP1129402A1; US6282633B1; KR20010092736A; AU1346500A; JP2006185462A; WO2000029938A1

Description

【０００１】
【発明の属する技術分野】
本発明は、マイクロプロセッサに関し、特に命令幅を高度に有効に使用することのできる高性能で命令セットの数を減少させたコンピュータ（ＲＩＳＣ）アーキテクチャプロセッサに関する。
【０００２】
【従来の技術】
プロセッサの命令セットの設計は十分に確立された技術である。多くの命令セット特徴はそれ自体は新しいものではない。しかしながら個々の特徴は技術の進歩のために新しいユニークな方法で組合わせられることができる。特に、命令セット設計が従来の命令セットとは異なった使用に対して最適となるように設計されるとき、命令セットがターゲットアプリケーションにおいて使用されるようにプロセッサが構成されている場合に顕著な改善が得られる。
【０００３】
命令セットの設計は、種々のアルゴリズムを符号化するために必要なマシンコードの大きさを含む多くの競合する目標をバランスさせるために必要であり、それらには、新しいアルゴリズムおよびアプリケーションに対する命令セットの拡大性および適用性、そのようなアルゴリズムにおいて命令セットを実行するプロセッサの性能および消費電力、命令セットを実行するプロセッサのコスト、多くのプロセッサを長い時間使用することに対する命令セットの安定性、命令セットを実行するプロセッサの設計の複雑性、高レベルプログラム言語から編集するためのターゲットとしての命令セットの安定性等が含まれている。
【０００４】
命令セットはプロセッサの性能に対して１つの直接的な影響および２つの間接的な影響を有している。命令セットは直接ＩＥ，すなわち、所定のアルゴリズムを実行するために必要な命令の数を決定する。なお、ここでは、編集のためにターゲットとしての命令セットの安定性は良好であるとしている。プロセッサの性能の他のコンポーネントはクロック期間ＣＰと命令当たりの平均クロックＣＰＩである。これらは命令セットの実行の主要な影響であるが、命令セット特徴はクロック当たりの時間および命令目標当たりのクロックに同時に適合するために実行者の能力に影響する。例えば、符号化の選択は残りの命令の実行により直列に付加的な論理装置に命令してもよく、それはクロック当たりの時間を増加させることにより実行者がアドレスし、或いは命令当たりのクロックを通常増加させる付加的なパイプライン段を追加することによって実行者がアドレスする。
【０００５】
１９８０年代および１９９０年代にＲＩＳＣと呼ばれる新しい命令セットアーキテクチャが開発された。それは上記の妥協の実現により得られたものであり、すなわち、
Ｔ＝ＩＥ＊ＣＰＩ＊ＣＰ
であり、ここで、Ｔは秒で表されるプログラム実行時間であり、他の変数は上述したものである。ＲＩＳＣ命令セットは実行者が著しくＩＥを増加させずにＣＰＩおよびＣＰを顕著に改善することを可能にする。ＲＩＳＣ命令セットはプロセッサの性能を改善し、設計の複雑性を低下させ、所定の性能レベルにおけるプロセッサ構成のコストを低下させ、高レベルプログラム言語からの編集に適している。
【０００６】
プロセッサアーキテクチャコミュニティはＲＩＳＣの完全に満足できる定義について同意していないが、それは一般に次のような属性の多くのものを含んでいる。すなわち、固定サイズの命令ワード；算術演算およびその他の計算動作は１６以上のレジスタを有する汎用レジスタファイルから読取られたオペランドについて行われ、その結果は同じレジスタファイルに書込まれる；レジスタファイルのアクセスが並列に行われるように、ソースレジスタに対して命令ワード中の固定された位置にある；メモリアクセスは主としてメモリからレジスタへのロードにより行われ、レジスタからメモリへ記憶される（計算命令におけるメモリオペランドを有するのと反対）；メモリアドレスを計算する小さい数の方法（通常４以下であり、しばしば１である）；命令のパイプライン実行を困難にする特徴の回避（例えば所定の命令による２回以上のハードウエアリソースの使用）；およびマイクロコードまたはその等価物を要求する特徴の回避等が含まれる。全てのプロセッサが上記の要素の全てを含んでいるＲＩＳＣであるとは考えられないが、全てのプロセッサが上記の要素の多くのものを含んでいる。
【０００７】
しかしながら、初期のＲＩＳＣ命令セットはコンパクトなマシンコードの生成において特に効率的ではない。特に、ＲＩＳＣ命令セットはアプリケーションを符号化するために通常はプレＲＩＳＣ命令セットよりも多くのビットを必要とする。アプリケーションに対するマシンコードの大きさは、全体のソリューションコストにおいてプロセッサそれ自身のコストよりもしばしば重要である。それはアプリケーションを保持するためにより大きいメモリが必要であるからである。ＲＩＳＣは性能が最も重要である多くのアプリケーションにおいて依然として許容されているが、ＲＩＳＣの利点を有しているが、減少されたコードサイズの命令セットは多くの他のプロセッサアプリケーションにおいて有効である。
【０００８】
【発明が解決しようとする課題】
初期のプロセッサ命令セット（ＩＢＭ7090（商標），ＣＤＣ6600（商標），ＤＥＣＰＤＰ6（商標），ＧＥ635（商標））のいくつかはＲＩＳＣの特徴のいくつかのものを有している。それはそれらがＲＩＳＣのようなマイクロコードなしでハードウエアにより直接実行されるように設計されていたからである。これらの命令セットの大部分は最新の高レベル言語およびアプリケーションにあまり適したものではない。その理由は、ワード（バイトではなく）アドレス、限定されたアドレススペース、および演算の特別の組合わせのような特徴のためである。実際に多くのものは言語プログラムを組立てるために意図されたものである。いくつかのものは３６ビットのデータワードおよび命令幅に基づいており、３６ビット命令はコード密度に対しては非常に良好であるとは言えない。いくつかのものは計算のための累算器およびメモリ例に基づいており、それは性能を制限する。本発明の個々の特徴のいくつかはこれらのマシンの世代に続くものであるが、所望の特性のものは従来存在していない。
【０００９】
プロセッサを動作させるためのマイクロコードの使用はさらに複雑な命令セットの適合を可能にする（ＩＢＭ360（商標），ＤＥＣＰＤＰ11（商標），ＤＥＣＶＡＸ（商標），インテルｘ86（商標），ＬＬＮＬ S-1（商標），モトローラ68000（商標））。それ故、次の世代のプロセッサは、複雑な可変命令長コード化による等のために良好なコード密度を有する複雑な命令セットを有している。しかしながら、マイクロコードのプロセッサおよび複雑な命令セットは高性能には良好に適合しないことが多い。複雑な命令は、ハードウエアパイプラインにおける直接の実行の代りにマイクロエンジンの相互作用により実行されるが、それはＣＰＩを増加させる。種々の異なるスタイルのこの世代に出現した命令セットは１個または２個の累算器から汎用レジスタアーキテクチャまたはスタックアーキテクチャのいずれかに変更される傾向がある。レジスタまたはスタックの構成コストは十分に低く、命令セットはこれらの有効なスタイルを使用することができる。
【００１０】
上述のように性能の顕著な改善はあるが、ＲＩＳＣはコード密度に対してはむしろ後退している。多くのＲＩＳＣ命令セットは固定長の３２ビット命令に基づいており、３２ビットは必要以上のものである。またいくつかの種類の可変長符号化が最良のコード密度を得るために必要である。スタックアーキテクチャはコードサイズの点では有利であるが、性能が低いためにこの時点で除外される。それは命令が性能とコードサイズの目標を共に達成することがいかに重要であるかを示している。
【００１１】
ＲＩＳＣのコードサイズの欠点を補償するために、プロセッサ設計者は命令セットのコンパクトな符号化を導入した。ＡＲＭ（商標）のＴnumbおよびＭＩＯＳ16はその例である。両者は少数の３２ビット命令を含むが、主として１６ビットを使用する。１６ビット符号化（それは命令当たりのビット数を半分にすることによって小さいコードを与える）は貧弱な性能を与える。その理由は８個のレジスタしか有していないからであり（ＩＥを増加）、編集されたソースレジスタオペランドの使用（ＣＰまたはＣＰＩを増加）は命令ワード（ＩＥを増加）の制限範囲を限定し、行先レジスタオペランドの数（多くの場合２以下…ＩＥを増加）を制限する。
【００１２】
ヒタチのＳＨ命令セットはＲＩＳＣと類似し、対象としてコードサイズをターゲットにしている。それは１６ビットの命令セットでスタートするが、後で必要が認められたとき３２ビット命令が追加される。１６のレジスタを有するが命令当たり大抵は２個のレジスタであり（ＩＥを増加）、ブランチオフセットを厳しく制限する。
【００１３】
ＲＩＳＣの性能その他の利点を与え、しかも小型でコストの効率のよいマシンコードを与える命令セット構成が必要とされている。過度に複雑化されることなく、高性能の構成を実現するために、命令セットは簡単な短いパイプラインによってマイクロコードを使用しないで直接実行可能でなければならない。良好な性能を得て編集を最適にするための適切なターゲットである十分な数の汎用レジスタがなければならない。さらにコードサイズを減少させるためにその他の技術が使用されてもよい。
【００１４】
上述した従来の技術の問題に鑑みて、本発明の目的は、完全な特徴のＲＩＳＣ命令セットの２４ビット符号化を行うプロセッサを提供することである。
本発明の別の目的は、命令当たりのバスの平均数が限定された命令符号化を使用したときプログラムを表すために必要な命令の静的な数を低い値に維持するために相乗的に動作する命令を有する命令セットを実行するプロセッサを提供することである。本発明は狭い命令ワードにおける命令定数の効率的な符号化を行うために技術的に有効である。
【００１５】
本発明の別の目的は、２４ビット命令ワード符号化比較および普通の場合にもっと長いターゲット特定素子を有する最も有用な比較および形態を使用するブランチ命令を使用するＲＩＳＣ命令セットの符号化を行うプロセッサを提供することである。
本発明のさらに別の目的は、プログラムを表すために必要な命令およびサイクルの静的な数およびプログラムを実行するために必要な命令の数を少くすることのできる減少されたオーバーヘッドループ能力を有する汎用（ＤＳＰのような特別の目的のものと対照的な）命令セットを実行するプロセッサを提供することである。
【００１６】
【課題を解決するための手段】
上述の目的は、本発明の第１の好ましい実施形態の命令セットを実行するＲＩＳＣプロセッサを提供することによって達成される。それ命令セットは、上記のような式Ｔ＝ＩＥ＊ＣＰＩ＊ＣＰに対して調整されるために付加される次のようなコードサイズ式により構成される。
Ｓ＝ＩＳ＊ＢＩ
ここで、Ｓはビットによるプログラムの大きさであり、
ＩＳはプログラムを表すために必要な命令の静的な数（前のもののような実行に必要な数ではない）であり、
ＢＩは命令当たりのビットの平均数である。
ＲＩＳＣに比較して本発明はＣＰおよびＣＰＩの増加を最小にしてＢＩおよびＩＳを低下させる。それは同時にＩＥを増加および減少させることが特徴である。
【００１７】
本発明のこの特徴は、ロード／記憶アーキテクチャを有する一般的なレジスタを含むＲＩＳＣの原理に基づいた固定長の高性能符号化において良好なコード密度を提供しなければならないことの認識により構成されたものである。典型的なコード密度を得るために、実施例では性能と妥協しない簡単な可変長符号化が付加される。この実施形態はプロセッサ構成のコストを最適なものにする。
【００１８】
【発明の実施の形態】
本発明の上記およびその他の目的は、以下の詳細な説明および添付図面から容易に明らかになるであろう。
図１乃至４には、本発明の好ましい実施形態による命令セットを実施するのに適したプロセッサが示されている。一般に、プロセッサは命令およびデータ用の２³²バイト、すなわち４ＧＢのバーチャルメモリと、３２ビットのプログラムカウンタＩＰＣと、１６個以上の３２ビットの汎用レジスタと、シフトアドレスレジスタＳＡＲと、３２ビットのループ開始アドレスレジスタＬＢＥＧと、３２ビットのループ終了アドレスレジスタＬＥＮＤと、および３２ビットのループカウントレジスタＬＣＯＵＮＴとを有しており、これら３つのレジスタは全て、以下詳細に説明するようにオーバーヘッド減少ループ命令によって使用される。
【００１９】
とくに、プロセッサパイプラインは５つの基本ステージ：命令フェッチすなわちＩステージ100 と、命令デコードおよびレジスタアクセス、すなわちＲステージ200 と、実行およびアドレス計算、すなわちＥステージ300 と、メモリアクセス、すなわちＭステージ400 と、およびライトバック、すなわちＷステージ500 とを有している。Ｉステージ100 において、実行されるべき命令を検索するためにプログラムメモリがアクセスされる。Ｒステージ200 では、このようにフェッチされた命令が復号され、それが使用するレジスタがあるならば、これがアクセスされる。その後、Ｅステージ300 において、Ｒステージ200 で復号されたレジスタの内容および定数が命令オペランドにしたがってプロセッサのＡＬＵ332 によって処理される。Ｍステージ400 では、ロード（負荷）、記憶等の任意の必要なメモリアクセスが行われる。最後にＷステージ500 において、命令を実行した結果が命令オペランドによって命じられたとおりに汎用レジスタにライトバックされる。
【００２０】
とくにＩステージ100 では、ワードがＩステージプログラムカウンタＩＰＣ104 中に保持されているアドレスに基づいて命令キャッシュ102 から抽出される。その後そのワードは、命令キャッシュＲＡＭＩＣ102 （以下に示す別のコンポーネントと一緒に、命令キャッシュ116 を形成する）から読出され、整列ユニットＡＬＩＧＮ108 により最後のフェッチレジスタ106 中に保持され、Ｒステージ命令レジスタＲＩＮＳＴ202 に記憶された最後のワードと結合される。キャッシュミスは、主プログラムメモリからステージングレジスタＩＲＥＦＩＬＬ110 を通ってキャッシュＲＡＭＩＣ102 までメモリフェッチによって処理され、タグがそれにしたがって、レジスタＩＭＩＳＳＡ112 およびタグキャッシュＲＡＭＩＴＡＧ114 を使用して調節される。マルチプレクサ118 は、キャッシュＲＡＭＩＣ102 の出力か、または主メモリから直接フェッチされた命令かのいずれかを選択し、選択されたデータを整列ユニット108 に出力し、この整列ユニット108 がそれを、最後のフェッチレジスタ106 に記憶されている最後にフェッチされたワードと連結し、必要ならばそのサブセットを選択して、命令長バリエーションに対して調節する。タグ比較装置122 はキャッシュミスを検出し、その表示をＩステージコントローラ124 に与え、このＩステージコントローラ124 はそのステージの全動作を制御する。
【００２１】
素子104 として示されている回路はここではプログラムカウンタと呼ばれているが、Ｉステージプログラムカウンタ104 は実際はフェッチされるべき命令をカウントするのではなく、ワードをカウントするために使用されることを認識しなければならない。しかしながら、Ｒステージプログラムカウンタ204 のような後続するプログラムカウンタは、好ましい実施形態では、実際の命令をカウントする。また、当業者は、Ｉステージコントローラ124 に加えて、対応したＲステージコントローラ224 、Ｅステージコントローラ324 およびＭステージコントローラ424 のそれぞれがその各ステージの全動作を制御することを容易に理解するであろう。また、Ｒステージ状態レジスタ203 、Ｅステージ状態レジスタ303 、Ｍステージ状態レジスタ403 およびＷステージ状態レジスタ503 のそれぞれが、その各パイプラインステージにおける命令に関する関連状態情報、たとえばデータが有効か否か等の情報を各コントローラに供給する。さらに、ステージコントローラからそれらの各マルチプレクサまで延びているマルチプレクサ選択ライン、クロック信号、例外ベクトルアドレス等のある特徴は、説明を容易にするために省略されているが、しかしながら当業者は、それらの配置を容易に認識するであろう。
【００２２】
命令はＲステージ命令レジスタ202 に供給されているが、次のアドレス発生セクション126 内の加算器128 は、フェッチされるべき次のワードを指すように現在のワードアドレスをインクレメントし、それをマルチプレクサ130 に供給し、このマルチプレクサ130 がそれを命令プログラムカウンタ104 にフィードバックする。ループ命令（以下さらに詳細に説明する）が実行されたとき、それは開始ループアドレスをループ開始レジスタＬＢＥＧ132 にロードし、その後マルチプレクサ130 が開始アドレスをプログラムカウンタ104 に供給する。ループ命令で使用されたように、ループ終了レジスタ134 は、ループ状態の終了を検出してループ指標レジスタＬＯＵＮＴ138 をデクレメントするために比較装置136 によって現在のアドレスと比較される値を供給する。ループ外での実行を続けるために、比較装置140 は、カウントがゼロの場合に命令コントローラ124 に表示を与える。そうでない場合には、ループ指標レジスタＬＯＵＮＴ138 はデクレメンタ142 によってデクレメントされ、マルチプレクサ144 （レジスタをロードするためにも使用される）を通過させられる。最後に、ＲステージＰＣ選択マルチプレクサ146 は、以下さらに詳細に説明するようにＥステージ300 に供給されるべきアドレス値を選択する。
【００２３】
Ｒステージ命令レジスタ202 に記憶された命令はデコーダ201 によって復号され、予め定められたパラメータフィールドを抽出し、命令オプコードにしたがって即値または定数フィールドを復号する。復号された命令は、実行のためにＥステージ命令復号レジスタ302 にパイプラインされる。命令復号動作と並列に、命令からのフィールドは、以下詳細に説明するように、ウインドウされるレジスタ動作のために、それにウインドウベース値を加算するために加算器208 −212 を介してレジスタファイル206 に送られる。ある命令中に存在する可能性のある２つのソースレジスタフィールドおよび１つの行先レジスタフィールドのそれぞれに対して１個の加算器が使用される。
【００２４】
レジスタファイル206 中の値は読出されて、マルチプレクサ214 および216 に供給され、その後ＥステージＳおよびＴレジスタ304 および306 に供給される。マルチプレクサ214 および216 はレジスタファイル206 から値を供給することができる。すなわち、必要とされるデータがまだファイル206 中に書込まれていない場合、以下に説明するように、それらはＥステージから供給された値を使用してもよい。マルチプレクサ214 はまた定数値を命令デコーダ204 から受取ってもよい。
【００２５】
加算器218 は、Ｒステージプログラムカウンタ201 の内容と命令デコーダ204 からの指標定数に基づいて指標付きターゲットアドレスを計算し、その結果をＥステージブランチレジスタ308 に記憶する。加算器220 は、命令長に応じて２または３をマルチプレクサ222 を介してＲステージプログラムカウンタ201 内の値に加算することにより次の命令アドレスを計算し、ブランチがとられない場合に使用するためにその結果を次のＰＣレジスタ310 に送る。
【００２６】
Ｅステージ300 に進むと、バイパスマルチプレクサ318 および320 は、種々の機能ユニット（ブランチユニット326 、シフト／マスクユニット330 、ＡＬＵ332 、ＡＧＥＮ334 および記憶整列ユニット336 ）に対するオペランドを選択する。マルチプレクサ選択は、Ｅステージ300 、Ｍステージ400 およびＷステージ500 中の現在の命令に基づいて各レジスタ228 、230 および232 によってバイパスブロックＥＢＹＰ314 によりＲステージ200 において計算され、バイパスブロックＥＢＹＰ226 を通ってパイプラインされた。結果がＲステージ200 から得られたとき、各マルチプレクサ318 ，320 はＥＴレジスタ312 またはＥＳレジスタ316 を選択する。マルチプレクサ318 および320 に対する別の入力は、Ｍステージ400 およびＷステージ500 からのものである。
【００２７】
ブランチユニット326 はマルチプレクサ318 および320 からの２つのオペランドを使用して、条件付きブランチ採用／不採用決定を生成し、それはＩステージ100 およびＲステージ200 中のコントローラ124 および224 にそれぞれ供給され、そこでマルチプレクサが選択をする。シフト／マスクユニット330 は、マルチプレクサ328 の出力に基づいてシフトおよび抽出命令を実行する。それはマルチプレクサ318 および320 から２つのオペランドをとるだけでなく、復号された命令レジスタＥＩＮＳＴＤ302 からのマスク入力をとり、このＥＩＮＳＴＤ302 はまたＭステージ命令レジスタＭＩＮＳＴＤ402 に供給する。シフト量はシフト×定数に関してはＥＩＮＳＴＤ302 から選択され、あるいはシフト×可変量についてはＥＳＡＲ322 から選択される。ＥＳＡＲ322 はＥステージ300 に対するＩＳＡ状態ＳＡＲを含んでいる。
【００２８】
ＡＬＵ332 は、ＡＤＤ，ＳＤＤＩ，ＡＤＤＸ２，ＳＵＢ，ＡＤＮ，ＯＲ，ＸＯＲを含む演算および論理機能を実行する。シフト／マスクユニット330 およびＡＬＵ332 の出力は、マルチプレクサ338 中の命令タイプに基づいて多重化され、ＭＡＬＵレジスタ406 に供給される。アドレス発生ユニットＡＧＥＮ334 は、レジスタオペランドとＥＩＮＳＴＤ302 中の復号された命令からのオフセットとの和を計算する。その出力は、ＭステージバーチャルアドレスレジスタＭＶＡ408 に送られる。記憶整列ユニット336 は、ＥＴマルチプレクサ318 の出力を０，８，１６または２４の位置だけシフトして、記憶されたデータをメモリに適したバイト位置に整列する。出力はＭステージ記憶データレジスタＭＳＤ410 に送られる。
【００２９】
前のパイプステージと同様に、ＥＣＴＬ324 はＥステージ300 に対する制御と、そこで実行される命令の状態の更新とを処理する。Ｅステージ命令アドレスプログラムカウンタＥＰＣ304 は、例外処理のためにＭステージ命令アドレスプログラムカウンタＭＰＣ404 にパイプラインされる。
【００３０】
パイプラインのＭステージ400 は、ロードおよび記憶命令の第２の半分と全てのステージに対する例外決定とを処理する。ＭＰＣ404 の出力は、ＷＰＣレジスタ504 に送られる。Ｍステージ400 における命令が例外または割込みによって無効にされた場合、ＷＰＣ504 の出力はＩＳＡ特定例外命令アドレスレジスタＥＰＣ[i] （示されていない）（ＥステージプログラムカウンタＥＰＣ304 とは異なる）の１つにロードされる。Ｍステージ400 中の命令が再試行されなければならない（たとえば、キャッシュミスために）場合、ＷＰＣレジスタ504 の内容は命令フェッチを再スタートするためにＩステージ100 に送り返される。
【００３１】
シフトまたはＡＬＵ命令は、このステージにおいてＭＡＬＵ406 からＷＡＬＵ506 に単に転送されるに過ぎない。ＭＡＬＵ406 の出力はまた、このステージにおいてバイパスマルチプレクサ318 および320 に供給され、それによってシフトまたはＡＬＵ命令の出力は、後続する命令がレジスタファイルに書込まれる前にこれによって使用されることができるようになる。Ｗステージ500 におけるロード命令は、データキャッシュＲＡＭおよびデータタグＲＡＭの両者を読出す。Ｗステージ500 における記憶命令は、データタグＲＡＭだけを読出す。データキャッシュＲＡＭの書込みは、タグ比較が完了するまで遅延される。非ロード命令は、任意の未決定の記憶データをデータキャッシュＲＡＭに書込む。同じアドレスへのロードが後続する記憶は、記憶データがデータキャッシュＲＡＭに書込まれていないので、特殊バイパスを必要とする。
【００３２】
Ｗステージ中のロード命令は、バーチャルアドレスＭＶＡ408 の指標部分をデータタグＲＡＭのアドレス入力に送り、およびマルチプレクサ422 を通って直接マップされたデータキャッシュＲＡＭＤＣ434 のアドレス入力にも送る。ＤＣ434 の読出しと並列に、アドレスが、ＳＴＶＩ416 中の未決定の記憶バーチャル指標および有効ビットと比較される。比較装置428 の出力に基づいて、読出しが未決定の記憶のキャッシュ指標に対するものである場合、マルチプレクサ432 は未決定の記憶データバッファ418 の結果を選択する。そうしないと、ＤＡ読出しデータが選択される。マルチプレクサ432 はロード整列回路436 に供給し、この回路436 はバーチャルアドレスの下位の２つのビットに基づいてロードデータを０，８，１８または２４だけシフトし、その後Ｌ８ＵＩおよびＬ１６ＵＩ命令に対してビット７または１５からそれぞれゼロ拡張され、Ｌ１６ＳＩ命令に対してビット位置１５から符号拡張される。この結果はＷＬＯＡＤ508 によってラッチされる。データタグＲＡＭの出力は、比較装置430 によってＭＶＡ408 からのＭステージバーチャルアドレスの上位ビットと比較され、このヒット／ミス結果は、キャッシュミスおよび例外を処理するＭステージ制御論理ＭＣＴＬ424 に送られる。最後に、ロードバーチャルアドレスがキャッシュミスを処理するためにＷＭＡ510 において捕獲される。
【００３３】
ロードキャッシュミスは、パイプラインのＩからＭまでのステージ中の命令を無効にする。ＷＭＡ510 からのロードアドレスは、外部メモリに送られる。そのメモリから読出されたデータは、ＷＭＡ510 の下位ビットをアドレスとして使用して、マルチプレクサ412 およびＳＴＤＡＴＡ418 を通ってデータキャッシュＲＡＭ424 中に書込まれる。データタグＲＡＭ426 はマルチプレクサ414 およびＳＴＡＤＤＲ420 を通ってＷＭＡ510 において捕獲された上位桁のミスアドレスから書込まれ、ＤＴＡＧ420 はＭＶＡ408 からの小さい桁のビットによってアドレスされる。
【００３４】
Ｗステージ500 における記憶命令は、記憶アドレスおよびデータをＳＴＡＤＤＲ418 およびＳＴＤＡＴＡ420 に入れる。さらに、データタグＲＡＭ426 がアクセスされ、その結果がＭＶＡ408 の上位桁ビットと比較されて、記憶アドレスがヒットかミスかが決定される。キャッシュにおいて記憶がヒットしたならば、ＳＤＴＤＡＴＡ418 の内容が第１の非ロードサイクルにおいて、ＳＤＴＤＡＴＡ418 に記憶されているアドレスにおいてデータキャッシュＲＡＭ424 中に書込まれる。キャッシュミスの補充が完了したとき、命令フェッチユニットは、ミスされたロード命令から再びスタートするフェッチ命令を開始する。この実施形態のデータキャッシュはライトスルーされ、したがって記憶アドレスおよびデータもまたＳＴＡＤＤＲ420 およびＳＴＤＡＴＡ418 から書込みバッファ438 に送られ、外部メモリに書込まれるまで、この書込みバッファ438 に保持される。
【００３５】
ＷＡＬＵおよびＷＬＯＡＤレジスタ506 および508 の出力は、マルチプレクサ512 によって選択され、この時点で命令が依然として有効ならばＲステージ200 におけるレジスタファイル206 中に書込まれ、それはＡレジスタの結果を有する命令である。
【００３６】
プロセッサはまた６ビットシフト量レジスタを有しており、この６ビットシフト量レジスタは、論理の左、論理の右および演算の右のような通常の即値シフトを行うために使用されるが、直接変数シフトはクリティカルタイミングパスである可能性が高いため、シフト量がレジスタオペランドである場合には単一の命令シフトを行わず、簡単なシフトは効率的に広い幅に広がらない。ファンネルシフトは広げられることができるが、それらには過大な数のオペランドが必要である。本発明の好ましい実施形態によるプロセッサは、そのシフト量がＳＡＲレジスタからとられるファンネルシフトを与えることによってこれらの問題を解決する。変数シフトは、汎用レジスタ中のシフト量からＳＡＲを計算するための命令を使用してコンパイラによって合成され、それに続いてファンネルシフトされる。ＳＡＲに対する値のリーガル範囲は０乃至３１ではなく０乃至３２であり、したがって６つのビットがそのレジスタに対して使用される。
【００３７】
当然ながら、以下において詳細に示される命令セットの説明が与えられたならば、本発明による他の種々のプロセッサアーキテクチャは、当業者に容易に明らかになるであろう。これらの構造もまた請求の範囲に記載された技術的範囲内に含まれるべきものである。
【００３８】
プロセッサ内において種々のパイプライン構造が使用されてもよい。しかしながら、命令セットのある特徴がある実施クラスに関していちばんよく実行する。図５には、１つのこのようなタイプが一般的に示されている。このタイプのアーキテクチャは、浮動小数点ユニットおよびＤＳＰのような主な計算ユニットにより有効に使用されることができ、このパイプラインアーキテクチャの１つの顕著な点は、このようなユニットがＤキャッシュの後に（図５においてＤＲｅｇ／ＤＡＬＵとラベル付けされた位置に）配置されることにより、このようなユニットに対する命令がメモリ参照を１つのソースオペランドとして含むことが可能になることである。これによって、多数の命令セットをサイクルごとにフェッチおよび実行する必要なしに、データキャッシュ参照および動作をサイクルごとに行うことができるようになる。
【００３９】
［一般的な命令セット設計考慮事項］
多数の命令セット特徴はプロセッサ構成費用を増加させる犠牲を伴って、性能を改善し（ＩＥを低くすることにより）、符号サイズを改善する（ＩＳを低くすることにより）。たとえば、“自動インクレメント”アドレスモード（ベースアドレスレジスタが読出し、その後インクレメントされたアドレスで再度書込まれる）には、ロード用の第２のレジスタファイル書込みポートが必要である。“指標付き”アドレスモード（２つのレジスタの和がバーチャルアドレスを形成するために使用される）には、記憶用の３つのレジスタファイル読出しポートが必要である。好ましい実施形態は、妥当な性能のために最小限必要な２個の読出しポートと１個の書込みポートのレジスタファイルに合わせられている。
【００４０】
好ましい実施形態は、構成の費用を増加させるいくつかの特徴を有しているが、レジスタファイルポートの追加と同程度の増額が必要とされる特徴は回避される。これはとくに、構成されたものが多数の命令をサイクルごとに実行する場合に重要である。それは、ポートの数がプロセッサの最大実行機能（たとえば、２乃至８）により増倍されるためである。
【００４１】
性能を維持するために、命令セットは少なくとも２つのソースレジスタフィールドと１つの異なる行先レジスタフィールドとをサポートしなければならない。さもないと、ＩＥおよびＩＳの両者が増加する。符号密度だけを最適化する汎用レジスタ命令セットが、しばしば、１つがソースとしてのみ使用され、１つがソースおよび行先の両方として使用される２つのレジスタフィールドを中心に設計される（たとえば、日立ＳＨ）。これによって符号サイズは、ＩＳの増加がＢＩの減少だけオフセットされたときに減少するが、ＩＥ命令セットの増加を補償する方法は存在しない。少数のレジスタを指定する命令セットは狭いレジスタフィールドを使用し、したがって低いＢＩを使用するが、多くの可変的で一時的な値をメモリ中において強制的に生かしておき、それ故付加的ロードおよび記憶命令を要求することによってＩＥおよびＩＳを増加させる。符号密度だけが優先度である場合、ＩＳの増加は正味の節約としてＢＩの減少分だけオフセットされるが、良好な性能もまた要求された場合にはＩＥの増加を補償する方法はない。
【００４２】
レジスタの数が増えるにしたがって、ＩＥおよびＩＳの減少は低下して特性は平らになる。命令セットは、少なくとも収穫逓減点に達するように、すなわちレジスタカウントをさらに増加した結果としてＩＥおいて対応した著しい減少が生じないように、十分なレジスタを提供しなければならない。とくにＲＩＳＣ性能レベルに対して、少なくとも１６個の汎用レジスタが必要である。また、３個の４ビットレジスドタフィールドは、符号化するために少なくとも１２ビットを必要とする。操作コード（オプコード）および定数フィールドに対するビットもまた要求され、したがっていくつかのプロセッサによって使用されるような１６ビット符号化では不十分である。
【００４３】
［２４ビット符号化］
大部分の従来技術において符号サイズと性能との間の適切なバランスをとることができていない１つの理由は、命令セットの設計者が１６ビットまたは３２ビットのようなある命令サイズを強制されていると感じているためである。実際、プロセッサのデータワード幅に対して簡単な比率の命令サイズを使用することには利点がある。しかしながら、制限を若干緩和することには大きな利点がある。
【００４４】
好ましい実施形態では、２４ビットの固定長符号化をスタート地点として使用する。２４ビットは高性能に対して十分なだけでなく、命令に対して拡張性と空間とを提供し、それがＩＥを減少させる。別の実施形態は、１８−２８ビットの範囲の符号化を使用することができるが、２４ビットより小さいものは拡張性とブランチ範囲が制限されることになる。２４ビット符号化は、ほとんどの３２ビットＲＩＳＣ命令セットからのＢＩおよびしたがって符号サイズにおける２５％の減少を代表的に示す。最後に、２４ビットは３２のデータパス幅を有するプロセッサにおいて非常に簡単に適応する。
【００４５】
好ましい実施形態は４ビットレジスタフィールドを使用し、これは許容可能な性能のために要求される最小のものであり、また２４ビット命令ワード内に満足できるように適合する最大のものである。多くのＲＩＳＣ命令セットは、３２個のレジスタ（５ビットレジスタフィールド）を使用している。３つの５ビットレジスタフィールドの後、２４ビット命令は操作コードおよび定数フィールドのために９ビットだけを残す。短い定数フィールドの結果、ブランチ、呼出しおよび他のＰＣ関連参照の範囲が不十分となる可能性が高い。操作コードに対するビットが少な過ぎると、拡張性が不十分なものになる。これら２つの理由から、５ビットレジスタフィールドを有する２４ビット命令ワードは望ましくない。１６個と３２個の汎用レジスタ間の性能差（ＩＥの差による）（約６％）は、８個と１６個の汎用レジスタ間の差ほど大きくなく、失われた性能（たとえば、以下に認められるように複合命令およびレジスタウインドウ）を生成するために別の特徴が導入されることができるほど十分に小さい。ＩＳの増加（これもまた６％）は、２４ビットと３２ビットの符号化間の差によるオフセット以上のものである。
【００４６】
５ビットレジスタフィールドを有する多くの命令セットは、コンパイル用の３２個の汎用レジスタを提供しないことも認識すべきである。ゼロを保持するために多くのレジスタが専有されるが、ゼロレジスタは、少数の特別命令操作コードを与えることによって容易に必要のないものにすることができる。別のレジスタが特定の用途に使用されることも多いが、これも命令セットの中に別の特徴を含むことによって回避できる。たとえば、ＭＩＰＳはその３１個の汎用レジスタのうち２個を例外処理コードに対して、１個をグローバル区域ポインタに対して使用する。したがって、それは実際には可変的および一時的なものに対して２８個のレジスタしか有しておらず、これは４ビットレジスタフィールドと適切な命令セット特徴とを備えた命令セットより１２個多いだけである。ソフトウェア規定によって汎用レジスタを呼出す側および呼出される側のセーブされたレジスタに分割することは一般的であり、また、大きいレジスタファイルのユーティリティをさらに減少する。好ましい実施形態は、以下さらに詳細に説明するようにこれを回避する特徴を含んでいる。
【００４７】
［複合命令］
ＩＳおよびＩＥを低くするために、好ましい実施形態はまた、ＲＩＳＣにおいて一般に見出される多数の命令の機能とその他の命令セットを組合せる単一の命令を使用する。簡単な複合命令の一例は、左シフトおよび加算／減算である。ＨＰＰＡ−ＲＩＳＣ（商標）およびＤＥＣアルファ（商標）は、これらの動作を実行させる命令セットの例である。小さい定数によるアドレス演算および乗算はしばしばこれらの組合せを使用し、これらの動作を実行することによりＣＰの潜在的なコストの増加（計算パイプラインステージにおいて一連の論理装置が追加されるので）を犠牲にしてＩＥおよびＩＳを減少させる。しかしながら、シフトが０乃至３に制限されたとき、追加の論理手段はＣＰに関して最もクリティカルな制約ではないことが種々の構成において立証されている。逆にいえば、ＡＲＭ命令セットは任意のシフトおよび加算を実行し、その実施においてＣＰは非常に低いものである。
【００４８】
右シフトはしばしば、フィールドを大きいワードから抽出するために使用される。符号なしのフィールドを抽出するために、一般に２つの命令（右シフトによって後続される左シフト、または定数とのＡＮＤによって後続される右シフトのいずれか）が使用される。好ましい実施形態では、この機能を行うために単一の複合命令ｅｘｔｕｉが実行される。それは、命令ワードにおいて丁度４ビットの符号化を指定されたマスクを有するＡＮＤによって後続されるシフトとして実行される。複合命令ｅｘｔｕｉのＡＮＤ部分は非常に論理的であるため、命令セットにそれが含まれることによってその構成のＣＰは増加する可能性は少ない。これは、符号付きのフィールドを抽出する命令に対して該当せず、したがってこれは含まれない。
【００４９】
大部分の命令セット、ＲＩＳＣおよびその他（たとえば、ＡＲＭ，ＤＥＣＰＤＰ１１，ＤＥＣＶＡＸ，インテルｘ８６，モトローラ６８０００，サンＳＰＡＲＣ，モトローラ８８０００）は条件コードを設定する比較命令を使用し、この条件コードは、制御の流れを決定するために条件コードを試験する条件付きブランチ命令によって後続される。条件付きブランチは大部分のＲＩＳＣ命令セットの１０−２０％の命令を構成し、また各条件付きブランチは通常比較命令と対にされており、したがってこのスタイルの命令セットは無駄である。さらに古い命令セットはしばしば比較およびスキップスタイルの条件に基づいていたが、これには分離した比較およびブランチと同じ欠点があった。
【００５０】
いくつかの命令セット（たとえば、Ｃｒａｙ−１，ＭＩＰＳ，ＤＥＣアルファ，ＨＰＰＡ−ＲＩＳＣおよびＳｕｎＳＰＡＲＣの後におけるＶ９バージョン）は、フレキシビリティを変化させる複合比較およびブランチ機能を提供している。ＣｒａｙおよびＤＥＣアルファは、レジスタとゼロの比較とブランチだけを行う。ＭＩＰＳはレジスタ・ゼロ比較とレジスタ・レジスタ同等および非同等ならびにブランチを行う。ＨＰＰＡ−ＲＩＳＣは、非常に完全なセットのレジスタ・レジスタ比較およびブランチ命令を提供する。
【００５１】
好ましい実施形態は、最も有用な複合比較およびブランチ命令を提供する。的確なセットの選択には、とくに２４ビット（３２ビットとは対照的に）符号化が目標である場合、それが消費する操作コード空間により各比較およびブランチのユーティリティのバランスをとることが要求される。その他の命令セットはこのテストに失敗する。たとえば、ＨＰＰＡ−ＲＩＳＣは、ほとんど効用のないいくつかの複合比較およびブランチ命令コード（たとえば、加算の後のネバーおよびオーバーフロー）を提供し、有効ないくつかを削除する。
【００５２】
好ましい実施形態に対して選択された複合比較およびブランチ命令のセットは、
Ａ==０，Ａ！= ０，Ａ＜Ｓ０，Ａ＞= Ｓ０，
Ａ==Ｂ，Ａ！= Ｂ，Ａ＜ＳＢ，Ａ＜ＵＢ，Ａ＞= ＳＢ，Ａ＞= ＵＢ，
（Ａ＆Ｂ）==０，（Ａ＆Ｂ）！= ０，（〜Ａ＆Ｂ）==０，（〜Ａ＆Ｂ）！= ０，
Ａ==Ｉ，Ａ！= Ｉ，Ａ＜ＳＩ，Ａ＜ＵＩ，Ａ＞= ＳＩ，Ａ＞= ＵＩ，
ＡのビットＢ==０，Ａ！のビットＢ= ０，
ＡのビットＩ==０，Ａ！のビットＩ= ０
ここで、ＡおよびＢはレジスタの内容を示し、レジスタの関連演算子上の接尾部“Ｕ”および“Ｓ”は、符号なしの、または符号付きのレジスタ内容との“符号なし”または“符号付き”の各比較を示す。ゼロとの関連演算子上の接尾部（たとえば、Ａ＜Ｓ０）はゼロに対する符号なしの、または符号付きの比較を示し、Ｉは指標定数を示す。
【００５３】
複合比較およびブランチは、分離した比較およびブランチ命令セットと比較して、また、ＭＩＰＳおよびＤＥＣアルファのような部分的比較およびブランチ命令セットと比較した場合でも、ＩＥおよびＩＳを減少させる。好ましい実施形態は、複合比較およびブランチを実施するためにＣＰＩの増加を要求する可能性があるが、全体的な性能効果は依然として改善されたものである。
【００５４】
分離した比較およびブランチ命令セットの主な利点は、比較演算子、比較オペランドおよびブランチ目標を指定するのに２つの命令ワードが利用可能であり、このためにそれぞれに対する豊富なフィールド幅割当てが可能になることである。これと対照的に、複合比較およびブランチ命令セットは、これら全てを単一の命令ワードにパックしなければならず、その結果フィールドが少なくなり、適合しない値を処理するメカニズム（たとえば、広い範囲を有するブランチ）に対する必要性も小さくなる。好ましい実施形態は、比較操作コード、２つのソースレジスタフィールドおよび２４ビット命令ワードへの８ビットＰＣ相対オフセットをパックする。８ビットターゲット特定子は場合によって不十分であり、広い範囲を有する無条件ブランチに基づいて逆の性質の条件付きブランチを使用するためにコンパイラまたはアセンブラが必要になり、これを好ましい実施形態が提供する。この状態は当然ＩＥおよびＩＳを増加させ、これは望ましくない。このために、好ましい実施形態はまた、最も一般的な場合であるゼロに対する試験を行う一連の複合比較およびブランチを提供する。これらの複合比較およびブランチ命令は、それらの同僚(colleagues)よりはるかに広い範囲を提供する１２ビットのＰＣ相対オフセットを有している。両形式を提供した余分な複雑さはＩＥおよびＩＳの改善によって釣り合わせられる。好ましい実施形態は、ＭＩＰＳおよびＤＥＣアルファとは異なり、ゼロに対する全ての比較を提供するわけではなく（ゼロ以下およびゼロより大きいレジスタを省略する）、再び、好ましい実施形態はプログラムニーズを操作コード空間と釣り合わせる命令のセットを提供する。
【００５５】
全ての命令を符号化するために２４ビットしか使用しない１つの結果として、命令ワードにおける定数フィールドのサイズが限定される。これは、ＩＳおよびＩＥを潜在的に増加させる可能性が高い（もっとも、ＩＥの増加はその定数をループの外側のレジスタにロードすることによって減少させることができるが）。好ましい実施形態はこの問題をいくつかの方法で解決する。第１に、それは最も共通する定数を捕獲するために小さい定数フィールドを提供する。狭い（たとえば、４ビット）定数フィールドを最大限に使用するために、命令セットは定数値を直接指定するのではなく、それを符号化するためにフィールドを使用する。符号化された値は、広範囲のプログラム統計群からＮ（たとえば、１６）の最大頻度定数として選択される。好ましい実施形態は、１６の値が０乃至１５ではなく−１および１乃至１５であるように選択されるａｄｄｉ．ｎ命令でこの技術を使用する。０の加算は全く効用がなく（分離したｍｏｖ．ｎ命令が存在する）、−１加算は共通である。ｂｅｑｉ、ｂｎｅｉ、ｂｌｔｉ、ｂｇｅｉ命令はまた、種々の共通定数を符号化する４ビットフィールドを使用する。ｂｌｔｕｉおよびｂｇｅｕｉ命令は、符号のない比較が異なったセットの使用値を有しているため、異なった符号化を使用する。
プロセッサは検索表中の定数値を指定する定数フィールドを有する少なくとも１つの命令を有している。テスト（試験）はフィールド値によって特定された検索表中の位置を参照にすることによって定数を形成する命令の定数フィールドとソースレジスタとの比較を含む。
【００５６】
最も普通の定数は典型的に非常に小さく、狭いフィールドは所望の値のほとんどを捕獲する。しかしながら、ビット様式の論理動作（たとえば、ＡＮＤ、ＯＲ、ＸＯＲ等）で使用される定数は種々の種類のビットマスクを表し、小さい定数ティールドでは適合しないことが多い。たとえば、単一ビットが任意の位置において１に設定されているか、あるいは単一ビットが任意の位置においてゼロに設定されている定数は普通である。１のシーケンスが後続する０のシーケンス、および０のシーケンスが後続する１のシーケンスからなるビットパターンもまたは普通である。この理由から、好ましい実施形態はマスクを命令ワード中に直接入れる必要を避けるための命令を有している。好ましい実施形態における例はｂｂｅｉおよびｂｂｓｉ命令であり、それらはそれぞれ、レジスタの指定されたビットが０か１かに応じてブランチする。そのビットはマスクではなくビット番号として与えられる。ｅｘｔｕｉ命令（上述された）はシフトを行い、このシフトは、一連の１により後続される一連の０から成るマスクによって後続され、ここで１の数はこの命令における定数フィールドである。
【００５７】
［コプロセッサブールレジスタおよびブランチ］
複合比較およびブランチは非常に多数のものを３２ビット未満の広さの命令ワード中にパックするために、上記にリストされた命令は利用可能な命令ワードのかなりの部分を消費する。これは、これらのブランチの頻度とその結果達成される節約のために、これらのブランチのためになる妥協である。
【００５８】
命令セット設計に対するその他の制約の他に、命令セットは拡張可能で（新しいデータタイプの追加を可能にし）なければならず、緊密に結合されたコプロセッサ（共同するプロセッサ）において特徴が活用されなければならない。しかしながら、短い命令には、浮動小数点、ＤＳＰのような他のデータタイプに対する複合比較およびブランチ命令を追加するための空間はない。さらに、各コプロセッサがそれ自身の複合比較およびブランチを実施することはできない可能性がある。個々の複合比較およびブランチ命令の実施が可能な場合でも、それは無駄かもしれない。それは、このようなデータタイプに関する比較およびブランチはまた多くのアプリケーションにおいて整数データより頻度が少ないからである。
【００５９】
この理由から、本発明の好ましい実施形態は、コプロセッサ条件付きブランチに対して異なった方法を使用する。好ましい実施形態において、命令セットは、任意のコプロセッサパッケージに予め必要なものであるオプショナルパッケージを含んでいる。このパッケージは、１６個の単一ビットブールレジスタと、これらのブールレジスタおよびそれに応じたブランチを試験するＢＦ（偽ならばブランチ）およびＢＴ（真ならばブランチ）命令とを追加する。その後、コプロセッサは、たとえばそれらのサポートされているデータタイプの比較に基づいてブールレジスタを設定する命令を出す。ブールレジスタならびにＢＦおよびＢＴ命令は、全てのコプロセッサによって共有され、それによって短い命令ワードが効率的に利用される。
【００６０】
これは、上述した多くの初期の命令セットにおいて見出される条件コードベース比較およびブランチの新しい変形である。初期の命令セットは、プロセッサと、そのコプロセッサ（たとえば、ＰｏｗｅｒＰＣ）と、使用される多数のコプロセッサごとの単一ビット条件コード（たとえば、ＭＩＰＳ）との間における多数の共有多ビット条件コードを有している。本発明の好ましい実施形態では、多数の共有単一ビット条件コードが使用される。
【００６１】
比較に対する多数の行先（たとえば、本発明の好ましい実施形態では、ＭＩＰＳ、ＰｏｗｅｒＰＣ）を設けることにより、コンパイラがコードをさらに自由にスケジュールすることが可能になり、単一の命令中の多数のデータ値を比較して多数の結果（たとえば、ＭＩＰＳＭＤＭＸ）を生成する命令が可能になる。
【００６２】
比較結果レジスタを多数のコプロセッサ間で共有する（本発明）か、あるいはこれをプロセッサとそのコプロセッサとの間で共有する（ＰｏｗｅｒＰＣのように）ことによって、比較結果を試験するために必要とされる操作コードの数が節約される。またこれによって、比較結果レジスタ上の論理動作を実行する命令を提供する可能性もまた増加する（本発明の好ましい実施形態およびＰｏｗｅｒＰＣのように）。
【００６３】
単一ビット比較結果レジスタ（本発明の好ましい実施形態、ＭＩＰＳ）を多ビット（他のほとんどのＩＳＡ）の代わりに使用することにより、要求される比較操作コードの数は増加するが、必要とされるブランチ操作コードの数は減少する。ブランチ命令はまたＰＣ相対ターゲットアドレスを提供しなければならないため、好ましい実施形態では単一ビット比較結果（ブール）レジスタが使用され、したがって非常に多数のコプロセッサが存在していなければ、ブランチ命令コードの追加はさらに高価なものとなる。
【００６４】
要約すると、複合比較およびブランチはコードサイズの最小化に対して重要な技術であるが、ＢＩを小さい状態に維持しておくことが必要なことから、異なる頻度および要求されるコプロセッサ操作コードの異なる数のために分割方法がコプロセッサ比較およびブランチに適切であることが分かっている。比較およびブランチ選択肢の範囲内において、コプロセッサ間で共有される多数の単一ビット比較結果レジスタを使用することにより、命令コード空間は最も効率的に利用される。
【００６５】
［ロードおよび記憶命令］
好ましい実施形態のロードおよび記憶命令は、レジスタからのベースアドレスに加算される８ビット定数オフセットを有する命令フォーマットを使用する。最初に、好ましい実施形態はこれら８ビットの大部分を形成し、次にこれが不十分な場合に簡単な拡張方法を適用する。また、好ましい実施形態の４つのロード／記憶オフセットは符号付き拡張（他の多くの命令セットにおいて共通の）ではなくゼロ拡張される。これは、値１２８乃至２５５のほうが値−１２８乃至−１より普通であるためである。また、ほとんどの参照は整列されたベースレジスタからの整列されたアドレスに対するものであるため、オフセットは参照サイズに対して適切に左シフトされる。３２ビットのロードおよび記憶に対するオフセットは２だけシフトされ、１６ビットのロードおよび記憶に対するオフセットは１だけシフトされ、８ビットのロードおよび記憶に対するオフセットはシフトされない。大部分のロードおよび記憶は３２ビットなので、この技術では２の付加的なビット範囲が与えられる。
【００６６】
ロード／記憶命令（またはａｄｄｉ命令）において指定された８ビット定数オフセットが不十分である場合、好ましい実施形態では、８だけ左シフトされたその８ビット定数を加算するａｄｄｍｉ命令が与えられる。したがって、２命令シーケンスは、ａｄｄｍｉからの８およびロード／記憶／ａｄｄｉからの８からなる１６のビット範囲を有する。さらに、上述の方法の１つによって符号化されない定数は、分離した命令によりレジスタ中にロードされなければならない（この技術は、２つではなく単一のレジスタオペランドしかとらないために上記のａｄｄｍｉソリューションを必要とするロード／記憶命令には適用できない）。好ましい実施形態は、定数をレジスタにロードするための２つの方法を提供する。第１の方法は、このためのｍｏｖｉ（および以下に説明される短い命令フォーマットにおけるｍｏｖｉ．ｎ）命令である。ｍｏｖｉは命令ワード中の１２ビット符号付き拡張された対のフィールド中の定数を指定する。また、定数値をレジスタ変数に割当てることはそれ自体は普通のことである。
【００６７】
３２ビット以下の命令フォーマットにおいて、任意の３２ビット定数を符号化できる命令は１つもなく、したがって他のいくつかの方法がレジスタを任意の定数値に設定するために必要とされる。少なくとも２つの方法が別の命令セットにおいて使用されており、これらの方法のどちらもソリューションを提供するために上記の技術と共に使用されてよい。第１のソリューションは、各命令の中の多数の定数を使用して３２ビットの定数を一緒に合成する命令の対を提供することである（たとえば、ＭＩＰＳＬＵＩ／ＡＤＤＩ，ＤＥＣアルファ，ＩＢＭＰｏｗｅｒＰＣは２つの個別の命令における上位の１６ビットと下位の１６ビットとを指定する命令を有している）。第２のソリューション（たとえば、ＭＩＰＳ浮動小数点定数、ＭＩＰＳ１６およびＡＲＭＴｈｕｍｂ）は、ロード命令によりメモリから定数を読出す簡単な方法を提供することである。
【００６８】
ロード自身が単一の命令だけを必要とする場合、定数を参照するためにロード命令を使用することによって、命令のシーケンスを使用した場合より低いＩＳおよびＩＥが提供されることができる。たとえば、ＭＩＰＳコンパイラは、（とくに）４バイトおよび８バイト浮動小数点定数が保存されている定数プールにポインタを保持するために３１の汎用レジスタの１つを専有する。このレジスタによってアドレスされた区域が６４ＫＢより小さい場合、その定数は単一ロード命令によって参照されることができる。これは、ＭＩＰＳがロードにおいて６４ＫＢのオフセット範囲を有しているためである。一度参照された定数に対して、３２ビットのロード命令プラス３２ビット定数の合計サイズは、命令ワードを使用した２者と同じである。定数が２回以上参照された場合、定数プールはもっと小さい合計サイズを提供する。２４ビットの命令対に対する４８ビットに対して定数プールプラスロードが５６ビットである好ましい実施形態の２４ビットサイズのような他の命令長に対して、妥協は異なる。しかしながら、定数が多数回使用された場合、定数プールはほとんど常にさらに有効な合計サイズソリューションである。
【００６９】
定数およびその他の値をアドレスするためにレジスタを専用にするＭＩＰＳ技術は、上述したようにより狭い命令ワードが一般に３２個より少ないレジスタを提供し、したがって各レジスタはより高い価値を有するために、本発明の好ましい実施形態およびその他の実施形態にとって望ましくない。また、狭い命令セット中のレジスタから利用可能なオフセットは制限され、したがって単一のレジスタは小さい定数プール（小さ過ぎて実用的でない）へのアクセスしか行わない。好ましい実施形態は、定数プールにアクセスするために使用されることのできるＰＣ相対ロードを提供する際に、多数の命令セット（たとえば、ＰＤＰ１１，モトローラ６８０００，ＭＩＰＳ１６，ＡＲＭＴｈｕｍｂ）のソリューションを採用する。
【００７０】
任意の定数をロードするいずれの技術も本発明に適用可能である。好ましい実施形態は、第２の技術を使用しているが、一方、別の実施形態は、完全な定数の一部分をそれぞれ含んでいる複数の命令を使用する。２４ビット命令ワードに対する別の実施形態の具体的な例は、１６ビット命令定数をレジスタの上位部分に入力する１つの命令（１６ビットの定数＋４ビットのレジスタ行先＋４ビットの命令コード＝２４ビット）と、１６ビットの符号付き定数をレジスタに加算する別の命令（１６ビットの定数＋４ビットのレジスタソースおよび行先＋４ビットの命令コード＝２４ビット）とを有するものである。
【００７１】
［オーバーヘッド減少ループ命令］
好ましい実施形態はまたいくつかのデジタル信号プロセッサ（ＤＳＰ）において見出されるが、ＲＩＳＣプロセッサ中では見出されないループ特徴を提供する。ほとんどのＲＩＳＣプロセッサはそれらの既存の条件付きブランチ命令を使用して、新しい特徴を提供することによりループを実施するのではなくループを生成する。この効率的使用によって、プロセッサは簡単なままに維持されるが、ＩＥおよびＩＳが増加する。たとえば、Ｃループ

は、好ましい実施形態において、

としてコンパイルされる。あらゆる反復には２つの“ループオーバーヘッド”の命令である、加算および条件付きブランチが存在する。（好ましい実施形態の比較およびブランチ特徴がない場合には、３つのオーバーヘッドの命令が必要になる。）これは明らかにＩＥを増加させる。さらに、いくつかのプロセッサ構成における選ばれた条件付きブランチは、パイプライン化および、またはブランチ予測のために、実行すべきサイクルを別の命令より多く要求する可能性がある。したがって、ＣＰＩが増加する可能性がある。いくつかの命令セットは、レジスタをインクレメントまたはデクレメントし、比較し、ブランチするための単一の命令（たとえば、ＤＥＣＰＤＰ６，ＤＥＣＰＤＰ１１，ＩＢＭＰｏｗｅｒＰＣ）を追加し、この場合はＩＥを低下させる。（ＩＢＭＰｏｗｅｒＰＣ命令の実施はまたＣＰＩの低下を目標とする。）
ループ本体が小さいとき、ループオーバーヘッドの性能インパクトはさらに高い。多数のコンパイラは、１以上の反復にわたってループオーバーヘッドを拡散するためにこの場合ではループアンローリングと呼ばれる最適化を使用する。Ｃでは前述のループは例えば以下のように変換される。

ｉ＋定数が本体の命令中に（例えばロードと記憶命令のオフセットに）折畳まれることができ、それによってただ１つのインクエメントだけが反復毎に必要とされる。
【００７２】
２よりも大きい係数によるループアンローリングは非常に普通のものであり、４および８が普通である（幾つかの利点を有する２の累乗）。係数２のアンロールについて留意することは、結果として生じるコードサイズの増加である（本体は前述の例では３回生じる）。性能を実現するためのＲＩＳＣプロセッサにおけるこの技術の使用はコードサイズにわたる性能および単純化の強調と一貫している。
【００７３】
多数のＤＳＰと、幾つかの汎用プロセッサはある種のループを実行する他の方法を与える。第１の方法は固定回数だけ第２の命令を反復する命令を与えることである（例えばTI TMS320C2X. Intel x86 ）。これは実行が非常に簡単である利点を有する。これが適用可能な場合、ループオーバーヘッドを除去し、反復して同じ命令をフェッチする要件を除くことによりパワー消費を節約する。反復命令を有する幾つかの命令セットはプロセッサがループ中に中断を行わないことを必要とし、これは重要な制限である。また、１つの命令ループは限定された状態でのみ便利であり、反復された命令が多数の効果を有するのに十分複雑であるときだけ有効であり、それによって各反復において異なるデータで動作する。
【００７４】
簡単な反復命令についての改良は減少されたまたはゼロループオーバーヘッドで多数回命令のブロックを反復する能力である（例えばTI TMS320C5X）。好ましい実施形態はそのループ、即ちloopgtz およびloopnez 命令によりこの能力を与える。前述の第１のＣループは以下の命令へ編集される。

ＬＣＯＵＮＴ、ＬＢＥＧおよびＬＥＮＤレジスタは命令セットで利用され、それによってループは中断可能である。これはまたこれらのレジスタが他の命令時実行と並列に読取られ、書込まれることを可能にする（汎用レジスタが使用されるならば、レジスタファイルの読取り／書込みポートは増加される必要がある）。好ましい実施形態はＬＣＯＵＮＴレジスタが命令のフェッチを行うために最大時間を与えるようにテストされた後、直ちにデクリメントされることを特定する。ループ命令は、好ましい実施形態によってループの条件的ブランチコンパイルに関連するブランチペナルティを取ることを防止できることを期待されている。
【００７５】
ａ３（ｉ）のインクレメントはループ命令により自動的に実行されない。前述したように特に強度減少の最適化後に、多数のループは異なる量だけ帰納変数のインクレメントまたはデクレメントを必要とするので、これは別々の命令のままにされる。さらに、幾つかのケースではこれらのインクレメントは自動インクレメントのようなコプロセッサアドレスモードへ折り畳まれることができる。最終的に、汎用レジスタをインクレメントするために、汎用レジスタファイルに特別のポートが必要とされる。
【００７６】
前述の例および説明から認められるように、ループ命令はＩＥとＩＳの両者を減少し、ＣＰＩを減少する構成を容易にする。ループ命令がループアンローリングを行う必要性をなくしたとき、ＩＳにおけるインパクトは最大であるが、アンロールのケースでさえも存在する。しかしながら、当業者により容易に明白であるように、好ましい実施形態ではこれらの命令が存在することにより付加的なプロセッサ構成コスト（例えば特別なレジスタ、特別な命令フェッチ論理装置）が必要とされる。
【００７７】
［ハザード］
ほとんどの命令セットはパイプラインハードウェアにより実行される。パイプラインの使用はしばしばハードウェアまたはソフトウェアで避けなければならない命令実行中のハザードを発生する。例えば多数のパイプラインはパイプラインの端部において（または少なくとも後期に）レジスタファイルを書込む。正確な動作においては、ソースオペランドとして書込まれるレジスタを使用する次の命令は、値が書込まれるまでまたは書込まれる値がバイパスされなければならないか或いは依存命令へ転送されなければならなくなるまでレジスタファイルの読取りを待機しなければならず、レジスタファイル内容は無視される。
【００７８】
ほとんどのプロセッサは結果が有効になるまで通常のレジスタファイルと両遅延依存命令に対してハードウェアの依存検出を行い、その後、これがレジスタファイルに書込まれる前に依存動作をバイパスする。（通常はＮＯＰの挿入による）ソフトウェアの命令の遅延は（ＩＳの増加により）コードサイズを著しく増加し、バイパスしないことは非常に性能を減少する。したがって検出、機能停止、バイパスハードウェアはその価格に値する。
【００７９】
しかしながら、汎用レジスタファイル以外のプロセッサ状態では、このようなレジスタはしばしばそれ程参照されないので、妥協点は異なる。幾つかの命令セット（例えばＭＩＰＳ）はそれ故、（例えば書込みを使用とは別にするためにＮＯＰを挿入することによって）特別なレジスタハザードのソフトウェア管理に切換える。これは残念ながら、命令ストリーム中に組立てられるパイプラインの知識を必要とする。
【００８０】
代りの方法はハザードを避けるために、全ての後続する命令を遅延する特別なレジスタ書込みを有することである。これは簡単で問題を解決するが、（例えば内容スイッチ後の状態を回復するか妨害するため）特別なレジスタの書込みがしばしばグループで行われ、多くは他の特別なレジスタの書込みとそれらに基づく命令を遅延する理由はないので非効率的である。本発明の好ましい実施形態はハイブリッド方法を採用する。これは検出されずハードウェアにより防止されたハザードを防止するためにソフトウェアが挿入しなければならないＩＳＹＮＣ、ＲＳＹＮＣ、ＥＳＹＮＣ、ＤＳＹＮＣ命令を与える。ＮＯＰとは異なって、これらの命令は全ての特別なレジスタの書込みが完了するまで機能停止する。これは１つの構成依存型命令が実現されることを可能にし、そうでなければ、潜在的に多数の特定構造向けＮＯＰを必要とする。また、プログラマーが性能を最大にするために機能停止せずに、特別なレジスタ書込みを共にグループ化することを可能にする。
【００８１】
［コード密度オプション］
好ましい実施形態の命令セットは、命令セットの全ての構成に存在することが好ましい命令のコアセットと、所定の構成に存在しても存在しなくてもよいオプショナル命令パッケージのセットとからなる。１つのこのようなパッケージはＢＩ、すなわち命令当りの平均ビットを減少することにより顕著なコードサイズの減少を行う短い命令フォーマットである。これらの短いフォーマット命令が存在するとき、好ましい実施形態は固定長（２４ビット）命令セットから２つの命令サイズ（２４ビットおよび１６ビット）を有する長さへ変化する。代わりの実施形態は命令サイズの異なるセットを取出す。例えば２４／１６符号化への類似のコード密度による１つの別の方法は２４／１２であり、ここでは短い形態において３つの代わりに２つのレジスタフィールドが存在する。
【００８２】
短い命令形態が随意のものであるので、これらの形態はコードサイズの改良だけに使用され、これらの命令には新しい機能は存在しない。１６ビットで符号化されることができる命令セットは適合する（または例えば一定のフィールド幅の減少によって適合するように変更されることができる）最も統計的に頻度の高い命令として選択される。ほとんどの命令セットにおける最も頻度の高い命令はロード、記憶、ブランチ、付加、移動であり、これらは丁度好ましい実施形態の１６ビット符号化に存在する命令である。ＢＩを減少するための全体的に短いフォーマットの使用はMotorola 68000、Intel x86 、DEC VAX のような可変長命令セットと対称的であり、それらでは各命令は主にオペランドの数、オペランドの種類に依存し、使用の静的頻度に依存しない符号化を有する。
【００８３】
本発明と類似の特性を有することで知られている唯一の命令セットは、Simens Tricore（商標）であり、これは３２ビットの主要なフォーマットと、ＢＩを減少するための１６ビットの短いフォーマットを有する。本発明と異なって、主要なフォーマットは長過ぎるため例示的なＢＩを実現することができず、短い形態は、ソースおよび行先レジスタのうちの一方を同一にさせ、またはソースまたは行先レジスタの一方を操作コードにより示唆させる２つのレジスタフィールドだけを与えるので、それ程機能的ではない。前述したように、示唆されたソースレジスタの使用は構造のＣＰまたはＣＰＩの一方を増加させる傾向がある。
【００８４】
１６ビットのみの命令セットは不十分な性能と機能性を与えることを先に示した。最も頻度の高い命令の１６ビットの符号化はこの落とし穴（pitfall ）を避ける。最も頻度の高い命令だけが短い符号化を必要とするので、３つのレジスタフィールドが有効であり、狭い一定のフィールドは大きな使用部分を捕捉できる。アプリケーションを表すために必要なほぼ半分の命令は、３つの４ビットフィールドがレジスタ特定子または定数のために保留された後、１６ビット符号化に有効な１６個の操作コードの丁度６で符号化されることができる。
【００８５】
１６ビット符号化された密度の高い命令オプションは132i.n命令（ロード３２ビット、４セットオフセット）と、s32i.n（３２ビット保存、４ビットオフセット）と、mov.n （一方のレジスタの内容を別のレジスタへ移動）と、add.n （２つのレジスタの内容を付加）と、addi.n（レジスタと即値を付加、ここでは即値は−１または１…１５の範囲である）と、movi.n（レジスタを即値でロード、ここでは即値は−３２…９５の範囲である）と、nop.n （動作ではない）と、break.n （中断）と、ret.n 、retw.n（ret およびretw）と、beqz.n（レジスタがゼロならば、６ビットのコードなしのオフセットを有するブランチを転送）と、bnez.n（レジスタがゼロでないならば、６ビットのコードなしのオフセットを有するブランチを転送）を含んでいる。
【００８６】
別の実施形態は前述した１２ビットの短い形態を使用する。１２ビット形態は２つの４ビットフィールドと、４ビットの主要操作コードだけをサポートする。これはロードだけをサポートし、オフセットなしで記憶し（時折、フィールドにおけるレジスタ間接アドレスと呼ばれる）、行先と１つのソースレジスタが同一である場合命令を付加する。コンパイラは適切であるとき３つのオペランドを自由に使用するので、他の状況のようにこれらの制限は性能における限定ではない。制限は１２ビット形態がしばしば使用されることを防止するが、その減少されたサイズは部分的に補償する。３０％の１２ビットおよび７０％の２４ビットにおいては、ＢＩは２０．４ビットであり、ほぼ５０％の１６ビットと５０％の２４ビットにより実現される２０．０ビットと同一である。１つのフォーマットが他のフォーマットのサイズの半分であるときに生じる幾つかの構造の単純化が存在するが、命令のサイズとデータ幅の最大の共通の除数（ｇｅｄ）が小さい（これは２４、１２、３２では４、２４、１６、３２では８）とき幾つかの構造問題が存在する。全体的に２は構造価格にほぼ等しく、好ましい実施形態は良好なコードサイズを与える１であり、これは２４／１６である。
【００８７】
２４／１２に比べて２４／１６の付加的な１つのコードサイズの欠点が存在する。ブランチオフセット（異なる命令アドレスを経てターゲット命令を特定する命令定数）は全ての命令サイズのｇｅｄの倍数でなければならない。これは２４／１２に対しては１２であり、２４／１６で８である。この数が大きい程、ブランチはさらに（ビットにおいて）到達できる。この到達点を超えるブランチはＩＳを増加する多数の命令シーケンスを必要とする。
【００８８】
固定長命令の最も重要な利点は、大部分のＲＩＳＣで見られるようにプロセッサ構造がサイクル当たり多数の命令を実行するときに得られる。この状態では、命令は通常並列で復号される。可変長命令では、十分な復号は第２の命令の開始を発見するために第１の命令で行われなければならず、それによって復号はそこで開始することができ、十分な復号は第３の命令の開始を発見するために第２の命令で行われなければならない。これはＣＰを増加する。ＣＰの増加を防止するためにパイプライン段を付加することはＣＰＩを増加する可能性が高い。幾つかの構造は潜在的な命令の開始の都度の復号と、その情報が先の命令の復号から利用できるようになったときに実際の命令の選択によって早期の開始を獲得する。これは明白に構造価格を増加する。命令を分類するパイプライン段を付加することは同様に価格を増加する。命令キャッシュへ予め復号する等のさらに他の可能性も可能であるが、全て構造価格を増加する。
【００８９】
好ましい実施形態は可変長の復号問題を除去しないが、第１にただ２つの命令長を使用し、第２に２つの長さを弁別するため１つの命令ビットを使用することにより、できる限り簡単にする。これは構造価格と、ＣＰに対する任意の影響を最少にする。最後に、短い形態をオプショナルにすることにより、好ましい実施形態は価格と、コードサイズが１番の優先順位ではないときのＣＰ効果を減少することを可能にする。
【００９０】
多数の命令セットはリトルエンディアンまたはビッグエンディアンバイトオーダリングで動作する。これを実現する技術は例えばWeber の米国特許第4,959,779 号明細書に記載されている。しかしながら、可変サイズの命令を有する命令セットは付加的な注意を必要とする。ＭＩＰＳ命令セットはビッグおよびリトルエンディアンバイト順序で同じ命令フォーマットを使用し、これは命令が全て１サイズであるので動作するだけである。好ましい実施形態は、命令サイズを決定するために必要なビットが最下位の番号のアドレスバイト（好ましい実施形態では最小のアドレス可能な単位）に存在する特性を維持するために、ビッグおよびリトルエンディアンバイト順序に対して異なる命令ワードを特定する。
【００９１】
［ウィンドウを付けられたレジスタオプション］
別のオプショナルパッケージはウィンドウを付けられたレジスタオプションである。これはさらに低いＩＥおよびＩＳに与えられる。低くされたＩＥからの性能の増加はまた、３２の代わりに１６レジスタを有するので、ＩＥの増加を補償する。レジスタウィンドウはSun SPARC のような幾つかの他のプロセッサで見られる。サブジェクトに対する完全な概論はSun SPARC の文書を参照する。名称“レジスタウィンドウ”とは命令中のレジスタフィールドが現在のウィンドウ中のレジスタをさらに大きいレジスタファイルに特定する典型的な構造を示している。ウィンドウの位置はウィンドウベースのレジスタにより示される。
【００９２】
レジスタウィンドウは（ＩＳとＩＥを減少する）処理エントリおよびエクシートでレジスタを保存し回復する必要性をなくす。これはこれらの点におけるポインタを変更することにより行われ、これは幾つかのレジスタを視界から隠し、新しいレジスタを露出する。露出されたレジスタは通常有効なデータを含まず、直接使用されることができる。しかしながら、露出されたレジスタが有効なデータを含むとき（ウィンドウはさらに移動しているので先の呼フレームのレジスタへラップアラウンドされるため）、実行の継続（これは通常ソフトウェアハンドラへのトラップにより実現される）の前に、ハードウェアがこれを検出し、有効なレジスタをメモリに記憶する。これはレジスタウィンドウオーバーフローと呼ばれる。レジスタがメモリに記憶されているフレームに呼が戻るとき、レジスタウィンドウのアンダーフローが生じ、プロセッサはメモリからの値をロードしなければならない（これもまた通常ソフトウェアハンドラへのトラップにより実現される）。
【００９３】
呼者と、被呼者との間の物理的レジスタファイルの視野中でオーバーラップするレジスタウィンドウはまた、処理手順へのアーギュメントがレジスタで通過されるときに生じるアーギュメントシャッフリングを防止する（アーギュメントシャッフリングはＩＳとＩＥを増加する）。最後に、レジスタウィンドウは可変および一時的な値をレジスタに割当てるブレークイーブン点を変更し、したがってレジスタの使用を助長し、これはメモリ位置を使用するよりも高速度で小型である（またＩＳとＩＥを減少する）。
【００９４】
本発明のレジスタウィンドウとSPARC との主な差は、（１）SPARC はウィンドウポインタに対して固定したインクレメント１６を有し、（２）SPARC はウィンドウを付けられたレジスタに加えてグローバルレジスタを有し、本発明の好ましい実施形態ではそれを具備しておらず、（３）SPARC は現在のウィンドウが先のウィンドウとオーバーラップする状況のときウィンドウのオーバーフローを検出し、一方、好ましい実施形態は先のウィンドウの一部であるレジスタを参照するときウィンドウのオーバーフローを検出することである。
【００９５】
固定したインクレメントから可変のインクレメントへの変更は構造価格を低くするのに重要である。これは非常に小さい物理的レジスタファイルの使用を可能にする。例えば多数のSun SPARC 構造は１３６エントリの物理的レジスタを使用し、好ましい実施形態は類似のウィンドウ性能を実現するために６４のみのエントリのレジスタファイルを必要とするに過ぎない。可変インクレメントの複雑性の増加が存在するが、プロセッサ構造価格の差は３０％以上である（これはさらに簡単な固定したインクレメントSPARC 方法により必要とされるさらに大きいレジスタの価格である）。好ましい実施形態はオーバーフローとアンダーフローを検出し、スタックフレームを組織するための新しい方法を特定する。
【００９６】
表面上において、レジスタウィンドウ機構はレジスタファイルの読取りと直列して付加（短いにもかかわらず）を必要とすることによりＣＰ（またはＣＰＩ）を増加するように見える（パイプラインに付加を行うための１サイクルが存在するので、レジスタの書込みは問題ではない）。しかしながら、レジスタファイルへのウィンドウを付けられていないレジスタのアクセスに対して類似したタイミングとウィンドウサイズを有する方法でレジスタウィンドウアクセスを実行することが可能である。例えば、６４レジスタの物理的レジスタファイルと、任意の所定の命令に可視である１６のウィンドウを考慮する。この場合、１６の６４：１多重化はウィンドウポインタにのみ基づいて１６の可視レジスタを選択するために使用され、その後これらの１６の結果は１６エントリレジスタファイルのようにアクセスされる。１６の６４：１多重化の使用は、高い構造価格を要する。この理由で、好ましい実施形態は４の倍数に制限されるウィンドウポインタを特定し、この価格を係数４だけ減少する。連続した加算の使用を選択する構成においてさえも、これはレジスタ数の２ビットが即座にレジスタファイルアクセスの開始に使用されることができ、さらに低速度の合計ビット（４ビットと２ビット入力の合計）がアクセスにおける後の点で使用される。最後に、これらの２つの構成間のハイブリッドが可能であり、構造価格は中間になる。
【００９７】
好ましい実施形態の変更および変形は当業者に容易に明白であろう。このような変形は特許請求の範囲に記載されている本発明の技術的範囲内にある。
【図面の簡単な説明】
【図１】本発明の好ましい実施形態による命令セットを実施するプロセッサのブロック図。
【図２】本発明の好ましい実施形態による命令セットを実施するプロセッサのブロック図。
【図３】本発明の好ましい実施形態による命令セットを実施するプロセッサのブロック図。
【図４】本発明の好ましい実施形態による命令セットを実施するプロセッサのブロック図。
【図５】好ましい実施形態によるプロセッサにおいて使用されるパイプラインのブロック図。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to microprocessors, and more particularly to a high performance, reduced instruction set computer (RISC) architecture processor that can use instruction widths highly efficiently.
[0002]
[Prior art]
Processor instruction set design is a well-established technology. Many instruction set features are not new per se. However, individual features can be combined in new and unique ways for technological advancement. A significant improvement when the processor is configured to be used in the target application, especially when the instruction set design is designed to be optimal for use different from the traditional instruction set Is obtained.
[0003]
Instruction set design is necessary to balance many competing goals, including the size of machine code required to encode the various algorithms, including instruction set design for new algorithms and applications. Scalability and applicability, the performance and power consumption of processors that execute instruction sets in such algorithms, the cost of processors that execute instruction sets, the stability of instruction sets against the long use of many processors, the instruction sets Including the complexity of the design of the processor that executes and the stability of the instruction set as a target for editing from a high level programming language.
[0004]
The instruction set has one direct and two indirect effects on processor performance. The instruction set directly determines the IE, ie, the number of instructions required to execute a given algorithm. Here, it is assumed that the stability of the instruction set as a target for editing is good. Other components of processor performance are clock period CP and average clock CPI per instruction. While these are the main effects of instruction set execution, instruction set characteristics affect the ability of the performer to simultaneously adapt to the time per clock and the clock per instruction target. For example, the choice of encoding may instruct additional logic units in series with the execution of the remaining instructions, which can be addressed by the performer by increasing the time per clock, or the clock per instruction is usually The performer addresses by adding additional pipeline stages to increase.
[0005]
A new instruction set architecture called RISC was developed in the 1980s and 1990s. It was obtained by realizing the above compromise, ie
T = IE * CPI * CP
Where T is the program execution time expressed in seconds and the other variables are those described above. The RISC instruction set allows an implementer to significantly improve CPI and CP without significantly increasing IE. The RISC instruction set improves processor performance, reduces design complexity, reduces the cost of processor configuration at a given performance level, and is suitable for editing from high-level programming languages.
[0006]
The processor architecture community disagrees with the fully satisfactory definition of RISC, but it generally includes many of the following attributes: That is, fixed size instruction words; arithmetic operations and other computation operations are performed on operands read from a general register file having 16 or more registers, and the result is written to the same register file; In a fixed position in the instruction word relative to the source register, as is done in parallel; memory accesses are mainly done by loading from memory to register and stored from register to memory (memory operands in computational instructions) A small number of ways to calculate memory addresses (usually 4 or less, often 1); avoiding features that make instruction pipeline execution difficult (eg, more than once by a given instruction) Hardware resources); and microcode or It includes avoidance features requesting equivalents. Although not every processor is considered a RISC containing all of the above elements, every processor contains many of the above elements.
[0007]
However, the initial RISC instruction set is not particularly efficient in generating compact machine code. In particular, the RISC instruction set typically requires more bits than the pre-RISC instruction set to encode an application. The size of the machine code for the application is often more important than the cost of the processor itself in the overall solution cost. This is because larger memory is required to hold the application. Although RISC is still acceptable in many applications where performance is most important, it has the advantages of RISC, but the reduced code size instruction set is useful in many other processor applications.
[0008]
[Problems to be solved by the invention]
  Some of the early processor instruction sets (IBM 7090 ™, CDC6600 ™, DEC PDP6 ™, GE635 ™) have some of the features of RISC. This is because they were designed to be executed directly by hardware without microcode like RISC. Most of these instruction sets are not well suited for modern high-level languages and applications. The reason is because of features such as word (not byte) address, limited address space, and a special combination of operations. In fact, many are intended for building language programs. Some are based on 36-bit data words and instruction widths, and 36-bit instructions are not very good for code density. Some are based on accumulators and memory examples for computation, which limits performance. Although some of the individual features of the present invention follow these machine generations, none of the desired characteristics has previously existed.
[0009]
  The use of microcode to operate the processor allows for more complex instruction set adaptation (IBM 360 ™, DEC PDP11 ™, DEC VAX ™, Intel x86 ™, LLNL S-1 (Trademark), Motorola 68000 (trademark)). Therefore, next generation processors have complex instruction sets with good code density, such as by complex variable instruction length coding. However, microcode processors and complex instruction sets often do not fit well with high performance. Complex instructions are executed by micro-engine interaction instead of direct execution in the hardware pipeline, which increases CPI. Instruction sets that appeared in this generation in a variety of different styles tend to change from one or two accumulators to either general purpose register architectures or stack architectures. The register or stack configuration cost is low enough and the instruction set can use these effective styles.
[0010]
As noted above, there is a significant improvement in performance, but RISC is rather back with respect to code density. Many RISC instruction sets are based on fixed-length 32-bit instructions, which are more than necessary. Some kind of variable length coding is also necessary to obtain the best code density. The stack architecture is advantageous in terms of code size, but is excluded at this point due to low performance. It shows how important it is for instructions to achieve both performance and code size goals.
[0011]
   To compensate for the RISC code size shortcomings, processor designers have introduced a compact encoding of the instruction set. ARM ™ Tnumb and MIOS 16 are examples. Both contain a small number of 32-bit instructions, but primarily use 16 bits. 16-bit encoding (which gives a small code by halving the number of bits per instruction) gives poor performance. The reason is that it has only 8 registers (increase IE) and the use of edited source register operands (increase CP or CPI) limits the limit of instruction word (increase IE). , Limit the number of destination register operands (in many cases 2 or less ... increase IE).
[0012]
Hitachi's SH instruction set is similar to RISC and targets code size as a target. It starts with a 16-bit instruction set, but a 32-bit instruction is added when needed later. There are 16 registers but usually 2 registers per instruction (increase IE), severely limiting the branch offset.
[0013]
There is a need for an instruction set configuration that provides RISC performance and other advantages, yet provides a small, cost-effective machine code. In order to achieve a high performance configuration without being overly complicated, the instruction set must be directly executable without the use of microcode by a simple short pipeline. There must be a sufficient number of general purpose registers to be good targets to get good performance and optimize editing. Other techniques may be used to further reduce the code size.
[0014]
In view of the problems of the prior art described above, it is an object of the present invention to provide a processor that performs 24-bit encoding of a complete feature RISC instruction set.
Another object of the present invention is to synergistically maintain a static number of instructions required to represent a program when using an instruction encoding with a limited average number of buses per instruction. It is to provide a processor that executes an instruction set having instructions that operate. The present invention is technically effective for efficient encoding of instruction constants in narrow instruction words.
[0015]
Another object of the present invention is a processor that performs encoding of a RISC instruction set using a 24-bit instruction word encoding comparison and a branch instruction that uses the most useful comparisons and forms that would normally have longer target specific elements. Is to provide.
Yet another object of the present invention is to have a reduced overhead loop capability that can reduce the static number of instructions and cycles required to represent a program and the number of instructions required to execute a program. It is to provide a processor that executes a general purpose (as opposed to special purpose such as DSP) instruction set.
[0016]
[Means for Solving the Problems]
The above objective is accomplished by providing a RISC processor that executes the instruction set of the first preferred embodiment of the present invention. The instruction set is composed of the following code size expression added to be adjusted to the expression T = IE * CPI * CP as described above.
S = IS * BI
Where S is the program size in bits,
IS is the static number of instructions needed to represent a program (not the number needed to execute like the previous one)
BI is the average number of bits per instruction.
Compared to RISC, the present invention minimizes increases in CP and CPI and reduces BI and IS. It is characterized by increasing and decreasing IE at the same time.
[0017]
This feature of the present invention was constructed with the recognition that good code density must be provided in fixed-length high-performance coding based on RISC principles, including general registers with load / store architecture. Is. To obtain a typical code density, the embodiment adds a simple variable length coding that does not compromise performance. This embodiment optimizes the cost of the processor configuration.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
These and other objects of the present invention will become readily apparent from the following detailed description and accompanying drawings.
1-4, there is shown a processor suitable for implementing an instruction set according to a preferred embodiment of the present invention. In general, a processor has two instructions and data.³²Byte, ie 4 GB virtual memory, 32-bit program counter IPC, 16 or more 32-bit general purpose registers, shift address register SAR, 32-bit loop start address register LBEG, and 32-bit loop end address It has a register LEND and a 32-bit loop count register LCOUNT, all three of which are used by overhead reduction loop instructions as described in detail below.
[0019]
In particular, the processor pipeline has five basic stages: instruction fetch or I stage 100, instruction decode and register access, ie R stage 200, execution and address calculation, ie E stage 300, and memory access, ie M stage 400. , And a write back, that is, a W stage 500. In I stage 100, the program memory is accessed to retrieve the instruction to be executed. In R-stage 200, the instruction fetched in this way is decoded and accessed if there is a register used by it. Thereafter, in the E stage 300, the contents and constants of the register decoded in the R stage 200 are processed by the processor ALU 332 according to the instruction operand. In the M stage 400, any necessary memory access such as loading and storage is performed. Finally, in the W stage 500, the result of executing the instruction is written back to the general purpose register as commanded by the instruction operand.
[0020]
In particular, at I-stage 100, words are extracted from instruction cache 102 based on the address held in I-stage program counter IPC 104. The word is then read from the instruction cache RAM IC 102 (together with other components described below to form the instruction cache 116) and held in the last fetch register 106 by the alignment unit ALIGN 108, and the R stage instruction register RINST 202. Combined with the last word stored in. Cache misses are handled by a memory fetch from main program memory through staging register IREFILL 110 to cache RAM IC 102, and the tags are adjusted accordingly using registers IMISSA 112 and tag cache RAM ITAG 114. Multiplexer 118 selects either the output of cache RAM IC 102 or an instruction fetched directly from main memory and outputs the selected data to alignment unit 108, which aligns it with the last Concatenate with last fetched word stored in fetch register 106, select a subset if necessary, and adjust for instruction length variation. Tag comparison unit 122 detects a cache miss and provides an indication to I stage controller 124 which controls all operations of that stage.
[0021]
Although the circuit shown as element 104 is referred to herein as a program counter, the I stage program counter 104 is not used to actually count instructions to be fetched, but to be used to count words. Must be recognized. However, subsequent program counters, such as R stage program counter 204, in the preferred embodiment, count the actual instructions. Also, those skilled in the art will readily understand that in addition to the I stage controller 124, the corresponding R stage controller 224, E stage controller 324 and M stage controller 424 each control the overall operation of each stage. Let's go. Each of the R stage status register 203, the E stage status register 303, the M stage status register 403, and the W stage status register 503 indicates related status information relating to instructions in each pipeline stage, such as whether or not data is valid. Information is supplied to each controller. In addition, certain features such as multiplexer select lines, clock signals, exception vector addresses, etc. extending from the stage controller to their respective multiplexers have been omitted for ease of explanation, however, those skilled in the art will recognize their placement. Will be easily recognized.
[0022]
While the instruction is fed into the R stage instruction register 202, the adder 128 in the next address generation section 126 increments the current word address to point to the next word to be fetched and multiplexes it. This multiplexer 130 feeds it back to the instruction program counter 104. When a loop instruction (described in further detail below) is executed, it loads the start loop address into the loop start register LBEG 132, after which the multiplexer 130 provides the start address to the program counter 104. As used in the loop instruction, the loop end register 134 provides a value that is compared with the current address by the comparator 136 to detect the end of the loop condition and decrement the loop index register LOUNT138. In order to continue execution outside the loop, the comparator 140 provides an indication to the instruction controller 124 when the count is zero. Otherwise, loop index register LOUNT 138 is decremented by decrementer 142 and passed through multiplexer 144 (which is also used to load the registers). Finally, R-stage PC selection multiplexer 146 selects the address value to be supplied to E-stage 300 as will be described in further detail below.
[0023]
The instruction stored in the R stage instruction register 202 is decoded by the decoder 201, a predetermined parameter field is extracted, and an immediate value or constant field is decoded according to the instruction opcode. The decoded instruction is pipelined to the E stage instruction decode register 302 for execution. In parallel with the instruction decode operation, the field from the instruction is added to the register file 206 via adder 208-212 to add the window base value to the windowed register operation, as described in detail below. Sent to. One adder is used for each of two source register fields and one destination register field that may be present in an instruction.
[0024]
The values in register file 206 are read and provided to

multiplexers

214 and 216 and then to E stage S and T registers 304 and 306.

Multiplexers

214 and 216 can provide values from register file 206. That is, if the required data has not yet been written into the file 206, they may use values supplied from the E stage, as described below. Multiplexer 214 may also receive a constant value from instruction decoder 204.
[0025]
The adder 218 calculates a target address with an index based on the contents of the R stage program counter 201 and the index constant from the instruction decoder 204, and stores the result in the E stage branch register 308. The adder 220 calculates the next instruction address by adding 2 or 3 to the value in the R stage program counter 201 via the multiplexer 222 according to the instruction length, and is used when the branch cannot be taken. The result is sent to the next PC register 310.
[0026]
Proceeding to E stage 300,

bypass multiplexers

318 and 320 select operands for the various functional units (branch unit 326, shift / mask unit 330, ALU 332, AGEN 334, and storage alignment unit 336). The multiplexer selection is calculated in the R stage 200 by the bypass block EBYP 314 by each

register

228, 230 and 232 based on the current instruction in the E stage 300, M stage 400 and W stage 500 and pipelined through the bypass block EBYP 226. It was done. When the result is obtained from R stage 200, each

multiplexer

318, 320 selects ET register 312 or ES register 316. Another input to

multiplexers

318 and 320 is from M stage 400 and W stage 500.
[0027]
Branch unit 326 uses the two operands from

multiplexers

318 and 320 to generate conditional branch adoption / non- adoption decisions that are fed to

controllers

124 and 224 in I-stage 100 and R-stage 200, respectively. The multiplexer makes a selection. Shift / mask unit 330 performs shift and extract instructions based on the output of multiplexer 328. It takes not only two operands from

multiplexers

318 and 320, but also takes a mask input from decoded instruction register EINSTD 302, which also supplies M-stage instruction register MINSTD 402. The shift amount is selected from EINSTD 302 for shift × constant, or from ESAR 322 for shift × variable amount. ESAR 322 contains the ISA status SAR for E stage 300.
[0028]
The ALU 332 performs arithmetic and logical functions including ADD, SDDI, ADDX2, SUB, ADN, OR, and XOR. The outputs of shift / mask unit 330 and ALU 332 are multiplexed based on the instruction type in multiplexer 338 and provided to MALU register 406. Address generation unit AGEN 334 calculates the sum of the register operand and the offset from the decoded instruction in EINSTD 302. The output is sent to the M stage virtual address register MVA 408. A storage alignment unit 336 shifts the output of the ET multiplexer 318 by 0, 8, 16 or 24 positions to align the stored data into byte positions suitable for memory. The output is sent to the M stage storage data register MSD410.
[0029]
Like the previous pipe stage, ECTL 324 handles the control over E stage 300 and the update of the state of the instructions executed there. The E stage instruction address program counter EPC 304 is pipelined to the M stage instruction address program counter MPC 404 for exception handling.
[0030]
The pipeline M stage 400 handles the second half of the load and store instructions and exception decisions for all stages. The output of MPC 404 is sent to WPC register 504. If an instruction at M stage 400 is invalidated by an exception or interrupt, the output of WPC 504 is one of the ISA specific exception instruction address registers EPC [i] (not shown) (different from E stage program counter EPC 304). Loaded. If the instruction in M-stage 400 must be retried (eg, due to a cache miss), the contents of WPC register 504 are sent back to I-stage 100 to restart instruction fetch.
[0031]
Shift or ALU instructions are simply transferred from MALU 406 to WALU 506 at this stage. The output of MALU 406 is also fed to bypass

multiplexers

318 and 320 at this stage so that the output of the shift or ALU instruction can be used thereby before subsequent instructions are written to the register file. Become. The load instruction in the W stage 500 reads both the data cache RAM and the data tag RAM. The store instruction in the W stage 500 reads only the data tag RAM. Data cache RAM writes are delayed until tag comparison is complete. The non-load instruction writes any pending storage data to the data cache RAM. Storage subsequent to loading to the same address requires a special bypass because the stored data is not written to the data cache RAM.
[0032]
The load instruction during the W stage also sends the index portion of the virtual address MVA 408 to the address input of the data tag RAM and also to the address input of the data cache RAM DC434 mapped directly through the multiplexer 422. In parallel with the reading of DC434, the address is compared to the pending stored virtual index and valid bits in STVI 416. Based on the output of the comparator 428, the multiplexer 432 selects the pending storage data buffer 418 result if the read is for the pending cache index. Otherwise, DA read data is selected. Multiplexer 432 feeds load alignment circuit 436 which shifts the load data by 0, 8, 18 or 24 based on the lower two bits of the virtual address, and then bit 7 for L8UI and L16UI instructions. Or, each is zero extended from 15 and sign extended from bit position 15 for the L16SI instruction. This result is latched by WLOAD 508. The output of the data tag RAM is compared by the comparator 430 with the upper bits of the M stage virtual address from the MVA 408 and the hit / miss result is sent to the M stage control logic MCTL 424 which handles cache misses and exceptions. Finally, the load virtual address is captured at WMA 510 to handle the cache miss.
[0033]
A load cache miss invalidates instructions in the pipeline stages I through M. The load address from WMA 510 is sent to the external memory. Data read from the memory is written into the data cache RAM 424 through multiplexer 412 and STDATA 418 using the lower bits of WMA 510 as an address. Data tag RAM 426 is written from the high order miss address captured at WMA 510 through multiplexer 414 and STADDR 420, and DTAG 420 is addressed by the small order bit from MVA 408.
[0034]
The store instruction in W stage 500 puts the store address and data into STADDR 418 and STDATA 420. In addition, the data tag RAM 426 is accessed and the result is compared with the high order bits of the MVA 408 to determine if the storage address is a hit or miss. If the store hits in the cache, the contents of SDTDATA 418 are written into data cache RAM 424 at the address stored in SDTDATA 418 in the first non-load cycle. When the cache miss replenishment is complete, the instruction fetch unit initiates a fetch instruction that starts again from the missed load instruction. The data cache in this embodiment is write-through, so the storage address and data are also sent from STADDR 420 and STDATA 418 to write buffer 438 and held in this write buffer 438 until written to external memory.
[0035]
The outputs of WALU and WLOAD registers 506 and 508 are selected by multiplexer 512 and if at this point the instructions are still valid, they are written into register file 206 in R stage 200, which is the instruction with the result of the A register.
[0036]
The processor also has a 6-bit shift amount register, which is used to perform normal immediate shifts such as logical left, logical right and arithmetic right, but directly Since variable shift is likely to be a critical timing path, when the shift amount is a register operand, a single instruction shift is not performed, and a simple shift does not efficiently spread over a wide range. Funnel shifts can be widened, but they require an excessive number of operands. The processor according to the preferred embodiment of the present invention solves these problems by providing a funnel shift whose shift amount is taken from the SAR register. Variable shifts are synthesized by the compiler using instructions for calculating the SAR from the shift amount in the general purpose register, followed by funnel shifts. The legal range of values for SAR is 0-32 instead of 0-31, so 6 bits are used for that register.
[0037]
Of course, various other processor architectures in accordance with the present invention will be readily apparent to those skilled in the art given the instruction set description set forth in detail below. These structures are also intended to be included within the scope of the claims.
[0038]
Various pipeline structures may be used within the processor. However, it performs best with respect to certain implementation classes that have certain characteristics of the instruction set. One such type is shown generally in FIG. This type of architecture can be used effectively by major computational units such as floating point units and DSPs, and one notable aspect of this pipeline architecture is that such units are placed after the D-cache ( Arrangement (in the location labeled DReg / DALU in FIG. 5) allows instructions for such units to contain a memory reference as one source operand. This allows data cache references and operations to be performed on a cycle-by-cycle basis without having to fetch and execute multiple instruction sets on a cycle-by-cycle basis.
[0039]
[General instruction set design considerations]
A number of instruction set features improve performance (by lowering IE) and code size (by lowering IS) at the cost of increasing processor configuration costs. For example, "auto-increment" address mode (the base address register is read and then rewritten with the incremented address) requires a second register file write port for loading. The “indexed” address mode (the sum of two registers is used to form a virtual address) requires three register file read ports for storage. The preferred embodiment is tailored to a register file with two read ports and one write port, which are the minimum required for reasonable performance.
[0040]
The preferred embodiment has several features that increase the cost of configuration, but features that require as much increase as adding register file ports are avoided. This is particularly important when the configuration executes a large number of instructions every cycle. This is because the number of ports is multiplied by the maximum execution capability of the processor (eg 2 to 8).
[0041]
In order to maintain performance, the instruction set must support at least two source register fields and one different destination register field. Otherwise, both IE and IS will increase. A general purpose register instruction set that optimizes only code density is often designed around two register fields, one used only as a source and one used as both source and destination (eg, Hitachi SH) . This reduces the code size when the IS increase is offset by the BI decrease, but there is no way to compensate for the IE instruction set increase. An instruction set that specifies a small number of registers uses a narrow register field, and therefore uses a low BI, but forces many variable and temporary values to be kept alive in memory, hence additional loading and Increase IE and IS by requesting store instructions. If only code density is a priority, the increase in IS is offset by the decrease in BI as a net saving, but there is no way to compensate for the increase in IE if good performance is also required.
[0042]
As the number of registers increases, the decrease in IE and IS decreases and the characteristics flatten. The instruction set must provide enough registers so that at least the diminishing return point is reached, i.e. there is no corresponding significant decrease in the IE as a result of further increasing the register count. In particular, at least 16 general purpose registers are required for the RISC performance level. The three 4-bit register fields require at least 12 bits for encoding. Bits for opcodes and constant fields are also required, so 16-bit encoding as used by some processors is not sufficient.
[0043]
[24-bit encoding]
One reason that most prior art has not been able to strike a proper balance between code size and performance is that the instruction set designer is forced to use some instruction size, such as 16 or 32 bits. This is because they feel that In fact, there are advantages to using a simple ratio of instruction size to processor data word width. However, there are significant advantages to relaxing the restrictions slightly.
[0044]
In the preferred embodiment, a 24-bit fixed length encoding is used as the starting point. 24 bits is not only sufficient for high performance, but also provides scalability and space for instructions, which reduces IE. Another embodiment may use an 18-28 bit range encoding, but less than 24 bits would limit scalability and branch range. 24-bit encoding typically represents a 25% reduction in BI and thus code size from most 32-bit RISC instruction sets. Finally, 24 bits adapts very easily in a processor with 32 data path widths.
[0045]
The preferred embodiment uses a 4-bit register field, which is the minimum required for acceptable performance and the maximum that fits satisfactorily within a 24-bit instruction word. Many RISC instruction sets use 32 registers (5-bit register field). After three 5-bit register fields, the 24-bit instruction leaves only 9 bits for the operation code and constant fields. As a result of the short constant field, the scope of branches, calls and other PC related references is likely to be insufficient. If there are too few bits for the operation code, the scalability will be insufficient. For these two reasons, a 24-bit instruction word with a 5-bit register field is undesirable. The difference in performance between the 16 and 32 general purpose registers (due to the IE difference) (approximately 6%) is not as great as the difference between the 8 and 16 general purpose registers, and lost performance (for example: Is small enough that other features can be introduced to generate compound instructions and register windows). The increase in IS (also 6%) is more than the offset due to the difference between 24-bit and 32-bit encoding.
[0046]
It should also be recognized that many instruction sets having a 5-bit register field do not provide 32 general purpose registers for compilation. Many registers are dedicated to hold zero, but the zero register can be easily made unnecessary by providing a small number of special instruction operation codes. Other registers are often used for specific applications, but this can also be avoided by including other features in the instruction set. For example, MIPS uses two of its 31 general purpose registers for exception handling code and one for global area pointers. Therefore, it actually has only 28 registers for variable and temporary, which is only 12 more than an instruction set with a 4-bit register field and appropriate instruction set features. It is. It is common to divide general purpose registers into caller and callee saved registers by software convention, and further reduce the utility of large register files. Preferred embodiments include features that avoid this as described in more detail below.
[0047]
   [Compound instructions]
  To lower IS and IE, the preferred embodiment also uses a single instruction that combines the functionality of multiple instructions commonly found in RISC with other instruction sets. An example of a simple compound instruction is left shift and add / subtract. HP PA-RISC ™ and DEC Alpha ™ are examples of instruction sets that cause these operations to be performed. Address operations and multiplications with small constants often use a combination of these, and performing these operations sacrifices the potential cost of the CP (since it adds a series of logic units in the computational pipeline stage). To reduce IE and IS. However, it has been demonstrated in various configurations that when the shift is limited to 0-3, the additional logic means is not the most critical constraint with respect to CP. Conversely, the ARM instruction set performs arbitrary shifts and additions, and the CP is very low in its implementation.
[0048]
Right shift is often used to extract fields from large words. Two instructions (either a left shift followed by a right shift or a right shift followed by an AND with a constant) are typically used to extract an unsigned field. In the preferred embodiment, a single compound instruction extui is executed to perform this function. It is performed as a shift followed by an AND with a mask specified with just a 4-bit encoding in the instruction word. Since the AND part of the compound instruction extui is very logical, the inclusion of it in the instruction set is unlikely to increase the CP of its configuration. This is not the case for instructions that extract a signed field and is therefore not included.
[0049]
Most instruction sets, RISC, and others (eg, ARM, DEC PDP11, DEC VAX, Intel x86, Motorola 68000, Sun SPARC, Motorola 88000) use comparison instructions that set condition codes, which are controlled by Followed by a conditional branch instruction that tests the condition code to determine the flow of Conditional branches make up 10-20% of the instructions of most RISC instruction sets, and each conditional branch is usually paired with a compare instruction, so this style of instruction set is useless. Older instruction sets were often based on comparison and skip style conditions, which had the same drawbacks as separate comparisons and branches.
[0050]
Some instruction sets (eg, V9 version after Cray-1, MIPS, DEC alpha, HP PA-RISC and SunSPARC) provide complex compare and branch functions that change flexibility. Cray and DEC alpha only perform register and zero comparisons and branches. MIPS performs register-zero comparison and register-register equivalent and non-equivalent and branch. HP PA-RISC provides a very complete set of register-register comparison and branch instructions.
[0051]
The preferred embodiment provides the most useful compound compare and branch instructions. Choosing the right set, especially when 24-bit (as opposed to 32-bit) coding is the goal, requires that the utility of each comparison and branch be balanced by the operation code space it consumes. The Other instruction sets fail this test. For example, HP PA-RISC provides some complex comparisons and branch instruction codes (eg, never and overflow after addition) that have little utility and removes some that are valid.
[0052]
The set of compound compare and branch instructions selected for the preferred embodiment is
A == 0, A! = 0, A <S0, A> = S0,
A == B, A! = B, A <SB, A <UB, A> = SB, A> = UB,
(A & B) == 0, (A & B)! = 0, (~ A & B) == 0, (~ A & B)! = 0,
A == I, A! = I, A <SI, A <UI, A> = SI, A> = UI,
A bit B == 0, A! Bit B = 0,
A bit I == 0, A! Bit I = 0
Where A and B indicate the contents of the register, and the suffixes “U” and “S” on the associated operator of the register are “unsigned” or “signed” with unsigned or signed register contents. Each comparison “with” is shown. A suffix on the relational operator with zero (eg, A <S0) indicates an unsigned or signed comparison to zero and I indicates an index constant.
[0053]
Compound comparisons and branches reduce IE and IS compared to separate comparisons and branch instruction sets, and even when compared to partial comparisons and branch instruction sets such as MIPS and DEC alpha. Although the preferred embodiment may require an increase in CPI to perform complex comparisons and branches, the overall performance effect is still improved.
[0054]
The main advantage of separate compare and branch instruction sets is that two instruction words are available to specify comparison operators, comparison operands, and branch targets, which allows a rich field width allocation for each. It is to become. In contrast, complex compare and branch instruction sets must pack all of them into a single instruction word, resulting in fewer fields and a mechanism to handle non-conforming values (eg, a wide range). The need for branches) is also reduced. The preferred embodiment packs the compare operation code, two source register fields and an 8-bit PC relative offset into a 24-bit instruction word. 8-bit target specifiers may be insufficient in some cases, requiring a compiler or assembler to use a conditional branch of the opposite nature based on an unconditional branch with a wide range, which provides a preferred embodiment To do. This condition naturally increases IE and IS, which is undesirable. To this end, the preferred embodiment also provides a series of complex comparisons and branches that test for zero, which is the most common case. These complex compare and branch instructions have a 12-bit PC relative offset that provides a much wider range than their colleagues. The extra complexity that provided both formats is balanced by improvements in IE and IS. The preferred embodiment, unlike MIPS and DEC alpha, does not provide all comparisons to zero (omitting registers below zero and greater than zero), and again, the preferred embodiment takes program needs into operational code space. Provide a set of instructions to balance.
[0055]
  One result of using only 24 bits to encode all instructions limits the size of the constant field in the instruction word. This is likely to potentially increase IS and IE (although IE increases can be reduced by loading that constant into a register outside the loop). The preferred embodiment solves this problem in several ways. First, it provides a small constant field to capture the most common constants. In order to make the best use of a narrow (eg, 4 bit) constant field, the instruction set does not specify a constant value directly, but uses the field to encode it. The encoded value is selected as a maximum frequency constant of N (eg, 16) from a wide group of program statistics. Preferred embodiments are selected such that the value of 16 is -1 and 1-15 rather than 0-15. Use this technique with n instructions. Addition of 0 has no effect (there is a separate mov.n instruction), and -1 addition is common. The beqi, bnei, blti, bgei instructions also use a 4-bit field that encodes various common constants. The bltui and bgeui instructions use different encodings because unsigned comparisons have different sets of usage values.
  The processor has at least one instruction having a constant field that specifies a constant value in the lookup table. A test includes a comparison of the constant field of an instruction that forms a constant with reference to a position in a look-up table specified by a field value and a source register.
[0056]
The most common constant is typically very small and a narrow field captures most of the desired value. However, constants used in bit-wise logic operations (eg, AND, OR, XOR, etc.) represent various types of bit masks and often do not fit with small constant tails. For example, constants where a single bit is set to 1 at any position or a single bit is set to zero at any position are common. A bit pattern consisting of a sequence of 0 followed by a sequence of 1 and a sequence of 1 followed by a sequence of 0 is also ordinarily. For this reason, the preferred embodiment has instructions to avoid having to put the mask directly in the instruction word. Examples in the preferred embodiment are the bbei and bbsi instructions, which branch depending on whether the designated bit of the register is 0 or 1, respectively. The bits are given as bit numbers, not masks. The extui instruction (described above) performs a shift, which is followed by a mask consisting of a series of zeros followed by a series of ones, where the number of ones is a constant field in the instruction.
[0057]
[Coprocessor Boolean registers and branches]
The above listed instructions consume a significant portion of the available instruction words because compound comparisons and branches pack a large number into instruction words less than 32 bits wide. This is a compromise for these branches because of the frequency of these branches and the resulting savings.
[0058]
In addition to other constraints on instruction set design, the instruction set must be extensible (allowing the addition of new data types) and features must be exploited in tightly coupled coprocessors (cooperating processors). I must. However, short instructions do not have room to add complex compare and branch instructions for other data types such as floating point, DSP. Furthermore, each coprocessor may not be able to perform its own complex comparisons and branches. Even if individual compound comparisons and branch instructions can be performed, it may be useless. This is because comparisons and branches for such data types are also less frequent than integer data in many applications.
[0059]
For this reason, the preferred embodiment of the present invention uses a different method for coprocessor conditional branches. In a preferred embodiment, the instruction set includes an optional package that is pre-required for any coprocessor package. This package adds 16 single-bit Boolean registers and BF (branch if false) and BT (branch if true) instructions that test these Boolean registers and their corresponding branches. The coprocessor then issues an instruction to set a Boolean register based on, for example, a comparison of their supported data types. Boolean registers and BF and BT instructions are shared by all coprocessors, thereby efficiently utilizing short instruction words.
[0060]
This is a new variant of the condition code-based comparison and branch found in many early instruction sets described above. The initial instruction set contains a number of shared multi-bit condition codes between the processor, its coprocessor (eg, PowerPC), and a single bit condition code (eg, MIPS) for each of the many coprocessors used. Have. In the preferred embodiment of the present invention, a number of shared single bit condition codes are used.
[0061]
Providing multiple destinations for comparison (eg, MIPS, PowerPC in the preferred embodiment of the present invention) allows the compiler to schedule code more freely, and allows multiple data values in a single instruction. Can be instructed to produce a large number of results (eg, MIPS MDMX).
[0062]
It is necessary to test the comparison result by sharing the comparison result register between multiple coprocessors (invention) or by sharing it between the processor and its coprocessor (like PowerPC). The number of operation codes to be saved is saved. This also increases the possibility of providing instructions to perform logical operations on the comparison result register (as in the preferred embodiment of the present invention and PowerPC).
[0063]
The use of a single bit comparison result register (preferred embodiment of the invention, MIPS) instead of multiple bits (most other ISAs) increases the number of required comparison operation codes, but is required. The number of branch operation codes to be reduced is reduced. Since the branch instruction must also provide a PC relative target address, the preferred embodiment uses a single bit comparison result (Boolean) register, so if there are not a large number of coprocessors, the branch instruction code The addition of is even more expensive.
[0064]
In summary, complex comparisons and branches are important techniques for code size minimization, but because the BI needs to be kept small, it can be used for different frequencies and required coprocessor operational code. It has been found that for different numbers the partitioning method is suitable for coprocessor comparison and branching. Within the range of comparison and branch options, the instruction code space is most efficiently utilized by using multiple single bit comparison result registers shared between coprocessors.
[0065]
[Load and store instructions]
The load and store instructions of the preferred embodiment use an instruction format with an 8-bit constant offset that is added to the base address from the register. First, the preferred embodiment forms most of these 8 bits and then applies a simple expansion method when this is insufficient. Also, the four load / store offsets of the preferred embodiment are zero extended rather than signed extended (common in many other instruction sets). This is because values 128 to 255 are more common than values -128 to -1. Also, since most references are to aligned addresses from aligned base registers, the offset is shifted appropriately to the reference size. The offset for 32-bit load and store is shifted by 2, the offset for 16-bit load and store is shifted by 1, and the offset for 8-bit load and store is not shifted. Since most loads and stores are 32 bits, this technique gives 2 additional bit ranges.
[0066]
If the 8-bit constant offset specified in the load / store instruction (or addi instruction) is insufficient, the preferred embodiment provides an addmi instruction that adds the 8-bit constant left shifted by 8. Thus, the two instruction sequence has a 16 bit range consisting of 8 from addmi and 8 from load / store / addi. In addition, constants that are not encoded by one of the methods described above must be loaded into the register by separate instructions (this technique is based on the addmi solution described above because it takes only a single register operand instead of two). Not applicable to load / store instructions that require The preferred embodiment provides two methods for loading constants into registers. The first method is the movi (and movi.n in the short instruction format described below) instruction for this. movi specifies a constant in the 12-bit signed extended pair field in the instruction word. Also, assigning a constant value to a register variable is normal in itself.
[0067]
In an instruction format of 32 bits or less, no instruction can encode any 32-bit constant, so several other methods are required to set a register to any constant value. At least two methods are used in different instruction sets, and either of these methods may be used with the techniques described above to provide a solution. The first solution is to provide a pair of instructions that use multiple constants in each instruction to synthesize a 32-bit constant together (eg MIPS LUI / ADDI, DEC Alpha, IBM PowerPC And has an instruction that specifies the upper 16 bits and the lower 16 bits in two separate instructions). The second solution (eg, MIPS floating point constants, MIPS16 and ARM Thumb) is to provide a simple way to read constants from memory with load instructions.
[0068]
If the load itself requires only a single instruction, using the load instruction to reference a constant can provide a lower IS and IE than when using a sequence of instructions. For example, the MIPS compiler occupies one of 31 general purpose registers to hold pointers to a constant pool where 4 byte and 8 byte floating point constants are stored (especially). If the area addressed by this register is less than 64KB, the constant can be referenced by a single load instruction. This is because MIPS has a 64 KB offset range at load. For a constant once referenced, the total size of a 32-bit load instruction plus a 32-bit constant is the same as the two using the instruction word. If a constant is referenced more than once, the constant pool provides a smaller total size. The compromise is different for other instruction lengths, such as the 24-bit size of the preferred embodiment where the constant pool plus load is 56 bits versus 48 bits for a 24-bit instruction pair. However, if a constant is used multiple times, the constant pool is almost always a more effective total size solution.
[0069]
MIPS techniques that dedicate registers for addressing constants and other values, as described above, generally provide fewer registers than 32 instruction words, so each register has a higher value. It is undesirable for the preferred and other embodiments of the invention. Also, the offset available from registers in a narrow instruction set is limited, so a single register only has access to a small constant pool (too small and impractical). The preferred embodiment employs multiple instruction set solutions (eg, PDP11, Motorola 68000, MIPS16, ARM Thumb) in providing PC-relative loads that can be used to access the constant pool.
[0070]
Any technique for loading arbitrary constants is applicable to the present invention. The preferred embodiment uses the second technique, while another embodiment uses multiple instructions that each contain a portion of a complete constant. A specific example of another embodiment for a 24-bit instruction word is one instruction that inputs a 16-bit instruction constant into the upper part of the register (16-bit constant + 4-bit register destination + 4-bit instruction code = 24 bits) And another instruction for adding a 16-bit signed constant to the register (16-bit constant + 4-bit register source and destination + 4-bit instruction code = 24 bits).
[0071]
[Overhead reduction loop instruction]
Preferred embodiments also provide loop features that are found in some digital signal processors (DSPs) but are not found in RISC processors. Most RISC processors use their existing conditional branch instructions to create loops rather than perform loops by providing new features. This efficient use keeps the processor simple, but increases IE and IS. For example, C loop

In a preferred embodiment,

Is compiled as There are two “loop overhead” instructions in every iteration, add and conditional branches. (If there is no comparison and branch feature of the preferred embodiment, three overhead instructions are required.) This obviously increases the IE. In addition, selected conditional branches in some processor configurations may require more cycles to execute than other instructions for pipelining and / or branch prediction. Therefore, CPI may increase. Some instruction sets add a single instruction (eg, DEC PDP6, DEC PDP11, IBM PowerPC) to increment or decrement a register, compare and branch, which lowers the IE in this case . (Implementation of the IBM PowerPC instruction is also aimed at lowering the CPI.)
When the loop body is small, the performance impact of loop overhead is even higher. Many compilers use an optimization called loop unrolling in this case to spread the loop overhead over one or more iterations. In C, the aforementioned loop is converted as follows, for example.

i + constants can be folded during body instructions (eg, to load and store instruction offsets), so that only one increment is required per iteration.
[0072]
Loop unrolling with a factor greater than 2 is very common, 4 and 8 are common (powers of 2 with some advantages). Note the factor 2 unrolling is the resulting increase in code size (the body occurs three times in the previous example). The use of this technique in RISC processors to achieve performance is consistent with the emphasis on performance and simplification across code size.
[0073]
Many DSPs and some general purpose processors provide other ways of performing certain loops. The first method is to give an instruction that repeats the second instruction a fixed number of times (eg TI TMS320C2X. Intel x86). This has the advantage that it is very simple to implement. Where this is applicable, it eliminates loop overhead and saves power consumption by eliminating the requirement to fetch the same instruction repeatedly. Some instruction sets with repetitive instructions require the processor not to interrupt during the loop, which is an important limitation. Also, one instruction loop is useful only in limited situations and is only effective when the repeated instruction is complex enough to have multiple effects, thereby operating on different data in each iteration.
[0074]
An improvement over simple repeat instructions is the ability to repeat a block of instructions many times with reduced or zero loop overhead (eg TI TMS320C5X). The preferred embodiment provides this capability through its loop, ie loopgtz and loopnez instructions. The aforementioned first C loop is edited into the following instructions.

The LCOUNT, LBEG, and LEND registers are used in the instruction set so that the loop can be interrupted. This also allows these registers to be read and written in parallel with other instruction-time execution (if general purpose registers are used, the register file read / write port needs to be increased. ). The preferred embodiment specifies that the LCOUNT register is decremented immediately after being tested to give maximum time to fetch an instruction. Loop instructions are expected to be able to prevent branch penalties associated with conditional branch compilation of loops according to the preferred embodiment.
[0075]
The increment of a3 (i) is not automatically executed by the loop instruction. As previously mentioned, especially after optimization of the intensity reduction, this is left as a separate instruction because many loops require different amounts of increment or decrement of the induction variable. Furthermore, in some cases these increments can be folded into a coprocessor address mode such as auto-increment. Finally, a special port is required in the general purpose register file to increment the general purpose registers.
[0076]
As can be seen from the previous examples and description, loop instructions reduce both IE and IS, facilitating a configuration that reduces CPI. When the loop instruction eliminates the need to do loop unrolling, the impact on IS is greatest, but even in the case of unrolling. However, as will be readily apparent to those skilled in the art, the presence of these instructions in the preferred embodiment requires additional processor configuration costs (eg, special registers, special instruction fetch logic).
[0077]
[hazard]
Most instruction sets are executed by pipeline hardware. Pipeline use often creates hazards during instruction execution that must be avoided in hardware or software. For example, many pipelines write register files at the end of the pipeline (or at least late). In correct operation, the next instruction that uses a register written as the source operand will continue until the value is written, or the value to be written must be bypassed or transferred to the dependent instruction It must wait for the register file to be read and the register file contents are ignored.
[0078]
Most processors perform hardware dependency detection on the normal register file and both delay dependent instructions until the result is valid, and then bypass the dependency operation before it is written to the register file. Software instruction delays (usually due to NOP insertion) significantly increase code size (due to increased IS), and not bypassing greatly reduces performance. Detection, outage, and bypass hardware are therefore worth the price.
[0079]
However, in processor states other than the general purpose register file, such a register is often not referenced so much, so the compromise is different. Some instruction sets (e.g. MIPS) therefore switch to special register hazard software management (e.g. by inserting a NOP to separate writing from use). This unfortunately requires knowledge of the pipeline that is assembled into the instruction stream.
[0080]
An alternative method is to have a special register write that delays all subsequent instructions to avoid hazards. This is simple and solves the problem, but special register writes are often done in groups (e.g. to restore or prevent state after a content switch), many based on other special register writes and them It is inefficient because there is no reason to delay the instruction. A preferred embodiment of the present invention employs a hybrid method. This gives the ISYNC, RSYNC, ESYNC, and DSYNC instructions that must be inserted by the software to prevent hardware-prevented hazards. Unlike NOP, these instructions will cease to function until all special registers have been written. This allows one configuration-dependent instruction to be implemented, otherwise potentially requiring a large number of specific structure-specific NOPs. It also allows programmers to group special register writes together without stopping for maximum performance.
[0081]
[Code density option]
The instruction set of the preferred embodiment consists of a core set of instructions that are preferably present in all configurations of the instruction set, and a set of optional instruction packages that may or may not be present in a given configuration. One such package is BI, a short instruction format that provides significant code size reduction by reducing the average bits per instruction. When these short format instructions are present, the preferred embodiment changes from a fixed length (24 bits) instruction set to a length having two instruction sizes (24 bits and 16 bits). An alternative embodiment takes a different set of instruction sizes. For example, one alternative with a similar code density to 24/16 encoding is 24/12, where there are two register fields instead of three in the short form.
[0082]
Since short instruction forms are optional, these forms are only used to improve code size, and there are no new features in these instructions. The instruction set that can be encoded with 16 bits is selected as the most statistically frequent instruction that fits (or can be modified to fit, for example, by a constant field width reduction). The most frequent instructions in most instruction sets are load, store, branch, append, move, and these are just the instructions that exist in the preferred embodiment 16-bit encoding. The use of an overall short format to reduce BI is symmetric with variable-length instruction sets such as Motorola 68000, Intel x86, and DEC VAX, where each instruction mainly depends on the number of operands and the type of operand. It has a coding that depends and does not depend on the static frequency of use.
[0083]
The only instruction set known to have similar characteristics to the present invention is Simens Tricore ™, which is a 32-bit main format and a 16-bit short format to reduce BI. Have. Unlike the present invention, the main format is too long to achieve an exemplary BI, and the short form makes one of the source and destination registers the same, or one of the source or destination registers. Only two register fields suggested by the operation code are provided, so it is not very functional. As mentioned above, the suggested use of source registers tends to increase either the CP or CPI of the structure.
[0084]
It has previously been shown that a 16-bit only instruction set provides insufficient performance and functionality. The 16-bit encoding of the most frequent instructions avoids this pitfall. Since only the most frequent instructions require short encoding, three register fields are valid and a narrow constant field can capture a large portion of usage. Nearly half the instructions needed to represent an application are encoded with exactly 6 of the 16 operational codes valid for 16-bit encoding, after 3 4-bit fields are reserved for register specifiers or constants Can be done.
[0085]
16-bit encoded dense instruction options are 132i.n instruction (load 32 bits, 4 set offset), s32i.n (store 32 bits, 4 bit offset), and mov.n (content of one register) Move to another register), add.n (add contents of two registers), addi.n (add register and immediate value, where immediate value is in the range of -1 or 1 ... 15), movi.n (load registers with immediate values, where immediate values are in the range of -32 ... 95), nop.n (not an action), break.n (break), ret.n, retw.n (Ret and retw), beqz.n (if the register is zero, transfer a branch with a 6-bit uncoded offset) and bnez.n (if the register is non-zero, a 6-bit uncoded offset) Transfer branch with).
[0086]
Another embodiment uses the 12 bit short form described above. The 12-bit form only supports two 4-bit fields and a 4-bit main operation code. This only supports loading, storing without offset (sometimes referred to as register indirect address in the field) and appending an instruction if the destination and one source register are identical. Since the compiler is free to use three operands when appropriate, these limitations are not performance limitations as in other situations. The limitation prevents the 12-bit form from often being used, but its reduced size partially compensates. For 30% 12 bits and 70% 24 bits, the BI is 20.4 bits, which is approximately the same as the 20.0 bits realized by 50% 16 bits and 50% 24 bits. There are some structural simplifications that occur when one format is half the size of the other, but the maximum common did of instruction size and data width is small (this is 24, There are some structural problems when 12, 32, 4, 24, 16, 32 are 8). Overall 2 is approximately equal to the structure price, and the preferred embodiment is 1 which gives a good code size, which is 24/16.
[0087]
There is one additional code size disadvantage of 24/16 compared to 24/12. The branch offset (instruction constant specifying the target instruction via different instruction addresses) must be a multiple of ged of all instruction sizes. This is 12 for 24/12 and 8 for 24/16. The higher this number, the more branches (in bits) can be reached. A branch beyond this destination requires a large number of instruction sequences that increase the IS.
[0088]
The most important advantage of fixed-length instructions is obtained when the processor structure executes a large number of instructions per cycle, as seen in most RISCs. In this state, instructions are usually decoded in parallel. For variable length instructions, sufficient decoding must be performed on the first instruction to find the start of the second instruction, so that decoding can begin there, and sufficient decoding is Must be done in the second instruction to find the start of the instruction. This increases the CP. Adding a pipeline stage to prevent an increase in CP is likely to increase the CPI. Some structures obtain an early start by decoding each potential instruction start and selecting the actual instruction when that information becomes available from the previous instruction decode. This obviously increases the structure price. Adding a pipeline stage to classify instructions also increases the price. Other possibilities, such as pre-decoding to the instruction cache are possible, but all increase the structure price.
[0089]
The preferred embodiment does not eliminate the variable length decoding problem, but it is as simple as possible by first using only two instruction lengths and secondly using one instruction bit to discriminate between the two lengths. To. This minimizes the structural price and any impact on CP. Finally, by making the short form optional, the preferred embodiment makes it possible to reduce the price and CP effect when code size is not the first priority.
[0090]
Many instruction sets operate in little endian or big endian byte ordering. Techniques for accomplishing this are described, for example, in Weber US Pat. No. 4,959,779. However, instruction sets with variable size instructions require additional care. The MIPS instruction set uses the same instruction format in big and little endian byte order, which only works because the instructions are all one size. The preferred embodiment uses big and little endian bytes to maintain the property that the bits required to determine the instruction size are present in the lowest numbered address byte (the smallest addressable unit in the preferred embodiment). Identify different instruction words for the order.
[0091]
[Registered options with window]
Another optional package is a windowed register option. This is given to the lower IE and IS. The increase in performance from the lowered IE also compensates for the increase in IE since it has 16 registers instead of 32. Register windows are found on some other processors such as Sun SPARC. Refer to the Sun SPARC documentation for a complete introduction to the subject. The name “register window” indicates a typical structure in which the register field in the instruction identifies the register in the current window to a larger register file. The position of the window is indicated by a window based register.
[0092]
The register window eliminates the need to save and restore registers with processing entries (excluding IS and IE) and spreadsheets. This is done by changing the pointer at these points, which hides some registers from view and exposes new registers. Exposed registers usually do not contain valid data and can be used directly. However, when the exposed register contains valid data (because the window is moving further, it wraps around to the register of the previous call frame) and continues execution (this is usually accomplished by a trap in the software handler) Hardware) detects this and stores a valid register in memory. This is called register window overflow. When the call returns to the frame where the register is stored in memory, an underflow of the register window occurs and the processor must load the value from memory (this is also usually achieved by a trap to the software handler). .
[0093]
Register windows that overlap in the field of view of the physical register file between the caller and the called party also prevent argument shuffling that occurs when arguments to the process are passed through the register (argument shuffling is Increase IS and IE). Finally, the register window changes the break-even point for assigning variable and temporary values to the register, thus facilitating the use of the register, which is faster and smaller than using memory locations (and also with IS IE is reduced).
[0094]
The main difference between the register window of the present invention and SPARC is that (1) SPARC has an increment 16 fixed to the window pointer, and (2) SPARC has a global register in addition to the windowed register. And the preferred embodiment of the present invention does not have it, (3) SPARC detects window overflow when the current window overlaps the previous window, while the preferred embodiment To detect window overflow when referring to a register that is part of the previous window.
[0095]
Changing from a fixed increment to a variable increment is important for lowering the structure price. This allows the use of very small physical register files. For example, many Sun SPARC structures use 136 entry physical registers, and the preferred embodiment only requires a 64 entry register file to achieve similar window performance. There is an increase in the complexity of variable increments, but the difference in processor structure prices is more than 30% (this is the larger register price required by the simpler fixed increment SPARC method). The preferred embodiment detects overflows and underflows and identifies a new method for organizing the stack frame.
[0096]
On the surface, the register window mechanism appears to increase the CP (or CPI) by requiring an append (albeit short) in series with reading the register file (for appending to the pipeline) Since there is one cycle, register writes are not a problem). However, it is possible to perform register window access in a manner that has similar timing and window size to non-windowed register access to the register file. For example, consider a 64-register physical register file and 16 windows that are visible to any given instruction. In this case, 16 64: 1 multiplexing is used to select the 16 visible registers based solely on the window pointer, after which these 16 results are accessed like a 16 entry register file. The use of 16 64: 1 multiplexing requires high construction costs. For this reason, the preferred embodiment identifies window pointers that are limited to multiples of 4 and reduces this price by a factor of 4. Even in configurations that choose to use consecutive additions, this means that 2 bits of the register number can be used immediately to initiate register file access, and the slower total bits (4 bits and 2 bits input Sum) is used at a later point in access. Finally, a hybrid between these two configurations is possible and the structure price is intermediate.
[0097]
Modifications and variations of the preferred embodiment will be readily apparent to those skilled in the art. Such variations are within the scope of the invention as set forth in the appended claims.
[Brief description of the drawings]
FIG. 1 is a block diagram of a processor that implements an instruction set according to a preferred embodiment of the present invention.
FIG. 2 is a block diagram of a processor that implements an instruction set according to a preferred embodiment of the present invention.
FIG. 3 is a block diagram of a processor that implements an instruction set according to a preferred embodiment of the present invention.
FIG. 4 is a block diagram of a processor that implements an instruction set according to a preferred embodiment of the present invention.
FIG. 5 is a block diagram of a pipeline used in a processor according to a preferred embodiment.

Claims

More than 16 general purpose registers,
Means for accessing memory to exchange data with those registers;
An arithmetic unit for processing instructions from memory, all instructions having a length not exceeding 28 bits, all instructions comprising a complete feature set of RISC instructions;
At least one instruction includes an opcode field, a field that specifies a constant operand for the instruction, one source register field that can specify any of the general-purpose registers as a source register, and the general-purpose register as a destination register A destination register field that can specify any of the following:
At least one instruction can specify an opcode field, a plurality of source register fields that can specify any of the general-purpose registers as source registers, and any of the general-purpose registers as destination registers One destination register field,
At least one instruction causes the arithmetic unit to perform a plurality of complex operations, a first one of which is one of the first calculation and logical operation, and a second one is a second calculation operation And a conditional branch operation
In addition, one special purpose register,
State indicating means for selectively indicating that execution of writing to the special purpose register has not yet been completed and that execution of writing of the specific purpose register of all pending has been completed,
The set of instructions, Help processors include instructions to delay the execution of the next instruction to the processing unit that the execution of the writing of all pending complete until they show the state indication means.

Including a processing unit for processing instructions,
The instruction is different from the first group of instructions having the same first uncompressed fixed length instruction and the first uncompressed fixed length instruction, the same second uncompressed fixed length A second group of instructions having
A given bit field in the opcode field common to both groups indicates one group to which the instruction having that bit field belongs, and the second group of instructions is an operational behavior of the first group of instructions. A subset,
A processor with the same number, width and position of register specifiers for both the first and second group of instructions.

The processor of claim 2, wherein the first and second groups of instructions occupying the same opcode map.

The processor of claim 2 , wherein the register specifiers of the first and second groups of instructions are capable of decoding without changing the fixed lengths of the first and second uncompressed instructions, respectively.

The processor of claim 2, wherein all instructions have a length not exceeding 28 bits.

6. The processor of claim 5, wherein all instructions include a complete feature RISC instruction set.