JP3544334B2

JP3544334B2 - Instruction stream conversion method

Info

Publication number: JP3544334B2
Application number: JP2000007264A
Authority: JP
Inventors: ブレットクーン; 芳幸宮山; リートロンニューエン; ジョハネスワン
Original assignee: トランスメタコーポレイション
Priority date: 1992-03-31
Filing date: 2000-01-17
Publication date: 2004-07-21
Anticipated expiration: 2019-07-21
Also published as: JP2000215052A; JP2000215049A; JP3544333B2; JP2000215053A; JP3544335B2; JP2000215054A; JP2000215048A; JP3544332B2; JP3544331B2; JP2000215047A; JP3544330B2

Description

【０００１】
【発明の属する技術分野】
本発明は一般的にはスーパースカラ方式のＲＩＳＣ型マイクロプロセッサに関し、より具体的には複合命令をＲＩＳＣベースのハードウェアで実行できるようにするためのＣＩＳＣ型からＲＩＳＣ型へのマイクロプロセッサ命令のアライメント・ユニットとデコード・ユニットに関する。
【０００２】
【従来の技術及び発明が解決しようとする課題】
関連出願の引照
以下は同一承継人の出願に係る同時係属中の出願である。
米国出願番号０７／８０２，８１６、１９９２年１２月６日出願（代理人整理番号ＳＰ０２４）、発明の名称「ＲＡＭセル及び巡回冗長検査回路搭載ＲＯＭ（ＡＲＯＭｗｉｔｈＲＡＭＣｅｌｌａｎｄＣｙｃｌｉｃＲｅｄｕｎｄａｎｃｙｃｈｅｃｋＣｉｒｃｕｉｔ）」、米国出願番号０７／８１７，８１０、１９９２年１月８日出願（代理人整理番号ＳＰ０１５）、発明の名称「高性能ＲＩＳＣ型マイクロプロセッサ・アーキテクチャ（ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＲＩＳＣＭｉｃｒｏｐｒｏｃｅｓｓｏｒＡｒｃｈｉｔｅｃｔｕｒｅ）、米国出願番号０７／８１７，８０９、１９９２年１月８日出願（代理人整理番号ＳＰ０２１）、発明の名称「拡張可能ＲＩＳＣ型マイクロプロセッサ・アーキテクチャ（ＥｘｔｅｎｓｉｂｌｅＲＩＳＣＭｉｃｒｏｐｒｏｃｅｓｓｏｒＡｒｃｈｉｔｅｃｔｕｒｅ）」。
【０００３】
上記の出願の開示は参照することにより本明細書に組み込まれているものとする。
【０００４】
関連技術
可変長命令を使用する複合命令セット・コンピュータ（ＣＩＳＣ型コンピュータ）は全て、命令ストリームの中で発生する各命令の長さを確定するという問題に直面している。命令は連続するバイトからなるデータとしてメモリの中に詰め込まれる。従って、命令のアドレスが与えられれば、第１命令の長さがわかっている場合次の命令の開始アドレスを確定することは可能である。
【０００５】
従来のプロセッサでは、この長さの確定が、実際の各命令実行のような、命令ストリームの処理における他のステージに比べて、性能に大きく影響することはない。その結果、かなり単純な回路が典型的に使用されている。一方、スーパースカラ型の縮小命令セット・コンピュータ（ＲＩＳＣ型コンピュータ）ははるかに高速で命令をプロセスできるが、複数の命令を並列で実行するためにはるかに高速でメモリから命令が抽出されなければならない。命令がメモリから抽出される速度によって課せられるこの制限要因はフライン・ボトルネック（Ｆｌｙｎｎ
Ｂｏｔｔｌｅｎｅｃｋ）と呼ばれる。
【０００６】
各命令の長さを確定し、さらにその命令を命令ストリームから引き出すタスクは命令アライメント・ユニット（ＩＡＵ）と呼ばれる機能ユニットによって実行される。このブロックには命令の長さを確定するためのデコーダ・ロジックと、命令データをそのデコーダ・ロジックに合わせてアライメントするためのシフタが含まれなければならない。
【０００７】
インテル社（Ｉｎｔｅｌ）の８０３８６マイクロプロセッサでは、命令の第１バイトが命令長全体に関して多くのことを暗示しており、最終の長さを知る前に追加バイトのチェックが必要になることがある。さらに、追加バイトから他の追加バイトを特定できることがある。従って、プロセスが本質的にシーケンシャルであるため、ｘ８６系の命令の長さを即時に確定するのは極めて困難である。
【０００８】
ｉ４８６のプログラマ・リファレンス・ガイド（ｉ４８６Ｐｒｏｇｒａｍｍｅｒ’ｓＲｅｆｅｒｅｎｃｅＧｕｉｄｅ）に提供されている情報に基づき、ｉ４８６に採用されているアライメント・ユニットに関して幾つかの結論を引き出すことができる。ｉ４８６のＩＡＵは命令の最初の数バイトだけを見るように設計されている。これらのバイトがその長さを十分には特定していない場合、これらの初期バイトが抽出されさらにそのプロセスが残りのバイトに対して繰り返される。このプロセスの繰り返しは毎回フル・サイクルを要する。従って、最悪の場合、命令が完全にアライメントされるには数サイクルかかることがある。
【０００９】
ｉ４８６のＩＡＵが追加サイクルを要するのはプレフィックス形や拡張型（２バイト）の演算コードが使われている場合などである。これらの演算コードは共にｉ４８６のプログラムでは共通のものである。その上、複合命令はまたディスプレースメント及びイミディエト・データから成り立っていることもある。ｉ４８６ではこのデータを抽出するのに追加の時間が必要になる。
【００１０】
ＣＩＳＣ型プロセッサ命令のフォーマット例は図２２に示す通りである。この例は可変長のｉ４８６ＣＩＳＣ型命令の可能バイトを表している。命令はバイト境界上のメモリに格納されている。命令の長さは最短で１バイト、最長はプレフィックスを入れて１５バイトである。命令の全長はＰｒｅｆｉｘｅｓＯｐｃｏｄｅ、ＭｏｄＲ／Ｍ及びＳＩＢのバイトによって確定される。
【００１１】
【課題を解決するための手段】
本発明は、Ｉｎｔｅｌ８０ｘ８６マイクロプロセッサのような複合命令セット・コンピュータ（ＣＩＳＣ）、またはその他のＣＩＳＣ型プロセッサをエミュレートするように設計されたスーパースカラ型の縮小命令セット・コンピュータ（ＲＩＳＣ）・プロセッサを有するマイクロプロセッサのサブシステム並びに方法である。
【００１２】
本発明におけるＣＩＳＣ型からＲＩＳＣ型への変換（ｔｒａｎｓｌａｔｉｏｎ）処理には二つの基本的なステップがある。ＣＩＳＣ型命令は先ず命令ストリームから抽出され、そして次にＲＩＳＣ型プロセッッサによって処理され得るナノ命令を生成するためにデコードされなければならない。これらのステップはそれぞれ命令アライメント・ユニット（ＩＡＵ）と命令デコード・ユニット（ＩＤＵ）によって実行される。
【００１３】
ＩＡＵは命令データ上の古い方から２３番目までのバイトを調べることによって命令ストリームから個々のＣＩＳＣ型命令を抽出する働きをする。ＩＡＵは命令ＦＩＦＯのボトム・ラインにあるバイトのいずれかから始まって継続する８バイトを抽出する。各クロック・フェーズの間に、ＩＡＵは現在の命令の長さを確定し、この情報を使って２個のシフタを制御してその現在の命令をシフトアウトするのであるが、そのストリームには次に来る続きの命令が残っている。ＩＡＵは、その結果、サイクル当たり２命令というピーク・レートで、各クロック・フェーズの間にアライメントされた命令を出力する。このベスト・ケースの性能の例外については以下の項２．０と２．１で説明する。
【００１４】
ＣＩＳＣ型命令がメモリから抽出された後、ＩＤＵがこれらのアライメントされた命令をナノ命令と呼ばれるＲＩＳＣ型命令と同じシーケンスに変換する働きをする。ＩＤＵはアライメントされた各命令はＩＡＵからの出力であるとみなして、必要なナノ命令の数やタイプ、データ・オペランドのサイズ、さらにアライメントされた命令を完了するのにメモリ・アクセスが必要か否かなどといった様々な要因を確定するためにその命令をデコードする。単純な命令は直接デコーダ・ハードウェアによってナノ命令に変換されるのに対し、より複雑なＣＩＳＣ型命令はマイクロコード・ルーチンと呼ばれる特殊命令セットのサブルーチンによってエミュレートされ、そのサブルーチンは次にナノ命令にデコードされる。この情報は、二つの命令につき完全な１サイクルで収集され、その次に命令バケットを形成すべく一つにまとめられるが、その中には両方のソース命令に対応するナノ命令が含まれている。このバケットは次にＲＩＳＣ型プロセッサによる実行のため命令実行ユニット（ＩＥＵ）に転送される。ナノ命令バケットの実行は本発明の適用範囲外である。
【００１５】
本発明の前記、ならびにそれ以外の特徴並びに利点については、添付の図面に示すように、以下の本発明の好適な実施例のより詳細な説明から明らかになるであろう。
【００１６】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照しつつ説明する。
目次
１．０命令フェッチ・ユニット
２．０命令アライメント・ユニットの概略
２．１命令アライメント・ユニットのブロック図
３．０命令デコード・ユニットの概説
３．１マイクロコード・ディスパッチ・ロジック
３．２メールボックス
３．３ナノ命令フォーマット
３．４特殊命令
３．５命令デコード・ユニットのブロック図
４．０デコードされた命令ＦＩＦＯ
好適な実施例の詳細な説明
本項で説明する基本的な概念については以下の参考文献により詳細に記述されている：「ＳｕｐｅｒｓｃａｌａｒＭｉｃｒｏｐｒｏｃｅｓｓｅｒＤｅｓｉｇｎ」、ＭｉｋｅＪｏｈｎｓｏｎ著、ニュージャージー州、イングルウッドクリフ所在のＰｒｅｎｔｉｃｅ−Ｈａｌｌ社より１９９１年出版。「Ｃｏｍｐｕｔｅｒａｒｃｈｉｔｅｃｔｕｒｅ−ＡＱｕａｎｔｉｔａｔｉｖｅＡｐｐｒｏａｃｈ」、ＪｏｈｎＬ．Ｈｅｎｎｅｓｓｙ他著、カリフォルニア州、サンマテオ所在のＭｏｒｇａｎＫａｕｆｍａｎｎＰｕｂｌｉｓｈｅｒｓ社より１９９０年出版。「ｉ４８６ＭｉｓｒｏｐｒｏｃｅｓｓｏｒＰｒｏｇｒａｍｍｅｒ’ｓＲｅｆｅｒｅｎｃｅＭａｎｕａｌ」及び「ｉ４８６ＭｉｓｒｏｐｒｏｃｅｓｓｏｒＨａｒｄｗａｒｅＲｅｆｅｒｅｎｃｅＭａｎｕａｌ」、カリフォルニア州、サンタタララ所在のＩｎｔｅｌＣｏｒｐｏｒａｔｉｏｎより１９９０年発行でオーダ番号はそれぞれ２４０４８６及び２４０５５２。これらの出版物の開示は参照することにより本明細書に組み込まれているものとする。
【００１７】
１．０命令フェッチ・ユニット
本発明の命令フェッチ・ユニット（ＩＦＵ）は命令メモリや、命令キャッシュ等の中に格納された命令ストリームから命令バイトをフェッチし、さらにその命令バイトを実行のためにデコーダ部に供給するために使用される。命令アライメント・ユニットによってアライメントされるべき命令は従ってＩＦＵから供給される。図１に示すのはそのＩＦＵ内の３個の命令プリフェッチ・バッファ２００のブロック図であり、それは主命令バッファ（ＭＢＵＦ）２０４、エミュレーション命令バッファ（ＥＢＵＦ）２０２、及び目標命令バッファ（ＴＢＵＦ）２０６から成っている。その命令プリフェッチ・バッファは命令キャッシュから１２８ビット（１６バイト）の命令ストリームを単一サイクルでロードすることができる。このデータはＩＡＵによって使用されるべく３個のバッファのうちの１個に保持される。
【００１８】
通常のプログラム実行中、ＭＢＵＦ２０２は命令バイトをＩＡＵに供給するために使用される。条件付きの制御フロー（即ち、条件付き分岐命令）に遭遇すると、ＭＢＵＦ２０２からの実行が続行している間、そのブランチのターゲット・アドレスに対応する命令はＴＢＵＦ２０６に格納される。一度ブランチの決定が下されると、分岐しない場合はＴＢＵＦ２０６の廃棄、分岐する場合にはＴＢＵＦ２０６のＭＢＵＦへの転送、のいずれかが行なわれる。いずれの場合も、ＭＢＵＦからの実行は続行する。ＥＢＵＦ２０４の動作は多少異なる。エミュレーション・モードに入ると、エミュレーション命令かもしくは例外によって、命令のフェッチングと実行がＥＢＵＦ２０４に転送される。（エミュレーション・モード及び例外処理については共に以下に詳細に説明する。）プロセッサがエミュレーション・モードになっている限り、実行はＥＢＵＦ２０４から続行する。エミュレーション・ルーチンが終わると、実行はＭＢＵＦ２０４に残っている命令データから続けられる。これにより、エミュレーション・ルーチン実行後、主命令データを再度フェッチする必要がなくなる。
【００１９】
２．０命令アライメント・ユニットの概略
本発明との組み合わせで命令アライメント・ユニットは、スーパースカラ型プロセッサの卓越したサイクル当たりの命令スループットを用いることによって、普通のケースを高速処理にするＲＩＳＣ戦略を用いる。
【００２０】
本発明において、「アライメントする」という用語は、後でデコードするために或る命令のバイトを命令ストリームで隣接するバイトと区別できるように位置付けることを意味する。ＩＡＵは、現在の命令のバイト数を確定することによって、現在の命令の終わりを次の命令の始まりと区別する。ＩＡＵは次に、ＩＤＵに入れられる最下位のバイトが現在の命令の第１バイトとなるように、現在の命令をアライメントする。バイトはいろいろ異なる順序でＩＤＵに供給することもできる。
【００２１】
本発明のＩＡＵのサブシステムはあらゆるクロック・レートにおいてサイクル当たり２命令の速度でほとんどの一般的な命令をアライメントすることができ、縮小クロック速度でこれと同じレートでその他のほとんどの命令をアライメントすることができる。プレフィックスを含む命令にアライメントに半サイクル余計に必要である。イミディエト・データ及びディスプレースメントのフィールドは並列で抽出されるために余分な時間は不要である。
【００２２】
さらに、ＩＡＵのアライメント・タイムは最悪のケースで１命令当たりわずか２．０サイクルであり、従来のＣＩＳＣ型プロセッサの一般的な命令の多くをアライメントするのに要する時間より短い。命令が一つ以上のプレフィックス（アライメントに要するサイクル合計の半分）を有し、その命令が長さの確定に完全に１サイクルを要するセットからのもので、且つその命令（プレフィックスを含まない）の長さが８バイトより長い場合（半サイクル余計に必要だから、結果として合計で完全な２サイクルになる）には最悪のケースが起こる。
【００２３】
幾つかの構造上の特徴によってこうした性能が実現される。第一に、ＩＡＵは、アライメント回路中のフェーズ・ラッチとマルチプレクサを交互に使用することによりクロックのフェーズ毎に完全なアライメント操作を実行するように設計されている。第二に、デコード・ロジックは各命令の長さを確定するために考慮に入れなければならないビット数に基づいてＣＩＳＣ型命令を二つのカテゴリーに分ける。即ち、少数ビットで指定された長さの命令は単一フェーズ（半サイクル）でアライメントされるのに対し、他の命令は典型的に、さらに１クロック・サイクルが必要である。最後に、ＩＡＵは命令ストリームから一回だけのシフトで８バイトまでを抽出できる。これにより、長い命令（ｉ４８６では１５バイトまで）を数少ないシフト命令でアライメントすることが可能になり、且つほとんどの命令が一回だけのシフトでアライメントできるようになる。
【００２４】
高速且つ正確にＣＩＳＣ型命令をデコードするために以下のタスクがＩＡＵによって実行される
プレフィックス・バイトの存在とその長さを検出する
演算コード、ＭｏｄＲ／Ｍ及びＳＩＢ（ｓｃａｌｅ、ｉｎｄｅｘ、ｂａｓｅ）のバイトを分離する
命令の長さ（次の命令の記憶位置を示す）を検出する
以下の情報を命令デコード・ユニット（ＩＤＵ）に送る
− 演算コード、即ち８ビットに任意の拡張３ビットを足したもの。２バイトの演算では、第１バイトは常にＯＦｈｅｘだから、２番目のバイトが演算コードとして送られる
− ＭｏｄＲ／Ｍバイト、ＳＩＢバイト、ディスプレースメント及びイミディエト・データ。
【００２５】
− プレフィックス数及びタイプに関する情報
演算コード・バイトはその命令によって実行された演算を指定する。ＭｏｄＲ／Ｍバイトは、命令がメモリのオペランドを参照する場合に用いられるアドレス形式を指定する。ＭｏｄＲ／Ｍバイトはまた２番目のアドレッシング・バイト、即ち、ＳＩＢ（ｓｃａｌｅ、ｉｎｄｅｘ、ｂａｓｅ）バイトを参照することもでき、そのＳＩＢバイトはアドレッシング形式を十分に指定することを必要とすることがある。
【００２６】
２．１命令アライメント・ユニットのブロック図
ＩＡＵのブロック図は図２に示す通りである。この図は二つの部分、即ち、メインデータバス３０２（破線で囲んだ部分）とプレデコーダ３０４（破線で囲んだ部分）とに分れる。命令のシフティングや抽出はメインデータバス３０２で起こるのに対し、長さの確定やデータバスの制御はプレデコーダ３０４によって処理される。
【００２７】
メインデータバス３０２は幾つかのシフタ、ラッチ及びマルチプレクサから成り立っている。抽出シフタ３０６はバイトで構成された命令データをＩＦＵから受け取る。ＩＦＩ０ｂ＿バス〔１２７：０〕とＩＦＩ１ｂ＿バス〔５５：０〕の２本のバス（概ね３０３で示した）はＩＦＵの命令データ出力を表している。ＩＦＵはＩＡＵからの要求に答えてアドバンス・バッファ・リクエスト（ＡＤＶＢＵＦＲＥＱ）ライン３０８上でこの命令情報を更新する。ＡＤＶＢＵＦＲＥＱ信号の生成については以下に説明する。現在の命令に該当する８バイトのデータは抽出シフタから出力され且つバス３０７上の整列シフタ３１０に送られる。整列シフタは合計で１６バイトの命令データを保持し且つフェーズ毎に８バイトまでシフトすることができる。シフトアウトによってプレフィックスが検出される場合、命令からプレフィックスを切り離すために整列シフタが使用される。整列シフタはまた、命令をより低位のバイトにアライメントし、さらにアライメント後にその命令全体をシフトアウトするために使用される。
【００２８】
その８バイトはバス３０９を介してイミディエト・データシフタ（ＩＭＭシフタ３１２）とディスプレースメント・シフタ（ＤＩＳＰシフタ３１４）にも送られる。ＩＭＭシフタ３１２は現在の命令からイミディエト・データを抽出し、ＤＩＳＰシフタ３１４はディスプレースメント・データを抽出する。これら２個のシフタへのデータはアライメントされた命令との同期を維持するためにΩサイクル遅延素子３１６によって遅延させられる。
【００２９】
整列シフタ３１０はバス３１１上のアライメントされた次の命令を２個の整列＿ＩＲラッチ３１８または３２０へ出力する。これらのラッチはシステム・クロックの対向フェーズ上で動作する。それによってサイクル毎に二つの命令がラッチされることになる。整列＿ＩＲラッチ３１８及び３２０はアライメントされた命令を２本の出力バス３２１上に出力する。そのラッチの１個が新規の値を受け取るフェーズ期間中に、他のラッチの出力（アライメントされた現在の命令）はマルチプレクサ（ＭＵＸ３２２）によって選択される。ＭＵＸ３２２はそのアライメントされた現在の命令をアライメントされた命令バス３２３に出力する。出力３２３はＩＡＵの一次出力である。この出力は、現在の命令の長さを確定するためにプレデコーダ３０４によって使用され、且つ次の命令が抽出されるデータとして整列シフタ３１０にフィードバックされる。アライメントされた現在の命令はバス３２５、スタック３３４、さらに先のバス３０５を介して整列シフタ３１０にフィードバックされる。バス３０５はアライメントされた現在の命令に関する情報をΩサイクル・データ遅延３１６にも送る。
【００３０】
ＩＭＭシフタ３１２とＤＩＳＰシフタ３１４はそれぞれイミディエト・データとディスプレースメント・データをシフトすることができる。何故ならば、それらはシフトするのに合計１６バイトが必要だからである。Ωサイクル・データ遅延３１６はシフタへの命令バイトを１本のバス上に出力する。ＩＭＭシフタ３１２は現在の命令に対応するイミディエト・データをイミディエト・データバス３４０上に出力する。ＤＩＳＰシフタ３１４は現在の命令に対応するディスプレースメント・データをディスプレースメント・データバス３４２上に出力する。
【００３１】
プレデコーダ３０４は、次命令検出器（ＮＩＤ）３２４、イミディエト・データ及びディスプレースメント検出器（ＩＤＤＤ）３２６、及びプレフィックス検出器（ＰＤ）３２８の３つのデコーダ・ブロックから成り立っている。ＮＩＤとＰＤは整列シフタ及び抽出シフタを制御し、ＩＤＤＤはＩＭＭシフタ３１２とＤＩＳＰシフタ３１４を制御する。
【００３２】
ＰＤ３２８は一つの命令中のプレフィックスの存在を検出するように設計されている。ＰＤ３２８は存在するプレフィックス数を確定し、且つ次の半サイクルで命令ストリームからプレフィックスを抽出するために、ライン３３１、ＭＵＸ３３０、及びライン３３３を介して整列シフタ３１０とカウンタシフタ３３２にシフト制御信号を供給する。さらに、ＰＤ３２８はプレフィックス自体をデコードしてこのプレフィックス情報をＩＤＵへの出力ライン３２９上に供給する。
【００３３】
ＰＤ３２８の基本アーキテクチャは４個の同一の検出装置（プレフィックスを４つまで検出するため）と、プレフィックス自体をデコードするための第２ブロックのロジックとで構成されている。ＣＩＳＣ型フォーマットはプレフィックス発生の順序を定義するが、本発明では初めの４バイト位置のそれぞれにおける全てのプレフィックスの存在を検査する。さらに、デコーダの減速要求を利用すべく、プレフィックスの存在を検出する機能とプレフィックスをデコードする機能は別々になっている。ＰＤ３２８のアーキテクチャについては以下にさらに詳細に述べる。
【００３４】
ＩＤＤＤ３２６は各命令からイミディエト・データとディスプレースメント・データを抽出するように設計されている。ＩＤＤＤ３２６はそれらの存在に係わりなく常にこの二つのフィールドの抽出を試みる。ＩＤＤＤ３２６はＩＭＭシフタ３１２とＤＩＳシフタ３１４を１対のライン３４４と３４６上でそれぞれ制御する。ＩＤＵはアライメントされた命令をプロセスするのに半サイクルを要するが、イミディエト・データ及びディスプレースメント・データには無用のものである。従って、イミディエト・データ及びディスプレースメント・データは、ＩＤＤＤ３２６がシフト量の計算にもっと時間をかけられるようにするために、Ωサイクル・データ遅延３１６によって遅延させられる。何故ならば、同じフェーズでデコードとシフトを実行するＮＩＤ３２４と異なり、シフトはその次にくるフェーズで起こるからである。
【００３５】
ＮＩＤ３２４はプレデコーダの心臓部である。一度プレフィックスが取り除かれると、ＮＩＤ３２４は各命令の長さを確定する。ＮＩＤ３２４は制御ライン３２７、ＭＵＸ３３０、さらにライン３３３を介して整列シフタ３１０とカウンタシフタ３３２を制御する。ＮＩＤは二つのサブブロック、サブセット次命令検出器（ＳＮＩＤ７０２）と、さらに残存次命令検出器（ＲＮＩＤ７０４）とから成り立っており、ＲＮＩＤ７０４については図６、図７との関連において説明する。
【００３６】
その名が示すように、ＳＮＩＤ７０２はＣＩＳＣ型命令セットのサブセットの長さを確定する。サブセット内の命令はＳＮＩＤによってサイクル当たり２命令の割合でアライメントされる。
【００３７】
ＲＮＩＤ７０４は残る全ての命令の長さを確定し、さらにあと半サイクルを必要とし、それによってデコード時間合計は完全な１サイクルになる。サブセットに命令が入っているかどうかの確定はＳＮＩＤによってなされ、さらにこの信号はＳＮＩＤかＲＮＩＤかいずれかの出力を選択するためにＮＩＤ内で使用される。
【００３８】
新規の命令がアライメントされている場合、初めはサブセットの中に存在していると仮定され、それによってＳＮＩＤの出力が選択される。ＳＮＩＤがその命令はＲＮＩＤによって処理されるべきものであると（この同じ半サイクル中に）判定した場合、信号がアサートされ、ＩＡＵが現在の命令をループし、それをさらに半サイクルの間保持する。この２番目の半サイクルの間に、ＲＮＩＤの出力が選択され、且つ命令が適正にアライメントされる。
【００３９】
ＮＩＤのこのアーキテクチャには幾つかの利点がある。その一つは先に既に述べたが、サイクル時間が十分に長ければ、ＳＮＩＤ・ＲＮＩＤ間の選択が一回の半サイクルの間に実行でき、それによって全ての命令が単一フェーズ（プレフィックスや８バイトより長い命令を抽出する時間は含まない）内にアライメントされるようになることである。これにより、ハードウェアを追加せずに低サイクル・レートでサイクル当たりの性能を向上させることができる。
【００４０】
第２の利点は、選択信号をアライメント取消信号として使用できることである。何故ならば、選択信号はＩＡＵがＳＮＩＤシフト出力を無視し、そして、さらに半サイクルの間現在の命令を保持するからである。特定命令の組み合わせまたは長さを予測し、続いてその予測が正しくなければ取消信号を生成するようにＳＮＩＤを設計することができる。例えば、この方法は一回の半サイクルで複数の命令をアライメントするために使用することができ、これによって性能がさらに向上する。
【００４１】
ＩＡＵもカウンタシフタ３３２から成り立っている。カウンタシフタ３３２はライン３３５を介して抽出シフタ３０６のシフト量を確定し、さらにＡＤＶＢＵＦＲＥＱライン３０８を用いてＩＦＵに追加のＣＩＳＣ型命令バイトを要求するために使用される。カウンタシフタ３３２の機能については次のＩＡＵの動作フローチャートとタイミング図の例を検討することにより良く理解されるであろう。
【００４２】
図３は本発明のＩＡＵによって実行される命令バイト抽出とアライメントの概略フローチャートである。ステップ４０２に示すように、新規のデータがＩＦＵのＭＢＵＦ２０４（ＢＵＣＫＥＴ＿＃０と呼ばれる）の最低ライン２０５に入力されると、抽出シフタ３０６は第１命令から始まる８バイトを抽出する。ステップ４０４に示すように、その８命令バイトは整列シフタ３１０をバイパスして整列＿ＩＲラッチ３１８及び３２０に渡される。ステップ４０６に示すように、ＩＡＵは次に整列＿ＩＲラッチ中にアライメントされた命令を保持しながら次のクロック・フェーズがくるのを待つ。
【００４３】
次のクロック・フェーズの間に、ＩＡＵはＩＤＵ、ＳＴＡＣＫ３３４、ＩＤＤＤ３２６、ＮＩＤ３２４、ＰＤ３２８及びΩサイクル・データ遅延３１６にアライメントされた命令を出力する。イミディエト・データとディスプレースメントに関する情報は次にバス３４０と３４２上のそれぞれのＩＤＵへ出力される。このデータは、もし存在していたら、その前のフェーズでアライメントされた命令に対応する。これらのオペレーションは概ね図３のステップ４０８に示す通りである。
【００４４】
プレフィックスが存在しているかを確定するために、次にＩＡＵによって条件文４０９が入力される。この確定はＰＤ（プレフィックスデコーダ）３２８によって行なわれる。条件文４０９を出る矢印「Ｙｅｓ」で示すように、ＰＤによって一つ以上のプレフィックスが検出されれば、そのプロセスはステップ４１０へと進み、そこでＩＡＵはＭＵＸ３３０でＰＤの出力を選択する。ステップ４１２に示すように、そのデコードされたプレフィックス情報は次に対応するアライメントされた命令とともに次のフェーズでＩＤＵに送られるべくラッチされる。条件文４０９を出る矢印「Ｎｏ」で示すように、プレフィックス命令バイトが検出されなければ、ステップ４１４に示すようにＭＵＸ３３０でＮＩＤ３２４の出力が選択される。
【００４５】
一度ステップ４１２または４１４が完了すれば、ブロック４１６に示すように、抽出シフタ３０６を制御して、整列シフタ３１０とｎサイクル・データ遅延３１６に次の８バイトの命令データを供給するためにカウンタシフタ３３２の現在の出力が使用される。次に、ＩＡＵはＭＵＸ３３０の出力をシフト＿Ａと呼ばれる変数として用いる。この変数は整列シフタ３１０を制御して次の命令をアライメントするために用いられる。シフト＿Ａは、次のフェーズの間に用いるシフト量を計算するために、現在の抽出シフタのシフト量（ＢＵＦ＿カウントと呼ばれる）にも加えられる。この加算は、ステップ４０８に示すように、カウンタシフタ３０８において行なわれる。
【００４６】
ＩＡＵによって行なわれる次の操作のステップは、ステップ４２０に示すように、整列＿ＩＲラッチ内の整列シフタの出力をラッナすることである。ステップ４２２に示すように、ＩＤＤＤ３２６内のイミディエト・データとディスプレースメント・データの位置が計算され、さらにこのシフト量がΩサイクルだけ遅延させられる。次に、ステップ４２４に示すように、ＩＡＵはその前の半サイクルの間に計算されたシフト量を用い、現在ＩＭＭシフタ３１２とＤＩＳＰシフタ３１４に入力中のデータをシフトする。最後に、このプロセスをステップ４０６から初めて繰り返して行ない、次のクロック・フェーズを待つ。４０８から４２４までのステップが命令ストリーム中に残存する命令バイトに対して繰り返される。
【００４７】
図４に示すのは図２のＩＡＵに関連するタイミング図である。図４の上部に二つの命令バケットが表示されている。バケット＿＃０及びバケット＿♯１とラベルの付いたこれら二つの命令バケットはそれぞれＩＦＵ（図示していない命令メモリから）によって図２に示したＩＡＵに供給される１６命令バイトから成り立っている。命令のアライメントはいっもバケット＿＃０の右（即ち、一番下のバケット）から行なわれる。本実施例においては、バケット＃０及びバケット＿＃１がＩＦＵのＭＢＵＦ２０４の一番下の二つのバケットである。他の配列も可能である。
【００４８】
本実施例において、ＩＡＵに送られた最初の３命令はＯＰ０、ＯＰ１、ＯＰ２で、長さはそれぞれ５バイト、３バイト、１１バイトである。命令ＯＰ２の最初の８バイトだけがバケット＿♯０に収まることに注意すること。残る３バイトはバケット＿♯１の始まりにラッチされる。この実施例を簡素化するために、これらの３命令にはプレフィックス・バイトがないものと仮定する。プレフィックスが検出されれば、１命令のアライメントのために１フェーズの追加が必要になる。
【００４９】
命令はバケットのどの位置からでも開始できる。命令は一番下のバケットのいずれかの位置から始まって一度に８バイトまで抽出される。ＩＡＵは本実施例におけるＯＰ２のような、２番目のバケットに入り込んでいる命令に対処するため、二つのバケットを調べる。
【００５０】
このタイミング図におけるトレース「１」は二つのシステム・クロックの一つ、ＣＬＫ０である。本実施例において、このシステム・クロックは半サイクルが６ナノ秒になっている。別のシステム・クロックＣＬＫ１と対比して逆のフェーズを有するＣＬＫ０はＴ６で上がりＴ０で下がる。その場合、Ｔ０はＣＬＫ１の立ち上がりエッジであり、Ｔ６がＣＬＫ０の立ち上がりエッジである。説明をわかりやすくするために図４において主な３つのクロック・フェーズにはＦ１、Ｆ２、Ｆ３のラベルを付けてある。
【００５１】
このタイミング図におけるトレースの「２」と「３」は入力バスＩＦＩ１ＢとＩＦＩ０Ｂ上の命令データを表している。５０２に示すように、新規のバケット＿＃０はＦ１が始まるところのＩＦＩ０Ｂ上で使用可能になる。少し後に、ＯＰ０（Ｂ＃０；７−０）で始まる最初の８バイトが５０４のところで抽出シフタ３０６によって抽出される。バケット＿♯０バイト７−０は有効であることが示されている。抽出シフタのタイミングはトレース「４」に示す通りである。
【００５２】
命令ストリームのＣＩＳＣ型からＲＩＳＣ型へのデコーディングが始まると、カウンタシフタ３３２はバケット＿＃０から最初の８バイトを抽出するために抽出シフタ３０６を制御する。カウンタシフタは命令のアライメントの進行につれてバケットからさらにバイトをシフトし且つ抽出するように抽出シフタに信号を送る。バケット＿＃０から命令バイトが空になると、バケット＿＃１の内容がバケット＿＃０の中にシフトされ、バケット＿＃１は命令ストリームから補充される。最初の８バイト抽出後、抽出シフタは、命令長、プレフィックス長並びに先のシフトの情報に基づいて、ライン３３５上のカウンタシフタの制御のもとバイトを抽出してシフトする。
【００５３】
しかしながら、本実施例では、カウンタシフタは第１命令をアライメントすべくゼロにシフトするように抽出シフタに信号を送る。よって、抽出シフタは第１命令の最初の８バイトを整列シフタ３１０にシフトアウトする。整列シフタの信号のタイミングはタイミング図のトレース「５」に示す通りである。これらの８バイトは参照番号５０６で示したＦ１の時間帯の間整列シフタで有効になる。
【００５４】
バケット＿♯０の最初の８バイトは整列シフタをバイパスして２個の整列＿ＩＲラッチ３１８または３２０（図４のトレース「６」と「７」に示すように）の中に格納される。クロック信号ＣＬＫ０とＣＬＫ１のタイミングに基づいて、これらの整列＿ＩＲラッチは交互に命令バイトを受け取る。整列＿ＩＲ０３１８はクロック信号ＣＬＫ０のラッチで、即ちクロック信号ＣＬＫ０がハイの時ラッチされる。整列＿ＩＲ１３２０はクロック信号ＣＬＫ１のラッチで、クロック信号ＣＬＫ１がハイの時ラッチする。Ｆ１の終わり寄りの参照番号５０８で示すように、最初の８バイトは第１クロック信号ＣＬＫ０のフェーズ終了前に整列＿ＩＲ０にて有効になる。
【００５５】
ＭＵＸ３２２はその前のフェーズでラッチを実行したラッチを選択する。本実施例では、従って、ＭＵＸ３２２が２番目の完全フェーズ、Ｆ２の間にＯＰ０の最初の８バイトを出力する。
【００５６】
その次に、ＯＰ０最初の８バイトはＮＩＤ３２４とスタック３３４に流れる。ＮＩＤ３２４は、第１命令が５バイト長であることを検出してこの情報をライン３２５、ＭＵＸ３３０、さらにライン３３３経由で整列シフタ及びカウンタシフタに送り返す。上述したように、同時に最初の８バイトはスタックを通って流れ、整列シフタにフィードバックされる。その結果、整列シフタは命令バイトを抽出シフタからと、そして間接的に自分自身から受け取ることになる。これはサイクル毎に最大８バイトをシフトするためには整列シフタには１６バイトの入力が必要だからである。整列シフタがＸバイトを右にシフトすると、最下位のＸバイトを廃棄して次の８バイトのデータをラッチの３１８と３２０に渡す。この場合、スタック３３４は整列シフタ３１０にバイト０〜７を供給する。
【００５７】
整列シフタを取り囲むバイパス３３６は抽出シフタが命令ストリームから第１命令を抽出する初期のケースで使われる。プレフィックス・バイトを除いて、第１命令がアライメントされるため、整列シフタが初期のケースでシフトを行なう必要はない。
【００５８】
タイミング図のＦ２の期間中、抽出シフタはバケット＿＃０のバイト１５〜８の８バイトをシフトアウトする。図４の５１０を参照。これらのバイトは整列シフタに送られるが、その整列シフタは今や合計で１６の処理対象の続きバイトを有している。整列シフタは抽出シフタの出力並びにＦ２期間中のラッチ３１８と３２０の有効出力を調べる。
【００５９】
Ｆ２の終わり近くで、整列シフタはＮＩＤからの信号に基づき、バケット＿＃０のバイト１２〜５を出力にシフトする。そのＮＩＤからの信号は整列シフタに５バイト右にシフトするように指示するものである。それによって命令ＯＰ０に対応する最下位の５バイトが廃棄される。タイミング図のトレース「８」のシフト＿５＿バイト信号５１２を参照。残る命令データの８バイト、即ちバイト１２〜５はその後整列シフタを通って流れる。バイト５は次の命令ＯＰ１の第１バイトであることに注意すること。
【００６０】
カウンタシフタ３３２は次に抽出シフタ３０６の８バイトをシフトする。何故ならば、最初の８バイトは今や整列＿ＩＲラッチから入手でき、よって次のバイトが必要だからである。フェーズＦ３が始まると、カウンタシフタは先のフェーズで整列シフタ３１０によってシフトアウトされたバイト数だけシフト量を増やすように抽出シフタに信号を送る。従ってカウンタシフタは先の抽出シフタのシフト量を格納し、さらにこの値に整列シフタのシフト量を加算するためのロジックから成り立っていなければならない。
【００６１】
整列シフタ用に新規の値がでてくる毎に、カウンタシフタはその量を旧シフト量に加算する。本実施例においては、Ｆ２の期間中カウンタシフタは８バイトをシフトしたことになる。従って、Ｆ３の期間中、カウンタシフタは抽出シフタに８＋５または１３バイトをシフトするように指示しなければならない。抽出シフタによるバイト出力はバイト２０〜１３である。整列ＩＲラッチはＦ３の期間中バイト１２−５を出力し、よってバイト２０〜５が整列シフタで使用可能になることに注意のこと。
【００６２】
Ｆ３の期間中、抽出シフタはバイト２０〜１３を出力する。しかしながら、バケット＿＃０はバイト１５〜０しか含有していないため、バイト２０〜１６はバケット＿＃１から取ってこなければならない。タイミング図の５１４に示すように、バケット＿＃１はＦ３の始まりで有効になる。５１６に示すように、抽出シフタは続いてバケット＿＃１のバイト４〜０をシフトし、さらにバケット＿♯０のバイト１５〜１３をシフトする。この時点でバケット＿♯１が有効でなければ、ＩＡＵは有効になるまで待たなければならない。
【００６３】
上記のごとく、シフト＿５バイト信号がＦ２の期間中ＮＩＤによって生成された。５１８に示すように、この信号に従い、バケット＿＃０のバイト１２〜５は整列シフタによってシフトアウトされ、さらに５２０に示すように、その後まもなく整列＿ＩＲ１の中にラッチされる。
【００６４】
バイト１２〜５はＦ３の始まりにＭＵＸ３２２によってスタック３３４とＮＩＤ３２４に送られる。スタックは３０５に示すようにバイト１２−５を整列シフタにフィードバックし、さらに５２２のトレース「９」に示すように、ＮＩＤはＯＰ１の長さが３バイトであると確定して、Ｆ３の期間中の後半にシフト＿３＿バイト信号を出力する。整列シフタは３バイト（１５−８）をシフトし、さらにこの量がカウンタシフタに加算される。
【００６５】
上述のプロセスがさらに繰り返される。一つの命令がバケット＿＃０を越える（即ち、バケット＿♯０が全部使われている）と、バケット＿＃１がバケット＿＃０になり、そして新規のバケット＿＃１がその後有効になる。
【００６６】
タイミング図のトレース「１０」は命令ストリームからのバイト抽出のタイミングを示している。Ｂｕｆ＿カウント＃０ブロックは格納された抽出シフト量を表している。フェーズ毎にアライメントされたシフト量がＢｕｆ＿カウント＃０に加算され、その結果が次のフェーズで抽出シフト量になる（カウンタ＿シフトとラベルのついたブロックを参照）。
【００６７】
タイミング図のトレース「１１」は命令アライメントのタイミングを示す。ＩＲ＿ラッチ＿＃０とＩＲ＿ラッチ＿♯１のラベルのついたブロックは対応する整列＿ＩＲラッチ内の命令が有効になる期間を表す。ＭＵＸ１のラベルが付いた小さなブロックはＭＵＸ３２２がその有効アライメント・ラッチを選択し始める時を表している。ＭＵＸ２のラベルが付いた小さなブロックはＭＵＸ３３０がＮＩＤ３２４が確定したシフト量を選択し始める時を表す。最後に、整列＿シフトのラベルが付いたブロックは整列シフタが命令を出力し始める時を表している。
【００６８】
プレフィックスは命令がアライメントされるのと同じ技法を使って抽出されるが、ＭＵＸ３３０はＮＩＤ３２４の出力ではなくＰＤ３２８の出力を選ぶ。
【００６９】
スタック３３４の一部分のブロック図は図５に示す通りである。このスタックは並列に配置された、６４個の１ビット・スタックから成り立っている。１ビット・スタック６００はそれぞれ２個のラッチ６０２及び６０４、さらに３入力のＭＵＸ６０６とから成っている。アライメントされた命令はラッチ並びにＩＮのラベルが付いたバス６０７上のＭＵＸへ入力される。この２個のラッチのローディングはいずれかのクロック・フェーズで個別に行なわれる。さらに、ＭＵＸ６０６はいずれのラッチの出力を選択するか、またはＩＮデータをバイパスして直接ＯＵＴのラベルが付いた出力６１０に送るかするために３本のＭＵＸ制御ライン６０８を有している。
【００７０】
ＩＡＵは定期的に別々の命令ストリームに転送することができる。スタックによってＩＡＵがＭＵＸ３２２からの８バイトの命令データ・セット２組を格納できるようになる。この特徴は一般的にＣＩＳＣ型命令エミュレーションで使われるものである。ＩＡＵが複雑なＣＩＳＣ型命令のエミュレーション用のマイクロコード・ルーチンを処理するために分岐しなければならない時、ＣＩＳＣ型命令のエミュレーションが完了すればＩＡＵの状態が格納され、再開始される。
【００７１】
Ωサイクル・データ遅延３１６はイミディエト・データとディスプレースメントの情報を送らせるために使用される。同じ半サイクル期間中に命令長とシフトを確定するのではなく、シフタの前にＩＡＵに遅延を入れることによって次のフェーズでシフトを行なうためにイミディエト・データとディスプレースメント・ロジックが送られる。これらの動作がそのサイクルに渡って広げられるから、タイミング要件をそのロジックに合せるのが容易になる。ＩＤＤＤブロック３２６はＩＭＭシフタ３１２とＤＩＳＰシフタ３１４を制御して命令からイミディエト・データ並びにディスプレースメント・データを抽出する。例えば、最初の３バイトの命令が演算コードでそれに４バイトのディスプレースメント並びに４バイトのイミディェト・データが続いていれば、シフタは適切なバイトをシフトアウトすることができるようになる。
【００７２】
シフタの３１２と３１４は、実際のデータ・サイズが８、１６、或いは３２ビットであろうが関係なく常に３２ビットを出力し、それには３２ビット出力の低位ビットの順に適正アライメントされたイミディエト・データ及びディスプレースメント・データが含まれている。ＩＤＵはそのイミディエト・データ及びディスプレースメント・データが有効であるか確定し、もし有効ならば、どれだけ有効データがあるかを確定する。
【００７３】
プレフィックス、イミディエト・データ、ディスプレースメント・データの長さの確定並びに命令の実際の長さの確定はアライメントされ、さらにデコードされている実際のＣＩＳＣ型命令セットの機能の一つである。当業者はＣＩＳＣ型命令セット自体、メーカーのユーザ・マニュアル、もしくはその他一般的な参考資料を調査することによってこうした情報を得ることができる。当業者はこれをどのように行なうか、また上述のＩＡＵサブシステムを実現するために情報をランダム・ロジックにどのように転換するか、以下に述べるＩＤＵサブシステムをどのように実現するか、さらにデータの流れ（ｆｌｏｗ）を制御するために使われる制御ロジック並びに制御信号をどのように生成するかについて容易に理解するだろう。さらに、一度そうしたランダム・ロジックが生成されたら、市販のエンジニアリング・ソフトウェア・アプリケーション（例えば、カリフォルニア州サンノゼ市所在のＣａｄｅｎｃｅＤｅｓｉｇｎＳｙｓｔｅｍｓ社製のＶｅｒｉｌｏｇ）を使ってロジックを検証することができるし、そうしたアプリケーションは制御信号や関連するランダム・ロジックのタイミングや生成を定義するのに役に立つ。ゲートやセルのレイアウトを生成して、そうした機能ブロックや制御ロジックの実現を最適化するために他の市販のエンジニアリング・ソフトウェア・アプリケーションを用いることができる。
【００７４】
ｉ４８６の命令セットは、一つの命令の中で一緒に使われるとき順序が定義されている１１個のプレフィックスをサポートしている。そのフォーマットはプレフィックスを単一命令に４個まで含めるように定義する。従って、本発明のプレフィックス検出器３２８は同一のプレフィックス検出回路４個を備えている。各々の回路がその１１個のプレフィックス・コードのどれかを探索する。プレフィックス検出器に渡される最初の４バイトが評価され、さらに存在するプレフィックス数の合計を確定するために４個のプレフィックス検出回路の出力が一つにまとめられる。その結果はＭＵＸ３３０に渡されるシフト量として使用される。
【００７５】
ＮＩＤのブロック図を図６及び図７に示す。ＮＩＤについての以下の説明はｉ４８６命令のアライメント特有のものである。他のＣＩＳＣ型命令のアライメントは異なるＮＩＤアーキテクチャを用いるのが適切である。以下に述べる技法は従って当業者にとって一つのガイドとはなるが、それによって本発明の適用範囲を限定するものと考えられるべきではない。
【００７６】
一つの命令の長さを確定するには４バイトだけあればよい（上記のごとく、その４バイトは二つの演算コードバイトと、一つの任意のＭｏｄＲ／Ｍバイト並びに一つのＳＩＢバイトから成り立っている）。
【００７７】
図６に示すのはＭＵＸ３２２から受け取った命令の最初の４バイトを表す４バイト（３２ビット）・バス７０１である。その最初の２バイトはバス７０３上のＳＮＩＤ７０２に送られる。ＳＮＩＤは、定義上、その最初の２バイトに基づいて識別される命令の最初のサブセットの長さを確定する。ＳＮＩＤは半サイクルで命令のこのサブセットの長さを確定できる。サブセット命令の長さはバス７０５上のＳＮＩＤによって出力される。バスの幅はＳＮＩＤによって検出された命令バイトの最大数に相当する。ＳＮＩＤはまたＭｏｄＲ／Ｍバイトがその命令の中にあるかどうかを知らせるために１ビットのＭＯＤ検出（ＭＯＤ＿ＤＥＴ）出力ライン７０７を有している。さらに、ＳＮＩＤは命令がサブセット形式でない制御ロジックを合図するために１ビットのＮＩＤ＿待ちライン７０９を有している（即ち、代わりにＲＮＩＤの出力を用いる）。従ってＩＡＵは、ＮＩＤ＿待ちが真の場合、命令をデコードするためにＲＮＩＤを半サイクル待たなければならない。
【００７８】
ＳＮＩＤによってデコードされた命令のサブセットは最低１、２及び３入力のゲート（否定論理積、否定論理和及びインベンタ）を使って半サイクルでデコードすることができるＣＩＳＣ型命令であり、そのゲート遅延は２５６命令の１６×１６のカルノー図に基づいて最大で５である。ほとんどが１バイトの演算コード命令を含むカルノー図のブロックはこのようにして実現できる。残りの命令はゲート遅延がもっと長いロジック・アレイを使ってＲＮＩＤによってデコードされる。
【００７９】
ＲＮＩＤ７０４はバス７０１上の最初の４バイトを受け取る。ＲＮＩＤはデコードするのに１フェーズ以上を要する残りの命令の長さを確定するためにデコードを実行する。ＲＮＩＤはＳＮＩＤの出力に類似した出力を有する。
【００８０】
ＲＮＩＤは命令長を検出してその結果をバス７１１上に出力する。１ビットのオーバー８出力７１２はその命令は長さが８バイト以上であることを示している。ＲＮＩＤはまた、命令にＭｏｄＲ／Ｍバイトを含んでいるかどうかを示す１ビットのＭＯＤ＿ＤＥＴ出力７１４を有する。
【００８１】
ＳＮＩＤまたはＲＮＩＤのどちらかによってデコードされた長さはＭＵＸ７０６によって選択される。現在の命令のための選択デコーダ（ＳＥＬＤＥＣＩＲ）と呼ばれる、ＭＵＸ７０６用の制御ライン７０８は１から１１バイトである実際の長さを測定するためにＭＵＸ７０６を２個のデコーダ間で切り替える。例えば、１１バイト長の命令は、ＲＮＩＤがオーバー８信号と３をバス７１１上に出力するようにする。その命令長（１ｎ）はバス７１６上のＭＵＸ３３０に送られ、整列シフタ３１０とカウンタシフタ３３２によって使用される。トップのＭＵＸ７０６によって出力された８ビットは整列シフタ及びカウンタシフタ用のシフト制御（イネーブル）として使われる。
【００８２】
ＭｏｄＲ／Ｍバイトも同様に選択される。ＳＥＬＤＥＣＩＲ信号７０８は適切なＭＯＤラインを選んで、ＭｏｄＲ／Ｍバイトが存在しているか否かを示すために第２ＭＵＸ７１０を制御する。ＭＯＤライン出力７１８はＩＤＤＤによって使用される。
【００８３】
ＳＥＬＤＥＣＩＲ信号７０８はＮＩＤ＿待ち信号７０９に基づいて生成される。ＳＮＩＤの出力は、その結果が完全なものであるから、第１クロック・フェーズ期間中に選択される。ＮＩＤ＿待ち信号７０９がその命令がデコードされていないことを示している場合、ＭＵＸ７０６と７１０はＲＮＩＤの出力７１１を選択するために切り替えられ、その次のクロック・フェーズの始まりで使用可能になる。
【００８４】
ＲＮＩＤ７０４は基本的に２個の並列デコーダを備えており、その１個は命令を１バイトの演算コードがあるかのようにデコードし、もう１個は２バイトの演算コードがあるかのようにデコードする。エスケープ検出（ＥＳＣ＿ＤＥＴ）入力信号は演算コードの長さが１バイトか２バイトかを示す。例えば、ｉ４８６の命令セットでは、全２バイトの演算コード（エスケープバイトと呼ばれる）の第１バイトはその命令が２バイトの演算コードを有することを示す値ＯＦｈｅｘを有している。ＲＮＩＤはＥＳＣ＿ＤＥＴ信号に基づいて有効命令長を出力する。この信号は第１演算コードがエスケープ（ＯＦｈｅｘ）であることを示し、それは即ち２バイトの演算コードであることを示しており、それによって第２バイト・デコーダをイネーブルにする。ＥＳＣ＿ＤＥＴ信号を生成するためのロジックのデコーディングについては当業者には明らかなはずである。
【００８５】
ＲＮＩＤのブロック図は図７に示す通りである。ＲＮＩＤは、第１演算コードバイトをデコードするＲＮＩＤ＿１ＯＰデコーダ７５２、第２演算コードバイトをデコードするＲＮＩＤ＿２ＯＰデコーダ７５４、存在する演算バイト数によって確定された２ケ所の位置のいずれかにＭｏｄＲ／Ｍバイトをデコードする２個の同一のＲＮＩＤ＿ＭＯＤデコーダ７５６と７５８、及びＲＮＩＤＳＵＭ加算器７６０を備えている。４個のＲＮＩＤデコーダ７５２〜７５８の出力に基づいて、ＲＮＩＤ＿ＳＵＭ加算器７６０はバス７６２上に命令の全長を出力する。ＲＮＩＤ＿ＳＵＭ加算器７６０は、命令の長さが８バイト以上であるかどうかを示すために、ＯＶＥＲ８とラベルが付いた別の出力ライン７６４を有している。
【００８６】
命令の第１演算コードのバイト及びＭｏｄＲ／Ｍバイトの３ビット（拡張ビットと呼ばれるビット〔５：３〕）はバス７６６上のＲＮＩＤ＿１ＯＰ７５２へ入力される。データ＿ＳＺと呼ばれるＲＮＩＤ＿１ＯＰへのさらに別の入力ライン７６８は命令のオペランド・サイズが１６ビットか３２ビットかを示す。データ・サイズは使用されるメモリ保護構成と、さらに、デフォルトのデータ・サイズを無効にするプレフィックスが存在しているか否かに基づいて確定される。ＲＮＩＤ＿１ＯＰは、命令が１バイトの演算コードを有していると仮定し、さらにその情報と拡張３ビットに基づいて命令の長さを確定しようとする。
【００８７】
ＲＮＩＤ＿ＭＯＤデコーダ７５６はバス７７０上のＭｏｄＲ／Ｍバイトの命令入力をデコードする。ＲＮＩＤ＿ＭＯＤデコーダはアドレス・サイズが１６ビットか３２ビットかを示すＡＤＤ＿ＳＺのラベルが付いた別の入力バス７７２を有している。アドレス・サイズはデータ・サイズとは無関係である。
【００８８】
ＥＳＣ＿ＤＥＴ信号７７４はブロック７６０へも入力される。例えば、ＥＳＣ＿ＤＥＴ信号がロジックのＨＩＧＨであれば、ＲＮＩＤ＿ＳＵＭブロックは演算コードが実際に第２バイトになっていることを知る。
【００８９】
ＲＮＩＤ＿２ＯＰデコーダ７５４は演算コードが２バイトであると仮定し、それゆえ演算コードの第２バイト（バス７７６参照）をデコードする。ＲＮＩＤ＿２ＯＰデコーダはデータ・サイズを認識する入力７６８も有している。
【００９０】
デコーダ自体は演算コードの長さ、即ち１バイトなのか２バイトなのかを知らないし、且つＭｏｄＲ／Ｍバイトは必ず演算コードの後に続くから、ここでも２バイトであると仮定して２バイトの演算コードに続くバイト（バス７７８参照）をデコードするために第２ＲＮＩＤ＿ＭＯＤデコーダ７５８が使用される。２個のＲＮＩＤ＿ＭＯＤデコーダは同一であるが、命令ストリーム中の異なるバイトをデコードする。
【００９１】
さらにまた、ＥＳＣ＿ＤＥＴ信号７７４に基づいて、ＲＮＩＤ＿ＳＵＭ７６０は適切な演算コード及びＭｏｄＲ／Ｍバイト・デコーダの出力並びにバス７６２上の命令の長さを選択する。オーバー８のラベルが付いた出力７６４は命令が８バイト以上か否かを示す。命令の長さが８バイト以上の場合、ＩＲ＿ＮＯ〔７：０〕バス７６２が８を越える命令バイト数を示す。
【００９２】
ＲＮＩＤ＿１ＯＰデコーダ７５２は９ビット幅の出力バス７８０を有する。１本のラインは命令が１バイト長であるか否かを示す。２本目のラインは命令が１バイト長で且つＭｏｄＲ／Ｍバイトが存在していることを示しており、従って命令の長さを判定するにはＭｏｄＲ／Ｍデコーダからの情報も含まれるべきものである。同様に、バス７８０の残りの出力ラインは次のバイト数を示す：２、２／ＭＯＤ、３、３／ＭＯＤ、４、５、及び５／ＭＯＤ。命令が４バイト長であれば、ＭｏｄＲ／Ｍバイトは存在しているはずがない。これはｉ４８６命令セット特有のことである。しかしながら、本発明はいかなる点においても特定のＣＩＳＣ型命令セットに限定されるものではない。当業者はどんなＣＩＳＣ型命令セットに対してもアライメント並びにデコードするために本発明の特徴を適用することができる。
【００９３】
ＲＮＩＤ＿２ＯＰデコーダ７５４は６ビット幅の出力バス７８２を有する。１本のラインは命令が１バイト長であるか否かを示す。２本目のラインは命令が１バイト長であるか否かを示し、且つＭｏｄＲ／Ｍバイトを含有しており、命令の長さを確定するには含まれるべきものである。同様に、バス７８２の残りの出力ラインは２、２／ＭＯＤ、３、及び５／ＭＯＤが存在することを示す。演算コードが２バイト長の場合、ｉ４８６の命令セットがサポートする命令長は他に考えられない。
【００９４】
２個のデコーダＲＮＩＤ＿ＭＯＤ７５６及び７５８の出力７８４及び７８６によってＲＮＩＤ＿ＳＵＭ７６０はＭｏｄＲ／Ｍバイトにより指定される５つの考えられる追加の長さを知る。各ＲＮＩＤ＿ＭＯＤデコーダは５ビット幅の出力バスを有している。その考えられる５つの追加の長さは１、２、３、５及び６バイトである。全長を確定するのにＭｏｄＲ／Ｍバイト自体が含まれている。残りのバイトはいずれもイミディエト・データまたはディスプレースメント・データから成り立っている。
【００９５】
図８に示すのはＩＤＤＤ３２６のブロック図である。ＩＤＤＤ３２６はＩＭＭシフタ３１２及びＤＩＳＰシフタ３１４のシフト量を確定する。シフト量は、命令のＭｏｄＲ／Ｍバイトによって確定される。
【００９６】
ｉ４８６命令セットは二つの特殊命令、即ちｅｎｔｅｒ＿ｄｅｔｅｃｔ命令とｊｕｍｐ＿ｃａｌｌ＿ｄｅｔｅｃｔ命令を含む。従って、ＩＤＤＤ３２６はこれらの命令のデコーディング処理をするためにイミディエト特殊検出器（ＩＳＤ）８０２と呼ばれるブロックを有する。ＩＳＤへの入力８０３は、命令の第１バイトである。２本の出力ラインＥＮ＿ＤＥＴとＪＭＰ＿ＣＬ＿ＤＥＴ（８２０と８２２）は該当する命令の一つが検出されていることを示す。
【００９７】
ＭＯＤ＿ＤＥＣデコーダ８０４と８０６は同一物でイミディエト・データとディスプレースメント・データをデコードする。ＡＤＤ＿ＳＺ７７２に基づいて、デコーダ８０４は１バイトの演算コードと仮定してＭｏｄＲ／Ｍバイトを調べ、デコーダ８０６は２バイトと仮定してＭｏｄＲ／Ｍバイトを調べる。ＭＯＤ＿ＤＥＣ８０４及び８０５への命令バイト入力はそれぞれ８０５及び８０７である。これらのデコーダは命令ストリームのディスプレースメントの位置とイミディエト・データの位置を確定する。二つの７ライン出力８２４と８２６はディスプレースメント及びイミディエト・データの開始位置を示す。即ち、ディスプレースメントは位置２か位置３から始まり、イミディエト・データは位置２、３、４、６或いは７から始まる。
【００９８】
ＭＯＤ＿ＤＥＴライン７０７と７１４もまた選択ブロック８１２へ入力される。
【００９９】
選択ブロック８１２はＥＮ＿ＤＥＴ信号とＪＭＰ＿ＣＬ＿ＤＥＴ信号、ＭＯＤ＿ＤＥＴ結果とＭＯＤ＿ＤＥＣ結果、及びＡＤＤ＿ＳＺとを組み合わせて、４個のバス８３２〜８３８上にその結果を出力する。ディスプレースメント（ＤＩＳＰ＿１）バス８３２は１バイトの演算コードと仮定してディスプレースメント・シフトの結果を出力する。ディスプレースメント２（ＤＩＳＰ＿２）バス８３４は２バイトの演算コードと仮定してディスプレースメント・シフト結果を出力する。イミディエト１及び２（ＩＭＭ＿１とＩＭＭ＿２）バス８３６及び８３８はそれぞれ１バイトと２バイトの演算コードと仮定してイミディエト・データ・シフトの情報を出力する。
【０１００】
ＭＯＤ＿ＳＥＬ／ＤＬＹとラベルが付いた最後のブロック８１４は実際に適切なシフト量を選択してその結果を半サイクル遅延させる。ＭＯＤ＿ＳＥＬ／ＤＬＹ８１６によって実行された半サイクルの遅延は図２に示した遅延３１６を表す。上述のＥＳＣ＿ＤＥＴ信号７７４はシフトの選択を行なうためにＭＯＤ＿ＳＥＬ／ＤＬＹブロックによって使用される。その結果は半サイクル遅れてクロック信号ＣＬＫ０とＣＬＫ１とによってＭＯＤ＿ＳＥＬ／ＤＬＹ８１４からクロックされる。イミディエト・データのシフト制御信号並びにディスプレースメントのシフト制御信号はシフト＿Ｄ〔３：０〕バス８４０とシフト＿Ｉ〔７：０〕バス８４２をそれぞれ介してＤＩＳＰシフタとＩＭＭシフタに送られる。ＣＩＳＣ型命令内でのイミディエト・データとディスプレースメント・データの可能な位置数はシフト量を指定するのに必要なビット数を定義する。
【０１０１】
プレフィックス検出器３２８のブロック図は図９に示す通りである。プレフィックス検出器３２８はプレフィックス＿数デコーダ（ＰＲＦＸ＿ＮＯ）９０２、４個のプレフィックス＿検出器デコーダ（ＰＲＦＸ＿ＤＥＣ９０４〜９１０）とプレフィックス＿デコーダ（ＰＲＦＸ＿ＳＥＬ）９１２を備えている。
【０１０２】
例えば、ｉ４８６命令セットは１１の考えられるプレフィックスを含む。幾つかの無効なプレフィックスの組み合わせがあるから、１命令につき合計で４つのプレフィックスを含むことができる。その４つのプレフィックスの順序もまた命令セットによって定義される。しかしながら、正しいプレフィックス順列のみを検出するためではなく、むしろ命令の最初の４バイトをそれぞれデコードするためにプレフィックス検出器は４個のプレフィックス検出器９０４〜９１０を使う。命令の最初の４バイトはバス９０１上のプレフィックス検出器へ入力される。検出器９０４から９１０はそれぞれ１２ビット幅の出力バス（９０５、９０７、９０９及び９１１）を有する。プレフィックスが実際にデコードされていれば、１２の出力からどのプレフィックスが存在しているかわかる。１２番目のプレフィックスはロック解除と呼ばれ、これはｉ４８６のロックプレフィックスの機能上の補数であるが、エミュレーション・モード時のマイクロコード・ルーチンにのみ使用可能である。
【０１０３】
整列＿ＲＵＮ制御信号９２０はプレフィックス・デコーダをイネーブル／ディスエーブルにするために組み込まれていることがあり、プレフィックスを全てマスク・アウトするために使用される。ＨＯＬＤ＿ＰＲＦＸ制御信号９２２はプレフィックス情報をラッチし且つ保持するために使用される。一般的に、プレフィックス検出器３２８がプレフィックスの存在を示している場合の命令のアライメントでは、制御ロジックがプレフィックス情報をラッチしなければならない。プレフィックス情報はその後プレフィックスをシフト・アウトするために整列シフタ３１０によって使用される。その次のサイクルで、ＩＡＵは命令の長さを確定してアライメントし、さらにＩＤＵに引き渡す。
【０１０４】
ＰＲＦＸ＿ＮＯデコーダ９０２は演算コードの最初の４バイトをデコードすることによりプレフィックスがどこにどれだけ存在しているかを示す。ＰＲＦＸ＿ＮＯデコーダ９０２の論理図は図１０に示す通りである。ＰＲＦＸ＿ＮＯデコーダは４個の同一のデコーダ１００２〜１００８並びに論理ゲート１０１０一式を備えている。４個のデコーダ１００２〜１００８は各々最初の４バイト（１０１０〜１０１３）の一つを調べてプレフィックスが存在しているかどうかを確定する。プレフィックス・バイトは演算コード・バイトに続くことができるから、論理ゲート１０１０は最初の演算コード・バイトの前にプレフィックス総数を示している結果を出力するために使用される。何故なら、演算コードに続くプレフィックスは次の命令の演算コードにのみ適用できるからである。
【０１０５】
第１バイト（位置）がプレフィックスで第２位置にプレフィックスがなければ、プレフィックス総数は１である。また別の実施例として、プレフィックスが最初の３位置になければ、第４位置のプレフィックスはどうでもよい。一番下のＮＡＮＤゲート１０１４から出力されたロジックＨＩＧＨ（１）は４個のプレフィックスが存在することを示し、下から２番目のＮＡＮＤゲート１０１５から出力されたＨＩＧＨは３個のプレフィックスの存在を示すといった具合である。４個のＮＡＮＤゲートの出力はＰＲＥＦＩＸ＿ＮＯバス１０１８を形成するために結合され、バス１０１８は第１演算コードに先行する有効プレフィックス総数、即ちプレフィックス検出器３２８のシフト量出力を表す。
【０１０６】
ＰＲＦＸ＿ＮＯデコーダ９０２はＰｒｅｆｉｘ＿Ｐｒｅｓｅｎｔ（ＰＲＦＸ＿Ｐ）出力バス１０２０（これも４ビット幅）も含んでいる。４本のＰＲＦＸ＿Ｐ出力ライン１０２０〜１０２３は、他の位置の出力が何であるかに係わらず、特定の位置にプレフィックスがあるか否かを示す。ＰＲＦＸ＿Ｐ出力は４個のデコーダ（１００２〜１００８）の出力から直接採られる。
【０１０７】
ＰＲＦＸ＿ＮＯデコーダの結果（図１０との関連で説明する）及びＰＲＦＸ＿ＤＥＣ検出器９０４〜９１０からの情報はＰＲＦＸ＿ＳＥＬデコーダ９１２によって結合される。プレフィックス情報は１個の１３ビット出力バス９２４を形成するために結合され、バス９２４はプレフィックス信号があるか、及びどのプレフィックスが存在するかを示す。
【０１０８】
３．０命令デコード・ユニットの概略
命令は全てＩＡＵから命令デコード・ユニット（ＩＤＵ）に引き渡され、直接ＲＩＳＣ型の命令に変換される。ＩＥＵによって実行される命令は先ずＩＤＵによって処理される。ＩＤＵは各命令がエミュレートされた命令なのか基本命令なのかを判定する。エミュレートされていれば、全て基本命令からなるマイクロコード・エミュレーション・ルーチンが処理される。基本命令であれば、直接ハードウェアによって１個から４個のナノ命令に変換されてＩＥＵに送られる。ＩＥＵが実際に実行するのは、元々のＣＩＳＣ型かマイクロコードの命令ではなくて、これらやナノ命令である。
【０１０９】
命令の分割には二つの主要な利点がある。その１は、簡単なオペレーションに対応しているだけでいいから、ハードウェアが小型ですむ。その２は変更が容易な複合マイクロコード・ルーチンでバグが発生しやすいため、バグはそれほど厄介な問題ではなくなる。
【０１１０】
本発明に関連するＩＤＵのマイクロコード・ルーチン対応のハードウェアには固有の特徴が幾つかある。マイクロコード命令はプロセッサ内に存在する様々なデータバス用の制御ビットから成り、ほとんど符号化されていないか全く符号化されていないというのが典型的である。これと対比して、本発明のマイクロコードは特定の複合命令セットをエミュレートするために設計された比較的高レベルの機械言語である。典型的なマイクロコードは直接プロセッサの機能ユニットへ送られるのに対し、本発明のマイクロコードは目標のＣＩＳＣ型（例えば、８０ｘ８６）命令に使用されるのと同じデコーダ論理によって処理される。これによって、本発明のマイクロコードのコード密度が典型的なマイクロコードによって達成される場合よりはるかに優れたものになり、そして目標のＣＩＳＣ型命令セットと類似しているからマイクロコードの開発が容易になる。さらに、本発明はマイクロコードの改訂用にハードウェアで対応できるようになる。即ち、オンチップＲＯＭベースのマイクロコードはソフトウェア制御によって部分的もしくは全体的に外部ＲＡＭベースのマイクロコードに置き換えることができる。（１９９１年１２月６日に出願された、同一承継人の出願に係る同時係属出願中の、米国出願番号０７／８０２，８１６、発明の名称「ＲＡＭセル及び巡回冗長検査回路搭載ＲＯＭ」、代理人整理番号ＳＰ０２４を参照。なお、当該出願の開示は参照することによって本明細書に組み込まれているものとする。）
マイクロコード・ルーチン言語は、あらゆるエミュレートされた複合命令に必要な機能に加え、例外処理に関連する様々な制御並びに保守機能を実行するために、ＲＩＳＣ型コアによって実行される命令セットになるように設計されている。エミュレートされた命令は典型的にはエミュレートされていない（基本）命令などには性能に影響しないし、さらに例外（マイクロコード・ルーチンによって処理される）はめったに起こらないけれど、それでもなお両方を効率的に処理することが総体的なシステムのスループットにとって非常に重要なことである。この目標は様々な形式のマイクロコード・ルーチン対応のハードウェアを使用することによって達成される。本発明はマイクロコード対応のハードウェアの４つの領域、即ち、ディスパッチ論理、メイルボックス、ナノ命令フォーマット、及び特殊命令を備えている。
【０１１１】
マイクロコード・ディスパッチ論理は目標ＣＩＳＣ型命令ストリームからマイクロコード・ルーチンへ、そしてまた目標命令ストリームに戻るプログラム制御の効率的な転送を制御する。それはわずかなハードウェアを使用し、且つＲＩＳＣ型コアの命令実行ユニット（ＩＥＵ）には見えない方法で、処理される。（ＩＥＵはＲＩＳＣ型命令を実行する。上述の「ＲＩＳＣコア」はＩＥＵと同義語である。ＩＥＵについての詳細は当業者が本発明を実施するのに必要ではない。本発明の特徴はＲＩＳＣ型プロセッサ全般に適用できる。）
メールボックスは情報を体系的な方法で命令デコード・ハードウェアからマイクロコード・ルーチンに転送するために使用されるレジスタのシステムを備えている。これによってこのハードウェアが命令オペランドや同様のデータをマイクロコード・ルーチンに引き渡せるようになり、その結果、命令からこのデータを抽出するタスクを省くことになる。
【０１１２】
ナノ命令フォーマットはＩＤＵからＩＥＵに引き渡す情報を記述する。ソースのＣＩＳＣ型命令から効率的に抽出されるようにするためにこのフォーマットが選択されているが、依存性の検査や機能ユニット制御には十分な情報をＩＥＵに提供する。
【０１１３】
最後に、特殊命令はＲＩＳＣ型ハードウェアを完全に制御できるようにし、
ハードウェア固有のエミュレーション・タスクに対応するために備えられた追加の命令セットであり、且つＣＩＳＣ型命令セット専用である。
【０１１４】
３．１マイクロコード・ディスパッチ論理
マイクロコードにディスパッチする第１のステップはマイクロコード・ルーチンのアドレスを確定することである。このステップには二つの重要要件がある。即ち、各マイクロコード・ルーチン毎に固有の開始アドレスがあることと、それらのアドレスは高速で生成されなければならないことである。取り扱い件数が少なければハードウェアがアドレスを定数として格納できるし且つそれらの間で選択することもほとんどないから、このやり方でかなり容易に例外処理のルーチンを実現できる。しかしながら、実行可能なアドレス全部を格納させるにはあまりにも数が多いため、エミュレートされた命令のアドレス確定はもっと難しい。
【０１１５】
マイクロコード．ディスパッチ論理は直接その演算コードを各命令のディスパッチ・アドレスに基づかせることによって要件を満たしている。例えば、１バイトの演算コードがＯＨから１ＦＦＦＨのアドレス空間にマップされる。その場合、１６ビットのディスパッチ・アドレスの上位３ビットはゼロでなければならない。これらのマイクロコードのエントリ・ポイントは６４バイト隔てられており、各エントリ・ポイント・アドレスの最下位の６ビットはゼロでなければならない。これによって７ビットが未定のまま残ることになるが、演算コードの７ビットから直接取り込むことができる。当業者には明確になるように、この方法によるアドレス生成はほとんどロジックを必要としない。例えば、演算コードから適正ビットを選択するためにマルチプレクサだけが使用される。
【０１１６】
一度マイクロコード・ルーチンのディスパッチ・アドレスが確定されれば、マイクロコードはメモリからフェッチされなければならない。典型的には、マイクロコードはオンチップＲＯＭ内に存在するが、必ずしもそうとは限らない。上記に引用した米国出願番号０７／８０２，８１６に詳述されているように、各エントリ・ポイントはＲＯＭのルーチンが正しいか否かを表すＲＯＭ無効ビットに対応している。このビットはＲＯＭへのアクセスと並行してフェッチされ、従来のキャッシュ・ヒット・インディケータと同様の働きをする。このビットがＲＯＭのエントリが有効であることを示していれば、マイクロコード・ルーチンはＲＯＭから縦続してフェッチされ、普通に実行される。しかしながら、ビットがＲＯＭが無効であることを示していれば、マイクロコードはＲＡＭ等の外部メモリからフェッチされる。
【０１１７】
オンチップ・マイクロコード・ルーチンのアドレス指定はＩＤＵ自身によって行なわれる。ＩＤＵはマイクロコードＲＯＭにアクセスするための１６ビットのアドレスを生成する。アドレス指定されているＲＯＭエントリに対応するＲＯＭ無効ビットがそのマイクロコードは無効であることを示していれば、主メモリ内にオフチップで存在する外部マイクロコードのアドレスが計算される。Ｕ＿ベースレジスタは主メモリ内に存在する外部マイクロコードの上位１６のアドレス・ビット（開始アドレスと呼ばれる）を保持する。ＩＤＵによってデコードされた１６ビットのアドレスは、主メモリ内に存在する外部マイクロコードにアクセスするために、Ｕ＿Ｂａｓｅレジスタの上位１６ビットと連結される。主メモリ内に存在する外部マイクロコードの記憶場所が変更されれば、新規の主メモリの記憶場所を反映するためＵ＿Ｂａｓｅレジスタの内容を修正することができる。
【０１１８】
この特徴によって、全てのマイクロコードに外部メモリ・アクセスの性能低下を強いることなく、あるルーチンを外部メモリ内の別のものと置き換えることによりマイクロコードの更新を行なえるようになる。ＲＩＳＣ型チップの面積要件を減らしたり、マイクロコード開発援助のために、ＲＩＳＣ型チップからＲＯＭを全て削除して外部ＲＡＭにマイクロコード全体を入れることもできるようになる。
【０１１９】
タスクが終了するとマイクロコード・ルーチンが命令の主ストリームに戻るための手段を提供するのもこのディスパッチ論理である。この処理のために、個別のプログラム・カウンタ（ＰＣ’ｓ）及び命令バッファを維持する。通常動作中、主ＰＣが外部メモリ内の各ＣＩＳＣ型命令のアドレスを確定する。これらの命令を含むメモリのセクションはＩＦＵによってフェッチされ、ＭＢＵＦに格納される。
【０１２０】
エミュレートされた命令または例外が検出されると、現在の命令のＰＣ値と長さが一時バッファに格納される。一方、マイクロコード・ディスパッチ・アドレスは上述のように計算され、さらに命令がこのアドレスからＥＢＵＦにフェッチされる。マイクロコードの「リターン」命令が検出されるまでマイクロコードがＥＢＵＦから実行される。リターン命令検出時に予備のＰＣ値が再ロードされ、ＭＢＵＦから実行が縦続される。ＭＢＵＦやその他全ての関連レジスタはマイクロコード・ルーチンへの制御の転送中は保存されているから、ＣＩＳＣ型プログラムヘの戻りの転送は非常に高速で起こる。
【０１２１】
命令エミュレーション・ルーチンと例外処理ルーチンの相違に対応するためにマイクロコード・ルーチンによって使用される二つのリターン命令がある。例外処理のためにマイクロコード・ルーチンが入力されると、そのルーチン終了後にプロセッサは割り込みが入ったまさにその状態に戻ることが重要である。しかしながら、命令をエミュレートするためにマイクロコード・ルーチンが入力されると、ルーチンはエミュレートされた命令に続く命令に戻りたがる。さもなければ、エミュレーション・ルーチンは二回目を実行する。これらの二つの機能は二つのリターン命令、即ち、ａｒｅｔ及びｅｒｅｔ、を使用して処理される。ａｒｅｔ命令は、マイクロコードが入力されていれば、プロセッサをその状態に戻し、一方、ｅｒｅｔ命令は主ＰＣを更新し且つ制御して目的ストリームの次の命令に戻るようにする。
【０１２２】
３．２メールボックス
エミュレーション・ルーチンがうまく複合ＣＩＳＣ型命令の機能を行なうためには、マイクロコードが、エミュレートされた命令によって参照されるオペランドにアクセスしやすいことが必要である。本発明において、このことは４個のメールボックス・レジスタを使用することによって行なわれる。これらのレジスタはその使われ方が特有である。即ち、マイクロコードに使用可能な、整数レジスタ・ファイル内の１６個の一時レジスタ・セットの最初の４個であると定義されている。オリジナル命令からのオペランドか他の情報を要する各エミュレーション・ルーチンは、ルーチンに入る際に、１個以上のメールボックス・レジスタに格納されたこれらの値を見つけるはずである。ＩＤＵはエミュレートされた命令を検出すると、マイクロコード・ルーチン自体の実行開始前に、マイクロコードが予期する値を有するレジスタをロードするためにＩＥＵによって使用される命令を生成する。
【０１２３】
例えば、オペランドとして汎用レジスタのどれかを指定するＬｏａｄＭａｃｈｉｎｅＳｔａｔｕｓＷｏｒｄ（ｌｍｓｗ）命令のエミュレーションを考察してみよう。エミュレート対象の特定命令がｌｍｓｗａｘであると仮定し、それは「ａｘ」レジスタから１６ビットの状態ワードをロードするとする。命令で実際に指定されたレジスタいかんにかかわわらず同じマイクロコード・ルーチンが使用され、従ってこの命令のためにメイルボックス♯０には状態ワードがマイクロコード・エントリの前にロードされる。ＩＤＵはこの命令を検出すると、ＩＥＵが「ａｘ」レジスタから「ｕ０」レジスタに状態ワードを移動するようにｍｏｖｕ０・ａｘ命令を生成するのであるが、それはメイルボックス＃０と定義されている。このｍｏｖ命令がＩＥＵに送られた後に、マイクロコード・ルーチンがフェッチされて送られる。従って、マイクロコードはエミュレートされた命令がｌｍｓｗｕ０であるかのように書き込まれ、オリジナルのＣＩＳＣ型命令で指定される全ての考えられるオペランドを正確に処理する。
【０１２４】
３．３ナノ命令フォーマット
上述したように、ＣＩＳＣ型命令はＩＤＵによってナノ命令にデコードされるのであるが、その処理はＩＥＵと呼ばれるＲＩＳＣ型プロセッサ・コアによって行なわれる。ナノ命令は「バケット」と呼ばれる４つのグループに分けてＩＤＵからＩＥＵに渡される。バケットの一つを図１１に示す。各バケットは２個のパケットとそのバケット全体に関する一般的な情報とで構成されている。パケット＃０には常に順序通りに実行される３つのナノ命令が入っている。その３つのナノ命令はロード命令１１０２、ＡＬＵタイプ命令１１０４、格納命令１１０６である。パケット＃１は単一のＡＬＵタイプ命令１１０８から成る。
【０１２５】
ＩＥＵはサイクル当たり１個のピーク・レートでＩＤＵからバケットを受け入れることができる。ＩＤＵはサイクル当たり２個のピーク・レートで基本命令を処理する。ほとんどの基本命令は単一のパケットに変換されているため、通常二つの基本命令は１個のバケットに入れられて一緒にＩＥＵに渡される。このレートの一番大きな制約は基本命令がバケットの要件に適合していなければならないということである。その要件とは以下の通りである。
【０１２６】
二つの基本命令のうち一つしかメモリ・オペランドを参照することはできない（バケット毎にロード／格納動作は一つしかない）、さらに両命令ともに単一のＡＬＵタイプ演算（二つのＡＬＵタイプ演算を要する一つの命令と対照して）から成っていなければならない。
【０１２７】
この制約の片方か両方かが満たされなければ、基本命令の一つだけに該当するナノ命令の入ったバケットがＩＥＵに送られ、残る命令は後から別のバケットで送られる。これらの制約はＩＥＵの能力を正確に反映するものである。即ち、ＩＥＵは２個のＡＬＵと１個のロード／格納ユニットを備えているから、実際にはこれらの要件によって性能が限定されるわけではない。このタイプのＩＥＵの例については、同一承継人の出願に係る同時係属中の、米国特許出願番号０７／８１７．８１０、発明の名称「高性能ＲＩＳＣ型マイクロプロセッサ・アーキテクチャ（ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＲＩＳＣＭｉｃｒｏｐｒｏｃｅｓｓｏｒＡｒｃｈｉｔｅｃｔｕｒｅ）」、１９９２年１月８日出願（代理人整理番号ＳＰＯ１５／１３９７．０２８０００１）、並びに米国特許出願番号０７／８１７．８０９、発明の名称「拡張可能ＲＩＳＣ型マイクロプロセッサ・アーキテクチャ（ＥｘｔｅｎｓｉｂｌｅＲＩＳＣＭｉｃｒｏｐｒｏｃｅｓｓｏｒＡｒｃｈｉｔｅｃｔｕｒｅ）」、１９９２年１月８日出願（代理人整理番号ＳＰＯ２１／１３９７．０３００００１）に開示している。なお、これらの開示は参照することにより本明細書に組み込まれているものとする。
【０１２８】
３．４特殊命令
汎用命令を用いて実行するのが困難であったり不十分であるマイクロコード・ルーチンによって実行されなければならない機能は数多くある。さらに、従来のＣＩＳＣ型プロセッサに比べ当ＲＩＳＣ型プロセッサのアーキテクチャは拡張されているため、特定の機能が有効である。かといって、そうした機能はＣＩＳＣ型プロセッサには何の意味もないし、従ってＣＩＳＣ型命令のどんな組み合わせを用いても実行できない。合わせて、こうした状況から「特殊命令」が生まれた。
【０１２９】
特殊命令の第１カテゴリーの例はｅｘｔｒａｃｔ＿ｄｅｓｃ＿ｂａｓｅ命令である。この命令によって２個のマイクロコードの汎用レジスタから様々なビット・フィールドが抽出され、それらは連結され、さらにその結果がマイクロコードによる使用のために第３の汎用レジスタに入れられる。この命令を利用しないで同じ動作を実行するには、マイクロコードが幾つかのマスキングとシフトの動作を実行しなければならない上、一時的値を保持するために追加のレジスタの使用が必要となる。特殊命令によって、単一サイクルで１命令によってしかもスクラッチ・レジスタを使わずに、実行されるのと同じ機能が果たせるようになる。
【０１３０】
特殊命令の第２カテゴリーの二つの例については既に述べた。即ち、マイクロコード・ルーチンを終了させるために用いられる二つのリターン命令、ａｒｅｔとｅｒｅｔである。これらの命令はマイクロコード環境でのみ意味があり、従ってＣＩＳＣ型のアーキテクチャには同等の命令とか命令順序といったものはない。本件において、特殊命令は性能上の理由だけでなく、機能補正の点からも必要だった。
【０１３１】
特殊命令はマイクロコード・ルーチンにのみ使用可能であり、さらにエミュレートされた命令は目標のＣＩＳＣ型命令ストリームにしか発生しないから、エミュレートされた命令の演算コードは特殊命令のマイクロコード・モード時に再使用される。従って、目標のＣＩＳＣ型命令ストリームにこれらの演算コードの一つが発生する時、それはその命令のマイクロコード・エミュレーション・ルーチンが実行されるべきであるということを表しているにすぎない。しかしながら、その同じ演算コードがマイクロコード命令ストリームに発生する時、それは特殊命令の一つとして全く異なった機能を有している。この演算コードの再使用に対応するために、ＩＤＵは現在のプロセッサの状態を記録し、さらに命令を適正にデコードする。この演算コード再使用はＩＥＵには見えない。
【０１３２】
ＩＤＵは各ＣＩＳＣ型命令（例えば、ｉ４８６命令セットの）をデコードして各命令を幾つかのＲＩＳＣ型プロセッサ・ナノ命令に変換する。上述したように、複雑性や機能性いかんによって、各命令は０から４つのナノ命令に変換される。ＩＤＵは最高で１サイクルの割合で２個のＣＩＳＣ型命令をデコードして変換する。ＩＤＵの基本機能を要約すると以下の通りである。
＊半サイクルにつき１個のＣＩＳＣ型命令をデコードする。
＊第１フェーズで第１ＣＩＳＣ型命令をデコードする。
＊第１ＣＩＳＣ型命令のデコードされた結果を有効なものであるとして第２フェーズ終了まで保持する。
＊第２フェーズで第２ＣＩＳＣ型命令をデコードする。
＊第３フェーズで可能ならば、二つの命令の出力を結合する。
＊サイクル毎に４つのナノ命令から成るバケットを１個出力する。
【０１３３】
３．５命令デコード・ユニットのブロック図
ＩＤＵのブロック図は図１２に示す通りである。ＩＡＵからのアライメントされた命令は３２ビット幅（〔３１：０〕か４バイト）のバス１２０１上のＩＤＵに到達する。そのアライメントされた命令は命令デコーダ１２０２によって受け取られる。ＩＤＵ１２０２はＣＩＳＣ型からＲＩＳＣ型への変換を行なうためにアライメントされた命令の最初の４バイトを調べるだけである。
【０１３４】
命令デコーダ１２０２は１クロック・フェーズ（半サイクル）で作動する。アライメントされた命令はそのデコーダを通り、そしてそこを出るデコードされた情報は多重化され、バス１２０３を介して半サイクル遅延ラッチ１２０４にフェッチされる。従って、そのデコードされた情報は１フェーズ・パイプライン遅延と同じことを経験することになる。
【０１３５】
半サイクルの遅延後、そのデコードされた情報は使用された実際のレジスタ・コードを確定するためにバス１２０５を介してＭＵＸ１２０６に送られる。デコーディングのこの段階で、そのデコードされた情報はナノ命令にフォーマットされる。そのナノ命令は次にラェッチされる。２個の完全なナノ命令バケットがサイクル毎にラッチされる。２個のナノ命令バケットのラッチをそれぞれ第１ＩＲバケット１２０８、第２ＩＲバケット１２１０で図式的に示す。
【０１３６】
ＩＤＵはバケット１２０８と１２１０を１個のバケット１２１２にまとめようとする。制御ゲートー式１２１４がまとめ作業を行なう。ＩＤＵは先ず各ナノ命令のタイプを調べ、結合可能なタイプかどうかを確定する。二つのラッチされた命令のロード（ＬＤ）動作のどちらが単一バケット１２１２のＬＤ記憶場所１２１６に入ってもいいし、ラッチされた命令の格納（ＳＴ）動作のどちらが単一バケットのＳＴ記憶場所に入ってもいいし、Ａ０動作のどちらがＡ０記憶場所１２２０に入ってもいい、さらにＡ０かＡ１の動作のいずれでもＡ１記憶場所１２２２に入っていいことに注意すること。
【０１３７】
ＩＤＵは命令を全体的に扱う。ＩＤＵは二つの命令を一つのバケットに詰め込めなければ、一つの完全な命令を後に残す。例えば、第１ＩＲラッチにはＡ０動作しかなく、第２ＩＲラッチに４つの動作全てが入っている場合、ＩＦＵは第２ＩＲラッチからＡ１を取り込まずＡ０動作に合併する。Ａ０動作が単独で送られ、第２ＩＲラッチの動作の集合は第１ＩＲラッチに転送され次のフェーズ上に送られる。その期間中に第２ＩＲラッチは再ロードされる。言い換えれば、第１ＩＲラッチに格納された動作は常に送られ、第２ＩＲラッチに格納された動作は可能ならば第１ＩＲラッチの動作と一つにまとめられるということである。万一第１ＩＲと第２ＩＲがまとめられない場合には先のＩＤＵ並びにＩＡＵのパイプライン・ステージは待機しなければならない。ＩＤＵが第１と第２のＩＲラッチ動作を合併できるのは下記の状況においてである。
【０１３８】
１．共にＡ０しか使用しない、もしくは
２．片方はＡ０しか使用せず、他方はＡ０、ＬＤ及びＳＴのみを使用する
先に説明した機能性及び基本論理の設計実務に基づいて、当業者は、第１と第２のＩＲラッチの内容を合併すべく、制御ゲートに必要な制御信号を生成するために組み合わせ論理を容易に設計できる。
【０１３９】
ＩＤＵがエミュレーションを要する命令のサブセットに属する命令を識別するとエミュレーション・モードになる。エミュレーション・モードになると、エミュレーション・モード制御信号（ＥＭＵＬ＿ＭＯＤＥ）がＩＤＵのデコーダに送られる。ＣＩＳＣ型命令の直接デコーディングは中断し、識別された命令に対応するマイクロコード・ルーチンがデコーディングのためＩＤＵに送られる。マイクロコード・ルーチンがサブセット命令のエミュレーションを終えると、ＩＤＵデコーダはＣＩＳＣ型命令のデコーディングを続けるため基本モードに戻る。基本的に、ＩＤＵは基本ＣＩＳＣ型命令及びマイクロコード命令を同様に取り扱う。演算コードの解釈だけが変わる。
【０１４０】
１バイト並びに２バイトの演算コード命令のデフォルト（基本）モードのカルノー図を図１３〜図１７に示す。カルノー図の左側と上部に示す数字は演算コード・ビットである。例えば、ｈｅｘＯＦのコードのついた１バイトの演算コードは第１行第１１列に相当し、それは「２バイト・エスケープ」命令である。
【０１４１】
図１３〜図１７のカルノー図で影をつけたグレーの命令ボックスは基本命令で、白のボックスはエミュレートされなければならない命令である。
【０１４２】
ＩＤＵの命令デコーダ１２０２のブロック図を図１８に示す。命令デコーダ１２０２はＣＩＳＣ型命令とマイクロコード・ルーチンをデコードするために用いられる複数のデコーダを含んでいる。
【０１４３】
タイプジェネレータ（ＴＹＰＥ＿ＧＥＮ）デコーダ１４０２は整列＿ＩＲバス上の完全にアライメントされた最初の命令を受取り、命令のタイプフィールドを識別するために命令を一つずつデコードする。
【０１４４】
識別されたタイプフィールドはＩＤＵとの関連で先に説明したナノ命令の動作に対応する。タイプはバケット内の各動作（ロード、ＡＬＵ０、格納、ＡＬＵ１）を表す４ビットのフィールドで表わされる。ＴＹＰＥ＿ＧＥＮデコーダ１４０２は命令実行にはこれら４つの動作のどれが必要かを指定する。受け取った命令いかんで、ＣＩＳＣ型命令を満たすには命令の１から４までのいずれかの番号が必要である。
【０１４５】
例えば、１個のレジスタの内容をもう１個のレジスタの内容と合計する、加算演算はＡＬＵナノ命令を一回実行するだけでいい。一方、レジスタの内容と記憶場所の内容を足さなければならない命令では、ロード、ＡＬＵの動作と、続いて格納動作とを合わせて３つのナノ命令の動作が必要となる。（データはメモリから読み出され、レジスタに加算され、さらにメモリに格納されなければならない。）より複雑なＣＩＳＣ型命令では４つのナノ命令全てが必要になる。
【０１４６】
ＴＹＰＥ＿ＧＥＮデコーダ１４０２は３個のタイプデコーダを備えている。第１デコーダタイプ１は命令はＭｏｄＲ／Ｍバイトの前に１バイトの演算コードを有していると仮定し、その仮定に基づいてタイプを計算する。第２デコーダタイプ２はその命令には２バイトの演算コードがあると仮定する。第１バイトはエスケープバイトであるが、それは演算コードである第２バイトとＭｏｄＲ／Ｍバイトである第３バイトとの前にくる。第３デコーダタイプＦはその命令は浮動小数点命令であると仮定し、その仮定に基づき命令をデコードする。
【０１４７】
ＴＹＰＥ＿ＧＥＮデコーダは４ビット幅のタイプ命令出力バス（タイプ１、タイプ２、タイプＦ）を３個有する。各ビットはバケット内の４つのナノ命令動作の一つに対応する。特定のタイプフィールドによってＣＩＳＣ型命令を実行するのにどのナノ命令が必要か指定される。例えば、４ビットが全てロジックのＨＩＧＨの場合、ＣＩＳＣ型命令にはロード、格納の動作がそれぞれ一回と、ＡＬＵ動作が二回必要である。
【０１４８】
１、２、Ｆのラベルが付いたセクションを含む図１８の残りのデコーダはそれらがそれぞれ１バイトの演算コード、２バイトの演算コード、浮動小数点命令であると仮定してデコードする。無効結果が選択されることはめったにない。マルチプレクサは正しいデコーダの出力を選択する。
【０１４９】
二つのＡＬＵ動作（ＡＬＵ０とＡＬＵ１）には各々１１ビット長の演算コード・フィールドがある。その１１ビットは演算コードの８ビットと、隣接するＭｏｄＲ／Ｍバイトからの３演算コード拡張ビットとから成る。ＩＤＵが処理するＣＩＳＣ型命令ではほとんどの場合、演算コード・ビットはナノ命令動作に直接コピーされる。しかしながら、ＣＩＳＣ型命令のなかには演算コードの置き換えを必要とするものもある。この場合、ＩＤＵ装置はＣＩＳＣ型演算コードを命令実行ユニット（ＩＥＵ）にフィルタすることはめったにない。ＩＥＵ内の機能ユニットのタイプ及び数がＩＤＵ内での演算コードの置き換えが特定のＣＩＳＣ型命令にとって必要か否かを左右するから、このことは当業者には明確になるであろう。
【０１５０】
ＩＥＵがＡＬＵ動作を処理するためには、指定されたＡＬＵ動作を処理するのにどの機能ユニットが必要であるかという情報を受け取らなければならない。従って、ＩＤＵはＦ＿０ＵＮＩＴ１、Ｆ＿０ＵＮＩＴ２、及びＦ＿０ＵＮＩＴＦの３個のデコーダから成る機能ゼロユニット（Ｆ０ＵＮＩＴ）デコーダ１４１０を含んでいる。デコーダの出力はＡ０のＡＬＵ動作を処理するのにどの機能ユニットが必要かを表す複数バイトのフィールドである。Ａ１のＡＬＵ動作のためのデコーディングをする機能ユニットは同一ではあるが、別個のデコーダＦ＿１ユニット１４１２によって取り扱われる。
【０１５１】
ＣＩＳＣ型命令は演算コードによって暗示されるレジスタを用いてオペレーションを実行することが多い。例えば、多くの命令がアキュムレータとしてＡＸレジスタを用いるべきであると暗示している。従って、そのＣＩＳＣ型命令の演算コードに基づいたレジスタ・インデックスを生成するために定数ジェネレータ（ＣＳＴ＿ＧＥＮ）デコーダ１４１４が含まれている。ＣＳＴ＿ＧＥＮデコーダは特定の演算コードに基づいて、どのレジスタが暗示されているかを明らかにする。ナノ命令の正しいソースやデスティネーション・レジスタ・インデックスを生成するための多重化については図１９との関連において以下に説明する。
【０１５２】
追加の２ビットの制御信号である、ＴｅｍｐＣｏｕｎｔ（ＴＣ）は、ＣＳＴ＿ＧＥＮデコーダへ入力される。ＴＣ制御信号はダミー・レジスタとしてＩＥＵが使うために、循環する４個の一時レジスタを表す２ビットのカウンタである。一時（もしくはダミー）レジスタは、暗示されたレジスタに加えて、ＣＳＴＧＥＮデコーダから受け継ぐレジスタのもう一つの値を示す。動作毎のレジスタを２個有するＡＬＵ動作が二つあるため、定数ジェネレータ・デコーダは４つの定数フィールドを引き渡す。定数レジスタ・バスはそれぞれが２０ビット幅で、各定数は計５ビットだから、ＩＥＵ内の３２個のレジスタの１個を選択することができる。
【０１５３】
次に、概ねブロック１４１６で示した選択ジェネレータ（ＳＥＬＧＥＮ）デコーダについて説明する。ＳＥＬ＿ＧＥＮデコーダはフラグ要求変更（ＦＧ＿ＮＭ）デコーダ１４１８を含む。ＦＧ＿ＮＭデコーダは１バイトの演算コード、２バイトの演算コード、及び浮動小数点命令用にデコードする。例えば、ｉ４８６命令セットには計６個のフラグがある。フラグは命令によって変更してもいいが、これらのフラグは命令の実行が開始される前に有効になっていなければならない。ＦＧ＿ＮＭデコーダはフラグ毎に二つの信号を出力する。一方のビットはこの命令実行のためにフラグが必要か否かを示し、別のビットはこの命令が実際にフラグを変更するか否かを示す。
【０１５４】
ＡＬＵ０とＡＬＵ１の動作に関するレジスタの無効情報はそれぞれ１４２０と１４２２で表したＩＮＶＤ１とＩＮＶＤ２のデコーダによってデコードされる。ＩＮＶＤ１及びＩＮＶＤ２デコーダはＳＥＬ＿ＧＥＮデコーダ１４１６の一部でもある。ＩＮＶＤ１及びＩＮＶＤ２のデコーダはＩＥＵ用の制御信号を生成する。これらの信号はＡＬＵレジスタを使用すべきか否かを示す。３個の考えられるレジスタ・インデックスは各ＡＬＵ動作により指定される。その一つはソース及び／またはデスティネーション・レジスタとして使用し、残りの二つはソース・レジスタ指定だけに限定される。動作にはどのレジスタが必要かを指定するために４ビットのフィールドが使われる。
【０１５５】
ＳＥＬ＿ＧＥＮデコーダ１４１６はさらにＣＩＳＣ命令にはレジスタ・フィールドのどれが必要かを示すＦＬＤ＿ＣＮＴデコーダ１４２４を含んでいる。ＦＬＤ＿ＣＮＴデコーダは二つのフィールドのどちらがソース・レジスタでどちらがデスティネーション・レジスタであるかを指定する。
【０１５６】
ナノ命令ジェネレータ（ＮＩＲ＿ＧＥＮ）デコーダは概ねブロック１４２６として示す通りである。データ・サイズ（ＤＡＴＡ＿ＳＺ）及びアドレス・サイズ（ＡＤＤＲ＿ＳＺ）の入力制御信号はシステムが動作しているデフォルトの状態に対応している。最終のアドレス並びにオペランドのサイズをデコードするためには、デフォルト・モードが分かっていなければならないし、プレフィックス（ＩＡＵとの関連において先に説明した）の存在も分かっていなければならない。ＥＭＵＬ＿ＭＯＤＥ制御信号はＮＩＲ＿ＧＥＮデコーダへ入力されるが、他のデコーダによっても使用される。
【０１５７】
エスケープ検出（ＥＳＣ＿ＤＥＴ）入力制御信号は、命令が２バイトの演算コードを有しているかを表すために、ＮＩＲ＿ＧＥＮデコーダに送り込まれる。さらに、エミュレーション命令が検出されるとメールボックス・レジスタのローディングを起こすために、選択演算コード拡張（ＳＥＬ＿ＯＰ＿ＥＸＴ）入力制御信号が使われる。
【０１５８】
浮動小数点レジスタ（ＦＰ＿ＲＥＧ）入力制御信号は変換された浮動小数点レジスタ・インデックスをＩＤＵに渡す。例えば、ｉ４８６の浮動小数点フォーマットは浮動小数点数用の８個のレジスタを有しているが、それらのレジスタはスタックと同様にアクセスされる。スタック・アクセス方式、即ち、レジスタ０がスタックの一番上で、レジスタ１が上から２番目といった具合、を使ってこれらのレジスタをアクセスできる。このレジスタ・スタックは固定インデックスを有する８個の線形レジスタを使用することによってエミュレートされる。入力命令がレジスタ０を指定すれば、変換ブロック（図示せず）は周知の方法でスタック関連レジスタ・インデックスを線形レジスタ用のレジスタ・インデックスに変換する。これによりＩＤＵがどのレジスタがスタックの一番上にあるかを記録することができるようになる。
【０１５９】
システムがエミュレーション・モードに分岐すると、ＩＤＵはエミュレートされている命令についての情報を保存する。ＩＤＵは、デスティネーションのレジスタインデックス（ＥＭ＿ＲＤＥＳＴ）、ソース（ＥＭ＿ＲＤＥＳＴ２）、ベースインデックス情報（ＥＭ＿ＢＳＩＤＸ）に加えて、命令のデータサイズ（ＥＭ＿ＤＳＩＺＥ）及びアドレスサイズ（ＥＭ＿ＡＳＩＺＥ）も保存する。この保存された情報は命令を適切にエミュレートするためにマイクロコード・ルーチンによって使用される。例えば、加算命令のエミュレーションを考えてみよう。マイクロコード・ルーチンは、どのアドレス・サイズをエミュレートするかを知るために、加算命令のアドレス・サイズを確定するのにＥＭ＿ＡＳＩＺＥをチェックすることがある。
【０１６０】
ＮＩＲ＿ＧＥＮデコーダ１４２６はサイズデコーダ１４２８を含む。ＳＩＺＥデコーダ（即ち、ＳＩＺＥ１、ＳＩＺＥ２、ＳＩＺＥＦ）によって生成されたフィールドは命令のアドレス・サイズ、オペランド・サイズ、さらにイミディエト・データ・サイズを表す。１６ビットか３２ビットのアドレス・サイズ、８ビットか１６ビットか３２ビットかのオペランド・サイズ、８ビットか１６ビットか３２ビットかのイミディエト・データ・フィールド・サイズが各命令用に抽出される。
【０１６１】
もう一つのＮＩＲ＿ＧＥＮデコーダはロード情報（ＬＤ＿ＩＮＦ）デコーダ１４３０と呼ばれる。ＬＤ＿ＩＮＦデコーダはロード及び格納の動作に対応する情報をデコードする。ロード情報は効果的なアドレス計算を行なうために使用される。ＣＩＳＣ命令セットは通常多くの様々に異なるアドレス指定モードを支援するから、ロード情報のフィールド（ＬＤ＿ＩＮＦ１、ＬＤ＿ＩＮＦ２、ＬＤ＿ＩＮＦＦ）はＣＩＳＣ命令によってどのアドレス指定モードが使われているかを指定するために使用される。
【０１６２】
ｉ４８６の基本アドレス指定モードは、アドレスを確定するために足して一つにまとめられるセグメント・フィールドとオフセットを含んでいる。インデックス・レジスタのスケールに加えて（例えば、インデックス・レジスタがアレイ内の素子である場合）、インデックス・レジスタを指定できるし、素子を長さで１、２、４、または８バイトとして指定できる。従って、インデックス・レジスタがアドレスを確定するために加算される前に１、２、４、または８でインデックス・レジスタを基準化することができる。ベース並びにインデックスもＬＤ＿ＩＮＦフィールドで指定できる。
【０１６３】
ナノ命令演算コード（ＮＩＲ＿ＯＰＣ）デコーダ１４３２はＡ１オペレーション（パケット１）用の演算コードを転送する。デコードされたフィールド（ＮＩＲ＿ＯＰＣ１、ＮＩＲ＿ＯＰＣ２、ＮＩＲ＿ＯＰＣＦ）は第１命令バイト（８ビット）と第２バイトからの３つの拡張ビットから成る。
【０１６４】
雑演算コード（ＭＩＳＣ＿ＯＰＣ）デコーダ１４３４は、命令が浮動小数点であるか、及びロード命令が実際に存在しているかどうかを表す。ＭＩＳＣ＿ＯＰＣデコーダによって生成されたフィールドは、浮動データの変換が必要かを示すことになる。この情報は命令のフォーマットに係わらず簡単に抽出されるから、このデコーダは多重化する必要がない。
【０１６５】
パケット０のＡ０動作用の演算コードは演算コードデコーダ１４３６により指定される。Ａ０演算コードは通常ｉ４８６の入力演算コードから直接コピーされるが、命令によっては演算コードが別の演算コードで置き換えられることがある。（上記のように、ＮＩＲ＿ＧＥＮデコーダにより生成された信号の機能性はデコードされているＣＩＳＣ型命令セットに特有であり、よってＣＩＳＣ型命令セット並びに本発明のナノ命令フォーマットを検討すると当業者には明確になるはずである。）
ＥＸＴ＿ＣＯＤＥデコーダ１４４０はＭｏｄＲ／Ｍバイトから３ビットの演算コード拡張子を抽出する。
【０１６６】
ＩＮ＿ＯＲＤＥＲデコーダ１４４２は命令が「順序正しく」実行されなければならないかを確定するために命令をデコードする。これによって、全ての先行命令の実行終了までこの命令に対して何もしないようにＩＥＵに指示が出される。一度命令の実行が完了すると、それに続く命令の実行が開始される。
【０１６７】
制御フロージャンプサイズデコーダ１４４４はアドレスを指定するジャンプのディスプレースメント・サイズを表す。ＣＦ＿ＪＶ＿ＳＩＺＥとラベルをつけた、このフィールドはジャンプのアドレス・サイズを指定する。これはＣＩＳＣ型命令セットに使用されるアドレス指定方式のタイプに特有のものである。
【０１６８】
ＤＥＣ＿ＭＤＥＳＴ１４４６とラベルをつけた１ビットのデコーダは命令のデスティネーションがメモリ・アドレスであるか否かを表す。
【０１６９】
最後に、命令デコーダはレジスタ・コード（インデックス）選択のために３個のレジスタコードデコーダ１４３８を含んでいる。ｉ４８６の命令フォーマットは命令内の様々な場所にあるレジスタ・フィールドのインデックスを符号化する。これらのフィールドのインデックスはＲＣデコーダにより抽出される。ＭｏｄＲ／Ｍバイトは２個のレジスタ・インデックスも有しており、それらは演算コード自体により指定されたデスティネーション／ソースとして使用される。レジスタコードデコーダ１４３８は３つのＲＣフィールド、ＲＣ１、ＲＣ２、及びＲＣ３を生成する。プロセッサがエミュレーション・モードでない場合、ＲＣ１及びＲＣ２は以下のようにＭｏｄＲ／Ｍバイトから抽出され、その命令は浮動少数点命令ではない。即ち、ＲＣ１＝ＭｏｄＲ／Ｍバイトのビット〔２：０〕で、ＲＣ２＝ＭｏｄＲ／Ｍバイトのビット〔５：３〕で、そしてＲＣ３＝演算コードのビット〔２：０〕。基本（エミュレーションでない）モードの浮動小数点命令では、ＲＣ１、ＲＣ２、ＲＣ３は以下のように割り当てられる。
【０１７０】
ＲＣ１：ＳＴ（０）＝スタックの１番上
ＲＣ２：ＳＴ（１）＝スタックの２番目のアイテム＝スタックの上から２番目
ＲＣ３：ＳＴ（ｉ）＝スタックからｉ番目のアイテムで、そこにおいて、ｉは演算コードの中に指定されている。
エミュレーション・モードでは、ＲＣ１、ＲＣ２、ＲＣ３は以下のように割り当てられる。
【０１７１】
ＲＣ１：バイト３のビット〔４：０〕
ＲＣ２：バイト２のビット〔１：０〕及びバイト３のビット〔７：５〕
ＲＣ３：バイト２のビット〔６：１〕
図１９はＣＳＴ＿ＧＥＮ、ＮＩＲ＿ＧＥＮ、ＳＥＬ＿ＧＥＮの各デコーダ（１４１４、１４３８、１４２４）の代表的なブロック並びに論理ゲート図を表すものである。この図１９は、ナノ命令オペレーションＡ０及びＡ１のソース並びにデスティネーション・レジスタ・インデックス、さらにロード命令のデスティネーション・レジスタ・インデックスを生成するために、１バイトの演算コード、２バイトの演算コード及び浮動小数点のデコードされた結果がどのように選択され、遅延させられ、さらに結合されるかを示す実施例であると理解されるべきものである。選択、遅延、さらに多重化の技法は、１バイトの演算コード、２バイトの演算コード及び浮動小数点の結果を個別に生成しない信号を除く、命令デコーダ１２０２により生成される全ての信号に適用される。さらに、言い換えれば、この実施例により生成された結果はアプリケーション専用であり、ｉ４８６命令を本発明のナノ命令フォーマットにデコードすることに適用される。しかしながら、これらの実施例を通してこれまでに説明してきた原理はＣＩＳＣ型からＲＩＳＣ型への命令のアライメント及びデコーディングに概ね適用可能である。
【０１７２】
先に説明したようにＣＳＴ＿ＧＥＮデコーダ１４１４はＣＳＴ１、ＣＳＴ２及びＣＳＴＦの３つの出力を生成し、その各々は４つの定数５ビットレジスタ・フィールド（計２０ビット）から成り立っている。ＳＥＬ＿ＧＥＮはもっと先の部分ＭＵＸ１５１２でのマルチプレクサの選択のためにレジスタ・フィールド制御信号（ＦＬＤ１、ＦＬＤ２、ＦＬＤ３）を生成する。ＣＳＴ１、ＣＳＴ２かＣＳＴＦの結果並びにＦＬＤ１、ＦＬＤ２、及びＦＬＤＦの結果の選択についてはマルチプレクサ・ブロック１５０２に概ね示す通りである。３ビットのＭＵＸセレクト線１５０４は、命令が１バイトの演算コード、２バイトの演算コード、或いは浮動小数点命令を有しているかどうかで結果を選択するために使用される。
【０１７３】
Ωサイクル・パイプライン遅延ラッチ１５０６はマルチプレクサ１５０２によって選択された結果と、３つのレジスタ制御フィールドのＲＣ１、ＲＣ２、ＲＣ３を遅延させるために使用される。Ωパイプライン遅延ラッチ１５０４への各入力は対向してクロックされた一対のラッチ１５０８に送られる。このラッチの内容はマルチプレクサ１５１０により選択される。この配列はＩＡＵとの関連で先に説明したΩサイクル・データ遅延３１６に類似している。
【０１７４】
さらにその先の多重化のステージはブロック１５１２に示す通りである。マルチプレクサ１５０２によって選択された定数レジスタ・フィールドは、１５１４に概ね示すように、ｒｅｇｃ１からｒｅｇｃ４まで個々にラベルをつけた４つの個別のフィールドとしてマルチプレクサ１５１２へ入力される。ブロック１５１２への入力としても示したのは、演算コード及びＭｏｄＲ／Ｍバイトからの抽出レジスタフィールド、ＲＣ１、ＲＣ２及びＲＣ３である。概ね１５１８に示した動作Ａ１用のソース及びデスティネーションのレジスタ・インデックスａ１＿ｒｄ及びａ１＿ｒｓだけでなく、概ね１５１６に表わした動作Ａ０用のソース及びデスティネーションのレジスタ・インデックスａ０＿ｒｄ及びａ０＿ｒｓを生成するためにＦＬＤ制御信号１５２０の制御の下ブロック１５１２の論理により、ｒｅｇｃフィールド並びにＲＣフィールドが結合される。ロード命令のデスティネーション・レジスタ・インデックスである、インデックス１ｄ＿ｒｄもブロック１５１２で選択される。
【０１７５】
４．０デコードされた命令ＦＩＦＯ
本発明におけるデコードＦＩＦＯ（ＤＦＩＦＯ）のブロック図は図２０Ａに示す通りである。ＤＦＩＦＯは４個の完全なバケットを保持し、その各々には一つのナノ命令、二つのイミディエト・データ・フィールド、及び一つのディスプレースメント・フィールドが入っている。各バケットはＤＦＩＦＯの１レベルのパイプライン・レジスタに対応している。これらのバケットはＩＤＵで生成されてＩＥＵが新規のバケットを要求する各サイクル期間中にＤＦＩＦＯに押し出される。バケット内のナノ命令はパケット０及びパケット１と呼ばれる二つのグループに分けられる。パケット０はロード、ＡＬＵ、及び／または格納の動作で構成され、その動作は１、２、もしくは３ナノ命令に対応している。パケット１は１ナノ命令に相当するＡＬＵ動作のみである。この分割の結果、１個のバケットは二つのＡＬＵ動作のみを含み、その一つだけがメモリを参照できる。その後に続く命令が共にメモリ・オペランドを要求する場合、それらの命令は別々のバケットに入れられなければならない。
【０１７６】
図２０Ｂから分かるように、各パケット及びバケット全体に関する、相当量の一般的な情報があるだけである。この情報は一般情報ＦＩＦＯに格納される。デフォルトでは、１個のバケット内に入った４つのナノ命令がＮＩＲ０からＮＩＲ３への順序で実行される。ＮＩＲ３はＮＩＲ０〜ＮＩＲ２の前に実行されなければならないことを示すようにバケットの一般情報ビットの一つを設定することができる。この特徴により連続する命令を単一のバケットにまとめることが容易になる。何故なら、その順序はもはやバケット要件を満たす能力に影響しないからである。
【０１７７】
図２０Ｃはバケット０〜バケット４のイミディエト・データ及びディスプレースメントＦＩＦＯを示す。ＩＭＭ０はパケット０に対応するイミディエト・データを表し、ＩＭＭ１はパケット１に対応するイミディエト・データを表している。ＤＩＳＰはパケット０に対応するディスプレースメントを表わしている。ＤＩＳＰフィールドはアドレス計算の一部としてしか使用されないから、パケット１はＤＩＳＰ情報を使用しない。
【０１７８】
上述の３タイプのナノ命令の具体例を図２１に示す。これらの表は各バケットの内容についての情報を提供するものである。
【０１７９】
本発明に基づく様々な実施例を先に記述してきたが、あくまで例として提示したものであり、それにより限定されるものではないことが理解されるはずである。従って、本発明の広さ並びに範囲については上記の例としての実施例によって制限されるべきものではなく、特許請求の範囲及びそれに相当するものに従ってのみ定められるべきことである。
【図面の簡単な説明】
【図１】本発明の命令プリフェッチ・バッファのブロック図である。
【図２】本発明の命令アライメント・ユニットのブロック図である。
【図３】本発明のＩＡＵの命令抽出並びにアライメント方法を表す代表的なフローチャートである。
【図４】図２のブロック図並びに図３のフローチャートに関連する簡略タイミング図である。
【図５】本発明のＳＴＡＣＫのブロック図である。
【図６】本発明の次命令検出器（ＮＩＤ）のブロック図である。
【図７】本発明の残存次命令検出器（ＲＮＩＤ）のブロック図である。
【図８】本発明のイミディエト・データ及びディスプレースメント検出器（ＩＤＤＤ）のブロック図である。
【図９】本発明のプレフィックス検出器（ＰＤ）のブロック図である。
【図１０】本発明のプレフィックス数（ＰＲＦＸ＿ＮＯ）デコーダのブロック図である。
【図１１】本発明のナノ命令バケットのブロック図である。
【図１２】本発明の命令デコード・ユニット（ＩＤＵ）の代表的なブロック図である。
【図１３】本発明の命令ビット・マップを示す図である。
【図１４】本発明の命令ビット・マップを示す図である。
【図１５】本発明の命令ビット・マップを示す図である。
【図１６】本発明の命令ビット・マップを示す図である。
【図１７】本発明の命令ビット・マップを示す図である。
【図１８】本発明のＩＤＤＤの命令デコーダのセクションの一例を示すブロック図である。
【図１９】図１８に示した命令デコーダのデコーダー式の代表的なブロック並びにロジック図である。
【図２０】本発明のデコードＦＩＦＯの概念的なブロック図である。
【図２１】本発明のナノ命令のフィールド・フォーマットの例を示す図である。
【図２２】従来のＣＩＳＣ型命令のデータ構造フォーマットを示す図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to superscalar RISC-type microprocessors, and more particularly to alignment of CISC-type to RISC-type microprocessor instructions so that compound instructions can be executed on RISC-based hardware. -Regarding the unit and the decode unit.
[0002]
Problems to be solved by the prior art and the invention
Reference of related application
The following is a co-pending application filed by the same successor.
U.S. Application No. 07 / 802,816, filed on December 6, 1992 (attorney docket number SP024), title of invention "AROM with RAM Cell and Cyclic Redundancy check Circuit", U.S. Application No. 07 / 817,810, filed Jan. 8, 1992 (Attorney Docket No. SP015), entitled "High Performance RISC Microprocessor Architecture", U.S. Application No. 07/817 809, filed Jan. 8, 1992 (Attorney Docket No. SP021), entitled "Extensible RISC Microprocessor Architecture (Extensi le RISC Microprocessor Architecture) ".
[0003]
The disclosure of the above application is incorporated herein by reference.
[0004]
Related technology
Complex instruction set computers (CISC type computers) that use variable length instructions all face the problem of determining the length of each instruction that occurs in the instruction stream. Instructions are packed into memory as data consisting of consecutive bytes. Thus, given the address of an instruction, it is possible to determine the start address of the next instruction if the length of the first instruction is known.
[0005]
In conventional processors, this determination of length does not significantly affect performance as compared to other stages in processing the instruction stream, such as the actual execution of each instruction. As a result, fairly simple circuits are typically used. On the other hand, superscalar reduced instruction set computers (RISC computers) can process instructions much faster, but instructions must be extracted from memory much faster to execute multiple instructions in parallel. . This limiting factor, imposed by the speed at which instructions are extracted from memory, is a Flynn bottleneck (Flynn).
(Bottleneck).
[0006]
The task of determining the length of each instruction and extracting it from the instruction stream is performed by a functional unit called an instruction alignment unit (IAU). This block must include decoder logic to determine the length of the instruction and a shifter to align the instruction data to the decoder logic.
[0007]
In the Intel 80386 microprocessor, the first byte of an instruction implies a lot about the overall instruction length, and additional bytes may need to be checked before knowing the final length. Further, other additional bytes may be specified from the additional bytes. Thus, it is extremely difficult to immediately determine the length of an x86-type instruction because the process is inherently sequential.
[0008]
Based on the information provided in the i486 Programmer's Reference Guide, some conclusions can be drawn regarding the alignment units employed in the i486. The i486 IAU is designed to look only at the first few bytes of an instruction. If these bytes do not sufficiently specify their length, these initial bytes are extracted and the process is repeated for the remaining bytes. Each iteration of this process requires a full cycle. Thus, in the worst case, it may take several cycles for the instructions to be fully aligned.
[0009]
The reason why the IAU of i486 requires an additional cycle is when a prefix type or an extended type (2 bytes) operation code is used. Both of these operation codes are common in i486 programs. Moreover, compound instructions may also consist of displacement and immediate data. i486 requires additional time to extract this data.
[0010]
An example of the format of the CISC processor instruction is as shown in FIG. This example shows the possible bytes of a variable length i486 CISC type instruction. Instructions are stored in memory on byte boundaries. The length of the instruction is at least 1 byte, and the maximum is 15 bytes including a prefix. The total length of the instruction is determined by the PrefixesOpcode, ModR / M and SIB bytes.
[0011]
[Means for Solving the Problems]
The present invention comprises a complex instruction set computer (CISC) such as an Intel 80x86 microprocessor, or a superscalar reduced instruction set computer (RISC) processor designed to emulate other CISC-type processors. A microprocessor subsystem and method.
[0012]
There are two basic steps in the CISC-to-RISC-type translation process of the present invention. CISC-type instructions must first be extracted from the instruction stream and then decoded to generate nanoinstructions that can be processed by a RISC-type processor. These steps are performed by an instruction alignment unit (IAU) and an instruction decode unit (IDU), respectively.
[0013]
The IAU serves to extract individual CISC-type instructions from the instruction stream by examining the oldest 23rd byte on the instruction data. The IAU extracts eight consecutive bytes starting at any of the bytes on the bottom line of the instruction FIFO. During each clock phase, the IAU determines the length of the current instruction and uses this information to control the two shifters to shift out the current instruction, but the stream contains the following: There are still more instructions coming in. The IAU will output instructions aligned during each clock phase at a peak rate of two instructions per cycle as a result. Exceptions to this best-case performance are described in Sections 2.0 and 2.1 below.
[0014]
After the CISC-type instructions have been extracted from memory, the IDU serves to translate these aligned instructions into the same sequence as RISC-type instructions called nano-instructions. The IDU considers each aligned instruction to be an output from the IAU and determines the number and type of nanoinstructions required, the size of the data operands, and whether memory access is required to complete the aligned instruction. The instruction is decoded to determine various factors such as whether or not. Simple instructions are translated directly into nano-instructions by decoder hardware, while more complex CISC-type instructions are emulated by special instruction set subroutines called microcode routines, which are then converted to nano-instructions. Is decoded. This information is collected in one complete cycle for two instructions, then grouped together to form an instruction bucket, which includes nano-instructions corresponding to both source instructions. . This bucket is then transferred to an instruction execution unit (IEU) for execution by a RISC-type processor. Execution of a nanoinstruction bucket is outside the scope of the present invention.
[0015]
The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
table of contents
1.0 Instruction Fetch Unit
2.0 Outline of Instruction Alignment Unit
2.1 Block diagram of instruction alignment unit
3.0 Outline of Instruction Decode Unit
3.1 Microcode dispatch logic
3.2 Mailbox
3.3 Nano instruction format
3.4 Special instructions
3.5 Block diagram of instruction decode unit
4.0 Decoded Instruction FIFO
Detailed Description of the Preferred Embodiment
The basic concepts described in this section are described in more detail in the following references: "Superscalar Microprocessor Design", by Mike Johnson, published in 1991 by Prentice-Hall, Inc., Inglewood Cliff, NJ. "Computer architecture-A Quantitative Approach", John L. et al. Hennessy et al., 1990, published by Morgan Kaufmann Publishers, San Mateo, California. "I486 Microprocessor Programmer's Reference Manual" and "i486 Microprocessor Hardware Reference Manual", published by Intel Corporation, Santa Tallara, California, 1990 with order numbers 2406 and 2406 respectively. The disclosures of these publications are incorporated herein by reference.
[0017]
1.0Instruction fetch unit
The instruction fetch unit (IFU) of the present invention is used to fetch instruction bytes from an instruction stream stored in an instruction memory, an instruction cache, or the like, and supply the instruction bytes to a decoder unit for execution. Is done. The instructions to be aligned by the instruction alignment unit are thus provided by the IFU. Shown in FIG. 1 is a block diagram of the three instruction prefetch buffers 200 in the IFU, which includes a main instruction buffer (MBUF) 204, an emulation instruction buffer (EBUF) 202, and a target instruction buffer (TBUF) 206. Made up of The instruction prefetch buffer can load a 128 bit (16 byte) instruction stream from the instruction cache in a single cycle. This data is held in one of three buffers for use by the IAU.
[0018]
During normal program execution, MBUF 202 is used to supply instruction bytes to the IAU. Upon encountering a conditional control flow (ie, a conditional branch instruction), the instruction corresponding to the target address of that branch is stored in TBUF 206 while execution from MBUF 202 continues. Once the branch is determined, either the TBUF 206 is discarded if the branch is not taken, or the TBUF 206 is transferred to the MBUF if the branch is taken. In either case, execution from the MBUF continues. The operation of EBUF 204 is slightly different. Once in emulation mode, fetching and execution of instructions are transferred to EBUF 204 by emulation instructions or by exception. (Both emulation mode and exception handling are described in detail below.) Execution continues from EBUF 204 as long as the processor is in emulation mode. At the end of the emulation routine, execution continues with the instruction data remaining in MBUF 204. This eliminates the need to fetch the main instruction data again after the execution of the emulation routine.
[0019]
2.0Outline of instruction alignment unit
In combination with the present invention, the instruction alignment unit uses a RISC strategy that makes the common case faster by using the outstanding instruction throughput per cycle of a superscalar processor.
[0020]
In the context of the present invention, the term "align" means to position the bytes of an instruction so that they can be distinguished from adjacent bytes in the instruction stream for later decoding. The IAU distinguishes the end of the current instruction from the beginning of the next instruction by determining the number of bytes in the current instruction. The IAU then aligns the current instruction such that the least significant byte put into the IDU is the first byte of the current instruction. The bytes may be provided to the IDU in various different orders.
[0021]
The IAU subsystem of the present invention can align most common instructions at the rate of two instructions per cycle at any clock rate, and align most other instructions at the same rate at reduced clock speeds. be able to. Instructions containing prefixes require an extra half cycle for alignment. No extra time is needed because the immediate data and displacement fields are extracted in parallel.
[0022]
Furthermore, the IAU has an alignment time of only 2.0 cycles per instruction in the worst case, less than the time required to align many of the common instructions of a conventional CISC-type processor. The instruction has one or more prefixes (half the total number of cycles required for alignment), the instruction is from a set that requires a complete cycle to determine its length, and the instruction (without the prefix) The worst case occurs when the length is longer than 8 bytes (an extra half cycle is needed, resulting in a total of two full cycles).
[0023]
Several structural features provide such performance. First, the IAU is designed to perform a full alignment operation for each phase of the clock by using alternating phase latches and multiplexers in the alignment circuit. Second, decode logic divides CISC-type instructions into two categories based on the number of bits that must be taken into account to determine the length of each instruction. That is, instructions of a length specified by a few bits are aligned in a single phase (half cycle), while other instructions typically require one more clock cycle. Finally, the IAU can extract up to 8 bytes from the instruction stream with a single shift. This makes it possible to align long instructions (up to 15 bytes in i486) with a few shift instructions, and to align most instructions with only one shift.
[0024]
The following tasks are performed by the IAU to quickly and accurately decode CISC-type instructions:
Detect the presence and length of a prefix byte
Separates bytes of operation code, ModR / M and SIB (scale, index, base)
Detect instruction length (indicates location of next instruction)
Send the following information to the instruction decode unit (IDU)
The opcode, ie 8 bits plus any 3 optional extensions. In a two-byte operation, the first byte is always OFhex, so the second byte is sent as the operation code
ModR / M bytes, SIB bytes, displacement and immediate data.
[0025]
-Information on the number and type of prefixes;
The opcode byte specifies the operation performed by the instruction. The ModR / M byte specifies an address format used when an instruction refers to a memory operand. The ModR / M byte may also refer to the second addressing byte, ie, the SIB (scale, index, base) byte, which may need to fully specify the addressing format. .
[0026]
2.1Instruction alignment unit block diagram
A block diagram of the IAU is as shown in FIG. This figure is divided into two parts, a main data bus 302 (portion surrounded by a broken line) and a predecoder 304 (portion surrounded by a broken line). Instruction shifting and extraction occur on the main data bus 302, while length determination and data bus control are handled by the predecoder 304.
[0027]
The main data bus 302 is composed of several shifters, latches and multiplexers. The extraction shifter 306 receives instruction data composed of bytes from the IFU. Two buses (generally indicated by 303), IFI0b_bus [127: 0] and IFI1b_bus [55: 0], represent the instruction data output of the IFU. The IFU responds to the request from the IAU and updates this instruction information on the Advance Buffer Request (ADVBUFREQ) line 308. The generation of the ADVBUFREQ signal will be described below. The 8-byte data corresponding to the current instruction is output from the extraction shifter and sent to the alignment shifter 310 on the bus 307. The alignment shifter holds a total of 16 bytes of instruction data and can shift up to 8 bytes per phase. If the prefix is detected by shift-out, an alignment shifter is used to separate the prefix from the instruction. Alignment shifters are also used to align an instruction to a lower byte and then shift out the entire instruction after alignment.
[0028]
The eight bytes are also sent via a bus 309 to an immediate data shifter (IMM shifter 312) and a displacement shifter (DISP shifter 314). IMM shifter 312 extracts immediate data from the current instruction, and DISP shifter 314 extracts displacement data. Data to these two shifters is delayed by Ω cycle delay element 316 to maintain synchronization with the aligned instruction.
[0029]
Alignment shifter 310 outputs the next aligned instruction on bus 311 to two alignment_IR latches 318 or 320. These latches operate on opposite phases of the system clock. This results in two instructions being latched per cycle. The alignment_IR latches 318 and 320 output the aligned instruction on two output buses 321. During the phase in which one of the latches receives a new value, the output of the other latch (the current instruction aligned) is selected by the multiplexer (MUX 322). MUX 322 outputs the aligned current instruction to aligned instruction bus 323. Output 323 is the primary output of the IAU. This output is used by predecoder 304 to determine the length of the current instruction, and is fed back to alignment shifter 310 as data from which the next instruction is extracted. The aligned current instruction is fed back to the alignment shifter 310 via the bus 325, the stack 334, and the bus 305. Bus 305 also sends information about the aligned current instruction to Ω cycle data delay 316.
[0030]
The IMM shifter 312 and DISP shifter 314 can shift immediate data and displacement data, respectively. Because they require a total of 16 bytes to shift. The .OMEGA. Cycle data delay 316 outputs the instruction byte to the shifter on one bus. IMM shifter 312 outputs the immediate data corresponding to the current instruction on immediate data bus 340. DISP shifter 314 outputs the displacement data corresponding to the current instruction on displacement data bus 342.
[0031]
The predecoder 304 comprises three decoder blocks: a next instruction detector (NID) 324, an immediate data and displacement detector (IDDD) 326, and a prefix detector (PD) 328. NID and PD control the alignment and extraction shifters, and IDDD controls the IMM shifter 312 and DISP shifter 314.
[0032]
PD 328 is designed to detect the presence of a prefix in an instruction. PD 328 provides shift control signals to alignment shifter 310 and counter shifter 332 via lines 331, MUX 330, and line 333 to determine the number of prefixes present and extract prefixes from the instruction stream in the next half cycle. I do. In addition, PD 328 decodes the prefix itself and provides this prefix information on output line 329 to the IDU.
[0033]
The basic architecture of PD 328 consists of four identical detectors (to detect up to four prefixes) and a second block of logic to decode the prefix itself. The CISC type format defines the order of prefix generation, but the present invention checks for the presence of all prefixes in each of the first four byte positions. Further, the function of detecting the presence of the prefix and the function of decoding the prefix are separate in order to utilize the deceleration request of the decoder. The architecture of PD 328 is described in further detail below.
[0034]
IDDD 326 is designed to extract immediate data and displacement data from each instruction. IDDD 326 always attempts to extract these two fields, regardless of their presence. IDDD 326 controls IMM shifter 312 and DIS shifter 314 on a pair of lines 344 and 346, respectively. The IDU takes half a cycle to process an aligned instruction, but is useless for immediate and displacement data. Thus, the immediate data and the displacement data are delayed by the Ω cycle data delay 316 to allow the IDDD 326 to spend more time calculating the amount of shift. This is because, unlike NID 324, which performs decoding and shifting in the same phase, shifting occurs in the next phase.
[0035]
NID 324 is the heart of the predecoder. Once the prefix is removed, NID 324 determines the length of each instruction. The NID 324 controls the alignment shifter 310 and the counter shifter 332 via the control line 327, the MUX 330, and the line 333. The NID is composed of two sub-blocks, a subset next instruction detector (SNID 702) and a remaining next instruction detector (RNID 704). The RNID 704 will be described with reference to FIGS.
[0036]
As the name implies, SNID 702 determines the length of a subset of the CISC-type instruction set. The instructions in the subset are aligned by SNID at a rate of two instructions per cycle.
[0037]
The RNID 704 determines the length of all remaining instructions and requires another half cycle, so that the total decode time is one complete cycle. The determination of whether the subset contains instructions is made by the SNID, and this signal is used in the NID to select either the SNID or the RNID output.
[0038]
If the new instruction is aligned, it is initially assumed to be in the subset, thereby selecting the output of the SNID. If the SNID determines (during this same half cycle) that the instruction is to be processed by the RNID, a signal is asserted and the IAU loops over the current instruction and holds it for another half cycle. . During this second half cycle, the output of the RNID is selected and the instructions are properly aligned.
[0039]
This architecture of NID has several advantages. One of these has already been described above, but if the cycle time is long enough, the selection between SNID and RNID can be performed in one half cycle, so that all instructions are in a single phase (prefix or 8 (Excluding the time to extract instructions longer than bytes). This can improve per-cycle performance at low cycle rates without additional hardware.
[0040]
A second advantage is that the selection signal can be used as an alignment cancellation signal. This is because the select signal causes the IAU to ignore the SNID shift output and hold the current instruction for another half cycle. The SNID can be designed to predict the combination or length of a particular instruction and subsequently generate a cancellation signal if the prediction is incorrect. For example, the method can be used to align multiple instructions in one half cycle, which further improves performance.
[0041]
The IAU also includes a counter shifter 332. The counter shifter 332 is used to determine the shift amount of the extract shifter 306 via line 335 and to request an additional CISC type instruction byte from the IFU using the ADVBUFREQ line 308. The function of the counter shifter 332 may be better understood by examining the following IAU operation flowchart and example timing diagram.
[0042]
FIG. 3 is a schematic flowchart of instruction byte extraction and alignment performed by the IAU of the present invention. As shown in step 402, when new data is input to the lowest line 205 of the IFU's MBUF 204 (referred to as BUCKET_ # 0), the extraction shifter 306 extracts 8 bytes starting from the first instruction. The eight instruction bytes are passed to the alignment_IR latches 318 and 320, bypassing the alignment shifter 310, as shown in step 404. As shown in step 406, the IAU next waits for the next clock phase while holding the aligned instruction in the alignment_IR latch.
[0043]
During the next clock phase, the IAU outputs instructions aligned to the IDU, STACK 334, IDDD 326, NID 324, PD 328 and Ω cycle data delay 316. Immediate data and information about the displacement are then output to respective IDUs on buses 340 and 342. This data, if present, corresponds to the instruction that was aligned in the previous phase. These operations are generally as shown in step 408 of FIG.
[0044]
Next, a conditional statement 409 is input by the IAU to determine whether the prefix exists. This determination is made by a PD (prefix decoder) 328. If one or more prefixes are detected by the PD, as indicated by arrow "Yes" exiting conditional 409, the process proceeds to step 410, where the IAU selects the output of the PD at MUX 330. As shown in step 412, the decoded prefix information is latched to be sent to the IDU in the next phase with the next corresponding aligned instruction. If no prefix instruction byte is detected, as indicated by arrow "No" exiting conditional 409, MUX 330 selects the output of NID 324, as shown in step 414.
[0045]
Once step 412 or 414 is complete, the counter shifter 306 controls the extract shifter 306 to provide the next 8 bytes of instruction data to the alignment shifter 310 and n cycle data delay 316, as shown in block 416. The current output of 332 is used. Next, the IAU uses the output of MUX 330 as a variable called shift_A. This variable is used to control the alignment shifter 310 to align the next instruction. Shift_A is also added to the current extractor shift amount (called BUF_count) to calculate the shift amount to use during the next phase. This addition is performed in the counter shifter 308 as shown in step 408.
[0046]
The next operational step performed by the IAU is to run the output of the alignment shifter in the alignment_IR latch, as shown in step 420. As shown in step 422, the positions of the immediate data and the displacement data in IDDD 326 are calculated, and this shift amount is delayed by Ω cycle. Next, as shown in step 424, the IAU uses the shift amount calculated during the previous half cycle to shift the data currently being input to the IMM shifter 312 and DISP shifter 314. Finally, the process is repeated starting from step 406, waiting for the next clock phase. Steps 408 to 424 are repeated for the remaining instruction bytes in the instruction stream.
[0047]
FIG. 4 is a timing diagram associated with the IAU of FIG. Two instruction buckets are displayed at the top of FIG. These two instruction buckets, labeled bucket_ # 0 and bucket_ $ 1, each consist of 16 instruction bytes supplied by the IFU (from an instruction memory not shown) to the IAU shown in FIG. Instruction alignment is performed from the right of bucket_ # 0 (ie, the bottom bucket). In this embodiment, bucket # 0 and bucket_ # 1 are the bottom two buckets of the IFU MBUF 204. Other arrangements are possible.
[0048]
In the present embodiment, the first three instructions sent to the IAU are OP0, OP1, and OP2, and their lengths are 5, 3, and 11 bytes, respectively. Note that only the first 8 bytes of instruction OP2 fit in bucket_ $ 0. The remaining three bytes are latched at the beginning of bucket_1. To simplify this embodiment, assume that these three instructions have no prefix byte. If a prefix is detected, one phase needs to be added for alignment of one instruction.
[0049]
Instructions can start at any position in the bucket. Instructions are extracted up to 8 bytes at a time starting from any location in the bottom bucket. The IAU examines the two buckets to deal with the instruction entering the second bucket, such as OP2 in this embodiment.
[0050]
Trace "1" in this timing diagram is one of the two system clocks, CLK0. In the present embodiment, this system clock has a half cycle of 6 nanoseconds. CLK0, which has the opposite phase as compared to another system clock CLK1, rises at T6 and falls at T0. In that case, T0 is the rising edge of CLK1 and T6 is the rising edge of CLK0. For clarity, the three main clock phases are labeled F1, F2, and F3 in FIG.
[0051]
Traces "2" and "3" in this timing chart represent instruction data on input buses IFI1B and IFI0B. As shown at 502, the new bucket_ # 0 is made available on IFI0B where F1 begins. Some time later, the first eight bytes starting at OP0 (B # 0; 7-0) are extracted at 504 by extraction shifter 306. Bucket_ $ 0 byte 7-0 is shown to be valid. The timing of the extraction shifter is as shown in trace “4”.
[0052]
When the decoding of the instruction stream from the CISC type to the RISC type starts, the counter shifter 332 controls the extraction shifter 306 to extract the first 8 bytes from bucket_ # 0. The counter shifter signals the extraction shifter to shift and extract more bytes from the bucket as the instruction alignment progresses. When the instruction byte from bucket_ # 0 becomes empty, the contents of bucket_ # 1 are shifted into bucket_ # 0 and bucket_ # 1 is replenished from the instruction stream. After the first eight bytes are extracted, the extraction shifter extracts and shifts bytes under the control of the counter shifter on line 335 based on the instruction length, prefix length, and previous shift information.
[0053]
However, in this embodiment, the counter shifter signals the extract shifter to shift the first instruction to zero to align. Thus, the extraction shifter shifts out the first 8 bytes of the first instruction to the alignment shifter 310. The timing of the signals of the alignment shifter is as shown in trace "5" of the timing diagram. These 8 bytes are valid in the alignment shifter during the F1 time period indicated by reference numeral 506.
[0054]
The first 8 bytes of bucket_ $ 0 are stored in two aligned_IR latches 318 or 320 (as shown in traces "6" and "7" in FIG. 4), bypassing the alignment shifter. Based on the timing of clock signals CLK0 and CLK1, these aligned_IR latches alternately receive instruction bytes. Align_IR0318 is a latch for clock signal CLK0, that is, latched when clock signal CLK0 is high. Alignment_IR 1320 is a latch for clock signal CLK1, which latches when clock signal CLK1 is high. As indicated by reference numeral 508 near the end of F1, the first 8 bytes become valid at alignment_IR0 before the end of the phase of the first clock signal CLK0.
[0055]
The MUX 322 selects the latch that performed the latch in the previous phase. In this embodiment, therefore, MUX 322 outputs the first 8 bytes of OP0 during the second full phase, F2.
[0056]
Then, the first 8 bytes of OP0 flow to NID 324 and stack 334. The NID 324 detects that the first instruction is 5 bytes long and sends this information back to the alignment shifter and counter shifter via line 325, MUX 330, and line 333. As described above, at the same time, the first eight bytes flow through the stack and are fed back to the alignment shifter. As a result, the alignment shifter receives instruction bytes from the extraction shifter and indirectly from itself. This is because the alignment shifter requires 16 bytes of input to shift up to 8 bytes per cycle. When the alignment shifter shifts X bytes to the right, it discards the least significant X bytes and passes the next 8 bytes of data to latches 318 and 320. In this case, stack 334 supplies bytes 0-7 to alignment shifter 310.
[0057]
The bypass 336 surrounding the alignment shifter is used in the initial case where the extraction shifter extracts the first instruction from the instruction stream. With the exception of the prefix byte, the first instruction is aligned so that the alignment shifter does not need to shift in the initial case.
[0058]
During period F2 of the timing diagram, the extraction shifter shifts out eight bytes of bytes 15-8 of bucket_ # 0. See 510 in FIG. These bytes are sent to the alignment shifter, which now has a total of 16 consecutive bytes to process. The alignment shifter examines the output of the extraction shifter as well as the valid outputs of latches 318 and 320 during F2.
[0059]
Near the end of F2, the alignment shifter shifts bytes 12-5 of bucket_ # 0 to the output based on the signal from the NID. The signal from the NID instructs the alignment shifter to shift right by 5 bytes. Thereby, the least significant 5 bytes corresponding to the instruction OP0 are discarded. See shift_5_byte signal 512 in trace “8” of the timing diagram. The remaining eight bytes of instruction data, bytes 12-5, then flow through the alignment shifter. Note that byte 5 is the first byte of the next instruction OP1.
[0060]
The counter shifter 332 then shifts the eight bytes of the extract shifter 306. Because the first 8 bytes are now available from the aligned_IR latch, so the next byte is needed. When the phase F3 starts, the counter shifter sends a signal to the extraction shifter to increase the shift amount by the number of bytes shifted out by the alignment shifter 310 in the previous phase. Therefore, the counter shifter must be made up of logic for storing the shift amount of the previous extraction shifter and further adding the shift amount of the alignment shifter to this value.
[0061]
Each time a new value comes out for the alignment shifter, the counter shifter adds that amount to the old shift amount. In the present embodiment, the counter shifter has shifted 8 bytes during the period of F2. Thus, during F3, the counter shifter must instruct the extractor to shift 8 + 5 or 13 bytes. The byte output by the extraction shifter is bytes 20-13. Alignment Note that the IR latch outputs bytes 12-5 during F3, thus making bytes 20-5 available to the alignment shifter.
[0062]
During F3, the extraction shifter outputs bytes 20-13. However, since bucket_ # 0 contains only bytes 15-0, bytes 20-16 must be fetched from bucket_ # 1. As shown at 514 in the timing diagram, bucket_ # 1 becomes valid at the beginning of F3. As shown at 516, the extraction shifter then shifts bytes 4-0 of bucket_ # 1 and further shifts bytes 15-13 of bucket_ # 0. At this point, if bucket_1 is not valid, the IAU must wait until it becomes valid.
[0063]
As described above, the shift_5 byte signal was generated by the NID during F2. In accordance with this signal, as shown at 518, bytes 12-5 of bucket_ # 0 are shifted out by the alignment shifter, and shortly thereafter latched into alignment_IR1, as shown at 520.
[0064]
Bytes 12-5 are sent by MUX 322 to stack 334 and NID 324 at the beginning of F3. The stack feeds back bytes 12-5 to the alignment shifter, as shown at 305, and further, as shown at 522 trace "9", the NID determines that the length of OP1 is 3 bytes and during F3, In the second half of the sequence. The alignment shifter shifts 3 bytes (15-8), and this amount is added to the counter shifter.
[0065]
The above process is further repeated. If one instruction crosses bucket_ # 0 (ie, bucket_ $ 0 is fully occupied), bucket_ # 1 becomes bucket_ # 0, and the new bucket_ # 1 becomes valid thereafter .
[0066]
Trace "10" in the timing diagram shows the timing of byte extraction from the instruction stream. The Buf_count # 0 block indicates the stored extracted shift amount. The shift amount aligned for each phase is added to Buf_count # 0, and the result becomes the extracted shift amount in the next phase (see the block labeled counter_shift).
[0067]
Trace "11" in the timing diagram shows the timing of instruction alignment. The blocks labeled IR_latch_ # 0 and IR_latch_ # 1 represent the time periods during which the instructions in the corresponding aligned_IR latch are valid. The small block labeled MUX1 indicates when MUX 322 begins to select its valid alignment latch. The small block labeled MUX2 represents when the MUX 330 begins to select the shift amount determined by the NID 324. Finally, the block labeled Align_Shift indicates when the alignment shifter will start outputting instructions.
[0068]
The prefix is extracted using the same technique as the instructions are aligned, but MUX 330 chooses the output of PD 328 instead of the output of NID 324.
[0069]
A block diagram of a portion of the stack 334 is as shown in FIG. This stack consists of 64 1-bit stacks arranged in parallel. One-bit stack 600 comprises two latches 602 and 604, respectively, and a three-input MUX 606. The aligned instruction is input to the latches as well as the MUX on bus 607 labeled IN. The loading of the two latches is done separately in any clock phase. In addition, MUX 606 has three MUX control lines 608 to select the output of either latch or to bypass IN data and send it directly to output 610 labeled OUT.
[0070]
The IAU can be periodically transferred to a separate instruction stream. The stack allows the IAU to store two sets of 8-byte instruction data sets from the MUX 322. This feature is commonly used in CISC-type instruction emulation. When the IAU must branch to process a microcode routine for emulation of a complex CISC-type instruction, the state of the IAU is stored and restarted upon completion of the CISC-type instruction emulation.
[0071]
The .OMEGA. Cycle data delay 316 is used to send immediate data and displacement information. Rather than establishing the instruction length and shift during the same half cycle, immediate data and displacement logic are sent to effect the shift in the next phase by delaying the IAU before the shifter. Because these operations are spread over the cycle, it is easier to tailor the timing requirements to the logic. The IDDD block 326 controls the IMM shifter 312 and the DISP shifter 314 to extract immediate data and displacement data from the instruction. For example, if the first three bytes of the instruction are an opcode followed by four bytes of displacement and four bytes of immediate data, the shifter will be able to shift out the appropriate byte.
[0072]
Shifters 312 and 314 always output 32 bits, regardless of the actual data size of 8, 16, or 32 bits, including the properly aligned immediate data in the order of the low-order bits of the 32-bit output. And displacement data. The IDU determines whether the immediate data and displacement data are valid, and if so, how much valid data is present.
[0073]
Determining the length of the prefix, immediate data, and displacement data as well as determining the actual length of the instruction is one of the functions of the actual CISC-type instruction set that is aligned and decoded. One skilled in the art can obtain such information by consulting the CISC instruction set itself, the manufacturer's user manual, or other general reference material. Those skilled in the art will know how to do this, how to convert information into random logic to implement the above-described IAU subsystem, how to implement the IDU subsystem described below, It will be easy to understand the control logic used to control the data flow as well as how to generate the control signals. Further, once such random logic is generated, the logic can be verified using a commercially available engineering software application (eg, Verilog from CadenceDesignSystems, San Jose, Calif.), And the application can control Useful for defining the timing and generation of signals and associated random logic. Other off-the-shelf engineering software applications can be used to generate gate and cell layouts and optimize the implementation of such functional blocks and control logic.
[0074]
The i486 instruction set supports 11 prefixes whose order is defined when used together in one instruction. The format is defined to include up to four prefixes in a single instruction. Therefore, the prefix detector 328 of the present invention has four identical prefix detection circuits. Each circuit looks for one of the 11 prefix codes. The first four bytes passed to the prefix detector are evaluated, and the outputs of the four prefix detection circuits are combined together to determine the total number of prefixes present. The result is used as the shift amount passed to MUX 330.
[0075]
A block diagram of the NID is shown in FIGS. The following description of NID is specific to i486 instruction alignment. Suitably, the alignment of other CISC-type instructions uses a different NID architecture. The techniques described below are therefore a guide for those skilled in the art, but should not be considered as limiting the scope of the invention.
[0076]
Only four bytes are required to determine the length of one instruction (as described above, the four bytes consist of two operation code bytes, one arbitrary ModR / M byte, and one SIB byte). ).
[0077]
Shown in FIG. 6 is a 4-byte (32-bit) bus 701 representing the first 4 bytes of the instruction received from MUX 322. The first two bytes are sent to SNID 702 on bus 703. The SNID, by definition, determines the length of the first subset of instructions identified based on their first two bytes. The SNID can determine the length of this subset of instructions in half a cycle. The length of the subset instruction is output by the SNID on bus 705. The width of the bus corresponds to the maximum number of instruction bytes detected by the SNID. The SNID also has a 1-bit MOD detect (MOD_DET) output line 707 to indicate whether a ModR / M byte is in the instruction. In addition, the SNID has a 1-bit NID_wait line 709 to signal control logic where the instruction is not in subset form (ie, use the output of the RNID instead). Thus, if NID_wait is true, the IAU must wait RNID half a cycle to decode the instruction.
[0078]
The subset of instructions decoded by SNID are CISC-type instructions that can be decoded in half a cycle using at least one, two, and three input gates (NAND, NOR, and Inventor), with a gate delay of The maximum is 5, based on a 16 × 16 Karnaugh map of 256 instructions. Blocks of the Karnaugh diagram containing mostly one-byte opcode instructions can be implemented in this manner. The remaining instructions are decoded by the RNID using a logic array with a longer gate delay.
[0079]
RNID 704 receives the first four bytes on bus 701. The RNID performs decoding to determine the length of the remaining instructions that require one or more phases to decode. The RNID has an output similar to that of the SNID.
[0080]
The RNID detects the instruction length and outputs the result on the bus 711. One bit over 8 output 712 indicates that the instruction is 8 bytes or more in length. The RNID also has a 1-bit MOD_DET output 714 that indicates whether the instruction contains a ModR / M byte.
[0081]
The length decoded by either SNID or RNID is selected by MUX 706. A control line 708 for the MUX 706, called the Select Decoder for the current instruction (SELDECIR), switches the MUX 706 between the two decoders to measure the actual length, which is 1 to 11 bytes. For example, an instruction that is 11 bytes long causes the RNID to output an over8 signal and 3 on bus 711. The instruction length (1n) is sent to MUX 330 on bus 716 and used by alignment shifter 310 and counter shifter 332. The 8 bits output by the top MUX 706 are used as shift control (enable) for the alignment shifter and counter shifter.
[0082]
ModR / M bytes are selected similarly. The SELDECIR signal 708 selects the appropriate MOD line and controls the second MUX 710 to indicate whether a ModR / M byte is present. MOD line output 718 is used by IDDD.
[0083]
The SELDECIR signal 708 is generated based on the NID_wait signal 709. The output of the SNID is selected during the first clock phase because the result is complete. If the NID_wait signal 709 indicates that the instruction has not been decoded, the MUXs 706 and 710 are switched to select the output 711 of the RNID and are available at the beginning of the next clock phase.
[0084]
The RNID 704 basically has two parallel decoders, one for decoding the instruction as if it had a 1-byte opcode and the other for decoding the instruction as if it had a 2-byte opcode. Decode. The escape detection (ESC_DET) input signal indicates whether the length of the operation code is 1 byte or 2 bytes. For example, in the i486 instruction set, the first byte of the full 2-byte opcode (called the escape byte) has a value OFhex indicating that the instruction has a 2-byte opcode. The RNID outputs a valid instruction length based on the ESC_DET signal. This signal indicates that the first opcode is escape (OFhex), which indicates a two byte opcode, thereby enabling the second byte decoder. The decoding of the logic to generate the ESC_DET signal should be apparent to those skilled in the art.
[0085]
The block diagram of the RNID is as shown in FIG. The RNID is an RNID_1OP decoder 752 for decoding the first operation code byte, an RNID_2OP decoder 754 for decoding the second operation code byte, and decoding the ModR / M byte to one of two positions determined by the number of existing operation bytes. And two identical RNID_MOD decoders 756 and 758, and an RNID SUM adder 760. Based on the outputs of the four RNID decoders 752-758, RNID_SUM adder 760 outputs the total length of the instruction on bus 762. The RNID_SUM adder 760 has another output line 764 labeled OVER8 to indicate whether the length of the instruction is 8 bytes or more.
[0086]
The first operation code byte of the instruction and the three bits of the ModR / M byte (bits [5: 3] called extension bits) are input to RNID_1OP752 on bus 766. Yet another input line 768 to RNID_1OP, called Data_SZ, indicates whether the instruction's operand size is 16 or 32 bits. The data size is determined based on the memory protection configuration used and whether there is a prefix that overrides the default data size. RNID_1OP assumes that the instruction has a 1-byte opcode and attempts to determine the length of the instruction based on that information and the extended 3 bits.
[0087]
The RNID_MOD decoder 756 decodes a ModR / M byte instruction input on the bus 770. The RNID_MOD decoder has another input bus 772 labeled ADD_SZ indicating whether the address size is 16 bits or 32 bits. Address size is independent of data size.
[0088]
ESC_DET signal 774 is also input to block 760. For example, if the ESC_DET signal is logic HIGH, the RNID_SUM block knows that the opcode is actually the second byte.
[0089]
The RNID_2OP decoder 754 assumes that the opcode is two bytes, and therefore decodes the second byte of the opcode (see bus 776). The RNID_2OP decoder also has an input 768 that recognizes the data size.
[0090]
The decoder itself does not know the length of the operation code, that is, whether it is 1 byte or 2 bytes, and ModR / M bytes always follow the operation code. A second RNID_MOD decoder 758 is used to decode the byte following the code (see bus 778). The two RNID_MOD decoders are identical, but decode different bytes in the instruction stream.
[0091]
Still further, based on the ESC_DET signal 774, the RNID_SUM 760 selects the appropriate opcode and the output of the ModR / M byte decoder and the length of the instruction on bus 762. Output 764, labeled over8, indicates whether the instruction is 8 bytes or more. If the instruction length is 8 bytes or more, the IR_NO [7: 0] bus 762 indicates the number of instruction bytes exceeding 8.
[0092]
The RNID_1OP decoder 752 has an output bus 780 that is 9 bits wide. One line indicates whether the instruction is one byte long. The second line indicates that the instruction is 1 byte long and that ModR / M bytes are present, and therefore, information from the ModR / M decoder should be included in determining the instruction length. is there. Similarly, the remaining output lines of bus 780 indicate the following number of bytes: 2, 2 / MOD, 3, 3 / MOD, 4, 5, and 5 / MOD. If the instruction is 4 bytes long, then ModR / M bytes cannot be present. This is unique to the i486 instruction set. However, the invention is not limited in any way to a particular CISC-type instruction set. One skilled in the art can apply the features of the present invention to align and decode any CISC-type instruction set.
[0093]
The RNID_2OP decoder 754 has an output bus 782 that is 6 bits wide. One line indicates whether the instruction is one byte long. The second line indicates whether the instruction is one byte long and contains ModR / M bytes, which should be included to determine the length of the instruction. Similarly, the remaining output lines on bus 782 indicate that 2, 2 / MOD, 3, and 5 / MOD are present. When the operation code is 2 bytes long, there is no other instruction length supported by the i486 instruction set.
[0094]
The outputs 784 and 786 of the two decoders RNID_MODs 756 and 758 allow RNID_SUM 760 to know the five possible additional lengths specified by the ModR / M bytes. Each RNID_MOD decoder has a 5-bit wide output bus. The five possible additional lengths are 1, 2, 3, 5, and 6 bytes. The ModR / M byte itself is included in determining the total length. All the remaining bytes consist of immediate data or displacement data.
[0095]
FIG. 8 is a block diagram of the IDDD 326. The IDDD 326 determines the shift amount of the IMM shifter 312 and the DISP shifter 314. The shift amount is determined by the ModR / M bytes of the instruction.
[0096]
The i486 instruction set includes two special instructions, an enter_detect instruction and a jump_call_detect instruction. Therefore, IDDD 326 has a block called Immediate Special Detector (ISD) 802 to decode these instructions. Input 803 to the ISD is the first byte of the instruction. Two output lines EN_DET and JMP_CL_DET (820 and 822) indicate that one of the relevant instructions has been detected.
[0097]
The MOD_DEC decoders 804 and 806 decode immediate data and displacement data with the same thing. Based on ADD_SZ772, decoder 804 examines ModR / M bytes assuming a 1-byte opcode, and decoder 806 examines ModR / M bytes assuming 2 bytes. Instruction byte inputs to MOD_DECs 804 and 805 are 805 and 807, respectively. These decoders determine the position of the displacement and the position of the immediate data in the instruction stream. Two 7-line outputs 824 and 826 indicate the start of displacement and immediate data. That is, the displacement starts at position 2 or 3, and the immediate data starts at position 2, 3, 4, 6, or 7.
[0098]
MOD_DET lines 707 and 714 are also input to select block 812.
[0099]
The selection block 812 combines the EN_DET signal and the JMP_CL_DET signal, the MOD_DET result, the MOD_DEC result, and the ADD_SZ, and outputs the result on the four buses 832 to 838. The displacement (DISP_1) bus 832 outputs the result of the displacement shift on the assumption that the operation code is 1 byte. The displacement 2 (DISP_2) bus 834 outputs a displacement shift result assuming a 2-byte operation code. Immediate 1 and 2 (IMM_1 and IMM_2) buses 836 and 838 output immediate data shift information assuming 1-byte and 2-byte operation codes, respectively.
[0100]
The last block 814, labeled MOD_SEL / DLY, actually selects the appropriate shift amount and delays the result by a half cycle. The half-cycle delay implemented by MOD_SEL / DLY 816 represents delay 316 shown in FIG. The ESC_DET signal 774 described above is used by the MOD_SEL / DLY block to make shift selections. The result is clocked from MOD_SEL / DLY 814 by clock signals CLK0 and CLK1 with a half cycle delay. The shift control signal of the immediate data and the shift control signal of the displacement are sent to the DISP shifter and the IMM shifter via the shift_D [3: 0] bus 840 and the shift_I [7: 0] bus 842, respectively. The number of possible positions of immediate data and displacement data within a CISC-type instruction defines the number of bits required to specify the amount of shift.
[0101]
The block diagram of the prefix detector 328 is as shown in FIG. The prefix detector 328 includes a prefix_number decoder (PRFX_NO) 902, four prefix_detector decoders (PRFX_DECs 904 to 910), and a prefix_decoder (PRFX_SEL) 912.
[0102]
For example, the i486 instruction set contains 11 possible prefixes. Since there are several invalid prefix combinations, a total of four prefixes can be included per instruction. The order of the four prefixes is also defined by the instruction set. However, the prefix detector uses four prefix detectors 904-910 to detect only the correct prefix permutation, but rather to decode each of the first four bytes of the instruction. The first four bytes of the instruction are input to a prefix detector on bus 901. The detectors 904 to 910 each have a 12-bit wide output bus (905, 907, 909 and 911). If the prefix is actually decoded, the output of 12 tells which prefix is present. The twelfth prefix is called unlock, which is the functional complement of the i486 lock prefix, but is only available to microcode routines in emulation mode.
[0103]
The Align_RUN control signal 920 may be implemented to enable / disable the prefix decoder and is used to mask out all prefixes. HOLD_PRFX control signal 922 is used to latch and hold prefix information. In general, alignment of instructions when the prefix detector 328 indicates the presence of a prefix requires control logic to latch the prefix information. The prefix information is then used by alignment shifter 310 to shift out the prefix. In the next cycle, the IAU determines the length of the instruction, aligns it, and passes it on to the IDU.
[0104]
The PRFX_NO decoder 902 indicates where and how much the prefix is present by decoding the first four bytes of the opcode. The logic diagram of the PRFX_NO decoder 902 is as shown in FIG. The PRFX_NO decoder comprises four identical decoders 1002-1008 and a set of logic gates 1010. Each of the four decoders 1002 to 1008 examines one of the first four bytes (1010 to 1013) to determine whether a prefix exists. Since the prefix byte can follow the opcode byte, logic gate 1010 is used to output a result indicating the total prefix before the first opcode byte. This is because the prefix following the operation code can be applied only to the operation code of the next instruction.
[0105]
If the first byte (position) is a prefix and there is no prefix in the second position, the total number of prefixes is one. As another example, if the prefix is not in the first three positions, the prefix in the fourth position does not matter. A logic HIGH (1) output from the bottom NAND gate 1014 indicates that there are four prefixes, and a HIGH output from the second bottom NAND gate 1015 indicates the presence of three prefixes. And so on. The outputs of the four NAND gates are combined to form a PREFIX_NO bus 1018, which represents the total number of valid prefixes preceding the first opcode, ie, the shift amount output of prefix detector 328.
[0106]
The PRFX_NO decoder 902 also includes a Prefix_Present (PRFX_P) output bus 1020 (also 4 bits wide). The four PRFX_P output lines 1020-1023 indicate whether there is a prefix at a particular location, regardless of what the output of the other location is. The PRFX_P output is taken directly from the outputs of the four decoders (1002-1008).
[0107]
The results of the PRFX_NO decoder (described in connection with FIG. 10) and the information from the PRFX_DEC detectors 904-910 are combined by a PRFX_SEL decoder 912. The prefix information is combined to form one 13-bit output bus 924, which indicates which prefix signal is present and which prefix is present.
[0108]
3.0Outline of instruction decode unit
All instructions are passed from the IAU to an instruction decode unit (IDU) and are directly converted to RISC-type instructions. Instructions executed by the IEU are first processed by the IDU. The IDU determines whether each instruction is an emulated instruction or a basic instruction. If emulated, a microcode emulation routine consisting entirely of basic instructions is processed. If it is a basic instruction, it is directly converted into one to four nano instructions by hardware and sent to the IEU. What the IEU actually executes are these and nano instructions, rather than the original CISC or microcode instructions.
[0109]
Instruction splitting has two main advantages. The first is that it only needs to support simple operations, so the hardware is small. The second is that the bugs are not so troublesome, because they are easy to change and are prone to bugs in complex microcode routines.
[0110]
The IDU microcode routine-capable hardware associated with the present invention has several unique features. Microcode instructions consist of control bits for the various data buses present in the processor, and are typically little encoded or not encoded at all. In contrast, the microcode of the present invention is a relatively high-level machine language designed to emulate a particular complex instruction set. Typical microcode is sent directly to the functional units of the processor, whereas the microcode of the present invention is processed by the same decoder logic used for the target CISC type (eg, 80x86) instructions. This makes the code density of the microcode of the present invention much better than that achieved by typical microcode, and is similar to the target CISC-type instruction set, making microcode development easier. become. Further, the present invention allows hardware to be adapted for microcode revision. That is, the on-chip ROM-based microcode can be partially or entirely replaced with external RAM-based microcode by software control. (US application Ser. No. 07 / 802,816, filed on Dec. 6, 1991, co-pending and filed with the same successor, entitled "ROM with RAM Cell and Cyclic Redundancy Check Circuit" (See person identification number SP024. The disclosure of this application is incorporated herein by reference.)
The microcode routine language becomes an instruction set executed by the RISC-type core to perform various control and maintenance functions related to exception handling, in addition to the functions required for any emulated compound instruction. Designed to. Emulated instructions typically do not affect performance such as non-emulated (basic) instructions, and exceptions (handled by microcode routines) rarely occur, but still Efficient processing is very important to overall system throughput. This goal is achieved by using hardware that supports various types of microcode routines. The present invention has four areas of microcode-capable hardware: dispatch logic, mailboxes, nano-instruction format, and special instructions.
[0111]
The microcode dispatch logic controls the efficient transfer of program control from the target CISC type instruction stream to the microcode routine and back to the target instruction stream. It uses little hardware and is handled in a way that is invisible to the instruction execution unit (IEU) of the RISC-type core. (The IEU executes RISC-type instructions. The above-mentioned "RISC core" is a synonym for IEU. Details of the IEU are not necessary for one of ordinary skill in the art to practice the invention. Applicable to all processors.)
The mailbox has a system of registers used to transfer information from the instruction decode hardware to the microcode routine in a systematic way. This allows the hardware to pass instruction operands and similar data to the microcode routine, thereby eliminating the task of extracting this data from the instruction.
[0112]
The nanoinstruction format describes the information passed from the IDU to the IEU. This format is selected to be efficiently extracted from the source CISC-type instructions, but provides the IEU with sufficient information for dependency checking and functional unit control.
[0113]
Finally, special instructions allow complete control over RISC-type hardware,
An additional instruction set provided to accommodate hardware-specific emulation tasks and is dedicated to the CISC-type instruction set.
[0114]
3.1Microcode dispatch logic
The first step in dispatching to microcode is to determine the address of the microcode routine. This step has two important requirements. That is, there is a unique starting address for each microcode routine, and those addresses must be generated at high speed. If the number of cases handled is small, the hardware can store the address as a constant and there is almost no choice between them, so that the exception handling routine can be realized quite easily in this manner. However, determining the address of an emulated instruction is more difficult because there are too many to store all executable addresses.
[0115]
Microcode. Dispatch logic satisfies the requirement by directly basing its opcode on the dispatch address of each instruction. For example, a 1-byte operation code is mapped to an address space from OH to 1FFFH. In that case, the upper 3 bits of the 16-bit dispatch address must be zero. The entry points of these microcodes are 64 bytes apart and the least significant 6 bits of each entry point address must be zero. This leaves seven bits undecided, but can be taken directly from the seven bits of the opcode. As will be apparent to those skilled in the art, address generation in this manner requires little logic. For example, only a multiplexer is used to select the proper bits from the opcode.
[0116]
Once the dispatch address of the microcode routine has been determined, the microcode must be fetched from memory. Typically, but not necessarily, the microcode resides in on-chip ROM. As described in detail in U.S. Application No. 07 / 802,816, cited above, each entry point corresponds to a ROM invalid bit that indicates whether the ROM routine is correct. This bit is fetched in parallel with the access to the ROM and works like a conventional cache hit indicator. If this bit indicates that the ROM entry is valid, the microcode routine is fetched cascaded from the ROM and executed normally. However, if the bit indicates that the ROM is invalid, the microcode is fetched from an external memory such as RAM.
[0117]
The addressing of the on-chip microcode routine is performed by the IDU itself. The IDU generates a 16-bit address for accessing the microcode ROM. If the ROM invalid bit corresponding to the addressed ROM entry indicates that the microcode is invalid, then the address of the external microcode present off-chip in main memory is calculated. The U_base register holds the upper 16 address bits (referred to as the start address) of the external microcode present in main memory. The 16 bit address decoded by the IDU is concatenated with the upper 16 bits of the U_Base register to access external microcode present in main memory. If the location of the external microcode in the main memory is changed, the contents of the U_Base register can be modified to reflect the new location of the main memory.
[0118]
This feature allows microcode to be updated by replacing one routine with another in external memory without forcing all microcode to degrade the performance of external memory access. In order to reduce the area requirement of the RISC chip and to assist in microcode development, the entire microcode can be stored in the external RAM by removing all the ROM from the RISC chip.
[0119]
It is also the dispatch logic that provides a means for the microcode routine to return to the main stream of instructions when the task is completed. A separate program counter (PC's) and instruction buffer are maintained for this process. During normal operation, the main PC determines the address of each CISC-type instruction in the external memory. The section of memory containing these instructions is fetched by the IFU and stored in the MBUF.
[0120]
When an emulated instruction or exception is detected, the PC value and length of the current instruction are stored in a temporary buffer. Meanwhile, the microcode dispatch address is calculated as described above, and instructions are fetched from this address into the EBUF. The microcode is executed from the EBUF until a microcode "return" instruction is detected. When the return instruction is detected, the spare PC value is reloaded, and the execution is continued from the MBUF. Since the MBUF and all other related registers are preserved during the transfer of control to the microcode routine, the transfer of a return to a CISC-type program occurs very quickly.
[0121]
There are two return instructions used by the microcode routine to accommodate the differences between the instruction emulation routine and the exception handling routine. When a microcode routine is entered for exception handling, it is important that the processor return to the exact state at which the interrupt was entered after the routine exits. However, when a microcode routine is entered to emulate an instruction, the routine will want to return to the instruction following the emulated instruction. Otherwise, the emulation routine performs a second time. These two functions are handled using two return instructions, ie, aret and eret. The aret instruction returns the processor to its state if microcode has been entered, while the eret instruction updates and controls the main PC to return to the next instruction in the destination stream.
[0122]
3.2Mailbox
For an emulation routine to successfully perform the functions of a compound CISC-type instruction, it is necessary for the microcode to have easy access to the operands referenced by the emulated instruction. In the present invention, this is done by using four mailbox registers. These registers are unique in their use. That is, it is defined to be the first four of a set of 16 temporary registers in the integer register file available for microcode. Each emulation routine that requires operands or other information from the original instruction will find these values stored in one or more mailbox registers when entering the routine. When the IDU detects an emulated instruction, it generates an instruction that is used by the IEU to load a register with the value that the microcode expects before execution of the microcode routine itself begins.
[0123]
For example, consider the emulation of a Load Machine Status Word (lmsw) instruction that specifies any of the general purpose registers as operands. Assume that the particular instruction to be emulated is lmswax, which loads a 16-bit status word from the "ax" register. The same microcode routine is used, regardless of the register actually specified in the instruction, so mailbox # 0 is loaded with a status word before the microcode entry for this instruction. When the IDU detects this instruction, it generates a movu0.ax instruction so that the IEU moves the status word from the "ax" register to the "u0" register, which is defined as mailbox # 0. After the mov instruction is sent to the IEU, the microcode routine is fetched and sent. Thus, the microcode is written as if the emulated instruction was lmswu0 and would correctly handle all possible operands specified in the original CISC-type instruction.
[0124]
3.3Nano instruction format
As described above, a CISC type instruction is decoded into a nano instruction by the IDU, and the processing is performed by a RISC type processor core called IEU. Nanoinstructions are passed from the IDU to the IEU in four groups called "buckets". One of the buckets is shown in FIG. Each bucket consists of two packets and general information about the whole bucket. Packet # 0 contains three nano-instructions that are always executed in order. The three nano instructions are a load instruction 1102, an ALU type instruction 1104, and a store instruction 1106. Packet # 1 consists of a single ALU-type instruction 1108.
[0125]
The IEU can accept buckets from the IDU at one peak rate per cycle. The IDU processes primitives at two peak rates per cycle. Since most primitives have been translated into a single packet, two primitives are usually passed into the IEU together in one bucket. The biggest constraint on this rate is that the primitive must meet the requirements of the bucket. The requirements are as follows.
[0126]
Only one of the two basic instructions can refer to the memory operand (there is only one load / store operation for each bucket), and both instructions use a single ALU-type operation (two ALU-type operations (As opposed to a single instruction).
[0127]
If one or both of these constraints are not met, a bucket containing nano-instructions corresponding to only one of the basic instructions will be sent to the IEU, and the remaining instructions will be sent later in another bucket. These constraints accurately reflect the capabilities of the IEU. That is, since the IEU has two ALUs and one load / store unit, performance is not actually limited by these requirements. For an example of this type of IEU, see co-pending U.S. patent application Ser. No. 07 / 817,810, entitled "High Performance RISC Microprocessor Architecture." , Filed January 8, 1992 (Attorney Docket No. SPO15 / 1397.0280001), and U.S. Patent Application Serial No. 07 / 817.809, entitled "Extensible RISC Microprocessor Architecture." ", Filed Jan. 8, 1992 (attorney docket number SPO21 / 1397.0300001). These disclosures are incorporated herein by reference.
[0128]
3.4Special instructions
There are many functions that must be performed by microcode routines that are difficult or inadequate to perform using general instructions. Further, since the architecture of the RISC processor is expanded as compared with the conventional CISC processor, specific functions are effective. However, such functions have no meaning for CISC-type processors and therefore cannot be performed using any combination of CISC-type instructions. At the same time, "special instructions" came out of this situation.
[0129]
An example of the first category of the special instruction is an extract_desc_base instruction. This instruction extracts various bit fields from the two microcode general registers, concatenates them, and places the result in a third general register for use by the microcode. Performing the same operation without this instruction requires the microcode to perform some masking and shifting operations, and requires the use of additional registers to hold temporary values. . Special instructions allow the same function to be performed by one instruction in a single cycle and without using a scratch register.
[0130]
Two examples of the second category of special instructions have already been mentioned. That is, two return instructions, aret and eret, used to terminate the microcode routine. These instructions are only meaningful in a microcode environment, so there is no equivalent instruction or instruction order in a CISC-type architecture. In this case, special instructions were needed not only for performance reasons, but also for functional correction.
[0131]
Because special instructions are available only for microcode routines, and because emulated instructions occur only in the target CISC-type instruction stream, the opcodes of the emulated instructions are Reused. Thus, when one of these opcodes occurs in the target CISC-type instruction stream, it merely indicates that the microcode emulation routine for that instruction should be executed. However, when that same opcode occurs in the microcode instruction stream, it has a completely different function as one of the special instructions. To accommodate this reuse of the opcode, the IDU records the current state of the processor and decodes the instructions appropriately. This opcode reuse is invisible to the IEU.
[0132]
The IDU decodes each CISC-type instruction (eg, of the i486 instruction set) and translates each instruction into several RISC-type processor nano-instructions. As described above, depending on complexity and functionality, each instruction is converted from zero to four nanoinstructions. The IDU decodes and converts two CISC-type instructions at a rate of at most one cycle. The basic functions of the IDU are summarized as follows.
* Decode one CISC-type instruction per half cycle.
* Decode the first CISC type instruction in the first phase.
* Hold the decoded result of the first CISC type instruction as valid until the end of the second phase.
* Decode the second CISC type instruction in the second phase.
* Combine the outputs of the two instructions, if possible in the third phase.
* One bucket of four nano-instructions is output per cycle.
[0133]
3.5Instruction decode unit block diagram
A block diagram of the IDU is as shown in FIG. Aligned instructions from the IAU arrive at the IDU on bus 1201 which is 32 bits wide ([31: 0] or 4 bytes). The aligned instruction is received by instruction decoder 1202. IDU 1202 only looks at the first four bytes of the aligned instruction to perform the CISC to RISC conversion.
[0134]
Instruction decoder 1202 operates in one clock phase (half cycle). The aligned instruction passes through its decoder, and the decoded information leaving it is multiplexed and fetched via bus 1203 to half cycle delay latch 1204. Thus, the decoded information will experience the same as a one-phase pipeline delay.
[0135]
After a half cycle delay, the decoded information is sent to MUX 1206 via bus 1205 to determine the actual register code used. At this stage of the decoding, the decoded information is formatted into nano-instructions. The nanoinstruction is then latched. Two complete nanoinstruction buckets are latched every cycle. The latches of the two nanoinstruction buckets are schematically illustrated by a first IR bucket 1208 and a second IR bucket 1210, respectively.
[0136]
The IDU attempts to combine the buckets 1208 and 1210 into one bucket 1212. A control gate-type 1214 performs the bulk work. The IDU first examines the type of each nanoinstruction to determine whether it is a type that can be combined. Which of the two latched instruction load (LD) operations can enter the LD storage location 1216 of the single bucket 1212, and which of the latched instruction storage (ST) operations can be stored in the single bucket ST storage location. Note that any of the A0 operations may enter the A0 storage location 1220, and any of the A0 or A1 operations may enter the A1 storage location 1222.
[0137]
The IDU handles the instruction as a whole. If the IDU fails to pack two instructions into one bucket, it leaves one complete instruction behind. For example, if the first IR latch has only A0 operation and the second IR latch has all four operations, the IFU will not fetch A1 from the second IR latch and merge with A0 operation. The A0 operation is sent alone and the set of operations of the second IR latch is transferred to the first IR latch and sent on the next phase. During that time, the second IR latch is reloaded. In other words, the operation stored in the first IR latch is always sent, and the operation stored in the second IR latch is combined with the operation of the first IR latch if possible. If the first IR and the second IR cannot be combined, the previous IDU and IAU pipeline stages must wait. An IDU can combine the first and second IR latch operations in the following situations.
[0138]
1. Both use only A0, or
2. One uses only A0, the other uses only A0, LD and ST
Based on the functionality and design practices of the basic logic described above, those skilled in the art will appreciate that in order to merge the contents of the first and second IR latches, the combination logic may be used to generate the necessary control signals for the control gates. Easy to design.
[0139]
The emulation mode is entered when the IDU identifies an instruction belonging to a subset of the instructions requiring emulation. When the emulation mode is set, an emulation mode control signal (EMUL_MODE) is sent to the decoder of the IDU. Direct decoding of CISC-type instructions is discontinued and the microcode routine corresponding to the identified instruction is sent to the IDU for decoding. When the microcode routine finishes emulating the subset instructions, the IDU decoder returns to basic mode to continue decoding CISC type instructions. Basically, IDUs treat basic CISC type instructions and microcode instructions in a similar manner. Only the interpretation of the operation code changes.
[0140]
FIGS. 13 to 17 show Carnot diagrams of the default (basic) mode of the 1-byte and 2-byte operation code instructions. The numbers shown on the left and top of the Carnot diagram are the opcode bits. For example, a one-byte operation code with a hexOF code corresponds to the first row and the eleventh column, which is a “2-byte escape” instruction.
[0141]
The gray shaded instruction boxes in the Karnaugh diagrams of FIGS. 13-17 are basic instructions, and the white boxes are instructions that must be emulated.
[0142]
FIG. 18 is a block diagram of the instruction decoder 1202 of the IDU. Instruction decoder 1202 includes a plurality of decoders used to decode CISC type instructions and microcode routines.
[0143]
A type generator (TYPE_GEN) decoder 1402 receives the first fully aligned instruction on the Align_IR bus and decodes the instructions one by one to identify the type field of the instruction.
[0144]
The identified type field corresponds to the operation of the nanoinstruction described above in relation to the IDU. The type is represented by a 4-bit field representing each operation (load, ALU0, store, ALU1) in the bucket. The TYPE_GEN decoder 1402 specifies which of these four operations is required for instruction execution. Depending on the instruction received, any number from 1 to 4 of the instruction is required to satisfy the CISC type instruction.
[0145]
For example, the addition operation, which sums the contents of one register with the contents of another register, only needs to execute the ALU nano instruction once. On the other hand, an instruction that requires addition of the contents of a register and the contents of a storage location requires three nano-instruction operations including a load operation, an ALU operation, and a subsequent storage operation. (Data must be read from memory, added to registers, and stored in memory.) More complex CISC-type instructions require all four nanoinstructions.
[0146]
The TYPE_GEN decoder 1402 has three type decoders. The first decoder type 1 assumes that the instruction has an opcode of 1 byte before ModR / M bytes and calculates the type based on that assumption. The second decoder type 2 assumes that the instruction has a 2-byte opcode. The first byte is an escape byte, but it precedes the second byte, the opcode, and the third byte, the ModR / M byte. The third decoder type F assumes that the instruction is a floating point instruction and decodes the instruction based on that assumption.
[0147]
The TYPE_GEN decoder has three 4-bit type instruction output buses (type 1, type 2, and type F). Each bit corresponds to one of the four nanoinstruction operations in the bucket. A particular type field specifies which nanoinstructions are required to execute a CISC type instruction. For example, when all four bits are logic HIGH, the CISC instruction requires one load and one store operation and two ALU operations.
[0148]
The remaining decoders in FIG. 18, including sections labeled 1, 2, and F, decode assuming they are 1 byte opcodes, 2 byte opcodes, and floating point instructions, respectively. Invalid results are rarely selected. The multiplexer selects the correct decoder output.
[0149]
The two ALU operations (ALU0 and ALU1) each have an 11 bit long opcode field. The 11 bits consist of 8 bits of the operation code and 3 operation code extension bits from the adjacent ModR / M bytes. In most CISC-type instructions processed by the IDU, the opcode bits are copied directly into the nanoinstruction operation. However, some CISC-type instructions require replacement of operation codes. In this case, the IDU unit seldom filters the CISC type opcode into the instruction execution unit (IEU). This will be clear to those skilled in the art, as the type and number of functional units in the IEU will determine whether replacement of the opcode in the IDU is necessary for a particular CISC-type instruction.
[0150]
In order for the IEU to process ALU operations, it must receive information about which functional units are required to process the specified ALU operation. Thus, the IDU includes a functional zero unit (F 0 UNIT) decoder 1410 consisting of three decoders, F_0 UNIT1, F_0 UNIT2, and F_0 UNITF. The output of the decoder is a multi-byte field that indicates which functional unit is needed to handle the A0 ALU operation. The functional units for decoding for A1 ALU operation are the same but handled by a separate decoder F_1 unit 1412.
[0151]
CISC-type instructions often perform operations using registers implied by the opcode. For example, many instructions imply that the AX register should be used as an accumulator. Therefore, a constant generator (CST_GEN) decoder 1414 is included to generate a register index based on the opcode of the CISC instruction. The CST_GEN decoder determines which registers are implied based on the particular opcode. Multiplexing to generate the correct source and destination register index of the nanoinstruction is described below in connection with FIG.
[0152]
An additional 2-bit control signal, TempCount (TC), is input to the CST_GEN decoder. The TC control signal is a 2-bit counter that represents four circulating temporary registers for use by the IEU as dummy registers. The temporary (or dummy) register indicates, in addition to the implied register, another value of the register inherited from the CST GEN decoder. Since there are two ALU operations with two registers for each operation, the constant generator / decoder delivers four constant fields. Each of the constant register buses is 20 bits wide and each constant is a total of 5 bits, so that one of the 32 registers in the IEU can be selected.
[0153]
Next, the selection generator (SEL) shown generally at block 1416 The GEN) decoder will be described. The SEL_GEN decoder includes a flag request change (FG_NM) decoder 1418. The FG_NM decoder decodes for one-byte operation codes, two-byte operation codes, and floating-point instructions. For example, the i486 instruction set has a total of six flags. The flags may be changed by the instruction, but these flags must be valid before execution of the instruction begins. The FG_NM decoder outputs two signals for each flag. One bit indicates whether a flag is needed to execute the instruction, and another bit indicates whether the instruction actually changes the flag.
[0154]
The invalid information of the registers relating to the operation of ALU0 and ALU1 is decoded by INVD1 and INVD2 decoders indicated by 1420 and 1422, respectively. The INVD1 and INVD2 decoders are also part of the SEL_GEN decoder 1416. The INVD1 and INVD2 decoders generate control signals for the IEU. These signals indicate whether to use the ALU register. Three possible register indices are specified by each ALU operation. One is used as a source and / or destination register, and the other two are limited to source register designation only. A 4-bit field is used to specify which registers are needed for the operation.
[0155]
The SEL_GEN decoder 1416 further includes a FLD_CNT decoder 1424 that indicates which of the register fields are required for the CISC instruction. The FLD_CNT decoder specifies which of the two fields is the source register and which is the destination register.
[0156]
The nanoinstruction generator (NIR_GEN) decoder is generally shown as block 1426. The input control signals for data size (DATA_SZ) and address size (ADDR_SZ) correspond to the default state in which the system is operating. In order to decode the final address as well as the size of the operand, the default mode must be known and the presence of the prefix (described above in connection with the IAU) must be known. The EMUL_MODE control signal is input to the NIR_GEN decoder, but is also used by other decoders.
[0157]
An escape detect (ESC_DET) input control signal is sent to the NIR_GEN decoder to indicate whether the instruction has a 2-byte opcode. Further, the selection opcode extension (SEL_OP_EXT) input control signal is used to cause the loading of the mailbox register when an emulation instruction is detected.
[0158]
The floating point register (FP_REG) input control signal passes the converted floating point register index to the IDU. For example, the floating point format of the i486 has eight registers for floating point numbers, which are accessed similarly to the stack. These registers can be accessed using a stack access scheme, ie, register 0 is at the top of the stack, register 1 is the second from the top, and so on. This register stack is emulated by using eight linear registers with fixed indices. If the input instruction specifies register 0, a conversion block (not shown) converts the stack-related register index to a register index for a linear register in a well-known manner. This allows the IDU to record which register is at the top of the stack.
[0159]
When the system branches to emulation mode, the IDU saves information about the instruction being emulated. The IDU stores the instruction data size (EM_DSIZE) and address size (EM_ASIZE) in addition to the destination register index (EM_RDEST), source (EM_RDEST2), and base index information (EM_BSIDX). This stored information is used by microcode routines to properly emulate instructions. For example, consider emulation of an add instruction. The microcode routine may check EM_ASIZE to determine the address size of the add instruction to know which address size to emulate.
[0160]
NIR_GEN decoder 1426 includes size decoder 1428. The fields generated by the SIZE decoder (i.e., SIZE1, SIZE2, SIZEF) represent the instruction's address size, operand size, and immediate data size. A 16-bit or 32-bit address size, an 8-bit, 16-bit or 32-bit operand size, and an 8-bit, 16-bit or 32-bit immediate data field size are extracted for each instruction.
[0161]
Another NIR_GEN decoder is called a load information (LD_INF) decoder 1430. The LD_INF decoder decodes information corresponding to load and store operations. The load information is used to perform an effective address calculation. Since the CISC instruction set typically supports many different addressing modes, the load information fields (LD_INF1, LD_INF2, LD_INFF) are used to specify which addressing mode is being used by the CISC instruction. .
[0162]
The basic addressing mode of the i486 includes a segment field and offset that are added together to determine the address. In addition to the scale of the index register (eg, if the index register is an element in the array), the index register can be specified, and the element can be specified as 1, 2, 4, or 8 bytes in length. Thus, the index registers can be scaled by 1, 2, 4, or 8 before the index registers are added to determine an address. The base and index can also be specified in the LD_INF field.
[0163]
The nano-instruction opcode (NIR_OPC) decoder 1432 transfers the opcode for A1 operation (packet 1). The decoded fields (NIR_OPC1, NIR_OPC2, NIR_OPCF) consist of a first instruction byte (8 bits) and three extension bits from the second byte.
[0164]
A miscellaneous opcode (MISC_OPC) decoder 1434 indicates whether the instruction is floating point and whether a load instruction actually exists. The field generated by the MISC_OPC decoder will indicate whether floating data conversion is required. Since this information is easily extracted regardless of the format of the instruction, the decoder does not need to be multiplexed.
[0165]
The operation code for the A0 operation of packet 0 is specified by the operation code decoder 1436. The A0 operation code is usually copied directly from the input operation code of i486, but depending on the instruction, the operation code may be replaced with another operation code. (As noted above, the functionality of the signal generated by the NIR_GEN decoder is specific to the CISC-type instruction set being decoded, and thus will be apparent to one of ordinary skill in the art upon reviewing the CISC-type instruction set as well as the nanoinstruction format of the present invention. Should be.)
The EXT_CODE decoder 1440 extracts a 3-bit operation code extension from ModR / M bytes.
[0166]
IN_ORDER decoder 1442 decodes instructions to determine if they must be executed "out of order." As a result, the IEU is instructed to do nothing to this instruction until the execution of all preceding instructions is completed. Once the execution of the instruction is completed, the execution of the subsequent instruction is started.
[0167]
The control flow jump size decoder 1444 indicates the displacement size of the jump specifying the address. This field, labeled CF_JV_SIZE, specifies the address size of the jump. This is specific to the type of addressing scheme used for the CISC type instruction set.
[0168]
A 1-bit decoder labeled DEC_MDEST 1446 indicates whether the destination of the instruction is a memory address.
[0169]
Finally, the instruction decoder includes three register code decoders 1438 for register code (index) selection. The i486 instruction format encodes the indices of register fields at various locations within the instruction. The indexes of these fields are extracted by the RC decoder. The ModR / M byte also has two register indices, which are used as the destination / source specified by the opcode itself. Register code decoder 1438 generates three RC fields, RC1, RC2, and RC3. If the processor is not in emulation mode, RC1 and RC2 are extracted from the ModR / M bytes as follows, and the instruction is not a floating point instruction. RC1 = ModR / M byte bits [2: 0], RC2 = ModR / M byte bits [5: 3], and RC3 = operation code bits [2: 0]. For floating point instructions in basic (non-emulation) mode, RC1, RC2, and RC3 are assigned as follows.
[0170]
RC1: ST (0) = top of stack
RC2: ST (1) = second item on stack = second from top of stack
RC3: ST (i) = the ith item from the stack, where i is specified in the opcode.
In emulation mode, RC1, RC2, and RC3 are assigned as follows.
[0171]
RC1: bits of byte 3 [4: 0]
RC2: Byte 2 bits [1: 0] and Byte 3 bits [7: 5]
RC3: Byte 2 bits [6: 1]
FIG. 19 shows a representative block and logic gate diagram of each of the decoders (1414, 1438, 1424) of CST_GEN, NIR_GEN, and SEL_GEN. FIG. 19 shows a 1-byte operation code, a 2-byte operation code, and a floating-point operation code for generating the source and destination register indexes of the nano-instruction operations A0 and A1, and the destination register index of the load instruction. It should be understood that this is an example of how the decimal decoded result is selected, delayed and further combined. The selection, delay, and multiplexing techniques apply to all signals generated by the instruction decoder 1202, except for one-byte opcodes, two-byte opcodes, and signals that do not individually generate floating-point results. . Further, in other words, the results generated by this embodiment are application specific and apply to decoding i486 instructions into the nanoinstruction format of the present invention. However, the principles described so far through these embodiments are generally applicable to alignment and decoding of instructions from CISC to RISC.
[0172]
As described above, the CST_GEN decoder 1414 produces three outputs, CST1, CST2 and CSTF, each of which consists of four constant 5-bit register fields (20 bits total). SEL_GEN generates register field control signals (FLD1, FLD2, FLD3) for multiplexer selection in the further part MUX 1512. The selection of the result of CST1, CST2 or CSTF and the result of FLD1, FLD2, and FLDF is generally as shown in multiplexer block 1502. The 3-bit MUX select line 1504 is used to select a result depending on whether the instruction has a one-byte operation code, a two-byte operation code, or a floating-point instruction.
[0173]
Ω cycle pipeline delay latch 1506 is used to delay the result selected by multiplexer 1502 and the three register control fields RC1, RC2, RC3. Each input to the Ω pipeline delay latch 1504 is sent to a pair of opposed clocked latches 1508. The contents of this latch are selected by multiplexer 1510. This arrangement is similar to the Ω cycle data delay 316 described above in connection with the IAU.
[0174]
Further stages of multiplexing are as shown in block 1512. The constant register field selected by multiplexer 1502 is input to multiplexer 1512 as four individual fields, individually labeled regc1 through regc4, as shown generally at 1514. Also shown as inputs to block 1512 are the opcodes and the extracted register fields from the ModR / M bytes, RC1, RC2 and RC3. FLD to generate the source and destination register indexes a1_rd and a1_rs for operation A1 shown generally at 1518, as well as the source and destination register indexes a0_rd and a0_rs for operation A0 shown generally at 1516. The logic of the lower block 1512 under the control of the control signal 1520 couples the regc field as well as the RC field. Index 1d_rd, the destination register index of the load instruction, is also selected at block 1512.
[0175]
4.0Decoded instruction FIFO
A block diagram of the decode FIFO (DFIFO) in the present invention is as shown in FIG. 20A. The DFIFO holds four complete buckets, each containing one nanoinstruction, two immediate data fields, and one displacement field. Each bucket corresponds to a one-level pipeline register in the DFIFO. These buckets are generated in the IDU and pushed to the DFIFO during each cycle when the IEU requests a new bucket. The nano-instructions in the bucket are divided into two groups called Packet 0 and Packet 1. Packet 0 is comprised of load, ALU, and / or store operations, which correspond to 1, 2, or 3 nanoinstructions. Packet 1 is only an ALU operation corresponding to one nano instruction. As a result of this split, one bucket contains only two ALU operations, only one of which can reference memory. If subsequent instructions both require memory operands, they must be placed in separate buckets.
[0176]
As can be seen from FIG. 20B, there is only a significant amount of general information about each packet and the entire bucket. This information is stored in the general information FIFO. By default, four nanoinstructions in one bucket are executed in order from NIR0 to NIR3. NIR3 may set one of the general information bits of the bucket to indicate that it must be performed before NIR0-NIR2. This feature makes it easier to group consecutive instructions into a single bucket. Because the order no longer affects the ability to satisfy the bucket requirement.
[0177]
FIG. 20C shows the immediate data and the displacement FIFO of buckets 0 to 4. IMM0 represents immediate data corresponding to packet 0, and IMM1 represents immediate data corresponding to packet 1. DISP represents the displacement corresponding to packet 0. Packet 1 does not use DISP information because the DISP field is only used as part of the address calculation.
[0178]
FIG. 21 shows specific examples of the above three types of nano-instructions. These tables provide information about the contents of each bucket.
[0179]
While various embodiments in accordance with the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Therefore, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
[Brief description of the drawings]
FIG. 1 is a block diagram of an instruction prefetch buffer of the present invention.
FIG. 2 is a block diagram of the instruction alignment unit of the present invention.
FIG. 3 is a representative flowchart showing an IAU instruction extraction and alignment method of the present invention.
FIG. 4 is a simplified timing diagram associated with the block diagram of FIG. 2 and the flowchart of FIG. 3;
FIG. 5 is a block diagram of a STACK of the present invention.
FIG. 6 is a block diagram of a next instruction detector (NID) of the present invention.
FIG. 7 is a block diagram of a residual next instruction detector (RNID) of the present invention.
FIG. 8 is a block diagram of the immediate data and displacement detector (IDDD) of the present invention.
FIG. 9 is a block diagram of a prefix detector (PD) of the present invention.
FIG. 10 is a block diagram of a prefix number (PRFX_NO) decoder according to the present invention.
FIG. 11 is a block diagram of the nanoinstruction bucket of the present invention.
FIG. 12 is a representative block diagram of an instruction decode unit (IDU) of the present invention.
FIG. 13 illustrates an instruction bit map of the present invention.
FIG. 14 illustrates an instruction bit map of the present invention.
FIG. 15 illustrates an instruction bit map of the present invention.
FIG. 16 illustrates an instruction bit map of the present invention.
FIG. 17 illustrates an instruction bit map of the present invention.
FIG. 18 is a block diagram showing an example of a section of an IDDD instruction decoder of the present invention.
19 is a typical block diagram and logic diagram of a decoder type of the instruction decoder shown in FIG. 18;
FIG. 20 is a conceptual block diagram of a decode FIFO according to the present invention.
FIG. 21 is a diagram showing an example of a field format of a nanoinstruction of the present invention.
FIG. 22 is a diagram showing a data structure format of a conventional CISC type instruction.

Claims

A method for converting a stream of non-native instructions for processing on a host processor, the method comprising:
(1) converting a stream of non-native instructions into less than a predetermined number of native instructions;
(2) storing at least two groups of said native instructions in at least two intermediate buckets capable of storing up to said predetermined number of native instructions;
(3) a subset of the at least two groups of the native instructions can be integrated into a final bucket to output the subset of the native instructions of the final bucket having a maximum capacity of the predetermined number of native instructions on a host processor. And b.

The method of claim 1, wherein the at least two intermediate buckets are capable of storing up to four native instructions.

The method of claim 1, wherein the step of combining combines four native instructions at a time into the last bucket.

The method of claim 1, wherein the stream of non-native instructions includes at least two non-native instructions.

A method for converting a stream of non-native instructions for processing on a host processor, the method comprising:
(1) converting a stream of non-native instructions into native instructions;
(2) storing at least two groups of said native instructions in at least two intermediate buckets capable of storing up to four native instructions;
(3) integrating a subset of the at least two groups of the native instructions into a final bucket so that the subset of the native instructions in the final bucket can be output on a host processor. how to.

The method of claim 5, wherein the step of combining combines four native instructions at a time into the last bucket.

The method of claim 5, wherein the stream of non-native instructions includes at least two non-native instructions.