JP3732233B2

JP3732233B2 - Method and apparatus for predecoding variable byte length instructions in a superscalar microprocessor

Info

Publication number: JP3732233B2
Application number: JP50595198A
Authority: JP
Inventors: トラン，タング・エム
Original assignee: Advanced Micro Devices Inc
Current assignee: Advanced Micro Devices Inc
Priority date: 1996-07-16
Filing date: 1996-07-16
Publication date: 2006-01-05
Anticipated expiration: 2016-07-16
Also published as: JP2000515274A; EP0912923A1; WO1998002797A1

Description

発明の背景
１．発明の分野
この発明はスーパースカラマイクロプロセッサに関し、特に高性能、高周波数スーパースカラマイクロプロセッサ内で可変バイト長コンピュータ命令をプリデコードすることに関する。
２．関連技術の説明
ＥＰ−Ａ−０６５１３２２は可変バイト長命令フォーマットを有するスーパースカラマイクロプロセッサのための公知の命令キャッシュを開示している。記載されたスーパースカラマイクロプロセッサは添付の請求項１のプリアンブルに従った可変バイト長命令をプリデコードするための方法を採用しており、添付の請求項１４のプリアンブルの特徴を有する。
スーパースカラマイクロプロセッサは多命令の並列実行を可能にすることにより、従来のスカラプロセッサを凌ぐ性能を達成することができる。ｘ８６系マイクロプロセッサが広く受入れられているため、ｘ８６命令を実行するスーパースカラマイクロプロセッサを開発するための努力がマイクロプロセッサ製造者により行なわれてきた。このようなスーパースカラマイクロプロセッサは比較的高い性能を達成しつつ、８０８６、８０２８６、８０３８６および８０４８６等の前の世代のマイクロプロセッサのために開発された既存の大量のソフトウェアとの旧版互換性を維持している点で有利である。
ｘ８６命令セットは比較的複雑で、複数個の可変バイト長命令を特徴とする。ｘ８６命令セットを例示する一般的なフォーマットが図１Ａに示される。図で示されるとおり、ｘ８６命令は１から５のオプションのプレフィックスバイト１０２、これに続くオペレーションコード（opcode）フィールド１０４、オプションのアドレスモード（Ｍｏｄｒ／Ｍ）バイト１０６、オプションのスケール−インデックス−ベース（ＳＩＢ）バイト１０８、オプションの変位フィールド１１０およびオプションの即値データフィールド１１２からなる。
ｏｐｃｏｄｅフィールド１０４はある特定の命令の基本的動作を規定する。ある特定のｏｐｃｏｄｅのデフォルト動作は１または２以上のプレフィックスバイトによって修正できる。たとえば、１プレフィックスバイトを用いて命令のアドレスまたはオペランドサイズを変更し、メモリのアドレスに用いられるデフォルトセグメントをオーバライドするか、またはプロセッサが一連の動作をある回数繰返すように命令することができる。ｏｐｃｏｄｅフィールド１０４はもしあればプレフィックスバイト１０２に続き、その長さは１または２バイトである。アドレスモード（ＭＯＤＲＭ）バイト１０６は使用されるレジスタおよびメモリアドレスモードを特定する。スケール−インデックス−ベース（ＳＩＢ）バイトはどのレジスタがアドレス計算の基底値を含んでいるかを規定し、インデックスフィールドはどのレジスタがインデックス値を含んでいるかを特定する。スケールフィールドはインデックス値が加算される前にこれに乗算されるべき２のべき乗を特定するとともに、もし基底値に何か変位があればこれを特定する。次の命令フィールドはオプションの変位フィールド１１０であり、これは長さが１から４バイトである。変位フィールド１１０はアドレス計算に用いられる定数を含む。オプションの即値フィールド１１２もまた長さが１から４バイトであって、命令オペランドとして用いられる定数を含む。８０２８６では命令の最大長さを１０バイトに設定しており、一方８０３８６および８０４８６はともに最大１５バイトまでの命令長を可能としている。
図１Ｂを参照して、いくつかの異なる可変バイト長ｘ８６命令フォーマットが示される。最も短いｘ８６命令は長さが１バイトしかなく、フォーマット（ａ）で示されるように単一のｏｐｃｏｄｅバイトを含む。いくつかの命令では、ｏｐｃｏｄｅフィールドを含むバイトはまたフォーマット（ｂ）、（ｃ）および（ｅ）に示されるようにレジスタフィールドをも含む。フォーマット（ｊ）は２つのｏｐｃｏｄｅバイトを備えた命令を示す。フォーマット（ｄ）、（ｆ）、（ｈ）および（ｊ）においては、オプションのＭＯＤＲＭバイトがｏｐｃｏｄｅバイトに続く。即値データはフォーマット（ｅ）、（ｇ）、（ｉ）および（ｋ）ではｏｐｃｏｄｅバイトに続き、フォーマット（ｆ）および（ｈ）ではＭＯＤＲＭバイトに続く。図１Ｃはいくつかの可能なアドレスモードフォーマット（ａ）−（ｈ）を例示する。フォーマット（ｃ）、（ｄ）、（ｅ）、（ｇ）および（ｈ）はオフセット（すなわち変位）情報を備えたＭＯＤＲＭバイトを含む。フォーマット（ｆ）、（ｇ）および（ｈ）ではＳＩＢバイトが用いられている。
ｘ８６命令セットの複雑さのために、高性能のｘ８６互換スーパースカラマイクロプロセッサの実現には困難な問題が生じる。問題の１つは、適正なデコードが行なわれる前にこのようなプロセッサの並列接続命令デコーダに対して命令が整列されなければならない、という事実から生じる。ほとんどのＲＩＳＣ命令フォーマットとは対照的に、ｘ８６命令セットは可変バイト長命令からなり、１ライン内の連続した命令の開始バイトは必ずしも等間隔になっておらず、またラインごとの命令数も固定されていない。この結果、簡潔な固定長シフトロジックを用いることそれ自体では命令の整列の問題を解決することができない。
複数個の可変バイト長命令を並列に迅速に整列させ、デコードしかつ実行するという問題の解決を支援するために、命令プリデコード技術を採用するスーパースカラマイクロプロセッサが提案されている。このようなスーパースカラマイクロプロセッサの１つでは、命令が外部メインメモリから命令キャッシュ内に書込まれると、プリデコーダがいくつかのプリデコードビット（集合的にプリデコードタグと称される）を各バイトに付加する。これらのビットはあるバイトがｘ８６命令の開始および／または終了バイトであるか否か、ｘ８６命令を実現するのに必要とされるマイクロ命令の数、およびｏｐｃｏｄｅとプレフィックスの位置を示す。命令がキャッシュからフェッチされると、スーパースカラマイクロプロセッサは各命令を１または２個以上のＲＯＰＳと称されるマイクロ命令に変換する。ＲＯＰＳはそれらが固定長で、簡単な一貫性のある符号化を伴うという点で、ＲＩＳＣ命令と類似している。ｘ８６命令は命令キャッシュの中で既に、どこで命令が開始し、どこで終了し、各々がいくつのＲＯＰＳを必要とするかを示すプリデコードビットをタグとして付与されているため、バイトキューが命令の境界の位置を突き止め、各ｘ８６命令を１または２以上のＲＯＰＳに変換しかつ並列命令デコーダに固定数のＲＯＰＳを提供することは比較的簡単な作業である。
上で述べたプリデコード技術は大部分は成功しているが、命令キャッシュアレイ内の利用可能な記憶空間の５０％以上をプリデコードビットに割当てなければならない。これは命令コードのための命令キャッシュ内の記憶量を制限しおよび／またはダイサイズの増加のためにプロセッサの費用を増大させる。
発明の要約
上で概略を述べた問題の多くの部分が、この発明に従った可変バイト長命令をプリデコードするための方法によって解決される。一実施例では、可変バイト長命令をそれらが命令キャッシュ内に記憶されるに先立ってプリデコードすることができる、プリデコードユニットが設けられる。プリデコードユニットは各命令バイトについて複数個のプリデコードビットを発生するように構成される。各命令バイトに関連付けられた複数個のプリデコードビットは集合的にプリデコードタグと称される。その後命令整列ユニットがプリデコードタグを用いて可変バイト長命令を複数個のデコードユニットにディスパッチし、これらはスーパースカラマイクロプロセッサ内に固定された発行位置を形成する。
ある具体例では、プリデコードユニットは命令コードの各バイトに関連して３個のプリデコードビット、すなわち「開始」ビット、「終了」ビットおよび「機能」ビットを発生する。開始ビットは関連のバイトが命令の最初のバイトである場合にセットされる。同様に、終了ビットはそのバイトが命令の最後のバイトである場合にセットされる。機能ビットに専用の意味を関連付けるのではなく、プリデコードユニットは、機能ビットが持つまたはそれに関連する意味がその状態（すなわち機能ビットがセットされているか否か）およびそのバイトの開始ビットの状態に依存するように構成されている。機能ビットの意味はさらに、先行する命令バイトの開始ビットの状態にも依存する。
たとえば、ある実現例においてある特定のバイトの開始ビットがセットされていれば、機能ビットは、その命令が直接デコード可能な「ファストパス」命令またはＭＲＯＭ命令（マイクロコードを介して逐次化されるべき命令）であることを示す。他方で、ある特定のバイトの開始ビットがクリアされておりかつそのバイトが開始バイト（その開始ビットがセットされている命令バイト）に直接続いている場合は、機能ビットはｏｐｃｏｄｅが命令の最初のバイトであるか否か、またはプレフィックスが命令の最初のバイトであるか否かを示す。もしそのバイトの開始ビットがクリアされており、そのバイトが開始バイトに続くものでない場合、機能ビットは関連のバイトがＭＯＤＲＡＭまたはＳＩＢバイト、または変位または即値データであることを示す。
プリデコードユニットからのプリデコード情報を利用することで、命令整列ユニットは比較的少ない数のカスケードされた論理ゲートのレベルで実現でき、非常に高い周波数の動作に対処することができる。命令整列ユニットからデコードユニットをさらに比較的少ないパイプライン段で完成させることができる。加えて、可変バイト長命令が整列される複数のデコードユニットはプリデコードタグを利用して、比較的高速の命令デコードを達成する。最後に、プリデコードユニットはある特定のプリデコードタグの機能ビットの意味が開始ビットの状態に依存するように構成されているため、比較的少ない数のプリデコードビットにより、比較的大量のプリデコード情報を運ぶことができる。したがってこれは性能を犠牲にすることなく、命令キャッシュのサイズを減じることを可能にする。
さらに、機能ビットによって保持される情報で、デコードユニットはｏｐｃｏｄｅ、変位、即値、レジスタおよびスケールインデックスバイトの正確な位置を知ることができる。したがって、デコードユニットが命令バイトをシリアルに走査する必要はない。加えて、機能ビットはデコードユニットが８ビット線形アドレスを（加算回路を介して）迅速に計算し、スーパースカラマイクロプロセッサ内の他のサブユニットがこれを使用することを可能にする。したがって、比較的迅速なデコードが達成され、高い性能に対応することができる。
大まかに言えば、この発明はスーパースカラマイクロプロセッサ内で可変バイト長命令をプリデコードするための方法であって、命令のあるバイトが開始バイトであるか否かを示す開始ビットを発生するステップと、前記命令の前記バイトが終了バイトであるか否かを示す終了ビットを生成するステップと、前記開始ビットの値に依存する意味を有する機能ビットを発生するステップとを含む、方法を提供しようとするものである。
【図面の簡単な説明】
発明の他の目的および利点は添付の図面を参照しながら以下の詳細な説明を読むことによって明らかとなるであろう。
図１Ａは包括的ｘ８６命令セットフォーマットを例示する図である。
図１Ｂはいくつかの異なる可変バイト長ｘ８６命令フォーマットを例示する図である。
図１Ｃはいくつかの可能なｘ８６アドレスモードフォーマットを例示する図である。
図２は６個のデコードユニットに多命令を送る命令整列ユニットを含むスーパースカラマイクロプロセッサのブロック図である。
図３は命令整列ユニットと６個のデコードユニットのブロック図である。
図４Ａ−４ＣはＭＲＯＭ命令の実行を示すブロック図である。
発明にはさまざまな修正や変形が可能であるが、その具体的な実施例を例として図面に示し以下で詳細に説明する。しかしながら、図面と詳細な説明は発明を開示された特定の形に制限するものではなく、逆に、意図するところは添付のクレームに規定されるこの発明の範囲にあるすべての修正、均等物および変形例を包含することである。
発明の詳細な説明
次に図２を参照して、この発明の方法に従って動作するプリデコードユニット２０２を含むスーパースカラマイクロプロセッサ２００のブロック図が示される。図２の実施例で例示されるように、スーパースカラマイクロプロセッサ２００はプリデコードユニット２０２と、命令キャッシュ２０４に結合された分岐予測ユニット２２０とを含む。プリフェッチユニット２０３がプリデコードユニット２０２に結合される。命令整列ユニット２０６が命令キャッシュ２０４と複数個のデコードユニット２０８Ａ−２０８Ｆ（集合的にデコードユニット２０８と称する）との間に結合される。各デコードユニット２０８Ａ−２０８Ｆはそれぞれのリザベーションステーション２１０Ａ−２１０Ｆ（集合的にリザベーションステーション２１０と称する）に結合され、各リザベーションステーション２１０Ａ−２１０Ｆはそれぞれの機能ユニット２１２Ａ−２１２Ｆに結合される（集合的に機能ユニット２１２と称される）。デコードユニット２０８、リザベーションステーション２１０および機能ユニット２１２はさらにリオーダバッファ２１６、レジスタファイル２１８およびロード／ストアユニット２２２に結合される。データキャッシュ２２４は最後にロード／ストアユニット２２２に結合されて示され、ＭＲＯＭユニット２０９は命令整列ユニット２０６に結合されて示される。
一般に、命令キャッシュ２０４は命令デコードユニット２０８に送られる前、に命令を一時的にストアするために設けられる高速キャッシュメモリである。一実施例では、命令キャッシュ２０４は最大で３２キロバイトの、各々が１６バイトのラインで編成される（ここで各バイトは８ビットからなる）、命令コードをキャッシュするように構成されている。。動作の間、命令コードがメインメモリ（図示せず）からプリフェッチユニット２０３を介してプリフェッチコードにより命令キャッシュ２０４に与えられる。命令コードの各バイトについて、命令キャッシュ２０４はさらにそれに関連するプリデコードタグをストアする。ここで注目されるのは、命令キャッシュ２０４がセットアソシアティブ、フルアソシアティブ、またはダイレクトマップ構成のいずれかで実現できることである。
プリフェッチユニット２０３はメインメモリから命令コードをプリフェッチして命令キャッシュ２０４内に記憶するために設けられている。一実施例では、プリフェッチユニット２０３はメインメモリから命令キャッシュ２０４へ６４ビット幅コードをバーストするように構成されている。さまざまな具体的なコードプリフェッチ技術およびアルゴリズムをプリフェッチユニット２０３に採用できることが理解されるであろう。
プリフェッチユニット２０３がメインメモリから命令をフェッチすると、プリデコードユニット２０２は命令コードの各バイトに関連する３個のプリデコードビットを発生する。すなわち「開始」ビット、「終了」ビットおよび「機能」ビットである。
各バイトの開始ビットおよび終了ビットは命令の境界を示すものである。各バイトの機能ビットはそのバイトまたは命令に関するさらなる情報、たとえば命令がデコードユニット２０８によって直接デコードできるか否か、または命令がＭＲＯＭユニット２０９（以下でより詳細に説明する）によって制御されるマイクロコード手続を起動することで実行されなければならないか、またはバイトがＭＯＤＲＭまたはＳＩＢバイトであるか、またはバイトが変位または即値データであるか、等の情報を持っている。機能ビットはまた、ｏｐｃｏｄｅバイトのロケーションを示すために用いられてもよい。以下の説明から、特定の命令バイトの機能ビットにエンコードされた意味は関連の開始ビットに依存することが理解されるであろう。
表１はプリデコードユニット２０２によって実現されるプリデコードタグのエンコードの一例を示す。表に示されるとおり、もし所与のバイトが命令の最初のバイトであれば、そのバイトがメインメモリからフェッチされ命令キャッシュ２０４にストアされると、プリデコードユニット２０２によりそのバイトの開始ビットがセットされる。もしそのバイトが命令の最後のバイトであれば、そのバイトの終了ビットがセットされる。もしある特定の命令を直接デコードユニットによってデコードすることができない場合には、その命令の最初のバイトに関連する機能ビットがセットされる。他方で、もしその命令をデコードユニット２０８によって直接デコードすることが可能であれば、その命令の最初のバイトに関連する機能ビットがクリアされる。ある特定の命令の第２のバイトの機能ビットは、もしｏｐｃｏｄｅが最初のバイトであればクリアされ、もしｏｐｃｏｄｅが２番目のバイトである場合にはセットされる。ｏｐｃｏｄｅが２番目のバイトであるような状況では、最初のバイトはプレフィックスバイトであることが注目される。命令バイト番号３−８の機能ビット値は、バイトがＭＯＤＲＭまたはＳＩＢバイトであるか否か、およびバイトが変位または即値データを含むか否かを示す。

上の表１によれば、スーパースカラマイクロプロセッサ２００のプリデコードユニット２０２は命令コードの各バイトについて機能ビットを発生するように構成されることが注目される。機能ビットの意味はそのバイトに関連した開始ビットの値に依存する。表１に例示されたスキームをエンコードするために、機能ビットの意味はさらに、先行する命令バイトに関連する開始ビットの値にも依存する。
上で述べられた特定の具体例について、もしそのバイトの開始ビットがセットされている場合には、機能ビットが、命令が直接デコード可能な命令であるか否か、またはＭＲＯＭ命令（以下でさらに説明する）であるか否かを示すことが理解されるであろう。命令コードのある特定のバイトに関連する開始ビットがクリアされ、開始ビットがセットされている命令コードのバイトに直接続いている場合には、機能ビットはｏｐｃｏｄｅが最初のバイトであるか、またはプレフィックスが最初のバイトであることを示す。さらに、命令コードのあるバイトの開始ビットがクリアされ、先行するバイトの開始ビットもまたクリアされている場合には、機能ビットはそのバイトがＭＯＤＲＭまたはＳＩＢバイトであること、またはバイトが変位または即値データであることを示す。特定の命令の後続のバイトでは、バイト３−８においてセットされた２番目の機能ビットが即値データを示す。
上で説明されたようなスーパースカラマイクロプロセッサ２００によって採用されるプリデコード方式によれば、命令コードの各バイトに関連するプリデコードタグが発生する。プリデコードタグと命令コードとはともに命令キャッシュ２０４にストアされ、その後スーパースカラマイクロプロセッサによって処理される。機能ビットの意味はある特定のバイトの開始ビットおよび先行するバイトの開始ビットに依存するため、命令整列ユニット２０６およびデコードユニット２０８について比較的大量のプリデコード情報を伝えることができ、比較的高速の整列およびデコード命令を達成することができる。プリデコードタグ内で必要とされるビット数は比較的小さいので、性能を犠牲にすることなく、命令キャッシュ２０４の必要なサイズを減じることができる。
さらに、機能ビットに保持される情報で、デコードユニットはｏｐｃｏｄｅ、変位、即値、レジスタおよびスケールインデックスバイトの正確な位置を知る。したがって、デコードユニットが命令バイトを順次に走査する必要はない。加えて、機能ビットはデコードユニットが８ビットの線形アドレスを（加算回路を介して）迅速に計算し、スーパースカラマイクロプロセッサ内の他のサブユニットが使用することを可能にする。したがって、比較的高速なデコードが達成され、高い性能に対処することができる。
先に述べたように、一実施例ではｘ８６命令セットのある命令がデコードユニット２０８によって直接デコードされ得る。これらの命令は「ファストパス」命令と称される。ｘ８６命令セットの残りの命令は「ＭＲＯＭ命令」と称される。ＭＲＯＭ命令はＭＲＯＭユニット２０９を起動することで実行される。ＭＲＯＭ命令に遭遇すると、ＭＲＯＭユニット２０９はこの命令を規定されたファストパス命令のサブセットに構文解析して逐次化し、所望の動作を引き起こす。ファストパス命令として分類分けされたｘ８６命令の例と、ファストパスおよびＭＲＯＭ命令の処理態様の説明は以下でさらに与えられる。
命令整列ユニット２０６は命令キャッシュ２０４からの可変バイト長命令をチャネリングまたは「経路分け」（funnel）してデコードユニット２０８Ａ−２０８Ｆによって形成される固定された発行位置に与えるために設けられる。図３−５と関連して説明されるとおり、命令整列ユニット２０６は命令キャッシュ２０４によって境界を定められるライン内の命令の開始バイトのロケーションに依存して、命令コードを指定されたデコードユニット２０８Ａないし２０８Ｆにチャネリングするように構成されている。一実施例では、そこに所与の命令がディスパッチされる特定のデコードユニット２０８Ａ−２０８Ｆはその命令の開始バイトのロケーションとともに、もしあれば、先行する命令の開始バイトのロケーションにも依存する。あるバイトロケーションから始まる命令はさらに、予め定められた一の発行位置のみに発行されるように制限される。具体的な詳細は以下のとおりである。
命令キャッシュ２０４からデコードユニット２０８への命令の整列の説明に入る前に、図２の例示のスーパースカラマイクロプロセッサ２００で採用される他のサブシステムに関する一般的な局面について説明する。図２の実施例では、デコードユニット２０８の各々は上で述べた予め定められたファストパス命令をデコードするためのデコード回路を含む。さらに、各デコードユニット２０８Ａ−２０８Ｆは変位および即値データをリザベーションステーションユニット２０１Ａ−２０１Ｆに経路付ける。デコードユニット２０８からの出力信号は機能ユニット２０１に対するビットエンコード実行命令と、オペランドアドレス情報、即値データおよび／または変位データを含む。
図２のスーパースカラマイクロプロセッサは追い越し実行をサポートしており、したがってレジスタ読出および書込動作のためのもともとのプログラムシーケンスを守るために、レジスタのリネームを実現するために、投機的な命令の実行と分岐予測の誤りからの回復を行なうために、そして、正確な例外を容易にするために、リオーダバッファ２１６を含む。当業者には理解されるであろうが、リオーダバッファ２１６内の一時的記憶ロケーションがレジスタの更新を含む命令をデコードする際に予約され、これによって投機的なレジスタの状態を記憶する。リオーダバッファ２１６は、投機的な結果が有効にされレジスタファイルに書込まれるとバッファの「最後部」に移動して、これによってバッファの「頭部」に新たなエントリの余地を与えるような、先入れ先出し構成で実現されてもよい。リオーダバッファ２１６の他の具体的な構成もまた可能であり、以下で説明される。分岐予測が不正確である場合、予測を誤った経路に沿った投機的に実行された命令の結果は、それらがレジスタファイル２１８に書込まれる前に、バッファ内で無効にすることができる。
デコードユニット２０８Ａ−２０８Ｆの出力で与えられる即値データおよびビットエンコード実行命令は直接、それぞれのリザベーションステーションユニット２１０Ａ−２１０Ｆに経路付け（route）られる。一実施例では、各リザベーションステーションユニット２１０Ａ−２１０Ｆは対応の機能ユニットへの発行を待つ最大で３個の未決の（pending）命令について、命令情報を保持することができる（ビットエンコード実行ビットおよびオペランド値、オペランドタグおよび／または即値データ）。ここで注目されるのは、図２の実施例では、各デコードユニット２０８Ａ−２０８Ｆが専用のリザベーションステーションユニット２１０Ａ−２１０Ｆと関連付けられ、各リザベーションステーションユニット２１０Ａ−２１０Ｆが同様に専用の機能ユニット２１２Ａ−２１２Ｆに関連付けられていることである。したがって、６個の専用の「発行位置」がデコードユニット２０８、リザベーションステーションユニット２１０および機能ユニット２１２で形成される。デコードユニット２０８Ａを介して発行位置０に整列されかつディスパッチされた命令はリザベーションステーションユニット２１０Ａを通り、その後機能ユニット２１２Ａに渡されて実行される。同時に、デコードユニット２０８Ｂに整列されかつ送られた命令はリザベーションステーションユニット２１０Ｂに渡され、さらに機能ユニット２１２Ｂに渡され、以下同様である。
ある特定の命令をデコードする際に、もし必要とされるオペランドがレジスタロケーションである場合には、レジスタアドレス情報がリオーダバッファ２１６とレジスタファイル２１８とに同時に経路付けられる。当業者には理解されるであろうが、ｘ８６レジスタファイルは８個の３２ビットリアルレジスタを含み（典型的にはＥＡＸ、ＥＢＸ、ＥＣＸ、ＥＤＸ、ＥＢＰ、ＥＳＩ、ＥＤＩおよびＥＳＰと称される）、これらは以下に詳細に説明するとおりである。リオーダバッファ２１６はこれらのレジスタの内容を変更する結果のための一時的記憶ロケーションを含んでおり、これによって追越し実行を可能にする。リオーダバッファ２１６の一時的記憶ロケーションは各命令について予約されており、これはデコードされると、リアルレジスタのうち１個の内容を修正する。したがって、あるプログラムを実行する間のさまざまな点で、リオーダバッファ２１６の１または２以上のロケーションには、投機的に実行された所与のレジスタの内容が含まれているかもしれない。所与の命令のデコードに続いて、リオーダバッファ２１６が所与の命令内のオペランドとして用いられるレジスタに割当てられた先のロケーションを有すると判断された場合には、リオーダバッファ２１６は対応のリザベーションステーションに以下のうちのいずれかを送る。すなわち１）最も最近に割当てられたロケーションの値、または２）最終的に先の命令を実行することとなる機能ユニットによって値が未だ生成されていない場合には、最も最近に割当てられたロケーションに対するタグ。もしリオーダバッファが所与のレジスタのために予約されたロケーションを有する場合には、オペランド値（タグ）はレジスタファイル２１８からではなく、リオーダバッファ２１６から与えられる。リオーダバッファ２１６内に必要とされるレジスタのために予約されたロケーションがない場合には、値はレジスタファイル２１８から直接得られる。オペランドがメモリロケーションに対応する場合、オペランド値はロード／ストアユニット２２２を介してリザベーションステーションユニットに与えられる。
好適なリオーダバッファ実現に関する詳細はマイクジョンソン（Mike Johnson）による「スーパースカラマイクロプロセッサの設計」（“Superscalar Microprocessor Design”）、Prentice-Hall、Englewood Cliffs、New Jersey、１９９１、および係属中の、共通に譲受された、ウイットら（Witt, et al.）の、平成６年１０月２９日出願の「スーパースカラマイクロプロセッサ」と題された、特願平６−２６３３１７（特開平７−１８２６０）にある。これらの文書は引用によりその全体をここに援用する。
リザベーションステーションユニット２１０Ａ−２１０Ｆは対応の機能ユニット２１２Ａ−２１２Ｆによって投機的に実行されるべき命令情報を一時的にストアするために設けられている。先に述べたとおり、各リザベーションステーションユニット２１０Ａないし２１０Ｆは最大で３個の未決の命令について命令情報をストアすることができる。
６個のリザベーションステーション２１０Ａ−２１０Ｆの各々は対応の機能ユニットによって投機的に実行されるべきビットエンコード実行命令およびオペランド値をストアするためのロケーションを含む。もし特定のオペランドが利用可能でない場合、このオペランドに対するタグがリオーダバッファ２１６から与えられ、結果が発生する（すなわち先の命令の実行が完了することによって）まで、対応のリザベーションステーション内にストアされる。注目されるのは、機能ユニット２１２Ａ−２１２Ｆのうち１つによって命令が実行されると、この命令の結果がその結果を待っているリザベーションステーションユニット２１０Ａ−２１０Ｆのいずれかに直接渡され、同時に、この結果が渡されてリオーダバッファ２１６を更新することである（この技術は通例「結果ディスパッチ」と称される）。いずれかの必要とされるオペランドの値が利用可能にされると、命令が実行のために機能ユニットに発行される。すなわち、リザベーションステーションユニット２１０Ａ−２１０Ｆのうち１つの中の未決の命令に関連するオペランドが、リオーダバッファ２１６内の先の結果の値のロケーションのタグを付けられており、必要とされるオペランドを修正する命令にこれが対応する場合には、この命令は先の命令に対するオペランドの結果が得られるまで、対応の機能ユニット２１２には発行されない。したがって、命令が実行される順序はもともとのプログラム命令シーケンスの順序と同じではないかもしれない。リオーダバッファ２１６は書込後読出の依存が生じるような状況においてもデータの一貫性が維持されることを保証する。
一実施例では、機能ユニット２１２の各々は加算および減算の整数算術演算およびシフト、回転、論理演算および分岐動作を行なうように構成される。ここで注目されるのは、浮動小数点ユニット（図示せず）もまた浮動小数点演算に対処するために用いられ得るということである。
機能ユニット２１２の各々はまた、分岐予測ユニット２２０に対し条件付分岐命令の実行に関する情報を提供する。もし分岐予測が不正確である場合には、分岐予測ユニット２２０は予測誤り分岐後の命令実行パイプラインに入った命令をフラッシュし、プリフェッチ／プリデコードユニット２０２に命令キャッシュ２０４またはメインメモリから必要とされる命令をフェッチさせる。このような状況では、もともとのプログラムシーケンス内の命令であって予測誤り分岐命令の後に生じた結果は、投機的に実行されロード／ストアユニット２２２およびリオーダバッファ２１６内に一時的にストアされていたものも含め、廃棄される。好適な分岐予測メカニズムの構成例は周知である。
機能ユニット２１２によって生成された結果は、レジスタ値が更新されていればリオーダバッファ２１６に送られ、メモリロケーションの内容が変更されていればロード／ストアユニット２２２に送られる。もし結果がレジスタにストアされるべき場合には、リオーダバッファ２１６はその結果を、命令がデコードされたときにレジスタの値のために予約されていたロケーションにストアする。先に述べたとおり、結果は、未決の命令が、必要とされるオペランド値を得るために先の命令の実行の結果を待っている場合、リザベーションステーションユニット２１０Ａ−２１０Ｆにもまたブロードキャストされる。
一般に、ロード／ストアユニット２２２は機能ユニット２１２Ａ−２１２Ｆとデータキャッシュ２２４との間のインターフェイスを提供する。一実施例では、ロード／ストアユニット２２２は未決のロードまたはストアのためのデータとアドレス情報とについて８個のストアロケーションを備えたストアバッファを伴って構成される。機能ユニット２１２はロード／ストアユニット２２２へのアクセスの調停を行なう。バッファが一杯である場合（full）、機能ユニットはロード／ストアユニットが未決のロードまたはストア要求情報のためのあきを有するようになるまで待たなければならない。ロード／ストアユニット２２２はまた未決のストア命令に対しロード命令の依存性チェックを行ない、データの一貫性が維持されることを確実にする。
データキャッシュ２２４はロード／ストアユニット２２２とメインメモリサブシステムとの間を転送されるデータを一時的にストアするために設けられる高速キャッシュメモリである。一実施例では、データキャッシュ２２４は最大で８キロバイトのデータを記憶する容量を有する。データキャッシュ２２４は１セットのアソシアティブ構成を含む、さまざまな具体的なメモリ構成で実現され得ることが理解されるであろう。
命令キャッシュ２０４から命令整列ユニット２０６を介してデコードユニット２０８に至る命令のディスパッチの詳細が以下に考察される。図３は命令整列ユニット２０６の一実施例の内部と命令キャッシュ２０４から与えられる命令コードのラインに関するデコードユニット２０８Ａ−２０８Ｆの内部とを示すブロック図である。先に述べたように、命令整列ユニット２０６は可変バイト長命令（この場合ファストパス命令と称されるある種のｘ８６命令）をデコードユニット２０８Ａ−２０８Ｆにチャネリングするように構成されている。
図３に示されるとおり、ラッチユニット３０２は命令キャッシュ２０４の出力バッファセクション３０１の一部として組入れられる。ラッチユニット３０２は命令キャッシュ２０４の記憶アレイ（図３では示されない）から与えられる命令コードの１ラインがデコードユニット２０８に送られる前に、これをストアすることが可能である。
図３の命令整列ユニット２０６はラッチ３０２とデコードユニット２０８との間に結合されたマルチプレクサチャネル３０４Ａ−３０４Ｇと称される複数個のマルチプレクサ回路を含む。マルチプレクサコントロール回路３０６がさらにマルチプレクサチャネル３０４Ａ−３０４Ｇの各々に結合されて示される。この実施例では、デコードユニット２０８Ａ−２０８Ｆの各々が、入力がそれぞれのマルチプレクサチャネル３０４Ａ−３０４Ｆに結合された関連の命令デコーダ３１８Ａ−３１８Ｆを含む。各デコードユニット２０８Ａ−２０８Ｆはさらにそれぞれの変位／即値データバッファ３３０Ａ−３３０Ｆおよびそれぞれの命令発行ユニット３４０Ａ−３４０Ｆを含む。
動作の間、実行されるべき命令コードのラインが命令キャッシュ２０４の記憶アレイからラッチユニット３０２に与えられる。命令キャッシュ２０４内の命令コードの各バイトは開始ビット、終了ビットおよび機能ビットを含む対応のプリデコードタグと関連付けられる。命令コードのラインがラッチユニット３０２に与えられると、各バイトに関連するプリデコードタグがマルチプレクサ制御回路３０６の入力に与えられる。以下で詳細に述べるように、ラッチユニット３０２内の命令コードの各ラインに対応するプリデコードタグに依存して、マルチプレクサコントロール回路３０６は、命令バイトが指定された命令デコーダ３１８Ａ−３１８Ｆに選択的に経路付けられるようにマルチプレクサチャンネル３０４Ａ−３０４Ｇを制御する。デコードユニット２０８Ａ−２０８Ｆによって形成される命令パス（経路）は発行位置と称される。マルチプレクサチャネル３０４Ａ−３０４Ｇを介する命令コードのチャネリングはラッチユニット３０２によって表わされるように各ラインに関連する各命令に付随する開始バイトのロケーションに依存する。図３の実施例では、最初の５個のマルチプレクサチャネル３０４Ａ−３０４Ｆの各々はラッチユニット３０２からの命令コードの４個の連続したバイトをそれぞれの命令デコーダ３１８Ａ−３１８Ｆに経路付ける。マルチプレクサチャネル３０４Ｇは命令コードの最大３個の連続したバイトを命令デコーダ３１８にチャネリングすることが可能である。
以下の表２は開始バイトがそれによってチャネリングされ得る、可能なマルチプレクサチャネル３０４Ａ−３０４Ｇを例示する。上で述べたとおり、命令コードのチャネリングは所与のライン内の開始バイトのロケーションに依存する。注目されるのは、マルチプレクサチャネル３０４Ａ−３０４Ｆの各々がそれに割当てられたもののうち最下位の開始バイトを、この開始バイトが下位のマルチプレクサチャネルによる経路付けのために選択されていない限り、経路付けするように構成されていることである。

表２を参照して、マルチプレクサチャネル３０４Ａはバイト位置０−２に位置する開始バイトをデコードユニット３１８Ａに経路付けすることが可能である。マルチプレクサチャネル３０４Ｂはバイト位置１−４にある開始バイトをデコードユニット３１８Ｂに経路付けすることが可能である。マルチプレクサチャネル３０４Ｃはバイト位置３−８の開始バイトをデコードユニット２０８Ｃに転送することが可能である。同様に、マルチプレクサチャネル３０４Ｄはバイト位置６−１０にある開始バイトをデコードユニット２０８Ｄに転送することが可能であり、マルチプレクサチャネル３０４Ｅはバイト位置９−１２にある開始バイトをデコードユニット２０８Ｅに転送することが可能である。最後に、マルチプレクサチャネル３０４Ｆはバイト位置１２−１５にある開始バイトをデコードユニット３１８Ｆに転送することが可能である。バイト位置１３−１５に位置付けられた開始バイトは代替的にはマルチプレクサチャネル３０４Ｇを介して、不完全な命令（すなわち次のラインまで続く命令）のラッピング（wrap）に用いられる７番目の発行位置に経路付けられてデコードのための次のキャッシュラインとされてもよい。以下にさらに説明されるように、マルチプレクサチャネル３０４Ｇを介して経路付けられた命令バイトはラッチユニット３０２内でその命令の残りのバイトが入手可能になったときに、次のクロックサイクルで命令デコーダ３０４Ａに与えられる。
もし命令が後続のキャッシュラインにラッピングされると、命令の指定された位置へのディスパッチは次のラインに現われる命令の残りのバイトの性質に依存する。単に変位または即値データが次のキャッシュラインにラップアラウンドする状況では、この即値または変位データがマルチプレクサチャネル３０４Ａを介して変位／即値データバッファ３３０Ｆに与えられる。ここで注目されるのは、この状況において、この命令の先行するバイト（先のキャッシュラインに現われるもの）が先のクロックサイクルの間命令デコーダ３１８Ｆにディスパッチされるであろうという点である。プレフィックス、ｏｐｃｏｄｅ、ＭＯＤＲＭおよび／またはＳＩＢバイトが次のキャッシュラインにラップアラウンドするような状況では、先行するラインからの命令情報はマルチプレクサチャネル３０４Ｇを介して命令デコーダ３１８Ａに経路付けられ、次のクロックサイクルの間に命令コードの残りの部分とマージされる。
ある１ラインの所与の命令がディスパッチされ得る発行位置の可能な数を制限することによって、命令整列ユニット２０６を実現するのに必要とされるカスケードされた論理レベルの数がうまく減じられることが分かるだろう。さらに、あるライン中のバイトロケーションの選択されたサブセットの１個にある開始バイトを有する命令のディスパッチを単一の発行位置（バイト位置５および１１）のみに限定することで、命令整列のためのカスケードされた論理レベルの数がさらに減じられる。したがって、上述の命令整列ユニット２０６はパイプライン段当りのゲートの数が比較的少ないスーパースカラマイクロプロセッサの実現を可能にし、それによって非常に高い周波数の動作に対処することが可能になる。比較的長い命令については、発行位置がスキップされるものの、他の発行位置がキャッシュライン内の残りの命令に利用できるため、比較的高い性能を依然として達成することができる。
定義された（defined）ファストパス命令は長さが最大８バイトであって、単一のプレフィックスバイトを含み得る。注目されるのは、定義されたファストパス命令を単一のプレフィックスバイトのみに制限することによって、いかなるファストパス命令のバイト４ないし７も、もしあれば、変位および／または即値データのみを含むようにすることができる、ということである。したがって、命令が４バイトを超えるような状況では、命令の最初の４バイトがその命令の開始バイトに割当てられたマルチプレクサチャネルを介して経路付けられる。その命令の残りのバイトは次の発行位置のマルチプレクサチャネルを通ってチャネリングされる。このような状況では、この命令の残りのバイトを受ける発行位置の命令デコーダ（すなわち命令デコーダ）はその最初のバイト位置で開始ビットの不在を検出し、これにしたがってデータを先行する発行位置の変位／即値データバッファ３３０に渡し、ＮＯＯＰ命令を出す。
したがって、ある命令の開始バイトがラッチユニット３０２のバイト位置０に位置付けられている場合、そのバイトはバイト位置１、２および３にある次の３個の連続したバイトとともにデコードユニット２０８Ａに与えられる。もし次の開始バイトが位置２にある場合（すなわち最初の命令が長さ２バイトである場合）、バイト２−５はマルチプレクサチャネル３０４Ｂを介してデコードユニット２０８Ｂに経路付けられる。図３の実施例では、各命令デコーダ３１８Ａ−３１８Ｆは一時に１個の命令しかデコードできない。したがって、１以上の命令の開始バイトがたとえば命令デコーダ３１８Ａに与えられたとしても、デコードされるのは最初の命令だけである。所与の命令デコーダ内のさらなる命令に対応する、最初の終了バイトを超えたバイトは無関係（extraneous）であって、実質上無視される。注目されるのは、命令整列ユニット２０６のマルチプレクサチャネル３０４が、これに代えて、命令の開始および終了プリデコードビットにしたがって単一の命令（またはその部分）のみが所与の命令デコーダ３１８にチャネリングされるように構成できるという点である。
上述のことにしたがって、もし最初の命令がバイト位置０で開始する場合、バイト０−３が命令デコーダ３１８Ａに与えられる。もし命令が４バイトより長ければ、ラッチユニット３０２のバイト４−７がマルチプレクサチャネル３０４Ｂを介して命令デコーダ３１８Ｂに与えられ、これはその後データを変位／即値データバッファ３３０Ａに渡す。この状況では、マルチプレクサチャネル３０８Ｃはコード内に現われる次の開始バイトを命令デコーダ３１８Ｃに経路付ける。他方で、バイトロケーション０で開始する最初の命令が４バイト以下である場合、次の命令は第２の命令の開始バイトで始まる、マルチプレクサチャネル３０４Ｂを介して経路付けられる。もし命令が４バイト長より大きければ、その命令に対応する即値または変位データがマルチプレクサチャネル３０４Ｃを介して変位／即値データバッファ３３０Ｂに経路付けられる。残りのマルチプレクサチャネルも同様に動作する。
もし即値または変位データが先のラインで開始する命令から後続のラインにラップアラウンドされる場合には、このデータは、即値または変位データがラッチユニット３０２内で利用可能である場合、マルチプレクサチャネル３０４Ａを介して変位／即値データバッファ３４０Ｆに与えられる。さらに注目されるのは、命令デコードが行われないことである。なぜなら変位および即値データにはデコードが不要だからである。後続のラインの最初の命令はしたがってマルチプレクサチャネル３０４Ｂを介して命令デコーダ３１８Ｂに経路付けられる。
同様に注目されるのは、プレフィックス、ｏｐｃｏｄｅ、ＭＯＤＲＭおよび／またはＳＩＢ情報が先のライン上で開始する命令からラップアラウンドされる場合には、マルチプレクサチャネル３０４Ｇが命令の先行する部分を命令デコーダ３１８Ａに経路付けることであり、この場合（次のクロックサイクルの間ラッチユニット３０２内の最初の開始バイトに対応する）次の命令がマルチプレクサチャネル３０４Ｂを介して命令デコーダ３１８Ｂに経路付けられる。
以下の例でよりよく理解されるとおり、所与の開始バイトが与えられるべき可能な発行位置のいずれもが、先の命令によってこれらの発行位置が占有されているために利用できない場合が生じる。このような状況が生じると、その命令とそれに続くいかなる命令も、ディスパッチのためには次のクロックサイクルまでホールドされなければならない。
以下の表３にｘ８６命令の条件の例が示される。命令１ないし７と命令８の最初のバイトがキャッシュライン１内に示される。キャッシュライン２は命令８の２番目のバイトで始まり、さらに命令９ないし１６を含む。

以下の表４は上の表３の命令のシーケンスが命令整列ユニット２０６によってデコードユニット２０８Ａ−２０８Ｆにディスパッチされる様子を例示したものである。

命令１−５は最初のクロックサイクルの間に、それぞれデコードユニット３１８Ａ−３１８Ｅに対応する発行位置０−４にディスパッチされる。ラッチユニット３０２のバイト位置１１で始まる命令６はデコードユニット３１８Ｅに対応する発行位置４のみにチャネリングされ得る。しかしながら、発行位置４は既に命令５によって占有されているため、命令６はこのサイクルの間にディスパッチされることはできない。したがって、マルチプレクサ制御回路３０６はデコードユニット３１８Ｆが、命令１−４でデコードされるデコード段階の間はＮＯＯＰ（no operation：動作なし）を発行するようにさせる。
クロックサイクル２の間、命令６が発行位置４にディスパッチされ、命令７が発行位置５にディスパッチされる。ここで注目されるのは、これらの命令がデコードされると、マルチプレクサ制御回路３０６がデコードユニット３１８Ａ−３１８ＤにＮＯＯＰ命令を発行させることである。命令８は次のキャッシュラインまでラップアラウンドするので、この命令の最初のバイトは次のクロックサイクルでマルチプレクサチャネル３０４Ｇを介して命令デコーダ３１８にラップアラウンドされる。
クロックサイクル３の間、命令８は発行位置０にディスパッチされる。注目されるのは、命令８の最初のバイトが先行するラインのバイト位置１５からラップアラウンドされることである。命令９および１０はさらにマルチプレクサチャネル３０４Ｂおよび３０４Ｃを介してそれぞれ発行位置１および２にディスパッチされる。命令８−１０のデコードの際に、命令発行ユニット３０４Ｄ−ＥはＮＯＯＰ命令を発行させる。
命令１１および１２はクロックサイクル４の間に発行位置２および３にディスパッチされる。命令１３はバイト７で始まり、発行位置４に経路付けすることができない。したがって、命令１３のディスパッチは次のクロックサイクルまでホールドされなければならない。
クロックサイクル５の間に、命令１３ないし１６が発行位置２ないし５にそれぞれディスパッチされる。上で説明したのと同様に、命令１３−１６のデコードの間、命令発行ユニット３４０Ａおよび３４０Ｂは発行位置０および１に対してＮＯＯＰ命令が発行されるようにする。
図２を再び参照して、ファストパス命令として指定されていないｘ８６命令サブセット内に含まれる命令は、ストアされたマイクロコードを用いてＭＲＯＭユニット２０９の制御に基づいて実行される。ＭＲＯＭユニット２０９はこのような命令を一連のファストパス命令に構文解析し、これらが１または２以上のクロックサイクルでディスパッチされる。先に述べたとおり、プリデコードユニット２０２は予め指定されたＭＲＯＭ命令に遭遇するとその命令の最初のバイトに関連する機能ビットがセットされるように構成されている。以下でさらに詳細に説明されるように、この条件はＭＲＯＭユニット２０９により容易に検出可能であって、命令の逐次化を可能にする。
ラッチユニット２０２内のコードの１ラインの中でＭＲＯＭ命令がＭＲＯＭユニット２０９によって検出されると、この命令とこれに続くものがあればいずれも、現在のサイクルではディスパッチされない。これに先立つ命令はいずれも、上の説明にしたがってディスパッチされる。
これに続くクロックサイクルの間、ＭＲＯＭユニット２０９はその特定のＭＲＯＭ命令に対するマイクロコードに従って、命令整列ユニット２０６を介してデコードユニット２０８へ一連のファストパス命令を与える。すべてのマイクロコード化された命令が整列ユニット２０６を介してデコードユニット２０８にディスパッチされて所望のＭＲＯＭ動作が行われると、ＭＲＯＭ命令に続く命令をディスパッチすることが可能となる。
以下の表５はＭＲＯＭ命令（ＲＥＰＭＯＶＳＢ）を含むｘ８６アセンブリ言語コードセグメントの例を示す。

図４Ａ−４Ｃは連続するクロックサイクルの間の表５の命令のディスパッチとデコードを示す、投機的プロセッサ２００の一部のブロック図である。図４Ａに示される最初のクロックサイクルの間、最初の２つの命令（ＭＯＶＥＣＸ，Ｓ＿ＬＥＮおよびＣＬＤ）がマルチプレクサチャネル３０４Ａおよび３０４Ｂを介して発行位置０および１に経路付けされる（すなわちデコードユニット３１８Ａおよび３１８Ｂ）。デコードにより、ＭＲＯＭユニット２０９はさらにデコードユニット２０８Ｃ−２０８ＦにＮＯＯＰ命令を発行させる。
ＲＥＰＭＯＶＳＢ命令を行わせるマイクロコード化された命令が、図４Ｂに示されるとおり、サイクル２ないしＮの間にディスパッチされる。これらのサイクルの間に、ＭＲＯＭユニット２０９内にストアされたマイクロコードに従った１組のファストパス命令が命令整列ユニット２０６を介してデコードユニット２０８Ａ−２０８Ｆにディスパッチされる。注目されるのは、このＭＲＯＭシーケンスが完了するまでに数サイクルを要する場合があることである。
ＭＲＯＭ命令の完全なディスパッチに続いて、ＭＲＯＭ命令に続くラインの残りの命令をマルチプレクサチャネル３０４Ｄ−３０４Ｆを介して発行位置３−５にディスパッチすることが可能となる。これらの命令をデコードすると、ＭＲＯＭユニット２０９はデコードユニット２０８Ａ−２０８ＣにＮＯＯＰ命令を発行させる。
図２−４と関連して上で説明された命令整列ユニット２０６は表２に示された特定の発行位置に命令を選択的に経路付けるように構成されているが、他の構成もまた可能であることが理解されよう。すなわち、メモリの１ライン内の所与の命令がディスパッチされる特定の発行位置は、上述のものとは異なっていてもよい。さらに、この発明にしたがったデコードユニットを採用するスーパースカラマイクロプロセッサ内に設けられる発行位置の数もまた変化し得る。命令を並列デコードユニットに与えるための命令整列ユニットの他の構成もまた可能であり、デコードユニットの他の構成もまた可能である。
プリデコードユニット２０２によって採用される特定のプリデコードスキームも、表１に示されたものとは異なっていてもよい。たとえば、命令コードのある特定のバイトの開始ビットおよび機能ビットの値の特定の組合せが有する特定の意味は、表１に示された特定の意味と異なっていてもよい。さらに、上述の実施例の命令整列ユニット２０６およびデコードユニット２０８はある種の生のｘ８６命令（すなわちファストパス命令）を直接転送かつデコードするように構成されているが、命令整列ユニットが生のｘ８６命令を１または２以上の固定長命令、たとえばＲＯＰに変換するように構成されたスーパースカラマイクロプロセッサの実現例もまた可能である。このような構成では、複数のデコードユニットが変換された命令を受けかつデコードするように構成されるであろう。
上の開示が十分に理解されれば、当業者には多くの変形や修正が明らかとなるであろう。以下のクレームは、このような変更や修正のすべてを包含すると解釈されるべきである。Background of the Invention
1. Field of Invention
The present invention relates to superscalar microprocessors, and more particularly to predecoding variable byte length computer instructions within a high performance, high frequency superscalar microprocessor.
2. Explanation of related technology
EP-A-0651322 discloses a known instruction cache for a superscalar microprocessor having a variable byte length instruction format. The described superscalar microprocessor employs a method for predecoding variable byte length instructions according to the preamble of the appended claim 1 and has the preamble features of the appended claim 14.
Superscalar microprocessors can achieve performance that surpasses traditional scalar processors by enabling parallel execution of multiple instructions. Due to the wide acceptance of x86 family microprocessors, efforts have been made by microprocessor manufacturers to develop superscalar microprocessors that execute x86 instructions. Such a superscalar microprocessor achieves relatively high performance while maintaining previous version compatibility with a large amount of existing software developed for previous generations of microprocessors such as 8086, 80286, 80386 and 80486. This is advantageous.
The x86 instruction set is relatively complex and features multiple variable byte length instructions. A general format illustrating the x86 instruction set is shown in FIG. 1A. As shown, the x86 instruction is an optional prefix byte 102 of 1 to 5, followed by an operation code (opcode) field 104, an optional address mode (Mod r / M) byte 106, an optional scale-index-base. (SIB) consists of byte 108, optional displacement field 110 and optional immediate data field 112.
The opcode field 104 defines the basic operation of a particular instruction. The default behavior of certain opcodes can be modified by one or more prefix bytes. For example, one prefix byte can be used to change the instruction address or operand size and override the default segment used for the memory address, or to instruct the processor to repeat a series of operations a number of times. The opcode field 104 follows the prefix byte 102, if any, and has a length of 1 or 2 bytes. Address mode (MODRM) byte 106 identifies the register and memory address mode used. The scale-index-base (SIB) byte defines which register contains the base value for the address calculation, and the index field specifies which register contains the index value. The scale field specifies the power of 2 to be multiplied before the index value is added, and if there is any displacement in the base value. The next instruction field is an optional displacement field 110, which is 1 to 4 bytes in length. The displacement field 110 contains constants used for address calculation. The optional immediate field 112 is also 1 to 4 bytes long and contains a constant used as an instruction operand. 80286 sets the maximum instruction length to 10 bytes, while 80386 and 80486 both allow instruction lengths up to 15 bytes.
Referring to FIG. 1B, several different variable byte length x86 instruction formats are shown. The shortest x86 instruction is only 1 byte in length and contains a single opcode byte as shown in format (a). In some instructions, the byte containing the opcode field also contains a register field as shown in formats (b), (c) and (e). Format (j) indicates an instruction with two opcode bytes. In formats (d), (f), (h) and (j), an optional MODRM byte follows the opcode byte. Immediate data follows the opcode byte in formats (e), (g), (i) and (k), and follows the MODRM byte in formats (f) and (h). FIG. 1C illustrates some possible address mode formats (a)-(h). Formats (c), (d), (e), (g) and (h) contain MODRM bytes with offset (ie displacement) information. In the formats (f), (g) and (h), SIB bytes are used.
Due to the complexity of the x86 instruction set, it is difficult to implement a high performance x86 compatible superscalar microprocessor. One problem arises from the fact that instructions must be aligned with respect to such processor's parallel instruction decoder before proper decoding can take place. In contrast to most RISC instruction formats, the x86 instruction set consists of variable byte length instructions, and the start bytes of consecutive instructions within a line are not necessarily evenly spaced, and the number of instructions per line is also fixed. It has not been. As a result, using simple fixed-length shift logic itself cannot solve the instruction alignment problem.
In order to help solve the problem of quickly aligning, decoding and executing a plurality of variable byte length instructions in parallel, superscalar microprocessors employing instruction predecoding techniques have been proposed. In one such superscalar microprocessor, when an instruction is written from external main memory into the instruction cache, the predecoder sets several predecode bits (collectively referred to as predecode tags) to each Append to byte. These bits indicate whether a byte is the start and / or end byte of an x86 instruction, the number of microinstructions required to implement the x86 instruction, and the location of the opcode and prefix. As instructions are fetched from the cache, the superscalar microprocessor converts each instruction into one or more microinstructions called ROPS. ROPS are similar to RISC instructions in that they are fixed length and involve simple consistent coding. Since x86 instructions are already tagged in the instruction cache as pre-decoded bits that indicate where and where instructions start and end, and how many ROPS each need, It is a relatively simple task to locate and convert each x86 instruction to one or more ROPS and provide a fixed number of ROPS to the parallel instruction decoder.
Although the predecode techniques described above are largely successful, over 50% of the available storage space in the instruction cache array must be allocated to predecode bits. This limits the amount of storage in the instruction cache for instruction codes and / or increases the cost of the processor due to increased die size.
Summary of invention
Many of the problems outlined above are solved by the method for predecoding variable byte length instructions according to the present invention. In one embodiment, a predecode unit is provided that can predecode variable byte length instructions prior to them being stored in the instruction cache. The predecode unit is configured to generate a plurality of predecode bits for each instruction byte. A plurality of predecode bits associated with each instruction byte are collectively referred to as a predecode tag. The instruction alignment unit then dispatches variable byte length instructions to the plurality of decode units using predecode tags, which form a fixed issue location within the superscalar microprocessor.
In one embodiment, the predecode unit generates three predecode bits associated with each byte of the instruction code: a “start” bit, an “end” bit, and a “function” bit. The start bit is set if the associated byte is the first byte of the instruction. Similarly, the end bit is set if the byte is the last byte of the instruction. Rather than associating a dedicated meaning with a function bit, the predecode unit determines whether the meaning of the function bit has or is associated with that state (ie whether the function bit is set) and the start bit state of that byte. It is configured to depend. The meaning of the function bit also depends on the state of the start bit of the preceding instruction byte.
For example, if the start bit of a particular byte is set in some implementations, the function bit is a “fast path” instruction or MROM instruction (which should be serialized via microcode) that the instruction can directly decode. Command). On the other hand, if the start bit of a particular byte is cleared and that byte follows directly to the start byte (the instruction byte with that start bit set), the function bit is Indicates whether this is a byte or whether the prefix is the first byte of the instruction. If the start bit of the byte is cleared and the byte does not follow the start byte, the function bit indicates that the associated byte is a MODRAM or SIB byte, or displacement or immediate data.
By utilizing predecode information from the predecode unit, the instruction alignment unit can be implemented with a relatively small number of cascaded logic gate levels and can handle very high frequency operations. Decoding units can be completed with relatively few pipeline stages from instruction alignment units. In addition, multiple decode units in which variable byte length instructions are aligned utilize predecode tags to achieve relatively fast instruction decode. Finally, the predecode unit is configured so that the meaning of the functional bits of a particular predecode tag depends on the state of the start bit, so a relatively small number of predecode bits can result in a relatively large amount of predecode. Can carry information. This therefore allows the instruction cache size to be reduced without sacrificing performance.
Furthermore, with the information held by the function bits, the decode unit can know the exact position of the opcode, displacement, immediate value, register and scale index byte. Thus, there is no need for the decode unit to scan instruction bytes serially. In addition, the function bits allow the decode unit to quickly calculate (via an adder circuit) an 8-bit linear address that can be used by other subunits in the superscalar microprocessor. Therefore, relatively quick decoding can be achieved and high performance can be accommodated.
Broadly speaking, the present invention is a method for predecoding a variable byte length instruction in a superscalar microprocessor, generating a start bit indicating whether a byte in the instruction is a start byte; And generating a termination bit that indicates whether the byte of the instruction is a termination byte and generating a function bit having a meaning that depends on the value of the start bit. To do.
[Brief description of the drawings]
Other objects and advantages of the invention will become apparent upon reading the following detailed description with reference to the accompanying drawings.
FIG. 1A is a diagram illustrating a generic x86 instruction set format.
FIG. 1B is a diagram illustrating several different variable byte length x86 instruction formats.
FIG. 1C is a diagram illustrating some possible x86 address mode formats.
FIG. 2 is a block diagram of a superscalar microprocessor including an instruction alignment unit that sends multiple instructions to six decode units.
FIG. 3 is a block diagram of an instruction alignment unit and six decode units.
4A-4C are block diagrams illustrating the execution of the MROM instruction.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will be described in detail below. However, the drawings and detailed description are not intended to limit the invention to the particular forms disclosed, but contemplates all modifications, equivalents, and equivalents that are within the scope of the invention as defined by the appended claims. It is to include a modification.
Detailed Description of the Invention
With reference now to FIG. 2, a block diagram of a superscalar microprocessor 200 including a predecode unit 202 operating in accordance with the method of the present invention is shown. As illustrated in the embodiment of FIG. 2, superscalar microprocessor 200 includes a predecode unit 202 and a branch prediction unit 220 coupled to instruction cache 204. A prefetch unit 203 is coupled to the predecode unit 202. An instruction alignment unit 206 is coupled between the instruction cache 204 and a plurality of decode units 208A-208F (collectively referred to as decode units 208). Each decode unit 208A-208F is coupled to a respective reservation station 210A-210F (collectively referred to as reservation station 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (collectively). Referred to as functional unit 212). Decode unit 208, reservation station 210 and functional unit 212 are further coupled to reorder buffer 216, register file 218 and load / store unit 222. Data cache 224 is finally shown coupled to load / store unit 222 and MROM unit 209 is shown coupled to instruction alignment unit 206.
In general, instruction cache 204 is a high speed cache memory provided for temporarily storing instructions before they are sent to instruction decode unit 208. In one embodiment, the instruction cache 204 is configured to cache instruction codes, organized in lines of up to 32 kilobytes, each 16 bytes (where each byte consists of 8 bits). . During operation, the instruction code is provided to the instruction cache 204 by the prefetch code from the main memory (not shown) through the prefetch unit 203. For each byte of the instruction code, the instruction cache 204 further stores a predecode tag associated with it. Note that the instruction cache 204 can be implemented in either a set associative, full associative, or direct map configuration.
The prefetch unit 203 is provided for prefetching an instruction code from the main memory and storing it in the instruction cache 204. In one embodiment, prefetch unit 203 is configured to burst a 64-bit wide code from main memory to instruction cache 204. It will be appreciated that various specific code prefetch techniques and algorithms can be employed in the prefetch unit 203.
When prefetch unit 203 fetches an instruction from main memory, predecode unit 202 generates three predecode bits associated with each byte of the instruction code. That is, a “start” bit, an “end” bit, and a “function” bit.
The start and end bits of each byte indicate instruction boundaries. The function bit of each byte is further information about that byte or instruction, such as whether the instruction can be decoded directly by the decode unit 208 or the microcode procedure in which the instruction is controlled by the MROM unit 209 (described in more detail below). Has to be executed by activating, or has information such as whether the byte is a MODRM or SIB byte, or whether the byte is displacement or immediate data. Function bits may also be used to indicate the location of opcode bytes. From the following description, it will be understood that the meaning encoded in the function bits of a particular instruction byte depends on the associated start bit.
Table 1 shows an example of predecode tag encoding realized by the predecode unit 202. As shown in the table, if a given byte is the first byte of an instruction, when that byte is fetched from main memory and stored in instruction cache 204, the start bit of that byte is set by predecode unit 202 Is done. If that byte is the last byte of the instruction, the end bit of that byte is set. If a particular instruction cannot be directly decoded by the decode unit, the function bit associated with the first byte of the instruction is set. On the other hand, if the instruction can be decoded directly by the decode unit 208, the function bit associated with the first byte of the instruction is cleared. The function bit of the second byte of a particular instruction is cleared if opcode is the first byte, and is set if opcode is the second byte. Note that in situations where the opcode is the second byte, the first byte is a prefix byte. The function bit value of instruction byte number 3-8 indicates whether the byte is a MODRM or SIB byte and whether the byte contains displacement or immediate data.

According to Table 1 above, it is noted that the predecode unit 202 of the superscalar microprocessor 200 is configured to generate a function bit for each byte of the instruction code. The meaning of the function bit depends on the value of the start bit associated with that byte. To encode the scheme illustrated in Table 1, the meaning of the function bit also depends on the value of the start bit associated with the preceding instruction byte.
For the specific example described above, if the start bit of that byte is set, the function bit indicates whether the instruction is a directly decodable instruction, or the MROM instruction (further described below). It will be understood to indicate whether or not. If the start bit associated with a particular byte of the instruction code is cleared and directly follows the byte of the instruction code for which the start bit is set, the function bit is the opcode is the first byte or the prefix Indicates that is the first byte. In addition, if the start bit of a byte with an instruction code is cleared and the start bit of the preceding byte is also cleared, the function bit indicates that the byte is a MODRM or SIB byte, or that the byte is displaced or immediate Indicates data. In subsequent bytes of a particular instruction, the second function bit set in bytes 3-8 indicates immediate data.
According to the predecode scheme employed by the superscalar microprocessor 200 as described above, a predecode tag associated with each byte of the instruction code is generated. Both the predecode tag and the instruction code are stored in the instruction cache 204 and then processed by the superscalar microprocessor. Since the meaning of the function bits depends on the start bit of a particular byte and the start bit of the preceding byte, a relatively large amount of predecode information can be conveyed for the instruction alignment unit 206 and the decode unit 208, and a relatively fast Align and decode instructions can be achieved. Since the number of bits required in the predecode tag is relatively small, the required size of the instruction cache 204 can be reduced without sacrificing performance.
Furthermore, with the information held in the function bits, the decode unit knows the exact position of the opcode, displacement, immediate value, register and scale index byte. Therefore, it is not necessary for the decode unit to scan the instruction bytes sequentially. In addition, the function bits allow the decode unit to quickly calculate an 8 bit linear address (via an adder circuit) and use it by other subunits in the superscalar microprocessor. Accordingly, relatively fast decoding is achieved and high performance can be addressed.
As noted above, in one embodiment, certain instructions in the x86 instruction set can be decoded directly by the decode unit 208. These instructions are referred to as “fast path” instructions. The remaining instructions in the x86 instruction set are referred to as “MROM instructions”. The MROM instruction is executed by starting the MROM unit 209. Upon encountering an MROM instruction, the MROM unit 209 parses and serializes this instruction into a defined subset of fastpath instructions to cause the desired action. An example of an x86 instruction classified as a fastpath instruction and a description of the processing aspects of the fastpath and MROM instructions are given further below.
Instruction alignment unit 206 is provided for channeling or “funneling” variable byte length instructions from instruction cache 204 to a fixed issue location formed by decode units 208A-208F. As described in connection with FIGS. 3-5, the instruction alignment unit 206 depends on the location of the start byte of the instruction within the line bounded by the instruction cache 204, and the decode unit 208A through the instruction code specified. It is configured to channel to 208F. In one embodiment, the particular decode unit 208A-208F to which a given instruction is dispatched depends on the location of the start byte of that instruction, as well as the location of the start byte of the preceding instruction, if any. An instruction starting from a certain byte location is further restricted to be issued to only one predetermined issue position. Specific details are as follows.
Prior to describing the alignment of instructions from the instruction cache 204 to the decode unit 208, general aspects relating to other subsystems employed in the exemplary superscalar microprocessor 200 of FIG. 2 will be described. In the embodiment of FIG. 2, each of the decode units 208 includes a decode circuit for decoding the predetermined fast path instruction described above. In addition, each decode unit 208A-208F routes displacement and immediate data to reservation station units 201A-201F. The output signal from the decode unit 208 includes a bit encode execution instruction for the functional unit 201, operand address information, immediate data, and / or displacement data.
The superscalar microprocessor of FIG. 2 supports overtaking execution, so speculative instruction execution to achieve register renaming to preserve the original program sequence for register read and write operations. A reorder buffer 216 is included to recover from branch prediction errors and to facilitate accurate exceptions. As will be appreciated by those skilled in the art, a temporary storage location in reorder buffer 216 is reserved in decoding instructions that include register updates, thereby storing speculative register states. The reorder buffer 216 moves to the “end” of the buffer when the speculative result is validated and written to the register file, thereby leaving room for a new entry in the “head” of the buffer, It may be realized in a first-in first-out configuration. Other specific configurations of reorder buffer 216 are also possible and are described below. If the branch prediction is inaccurate, the results of speculatively executed instructions along the path that mispredicted can be invalidated in the buffer before they are written to register file 218.
Immediate data and bit encode execution instructions provided at the outputs of the decode units 208A-208F are routed directly to the respective reservation station units 210A-210F. In one embodiment, each reservation station unit 210A-210F can hold instruction information for up to three pending instructions waiting to be issued to the corresponding functional unit (bit encoding execution bits and operands). Value, operand tag, and / or immediate data). It is noted that in the embodiment of FIG. 2, each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F, and each reservation station unit 210A-210F is similarly dedicated to a functional unit 212A- It is related to 212F. Accordingly, six dedicated “issue positions” are formed by the decode unit 208, the reservation station unit 210, and the functional unit 212. Instructions aligned and dispatched to issue position 0 via decode unit 208A pass through reservation station unit 210A and are then passed to functional unit 212A for execution. At the same time, the instructions aligned and sent to decode unit 208B are passed to reservation station unit 210B, further passed to functional unit 212B, and so on.
When decoding a particular instruction, if the required operand is a register location, register address information is routed to the reorder buffer 216 and the register file 218 simultaneously. As will be appreciated by those skilled in the art, the x86 register file contains eight 32-bit real registers (typically referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). These are as described in detail below. Reorder buffer 216 contains temporary storage locations for the results of changing the contents of these registers, thereby allowing overtaking execution. A temporary storage location in reorder buffer 216 is reserved for each instruction, which, when decoded, modifies the contents of one of the real registers. Thus, at various points during execution of a program, one or more locations of reorder buffer 216 may contain the contents of a given register that was speculatively executed. If, following the decoding of a given instruction, it is determined that the reorder buffer 216 has a previous location assigned to a register used as an operand in the given instruction, the reorder buffer 216 will correspond to the corresponding reservation station. Send one of the following: Ie, 1) the value of the most recently assigned location, or 2) for the most recently assigned location if the value has not yet been generated by the functional unit that will eventually execute the previous instruction. tag. If the reorder buffer has a reserved location for a given register, the operand value (tag) is provided from the reorder buffer 216, not from the register file 218. If there is no reserved location for the required register in reorder buffer 216, the value is obtained directly from register file 218. If the operand corresponds to a memory location, the operand value is provided to the reservation station unit via the load / store unit 222.
Details on implementing a suitable reorder buffer can be found in the “Superscalar Microprocessor Design” by Mike Johnson, Prentice-Hall, Englewood Cliffs, New Jersey, 1991, and pending In Japanese Patent Application No. 6-263317 (Japanese Patent Application Laid-Open No. 7-18260) entitled “Superscalar Microprocessor” filed on October 29, 1994 by Witt et al. . These documents are incorporated herein by reference in their entirety.
The reservation station units 210A-210F are provided for temporarily storing instruction information to be executed speculatively by the corresponding functional units 212A-212F. As previously mentioned, each reservation station unit 210A-210F can store instruction information for up to three outstanding instructions.
Each of the six reservation stations 210A-210F includes a location for storing a bit encode execution instruction and operand values to be executed speculatively by the corresponding functional unit. If a particular operand is not available, a tag for this operand is provided from the reorder buffer 216 and stored in the corresponding reservation station until a result occurs (ie, execution of the previous instruction is complete). . It is noted that when an instruction is executed by one of the functional units 212A-212F, the result of this instruction is passed directly to one of the reservation station units 210A-210F waiting for the result, This result is passed to update the reorder buffer 216 (this technique is commonly referred to as “result dispatch”). When any required operand value is made available, an instruction is issued to the functional unit for execution. That is, the operand associated with the pending instruction in one of the reservation station units 210A-210F has been tagged with the location of the previous result value in the reorder buffer 216, modifying the required operand. If this corresponds to the instruction to be executed, this instruction is not issued to the corresponding functional unit 212 until the result of the operand for the previous instruction is obtained. Thus, the order in which instructions are executed may not be the same as the order of the original program instruction sequence. The reorder buffer 216 ensures that data consistency is maintained even in situations where post-write read dependencies occur.
In one embodiment, each of the functional units 212 is configured to perform addition and subtraction integer arithmetic operations and shift, rotate, logical operations and branch operations. It is noted here that a floating point unit (not shown) can also be used to handle floating point operations.
Each of the functional units 212 also provides information to the branch prediction unit 220 regarding the execution of conditional branch instructions. If the branch prediction is inaccurate, the branch prediction unit 220 flushes the instruction that has entered the instruction execution pipeline after the mispredicted branch, and the prefetch / predecode unit 202 needs it from the instruction cache 204 or main memory. Causes the instruction to be fetched. In such a situation, the instructions that were in the original program sequence and occurred after the mispredicted branch instruction were speculatively executed and temporarily stored in the load / store unit 222 and the reorder buffer 216. It is discarded including things. A configuration example of a suitable branch prediction mechanism is well known.
The result generated by the functional unit 212 is sent to the reorder buffer 216 if the register value has been updated, and sent to the load / store unit 222 if the contents of the memory location have changed. If the result is to be stored in a register, reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded. As previously mentioned, the result is also broadcast to the reservation station units 210A-210F if the pending instruction is waiting for the result of execution of the previous instruction to obtain the required operand value.
In general, load / store unit 222 provides an interface between functional units 212A-212F and data cache 224. In one embodiment, the load / store unit 222 is configured with a store buffer with 8 store locations for data and address information for pending loads or stores. The functional unit 212 arbitrates access to the load / store unit 222. If the buffer is full (full), the functional unit must wait until the load / store unit has room for pending load or store request information. The load / store unit 222 also performs load instruction dependency checks on pending store instructions to ensure that data consistency is maintained.
The data cache 224 is a high-speed cache memory provided for temporarily storing data transferred between the load / store unit 222 and the main memory subsystem. In one embodiment, the data cache 224 has a capacity to store up to 8 kilobytes of data. It will be appreciated that the data cache 224 may be implemented with a variety of specific memory configurations, including a set of associative configurations.
Details of instruction dispatch from the instruction cache 204 through the instruction alignment unit 206 to the decode unit 208 are discussed below. FIG. 3 is a block diagram illustrating the interior of one embodiment of instruction alignment unit 206 and the interior of decode units 208A-208F for instruction code lines provided from instruction cache 204. As previously mentioned, the instruction alignment unit 206 is configured to channel variable byte length instructions (a type of x86 instruction in this case referred to as a fast path instruction) to the decode units 208A-208F.
As shown in FIG. 3, the latch unit 302 is incorporated as part of the output buffer section 301 of the instruction cache 204. Latch unit 302 may store a line of instruction code provided from a storage array of instruction cache 204 (not shown in FIG. 3) before it is sent to decode unit 208.
The instruction alignment unit 206 of FIG. 3 includes a plurality of multiplexer circuits referred to as multiplexer channels 304A-304G coupled between the latch 302 and the decode unit 208. A multiplexer control circuit 306 is further shown coupled to each of the multiplexer channels 304A-304G. In this embodiment, each of decode units 208A-208F includes an associated instruction decoder 318A-318F whose inputs are coupled to respective multiplexer channels 304A-304F. Each decode unit 208A-208F further includes a respective displacement / immediate data buffer 330A-330F and a respective instruction issue unit 340A-340F.
During operation, a line of instruction code to be executed is provided to the latch unit 302 from the storage array of the instruction cache 204. Each byte of instruction code in instruction cache 204 is associated with a corresponding predecode tag that includes a start bit, an end bit, and a function bit. When a line of instruction code is provided to latch unit 302, a predecode tag associated with each byte is provided to the input of multiplexer control circuit 306. As described in detail below, depending on the predecode tag corresponding to each line of instruction code in latch unit 302, multiplexer control circuit 306 selectively directs instruction decoders 318A-318F to which instruction bytes are designated. Control multiplexer channels 304A-304G to be routed. The instruction path (path) formed by the decode units 208A-208F is called an issue position. Instruction code channeling through multiplexer channels 304A-304G depends on the location of the start byte associated with each instruction associated with each line as represented by latch unit 302. In the embodiment of FIG. 3, each of the first five multiplexer channels 304A-304F routes four consecutive bytes of instruction code from latch unit 302 to a respective instruction decoder 318A-318F. Multiplexer channel 304G can channel up to three consecutive bytes of the instruction code to instruction decoder 318.
Table 2 below illustrates possible multiplexer channels 304A-304G by which the start byte can be channeled. As mentioned above, instruction code channeling depends on the location of the starting byte within a given line. It is noted that each of the multiplexer channels 304A-304F routes the least significant start byte assigned to it unless this start byte is selected for routing by the lower multiplexer channel. It is configured as follows.

Referring to Table 2, multiplexer channel 304A can route the starting byte located at byte positions 0-2 to decode unit 318A. Multiplexer channel 304B can route the starting byte at byte positions 1-4 to decode unit 318B. Multiplexer channel 304C can transfer the starting byte at byte positions 3-8 to decode unit 208C. Similarly, multiplexer channel 304D can transfer the start byte at byte positions 6-10 to decode unit 208D, and multiplexer channel 304E can transfer the start byte at byte positions 9-12 to decode unit 208E. Is possible. Finally, multiplexer channel 304F can transfer the starting byte at byte positions 12-15 to decode unit 318F. The starting byte located at byte positions 13-15 is alternatively passed through multiplexer channel 304G to the seventh issue position used to wrap incomplete instructions (ie instructions that continue to the next line). It may be routed to be the next cache line for decoding. As will be described further below, the instruction byte routed through multiplexer channel 304G is transferred to instruction decoder 304A in the next clock cycle when the remaining bytes of that instruction are available in latch unit 302. Given to.
If an instruction is wrapped to a subsequent cache line, dispatching the instruction to a specified location depends on the nature of the remaining bytes of the instruction that appear on the next line. In situations where displacement or immediate data simply wraps around to the next cache line, this immediate or displacement data is provided to the displacement / immediate data buffer 330F via multiplexer channel 304A. It is noted here that in this situation, the preceding byte of this instruction (which appears in the previous cache line) will be dispatched to instruction decoder 318F during the previous clock cycle. In situations where the prefix, opcode, MODRM and / or SIB bytes wrap around to the next cache line, the instruction information from the previous line is routed to instruction decoder 318A via multiplexer channel 304G for the next clock cycle. Merged with the rest of the instruction code.
By limiting the possible number of issue locations that a given instruction on a line can be dispatched, the number of cascaded logic levels required to implement the instruction alignment unit 206 can be successfully reduced. You will understand. In addition, for instruction alignment by limiting the dispatch of instructions with a starting byte in one of the selected subset of byte locations in a line to only a single issue position (byte positions 5 and 11) The number of cascaded logic levels is further reduced. Thus, the instruction alignment unit 206 described above enables the implementation of a superscalar microprocessor with a relatively small number of gates per pipeline stage, thereby allowing for very high frequency operation. For relatively long instructions, although issue positions are skipped, relatively high performance can still be achieved because other issue positions are available for the remaining instructions in the cache line.
A defined fast path instruction is up to 8 bytes in length and may contain a single prefix byte. Note that by limiting the defined fastpass instruction to only a single prefix byte, bytes 4-7 of any fastpass instruction will contain only displacement and / or immediate data, if any. It can be made. Thus, in situations where an instruction exceeds 4 bytes, the first 4 bytes of the instruction are routed through the multiplexer channel assigned to the starting byte of the instruction. The remaining bytes of the instruction are channeled through the multiplexer channel at the next issue location. In such a situation, the instruction decoder at the issue position that receives the remaining bytes of this instruction (ie, the instruction decoder) detects the absence of the start bit at that first byte position and accordingly shifts the issue position preceding the data accordingly. / Pass to immediate data buffer 330 and issue NOOP instruction.
Thus, if the start byte of an instruction is located at byte position 0 of latch unit 302, that byte is provided to decode unit 208A along with the next three consecutive bytes at

byte positions

1, 2 and 3. If the next starting byte is at position 2 (ie, the first instruction is 2 bytes in length), bytes 2-5 are routed through the multiplexer channel 304B to the decode unit 208B. In the embodiment of FIG. 3, each instruction decoder 318A-318F can only decode one instruction at a time. Thus, even if the start byte of one or more instructions is provided to instruction decoder 318A, for example, only the first instruction is decoded. Bytes beyond the first end byte that correspond to further instructions in a given instruction decoder are extraneous and are substantially ignored. Note that the multiplexer channel 304 of the instruction alignment unit 206 instead channels only a single instruction (or portion thereof) to a given instruction decoder 318 according to the start and end predecode bits of the instruction. It can be configured as described.
In accordance with the above, if the first instruction starts at byte position 0, bytes 0-3 are provided to instruction decoder 318A. If the instruction is longer than 4 bytes, bytes 4-7 of latch unit 302 are provided to instruction decoder 318B via multiplexer channel 304B, which then passes the data to displacement / immediate data buffer 330A. In this situation, multiplexer channel 308C routes the next start byte that appears in the code to instruction decoder 318C. On the other hand, if the first instruction starting at byte location 0 is 4 bytes or less, the next instruction is routed through multiplexer channel 304B, starting at the start byte of the second instruction. If the instruction is greater than 4 bytes long, the immediate or displacement data corresponding to that instruction is routed through the multiplexer channel 304C to the displacement / immediate data buffer 330B. The remaining multiplexer channels operate similarly.
If immediate or displacement data is wrapped around to a subsequent line from an instruction starting on the previous line, this data is sent to multiplexer channel 304A if immediate or displacement data is available in latch unit 302. To the displacement / immediate data buffer 340F. It is further noted that instruction decoding is not performed. This is because the displacement and immediate data need not be decoded. The first instruction in the subsequent line is therefore routed to instruction decoder 318B via multiplexer channel 304B.
Also noteworthy is that if prefix, opcode, MODRM, and / or SIB information is wrapped around from an instruction starting on the previous line, multiplexer channel 304G passes the preceding part of the instruction to instruction decoder 318A. Routing, in which case the next instruction (corresponding to the first starting byte in latch unit 302 for the next clock cycle) is routed through multiplexer channel 304B to instruction decoder 318B.
As will be better understood in the following example, it may happen that none of the possible issue positions that should be given a given starting byte is available because these issue positions are occupied by previous instructions. When such a situation occurs, the instruction and any subsequent instructions must be held until the next clock cycle for dispatch.
Table 3 below shows examples of x86 instruction conditions. Instructions 1 through 7 and the first byte of instruction 8 are shown in cache line 1. Cache line 2 begins with the second byte of instruction 8 and further includes instructions 9-16.

Table 4 below illustrates how the sequence of instructions in Table 3 above is dispatched by the instruction alignment unit 206 to the decode units 208A-208F.

Instructions 1-5 are dispatched to issue locations 0-4 corresponding to decode units 318A-318E, respectively, during the first clock cycle. Instruction 6 starting at byte position 11 of latch unit 302 may be channeled only to issue position 4 corresponding to decode unit 318E. However, since issue position 4 is already occupied by instruction 5, instruction 6 cannot be dispatched during this cycle. Accordingly, the multiplexer control circuit 306 causes the decode unit 318F to issue a NOOP (no operation) during the decode phase decoded by the instruction 1-4.
During clock cycle 2, instruction 6 is dispatched to issue position 4 and instruction 7 is dispatched to issue position 5. It is noted that when these instructions are decoded, the multiplexer control circuit 306 causes the decode units 318A-318D to issue a NOOP instruction. Since instruction 8 wraps around to the next cache line, the first byte of this instruction wraps around to instruction decoder 318 via multiplexer channel 304G in the next clock cycle.
During clock cycle 3, instruction 8 is dispatched to issue position 0. It is noted that the first byte of instruction 8 is wrapped around from byte position 15 of the preceding line.

Instructions

9 and 10 are further dispatched to issue

locations

1 and 2 via

multiplexer channels

304B and 304C, respectively. When the instruction 8-10 is decoded, the instruction issuing unit 304D-E issues a NOOP instruction.

Instructions

11 and 12 are dispatched to issue positions 2 and 3 during clock cycle 4. Instruction 13 begins at byte 7 and cannot be routed to issue position 4. Therefore, the dispatch of instruction 13 must be held until the next clock cycle.
During clock cycle 5, instructions 13-16 are dispatched to issue positions 2-5, respectively. Similar to that described above, instruction issue units 340A and 340B cause NOOP instructions to be issued to issue

locations

0 and 1 during decoding of instructions 13-16.
Referring back to FIG. 2, instructions included in the x86 instruction subset that are not designated as fast path instructions are executed under the control of the MROM unit 209 using stored microcode. The MROM unit 209 parses such instructions into a series of fast path instructions, which are dispatched in one or more clock cycles. As previously mentioned, predecode unit 202 is configured such that when a pre-designated MROM instruction is encountered, a function bit associated with the first byte of the instruction is set. As will be explained in more detail below, this condition is easily detectable by the MROM unit 209 and allows instruction serialization.
When an MROM instruction is detected by the MROM unit 209 in one line of code in the latch unit 202, any of this instruction and any subsequent ones are not dispatched in the current cycle. Any preceding instructions are dispatched according to the above description.
During subsequent clock cycles, the MROM unit 209 provides a series of fast path instructions to the decode unit 208 via the instruction alignment unit 206 according to the microcode for that particular MROM instruction. Once all the microcoded instructions are dispatched to the decode unit 208 via the alignment unit 206 to perform the desired MROM operation, the instructions following the MROM instruction can be dispatched.
Table 5 below shows an example of an x86 assembly language code segment containing an MROM instruction (REP MOVSB).

4A-4C are block diagrams of a portion of speculative processor 200 illustrating the dispatching and decoding of the instructions in Table 5 during successive clock cycles. During the first clock cycle shown in FIG. 4A, the first two instructions (MOVE CX, S_LEN and CLD) are routed to issue

locations

0 and 1 via

multiplexer channels

304A and 304B (ie, decode

unit

318A and 318B). By decoding, the MROM unit 209 further causes the decode units 208C-208F to issue a NOOP instruction.
Microcoded instructions that cause a REP MOVSB instruction to be dispatched during cycles 2 through N, as shown in FIG. 4B. During these cycles, a set of fast path instructions according to the microcode stored in the MROM unit 209 is dispatched to the decode units 208A-208F via the instruction alignment unit 206. It is noted that this MROM sequence can take several cycles to complete.
Following complete dispatch of the MROM instruction, the remaining instructions on the line following the MROM instruction can be dispatched to issue locations 3-5 via multiplexer channels 304D-304F. When these instructions are decoded, MROM unit 209 causes decode units 208A-208C to issue NOOP instructions.
The instruction alignment unit 206 described above in connection with FIGS. 2-4 is configured to selectively route instructions to the specific issue locations shown in Table 2, although other configurations are possible. It will be understood that. That is, the particular issue location at which a given instruction within a line of memory is dispatched may be different from that described above. Furthermore, the number of issue positions provided in a superscalar microprocessor employing a decode unit according to the invention can also vary. Other configurations of instruction alignment units for providing instructions to the parallel decode unit are also possible, and other configurations of the decode unit are also possible.
The particular predecode scheme employed by the predecode unit 202 may also be different from that shown in Table 1. For example, the specific meaning of a particular combination of the start bit and function bit values of a particular byte of an instruction code may differ from the specific meanings shown in Table 1. Further, although the instruction alignment unit 206 and decode unit 208 of the above-described embodiment are configured to directly transfer and decode certain raw x86 instructions (ie, fast path instructions), the instruction alignment unit is a raw x86. An implementation of a superscalar microprocessor configured to convert instructions to one or more fixed length instructions, eg, ROP, is also possible. In such a configuration, a plurality of decode units will be configured to receive and decode the converted instruction.
Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. The following claims should be construed to include all such changes and modifications.

Claims

Predecoding a variable byte length instruction to generate a boundary bit indicating whether a byte with the variable byte length instruction is a boundary byte;
Predecoding the variable byte length instruction to generate a functional bit having a meaning depending on a value of the boundary bit.
A method for predecoding variable byte length instructions in a superscalar microprocessor (200).

The boundary bit is a start bit;
Generating an end bit indicating whether the byte of the instruction is an end byte;
A method for predecoding variable byte length instructions in a superscalar microprocessor (200) according to claim 1.

The predecoding of variable byte length instructions within a superscalar microprocessor (200) according to claim 1 or 2, wherein the meaning of the function bits also depends on the value of a predecode bit associated with another byte of the instruction. How to do.

Predecoding variable byte length instructions within a superscalar microprocessor (200) according to claim 1 or 2, wherein the meaning of the function bits further depends on whether or not adjacent bytes of the instruction are boundary bytes. Way for.

3. To predecode a variable byte length instruction in a superscalar microprocessor (200) according to claim 1 or 2, wherein the meaning of the function bit further depends on the value of the corresponding start bit of the previous instruction byte. the method of.

6. To predecode a variable byte length instruction in a superscalar microprocessor (200) according to any of claims 1 to 5, wherein the function bit indicates whether the operation code is the first byte of the instruction. the method of.

Providing the start bit, the end bit and the function bit to an instruction decoder (208);
A method for predecoding variable byte length instructions in a superscalar microprocessor (200) according to any of the preceding claims.

Detecting the start bit, the end bit and the function bit in the instruction decoder (208) to determine a boundary of the instruction;
A method for predecoding variable byte length instructions in a superscalar microprocessor (200) according to claim 7.

A predecode unit (202) for fetching a plurality of variable byte length instructions to generate a predecode tag associated with the bytes of the variable byte length instruction;
An instruction cache (204) for storing the fetched variable byte length instruction and the generated predecode tag;
A plurality of decode units (208) for receiving a plurality of the variable byte length instructions stored in the instruction cache according to the predecode tag, wherein the predecode tag indicates whether the byte is a boundary of the instruction or not. Including a boundary bit having a value to indicate, and further including a function bit having a meaning dependent on the value of the boundary bit,
Super scalar microprocessor.

The superscalar microprocessor (200) of claim 9, wherein the meaning of the function bits further depends on a value of a predecode bit associated with another byte of the instruction.

The superscalar microprocessor (200) of claim 9, wherein the meaning of the function bits further depends on the value of the corresponding start bit of a previous instruction byte.

The superscalar microprocessor (200) of claim 9, wherein the meaning of the function bits further depends on whether adjacent bytes of the instruction are boundary bytes.

The superscalar microprocessor (200) of claim 9, wherein the function bit indicates whether the operation code is the first byte of an instruction.

The boundary bit is a start bit having a value indicating whether the byte is a start byte of the instruction, and the plurality of decode units (208) are designated instructions corresponding to the plurality of variable byte length instructions. An instruction alignment unit coupled between the instruction cache (204) and the plurality of decode units (208) to provide a decodeable instruction to the plurality of decode units (208) The superscalar microprocessor (200) according to any of claims 9 to 13, comprising (206).

The superscalar microprocessor (200) of claim 14, wherein the instruction alignment unit (206) is configured to provide the instructions to one of the plurality of decode units (208).

The superscalar microprocessor (200) of claim 14 or 15, wherein each of the plurality of decode units (208) is configured to decode a predetermined subset of an x86 instruction set.

17. The superscalar microprocessor (200) of claim 14, 15 or 16, wherein the predecode tag further includes an end bit indicating whether the byte is an end byte of the instruction.

The superscalar microprocessor (1) according to claim 14, 15, 16 or 17, wherein the instruction alignment unit (206) is configured to provide the predecode tag to at least one of the plurality of decode units (208). 200).

The superscalar microprocessor (200) of claim 18, wherein the at least one of the plurality of decode units (208) is configured to detect the predecode tag to determine a boundary of the instruction.

A superscalar microprocessor (1) according to any of claims 9 to 19, wherein the plurality of variable byte length instructions are organized as lines in the instruction cache (204), wherein one line includes a predetermined number of bytes. 200).

21. The superscalar microprocessor (200) according to any of claims 9 to 20, further comprising a plurality of functional units (212) configured to receive output signals from the plurality of decode units (208).

The superscalar microprocessor (200) of claim 21, wherein the output signals from the plurality of decode units (208) include bit encode execution instructions.

And further comprising a plurality of reservation stations (210) coupled to the plurality of decode units (208) and the plurality of functional units (212), wherein the plurality of reservation stations (210) are the outputs from the plurality of decode units. 23. A superscalar microprocessor (200) according to claim 21 or 22, wherein the superscalar microprocessor (200) is configured to temporarily store signals before being issued to the plurality of functional units (212).

24. A superscalar microprocessor (200) according to claim 21, 22 or 23, wherein a dedicated functional unit (212) is associated with each of the plurality of decode units (208).

A superscalar microprocessor (1) according to any of claims 9 to 24, further comprising a reorder buffer (216) coupled to the plurality of decode units (208) for storing the results of speculatively executed instructions. 200).