JP2000515275A

JP2000515275A - Superscalar microprocessor including high-speed instruction alignment unit

Info

Publication number: JP2000515275A
Application number: JP10505952A
Authority: JP
Inventors: トラン，タング・エム; ウィット，デイビッド・ビィ; ジョンソン，ウィリアム・エム
Original assignee: Advanced Micro Devices Inc
Current assignee: Advanced Micro Devices Inc
Priority date: 1996-07-16
Filing date: 1996-07-16
Publication date: 2000-11-14
Also published as: WO1998002798A1; EP0912924A1

Abstract

(57)【要約】命令整列ユニット、命令キャッシュ、複数のデコードユニットおよびプリデコードユニットを有するスーパースカラマイクロプロセッサが提供される。命令整列ユニットは命令キャッシュから固定数の命令を複数のデコードユニットの各々に転送する。命令はプリデコードユニットによってもたらされたプリデコードタグに従って複数バイトから選択される。プリデコードタグは複数バイトのうちいずれのバイトが命令の第１のバイトであるかを示ず開始バイトビットを含む。命令整列ユニットは複数のグループの命令バイトを個別にスキャンし、複数の発行位置の各々に関する開始バイトおよび複数の連続バイトを選択する。命令整列ユニットはまず、複数のグルーブの命令の各々に関する発行位置のグループを選択する。次に、命令整列ユニットは個別にもたらされた発行位置をシフトしてマージし、最終組の発行位置をもたらし、複数のデコードユニットに転送する。 (57) Abstract: A superscalar microprocessor having an instruction alignment unit, an instruction cache, a plurality of decode units and a predecode unit is provided. The instruction alignment unit transfers a fixed number of instructions from the instruction cache to each of the plurality of decode units. The instruction is selected from the plurality of bytes according to a predecode tag provided by the predecode unit. The predecode tag does not indicate which byte of the plurality of bytes is the first byte of the instruction, but includes a start byte bit. The instruction alignment unit individually scans a plurality of groups of instruction bytes and selects a start byte and a plurality of consecutive bytes for each of the plurality of issue locations. The instruction alignment unit first selects a group of issue locations for each of the instructions in the plurality of grooves. The instruction alignment unit then shifts and merges the individually provided issue locations to yield a final set of issue locations and forwards them to multiple decode units.

Description

【発明の詳細な説明】名称：高速命令整列ユニットを含むスーパースカラマイクロプロセッサ発明の背景１．発明の分野この発明はスーパースカラマイクロプロセッサに関し、特に、可変バイト長の命令をスーパースカラマイクロプロセッサ内の複数個の命令デコードユニットへとディスパッチするための高速命令整列ユニットに関する。２．関連技術の説明スーパースカラマイクロプロセッサは、多命令の並列実行を可能にすることによって従来のスカラプロセッサをしのぐ性能を達成することができる。ｘ８６系マイクロプロセッサが広く受入れられているために、マイクロプロセッサ製造業者はｘ８６命令を実行するスーパースカラマイクロプロセッサを開発する努力を行なっている。このようなスーパースカラマイクロプロセッサは、比較的高い性能を達成しつつ、８０８６、８０２８６、８０３８６および８０４８６のような前世代のマイクロプロセッサのために開発された非常に多量の既存のソフトウェアとの旧版互換性を有利に保つ。ｘ８６命令セットは比較的複雑であり、複数個の可変バイト長命令によって特徴付けられる。ｘ８６命令セットを示す一般的なフォーマットが図１に示される。図に示すように、ｘ８６命令は１個から５個のオプションのプリフィックスバイト１０２と、それに続くオペレーションコード（opcoad）フィールド１０４と、オプションのアドレスモード（ＭｏｄＲ／Ｍ）バイト１０６と、オプションのスケールーインデックスーベース（ＳＩＢ）バイト１０８と、オプションの変位フィールド１１０と、オプションの即値データ１１２とからなる。 opcoadフィールド１０４はある特定の命令のための基本動作を規定する。ある特定のopcoadのデフォルト動作は１または２以上のプリフィックスバイトによって変更され得る。たとえば、プリフィックスバイトは、命令のためのアドレスまたはオペランドサイズを変更し、メモリアドレスにおいて用いられるデフォルトセグメントをオーバライドし、または、一続きの動作を何回か繰返すようにプロセッサに命令するために用いられ得る。opcoadフィールド１０４にはもしあればプリフィックスバイト１０２が従い、opcoadフィールド１０４は１バイトまたは２バイトの長さであり得る。アドレスモード（ＭｏｄＲ／Ｍ）バイト１０６は使用されるレジスタおよびメモリアドレスモードを特定する。スケール−インデックス−ベース（ＳＩＢ）バイト１０８はスケール係数およびインデックス係数を用いる３２ビットのベース相対アドレシングにおいてのみ用いられる。ＳＩＢバイトのベースフィールドはどのレジスタがアドレス計算のための基底値を含むかを特定し、インデックスフィールドはどのレジスタがインデックス値を含むかを特定する。スケールフィールドは、インデックス値フィールドがどの変位とも共に基底値に付加される前にそれによって乗算される２の累乗を特定する。次の命令フィールドはオプションの変位フィールド１１０であり、これは１バイトから４バイトの長さであり得る。変位フィールド１１０はアドレス計算において用いられる定数を含む。オプションの即値フィールド１１２もまた１バイトから４バイトの長さであってもよく、命令オペランドとして用いられる定数を含む。最短のｘ８６命令はわずか１バイトの長さであり、単一のopcoadバイトを含む。８０２８６は命令のための最大長を１０バイトに設定するが、８０３８６および３０４８６は共に１５バイトまでの命令長を可能にする。ｘ８６命令セットが複雑であるために高性能なｘ８６互換スーパースカラマイクロプロセッサの実現が困難になる。難点の１つは、適切なデコードが実行され得るまでにこのようなプロセッサの並列接続命令デコーダに対して命令が整列させられなければならないという事実から生じる。ほとんどのＲＩＳＣ命令フォーマットとは対照的に、ｘ８６命令セットが可変バイト長の命令からなるので、１ライン内の連続的な命令の開始バイトが必ずしも等しく間隔をあけられず、１ライン当りの命令の数が固定されない。結果として、簡潔で固定した長さのシフトロジックの採用はそれ自体では命令整列の問題を解決できない。スキャンロジックがプロセッサの単数（複数）のデコードパイプライン段の間に命令の境界を動的およびシーケンシャルに求めるために提案されているが、このような解決法は一般に、プロセッサのデコードパイプライン段が、スキャン動作を行なうために比較的多数のカスケード接続された論理ゲートのレベルおよび／または数クロックサイクルの割当てを伴って実現されることを必要とする。ｘ８６互換スーパースカラマイクロプロセッサ内の命令整列およびデコードに対するさらなる解決法が、同時係属中であり、共通の譲受人に譲渡された特許出願、すなわち、ウイット他（witt et al.）により１９９３年１０月２９日に出願され、その開示全体が引用によりここに援用される連続番号第０８／１４６，３８３号「スーパースカラ命令デコーダ」(“Supersca1ar Instruction Decoder ”)内に説明されている。このような解決法は、命令が命令キャッシュ内にストアされるときに各可変バイト長命令のためのプリデコード情報が引出されるプリデコード技術を用いる。プリデコード情報は特に各命令の境界を示す。プロセッサのデコード段にディスパッチする前に、（バイトキューと称される）整列機構がシーケンシャルに各命令を配置する。命令を配置する際、整列機構は命令を「ＲＯＰ」と称される１以上の固定長のＲＩＳＣ的命令に変換する。次にこれら固定長のＲＯＰが割当てられた命令デコーダへと与えられる。後の命令も同様に扱われる。この解決法はかなり成功しているが、これもまた一般にカスケード接続された比論理ゲートの比較的多数のレベルおよび／またはパイプライン段を必要とする。これはしたがってスーパースカラマイクロプロセッサの最大の全体クロック周波数および性能を制限する。発明の概要上に概要を述べた問題はこの発明に従うスーパースカラマイクロプロセッサによって大部分が解決される。ある実施例においては、スーパースカラマイクロプロセッサは固定数のバイトを命令キャッシュから複数個のデコードユニットの各々に転送する命令整列ユニットを用いる。これらバイトは、プリデコードユニットによって発生されるプリデコードタグに従って、予め定められたバイトグループから選択される。プリデコードタグ（各バイトに異なる１つが関連付けられる）は予め定められたグループ内のどのバイトが命令のための開始バイトであるかを示す。ある具体的な実施例において、命令整列ユニットは８バイトの連続する命令コードの３つの異なったグループの中で開始バイトを同時に独立して検出する。命令コードの各グループ内で予め定められた数の開始バイトを独立して求めると、命令整列ユニットは開始バイトを各開始バイトに従う隣接した７バイトと共に各グループに関連したそれぞれの「仮の」発行チャネルへと独立して送る。仮の発行チャネルは次に上述の複数個のデコードユニットと結合される１組の「最終的な」発行チャネルへとシフトおよび／またはマージされる。別の実施例において、命令バイトのグループが１対の命令チャネリングユニットへと転送されるスーパースカラマイクロプロセッサが提供されろ。命令チャネリングユニットは独立して命令バイトから最大４バイトまでの開始バイトを選択し、選択された開始バイトと、開始バイトに隣接し、かつその後に続く多数のバイトとを仮の発行位置に配置する。２組の仮の発行位置を通してチャネリングされた命令バイトは次に、第１の命令チャネリングユニットの発行位置内に含まれる有効命令の数の表示と共に、第３の命令チャネリングユニットへと転送される。第２の命令チャネリングユニットによって転送された発行位置は次に、第１の命令チャネリングユニットによって表示された有効命令の数だけシフトされる。次に、最終的な発行位置が第１の命令チャネリングユニットからの発行位置において転送される対応する有効命令から選択される。残りの最終的な発行位置はいずれも第２のチャネリングユニットからのシフトされた組の発行位置の対応の発行位置から選択される。最終的な発行位置が１組のデコードユニットに結合され、これは命令をデコードし、それらを実行のために機能ユニットへとディスパッチする。別の実施例においては、命令整列ユニットが選択するバイト数は２４であり、その最後の８バイトは前にフェッチされた命令キャッシュラインのものであり、１６バイトは現在の命令キャッシュラインのものであるスーパースカラマイクロプロセッサが提供される。開始バイトがディスパッチのために選択されるとき、対応の開始ビットは無効にされる。この実施例において、１クロックサイクル当り４つまでの命令がディスパッチされ得る。前にフェッチされたキャッシュラインの最後の８バイトと現在のキャッシュラインの最初の８バイトとが有効開始バイトを含まない場合、現在のキャッシュラインが前にフェッチされた命令キャッシュライン位置へと移動され、次の命令キャッシュがフェッチされる。各８バイト部分が開始バイトを見つけるため独立して調べられ、見つけられた開始バイトおよびその後の７バイトが発行位置に割当てられる。第１のレベルの多重化がこれを達成するために実装される。３組の発行グループ（ここでは、前のキャッシュラインの最後の８バイトに対しては発行グループ１、現在のキャッシュラインの最初の８バイトに対しては発行グループ２と、現在のキャッシュラインの最後の８バイトに対しては発行グループ３と呼ぶ）が次に第２のレベルの多重化へと導かれる。このレベルで、発行グループ１に含まれる有効命令の数だけ発行グループ２をシフトすることによって発行グループ１および発行グループ２がマージされる。発行グループ３における命令もまたこのレベルでの発行グループ１内の有効命令の数だけシフトされる。マージおよびシフトされた発行グループは次に第３のレベルの多重化へと導かれる。前にシフトされた発行グループ３はさらに発行グループ２に含まれる有効命令の数だけシフトされる。二重にシフトされた発行グループ３は次に前にマージされた発行グループ１および２とマージされる。結果として生じる発行グループは命令デコードユニットへと転送され、転送された命令に対する対応の開始ビットがリセットされる。第３の多重化レベルにはＭＲＯＭユニットおよびプリデコードユニットからの入力も含まれる。この発明に従うスーパースカラマイクロプロセッサは命令整列ユニットを用いることができる。命令整列ユニットは、開始バイトを見つけるために同時にバイトのいくつかの小さいフィールドをスキャンし、次に、小さいフィールド内に見つけられた開始バイトの数だけ、見つけれた命令を独立してシフトすることによって、少数のカスケード接続されたゲートで実現され得る。計算値を組合せることは必要ではなく、実装がさらに速やかとなる。概して、この発明は、命令キャッシュと、複数個のデコードユニットと、第１、第２および第３の命令チャネリングユニットを含む命令整列ユニットとを用いるスーパースカラマイクロプロセッサを目指している。第１および第２の命令チャネリングユニットは入力ポートに結合される。入力ポートは命令キャッシュからの命令バイトの複数個のグループを含む。第１の命令チャネリングユニットは第１の複数個の命令バイトを選択し、第２の命令チャネリングユニットはディスパッチのために複数個のグループの命令から第２の複数個の命令バイトを選択する。第１の複数個の命令バイトは次に第３の命令チャネルリングユニットによって第２の複数個の命令バイトとマージされ、マージされた複数個の命令バイトを形成する。このマージされた複数個の命令バイトは次に出力ポートを介して複数個の命令デコードユニットへとディスパッチされる。図面の簡単な説明この発明の他の目的および利点は、以下の詳細な説明を読み、添付の図面を参照することによって明らかとなるであろう。図１は、一般的なｘ８６命令フォーマットのブロック図である。図２は、この発明に従う命令整列ユニットを含むスーパースカラマイクロプロセッサのブロック図である。図３Ａは、この発明に従う命令整列ユニットの１つの実施例のブロック図である。図３Ｂは、この発明に従う命令整列ユニットの別の実施例の図であり、第１のレベルの多重化への開始バイトの接続のみを示す。図４は、隣接した１５命令バイトと、１５命令バイトの組内の隣接した８バイトを選択するために必要な多重化接続とを示す図である。この発明はさまざまに変更され、代替的な形をとり得るが、その具体的な実施例は例としてのみ示され、ここに詳細に説明される。しかしながら、その図面および詳細な説明はこの発明を開示される特定の形態に限定せず、逆に、添付の請求項によって規定されるようなこの発明の趣旨および範疇内に含まれるすべての変更、均等物および代替例に及ぶことが理解されるべきである。発明の詳細な説明ここで図２を参照すると、この発明に従う命令整列ユニット２０６を含んだスーパースカラマイクロプロセッサ２００のブロック図が示される。図２の実施例に示すように、スーパースカラマイクロプロセッサ２００はプリフェッチ／プリデコードユニット２０２と命令キャッシュ２０４に結合された分岐予測ユニット２２０とを含む。命令整列ユニット２０６は命令キャッシュ２０４と（デコードユニット２０８と総称される）複数個のデコードユニット２０８Ａ−２０８Ｄとの間に結合される。デコードユニット２０８Ａ−２０８Ｄの各々は（リザベーションステーション２１０と総称される）それぞれのリザベーションステーションユニット２１０Ａ−２１０Ｄに結合され、リザベーションステーション２１０Ａ −２１０Ｄの各々は（機能ユニット２１２と総称される）それぞれの機能ユニット２１２Ａ−２１２Ｄに結合される。デコードユニット２０８、リザベーションステーション２１０および機能ユニット２１２はリオーダバッファ２１６、レジスタファイル２１８およびロード／ストアユニット２２２にさらに結合される。データキャッシュ２２４は最後にロード／ストアユニット２２２に結合されて示され、ＭＲＯＭユニット２０９は命令整列ユニット２０６に結合されて示される。概して、命令キャッシュ２０４はデコードユニット２０８へのディスパッチの前に命令を一時的にストアするために設けられる高速キャッシュメモリである。１つの実施例では、命令キャッシュ２０４が最大で３２キロバイトの、各々が１６バイトのラインで編成される（ここで各バイトは８ビットからなる）、命令コードをキャッシュするように構成される。動作の間、命令コードはメインメモリ（図示せず）からプリフェッチ／プリデコードユニット２０２を介してコードをプリフェッチすることによって命令キャッシュ２０４に与えられる。命令キャッシュ２０４がセットアソシアティブ構成、フルアソシアティブ構成またはダイレクトマップ構成に実現され得ることに注目されたい。プリフェッチ／プリデコードユニット２０２はメインメモリから命令コードをプリフェッチして命令キャッシュ２０４内にストアするために設けられる。１つの実施例では、プリフェッチ／プリデコードユニット２０２はメインメモリから命令キャッシュ２０４へと６４ビット幅のコードをバーストするように構成される。さまざまな具体的なコードプリフェッチ技術およびアルゴリズムがプリフェッチ／プリデコードユニット２０２によって用いられ得ることが理解される。プリフェッチ／プリデコードユニット２０２がメインメモリから命令をフェッチするとき、これは命令コードの各バイトに関連した３つのプリデコードビット、すなわち、開始ビット、終了ビットおよび「機能」ビットを発生する。プリデコードビットは各命令の境界を示すタグを形成する。以下により詳細に説明するように、プリデコードタグはまた、所与の命令がデコードユニット２０８によって直接的にデコードされ得るか、または命令がＭＲＯＭユニット２０９によって制御されるマイクロコード手順を起動することによって実行されなければならないかのような付加的な情報を伝えることができる。表１はプリデコードタグのエンコード一例を示ず。表に示すように、所与のバイトが命令の最初のバイトであれば、そのバイトの開始ビットがセットされる。そのバイトが命令の最後のバイトであれば、そのバイトの終了ビットがセットされる。ある特定の命令がデコードユニット２０８によって直接デコードできなければ、その命令の最初のビットに関連した機能ビットがセットされる。他方、その命令がデコードユニット２０８によって直接デコードできれば、その命令の最初のビットに関連した機能ビットがクリアされる。ある特定の命令の２番目のバイトのための機能ビットは、opcoadが第１のバイトである場合にクリアされ、op coadが第２のバイトである場合にセットされる。opcoadが第２のバイトである状況では最初のバイトがプリフィックスバイトであることに注目される。命令バイト番号３−８に対する機能ビット値は、そのバイトがＭＯＤＲＮバイトまたはＳＩＢバイトであるか、またはそのバイトが変位データまたは即値データを含むかを示す。表１開始ビット、終了ビットおよび機能ビットのエンコード上述のように、１つの実施例ではｘ８６命令セット内のある命令がデコードユニット２０８によって直接デコードされ得る。これらの命令は「ファストパス」命令と称される。ｘ８６命令セットの残りの命令は「ＭＲＯＭ命令」と称される。ＭＲＯＭ命令はＭＲＯＭユニット２０９を起動することによって実行される。より具体的には、ＭＲＯＭ命令に遭遇すると、ＭＲＯＭユニット２０９はその命令を規定されたファストパス命令のサブセットへと構文解析し、逐次化して所望の動作を実行する。ファストパス命令として分類される例示的なｘ８６命令の例と、ファストパス命令およびＭＲＯＭ命令の両方を扱う方法の説明とが以下に示される。可変バイト長命令を命令キャッシュ２０４からデコードユニット２０８Ａ−２０８Ｄによって形成される固定した発行位置へとチャネリングするために命令整列ユニット２０６が設けられている。図２−４に関連して説明するように、命令整列ユニット２０６は指定されたデコードユニット２０８Ａ−２０８Ｄに命令バイトをチャネリングするように構成される。命令整列ユニット２０６は独立してかつ並行して、命令キャッシュ２０４によって与えられる３つのグループの命令バイトから命令を選択し、これらのバイトを３つのグループの仮の発行位置へと配列する。発行位置の各グループは３つのグループの命令バイトの１つと関連付けられる。仮の発行位置は次に共にマージされて最終的な発行位置を形成し、その各々がデコードユニット２０８の１つに結合される。命令キャッシュ２０４からデコードユニット２０８への命令整列の詳細な説明を行なう前に、図２の例示的なスーパースカラマイクロプロセッサ２００内で用いられる他のサブシステムに関した一般的な局面を説明する。図２の実施例では、デコードユニット２０８の各々が上述の予め定められたファストパス命令をデコードするためのデコード回路を含む。さらに、各デコードユニット２０８Ａ− ２０８Ｄが変位データおよび即値データを対応のリザベーションステーションユニット２１０Ａ−２１０Ｄへと経路付ける。デコードユニット２０８からの出力信号は機能ユニット２１２のためのビット−エンコード実行命令と、オペランドアドレス情報と、即値データおよび／または変位データとを含む。図２のスーパースカラマイクロプロセッサは追越し実行を支持し、したがって、レジスタの読出動作および書込動作のためのもともとのプログラムシーケンスを守るために、レジスタのリネームを実現するために、投機的な命令実行および分岐予測の誤りからの回復を行なうために、そして正確な例外を容易にするために、リオーダバッファ２１６を含む。当業者には認識されるように、リオーダバッファ２１６内の一時的な記憶場所はレジスタの更新を含む命令のデコード時に予約されてそれによって投機的なレジスタの状態をストアする。リオーダバッファ２１６は先入れ先出し構成に実現でき、ここで、投機的な結果は有効にされレジスタファイルに書込まれるときバッファの「最後部」に移動し、こうしてバッファの「頭部」に新たなエントリのための余地を与える。リオーダバッファ２１６の他の具体的な構成も以下にさらに説明するように可能である。分岐予測が正確でなければ、予測誤り経路に沿う投機的に実行された命令の結果がレジスタファイル２１８に書込まれる前にバッファにおいて無効にされ得る。デコードユニット２０８Ａ−２０８Ｄの出力で与えられるビット−エンコード実行命令および即値データはそれぞれのリザベーションステーションユニット２１０Ａ−２１０Ｄへと直接経路付け（route）られる。１つの実施例では、各リザベーションステーションユニット２１０Ａ−２１０Ｄが、対応の機能ユニットへの発行を待つ３つまでの未決（pending）命令に対して命令情報（すなわち、ビットエンコード化実行ビットならびにオペランド値、オペランドタグおよび／または即値データ）を保持することができる。図２の実施例では、各デコードユニット２０８Ａ−２０８Ｄが専用のリザベーションステーションユニット２１０Ａ−２１０Ｄと関連付けられ、各リザベーションステーションユニット２１０Ａ −２１０Ｄが専用の機能ユニット２１２Ａ−２１２Ｄに同様に関連付けられることが注目される。したがって、４つの専用の「発行位置」がデコードユニット２０８、リザベーションステーションユニット２１０および機能ユニット２１２によって形成される。デコードユニット２０８Ａを介して整列させられ、発行位置０にディスパッチされた命令がリザベーションステーションユニット２１０Ａに送られ、続いて実行のために機能ユニット２１２Ａに送られる。同様に、デコードユニット２０８Ｂに整列させられ、ディスパッチされた命令がリザベーションステーションユニット２１０Ｂ、機能ユニット２１２Ｂに送られ、以下同様である。ある特定の命令のデコード時に、要求されたオペランドがレジスタ場所であればレジスタアドレス情報が同時にリオーダバッファ２１６およびレジスタファイル２１８へと経路付けられる。当業者はｘ８６レジスタファイルが８個の３２ビットリアルレジスタ（すなわち、典型的にはＥＡＸ、ＥＢＸ、ＥＣＸ、ＥＤＸ、ＥＢＰ、ＥＳＩ、ＥＤＩおよびＥＳＰと称される）を含むことを認識するであろう。リオーダバッファ２１６はこれらのレジスタの内容を変更する結果のための一時的な記憶場所を含み、それによって追越し実行を可能にする。リオーダバッファ２１６の一時的な記憶場所は各命令に対して予約されており、これはデコード時にリアルレジスタの１つの内容を変更する。したがって、特定のプログラムの実行中のさまざまな点で、リオーダバッファ２１６は所与のレジスタの投機的に実行された内容を含む１または２以上の場所を含み得る。所与の命令のデコードに続いて、リオーダバッファ２１６が所与の命令におけるオペランドとして用いられるレジスタに割当てられた前の単数または複数の場所を有すると判断されれば、リオーダバッファ２１６は対応のリザベーションステーションに、１）最も最近に割当てられた場所の値か、２）最終的に前の命令を実施する機能ユニットによって値がまだ生み出されていなければ最も最近に割当てられた場所のためのタグを送る。リオーダバッファが所与のレジスタのために予約された場所を有していれば、オペランド値（またはタグ）がレジスタファイル２１８ではなくリオーダバッファ２１６から与えられる。リオーダバッファ２１６に要求されるレジスタのために予約された場所がなければ、その値はレジスタファイル２１８から直接取出される。オペランドがメモリ場所に対応すれば、オペランド値がロード／ストアユニット２２２を介してリザベーションステーションユニットへと与えられる。適切なリオーダバッファの実現に関する詳細は、マイク・ジョンソン（Mike J ohnson）による出版物「スーパースカラマイクロプロセッサ設計」（“Supersca lar Microprocessor Design”)、Prentice-Hall,Englewood Cliffs，New Jersey ，1991と、同時係属中であり、共通に譲渡された特許出願、すなわち、ウイット他（ｗitt，etal.）によって１９９３年１０月２９日に出願された連続番号第０８／１４６，３８２号「高性能スーパースカラマイクロプロセッサ」（“High Performance Superscalar Microprocessor”）とに見られる。これらの文書は引用によりその全体をここに援用する。リザベーションステーションユニット２１０Ａ−２１０Ｄは対応の機能ユニット２１２Ａ−２１２Ｄによって投機的に実行されるべき命令情報を一時的にストアするために設けられる。上述のように、各リザベーションステーションユニット２１０Ａ−２１０Ｄが３つまでの未決命令について命令情報をストアできる。４つの命令ステーション２１０Ａ−２１０Ｄの各々が対応の機能ユニットによって投機的に実行されるべきビットエンコード化実行命令とオペランドの値とをストアするための場所を含む。特定のオペランドが利用可能でなければ、そのオペランドのためのタグがリオーダバッファ２１６から与えられ、結果が発生される（すなわち、前の命令の実行を完了することによって）まで対応のリザベーションステーション内にストアされる。命令が機能ユニット２１２Ａ−２１２Ｄの１つによって実行されるとき、その命令の結果がその結果を待っているリザベーションステーションユニット２１０Ａ−２１０Ｄへと直接渡され、同時にその結果がリオーダバッファ２１６を更新するために送られることに注目される（この技術は通常「結果フォワーディング」と称される）。命令は、いずれかの要求されるオペランドの値が利用可能とされた後に実行のために機能ユニットへと発行される。すなわち、リザベーションステーションユニット２１０Ａ−２１０Ｄの１つの中の未決命令と関連したオペランドが、要求されるオペランドを変更する命令に対応するリオーダバッファ２１６内の前の結果値の場所のタグを付けられていれば、前の命令のためのオペランド結果が得られるまで命令は対応の機能ユニット２１２に発行されない。したがって、命令が実行される順序は元のプログラム命令シーケンスの順序とは同じではないかもしれない。リオーダバッファ２１６は書込後読出の依存が起こる状況でデータの一貫性が維持されることを確実とする。１つの実施例では、機能ユニット２１２の各々が加算および減算の整数算術演算ならびにシフト、回転、論理演算および分岐演算を行なうように構成される。浮動小数点に対処するために浮動小数点ユニット（図示せず）も用いられ得ることに注目される。機能ユニット２１２の各々はまた条件付分岐命令の実行に関する情報を分岐予測ユニット２２０に与える。分岐予測が正確でなければ、分岐予測ユニット２２０は予測誤り分岐に後の命令処理パイプラインに入っている命令をフラッシュし、プリフェッチ／プリデコードユニット２０２に必要とされる命令を命令キャッシュ２０４またはメインメモリからフェッチさせる。このような状況では、投機的に実行され、ロード／ストアユニット２２２およびリオーダバッファ２１６に一時的に記憶されたものを含め、予測誤り分岐命令の後に生じる、元のプログラムシーケンスが廃棄されることに注目される。適切な分岐予測機構の例示的構成は周知である。機能ユニット２１２によって生じる結果は、レジスタ値が更新されていればリオーダバッファ２１６に送られ、メモリ場所の内容が変更されていればロード／ストアユニット２２２に送られる。結果がレジスタに記憶されるべきであれば、リオーダバッファ２１６は命令がデコードされたときのレジスタの値のために予約された場所に結果をストアする。上述のように、結果は、未決命令が要求されるオペランド値を得るために前の命令実行の結果を待っている場合、リザベーションステーションユニット２１０Ａ−２１０Ｄにブロードキャストされる。一般に、ロード／ストアユニット２２２は機能ユニット２１２Ａ−２１２Ｄとデータキャッシュ２２４との間にインターフェイスを与える。１つの実施例では、ロード／ストアユニット２２２は未決のロードまたはストアのためのデータおよびアドレス情報に対する８つの記憶場所を有するストアバッファを伴って構成される。機能ユニット２１２はロード／ストアユニット２２２へのアクセスの調停を行なう。バッファが一杯であれば（full）、機能ユニットはロード／ストアユニット２２２が未決のロードまたはストア要求情報のためのあきを有するようになるまで待たなければならない。ロード／ストアユニット２２２はまた未決のストア情報に対してロード命令のための依存性チェックを行なって、データの一貫性が保たれることを確実とする。データキャッシュ２２４は、ロード／ストアユニット２２２とメインメモリサブシステムとの間で転送されるデータを一時的にストアするために与えられる高速キャッシュメモリである。１つの実施例では、データキャッシュ２２４は８キロバイトまでのデータをストアする容量を有する。データキャッシュ２２４がセットアソシアティブ構成を含むさまざまな具体的メモリ構成で実現され得ることが理解される。命令キャッシュ２０４から命令整列ユニット２０６を介してデコードユニット２０８に至る命令のディスパッチに関する詳細を以下に検討する。図３Ａは、命令整列ユニット２０６の一実施例の内部とデコードユニット２０８への入力レジスタとを示すブロック図である。この実施例は（命令バイトバス２５０と総称される）２つの命令バイトバス２５０Ａおよび２５０Ｂを用いて構成される。命令バイトは命令キャッシュ２０４によって命令バイトバス２５０上に出力され、各命令バイトバスは８バイトを転送する。命令バイトバス２５０Ａは命令チャネリングユニット２５１に結合され、命令バイトバス２５０Ｂは命令チャネリングユニット２５２に結合される。図３Ａには、プリデコードタグバス２５４上の入力情報を受取り、制御出力バス２５６、２５７および２５８を有する制御ユニ２５５も示される。制御出力バス２５６は命令チャネリングユニット２５２に結合される。同様に、制御出力バス２５７は命令チャネリングユニット２５１に結合され、制御出力バス２５８は命令チャネリングユニット２５３に結合される。命令チャネリングユニット２５１は４つの仮の発行位置、すなわち、仮の発行位置Ａ、仮の発行位置Ｂ、仮の発行位置Ｃ、仮の発行位置Ｄを生じる。同様に、命令チャネリングユニット２５２は仮の発行位置Ａ'、仮の発行位置Ｂ'、発行位置Ｃ' および仮の発行位置Ｄ'を生じる。発行位置Ａ−ＤおよびＡ'−Ｄ'の各々は命令チャネリングユニット２５３に結合される。命令チャネリングユニット２５３は４つの最終的な発行位置２６７，２６８，２６９および２７０を生じ、これらはデコードユニット２０８Ａ、２０８Ｂ、２０８Ｃおよび２０８Ｄにそれぞれ結合される。この実施例では、仮の発行位置または最終的な発行位置が最大で１つの有効命令を伝え、有効命令を含む固定数のバイトを伝える。一般に、命令チャネリングユニット２５１および２５２は独立してかつ並行してそれぞれ命令バイトバス２５０Ａおよび２５０Ｂから命令を選択する。選択された命令は命令チャネリングユニット２５１および２５２に接続された仮の発行位置を占める。命令チャネリングユニット２５３は仮の発行位置Ａ−Ｄにおいて伝えられる命令数だけ仮の発行位置Ａ'−Ｄ'において伝えられる命令をシフトする。命令チャネリングユニット２５３は次に２組の仮の発行位置からの命令を最終的な発行位置２６７−２７０へとマージする。命令選択およびシフティングプロセスは以下の段落でより詳細に説明される。この実施例では、制御ユニット２５５が命令バイトバス２５０上で転送される命令バイトと関連した開始バイトビットを（バス２５４によって）受取る。制御ユニット２５５は命令バイトバス２５０Ａのために開始バイト情報をスキャンし、セットされた開始バイトを探す。開始バイトビットがセットされていると、命令バイトバス２５０Ａ上の対応のバイトが命令の始まりである。制御ユニット２５５は（制御出力バス２５７上の信号によって）、入力命令バイトバス２５０Ａ上の対応のバイトとそれに続く７バイトとを選択するように命令チャネリングユニット２５１に指示する。選択されたバイトが次の利用可能な仮の発行位置を占める。仮の発行位置Ａが最初に使われ、次に仮の発行位置Ｂが続き、以下同様である。制御ユニット２５５は、命令チャネリングユニット２５１の発行位置が占められるか命令バイトバス２５０Ａに関連した開始バイトビットがなくなるまで、命令バイトバス２５０Ａに関連した開始バイトビットをスキャンし続ける。同様に、並行して、制御ユニット２５５は命令バイトバス２５０Ｂに関連した開始バイトビットを処理し、制御出力バス２５６上で命令チャネリングユニット２５２へと発行位置選択情報を伝える。図３Ａの実施例では、命令バイトビット２５０Ａ上で転送される命令が命令バイトバス２５０Ｂ上で転送される命令よりも優先される。したがって、仮の発行位置Ａ−Ｄにおいて伝えられる有効命令が制御ユニット２５５の指示のもと命令チャネリングユニット２５３によって最終的な発行位置２６７−２７０へと向けられる。有効命令を伝えるとき、仮の発行位置Ａは発行位置２６７に向けられる。同様に、仮の発行位置Ｂは有効命令を伝えるときに発行位置２６８に向けられ、以下同様である。さらに、命令チャネリングユニット２５３は命令チャネリングユニット２５１によって選択される有効命令の数（すなわち、発行位置Ａ−Ｄにおいて伝えられる有効命令の数）だけ仮の発行位置Ａ'−Ｄ'をシフトする。その後、シフトされた仮の発行位置は仮の発行位置Ａ−Ｄからの命令で占められていなかったこれらの最終的な発行位置２６７−２７０を占める。したがって、デコードユニット２０８は命令バイトバス２５０内で配置され得る最大数の命令（４まで）を受取る。この実施例の動作を例を用いてさらに説明する。命令バイトバス２５０Ａがあるクロックサイクルで２つの有効命令を転送し、命令バイトバス２５０Ｂもまた同じクロックサイクルで２つの有効命令を転送すると仮定する。制御ユニット２５５の指示の下、命令チャネリングユニット２５１は命令バイトバス２５０Ａから最初の開始バイトとそれに続く７バイトとを選択し、選択されたバイトを仮の発行位置Ａに与える。制御ユニット２５５は次に命令バイトバス２５０Ａの第２の開始バイトを検出し、第２の開始バイトとそれに続く７バイトとが仮の発行位置Ｂを占めるように命令チャネリングユニット２５１に指示を出す。独立してかつ上と並行して、制御ユニット２５５が命令バイトバス２５０Ｂ上に与えられた命令バイトと関連した開始バイトビットをスキャンし、第１の開始バイトを検出する。検出された開始バイトとそれに続く７バイトとが仮の発行位置Ａ'を占める。スキャンプロセスを続け、制御ユニット２５５は命令バイトバス２５０Ｂ上で伝えられる第２の開始バイトを検出する。第２の開始バイトとそれに続く７バイトとが命令チャネリングユニット２５２によって仮の発行位置Ｂ'へと選択される。なお制御ユニット２５５のスキャン機構もまた仮の発行位置Ｃ'およびＤ' に送られる命令バイトバス２５０Ｂ上の後の命令を見つけることができる。しかしながら、上から明らかであるように、発行位置Ｃ'およびＤ'は命令チャネリングユニット２５３によって実質的に無視されるであろう。次に、制御ユニット２５５は制御出力２５８によって命令チャネリングユニット２５３に指示する。２つの有効命令が仮の発行位置Ａ−Ｂに存在するので、仮の発行位置Ａおよび仮の発行位置Ｂがそれぞれ最終的な発行位置２６７および２６８を占める。また、２つの有効命令が命令チャネリングユニット２５１において選択されたので、仮の発行位置Ａ'−Ｄ'が２つ分位置をシフトされる。このシフトによって、発行位置Ａ'において伝えられる命令が最終的な発行位置２６９と整列させられる。同様に、発行位置Ｂ'が最終的な発行位置２７０と整列させられる。したがって、もともとは仮の発行位置Ａ'およびＢ'におけるものである２つの有効命令がそれぞれ発行位置２６９および２７０を占めろ。デコードユニット２０８の各々がこのサイクルで命令を受取る。別の実施例では、命令チャネリングユニット２５１および２５２の出力で１つの仮の発行位置を占めるように選択されたバイトが別の仮の発行位置を占めるように選択されたバイトと重複する。仮の発行位置または最終的な発行位置を占めるバイト数は固定されており、いくつかの命令は発行位置内のバイト数全部を占めることができないかも知れない。したがって、後続の命令の開始バイトと恐らくは他のバイトとが現在の命令位置内のバイト位置を占める。デコードユニット２０８の各々がデコードユニットに転送された命令と関連した開始バイトおよび終了バイトビットを受取る。デコードユニット２０８は転送されたどのバイトが完全な有効命令を含むかを判断するために開始バイトビットおよび終了バイトビットを検出する。他の実施例では、異なった数の発行位置およびデコードユニットを用いることができることは言うまでもない。図３Ａと関連して説明される実施例は少数のカスケード接続された論理レベルで実現でき、したがってこの実施例は高速で動作が可能となる。この実施例はさまざまな理由のため少数のカスケード接続された論理レベルで実現できる。第１に、命令バイトバス２５０上を転送される多数の命令が互いに独立した小さいグループごとに処理される。この多数の命令と関連した開始ビット情報中を線形的にスキャンする代わりに小さいグループが並行に処理され得る。第２に、小さいグループがそのうちの１つで見つけられた有効命令の数に基づいて共に組合せられる（この実施例では命令バイトバス２５０Ａ）。ここで図３Ｂを参照すると、命令整列ユニット２０６の別の実施例が示される。この実施例の命令チャネリングユニットはマルチプレクサを含み、マルチプレクサ制御バス３１１、３１２および３１３を介して出力制御ユニット３０２によって制御される。（ここでは命令バイトバス３００と総称される）３つの命令バイトバス３００Ａ、３００Ｂおよび３００Ｃがさらに示される。命令バイトバス３００Ａは「前に」フェッチされた命令キャッシュラインから最後の８命令バイトバスを伝える。入力命令バイトバス３００Ｂは「最も最近の」命令キャッシュラインの最初の８バイトを伝え、命令バイトバス３００Ｃは「最も最近の」命令キャッシュラインの最後の８バイトを伝える。前にフェッチされたキャッシュの最後の８バイトと最も最近のキャッシュラインの最初の８バイトとからの命令がデコードユニット２０８に転送されると、最も最近のキャッシュラインの最後の８バイトが前にフェッチされた命令キャッシュラインの最後の８バイトへと（すなわち、命令バイトバス３００Ａへと）移動され、新しいキャッシュラインがフェッチされる（そして、命令バイトバス３００Ｂおよび３００Ｃ上を伝えられる）。図３Ｂを参照すると、入力命令バイトバス３００と第１のレベルのマルチプレクサ３０１Ａ、３０１Ｂ、３０１Ｃ、３０１Ｄ、３０４Ａ、３０４Ｂ、３０４Ｃ、３０４Ｄ、３０５Ａ、３０５Ｂ、３０５Ｃおよび３０５Ｄ（それぞれマルチプレクサ３０１，３０４および３０５と総称される）との間の信号経路が示される。２つの第１のレベルの命令チャネリングユニットを有する前の実施例に対して、この実施例はそれぞれマルチプレクサ３０１，３０４および３０５によって表わされるように３つの第１レベルの命令チャネリングユニットを有する。第１のレベルの命令チャネリングユニットはそれらと関連した発行位置１Ａ−１Ｄ、１Ａ'−１Ｄ'、および１Ａ"−１Ｄ"を有する。図３Ｂはまた第１のレベルのマルチプレクサ３０１，３０４および３０５と第２のレベルのマルチプレクサ３０６Ａ、３０６Ｂ、３０６Ｃ、３０６Ｄ、３０７Ａ、３０７Ｂ、３０７Ｃおよび３０７Ｄ（それぞれマルチプレクサ３０６および３０７とここで総称される）との間の信号経路を示す。マルチプレクサ３０６および３０７は２つの第２のレベルの命令チャネリングユニットを形成する。第２のレベルの命令チャネリングユニットはそれらと関連した発行位置２Ａ−２Ｄおよび２Ａ’−２Ｄ’を有する。最後に、第２のレベルのマルチプレクサ３０６および３０７と第３のレベルのマルチプレクサ３０８Ａ、３０８Ｂ、３０８Ｃおよび３０８Ｄ（マルチプレクサ３０８とここで総称される）との間の信号経路が示される。マルチプレクサ３０８は第３のレベルの命令チャネリングユニットを形成する。第３のレベルの命令チャネリングユニットがそれと関連した発行位置３Ａ−３Ｄを有する。大まかに言うと、マルチプレクサ３０１、３０４および３０５によって形成された第１のレベルの命令チャネリングユニットの各々は、関連したそれらの命令バイトバス３００Ａ−３００Ｃから個別にかつ並行して命令を選択し、発行位置１Ａ−１Ｄ、ＩＡ’−１Ｄ’および１Ａ”−１Ｄ”にそれぞれ送る。マルチプレクサ３０６および３０７によって形成された第２のレベルの命令チャネリングユニットは、発行位置１Ａ−１Ｄ内の有効命令の数だけ発行位置１Ａ’−１Ｄ’および１Ａ”−１Ｄ”をそれぞれシフトする。さらに、マルチブレクサ３０６は、発行位置１Ａ−１Ｄと、発行位置１Ａ’−１Ｄ’に関連したシフトされた発行位置とをマージする。マルチプレクサ３０８によって形成された第３のレベルの命令チャネリングユニットは、発行位置１Ａ’−１Ｄ’における命令数だけ発行位置２Ａ’−２Ｄ’をシフトする。マルチプレクサ３０８はさらに、発行位置２Ａ −２Ｄと、発行位置２Ａ’−２Ｄ’に関連したシフトされた発行位置とをマージする。次に、この実施例をさらに完全に説明する。図３Ｂには開始バイトを多重化するための信号経路しか示されていない。しかしながら、第１のレベルのマルチプレクサの出力上のスラッシュで示されるように、各マルチプレクサによって複数バイトが選択される。所与のマルチプレクサに対して選択される他のバイトの多重化を以下に図４に関して記載する。第１のレベルのマルチプレクサは、それらが結合された命令バイトバス３００に従ってグループ化される。したがって、マルチプレクサ３０１は命令バイトバス３００Ａに結合され、マルチプレクサ３０４は命令バイトバス３００Ｂに結合され、マルチプレクサ３０５は命令バイトバス３００Ｃに結合される。１つの実施例において、マルチプレクサ３０１Ａは命令バイトバス３００Ａの８つの命令バイトに結合される。これにより、命令バイトバス３００Ａ内に伝えられたすべてのバイトから開始バイトが選択できるようになる。マルチプレクサ３０１Ｂは、最初のバイトを除く、命令バイトバス３００Ａのバイトの各々に結合される。マルチプレクサ３０１Ｂは最初のバイトに結合される必要はなく、そのバイトが開始バイトであれば、それはマルチプレクサ３０１Ａによって選択されることとなる。同様に、マルチプレクサ３０１Ｃは最初の２つのバイトに結合される必要はない。両方のバイトが開始バイトである場合、第１のバイトはマルチプレクサ３０１Ａによって選択され、第２のバイトはマルチプレクサ３０１Ｂによって選択されることとなる。最後に、マルチプレクサ３０１Ｄは、最初の３バイトを除く、命令バイトバス３００Ａのバイトの各々に結合して示される。したがって、マルチプレクサ３０１Ａ、３０１Ｂ、３０１Ｃおよび３０１Ｄと、命令バイトバス３００Ａからの対応する信号経路との組合せにより、４つまでの開始バイトが命令バス３００Ａから選択できるようになる。図３Ｂにさらに示されるように、命令バイトバス３００Ａからマルチプレクサ３０１までに描かれる。同様の信号経路が入力命令バイトバス３００Ｂとマルチプレクサ３０４との間に示される。これらのマルチプレクサはマルチプレクサ３０１に似た構成であり、マルチプレクサ３０４Ａは３０１Ａに似ており、３０４Ｂは３０１Ｂに似ており、３０４Ｃは３０１Ｃに似ており、３０４Ｄは３０１Ｄに似ている。また、マルチプレクサ３０４の動作はマルチプレクサ３０１の動作からは独立しており、かつそれと並行して行なわれる。命令バイトバス３００Ｃとマルチプレクサ３０５との間の信号経路も、命令バイトバス３００Ａとマルチプレクサ３０１との間のものに似ている。制御ユニット３０２はマルチプレクサ制御バス３１１を介してマルチプレクサ３０１、３０４および３０５に結合される。制御ユニット３０２はプリデコードタグ入力ポート３００をさらに備えた構成である。入力ポート３００は制御ユニット３０２が使用する情報を伝え、命令バイトバス３００からの命令バイトをマルチプレクサ３０１、３０４および３０５が選択するようにする。１つの実施例において、入力ポート３０３に伝えられた情報は、命令バイトバス３００に与えられたバイトに関連した開始バイトビットを含む。開始バイト情報は制御ユニット３０２によってスキャンされ、マルチプレクサ制御バス３１１に伝えられる信号を生成するために用いられる。命令バイトバス３００Ａ上に伝えられる命令バイトに関連した開始バイトビットをスキャンすることにより検出された第１の開始バイトが、その後に続く７バイトとともにマルチプレクサ３０１Ａによって選択される。マルチプレクサ３０１Ａによって選択されたバイトは、必要に応じて、命令バイトバス３００Ｂで伝えられた命令バイトにまで食い込むこともある。同様に、検出された第２の開始バイトは、後に続く７バイトとともにマルチプレクサ３０１Ｂによって選択される。ここでもまた、マルチプレクサ３０１Ｂによって選択されたバイトは、必要に応じて、命令バイトバス３００Ｂで伝えられた命令バイトまでに食い込むことがある。制御ユニット３０２は、４つの開始バイトが検出されるか、または命令バイトバス３００Ａに伝えられた命令バイトに関連した開始バイトビットがなくなるまでスキャンを続ける。制御ユニット３０２は、前述のスキャンと並行して、かつそれからは独立して、命令バイトバス３００Ｂに伝えられた命令バイトに関連した開始バイトビットと、命令バイトバス３００Ｃに伝えられた命令バイトに関連した開始バイトビットとをスキャンする。その後、それぞれマルチプレクサ３０４および３０５を用いて命令バイトバス３００Ｂおよび命令バイトバス３００Ｃからバイトを選択する類似した手順が行なわれる。先に規定した発行位置を用いて、第２のレベルのマルチプレクサ３０６および３０７の機能を説明することができる。大まかに言えば、マルチプレクサ３０６は、制御ユニット３０２の指示下で発行位置１Ａ−１Ｄと発行位置１Ａ’−１Ｄ ’とをマージして、発行位置２Ａ−２Ｄを形成するように構成される。マージ機能は、発行位置１Ａ−１Ｄにおける有効命令数だけ発行位置１Ａ’−１Ｄ’をシフトし、発行位置２Ａ−２Ｄを発行位置１Ａ−１Ｄからの任意の有効命令で埋め、発行位置１Ａ’−１Ｄ’からもたらされたシフトされた発行命令によって残りの発行位置２Ａ−２Ｄを埋めることによって行なわれる。マルチプレクサ３０７は制御ユニット３０２の指示下で発行位置１Ａ−１Ｄにおける有効命令の数だけ発行位置１Ａ”−１Ｄ”をシフトし、それにより発行位置２Ａ’−２Ｄ’を埋める。ここで述べたとおり、マルチプレクサ３０６および３０７に対するマルチプレクサ制御バス３１２は発行位置１Ａ−１Ｄにおける有効命令数に依存する。マルチプレクサ３０８は、制御ユニット３０２の指示下で発行位置２Ａ−２Ｄと２Ａ’−２Ｄ’とをマージして発行位置３Ａ−３Ｄにするように構成される。マルチプレクサ３０８によって行なわれるマージ機能は、発行位置１Ａ’−１Ｄ ’における有効命令数だけ発行位置２Ａ’−２Ｄ’をシフトし、発行位置２Ａ− ２Ｄにおける任意の有効命令によって発行位置３Ａ−３Ｄを埋め、発行位置２Ａ ’−２Ｄ’からもたらされたシフトされた発行位置によって残りの発行位置３Ａ −３Ｄを埋めることによって行なわれる。発行位置３Ａ−３Ｄに含まれる命令はデコードユニット２０８に転送される。デコードユニット２０８に転送された命令に対応する開始バイトビットがリセットされ、これによりさらなる命令を次のサイクルにおいて処理することができる。別の実施例において、テークンと予測された分岐命令に続く命令の開始ビットが分岐予測ユニット２２０によってリセットされる。このため、１つの場合においては（命令がデコードユニット２０８にディスパッチ済のため）命令バイトバス３００Ａに伝えられた命令バイトに関連した開始ビットがリセットされ、（命令バイトバス３００Ｂに伝えられた命令バイトが、テークンと予測された分岐命令を含むため）命令バイトバス３００Ｃに伝えられた命令バイトに関連した開始ビットがリセットされる。この場合、命令バイトバス３００Ｂに伝えられた命令バイトは命令バイトバス３００Ａまで移動され、新しいキャッシュラインが、予測された分岐命令のターゲットからフェッチされる。１つの実施例において、マルチプレクサ３０８はプリデコードユニット２０２およびＭＲＯＭユニット２０９からの入力をさらに有する。プリデコードユニット２０２からの入力は図３Ｂに３０９として示される。ＭＲＯＭユニット２０９からの入力は図３Ｂに３１０として示される。ＭＲＯＭ入力３１０は、ＭＲＯＭユニット２０９によってＭＲＯＭ命令をデコードユニット２０８に転送することができるようにするために用いられる。プリデコード入力３０９は、命令キャッシュ２０４において命令フェッチのミスが発生した場合に用いられる。この場合、命令はメインメモリから読出され、（１つのクロックサイクルにつき１つの命令が）プリデコードユニット２０２によってプリデコードされる。マイクロプロセッサ２００は、命令キャッシュラインがプリデコードを終え、命令キャッシュにストアされるまで待機するのではなく、プリデコード入力３０９を用いてプリデコード命令をデコードユニット２０８に経路付ける。有効命令は、任意のグループの発行命令内で、Ａで示される位置が最初に埋められ、次にＢで示される位置が埋められるという具合になるような態様で発行位置を埋める。たとえば、発行位置１Ｂは発行位置１Ａが有効命令を含まないならば有効命令を含まない。さらに、発行位置２Ｂ’は発行位置２Ａ’が有効命令を含まないならば有効命令を含まない。一例によって、マルチプレクサ３０６、３０７および３０８によって行なわれるマージおよびシフト動作をさらに明らかにする。この例では、発行位置１Ａおよび１Ｂは有効命令を伝え、発行位置１Ｃおよび１Ｄは有効命令を伝えない。さらに、発行位置１Ａ’は有効命令を伝え、発行位置１Ｂ’，１Ｃ’および１Ｄ’ は有効命令を伝えない。最後に、発行位置１Ａ”は有効命令を伝え、発行位置１Ｂ”、１Ｃ”および１Ｄ”は有効命令を伝えない。この例において、発行位置１Ａ’−１Ｄ’および１Ａ”−１Ｄ”は、発行位置１Ａ−１Ｄにおける有効命令数である２だけシフトされる。発行位置１Ａ’−１Ｄ’および１Ａ”−１Ｄ”に対するシフトはそれぞれマルチブレクサ３０６および３０７によって行なわれる。したがって、制御ユニット３０２は、マルチプレクサ制御バス３１２を介して、マルチブレクサ３０６Ａがマルチプレクサ３０１Ａ（発行位置１Ａ）からのバイトを選択し、マルチプレクサ３０６Ｂがマルチプレクサ３０１Ｂ（発行位置１Ｂ）からのバイトを選択し、かつマルチプレクサ３０６Ｃがマルチプレクサ３０４Ａ（発行位置１Ａ’）からのバイトを選択するようにする。マルチプレクサ３０６Ｄはこの例では有効命令を選択しない。こうして、発行位置１Ａ−１Ｄおよび１Ａ’−１Ｄ’がマージされる。発行位置２Ａ− ２Ｄには３つの有効命令が存在する。さらに、制御ユニット３０２はマルチプレクサ３０７Ａ、３０７Ｂおよび３０７Ｄが有効命令を選択しないようにする。制御ユニット３０２はマルチプレクサ３０７Ｃがマルチプレクサ３０５Ａ（発行位置１Ａ”）からのバイトを選択するようにする。この態様で、発行位置２Ａ’− ２Ｄ’は、発行位置１Ａ−１Ｄにおける有効命令数だけシフトされた発行位置１Ａ”−１Ｄ”を含む。例を続けると、制御ユニット３０２はさらに、マルチプレクサ３０８Ａ、３０８Ｂ、３０８Ｃおよび３０８Ｄがそれぞれマルチプレクサ３０６Ａ（発行位置２Ａ）、３０６Ｂ（発行位置２Ｂ）、３０６Ｃ（発行位置２Ｃ）および３０７Ｃ（発行位置２Ｃ’）からのバイトを選択するようにする。この態様で、発行位置２Ａ’−２Ｄ’は、発行位置１Ａ’−１Ｄ’における有効命令数だけ（すなわち１だけ）シフトされる。最後の組のデコード位置３Ａ−３Ｄがもたらされている。この例からわかるように、異なった３組の命令バイトからの４つの有効命令がこのサイクルのデコードのために選択された。有利なことに、４つのデコード位置が用いられた。さまざまなマルチプレクサ３０１、３０４および３０５によって選択されたバイトは重複し得ることに留意されたい。たとえば、マルチプレクサ３０１Ａは制御ユニット３０２によって、命令バイトバス３００Ａで伝えられる８バイトを選択するようにされてもよい。しかしながら、命令バイトバス３００Ａの第２のバイトが開始バイトであることもあり得る。この場合、制御ユニット３０２は、マルチプレクサ３０１Ｂが命令バイトバス３００Ａの第２バイトから第８バイトまでと、命令バイトバス３００Ｂの第１のバイトとを選択するようにする。したがって、命令バイトバス３００Ａの第２バイトから第８バイトまではマルチプレクサ３０１Ａおよび３０１Ｂの両方によって選択される。開始バイトおよび終了バイト情報がデコードユニット２０８に伝えられ、これによりそれらは受取られた８つのバイトのうちいずれが命令を表わすかを決定し得る。開始バイトと終了バイトとの間（両端を含む）に含まれるバイトは、選択されたバイトを受取るデコードユニットよってデコードされることとなる。デコードユニット２０８によって開始バイトおよび／または終了バイトが検出されない場合、バイトはプリデコードユニット２０２（図２）まで転送されて戻され、プリデコードされる。先に規定した機能ビットが当該命令がＭＲＯＭ命令であることを示す場合、このバイトはＭＲＯＭユニット２０９（図２）に転送されてさらに処理される。シフトの効果は、入力がマルチプレクサのグループに結合される態様と、マルチプレクサ制御バスに伝えられる選択信号が発生する態様とによってもたらされることに留意されたい。たとえば、図３Ｂに示されるようなマルチプレクサ３０６Ｂを想定する。マルチプレクサ３０６Ｂは３つの入力、すなわちマルチプレクサ３０１Ｂ、３０４Ａおよび３０４Ｂの出力を有する構成である。したがって、マルチプレクサ３０６Ｂは発行位置１Ｂと、１Ａ’と１Ｂ’から選択する。発行位置１Ａ−１Ｄにおいて１つの命令が有効である場合、マルチプレクサ３０６Ｂは発行位置１Ａ’を選択するようにされる。したがって、マルチプレクサ３０４の第１の発行位置はマルチプレクサ３０６の第２の発行位置までシフトされている。図３Ｂの実施例はまず、命令バイトバス３００Ａから有効命令を選択し、次に命令バイトバス３００Ｂから選択し、最後に命令バイトバス３００Ｃから選択して最後の発行位置３Ａ−３Ｄに送る。この方法論が採用されたのは、入力命令バイトバス３００Ａが最も古い未決命令を含むために、新しい命令をデコード機構が認識できるようになるようにこれらの命令を最初にデコードする（かつ後に実行する）ことが一般的に有利だからである。他の実施例において、入力命令バイトバス３００は異なった構成を有してもよく、命令を選択するために種々の機構が採用され得る。入力命令バイトのグループの数およびサイズは実施例によっても異なるが、必ずしも命令キャッシュラインに関連することは必ずしもない。実際に、入力命令バイトバス３００には、関係のないグループの命令バイトを与えてもよい。他の実施例では、異なった数の命令チャネリングユニットが設けられてもよいことを理解されたい。さらに、命令バイトバスから選択された開始バイト数（および命令数）が実施例によって異なり得ることを理解されたい。次に図４を参照して、命令バイトバス３００（図４）からの１組の隣接したバイトをデコードユニットに転送する信号経路が示される。上述のとおり、図３Ｂには開始バイト信号経路のみが示された。図３Ｂの場合と同様に、図４には３つのレベルのマルチプレクサが示される。第１のレベルのマルチプレクサ４００Ａ、４００Ｂ、４００Ｃ、４００Ｄ、４００Ｅ、４００Ｆ、４００Ｇおよび４００Ｈ（ここで包括的にマルチプレクサ４００と呼ぶ）が１組の連続した命令バイト４０１に結合される。命令バイト４０１は命令バス３００上で発生する。マルチプレクサ制御バス４０２（制御バス３１１のサブセット）はマルチプレクサ４００に結合される。マルチプレクサ４００Ａにおいて開始バイトが選択され、マルチプレクサ４００Ｂにおいて隣接した次のバイトが選択されるという具合に続く。たとえば、命令バイト１が開始バイトであれば、命令バイト１がマルチプレクサ４００Ａによって選択され、命令バイト２がマルチプレクサ４００Ｂによって選択されるという具合に続く。図４には、第２のレベルのマルチプレクサがマルチプレクサ４０３Ａ、４０３Ｂ、４０３Ｃ、４０３Ｄ、４０３Ｅ、４０３Ｆ、４０３Ｇおよび４０３Ｈ（ここで包括的にマルチプレクサ４０３と呼ぶ）として示される。マルチプレクサ４０３にはマルチプレクサ４００の出力が入力として結合される。さらに、マルチプレクサ４０３の入力として入力４０５が結合される。入力４０５はマルチプレクサ４００に似たマルチプレクサ回路（図示せず）に結合され、これらは制御バス４０２と似ているが、異なったバイトを命令バス３００から選択する種々の制御バスに結合される。たとえば、このような選択制御は、制御バス４０２で発生する開始バイトビットとは異なった開始バイトビットを見出すことによりもたらすことができる。マルチプレクサ４０３はマルチプレクサ制御バス４０４にさらに結合され、これは図３Ｂに示される制御バス３１２のサブセットである。マルチプレクサ４０３の出力は入力として４０７Ａ、４０７Ｂ、４０７Ｃ、４０７Ｄ、４０７Ｅ、４０７Ｆ、４０７Ｇおよび４０７Ｈ（ここでは包括的にマルチプレクサ４０７と呼ぶ）に結合される。マルチプレクサ４０７への入力として入力４０８がさらに結合される。入力４０８は、（制御バス４０４に似た種々の制御バスに結合された）マルチプレクサ４０３に似たマルチブレクサ回路（図示せず）に結合される。１つの実施例において、入力４０８はＭＲＯＭユニット２０９（図２）からのＭＲＯＭ入力と、プリデコードユニット２０２（図２）からの入力とをさらに含む。マルチプレクサ４０７にはマルチプレクサ制御バス４０６がさらに結合され、これは図３Ｂに示される制御バス３１３のサブセットである。マルチプレクサ４０７の出力はデコードユニット２０８のうちの１つの入力バイトに結合される。以上の説明により、高性能の命令整列ユニットを開示した。命令整列ユニットは独立した多数のスキャンおよびシフトユニット（命令チャネリングユニット）を採用して、命令を選択してディスパッチするようにする。ここに記載した方法および装置によりカスケード接続された少数のレベルの論理ゲートでの実装が可能になり、このユニットを高速設計において特に有用とする。さらに、命令整列ユニットは、実行すべき命令に関する広範囲のバイトをスキャンすることにより高性能を達成する。前掲の開示が十分に認められると当業者には多くの変形および修正が明らかとなるであろう。以下のクレームはこのような変形および修正のすべてを包含するものと解されるように意図される。DETAILED DESCRIPTION OF THE INVENTION Name: Superscalar microprocessor including high-speed instruction alignment unit Background of the Invention 1.Field of the invention The present invention relates to superscalar microprocessors, and in particular, to variable byte length Instruction to multiple instruction decode units in superscalar microprocessor And a high-speed instruction alignment unit for dispatching. 2.Description of related technology Superscalar microprocessors enable parallel execution of multiple instructions. Therefore, performance exceeding that of the conventional scalar processor can be achieved. x86 series Due to the wide acceptance of microprocessors, Endeavor to develop a superscalar microprocessor that executes x86 instructions. I do. Such superscalar microprocessors have relatively high performance Such as 8086, 80286, 80386 and 80486 Very large amount of existing software developed for previous generation microprocessors The old version compatibility with a is maintained advantageously. The x86 instruction set is relatively complex and features multiple variable byte length instructions. Be charged. The general format showing the x86 instruction set is shown in FIG. . As shown, the x86 instruction has one to five optional prefix bars. Site 102 followed by an operation code (opcoad) field 104 , Optional address mode (Mod R / M) byte 106 and optional Scale-index-based (SIB) byte 108 and optional variables It consists of an order field 110 and optional immediate data 112. The opcoad field 104 defines the basic operation for a particular instruction. is there The default behavior of certain opcoads depends on one or more prefix bytes. Can be changed. For example, the prefix byte is Or change the operand size and use the default Override the default segment or repeat the sequence several times It can be used to instruct a processor. If the opcoad field 104 If the prefix byte 102 follows, the opcoad field 104 Or 2 bytes long. Address mode (Mod R / M) byte 10 6 specifies the register and memory address mode to be used. Scale-A Index-base (SIB) byte 108 is a scale factor and index Used only in 32-bit base relative addressing using coefficients. S The base field of the IB byte indicates which register is the base value for address calculation. The index field specifies which register contains the index value. Identify the future. The scale field determines which displacement the index value field Both specify a power of two that is multiplied by the base value before it is added. The next instruction field is an optional displacement field 110, which contains one byte. 4 bytes long. The displacement field 110 is used for address calculation. Including the constants used. Is the optional immediate field 112 also 1 byte? May be 4 bytes long, including constants used as instruction operands . The shortest x86 instruction is only one byte long and contains a single opcoad byte . 80286 sets the maximum length for instructions to 10 bytes, but 80386 and And 30486 both allow instruction lengths up to 15 bytes. x86-compatible superscalar-mai with high performance due to complex x86 instruction set It becomes difficult to realize a microprocessor. One of the drawbacks is that proper decoding is performed Instructions are aligned to the parallel instruction decoder of such processors before Stems from the fact that it must be done. Most RISC instruction formats In contrast to the mat, the x86 instruction set consists of instructions of variable byte length, so The start bytes of consecutive instructions in a line are not necessarily equally spaced The number of instructions per in is not fixed. As a result, a concise and fixed length shift Employing logic by itself does not solve the problem of instruction alignment. Scan log Moves the instruction boundaries during the decode pipeline stage (s) of the processor. It has been proposed to be targeted and sequential, but such a solution In general, the decode pipeline stage of a processor is required to perform a scan operation. The level and / or several clocks of a relatively large number of cascaded logic gates Need to be implemented with a cycle assignment. Instruction alignment and decoding in x86 compatible superscalar microprocessor A further solution to this is a co-pending patent application that has been assigned to a common assignee. Wish, et al., Published by Witt et al. On October 29, 1993. And serial number 08/146, the entire disclosure of which is incorporated herein by reference. No. 383 “Supersca1ar Instruction Decoder” "). Such a solution is to store instructions in the instruction cache. The predecode information for each variable byte length instruction Use decoding technology. The predecode information particularly indicates a boundary between instructions. Process Alignment mechanism (referred to as byte queue) before dispatching to the decode stage Places each instruction sequentially. When arranging instructions, the alignment mechanism uses " It is converted into one or more fixed-length RISC-like instructions called "ROPs". Next, A fixed length ROP is provided to the assigned instruction decoder. Subsequent instructions are treated similarly. Will be Although this solution has been quite successful, it is also generally cascaded Requires a relatively large number of levels of logic gates and / or pipeline stages And This is therefore the largest overall clock of a superscalar microprocessor. Limit the clock frequency and performance. Summary of the Invention The problem outlined above relates to a superscalar microprocessor according to the present invention. Thus, most are resolved. In one embodiment, a superscalar micropump is used. The processor sends a fixed number of bytes from the instruction cache to each of multiple decode units. An instruction aligning unit that transfers each is used. These bytes are stored in the predecode unit. According to the predecode tag generated by the Selected from the group. Predecode tag (a different one is associated with each byte ) Indicates which byte in the predefined group is the starting byte for the instruction Is shown. In one specific embodiment, the instruction alignment unit is an 8-byte contiguous Start byte detection simultaneously and independently in three different groups of instruction codes I do. Independently determine a predetermined number of start bytes within each group of instruction codes In other words, the instruction alignment unit converts the start byte into seven adjacent bytes according to each start byte. And independently to each "temporary" publishing channel associated with each group . The tentative issue channel is then used by a set of Shifted and / or merged into a "final" publishing channel. In another embodiment, a group of instruction bytes is a pair of instruction channeling units. Provide a superscalar microprocessor that is forwarded to the Instruction channel Ring unit independently selects start byte from instruction byte up to 4 bytes The selected start byte and a number of bytes adjacent to and following the start byte. Are placed at a temporary issuing position. Channeled through two tentative issue locations The instruction byte is then included in the issue location of the first instruction channeling unit. Transferred to a third instruction channeling unit with an indication of the number of valid instructions . The issue position transferred by the second instruction channeling unit is then: It is shifted by the number of valid instructions displayed by the instruction channeling unit. Next, the final issue position is set to the issue position from the first instruction channeling unit. Selected from the corresponding valid instructions transmitted. Remaining final issue position yes The shift is also the corresponding issue of the shifted set of issuing positions from the second channeling unit. Selected from line position. The final issue position is combined into a set of decode units This decodes the instructions and dispatches them to functional units for execution. Touch. In another embodiment, the number of bytes selected by the instruction alignment unit is 24; The last 8 bytes are for the previously fetched instruction cache line, Super scalar micro 16 bytes are for current instruction cache line A processor is provided. When the starting byte is selected for dispatch, The corresponding start bit is invalidated. In this embodiment, one clock cycle Up to four instructions can be dispatched. Previously fetched cache line The last 8 bytes of the current cache line and the first 8 bytes of the current cache line If not, the current cache line contains the previously fetched instruction cache. It is moved to the shrine location and the next instruction cache is fetched. Each 8-byte portion is examined independently to find the starting byte and found The start byte and the following 7 bytes are assigned to the issue location. The first level Multiplexing is implemented to achieve this. Three issuing groups (here, before For the last 8 bytes of the cache line, issue group 1, the current cache For the first 8 bytes of the shrine, issue group 2 and the current cache The last 8 bytes of the IN are called issue group 3), then the second level It leads to multiplexing. At this level, the number of valid instructions included in issue group 1 Issue group 1 and issue group 2 by shifting issue group 2 2 are merged. The instructions in issue group 3 are also issued at this level. Are shifted by the number of valid instructions in loop 1. Merged and shifted issue glue The loop is then led to a third level of multiplexing. Previously shifted publishing group 3 is further shifted by the number of valid instructions included in issue group 2. Doubly Issuance group 3 is then merged with the previously merged issuance groups 1 and 2. Is displayed. The resulting issue group is forwarded to the instruction decode unit. And the corresponding start bit for the transferred instruction is reset. Third multiplexing Level also includes inputs from MROM unit and predecode unit . The superscalar microprocessor according to the present invention uses an instruction alignment unit. Can be The instruction alignment unit will simultaneously Scan several small fields of the By independently shifting the found instruction by the number of appended start bytes. Thus, it can be realized with a small number of cascaded gates. Combining calculated values Is not necessary and implementation is faster. In general, the invention comprises an instruction cache, a plurality of decode units, a first , An instruction alignment unit including second and third instruction channeling units. A super scalar microprocessor. First and second instruction The channeling unit is coupled to the input port. Input port is instruction cache These include multiple groups of instruction bytes. The first instruction channeling unit is The first plurality of instruction bytes are selected and the second instruction channeling unit selects Selecting a second plurality of instruction bytes from the plurality of groups of instructions for the patch You. The first plurality of instruction bytes are then transferred to a third instruction channeling unit by the third instruction channeling unit. Merged with the two instruction bytes to form the merged instruction bytes I do. The merged instruction bytes are then transferred to the output port via the output port. Dispatched to the instruction decode unit. BRIEF DESCRIPTION OF THE FIGURES Other objects and advantages of the present invention will be apparent from the following detailed description when read in conjunction with the accompanying drawings. It will be clear from the light. FIG. 1 is a block diagram of a general x86 instruction format. FIG. 2 shows a super scalar micro processor including an instruction alignment unit according to the present invention. It is a block diagram of Sessa. FIG. 3A is a block diagram of one embodiment of an instruction alignment unit according to the present invention. You. FIG. 3B is a diagram of another embodiment of the instruction alignment unit according to the present invention. Only the connection of the start byte to the level multiplex is shown. FIG. 4 shows 15 adjacent instruction bytes and 8 adjacent bytes in a set of 15 instruction bytes. FIG. 9 is a diagram showing multiplexing connections necessary for selecting a port. Although the present invention can be variously modified and take alternative forms, specific Examples are given by way of example only and are described in detail herein. However, the drawings and The detailed description is not intended to limit the invention to the particular forms disclosed. Claims that fall within the spirit and scope of this invention as defined by the claims. It should be understood that modifications, equivalents, and alternatives are covered. Detailed description of the invention Referring now to FIG. 2, a switch including an instruction alignment unit 206 according to the present invention is provided. A block diagram of a perscalar microprocessor 200 is shown. Embodiment of FIG. As shown in FIG. Branch prediction unit coupled to decode unit 202 and instruction cache 204 220. The instruction alignment unit 206 communicates with the instruction cache 204 (decode A plurality of decoding units 208A-208D) (collectively referred to as unit 208); Is joined between. Each of decode units 208A-208D is Reservation stations). Units 210A-210D coupled to the reservation station 210A -210D each have a respective functional unit (collectively referred to as functional unit 212). To the keys 212A-212D. Decode unit 208, reservation The station 210 and the functional unit 212 include a reorder buffer 216, The star file 218 and the load / store unit 222 are further coupled. Data cache 224 is finally shown coupled to load / store unit 222. MROM unit 209 is shown coupled to instruction alignment unit 206 . Generally, the instruction cache 204 dispatches the dispatch to the decode unit 208. This is a high-speed cache memory provided for temporarily storing instructions before. In one embodiment, the instruction cache 204 has a maximum of 32 kilobytes, each of An instruction code organized in 6-byte lines (where each byte consists of 8 bits) Configured to cache code. During operation, instruction codes are stored in main memory (Not shown) through the prefetch / predecode unit 202 It is given to the instruction cache 204 by prefetching. Instruction cache 204 is set associative, full associative or Note that this can be implemented in a business map configuration. The prefetch / predecode unit 202 reads an instruction code from the main memory. It is provided for prefetching and storing in the instruction cache 204. One In the embodiment of the present invention, the prefetch / predecode unit 202 Configured to burst 64-bit wide code into instruction cache 204 You. A variety of specific code prefetching techniques and algorithms It is understood that the switch / predecode unit 202 can be used. Prefetch / predecode unit 202 fetches instructions from main memory. When this is done, this is the three predecode bits associated with each byte of the instruction code. , Ie, generate a start bit, an end bit, and a “function” bit. Pride The code bits form a tag that indicates the boundaries of each instruction. Described in more detail below Yo As such, the predecode tag also allows a given instruction to be Can be decoded directly or the instruction can be controlled by MROM unit 209. Must be executed by invoking a controlled microcode procedure Such additional information can be conveyed. Table 1 does not show an example of the encoding of the predecode tag. As shown in the table, If the byte is the first byte of an instruction, the start bit of that byte is set. If the byte is the last byte of the instruction, the end bit of that byte is set. It is. Certain instructions cannot be decoded directly by decode unit 208. If so, the function bit associated with the first bit of the instruction is set. On the other hand, If the instruction can be directly decoded by the decode unit 208, the The function bit associated with the first bit is cleared. The second buffer of a particular instruction The function bit for the byte is cleared if opcoad is the first byte, Set if coad is the second byte. opcoad is the second byte Note that in some situations the first byte is a prefix byte. Instruction by The function bit value for the port number 3-8 is as follows: Whether it is an IB byte or the byte contains displacement data or immediate data Is shown. Table 1 Encoding of start bit, end bit and function bit As described above, in one embodiment, certain instructions in the x86 instruction set are It can be decoded directly by the knit 208. These instructions are "fastpath" Called an instruction. The remaining instructions in the x86 instruction set are called "MROM instructions" . The MROM instruction is executed by activating the MROM unit 209. More specifically, when an MROM instruction is encountered, Instructions are parsed into a subset of the specified fast-path instructions, serialized and desired Perform the operation of Example of an exemplary x86 instruction classified as a fast path instruction Below is a description of how to handle both fast path and MROM instructions. Is done. The variable byte length instruction is transferred from the instruction cache 204 to the decode unit 208A-2. 08D to channel to the fixed issue position formed by A row unit 206 is provided. As described in connection with FIGS. Alignment unit 206 provides instruction buffers to designated decode units 208A-208D. Configured to channel the site. The instruction alignment unit 206 is independent And in parallel, three groups of instructions provided by instruction cache 204 Select instructions from the bytes and transfer these bytes to the three groups of temporary issue locations Arrange. Each group of issue locations is associated with one of the three groups of instruction bytes Be killed. The temporary issue locations are then merged together to form the final issue location, and Are coupled to one of the decode units 208. Detailed description of instruction alignment from instruction cache 204 to decode unit 208 Before performing the operations in the exemplary superscalar microprocessor 200 of FIG. General aspects related to the other subsystems that will be described. In the embodiment of FIG. , Each of the decoding units 208 decodes the predetermined fast path instruction described above. Includes a decoding circuit for coding. Further, each decoding unit 208A- 208D stores the displacement data and the immediate data in the corresponding reservation station unit. Route to knit 210A-210D. Output from decode unit 208 The signals are a bit-encode execution instruction for functional unit 212 and the operand Address information and immediate data and / or displacement data are included. The superscalar microprocessor of FIG. 2 supports overtaking, and , The original program sequence for register read and write operations To Speculative instruction execution and segmentation to achieve register renaming to protect To recover from misprediction and to facilitate accurate exceptions , Reorder buffer 216. As will be appreciated by those skilled in the art, The temporary storage location in file 216 is reserved when decoding instructions including register updates. Stored by the speculative register state. Reorder buffer 216 can be implemented in a first-in, first-out configuration, where speculative results are validated and registered. When writing to a star file, go to the "last" end of the buffer, thus Gives the “head” of the keyer room for a new entry. Reorder buffer 216 Other specific configurations are possible as described further below. Branch prediction is accurate Otherwise, the result of the speculatively executed instruction along the mispredict May be invalidated in the buffer before being written to file 218. Bit-encode provided at outputs of decode units 208A-208D The execution instruction and immediate data are stored in each reservation station unit 2 Routed directly to 10A-210D. In one embodiment, each resource The functional units corresponding to the servation station units 210A-210D Instruction information (ie, up to three pending instructions waiting to be issued to Bit encoding execution bits and operand values, operand tags and / Or immediate data). In the embodiment of FIG. The knit 208A-208D is a dedicated reservation station unit 210 A-210D and each reservation station unit 210A -210D is similarly associated with dedicated functional units 212A-212D. It is noted. Therefore, the four dedicated “issuing positions” correspond to the decoding unit 2 08, reservation station unit 210 and functional unit 212 Thus, it is formed. Aligned via the decode unit 208A, the issue position The instruction dispatched to 0 is sent to the reservation station unit 210A. Sent to the functional unit 212A for execution. Similarly, Instructions that are aligned and dispatched to the Sent to the station unit 210B and the functional unit 212B, and so on. You. When decoding a particular instruction, if the requested operand is a register location If the register address information is simultaneously stored in the reorder buffer 216 and the register file, Route 218. Those skilled in the art will recognize that there are 8 x86 register files and 32 Real registers (ie, typically EAX, EBX, ECX, EDX, (Referred to as EBP, ESI, EDI and ESP) U. The reorder buffer 216 has a function for changing the contents of these registers. Includes a temporary storage location, thereby allowing overtaking execution. Reorder bag A temporary storage location for file 216 is reserved for each instruction, Change the contents of one of the real registers when Therefore, certain programs At various points during the execution of the reorder buffer 216, the speculative May include one or more locations that contain the content executed. Decoding of a given instruction Following the reorder, reorder buffer 216 is used as an operand in a given instruction. Is determined to have the previous location or locations assigned to the register If the reorder buffer 216 is Also the value of the recently assigned location, or 2) the function unit that ultimately implements the previous instruction. For the most recently assigned location if the value has not yet been generated by the Send a tag. The reorder buffer has a place reserved for a given register. If the operand value (or tag) is not It is provided from the order buffer 216. The level required for the reorder buffer 216 If no place is reserved for the register, its value is stored in register file 218. Removed directly from Operand value is low if the operand corresponds to a memory location. To the reservation station unit via the storage / store unit 222 available. See Mike Johnson for details on implementing an appropriate reorder buffer. ohnson), “Superscalar Microprocessor Design” (“Supersca lar Microprocessor Design ”), Prentice-Hall, Englewood Cliffs, New Jersey 1991, co-pending and commonly assigned patent application, ie wit Others (witt, etal. ) Filed on Oct. 29, 1993 8/146 382 "High Performance Super Scalar Micro Pro "Sessa" ("High Performance Superscalar Microprocessor") . These documents are hereby incorporated by reference in their entirety. The reservation station units 210A-210D have the corresponding function units. Command information to be executed speculatively by the It is provided for security. As mentioned above, Each reservation station unit 210A-210D can store instruction information for up to three pending instructions. Each of the four instruction stations 210A-210D is operated by a corresponding functional unit. The bit encoding execution instruction to be executed speculatively and the operand value Including a place to torre. If no particular operand is available, That operation The tag for the land is provided from the reorder buffer 216, Result is generated (That is, Responding up to (by completing the execution of the previous instruction) Stored in the station. The instruction is one of the functional units 212A-212D. When performed by one The reservation whose result is waiting for the result Directly to the station units 210A-210D, At the same time the result Is sent to update the reorder buffer 216 (this technique Surgery is commonly referred to as "result forwarding"). The instruction is Any required Issued to the functional unit for execution after the value of the new operand is made available. It is. That is, One of the reservation station units 210A-210D Operands associated with the pending instruction The order to change the required operand Tagged with the location of the previous result value in the reorder buffer 216 corresponding to the instruction If you have An instruction is executed in the corresponding function unit until the operand result for the previous instruction is obtained. Not issued to the client 212. Therefore, The order in which the instructions are executed depends on the original program. May not be in the same order as the system instruction sequence. Reorder buffer 21 6 ensures that data consistency is maintained in situations where read-after-write dependencies occur. I do. In one embodiment, Each of the functional units 212 has an integer arithmetic operation of addition and subtraction. Arithmetic and shift, rotation, It is configured to perform a logical operation and a branch operation. Floating point units (not shown) can also be used to handle floating point. It is noted. Each of the functional units 212 also stores information about the execution of a conditional branch instruction in the branch Measurement unit 220. If the branch prediction is not accurate, Branch prediction unit 22 0 flushes the instructions in the instruction processing pipeline after the mispredicted branch. , Instructions required by the prefetch / predecode unit 202 are Fetch from the memory 204 or main memory. In such a situation, Speculation Executed Load / store unit 222 and reorder buffer 216 Including those stored temporarily, Occurs after a mispredicted branch instruction, Original program Note that the system sequence is discarded. Exemplary configuration of a suitable branch prediction mechanism Is well known. The result produced by functional unit 212 is: If the register value has been updated, Sent to the order buffer 216, Load / change memory location contents The data is sent to the store unit 222. If the result is to be stored in a register, Reorder buffer 216 is reserved for the value of the register when the instruction was decoded. Store the result in the location where it was promised. As mentioned above, Result is, Pending order is requested Waiting for the result of the previous instruction execution to obtain Reserve Broadcast to the station units 210A-210D. In general, The load / store unit 222 includes functional units 212A-212D. An interface is provided with the data cache 224. In one embodiment, , The load / store unit 222 stores data and data for a pending load or store. With store buffer with 8 storage locations for address and address information Is done. The function unit 212 controls access to the load / store unit 222. Make a stop. If the buffer is full (full) Functional unit is load / store So that unit 222 has room for pending load or store request information I have to wait until Load / store unit 222 is also pending Performs a dependency check on the store information for a load instruction, One of the data Ensure that persistence is maintained. The data cache 224 is Load / store unit 222 and main memory High provided to temporarily store data transferred to and from the It is a fast cache memory. In one embodiment, Data cache 224 has 8 keys It has the capacity to store data up to a low byte. Data cache 224 is That can be implemented with various specific memory configurations, including Is understood. Decode unit from instruction cache 204 via instruction alignment unit 206 Details regarding the dispatch of instructions leading to 208 are discussed below. FIG. 3A life Of the instruction aligning unit 206 and an input register to the decoding unit 208. It is a block diagram showing a star. This embodiment is referred to collectively as (instruction byte bus 250). ) Using two instruction byte buses 250A and 250B. order The bytes are output by the instruction cache 204 on the instruction byte bus 250, each The instruction byte bus transfers 8 bytes. The instruction byte bus 250A is an instruction channel To the switching unit 251, The instruction byte bus 250B is an instruction channeling unit. It is connected to the knit 252. In FIG. 3A, Input on predecode tag bus 254 Receiving information, Control output bus 256, Control unit 25 having 257 and 258 5 is also shown. Control output bus 256 is coupled to instruction channeling unit 252. It is. Similarly, Control output bus 257 is coupled to instruction channeling unit 251. And Control output bus 258 is coupled to instruction channeling unit 253. order The channeling unit 251 has four temporary issuing positions, That is, Provisional issuing position A , Provisional issuing position B, Provisional issuing position C, A temporary issue position D is created. Similarly, Order The channeling unit 252 has a temporary issuing position A ′, Provisional issuing position B ', Issue position C ' And a provisional issuing position D '. Each of the issuing positions AD and A'-D 'is an instruction It is coupled to the channeling unit 253. Instruction channeling unit 253 Four final issue positions 267, 268, 269 and 270, They are Decoding unit 208A, 208B, Coupled to 208C and 208D respectively Is done. In this example, At most one temporary or final issue position Communicate a valid order, It conveys a fixed number of bytes containing valid instructions. In general, Instruction channeling units 251 and 252 are independent and parallel To select an instruction from the instruction byte buses 250A and 250B, respectively. Selected Instruction issued is a temporary issue connected to instruction channeling units 251 and 252 Occupy position. The instruction channeling unit 253 is located at the provisional issue position AD. Shift the instructions transmitted at the provisional issue positions A'-D 'by the number of instructions transmitted You. The instruction channeling unit 253 then retrieves the instructions from the two tentative issue locations. Merge to final issue position 267-270. Instruction selection and shifting Processes are described in more detail in the following paragraphs. In this example, Control unit 255 is transferred on instruction byte bus 250 A start byte bit associated with the instruction byte is received (by bus 254). control Unit 255 scans start byte information for instruction byte bus 250A. , Look for the set start byte. If the start byte bit is set, life The corresponding byte on instruction byte bus 250A is the start of the instruction. Control unit 2 55 (by the signal on control output bus 257) Input instruction byte bus 250A Instruction channeling unit to select the corresponding byte above and the following 7 bytes Instruct the knit 251. The selected byte occupies the next available provisional issue location. Confuse. The temporary issue position A is used first, Next is the temporary issue position B, And so on is there. The control unit 255 Issuance position of instruction channeling unit 251 occupies Until the start byte bit associated with the instruction byte bus 250A is exhausted. , Continue scanning the start byte bit associated with instruction byte bus 250A. same As, In parallel, The control unit 255 starts in connection with the instruction byte bus 250B. Process byte bits, Instruction channeling unit 25 on control output bus 256 2 to issue position selection information. In the example of FIG. 3A, The instruction transferred on instruction byte bit 250A is It takes precedence over instructions transferred on the event bus 250B. Therefore, Provisional issuance The valid command transmitted at the position AD is a command under the direction of the control unit 255. Channeling unit 253 to final issue position 267-270 Can be When giving a valid order, Temporary issue position A is directed to issue position 267 . Similarly, The provisional issue position B is directed to the issue position 268 when transmitting a valid instruction. , The same applies hereinafter. further, The instruction channeling unit 253 is an instruction channeling unit. The number of valid instructions selected by the Issue position A-D The temporary issue positions A′-D ′ are shifted by the number of valid instructions transmitted in (1). So After, The shifted provisional issue position is occupied by instructions from the provisional issue positions AD. Not occupy these final issue positions 267-270. Therefore, De Code unit 208 stores the maximum number of instructions ( 4) is received. The operation of this embodiment will be further described using an example. Instruction byte bus 250A Two valid instructions in one clock cycle, Instruction byte bus 250B also Assume that two valid instructions are transferred in the same clock cycle. Control unit 2 Under the direction of 55, Is the instruction channeling unit 251 an instruction byte bus 250A? Select the first starting byte and the following 7 bytes, The selected byte is Give to issue position A. Control unit 255 then proceeds to the second of instruction byte bus 250A. Finds the start byte of The second start byte and the following 7 bytes are the temporary issue position The instruction channeling unit 251 is instructed to occupy the position B. Independently In parallel with the upper Control unit 255 provided on instruction byte bus 250B Scan the start byte bit associated with the instruction byte, Detect first start byte I do. The detected start byte and the following 7 bytes occupy the temporary issue position A ' You. Continue the scanning process, The control unit 255 is on the instruction byte bus 250B. To detect the second start byte. Second start byte followed by 7 bytes Is selected by the instruction channeling unit 252 to the temporary issue position B '. It is. Note that the scanning mechanism of the control unit 255 also has provisional issuing positions C ′ and D ′. Can be found on the instruction byte bus 250B. Only while doing, As is clear from above, Issue positions C 'and D' are in the instruction channel Group 253 will be substantially ignored. next, The control unit 255 uses the control output 258 to control the instruction channeling unit. 253. Since two valid instructions exist at the provisional issue positions AB, Provisional Are the final issue positions 267 and 2 respectively. Occupies 68. Also, Two valid instructions are in the instruction channeling unit 251 Selected The provisional issuing positions A'-D 'are shifted by two positions. This By the The instruction transmitted at issue position A 'is the final issue position 269 And aligned. Similarly, The issuing position B 'is aligned with the final issuing position 270. Can be Therefore, Originally at provisional issuance positions A 'and B' Two valid instructions occupy issue locations 269 and 270, respectively. Decoding Each of the units 208 receives an instruction in this cycle. In another embodiment, One at the output of instruction channeling units 251 and 252 The byte selected to occupy one temporary issue position occupies another temporary issue position. Overlap with the selected byte. Occupies a temporary or final issue position The number of bytes is fixed, Some instructions occupy the entire number of bytes in the issue position You may not be able to do it. Therefore, Starting byte and possibly of subsequent instruction Or other byte occupies the byte position within the current instruction position. Decoding unit 208 is a start byte associated with the instruction transferred to the decode unit; Receive the end byte bit. Decode unit 208 determines which bytes are transferred Start byte bit and end byte bit to determine if it contains a complete valid instruction. Detect the unit. In another embodiment, Using different numbers of issue positions and decode units It goes without saying that you can do it. The embodiment described in connection with FIG. It can be realized at a cascaded logical level, Therefore this embodiment works at high speed Becomes possible. This embodiment has a small number of cascaded for various reasons Can be implemented at the logical level. First, The number of bytes transferred on the instruction byte bus 250 Instructions are processed in small groups independent of each other. Related to this many instructions Small groups in parallel instead of linearly scanning through the starting bit information Can be processed. Second, Useful life small group found in one of them Are combined together based on the number of instructions (in this embodiment, instruction byte bus 250A). . Referring now to FIG. 3B, Another embodiment of the instruction alignment unit 206 is shown. . The instruction channeling unit of this embodiment includes a multiplexer, Multiple Kusa control bus 311, Output control unit 302 via 312 and 313 Is controlled. Three instruction buses (collectively referred to herein as instruction byte bus 300) Itobus 300A, 300B and 300C are further shown. Instruction byte bus 300A is the last eight instruction bytes from the instruction cache line fetched "before". Tell Tobas. The input instruction byte bus 300B is the "latest" instruction cache Tell the first 8 bytes of the line, Instruction byte bus 300C is the "latest" instruction Ki Tells the last 8 bytes of the cache line. Of the previously fetched cache Instructions from the last 8 bytes and the first 8 bytes of the most recent cache line are When transferred to the code unit 208, Last 8 of most recent cache line To the last 8 bytes of the instruction cache line where the byte was previously fetched By the way, Moved to the instruction byte bus 300A), New cache line (And Conveyed on instruction byte buses 300B and 300C) . Referring to FIG. 3B, The input instruction byte bus 300 and the first level Kusa 301A, 301B, 301C, 301D, 304A, 304B, 304C , 304D, 305A, 305B, 305C and 305D (each a multiple Lexa 301, (Collectively 304 and 305) are shown. . For the previous embodiment with two first level instruction channeling units , This embodiment has multiplexers 301, Table by 304 and 305 As shown, it has three first level instruction channeling units. First Level instruction channeling units have issue positions 1A-1D associated with them, 1 A'-1D ', And 1A "-1D". FIG. 3B also shows a first level multi. Plexa 301, 304 and 305 and second level multiplexer 306A , 306B, 306C, 306D, 307A, 307B, 307C and 307 D (collectively referred to herein as multiplexers 306 and 307, respectively). 2 shows a signal path. Multiplexers 306 and 307 provide two second level instructions. Form a command channeling unit. Second level instruction channeling unit Have issue positions 2A-2D and 2A'-2D 'associated with them. Finally , Second level multiplexers 306 and 307 and a third level Lexa 308A, 308B, 308C and 308D (multiplexer 308 and (Collectively referred to herein). Multiplexer 308 is a third Level instruction channeling unit. Third level instruction channel The calling unit has an issue position 3A-3D associated with it. Broadly speaking, Multiplexer 301, Formed by 304 and 305 Each of the provided first level instruction channeling units Those instructions related Selecting instructions individually and in parallel from byte buses 300A-300C, Issue position 1A-1D, IA'-1D 'and 1A "-1D", respectively. Multiple Second level instruction channeling unit formed by multiplexors 306 and 307 Knit, As many as the number of valid instructions in issue positions 1A-1D, issue positions 1A'-1D ' And 1A "-1D", respectively. further, The multi-brexer 306 is Issue positions 1A-1D; Shifted issue position relative to issue position 1A'-1D ' Merge with The third level of life formed by multiplexer 308 Ordinary channeling unit Issue position by the number of instructions at issue position 1A'-1D ' 2A'-2D 'are shifted. Multiplexer 308 further includes Issue position 2A -2D, Merge with the shifted issue position associated with issue position 2A'-2D ' I do. next, This embodiment is described more fully. FIG. 3B shows only the signal path for multiplexing the start byte. Only while doing, As indicated by a slash on the output of the first level multiplexer To Multiple bytes are selected by each multiplexer. Given multiplexer The multiplexing of the other bytes selected for is described below with respect to FIG. First The level multiplexer is According to the instruction byte bus 300 to which they are coupled Be grouped. Therefore, Multiplexer 301 includes instruction byte bus 300 Joined to A, Multiplexer 304 is coupled to instruction byte bus 300B, Ma Multiplexer 305 is coupled to instruction byte bus 300C. In one embodiment And Multiplexer 301A provides eight instruction bytes on instruction byte bus 300A. Be combined. This allows All bytes transmitted in the instruction byte bus 300A Start byte can be selected from the The multiplexer 301B is the first Excluding bytes, It is coupled to each of the bytes of the instruction byte bus 300A. Multiple Lexer 301B need not be tied to the first byte, The byte is the start by If It will be selected by the multiplexer 301A. same As, Multiplexer 301C need not be tied to the first two bytes. If both bytes are start bytes, The first byte is the multiplexer 301A Selected by The second byte is selected by multiplexer 301B It will be. Finally, The multiplexer 301D is Except the first three bytes, order It is shown coupled to each of the bytes of byte bus 300A. Therefore, Multiple Lexa 301A, 301B, 301C and 301D; Instruction byte bus 300 By combination with the corresponding signal path from A, Instruction bus up to 4 start bytes You can select from 300A. As further shown in FIG. 3B, Multiplexer from instruction byte bus 300A It is drawn up to 301. A similar signal path is connected to the input instruction byte bus 300B and the Shown between the plexer 304 These multiplexers are multiplexers 3 01 is similar to Multiplexer 304A is similar to 301A, 304 B is similar to 301B, 304C is similar to 301C, 304D is 301D It's similar to. Also, The operation of the multiplexer 304 is the same as the operation of the multiplexer 301. Independent of And it is performed in parallel with it. Instruction byte bus 300C And the signal path between Instruction byte bus 300A and multi It is similar to that between Plexa 301. The control unit 302 controls the multiplexer via the multiplexer control bus 311. 301, Coupled to 304 and 305. The control unit 302 performs pre-decoding The configuration further includes a tag input port 300. The input port 300 is Communicate the information used by the The instruction byte from the instruction byte bus 300 is Luplexer 301, Let 304 and 305 select. One embodiment At The information transmitted to the input port 303 is Given to the instruction byte bus 300 Contains the starting byte bit associated with the given byte. Start byte information is stored in the control unit. Is scanned by the The signal transmitted to the multiplexer control bus 311 Used to generate the signal. The instruction bus transmitted on the instruction byte bus 300A The first open detected by scanning the start byte bit associated with the The first byte is Selected by multiplexer 301A together with the following 7 bytes. Selected. The byte selected by the multiplexer 301A is If necessary , There is also a case where the instruction byte transmitted through the instruction byte bus 300B is cut off. Similarly, The second start byte detected is: Multiple with the following 7 bytes The selection is made by the hook 301B. Again, By the multiplexer 301B The byte selected by If necessary, Communicated on instruction byte bus 300B May cut by instruction byte. The control unit 302 4 start buys G Is detected or Or related to the instruction byte transmitted to the instruction byte bus 300A Continue scanning until there are no more start byte bits. The control unit 302 In parallel with the above scan, And independently of it , Start byte bit associated with the instruction byte passed to instruction byte bus 300B When, Start byte bit associated with the instruction byte transmitted to the instruction byte bus 300C. And scan. afterwards, Use multiplexers 304 and 305 respectively To select a byte from the instruction byte bus 300B and the instruction byte bus 300C. A similar procedure is performed. Using the issuing position specified above, A second level multiplexer 306 and 307 can be described. Broadly speaking, Multiplexer 306 Is Issue positions 1A-1D and 1A'-1D under the instruction of the control unit 302. ’ It is configured to form issue positions 2A-2D. Merge machine Noh is The issue positions 1A'-1D 'are serialized by the number of valid instructions at the issue positions 1A-1D. And Fill issue position 2A-2D with any valid instruction from issue position 1A-1D , Remaining due to the shifted issue instruction resulting from issue position 1A'-1D ' Is performed by filling the issue positions 2A-2D. Multiplexer 307 Is the number of valid instructions at issue positions 1A-1D under the direction of control unit 302 Shift issue position 1A "-1D", As a result, the issuing position 2A'-2D 'is filled. You. As mentioned here, Multiplex for multiplexers 306 and 307 Lexer control bus 312 depends on the number of valid instructions at issue locations 1A-1D. Multiplexer 308 Issue position 2A-2D under the instruction of control unit 302 And 2A'-2D 'are merged into the issuing position 3A-3D. The merge function performed by multiplexer 308 is Issue position 1A'-1D ', The issue positions 2A'-2D' are shifted by the number of valid instructions in Issue position 2A- Fill issue positions 3A-3D with any valid instruction in 2D, Issue position 2A The remaining issue position 3A with the shifted issue position resulting from '-2D' This is done by filling -3D. The instructions contained in issue positions 3A-3D are: The data is transferred to the decode unit 208. The life transferred to the decode unit 208 The start byte bit corresponding to the command is reset, This allows further instructions to follow In the cycle. In another embodiment, Start bit of the instruction following the predicted taken branch instruction Is reset by the branch prediction unit 220. For this reason, In one case In other words, since the instruction has been dispatched to the decode unit 208, the instruction byte The start bit associated with the instruction byte passed to the (life The instruction byte transmitted to the instruction byte bus 300B is Taken and predicted branch life Instruction associated with the instruction byte transmitted to the instruction byte bus 300C). The bit is reset. in this case, Instruction transmitted to instruction byte bus 300B The byte is moved to the instruction byte bus 300A, A new cache line Forecast Fetched from the target of the measured branch instruction. In one embodiment, The multiplexer 308 is connected to the predecode unit 202 And an input from the MROM unit 209. Predecode unit The input from port 202 is shown as 309 in FIG. 3B. MROM unit 209 The input from is shown as 310 in FIG. 3B. MROM input 310 MROM Transferring MROM instructions to decode unit 208 by unit 209 Used to enable The predecode input 309 is Instruction cache This is used when an instruction fetch error occurs in the cache 204. in this case , Instructions are read from main memory, (One life per clock cycle The command is predecoded by the predecode unit 202. Micropro Sessa 200 is The instruction cache line has finished predecoding, Instruction cache Instead of waiting for it to be stored in Using the predecode input 309 Route the decode instruction to the decode unit 208. Valid instructions are: In any group of issuance orders, The position indicated by A is filled first And Next, the issue position is set in such a way that the position indicated by B is filled. Fill the place. For example, Issue position 1B if issue position 1A does not contain a valid instruction Does not include valid instructions. further, Issuing position 2B 'is issued by issuing position 2A'. If not included, does not include valid instructions. By one example, Multiplexer 306, Done by 307 and 308 The merge and shift operations are further clarified. In this example, Issue position 1A And 1B communicate a valid command, Issue positions 1C and 1D do not carry valid instructions. Sa In addition, Issue position 1A 'conveys a valid command, Issue position 1B ', 1C 'and 1D' Does not communicate a valid command. Finally, Issue position 1A "conveys a valid command, Issue position 1 B ", 1C "and 1D" do not carry a valid command. In this example, Issue positions 1A'-1D 'and 1A "-1D" Issue position It is shifted by 2 which is the number of valid instructions in 1A-1D. Issue position 1A'-1 The shifts for D 'and 1A "-1D" are the multiplexers 306 and 306, respectively. And 307. Therefore, The control unit 302 Multiple Via the wedge control bus 312, The multiplexer 301 is a multiplexer 301 Select the byte from A (issue position 1A), Multiplexer 306B Select the byte from lexer 301B (issue position 1B), And multiplexer 3 06C selects the byte from multiplexer 304A (issue position 1A ') To do. Multiplexer 306D does not select a valid instruction in this example. Like this hand, Issue positions 1A-1D and 1A'-1D 'are merged. Issue position 2A- There are three valid instructions in 2D. further, The control unit 302 is Kusa 307A, Prevent 307B and 307D from selecting valid instructions. System In the control unit 302, the multiplexer 307C has the multiplexer 305A (issue position). 1A "). In this manner, Issue position 2A'- 2D ' Issue position 1 shifted by the number of valid instructions in issue positions 1A-1D A "-1D". Continuing the example, The control unit 302 further includes Multiplexer 308A, 30 8B, 308C and 308D are respectively connected to the multiplexer 306A (issue position 2). A), 306B (issue position 2B), 306C (issue position 2C) and 307C ( The byte from the issuing position 2C ') is selected. In this manner, Issue position 2 A'-2D 'is Only the number of valid instructions at issue positions 1A'-1D '(i.e., 1 Only) shifted. The last set of decoding positions 3A-3D has been provided. As you can see from this example, Four valid instructions from three different sets of instruction bytes Selected for cycle decoding. Advantageously, 4 decoding positions Was used. Various multiplexers 301, The bar selected by 304 and 305 Note that sites can overlap. For example, Multiplexer 301A is controlled By the control unit 302, Selects 8 bytes transmitted on instruction byte bus 300A. May be selected. However, The second bus of the instruction byte bus 300A The byte may be the starting byte. in this case, The control unit 302 Ma Multiplexer 301B operates from byte 2 to byte 8 of instruction byte bus 300A. And The first byte of the instruction byte bus 300B is selected. But What The second to eighth bytes of the instruction byte bus 300A are multiplexed. It is selected by both servers 301A and 301B. Start byte and end byte Is transmitted to the decoding unit 208, This allows them to be received One may determine which of the eight bytes represents the instruction. Start byte and end byte The bytes included between (and inclusive of) the Deco receiving the selected byte Will be decoded by the code unit. The decoding unit 208 If no start and / or end byte is found, Pre-deco bytes Transfer unit 202 (FIG. 2) and returned, Predecoded. First If the specified function bit indicates that the instruction is an MROM instruction, This bi The data is transferred to the MROM unit 209 (FIG. 2) for further processing. The effect of the shift is How the inputs are coupled to groups of multiplexers; Mar The manner in which the select signal conveyed to the multiplexer control bus is generated. Note that For example, Multiplexer 30 as shown in FIG. 3B 6B is assumed. Multiplexer 306B has three inputs, Ie multiplex 301B, This is a configuration having outputs of 304A and 304B. Therefore, Multiplexer 306B has issue position 1B, Select from 1A 'and 1B'. Issue If one instruction is valid at positions 1A-1D, Multiplexer 306B Selects the issuing position 1A '. Therefore, Multiplexer 304 Has been shifted to the second issue position of the multiplexer 306. You. The embodiment of FIG. Select a valid instruction from the instruction byte bus 300A, next Select from instruction byte bus 300B, Finally, select from the instruction byte bus 300C. To the last issue position 3A-3D. This methodology was adopted Input command bar Since the bus 300A contains the oldest pending instruction, New instruction decode mechanism These instructions must first be decoded (and later executed) so that Is generally advantageous. In another embodiment, Input instruction by The bus 300 may have a different configuration, Various mechanisms for selecting instructions May be employed. The number and size of groups of input instruction bytes depends on the embodiment. Is also different, It is not necessarily associated with the instruction cache line. Real At that time, The input instruction byte bus 300 includes: Give unrelated group instruction bytes You may. In another embodiment, Different numbers of instruction channeling units are provided Please understand that it may be. further, Start byte selected from instruction byte bus It should be understood that the number of instructions (and the number of instructions) may vary from embodiment to embodiment. Next, referring to FIG. A set of adjacent buses from the instruction byte bus 300 (FIG. 4) The signal path for transferring the site to the decode unit is shown. As mentioned above, FIG. 3B Shows only the starting byte signal path. As in FIG. 3B, Fig. 4 shows three Are shown. First level multiplexer 400A , 400B, 400C, 400D, 400E, 400F, 400G and 400 H (collectively referred to herein as multiplexer 400) is a set of contiguous instruction bytes Coupled to 401. Instruction byte 401 occurs on instruction bus 300. Multi Plexer control bus 402 (a subset of control bus 311) Combined with zero. A start byte is selected in multiplexer 400A; Mar Continues with the next adjacent byte being selected in the multiplexer 400B . For example, If instruction byte 1 is the start byte, Instruction byte 1 is multiplex Selected by the service 400A, Instruction byte 2 is multiplexed by multiplexer 400B It continues to be selected. In FIG. A second level multiplexer is multiplexer 403A, 403 B, 403C, 403D, 403E, 403F, 403G and 403H (here , Collectively referred to as a multiplexer 403). Multiplexer 40 The output of multiplexer 400 is coupled to 3 as an input. further, Multiple An input 405 is connected as an input of the lexer 403. Input 405 is a multiplex Coupled to a multiplexer circuit (not shown) similar to These are the control buses Similar to 402, but Various controls for selecting different bytes from instruction bus 300 Coupled to the bus. For example, Such selection control Occurs on control bus 402 By finding a different start byte bit from the start byte bit be able to. Multiplexer 403 further connects to multiplexer control bus 404 Combined This is a subset of the control bus 312 shown in FIG. 3B. The output of multiplexer 403 is 407A as input, 407B, 407C, 4 07D, 407E, 407F, 407G and 407H (here, comprehensively (Referred to as a tipplexer 407). As an input to multiplexer 407 Input 408 is further coupled. Input 408 is (Various similar to control bus 404 Multiplexer circuit similar to multiplexer 403 (coupled to the control bus) (shown Without). In one embodiment, Input 408 is MROM unit 2 09 (FIG. 2), From the predecode unit 202 (FIG. 2) And the input of The multiplexer 407 has a multiplexer control bus 40 6 are further combined, This is a subset of the control bus 313 shown in FIG. You. The output of the multiplexer 407 is the input of one of the decode units 208 Combined into bytes. By the above explanation, A high-performance instruction alignment unit has been disclosed. Instruction alignment unit Is a large number of independent scan and shift units (instruction channeling units) To adopt Select and dispatch instructions. The method described here And can be implemented with a small number of logic gates cascaded by Noh, This unit is particularly useful in high-speed designs. further, Instruction alignment The unit is By scanning a wide range of bytes about the instruction to be executed Achieve high performance. Many variations and modifications will be apparent to those skilled in the art once the foregoing disclosure is fully appreciated. Will be. The following claims cover all such variations and modifications. It is intended to be understood.

【手続補正書】【提出日】平成１１年１月２６日（１９９９．１．２６）【補正内容】（１）明細書第３頁第４行から第８行までの「同時係属中であり、共通の譲受人に譲渡された特許出願、すなわち、ウィット他(Witt et al.)により１９９３年１０月２９日に出願され、その開示全体が引用によりここに援用される連続番号第０８／１４６，３８３号「スーパースカラ命令デコーダ（“Superscalar Instruction Decoder”）」を「発明者らの同時係属中のＥＰ−Ａ−０６５１３２０「スーパースカラ命令デコーダ」(““Superscalar Instruction Decoder ”)」に補正する。（２）明細書第３頁第１８行と第１９行の間に下記の文章を挿入する。記ＧＢ−Ａ−２２６３９８７には、可変長さであり、命令間の区別なしに命令ストリーム内にシーケンシャルに現れる命令の長さを決定するための装置を説明しており、その装置は各命令の長さがその点で終了することを示すための終了ビットを与える。第１のチャネルはシーケンス内の第１の命令を処理し、第２のチャネルは第１の命令に続く命令を処理し、命令の終了ビットはその命令の終了点と次の命令の始まりとを命令ストリームから決定するために第１のチャネルによって処理される。（３）明細書第３頁第２０行から第４頁第５行までを下記のように補正する。記上に概要を述べた問題はこの発明に従うスーパースカラマイクロプロセッサによって大部分が解決される。この発明は、命令キャッシュから複数個のデコードユニットへと命令を転送するための命令整列ユニットであって、複数グループの命令バイトを転送するように構成された入力ポートと、前記入力ポートに結合された第１の命令チャネリングユニットとを含み、前記第１の命令チャネリングユニットは、前記入力ポートによって転送される第１の前記複数グループの命令バイトから第１の複数個の命令バイトを選択し、転送するように構成され、第１の複数個の命令バイトは開始バイトと固定数の連続バイトとを含み、さらに、前記入力ポートに結合された第２の命令チャネリングユニットを含み、前記第２の命令チャネリングユニットは、前記入力ポートによって転送される第２の前記複数グループの命令バイトから第２の複数個の命令バイトを選択し、転送するように構成され、第２の命令チャネリングユニットは第１の命令チャネリングによる第１の複数個の命令バイトの選択および転送と並行して第２の複数個の命令バイトを選択し、転送し、第２の複数個の命令バイトは開始バイトと固定数の連続バイトとを含み、さらに、前記第１の命令チャネリングユニットおよび前記第２の命令チャネリングユニットに結合された第３の命令チャネリングユニットを含み、前記第３の命令チャネリングユニットは前記第１の複数個の命令バイトと前記第２の複数個の命令バイトとをマージされた複数個の命令バイトへとマージするように構成され、さらに、前記第３の命令チャネリングユニットに結合された出力ポートを含み、前記出力ポートは前記複数個のデコードユニットに複数個の命令バイトを転送するように構成される、命令整列ユニットを提供する。この発明はまた、複数グループの命令バイトから可変長命令を選択するための方法であって、前記複数グループの命令の１つから開始バイトと固定数の連続バイトとを含む第１の複数個の命令バイトを選択するステップと、前記複数グループの命令の別のものから開始バイトと固定数の連続バイトとを含む第２の複数個の命令バイトを選択するステップとを含み、第２の複数個の命令バイトは第１の複数個の命令バイトの選択と並行して選択され、前記方法は、前記第２の複数個の命令バイトを前記第１の複数個の命令バイトにおけるバイト数だけシフトし、それによってシフトされた複数個の命令バイトを生じるステップと、前記第１の複数個の命令バイトを前記シフトされた複数個の命令バイトとマージし、それによってマージされた複数個の命令バイトを生じるステップとを含み、前記マージするステップは、前記シフトされた複数個の命令バイトが前記マージされた複数個の命令バイト内で前記第１の複数個の命令バイトに続くように行なわれることを特徴とする、方法を提供する。ある実施例において、スーパースカラマイクロプロセッサは固定数のバイトを命令キャッシュから複数個のデコードユニットの各々に転送する命令整列ユニットを用いる。これらバイトは、プリデコードユニットによって発生されるプリデコードタグに従って、予め定められたバイトグループから選択される。プリデコードタグ（各バイトに異なる１つが関連付けられる）は予め定められたグループ内のどのバイトが命令のための開始バイトであるかを示す。ある具体的な実施例において、命令整列ユニットは８バイトの連続する命令コードの３つの異なったグループの中で開始バイトを同時に独立して検出する。命令コードの各グループ内で予め定められた数の開始バイトを独立して求めると、命令整列ユニットは開始バイトを各開始バイトに従う連続した７バイトとともに各グループに関連したそれぞれの「仮の」発行チャネルへと独立して送る。仮の発行チャネルは次に上述の複数個のデコードユニットと結合される１組の「最終的な」発行チャネルへとシフトおよび／またはマージされる。（４）請求の範囲を別紙のとおり補正する。請求の範囲１．命令キャッシュからの命令を複数のデコードユニットに転送するための命令整列ユニットであって、複数のグループの命令バイトを転送するように構成された入力ポートと、前記入力ポートに結合された第１の命令チャネリングユニット（２５１，３０１Ａ−Ｄ）とを備え、前記第１の命令チャネリングユニットは、前記入力ポートによって転送された前記複数のグループの命令バイトのうちの第１のものから第１の複数の命令バイトを選択するように構成され、前記第１の複数の命令バイトは開始バイトと固定数の連続バイトとを含み、さらに前記入力ポートに結合された第２の命令チャネリングユニット（２５２，３０４Ａ−Ｄ）を備え、前記第２の命令チャネリングユニットは、前記入力ポートによって転送された前記複数のグループの命令バイトのうちの第２のものから第２の複数の命令バイトを選択して転送するように構成され、前記第２の命令チャネリングユニットは、前記第１の複数の命令バイトを選択して転送する前記第１の命令チャネリングユニットと並行して前記第２の複数の命令バイトを選択して転送し、前記第２の複数の命令バイトは開始バイトと固定数の連続バイトとを含み、さらに前記第１の命令チャネリングユニットおよび前記第２の命令チャネリングユニットに結合された第３の命令チャネリングユニット（２５３，３０６Ａ−Ｄ）を備え、前記第３の命令チャネリングユニットは、前記第１の複数の命令バイトと前記第２の複数の命令バイトとをマージして、マージされた複数の命令バイトをもたらすように構成され、さらに前記第３の命令チャネリングユニットに結合された出力ポートを備え、前記出力ポートは、複数の命令バイトを前記複数のデコードユニットに転送するように構成される、命令整列ユニット。２．前記第１の命令チャネリングユニット、前記第２の命令チャネリングユニットおよび前記第３の命令チャネリングユニットが複数のマルチプレクサをさらに含む、請求項１に記載の命令整列ユニット。３．前記第１の複数の命令バイト、前記第２の複数の命令バイトおよび前記出力ポートによって転送された前記複数の命令バイトの数が等しい、請求項２に記載の命令整列ユニット。４．前記マージされた複数の命令バイトが、後に前記第２の複数の命令バイトが続く前記第１の複数の命令バイトを含み、それにより前記第２の複数の命令バイトは、前記第１の複数の命令バイトにおけるバイト数だけシフトされている、請求項３に記載の命令整列ユニット。５．前記出力ポートによって転送された前記複数の命令バイトが、前記マージされた複数の命令バイトである、請求項４に記載の命令整列ユニット。６．前記第１の命令チャネリングユニット、前記第２の命令チャネリングユニットおよび前記第３の命令チャネリングユニットに結合された制御ユニット（２５５）をさらに含み、前記制御ユニットは、前記第１の命令チャネリングユニットが前記第１の複数の命令バイトを選択するように構成される、請求項５に記載の命令整列ユニット。７．前記制御ユニットが、前記第２の命令チャネリングユニットに前記第２の複数の命令バイトを選択させるようにさらに構成される、請求項６に記載の命令整列ユニット。８．前記制御ユニットが、前記第３の命令チャネリングユニットに前記マージされた複数の命令バイトを選択させるようにさらに構成される、請求項７に記載の命令整列ユニット。９．前記制御ユニットが制御入力ポートをさらに含み、前記制御ユニットは、前記制御入力ポートに与えられる情報に従って前記第１の命令チャネリングユニット、前記第２の命令チャネリングユニットおよび前記第３の命令チャネリングユニットに指示を与えるようにさらに構成される、請求項８に記載の命令整列ユニット。１０．前記制御入力ポートに与えられる前記情報が、前記入力ポートの前記複数のグループの命令バイト内にある開始命令バイトおよび終了命令バイトを特定する開始バイトおよび終了バイトビットである、請求項９に記載の命令整列ユニット。１１．スーパースカラマイクロプロセッサであって、請求項１から１０のいずれかに記載の命令整列ユニットを備え、前記マイクロプロセッサはさらに、前記命令整列ユニットに結合された既にフェッチ済みの命令ブロックをストアするための命令キャッシュ（２０４）を含み、前記命令キャッシュは複数のブロックのメモリを含み、前記マイクロプロセッサはさらに前記命令整列ユニットに結合された、前記命令整列ユニットから転送された前記複数の命令バイトをデコードするための複数のデコードユニット（２０８Ａ− ２０８Ｄ）を備える、スーパースカラマイクロプロセッサ。１２．前記命令キャッシュに結合され、メインメモリからの命令をプリフェッチしてプリデコードするためのプリフェッチ／プリデコードユニット（２０２）と、前記命令キャッシュに結合され、分岐命令のターゲットアドレスを予測するための分岐予測ユニット（２２０）と、前記命令整列ユニットに結合され、困難な命令をマイクロコード化するためのＭＲＯＭユニット（２０９）と、前記複数のデコードユニットに結合され、デコードされた命令を実行するために複数の機能ユニットのうちの１つが利用できるようになり、かつ前記デコードされた命令にそれらのオペランドが与えられるまで、前記デコードされた命令をストアするための複数のリザベーションステーション（２１０）と、前記複数のリザベーションステーションに結合され、前記複数のリザベーションステーションにストアされた前記デコードされた命令を実行するための前記複数の機能ユニット（２１２）と、前記複数の機能ユニットおよび前記複数のデコードユニットに結合され、ロード／ストア命令を実行するためのロード／ストアユニット（２２２）と、前記ロード／ストアユニットに結合され、既にフェッチされているデータメモリロケーションをストアするためのデータキャッシュ（２２４）と、前記複数の機能ユニット、前記ロード／ストアユニットおよび前記複数のデコードユニットに結合されたリオーダバッファ（２１６）とを含み、前記リオーダバッファは、結果が投機的でなくなるまで、投機的に実行された結果をストアし、さらに前記複数のデコードユニットおよび前記リオーダバッファに結合され、レジスタセットの非投機的な状態をストアするためのレジスタファイル（２１８）とを含む、請求項１１に記載のスーパースカラマイクロプロセッサ。１３．複数のグループの命令バイトから可変長命令を選択するための方法であって、開始バイトと固定数の連続バイトとを含む第1の複数の命令バイトを前記複数のグループの命令のうちの１つから選択するステップと、開始バイトと固定数の連続バイトとを含む第２の複数の命令バイトを前記複数のグループの命令のうちの別のものから選択するステップとを含み、前記第２の複数の命令バイトは前記第1の複数の命令バイトの選択と並行して選択され、前記第1の複数の命令バイトのバイト数だけ前記第２の複数の命令バイトをシフトして、シフトされた複数の命令バイトをもたらすようにするステップと、前記第１の複数の命令バイトと前記シフトされた複数の命令バイトとをマージし、マージされた複数の命令バイトをもたらすようにするステップとを含み、前記マージするステップは、前記シフトされた複数の命令バイトが前記マージされた複数の命令バイト内の前記第1の複数の命令バイトに続くように行なわれることを特徴とする、方法。１４．前記マージされた複数の命令バイトを複数のデコードユニットに転送するステップをさらに含む、請求項１３に記載の方法。[Procedure amendment] [Submission date] January 26, 1999 (1999.1.26) [Correction contents] (1) From page 3, lines 4 to 8 of the specification, “Co-pending, common The patent application assigned to the recipient, ie, 199 by Witt et al. Serial application filed Oct. 29, 3rd, the entire disclosure of which is incorporated herein by reference. No. 08 / 146,383 "Superscalar instruction decoder (" Superscalar Instruction Decoder ”)” to “Inventors co-pending EP-A-0651. 320 "Superscalar Instruction Decoder" Correct to ")". (2) Insert the following sentence between page 18, line 18 and line 19 of the specification. Record GB-A-2 263 987 has a variable length and commands without discrimination between instructions. Describes a device for determining the length of instructions that appear sequentially in the instruction stream. The end of each instruction to indicate that the length of each instruction ends at that point. Give a bit. The first channel processes the first instruction in the sequence and the second channel The channel processes the instruction following the first instruction, and the end bit of the instruction sets the end of the instruction On the first channel to determine the point and the beginning of the next instruction from the instruction stream Therefore, it is processed. (3) From page 3, line 20 to page 4, line 5 of the description, amend as follows . Record The problem outlined above relates to a superscalar microprocessor according to the present invention. Thus, most are resolved. The present invention transfers an instruction from an instruction cache to a plurality of decode units. Instruction alignment unit for An input port configured to transfer multiple groups of instruction bytes; A first instruction channeling unit coupled to the input port; A first instruction channeling unit configured to transmit a first instruction channeling unit to the first instruction channeling unit; Selecting and transferring a first plurality of instruction bytes from the plurality of groups of instruction bytes; The first plurality of instruction bytes is comprised of a starting byte and a fixed number of consecutive bytes. And A second instruction channeling unit coupled to the input port; 2 instruction channeling units are connected to the second Selecting and transferring a second plurality of instruction bytes from the plurality of groups of instruction bytes; And the second instruction channeling unit is configured to perform the first instruction channeling. A second plurality of instructions in parallel with the selection and transfer of the first plurality of instruction bytes by A byte is selected and transferred, and the second plurality of instruction bytes is a fixed number of contiguous bytes with the start byte. And a subsequent byte, and The first instruction channeling unit and the second instruction channeling unit And a third instruction channeling unit coupled to the third instruction channel. The nailing unit includes a first plurality of instruction bytes and a second plurality of instruction bytes. And merged into multiple instruction bytes that have been merged. To An output port coupled to the third instruction channeling unit; The output port transfers a plurality of instruction bytes to the plurality of decode units. An instruction alignment unit is provided. The present invention also provides a method for selecting a variable length instruction from a plurality of groups of instruction bytes. The method Including a starting byte and a fixed number of consecutive bytes from one of the plurality of groups of instructions Selecting a first plurality of instruction bytes; Starting bytes and a fixed number of contiguous bytes from another of the plurality of instructions Selecting a second plurality of instruction bytes including the second plurality of instruction bytes. The instruction byte is selected in parallel with the selection of the first plurality of instruction bytes, the method comprising: The second plurality of instruction bytes are byte-by-byte in the first plurality of instruction bytes. Shifts the number of instruction bytes, thereby producing the shifted instruction bytes. And Marking the first plurality of instruction bytes with the shifted plurality of instruction bytes. And thereby producing a plurality of instruction bytes merged. The step of merging may be such that the plurality of shifted instruction bytes are In the plurality of instruction bytes that have been written, a line is continued following the first plurality of instruction bytes. A method is provided, characterized in that it is done. In one embodiment, the superscalar microprocessor stores a fixed number of bytes. An instruction alignment unit for transferring from the instruction cache to each of a plurality of decode units Using These bytes contain the predecode generated by the predecode unit. According to the code tag, it is selected from a predetermined byte group. Pre-deco Code tags (each byte has a different one associated with it) Indicates which byte in is the starting byte for the instruction. One specific example In the instruction alignment unit, three different instruction codes of eight byte contiguous The start byte is detected simultaneously and independently in the group. Each group of instruction codes When the predetermined number of start bytes are independently determined within the Start byte associated with each group with 7 consecutive bytes following each start byte Send independently to each "temporary" publishing channel. The temporary publishing channel is next To a set of "final" publishing channels combined with the multiple decoding units described above And / or merged. (4) The claims are amended as shown in the attached sheet. The scope of the claims 1. Instructions for transferring instructions from the instruction cache to multiple decode units An alignment unit, An input port configured to transfer multiple groups of instruction bytes; A first instruction channeling unit (251, 30) coupled to the input port 1A-D), wherein the first instruction channeling unit comprises the input port From the first one of the plurality of groups of instruction bytes transferred by A first plurality of instruction bytes configured to select one of the plurality of instruction bytes; Contains a starting byte and a fixed number of contiguous bytes, and A second instruction channeling unit coupled to the input port (252, 30 4A-D), wherein the second instruction channeling unit is connected to the input port. Thus, the second to second of the plurality of groups of instruction bytes transferred. And the second instruction channel is configured to select and transfer a plurality of instruction bytes of the second instruction channel. A ring unit configured to select and transfer the first plurality of instruction bytes; Selecting and translating the second plurality of instruction bytes in parallel with the instruction channeling unit; The second plurality of instruction bytes includes a start byte and a fixed number of contiguous bytes. ,further The first instruction channeling unit and the second instruction channeling unit A third instruction channeling unit (253, 306A-D) coupled to the And wherein the third instruction channeling unit comprises: the first plurality of instruction bytes. Merging with the second plurality of instruction bytes to form a merged plurality of instruction bytes; Configured to bring An output port coupled to the third instruction channeling unit; The output port is configured to transfer a plurality of instruction bytes to the plurality of decode units. An instruction alignment unit that is composed. 2. The first instruction channeling unit and the second instruction channeling unit; And the third instruction channeling unit further comprises a plurality of multiplexers. The instruction alignment unit according to claim 1, comprising: 3. The first plurality of instruction bytes, the second plurality of instruction bytes, and the output 3. The method of claim 2, wherein the number of instruction bytes transferred by a port is equal. Instruction alignment unit. 4. The merged instruction bytes are later followed by the second plurality of instruction bytes. The first plurality of instruction bytes following the second plurality of instruction bytes. The contract is shifted by the number of bytes in the first plurality of instruction bytes. The instruction alignment unit according to claim 3. 5. The plurality of instruction bytes transferred by the output port are merged with the merged instruction bytes. 5. The instruction alignment unit of claim 4, wherein the instruction alignment unit is a plurality of instruction bytes. 6. The first instruction channeling unit and the second instruction channeling unit; And a control unit (25) coupled to said third instruction channeling unit. 5), wherein the control unit includes the first instruction channeling unit. Is configured to select the first plurality of instruction bytes. Instruction alignment unit. 7. The control unit stores the second instruction channeling unit in the second instruction channeling unit. 7. The instruction set of claim 6, further configured to select a number of instruction bytes. Column unit. 8. The control unit is configured to merge the merged instruction into the third instruction channeling unit. The method of claim 7, further configured to cause a selected plurality of instruction bytes to be selected. Instruction alignment unit. 9. The control unit further includes a control input port, wherein the control unit comprises a The first instruction channeling unit according to information given to the control input port. The second instruction channeling unit and the third instruction channeling unit. 9. The instruction alignment unit of claim 8, further configured to provide instructions to the unit. To 10. The information provided to the control input port is the plurality of input ports. The start instruction byte and the end instruction byte within the instruction bytes of 10. The instruction alignment unit of claim 9, wherein the start byte and end byte bits are G. 11. A superscalar microprocessor, An instruction alignment unit according to any one of claims 1 to 10, wherein The processor further Store an already fetched instruction block coupled to the instruction alignment unit And an instruction cache (204) for executing a plurality of blocks. And the microprocessor further comprises Before being transferred from the instruction alignment unit, coupled to the instruction alignment unit A plurality of decode units (208A-208) for decoding the plurality of instruction bytes. 208D) comprising a superscalar microprocessor. 12. Prefetch instructions from main memory, coupled to the instruction cache And a prefetch / predecode unit (202) for predecoding Coupled to the instruction cache to predict a target address of a branch instruction. A branch prediction unit (220) for Coupled to the instruction alignment unit to microcode difficult instructions An MROM unit (209); For executing the decoded instruction coupled to the plurality of decoding units One of a plurality of functional units becomes available, and said decoding Until the decoded instructions are given their operands, A plurality of reservation stations (210) for storing; The plurality of reservation stations coupled to the plurality of reservation stations; The decoded instructions for executing the decoded instructions stored in the A number of functional units (212); The plurality of functional units and the plurality of decode units, A load / store unit (222) for executing load / store instructions; Data memos already fetched, coupled to the load / store unit A data cache (224) for storing the relocation; The plurality of functional units, the load / store unit, and the plurality of deco A reorder buffer (216) coupled to the reorder unit. The buffer stores speculatively executed results until the results are no longer speculative. ,further A register coupled to the plurality of decode units and the reorder buffer; A register file (218) for storing the non-speculative state of the tuset The superscalar microprocessor of claim 11, comprising: 13. A method for selecting a variable length instruction from multiple groups of instruction bytes. hand, A first plurality of instruction bytes including a start byte and a fixed number of consecutive bytes; Selecting from one of the group of instructions: A second plurality of instruction bytes including a start byte and a fixed number of consecutive bytes; Selecting from another of the instructions of the group of A plurality of instruction bytes are selected in parallel with the selection of the first plurality of instruction bytes; The second plurality of instruction bytes are sequenced by the number of bytes of the first plurality of instruction bytes. Shifting to result in a plurality of shifted instruction bytes; Merging the first plurality of instruction bytes with the shifted plurality of instruction bytes And providing a plurality of instruction bytes that have been merged. Merging the shifted instruction bytes with the merged instruction bytes; Being performed following the first plurality of instruction bytes in the plurality of instruction bytes. And a method. 14． Transferring the merged instruction bytes to a plurality of decode units 14. The method of claim 13, further comprising a step.

───────────────────────────────────────────────────── フロントページの続き (72)発明者ウィット，デイビッド・ビィアメリカ合衆国、78759 テキサス州、オースティン、パスフィンダー・ドライブ、 6318 (72)発明者ジョンソン，ウィリアム・エムアメリカ合衆国、78746 テキサス州、オースティン、コキーナ・レーン、606────────────────────────────────────────────────── ─── Continuation of front page (72) Inventors Wit, David Bee United States, 78759 Texas, Ohio Austin, Pathfinder Drive, 6318 (72) Inventors Johnson, William M United States, 78746 Texas, Ohio Austin, Coquina Lane, 606

Claims

[Claims] 1. A superscalar microprocessor, Instruction alignment for transferring instructions from the instruction cache to multiple decode units A column unit, wherein the instruction alignment unit comprises: And configured to transfer a plurality of groups of instruction bytes from the instruction cache. Input port, A first instruction channeling unit coupled to the input port; A first instruction channeling unit is provided for transmitting the duplicated data transferred by the input port. Selecting a first plurality of instruction bytes from a first one of the group of instruction bytes; And the instruction alignment unit further comprises: A second instruction channeling unit coupled to the input port; 2 instruction channeling units, the plurality of instruction channel units transferred by the input port. Select a second plurality of instruction bytes from a second one of the groups of instruction bytes Wherein the instruction alignment unit further comprises: The first instruction channeling unit and the second instruction channeling unit A third instruction channeling unit coupled to the third instruction channel. A ring unit comprising the first plurality of instruction bytes and the second plurality of instruction bytes; And the instruction is merged into a plurality of merged instruction bytes. Command line up unit An output port coupled to the third instruction channeling unit; The port is configured to transfer a plurality of instruction bytes to the plurality of decode units. Wherein the superscalar microprocessor further comprises A previously fetched instruction block coupled to the instruction alignment unit An instruction cache for storing the instruction cache, wherein the instruction cache includes a plurality of blocks. And the superscalar microprocessor further comprises Before being transferred from the instruction alignment unit, coupled to the instruction alignment unit A plurality of instruction units for decoding the plurality of instruction bytes. , Super scalar microprocessor. 2. A plurality of groups stored in a memory of a plurality of blocks; Further configured to transfer the instruction bytes of the plurality of blocks of memory. The first of the plurality and the second of the plurality of blocks of memory are contiguous. The superscalar microprocessor of claim 1, wherein 3. The first instruction channeling unit of the instruction alignment unit and the instruction Wherein the second instruction channeling unit of the instruction alignment unit comprises the first plurality of instructions. Further configured to individually select an instruction byte and the second plurality of instruction bytes. The superscalar microprocessor according to claim 1, wherein: 4. The first instruction channeling unit of the instruction alignment unit, the instruction alignment unit; The second instruction channeling unit of a column unit and the instruction alignment unit Said third instruction channeling unit further comprises a plurality of multiplexers 4. A superscalar microprocessor according to claim 3. 5. The merged instruction bytes are followed by the second plurality of instruction bytes. The first plurality of instruction bytes provided, thereby providing the second plurality of instruction bytes. The instruction byte is shifted by a number of bytes of the first plurality of instruction bytes. A superscalar microprocessor according to 1. 6. The plurality of instruction bytes transferred by the output port are merged with the merged instruction bytes. 6. The superscalar microprocessor according to claim 5, wherein said plurality of instruction bytes are set. Ssa. 7. The instruction alignment unit comprises: the first instruction channeling unit; To the third instruction channeling unit and the third instruction channeling unit Further comprising a controlled control unit, the control unit comprising: The ring unit selects the one-to-one instruction bytes. 7. The superscalar microprocessor of claim 6, wherein 8. The control unit of the instruction alignment unit comprises the second instruction channeling A contractor, wherein the unit is further configured to select the second plurality of instruction bytes. A superscalar microprocessor according to claim 7. 9. The control unit of the instruction channeling unit is configured to control the third instruction channel. Further causing the Nelling Unit to select the merged instruction bytes. The superscalar microprocessor according to claim 8, wherein the microprocessor is configured as follows. 10. The control unit further includes a control input port, wherein the control unit comprises: The first instruction channeling unit according to information provided to the control input port; , The second instruction channeling unit and the third instruction channeling 10. The supermarket of claim 9, further configured to provide instructions to the unit. Color microprocessor. 11. The information provided to the control input port is the plurality of groups of the input port. A start bar that identifies the start and end instruction bytes within the input byte of the loop. 11. The superscalar microphone of claim 10, wherein the superscalar microphone is a byte and an end byte bit. Processor. 12. The control unit provides the first instruction channeling unit with the first instruction channeling unit. The first of the plurality of groups of instruction bytes included in the plurality of instruction bytes Is further configured to select a byte in one of the 12. The superscalar microprocessor according to claim 11, wherein the microprocessor is a microprocessor. 13. The control unit provides the first instruction channeling unit with the first instruction channeling unit. Selecting a plurality of bytes adjacent to the start byte included in a plurality of instruction bytes; 13. The superscalar microprocessor of claim 12, further configured to: Ssa. 14． The output port of the instruction alignment unit is connected to the plurality of decode units. Configured to transfer the byte and the adjacent byte to one of the following: A superscalar microprocessor according to claim 13. 15. A fourth instruction channel coupled to the input port; A ring unit, wherein the fourth instruction channeling unit comprises A third one of the plurality of groups of instruction bytes transmitted by the input port; The method of claim 1, further configured to select a third plurality of instruction bytes from Super scalar microprocessor as described. 16. The instruction alignment unit is coupled to the fourth instruction channeling unit. A fifth instruction channeling unit, the fifth instruction channeling unit comprising: The unit comprises the third plurality of instruction bytes by the number of bytes in the first plurality of instruction bytes. Shifts a number of instruction bytes, thereby forming a plurality of shifted instruction bytes. The superscalar microprocessor of claim 15, wherein the microprocessor is configured to: 17． The instruction alignment unit is coupled to the fifth instruction channeling unit. And a sixth instruction channel further coupled to the third instruction channeling unit. A channeling unit, wherein the sixth instruction channeling unit further comprises: The merged instruction bytes and the shifted instruction bytes are merged. And providing a second merged plurality of instruction bytes. The two merged instruction bytes are followed by the third plurality of instruction bytes The merged instruction bytes, thereby shifting the shifted plurality of instruction bytes. The instruction byte is further shifted by the number of bytes in the second plurality of instruction bytes 17. The superscalar microprocessor of claim 16, wherein: 18. The plurality of instruction bytes transferred by the output port are merged with the merged instruction bytes. 18. The superscalar my of claim 17, which is a second plurality of instruction bytes. Cloprocessor. 19. Prefetch instructions from main memory, coupled to the instruction cache A prefetch / predecode unit for predecoding by Coupled to the instruction cache to predict a target address of a branch instruction; A branch prediction unit for Coupled to the instruction alignment unit to microcode difficult instructions An MROM unit, For executing the decoded instruction coupled to the plurality of decoding units One of a plurality of functional units becomes available, and said decoding Until the decoded instructions are given their operands, Multiple reservation stations to store, The plurality of reservation stations coupled to the plurality of reservation stations; The decoded instructions for executing the decoded instructions stored in the Number of functional units, The plurality of functional units and the plurality of decode units, Load / store unit for executing load / store instructions; Data memos already fetched, coupled to the load / store unit A data cache for storing relocations, The plurality of functional units, the load / store unit, and the plurality of deco And a reorder buffer coupled to the reorder buffer. Store the speculatively executed result until the result is no longer speculative, and A register coupled to the plurality of decode units and the reorder buffer; A register file for storing the non-speculative state of the tuset. Item 19. A superscalar microprocessor according to item 18. 20. Instruction for transferring instructions from instruction cache to multiple decode units An alignment unit, An input port configured to transfer multiple groups of instruction bytes; A first instruction channeling unit coupled to the input port; A first instruction channeling unit is provided for transmitting the duplicated data transferred by the input port. Selecting a first plurality of instruction bytes from a first one of the group of instruction bytes; Is configured to A second instruction channeling unit coupled to the input port; 2 instruction channeling units, the plurality of instruction channel units transferred by the input port. Select a second plurality of instruction bytes from a second one of the groups of instruction bytes Configured to The third instruction channeling unit and the second instruction channeling unit A third instruction channeling unit coupled to the third instruction channel, the third instruction channel The nailing unit comprises a first plurality of instruction bytes and a second plurality of instruction bytes. Configured to merge the instruction bytes with the further An output port coupled to the third instruction channeling unit; The output port is configured to transfer a plurality of instruction bytes to the plurality of decode units. An instruction alignment unit that is composed. 21. The input port has a plurality of groups stored in a plurality of blocks of memory. A plurality of blocks of instruction bytes, further comprising: 21. The instruction alignment unit of claim 20, wherein is stored in the instruction cache. 22. The input port has a plurality of groups stored in a plurality of blocks of memory. Further configured to transfer instructions of the plurality of blocks of the memory of the plurality of blocks. And the second of the plurality of blocks of memory are contiguous 22. The instruction alignment unit according to claim 21. 23. The first instruction channeling unit and the second instruction channeling A unit for storing the first plurality of instruction bytes and the second plurality of instruction bytes; 21. The instruction alignment unit of claim 20, further configured to select individually. . 24. The first instruction channeling unit and the second instruction channeling unit; And the third instruction channeling unit comprises a plurality of multiplexers. The instruction alignment unit according to claim 23, further comprising: 25. And the first plurality of instruction bytes transferred by the output port; 25. The method of claim 24, wherein the number of the second plurality of instruction bytes is equal. Instruction alignment unit as described. 26. The merged instruction bytes are later replaced by the second plurality of instruction bytes. The first plurality of instruction bytes followed by the second plurality of instruction bytes. The bytes are shifted by a number of bytes in the first plurality of instruction bytes; The instruction alignment unit according to claim 25. 27. The plurality of instruction bytes transferred by the output port are merged with the merged instruction bytes. 27. The instruction alignment unit of claim 26, wherein the instruction alignment unit is a plurality of instruction bytes. 28. The first instruction channeling unit and the second instruction channeling unit; And a control unit coupled to the third instruction channeling unit. Wherein the control unit includes the first instruction channeling unit and the first instruction channeling unit. 28. The instruction set of claim 27, wherein the instruction set is configured to select a plurality of instruction bytes. Column unit. 29. The control unit provides the second instruction channeling unit with the second 29. The instruction of claim 28, further configured to select a plurality of instruction bytes. Command alignment unit. 30. The control unit merges the third instruction channeling unit with the third instruction channeling unit. 30. The method of claim 29, further configured to cause a selected plurality of instruction bytes to be selected. Instruction alignment unit. 31. The control unit further includes a control input port, wherein the control unit comprises: The first instruction channeling unit according to information provided to the control input port; , The second instruction channeling unit and the third instruction channeling 31. The instruction alignment of claim 30, further configured to provide instructions to the unit. unit. 32. The information provided to the control input port is the plurality of input ports. The start instruction byte and the end instruction byte within the instruction bytes of 32. The instruction alignment unit of claim 31, wherein the start byte and end byte bits are To 33. The control unit provides the first instruction channeling unit with the first instruction channeling unit. The first of the plurality of groups of instruction bytes included in the plurality of instruction bytes 33. The apparatus of claim 32, further configured to select a byte within one of the following. Instruction alignment unit. 34. The control unit provides the first instruction channeling unit with the first instruction channeling unit. Selecting a plurality of bytes adjacent to the start byte included in a plurality of instruction bytes; 34. The instruction alignment unit of claim 33, further configured to: 35. The output port is adapted to decode the byte and the adjacent byte to the plurality of 35. The instructions of claim 34, wherein the instructions are configured to forward to one of the code units. Alignment unit. 36. A fourth instruction channeling unit coupled to the input port. Only the fourth instruction channeling unit is transferred by the input port. A third to a third plurality of instruction bytes of the plurality of groups of instruction bytes. The instruction alignment unit of claim 20, further configured to select a unit. . 37. A fifth instruction channel coupled to the fourth instruction channeling unit And the fifth instruction channeling unit comprises the first instruction channeling unit. Shifting the third plurality of instruction bytes by the number of bytes in the plurality of instruction bytes. , Thereby configured to form a plurality of shifted instruction bytes. Item 36. The instruction alignment unit according to Item 36. 38. A third instruction channel coupled to the fifth instruction channeling unit; A sixth instruction channeling unit further coupled to the channeling unit; Wherein the sixth instruction channeling unit comprises the plurality of merged instructions. A second merged byte by merging a byte with said shifted instruction bytes. The second merged plurality of instructions configured to provide a plurality of instruction bytes. The instruction bytes are the merged instructions followed by the third instruction bytes. Instruction byte, whereby the shifted plurality of instruction bytes are stored in the second 38. The instruction set of claim 37, further shifted by a number of instruction bytes. Column unit. 39. The plurality of instruction bytes transferred by the output port are merged with the merged instruction bytes. 39. The instruction alignment unit of claim 38, wherein the instruction alignment unit is a second plurality of instruction bytes. 40. A method for selecting a variable length instruction from multiple groups of instruction bytes. hand, A first plurality of instruction bytes including a start byte and a fixed number of consecutive bytes; Selecting from one of the group of instructions: A second plurality of instruction bytes including a start byte and a fixed number of consecutive bytes; Selecting from another of the group of instructions; The second plurality of instruction bytes are sequenced by the number of bytes of the first plurality of instruction bytes. Shifting to result in a plurality of shifted instruction bytes; Merging the first plurality of instruction bytes with the shifted plurality of instruction bytes And providing a plurality of instruction bytes that have been merged. Merging the shifted instruction bytes with the merged instruction bytes; Following the first plurality of instruction bytes in the plurality of instruction bytes. Method. 41. The step of selecting the first and the step of selecting the second are independent 41. The method of claim 40, performed in parallel. 42. Transferring the merged instruction bytes to a plurality of decode units 41. The method of claim 40, further comprising a step. 43. Instruction to transfer instructions from the instruction cache to multiple decode units Command alignment unit, First to first plurality of instruction bytes of the plurality of groups of instruction bytes A first instruction channeling unit configured to select A second to a second plurality of instruction bytes of the plurality of groups of instruction bytes. A second instruction channeling unit configured to select the unit. Command alignment unit.