JP3629551B2

JP3629551B2 - Microprocessor using basic cache block

Info

Publication number: JP3629551B2
Application number: JP2000391300A
Authority: JP
Inventors: ジェームス・アラン・カール
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2000-01-06
Filing date: 2000-12-22
Publication date: 2005-03-16
Anticipated expiration: 2020-12-22
Also published as: KR20010070434A; KR100402820B1; CN1116638C; HK1035946A1; CN1303044A; JP2001229024A

Description

【０００１】
【発明の属する技術分野】
本発明は、一般にマイクロプロセッサ・アーキテクチャの分野に関し、特に命令グループ・アーキテクチャ、対応するキャッシュ機構、及びその有用な拡張機能を利用したマイクロプロセッサに関する。
【０００２】
【従来の技術】
マイクロプロセッサ技術によりギガヘルツクラスの性能が実現される中、マイクロプロセッサの設計者は、特定の命令セット・アーキテクチャ（ＩＳＡ）で動作するよう設計され、既に実用に供されている多数のソフトウェアとの互換性を維持し、その一方で最新技術を利用するという大きな課題に直面している。設計者はこの問題を解決するため、既存ＩＳＡに従ってフォーマットされている命令を受信し、命令フォーマットをギガヘルツの実行パイプラインでの動作に適した内部ＩＳＡに変換するよう調整された”階層化アーキテクチャ”・マイクロプロセッサを実装している。図４を参照する。階層化アーキテクチャ・マイクロプロセッサ４０１の部分が示してある。この設計で、マイクロプロセッサ４０１の命令キャッシュ４１０は、フェッチ・ユニット４０２によりメイン・メモリからフェッチされた命令を受信し格納する。命令キャッシュ４１０に格納された命令は、第１ＩＳＡ（つまりプロセッサ４０１により実行されているプログラムが書込まれたＩＳＡ）に従ってフォーマットされる。命令は次に、命令キャッシュ４１０から検索され、ＩＳＡコンバータ４１２により第２ＩＳＡに変換される。第１ＩＳＡから第２ＩＳＡへの命令の変換には複数のサイクルが必要なので、変換プロセスは通常、パイプライン処理され、従って、どの時点でも第１ＩＳＡから第２ＩＳＡへ複数の命令を変換しなければならないことがある。変換された命令は次に、実行のためプロセッサ４０１の実行パイプライン４２２に転送される。フェッチ・ユニット４０２は、分岐判断の結果を予測することによって分岐命令に続いて実行される命令のアドレスを決定しようとする分岐予測ロジック４０６を含む。命令は次に、分岐予測をもとに投機的に発行され実行される。ただし分岐の予測が外れたとき、プロセッサ４０１の命令キャッシュ４１０と最終ステージ４３２の間に保留された命令をフラッシュする必要がある。システム内の予測ミスした分岐結果がフラッシュされたときに生じる性能のペナルティは、パイプラインの長さの関数である。フラッシュする必要のあるパイプライン・ステージが多ければ多いほど、分岐予測の外れた場合の性能のペナルティが大きくなる。階層化アーキテクチャではプロセッサ・パイプラインが長くなり、所定の時間に”フライト中”の命令数が増える可能性があるため、階層化アーキテクチャに伴う分岐予測外れのペナルティは、プロセッサの性能を制限する要因になる。
【０００３】
従って、分岐予測外れの性能ペナルティに対応した階層化アーキテクチャ・マイクロプロセッサを実装することが強く求められる。また、実装された解決策が、コードの断片の反復実行により発生する例外条件の反復発生を、少なくとも部分的には解決することも求められる。更にまた、実装された解決策が、次に実行される命令を発行キューで検索する機能を犠牲にすることなく、事実上、大きい発行キューを使用可能にすることも求められる。
【０００４】
【発明が解決しようとする課題】
本発明は、命令グループ及び命令グループ・フォーマットに一致するキャッシュ機構を利用したマイクロプロセッサを提供することを目的とする。
【０００５】
本発明は更に、プロセッサ、データ処理システム、及び性能を改良するため基本キャッシュ・ブロックとともに命令履歴情報を利用する方法を提供することを目的とする。
【０００６】
本発明は更に、プロセッサ、データ処理システム、及び１次発行キューと２次発行キューを利用した方法を提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明の実施例は、マイクロプロセッサ及びこれに関連する方法とデータの処理システムを想定している。マイクロプロセッサは、第１命令セットを受信するよう構成された命令クラッキング・ユニット（ｃｒａｃｋｉｎｇｕｎｉｔ）を含む。クラッキング・ユニットは、命令のセットを命令グループとして編成する。グループの各命令は命令グループ・タグを共有する。プロセッサはまた、命令グループ・フォーマットで編成され、クラッキング・ユニットにより生成された命令グループをキャッシュするよう構成された基本キャッシュ・ブロック機構を含む。プロセッサの実行ユニットは、命令グループの命令を実行するのに適している。実施例で、命令グループの命令の実行中に例外が発生し、これによりフラッシュが生じたとき、フラッシュされるのは、基本キャッシュ・ブロックからディスパッチされた命令のみである。プロセッサは、基本キャッシュ・ブロックに届いた命令のみフラッシュすることにより、クラッキング・ユニット・パイプラインに保留されている命令がフラッシュされないようにする。フラッシュされる命令が少なくなるので、例外発生時の性能ペナルティも減少する。他の実施例で、受信された命令は、第１命令フォーマットに従ってフォーマットされ、第２命令セットは第２命令フォーマットに従ってフォーマットされる。第２命令フォーマットは第１命令フォーマットより幅が広い。基本キャッシュ・ブロックは、基本キャッシュ・ブロックの対応するエントリの各命令グループを格納しやすいように構成される。実施例によっては、基本キャッシュ・ブロックの各エントリは、対応する基本キャッシュ・ブロック・エントリを示すエントリ・フィールドと、次に実行される命令グループを予測するポインタを含む。プロセッサは、好適には、予測ミスした分岐に対応したキャッシュ・エントリのポインタを更新するよう構成される。
【０００８】
プロセッサは、命令セットを受信し、命令セットを命令グループに編成するのに適している。命令グループは実行を目的にディスパッチされる。命令グループの実行後、命令グループに関連する例外イベントを示す命令履歴情報が記録される。その後、命令の実行が命令履歴情報に応答して変更され、後の命令グループ実行時の例外イベントの発生が防止される。プロセッサは、命令キャッシュ等のステージ機構、Ｌ２キャッシュまたはシステム・メモリ、クラッキング・ユニット、及び基本キャッシュ・ブロックを含む。クラッキング・ユニットは、ステージ機構から命令セットを受信するよう構成される。クラッキング・ユニットは、命令セットを命令グループに編成するよう調整される。クラッキング・ユニットは、命令セットのフォーマットを第１命令フォーマットから第２命令フォーマットに変更することができる。基本キャッシュ・ブロックのアーキテクチャは命令グループを格納するのに適している。基本キャッシュ・ブロックは、基本キャッシュ・ブロックの各エントリに対応した命令履歴フィールドを含む。命令履歴情報は、命令グループに関連した例外イベントを示す。好適実施例の場合、基本キャッシュ・ブロックの各エントリは、クラッキング・ユニットにより生成された１つの命令グループに対応する。プロセッサには更に、命令グループの実行が完了したときに命令履歴フィールドに情報を格納するよう構成された完了テーブル制御ロジックを追加できる。命令履歴情報は、命令グループの命令が他の命令と依存関係を持つかどうか、または命令グループの実行が前にストア・フォワード例外になったかどうかを示すことができる。この実施例で、プロセッサは、命令グループの実行が前にストア・フォワード例外になったことの検出に応答する順次モード（ｉｎ−ｏｒｄｅｒ−ｍｏｄｅ）で動作するよう構成される。
【０００９】
プロセッサは、命令を発行ユニットにディスパッチするのに適している。発行ユニットは、１次発行キューと２次発行キューを含む。命令は、実行のため現在発行が許可されている場合は１次発行キューに格納される。実行のため現在発行が許可されていない場合は２次発行キューに格納される。プロセッサは、１次発行キューの複数の命令のうち次に発行する命令を決定する。命令は、別の命令からの結果に依存する場合は、１次発行キューから２次発行キューに移動される。実施例で、命令は、実行のため発行された後、１次発行キューから２次発行キューに移動することができる。この実施例では、命令は、指定時間の間２次発行キューに維持することができる。その後、命令が拒否されていない場合は、命令を含む２次発行キュー・エントリの割当てが解除される。マイクロプロセッサは、命令キャッシュ、命令キャッシュから命令を受信するよう構成されたディスパッチ・ユニット、及びディスパッチ・ユニットから命令を受信するよう構成された発行ユニットを含む。発行ユニットは、ディスパッチされ現在実行を許可されている命令を１次発行キューに割当て、ディスパッチされ現在実行を許可されていない命令を２次発行キューに割当てる。
【００１０】
【発明の実施の形態】
図１を参照する。本発明に従ったデータ処理システム１００の実施例が示してある。システム１００は、中央処理装置（プロセッサ）１０１ａ、１０１ｂ、１０１ｃ等（ここではプロセッサ１０１と総称する）を含む。実施例で、各プロセッサ１０１は、ＲＩＳＣ（縮小命令セット・コンピュータ）マイクロプロセッサ等である。ＲＩＳＣプロセッサ一般については、Ｃ．ＭａｙらによるＰｏｗｅｒＰＣＡｒｃｈｉｔｅｃｔｕｒｅ：ＡＳｐｅｃｉｆｉｃａｔｉｏｎｆｏｒａＮｅｗＦａｍｉｌｙｏｆＲＩＳＣＰｒｏｃｅｓｓｏｒｓ（ＭｏｒｇａｎＫａｕｆｍａｎｎ、１９９４２ｄｅｄｉｔｉｏｎ）を参照されたい。プロセッサ１０１は、システム・バス１１３を通してシステム・メモリ２５０及び他の様々なコンポーネントに接続される。ＲＯＭ（読出し専用メモリ）１０２は、システム・バス１１３に接続され、ＢＩＯＳ（基本入出力システム）等を含み、ＢＩＯＳはシステム１００の基本機能を制御する。図１は、システム・バス１１３に接続されたＩ／Ｏアダプタ１０７とネットワーク・アダプタ１０６も含む。Ｉ／Ｏアダプタ１０７は、ハード・ディスク１０３、テープ・ストレージ・デバイス１０５等の大容量記憶装置をシステム・バス１１３にリンクする。ネットワーク・アダプタ１０６は、バス１１３を外部ネットワークと相互接続し、データ処理システム１００が他のシステムと通信できるようにする。ディスプレイ・モニタ１３６は、ディスプレイ・アダプタ１１２によりシステム・バス１１３に接続され、アダプタ１１２は、グラフィックスの多いアプリケーション及びビデオ・コントローラの性能を改良するためグラフィックス・アダプタ等を含む。実施例によっては、アダプタ１０７、１０６、１１２は、中間バス・ブリッジ（図示せず）を介してシステム・バス１１３に接続されるＩ／Ｏバスに接続することができる。ハード・ディスク・コントローラ、ネットワーク・アダプタ、グラフィックス・アダプタ等の周辺装置を接続するのに適したＩ／Ｏバスは、ＰＣＩＳＩＧ（分科会）（オレゴン州ヒルズボロ）のＰＣＩローカル・バス仕様２．２版に従って指定されているＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔｓＩｎｔｅｒｆａｃｅ）バス等である。他の入出力デバイスは、ユーザ・インタフェース・アダプタ１０８を通してシステム・バス１１３に接続されているように示している。キーボード１０９、マウス１１０、スピーカ１１１は全てユーザ・インタフェース・アダプタ１０８を通してバス１１３にリンクされ、アダプタ１０８は、例えば複数のデバイス・アダプタを１つの回路に統合するＳｕｐｅｒＩ／Ｏチップ等である。このようなチップの情報については、ＮａｔｉｏｎａｌＳｅｍｉｃｏｎｄｕｃｔｏｒＣｏｒｐｏｒａｔｉｏｎ、ｗｗｗ．ｎａｔｉｏｎａｌ．ｃｏｍのＰＣ８７３３８／ＰＣ９７３３８ＡＣＰＩ１．０及びＰＣ９８／９９ＣｏｍｐｌｉａｎｔＳｕｐｅｒＩ／Ｏデータ・シート（１９９８年１１月）を参照されたい。図１に示すように、システム１００は、プロセッサ１０１の形の処理手段、システム・メモリ２５０と大容量記憶装置１０４を含むステージ手段、キーボード１０９、マウス１１０等の入力手段、及びスピーカ１１１、ディスプレイ１３６を含む出力手段を含む。システム・メモリ２５０と大容量記憶装置１０４の一部の実施例は、集合的にＩＢＭＡＩＸ等のオペレーティング・システムまたは他の適切なオペレーティング・システムを格納し、図１に示した様々なコンポーネントの機能を調整する。ＡＩＸオペレーティング・システムの詳細については、ＩＢＭのＡＩＸＶｅｒｓｉｏｎ４．３ＴｅｃｈｎｉｃａｌＲｅｆｅｒｅｎｃｅ：ＢａｓｅＯｐｅｒａｔｉｎｇＳｙｓｔｅｍａｎｄＥｘｔｅｎｓｉｏｎｓ、Ｖｏｌｕｍｅｓ１ａｎｄ２（ＳＣ２３−４１５９、ＳＣ２３−４１６０）、ＡＩＸＶｅｒｓｉｏｎ４．３ＳｙｓｔｅｍＵｓｅｒ’ｓＧｕｉｄｅ：ＣｏｍｍｕｎｉｃａｔｉｏｎｓａｎｄＮｅｔｗｏｒｋｓ（ＳＣ２３−４１２２）、及びＡＩＸＶｅｒｓｉｏｎ４．３ＳｙｓｔｅｍＵｓｅｒ’ｓＧｕｉｄｅ：ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍａｎｄＤｅｖｉｃｅｓ（ＳＣ２３−４１２１）を参照されたい。
【００１１】
図２を参照する。本発明の実施例に従ったプロセッサ１０１を簡素化した図が示してある。プロセッサ１０１は、次にフェッチする命令のアドレスを生成するのに適した命令フェッチ・ユニット２０２を含む。フェッチ・ユニット２０２により生成された命令アドレスは命令キャッシュ２１０に与えられる。フェッチ・ユニット２０２は、名前からわかるように、プログラムの実行フローの決定結果を所定の情報をもとに予測するよう調整された分岐予測ロジック等を含む。分岐決定を正しく予測できるかどうかは、プロセッサ１０１が命令を投機的且つ順不同に実行することによって性能を改良するために重要な要因である。フェッチ・ユニット２０２により生成された命令アドレスは命令キャッシュ２１０に与えられる。キャッシュ２１０は高速ステージ機構にシステム・メモリの内容の一部を格納する。命令キャッシュ２１０に格納された命令は、好適には第１ＩＳＡに従ってフォーマットされる。第１ＩＳＡは通常、例えばＰｏｗｅｒＰＣ、ｘ８６互換命令セット等のレガシーＩＳＡである。ＰｏｗｅｒＰＣ命令セットについての詳細は、モトローラ社のＰｏｗｅｒＰＣ６２０ＲＩＳＣＭｉｃｒｏｐｒｏｃｅｓｓｏｒＵｓｅｒ’ｓＭａｎｕａｌ（ＭＰＣ６２０ＵＭ／ＡＤ）を参照されたい。フェッチ・ユニット２０２により生成された命令アドレスが、命令キャッシュ２１０で現在複製されているシステム・メモリ位置に対応する場合、命令キャッシュ２１０は、対応する命令をクラッキング・ユニット２１２に転送する。フェッチ・ユニット２０２により生成された命令アドレスに対応した命令が、命令キャッシュ２１０に現在存在しない（つまりフェッチ・ユニット２０２により与えられた命令アドレスが命令キャッシュ２１０をミスしている）場合、命令は、クラッキング・ユニット２１２に転送する前に、Ｌ２キャッシュ（図示せず）またはシステム・メモリからフェッチする必要がある。
【００１２】
クラッキング・ユニット２１２は、入力される命令ストリームを変更し、所定の実行パイプラインで高動作周波数（１ＧＨｚを超える動作周波数）にて実行するのに最適な命令セットを生成するよう調整される。例えば、クラッキング・ユニット２１２は、実施例によっては、ＰｏｗｅｒＰＣマイクロプロセッサによりサポートされる命令等、３２ビット幅のＩＳＡで命令を受信し、ギガヘルツ・レンジ以上で動作する高速実行ユニットでの実行を促進する第２の、好適にはより幅の広いＩＳＡに変換する。クラッキング・ユニット２１２により生成される命令の、幅広いフォーマットは、例えばクラッキング・ユニット２１２により受信され第１フォーマットに従ってフォーマットされた命令にて単に暗示されているだけか、または参照されているだけの情報（オペランド値等）を格納する明示的フィールド等を含めることができる。例えば、実施例によっては、クラッキング・ユニットによって生成された命令のＩＳＡは６４ビット以上のビット幅である。
【００１３】
他の実施例で、クラッキング・ユニット２１２は、命令を第１フォーマットから第２の、好適にはより幅の広いフォーマットに変換する他、フェッチされた命令セットを命令グループ３０２に編成するよう設計される。図３に命令グループの例が示してある。命令グループ３０２はそれぞれ、命令スロット３０４ａ、３０４ｂ等（ここでは命令スロット３０４と総称する）のセットを含む。命令セットを命令グループに編成することで、特にフライト中の多数の命令に対して名前変更レジスタのマッピング・テーブルや完了テーブルを維持するため必要なロジックが簡素化され、高速実行が促進される。図３は、クラッキング・ユニット２１２により実行可能な命令グループの３つの例を示す。
【００１４】
例１で、３０１と示した命令セットは、クラッキング・ユニット２１２により１つの命令グループ３０２に変換される。本発明の図の実施例では、各命令グループ３０２が、３０４ａ、３０４ｂ、３０４ｃ、３０４ｄ、及び３０４ｅと示した５つのスロットを含む。各スロット３０４は１つの命令を含むことができる。この実施例で、各命令グループは最大５つの命令を含むことができる。実施例によっては、クラッキング・ユニット２１２により受信される命令セット３０１の命令は、前記のように第１ＩＳＡに従ってフォーマットされ、グループ３０２に格納された命令は、より幅の広い第２フォーマットに従ってフォーマットされる。命令グループを使用することで、個別にタグを付けて追跡する必要のある命令数が減少し、名前変更リカバリ・テーブルや完了テーブルのロジックが簡素化される。従って、命令グループを使用することは、順不同プロセッサにて保留命令を追跡するプロセスを簡素化するよう努めながら、各命令に関する情報をいくらか犠牲にすることを想定したものである。
【００１５】
図３の例２は、本発明の実施例に従ってクラッキング・ユニット２１２により実行される命令グループの第２例を示す。この例は、実行を高速化するため、複雑な命令を単純な命令のグループに分けるクラッキング・ユニット２１２の機能を示している。図示の例で、２つのＬＤＵ（ｌｏａｄ−ｗｉｔｈ−ｕｐｄａｔｅ）命令のシーケンスが、それぞれスロット３０４ａ及び３０４ｃにあるロード命令のペアと、それぞれスロット３０４ｂ及び３０４ｄにあるＡＤＤ命令のペアを含む命令グループに分けられる。この例では、グループ３０２に分岐命令は含まれないので、命令グループ３０２の最後のスロット３０４ｅにも命令が含まれない。ＰｏｗｅｒＰＣのＬＤＵ命令は、他の命令セットの類似の命令と同様、命令が複数のＧＰＲ（汎用レジスタ）の内容に影響を与えるという意味で複雑な命令である。具体的には、ＬＤＵ命令は、第１ＧＰＲの内容に影響を与えるロード命令と、第２ＧＰＲの内容に影響を与えるＡＤＤ命令に分けることができる。従って、図３の例２の命令グループ３０２で、２つ以上の命令スロット３０４の命令は、クラッキング・ユニット２１２により受信された１つの命令に対応する。
【００１６】
例３で、クラッキング・ユニット２１２に入力される１つの命令が、複数のグループ３０２を占める命令セットに分けられる。具体的には、図３は、ＬＭ（ｌｏａｄｍｕｌｔｉｐｌｅ）命令を示す。ＬＭ命令は（ＰｏｗｅｒＰＣ命令セットによると）、メモリ内の連続した位置の内容を連続番号の付いたＧＰＲにロードする。図の例で、連続した６つのメモリ位置のＬＭは、６つのロード命令に分けられる。プロセッサ１０１の図の実施例に従った各グループ３０２は、多くても５つの命令しか含まず、５番目のスロット３０４ｅは分岐命令に予約されているので、６つのレジスタのＬＭはそれぞれ２つのグループ３０２ａ及び３０２ｂに分けられる。ロード命令のうち４つは第１グループ３０２ａに格納され、残り２つのロード命令は第２グループ３０２ｂに格納される。従って例３で、１つの命令が複数の命令グループ３０２にまたがる命令セットに分けられる。
【００１７】
図２を参照する。クラッキング・ユニット２１２の好適実施例により生成された命令グループ３０２は、基本キャッシュ・ブロック２１３に転送され、格納されて実行が保留される。図５を参照する。基本キャッシュ・ブロック２１３の実施例が示してある。図の実施例で、基本キャッシュ・ブロック２１３はエントリ５０２ａ乃至５０２ｎのセット（ここでは基本キャッシュ・ブロック・エントリ５０２と総称する）を含む。実施例によっては、基本キャッシュ・ブロック２１３の各エントリ５０２が１つの命令グループ３０２を格納する。また各エントリ５０２は、エントリＩＤ５０４、ポインタ５０６、及び命令アドレス（ＩＡ）・フィールド５０７等を含む。各エントリ５０２の命令アドレス・フィールド５０７は完了テーブル２１８のＩＡフィールドと同類である。他の実施例で、基本キャッシュ・ブロック５０４の各エントリは、完了テーブル２１８のエントリに対応し、命令アドレス・フィールド５０７は、対応する命令グループ３０２の第１命令の命令アドレスを示す。他の実施例で、ポインタ５０６は、分岐予測アルゴリズム、分岐履歴テーブル、または他の分岐予測メカニズムをもとに次に実行される命令グループ３０２のエントリＩＤを示す。前記のように、クラッキング・ユニット２１２で命令グループ３０２を形成する好適な実施例は、分岐命令を各グループ３０２の最後のスロットに割当てる。また、クラッキング・ユニット２１２の好適実施例は、グループ３０２の分岐命令数が１（または１以下）である命令グループ３０２を生成する。この構成の各命令グループ３０２は、図６に示す分岐ツリー６００のレッグ（足）を表すとみなすことができる。その場合、命令グループ３０２は、対応する命令グループ・エントリ５０４の値により表される。例えば第１命令グループ３０２ａは、そのエントリ番号（１）等により示される。１例として、プロセッサ１０１の分岐予測メカニズムが、レッグ１に続いてレッグ２（第２グループ３０２ｂに対応）が実行されると予測し、レッグ２に続いてレッグ３が実行されると仮定する。基本キャッシュ・ブロック２１３は、本発明の実施例によっては、これらの分岐予測を反映するため、ポインタ５０６をセットして次に実行されるグループ３０２を示す。基本キャッシュ・ブロック２１３の各エントリ５０２のポインタ５０６は、次にディスパッチされるグループ３０２を決定するために利用することができる。
【００１８】
基本キャッシュ・ブロック２１３は、フェッチ・ユニット２０２が命令キャッシュ２１０と連携するのと同様、ブロック・フェッチ・ユニット２１５と連携する。具体的には、ブロック・フェッチ・ユニット２１５は、基本キャッシュ・ブロック２１３に与えられる命令アドレスを生成する役割を持つ。ブロック・フェッチ・ユニット２１５により与えられる命令アドレスは、基本キャッシュ・ブロック２１３の命令アドレス・フィールド５０７にあるアドレスと比較される。ブロック・フェッチ・ユニット２１３により与えられた命令アドレスが基本キャッシュ・ブロック２１３でヒットする場合、対応する命令グループが発行キュー２２０に転送される。ブロック・フェッチ・ユニット２１５により与えられたアドレスが基本キャッシュ・ブロック２１３でミスした場合、命令アドレスはフェッチ・ユニット２０２に送り返され、対応する命令が命令キャッシュ２１０から検索される。基本キャッシュ・ブロック２１３は、領域節約に適した実施例では（ダイ・サイズ）、命令キャッシュ２１０をなくすことができる。この実施例で、命令はＬ２キャッシュ、システム・メモリ等の適切なステージ機構から検索され、クラッキング・ユニット２１２に直接与えられる。ブロック・フェッチ・ユニット２１３により生成された命令アドレスが、基本キャッシュ・ブロック２１３でミスした場合、対応する命令が命令キャッシュ２１０ではなくＬ２キャッシュまたはシステム・メモリから検索される。
【００１９】
プロセッサ１０１の図の実施例は更に、ディスパッチ・ユニット２１４を示す。ディスパッチ・ユニット２１４は、各命令グループの命令を対応する発行キュー２２０に転送する前に、必要な全てのリソースが利用できるようにする。また、ディスパッチ・ユニット２１４は、ディスパッチ／完了制御ロジック２１６と通信し、命令が発行された順序とこれらの命令の完了状態を追跡し、順不同実行を促進する。前記のように、クラッキング・ユニット２１２が入力された命令を命令グループに編成するプロセッサ１０１の実施例で、各命令グループ３０２には、完了制御ロジック２１６により、発行済み命令グループの順序を知らせるグループ・タグ（ＧＴＡＧ）が割当てられる。１例として、ディスパッチ・ユニット２１４は、単調に増加する値を連続した命令グループに割当てることができる。この構成で、ＧＴＡＧ値が小さい命令グループは、ＧＴＡＧ値の大きい命令グループよりも先に発行されている（つまりその命令グループより若い）と言われる。プロセッサ１０１の図の実施例は、ディスパッチ・ユニット２１４を独立した機能ブロックとして示しているが、基本キャッシュ・ブロック２１３の命令グループ編成は、ディスパッチ・ユニット２１４の機能を組み込むのに役立つ。従って、実施例によっては、ディスパッチ・ユニット２１４が基本キャッシュ・ブロック２１３内に組み込まれ、基本キャッシュ・ブロック２１３は発行キュー２２０に直接接続される。
【００２０】
ディスパッチ／完了制御ロジック２１６に関連して、本発明の実施例にて、発行済み命令グループの状態を追跡するため完了テーブル２１８が使用される。図７を参照する。完了テーブル２１８の実施例が示してある。図の実施例で、完了テーブル２１８はエントリ７０２ａ乃至７０２ｎのセット（ここでは完了テーブル・エントリ７０２と総称する）を含む。この実施例で完了テーブル２１８の各エントリ７０２は、命令アドレス（ＩＡ）・フィールド７０４と状態ビット・フィールド７０６を含む。この実施例で、各命令グループ３０２のＧＴＡＧ値は、命令グループ３０２に対応した完了情報が格納された完了テーブル２１８のエントリ７０２を識別する。従って、完了テーブル２１８のエントリ１に格納された命令グループ３０２は、ＧＴＡＧ値が１等となる。この実施例で完了テーブル２１８は更に、ＧＴＡＧ値の小さい命令グループが、ＧＴＡＧ値の大きい命令グループより実際に若いことを示すラップ・アラウンド・ビットを含むことができる。他の実施例で、命令アドレス・フィールド７０４は、対応する命令グループ３０２の第１スロット３０４ａに命令のアドレスを含む。状態フィールド７０６は、例えば完了テーブル２１８の対応するエントリ７０２が利用できるかどうか、前に保留されている命令グループにエントリが割当てられているかどうかを示す状態ビットを含むことができる。
【００２１】
図２に示したプロセッサ１０１の実施例で、命令はディスパッチ・ユニット２１４から発行キュー２２０に発行され、対応する実行パイプ２２２での実行を待機する。プロセッサ１０１には、様々な実行パイプを追加できる。パイプはそれぞれ、プロセッサの命令セットの一部を実行するよう設計される。実施例で、実行パイプ２２２は、分岐ユニット・パイプライン２２４、ロード／ストア・パイプライン２２６、固定小数点演算ユニット２２８、及び浮動小数点ユニット２３０等を含む。各実行パイプ２２２は、２つ以上のパイプライン・ステージで構成することができる。発行キュー２２０に格納された命令は、様々な発行優先順位アルゴリズムを使用して実行パイプ２２２に発行することができる。実施例によっては、例えば、発行キュー２２０の保留された最も古い命令が、次に実行パイプ２２２に発行される命令になる。この実施例で、ディスパッチ・ユニット２１４により割当てられたＧＴＡＧ値は、発行キュー２２０の保留されている命令の相対経過時間を決定するため使用される。発行前に、命令の宛先レジスタ・オペランドが、使用できる名前変更ＧＰＲに割当てられる。最終的に命令が発行キュー１２０から対応する実行パイプに転送されるとき、実行パイプは、命令コードにより示される動作を実行し、命令がパイプラインの最終ステージに達したとき（１３２）、命令の結果を命令の名前変更ＧＰＲに書込む。名前変更ＧＰＲとこれに対応する設計済みレジスタ間にマッピングが維持される。命令グループの全命令（及び若い命令グループの全命令）が、例外を発生することなく終了したとき、完了テーブル２１８の完了ポインタが次の命令グループに増分される。完了ポインタが新しい命令グループに増分されたとき、古い命令グループの命令に関連する名前変更レジスタが解除され、これにより古い命令グループの命令の結果がコミットされる。終了しまだコミットされていない命令よりも古い命令が例外を発生した場合、例外を発生した命令及び全ての若い命令がフラッシュされ、名前変更リカバリ・ルーチンが呼び出され、ＧＰＲマッピングが既知の最後の有効状態に戻される。
【００２２】
予測された分岐が取られない場合（分岐予測外れ）、実行パイプ２２２で保留されている命令と発行キュー２２０がフラッシュされる。また予測ミスした分岐に関連する基本キャッシュ・ブロック・エントリ５０２のポインタ５０６が更新され、採用された最も新しい分岐が反映される。この更新プロセスの例を、プログラム実行によりレッグ１（命令グループ３０２ａ）からレッグ４（命令グループ３０２ｄ）への分岐が発生する場合について、図５に示す。エントリ５０２ａのポインタ５０６は先に、基本キャッシュ・ブロック２１３の番号２のエントリにある命令グループ（つまりグループ３０２ｂ）への分岐を予測したので、命令グループ３０２ａからグループ３０２ｄへの実際の分岐は予測が外れている。予測が外れた分岐は削除され、ブロック・フェッチ・ユニット２１５に送り返され、基本キャッシュ・ブロック２１３と各パイプライン２２２の最終ステージ２３２間に保留されている命令がフラッシュされ、基本キャッシュ・ブロック２１３のエントリ４の命令グループ３０２ｄから実行が再開される。また、基本キャッシュ・ブロック・エントリ５０２ａのポインタ５０６は、その前の値２から新しい値４に変更され、最も新しい分岐情報が反映される。本発明は、基本キャッシュ・ブロック２１３とブロック・フェッチ・ユニット２１５を実行パイプライン２２２に近接して組み込むことで、分岐予測が外れた場合の性能ペナルティを少なくするものである。具体的には、命令クラッキング・ユニット２１２の”下流”側に基本キャッシュ・ブロック２１３を実装することによって、分岐予測の外れたフラッシュ・パスから、クラッキング・ユニット２１２に保留されている命令をなくし、よって、分岐予測外れの後にパージしなければならないパイプライン・ステージ数を少なくし、性能ペナルティを少なくする。また基本キャッシュ・ブロック２１３は、ディスパッチ／完了制御ロジック２１６と完了テーブル２１８の編成に一致した構造を持つキャッシュ・メカニズムを想定し、よって、介在するロジックの編成を簡素化し、前記のように、基本キャッシュ・ブロック２１３への有用な拡張機能の実装を容易にしている。
【００２３】
実施例の基本キャッシュ・ブロック２１３は更に、例外、フラッシュ、割込み等、性能を制限するイベント（ここでは例外イベントと総称する）の発生につながる可能性のあるシナリオを避けるため、同じ命令グループの後の実行の間に使用される可能性のある情報を記録することによって、プロセッサ性能を好都合に改良できるようにする命令履歴情報を含む。図８に示した基本キャッシュ・ブロック２１３の実施例で、命令履歴情報は、各エントリ５０２の命令履歴フィールド５０８に格納される。命令履歴フィールド５０８に格納される情報の種類の例として、ロード命令が最後に実行されたときストア・フォワード例外になった特定のロード命令を含む命令グループを考える。ストア・フォワード例外は、メモリ参照が共通の（プログラム順序で）ストア命令に続くロード命令が、順不同マシンでストア命令より先に実行されたときに生じる。ロード命令は、ストア命令より前に実行された場合は、レジスタから無効な値を検索するので、例外が発生する結果、命令がフラッシュされる。基本キャッシュ・ブロック２１３と完了制御ロジック２１６の構造間には並列性があるため、命令の実行と完了の方法に関してディスパッチ／完了制御ロジック２１６が取得した情報を基本キャッシュ・ブロック２１３の対応するエントリに転送するタスクが容易になる。この並列性がない場合、ディスパッチ／完了制御ロジック２１６からの完了情報は、通常、グループ命令情報をそのコンポーネント命令と関連付けるため、何らかの形の中間のハッシュ・テーブル、その他の適切なメカニズムを通して渡す必要がある。ストア・フォワードの例では、ストア・フォワード例外を検出した後、ディスパッチ／完了制御ロジック２１６が、基本キャッシュ・ブロック２１３の対応するエントリの命令履歴フィールド５０８に、ストア・フォワード例外を示すビットを書込む。後で命令グループが実行された場合、前にストア・フォワード例外が発生したことを示す命令履歴情報を、例えばプロセッサ１０１を、ストアの完了前にロードが実行されるのを防ぐ順次モードにするために使用できる。従って、本発明のこの実施例は、命令グループに関連した例外イベントを示す命令履歴情報を記録し、その後、命令グループの実行を変更することで、命令グループが後で実行されるときの例外イベントの発生を防ごうとするものである。ストア・フォワードの例に示しているが、命令履歴情報フィールド５０８は、予測メカニズムの精度に関連する情報、予測オペランド値、キャッシュ・ミス／ヒット情報等、プロセッサが例外条件の再発を回避できるような様々な命令履歴イベントに関連した情報を記録するのに適している。
【００２４】
基本キャッシュ・ブロック２１３の実行履歴フィールド５０８に記録される情報の１例が、図９に示す実施例により強調されている。この実施例では、発行キュー２２０が１次発行キュー９０２と２次発行キュー９０４に分けられる。発行キュー２２０の最適なサイズまたは深さは、拮抗する考慮事項のバランスを表す。一方では、プロセッサの機能を最大限に活用して命令を順不同に実行するため、極めて大きく深い発行キューを実装することが望ましい。命令を順不同で発行する機能は、発行キュー２２０に保留されている命令の数により制限される。発行キューが多くなると、順不同処理に適した命令も多くなる。他方、発行キューが深くなると、プロセッサのサイクル時間の制約内で次に発行する命令を決定するプロセッサの機能は減少する。言い換えると、発行キュー２２０に保留される命令が多ければ多いほど、次に発行する命令を決定するため必要な時間が長くなる。そのため、発行キュー２２０のような発行キューは、約２０以下の深さに制限されることが多い。本発明の実施例は、発行キューで次に発行可能な命令を検索するため必要なロジックをあまり大きくする必要なく、深い発行キューのメリットを実現しようとするものである。本発明は、既に発行されていて、プロセッサ１０１の実行パイプライン２２２に保留されているか、またはオペランド値を依存している他の命令の完了を待機しているため、発行キュー２２０に保留されている命令を直ちに発行することができないことが多いという事実を利用している。
【００２５】
図９を参照する。本発明の実施例に従った発行キュー２２０は、１次発行キュー９０２と２次発行キュー９０４を含む。１次発行キュー９０２は、直ちに発行可能な命令を格納する。実施例で、ディスパッチ・ユニット２１４からディスパッチされた命令は、最初、１次発行キュー９０２の使用できるエントリに格納される。後に、命令が他の命令に依存することが確認された場合、依存する命令は、依存対象である命令によって必要な情報が検索されるまで、２次発行キュー９０４に移動される。例えば、ロード命令に続く加算命令に、ロード命令の結果が必要な場合、最初に両方の命令を１次発行キュー９０２にディスパッチすることができる。加算命令がロード命令に依存することが確認されると、加算命令は１次発行キュー９０２から２次発行キュー９０４に転送される。図８に関して述べたように、命令履歴フィールド５０８を利用した実施例では、後の命令実行時に、加算命令を２次発行キュー９０４に直接格納できるように加算命令の依存性を記録することができる。２次発行キュー９０４はまた、最近発行され、プロセッサの実行パイプラインにまだ保留されている命令を格納するために使用できる。この実施例で、命令は１次発行キュー９０２から発行された後、２次発行キュー９０４に転送される。実施例によっては、命令が拒否されないことが確認されるまで、命令を２次発行キュー９０４に配置することができる。命令が拒否されていないことを確認する１つの方法は、２次発行キュー９０４の各エントリに関連付けたタイマ／カウンタ（図示せず）を実装することである。最初に命令が１次発行キュー９０２から２次発行キュー９０４に転送されたときに、カウンタ／タイマが初期化される。他の実施例で、カウンタ／タイマは、カウンタ／タイマの初期化以降に終了したクロック・サイクル数をカウントする。カウンタ／タイマが所定数のサイクルについてカウントを続け、命令が拒否されたことが検出されない場合、命令は正常に完了したとみなされ、２次発行キュー９０４のエントリが割当てを解除される。命令の依存性のため、または命令が最近１次発行キューから発行されたために、命令がディスパッチされていても現在実行可能ではない２次発行キューとともに実行するため、現在発行可能な命令に専用される１次発行キューを含む発行キューを利用することによって、次に発行する命令を決定するため必要な時間（つまりロジック・レベル数）をかなり長くすることなく、発行キューの有効サイズまたは深さが増加する。
【００２６】
本発明の開示内容を享受する当業者には明らかなように、本発明は、予測ミスした分岐に伴う待ち時間を少なくするため、グループ分けされた命令（つまり、第１フォーマットから第２フォーマットに変換された命令）を格納するのに適したキャッシュ機構を含むマイクロプロセッサの様々な実施例を想定している。図とともに詳しく説明した本発明の形式は、現在の好適な例にすぎない。特許請求の範囲は、ここに開示した好適実施例の変形例を全て包括するように広く解釈されるべきものである。
【００２７】
まとめとして、本発明の構成に関して以下の事項を開示する。
【００２８】
（１）マイクロプロセッサにて命令を処理する方法であって、
受信された命令の第１セットを命令グループに変換するステップと、
編成された基本キャッシュ・ブロックの、それぞれに命令グループが格納された各キャッシュ機構エントリに、前記命令グループを格納するステップと、
前記命令グループの命令を実行のため発行するステップと、
前記命令グループの命令の実行時に生成された例外に応答して、前記基本キャッシュ・ブロックと最終ステージの間に保留されている命令のみをフラッシュするステップと、
を含む、方法。
（２）前記生成された例外は分岐予測ミス例外を含む、前記（１）記載の方法。
（３）前記受信された命令は、第１命令フォーマットに従ってフォーマットされ、前記命令グループの命令は第２命令フォーマットに従ってフォーマットされる、前記（１）記載の方法。
（４）前記第２命令フォーマットは前記第１命令フォーマットより幅が広い、前記（３）記載の方法。
（５）前記キャッシュ機構の各エントリにポインタを割当て、該ポインタは次に実行される命令を示す、前記（４）記載の方法。
（６）前記命令グループの１つの実行時に予測ミスした分岐の検出に応答して、該予測ミスした分岐に対応するキャッシュ・エントリのポインタを更新するステップを含む、前記（５）記載の方法。
（７）マイクロプロセッサ命令の第１セットを受信し、該命令セットを命令グループとして編成するよう構成された命令クラッキング・ユニットと、
前記クラッキング・ユニットにより生成された命令グループをキャッシュするよう構成された基本キャッシュ・ブロック機構と、
前記命令グループの命令を実行するのに適した実行ユニットと、
を含み、
前記命令グループの命令の実行時に生成され、フラッシュの原因となる例外により、前記基本キャッシュ・ブロックからディスパッチされた命令のみフラッシュされる、
マイクロプロセッサ。
（８）前記基本キャッシュ・ブロックの前記命令グループから命令を検索し、該命令を発行キューに転送するよう構成されたディスパッチ・ユニットを含む、前記（７）記載のプロセッサ。
（９）前記受信された命令は第１命令フォーマットに従ってフォーマットされ、第２命令セットが第２命令フォーマットに従ってフォーマットされ、該第２命令フォーマットは該第１命令フォーマットより幅が広い、前記（７）記載のプロセッサ。
（１０）前記基本キャッシュ・ブロックは、前記基本キャッシュ・ブロックの対応するエントリに各命令グループを格納するよう構成された、前記（７）記載のプロセッサ。
（１１）前記基本キャッシュ・ブロックは、前記対応する基本キャッシュ・ブロック・エントリを示すエントリ・フィールドを含む、前記（１０）記載のプロセッサ。
（１２）前記基本キャッシュ・ブロックの各エントリは、次に実行される命令グループを示すポインタを含む、前記（１１）記載のプロセッサ。
（１３）前記プロセッサは、予測ミスした分岐に応答してキャッシュ・エントリのポインタを更新するよう構成された、前記（１２）記載のプロセッサ。
（１４）プロセッサ、メモリ、入力手段、及びディスプレイを含むデータ処理システムであって、
マイクロプロセッサ命令の第１セットを受信し、該命令セットを命令グループとして編成するよう構成された命令クラッキング・ユニットと、
前記クラッキング・ユニットにより生成された命令グループをキャッシュするよう構成された基本キャッシュ・ブロック機構と、
前記命令グループの命令を実行するのに適した実行ユニットと、
を含み、
前記命令グループの命令の実行時に生成され、フラッシュの原因となる例外により、前記基本キャッシュ・ブロックからディスパッチされた命令のみフラッシュされる、
データ処理システム。
（１５）前記基本キャッシュ・ブロックの命令グループから命令を検索し、該命令を発行キューに転送するよう構成されたディスパッチ・ユニットを含む、前記（１４）記載のデータ処理システム。
（１６）前記受信された命令は第１命令フォーマットに従ってフォーマットされ、第２命令セットが第２命令フォーマットに従ってフォーマットされ、該第２命令フォーマットは該第１命令フォーマットより幅が広い、前記（１４）記載のデータ処理システム。
（１７）前記基本キャッシュ・ブロックは、前記基本キャッシュ・ブロックの対応するエントリに各命令グループを格納するよう構成された、前記（１４）記載のデータ処理システム。
（１８）前記基本キャッシュ・ブロックは、前記対応する基本キャッシュ・ブロック・エントリを示すエントリ・フィールドを含む、前記（１７）記載のデータ処理システム。
（１９）前記基本キャッシュ・ブロックの各エントリは、次に実行される命令グループを示すポインタを含む、前記（１８）記載のデータ処理システム。
（２０）前記プロセッサは、予測ミスした分岐に応答して各エントリのポインタを更新するよう構成された、前記（１９）記載のデータ処理システム。
【図面の簡単な説明】
【図１】本発明の実施例に従ったマイクロプロセッサを含むデータ処理システムの特定のコンポーネントを示す図である。
【図２】本発明の実施例に従ったマイクロプロセッサの特定のコンポーネントを示す図である。
【図３】図２のプロセッサの実施例により実行される命令クラッキング関数の例を示す図である。
【図４】マイクロプロセッサの特定のコンポーネントを示す図である。
【図５】図２のマイクロプロセッサの基本キャッシュ・ブロックを示す図である。
【図６】図２のプロセッサに予想される様々な分岐を示す図である。
【図７】本発明に適した完了テーブルを示す図である。
【図８】命令履歴情報を含む基本キャッシュ・ブロックを示す図である。
【図９】本発明の実施例に従った１次発行キューと２次発行キューを含む発行キューを示す図である。
【符号の説明】
１００データ処理システム
１０１中央処理装置（プロセッサ）
１０２ＲＯＭ（読出し専用メモリ）
１０３ハード・ディスク
１０４大容量記憶装置
１０５テープ・ストレージ・デバイス
１０６ネットワーク・アダプタ
１０７Ｉ／Ｏアダプタ
１０８ユーザ・インタフェース・アダプタ
１０９キーボード
１１０マウス
１１１スピーカ
１１２ディスプレイ・アダプタ
１１３システム・バス
１３６ディスプレイ・モニタ
２０２命令フェッチ・ユニット
２１０命令キャッシュ
２１２クラッキング・ユニット
２１３基本キャッシュ・ブロック
２１４ディスパッチ・ユニット
２１５ブロック・フェッチ・ユニット
２１６ディスパッチ／完了制御ロジック
２１８完了テーブル
２２０発行キュー
２２２実行パイプ
２２４分岐ユニット・パイプライン
２２６ロード／ストア・パイプライン
２２８固定小数点演算ユニット
２３０浮動小数点ユニット
２３２最終ステージ
２５０システム・メモリ
３０２命令グループ
３０４命令スロット
４０１階層化アーキテクチャ・マイクロプロセッサ
４０２フェッチ・ユニット
４０６分岐予測ロジック
４１０命令キャッシュ
４１２ＩＳＡコンバータ
４２２実行パイプライン
４３２最終ステージ
５０２基本キャッシュ・ブロック・エントリ
５０４エントリＩＤ
５０６ポインタ
５０７命令アドレス（ＩＡ）・フィールド
５０８命令履歴フィールド
６００分岐ツリー
７０２完了テーブル・エントリ
７０４命令アドレス（ＩＡ）・フィールド
７０６状態ビット・フィールド
９０２１次発行キュー
９０４２次発行キュー[0001]
BACKGROUND OF THE INVENTION
The present invention relates generally to the field of microprocessor architectures, and more particularly to microprocessors that utilize instruction group architectures, corresponding cache mechanisms, and their useful extensions.
[0002]
[Prior art]
While microprocessor technology delivers gigahertz-class performance, microprocessor designers are compatible with a large number of software that is designed to run on specific instruction set architectures (ISAs) and is already in practical use. Faced with the big challenge of maintaining the functionality while using the latest technology. To solve this problem, designers receive instructions that are formatted according to existing ISAs, and a “layered architecture” tailored to convert the instruction format to an internal ISA suitable for operation in a gigahertz execution pipeline.・ A microprocessor is installed. Please refer to FIG. A portion of the hierarchical architecture microprocessor 401 is shown. With this design, instruction cache 410 of microprocessor 401 receives and stores instructions fetched from main memory by fetch unit 402. The instructions stored in the instruction cache 410 are formatted according to the first ISA (that is, the ISA written with the program being executed by the processor 401). The instruction is then retrieved from instruction cache 410 and converted to a second ISA by ISA converter 412. Since the conversion of instructions from the first ISA to the second ISA requires multiple cycles, the conversion process is usually pipelined, so multiple instructions must be converted from the first ISA to the second ISA at any point in time. There is. The converted instruction is then transferred to the execution pipeline 422 of the processor 401 for execution. The fetch unit 402 includes branch prediction logic 406 that attempts to determine the address of an instruction to be executed following a branch instruction by predicting the result of the branch decision. The instruction is then issued and executed speculatively based on the branch prediction. However, when the branch is unpredicted, it is necessary to flush the instruction held between the instruction cache 410 of the processor 401 and the final stage 432. The performance penalty that occurs when a mispredicted branch result in the system is flushed is a function of the length of the pipeline. The more pipeline stages that need to be flushed, the greater the performance penalty when branch prediction is missed. In a hierarchical architecture, the processor pipeline becomes long and the number of “in flight” instructions can increase at a given time, so the mispredicted branch penalty associated with a hierarchical architecture is a factor that limits processor performance. become.
[0003]
Therefore, there is a strong demand to implement a hierarchical architecture microprocessor that supports the performance penalty of branch misprediction. There is also a need for the implemented solution to at least partially resolve the repeated occurrence of exceptional conditions caused by repeated execution of code fragments. Furthermore, the implemented solution is also required to enable a large issue queue in effect without sacrificing the ability to search the issue queue for instructions to be executed next.
[0004]
[Problems to be solved by the invention]
It is an object of the present invention to provide a microprocessor utilizing a cache mechanism that matches the instruction group and instruction group format.
[0005]
It is a further object of the present invention to provide a processor, data processing system, and method for utilizing instruction history information with basic cache blocks to improve performance.
[0006]
It is another object of the present invention to provide a processor, a data processing system, and a method using a primary issue queue and a secondary issue queue.
[0007]
[Means for Solving the Problems]
Embodiments of the present invention contemplate a microprocessor and associated method and data processing system. The microprocessor includes an instruction cracking unit configured to receive the first instruction set. The cracking unit organizes a set of instructions as an instruction group. Each instruction in the group shares an instruction group tag. The processor also includes a basic cache block mechanism organized in an instruction group format and configured to cache instruction groups generated by the cracking unit. The execution unit of the processor is suitable for executing instructions of the instruction group. In an embodiment, when an exception occurs during the execution of an instruction group instruction and this causes a flush, only the instructions dispatched from the base cache block are flushed. The processor only flushes instructions that have arrived in the basic cache block so that instructions pending in the cracking unit pipeline are not flushed. Since fewer instructions are flushed, the performance penalty when an exception occurs is also reduced. In other embodiments, the received instructions are formatted according to a first instruction format and the second instruction set is formatted according to a second instruction format. The second instruction format is wider than the first instruction format. The basic cache block is configured to facilitate storing each instruction group of the corresponding entry of the basic cache block. In some embodiments, each entry in a basic cache block includes an entry field that indicates the corresponding basic cache block entry and a pointer that predicts the next instruction group to be executed. The processor is preferably configured to update the pointer of the cache entry corresponding to the mispredicted branch.
[0008]
The processor is suitable for receiving an instruction set and organizing the instruction set into instruction groups. Instruction groups are dispatched for execution. After execution of the instruction group, instruction history information indicating an exception event related to the instruction group is recorded. Thereafter, the execution of the instruction is changed in response to the instruction history information, and the occurrence of an exception event during the subsequent execution of the instruction group is prevented. The processor includes a stage mechanism such as an instruction cache, an L2 cache or system memory, a cracking unit, and a basic cache block. The cracking unit is configured to receive an instruction set from the stage mechanism. The cracking unit is coordinated to organize the instruction set into instruction groups. The cracking unit can change the format of the instruction set from the first instruction format to the second instruction format. The basic cache block architecture is suitable for storing instruction groups. The basic cache block includes an instruction history field corresponding to each entry of the basic cache block. The instruction history information indicates an exception event related to the instruction group. In the preferred embodiment, each entry in the basic cache block corresponds to one instruction group generated by the cracking unit. The processor can further include completion table control logic configured to store information in the instruction history field when execution of the instruction group is completed. The instruction history information can indicate whether instructions in the instruction group have a dependency with other instructions or whether execution of the instruction group has previously resulted in a store-forward exception. In this embodiment, the processor is configured to operate in an in-order-mode in response to detecting that execution of the instruction group has previously resulted in a store-forward exception.
[0009]
The processor is suitable for dispatching instructions to the issuing unit. The issue unit includes a primary issue queue and a secondary issue queue. An instruction is stored in the primary issue queue if it is currently issued for execution. If the current issue is not permitted for execution, it is stored in the secondary issue queue. The processor determines an instruction to be issued next among a plurality of instructions in the primary issue queue. An instruction is moved from the primary issue queue to the secondary issue queue if it depends on the result from another instruction. In an embodiment, after an instruction is issued for execution, it can move from the primary issue queue to the secondary issue queue. In this embodiment, the instructions can be kept in the secondary issue queue for a specified time. Thereafter, if the instruction has not been rejected, the secondary issue queue entry containing the instruction is deallocated. The microprocessor includes an instruction cache, a dispatch unit configured to receive instructions from the instruction cache, and an issue unit configured to receive instructions from the dispatch unit. The issue unit assigns instructions dispatched and currently allowed to execute to the primary issue queue, and assigns instructions dispatched and not currently allowed to execute to the secondary issue queue.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Please refer to FIG. An embodiment of a data processing system 100 according to the present invention is shown. The system 100 includes central processing units (processors) 101a, 101b, 101c and the like (herein collectively referred to as a processor 101). In the embodiment, each processor 101 is a RISC (Reduced Instruction Set Computer) microprocessor or the like. For general RISC processors, see C.I. See May, et al. Power PC Architecture: A Specification for a New Family of RISC Processors (Morgan Kaufmann, 1994 2d edition). The processor 101 is connected to the system memory 250 and various other components through the system bus 113. A ROM (read only memory) 102 is connected to the system bus 113 and includes a BIOS (basic input / output system) and the like, and the BIOS controls basic functions of the system 100. FIG. 1 also includes an I / O adapter 107 and a network adapter 106 connected to the system bus 113. The I / O adapter 107 links mass storage devices such as the hard disk 103 and the tape storage device 105 to the system bus 113. The network adapter 106 interconnects the bus 113 with an external network and allows the data processing system 100 to communicate with other systems. Display monitor 136 is connected to system bus 113 by display adapter 112, which includes graphics adapters and the like to improve the performance of graphics-intensive applications and video controllers. In some embodiments, adapters 107, 106, and 112 can be connected to an I / O bus that is connected to system bus 113 via an intermediate bus bridge (not shown). I / O buses suitable for connecting peripheral devices such as hard disk controllers, network adapters, graphics adapters, etc. are PCI local bus specifications of PCI SIG (Subcommittee) (Hillsboro, Oregon). PCI (Peripheral Components Interface) bus designated according to the second edition. Other input / output devices are shown as being connected to the system bus 113 through the user interface adapter 108. The keyboard 109, mouse 110, and speaker 111 are all linked to the bus 113 through the user interface adapter 108. The adapter 108 is, for example, a Super I / O chip that integrates a plurality of device adapters into one circuit. For information on such chips, see National Semiconductor Corporation, www. national. com PC87338 / PC97338 ACPI 1.0 and PC98 / 99 Compliant SuperI / O data sheet (November 1998). As shown in FIG. 1, the system 100 includes processing means in the form of a processor 101, stage means including a system memory 250 and a mass storage device 104, input means such as a keyboard 109, a mouse 110, and a speaker 111, a display 136. Including output means. Some embodiments of system memory 250 and mass storage device 104 collectively store an operating system such as IBM AIX or other suitable operating system, and the functionality of the various components shown in FIG. Adjust. For details on the AIX operating system, see AIX's AIX Version 4.3 Technical Reference: Base Operating System and Extensions, Volumes 1 and 2 (SC23-4159, SC23-4160), AIX Vers. See Communications and Networks (SC23-4122) and AIX Version 4.3 System User's Guide: Operating System and Devices (SC23-4121).
[0011]
Please refer to FIG. A simplified diagram of a processor 101 according to an embodiment of the present invention is shown. The processor 101 includes an instruction fetch unit 202 suitable for generating the address of the next instruction to be fetched. The instruction address generated by the fetch unit 202 is provided to the instruction cache 210. As can be seen from the name, the fetch unit 202 includes branch prediction logic and the like adjusted to predict the execution flow determination result of the program based on predetermined information. The ability to predict branch decisions correctly is an important factor for the processor 101 to improve performance by executing instructions speculatively and out of order. The instruction address generated by the fetch unit 202 is provided to the instruction cache 210. Cache 210 stores a portion of the contents of the system memory in the high speed stage mechanism. The instructions stored in the instruction cache 210 are preferably formatted according to the first ISA. The first ISA is typically a legacy ISA such as a PowerPC, x86 compatible instruction set, for example. For more information on the PowerPC instruction set, refer to Motorola's PowerPC 620 RISC Microprocessor User's Manual (MPC620UM / AD). If the instruction address generated by fetch unit 202 corresponds to a system memory location that is currently replicated in instruction cache 210, instruction cache 210 forwards the corresponding instruction to cracking unit 212. If the instruction corresponding to the instruction address generated by fetch unit 202 does not currently exist in instruction cache 210 (ie, the instruction address provided by fetch unit 202 misses instruction cache 210), the instruction Before being transferred to the cracking unit 212, it must be fetched from an L2 cache (not shown) or system memory.
[0012]
The cracking unit 212 is tuned to modify the incoming instruction stream and generate an optimal instruction set for execution at high operating frequencies (over 1 GHz) in a given execution pipeline. For example, the cracking unit 212 receives instructions in a 32-bit wide ISA, such as those supported by a PowerPC microprocessor in some embodiments, and facilitates execution in a fast execution unit operating above the gigahertz range. Convert to a second, preferably wider ISA. The broad format of the instructions generated by the cracking unit 212 is, for example, information that is simply implied or referenced in the instructions received by the cracking unit 212 and formatted according to the first format ( An explicit field or the like for storing an operand value or the like can be included. For example, in some embodiments, the ISA of instructions generated by the cracking unit is 64 bits or more in width.
[0013]
In another embodiment, the cracking unit 212 is designed to organize the fetched instruction set into an instruction group 302 in addition to converting instructions from a first format to a second, preferably wider format. The FIG. 3 shows an example of instruction groups. Each instruction group 302 includes a set of instruction slots 304a, 304b, etc. (collectively referred to herein as instruction slots 304). Organizing instruction sets into instruction groups simplifies the logic required to maintain rename register mapping tables and completion tables, especially for a large number of instructions in flight, and facilitates high-speed execution. FIG. 3 shows three examples of instruction groups that can be executed by the cracking unit 212.
[0014]
In Example 1, the instruction set denoted 301 is converted into one instruction group 302 by the cracking unit 212. In the illustrated embodiment of the invention, each instruction group 302 includes five slots denoted 304a, 304b, 304c, 304d, and 304e. Each slot 304 can contain one instruction. In this embodiment, each instruction group can include up to five instructions. In some embodiments, instructions in instruction set 301 received by cracking unit 212 are formatted according to the first ISA as described above, and instructions stored in group 302 are formatted according to a wider second format. . Using instruction groups reduces the number of instructions that need to be individually tagged and tracked, and simplifies the logic of the rename recovery table and completion table. Thus, using instruction groups is intended to sacrifice some information about each instruction while trying to simplify the process of tracking pending instructions in an out-of-order processor.
[0015]
Example 2 of FIG. 3 shows a second example of an instruction group executed by the cracking unit 212 according to an embodiment of the present invention. This example illustrates the function of the cracking unit 212 that divides complex instructions into simple instruction groups for faster execution. In the example shown, the sequence of two LDU (load-with-update) instructions is divided into an instruction group containing a pair of load instructions in slots 304a and 304c, respectively, and a pair of ADD instructions in slots 304b and 304d, respectively. It is done. In this example, since the branch instruction is not included in the group 302, no instruction is included in the last slot 304e of the instruction group 302. The PowerPC LDU instruction is a complex instruction in the sense that the instruction affects the contents of a plurality of GPRs (general-purpose registers), as well as similar instructions in other instruction sets. Specifically, the LDU instruction can be divided into a load instruction that affects the contents of the first GPR and an ADD instruction that affects the contents of the second GPR. Accordingly, in the instruction group 302 of Example 2 of FIG. 3, instructions in two or more instruction slots 304 correspond to one instruction received by the cracking unit 212.
[0016]
In Example 3, one instruction input to the cracking unit 212 is divided into an instruction set that occupies a plurality of groups 302. Specifically, FIG. 3 shows an LM (load multiple) instruction. The LM instruction (according to the PowerPC instruction set) loads the contents of consecutive locations in memory into the GPR with serial numbers. In the illustrated example, the LM of six consecutive memory locations is divided into six load instructions. Each group 302 according to the illustrated embodiment of the processor 101 contains at most five instructions and the fifth slot 304e is reserved for branch instructions, so the LM of the six registers is two groups each. Divided into 302a and 302b. Four of the load instructions are stored in the first group 302a, and the remaining two load instructions are stored in the second group 302b. Thus, in Example 3, one instruction is divided into an instruction set that spans a plurality of instruction groups 302.
[0017]
Please refer to FIG. The instruction group 302 generated by the preferred embodiment of the cracking unit 212 is transferred to the basic cache block 213 and stored for execution pending. Please refer to FIG. An embodiment of basic cache block 213 is shown. In the illustrated embodiment, basic cache block 213 includes a set of entries 502a-502n (collectively referred to herein as basic cache block entries 502). In some embodiments, each entry 502 in basic cache block 213 stores one instruction group 302. Each entry 502 includes an entry ID 504, a pointer 506, an instruction address (IA) field 507, and the like. The instruction address field 507 of each entry 502 is similar to the IA field of the completion table 218. In another embodiment, each entry in the basic cache block 504 corresponds to an entry in the completion table 218 and the instruction address field 507 indicates the instruction address of the first instruction in the corresponding instruction group 302. In other embodiments, the pointer 506 indicates the entry ID of the next instruction group 302 to be executed based on the branch prediction algorithm, branch history table, or other branch prediction mechanism. As described above, the preferred embodiment of forming instruction groups 302 with cracking unit 212 assigns branch instructions to the last slot of each group 302. The preferred embodiment of the cracking unit 212 also generates an instruction group 302 in which the number of branch instructions in the group 302 is 1 (or less than 1). Each instruction group 302 in this configuration can be considered to represent a leg (leg) of the branch tree 600 shown in FIG. In that case, the instruction group 302 is represented by the value of the corresponding instruction group entry 504. For example, the first instruction group 302a is indicated by its entry number (1). As an example, assume that the branch prediction mechanism of processor 101 predicts that leg 2 (corresponding to the second group 302b) is executed following leg 1, and leg 3 is executed following leg 2. The basic cache block 213 shows the group 302 to be executed next with the pointer 506 set to reflect these branch predictions in some embodiments of the present invention. The pointer 506 of each entry 502 in the basic cache block 213 can be used to determine the next group 302 to be dispatched.
[0018]
The basic cache block 213 works with the block fetch unit 215 in the same way that the fetch unit 202 works with the instruction cache 210. Specifically, the block fetch unit 215 has a role of generating an instruction address given to the basic cache block 213. The instruction address provided by the block fetch unit 215 is compared with the address in the instruction address field 507 of the basic cache block 213. If the instruction address provided by the block fetch unit 213 hits in the basic cache block 213, the corresponding instruction group is transferred to the issue queue 220. If the address provided by block fetch unit 215 misses in basic cache block 213, the instruction address is sent back to fetch unit 202 and the corresponding instruction is retrieved from instruction cache 210. The basic cache block 213 can eliminate the instruction cache 210 in an embodiment suitable for space saving (die size). In this embodiment, instructions are retrieved from an appropriate stage mechanism such as L2 cache, system memory, etc., and provided directly to the cracking unit 212. If the instruction address generated by the block fetch unit 213 misses in the basic cache block 213, the corresponding instruction is retrieved from the L2 cache or system memory instead of the instruction cache 210.
[0019]
The illustrated embodiment of the processor 101 further shows a dispatch unit 214. The dispatch unit 214 makes all necessary resources available before transferring the instructions of each instruction group to the corresponding issue queue 220. The dispatch unit 214 also communicates with the dispatch / completion control logic 216 to track the order in which instructions are issued and the completion status of these instructions, facilitating out-of-order execution. As described above, in the embodiment of the processor 101 in which the cracking unit 212 organizes input instructions into instruction groups, each instruction group 302 is informed by the completion control logic 216 of the order of issued instruction groups. A tag (GTAG) is assigned. As an example, dispatch unit 214 can assign monotonically increasing values to consecutive instruction groups. In this configuration, an instruction group having a small GTAG value is said to be issued before an instruction group having a large GTAG value (that is, younger than the instruction group). Although the illustrated embodiment of the processor 101 shows the dispatch unit 214 as an independent functional block, the instruction group organization of the basic cache block 213 helps to incorporate the functionality of the dispatch unit 214. Thus, in some embodiments, the dispatch unit 214 is incorporated into the basic cache block 213 and the basic cache block 213 is directly connected to the issue queue 220.
[0020]
In connection with dispatch / completion control logic 216, a completion table 218 is used to track the state of issued instruction groups in embodiments of the present invention. Please refer to FIG. An example of a completion table 218 is shown. In the illustrated embodiment, the completion table 218 includes a set of entries 702a through 702n (collectively referred to herein as completion table entries 702). Each entry 702 in the completion table 218 in this example includes an instruction address (IA) field 704 and a status bit field 706. In this embodiment, the GTAG value for each instruction group 302 identifies an entry 702 in the completion table 218 in which completion information corresponding to the instruction group 302 is stored. Therefore, the instruction group 302 stored in entry 1 of the completion table 218 has a GTAG value of 1 or the like. In this example, completion table 218 may further include a wrap around bit that indicates that the instruction group with the lower GTAG value is actually younger than the instruction group with the higher GTAG value. In another embodiment, the instruction address field 704 contains the address of the instruction in the first slot 304a of the corresponding instruction group 302. The status field 706 can include status bits that indicate, for example, whether a corresponding entry 702 in the completion table 218 is available and whether an entry has been assigned to a previously pending instruction group.
[0021]
In the embodiment of processor 101 shown in FIG. 2, instructions are issued from dispatch unit 214 to issue queue 220 and wait for execution on the corresponding execution pipe 222. Various execution pipes can be added to the processor 101. Each pipe is designed to execute a portion of the processor instruction set. In the embodiment, the execution pipe 222 includes a branch unit pipeline 224, a load / store pipeline 226, a fixed point arithmetic unit 228, a floating point unit 230, and the like. Each execution pipe 222 may consist of two or more pipeline stages. Instructions stored in issue queue 220 can be issued to execution pipe 222 using various issue priority algorithms. In some embodiments, for example, the oldest pending instruction in issue queue 220 is the next instruction issued to execution pipe 222. In this embodiment, the GTAG value assigned by dispatch unit 214 is used to determine the relative age of pending instructions in issue queue 220. Prior to issue, the destination register operand of the instruction is assigned to an available rename GPR. When the instruction is finally transferred from the issue queue 120 to the corresponding execution pipe, the execution pipe performs the operation indicated by the instruction code, and when the instruction reaches the final stage of the pipeline (132), Write the result to the instruction rename GPR. A mapping is maintained between the renamed GPR and the corresponding designed register. When all instructions in an instruction group (and all instructions in a young instruction group) complete without an exception, the completion pointer in the completion table 218 is incremented to the next instruction group. When the completion pointer is incremented to a new instruction group, the rename register associated with the old instruction group instruction is released, thereby committing the result of the old instruction group instruction. If an instruction that is older and older than an uncommitted instruction raises an exception, the instruction that caused the exception and all young instructions are flushed, the rename recovery routine is called, and the last valid known GPR mapping Return to state.
[0022]
If the predicted branch is not taken (branch prediction failure), the instruction held in the execution pipe 222 and the issue queue 220 are flushed. Also, the pointer 506 of the basic cache block entry 502 associated with the mispredicted branch is updated to reflect the newest branch employed. An example of this update process is shown in FIG. 5 for the case where a branch from leg 1 (instruction group 302a) to leg 4 (instruction group 302d) occurs by program execution. Since the pointer 506 of the entry 502a first predicted a branch to the instruction group (that is, the group 302b) in the entry number 2 of the basic cache block 213, the actual branch from the instruction group 302a to the group 302d is predicted. It is off. The unpredicted branch is deleted and sent back to the block fetch unit 215, the instructions held between the base cache block 213 and the final stage 232 of each pipeline 222 are flushed, and the base cache block 213 Execution resumes from the instruction group 302d of entry 4. Also, the pointer 506 of the basic cache block entry 502a is changed from the previous value 2 to the new value 4, and the latest branch information is reflected. The present invention incorporates the basic cache block 213 and the block fetch unit 215 close to the execution pipeline 222, thereby reducing the performance penalty when branch prediction is lost. Specifically, by implementing the basic cache block 213 on the “downstream” side of the instruction cracking unit 212, instructions that are pending in the cracking unit 212 are eliminated from the flash path that is out of branch prediction, Therefore, the number of pipeline stages that must be purged after a branch misprediction is reduced and the performance penalty is reduced. The basic cache block 213 also assumes a cache mechanism having a structure consistent with the organization of the dispatch / completion control logic 216 and completion table 218, thus simplifying the organization of the intervening logic and, as described above, the basic It facilitates the implementation of useful extensions to the cache block 213.
[0023]
The basic cache block 213 of the example is further followed by the same instruction group to avoid scenarios that may lead to the occurrence of performance limiting events (herein referred to as exception events) such as exceptions, flushes, interrupts, etc. Instruction history information is included that allows the processor performance to be advantageously improved by recording information that may be used during execution of. In the embodiment of the basic cache block 213 shown in FIG. 8, instruction history information is stored in the instruction history field 508 of each entry 502. As an example of the type of information stored in the instruction history field 508, consider an instruction group that includes a particular load instruction that resulted in a store-forward exception when the load instruction was last executed. A store-forward exception occurs when a load instruction following a store instruction with a common memory reference (in program order) is executed before the store instruction on an out-of-order machine. If the load instruction is executed before the store instruction, an invalid value is retrieved from the register, and as a result, an instruction is flushed. Because of the parallelism between the structure of the basic cache block 213 and the completion control logic 216, the information obtained by the dispatch / completion control logic 216 regarding the instruction execution and completion method is stored in the corresponding entry of the basic cache block 213. The task to transfer becomes easier. Without this parallelism, completion information from dispatch / completion control logic 216 typically must be passed through some form of an intermediate hash table or other suitable mechanism to associate group instruction information with its component instructions. is there. In the store forward example, after detecting a store forward exception, the dispatch / completion control logic 216 writes a bit indicating the store forward exception in the instruction history field 508 of the corresponding entry in the basic cache block 213. . If an instruction group is executed later, instruction history information indicating that a store-forward exception has occurred before, for example, to place the processor 101 in a sequential mode that prevents a load from being executed before the store is complete. Can be used for Accordingly, this embodiment of the present invention records instruction history information indicating exception events associated with an instruction group, and then changes the execution of the instruction group so that the exception event when the instruction group is executed later. It is intended to prevent the occurrence of. As shown in the store forward example, the instruction history information field 508 is such that information relating to the accuracy of the prediction mechanism, prediction operand value, cache miss / hit information, etc. can be prevented by the processor from recurring exception conditions. Suitable for recording information related to various instruction history events.
[0024]
One example of information recorded in the execution history field 508 of the basic cache block 213 is highlighted by the embodiment shown in FIG. In this embodiment, the issue queue 220 is divided into a primary issue queue 902 and a secondary issue queue 904. The optimal size or depth of issue queue 220 represents a balance of competing considerations. On the other hand, it is desirable to implement a very large and deep issue queue in order to execute instructions out of order using the processor's functions to the fullest. The function of issuing instructions out of order is limited by the number of instructions held in the issue queue 220. As the number of issue queues increases, more instructions are suitable for out-of-order processing. On the other hand, as the issue queue becomes deeper, the processor's ability to determine the next instruction to issue within the processor's cycle time constraints decreases. In other words, the more commands that are held in the issue queue 220, the longer the time required to determine the next command to be issued. Thus, issue queues such as issue queue 220 are often limited to a depth of about 20 or less. Embodiments of the present invention seek to realize the benefits of deep issue queues without requiring too much logic to retrieve the next issueable instruction in the issue queue. The present invention is pending in the issue queue 220 because it has already been issued and is pending in the execution pipeline 222 of the processor 101 or waiting for completion of other instructions that depend on operand values. It takes advantage of the fact that certain orders cannot often be issued immediately.
[0025]
Please refer to FIG. The issue queue 220 according to the embodiment of the present invention includes a primary issue queue 902 and a secondary issue queue 904. The primary issue queue 902 stores instructions that can be issued immediately. In an embodiment, instructions dispatched from dispatch unit 214 are initially stored in available entries in primary issue queue 902. Later, when it is confirmed that the instruction depends on another instruction, the dependent instruction is moved to the secondary issue queue 904 until necessary information is retrieved by the instruction to be depended on. For example, if the result of the load instruction is required for an add instruction following the load instruction, both instructions can be dispatched to the primary issue queue 902 first. If it is confirmed that the addition instruction depends on the load instruction, the addition instruction is transferred from the primary issue queue 902 to the secondary issue queue 904. As described with reference to FIG. 8, in the embodiment using the instruction history field 508, the dependency of the addition instruction can be recorded so that the addition instruction can be directly stored in the secondary issue queue 904 when the instruction is executed later. . The secondary issue queue 904 can also be used to store recently issued instructions that are still pending in the processor's execution pipeline. In this embodiment, the instruction is issued from the primary issue queue 902 and then transferred to the secondary issue queue 904. In some embodiments, an instruction can be placed in the secondary issue queue 904 until it is confirmed that the instruction is not rejected. One way to verify that the instruction has not been rejected is to implement a timer / counter (not shown) associated with each entry in the secondary issue queue 904. When the instruction is first transferred from the primary issue queue 902 to the secondary issue queue 904, the counter / timer is initialized. In another embodiment, the counter / timer counts the number of clock cycles that have ended since the counter / timer initialization. If the counter / timer continues to count for a predetermined number of cycles and it is not detected that the instruction has been rejected, the instruction is deemed to have completed successfully and the entry in the secondary issue queue 904 is deallocated. Dedicated to currently issueable instructions to execute with a secondary issue queue that is not currently executable even though the instruction has been dispatched because of instruction dependency or because the instruction was recently issued from the primary issue queue By using an issue queue that includes the primary issue queue, the effective size or depth of the issue queue can be increased without significantly increasing the time (ie, the number of logic levels) required to determine the next instruction to issue. To increase.
[0026]
As will be apparent to those skilled in the art who have the benefit of the present disclosure, the present invention provides grouped instructions (ie, from the first format to the second format) to reduce latency associated with mispredicted branches. Various embodiments of a microprocessor are contemplated that include a cache mechanism suitable for storing (translated instructions). The form of the invention described in detail with the drawings is only a presently preferred example. The claims should be construed broadly to encompass all variations of the preferred embodiments disclosed herein.
[0027]
In summary, the following matters are disclosed regarding the configuration of the present invention.
[0028]
(1) A method of processing instructions by a microprocessor,
Converting the first set of received instructions into an instruction group;
Storing said instruction group in each cache mechanism entry in which each instruction group is stored in a structured basic cache block;
Issuing instructions of said instruction group for execution;
Flushing only instructions pending between the basic cache block and the final stage in response to an exception generated during execution of an instruction of the instruction group;
Including a method.
(2) The method according to (1), wherein the generated exception includes a branch misprediction exception.
(3) The method according to (1), wherein the received instructions are formatted according to a first instruction format, and instructions of the instruction group are formatted according to a second instruction format.
(4) The method according to (3), wherein the second instruction format is wider than the first instruction format.
(5) The method according to (4), wherein a pointer is assigned to each entry of the cache mechanism, and the pointer indicates an instruction to be executed next.
(6) The method according to (5), including the step of updating a cache entry pointer corresponding to the mispredicted branch in response to detection of a mispredicted branch during execution of one of the instruction groups.
(7) an instruction cracking unit configured to receive a first set of microprocessor instructions and to organize the instruction set as an instruction group;
A basic cache block mechanism configured to cache instruction groups generated by the cracking unit;
An execution unit suitable for executing instructions of the instruction group;
Including
Only instructions dispatched from the basic cache block are flushed due to an exception that is generated when executing an instruction of the instruction group and causes a flush,
Microprocessor.
(8) The processor according to (7), including a dispatch unit configured to retrieve an instruction from the instruction group of the basic cache block and transfer the instruction to an issue queue.
(9) The received instruction is formatted according to a first instruction format, a second instruction set is formatted according to a second instruction format, and the second instruction format is wider than the first instruction format, (7) The processor described.
(10) The processor according to (7), wherein the basic cache block is configured to store each instruction group in a corresponding entry of the basic cache block.
(11) The processor according to (10), wherein the basic cache block includes an entry field indicating the corresponding basic cache block entry.
(12) The processor according to (11), wherein each entry of the basic cache block includes a pointer indicating an instruction group to be executed next.
(13) The processor according to (12), wherein the processor is configured to update a cache entry pointer in response to a mispredicted branch.
(14) A data processing system including a processor, a memory, an input means, and a display,
An instruction cracking unit configured to receive a first set of microprocessor instructions and to organize the instruction set as an instruction group;
A basic cache block mechanism configured to cache instruction groups generated by the cracking unit;
An execution unit suitable for executing instructions of the instruction group;
Including
Only instructions dispatched from the basic cache block are flushed due to an exception that is generated when executing an instruction of the instruction group and causes a flush,
Data processing system.
(15) The data processing system according to (14), further including a dispatch unit configured to retrieve an instruction from an instruction group of the basic cache block and transfer the instruction to an issue queue.
(16) The received instruction is formatted according to a first instruction format, a second instruction set is formatted according to a second instruction format, and the second instruction format is wider than the first instruction format, (14) The data processing system described.
(17) The data processing system according to (14), wherein the basic cache block is configured to store each instruction group in a corresponding entry of the basic cache block.
(18) The data processing system according to (17), wherein the basic cache block includes an entry field indicating the corresponding basic cache block entry.
(19) The data processing system according to (18), wherein each entry of the basic cache block includes a pointer indicating an instruction group to be executed next.
(20) The data processing system according to (19), wherein the processor is configured to update a pointer of each entry in response to a mispredicted branch.
[Brief description of the drawings]
FIG. 1 illustrates certain components of a data processing system that includes a microprocessor in accordance with an embodiment of the present invention.
FIG. 2 illustrates certain components of a microprocessor in accordance with an embodiment of the present invention.
FIG. 3 is a diagram illustrating an example of an instruction cracking function performed by the embodiment of the processor of FIG.
FIG. 4 illustrates certain components of a microprocessor.
FIG. 5 is a diagram showing a basic cache block of the microprocessor of FIG. 2;
FIG. 6 is a diagram illustrating various branches expected for the processor of FIG. 2;
FIG. 7 is a diagram showing a completion table suitable for the present invention.
FIG. 8 is a diagram showing a basic cache block including instruction history information.
FIG. 9 is a diagram illustrating an issue queue including a primary issue queue and a secondary issue queue according to an embodiment of the present invention.
[Explanation of symbols]
100 Data processing system
101 Central processing unit (processor)
102 ROM (read only memory)
103 hard disk
104 Mass storage device
105 Tape storage device
106 Network adapter
107 I / O adapter
108 User Interface Adapter
109 keyboard
110 mice
111 Speaker
112 Display adapter
113 System bus
136 Display Monitor
202 Instruction fetch unit
210 Instruction cache
212 Cracking unit
213 Basic cache block
214 Dispatch Unit
215 block fetch unit
216 Dispatch / completion control logic
218 Completion table
220 Issue queue
222 execution pipe
224 Branch unit pipeline
226 Load / Store Pipeline
228 fixed point arithmetic unit
230 Floating point unit
232 Final stage
250 system memory
302 instruction group
304 instruction slot
401 Hierarchical architecture microprocessor
402 fetch unit
406 Branch prediction logic
410 Instruction cache
412 ISA converter
422 execution pipeline
432 Final stage
502 Basic cache block entry
504 Entry ID
506 pointer
507 Instruction address (IA) field
508 Instruction history field
600 branch tree
702 Completion table entry
704 Instruction address (IA) field
706 Status bit field
902 Primary issue queue
904 Secondary issue queue

Claims

A method of processing instructions in a microprocessor including an instruction cache, an instruction cracking unit, a basic cache block, a dispatch unit, an execution unit, an execution queue, and dispatch / completion logic ,
The instruction cache receives a plurality of instructions including a store instruction and a load instruction;
The instruction cracking unit constitutes an instruction group from a plurality of instructions received by the instruction cache and instruction history information storing information on the order of processing these instructions. Transferring to the block;
The basic cache block storing and transferring the transferred instruction group to the dispatch unit;
The dispatch unit determines an instruction group to be executed next from the instruction history information among a plurality of instruction groups transferred from the basic cache block, and should execute next through the execution queue. Transferring each instruction of the instruction group determined to be to the execution unit;
The execution unit executing each instruction of the instruction group;
When the instruction executed by the execution unit is a load instruction corresponding to a store instruction that has not yet been executed, the dispatch / completion logic indicates that the instruction group including the load instruction has caused a store-forward exception. Recording in the command history information;
The dispatch unit forwards each instruction of the instruction group including the store instruction to the execution unit prior to each instruction of the instruction group including the load instruction based on the recorded instruction history information; The execution unit executing the load instruction again;
Including methods.

The method of claim 1, wherein
Record in the dispatch / completion logic completion table that the dispatch unit has transferred each instruction of the instruction group to the execution unit after the step of transferring each instruction of the instruction group to the execution unit. A method comprising the steps of:

The method according to claim 1, wherein the step of configuring the instruction group includes assigning an ID to the instruction group, the ID, the plurality of instructions, and a pointer indicating an ID of an instruction group to be executed next, A method of composing instruction groups from instruction history information.

  The method of claim 3, wherein
  The instruction cracking unit comprises the instruction group;
  If the received instruction is a branch instruction, the instruction group ID speculatively executed by the branch instruction is recorded on the pointer as an instruction group ID to be executed next, and the instruction group is configured.
  The dispatch unit determines the instruction group to be executed next and transfers each instruction of the determined instruction group to the execution unit.
  A method in which the dispatch unit determines an instruction group to be executed next based on the pointer and the instruction record information, and transfers each instruction of the instruction group determined to be executed next to the execution unit.

The method of claim 4, wherein
When the branch instruction is executed and the instruction speculatively predicted and executed is a misprediction, the basic cache block changes the ID of the instruction group having the misprediction. Method including

The method of claim 1, wherein
After the instruction cache receives the plurality of instructions, each of the received instructions includes the step of converting to a second instruction format in which the bit width of the data of the instruction is widened.

The method of claim 6, wherein
The method of converting to the second instruction format includes converting to a second instruction format that includes explicit operand reference information.

The microprocessor
An instruction cache for receiving a plurality of instructions including a store instruction and a load instruction;
An instruction cracking unit that constitutes an instruction group from a plurality of instructions received in the instruction cache and instruction history information that stores information on the order in which these instructions are processed , and transfers the instruction group;
A basic cache block for storing instruction groups transferred from the instruction cracking unit and transferring them to the dispatch unit;
A dispatch unit that determines an instruction group to be executed next from the instruction history information among instruction groups transferred from the basic cache block, and transfers each instruction of the instruction group determined to be executed next; ,
An execution unit that receives each instruction of the instruction group from the dispatch unit and executes each received instruction;
If the instruction executed by the execution unit is a load instruction corresponding to a store instruction that has not yet been executed, the instruction group including the load instruction needs to be executed again. Dispatch / completion logic for recording in the instruction history information that an instruction group has caused a store forward exception;
Including a microprocessor.

The microprocessor of claim 8, wherein
The dispatch / completion logic indicates that the dispatch unit has transferred each instruction of the instruction group to the execution unit and then transferred each instruction of the instruction group to the execution unit. A microprocessor that records in the completion table.

The microprocessor of claim 8, wherein
When the instruction cracking unit constitutes the instruction group, in addition to the plurality of instructions and the instruction history information, an instruction group ID, and a pointer indicating an instruction group ID to be executed next, A microprocessor that constitutes an instruction group including

  The microprocessor of claim 10, wherein
  When the instruction received by the instruction cache is a branch instruction, the instruction cracking unit sets an ID of an instruction group to be executed speculatively by the branch instruction as an ID of an instruction group to be executed next as a pointer. Record, and then configure the instruction group with this recorded pointer,
  The dispatch unit determines a next instruction group to be executed based on the pointer and the instruction record information, and transfers each instruction of the instruction group determined to be executed next.

The microprocessor of claim 11, wherein
The basic cache block is a microprocessor that changes the ID of an instruction group that is a misprediction when the execution unit executes the branch instruction and the instruction speculatively predicted and executed is a misprediction. .

The microprocessor of claim 8, wherein
The microprocessor further performs conversion after receiving a plurality of instructions, to each of the received instructions, into a second instruction format in which the bit width of data of the instructions is widened.

The microprocessor of claim 13.
The microprocessor, when converting to the second instruction format, converts to the second instruction format including explicit operand reference information.

A data processing system comprising a microprocessor and a memory,
An instruction cache for receiving a plurality of instructions including a store instruction and a load instruction;
An instruction cracking unit that constitutes an instruction group from a plurality of instructions received in the instruction cache and instruction history information that stores information on the order in which these instructions are processed , and transfers the instruction group;
A basic cache block for storing instruction groups transferred from the instruction cracking unit and transferring them to the dispatch unit;
A dispatch unit that determines an instruction group to be executed next from the instruction history information among instruction groups transferred from the basic cache block, and transfers each instruction of the instruction group determined to be executed next; ,
An execution unit that receives each instruction of the instruction group from the dispatch unit and executes each received instruction;
If the instruction executed by the execution unit is a load instruction corresponding to a store instruction that has not yet been executed, the instruction group including the load instruction needs to be executed again. Dispatch / completion logic for recording in the instruction history information that an instruction group has caused a store forward exception;
A data processing system comprising a microprocessor including:

The data processing system of claim 15,
The dispatch / completion logic indicates that the dispatch unit has transferred each instruction of the instruction group to the execution unit and then transferred each instruction of the instruction group to the execution unit. A data processing system that records in a completion table .

The data processing system of claim 15,
When the instruction cracking unit constitutes the instruction group, in addition to the plurality of instructions and the instruction history information, an instruction group ID, and a pointer indicating an instruction group ID to be executed next, A data processing system that constitutes an instruction group including

The data processing system of claim 17,
When the instruction received by the instruction cache is a branch instruction, the instruction cracking unit sets an ID of an instruction group to be executed speculatively by the branch instruction as an ID of an instruction group to be executed next as a pointer. Record, and then configure the instruction group with this recorded pointer,
The dispatch unit, a data processing system for transferring said pointer and said by the command recorded information to determine the instruction group to be executed next, then each instruction of the instruction group that determines to execute.

The data processing system of claim 18, wherein
The basic cache block is a data process for changing the ID of an instruction group that is a misprediction when the execution unit executes the branch instruction and the instruction speculatively predicted and executed is a misprediction System .

The data processing system of claim 15,
The data processing system , wherein the instruction cache further converts each received instruction into a second instruction format in which the bit width of the data of the instruction is expanded after receiving a plurality of instructions.