JP4030314B2

JP4030314B2 - Arithmetic processing unit

Info

Publication number: JP4030314B2
Application number: JP2002020244A
Authority: JP
Inventors: 周史山村; 耕一久門; 充佐藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-01-29
Filing date: 2002-01-29
Publication date: 2008-01-09
Anticipated expiration: 2022-01-29
Also published as: JP2003223359A

Description

【０００１】
【発明の属する技術分野】
本発明は、キャッシュメモリを備えるとともにメモリアクセスレイテンシ(memory access latency)を隠蔽するためにプリフェッチ処理を行う演算処理装置に関し、より詳細には、実行時情報に基づいてプリフェッチ処理を行うことでその効果の改善を図った演算処理装置に関する。
【０００２】
【従来の技術】
現在の計算機システムにおいて、メモリアクセスレイテンシの増大は、システム全体の性能向上を妨げる大きな原因となっている。一般の計算機システムは、メモリアクセスの時間的・空間的な局所性に着目し、プロセッサ（レジスタ）とメインメモリとの間に高速にアクセスすることができるキャッシュメモリを装備しており、これによってメモリアクセス性能の向上を図っている。
【０００３】
しかし、メモリアクセスが発生した際、キャッシュメモリにデータが存在しない場合すなわちキャッシュミスが発生した場合には、メインメモリからデータを転送する必要がある。このとき、データ転送に多くの時間を要する（キャッシュミスペナルティ）。現在の演算処理装置においては、キャッシュメモリへのアクセスは１〜十数サイクル程度であるが、キャッシュミスペナルティは、数十サイクルに達する。したがって、プログラムの実行時間はキャッシュミスペナルティによって律速される部分が大きい。プロセッサの動作速度とメモリアクセス速度とのギャップは年々大きくなっており、上記の問題は、将来、より深刻なダメージを計算機システムに対して与えることになる。
【０００４】
この問題に対して、従来技術においては、データプリフェッチ（データ先読み、以下単にプリフェッチと呼ぶ）と呼ばれる方式が広く提案・利用されている。データプリフェッチとは、将来アクセスされるであろうと予想されるデータを、前もってメモリの上位階層（キャッシュメモリ）に転送しておくという処理である。適切にデータプリフェッチを行うことで、メインメモリへのアクセス遅延時間が隠蔽され、見かけ上キャッシュにデータが常に存在しているかのようにプログラムの実行が行われる。
【０００５】
現在まで数多くのプリフェッチ方式が提案されているが、大きく分けて以下の２つの方式に分けられる。
【０００６】
(1) ソフトウェアプリフェッチ。一般に、コンパイラなどが静的にプログラムの挙動解析を行い、プリフェッチ命令をプログラム中に挿入する。プログラムを実行したプロファイル情報を元に、コンパイラなどがプリフェッチ命令を挿入する場合もある。
【０００７】
(2) ハードウェアプリフェッチ。プロセッサの内部あるいは外部に、プリフェッチ専用のハードウェア機構を備える。一般に、動的に（プログラム実行中に）、この専用ハードウェアがキャッシュミスの発生を認識し、予め定められたアルゴリズムに基づいてプリフェッチ処理を行う。既存のハードウェアプリフェッチ方式は、キャッシュミスが数回発生したことを認識し、これをトリガとしてプリフェッチを開始する。
【０００８】
上記(1)の例として、特開平１１−２１２８０２号公報では、原始プログラムを中間テキストに変換し、変換した中間テキストに対して最適化を行い、データプリフェッチを行うアドレスを算出してプリフェッチ命令を挿入するというコンパイル技術が提案されている。
【０００９】
また、上記(2)については、例えば、あるキャッシュブロックがアクセスされた場合に、次のキャッシュブロックに対して近い将来アクセスされるであろうと予測し、そのアドレスに対してプリフェッチを行うというものや、その他にも、ある一定の値を加算して次回アクセスするアドレスを決定するストライド予測によるプリフェッチなど、数多くの方式が提案されている。また、実際に、商用で利用されているものとして、Pentium （登録商標）４においては、実際にアクセスされているアドレスから常に２５６バイト先のデータを先読みするといった機構が備わっている。
【００１０】
【発明が解決しようとする課題】
一般に、プリフェッチ処理を行う場合には、以下のような問題点に留意する必要がある。
【００１１】
(1) キャッシュ汚染が発生すること。プリフェッチを行うことで、キャッシュメモリに格納されていたデータが追い出されてしまう。プリフェッチ処理を行わなければ、追い出されずに本来キャッシュにヒットしたであろうデータが存在した場合には、プリフェッチによって性能低下を招くことがある。
【００１２】
(2) プリフェッチしたデータが、後続のアクセスによって使用前に追い出されること。プリフェッチしたデータに実際にアクセスするときよりも長時間前にプリフェッチ処理を行ってしまうと、実際のアクセスの前に、先行するメモリアクセスによってプリフェッチしたデータが追い出されてしまうことがある。
【００１３】
(3) 過剰なプリフェッチ処理によってバス帯域幅を浪費してしまうこと。必要以上にプリフェッチ処理を行うことで、メモリバス帯域幅を浪費してしまい、逆に性能低下を招く。
【００１４】
上記(1)及び(2)については、プリフェッチ挿入位置とプリフェッチされたデータを実際に使用する命令との距離（プリフェッチ距離）を適切に決定する必要がある。また、上記(3)については、キャッシュミスが発生するメモリアクセス全てについてプリフェッチを行うのではなく、プリフェッチを行えば効果的であると判断できる場合のみ、プリフェッチ命令を挿入しなければならない。
【００１５】
これに対して、従来のプリフェッチ方式では、実行時の状況によりプログラムの挙動が変化する場合や、あるいは、システム構成の変更（プロセッサあるいはメモリシステムの構成の変更）が行われる場合があるため、プリフェッチ距離を適切に設定することは困難である。
【００１６】
まず、ソフトウェアプリフェッチでは、上述のような環境の変更が発生した場合、原始プログラムを再コンパイルするなどの作業が必要になる。このようにシステム毎に実行ファイルを用意することは現実的でない。また、プログラムへの入力によって挙動が変わるようなプログラムの場合には対処できない。
【００１７】
一方、ハードウェアプリフェッチにおいても、単純にロードミスが発生した場合に即座にプリフェッチを行うという方法では、効果的なプリフェッチ距離を保つことができない。また、ハードウェアプリフェッチを行うかどうかの決定方法に関する複雑な処理については、その実装が困難である。実際、従来のハードウェアプリフェッチでは、キャッシュミスが数回発生した段階でプリフェッチを実行したり、アクセスされたブロックの次のブロックに対してプリフェッチし続けるなど一般に単純である。
【００１８】
本発明は、上述した問題点に鑑みてなされたものであり、その目的は、キャッシュメモリを備えた演算処理装置であって、実行時情報に基づいてプリフェッチ命令を命令ストリーム中に挿入することにより、従来のプリフェッチ方式に伴うキャッシュ汚染等の問題を最小限に抑制しつつ効率的にプリフェッチを実行し得るものを提供することにある。
【００１９】
【課題を解決するための手段】
上記目的を達成するために、本発明の第１の面によれば、予めメインメモリからキャッシュメモリへデータを転送するように指示するプリフェッチ命令を動的に命令列中に挿入して実行する演算処理装置であって、キャッシュミスを起こす命令のうちプリフェッチ処理の対象とすべき命令を選択するプリフェッチ対象選択手段と、前記プリフェッチ対象選択手段によってプリフェッチ処理の対象とされた命令の実行時におけるメモリアクセスアドレスを予測するアドレス予測手段と、前記プリフェッチ対象選択手段によってプリフェッチ処理の対象とされた命令に対応するプリフェッチ命令の命令列中への挿入位置を決定するプリフェッチ命令挿入位置決定手段と、前記アドレス予測手段によって予測されたメモリアクセスアドレスをオペランドに有するプリフェッチ命令を、前記プリフェッチ命令挿入位置決定手段によって決定された挿入位置に、挿入するプリフェッチ命令挿入手段と、を具備する演算処理装置が提供される。
【００２０】
また、本発明の第２の面によれば、好ましくは、前記プリフェッチ対象選択手段は、命令アドレスごとにキャッシュミスが発生する回数を計測するカウンタを備え、該カウンタの値が一定値を超えた場合に、キャッシュミス情報を処理するハンドラを起動して、該ハンドラによってプリフェッチ処理の対象とすべき命令を選択する。
【００２１】
また、本発明の第３の面によれば、好ましくは、前記プリフェッチ命令挿入位置決定手段は、過去に実行が完了した命令列について命令アドレス及び実行完了時刻を保持するテーブルを備え、該テーブルの情報に基づいてプリフェッチ命令の命令列中への挿入位置を決定する。
【００２２】
また、本発明の第４の面によれば、前記プリフェッチ対象選択手段、前記プリフェッチ命令挿入位置決定手段、前記アドレス予測手段及び前記プリフェッチ命令挿入手段をキャッシュメモリ内に備えるようにしてもよい。
【００２３】
また、本発明の第５の面によれば、複数の実行命令流が制御可能な演算処理装置の場合、前記キャッシュミス情報を処理するハンドラが独立した命令流として主命令流とは独立に実行される。
【００２４】
【発明の実施の形態】
以下、添付図面を参照して本発明の実施形態について説明する。
【００２５】
図１は、本発明によりプリフェッチ命令の挿入処理を行う演算処理装置の動作の概要を説明するための図である。この図１は、本発明で関係のある部分のみについて示している。命令フェッチ部および命令デコーダ部をまとめ、命令の実行を制御する部分とし、この部分を制御部と呼ぶことにする。本発明の対象となる演算処理装置では、制御部から次に実行すべき命令を指示する命令アドレスが指定され、命令列が格納されているメモリ（またはキャッシュメモリ）から、命令が読み出され、制御部に入力される。その後、命令は、単一あるいは複数の演算部において実行される。メモリから読み出される命令の並びのことを、本発明では「命令ストリーム（命令流）」と呼ぶ。
【００２６】
図１の制御部は、指定された命令アドレスにプリフェッチ命令を挿入する。実行時、プログラムそれ自体はメインメモリに格納されており、これを直接修正することは行われない。指定された命令アドレスにプリフェッチ命令を挿入して実行する方法の一例においては、命令フェッチユニット（図１では、制御部内に含まれる）が、フェッチする命令アドレスと、挿入する命令アドレスとを比較し、それらが一致すれば、メインメモリから取得した命令ストリーム中にプリフェッチ命令を割り込ませ、その後、そのプリフェッチ命令を演算部が実行する。
【００２７】
本発明による演算処理装置では、キャッシュミスが発生する命令アドレス（ロード／ストア命令、メモリアクセスを伴う演算命令など）が検出される。検出された命令アドレスに対して、実際にデータプリフェッチを行うかどうかを決定するために、様々なアルゴリズムを適用することが考えられる。そして、プリフェッチを行う場合には、システムにおけるキャッシュミスペナルティを考慮して、これに応じたプリフェッチ命令挿入位置（命令アドレス）が指定される。さらに、プリフェッチ対象とするデータアドレスについては、現在までにすでに提案されている様々なデータアドレス予測方式が適用され得る。かくして、プログラムに対して再コンパイルを行ったりせず、また、実行時環境に依存せず、動的で効果的なプリフェッチを行うことができる。
【００２８】
図２は、本発明による演算処理装置の一構成例を示す図である。図２に示される実施形態では、メインメモリ１０に対して、命令キャッシュ１２及びデータキャッシュ１４が設けられている。命令フェッチユニット２０は、プログラムカウンタ（ＰＣ）２４の値（フェッチされるべき命令が格納されているアドレス）に基づいて命令キャッシュ１２から命令をフェッチしデコーダ３０に送出する。デコーダ３０にてデコードされた命令は、分岐命令処理ユニット３２、算術命令処理ユニット３４又はロード／ストア命令処理ユニット３６において処理される。
【００２９】
本実施形態における命令フェッチユニット２０の内部には、プリフェッチ命令の挿入アドレスを指定する複数のレジスタ（プリフェッチ命令挿入アドレスレジスタ）２２ａ〜２２ｄが装備されている。命令フェッチユニット２０は、命令フェッチ時にプログラムカウンタ（ＰＣ）２４の値とプリフェッチ命令挿入アドレスレジスタ２２ａ〜２２ｄの値とを、比較器（コンパレータ）２６ａ〜２６ｄにおいて並列に比較する。そして、もし一致していれば、デコーダ３０に命令を投入するための命令キュー２８の所定位置に、プリフェッチ命令が割り込み挿入される。
【００３０】
データキャッシュミスを起こす命令のうちプリフェッチ処理の対象とすべき命令を選択するに際しては、キャッシュミスが発生した命令アドレスを記録し、統計的に処理することで、プリフェッチ処理により充分な効果が得られる命令アドレスを選択することができるようにしておくことが望ましい。
【００３１】
本実施形態においては、キャッシュミスが発生する命令アドレスの統計データを効率的に取得するために、キャッシュミスの発生を計測するカウンタが装備される。すなわち、図２におけるレジスタファイル４０中のイベントカウンタ４２がそのカウンタである。イベントカウンタ４２が、指定された回数だけカウントすると、演算処理装置内部であらかじめ指定されたハンドラが実行される。このような機構を用意することで、キャッシュミスが発生するたびに検出機能が起動されるというオーバヘッドを避けつつ、ソフトウェアによる柔軟なキャッシュミス情報の処理を行うことができる。
【００３２】
起動されたハンドラは、図３に示されるような、命令アドレスごとにキャッシュミスが発生した回数を記憶するテーブルに基づいて、プリフェッチ処理を行うべきか否かを判定する。なお、このテーブルは、メインメモリ上に設けられている。
【００３３】
最適なプリフェッチ距離を判断するために、例えば、１命令あたりの実行サイクル数（Cycle Per Instruction，ＣＰＩ）を仮定し、対象となるシステム構成におけるキャッシュミスペナルティを考慮して、適切なプリフェッチ距離を算出することができる。
【００３４】
しかし、この場合、実行時において対象システム（対象プログラム）のＣＰＩを知ることは一般に難しく、また、それがプリフェッチ対象としている命令付近の実行状況を反映しているとは必ずしも言えない。
【００３５】
そこで、過去に実行が完了した命令の実行履歴を、ある固定された命令数だけ保持しておくテーブルを演算処理装置に備えることが好ましい。この場合、プリフェッチ命令を挿入する際、このテーブルを参照することで、ＣＰＩを仮定するよりも、精度良い挿入アドレスを知ることができる。これにより、プリフェッチによる悪影響を最小限に抑えることができる。
【００３６】
図２に示される実行履歴テーブル５０は、命令実行履歴を保持するテーブルであって、上述のハンドラが、プリフェッチ処理の対象とされた命令に対応するプリフェッチ命令の命令列中への挿入位置を決定する際に参照するものである。この実行履歴テーブル５０の構成例が図４に示される。この図に示されるように、テーブル内の各エントリは、命令アドレスと実行命令完了時の時刻（クロック単位）とで構成される。このテーブルは、複数のエントリからなり、最近実行完了した命令から数十命令程度以前の完了情報を保持することができるものである。
【００３７】
キャッシュミスイベント発生時に起動されるハンドラは、プリフェッチ命令を挿入する場合、実行履歴テーブル５０を参照して、挿入位置を決定し、プリフェッチ命令挿入アドレスレジスタ２２ａ〜２２ｄのいずれかにセットする。なお、このハンドラは、レジスタ２２ａ〜２２ｄに対して、一般レジスタと同様にアクセスすることができる。このように、かかるハンドラは、プリフェッチ命令を挿入する際、実行履歴テーブル５０にアクセスすることで、プリフェッチ命令を挿入する位置を精度良く決定することができる。なお、この場合、実行履歴テーブルにアクセスするための命令を追加するか、あるいは、このテーブルをメモリマップする等が必要となる。
【００３８】
データプリフェッチ対象となるメモリアドレス、すなわちプリフェッチ命令のオペランドの決定には、図２中のアドレス予測ユニット６０が使用される。上述のハンドラは、このユニットに対して、予めプリフェッチ処理が行われるべき命令（ロード命令）の命令アドレスを送信する。なお、アドレス予測ユニット６０は、すでに一般的に知られているアドレス予測技術を実装したものである。例えば、特開平4-206917号公報、M.H. Lipasti, et al. “Value Locality and Load Value Prediction” in ASPLOS-VII, 1996、等に開示された技術が利用可能である。
【００３９】
アドレス予測ユニット６０の内部には、アドレス予測テーブルが用意されている。このテーブルの各エントリは、ロード命令アドレス（インデックス）、前回及び前々回にロード対象とされたメモリアドレスが保持されている。このアドレス予測ユニット６０に対して、ロード命令アドレスを送信すると、このユニットは、(1) アドレス予測テーブル内に当該アドレスが存在するか、インデックスとの比較を行い、(2) もしあれば、前回及び前々回のメモリアドレスの差分（ストライド）から次回アクセスされるアドレスを予測する。
【００４０】
以下、実際に簡単なサンプルプログラムを用いてプリフェッチ命令を挿入するシーケンスを例示する。そのサンプルプログラムを図５に示す。このプログラムは、あるプログラム全体から、配列の指定範囲内についてループ演算処理を行う部分のみを切り出したものである。ここでは、特定のアーキテクチャの命令ではなく、擬似命令を用いて記述している。
【００４１】
プログラム中、符号７０で示されるロード命令（ｌｄ）がキャッシュミスを多数発生するロード命令である。なお、このロード命令は、レジスタｒ６に格納された値とレジスタｒ７に格納された値とを加算して得られる値でアドレッシングされるメモリの内容をレジスタｒ１へロードするものである。以下、そのシーケンスを示す。
【００４２】
（１）プログラム実行中、符号７０で示されるロード命令の実行時にキャッシュミスが発生する。このとき、イベントカウンタ４２（図２）がカウントアップされるが、そのカウンタが溢れた場合にはハンドラが起動される。
【００４３】
（２）起動されたハンドラは、前述した図３に示されるテーブルに基づいて、この命令アドレスでキャッシュミスが多数発生していてプリフェッチ処理を行うべき状況に至っているか否かを判定する。
【００４４】
（３）プリフェッチ処理を行う場合、ハンドラは、実行履歴テーブル５０（図４）を参照する。このサンプルプログラムの場合、実行履歴テーブル５０の内容は、図６に示されるようになっている。なお、図６は、実行履歴テーブル５０とともに、対応する命令ストリームをも示している。本システムにおいて、キャッシュミスペナルティが０ｘ６０（１６進表示）プロセッササイクルであった場合、このテーブルから２ループ前のロード命令の前後にプリフェッチ命令を挿入すれば良いことがわかる。
【００４５】
（４）ハンドラは、命令フェッチユニット２０内のプリフェッチ命令挿入アドレスを指定するためのレジスタ２２ａ〜２２ｄの一つにロード命令の命令アドレス（０ｘ０００１００３０）をセットし、同時に、アドレス予測ユニット６０に対してオペランド（メモリアドレス）の生成を指示する。また、何ループ先の予測アドレスを取得したいか（この例の場合、２）も指示する。
【００４６】
（５）アドレス予測ユニット６０は、当該命令アドレスがユニット内のアドレス予測テーブルに存在すれば、当該エントリの前回及び前々回のメモリアドレスの差分を求める。この例の場合、１ループ実行されるたびにａｄｄ命令でレジスタｒ７に３２が加算されていることからわかるように、メモリアドレスの差分は３２バイトとなる。
【００４７】
（６）アドレス予測ユニット６０は、３２（バイト）×２（ループ）先のデータをプリフェッチの対象メモリアドレスとして、命令フェッチユニット２０に送信する。
【００４８】
なお、上記（３）におけるキャッシュミスペナルティ（０ｘ６０サイクル）は、システム構成が変わった場合に変更されるものである。以後、命令フェッチユニット２０は、再度当該ロード命令前後の実行にさしかかった場合、そのアドレスに対するプリフェッチ命令を繰り返し挿入する。このとき、アドレス予測ユニット６０は、その都度プリフェッチ対象となるメモリアドレスの生成を行う。
【００４９】
上述した実施形態においては、演算処理装置における命令フェッチユニット２０（すなわち、図１における制御部）においてプリフェッチ命令の挿入が行われているが、プリフェッチ命令の挿入をキャッシュメモリで行うこともできる。多くの演算処理装置においては、命令ワードを格納するための専用キャッシュメモリ（命令キャッシュ）あるいは、データとともに命令ワードを格納する共有キャッシュメモリが装備されている。このキャッシュメモリ装置内部に、上述したプリフェッチ命令挿入装置を持たせ、制御部からの命令取得要求があったときにプリフェッチ命令を挿入したものを返すようにすることができる。
【００５０】
このようにキャッシュメモリにプリフェッチ挿入機能を持たせることにより、演算処理装置の内部（制御部、演算部）に何ら修正を加えることなく、機能を追加することが可能となる。
【００５１】
ところで、制御部が単一の命令ストリームの制御しか行えない場合（スーパスカラプロセッサ等）においては、ソフトウェアハンドラを実行するときに、一旦、現在実行中のプログラムの実行を中断し、その実行コンテキストを退避した後に当該ハンドラの実行を開始する。また、ハンドラの実行後は、退避した実行コンテキストを復帰させてプログラムの実行を再開する。ハンドラの起動に際しては、以上のようなオーバヘッドが生じる。
【００５２】
一方、ＳＭＴ（Simultaneous Multi-Threading）あるいはＣＭＰ（on Chip Multi Processor）のように、独立した複数の命令流を同時に制御することができるようなアーキテクチャの場合には、ハンドラを主命令流とは並列に実行することで、上述したハンドラの実行に伴うオーバヘッドを削減することができる。
【００５３】
また、上記ＳＭＴ等のアーキテクチャを採用する場合、複数の命令制御部が備えられる。しかし、プロセッサ動作中においては必ずしも全ての命令制御部が稼動しているわけではない。そこで、命令制御部がアイドリングしている場合に、上記ハンドラの実行を行うようにすることで、実行オーバヘッドを削減しつつ、演算処理装置（プロセッサ）の稼働率を向上させることができる。
【００５４】
図７は、ＳＭＴ（Simultaneous Multi-Threading）アーキテクチャを採用するシステムに本発明を適用した例について、その基本的な構成を示す図である。図中、ＤＣはデコーダ、ＡＬＵは算術論理ユニット、ＲＥＧはレジスタである。また、ＳＵはスケジューリングユニットである。また、ＦＰＡは浮動小数点加算器、ＦＰＭは浮動小数点乗算器、ＦＰＤは浮動小数点割算器、ＬＤ／ＳＴはロードストアユニットである。
【００５５】
命令フェッチユニット１２０は、図２に関して述べたと同様に、プリフェッチ命令挿入アドレスを格納するレジスタおよび比較器を備えている。また、各スレッドを制御するためのシーケンサ１２１ａ、１２１ｂ及び１２１ｃは、それぞれ独立したレジスタを備えている。なお、符号１２８ａ、１２８ｂ及び１２８ｃは、命令キューを示す。
【００５６】
かかる構成において、キャッシュミスイベントカウンタ（図示せず）がオーバフローした際に、そのイベントが割り込みとして通知される場合、その割り込みは、ある特定のシーケンサ（例えば、１２１ｃ）にのみ通知される。かくして、キャッシュミス情報を処理するためのハンドラは、ある特定のシーケンサにて実行されることとなる。
【００５７】
主命令流は、その他のシーケンサを用いて実行されるので、ハンドラの実行に際して主命令流の実行は妨げられない。また、この図の場合、シーケンサ毎にレジスタ等の実行コンテキスト（レジスタ）が保持されているので、ハンドラの起動・終了時に退避・復帰処理のオーバヘッドを伴うことがない。
【００５８】
以上、本発明を特にその好ましい実施の形態を参照して詳細に説明した。本発明の容易な理解のため、本発明の具体的な形態を以下に付記する。
【００５９】
（付記１）予めメインメモリからキャッシュメモリへデータを転送するように指示するプリフェッチ命令を動的に命令列中に挿入して実行する演算処理装置であって、
キャッシュミスを起こす命令のうちプリフェッチ処理の対象とすべき命令を選択するプリフェッチ対象選択手段と、
前記プリフェッチ対象選択手段によってプリフェッチ処理の対象とされた命令の実行時におけるメモリアクセスアドレスを予測するアドレス予測手段と、
前記プリフェッチ対象選択手段によってプリフェッチ処理の対象とされた命令に対応するプリフェッチ命令の命令列中への挿入位置を決定するプリフェッチ命令挿入位置決定手段と、
前記アドレス予測手段によって予測されたメモリアクセスアドレスをオペランドに有するプリフェッチ命令を、前記プリフェッチ命令挿入位置決定手段によって決定された挿入位置に、挿入するプリフェッチ命令挿入手段と、
を具備する演算処理装置。
【００６０】
（付記２）前記プリフェッチ対象選択手段は、命令アドレスごとにキャッシュミスが発生する回数を計測するカウンタを備え、該カウンタの値が一定値を超えた場合に、キャッシュミス情報を処理するハンドラを起動して、該ハンドラによってプリフェッチ処理の対象とすべき命令を選択する、付記１に記載の演算処理装置。
【００６１】
（付記３）前記プリフェッチ命令挿入位置決定手段は、過去に実行が完了した命令列について命令アドレス及び実行完了時刻を保持するテーブルを備え、該テーブルの情報に基づいてプリフェッチ命令の命令列中への挿入位置を決定する、付記１又は付記２に記載の演算処理装置。
【００６２】
（付記４）前記プリフェッチ対象選択手段、前記プリフェッチ命令挿入位置決定手段、前記アドレス予測手段及び前記プリフェッチ命令挿入手段をキャッシュメモリ内に備える、付記１に記載の演算処理装置。
【００６３】
（付記５）複数の実行命令流が制御可能であり、前記キャッシュミス情報を処理するハンドラが独立した命令流として主命令流とは独立に実行される、付記３に記載の演算処理装置。
【００６４】
（付記６）予めメインメモリからキャッシュメモリへデータを転送するように指示するプリフェッチ命令を動的に命令列中に挿入して実行する演算処理方法であって、
(a) キャッシュミスを起こす命令のうちプリフェッチ処理の対象とすべき命令を選択するステップと、
(b) ステップ(a)によってプリフェッチ処理の対象とされた命令の実行時におけるメモリアクセスアドレスを予測するステップと、
(c) ステップ(a)によってプリフェッチ処理の対象とされた命令に対応するプリフェッチ命令の命令列中への挿入位置を決定するステップと、
(d) ステップ(b)によって予測されたメモリアクセスアドレスをオペランドに有するプリフェッチ命令を、ステップ(c)によって決定された挿入位置に、挿入するステップと、
を具備する演算処理方法。
【００６５】
（付記７）ステップ(a)は、命令アドレスごとにキャッシュミスが発生する回数を計測し、該回数が一定値を超えた場合に、キャッシュミス情報を処理するハンドラを起動して、該ハンドラによってプリフェッチ処理の対象とすべき命令を選択する、付記６に記載の演算処理方法。
【００６６】
（付記８）ステップ(c)は、過去に実行が完了した命令列について命令アドレス及び実行完了時刻を保持するテーブルを備え、該テーブルの情報に基づいてプリフェッチ命令の命令列中への挿入位置を決定する、付記６又は付記７に記載の演算処理方法。
【００６７】
（付記９）複数の実行命令流を制御可能であり、前記キャッシュミス情報を処理するハンドラを独立した命令流として主命令流とは独立に実行させるステップを更に具備する、付記８に記載の演算処理方法。
【００６８】
【発明の効果】
以上説明したように、本発明によれば、キャッシュメモリを備えた演算処理装置において、実行時情報に基づいてプリフェッチ命令を命令ストリーム中に挿入することにより、従来のプリフェッチ方式に伴うキャッシュ汚染等の問題を最小限に抑制しつつ効率的にプリフェッチを実行することが可能となる。
【図面の簡単な説明】
【図１】本発明によりプリフェッチ命令の挿入処理を行う演算処理装置の動作の概要を説明するための図である。
【図２】本発明による演算処理装置の一構成例を示す図である。
【図３】命令アドレスごとにキャッシュミスが発生した回数を記憶するテーブルの構成を示す図である。
【図４】実行履歴テーブルの構成を示す図である。
【図５】サンプルプログラムを示す図である。
【図６】サンプルプログラムを実行するときの実行履歴テーブルの内容を命令ストリームとともに示す図である。
【図７】ＳＭＴアーキテクチャを採用するシステムに本発明を適用した例について、その基本的な構成を示す図である。
【符号の説明】
１０…メインメモリ
１２…命令キャッシュ
１４…データキャッシュ
２０…命令フェッチユニット
２２ａ〜２２ｄ…プリフェッチ命令挿入アドレスレジスタ
２４…プログラムカウンタ（ＰＣ）
２６ａ〜２６ｄ…比較器（コンパレータ）
２８…命令キュー
３０…デコーダ
３２…分岐命令処理ユニット
３４…算術命令処理ユニット
３６…ロード／ストア命令処理ユニット
４０…レジスタファイル
４２…イベントカウンタ
５０…実行履歴テーブル
６０…アドレス予測ユニット
１２０…命令フェッチユニット
１２１ａ，１２１ｂ，１２１ｃ…シーケンサ
１２８ａ，１２８ｂ，１２８ｃ…命令キュー[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an arithmetic processing unit that includes a cache memory and performs prefetch processing to conceal memory access latency, and more particularly, by performing prefetch processing based on runtime information. It is related with the arithmetic processing apparatus which aimed at improvement.
[0002]
[Prior art]
In the current computer system, an increase in memory access latency is a major cause of hindering an improvement in the performance of the entire system. A general computer system focuses on the temporal and spatial locality of memory access, and is equipped with a cache memory that can be accessed at high speed between the processor (register) and the main memory. The access performance is improved.
[0003]
However, when there is no data in the cache memory when a memory access occurs, that is, when a cache miss occurs, it is necessary to transfer the data from the main memory. At this time, a long time is required for data transfer (cache miss penalty). In the current arithmetic processing unit, access to the cache memory is about 1 to several tens of cycles, but the cache miss penalty reaches several tens of cycles. Accordingly, the program execution time is largely limited by the cache miss penalty. The gap between processor operating speed and memory access speed is increasing year by year, and the above problem will cause more serious damage to computer systems in the future.
[0004]
In order to solve this problem, in the prior art, a method called data prefetch (data prefetching, hereinafter simply referred to as prefetch) has been widely proposed and used. Data prefetching is a process in which data that is expected to be accessed in the future is transferred in advance to an upper layer (cache memory) of the memory. By appropriately performing data prefetching, the access delay time to the main memory is hidden, and the program is executed as if the data is always present in the cache.
[0005]
Many prefetch methods have been proposed so far, but can be broadly divided into the following two methods.
[0006]
(1) Software prefetch. In general, a compiler or the like statically analyzes a program behavior and inserts a prefetch instruction into the program. A compiler or the like may insert a prefetch instruction based on profile information obtained by executing a program.
[0007]
(2) Hardware prefetch. A hardware mechanism dedicated to prefetch is provided inside or outside the processor. Generally, dynamically (during program execution), this dedicated hardware recognizes the occurrence of a cache miss and performs prefetch processing based on a predetermined algorithm. The existing hardware prefetch method recognizes that a cache miss has occurred several times, and starts prefetching using this as a trigger.
[0008]
As an example of the above (1), in Japanese Patent Laid-Open No. 11-212802, the original program is converted into intermediate text, the converted intermediate text is optimized, the address for data prefetching is calculated, and the prefetch instruction is issued. A compilation technique of inserting has been proposed.
[0009]
As for (2) above, for example, when a certain cache block is accessed, it is predicted that the next cache block will be accessed in the near future, and prefetching is performed for that address. In addition, many other methods have been proposed, such as prefetching by stride prediction in which an address to be accessed next time is determined by adding a certain value. In addition, as actually used commercially, Pentium (registered trademark) 4 has a mechanism that always reads data 256 bytes ahead from the address that is actually accessed.
[0010]
[Problems to be solved by the invention]
Generally, when performing prefetch processing, it is necessary to pay attention to the following problems.
[0011]
(1) Cache pollution occurs. By performing the prefetch, the data stored in the cache memory is evicted. If prefetch processing is not performed, if there is data that would not have been evicted and hit the cache originally, prefetching may cause performance degradation.
[0012]
(2) Prefetched data is evicted before use by subsequent accesses. If the prefetch process is performed for a long time before the prefetched data is actually accessed, the prefetched data may be expelled by the preceding memory access before the actual access.
[0013]
(3) The bus bandwidth is wasted due to excessive prefetch processing. By performing prefetch processing more than necessary, the memory bus bandwidth is wasted, and conversely, performance degradation is caused.
[0014]
Regarding the above (1) and (2), it is necessary to appropriately determine the distance (prefetch distance) between the prefetch insertion position and the instruction that actually uses the prefetched data. As for the above (3), a prefetch instruction must be inserted only when it can be determined that prefetching is effective rather than prefetching all memory accesses in which a cache miss occurs.
[0015]
On the other hand, in the conventional prefetch method, the program behavior may change depending on the situation at the time of execution, or the system configuration may be changed (the configuration of the processor or the memory system may be changed). It is difficult to set the distance appropriately.
[0016]
First, in software prefetching, when an environment change as described above occurs, an operation such as recompiling a source program is required. Thus, it is not realistic to prepare an executable file for each system. In addition, it is not possible to cope with a program whose behavior changes depending on the input to the program.
[0017]
On the other hand, even in hardware prefetching, an effective prefetching distance cannot be maintained by simply performing prefetching when a load miss occurs. In addition, it is difficult to implement complicated processing related to a method for determining whether to perform hardware prefetching. In fact, the conventional hardware prefetch is generally simple, such as executing prefetch when a cache miss occurs several times, or continuing to prefetch the next block after the accessed block.
[0018]
The present invention has been made in view of the above-described problems, and an object of the present invention is to provide an arithmetic processing device having a cache memory by inserting a prefetch instruction into an instruction stream based on runtime information. Another object of the present invention is to provide an apparatus capable of efficiently performing prefetching while minimizing problems such as cache contamination associated with a conventional prefetch method.
[0019]
[Means for Solving the Problems]
In order to achieve the above object, according to the first aspect of the present invention, an operation is executed by dynamically inserting a prefetch instruction instructing to transfer data from the main memory to the cache memory in advance into the instruction sequence. A processing device, a prefetch target selecting means for selecting an instruction to be subjected to a prefetch process among instructions causing a cache miss, and a memory access at the time of executing an instruction which is a target of the prefetch process by the prefetch target selecting means Address predicting means for predicting an address; prefetch instruction insertion position determining means for determining an insertion position of a prefetch instruction corresponding to an instruction that has been subjected to prefetch processing by the prefetch target selecting means; and the address prediction Operate the memory access address predicted by the means A prefetch instruction with the de, the insertion position determined by the prefetch instruction insertion position determining means, the arithmetic processing apparatus is provided comprising a prefetch instruction insertion means for inserting.
[0020]
Also, according to the second aspect of the present invention, preferably, the prefetch target selecting means includes a counter for measuring the number of times a cache miss occurs for each instruction address, and the value of the counter exceeds a certain value. In this case, a handler that processes cache miss information is activated, and an instruction to be subjected to prefetch processing is selected by the handler.
[0021]
Also, according to the third aspect of the present invention, preferably, the prefetch instruction insertion position determining means includes a table that holds an instruction address and an execution completion time for an instruction sequence that has been executed in the past. The insertion position of the prefetch instruction in the instruction string is determined based on the information.
[0022]
According to the fourth aspect of the present invention, the prefetch target selection means, the prefetch instruction insertion position determination means, the address prediction means, and the prefetch instruction insertion means may be provided in a cache memory.
[0023]
According to the fifth aspect of the present invention, in the case of an arithmetic processing unit capable of controlling a plurality of execution instruction streams, the handler for processing the cache miss information is executed as an independent instruction stream independent of the main instruction stream. Is done.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
[0025]
FIG. 1 is a diagram for explaining the outline of the operation of an arithmetic processing unit that performs prefetch instruction insertion processing according to the present invention. FIG. 1 shows only the portion related to the present invention. The instruction fetch unit and the instruction decoder unit are combined into a part that controls the execution of the instruction, and this part is called a control unit. In the arithmetic processing device that is the subject of the present invention, an instruction address that designates the next instruction to be executed is specified from the control unit, and the instruction is read from the memory (or cache memory) in which the instruction sequence is stored, Input to the control unit. Thereafter, the instruction is executed in a single or a plurality of arithmetic units. The sequence of instructions read from the memory is referred to as an “instruction stream (instruction stream)” in the present invention.
[0026]
The control unit in FIG. 1 inserts a prefetch instruction at the designated instruction address. At runtime, the program itself is stored in the main memory and is not directly modified. In an example of a method for inserting and executing a prefetch instruction at a specified instruction address, an instruction fetch unit (included in the control unit in FIG. 1) compares the instruction address to be fetched with the instruction address to be inserted. If they match, the prefetch instruction is interrupted into the instruction stream acquired from the main memory, and then the arithmetic unit executes the prefetch instruction.
[0027]
In the arithmetic processing unit according to the present invention, an instruction address (load / store instruction, arithmetic instruction with memory access, etc.) where a cache miss occurs is detected. Various algorithms may be applied to determine whether to actually perform data prefetch for the detected instruction address. When prefetching is performed, a prefetch instruction insertion position (instruction address) corresponding to the cache miss penalty in the system is specified. Furthermore, various data address prediction methods that have already been proposed to date can be applied to data addresses to be prefetched. Thus, it is possible to perform dynamic and effective prefetching without recompiling the program and without depending on the runtime environment.
[0028]
FIG. 2 is a diagram showing a configuration example of an arithmetic processing apparatus according to the present invention. In the embodiment shown in FIG. 2, an instruction cache 12 and a data cache 14 are provided for the main memory 10. The instruction fetch unit 20 fetches an instruction from the instruction cache 12 based on the value of the program counter (PC) 24 (address where the instruction to be fetched is stored) and sends the instruction to the decoder 30. The instruction decoded by the decoder 30 is processed in the branch instruction processing unit 32, the arithmetic instruction processing unit 34 or the load / store instruction processing unit 36.
[0029]
Inside the instruction fetch unit 20 in this embodiment, a plurality of registers (prefetch instruction insertion address registers) 22a to 22d for specifying the insertion address of the prefetch instruction are provided. The instruction fetch unit 20 compares the value of the program counter (PC) 24 and the values of the prefetch instruction insertion address registers 22a to 22d in parallel in comparators (comparators) 26a to 26d at the time of instruction fetch. If they match, a prefetch instruction is inserted at a predetermined position in the instruction queue 28 for inputting an instruction to the decoder 30.
[0030]
When selecting an instruction to be subject to prefetch processing among instructions that cause a data cache miss, a sufficient effect can be obtained by prefetch processing by recording the instruction address where the cache miss has occurred and statistically processing it. It is desirable to be able to select an instruction address.
[0031]
In this embodiment, a counter that measures the occurrence of a cache miss is provided in order to efficiently obtain statistical data of an instruction address at which a cache miss occurs. That is, the event counter 42 in the register file 40 in FIG. 2 is the counter. When the event counter 42 counts the designated number of times, a handler designated beforehand is executed inside the arithmetic processing unit. By providing such a mechanism, it is possible to perform flexible processing of cache miss information by software while avoiding the overhead that the detection function is activated each time a cache miss occurs.
[0032]
The activated handler determines whether or not to perform prefetch processing based on a table that stores the number of times a cache miss has occurred for each instruction address as shown in FIG. This table is provided on the main memory.
[0033]
To determine the optimal prefetch distance, for example, assume the number of execution cycles per instruction (Cycle Per Instruction, CPI), and calculate an appropriate prefetch distance in consideration of the cache miss penalty in the target system configuration. can do.
[0034]
However, in this case, it is generally difficult to know the CPI of the target system (target program) at the time of execution, and it cannot necessarily be said that it reflects the execution status near the instruction to be prefetched.
[0035]
Therefore, it is preferable to provide the arithmetic processing unit with a table that holds an execution history of instructions that have been executed in the past for a fixed number of instructions. In this case, when inserting a prefetch instruction, it is possible to know an insertion address with higher accuracy than by assuming CPI by referring to this table. As a result, adverse effects due to prefetching can be minimized.
[0036]
The execution history table 50 shown in FIG. 2 is a table that holds an instruction execution history, and the above-described handler determines the insertion position of the prefetch instruction corresponding to the instruction to be prefetched in the instruction string. This is what you will refer to. A configuration example of the execution history table 50 is shown in FIG. As shown in this figure, each entry in the table includes an instruction address and a time (clock unit) when the execution instruction is completed. This table includes a plurality of entries, and can hold completion information about several tens of instructions before the instruction that has been recently executed.
[0037]
When inserting a prefetch instruction, the handler activated when a cache miss event occurs refers to the execution history table 50 to determine the insertion position and sets it to any of the prefetch instruction insertion address registers 22a to 22d. Note that this handler can access the registers 22a to 22d in the same manner as the general registers. In this manner, such a handler can accurately determine the position to insert the prefetch instruction by accessing the execution history table 50 when inserting the prefetch instruction. In this case, it is necessary to add an instruction for accessing the execution history table, or to memory map this table.
[0038]
The address prediction unit 60 in FIG. 2 is used to determine the memory address to be data prefetched, that is, the operand of the prefetch instruction. The above-described handler transmits an instruction address of an instruction (load instruction) to be prefetched to this unit in advance. Note that the address prediction unit 60 is implemented with a generally known address prediction technique. For example, techniques disclosed in Japanese Patent Laid-Open No. 4-206917, MH Lipasti, et al. “Value Locality and Load Value Prediction” in ASPLOS-VII, 1996, etc. can be used.
[0039]
An address prediction table is prepared inside the address prediction unit 60. Each entry of this table holds a load instruction address (index) and a memory address to be loaded last time and the last time. When a load instruction address is transmitted to this address prediction unit 60, this unit (1) compares the address in the address prediction table with the index, and (2) And the address accessed next time is predicted from the difference (stride) of the memory address of the previous time.
[0040]
Hereinafter, a sequence for inserting a prefetch instruction using an actually simple sample program will be exemplified. The sample program is shown in FIG. This program is obtained by cutting out only a portion for performing loop calculation processing within a specified range of an array from an entire program. Here, description is made using pseudo-instructions, not instructions of a specific architecture.
[0041]
In the program, a load instruction (ld) indicated by reference numeral 70 is a load instruction that causes many cache misses. This load instruction is to load the contents of the memory addressed with the value obtained by adding the value stored in the register r6 and the value stored in the register r7 into the register r1. The sequence is shown below.
[0042]
(1) During execution of a program, a cache miss occurs when a load instruction indicated by reference numeral 70 is executed. At this time, the event counter 42 (FIG. 2) is counted up. If the counter overflows, the handler is activated.
[0043]
(2) Based on the table shown in FIG. 3 described above, the activated handler determines whether a number of cache misses have occurred at this instruction address and the prefetch process should be performed.
[0044]
(3) When performing prefetch processing, the handler refers to the execution history table 50 (FIG. 4). In the case of this sample program, the contents of the execution history table 50 are as shown in FIG. FIG. 6 shows a corresponding instruction stream together with the execution history table 50. In this system, when the cache miss penalty is 0x60 (hexadecimal display) processor cycle, it can be seen from this table that prefetch instructions should be inserted before and after the load instruction two loops before.
[0045]
(4) The handler sets the instruction address (0x00010030) of the load instruction in one of the registers 22a to 22d for designating the prefetch instruction insertion address in the instruction fetch unit 20, and at the same time, the address prediction unit 60 Instructs generation of operand (memory address). In addition, it indicates how many loop destination predicted addresses it is desired to obtain (in this example, 2).
[0046]
(5) If the instruction address is present in the address prediction table in the unit, the address prediction unit 60 obtains the difference between the previous and previous memory addresses of the entry. In this example, as can be seen from the fact that 32 is added to the register r7 by the add instruction every time one loop is executed, the memory address difference is 32 bytes.
[0047]
(6) The address prediction unit 60 transmits the data ahead of 32 (bytes) × 2 (loop) to the instruction fetch unit 20 as a prefetch target memory address.
[0048]
Note that the cache miss penalty (0x60 cycles) in (3) is changed when the system configuration is changed. Thereafter, the instruction fetch unit 20 repeatedly inserts a prefetch instruction for the address when it reaches the execution before and after the load instruction again. At this time, the address prediction unit 60 generates a memory address to be prefetched each time.
[0049]
In the above-described embodiment, the prefetch instruction is inserted in the instruction fetch unit 20 (that is, the control unit in FIG. 1) in the arithmetic processing unit. However, the prefetch instruction can be inserted in the cache memory. Many arithmetic processing units are equipped with a dedicated cache memory (instruction cache) for storing instruction words or a shared cache memory for storing instruction words together with data. The cache memory device can be provided with the above-described prefetch instruction insertion device so that when the instruction acquisition request is issued from the control unit, the prefetch instruction inserted is returned.
[0050]
By providing the cache memory with the prefetch insertion function in this way, it becomes possible to add functions without any modification to the inside of the arithmetic processing unit (control unit, arithmetic unit).
[0051]
By the way, when the control unit can only control a single instruction stream (such as a superscalar processor), when executing a software handler, the execution of the currently executing program is temporarily suspended and the execution context is saved. After that, the execution of the handler starts. After the handler is executed, the saved execution context is restored and the program execution is resumed. The above overhead occurs when the handler is started.
[0052]
On the other hand, in an architecture such as SMT (Simultaneous Multi-Threading) or CMP (on Chip Multi Processor) where multiple independent instruction streams can be controlled simultaneously, the handler is parallel to the main instruction stream. As a result, the overhead associated with the execution of the above-described handler can be reduced.
[0053]
When an architecture such as the above SMT is adopted, a plurality of instruction control units are provided. However, not all instruction control units are operating during processor operation. Therefore, when the instruction control unit is idling, by executing the above handler, it is possible to improve the operating rate of the arithmetic processing unit (processor) while reducing the execution overhead.
[0054]
FIG. 7 is a diagram showing a basic configuration of an example in which the present invention is applied to a system employing an SMT (Simultaneous Multi-Threading) architecture. In the figure, DC is a decoder, ALU is an arithmetic logic unit, and REG is a register. SU is a scheduling unit. FPA is a floating point adder, FPM is a floating point multiplier, FPD is a floating point divider, and LD / ST is a load store unit.
[0055]
The instruction fetch unit 120 includes a register for storing a prefetch instruction insertion address and a comparator, as described with reference to FIG. The sequencers 121a, 121b, and 121c for controlling each thread have independent registers. Reference numerals 128a, 128b and 128c indicate instruction queues.
[0056]
In such a configuration, when a cache miss event counter (not shown) overflows, when the event is notified as an interrupt, the interrupt is notified only to a specific sequencer (for example, 121c). Thus, a handler for processing cache miss information is executed by a specific sequencer.
[0057]
Since the main instruction stream is executed using another sequencer, the execution of the main instruction stream is not hindered when the handler is executed. Further, in the case of this figure, since an execution context (register) such as a register is held for each sequencer, there is no overhead of save / restore processing when the handler is started / finished.
[0058]
The present invention has been described in detail with particular reference to preferred embodiments thereof. For easy understanding of the present invention, specific embodiments of the present invention will be described below.
[0059]
(Supplementary Note 1) An arithmetic processing unit that dynamically inserts and executes a prefetch instruction instructing to transfer data from a main memory to a cache memory in advance in an instruction sequence,
Prefetch target selection means for selecting an instruction to be subject to prefetch processing among instructions causing a cache miss;
An address predicting means for predicting a memory access address at the time of execution of an instruction to be prefetched by the prefetch target selecting means;
Prefetch instruction insertion position determination means for determining an insertion position in an instruction string of a prefetch instruction corresponding to an instruction subjected to prefetch processing by the prefetch target selection means;
Prefetch instruction insertion means for inserting a prefetch instruction having a memory access address predicted by the address prediction means as an operand at an insertion position determined by the prefetch instruction insertion position determination means;
An arithmetic processing apparatus comprising:
[0060]
(Supplementary Note 2) The prefetch target selection means includes a counter that counts the number of times a cache miss occurs for each instruction address, and activates a handler that processes cache miss information when the counter value exceeds a certain value. The arithmetic processing apparatus according to attachment 1, wherein the handler selects an instruction to be subjected to prefetch processing.
[0061]
(Additional remark 3) The said prefetch instruction insertion position determination means is provided with the table which hold | maintains an instruction address and execution completion time about the instruction sequence which execution completed in the past, Based on the information of this table, the instruction sequence of the prefetch instruction The arithmetic processing apparatus according to Supplementary Note 1 or Supplementary Note 2, which determines an insertion position.
[0062]
(Supplementary Note 4) The arithmetic processing apparatus according to Supplementary Note 1, wherein the prefetch target selection unit, the prefetch instruction insertion position determination unit, the address prediction unit, and the prefetch instruction insertion unit are provided in a cache memory.
[0063]
(Supplementary note 5) The arithmetic processing unit according to supplementary note 3, wherein a plurality of execution instruction streams are controllable, and a handler for processing the cache miss information is executed independently of a main instruction stream as an independent instruction stream.
[0064]
(Supplementary Note 6) An arithmetic processing method of executing a prefetch instruction for instructing to transfer data from a main memory to a cache memory in advance and dynamically inserting it into an instruction sequence,
(a) selecting an instruction to be subjected to prefetch processing among instructions causing a cache miss; and
(b) a step of predicting a memory access address at the time of execution of the instruction targeted for prefetch processing in step (a);
(c) determining an insertion position in the instruction string of the prefetch instruction corresponding to the instruction subjected to the prefetch processing in step (a);
(d) inserting a prefetch instruction having the memory access address predicted in step (b) as an operand at the insertion position determined in step (c);
An arithmetic processing method comprising:
[0065]
(Supplementary note 7) Step (a) measures the number of times a cache miss occurs for each instruction address, and when the number exceeds a certain value, starts a handler that processes cache miss information. The arithmetic processing method according to attachment 6, wherein an instruction to be subjected to prefetch processing is selected.
[0066]
(Supplementary Note 8) Step (c) includes a table that holds an instruction address and an execution completion time for an instruction sequence that has been executed in the past. Based on information in the table, the insertion position of the prefetch instruction in the instruction sequence is determined. The arithmetic processing method according to appendix 6 or appendix 7, which is determined.
[0067]
(Supplementary note 9) The arithmetic operation according to supplementary note 8, further comprising a step of controlling a plurality of execution instruction streams and causing the handler for processing the cache miss information to be executed independently of the main instruction stream as an independent instruction stream. Processing method.
[0068]
【The invention's effect】
As described above, according to the present invention, in an arithmetic processing unit equipped with a cache memory, by inserting a prefetch instruction into an instruction stream based on execution time information, it is possible to prevent cache contamination associated with the conventional prefetch method. It is possible to efficiently perform prefetching while minimizing problems.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the outline of the operation of an arithmetic processing unit that performs prefetch instruction insertion processing according to the present invention;
FIG. 2 is a diagram illustrating a configuration example of an arithmetic processing apparatus according to the present invention.
FIG. 3 is a diagram illustrating a configuration of a table that stores the number of times a cache miss has occurred for each instruction address;
FIG. 4 is a diagram showing a configuration of an execution history table.
FIG. 5 is a diagram showing a sample program.
FIG. 6 is a diagram showing contents of an execution history table together with an instruction stream when a sample program is executed.
FIG. 7 is a diagram showing a basic configuration of an example in which the present invention is applied to a system adopting an SMT architecture.
[Explanation of symbols]
10 ... Main memory
12 ... Instruction cache
14 ... Data cache
20 ... Instruction fetch unit
22a to 22d: Prefetch instruction insertion address register
24 ... Program counter (PC)
26a to 26d: Comparator
28 ... Instruction queue
30 ... Decoder
32 ... Branch instruction processing unit
34. Arithmetic instruction processing unit
36 ... Load / store instruction processing unit
40: Register file
42 ... Event counter
50 ... Execution history table
60: Address prediction unit
120 ... Instruction fetch unit
121a, 121b, 121c ... sequencer
128a, 128b, 128c ... instruction queue

Claims

An arithmetic processing unit that employs an architecture capable of simultaneously controlling a plurality of execution instruction streams, and that dynamically inserts a prefetch instruction instructing to transfer data from the main memory to the cache memory in advance into the instruction sequence. There,
Prefetch target selection means for selecting an instruction to be subject to prefetch processing among instructions causing a cache miss;
An address predicting means for predicting a memory access address at the time of execution of an instruction to be prefetched by the prefetch target selecting means;
Prefetch instruction insertion position determination means for determining an insertion position in an instruction string of a prefetch instruction corresponding to an instruction subjected to prefetch processing by the prefetch target selection means;
Prefetch instruction insertion means for inserting a prefetch instruction having a memory access address predicted by the address prediction means as an operand at an insertion position determined by the prefetch instruction insertion position determination means;
Equipped with,
The prefetch target selection means includes a counter for measuring the number of times a cache miss occurs for each instruction address, and activates a handler that processes cache miss information when the value of the counter exceeds a certain value, The instruction to be prefetched by the handler is selected, and
The prefetch target selecting means causes a handler for processing the cache miss information to be executed as an instruction stream independent of a main instruction stream;
Arithmetic processing unit.

The prefetch instruction insertion position determining means includes a table that holds an instruction address and an execution completion time for an instruction sequence that has been executed in the past, and determines an insertion position of the prefetch instruction in the instruction sequence based on information in the table The arithmetic processing device according to claim 1 .

The arithmetic processing apparatus according to claim 1, wherein the prefetch target selection unit, the prefetch instruction insertion position determination unit, the address prediction unit, and the prefetch instruction insertion unit are provided in a cache memory.