JP4066838B2

JP4066838B2 - Shared resource conflict detector and shared resource conflict detection method

Info

Publication number: JP4066838B2
Application number: JP2003041575A
Authority: JP
Inventors: 周史山村; 耕一久門
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-02-19
Filing date: 2003-02-19
Publication date: 2008-03-26
Anticipated expiration: 2023-02-19
Also published as: JP2004252670A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の論理ＣＰＵから構成されるマルチスレッドプロセッサにおいて、論理ＣＰＵ間で共有リソースの競合状態を発見するための検出器および検出方法に関する。
【０００２】
【従来の技術】
一般に、マルチプロセッサシステムにおいてプロセッサ間で共有するリソースに対して排他的にアクセスするために、「クリティカルリージョン」と呼ぶプログラム領域を設けることが多い。クリティカルリージョンは、同時に1 つのプロセッサしか実行することができないプログラム領域である。この領域内でプロセッサ間の共有リソースにアクセスすることで、並列プログラムを実行中のデータの整合性を保つ。
【０００３】
実際には、複数のプロセッサを搭載したシステムにおいて、プロセッサがクリティカルリージョンに入って良いかどうか、すなわち他のプロセッサがクリティカルリージョンを走行していないかどうかを判断するために、「スピンロック変数」と呼ぶ変数を設ける。
例えば、スピンロック変数には、
・あるプロセッサがクリティカルリージョンを走行中の場合は「１」
・いずれのプロセッサも走行していない場合は「０」
がセットされる。
【０００４】
この実装では、あるプロセッサがクリティカルリージョンを走行している間に他のプロセッサがクリティカルリージョンに入ろうとした場合には、スピンロッ変数が「０」になるまでスピンロック変数値を繰り返しチェックする必要が生じる。このループ処理を「スピンループ」と呼ぶ。スピンループは、排他制御の簡潔な実装方法として、マルチプロセッサシステム上で動作するソフトウェアにおいて多用されている。
【０００５】
しかし、マルチスレッドプロセッサにおいては大きな間題が発生する（マルチスレッドプロセッサについては、非特許文献１参照）。即ち、ある論理ＣＰＵ上でスピンループしているスレッドが論理ＣＰＵ間で共有している演算リソースを奪ってしまうために、計算処理を行っている他のスレッドの実効性能が大きく低下してしまうことがある（例えば、非特許文献２、３参照）。
【０００６】
また、マルチスレッドプロセッサでは、スピンロックのような演算リソースの奪い合い（競合）だけに留まらず、論理ＣＰＵ間で共有しているその他のリソース（例えば、Ｉｎｔｅｌ製Ｘｅｏｎプロセッサ（非特許文献４参照）の場合であれば、１次２次キャッシュメモリやＴＬＢ（Translation Look-aside Buffer ）が論理ＣＰＵ間で共有される) における競合もまた性能低下を招く原因となる（例えば、非特許文献５参照）。
【０００７】
次に、リソースの競合部分の発見について、従来の技術を述べる。
プログラムの実行において、「プログラム中のどの部分で最も時間を消費したか」といった統計情報を採取する作業を性能プロファイリングという。性能プロファイリングを行う最も基本的で広く利用されている手法として、ＰＣサンプリングが挙げられる（例えば、非特許文献６参照）。
【０００８】
ＰＣ（プログラムカウンタ) サンプリングとは、ある一定間隔ごとにプログラムのどの部分を実行していたかを記録し、プログラム実行後にそれらサンプリングデータについて統計処理を施すことで性能プロファイリングを行う。実際には、イベント計測カウンタとカウンタのオーバフロー割り込みとを組み合わせることでＰＣサンプリングを既存プロセッサ上で実現している。
【０００９】
例えば、Ｉｎｔｅｌプロセッサに搭載されている性能モニタリングカウンタ（例えば、非特許文献５参照）は上記のような機能を有しているイベント計測カウンタである。しかし、従来のイベント計測カウンタを用いた場合では、ある特定のイベント（例えばタイムベースや実行命令数など）によるサンプリングを行うことは可能であるが、スピンループのような複数の命令の動作の組み合せによって引き起こされるイベントを計測することには対応できない。
【００１０】
また、マルチプロセッサ／マルチスレッドプロセッサ向きのイベント計測カウンタが提案されている（例えば、特許文献１、２参照）。しかしこれらの提案は、いずれもプロセッサ上で走行するスレッド別のカウントを可能とするものであったり、すべてのスレッドでの走行時間の合計を記録する、といった機能しか持たない。このような機能を利用したサンプリング測定では、すべてのスレッドが活動している部分を特定することは可能であると考えられるが、性能プロファイリングで重要なのは、すベてのスレッドが活動しているとして、それらがどのような動作（例えばスピンロック）を行っていたかを判断することである。この点において、上記のいずれの手法も、単純に実行された命令数を数えたり、キャッシュミスイベントの発生回数をカウントしたりといった、単一イベントの発生回数をカウントすることしか行うことができず、論理ＣＰＵ間での関連性を考察する上では不十分と言える。
【００１１】
以上の方式の他にも、命令そのものをプロファイルする技術「ＰｒｏｆｉｌｅＭｅ」（例えば、非特許文献７参照）と呼ぶ方法も提案されている。しかし、この方式は、命令一つずつに識別子をセットして命令そのものの実行遅延を測定するものであり、スピンループのような複数の命令から構成されるループ処理をチェックすることはできない。
【００１２】
さらに、ある一定時間毎にプロセッサの動作状態をチェックする「ＷａｔｃｈＤｏｇタイマ」と呼ぶ機能が知られている。この機能を応用すれば、スピンループについては発生場所を特定できる可能性がある。しかし、この手法では、プログラム中に出現するスピンループ以外のループ処理との識別が困難であり、また、検出できたとしても現在スピンループが発生している１箇所を特定するのみで、性能プロファイリングのような統計処理には適用できない。加えて、上記の手法はループ処理に対してのみ利用可能であり、本発明での課題とする共有リソース競合を検出するために応用することは困難である。
【００１３】
なお、ここで論理ＣＰＵについて定義を簡単に行う。マルチスレッドプロセッサ内部には、独立した複数の命令流を制御するために、
（１）命令制御部および命令実行ステートを保持するレジスタ群
（２）上記（１）の間で共有されている演算器等
が存在する。ここでは独立した命令流を実行する上で必要となる（１）と（２）の組合せを論理ＣＰＵと呼ぶこととする。一方、物理的なプロセッサ全体のことを「物理ＣＰＵ」と呼ぶこととする。
【００１４】
また、「スレッド」とはＯＳあるいはハードウェアで認識できる実行コンテキストを持つ一連の実行命令列のこととする。
【００１５】
【非特許文献１】
"Simultaneous Multithreading: Maximizing On-Chip Parallelism", Dean M.Tullsen,Susan J.Eggers,and Henry M.Levy,In Proc. of 22nd Annual Interna-tional Symposium on Computer Architecture,pp.392-403,June 1995.
【００１６】
【非特許文献２】
"Using Spin-Loops on Intel Pentium4 Processor and Intel Xeon ProcessorVersion 2.1",May 2001,Order Number 248674-002.
【００１７】
【非特許文献３】
"Introduction to Next Generation Multiprocessing:Hyper-Threading Tech-nology",http://www.intel.com/technology/hyperthread/intro nexgen/.
【００１８】
【非特許文献４】
"Hyper-Threading Technology Architecture and Microarchitecture",
Deborah T.Marr, et al.,Intel Technology Journal,Volume.6,Issue.1,Febru- ary 2002.
【００１９】
【非特許文献５】
"IA-32 Intel Architecture Software Developer's Manual Volume 3 System Programming Guide",September,2002,Order Number 245472-009, p.7-40.
【００２０】
【非特許文献６】
"Measuring Computer Performance A Practitioner's Guide",David J.Lilja,Cambridge University Press,New York,NY,2000.
【００２１】
【非特許文献７】
"ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors",Jeffrey Dean,James E.Hicks,Carl A.Waldspurger,WilliamE.Weihl,George Chrysos,International Symposium on Microarchitecture,1997
【００２２】
【特許文献１】
特開平１０−２７５１００号公報（第１頁、図１）
【００２３】
【特許文献２】
特開平９−２３７２０３号公報（第１頁）
【００２４】
【発明が解決しようとする課題】
前述のように、マルチスレッドプロセッサにおいては、ある論理プロセッサ上でのプログラムの実行速度は、他の論理プロセッサ上でのプログラムの動作状況に大きな影響を受ける。特に、一方の論理プロセッサ上でスピンループが実行されていた場合には、他方の論理プロセッサ上におけるプログラムの実効性能が大きく低下してしまう可能性がある。しかし、このようなスピンループがプログラム実行中の何時、どの部分で発生しているかを検出することはこれまで困難であると考えられていた。
【００２５】
また、マルチスレッドプロセッサ内部に装備されている共有リソース( 例えばキャッシュメモリ) において、各論理ＣＰＵ間で頻繁に競合が発生する場合には、大幅な性能低下を招く危険がある。しかし、このような競合が頻繁に発生していると考えられる場所を特定することもまた困難であつた。
マルチスレッドプロセッサの性能を十分に発揮するためには、上記のような性能劣化の要因となるスピンループおよび共有リソースでの競合を簡単、かつ的確に発見することが課題である。
【００２６】
【課題を解決するための手段】
上記課題を解決するため、本発明のマルチスレッドプロセッサにおける共有リソースの競合検出器は以下のように構成される。
（１）第１の発明
第１の発明の原理を図１を用いて説明する。本発明の検出器は、イベント取得手段１、カウント手段２および割り込み手段３から構成する。
【００２７】
イベント取得手段１は、マルチスレッドプロセッサの中で実行中の複数のＣＰＵ（論理ＣＰＵ）からコマンドの実行に伴って発生するイベント（実行したイベントの種類）を取得する。
カウント手段２は、取得したイベントが予め登録してあるイベントパターンと比較して等しいときカウンタのカウントアップを行う。例えば、登録してあるイベントパターンが、イベントＡ、イベントＢ、イベントＣの順に登録してあるとき、イベント取得手段１で取得したイベントの順序とイベントの種類が登録のイベントパターンと同一であればカウンタを１つカウントアップするものである。
【００２８】
割り込み手段３は、カウント手段２によってカウントアップされたカウント値が所定の値となったとき、共有リソースの競合が発生したと判断してイベントを発生しているＣＰＵに割り込みを掛けることを行う。
第１の発明は、論理ＣＰＵ間で共有リソースの競合が起こった場合に、特徴的なイベントパターンが発生することに着目して競合の発生していることを検出するものである。本発明の構成は前述のとおりであるが、本発明を図２のように模式的に示すことができる。図２の論理ＣＰＵｘで実行中のプログラムの命令は、例えば既存のＩｎｔｅｌ製プロセッサにおける命令セットアーキテクチャを用いて図３の（ａ）に示されるようなものである。また図２の登録したイベントパターン内のイベントは同様にＩｎｔｅｌ製プロセッサの性能モニタリング用イベントを用いて示している。図３（ａ）のプログラムを実行することにより発生するイベントは、図３の（ｂ）に示されるもので、このイベントと登録したイベントパターンとを比較し、一致した場合にはカウンタをインクリメントする。カウンタ値が所定の値になった場合にイベントを発生しているＣＰＵ＃ｘに割り込みを発生させる。
【００２９】
第１の発明によれば、マルチスレッドプロセッサにおける共有リソース競合の発生を検知できる。
（２）第２の発明
登録するイベントパターンは、イベントとこのイベントに対応付けてイベント発生元（スピンループを発生している論理ＣＰＵ）とするものである。これにより、論理ＣＰＵ間での競合状態を検出することが可能となる。例えば、図４（ａ）のように、論理ＣＰＵ＃０がキャッシュミスを発生した直後に、論理ＣＰＵ＃１がキャッシュミスを発生させた場合には互いにキャッシュメモリヘのアクセスについて競合状態にある可能性が高い。このようにイベントの発生元を特定することで、共有リソースでの競合がプログラム実行時のどの部分において発生し、悪影響を及ぼしているかを判断することができる。例えば論理ＣＰＵ＃０と論理ＣＰＵ＃１は、図４（ｂ）に示すイベントを発生することが予想され、これをイベントパターンとして登録しておく。
（３）第３の発明
割り込み手段において、カウンタが所定値になったとき、即ち共有リソースの競合状態を検出したとき、論理ＣＰＵの割り込みを掛けると共に当該論理ＣＰＵのプログラムカウンタの値をサンプリングするものである。これにより、プログラムのどの部分で他方の論理ＣＰＵでスピンループが発生していたか、というＰＣサンプリングによるプロファイリングを行うことができる。
（４）第４の発明
割り込み手段において、カウンタが所定値になったとき、即ち共有リソースの競合状態を検出したとき、競合を発生している当該スレッドとは別の実行状態にあるスレッドを優先的にスケジューリングする。または、競合を発生する原因となっているスレッドの実行を休止（停止）させる。これにより、共有リソース競合の発生を抑えることができる。
（５）第５の発明
本発明の共有リソースの検出方法、イベント取得手順、カウント手順および割り込み手順から構成する。イベント取得手順は、マルチスレッドプロセッサの中で実行中の複数の論理ＣＰＵからコマンドの実行に伴って発生するイベントを取得する。カウント手順は、取得したイベントが予め登録してあるイベントパターンと比較して等しいときカウンタのカウントアップを行う。割り込み手順は、カウント手順によってカウントアップされたカウント値が所定の値となったとき、そのイベントを発生しているＣＰＵに割り込みを掛けることを行う。これにより、マルチスレッドプロセッサにおける共有リソース競合の発生を検知できる。
【００３０】
【発明の実施の形態】
次に、本発明について図面を参照して実施形態を説明する。
（実施形態その１）
実施形態その１は、2 つの論理ＣＰＵで構成されたマルチスレッドプロセッサにおいて、一方の論理ＣＰＵが５つの関数（Ａ，Ｂ，Ｃ，Ｄ，Ｅ）から構成されているプログラムを実行しているものとする。このとき、当該プログラムのどの部分が、論理ＣＰＵ＃１上で実行されたスピンループの影響を受けているかを検出する例を示す。
【００３１】
図５に、本発明のリソース競合の検出機能を有するマルチスレッドプロセッサの基本的な構成を示す。この例は、論理ＣＰＵが２つで構成されているマルチスレッドプロセッサである。各構成要素は、プログラム４０からコマンド取り出す命令フェッチユニット１１、スレッドを制御する命令シーケンサ１２、演算器を選択するＳＵ１３、算術／論理演算ユニットであるＡＬＵ１４、浮動小数点加算器のＦＰＡ１５、乗算器ＦＰＭ１６、割算器ＦＰＤ１７、ロ一ドストアユニットＬＤ／ＳＴ１８、命令シーケンサ１２に対応したレジスタセットＲＥＧ２０、命令の終了処理を行うＲｅｔｉｒｅｍｅｎｔＵｎｉｔ１９、リソースの競合を検出するイベント比較ユニット３０から成る。本発明の中心は、イベント比較ユニット３０にあるので、マルチスレッドプロセッサを構成する上での他の部分については省略してある。
【００３２】
イベント比較ユニットは、イベントパターンを格納するレジスタＰＴＲＮＲＥＧＩＳＴＥＲＳ３５を持ち、ここに検出すべきイベントパターンが登録されている。この例の場合、最大６個のイベント発生シーケンスを検出することができ、イベント発生元とイベントとを登録している。イベント比較ユニット内部には、ＰＴＲＮＩＮＤＥＸＲＥＧＩＳＴＥＲ３４があり、これを用いてＰＴＲＮＲＥＧＩＳＴＥＲＳ３５内のどのイベントを現在比較しているかを示す。発生したイベントは、イベントフェッチユニット３１を通してイベント比較ユニットに投入され、登録されたイベントパターンと比較器３２で比較される。一致した場合は、カウンタ３３をカウントアップし、カウンタ３３がオーバーフローしたとき、競合が発生したと判断して割り込み信号を発生させる。カウンタ３３は例えば４０ビットで構成する。
【００３３】
また図５には、スピンループを行うプログラム４０の例を表示している。このプログラムが論理ＣＰＵ＃１で実行されているとし、この場合、図５のイベント発生パターンのようなシーケンスでイベントが発生する。この実施例では、図６のように論理ＣＰＵ＃１で発生したカウンタのオーバフロー割り込みを論理ＣＰＵ＃０に発生させることにする。そして、論理ＣＰＵ＃０に対して割り込みが発生した時に実行される割り込みハンドラ内部でＰＣサンプリングを行う。同時に、従来の技術で実現可能なタイムベース（クロックベース) および実行完了命令数べ一スでのＰＣサンプリングも行うこととする。
【００３４】
次に、実施形態その１の処理フローについて図７をもとに説明する。まず、カウンタ３３のカウント値ＣＮＴ、および登録したイベントパターンの項目番号Ｉを初期化のため「０」にセットしておく。ＣＰＵ＃１でプログラム４０のコマンドを実行し、その実行にともなって発生するイベントをイベントフェッチユニット３１から取得する。（Ｓ１１〜Ｓ１４）。
【００３５】
登録のイベント項目番号Ｉをカウントアップし、取得したイベントがＩ番目の登録イベントパターンと一致するかを調べ、一致すればそれが６番目のイベントかどうかを調べる。６番目のイベントであれば６個の発生シーケンスからなる登録イベントパターンと一致したことになるので、カウンタ値”ＣＮＴ”をカウントアップする。次にカウンタがオーバーフローしていなければ、Ｓ１２に戻りイベントの取得を行うことを繰り返す。６番目のイベントでなかった場合もＳ１２に戻りイベントの取得の繰り返しを行う。（Ｓ１５〜Ｓ１９）。
【００３６】
カウンタ３３がオーバーフローした場合に競合発生と判断し、ＣＰＵ＃０に割り込みを掛けるとともにレジスタセットＲＥＧ２０の一つであるＰＣカウンタの値をサンプリングする。（Ｓ２０）。
図８にＰＣサンプリングの結果例を示す。これは、１ＧＨｚで動作するＣＰＵにおいて論理ＣＰＵ＃０で１秒間ＰＣサンプリングを行った結果である。図７には、従来の（ａ）タイムベース（クロックベース) サンプリングを行った場合、（ｂ）実行完了命令数べ一スによるサンプリングを行った場合、（ｃ）スピンロック検出イベントによるサンプリングを行った場合のそれぞれについて、ＰＣサンプリングによる各関数の出現比率を示したものである。簡単のため、イベントサンプリングは、各イベントが発生するたびに行われるものとする。
【００３７】
（ａ）、（ｂ）は、従来のイベント計測カウンタによるプロファイリング結果である。通常ソフトウェア開発者は、（ａ）のプロファイリング結果から、関数Ｃが最も多くの時間を費やしているので、これが性能向上のボトルネックとなっていると判断する。この場合、関数ＣのＣＰＩ（Cycles Per Instructions)は、
【００３８】
【数１】

【００３９】
となる。ここで、（ｃ）より論理ＣＰＵ＃０が関数Ｃを実行中に、論理ＣＰＵ＃１においてスピンロックが
【００４０】
【数２】

【００４１】
回発生していることがわかる。検出される１つのスピンロックは６命令から構成されているので、総計
【００４２】
【数３】

【００４３】
命令が論理ＣＰＵ＃１においてスピンロックのために実行されたこととなる。したがって、大まかにいえば、論理ＣＰＵ＃１でスピンロック処理が行われていないと仮定すると、論理ＣＰＵ＃０での実行命令数がＩｓ増えることが期待できる。したがって、スピンロックの影響を取り除いた関数ＣのＣＰＩは、
【００４４】
【数４】

【００４５】
となり、関数Ｃについては約３２％の高速化が見込める。
このように、本発明によりスピンロックの影響を受けて実行時間が大幅に増加している部分を検出することが可能となる。
（実施形態その２）
２つの論理ＣＰＵで構成されたマルチスレッドプロセッサにおいて、論理ＣＰＵ上で実行されているスピンループを検出し、それに対応してスレッドスケジューリングを行うことでプロセッサの実効性能を向上するシステム例を示す。
【００４６】
図９にスレッドスケジューリングを行うマルチスレッドプロセッサの構成例を示す。図９の構成要素は図５と同様で、イベント比較ユニット３０からスケジューリングユニット１３および命令シーケンサ１２に対して比較結果を転送するためのデータパスが設けられている。
ここで、論理ＣＰＵ＃０においては通常の計算処理を行うスレッド、また論理ＣＰＵ＃１においてスピンロックを行うスレッドが実行されるものとする。このとき、それぞれのスレッドに含まれる命令は対応する命令シーケンサＡおよびＢから発行される。
【００４７】
通常時、スケジューリングユニット１３は２つの命令シーケンサから発行された命令を、同一優先度の基で実行ユニットに対して投入するものとする。ここで、イベント比較ユニット３０によってスピンロックが検出された場合には、スピンロックを実行しているスレッドよりも、他方スレッドを優先的にスケジューリングする。すなわち、スピンロックを行う命令シーケンサＢから発行された命令よりも、命令シーケンサＡが発行した命令を優先的に実行ユニットに投入する。
【００４８】
本発明により、上記のようなプログラムの実行状況に対応した動的な命令スケジューリング装置が実現可能であり、マルチスレッドプロセッサの命令実効性能を向上させることができる。
（付記１）複数のＣＰＵを有するマルチスレッドプロセッサの共有リソースの競合検出器であって、
前記複数のＣＰＵがコマンドの実行によって発生するイベントを取得するイベント取得手段と、
取得した前記イベントと、予め登録したイベントパターンとを比較し、一致したときカウンタをカウントアップするカウント手段と、
前記カウンタのカウント値が所定の値になったとき、前記イベントを発生したＣＰＵに割り込みを掛ける割り込み手段と
を有することを特徴とする共有リソースの競合検出器。
【００４９】
（付記２）前記登録したイベントパターンは、イベントと前記イベントに対応付けたイベント発生元であり、
前記割り込み手段は、前記カウンタのカウント値が所定の値になったとき、前記イベントを発生したＣＰＵ、または登録した前記イベント発生元のＣＰＵに割り込みを掛ける
ことを特徴とする付記１に記載の共有リソースの競合検出器。
【００５０】
（付記３）前記割り込み手段は、前記カウンタのカウント値が所定の値になったとき、前記イベントを発生したＣＰＵに割り込みを掛け、プログラムカウンタの値をサンプリングする
ことを特徴とする付記１または付記２に記載の共有リソースの競合検出器。
（付記４）前記割り込み手段は、前記カウンタのカウント値が所定の値になったとき、スレッドのスケジューリング処理を行う
ことを特徴とする付記１または付記２に記載の共有リソースの競合検出器。
【００５１】
（付記５）複数のＣＰＵを有するマルチスレッドプロセッサの共有リソースの競合検出方法であって、
前記複数のＣＰＵがコマンドの実行によって発生するイベント発生パターンを取得するイベントパターン取得手順と、
取得した前記イベント発生パターンと、予め登録した登録イベントパターンとを比較し、一致したときカウンタをカウントアップするカウント手順と、
前記カウンタのカウント値が所定の値になったとき、前記イベント発生パターンを発生したＣＰＵに割り込みを掛ける割り込み手順と
を有することを特徴とする共有リソースの競合検出方法。
【００５２】
（付記６）前記登録したイベントパターンは、イベントと前記イベントの発生時間間隔である
ことを特徴とする付記１または付記２に記載の共有リソースの競合検出器。
（付記７）前記割り込み手段は、前記カウンタのカウント値が所定の値になったとき、前記イベントを発生したＣＰＵを一時停止するように割り込みを掛ける
ことを特徴とする付記１、付記２または付記６に記載の共有リソースの競合検出器。
【００５３】
【発明の効果】
本発明は、マルチスレッドプロセッサにおいて、イベント発生パターンを認識することができるカウンタを使用することで、特徴的なイベント発生パターンを持つスピンループによる演算リソース競合やキャッシュメモリの利用競合、といった論理ＣＰＵ間での共有リソースの奪い合いが発生する場所を効率的に特定することが可能となる。この情報は、マルチスレッドプロセッサ上で動作するプログラムの最適化を行う技術をサポートする。
【図面の簡単な説明】
【図１】本発明の原理図である。
【図２】第１の発明の模式図である。
【図３】スピンループ実装プログラム例と発生したイベントパターン例である。
【図４】キャッシュメモリアクセスにおける競合の検出例である。
【図５】実施形態その１の構成例である。
【図６】イベント発生元の識別による競合検出例である。
【図７】実施形態その１のフロー例である。
【図８】ＰＣサンプリング例である。
【図９】実施形態その２の構成例である。
【符号の説明】
１：イベント取得手段
２：カウント手段
３：割り込み手段
１０：マルチスレッドプロセッサ
１１：命令フェッチユニット
１２：命令シーケンサ
１３：選択ユニット
１４：算術／論理演算ユニット
１５：浮動小数点加算機
１６：乗算器
１７：割算器
１８：ロード／ストアユニット
１９：リタイアメントユニット
２０：レジスタセット
３０：イベント比較ユニット
３１：イベントフェッチユニット
３２：比較器
３３：カウンタ
３４：パターン索引レジスタ
３５：パターンレジスタ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a detector and a detection method for finding a competition state of shared resources between logical CPUs in a multi-thread processor composed of a plurality of logical CPUs.
[0002]
[Prior art]
In general, in order to exclusively access resources shared between processors in a multiprocessor system, a program area called “critical region” is often provided. A critical region is a program area that can execute only one processor at a time. Access to shared resources between processors in this area maintains the consistency of data during execution of parallel programs.
[0003]
In practice, in a system with multiple processors, a “spinlock variable” is used to determine whether a processor can enter the critical region, that is, whether another processor is not running in the critical region. Provide a variable to call.
For example, for spinlock variables:
-1 if a processor is running in the critical region
・ "0" when no processor is running
Is set.
[0004]
In this implementation, when one processor is running in the critical region and another processor tries to enter the critical region, it is necessary to repeatedly check the spinlock variable value until the spinlock variable becomes “0”. . This loop processing is called “spin loop”. The spin loop is frequently used in software operating on a multiprocessor system as a simple implementation method of exclusive control.
[0005]
However, a big problem occurs in a multi-thread processor (refer to Non-Patent Document 1 for a multi-thread processor). In other words, the thread that is spinning on a certain logical CPU robs the computation resources shared between the logical CPUs, so that the effective performance of other threads performing computation processing is greatly reduced. (For example, see Non-Patent Documents 2 and 3).
[0006]
Further, in the multi-thread processor, not only the competition (competition) of operation resources such as spin lock, but also other resources shared between logical CPUs (for example, Intel Xeon processor (see Non-Patent Document 4)). In some cases, contention in the primary / secondary cache memory and TLB (Translation Look-aside Buffer) is shared between logical CPUs also causes performance degradation (see, for example, Non-Patent Document 5).
[0007]
Next, a conventional technique for finding a competitive part of a resource will be described.
In the execution of a program, the work of collecting statistical information such as “where in the program the most time was consumed” is called performance profiling. The most basic and widely used technique for performing performance profiling is PC sampling (see Non-Patent Document 6, for example).
[0008]
PC (program counter) sampling is to record which part of the program was executed at certain intervals, and perform performance profiling by performing statistical processing on the sampled data after the program is executed. In practice, PC sampling is realized on an existing processor by combining an event measurement counter and a counter overflow interrupt.
[0009]
For example, a performance monitoring counter (for example, see Non-Patent Document 5) mounted on the Intel processor is an event measurement counter having the above-described functions. However, when a conventional event measurement counter is used, it is possible to perform sampling based on a specific event (for example, time base, number of executed instructions, etc.), but a combination of operations of multiple instructions such as a spin loop. It cannot cope with measuring the event caused by.
[0010]
Also, an event measurement counter for multiprocessor / multithread processor has been proposed (see, for example, Patent Documents 1 and 2). However, each of these proposals has only a function of enabling counting for each thread running on the processor or recording the total running time for all threads. Sampling measurement using such a function seems to be able to identify the part where all threads are active, but the important thing in performance profiling is that all threads are active , To determine what operation (eg, spin lock) they were performing. In this regard, any of the above methods can only count the number of occurrences of a single event, such as simply counting the number of instructions executed or counting the number of occurrences of a cache miss event. It can be said that it is insufficient for considering the relationship between logical CPUs.
[0011]
In addition to the above method, a method called “ProfileMe” (see, for example, Non-Patent Document 7) that profiles an instruction itself has also been proposed. However, in this method, an identifier is set for each instruction to measure the execution delay of the instruction itself, and it is not possible to check a loop process composed of a plurality of instructions such as a spin loop.
[0012]
Furthermore, a function called a “Watch Dog timer” for checking the operating state of the processor at certain intervals is known. If this function is applied, there is a possibility that the occurrence location of the spin loop can be specified. However, with this method, it is difficult to distinguish from loop processing other than the spin loop appearing in the program, and even if it can be detected, it is only necessary to identify one location where the spin loop is present, and performance profiling It cannot be applied to statistical processing such as In addition, the above-described method can be used only for loop processing, and it is difficult to apply it to detect shared resource contention, which is a problem in the present invention.
[0013]
Here, the logical CPU is simply defined. In order to control multiple independent instruction flows inside the multi-thread processor,
(1) A register group that holds an instruction control unit and an instruction execution state (2) There is an arithmetic unit or the like shared between the above (1). Here, the combination of (1) and (2) necessary for executing an independent instruction stream is called a logical CPU. On the other hand, the entire physical processor is referred to as a “physical CPU”.
[0014]
The “thread” is a series of execution instruction sequences having an execution context that can be recognized by the OS or hardware.
[0015]
[Non-Patent Document 1]
"Simultaneous Multithreading: Maximizing On-Chip Parallelism", Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy, In Proc. Of 22nd Annual Interna-tional Symposium on Computer Architecture, pp. 392-403, June 1995.
[0016]
[Non-Patent Document 2]
"Using Spin-Loops on Intel Pentium4 Processor and Intel Xeon ProcessorVersion 2.1", May 2001, Order Number 248674-002.
[0017]
[Non-Patent Document 3]
"Introduction to Next Generation Multiprocessing: Hyper-Threading Tech-nology", http://www.intel.com/technology/hyperthread/intro nexgen /.
[0018]
[Non-Patent Document 4]
"Hyper-Threading Technology Architecture and Microarchitecture",
Deborah T. Marr, et al., Intel Technology Journal, Volume.6, Issue.1, February 2002.
[0019]
[Non-Patent Document 5]
"IA-32 Intel Architecture Software Developer's Manual Volume 3 System Programming Guide", September, 2002, Order Number 245472-009, p.7-40.
[0020]
[Non-Patent Document 6]
"Measuring Computer Performance A Practitioner's Guide", David J. Lilja, Cambridge University Press, New York, NY, 2000.
[0021]
[Non-Patent Document 7]
"ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors", Jeffrey Dean, James E.Hicks, Carl A.Waldspurger, WilliamE.Weihl, George Chrysos, International Symposium on Microarchitecture, 1997
[0022]
[Patent Document 1]
Japanese Patent Laid-Open No. 10-275100 (first page, FIG. 1)
[0023]
[Patent Document 2]
JP-A-9-237203 (first page)
[0024]
[Problems to be solved by the invention]
As described above, in a multi-thread processor, the execution speed of a program on a certain logical processor is greatly influenced by the operation status of the program on another logical processor. In particular, when a spin loop is executed on one logical processor, the effective performance of the program on the other logical processor may be greatly reduced. However, it has been considered difficult to detect when and where such a spin loop occurs during program execution.
[0025]
Further, in the case of frequent competition between logical CPUs in a shared resource (for example, cache memory) provided in the multithread processor, there is a risk of causing a significant performance degradation. However, it has also been difficult to identify where such conflicts are likely to occur.
In order to fully demonstrate the performance of the multi-thread processor, it is a problem to easily and accurately find the competition between the spin loop and the shared resource that cause the performance deterioration as described above.
[0026]
[Means for Solving the Problems]
In order to solve the above problems, the shared resource conflict detector in the multi-thread processor of the present invention is configured as follows.
(1) First Invention The principle of the first invention will be described with reference to FIG. The detector of the present invention comprises event acquisition means 1, count means 2 and interrupt means 3.
[0027]
The event acquisition unit 1 acquires events (types of executed events) that occur with the execution of commands from a plurality of CPUs (logical CPUs) being executed in the multi-thread processor.
The counting means 2 counts up the counter when the acquired event is equal to the previously registered event pattern. For example, when the registered event patterns are registered in the order of event A, event B, and event C, if the order of events acquired by the event acquisition means 1 and the event type are the same as the registered event pattern The counter is incremented by one.
[0028]
When the count value counted up by the count unit 2 reaches a predetermined value, the interrupt unit 3 determines that a shared resource conflict has occurred and interrupts the CPU that has generated the event.
The first invention detects the occurrence of contention by paying attention to the occurrence of a characteristic event pattern when a shared resource contention occurs between logical CPUs. Although the configuration of the present invention is as described above, the present invention can be schematically illustrated as shown in FIG. The instructions of the program being executed by the logical CPUx in FIG. 2 are, for example, as shown in FIG. 3A using an instruction set architecture in an existing Intel processor. The events in the registered event pattern in FIG. 2 are similarly shown using performance monitoring events of the Intel processor. The event generated by executing the program in FIG. 3A is shown in FIG. 3B. This event is compared with the registered event pattern, and if they match, the counter is incremented. . When the counter value reaches a predetermined value, an interrupt is generated in the CPU #x that has generated the event.
[0029]
According to the first aspect, it is possible to detect the occurrence of shared resource contention in the multithread processor.
(2) Second invention An event pattern to be registered is an event and an event generation source (a logical CPU generating a spin loop) associated with the event. Thereby, it is possible to detect a race condition between logical CPUs. For example, as shown in FIG. 4A, when the logical CPU # 1 generates a cache miss immediately after the logical CPU # 0 has generated a cache miss, there is a possibility that the access to the cache memory is in a competition state. High nature. By identifying the event source in this way, it is possible to determine in which part of the program execution the contention in the shared resource has an adverse effect. For example, the logical CPU # 0 and the logical CPU # 1 are expected to generate the event shown in FIG. 4B, and this is registered as an event pattern.
(3) In the third invention interrupt means, when the counter reaches a predetermined value, that is, when a conflicting state of the shared resource is detected, the logical CPU is interrupted and the value of the program counter of the logical CPU is sampled. It is. Thereby, it is possible to perform profiling by PC sampling to determine which part of the program a spin loop has occurred in the other logical CPU.
(4) In the fourth invention interrupt means, when a counter reaches a predetermined value, that is, when a conflict state of a shared resource is detected, priority is given to a thread in an execution state different from the thread in which the conflict occurs. Scheduled. Alternatively, the execution of the thread that causes the contention is suspended (stopped). Thereby, it is possible to suppress the occurrence of shared resource contention.
(5) Fifth Invention The present invention comprises the shared resource detection method, event acquisition procedure, count procedure and interrupt procedure of the present invention. The event acquisition procedure acquires an event that occurs in accordance with the execution of a command from a plurality of logical CPUs being executed in the multithread processor. In the counting procedure, the counter is counted up when the acquired event is equal to the previously registered event pattern. In the interrupt procedure, when the count value counted up by the count procedure becomes a predetermined value, the CPU that has generated the event is interrupted. As a result, occurrence of shared resource contention in the multi-thread processor can be detected.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
(Embodiment 1)
Embodiment 1 is a multi-thread processor composed of two logical CPUs, one logical CPU executing a program composed of five functions (A, B, C, D, E) And At this time, an example of detecting which part of the program is affected by the spin loop executed on the logical CPU # 1 is shown.
[0031]
FIG. 5 shows a basic configuration of a multi-thread processor having a resource contention detection function according to the present invention. This example is a multi-thread processor having two logical CPUs. Each component includes an instruction fetch unit 11 that fetches a command from the program 40, an instruction sequencer 12 that controls a thread, an SU 13 that selects an arithmetic unit, an ALU 14 that is an arithmetic / logical arithmetic unit, an FPA 15 that is a floating-point adder, a multiplier FPM 16, A divider FPD 17, a load store unit LD / ST 18, a register set REG 20 corresponding to the instruction sequencer 12, a Retirement Unit 19 that performs instruction termination processing, and an event comparison unit 30 that detects resource contention. Since the center of the present invention lies in the event comparison unit 30, other parts of the multi-thread processor are omitted.
[0032]
The event comparison unit has a register PTRN REGISTERS 35 for storing event patterns, and event patterns to be detected are registered here. In this example, a maximum of six event occurrence sequences can be detected, and the event occurrence source and the event are registered. Inside the event comparison unit is a PTRN INDEX REGISTER 34 which is used to indicate which event in the PTRNREGISTERS 35 is currently being compared. The generated event is input to the event comparison unit through the event fetch unit 31 and compared with the registered event pattern by the comparator 32. If they match, the counter 33 is counted up, and when the counter 33 overflows, it is determined that a conflict has occurred and an interrupt signal is generated. The counter 33 is composed of 40 bits, for example.
[0033]
FIG. 5 shows an example of a program 40 that performs a spin loop. Assume that this program is executed by the logical CPU # 1, and in this case, an event occurs in a sequence like the event generation pattern of FIG. In this embodiment, as shown in FIG. 6, the counter overflow interrupt generated in the logical CPU # 1 is generated in the logical CPU # 0. Then, PC sampling is performed inside an interrupt handler that is executed when an interrupt occurs to the logical CPU # 0. At the same time, PC sampling is also performed on the basis of the time base (clock base) and the number of execution completion instructions that can be realized by the conventional technology.
[0034]
Next, the processing flow of Embodiment 1 will be described with reference to FIG. First, the count value CNT of the counter 33 and the item number I of the registered event pattern are set to “0” for initialization. The CPU # 1 executes the command of the program 40, and acquires an event that occurs along with the execution from the event fetch unit 31. (S11-S14).
[0035]
The registered event item number I is counted up, and it is checked whether or not the acquired event matches the I-th registered event pattern, and if it matches, it is checked whether or not it is the sixth event. If it is the sixth event, it matches the registered event pattern consisting of six occurrence sequences, so the counter value “CNT” is counted up. If the counter has not overflowed, the process returns to S12 and the acquisition of the event is repeated. If it is not the sixth event, the process returns to S12 and the acquisition of the event is repeated. (S15-S19).
[0036]
When the counter 33 overflows, it is determined that contention has occurred, interrupts the CPU # 0, and samples the value of the PC counter that is one of the register sets REG20. (S20).
FIG. 8 shows an example of the result of PC sampling. This is a result of performing PC sampling for 1 second by the logical CPU # 0 in the CPU operating at 1 GHz. In FIG. 7, when (a) time base (clock base) sampling is performed, (b) sampling is performed based on the number of execution completion instructions, (c) sampling is performed based on a spin lock detection event. In each case, the appearance ratio of each function by PC sampling is shown. For simplicity, it is assumed that event sampling is performed as each event occurs.
[0037]
(A), (b) is the profiling result by the conventional event measurement counter. Usually, the software developer determines from the profiling result (a) that the function C spends the most time, and this is a bottleneck for improving performance. In this case, the CPI (Cycles Per Instructions) of the function C is
[0038]
[Expression 1]

[0039]
It becomes. Here, from (c), while the logical CPU # 0 is executing the function C, the spin lock is established in the logical CPU # 1.
[Expression 2]

[0041]
It can be seen that it has occurred. Since one detected spinlock consists of 6 instructions, the total
[Equation 3]

[0043]
The instruction is executed for the spin lock in the logical CPU # 1. Therefore, roughly speaking, assuming that the spin lock process is not performed in the logical CPU # 1, it can be expected that the number of executed instructions in the logical CPU # 0 will be increased. Therefore, the CPI of the function C excluding the influence of the spin lock is
[0044]
[Expression 4]

[0045]
Thus, about 32% speedup can be expected for the function C.
As described above, according to the present invention, it is possible to detect a portion where the execution time is significantly increased due to the influence of the spin lock.
(Embodiment 2)
An example of a system that improves the effective performance of a processor by detecting a spin loop being executed on a logical CPU and performing thread scheduling in response to the detection in a multi-thread processor composed of two logical CPUs will be described.
[0046]
FIG. 9 shows a configuration example of a multi-thread processor that performs thread scheduling. The components in FIG. 9 are the same as those in FIG. 5, and a data path is provided for transferring the comparison result from the event comparison unit 30 to the scheduling unit 13 and the instruction sequencer 12.
Here, it is assumed that a thread that performs normal calculation processing is executed in the logical CPU # 0, and a thread that performs spin lock is executed in the logical CPU # 1. At this time, the instructions included in each thread are issued from the corresponding instruction sequencers A and B.
[0047]
In normal times, the scheduling unit 13 inputs instructions issued from two instruction sequencers to the execution unit based on the same priority. Here, when the spin lock is detected by the event comparison unit 30, the other thread is scheduled with priority over the thread executing the spin lock. That is, the instruction issued by the instruction sequencer A is given priority to the execution unit over the instruction issued from the instruction sequencer B that performs the spin lock.
[0048]
According to the present invention, a dynamic instruction scheduling apparatus corresponding to the program execution state as described above can be realized, and the instruction execution performance of the multithread processor can be improved.
(Supplementary note 1) A contention detector for a shared resource of a multi-thread processor having a plurality of CPUs,
Event acquisition means for acquiring an event generated by execution of a command by the plurality of CPUs;
A count means for comparing the acquired event with a pre-registered event pattern and counting up a counter when they match,
A shared resource conflict detector, comprising: interrupt means for interrupting a CPU that has generated the event when a count value of the counter reaches a predetermined value.
[0049]
(Supplementary Note 2) The registered event pattern is an event generation source associated with an event and the event,
The sharing according to claim 1, wherein when the count value of the counter reaches a predetermined value, the interrupt unit interrupts the CPU that generated the event or the registered CPU that generated the event. Resource contention detector.
[0050]
(Supplementary note 3) The supplementary means 1 or Supplementary note, wherein when the count value of the counter reaches a predetermined value, the interrupt means interrupts the CPU that generated the event and samples the value of the program counter 3. The shared resource contention detector according to 2.
(Supplementary Note 4) The shared resource contention detector according to

Supplementary Note

1 or 2, wherein the interrupting unit performs thread scheduling when the count value of the counter reaches a predetermined value.
[0051]
(Supplementary Note 5) A method for detecting contention for a shared resource of a multi-thread processor having a plurality of CPUs,
An event pattern acquisition procedure for acquiring event occurrence patterns generated by execution of commands by the plurality of CPUs;
A count procedure for comparing the acquired event occurrence pattern with a registered event pattern registered in advance and counting up a counter when they match,
A shared resource conflict detection method, comprising: an interrupt procedure for interrupting a CPU that has generated the event occurrence pattern when a count value of the counter reaches a predetermined value.
[0052]
(Supplementary Note 6) The shared resource conflict detector according to

Supplementary Note

1 or 2, wherein the registered event pattern is an event and an occurrence time interval of the event.
(Supplementary Note 7) The supplementary note 1, the supplementary note 2 or the supplementary note, wherein when the count value of the counter reaches a predetermined value, the interrupt means interrupts the CPU that has generated the event. 6. The shared resource contention detector according to 6.
[0053]
【The invention's effect】
By using a counter capable of recognizing an event occurrence pattern in a multi-thread processor, the present invention makes it possible for logical CPUs such as computation resource contention due to a spin loop having a characteristic event occurrence pattern and cache memory use contention. It is possible to efficiently identify a place where a contention for shared resources occurs. This information supports techniques for optimizing programs that run on multithreaded processors.
[Brief description of the drawings]
FIG. 1 is a principle diagram of the present invention.
FIG. 2 is a schematic diagram of the first invention.
FIG. 3 is an example of a spin loop implementation program and an example of an event pattern that has occurred.
FIG. 4 is an example of contention detection in cache memory access.
FIG. 5 is a configuration example of Embodiment 1;
FIG. 6 is an example of conflict detection by identification of an event generation source.
FIG. 7 is a flow example of Embodiment 1;
FIG. 8 is an example of PC sampling.
FIG. 9 is a configuration example of Embodiment 2;
[Explanation of symbols]
1: Event acquisition means 2: Count means 3: Interrupt means 10: Multithread processor 11: Instruction fetch unit 12: Instruction sequencer 13: Selection unit 14: Arithmetic / logical operation unit 15: Floating point adder 16: Multiplier 17: Divider 18: Load / store unit 19: Retirement unit 20: Register set 30: Event comparison unit 31: Event fetch unit 32: Comparator 33: Counter 34: Pattern index register 35: Pattern register

Claims

A shared resource contention detector of a multi-thread processor having a plurality of logical CPUs,
Event acquisition means for acquiring an event generated by execution of a command by the plurality of logical CPUs;
The acquired event is compared with a pre-registered event pattern that appears at the time of executing the spin lock, and a count unit that counts up the counter when they match,
A shared resource contention detector, comprising: interrupt means for performing thread scheduling processing on one or more logical CPUs when a count value of the counter reaches a predetermined value.

The registered event pattern is an event generation source associated with an event and the event,
The shared resource contention detector according to claim 1, wherein when the count value of the counter reaches a predetermined value, the interrupt unit interrupts the logical CPU that has generated the event.

The interrupt means interrupts one or more logical CPUs when the count value of the counter reaches a predetermined value, and samples the value of a program counter corresponding to the logical CPU. The contention detector for shared resources according to claim 1 or 2.

A method for detecting contention for a shared resource of a multi-thread processor having a plurality of logical CPUs,
An event pattern acquisition procedure for acquiring an event occurrence pattern generated by execution of a command by the plurality of logical CPUs;
A count procedure for comparing the acquired event occurrence pattern with a registered event pattern that appears at the time of executing a pre-registered spin lock, and counting up a counter when they match,
An interrupt procedure for performing thread scheduling processing on one or more logical CPUs when the count value of the counter reaches a predetermined value;
A conflict detection method for shared resources, comprising: