JP4209959B2

JP4209959B2 - Apparatus and method for high frequency sampling of performance counter

Info

Publication number: JP4209959B2
Application number: JP05806598A
Authority: JP
Inventors: エムバークランス; ゲムワトサンジェイ; エルシテスリチャード; ハーヘンツィンガーモニカ; エイウォールドスパーガーカール; ウェイールウィリアム
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 1997-03-10
Filing date: 1998-03-10
Publication date: 2009-01-14
Anticipated expiration: 2018-03-10
Also published as: DE69816686D1; US5796939A; EP0864978A3; CA2231584A1; DE69816686T2; EP0864978A2; EP0864978B1; JPH10260869A

Abstract

In a computer system, an apparatus is configured to collect performance data of a computer system including a plurality of processors for concurrently executing instructions of a program. A plurality of performance counters are coupled to each processor. The performance counters store performance data generated by each processor while executing the instructions. An interrupt handler executes on each processors, the interrupt handler samples the performance data of the processor in response to interrupts. A first memory includes a hash table associated with each interrupt handler, the hash table stores the performance data sampled by the interrupt handler executing on the processor. A second memory includes an overflow buffer, the overflow buffer stores the performance data while portions of the hash tables are active or full. A third memory includes a user buffer, and means are provided for periodically flushing the performance data from the hash tables and the overflow to the user buffer.

Description

【０００１】
【発明の属する技術分野】
本発明は、一般に、コンピュータシステムに係り、より詳細には、コンピュータシステムにおける性能データの収集に係る。
【０００２】
【従来の技術】
動作しているコンピュータシステムにおける性能データの収集は、ハードウェア及びソフトウェアのエンジニアにより行われる頻繁で且つ非常に重要な作業である。ハードウェアのエンジニアは、新しいコンピュータハードウェアが既存のオペレーティングシステム及びアプリケーションプログラムといかに動作するかを決定するために性能データを必要とする。
【０００３】
プロセッサ、メモリ及びキャッシュのようなハードウェア構造体の特定の設計は、同じ１組のプログラムに対し著しく異なるそして時には予想し得ない使い方をする。ハードウェアの欠陥を識別し、それらを将来の設計において修正できることが重要である。性能データは、ソフトウェアがいかに効率的にハードウェアを使用するかを識別できると共に、改良されたシステムを設計する上で助けとなる。
【０００４】
ソフトウェアのエンジニアは、プログラムの重大な部分を識別することが必要である。例えば、コンパイラーの著者は、コンパイラーが命令を実行に対しいかにスケジュールするか、又はソフトウェアを最適化する入力を与えるために条件分岐の実行がいかに良好に予想されるかを見出さねばならない。同様に、オペレーティングシステムのカーナル、デバイスドライバ、及びアプリケーションソフトウェアプログラムの性能を理解することが必要である。
【０００５】
コンピュータシステムの動作環境を妨げることなくハードウェア及びソフトウェアシステムの性能を正確に監視することが問題となる。特に、何日又は何週間といった長い期間にわたって性能データが収集される場合に問題となる。多くの場合に、性能監視システムは、手作りである。システムのオペレーションが監視システムにより影響されないように保証するためには、コストのかかるハードウェア及びソフトウェア変更が必要となる。
【０００６】
コンピュータシステムの性能を監視できる１つの方法は、性能カウンタを使用することによるものである。性能カウンタは、システムにおける重大な事象の発生を「カウント」する。重大な事象は、例えば、キャッシュミス、命令の実行、Ｉ／Ｏデータ転送要求、等々を含む。性能カウンタを周期的にサンプリングすることにより、システムの性能を推論することができる。
【０００７】
【発明が解決しようとする課題】
コンピュータシステムの性能は、ソフトウェアの変更を伴わずに監視できることが望ましい。又、サンプリング率を固定又は可変とし、そしてその率を非常に高くできることが望ましい。更に、高頻度のサンプリング中に、サンプリングのオーバーヘッドを最小に保持し、性能データがシステムのオペレーションを正確に反映するのが望ましい。オーバーヘッドを低く保つことは、データアクセスを同期させる必要のあるマルチプロセッサシステムでは特に困難であり、サンプリング率は、非常に高いものとなり、例えば、毎秒５０，０００ないし１００，０００サンプルとなる。
【０００８】
【課題を解決するための手段】
本発明は、コンピュータシステムにおいて性能データを収集する装置を提供する。コンピュータシステムは、プログラムの命令を同時に実行するための複数のプロセッサを備えている。本発明の装置は、複数セットの性能カウンタを備えている。１セットの性能カウンタが各プロセッサに接続される。性能カウンタは、命令を実行する間に各プロセッサにより発生される性能データを記憶するためのものである。
本発明は、その広い形態においては、コンピュータシステムの性能データをサンプリングするための請求項１に記載の装置、及び請求項９に記載のその方法に係る。
【０００９】
各プロセッサにおいて割り込みハンドラーが実行される。割り込みハンドラーは、割り込みに応答して、プロセッサの性能データをサンプリングするためのものである。第１メモリは、各割り込みハンドラーに関連したハッシュテーブルを含む。ハッシュテーブルは、プロセッサ上で実行される割り込みハンドラーによりサンプリングされた性能データを記憶する。第２メモリは、オーバーフローバッファを含み、このオーバーフローバッファは、ハッシュテーブルの一部分が非アクティブでありそしてそれらの部分がいっぱいである間に性能データを記憶するためのものである。第３メモリは、ユーザバッファを含む。更に、ハッシュテーブル及びオーバーフローバッファからユーザバッファへ性能データを周期的にフラッシュするための手段が設けられる。
【００１０】
好ましくは、以下に述べるように、第１メモリのハッシュテーブルは、多方セット連想(multi-way set-associative) キャッシュとして編成される。更に、多方セット連想キャッシュは、複数のチャンクを含み、各チャンクは、第１メモリと第３メモリとの間のデータ転送の単位である。更に、各チャンクは、複数のラインを含み、そして各チャンクには、対応するチャンクが非アクティブ及びいっぱいである場合を各々指示するためのａｃｔｉｖｅｃｈｕｎｋフラグ及びｆｌｕｓｈｃｈｕｎｋフラグが関連される。各チャンクのラインは、更に、複数のエントリーに仕切られ、そして各エントリーは、性能データを記憶するための複数のフィールドを含む。性能データのフィールドは、プロセッサ識別、プログラムカウンタ、プロセッサ事象識別、及びプロセッサ事象カウンタフィールドを含む。
【００１１】
プロセッサ識別、プログラムカウンタ及びプロセッサ事象識別からハッシュインデックスを発生するための手段が設けられるのが好都合である。ハッシュインデックスは、ヒット又はミス信号を発生するために特定のプロセッサに関連したハッシュテーブルのラインを探知するのに使用される。
【００１２】
ミス信号に応答して、現在ハッシュインデックスにより指示された現在エントリーに記憶された性能データがハッシュテーブルからオーバーフローバッファへ移動される。現在エントリーは、割り込みハンドラーによりサンプリングされた性能データでオーバーライトされる。ヒット信号の場合には、現在ハッシュインデックスにより指示された現在エントリーに記憶されたプロセッサ事象カウンタが増加される。性能データは、圧縮形態でエントリーに記憶される。
【００１３】
以下に示すように、第２メモリのオーバーフローバッファは、二重バッファとして編成された第１及び第２のバッファを備えている。各バッファは、ハッシュテーブルのエントリーの性能データを記憶するための複数のスロットを含む。
【００１４】
【発明の実施の形態】
本発明は、添付図面を参照した好ましい実施形態の以下の詳細な説明から容易に理解されよう。
システムの概要
図１に示すように、コンピュータシステム１００は、バス１４０により互いに接続されたプロセッサ１１０と、メモリサブシステム（メモリ）１２０と、入力／出力インターフェイス（Ｉ／Ｏ）１３０とを備えている。システム１００は、埋め込まれたシステム、ＰＣ、ワークステーション、メインフレーム、又はネットワークでリンクされたシステムのクラスターのメンバーである。
【００１５】
プロセッサ１１０は、ＣＩＳＣ又はＲＩＳＣのいずれかのアーキテクチャーを有する１つ以上の個々のプロセッサチップ１１１として構成することができる。各プロセッサ１１１には、性能カウンタ１１２のセットが組み合わされる。性能カウンタ１１２の各セットは、複数のレジスタとして実施することができる。これらレジスタは、システムにおいてシステムの性能を表す重大な事象の発生をカウントすることができる。
【００１６】
メモリ１２０は、個別の静的、動的、ランダム、逐次、揮発性、及び永続的な記憶エレメント、又はその組合せを含むことができる。幾つかのメモリは、ハイアラーキ構成とすることができる。記憶エレメントは、レジスタ、キャッシュ、ＤＲＡＭ、ディスク、テープ等である。メモリ１２０は、マシンで実行できる命令の形態のソフトウェアプログラム１２１と、命令によりアクセスされるデータ１２２とを記憶する。ソフトウェアプログラムは、オペレーティングシステム、デバイスドライバ、及びアプリケーションプログラムを含む。
【００１７】
Ｉ／Ｏ１３０は、プリンタ、ターミナル及びキーボードのような入力及び出力デバイスへのインターフェイスを含むことができる。又、Ｉ／Ｏ１３０は、他のコンピュータシステムとデータを通信するためにライン１５０を経てネットワーク（ＮＥＴ）１６０に接続することもできる。バス１４０は、通常、アドレス、データ、制御及びタイミング信号を種々の要素間に搬送するために複数のラインとして実施される。
【００１８】
オペレーションの概要
システム１００のオペレーション中に、プログラム１２１の命令は、１つ以上のプロセッサ１１１により実行される。命令は、一般に、プログラムの実行流を制御するか、又はデータ１２２をアクセス（ロード、記憶、読み取り及び書き込み）する。通常のオペレーション環境を著しく妨げることなくシステム１００の性能データを収集することが望まれる。性能データの分析は、システム１００のハードウェア及びソフトウェア要素の設計を最適化するのに使用できる。又、性能データは、パイプラインストールのようなオペレーションの問題を決定するのにも使用できる。
【００１９】
データ収集サブシステム
図２に示す１つの実施形態では、性能データ収集サブシステム２００は、種々のハードウェアデバイス２２０と、カーナルモードプログラム及びデータ構造体２３０と、ユーザモードプログラム及びデータ構造体２６０とを備えている。プログラム及びデータ２３０及び２６０は、図１のメモリ１２０に記憶することができる。
ハードウェア２２０は、複数の個々のプロセッサ２２１−２２３を含むことができる。各プロセッサには、性能カウンタ又はレジスタ２０１のセットが組み合わされる。レジスタ２０１のセットは、同じ半導体ダイ上に関連プロセッサと共通に常駐することができる。各セット２０１は、複数のレジスタ２３１−２３４を含むことができる。
【００２０】
レジスタ２３１−２３４の各々は、性能事象のカウントを記憶するために多数のビットを含むことができる。１つのカウンタに累積することのできる特定事象の全発生数は、レジスタ２３１−２３４のサイズ即ちビット数に依存する。
特定の実施形態に基づき、レジスタ２０１のセットは、イネーブル、ディスエイブル、休止、再開、リセット（クリア）、読み取り及び書き込みすることができる。典型的に、レジスタは、オーバーフロー（ある特定ビット位置の桁上げ）時に割り込み２５１−２５３を発生する。割り込みに応答して、レジスタ２０１をサンプリング（読み取り）することができる。ある実施形態においては、２の整数累乗に対応する頻度をカウントするために１つのレジスタの特定ビットの設定時に割り込みを発生することができる。
【００２１】
ハードウェア２２０のオペレーション中に、システムの種々のオペレーション特性を表す信号又は「事象」、例えば、（Ｅ₁、・・・Ｅ_M）２４１−２４４がそれに対応するレジスタを増加させる。信号送信される厳密な事象は、通常は、システムに特有のものである。レジスタを増加できる典型的な事象は、例えば、キャッシュミス、分岐予想ミス、パイプラインストール、命令発生、演算動作、プロセッササイクル、再生トラップ、変換バッファミス、Ｉ／Ｏ要求、プロセスアクティブ、等々を含むことができる。一度に１つ以上の特定の事象数をサンプリングのために選択することができる。
【００２２】
サブシステム２００のカーナルモード成分２３０は、デバイスドライバ又は割り込みハンドラー（プログラム）２３１−２３２と、テーブル２３４−２３５とを含む。各プロセッサごとに、１つのハンドラー及びそれに関連したテーブルがある。効果として、各プロセッサごとに１つのハッシュテーブルをもつと、種々のプロセス間のほとんどの処理活動を同期する必要性が排除される。更に、１つの実施形態において、全てのハンドラー２３１−２３３により共用されるオーバーフローバッファ２３８がある。バッファ２３８へのアクセスは、ロック（Ｌ）２３９により制御される。別の実施形態では、各ハンドラーは、同期する必要のある事象の数を更に減少するために専用バッファを関連させることができる。
【００２３】
カーナルモード成分２３０のオペレーション中に、割り込みがそれに対応するハンドラーをアクチベートする。ハンドラー２３１−２３３は、多数の同時実行プロセス又はスレッドとして動作することができる。ハンドラーは、レジスタから性能データを読み取り、そして以下に詳細に述べるように、それに関連したテーブル２３４−２３６の１つにデータを記憶する。いずれかのテーブルがいっぱいになるか又はアクセス不能になる場合には、オーバーフローデータをオーバーフローバッファに記憶することができる。バッファ２３８へ書き込むためには、プロセスは、先ず、データの一貫性を確保するようにロック２３９を得なければならない。
【００２４】
ユーザモード成分２６０は、デーモンプロセス２６１、１つ以上のユーザバッファ２６２、及び処理された性能データ２５１を含む。データ２５１は、ディスク２５０に記憶することもできるし、又は図１のメモリ１２０の一部分である他の形式の不揮発性記憶装置に記憶することもできる。
オペレーション中に、デーモン２６１は、ハッシュテーブル及びオーバーフローバッファをユーザバッファ２６２へ周期的にフラッシュする（空にする）ことができる。更に、デーモン２６１は、累積された性能データを処理し、例えば、システム１００の性能を分析するのに有用な実行プロファイル、ヒストグラム、及び他の統計学的データを発生することもできる。
【００２５】
ハッシュテーブル
好ましい実施形態として図３に示すように、テーブル２３４−２３６の各々はハッシュテーブル３００として構成される。ハッシュテーブルは、ハッシュインデックスによりアクセスされるデータ構造体である。典型的に、ハッシュインデックスは、ある範囲のアドレスにわたってデータをランダムに分配する傾向のある決定論的に計算されたあるアドレスである。
【００２６】
ハッシュファンクションは、良く知られている。性能データを収集するためのこの実施形態においては、ハッシュテーブルは、カーナルプロセスとユーザプロセスとの間にデータを転送するのに必要なメモリ帯域巾を減少するのに使用される。より詳細には、以下に述べるように、各ハッシュテーブルは、メモリサブシステムの帯域巾を減少するために多方セット連想キャッシュとして実施することができる。本発明の好ましい実施形態は、ハッシュテーブル３００において４方連想を使用する。
【００２７】
ハッシュテーブル３００は、複数のチャンク３０１−３０３に仕切られる。チャンクは、カーナルモード成分とユーザモード成分との間のデータ転送の単位である。ハッシュテーブル３００の各チャンクは、更に、複数のライン３１１−３１４に仕切られる。ラインとは、通常、システムのメモリ転送ロジック及びキャッシュにより効率的に取り扱われるデータ転送のコヒレントな単位である。
【００２８】
ハッシュテーブル３００の（４方）連想性は、ハードウェアキャッシュライン３１１−３１４のサイズに合致するように入念に選択され、例えば、各キャッシュラインは、６４バイトを含む。又、チャンク３０１−３０３の各々には、ａｃｔｉｖｅｃｈｕｎｋ及びｆｌｕｓｈｃｈｕｎｋフラグ３１５及び３１６が組み合わされる。ａｃｔｉｖｅｃｈｕｎｋフラグ３１５は、ハンドラーの１つがハッシュテーブルのチャンクの１つに記憶されたデータを変更（更新）するときにセットすることができる。フラグがクリアされると、それに対応するチャンクが非アクティブとなり、即ち書き込まれない。ｆｌｕｓｈｃｈｕｎｋフラグ３１６は、データがハッシュテーブルからユーザバッファにコピーされるときにセットすることができる。テーブル及びバッファの非同期の取り扱いについては、以下に詳細に述べる。
【００２９】
各ラインは、複数のセット連想エントリー３２０を含む。各エントリー３２０は、複数のフィールド３２１−３２４を含む。例えば、フィールド３２１−３２４は、各々、プロセス識別（ｐｉｄ）と、プログラムカウンタ（ｐｃ）と、事象識別（ｅｖｅｎｔ）と、カウンタ値（ｃｏｕｎｔ）を記憶する。ライン３１１−３１４のエントリー３２０に記憶されるデータは、キャッシュミスの数を減少するために高度に圧縮される。連想性をキャッシュラインのサイズに合致することにより、単一キャッシュミスの場合に、エントリーの最大数を探知することができる。サンプルデータを圧縮されたキャッシュラインとして記憶すると、メモリサブシステム１２０の帯域巾及びストレスが減少される。
【００３０】
割り込みハンドラープロセス
図４は、図２のハンドラーのプロセス４００を示す。プロセス４００は、図３のハッシュテーブル３００のラインのエントリー３２０を更新するためのものである。このプロセスは、ステップ４１０において、通常は、割り込みに応答して開始される。ステップ４２０において、ハッシュインデックスＨ_iを決定する。このインデックスは、ビットを合成するあるハッシュ関数ｆ_hash、例えば、ｐｉｄ_i、ｐｃ_i及びｅｖｅｎｔ_iのこのインスタンスの排他的オア、例えば、Ｈ_i＝ｆ_flush（ｐｄｉ_i、ｐｃ_i、ｅｖｅｎｔ_i）により決定することができる。ステップ４３０において、テーブル３００におけるインデックスＨ_iの全てのエントリーをチェックし、それらがｐｉｄ_i、ｐｃ_i及びｅｖｅｎｔ_iについて合致するかどうか決定する。ヒットの場合には、ステップ４４０において、カウントフィールド３２４を増加し、即ちｃｏｕｎｔ_i＝ｃｏｕｎｔ_i＋１となる。
【００３１】
ミスの場合、即ちインデックスＨ_iのエントリーにおいてヒットがない場合には、ステップ４５０において、インデックスＨ_iのエントリーを図４のオーバーフローバッファ２３８へ移動する。ステップ４６０において、ハッシュテーブルのインデックスＨ_iに新たなエントリーを記憶し、そしてカウントを１にセットする。ヒット又はミスのいずれの場合にも、プロセスは、ステップ４９０で完了する。
【００３２】
特定の割り込みハンドラーは、グローバルデータの多数の断片をアクセスしなければならない。グローバルデータは、ハンドラーが実行されているプロセッサに対するハッシュテーブルのポインタと、新たなエントリーに対しハッシュ関数の値でインデックスされるハッシュテーブルのラインと、オーバーフローバッファのポインタと、データ構造体の状態を制御するのに使用される多数のグローバル変数（例えば、アクティブなオーバーフローバッファへ挿入するための次のインデックス、及びハッシュテーブルにおいて次のミスに対し所与のラインのどのエントリーを退かせるかを指示するカウンタ）と、その他の種々のグローバル変数とを含む。
【００３３】
このグローバルデータは、全て、データを記憶するのに使用されるハードウェア構造体に合致するために入念にレイアウトしなければならない。例えば、６４バイトのキャッシュラインをもつプロセッサにおいては、データが単一の６４バイト構造体にパックされる。これは、このデータのいずれをアクセスするのにもせいぜい１つのキャッシュミスしか生じないよう確保する。キャッシュミスは、約百サイクル以上で生じ、そして割り込みハンドラーは、その完了のために全体的に数百サイクル以下しか消費してはならないので、キャッシュミスの数を最小にすることは、性能データ収集の影響を最小にするために重要である。
【００３４】
更に、多数のプロセッサ間に共用されたデータを含むキャッシュラインにマルチプロセッサが書き込みするのは、不経済である。グローバルデータは、異なるプロセッサに入念に複製されて、各プロセッサがそれ自身のコピーをもつようにされ、共用キャッシュラインへの書き込みの必要性が回避される。
別の実施形態においては、割り込みを取り扱う時間長さは、特定の状態に対して各々最適化された多数の異なるハンドラーを使用することにより減少することができる。例えば、以下に述べる多数のオーバーフローバッファをもつ１つの実施形態では、割り込みハンドラーが、その開始時に、ハッシュテーブルをバイパスすべきかどうかチェックする。ほとんどの時間、このチェックは、偽となる。
【００３５】
しかしながら、ハッシュテーブルをバイパスすべきかどうか調べるためにフラグをチェックしない形態の割り込みハンドラーを形成することができる。そうではなく、ハンドラーは、ハッシュテーブルをアクセスすべきであると仮定する。ハッシュテーブルをバイパスすべきかどうか指示するためにフラグを変更するときには、適当な割り込みハンドラーを指すようにシステムレベルの割り込みベクトルを変更することができる。これは、通常の場合に、ハッシュテーブルをバイパスすべきでないときに、チェックが必要でないよう確保し、従って、多数のプロセッササイクルがセーブされる。通常の場合に全ての異なるフラグ及びそれらの設定を入念に分析し、そしてその通常の場合に特殊な形態の割り込みハンドラーを使用することにより、著しい高速性能が得られる。
ハッシュテーブルに記憶されたデータを操作する間に他のプロセッサとの同期をとることについて、以下に詳細に説明する。
【００３６】
同期
ハッシュテーブル及びオーバーフローバッファへのアクセスの同期は、次のように管理される。先ず、各プロセッサごとに個別のハッシュテーブル２３４−２３６があり、従って、異なるプロセッサで実行される割り込みハンドラー２３１−２３３は、ハッシュテーブルを操作する間に互いに同期する必要がない。しかしながら、ハンドラーは、オーバーフローバッファ２３８を共用し、従って、オーバーフローバッファへのアクセスは、以下に述べるように同期することが必要となる。
【００３７】
更に、ユーザレベルデーモン２６１がカーナルのテーブル及びオーバーフローバッファからユーザバッファ２６２へ全てのサンプルデータを検索して、デーモン２６１が現在情報を有するよう確保できるようにするために、カーナルモードデバイスドライバにより手順が与えられる。ハッシュテーブル及びオーバーフローバッファから各々データを検索するために「ｆｌｕｓｈｈａｓｈ」及び「ｆｌｕｓｈｏｖｅｒｆｌｏｗ」と称する個別のルーチンが与えられる。これらのルーチンは、以下に述べるように、ハンドラー２３１−２３３と同期する必要がある。
【００３８】
ハッシュテーブル同期
所与のプロセッサのハッシュテーブルへのアクセスを同期させる必要性のあるアクティビティは、２つあり、そのプロセッサに対する割り込みハンドラーと、ｆｌｕｓｈｈａｓｈルーチンである。同期される必要のある考えられる事象のタイミングが図５に示されている。割り込みハンドラー及びｆｌｕｓｈｈａｓｈルーチンにより共用されるグローバルな変数が図６に示されている。割り込みハンドラーの擬似コードが図７に示され、そしてｆｌｕｓｈｈａｓｈルーチンの擬似コードが図８に示されている。
【００３９】
特定のプロセッサのハッシュテーブルがフラッシュされる間は、割り込みハンドラーがそのテーブルにアクセスすることができない。それ故、この時間中に、割り込みハンドラーは、サンプルデータをオーバーフローバッファ２３８に直接的に記憶する。オーバーフローバッファ２３８に直接書き込まれるサンプルの数を最小にするために、ハッシュテーブルは、そのときのチャンクがフラッシュされる。ハッシュインデックスが、フラッシュされるチャンクに入るサンプルは、オーバーフローバッファに直接書き込まれる。しかしながら、チャンクがフラッシュされるサンプルは、上記のようにハッシュテーブルに書き込まれる。
【００４０】
ａｃｔｉｖｅｃｈｕｎｋ及びｆｌｕｓｈｃｈｕｎｋフラグ３１５−３１６は、各々、割り込みハンドラー及びｆｌｕｓｈｈａｓｈルーチンによりどのチャンクが使用中であるか指示する。各フラグは、チャンクの名前を記録し、チャンクは、その第１エントリーのインデックスにより命名される。値−１は、ａｃｔｉｖｅｃｈｕｎｋ及びｆｌｕｓｈｃｈｕｎｋフラグにおいて、チャンクが使用されず、例えば、チャンクが非アクティブであることを指示するのに使用される。
【００４１】
オーバーフローバッファにサンプルデータを記憶するための次の空きスロットを決定し、そしてオーバーフローバッファの空きスロットにエントリーを書き込むための手順は、オーバーフローバッファ２３８へのアクセスを同期する説明の一部分として以下に述べる。データ形式「ｓｌｏｔｉｎｄｅｘ」についても述べる。
ｆｌｕｓｈｈａｓｈルーチンの同期は、そのルーチンが多数の高速プロセッサをもつシステムに使用されるよう意図されるので、メモリモデルが順次一貫しない場合には、部分的に繊細である。これについては、「マルチプロセスプログラムを正しく実行するマルチプロセッサコンピュータを構成する方法(How to make a multiprocessor computer that correctly executes multiprocess programs) 」、ＩＥＥＥトランザクション・オン・コンピュータ、Ｃ−２８（１９７９年、９月）、第６９０−６９１ページを参照されたい。
【００４２】
オペレーションの順序がある条件のもとでしか保証されない場合には、メモリモデルがシーケンシャルに一貫しない。プロセッサは、第１順序で命令を発生しそして実行できるが、メモリサブシステムは、第２順序（メモリ順序）でアクセスを完了する。更に、メモリ順序は、推移的であり、ＡがＢの前に生じそしてＢがＣの前に生じる場合には、ＡがＣの前に生じる。オペレーションが実際に生じる順序は、次の制約に基づきメモリ順序によって決定される。
１．単一プロセッサによる単一メモリ位置でのメモリアクセスオペレーションは、アクセス命令がそのプロセッサにより発生される順序で生じる。
２．単一プロセッサによる異なるメモリ位置でのメモリアクセスオペレーションは、それらオペレーションが「メモリバリア」（ＭＢ）命令により分離されない限り、いかなる順序で発生することもできる。この場合に、メモリバリア命令の前の全てのオペレーションは、メモリバリア命令の後の全てのオペレーションの前に生じる。
３．２つのプロセッサが同じメモリ位置を１つのプロセッサ読み取りによりアクセスし、そして他のプロセッサが書き込みし、更に、読み取りオペレーションがその書き込まれる値を通知する場合には、その読み取りオペレーションが書き込みオペレーションの後に生じる。
【００４３】
メモリバリア命令は、比較的長い実行時間を必要とすることに注意されたい。それ故、頻繁に横断される実行経路におけるＭＢ命令の使用を最小にすることが望ましい。
図７に示す割り込みハンドラーにとられる特定の経路に基づき、３つのケースが考えられる。ハンドラーは、ライン７０４、７０７又はライン７０９−７２２で始まるコードを実行することができる。これらケースの各々は、以下に説明する。
【００４４】
先ず、ハンドラーが、ライン７０４で始まるコードを実行する場合には、ハンドラーは、ハッシュテーブルにアクセスする必要が全くない。サンプルは、オーバーフローバッファ２３８に直接書き込まれ、このバッファは、以下に述べるように、サンプルの損失又は二重カウントを回避するために同期される。この場合に、たとえ変数ｆｌｕｓｈｃｈｕｎｋ〔ｉ〕がｆｌｕｓｈｈａｓｈルーチン（異なるプロセッサで実行される）によりｃ以外の値にセットされた場合でも、ハンドラーは、ライン７０４を実行するためにライン７０３において判断することが考えられる。
【００４５】
このため、図１のメモリサブシステム１２０は、変数ｆｌｕｓｈｃｈｕｎｋ〔ｉ〕の新たな値が全ての他のプロセッサ上に直ちに見えるように保証しない。これは、誤った振る舞いを生じるものではなく、幾つかのサンプルがハッシュテーブルに記憶できたであろうときにそれらがオーバーフローバッファに書き込まれることを意味するに過ぎない。従って、性能は若干低下するが、データの正しさは、影響がない。
第２に、ハンドラーがライン７０７のコードを実行する場合には、高価な同期オペレーションは、実行されない。ハンドラーがライン７０７にあるときには、ハッシュテーブルの一致するエントリーが見つかり、これは、そのエントリーに対してカウントを増加する必要があることだけを意味する。しかしながら、この場合に、非常に稀であるが、単一のサンプルが失われるおそれがあることに注意されたい。
【００４６】
図５は、ハッシュテーブルアクセスの考えられるタイミングを示す。図５において、ライン５６０は、左から右へ増加する全時間を示し、ライン５７０は、１つのプロセッサ（ｃｐｕｊ）の割り込みハンドラー５３１の事象のタイミングを示し、そしてライン５８０は、別のプロセッサ（ｃｐｕｉ）５３２のｆｌｕｓｈｈａｓｈルーチンの事象のタイミングを示す。
割り込みハンドラー（ｃｐｕｊ）５３１は、多数の動作を実行し、即ちｆｌｕｓｈｃｈｕｎｋフラグを読み取り（事象Ｃ５７１）、ハッシュテーブルにおいてヒットを見出し、ハッシュテーブルにおいて一致するエントリーのカウントを読み取り（事象Ｄ５７２）、そして増加されたカウントをハッシュテーブルに書き込む（事象Ｅ５７３）。
【００４７】
ｆｌｕｓｈｈａｓｈルーチンがオーバーフローバッファ２３８をフラッシュするように同時に実行されない場合には、増加されたカウントが正しく書き戻される。しかしながら、たとえ割り込みハンドラーがａｃｔｉｖｅｃｈｕｎｋフラグ３１５をチェックし、そしてそれが必要とされるチャンクに対してセットされたものでないと決定されても、ｆｌｕｓｈｈａｓｈルーチンが実際にそのチャンクを同時にフラッシュすることが考えられる。この稀なケースにおいては、以下に述べるように、タイミングの入念な分析により、単一のサンプルが失われることがあるが、サンプルが二重カウントされることはないことが示される。
【００４８】
図５において、ｆｌｕｓｈｈａｓｈルーチンに対して次の事象が生じる。ｆｌｕｓｈｃｈｕｎｋフラグ３１６は、チャンクが次にフラッシュされるようにセットされる（事象Ａ５６１）。次いで、ルーチンは、割り込みハンドラーにより使用されるハッシュテーブルエントリーをユーザバッファへコピーする（事象Ｇ５８３）。次いで、ルーチンは、ハッシュテーブルのエントリーをゼロにし（事象Ｈ５８４）、そしてｆｌｕｓｈｃｈｕｎｋフラグ３１６をクリアすることにより完了を指示する（事象Ｉ５８５）。
【００４９】
２つの他の事象に対する時間、即ちｆｌｕｓｈｃｈｕｎｋフラグの更新値が他の全てのプロセッサに伝播されるよう保証される（事象Ｂ５８２）時間と、割り込みハンドラーによりハッシュテーブルに書き戻される増加カウントが至る所に伝播されるように保証される（事象Ｆ５７４）時間とが図５に示されている。これらの時間は、特定のプロセッサ実施形態に基づくもので、アーキテクチャー仕様により予め決定されるものではないことに注意されたい。
【００５０】
事象Ｅ５７３が事象Ｇ５８３の前に発生する（メモリ順序で）場合には、増加カウントが事象Ｇ５８３の時間にユーザバッファにコピーされ、そして事象Ｈ５８４の時間にカウントがゼロにリセットされる。このケースにおいて、サンプルは、厳密に一度だけカウントされる。これは、稀である。
事象Ｅ５７３が事象Ｈ５８４の後に生じる場合には、エントリーのカウントがゼロにセットされた後に増加カウントがハッシュテーブルに書き戻される。これは、このエントリーの元のカウントで表されたサンプルが、事象Ｇの時間にユーザバッファへコピーされることにより一度、そしてこのエントリーがハッシュテーブルから追い立てられ即ちフラッシュされる次のときにもう一度、の２回カウントされるようにする。これは、受け入れられない。というのは、単一のハッシュテーブルエントリーは、多数のサンプルを表す大きなカウントをもつことができるからである。
【００５１】
二重カウントは、事象Ｅ５７３が事象Ｈ５８４の前に生じる限り起こり得ない。これは、次の変数について以下に述べる制約により保証される。
記憶された値を全てのプロセッサに伝播するのに必要な最大時間を「ｍａｘｐｒｏｐ」とする。図７のライン７０７を実行するときに割り込みルーチンに対する最大時間を「ｍａｘｉｎｔｒ」とする。同じエントリーに対し事象Ａ（５６１）から事象Ｈ（４８４）までの最小時間（即ち、ｆｌｕｓｈｃｈｕｎｋフラグ３１６がセットされるときから、時間フラグ３１６がｆｌｕｓｈｈａｓｈルーチンによってクリアされるまでの最小時間）を「ｍｉｎｆｌｕｓｈ」とする。
【００５２】
次の制約は、事象Ｅが事象Ｈの前に生じるように確保する。
（ｍａｘｉｎｔｒ＋（２＊ｍａｘｐｒｏｐ））＜ｍｉｎｆｌｕｓｈ
特定のプロセッサ実施形態におけるタイミングは、ｍａｘｐｒｏｐ及びｍａｘｉｎｔｒを決定するように測定することができる。次いで、チャンクサイズは、ｍｉｎｆｌｕｓｈが充分大きくなるよう確保するに充分な大きさに選択することができる。
【００５３】
第３に、ハンドラーがライン７０９−７２２を実行する場合には、ハッシュテーブルからオーバーフローバッファへエントリーを移動し、そしてカウント１の新たなエントリーをハッシュテーブルに書き込まねばならない。
ハッシュテーブルからオーバーフローバッファへ移動されるエントリーの損失又は二重カウントを回避するために、２つのフラグ（ａｃｔｉｖｅｃｈｕｎｋ〔ｉ〕及びｆｌｕｓｈｃｈｕｎｋ〔ｉ〕）との入念な同期と、３つまでのメモリバリアオペレーションとが使用される。この同期は比較的高価であるが、これは、ハッシュテーブルにミスがあるときしか生じず、これは、比較的稀である。
【００５４】
このコードの重要な特性は、ライン７１９及び７２０がライン７０７と相互に排他的なことである。このアルゴリズムは、メモリバリア命令と共に使用される標準的な相互除外技術の変形である。この技術は、オペレーションを強制的に順序付けし、そして割り込みハンドラーがロックを待機しないように確保する。
ｆｌｕｓｈｈａｓｈルーチンが所望のチャンクで行われるまで待機するのではなく、割り込みハンドラーは、所望のチャンクが得られないときにライン７１４−７１６のハッシュテーブルを単にバイパスする。又、オーバーフローバッファがいっぱいになる非常に稀な事象においては、割り込みハンドラーが単に復帰し、サンプルを破棄する。その結果、１つの効果として、この場合には、ハンドラーがライン７０９−７２２を実行するときに、サンプルが失われることも二重カウントされることもない。
【００５５】
この解決策の正味の作用は、ハッシュテーブルにおいてヒットする通常の場合に、ハンドラーは、高価な同期オペレーションを実行せず、非常に稀なケースにおいてサンプルを失う。
ライン７１０、７１４及び７１８を最適化することができ、ライン７１１においてオーバーフローバッファにスロットを得るには、ロックを得そして解除することが必要となる。ロックを得るには、それを解除するときと同様に、メモリバッファを伴い、従って、ライン７１０のメモリバッファは排除できる。又、ロック解除がライン７１３の条件テストの後に移動される（即ち、ロック解除がライン７１４及び７１８で行われる）場合にも、ライン７１４及び７１８のメモリバリアを排除することができる。
【００５６】
オーバーフローバッファの同期
好ましい実施形態においては、オーバーフローバッファ２３８は、実際には、２つの部分に分割され、一方の部分がハンドラーによりアクセスされる間に他方の部分をフラッシュできるようにされる。この形式の技術は、「二重バッファリング」とも称される。更に、オーバーフローバッファ２３８のロック２３９は、バッファのポインタが操作される間にのみ保持され、エントリーがそれに書き込まれる間には保持されない。これは、ハンドラーがエントリーを同じオーバーフローバッファに並列に書き込みできるようにすることにより効率を改善する。
【００５７】
明瞭化のために、上記のオーバーフローバッファ２３８の説明は、あたかもそれが単一バッファであるかのように簡単化され、そしてスロットは、形式ｓｌｏｔｉｎｄｅｘの値により識別される。ここでは、バッファ及びｓｌｏｔｉｎｄｅｘがいかに表されるかを詳細に説明する。
２つのバッファが存在する。バッファに対するインデックス（ｓｌｏｔｉｎｄｅｘ）は、バッファｉｄ（０又は１）と、そのバッファのスロットのインデックスとで構成される。ハンドラー及びｆｌｕｓｈｏｖｅｒｆｌｏｗにより共用されるグローバル変数が図９に示されており、次の空きスロットを決定しそしてオーバーフローバッファにエントリーを書き込むためにハンドラーにより使用される手順が図１０及び１１に示されており、そしてｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンが図１２に示されている。
【００５８】
ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンは、単一のオーバーフローバッファをフラッシュする。満杯のバッファが読み取りを待機する場合には、それがフラッシュされ、さもなくば、現在部分的に満杯のバッファがフラッシュされる一方、オーバーフローが他のバッファへ送られる。
単一のロック（ｏｖｅｒｆｌｏｗｌｏｃｋ）は、図９の変数へのアクセスを同期し、そしてバッファ（インデックス、完了、ｃｕｒｒｅｎｔｏｖｅｒｆｌｏｗ及びｆｕｌｌｏｖｅｒｆｌｏｗ）を管理するのに使用される。これら変数に対する全ての更新は、ｏｖｅｒｆｌｏｗｌｏｃｋを保持する間に行われる。バッファｉについては、ｉｎｄｅｘ〔ｉ〕は、書き込まれるべき次のスロットのインデックスである。
【００５９】
エントリーは、ロックを保持せずにオーバーフローバッファに書き込まれ、むしろ、ｉｎｄｅｘ〔ｉ〕を増加することによりバッファｉにスロットが指定される。スロットを指定したプロセッサのみが、それに書き込むことが許される。プロセッサは、スロットへの書き込みを行うと、図１１について説明するように、ｃｏｍｐｌｅｔｅｄ〔ｉ〕を増加する。従って、スロットは、いかなる順序で書き込むこともできる（特定の順序で指定されたが）。
【００６０】
ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンのワイル(while) ループは、ｃｏｍｐｌｅｔｅｄ〔ｉ〕がｉｎｄｅｘ〔ｉ〕に等しくなるまで待機する。これは、指定された全てのスロットへの書き込みが完了したことを意味する。ｉｎｄｅｘ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕は、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンがワイルループにある間は変更できないことに注意されたい。というのは、満杯のオーバーフローバッファに対してスロットが指定されないからである。又、ｃｏｍｐｌｅｔｅｄ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕は、単調に増加され、即ち読み取りオペレーションは、原子的なものであるから、ロックを保持せずにｃｏｍｐｌｅｔｅｄ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕を読み取ることができ、ｉｎｄｅｘ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕に等しい値が見えたときから、指定のスロットへの全ての書き込みを完了しなければならない。
【００６１】
図８のライン８０５のワイルループの後のメモリバリアは、満杯のオーバーフローバッファをユーザバッファにコピーするのに伴うメモリオペレーションが、ｃｏｍｐｌｅｔｅｄ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕をｉｎｄｅｘ〔ｆｕｌｌｏｖｅｒｆｌｏｗ〕に比較するテストの後に、換言すれば、バッファへの書き込みが実際に完了した後に生じるよう確保するために必要となる。
【００６２】
上記技術においては、全てのプロセッサが共通バッファへと退去し、これは、各退去のたびにプロセッサ間にある程度の整合を必要とする。実際には、一対のオーバーフローバッファがあって、その一方が満杯で、ユーザスペースへフラッシュされねばならないときに、その他方のバッファへ退去できるようにする。事象のタイミングを入念に分析することに基づく付加的な同期は、ハッシュテーブルへのアクセスを適切に同期させるよう確保する。
【００６３】
プロセッサごとのローカルオーバーフローバッファ
別の実施形態においては、プロセッサごとのハッシュテーブル及び単一の共用のオーバーフローバッファを伴う上記方法を、各プロセッサごとに付加的な小さなオーバーフローバッファをもつように拡張することができる。この方法では、割り込みハンドラーがオーバーフローバッファに書き込もうとするとき、先ず、そのローカルオーバーフローバッファをチェックする。そのバッファがいっぱいでなければ、単にそのバッファに書き込むだけである。そのバッファがいっぱいであれば、共用オーバーフローバッファにロックを得、そしてその全ローカルバッファを共用バッファにコピーする。これは、ロックを得る頻度、及び共用キャッシュラインに書き込む頻度を減少し、従って、マルチプロセッサの性能を改善する。
【００６４】
プロセッサごとのローカルオーバーフローバッファを伴うこの方法の変形として、ハッシュテーブルを完全に排除することによりこの方法が更に変更される。これは、上記した他の方法よりも高いメモリトラフィックを有するが、共用のオーバーフローバッファへの書き込みが依然として低い頻度であるので、マルチプロセッサの共用メモリへのアクセス及びロックのためのオーバーヘッドは依然として小さい。
【００６５】
多数のオーバーフローバッファ
別の実施形態においては、異なる同期技術を使用し、全てのプロセッサに共用される単一の二重バッファではなくて、付加的なオーバーフローバッファ、即ちプロセッサ当たり２つのバッファを使用することにより、ハッシュテーブル同期及びオーバーフローバッファアクセスのコストを低減することができる。
この技術では、各プロセッサは、専用のハッシュテーブル（前記のような）と一対の専用の（二重）オーバーフローバッファとを「所有」する。この場合に、「所有」とは、アクティブなバッファ及びハッシュテーブルの状態に対する全ての変更（１つを除く）がそのプロセッサにより行われ、上記の単一の二重バッファ技術に存在するメモリ同期オペレーションの多くを排除することを意味する。
【００６６】
第１の技術に対し２つの主たる変更がある。第１に、ハッシュテーブルをユーザスペースへとフラッシュする間に、性能カウンタ事象がアクティブなオーバーフローバッファに付加され、ハッシュテーブルがバイパスされる。これは、フラッシュオペレーション中にハッシュテーブルが変更されないよう確保するために必要な同期を排除する。第１の技術の場合と同様に、ハッシュテーブルは、チャンクにおいてフラッシュすることができ、事象をオーバーフローバッファに直接付加しなければならない頻度を減少する。
【００６７】
第２に、各プロセッサは、専用の一対のオーバーフローバッファを有し、全てのプロセッサにわたって単一のアクティブなオーバーフローバッファを共用するのに必要な同期が排除される。
以下のデータ状態は、プロセッサごとに維持される。

【００６８】
ハッシュテーブルの同期
特定のプロセッサに対してハッシュテーブルをアクセスする２つのアクティビティがある。
１）割り込みハンドラーは、新たなサンプルをハッシュテーブルに記憶する；そして
２）ｆｌｕｓｈｈａｓｈルーチンは、ハッシュテーブルをユーザスペースにコピーする。
図１３に示す割り込み中にオーバーフローバッファ事象を取り扱うための擬似コードにおいては、「ｂｙｐａｓｓｈａｓｈ」変数を用いて、ハッシュテーブルへのアクセスが制御される。ライン１３０１−１３０３は、この変数が「真」にセットされた場合に、割り込みハンドラーがハッシュテーブルを完全にスキップし、そして新たなサンプルをオーバーフローバッファに直接書き込む。
【００６９】
ライン１３０５−１３１０は、割り込みハンドラーを通る他の経路を示す。新たなサンプルがハッシュテーブルのエントリーの１つに一致する場合は（ライン１３０５−１３０６）、ハンドラーは、その一致するエントリーに関連したカウントを単に増加する。さもなくば（ライン１３０８−１３１０）、ハンドラーは既存のエントリーの１つを退去のために取り上げる。このエントリーは、ハッシュテーブルから除去され、そしてオーバーフローバッファへ書き込まれる。新たなサンプルは、ハッシュテーブルの空にスロットに書き込まれる。
【００７０】
「ｆｌｕｓｈｈａｓｈ」ルーチンの擬似コードが図１４に示されている。各プロセスに対し、このルーチンは、「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」フラグを真にセットし（ライン１４０４−１４０８）、ハッシュテーブルをユーザスペースにコピーし、そして「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」フラグを偽にリセットする（ライン１４０６及び１４１０）。正しい同期のために、「ｆｌｕｓｈｈａｓｈ」ルーチンは、「ｃｐｕ」という番号のプロセッサにおいて「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」への変更を入念に実行する。
【００７１】
「ｆｌｕｓｈｈａｓｈ」ルーチンが実行されているプロセッサが、ハッシュテーブルがコピーされるプロセッサと同じである（ライン１４０３−１４０６）場合には、このルーチンは、ローカルオペレーションにより「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」フラグを単にセットしそしてクリアする。さもなくば、「ｆｌｕｓｈｈａｓｈ」ルーチンは、プロセッサ間割り込みを使用して、「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」への変更を「ｃｐｕ」において実行させる。
【００７２】
割り込みハンドラー及びｆｌｕｓｈｈａｓｈルーチンは、正しく同期する。というのは、「ｂｙｐａｓｓｈａｓｈ〔ｃｐｕ〕」が読み取られそして「ｃｐｕ」という番号のプロセッサのみに書き込まれるからである。プロセッサ間割り込みは、性能カウンタのオーバーフローに対し割り込みハンドラーと同じ割り込み優先順位レベルで実行するようにセットされ、それらが互いに原子的に実行するよう確保する。メッセージを送信するような他の通信メカニズムも、同じ原子的レベルが確保される限り、プロセッサ間割り込みに代わって使用できる。
【００７３】
オーバーフローバッファの同期
特定のプロセッサに関連した一対のオーバーフローバッファを使用する２つのアクティビティがある。第１に、割り込みハンドラーは、エントリーをオーバーフローバッファに時々書き込む。第２に、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンは、オーバーフローバッファの内容をユーザスペースへ周期的にコピーする。これら２つのアクティビティは、バッファの一方を「アクティブ」バッファとしてマークすることにより同期される。割り込みハンドラーは、「アクティブ」バッファへ書き込むことだけが許され、そしてｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンは、非アクティブバッファから読み取ることだけが許される。
【００７４】
アクティブバッファがいっぱいであるときは、割り込みハンドラーは、古いアクティブバッファを非アクティブとマークしそして古い非アクティブバッファをアクティブとマークすることによりバッファをスイッチ（フリップ）するように試みる。フラグ「ａｌｌｏｗｆｌｉｐ〔ｃｐｕ〕」は、「ｆｌｕｓｈｏｖｅｒｆｌｏｗ」が古い非アクティブバッファをユーザスペースにコピーする間にこの切り換えが起きるのを防止するのに使用される。
【００７５】
特定のプロセッサに対しオーバーフローバッファにエントリーを追加するルーチンが図１５に示されている。アクティブバッファがいっぱいでない（ライン１５０２）場合には、このルーチンは、サンプルをアクティブバッファに単に付加するだけである。アクティブバッファがいっぱいである場合には、このルーチンは、バッファをフリップ（切り換える）ように試みる。
【００７６】
このルーチンは、先ず、フリップが許されたかどうか決定するようにチェックする（ライン１５０５）。フリップが許された場合は、バッファをフリップし、そして新たなアクティブバッファから全てのサンプルをドロップすることにより新たなアクティブバッファを書き込みのために準備する（ライン１５０８）。バッファがフリップされた後に、ルーチンは、満杯の非アクティブバッファを読み出すようにユーザレベルデーモンに通知する（ライン１５０７）。新たなサンプルが新たなアクティブバッファに追加される。フリップが許されない場合には、ルーチンが新たなサンプルをドロップし、そして復帰する（ライン１５１１）。
【００７７】
ｗｒｉｔｅｔｏｏｖｅｒｆｌｏｗルーチンは、２つのケースにおいてサンプルをドロップする。第１のケースでは、フリップ後に、新たなアクティブバッファの読み取られなかったサンプルがドロップされる（ライン１５０５）。サンプルが実際にドロップされるのは非常に稀である。というのは、これらサンプルは、最後のフリップのとき以来、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンが読み取りのために得ることができ、そしてフリップは、非常に低い頻度で生じるからである。
【００７８】
第２のケースは、アクティブバッファがいっぱいであって、フリップが許されないときである。このケースも、非常に稀である。フリップは、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンが非アクティブなバッファを読み出すときだけ許可されない。非アクティブなバッファは、最後のフリップのときに読み取りの準備がなされ、そしてこのシステムではフリップが生じる頻度が非常に低いので、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンが非アクティブなバッファを依然コピーすることはほとんどあり得ない。前記ケースの両方において、いずれかのサンプルがドロップした場合には、ｆｌｕｓｈｏｖｅｒｆｌｏｗルーチンが、オーバーフローバッファの充満に応答して、ユーザレベルデーモンにより充分迅速にコールされないという指示である。
【００７９】
「ｆｌｕｓｈｏｖｅｒｆｌｏｗ」ルーチンの擬似コードが図１６に示されている。このルーチンは、非アクティブなバッファを特定のプロセッサのユーザスペースへコピーする。このルーチンは、「ａｌｌｏｗｆｌｉｐ〔ｃｐｕ〕」フラグを使用して、非アクティブなバッファがユーザスペースへコピーされる間にｗｒｉｔｅｔｏｏｖｅｒｆｌｏｗルーチンが非アクティブなバッファをアクセスするのを防止する。上記のｆｌｕｓｈｈａｓｈルーチンの場合と同様に、「ａｌｌｏｗｆｌｉｐ〔ｃｐｕ〕」への全てのアクセスが、「ｃｐｕ」という番号のプロセッサにおいて生じ、それ故、正しく同期されるように確保するために、プロセッサ間割り込みが使用される。
【００８０】
収集される情報
性能カウンタの厳密な実施に基づき、選択された命令に対してプログラムカウンタ値をサンプリングすることができる。更に、汎用レジスタを特定するオペランドを有するメモリアクセス（ロード及び記憶）及びジャンプ命令に対して、ベースアドレスも収集することができる。
【００８１】
以上に述べた性能監視システムは、カーナルソフトウェア、入力／出力デバイスドライバ、アプリケーションプログラム及び共用ライブラリーを含むコンピュータシステムのオペレーションの多くの観点で性能データを収集することができる。これは、システムの通常のオペレーションを不当に妨げることなく非常に高いレートでサンプリング割り込みを発生しそして処理することによって達成される。
当業者であれば、本発明の範囲から逸脱せずに本発明に種々の変更がなされ得ることが明らかであろう。
【図面の簡単な説明】
【図１】本発明の好ましい実施形態による性能監視サブシステムによって性能データを収集することのできるコンピュータシステムのブロック図である。
【図２】データ収集サブシステムのブロック図である。
【図３】収集された性能データを記憶するためのハッシュテーブルを示すブロック図である。
【図４】図３のハッシュテーブルを更新するための流れ線図である。
【図５】ハッシュテーブルの非同期更新を示すタイミング図である。
【図６】割り込みハンドラー及びハッシュテーブルフラッシュルーチンにより共用される変数を示す図である。
【図７】割り込みハンドラールーチンの擬似コードを示す図である。
【図８】ハッシュテーブルをフラッシュするための擬似コードを示す図である。
【図９】共用同期変数を示す図である。
【図１０】オーバーフローバッファに空きスロットを得るための擬似コードを示す図である。
【図１１】オーバーフローバッファの空きスロットにエントリーを書き込むための擬似コードを示す図である。
【図１２】オーバーフローバッファをフラッシュするための擬似コードを示す図である。
【図１３】サンプル事象割り込みの間にオーバーフローバッファ事象を取り扱うための擬似コードを示す図である。
【図１４】ハッシュテーブルをフラッシュするための擬似コードを示す図である。
【図１５】オーバーフローバッファへエントリーするルーチンのための擬似コードを示す図である。
【図１６】オーバーフローバッファをユーザバッファへフラッシュするルーチンのための擬似コードを示す図である。
【符号の説明】
１００コンピュータシステム
１１０プロセッサ
１１２性能カウンタ
１２０メモリサブシステム
１２１プログラム
１２２データ
１３０入力／出力インターフェイス
１４０バス
１６０ネットワーク
２００性能データ収集サブシステム
２０１レジスタ
２２０ハードウェアデバイス
２２１−２２３個々のプロセッサ
２３０カーナルモードプログラム及びデータ構造体
２３１−２３２割り込みハンドラー
２３４−２３５テーブル
２３８オーバーフローバッファ
２３９ロック
２５０ディスク
２５１性能データ
２６０ユーザモードプログラム及びデータ構造体
２６１デーモンプロセス
２６２ユーザバッファ
３００ハッシュテーブル
３０１−３０３チャンク[0001]
BACKGROUND OF THE INVENTION
The present invention relates generally to computer systems, and more particularly to collecting performance data in computer systems.
[0002]
[Prior art]
Collecting performance data in an operating computer system is a frequent and very important task performed by hardware and software engineers. Hardware engineers need performance data to determine how new computer hardware will work with existing operating systems and application programs.
[0003]
The specific design of hardware structures such as processors, memory, and caches can be significantly different and sometimes unpredictable for the same set of programs. It is important to be able to identify hardware defects and correct them in future designs. The performance data can identify how efficiently the software uses the hardware and can help in designing an improved system.
[0004]
Software engineers need to identify critical parts of the program. For example, compiler authors must find out how well a compiler can schedule instructions for execution or how execution of conditional branches is expected to provide input that optimizes the software. Similarly, it is necessary to understand the performance of operating system kernels, device drivers, and application software programs.
[0005]
The problem is to accurately monitor the performance of hardware and software systems without interfering with the operating environment of the computer system. This is particularly problematic when performance data is collected over a long period, such as days or weeks. In many cases, the performance monitoring system is handmade. Costly hardware and software changes are required to ensure that the operation of the system is not affected by the monitoring system.
[0006]
One way that the performance of a computer system can be monitored is by using a performance counter. The performance counter “counts” the occurrence of a significant event in the system. Critical events include, for example, cache misses, instruction execution, I / O data transfer requests, and so on. By sampling the performance counter periodically, the performance of the system can be inferred.
[0007]
[Problems to be solved by the invention]
It is desirable that the performance of a computer system can be monitored without software changes. It is also desirable that the sampling rate be fixed or variable and that the rate be very high. In addition, during high frequency sampling, it is desirable to keep sampling overhead to a minimum and performance data accurately reflects system operation. Keeping overhead low is particularly difficult in multiprocessor systems where data access needs to be synchronized, and the sampling rate can be very high, for example, 50,000 to 100,000 samples per second.
[0008]
[Means for Solving the Problems]
The present invention provides an apparatus for collecting performance data in a computer system. The computer system includes a plurality of processors for executing program instructions simultaneously. The apparatus of the present invention comprises a plurality of sets of performance counters. A set of performance counters is connected to each processor. The performance counter is for storing performance data generated by each processor during execution of instructions.
The invention, in its broad form, relates to an apparatus according to claim 1 and a method according to claim 9 for sampling performance data of a computer system.
[0009]
An interrupt handler is executed in each processor. The interrupt handler is for sampling processor performance data in response to an interrupt. The first memory includes a hash table associated with each interrupt handler. The hash table stores performance data sampled by an interrupt handler executed on the processor. The second memory includes an overflow buffer, which is for storing performance data while portions of the hash table are inactive and those portions are full. The third memory includes a user buffer. In addition, means are provided for periodically flushing performance data from the hash table and overflow buffer to the user buffer.
[0010]
Preferably, as described below, the hash table of the first memory is organized as a multi-way set-associative cache. Further, the multi-way set associative cache includes a plurality of chunks, and each chunk is a unit of data transfer between the first memory and the third memory. In addition, each chunk includes multiple lines, and each chunk is active to indicate when the corresponding chunk is inactive and full, respectively. chunk flag and flush The chunk flag is associated. Each chunk line is further partitioned into a plurality of entries, and each entry includes a plurality of fields for storing performance data. The performance data fields include a processor identification, program counter, processor event identification, and processor event counter fields.
[0011]
Conveniently, means are provided for generating a hash index from the processor identification, program counter and processor event identification. The hash index is used to locate the hash table line associated with a particular processor to generate a hit or miss signal.
[0012]
In response to the miss signal, the performance data stored in the current entry indicated by the current hash index is moved from the hash table to the overflow buffer. The current entry is overwritten with performance data sampled by the interrupt handler. In the case of a hit signal, the processor event counter stored in the current entry pointed to by the current hash index is incremented. Performance data is stored in entries in compressed form.
[0013]
As shown below, the overflow buffer of the second memory includes first and second buffers organized as a double buffer. Each buffer includes a plurality of slots for storing performance data of hash table entries.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
The invention will be readily understood from the following detailed description of preferred embodiments with reference to the accompanying drawings, in which:
System overview
As shown in FIG. 1, the computer system 100 includes a processor 110, a memory subsystem (memory) 120, and an input / output interface (I / O) 130 connected to each other by a bus 140. System 100 is a member of a cluster of embedded systems, PCs, workstations, mainframes, or network linked systems.
[0015]
The processor 110 may be configured as one or more individual processor chips 111 having either a CISC or RISC architecture. Each processor 111 is combined with a set of performance counters 112. Each set of performance counters 112 can be implemented as a plurality of registers. These registers can count the occurrence of critical events that represent system performance in the system.
[0016]
Memory 120 may include individual static, dynamic, random, sequential, volatile, and permanent storage elements, or combinations thereof. Some memories can be arranged in a hierarchy. The storage element is a register, cache, DRAM, disk, tape, or the like. The memory 120 stores a software program 121 in the form of instructions that can be executed by a machine and data 122 accessed by the instructions. The software program includes an operating system, a device driver, and an application program.
[0017]
The I / O 130 can include interfaces to input and output devices such as printers, terminals and keyboards. The I / O 130 can also be connected to a network (NET) 160 via line 150 to communicate data with other computer systems. The bus 140 is typically implemented as a plurality of lines to carry address, data, control and timing signals between the various elements.
[0018]
Overview of operation
During operation of the system 100, the instructions of the program 121 are executed by one or more processors 111. The instructions generally control the execution flow of the program or access (load, store, read and write) data 122. It is desirable to collect system 100 performance data without significantly disturbing the normal operating environment. Analysis of performance data can be used to optimize the design of hardware and software elements of system 100. Performance data can also be used to determine operational problems such as pipeline installation.
[0019]
Data collection subsystem
In one embodiment shown in FIG. 2, the performance data collection subsystem 200 includes various hardware devices 220, a kernel mode program and data structure 230, and a user mode program and data structure 260. . Programs and

data

230 and 260 may be stored in the memory 120 of FIG.
The hardware 220 can include a plurality of individual processors 221-223. Each processor is combined with a set of performance counters or registers 201. The set of registers 201 can reside in common with the associated processor on the same semiconductor die. Each set 201 can include a plurality of registers 231-234.
[0020]
Each of registers 231-234 may include a number of bits to store a performance event count. The total number of specific events that can be accumulated in one counter depends on the size of the register 231-234, that is, the number of bits.
Based on certain embodiments, the set of registers 201 can be enabled, disabled, paused, resumed, reset (cleared), read and written. Typically, registers generate interrupts 251-253 on overflow (carrying certain bit positions). Responsive to the interrupt, the register 201 can be sampled (read). In some embodiments, an interrupt can be generated upon setting a particular bit in one register to count the frequency corresponding to an integer power of two.
[0021]
During operation of the hardware 220, signals or “events” representing various operational characteristics of the system, eg, (E₁... E_M241-244 increments the corresponding register. The exact event that is signaled is usually system specific. Typical events that can increase registers include, for example, cache misses, branch prediction misses, pipeline installations, instruction generation, arithmetic operations, processor cycles, replay traps, conversion buffer misses, I / O requests, process actives, etc. Can do. One or more specific events at a time can be selected for sampling.
[0022]
The kernel mode component 230 of the subsystem 200 includes a device driver or interrupt handler (program) 231-232 and tables 234-235. There is one handler and associated table for each processor. As an advantage, having one hash table for each processor eliminates the need to synchronize most processing activity between the various processes. In addition, in one embodiment, there is an overflow buffer 238 that is shared by all handlers 231-233. Access to the buffer 238 is controlled by a lock (L) 239. In another embodiment, each handler can be associated with a dedicated buffer to further reduce the number of events that need to be synchronized.
[0023]
During operation of the kernel mode component 230, an interrupt activates the corresponding handler. The handlers 231-233 can operate as multiple concurrent processes or threads. The handler reads the performance data from the register and stores the data in one of its associated tables 234-236, as described in detail below. If any table is full or inaccessible, overflow data can be stored in the overflow buffer. In order to write to the buffer 238, the process must first obtain a lock 239 to ensure data consistency.
[0024]
User mode component 260 includes a daemon process 261, one or more user buffers 262, and processed performance data 251. Data 251 can be stored on disk 250 or in other types of non-volatile storage that is part of memory 120 of FIG.
During operation, the daemon 261 can periodically flush (empty) the hash table and overflow buffer to the user buffer 262. In addition, daemon 261 can process accumulated performance data and generate, for example, execution profiles, histograms, and other statistical data useful for analyzing system 100 performance.
[0025]
Hash table
As shown in FIG. 3 as a preferred embodiment, each of the tables 234-236 is configured as a hash table 300. A hash table is a data structure accessed by a hash index. A hash index is typically a deterministically calculated address that tends to randomly distribute data over a range of addresses.
[0026]
Hash functions are well known. In this embodiment for collecting performance data, the hash table is used to reduce the memory bandwidth required to transfer data between the kernel process and the user process. More specifically, as described below, each hash table can be implemented as a multi-way set associative cache to reduce the bandwidth of the memory subsystem. The preferred embodiment of the present invention uses 4-way association in hash table 300.
[0027]
The hash table 300 is partitioned into a plurality of chunks 301-303. A chunk is a unit of data transfer between a carnal mode component and a user mode component. Each chunk of the hash table 300 is further partitioned into a plurality of lines 311 to 314. A line is usually a coherent unit of data transfer that is efficiently handled by the system's memory transfer logic and cache.
[0028]
The (four-way) associativity of the hash table 300 is carefully selected to match the size of the hardware cache lines 311-314, for example, each cache line includes 64 bytes. Also, each chunk 301-303 has an active chunk and flush Chunk flags 315 and 316 are combined. active The chunk flag 315 can be set when one of the handlers changes (updates) data stored in one of the hash table chunks. When a flag is cleared, the corresponding chunk becomes inactive, i.e. not written. flush The chunk flag 316 can be set when data is copied from the hash table to the user buffer. The asynchronous handling of tables and buffers will be described in detail below.
[0029]
Each line includes a plurality of set associative entries 320. Each entry 320 includes a plurality of fields 321-324. For example, each of the fields 321 to 324 stores a process identification (pid), a program counter (pc), an event identification (event), and a counter value (count). The data stored in entries 320 on lines 311-314 is highly compressed to reduce the number of cache misses. By matching the associativeness to the size of the cache line, the maximum number of entries can be detected in the case of a single cache miss. Storing sample data as a compressed cache line reduces the bandwidth and stress of the memory subsystem 120.
[0030]
Interrupt handler process
FIG. 4 shows the process 400 of the handler of FIG. The process 400 is for updating the line entry 320 of the hash table 300 of FIG. This process begins at step 410, typically in response to an interrupt. In step 420, the hash index H_iTo decide. This index is a hash function f that combines bits._hash, For example, pid_i, Pc_iAnd event_iThe exclusive OR of this instance of, eg H_i= F_flush(Pdi_i, Pc_i, Event_i). At step 430, the index H in the table 300_iCheck all entries in, they are pid_i, Pc_iAnd event_iDetermine whether or not If it is a hit, in step 440 the count field 324 is incremented, ie count._i= Count_i+1.
[0031]
In case of a mistake, ie index H_iIf there is no hit in the entry, in step 450, the index H_iAre moved to the overflow buffer 238 of FIG. In step 460, the hash table index H_iStore the new entry and set the count to 1. In either case of a hit or miss, the process is completed at step 490.
[0032]
Certain interrupt handlers must access multiple pieces of global data. Global data controls the hash table pointer for the processor on which the handler is running, the hash table line indexed by the hash function value for the new entry, the overflow buffer pointer, and the state of the data structure A number of global variables used to do (e.g., the next index to insert into the active overflow buffer, and a counter that indicates which entry in a given line is retired for the next miss in the hash table ) And various other global variables.
[0033]
All of this global data must be carefully laid out to match the hardware structure used to store the data. For example, in a processor with a 64-byte cache line, the data is packed into a single 64-byte structure. This ensures that no more than one cache miss will occur to access any of this data. Minimizing the number of cache misses is a performance data collection because cache misses occur in about a hundred cycles or more, and interrupt handlers should generally consume no more than a few hundred cycles for their completion. Is important to minimize the impact of.
[0034]
Furthermore, it is uneconomical for a multiprocessor to write to a cache line that contains data shared between multiple processors. Global data is carefully replicated to different processors so that each processor has its own copy, avoiding the need to write to a shared cache line.
In another embodiment, the length of time to handle interrupts can be reduced by using a number of different handlers, each optimized for a particular state. For example, in one embodiment with multiple overflow buffers as described below, the interrupt handler checks at its start whether it should bypass the hash table. Most of the time, this check is false.
[0035]
However, an interrupt handler can be formed that does not check the flag to see if the hash table should be bypassed. Instead, the handler assumes that it should access the hash table. When changing the flag to indicate whether the hash table should be bypassed, the system level interrupt vector can be changed to point to the appropriate interrupt handler. This ensures that a check is not necessary when the hash table should not be bypassed in the normal case, thus saving a large number of processor cycles. By analyzing carefully all the different flags and their settings in the normal case and using a special form of interrupt handler in that normal case, significant speed performance is obtained.
The synchronization with other processors while operating the data stored in the hash table will be described in detail below.
[0036]
Sync
Synchronization of access to the hash table and overflow buffer is managed as follows. First, there is a separate hash table 234-236 for each processor, so interrupt handlers 231-233 running on different processors need not be synchronized with each other while operating the hash table. However, the handlers share the overflow buffer 238 and thus access to the overflow buffer needs to be synchronized as described below.
[0037]
In addition, the user-level daemon 261 retrieves all sample data from the kernel table and overflow buffer into the user buffer 262 to ensure that the daemon 261 has the current information. Is given. To retrieve data from the hash table and overflow buffer, respectively, “hash” and “flush” A separate routine called “overflow” is given. These routines need to be synchronized with the handlers 231-233 as described below.
[0038]
Hash table synchronization
There are two activities that need to synchronize access to a given processor's hash table: an interrupt handler for that processor, and a flush This is a hash routine. The timing of possible events that need to be synchronized is shown in FIG. Interrupt handler and flash The global variables shared by the hash routine are shown in FIG. The pseudo code for the interrupt handler is shown in FIG. The pseudo code for the hash routine is shown in FIG.
[0039]
While the hash table for a particular processor is flushed, interrupt handlers cannot access that table. Therefore, during this time, the interrupt handler stores sample data directly in the overflow buffer 238. To minimize the number of samples written directly into the overflow buffer 238, the hash table is flushed with the current chunk. Samples whose hash index falls into the flushed chunk are written directly into the overflow buffer. However, samples whose chunks are flushed are written to the hash table as described above.
[0040]
active chunk and flush The chunk flags 315 to 316 are respectively an interrupt handler and a flash. The hash routine indicates which chunk is in use. Each flag records the name of the chunk, and the chunk is named by its first entry index. A value of -1 is active chunk and flush In the chunk flag, no chunk is used, for example, to indicate that the chunk is inactive.
[0041]
The procedure for determining the next empty slot for storing sample data in the overflow buffer and writing the entry into the empty buffer slot is described below as part of the description of synchronizing access to the overflow buffer 238. Data format “slot The “index” is also described.
flush The synchronization of hash routines is partly sensitive if the memory model is not sequentially consistent because the routine is intended to be used in a system with multiple high speed processors. For this, “How to make a multiprocessor computer that correctly executes multiprocess programs”, IEEE Transactions on Computer, C-28 (September, 1979). ), Pages 690-691.
[0042]
If the order of operations can only be guaranteed under certain conditions, the memory model is not sequentially consistent. The processor can issue and execute instructions in a first order, but the memory subsystem completes access in a second order (memory order). Furthermore, the memory order is transitive, so that A occurs before C if A occurs before B and B occurs before C. The order in which operations actually occur is determined by the memory order based on the following constraints:
1. Memory access operations at a single memory location by a single processor occur in the order in which access instructions are generated by that processor.
2. Memory access operations at different memory locations by a single processor can occur in any order as long as they are not separated by "memory barrier" (MB) instructions. In this case, all operations before the memory barrier instruction occur before all operations after the memory barrier instruction.
3. If two processors access the same memory location with one processor read and the other processor writes, and the read operation notifies the value to be written, then the read operation is after the write operation. Arise.
[0043]
Note that memory barrier instructions require a relatively long execution time. Therefore, it is desirable to minimize the use of MB instructions in execution paths that are frequently traversed.
Three cases are considered based on the specific path taken by the interrupt handler shown in FIG. The handler can execute code beginning at line 704, 707 or line 709-722. Each of these cases is described below.
[0044]
First, if the handler executes code beginning at line 704, the handler does not need to access the hash table at all. Samples are written directly into overflow buffer 238, which is synchronized to avoid sample loss or double counting, as described below. In this case, even if the variable flush chunk [i] is flush Even if it is set to a value other than c by the hash routine (executed on a different processor), it is conceivable that the handler makes a decision on line 703 to execute line 704.
[0045]
For this reason, the memory subsystem 120 of FIG. It does not guarantee that the new value of chunk [i] is immediately visible on all other processors. This does not cause incorrect behavior, but only means that when some samples could be stored in the hash table, they are written to the overflow buffer. Therefore, the performance is slightly reduced, but the correctness of the data is not affected.
Second, if the handler executes the code on line 707, expensive synchronization operations are not performed. When the handler is on line 707, a matching entry in the hash table is found, which means only that the count needs to be increased for that entry. However, note that in this case, a single sample can be lost, although very rarely.
[0046]
FIG. 5 shows possible timings for hash table access. In FIG. 5, line 560 shows the total time increasing from left to right, line 570 shows the timing of events in one processor (cpu j) interrupt handler 531 and line 580 shows another processor ( cpu i) 532 flush Indicates the timing of events in the hash routine.
The interrupt handler (cpu j) 531 performs a number of operations, ie flush. The chunk flag is read (event C 571), a hit is found in the hash table, the count of matching entries in the hash table is read (event D 572), and the incremented count is written to the hash table (event E 573).
[0047]
flush If the hash routine is not executed simultaneously to flush the overflow buffer 238, the incremented count is written back correctly. However, even if the interrupt handler is active Even if the chunk flag 315 is checked and it is determined that it is not set for the required chunk, flush It is conceivable that the hash routine actually flushes the chunks simultaneously. In this rare case, as described below, careful analysis of timing indicates that a single sample may be lost, but the sample will not be double counted.
[0048]
In FIG. 5, flush The following events occur for the hash routine: flush The chunk flag 316 is set so that the chunk is flushed next (event A 561). The routine then copies the hash table entry used by the interrupt handler to the user buffer (event G 583). The routine then zeros the hash table entry (event H 584) and flush. Completion is indicated by clearing the chunk flag 316 (event I 585).
[0049]
Time for two other events, ie flush The time that the updated value of the chunk flag is guaranteed to be propagated to all other processors (event B 582) and the increment count that is written back to the hash table by the interrupt handler is guaranteed to be propagated everywhere. (Event F 574) time is shown in FIG. Note that these times are based on the particular processor implementation and are not predetermined by the architecture specification.
[0050]
If event E 573 occurs (in memory order) before event G 583, the increment count is copied to the user buffer at the time of event G 583 and the count is reset to zero at the time of event H 584 . In this case, the sample is counted exactly once. This is rare.
If event E 573 occurs after event H 584, the increment count is written back to the hash table after the entry count is set to zero. This is because the sample represented by the original count of this entry is copied once to the user buffer at the time of event G, and once again the next time this entry is driven or flushed from the hash table, It will be counted twice. This is unacceptable. This is because a single hash table entry can have a large count representing a large number of samples.
[0051]
A double count cannot occur as long as event E 573 occurs before event H 584. This is guaranteed by the constraints described below for the following variables:
The maximum time required to propagate the stored value to all processors is "max "prop". The maximum time for the interrupt routine when executing line 707 in FIG. intr ". Minimum time from event A (561) to event H (484) for the same entry (ie, flush) From the time when the chunk flag 316 is set, the time flag 316 is flush. The minimum time until it is cleared by the hash routine) "Flush".
[0052]
The following constraint ensures that event E occurs before event H.
(Max intr + (2 * max prop)) <min flush
The timing in a particular processor embodiment is max prop and max It can be measured to determine intr. Then the chunk size is min A size sufficient to ensure that the flush is sufficiently large can be selected.
[0053]
Third, if the handler executes line 709-722, it must move the entry from the hash table to the overflow buffer and write a new entry with a count of 1 to the hash table.
To avoid the loss or double counting of entries moved from the hash table to the overflow buffer, two flags (active chunk [i] and flush careful synchronization with chunk [i]) and up to three memory barrier operations are used. Although this synchronization is relatively expensive, it only occurs when there is a miss in the hash table, which is relatively rare.
[0054]
An important property of this code is that

lines

719 and 720 are mutually exclusive with line 707. This algorithm is a variation of the standard mutual exclusion technique used with memory barrier instructions. This technique forces the operations to be ordered and ensures that the interrupt handler does not wait for a lock.
flush Rather than waiting for the hash routine to be done on the desired chunk, the interrupt handler simply bypasses the hash table on lines 714-716 when the desired chunk is not available. Also, in the very rare event that the overflow buffer fills up, the interrupt handler simply returns and discards the sample. As a result, one effect is that in this case, no samples are lost or double counted when the handler executes lines 709-722.
[0055]
The net effect of this solution is that in the normal case of hits in the hash table, the handler does not perform expensive synchronization operations and loses samples in very rare cases.

Lines

710, 714 and 718 can be optimized, and getting a slot in the overflow buffer on line 711 requires obtaining and releasing a lock. Acquiring a lock involves a memory buffer, as well as releasing it, so the memory buffer on line 710 can be eliminated. The memory barrier on

lines

714 and 718 can also be eliminated if unlocking is moved after the condition test on line 713 (ie, unlocking is performed on lines 714 and 718).
[0056]
Overflow buffer synchronization
In the preferred embodiment, overflow buffer 238 is actually divided into two parts so that one part can be flushed while the other is accessed by the handler. This type of technique is also referred to as “double buffering”. Further, the lock 239 of the overflow buffer 238 is only held while the buffer pointer is manipulated, not while the entry is written to it. This improves efficiency by allowing the handler to write entries to the same overflow buffer in parallel.
[0057]
For clarity, the above description of overflow buffer 238 has been simplified as if it were a single buffer, and the slot has the form slot It is identified by the value of index. Here, buffer and slot How the index is represented will be described in detail.
There are two buffers. Index to buffer (slot index) is composed of a buffer id (0 or 1) and an index of the slot of the buffer. Handler and flush The global variables shared by overflow are shown in FIG. 9, the procedure used by the handler to determine the next free slot and write the entry to the overflow buffer is shown in FIGS. 10 and 11, and flush The overflow routine is shown in FIG.
[0058]
flush The overflow routine flushes a single overflow buffer. If a full buffer is waiting to be read, it is flushed, otherwise the currently partially full buffer is flushed while an overflow is sent to another buffer.
Single lock (overflow) lock) synchronizes access to the variables of FIG. 9, and buffers (index, complete, current) overflow and full overflow). All updates to these variables are overflow This is done while holding the lock. For buffer i, index [i] is the index of the next slot to be written.
[0059]
The entry is written to the overflow buffer without holding the lock, but rather a slot is assigned to buffer i by incrementing index [i]. Only the processor that specified the slot is allowed to write to it. When the processor writes to the slot, it increments completed [i] as described in FIG. Thus, slots can be written in any order (although specified in a particular order).
[0060]
flush The overflow routine's while loop waits until completed [i] is equal to index [i]. This means that writing to all designated slots has been completed. index [full overflow] is flush Note that the overflow routine cannot be changed while in the Weyl loop. This is because no slot is specified for a full overflow buffer. Also, complete [full overflow] is monotonically incremented, i.e., the read operation is atomic, so it does not hold the lock and is completed [full overflow] can be read and index [full All writes to the specified slot must be completed when a value equal to "overflow" is seen.
[0061]
The memory barrier after the Weyl loop of line 805 in FIG. 8 is that the memory operation associated with copying the full overflow buffer to the user buffer is completed [full overflow] to index [full After the test compared to "overflow", in other words, it is necessary to ensure that the writing to the buffer occurs after it is actually completed.
[0062]
In the above technique, all processors retire to a common buffer, which requires some degree of matching between processors for each retirement. In practice, when there is a pair of overflow buffers, one of which is full and must be flushed to user space, it can be moved to the other buffer. Additional synchronization based on careful analysis of event timing ensures that access to the hash table is properly synchronized.
[0063]
Local overflow buffer per processor
In another embodiment, the above method with a per-processor hash table and a single shared overflow buffer can be extended to have an additional small overflow buffer for each processor. In this method, when an interrupt handler attempts to write to an overflow buffer, it first checks its local overflow buffer. If the buffer is not full, simply write to it. If the buffer is full, it gets a lock on the shared overflow buffer and copies all its local buffers to the shared buffer. This reduces the frequency of obtaining locks and writing to shared cache lines, thus improving multiprocessor performance.
[0064]
As a variation of this method with local overflow buffers per processor, the method is further modified by completely eliminating the hash table. This has higher memory traffic than the other methods described above, but since the writes to the shared overflow buffer are still less frequent, the overhead for accessing and locking the multiprocessor shared memory is still small.
[0065]
Many overflow buffers
In another embodiment, using different synchronization techniques and using additional overflow buffers, ie two buffers per processor, rather than a single double buffer shared by all processors, the hash The cost of table synchronization and overflow buffer access can be reduced.
In this technique, each processor “owns” a dedicated hash table (as described above) and a pair of dedicated (double) overflow buffers. In this case, “owned” means that all changes (except one) to the state of the active buffer and hash table are made by the processor and the memory synchronization operation present in the single double buffer technology described above. Means to eliminate a lot of.
[0066]
There are two major changes to the first technology. First, while flushing the hash table to user space, performance counter events are added to the active overflow buffer, bypassing the hash table. This eliminates the synchronization required to ensure that the hash table is not changed during the flush operation. As with the first technique, the hash table can be flushed in chunks, reducing the frequency with which events must be added directly to the overflow buffer.
[0067]
Second, each processor has a dedicated pair of overflow buffers, eliminating the synchronization necessary to share a single active overflow buffer across all processors.
The following data states are maintained for each processor.

[0068]
Hash table synchronization
There are two activities that access the hash table for a particular processor.
1) The interrupt handler stores the new sample in the hash table; and
2) flush The hash routine copies the hash table to user space.
In the pseudo code for handling overflow buffer events during the interrupt shown in FIG. Access to the hash table is controlled using the “hash” variable. Lines 1301-1303 indicate that if this variable is set to "true", the interrupt handler will completely skip the hash table and write a new sample directly into the overflow buffer.
[0069]
Lines 1305-1310 show another path through the interrupt handler. If the new sample matches one of the entries in the hash table (lines 1305-1306), the handler simply increments the count associated with that matching entry. Otherwise (lines 1308-1310), the handler picks up one of the existing entries for eviction. This entry is removed from the hash table and written to the overflow buffer. New samples are written into slots in the hash table empty.
[0070]
"Flush The pseudo code for the “hash” routine is shown in FIG. For each process, this routine is "bypass set the hash [cpu] "flag to true (lines 1404-1408), copy the hash table to user space, and" bypass " The “hash [cpu]” flag is reset to false (lines 1406 and 1410). For proper synchronization, “flash” The “hash” routine is called “bypass” in the processor numbered “cpu”. Carefully execute the change to “hash [cpu]”.
[0071]
"Flush If the processor on which the “hash” routine is being executed is the same as the processor to which the hash table is copied (lines 1403-1406), the routine will “bypass” the local operation. Simply set and clear the "hash [cpu]" flag. Otherwise, “flush The “hash” routine uses an interprocessor interrupt to generate a “bypass” The change to “hash [cpu]” is executed in “cpu”.
[0072]
Interrupt handler and flash The hash routine synchronizes correctly. Because "bypass This is because "hash [cpu]" is read and written only to the processor numbered "cpu". Interprocessor interrupts are set to execute at the same interrupt priority level as the interrupt handler for performance counter overflows, ensuring that they execute atomically with each other. Other communication mechanisms, such as sending messages, can be used in place of interprocessor interrupts as long as the same atomic level is ensured.
[0073]
Overflow buffer synchronization
There are two activities that use a pair of overflow buffers associated with a particular processor. First, interrupt handlers sometimes write entries to the overflow buffer. Second, flush The overflow routine periodically copies the contents of the overflow buffer to user space. These two activities are synchronized by marking one of the buffers as an “active” buffer. Interrupt handlers are only allowed to write to the “active” buffer, and flush The overflow routine is only allowed to read from the inactive buffer.
[0074]
When the active buffer is full, the interrupt handler attempts to switch (flip) the buffer by marking the old active buffer as inactive and marking the old inactive buffer as active. Flag "allow" "flip [cpu]" is a "flash" "overflow" is used to prevent this switching from occurring while copying the old inactive buffer to user space.
[0075]
A routine for adding an entry to the overflow buffer for a particular processor is shown in FIG. If the active buffer is not full (line 1502), the routine simply appends the sample to the active buffer. If the active buffer is full, this routine attempts to flip the buffer.
[0076]
The routine first checks to determine if a flip is allowed (line 1505). If flipping is allowed, the new active buffer is prepared for writing by flipping the buffer and dropping all samples from the new active buffer (line 1508). After the buffer is flipped, the routine notifies the user level daemon to read a full inactive buffer (line 1507). New samples are added to the new active buffer. If flipping is not allowed, the routine drops a new sample and returns (line 1511).
[0077]
write to The overflow routine drops the sample in two cases. In the first case, after flipping, the unread sample of the new active buffer is dropped (line 1505). It is very rare that a sample is actually dropped. Because these samples have been flush since the last flip This is because the overflow routine can get for reading and flips occur very infrequently.
[0078]
The second case is when the active buffer is full and flipping is not allowed. This case is also very rare. Flip is flush Only allowed when the overflow routine reads an inactive buffer. The inactive buffer is ready for reading at the time of the last flip and the frequency of flips in this system is very low, so It is unlikely that an overflow routine will still copy an inactive buffer. In both cases, if any sample drops, flush An indication that the overflow routine is not called quickly enough by the user level daemon in response to the overflow buffer filling.
[0079]
"Flush The pseudo code for the “overflow” routine is shown in FIG. This routine copies inactive buffers to the user space of a particular processor. This routine is "allow" write while the inactive buffer is being copied to user space using the "flip [cpu]" flag to Prevent the overflow routine from accessing an inactive buffer. The above flush As with the hash routine, “allow” All accesses to "flip [cpu]" occur in the processor numbered "cpu" and therefore interprocessor interrupts are used to ensure that they are properly synchronized.
[0080]
Information collected
Based on the exact implementation of the performance counter, the program counter value can be sampled for selected instructions. In addition, base addresses can also be collected for memory access (load and store) and jump instructions with operands that specify general purpose registers.
[0081]
The performance monitoring system described above can collect performance data in many aspects of computer system operation including kernel software, input / output device drivers, application programs, and shared libraries. This is accomplished by generating and handling sampling interrupts at a very high rate without unduly hampering normal operation of the system.
It will be apparent to those skilled in the art that various modifications can be made to the present invention without departing from the scope of the invention.
[Brief description of the drawings]
FIG. 1 is a block diagram of a computer system capable of collecting performance data by a performance monitoring subsystem according to a preferred embodiment of the present invention.
FIG. 2 is a block diagram of a data collection subsystem.
FIG. 3 is a block diagram showing a hash table for storing collected performance data.
4 is a flow diagram for updating the hash table of FIG. 3. FIG.
FIG. 5 is a timing diagram showing asynchronous update of a hash table.
FIG. 6 is a diagram showing variables shared by an interrupt handler and a hash table flush routine.
FIG. 7 is a diagram illustrating pseudo code of an interrupt handler routine.
FIG. 8 is a diagram illustrating pseudo code for flushing a hash table.
FIG. 9 is a diagram showing a shared synchronization variable.
FIG. 10 is a diagram illustrating pseudo code for obtaining an empty slot in an overflow buffer.
FIG. 11 is a diagram illustrating pseudo code for writing an entry into an empty slot of an overflow buffer.
FIG. 12 is a diagram illustrating pseudo code for flushing an overflow buffer.
FIG. 13 shows pseudo code for handling overflow buffer events during sample event interrupts.
FIG. 14 is a diagram illustrating pseudo code for flushing a hash table.
FIG. 15 shows pseudo code for a routine for entry into an overflow buffer.
FIG. 16 shows pseudo code for a routine for flushing an overflow buffer to a user buffer.
[Explanation of symbols]
100 computer system
110 processor
112 Performance counter
120 memory subsystem
121 program
122 data
130 Input / Output Interface
140 Bus
160 network
200 Performance data collection subsystem
201 registers
220 hardware devices
221-223 Individual processor
230 Carnal mode program and data structure
231-232 Interrupt handler
234-235 table
238 Overflow buffer
239 lock
250 discs
251 Performance data
260 User Mode Program and Data Structure
261 daemon process
262 User buffer
300 hash table
301-303 chunks

Claims

An apparatus for collecting performance data in a computer system including a plurality of processors for executing program instructions simultaneously,
The performance data is the number of occurrences of a specific event generated by each processor while executing an instruction;
The device is
A plurality of performance counters connected to each of the processors;
An interrupt handler executed on the processor, one for each of the processors;
A first memory including one hash table for each of the interrupt handlers;
A second memory including an overflow buffer;
A third memory including a user buffer; and
With means to flash,
The performance counter stores the performance data, and generates an interrupt when a specific state of the performance counter occurs.
The hash table is partitioned into a plurality of chunks, and each chunk is partitioned into a plurality of entries.
The interrupt handler reads the performance data from the performance counter in response to the interrupt, and stores the performance data in one of the entry of the hash table corresponding to the interrupt handler and the overflow buffer. To remember,
The means for flushing periodically flushes the performance data stored in the entry and the overflow buffer of the hash table to the user buffer,
In the process of storing the specific performance data, the interrupt handler, when the chunk including the entry that is the storage destination of the performance data is being flushed by the means for flushing, the performance data, An apparatus for storing in the overflow buffer.

Each chunk is
Multiple lines,
A flag indicating whether the chunk is being written by the interrupt handler;
A flag indicating whether or not the chunk is being flushed by the flushing means;
The apparatus according to claim 1, wherein each line further includes a plurality of the entries, and each entry includes a plurality of fields for storing the performance data and related information of the performance data.

The apparatus of claim 2, wherein the plurality of fields of each entry further includes a process identification, a program counter, a processor event identification, and a processor event counter.

The interrupt handler is
Generating a hash index from the process identification, program counter, and processor event identification to determine if it is associated with a current entry in the plurality of lines;
In response to the miss signal, the performance data stored in the current entry indicated by the hash index is moved from the hash table to the overflow buffer, the current entry is overwritten with the performance data read from the performance counter,
4. The apparatus of claim 3, wherein in response to the hit signal, the processor event counter stored in the current entry pointed to by the current hash index is incremented.

The overflow buffer of the second memory comprises first and second buffers organized as a double buffer,
Each buffer includes a plurality of slots for storing hash table entry performance data;
The overflow buffer is further divided into double buffers, one for each of a plurality of processors,
The apparatus of claim 1, wherein the performance data is read from the inactive first buffer while the performance data is written to the active second buffer.

The apparatus of claim 1, wherein the program includes an application, an operating system and a shared library program component, and further collects performance data from the application, operating system and shared library program component.

A computerized method for collecting performance data in a computer system comprising a plurality of processors for simultaneously executing instructions of a program, each processor having a set of performance counters connected thereto,
Storing in the plurality of sets of performance counters performance data that is the number of occurrences of a specific event generated by each processor during execution of an instruction;
Generating an interrupt upon occurrence of a specific state of the performance counter;
Sampling performance data stored in a set of performance counters in response to the interrupt;
The sampled performance data is a plurality of hash tables in the first memory, one for each processor, partitioned into a plurality of chunks, and each chunk is partitioned into a plurality of entries. Storing in either one of the entries of the hash table and the overflow buffer of the second memory;
Periodically flushing the entry of the hash table and the performance data stored in the overflow buffer to the user buffer;
The step of storing the sampled performance data includes the step of storing the performance data if the chunk including the entry that is the storage destination of the specific performance data is being flushed by the step of flushing. Storing in a buffer.