JP2004102594A

JP2004102594A - Method for analyzing number of times of execution of parallel execution program and its program

Info

Publication number: JP2004102594A
Application number: JP2002262872A
Authority: JP
Inventors: Masaharu Nakazawa; 中澤　正治
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-09-09
Filing date: 2002-09-09
Publication date: 2004-04-02

Abstract

<P>PROBLEM TO BE SOLVED: To confirm the distribution volume of a load for parallelization with numerics, and to support the parallelization tuning of a parallel execution program. <P>SOLUTION: A program to be analyzed execution number of time measuring means 12 accumulates the number of times of execution in a count area in response to a count instruction inserted into a program to be analyzed at the time of translation to measure the number of times of execution of an execution sentence by each of parallel execution units. An execution number of time measurement result analyzing means 13 measures the mean number of rotation of a loop from the measurement result of the number of times of execution, and calculates a performance improvement rate when changing the degree of parallelization. Also, the number of times of execution of the execution sentence in each procedure in programs executed in parallel and the number of times of execution of the execution sentence by each of the parallel execution units of each loop are calculated, and the calculation results are displayed in the order of the larger number of times, and information indicating the balance of the number of times of execution between the parallel execution units. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は，ＨＰＣ（Ｈｉｇｈ　Ｐｅｒｆｏｒｍａｎｃｅ　Ｃｏｍｐｕｔｉｎｇ　Ｓｙｓｔｅｍ　）大規模科学技術計算分野におけるプログラムの並列化チューニングの支援機能に関し，特に，Ｆｏｒｔｒａｎ等の高級言語で記述された逐次実行プログラムを並列化による負荷分散で性能向上を図る過程を支援する並列実行プログラムの実行回数解析方法およびその実行回数解析プログラムに関する。
【０００２】
チューニングするプログラムの並列化は，メモリ共用型並列方式およびメモリ分散型並列化方式が対象である。
【０００３】
メモリ共用型並列方式は，スレッドと呼ばれる複数の並列実行単位上でプログラムが並列に実行される。並列化の指定は，ユーザがプログラム実行時に，ＯｐｅｎＭＰ並列化構文によりソースコード中の並列化する範囲およびその範囲の並列実行単位に対するループ負荷分配を示す並列負荷分割量を指定する。
【０００４】
例えば，図１５に示す並列化対象プログラムの場合，基本ブロック１から基本ブロック３までの範囲を，並列化構文（「！＄ＯＭＰ　ＰＡＲＡＬＬＥＬ」，「！＄ＯＭＰ　ＥＮＤ　ＰＡＲＡＬＬＥＬ」）によって並列化指定している。ここで，基本ブロックとは，実行順序が変化する可能性のない実行文列である。ループに対応する部分については，ユーザにより指定された並列負荷分割量に応じたオブジェクトコードの生成関数をコンパイラが作成し，並列実行する。図１５に示す並列化指定の場合，基本ブロック１，２，３については，各並列実行単位が同一の処理を行う冗長実行となり，ループ部分については並列負荷分割量に応じてループの負荷が分散される。
【０００５】
メモリ分散型並列化方式は，プロセスと呼ばれる複数の並列実行単位にデータが分割されてプログラムが並列に実行される。並列化の指定はＭＰＩ（Ｍｅｓｓａｇｅ　Ｐａｓｓｉｎｇ　Ｉｎｔｅｒｆａｃｅ　）ライブラリをソースコード中の並列化する位置に挿入するとともに，並列負荷分割量を指定する。
【０００６】
本発明は，これらの並列化プログラムおよび両方式が混在する並列化プログラムの並列化チューニングを支援する情報を出力するものである。
【０００７】
【従来の技術】
メモリ共用型並列およびメモリ分散型並列のプログラミングを支援するためには，特に負荷分散の対象となるループの平均回転数の測定が必要であり，手続き，ループ，各実行文の並列化状況を，並列実行単位ごとに確認できる機能が必要である。
【０００８】
プログラムの並列処理の最適化を図る技術としては，複数のプロセッサで構成された並列計算機に関して，データやプログラムを各プロセッサエレメントごとに割り付け，その実行時に処理時間を測定し，そのばらつきが各プロセッサエレメントで均等になるように動的に再割り付けを行う技術があった（例えば，特許文献１参照。）。また，逐次実行プログラムの各ループの平均回転数を測定する技術は，従来からあった（例えば，非特許文献１参照）。
【０００９】
【特許文献１】
特開平２−１３２５２６号公報（第１頁）
【非特許文献１】
ＦＵＪＩＴＳＵ，「ＵＸＰ／Ｖアナライザ使用手引書Ｖ２０用」（Ｊ２Ｕ５−０１３２−０２），富士通株式会社，１９９９年９月，第２版，ｐ５５−５６
【００１０】
【発明が解決しようとする課題】
しかし，従来のプログラムの実行回数解析機能は，逐次実行プログラムが対象であり，プログラム並列化を支援する機能はなかった。すなわち，並列実行プログラムの各ループの平均回転数を並列単位ごとに測定する機能はなかった。
【００１１】
また，例えば，前記特許文献１の特開平２−１３２５２６号公報に記載されている技術は，ループの回数を測定してチューニング用に表示するものではないため，並列実行における負荷の分散状態をユーザが確認した上で並列化チューニングを行うことができないという問題点がある。
【００１２】
具体的には，従来は，チューニング対象プログラムについて，並列化による性能予測を行う機能や，並列実行単位間での手続きおよびループの実行回数のバランスを確認する機能や，並列実行したプログラム内の各実行文について，並列状況を表示する機能はなかった。そのため，並列実行における負荷の分散状態に基づいて並列化プログラムの並列化チューニングを行うことは困難であった。
【００１３】
本発明は，上記従来技術の問題点を解決し，並列実行における負荷の分散状態をユーザが確認した上で並列化プログラムのチューニングが可能なチューニング支援技術を提供することを目的とする。
【００１４】
【課題を解決するための手段】
上記課題を解決するため，本発明は，以下の点に着目する。
【００１５】
（１）プログラムの実行負荷は実行時間で表される。実行回数の多い手続き，ループおよび実行文ほど実行時間を多く必要とする傾向にある。この傾向を利用して実行負荷を実行回数で表現することにより，ソースコードへのフィードバックが容易な数値で実行負荷を表現するチューニング支援機能を実現する。また，並列化のための負荷分散量（主にループの回転数）は数値で指定する場合が多いので，プログラムの実行負荷を数値で得られることがチューニングに有益である。さらに，ループの各回転の負荷が回転数に無依存で同一ならば，ループの平均回転数を求めてその値を並列実行単位に均等に分割すると性能効果を発揮できる。
【００１６】
（２）並列実行単位の増加に比例してループの回転数を並列実行単位に分割した場合の性能向上予測情報があると，チューニングを効果的に進めることができる。
【００１７】
（３）プログラムを並列実行した場合，手続き，ループ，実行文の単位ごとに並列実行状況を数値で確認できると理解が容易なチューニング情報となる。
【００１８】
本発明は，上記着目点に基づき，プログラムの並列化で負荷の分散のために最も着目されるループの平均回転数を正確に測定する。
【００１９】
また，本発明は，並列実行単位の増加に比例してループの回転数を詳細に分割した場合の性能向上予測機能を実現する。また，手続き，ループの実行回数を並列実行単位間で比較できるように表示する。また，負荷の高い順番に手続き，ループを表示してチューニング対象を簡単に選定可能とする。さらに，ソースコードの実行文に対応して，実行回数および並列実行状況を確認できるように表示する。
【００２０】
すなわち，本発明は，並列実行プログラムの実行回数解析方法であって，解析対象プログラムを翻訳し，前記翻訳時に，実行回数を累積する命令を各手続き内に挿入するプログラム翻訳過程と，前記翻訳時に設定した単位の実行回数を並列単位ごとに累積するプログラム実行回数計測過程と，前記並列実行単位の実行回数計測結果を解析し，解析結果を端末に表示させる実行回数計測結果解析過程とを有することを特徴とする。
【００２１】
また，本発明において，前記プログラム翻訳過程は，各手続きの開始位置にその手続き内の各実行文の実行回数を累積するカウント領域を確保する機能を持つ関数の呼出し命令を挿入し，実行回数を累積するカウント命令を手続き内の実行順序が変化する可能性のない実行文列である各基本ブロックの先頭位置，各ループの開始位置および各ループ内の先頭実行文の位置に挿入し，前記プログラム実行回数計測過程は，各手続きの命令を実行中に出現するカウント命令により前記カウント領域に実行回数を累積し，プログラム全体の終了時に，カウント領域の実行回数を並列実行単位ごとに実行情報ファイルに出力し，前記実行回数計測結果解析過程は，各並列実行単位の実行回数計測結果を実行情報ファイルから抽出し，各ループ実行におけるループ内実行文の実行回数の平均である各ループの平均回転数を並列実行単位ごとに算出することを特徴とする。
【００２２】
本発明を用いることにより，並列実行における負荷の分散状態を明らかにすることができる。従って，ループの各回転の負荷が回転数に無依存で同一ならば，ループの平均回転数の値を並列実行単位に均等に分割することを通じて性能効果を発揮できる。
【００２３】
また，本発明において，前記実行回数計測結果解析過程は，更に，前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，逐次実行に換算したプログラム中の実行文の総実行回数を算出する過程と，前記算出した逐次実行に換算したプログラム中の実行文の総実行回数と，各並列実行単位におけるループ内実行文実行回数の合計と，並列度数と，並列実行のためのオーバーヘッドとに基づいて，前記逐次実行に換算したプログラム中の実行文の総実行回数を並列度数に応じて各並列実行単位に分割する場合の並列実行単位における実行文の総実行回数を算出する過程と，前記算出した並列実行単位における実行文の総実行回数と前記算出した逐次実行に換算したプログラム中の実行文の総実行回数とに基づいて，並列度数を変化させた場合の性能向上率を算出する過程とを有することを特徴とする。
【００２４】
本発明を用いることにより，並列度数の増加に比例してループの回転数を詳細に分割した場合の性能向上予測機能を実現することができ，チューニングを効果的に進めることができる。
【００２５】
また，本発明において，前記実行回数計測結果解析過程は，更に，前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，並列実行させたプログラム内の各手続き内の実行文の実行回数および各ループの並列実行単位における実行文の実行回数を算出し，算出結果を回数の多い順に端末に表示するとともに，並列実行単位間の実行回数のバランスを端末に表示することを特徴とする。
【００２６】
本発明を用いることにより，各手続き，ループの並列単位間の負荷バランスおよび並列化候補の選択が可能になる。
【００２７】
また，本発明において，前記実行回数計測結果解析過程は，更に，前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，ソースコードに対応つけて各実行文ごとに総実行回数および並列度数を算出し，端末に表示させることを特徴とする。
【００２８】
本発明を用いることにより，並列実行状況の認識が容易になり性能改善のソースコードへのフィードバックが促進される。
【００２９】
また，本発明のプログラムは，並列実行プログラムの実行回数解析方法をコンピュータに実行させるためのプログラムであって，解析対象の並列実行プログラム翻訳時に，少なくとも並列実行単位ごとに実行回数を累積するための命令を各手続き内に挿入する翻訳過程と，前記翻訳した並列実行プログラムを実行することにより，翻訳時に設定した命令の実行回数を並列単位ごとに累積する計測過程と，前記並列実行単位の実行回数計測結果を解析し，少なくとも並列実行における負荷の分散を示す解析結果を出力する解析過程とを有することを特徴とする並列実行プログラムの実行回数解析方法をコンピュータに実行させるための並列実行プログラムの実行回数解析プログラムである。
【００３０】
【発明の実施の形態】
以下，図を用いて本発明の実施の形態を説明する。図１は，本発明の並列実行プログラムの実行回数解析方法を実現するハードウェア構成の一例を示す図である。
【００３１】
図中，１は並列実行プログラムの実行結果を解析するワークステーション（ＷＳ）等の並列実行プログラム実行結果解析装置である。２および３はそれぞれ１または複数のＣＰＵ２１，３１とメモリ２２，３２とを有し，並列実行プログラムを実行するノードＡ，Ｂである。４はノードＡ２およびノードＢ３で実行した並列実行プログラムの実行回数計測結果が記録される実行情報ファイル，５はノードＡ２とノードＢ３間を接続するプロセッサ間通信用の高速バスである。なお，図１では実行情報ファイル４は，共用ファイルとなっているが，本発明は実行情報ファイル４が共用となっていなくても実現可能である。
【００３２】
本発明の並列実行プログラムの実行回数解析方法は，ノードＡ２内またはノードＢ３内において行われるメモリ共用型の並列化方式の並列化チューニング支援に用いられるほか，ノードＡ２とノードＢ３間で行われるメモリ分散型並列化方式の並列化チューニング支援にも用いられる。
【００３３】
図２は，本発明の並列実行プログラムの実行回数解析方法の概要を示す図である。６はディスプレイやプリンタ等の解析結果出力装置，７は並列化指示が埋め込まれた解析対象プログラムのソースプログラム，７’は翻訳によって実行回数を計測するための命令が埋め込まれた解析対象プログラムのロードモジュール，１１は解析対象プログラムの翻訳（リンクを含む）を行う解析対象プログラム翻訳手段，１２は解析対象プログラム実行回数計測手段，１３は実行回数計測結果を解析する実行回数計測結果解析手段である。なお，解析対象プログラム翻訳手段１１と実行回数計測結果解析手段１３とは，必ずしも同一装置で実現する必要はなく，別の装置で実現されていてもよい。
【００３４】
以下に，並列実行プログラム実行結果解析装置１が行う処理を説明する。図３は，本発明の並列実行プログラムの実行回数解析処理フローの一例を示す図である。
【００３５】
まず，解析対象プログラム翻訳手段（コンパイラ）１１が，Ｆｏｒｔｒａｎ等の高級言語で記述された解析対象プログラム７の翻訳処理を行う（ステップＳ１）。この解析対象プログラム７の翻訳では，オブジェクトプログラム中の各手続きの開始位置に，その手続き内の各実行文の実行回数を累積する領域　（カウント領域）　を確保する機能を持つ関数を呼び出す命令を挿入する。
【００３６】
また，解析対象プログラム翻訳手段１１は，実行回数を累積する命令（カウント命令）を，オブジェクトプログラム中の手続き内の次の位置に挿入する。
▲１▼　各基本ブロック（実行順序が変化する可能性のない実行文列）の先頭位置
▲２▼　各ループの開始位置
▲３▼　各ループ内の先頭実行文
これらのカウント命令は，例えばソースコードの形式で表すと，
領域名（ｘ）＝領域名（ｘ）＋１
というような命令である。
【００３７】
解析対象プログラム７の翻訳時におけるオブジェクトプログラムへのカウント命令の挿入の一例を，図４に示す。図４に示す例では，まず手続きの開始位置にカウント領域管理の関数呼出し命令が挿入され，また各基本ブロック１〜３の先頭位置に実行回数をカウントするカウント命令が挿入され，ループの開始位置にループ開始カウント命令が挿入され，ループ内の先頭実行文にループ内先頭実行文カウント命令が挿入されている。以上のような命令の挿入は，翻訳処理の開始時に所定のコンパイラ・オプションの選択によって行われる。
【００３８】
次に，解析対象プログラム実行回数計測手段１２が，解析対象プログラム７の実行回数計測処理を行う（ステップＳ２）。具体的には，解析対象プログラム７を翻訳した結果のオブジェクトプログラムから得られるロードモジュール（解析対象プログラム７’）を，図１に示す各ノードＡ２，Ｂ３のＣＰＵ２１，３１に実行させて，翻訳時に設定したカウント命令を用いて実行回数を並列実行単位ごとに累積する。
【００３９】
この実行回数の累積は，次のように行われる。各手続きの開始時に，カウント領域を確保する関数が呼び出される。この関数は，呼出し元の関数名および並列実行単位をキーとしてカウント領域を管理している。この関数は，呼び出されると，該当キーに対応するカウント領域を呼出し元に通知する。
【００４０】
手続きの命令を実行中に，翻訳時に挿入されたカウント命令が出現すると，そのカウント命令は，手続きの開始時に用意されたカウント領域に実行回数を累積する。そして，プログラム全体の終了時に，カウント領域の実行回数を並列実行単位ごとに実行情報ファイル４に記録する。図５に，図３に示した解析対象プログラム７の実行回数計測方法の一例を示す。図５に示す実行回数計測においては，図１５に示す並列化の指定方法と同様の方法による並列化指定に対してコンパイラが作成した生成関数に基づき，特にループ部分が並列実行単位毎に分割されて並列実行され，各並列実行単位毎に実行回数が計測される。
【００４１】
次に，実行回数計測結果解析手段１３が，実行回数計測結果解析処理を行う（ステップＳ３）。具体的には，各並列実行単位の実行回数計測結果を，実行情報ファイル４から取り出して次のように解析し，ディスプレイやプリンタ等の解析結果出力装置６に出力する。
【００４２】
まず，各並列実行単位およびそれらの平均のループ回転数を，以下の式でループごとに算出する。
【００４３】
ループの平均回転数
＝ループ内の先頭実行文の実行回数／ループ開始位置の実行回数
また，以下の式で，チューニング対象ループの逐次実行に換算した総実行回数から，並列度数に応じて各並列実行単位に分割する場合の並列実行単位における実行文の総実行回数を計算する。
【００４４】
ＰＴＯＴＡＬ
＝ＳＴＯＴＡＬ−ＬＯＯＰ＋（ＬＯＯＰ／Ｐａｒａ）＋ＰＯＶＥＲ
ここで，ＳＴＯＴＡＬは，逐次実行に換算したプログラム中の実行文の総実行回数であり，図６に示すように，基本ブロック実行コストとＬＯＯＰとから構成される。ＬＯＯＰは，逐次実行に換算したチューニング対象ループ内実行文の総実行回数であり，各並列実行単位におけるループ内実行文実行回数の合計値である。
【００４５】
ＰＴＯＴＡＬは，並列実行単位の実行文総実行回数であり，冗長実行コストとＬＯＯＰ／ＰａｒａとＰＯＶＥＲとから構成される。冗長実行コストは，各並列実行単位が同一の処理を行う基本ブロックの実行コストである。図６においては，冗長実行コストと基本ブロックの実行コストが同じ場合を示す。
【００４６】
Ｐａｒａは並列度数であり，ＬＯＯＰ／Ｐａｒａは，各並列実行単位におけるループ内実行文実行回数の負荷としての並列実行コストである。また，ＰＯＶＥＲは，並列実行のためのオーバーヘッドであり，固定値である。
【００４７】
ここで，ＳＴＯＴＡＬは，図５に示す各並列実行単位ごとのカウント領域におけるカウント値から算出する。すなわち，まず各並列実行単位におけるループ内実行文実行回数の合計値を求めてＬＯＯＰを算出する。また，例えば各並列実行単位が基本ブロック１ないし基本ブロック３について同一の処理を行う場合，冗長実行となる基本ブロック１ないし基本ブロック３のカウント値については，代表となる並列実行単位のカウント値を用いて算出することによって冗長実行コストを算出し，算出したＬＯＯＰに冗長実行コストを加えたものをＳＴＯＴＡＬとして求める。
【００４８】
図６に示すように，並列実行により，オーバーヘッドとしてのＰＯＶＥＲが生じるが，ＬＯＯＰがＬＯＯＰ／Ｐａｒａに減少するため，ＰＴＯＴＡＬはＳＴＯＴＡＬより小さい値となる。
【００４９】
次に，並列度数を変化させた場合の性能向上率を，ＳＴＯＴＡＬとＰＴＯＴＡＬを用いて，以下の式で計算する。単位は，パーセントである。
【００５０】
性能向上率＝｛（ＳＴＯＴＡＬ−ＰＴＯＴＡＬ）／ＳＴＯＴＡＬ｝×１００
また，チューニング対象ループについて，並列度数の変化に対する並列実行単位内実行文実行回数，ループの平均回転数および性能向上率の見積り値を表示する。この見積り値は，ループの負荷がループの回転数に無依存で一定であることを前提とする。
【００５１】
図７に，本発明による解析結果の出力の一例として，並列度数の変化に対する特定ループの実行文実行回数，ループの平均回転数および性能向上率の見積り値の表示形式の一例を示す。
【００５２】
図７に示す例では，並列度数が１（逐次実行）から２になると，ループの平均回転数（Ｉｔｅｒａｔｉｏｎ−ｃｏｕｎｔ）が８０から４０になり，性能見積り対象ループ内実行文の総実行回数が減少するが，並列実行のオーバヘッドが増えるため，性能が逐次実行の場合よりもわずかに劣化している。しかし，並列度数が４，８と増加すると，ループの平均回転数がそれぞれ２０，１０になり，性能向上率が１４．５％，２２．２％と向上していることがわかる。
【００５３】
また，本実施の形態では，逐次実行におけるループ平均回転数を算出した上で，これから算出した並列ループの平均回転数情報を，並列負荷分割量の見積りに使用する。逐次実行におけるループ平均回転数は，計測したループの平均回転数に並列度数を乗じたものであり，図７に表示されている並列度数が１の場合のループの平均回転数８０が逐次実行における平均回転数である。
【００５４】
図８に逐次実行のループの平均回転数に基づく並列負荷分割量の見積り方法を示す。図８に示すように，並列負荷分割量は，逐次実行におけるループ平均回転数を並列度数で割って算出する。
【００５５】
また，各手続きについて実行回数の合計値および各並列実行単位における実行回数のバランスを表示する。手続きの情報は，実行回数の多い順番に表示する。メモリ分散型並列化方式とメモリ共有型並列化方式の混在型の場合には，メモリ分散型並列化方式の各並列実行単位の表示を行い，その並列実行単位の詳細化情報としてメモリ共用型並列化の表示を行う。
【００５６】
図９に，手続き実行回数合計値の表示形式の一例を示す。また，図１０に，各並列実行単位における手続き実行回数のバランス表示の形式の一例を示す。
【００５７】
例えば，「ｍａｉｎ．＿ＯＭＰ＿１＿」の手続きは，総実行回数（Ｔｏｔａｌ−ｃｏｕｎｔ）が１０４２回であり，この総実行回数のプログラム全体に占める割合（Ｒｕｎ）は，０．８８％であり，手続き実行回数は８回であることが，図９からわかる。
【００５８】
この手続き「ｍａｉｎ．＿ＯＭＰ＿１＿」を４個のスレッドで並列に実行し，平均実行回数に対する該当並列実行単位の実行回数の偏差割合を示しているのが図１０である。図１０の例から，４番目のスレッド（Ｔｈｒｅａｄ　３）が＋１１％と負荷が他のスレッドよりも大きいことがわかる。従って，ユーザは，図１０に示すバランス表示を認識した上で，各並列実行単位に分割する負荷バランスを均等化するチューニングを行うためには，４番目のスレッドの負荷を軽減すればよいことがわかる。
【００５９】
例えばｉｆ文を含むループのように，回転数に依存してループの動きが変わる場合には，各並列実行単位毎の負荷にばらつきが生じているのが通常であるが，このような場合に，図１０に示すバランス表示は，ユーザがバランス表示を認識した上で，負荷のバランスを変えることができるという意味で有用な情報表示である。
【００６０】
また，ループの実行回数について，手続きの場合と同様の表示を行う。図１１に，ループ実行回数合計値の表示形式の一例を示す。また，図１２に，各並列実行単位におけるループ実行回数のバランス表示の形式の一例を示す。
【００６１】
図１１に示す表示において，Ｍａｒｋは，並列実行か逐次実行かの識別であり，ＯＭＰが並列実行，Ｓが逐次実行であることを示している。
【００６２】
また，各実行文の実行回数と並列度数を実行情報ファイルから抽出して，その値をソースコードに対応付けして表示する。ソースコードの存在場所は当解析処理に引数で通知させる。図１３および図１４に，実行文単位の実行回数および並列度数表示の形式の一例を示す。
【００６３】
上記実行回数計測結果解析手段１３等が行う処理は，コンピュータとソフトウェアプログラムとによって実現することができ，そのプログラムは，コンピュータが読み取り可能な可搬媒体メモリ，半導体メモリ，ハードディスク等の適当な記録媒体に格納して，そこから読み出すことによりコンピュータに実行させることができる。
【００６４】
以上から把握できるように，本発明の実施形態の特徴を述べると，以下のとおりである。
【００６５】
（付記１）並列実行プログラムの実行回数解析方法であって，
解析対象の並列実行プログラム翻訳時に，少なくとも並列実行単位ごとに実行回数を累積するための命令を各手続き内に挿入する翻訳過程と，
前記翻訳した並列実行プログラムを実行することにより，翻訳時に設定した命令の実行回数を並列単位ごとに累積する計測過程と，
前記並列実行単位の実行回数計測結果を解析し，少なくとも並列実行における負荷の分散を示す解析結果を出力する解析過程とを有する
ことを特徴とする並列実行プログラムの実行回数解析方法。
【００６６】
（付記２）付記１記載の並列実行プログラムの実行回数解析方法において，
前記翻訳過程では，各手続きの開始位置にその手続き内の各実行文の実行回数を累積するカウント領域を確保する機能を持つ関数の呼出し命令を挿入するとともに，実行回数を累積するカウント命令を手続き内の実行順序が変化する可能性のない実行文列である各基本ブロックの先頭位置，各ループの開始位置および各ループ内の先頭実行文の位置に挿入し，
前記計測過程では，各手続きの命令実行中に出現するカウント命令により前記カウント領域に実行回数を累積し，プログラム全体の終了時に，カウント領域の実行回数を実行情報ファイルに記録し，
前記解析過程では，各並列実行単位の実行回数計測結果を実行情報ファイルから抽出し，各ループ実行におけるループ内実行文の実行回数の平均である各ループの平均回転数を並列実行単位ごとに算出する
ことを特徴とする並列実行プログラムの実行回数解析方法。
【００６７】
（付記３）付記２記載の並列実行プログラムの実行回数解析方法において，
前記解析過程は，
前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，逐次実行に換算したプログラム中の実行文の総実行回数を算出する過程と，
前記算出した逐次実行に換算したプログラム中の実行文の総実行回数と，各並列実行単位におけるループ内実行文実行回数の合計と，並列度数と，並列実行のためのオーバーヘッドとに基づいて，前記逐次実行に換算したプログラム中の実行文の総実行回数を並列度数に応じて各並列実行単位に分割する場合の並列実行単位における実行文の総実行回数を算出する過程と，
前記算出した並列実行単位における実行文の総実行回数と前記算出した逐次実行に換算したプログラム中の実行文の総実行回数とに基づいて，並列度数を変化させた場合の性能向上率を算出し，算出結果を出力する過程とを有する
ことを特徴とする並列実行プログラムの実行回数解析方法。
【００６８】
（付記４）付記２または付記３記載の並列実行プログラムの実行回数解析方法において，
前記解析過程は，
前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，並列実行させたプログラム内の各手続き内の実行文の実行回数および各ループの並列実行単位における実行文の実行回数を算出し，算出結果を回数の多い順に出力する過程と，
並列実行単位間の実行回数のバランスを示す情報を出力する過程とを有する
ことを特徴とする並列実行プログラムの実行回数解析方法。
【００６９】
（付記５）付記２から付記４までのいずれか１項に記載の並列実行プログラムの実行回数解析方法において，
前記解析過程は，
前記実行情報ファイルから抽出した各並列実行単位の実行回数計測結果に基づいて，ソースコードに対応つけて各実行文ごとに総実行回数および並列度数を算出し，算出結果を出力する過程を有する
ことを特徴とする並列実行プログラムの実行回数解析方法。
【００７０】
（付記６）並列実行プログラムの実行回数解析方法をコンピュータに実行させるためのプログラムであって，
解析対象の並列実行プログラム翻訳時に，少なくとも並列実行単位ごとに実行回数を累積するための命令を各手続き内に挿入する翻訳処理と，
前記翻訳した並列実行プログラムを実行することにより，翻訳時に設定した命令の実行回数を並列単位ごとに累積する計測処理と，
前記並列実行単位の実行回数計測結果を解析し，少なくとも並列実行における負荷の分散を示す解析結果を出力する解析処理とを，
コンピュータに実行させるための並列実行プログラムの実行回数解析プログラム。
【００７１】
（付記７）並列実行プログラムの実行回数解析方法をコンピュータに実行させるためのプログラムを記録した記録媒体であって，
解析対象の並列実行プログラム翻訳時に，少なくとも並列実行単位ごとに実行回数を累積するための命令を各手続き内に挿入する翻訳処理と，
前記翻訳した並列実行プログラムを実行することにより，翻訳時に設定した命令の実行回数を並列単位ごとに累積する計測処理と，
前記並列実行単位の実行回数計測結果を解析し，少なくとも並列実行における負荷の分散を示す解析結果を出力する解析処理とを，
コンピュータに実行させるための並列実行プログラムの実行回数解析プログラムを記録した記録媒体。
【００７２】
【発明の効果】
本発明に係る並列実行プログラムの実行回数解析方法によって，ループの平均回転数を並列実行単位ごとに測定することにより，負荷分散状況を正確に確認できる。また，負荷の分割数を決定するための具体的な数値の目安を得ることができる。また，並列度数を増加させた場合の並列効果の予測値を表示することにより，性能向上の目安を決定できると共にチューニングの効率向上に寄与することができる。
【００７３】
また，手続き，ループの実行回数を多い順番に表示することにより，並列化候補の抽出が容易となる。また，並列実行単位間のバランスを表示することにより，並列実行単位間の負荷状況を確認することができる。また，ソースコードに対応付けて各実行文の実行回数および並列度数を表示することにより，並列実行状況の認識が容易になり，性能改善のソースコードへのフィードバックが促進される。
【図面の簡単な説明】
【図１】本発明の並列実行プログラムの実行回数解析方法を実現するハードウェア構成の一例を示す図である。
【図２】本発明の並列実行プログラムの実行回数解析方法の概要を示す図である。
【図３】本発明の並列実行プログラムの実行回数解析処理フローの一例を示す図である。
【図４】翻訳時におけるカウント命令の挿入の一例を示す図である。
【図５】解析対象プログラムの実行回数計測方法の一例を示す図である。
【図６】逐次実行に換算したプログラム中の実行文の総実行回数（ＳＴＯＴＡＬ）と並列実行単位の実行文の総実行回数（ＰＴＯＴＡＬ）との関係を示す図である。
【図７】並列度数の変化に対する特定ループの実行文実行回数，ループの平均回転数および性能向上率の見積り値表示の形式の一例を示す図である。
【図８】逐次実行のループの平均回転数に基づく並列負荷分割量の見積り方法を示す図である。
【図９】手続き実行回数合計値の表示形式の一例を示す図である。
【図１０】各並列実行単位における手続き実行回数のバランス表示の形式の一例を示す図である。
【図１１】ループ実行回数合計値の表示形式の一例を示す図である。
【図１２】各並列実行単位におけるループ実行回数のバランス表示の形式の一例を示す図である。
【図１３】実行文単位の実行回数および並列度数表示の形式の一例を示す図である。
【図１４】実行文単位の実行回数および並列度数表示の形式の一例を示す図である。
【図１５】並列化の指定の一例を示す図である。
【符号の説明】
１　並列実行プログラム実行結果解析装置
２　ノードＡ
３　ノードＢ
４　実行情報ファイル
５　高速バス
６　解析結果出力装置
７，７’解析対象プログラム
１１　解析対象プログラム翻訳手段
１２　解析対象プログラム実行回数計測手段
１３　実行回数計測結果解析手段
２１，３１　ＣＰＵ
２２，３２　メモリ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a support function for parallel tuning of programs in the field of HPC (High Performance Computing System) large-scale scientific and technical calculation, and more particularly, to a method of parallel execution of programs written in a high-level language such as Fortran and performing load balancing by parallelization. The present invention relates to a method for analyzing the number of executions of a parallel execution program and a program for analyzing the number of executions, which support a process of improvement.
[0002]
The parallelization of the program to be tuned is targeted at a shared memory type parallel method and a memory distributed type parallelized method.
[0003]
In the shared memory parallel method, a program is executed in parallel on a plurality of parallel execution units called threads. When the user executes the program, the user specifies the range of parallelization in the source code according to the OpenMP parallelization syntax and the amount of parallel load division indicating the distribution of the loop load to the parallel execution unit in the range.
[0004]
For example, in the case of the parallelization target program shown in FIG. 15, the range from the basic block 1 to the basic block 3 is specified by the parallelization syntax (“! @OMP PARALLEL”, “! @OMP END PARALLEL”). I have. Here, the basic block is an execution sentence string whose execution order is not likely to change. For the part corresponding to the loop, the compiler creates an object code generation function corresponding to the parallel load split amount specified by the user, and executes the function in parallel. In the case of the parallelization designation shown in FIG. 15, for the basic blocks 1, 2, and 3, each of the parallel execution units performs redundant execution in which the same processing is performed, and for the loop portion, the load of the loop is distributed according to the parallel load division amount. Is done.
[0005]
In the memory distributed parallelization method, data is divided into a plurality of parallel execution units called processes, and programs are executed in parallel. To specify the parallelization, an MPI (Message Passing Interface) library is inserted into the source code at the position where the parallelization is to be performed, and the parallel load division amount is specified.
[0006]
The present invention outputs information that supports parallelization tuning of these parallelized programs and a parallelized program in which both types are mixed.
[0007]
[Prior art]
In order to support memory-shared parallel and memory-distributed parallel programming, it is particularly necessary to measure the average number of rotations of the loop to be load-balanced. A function that can be checked for each parallel execution unit is required.
[0008]
As a technique for optimizing the parallel processing of a program, for a parallel computer composed of multiple processors, data and programs are allocated to each processor element, and the processing time is measured at the time of execution. There has been a technique of dynamically re-assigning the data so that the data is evenly distributed (for example, see Patent Document 1). In addition, a technique for measuring the average rotation speed of each loop of the sequential execution program has been conventionally known (for example, see Non-Patent Document 1).
[0009]
[Patent Document 1]
JP-A-2-132526 (page 1)
[Non-patent document 1]
FUJITSU, "For UXP / V Analyzer User's Guide V20" (J2U5-0132-02), Fujitsu Limited, September 1999, 2nd edition, p55-56
[0010]
[Problems to be solved by the invention]
However, the conventional program execution number analysis function is for sequential execution programs, and has no function to support program parallelization. That is, there is no function for measuring the average rotation speed of each loop of the parallel execution program for each parallel unit.
[0011]
Also, for example, the technique described in Japanese Patent Application Laid-Open No. 2-132526 of Patent Document 1 does not measure the number of loops and displays it for tuning. However, there is a problem that parallelization tuning cannot be performed after confirmation.
[0012]
To be more specific, in the past, functions to predict the performance of a tuning target program by parallelization, to check the balance of the number of executions of procedures and loops between parallel execution units, and to check the There was no function to display the parallel status of executable statements. For this reason, it has been difficult to perform parallelization tuning of the parallelized program based on the load distribution in the parallel execution.
[0013]
SUMMARY OF THE INVENTION It is an object of the present invention to solve the above-mentioned problems of the prior art and to provide a tuning support technique that allows a user to check a load distribution state in parallel execution and then tune a parallelized program.
[0014]
[Means for Solving the Problems]
In order to solve the above problems, the present invention focuses on the following points.
[0015]
(1) The execution load of a program is represented by an execution time. Procedures, loops, and executable statements that are executed more often tend to require more execution time. By utilizing this tendency to express the execution load by the number of executions, a tuning support function that expresses the execution load by a numerical value that can be easily fed back to the source code is realized. Also, since the amount of load distribution for parallelization (mainly the number of rotations of the loop) is often specified by a numerical value, it is useful for tuning to be able to obtain the execution load of the program by a numerical value. Furthermore, if the load of each rotation of the loop is the same regardless of the number of rotations, a performance effect can be exhibited by obtaining the average number of rotations of the loop and dividing the value equally into parallel execution units.
[0016]
(2) If there is performance improvement prediction information when the number of rotations of the loop is divided into parallel execution units in proportion to the increase in the number of parallel execution units, tuning can be effectively advanced.
[0017]
(3) When a program is executed in parallel, if the parallel execution status can be confirmed numerically for each procedure, loop, and execution statement unit, the tuning information becomes easy to understand.
[0018]
The present invention accurately measures the average number of rotations of a loop, which is most noticed for load distribution in parallelization of a program, based on the above noted points.
[0019]
Further, the present invention realizes a performance improvement prediction function when the number of revolutions of the loop is divided in detail in proportion to the increase in the number of parallel execution units. Also, the number of executions of procedures and loops is displayed so that they can be compared between parallel execution units. Also, procedures and loops are displayed in descending order of load so that a tuning target can be easily selected. In addition, it is displayed so that the number of executions and the parallel execution status can be checked in accordance with the execution statement of the source code.
[0020]
That is, the present invention relates to a method for analyzing the number of executions of a parallel execution program, the method comprising translating a program to be analyzed and inserting an instruction for accumulating the number of executions into each procedure during the translation; A program execution number measuring step of accumulating the execution number of the set unit for each parallel unit, and an execution number measurement result analyzing step of analyzing the execution number measurement result of the parallel execution unit and displaying the analysis result on a terminal It is characterized by.
[0021]
Further, in the present invention, in the program translation step, a call instruction of a function having a function of securing a count area for accumulating the number of executions of each execution statement in the procedure is inserted at a start position of each procedure, and the number of executions is reduced. Inserting the count instruction to be accumulated at the start position of each basic block, the start position of each loop, and the position of the first executable statement in each loop, which are executable statement strings in which the execution order in the procedure is not likely to change; In the execution number measurement step, the number of executions is accumulated in the count area by a count instruction that appears during execution of the instruction of each procedure, and when the entire program ends, the number of executions in the count area is stored in the execution information file for each parallel execution unit. The execution number measurement result analysis step extracts the execution number measurement result of each parallel execution unit from the execution information file and outputs the result to each loop execution. And calculates that the average rotational speed of the loop is the average number of executions of the loop in the execution statement for each parallel execution unit.
[0022]
By using the present invention, the state of load distribution in parallel execution can be clarified. Therefore, if the load of each rotation of the loop is the same regardless of the number of rotations, the performance effect can be exerted by equally dividing the value of the average number of rotations of the loop into parallel execution units.
[0023]
In the present invention, the execution number measurement result analysis step further includes a step of total execution of execution statements in the program converted into sequential execution based on the execution number measurement result of each parallel execution unit extracted from the execution information file. Calculating the number of executions, the total number of executions of the executable statement in the program converted to the calculated sequential execution, the total number of executions of the in-loop execution statement in each parallel execution unit, the parallel frequency, Calculating the total number of executions of the execution statement in the parallel execution unit when dividing the total number of executions of the execution statement in the program converted into the sequential execution into each parallel execution unit based on the overhead; And the calculated total number of executions of the executable statement in the parallel execution unit and the calculated total number of executions of the executable statement in the program converted into the sequential execution. And having a step of calculating the performance improvement rate when changing the column frequencies.
[0024]
By using the present invention, it is possible to realize a performance improvement predicting function when the number of revolutions of the loop is divided in detail in proportion to the increase in the parallelism, and the tuning can be effectively advanced.
[0025]
Further, in the present invention, the execution number measurement result analyzing step further includes an execution statement in each procedure in the program executed in parallel based on the execution number measurement result of each parallel execution unit extracted from the execution information file. Calculates the number of times of execution and the number of executions of executable statements in the parallel execution unit of each loop, displays the calculation result on the terminal in descending order of the number of executions, and displays the balance of the number of executions between the parallel execution units on the terminal And
[0026]
By using the present invention, it becomes possible to balance the load between the parallel units of each procedure and loop and to select a parallelization candidate.
[0027]
In the present invention, the execution count measurement result analysis step further includes, based on the execution count measurement result of each parallel execution unit extracted from the execution information file, a total execution count for each execution statement associated with a source code. The number of times and the degree of parallelism are calculated and displayed on a terminal.
[0028]
By using the present invention, the parallel execution status can be easily recognized, and the feedback to the source code for improving the performance is promoted.
[0029]
The program of the present invention is a program for causing a computer to execute a method for analyzing the number of executions of a parallel execution program, and for accumulating the number of executions at least for each parallel execution unit at the time of translating the parallel execution program to be analyzed. A translation step of inserting an instruction into each procedure, a measuring step of accumulating the number of executions of the instruction set at the time of translation for each parallel unit by executing the translated parallel execution program, and a number of executions of the parallel execution unit Executing a parallel execution program for causing a computer to execute a method for analyzing the number of executions of a parallel execution program, the method comprising: analyzing a measurement result and outputting at least an analysis result indicating a load distribution in the parallel execution. It is a number analysis program.
[0030]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing an example of a hardware configuration for realizing a method for analyzing the number of executions of a parallel execution program according to the present invention.
[0031]
In the figure, reference numeral 1 denotes a parallel execution program execution result analyzer such as a workstation (WS) for analyzing the execution result of the parallel execution program. 2 and 3 are nodes A and B respectively having one or a plurality of CPUs 21 and 31 and memories 22 and 32 and executing parallel execution programs. Reference numeral 4 denotes an execution information file in which a result of measuring the number of executions of the parallel execution programs executed in the nodes A2 and B3 is recorded. Reference numeral 5 denotes a high-speed bus for communication between processors connecting the nodes A2 and B3. Although the execution information file 4 is a shared file in FIG. 1, the present invention can be realized even if the execution information file 4 is not shared.
[0032]
The method for analyzing the number of executions of the parallel execution program according to the present invention is used for parallelization tuning support of a shared memory type parallelization method performed in the node A2 or the node B3, and a memory performed between the node A2 and the node B3. It is also used for parallel tuning support of distributed parallel processing.
[0033]
FIG. 2 is a diagram showing an outline of a method for analyzing the number of executions of a parallel execution program according to the present invention. 6 is an analysis result output device such as a display or a printer, 7 is a source program of the analysis target program in which parallelization instructions are embedded, and 7 'is a load of the analysis target program in which instructions for measuring the number of executions are embedded by translation. A module 11 is an analysis target program translating means for translating (including a link) the analysis target program, a reference numeral 12 is an analysis target program execution number measurement means, and a reference numeral 13 is an execution number measurement result analyzing means for analyzing the execution number measurement result. Note that the analysis target program translating means 11 and the execution count measurement result analyzing means 13 need not always be realized by the same device, and may be realized by different devices.
[0034]
Hereinafter, processing performed by the parallel execution program execution result analysis device 1 will be described. FIG. 3 is a diagram showing an example of the execution number analysis processing flow of the parallel execution program of the present invention.
[0035]
First, the analysis target program translating means (compiler) 11 translates the analysis target program 7 described in a high-level language such as Fortran (step S1). In the translation of the analysis target program 7, an instruction for calling a function having a function of securing an area (count area) for accumulating the execution count of each execution statement in the procedure is inserted at a start position of each procedure in the object program. I do.
[0036]
Further, the analysis target program translating means 11 inserts an instruction (count instruction) for accumulating the number of executions at a next position in the procedure in the object program.
{Circle around (1)} The starting position of each basic block (executable statement string whose execution order is not likely to change)
(2) Start position of each loop
(3) First executable statement in each loop
These count instructions can be expressed, for example, in the form of source code:
Area name (x) = area name (x) +1
Such an instruction.
[0037]
FIG. 4 shows an example of insertion of a count instruction into an object program when the analysis target program 7 is translated. In the example shown in FIG. 4, a function call instruction for managing the count area is inserted at the start position of the procedure, and a count instruction for counting the number of executions is inserted at the head position of each of the basic blocks 1 to 3. , A loop start count instruction is inserted in the first executable statement in the loop. The insertion of the instruction as described above is performed by selecting a predetermined compiler option at the start of the translation process.
[0038]
Next, the analysis target program execution count measuring means 12 performs an execution count measurement process of the analysis target program 7 (step S2). Specifically, the CPU 21 and 31 of each of the nodes A2 and B3 shown in FIG. 1 execute a load module (analysis target program 7 ') obtained from an object program obtained by translating the analysis target program 7, and The number of executions is accumulated for each parallel execution unit using the set count instruction.
[0039]
The accumulation of the number of executions is performed as follows. At the start of each procedure, a function that allocates a count area is called. This function manages the count area using the function name of the caller and the parallel execution unit as keys. When this function is called, it notifies the caller of the count area corresponding to the key.
[0040]
If a count instruction inserted during translation appears during execution of a procedure instruction, the count instruction accumulates the number of executions in a count area prepared at the start of the procedure. Then, at the end of the entire program, the number of executions of the count area is recorded in the execution information file 4 for each parallel execution unit. FIG. 5 shows an example of a method of measuring the number of executions of the analysis target program 7 shown in FIG. In the execution count measurement shown in FIG. 5, based on the generated function created by the compiler for the parallelization specification in the same manner as the parallelization specification method shown in FIG. And the number of executions is measured for each parallel execution unit.
[0041]
Next, the execution count measurement result analysis means 13 performs an execution count measurement result analysis process (step S3). More specifically, the execution result measurement result of each parallel execution unit is extracted from the execution information file 4, analyzed as follows, and output to the analysis result output device 6 such as a display or a printer.
[0042]
First, the parallel execution units and their average loop speed are calculated for each loop by the following formula.
[0043]
Average loop speed
= Number of executions of the first executable statement in the loop / number of executions of the loop start position
In addition, the following expression is used to calculate the total number of executions of an executable statement in a parallel execution unit when dividing into the respective parallel execution units in accordance with the degree of parallelism, from the total number of executions converted into the sequential execution of the tuning target loop.
[0044]
POTTAL
= STOTAL-LOOP + (LOOP / Para) + POVER
Here, STOTAL is the total number of executions of the execution statement in the program converted to sequential execution, and is composed of a basic block execution cost and LOOP as shown in FIG. LOOP is the total number of executions of the execution statement in the loop to be tuned converted into the sequential execution, and is a total value of the execution number of execution statements in the loop in each parallel execution unit.
[0045]
PTOTAL is the total number of executions of an executable statement in a parallel execution unit, and is composed of redundant execution cost, LOOP / Para, and POVER. The redundant execution cost is the execution cost of a basic block in which each parallel execution unit performs the same processing. FIG. 6 shows a case where the redundant execution cost and the execution cost of the basic block are the same.
[0046]
Para is the degree of parallelism, and LOOP / Para is the parallel execution cost as a load of the number of executions of the in-loop executable statement in each parallel execution unit. POVER is an overhead for parallel execution and is a fixed value.
[0047]
Here, STOTAL is calculated from the count value in the count area for each parallel execution unit shown in FIG. That is, first, LOOP is calculated by calculating the total value of the number of times of execution of an in-loop executable statement in each parallel execution unit. Also, for example, when each parallel execution unit performs the same processing for the basic blocks 1 to 3, the count value of the representative parallel execution unit for the redundant basic execution of the basic blocks 1 to 3 is calculated. Then, the redundant execution cost is calculated by using the calculated LOOP, and the sum of the calculated LOOP and the redundant execution cost is obtained as STOTAL.
[0048]
As shown in FIG. 6, POVER as overhead occurs due to parallel execution, but since LOOP decreases to LOOP / Para, PTOTAL becomes a value smaller than STOTAL.
[0049]
Next, the performance improvement rate when the degree of parallelism is changed is calculated by the following equation using STOTAL and PTOTAL. The unit is a percentage.
[0050]
Performance improvement rate = {(STOTAL-PTOTAL) / STOTAL} × 100
Also, for the loop to be tuned, the execution number of execution statements in the parallel execution unit, the average rotation number of the loop, and the estimated value of the performance improvement rate with respect to the change in the parallelism are displayed. This estimate is based on the assumption that the load on the loop is constant and independent of the loop speed.
[0051]
FIG. 7 shows an example of the display format of the execution number of the execution statement of the specific loop, the average rotation number of the loop, and the estimated value of the performance improvement rate with respect to the change of the parallel degree as an example of the output of the analysis result according to the present invention.
[0052]
In the example shown in FIG. 7, when the degree of parallelism changes from 1 (sequential execution) to 2, the average rotation number (Iteration-count) of the loop changes from 80 to 40, and the total number of executions of the execution statement in the loop whose performance is to be estimated decreases. However, because the overhead of parallel execution increases, the performance is slightly degraded as compared with the case of sequential execution. However, when the degree of parallelism increases to 4,8, the average number of rotations of the loop becomes 20,10, respectively, and the performance improvement rate is improved to 14.5% and 22.2%.
[0053]
Further, in the present embodiment, after calculating the loop average rotation speed in the sequential execution, the calculated average rotation speed information of the parallel loop is used for estimating the parallel load division amount. The average rotation number of the loop in the sequential execution is obtained by multiplying the average rotation number of the measured loop by the parallel frequency, and the average rotation number 80 of the loop when the parallel frequency shown in FIG. It is an average rotation speed.
[0054]
FIG. 8 shows a method of estimating the parallel load division amount based on the average rotation speed of the sequentially executed loop. As shown in FIG. 8, the parallel load division amount is calculated by dividing the average rotation speed of the loop in the sequential execution by the parallel frequency.
[0055]
Also, the total value of the number of executions for each procedure and the balance of the number of executions in each parallel execution unit are displayed. Procedure information is displayed in the order of the number of executions. In the case of the mixed type of the memory-distributed parallelization method and the memory-sharing parallelization method, each parallel execution unit of the memory-distributed parallelization method is displayed, and the memory-shared parallelism is used as detailed information of the parallel execution unit. Display of conversion.
[0056]
FIG. 9 shows an example of a display format of the procedure execution count total value. FIG. 10 shows an example of a form of a balance display of the number of procedure executions in each parallel execution unit.
[0057]
For example, in the procedure of “main._OMP_1_”, the total execution count (Total-count) is 1042, and the ratio (Run) of the total execution count to the entire program is 0.88%, and the procedure execution count is 0.88%. 9 is eight times.
[0058]
This procedure “main._OMP_1_” is executed in parallel by four threads, and FIG. 10 shows a deviation ratio of the number of executions of the corresponding parallel execution unit to the average number of executions. From the example of FIG. 10, it can be seen that the load of the fourth thread (Thread 3) is + 11%, which is larger than that of the other threads. Therefore, after recognizing the balance display shown in FIG. 10, the user may need to reduce the load on the fourth thread in order to perform tuning to equalize the load balance divided into each parallel execution unit. Understand.
[0059]
For example, when the movement of the loop changes depending on the number of revolutions, such as in a loop including an if statement, the load usually varies for each parallel execution unit. The balance display shown in FIG. 10 is a useful information display in the sense that the user can recognize the balance display and change the load balance.
[0060]
Also, the number of times of execution of the loop is displayed in the same manner as in the case of the procedure. FIG. 11 shows an example of the display format of the loop execution count total value. FIG. 12 shows an example of a form of a balance display of the number of loop executions in each parallel execution unit.
[0061]
In the display shown in FIG. 11, Mark is identification of parallel execution or sequential execution, and indicates that OMP is parallel execution and S is sequential execution.
[0062]
In addition, the number of executions and the number of parallel executions of each execution statement are extracted from the execution information file, and the values are displayed in association with the source code. The location of the source code is notified by an argument to this analysis process. FIGS. 13 and 14 show an example of a format for displaying the number of executions and the degree of parallelism for each execution statement.
[0063]
The processing performed by the execution number measurement result analysis means 13 and the like can be realized by a computer and a software program, and the program is executed by a computer-readable recording medium such as a portable medium memory, a semiconductor memory, or a hard disk. And read it from there to be executed by the computer.
[0064]
As can be understood from the above, the features of the embodiment of the present invention are as follows.
[0065]
(Supplementary Note 1) A method of analyzing the number of executions of a parallel execution program,
A translation process of inserting an instruction for accumulating the number of executions at least for each parallel execution unit into each procedure when translating the parallel execution program to be analyzed;
By executing the translated parallel execution program, accumulating the number of executions of instructions set at the time of translation for each parallel unit;
Analyzing an execution count measurement result of the parallel execution unit and outputting at least an analysis result indicating a load distribution in the parallel execution.
A method for analyzing the number of executions of a parallel execution program, characterized in that:
[0066]
(Supplementary note 2) In the method for analyzing the number of executions of a parallel execution program according to supplementary note 1,
In the translation step, a call instruction of a function having a function of securing a count area for accumulating the number of executions of each execution statement in the procedure is inserted at a start position of each procedure, and a count instruction for accumulating the number of executions is added to the procedure. Is inserted at the start position of each basic block, the start position of each loop, and the position of the first executable statement in each loop, which are executable statement strings whose execution order is not likely to change.
In the measurement step, the number of executions is accumulated in the count area by a count instruction appearing during execution of the instruction of each procedure, and the number of executions of the count area is recorded in an execution information file at the end of the entire program.
In the analysis process, the result of measuring the number of executions of each parallel execution unit is extracted from the execution information file, and the average rotation number of each loop, which is the average of the number of executions of the in-loop execution statement in each loop execution, is calculated for each parallel execution unit. Do
A method for analyzing the number of executions of a parallel execution program, characterized in that:
[0067]
(Supplementary note 3) In the method for analyzing the number of executions of a parallel execution program according to supplementary note 2,
The analysis process includes:
Calculating the total number of executions of the executable statement in the program converted to sequential execution based on the execution number measurement result of each parallel execution unit extracted from the execution information file;
Based on the calculated total number of executions of the executable statement in the program converted to the sequential execution, the total number of executions of the in-loop executable statement in each parallel execution unit, the parallelism, and the overhead for the parallel execution, Calculating the total number of executions of the executable statement in the parallel execution unit when dividing the total number of executions of the executable statement in the program converted into the serial execution into each parallel execution unit according to the parallelism frequency;
Based on the calculated total number of executions of the execution statement in the parallel execution unit and the calculated total number of executions of the execution statement in the program converted to the sequential execution, a performance improvement rate when the parallelism is changed is calculated. Outputting a calculation result
A method for analyzing the number of executions of a parallel execution program, characterized in that:
[0068]
(Supplementary note 4) In the method for analyzing the number of executions of the parallel execution program according to supplementary note 2 or 3,
The analysis process includes:
Based on the execution count measurement result of each parallel execution unit extracted from the execution information file, the execution count of the execution statement in each procedure and the execution count of the execution statement in the parallel execution unit of each loop in the program executed in parallel Calculating and outputting the calculation result in descending order of the number of times;
Outputting information indicating the balance of the number of executions between the parallel execution units.
A method for analyzing the number of executions of a parallel execution program, characterized in that:
[0069]
(Supplementary note 5) In the method for analyzing the number of times of execution of a parallel execution program according to any one of supplementary notes 2 to 4,
The analysis process includes:
A step of calculating the total number of executions and the parallelism for each execution statement in association with the source code based on the execution count measurement result of each parallel execution unit extracted from the execution information file, and outputting the calculation result
A method for analyzing the number of executions of a parallel execution program, characterized in that:
[0070]
(Supplementary Note 6) A program for causing a computer to execute a method of analyzing the number of executions of a parallel execution program,
At the time of translating the parallel execution program to be analyzed, a translation process for inserting an instruction for accumulating the number of executions at least for each parallel execution unit into each procedure;
By executing the translated parallel execution program, a measurement process for accumulating the number of executions of the instruction set at the time of translation for each parallel unit;
An analysis process of analyzing the execution count measurement result of the parallel execution unit and outputting an analysis result indicating at least a load distribution in the parallel execution;
An execution count analysis program for a parallel execution program to be executed by a computer.
[0071]
(Supplementary Note 7) A recording medium storing a program for causing a computer to execute the method of analyzing the number of executions of a parallel execution program,
At the time of translating the parallel execution program to be analyzed, a translation process for inserting an instruction for accumulating the number of executions at least for each parallel execution unit into each procedure;
By executing the translated parallel execution program, a measurement process for accumulating the number of executions of the instruction set at the time of translation for each parallel unit;
An analysis process of analyzing the execution count measurement result of the parallel execution unit and outputting an analysis result indicating at least a load distribution in the parallel execution;
A recording medium that stores a program for analyzing the number of executions of a parallel execution program to be executed by a computer.
[0072]
【The invention's effect】
With the method for analyzing the number of times of execution of the parallel execution program according to the present invention, the load distribution situation can be accurately confirmed by measuring the average rotation number of the loop for each parallel execution unit. In addition, it is possible to obtain a specific numerical reference for determining the number of load divisions. In addition, by displaying the predicted value of the parallel effect when the parallel frequency is increased, it is possible to determine a measure of performance improvement and contribute to improvement of tuning efficiency.
[0073]
Also, by displaying the number of executions of procedures and loops in descending order, extraction of parallelization candidates becomes easy. Further, by displaying the balance between the parallel execution units, the load status between the parallel execution units can be confirmed. In addition, by displaying the number of executions and the parallelism of each execution statement in association with the source code, it is easy to recognize the parallel execution status, and the feedback of the performance improvement to the source code is promoted.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an example of a hardware configuration that implements a method for analyzing the number of executions of a parallel execution program according to the present invention.
FIG. 2 is a diagram showing an outline of a method for analyzing the number of executions of a parallel execution program according to the present invention.
FIG. 3 is a diagram illustrating an example of an execution number analysis processing flow of a parallel execution program according to the present invention.
FIG. 4 is a diagram showing an example of insertion of a count command at the time of translation.
FIG. 5 is a diagram illustrating an example of a method of measuring the number of executions of an analysis target program.
FIG. 6 is a diagram showing the relationship between the total number of executions of a statement executed in a program (STOTAL) converted to sequential execution and the total number of executions of a statement executed in a parallel execution unit (PTOTAL).
FIG. 7 is a diagram illustrating an example of a format of an execution value execution count of a specific loop, an average rotation speed of a loop, and an estimated value of a performance improvement rate with respect to a change in a parallelism;
FIG. 8 is a diagram illustrating a method of estimating a parallel load split amount based on an average rotation speed of a sequentially executed loop.
FIG. 9 is a diagram illustrating an example of a display format of a procedure execution count total value.
FIG. 10 is a diagram showing an example of a format of a balance display of the number of times of procedure execution in each parallel execution unit.
FIG. 11 is a diagram illustrating an example of a display format of a total number of times of loop execution.
FIG. 12 is a diagram illustrating an example of a form of a balance display of the number of loop executions in each parallel execution unit.
FIG. 13 is a diagram showing an example of a format of display of the number of executions and the number of parallelisms for each execution statement.
FIG. 14 is a diagram showing an example of a format of a display of the number of executions and the number of parallelisms in execution statement units.
FIG. 15 is a diagram illustrating an example of designation of parallelization;
[Explanation of symbols]
1 Parallel execution program execution result analyzer
2 Node A
3 Node B
4 Execution information file
5 Express bus
6 Analysis result output device
7, 7 'analysis target program
11 Analysis target program translation means
12 means for measuring the number of executions of the program to be analyzed
13. Execution count measurement result analysis means
21,31 CPU
22, 32 memories

Claims

An execution number analysis method for a parallel execution program,
A translation process of inserting an instruction for accumulating the number of executions at least for each parallel execution unit into each procedure when translating the parallel execution program to be analyzed;
By executing the translated parallel execution program, accumulating the number of executions of instructions set at the time of translation for each parallel unit;
Analyzing the execution count measurement result of the parallel execution unit and outputting at least an analysis result indicating a load distribution in the parallel execution.

2. The method according to claim 1, wherein the number of executions of the parallel execution program is analyzed.
In the translation step, a call instruction of a function having a function of securing a count area for accumulating the number of executions of each execution statement in the procedure is inserted at a start position of each procedure, and a count instruction for accumulating the number of executions is added to the procedure. Is inserted at the start position of each basic block, the start position of each loop, and the position of the first executable statement in each loop, which are executable statement strings whose execution order is not likely to change.
In the measurement step, the number of executions is accumulated in the count area by a count instruction appearing during execution of the instruction of each procedure, and the number of executions of the count area is recorded in an execution information file at the end of the entire program.
In the analysis process, the result of measuring the number of executions of each parallel execution unit is extracted from the execution information file, and the average rotation number of each loop, which is the average of the number of executions of the in-loop execution statement in each loop execution, is calculated for each parallel execution unit. A method for analyzing the number of executions of a parallel execution program.

3. The method for analyzing the number of executions of a parallel execution program according to claim 2,
The analysis process includes:
Calculating the total number of executions of the executable statement in the program converted to sequential execution based on the execution number measurement result of each parallel execution unit extracted from the execution information file;
Based on the calculated total number of executions of the executable statement in the program converted to the sequential execution, the total number of executions of the in-loop executable statement in each parallel execution unit, the parallelism, and the overhead for the parallel execution, Calculating the total number of executions of the executable statement in the parallel execution unit when dividing the total number of executions of the executable statement in the program converted into the serial execution into each parallel execution unit according to the parallelism frequency;
Based on the calculated total number of executions of the execution statement in the parallel execution unit and the calculated total number of executions of the execution statement in the program converted into the sequential execution, a performance improvement rate when the parallelism is changed is calculated. And analyzing the number of executions of the parallel execution program.

A method for analyzing the number of executions of a parallel execution program according to claim 2 or 3,
The analysis process includes:
Based on the execution count measurement result of each parallel execution unit extracted from the execution information file, the execution count of the execution statement in each procedure and the execution count of the execution statement in the parallel execution unit of each loop in the program executed in parallel Calculating and outputting the calculation result in descending order of the number of times;
Outputting information indicating a balance of the number of executions among the parallel execution units.

A program for causing a computer to execute a method for analyzing the number of executions of a parallel execution program,
At the time of translating the parallel execution program to be analyzed, a translation process for inserting an instruction for accumulating the number of executions at least for each parallel execution unit into each procedure;
By executing the translated parallel execution program, a measurement process for accumulating the number of executions of the instruction set at the time of translation for each parallel unit;
An analysis process of analyzing the execution count measurement result of the parallel execution unit and outputting an analysis result indicating at least a load distribution in the parallel execution;
An execution count analysis program for a parallel execution program to be executed by a computer.