JP3698949B2

JP3698949B2 - Function allocation optimization apparatus for instruction cache, function allocation optimization method, and recording medium recording function allocation optimization procedure

Info

Publication number: JP3698949B2
Application number: JP2000089382A
Authority: JP
Inventors: 英信田中
Original assignee: Ｎｅｃマイクロシステム株式会社
Priority date: 2000-03-28
Filing date: 2000-03-28
Publication date: 2005-09-21
Anticipated expiration: 2020-03-28
Also published as: JP2001282547A

Description

【０００１】
【発明の属する技術分野】
本発明は命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体に関し、特にキャッシュを搭載したマイクロプロセッサシステムにおける命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体に関する。
【０００２】
【従来の技術】
この種のマイクロプロセッサシステムの処理速度においては、ＣＰＵ速度に加えて主記憶である外部メモリに対するアクセス、すなわち、メモリアクセスの処理速度（以下メモリアクセス速度）が大きく影響する。しかしながら、最近のＣＰＵ速度の著しい向上に対して、メモリアクセス速度の向上はそれほど大きくなく、ＣＰＵ速度とメモリアクセス速度との差は年々開く一方である。例えば、ＤＲＡＭのアクセス時間の向上率は年率７％程度であるのに対し、ＣＰＵ速度の向上率は５０〜１００％というデータもある（ダビド・エー・パターソン、ジョンエル・ヘネシー（ＤａｖｉｄＡ．Ｐａｔｔｅｒｓｏｎ、Ｊｏｈｎ．Ｌ．Ｈｅｎｎｅｓｓｙ）著、成田光彰訳、「コンピュータの構成と設計」、日経ＢＰ社、１９９６年４月１９日）。
【０００３】
従って、システム処理速度の向上のためには、キャッシュの有効利用が、非常に重要な問題となってきている。なお、キャッシュとは、公知の通り、外部メモリを高速にアクセスするための小容量高速メモリであるバッファから成る機構である。一般にプログラムは、メモリに配置された関数をアドレス順に処理していく。しかし、外部メモリのメモリアクセス速度はＣＰＵの処理速度に比べて非常に遅いため、結果として実行速度が遅くなるという問題がある。
【０００４】
上記の問題点を解消するため、関数実行の際に、外部メモリに格納してあるプログラムを、高速アクセス可能なバッファであるキャッシュにコピーして実行することにより、高速なプログラム実行を実現することが可能となる。この理由は、一般に、プログラムを実行すると、一度アクセスされたメモリは、近いうちに再度アクセスされる可能性が高いという性質があるからである。
【０００５】
ただし、一般的にキャッシュ用の高速メモリは外部メモリに比較して高価であり、その構成にはコストがかかるため、外部メモリに比べて非常に容量（サイズ）は小さい。このため、キャッシュを用いる処理システムは、外部メモリをキャッシュのサイズで区切った領域に分割し、分割した領域毎に外部メモリをキャッシュに割り当てる。あるアドレスに対する最初のアクセスでそのアドレスのプログラムを上記キャッシュ割り当て領域にコピーし、次に外部メモリの同一アドレスをアクセスした場合はキャッシュを直接アクセスする。このことで、高速なプログラム実行を実現する。
【０００６】
このとき、外部メモリからキャッシュメモリへのコピーは特定のサイズの単位で行われ、このサイズでキャッシュを分割した領域をキャッシュラインと呼ぶ。従って、同一のキャッシュラインに割り当てられた外部メモリのアドレスに配置された関数同士は、関数が切り替わる度に、キャッシュにプログラムををコピーし直す必要が生じてくる。これをキャッシュコンフリクトという。このキャッシュコンフリクトが頻繁に起きると、結果としてプログラムの実行速度が遅くなってしまうという問題がある。よって、昨今では、この問題を解消すべく同時に動く可能性の高い関数同士は、同一のキャッシュラインには載らないように配置する方法が研究されている。
【０００７】
なお、キャッシュには命令キャッシュとデータキャッシュがあるが、本発明は、命令キャッシュに着目するものである。
【０００８】
外部メモリのキャッシュへの割当て方式には、もっとも単純で安価なダイレクトマップ方式や、セットアソシアティブ／フルアソシアティブといった方式があるが、基本的な問題は全て同じであるため、以降、ダイレクトマップ方式を例にとり説明する。
【０００９】
また、従来の言語処理系プログラムでは、ある処理単位（関数）毎にメモリに適当に配置していた。このため、キャッシュ搭載のシステムにおいては、それが必ずしも最適に利用されているとは限らなかった。
【００１０】
最新の従来技術では、命令キャッシュに関し、キャッシュコンフリクトの回数を確率的に減らすために、関数の呼び出し回数情報に基づいて関数のメモリ配置を最適化することにより、キャッシュを有効に利用する研究がなされており、いくつかのアルゴリズムが論文発表されている。
【００１１】
例えば、エー・エッチ・ハッシェミ、デー・アール・カエリ、ビー・カルダ、「キャッシュラインカラーリングを用いた効率的マッピング手順」（Ａ．Ｈ．Ｈａｓｈｅｍｉ，Ｄ．Ｒ．Ｋａｅｌｉ，Ｂ．Ｃａｌｄｅｒ”ＥｆｆｉｃｉｅｎｔＰｒｏｃｅｄｕｒｅＭａｐｐｉｎｇＵｓｉｎｇＣａｃｈｅＬｉｎｅＣｏｌｏｒｌｉｎｇ”）ＡＣＭＳＩＧＰＬＡＮ、１９９７年６月、（文献１）においては、呼び出し回数の多い関数から順にキャッシュコンフリクトを避けるように関数のメモリ配置を最適にしていく手法が公開されている。
【００１２】
また、特開平１１−２３２１１７号公報（文献２）記載の従来の第１の命令キャッシュへの関数割付最適化方法においては、キャッシュメモリを効率よく使用するための関数配置方法の実施例として、上記文献１の方法が採用され、詳しく説明されている。
【００１３】
本従来技術のアルゴリズムでは、関数の呼出グラフを作成し、呼び出し回数をその辺（関数の組み合わせ）に対する重みとして優先順位をつけてメモリ空間に配置する。このことにより、まず関数を最初に配置した時のキャッシュコンフリクトを避けることができる。さらに、各関数が配置された「使用色」（すなわち、使用されているキャッシュライン）と、その関数が現在利用できない「利用不可色」の集合を記録しておき、後者、すなわち、利用不可色を使わないように関数を配置し、既に配置した関数についても、その関数の利用できない利用不可色を使わないという条件の下で、別の場所に移動する。これにより、直接の「親」あるいは「子」との間で発生するキャッシュコンフリクトを除去するものである。
【００１４】
次に、後述する本発明の実施の形態で適用するアプリケーションプログラムに対する従来例での動作を確認する。このことにより、従来例における問題点をより詳細に説明する。
【００１５】
従来の第１の命令キャッシュへの関数割付最適化装置をブロックで示す図１０を参照すると、この従来の第１の命令キャッシュへの関数割付最適化装置は、アプリケーションプログラム１１０から関数呼出情報を読み込み、関数呼び出し時に呼出元と呼出先の各関数情報とその呼出回数を関数呼出組合せ情報１１１に出力する関数呼出情報出力部１と、関数呼出組合せ情報１１１に基づき関数の配置を最適化してアドレス空間に配置し関数メモリ配置結果１０４を出力する関数メモリ配置最適化部１０３とを備える。
【００１６】
図１０、関数メモリ配置最適化部１０３の処理フローをフローチャートで示す図１１、及び上記アプリケーションプログラムの一例を示す図３を参照して、従来の第１の命令キャッシュへの関数割付最適化装置の動作である従来の第１の命令キャッシュへの関数割付最適化方法について説明すると、まず、図３に示すアプリケーションプログラム１１０を適用した場合、関数呼び出し情報出力部１は、プロファイルにより関数呼び出し時に呼出元と呼出先の各関数情報とその呼出回数を関数呼出組合せ情報１１１に出力する。なお、図１０及び図１１において、実線の矢印は処理の流れを示し、点線の矢印はデータの流れを示す。
【００１７】
関数呼出組合せ情報１１１の一例を示す図２を参照すると、この関数呼出組合せ情報１１１は、関数の呼出元、呼出先、呼出回数の各欄から成る。
【００１８】
次に、関数メモリ配置最適化部１０３は、関数呼出組合せ情報１１１を呼出回数の多い順にソートし、この順番にアドレス空間に配置すると同時に配置した関数が利用できないキャッシュライン対応の「利用不可色」の集合を認識し、これを避けて後続の関数を配置する。
【００１９】
すなわち、図９において、ステップＰ１では関数呼出組合せ情報１１１から図１２に示す関数呼出グラフ１２０を作成し、ステップＰ２では作成した関数関数呼出グラフ１２０を呼び出し回数の多いものと少ないものとに分割する。ここでは、ｆｕｎｃ−ｆｕｎｃＡ，ｆｕｎｃ−ｆｕｎｃＢ，ｆｕｎｃ−ｆｕｎｃＣが前者の「多いもの」、ｍａｉｎ−ｆｕｎｃ，ｆｕｎｃＡ−ｆｕｎｃＤ，ｆｕｎｃＢ−ｆｕｎｃＤ、が後者の「少ないもの」となる。
【００２０】
ここで作成した関数呼出グラフ１２０においては、関数の組合せが辺となり、その両端のノードが組合せにおける２つの関数となる。
【００２１】
なお、図３における各関数の占めるキャッシュライン数すなわち「色」の数は、ｆｕｎｃが２個、ｆｕｎｃＡ，ｆｕｎｃＢ，ｆｕｎｋＣ，ｍａｉｎが各１個であるものとする。
【００２２】
次に、ステップＰ３において、呼出回数の多いもののグループを呼出回数の多い順にソートし、その順番でステップＰ４以降の処理を行う。ステップＰ４では呼出回数の多い辺が残っているか確認し、残っているのでステップＰ５に進み、ｆｕｎｃ−ｆｕｎｃＡの辺に対して両側のノードが未配置であるかを確認する。この確認において未配置であるのでステップＰ９に進み、ｆｕｎｃとｆｕｎｃＡをメモリ空間上の任意の場所に隣接して配置し、ステップＰ１５においてｆｕｎｃとｆｕｎｃＡの利用できない「色」を利用不可能集合として認識した後、再びステップＰ４に戻る。隣接して配置されたｆｕｎｃ−ｆｕｎｃＡの辺は、複合ノードとして今後ひとつのノードとして扱われる。この時点で、既に配置済みの関数とキャッシュラインの関係および各関数の利用不可能集合の状態は、図１３（Ａ）に示すようになっている。
【００２３】
続いて、呼出回数の多い辺がまだ残っているのでステップＰ５に進み、ｆｕｎｃ−ｆｕｎｃＣの辺に対して両側のノードが未配置であるかを確認する。この確認において、ｆｕｎｃは配置済みであるのでステップＰ６に進み、２個の異なる複合ノードに属するノードを結ぶ辺かどうかを確認する。この確認において、ｆｕｎｃＣは複合ノードに属していないので、ステップＰ７に進み、一方のノードが複合ノードに属し他方のノードが未配置かどうか確認すると、条件に当てはまるのでステップＰ１１に進む。
【００２４】
ステップＰ１１では未配置のｆｕｎｃＣをｆｕｎｃに近い場所に配置し、ステップＰ１３において関数配置の際に利用不可能集合の影響で隙間が空いてないかを確認すると空いていないので、ステップＰ１５においてｆｕｎｃとｆｕｎｃＣを利用できない「利用不可色」を利用不可能集合として認識した後に、再びステップＰ４に戻る。
【００２５】
この時点で、既に配置済みの関数とキャッシュラインの関係及び各関数の利用不可能集合の状態は、図１３（Ｂ）に示すようになっている。
【００２６】
ステップＰ４において、まだ未配置の辺があるので、ステップＰ５に進み、ｆｕｎｃ−ｆｕｎｃＢの辺に対し、両側のノードが未配置であるかを確認する。この確認において、ｆｕｎｃは配置済みであるので、ステップＰ６に進み、２個の異なる複合ノードに属するノードを結ぶ辺かどうかを確認する。この確認において、ｆｕｎｃＢは複合ノードに属していないので、ステップＰ７に進み、一方のノードが複合ノードに属し他方のノードが未配置かどうか確認する。この確認において、条件に当てはまるので、ステップＰ１１に進む。
【００２７】
ステップＰ１１では、未配置のｆｕｎｃＢと対を成すｆｕｎｃの中心から複合ノードの両端までの距離が同じであるため、任意に左側に配置し、ステップＰ１３において関数配置の際、利用不可能集合の影響で隙間が空いてないか確認する。すると、空いていないので、ステップＰ１５に進み、ｆｕｎｃとｆｕｎｃＢの利用できない「利用不可色」を利用不可能集合として認識したのち、再びステップＰ４に戻る。
【００２８】
ステップＰ４において、未配置の辺が無くなったことを確認すると、ステップＰ１６に進み、未配置ノードｍａｉｎ、ｆｕｎｃＤを任意のキャッシュラインに配置する。
【００２９】
図１３（Ｃ）は最終的な関数配置とキャッシュラインの関係、及び各関数の利用不可能集合の状態、すなわち、関数メモリ配置結果１０４であるが、ｆｕｎｃＡ、ｆｕｎｃＢが同一キャッシュライン「青」を共有しており、それぞれの利用不可能集合には「青」が含まれていない。よって、呼び出し元関数と呼び出し先関数の間のキャッシュコンフリクトは削減できる。
【００３０】
上記の従来の第１の技術では、手続き、関数、あるいはサブルーチン同士が、互いを呼び出す際のキャッシュメモリ上での衝突およびキャッシュミスを防止することが目的である。この目的において、手続き、関数、あるいはサブルーチンが実際に呼び出される回数を示す情報と、手続き、関数、あるいはサブルーチン同士が、互いを呼び出す関係を示す情報とを利用している。これにより、手続き、関数、あるいはサブルーチン同士が、互いを呼び出す際のキャッシュメモリ上での衝突を防止できる。
【００３１】
このように、従来の第１の技術においては呼出元関数と呼出先関数の間のキャッシュコンフリクトは削減できるが、（１）ある関数の中で複数の関数が連続して呼ばれている場合、あるいは（２）ループの中で呼ばれている場合等には、これら複数の関数間のキャッシュコンフリクトを削減できず、極めて多くのキャッシュコンフリクトが生じてしまうという第１の問題がある。
【００３２】
つまり、ｆｕｎｃＡとｆｕｎｃＢは直接の呼出関係がないため、関数呼出組み合わせ情報を元にした関数配置を行う従来技術では、これらの関数が同一キャッシュラインに乗ってしまう場合があり得る。しかし、図３の上記アプリケーションプログラム例より明らかな通り、ｆｕｎｃＡとｆｕｎｃＢはループ中で連続して呼ばれ、さらにｆｕｎｃＢでは、ｆｕｎｃＤを経由してｆｕｎｃＡを呼び出すというプログラム記述となっており、ｆｕｎｃＡとｆｕｎｃＢが頻繁に遷移を繰り返すため、このループ処理において、極めて多くのキャッシュコンフリクトが生じてしまう。
【００３３】
また、以上の第１の問題を解決するため、特願２０００−０２７２１８号明細書（文献３）記載の従来の第２の命令キャッシュへの関数割付最適化方法は、プロファイルにより直接の関数呼出組み合わせ情報を出力する代わりに、関数実行の時系列情報を出力し、この時系列情報から、連続した関数呼出しなど直接の関数呼出し以外にキャッシュコンフリクトを発生する可能性のある関数の組み合わせ実行パターンを検出し、検出した関数間キャッシュコンフリクト組み合わせ情報に対して、従来技術の関数配置最適化を適用する。これにより、従来削減できなかった、（１）ある関数の中で複数の関数が連続して呼ばれている場合、あるいは（２）ループの中で呼ばれている場合など、これら複数の関数間のキャッシュコンフリクトを削減し、アプリケーションプログラムの実行スピードを向上する手段を提供するものである。
【００３４】
しかしながら、この従来の第２の技術は、プログラムの遷移が単純な場合は、単純な処理で実現可能であり、極めて有効な手段であるが、この例で示したように、（３）ループ中で呼ばれる関数がまた別の関数を呼んでいるような場合などでは、パターンマッチングができなくなり、最適化ができないという第２の問題がある。
【００３５】
【発明が解決しようとする課題】
上述した従来の第１の命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体は、ある関数の中で複数の関数が連続して呼ばれている場合、あるいはループの中で呼ばれている場合等には、これら複数の関数間のキャッシュコンフリクトを削減できるとは限らず、最悪の場合には極めて多くのキャッシュコンフリクトが生じてしまうというという欠点があった。
【００３６】
また、上記欠点の解決を図った従来の第２の命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体は、プログラムの遷移が単純な場合は、単純な処理で実現可能であり、極めて有効な手段であるが、ループ中で呼ばれる関数がまた別の関数を呼んでいるような場合などでは、パターンマッチングができなくなり、最適化が不可能となるという欠点があった。
【００３７】
本発明の目的は、上記第１及び第２の従来技術の欠点を除去し、複数の関数間のキャッシュコンフリクトを削減し、アプリケーションプログラムの実行スピードの向上を図った命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体を提供することにある。
【００３８】
【課題を解決するための手段】
請求項１の発明の命令キャッシュへの関数割付最適化装置は、命令キャッシュを搭載したマイクロプロセッサシステム用の所定のアプリケーションプログラムを入力し前記命令キャッシュに関しキャッシュコンフリクトの回数を確率的に低減するように関数の呼出回数情報に基づいて関数のメモリ配置を最適化する命令キャッシュへの関数割付最適化装置において、
前記アプリケーションプログラムを入力しプロファイルによる関数呼出時に呼出元及び呼出先の各関数とその呼出回数とを関数呼出組合せとして関数呼出組合せ情報に出力する関数呼出情報出力部と、
前記アプリケーションプログラムを入力し前記プロファイルによる関数呼出に応じた関数の遷移に対して該関数のＩＤ及び該関数の基本ブロックの順番の組合せを関数遷移毎に並べた関数基本ブロック遷移情報に出力する関数基本ブロック遷移情報出力部と、
前記関数呼出組合せ情報を参照し前記関数呼出組合せ相互間で前記関数遷移の回数である遷移回数を入替えた関数呼出回数入替データから成る呼出回数入替情報を生成し、次に生成した前記呼出回数入替情報を参照して関数をメモリ空間上のアドレスに仮配置した後、前記関数基本ブロック遷移情報を参照してキャッシュコンフリクトの回数を検出し、前記呼出回数入替データの中で、前記キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定し対応する関数メモリ配置結果を出力する関数メモリ配置最適化部とを備えて構成されている。
【００３９】
また、前記関数呼出組合せ情報及び前記呼出回数入替情報の各々が、関数の呼出元の関数名を記述した呼出元欄と、
前記関数の呼出先の関数名を記述した呼出先欄と、
前記関数の呼出回数を設定する呼出回数欄とをそれぞれ有しても良い。
【００４０】
請求項３の命令キャッシュへの関数割付最適化方法は、命令キャッシュを搭載したマイクロプロセッサシステム用のアプリケーションプログラムを入力し前記命令キャッシュに関しキャッシュコンフリクトの回数を確率的に低減するように関数の呼出回数情報に基づいて関数のメモリ配置を最適化する命令キャッシュへの関数割付最適化方法において、
前記アプリケーションプログラムを入力し前記プロファイルによる関数呼出時に呼出元及び呼出先の各関数とその呼出回数とを関数呼出組合せとして関数呼出組合せ情報を生成し、
前記アプリケーションプログラムを入力し前記プロファイルにより得られた前記関数の基本ブロック単位の実行に関する関数基本ブロック遷移情報を生成した後、
前記関数呼出組合せ情報を参照し前記関数呼出組合せ相互間で前記関数遷移の回数である遷移回数を入替えた関数呼出回数入替データから成る呼出回数入替情報を生成し、前記関数基本ブロック遷移情報を参照して各関数のキャッシュコンフリクトの回数を検出し、前記呼出回数入替データの中で前記キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定する関数メモリ配置最適化工程を有することを特徴とするものである。
【００４１】
また、前記関数メモリ配置最適化工程が、前記関数呼出組合せ情報を参照して前記呼出回数入替情報を生成する呼出回数入替ステップと、
生成した前記呼出回数入替情報を参照して関数をメモリ空間上のアドレスに仮配置する関数メモリ仮配置ステップと、
前記関数基本ブロック遷移情報を参照して仮配置した各関数のキャッシュコンフリクト回数を検出するキャッシュコンフリクト回数算出処理ステップと、
前記関数呼出回数入替データの中で前記キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定する関数メモリ配置ステップとを有することを特徴としても良い。
【００４２】
また、前記関数メモリ配置最適化工程における前記呼出回数入替ステップが、前記関数呼出組合せ情報を参照して呼出回数の入替対象とする前記関数呼出組合せの数である引数を読み込み第１の変数を設定する第１のステップと、
前記第１の変数が０か否かを判定する第２のステップと、
前記第２のステップで前記第１の変数が０の場合現在の内容の呼出回数入替情報を出力する第３のステップと、
前記第２のステップで前記第１の変数が０以外の場合呼出回数入替処理を行い前記第１の変数−１に対応する引数を設定して再帰呼出を行う第４のステップと、
第２の変数に０を設定する第５のステップと、
前記第２の変数が前記第１の変数−１より小さいか否かの判定を行い否の場合は処理を終了する第６のステップと、
前記第６のステップで諾の場合前記第２の変数と前記第１の変数−１である第１のインッデクスの各々の前記呼出回数を交換する第７のステップと、
前記引数を１デクリメントして前記呼出回数入替処理を行い前記第１の変数−１に対応する引数を設定して再帰呼出を行う第８のステップと、
前記第２の変数である第２のインデックスと前記第１の変数−１の各々の前記呼出回数を交換する第９のステップと、
前記第２の変数を１インクリメントし前記第６のステップ以降を反復する第１０のステップとを有することを特徴としても良い。
【００４３】
さらに、前記関数メモリ配置最適化工程における前記キャッシュコンフリクト回数算出処理ステップが、キャッシュコンフリクト回数をカウントする変数を０に初期化する第１のステップと、
前記関数基本ブロック遷移情報を順次読み込み、関数ＩＤと基本ブロックの順番情報であるＩＤ順番情報を求める第２のステップと、
前記関数基本ブロック遷移情報は終了したかを判定し諾の場合は処理を終了する第３のステップと、
前記第３のステップで否の場合は前記ＩＤ順番情報のキャッシュ上の配置を求める第４のステップと、
先頭の前記関数基本ブロック遷移情報かを判定し諾の場合は前記第２のステップに戻る第５のステップと、
前記第５のステップで否の場合以前のブロックとアドレス上の重なりがあるかを判定し否の場合は前記第２のステップに戻る第６のステップと、
前記第６のステップで諾の場合は前記キャッシュコンフリクト回数をカウントする変数を１インクリメントし前記第２のステップに戻る第７のステップとを有することを特徴としても良い。
【００４４】
請求項７の命令キャッシュへの関数割付最適化手順を記録した記録媒体は、命令キャッシュを搭載したマイクロプロセッサシステム用のアプリケーションプログラムを入力し前記命令キャッシュに関しキャッシュコンフリクトの回数を確率的に低減するように関数の呼出回数情報に基づいて関数のメモリ配置を最適化する命令キャッシュへの関数割付最適化手順を記録した記録媒体において、
前記アプリケーションプログラムを入力し前記プロファイルによる関数呼出時に呼出元及び呼出先の各関数とその呼出回数とを関数呼出組合せとして関数呼出組合せ情報を生成する手順と、
前記アプリケーションプログラムを入力し前記プロファイルにより得られた前記関数の基本ブロック単位の実行に関する関数基本ブロック遷移情報を生成する手順と、
前記関数呼出組合せ情報を参照し前記関数呼出組合せ相互間で前記関数遷移の回数である遷移回数を入替えた関数呼出回数入替データから成る呼出回数入替情報を生成し、前記関数基本ブロック遷移情報を参照して各関数のキャッシュコンフリクトの回数を検出し、前記呼出回数入替データの中で前記キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定する関数メモリ配置最適化手順とを実行させるプログラムを記録したことを特徴とするものである。
【００４５】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００４６】
本発明は、命令キャッシュを搭載したマイクロプロセッサシステム用のアプリケーションプログラムを入力し、上記命令キャッシュに関しキャッシュコンフリクトの回数を確率的に低減するように関数の呼出回数情報に基づいて関数のメモリ配置を最適化する命令キャッシュへの関数割付最適化方法において、プロファイルによる関数呼出時に呼出元及び呼出先の各関数とその呼出回数とを関数呼出組合せとして関数呼出組合せ情報を生成し、プロファイルにより得られた関数の基本ブロック単位の実行に関する関数基本ブロック遷移情報を生成し、関数呼出組合せ情報を参照し関数呼出組合せ相互間で関数遷移の回数である遷移回数を入替えた関数呼出回数入替データから成る呼出回数入替情報を生成し、関数基本ブロック遷移情報を参照して命令キャッシュのキャッシュコンフリクト回数を検出して、呼出回数入替データの中で最もキャッシュコンフリクト回数の少なくなるように関数をメモリ空間上のアドレスに配置することにより、プログラムの実行スピードを向上させるものである。
【００４７】
本発明の実施の形態を図１０と共通の構成要素には共通の参照文字／数字を付して同様にブロックで示す図１を参照すると、この図に示す本実施の形態の命令キャッシュへの関数割付最適化装置は、従来と共通のアプリケーションプログラム１１０を入力し、プロファイルによる関数呼出時に呼出元及び呼出先の各関数とその呼出回数とを関数呼出組合せとして関数呼出組合せ情報１１１に出力する関数呼出情報出力部１に加えて、アプリケーションプログラム１１０を入力し、プロファイルによる関数呼出に応じた関数の遷移に対して、関数のＩＤ（識別）及びその関数の基本ブロックの順番の組合せを関数遷移毎に並べた関数基本ブロック遷移情報１１２に出力する関数基本ブロック遷移情報出力部２と、関数呼出組合せ情報１１１を参照して関数呼出組合せ相互間で上記関数遷移の回数である遷移回数を入替えて関数呼出回数入替データから成る呼出回数入替情報１１３を生成し、次に生成した呼出回数入替情報１１３を参照して関数をメモリ空間上のアドレスに仮配置した後、関数基本ブロック遷移情報１１２を参照してキャッシュコンフリクトの回数を検出し、関数呼出回数を入替えたもの、すなわち、関数呼出回数入替データの中で、キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定し対応する関数メモリ配置結果４を出力する関数メモリ配置最適化部３とを備える。
【００４８】
なお、図１において、実線の矢印は処理の流れを示し、点線の矢印はデータの流れを示す。
【００４９】
関数呼出組合せ情報１１１の一例を示す図２を参照すると、この関数呼出組合せ情報１１１は、関数の呼出元の関数名を記述した呼出元欄、関数の呼出先の関数名を記述した呼出先欄、及びその呼出回数を設定する呼出回数欄の各欄から成る。
【００５０】
また、呼出回数入替情報１１３も、その構成は関数呼出組合せ情報１１１と全く同じであり、関数の呼出元の関数名を記述した呼出元欄、関数の呼出先の関数名を記述した呼出先欄、及びその呼出回数を設定する呼出回数欄の各欄から成る。
【００５１】
次に、図１、図２、上記アプリケーションプログラムの一例を示す図３、及び関数メモリ配置最適化部の処理フローをフローチャートで示す図５を参照して本実施の形態の動作について説明すると、まず、関数呼出情報出力部１は、アプリケーションプログラム１１０を入力し、従来と同様に、プロファイルにおける関数呼び出し時に呼出元と呼出先の各関数情報とその呼出回数を関数呼出組合せ情報１１１に出力する。ここで、プロファイルとは、関数の基本ブロックの先頭及び呼出関数からの復帰時に、関数ＩＤと基本ブロックの順番を出力するコードをアプリケーションプログラム１１０に挿入して実行することである。
【００５２】
次に、関数基本ブロック遷移情報出力部２は、アプリケーションプログラム１１０を入力し、プロファイル、すなわち、関数の基本ブロックの先頭、及び呼出関数から復帰時に、関数ＩＤと基本ブロックの順番を出力するコードをアプリケーションプログラム１１０に挿入して実行することにより、これらの情報を関数基本ブロック遷移情報１１２に出力する。
【００５３】
図３を再度参照すると、この図に示すアプリケーションプログラム１１０はＣ言語によるアプリケーションプログラムの例であり、関数ｍａｉｎの関数ＩＤを０、関数ｆｕｎｃの関数ＩＤを１、関数ｆｕｎｃＡの関数ＩＤを２、関数ｆｕｎｃＢの関数ＩＤを３、関数ｆｕｎｃＣの関数ＩＤを４、関数ｆｕｎｃＤの関数ＩＤを５とそれぞれ想定する。
【００５４】
さらに関数ｆｕｎｃは、関数の基本処理単位である基本ブロックがループ文２つから構成され、また、関数ｍａｉｎ，ｆｕｎｃＡ，ｆｕｎｃＢ，ｆｕｎｃＣ，ｆｕｎｃＤの各々の基本ブロックが各１つの構成とする。
【００５５】
まず、先頭の関数ｍａｉｎの呼出時点でｍａｉｎのＩＤである０と基本ブロックの１番目の組合せである０−１（以下、組合せ０−１等）とを出力する。次に、関数ｍａｉｎから関数ｆｕｎｃに処理が移り、ｆｕｎｃのＩＤである１と基本ブロック１番目の組合せ１−１とを出力する。次に、関数ｆｕｎｃからｆｕｎｃＡに処理が移り、ｆｕｎｃＡの関数ＩＤである２と基本ブロック１番目の組合せ２−１とを出力する。次に、関数ｆｕｎｃＡからｆｕｎｃに処理が復帰し、ｆｕｎｃのＩＤである１と基本ブロックの１番目の組合せ１−１とを出力する。以降、同様にしてプログラムの終了まで関数ＩＤと基本ブロックの順番の組合せを出力する。
【００５６】
本例のアプリケーションプログラム１１０は、関数ｍａｉｎから関数ｆｕｎｃを呼出し、関数ｆｕｎｃでは関数ｆｕｎｃＡと関数ｆｕｎｃＢを連続して呼出す処理を２０回繰返す（反復）処理と、続いて関数ｆｕｎｃＡと関数ｆｕｎｃＣを連続して呼出す処理を３０回繰返す処理を行い、関数ｍａｉｎに戻って終了するプログラムである。ここで、関数ｆｕｎｃＢは関数ｆｕｎｃＤを呼出し、さらに、関数ｆｕｎｃＤは関数ｆｕｎｃＡを呼出す構成である。この処理により、プログラムの終了までプロファイルを実行すると、図４に示すような関数ＩＤと基本ブロックの順番の情報の配列である関数基本ブロック遷移情報１１２が作成される。
【００５７】
次に、関数メモリ配置最適化部３は、関数呼出組合せ情報１１１を参照して呼出回数を入替えて関数呼出回数入替データから成る呼出回数入替情報１１３を生成する。次に、生成した呼出回数入替情報１１３を参照して、関数をメモリ空間上のアドレスに仮配置する。その後、関数基本ブロック遷移情報１１２を参照して仮配置した各関数のキャッシュコンフリクトの回数を検出し、呼出回数を入替たもの、すなわち、関数呼出回数入替データの中で、キャッシュコンフリクトの最も回数の少ないものに関数のメモリ配置を決定する。
【００５８】
すなわち、図５を併せて参照して関数メモリ配置最適化部３の動作の詳細を説明すると、まず関数呼出組合せ情報１１１を参照して呼出回数入替処理ステップＳ１を行う。本処理では引数として呼出回数の多い組合せの数を与える。この多い少ないの基準は、従来技術である関数メモリ配置最適化部おける関数呼出グラフ分割と同様である。本例では関数呼出組合せ情報１１１において、ｆｕｎｃ−ｆｕｎｃＡ、ｆｕｎｃ−ｆｕｎｃＢ、ｆｕｎｃ−ｆｕｎｃＣの各組合せを関数呼出回数が多いと判断し、その他の組合せを少ないと判断する。よって呼出回数の多い組合せの数は３種類となり、これを引数として渡すこととなる。以後、呼出回数の入替対象は、この３組となる。
【００５９】
ここで、呼出回数入替処理ステップＳ１の詳細をフローチャートで示す図６及び呼出回数入替情報１１３の一例を示す図７を併せて参照して呼出回数入替処理ステップＳ１の詳細動作について説明すると、まず、ステップＳ１０１で引数３を読み込む。次に、ステップＳ１０２で変数ｎに引数３を設定する。次に、ステップＳ１０３の条件判定「ｎが０か」を行い、変数ｎが０でないので、ステップＳ１０５に分岐し、変数ｎ−１とする呼出回数入替処理を行い、ｎ−１に対応する引数２として再帰呼出を行う。
【００６０】
次に、再度ステップＳ１０１で引数２を読み込み、ステップＳ１０２で変数ｎに２を設定する。次に、ステップＳ１０３の条件判定を行い、変数ｎが０でないので、ステップＳ１０５に分岐し、変数ｎ−１とする呼出回数入替処理を行い、引数１として再び再帰呼出を行う。
【００６１】
次に、ステップＳ１０１で引数１を読み込み、ステップＳ１０２で変数ｎに１を設定する。ステップＳ１０３の条件判定を行い、変数ｎが０でないので、ステップＳ１０５に分岐し、変数ｎ−１とする呼出回数入替処理を行い、引数０としてさらに再帰呼出を行う。
【００６２】
次に、ステップＳ１０１で引数０を読み込み、ステップＳ１０２で変数ｎに引数０を設定する。ステップＳ１０３の条件判定を行い、変数ｎが０であるので、ステップＳ１０４に進み、現在の内容である呼出回数入替情報１１３を出力する。ここまでは呼出回数は全く入替えていないので、呼出回数入替情報１１３として図７（Ａ）に示す関数呼出組合せ情報１１１と全く同じものを出力し、引数０における処理は終了する。
【００６３】
次に、引数１における処理に戻り、ステップＳ１０６で変数ｉに初期値０を設定し、ステップＳ１０７の条件判定「ｉ＜（ｎ−１）」を行う。ｎ−１は０であり従って変数ｉは（ｎ−１）より小さくはないので、引数１における処理は終了する。
【００６４】
次に、引数２における処理に戻り、ステップＳ１０６で変数ｉに初期値０を設定し、ステップＳ１０７の条件判定を行う。ｎ−１は１であり従って変数ｉは（ｎ−１）より小さいので、ステップＳ１０８に進む。ステップＳ１０８で、インデックスｎ−１及びｉの要素、すなわち、呼出回数を交換する。この例では、インデックス０（ｉ）対応のｆｕｎｃ−ｆｕｎｃＡの呼出回数５０と、インデックス１（ｎ−１）対応のｆｕｎｃ−ｆｕｎｃＢの呼出回数２０とを交換する。その後、ステップＳ１０９で引数を１として再び再帰呼出を行う。
【００６５】
次に、ステップＳ１０１で引数１を読み込み、ステップＳ１０２で変数ｎに１を設定する。ステップＳ１０３の条件判定を行い、変数ｎが０でないので、ステップＳ１０５に分岐し、呼出回数入替処理ステップＳ１０９を引数０としてさらに再帰呼出を行う。
【００６６】
次にステップＳ１０１で、引数０を読み込み、ステップＳ１０２で変数ｎに引数０を設定する。ステップＳ１０３の条件判定を行い、変数ｎが０であるので、ステップＳ１０４に進み、図７（Ｂ）に示す呼出回数入替情報１１３を出力し、引数０における処理は終了する。
【００６７】
次に、引数１における処理に戻り、ステップＳ１０６で変数ｉに初期値０を設定し、ステップＳ１０７の条件判定「ｉ＜（ｎ−１）」を行う。ｎ−１は０であり従って変数ｉは（ｎ−１）より小さくはないので、引数１における処理を終了する。
【００６８】
次に、引数２における処理に戻り、ステップＳ１１０でインデックス０対応のｆｕｎｃ−ｆｕｎｃＡの呼出回数２０と、インデックス１対応のｆｕｎｃ−ｆｕｎｃＢの呼出回数５０とを交換して元に戻す。その後、ステップＳ１１０で変数ｉを０から１とし、ステップＳ１０７の条件判定を行う。ｎ−１は０であり従って変数ｉは（ｎ−１）より小さくはないので、引数２における処理を終了する。
【００６９】
以降、同様の処理を繰り返し、図７（Ｃ）、（Ｄ）、（Ｅ）、（Ｆ）をそれぞれ出力して、呼出回数入替処理ステップＳ１を終了する。
【００７０】
次に、図５に戻り、ステップＳ２で、呼出回数入替情報１１３を参照し、この情報が（Ａ）から（Ｆ）までの６個出力されていることを認識し、（Ａ）から順に以後の処理を行っていく。
【００７１】
以降、ステップＳ４からステップＳ１９まで、従来のステップＰ１〜Ｐ１６と同一処理を行う。
【００７２】
このステップＳ４〜Ｓ１９のアルゴリズムは、従来の技術で説明したように、関数の呼出グラフを作成し、呼び出し回数をその辺（関数の組み合わせ）に対する重みとして優先順位をつけてメモリ空間に配置することにより、関数を最初に配置した時のキャッシュコンフリクトを避けることができる。各関数が配置された「使用色」（すなわち、使用されているキャッシュライン）と、その関数が現在利用できない「利用不可色」の集合を記録しておき、後者、すなわち、利用不可色を使わないように関数を配置し、既に配置した関数についても、その関数の利用できない利用不可色を使わないという条件の下で、別の場所に移動する。これにより、直接の「親」あるいは「子」との間で発生するキャッシュコンフリクトを除去するものである。
【００７３】
まず、ステップＳ４で、関数呼出組合せ情報１１１から図１２に示す関数呼出グラフ１２０を作成し、ステップＳ５では作成した関数関数呼出グラフ１２０を呼出回数の多いものと少ないものとに分割する。ここでは、ｆｕｎｃ−ｆｕｎｃＡ，ｆｕｎｃ−ｆｕｎｃＢ，ｆｕｎｃ−ｆｕｎｃＣの各組合せが呼出回数の「多いもの」、ｍａｉｎ−ｆｕｎｃ，ｆｕｎｃＡ−ｆｕｎｃＤ，ｆｕｎｃＢ−ｆｕｎｃＤ、が呼出回数の「少ないもの」となる。
【００７４】
ここで作成した関数呼出グラフ１２０においては、関数の組合せが辺となり、その両端のノードが組合せにおける２つの関数となる。
【００７５】
なお、図３における各関数の占めるキャッシュライン数、すなわち「色」の数は、ｆｕｎｃが２個、ｆｕｎｃＡ，ｆｕｎｃＢ，ｆｕｎｋＣ，ｍａｉｎが各１個であるものとする。
【００７６】
次に、ステップＳ６において、呼出回数の多いもののグループを呼出回数の多い順にソートし、その順番でステップＳ７以降の処理を行う。ステップＳ７では呼出回数の多い辺が残っているか確認し、残っているのでステップＳ８に進み、ｆｕｎｃ−ｆｕｎｃＡの辺に対して両側のノードが未配置であるかを確認する。この確認において未配置であるのでステップＳ９に進み、ｆｕｎｃとｆｕｎｃＡをメモリ空間上の任意の場所に隣接して配置し、ステップＳ１８においてｆｕｎｃとｆｕｎｃＡの利用できない「色」を利用不可能集合として認識した後、再びステップＳ７に戻る。隣接して配置されたｆｕｎｃ−ｆｕｎｃＡの辺は、複合ノードとして今後ひとつのノードとして扱われる。この時点で、既に配置済みの関数とキャッシュラインの関係および各関数の利用不可能集合の状態は、図１３（Ａ）に示すようになっている。
【００７７】
続いて、呼出回数の多い辺がまだ残っているのでステップＳ８に進み、ｆｕｎｃ−ｆｕｎｃＣの辺に対して両側のノードが未配置であるかを確認する。この確認において、ｆｕｎｃは配置済みであるのでステップＳ９に進み、２個の異なる複合ノードに属するノードを結ぶ辺かどうかを確認する。この確認において、ｆｕｎｃＣは複合ノードに属していないので、ステップＳ１０に進み、一方のノードが複合ノードに属し他方のノードが未配置かどうか確認すると、条件に当てはまるのでステップＳ１４に進む。
【００７８】
ステップＳ１４では未配置のｆｕｎｃＣをｆｕｎｃに近い場所に配置し、ステップＳ１６において関数配置の際に利用不可能集合の影響で隙間が空いてないかを確認すると空いていないので、ステップＳ１８においてｆｕｎｃとｆｕｎｃＣを利用できない「利用不可色」を利用不可能集合として認識した後に、再びステップＳ７に戻る。
【００７９】
この時点で、既に配置済みの関数とキャッシュラインの関係及び各関数の利用不可能集合の状態は、図１３（Ｂ）に示すようになっている。
【００８０】
ステップＳ７において、まだ未配置の辺があるので、ステップＳ８に進み、ｆｕｎｃ−ｆｕｎｃＢの辺に対し、両側のノードが未配置であるかを確認する。この確認において、ｆｕｎｃは配置済みであるので、ステップＳ９に進み、２個の異なる複合ノードに属するノードを結ぶ辺かどうかを確認する。この確認において、ｆｕｎｃＢは複合ノードに属していないので、ステップＳ１０に進み、一方のノードが複合ノードに属し他方のノードが未配置かどうか確認する。この確認において、条件に当てはまるので、ステップＳ１４に進む。
【００８１】
ステップＳ１４では、未配置のｆｕｎｃＢと対を成すｆｕｎｃの中心から複合ノードの両端までの距離が同じであるため、任意に左側に配置し、ステップＳ１６において関数配置の際、利用不可能集合の影響で隙間が空いてないか確認する。すると、空いていないので、ステップＳ１８に進み、ｆｕｎｃとｆｕｎｃＢの利用できない「利用不可色」を利用不可能集合として認識したのち、再びステップＳ７に戻る。
【００８２】
ステップＳ７において、未配置の辺が無くなったことを確認すると、ステップＳ４６に進み、未配置ノードｍａｉｎ、ｆｕｎｃＤを任意のキャッシュラインに配置する。
【００８３】
以上のステップＳ４〜Ｓ１９の処理結果、図７（Ａ）に示す呼出回数入替情報１１３に対しては、従来の図１３（Ｃ）と同様の、本実施の形態の関数メモリ配置結果４を色で示す図９（Ａ）の配置となる。
【００８４】
本実施の形態の関数配置とキャッシュラインの関係及び各関数の利用不可能集合の状態すなわち、関数メモリ配置結果４を示す図９を参照すると、図７に示す呼出回数入替情報１１３（Ａ）〜（Ｆ）の各々に対応して図９（Ａ）〜（Ｆ）に示すようになる。
【００８５】
本実施の形態の例では、処理の対象とする６つの関数ｍａｉｎ、ｆｕｎｃ、ｆｕｎｃＡ、ｆｕｎｃＢ、ｆｕｎｃＣ、ｆｕｎｃＤのうち、呼出回数入替の対象とする４つの関数ｆｕｎｃ、ｆｕｎｃＡ、ｆｕｎｃＢ、ｆｕｎｃＣに利用不可能集合対応の色すなわち、利用不可能色を割り付ける。この例では、関数ｆｕｎｃに青及び黄を、関数ｆｕｎｃＡ、ｆｕｎｃＢ及びｆｕｎｃＣに赤及び緑をそれぞれ利用不可能色として割り付ける。
【００８６】
次に、ステップＳ２０に進み、関数基本ブロック遷移情報１１２を参照してキャッシュコンフリクト回数算出処理を行う。
【００８７】
以下の説明では、上述したように、関数ｍａｉｎ、ｆｕｎｃ、ｆｕｎｃＡ、ｆｕｎｃＢ、ｆｕｎｃＣ、及びｆｕｎｃＤの各々の関数ＩＤを０，１，２，３，４，５と設定してあるものとする。
【００８８】
キャッシュコンフリクト回数算出処理ステップＳ２０の詳細をフローチャートで示す図８を参照してこのキャッシュコンフリクト回数算出処理の詳細動作について説明すると、まず、ステップＳ２０１で、コンフリクト回数をカウントする変数を０に初期化する。
【００８９】
次に、関数基本ブロック遷移情報読み込みステップＳ２０２で、関数基本ブロック遷移情報１１２を順次読み込み、関数ＩＤと基本ブロックの順番情報（以下ＩＤ順番情報）０−１、すなわち、ｍａｉｎ−１番を得る。次に、ステップＳ２０３の条件判定「関数基本ブロック遷移情報は終了」で、関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４に進む。ステップＳ２０４で、ＩＤ順番情報０−１のキャッシュ上の配置を求め、図９（Ａ）を参照して、関数ｍａｉｎ対応のキャッシュラインの色「黄」を得る。次に、ステップＳ２０５の条件判定「先頭の関数基本ブロック遷移情報か」で、ＩＤ順番情報０−１は先頭の関数基本ブロック遷移情報であるため、ステップＳ２０２に戻る。
【００９０】
次に、再度ステップＳ２０２で、関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報１−１、すなわち、ｆｕｎｃ−１番を得る。次に、ステップＳ２０３で、関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４に進み、ＩＤ順番情報１−１のキャッシュ上の配置として、図９（Ａ）を参照して、関数ｆｕｎｃ対応のキャッシュラインの色「赤」を得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報１−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次にステップＳ２０６の条件判定「以前のブロックとアドレス上の重なりがあるか」で、「赤」に配置されたものは以前にはなかったので、ステップＳ２０２に戻る。
【００９１】
次に、再度ステップＳ２０２で、関数基本ブロック遷移情報１１２を順次読み込み、次のＩＤ順番情報２−１、すなわち、ｆｕｎｃＡ−１を得る。以下、上記処理と同様にステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報２−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＡ対応のキャッシュライン「青」を得る。次に、ステップＳ２０５の条件判定で、このＩＤ順番情報２−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次にステップＳ２０６の条件判定で、「青」に配置されたものは以前にはなかったので、ステップＳ２０２に戻る。
【００９２】
次に、ステップＳ２０２で、関数基本ブロック遷移情報１１２を順次読み込み、次のＩＤ順番情報１−１、すなわち、ｆｕｎｃ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報１−１のキャッシュ上の配置として、図９（Ａ）を参照して、関数ｆｕｎｃ対応のキャッシュライン「赤」を得る。次にステップＳ２０５の条件判定で、ＩＤ順番情報１−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次にステップＳ２０６の条件判定で、「赤」に配置された以前のブロックは今回と同様１−１なので、ステップＳ２０２に戻る。
【００９３】
次に、ステップＳ２０２で関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報３−１、すなわち、ｆｕｎｃＢ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報３−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＢ対応のキャッシュライン「青」を得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報３−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次に、ステップＳ２０６の条件判定で、「青」に配置された以前のブロックは今回と異なりＩＤ順番情報２−１であったので、ステップＳ２０７に進む。次に、ステップＳ２０７で、コンフリクト回数をカウントする変数（以下、コンフリクト回数）を１インクリメントし、ステップＳ２０２に戻る。
【００９４】
次にステップＳ２０２で関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報５−１、すなわち、ｆｕｎｃＤ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報５−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＤ対応のキャッシュライン「緑」と得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報５−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次にステップＳ２０６の条件判定で、「緑」に配置されたものは以前にはなかったので、ステップＳ２０２に戻る。
【００９５】
次に、ステップＳ２０２で関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報２−１、すなわち、ｆｕｎｃＡ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４で２−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＡ対応のキャッシュライン「青」と得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報２−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次に、ステップＳ２０６の条件判定で、「青」に配置された以前のブロックは今回と異なりＩＤ順番情報３−１であったので、ステップＳ２０７に進む。次に、ステップＳ２０７で、コンフリクト回数を１インクリメントし、ステップＳ２０２に戻る。
【００９６】
次にステップＳ２０２で関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報５−１、すなわち、ｆｕｎｃＤ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報５−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＤ対応のキャッシュライン「緑」を得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報５−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次に、ステップＳ２０６の条件判定で、「緑」に配置された以前のブロックは今回と同様ＩＤ順番情報５−１なので、ステップＳ２０２に戻る。
【００９７】
次に、ステップＳ２０２で関数基本ブロック遷移情報１１２を順次読み込み、ＩＤ順番情報３−１、すなわち、ｆｕｎｃＢ−１を得る。次に、ステップＳ２０３で関数基本ブロック遷移情報はまだ終了ではないと判定して、ステップＳ２０４でＩＤ順番情報３−１のキャッシュ上の配置を、図９（Ａ）を参照して、関数ｆｕｎｃＢ対応のキャッシュライン「青」を得る。次に、ステップＳ２０５の条件判定で、ＩＤ順番情報３−１は先頭の関数基本ブロック遷移情報ではないため、ステップＳ２０６に進む。次に、ステップＳ２０６の条件判定で、「青」に配置された以前のブロックは今回と異なりＩＤ順番情報２−１だったので、ステップＳ２０７に進む。次に、ステップＳ２０７で、コンフリクト回数を１インクリメントし、ステップＳ２０２に戻る。
【００９８】
以下、同様の処理を繰り返して関数基本ブロック遷移情報１１２の最後まで処理を行なうと、キャッシュライン「青」（以下、「青」等と省略）におけるコンフリクト回数は７９回、「黄」におけるコフリクト回数は２回、「緑」におけるコンフリクト回数は１回、「赤」におけるコンフリクト回数は０回となり、合計８２回のコンフリクトが発生することを算出する。
【００９９】
以上により、キャッシュコンフリクト回数算出処理ステップＳ２０を終了する。
【０１００】
次に、ステップＳ２に戻り、以後同様の処理を行い、図７に示す呼出回数入替情報１１３（図７（Ａ）〜（Ｆ））の各々に対する関数配置とキャッシュラインの関係及び各関数の利用不可能集合の状態（以下キャッシュメモリ配置）はそれぞれ図９（Ａ）〜（Ｆ）となる。
【０１０１】
例えば、図９（Ｂ）のキャッシュメモリ配置の場合は、（Ａ）の場合と同一の８２回のコンフリクトが発生することを算出する。
【０１０２】
また、図９（Ｃ）と（Ｆ）のキャッシュメモリ配置の場合は、「青」におけるコンフリクト回数は１回、「黄」におけるコフリクト回数は２回、「緑」におけるコンフリクト回数は１回、「赤」におけるコンフリクト回数は０回となり、合計４回のコンフリクトが発生することを算出する。
【０１０３】
さらに、図９（Ｄ）と（Ｅ）のキャッシュメモリ配置の場合は、「青」におけるコンフリクト回数は７９回、「黄」におけるコフリクト回数は２回、「緑」におけるコンフリクト回数は１回、「赤」におけるコンフリクト回数は０回となり、合計８２回のコンフリクトが発生することを算出する。
【０１０４】
よって、図９（Ｃ）と（Ｆ）のキャッシュメモリ配置の場合にキャッシュコンフリクトの発生回数が一番少ないことが分かる。
【０１０５】
ステップＳ３において、関数のキャッシュメモリ配置をこのうちの一方である図９（Ｃ）の配置に最終決定する。
【０１０６】
従来の第１の技術では、ｆｕｎｃＡ、ｆｕｎｃＢが同一キャッシュライン「青」を共有しており、それぞれの利用不可能集合には「青」が含まれておらず、従ってこれらｆｕｎｃＡ、ｆｕｎｃＢは直接の呼出関係がないため、関数呼出組み合わせ情報を元にした関数配置を行うと、これらの関数が同一キャッシュラインに乗ってしまう場合があり得、アプリケーションプログラムによっては、必ずしもこれら両関数間のキャッシュコンフリクトを削減することができなかった。これに対し、本実施の形態では、図９（Ｃ）に示す通り、ｆｕｎｃＡは「黄」、ｆｕｎｃＢは「青」にそれぞれ配置されるため、ｆｕｎｃＡとｆｕｎｃＢとが遷移を繰り返してもキャッシュコンフリクトが起きず、アプリケーションプログラムの実行スピードを向上することができる。
【０１０７】
【発明の効果】
以上説明したように、本発明の命令キャッシュへの関数割付最適化装置、関数割付最適化方法及び関数割付最適化手順を記録した記録媒体は、関数呼出情報出力部と、関数呼出に応じた関数の遷移に対して該関数のＩＤ及び該関数の基本ブロックの順番の組合せを関数遷移毎に並べた関数基本ブロック遷移情報に出力する関数基本ブロック遷移情報出力部と、上記関数呼出組合せ情報を参照し関数呼出組合せ相互間で遷移回数を入替えた関数呼出回数入替データから成る呼出回数入替情報を生成し、次に生成した呼出回数入替情報を参照して関数をメモリ空間上のアドレスに仮配置した後、関数基本ブロック遷移情報を参照してキャッシュコンフリクトの回数を検出し、呼出回数入替データの中で、キャッシュコンフリクトの回数の最も少ないものに関数のメモリ配置を決定し対応する関数メモリ配置結果を出力する関数メモリ配置最適化部とを備えているので、以下の効果を奏する。
【０１０８】
まず、第１の効果は、（１）ある関数の中で複数の関数が連続して呼ばれている場合、あるいは、（３）ループの中で呼ばれている場合等に対しても、キャッシュコンフリクトを削減するよう関数をメモリ空間上に配置することにより、アプリケーションプログラムの実行スピードを向上できることである。
【０１０９】
その理由は、関数が連続して呼ばれている場合やループ中で呼ばれている等の呼出回数の多いものに着目して関数の呼出情報における呼出回数を入れ替え、それぞれの情報の重み付けを変更してから従来と同様の関数の配置処理を行なって仮配置をし、変更した中でもっともキャッシュコンフリクトの回数が少ないものを、関数の最終的なメモリ配置として決定するため、ある関数の中で複数の関数が連続して呼ばれている場合、あるいはループの中で呼ばれている場合などのようなアプリケーションプログラムの複雑さには依存しない処理であるからである。
【０１１０】
また、第２の効果は、（３）ループ中で呼ばれる関数がまた別の関数を呼んでいる場合などに対しても、キャッシュコンフリクトを削減するよう関数をメモリ空間上に配置することにより、アプリケーションプログラムの実行スピードを向上できることである。
【０１１１】
その理由は、関数の呼出情報における呼出回数を入れ替え、それぞれの情報の重み付けを変更してから従来の関数の配置処理を行なって仮配置をし、変更した中でもっともキャッシュコンフリクトの回数が少ないものを関数の最終的なメモリ配置を決定するため、ループ中で呼ばれる関数がまた別の関数を呼んでいる場合などのようなアプリケーションプログラムの複雑さには依存しない処理であるからである。
【図面の簡単な説明】
【図１】本発明の命令キャッシュへの関数割付最適化装置及びその処理手順の一実施の形態を示すブロック図である。
【図２】本実施の形態の関数呼出組合せ情報の一例を示す図である。
【図３】アプリケーションプログラムの一例を示す図である。
【図４】本実施の形態の関数基本ブロック遷移情報の一例を示す図である。
【図５】本実施の形態の命令キャッシュへの関数割付最適化装置の動作である関数割付最適化方法の一例を示すフローチャートである。
【図６】図５の呼出回数入替処理ステップの詳細処理を示すフローチャートである。
【図７】本実施の形態の呼出回数入替情報の一例を示す図である。
【図８】図５のキャッシュコンフリクト回数算出処理ステップの詳細処理を示すフローチャートである。
【図９】本実施の形態の関数メモリ配置結果を示す図である。
【図１０】従来の命令キャッシュへの関数割付最適化装置及びその処理手順の一例を示すブロック図である。
【図１１】従来の命令キャッシュへの関数割付最適化装置の動作である関数割付最適化方法の一例を示すフローチャートである。
【図１２】従来の関数呼出グラフの構成を説明するための図である。
【図１３】従来の関数配置とキャッシュラインの関係及び各関数の利用不可能集合の状態及び関数メモリ配置結果の一例を示す図である。
【符号の説明】
１関数呼出情報出力部
２関数基本ブロック遷移情報出力部
３，１０３関数メモリ配置最適化部
４，１０４関数メモリ配置結果
１１０アプリケーションプログラム
１１１関数呼出組合せ情報
１１２関数基本ブロック遷移情報
１１３呼出回数入替情報
１２０関数呼出グラフ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an instruction cache function allocation optimization device, a function allocation optimization method, and a recording medium storing a function allocation optimization procedure, and more particularly, to a function allocation optimization device to an instruction cache in a microprocessor system equipped with a cache, The present invention relates to a function allocation optimization method and a recording medium on which a function allocation optimization procedure is recorded.
[0002]
[Prior art]
In the processing speed of this type of microprocessor system, in addition to the CPU speed, the access to the external memory which is the main memory, that is, the memory access processing speed (hereinafter referred to as memory access speed) is greatly affected. However, the improvement in memory access speed is not so large with respect to the recent significant increase in CPU speed, and the difference between CPU speed and memory access speed is increasing year by year. For example, there is data that the improvement rate of DRAM access time is about 7% per year, while the improvement rate of CPU speed is 50-100% (David A. Patterson, John A. Patterson, David A. Patterson, (John L. Hennessy), translated by Mitsuaki Narita, “Computer Configuration and Design”, Nikkei Business Publications, April 19, 1996).
[0003]
Therefore, effective use of the cache has become a very important issue for improving the system processing speed. As is well known, the cache is a mechanism comprising a buffer that is a small-capacity high-speed memory for accessing an external memory at high speed. Generally, a program processes functions arranged in a memory in the order of addresses. However, since the memory access speed of the external memory is very slow compared with the processing speed of the CPU, there is a problem that the execution speed becomes slow as a result.
[0004]
In order to solve the above problems, high-speed program execution can be realized by copying the program stored in the external memory to the cache, which is a buffer that can be accessed at high speed, when executing the function. Is possible. This is because, generally, when a program is executed, a memory once accessed is likely to be accessed again soon.
[0005]
However, in general, a high-speed cache memory is expensive compared to an external memory, and its configuration is expensive, so its capacity (size) is very small compared to an external memory. For this reason, a processing system using a cache divides the external memory into areas divided by the size of the cache, and allocates the external memory to the cache for each divided area. In the first access to an address, the program at that address is copied to the cache allocation area, and when the same address in the external memory is accessed next, the cache is directly accessed. This realizes high-speed program execution.
[0006]
At this time, copying from the external memory to the cache memory is performed in units of a specific size, and an area obtained by dividing the cache by this size is called a cache line. Therefore, it is necessary to copy the program between the functions arranged at the addresses of the external memory allocated to the same cache line every time the functions are switched. This is called a cache conflict. If this cache conflict frequently occurs, there is a problem that the execution speed of the program becomes slow as a result. Therefore, in recent years, a method has been studied in which functions that are likely to move simultaneously are arranged so as not to be placed in the same cache line in order to solve this problem.
[0007]
The cache includes an instruction cache and a data cache. The present invention focuses on the instruction cache.
[0008]
External memory cache allocation methods include the simplest and cheapest direct map method and set associative / full associative methods, but the basic problems are the same. I will explain to you.
[0009]
In the conventional language processing system program, each processing unit (function) is appropriately arranged in the memory. For this reason, in a system equipped with a cache, it is not always optimally used.
[0010]
In the latest conventional technology, in order to reduce the number of cache conflicts stochastically, research on how to effectively use the cache by optimizing the function memory allocation based on the function call count information has been made. Several algorithms have been published.
[0011]
For example, AH Hassemi, D.R. Kaeli, B. Calda, “Efficient Mapping Procedure Using Cache Line Coloring” (AH Hashemi, DR Kaeli, B. Calder “Efficient Procedure”. “Mapping Using Cache Line Coloring”) ACM SIGPLAN, June 1997, (Reference 1) discloses a method for optimizing the memory allocation of functions so as to avoid cache conflicts in descending order of the number of calls. .
[0012]
Further, in the conventional function allocation optimization method for the first instruction cache described in Japanese Patent Application Laid-Open No. 11-232117 (Document 2), as an example of the function allocation method for efficiently using the cache memory, The method of literature 1 is adopted and explained in detail.
[0013]
In the algorithm of this prior art, a call graph of a function is created, and the number of calls is prioritized as a weight for that edge (function combination) and placed in a memory space. This avoids a cache conflict when the function is first placed. Furthermore, a set of “used color” (that is, used cache line) in which each function is arranged and “unavailable color” that the function cannot currently use is recorded, and the latter, that is, an unusable color. The function is arranged so that it is not used, and the function that has already been arranged is moved to another place under the condition that the unavailable color of the function is not used. This eliminates cache conflicts that occur with the direct “parent” or “child”.
[0014]
Next, the operation in the conventional example for the application program applied in the embodiment of the present invention to be described later is confirmed. Thus, problems in the conventional example will be described in more detail.
[0015]
Referring to FIG. 10 showing a block diagram of a conventional function allocation optimization apparatus for a first instruction cache, the conventional function allocation optimization apparatus for a first instruction cache reads function call information from an application program 110. The function call information output unit 1 outputs the function information of the caller and the callee and the number of calls to the function call combination information 111 when the function is called, and the function layout is optimized based on the function call combination information 111 to address space And a function memory arrangement optimizing unit 103 that outputs the function memory arrangement result 104.
[0016]
Referring to FIG. 10, FIG. 11 showing the processing flow of the function memory arrangement optimizing unit 103 in a flowchart, and FIG. 3 showing an example of the application program, the conventional function allocation optimization device for the first instruction cache The conventional function allocation optimization method for the first instruction cache, which is the operation, will be described. First, when the application program 110 shown in FIG. 3 is applied, the function call information output unit 1 calls the caller at the time of function call according to the profile. Each function information of the call destination and the number of calls are output to the function call combination information 111. In FIGS. 10 and 11, solid arrows indicate the flow of processing, and dotted arrows indicate the flow of data.
[0017]
Referring to FIG. 2 showing an example of the function call combination information 111, the function call combination information 111 includes columns of function call source, call destination, and number of calls.
[0018]
Next, the function memory arrangement optimizing unit 103 sorts the function call combination information 111 in descending order of the number of calls, arranges the function call combination information 111 in this order in the address space, and simultaneously uses the “unusable color” corresponding to the cache line in which the arranged function cannot be used. Recognizing a set of and avoiding this and placing subsequent functions.
[0019]
That is, in FIG. 9, in step P1, the function call graph 120 shown in FIG. 12 is created from the function call combination information 111, and in step P2, the created function function call graph 120 is divided into one with a large number of calls and one with a small number of calls. . Here, func-funcA, func-funcB, and func-funcC are the former “many”, and main-func, funcA-funcD, and funcB-funcD are the latter “small”.
[0020]
In the function call graph 120 created here, a combination of functions is an edge, and nodes at both ends thereof are two functions in the combination.
[0021]
Note that the number of cache lines, that is, the number of “colors” occupied by each function in FIG. 3 is two func and one each of funcA, funcB, funkC, and main.
[0022]
Next, in step P3, groups having a large number of calls are sorted in descending order of the number of calls, and the processes after step P4 are performed in that order. In step P4, it is confirmed whether a side with a large number of calls remains. Since it remains, the process proceeds to step P5, and it is confirmed whether nodes on both sides with respect to the side of func-funcA are not arranged. In this confirmation, since it is not arranged, the process proceeds to step P9, where func and funcA are arranged adjacent to arbitrary locations in the memory space, and in step P15 “unavailable” colors of func and funcA are recognized as unusable sets. After that, the process returns to Step P4 again. The sides of func-funcA arranged adjacent to each other will be treated as one node from now on as a composite node. At this point, the relationship between the functions already arranged and the cache line and the state of the unusable set of each function are as shown in FIG.
[0023]
Subsequently, since there are still sides with a large number of calls, the process proceeds to step P5 to check whether nodes on both sides of the side of func-funcC are not arranged. In this confirmation, since the func has already been arranged, the process proceeds to step P6, and it is confirmed whether or not it is an edge connecting nodes belonging to two different composite nodes. In this confirmation, since funcC does not belong to the composite node, the process proceeds to step P7. When one node belongs to the composite node and it is confirmed whether the other node is not arranged, the condition is met, so the process proceeds to step P11.
[0024]
In step P11, an unallocated funcC is arranged at a location close to func. In step P13, it is not checked if there is no gap due to the unusable set at the time of function arrangement. After recognizing “unusable color” that cannot use funcC as an unusable set, the process returns to step P4 again.
[0025]
At this time, the relationship between the functions already arranged and the cache line and the state of the unusable set of each function are as shown in FIG.
[0026]
In step P4, since there are still unallocated sides, the process proceeds to step P5, and it is confirmed whether the nodes on both sides are unallocated with respect to the side of func-funcB. In this confirmation, since the func has already been arranged, the process proceeds to step P6, and it is confirmed whether or not the edge connects nodes belonging to two different composite nodes. In this confirmation, since funcB does not belong to the composite node, the process proceeds to step P7, and it is confirmed whether one node belongs to the composite node and the other node has not been arranged. Since this condition is met in this confirmation, the process proceeds to Step P11.
[0027]
In step P11, since the distance from the center of the func paired with the unallocated funcB to both ends of the composite node is the same, it is arbitrarily arranged on the left side, and the influence of the unavailable set at the time of function arrangement in step P13 Confirm that there is no gap. Then, since it is not vacant, the process proceeds to step P15, and “unusable color” that cannot be used for func and funcB is recognized as an unusable set, and then the process returns to step P4 again.
[0028]
If it is confirmed in step P4 that there are no unallocated sides, the process proceeds to step P16, where the unallocated nodes main and funcD are allocated in an arbitrary cache line.
[0029]
FIG. 13C shows the relationship between the final function arrangement and the cache line, and the state of the unusable set of each function, that is, the function memory arrangement result 104, but funcA and funcB have the same cache line “blue”. Shared and each unavailable set does not contain "blue". Therefore, the cache conflict between the caller function and the callee function can be reduced.
[0030]
The purpose of the first conventional technique is to prevent collisions and cache misses in the cache memory when procedures, functions, or subroutines call each other. For this purpose, information indicating the number of times a procedure, function, or subroutine is actually called and information indicating a relationship in which procedures, functions, or subroutines call each other are used. As a result, it is possible to prevent collisions in the cache memory when procedures, functions, or subroutines call each other.
[0031]
As described above, in the first conventional technique, the cache conflict between the caller function and the callee function can be reduced, but (1) when a plurality of functions are called continuously in a certain function, Alternatively, (2) when called in a loop, etc., there is a first problem that cache conflicts among these functions cannot be reduced and a great number of cache conflicts occur.
[0032]
In other words, since funcA and funcB do not have a direct call relationship, in the conventional technique in which function arrangement is performed based on function call combination information, these functions may be on the same cache line. However, as is clear from the above application program example in FIG. 3, funcA and funcB are called continuously in the loop, and funcB has a program description that calls funcA via funcD, and funcA and funcB Frequently repeats the transition, so that a very large number of cache conflicts occur in this loop processing.
[0033]
In order to solve the above first problem, the conventional function allocation optimization method to the second instruction cache described in Japanese Patent Application No. 2000-027218 (Document 3) is a combination of direct function calls according to a profile. Instead of outputting information, function execution time series information is output, and from this time series information, combination execution patterns of functions that may cause cache conflicts other than direct function calls such as consecutive function calls are detected. Then, the function arrangement optimization of the prior art is applied to the detected inter-function cache conflict combination information. As a result, it has not been possible to reduce in the past, (1) when a plurality of functions are called continuously in a certain function, or (2) when they are called in a loop, between these functions. The present invention provides means for reducing the cache conflict and improving the execution speed of the application program.
[0034]
However, this conventional second technique can be realized by simple processing when the program transition is simple, and is an extremely effective means. However, as shown in this example, (3) in the loop When the function called in (2) calls another function, there is a second problem that pattern matching cannot be performed and optimization cannot be performed.
[0035]
[Problems to be solved by the invention]
In the above-described conventional function allocation optimization apparatus, function allocation optimization method, and function allocation optimization procedure for the first instruction cache, a plurality of functions are called continuously in a certain function. If this is the case, or if it is called in a loop, it is not always possible to reduce the cache conflict between these functions, and in the worst case, a very large number of cache conflicts will occur. was there.
[0036]
In addition, a conventional function allocation optimization apparatus, function allocation optimization method, and function allocation optimization procedure for recording a second instruction cache that solves the above-described drawbacks may be used when the program transition is simple. It can be realized with simple processing and is an extremely effective means. However, when a function called in a loop calls another function, pattern matching becomes impossible and optimization is impossible. There was a drawback of becoming.
[0037]
An object of the present invention is to optimize function allocation to an instruction cache that eliminates the disadvantages of the first and second prior arts described above, reduces cache conflicts between a plurality of functions, and improves the execution speed of an application program. An object is to provide a recording medium in which an apparatus, a function allocation optimization method, and a function allocation optimization procedure are recorded.
[0038]
[Means for Solving the Problems]
The function allocation optimization apparatus for an instruction cache according to the first aspect of the present invention inputs a predetermined application program for a microprocessor system equipped with an instruction cache and probabilistically reduces the number of cache conflicts related to the instruction cache. In a function allocation optimizing device for an instruction cache that optimizes the memory allocation of a function based on function call count information,
A function call information output unit that inputs the application program and outputs each function of the call source and the call destination and the number of calls to the function call combination information as a function call combination when the function is called by the profile;
A function that inputs the application program and outputs to the function basic block transition information in which a combination of the function ID and the basic block order of the function is arranged for each function transition with respect to the function transition according to the function call by the profile A basic block transition information output unit;
Call number replacement information composed of function call number replacement data in which the number of transitions that are the number of function transitions between the function call combinations is replaced with reference to the function call combination information is generated, and then the generated call number replacement is generated. After referring to the information and temporarily placing the function at an address in the memory space, the number of cache conflicts is detected by referring to the function basic block transition information, and the number of cache conflicts in the call count replacement data The function memory allocation optimizing unit that determines the memory allocation of the function and outputs the corresponding function memory allocation result is provided.
[0039]
In addition, each of the function call combination information and the call count replacement information is a caller field describing a function name of a function caller, and
A callee field describing the function name of the callee of the function;
Each may have a call count column for setting the call count of the function.
[0040]
4. The method for optimizing function allocation to an instruction cache according to claim 3, wherein an application program for a microprocessor system equipped with an instruction cache is input, and the number of function calls is reduced so as to reduce the number of cache conflicts with respect to the instruction cache. In the function allocation optimization method to the instruction cache that optimizes the memory allocation of the function based on the information,
The application program is input, and function call combination information is generated by using each function of the call source and call destination and the number of calls as a function call combination when the function is called by the profile,
After generating the function basic block transition information regarding the execution of the basic block unit of the function obtained by the profile by inputting the application program,
The function call combination information is referred to, and the function call number replacement information including the function call number replacement data in which the number of transitions as the number of function transitions is replaced between the function call combinations is generated, and the function basic block transition information is referred to And a function memory allocation optimization step of detecting the number of cache conflicts of each function and determining the memory allocation of the function to the smallest number of cache conflicts among the call count replacement data. To do.
[0041]
Further, the function memory arrangement optimizing step refers to the function call combination information, and generates the call number replacement information with reference to the function call combination information.
A function memory temporary placement step for temporarily placing a function at an address in the memory space with reference to the generated call count replacement information;
A cache conflict frequency calculation processing step for detecting the cache conflict frequency of each function temporarily placed with reference to the function basic block transition information;
A function memory allocation step for determining a memory allocation of a function in the function call count replacement data having the smallest number of cache conflicts.
[0042]
In the function memory arrangement optimizing step, the number-of-calls replacement step reads an argument that is the number of the function-call combinations whose number of calls is to be replaced with reference to the function-call combination information and sets a first variable A first step to:
A second step of determining whether the first variable is 0;
A third step of outputting call content replacement information of the current contents when the first variable is 0 in the second step;
A fourth step of performing a recursive call by performing a call frequency replacement process when the first variable is other than 0 in the second step and setting an argument corresponding to the first variable-1;
A fifth step of setting the second variable to 0;
A sixth step of ending the process if it is not determined whether or not the second variable is smaller than the first variable-1;
A seventh step of exchanging the number of calls for each of the second variable and the first index that is the first variable-1 in the case of consent in the sixth step;
An eighth step in which the argument is decremented by 1 and the call count replacement process is performed to set an argument corresponding to the first variable-1 and a recursive call is performed;
A ninth step of exchanging the number of calls for each of the second index being the second variable and the first variable-1;
And a tenth step in which the second variable is incremented by 1 and the sixth and subsequent steps are repeated.
[0043]
A first step of initializing a variable for counting the number of cache conflicts to zero, wherein the cache conflict number calculation processing step in the function memory arrangement optimization step;
A second step of sequentially reading the function basic block transition information and obtaining ID order information which is function ID and basic block order information;
A third step of determining whether or not the function basic block transition information has been completed,
In the case where the third step is negative, a fourth step for obtaining the arrangement of the ID order information on the cache;
A fifth step of determining whether the function basic block transition information at the head is acceptable and returning to the second step in the case of acceptance;
If not in the fifth step, a sixth step returns to the second step if it is not determined whether there is an overlap in address with the previous block;
In the case of consent in the sixth step, there may be provided a seventh step of incrementing a variable for counting the number of times of the cache conflict by 1 and returning to the second step.
[0044]
The recording medium recording the function allocation optimization procedure to the instruction cache according to claim 7 inputs an application program for a microprocessor system equipped with an instruction cache and probabilistically reduces the number of cache conflicts related to the instruction cache. In the recording medium in which the function allocation optimization procedure to the instruction cache for optimizing the memory allocation of the function based on the function call count information is recorded,
A procedure of inputting the application program and generating function call combination information using each function of the call source and the call destination and the number of calls as a function call combination when the function is called by the profile;
A procedure for generating function basic block transition information relating to execution of the basic block unit of the function obtained by inputting the application program and obtained by the profile;
The function call combination information is referred to, and the function call number replacement information including the function call number replacement data in which the number of transitions as the number of function transitions is replaced between the function call combinations is generated, and the function basic block transition information is referred to And a function memory allocation optimization procedure for detecting the number of cache conflicts of each function and determining the memory allocation of the function to the one with the smallest number of cache conflicts among the call count replacement data. It is characterized by recording.
[0045]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0046]
The present invention inputs an application program for a microprocessor system equipped with an instruction cache, and optimizes the function memory allocation based on the function call count information so as to reduce the number of cache conflicts with respect to the instruction cache. In the function allocation optimization method for the instruction cache to be generated, function call combination information is generated by using each function of the call source and call destination and the number of calls as a function call combination when the function is called by the profile, and the function obtained by the profile Function basic block transition information related to the execution of each basic block is generated, and the number of calls is composed of function call number replacement data obtained by replacing the number of function transitions between function call combinations by referring to the function call combination information. Generate information and refer to function basic block transition information By detecting the number of cache conflicts in the instruction cache and placing the function at an address in the memory space so that the number of cache conflicts is the smallest among the call count replacement data, the program execution speed is improved. is there.
[0047]
Referring to FIG. 1, in which the same reference characters / numbers are attached to the same components as in FIG. The function allocation optimizing device receives an application program 110 that is common to the conventional one and outputs each function of the call source and the call destination and the number of calls to the function call combination information 111 as a function call combination when the function is called by the profile. In addition to the call information output unit 1, the application program 110 is input, and the function ID (identification) and the combination of the order of the basic block of the function are set for each function transition in response to the function transition according to the function call by the profile. Refer to the function basic block transition information output unit 2 to be output to the function basic block transition information 112 arranged in the above and the function call combination information 111. Then, the number of transitions, which is the number of function transitions, is exchanged between the function call combinations to generate the call number replacement information 113 composed of the function call number replacement data, and the function is referenced with reference to the generated call number replacement information 113. Is temporarily allocated at an address in the memory space, the number of cache conflicts is detected by referring to the function basic block transition information 112, and the number of function calls is replaced, that is, in the function call number replacement data, A function memory arrangement optimizing unit 3 that determines the memory arrangement of the function with the least number of conflicts and outputs a corresponding function memory arrangement result 4;
[0048]
In FIG. 1, solid arrows indicate the flow of processing, and dotted arrows indicate the flow of data.
[0049]
Referring to FIG. 2 showing an example of the function call combination information 111, the function call combination information 111 includes a call source field describing a function name of a function call source and a call destination field describing a function name of a function call destination. , And the number-of-calls column for setting the number of calls.
[0050]
Also, the number-of-calls replacement information 113 has exactly the same structure as the function call combination information 111, and includes a caller field describing the function name of the function call source and a call destination field describing the function name of the function call destination. , And the number-of-calls column for setting the number of calls.
[0051]
Next, the operation of the present embodiment will be described with reference to FIG. 1, FIG. 2, FIG. 3 showing an example of the application program, and FIG. 5 showing the processing flow of the function memory arrangement optimization unit in a flowchart. The function call information output unit 1 receives the application program 110, and outputs the function information of the call source and the call destination and the number of calls to the function call combination information 111 when the function is called in the profile, as in the conventional case. Here, the term “profile” means that a code that outputs a function ID and the order of the basic blocks is inserted into the application program 110 and executed when returning from the top of the basic block of the function and the calling function.
[0052]
Next, the function basic block transition information output unit 2 inputs the application program 110 and outputs a profile, that is, a code that outputs the function ID and the order of the basic blocks when returning from the top of the basic block of the function and the calling function. The information is output to the function basic block transition information 112 by being inserted into the application program 110 and executed.
[0053]
Referring to FIG. 3 again, the application program 110 shown in this figure is an example of an application program in C language. The function ID of the function main is 0, the function ID of the function func is 1, the function ID of the function funcA is 2, and the function Assume that the function ID of funcB is 3, the function ID of function funcC is 4, and the function ID of function funcD is 5.
[0054]
Further, the function func is composed of two basic blocks, which are basic processing units of the function, and each basic block of the functions main, funcA, funcB, funcC, and funcD has one structure.
[0055]
First, 0, which is the ID of main, and 0-1, which is the first combination of basic blocks (hereinafter referred to as combination 0-1, etc.) are output at the time of calling the top function main. Next, the process shifts from the function main to the function func, and the func ID 1 and the first basic block combination 1-1 are output. Next, the process moves from the function func to funcA, and the function ID 2 of funcA and the first combination 2-1 of the basic block are output. Next, the process returns from the function funcA to func, and the func ID 1 and the first combination 1-1 of the basic block are output. Thereafter, similarly, the combination of the function ID and the order of the basic blocks is output until the end of the program.
[0056]
The application program 110 of this example calls the function func from the function main, and the function func repeats the process of calling the function funcA and the function funcB 20 times (repetition), followed by the function funcA and the function funcC. This is a program that performs a process of repeating the calling process 30 times, returns to the function main, and ends. Here, the function funcB calls the function funcD, and the function funcD calls the function funcA. When the profile is executed by this processing until the end of the program, function basic block transition information 112, which is an array of information on the function ID and the order of the basic blocks, as shown in FIG. 4 is created.
[0057]
Next, the function memory arrangement optimization unit 3 refers to the function call combination information 111 and replaces the number of calls to generate call number replacement information 113 including function call number replacement data. Next, with reference to the generated number-of-calls replacement information 113, the function is temporarily placed at an address in the memory space. Thereafter, the number of cache conflicts of each function temporarily arranged with reference to the function basic block transition information 112 is detected, and the number of call conflicts is replaced, that is, the number of cache conflicts most frequently among the function call count replacement data. Decide the memory allocation of the function to the few.
[0058]
That is, the details of the operation of the function memory arrangement optimizing unit 3 will be described with reference to FIG. 5. First, the call number replacement processing step S 1 is performed with reference to the function call combination information 111. In this process, the number of combinations with a large number of calls is given as an argument. These many and small criteria are the same as the function call graph division in the function memory arrangement optimizing unit which is the prior art. In this example, in the function call combination information 111, each combination of func-funcA, func-funcB, and func-funcC is determined to have a large number of function calls, and other combinations are determined to be small. Therefore, there are three types of combinations with a large number of calls, and these are passed as arguments. Thereafter, the number of calls to be replaced becomes these three sets.
[0059]
Here, the detailed operation of the call frequency replacement processing step S1 will be described with reference to FIG. 6 showing the details of the call frequency replacement processing step S1 in a flowchart and FIG. 7 showing an example of the call frequency replacement information 113. In step S101, argument 3 is read. Next, the argument 3 is set to the variable n in step S102. Next, the condition judgment “whether n is 0” in step S103 is performed, and the variable n is not 0. Therefore, the process branches to step S105, and the number of calls is changed to the variable n−1, and the argument corresponding to n−1 is executed. Recursive call is made as 2.
[0060]
Next, the argument 2 is read again in step S101, and 2 is set to the variable n in step S102. Next, the condition determination in step S103 is performed, and the variable n is not 0. Therefore, the process branches to step S105, the number of calls is replaced with the variable n-1, and the recursive call is performed again as the argument 1.
[0061]
Next, the argument 1 is read in step S101, and 1 is set to the variable n in step S102. The condition determination in step S103 is performed, and the variable n is not 0. Therefore, the process branches to step S105, the number of calls is changed to the variable n-1, and the recursive call is further performed as the argument 0.
[0062]
Next, the argument 0 is read in step S101, and the argument 0 is set to the variable n in step S102. The condition determination in step S103 is performed, and the variable n is 0. Therefore, the process proceeds to step S104, and the call frequency replacement information 113, which is the current content, is output. Since the number of calls has not been replaced so far, the same number as the function call combination information 111 shown in FIG. 7A is output as the number-of-calls replacement information 113, and the process for the argument 0 ends.
[0063]
Next, returning to the process for argument 1, in step S106, an initial value 0 is set to the variable i, and the condition determination “i <(n−1)” in step S107 is performed. Since n-1 is 0, and the variable i is not smaller than (n-1), the process for argument 1 ends.
[0064]
Next, returning to the processing for the argument 2, in step S106, an initial value 0 is set to the variable i, and the condition determination in step S107 is performed. Since n-1 is 1, therefore, the variable i is smaller than (n-1), the process proceeds to step S108. In step S108, the elements of indexes n-1 and i, that is, the number of calls is exchanged. In this example, the number of calls of func-funcA corresponding to index 0 (i) is exchanged with the number of calls of func-funcB corresponding to index 1 (n-1). Thereafter, the recursive call is performed again with the argument set to 1 in step S109.
[0065]
Next, the argument 1 is read in step S101, and 1 is set to the variable n in step S102. The condition determination in step S103 is performed, and the variable n is not 0. Therefore, the process branches to step S105, and a recursive call is further performed with the call count replacement process step S109 as an argument 0.
[0066]
Next, in step S101, the argument 0 is read, and in step S102, the argument 0 is set to the variable n. The condition determination in step S103 is performed, and the variable n is 0. Therefore, the process proceeds to step S104, the number-of-calls replacement information 113 shown in FIG. 7B is output, and the process for argument 0 ends.
[0067]
Next, returning to the process for argument 1, in step S106, an initial value 0 is set to the variable i, and the condition determination “i <(n−1)” in step S107 is performed. Since n-1 is 0, and the variable i is not smaller than (n-1), the process for argument 1 is terminated.
[0068]
Next, returning to the processing for argument 2, in step S110, the number of calls of func-funcA 20 corresponding to index 0 and the number of calls 50 of func-funcB corresponding to index 1 are exchanged and restored. Thereafter, in step S110, the variable i is changed from 0 to 1, and the condition is determined in step S107. Since n−1 is 0, and the variable i is not smaller than (n−1), the process for argument 2 is terminated.
[0069]
Thereafter, the same processing is repeated to output FIGS. 7C, 7D, 7E, and 7F, respectively, and the call frequency replacement processing step S1 is completed.
[0070]
Next, returning to FIG. 5, in step S2, the call number replacement information 113 is referred to, and it is recognized that six pieces of information from (A) to (F) are output. We will continue processing.
[0071]
Thereafter, the same processes as those in the conventional steps P1 to P16 are performed from step S4 to step S19.
[0072]
As described in the prior art, the algorithm of steps S4 to S19 creates a function call graph and places the number of calls as a weight for that edge (combination of functions) and places it in the memory space. Therefore, it is possible to avoid a cache conflict when the function is first arranged. Record the “used color” where each function is placed (ie, the used cache line) and the set of “unavailable colors” that the function cannot currently use, and use the latter, ie, the unavailable color. The function is arranged so that there is no function, and the function that has already been arranged is moved to another place under the condition that the unusable color that cannot be used for the function is not used. This eliminates cache conflicts that occur with the direct “parent” or “child”.
[0073]
First, in step S4, a function call graph 120 shown in FIG. 12 is created from the function call combination information 111, and in step S5, the created function function call graph 120 is divided into one having a large number of calls and one having a small number of calls. Here, each combination of func-funcA, func-funcB, and func-funcC has the “number of calls” “high”, and main-func, funcA-funcD, and funcB-funcD have the “number of calls” low.
[0074]
In the function call graph 120 created here, a combination of functions is an edge, and nodes at both ends thereof are two functions in the combination.
[0075]
It is assumed that the number of cache lines occupied by each function in FIG. 3, that is, the number of “colors” is two for func and one for funcA, funcB, funkC, and main.
[0076]
Next, in step S6, the groups with the highest number of calls are sorted in descending order of the number of calls, and the processing after step S7 is performed in that order. In step S7, it is confirmed whether a side with a large number of calls remains. Since it remains, the process proceeds to step S8, and it is confirmed whether nodes on both sides with respect to the side of func-funcA are not arranged. Since it is not allocated in this confirmation, the process proceeds to step S9, where func and funcA are arranged adjacent to any location in the memory space, and “color” in which func and funcA cannot be used is recognized as an unusable set in step S18. After that, the process returns to step S7 again. The sides of func-funcA arranged adjacent to each other will be treated as one node from now on as a composite node. At this point, the relationship between the functions already arranged and the cache line and the state of the unusable set of each function are as shown in FIG.
[0077]
Subsequently, since there are still sides with a large number of calls, the process proceeds to step S8, and it is confirmed whether nodes on both sides with respect to the side of func-funcC are not arranged. In this confirmation, since the func has already been arranged, the process proceeds to step S9, and it is confirmed whether or not it is an edge connecting nodes belonging to two different composite nodes. In this confirmation, since funcC does not belong to the composite node, the process proceeds to step S10. When one node belongs to the composite node and it is confirmed whether the other node is not arranged, the condition is met, and the process proceeds to step S14.
[0078]
In step S14, unplaced funcC is placed at a location close to func. In step S16, it is not checked if there is no gap due to the unusable set at the time of function placement. After recognizing “unusable color” that cannot use funcC as an unusable set, the process returns to step S7 again.
[0079]
At this time, the relationship between the functions already arranged and the cache line and the state of the unusable set of each function are as shown in FIG.
[0080]
In step S7, since there are still unallocated sides, the process proceeds to step S8, and it is confirmed whether the nodes on both sides are unallocated with respect to the side of func-funcB. In this confirmation, since the func has already been arranged, the process proceeds to step S9 to confirm whether the edge connects nodes belonging to two different composite nodes. In this confirmation, since funcB does not belong to the composite node, the process proceeds to step S10, and it is confirmed whether one node belongs to the composite node and the other node is not arranged. In this confirmation, since the condition is met, the process proceeds to step S14.
[0081]
In step S14, since the distance from the center of the func paired with the unallocated funcB to both ends of the composite node is the same, it is arbitrarily arranged on the left side and the influence of the unusable set at the time of function arrangement in step S16 Confirm that there is no gap. Then, since it is not vacant, the process proceeds to step S18, and “unusable color” that cannot be used for func and funcB is recognized as an unusable set, and then the process returns to step S7 again.
[0082]
When it is confirmed in step S7 that there are no unallocated edges, the process proceeds to step S46, where the unallocated nodes main and funcD are allocated in an arbitrary cache line.
[0083]
As for the processing results of the above steps S4 to S19, the function memory arrangement result 4 of the present embodiment, which is the same as the conventional FIG. 13C, is used for the call frequency replacement information 113 shown in FIG. 9A shown in FIG.
[0084]
Referring to FIG. 9 showing the relationship between the function arrangement and the cache line and the state of the unusable set of each function, that is, the function memory arrangement result 4 in this embodiment, the number-of-calls replacement information 113 (A) to 113A shown in FIG. 9A to 9F correspond to each of (F).
[0085]
In the example of the present embodiment, among the six functions main, func, funcA, funcB, funcC, and funcD to be processed, it is not used for the four functions func, funcA, funcB, and funcC that are to be replaced with the number of calls. Assign a color corresponding to the possible set, that is, an unavailable color. In this example, blue and yellow are assigned to the function func, and red and green are assigned to the functions funcA, funcB and funcC as unavailable colors.
[0086]
Next, the process proceeds to step S20, and cache conflict frequency calculation processing is performed with reference to the function basic block transition information 112.
[0087]
In the following description, it is assumed that the function IDs of the functions main, func, funcA, funcB, funcC, and funcD are set to 0, 1, 2, 3, 4, and 5 as described above.
[0088]
The detailed operation of the cache conflict count calculation process will be described with reference to FIG. 8 showing the details of the cache conflict count calculation step S20 in a flowchart. First, in step S201, a variable for counting the conflict count is initialized to zero. .
[0089]
Next, in function basic block transition information reading step S202, the function basic block transition information 112 is sequentially read to obtain function ID and basic block order information (hereinafter referred to as ID order information) 0-1, that is, main-1. Next, in the condition determination “function basic block transition information is completed” in step S203, it is determined that the function basic block transition information is not yet completed, and the process proceeds to step S204. In step S204, the arrangement of the ID order information 0-1 on the cache is obtained, and the color “yellow” of the cache line corresponding to the function “main” is obtained with reference to FIG. Next, since the ID order information 0-1 is the top function basic block transition information in the condition determination “whether it is the top function basic block transition information” in step S205, the process returns to step S202.
[0090]
Next, in step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 1-1, that is, func-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and the process proceeds to step S204. As the arrangement of the ID order information 1-1 on the cache, refer to FIG. The cache line color “red” corresponding to the function func is obtained. Next, since the ID order information 1-1 is not the first function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, in step S206, the condition determination “whether there is an overlap in the address with the previous block” has not been placed in “red” before, so the process returns to step S202.
[0091]
Next, in step S202 again, the function basic block transition information 112 is sequentially read to obtain the next ID order information 2-1, that is, funcA-1. Thereafter, in the same manner as the above processing, it is determined in step S203 that the function basic block transition information is not yet finished, and in step S204, the arrangement of the ID order information 2-1 on the cache is referred to with reference to FIG. , The cache line “blue” corresponding to the function funcA is obtained. Next, in the condition determination in step S205, since this ID order information 2-1 is not the top function basic block transition information, the process proceeds to step S206. Next, in the condition determination in step S206, there has not been anything placed in “blue” before, so the process returns to step S202.
[0092]
Next, in step S202, the function basic block transition information 112 is sequentially read to obtain the next ID order information 1-1, that is, func-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, as the arrangement of the ID order information 1-1 on the cache, with reference to FIG. 9A, the function func is supported. Get the cash line "red". Next, in the condition determination in step S205, since the ID order information 1-1 is not the top function basic block transition information, the process proceeds to step S206. Next, in the condition determination in step S206, the previous block arranged in “red” is 1-1 as in this case, so the process returns to step S202.
[0093]
Next, in step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 3-1, that is, funcB-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, the arrangement of the ID order information 3-1 on the cache is referred to FIG. 9A, and the function funcB is supported. Get the cash line "blue". Next, since the ID order information 3-1 is not the first function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, in the condition determination in step S206, the previous block arranged in “blue” is the ID order information 2-1, unlike this time, so the process proceeds to step S207. In step S207, a variable for counting the number of conflicts (hereinafter, the number of conflicts) is incremented by 1, and the process returns to step S202.
[0094]
In step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 5-1, that is, funcD-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, the arrangement of the ID order information 5-1 on the cache is referred to FIG. 9A, and the function funcD is supported. Get the cash line “green”. Next, since the ID order information 5-1 is not the first function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, in the condition determination in step S206, since there has not been anything previously arranged in “green”, the process returns to step S202.
[0095]
Next, in step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 2-1, that is, funcA-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, the arrangement on the cache 2-1 is referred to with reference to FIG. 9A, and the cache line corresponding to the function funcA. Get “blue”. Next, since the ID order information 2-1 is not the top function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, in the condition determination in step S206, the previous block arranged in “blue” is the ID order information 3-1, which is different from this time, so the process proceeds to step S207. Next, in step S207, the number of conflicts is incremented by 1, and the process returns to step S202.
[0096]
In step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 5-1, that is, funcD-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, the arrangement of the ID order information 5-1 on the cache is referred to FIG. 9A, and the function funcD is supported. Get the cash line “green”. Next, since the ID order information 5-1 is not the first function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, since the previous block arranged in “green” in the condition determination in step S206 is the ID order information 5-1, as in this case, the process returns to step S202.
[0097]
Next, in step S202, the function basic block transition information 112 is sequentially read to obtain ID order information 3-1, that is, funcB-1. Next, in step S203, it is determined that the function basic block transition information is not yet finished, and in step S204, the arrangement of the ID order information 3-1 on the cache is referred to FIG. 9A, and the function funcB is supported. Get the cash line "blue". Next, since the ID order information 3-1 is not the first function basic block transition information in the condition determination in step S205, the process proceeds to step S206. Next, in the condition determination in step S206, since the previous block arranged in “blue” is the ID order information 2-1, unlike this time, the process proceeds to step S207. Next, in step S207, the number of conflicts is incremented by 1, and the process returns to step S202.
[0098]
Thereafter, when the same processing is repeated until the end of the function basic block transition information 112, the number of conflicts in the cache line “blue” (hereinafter abbreviated as “blue” etc.) is 79, and the number of conflicts in “yellow”. Is 2 times, the number of conflicts in “green” is 1 time, the number of conflicts in “red” is 0 times, and a total of 82 times of conflicts are calculated.
[0099]
Thus, the cache conflict frequency calculation processing step S20 is completed.
[0100]
Next, returning to step S2, the same processing is performed thereafter, and the relationship between the function arrangement and the cache line for each of the number-of-calls replacement information 113 (FIGS. 7A to 7F) shown in FIG. The states of the impossible set (hereinafter referred to as cache memory arrangement) are as shown in FIGS.
[0101]
For example, in the case of the cache memory arrangement of FIG. 9B, it is calculated that the same 82 times of conflicts as in the case of (A) occur.
[0102]
9C and 9F, the number of conflicts in “blue” is 1, the number of conflicts in “yellow” is 2, the number of conflicts in “green” is 1, The number of conflicts in “red” is zero, and it is calculated that a total of four conflicts will occur.
[0103]
9D and 9E, the number of conflicts in “blue” is 79 times, the number of conflicts in “yellow” is 2, the number of conflicts in “green” is 1, The number of conflicts in “red” is 0, and it is calculated that a total of 82 conflicts occur.
[0104]
Therefore, it can be seen that the number of occurrences of cache conflict is the smallest in the case of the cache memory arrangement of FIGS.
[0105]
In step S3, the cache memory layout of the function is finally determined to be the layout of FIG. 9C, which is one of them.
[0106]
In the first conventional technique, funcA and funcB share the same cache line “blue”, and each unavailable set does not include “blue”, so these funcA and funcB are directly Since there is no calling relationship, when function arrangement based on function call combination information is performed, these functions may be on the same cache line, and depending on the application program, a cache conflict between these two functions may not necessarily occur. It was not possible to reduce. On the other hand, in the present embodiment, as shown in FIG. 9C, funcA is arranged in “yellow” and funcB is arranged in “blue”. Therefore, even if funcA and funcB repeat transitions, a cache conflict occurs. The execution speed of the application program can be improved without happening.
[0107]
【The invention's effect】
As described above, the function allocation optimization apparatus, the function allocation optimization method, and the function allocation optimization procedure recorded on the instruction cache according to the present invention include a function call information output unit and a function corresponding to the function call. Refer to the function basic block transition information output unit that outputs the combination of the function ID and the order of the basic block order of the function to the function basic block transition information arranged for each function transition, and the function call combination information. Call number replacement information consisting of function call number replacement data with the number of transitions exchanged between the two function call combinations is generated, and the function is temporarily placed at an address in the memory space with reference to the generated call number replacement information. After that, the number of cache conflicts is detected by referring to the function basic block transition information, and the smallest number of cache conflicts among the call number replacement data. Since a function memory placement optimization unit that outputs to determine the memory configuration of the function corresponding function memory arrangement result to those, the following effects.
[0108]
First, the first effect is that (1) even when a plurality of functions are called continuously in a certain function, or (3) when they are called in a loop, etc. By allocating functions in the memory space so as to reduce conflicts, the execution speed of application programs can be improved.
[0109]
The reason is that the number of calls in the function call information is changed and the weight of each information is changed, focusing on the ones with a large number of calls such as when the functions are called continuously or in a loop. Then, the function allocation process similar to the conventional one is performed, and the temporary allocation is performed, and the one with the smallest number of cache conflicts among the changes is determined as the final memory allocation of the function. This is because the processing does not depend on the complexity of the application program such as when a plurality of functions are called continuously or when called in a loop.
[0110]
In addition, the second effect is that (3) even when a function called in a loop calls another function, the function is arranged in the memory space so as to reduce the cache conflict. The program execution speed can be improved.
[0111]
The reason is that the number of calls in the function call information is changed, the weighting of each information is changed, and then the conventional function placement processing is performed for temporary placement. This is because the process is independent of the complexity of the application program, such as when a function called in a loop calls another function to determine the final memory allocation of the function.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an embodiment of a function allocation optimizing device for an instruction cache and a processing procedure thereof according to the present invention.
FIG. 2 is a diagram illustrating an example of function call combination information according to the present embodiment.
FIG. 3 is a diagram illustrating an example of an application program.
FIG. 4 is a diagram illustrating an example of function basic block transition information according to the present embodiment.
FIG. 5 is a flowchart showing an example of a function allocation optimizing method that is an operation of the function allocation optimizing apparatus for instruction cache according to the present embodiment;
6 is a flowchart showing detailed processing of a call number replacement processing step in FIG. 5;
FIG. 7 is a diagram illustrating an example of call frequency replacement information according to the present embodiment;
8 is a flowchart showing detailed processing of a cache conflict frequency calculation processing step in FIG. 5;
FIG. 9 is a diagram illustrating a result of function memory arrangement according to the present embodiment.
FIG. 10 is a block diagram illustrating an example of a conventional function allocation optimization apparatus for an instruction cache and a processing procedure thereof.
FIG. 11 is a flowchart showing an example of a function allocation optimizing method that is an operation of a function allocation optimizing apparatus for a conventional instruction cache.
FIG. 12 is a diagram for explaining the configuration of a conventional function call graph.
FIG. 13 is a diagram illustrating an example of a relationship between a conventional function arrangement and a cache line, an unusable set state of each function, and a function memory arrangement result;
[Explanation of symbols]
1 Function call information output section
2 Function basic block transition information output part
3,103 Function memory allocation optimization unit
4,104 Function memory allocation result
110 Application program
111 Function call combination information
112 Function basic block transition information
113 Number of calls replacement information
120 Function call graph

Claims

An instruction that inputs a predetermined application program for a microprocessor system equipped with an instruction cache and optimizes the memory arrangement of the function based on the function call count information so as to reduce the number of cache conflicts with respect to the instruction cache. In the function allocation optimization device to the cache,
A function call information output unit that inputs the application program and outputs each function of the call source and the call destination and the number of calls to the function call combination information as a function call combination when the function is called by the profile;
A function that inputs the application program and outputs to the function basic block transition information in which a combination of the function ID and the basic block order of the function is arranged for each function transition with respect to the function transition according to the function call by the profile A basic block transition information output unit;
The function call combination information is referred to, and the function call number replacement data including the function call number replacement data in which the number of calls is replaced between the function call combinations is generated, and the function is then referred to by referring to the generated call number replacement information. After tentatively arranging at an address in the memory space, the number of cache conflicts is detected by referring to the function basic block transition information, and the function number is changed to the one with the smallest number of cache conflicts among the call count replacement data. A function allocation optimization unit for an instruction cache, comprising: a function memory allocation optimization unit that determines a memory allocation and outputs a corresponding function memory allocation result.

Each of the function call combination information and the call frequency replacement information is a caller field describing a function name of a function caller, and
A callee field describing the function name of the callee of the function;
2. The function allocation optimizing device for an instruction cache according to claim 1, further comprising a call count column for setting the number of call of the function.

To an instruction cache that optimizes the memory allocation of a function based on function call count information so that an application program for a microprocessor system equipped with an instruction cache is input and the number of cache conflicts is stochastically reduced with respect to the instruction cache In the function allocation optimization method of
When the application program is input, function call combination information is generated by using the call source and call destination functions and the number of calls as a function call combination when the function is called by the profile, and the application program is input and obtained by the profile. After generating function basic block transition information regarding execution of the basic block unit of the function,
Call number replacement information consisting of function call number replacement data in which the call number is replaced between the function call combinations with reference to the function call combination information is generated, and a cache conflict of each function with reference to the function basic block transition information A function memory allocation optimization step for determining a memory allocation of a function to the one with the smallest number of cache conflicts among the number of call replacement data. Optimization method.

The function memory arrangement optimizing step refers to the function call combination information, and generates the call number replacement information with reference to the function call combination information.
A function memory temporary placement step for temporarily placing a function at an address in the memory space with reference to the generated call count replacement information;
A cache conflict frequency calculation processing step for detecting the cache conflict frequency of each function temporarily placed with reference to the function basic block transition information;
4. The function allocation optimization to an instruction cache according to claim 3, further comprising a function memory allocation step for determining a memory allocation of a function in the function call count replacement data having the smallest number of cache conflicts. Method.

A first step of setting a first variable by reading an argument which is the number of the function call combinations to be replaced with a call count by referring to the function call combination information;
A second step of determining whether the first variable is 0;
A third step of outputting call content replacement information of the current contents when the first variable is 0 in the second step;
A fourth step of performing a recursive call by performing a call frequency replacement process when the first variable is other than 0 in the second step and setting an argument corresponding to the first variable-1;
A fifth step of setting the second variable to 0;
A sixth step of ending the process if it is not determined whether or not the second variable is smaller than the first variable-1;
A seventh step of exchanging the number of calls for each of the second variable and the first index that is the first variable-1 in the case of consent in the sixth step;
An eighth step in which the argument is decremented by 1 and the call count replacement process is performed to set an argument corresponding to the first variable-1 and a recursive call is performed;
A ninth step of exchanging the number of calls for each of the second index being the second variable and the first variable-1;
5. The method according to claim 4, further comprising a tenth step of incrementing the second variable by 1 and repeating the sixth and subsequent steps.

A first step of initializing a variable for counting the number of cache conflicts to 0, wherein the cache conflict number calculation processing step;
A second step of sequentially reading the function basic block transition information and obtaining ID order information which is function ID and basic block order information;
A third step of determining whether or not the function basic block transition information has been completed,
In the case where the third step is negative, a fourth step for obtaining the arrangement of the ID order information on the cache;
A fifth step of determining whether the function basic block transition information at the head is acceptable and returning to the second step in the case of acceptance;
If not in the fifth step, a sixth step returns to the second step if it is not determined whether there is an overlap in address with the previous block;
5. The instruction cache according to claim 4, further comprising a seventh step of incrementing a variable for counting the number of times of the cache conflict by 1 when the approval is obtained in the sixth step and returning to the second step. Function allocation optimization method.

To an instruction cache that optimizes the memory allocation of a function based on function call count information so that an application program for a microprocessor system equipped with an instruction cache is input and the number of cache conflicts is stochastically reduced with respect to the instruction cache A computer-readable recording medium storing a program for causing a computer to execute the function allocation optimization process of
Inputting the application program and generating function call combination information using each function of the call source and call destination and the number of calls as a function call combination when the function is called by the profile;
Generating function basic block transition information related to execution of the basic block unit of the function obtained by inputting the application program and obtained by the profile;
Call number replacement information consisting of function call number replacement data in which the call number is replaced between the function call combinations with reference to the function call combination information is generated, and a cache conflict of each function with reference to the function basic block transition information Function allocation optimization step to the instruction cache, wherein the function memory allocation optimization step for determining the memory allocation of the function to the one with the smallest number of cache conflicts among the number of call replacement data is executed. A computer-readable recording medium in which a program for causing a computer to execute is recorded .