JP3755804B2

JP3755804B2 - Object code resynthesis method and generation method

Info

Publication number: JP3755804B2
Application number: JP2000207579A
Authority: JP
Inventors: 健三小西; 和治伊達
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-07-07
Filing date: 2000-07-07
Publication date: 2006-03-15
Anticipated expiration: 2020-07-07
Also published as: JP2002024031A

Description

【０００１】
【発明の属する技術分野】
本発明は、情報機器等においてメモリアクセスを高速に行うために用いられるキャッシュメモリに対して、その動作性能を引き出すために最適なオブジェクトコードを得ることができるオブジェクトコードの再合成方法および生成方法に関する。
【０００２】
【従来の技術】
従来、情報機器においては、図５に示すように、高級言語用のコンパイラ、またはマシンに依存するアセンブリ言語用のアセンブラを用いて、ソースプログラムからＣＰＵがそのプログラムを実行するためのオブジェクトコードを生成している。この際に、従来技術では、単にオブジェクトコードサイズが小さくなるように、またはＣＰＵの処理速度が向上するように、或いはコンパイルやアセンブルに必要な処理時間が短くなるように、コンパイル処理やアセンブル処理が行われていた。
【０００３】
しかしながら、今日のように、携帯情報機器においても大きな主記憶装置（メインメモリとも称する）やキャッシュメモリが搭載されている時代では、オブジェクトコードサイズが小さいだけでは不十分であり、ハードウェアシステムの構成についても考慮して、高速および高効率にＣＰＵ処理を実行することが必要である。
【０００４】
このためには、主記憶装置よりも高速なキャッシュメモリ上に、いかにして必要なプログラム中の命令やデータを記憶させておくか（キャッシュヒット率を向上させるか）、ということが重要となる。必要な命令やデータが、それらが必要とされるタイミングでキャッシュメモリに記憶されていない場合、ＣＰＵは速度が遅い主記憶装置に対してアクセスを行うことになり、ＣＰＵに待ち時間が必要となる。その結果、ＣＰＵの処理速度が低下してしまい、キャッシュメモリの存在意義が無くなるからである。また、外部メモリ（主記憶装置を含む）へのアクセスが多くなると、処理の高速化という観点から影響が生じるだけではなく、データバスの状態が遷移する確率が増加したり、主記憶装置へのアクセスの際に消費電力が増加するという問題もあった。
【０００５】
以下に、キャッシュメモリを搭載したシステムについて、図６を用いて簡単に説明する。キャッシュメモリ２は、ＣＰＵ１の処理速度（高速）と主記憶装置３へのアクセススピード（低速）の差を緩和するために用いられる緩衝メモリの一種であり、命令やデータを記憶するキャッシュメモリ部と、その動作制御を行うキャッシュメモリ制御回路とからなる。キャッシュメモリ２は、通常、主記憶装置３よりも動作が高速であり、ＣＰＵからのアクセスに高速応答することが可能なメモリである。キャッシュメモリ２は、ＣＰＵ１の周辺に設けても内部に設けてもよい。
【０００６】
このシステムは以下のように動作する。コンパイラやアセンブラが生成したオブジェクトコード（プログラム中の命令やデータ）は予め主記憶装置３に記憶されている。キャッシュメモリ２にデータが記憶されていない初期状態では、任意のアドレスをＣＰＵ１がアクセスした場合、キャッシュメモリ２上にデータが存在しないため、低速な主記憶装置３に対して１回目のアクセスが行われる。このとき同時に、キャッシュメモリ２は、ＣＰＵ１によりアクセスされた主記憶装置３のアドレスと主記憶装置３から出力されたデータ（命令やデータ）とをキャッシュメモリ部に記憶する。キャッシュメモリ２は、ＣＰＵ１が実行する主記憶装置３へのアクセスを常に監視しており、２回目以降に同一アドレス（キャッシュメモリ部に記憶されたアドレス）へのアクセスが行われると、キャッシュメモリ制御回路は主記憶装置３への制御信号の出力を禁止して低速な主記憶装置へのアクセスを禁止する。そして、高速なキャッシュメモリ２からが該当するアドレスに対応するデータをＣＰＵ１に渡すことにより、ＣＰＵ１の主記憶装置３へのアクセス時における待ち時間を減らして処理速度を向上させることができる。
【０００７】
従って、小容量であっても高速なキャッシュメモリを設けることにより、ＣＰＵが必要とするデータがキャッシュメモリ上に存在すれば、ＣＰＵは低速な主記憶装置にアクセスすることなく、高速に命令やデータを読み出したり、書き込んだりすることが可能である。
【０００８】
最も高速なシステムを実現しようとした場合には、低速な主記憶装置と高速なキャッシュメモリを同じサイズ（記憶容量）にすればよいが、これは、キャッシュメモリが高価であることからコスト面で現実的ではない。よって、通常は、主記憶装置に比べると小さなサイズのキャッシュメモリがシステムに搭載される。
【０００９】
このようにキャッシュメモリが主記憶装置に比べて小さいサイズであるため、主記憶装置中に記憶されているデータのうちの一部のデータのみがキャッシュメモリに記憶されることになる。ここで、キャッシュメモリの有効利用のためには、これからＣＰＵが使用すると考えられるデータをＣＰＵが実際に使用する前にキャッシュメモリに記憶させ、今後ＣＰＵが使用しないと考えられるデータをキャッシュメモリから消去させるという動作が理想的である。しかし、実際にどのデータをキャッシュメモリに記憶させ、またはキャッシュメモリから消去させるかについては、プログラムの構造やキャッシュメモリ制御回路の制御方法に依存する。
【００１０】
絶対的な基準としては、（１）読み書きされたデータは近いうちに再び使用される可能性が高い（時間的局所性）および（２）読み書きされたデータの近くにあるデータは近いうちに使用される可能性が高い（空間的局所性）という２つの基準があると一般的に言われている。しかし、最近のキャッシュメモリ制御回路はこれらの経験則に基づいて設計されているにも関わらず、実際にはキャッシュメモリを設けたことによる恩恵を受けられない場合が増加してきている。
【００１１】
その最大の原因としては、プログラム開発言語の変化が挙げられる。アセンブラを用いてアセンブリ言語でプログラムを開発していたころは、アセンブリ言語がオブジェクトコードとほぼ１対１で生成される特徴を活かして、必要な命令やデータがなるべくキャッシュメモリに記憶されるように意識してプログラミングすることが可能である。よって、必要とされる命令やデータがキャッシュメモリ上に存在している確率（キャッシュヒット率）についても、ある程度は高くすることが可能であった。
【００１２】
しかし、近年では、コンパイラを用いて高級言語によりハードウェアを意識しなくても良いようなプログラミングが行われ、キャッシュメモリの動作を考えながらプログラミングすることがなくなった。また、コンパイラの生成するオブジェクトコードを、プログラム作成者がコントロールして生成すること自体が不可能になってきた。
【００１３】
さらに、コンパイラのオプティマイズ（最適化）処理についても、従来ではオブジェクトコードのコードサイズを小さくするという処理が主として行われている。このため、従来では、キャッシュメモリを設けたことによる恩恵を受け易いようなオブジェクトコードを生成可能なコンパイル技術は、システムに実装されていない。これは、コンパイラがシステム毎に異なるハードウェア構成に強く依存するため、全てのシステムに対応可能なコンパイラを作成することが困難であることも一つの原因である。
【００１４】
そこで、最近では、これらのことも考慮したコンパイル技術が報告されてきている。例えば、特開平５−１２００２９号公報では、プログラム中のループ部分について最適化されたオブジェクトコードを生成するコンパイル処理について提案されている。また、特開平１１−９６０１５号公報では、プロファイルデータを事前に生成しておき、そのプロファイルデータに基づいてコンパイル処理を行うことが提案されている。さらに、特開平５−３２４２８１号公報では、複数のサブルーチン（一つの処理を行うためのプログラムの塊）がキャッシュメモリに記憶され、各サブルーチンが繰り返し実行される場合に、それらのサブルーチンがキャッシュメモリ上の同一アドレスにロードされないように、アドレスの再割り当て（再マッビング）を行う処理について言及されている。
【００１５】
【発明が解決しようとする課題】
キャッシュメモリは、図６に示したように、近年の高速なＣＰＵと低速な主記憶装置との緩衝装置として動作する。一般に、キャッシュメモリの大きさは主記憶装置と比べて相当に小さく、必要な（必要そうな）データのみをキャッシュメモリに記憶させておくことになる。必要なデータがキャッシュメモリに記憶されていれば、ＣＰＵは主記憶装置へのアクセスは行わず、高速なキャッシュメモリがアクセスされるので高速処理が可能である。しかし、キャッシュメモリに必要なデータが記憶されていない場合には、主記憶装置にアクセスを行って必要なデータをキャッシュメモリ上に転送する必要がある。このとき、既にキャッシュメモリの全領域にデータが記憶されていると、不要なデータを主記憶装置に転送（コピー）し、その後でそのキャッシュメモリ領域に新しいデータを主記憶装置から転送する処理が必要になる。
【００１６】
このように、キャッシュメモリ上に必要なデータ（命令やデータ）が記憶されていない場合のペナルティは非常に大きなものである。このため、単にキャッシュメモリを搭載するのみではなく、キャッシュメモリの使用や動作を考慮したオブジェクトコードを生成することが必要となる。
【００１７】
それにも関わらず、現状のコンパイラやアセンブラでは、コンパクトなオブジェクトコードを高速に生成することに注力されており、また、キャッシュメモリの仕様に合ったオブジェクトコードを生成するという観点からの処理は殆ど行われていない。また、データ転送が多くなると、処理の高速化という観点から影響が生じるだけではなく、データバスの状態が遷移する確率が増加したり、主記憶装置へのアクセスの際に消費電力が増加するという問題もあった。さらに、上述した各公報の技術には、以下のような問題があった。
【００１８】
上述した特開平５−１２００２９号公報に提案されている技術は、キャッシュメモリのトータルサイズに注目しているが、実際に重要なのはキャッシュメモリのトータルサイズではなく、キャッシュメモリがどのような構造を有し、どのように動作を行うかである。つまり、キャッシュミス（必要な命令やデータがキャッシュメモリに上存在しない）状態のときに、主記憶装置からキャッシュメモリに命令やデータがロード／ストアされるが、この従来技術では、そのときに何ワードサイズ単位で命令やデータがロード／ストアされるのかについて考慮されていない。
【００１９】
プログラム中にループが複数あり、かつ、全てのループがキャッシュメモリと同じサイズである場合には、ＣＰＵが実際にプログラムを実行したときに各ループで全てのオブジェクトが実行されることは少なく、ループ中の条件分岐によって別のループへ分岐（ジャンプ）することが殆どである。つまり、ループの全てが実行されずに各ループの一部ずつが実行される。このような場合には、上記提案の技術によってループサイズをキャッシュメモリのサイズに合わせてオブジェクトコードを生成しても、実際のキャッシュメモリのロード／ストア動作が考慮されていないために意味のないものになる。
【００２０】
キャッシュメモリ制御回路は、ＣＰＵがアクセスを行ったアドレスに対応する命令やデータがキャッシュメモリ上に存在しなければ、外部メモリ（主記憶装置）からキャッシュメモリ上にデータを転送する。この動作をラインフェッチと称し、そのときに一度に転送されるデータサイズをラインサイズと称する。この際、キャッシュメモリには、ＣＰＵがアクセスしたアドレスに対応するデータ（命令やデータ）だけではなく、そのアドレスの周囲アドレスに対応するデータもまとめて転送される。これは、上記空間的局所性に基づく動作である。通常、ラインフェッチで転送されるデータサイズ（ラインサイズ）は４ワード、８ワード、１６ワード程度である。このため、主記憶装置からキャッシュメモリにデータを転送したり、不要なデータをキャッシュメモリから消去するのも、全てこのラインサイズが基本になる。よって、オブジェクトコード生成時においては、キャッシュメモリ全体のサイズが重要なのではなく、一連のキャッシュメモリへのロード／ストア処理の単位であるラインフェッチのサイズ（ラインサイズ）に、オブジェクトコードの処理単位のサイズを納めるということが重要である。
【００２１】
また、上述した特開平１１−９６０１５号公報に提案されている技術は、組み込みシステムのように命令やデータがほぼ永久的に変化しない場合には有効であるが、今日のようにプログラムがネットワーク経由でダウンロードされたり、状況によってデータに様々なパターンが生じる場合には、プロファイルデータが意味をなさないということが多く、実用的ではない。また、実際にＣＰＵがプログラムを実行する場合、外部から入力される信号のタイミング等の条件によって、実行される命令の順番等が変化するため、万全ではない。
【００２２】
さらに、上述した特開平５−３２４２８１号公報では、従来例（その公報中の図２（ｂ））として各ページの先頭にサブルーチンを配置したものが挙げられているが、このような手法を使用しているコンパイラは少ない。一般的には、その公報中の図１に示すように、ページの先頭から詰めてサブルーチンを配置し、ページを跨ぐ場合にだけサブルーチンを次のページに配置する技術が、この提案よりも前に使用されている。これは、プログラムサイズ（主記憶装置のサイズ）を小さくするために一般的に行われている技術である。
【００２３】
また、この従来技術では、サブルーチンを詰めて配置することが提案され、ダイレクトマップ方式およびセットアソシエイティブ方式という大きな方式での動作が記載されているが、特開平５−１２００２９号公報の提案と同様に、キャッシュメモリがロード／ストアする際のデータサイズが考慮されておらず、キャッシュメモリの方式やサイズによりアドレスを決定するという概念的な記載しかないために、不十分と言える。
【００２４】
本発明は、このような従来技術の課題を解決するためになされたものであり、キャッシュメモリの仕様や動作に適したオブジェクトコードを生成してキャッシュメモリのヒット率を向上し、ＣＰＵの処理速度を向上することができるオブジェクトコードの再合成方法および生成方法を提供することを目的とする。
【００２５】
【課題を解決するための手段】
本発明のオブジェクトコードの再合成方法は、コンパイラ、アセンブラまたはリンカを用いてソースプログラムから生成されたオブジェクトコードに対して、アドレスの再割り当てを行うオブジェクトコードの再合成方法であって、ＣＰＵの周辺または内部に主記憶装置よりも動作速度が速いキャッシュメモリを有し、該ＣＰＵは、アクセスしたい該主記憶装置のアドレスに対応するデータが該キャッシュメモリに記憶されている場合には該キャッシュメモリに対してアクセスを行い、該アドレスに対応するデータが該キャッシュメモリに記憶されていない場合には該主記憶装置に対してアクセスを行うと共に、アクセスされた該主記憶装置のアドレスと該アドレスに対応するデータとを該キャッシュメモリに転送して記憶させる計算処理システムにおいて、必要な命令とデータが必要なタイミングでキャッシュメモリに記憶されている確率を向上させるべく、該主記憶装置から該キャッシュメモリにデータ転送を行う際に一度に転送されるデータサイズに基づいて、プログラム中で必ず一まとまりとして扱われる命令とデータが一度に転送されるデータに含まれるように、または少なくとも命令単位およびデータ単位で一度に転送されるデータに含まれるように、コンパイラ、アセンブラまたはリンカから出力されたオブジェクトコードの命令順序およびデータ順序を変更してアドレスの再割り当てを行い、そのことにより上記目的が達成される。
【００２６】
プログラム中の条件分岐部分において、条件の成立し易さを考慮して、成立し易い方の分岐枝が条件分岐部分の近くのアドレスに配置されるように、命令順序およびデータ順序を変更してアドレスの再割り当てを行うのが好ましい。
【００２７】
プログラム中のデータの順序を入れ換えてもプログラムの実行に影響が生じない箇所においては、データバスの状態遷移が少なくなるようにデータ順序を変更してアドレスの再割り当てを行うのが好ましい。
【００２８】
本発明のオブジェクトコードの生成方法は、コンパイラ、アセンブラまたはリンカを用いてソースプログラムからオブジェクトコードを生成する方法であって、ＣＰＵの周辺または内部に主記憶装置よりも動作速度が速いキャッシュメモリを有し、該ＣＰＵは、アクセスしたいアドレスに対応するデータが該キャッシュメモリに記憶されている場合には該キャッシュメモリに対してアクセスを行い、該アドレスに対応するデータが該キャッシュメモリに記憶されていない場合には該主記憶装置に対してアクセスを行うと共に、アクセスされた該主記憶装置のアドレスと該アドレスに対応するデータとを該キャッシュメモリに転送して記憶させる計算処理システムにおいて、必要な命令とデータが必要なタイミングでキャッシュメモリに記憶されている確率を向上させるべく、該主記憶装置から該キャッシュメモリにデータ転送を行う際に一度に転送されるデータサイズに基づいて、プログラム中で必ず一まとまりとして扱われる命令とデータが一度に転送されるデータに含まれるように、または少なくとも命令単位およびデータ単位で一度に転送されるデータに含まれるように、コンパイラ、アセンブラまたはリンカを用いてソースプログラムからオブジェクトコードを生成する際に、命令およびデータを配置してアドレスの割り当てを行い、そのことにより上記目的が達成される。
【００２９】
プログラム中の条件分岐部分において、条件の成立し易さを考慮して、成立し易い方の分岐枝が条件分岐部分の近くのアドレスに配置されるように、命令およびデータを配置してアドレスの割り当てを行うのが好ましい。
【００３０】
プログラム中のデータ順序を入れ換えてもプログラムの実行に影響が生じない箇所においては、データバスの状態遷移が少なくなるようにデータを配置してアドレスの割り当てを行うのが好ましい。
【００３１】
以下、本発明の作用について説明する。
【００３２】
本発明にあっては、コンパイラ、アセンブラまたはリンカ等が生成したオブジェクトコードに対して、主記憶装置からキャッシュメモリにデータ転送を行う際に一度に転送されるデータサイズ（ラインサイズ）に基づいて、プログラム中で必ず一まとまりとして扱われる命令とデータが一度に転送されるデータに含まれるように、または少なくとも命令単位およびデータ単位で一度に転送されるデータに含まれるように、命令順序およびデータ順序を変更してアドレスの再割り当てを行う。これにより、ターゲットシステム上で実際にプログラムが実行されたときに、プログラム中で必ず一まとまりとして扱われる命令とデータは、その命令およびデータのいずれかがアクセスされたときに（但し、最初にアクセスされるのは命令である）に他のものが記憶装置からキャッシュメモリに転送されて記憶されるので、キャッシュメモリに対するヒット率を向上させて、処理速度の向上と消費電力の低減を図ることが可能となる。
【００３３】
または、コンパイラ、アセンブラまたはリンカ等がオブジェクトコードを生成する際に、主記憶装置からキャッシュメモリにデータ転送を行う際に一度に転送されるデータサイズ（ラインサイズ）に基づいて、プログラム中で必ず一まとまりとして扱われる命令とデータが一度に転送されるデータに含まれるように、または少なくとも命令単位およびデータ単位で一度に転送されるデータに含まれるように、命令順序およびデータ順序を変更してアドレスの再割り当てを行う。これによっても、ターゲットシステム上で実際にプログラムが実行されたときにキャッシュメモリに対するヒット率を向上させて、処理速度の向上と消費電力の低減を図ることが可能となる。
【００３４】
プログラム中の条件分岐部分においては、一方が頻繁に起こり、他方があまり起こらないということが多い。よって、どちらか一方が起こり易い（成立し易い）場合、条件の成立し易さを考慮して、成立し易い方の分岐枝が条件分岐部分の近くのアドレスに配置されるように、アドレスの割り当てまたはアドレスの再割り当てを行う。これによって、さらにキャッシュヒット率を向上させることが可能である。この場合、起こり難い（成立し難い）方の分岐枝は、どこに配置しても実行時間に与える影響は小さい。
【００３５】
さらに、プログラム中のデータ順序を入れ換えてもプログラムの実行に影響が生じない箇所においては、データバスの状態遷移が少なくなるようにデータを配置してアドレスの割り当てを行う。これによって、データバスの状態遷移による消費電流を低減することが可能である。
【００３６】
【発明の実施の形態】
以下に、本発明の実施の形態について、図面を参照しながら説明する。
【００３７】
図１（ａ）および図１（ｂ）は本発明のオブジェクトコードの再合成方法およびオブジェクトコードの生成方法について説明するための図である。
【００３８】
本発明にあっては、コンパイラ、アセンブラまたはリンカ等の出力であるオブジェクトコードのアドレス配置（アドレスの割り当てまたは再割り当て）を決定する際に、ソースプログラムの構造や内容を解析するだけではなく、実際にそのプログラムが実行されるターゲットシステム（例えば図６に示したようなシステム）に搭載されるキャッシュメモリの仕様を詳細に与える。例えば、ラインサイズ、置換アルゴリズム、連想度（Ａｓｓｏｃｉａｔｉｖｉｔｙ）、書き込みアルゴリズム、アドレスのデコード情報等をキャッシュメモリの仕様として与えることができる。
【００３９】
ラインサイズは、主記憶装置からキャッシュメモリにデータ転送を行う際に一度に転送されるデータサイズである。また、置換アルゴリズムは、キャッシュメモリが一杯になったときにどのデータをキャッシュメモリから追い出すかを制御するアルゴリズムである。この置換アルゴリズムは、一般に、乱数によるランダム法と、最も最近に使用されなかったものをキャッシュメモリから追い出すＬＲＣ（ＬｅａｓｔＲｅｃｅｎｔｌｙＵｓｅｄ）法とに分類される。また、連想度は、同じラインに入る可能性のあるデータがいくつ入ることができるかを示す。例えば、連想度が４であれば、同じラインに入るデータを４つまでキャッシュメモリに入れることができ、５つ目からはキャッシュメモリから１つ追い出すことになる。また、書き込みアルゴリズムは、書き込み時にキャッシュメモリにヒットした場合に、キャッシュメモリに対してのみ書き換えを行うＷｒｉｔｅ−Ｂａｃｋ法と、キャッシュメモリと実体の存在するメモリ（主記憶装置のメモリ）の両方に対して書き換えを行うＷｒｉｔｅ−Ｔｈｒｏｕｇｈ法とに分類される。さらに、アドレスのデコード情報は、アドレスのどの部分に相当するとキャッシュメモリ上の同じ場所に載せられるかを示す。細かい制御が必要なソフトウェアでは、命令やデータの配置は人間の手によりこのアドレスのデコード情報を元に決定される。例えば３２ｂｉｔアドレスのうち、［２０：１１］の１０ｂｉｔの情報を用いるキャッシュメモリでは、どのラインに載るかは［２０：１１］で決まる。よって、［２０：１１］が等しくて他が異なるアドレスの情報は同じラインに載ることになる。連動度が４であれば、［２０：１１］が等しくて他が異なるデータでも４つまではキャッシュメモリに載るが、５つ目からは順次必要なものがキャッシュメモリ上に残され、その段階で不要なものはキャッシュメモリから追い出されるという動作を繰り返すことになる。これを防ぐために、この情報に基づいて人間の手により配置場所を調整するか、またはコンパイラや本発明のオブジェクトコード再合成方法または本発明のオブジェクトコード生成方法によって、配置場所を調整することができる。なお、キャッシュメモリサイズは、ラインサイズ×連想度×ライン数で与えられる。
【００４０】
このようなキャッシュメモリの仕様に基づいて、ソースプログラムの構造解析を行ってアドレス配置や命令列を解析する。そして、キャッシュメモリサイズやラインフェッチのサイズ（ラインサイズ）およびラインフェッチ動作等、細かいハードウェアの動作を考慮して命令やデータの順序を決定し、アドレスの割り当てまたは再割り当てを行ってオブジェクトコードを生成または再合成する。このことにより、ターゲットシステム上で実際に実行されたときのキャッシュメモリに対するヒット率（キャッシュメモリに必要な命令やデータが記憶されている確率）を向上させて、処理速度の向上および消費電力の低減を図ることが可能である。
【００４１】
また、命令列を解析して、処理内容や実行時間に影響の無いプログラム部分において、命令の入れ替えやイミディエットデータの入れ替え等を行って、データ遷移率を減少させることが可能である。さらに、プログラム中の条件分岐部分において、分岐枝の成立し易さを考慮することにより、さらにキャッシュメモリに対するヒット率を向上させることが可能である。
【００４２】
例えば、図１（ａ）に示すように、コード再合成手段に対して、コンパイラ、アセンブラまたはリンカ等が生成したオブジェクトコードを入力すると共に、キャッシュメモリの仕様情報（ラインサイズ等）を入力する。コード再合成手段は、ソースプログラムの解析を行ってアドレスの配置場所や命令列を解析し、データや命令の順序を変更してアドレスの再割り当て（アドレスの再マッピング）を行うことができる。なお、ソースプログラムの解析はコンパイラで行うこともでき、コンパイラが最低限の解析を行った結果をコード再合成手段により利用してもよい。
【００４３】
または、図１（ｂ）に示すように、コード生成手段をコンパイラ、アセンブラまたはリンカ等に組み込んで、コンパイラ、アセンブラまたはリンカ等が生成した高級言語やアセンブリ言語を入力すると共にキャッシュメモリの仕様情報（ラインサイズ等）を入力する。コード生成手段は、ソースプログラムの構造や内容を解析してアドレスの配置場所や命令列を解析し、データや命令の順序を決定してアドレスの割り当て（アドレスのマッピング）を行うことができる。この場合にも、ソースプログラムの解析はコンパイラで行うこともでき、コンパイラが最低限の解析を行った結果をコード生成手段により利用してもよい。
【００４４】
上記コード再合成手段およびコード生成手段は、ソフトウェアにより構成することができる。例えば、（１）アセンブラまたはコンパイラによって入力となるソースコードを読み込み、ＣＰＵに依存した命令列（０と１のパターンになっている）を生成する。次に、（２）この命令列を入力とするコード再合成手段（ソフトウェア）によって、命令列から命令の抽出、順序入れ替え可能場所の抽出およびアドレスの変更等、本発明に必要な処理を行う。その後、（３）その結果をリンカに渡す。または、本発明のコード再合成方法がリンカの処理を含む場合には、アドレスマッピングが終了した、計算機により実行可能なオブジェクトコードを出力する。なお、ここではアセンブラやコンパイラに本発明のコード再合成手段やコード生成手段の機能を含まない場合の動作の流れを示したが、アセンブラやコンパイラに本発明のコード再合成手段やコード生成手段の機能を含む場合には、コンパイラやアセンブラの内部処理に上記動作が組み込まれる。
【００４５】
以下に、本発明の実施の形態について、図２を用いてさらに詳しく説明する。なお、ここでは図１（ａ）に示したようにコンパイルされたオブジェクトコードに対して、コード再合成手段を用いてアドレスを再マッピングする方法について説明するが、図１（ｂ）に示したようにコンパイラやアセンブラまたはリンカ等にコード生成手段を組み込んだ場合についても、同様にアドレスのマッピングを行うことができる。
【００４６】
オブジェクトコードには、アドレスのマッピングを変更してもプログラムの実行に影響を与えない部分が多数あるので、その部分がアドレスの再マッピングの対象となる。なお、プログラムの実行に影響を与えない部分としては、以下のようなものがある。まず、独立した処理単位、例えば高級言語ではファンクション（関数）やプロシージャと称されるものはコンパイラのレベルで抽出できている。これらは独立しているので、１まとまりのまま全体のアドレスを移動させても問題が生じない。また、命令レベルで言えば、等価なオペランド（引数）が２つまたは３つある場合に、これらの順序を変更しても影響はない。例えば、「ａ＋ｂ＋ｃをｂ＋ａ＋ｃとしても同じである」こと等が挙げられる。
【００４７】
コード再合成手段は、まず、オブジェクトコードから１命令セット（１処理を行う命令部分とデータ部分とのまとまり。プログラム中で必ず一まとまりとして扱われる命令とデータであって、１ワードの命令コードではない。命令＋データのセットであり、図２では例えば命令３とデータ３−１と３−２が１命令セットで、命令６−１と６−２とデータ６−１〜６−３が１命令セットである）を抽出する。ここで、命令であるか否かを判断するのは容易であり、また、その命令がいくつオペランドを有するかもＣＰＵの命令セットで決まっているので、１命令セットに含まれる命令とデータとを抽出することができる。
【００４８】
次に、抽出した１命令セットのアドレスがキャッシュメモリのラインサイズよりも大きいか小さいかを判断する。キャッシュメモリはラインサイズ単位でしかロード／ストアを行わないので、１命令セットのサイズがラインサイズよりも小さい場合（図２では例えば命令３とデータ３−１と３−２が１命令セット）には、この命令とデータとのセット内で命令やデータの入れ替えを行ってもキャッシュメモリの動作には影響を与えないため、そのラインサイズ内で自由に命令やデータの再配置を行い、ラインサイズ内に命令セットがマッピングされるようにオブジェクトコードを再配置してアドレスを割り当てる。
【００４９】
一方、ラインサイズよりも１命令セットが大きい場合（図２では例えば命令６−１とデータ６−１が１命令セットであり、このセットがラインサイズをまたいでいる）には、ラインサイズ以下になるまで分割し、ラインサイズ以下になった段階で上記処理を行う。１命令セットをラインサイズ以下に分割することができない場合には、例えば図２の命令６−１、６−２とデータ６−１〜６−３のように、命令とデータの境界をラインフェッチの境界に合わせて、命令またはデータ単位でキャッシュメモリに存在し易いようにする。なお、図２では、ラインサイズを４ワードとしているが、８ワードや１６ワード等、どのようなサイズであってもよい。
【００５０】
例えば、加算命令の場合には、加算命令とその引数が必ずセットになったオブジェクトコードを生成する。このため、加算命令がキャッシュメモリに記憶されているときには、一連の加算命令とデータがキャッシュメモリの扱うラインサイズの境界をまたがないように、命令やデータの順序変更およびアドレスの再割り当てを行う。または、命令とデータがラインサイズよりも大きくなる場合には、命令とデータの境界がラインサイズの境界に配置されるようにする。
【００５１】
これにより、命令セットがキャッシュメモリの同一ライン上に記憶されることになり、キャッシュヒット率を向上させて処理スピードの向上を図ることができる。
【００５２】
次に、条件分岐部分について、図３を用いて説明する。一般に、条件分岐部分では一方が頻繁に起こり、他方があまり起こらないということが多い。また、処理を実行するまでどちらが起こり易いか分からない場合もあるが、事前に分かっていることも多い。そのような場合には、起こり易い方の処理がキャッシュメモリに存在し易いようにアドレスのマッピングを行うことにより、キャッシュヒット率を向上することができる。
【００５３】
まず、コード再合成手段は、オブジェクトコード中から、図３（ａ）に示すような条件判断および分岐が行われる部分を抽出し、条件分岐のＹｅｓ（条件成立）／Ｎｏ（条件不成立）のいずれが起こり易いかを判断する。いずれの分岐枝が起こり易いかについては、様々な場合が考えられる。
【００５４】
例えば、経験則として、プログラマーは条件が成立し易い方を条件として記述するということが言われているので、その経験則を利用してもよい。図３（ｂ）に示すように、Ｙｅｓ（成立）が起こり易い場合には、Ｙｅｓに続く命令（図３では処理２）がキャッシュメモリに存在し易いように、分岐命令の近くにアドレスの再割り当てを行う。このとき、他方（ここではＮｏ、図３では処理３）は、どこに配置しても実行時間全体に与える影響は小さい。なお、この図３（ｂ）ではデータを省略して示しているが、図２と同様に含まれている。
【００５５】
さらに、一般に、プログラマーの癖として、条件分岐が成立した場合に実行するコードが複雑な場合にはその条件が成立しにくく、条件分岐が成立した場合に実行するコードが単純な場合にはその条件が成立し易いと言われているので、このような情報から判断してもよい。さらに、いずれの分岐枝が起こり易いかという情報をプログラマーが予め入力するようにしてもよい。
【００５６】
この場合にも、上記コード再合成手段およびコード生成手段は、ソフトウェアにより構成することができる。例えば、プログラマーから予め情報が与えられる場合、（１）アセンブラまたはコンパイラによって入力となるソースコードを読み込み、ＣＰＵに依存した命令列（０と１のパターンになっている）を生成する。次に、（２）この命令列を入力とするコード再合成手段またはコード生成手段（ソフトウェア）によって、命令列からジャンプ命令や条件分岐命令を抽出する。そして、（３）条件分岐が起こり易いという情報が与えられている場合には、その分岐先の処理に相当する箇所のアドレスを分岐命令の近くに割り当てる。また、分岐しやすい命令がキャッシュメモリから追い出されにくいように、アドレスのデコード情報を利用して、キャッシュメモリの同じラインに載らないようにアドレスをマッピングする。
【００５７】
または、プログラマの癖を利用する場合、（１）アセンブラまたはコンパイラによって入力となるソースコードを読み込み、ＣＰＵに依存した命令列（０と１のパターンになっている）を生成する。次に、（２）この命令列を入力とするコード再合成手段またはコード生成手段（ソフトウェア）によって、命令列からジャンプ命令や条件分岐命令を抽出する。そして、（３）分岐先の命令が単純な場合には、その分岐先の処理に相当する箇所のアドレスを分岐命令の近くに割り当てる。また、分岐しやすい命令がキャッシュメモリから追い出されにくいように、アドレスのデコード情報を利用して、キャッシュメモリの同じラインに載らないようにアドレスをマッピングする。
【００５８】
次に、キャッシュメモリにヒットすることを前提として、バスの状態遷移確率を下げて消費電流を削減する方法について、図４を用いて説明する。
【００５９】
例えば、加算（ＡＤＤ）を考えた場合、加算する数字の順序を入れ替えても得られる結果は同じになる。このとき、前後のデータバスの値が大きく変化しないように、データの順序を変更することにより、バスの充放電に関する消費電力を削減することができる。データ順序を変更する際には、データバスの状態遷移確率が小さくなるように、隣接するデータや命令のビットパターンを比較して、変化が少なくなるように配置を決定する。
【００６０】
データの順序を入れ替えることにより、データバスの各ビットの遷移が少なくなるようにすれば、消費電力を削減することができる。各ビットについて、ＨからＬ、ＬからＨに遷移する場合に充放電を行うために電力が消費されるので、ＨからＨ、ＬからＬに遷移するように（変化しないように）して、電流の充放電を行わせないようにする。さらに、多数のビットで同時に充放電が起こることによって生じる、瞬間電流の増大に伴う電源電圧の電圧降下についても起こり難くなる。
【００６１】
例えば図４では、ａ、ｂ、ｃの順序を入れ替えてａ＋ｂ＋ｃとしてもｃ＋ｂ＋ａとしても得られる結果は同じになる。このとき、図４（ａ）に示すように、データバスの遷移を考慮しなかった場合には、８つのビット遷移が発生してその分の充放電によって電力が消費される。また、一度に４ビット変化するタイミングが発生して瞬間電流が多くなり、電源電圧の降下が起こるおそれがある。
【００６２】
これに対して、図４（ｂ）に示すように、データバスの遷移を考慮した場合には、６つのビット遷移しか発生せず、充放電による電力の消費を抑えることができる。また、一度に変化するビットの数も３つに抑えられているので、瞬間電流を少なくして、電源電圧の降下を減らすことができる。
【００６３】
この場合にも、上記コード再合成手段およびコード生成手段は、ソフトウェアにより構成することができる。例えば、（１）アセンブラまたはコンパイラによって入力となるソースコードを読み込み、ＣＰＵに依存した命令列（０と１のパターンになっている）を生成する。次に、（２）この命令列を入力とするコード再合成手段またはコード生成手段（ソフトウェア）によって、ＣＰＵの動作として順序が反転しても影響が無い場所を抽出する。そして、（３）順序が判定しても影響が無い場所のリストを順次読み込んでその順序を入れ替えることを試みる。その際に、ｂｉｔ単位で変化が少なくなる場合には入れ替えを実行し、逆にｂｉｔ単位で変化が多くなるようであれば元のままの配置にしておく。
【００６４】
【発明の効果】
以上詳述したように、本発明によれば、プログラム中で必ず一まとまりとして扱われる命令とデータがラインサイズ内に含まれるように、命令順序およびデータ順序を変更または決定して、アドレスの再割り当てまたは割り当てを行うことにより、ターゲットシステム上で実際にプログラムが実行されたときにキャッシュメモリに対するヒット率を向上させて、計算処理速度を向上させることができる。また、主記憶装置へのアクセス低減によって低消費電力化を達成することができる。プログラム中で必ず一まとまりとして扱われる命令とデータがラインサイズよりも大きくなる場合には、命令単位またはデータ単位でラインサイズ内に含まれるように、命令とデータの境界がラインフェッチの境界に配置されるように、命令順序およびデータ順序を変更または決定して、アドレスの再割り当てまたは割り当てを行うことができる。
【００６５】
さらに、プログラム中の条件分岐部分において、成立し易い方の分岐枝が条件分岐部分の近くのアドレスに配置されるように、アドレスの割り当てまたはアドレスの再割り当てを行うことにより、さらにキャッシュヒット率を向上させることができる。
【００６６】
さらに、プログラム中のデータ順序を入れ換えてもプログラムの実行に影響が生じない箇所においては、データバスの状態遷移が少なくなるようにデータを配置してアドレスの割り当てを行うことにより、データバスの状態遷移による消費電流を低減することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態であるオブジェクトコードの再合成方法およびオブジェクトコードの生成方法について説明するためのフロー図である。
【図２】本発明の一実施形態であるオブジェクトコードの再合成方法について、アドレスの再マッビング方法を説明するためのフロー図である。
【図３】本発明の一実施形態であるオブジェクトコードの再合成方法について、条件分岐部分におけるアドレスの再マッビング方法を説明するためのフロー図である。
【図４】本発明の一実施形態であるオブジェクトコードの再合成方法について、データバスの状態遷移確率を減少させるためのアドレスの再マッビング方法を説明するための図である。
【図５】従来のオブジェクトコードの生成方法を説明するためのフロー図である。
【図６】キャッシュメモリを有するシステムの全体構成を説明するための図である。
【符号の説明】
１ＣＰＵ
２キャッシュメモリ
３主記憶装置（メインメモリ）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method for re-synthesizing and generating an object code that can obtain an optimum object code for extracting the operation performance of a cache memory used for performing high-speed memory access in an information device or the like. .
[0002]
[Prior art]
Conventionally, in an information device, as shown in FIG. 5, a high-level language compiler or a machine-dependent assembly language assembler is used to generate object code for a CPU to execute the program from a source program. is doing. At this time, in the prior art, the compile process and the assemble process are simply performed so that the object code size is reduced, the CPU processing speed is improved, or the processing time required for the compile and assemble is shortened. It was done.
[0003]
However, in today's era when large main storage devices (also referred to as main memories) and cache memories are installed in portable information devices, it is not sufficient that the object code size is small. Therefore, it is necessary to execute CPU processing at high speed and high efficiency.
[0004]
For this purpose, it is important how to store instructions and data in a necessary program in a cache memory faster than the main storage device (to improve the cache hit rate). . If the necessary instructions and data are not stored in the cache memory at the timing when they are needed, the CPU will access the slow main memory, and the CPU will need to wait. . As a result, the processing speed of the CPU decreases, and the presence of the cache memory is lost. In addition, when the number of accesses to the external memory (including the main storage device) increases, not only does it affect the speed of processing, but also the probability that the state of the data bus transitions increases or the access to the main storage device increases. There was also a problem that power consumption increased during access.
[0005]
A system equipped with a cache memory will be briefly described below with reference to FIG. The cache memory 2 is a kind of buffer memory used to alleviate the difference between the processing speed (high speed) of the CPU 1 and the access speed (low speed) to the main storage device 3, and includes a cache memory unit that stores instructions and data. And a cache memory control circuit for controlling the operation. The cache memory 2 is a memory that normally operates at a higher speed than the main storage device 3 and can respond to accesses from the CPU at a high speed. The cache memory 2 may be provided around or inside the CPU 1.
[0006]
This system operates as follows. Object codes (instructions and data in the program) generated by the compiler and assembler are stored in the main storage device 3 in advance. In an initial state in which no data is stored in the cache memory 2, when the CPU 1 accesses an arbitrary address, no data exists on the cache memory 2, so the first access is made to the low-speed main storage device 3. Is called. At the same time, the cache memory 2 stores the address of the main storage device 3 accessed by the CPU 1 and the data (command or data) output from the main storage device 3 in the cache memory unit. The cache memory 2 constantly monitors the access to the main storage device 3 executed by the CPU 1, and when the same address (address stored in the cache memory unit) is accessed after the second time, the cache memory control is performed. The circuit prohibits the output of the control signal to the main storage device 3 and prohibits the access to the low-speed main storage device. Then, by passing the data corresponding to the corresponding address from the high-speed cache memory 2 to the CPU 1, it is possible to reduce the waiting time when the CPU 1 accesses the main storage device 3 and improve the processing speed.
[0007]
Therefore, by providing a high-speed cache memory even with a small capacity, if the data required by the CPU exists on the cache memory, the CPU does not access the low-speed main storage device and can execute instructions and data at high speed. Can be read or written.
[0008]
When trying to realize the fastest system, the low-speed main storage device and the high-speed cache memory should be the same size (storage capacity), but this is costly because the cache memory is expensive. Not realistic. Therefore, normally, a cache memory having a smaller size than that of the main storage device is installed in the system.
[0009]
As described above, since the cache memory is smaller than the main storage device, only a part of the data stored in the main storage device is stored in the cache memory. Here, in order to use the cache memory effectively, the data that the CPU is supposed to use is stored in the cache memory before the CPU actually uses it, and the data that the CPU is considered not to use in the future is deleted from the cache memory. The operation of making it ideal is ideal. However, which data is actually stored in the cache memory or erased from the cache memory depends on the structure of the program and the control method of the cache memory control circuit.
[0010]
As absolute standards, (1) read / write data is likely to be used again soon (temporal locality) and (2) data near read / write data is used soon It is generally said that there are two criteria that are likely to be done (spatial locality). However, even though recent cache memory control circuits are designed based on these rules of thumb, the number of cases where the benefits of providing a cache memory cannot actually be increased.
[0011]
The biggest cause is the change of the program development language. When developing a program in assembly language using an assembler, the necessary instructions and data are stored in the cache memory as much as possible, taking advantage of the fact that the assembly language is generated almost one-to-one with the object code. It is possible to program consciously. Therefore, the probability (cache hit rate) that required instructions and data exist in the cache memory can be increased to some extent.
[0012]
However, in recent years, programming that does not require hardware awareness by a high-level language using a compiler has been performed, and programming without considering the operation of the cache memory has ceased. Further, it has become impossible for a program creator to control and generate object code generated by a compiler.
[0013]
Further, with regard to compiler optimization (optimization) processing, conventionally, processing for reducing the code size of the object code is mainly performed. For this reason, conventionally, a compile technique capable of generating an object code that can easily benefit from the provision of a cache memory has not been implemented in the system. One reason for this is that it is difficult to create a compiler that can be applied to all systems because the compiler strongly depends on different hardware configurations for each system.
[0014]
Therefore, recently, compilation techniques that take these into account have been reported. For example, Japanese Patent Application Laid-Open No. 5-120029 proposes a compiling process for generating object code optimized for a loop portion in a program. Japanese Patent Application Laid-Open No. 11-96015 proposes that profile data is generated in advance and compiling is performed based on the profile data. Further, in Japanese Patent Laid-Open No. 5-324281, a plurality of subroutines (a group of programs for performing one process) are stored in the cache memory, and when each subroutine is repeatedly executed, the subroutines are stored in the cache memory. The address reassignment (re-mapping) is mentioned so that the same address is not loaded.
[0015]
[Problems to be solved by the invention]
As shown in FIG. 6, the cache memory operates as a buffer device between a recent high-speed CPU and a low-speed main storage device. In general, the size of the cache memory is considerably smaller than that of the main storage device, and only necessary (probable) data is stored in the cache memory. If necessary data is stored in the cache memory, the CPU does not access the main storage device, and the high-speed cache memory is accessed, so that high-speed processing is possible. However, if the necessary data is not stored in the cache memory, it is necessary to access the main storage device and transfer the necessary data onto the cache memory. At this time, if data has already been stored in the entire area of the cache memory, unnecessary data is transferred (copied) to the main storage device, and then new data is transferred from the main storage device to the cache memory area. I need it.
[0016]
As described above, the penalty when the necessary data (instructions and data) is not stored in the cache memory is very large. For this reason, it is necessary not only to install a cache memory but also to generate an object code considering the use and operation of the cache memory.
[0017]
Nevertheless, current compilers and assemblers can generate compact object code at high speed. Focus In addition, almost no processing is performed from the viewpoint of generating object code that meets the specifications of the cache memory. In addition, an increase in data transfer not only affects the speed of processing, but also increases the probability of data bus state transitions and increases power consumption when accessing the main storage device. There was also a problem. Furthermore, the techniques of the above-mentioned publications have the following problems.
[0018]
The technology proposed in the above-mentioned Japanese Patent Laid-Open No. 5-120029 pays attention to the total size of the cache memory, but what is actually important is not the total size of the cache memory, but what structure the cache memory has. And how it works. In other words, when a cache miss (necessary instruction or data does not exist in the cache memory) state, the instruction or data is loaded / stored from the main storage device to the cache memory. There is no consideration as to whether instructions and data are loaded / stored in word size units.
[0019]
If there are multiple loops in the program and all the loops are the same size as the cache memory, it is unlikely that all objects will be executed in each loop when the CPU actually executes the program. In most cases, branching (jumping) to another loop is performed by a conditional branch in the middle. That is, a part of each loop is executed without executing all the loops. In such a case, even if the object code is generated by adjusting the loop size to the size of the cache memory by the proposed technique, it does not make sense because the actual load / store operation of the cache memory is not considered. become.
[0020]
The cache memory control circuit transfers data from the external memory (main storage device) to the cache memory if there is no instruction or data corresponding to the address accessed by the CPU on the cache memory. This operation is referred to as line fetch, and the data size transferred at that time is referred to as line size. At this time, not only data (instruction or data) corresponding to the address accessed by the CPU but also data corresponding to the peripheral address of the address are collectively transferred to the cache memory. This is an operation based on the spatial locality. Usually, the data size (line size) transferred by line fetch is about 4 words, 8 words, and 16 words. Therefore, this line size is the basis for transferring data from the main storage device to the cache memory and erasing unnecessary data from the cache memory. Therefore, at the time of object code generation, the size of the entire cache memory is not important, and the line fetch size (line size), which is a unit of load / store processing to the cache memory, is set to the object code processing unit. It is important to keep the size.
[0021]
The technique proposed in the above-mentioned Japanese Patent Application Laid-Open No. 11-96015 is effective when instructions and data do not change almost permanently as in an embedded system. If the data is downloaded by the Internet or various patterns occur in the data depending on the situation, the profile data often does not make sense and is not practical. In addition, when the CPU actually executes a program, the order of instructions to be executed changes depending on conditions such as the timing of signals input from the outside, which is not perfect.
[0022]
Furthermore, in the above-mentioned Japanese Patent Application Laid-Open No. 5-324281, a conventional example (FIG. 2 (b) in the publication) includes a subroutine arranged at the head of each page. There are few compilers. In general, as shown in FIG. 1 in that publication, a technique for arranging subroutines from the top of the page and placing the subroutine on the next page only when straddling the page is prior to this proposal. in use. This is a technique generally performed to reduce the program size (the size of the main storage device).
[0023]
Further, in this prior art, it is proposed that the subroutines are arranged and arranged, and operations in large systems such as a direct map system and a set associative system are described, but similar to the proposal of Japanese Patent Laid-Open No. 5-120029. In addition, the data size when the cache memory is loaded / stored is not taken into account, and it can be said that this is insufficient because there is only a conceptual description that the address is determined by the method and size of the cache memory.
[0024]
The present invention has been made to solve the above-described problems of the prior art, and generates an object code suitable for the specifications and operations of the cache memory to improve the hit rate of the cache memory, and the processing speed of the CPU. It is an object of the present invention to provide a method of re-synthesizing and generating an object code that can improve the performance.
[0025]
[Means for Solving the Problems]
An object code resynthesizing method according to the present invention is an object code resynthesizing method for reassigning an address to an object code generated from a source program using a compiler, an assembler, or a linker. Alternatively, the CPU has a cache memory whose operation speed is faster than that of the main memory, and the CPU stores data corresponding to the address of the main memory to be accessed in the cache memory. If the data corresponding to the address is not stored in the cache memory, the main memory is accessed, and the address of the accessed main memory and the address are supported. Data to be transferred to the cache memory for storage In the system, in order to improve the probability that necessary instructions and data are stored in the cache memory at the necessary timing, based on the data size transferred at one time when data is transferred from the main storage device to the cache memory Therefore, the compiler and assembler ensure that instructions and data that are always handled as a unit in the program are included in the data transferred at one time, or at least in the data transferred at the same time in instruction units and data units. Alternatively, the instruction order and data order of the object code output from the linker are changed and the addresses are reassigned, whereby the above object is achieved.
[0026]
In the conditional branch part of the program, the instruction order and data order are changed so that the branch branch that is more likely to be satisfied is placed at an address near the conditional branch part, taking into account the ease with which the condition is satisfied. Address reassignment is preferably performed.
[0027]
In places where the execution of the program is not affected even if the order of the data in the program is changed, it is preferable to reassign the address by changing the data order so that the state transition of the data bus is reduced.
[0028]
The object code generation method of the present invention is a method of generating object code from a source program using a compiler, assembler or linker, and has a cache memory having a faster operation speed than the main storage device in the periphery or inside of the CPU. When the data corresponding to the address to be accessed is stored in the cache memory, the CPU accesses the cache memory, and the data corresponding to the address is not stored in the cache memory. In such a case, a necessary instruction is issued in the computing system that accesses the main storage device and transfers the address of the accessed main storage device and data corresponding to the address to the cache memory for storage. And data is stored in the cache memory when needed In order to improve the probability that data is transferred from the main memory to the cache memory, instructions and data that are always handled as a unit in the program are transferred at a time based on the data size transferred at a time. When generating object code from a source program using a compiler, assembler, or linker so that it is included in the data to be transmitted, or at least included in the data transferred at the same time in instruction units and data units, Data is allocated and addresses are assigned, thereby achieving the above object.
[0029]
In the conditional branch part of the program, in consideration of the ease with which the condition is established, instructions and data are arranged so that the branch branch that is more likely to be established is located at an address near the conditional branch part. It is preferable to make the assignment.
[0030]
In places where the execution of the program is not affected even if the data order in the program is changed, it is preferable to allocate data and allocate addresses so that the state transition of the data bus is reduced.
[0031]
The operation of the present invention will be described below.
[0032]
In the present invention, the object code generated by a compiler, assembler, linker, or the like is based on the data size (line size) transferred at a time when data is transferred from the main storage device to the cache memory. Instruction order and data order so that instructions and data that are always treated as a unit in a program are included in the data transferred at a time, or at least in data transferred at a time in instruction units and data units And reassign the address. As a result, when a program is actually executed on the target system, instructions and data that are always handled as a unit in the program are accessed when either the instruction or data is accessed (however, it is accessed first. Others are transferred from the storage device to the cache memory and stored therein, so that the hit rate for the cache memory can be improved to improve the processing speed and reduce the power consumption. It becomes possible.
[0033]
Alternatively, when a compiler, assembler, linker, or the like generates object code, it must always be one in the program based on the data size (line size) transferred at a time when data is transferred from the main storage device to the cache memory. Change the instruction order and data order so that instructions and data treated as a unit are included in the data transferred at one time, or at least in the data transferred at one time in units of instructions and data. Reassign This also improves the hit rate for the cache memory when the program is actually executed on the target system, thereby improving the processing speed and reducing the power consumption.
[0034]
In a conditional branch part in a program, one frequently occurs and the other does not often occur. Therefore, when either one is likely to occur (it is easy to be established), the condition of the address is set so that the branch branch that is likely to be satisfied is arranged at the address near the conditional branch part in consideration of the ease of establishment of the condition. Assign or reassign addresses. Thereby, it is possible to further improve the cache hit rate. In this case, the influence on the execution time is small regardless of where the branch branch that is unlikely to occur (not easily established) is arranged.
[0035]
Further, at locations where the execution of the program is not affected even if the data order in the program is changed, data is allocated and addresses are assigned so that the state transition of the data bus is reduced. As a result, it is possible to reduce current consumption due to state transition of the data bus.
[0036]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0037]
FIG. 1A and FIG. 1B are diagrams for explaining an object code re-synthesis method and an object code generation method according to the present invention.
[0038]
In the present invention, when determining the address arrangement (address allocation or reassignment) of the object code that is the output of the compiler, assembler, linker, etc., not only the structure and contents of the source program are analyzed, but also the actual The specifications of the cache memory mounted on the target system (for example, the system shown in FIG. 6) on which the program is executed are given in detail. For example, the line size, replacement algorithm, associativeness, write algorithm, address decoding information, and the like can be given as cache memory specifications.
[0039]
The line size is a data size transferred at a time when data is transferred from the main storage device to the cache memory. The replacement algorithm is an algorithm for controlling which data is expelled from the cache memory when the cache memory is full. This replacement algorithm is generally classified into a random method using random numbers and an LRC (Least Recently Used) method in which the least recently used one is pushed out of the cache memory. The association level indicates how many pieces of data that can possibly enter the same line. For example, if the association degree is 4, up to four pieces of data entering the same line can be put in the cache memory, and one is pushed out of the cache memory from the fifth. In addition, the write algorithm includes a write-back method that rewrites only the cache memory when the cache memory is hit at the time of writing, and both the cache memory and the actual memory (memory of the main storage device). And the Write-Through method for rewriting. Further, the address decode information indicates which part of the address corresponds to the same location on the cache memory. In software that requires fine control, the arrangement of instructions and data is determined based on the decoding information of this address by a human hand. For example, in a cache memory using 10-bit information [20:11] out of 32 bit addresses, which line is placed is determined by [20:11]. Therefore, information of addresses having the same [20:11] but different is placed on the same line. If the degree of linkage is 4, up to four pieces of data having the same [20:11] but different data are placed in the cache memory, but from the fifth, the necessary data are sequentially left in the cache memory. This operation repeats that unnecessary items are evicted from the cache memory. In order to prevent this, the placement location can be adjusted by a human hand based on this information, or the placement location can be adjusted by a compiler, the object code re-synthesis method of the present invention, or the object code generation method of the present invention. . The cache memory size is given by line size × association × number of lines.
[0040]
Based on the specifications of such a cache memory, the structure of the source program is analyzed to analyze the address arrangement and instruction sequence. Then, determine the order of instructions and data in consideration of fine hardware operations such as cache memory size, line fetch size (line size), and line fetch operations, and assign or reassign addresses to assign object codes. Generate or re-synthesize. This improves the hit rate (probability that instructions and data required for the cache memory are stored) when actually executed on the target system, and improves processing speed and power consumption. Can be achieved.
[0041]
Also, by analyzing the instruction sequence, it is possible to reduce the data transition rate by replacing the instruction, replacing the immediate data, etc. in the program part that does not affect the processing content or execution time. Furthermore, it is possible to further improve the hit rate for the cache memory by considering the ease of establishment of the branch branch in the conditional branch portion in the program.
[0042]
For example, as shown in FIG. 1A, the object code generated by a compiler, assembler, linker, or the like is input to the code re-synthesizing means, and specification information (line size, etc.) of the cache memory is input. The code re-synthesizing means can analyze the source program to analyze the location of the address and the instruction sequence, change the order of the data and instructions, and reassign the address (address remapping). The source program can be analyzed by a compiler, and the result of the minimum analysis performed by the compiler may be used by the code re-synthesizer.
[0043]
Alternatively, as shown in FIG. 1B, the code generation means is incorporated into a compiler, assembler, linker, or the like, and a high-level language or assembly language generated by the compiler, assembler, linker, etc. is input and cache memory specification information ( Enter the line size). The code generation means can analyze the structure and contents of the source program, analyze the address location and instruction sequence, determine the order of data and instructions, and assign addresses (address mapping). Also in this case, the analysis of the source program can be performed by a compiler, and the result of the minimum analysis performed by the compiler may be used by the code generation means.
[0044]
The code re-synthesis unit and the code generation unit can be configured by software. For example, (1) a source code to be input is read by an assembler or a compiler, and an instruction sequence (a pattern of 0 and 1) depending on the CPU is generated. Next, (2) the code re-synthesizing means (software) that receives this instruction sequence performs processing necessary for the present invention, such as extraction of instructions from the instruction sequence, extraction of place where order can be changed, and change of address. Thereafter, (3) the result is passed to the linker. Alternatively, when the code re-synthesis method of the present invention includes a linker process, an object code that can be executed by a computer after address mapping is output. Here, the flow of the operation when the assembler or compiler does not include the functions of the code re-synthesizing means and code generating means of the present invention is shown. However, the assembler and compiler of the code re-synthesizing means and code generating means of the present invention are shown. When the function is included, the above operation is incorporated in the internal processing of the compiler or assembler.
[0045]
Hereinafter, an embodiment of the present invention will be described in more detail with reference to FIG. Here, a method of remapping the address using the code re-synthesizing means for the object code compiled as shown in FIG. 1A will be described, but as shown in FIG. 1B. In the case where the code generation means is incorporated in a compiler, assembler, linker or the like, address mapping can be performed in the same manner.
[0046]
Since the object code has many portions that do not affect the execution of the program even if the address mapping is changed, these portions are the targets of the address remapping. In addition, there are the following parts that do not affect the execution of the program. First, independent processing units, for example, what are called functions or procedures in high-level languages can be extracted at the compiler level. Since they are independent, there is no problem even if the entire address is moved as a unit. Also, at the instruction level, if there are two or three equivalent operands (arguments), changing the order of these has no effect. For example, “the same is true even if a + b + c is replaced with b + a + c”.
[0047]
The code re-synthesizing means starts with an instruction set from an object code (a set of an instruction part and a data part for performing one process. Instructions and data that are always handled as a unit in a program. In FIG. 2, for example, instruction 3 and data 3-1 and 3-2 are one instruction set, and instructions 6-1 and 6-2 and data 6-1 to 6-3 are 1. Is the instruction set). Here, it is easy to determine whether it is an instruction, and the number of operands of the instruction is determined by the CPU instruction set, so the instructions and data included in one instruction set are extracted. can do.
[0048]
Next, it is determined whether the address of one extracted instruction set is larger or smaller than the line size of the cache memory. Since the cache memory loads / stores only in line size units, when the size of one instruction set is smaller than the line size (in FIG. 2, for example, instruction 3 and data 3-1 and 3-2 are one instruction set). In this set of instruction and data, even if the instruction or data is exchanged, the cache memory operation is not affected, so instructions and data can be rearranged freely within that line size, and the line size The object code is rearranged so that the instruction set is mapped in it, and an address is assigned.
[0049]
On the other hand, when one instruction set is larger than the line size (in FIG. 2, for example, instruction 6-1 and data 6-1 are one instruction set, and this set crosses the line size), the line size is less than or equal to the line size. The above process is performed when the line size is equal to or smaller than the line size. If it is not possible to divide one instruction set below the line size, for example, instructions 6-1 and 6-2 and data 6-1 to 6-3 in FIG. It is made easy to exist in the cache memory in units of instructions or data in accordance with the boundaries of In FIG. 2, the line size is 4 words, but any size such as 8 words or 16 words may be used.
[0050]
For example, in the case of an addition instruction, an object code in which the addition instruction and its argument are always set is generated. For this reason, when an add instruction is stored in the cache memory, the order of instructions and data is changed and addresses are reassigned so that the series of add instructions and data do not cross the boundary of the line size handled by the cache memory. . Alternatively, when the instruction and data become larger than the line size, the boundary between the instruction and data is arranged at the boundary of the line size.
[0051]
As a result, the instruction set is stored on the same line of the cache memory, so that the cache hit rate can be improved and the processing speed can be improved.
[0052]
Next, the conditional branch portion will be described with reference to FIG. In general, in the conditional branch part, one frequently occurs and the other does not often occur. In addition, there is a case where it is not known which is likely to occur until the process is executed, but it is often known in advance. In such a case, the cache hit rate can be improved by performing address mapping so that the process that tends to occur is likely to exist in the cache memory.
[0053]
First, the code re-synthesizing means extracts from the object code the part where condition judgment and branching are performed as shown in FIG. 3A, and the condition branching Yes (condition satisfied) / No (condition not satisfied) To determine whether it is likely to occur. Various cases can be considered as to which branch branches are likely to occur.
[0054]
For example, as an empirical rule, it is said that a programmer describes a condition that is more likely to be satisfied as a condition, so that empirical rule may be used. As shown in FIG. 3 (b), when Yes (establishment) is likely to occur, the address is relocated near the branch instruction so that the instruction following the Yes (Process 2 in FIG. 3) is likely to exist in the cache memory. Make an assignment. At this time, the other (No in this case, processing 3 in FIG. 3) has little influence on the entire execution time regardless of where it is placed. In FIG. 3B, data is omitted, but is included in the same manner as in FIG.
[0055]
Furthermore, in general, as a programmer's habit, if the code to be executed when the conditional branch is satisfied is complicated, the condition is difficult to be satisfied. If the code to be executed when the conditional branch is satisfied is simple, the condition is not satisfied. May be determined from such information. Further, the programmer may input in advance information about which branch branches are likely to occur.
[0056]
Also in this case, the code re-synthesis unit and the code generation unit can be configured by software. For example, when information is given in advance by a programmer, (1) an input source code is read by an assembler or a compiler, and an instruction sequence (pattern of 0 and 1) depending on the CPU is generated. Next, (2) a jump instruction or a conditional branch instruction is extracted from the instruction sequence by the code re-synthesizing means or the code generation means (software) that receives this instruction string. Then, (3) when information indicating that a conditional branch is likely to occur is given, an address corresponding to the branch destination process is assigned near the branch instruction. Further, the address is mapped so that it is not placed on the same line of the cache memory by using the decode information of the address so that an instruction that is likely to branch is not easily evicted from the cache memory.
[0057]
Alternatively, when a programmer's bag is used, (1) an input source code is read by an assembler or a compiler, and a sequence of instructions depending on the CPU (in a pattern of 0 and 1) is generated. Next, (2) a jump instruction or a conditional branch instruction is extracted from the instruction sequence by the code re-synthesizing means or the code generation means (software) that receives this instruction string. (3) If the branch destination instruction is simple, an address corresponding to the branch destination process is assigned near the branch instruction. Further, the address is mapped so that it is not placed on the same line of the cache memory by using the decode information of the address so that an instruction that is likely to branch is not easily evicted from the cache memory.
[0058]
Next, a method for reducing the current consumption by lowering the bus state transition probability on the assumption that the cache memory is hit will be described with reference to FIG.
[0059]
For example, when adding (ADD) is considered, the result obtained is the same even if the order of the numbers to be added is changed. At this time, it is possible to reduce power consumption related to charging / discharging of the bus by changing the data order so that the values of the preceding and succeeding data buses do not change greatly. When changing the data order, the bit patterns of adjacent data and instructions are compared so as to reduce the change so as to reduce the state transition probability of the data bus.
[0060]
By changing the order of the data so that the transition of each bit of the data bus is reduced, the power consumption can be reduced. For each bit, power is consumed to perform charging / discharging when transitioning from H to L and from L to H, so that transition from H to H and L to L (not to change) Avoid charging / discharging current. Furthermore, it is difficult to cause a voltage drop of the power supply voltage accompanying an increase in the instantaneous current, which is caused by simultaneous charge / discharge of a large number of bits.
[0061]
For example FIG. Then, the result obtained by changing the order of a, b, and c as a + b + c and c + b + a is the same. At this time, FIG. As shown in (a), when the transition of the data bus is not taken into consideration, eight bit transitions occur and power is consumed by charging / discharging that amount. In addition, there is a possibility that the timing of changing 4 bits at a time is generated, the instantaneous current is increased, and the power supply voltage is lowered.
[0062]
On the contrary, FIG. As shown in (b), when the transition of the data bus is taken into consideration, only six bit transitions occur, and power consumption due to charging / discharging can be suppressed. In addition, since the number of bits changing at one time is suppressed to three, the instantaneous current can be reduced and the drop in the power supply voltage can be reduced.
[0063]
Also in this case, the code re-synthesis unit and the code generation unit can be configured by software. For example, (1) a source code to be input is read by an assembler or a compiler, and an instruction sequence (a pattern of 0 and 1) depending on the CPU is generated. Next, (2) a place where there is no influence even if the order is reversed as the operation of the CPU is extracted by the code re-synthesizing means or the code generating means (software) which receives this instruction sequence. Then, (3) a list of places that are not affected even if the order is determined is sequentially read and an attempt is made to change the order. At this time, if the change becomes smaller in units of bits, the replacement is executed. On the contrary, if the change becomes larger in units of bits, the original arrangement is kept.
[0064]
【The invention's effect】
As described above in detail, according to the present invention, the instruction order and data order are changed or determined so that instructions and data that are always handled as a unit in the program are included in the line size, and the address is regenerated. By assigning or assigning, the hit rate for the cache memory can be improved when the program is actually executed on the target system, and the calculation processing speed can be improved. In addition, low power consumption can be achieved by reducing access to the main storage device. When instructions and data that are always handled as a unit in a program are larger than the line size, the instruction and data boundaries are placed at the line fetch boundaries so that they are included in the line size in units of instructions or data. As noted, the instruction order and data order can be changed or determined to reassign or assign addresses.
[0065]
Furthermore, in the conditional branch part of the program, the cache hit rate can be further increased by assigning addresses or reassigning addresses so that the branch branch that is more likely to be established is placed at the address near the conditional branch part. Can be improved.
[0066]
Furthermore, in places where the program execution is not affected even if the data order in the program is changed, the data bus state is determined by allocating data and assigning addresses so that the state transition of the data bus is reduced. Current consumption due to transition can be reduced.
[Brief description of the drawings]
FIG. 1 is a flowchart for explaining an object code re-synthesis method and an object code generation method according to an embodiment of the present invention;
FIG. 2 is a flowchart for explaining an address re-mapping method for an object code re-synthesis method according to an embodiment of the present invention;
FIG. 3 is a flow diagram for explaining an address re-mapping method in a conditional branch part of an object code re-synthesis method according to an embodiment of the present invention;
FIG. 4 is a diagram for explaining an address re-mapping method for reducing a state transition probability of a data bus in an object code re-synthesis method according to an embodiment of the present invention;
FIG. 5 is a flowchart for explaining a conventional object code generation method;
FIG. 6 is a diagram for explaining an overall configuration of a system having a cache memory;
[Explanation of symbols]
1 CPU
2 Cache memory
3 Main memory (main memory)

Claims

Includes a CPU, a main storage device which is accessed by the CPU, and a fast cache memory operating speed than the main storage device provided around or inside of the CPU, the CPU, the main memory to be accessed the main storage device when the case where the data corresponding to the address is stored in the cache memory performs access to the cache memory, the data corresponding to the address is not stored in the cache memory against performs access, in the accessed computing system configurations and corresponding data to the address and the address of the main storage apparatus for transferring and storing in the cache memory, the compiler uses an assembler or linker the object code generated from the source program, A re-synthesis of the object code to reassign addresses by computer,
When transferring data from said main memory to said cache memory, when the size of the instruction set of instructions units and data units handled always as a collection in the source program is smaller than the size of data to be transferred at one time The instruction set is arranged within one data size by changing the order of instruction units or data units in the object code. If the instruction set is larger than the data size, the instruction set is reduced to the data size or less. Is divided into instruction units or data units, and by changing the order of instruction units or data units in the object code, the divided instruction set is arranged within one data size,
Further, in the conditional branch part in the source program, the instruction unit of the process to which information that the condition is likely to be satisfied is given, and the processing of other conditions is changed by changing the order of the instruction unit or the data unit in the object code. An object code re-synthesizing method characterized in that the object code is arranged at an address closer to the conditional branch portion than the instruction unit .

When a plurality of data units in the source program can obtain the same result even if the order is changed, the data units of the data units are arranged in the order in which the state transition is reduced in the data bus to which the data of each data unit is output . The object code resynthesis method according to claim 1 , wherein address reassignment is performed.

Includes a CPU, a main storage device which is accessed by the CPU, and a fast cache memory operating speed than the main storage device provided around or inside of the CPU, the CPU, the main memory to be accessed the main storage device when the case where the data corresponding to the address is stored in the cache memory performs access to the cache memory, the data corresponding to the address is not stored in the cache memory performs access, in the accessed the main storage device computing system configuration that addresses and the data corresponding to the address of the stores are transferred to the cache memory, by the computer with respect to the compiler, assembler or linker Generated from the source program using A method of generating object code for assigning an address to Kutokodo,
When transferring data from said main memory to said cache memory, when the size of the instruction set of instructions units and data units handled always as a collection in the source program is smaller than the size of data to be transferred at one time The instruction set is arranged within one data size by changing the order of instruction units or data units in the object code. If the instruction set is larger than the data size, the instruction set is reduced to the data size or less. Is divided into instruction units or data units, and by changing the order of instruction units or data units in the object code, the divided instruction set is arranged within one data size,
Further, in the conditional branch part in the source program, the instruction unit of the process to which information that the condition is likely to be satisfied is given, the processing of other conditions is changed by changing the order of the instruction unit or the data unit in the object code A method for generating an object code, characterized in that the object code is arranged at an address closer to the conditional branch portion than the instruction unit .

When a plurality of data units in the source program can obtain the same result even if the order is changed, the data units of the data units are arranged in the order in which the state transition is reduced in the data bus to which the data of the data units is output . The object code generation method according to claim 3 , wherein address reassignment is performed.