JPH10283192A

JPH10283192A - Prefetching code generation system

Info

Publication number: JPH10283192A
Application number: JP9090470A
Authority: JP
Inventors: Giichi Tanaka; 義一田中; Shinichi Ito; 信一伊藤; Yuji Tsushima; 雄次對馬
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-04-09
Filing date: 1997-04-09
Publication date: 1998-10-23

Abstract

PROBLEM TO BE SOLVED: To hide a main memory penalty in the case that a data area accessed by a loop exceeds a cache capacity by dividing an inner side loop into two and starting the prefetching of data in a second half loop at the time of the repetition of an outer side loop one before for the data required in a first half loop and in the first half loop for the data required in the second half loop. SOLUTION: The inner side loop is divided into two and a code for incorporating the prefetching processing of the data required at the first part of the repetition processing of the outer side loop one after in the second half loop 41 and incorporating the prefetching of the data required in a second half in the repetition processing of the same outer side loop in the first half loop 40 is generated. For the distribution of the front half loop 40 and the second half loop 41, the processing time of the second half loop 41 is turned to be more than main memory penalty time. In the case that the processing time of the second half loop 41 can not be turned to be more than the main memory penalty, loop development is performed relating to the outer side loop and an arithmetic amount is increased so as to completely hide the main memory penalty.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、命令レベル並列処
理を行うプログラムの実行方式に係わり、プログラムが
アクセスするデータが大きく、キャッシュに入りきれな
い場合のループに好適なコード生成方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of executing a program for performing instruction level parallel processing, and more particularly to a code generation method suitable for a loop in a case where data accessed by the program is large and cannot be stored in a cache.

【０００２】[0002]

【従来の技術】スーパコンピュータの１つの方向とし
て、スカラプロセッサをノードプロセッサとする並列処
理方式が有望視されている。スカラプロセッサを用いた
並列スーパコンピュータが期待されるのは、半導体技術
の進歩によるクロック周波数の向上、複数の並列実行可
能な演算器を有効に生かすスーパスカラ方式等の命令レ
ベル並列処理の実現により、スカラプロセッサの処理性
能が飛躍的に向上しているためである。しかしながら、
その高いスカラプロセッサ処理能力は、キャッシュが有
効に働くときにのみ達成される。2. Description of the Related Art As one direction of a supercomputer, a parallel processing system using a scalar processor as a node processor is considered promising. Parallel supercomputers using scalar processors are expected to be improved by improving the clock frequency due to advances in semiconductor technology, and by realizing instruction-level parallel processing such as the superscalar method that makes effective use of multiple arithmetic units that can execute in parallel. This is because the processing performance of the processor has been dramatically improved. However,
Its high scalar processor throughput is only achieved when the cache works.

【０００３】スカラプロセッサは、一般的に命令処理装
置とキャッシュ及びメモリ装置を有する。スカラプロセ
ッサはメモリにプログラムとデータを格納し、プログラ
ムに記述された命令に従いメモリ中のデータを処理す
る。キャッシュは命令処理装置からの参照時間の短い比
較的小容量の記憶手段であり、プログラム及びデータの
一部を一時的に格納する。命令実行にあたり必要なデー
タはメモリから読み出されるが、同時にデータを含むデ
ータブロックはキャッシュを構成するラインにコピーさ
れ、以後当該ブロック内のデータに対する参照が指定さ
れたときは上記キャッシュのラインからデータを参照す
る。このメモリからキャッシュラインへのデータブロッ
クの転送をライン転送と呼ぶ。A scalar processor generally has an instruction processing device, a cache and a memory device. A scalar processor stores a program and data in a memory, and processes the data in the memory according to instructions described in the program. The cache is a relatively small-capacity storage unit having a short reference time from the instruction processing device, and temporarily stores a program and a part of data. Data necessary for executing the instruction is read from the memory, but at the same time, a data block including the data is copied to a line constituting the cache, and when a reference to the data in the block is designated thereafter, the data is read from the cache line. refer. The transfer of the data block from the memory to the cache line is called line transfer.

【０００４】命令実行にあたり必要データがキャッシュ
にない場合、これをキャッシュミスと呼ぶが、キャッシ
ュミスが発生するとライン転送が実行される。従来のス
カラプロセッサでは命令実行に伴いこのライン転送が発
生すると、その完了まで当該命令の実行が待ち合わされ
る。従って、キャッシュミスが多発するとライン転送に
伴う待ち合わせによりプログラムの実行時間が増大し、
スカラプロセッサの処理性能が劣化するという問題があ
った。特に、大規模科学技術計算では、データ領域が大
きく、データの局所性が少ないという性質があるためキ
ャッシュミスによる性能低下は深刻であった。When data necessary for instruction execution is not in the cache, this is called a cache miss. When a cache miss occurs, line transfer is executed. In a conventional scalar processor, when this line transfer occurs along with the execution of an instruction, the execution of the instruction is awaited until its completion. Therefore, when cache misses occur frequently, the execution time of the program increases due to the queuing associated with the line transfer,
There is a problem that the processing performance of the scalar processor deteriorates. In particular, in large-scale scientific and technical calculations, the performance is greatly reduced due to a cache miss due to the property that the data area is large and the locality of data is small.

【０００５】これに対し、最近では予めプログラムにお
いて、キャッシュミスを引き起こす可能性のある命令に
先だってデータの先読みを指示する特殊な命令を実行さ
せることで上記ライン転送に伴う性能劣化を回避する試
みがなされている。このデータ先読みをプリフェッチと
呼ぶ。例えば、米国ＩＢＭ社のマイクロプロセッサPowe
rＰＣでは指定したオペランドアドレスに位置するデー
タをキャッシュに読み込むプリフェッチ命令がある。こ
の動作において、キャッシュミスを発生した場合、プリ
フェッチ命令の完了を待ち合わせることなくライン転送
を行う。従って上記データを参照する命令を実行する時
には上記データがキャッシュに入っているために上記性
能劣化が回避される。On the other hand, recently, an attempt has been made to avoid the performance degradation associated with the line transfer by executing a special instruction for prefetching data in advance in a program in advance of an instruction which may cause a cache miss. It has been done. This data prefetch is called prefetch. For example, the microprocessor Powe of IBM of the United States
In rPC, there is a prefetch instruction for reading data located at a specified operand address into a cache. In this operation, when a cache miss occurs, the line transfer is performed without waiting for the completion of the prefetch instruction. Therefore, when an instruction that refers to the data is executed, the performance degradation is avoided because the data is in the cache.

【０００６】このようなプリフェッチ命令を利用して、
データ転送と演算をオーバラップさせることで、実質的
に主記憶レイテンシを隠蔽させるコード生成技術が論文
ToddC.Mowry，Monica S.Lam，Anoop Gupta「Design and
Evaluation of a CompilerAlgorithm for Prefetchin
g」ASPLOS-V，ACM 0-89791-535-6/92/0010/0062 pp.62-
73に示されている。Using such a prefetch instruction,
A paper on code generation technology that effectively hides main memory latency by overlapping data transfer and operation
ToddC. Mowry, Monica S. Lam, Anoop Gupta “Design and
Evaluation of a CompilerAlgorithm for Prefetchin
g ”ASPLOS-V, ACM 0-89791-535-6 / 92/0010/0062 pp.62-
Shown at 73.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上記の
従来のコード生成技術では以下に記すような問題があっ
た。図２に示すようなＤＯ２０（２０）及びＤＯ１０
（２１）で示される多重ループに対するプリフェッチコ
ードの生成において、内側ループＤＯ１０（２１）のみ
しか考慮されていなかった。このことを例を用いて具体
的に問題点を示す。However, the above-mentioned conventional code generation technology has the following problems. DO20 (20) and DO10 as shown in FIG.
In the generation of the prefetch code for the multiplex loop shown in (21), only the inner loop DO10 (21) has been considered. The problem will be specifically described using an example.

【０００８】まず、以下のコード生成で前提としたアー
キテクチャの仮定を述べる。対象とするスカラプロセッ
サはロード／ストア命令が２個，浮動小数点演算が２
個、及び整数演算が２個、同時に実行でき、キャシュラ
インサイズが３２バイト，キャッシュミス時の主記憶ペ
ナルティを７５サイクルと仮定する。First, the assumption of the architecture assumed in the following code generation will be described. The target scalar processor has two load / store instructions and two floating-point operations.
And two integer operations can be executed simultaneously, the cache line size is 32 bytes, and the main memory penalty at the time of a cache miss is 75 cycles.

【０００９】図２のソースプログラムに対して、コンパ
イラは、まず、ローカリティ解析などによりプリフェッ
チ対象オペランドを探す。その結果、参照ｂ（ｉ，ｊ）
２２がプリフェッチ対象オペランドであるとする。オペ
ランドの大きさを８バイトとすると、１回のプリフェッ
チで、対象とするオペランドを含む連続した３２バイト
のデータがキャッシュに格納されるため、ループ繰り返
しごとにプリフェッチ命令を実行していたのでは無駄
で、この例の場合、４回に１回の割合でプリフェッチ命
令を出力すればよい。このため、一般的には、コンパイ
ラのループ構造変換部で、最内側ループ２１に関して４
倍のループ展開を行う。For the source program of FIG. 2, the compiler first searches for a prefetch target operand by locality analysis or the like. As a result, reference b (i, j)
Assume that 22 is a prefetch target operand. If the size of the operand is 8 bytes, continuous 32-byte data including the target operand is stored in the cache in one prefetch, so it is useless to execute the prefetch instruction for each loop iteration. Thus, in this example, the prefetch instruction may be output once every four times. For this reason, generally, the loop structure conversion unit of the compiler performs
Perform double loop unrolling.

【００１０】この結果を図３に示す。ここで、ループ２
１がループ２５とループ２６に分割されている。ループ
２６は、ループ展開の余りループである。以後、簡単の
ため、ループ展開の余りループに関しては省略すること
にする。そして、通常の最適化、コード生成の後に、プ
リフェッチコードの生成を行う。FIG. 3 shows the result. Here, loop 2
1 is divided into a loop 25 and a loop 26. The loop 26 is a remainder loop of the loop expansion. Hereinafter, for the sake of simplicity, the remainder of the loop expansion will be omitted. After normal optimization and code generation, a prefetch code is generated.

【００１１】プリフェッチコードの生成は以下のように
行う。まず、ループ２５のループ１回あたりの実行サイ
クル数を推定する。ループ２５では、ロード命令が４
個，ストア命令４が４個，浮動小数点演算が４で、プリ
フェッチ命令が１であるので、ループ１回あたりの処理
サイクル数は５サイクル（＝max（ceil（(４＋４＋１）
／２），２／２）と推定できる。ここで、ceil関数は、
引数の値を超えない最小の整数を意味する。キャッシュ
ミス時の主記憶ペナルティは７５サイクルであるので、
ループ処理では、１５回（ceil（７５／５）＝１５）後
で使用されるデータのプリフェッチを行うコードを生成
すると、主記憶ペナルティを隠蔽することができる。Generation of the prefetch code is performed as follows. First, the number of execution cycles per loop of the loop 25 is estimated. In the loop 25, the load instruction is 4
, Four store instructions, four floating-point operations, and one prefetch instruction, so the number of processing cycles per loop is five (= max (ceil ((4 + 4 + 1)
/ 2) and 2/2). Where the ceil function is
Means the smallest integer that does not exceed the value of the argument. Since the main memory penalty at the time of a cache miss is 75 cycles,
In the loop processing, if a code for prefetching data used 15 times (ceil (75/5) = 15) is generated, the main storage penalty can be hidden.

【００１２】図４は、図３のコードにプリフェッチ命令
を付加したものである。ここでＡ部３０は、ループ繰り
返しの最初の方で必要となるデータのプリフェッチを行
う部分，Ｂ部３１は演算処理とループ繰り返しに関して
１５回後で使用するデータのプリフェッチ命令を含むル
ープ部分、Ｃ部３２はループ処理の後半部でプリフェッ
チ命令を含まないループ部分である。FIG. 4 is obtained by adding a prefetch instruction to the code of FIG. Here, the A section 30 is a section for performing prefetching of data required at the beginning of the loop iteration, the B section 31 is a loop section including a data prefetch instruction to be used 15 times later for arithmetic processing and loop iteration, and C section. The unit 32 is a loop part that does not include a prefetch instruction in the latter half of the loop processing.

【００１３】図５はその時の実行の様子を模式的に示し
たものである。横軸が時間，斜線箱がプリフェッチ処
理，空箱が演算処理である。ただし、図面の都合上、図
４のプログラムとは数値的に一致はしていない。つま
り、図４のＡ部は１５回のプリフェッチループである
が、図５では、３回に省略されている。ここで、問題で
あるのは、Ａ部であり、主記憶アクセスペナルティが顕
在化する。これは、特にループ処理回数が小さい場合に
問題となる。FIG. 5 schematically shows the state of execution at that time. The horizontal axis indicates time, the hatched box indicates prefetch processing, and the empty box indicates arithmetic processing. However, for the convenience of the drawing, the program does not numerically match the program in FIG. That is, part A of FIG. 4 is a prefetch loop of 15 times, but is omitted three times in FIG. Here, the problem is the part A, and the main memory access penalty becomes apparent. This is a problem particularly when the number of loop processes is small.

【００１４】以下では、このコードの性能をみつもる。
ループ長がＮ＝１００の場合、Ａ部の実行時間は、少な
くとも主記憶アクセスペナルティの７５サイクル、Ｂ部
は５０サイクル（ループ１回あたり５サイクル×ループ
１０回実行）、そしてＣ部は６０サイクル（ループ１回
あたり４サイクル×ループ１５回実行）で計１８５サイ
クルで、全体の時間の内４０％（７５／１８５×１００
％）が主記憶待ちのオーバヘッドとなっている。さら
に、ループ長が５０と小さい場合には、Ａ部は少なくと
も７５サイクル，Ｂ部はループ長が小さいため実行され
ず０サイクル，Ｃ部は、４８サイクル（ループ１回あた
り４サイクル×ループ１２回実行）で、計１３３サイク
ルで、全体の時間の内５６％（７５／１３３×１００
％）が主記憶待ちのオーバヘッドとなる大きな問題があ
る。Ａ部で、少なくとも７５サイクルと書いたのは、現
実の計算機では、同時に処理できるプリフェッチ命令の
数には限界があるためである。ここでは、十分大きいと
仮定した。In the following, we will look at the performance of this code.
When the loop length is N = 100, the execution time of the part A is at least 75 cycles of the main memory access penalty, the part B is 50 cycles (5 cycles per loop × 10 times the loop), and the part C is 60 cycles. (4 cycles per loop × 15 loops), a total of 185 cycles, 40% of the total time (75/185 × 100
%) Is the overhead of waiting for main memory. Further, when the loop length is as small as 50, the portion A is at least 75 cycles, the portion B is not executed because the loop length is small, and 0 cycle, and the portion C is 48 cycles (4 cycles per loop × 12 loops). Execution), for a total of 133 cycles, 56% (75/133 × 100
%) Is a major problem that results in an overhead of waiting for main memory. The reason for writing at least 75 cycles in part A is that there is a limit to the number of prefetch instructions that can be processed simultaneously in a real computer. Here, it is assumed that it is sufficiently large.

【００１５】また、上記で説明したプリフェッチ命令の
削減のための内側ループ展開法を用いたコード生成は、
以下の問題がある。通常、このようなループの構造変換
はコンパイラの前半部分で行われ、その後、最適化処理
が施され、コード生成部でプリフェッチ命令が挿入され
ると考えられる。こうすると、コンパイラ処理の大半
で、ループ展開による大きな中間語を扱わねばならず、
長大なコンパイル時間が必要になる。ここでは、ライン
サイズ３２バイトでの例を示したため、高々４倍の展開
であったが、ラインサイズ１２８バイトの計算機では１
６倍展開ものループ展開が必要となり、さらに深刻とな
る。The code generation using the inner loop unrolling method for reducing the prefetch instructions described above is as follows.
There are the following problems. Usually, it is considered that such a loop structure conversion is performed in the first half of the compiler, and thereafter, optimization processing is performed, and a prefetch instruction is inserted in the code generation unit. In this way, most of the compiler processing has to deal with large intermediate words by loop unrolling,
Requires a long compilation time. Here, since the example in which the line size is 32 bytes is shown, the expansion is at most 4 times.
A loop expansion as much as six times is required, which is even more serious.

【００１６】本発明の目的は、上記の問題点を解決する
プリフェッチ命令を用いたコード生成方式を提供するこ
とにある。An object of the present invention is to provide a code generation method using a prefetch instruction which solves the above-mentioned problems.

【００１７】[0017]

【課題を解決するための手段】上記の目的は、図６に示
すように、内側ループ処理を２つにわけ、後半ループ４
１で、１つ先の外側ループの繰り返しの処理の最初の方
で必要となるデータのプリフェッチ処理を組み込み、前
半ループ４０で、同一外側ループの繰り返し処理で、後
半で必要となるデータのプリフェッチ処理を組み込むコ
ードを生成することで達成される。The object of the present invention is to divide the inner loop processing into two, as shown in FIG.
1, a data prefetching process required at the beginning of the repetition processing of the next outer loop is incorporated, and in a first half loop 40, a prefetching process of data required in the latter half of the same outer loop repetition processing is performed. This is achieved by generating code that incorporates

【００１８】例では、外側ループＪ＝１の、内側ループ
Ｉ＝３，４，５，６，７の処理の際に、外側ループＪ＝
２の内側ループＩ＝１，２，３，４，５で必要となるデ
ータをプリフェッチする命令を挿入し、外側ループＪ＝
１の、内側ループＩ＝１，２の処理の際に、内側ループ
の後半Ｉ＝６，７で必要となるデータをプリフェッチす
る命令を挿入する。ここで、前半ループ４０と、後半ル
ープ４１の分配は、後半ループ４１の処理時間が主記憶
ペナルティ時間以上となるようにループ処理を分割す
る。In the example, when the outer loop J = 1 and the inner loop I = 3, 4, 5, 6, 7, the outer loop J =
2 inserts an instruction for prefetching necessary data in inner loop I = 1, 2, 3, 4, 5 and outer loop J =
At the time of the processing of the inner loop I = 1, 2, the instruction for prefetching the necessary data in the latter half I = 6, 7 of the inner loop is inserted. Here, the distribution of the former half loop 40 and the latter half loop 41 divides the loop processing so that the processing time of the latter half loop 41 becomes equal to or longer than the main storage penalty time.

【００１９】ここで、内側ループの演算数が小さくルー
プ長が短い場合は、後半ループ４１の処理時間を主記憶
ペナルティ以上にできない場合がある。この場合、当該
ループの処理の時間が、想定した規定値以上のループ長
の場合には、完全に主記憶ペナルティを隠蔽できるよう
に、外側ループに関してループ展開を行い、演算量を増
加させて、この目的を達成する。Here, when the number of operations in the inner loop is small and the loop length is short, the processing time of the second half loop 41 may not be longer than the main storage penalty. In this case, if the processing time of the loop is a loop length that is equal to or greater than the assumed specified value, the loop expansion is performed on the outer loop so that the main memory penalty can be completely hidden, and the amount of calculation is increased. Achieve this goal.

【００２０】また、プリフェッチ命令の削減のための内
側ループ展開法を用いたコード生成によるコンパイル時
間の増大の問題は、コンパイラ処理の後半のコード生成
処理において、コードを複製する方式により解決でき
る。The problem of an increase in compile time due to code generation using the inner loop unrolling method for reducing the number of prefetch instructions can be solved by a method of duplicating code in the code generation processing in the second half of the compiler processing.

【００２１】[0021]

【発明の実施の形態】以下、本発明のコンパイラにおけ
る実施例を図を参照しつつ説明する。図１にコンパイラ
全体の構造を示す。図１のソースプログラム１が、構文
解析２によって中間語３に変換される。ループ構造変換
部４は、これを入力として、多重ループに関するプリフ
ェッチコードの主記憶レイテンシの隠蔽される割合を高
くするために外側ループに関するループ展開を行い、中
間語５を出力する。最適化部６は、通常の公知の伝統的
な最適化を行い、中間語７を出力する。コード生成部８
は、プリフェッチ命令を含むオブジェクトコードを９
を、中間語７をもとに生成する。本発明は４及び８に係
わり、オブジェクトコード９の実行効率を向上させるも
のである。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of a compiler according to the present invention. FIG. 1 shows the overall structure of the compiler. The source program 1 in FIG. 1 is converted into the intermediate language 3 by the syntax analysis 2. The loop structure conversion unit 4 receives the input as an input, performs loop expansion on the outer loop in order to increase the concealment rate of the main storage latency of the prefetch code for the multiplex loop, and outputs the intermediate language 5. The optimizing unit 6 performs an ordinary well-known traditional optimization, and outputs an intermediate language 7. Code generator 8
Changes the object code including the prefetch instruction to 9
Is generated based on the intermediate language 7. The present invention relates to 4 and 8, and relates to improving the execution efficiency of the object code 9.

【００２２】図１のループ構造変換部４のうち、多重ル
ープにおける主記憶アクセスペナルティの効率的な隠蔽
に関わる部分を図７に示す。図７に入力するソースプロ
グラム１として、図２のFORTRAN プログラムを例として
あげ説明する。図２のDO20（２０），ＤＯ１０（２１）
はループを示し、これらは２重ループを構成している。
このようなプログラムに対して図７は以下のような処理
を行い、中間語５に変換する。FIG. 7 shows a part of the loop structure conversion unit 4 shown in FIG. 1 which relates to efficient concealment of a main memory access penalty in a multiplex loop. The source program 1 input to FIG. 7 will be described with reference to the FORTRAN program of FIG. DO20 (20) and DO10 (21) in FIG.
Indicates a loop, and these constitute a double loop.
FIG. 7 performs the following processing on such a program to convert it into an intermediate language 5.

【００２３】図７の処理は、ソースプログラム中の多重
ループ５０を順次処理する。判定５１は、多重ループを
考慮したプリフェッチコードを出力する多重ループを選
択する。ここでは、１つのループの中に、１つのループ
を含む正規型多重ループに限定する。図２のソースプロ
グラムでは、外側ループＤＯ２０（２０）の中に、１つ
のループＤＯ１０（２１）しかない構造であるので適合
する。この例で、外側ループＤＯ２０（２０）の中に、
複数のＤＯループがある場合は、本発明では、多重ルー
プを考慮したプリフェッチ命令は出力しないので、ルー
プ構造が適合しないと判断する。The process shown in FIG. 7 sequentially processes multiple loops 50 in the source program. The decision 51 selects a multiplex loop that outputs a prefetch code in consideration of the multiplex loop. Here, the normal type multiple loop including one loop is limited to one loop. The source program shown in FIG. 2 is suitable because the outer loop DO20 (20) has only one loop DO10 (21). In this example, in the outer loop DO20 (20),
When there are a plurality of DO loops, the present invention does not output a prefetch instruction in consideration of multiple loops, and thus determines that the loop structure does not match.

【００２４】処理５２では、元々の内側ループ１回あた
りの実行サイクル数を推定する。ループ内には、浮動小
数点演算が１個，ロード演算が１個，ストア演算が１個
であるので、仮定しているアーキテクチャでは、ループ
１回あたり、ＩＩ＝max(ceil（（１＋１）／２），ceil
（１／２））＝１サイクルである。In the process 52, the number of execution cycles per original inner loop is estimated. In the loop, there is one floating point operation, one load operation, and one store operation. Therefore, in the assumed architecture, II = max (ceil ((1 + 1) / 2 ), Ceil
(1/2)) = 1 cycle.

【００２５】処理５３では、次の外側ループでの処理に
必要なデータの主記憶アクセスペナルティを完全に隠蔽
する最小のループ長ＭＩＮＬを求める。この場合、主記
憶アクセスペナルティは７５であるので、ＭＩＮＬ＝７
５となる。つまり、このままでは、内側ループ長が７５
未満であると、次の外側ループの実行に必要なデータを
オーバヘッドなしに準備できなく、データ待ちの状態が
発生することになる。実際のプログラムにおいては、こ
のループ長よりも小さい場合も多い。In the process 53, the minimum loop length MINL for completely concealing the main storage access penalty of the data necessary for the process in the next outer loop is obtained. In this case, since the main storage access penalty is 75, MINL = 7
It becomes 5. That is, in this state, the inner loop length is 75
If the value is less than the above, data necessary for execution of the next outer loop cannot be prepared without overhead, and a data waiting state occurs. In an actual program, it is often smaller than this loop length.

【００２６】そこで、仮にコンパイラのコード生成の方
針として、ループ長４０以上であれば主記憶アクセスペ
ナルティを完全に隠蔽できるコードを生成することにす
る。判定５４において、規定値とは、このようにコンパ
イラのコード生成方針で決めた値である。図２のソース
プログラムの場合、ＭＩＮＬ＝７５，規定値＝４０であ
るので、判定５５に進む。判定５５は、外側ループ展開
を行っても、もとのプログラムと同一の処理となるか、
データ依存関係からの検査を行う。外側ループ展開が可
能か否かは、当該多重ループのループ交換が可能か否か
と同一であり、これは公知技術、例えば田中義一，岩沢
京子「ベクトル計算機のためのコンパイル技術」情報処
理，Vol.31,No.6,pp.736−743(1990) に記述してある方
式を用いればよい。図２のソースプログラムではＤＯ２
０（２０）とＤＯ１０（２１）のループを交換しても同
一の結果を得ることができるので、外側ループ展開が可
能である。Therefore, as a code generation policy of the compiler, if the loop length is 40 or more, a code capable of completely concealing the main memory access penalty will be generated. In the determination 54, the specified value is a value determined in this way by the code generation policy of the compiler. In the case of the source program of FIG. 2, since MINL = 75 and the specified value = 40, the flow proceeds to determination 55. The judgment 55 indicates whether the same processing as the original program is performed even if the outer loop is unrolled,
Check from data dependencies. Whether or not outer loop unrolling is possible is the same as whether or not loop switching of the multiple loops is possible. This is a known technique, for example, Yoshikazu Tanaka, Kyoko Iwasawa, "Compiling Techniques for Vector Computers" 31, No. 6, pp. 736-743 (1990). In the source program of FIG.
Since the same result can be obtained even if the loops of 0 (20) and DO10 (21) are exchanged, the outer loop can be expanded.

【００２７】処理５６では、外側ループの展開数を決め
て、外側ループ展開を中間語上で行う。図の例ではceil
（７５／４０）＝２倍展開を行い、図８のような結果を
得る。ここでは、中間語として、フォートランライクに
表現したものを用いている。ＤＯ２０（６０）とＤＯ１
０（６１）が、外側ループに関して２倍展開した多重ル
ープ、ＤＯ１１（６２）は、外側ループに関して２倍展
開した余りループである。以後、余りループに関しては
簡単のため、説明は省略する。判定５７により、すべて
の多重ループに対して上記の処理を行う。In step 56, the number of outer loop expansions is determined, and outer loop expansion is performed on the intermediate language. Ceil in the example shown
(75/40) = Double expansion is performed to obtain a result as shown in FIG. Here, the language expressed in Fortran-like is used as the intermediate language. DO20 (60) and DO1
0 (61) is a multiplex loop expanded twice with respect to the outer loop, and DO11 (62) is a residual loop expanded twice with respect to the outer loop. Hereinafter, the remaining loop will be omitted for simplicity. By the judgment 57, the above processing is performed for all the multiplex loops.

【００２８】その後図１において、通常の中間語に対す
る最適化部６の処理を行い、コード生成８を行う。コー
ド生成部８において、多重ループにおける主記憶アクセ
スペナルティの効率的な隠蔽に関わる部分を図９，図１
０に示す。Thereafter, in FIG. 1, the processing of the optimizing unit 6 for ordinary intermediate language is performed, and code generation 8 is performed. FIG. 9 and FIG. 1 show parts related to efficient concealment of the main memory access penalty in the multiplex loop in the code generation unit 8.
0 is shown.

【００２９】処理９０は、最内側ループ内のプリフェッ
チ対象ロードオペランドの選択を行う。選択の方法は、
例えば、公知技術である前記文献ACM 0-89791-535-6/92
/0010/0062 pp.62-73に従う。この結果、図８のＤＯ１
０内のｂ（ｉ，ｊ）とｂ（ｉ，ｊ＋１）がプリフェッチ
対象オペランドとなるとする。A process 90 selects a prefetch target load operand in the innermost loop. The selection method is
For example, the aforementioned document ACM 0-89791-535-6 / 92 which is a known technique
/ 0010/0062 Follow pp.62-73. As a result, DO1 in FIG.
It is assumed that b (i, j) and b (i, j + 1) in 0 are prefetch target operands.

【００３０】処理９１は、すべてのプリフェッチ対象オ
ペランドについてプリフェッチ命令を挿入すべきループ
間隔を求め、それらの最大公約数を求める。キャッシュ
ラインサイズが３２バイトであるため、ｂ（ｉ，ｊ）を
プリフェッチすると、ｂ（ｉ，ｊ）を含む連続３２バイ
トのデータがプリフェッチされる。仮に、ｂ(ｉ，ｊ)が
キャッシュラインの先頭にあれば、ｂ（ｉ，ｊ），ｂ
（ｉ＋１，ｊ），ｂ（ｉ＋２，ｊ），ｂ（ｉ＋３，ｊ）
のデータプリフェッチされるので、ループ４回に１回の
割合でプリフェッチ命令を発行すればよい。同様に、ｂ
（ｉ，ｊ＋１）もループ４回に１回でよい。ｂ（ｉ，
ｊ）及びｂ（ｉ，ｊ＋１）に対するプリフェッチ命令を
挿入すべき間隔はそれぞれ４であるため、全体で、プリ
フェッチ命令を発行するループ間隔は上記の最大公約数
であるＬＣ＝４となる。The process 91 finds a loop interval in which prefetch instructions are to be inserted for all prefetch target operands, and finds their greatest common divisor. Since the cache line size is 32 bytes, when b (i, j) is prefetched, continuous 32 bytes of data including b (i, j) are prefetched. If b (i, j) is at the head of the cache line, b (i, j), b
(I + 1, j), b (i + 2, j), b (i + 3, j)
, The prefetch instruction may be issued once every four loops. Similarly, b
(I, j + 1) may be performed once in four loops. b (i,
Since the intervals at which the prefetch instructions for j) and b (i, j + 1) are to be inserted are 4, respectively, the loop interval at which the prefetch instructions are issued is LC = 4, which is the greatest common divisor as a whole.

【００３１】処理９２では、最内側ループのループ範囲
を２つに分割する。この処理を記述した箱のなかに記す
ように、ループ後半のループ長は、実行時間を主記憶ア
クセスペナルティ以上にするループ長か、または、ルー
プ長が短く、主記憶アクセスペナルティ以上の実行時間
にならないときはもともとのループ長を設定し、ループ
前半の処理は残りとする。従って、前半ループは実行さ
れないこともある。In the process 92, the loop range of the innermost loop is divided into two. As described in the box describing this processing, the loop length in the latter half of the loop is either a loop length that makes the execution time longer than the main memory access penalty, or a loop length that is shorter and longer than the main memory access penalty. If not, the original loop length is set, and the processing in the first half of the loop is left. Therefore, the first half loop may not be executed.

【００３２】図８のＤＯ１０（６１）のループ１回あた
りの実行推定サイクルは２（＝max（ceil（（２＋２）
／２），ceil（２／２）），ＬＣ＝４，プリフェッチ命
令はｂ（ｉ，ｊ）とｂ（ｉ，ｊ＋１）に対して出るの
で、プリフェッチ命令による実行サイクルは１である。
このため、最内側ループをＬＣ回実行した場合の推定実
行サイクルは９サイクルとなる。このため、後半ループ
のループ長は、min(元々のループ長、ceil（７５／
９））となる。その結果、図８のＤＯ１０（６１）は、
図１１の前半ループＤＯ１０（６５），後半ループＤＯ
１１（６６）に分割できる。The execution estimation cycle per loop of DO10 (61) in FIG. 8 is 2 (= max (ceil ((2 + 2)
/ 2), ceil (2/2)), LC = 4, the prefetch instruction is issued for b (i, j) and b (i, j + 1), so the execution cycle by the prefetch instruction is 1.
Therefore, the estimated execution cycle when the innermost loop is executed LC times is 9 cycles. Therefore, the loop length of the latter half loop is min (the original loop length, ceil (75 /
9)). As a result, DO10 (61) in FIG.
First half loop DO10 (65), second half loop DO in FIG.
11 (66).

【００３３】図１１では、前半ループＤＯ１０（６
５），後半ループＤＯ１１（６６）を、ＤＯループの構
造で示してある。このままであると、実際のオブジェク
トコードとのレベルギャップがあるので、最内側ループ
をマシン語レベルに近くした中間語表現を図１２に示
す。ここで、文７０−文７８が図１１での前半ループDO
10（６５）に対応し、文８０−８８が図１１での後半ル
ープＤＯ１１（６６）に対応する。In FIG. 11, the first half loop DO10 (6
5), the latter half loop DO11 (66) is shown by the structure of the DO loop. If this is left as it is, there is a level gap with the actual object code, so that an intermediate language expression in which the innermost loop is close to the machine language level is shown in FIG. Here, the sentence 70-sentence 78 is the first half loop DO in FIG.
The sentence 80-88 corresponds to the latter half loop DO11 (66) in FIG.

【００３４】ここで、文７１，文７２，文７３，文７
６，文７７，文７８でＤＯループ１０の処理を意味す
る。変数ｃｎｔに処理すべきループ処理長を７１で設定
し、文７６で１つ減じ、文７７で繰り返す。文７４はも
ともとのループ本体での処理、文７０はループ制御変数
の初期設定で、文７５がループ制御変数の更新を意味す
る。Here, sentence 71, sentence 72, sentence 73, sentence 7
6, sentence 77 and sentence 78 mean the processing of the DO loop 10. The loop processing length to be processed is set to the variable cnt at 71, reduced by one at sentence 76, and repeated at sentence 77. Statement 74 is the original processing in the loop body, statement 70 is the initial setting of the loop control variable, and statement 75 means the update of the loop control variable.

【００３５】同様に、文８１，文８２，文８３，文８
６，文８７，文８８でＤＯループ１１の処理を意味す
る。変数cnt に処理すべきループ処理長を８１で設定
し、文８６で１つ減じ、文８７で繰り返す。文８４はも
ともとのループ本体での処理、文８０はループ制御変数
の初期設定で、文８５がループ制御変数の更新を意味す
る。Similarly, sentence 81, sentence 82, sentence 83, sentence 8
6, sentence 87 and sentence 88 mean the processing of the DO loop 11. The loop processing length to be processed is set to the variable cnt by 81, reduced by one at sentence 86, and repeated at sentence 87. Statement 84 is the original processing in the loop body, statement 80 is the initial setting of the loop control variable, and statement 85 means the update of the loop control variable.

【００３６】図１０に進んで、処理９３は、前半ループ
に対するコード生成を行う。まず、内側ループ後半のた
めのプリフェッチ命令の挿入を行う。図１３の文１０
０，文１０１，文１０２がこのコードである。文１００
により、内側ループでceil（７５／９）回先に必要とな
るデータのアドレスを計算する。文１０１がプリフェッ
チ命令で、文１０２が、ＬＣ回おきにプリフェッチコー
ドを出すためのアドレス更新コードである。次にループ
ボディコードを３（＝ＬＣ−１）回複製して、挿入しす
る。図１２でのループボディ文７４−７７を複製し、文
１０３−文１０６，文１０７−文１１０となる。ループ
終了判定コード文７７，文１０６，文１１０の飛び先の
lab１(文７３）から、ループ直後のlab２(文７８）に変
更する。そして、再度、ループボディのコードを複製
し、文１１１−１１４に挿入する。Proceeding to FIG. 10, a process 93 performs code generation for the first half loop. First, a prefetch instruction for the latter half of the inner loop is inserted. Statement 10 in FIG.
0, sentence 101 and sentence 102 are this code. Sentence 100
Calculates the address of the data required before the ceil (75/9) times in the inner loop. Statement 101 is a prefetch instruction, and statement 102 is an address update code for issuing a prefetch code every LC times. Next, the loop body code is copied and inserted three (= LC-1) times. The loop body sentence 74-77 in FIG. 12 is duplicated to become sentence 103-sentence 106 and sentence 107-sentence 110. Loop jump judgment code statement 77, statement 106, statement 110 jump destination
lab1 (sentence 73) is changed to lab2 (sentence 78) immediately after the loop. Then, the code of the loop body is copied again and inserted into the sentences 111-114.

【００３７】処理９４は、後半ループに対するコード生
成を行う。前半ループとの違いは、前半ループでは、プ
リフェッチ対象オペランドの先頭アドレスが、当該外側
ループでの内側ループ処理の後半で必要になるデータに
対し、１つ先の外側ループの内側ループ処理前半で必要
となるデータである点である。最初に、次の外側ループ
で必要になるデータのためのプリフェッチ命令の挿入を
行う。図１４の文120,文１２１，文１２２がこのコード
である。文１２０により、次の外側ループで必要となる
データのアドレスを計算する。文１２１がプリフェッチ
命令で、文122が、ＬＣ回おきにプリフェッチコードを
出すためのアドレス更新コードである。次にループボデ
ィコードを３(＝ＬＣ−１)回複製して、挿入する。図１
２でのループボディ文８４−８７を複製し、文１２３−
文１２６，文１２７−文１３０となる。ループ終了判定
コード文８７，文１２６，文１３０の飛び先のlab３(文
８３）から、ループ直後のlab４(文８８）に変更する。
そして、再度、ループボディのコードを複製し、文１３
１−１３４に挿入する。A process 94 generates a code for the latter half loop. The difference from the first half loop is that in the first half loop, the first address of the operand to be prefetched is required in the first half of the inner loop processing of the next outer loop for the data required in the second half of the inner loop processing in the outer loop. This is the data that First, a prefetch instruction is inserted for data needed in the next outer loop. The sentences 120, 121, and 122 in FIG. 14 are the codes. The statement 120 calculates the address of the data required in the next outer loop. Statement 121 is a prefetch instruction, and statement 122 is an address update code for issuing a prefetch code every LC times. Next, the loop body code is copied and inserted 3 (= LC-1) times. FIG.
The loop body statement 84-87 in step 2 is duplicated and the statement 123-
Sentence 126 and sentence 127-sentence 130. The lab 3 (sentence 83) at the jump destination of the loop end determination code statements 87, 126, and 130 is changed to lab 4 (sentence 88) immediately after the loop.
Then, the code of the loop body is duplicated again, and the statement 13
Insert 1-134.

【００３８】処理９５は、多重ループの入り口に、初期
プリフェッチ命令を挿入する。図１３の文１４０がこれ
である。すなわち、最初の外側ループ繰り返し時におけ
る、最初の内側ループの最初に必要となるデータは、こ
こでプリフェッチをかける。しかし、このプリフェッチ
は、他の処理とオーバラップができないので、主記憶ペ
ナルティが隠蔽できない部分である。図６のｊ＝１，ｉ
＝１，２，３，４，５のプリフェッチ処理に相当する処
理である。A process 95 inserts an initial prefetch instruction at the entrance of the multiple loop. This is the sentence 140 in FIG. That is, the data required at the beginning of the first inner loop during the first outer loop iteration is prefetched here. However, this prefetch is a part where the main memory penalty cannot be concealed because it cannot overlap with other processing. J = 1, i in FIG.
= 1,2,3,4,5.

【００３９】[0039]

【発明の効果】本発明のコード生成方式により、多重ル
ープにおいて、アクセスするデータ領域がキャッシュ容
量を超える場合も、内側ループ処理を２つのループに分
け、前半ループに必要なデータは、１つ前の外側ループ
の繰り返し時の後半ループでデータのプリフェッチを開
始し、後半ループで必要なデータは、前半ループでデー
タのプリフェッチを開始するため、主記憶ペナルティを
隠蔽することができる。According to the code generation method of the present invention, even in the case where the data area to be accessed exceeds the cache capacity in the multiplex loop, the inner loop processing is divided into two loops, and the data necessary for the first half loop is one before. Since the prefetch of data is started in the latter half loop when the outer loop is repeated, and the data required in the latter half loop is started in the first half loop, the main memory penalty can be concealed.

【００４０】例えば従来の手法では、先に述べたよう
に、外側ループ長Ｍに依存せず、内側ループ長Ｎが１０
０の場合、全体の時間の内４０％がデータ待ちの状態，
内側ループ長Ｎが５０の場合は、全体の時間の内５６％
がデータ待ちの状態であった。これに対して本発明の実
施によれば、最内側ループ長が９（＝ceil(７５／９)）
以上であれば、外側ループの１回目の処理を除き、２回
目以降の処理では、完全に主記憶ペナルティを隠蔽でき
る。即ち、内側ループ長が１００，５０のケースとも、
外側ループ長が非常に大きければ、データ待ちの状態は
発生しない。外側ループ長が５０の場合、内側ループ長
が１００，５０の場合は、全体の時間の内、１.３％，
２.６％がデータ待ちの状態で、外側ループ長が１０の
場合、内側ループ長が１００，５０の場合は、全体の時
間の内、６.２５％，１１.８％がデータ待ちの状態と改
善することができる。For example, in the conventional method, as described above, the inner loop length N is set to 10 without depending on the outer loop length M.
If 0, 40% of the total time is waiting for data,
When the inner loop length N is 50, 56% of the total time
Was waiting for data. On the other hand, according to the embodiment of the present invention, the innermost loop length is 9 (= ceil (75/9)).
If so, the main memory penalty can be completely hidden in the second and subsequent processes except for the first process of the outer loop. That is, in both cases where the inner loop length is 100 or 50,
If the outer loop length is very large, no data wait condition occurs. When the outer loop length is 50, when the inner loop length is 100, 50, 1.3% of the total time,
When 2.6% is waiting for data and the outer loop length is 10, when the inner loop length is 100 and 50, 6.25% and 11.8% of the total time are waiting for data. And can be improved.

[Brief description of the drawings]

【図１】コンパイラ全体の構成図。FIG. 1 is a configuration diagram of an entire compiler.

【図２】ソースプログラム例を示す図。FIG. 2 is a diagram showing an example of a source program.

【図３】従来法による、内側ループ展開後のコード例を
示す図。FIG. 3 is a diagram showing a code example after inner loop expansion according to a conventional method.

【図４】従来法による多重ループにおけるプリフェッチ
命令を使用したコード例を示す図。FIG. 4 is a diagram showing a code example using a prefetch instruction in a multiple loop according to a conventional method.

【図５】従来法による実行の様子を示した説明図。FIG. 5 is an explanatory diagram showing a state of execution by a conventional method.

【図６】本発明による実行の様子を示した説明図。FIG. 6 is an explanatory diagram showing a state of execution according to the present invention.

【図７】本発明による多重ループの外側ループの展開方
式の処理フロー図。FIG. 7 is a processing flow diagram of an unrolling method of an outer loop of a multiple loop according to the present invention.

【図８】本発明による外側ループ展開後のコード例を示
す図。FIG. 8 is a diagram showing a code example after the outer loop is expanded according to the present invention.

【図９】本発明による多重ループにおけるプリフェッチ
コード生成処理のフロー図。FIG. 9 is a flowchart of a prefetch code generation process in a multiplex loop according to the present invention.

【図１０】本発明による多重ループにおけるプリフェッ
チコード生成処理のフロー図。FIG. 10 is a flowchart of a prefetch code generation process in a multiplex loop according to the present invention.

【図１１】本発明によるループ分割後のコード例を示す
図。FIG. 11 is a diagram showing a code example after loop division according to the present invention.

【図１２】本発明によるループ分割後の機械語に近いコ
ード例を示す図。FIG. 12 is a diagram showing a code example close to a machine language after loop division according to the present invention.

【図１３】本発明による多重ループにおけるプリフェッ
チ命令を使用したコード例を示す図。FIG. 13 is a diagram showing a code example using a prefetch instruction in a multiple loop according to the present invention.

【図１４】本発明による多重ループにおけるプリフェッ
チ命令を使用したコード例を示す図。FIG. 14 is a diagram showing a code example using a prefetch instruction in a multiple loop according to the present invention.

[Explanation of symbols]

１…ソースプログラム、２…構文解析部、３…中間語、
４…ル−プ構造変換部、５…中間語、６…最適化部、７
…中間語、８…コード生成部、９…オブジェクトコー
ド。1. Source program, 2. Parsing unit, 3. Intermediate language,
4 ... Loop structure conversion unit, 5 ... Intermediate language, 6 ... Optimization unit, 7
... intermediate language, 8 ... code generation unit, 9 ... object code.

Claims

[Claims]

In a method of compiling a source program into an object program, in a case where each of the multiple loops is a loop whose number of times of execution is determined immediately before execution of the loop with respect to a multiple loop portion of the source program, When the outer loop is the loop 1 and the innermost loop is the loop 2 of the multiple loops, the loop 2
A code generation method for generating a prefetch code for transferring a part of data necessary for an operation in a cache to a cache at the time of the previous loop repetition processing with respect to the loop 1.

2. A method for compiling a source program into an object program, wherein, for each of the multiple loops of the source program, each of the multiple loops is a loop whose number of times of execution is determined immediately before the execution of the loop. When the outer loop is the loop 1 and the innermost loop is the loop 2 of the multiple loops, the loop 2
Is divided into two loop repetition ranges, each of which is a first half loop and a second half loop, the code of the second half loop includes the data necessary at the beginning of the operation of the first half loop in the next iteration of the loop 1 The pre-fetch code to be transferred to the cache and the original code, and the code of the first half loop transfers, to the cache, data other than the data transferred to the cache in the second half loop among the data required for the operation of the first half loop. A code generation method comprising a prefetch code to be executed and an original code.

3. The code generation method according to claim 2, wherein in the division of the loop repetition range of the loop 2, the number of repetitions of the latter loop estimates a processing time of the latter loop, and the processing time is cached from a memory. Estimated number of loop iterations determined to exceed main memory access penalty, which is the transfer time to
A code generation method characterized by setting a small number of repetitions among the number of loop repetitions of (1).

4. The code generation method according to claim 3, wherein the processing time of the latter half loop is estimated, and there is an estimated number of loop iterations in which the processing time exceeds a main storage penalty which is a transfer time from a memory to a cache. When the value is larger than the given reference value, analysis is performed to determine if loop unrolling can be performed for loop 1. If unrolling is possible, the processing time of the latter half loop after expansion is estimated. A code generation method for generating a code in which loop expansion for loop 1 is performed until the estimated number of loop repetitions exceeding a main memory access penalty, which is a transfer time, becomes smaller than the reference value.

5. The code generation method according to claim 2, wherein the number of loop repetition intervals required for the prefetch is determined, and the code of the latter half loop is used in the first iteration of the operation of the former half loop during the next repetition processing with respect to the loop 1. The prefetch code for transferring data required to the cache and the original processing code are duplicated for the number of expansions, and the code of the first half loop is one of the data required for the operation of the first half loop. A code generation method comprising a prefetch code for transferring data other than data transferred to the cache in the latter half loop to the cache and a code obtained by duplicating the original processing code by the number of expansions.