JPH06290057A

JPH06290057A - Loop optimizing method

Info

Publication number: JPH06290057A
Application number: JP7787993A
Authority: JP
Inventors: Shigehisa Sato; 茂久佐藤; Keisuke Toyama; 圭介十山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-04-05
Filing date: 1993-04-05
Publication date: 1994-10-18

Abstract

PURPOSE:To optimize a loop by a compiler for architecture having a function for invalidating the execution result of a previously executed instruction in accordance with judgement whether branching is to be executed by a succeeding condition branch or not. CONSTITUTION:A loop conversion selecting part 107 in a loop optimizing part 104 determines a loop converting method, and when an instruction for using an invalidating function is usable, a loop conversion executing part 108 converts a loop using the instruction to optimize the loop. Consequently the number of instructions or the number of registers to be used can be reduced without losing a convensional loop conversion effect, and even when the repeated frequency of the loop is not determined at the time of starting the loop, loop conversion can be executed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、後続の条件分岐におい
て分岐するか否かによって実行結果を無効にする機能を
持つアーキテクチャに対するコンパイラによる最適化に
係り、前記の無効化機能を利用して実行速度を向上させ
るループ最適化方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to optimization by a compiler for an architecture having a function of invalidating an execution result depending on whether or not to branch in a subsequent conditional branch. It relates to a loop optimization method for improving speed.

【０００２】[0002]

【従来の技術】現在の多くのコンピュータ、特に縮小命
令セットコンピュータ（ＲＩＳＣ）では、パイプライン
処理を行うことによって、理想的には１サイクル当たり
１命令の処理を行うことができる。しかし、現実にはデ
ータや機能ユニットの依存関係により、毎サイクル新た
な命令を発行することは困難である。2. Description of the Related Art In many modern computers, particularly a reduced instruction set computer (RISC), pipeline processing is ideally performed for one instruction per cycle. However, in reality, it is difficult to issue a new instruction every cycle due to the dependency relationship between data and functional units.

【０００３】ある命令とその命令の実行結果を参照する
命令との間には、「データによる依存関係がある」とい
う。そのような依存関係は例えば、メモリからのデータ
のロードや算術演算によりレジスタや条件コードに値が
設定され、後続の命令がその値を参照する場合に生じ
る。依存関係にある命令は、それらの実行順序を変えら
れないだけでなく、参照する側の命令は参照する値が設
定されるまで実行を始めてはならないという制約があ
る。また、二つの命令の使用する機能ユニットが競合す
る場合には、それらの命令の実行順序は制限されない
が、一方を実行している間は他方を実行することができ
ないという制約がある。It is said that there is a data dependency between an instruction and an instruction that refers to the execution result of the instruction. Such a dependency occurs, for example, when a value is set in a register or a condition code by loading data from a memory or arithmetic operation, and a subsequent instruction refers to the value. Instructions that have a dependency relationship cannot be changed in their execution order, and there is a restriction that the instruction on the referencing side must not start execution until the reference value is set. Further, when the functional units used by the two instructions compete with each other, the execution order of the instructions is not limited, but there is a constraint that the other cannot be executed while one is being executed.

【０００４】これらの制約により、後続の命令をただち
に実行してはならない場合に、後続の命令の実行の開始
を遅らせる機構をパイプライン・インターロック（以
下、インターロックと略）といい、インターロックが起
きている間、プロセッサは新たな命令の実行を始められ
ず、パイプラインに命令を発行できない空きが生じる。Due to these restrictions, a mechanism for delaying the start of execution of a subsequent instruction when the subsequent instruction must not be executed immediately is called pipeline interlock (hereinafter abbreviated as interlock), which is an interlock. While the error occurs, the processor cannot start executing a new instruction, and there is a space in the pipeline where the instruction cannot be issued.

【０００５】ＲＩＳＣやパイプライン処理については、
例えばＪ．Ｌ．ヘネシー(J.L. Hennessy)とＤ．Ａ．
パターソン(D.A.Patterson)のコンピュータアーキテ
クチャ：アクオリタティブアプローチ，モルガン
カウフマン出版社刊（Computer Architecture：A Quan
titative Approach,Morgan Kaufmann Publishers,Inc，
1990（文献１））に記述されている。Regarding RISC and pipeline processing,
For example, J. L. JL Hennessy and D.H. A.
DA Patterson's Computer Architecture: An Qualitative Approach, Morgan
Published by Kaufman Publisher (Computer Architecture: A Quan
titative Approach, Morgan Kaufmann Publishers, Inc,
1990 (reference 1)).

【０００６】実行時にこのパイプラインの空きが少なく
なるような命令列を生成することによりクロックサイク
ル当たりの命令実行数を増加させる最適化を行うことに
よって、プログラムの実行時間を短縮させることができ
る。The execution time of the program can be shortened by optimizing the instruction execution number per clock cycle by generating an instruction sequence that reduces the empty space of the pipeline during execution.

【０００７】そのような最適化の内、局所命令スケジュ
ーリングは、プログラムの意味を変えない範囲で基本ブ
ロック内の命令を並べ変えることにより、インターロッ
クによるパイプラインの空きを減らす。ただし、基本ブ
ロックとは途中に分岐命令や分岐のターゲットとなる命
令を含まない命令系列である。従って、局所命令スケジ
ューリングでは、一時には一つの基本ブロックの中の命
令のみを対象にし、その中でのみ命令の並べ変えを行
う。この最適化は一般に、機械語とほぼ１対１に対応す
る中間語、または機械語そのものに対して行われ、中間
語の場合、物理レジスタ割り付けを行った後に行う場合
と、物理レジスタ割り付けを行う前に行う場合、あるい
はその両方で行う場合がある。Among such optimizations, the local instruction scheduling reduces the pipeline vacancy due to the interlock by rearranging the instructions in the basic block within the range that does not change the meaning of the program. However, the basic block is an instruction series that does not include a branch instruction or an instruction to be a target of branch in the middle. Therefore, in the local instruction scheduling, only the instructions in one basic block are targeted at a time, and the instructions are rearranged only in the instructions. This optimization is generally performed on an intermediate language that corresponds to the machine language in a nearly one-to-one manner, or on the machine language itself. In the case of the intermediate language, physical register allocation is performed after physical register allocation and physical register allocation is performed. It may be done before, or both.

【０００８】パイプライン向けの局所命令スケジューリ
ングについてのサーベイとして以下の文献がある。Ｓ．
Ｍ．クリシュナマーシー(S.M.Krishnamurthy）のアブ
リーフサーベイオブぺーパーズオンスケジュ
ーリングフォーパイプラインドプロセッサーズ，
シグプランナティシーズ２５巻７号（A BriefSur
vey of Papers on Scheduling for Pipelined Processo
rs,SIGPLAN Notices,Vol.25, No.7）,1990（文献２）。The following literature is available as a survey on local instruction scheduling for pipelines. S.
M. A Brief Survey of Papers on Scheduling for Pipelined Processors from SM Krishnamurthy,
Sigplan Naughties 25, No. 7 (A BriefSur
vey of Papers on Scheduling for Pipelined Processo
rs, SIGPLAN Notices, Vol.25, No.7), 1990 (reference 2).

【０００９】しかし、一般のプログラムでは、基本ブロ
ック内に存在する並列性は限られており、局所命令スケ
ジューリングのみではインターロックを完全に回避する
ことはできない。さらに、スーパースカラや、スーパー
パイプライン、ＶＬＩＷ(Very Long Instruction Word)
コンピュータなどの方式の、１サイクル当たり一つより
多くの命令を実行可能なコンピュータでは、局所命令ス
ケジューリングだけでは十分な並列性を得られない。ス
ーパースカラ・プロセッサ向けの局所命令スケジューリ
ングについては以下の文献に記述されている。Ｍ．Ｓ．
ラム(M.S.Lam)のインストラクションスケジューリン
グフォースーパースカラアーキテクチャーズア
ニュアルレビューズオンコンピューティング第
４巻（Instruction Scheduling for Superscalar Archi
tectures, Annual Reviewson Computing, Vol.4）, 199
0(文献３）の１８３から１８４ページ。Ｍ．ジョンソン
(M.Johnson）のスーパースカラマイクロプロセッサ
デザイン，プレンテイスホール社刊（"Superscalar M
icroprocessor Design", Prentice Hall),1991（文献
４）の１７７から２０２ページ。However, in a general program, the parallelism existing in the basic block is limited, and the interlock cannot be completely avoided only by the local instruction scheduling. Furthermore, superscalar, super pipeline, VLIW (Very Long Instruction Word)
In a computer such as a computer capable of executing more than one instruction per cycle, the local instruction scheduling alone cannot provide sufficient parallelism. Local instruction scheduling for superscalar processors is described in: M. S.
MSLam Instruction Scheduling for Superscalar Architecture Annual Review on Computing Vol. 4 (Instruction Scheduling for Superscalar Archi
tectures, Annual Reviewson Computing, Vol.4), 199
0 (Reference 3), pages 183 to 184. M. Johnson
(M.Johnson) Superscalar Microprocessor
Design, published by Prentice Hall ("Superscalar M
icroprocessor Design ", Prentice Hall), 1991 (Reference 4), pages 177 to 202.

【００１０】以上の理由から、大域的命令スケジューリ
ングと呼ばれる、基本ブロックを越える命令スケジュー
リングが近年研究、実用化されている。それらはループ
に対するものと、ループでない複数の基本ブロックに対
するものに大別される。For the above reasons, instruction scheduling beyond basic blocks, which is called global instruction scheduling, has been studied and put into practical use in recent years. They are roughly classified into those for a loop and those for a plurality of basic blocks that are not loops.

【００１１】まず、ループに関する最適化は、ソフトウ
ェアパイプライニングとループアンローリングが代表的
である。これらはいずれも、ループの複数の反復にまた
がった命令スケジューリングを可能にする。詳しくは
（文献３）の１８７から１９３ページ、（文献４）の２
１３から２２９ページに記載されている。これらの最適
化には、静的命令数が増加し、また、反復数が少ない場
合にはかえって実行時間が増加しうるなどの欠点があ
る。First, software pipelining and loop unrolling are typical optimizations for loops. Both of these allow instruction scheduling across multiple iterations of the loop. For details, see pages 187 to 193 of (Reference 3), 2 of (Reference 4).
Pp. 13-229. These optimizations have drawbacks such as an increase in the number of static instructions and an increase in execution time when the number of iterations is small.

【００１２】ループでない複数の基本ブロックに対する
スケジューリングは、種々提案されている。その代表的
なものがトレーススケジューリングである。トレースス
ケジューリングは（文献１）の３２５から３２８ペー
ジ、（文献３）の１８４から１８７ページ、および（文
献４）の２０５ページから２１３ページと２２９から２
３３ページに記載されている。トレーススケジューリン
グでは、ループはその一回の反復の中でしかスケジュー
リングを行わない。Various schedulings for a plurality of basic blocks that are not loops have been proposed. A typical example is trace scheduling. Trace scheduling is described in (Reference 1) pages 325 to 328, (Reference 3) pages 184 to 187, and (Reference 4) pages 205 to 213 and 229 to 2
It is described on page 33. In trace scheduling, the loop schedules only in that one iteration.

【００１３】そのほかの大域的命令スケジューリングに
は、例えば、特開平4−263331 号公報がある。Other global instruction scheduling is disclosed in, for example, Japanese Patent Laid-Open No. 4-263331.

【００１４】大域的命令スケジューリングの際の重要な
概念として、投機的な命令の移動がある。投機的な命令
の移動とは、例えば、条件分岐の一方の分岐先の基本ブ
ロック内の命令をその条件分岐命令のある基本ブロック
に移動することである。移動された命令は条件分岐の分
岐先によらず、実行されることになる。従って、他方へ
分岐した場合には本来実行する必要のない命令が実行さ
れることになるので、それによって実行結果が変ってし
まうような命令は移動できない。そのようにして移動さ
れた命令は投機的命令と呼ばれる。An important concept in global instruction scheduling is speculative instruction movement. Speculative instruction movement is, for example, movement of an instruction in a basic block of one branch destination of a conditional branch to a basic block having the conditional branch instruction. The moved instruction will be executed regardless of the branch destination of the conditional branch. Therefore, when branching to the other, an instruction that does not need to be executed is executed, and therefore an instruction whose execution result is changed cannot be moved. Instructions so moved are called speculative instructions.

【００１５】ある基本ブロックＢ１において十分な並列
性が得られない場合、後続の基本ブロックＢ２内の命令
で、プログラムの意味を変えず、かつＢ１の実行時間を
増やさずにＢ１へ移動できるようなものがあれば、その
命令をＢ２へ投機的に移動することによりＢ２の実行時
間が減少することが期待できる。しかし、通常は投機的
に移動できる命令の種類は限られる。これは、例外を起
こしうる命令を分岐を越えて移動すると、実行時に別の
方向へ分岐する場合には起きないはずの例外を起こして
しまうためである。そのような例外の起きる命令は、一
般に、メモリアクセス命令，浮動小数点演算命令，整数
除算命令が含まれる。そのため、単純な方法としてはそ
れらの命令は分岐命令を越えては移動しない方法があ
る。しかし、それでは移動できる命令がわずかしかない
ので、十分な並列性を得られない。If sufficient parallelism cannot be obtained in a basic block B1, an instruction in the subsequent basic block B2 can move to B1 without changing the meaning of the program and without increasing the execution time of B1. If there is one, it can be expected that the execution time of B2 is reduced by speculatively moving the instruction to B2. However, the types of commands that can be speculatively moved are usually limited. This is because moving an instruction that can cause an exception across a branch causes an exception that would not occur when branching in another direction at execution. Instructions in which such an exception occurs generally include memory access instructions, floating point arithmetic instructions, and integer division instructions. Therefore, as a simple method, there is a method in which those instructions do not move beyond the branch instruction. However, it is not possible to obtain sufficient parallelism because there are few instructions that can be moved.

【００１６】その解決法は、例外を起こさないようにす
る方法と、例外の検出を遅らせる方法とがある。前者
は、例外の起きうる命令について、その命令と同じ処理
を行うが例外の起きない命令を用意し、投機的に移動す
る場合は元の命令と対応する例外の起きない命令に置き
換えるという方法である。この方法には例外を正しく検
出できないという欠点がある。後者、即ち、例外の検出
を遅らせる方法としては、ブースティングとセンティネ
ルスケジューリングがある。これらの方法は、投機的命
令の実行、及び例外の処理をサポートするハードウェア
を付加することによって、例外の起きる命令であっても
投機的に移動し、かつ、例外が発生したときに正しく処
理することができる。There are two ways to solve the problem. One is to prevent the exception from occurring and the other is to delay the detection of the exception. The former method prepares an instruction that can cause an exception but does not cause an exception, and when speculatively moves, replaces it with an instruction that does not cause an exception corresponding to the original instruction. is there. This method has the drawback of not being able to detect exceptions correctly. The latter, that is, methods for delaying the detection of exceptions include boosting and sentinel scheduling. By adding hardware that supports speculative instruction execution and exception handling, these methods speculatively move even the instruction causing the exception and handle it correctly when the exception occurs. can do.

【００１７】これらの投機的命令実行に関しては、Ｓ．
Ａ．マールケ(S.A.Mahlke)らによる、センティネルス
ケジューリングフォーブイエルアイダブリューア
ンドスーパースカラプロセッサーズ，アスポルス−Ｖ
プロシーディングズ（Sentinel Scheduling for VLIW
and Superscalar Processors,ASPLOS−VProceedings),
1992の２３８から２４７ページ(文献５）、及び、Ｍ．
Ｄ．スミス(M.D.Smith）らによる、エフィシエントス
ーパースカラパフォーマンススルーブースティン
グ，アスポルス−Ｖプロシーディングス（Efficient
Superscalar Performance Through Boosting,ASPLOS−V
Proceedings）, 1992の２４８から２５９ページ（文献
６）に記述されている。これらの文献では、ループを含
まない複数のブロック内での大域的スケジューリングの
ために投機的命令を使用している。Regarding the execution of these speculative instructions, S.
A. Sentinel Scheduling Forbuyer IW and Superscalar Processors, Aspors-V by SA Mahlke et al.
Proceedings (Sentinel Scheduling for VLIW
and Superscalar Processors, ASPLOS-V Proceedings),
Pages 238 to 247 (Reference 5) of 1992, and M.M.
D. Efficient Superscalar Performance Through Boosting, Ess-V Proceedings (Efficient) by MDSmith et al.
Superscalar Performance Through Boosting, ASPLOS-V
Proceedings), 1992, pp. 248-259 (reference 6). These documents use speculative instructions for global scheduling within blocks that do not contain loops.

【００１８】従来のループ最適化の例として、図２のＣ
言語で書かれたプログラム片をコンパイルすることを考
える。このプログラムは配列ａの各要素の値に変数ｂの
値を掛け、それをその要素の新たな値とするものである
（つまり、ベクトルをスカラー倍するものである）。ル
ープの反復回数は、実行時に値の決まる変数ｎによって
与えられる。As an example of conventional loop optimization, C in FIG.
Consider compiling a piece of program written in a language. This program multiplies the value of each element of the array a by the value of the variable b and makes it the new value of the element (that is, the vector is scalar-multiplied). The number of loop iterations is given by a variable n whose value is determined at run time.

【００１９】コンパイラの目的コードをアセンブリ言語
とすると、ループ変換以外の最適化を行った場合、図３
に示すコードにコンパイルできる。Assuming that the object code of the compiler is an assembly language, if optimization other than loop conversion is performed, the result shown in FIG.
It can be compiled into the code shown in.

【００２０】このアセンブリ言語の意味を次に説明す
る。The meaning of this assembly language will be described below.

【００２１】まず、ｓｐ，ｒ０からｒ５は３２ビットの
汎用レジスタで、ｆｒ０からｆｒ５は６４ビット浮動小
数点レジスタであり、そのうちｒ０とｆｒ０は常に０の
値を持つ。First, sp, r0 to r5 are 32-bit general-purpose registers, fr0 to fr5 are 64-bit floating point registers, of which r0 and fr0 always have a value of 0.

【００２２】図３のブログラムでは、各レジスタはそれ
ぞれ次の目的で用いられる。ｓｐ：スタックポインタ。各変数はスタック内に割り
付けられる。ｒ０：常に０の値をもつ。ｒ１：ｎの値を８倍した値があらかじめ設定されてい
る。ｒ２：ｉの値を保持する。ｒ３：ａ［０］のアドレスを保持する。ｆｒ１：ｂの値があらかじめ設定されている。ｆｒ２：ａ［ｉ］の値を保持する。ｆｒ３：ａ［ｉ］とｂの積を保持する。In the program of FIG. 3, each register is used for the following purposes. sp: Stack pointer. Each variable is allocated in the stack. r0: Always has a value of 0. A value obtained by multiplying the value of r1: n by 8 is preset. r2: Holds the value of i. r3: Holds the address of a [0]. The value of fr1: b is preset. fr2: Holds the value of a [i]. fr3: Holds the product of a [i] and b.

【００２３】また、（３）のｄｉｓｐ＿ａは、ａ［０］
のスタックポインタからの相対位置を表し、Ｌ＄１とＬ
＄２はラベル、メモリはバイト単位でアドレス付けされ
ている。Further, disp_a in (3) is a [0].
Position relative to the stack pointer of L, and L $ 1 and L
$ 2 is the label, and the memory is addressed in bytes.

【００２４】実施例でも使用される、アセンブリ言語で
書かれた各命令の行う処理は次の通りである。ｍｏｖｒ，ｑ…ｒ，ｑは汎用レジスタであり、ｒの値
をｑへ複写する。ｌｄｒ（ｑ），ｆ…ｒ，ｑは汎用レジスタ、ｆは浮動
小数点レジスタであり、ｑの値にｒの値を加えて得たア
ドレスから始まる８バイトのデータを目的レジスタｆへ
転送するロード命令。ｓｔｆ，ｒ（ｑ）…ｆは浮動小数点レジスタ、ｒとｑ
は汎用レジスタであり、ｆの内容をｑの値にｒの値を加
えて得たアドレスに転送するストア命令。ａｄｄｉｉ，ｒ，ｑ…ｉは正定数、ｒ，ｑは汎用レジ
スタであり、ｉとｒの値を加えてｑの値とする。ａｎｄｉｉ，ｒ，ｑ…ｉは整定数、ｒ，ｑは汎用レジ
スタであり、ｉとｒの値のビット毎の論理積をｑの値と
する。ｍｕｌｆ，ｇ，ｈ…ｆ，ｇ，ｈは浮動小数点レジスタ
であり、ｆの値とｇの値の積をｈの値とする。ｃｏｍｒ，ｑ…ｒ，ｑは汎用レジスタまたは浮動小数
点レジスタであり、ｒとｑの値を符号付整数として比較
し、条件コードをセットする、比較命令。ｂｌｅＬ…Ｌはラベルであり、上記ｃｏｍ命令で、ｒ
＞＝ｑのときＬへ分岐する、条件分岐命令。ｂｇｔＬ…Ｌはラベルであり、上記ｃｏｍ命令で、ｒ
＜ｑのときＬへ分岐する、条件分岐命令。ｂｇｅＬ…Ｌはラベルであり、上記ｃｏｍ命令で、ｒ
＜＝ｑのときＬへ分岐する、条件分岐命令。ｂｎｅＬ…Ｌはラベルであり、上記ｃｏｍ命令で、ｒ
とｑが等しくないときＬへ分岐する、条件分岐命令。ｂｅｑＬ…Ｌはラベルであり、上記ｃｏｍ命令で、ｒ
とｑが等しいときＬへ分岐する、条件分岐命令。ｂｎｚＬ…Ｌはラベルであり、上記ａｎｄｉ命令で、
ビット毎の論理積が０でないものがあればＬへ分岐す
る、条件分岐命令。ｂｒａＬ…Ｌはラベルであり、Ｌへ分岐する無条件分
岐命令。The processing executed by each instruction written in assembly language, which is also used in the embodiment, is as follows. mov r, q ... r, q are general-purpose registers, and copy the value of r into q. ld r (q), f ... r, q are general-purpose registers, f is a floating point register, and a load that transfers 8-byte data starting from an address obtained by adding the value of r to the value of q to the target register f order. st f, r (q) ... f is a floating point register, r and q
Is a general-purpose register, and is a store instruction for transferring the content of f to the address obtained by adding the value of r to the value of q. i i, r, q ... i are positive constants, r and q are general-purpose registers, and the values of i and r are added to obtain the value of q. and i i, r, q ... i are integer constants, r and q are general-purpose registers, and the bitwise logical product of the values of i and r is taken as the value of q. mul f, g, h ... F, g, h are floating point registers, and the product of the value of f and the value of g is the value of h. com r, q ... r, q are general-purpose registers or floating-point registers, and are comparison instructions that compare the values of r and q as signed integers and set the condition code. ble L ... L is a label, and r in the above com command
A conditional branch instruction that branches to L when> = q. bgt L ... L is a label, and r in the above com instruction
Conditional branch instruction that branches to L when <q. bge L ... L is a label, and in the above com command, r
Conditional branch instruction that branches to L when <= q. bne L ... L is a label, and in the above com command, r
A conditional branch instruction that branches to L when and are not equal. beq L ... L is a label, and r in the above com instruction
A conditional branch instruction that branches to L when and q are equal. bnz L ... L is a label, and in the above andi instruction,
A conditional branch instruction that branches to L if the logical product of each bit is not 0. bra L ... L is a label, and is an unconditional branch instruction that branches to L.

【００２５】このアセンブリコードを実行するプロセッ
サの例として、次のようなＲＩＳＣを単純にしたものを
考える。ハードウェアのパイプライン処理により、１サ
イクルに一つの命令を実行できる。命令のレイテンシ
（命令を発行してから完了するまでにかかるサイクル
数）は、ロード命令が２、浮動小数点加減算命令が４、
その他が１とする。このことは、例えば、ロード命令の
目的レジスタの値は２サイクル後に参照可能であり、ロ
ード命令の直後の命令が目的レジスタを参照しようとす
ると、１サイクルのインターロックが起きることを意味
する。As an example of a processor that executes this assembly code, consider the following simplified RISC. One instruction can be executed in one cycle by hardware pipeline processing. The latency of the instruction (the number of cycles from issuing the instruction to completing it) is 2 for the load instruction, 4 for the floating point add / subtract instruction,
Others are set to 1. This means that, for example, the value of the target register of the load instruction can be referenced after two cycles, and if an instruction immediately after the load instruction attempts to reference the target register, a one-cycle interlock occurs.

【００２６】このときの前述のアセンブリコードのｎが
十分大きいときのループの反復一回当たりの実行時間
は、反復当たり１０サイクルとなる。これは、６命令の
実行にそれぞれ１サイクルかかり、それに加えて、ｌｄ
命令(２０１)とｍｕｌ(２０２)命令の間に１サイクル、
ｍｕｌ命令（２０２）とｓｔ命令（２０３）の間に３サ
イクル、合計４サイクルのインターロックが生じるため
である。これはスカラープロセッサの場合であるが、ス
ーパースカラやＶＬＩＷの場合にはさらにインターロッ
クの割合が高くなる。At this time, when n of the above-mentioned assembly code is sufficiently large, the execution time per iteration of the loop is 10 cycles per iteration. This takes 1 cycle each to execute 6 instructions, plus ld
1 cycle between instruction (201) and mul (202) instruction,
This is because an interlock of 3 cycles, that is, 4 cycles in total, occurs between the mul instruction (202) and the st instruction (203). This is the case of the scalar processor, but the ratio of interlock becomes higher in the case of superscalar or VLIW.

【００２７】また、このプログラムはループ内で汎用レ
ジスタを３個、浮動小数点レジスタを３個使用してい
る。This program uses three general-purpose registers and three floating-point registers in the loop.

【００２８】ループ最適化方法の一つである、ソフトウ
ェアパイプライニングによって、命令レベルの並列度が
増加し、インターロックが減少する。しかし、このプロ
グラムに対してソフトウェアパイプライニングを実施す
る際には、(文献４)の２２５から２２９ページにあるよ
うに、ループアンローリングやレジスタリネーミングを
併用する必要がある。Software pipelining, one of the loop optimization methods, increases instruction level parallelism and reduces interlocks. However, when implementing software pipelining for this program, it is necessary to use loop unrolling and register renaming together, as described on pages 225 to 229 of (Reference 4).

【００２９】ループアンローリングは、元のループの複
数の反復を一回の反復で処理するようにループを変換す
る方法である。Loop unrolling is a method of transforming a loop so that multiple iterations of the original loop are processed in a single iteration.

【００３０】レジスタリネーミングは、レジスタの再割
り付けによって、レジスタによる依存関係を変更し、ス
ケジューリングを行いやすくする方法である。特に、同
一のデータを実行途中で別のレジスタに割り付け直す場
合、始めに割り付けてあったレジスタから新しく割り付
けたレジスタへの複写命令が生成される。Register renaming is a method of changing the dependence of registers by reallocating them to facilitate scheduling. In particular, when the same data is reassigned to another register during execution, a copy instruction from the initially assigned register to the newly assigned register is generated.

【００３１】それらのループ最適化を同じソースプログ
ラムに実施したものを図４(前半部)と図５（後半部）に
示す。この場合は、ソフトウェアパイプライニングに加
えて、１回のループアンローリングと、レジスタリネー
ミングを併用している。この場合、ソフトウェアパイプ
ライニングによるパイプライン処理は２段で構成され
る。FIG. 4 (first half) and FIG. 5 (second half) show the same source program subjected to those loop optimizations. In this case, in addition to software pipelining, one-time loop unrolling and register renaming are used together. In this case, the pipeline processing by software pipelining is composed of two stages.

【００３２】２段のパイプラインでは、パイプライン化
されたループを実行する前に、最初の反復の１段目を実
行しておく必要があり、そのためのコードを（パイプラ
インの）プロローグと呼ぶ。同様に、最後の反復の２段
目をループの後に実行する必要があり、そのためのコー
ドを（パイプラインの）エピローグと呼ぶ。In a two-stage pipeline, the first stage of the first iteration must be executed before executing the pipelined loop, the code for which is called the (pipeline) prolog. . Similarly, the second stage of the last iteration needs to be executed after the loop, and the code for that is called the epilogue (of the pipeline).

【００３３】パイプラインが２段で、一回アンロールさ
れていると、１段目の命令はプロローグで一回、ループ
の各反復で２回ずつ実行され、２段目の命令は、ループ
の各反復で２回、エピローグで一回実行される。そのた
め、パイプライン化されたループでは、元の反復のうち
３回以上の奇数回分を実行する。When the pipeline has two stages and is unrolled once, the first stage instruction is executed once in the prologue and twice in each iteration of the loop, and the second stage instruction is executed in each loop. It is executed twice in iterations and once in the epilogue. Therefore, the pipelined loop executes an odd number of three or more of the original iterations.

【００３４】元のループの反復回数が３回未満であった
場合、パイプライン化されたループでは実行できず、偶
数回の場合は、一回分はパイプライン化されたループと
は別に実行する必要がある。When the number of iterations of the original loop is less than 3, the pipelined loop cannot be executed, and when the number of iterations is even, one execution needs to be executed separately from the pipelined loop. There is.

【００３５】図４と図５のアセンブリコードの処理のフ
ローチャートを図６に示す。FIG. 6 shows a flowchart of the processing of the assembly code shown in FIGS. 4 and 5.

【００３６】アセンブリコードとそれに対応するフロー
チャートのそれぞれの処理の概要を以下に説明する。４
０１では、ループの反復回数が０回のときに以下の処理
を行わずに、次の処理へ制御を移す。４０２では、ルー
プの反復回数が３回以上のときは、パイプライン化され
たループを実行し、３回未満のときは、元のループを実
行するための判定を行う。４０３では、ループ変換を行
わない元のループとして実行する。４０４では、元のル
ープの反復回数が偶数回のときに１回分の反復を処理す
る。４０５は、パイプラインのプロローグである。４０
６は、パイプライン化されたループである。４０７は、
パイプラインのエピローグである。The outline of each processing of the assembly code and the corresponding flowchart will be described below. Four
In 01, when the number of loop iterations is 0, the following processing is not performed and control is passed to the next processing. In 402, when the number of loop iterations is three or more, the pipelined loop is executed, and when it is less than three, a determination for executing the original loop is made. In 403, it is executed as an original loop without loop conversion. At 404, one iteration is processed when the original loop has an even number of iterations. 405 is a pipeline prologue. 40
Reference numeral 6 is a pipelined loop. 407 is
It is an epilogue of the pipeline.

【００３７】このパイプライン化されたループの実行時
間は、インターロックが完全に回避されているので、１
２個の命令を実行するのに１２サイクルかかる。これ
は、元の反復一回当たり６サイクルである。同様に、レ
ジスタは汎用レジスタが５個、浮動小数点レジスタが５
個であり、レジスタリネーミングにより２個ずつ増加し
ている。ループ全体の静的命令数（実行頻度を考慮しな
い、目的コード中の命令数）は、元が１０命令であるの
に対して、３５命令に増加している。この中には、反復
回数が３回未満の場合と、反復回数が偶数回の場合のた
めに元のループが一回ずつ複製されたものと、さらにア
ンロールとパイプライン化がなされたループのためのコ
ードが含まれる。そのため、一般に元の命令数の３倍以
上になる。The execution time of this pipelined loop is 1 because interlocks are completely avoided.
It takes 12 cycles to execute two instructions. This is 6 cycles per original iteration. Similarly, there are 5 general-purpose registers and 5 floating-point registers.
The number is 2 and the number is increased by 2 by register renaming. The number of static instructions in the entire loop (the number of instructions in the target code, which does not consider the execution frequency) is increased from 35 instructions to 10 instructions. In this, there are less than 3 iterations, one iteration of the original loop for an even number of iterations, and an unrolled and pipelined loop. The code for is included. Therefore, the number of instructions is generally three times or more the original number.

【００３８】反復回数が少ないときには、ループ変換し
たコードを実行するか否かの選択などのオーバーヘッド
があるため、静的命令数だけでなく、実行時間が増加し
得る。When the number of iterations is small, there is overhead such as whether or not to execute the loop-transformed code, so that not only the number of static instructions but also the execution time can be increased.

【００３９】また、ソフトウェアパイプライニングは、
ループアンローリングやレジスタリネーミングを併用す
る必要がある場合が多く、ループアンローリングによる
静的命令数の増加や、レジスタリネーミングによる使用
するレジスタ数の増加が生じる。Further, the software pipe lining is
It is often necessary to use loop unrolling and register renaming together, and loop unrolling increases the number of static instructions and register renaming increases the number of registers used.

【００４０】そのほかに、ループの終了条件が各反復の
実行結果に依存するなどの理由で、ループの実行開始時
に反復回数を知ることができない場合には、このような
最適化を行うことができない。In addition, such optimization cannot be performed when the number of iterations cannot be known at the start of loop execution because the loop termination condition depends on the execution result of each iteration. .

【００４１】[0041]

【発明が解決しようとする課題】従来の技術で述べたよ
うなループの最適化では、静的命令数の増大や使用する
レジスタ数の増大が生じる。さらに、従来は実行回数が
ループの実行開始時に決定されなければ、これらの最適
化は実施できなかった。In the optimization of the loop as described in the prior art, the number of static instructions and the number of registers used increase. Furthermore, conventionally, these optimizations could not be performed unless the number of executions was determined at the start of execution of the loop.

【００４２】本発明の目的は、ループの最適化を行う際
に、投機的命令を使用することにより、静的命令数と使
用するレジスタ数のより少ない方法を提供することと、
実行回数がループの開始時に決定されていないときでも
ループの最適化を実施できるようにすることにある。It is an object of the present invention to provide a method of using fewer speculative instructions and fewer registers to be used by using speculative instructions when optimizing a loop.
It is to be able to perform loop optimization even when the number of executions is not determined at the start of the loop.

【００４３】[0043]

【課題を解決するための手段】本発明は高級言語で書か
れたプログラムを、機械語，アセンブリ言語、あるいは
他の高級言語に変換するコンパイラであって、変換され
たプログラムが投機的命令を実行できるプロセッサで実
行されるときに、その投機的命令を使用したループの最
適化を行う方法を提供する。The present invention is a compiler for converting a program written in a high-level language into a machine language, an assembly language, or another high-level language, and the converted program executes speculative instructions. A method for optimizing a loop using its speculative instructions when executed on a processor.

【００４４】従来技術として、以下のようなループの変
換方法がある。あるループの反復回数がループの開始時
にわかっており、それを第０番から第ｎ−１番までのｎ
回としたとき、ｉ，ｊ，ｐ，ｑを非負整数として、変換
後のループでのｉ番目（０＜＝ｉかつｉ＜＝ｎ−１）の
反復で、元のループでのｉ＋ｐ番目からｉ＋ｑ番目まで
の（ｑ−ｐ＋１）回の反復の内の１つ以上の反復の全部
または一部の処理を行うように変換する。その例とし
て、ソフトウェアパイプライニングがある。As a conventional technique, there is the following loop conversion method. The number of iterations of a loop is known at the beginning of the loop, and it is stored in n from 0 to n-1.
If i, j, p, and q are non-negative integers, the i-th (0 <= i and i <= n−1) iteration of the converted loop starts from the i + p-th iteration of the original loop. Convert to perform all or part of one or more iterations out of (q-p + 1) iterations up to the i + qth iteration. An example is software pipelining.

【００４５】本発明は、そのようなループ変換に際し
て、対象とするアーキテクチャに応じて、以下のような
操作を行う。The present invention performs the following operations in such loop transformation according to the target architecture.

【００４６】まず、ある命令に対し、後続の一つ以上の
条件分岐命令がそれぞれ分岐するか否かによって、その
命令の実行で生じた例外の検出を行うか否かを選択でき
るアーキテクチャの場合は次のようにする。First, in the case of an architecture in which it is possible to select whether or not to detect an exception caused by the execution of an instruction, depending on whether or not one or more subsequent conditional branch instructions branch to the instruction, respectively. Do the following:

【００４７】ループの反復回数がループの実行開始時に
わかる場合は、ループ変換において、１＜＝ｔかつｔ＜
＝ｑ−ｐであるｔに対し、ｓ＜＝ｔであるすべてのｓ
で、元のループのｉ＋ｓ番目の処理を行う命令があり、
その命令が例外を起こしうるなら、その命令をループを
構成する条件分岐命令が続けてｓ回分岐したときにのみ
検出されるような、対応する命令に置き換える。さら
に、ループの反復回数をｎ−ｑ＋ｔ回とする。When the number of loop iterations is known at the start of loop execution, 1 <= t and t <in the loop transformation.
For all t = q-p, all s for which s <= t
Then, there is an instruction to perform the i + sth processing of the original loop,
If the instruction can cause an exception, the instruction is replaced with a corresponding instruction that is detected only when conditional branch instructions forming a loop branch s consecutive times. Further, the number of loop iterations is n−q + t.

【００４８】また、ｐ＋１回以上ループすることはわか
っているが、何回ループするかはループの実行開始時に
はわからないループに対しては、ｔ＝ｑ−ｐであるｔに
対し、上記の命令の置き換えを行ってループ変換を実施
する。Further, although it is known that the loop loops p + 1 times or more, but it is not known how many times the loop is performed at the start of execution of the loop. Perform replacement and perform loop transformation.

【００４９】次に、ある命令に対し、後続の一つ以上の
条件分岐命令がそれぞれ分岐するか否かによって、その
命令の実行結果のレジスタ，条件コード，メモリへの書
き込みと、実行で生じた例外の検出を行うか否か選択で
きるアーキテクチャの場合、次のようにする。Next, depending on whether or not one or more subsequent conditional branch instructions branch to an instruction, the execution result of the instruction is written to the register, the condition code, the memory, and the execution occurs. For architectures that allow you to choose whether or not to detect exceptions, do the following:

【００５０】ループの反復回数がループの実行開始時に
わかる場合は、ループ変換において、１＜＝ｔかつｔ＜
＝ｑ−ｐであるｔに対し、ｓ＜＝ｐ＋ｔであるすべての
ｓで、元のループのｉ＋ｓ番目の処理を行う命令を、ル
ープを構成する条件分岐命令が続けてｓ回分岐したとき
にのみ結果を書き込むような命令に置き換える。同時
に、上記の置き換えを行った命令の実行結果の参照を、
上記条件分岐の分岐が行われる前にでき、その置き換え
られた命令の実行結果を参照する命令が同じ反復内にあ
れば、その実行結果を参照する命令と置き換える。さら
に、ループの反復回数をｎ−ｑ＋ｔ回とする。When the number of loop iterations is known at the start of loop execution, 1 <= t and t <in the loop transformation.
= T-q-p for all s for which s <= p + t, when the conditional branch instruction forming the loop continuously branches s times for the instruction performing the i + sth processing of the original loop Replace with an instruction that only writes the result. At the same time, refer to the execution result of the instruction with the above replacement,
Before the branch of the conditional branch is performed, if the instruction that references the execution result of the replaced instruction is in the same iteration, it is replaced with the instruction that references the execution result. Further, the number of loop iterations is n−q + t.

【００５１】また、ｐ＋１回以上ループすることはわか
っているが、何回ループするかはループの実行開始時に
はわからないループに対しては、ｔ＝ｑ−ｐであるｔに
対し、上記の命令の置き換えを行ってループ変換を実施
する。Further, although it is known that the loop loops p + 1 times or more, but it is not known how many times the loop is performed at the start of execution of the loop. Perform replacement and perform loop transformation.

【００５２】[0052]

【作用】どちらのアーキテクチャについても、次のこと
が言える。[Operation] The following can be said for both architectures.

【００５３】変換後のループのｉ番目の反復で、ｎ−ｑ＋１＜＝ｉかつｉ＜＝ｎ−ｑ＋ｔのとき、元の反復でのｎ−１番目までの処理と本来必要
のないｎ番目以降の命令を実行してしまうが、本発明に
よれば、このｎ番目以降の処理は例外を起こしてもその
処理が行われないため影響がない。In the i-th iteration of the transformed loop, when n-q + 1 <= i and i <= n-q + t, the processing up to the (n-1) -th time in the original iteration and the n-th and subsequent processes not originally necessary are performed. However, according to the present invention, even if an exception occurs, this processing is not performed because it is not executed.

【００５４】従って、従来技術では変換後のループでは
この範囲で実行される元の反復でのｎ−１番目までの反
復の処理は、エピローグとして別に生成する必要があっ
たが、本発明によりループ内でその命令の実行ができる
ため、エピローグを生成する必要がなくなった。Therefore, according to the present invention, the processing of the n-1th iteration in the original iteration executed in this range in the transformed loop needs to be separately generated as an epilogue. You can now execute the instruction within, so you no longer need to generate an epilogue.

【００５５】特に、実行結果の書き込みを遅らせること
のできるアーキテクチャでは、レジスタ，条件コード，
メモリへの書き込みを遅らせることにより、元のループ
の複数の反復の処理を行うために生じうる依存関係の破
壊が回避でき、ループアンローリングやレジスタリネー
ミングが不要になる。In particular, in an architecture capable of delaying writing of execution results, registers, condition codes,
By delaying writing to memory, it is possible to avoid breaking the dependencies that may occur due to the processing of multiple iterations of the original loop, eliminating the need for loop unrolling and register renaming.

【００５６】ループの反復回数がループ開始時に決定さ
れない場合には、変換後のｉ番目の反復で終了条件が成
立したときはすでに実行されている元のｉ＋１番目以降
の処理は、次の条件分岐で分岐しないために無効となる
ため、正しい結果が得られる。When the number of iterations of the loop is not determined at the start of the loop, when the ending condition is satisfied in the i-th iteration after conversion, the original i + 1th and subsequent processings already executed are processed to the next conditional branch. Since it is invalid because it does not branch with, the correct result is obtained.

【００５７】[0057]

【実施例】まず、本発明における最適化の前提とするア
ーキテクチャの一例について説明する。この例では、従
来技術で述べた例で用いたプロセッサに、投機的実行を
サポートする機能が以下のように実現されているものと
する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, an example of an architecture on which optimization according to the present invention is based will be described. In this example, it is assumed that the processor used in the example described in the related art has the function of supporting speculative execution as follows.

【００５８】レジスタファイルは汎用レジスタ，浮動小
数点レジスタの２種類があり、それぞれ３２ビット，６
４ビットのデータを保持できるものとし、それぞれを２
組用意する。その一組は、従来のレジスタセットをその
まま用い、これを表レジスタと呼ぶ。もう一組は、裏レ
ジスタと呼び、以下のように構成する。There are two types of register files, a general-purpose register and a floating-point register, which have 32 bits and 6 bits, respectively.
It is assumed that 4-bit data can be stored, and each of them can store 2 bits.
Prepare a group. One set uses a conventional register set as it is, and this is called a table register. The other set is called a back register and is constructed as follows.

【００５９】各レジスタは値を入れる領域に加えて、そ
のレジスタが有効であるかどうかを示すビット（有効ビ
ット）と、例外が起きたかどうかを示すビット(例外ビ
ット)を付加する。また、表と裏は同じように番号付け
する。Each register has a bit (valid bit) indicating whether or not the register is valid and a bit (exception bit) indicating whether or not an exception occurs, in addition to the area for storing a value. The front and back are numbered the same.

【００６０】命令セットを、以下のように変更する。各
命令の参照する各レジスタに、表レジスタであるか裏レ
ジスタであるかを示すビットを付加する。アセンブリコ
ードでは、それぞれのレジスタについて、レジスタ名の
後に“．Ｂ”を付加して裏レジスタを指定する。The instruction set is changed as follows. A bit indicating whether it is a front register or a back register is added to each register referred to by each instruction. In the assembly code, for each register, ".B" is added after the register name to specify the back register.

【００６１】例えば、前述の浮動小数点数ロード命令ｌｄｒ２(ｒ３)，ｆｒ１を投機的にするために、目的レジスタを裏にすると、ｌｄｒ２(ｒ３)，ｆｒ１．Ｂとなる。裏レジスタを使用する命令が投機的命令であ
る。なお、ここでは説明を簡単にするために、裏レジス
タを参照する命令は、表レジスタの値を変えることはで
きないものとする。For example, if the target register is turned to make the floating-point number load instruction ld r2 (r3), fr1 speculative, ld r2 (r3), fr1. It becomes B. The instruction that uses the back register is a speculative instruction. Note that, here, for the sake of simplicity of explanation, it is assumed that an instruction that refers to the back register cannot change the value of the front register.

【００６２】投機的な命令は、以下のように実行され
る。Speculative instructions are executed as follows.

【００６３】初期状態は、各裏レジスタの有効ビットは
無効を意味する０とし、例外ビットは例外が起きていな
いことを示す０とする。In the initial state, the valid bit of each back register is set to 0, which means invalid, and the exception bit is set to 0, which indicates that no exception has occurred.

【００６４】各命令の演算は、レジスタの参照か書き込
みかによって、以下のように処理される。The operation of each instruction is processed as follows depending on whether the register is referenced or written.

【００６５】レジスタに値を書き込むときは、表か裏か
指定されたレジスタセットに書き込む。このとき、その
演算で例外が起きたときは、表レジスタの場合はただち
に処理されるが、裏レジスタの場合は、目的レジスタに
例外の起きた命令のアドレスを書き込み、そのレジスタ
の例外ビットを１にする。例外が起きなかった場合は指
定された裏レジスタに値を書き込み、有効ビットを１と
する。When writing a value to a register, whether it is the front or the back is written in a designated register set. At this time, if an exception occurs in the operation, it will be processed immediately in the case of the front register, but in the case of the back register, the address of the instruction in which the exception occurred will be written to the exception register of the register To If no exception occurs, the value is written to the specified back register and the valid bit is set to 1.

【００６６】レジスタの参照は、命令語に表か裏か明示
されているので、指定されたレジスタセットを用いる。
裏レジスタを参照する場合、そのレジスタの例外ビット
が１の時は、その命令がレジスタに値を書き込むなら
ば、書き込むレジスタに例外ビットが１の参照レジスタ
の値である、例外の起きた命令のアドレスを複写し、例
外ビットを１にする。Since the reference of a register is clearly indicated in the instruction word whether it is front or back, a designated register set is used.
When referring to the back register, if the exception bit of the register is 1, and the instruction writes a value to the register, the exception register of the instruction in which the exception bit occurred is the value of the reference register with the exception bit of 1 in the register to be written. Copy the address and set the exception bit to 1.

【００６７】条件分岐命令は、ハードウェアによって静
的な分岐予測を行うものとする。即ち、ｉｆ文のように
分岐先が前方の場合は分岐しないものと予測し、ループ
のように分岐先が後方の場合は分岐するものと予測す
る。条件分岐命令を実行する際に、分岐予測に基づいて
投機的に実行した命令の有効化と無効化のための処理を
行う。分岐予測が当たった場合には、裏レジスタの内、
例外ビットが１のものがあれば例外の処理を行い、一つ
もない場合には有効ビットが１のものの値を、対応する
表レジスタに複写し、全ての有効ビットと例外ビットを
０にする。この処理は分岐先の命令の実行直前に行われ
るものとみなす。従って、条件分岐命令の分岐スロット
内の命令は、裏レジスタの無効化や、表レジスタへの複
写の前にレジスタの参照、書き込みが行われることにな
る。The conditional branch instruction is assumed to perform static branch prediction by hardware. That is, when the branch destination is forward as in the if statement, it is predicted that the branch is not taken, and when the branch destination is backward as in the loop, it is predicted that the branch is taken. When executing a conditional branch instruction, processing for validating and invalidating the speculatively executed instruction based on branch prediction is performed. If the branch prediction hits, in the back register,
If there is an exception bit of 1, the exception processing is performed. If there is no exception bit, the value of the effective bit of 1 is copied to the corresponding table register, and all the effective bits and the exception bits are set to 0. This processing is assumed to be performed immediately before the execution of the branch destination instruction. Therefore, the instruction in the branch slot of the conditional branch instruction is referred to and written in the register before the back register is invalidated or copied to the front register.

【００６８】このようなアーキテクチャを前提とする本
発明による最適化法を次に説明する。The optimization method according to the present invention, which is based on such an architecture, will be described below.

【００６９】まず、コンパイラの構成を図１に示す。コ
ンパイラは、高級言語で書かれたソースプログラムを、
目的のプロセッサのアセンブリ言語、または機械語に変
換する。また、高級言語を別の高級言語に変換するもの
もある。First, the configuration of the compiler is shown in FIG. A compiler is a source program written in a high-level language,
Convert to target processor assembly or machine language. There are also those that convert a high-level language into another high-level language.

【００７０】コンパイラの処理は、大きく分けて次のよ
うに分類できる。字句解析部１０１，意味解析及び中間
語生成部１０２，最適化部１０３，目的コード生成部１
１０である。最適化部は構文解析部の生成した中間語、
またはそれをさらに変換した別の中間語に対して、プロ
グラムの意味を変えずに実行速度、あるいは目的コード
の大きさが小さくなるように変換を行う。The processing of the compiler can be roughly classified as follows. Lexical analysis unit 101, semantic analysis and intermediate language generation unit 102, optimization unit 103, object code generation unit 1
It is 10. The optimization unit is an intermediate language generated by the parsing unit,
Alternatively, another intermediate language obtained by further converting it is converted so as to reduce the execution speed or the size of the object code without changing the meaning of the program.

【００７１】最適化部で行う処理には一般に行われるル
ープ不変式の移動や、定数伝播のようなソースプログラ
ムの段階でできる最適化や、強度削減などの目的機械に
依存する最適化、あるいはその両方を含む最適化などを
含む。The processing performed by the optimizing unit is generally performed by moving a loop invariant expression, optimizing at the stage of the source program such as constant propagation, optimizing depending on the target machine such as strength reduction, or the like. Includes optimizations that include both.

【００７２】以後、最適化部の内、本発明を実施するル
ープの最適化を行う部分について説明する。Hereinafter, of the optimizing unit, a portion for optimizing the loop for implementing the present invention will be described.

【００７３】１０４にループ最適化部の構成を示す。ま
ず、中間語からループ構造を検出し１０５、その中でイ
ンターロックの発生がどの程度あるかを評価し１０６、
ループ変換を適用すべきかどうかを決定し、適用すべき
ならば適用可能な一つまたは複数のループ変換の組み合
わせを選択し１０７、選択されたループ変換に応じて、
中間語でループの変換を実施する108。The structure of the loop optimization unit is shown at 104. First, the loop structure is detected from the intermediate language 105, and the degree of occurrence of interlock is evaluated 106,
Decide whether to apply a loop transform, and if so, select one or more applicable combination of loop transforms 107, depending on the selected loop transform,
Perform 108 the transformation of the loop on the intermediate language.

【００７４】ループ変換を行ったループに対して、さら
に別のループ変換を行うことも可能である１０９。It is also possible to perform another loop transformation 109 on the loop that has undergone the loop transformation 109.

【００７５】ループ変換の実施部１０８では以下の処理
も含む。選択された変換が単独では実施できない場合
に、他の変換と組み合わせた変換を実施する。変換され
たループが反復関数が特定の条件のもとでしか実行でき
ない場合に、その条件を満たさないときに元のループを
実行するためのコードや、条件を満たすようにするため
の前処理のためのコードの生成を含む。The loop conversion implementation unit 108 also includes the following processing. If the selected conversion cannot be performed alone, then the conversion combined with other conversions is performed. If the transformed loop can only execute the iterative function under certain conditions, the code to execute the original loop when the conditions are not satisfied, or the preprocessing to satisfy the conditions Including code generation for.

【００７６】本例では、ループ変換の選択部とループ変
換部が以上の従来方法と異なる。In this example, the selection unit for loop transformation and the loop transformation unit are different from the above conventional method.

【００７７】ループ変換の選択部１０７では、ループ選
択の際に投機的命令が使用できることを考慮する。それ
により、これまで出来なかった変換が可能になる場合が
ある。The loop conversion selecting unit 107 considers that a speculative instruction can be used when selecting a loop. This may allow conversions not previously possible.

【００７８】ループ変換部では、変換を行った命令系列
とその前処理，後処理、及び変換したループで実行でき
ない場合のための元のループの命令系列を、必要ならば
投機的命令を使用して生成する。The loop conversion unit uses the converted instruction sequence and its pre-processing and post-processing, and the instruction sequence of the original loop when it cannot be executed in the converted loop, and speculative instructions if necessary. To generate.

【００７９】以下、従来の技術で述べたソフトウェアパ
イプライニングの例を元に、詳細に説明する。Hereinafter, a detailed description will be given based on an example of software pipelining described in the prior art.

【００８０】図２のＣ言語で書かれたプログラムをパイ
プライニングすることを考える。論理レジスタの割り付
けられた機械語と１対１に対応する中間語をループ最適
化部の入力とする。Consider pipelining a program written in the C language of FIG. An intermediate word corresponding to the machine word assigned to the logical register and the one-to-one correspondence is input to the loop optimization unit.

【００８１】図７に本実施例におけるソフトウェアパイ
プライニングの場合の処理の流れを示す。この図では元
のループの反復回数がｎ回で、パイプラインの段数が２
段の時の処理を示す。FIG. 7 shows the flow of processing in the case of software pipelining in this embodiment. In this figure, the number of iterations of the original loop is n and the number of pipeline stages is 2.
The processing at the stage is shown.

【００８２】ループ内に分岐や関数呼出がなく、インタ
ーロックが一定値以上ある場合に、ループ変換の選択部
でソフトウェアパイプライニングが選択される。従来技
術ではループの反復回数がループの開始時に決定されて
いなければソフトウェアパイプライニングは実施できな
かったが、本発明ではそのような制約はない。When there is no branch or function call in the loop and the interlock is a certain value or more, the software pipelining is selected by the selection unit of the loop conversion. In the prior art, software pipelining could not be performed unless the number of loop iterations was determined at the beginning of the loop, but the present invention does not have such a restriction.

【００８３】まず、７０１でループ本体の命令を、イン
ターロックが最小になるように並べ変える。図２のプロ
グラムの場合は、変換前のプログラムが図３に示すもの
となるので、図８の様に命令が並べ変えられる。このと
き、依存関係から、パイプラインの段数と、各命令がど
の段の処理を行うかが決定される。その方法の一例が
（文献４）の２２０から２２５ページに記載されてい
る。First, at 701, the loop body instructions are rearranged so that the interlock is minimized. In the case of the program of FIG. 2, the program before conversion is the one shown in FIG. 3, so the instructions can be rearranged as shown in FIG. At this time, the dependency relationship determines the number of stages of the pipeline and which stage each instruction performs. An example of the method is described on pages 220 to 225 of (Reference 4).

【００８４】図８の場合、パイプラインは２段からな
り、８０１，８０２，８０４がパイプラインの１段目を
構成し、８０３，８０５，８０６が２段目を構成するこ
とになる。つまり、図２のループのｉ番目の反復では、
８０３，８０５，８０６が元のｉ番目の反復の処理を行
うのに対し、８０１，８０２，８０４はｉ＋１番目の反
復の処理を行う。そのため、論理レジスタ番号の通りに
物理レジスタを割り付けると、８０１と８０３，８０１
と８０５の間でデータの依存関係をこわしてしまう。In the case of FIG. 8, the pipeline is composed of two stages, 801, 802 and 804 compose the first stage of the pipeline, and 803, 805 and 806 compose the second stage. So, in the ith iteration of the loop of Figure 2,
While 803, 805, and 806 perform the original i-th iteration process, 801, 802 and 804 perform the i + 1-th iteration process. Therefore, if physical registers are assigned according to the logical register numbers, then 801 and 803, 801
And 805 break the data dependency.

【００８５】それを避けるために従来の技術では、前述
したように、１回ループアンローリングを行い、レジス
タリネーミングを行うところである。In order to avoid this, in the conventional technique, as described above, the loop unrolling is performed once and the register renaming is performed.

【００８６】本発明では、７０２で１段目の処理を行う
命令８０１，８０２，８０４を投機的命令に置き換える
ことにより、これらの処理の結果が反映されるのを次の
反復に入る時点まで遅らせることにより、上記の問題を
解決する。ここで行う投機的命令への置き換えは、１段
目の各命令について、目的レジスタを対応する裏レジス
タに置き換え、その後続の命令で同じレジスタを参照す
る命令を対応する裏レジスタを参照する命令に置き換え
ることである。この裏レジスタの使用が、レジスタリネ
ーミングを裏レジスタに対して行うのと同じことになる
ため、表レジスタの使用量は増えない。さらに、ループ
アンローリングを行う必要がなくなり、反復回数が奇数
か偶数かも問題ではなくなる。In the present invention, by replacing the instructions 801, 802, 804 for performing the first stage processing in 702 with speculative instructions, the reflection of the results of these processing is delayed until the time of entering the next iteration. This solves the above problem. The replacement to the speculative instruction performed here replaces the target register with the corresponding back register for each instruction of the first stage, and replaces the instruction that refers to the same register with the instruction that refers to the same back register in the subsequent instruction. It is to replace. Since the use of this back register is the same as performing register renaming for the back register, the usage of the front register does not increase. Furthermore, it is not necessary to perform loop unrolling, and it does not matter whether the number of iterations is odd or even.

【００８７】一般には、ｋ段のパイプラインからなると
き、２＜＝ｉ＜＝ｋである各ｉ段目に属する命令は、ｋ
−ｉ回後の反復で実行結果が有効になるように、ループ
を構成する条件分岐命令がｋ−ｉ回分岐したときに有効
になる投機的命令に置き換える。しかし、この例のアー
キテクチャでは、直後の条件分岐についてしか投機的な
実行が行えないため、３段以上のパイプライニングを行
うのであっても１段目のみを投機的に実行する。そのた
め、３段以上のパイプライニングでは、実行回数がルー
プの実行開始時に不明の場合、パイプライニングが出来
ず、その場合はループ変換を実施しない。以後、パイプ
ラインが２段の場合の処理のみ説明する。Generally, when the pipeline has k stages, the instruction belonging to each i stage where 2 <= i <= k is k
The conditional branch instruction forming the loop is replaced with a speculative instruction that becomes valid when branching k-i times so that the execution result becomes valid after -i iterations. However, in the architecture of this example, speculative execution can be performed only for the conditional branch immediately after, so even if three or more stages of pipelining are performed, only the first stage is speculatively executed. Therefore, in pipelining with three or more stages, if the number of executions is unknown at the start of loop execution, pipelining cannot be performed, and in that case, loop conversion is not performed. Hereinafter, only the processing when the pipeline has two stages will be described.

【００８８】次に、７０３，７０４で反復回数の決定と
パイプラインのプロローグのコードの生成を行う。プロ
ローグとしては、従来と同様に元のループの最初の反復
での一段目の命令８０１，８０２，８０４を生成する。
従来は変換後のループの反復回数をｎ−１回とし、もと
のｎ回目の反復での２段目の命令の内、ループのための
命令を除いた８０３をエピローグとして生成していた。
本発明では反復回数をｎとして、ｎ回目に元のｎ＋１回
目の１段目の命令を実行してもそれらは投機的であり、
次の反復がないために無効になるので、反復回数をｎ回
として２段目の命令が元のループでｎ＋１回目の反復に
対する処理を行っても問題がない。よって、エピローグ
は生成しない。Next, at 703 and 704, the number of iterations is determined and the pipeline prolog code is generated. As the prologue, the first-stage instructions 801, 802, 804 are generated in the first iteration of the original loop as in the conventional case.
Conventionally, the number of iterations of the loop after the conversion is set to n−1, and the instruction 803 excluding the instruction for the loop is generated as the epilogue from the second-stage instruction in the original n-th iteration.
In the present invention, assuming that the number of iterations is n, even if the original n + 1th first-stage instruction is executed nth time, they are speculative.
Since it is invalid because there is no next iteration, there is no problem even if the number of iterations is set to n and the instruction in the second stage performs processing for the (n + 1) th iteration in the original loop. Therefore, no epilogue is generated.

【００８９】この場合、反復回数が１回以上のときにパ
イプライン化したループで実行できる。これは、ループ
アンローリングが実施されていないために、プロローグ
にある１段目の処理と、最初の反復で実行される２段目
の処理以外は、無効化できるからである。同じ理由で、
反復回数が１回以上であれば、反復回数があらかじめわ
からなくてもよい。よって、反復回数の制約は１回以上
ということになるので、反復回数が０回の時に変換され
たループを実行しないようなコードを生成する７０５。In this case, when the number of iterations is one or more, it can be executed in a pipelined loop. This is because the loop unrolling has not been performed, so that the processes other than the first stage process in the prologue and the second stage process executed in the first iteration can be invalidated. For the same reason,
If the number of iterations is one or more, the number of iterations need not be known in advance. Therefore, since the constraint on the number of iterations is one or more, a code that does not execute the converted loop when the number of iterations is 0 is generated 705.

【００９０】以上の処理により生成された目的プログラ
ムを図９に示す。The target program generated by the above processing is shown in FIG.

【００９１】このプログラムでは、命令数は変換前の１
０命令にプロローグの３命令を加えた１３命令であり、
インターロックは起きなくなっているため反復一回あた
り６サイクルで実行できる。In this program, the number of instructions is 1 before conversion.
There are 13 instructions that are 0 instructions plus 3 prologue instructions.
Since the interlock no longer occurs, it can be executed in 6 cycles per iteration.

【００９２】新たに使用するレジスタは、汎用レジスタ
と浮動小数点レジスタの裏レジスタがそれぞれ２個ずつ
である。しかし、表レジスタの使用数は増加しない。The newly used registers are two general-purpose registers and two floating-point registers, respectively. However, the number of table registers used does not increase.

【００９３】このプログラムの実行の様子を図１０に示
す。１００１はプロローグの命令、１００２はパイプラ
イン化されたループの命令をそれぞれオペランドを省略
して示す。FIG. 10 shows how the program is executed. Reference numeral 1001 indicates a prologue instruction, and 1002 indicates a pipelined loop instruction with operands omitted.

【００９４】元のループの反復回数が３回であるとき、
その１回目，２回目，３回目の反復での処理を１００
３，１００４，１００５で行う。１００６は元のループ
での４回目の反復の処理に相当するが、１００５の条件
分岐で分岐しないために1006の処理は無効化される。When the number of iterations of the original loop is 3,
The processing in the first, second, and third iterations is 100
3,1004,1005. Although 1006 corresponds to the process of the fourth iteration in the original loop, the process of 1006 is invalidated because the conditional branch of 1005 does not branch.

【００９５】第二の実施例として、ループの反復回数が
ループの開始時に決定されていない場合を説明する。図
１１にそのようなソースプログラムの例を示す。従来の
技術ではこのプログラムはソフトウェアパイプライニン
グができなかった。この場合も前述の実施例で説明した
方法により、図１２のアセンブリコードに変換される。
その際、ループの終了条件は変える必要がない。反復回
数が１回以上の時にこの変換されたコードでループを実
行できる。As a second embodiment, a case where the number of loop iterations is not determined at the start of the loop will be described. FIG. 11 shows an example of such a source program. In the prior art, this program could not be software pipelined. Also in this case, the assembly code shown in FIG. 12 is converted by the method described in the above embodiment.
At that time, it is not necessary to change the loop termination condition. The loop can be executed by this converted code when the number of iterations is one or more.

【００９６】[0096]

【発明の効果】本発明によれば、第一に、静的命令数が
少なくなる。これは例で見たように、ループアンローリ
ングやレジスタリネーミングが不要になること、エピロ
ーグが不要になること、１回以上のループに適用できる
ことなどの理由により、従来の方法では元のループの命
令数の３倍以上の命令を必要としたが、本発明によれば
わずかな命令数の増加で同等の処理を実現できるように
なったためである。According to the present invention, firstly, the number of static instructions is reduced. As shown in the example, this is because the loop unrolling and register renaming are not required, the epilogue is not required, and it can be applied to one or more loops. This is because the number of instructions required is three times or more the number of instructions, but according to the present invention, the same processing can be realized with a slight increase in the number of instructions.

【００９７】第二に、使用するレジスタ数が減少する。
ただし、これは投機的実行のために別のレジスタセット
をもつ場合の効果であり、そのとき通常のレジスタの使
用数はループ変換を行わなかった場合と変らなくなる。Secondly, the number of registers used is reduced.
However, this is an effect when another register set is provided for speculative execution, and at that time, the number of normal registers used is the same as when the loop conversion is not performed.

【００９８】第三に、ループの反復回数がループの開始
時に決定されていない場合でもループ変換を実施でき
る。これは、一部の命令を余分に実行しても、その効果
を取り消す、あるいは余分な例外を起こさないようにす
ることができるからである。Third, loop transformation can be performed even if the number of loop iterations is not determined at the beginning of the loop. This is because even if some of the instructions are executed excessively, it is possible to cancel their effect or prevent an extra exception.

[Brief description of drawings]

【図１】本発明のループ最適化を実施するコンパイラの
フローチャート。FIG. 1 is a flowchart of a compiler that implements loop optimization of the present invention.

【図２】Ｃ言語で書かれたソースプログラムの例の説明
図。FIG. 2 is an explanatory diagram of an example of a source program written in C language.

【図３】図２に示すプログラムをループ最適化を行わず
にコンパイルしたときの目的コードの例の説明図。FIG. 3 is an explanatory diagram of an example of an object code when the program shown in FIG. 2 is compiled without performing loop optimization.

【図４】図２に示すプログラムをループ最適化を行って
コンパイルしたときの目的コードの例の前半部を示す説
明図。FIG. 4 is an explanatory diagram showing the first half of an example of the target code when the program shown in FIG. 2 is compiled by performing loop optimization.

【図５】図２に示すプログラムをループ最適化を行って
コンパイルしたときの目的コードの例の後半部を示すの
説明図。5 is an explanatory diagram showing the latter half of the example of the target code when the program shown in FIG. 2 is compiled by performing loop optimization.

【図６】図４および図５のプログラムの処理を示すフロ
ーチャート。6 is a flowchart showing the processing of the programs of FIGS. 4 and 5. FIG.

【図７】本発明によるソフトウェアパイプライニングの
一実施例の処理のフローチャート。FIG. 7 is a flowchart of a process of an embodiment of software pipelining according to the present invention.

【図８】図３のループ本体の命令を実行時間が最小にな
るように並べ変えたものを示す説明図。FIG. 8 is an explanatory diagram showing the instructions of the loop body of FIG. 3 rearranged so that the execution time is minimized.

【図９】本発明による図２のプログラムの目的コードの
説明図。9 is an explanatory diagram of an object code of the program of FIG. 2 according to the present invention.

【図１０】図９のプログラムの実行の様子を示す説明
図。FIG. 10 is an explanatory diagram showing how the program of FIG. 9 is executed.

【図１１】ループの反復回数がループの実行開始時には
わからないプログラムの例を示す説明図。FIG. 11 is an explanatory diagram showing an example of a program in which the number of loop iterations is unknown at the start of loop execution.

【図１２】本発明による図１１のプログラムの目的コー
ドの説明図。12 is an explanatory diagram of the object code of the program of FIG. 11 according to the present invention.

[Explanation of symbols]

８０１，８０２，８０４…本発明により生成された無効
化機能を利用する命令。801, 802, 804 ... Instructions using the invalidation function generated by the present invention.

Claims

[Claims]

1. A function for invalidating a part or all of an execution result of a designated instruction depending on whether or not a branch is caused by one or more conditional branch instructions executed after the instruction. In a compiler that generates a target code to be executed in a processor having the source code of a program written in a computer language, an instruction having the invalidating function is included in a target code generated for a loop in the source code. By using it, it is possible to suppress the generation of some or all of the loop pre-processing instruction sequence, post-processing instruction sequence, and loop body instruction sequence that are generated when the invalidation function is not used. A characteristic loop optimization method.

2. The method according to claim 1, wherein the invalidating function branches an exception caused by execution of a designated instruction by one or more conditional branch instructions executed after the instruction. The pre-processing instruction sequence of the loop, the post-processing instruction sequence, and the instruction sequence of the loop body, which are generated when the invalidating function is not used, when realized by performing after determining Loop optimization method that suppresses the generation of some or all of

3. The invalidating function according to claim 1, wherein the writing to the register, the condition code, the memory, and the exception processing which are caused by the execution of the designated instruction are executed after the instruction. After determining whether or not one or more conditional branch instructions branch, if realized by performing them, the pre-processing instruction sequence of the loop, the post-processing instruction sequence, and the instruction sequence of the loop body, respectively. Loop optimization method that suppresses the generation of some or all of

4. The invalidating function according to claim 1, wherein the execution of the writing to the register, the condition code, the memory, and the exception processing, which are caused by the execution of the designated instruction, is executed after the instruction. If one or more conditional branch instructions are implemented after it has been determined whether or not to branch, and even before the writing is performed, it is realized to have a means for referencing the value to be written. To
A loop optimization method for suppressing generation of a part or all of a preprocessing instruction sequence of a loop, a postprocessing instruction sequence, and an instruction sequence of a loop body.

5. The existing loop optimization method for a processor according to claim 1, wherein an integer i in a range and k integers p1 to pk arranged in ascending order without duplication are not optimized. Is divided into processes of k iterations of i + p1 to i + pk of the loop after optimization by dividing the processing of the i-th iteration of
For some integer q less than or equal to pk, i of the loop after optimization
2. If there is an instruction sequence for processing the i-th iteration of the pre-optimization loop at the + q-th iteration, the instruction sequence is used when the number of iterations of the loop after optimization is not i + q-p1 or more. Loop optimization method with an instruction sequence that is invalidated by the invalidation function of.

6. The loop according to claim 5, wherein the number of iterations of the loop before optimization is n, and the loop after optimization is a part of the (n + 1) th and subsequent iterations of the loop before optimization which are not originally necessary. Alternatively, a loop optimizing method for executing all the processes but optimizing such unnecessary processes by the invalidating means.

7. The invalidation function according to claim 1,
The loop optimization method according to claim 5, which is realized by the method according to claim 3 or 4.

8. The invalidation function according to claim 1 is realized by the method according to claim 4, and by using the referencing means according to claim 4, the invalidation function is not used. The loop optimization method according to claim 5, wherein the number of registers used is also reduced.

9. The loop optimization method according to claim 5, wherein the loop optimization method is software pipelining.

10. The loop optimization method according to claim 5, wherein the nullification function according to claim 1 is not used,
A loop optimization method for applying the loop optimization method by using the invalidation function, when the loop transformation cannot be performed because the number of loop iterations is not determined at the start of execution of the loop.

11. The loop optimization method according to claim 5, wherein the invalidation function according to claim 1 is used for a loop whose number of iterations is determined at the start of execution of the loop. When the loop optimization method is not executed when it is not repeated a certain number of times, the loop optimization is executed for a smaller number of iterations by using the invalidation function. A loop optimization method for generating a target code executed by an instruction sequence.

12. The loop optimization method according to claim 5, wherein the invalidation function according to claim 1 is used for a loop in which the number of iterations is determined at the start of execution of the loop. , The loop optimization method is not implemented by using the invalidation function when it is necessary to execute a part of the iterations by an instruction sequence that does not implement the loop optimization method depending on the number of iterations. A loop optimization method that produces less repetitive target code executed by a sequence of instructions.