JPH04307624A

JPH04307624A - Loop optimization system

Info

Publication number: JPH04307624A
Application number: JP3071978A
Authority: JP
Inventors: Atsushi Inoue; 淳井上; Kenji Shirakawa; 健治白川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1991-04-05
Filing date: 1991-04-05
Publication date: 1992-10-29
Anticipated expiration: 2015-04-10
Also published as: JP3032030B2

Abstract

PURPOSE:To assure the effective operation of a loop optimization system by deciding the number of loop evolving stages having the highest effect of parallelism in consideration of the characteristic of each loop contained in a program and with the use of a computer which can carry out plural instructions at the same time. CONSTITUTION:A loop analyzing part 24 is provided to check the data reference relation among the loops together with a loop evolving stage number deciding part which decides the number of loop evolving stages having no superposition of addresses among the data based on the analyzing result of the part 24, and a loop optimizing part which performs the evolution of loops and the scheduling of a list based on the deciding result of the loop evolving stage number deciding part. Furthermore the part 24 checks the number of operations and the number of working registers within each loop. Then the loop evolving stage number deciding part decides the number of loop evolving stages where the types of operations are distributed most properly with the proper number of registers based on the checking result of the part 24.

Description

[Detailed description of the invention]

［発明の目的］ [Purpose of the invention]

【０００１】0001

【産業上の利用分野】本発明は、複数の命令の列を並列
に実行可能とする電子計算機におけるループ最適化方式
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a loop optimization method in an electronic computer that allows a plurality of sequences of instructions to be executed in parallel.

【０００２】0002

【従来の技術】従来、電子計算機の命令列は逐次処理さ
れることを前提に構成され、電子計算機はこれに従って
命令を１個ずつ取り出して実行していた。2. Description of the Related Art Conventionally, electronic computers have been constructed on the premise that instruction sequences are processed sequentially, and electronic computers have taken out instructions one by one and executed them accordingly.

【０００３】この命令列の処理高速化のため、パイプラ
イン技術が導入された。これは命令の実行を複数の段階
（ステージ）に分解し、この分解された実行段階を単位
として、異なる段階の複数の命令を同時に演算する技術
であり、これにより処理のサイクルタイム短縮が可能と
なり命令列全体の処理時間も短縮された。しかし、パイ
プライン技術においても実行開始される命令は１サイク
ルあたり１個であり、これを超えた命令の並列実行は行
われていなかった。Pipeline technology has been introduced to speed up the processing of this instruction sequence. This is a technology that decomposes the execution of an instruction into multiple stages, and uses these decomposed execution stages as a unit to simultaneously calculate multiple instructions at different stages.This makes it possible to shorten the processing cycle time. The processing time for the entire instruction sequence has also been reduced. However, even in the pipeline technology, only one instruction is started per cycle, and instructions beyond this limit have not been executed in parallel.

【０００４】これに対し、近年、命令単位で並列実行を
可能とする技術が導入されており、１サイクルあたり１
個以上の命令を処理することが可能になってきている。ＶＬＩＷ（Ｖｅｒｙ　Ｌｏｎｇ　　Ｉｎｓｔｒｕｃｔｉ
ｏｎ　Ｗｏｒｄ）　技術とスーパースカラー技術と呼ば
れる２つの手法が代表的なものである。On the other hand, in recent years, technology has been introduced that enables parallel execution on an instruction-by-instruction basis.
It has become possible to process more than one instruction. VLIW (Very Long Instrument)
There are two representative methods: on Word) technology and superscalar technology.

【０００５】ＶＬＩＷ技術は、予め定められた数の複数
命令を１つの実行単位として定義し、計算機はこれら複
数の命令を必ず同時に実行する。計算機は処理対象の複
数命令が同時に実行可能であるか否かを判定する必要が
なく制御が単純であるため、少ないハードウェア量で実
現可能であり、またサイクルタイムの短縮が可能である
。しかしながら、処理の並列性はコンパイラまたは人手
により判定し、予め命令の割当を行っておかなければな
らない。この命令割当を「命令スケジューリング」と呼
ぶ。[0005] In the VLIW technology, a predetermined number of instructions are defined as one execution unit, and a computer always executes these instructions simultaneously. Since the computer does not need to determine whether multiple instructions to be processed can be executed simultaneously, and the control is simple, it can be implemented with a small amount of hardware, and the cycle time can be shortened. However, the parallelism of processing must be determined by a compiler or manually, and instructions must be allocated in advance. This instruction assignment is called "instruction scheduling."

【０００６】一方スーパースカラー技術は、従来と同様
の逐次実行を前提とした命令列を解釈して並列実行可能
性を調べるハードウェアを備え、並列実行可能な場合に
はこれらの複数命令を並列実行するように制御する方式
である。この方式では命令の並列実行可能性判定を専用
ハードウェアに委ねているため、通常のプログラムであ
っても逐次実行時との互換性を保った実行は保証される
。しかし実際にスーパースカラー方式で性能向上を目指
す場合には、ハードウェアリソースやデータ依存性等の
情報を元に演算ユニットが出来る限り動作状態にあるよ
うにするため、ＶＬＩＷ方式と同様の命令スケジューリ
ングを行っておくことが必須である。On the other hand, superscalar technology is equipped with hardware that examines the possibility of parallel execution by interpreting a sequence of instructions that are assumed to be executed sequentially, as in the past, and if parallel execution is possible, these multiple instructions are executed in parallel. This is a method of controlling the Since this method relies on dedicated hardware to determine whether instructions can be executed in parallel, it is guaranteed that even ordinary programs can be executed while maintaining compatibility with sequential execution. However, when actually aiming to improve performance using the superscalar method, instruction scheduling similar to the VLIW method is used to ensure that the arithmetic unit is as active as possible based on information such as hardware resources and data dependencies. It is essential to go.

【０００７】これらの並列実行方式における命令スケジ
ューリング手法として、１つの基本ブロック（プログラ
ムのうち分岐、脱出のない命令列）を処理対象としてハ
ードウェアリソースやデータ依存関係などの制約条件を
元に命令を並べ替えていく「リストスケジューリング」
と呼ばれる方式が知られている。この方式は実現が容易
であるが、一般のプログラムでは１つの基本ブロックに
含まれる命令数には限りがあるため充分な性能向上が得
られないという欠点もある。この欠点を解消するため、
プログラムのループ部分を対象とする場合に、ループ内
の演算を複数反復分展開して処理する「ループ展開（ル
ープアンローリング）」と呼ばれる手法を組み合わせて
使用することがある。この方式でループ部分を処理する
と、リストスケジュリング単独で適用するより高い並列
性を得られることもよく知られている。[0007] As an instruction scheduling method in these parallel execution systems, instructions are processed based on constraints such as hardware resources and data dependencies, with one basic block (a sequence of instructions without branches or escapes in the program) being processed. "List scheduling" to sort
A method called ``is known. Although this method is easy to implement, it also has the drawback that in general programs, the number of instructions included in one basic block is limited, and therefore sufficient performance improvement cannot be achieved. In order to eliminate this drawback,
When targeting a loop portion of a program, a technique called ``loop unrolling'' may be used in combination, in which the operations within the loop are expanded and processed over multiple iterations. It is also well known that processing loop portions in this manner can provide higher parallelism than applying list scheduling alone.

【０００８】この「ループ展開」とは、複数の隣接する
ループ反復を展開して処理する方法である。例えば、Ｃ
言語で記述したループ図３（ａ）　を考える。このルー
プをそのままアセンブラに変換すると図３（ｂ）　を得
る。　このループに４段のループ展開を施すとループ図
４（ａ）　が得られる。ループ展開処理では、このよう
にループの４回分の反復処理を１つのループ反復にまと
めて等価なループを構成する。このループをアセンブラ
に変換すると図４（ｂ）　を得る。This "loop unrolling" is a method of unrolling and processing a plurality of adjacent loop iterations. For example, C
Consider loop diagram 3(a) described in language. If this loop is directly converted to assembler, the result shown in FIG. 3(b) is obtained. When this loop is subjected to four stages of loop expansion, the loop shown in Fig. 4(a) is obtained. In the loop unrolling process, four iterations of the loop are combined into one loop iteration to form an equivalent loop. When this loop is converted to assembler, the result shown in FIG. 4(b) is obtained.

【０００９】以上の２つのループのコンパイル結果から
判るように、ループ展開によりループ内の命令スケジュ
ーリング処理対象となる命令が図３（ｂ）　の７個から
図４（ｂ）の１６個に増加する。またこの例では、ルー
プの各段階の演算（ｓｕｂ　）は独立に実行できるので
、命令スケジューリングにより、ＶＬＩＷ、スーパース
カラー方式に適した並列化プログラムに変換できる。こ
のようなループ展開をソースプログラムに手を加えるこ
となく、コンパイラ内部で処理するループ最適化技法が
一般に行われている。As can be seen from the compilation results of the above two loops, the number of instructions subject to instruction scheduling within the loop increases from 7 in FIG. 3(b) to 16 in FIG. 4(b) due to loop unrolling. . Furthermore, in this example, since the operations (sub) at each stage of the loop can be executed independently, it can be converted into a parallelized program suitable for VLIW and superscalar methods by instruction scheduling. A loop optimization technique that processes such loop unrolling inside a compiler without modifying the source program is generally used.

【００１０】このようにループ展開とリストスケジュー
リングを組み合わせて命令列を再構成する手法は、従来
の方法に比べ高い並列性を達成できるが、一方ではルー
プを展開するためにオブジェクトプログラムのサイズが
増加してしまう欠点があった。[0010] Although this method of reconfiguring the instruction sequence by combining loop unrolling and list scheduling can achieve higher parallelism than conventional methods, on the other hand, the size of the object program increases due to loop unrolling. There was a drawback to it.

【００１１】またこのループ展開とリストスケジューリ
ングの組み合わせ手法は必ずしも万能ではない。例えば
図５（ａ）　に示すループのようにループ内のデータ参
照関係が循環するような場合（データリカージョン）で
は１回前の反復結果を次回の計算に使用するため、ルー
プを展開しても前後２回の反復間で演算を多重化できず
、さほど並列化効果をあげることができない。実際、図
５（ａ）　に４段のループ展開を施してアセンブラに変
換した結果の図５（ｂ）を見ると、ループ内の命令は増
加しているが、データリカージョンがあるため４つのｓ
ｕｂ　命令は本質的に逐次実行しなければならない。即
ち、（＊）　のｓｕｂ　は（＊＊）のｓｕｂ　の計算結
果をオペランドに持っているので（＊＊）が完全に終了
するまでは実行してはいけない。このため、このような
ループではたとえ複数の演算器を持つＶＬＩＷ、スーパ
ースカラー計算機でも並列化性能は高くできないという
問題があった。[0011] Furthermore, this method of combining loop unrolling and list scheduling is not necessarily universal. For example, when the data reference relationship within the loop is circular (data recursion), as in the loop shown in Figure 5(a), the loop is unrolled in order to use the results of the previous iteration for the next calculation. However, operations cannot be multiplexed between the two iterations before and after, and the parallelization effect cannot be achieved much. In fact, if you look at Figure 5(b), which is the result of applying four stages of loop unrolling to Figure 5(a) and converting it to assembler, you can see that the number of instructions in the loop has increased, but due to data recursion, there are four s
The ub instructions must be executed sequentially in nature. That is, since sub of (*) has the calculation result of sub of (**) as an operand, it must not be executed until (**) is completely completed. For this reason, there is a problem in that such a loop cannot achieve high parallelization performance even in a VLIW or superscalar computer having a plurality of arithmetic units.

【００１２】またコンパイラのコード生成方式によって
は、図６のようにループ内に帰納変数（この例ではｋ）
でアクセスするような配列参照がある場合も、ループ展
開を施しても各段での帰納変数計算のため演算実行の多
重化が妨げられる場合がある。[0012] Also, depending on the code generation method of the compiler, an induction variable (k in this example) may be added to the loop as shown in Figure 6.
Even if there is an array reference that is accessed with , even if loop unrolling is performed, multiplexing of operation execution may be hindered due to induction variable calculation in each stage.

【００１３】つまり、以上示した図５、図６のように「
ループ展開」の効果が小さいループを対象とする場合は
、従来のループ展開処理を行っても、単にオブジェクト
コードサイズが増加するだけで充分な並列化効果が得ら
れず、システムの有効な稼働の障害になっていた。In other words, as shown in FIGS. 5 and 6 above,
If the target is a loop for which the effect of loop unrolling is small, even if conventional loop unrolling processing is performed, the object code size will simply increase and sufficient parallelization effect will not be obtained, and the effective operation of the system will not be achieved. It was becoming an obstacle.

【００１４】さらに、ループ展開とリストスケジューリ
ングを組み合わせて命令列を再構成する手法には、上記
の問題の他にも、ループ内の演算種類に偏りがある場合
に計算機の有する複数の処理ユニットが有効に利用でき
ず性能があまり向上しないという欠点があった。即ち、
ループ内の各演算種類（整数演算、浮動小数点演算）の
演算数比ができる限り計算機の持つ演算ユニット数比に
近い状態で実行するようにはなっていないために、効率
のよい並列実行ができなかった。Furthermore, in addition to the above-mentioned problems, the method of reconfiguring the instruction sequence by combining loop unrolling and list scheduling has problems when multiple processing units of a computer are used when the types of operations in the loop are biased. The drawback was that it could not be used effectively and performance did not improve much. That is,
Efficient parallel execution is not possible because the ratio of the number of operations for each type of operation (integer operation, floating point operation) in the loop is not as close to the ratio of the number of operation units of the computer as possible. There wasn't.

【００１５】このことを図９を例にとって説明する。図
９（ａ）　に示すループを整数演算ユニット２個、浮動
小数点演算ユニット１個の計算機で実行することを考え
ると、ループ展開段数を変化させた場合にはループ内の
整数演算数、浮動小数点演算数は図９（ｂ）　のように
変化する。この整数演算数／浮動小数点演算数比は、で
きる限り整数ユニット数／浮動小数点ユニット数比に近
い方が効率の良いスケジューリングが得られる可能性が
高くなる。実際、図９（ａ）　のプログラムのループ展
開段数を変えて命令スケジューリングを行った場合のル
ープ内１サイクルあたり命令数は図９（ｂ）　のように
変化するが、これは整数演算数／浮動小数点演算数比が
演算ユニット数比に近い場合が最適値を出している。This will be explained using FIG. 9 as an example. Considering that the loop shown in Figure 9(a) is executed on a computer with two integer calculation units and one floating-point calculation unit, when the number of loop unrolling stages is changed, the number of integer calculations and floating-point calculations in the loop is The number of operations changes as shown in FIG. 9(b). The closer the ratio of the number of integer operations to the number of floating point operations is to the ratio of the number of integer units to the number of floating point units, the higher the possibility that efficient scheduling will be obtained. In fact, when instruction scheduling is performed by changing the number of loop unrolling stages in the program shown in Figure 9(a), the number of instructions per cycle within the loop changes as shown in Figure 9(b), which is the number of integer operations/floating The optimum value is obtained when the decimal point operation number ratio is close to the operation unit number ratio.

【００１６】またこのループ展開とリストスケジューリ
ングの組み合わせ手法を行う場合、一定段数以上ループ
展開を行うとループ内で使用されるレジスタ数が計算機
の持つレジスタ数を越えてしまう場合が起こりうる。こ
の場合使用できるレジスタがないのでスタックへの値の
セーブが発生する（このような状況を「レジスタあふれ
」という）。一般にスタックへのアクセスは効率が悪い
ので、このようなレジスタあふれを発生させてしまうと
ループ展開を行ったにもかかわらずかえって性能が低下
してしまうという問題があった。実際、図９（ａ）　の
プログラムについて図９（ｂ）　を見るとループ展開段
数が１２以上でレジスタあふれが発生して性能が低下し
ている。[0016] Furthermore, when performing this combination of loop unrolling and list scheduling, if the loop is unrolled over a certain number of stages, the number of registers used within the loop may exceed the number of registers possessed by the computer. In this case, there are no registers available, so the value is saved to the stack (this situation is called a ``register overflow''). Generally speaking, accessing the stack is inefficient, so if such a register overflow occurs, performance will deteriorate even though loop unrolling has been performed. In fact, looking at FIG. 9(b) for the program shown in FIG. 9(a), when the number of loop unrolling stages is 12 or more, register overflow occurs and the performance deteriorates.

【００１７】つまり、「ループ展開」を有効に適用する
ためには、以上に示したようにループ内の演算数、ルー
プ内で使用されるレジスタ数を元に最適な展開段数を決
定して処理することが非常に重要であるが、従来はこれ
らの情報を活かした柔軟なループ展開処理を行うことが
できなかった。In other words, in order to effectively apply "loop unrolling", as shown above, the optimal number of unrolling stages must be determined based on the number of operations in the loop and the number of registers used in the loop. However, until now, it has not been possible to perform flexible loop unrolling processing that takes advantage of this information.

【００１８】[0018]

【発明が解決しようとする課題】このように、従来ＶＬ
ＩＷ、スーパースカラー方式用に提案されているループ
展開とリストスケジューリングの組み合わせ最適化方式
では、プログラムのループ自体の特性に応じて命令スケ
ジュール方式を自由に切り替えることが不可能であり、
その結果、さほど並列化効果が上がらないにもかかわら
ず、ループを多重展開するためオブジェクトコードサイ
ズが膨大になり、システムの効率的運用を阻害してしま
う場合が起こってしまう欠点があった。さらに、演算種
類の演算数比と計算機の持つ演算ユニット数比が大きく
異なるために、計算機の有するハードウェアを有効に利
用できず充分な並列化効果が得られなかったり、あるい
はループ展開段数を大きく設定しすぎるために、レジス
タあふれを発生してかえって性能低下を起こしてしまう
などシステムの効率的運用を阻害してしまう欠点があっ
た。[Problem to be solved by the invention] In this way, conventional VL
IW, in the combined optimization method of loop unrolling and list scheduling proposed for the superscalar method, it is impossible to freely switch the instruction scheduling method according to the characteristics of the program loop itself.
As a result, although the parallelization effect is not so great, the object code size becomes enormous due to the multiple unrolling of loops, which has the disadvantage that the efficient operation of the system may be hindered. Furthermore, because the ratio of the number of operations of the type of operation and the number of operation units of the computer are greatly different, the hardware of the computer cannot be used effectively and a sufficient parallelization effect cannot be obtained, or the number of loop unrolling stages is increased. If the settings are set too high, register overflow occurs and performance deteriorates, which hinders the efficient operation of the system.

【００１９】本発明は、このような事情を考慮してなさ
れたものであり、その目的とするところは、プログラム
内の各ループの特性を考慮し、ループ内データの参照、
更新関係に応じて、最も並列化効果が高くかつオブジェ
クトコードの無駄な増大を防ぐループ展開段数を決定し
た上で処理を行い、さらに、ループ内の整数演算数、浮
動小数点演算数、ループ内で使用するレジスタ数を求め
、計算機の有する演算ユニット数に応じて、最も並列化
効果が高くかつレジスタあふれを起こさない最適ループ
展開段数を決定した上で処理を行うことにより、システ
ムの有効運用を保証するループ最適化方式を提供するこ
とにある。［発明の構成］The present invention has been made in consideration of such circumstances, and its purpose is to take into account the characteristics of each loop in a program, and to refer to data within the loop.
Processing is performed after determining the number of loop unrolling stages that has the highest parallelization effect and prevents unnecessary increase in object code according to the update relationship. Guarantees effective system operation by determining the number of registers to be used and determining the optimal number of loop unrolling stages that has the highest parallelization effect and does not cause register overflow according to the number of arithmetic units of the computer. The objective is to provide a loop optimization method that [Structure of the invention]

【００２０】[0020]

【課題を解決するための手段】第１の発明に係るループ
最適化方式は、ループ展開とリストスケジューリングの
組み合わせ最適化を行うループ最適化部と、ループ内の
データ依存関係、配列参照関係を元に、相互にアドレス
が重なり合う最小のループ反復差を求めるループ解析部
と、このループ解析部の検出結果を元に最小のオブジェ
クトコードサイズで最大の並列化効果を達成できるルー
プ展開段数を決定するループ展開段数決定部とを具備し
、このループ展開段数決定部の判定結果に基づき、ルー
プ展開、及びリストスケジューリングを行いループ部分
の最適化を行うことを特徴とするものである。[Means for Solving the Problems] A loop optimization method according to the first invention includes a loop optimization unit that performs combined optimization of loop unrolling and list scheduling, and a loop optimization unit that performs a combination optimization of loop unrolling and list scheduling, and a loop optimization unit that performs a combination optimization of loop unrolling and list scheduling, and In addition, there is a loop analysis section that calculates the minimum loop iteration difference where addresses overlap with each other, and a loop that determines the number of loop unrolling stages that can achieve the maximum parallelization effect with the minimum object code size based on the detection results of this loop analysis section. The present invention is characterized in that it includes an unrolling stage number determining section, and performs loop unrolling and list scheduling to optimize the loop portion based on the determination result of the loop unrolling stage number determining section.

【００２１】第２の発明に係るループ最適化方式は、ル
ープ展開とリストスケジューリングの組み合わせ最適化
を行うループ最適化部と、ループ内の整数演算数、浮動
小数点演算数、整数レジスタ数、浮動小数点レジスタ数
を求め、ループ展開処理適用時の整数演算数、浮動小数
点演算数、使用する整数レジスタ数、浮動小数点レジス
タ数を求めるループ解析部と、このループ解析部の出力
結果を元に、計算機の有する整数ユニット数、浮動小数
点ユニット数、レジスタ数を考慮して最もハードウェア
を有効に使用でき、かつレジスタあふれを起こさない最
適なループ展開段数を決定するループ展開段数決定部と
を具備し、このループ展開段数決定部の判定結果に基づ
き、ループ展開、及びリストスケジューリングを行いル
ープ部分の最適化を行うことを特徴とするものである。[0021] The loop optimization method according to the second invention includes a loop optimization unit that performs combined optimization of loop unrolling and list scheduling, and a loop optimization unit that performs a combination optimization of loop unrolling and list scheduling, and a loop optimization unit that performs a combination optimization of loop unrolling and list scheduling, and a loop optimization unit that performs a combination optimization of loop unrolling and list scheduling, and A loop analysis section that calculates the number of registers, the number of integer operations, the number of floating point operations, the number of integer registers to be used, and the number of floating point registers when applying loop unrolling processing, and the calculation of the computer based on the output results of this loop analysis section. and a loop unrolling stage number determination unit that determines the optimal number of loop unrolling stages that can use the hardware most effectively and does not cause register overflow, taking into account the number of integer units, the number of floating point units, and the number of registers. The present invention is characterized in that loop unrolling and list scheduling are performed to optimize the loop portion based on the determination result of the loop unrolling stage number determination unit.

【００２２】[0022]

【作用】第１の発明によれば、予めプログラム内の各ル
ープ部分におけるデータ参照関係、配列アクセス形式等
からループ反復間のデータ参照重なり情報を求め、これ
を元に最適なループ展開段数を決定した上で、ループの
展開、及び命令の並べ替えを行うので、従来方式で高い
並列化効果を得られるループの並列化性能は保持した上
に、従来方式でさほど並列化効果が上がらないループで
は、ループ展開によるオブジェクトコードサイズの増大
を最小限にとどめて処理することができる。[Operation] According to the first invention, data reference overlap information between loop iterations is obtained in advance from the data reference relationship in each loop part in the program, array access format, etc., and the optimal number of loop unrolling stages is determined based on this. Then, loops are unrolled and instructions are rearranged, so while the parallelization performance of loops that can obtain a high parallelization effect with the conventional method is maintained, it is also , it is possible to minimize the increase in object code size due to loop unrolling.

【００２３】第２の発明によれば、予めプログラム内の
各ループの整数／浮動小数点演算数、ループ内で使用す
る整数／浮動小数点レジスタ数を求め、これを元に計算
機の有する演算ユニット数やレジスタ数を考慮して最適
なループ展開段数を決定してループの展開、及び命令の
並べ替えを行うので、従来方式で充分な並列化効果を得
られないループであっても性能を向上させ、かつレジス
タあふれによる性能の低下を未然に防止することができ
る。According to the second invention, the number of integer/floating point operations for each loop in the program and the number of integer/floating point registers used in the loop are determined in advance, and based on this, the number of arithmetic units and the number of arithmetic units possessed by the computer are calculated. Since the optimal number of loop unrolling stages is determined in consideration of the number of registers and the loop is unrolled and instructions are rearranged, performance can be improved even for loops for which conventional methods cannot achieve sufficient parallelization effects. In addition, performance degradation due to register overflow can be prevented.

【００２４】[0024]

【実施例】以下、図面に基づいて本発明の一実施例につ
いて説明する。 ○実施例１DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. ○Example 1

【００２５】図１は本発明の一実施例に係るループ最適
化方式を適用したコンパイラの構成を示す図である。コ
ンパイラは、プログラム入力部２１、構文解析部２２、
ループ抽出部２３、ループ解析部２４、ループ展開部２
５、命令スケジュール部２６、コード生成部２７より構
成されている。FIG. 1 is a diagram showing the configuration of a compiler to which a loop optimization method according to an embodiment of the present invention is applied. The compiler includes a program input section 21, a syntax analysis section 22,
Loop extraction section 23, loop analysis section 24, loop expansion section 2
5, an instruction schedule section 26, and a code generation section 27.

【００２６】プログラム入力部２１で読み込まれたソー
スプログラムは構文解析部２２で中間テキストに変換さ
れてループ抽出部２３に入力される。ループ抽出部２３
は入力された中間テキストを調べ、プログラム内のルー
プ部分を検出して取り出し、この内容をループ解析部２
４に渡す。The source program read by the program input section 21 is converted into intermediate text by the syntax analysis section 22 and input to the loop extraction section 23 . Loop extraction section 23
examines the input intermediate text, detects and extracts the loop part in the program, and sends this content to the loop analysis section 2.
Pass it to 4.

【００２７】ループ解析部２４はループ内に現れる各デ
ータの参照、更新アドレスを調べ、ある反復で生成され
たデータ値を後続の反復で使用する場合がないかどうか
を調べる。もしそのようなアドレス重なりがある場合は
、相互にアドレスが重なり合うループ反復段数の差（こ
の値を、以下「最小アドレス重なり反復差」と称する）
を求めて、これをループ展開部２５に通知する。The loop analyzer 24 examines the reference and update addresses of each data appearing in the loop, and examines whether there is a case where a data value generated in one iteration is used in a subsequent iteration. If there is such address overlap, the difference in the number of loop iteration stages where addresses overlap (this value is hereinafter referred to as the "minimum address overlap iteration difference").
is determined and notified to the loop unrolling unit 25.

【００２８】この解析には、ループ内の左辺、右辺に共
通に現れる配列の添字式を調べて両者が共通の値を取り
うるか否かを調べるデータ依存解析の手法を使用すれば
よい。またポインタ変数を使用する言語では、ポインタ
値の変化を追跡することにより同様の解析を行うことが
できる。このデータ依存解析部分は、処理対象言語や最
適化要求仕様に応じて任意に構成すればよい。[0028] For this analysis, a data dependence analysis method may be used in which subscript expressions of arrays that commonly appear on the left and right sides of the loop are checked to see if they can take a common value. In languages that use pointer variables, similar analysis can be performed by tracking changes in pointer values. This data dependence analysis part may be configured as desired depending on the language to be processed and the optimization request specifications.

【００２９】例えば、図７（ａ）　に示すループの場合
、ループ反復ｉで定義されたデータｘ［ｉ］を２回後の
反復で参照することが認識できる。この場合ループ解析
部２４は最小のデータアドレス重なりを生ずる反復差２
をループ展開部２５に通知する。同様に図８の場合、ル
ープ反復ｉで定義されたデータｘ［ｉ］を次の反復で参
照するので、ループ解析部２４は最小アドレス重なり反
復差１を通知する。For example, in the case of the loop shown in FIG. 7A, it can be recognized that data x[i] defined in loop iteration i is referenced in the second subsequent iteration. In this case, the loop analyzer 24 calculates the iteration difference 2 that causes the minimum data address overlap.
is notified to the loop unrolling unit 25. Similarly, in the case of FIG. 8, the data x[i] defined in loop iteration i is referenced in the next iteration, so the loop analysis unit 24 notifies the minimum address overlap iteration difference 1.

【００３０】ループ展開部２５は、ここで通知された最
小アドレス重なり反復差を元にループ展開処理を行う。この際、もしループ解析部２４が最小アドレス重なり反
復差Ｍを通知した場合には、最大でもＭ段までしかルー
プを展開しないようにする。従って図７（ａ）　のルー
プに対しては１段または２段のループ展開は行うが、３
段以上には展開しない。また図８のループに対しては１
段のみ、従ってループ展開は全く適用しないことにする
。The loop unrolling unit 25 performs loop unrolling processing based on the minimum address overlap repetition difference notified here. At this time, if the loop analysis unit 24 notifies the minimum address overlap repetition difference M, the loop is expanded only up to M stages at most. Therefore, for the loop in Figure 7(a), one or two stages of loop unrolling are performed, but three
Do not expand beyond a stage. Also, for the loop in Figure 8, 1
stage only, so loop unrolling is not applied at all.

【００３１】実際、図７（ａ）　のループを３段以上展
開したり、図８のループを２段以上展開しても、反復間
でデータ依存性により命令並べ替えに制限が生じ、並列
性を高めることは困難である。例えば図８に２段ループ
展開した場合は、図５の場合と同様に２つのｓｕｂ　命
令が逐次実行される必要があるので、これらの命令を並
列に処理できず高性能な並列化プログラムは得られない
。一方、図７（ａ）　をループ解析部２４の判定結果に
従って、２段ループ展開した場合のアセンブラ出力は図
７（ｂ）　のようになるが、この場合は２つのｓｕｂ　
命令は独立したデータに対する処理になるので、自由に
並列実行してよく、性能のよい並列プログラムを構成す
ることが可能である。このようにしてループ最適化された中間テキストはコー
ド生成部２６により機械語に翻訳されオブジェクトプロ
グラムとして出力される。In fact, even if the loop in FIG. 7(a) is expanded in three or more stages, or the loop in FIG. It is difficult to increase For example, when the two-stage loop is expanded in Figure 8, two sub instructions need to be executed sequentially as in Figure 5, so these instructions cannot be processed in parallel and a high-performance parallel program cannot be achieved. I can't do it. On the other hand, when FIG. 7(a) is expanded into a two-stage loop according to the determination result of the loop analysis unit 24, the assembler output is as shown in FIG. 7(b), but in this case, two sub
Since the instructions process independent data, they can be freely executed in parallel, making it possible to construct parallel programs with good performance. The intermediate text loop-optimized in this manner is translated into machine language by the code generator 26 and output as an object program.

【００３２】図２は本発明の別の実施例に係るループ最
適化方式を適用したコンパイラの構成を示す図である。このコンパイラは、図１と同様のプログラム入力部３１
、構文解析部３２、ループ抽出部３３、ループ解析部３
４、ループ展開部３５、命令スケジュール部３６、コー
ド生成部３７に加えて帰納変数解析部３８をもって構成
されている。FIG. 2 is a diagram showing the configuration of a compiler to which a loop optimization method according to another embodiment of the present invention is applied. This compiler has a program input section 31 similar to that shown in FIG.
, syntax analysis section 32, loop extraction section 33, loop analysis section 3
4. In addition to a loop unrolling section 35, an instruction scheduling section 36, and a code generation section 37, the system includes an inductive variable analysis section 38.

【００３３】この実施例は図６に示したようなループの
各反復毎に計算される帰納変数を伴う配列アクセスによ
りループ展開での並列化性能向上が阻害される場合に適
用するもので、この場合、ループ展開部３５はループ解
析部３４の判定結果と共に帰納変数解析部３８の判定結
果を元にループ展開を行うか否かを判断する。This embodiment is applied when the improvement in parallelization performance in loop unrolling is hindered by array access involving induction variables calculated at each iteration of a loop as shown in FIG. In this case, the loop unrolling unit 35 determines whether to perform loop unrolling based on the determination result of the loop analysis unit 34 and the determination result of the inductive variable analysis unit 38.

【００３４】帰納変数解析部３８は、ループ抽出部３３
から入力されたループ内中間テキストの各データを調べ
、図６に示すような帰納変数ｋが出現した場合はその旨
をループ展開部３５に通知する。ループ展開部はこの通
知を受けた場合には、そのループの展開を行わないよう
にしてループの最適化を行う。[0034] The induction variable analysis unit 38 includes the loop extraction unit 33
Each data of the in-loop intermediate text inputted from is examined, and if an induction variable k as shown in FIG. 6 appears, the loop expansion unit 35 is notified of this fact. When the loop unrolling unit receives this notification, it optimizes the loop by not unrolling the loop.

【００３５】なお本方式では、不要なループ展開を行わ
ないのでプログラムのコンパイル時間、最適化時間を短
縮できるという副次的効果も得られるので、より一層の
システムの有効活用が可能となる。 ○実施例２Note that this method also has the side effect of shortening the program compilation time and optimization time because unnecessary loop unrolling is not performed, so that the system can be used more effectively. ○Example 2

【００３６】本実施例に係るループ最適化方式を適用し
たコンパイラの構成を、図１１に示す。コンパイラは、
プログラム入力部１１、構文解析部１２、ループ抽出部
１３、ループ解析部１４、ループ展開段数決定部１５、
ループ展開部１６、命令スケジュール部１７、コード生
成部１８より構成されている。FIG. 11 shows the configuration of a compiler to which the loop optimization method according to this embodiment is applied. The compiler is
program input section 11, syntax analysis section 12, loop extraction section 13, loop analysis section 14, loop unrolling stage number determination section 15,
It is composed of a loop unrolling section 16, an instruction scheduling section 17, and a code generation section 18.

【００３７】プログラム入力部１１で読み込まれたソー
スプログラムは構文解析部１２で中間テキストに変換さ
れてループ抽出部１３に入力される。ループ抽出部１３
は入力された中間テキストを調べ、プログラム内のルー
プ部分を検出して取り出し、この内容をループ解析部１
４に渡す。である。The source program read by the program input section 11 is converted into intermediate text by the syntax analysis section 12 and input to the loop extraction section 13 . Loop extractor 13
examines the input intermediate text, detects and extracts the loop part in the program, and sends this content to the loop analysis section 1.
Pass it to 4. It is.

【００３８】本実施例におけるループ解析部１４は、ル
ープ内に現れる整数演算数、浮動小数点演算数、ループ
内で使用される整数レジスタ数、浮動小数点レジスタ数
を中間テキストを解析して求める。例えば図９（ａ）　
のループ内演算をｔｒｅｅ形式の中間テキストで表現し
たものは図１０のようになる。ループ解析部１４ではこ
れらの中間テキストを辿って各ｔｒｅｅの重複しない（
異なるデータオブジェクトに対応する）末端ノードをレ
ジスタ、中間ノードを演算として演算数、レジスタ数を
求めていく。また、この際に演算の型情報から整数／浮動小数点演算
数、整数／浮動小数点レジスタ数を個別に求める。また
ループ脱出条件評価のための演算数も別に求める。さら
にコード生成方式を考慮して必要があれば、中間ノード
に対してテンポラリレジスタが使用されるものとして整
数／浮動小数点テンポラリレジスタ数を求めても良い。図１０に対する各々の値は、整数演算数＝０浮動小数点演算数＝１６アドレス計算、分岐計算数＝４整数レジスタ数＝１（ループ変数ｋ）浮動小数点レジスタ数＝１２となる。これら算出した情報をループ展開段数決定部１
５に通知する。The loop analysis unit 14 in this embodiment analyzes intermediate text to determine the number of integer operations, the number of floating point operations, and the number of integer registers and floating point registers used within the loop. For example, Fig. 9(a)
The in-loop operation expressed in tree-format intermediate text is shown in FIG. The loop analysis unit 14 traces these intermediate texts and finds that each tree does not overlap (
The number of operations and the number of registers are determined by assuming that the end nodes (corresponding to different data objects) are registers and the intermediate nodes are operations. Also, at this time, the number of integer/floating point operations and the number of integer/floating point registers are individually determined from the operation type information. In addition, the number of operations for evaluating the loop exit condition is separately determined. Furthermore, if necessary in consideration of the code generation method, the number of integer/floating point temporary registers may be determined assuming that temporary registers are used for intermediate nodes. The respective values for FIG. 10 are as follows: Number of integer operations = 0 Number of floating point operations = 16 Number of address calculations and branch calculations = 4 Number of integer registers = 1 (loop variable k) Number of floating point registers = 12. The loop unrolling stage number determining unit 1 uses the calculated information as
Notify 5.

【００３９】ループ展開段数決定部１５はループ解析部
１４の出力情報とハードウェア情報を元にレジスタあふ
れを起こさずにかつ計算機の構成に適合した演算数が得
られる最適なループ展開段数を決定する。ハードウェア
情報としては計算機の有する整数ユニット数、浮動小数
点ユニット数、整数レジスタ数、浮動小数点レジスタ数
、整数テンポラリレジスタ数、浮動小数点テンポラリレ
ジスタ数、各演算の遅延サイクル数などを計算機の構成
、コード生成方式などに応じて与える。またハードウェ
ア情報は、特定の計算機用の情報をコンパイラ内部に登
録してもよいし、コンパイラ起動時にオプションなどで
与えてもよく、本発明ではハードウェア情報の入力形式
は任意のものを使用して構わない。The loop unrolling stage number determination unit 15 determines the optimal number of loop unrolling stages that will not cause register overflow and will provide the number of operations suitable for the configuration of the computer, based on the output information of the loop analysis unit 14 and hardware information. . Hardware information includes the number of integer units, floating point units, integer registers, floating point registers, integer temporary registers, floating point temporary registers, number of delay cycles for each operation, etc., computer configuration, code, etc. It is given depending on the generation method etc. Furthermore, the hardware information may be registered inside the compiler as information for a specific computer, or may be given as an option when starting the compiler, and in the present invention, any format for inputting the hardware information can be used. It doesn't matter.

【００４０】以下、整数ユニット２個、浮動小数点ユニ
ット１個、整数レジスタ３２個、浮動小数点レジスタ３
２個、整数ロード命令の遅延が１サイクル、浮動小数点
ロード命令も整数ユニットで実行され遅延が２サイクル
、浮動小数点演算の遅延が２サイクルである計算機を仮
定して説明していく。ループ解析部１４で得られた情報
を元に最適なループ展開段数を得るには、以下の計算を
行う。Below, there are 2 integer units, 1 floating point unit, 32 integer registers, and 3 floating point registers.
The following explanation assumes a computer in which an integer load instruction has a delay of 1 cycle, a floating point load instruction is also executed in an integer unit and has a delay of 2 cycles, and a floating point operation has a delay of 2 cycles. In order to obtain the optimal number of loop unrolling stages based on the information obtained by the loop analysis section 14, the following calculation is performed.

【００４１】まず、整数演算数、浮動小数点演算数、整
数レジスタ数、浮動小数点レジスタ数を元にループ展開
をｎ段行った場合の演算数を計算する。レジスタ数はロ
ード命令で演算ユニットが使用されるサイクルを考慮に
入れるために使用する。一般にループ展開を行う場合は
、ループ内配列のアドレス加算はｎ段分をまとめて行う
。またループ終了条件の評価計算もｎ段分で１回に減る
ので、整数演算、浮動小数点演算が混在するループの場
合はループ展開段数が増えると全演算に占める整数演算
の比率は小さくなっていく。整数演算数ＩＯＰ　は、Ｉ
ＯＰ　＝　（整数演算数＋（整数レジスタ数）＊２＋（
浮動小数点レジスタ数）＊３）＊（ループ展開段数）＋
（アドレス、分岐演算数）で求められる。ここで（整数レジスタ数）＊２は１サイ
クル遅延で実行される整数ロード命令、（浮動小数点レ
ジスタ数）＊３は整数ユニットで２サイクル遅延実行さ
れる浮動小数点ロード命令に整数ユニットが使用される
サイクルを近似している。同様に浮動小数点演算数ＦＯ
Ｐ　は、ＦＯＰ　＝　浮動小数点演算数＊３＊（ループ展開段数
）で求められる。＊３は浮動小数点演算命令が２サイク
ル遅延で実行されることを表している。First, the number of operations when loop unrolling is performed n stages is calculated based on the number of integer operations, the number of floating point operations, the number of integer registers, and the number of floating point registers. The number of registers is used to take into account the cycles in which the arithmetic unit is used in load instructions. Generally, when performing loop unrolling, address addition of arrays within the loop is performed at once for n stages. In addition, the evaluation calculation of the loop termination condition is reduced to once for every n stages, so in the case of a loop that includes a mixture of integer operations and floating point operations, as the number of loop unrolling stages increases, the ratio of integer operations to all operations will decrease. . The integer arithmetic number IOP is I
OP = (Number of integer operations + (Number of integer registers) * 2 + (
Number of floating point registers) *3) * (Number of loop unrolling stages) +
(address, number of branch operations). Here, (number of integer registers) *2 is an integer load instruction that is executed with a one-cycle delay, and (number of floating-point registers) *3 is an integer unit that is used for a floating-point load instruction that is executed with a two-cycle delay. It approximates a cycle. Similarly, floating point arithmetic number FO
P is determined by FOP = number of floating point operations * 3 * (number of stages of loop unrolling). *3 indicates that the floating point arithmetic instruction is executed with a two-cycle delay.

【００４２】これらの値から、ＩＦｒａｔｅ＝ＩＯＰ＊２／ＦＯＰを各ループ展開段数について求めていく。ハードウェア
は２つの整数ユニットと１つの浮動小数点ユニットをも
っているので、ＩＦｒａｔｅが１に近いほど各種別の命
令分布がハードウェアに適合することになる。演算数の
評価と同時に、レジスタ数も評価する。この場合、ルー
プ解析部１４で求めたレジスタノードをさらに解析して
、・独立なスカラー変数のノード数・独立な配列変数のノード数を計算する。ここで独立な配列変数というのは、例えば
図３（ａ）　のループに対するアセンブラプログラム図
３（ｂ）　を見ると、配列ｙ［ｉ］には予めループ外で
レジスタが確保され（＄ｆ２）、ループ内でｙ［ｉ＋１
］に対するレジスタ（＄ｆ０）のみが使用される。次段
に移る際にレジスタ間移動命令（ｍｏｖ）で値をコピー
するようになっている。このようにループ制御変数ｉと
一定値オフセットでアクセスされる同一配列は１つのレ
ジスタで割付が可能になるので１レジスタに対応として
数える。配列ｘはこれとは別にレジスタが使用されるの
で独立な配列変数である。図９に示すプログラムでは独
立な配列ノードはＸ、Ｙ、Ｚ、Ｕの４つになる。このよ
うな方式で使用レジスタ数を算出すると、整数レジスタ
数ＩＲＥＧはＩＲＥＧ＝（独立整数配列ノード数）＊ループ展開段数
＋（総整数ノード数ー独立整数配列ノード数）From these values, IFrate=IOP*2/FOP is calculated for each loop unrolling stage number. Since the hardware has two integer units and one floating point unit, the closer IFrate is to 1, the more each type of instruction distribution is adapted to the hardware. At the same time as evaluating the number of operations, the number of registers is also evaluated. In this case, the loop analysis unit 14 further analyzes the obtained register nodes to calculate the number of independent scalar variable nodes and independent array variable nodes. Here, the independent array variable means, for example, if we look at the assembler program for the loop in Figure 3(a) in Figure 3(b), the array y[i] has a register reserved in advance outside the loop ($f2), In the loop y[i+1
] only the register ($f0) is used. When moving to the next stage, values are copied using an inter-register move instruction (mov). In this way, the same array that is accessed with a constant value offset from the loop control variable i can be allocated in one register, so it is counted as corresponding to one register. Array x is an independent array variable because a register is used separately from this. In the program shown in FIG. 9, there are four independent array nodes: X, Y, Z, and U. When calculating the number of registers used in this way, the number of integer registers IREG is IREG = (number of independent integer array nodes) * number of loop unrolling stages + (total number of integer nodes - number of independent integer array nodes)

【００４
３】独立整数配列ノードはループ展開されると別のレジ
スタを必要とするのでループ展開段数倍されるが、それ
以外のスカラー変数はループ内で変化しないので１個の
レジスタで充分であることを表している。同様に浮動小
数点レジスタ数ＦＲＥＧは、ＦＲＥＧ＝（独立浮動小数点配列ノード数）＊ループ展
開段数＋（総浮動小数点ノード数ー独立浮動小数点配列
ノード数）となる。これらのレジスタ数が計算機の持つレジスタ数
を越えない範囲でループ展開段数を決定する。以上示し
た、演算数、レジスタ数評価を併せた最適ループ展開段
数決定部分のフローチャートを図１２に示す。004
3) An independent integer array node requires another register when the loop is unrolled, so it is multiplied by the number of loop unrolling stages, but other scalar variables do not change within the loop, so one register is sufficient. represents. Similarly, the number of floating point registers FREG is as follows: FREG = (number of independent floating point array nodes) * number of loop unrolling stages + (total number of floating point nodes - number of independent floating point array nodes). The number of loop unrolling stages is determined within the range where the number of these registers does not exceed the number of registers possessed by the computer. FIG. 12 shows a flowchart of the optimal loop expansion stage number determination portion, which combines the number of operations and the number of registers, as described above.

【００４４】なお本実施例では、変数レジスタのみを対
象にして評価を行う方式を述べたが、レジスタ割当方式
が確定しておりまたレジスタの使用形態も確定して一定
のテンポラリレジスタを使用することが決まっている場
合には演算ｔｒｅｅの中間ノードに対してテンポラリレ
ジスタが割り当てられるとして評価に加えてもよいし、
演算ユニットに関してもさらに細分化して、整数ユニッ
ト／浮動小数点加減算ユニット／浮動小数点乗算ユニッ
トなどを固有に評価してもよい。[0044] In this embodiment, a method of evaluating only variable registers has been described, but the register allocation method is fixed, the register usage pattern is also fixed, and certain temporary registers are used. If it is determined, a temporary register may be assigned to the intermediate node of the operation tree, and it may be added to the evaluation.
The arithmetic units may also be further subdivided to uniquely evaluate integer units, floating point addition/subtraction units, floating point multiplication units, etc.

【００４５】ループ展開部１６は、ループ展開段数決定
部１５から通知された最適ループ展開段数でループ展開
処理を行う。通知される展開段数でループ展開を行うと
レジスタあふれを起こすことなく、かつ計算機の有する
ハードウェアリソース数の構成と展開されたループ内の
演算の構成が最も適合した命令列を得ることができ、命
令スケジュール部１７での命令並べ変えにより効率の良
い並列実行を行うことができる。このようにしてループ
最適化された中間テキストはコード生成部１８により機
械語に翻訳されオブジェクトプログラムとして出力され
る。The loop unrolling unit 16 performs loop unrolling processing using the optimum loop unrolling stage number notified from the loop unrolling stage number determination unit 15. When loop unrolling is performed with the notified number of unrolling stages, it is possible to obtain an instruction sequence that best matches the configuration of the number of hardware resources of the computer and the configuration of operations in the unrolled loop without causing register overflow. By rearranging the instructions in the instruction scheduler 17, efficient parallel execution can be performed. The intermediate text loop-optimized in this manner is translated into machine language by the code generator 18 and output as an object program.

【００４６】なお本方式は使用する計算機のハードウェ
ア構成に依らず適用可能である。即ちループ展開段数決
定部で使用するハードウェアパラメータを交換するだけ
で内部の制御を全く変えることなく最適ループ展開段数
を決定することが可能であり、非常に広範囲の計算機に
適用可能である。Note that this method is applicable regardless of the hardware configuration of the computer used. That is, by simply exchanging the hardware parameters used in the loop unrolling stage number determining section, it is possible to determine the optimum number of loop unrolling stages without changing the internal control at all, and it is applicable to a very wide range of computers.

【００４７】また、実施例１と実施例２を組み合わせて
実施することもできる。この場合、例えば、実施例１の
ループ解析部２４により求めた、参照あるいは更新デー
タの相互間でアドレスが重なり合う最小の反復段数の差
Ｍを、実施例２の展開段数決定部１５においてループ展
開段数を決定する際の上限とする。すると、展開段数決
定部１５で決定されるループ展開段数は、参照あるいは
更新データの相互間でアドレスが重なり合うことがなく
、かつ、レジスタあふれを起こすことのない範囲で、計
算機の有するユニットを最大限有効に利用できるものと
なる。[0047] Furthermore, embodiment 1 and embodiment 2 can also be implemented in combination. In this case, for example, the loop unrolling stage number determination unit 15 of the second embodiment calculates the minimum difference M in the number of repetition stages at which addresses overlap between reference or update data, which was determined by the loop analysis unit 24 of the first embodiment, and shall be the upper limit when determining. Then, the number of loop unrolling stages determined by the unrolling stage number determination unit 15 is determined to maximize the number of units in the computer without overlapping addresses between reference or update data and without register overflow. It can be used effectively.

【００４８】[0048]

【発明の効果】以上説明したように本発明によれば、予
めプログラム内の各ループ部分におけるデータ参照関係
、配列アクセス形式等から異なるループ反復間のデータ
アドレス重なりを求め、これを元に最適なループ展開段
数を決定した上で、ループの展開、及び命令の並べ替え
を行うので、従来方式で高い並列化効果を得られるルー
プの並列化性能は保持した上に、従来方式でさほど並列
化効果が上がらないループでは、ループ展開によるオブ
ジェクトコードサイズの増大を最小限にとどめてループ
最適化方式が可能となる。さらに、予めプログラム内の
各ループの整数／浮動小数点演算数、ループ内で使用す
る整数／浮動小数点レジスタ数を求め、これを元に計算
機の有する演算ユニット数やレジスタ数を考慮して最適
なループ展開段数を決定してループの展開、及び命令の
並べ替えを行うので、従来方式で充分な並列化効果を得
られないループであっても性能を向上させ、かつレジス
タあふれによる性能の低下を未然に防止することが可能
なループ最適化方式を提供できる。As explained above, according to the present invention, the data address overlap between different loop iterations is determined in advance from the data reference relationships, array access formats, etc. in each loop part in the program, and based on this, the optimal Since the loop is unrolled and the instructions are rearranged after determining the number of loop unrolling stages, the loop parallelization performance that can be achieved with the conventional method is maintained, while the parallelization efficiency is not so great with the conventional method. For loops that do not increase, a loop optimization method can be used to minimize the increase in object code size due to loop unrolling. Furthermore, the number of integer/floating point operations for each loop in the program and the number of integer/floating point registers used within the loop are determined in advance, and based on this, the optimal loop is determined by considering the number of arithmetic units and registers of the computer. Since the number of unrolling stages is determined and the loop is unrolled and instructions are rearranged, the performance can be improved even for loops for which sufficient parallelization effects cannot be obtained using conventional methods, and performance degradation due to register overflow can be prevented. It is possible to provide a loop optimization method that can prevent this.

[Brief explanation of the drawing]

【図１】　　本発明の第１の実施例に係るループ最適化
方式を適用したコンパイラの構成例を示す図。FIG. 1 is a diagram showing an example of the configuration of a compiler to which a loop optimization method according to a first embodiment of the present invention is applied.

【図２】　　本発明の第１の実施例に係るループ最適化
方式を適用したコンパイラの別の構成例を示す図。FIG. 2 is a diagram showing another configuration example of a compiler to which the loop optimization method according to the first embodiment of the present invention is applied.

【図３】　　（ａ）　は本発明に係るループ最適化方式
の処理対象となるループを示すソースプログラム、（ｂ
）　は（ａ）　のプログラムをコンパイルした結果を示
すアセンブラプログラムを表す図。[Fig. 3] (a) is a source program showing a loop to be processed by the loop optimization method according to the present invention, (b)
) is a diagram representing an assembler program showing the result of compiling the program in (a).

【図４】　　（ａ）　は図３（ａ）　に４段のループ展
開を施したループを示すソースプログラム、（ｂ）　は
（ａ）　のプログラムをコンパイルした結果を示すアセ
ンブラプログラムを表す図。4(a) is a source program showing a loop obtained by performing four stages of loop expansion in FIG. 3(a), and FIG. 4(b) is a diagram showing an assembler program showing the result of compiling the program in FIG. 3(a).

【図５】　　（ａ）　は内部にデータリカージョンを含
むループを示すソースプログラム、（ｂ）　は（ａ）　
のプログラムを４段のループ展開でコンパイルした結果
を示すアセンブラプログラムを表す図。[Figure 5] (a) is a source program showing a loop that includes data recursion inside, (b) is (a)
The figure which shows the assembler program which shows the result of compiling the program by four stage loop expansion.

【図６】　　ループ内に帰納変数アクセスを含むループ
を示すソースプログラムを表す図。FIG. 6 is a diagram illustrating a source program showing a loop that includes induction variable access within the loop.

【図７】　　（ａ）　は最小アドレス重なり反復差が２
となるループを示すソースプログラム、（ｂ）　は（ａ
）　のプログラムをループ解析部の判定結果に従って２
段ループ展開してコンパイルした結果を示すアセンブラ
プログラムを表す図。[Figure 7] (a) shows that the minimum address overlap repetition difference is 2.
A source program showing a loop with (b) is (a
) program according to the judgment results of the loop analysis section.
The figure which shows the assembler program which shows the result of stage loop expansion and compilation.

【図８】　　最小アドレス重なり反復差が１となるルー
プを示すソースプログラムを表す図。FIG. 8 is a diagram representing a source program showing a loop in which the minimum address overlap iteration difference is 1.

【図９】　　（ａ）　は本発明に係るループ最適化方式
の処理対象となるループを示すソースプログラム、（ｂ
）　は（ａ）　のプログラムをループ展開段数を変えて
コンパイルした場合の整数演算数、浮動小数点演算数、
命令スケジューリングした場合のループ内の１サイクル
あたり命令数を示す図。[Fig. 9] (a) is a source program showing a loop to be processed by the loop optimization method according to the present invention, (b)
) is the number of integer operations, the number of floating point operations, when the program in (a) is compiled with different numbers of loop unrolling stages,
FIG. 7 is a diagram showing the number of instructions per cycle in a loop when instruction scheduling is performed.

【図１０】　　図９（ａ）　のプログラムのループ部分
のｔｒｅｅ形式中間テキストを示す図。FIG. 10 is a diagram showing tree format intermediate text of the loop portion of the program in FIG. 9(a).

【図１１】　　本発明の第２の実施例に係るループ最適
化方式を適用したコンパイラの構成例を示す図。FIG. 11 is a diagram showing a configuration example of a compiler to which a loop optimization method according to a second embodiment of the present invention is applied.

【図１２】　　本発明の第２の実施例に係るループ最適
化方式における最適ループ展開段数を求める処理の流れ
図。FIG. 12 is a flowchart of a process for determining the optimal number of loop expansion stages in a loop optimization method according to a second embodiment of the present invention.

[Explanation of symbols]

２１　　プログラム入力部　　　　　　２２　　構文解
析部２３　　ループ抽出部　　　　　　２４　　ループ
解析部２５　　ループ展開部　　　　　　２６　　命令
スケジュール部２７　　コード生成部　　　　　　３１
　　プログラム入力部３２　　構文解析部　　　　　　
３３　　ループ抽出部３４　　ループ解析部　　　　　
　３５　　ループ展開部３６　　命令スケジュール部　
　　　　　３７　　コード生成部３８　　帰納変数解析
部21 Program input section 22 Syntax analysis section 23 Loop extraction section 24 Loop analysis section 25 Loop expansion section 26 Instruction schedule section 27 Code generation section 31
Program input section 32 Syntax analysis section
33 Loop extraction section 34 Loop analysis section
35 Loop unrolling section 36 Instruction schedule section
37 Code generation section 38 Inductive variable analysis section

Claims

[Claims]

1. A loop optimization method used in an electronic computer capable of executing in parallel a plurality of instruction sequences constituting a program including loop processing in which a sequence of instructions is executed by repeating a plurality of times. The first iterative stage and the second iterative stage have an overlap between the address of the data to be referenced or updated in the iterative stage and the address of the data to be referenced or updated in the second iterative stage in the loop processing.
a program analysis means for determining the minimum step difference between the steps of the repetition step; , a program conversion means for converting the instruction sequence into an instruction sequence equivalent to the instruction sequence in the loop, and a program execution means for executing in parallel the instruction sequence for the number of repetitions expanded in one loop by the program conversion means. A loop optimization method characterized by:

2. The loop optimization method according to claim 1, wherein the program analysis means is provided with an instruction sequence that refers to or updates data using an induction variable that is newly calculated at each iteration stage in the loop processing. A loop optimization method characterized in that a control means is added to prevent the included loop from being expanded by the program conversion means.

3. In a loop optimization method used in an electronic computer capable of executing in parallel a plurality of instruction sequences constituting a program including a loop process in which a sequence of instructions is executed by repeating a plurality of times, the loop processing is performed by loop unrolling. A first step for detecting the number of loop unrolling stages when unrolling within the same loop by the number of stages, and the number of integer operations and the number of floating point operations included in the instruction sequence within the same loop when unrolling is performed with this number of loop unrolling stages. detection means, the number of loop unrolling stages when the loop processing is unrolled within the same loop by the number of loop unrolling stages, and the number of integer registers used within the same loop when unrolling is performed with this number of loop unrolling stages, and a floating number. a second detecting means for detecting the number of decimal point registers, and a ratio obtained from the integer arithmetic number and the floating point arithmetic number detected by the first detecting means, the number of integer arithmetic units possessed by the computer and the floating point arithmetic number; closest to the ratio of the number of units and the second
determining means for determining the number of stages of loop expansion within a range in which the number of integer registers and the number of floating point registers detected by the detecting means does not exceed the number of integer registers and the number of floating point registers possessed by the electronic computer; , a program converting means for expanding into one loop the number of loop unrolling stages determined by the determining means and converting it into an instruction sequence equivalent to an instruction sequence in the loop; and a program converting means for expanding into one loop by the program converting means. and program execution means for executing in parallel instruction sequences for the number of loop unrolling stages.