JP3953697B2

JP3953697B2 - Compiler and recording medium

Info

Publication number: JP3953697B2
Application number: JP37183499A
Authority: JP
Inventors: 延佳山地; 倫康野尻
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-12-27
Filing date: 1999-12-27
Publication date: 2007-08-08
Anticipated expiration: 2019-12-27
Also published as: JP2001184341A

Description

【０００１】
【発明の属する技術分野】
本発明は、ソースプログラムのベクトル化時に展開数の最適化を行うコンパイラおよび記録媒体に関するものである。
【０００２】
【従来の技術】
従来、ソースプログラムをベクトル化する場合に、ループの繰り返し回数を減らして高速化を図るアンローリング処理が行われている。この従来のアンローリング処理における展開数は下記のようにして決めていた。
【０００３】
（１）コンパイラディレクティブに指定された展開数とする。
（２）外側ループの繰り返し回数が陽に判る場合には、その値を展開数とする。
【０００４】
（３）上記以外で、展開後の命令数が許容範囲（例えばレジスタ不足にならない範囲）を越えなければｎ重展開とする。
【０００５】
【発明が解決しようとする課題】
上記従来の展開数では、演算数が多い場合は良いが、演算数が少ないループではレジスタを有効に使わないオブジェクトが生成されてしまうという問題があった。
【０００６】
また、アンローリング数の増加によりオブジェクト量が多量になってしまう危険性を持つという問題もあった。
本発明は、これらの問題を解決するため、ソースプログラムをベクトル化するときにループの繰り返し回数などをもとに最適化したアンローリング処理を行うと共に重複するメモリアクセスを削除し、オブジェクトプログラムの実行性能の向上を図ることを目的としている。
【０００７】
【課題を解決するための手段】
図１を参照して課題を解決するための手段を説明する。
図１において、ソースプログラム１は、ベクトル化する際に展開数の最適化を行う対象のプログラムである。
【０００８】
コンパイラ２は、ソースプログラム１を入力とし、実行可能形式のオブジェクトプログラム１０を生成するものであって、ここでは、最適化手段４などから構成されるものである。
【０００９】
最適化手段４は、ソースプログラム１をベクトル化して最適化を行うものであって、ここでは、ベクトル化手段５、およびベクトル最適化手段６などから構成されるものである。
【００１０】
ベクトル最適化手段６は、ベクトル化する際に展開数の最適化を行うものであって、ここでは、アンローリング手段７およびメモリアクセス削除手段８などから構成されるものである。
【００１１】
アンローリング手段７は、ベクトル化する際に最適な展開数で展開したりなどするものである。
メモリアクセス削除手段８は、最適な展開数で展開した後の重複するメモリアクセスをレジスタ間転送命令に変更し、削除するものである。
【００１２】
次に、動作を説明する。
ベクトル化手段５がソースプログラム１をベクトル化し、アンローリング手段７がソースプログラム１を解析して検出されたループの繰り返し回数から求めたレジスタ個数および演算に必要な最大レジスタ数をもとに仮想展開数を算出し、算出した仮想展開数について、ベクトル化されたプログラムのデータ依存関係をもとに展開数を算出して展開し、メモリアクセス削除手段８が展開後のベクトル命令列から重複するメモリアクセス命令がある場合にレジスタ間移動命令に変更するようにしている。
【００１３】
また、ベクトル化手段５がソースプログラム１をベクトル化し、アンローリング手段７がソースプログラムを解析して検出されたループの繰り返し回数が不明の場合に、ループで使用している配列のベクトル次元から求めたレジスタ個数および演算に必要な最大レジスタ数をもとに仮想展開数を算出し、算出した仮想展開数についてベクトル化されたプログラムのデータ依存関係をもとに展開数を算出し、算出した展開数をもとにベクトル化されたプログラムを展開し、メモリアクセス削除手段８が展開後のベクトル命令列から重複するメモリアクセス命令がある場合にレジスタ間移動命令に変更するようにしている。
【００１４】
これらの際に、デフォルトで展開数の最大値を設定しておき、最大値を越えない範囲内で展開数を算出するようにしている。
従って、ソースプログラム１をベクトル化するときにループの繰り返し回数などをもとに最適化したアンローリング処理を行うと共に重複するメモリアクセスを削除することにより、オブジェクトプログラム１０の実行性能の向上を図ることが可能となる。
【００１５】
【発明の実施の形態】
次に、図１から図４を用いて本発明の実施の形態および動作を順次詳細に説明する。
【００１６】
図１は、本発明のシステム構成図を示す。
図１において、ソースプログラム１は、ベクトル化する際に展開数の最適化を行う対象のプログラムであって、例えば後述する図４の（ａ）に示すような、ループの存在するプログラムである。
【００１７】
コンパイラ２は、ソースプログラム１を入力とし、実行可能形式のオブジェクトプログラム１０を生成するものであって、ここでは、ソースプログラム解析手段３、最適化手段４、およびコード生成手段９などから構成されるものである。
【００１８】
ソースプログラム解析手段３は、ソースプログラム１を形態素解析、構文解析などを行い、中間言を生成するものである。ここで、実際は、形態素解析および構文解析などを行った情報を付加した中間言をもとに以降説明するベクトル化、アンローリング処理などを行うが、説明を分かり易くするために、ソースプログラム１をベクトル化、アンローリング処理するなどとして説明を行う。
【００１９】
最適化手段４は、ソースプログラム１の最適化を行うものであって、ここでは、ベクトル化手段５、およびベクトル最適化手段６などから構成されるものである。
【００２０】
ベクトル化手段５は、ソースプログラム１をベクトル化するものであって、例えば後述する図３の（ａ）のソースプログラム１を（ｂ）に示すプログラム（ベクトル計算機で動作するベクトル命令を使ったプログラム）にするものである。
【００２１】
ベクトル最適化手段６は、ベクトル化されたプログラムを最適化するものであって、ここでは、アンローリング手段７およびメモリアクセス削除手段８などから構成されるものである。
【００２２】
アンローリング手段７は、ベクトル化されたプログラムについて、最適な展開数で展開してループの繰り返し回数を削減したりなどするものである。
メモリアクセス削除手段８は、最適な展開数で展開した後のプログラム中で重複するメモリアクセスをレジスタ間転送命令に変更し、削除したりするものである。
【００２３】
コード生成手段９は、最適化後のプログラムを、実行可能形式のオブジェクトプログラム１０にするものである。
次に、図２のフローチャートの順番に従い、図１の構成の動作を詳細に説明する。
【００２４】
図２は、本発明の動作説明フローチャートを示す。
図２において、Ｓ１は、コンパイラディレクティブ指定があるか判別する。ＹＥＳの場合には、その指定に従いＳ６以降の処理に進む。一方、ＮＯの場合には、Ｓ２に進む。
【００２５】
Ｓ２は、ループの繰り返し回数が不明か判別する。これは、ソースプログラム１のループの回数が不明、例えば後述する図３の（ａ）のソースプログラム１ではループの回数が判明と判別し、図４の（ａ）のソースプログラム１ではループの回数が不明と判別する。ＹＥＳの場合には、Ｓ３に進む。ＮＯの場合には、Ｓ９に進む。
【００２６】
Ｓ９は、Ｓ２のＮＯでループの繰り返し回数が不明でないと判別されたので、
仮想展開数＝（ループ内の繰り返し回数から見積もったレジスタ個数）／（必要なレジスタ数の最大値）
を求める。例えば後述する図３の（ｂ）のベクトル化した後のプログラムにおいて、当該プログラムの場合には（ループ内の繰り返し回数から見積もったレジスタ個数）＝２５６、必要なレジスタ数の最大値＝１であるので、仮想展開数＝２５６となるが、実行させるベクトル計算機の各種性能を考慮してデフォルトで最大が例えば４と設定されているので、仮想展開数を４と決定する。
【００２７】
Ｓ１０は、メモリアクセスの削除が可能か判別する。これは、例えば後述する図３の（ｃ−１）に示すように、展開（この場合には４重展開）した場合に重複するメモリアクセスがあって当該メモリアクセスを削除して変りにレジスタ間移動命令で置き換えることが可能か判別する。ＹＥＳの場合には、Ｓ１１に進む。一方，ＮＯの場合には、Ｓ６に進む。
【００２８】
Ｓ１１は、Ｓ１０のＹＥＳでメモリアクセス削除可能と判明したので、正式展開数を、データ依存関係を見て補正する。例えば後述する図３の（ｂ）のベクトル化したプログラムの場合には、２重展開からメモリアクセス命令の削除が可能であるので、Ｓ９で決めた仮想展開数４をそのまま正式展開数として決定する。そして、Ｓ６に進む。
【００２９】
Ｓ６は、ループアンローリングの展開処理を行う。これは、Ｓ９で決定した正式展開数、ここでは４をもとに、例えば図３の（ｂ）を図３の（ｃ−１）に示すように４重展開する。
【００３０】
Ｓ７は、メモリアクセスの削除が可能か判別する。ＹＥＳの場合には、Ｓ８に進む。ＮＯの場合には、終了する。
Ｓ８は、Ｓ７のＹＥＳでメモリアクセスの削除が可能と判明したので、メモリアクセスの削除処理を行う。例えば図３の（ｃ−１）の４重展開した後のプログラム中の重複するメモリアクセスをレジスタ間移動命令（例えばＶＭＯＶＥ命令）に置き換え、重複するメモリアクセスを削除し、図３の（ｃ−２）に示すプログラムに修正する。
【００３１】
以上のＳ１のＮＯ，Ｓ２のＮＯ，Ｓ９からＳ１１、Ｓ６からＳ８の手順によって、プログラムのループの繰り返し回数が判明する場合に、内部のループについて多重展開を行ってループの繰り返し回数を削減および重複したメモリアクセスをＶＭＯＶＥ命令に置き換えてメモリアクセスを必要最小限とし、オブジェクトプログラム１０の実行性能を向上させることが可能となる。
【００３２】
次に、ループの繰り返し回数が不明の場合について以下説明する。
図２において、Ｓ３は、Ｓ２のＹＥＳでループの繰り返し回数が不明と判明したので、
仮想展開数＝（ループ内で使用している配列のベクトル次元の添字に現れた要素数から見積もったレジスタ個数）／（必要なレジスタ数の最大値）
を求める。例えば後述する図４の（ｂ）のベクトル化した後のプログラムにおいて、当該プログラムの場合には（ループ内の繰り返し回数から見積もったレジスタ個数）＝２５６、（必要なレジスタ数の最大値）＝１．３であるので、仮想展開数＝２５６／１．３＝１９６と決定する。
【００３３】
Ｓ４は、データ依存関係から最大展開数を見積もる。例えば図４の（ｂ）のプログラムの場合には、２重展開以上してもメモリアクセスの削除は見込めないので、最大展開数＝２と決定する。
【００３４】
Ｓ５は、正式展開数が最大展開数を越えない範囲で決定する。ここでは、例えば図４の（ｂ）の場合には、上記したように、最大展開数＝２であるので、これを越えない正式展開数を２と決定する。
【００３５】
Ｓ６は、ループアンローリングの展開処理を行う。これは、Ｓ５で決定した正式展開数、ここでは２をもとに、例えば図４の（ｂ）を図４の（ｃ）に示すように２重展開する。
【００３６】
Ｓ７は、メモリアクセスの削除が可能か判別する。ＹＥＳの場合には、Ｓ８に進む。ＮＯの場合には、終了する。
Ｓ８は、Ｓ７のＹＥＳでメモリアクセスの削除が可能と判明したので、メモリアクセスの削除処理を行う。
【００３７】
以上のＳ１のＮＯ，Ｓ２のＹＥＳ，Ｓ３からＳ８の手順によって、プログラムのループの繰り返し回数が不明の場合に、内部のループで使用している配列のベクトル次元の添字に現れる要素数、必要なレジスタ数の最大値などをもとに展開数を決定して多重展開を行い、ループの繰り返し回数を削減および重複したメモリアクセスをＶＭＯＶＥ命令に置き換えてメモリアクセスを必要最小限とし、オブジェクトプログラム１０の実行性能を向上させることが可能となる。
【００３８】
図３は、本発明の説明図（その１、ループの繰り返し回数判明）を示す。ここでは、ループの繰り返し回数が判明するプログラムについて以下説明する。
図３の（ａ）は、ソースプログラム１の例を示す。このソースプログラム１は、図示のように、２重ループからなるプログラムである。
【００３９】
図３の（ｂ）は、アンローリング展開前のプログラムの例を示す。これは、図３の（ａ）のソースプログラム１をベクトル化した後のプログラムであって、図示のように、アンローリング展開前の１重展開したものである。
【００４０】
図３の（ｃ）は、アンローリング展開後のプログラムを示す。ここでは、図３の（ｂ）のアンローリング展開前のプログラムを、４重展開して（ｃ−１）のプログラムを作成し、次に、（ｃ−１）のプログラム中の重複したメモリアクセス命令を削除してレジスタ間移動命令（ＶＭＯＶＥ命令）に置き換えてメモリセスを必要最小限に削減して（ｃ−２）のプログラムを作成する。
【００４１】
図３の（ｃ−１）は、図３の（ｂ）のプログラムを４重展開（既述した図２のＳ３からＳ５で決定した展開数４）したプログラムの例を示す。尚、４重展開の場合には、ループの繰り返しに対して４で割った余りの処理に関しては、別のループを生成している（図３の（ｃ−１）の末尾のＤＯループ参照）。ここでは、図示したように、３つの重複したメモリアクセスが存在するので、これら重複するメモリアクセスを削除し、変わりにＶＭＯＶＥ命令に置き換えて図３の（ｃ−２）に示すプログラムとし、実行性能を向上させている。
【００４２】
図３の（ｃ−２）は、メモリアクセスを削除した場合のプログラムの例を示す。これは、図３の（ｃ−１）のプログラム中の重複するメモリアクセスの一方を削除し他方はＶＭＯＶＥ命令でレジスタ間転送して処理を行うように修正したものである。
【００４３】
尚、図３のプログラムの場合の展開数の決定について説明する。
図３の（ａ）のプログラムはループの繰り返し回数が判明し、ループの繰り返し回数から見積もったレジスタ個数がここでは２５６となり、必要なレジスタ数の最大値が１となるので、仮想展開数＝２５６／１＝２５６となるが、デフォルトで通常最大仮想展開数＝４と設定しているので、仮想展開数＝４と決定する。次に、メモリアクセスの削除が可能かの観点より、データ依存関係を見ると、２重展開からメモリアクセスの削除が可能と判明するので、正式展開数をそのまま４と決定する。決定された４重展開する場合には、４で割り切れる部分を図３の（ｃ−１）の上段のループ（４重展開のループ）とし、余りの部分を図３の（ｃ−１）の下段のループ（１重展開のままのループ）とする。更に、重複するメモリアクセスをレジスタ間移動命令（ＶＭＯＶＥ命令）に置き換えると図３の（ｃ−２）のプログラムを生成することが可能となる。
【００４４】
図４は、本発明の説明図（その２、ループの繰り返し回数不明）を示す。ここでは、ループの繰り返し回数が不明なプログラムについて以下説明する。
図４の（ａ）は、ソースプログラム１の例を示す。このソースプログラム１は、図示のように、２重ループからなるプログラムである。
【００４５】
図４の（ｂ）は、アンローリング展開前のプログラムの例を示す。これは、図４の（ａ）のソースプログラム１をベクトル化した後のプログラムであって、図示のように、アンローリング展開前の１重展開したものである。
【００４６】
図４の（ｃ）は、アンローリング展開後のプログラムを示す。ここでは、図４の（ｂ）のアンローリング展開前のプログラムを２重展開して作成する。
尚、図４のプログラムの場合の展開数の決定について説明する。
【００４７】
図４の（ａ）のプログラムはループの繰り返し回数が不明であるので、ループ内で使用している配列のベクトル次元の添字に現れた要素数から見積もったレジスタ個数がここでは２５６となり、必要なレジスタ数の最大値が１．３となるので、仮想展開数＝２５６／１．３＝１９６となる。データの依存関係から最大展開数を見積もると、２展開以上してもメモリアクセスの削除が見込めないので最大展開数を２と決定する。正式展開数は最大展開数を越えない範囲で決まるので、この場合には正式展開数が２と決定され、図４の（ｃ）のプログラムに示すように、２重展開する。
【００４８】
【発明の効果】
以上説明したように、本発明によれば、ソースプログラム１をベクトル化するときにループの繰り返し回数などをもとに最適化したアンローリング処理を行うと共に重複するメモリアクセスを削除する構成を採用しているため、オブジェクトプログラム１０の実行性能の向上を図ることができる。
【図面の簡単な説明】
【図１】本発明のシステム構成図である。
【図２】本発明の動作説明フローチャートである。
【図３】本発明の説明図（その１）である。
【図４】本発明の説明図（その２）である。
【符号の説明】
１：ソースプログラム
２：コンパイラ
３：ソースプログラム解析手段
４：最適化手段
５：ベクトル化手段
６：ベクトル最適化手段
７：アンローリング手段
８：メモリアクセス削除手段
９：コード生成手段
１０：オブジェクトプログラム[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a compiler and a recording medium for optimizing the number of expansions when a source program is vectorized.
[0002]
[Prior art]
Conventionally, when a source program is vectorized, an unrolling process has been performed to increase the speed by reducing the number of loop iterations. The number of developments in this conventional unrolling process was determined as follows.
[0003]
(1) The number of expansions specified in the compiler directive.
(2) If the number of iterations of the outer loop is clearly known, that value is taken as the number of expansions.
[0004]
(3) In addition to the above, if the number of instructions after expansion does not exceed an allowable range (for example, a range where there is no register shortage), n-fold expansion is performed.
[0005]
[Problems to be solved by the invention]
With the above conventional number of expansions, it is good if the number of operations is large, but there is a problem that an object that does not use registers effectively is generated in a loop with a small number of operations.
[0006]
There is also a problem that there is a risk that the amount of objects becomes large due to an increase in the number of unrolling.
In order to solve these problems, the present invention performs an unrolling process optimized based on the number of loop iterations when vectorizing a source program, deletes duplicate memory accesses, and executes an object program. The purpose is to improve performance.
[0007]
[Means for Solving the Problems]
Means for solving the problem will be described with reference to FIG.
In FIG. 1, a source program 1 is a target program for optimizing the number of expansions when vectorizing.
[0008]
The compiler 2 receives the source program 1 and generates an executable object program 10, and is composed of an optimization unit 4 and the like here.
[0009]
The optimization unit 4 performs optimization by vectorizing the source program 1, and is composed of a vectorization unit 5, a vector optimization unit 6, and the like here.
[0010]
The vector optimizing means 6 optimizes the number of expansions when vectorizing. Here, the vector optimizing means 6 comprises an unrolling means 7 and a memory access deleting means 8.
[0011]
The unrolling means 7 expands with the optimal number of expansions when vectorizing.
The memory access deleting means 8 changes the overlapping memory access after the expansion with the optimum expansion number into an inter-register transfer instruction and deletes it.
[0012]
Next, the operation will be described.
The vectorizing means 5 vectorizes the source program 1, and the unrolling means 7 analyzes the source program 1 and virtually expands it based on the number of registers obtained from the number of loop iterations detected and the maximum number of registers necessary for the operation. The number of virtual expansions is calculated, the number of expansions is calculated based on the data dependency of the vectorized program and expanded, and the memory access deletion means 8 is a memory duplicated from the expanded vector instruction sequence. When there is an access instruction, it is changed to a register-to-register move instruction.
[0013]
Further, when the vectorization means 5 vectorizes the source program 1 and the unrolling means 7 analyzes the source program and the number of loop iterations detected is unknown, it is obtained from the vector dimension of the array used in the loop. The number of virtual expansions is calculated based on the number of registered registers and the maximum number of registers required for the operation, and the number of expansions is calculated based on the data dependency of the vectorized program for the calculated number of virtual expansions. A vectorized program is expanded based on the number, and the memory access deleting means 8 changes to a register-to-register move instruction when there is an overlapping memory access instruction from the expanded vector instruction sequence.
[0014]
At these times, the maximum number of expansions is set by default, and the number of expansions is calculated within a range not exceeding the maximum value.
Therefore, the execution performance of the object program 10 is improved by performing an unrolling process optimized based on the number of loop iterations when vectorizing the source program 1 and deleting redundant memory accesses. Is possible.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments and operations of the present invention will be described in detail sequentially with reference to FIGS.
[0016]
FIG. 1 shows a system configuration diagram of the present invention.
In FIG. 1, a source program 1 is a program to be subjected to optimization of the number of expansions when vectorization is performed, and is a program in which a loop exists, for example, as shown in FIG.
[0017]
The compiler 2 inputs the source program 1 and generates an object program 10 in an executable format. Here, the compiler 2 includes a source program analysis means 3, an optimization means 4, a code generation means 9, and the like. Is.
[0018]
The source program analysis means 3 performs morphological analysis, syntax analysis, etc. on the source program 1 to generate an intermediate language. Here, in practice, vectorization and unrolling processing, which will be described later, are performed on the basis of intermediate language to which information obtained through morphological analysis and syntax analysis is added. The description will be made assuming that vectorization and unrolling processing are performed.
[0019]
The optimization unit 4 optimizes the source program 1 and is composed of a vectorization unit 5 and a vector optimization unit 6 in this example.
[0020]
The vectorization means 5 vectorizes the source program 1, for example, a source program 1 shown in FIG. 3A (to be described later) is a program shown in FIG. 3B (a program using a vector instruction that operates on a vector computer). ).
[0021]
The vector optimizing means 6 optimizes the vectorized program, and here is composed of an unrolling means 7 and a memory access deleting means 8.
[0022]
The unrolling means 7 expands the vectorized program with the optimal number of expansions to reduce the number of loop iterations.
The memory access deleting means 8 changes a memory access that is duplicated in a program after being developed with the optimum number of expansions into an inter-register transfer instruction and deletes it.
[0023]
The code generation means 9 converts the optimized program into an executable object program 10.
Next, the operation of the configuration of FIG. 1 will be described in detail according to the order of the flowchart of FIG.
[0024]
FIG. 2 shows a flowchart for explaining the operation of the present invention.
In FIG. 2, S1 determines whether there is a compiler directive designation. In the case of YES, the process proceeds to the processing after S 6 in accordance with the specification. On the other hand, if NO, the process proceeds to S2.
[0025]
In S2, it is determined whether the number of loop repetitions is unknown. This is because the number of loops of the source program 1 is unknown, for example, it is determined that the number of loops is found in the source program 1 in FIG. 3A described later, and the number of loops in the source program 1 in FIG. Is determined to be unknown. In the case of YES, the process proceeds to S 3. In the case of NO, the process proceeds to S 9.
[0026]
S 9, since the number of iterations of the loop is determined not unknown in S2 NO, the
Number of virtual expansions = (number of registers estimated from the number of iterations in the loop) / (maximum number of necessary registers)
Ask for. For example, in the program after vectorization of FIG. 3B described later, in the case of the program, (the number of registers estimated from the number of repetitions in the loop) = 256, and the maximum number of necessary registers = 1. Therefore, although the number of virtual deployments is 256, the maximum number of virtual deployments is determined to be 4 because the maximum is set to 4 by default in consideration of various performances of the vector computer to be executed.
[0027]
S 10, the deletion of the memory access is possible or not. For example, as shown in (c-1) of FIG. 3 to be described later, there is an overlapping memory access in the case of expansion (in this case, quadruple expansion), and the memory access is deleted and changed between registers. Determine if it can be replaced with a move command. If YES, the process proceeds to S 11. On the other hand, in the case of NO, the process proceeds to S 6.
[0028]
S 11, since proved to memory accessible delete YES in S 10, the official number deployment is corrected view data dependencies. For example, when the vectorized program of FIG. 3 to be described later (b) it is determined from the double expansion since it is possible to remove a memory access instruction, as it is formally expand the number of virtual deployment number 4 decided in S 9 To do. Then, the process proceeds to S 6.
[0029]
S 6 performs the expansion process of the loop unrolling. This is formally expanded number determined in S 9, where on the basis of 4, for example, in FIG. 3 (b) deploying quadruple as shown in the FIG. 3 (c-1).
[0030]
S 7, the deletion of the memory access is possible or not. In the case of YES, the process proceeds to S 8. If NO, the process ends.
S 8 since proved possible to remove the memory access by a YES S 7, and delete processing of the memory access. For example, the overlapping memory access in the program after quadruple expansion of (c-1) in FIG. 3 is replaced with an inter-register movement instruction (for example, VMOVE instruction), the overlapping memory access is deleted, and (c− Modify the program shown in 2).
[0031]
By the procedure of S 8 from S 11, S 6 from NO, S 9 of the above S1 of NO, S2, if the number of iterations of the loop of the program is found, the number of iterations of the loop performing multiple developed for inner loop It is possible to improve the execution performance of the object program 10 by reducing the memory access and replacing the duplicated memory access with the VMOVE instruction to minimize the memory access.
[0032]
Next, a case where the number of loop repetitions is unknown will be described below.
In FIG. 2, S 3 is determined to be unclear how many times the loop is repeated when S 2 is YES.
Number of virtual expansions = (number of registers estimated from the number of elements appearing in the vector dimension subscript of the array used in the loop) / (maximum number of necessary registers)
Ask for. For example, in the program after vectorization of FIG. 4B described later, in the case of the program, (the number of registers estimated from the number of repetitions in the loop) = 256, (the maximum value of the necessary number of registers) = 1. .3, it is determined that the number of virtual deployments = 256 / 1.3 = 196.
[0033]
S 4 estimates the maximum number of expansion from the data dependencies. For example, in the case of the program shown in FIG. 4 (b), deletion of memory access cannot be expected even after double expansion or more, so the maximum expansion number = 2 is determined.
[0034]
S 5 is officially number expansion is determined in a range not exceeding the maximum number of deployment. Here, for example, in the case of FIG. 4B, as described above, since the maximum number of expansions = 2, the number of formal expansions not exceeding this is determined to be two.
[0035]
S 6 performs the expansion process of the loop unrolling. This, S 5 formal development number determined in, here on the basis of 2, for example, to deploy double as shown in shown in FIG. 4 (c) the (b) of FIG.
[0036]
S 7, the deletion of the memory access is possible or not. In the case of YES, the process proceeds to S 8. If NO, the process ends.
S 8 since proved possible to remove the memory access by a YES S 7, and delete processing of the memory access.
[0037]
More S1 of NO, S2 YES of, by the procedure of S 8 from S 3, if the number of iterations of the program loop is unknown, the number of elements appearing in the index of the vector dimension of the array that are used in the interior of the loop, An object program that determines the number of expansions based on the maximum number of necessary registers, performs multiple expansion, reduces the number of loop iterations, replaces duplicate memory access with a VMOVE instruction, and minimizes memory access. 10 execution performance can be improved.
[0038]
FIG. 3 is an explanatory diagram of the present invention (part 1, identification of the number of loop iterations). Here, a program for determining the number of loop iterations will be described below.
FIG. 3A shows an example of the source program 1. The source program 1 is a program composed of a double loop as shown in the figure.
[0039]
FIG. 3B shows an example of a program before unrolling development. This is a program after vectorizing the source program 1 in FIG. 3A, and is a single development before unrolling development as shown.
[0040]
FIG. 3C shows a program after unrolling development. Here, the program before unrolling expansion in FIG. 3B is quadruple expanded to create the program of (c-1), and then the duplicate memory access in the program of (c-1) The instruction is deleted and replaced with a register-to-register move instruction (VMOVE instruction) to reduce the memory access to the minimum necessary, and the program (c-2) is created.
[0041]
(C-1) in FIG. 3 shows an example of a program in which the program in FIG. 3 (b) is expanded four times (the number of expansions 4 determined in S3 to S5 in FIG. 2 described above). In the case of quadruple expansion, another loop is generated for the remainder of the division of the loop divided by 4 (see the DO loop at the end of (c-1) in FIG. 3). . Here, as shown in the figure, since there are three overlapping memory accesses, these overlapping memory accesses are deleted and replaced with a VMMOVE instruction, and the program shown in FIG. Has improved.
[0042]
(C-2) in FIG. 3 shows an example of a program when memory access is deleted. This is a modification in which one of the overlapping memory accesses in the program of (c-1) in FIG. 3 is deleted and the other is transferred between registers by the VMOVE instruction for processing.
[0043]
The determination of the number of expansions in the case of the program of FIG. 3 will be described.
In the program of FIG. 3A, the number of loop iterations is known, and the number of registers estimated from the number of loop iterations is 256 here, and the maximum number of necessary registers is 1, so the number of virtual expansions = 256. / 1 = 256, but since the normal maximum virtual deployment number = 4 is set by default, the virtual deployment number = 4 is determined. Next, from the viewpoint of whether or not memory access can be deleted, from the viewpoint of data dependency, it is found that memory access can be deleted from double expansion, so the official expansion number is determined to be 4 as it is. When the determined quadruple expansion is performed, a portion divisible by 4 is set as the upper loop (four-fold expansion loop) of (c-1) in FIG. 3, and the remaining portion of FIG. 3 (c-1). Let it be the lower loop (the loop that has been unfolded once). Further, if the overlapping memory access is replaced with a register-to-register move instruction (VMOVE instruction), the program of (c-2) in FIG. 3 can be generated.
[0044]
FIG. 4 is an explanatory diagram of the present invention (No. 2, loop repetition count unknown). Here, a program whose loop repetition count is unknown will be described below.
FIG. 4A shows an example of the source program 1. The source program 1 is a program composed of a double loop as shown in the figure.
[0045]
FIG. 4B shows an example of a program before unrolling development. This is a program after vectorizing the source program 1 in FIG. 4A, and is a single development before unrolling development as shown.
[0046]
FIG. 4C shows a program after unrolling development. Here, the program before unrolling development of FIG. 4B is created by double development.
The determination of the number of expansions in the case of the program of FIG. 4 will be described.
[0047]
Since the number of loop iterations is unknown in the program of FIG. 4A, the number of registers estimated from the number of elements appearing in the vector dimension subscript of the array used in the loop is 256, which is necessary. Since the maximum value of the number of registers is 1.3, the number of virtual expansions = 256 / 1.3 = 196. If the maximum number of expansions is estimated from the data dependency, the deletion of memory access cannot be expected even if the number of expansions is two or more, so the maximum number of expansions is determined as two. Since the number of formal expansions is determined within a range not exceeding the maximum number of expansions, in this case, the number of formal expansions is determined to be 2 and double expansion is performed as shown in the program of FIG. 4C.
[0048]
【The invention's effect】
As described above, according to the present invention, when the source program 1 is vectorized, an unrolling process optimized based on the number of loop iterations and the like is performed, and a redundant memory access is deleted. Therefore, the execution performance of the object program 10 can be improved.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of the present invention.
FIG. 2 is a flowchart explaining the operation of the present invention.
FIG. 3 is an explanatory diagram (part 1) of the present invention.
FIG. 4 is an explanatory diagram (part 2) of the present invention.
[Explanation of symbols]
1: source program 2: compiler 3: source program analysis means 4: optimization means 5: vectorization means 6: vector optimization means 7: unrolling means 8: memory access deletion means 9: code generation means 10: object program

Claims

In a compiler that optimizes the number of expansions when vectorizing a source program,
Means for vectorizing the source program;
Means for determining the number of iterations of the loop detection, or by detecting the number of repetitions of the loop unknown analyzes the source program,
If the number of loop iterations is detected , calculate the number of virtual expansions based on the number of registers obtained from the detected number of loop iterations and the maximum number of registers required for the operation , while Means for calculating the number of virtual expansions based on the number of registers obtained from the vector dimension of the array used in the loop and the maximum number of registers necessary for the operation when it is determined that the number of times is unknown ;
Means for calculating the number of expansions based on the data dependency of the vectorized program for the calculated number of virtual expansions;
A compiler comprising means for developing the vectorized program based on the calculated number of expansions.

2. The compiler according to claim 1, further comprising means for changing to a register-to-register move instruction when there is a duplicate memory access instruction after the program is expanded .

3. The compiler according to claim 1, wherein a maximum value of the expansion number is set as a default, and the expansion number is calculated within a range not exceeding the maximum value.

On the computer,
Means for vectorizing the input source program;
Means for determining the number of iterations of the loop detection, or by detecting the number of repetitions of the loop unknown analyzes the source program,
If the number of loop iterations is detected , calculate the number of virtual expansions based on the number of registers obtained from the detected number of loop iterations and the maximum number of registers required for the operation , while Means for calculating the number of virtual expansions based on the number of registers obtained from the vector dimension of the array used in the loop and the maximum number of registers necessary for the operation when it is determined that the number of times is unknown ;
Means for calculating the number of expansions based on the data dependency of the vectorized program for the calculated number of virtual expansions;
A computer-readable recording medium recording a program that functions as means for developing the vectorized program based on the calculated number of expansions.