JP3627905B2

JP3627905B2 - Program compilation method and recording medium recording the method

Info

Publication number: JP3627905B2
Application number: JP05997999A
Authority: JP
Inventors: 静香小山
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-03-08
Filing date: 1999-03-08
Publication date: 2005-03-09
Anticipated expiration: 2019-03-08
Also published as: JP2000259423A

Description

【０００１】
【発明の属する技術分野】
本発明は、スーパースケイラまたはＶＬＩＷ等の、複数の機能ユニットを有するとともにそれぞれの機能ユニットが同時並列的に命令を実行できる計算機において、プログラムの実行を高速化するための技術に関し、特に、ハードウェアに冗長性をもたせることなく、各機能ユニットの作業負荷を同程度にするプログラムコンパイル方法およびそのプログラムコンパイル方法を記録した記録媒体に関する。
【０００２】
【従来の技術】
プログラムの実行を高速化するための手段として、プロセッサの処理能力の向上が挙げられる。プロセッサの処理能力を向上させる方法としては、プロセッサを構成するハードウェア個々の性能（処理速度、処理量）の向上を図る方法と、ハードウェア自体の性能向上ではなく、複数の命令を同時に実行させることによって性能の向上を図る方法とがある。
【０００３】
後者の方法は、内部に複数の機能ユニットをもち、各機能ユニットが同時に命令を実行できるプロセッサを用いる（文献「Ｈｅｎｎｅｓｓｙ，Ｄ．Ａ．ａｎｄＰａｔｔｅｒｓｏｎ，Ｊ．Ｌ．，”ＣｏｍｐｕｔｅｒＡｒｃｈｉｔｅｃｔｕｒｅＡＱｕａｎｔｉｔａｔｉｖｅＡｐｐｒｏａｃｈｓｅｃｏｎｄｅｄｉｔｉｏｎ， ” ｐｐ．２７８−２８９，ＭｏｒｇａｎＫａｕｆｍａｎｎｐｕｂｌｉｓｈｅｒｓ，ＳａｎＦｒａｎｃｉｓｃｏ，Ｃａｌｉｆｏｒｎｉａ，１９９６」参照）。この場合に用いられるプロセッサは、各機能ユニットごとに、実行できる命令が決まっている場合が多い。
【０００４】
例えば、プロセッサが内部に２つの機能ユニットを持ち、その一方（整数演算ユニット）が整数演算を実行する機能を有し、もう一方（浮動小数点演算ユニット）が浮動小数点演算を実行する機能を有するものとすると、プロセッサは整数演算ユニットを用いて整数命令を実行し、浮動小数点演算ユニットを用いて浮動小数点命令を実行する。このとき、プロセッサは、整数命令１命令と浮動小数点命令１命令を、それぞれの機能ユニットを使って同時に実行することができる。ただし、複数の命令を同時に実行するためには、ある単位時間内に実行しようとする各機能ユニットの作業負荷が同程度である必要がある。そのため、特にループ中の計算操作が特定の機能ユニットに属する命令に偏っているような場合では、複数の命令を同時に実行することはできない。
【０００５】
また、各機能ユニットの作業負荷を分散させることで、プログラムの実行速度を高速化する従来技術として、普段、あまり使用しない機能ユニットに、出現頻度が高い命令を実行する機能を付加する方法がある。例えば、プロシーディングオブザエイシーエムシグプランピーエルディーアイ（１９９８年）第１１８頁から第１２９頁（Ｓａｓｔｒｙ，Ｓ．Ｓ．，Ｐａｌａｃｈａｄａ，Ｓ．ａｎｄＳｍｉｔｈＪ．Ｅ．， ”ＥｘｐｌｏｉｔｉｎｇＩｄｌｅＦｌｏａｔｉｎｇＰｏｉｎｔＲｅｓｏｕｒｃｅｓＦｏｒＩｎｔｅｇｅｒＥｘｅｃｕｔｉｏｎ， ” ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＡＣＭＳＩＧＰＬＡＮＰＬＤＩ１９９８，ｐｐ．１１８−１２９）において論じられている方法では、整数命令を実行する整数演算ユニットと、浮動小数点命令を実行する浮動小数点演算ユニットの２種類の機能ユニットを有するプロセッサにおいて、浮動小数点演算ユニットに整数演算の機能を付加し、ハードウェアに冗長性をもたせることで、整数演算を多用するプログラムを高速に実行する。この場合の計算機用のコンパイラでは、ソースプログラム上の整数演算処理を、整数演算ユニットで実行すべきものと、浮動小数点演算ユニットに付加した整数演算機能を実行すべきものに分離する。
【０００６】
【発明が解決しようとする課題】
上述した従来の方法は、浮動小数点演算ユニットに整数演算の機能を付加した、冗長性をもった計算機でしか実現できない。また、一般的によく使われるプログラムの多く（コンパイラ、エディタ、オペレーティングシステム）は整数演算を多用するプログラムであるので、浮動小数点演算ユニットで整数演算ユニットの肩代わりをする従来の方法が有効であるが、科学技術計算のように浮動小数点演算を多用するプログラムでは有効でない。一般に、複数の機能ユニットをもち、各機能ユニットが同時に命令を実行できる計算機において、実際に各機能ユニットが同時に命令を実行するためには、ある単位時間に対する各機能ユニットの作業負荷が同程度でなければならない。
【０００７】
本発明の目的は、ハードウェアに冗長性をもたせることなしに、整数演算と浮動小数点演算のどちらを多用するプログラムをも高速化することが可能なプログラムコンパイル方法およびそのコンパイル方法を記録した記録媒体を提供することにある。
【０００８】
【課題を解決するための手段】
本発明のコンパイル方法は、上記目的を達成するために、まず、プログラム翻訳時に、入力プログラム（ソースプログラム）上のある実行区間に含まれる計算操作が、特定の機能ユニットに属する命令に偏っているか否かを判断する。その結果、偏っていると判断した場合には、各機能ユニットの作業負荷を同程度にするために、当該命令の一部を、他の機能ユニットに属する代替命令列に翻訳するか、あるいは、他の機能ユニットに属する命令を含む代替命令列で構成されたライブラリ呼び出しに翻訳する。
【０００９】
また、本発明の記録媒体は、上記プログラムコンパイル方法の各ステップの処理をプログラムコード化して記録したコンピュータで読み取り可能な記録媒体である。
【００１０】
【発明の実施の形態】
以下、本発明の実施例を図面を参照しながら説明する。しかし、本発明が実施例に限定されるものではないことはいうまでもない。
図１は、本発明のコンパイル方法が適用される計算機システムの一例を示す図である。図１の計算機は、プロセッサ１０１、主記憶１０２、ディスク装置１０３、読み込み装置１０４、バス１０５を有し、読み込み装置１０４は記憶媒体１０６に記憶されたプログラム等を読み込むことができる。本発明を利用するコンパイラはディスク装置１０３または記憶媒体１０６に記憶されており、バス１０５を介して主記憶１０２に取り込まれた後、、解読されてプロセッサ１０１で実行される。
【００１１】
図２は、プロセッサ１０１の機能構成の一例を示す図である。図２の例では、プロセッサ１０１は、ロードストアユニット２０１、整数演算ユニット２０２、浮動小数点加減乗算ユニット２０３、浮動小数点除算ユニット２０４の複数の機能ユニットを持っており、各機能ユニットは同時に並列的に命令を実行できるものとする。
【００１２】
図３は、本実施例におけるコンパイラの処理の流れの概要を示す図である。
同図において、３０１は入力プログラム（ソースプログラム）、３０２はコンパイラ、３０３は最適化部、３０４は目的プログラム（オブジェクトプログラム）、３０５は本発明に係るコンパイル方法が適用される処理を示している。
コンパイラ３０２は、入力プログラム３０１を読み込み、最適化部３０３で最適化して、目的プログラム３０４に翻訳する。本発明のコンパイル方法は最適化部３０３中の処理３０５に適用される。本発明のコンパイル方法が適用される処理３０５は、特にループに適用された場合に効果的となるため、以下の説明ではループへの適用例を示す。
【００１３】
図４は、処理３０５の具体的なフローチャートである。
本実施例における処理３０５は、プログラム中のある実行区間に含まれる計算操作に対してプロセッサ１０１が持っている各機能ユニットごとの作業負荷を見積もり、当該実行区間の計算操作中の命令のうち、代替命令列に翻訳した場合に作業負荷が各機能ユニットで同程度になる命令の数ｎを計算し、もしｎが０であれば全ての命令を本来の命令に翻訳し、ｎが１個以上の場合は、ｎ個を代替命令列に翻訳するようにしたものである。
【００１４】
次に、処理３０５の流れを図４に沿って詳細に説明する。
まず、各ループ実行の計算操作を本最適化適用対象の処理単位として認識し、各機能ユニットの作業負荷の見積もる（ステップ４０１）。
次に、各機能ユニットの作業負荷を同程度にするために、ステップ４０１で認識した計算操作中で、最も作業負荷が重い機能ユニットに属する命令を代替命令列に翻訳すべき数ｎを計算する（ステップ４０２）。計算の結果、翻訳すべき数ｎが０であれば、全ての命令を本来の命令に翻訳する（ステップ４０３）。翻訳すべき数ｎが１以上であった場合には、命令数ｎを代替命令列あるいはライブラリ呼び出しに翻訳する（ステップ４０４）。
【００１５】
本発明を適用できる計算操作には多くの種類が考えられるが、実際には、ループ中にある除算や平方根などの実行時間が長い命令を、乗算や加算などの実行時間が短い命令の組み合わせで代替するのが最も効果的である。ここでは実施例として、除算の一種である「逆数演算ｆ（ｘ）＝１／ｘ」を「乗加算」によって代替する場合を取り上げる。また、前提として、除算命令は２０サイクル、乗加算命令は１サイクルおきに実行できるものとし、浮動小数点数は、ＩＥＥＥ７５４による形式のように、符号部と指数部、および仮数部に分かれる形で表現されているものとする。
【００１６】
使用する算法として以下に示すニュートン法による近似を用いる場合を示す。
Ｘｎ＋１ ≒ Ｘｎ＋（１−ｘ×Ｘｎ）×Ｘｎ
まず、計算を始めるにあたって、引数ｘの指数部をある範囲に還元する。ゼロ次近似Ｘ０は、還元されたｘの値から、あらかじめ用意されたテーブルを引くことで求める。ニュートン法は二乗収束を示すので、数ビットの精度を持つ適当なゼロ次近似があれば、８バイトの浮動小数点数に対して、Ｘ３またはＸ４で完全に収束する。あるいは、近似値に対して検算の操作を加えることで、除算命令による計算結果と一致させることができる。
【００１７】
一回の近似ステップに要する計算量は乗算命令２命令、加減算命令２命令であり、乗加算命令をもつ計算機であれば、２命令で実現可能であるから、仮にＸ４で収束するならば８命令で実現可能であるが、ここでは１０命令を要するものと仮定する。一方、還元のための指数部の操作は整数演算により実現されるが、この作業負荷は比較的軽いので本例では無視する。したがって、逆数を計算するのに除算命令を用いると除算ユニット（２０４）を２０サイクル占有し、前記の代替命令列を用いると加減乗算ユニット（２０３）を１０サイクル占有することになる。
【００１８】
前記の条件において、ある実行区間中における除算命令数をｐ、加減乗算命令数をｑとすれば、乗加算命令による代替命令列に翻訳すべき除算命令の数をｎとしたとき、ｑが２０ｐ未満の場合には、（ｐ−ｎ）＊２０＝ｑ＋１０ｎすなわちｎ＝（２０ｐ−ｑ）／３０となり、一方、ｑが２０ｐ以上の場合はｎ＝０となる。図５は、逆数を代替命令列に翻訳すべき除算命令数ｎを計算する処理のプログラム例を示している。
【００１９】
また、除算命令をライブラリ呼び出しに置き換える場合には、ｑが２０ｐ未満のときには、除算ユニットだけが動作しているサイクル数（２０ｐ−ｑ）だけライブラリ呼び出しにする。したがって、ｎ＝（２０ｐ−ｑ）／２０となる。一方、ｑが２０ｐ以上の場合はｎ＝０となる。逆数を代替命令列のライブラリ呼び出しに翻訳すべき除算命令数ｎを計算する処理のプログラム例を示している。
【００２０】
このライブラリは、各機能ユニット間の作業負荷が等しくなるように、除算命令１に対して乗加算命令による代替命令列２の割合で逆数演算を実施するコードにより構成されている。ライブラリを呼び出す際には、引数の配列と結果の配列、及び演算個数の受け渡しが必要なため、ライブラリ呼び出しの際に、一時的な配列の生成が必要になる場合がある。具体的な例として、図７に示すＢ（ｎ）＝１／（５＊Ａ（ｎ））のソースプログラムを前記のライブラリ呼び出しにすると図８のようになる。図８において、ＬＩＢ（ＴＭＰ，Ｂ，Ｎ）はライブラリ、ＴＭＰはコンパイラが生成した一時的な配列を示している（ｎ＝１〜Ｎ）。
いずれの場合においても、最適な作業負荷バランスを実現するために、適当なループ展開最適化が併用されていると効果的である場合が多い。
【００２１】
次に、図４に示した処理３０５のフローチャートに沿って、図９に示したＦＯＲＴＲＡＮのプログラム片を翻訳する過程を示す。
コンパイラは、図９に示したＦＯＲＴＲＡＮプログラムのｄｏループを独立な最適化単位であると認識し、各機能ユニットの作業負荷を見積もる（ステップ４０１）。各機能ユニットの負荷を同程度にするために代替命令列を用いる場合は、図５に示した計算方法で代替命令列に翻訳すべき逆数演算ｎを計算する。また、ライブラリを用いる場合は、図６に示した計算方法によって、ライブラリで計算すべき逆数演算ｎを計算する。いずれの場合も、ｎは１以上になる（ステップ４０２）。
【００２２】
図９において、ループ１回、すなわち逆数を１個計算するのにかかる時間は、全て本来の除算命令に翻訳した場合は２０サイクルであり、全て代替命令列に翻訳した場合は１０サイクルである。ここで図９のループを４展開し、４個の逆数のうち３個を代替命令列に翻訳した場合には、逆数１個あたり７．５サイクルに短縮される。またライブラリ呼び出しに翻訳した場合には、逆数１個あたり約６．７サイクルになる（ステップ４０４）。
本実施例の過程のもとでは、いずれにしても２倍以上の高速化が達成できることがわかる。一般に，ライブラリ内のコードは、性能上、完璧なものを実現できるという長所がある。
【００２３】
以上、本発明に係るプログラムコンパイル方法を詳細に説明したが、本コンパイル方法の各処理をプログラムコード化して、例えばＣＤ−ＲＯＭやフレキシブルディスク（ＦＤ）などの記録媒体に記録して流通させれば、ユーザは、その記録媒体を入手して使用することにより、自分の計算機システムの処理性能（ハードウェアやＯＳなどを含む）に最適な目的プログラムを得ることができる。
【００２４】
さらに、ソフトウェア販売会社では、本プログラムコンパイル方法を用いて、各種の計算機システム（ハードウェアおよびＯＳなどを含む）に適合した目的プログラムを生成して、例えばＣＤ−ＲＯＭやフレキシブルディスク（ＦＤ）などの記録媒体に記録して、各種計算機システム専用のソフトウェアとして流通させることができる。その場合、ユーザは、自分の計算機システムにあった記録媒体を入手して使用することにより、自分の計算機システムの処理性能（ハードウェアやＯＳなどを含む）に最適な目的プログラムで計算機システムを動作させることができる。
【００２５】
【発明の効果】
本発明のプログラムコンパイル方法によれば、ハードウェアに冗長性をもたせることなく、各機能ユニットの作業負荷を同程度にすることができ、複数の機能ユニットが同時に命令を実行できるようになり、プロセッサの処理能力が向上し、プログラムの実行を高速化できる。また、本発明に係る記録媒体を入手して使用することにより、ユーザは自分の計算機システムの性能に最適な目的プログラムを得ることができ、また実行させることができる。
【図面の簡単な説明】
【図１】本発明のコンパイル方法が適用される計算機システムの一例を示す図である。
【図２】プロセッサ１０１の機能構成の一例を示す図である。
【図３】本実施例におけるコンパイラの処理の流れの概要を示す図である。
【図４】処理３０５の具体的なフローチャートである。
【図５】逆数を代替命令列に翻訳すべき除算命令数ｎを計算する処理のプログラム例である。
【図６】逆数を代替命令列のライブラリに翻訳すべき除算命令数ｎを計算する処理のプログラム例である。
【図７】ライブラリ呼び出しの際に、一時的な配列を必要とするソースプログラムの例である。
【図８】図７のソースプログラムを、一時的な配列を用いて書き換えた例である。
【図９】本発明を利用するソースプログラム（ＦＯＲＴＲＡＮ）の一例である。
【符号の説明】
１０１：プロセッサ、
１０２：主記憶、
１０３：ディスク装置、
１０４：読み込み装置、
１０５：バス、
１０６：記憶媒体、
２０１：ロードストアユニット、
２０２：整数演算ユニット、
２０３：浮動小数点加減乗算ユニット、
２０４：浮動小数点除算ユニット
３０１：入力プログラム、
３０２：コンパイラ、
３０３：目的プログラム、
３０５：処理（本発明のコンパイル方法）。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a technique for speeding up the execution of a program in a computer having a plurality of functional units such as a superscaler or VLIW and capable of executing instructions simultaneously in parallel. The present invention relates to a program compilation method for making the workload of each functional unit comparable without giving redundancy to the wear, and a recording medium on which the program compilation method is recorded.
[0002]
[Prior art]
As a means for speeding up the execution of the program, there is an improvement in the processing capacity of the processor. As a method of improving the processing capacity of the processor, a method for improving the performance (processing speed and processing amount) of each hardware constituting the processor and a performance improvement of the hardware itself are executed, and a plurality of instructions are executed simultaneously. There is a method for improving the performance.
[0003]
The latter method uses a processor that has a plurality of functional units therein and each functional unit can execute instructions simultaneously (references “Hennessy, DA and Patterson, JL,” Computer Architecture A Quantitative Approach second). edition, "pp. 278-289, Morgan Kaufmann publishers, San Francisco, California, 1996"). The processor used in this case often has instructions that can be executed for each functional unit.
[0004]
For example, a processor has two functional units inside, one of which (integer arithmetic unit) has a function of executing integer arithmetic, and the other (floating point arithmetic unit) has a function of executing floating point arithmetic Then, the processor executes an integer instruction using the integer arithmetic unit, and executes a floating point instruction using the floating point arithmetic unit. At this time, the processor can execute one integer instruction and one floating point instruction simultaneously using the respective functional units. However, in order to execute a plurality of instructions at the same time, the workload of each functional unit to be executed within a certain unit time needs to be approximately the same. Therefore, a plurality of instructions cannot be executed at the same time, particularly when the calculation operation in the loop is biased to instructions belonging to a specific functional unit.
[0005]
In addition, as a conventional technique for increasing the execution speed of a program by distributing the work load of each functional unit, there is a method of adding a function for executing a command having a high appearance frequency to a functional unit that is not often used. . For example, Proceeding of the ICH Mig Plan PLD (1998), p. 118-129 (Sastry, S. S., Pallacha, S. and Smith J. E., "Exploding Idle Floating Point Restraining Point Re- , "Proceedings of the ACM SIGPLAN PLDI 1998, pp. 118-129), two types of functional units, an integer arithmetic unit that executes integer instructions and a floating point arithmetic unit that executes floating point instructions. By adding an integer arithmetic function to the floating point arithmetic unit and making the hardware redundant, The intensive programs several operations to run at high speed. The compiler for a computer in this case separates integer arithmetic processing on the source program into one that should be executed by the integer arithmetic unit and one that should execute the integer arithmetic function added to the floating point arithmetic unit.
[0006]
[Problems to be solved by the invention]
The above-described conventional method can be realized only by a redundant computer in which an integer arithmetic function is added to a floating point arithmetic unit. In addition, since many commonly used programs (compilers, editors, operating systems) are programs that heavily use integer arithmetic, the conventional method of taking over the integer arithmetic unit with a floating-point arithmetic unit is effective. It is not effective for programs that make heavy use of floating-point arithmetic, such as scientific and engineering calculations. In general, in a computer that has a plurality of functional units and each functional unit can execute instructions simultaneously, in order for each functional unit to actually execute instructions simultaneously, the workload of each functional unit for a certain unit time is approximately the same. There must be.
[0007]
An object of the present invention is to provide a program compiling method capable of accelerating a program that frequently uses both integer arithmetic and floating point arithmetic without providing hardware with redundancy, and a recording medium on which the compiling method is recorded Is to provide.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, in the compiling method of the present invention, first, at the time of program translation, is a calculation operation included in a certain execution section on the input program (source program) biased to an instruction belonging to a specific functional unit? Judge whether or not. As a result, if it is determined that it is biased, in order to make the workload of each functional unit the same, a part of the instruction is translated into an alternative instruction sequence belonging to another functional unit, or It is translated into a library call composed of alternative instruction sequences including instructions belonging to other functional units.
[0009]
The recording medium of the present invention is a computer-readable recording medium in which the processing of each step of the program compilation method is recorded as a program code.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings. However, it goes without saying that the present invention is not limited to the examples.
FIG. 1 is a diagram showing an example of a computer system to which the compiling method of the present invention is applied. The computer shown in FIG. 1 includes a processor 101, a main memory 102, a disk device 103, a reading device 104, and a bus 105. The reading device 104 can read a program or the like stored in a storage medium 106. The compiler using the present invention is stored in the disk device 103 or the storage medium 106, is taken into the main memory 102 via the bus 105, is decoded, and is executed by the processor 101.
[0011]
FIG. 2 is a diagram illustrating an example of a functional configuration of the processor 101. In the example of FIG. 2, the processor 101 has a plurality of functional units including a load store unit 201, an integer arithmetic unit 202, a floating-point addition / subtraction / multiplication unit 203, and a floating-point division / division unit 204. The instruction can be executed.
[0012]
FIG. 3 is a diagram showing an overview of the processing flow of the compiler in this embodiment.
In the figure, reference numeral 301 denotes an input program (source program), 302 denotes a compiler, 303 denotes an optimization unit, 304 denotes a target program (object program), and 305 denotes processing to which the compiling method according to the present invention is applied.
The compiler 302 reads the input program 301, optimizes it by the optimization unit 303, and translates it into the target program 304. The compiling method of the present invention is applied to the processing 305 in the optimization unit 303. Since the processing 305 to which the compiling method of the present invention is applied is particularly effective when applied to a loop, an example of application to a loop is shown in the following description.
[0013]
FIG. 4 is a specific flowchart of the process 305.
The processing 305 in this embodiment estimates the workload for each functional unit that the processor 101 has for the calculation operation included in a certain execution section in the program, and among the instructions during the calculation operation in the execution section, Calculate the number n of instructions whose workload is the same for each functional unit when translated into an alternative instruction sequence. If n is 0, all instructions are translated into original instructions, and n is 1 or more In the case of n, n are translated into an alternative instruction sequence.
[0014]
Next, the flow of the process 305 will be described in detail with reference to FIG.
First, the calculation operation of each loop execution is recognized as the processing unit to be optimized and the workload of each functional unit is estimated (step 401).
Next, in order to make the work load of each functional unit approximately the same, the number n of instructions that belong to the functional unit with the heaviest work load in the calculation operation recognized in step 401 is to be translated into an alternative instruction sequence. (Step 402). If the number n to be translated is 0 as a result of the calculation, all instructions are translated into original instructions (step 403). If the number n to be translated is 1 or more, the number n of instructions is translated into an alternative instruction string or library call (step 404).
[0015]
There are many types of calculation operations to which the present invention can be applied. However, in practice, an instruction having a long execution time such as division or square root in a loop is combined with a combination of instructions having a short execution time such as multiplication or addition. Substitution is most effective. Here, as an example, a case where “reciprocal operation f (x) = 1 / x”, which is a kind of division, is replaced by “multiply addition” will be taken up. Also, as a premise, the division instruction can be executed every 20 cycles, and the multiplication and addition instruction can be executed every other cycle. It is assumed that
[0016]
The case where the approximation by the Newton method shown below is used as an arithmetic method to be used is shown.
Xn + 1≈Xn + (1−x × Xn) × Xn
First, when starting the calculation, the exponent part of the argument x is reduced to a certain range. The zero-order approximation X0 is obtained by subtracting a prepared table from the reduced x value. Since Newton's method shows square convergence, if there is a suitable zero-order approximation with several bits of precision, it will completely converge at X3 or X4 for an 8-byte floating point number. Alternatively, by adding a verification operation to the approximate value, it is possible to match the calculation result by the division instruction.
[0017]
The amount of calculation required for one approximation step is two multiplication instructions and two addition / subtraction instructions. Since a computer having a multiplication / addition instruction can be realized with two instructions, if it converges with X4, it is eight instructions. Here, it is assumed that 10 instructions are required. On the other hand, the operation of the exponent part for reduction is realized by integer arithmetic, but since this work load is relatively light, it is ignored in this example. Therefore, if the division instruction is used to calculate the reciprocal, the division unit (204) occupies 20 cycles, and if the above alternative instruction sequence is used, the addition / subtraction multiplication unit (203) occupies 10 cycles.
[0018]
Under the above conditions, if the number of division instructions in a certain execution section is p and the number of addition / subtraction multiplication instructions is q, then q is 20p when n is the number of division instructions to be translated into an alternative instruction sequence by multiplication / addition instructions. If it is less than (p−n) * 20 = q + 10n, that is, n = (20p−q) / 30, while if q is 20p or more, n = 0. FIG. 5 shows a program example of a process for calculating the number n of division instructions to be translated into an alternative instruction sequence.
[0019]
Further, when replacing the division instruction with a library call, when q is less than 20p, the library call is made for the number of cycles (20p-q) in which only the division unit is operating. Therefore, n = (20p−q) / 20. On the other hand, when q is 20p or more, n = 0. The example of a program of the process which calculates the division instruction number n which should translate a reciprocal number into the library call of an alternative instruction sequence is shown.
[0020]
This library is composed of codes for performing reciprocal calculation at a ratio of the alternative instruction sequence 2 by the multiplication and addition instruction to the division instruction 1 so that the workload between the functional units becomes equal. When calling a library, it is necessary to pass an array of arguments, an array of results, and the number of operations. Therefore, it may be necessary to generate a temporary array when calling the library. As a specific example, if the source program of B (n) = 1 / (5 * A (n)) shown in FIG. In FIG. 8, LIB (TMP, B, N) indicates a library, and TMP indicates a temporary array generated by the compiler (n = 1 to N).
In any case, it is often effective to use an appropriate loop expansion optimization in combination in order to realize an optimum work load balance.
[0021]
Next, the process of translating the FORTRAN program fragment shown in FIG. 9 will be described along the flowchart of the process 305 shown in FIG.
The compiler recognizes the doTRAN loop of the FORTRAN program shown in FIG. 9 as an independent optimization unit, and estimates the work load of each functional unit (step 401). When an alternative instruction sequence is used in order to make the load of each functional unit comparable, the reciprocal operation n to be translated into the alternative instruction sequence is calculated by the calculation method shown in FIG. When a library is used, the reciprocal operation n to be calculated by the library is calculated by the calculation method shown in FIG. In either case, n is 1 or more (step 402).
[0022]
In FIG. 9, the time required to calculate one loop, that is, one reciprocal, is 20 cycles when all are translated into original division instructions, and 10 cycles when all are translated into alternative instruction sequences. Here, when the loop of FIG. 9 is developed four times and three of the four reciprocals are translated into alternative instruction strings, the number of reciprocals is shortened to 7.5 cycles. When translated into a library call, it takes about 6.7 cycles per reciprocal (step 404).
Under the process of this embodiment, it can be seen that in any case, a speed increase of twice or more can be achieved. In general, the code in the library has the advantage of being able to achieve perfect performance.
[0023]
The program compiling method according to the present invention has been described above in detail. However, if each process of the compiling method is converted into a program code and recorded on a recording medium such as a CD-ROM or a flexible disk (FD), it is distributed. By obtaining and using the recording medium, the user can obtain a target program optimum for the processing performance (including hardware and OS) of his computer system.
[0024]
Further, the software sales company uses this program compilation method to generate a target program suitable for various computer systems (including hardware and OS), for example, a CD-ROM or a flexible disk (FD). It can be recorded on a recording medium and distributed as software dedicated to various computer systems. In that case, the user operates the computer system with the target program that is optimal for the processing performance (including hardware and OS) of the user's computer system by obtaining and using a recording medium suitable for the user's computer system. Can be made.
[0025]
【The invention's effect】
According to the program compiling method of the present invention, the work load of each functional unit can be made comparable without giving redundancy to the hardware, and a plurality of functional units can execute instructions simultaneously. The processing capability of the program is improved, and the execution of the program can be accelerated. Further, by obtaining and using the recording medium according to the present invention, the user can obtain and execute an objective program optimum for the performance of his computer system.
[Brief description of the drawings]
FIG. 1 is a diagram showing an example of a computer system to which a compiling method of the present invention is applied.
2 is a diagram illustrating an example of a functional configuration of a processor 101. FIG.
FIG. 3 is a diagram showing an outline of a processing flow of a compiler in the present embodiment.
FIG. 4 is a specific flowchart of processing 305;
FIG. 5 is a program example of a process for calculating the number of division instructions n to be translated into an alternative instruction sequence.
FIG. 6 is a program example of processing for calculating the number n of division instructions to be translated into a library of alternative instruction sequences.
FIG. 7 is an example of a source program that requires a temporary arrangement when a library is called.
8 is an example in which the source program of FIG. 7 is rewritten using a temporary arrangement.
FIG. 9 is an example of a source program (FORTRAN) using the present invention.
[Explanation of symbols]
101: processor,
102: Main memory
103: Disk device,
104: Reading device,
105: Bus
106: storage medium,
201: Load store unit,
202: integer arithmetic unit,
203: Floating point addition / subtraction unit
204: Floating point division unit 301: Input program,
302: Compiler,
303: Objective program,
305: Processing (the compiling method of the present invention).

Claims

In a program compiling method for a computer having a plurality of functional units capable of executing instructions in parallel, a specific functional unit is estimated by estimating a work load of each functional unit for a calculation operation included in a certain execution section on the input program The number of instructions that causes the workload to be the same in each functional unit when translated to an alternative instruction sequence belonging to another functional unit among the instructions being calculated in the execution section. a first step you calculate the number of instructions to alternative instruction sequence to translate configured library calls in the workload including instructions belonging to other functional units is similar in each functional unit,
If it is determined that the first step is biased, the number of instructions calculated in the first step among the instructions is translated into an alternative instruction sequence belonging to another functional unit or belongs to another functional unit A program compiling method comprising: a second step of translating into a library call composed of an alternative instruction sequence including instructions.

A computer-readable storage medium in which processing of each step of the program compiling method according to claim 1 is recorded as program code.