JP2007108838A

JP2007108838A - Compile method and compile device

Info

Publication number: JP2007108838A
Application number: JP2005296243A
Authority: JP
Inventors: Satoshi Torikai; 智鳥飼
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-10-11
Filing date: 2005-10-11
Publication date: 2007-04-26

Abstract

<P>PROBLEM TO BE SOLVED: To parallelize a loop so as to perform parallel execution when the effects of parallel execution is expected, and to perform successive execution when the effects of parallel execution is not expected when the loop length of the parallelized loop is unclear. <P>SOLUTION: This compile device generates an object code 107 executable on a shared memory type computer with a thread as the unit of parallel processing by input of a source program 101 and using an inter-thread synchronous overhead information file 108 and the number of machine cycles acquisition library 106, and is constituted of a syntax analysis part 103, a parallelization part 104 and a code generation part 105. The parallelization part 104 executes the first repetition of a loop which can be parallelized, and acquires the number of machine cycles whose loop repetition is one time, and generates an object code to determine a threshold as loop length from which the acquisition of the effects of parallel execution is expected. Thus, it is possible to perform highly efficient parallel execution by a static determination system in a development time shorter than that in an execution information sampling system. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、コンパイル方法及びコンパイル装置に係り、特に、プログラミング言語で記述されているソースプログラムを並列化するコンパイル方法及びコンパイル装置に関する。 The present invention relates to a compiling method and compiling device, and more particularly to a compiling method and compiling device for parallelizing a source program described in a programming language.

共有メモリ型計算機向けにソースプログラムを並列化するコンパイル装置は、プログラミング言語等により記述されたソースプログラム（以下、単に、ソースプログラムという）を、複数個のスレッドにより並列処理することができる機械語のプログラム（以下、オブジェクトコードという）に生成するものである。並列処理は、複数個のスレッド間での同期処理等の並列処理に伴う並列化オーバーヘッドが発生するため、例えば、ループの並列化は、ループ長がある長さを越えないと並列化オーバーヘッドを隠すことができず、並列化の効果を得ることができない。 A compiling device that parallelizes a source program for a shared memory computer is a machine language that can process a source program written in a programming language or the like (hereinafter simply referred to as a source program) by a plurality of threads. It is generated in a program (hereinafter referred to as object code). Since parallel processing generates parallel overhead associated with parallel processing such as synchronization processing among multiple threads, for example, loop parallelization hides the parallelization overhead unless the loop length exceeds a certain length Cannot be obtained and the effect of parallelization cannot be obtained.

従来技術によるループの並列化には、並列化したループのループ長が実行時でないと判らないときに、実行時に並列実行か逐次実行かを選択させるオブジェクトコードを生成する方法が用いられる。そして、オブジェクトコードを生成する方法として、コンパイル装置が静的に決定する並列実行の効果を得ることが期待できるループ長の値（以下、閾値という）と実行時の実際のループ長とを比較することによりオブジェクトコードを生成させる方法（以下、静的決定方式という）と、プログラムの実行情報を元に再コンパイル時に閾値を決定する方法（以下、実行情報採取方式という）とがある。 For parallelization of loops according to the prior art, a method of generating object code for selecting parallel execution or sequential execution at the time of execution when the loop length of the parallelized loop is not known at the time of execution is used. Then, as a method for generating object code, a loop length value (hereinafter referred to as a threshold) that can be expected to obtain the effect of parallel execution statically determined by the compiling device is compared with an actual loop length at the time of execution. Thus, there are a method of generating an object code (hereinafter referred to as a static determination method) and a method of determining a threshold value at the time of recompilation based on program execution information (hereinafter referred to as an execution information collection method).

なお、静的決定方式とも実行情報採取方式とも異なる方式の例として、実行時に実行情報を採取して実行時にプログラムを最適化する計算方法（以下、動的決定方式という）に関する技術が、例えば、特許文献１等に記載されて知られている。
特開２００４−３５５１４４号公報 In addition, as an example of a method different from the static determination method and the execution information collection method, a technique related to a calculation method (hereinafter referred to as a dynamic determination method) that collects execution information at the time of execution and optimizes the program at the time of execution is, for example, It is described in Patent Document 1 and the like.
JP 2004-355144 A

前述した静的決定方式は、実行時のキャッシュメモリ等の影響が考慮されていないので、精度の高い閾値の決定が困難であるという問題点を有している。また、実行情報採取方式は、実行情報を採取するためにプログラムを一度実行するため精度の高い閾値を決定することができるが、再コンパイルが必要になるため、このことがプログラム開発においてのオーバーヘッドになってしまうという問題点を有している。 The static determination method described above has a problem that it is difficult to determine a highly accurate threshold value because the influence of the cache memory at the time of execution is not taken into consideration. In addition, the execution information collection method can determine a highly accurate threshold because the program is executed once in order to collect execution information. However, since recompilation is required, this is an overhead in program development. It has the problem of becoming.

また、前述の動的決定方式は、入力となるソースプログラムの中に、実行時に最適化を行う領域と最適化するパラメータとをユーザが意識して明示的に記述しておく必要があるという問題点を有している。しかも、この動的決定方式は、動的に最適化するプログラムを生成するというものであって、動的に並列実行か逐次実行かを切り分けるオブジェクトコードを生成する並列化コンパイルに関するものではない。 In addition, the above-described dynamic determination method has a problem that the user needs to explicitly describe the area to be optimized and the parameter to be optimized in the source program that is input. Has a point. Moreover, this dynamic determination method is to generate a program to be dynamically optimized, and is not related to parallel compilation that generates object code that dynamically separates between parallel execution and sequential execution.

本発明の目的は、前述した従来技術の問題点を解決し、静的決定方式での精度の高い閾値の決定が困難であるという問題と、実行情報採取方式での再コンパイルが必要であるという問題とを解決し、ループ長が不明な並列ループを並列実行による効果を得ることが期待できるときに並列実行を行わせ、並列実行による効果を得ることが期待できないときに逐次実行を行わせるようにループを並列化することができるソースプログラムを並列化するコンパイル方法及びコンパイル装置を提供することにある。 The object of the present invention is to solve the above-mentioned problems of the prior art, the problem that it is difficult to determine a high-accuracy threshold in the static decision method, and the need to recompile in the execution information collection method Solve the problem and let parallel execution occur when parallel loops whose loop length is unknown can be expected to obtain the effect of parallel execution, and execute sequentially when the effect of parallel execution cannot be expected It is another object of the present invention to provide a compiling method and compiling device for parallelizing a source program capable of parallelizing a loop.

本発明によれば前記目的は、共有メモリ型計算機で利用されるオブジェクトコードを生成するコンパイル方法において、コンパイル時にループ長が不明な並列ループに対して、ループの繰り返しの１回目だけを実行させて、ループの繰り返し１回の所要マシンサイクル数を取得し、その取得したマシンサイクル数をもとに並列実行の効果を得ることができるループ長の値を算出させ、この値により、並列実行か逐次実行かを切り分けることにより達成される。 According to the present invention, in the compiling method for generating an object code used in a shared memory type computer, the first loop repetition is executed for a parallel loop whose loop length is unknown at the time of compiling. The number of machine cycles required for one iteration of the loop is acquired, and a loop length value that can obtain the effect of parallel execution is calculated based on the acquired number of machine cycles. This is achieved by separating execution.

また、前記目的は、共有メモリ型計算機で利用されるオブジェクトコードを生成するコンパイル装置において、構文解析部と、並列化部と、コード生成部とから構成され、前記並列化部は、前記共有メモリ型計算機で利用されるオブジェクトコードの実行により、ループの繰り返し１回の所要マシンサイクル数を取得させ、取得した前記マシンサイクル数とコンパイル時に読み込む対象となる計算機の同期処理オーバヘッド情報とに基づいて並列実行の効果を得ることができるループ長を求め、求められた前記並列化の効果を得ることができるループ長とループの実際のループ長とを比較して、並列実行か逐次実行かを切り分けさせる中間語を生成することにより達成される。 The object is a compiling device that generates an object code used in a shared memory type computer, and includes a syntax analysis unit, a parallelization unit, and a code generation unit, and the parallelization unit includes the shared memory. By executing the object code used in the type computer, the required number of machine cycles for one iteration of the loop is acquired, and parallel processing based on the acquired number of machine cycles and the synchronization processing overhead information of the computer to be read at the time of compilation The loop length that can obtain the effect of execution is obtained, and the obtained loop length that can obtain the effect of parallelization is compared with the actual loop length of the loop, so that the parallel execution or the sequential execution is separated. This is accomplished by generating an intermediate language.

本発明によれば、並列化したループに対し、ループ長が短く並列化の効果を得ることが期待できない場合、逐次実行させるオブジェクトコードの生成を静的決定方式より正確に行うことができる。 According to the present invention, when the loop length is short with respect to the parallelized loop and the effect of parallelization cannot be expected, the object code to be sequentially executed can be generated more accurately than the static determination method.

以下、本発明によるソースプログラムを並列化するコンパイル方法及びコンパイル装置の実施形態を図面により詳細に説明する。 Hereinafter, embodiments of a compiling method and compiling apparatus for parallelizing source programs according to the present invention will be described in detail with reference to the drawings.

図１は本発明の一実施形態によるコンパイル装置の構成を示すブロック図、図５は共有メモリ型計算機の構成を示すブロック図である。図１、図５において、１０１はソースファイル、１０２は並列化コンパイル装置、１０３は構文解析部、１０４は並列化部、１０５はコード生成部、１０６はマシンサイクル数取得ライブラリ、１０７はオブジェクトコード、１０８はスレッド間同期オーバーヘッド情報ファイル、１１１はＣＰＵ、１１２はメモリ、１１３はハードディスク装置、５０１は共有メモリ型計算機、５０２はプロセッサ、５０３はキャッシュメモリ、５０４はメインメモリである。 FIG. 1 is a block diagram showing a configuration of a compiling device according to an embodiment of the present invention, and FIG. 5 is a block diagram showing a configuration of a shared memory type computer. 1 and 5, 101 is a source file, 102 is a parallel compilation device, 103 is a syntax analysis unit, 104 is a parallelization unit, 105 is a code generation unit, 106 is a machine cycle number acquisition library, 107 is an object code, Reference numeral 108 denotes an interthread synchronization overhead information file, 111 a CPU, 112 a memory, 113 a hard disk device, 501 a shared memory computer, 502 a processor, 503 a cache memory, and 504 a main memory.

本発明の実施形態により並列化されたプログラムは、例えば、図５に示すような共有メモリ型計算機５０１上で実行される。この共有メモリ型計算機５０１は、それぞれがキャッシュメモリ５０３を有する複数個のプロセッサ５０２と、これらの複数のプロセッサ５０２が共有するメインメモリ５０５とがバス等の通信路を介して接続されて構成されている。 The parallelized program according to the embodiment of the present invention is executed on, for example, a shared memory type computer 501 as shown in FIG. The shared memory type computer 501 is configured by connecting a plurality of processors 502 each having a cache memory 503 and a main memory 505 shared by the plurality of processors 502 via a communication path such as a bus. Yes.

そして、本発明の実施形態による並列化コンパイル装置１０２は、図１（ａ）に示すように、ＦＯＲＴＲＡＮ等のプログラミング言語で記述されたソースプログラム１０１を入力として、スレッド間同期オーバーヘッド情報ファイル１０８とマシンサイクル数取得ライブラリ１０６を使用して、スレッドを並列処理の単位として共有メモリ型計算機上で実行可能なオブジェクトコード１０７を生成させるものであり、構文解析部１０３、並列化部１０４、コード生成部１０５から構成される。 As shown in FIG. 1A, the parallel compiling apparatus 102 according to the embodiment of the present invention receives the source program 101 described in a programming language such as FORTRAN as an input, and synchronizes overhead information files 108 between threads and the machine. An object code 107 that can be executed on a shared memory type computer is generated by using a cycle number acquisition library 106 as a unit of parallel processing, and includes a syntax analysis unit 103, a parallelization unit 104, and a code generation unit 105. Consists of

前述したように構成される並列化コンパイル装置１０２は、図５に示す共有メモリ型計算機５０１上に構築され、プロセッサ５０２の１つが、ソースプログラムの並列化の処理を実行するように構成されてもよく、あるいは、独立した別の計算機内に構築されてもよい。並列化コンパイル装置を計算機内に構築する場合の計算機の構成例を図１（ｂ）に示しており、この計算機は、よく知られているように、ＣＰＵ１１１、メモリ１１２、ハードディスク装置１１３、図示しない入出力装置を備えて構成されていればよい。そして、並列化コンパイル装置１０２は、装置を構成する各機能部が、ソフトウェアにより構成されて、ハードディスク装置１１３内に格納され、それらがメモリ１１２内に呼び出され、ＯＳの下でＣＰＵにより実行されることにより実現することができる。また、並列化コンパイル装置１０２が必要とする情報を格納するソースファイル１０１、マシンサイクル数取得ライブラリ１０６、スレッド間同期オーバーヘッド情報ファイル１０８は、ハードディスク装置１１３内に格納されていればよく、また、生成されたオブジェクトコード１０７も、ハードディスク装置１１３内に格納されればよい。 The parallelizing compiling device 102 configured as described above is constructed on the shared memory computer 501 shown in FIG. 5, and one of the processors 502 is configured to execute parallel processing of the source program. Alternatively, it may be built in another independent computer. FIG. 1B shows a configuration example of a computer when a parallel compiling device is built in a computer. As is well known, this computer includes a CPU 111, a memory 112, a hard disk device 113, and not shown. What is necessary is just to be provided with the input / output device. In the parallel compiling device 102, each functional unit constituting the device is configured by software, stored in the hard disk device 113, called into the memory 112, and executed by the CPU under the OS. Can be realized. Further, the source file 101 for storing information required by the parallelizing compiling device 102, the machine cycle number acquisition library 106, and the inter-thread synchronization overhead information file 108 need only be stored in the hard disk device 113, and can be generated. The object code 107 may be stored in the hard disk device 113.

並列化コンパイル装置を構成する構文解析部１０３は、ソースファイル１０１内のソースプログラムを入力として受け取り、ソースプログラムから中間語を生成する。並列化部１０４は、構文解析部１０３が生成した中間語とスレッド間同期オーバーヘッド情報ファイル１０８から対象マシンにおける同期オーバーヘッド情報とを受け取って、ループ長が不明な並列ループに対して、マシンサイクル数取得ライブラリ１０６のハードウェアカウンタルーチンを使用してループの繰り返し１回目の所要マシンサイクル数を取得させ、そのマシンサイクル数に基づいて閾値を決定し、その閾値とループ長とを比較して並列実行か逐次実行かを切り分けるための中間語を生成する。また、コード生成部１０５は、並列化部１０４で出力された中間語を入力として受け取り、その中間語に対して、最適化や、レジスタ割り付け等を実施した後、マシンサイクル数取得ライブラリ１０６をリンクして、計算機が実行可能なマシン言語であるオブジェクトコード１０７を生成する。 The syntax analysis unit 103 constituting the parallel compilation apparatus receives the source program in the source file 101 as an input, and generates an intermediate language from the source program. The parallelization unit 104 receives the intermediate language generated by the syntax analysis unit 103 and the synchronization overhead information in the target machine from the inter-thread synchronization overhead information file 108, and acquires the number of machine cycles for the parallel loop whose loop length is unknown. The hardware counter routine of the library 106 is used to obtain the required number of machine cycles for the first iteration of the loop, a threshold is determined based on the number of machine cycles, and the parallel execution is performed by comparing the threshold with the loop length. Generates an intermediate language for determining whether to execute. In addition, the code generation unit 105 receives the intermediate language output from the parallelization unit 104 as an input, performs optimization, register allocation, etc. on the intermediate language, and then links the machine cycle number acquisition library 106 Then, the object code 107 which is a machine language executable by the computer is generated.

図２は並列化部１０４が並列実行か逐次実行かを切り分ける中間語の生成を行う処理動作を説明するフローチャートである。ここに示すフローチャートでの処理は、並列化部１０４が、ループ長が不明な並列ループに対して、マシンサイクル数取得ライブラリ１０６のハードウェアカウンタルーチンを使用してループ繰り返し１回目の処理を実行させて所要マシンサイクル数を取得させ、そのマシンサイクル数に基づいて閾値を決定し、その閾値とループ長とを比較して並列実行か逐次実行かを切り分ける中間語を生成する処理である。従って、ここでの処理において、並列化部１０４は、マシンサイクル数取得ライブラリ１０６内の必要な名前（コード）だけを使用しその実体を使用することはなく、マシンサイクル数取得ライブラリ１０６の実体はオブジェクトを生成するときに使用される。 FIG. 2 is a flowchart for explaining a processing operation in which the parallelizing unit 104 generates an intermediate language for determining whether the execution is parallel execution or sequential execution. In the processing shown in the flowchart shown here, the parallel processing unit 104 executes the first loop iteration processing for the parallel loop whose loop length is unknown using the hardware counter routine of the machine cycle number acquisition library 106. In this process, the required number of machine cycles is acquired, a threshold is determined based on the number of machine cycles, and an intermediate language is generated for comparing between the threshold and the loop length to distinguish between parallel execution and sequential execution. Therefore, in this processing, the parallelizing unit 104 uses only the necessary name (code) in the machine cycle number acquisition library 106 and does not use the entity, and the entity of the machine cycle number acquisition library 106 is Used when creating an object.

（１）処理が開始されると、並列化部１０４は、まず、並列化したループのループ長が不明か否か判定し、不明でなければ、何もせずここでの処理を終了する（ステーション２０１、２０２、２０８）。 (1) When the process is started, the parallelizing unit 104 first determines whether the loop length of the parallelized loop is unknown, and if not, does nothing and ends the process here (station) 201, 202, 208).

（２）ステップ２０２の判定で、並列化したループのループ長が不明であった場合、ループ長０回チェックコードを並列ループの前に挿入する（ステップ２０３）。 (2) If it is determined in step 202 that the loop length of the parallelized loop is unknown, a check code with a loop length of 0 times is inserted before the parallel loop (step 203).

（３）次に、並列ループの繰り返し１回目を取り出すピーリングを行い、ピーリングされたループ本体の前後にマシンサイクル数取得ライブラリ１０６からハードウェアカウンタルーチンを呼び出すための文（コード）を挿入する（ステップ２０４）。 (3) Next, peeling is performed to take out the first iteration of the parallel loop, and a statement (code) for calling a hardware counter routine from the machine cycle number acquisition library 106 is inserted before and after the peeled loop body (step 204). ).

（４）次に、スレッド間同期オーバーヘッド情報ファイル１０８を読み込み、スレッドフォークオーバーヘッド、スレッド間バリアオーバーヘッド、ハードウェアカウンタルーチンから取得される、ピーリングされた本体のマシンサイクル数からの閾値決定のためのコードを挿入する（ステップ２０５、２０６）。 (4) Next, the inter-thread synchronization overhead information file 108 is read, and a code for determining a threshold value from the machine cycle number of the peeled main body obtained from the thread fork overhead, inter-thread barrier overhead, and hardware counter routine is read. Insert (steps 205 and 206).

図６は前述のステップ２０５の処理で読み込むスレッド間同期オーバーヘッド情報ファイル１０８の内容の例を説明する図である。スレッド間同期オーバーヘッド情報ファイル１０８の内容は、例えば、図６に６０１として示すようなものであり、図６に示す例では、対象となる共有メモリ型計算機のスレッドフォークオーバーヘッドが１０００サイクル、スレッド間バリアオーバーヘッドが２０００サイクル、プロセッサ数が１６個であることが示されている。 FIG. 6 is a diagram for explaining an example of the contents of the inter-thread synchronization overhead information file 108 read in the process of step 205 described above. The contents of the inter-thread synchronization overhead information file 108 are, for example, as shown as 601 in FIG. 6. In the example shown in FIG. 6, the thread fork overhead of the target shared memory type computer is 1000 cycles, and the inter-thread barrier It is shown that the overhead is 2000 cycles and the number of processors is 16.

（５）その後、決定した閾値とループ長とを比較して、成立すれば並列ループへ分岐させ、不成立であれば、逐次ループへ分岐させるコードを生成し、ここでの処理を終了する（ステップ２０７、２０８）。 (5) After that, the determined threshold value is compared with the loop length, and if established, a code is branched to the parallel loop, and if not established, a code to branch to the sequential loop is generated, and the processing here is terminated (step) 207, 208).

前述した処理は、プログラムにより構成し、計算機が備えるＣＰＵに実行させることができ、また、それらのプログラムは、ＦＤ、ＣＤＲＯＭ、ＤＶＤ等の記録媒体に格納して提供することができ、また、ネットワークを介してディジタル情報により提供することができる。 The processing described above is configured by a program and can be executed by a CPU provided in a computer. These programs can be provided by being stored in a recording medium such as an FD, a CDROM, or a DVD. Can be provided by digital information.

図３はソースプログラムの例を説明する図、図４は図３に示すソースプログラムから生成された並列化後のループを説明する中間語の例を説明する図であり、図３に示すソースプログラムを本発明の実施形態により並列化コンパイルするものとして説明を続ける。 FIG. 3 is a diagram for explaining an example of a source program. FIG. 4 is a diagram for explaining an example of an intermediate language for explaining a parallelized loop generated from the source program shown in FIG. Will continue to be described as being compiled in parallel according to an embodiment of the present invention.

図３に示すソースプログラムの例３０１は、Ｉ＝１〜Ｎの全てについて、Ａ(Ｉ)＝Ａ(Ｉ)＋Ｂ(Ｉ)、Ｓ＝Ｃ(Ｉ)、Ｄ(Ｉ)＝Ｅ(Ｉ)＋Ｆ(Ｉ)の各演算を実行するというものである。このようなソースプログラムから図２により説明したフローでの処理により並列化部１０４で生成される中間語は、図４のに４０１として示すような中間語である。 In the example 301 of the source program shown in FIG. 3, A (I) = A (I) + B (I), S = C (I), D (I) = E (I) for all of I = 1 to N. Each operation of + F (I) is executed. The intermediate language generated by the parallelization unit 104 by the processing in the flow described with reference to FIG. 2 from such a source program is an intermediate language as indicated by 401 in FIG.

図２のステップ２０３で挿入するチェックコードによりループ長Ｎが０回以上かをチェックし（行４０２）、０回以下である場合は何もせず、０回以上である場合、ループ繰り返し１回目をピーリングしたループの前後にマシンサイクル数を取得するためのハードウェアカウンタルーチンを呼び出してカウントを開始させ（行４０３）、ハードウェアカウンタ終了後、マシンサイクル数を取得する（行４０４）。行４０３と行４０４との間に記述された式は、図３に示すソースプログラムの例３０１における、Ｉ＝１を実行させる式である。この式を実行した結果として取得したマシンサイクル数から閾値を決定し（行４０５）、決定した閾値とループ長Ｎとの比較を行って（行４０６）、行４０６に示す式が成立すれば、並列実行を行わせ、成立しなければ、逐次実行をおこなわせるように分岐する。 It is checked whether the loop length N is 0 or more by the check code inserted in step 203 in FIG. 2 (line 402). If it is 0 or less, nothing is done, and if it is 0 or more, the first loop iteration is performed. A hardware counter routine for acquiring the machine cycle number before and after the peeled loop is called to start counting (line 403), and after the hardware counter is completed, the machine cycle number is acquired (line 404). The expression described between the line 403 and the line 404 is an expression for executing I = 1 in the example 301 of the source program shown in FIG. A threshold is determined from the number of machine cycles acquired as a result of executing this expression (line 405), the determined threshold is compared with the loop length N (line 406), and if the expression shown in line 406 is satisfied, Parallel execution is performed, and if it does not hold, branching is performed so that sequential execution is performed.

並列実行の場合、ループ長を分割して並列化した各スレッドの各スレッドの開始位置％Ｌ、終了位置％Ｕを決定するコード（行４０７）を記述し、並列化したループの各スレッドの開始位置％Ｌ、終了位置％Ｕが指定されて、ソースプログラムが複数のプロセッサにより並列実行されることになる。この並列実行を実行させる指示が、ｄｏ％Ｉ＝％Ｌ，％Ｕを含む４行に示す式である。 In the case of parallel execution, code (line 407) for determining the start position% L and end position% U of each thread of each thread parallelized by dividing the loop length is described, and the start of each thread of the parallelized loop The position% L and the end position% U are specified, and the source program is executed in parallel by a plurality of processors. The instruction to execute this parallel execution is an expression shown in four lines including do% I =% L and% U.

一方、逐次実行の場合、図３に示すソースプログラムの例３０１における、Ｉ＝１を除くＩ＝２〜Ｎが１つのプロセッサにより逐次実行されることになる。この逐次実行を行わせる指示が、ｄｏＩ＝２，Ｎを含む４行に示す式である。 On the other hand, in the case of sequential execution, I = 2 to N except for I = 1 in the source program example 301 shown in FIG. 3 are sequentially executed by one processor. The instruction to perform this sequential execution is an expression shown in four lines including doI = 2 and N.

図４に示す中間語の例から判るように、この中間言語がオブジェクトコードに生成されて、プロセッサにより実行されると、複数のループの最初の１つが実際に実行され、その実行結果から並列実行か、逐次実行かが決定されることになる。 As can be seen from the example of the intermediate language shown in FIG. 4, when this intermediate language is generated into object code and executed by the processor, the first one of the loops is actually executed, and the execution result is executed in parallel. Or sequential execution is determined.

図７は図２のフローにおけるステップ２０６の処理での閾値計算を行う演算式７０１の例を示す図、図８は実行時にハードウェアカウンタにより取得されるマシンサイクル数の例を示す図である。図７において、求める閾値をＬ、取得されるピーリングされたループのマシンサイクル数をＭ、スレッドフォークオーバーヘッドをＯ１、スレッド間同期オーバーヘッドをＯ２、実行スレッド数をＮＰＥとする。 FIG. 7 is a diagram showing an example of an arithmetic expression 701 for performing threshold calculation in the process of step 206 in the flow of FIG. 2, and FIG. 8 is a diagram showing an example of the number of machine cycles acquired by the hardware counter at the time of execution. In FIG. 7, the threshold value to be obtained is L, the machine cycle number of the peeled loop to be acquired is M, the thread fork overhead is O1, the inter-thread synchronization overhead is O2, and the number of execution threads is NPE.

予想逐次実行マシンサイクル数はＬ×Ｍ、予想並列実行マシンサイクル数はＬ×Ｍ÷ＮＰＥ＋（Ｏ１＋Ｏ２）となるので、予想逐次実行マシンサイクル数より予想並列実行マシンサイクル数が下回る最低の閾値は、図７の式７０１に示すように、
Ｌ＝（（Ｏ１＋Ｏ２）×ＮＰＥ）÷（Ｍ×（ＮＰＥ−１））
となる。 Since the expected number of sequential execution machine cycles is L × M and the expected number of parallel execution machine cycles is L × M ÷ NPE + (O1 + O2), the lowest threshold value where the expected number of parallel execution machine cycles is lower than the expected number of sequential execution machine cycles is As shown in equation 701 in FIG.
L = ((O1 + O2) × NPE) ÷ (M × (NPE-1))
It becomes.

例えば、図２のフローにおけるステップ２０５の処理で、図６に示したスレッド間同期オーバーヘッド情報６０１を読み込んでいる場合、Ｏ１＝１０００、Ｏ２＝２０００、ＮＰＥ＝１６が設定されるので、Ｌ＝３２００÷Ｍが図４の行４０５での閾値計算の実際の処理となり、実行時に図４の行４０４で取得するマシンサイクル数が図８に８０１として示す例のような場合、キャッシュメモリにデータが載っている場合、Ｍ＝１０が設定されてＬ＝３２０となり、キャッシュメモリにデータが載っていない場合、Ｍ＝２００が設定されてＬ＝１６となる。 For example, when the inter-thread synchronization overhead information 601 shown in FIG. 6 is read in the processing of step 205 in the flow of FIG. 2, since O1 = 1000, O2 = 2000, and NPE = 16 are set, L = 3200 ÷ M is the actual processing of the threshold calculation in the row 405 in FIG. 4, and when the number of machine cycles acquired in the row 404 in FIG. 4 is 801 in FIG. If M = 10 is set, L = 320, and if no data is stored in the cache memory, M = 200 is set and L = 16.

前述したような例の場合、従来技術の欄で説明した静的決定方式では、キャッシュメモリにデータが載っている場合または載っていない場合の閾値を考慮することができないので、効果を得ることが期待できない並列実行を行ってしまう場合があり、実行情報採取方式では、再コンパイルと再実行が必要となってしまうが、本発明の実施形態では、このようにことを防止することができる。 In the case of the example as described above, the static determination method described in the section of the prior art can obtain an effect because the threshold when the data is loaded or not loaded in the cache memory cannot be considered. In some cases, unexpected parallel execution may be performed, and in the execution information collection method, recompilation and re-execution are necessary. However, in the embodiment of the present invention, this can be prevented.

なお、前述で説明した本発明の実施形態は、ソースプログラムとしてＦＯＲＴＲＡＮ言語を用いる例であったが、本発明は、例えば、ソースプログラムとしてＣ言語等の任意のプログラミング言語が使用される場合にも適用することができる。 The embodiment of the present invention described above is an example in which the FORTRAN language is used as the source program. However, the present invention can be applied to a case where an arbitrary programming language such as C language is used as the source program. Can be applied.

本発明の一実施形態によるコンパイル装置の構成を示すブロック図である。It is a block diagram which shows the structure of the compilation apparatus by one Embodiment of this invention. 並列化部１０４が並列実行か逐次実行かを切り分ける中間語の生成を行う処理動作を説明するフローチャートである。It is a flowchart explaining the processing operation | movement which produces | generates the intermediate language which isolates whether the parallelization part 104 performs parallel execution or sequential execution. ソースプログラムの例を説明する図である。It is a figure explaining the example of a source program. 図３に示すソースプログラムから生成された並列化後のループを説明する中間語の例を説明する図である。It is a figure explaining the example of the intermediate language explaining the loop after the parallelization produced | generated from the source program shown in FIG. 共有メモリ型計算機の構成を示すブロック図である。It is a block diagram which shows the structure of a shared memory type computer. 図２のステップ２０５の処理で読み込むスレッド間同期オーバーヘッド情報ファイルの内容の例を説明する図である。It is a figure explaining the example of the content of the thread | sled synchronization overhead information file read by the process of step 205 of FIG. 図２のフローにおけるステップ２０６の処理での閾値計算を行う演算式の例を示す図である。It is a figure which shows the example of the computing equation which performs the threshold value calculation by the process of step 206 in the flow of FIG. 実行時にハードウェアカウンタにより取得されるマシンサイクル数の例を示す図である。It is a figure which shows the example of the number of machine cycles acquired by a hardware counter at the time of execution.

Explanation of symbols

１０１ソースファイル
１０２並列化コンパイル装置
１０３構文解析部
１０４並列化部
１０５コード生成部
１０６マシンサイクル数取得ライブラリ
１０７オブジェクトコード
１０８スレッド間同期オーバーヘッド情報ファイル
１１１ＣＰＵ
１１２メモリ
１１３ハードディスク装置
５０１共有メモリ型計算機
５０２プロセッサ
５０３キャッシュメモリ
５０４メインメモリ DESCRIPTION OF SYMBOLS 101 Source file 102 Parallelization compiling apparatus 103 Syntax analysis part 104 Parallelization part 105 Code generation part 106 Machine cycle number acquisition library 107 Object code 108 Interthread synchronization overhead information file 111 CPU
112 memory 113 hard disk device 501 shared memory type computer 502 processor 503 cache memory 504 main memory

Claims

In a compiling method for generating object code used in a shared memory type computer, a loop that requires only one loop iteration is executed for a parallel loop whose loop length is unknown at the time of compilation. It is characterized by acquiring the number of cycles, calculating the loop length value that can obtain the effect of parallel execution based on the acquired number of machine cycles, and separating the parallel execution or sequential execution based on this value Compilation method.

In the compiling method for generating an object code used in a shared memory type computer, the required machine cycle number of one iteration of a loop is obtained by executing the object code used in the shared memory type computer, and the obtained machine A loop length capable of obtaining the effect of parallel execution based on the number of cycles and the synchronization processing overhead information of the computer to be read at the time of compilation is obtained, and the loop length and loop capable of obtaining the obtained effect of parallelization are obtained. A parallel loop method characterized by comparing the actual loop length of and determining whether to execute in parallel or sequentially.

In a compiling device that generates an object code used in a shared memory type computer, it is composed of a syntax analysis unit, a parallelization unit, and a code generation unit, and the parallelization unit is used in the shared memory type computer. By executing the object code, the number of required machine cycles for one iteration of the loop is acquired, and the parallel execution effect is obtained based on the acquired number of machine cycles and the synchronization processing overhead information of the computer to be read at the time of compilation. A loop length that can be obtained, and a comparison between the obtained loop length capable of obtaining the effect of parallelization and the actual loop length of the loop, and generating an intermediate language that separates between parallel execution and sequential execution A loop parallelization device characterized by the above.

In a compiler program executable by a computer that generates object code used in a shared memory type computer, the number of required machine cycles for one iteration of the loop is obtained by executing the object code used in the shared memory type computer. Obtaining a loop length capable of obtaining the effect of parallel execution based on the acquired number of machine cycles and the synchronization processing overhead information of the computer to be read at the time of compilation, and obtaining the obtained effect of parallelization A compiler program comprising a step of comparing a loop length capable of being executed and an actual loop length of the loop to determine whether the execution is parallel execution or sequential execution.