JP2006268070A

JP2006268070A - Parallelizing compilation method and parallel computer for executing parallelized object code

Info

Publication number: JP2006268070A
Application number: JP2005080845A
Authority: JP
Inventors: Akihiro Matsui; 昭宏松居; Hideki Aoki; 秀貴青木; Naonobu Sukegawa; 直伸助川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-03-22
Filing date: 2005-03-22
Publication date: 2006-10-05

Abstract

<P>PROBLEM TO BE SOLVED: To achieve parallelization that takes the configuration of a computer with a multicore CPU into consideration. <P>SOLUTION: Parallelizing compilation is performed by which element parallelization between multicore CPU chips is achieved while section parallelization between in-chip CPU cores is achieved. Among a plurality of threads in execution of object codes obtained, a plurality of threads that share the same elements of a loop control variable are executed by the CPU cores of the same CPU chip. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、並列計算機のためのプログラムの並列化コンパイル処理方法、並列化されたオブジェクトコードを実行する並列計算機、及びそのオペレーティングシステムに関する。 The present invention relates to a parallel compilation method of a program for a parallel computer, a parallel computer that executes parallelized object code, and an operating system thereof.

複数のCPUを搭載する並列計算機において、あるプログラムを複数CPUで並列に動作させる場合の手法として、与えられたプログラムから並列化可能な命令を抽出する、要素並列化、もしくはセクション並列化、と呼ばれる並列化手法がある。これらの並列化手法は、主にプログラムのコンパイル時に適用される。
要素並列化では、図２に示すように、プログラム中のループについて、ループの制御変数Iの各要素を、実行するCPU数で均等に割った分のループを複数スレッドで実行し、該複数スレッドの各々を複数のCPUが担当する。図２の例では、制御変数Iの範囲である１〜ＮまでをCPU数４で均等に割った分の要素について各スレッド０〜３が担当し、その各スレッドをCPUコア１０１、１０２、１０３、１０４が並列に実行する。
セクション並列化では、図３に示すようにプログラム中のループについて、ループ中の命令列を互いに独立な命令列（セクション）に分け、その各セクションについてのループを各スレッド０〜３が担当し、その各スレッドをCPU１０１、１０２、１０３、１０４が並列に実行する。
しかし近年、半導体プロセスの進歩により、１チップ内に複数CPUコアを搭載するマルチコアチップが登場しつつある（非特許文献１）。一方、物理的な１CPUを仮想的に複数CPUに見せて複数スレッドを同時に実行できるマルチスレッディング機構を実装したCPUも主流となりつつある（INTEL/Pentium4(登録商標)など、非特許文献２）。このようなCPUを搭載する並列計算機では、図４のような構成となり、CPUコア０とCPUコア１の関係とCPUコア０とCPUコア２の関係は、キャッシュを共有するか否かという相違があることになる。このような構成の並列計算機において、上記の並列化手法では問題が起こる可能性がある。 In a parallel computer with multiple CPUs, this is called element parallelization or section parallelization, which extracts instructions that can be parallelized from a given program as a method for operating a program in parallel on multiple CPUs. There is a parallelization technique. These parallelization techniques are mainly applied when compiling a program.
In the element parallelization, as shown in FIG. 2, for a loop in a program, a loop corresponding to the control variable I of the loop divided equally by the number of CPUs to be executed is executed by a plurality of threads. Each of these is handled by multiple CPUs. In the example of FIG. 2, each thread 0 to 3 is in charge of elements obtained by equally dividing the range 1 to N of the control variable I by the number of CPUs 4, and each thread is assigned to the CPU cores 101, 102, 103. , 104 execute in parallel.
In section parallelization, as shown in FIG. 3, for a loop in a program, the instruction sequence in the loop is divided into independent instruction sequences (sections), and each thread 0 to 3 is responsible for the loop for each section. The CPUs 101, 102, 103, and 104 execute the threads in parallel.
However, in recent years, multi-core chips in which a plurality of CPU cores are mounted in one chip are appearing due to advances in semiconductor processes (Non-patent Document 1). On the other hand, CPUs equipped with a multi-threading mechanism that allows a single physical CPU to virtually appear as a plurality of CPUs and execute a plurality of threads simultaneously are becoming mainstream (INTEL / Pentium4 (registered trademark), etc., Non-Patent Document 2). A parallel computer equipped with such a CPU has a configuration as shown in FIG. 4, and the relationship between CPU core 0 and CPU core 1 and the relationship between CPU core 0 and CPU core 2 are different depending on whether or not the cache is shared. There will be. In the parallel computer having such a configuration, there is a possibility that a problem occurs in the parallelization method described above.

The Power4 Processor Introduction and Tuning Guide (IBM Redbooks, 2001)The Power4 Processor Introduction and Tuning Guide (IBM Redbooks, 2001)

David Koufaty, Deborah T. Marr, HYPERTHREADING TECHNOLOGY IN THE NETBURST MICROARCHITECTURE, IEEE MICRO March-April 2003，pp56-65David Koufaty, Deborah T. Marr, HYPERTHREADING TECHNOLOGY IN THE NETBURST MICROARCHITECTURE, IEEE MICRO March-April 2003, pp56-65

図４に示すような，マルチコアCPUを搭載する並列計算機を考える。該並列計算機は、１チップ上に２つのCPUコアを搭載したマルチコアCPUを２個搭載する。このような計算機を用いて、図２、３のように、プログラムを要素並列化もしくはセクション並列化により実行する場合、要素並列化、セクション並列化にはそれぞれ以下のようなデメリットがある。
まず図２の要素並列化では、CPUコアが増大するにつれて１CPU当たりのループ長が短くなり、ループ起動時のオーバヘッドの影響が大きくなる。図４のように全CPUコア数が４つ程度であればそれほど影響はないが、現在は数十個ものCPUコアを搭載する並列計算機が存在し、今後その数は増える傾向にあると言える。実行するCPUコア数が増えるほど、要素並列化時の各CPUコアが実行を担当するループのループ長は短くなるため、ループ起動時オーバヘッドが無視できなくなる可能性がある。また、要素並列化では、各CPUがロードしなければならない配列の数は逐次実行と変わらないため、図２のように多くのストリームが存在するループの場合、アーキテクチャによっては全てのストリームにプリフェッチを発行できず、性能が大幅に劣化する可能性がある。通常、メモリからのデータプリフェッチが可能なストリーム数はCPUコアごとに制限があり、無数のストリームにプリフェッチを発行できるわけではないからである。図２の例のプログラムでは、例えばループ分割によって４つのループに分割することもできるが、ループ長が長い場合、B1、B2、C1、D1、F1など再利用性のある配列を再度メモリからロードしなければならず、やはり性能は劣化する。 Consider a parallel computer equipped with a multi-core CPU as shown in FIG. The parallel computer has two multi-core CPUs each having two CPU cores mounted on one chip. When such a computer is used to execute a program by element parallelization or section parallelization, as shown in FIGS. 2 and 3, element parallelization and section parallelization have the following disadvantages.
First, in the element parallelization of FIG. 2, as the number of CPU cores increases, the loop length per CPU becomes shorter and the influence of the overhead at the time of starting the loop becomes larger. If the total number of CPU cores is about 4 as shown in FIG. 4, there will be no effect, but there are currently parallel computers equipped with dozens of CPU cores, and it can be said that the number tends to increase in the future. As the number of CPU cores to be executed increases, the loop length of the loop in which each CPU core is responsible for execution becomes shorter at the time of element parallelization, so the overhead at the time of starting the loop may not be negligible. In addition, in the element parallelization, the number of arrays that each CPU has to load is not different from the sequential execution. Therefore, in the case of a loop with many streams as shown in FIG. It cannot be issued, and performance may be significantly degraded. This is because the number of streams that can be prefetched from memory is usually limited for each CPU core, and prefetch cannot be issued to an unlimited number of streams. In the example program of FIG. 2, for example, the loop can be divided into four loops, but if the loop length is long, a reusable array such as B1, B2, C1, D1, and F1 is loaded from the memory again. It must be done, and the performance deteriorates.

図３のセクション並列化では、第一に、多数CPUコアに対してセクションを分割しにくいという問題がある。例えば、図４の並列計算機ように全CPUコア数が４つの場合、図３のループは４セクションに分けて各CPUコア数で実行することができるが、それ以上のCPUコア数構成の場合、セクション並列化は困難になる。また、セクション(1)〜(4)の実行速度が大きく異なる場合、並列化効率が落ちるという問題も生じる。また、マルチコアチップではチップ内に各CPUコア間で共有されるキャッシュを持つことが多いが、セクション(1)〜(4)の実行速度が大きく異なる場合、再利用性のある配列B1、B2、C1、D1、F1もキャッシュアウトする可能性がある。仮にキャッシュヒットであっても、例えば配列E1、F1のアクセス時にはチップ間の通信が発生し、性能低下の原因となる可能性がある。 The section parallelization of FIG. 3 has a problem that it is difficult to divide a section with respect to many CPU cores. For example, when the total number of CPU cores is four as in the parallel computer of FIG. 4, the loop of FIG. 3 can be divided into four sections and executed with the number of CPU cores. Section parallelization becomes difficult. Further, when the execution speeds of sections (1) to (4) are greatly different, there is a problem that parallelization efficiency is lowered. Also, multi-core chips often have a cache shared between CPU cores in the chip, but if the execution speeds of sections (1) to (4) differ greatly, reusable arrays B1, B2, C1, D1, and F1 may also be cashed out. Even if a cache hit occurs, communication between chips occurs, for example, when accessing the arrays E1 and F1, and this may cause performance degradation.

本発明の代表的実施例における並列化コンパイル処理方法の特徴は、生成したオブジェクトプログラムを実行する並列計算機の構成に対応したプログラム並列化を行う点にある。より詳細には、マルチコアCPU、もしくはマルチスレッド実装CPUを搭載する並列計算機のためのプログラムの並列化であってその対象計算機のＣＰＵ構成に対応して要素並列化とセクション並列化を組み合わせるた並列化を行うことを特徴とする。
図４の構成の並列計算機を例にとって、本発明のコンパイル処理方法を説明する。図５のようにプログラム中のループに対し、CPUチップ数２で要素並列化し、更にその並列化されたループをループ内の(1)(2)(3)(4)のセクションに分割して並列化する。並列化されたプログラムの実行時に８個のスレッドを起動し、チップ間で要素並列実行、チップ内でセクション並列実行となるように、ループ制御変数の同一要素を担当する複数スレッドが全て同一のチップ下のCPUコアで実行されるように並列化する。 A feature of the parallelized compile processing method in the representative embodiment of the present invention is that program parallelization corresponding to the configuration of a parallel computer that executes the generated object program is performed. More specifically, parallelization of a program for a parallel computer equipped with a multi-core CPU or a multi-thread implementation CPU, which combines element parallelization and section parallelization corresponding to the CPU configuration of the target computer. It is characterized by performing.
The compile processing method of the present invention will be described using the parallel computer having the configuration of FIG. 4 as an example. As shown in Fig. 5, the loop in the program is element-parallelized with 2 CPU chips, and the parallelized loop is divided into sections (1), (2), (3), and (4) in the loop. Parallelize. When executing parallelized programs, 8 threads are started, and multiple threads that are responsible for the same element of the loop control variable are all the same chip so that element parallel execution is performed between chips and section parallel execution is performed within the chip. Parallelize to run on the lower CPU core.

以上の並列化手法、及び、並列実行方法により、単純に要素並列化もしくはセクション並列化を行った場合と比べて以下のような効果を期待できる。
まず、要素並列化に対する優位点として、全CPUコア間で要素並列化する場合と違い、チップ内のCPUコア数倍にループ長を長くすることができ、ループ起動オーバヘッドの影響を少なくすることができる。また、１CPUコア当たりのストリーム数が減るため、プリフェッチ可能なストリーム数制限を越えてしまう多ストリームループにおいてもプリフェッチを有効に発行できることが期待できる。例えば図５のソースコードでは、全部でB1、C1、D1、E1、F1、G1、B2、C2、D2、D3の１０個のロードストリームが存在する。しかし本発明の並列化方式によれば、CPUチップ０下で実行される４つのループの内、上２つをCPUコア０、下２つをCPUコア1で実行する場合には、CPUコア０でのロードストリーム数は６個、CPUコア１でのロードストリーム数が４個と、CPUコア当たりのストリーム数を減らすことができる。加えて、CPUコア当たりのストリーム数が減ることにより、１ストリーム当たりの共有キャッシュ容量が相対的に大きくなり、キャッシュヒットの確率が高くなる効果も期待できる。 The following effects can be expected compared to the case where the element parallelization or section parallelization is simply performed by the above parallelization method and parallel execution method.
First, as an advantage over element parallelization, unlike the case of element parallelization among all CPU cores, the loop length can be increased to the number of CPU cores in the chip, reducing the effect of loop startup overhead. it can. Also, since the number of streams per CPU core is reduced, it can be expected that prefetch can be issued effectively even in a multi-stream loop that exceeds the limit on the number of prefetchable streams. For example, in the source code of FIG. 5, there are 10 load streams of B1, C1, D1, E1, F1, G1, B2, C2, D2, and D3 in total. However, according to the parallelization method of the present invention, among the four loops executed under the CPU chip 0, when the upper two are executed by the CPU core 0 and the lower two are executed by the CPU core 1, the CPU core 0 The number of load streams in the CPU core is 6, and the number of load streams in the CPU core 1 is 4, so that the number of streams per CPU core can be reduced. In addition, by reducing the number of streams per CPU core, the shared cache capacity per stream becomes relatively large, and an effect of increasing the probability of cache hit can be expected.

セクション並列化に対する優位点としては、チップ内CPUコア数でのセクション並列化のため、全CPUコア数で分割するよりもセクションを分割しやすいことが挙げられる。加えて、チップ内のCPUコア間でのセクション並列化のため、チップ内のCPUコア間で実行速度が均等になるようにセクションを分割し易く、全CPUコアごとのセクション並列と比べ並列化効率が高くなると期待できる。さらに、チップ内に共有キャッシュメモリがある場合、同一イタレーションの配列データはチップ内の共有キャッシュにあると期待できるため、余計なチップ間通信が発生しない。
加えて、マルチスレッディングを実装したCPUにおける本発明のメリットについて述べる。一般に、マルチスレッディングを実装したCPUでは、複数の仮想CPUに対して同質のコードを流すより、使用するCPU資源が極力競合しないように異質のコードを流す方が実効効率が高くなる。本発明による並列化手法では、単純な要素並列化と異なり複数CPUコアに対してセクション並列化を利用するため、性質の異なるコードを実行させ易い。マルチスレッディングを実装したCPUでは以上の効果も期待できる。 The advantage over section parallelism is that it is easier to divide a section than it is divided by the total number of CPU cores because of section parallelization by the number of CPU cores in a chip. In addition, because the sections are paralleled among CPU cores in the chip, it is easy to divide the sections so that the execution speed is uniform among the CPU cores in the chip, and the parallelization efficiency compared to the section parallel for all CPU cores Can be expected to increase. Furthermore, when there is a shared cache memory in the chip, the array data of the same iteration can be expected to be in the shared cache in the chip, so that no extra chip communication occurs.
In addition, the merits of the present invention in a CPU equipped with multithreading will be described. In general, in a CPU that implements multi-threading, it is more efficient to send different codes so that the CPU resources used do not compete as much as possible, rather than sending the same code to multiple virtual CPUs. Unlike the simple element parallelization, the parallelization technique according to the present invention uses section parallelization for a plurality of CPU cores, and therefore, it is easy to execute codes having different properties. The above effects can be expected with a CPU that implements multithreading.

本発明の代表的なコンパイル処理の流れは図１の通りである。本実施例におけるコンパイラは、コンパイル対象のプログラムが実行される並列計算機の構成の内、搭載するCPUチップが保持するCPUコア数を、コンパイルオプションもしくはオペレーティングシステムの設定パラメータもしくは環境変数から引数として受け取る。この場合のCPUコア数は、該CPUコアがマルチスレッディングの機能を搭載している場合は、１CPUコアが同時に実行可能なスレッド数とCPUチップが保持するCPUコア数の乗数と見なされる。
コンパイル中のプログラム並列化処理においてコンパイラは、プログラム中にループを検出すると、図１の処理を行う。図４に記載の並列計算機を対象とする場合を例としてその流れを以下に説明する。図４の計算機２００は、CPUチップ１０、２０を持ち、該CPUチップの各々が、CPUコア１０１、１０２とCPUコア１０３、１０４を持つ。また、CPUチップ１０、２０はそれぞれ２CPUコアに共有されるキャッシュメモリ１２１、１２２を持つ。更に、該計算機２００は２CPUチップから共有される主メモリ１３０を持つ。
コンパイラは最初に、ループ中の再利用性のある配列の有無を調べる。図５のループにおいては、配列B1、B2、C1、D1、F1が再利用性のあるループである。 A typical compiling process flow of the present invention is as shown in FIG. The compiler according to the present embodiment receives the number of CPU cores held by the mounted CPU chip as an argument from a compile option, an operating system setting parameter, or an environment variable in the configuration of a parallel computer in which a program to be compiled is executed. The number of CPU cores in this case is regarded as a multiplier of the number of threads that can be executed simultaneously by one CPU core and the number of CPU cores held by the CPU chip when the CPU core has a multi-threading function.
In the program parallelization process during compilation, when the compiler detects a loop in the program, it performs the process shown in FIG. The flow will be described below by taking the parallel computer shown in FIG. 4 as an example. The computer 200 in FIG. 4 has CPU chips 10 and 20, and each of the CPU chips has CPU cores 101 and 102 and CPU cores 103 and 104. The CPU chips 10 and 20 have cache memories 121 and 122 that are shared by the two CPU cores, respectively. Further, the computer 200 has a main memory 130 shared by 2 CPU chips.
The compiler first checks for a reusable array in the loop. In the loop of FIG. 5, the arrays B1, B2, C1, D1, and F1 are reusable loops.

再利用性のある配列がない場合で、かつループ長がｎ以上（このｎはコンパイル時オプション、もしくはアーキテクチャの特徴によりコンパイラに既定の値）の場合は、ループ分割を行う。ループ分割とは、図５のプログラムにおいては、(1)(2)(3)(4)をそれぞれループ制御変数は同等の４つのループに分割することを指す。ループ長がｎ未満の場合、ループ分割を行うとループが増えることによるループ起動オーバーヘッドの増分が実行時間に占める比率が高くなるため、ループ分割は行わずに、要素並列化が可能かどうかの判定をする。
ループ中に再利用性のある配列がある場合は、ループ分割をするとデータの再利用性が失われ、キャッシュを活用できないため、要素並列化が可能かどうかの判定に進む。要素並列化が可能であると判定された場合は、次に該ループのロードストリーム数が、１CPUコア当たりがプリフェッチ可能なストリーム数（この値はコンパイル時オプション、もしくはアーキテクチャの特徴によりコンパイラに既定の値）を超えているかどうかの判定をする。該ループのストリーム数がプリフェッチ可能なストリーム数以内であれば要素並列化のみを適用する。図５のプログラムのように、ループのストリーム数がプリフェッチ可能なストリーム数を超えている場合は、次にセクション並列化が可能かどうかの判定を行う。 If there is no reusable array and the loop length is n or more (where n is a compile-time option or a default value for the compiler depending on the architecture characteristics), loop division is performed. In the program shown in FIG. 5, the loop division means that (1), (2), (3), and (4) are divided into four loops each having the same loop control variable. When the loop length is less than n, if loop division is performed, the increase in the loop startup overhead due to the increase in the number of loops increases in the execution time. Therefore, whether or not element parallelization is possible without performing loop division is determined. do.
If there is an array with reusability in the loop, the data reusability is lost when the loop is divided, and the cache cannot be used. Therefore, the process proceeds to determination of whether element parallelization is possible. If it is determined that element parallelization is possible, then the number of load streams in the loop is the number of streams that can be prefetched per CPU core (this value is the default for the compiler depending on the compile-time option or architecture characteristics). Value) is exceeded. If the number of streams in the loop is within the number of prefetchable streams, only element parallelization is applied. When the number of streams in the loop exceeds the number of streams that can be prefetched as in the program of FIG. 5, it is next determined whether or not section parallelization is possible.

セクション並列化が可能かどうかの判定のため、最初にループ内の命令列を互いに非依存の最小単位にセクション分割する。ただしこの際、新たに逐次性を生むような分割、例えば、A1(I) = B1(I) + C1(I) + D1(I) の命令を A1(I) = B1(I) + C1(I) とA1(I) = A1(I) + D1(I) とに分割することは、両スレッドに逐次性が発生することになり、並列実行できないため行わない。図５のループでは、(1)(2)(3)(4)の４セクションに分割することができる。次に、分割されたセクションの個数がCPUチップ内CPUコア数と同等かそれ以上であることを確認する。この条件が満たされない場合は、セクション並列化は不可能として、ループのストリーム数がプリフェッチ可能なストリーム数以内の場合と同様、要素並列化のみ適用する。図４に搭載されるCPUチップ内CPUコア数は２であるため、この条件が満たされる図５のようなループの場合には、要素並列化を適用したループに対して更に、セクション(1)(2)(3)(4)に分割するセクション並列化を適用する。
以上の並列化処理をプログラム内ループの全てに行った後、コンパイラより出力される並列化されたオブジェクトコードの動作については実施例３に記載する。 In order to determine whether section parallelization is possible, first, the instruction sequence in the loop is sectioned into minimum units independent of each other. However, in this case, a new sequential generation, for example, A1 (I) = B1 (I) + C1 (I) + D1 (I) instruction is A1 (I) = B1 (I) + C1 ( The division into I) and A1 (I) = A1 (I) + D1 (I) is not performed because sequentiality occurs in both threads and parallel execution is impossible. In the loop of FIG. 5, it can be divided into four sections (1), (2), (3), and (4). Next, it is confirmed that the number of divided sections is equal to or more than the number of CPU cores in the CPU chip. If this condition is not satisfied, section parallelization is impossible, and only element parallelization is applied as in the case where the number of streams in the loop is within the number of prefetchable streams. Since the number of CPU cores in the CPU chip mounted in FIG. 4 is 2, in the case of the loop as shown in FIG. 5 where this condition is satisfied, the section (1) is further added to the loop to which element parallelization is applied. (2) Apply section parallelization divided into (3) and (4).
The operation of the parallelized object code output from the compiler after performing the above parallelization processing on all the loops in the program will be described in the third embodiment.

実施例１の並列化処理において、並列化処理の前にループ融合による最適化を行う方法を説明する。
図８のソースコード#1のように、ループを融合しソースコード#2のようなソースコードに変換できるとコンパイラが判断した場合には、ループを融合した後に、実施例１の並列化処理を行う。
以上の方法により、配列B1、B2、C1、D1、F1のアクセスに再利用性を持たせることが可能となり、メモリからのデータ転送を削減することが可能となる。 In the parallel processing of the first embodiment, a method for performing optimization by loop fusion before the parallel processing will be described.
As shown in the source code # 1 of FIG. 8, when the compiler determines that the loop can be fused and converted into a source code such as the source code # 2, the parallel processing of the first embodiment is performed after the loop is fused. Do.
By the above method, it becomes possible to give reusability to accesses of the arrays B1, B2, C1, D1, and F1, and it is possible to reduce data transfer from the memory.

本発明の代表的な並列化オブジェクトコードの実行時の動作は図６の通りである。図６に示す並列化されたプログラムのオブジェクトコードは、並列化されたループである並列実行部を３つ持ち、該並列実行部の両端には逐次実行部を持つ。
図６のオブジェクトを図４の計算機で実行する際の動作を下記に記述する。図４の計算機２００は、CPUチップ１０、２０を持ち、該CPUチップの各々が、CPUコア１０１、１０２とCPUコア１０３、１０４を持つ。また、CPUチップ１０、２０はそれぞれ２CPUコアに共有されるキャッシュメモリ１２１、１２２を持つ。更に、該計算機２００は２CPUチップから共有される主メモリ１３０を持つ。
該オブジェクトコード実行時には、実行オプションとして、要素並列化数を与える。図４の計算機では、全CPUコアで実行する場合にはCPUチップ数である要素並列化数２を実行時オプションとして与え、オブジェクトコードを実行する。
該オブジェクトコードは、並列実行部１、２、３のループがそれぞれ要素並列化されており、さらに、並列実行部１、２に関してはループ内命令がそれぞれセクション(1)〜(4)の４つと、セクション(5)〜(9)に分割されセクション並列化されている。
そのため該オブジェクトコードは、要素並列化数２と、並列実行部１、２、３の内最も多いセクション並列化数である並列実行部２のセクション数５、との乗数である１０個のスレッド、すなわち、図６の表で表すスレッド＃０〜９の１０個のスレッドで並列実行される。逐次実行部１〜４はスレッド＃０が担当するものとし、並列実行部１、２、３は図６の表のように各スレッドが担当する。
該オブジェクトコードが起動するスレッド＃０〜９は、自身が担当するループ制御変数の要素により、グループ識別子を持つ。図４の並列計算機で動作するオペレーティングシステムは、スレッド＃０〜９のグループ識別子を読み取り、識別子が同じスレッドを、同一CPUチップ下のCPUコアで実行させる。図６の例では、スレッド＃０〜４をCPUチップ０下のCPUコア０、１で、スレッド＃５〜９をCPUチップ１下のCPUコア２、３で実行させる。各CPUチップ下の２つCPUコアの内どちらで実行されるかはオペレーティングシステムのタイムシェアリングシステムにより任意に選択される。 The operation at the time of execution of a typical parallel object code of the present invention is as shown in FIG. The object code of the parallelized program shown in FIG. 6 has three parallel execution units that are parallelized loops, and sequential execution units at both ends of the parallel execution unit.
The operation when the object of FIG. 6 is executed by the computer of FIG. 4 will be described below. The computer 200 in FIG. 4 has CPU chips 10 and 20, and each of the CPU chips has CPU cores 101 and 102 and CPU cores 103 and 104. The CPU chips 10 and 20 have cache memories 121 and 122 that are shared by the two CPU cores, respectively. Further, the computer 200 has a main memory 130 shared by 2 CPU chips.
When the object code is executed, an element parallelization number is given as an execution option. In the computer shown in FIG. 4, when executing with all CPU cores, the element parallelization number 2 which is the number of CPU chips is given as an execution option, and the object code is executed.
In the object code, the loops of the parallel execution units 1, 2, and 3 are respectively element-parallelized. Further, regarding the parallel execution units 1 and 2, there are four instructions in the loops (1) to (4), respectively. The sections are divided into sections (5) to (9) and are parallelized.
Therefore, the object code has 10 threads which are multipliers of the element parallelization number 2 and the number of sections of the parallel execution unit 2 which is the largest number of parallel sections of the parallel execution units 1, 2 and 3. That is, it is executed in parallel by 10 threads # 0 to 9 shown in the table of FIG. The sequential execution units 1 to 4 are assigned to the thread # 0, and the parallel execution units 1, 2, and 3 are assigned to each thread as shown in the table of FIG.
Threads # 0 to # 9 activated by the object code have group identifiers depending on the elements of the loop control variable that they are responsible for. The operating system that operates on the parallel computer of FIG. 4 reads the group identifiers of threads # 0 to # 9, and causes threads with the same identifier to be executed by CPU cores under the same CPU chip. In the example of FIG. 6, threads # 0 to 4 are executed by CPU cores 0 and 1 under CPU chip 0, and threads # 5 to 9 are executed by CPU cores 2 and 3 under CPU chip 1. Which of the two CPU cores under each CPU chip is executed is arbitrarily selected by the time sharing system of the operating system.

実施例１の並列化コンパイラによって並列化されたオブジェクトコードを実施例３に記載の方法で実行した場合、各スレッドの実行の進行状況によっては、再利用性のある配列のアクセスがキャッシュミスを引き起こす可能性がある。例えば、図６の並列実行部１の実行時に、スレッド＃０の進行がスレッド＃１に比べ非常に早い場合を考える。ここで、図７のように並列実行部１のスレッド＃０、１にのみ注目し、時間T1〜T4における各スレッドのループ制御変数の進行状況を表に示す。ここで、図４の計算機の持つ共有キャッシュ１２１、１２２は512キロバイトのキャッシュ容量を持ち、配列の１要素は8バイトであるとすると、時間T3の時、既にスレッド＃０は制御変数I=10000まで進行しており、A1(I)、B1(I)、C1(I)、D1(I)、E1(I)、F1(I)、G1(I)の７つの配列のI=100〜10000までの容量は512キロバイトを超える。従って、スレッド＃１が配列B1(100)、C1(100)、D1(100)をアクセスした時には、既にキャッシュアウトしており、キャッシュミスする。
このような事態を避けるために、グループ識別子が同じ複数スレッド同士で、一定間隔ごとに同期を取り、進行速度をそろえることが考えられる。並列実行部１では、ロードとストアのストリーム数は合計で１４個あるため、
512キロバイト × 1024 / 8 / 14 = 4681
の計算で、全スレッドが4681個の要素をアクセスすると古いデータからキャッシュアウトすると考えられる。従って、全スレッドがループ制御変数が4681の倍数になるごとに各スレッドで同期を取り、再利用性のある配列アクセス時のキャッシュミスを回避する。 When the object code parallelized by the parallelizing compiler according to the first embodiment is executed by the method described in the third embodiment, access to a reusable array causes a cache miss depending on the progress of execution of each thread. there is a possibility. For example, consider a case in which the progress of thread # 0 is much faster than that of thread # 1 during the execution of parallel execution unit 1 in FIG. Here, focusing on only the threads # 0 and # 1 of the parallel execution unit 1 as shown in FIG. 7, the progress of the loop control variable of each thread during the time T1 to T4 is shown in the table. Here, if the shared caches 121 and 122 of the computer of FIG. 4 have a cache capacity of 512 kilobytes and one element of the array is 8 bytes, thread # 0 already has control variable I = 10000 at time T3. And 7 sequences of A1 (I), B1 (I), C1 (I), D1 (I), E1 (I), F1 (I), G1 (I) I = 100 to 10000 The capacity is over 512 kilobytes. Therefore, when the thread # 1 accesses the arrays B1 (100), C1 (100), and D1 (100), it has already been cached out and a cache miss occurs.
In order to avoid such a situation, it is conceivable that a plurality of threads having the same group identifier are synchronized with each other at regular intervals and the traveling speeds are made uniform. In the parallel execution unit 1, the total number of load and store streams is 14, so
512 kilobytes × 1024/8/14 = 4681
In the calculation, if all threads access 4681 elements, it is considered that the old data is cached out. Therefore, every thread synchronizes with each thread every time the loop control variable is a multiple of 4681, thereby avoiding a cache miss during reusable array access.

本発明は、プログラムの並列化手法、並列化コンパイラ、並列化されたオブジェクトコード、オペレーティングシステム、並列計算機に関する。特に１CPUチップに複数のCPUコアを搭載するマルチコアCPUや、CPUコア内の資源を複数に分割し複数のスレッドを同時に実行可能であるマルチスレッディングの機能を持つCPUを搭載する計算機において有効に利用可能である。 The present invention relates to a program parallelization technique, a parallelizing compiler, a parallelized object code, an operating system, and a parallel computer. In particular, it can be used effectively in multi-core CPUs with multiple CPU cores on a single CPU chip, and computers with CPUs with multi-threading functions that can divide resources in CPU cores and execute multiple threads simultaneously. is there.

本発明の一実施例の並列化コンパイル処理方法を示すフローチャートでありる。It is a flowchart which shows the parallelization compilation processing method of one Example of this invention. 要素並列化を示すする概念図である。It is a conceptual diagram which shows element parallelization. セクション並列化を示す概念図である。It is a conceptual diagram which shows section parallelization. 上記実施例で生成された並列化オブジェクトプコードを実行するマルチコアCPUチップを搭載する並列計算機のシステム構成図である。It is a system configuration | structure figure of the parallel computer carrying the multi-core CPU chip which performs the parallelization object code | cord | chord produced | generated in the said Example. 上記実施例の列化コンパイル処理方法によるループの並列化過程を示す概念図である。It is a conceptual diagram which shows the parallelization process of the loop by the row | line | column compile processing method of the said Example. 上記実施例で生成された並列化されたオブジェクトコードの動作をを示す概念図である。It is a conceptual diagram which shows operation | movement of the parallelized object code produced | generated in the said Example. 本発明の別の実施例における並列化されたオブジェクトコードの起動するスレッド間での同期について示す概念図である。It is a conceptual diagram which shows about the synchronization between the threads which the parallel object code starts in another Example of this invention. 本発明の更に別の実施例のコンパイル処理方法におけるループ融合を示す概念図である。It is a conceptual diagram which shows the loop fusion in the compile processing method of another Example of this invention.

Explanation of symbols

１０、２０：ＣＰＵチップ
１０１、１０２、１０３、１０４：ＣＰＵコア
１２１，１２２：キャッシュメモリ
１３０：メインメモリ。
10, 20: CPU chips 101, 102, 103, 104: CPU cores 121, 122: Cache memory 130: Main memory.

Claims

A parallelized compilation method for converting a given source code program into object code that can be executed in parallel by a parallel computer composed of a plurality of CPU chips each including one or a plurality of CPU cores,
Each loop in the program is divided into a plurality of loops corresponding to the plurality of CPU chips by dividing the loop by elements of a loop control variable,
Further, each of the divided loops is further divided into a plurality of loops by dividing the loops into a plurality of parallelizable sections of the instruction sequence in the loops,
An object code for outputting each of the plurality of loops in parallel by different threads is output.

The object code to be output in the preceding paragraph is a means for, after starting a plurality of threads, executing each thread responsible for the same element of the inner loop control variable of the plurality of threads by one or a plurality of CPU cores on the same CPU chip The parallelized compile processing method according to claim 1, wherein the object code has an object code.

The plurality of threads activated by the object code output in the preceding section have an identifier indicating that each of the plurality of threads responsible for the same element of the inner loop control variable of the plurality of threads is responsible for the same element of the loop control variable. The parallelized compile processing method according to claim 1.

If the CPU core in the CPU chip has the multi-threading function in which the CPU core in the CPU chip can divide the resources in the CPU core and execute multiple threads at the same time, one CPU chip has one CPU core. Assuming that there are CPU cores that is a multiplier of the number of threads that can be executed simultaneously and the number of physical CPU cores on one CPU chip, the same thread is assigned to the same element of the inner loop control variable of the activated multiple threads. The parallelized compile processing method according to claim 1, wherein the object code is executed by a CPU core on the CPU chip.

When a plurality of loops detected in the program can be merged into a single loop, the plurality of loops are merged into a single loop, and then parallelization is applied. The parallelized compile processing method according to claim 1.

A procedure for adding instructions that can be synchronized to each other every time a plurality of threads in charge of the same element progress at a certain time interval or a certain number of elements of the loop control variable when the loops are parallelized. The parallelized compile processing method according to claim 1, further comprising aligning the progress of a plurality of threads.

An object program that is supplied to a parallel computer composed of a plurality of CPU chips each including one or a plurality of CPU cores, and a loop included in the original program is divided by elements of control variables of the loop The CPU is divided into threads by the element parallelization method and executed in parallel between the plurality of CPU chips. Inside each CPU chip, the thread assigned to each CPU chip is divided into a plurality of sections of an instruction sequence constituting the loop. An object program that is further divided into threads and executed in parallel by the section parallelization method.

The CPU core has a multi-threading function in which resources can be divided into a plurality of threads and a plurality of threads can be executed simultaneously. Inside each CPU chip, one CPU core can execute a thread assigned to each CPU chip at the same time. 8. The method according to claim 7, wherein the instruction sequence constituting the loop is divided into sections corresponding to multipliers of the number of threads and the number of physical CPU cores on one CPU chip, and the instructions are divided and executed in parallel. Object program.

An operating system that runs on a parallel computer,
An element parallelization technique is applied in which one or more loops in the object code to be executed are divided into a plurality of loops by dividing the elements of the loop control variable,
Furthermore, by dividing each of the divided loops into a plurality of sections capable of parallelizing an instruction sequence in the loop, section parallelization is further applied to further divide into a plurality of loops.
When the object code has means for executing each of the plurality of loops in parallel with different threads, after starting the plurality of threads, each thread responsible for the same element of the inner loop control variable of the plurality of threads is set to the same Operating system characterized by having means to execute on one or more CPU cores on CPU chip

When the CPU core in the CPU chip mounted on the parallel computer has a multi-threading function in which resources in the CPU core are divided into a plurality of pieces and a plurality of threads can be executed at the same time, one CPU chip and one CPU core simultaneously 10. The operating system according to claim 9, further comprising means for assigning threads at the time of execution, assuming that the number of CPU cores is a multiplier of the number of executable threads and the number of physical CPU cores on one CPU chip. system.

A parallel computer with multiple CPUs.
An element parallelization technique is applied in which one or more loops in the object code are divided into multiple loops by dividing the elements of the loop control variable,
Furthermore, by dividing each of the divided loops into a plurality of sections capable of parallelizing an instruction sequence in the loop, section parallelization is further applied to further divide into a plurality of loops.
When executing object code having means for executing each of the plurality of loops in parallel with different threads, after starting the plurality of threads, each thread responsible for the same element of the inner loop control variable of the plurality of threads is set to the same A parallel computer characterized by having a means for execution by one or more CPU cores on a CPU chip.