JP5227646B2

JP5227646B2 - Compiler and code generation method thereof

Info

Publication number: JP5227646B2
Application number: JP2008110835A
Authority: JP
Inventors: 敬子本川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-04-22
Filing date: 2008-04-22
Publication date: 2013-07-03
Anticipated expiration: 2028-04-22
Also published as: JP2009265708A

Description

本発明は、コンパイル技術に関し、特に、複数データに対するメモリアクセスや演算を行うSIMD命令を使用するためのコンパイル技術に関する。 The present invention relates to a compiling technique, and more particularly, to a compiling technique for using a SIMD instruction for performing memory access and operation on a plurality of data.

１つのレジスタ上に複数のデータを保持し、複数データに対して並列演算を実行できる命令を備えたアーキテクチャが一般的になってきている。例えば非特許文献１に記載されているマルチメディア命令拡張SSEでは、16バイトのレジスタを用いて複数のデータを一括して扱える。例えば4バイトの単精度浮動小数点データの場合、4個のデータに対する演算を、1つのレジスタ上で並列に実行できる。このような命令をSIMD(single-instruction-multiple-data)命令と呼び、SIMD命令を用いたコード生成を行う最適化をSIMD最適化と呼ぶ。 2. Description of the Related Art Architectures having instructions that can hold a plurality of data on one register and execute parallel operations on the plurality of data have become common. For example, in the multimedia instruction extension SSE described in Non-Patent Document 1, a plurality of data can be collectively handled using a 16-byte register. For example, in the case of 4-byte single-precision floating-point data, operations on four data can be executed in parallel on one register. Such an instruction is called a SIMD (single-instruction-multiple-data) instruction, and optimization for generating code using the SIMD instruction is called SIMD optimization.

SIMD最適化を実現する方式としては、主としてストライド１の配列要素参照を用いた演算式を含むループに対してベクトル化の手法を用いた変換を行う方式（ベクトル化方式と呼ぶ）と、ループと無関係に、近傍の参照から隣接要素の組を見つけ、これらを結合することによる変換を行う方式（命令結合方式と呼ぶ）の二方式がある。ベクトル化方式については非特許文献１の第5章に記載されている。命令結合方式については、非特許文献２に記載されている。 As a method for realizing SIMD optimization, a method that performs transformation using a vectorization method on a loop that includes an arithmetic expression that uses an array element reference of stride 1 (referred to as a vectorization method), a loop, Regardless, there are two types of methods (referred to as an instruction combination method) that perform conversion by finding a set of adjacent elements from neighboring references and combining them. The vectorization method is described in Chapter 5 of Non-Patent Document 1. The instruction combination method is described in Non-Patent Document 2.

ベクトル方式による変換例を図５に示す。図５(a)のソースプログラム（C言語で記述）において、最内側のkループに着目する。配列aのアクセスは、ループ制御変数kが最下位次元の配列添え字となっているため、ストライド１の連続参照であり、本ループはベクトル化可能である。配列aの型(int型)に対して、SIMD命令により4個のデータを同時にアクセス可能である（SIMD並列度4）とすると、kループに対して4倍のループ展開（アンローリング）を適用し、SIMD化変換を行う。変換後のコードイメージを図５(b)に示す。kループ内の文で、配列aの添え字の’k:k+3’は、kからk+3、即ちｋ、ｋ＋１、ｋ＋２、ｋ＋３の4要素のアクセスを示す記法とする。左辺はa[i][j][k]を先頭とする4要素への代入を示し、右辺は4個のv2を1つのレジスタ上にまとめたデータを示す。このように、SIMD化対象のループ内では元の4個の文に相当する処理を1文で行う。kループの元のループ長は5であるから、4倍ループ展開による余りは１である。余りの要素a[i][j][4]に対しては、ループ後に通常の代入処理を行うコードを生成する。 An example of conversion by the vector method is shown in FIG. Focus on the innermost k-loop in the source program (described in C language) in FIG. Access to the array a is a continuous reference of stride 1 because the loop control variable k is an array index of the lowest dimension, and this loop can be vectorized. Assuming that 4 types of data can be accessed simultaneously by the SIMD instruction for the type of array a (int type) (SIMD parallelism 4), 4 times the loop expansion (unrolling) is applied to the k loop. And convert to SIMD. The code image after conversion is shown in FIG. In the sentence in the k loop, the subscript 'k: k + 3' of the array a is a notation indicating access of four elements from k to k + 3, that is, k, k + 1, k + 2, and k + 3. The left side shows assignment to four elements starting with a [i] [j] [k], and the right side shows data in which four v2s are put together in one register. In this way, processing corresponding to the original four sentences is performed in one sentence in the loop to be converted to SIMD. Since the original loop length of the k loop is 5, the remainder from the quadruple loop expansion is 1. For the remainder element a [i] [j] [4], a code for performing a normal substitution process after the loop is generated.

次に、命令結合方式による変換例を図７に示す。図７(a)のソースプログラム（C言語で記述）において、ループ内の3個の文に着目する。配列aのSIMD並列度を2と仮定すると、a[i][0]とa[i][1]は隣接する要素で同時アクセス可能であるから、先頭の2文を結合する。変換後のコードイメージを図７（ｂ）に示す。 Next, an example of conversion by the instruction combination method is shown in FIG. In the source program (described in C language) in FIG. 7A, attention is paid to three sentences in the loop. Assuming that the SIMD parallelism of array a is 2, a [i] [0] and a [i] [1] can be accessed simultaneously by adjacent elements, so the first two sentences are combined. The code image after conversion is shown in FIG.

Aart J.C. Bik, The Software Vectorization Handbook, Intel Press, 2004.Aart J.C.Bik, The Software Vectorization Handbook, Intel Press, 2004. Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets, Proc. of the SIGPLAN 2000 Conference on Programming Language Design and Implementation, 2000.Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets, Proc. Of the SIGPLAN 2000 Conference on Programming Language Design and Implementation, 2000.

SIMD最適化の対象とする範囲のデータ長が短い場合、データ長がSIMD並列度で割り切れない場合には、端数部分にはSIMD化が適用されない。実際には連続でSIMD化適用が可能な長いデータ列があり、最適化対象に選択した部分（例えば最内側ループの参照）がそのデータ列の一部の短データ列であった場合には、端数部分にSIMD化が適用されない分、SIMD化適用率が低下し実行性能向上への効果が小さくなってしまう。例えば、図５(a)のソースプログラムの場合、kループおよびjループのアクセス範囲が配列aの3次元目と2次元目の宣言と一致しており、3重ループによりaの全要素を連続的に参照する。この場合、並列度４のSIMD化適用により、aへのストア命令は4分の1に削減可能である。ところが、図5（ｂ）のような変換では、ループ展開後のkループは1イタレーションしか実行されず、長さ5のデータ列を2命令で実行することになるため、ストア命令数の削減率は5分の2となり、実行性能の向上が不十分である。また、図7(a)のソースプログラムにおいても配列aを連続的に参照しているため、ストア命令を2分の1に削減可能であるが、図7(b)の変換における削減率は３分の２である。 When the data length in the range subject to SIMD optimization is short, if the data length is not divisible by the SIMD parallelism, SIMD conversion is not applied to the fractional part. In fact, there is a long data string that can be continuously applied to SIMD, and if the part selected for optimization (for example, the reference of the innermost loop) is a short data string that is part of the data string, Since SIMD conversion is not applied to the fractional part, the SIMD conversion application rate decreases and the effect on improving execution performance is reduced. For example, in the case of the source program shown in FIG. 5 (a), the access range of the k loop and the j loop is the same as the declaration of the third and second dimensions of the array a, and all elements of a are continuously connected by the triple loop. Refer to it. In this case, the store instruction to a can be reduced to a quarter by applying SIMD with a parallel degree of 4. However, in the conversion as shown in FIG. 5B, the k loop after loop expansion only executes one iteration, and a data string of length 5 is executed with two instructions, so the number of store instructions is reduced. The rate is 2/5, and the execution performance is not improved sufficiently. In addition, since the array a is continuously referred to in the source program of FIG. 7 (a), the store instruction can be reduced by half, but the reduction rate in the conversion of FIG. 7 (b) is 3 Two minutes.

本発明の目的は、従来のSIMD化方式による対象データ列が、外側ループに跨った長い連続データ列の一部分である場合に、外側ループを考慮したSIMD化変換を実施することにより、端数部分のないSIMD化コードを生成する方法を提供することにある。 The object of the present invention is to perform the SIMD conversion in consideration of the outer loop when the target data string according to the conventional SIMD conversion method is a part of a long continuous data string straddling the outer loop. There is no way to provide a way to generate SIMD code.

本発明は、対象ループを解析して、内側データ長およびSIMD並列度を取得する構造解析手段と、内側データ部を囲む外側ループのループ展開数を求める展開数計算手段と、展開数計算手段で求めた展開数に従いループを展開するループ展開手段、および展開後のループに対してSIMDコードの生成を行うSIMDコード変換手段から成る。 The present invention includes a structure analysis unit that analyzes a target loop to obtain an inner data length and a SIMD parallelism, an expansion number calculation unit that calculates a loop expansion number of an outer loop that surrounds the inner data portion, and an expansion number calculation unit. It comprises loop unrolling means for unfolding a loop according to the obtained number of unfolding, and SIMD code converting means for generating SIMD code for the unrolled loop.

展開数計算手段では、内側データ長をSIMD並列度で割ったときの余りの値（端数と呼ぶ）を計算し、SIMD並列度と端数の公倍数を端数で割った値を展開数とする。 In the expansion number calculation means, a remainder value (called a fraction) when the inner data length is divided by the SIMD parallelism is calculated, and a value obtained by dividing the SIMD parallelism and the common multiple of the fraction by the fraction is used as the expansion number.

内側データ長の取得においては、ベクトル化方式対象ループについては内側の定数ループ長を解析し、命令結合方式対象ループについては内側ループ内の連続参照から成る内側データ列のデータ長を解析する。 In the acquisition of the inner data length, the inner constant loop length is analyzed for the vectorization method target loop, and the data length of the inner data string composed of continuous references in the inner loop is analyzed for the instruction combination method target loop.

本発明によれば、内側ループに属する短データで外側ループのイタレーション間で連続なデータ列にSIMD化を適用する際、内側短データ長がSIMD並列度で割り切れない場合の端数部分に対してSIMD化が非適用になることがなくなり、SIMD命令の適用率を向上させることで生成コードの実行性能向上を図ることができる。 According to the present invention, when applying SIMD to a short data belonging to the inner loop and a continuous data string between iterations of the outer loop, the inner short data length is not divisible by the SIMD parallelism. SIMD conversion is no longer applied, and the execution performance of the generated code can be improved by improving the application rate of SIMD instructions.

以下に、本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described.

以下、図面を用いて本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図２に本実施形態によるコンパイラが稼動する計算機システムの構成図を示す。この計算機システムは、制御手段としてのＣＰＵ(Central Processing Unit)２０１、表示手段としてのディスプレイ装置２０２、入力手段としてのキーボード２０３、記憶手段としての主記憶装置２０４、および記憶手段としての外部記憶装置２０５から構成されている。キーボード２０３により、ユーザからのコンパイラ起動命令を受け付ける。コンパイラ終了メッセージやエラーメッセージは、ディスプレイ装置２０２に表示される。外部記憶装置２０５には、ソースプログラム２０９と、オブジェクトプログラム２１０が格納される。主記憶装置２０４には、コンパイラ２０６、コンパイル過程で必要となる中間コード２０７およびループ表２０８が格納される。コンパイル処理はＣＰＵ２０１がコンパイラ２０６を実行することにより行われる。 FIG. 2 shows a configuration diagram of a computer system in which the compiler according to the present embodiment operates. This computer system includes a CPU (Central Processing Unit) 201 as control means, a display device 202 as display means, a keyboard 203 as input means, a main storage device 204 as storage means, and an external storage device 205 as storage means. It is composed of The keyboard 203 accepts a compiler activation command from the user. The compiler end message and error message are displayed on the display device 202. The external storage device 205 stores a source program 209 and an object program 210. The main memory 204 stores a compiler 206, intermediate code 207 and a loop table 208 that are necessary in the compilation process. Compile processing is performed by the CPU 201 executing the compiler 206.

本発明によるコンパイラはSIMD最適化部２１１を含む。本最適化部内の特徴的な処理部として、内側データ長解析部２１２、SIMD並列度解析部２１３、ループ展開数計算部２１４、ループ展開変換部２１５がある（内側データ長解析部２１２とSIMD並列度解析部２１３をまとめて構造解析部と称す。）。これらSIMD最適化部の処理については、本発明の特徴的な部分であるので、後で詳細に説明する。 The compiler according to the present invention includes a SIMD optimization unit 211. Characteristic processing units in the optimization unit include an inner data length analysis unit 212, a SIMD parallelism analysis unit 213, a loop expansion number calculation unit 214, and a loop expansion conversion unit 215 (the inner data length analysis unit 212 and SIMD parallel processing). The degree analysis unit 213 is collectively referred to as a structure analysis unit.) Since the processing of these SIMD optimization units is a characteristic part of the present invention, it will be described in detail later.

図３に、図２のシステムで稼動するコンパイラ２０６の処理手順を示す。 FIG. 3 shows a processing procedure of the compiler 206 operating in the system of FIG.

コンパイラ２０６は、ステップ３０１において、ソースプログラム２０９を入力として構文解析を行い、中間コード２０７を出力する。以下のコンパイル処理ではこの中間コードを対象に変換を行う。構文解析の方法や中間コードの例については、例えば、「Aho, Sethi, Ullman著, “Compilers: Principles, Techniques, and Tools”, Addison-Wesley, 1986」に記載されている。 In step 301, the compiler 206 performs a syntax analysis with the source program 209 as an input, and outputs an intermediate code 207. In the following compiling process, the intermediate code is converted. Examples of syntax analysis methods and intermediate code are described in, for example, “Aho, Sethi, Ullman,“ Compilers: Principles, Techniques, and Tools ”, Addison-Wesley, 1986”.

次にステップ３０２のループ解析において、コンパイラ２０６は、プログラム内に含まれるループの集合を、求め、ループ表２０８に記録する。ループ解析の方法に関しては、例えば、Michael Wolfe著「High Performance Compilers for Parallel Computing」、Addison-Wesley Publishing Company, 1996の６７頁に記載されている。 Next, in the loop analysis of step 302, the compiler 206 obtains a set of loops included in the program and records it in the loop table 208. The loop analysis method is described, for example, on page 67 of Michael Wolfe, “High Performance Compilers for Parallel Computing”, Addison-Wesley Publishing Company, 1996.

次にステップ３０３において、コンパイラ２０６のSIMD最適化部２１１がSIMD最適化３０３を行う。本ステップの処理手順については、本発明の特徴となる部分であるので、後述する。 Next, in step 303, the SIMD optimization unit 211 of the compiler 206 performs SIMD optimization 303. The processing procedure of this step is a feature of the present invention and will be described later.

次に、コンパイラ２０６が、ステップ３０４において、レジスタ割付けを行った後、ステップ３０５においてコード生成を行い、中間コード２０７をオブジェクトプログラム２１０に変換して出力する。レジスタ割付けやコード生成の方法に関しては、前述のAhoらの文献に記載されている。 Next, the compiler 206 performs register allocation in step 304, generates code in step 305, converts the intermediate code 207 into the object program 210, and outputs it. Register allocation and code generation methods are described in the above-mentioned Aho et al.

ここで、上記ステップ３０２のループ解析により生成されるループ表２０８について説明する。図８にループ表２０８の一例を示す。ループ表２０８には、ループ番号８０１、親ループ８０２、子ループ803、制御変数８０４、初期値８０５、終値８０６、増分値８０７、ループ長８０８が格納される。 Here, the loop table 208 generated by the loop analysis in step 302 will be described. FIG. 8 shows an example of the loop table 208. The loop table 208 stores a loop number 801, a parent loop 802, a child loop 803, a control variable 804, an initial value 805, a final value 806, an increment value 807, and a loop length 808.

ループ番号８０１は、コンパイラ内でループを識別するための番号である。親ループ８０２は、対象ループを取り囲む外側ループのうち最も内側のループのループ番号である。対象ループが最も外側のループの場合、「なし」が登録される。子ループ８０３は、対象ループの内側ループのうち最も外側のループのループ番号である。対象ループが最も内側のループの場合、「なし」が登録される。制御変数８０４は、対象ループの繰り返しを制御するループ制御変数である。初期値８０５、終値８０６、増分値８０７は、それぞれ、制御変数８０４の初期値、終値、繰り返しごとの増分である。ループ長８０８は、ループの繰り返し回数である。ループ長は、ループ制御変数の初期値、終値、増分値から求められる。 The loop number 801 is a number for identifying a loop in the compiler. The parent loop 802 is the loop number of the innermost loop among the outer loops surrounding the target loop. If the target loop is the outermost loop, “none” is registered. The child loop 803 is the loop number of the outermost loop among the inner loops of the target loop. If the target loop is the innermost loop, “none” is registered. The control variable 804 is a loop control variable that controls repetition of the target loop. An initial value 805, a final value 806, and an increment value 807 are an initial value, a final value, and an increment for each repetition of the control variable 804, respectively. The loop length 808 is the number of loop repetitions. The loop length is obtained from the initial value, the end value, and the increment value of the loop control variable.

以下、本発明の特徴であるSIMD最適化３０３の処理の詳細について説明する。図１にSIMD最適化３０３の処理手順の詳細を示す。 Hereinafter, the details of the processing of the SIMD optimization 303, which is a feature of the present invention, will be described. FIG. 1 shows details of the processing procedure of the SIMD optimization 303.

本実施例におけるSIMD最適化は、ループを処理単位として実施する。そこで、ステップ１０１では、コンパイラ２０６が、未処理のループがあるかどうかを判定し、あれば未処理のループを１つ処理対象として取り出してステップ１０２へ進む。未処理のループがなければ処理を終了する。 The SIMD optimization in this embodiment is performed using a loop as a processing unit. Therefore, in step 101, the compiler 206 determines whether or not there is an unprocessed loop. If there is an unprocessed loop, the compiler 206 extracts one unprocessed loop as a processing target and proceeds to step 102. If there is no unprocessed loop, the process ends.

ステップ１０２では、コンパイラ２０６が、SIMD最適化の対象ループであるかを判定する。対象ループでなければステップ１０１へ戻り、次のループの処理に進む。 In step 102, the compiler 206 determines whether it is a target loop for SIMD optimization. If it is not the target loop, the process returns to step 101 and proceeds to the processing of the next loop.

ステップ１０２の処理手順の詳細を図９に従って説明する。本実施形態例では、背景技術の項で述べたベクトル化方式と命令結合方式の2方式を対象とする。 Details of the processing procedure of step 102 will be described with reference to FIG. In this embodiment, the two methods of the vectorization method and the instruction combination method described in the background section are targeted.

本実施形態例では、最内側ループを処理対象とする。ステップ９０１において、最内側ループかどうかを判定する。最内側ループかどうかはループ表２０８の子ループ欄８０３を参照し、子ループがないかどうかで判定できる。最内側ループであればステップ９０２に進み、最内側ループでなければステップ９０９へ進み、SIMD最適化対象でないと判定して処理を終了する。 In this embodiment, the innermost loop is the processing target. In step 901, it is determined whether it is the innermost loop. Whether it is the innermost loop or not can be determined by referring to the child loop column 803 of the loop table 208 and whether there is no child loop. If it is the innermost loop, the process proceeds to step 902. If it is not the innermost loop, the process proceeds to step 909, where it is determined that it is not a SIMD optimization target, and the process ends.

ステップ９０２、９０３、９０４は、ベクトル化方式の条件を判定する処理である。 Steps 902, 903, and 904 are processing for determining the conditions of the vectorization method.

ステップ９０２では、ループ表の増分値欄８０７により、制御変数の増分が１であるかどうかを判定する。 In step 902, it is determined whether the increment of the control variable is 1 based on the increment value column 807 of the loop table.

ステップ９０３では、ループ運搬依存の有無を判定する。 In step 903, it is determined whether or not there is dependence on loop transportation.

ステップ９０４では、ループ本体中のコードを調べ、条件を満たすかどうかを判定する。まず、全演算およびオペランドの型が一致することを条件とする。SIMD演算で同時に扱うデータを示すSIMD並列度は通常、対象アーキテクチャにより型ごとに定義されているため、本実施形態では簡単のため、型が全て同一の場合のみを対象とする。次に、ループ内の演算がSIMD命令を持つ演算かどうかを判定する。例えば対象アーキテクチャが備えるSIMD演算命令が加算と乗算のみの場合、除算を含むループはSIMD最適化対象外とする。次にオペランドが配列参照の場合、ストライド１の連続参照であるかどうかを調べる。また、配列参照以外のオペランドはループ不変な変数であることを条件とする。 In step 904, the code in the loop body is examined to determine whether the condition is satisfied. First, it is a condition that all operations and operand types match. Since the SIMD parallelism indicating data handled simultaneously by the SIMD operation is normally defined for each type by the target architecture, in this embodiment, for simplicity, only the case where the types are all the same is targeted. Next, it is determined whether the operation in the loop is an operation having a SIMD instruction. For example, when the SIMD operation instruction included in the target architecture is only addition and multiplication, a loop including division is excluded from SIMD optimization targets. Next, when the operand is an array reference, it is checked whether it is a continuous reference of stride 1. In addition, operands other than array references are conditional on loop invariable variables.

ステップ９０２、９０３、９０４の条件を全て満たす場合、ステップ９０５に進み、ベクトル化方式によるSIMD最適化対象であると判定して処理を終了する。 When all of the conditions of steps 902, 903, and 904 are satisfied, the process proceeds to step 905, where it is determined that the object is a SIMD optimization target by the vectorization method, and the process ends.

ステップ９０２、９０３、９０４において条件を満たさない場合にはステップ９０６に進み、命令結合方式によるSIMD最適化の条件を判定する。命令結合方式では通常、ループ本体内から、同型な演算式で配列参照などによるメモリ参照が連続参照の組を探し、SIMD結合の対象とする。例えば、’a[i]+b[i+1]+v0’と’a[i+1]+b[i+2]+v1’は結合対象である。本実施形態では簡単のため、全文が同型な場合のみを対象とする。このためステップ９０６では、ループ本体中のコードを調べ、全ての文が同型であるかどうかを調べる。同型である場合、次の条件を満たすかどうかを調べる。まず、全演算およびオペランドの型が一致すること、および全演算がSIMD命令を持つこと、配列参照以外のオペランドがループ不変であることについては、ステップ９０４の判定と同様である。配列参照については、隣接する文間でストライド１かどうかを調べる。例えば第1文がa[i]、第2文がa[i+1]、第3文がa[i+2]のような場合には、本条件を満たす。また、隣接文間で依存がないことを条件とする。 When the conditions are not satisfied in Steps 902, 903, and 904, the process proceeds to Step 906, and the conditions for SIMD optimization by the instruction combination method are determined. In the instruction combination method, normally, a set of memory references such as array references with the same type of arithmetic expression is searched from the loop body, and is set as a target of SIMD combination. For example, 'a [i] + b [i + 1] + v0' and 'a [i + 1] + b [i + 2] + v1' are to be combined. In this embodiment, for simplicity, only the case where the whole sentence is the same type is targeted. For this reason, in step 906, the code in the loop body is checked to see if all sentences are isomorphic. If it is the same type, check whether the following condition is satisfied. First, it is the same as the determination in step 904 that all operations and operand types match, that all operations have SIMD instructions, and that operands other than array references are loop-invariant. For array references, it is checked whether or not it is stride 1 between adjacent sentences. For example, this condition is satisfied when the first sentence is a [i], the second sentence is a [i + 1], and the third sentence is a [i + 2]. Moreover, it is on condition that there is no dependence between adjacent sentences.

ステップ９０６の条件を満たす場合には、ステップ９０７へ進み、命令結合方式によるSIMD最適化対象であると判定して処理を終了する。条件を満たさない場合には、ステップ９０８へ進み、SIMD最適化対象でないと判定して処理を終了する。 If the condition of step 906 is satisfied, the process proceeds to step 907, where it is determined that the target is a SIMD optimization target by the instruction combination method, and the process ends. If the condition is not satisfied, the process proceeds to step 908, where it is determined that it is not a SIMD optimization target, and the process ends.

尚、ステップ１０２のSIMD最適化対象判定は、従来技術に属する部分であり、図９の処理手順による判定は、本発明の範囲に制約を与えるものではない。例えばストライド１の配列参照以外に、構造体内の隣接メンバーのようなメモリ上で連続となる領域もSIMD結合の対象となる。また、全演算がSIMD命令対象でない場合には、SIMD化適用部分と非適用部分に分けることができるなど、図９の条件以外にもSIMD最適化が適用可能な場合がある。 Note that the SIMD optimization target determination in step 102 is a part belonging to the prior art, and the determination by the processing procedure in FIG. 9 does not limit the scope of the present invention. For example, in addition to the array reference of stride 1, a continuous area on a memory such as an adjacent member in the structure is also a target of SIMD coupling. In addition, when all operations are not SIMD instruction targets, SIMD optimization may be applicable in addition to the conditions in FIG. 9, such as being divided into SIMD application application parts and non-application parts.

以上で、図9およびステップ１０２の説明を終了する。 Above, description of FIG. 9 and step 102 is complete | finished.

ステップ１０３では、コンパイラのSIMD並列度解析部２１３が、SIMD並列度を求める。対象アーキテクチャにより、各データ型に対するSIMD並列度が決まっている。例えばデータが4バイトの整数型でSIMD演算用のレジスタが16バイトの場合は、4並列である。ループ内の演算の型を調べ、その型に対する並列度を求める。 In step 103, the SIMD parallelism analysis unit 213 of the compiler obtains the SIMD parallelism. The SIMD parallelism for each data type is determined by the target architecture. For example, if the data is an integer type of 4 bytes and the SIMD calculation register is 16 bytes, then 4 parallels. Check the type of operation in the loop and find the degree of parallelism for that type.

ステップ１０４では、コンパイラの内側データ長解析部２１２が、内側データ部を解析し、内側データ長Nと、展開用外側ループを求める。本ステップの処理手順を図４に従い詳細に説明する。 In step 104, the inner data length analysis unit 212 of the compiler analyzes the inner data portion to obtain the inner data length N and the expansion outer loop. The processing procedure of this step will be described in detail with reference to FIG.

ステップ４０１では、内側データ長Nの初期値として1を設定し、展開用外側ループを求めるための変数UNROLLLOOPの初期値を未設定とする。 In step 401, 1 is set as the initial value of the inner data length N, and the initial value of the variable UNROLLLOOP for obtaining the expansion outer loop is not set.

ステップ４０２では、現在処理対象のループがベクトル化可能かどうかを判定する。ステップ１０２（図９）の判定において、ステップ９０５でベクトル化方式によるSIMD最適化対象と判定したケースはベクトル化可能である。ベクトル化可能な場合にはステップ４０３へ進み、ベクトル方式によるSIMD化向けの処理を行う。ベクトル化可能でない場合にはステップ４１０へ進み、命令結合方式によるSIMD化向けの処理を行う。 In step 402, it is determined whether or not the loop to be processed can be vectorized. In the determination in step 102 (FIG. 9), the case determined as the SIMD optimization target by the vectorization method in step 905 can be vectorized. If vectorization is possible, the process proceeds to step 403 to perform processing for SIMD conversion by the vector method. If vectorization is not possible, the process proceeds to step 410, and processing for SIMD conversion by the instruction combination method is performed.

ステップ４０３では、ベクトル方式対象のループに対して、OUTERLOOPに現在の処理対象の最内側ループを設定する。 In step 403, the innermost loop of the current process target is set in OUTERLOOP for the vector system target loop.

ステップ４０４，４０５，４０６，４０７，４０８，４０９は、親ループを辿って処理を行い、展開用外側ループを外側に広げていくための処理である。 Steps 404, 405, 406, 407, 408, and 409 are processes for performing processing by tracing the parent loop and expanding the expansion outer loop outward.

ステップ４０４では、OUTERLOOPの親ループがあるかどうかを判定する。本判定はループ表２０８の親ループ欄８０２により行う。親ループがあればステップ４０５へ進み、なければループを外側に辿る処理を終了し、ステップ４１２に進む。 In step 404, it is determined whether there is a parent loop of OUTERLOOP. This determination is performed in the parent loop column 802 of the loop table 208. If there is a parent loop, the process proceeds to step 405. If not, the process of tracing the loop outward is terminated, and the process proceeds to step 412.

ステップ４０５では、LOOPにOUTERLOOPを設定し、OUTERLOOPに現在のOUTERLOOPの親ループを設定する。 In step 405, OUTERLOOP is set in LOOP, and the parent loop of the current OUTERLOOP is set in OUTERLOOP.

ステップ４０６ではLOOPのループ長を調べる。ループ長はループ表２０８のループ長欄８０８から取得できる。ループ長が定数であればその値をN1に設定し、ステップ４０７へ進む。定数でなければステップ４１２へ進む。 In step 406, the loop length of LOOP is checked. The loop length can be acquired from the loop length column 808 of the loop table 208. If the loop length is a constant, the value is set to N1, and the process proceeds to step 407. If it is not a constant, go to Step 412.

ステップ４０７では定数NとN1の積を計算し、この値が一定値以下であるかどうかを判定する。本判定に用いる一定値には、内側データ部の短データ長として妥当な値となるような上限値を選択すればよい。例えば、対象としているアーキテクチャのSIMD並列度の最大値の数倍程度の値を選択する。値が一定値以下ならばステップ４０８へ進み、そうでなければステップ４１２へ進む。 In step 407, the product of the constants N and N1 is calculated, and it is determined whether or not this value is below a certain value. For the constant value used in this determination, an upper limit value that is appropriate as the short data length of the inner data portion may be selected. For example, a value about several times the maximum value of the SIMD parallelism of the target architecture is selected. If the value is less than or equal to a certain value, the process proceeds to step 408, and if not, the process proceeds to step 412.

ステップ４０８では、SIMD化対象の各配列参照が、OUTERLOOPのイタレーション間で連続かどうかを判定する。連続かどうかは、OUTERLOOPのi回目の最後の参照とi+1回目の最初の参照が連続するかどうかを判定する。例えば、図5(a)のソースプログラム例において、OUTERLOOPがjループの場合、a[i][j][4]とa[i][j+1][0]を比較すると、aの最下位次元の要素が0から4であることから、これらの要素は連続である。連続であればステップ４０９へ進み、そうでなければステップ４１２へ進む。 In step 408, it is determined whether each array reference to be converted to SIMD is continuous between iterations of OUTERLOOP. It is determined whether or not the i-th last reference of OUTERLOOP and the i + 1-th first reference are consecutive. For example, in the example of the source program in FIG. 5 (a), when OUTERLOOP is a j loop, comparing a [i] [j] [4] with a [i] [j + 1] [0] Since the elements of the lower dimension are 0 to 4, these elements are continuous. If continuous, the process proceeds to step 409; otherwise, the process proceeds to step 412.

ステップ４０９では、展開対象ループを外側に更新し、次の外側ループの処理に進む。即ち、Nの値をN*N1に更新し、UNROLLLOOPにOUTERLOOPを設定する。その後、ステップ４０４に戻り、更新したOUTERLOOPの処理を行う。 In step 409, the expansion target loop is updated to the outside, and the processing proceeds to the next outer loop. That is, the value of N is updated to N * N1, and OUTERLOOP is set to UNROLLLOOP. Thereafter, the process returns to step 404 to perform the updated OUTERLOOP process.

ステップ４１０は、命令結合方式によるSIMD化対象のループの処理である。処理対象ループ内の連続参照列を解析し、そのデータ長をN1に設定する。 Step 410 is processing of a loop to be converted into SIMD by the instruction combination method. Analyzes the continuous reference string in the processing target loop and sets its data length to N1.

ステップ４１１では、OUTERLOOPに現在処理対象の最内側ループを設定する。その後、ステップ４０７へ進み、ベクトル化方式対象のループと同様の処理を行う。 In step 411, the innermost loop to be processed is set in OUTERLOOP. Thereafter, the process proceeds to step 407 and the same processing as that of the vectorization method target loop is performed.

ステップ４１２では、UNROLLLOOPが未設定かどうかを判定する。 In step 412, it is determined whether UNROLLLOOP is not set.

未設定ならば、ステップ４１３に進み、判定結果を「内側データ部なし」として処理を終了する。 If not set, the process proceeds to step 413, where the determination result is “no inner data portion” and the process is terminated.

UNROLLLOOPが設定されていれば、ステップ４１４に進み、内側データ長をN、展開用外側ループをUNROLLLOOPとして、処理を終了する。 If UNROLLLOOP is set, the process proceeds to step 414, where the inner data length is N and the expansion outer loop is UNROLLLOOP.

以上で、図4およびステップ１０４の処理手順の説明を終了する。 Above, description of the processing procedure of FIG. 4 and step 104 is complete | finished.

ステップ１０５では、コンパイラのループ展開数計算部２１４が、ステップ１０４の判定結果により、内側データ部があるかどうかを調べ、あればステップ１０６に進む。内側データ部がない場合は、本発明の適用対象外であるので、ステップ１１２に進み、従来技術による通常のSIMD最適化処理を行い、本ループの処理を終了してステップ１０１に戻る。 In step 105, the loop expansion number calculation unit 214 of the compiler checks whether there is an inner data part based on the determination result in step 104. If there is no inner data part, the present invention is not applicable, so the process proceeds to step 112, the normal SIMD optimization process according to the prior art is performed, the process of this loop is terminated, and the process returns to step 101.

ステップ１０６では、コンパイラのループ展開数計算部２１４が、ステップ１０４で求めた内側データ長Nが、SIMD並列度で割り切れるかどうかを判定する。割り切れる場合には、ステップ１１２に進み、通常のSIMD最適化処理を行い、本ループの処理を終了する。割り切れない場合にはステップ１０７へ進む。 In step 106, the loop expansion number calculation unit 214 of the compiler determines whether or not the inner data length N obtained in step 104 is divisible by the SIMD parallelism. If it is divisible, the process proceeds to step 112, the normal SIMD optimization process is performed, and the process of this loop is terminated. If it is not divisible, the process proceeds to step 107.

ステップ１０７では、コンパイラのループ展開数計算部２１４が、内側データ部にループがあるかどうかを調べる。内側データ部は、展開用外側ループの内側であるので、図８のループ表２０８の子ループ８０３を用いて展開用外側ループに子ループがあるかどうかにより判定する。内側ループがあればステップ１０８に進み、内側ループを解消する。内側ループが多重ループの場合（展開用外側ループの子ループがさらに子ループを持つ場合）、多重ループを構成するループ全てを最内側から順に解消する。内側データ部にループがない場合には、ステップ１０９へ進む。 In step 107, the loop expansion number calculation unit 214 of the compiler checks whether there is a loop in the inner data portion. Since the inner data portion is inside the expansion outer loop, the child loop 803 of the loop table 208 in FIG. 8 is used to determine whether or not the expansion outer loop has a child loop. If there is an inner loop, the process proceeds to step 108 to cancel the inner loop. When the inner loop is a multiple loop (when the child loop of the expansion outer loop further has a child loop), all the loops constituting the multiple loop are eliminated in order from the innermost side. If there is no loop in the inner data portion, the process proceeds to step 109.

ステップ１０９では、コンパイラのループ展開数計算部２１４が、展開用外側ループのループ展開数を計算する。まず、内側データ長NをSIMD並列度で割り、その余りを「端数」とする。次に、SIMD並列度と端数の公倍数を求め、その値を端数で割った商を、展開用外側ループの展開数とする。公倍数の計算においては、最小公倍数を求めると、展開数を最小にすることができる。また単純な計算方法としては、SIMD並列度と端数の積を公倍数として採用する方法もある。 In step 109, the loop expansion number calculation unit 214 of the compiler calculates the loop expansion number of the expansion outer loop. First, the inner data length N is divided by the SIMD parallelism, and the remainder is defined as “fraction”. Next, the SIMD parallelism and the common multiple of the fraction are obtained, and the quotient obtained by dividing the value by the fraction is used as the number of expansions of the expansion outer loop. In the calculation of the common multiple, if the least common multiple is obtained, the number of expansions can be minimized. As a simple calculation method, there is a method of adopting a product of SIMD parallelism and fraction as a common multiple.

ステップ１１０では、ステップ１０９で求めた展開数に従い、コンパイラのループ展開変換部２１５が、展開用外側ループのループ展開処理を行う。 In step 110, according to the number of expansions obtained in step 109, the loop expansion conversion unit 215 of the compiler performs loop expansion processing of the expansion outer loop.

ステップ１１１では、コンパイラ２０６が、展開後の最内側ループ内のコードをSIMD化を適用したコードに変換する。以上で、１つのループの処理を終了し、ステップ１０１に戻り、次のループの処理に進む。 In step 111, the compiler 206 converts the code in the innermost loop after expansion into a code to which SIMD is applied. Thus, the processing of one loop is completed, the processing returns to step 101, and the processing of the next loop is performed.

以上で、図１およびステップ３０３の処理手順の詳細の説明を終了する。 This is the end of the detailed description of the processing procedure in FIG. 1 and step 303.

次に、図５(a)のソースプログラム（C言語で記述）に対して、本実施例を適用した場合の処理手順を説明する。 Next, a processing procedure when the present embodiment is applied to the source program (described in C language) of FIG.

ループ番号を外側から順に1,2,3とする。各ループの情報は図８に記載されている通りである。 The loop numbers are 1, 2, and 3 in order from the outside. Information of each loop is as described in FIG.

ステップ１０２のSIMD最適化対象ループの判定において、ループ１と２は最内側ループでないので、条件を満たさない。ループ３（kループ）の処理を説明する。ステップ９０２で制御変数の増分が１であることを判定し、本ループは依存を持たないためステップ９０３でループ運搬依存がないことを判定する。ステップ９０４でループ内のコードを調べる。ループ本体の文は演算を持たず、配列参照とループ不変変数v2のみである。型はint型で一致している。配列参照a[i][j][k]は最下位次元の添え字ｋがループ制御変数と一致しているため、ストライド１の連続参照である。条件を満たすので、ステップ９０５でSIMD最適化対象であると判定する。ステップ１０３において、例えば対象アーキテクチャのSIMDレジスタサイズが16バイト、int型のデータサイズが4バイト、int型に対するSIMD命令の並列度が4とすると、SIMD並列度を4に設定する。 In the determination of the SIMD optimization target loop in step 102, since the loops 1 and 2 are not the innermost loop, the condition is not satisfied. The processing of loop 3 (k loop) will be described. In step 902, it is determined that the increment of the control variable is 1. Since this loop has no dependency, it is determined in step 903 that there is no loop transportation dependency. In step 904, the code in the loop is examined. The statement in the loop body has no operations, only array references and loop invariant variable v2. The type is int type and matches. The array reference a [i] [j] [k] is a continuous reference of stride 1 because the subscript k in the lowest dimension matches the loop control variable. Since the condition is satisfied, it is determined in step 905 that it is a SIMD optimization target. In step 103, for example, if the SIMD register size of the target architecture is 16 bytes, the data size of the int type is 4 bytes, and the parallelism of the SIMD instruction for the int type is 4, the SIMD parallelism is set to 4.

ステップ１０４では、内側データ部の解析を行う。ステップ４０２で処理対象のkループはベクトル化可能であるから、ステップ４０３へ進み、OUTERLOOPにkループを設定する。ステップ４０４の判定後、ステップ４０５で、LOOPにkループを、OUTERLOOPにjループを設定する。ステップ４０６では、kループのループ長が定数5であることを判定し、ステップ４０７で、N*N1の値5が一定値以下かどうかを判定する。本実施形態においては、例えば一定値を32とすると、一定以下であると判定してステップ４０８へ進む。ステップ４０８において、配列参照a[i][j][k]がOUTERLOOP（jループ）のイタレーション間で連続かどうかを調べる。ｊループの1回目の最終参照はa[i][0][4]、jループの2回目の最初の参照はa[i][1][0]で、配列aの宣言と照らし合わせるとこれらは隣接する要素であるから、連続であると判定する。ステップ４０９では、Nに5を設定し、UNROLLLOOPにｊループを設定する。 In step 104, the inner data portion is analyzed. Since the k loop to be processed can be vectorized in step 402, the process proceeds to step 403, and the k loop is set in OUTERLOOP. After the determination in step 404, in step 405, a k loop is set in LOOP and a j loop is set in OUTERLOOP. In step 406, it is determined that the loop length of the k loop is a constant 5, and in step 407, it is determined whether the value 5 of N * N1 is equal to or less than a certain value. In the present embodiment, for example, if a constant value is 32, it is determined that the value is below a certain value, and the process proceeds to step 408. In step 408, it is checked whether the array reference a [i] [j] [k] is continuous between iterations of OUTERLOOP (j loop). The first reference of the j loop is a [i] [0] [4], and the first reference of the j loop is a [i] [1] [0]. Since these are adjacent elements, it is determined that they are continuous. In step 409, N is set to 5 and jroll is set to UNROLLLOOP.

ステップ４０４に戻り、OUTERLOOPの親ループを調べ、ステップ４０５でLOOPをｊループに、OUTERLOOPをiループに更新する。ｊループのループ長は3、Nとの積は15で32以下であるから、ステップ４０８へ進む。iループの1回目の最終参照はa[0][2][4]、jループの2回目の最初の参照はa[1][0][0]で、配列aの宣言と照らし合わせるとこれらは隣接する要素であるから、連続であると判定する。ステップ４０９でNを15に、UNROLLLOOPをiループに更新し、ステップ４０４に戻る。iループの親ループはないので、ステップ４１２に進み、UNROLLLOOPが設定されているので、ステップ４１４に進む。内側データ長を15、展開用外側ループをiループとして、ステップ１０４の処理を終了する。 Returning to step 404, the parent loop of OUTERLOOP is checked, and in step 405, LOOP is updated to j loop and OUTERLOOP is updated to i loop. Since the loop length of the j loop is 3 and the product of N is 15 and 32 or less, the process proceeds to step 408. The first reference of the i loop is a [0] [2] [4], and the first reference of the j loop is a [1] [0] [0]. Since these are adjacent elements, it is determined that they are continuous. In step 409, N is updated to 15 and UNROLLLOOP is updated to i loop, and the process returns to step 404. Since there is no parent loop of the i loop, the process proceeds to step 412. Since UNROLLLOOP is set, the process proceeds to step 414. The inner data length is 15, the expansion outer loop is i loop, and the process of step 104 is finished.

ステップ１０５で内側データ部ありと判定し、ステップ１０６で、内側データ長15がSIMD並列度4で割り切れるかどうかを判定する。割り切れないので、ステップ１０７へ進む。内側データ部にはｊループとkループがあるので、ステップ１０８で内側ループを解消する。初めに内側のkループを解消した後、さらにｊループを解消する。解消後のコード例を図６(a)に示す。 In step 105, it is determined that there is an inner data portion, and in step 106, it is determined whether the inner data length 15 is divisible by SIMD parallelism 4. Since it is not divisible, the process proceeds to step 107. Since there are j loop and k loop in the inner data portion, the inner loop is eliminated in step 108. First, after eliminating the inner k-loop, further canceling the j-loop. A code example after the cancellation is shown in FIG.

ステップ１０９において、端数は15を4で割った余りの3となる。展開数はSIMD並列度4と端数3の最小公倍数12を端数3で割り、４とする。ステップ１１０においてiループを4倍に展開する。展開後のコード列を図６(b)に示す。 In step 109, the fraction is 3 which is the remainder of 15 divided by 4. The number of expansions is 4 by dividing the SIMD parallelism 4 and the least common multiple 12 of the fraction 3 by the fraction 3. In step 110, the i loop is expanded four times. The expanded code string is shown in FIG.

ステップ１１１において、SIMDコードへの変換を行う。ここでは配列aへの代入文を先頭から順に4個ずつまとめ、１つの文に変換する。変換後のコード例を図６(c)に示す。本コードにおいて、a[i][0][1:3]はaの4要素への代入を示している。また(v2,v2,v2,v2)は変数v2を4個並べた値を示している。(a[i][0][4],a[i][1][0:2])は、a[i][0][4]と、これに連続するa[i][1][0]からa[i][1][2]までの3要素の合計4要素にまとめて代入することを示す。以下同様である。 In step 111, conversion to a SIMD code is performed. Here, four assignment statements to array a are collected in order from the top, and converted into one statement. An example of the code after conversion is shown in FIG. In this code, a [i] [0] [1: 3] indicates substitution of a into 4 elements. Further, (v2, v2, v2, v2) indicates a value obtained by arranging four variables v2. (a [i] [0] [4], a [i] [1] [0: 2]) is a [i] [0] [4] followed by a [i] [1] Indicates that all three elements from [0] to a [i] [1] [2] are assigned together. The same applies hereinafter.

以上で、図５(a)のソースコード例に対する変換例の説明を終了する。 This is the end of the description of the conversion example for the source code example in FIG.

次に、図７(a)ソースプログラム（C言語で記述）に対して、本実施例を適用した場合の処理手順を説明する。 Next, a processing procedure when this embodiment is applied to the source program (described in C language) in FIG. 7A will be described.

iループのループ番号を4とする。本ループの情報は図８に記載されている通りである。 The loop number of i loop is 4. The information of this loop is as described in FIG.

ステップ１０２のSIMD最適化対象ループの判定において、ステップ９０２で制御変数の増分が１であることを判定し、本ループは依存を持たないためステップ９０３でループ運搬依存がないことを判定する。ステップ９０４の判定において、配列参照a[i][0]はストライド1でないので、条件を満たさず、ステップ９０６へ進む。 In the determination of the SIMD optimization target loop in step 102, it is determined in step 902 that the increment of the control variable is 1, and since this loop has no dependency, it is determined in step 903 that there is no loop transportation dependency. In the determination in step 904, since the array reference a [i] [0] is not stride 1, the condition is not satisfied, and the process proceeds to step 906.

ステップ９０６では、ループ内の3個の文を比較し、各文とも左辺が配列参照、右辺が変数であるので、同型であると判定する。型は全てdouble型で一致する。配列参照に着目すると、3個の文の参照が連続している。これらの判定からステップ９０６の条件を満たすと判定し、ステップ９０７で命令結合方式によるSIMD最適化対象とする。 In step 906, the three sentences in the loop are compared, and each sentence is determined to be the same type because the left side is an array reference and the right side is a variable. All types match with double type. Focusing on array references, three sentence references are consecutive. From these determinations, it is determined that the condition of step 906 is satisfied, and in step 907, the SIMD optimization target by the instruction combination method is selected.

ステップ１０３において、double型に対するSIMD命令の並列度が2とすると、SIMD並列度を2に設定する。 In step 103, if the parallelism of the SIMD instruction for the double type is 2, the SIMD parallelism is set to 2.

ステップ１０４では、内側データ部の解析を行う。ステップ４０２で処理対象のiループはベクトル化可能でないからステップ４１０に進み、処理対象ループ中の連続参照列を調べる。本ループの参照列は、a[i][0],a[i][1],a[i][2]であるので、データ長N1に3を設定する。ステップ４１１では、iループをOUTERLOOPに設定する。 In step 104, the inner data portion is analyzed. In step 402, since the i loop to be processed cannot be vectorized, the process proceeds to step 410, and the continuous reference string in the processing target loop is examined. Since the reference sequence of this loop is a [i] [0], a [i] [1], a [i] [2], 3 is set to the data length N1. In step 411, the i loop is set to OUTERLOOP.

ステップ４０７に進み、N*N1は3であるので、一定以下と判定する。 Proceeding to step 407, since N * N1 is 3, it is determined to be below a certain level.

ステップ４０８において、iループのイタレーション間の連続性を調べる。iループの1回目における配列aの参照列の最後のデータはa[0][2]である。2回目の最初のデータはa[1][0]である。配列aの宣言と照らし合わせるとこれらは隣接する要素であるから、連続であると判定する。 In step 408, the continuity between i-loop iterations is examined. The last data in the reference string of the array a at the first time of the i loop is a [0] [2]. The first data of the second time is a [1] [0]. Since these are adjacent elements when compared with the declaration of the array a, it is determined that they are continuous.

ステップ４０９では、Nに3を設定し、UNROLLLOOPにiループを設定する。 In step 409, N is set to 3, and iROLL is set to UNROLLLOOP.

ステップ４０４に戻り、iループの親ループはないので、ステップ４１２に進み、UNROLLLOOPが設定されているので、ステップ４１４に進む。内側データ長を3、展開用外側ループをiループとして、ステップ１０４の処理を終了する。 Returning to Step 404, since there is no parent loop of i loop, the process proceeds to Step 412, and since UNROLLLOOP is set, the process proceeds to Step 414. The processing of step 104 is completed with the inner data length set to 3 and the expansion outer loop set to i loop.

ステップ１０５で内側データ部ありと判定し、ステップ１０６で、内側データ長3がSIMD並列度2で割り切れるかどうかを判定する。割り切れないので、ステップ１０７へ進む。内側データ部にループはないので、ステップ１０９に進む。 In step 105, it is determined that there is an inner data portion, and in step 106, it is determined whether the inner data length 3 is divisible by SIMD parallelism 2. Since it is not divisible, the process proceeds to step 107. Since there is no loop in the inner data portion, the process proceeds to step 109.

ステップ１０９において、端数は3を2で割った余りの1となる。展開数はSIMD並列度2と端数1の最小公倍数2を端数1で割り、2とする。ステップ１１０においてiループを2倍に展開する。展開後のコード列を図７（ｂ)に示す。 In step 109, the fraction is 1 which is a remainder obtained by dividing 3 by 2. The number of expansions is 2 by dividing the SIMD parallelism degree 2 and the least common multiple 2 of the fraction 1 by the fraction 1. In step 110, the i-loop is expanded twice. FIG. 7B shows the code string after development.

ステップ１１１において、SIMDコードへの変換を行う。ここでは配列aへの代入文を先頭から順に2個ずつまとめ、１つの文に変換する。変換後のコード例を図７(c)に示す。 In step 111, conversion to a SIMD code is performed. Here, two assignment statements to the array a are collected in order from the top and converted into one statement. An example of the code after conversion is shown in FIG.

以上で、図７(a)のソースコード例に対する変換例の説明を終了する。 This is the end of the description of the conversion example for the source code example in FIG.

本発明のコード生成方法は、複数データに対するメモリアクセスや演算を行うSIMD命令を備えたターゲットマシンに対するコード生成を行うコンパイラにおいて利用することができる。 The code generation method of the present invention can be used in a compiler that performs code generation for a target machine equipped with a SIMD instruction that performs memory access and calculation for a plurality of data.

本発明の実施形態によるSIMD最適化の処理手順を示す図である。It is a figure which shows the process sequence of SIMD optimization by embodiment of this invention. 本発明の実施形態によるコンパイラが稼動する計算機システムの構成図である。1 is a configuration diagram of a computer system in which a compiler according to an embodiment of the present invention operates. 本発明の実施形態のコンパイラの処理手順の例を示す。The example of the process sequence of the compiler of embodiment of this invention is shown. 本発明の実施形態によるSIMD最適化において、内側データ部およびデータ長を解析する処理手順を示す図である。It is a figure which shows the process sequence which analyzes an inner data part and data length in SIMD optimization by embodiment of this invention. ソースプログラムの例および、従来技術によるSIMD最適化の変換例を示す。An example of a source program and a conversion example of SIMD optimization according to the prior art are shown. 本発明によるコード変換過程の一例を示す。2 shows an example of a code conversion process according to the present invention. ソースプログラムの例と、従来技術および本発明によるコード変換の一例を示す。An example of a source program and an example of code conversion according to the prior art and the present invention are shown. 本実施形態のループ表の一例である。It is an example of the loop table of this embodiment. 本発明の実施形態によるSIMD最適化において、SIMD最適化対象ループの判定を行う処理手順を示す図である。In SIMD optimization by embodiment of this invention, it is a figure which shows the process sequence which determines a SIMD optimization object loop.

Explanation of symbols

204・・・主記憶装置、206・・・コンパイラ、207・・・中間コード、208・・・ループ表、209・・・ソースプログラム、211・・・SIMD最適化部、212・・・内側データ長解析部、213・・・SIMD並列度解析部、214・・・ループ展開数計算部、215・・・ループ展開変換部 204 ... Main memory, 206 ... Compiler, 207 ... Intermediate code, 208 ... Loop table, 209 ... Source program, 211 ... SIMD optimization unit, 212 ... Inside data Long analysis unit, 213... SIMD parallelism analysis unit, 214... Loop expansion number calculation unit, 215... Loop expansion conversion unit

Claims

A computer code generation method for generating a code using a SIMD instruction for performing parallel operation on a plurality of data from a source program or intermediate code,
Analyzing the loop of the intermediate code, the inner data length, which is the length of the inner data section including the SIMD instruction application target data string with a constant length less than a certain value, and how many pieces of data are parallelized by the SIMD instruction A structural analysis step for obtaining a SIMD parallelism indicating whether to process,
A fraction which is a remainder when the inner data length obtained in the structural analysis step is divided by the SIMD parallelism is calculated, and a value obtained by dividing the common multiple of the SIMD parallelism and the fraction by the fraction is obtained. An expansion number calculation step for obtaining the number of loop expansions of the outer loop surrounding the inner data portion;
A loop unrolling step for unfolding the loop according to the number of unfolding loops obtained in the unfolding number calculating step;
A code generation method characterized in that a computer executes.

The code generation method according to claim 1,
Before Kioyake multiples the code generation method of a computer which is a least common multiple of the SIMD parallelism and fractional.

The code generation method according to claim 1,
Before Kioyake multiples the code generation method of a computer which is a value obtained by multiplying the fraction in SIMD parallelism.

The code generation method according to claim 1,
A step of determining whether to perform the structure analysis step, the expansion number calculation step, and the loop expansion step with respect to the loop with reference to a loop table that describes information on the outer loop and the inner loop with respect to the loop of the intermediate code code generation method of a computer, wherein a computer further execute.

A compiler that causes a computer to generate code using a SIMD instruction that performs parallel operations on multiple data from a source program or intermediate code,
Analyze the loop of the code and process the inner data length, which is the length of the data string of the inner data section including the data string subject to SIMD instruction application with a constant length less than a certain value, and how many pieces of data are processed in parallel by the SIMD instruction A structural analysis step to obtain a SIMD parallelism representing whether to do;
A fraction that is a remainder when the inner data length obtained in the structural analysis step is divided by the SIMD parallelism is calculated, and a value obtained by dividing the common multiple of the SIMD parallelism and the fraction by the fraction is calculated. An expansion number calculation step for obtaining the number of loop expansions of the outer loop surrounding the inner data portion;
A loop expansion conversion step of expanding a loop according to the number of loop expansions determined in the expansion number calculation step;
A compiler characterized by causing a computer to execute.

The compiler of claim 5, comprising:
Before Kioyake multiples, compiler, characterized in that said SIMD parallelism, which is the least common multiple of the fraction.

The compiler of claim 5, comprising:
Before Kioyake multiples, compiler, wherein the a value obtained by multiplying the fraction in SIMD parallelism.

The compiler of claim 5, comprising:
Whether the structure analysis step, the expansion number calculation step, and the loop expansion conversion step are to be performed on the loop with reference to a loop table that describes information about the outer loop and the inner loop with respect to the loop of the intermediate code A compiler that further causes a computer to execute a step of determining whether or not.

A computer that generates code using SIMD instructions for performing parallel operations on multiple data from a source program or intermediate code,
Analyzing the loop of the intermediate code, the inner data length, which is the length of the inner data section including the SIMD instruction application target data string with a constant length less than a certain value, and how many pieces of data are parallelized by the SIMD instruction A structural analysis unit for obtaining SIMD parallelism indicating whether to process,
A fraction which is a remainder when the inner data length obtained in the structural analysis step is divided by the SIMD parallelism is calculated, and a value obtained by dividing the common multiple of the SIMD parallelism and the fraction by the fraction is obtained. An expansion number calculation unit for obtaining the number of loop expansions of the outer loop surrounding the inner data part;
A loop expansion conversion unit that expands a loop according to the number of loop expansions determined in the expansion number calculation step;
A computer comprising: