JP2019128760A

JP2019128760A - Compiler program, compiling method, and information processing device for compiling

Info

Publication number: JP2019128760A
Application number: JP2018009576A
Authority: JP
Inventors: 優太向井; Yuta Mukai
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-01-24
Filing date: 2018-01-24
Publication date: 2019-08-01
Anticipated expiration: 2038-01-24
Also published as: JP6974722B2

Abstract

To provide a compiler program, a compiling method, and an information processing device for compiling that add a prefetch instruction to a loop that repeats stride access and vectorize the loop.SOLUTION: An information processing device for compiling adds a prefetch instruction for storing access destination data of a stride access instruction in a cache memory after a predetermined number of repetitions of a loop in the loop; and adds a conditional statement for executing the prefetch instruction in the loop, and converts a stride access instruction, a conditional statement, and the prefetch instruction in the loop into a vector instruction, when variables are placed with a cache memory cache line size (C), an element length between stride accesses (m), an array element size (Type_Size), and a prefetch instruction prefetch address (x), if a remainder (x% C) when the prefetch address (x) is divided by the cache line size (C) is smaller than an address length between stride access addresses (S) obtained by multiplying the element length between stride accesses (m) by the one element size (Type_Size) ((x% C)<S).SELECTED DRAWING: Figure 11

Description

本発明は，コンパイラプログラム、コンパイル方法及びコンパイルする情報処理装置に関する。 The present invention relates to a compiler program, a compiling method, and an information processing apparatus for compiling.

情報処理装置において、メインメモリの処理性能はプロセッサの処理性能に比較して著しく低い。そのため、プロセッサは、メインメモリ内のデータにアクセスすると、そのアクセスが終了するまで処理が待ち状態になり、稼働低下を招く。このような稼働低下を回避するために、プロセッサは、高速なアクセスを可能にするキャッシュメモリを内蔵し、メインメモリ内のデータの一部をキャッシュメモリに格納し、メインメモリへのアクセス時間を短縮する。 In the information processing apparatus, the processing performance of the main memory is significantly lower than the processing performance of the processor. Therefore, when the processor accesses the data in the main memory, the process is in a waiting state until the access is completed, which causes a decrease in operation. In order to avoid such a drop in operation, the processor incorporates a cache memory that enables high-speed access, stores part of data in the main memory in the cache memory, and reduces the access time to the main memory Do.

一方、データのアクセスの一形態として、ソースプログラムのループ内において、配列の要素のデータを順番にアクセスする以外に、所定の要素数ずつ飛びとびにアクセスするストライドアクセスがある。このようにループ内で配列の要素へのメモリアクセスを繰り返す場合、将来のメモリアクセスを予測して所定のアクセス反復回数前にメインメモリにアクセスしてデータをキャッシュメモリに登録するプリフェッチを行うと、プロセッサの処理効率を高めることができる。 On the other hand, as one form of data access, there is stride access in which the data of the elements of the array is accessed in a loop in the source program loop, in addition to sequential access. In this way, when repeating memory access to the elements of the array in the loop, if prefetching is performed to predict the future memory access and access the main memory and register the data in the cache memory before the predetermined number of access iterations, The processing efficiency of the processor can be increased.

そこで、ソースプログラムをアセンブリコードまたはオブジェクトコードに変換するコンパイラは、最適化処理の一つとして、ソースプログラムの配列の要素のデータへのアクセスを繰り返すループ内にプリフェッチ命令を追加する。 Therefore, a compiler that converts a source program into assembly code or object code adds a prefetch instruction in a loop that repeats access to element data of an array of the source program as one of optimization processes.

また、コンパイラの別の最適化処理として、所定の命令が繰り返されるループのソースコードをベクトル化する処理も知られている。ベクトル化処理は、ループ内の所定の命令の繰り返し実行を、同じ一つの命令を複数のデータについて並列に実行するベクトル命令、またはSIMD（Single Instruction Multiple Data）命令に変換する処理である。このようにコンパイラがソースプログラム内のループ内で繰り返される所定の命令をSIMD命令に変換することで、SIMD演算器を有するプロセッサの効率的処理を利用することができる。 As another optimization process of the compiler, a process of vectorizing a source code of a loop in which a predetermined instruction is repeated is also known. The vectorization process is a process of converting repetitive execution of a predetermined instruction in a loop into a vector instruction that executes the same one instruction in parallel for a plurality of data, or a single instruction multiple data (SIMD) instruction. In this way, the compiler converts a predetermined instruction repeated in a loop in the source program into a SIMD instruction, so that efficient processing of a processor having a SIMD calculator can be used.

以下の特許文献は、ストライドアクセスのプリフェッチに関するもの、及びプリフェッチ追加による最適化処理に関するものである。 The following patent documents relate to prefetch of stride access and optimization processing by adding prefetch.

特開平６−２７４５２５号公報JP-A-6-274525 特開平１−１３４６７０号公報Japanese Patent Laid-Open No. 1-134670 特表２０１４−５１３３４０号公報Special table 2014-513340 gazette 特表平４−５０５２２５号公報Japanese National Patent Publication No. 4-505225 特開２００８−７１１２８号公報JP 2008-71128 A 特開２０１５−１５３１２２号公報JP2015-153122A

しかしながら、キャッシュメモリのキャッシュラインのサイズＣよりストライドアクセスのストライド長Ｓ（隣接するアクセス間のアドレス間隔）が小さい場合、連続するプリフェッチ命令が同じキャッシュライン内のデータに対してメインメモリに重複してアクセスする場合がある。このような重複する複数のアクセスは冗長なアクセスである。そのため、コンパイラの最適化処理によるプリフェッチ命令は、冗長アクセスの場合は行わないことが望ましい。その場合、コンパイラが冗長アクセスか否かの判定処理を追加する必要がある。しかし、かかる判定処理を追加するとベクトル化を行うことができない場合がある。 However, when the stride length S (address interval between adjacent accesses) of the stride access is smaller than the size C of the cache line of the cache memory, consecutive prefetch instructions are duplicated in the main memory for the data in the same cache line. May be accessed. Such duplicate accesses are redundant accesses. Therefore, it is desirable not to perform a prefetch instruction by compiler optimization processing in the case of redundant access. In that case, it is necessary to add a process for determining whether or not the compiler has redundant access. However, if such a determination process is added, vectorization may not be performed.

そこで，本発明の目的は，プリフェッチ命令の追加とベクトル化処理を両立できるコンパイラプログラム、コンパイル方法、コンパイルする情報処理装置を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a compiler program, a compiling method, and an information processing apparatus for compiling that can simultaneously add a prefetch instruction and vectorization processing.

実施の形態の一つの側面は，ソースプログラム内のメモリ内の配列の複数の要素をストライドアクセス間要素長の間隔でアクセスするストライドアクセス命令を複数回繰り返すループを検出し、
前記ストライドアクセス命令から前記ループの所定繰り返し回数後のストライドアクセス命令のアクセス先データを前記メモリにアクセスしてキャッシュメモリに格納するプリフェッチ命令を、前記ループ内に追加し、
前記キャッシュメモリのキャッシュラインサイズ（Ｃ）、前記ストライドアクセス間要素長（ｍ）、前記配列の一要素のサイズ（Type_Size）、前記プリフェッチ命令のプリフェッチアドレス（ｘ）の場合、前記プリフェッチアドレス（ｘ）を前記キャッシュラインサイズ（Ｃ）で除算したときの余り（ｘ％Ｃ）が、前記ストライドアクセス間要素長（ｍ）に一要素サイズ（Type_Size）を乗算したストライドアクセス間アドレス長（S）より小さい場合（（ｘ％Ｃ）＜Ｓ）、前記プリフェッチ命令を実行する条件文を、前記ループ内に追加し、
前記ループ内の前記ストライドアクセス命令と前記条件文及び前記プリフェッチ命令を、複数、並列に実行するベクトル命令に変換するベクトル化、
する処理をコンピュータに実行させるコンパイラプログラムである。 One aspect of the embodiment is to detect a loop in which a stride access instruction for accessing a plurality of elements of an array in a memory in a source program at an interval of element lengths between stride accesses is repeated a plurality of times.
A prefetch instruction for accessing the memory and storing the access destination data of the stride access instruction after a predetermined number of repetitions of the loop from the stride access instruction in the cache memory is added to the loop,
In the case of the cache line size (C) of the cache memory, the element length between stride accesses (m), the size of one element of the array (Type_Size), and the prefetch address (x) of the prefetch instruction, the prefetch address (x) Is divided by the cache line size (C), the remainder (x% C) is smaller than the inter-stride access address length (S) obtained by multiplying the inter-stride access element length (m) by one element size (Type_Size). If ((x% C) <S), add a conditional statement to execute the prefetch instruction in the loop,
Vectorization for converting the stride access instruction, the conditional statement and the prefetch instruction in the loop into a plurality of vector instructions to be executed in parallel;
This is a compiler program that causes a computer to execute processing to be performed.

第１の側面によれば，プリフェッチ命令の追加とベクトル化処理を両立できる。 According to the first aspect, the addition of the prefetch instruction and the vectorization process can be compatible.

本実施の形態におけるコンパイルする情報処理装置（コンピュータ）の構成例を示す図である。It is a figure which shows the structural example of the information processing apparatus (computer) to compile in this Embodiment. 本実施の形態のコンパイラの処理の一例を示すフローチャート図である。It is a flowchart figure which shows an example of the process of the compiler of this Embodiment. プリフェッチ命令追加処理の第１の例を示す図である。It is a figure which shows the 1st example of a prefetch instruction addition process. 配列ａに対するストライドアクセスのアドレスとキャッシュラインとの関係を示す図である。It is a figure which shows the relationship between the address of stride access with respect to the arrangement | sequence a, and a cache line. プリフェッチ命令追加処理の第２の例を示す図である。It is a figure which shows the 2nd example of a prefetch instruction addition process. ソースコードSC_1に対するベクトル化の例を示す図である。It is a figure which shows the example of vectorization with respect to source code SC_1. ベクトル化ができない一例を示す図である。It is a figure which shows an example which cannot be vectorized. 図５でプリフェッチ命令を追加されたソースコードSC_3をベクトル長２でベクトル化した例を示す図である。It is a figure which shows the example which vectorized the source code SC_3 to which the prefetch instruction was added in FIG. 本実施の形態におけるプリフェッチ命令追加処理で追加されるプリフェッチ命令の一例を説明する図である。It is a figure explaining an example of the prefetch instruction added by the prefetch instruction addition process in this Embodiment. 図１のソースコードSC_1にプリフェッチ命令追加処理S3を行って生成されたソースコードSC_8の例を示す図である。It is a figure which shows the example of source code SC_8 produced | generated by performing prefetch instruction addition process S3 with respect to source code SC_1 of FIG. 本実施の形態におけるプリフェッチ追加処理のフローチャート図である。It is a flowchart figure of the prefetch addition process in this Embodiment. 本実施の形態におけるベクトル処理の一例を示す図である。It is a figure which shows an example of the vector process in this Embodiment.

図１は、本実施の形態におけるコンパイルする情報処理装置（コンピュータ）の構成例を示す図である。情報処理装置は、複数の演算コア回路ＣＯＲＥとＬ２キャッシュＬ２＿ＣＡＣＨＥとメモリコントローラＭＡＣとを有するＣＰＵ（Central Processing Unit、以下プロセッサ回路またはプロセッサと称する）１０と、ＣＰＵのメモリコントローラＭＡＣによりアクセス制御されるメインメモリ１２と、補助記憶装置であるストレージ２０〜２７とを有する。さらに、情報処理装置は、ネットワークNETに接続されるネットワークインターフェース１４とバス１８とを有する。また、演算コア回路ＣＯＲＥ内には、図示しないＬ１キャッシュが設けられる。 FIG. 1 is a diagram showing a configuration example of an information processing apparatus (computer) to be compiled in the present embodiment. The information processing apparatus includes a CPU (Central Processing Unit, hereinafter referred to as a processor circuit or processor) 10 having a plurality of arithmetic core circuits CORE, an L2 cache L2_CACHE, and a memory controller MAC, and a main whose access is controlled by the memory controller MAC of the CPU. It has a memory 12 and storages 20 to 27 which are auxiliary storage devices. Further, the information processing apparatus includes a network interface 14 and a bus 18 connected to the network NET. In addition, an L1 cache (not shown) is provided in the arithmetic core circuit CORE.

ストレージ２０〜２７は、ＯＳ（Operating System）２０と、コンパイラ（プログラム）２２、アセンブラ（アセンブリコードをオブジェクトコードに変換するプログラム）２３、コンパイラによるコンパイル対象のソースプログラム２４、ソースプログラムから変換されたアセンブリコード２６、アセンブリコードから変換されたオブジェクトコード２７とを格納する。プロセッサ１０がコンパイラ２２を実行して、ソースプログラム２４の最適化、アセンブリコードへの変換、さらにオブジェクトコードへの変換を含むコンパイル処理を行う。プロセッサ１０は、コンパイルされたオブジェクトコードを実行してもよい。 The storages 20 to 27 include an operating system (OS) 20, a compiler (program) 22, an assembler (a program for converting assembly code into object code) 23, a source program 24 to be compiled by the compiler, an assembly converted from a source program Code 26 stores object code 27 converted from assembly code. The processor 10 executes the compiler 22 to perform compilation processing including optimization of the source program 24, conversion to assembly code, and conversion to object code. The processor 10 may execute the compiled object code.

図２は、本実施の形態のコンパイラの処理の一例を示すフローチャート図である。プロセッサは、コンパイラプログラムを実行し、ソースプログラムの字句解析を行い、さらに、構文解析を行う（S1）。さらに、プロセッサは、種々の最適化処理S2-S6を実行し、ソースコードから変換された最適化済みのアセンブリコードを出力する（S7）。プロセッサは、さらに、アセンブリコードをオブジェクトコードに変換してもよい。 FIG. 2 is a flowchart showing an example of processing of the compiler of this embodiment. The processor executes the compiler program, performs lexical analysis of the source program, and further performs syntax analysis (S1). Furthermore, the processor executes various optimization processes S2-S6, and outputs the optimized assembly code converted from the source code (S7). The processor may further convert the assembly code into object code.

最適化処理S2-S6には、配列へのアクセス命令を繰り返すループにプリフェッチ命令を追加する処理S3と、ソースプログラムのループ内の反復実行されるアクセス命令を、ベクトル命令に変換するベクトル化処理S5とが含まれる。プロセッサは、ベクトル化処理では、ループ内で繰り返し実行される配列へのアクセス命令を、アクセス命令を複数、並列実行するベクトル命令に変換する。プロセッサは、ベクトル化処理S5で、ループ内の配列へのアクセス命令にプリフェッチ命令が追加されている場合、アクセス命令に加えてプリフェッチ命令も複数、並列実行するベクトル命令に変換する。 The optimization process S2-S6 includes a process S3 for adding a prefetch instruction to a loop for repeating an instruction for accessing an array, and a vectorization process S5 for converting an access instruction repeatedly executed in a loop of a source program into a vector instruction. And are included. In the vectorization process, the processor converts an access instruction to an array that is repeatedly executed in a loop into a plurality of vector instructions that execute the access instruction in parallel. In the vectorization processing S5, when a prefetch instruction is added to an access instruction to an array in a loop, the processor converts a plurality of prefetch instructions into vector instructions to be executed in parallel in addition to the access instruction.

上記以外の最適化処理は、コンパイラの一般的な処理であるので、ここでの説明は省略する。 The optimization process other than the above is a general process of a compiler, so the explanation here is omitted.

［プリフェッチ命令追加処理］
以下、ループ内のメモリアクセスにプリフェッチ命令を追加する処理について、ソースコード例を示して説明する。 [Prefetch instruction addition processing]
Hereinafter, processing for adding a prefetch instruction to memory access in a loop will be described with reference to a source code example.

図３は、プリフェッチ命令追加処理の第１の例を示す図である。ソースコードSC_1のプログラムは、行１１〜１３に変数i=0からi=N-1まで変数iをi=i＋1して行１２の命令を繰り返し実行するループ命令が含まれる。行１２には、配列ａの要素a[i*m]に０を書き込む命令（アクセス命令）が含まれる。行１２のa[i*m]=0は、配列aのｍ要素間隔のストライドアクセスである。mは定数でも変数でもよい。 FIG. 3 is a diagram illustrating a first example of prefetch instruction addition processing. The program of the source code SC_1 includes, in lines 11 to 13, a loop instruction that repeatedly executes the instruction in line 12 by performing i = i + 1 on variable i from variable i = 0 to i = N-1. The row 12 includes an instruction (access instruction) to write 0 in the element a [i * m] of the array a. A [i * m] = 0 in row 12 is stride access of the m element interval of the array a. m may be a constant or a variable.

ソースコードSC_2は、ソースコードSC_1のループ内に、P回の反復先のアクセスa[i*m]=0に対するプリフェッチ命令、行２３のprefetch(&a[(P+i)*m]）、を有する。ソースコードSC_1にプリフェッチ命令追加処理S3を実行すると、単純に行２３のプリフェッチ命令が追加された例である。ここで、prefetch(&x)は、ｘのアドレス（&x）をプリフェッチすることを意味する。また、プリフェッチとは、メインメモリ内のアドレスｘのデータにアクセスし、そのデータをキャッシュメモリに格納する処理である。したがって、プリフェッチ命令prefetch(&a[(P+i)*m])は、配列aの要素a[(P+i)*m]のデータをメインメモリから読み出しキャッシュメモリに格納する命令である。 The source code SC_2 includes a prefetch instruction for the access a [i * m] = 0 at P iteration destinations, prefetch (& a [(P + i) * m]) on line 23, in the loop of the source code SC_1. Have. When the prefetch instruction addition process S3 is executed on the source code SC_1, this is an example in which the prefetch instruction of the line 23 is simply added. Here, prefetch (& x) means prefetching the address (& x) of x. Prefetch is a process of accessing data at address x in the main memory and storing the data in the cache memory. Therefore, the prefetch instruction prefetch (& a [(P + i) * m]) is an instruction for reading the data of the element a [(P + i) * m] of the array a from the main memory and storing it in the cache memory.

ソースコードSC_2のプリフェッチ命令では、ループの全ての反復（繰り返し）で行２３のプリフェッチ命令が実行される。キャッシュメモリは、ラインと呼ばれる一定サイズの連続領域の単位でデータを取り扱う（アクセス、格納、置換）ので、同じキャッシュラインを複数回プリフェッチすることは冗長なアクセスであり、１つのキャッシュラインに対して１回プリフェッチすることが性能上最善である。 In the prefetch instruction of the source code SC_2, the prefetch instruction in the row 23 is executed at every iteration (repetition) of the loop. Since the cache memory handles data in units of a continuous area of a certain size called a line (access, storage, replacement), prefetching the same cache line multiple times is a redundant access. It is best in performance to prefetch once.

そのため、ソースコードSC_2のようなｍ間隔のストライドアクセスを反復実行する場合、ｍの絶対値abs(m)がキャッシュラインのサイズCより小さい場合、冗長なプリフェッチが発生する。 Therefore, when repeatedly executing stride access at intervals of m as in the source code SC_2, if the absolute value abs (m) of m is smaller than the size C of the cache line, redundant prefetch occurs.

図４は、配列ａに対するストライドアクセスのアドレスとキャッシュラインとの関係を示す図である。図中、キャッシュラインサイズが４バイト、ｍ間隔がｍ＝３、配列ａのタイプサイズ（type_size:要素サイズ）が１バイト、要素a[P*m]が先頭キャッシュラインの先頭とする。配列aに対するキャッシュラインは、実線で示す４バイトの領域（４つの要素の領域）である。また、横軸に示されるとおり、左から右に向かってアドレスが増加する。 FIG. 4 is a diagram showing the relationship between the address of the stride access to the array a and the cache line. In the figure, the cache line size is 4 bytes, the m interval is m = 3, the type size (type_size: element size) of the array a is 1 byte, and the element a [P * m] is the head of the head cache line. The cache line for the array a is a 4-byte area (an area of 4 elements) indicated by a solid line. Also, as shown on the horizontal axis, the address increases from left to right.

図４には、配列ａの上側にｍ間隔の複数のストライドアクセスが示される。ｍ間隔の複数のストライドアクセスそれぞれに対して、要素a[P*m]、a[(P+1)*m]、a[(P+2)*m]、a[(P+3)*m]…a[(P+6)*m]のアドレスがプリフェッチされる。これらのプリフェッチのうち、要素a[(P+1)*m]、a[(P+5)*m]のプリフェッチは、それぞれ直前の要素a[(P)*m]、a[(P+4)*m]のプリフェッチと同じキャッシュラインをプリフェッチするので、冗長なプリフェッチである。そこで、このような冗長なプリフェッチが実行されないような条件文とプリフェッチ命令を追加することが考えられる。 FIG. 4 shows a plurality of stride accesses at intervals of m above the array a. For each of multiple stride accesses at intervals of m, elements a [P * m], a [(P + 1) * m], a [(P + 2) * m], a [(P + 3) * m] ... a [(P + 6) * m] address is prefetched. Among these prefetches, the prefetches of the elements a [(P + 1) * m] and a [(P + 5) * m] are the preceding elements a [(P) * m] and a [(P + 4) Since the same cache line as * m] prefetch is prefetched, it is a redundant prefetch. Therefore, it is conceivable to add a conditional statement and a prefetch instruction that do not execute such redundant prefetch.

図５は、プリフェッチ命令追加処理の第２の例を示す図である。ソースコードSC_1は、図３と同じである。それに対して、プリフェッチ命令追加処理S3により生成されるソースコードSC_3は、行３４−３５の条件文（if文）と、行３１，行３６のプリフェッチアドレスを算出する命令と、行３７のプリフェッチ命令を有する。 FIG. 5 is a diagram illustrating a second example of prefetch instruction addition processing. The source code SC_1 is the same as that in FIG. On the other hand, the source code SC_3 generated by the prefetch instruction addition process S3 includes a conditional statement (if statement) in lines 34 to 35, an instruction for calculating prefetch addresses in lines 31 and 36, and a prefetch instruction in line 37. Have.

図３から理解されるとおり、冗長なプリフェッチを回避するためには、前回プリフェッチしたキャッシュラインの先頭アドレスlast_prefetch（図４参照）と、各ループ内でのプリフェッチ命令のアドレス&a[(P+i)*m]とを比較し、各ループ内でのプリフェッチ命令のアドレス&a[(P+i)*m]が、前回のプリフェッチアドレスlast_prefetchより小さい場合と、前回のプリフェッチアドレスlast_prefetchよりキャッシュラインサイズC以上大きい場合に、プリフェッチ命令を実行すればよい。つまり、以下の条件１が真の場合、プリフェッチ命令を実行する。
&a[(P+i)*m]＜last_prefetch, またはlast_prefetch+C＜＝&a[(P+i)*m] 条件１
ここで、C=4である。 As understood from FIG. 3, in order to avoid redundant prefetching, the start address last_prefetch (see FIG. 4) of the cache line previously prefetched and the address & a [(P + i) of the prefetch instruction in each loop * If the address & a [(P + i) * m] of the prefetch instruction in each loop is smaller than the previous prefetch address last_prefetch compared with the previous prefetch address last_prefetch, cache line size C or more than that of the previous prefetch address last_prefetch If it is larger, a prefetch instruction may be executed. That is, when the following condition 1 is true, the prefetch instruction is executed.
& a [(P + i) * m] <last_prefetch, or last_prefetch + C <= & a [(P + i) * m] Condition 1
Here, C = 4.

図５の行３４，３５のif文の括弧内の条件文は、上記の条件１に対応する。ここで、||は論理和を意味する。また、行３１は、前回のプリフェッチアドレスlast_prefetchに初期値「０」を設定する。そして、行３６は、アドレス&a[(P+i)*m]をキャッシュラインサイズCで除算し、小数点を切り捨てた整数に、キャッシュラインサイズCを乗算したアドレスを、前回のプリフェッチアドレス（キャッシュラインの先頭アドレス）last_prefetchとする演算命令である。そして、行３７はアドレスlast_prefetchへのプリフェッチ命令である。この時のプリフェッチアドレスlast_prefetchは、ループ内の次回の繰り返しでは前回プリフェッチアドレスとなる。 The conditional statement in the parenthesis of the if statement on lines 34 and 35 in FIG. 5 corresponds to the above-mentioned condition 1. Here, || means a logical sum. In line 31, an initial value “0” is set to the previous prefetch address last_prefetch. Then, the line 36 divides the address & a [(P + i) * m] by the cache line size C, and multiplies the integer obtained by truncating the decimal point by the cache line size C to obtain the previous prefetch address (cache line). Start address of) last_prefetch is an operation instruction. Line 37 is a prefetch instruction to the address last_prefetch. The prefetch address last_prefetch at this time becomes the previous prefetch address in the next iteration in the loop.

ソースコードSC_3によれば、ループ内において、ループ初回のi=0では、プロセッサは、行３６によりlast_prefetched=&a[P*m]と更新し、要素a[P*m]のキャッシュラインに対するプリフェッチを実行する。図４に示したとおりである。 According to the source code SC_3, when i = 0 for the first time in the loop, the processor updates last_prefetched = & a [P * m] according to the line 36, and performs prefetch for the cache line of the element a [P * m]. Run. This is as shown in FIG.

次に、i=1では、プリフェッチ対象の要素a[(P+1)*m]は、前回プリフェッチした要素a[P*m]と同じキャッシュラインに属するため、要素a[(P+1)*m]へのプリフェッチを実行すると同じキャッシュラインを冗長にプリフェッチすることになる。図５のソースコードSC_3では、行３４−３５のif文の条件文（上記の条件１）が偽となり、プロセッサは、行３６のlast_prefetchの更新命令と行３７のプリフェッチ命令を実行せず、冗長なプリフェッチを抑止する。 Next, when i = 1, the element a [(P + 1) * m] to be prefetched belongs to the same cache line as the element a [P * m] prefetched last time, so the element a [(P + 1) When prefetching to * m] is performed, the same cache line is redundantly prefetched. In the source code SC_3 of FIG. 5, the conditional statement (condition 1 above) of the if statement in lines 34-35 is false, and the processor does not execute the update instruction of last_prefetch in line 36 and the prefetch instruction in line 37, and thus is redundant. Suppress prefetching.

さらに、i=2では、プリフェッチ対象要素a[(P+2)*m]は、前回プリフェッチしたキャッシュラインとは異なるキャッシュライン内であり、条件１のlast_prefetched+4＜＝&a[(P+2)*m]が真となり、プロセッサは、行３６の更新命令と行３７のプリフェッチ命令を実行する。以下、i=5ではプリフェッチ命令は実行されず、それ以外のi=3,4,6ではプリフェッチ命令が実行される。以上の通り、ソースコードSC_3によれば、図４に示した冗長なプリフェッチ命令の実行が防止される。 Furthermore, at i = 2, the prefetch target element a [(P + 2) * m] is in a cache line different from the cache line prefetched in the previous time, and last_prefetched + 4 <= & a [(P + 2) in condition 1 ) * m] is true and the processor executes the update instruction on line 36 and the prefetch instruction on line 37. Hereinafter, the prefetch instruction is not executed at i = 5, and the prefetch instruction is executed at other i = 3, 4, 6. As described above, according to the source code SC_3, execution of the redundant prefetch instruction shown in FIG. 4 is prevented.

［ベクトル化］
次に、コンパイラの別の最適化処理であるベクトル化についてソースコードを例にして説明する。ベクトル化とは、前述のとおり、ソースプログラムを、SIMD命令またはベクトル命令を含むプログラムに変換する処理である。また、ベクトル化では、配列へのアクセス命令を繰り返すループについて、アクセス命令を複数、並列実行するベクトル命令に変換する。コンパイルされたプログラムが、SIMD演算器を有する情報処理装置により実行されると、１命令で複数のデータを演算するベクトル命令が、SIMD演算器内の複数の演算器で並列に実行される。そのため、プログラムの実行効率が向上する。 [Vectorization]
Next, vectorization, which is another optimization process of the compiler, will be described using source code as an example. Vectorization is the process of converting a source program into a program containing SIMD instructions or vector instructions, as described above. Also, in vectorization, for a loop that repeats an access instruction to an array, a plurality of access instructions are converted into vector instructions that are executed in parallel. When the compiled program is executed by an information processing apparatus having a SIMD calculator, vector instructions for calculating a plurality of data with one instruction are executed in parallel by a plurality of calculators in the SIMD calculator. Therefore, the program execution efficiency is improved.

図６は、ソースコードSC_1に対するベクトル化の例を示す図である。ソースコードSC_1は、ベクトル化されない場合、ループ内ではアドレス&a[i*m]に「０」を書き込む命令を実行するだけである。一方、ベクトル化されたソースコードSC_4は、ベクトル長が２の例であり、行４２のベクトル命令が、アドレス&a[i*m], &a[(i+1)*m]の要素に「０」を並列に書込む命令である。そのため、行４１内のループの変数iのインクリメントの間隔は２に設定される。また、ベクトル長が２の場合、繰り返し数（イタレーション数）Ｎをベクトル長２で除算した余りのループは、ベクトル命令の対象外となる。その場合、ソースコードSC_1の行１２のストア命令a[i*m]=0が一回実行される。 FIG. 6 is a diagram illustrating an example of vectorization for the source code SC_1. If the source code SC_1 is not vectorized, in the loop, the source code SC_1 only executes an instruction to write "0" to the address & a [i * m]. On the other hand, the vectorized source code SC_4 is an example in which the vector length is 2, and the vector instruction in the row 42 sets “0” to the element of the address & a [i * m], & a [(i + 1) * m]. Is an instruction to write "" in parallel. Therefore, the increment interval of the loop variable i in the row 41 is set to 2. In addition, when the vector length is 2, the loop of the remainder obtained by dividing the number of iterations (number of iterations) N by the vector length 2 is out of the target of the vector instruction. In that case, the store instruction a [i * m] = 0 in the line 12 of the source code SC_1 is executed once.

ベクトル長をＶに一般化すると、ベクトル命令は、&a[i*m],&a[(i+1)*m], ..., &a[(i+V-1)*m]のＶ個のアドレスの要素それぞれに「0」をストアする命令となる。ベクトル長Vは、ベクトル命令に設定される設定値であり、コンパイルされたプログラムを実行するSIMD演算器のSIMD長に対応するまたは等しい長さに設定される。そして、N/Vの余りのループはベクトル命令の対象外となり、ベクトル化前のソースコードのストア命令a[i*m]=0を実行するコードに変換される。 When the vector length is generalized to V, V vector instructions are & a [i * m], & a [(i + 1) * m], ..., & a [(i + V-1) * m] This instruction stores “0” in each address element. The vector length V is a setting value set in the vector instruction, and is set to a length corresponding to or equal to the SIMD length of the SIMD computing unit that executes the compiled program. The remainder of the N / V loop is excluded from the target of the vector instruction, and is converted into a code that executes the store instruction a [i * m] = 0 of the source code before vectorization.

ほとんどのベクトル命令は、ベクトル命令の対象データについて、あるデータを先に計算し、その結果に基づいて他のデータを計算するということは、できない。したがって、通常、ループのイタレーション間で依存がない場合にベクトル化が行われ、依存がある場合はベクトル化は行われない。例えば、図６のソースコードSC_1のループは、他のイタレーションの計算結果を参照しないので、ソースコードSC_4の行４２のベクトル命令のように、ベクトル化が可能である。 Most vector instructions can not calculate one data first for the target data of the vector instruction and calculate other data based on the result. Therefore, normally, vectorization is performed when there is no dependency between iterations of a loop, and vectorization is not performed when there is dependency. For example, since the loop of the source code SC_1 in FIG. 6 does not refer to the calculation result of other iterations, it can be vectorized like the vector instruction in the row 42 of the source code SC_4.

図７は、ベクトル化ができない一例を示す図である。ソースコードSC_5の行５２は、前のイタレーションで演算した要素a[i-1]を、要素a[i]に乗算して、要素a[i]に書込む命令であり、前のイタレーション結果a[i-1]を参照する演算命令である。ソースコードSC_5の行５２の命令を、ベクトル長２でベクトル化すると、例えばソースコードSC_6の行６２のベクトル命令に変換される。このソースコードSC_6の行６２のベクトル命令は、以下の２つの演算命令がベクトル化されている。
a[i] = a[i] * a[i-1]
a[i+1] = a[i+1] * a[i] FIG. 7 is a diagram showing an example in which vectorization can not be performed. Line 52 of the source code SC_5 is an instruction to multiply the element a [i] by the element a [i-1] calculated in the previous iteration and write to the element a [i]. This is an arithmetic instruction that refers to the result a [i-1]. When the instruction in the line 52 of the source code SC_5 is vectorized by the vector length 2, it is converted into a vector instruction in the line 62 of the source code SC_6, for example. The following two operation instructions are vectorized in the vector instruction of line 62 of source code SC_6.
a [i] = a [i] * a [i-1]
a [i + 1] = a [i + 1] * a [i]

ここで、上記の２つの演算命令を実行すると、初期値がa[0]=2, a[1]=3, a[2]=4の場合、以下の通りとなる。
a[1] = a[1] * a[0] = 3 * 2 = 6
a[2] = a[2] * a[1] = 4 * 6 = 24 Here, when the above two arithmetic instructions are executed, when the initial values are a [0] = 2, a [1] = 3, and a [2] = 4, the following results.
a [1] = a [1] * a [0] = 3 * 2 = 6
a [2] = a [2] * a [1] = 4 * 6 = 24

一方、ソースコードSC_6の行６２のベクトル命令では、２つの演算が並列に実行されるため、以下のとおりとなる。
a[1] = a[1] * a[0] = 3 * 2 = 6, a[2] = a[2] * a[1] = 4 * 3 = 12
この演算結果a[2] = 12は、a[1]が更新される前の初期値a[1] = 3に基づいて算出されるので正しい値「24」と一致しない。このように、イタレーション間で依存のある命令をベクトル化することは不適切である。 On the other hand, in the vector instruction in the row 62 of the source code SC_6, two operations are executed in parallel.
a [1] = a [1] * a [0] = 3 * 2 = 6, a [2] = a [2] * a [1] = 4 * 3 = 12
Since the calculation result a [2] = 12 is calculated based on the initial value a [1] = 3 before a [1] is updated, it does not match the correct value “24”. Thus, it is inappropriate to vectorize instructions that depend on iterations.

図８は、図５でプリフェッチ命令を追加されたソースコードSC_3をベクトル長２でベクトル化した例を示す図である。ソースコードSC_3では、行３６で次回のループのif文の条件文で参照する前回のプリフェッチアドレスlast_prefetchが演算される。 FIG. 8 is a diagram showing an example in which the source code SC_3 to which the prefetch instruction is added in FIG. 5 is vectorized by vector length 2. As shown in FIG. In the source code SC_3, the previous prefetch address last_prefetch referred to in the conditional statement of the if statement of the next loop is calculated in line 36.

かかるソースコードSC_3をベクトル化すると、SC_3内の行３３の命令a[i*m] = 0は、ベクトル化されたソースコードSC_7内の行７３のように並列化される。即ち、以下のとおりである。
a[i*m:(i+1)*m] = 0 When the source code SC_3 is vectorized, the instruction a [i * m] = 0 in the row 33 in the SC_3 is parallelized as in the row 73 in the vectorized source code SC_7. That is, it is as follows.
a [i * m: (i + 1) * m] = 0

しかし、ソースコードSC_3内の行３４，３５のif文の条件文が前のイタレーションで求めた前回プリフェッチアドレスlast_prefetchの参照を含むので、行３４−３７をベクトル化できない。その結果、コードSC_7では、並列処理ではなく、行７４−７８と行７９−７Ｄのように順次処理するコードのままとなる。これでは、ベクトル化が十分ではない。 However, since the conditional statement of the if statement in the lines 34 and 35 in the source code SC_3 includes the reference to the previous prefetch address last_prefetch obtained in the previous iteration, the lines 34 to 37 cannot be vectorized. As a result, in the code SC_7, the code is not processed in parallel but remains processed sequentially as in lines 74-78 and 79-7D. This is not enough vectorization.

［本実施の形態］
次に、本実施の形態におけるストライドアクセスを繰り返すループでのプリフェッチ命令追加処理と、ベクトル化処理について説明する。前述のとおり、コンパイラでの最適化処理において、ストライドアクセスを繰り返す（反復、イタレート）するループに、プリフェッチ命令を追加するとともにベクトル化することが望ましい。 [This embodiment]
Next, prefetch instruction addition processing and vectorization processing in a loop that repeats stride access in the present embodiment will be described. As described above, in the optimization process in the compiler, it is desirable to add a prefetch instruction and vectorize it in a loop that repeats stride access (iteration and iteration).

図９は、本実施の形態におけるプリフェッチ命令追加処理で追加されるプリフェッチ命令の一例を説明する図である。図９には、配列ａについて、キャッシュラインサイズＣ＝８バイト、ストライドアクセスのアドレス間隔Ｓ＝３バイト（要素間隔ｍ＝３、要素サイズ（type_size）１バイト）、プリフェッチアドレスｘの例が示されている。プリフェッチアドレスｘの括弧内には、各キャッシュラインの先頭要素（０）からの相対的な要素位置を示す。破線が要素、１バイトの区切りで、実線がキャッシュラインの区切りである。 FIG. 9 is a diagram for explaining an example of a prefetch instruction added by the prefetch instruction addition processing in the present embodiment. FIG. 9 shows an example of cache line size C = 8 bytes, stride access address interval S = 3 bytes (element interval m = 3, element size (type_size) 1 byte), and prefetch address x for array a. ing. In the parentheses of the prefetch address x, the relative element position from the head element (0) of each cache line is shown. A broken line is an element, 1 byte delimiter, and a solid line is a cache line delimiter.

配列ａの先頭アドレスX₁がキャッシュラインの先頭とする。この場合、プリフェッチアドレスｘのキャッシュラインサイズＣに対する剰余（モジュロ）x%Cがストライドアクセスのアドレス間隔abs(Ｓ)より小さいという条件を満たす場合に、プリフェッチ命令prefetch(x)を実行し、満たさない場合は実行しないようにする。こうすれば、冗長なプリフェッチの実行を防止できる。つまり、図9に示すとおり、コンパイラは、プリフェッチアドレスX₁−X₈が各キャッシュラインサイズＣ（＝８バイト）の先頭から３バイト以内であれば、プリフェッチ命令を実行するようなソースコードに変換する。その場合、キャッシュライン当たり１回のプリフェッチが実行されるようになる。ここで、abs()とは、カッコ内の絶対値の意味である。 Top address X ₁ is the beginning of the cache line of the array a. In this case, the prefetch instruction prefetch (x) is executed if the condition that the remainder (modulo) x% C of the prefetch address x to the cache line size C is smaller than the address interval abs (S) for stride access is not satisfied. If you do not want to run. In this way, execution of redundant prefetch can be prevented. That is, as shown in FIG. 9, if the prefetch address X ₁ -X ₈ is within 3 bytes from the head of each cache line size C (= 8 bytes), the compiler converts the source code to execute the prefetch instruction. To do. In that case, one prefetch is executed per cache line. Here, abs () means the absolute value in parentheses.

図９の例では、アドレスX₁、X₄、X₇だけがプリフェッチ実行の条件、(x%C) ＜ abs(S)、を満たすので、プリフェッチ命令が実行される。そして、if文の条件文、(x%C) ＜ abs(S)、は、他のイタレーションの計算結果を利用しないので、ベクトル化が可能である。 In the example of FIG. 9, since only the addresses X ₁ , X ₄ , and X ₇ satisfy the prefetch execution condition (x% C) <abs (S), the prefetch instruction is executed. Since the conditional statement of the if statement, (x% C) <abs (S), does not use the calculation results of other iterations, it can be vectorized.

以上の通り、プリフェッチを実行する条件は、Sバイト間隔のストライドアクセスに対して、プリフェッチアドレスｘについて(x%C) ＜ abs(S)を満たす時、アドレスｘが、アドレスｘに対応するキャッシュラインにアクセスするストライドアクセスの中で最も小さなアクセスアドレスとなる。 As described above, the condition for executing prefetch is that when the prefetch address x satisfies (x% C) <abs (S) for stride access with an S-byte interval, the address x corresponds to the cache line corresponding to the address x. This is the smallest access address among the stride accesses to access the.

図１０は、図１のソースコードSC_1にプリフェッチ命令追加処理S3を行って生成されたソースコードSC_8の例を示す図である。ソースコードSC_8は、行８３−８５に以下のif文とプリフェッチ命令が追加される。
if(abs(m*type_size) > &a[(P+i)*m]%C) {
prefetch(&a[(P+i)*m])
} FIG. 10 is a diagram illustrating an example of the source code SC_8 generated by performing the prefetch instruction addition process S3 on the source code SC_1 in FIG. In the source code SC_8, the following if statement and prefetch instruction are added to lines 83-85.
if (abs (m * type_size)>& a [(P + i) * m]% C) {
prefetch (& a [(P + i) * m])
}

すなわち、プロセッサは、プリフェッチ命令追加処理S3を実行して、ストライドアクセスのストライド間隔S=m*type_sizeの絶対値abs(m*type_size)よりも、プリフェッチ対象要素a[(P+i)*m]のアドレス、&a[(P+i)*m]、のキャッシュラインサイズCに対する剰余&a[(P+i)*m]%Cが小さい場合、プリフェッチ命令prefetch(&a[(P+i)*m])を実行するコードを追加する。 That is, the processor executes the prefetch instruction addition process S3 to set the prefetch target element a [(P + i) * m] more than the absolute value abs (m * type_size) of the stride interval of stride access S = m * type_size. If the remainder & a [(P + i) * m]% C with respect to the cache line size C of address & a [(P + i) * m] is small, the prefetch instruction prefetch (& a [(P + i) * m Add code to execute]).

図１１は、本実施の形態におけるプリフェッチ追加処理のフローチャート図である。図２で説明したとおり、プロセッサは、コンパイラを実行して、事前にソースプログラムの字句解析及び構文解析S1を実行し、ソースプログラム内のループの位置と数、各ループ内のメモリアクセス命令の位置と数を抽出済みである。図１０のソースプログラムSC_1は、ループ数は１つ、ループ内のメモリアクセスは１つである。 FIG. 11 is a flowchart of the prefetch addition process in the present embodiment. As described in FIG. 2, the processor executes the compiler to execute the lexical analysis and parsing S1 of the source program in advance, the position and number of loops in the source program, and the position of the memory access instruction in each loop And the number has already been extracted. The source program SC_1 in FIG. 10 has one loop number and one memory access in the loop.

プロセッサは、コンパイラのプリフェッチ追加処理を実行して、以下の処理を行う。まず、プロセッサは、ループ番号ｎ１を初期値０に設定し（S11）、ループ番号ｎ１がソースプログラム内のループ数より小さい間(S12のTRUE)、処理S13-S22を繰り返す。プロセッサは、ループ番号ｎ１がループ数と等しくなると（S12のFALSE）、プリフェッチ追加処理を終了する。 The processor executes a prefetch addition process of the compiler and performs the following process. First, the processor sets the loop number n1 to the initial value 0 (S11), and repeats the processes S13 to S22 while the loop number n1 is smaller than the number of loops in the source program (S12 TRUE). When the loop number n1 becomes equal to the number of loops (FALSE in S12), the processor ends the prefetch addition process.

次に、プロセッサは、ループ番号ｎ１のループをプリフェッチ追加対象ループLに設定し（S13）、ループLのプリフェッチ距離を変数Pに設定する（S14）。ループ内のプリフェッチ距離Pは、前述したとおり、コンパイルされたプログラムを実行するコンピュータのプロセッサがメインメモリ内のデータをプリフェッチするのに要する時間（メインメモリの読み出しと、キャッシュメモリへのリードデータの格納に要する時間）に対応して設定される、ループのイタレーション回数である。つまり、プリフェッチ距離がPということは、Pイタレーション先でアクセスするデータをメインメモリからプリフェッチすることを意味する。 Next, the processor sets the loop with the loop number n1 to the prefetch addition target loop L (S13), and sets the prefetch distance of the loop L to the variable P (S14). As described above, the prefetch distance P in the loop is the time required for the processor of the computer executing the compiled program to prefetch data in the main memory (reading of the main memory and storage of the read data in the cache memory). It is the number of iterations of the loop, which is set corresponding to the time required for In other words, the prefetch distance P means that data accessed at the P iteration destination is prefetched from the main memory.

さらに、プロセッサは、ループL内のメモリアクセス番号ｎ２を初期値０に設定し（S15）、メモリアクセス番号ｎ２がループLのメモリアクセス数未満の間（S16のTRUE）、処理S17-S21を繰り返す。メモリアクセス番号ｎ２がループLのメモリアクセス数と等しくなると（S16のFALSE）、ループ番号ｎ１を＋１だけインクリメントし（S22）、次のループに対する処理に戻る（S12）。 Further, the processor sets the memory access number n2 in the loop L to the initial value 0 (S15), and repeats the processing S17 to S21 while the memory access number n2 is less than the number of memory accesses in the loop L (S16 TRUE). . When the memory access number n2 becomes equal to the number of memory accesses in the loop L (FALSE in S16), the loop number n1 is incremented by +1 (S22), and the process returns to the next loop (S12).

プロセッサは、処理対象のメモリアクセス番号ｎ２について、ループL内のｎ２番目のメモリアクセスを対象メモリアクセスＡに設定し（S17）、さらに、ループ内のアクセスＡのイタレーション間のアドレス間隔をＳに設定する(S18)。つまり、ループLが１イタレーション進んだ時のアクセスのアドレスの差分をSに設定する。プロセッサは、字句解析と構文解析により、ループの開始と終了を例えば分岐命令とその分岐先から認識することができ、さらに、ループ内でメモリアクセスのアドレスの増加量または減少量から、ループ内のアクセスＡのイタレーション間のアドレス間隔をその増加量または減少量と認識することができる。 The processor sets the n2nd memory access in the loop L to the target memory access A for the memory access number n2 to be processed (S17), and further sets the address interval between iterations of the access A in the loop to S It sets (S18). That is, the difference in the address of the access when the loop L advances by one iteration is set to S. The processor can recognize the start and end of the loop, for example, from the branch instruction and its branch destination by lexical analysis and syntactic analysis, and further, from the increment or decrement of the memory access address in the loop, The address interval between iterations of access A can be recognized as the increase or decrease.

アドレス間隔Ｓがイタレーションにより変化せず一定の場合(S19のTRUE)、対象メモリアクセスAのPイタレーション先のプリフェッチアドレスをｘとすると、プロセッサは、以下に示す、プリフェッチ実行の条件文を含むif文と、プリフェッチ命令とを追加する（S20）。この追加されるif文とプリフェッチ命令は、前述したものと同じであり、以下のとおりである。
if(abs(S) > &a[(P+i)*m]%C) {
prefetch(&a[(P+i)*m])
}
ここで、S=m*type_size（S：イタレーション間のアドレス間隔、ｍ：イタレーション間の要素間隔、type_size：要素のサイズ）である。つまり、図１０のソースコードSC_8の行８３−８５の追加コードに対応する。 When the address interval S does not change due to the iteration and is constant (TRUE of S19), if the prefetch address of the P iteration destination of the target memory access A is x, the processor includes the following prefetch execution conditional statement: An if statement and a prefetch instruction are added (S20). The added if statement and prefetch instruction are the same as those described above, and are as follows.
if (abs (S)>& a [(P + i) * m]% C) {
prefetch (& a [(P + i) * m])
}
Here, S = m * type_size (S: address interval between iterations, m: element interval between iterations, type_size: element size). That is, it corresponds to the additional code in lines 83-85 of the source code SC_8 in FIG.

アドレス間隔Ｓがイタレーションにより変化する場合は（S19のFALSE）、本実施の形態による最適化処理の対象外であるので、処理S20を実行せず、次のメモリアクセスを処理対象メモリアクセスAとする（S17）。 If the address interval S changes due to iteration (FALSE in S19), it is not a target of optimization processing according to the present embodiment, so processing S20 is not performed and the next memory access is processed with the target memory access A and To do (S17).

そして、プロセッサは、メモリアクセス番号ｎ２をインクリメント（ｎ２＝ｎ２＋１）し（S21）、対象ループL内の次のメモリアクセスについて処理S16-S21を繰り返す。 Then, the processor increments the memory access number n2 (n2 = n2 + 1) (S21), and repeats the processes S16-S21 for the next memory access in the target loop L.

図１０のソースコードSC_1は、ループ数が１、ループ内のメモリアクセスが１（a[i*m]=0）である。したがって、ソースコードSC_1の場合、図１１のフローチャートでは、プロセッサは、処理S13-S22を一回、処理S16-S21も一回実行する。 In the source code SC_1 in FIG. 10, the number of loops is 1, and the memory access in the loop is 1 (a [i * m] = 0). Therefore, in the case of the source code SC_1, in the flowchart of FIG. 11, the processor executes the processes S13-S22 once and the processes S16-S21 once.

ここで、プロセッサは、コンパイラを実行して、上記のif文の条件文のアクセスアドレス&a[(P+i)*m]とキャッシュラインサイズCとの剰余演算&a[(P+i)*m]%Cを、アドレス&a[(P+i)*m]=xと、C-1の各2進数のビットの論理積で演算する演算命令に変形してもよい。 Here, the processor executes a compiler to perform a remainder operation & a [(P + i) * m between the access address & a [(P + i) * m] of the conditional statement of the above if statement and the cache line size C ]% C may be transformed into an arithmetic instruction that operates with the logical product of the address & a [(P + i) * m] = x and each binary bit of C-1.

すなわち、キャッシュラインサイズCは通常２冪（べき）、2^y=10000000、であるので、C-1は次のとおりとなる。
C-1 = 10000000-1 = 01111111 That is, since the cache line size C is normally 2 冪 (power) and 2 ^y = 10000000, C-1 is as follows.
C-1 = 10000000-1 = 01111111

したがって、アドレスｘ＝10101000の場合、C-1=011111111各ビットの論理積は以下のとおりとなる。
C-1 = 01111111
x = 10101000
x・(C-1) = 00101000 Therefore, when the address x = 10101000, the logical product of each bit of C-1 = 011111111 is as follows.
C-1 = 01111111
x = 10101000
x ・ (C-1) = 00101000

つまり、論理積x・（C-1）は、アドレスｘの最上位ビット以外の値00101000となり、これは剰余演算x%Cで求めた余りと一致する。 That is, the logical product x · (C−1) is a value 00101000 other than the most significant bit of the address x, which matches the remainder obtained by the remainder operation x% C.

図１２は、本実施の形態におけるベクトル処理の一例を示す図である。図１２には、図１１でプリフェッチ命令が追加されたソースコードSC_8をベクトル化した疑似コードPSC_9が示される。ベクトル化されたコードはソースコードである必要はなく、コンパイラ内部のプログラムコードやアセンブリコードでも良い。図１２には、人間が理解しやすいようにソースコード風の疑似コードPSC_9で示す。 FIG. 12 is a diagram illustrating an example of vector processing according to the present embodiment. FIG. 12 shows pseudo code PSC_9 obtained by vectorizing the source code SC_8 to which the prefetch instruction is added in FIG. The vectorized code need not be source code, and may be program code or assembly code inside the compiler. In FIG. 12, source code-like pseudo code PSC_9 is shown for ease of human understanding.

プロセッサは、ベクトル化処理S5を実行して以下の処理を行っている。すなわち、プロセッサは、ベクトル化後の疑似コードPSC_9では、コードSC_8内の行８１のfor文について、行９１のfor文のように、変数ｉのイタレーション毎の増分をベクトル長Ｖに変更する。そして、プロセッサは、コードPSC_9において、行９２に示すとおり、ベクトル長Ｖに対応する変数iの最大値i+V-1を変数ｋに設定する。 The processor executes the vectorization processing S5 and performs the following processing. That is, in the pseudo code PSC_9 after vectorization, the processor changes the increment for each iteration of the variable i to the vector length V for the for statement in the line 81 in the code SC_8 as in the for statement in the line 91. Then, in the code PSC_ 9, the processor sets the maximum value i + V−1 of the variable i corresponding to the vector length V to the variable k, as shown in line 92.

さらに、プロセッサは、コードSC_8の行８２のメモリアクセスa[i*m]=0を、図示されるとおり、コードPSC_9の行９３の要素番号i*m〜k*mの要素a[i*m:k*m]に「０」を並列に書き込むベクトル命令に変更する。つまり、行９３のベクトル命令は、以下の通りである。
a[i*m:k*m]=0
このベクトル命令は、配列ａの要素番号i*m〜k*mのＶ個の要素に「０」を並列に書き込む命令である。 Further, the processor assigns the memory access a [i * m] = 0 in the row 82 of the code SC_8 to the element a [i * m in the element numbers i * m to k * m in the row 93 of the code PSC_9 as illustrated. : k * m] is changed to a vector instruction that writes “0” in parallel. That is, the vector instruction in row 93 is as follows.
a [i * m: k * m] = 0
This vector instruction is an instruction to write “0” in parallel to the V elements of the element numbers i * m to k * m of the array a.

そして、プロセッサは、コードPSC_9の行９４のとおり、各プリフェッチアドレス&a[(P+i)*m]〜&a[(P+k)*m]のキャッシュラインサイズＣに対する剰余（フェッチアドレスをキャッシュラインサイズＣで除算した余り）が、ストライドアクセス間のアドレス差分の絶対値abs(m*type_size)より小さいか否かの比較を行い、その比較結果（真：１、偽：０）をマスクの配列mask[0:V-1]のＶ個の要素それぞれ代入するベクトル命令を生成する。 Then, as indicated by line 94 of the code PSC_9, the processor uses the remainder (fetch address as the cache line) for the cache line size C of each prefetch address & a [(P + i) * m] to & a [(P + k) * m]. A comparison is made as to whether or not the remainder after division by size C) is smaller than the absolute value abs (m * type_size) of the address difference between stride accesses, and the comparison result (true: 1, false: 0) is used as the mask array. A vector instruction for substituting each of the V elements of mask [0: V-1] is generated.

また、プロセッサは、コードPSC_9の行９５のとおり、マスク配列mask[0:V-1]の真の比較結果が代入された要素に対応するアドレスにプリフェッチ命令prefetchを実行するベクトル命令を生成する。但し、マスク配列にプリフェッチ命令の実行の有無を示す値を代入すること以外の方法で、各プリフェッチ命令の実行か否かを判別するようにしてもよい。 The processor also generates a vector instruction for executing the prefetch instruction prefetch at the address corresponding to the element to which the true comparison result of the mask array mask [0: V−1] is substituted, as in line 95 of the code PSC_9. However, whether or not each prefetch instruction is to be executed may be determined by a method other than substituting a value indicating the presence or absence of the execution of the prefetch instruction in the mask array.

図１２において、コードPSC_9には、N/Vの余りのループのコードは省略している。コードPSC_9には、N/Vの余りのループであるINT(N/V)*V+1〜N番目の各ループのif文とそのメモリアクセスa[i*m]=0（コードSC_8の行８３−８４のコード）が追加される。 In FIG. 12, the code PSC_9 omits the remainder of the N / V loop code. The code PSC_9 contains INT (N / V) * V + 1 to Nth loop if statement and its memory access a [i * m] = 0 (code SC_8 line) 83-84 code) is added.

図１２において、ベクトル化により生成されるベクトル命令は、行９３のストライドアクセスを並列に行う第１のベクトル命令と、行９４のプリフェッチ命令を実行する条件を並列に判定して判定結果をマスク配列の各要素に格納する第２のベクトル命令と、行９５のマスク配列の各要素が真の場合にプリフェッチ命令を実行する第３のベクトル命令とを有する。 In FIG. 12, the vector instruction generated by vectorization determines in parallel the condition under which the first vector instruction performing stride access in row 93 and the prefetch instruction in row 94 is determined in parallel, and the determination result is a mask array And a third vector instruction that executes a prefetch instruction when each element of the mask array in row 95 is true.

以上の通り、本実施の形態によれば、プロセッサが、コンパイラを実行して、ソースプログラム内のメモリ内の配列の複数の要素に対してストライドアクセス命令を反復実行するループのストライドアクセス命令に、プリフェッチ命令のプリフェッチアドレスｘのキャッシュラインサイズCに対する剰余（余り）（ｘ％C）が、ストライドアクセス間アドレス長Sより小さい場合に、実行されるプリフェッチ命令を追加する。これにより、プロセッサは、ストライドアクセス命令とプリフェッチ命令とをベクトル化することができる。その結果、コンパイラの最適化処理により、配列の複数の要素にストライドアクセスを反復実行するループに、プリフェッチ命令を追加しさらにベクトル命令に変換し、コンパイルされたコードの情報処理装置による処理効率を高めることができる。 As described above, according to the present embodiment, the processor executes the compiler to execute the stride access instruction of the loop that repeatedly executes the stride access instruction on a plurality of elements of the array in the memory in the source program, A prefetch instruction to be executed is added when the remainder (remainder) (x% C) for the cache line size C of the prefetch address x of the prefetch instruction is smaller than the inter-stride access address length S. Thereby, the processor can vectorize the stride access instruction and the prefetch instruction. As a result, the compiler optimization process adds a prefetch instruction to a loop that repeatedly executes stride access to multiple elements of an array, and further converts it to a vector instruction, thereby improving the processing efficiency of the compiled code by the information processing device. be able to.

１０：CPU, プロセッサ
L2 CACHE：キャッシュメモリ
MAC：メモリアクセスコントローラ
M_MEM：メインメモリ
２２：コンパイラ
２３：アセンブラ
２４：ソースプログラム
２６：アセンブリコード
２７：オブジェクトコード
a[i*m]：配置aに対するストライドアクセス命令
for()：ループ
ａ：配列
Ｓ：ストライドアクセス間アドレス長（隣接するストライドアクセス間のアドレス間隔）
ｍ：ストライドアクセス間要素長（隣接するストライドアクセス間の要素間隔）
Type_Size：配列の一要素サイズ（S＝ｍ * Type_Size）
ｘ：プリフェッチアドレス
C：キャッシュラインのサイズ
Prefetch：プリフェッチ命令
Last_prefetched：直前のプリフェッチアドレス 10: CPU, processor
L2 CACHE: Cache memory
MAC: Memory access controller
M_MEM: main memory 22: compiler 23: assembler 24: source program 26: assembly code 27: object code
a [i * m]: Stride access instruction for placement a
for (): loop a: array S: address length between stride accesses (address interval between adjacent stride accesses)
m: Element length between stride accesses (element spacing between adjacent stride accesses)
Type_Size: Element size of the array (S = m * Type_Size)
x: Prefetch address
C: Cache line size
Prefetch: Prefetch instruction
Last_prefetched: Previous prefetch address

Claims

A loop that repeats a stride access instruction that accesses multiple elements of an array in the memory in the source program at an element length interval between stride accesses is detected.
A prefetch instruction for accessing the memory and storing the access destination data of the stride access instruction after a predetermined number of repetitions of the loop from the stride access instruction in the cache memory is added to the loop,
In the case of the cache line size (C) of the cache memory, the element length between stride accesses (m), the size of one element of the array (Type_Size), and the prefetch address (x) of the prefetch instruction, the prefetch address (x) Is divided by the cache line size (C), the remainder (x% C) is smaller than the inter-stride access address length (S) obtained by multiplying the inter-stride access element length (m) by one element size (Type_Size). If ((x% C) <S), add a conditional statement to execute the prefetch instruction in the loop,
Vectorization for converting the stride access instruction, the conditional statement and the prefetch instruction in the loop into a plurality of vector instructions to be executed in parallel;
A compiler program that causes a computer to execute the processing to be performed.

The vector instruction includes a first vector instruction that executes a plurality of the stride access instructions in parallel, and a plurality of parallel execution of the prefetch instruction when the conditional statement is true corresponding to the first vector instruction. The compiler program according to claim 1, further comprising a second vector instruction to be executed.

The second vector instruction is
A third vector instruction for storing in each element of the mask array whether or not the conditional statement is true corresponding to the first vector instruction, and an element of the mask array corresponding to the first vector instruction The compiler program according to claim 2, further comprising: a fourth vector instruction that executes the prefetch instruction when true.

further,
The process of adding the conditional statement in the loop is as follows:
The remainder when the prefetch address (x) is divided by the cache line size (C) is a binary number obtained by subtracting 1 from the cache line size (C), and each bit of the binary number of the prefetch address (x) The compiler program according to claim 1, comprising: generating an operation instruction to calculate a logical product between them.

A loop that repeats a stride access instruction that accesses multiple elements of an array in the memory in the source program at an element length interval between stride accesses is detected.
A prefetch instruction for accessing the memory and storing the access destination data of the stride access instruction after a predetermined number of repetitions of the loop from the stride access instruction in the cache memory is added to the loop,
In the case of the cache line size (C) of the cache memory, the element length between stride accesses (m), the size of one element of the array (Type_Size), and the prefetch address (x) of the prefetch instruction, the prefetch address (x) Is divided by the cache line size (C), the remainder (x% C) is smaller than the inter-stride access address length (S) obtained by multiplying the inter-stride access element length (m) by one element size (Type_Size). If ((x% C) <S), add a conditional statement to execute the prefetch instruction in the loop,
Vectorization for converting the stride access instruction, the conditional statement and the prefetch instruction in the loop into a plurality of vector instructions to be executed in parallel;
A compilation method that causes a computer to execute the processing to be performed.

With memory
A processor capable of accessing the memory;
The processor is
A loop that repeats a stride access instruction that accesses multiple elements of an array in the memory in the source program at an element length interval between stride accesses is detected.
A prefetch instruction for accessing the memory and storing the access destination data of the stride access instruction after a predetermined number of repetitions of the loop from the stride access instruction in the cache memory is added to the loop,
In the case of the cache line size (C) of the cache memory, the element length between stride accesses (m), the size of one element of the array (Type_Size), and the prefetch address (x) of the prefetch instruction, the prefetch address (x) Is divided by the cache line size (C), the remainder (x% C) is smaller than the inter-stride access address length (S) obtained by multiplying the inter-stride access element length (m) by one element size (Type_Size). If ((x% C) <S), add a conditional statement to execute the prefetch instruction in the loop,
Vectorization for converting the stride access instruction, the conditional statement and the prefetch instruction in the loop into a plurality of vector instructions to be executed in parallel;
An information processing apparatus that compiles to execute processing.