JPS61100862A

JPS61100862A - Sequencing system of instruction

Info

Publication number: JPS61100862A
Application number: JP21331584A
Authority: JP
Inventors: Masaki Aoki; 正樹青木; Hiroshi Nakada; 弘中田; Toshihiro Hirabayashi; 平林　俊弘
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1984-10-12
Filing date: 1984-10-12
Publication date: 1986-05-19
Also published as: JPH056712B2

Abstract

PURPOSE:To enhance the parallelism between D0 loops made into vectors and other range and to increase the efficiency of execution by grasping its data dependency in wide range and making the optimum sequencing process. CONSTITUTION:Data analyzed in a source analyzing section 1 is given to an address allocating section 2, and there, memory area is allotted, and initial values are given to arrays and variables. A vector making section 3 converts D0 loops to vector instruction strings, and a sequencing processing section 4 grasps its data dependency in wide range, and makes the optimum sequencing of instruction to the data dependency by using a pipe line ID or synchronizing instruction. An intermediate text optimizing section 5 makes optimizing after making into vectors, and a register allotting section 6 performs processing such as allotting data to registers. An instruction generating section 7 converts an intermediate test to a machine language instruction.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、ベクトル計算機用のオブジェクト・モジユー
ルを作成するコンパイラ、特にベクトル化された複数の
Ｄｏシル−間における命令の逐次化処理方式に関するも
のである。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a compiler that creates object modules for vector computers, and in particular to a method for serializing instructions between a plurality of vectorized DoSils. It is.

[Prior art and problems]

ベクトル計算機においては、演算器の高速化とその演算
器に見合うデータの供給能力が、実行効率向上の重要な
鍵である。このため最近のベクトル計算機では、並列動
作可能な２木のロード／ストア・パイプラインを用意し
、データの供給能力を高めている。しかし、複数のロー
ド／ストア・パイプラインが並列に動作することにより
、メモリ・アクセス命令の同期化（逐次化と同義）が必
要となってきた。ハードウェアでは、このような同期化
は困難であり、従来のベクトル計算機を含むシステムで
は、これをソフトウェアで実現している。In vector computers, increasing the speed of arithmetic units and the ability to supply data commensurate with the arithmetic units are important keys to improving execution efficiency. For this reason, recent vector computers are equipped with two-tree load/store pipelines that can operate in parallel to increase data supply capabilities. However, with multiple load/store pipelines operating in parallel, it has become necessary to synchronize (synonymous with serialization) memory access instructions. Such synchronization is difficult to achieve in hardware, and in systems including conventional vector computers, this is accomplished in software.

ベクトル計算機のハードウェアでは、メモリ・アクセス
命令の同期化手段としては、下記のものがある。In vector computer hardware, the following methods are available for synchronizing memory access instructions.

（ａｌ　　パイプラインｒＤベクトルのメモリ・アクセス命令が動作するパイプライ
ンを指定するもので、順序関係を保証する必要のあるメ
モリ・アクセス命令を同一のパイプラインで動作させる
ことにより同期を取ることが出来る。(al Pipeline rD Specifies the pipeline in which vector memory access instructions operate. Synchronization can be achieved by running memory access instructions that require order guarantees in the same pipeline. .

（ｂ）　　同期化命令（ＰＯ３Ｔ／ＷＡ　ＩＴ命令）メ
モリ・アクセス命令間の順序関係を同期化命令で保証す
る方法である。この方法を用いる、ことにより、ＰＯ５
Ｔ命令以前のメモリ・アクセス命令とＷＡＩＴ命令以後
のメモリ・アクセス命令との同期を取ることが出来る。(b) Synchronization instruction (PO3T/WA IT instruction) This is a method of guaranteeing the order relationship between memory access instructions using a synchronization instruction. Using this method, PO5
It is possible to synchronize the memory access instructions before the T instruction and the memory access instructions after the WAIT instruction.

同期化処理においては、単にメモリ・アクセス命令の順
序関係を保証するだけではなく、実行性能が低下しない
ように効率的に同期化を行う必要がある。しかしながら
、従来のコンパイラにおいては、ベクトル化された複数
のＤｏルー１間でのデータの依存関係を考慮していなか
った。そのため、個々のＤＯループ単位にその終了時点
で逐次化処理が成されており、並列処理計算における実
行効率低下の一因となっていた。In synchronization processing, it is necessary not only to guarantee the order of memory access instructions, but also to perform synchronization efficiently so that execution performance does not deteriorate. However, conventional compilers do not take into account data dependencies among a plurality of vectorized Do rules 1. Therefore, serialization processing is performed for each DO loop at the end of each DO loop, which is one of the causes of reduced execution efficiency in parallel processing calculations.

[Purpose of the invention]

本発明は、上記の考察に基づくものであって、複数のＤ
ｏループ間において最適な命令の逐次化処理を施し、実
行性能を高めることを目的としている。The present invention is based on the above consideration, and comprises a plurality of D
The objective is to perform optimal instruction serialization processing between o-loops and improve execution performance.

[Means to achieve the purpose]

そしてそのため、本発明の命令の逐次化方式は、ベクト
ル計算機上で実行されるオブジェクト・モジユールを生
成するコンパイラにおいて、ベクトル化後の中間テキス
トについて、配列に出現する添字を参照して広範囲にデ
ータ依存関係を把握し、逐次化に必要なデータに対して
最適な命令の逐次化処理を施すことを特徴としている。Therefore, in the instruction serialization method of the present invention, in a compiler that generates an object module executed on a vector computer, the intermediate text after vectorization is extensively data-dependent by referring to the subscripts appearing in the array. It is characterized by understanding the relationships and performing optimal instruction serialization processing on the data required for serialization.

[Embodiments of the invention]

以下、本発明を図面を参照しつつ説明する。第１図は本
発明のコンパイラの概要を５ノ（ず図である。Hereinafter, the present invention will be explained with reference to the drawings. FIG. 1 is a 5-part diagram outlining the compiler of the present invention.

このコンパイラは、ベクトル計′ｎ機を含むシステムで
実行されるオブジェクト・モジユールを生成する■Ｐコ
ンパイラである。第１図において、１はソース解析部、
２は番地割付は部、３はベクトル化部、４は逐次化処理
部、５は中間テキスト最適化部、６はレジスタ割付は部
、７は命令生成部をそれぞれ示している。ソース解析部
１は、宣言文で定義された配列や変数とソース・プログ
ラムの手続き部における取扱との矛盾を検出したり、未
定義の配列や変数が定義又は参照されていないかを調べ
ると共に、ソース・プログラムをブロック化したりする
ものである。番地割付は部２は、データに対してメモリ
領域を割付たり、配列や変数に対して初期値を与えたり
するものである。ベクトル化部３は、ＤＯループをベク
トル命令列に変換するものである。逐次化処理部４は、
命令の逐次化を行うものである。本発明は逐次化処理部
４に関するものである。中間テキスト最適化部５は、ベ
クトル化後の最適化等を行うものである。This compiler is a P compiler that generates object modules to be executed on a system containing a vector machine. In FIG. 1, 1 is a source analysis section;
Reference numeral 2 indicates an address allocation section, 3 a vectorization section, 4 a serialization processing section, 5 an intermediate text optimization section, 6 a register allocation section, and 7 an instruction generation section. The source analysis unit 1 detects inconsistencies between arrays and variables defined in declaration statements and their handling in the procedure division of the source program, and checks whether undefined arrays and variables are defined or referenced. It is used to create blocks of source programs. The address allocation section 2 allocates memory areas for data and gives initial values to arrays and variables. The vectorizer 3 converts the DO loop into a vector instruction sequence. The serialization processing unit 4
It serializes instructions. The present invention relates to the serialization processing section 4. The intermediate text optimization unit 5 performs optimization after vectorization.

レジスタ割付は部６は、データをレジスタに割付ける等
の処理を行うものである。命令生成部７は、中間テキス
トを機械語命令に変換するものである。The register allocation section 6 performs processing such as allocating data to registers. The instruction generation unit 7 converts intermediate text into machine language instructions.

要約すると、本発明は、ベクトル化後の中間テキスト（
命令列）において、ベクトル化の技術（配列に出現する
添字の振るまい方）を応用して、広範囲にデータ依存関
係を把握し、データ依存関係（逐次化に必要なデータ）
に対してパイプラインＩＤ又は同期化命令を用いて最適
な命令の逐次化処理を施すものである。In summary, the present invention provides intermediate text after vectorization (
By applying vectorization technology (how subscripts that appear in an array behave), we can grasp a wide range of data dependencies and create data dependencies (data required for serialization).
This method performs optimal instruction serialization processing using pipeline IDs or synchronization instructions.

第２図は本発明の命令の逐次化処理の流れを示す図であ
る。FIG. 2 is a diagram showing the flow of instruction serialization processing according to the present invention.

■　制御の流れが一定のＤＯシル一群を取出す。■ Take out a group of DO sills with a constant flow of control.

第３図は制御の流れが一定なりＯループ群の例を示すも
のであり、矢印Ａ−Ｂ、Ｃ−Ｄ、Ｅ−Ｆ等が制御の流れ
が一定なりＯループ群をしめす。FIG. 3 shows an example of an O-loop group in which the flow of control is constant, and arrows A-B, CD, E-F, etc. indicate O-loop groups in which the flow of control is constant.

■　ＤＯループ群内のデータ依存関係を把握する。■ Understand the data dependencies within the DO loop group.

即ち、複数次元の添字に対して重なりをチェックスする
。この際、上位次元の添字情報において、ずれが生じて
いれば下位次元において重なりはない。例えば下記のよ
うなプログラムがあったとする。That is, the overlap is checked for subscripts of multiple dimensions. At this time, if there is a shift in the subscript information of the higher dimension, there is no overlap in the lower dimension. For example, suppose you have a program like the one below.

００　１０　　Ｊ＝１．ＮＤＯ１０１＝１．ＮＡ、（１、Ｊ）　＝Ａ　（Ｉ　、　Ｊ−１）　＋Ｓ１０
　　Ｃ０ＮＴＩＮ［ＩＥこの文章は下記のように展開される。00 10 J=1. NDO101=1. N A, (1, J) = A (I, J-1) +S10
C0NTIN[IE This sentence is expanded as follows.

ＤＯ１Ｏｒ＝ｉ、Ｎ１０　　Ａ（１，１）＝Ａ（１，０）＋ＳＤｏ　　１０
’ｌ・１．Ｎ１０　　Ａ（１，２）＝Ａ（１，１）＋Ｓこの例におい
て、内側のＤＯ小ループは２次元目の添字が異なるため
、Ａのメモリ・アクセスに対して重なりはない（逐次化
不必要）。しかし、外側のループを考えたときへのスト
アとＡのロードで重なりが生じ、逐次化を行う必要があ
る。DO1Or=i, N 10 A(1,1)=A(1,0)+SDo 10
'l・1. N 10 A(1,2)=A(1,1)+S In this example, the inner DO small loops have different subscripts in the second dimension, so there is no overlap for A's memory access (non-serialization need). However, when considering the outer loop, there is an overlap between the store and the load of A, and it is necessary to perform serialization.

■　Ｄｏ小ループ間ベクトルとスカシの依存関係がある
か否かを調べる。Ｙｅｓのときは■の処理を行い、Ｎｏ
のときは■の処理を行う。■ Check whether there is a dependency relationship between the Do small loop vector and the search. If Yes, perform the process ■, and then
In this case, process ■.

■　多重Ｄｏループ内の逐次化を行う。、即ち外側の回
転によるデータの依存関係に基づき逐次化を行う。■ Perform serialization within multiple Do loops. , that is, serialization is performed based on the data dependence due to the outer rotation.

■　Ｄｏループ単位の逐次化を行う。■ Perform serialization in Do loop units.

■　効率をチェックする。即ち、パイプラインＩＤの密
度を調べる。ＮＧであれば■の処理を行う。■ Check efficiency. That is, the density of pipeline IDs is checked. If the result is NG, process (■) is performed.

■　ＤＯシル−間の逐次化を行う。即ち、最内次元のみ
（上から下のみ）のデータ依存関係に基づいて逐次化を
行う。■ Perform serialization between DO sils. That is, serialization is performed based on data dependence relationships only in the innermost dimension (from top to bottom).

■　効率をチェックする。ＮＧであれば■の処理を行う
。■ Check efficiency. If the result is NG, process (■) is performed.

■　ＤＯループ内の逐次化を行う。即ち、Ｄ○ループ内
の閉じたデータ依存関係に基づいて逐次化を゛行う。■ Perform serialization within the DO loop. That is, serialization is performed based on the closed data dependence relationship within the D loop.

次に本発明を具体例で説明する。いま、下記のようなり
ｏループ群を考える。Next, the present invention will be explained with specific examples. Now, consider the o-loop group as shown below.

ＤＯ１０Ｊ＝２．１００ＤＯ１０Ｉ・２，１．００Ａ（Ｉ、Ｊ　）＝Ａ（Ｉ−１，Ｊ−１）＋Ａ　（ＬＪ−
１）１０　Ｃ０ＮＴＩＮＵＥこの例では、内側Ｄ○ループの回転によるデータ依存関
係はない。しかし、外側ＤＯループの回転により■−■
、■−■なるデータ依存関係が生ずる。なお、■はＡ（
１，Ｊ）を、■はＡ（Ｉ−１，Ｊ−１）を、■はＡ　（
１，Ｊ〜１）を示している。この場合、広域的な範囲（
外側のＤｏループのデータ依存関係）で同期化を行うと
、■ないし■のメモリ・アクセスに対して同一のパイプ
ラインＩＤが必要になるため、並列処理効率が著しく悪
くなる。従って、局所的範囲で（内側Ｄｏループのデー
タ依存関係で）同期化を行う方が良い。このとき、他範
囲のデータ依存関係は、ＰＯ３Ｔ／ＷＡＩＴ命令により
同期化を取る。DO10J=2.100 DO10I・2,1.00 A(I, J)=A(I-1, J-1)+A (LJ-
1) 10 C0NTINUE In this example, there is no data dependency due to the rotation of the inner D○ loop. However, due to the rotation of the outer DO loop, ■−■
, ■−■ data dependence relationship occurs. In addition, ■ is A (
1, J), ■ is A (I-1, J-1), ■ is A (
1, J~1). In this case, a wide area (
If synchronization is performed (data dependence relationship of the outer Do loop), the same pipeline ID is required for memory accesses (1) and (2), resulting in a significant decrease in parallel processing efficiency. Therefore, it is better to synchronize locally (with the data dependencies of the inner Do loop). At this time, data dependencies in other ranges are synchronized by the PO3T/WAIT command.

広域的な範囲で同期化が最適な場合の例について説明す
る。いま、下記のようなりＯループ群を考える。An example where synchronization is optimal over a wide area will be described. Now, consider the O-loop group as shown below.

ＤＯ１０Ｊ＝１．１００Ｄｏ　　１０　　Ｉ＝１，１００Ａ（Ｉ、Ｊ　）＝Ｂ（１，Ｊ　）＋Ａ　（Ｉ、Ｊ−１）
１０　Ｃ０ＮＴＩＮＵＥこの例は、先の例と同様の構造を持つが、外側ＤＯルー
プの回転によるデータ依存関係は■−■のみであり、広
域的な範囲（外側Ｄｏ小ループデータ依存関係）で同期
化を行っても並列処理効率は高い。なお、■は八（１，
Ｊ　）を、■は八（Ｉ、Ｊ−１）を示している。従って
、パイプラインＩＤを用いて広域的な範囲で同期化を行
う方が最適である。DO10J=1.100 Do10I=1,100 A(I,J)=B(1,J)+A(I,J-1)
10 C0NTINUE This example has the same structure as the previous example, but the data dependency due to the rotation of the outer DO loop is only ■-■, and it is synchronized in a wide range (outer Do small loop data dependency) Even if you do this, parallel processing efficiency is high. In addition, ■ is eight (1,
J), ■ indicates 8 (I, J-1). Therefore, it is optimal to perform synchronization over a wide area using pipeline IDs.

広域的な範囲で同期化が最適な場合の他例について説明
する。いま、下記のようなりＯループ群を考える。Another example where synchronization is optimal over a wide area will be described. Now, consider the O-loop group as shown below.

ＤＯ１０１・１，１００Ａ（１）・Ｃ（１）＋８（１−１）１０　Ｃ０ＮＴＩＮＵＥＤｏ　　２０　　Ｉ・１，１００Ｂ（１）・Ｃ（Ｄ本Ｂ（Ｉ＋１）２０　Ｃ０ＮＴＩＮＬＩＥこの例においては、局所的な範囲で同期化を行った場合
、ＤＯシル−間でＰＯ３Ｔ／ＷＡＩＴ命令により同期が
取られるため、並列処理効率が悪くなってしまう。しか
しＤＯシル−間のデータ依存関係で同期化した場合には
、■−■及び■−■にパイプラインＩＤが必要となるの
みで、並列処理効率も高い。なお、■はＡ（１）を、■
はＢ（Ｉ−１）を、■はＢ（１）を、■はＢ（１＋１）
を示している。DO101・1,100 A(1)・C(1)+8(1-1) 10 C0NTINUE Do 20 I・1,100 B(1)・C(D book B(I+1) 20 C0NTINLIE In this example, the local If synchronization is performed within a range of 1 to 3, parallel processing efficiency will deteriorate because synchronization is achieved between DO sills using PO3T/WAIT instructions.However, if synchronization is performed based on data dependencies between DO sills, , only pipeline IDs are required for ■-■ and ■-■, and the parallel processing efficiency is high.Note that ■ is A(1), and ■
is B(I-1), ■ is B(1), ■ is B(1+1)
It shows.

局所的な範囲で同期化が最適な場合の他例について説明
する。いま、下記のようなりｏループ群を考える。Another example where synchronization is optimal within a local range will be described. Now, consider the o-loop group as shown below.

ＤＯ１０ｒ＝１，１００Ａ（１）＝　Ｃ（Ｔ　）１０　Ｃ０ＮＴＩＮＵＥＤＯ２０Ｊ＝１．５０Ｂ（Ｊ　）＝Ａ（５０）２０　Ｃ０ＮＴＩＮＵＥこの例では、ＤＯループ間にベクトルとスカシの依存関
係があるので、ＤＯ小ループ位で逐次化を行う。上記の
ＤＯループ群に対応するベクトル命令列は下記のように
なる。DO10r=1,100 A(1)=C(T) 10 C0NTINUE DO20J=1.50 B(J)=A(50) 20 C0NTINUE In this example, there is a vector and swash dependency between the DO loops, so Serialization is performed at the DO small loop level. The vector instruction sequence corresponding to the above DO loop group is as follows.

ＶＬ　　ＶＲＩ、Ｃ（１：１００）ＶＳＴ　ＶＲＩ　Ａ　（１：１００　’）ＶＰＴ −ＴＶＬ　　ＶＲ２，Ａ　（５０）ＶＳＴ　ＶＲ２，Ｂ（１：５０）なお、ＶＬはベクトル・ロード命令、ＶＳＴはベクトル
・ストア命令、ＶＰＴはＰＯ３Ｔ命令、ＶＷＴはＷＡＩ
Ｔ命令、ＶＲＸはベクトル・レジスタをそれぞれ示す。VL VRI, C (1:100) VST VRI A (1:100') VPT -T VL VR2, A (50) VST VR2, B (1:50) Note that VL is a vector load instruction, and VST is a vector load instruction. Store instruction, VPT is PO3T instruction, VWT is WAI
T instruction and VRX indicate vector registers, respectively.

〔Effect of the invention〕

以上の説明から明らかなように、本発明によれば、デー
タ依存関係を広範囲に把握し、最適な逐次化処理を行う
ことにより、ベクトル化されたＤＯループ間（ベクトル
命令列）及びその他の範囲（スカシ命令列）との並列性
が高まり、実行効率が向上する。As is clear from the above description, according to the present invention, by comprehensively grasping data dependencies and performing optimum serialization processing, (squash instruction sequence) increases, and execution efficiency improves.

[Brief explanation of drawings]

第１図は本発明のコンパイラの概要を示す図、第２図は
本発明の命令の逐次化処理の流れを示す図、第３図は制
御の流れが一定なりＯループ群の例を示す図である。１・・・ソース解析部、２・・・番地割付は部、３・・
・ベクトル化部、４・・・逐次化処理部、５・・・中間
テキスト最適化部、６・・・レジスタ割付は部、７・・
・命令生成部。Figure 1 is a diagram showing an overview of the compiler of the present invention, Figure 2 is a diagram showing the flow of instruction serialization processing of the present invention, and Figure 3 is a diagram showing an example of an O-loop group with a constant flow of control. It is. 1... Source analysis section, 2... Address allocation section, 3...
- Vectorization section, 4... Serialization processing section, 5... Intermediate text optimization section, 6... Register allocation section, 7...
・Instruction generation section.

Claims

[Claims]

A compiler that generates objects/modules that are executed on a vector computer uses the subscripts that appear in the array to grasp a wide range of data dependencies for the intermediate text after vectorization, and calculates the data required for serialization. An instruction serialization method characterized by performing optimal instruction serialization processing.