JPH0628324A

JPH0628324A - Parallel computer and compiler

Info

Publication number: JPH0628324A
Application number: JP17858992A
Authority: JP
Inventors: Hiroshige Fujii; 洋重藤井; Masashi Takahashi; 真史高橋; Shigeyoshi Kaneko; 栄美金子
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-07-06
Filing date: 1992-07-06
Publication date: 1994-02-04

Abstract

PURPOSE:To attain an optimum scheduling operation even when the executing time differs by the instructions by forming an instruction group with the instructions in dependent relation and then carrying out the scheduling operations for each instruction group. CONSTITUTION:An instruction scheduling part 6 of an intermediate code optimizing part 8 forms a group (instruction group) with each of instructions in dependent relation and secures the correspondence between the instruction groups that can be carried out in parallel with each other. Then, the orders of clocks are calculated for the total execution time covering the instruction issued first through the final instruction and the time when another instruction group is issued after the issue of the first instruction of each instruction group. Then, the instruction groups are noted from the first one and the presence or absence is decided for the instruction groups that can be carried out in parallel to each other (have a replaceable order of issue or the issuing time which can be carried up). If such instruction groups are recognized, such groups that can minimize the total execution time are selected and the issuing time is carried up for the noted instruction group.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、命令をパイプライン
で処理する計算機で使用される目的プログラムを生成す
るコンパイラ、物理現象のシミュレーションを行なう際
に現れる連立一次方程式を解く並列計算機、及び複数の
演算要素プロセッサを持ち、行列の乗算を少ないメモリ
で行う並列計算機に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a compiler that generates an object program used in a computer that processes instructions in a pipeline, a parallel computer that solves simultaneous linear equations that appear when simulating physical phenomena, and a plurality of parallel computers. The present invention relates to a parallel computer having an arithmetic element processor and performing matrix multiplication with a small memory.

【０００２】[0002]

【従来の技術】従来より、並列計算機などの計算機に用
いられているコンパイラには、ターゲットアーキテクチ
ャのパイプラインなどの特性を考慮して、命令列ができ
るだけ短時間で実行されるように配置する“命令スケジ
ューリング”と呼ばれる最適化方法がある。図２６〜３
０を参照して、以後の説明で用いる用語と、従来の命令
スケジューリング方法を簡単に説明する。2. Description of the Related Art Conventionally, a compiler used in a computer such as a parallel computer is arranged so that an instruction sequence is executed in the shortest possible time in consideration of characteristics such as a pipeline of a target architecture. There is an optimization method called “instruction scheduling”. 26 to 3
With reference to 0, the terms used in the following description and the conventional instruction scheduling method will be briefly described.

【０００３】まず初めに、用語の定義と説明をする。First, the definitions and explanations of terms will be given.

【０００４】パイプライン処理：機械命令の読み出しか
ら実行完了までの過程を、いくつかのステージに分割
し、各ステージを複数の命令に対して並列実行すること
によって、全体の処理を高速化するハードウェア技術。
代表的なパイプラインと命令実行の様子を、図３０
（ａ），（ｂ），（ｃ）に示す。図３０（ａ）は４段パ
イプライン、（ｂ）は５段パイプライン、（ｃ）は６段
パイプラインを表している。Pipeline processing: A process for speeding up the overall processing by dividing the process from the reading of machine instructions to the completion of execution into several stages and executing each stage in parallel for a plurality of instructions. Wear technology.
A typical pipeline and instruction execution state are shown in FIG.
Shown in (a), (b) and (c). 30A shows a 4-stage pipeline, FIG. 30B shows a 5-stage pipeline, and FIG. 30C shows a 6-stage pipeline.

【０００５】基本ブロック：先頭の命令から最後の命令
までが１つづつ順番に実行される一連の命令の並び。す
なわち、ラベルの定義や分岐命令、コール命令を含まな
い命令列。Basic block: A sequence of a series of instructions that are sequentially executed one by one from the first instruction to the last instruction. That is, an instruction sequence that does not include label definitions, branch instructions, or call instructions.

【０００６】資源：レジスタ、メモリ、プロセッサの状
態、キャリなどのこと。Resource: A register, memory, processor state, carry, etc.

【０００７】データ依存関係：２つの命令で同一の資源
を、｀定義´−｀参照´、｀参照´−｀定義´、｀定義
´−｀定義´、のいずれかの関係で使用している場合、
その２つの命令はデータ依存関係を持つという。ここで
いう｀定義´とは、値を設定する処理であり、｀参照´
とは演算で用いる処理である。Data dependency: Two instructions use the same resource in any of the relations of "definition"-"reference", "reference"-"definition", and "definition"-"definition". If
The two instructions are said to have a data dependency. "Definition" here is the process of setting the value, and refer to "
Is a process used in the calculation.

【０００８】インタロックパイプライン処理で、ある命令の結果が他の命令で使用
されるか、または特定の資源が２つの命令で同時に必要
となる場合には、最初の命令が完了するまで他方の命令
は待たなければならない。このような場合に、ハードウ
ェアが並列実行を一旦停止し、同期制御を行うような状
態。In interlocked pipeline processing, if the result of one instruction is used by another instruction, or if a particular resource is needed by two instructions at the same time, the other instruction will wait until the first instruction completes. The order must wait. In such a case, the hardware temporarily stops parallel execution and performs synchronous control.

【０００９】パイプライン処理系を持ち、並列に実行可
能な複数の演算器（加算器３１、乗算器３２、除算器３
３、及びレジスタ３４）を持った図２７（ａ）の計算機
において、図２７（ｂ）のプログラムが図２７（ｃ）の
命令列にコンパイルされたとする。A plurality of arithmetic units (an adder 31, a multiplier 32, a divider 3) having a pipeline processing system and capable of being executed in parallel.
27 (a) having a register 3) and a register 34), the program of FIG. 27 (b) is compiled into the instruction sequence of FIG. 27 (c).

【００１０】パイプラインは、図２６（ｂ）に示す５つ
のステージからなり、load（メモリからの読み込み）、
store （メモリへの書き込み）、add （加算）、mul
（乗算）の各命令は、共に命令フェッチからレジスタへ
の書き込みまでの各ステージを１クロックで、div （除
算）命令は命令実行ステージ（Ｅステージ）を６クロッ
ク、他のステージを１クロックで実行するとする。The pipeline is composed of five stages shown in FIG. 26 (b) and includes load (read from memory),
store (writing to memory), add (addition), mul
Each instruction of (multiplication) executes each stage from instruction fetch to writing to register in 1 clock, div (division) instruction executes instruction execution stage (E stage) in 6 clocks, other stages execute in 1 clock I will.

【００１１】パイプライン処理では、命令を１クロック
毎に発行し、パイプラインの各ステージを並列に行うこ
とができるため、命令に依存関係がなければ、除算命令
は１０クロック、それ以外の命令は５クロック後に実行
を完了する。In the pipeline processing, an instruction can be issued every 1 clock and each stage of the pipeline can be performed in parallel. Therefore, if there is no dependency between the instructions, the division instruction is 10 clocks and the other instructions are Execution is completed after 5 clocks.

【００１２】しかし、図２７（ｃ）の命令列では、命令
３が命令１と命令２の演算結果を使用するため、命令３
は命令フェッチ（Ｆステージ）の後、命令１、命令２の
実行結果がレジスタに書き込まれるまでレジスタの読み
込み（Ｒステージ）を開始できない。そのため、ここに
インタロックが発生する。図２８（ａ）で斜線部分がイ
ンタロックの状態を示している。However, in the instruction sequence of FIG. 27C, since the instruction 3 uses the operation result of the instruction 1 and the instruction 2, the instruction 3
Cannot fetch the register (R stage) after the instruction fetch (F stage) until the execution results of the instruction 1 and instruction 2 are written in the register. Therefore, an interlock occurs here. In FIG. 28A, the hatched portion shows the interlock state.

【００１３】同様に、命令５と命令３、命令４、命令６
と命令５、命令８と命令７、命令９と命令８にもデータ
依存関係があり、先に発行された命令の実行結果が出る
まで、詳しくはＷステージが終了するまで、次の命令の
デコード（Ｒステージ）を開始できず、インタロックが
発生する。Similarly, instruction 5, instruction 3, instruction 4, and instruction 6
And the instruction 5, the instruction 8 and the instruction 7, and the instruction 9 and the instruction 8 also have a data dependency relationship, and until the execution result of the previously issued instruction is output, more specifically, the decoding of the next instruction is performed until the W stage ends. (R stage) cannot be started and interlock occurs.

【００１４】図２７（ｃ）の命令列を実行すると図２８
（ａ）のようになり、各命令間で起きているインタロッ
クによって、パイプライン演算部に空き状態が発生し、
全体の実行時間を長くする原因となっている。When the instruction sequence of FIG. 27 (c) is executed, FIG.
As shown in (a), due to the interlock occurring between each instruction, a free state occurs in the pipeline operation unit,
This is a cause of increasing the overall execution time.

【００１５】このような命令列に対してプログラムの意
味を変えないでインタロックの発生回数をできるだけ少
なくする、あるいはパイプライン演算部の空き状態をで
きるだけ少なくし、実行時間を短くするように命令を並
べ変える“命令スケジューリング”と呼ばれるコンパイ
ラの最適化手法がある。With respect to such an instruction sequence, the number of interlocks is reduced as much as possible without changing the meaning of the program, or the empty state of the pipeline operation unit is reduced as much as possible to shorten the execution time. There is a compiler optimization method called "instruction scheduling" that rearranges.

【００１６】従来用いられている代表的な命令スケジュ
ーリング方法の１例を、P.B.Gibbons abd S.S.Muchnic
k, “Efficient Instruction Scheduling for a Pipeli
ned Architecture ”,Proceedings of SIGPLAN Symposi
um on Compiler Construction,Palo Alto,CA,June 198
6,pp.11-16に従って、紹介する。An example of a typical instruction scheduling method used in the past is PBGibbons abd SSMuchnic
k, “Efficient Instruction Scheduling for a Pipeli
ned Architecture ”, Proceedings of SIGPLAN Symposi
um on Compiler Construction, Palo Alto, CA, June 198
Introduced according to 6, pp.11-16.

【００１７】（１）基本ブロック内の命令に対して、
依存有向グラフを作成する。グラフは、基本ブロック内
の各命令をノードとし、資源の使用に関して依存関係が
ある２つの命令間を’→’で結んで構成する。’ａ→
ｂ’ならば、「ａはｂより前に実行しなければならな
い」ことを示す。(1) For an instruction in a basic block,
Create a directed directed graph. In the graph, each instruction in the basic block is used as a node, and two instructions having a dependency relationship regarding the use of resources are connected by '→'. 'a →
If b ', it indicates that "a must be executed before b".

【００１８】（２）依存有向グラフの根ノード（上の
例では、ａ）を、スケジューリングの際の根ノードの候
補集合におく。(2) The root node (a in the above example) of the dependent directed graph is placed in the root node candidate set at the time of scheduling.

【００１９】（３）候補集合が空になるまで、以下を
繰り返す。(3) The following is repeated until the candidate set becomes empty.

【００２０】［３−１］候補集合の中から最適なノー
ドを、次の規則に従って選択する。[3-1] The optimum node is selected from the candidate set according to the following rules.

【００２１】・最後にスケジュールされた命令と依存関
係のない命令を選ぶ。Select an instruction that has no dependency on the last scheduled instruction.

【００２２】・依存グラフ内の命令のいずれかと依存関
係のある命令を選ぶ。Select an instruction that has a dependency relationship with any of the instructions in the dependency graph.

【００２３】・サクセッサ数の多いものを選ぶ。Select a product with a large number of successors.

【００２４】・依存グラフ内の残りの命令に対する最長
パスの長さが長いものを選ぶ。Select the longest longest path length for the remaining instructions in the dependency graph.

【００２５】［３−２］新しくスケジュールされた命
令を候補集合及び依存グラフから取り除き、依存グラフ
上で新しく根ノードとなった命令を候補集合に加える。[3-2] The newly scheduled instruction is removed from the candidate set and the dependency graph, and the instruction that has become the new root node on the dependency graph is added to the candidate set.

【００２６】ここでサクセッサ数とは、着目する命令
と’→’で結ばれているノードの数を表し、パス長と
は、着目する命令以降のサクセッサを実行し終わるまで
にかかる実行時間のことを表している。例えば、図２８
（ｂ）の依存有向グラフで、命令１のサクセッサ数は
３、命令４のサクセッサ数は２、命令７のパス長は命令
７、命令８、命令９の実行時間を加算した２０クロッ
ク、命令５のパス長は１０クロックである。これは、除
算命令を１０クロック、それ以外を５クロックとしてい
るためである。Here, the number of successors represents the number of nodes connected to the instruction of interest by "→", and the path length is the execution time required to finish executing the successor after the instruction of interest. Is represented. For example, in FIG.
In the dependency directed graph of (b), the number of successors for instruction 1 is 3, the number of successors for instruction 4 is 2, the path length of instruction 7 is 20 clocks obtained by adding the execution times of instructions 7, 8 and 9, and 5 The path length is 10 clocks. This is because the division instruction has 10 clocks and the others have 5 clocks.

【００２７】図２７（ｃ）の命令列は、ラベル定義や分
岐命令を含んでいないので、基本ブロックである。この
命令列に対して（１）の処理を行って得たのが図２８
（ｂ）の依存有向グラフである。この依存グラフをもと
に、上記（２），（３）の処理を行う。The instruction sequence in FIG. 27C is a basic block because it does not include a label definition or a branch instruction. FIG. 28 shows the result obtained by performing the process (1) on this instruction sequence.
It is a dependence directed graph of (b). Based on this dependency graph, the above processes (2) and (3) are performed.

【００２８】最初に根ノードの候補となるのは、パスが
２０クロックで最長の命令１または命令２または命令７
であるが、命令１と命令２はサクセッサ数が３、命令７
は２であるので、命令１または命令２のいずれかが最初
に根ノードとなる。この場合、どちらの命令を先に根ノ
ードとしてもよく、一例として命令１を選択する。命令
１に対し、［３−１］の選択を行うと、パス長が最長の
命令２または命令７が候補になるが、サクセッサ数が最
大の命令２が選択される。The first candidate for the root node is the instruction 1 or the instruction 2 or the instruction 7 having the longest path of 20 clocks.
However, instruction 1 and instruction 2 have 3 successors, and instruction 7
Is 2, so either instruction 1 or instruction 2 becomes the root node first. In this case, either instruction may be the root node first, and the instruction 1 is selected as an example. When [3-1] is selected for the instruction 1, the instruction 2 or the instruction 7 having the longest path length becomes a candidate, but the instruction 2 having the maximum number of successors is selected.

【００２９】命令２に対して同様の選択を行う。この場
合、サクセッサ数が２で最大の命令３、命令４、命令７
が候補になるが、パス長が２０クロックで最長の命令７
が選択される。A similar selection is made for instruction 2. In this case, the number of successors is 2 and the maximum number is 3, 4, and 7.
Is the candidate, but the longest instruction with a path length of 20 clocks 7
Is selected.

【００３０】同様にして命令７以降の命令に対して処理
を行うと、図２９に示すパターンＡ、あるいはパターン
Ｂの命令列ができる。この他、命令２を最初の根ノード
とする場合など、命令の選び方によっていくつかのパタ
ーンができる。In the same manner, by processing the instructions after the instruction 7, the instruction sequence of pattern A or pattern B shown in FIG. 29 is formed. In addition, some patterns can be created depending on how to select the instruction, such as the case where the instruction 2 is the first root node.

【００３１】パターンＡ，Ｂの命令列は、それぞれ図３
０（ａ），（ｂ）のように実行され実行時間は共に２０
クロックである。命令スケジューリング前の実行時間が
２４クロック図２８（ａ）であったのに対し、従来の命
令スケジューリングによって４クロック短縮されたこと
がわかる。The instruction sequences of patterns A and B are shown in FIG.
0 (a) and (b) are executed and the execution time is 20
It is a clock. It can be seen that the execution time before instruction scheduling was 24 clocks as shown in FIG. 28 (a), while the conventional instruction scheduling reduced the execution time by 4 clocks.

【００３２】しかしながら、命令の並びとしては、図２
９で示したようにパターンＣのような命令列も考えられ
る。パターンＣの実行のようすを図３０（ｃ）に示す。However, the instruction sequence is as shown in FIG.
As shown in FIG. 9, an instruction sequence like pattern C is also conceivable. FIG. 30C shows how the pattern C is executed.

【００３３】パターンＡ，パターンＢの実行時間が２０
クロックであるのに対し、パターンＣの命令列の実行時
間は１９クロックで、さらに実行時間が短縮されている
ことが分かる。そのため、パターンＣは、最適な命令ス
ケジューリングであると言える。The execution time of pattern A and pattern B is 20
In contrast to the clock, the execution time of the instruction sequence of pattern C is 19 clocks, and it can be seen that the execution time is further shortened. Therefore, it can be said that the pattern C is the optimum instruction scheduling.

【００３４】ここで、図３０の各図においてインタロッ
クの状態を斜線で表すと、命令１の次に命令２を発行し
ても、命令７を発行してもインタロックは発生していな
い。また、命令２、命令７ともに残りの命令のパス長は
２０クロックで最長である。そのため、命令１の次に発
行する命令を決定するのは、命令２、命令７のうち、
［３−１］の処理によってサクセッサ数の多い方とな
り、命令２が先に選択されることになる。つまり、従来
の命令スケジューリング方法では、パターンＣの命令列
が選ばれることはない。Here, if the interlock state is shown by the slanted lines in each of the figures in FIG. 30, no interlock occurs even if command 2 is issued next to command 1 or command 7 is issued. Further, the path length of the remaining instructions for both instruction 2 and instruction 7 is 20 clocks, which is the longest. Therefore, the instruction to be issued next to the instruction 1 is determined by the instruction 2 or the instruction 7,
By the processing of [3-1], the number of successors becomes larger, and the instruction 2 is selected first. That is, in the conventional instruction scheduling method, the instruction sequence of pattern C is not selected.

【００３５】一方、このようなコンパイラを備えた並列
計算機では、物理現象を偏微分方程式に基づいてシミュ
レーションすることが盛んに行なわれている。On the other hand, in a parallel computer equipped with such a compiler, a physical phenomenon is actively simulated based on a partial differential equation.

【００３６】並列計算機で、半導体デバイスシミュレー
ションのように、物理現象を偏微分方程式に基づいてシ
ミュレーションを行なう際は、対象の領域を格子状に分
割し、格子点で偏微分方程式を離散化し、さらに必要な
らば線形化を行ない、連立一次方程式を解くことに帰着
させて解く。このときに、解くべき連立一次方程式の係
数行列は、特殊なスパース行列となる。When simulating a physical phenomenon on the basis of a partial differential equation like a semiconductor device simulation on a parallel computer, the target region is divided into a grid, and the partial differential equation is discretized at grid points. Perform linearization if necessary, and solve by solving simultaneous linear equations. At this time, the coefficient matrix of the simultaneous linear equations to be solved becomes a special sparse matrix.

【００３７】この連立一次方程式を解く解法として広く
用いられているのは、前処理付き共役勾配法と呼ばれる
反復解法である。この解法については、村田健郎他著、
「スーパーコンピュータ、科学技術計算への適用」（丸
善、１９８５）に記述されている。The iterative solution method called the conjugate gradient method with pretreatment is widely used as a solution method for solving the simultaneous linear equations. Regarding this solution, Takeshi Murata et al.,
"Supercomputer, application to scientific computing" (Maruzen, 1985).

【００３８】前処理付き共役勾配法は、ベクトル化また
は並列化による高速化の観点から考えると、高速化のネ
ックになるのは前処理となる不完全ＬＵ分解、および近
似行列Ｌ，Ｕの逆行列を求める処理で、いわゆる前進後
退代入処理となり、逐次的な処理となる。In the preconditioned conjugate gradient method, from the viewpoint of speeding up by vectorization or parallelization, the bottleneck of speeding up is the incomplete LU decomposition which is the preprocessing and the inverse of the approximated matrices L and U. In the process of obtaining the matrix, it is a so-called forward / backward substitution process, which is a sequential process.

【００３９】これらの処理を高速に行う方法として、格
子状に分割された物理領域のシミュレーションであらわ
れる行列の特殊性を生かしてベクトル化する方法があ
り、超平面法と呼ばれるものがある。これについて、以
下に説明する。As a method of performing these processes at high speed, there is a method of vectorizing by utilizing the peculiarity of the matrix that appears in the simulation of the physical region divided into a lattice, and there is a method called hyperplane method. This will be described below.

【００４０】例えば、図９に示す２次元格子についてシ
ミュレーションする場合には、図１０に示すような形を
した下三角行列を係数行列にもつ連立一次方程式を解く
必要がある。図１０において、(1)〜(16)は図９におけ
る格子点１〜１６に相当し、黒丸は非零要素を示してい
る。この黒丸は、例えば格子点５の物理量は、格子点１
の物理量が求まると計算可能であることを意味してい
る。For example, when simulating the two-dimensional lattice shown in FIG. 9, it is necessary to solve a simultaneous linear equation having a lower triangular matrix having a coefficient matrix as shown in FIG. In FIG. 10, (1) to (16) correspond to the grid points 1 to 16 in FIG. 9, and black circles indicate nonzero elements. This black circle indicates that the physical quantity of the grid point 5 is the grid point 1
It means that it can be calculated if the physical quantity of is obtained.

【００４１】通常の方法では、格子点１から順に番号順
に解を求めるが、超平面法では、１、２、５、３、６、
９、４、７、１０、１３、・・・・のような順序で解を
求める。これは、図９に破線で示した格子点の集合を
(1) 〜(7) の順序で解を求めることに相当する。破線上
にある格子点の解の計算は依存関係がないためベクトル
化が可能となるが、従来はこの計算を１つのベクトルプ
ロセッサで行っていた。また近年、複数の演算要素プロ
セッサ（ＰＥ：Processing Element）を同時に動作さ
せ、高速に処理を行う並列計算機が盛んに開発されてい
る。そのような並列計算機は、例えば図３１のような構
成になっている。In the usual method, the solution is obtained in numerical order from the grid point 1, but in the hyperplane method 1, 2, 5, 3, 6,
The solution is obtained in the order of 9, 4, 7, 10, 13, ... This is based on the set of grid points indicated by broken lines in FIG.
This is equivalent to finding the solution in the order of (1) to (7). The calculation of the solution of the grid points on the broken line can be vectorized because there is no dependency, but conventionally, this calculation was performed by one vector processor. Further, in recent years, a parallel computer that simultaneously operates a plurality of processing element processors (PEs) and performs high-speed processing has been actively developed. Such a parallel computer has a configuration as shown in FIG. 31, for example.

【００４２】図３１において、ＰＥ４１は演算要素プロ
セッサである。ＰＥ４１はアレイコントローラＡＣＵ１
０４からのマイクロコードにしたがって、演算を実行す
る。ＰＥ４１は、ネットワーク１０３によってｍ×ｎの
２次元アレイ状に配置され、ＰＥアレイを構成する。In FIG. 31, PE 41 is an arithmetic element processor. PE41 is array controller ACU1
Perform the operation according to the microcode from 04. The PEs 41 are arranged in an m × n two-dimensional array by the network 103 to form a PE array.

【００４３】各々のＰＥ４１は、上下左右の互いに隣接
するＰＥ４１と接続されている。また、右端のＰＥ４１
は同じ行の左端のＰＥ４１と、上端のＰＥ４１は同じ列
の下端のＰＥ４１と、それぞれ接続されている。いわゆ
るトーラス状の２次元格子結合である。各ＰＥ４１はデ
ータメモリ４２を持ち、接続されたＰＥ４１とデータ通
信を行うことができる。これにより、他のＰＥ４１のデ
ータメモリ４２に格納されたデータを演算に用いること
ができる。Each PE 41 is connected to the PE 41 adjacent to each other in the vertical and horizontal directions. Also, PE41 on the right end
Is connected to the leftmost PE41 in the same row, and the uppermost PE41 is connected to the lowermost PE41 in the same column. This is a so-called torus-shaped two-dimensional lattice coupling. Each PE 41 has a data memory 42 and can perform data communication with the connected PE 41. Thereby, the data stored in the data memory 42 of another PE 41 can be used for the calculation.

【００４４】各ＰＥ４１は、ＡＣＵ４４から送られるマ
イクロコードにしたがって、一斉に同じ処理を行う。い
わゆるＳＩＭＤ（Single Instruction Multiple Data S
tream ）形式の並列処理である。ＡＣＵ４４は、ＰＥ４
１の制御機能を付加した情報処理装置で、命令メモリ４
５に格納されている命令を解読し、解読した命令がＰＥ
４１向けの命令ならば、ＰＥ４１向けにマイクロコード
を発生して全ＰＥ４１に転送する。解読した命令が、Ａ
ＣＵ４４自身でのデータ処理に関するものならば、ＡＣ
Ｕ４４自身で処理を行う。Each PE 41 simultaneously performs the same processing according to the microcode sent from the ACU 44. So-called SIMD (Single Instruction Multiple Data S
tream) form parallel processing. ACU44 is PE4
In the information processing apparatus to which the control function 1 of FIG.
The instruction stored in 5 is decoded, and the decoded instruction is PE
If the instruction is for 41, a microcode for PE 41 is generated and transferred to all PEs 41. The decoded instruction is A
For data processing by CU44 itself, AC
U44 processes itself.

【００４５】ここでは、便宜上ｍ＝ｎと仮定し、ｎ×ｎ
の２次元状に配置されたＰＥアレイで、ｎ×ｎ行列の
積、Ｃ＝Ａ＊Ｂを求める場合を考える。ｉ行目ｊ列目に
位置するＰＥ４１をＰＥ（ｉ，ｊ）で表すものとする
（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−１）。Here, for convenience, it is assumed that m = n, and n × n.
Consider a case where the product of n × n matrices, C = A * B, is obtained in the PE array arranged in a two-dimensional manner. The PE 41 located at the i-th row and the j-th column is represented by PE (i, j) (0 ≦ i ≦ n−1, 0 ≦ j ≦ n−1).

【００４６】各ＰＥ４１は、初期データとして、行列の
１つの被演算要素を持つようにデータを配置する。すな
わち、ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−
１）は、初期データａ[i][j]、ｂ[i][j]を持つ。４×４
行列の場合に初期データが配置された様子を図３２に、
行列の積を図３３に示す。Each PE 41 arranges the data so as to have one operated element of the matrix as initial data. That is, PE (i, j) (0≤i≤n-1, 0≤j≤n-
1) has initial data a [i] [j] and b [i] [j]. 4x4
FIG. 32 shows how the initial data is arranged in the case of a matrix.
The product of matrices is shown in FIG.

【００４７】ＰＥ（ｉ，ｊ）では、行列Ｃの要素ｃ[i]
[j]の演算を受け持ち、全ＰＥ４１で一斉に実行する。
ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−１）で
行う演算は次の通りである。In PE (i, j), element c [i] of matrix C
Responsible for the calculation of [j], and executes all the PEs 41 simultaneously.
The operation performed by PE (i, j) (0≤i≤n-1, 0≤j≤n-1) is as follows.

【００４８】ｃ[i][j]＝０；ｋ＝０ to ｎ−１までｃ[i][j]＝ｃ[i][j]＋ａ[i][k]＊ｂ[k][j]；を繰り返す。 …（Ｉ）ＰＥ（ｉ，ｊ）はａ[i][j]、ｂ[i][j]以外のデータを持
っていないので、各ＰＥ（ｉ，ｊ）は、ＰＥアレイの同
じ行にあるすべてのＰＥ４１からａ[i][k]（０≦ｋ≦ｎ
−１）を、同じ列にあるすべてのＰＥ４１からｂ[k][j]
（０≦ｋ≦ｎ−１）を転送される必要がある。C [i] [j] = 0; k = 0 to n−1 c [i] [j] = c [i] [j] + a [i] [k] * b [k] [j ];repeat. (I) PE (i, j) has no data other than a [i] [j] and b [i] [j], so each PE (i, j) is in the same row of the PE array. From all PEs 41 a [i] [k] (0≤k≤n
-1) from all PEs 41 in the same column to b [k] [j]
(0 ≦ k ≦ n−1) needs to be transferred.

【００４９】従来の乗算式は、図３５のフローチャート
のような手順で実行される。The conventional multiplication formula is executed in the procedure shown in the flowchart of FIG.

【００５０】（１）まず、行方向にデータを転送す
る。各ＰＥ４１は、一斉に、右隣のＰＥ４１に、自分の
持つａを転送する（ステップ１３１，１３２）。すなわ
ち、ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−
１）は、ａ[i][j]を右隣のＰＥ４１に転送する。そし
て、ａ[i][j-1]を左隣のＰＥ４１から受け取り、メモリ
４２に格納する（ステップ１３３）。(1) First, data is transferred in the row direction. Each PE 41 simultaneously transfers its own a to the PE 41 on the right (steps 131 and 132). That is, PE (i, j) (0≤i≤n-1, 0≤j≤n-
1) transfers a [i] [j] to the PE 41 on the right. Then, a [i] [j-1] is received from the PE 41 on the left side and stored in the memory 42 (step 133).

【００５１】（２）各ＰＥ（ｉ，ｊ）は、（１）で左
隣のＰＥ（ｉ，ｊ−１）から受け取ったａ[i][j-1]を、
一斉に右隣のＰＥ４１に転送する。これによって、ＰＥ
（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−１）は、左
隣のＰＥ４１からａ[i][j-2]を受け取ることになり、こ
れをメモリ４２に格納する。この処理をｎ−１回繰り返
すことによって、ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０
≦ｊ≦ｎ−１）は、ａ[i][0]〜ａ[i][n-1]を受け取るこ
とができる（ステップ１３３〜１３５）。(2) Each PE (i, j) receives the a [i] [j-1] received from the PE (i, j-1) on the left in (1),
The data is simultaneously transferred to the PE 41 on the right side. By this, PE
(I, j) (0≤i≤n-1, 0≤j≤n-1) receives a [i] [j-2] from the PE 41 on the left side and stores it in the memory 42. To do. By repeating this process n-1 times, PE (i, j) (0≤i≤n-1,0
≦ j ≦ n−1) can receive a [i] [0] to a [i] [n−1] (steps 133 to 135).

【００５２】つまり、同じ行にあるｎ台のＰＥ４１のす
べてが、互いに他のｎ−１台のＰＥ４１からデータを転
送されたことになる。これをｎ対ｎ放送と呼ぶことにす
る。これは、すべての行で、並列に実行できる（ステッ
プ１３３〜１３５）。That is, all the n PEs 41 in the same row have transferred data from the other n-1 PEs 41. This is called n-to-n broadcasting. This can be done in parallel on all rows (steps 133-135).

【００５３】（３）同様に列方向でデータ転送を行
う。すべてのＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ
≦ｎ−１）が、ｂ[0][j]〜ｂ[n-1][j]を受け取る。つま
り、列方向にｎ対ｎ放送を行う（ステップ１３６〜１４
０）。この時点での、各ＰＥ４１の保持するデータの様
子を図３５に、ｎ＝４の場合について示す。(3) Similarly, data transfer is performed in the column direction. All PEs (i, j) (0≤i≤n-1, 0≤j
≦ n−1) receives b [0] [j] to b [n−1] [j]. That is, n to n broadcasting is performed in the column direction (steps 136 to 14).
0). The state of the data held by each PE 41 at this time is shown in FIG. 35 for the case of n = 4.

【００５４】（４）演算に必要なデータはすべてそろ
ったので、各ＰＥ４１で一斉に（Ｉ）式の計算を行う。
各ＰＥ４１で結果が求まる（ステップ１４１〜１４
５）。(4) Since all the data necessary for the operation have been prepared, each PE 41 simultaneously calculates the equation (I).
The result is obtained for each PE 41 (steps 141 to 14).
5).

【００５５】２次元格子状に配置されたＰＥアレイで
は、行方向、あるいは列方向へのｎ対ｎ放送を高速に行
うことができる。ｎ台のＰＥ４１のうち１台から、他の
ｎ−１台に同じデータを転送することを１対ｎ放送と呼
ぶことにすると、１対ｎ放送に要する時間は、１回の隣
接ＰＥ間通信時間を１サイクルとして、ｎ−１サイクル
かかる。一方、１回のｎ対ｎ放送に要する時間は、同じ
くｎ−１サイクル終了する。With the PE array arranged in a two-dimensional lattice, n-to-n broadcasting in the row direction or the column direction can be performed at high speed. Transferring the same data from one of the n PEs 41 to the other n-1 units is called 1-to-n broadcasting, and the time required for 1-to-n broadcasting is one communication between adjacent PEs. It takes n-1 cycles, with the time being one cycle. On the other hand, the time required for one time of n-to-n broadcasting similarly ends in n-1 cycles.

【００５６】したがって、２次元格子結合のＰＥアレイ
では、１対ｎ放送よりも、ｎ対ｎ放送の方が効率よくデ
ータ転送できることになる。行列乗算全体で、データ転
送に要する時間は、２（ｎ−１）サイクルである。Therefore, in the two-dimensional lattice-coupled PE array, the n: n broadcasting can transfer data more efficiently than the 1: n broadcasting. The time required for data transfer in the entire matrix multiplication is 2 (n-1) cycles.

【００５７】以上のｎ対ｎ放送に基づく行列乗算方法で
は、データ転送が高速に行われる反面、各ＰＥ４１で、
演算に必要な全データをあらかじめ保持する事になる。
行方向でｎ、列方向でｎ個のデータを受け取るため、図
３５で示したように２ｎ個のデータを格納できるメモリ
量が必要となる。すなわち、ｎが大きい場合に多大のメ
モリ量が要求されていた。In the matrix multiplication method based on the above n-to-n broadcasting, the data transfer is performed at high speed, but in each PE 41,
All data required for calculation will be stored in advance.
Since n pieces of data are received in the row direction and n pieces of data in the column direction, a memory amount capable of storing 2n pieces of data as shown in FIG. 35 is required. That is, a large amount of memory is required when n is large.

【００５８】[0058]

【発明が解決しようとする課題】従来の技術で説明した
ように、従来のコンパイラでは、命令によって実行クロ
ック数時間が異なる場合に、先に発行されたクロック数
の多い命令と依存関係のある命令の発行時間が遅れた
り、あるいは、ある命令と依存関係のあるクロック数の
多い命令が後から発行されることで、命令全体の実行時
間を長くしてしまうという問題があった。As described in the prior art, in the conventional compiler, when the number of execution clocks differs depending on the instruction, an instruction having a dependency relationship with the instruction issued earlier that has a larger number of clocks. There is a problem that the execution time of the whole instruction is lengthened by delaying the issue time of the instruction or issuing an instruction with a large number of clocks that has a dependency relationship with a certain instruction later.

【００５９】これは、従来の方法では、個々の命令に対
する情報にもとづいてスケジューリングを行っているた
め、依存関係のある命令全体の実行時間を考慮していな
いためである。This is because the conventional method does not consider the execution time of all dependent instructions because the scheduling is performed based on the information for each instruction.

【００６０】一般に、依存関係のある命令間では、先に
発行された命令が完了するまで、次の命令はレジスタ読
み込み以降の処理を開始できないため、それ以前に発行
しても実行時間を短縮することはできないが、従来のコ
ンパイラでは、依存関係のある命令でも、次のノードを
選択する際の候補になる場合があり、選択に無駄があっ
たり、あるいはそのような命令が先に選ばれたために、
もっと有効な命令が発行できなくなるという問題もあっ
た。Generally, between instructions having a dependency relationship, the next instruction cannot start processing after register reading until the instruction issued earlier is completed, so the execution time is shortened even if issued before that. However, in the conventional compiler, even if there is a dependency instruction, it may become a candidate when selecting the next node, and there is a waste of selection, or such an instruction was selected first. To
There was also the problem that more effective orders could not be issued.

【００６１】また、従来のベクトル計算機では、超平面
法を用いてとしても、１つのプロセッサで連立一次方程
式を解いていたので、ベクトルパイプラインで得られる
並列度以上の高速化は原理的にできない。このため、３
次元領域のシミュレーションのように本質的により大き
い並列度があるのにそれを生かしきれないという問題が
あった。並列計算機上での解法としては、ＳＯＲ法など
の並列度の高い方法があるが、数値的な安定性が良くな
い場合があるという問題もあった。Further, in the conventional vector computer, even if the hyperplane method is used, simultaneous linear equations are solved by one processor, and therefore, in principle, it is not possible to increase the speed beyond the parallelism obtained by the vector pipeline. . Therefore, 3
There was a problem in that it was not possible to make full use of it even though there was essentially a higher degree of parallelism as in the simulation of the dimensional domain. As a solution method on a parallel computer, there is a method having a high degree of parallelism such as the SOR method, but there is a problem that the numerical stability may not be good.

【００６２】さらに、従来の並列計算機では、行列の乗
算を行うための、各ＰＥでの必要なデータを、演算開始
前にすべて保持していたため、各ＰＥのメモリ量が少な
い場合には、演算が実行できないという欠点があった。Further, in the conventional parallel computer, since all the necessary data in each PE for performing the matrix multiplication is held before the calculation is started, the calculation is performed when the memory amount of each PE is small. There was a drawback that could not be executed.

【００６３】本発明はこのような問題点を解決するもの
であり、第１の発明の目的は、命令によって実行クロッ
ク数が異なるような場合にも、総実行時間が最も短くな
るような命令列を選択すること、パイプラインの空き状
態をできるだけ小さくし、また依存関係のある命令に対
して、次命令を選択する際の効率を向上させる命令スケ
ジューリングを行うことができるコンパイラを提供する
ことにある。The present invention solves such a problem, and an object of the first invention is to provide an instruction sequence in which the total execution time is the shortest even when the number of execution clocks differs depending on the instruction. Is to provide a compiler capable of performing instruction scheduling that minimizes the empty state of the pipeline and improves the efficiency of selecting the next instruction for an instruction that has a dependency relationship. .

【００６４】また、第２の発明の目的は、複数個のプロ
セッサを１次元状あるいは２次元状に配置することによ
り、超平面法を並列に計算し、連立一次方程式を高速に
解くことができる並列計算機を提供することにある。A second object of the present invention is to dispose a plurality of processors in a one-dimensional or two-dimensional manner so that the hyperplane method can be calculated in parallel and simultaneous linear equations can be solved at high speed. To provide a parallel computer.

【００６５】さらに、第３の発明は、各ＰＥあたりのメ
モリ量が少ない場合でも、行列の乗算を実行することが
できる並列計算機を提供することにある。Further, a third invention is to provide a parallel computer capable of executing matrix multiplication even when the memory amount for each PE is small.

【００６６】[0066]

【課題を解決するための手段】上記目的を達成させるた
め、第１の発明は、並列に実行可能な複数の演算器を備
え、命令をパイプラインで処理する計算機で使用される
目的プログラムを生成するコンパイラであって、基本ブ
ロック内の命令列に対し、依存関係のある命令毎に命令
群を構成し、命令群を単位として、パイプラインの空き
状態が小さくなるように命令群の実行順序の入れ換え、
繰り上げ、あるいは繰り下げを行う命令スケジューリン
グ部を備えている。In order to achieve the above object, the first invention comprises a plurality of arithmetic units that can be executed in parallel, and generates a target program used in a computer that processes instructions in a pipeline. A compiler that configures an instruction group for each instruction that has a dependency relationship with the instruction sequence in the basic block, and uses the instruction group as a unit to reduce the execution order of the instruction groups so that the empty state of the pipeline becomes smaller. Swap
It has an instruction scheduling unit that moves up or down.

【００６７】また、第２の発明は、シミュレーションの
対象となる物理領域が２次元の場合には、図６のように
複数のプロセッサが１次元状に配置された並列計算機を
用いる。各プロセッサは、格子点の物理データを計算す
る計算手段と、計算したデータを記憶する記憶手段と、
記憶しているデータを右に隣接するプロセッサへ送出す
る送信手段と、左に隣接するプロセッサから送出されて
きたデータを受信する受信手段とを有している。The second invention uses a parallel computer in which a plurality of processors are one-dimensionally arranged as shown in FIG. 6 when the physical area to be simulated is two-dimensional. Each processor has a calculation means for calculating the physical data of the lattice points, a storage means for storing the calculated data,
It has a transmitting means for transmitting the stored data to the adjacent processor on the right and a receiving means for receiving the data transmitted from the adjacent processor on the left.

【００６８】あるいは第２の発明は、シミュレーション
の対象となる物理領域が３次元の場合には、図７のよう
に複数のプロセッサが２次元状に配置された並列計算機
を用いる。各プロセッサは、格子点のデータを計算する
計算手段と、計算したデータを記憶する記憶手段と、記
憶しているデータを右に隣接するプロセッサおよび下に
隣接するプロセッサへ送出する送信手段と、左に隣接す
るプロセッサおよび上に隣接するプロセッサから送出さ
れてきたデータを受信する受信手段とを有している。Alternatively, the second invention uses a parallel computer in which a plurality of processors are two-dimensionally arranged as shown in FIG. 7 when the physical area to be simulated is three-dimensional. Each processor includes calculation means for calculating data at a lattice point, storage means for storing the calculated data, transmission means for sending the stored data to a right adjacent processor and a lower adjacent processor, and a left And a receiving means for receiving data sent from the adjacent processor.

【００６９】さらに、第３の発明は、２次元状に配置さ
れた複数の演算要素プロセッサから構成される並列プロ
セッサで、２つの行列の積を求める並列計算機であっ
て、第１の行列データと第２の行列データを、該２つの
行列で同じ割り付けを行い、該行列の行方向のデータを
持つ演算プロセッサ間で、１つの演算プロセッサから他
の複数の演算プロセッサへデータを放送する手段と、該
行列の列方向のデータを持つ演算プロセッサ間で、１つ
の演算プロセッサから他の複数の演算プロセッサへデー
タを放送する手段と、繰り返し回数を制御する手段と、
該データ放送と演算プロセッサでの演算を制御する制御
手段とを備えている。Furthermore, a third invention is a parallel processor which is composed of a plurality of arithmetic element processors arranged in a two-dimensional manner, and is a parallel computer which obtains the product of two matrices. Means for performing the same allocation of the second matrix data in the two matrices, and broadcasting the data from one arithmetic processor to another plurality of arithmetic processors among the arithmetic processors having data in the row direction of the matrix; Means for broadcasting data from one arithmetic processor to another arithmetic processor among arithmetic processors having data in the column direction of the matrix, means for controlling the number of repetitions,
The data broadcasting and the control means for controlling the arithmetic operation in the arithmetic processor are provided.

【００７０】[0070]

【作用】上記手段により、第１の発明では、依存関係の
ある命令同士で命令群を構成し、スケジューリングを命
令群単位で行うことにより、命令によって実行時間が異
なる場合でも最適なスケジューリングを行うことができ
る。With the above means, in the first aspect of the present invention, an instruction group is composed of instructions having a dependency relationship, and scheduling is performed for each instruction group, so that optimal scheduling is performed even when the execution time differs depending on the instruction. You can

【００７１】また、第１の発明によれば、上記の通り、
命令群を単位としてスケジューリングを行うため、ある
命令と依存関係のある命令の発行時間を固定することが
でき、それによって、依存関係のある命令が効果のない
段階で次の命令の候補に選ばれるという無駄がなくな
り、次命令を選択する際の効率を向上させている。According to the first invention, as described above,
Since scheduling is performed in units of instructions, it is possible to fix the issue time of an instruction that has a dependency relationship with one instruction, so that an instruction that has a dependency relationship is selected as a candidate for the next instruction when it has no effect. That is, the efficiency in selecting the next instruction is improved.

【００７２】さらに、第１の発明では、並列に実行可能
な命令群を括弧などによって対応づけることで、発行順
序の入れ換えや発行時間の繰り上げの対象となる命令群
を限定するので、命令の選び方に冗長性をなくすことが
できる。Further, in the first invention, the instruction groups that can be exchanged in the issuance order and the issuance time are advanced by associating the instruction groups which can be executed in parallel with each other by using parentheses or the like. Redundancy can be eliminated.

【００７３】一方、第２の発明は、シミュレーション対
象が２次元の場合は、ｙ座標が同一のものを１つのプロ
セッサの記憶手段に保持させる。すなわち格子の１行を
１つのプロセッサに割り当てる。On the other hand, in the second invention, when the simulation target is two-dimensional, the one having the same y coordinate is held in the storage means of one processor. That is, one row of the grid is assigned to one processor.

【００７４】求解の処理手順は、まずプロセッサの受信
手段が左隣のプロセッサからデータを受け取り、受けと
ったデータと自プロセッサで前に計算した変数値、行列
要素、定数ベクトル要素を用いて格子点の変数値を計算
手段で計算する。次に計算結果を記憶手段で記憶すると
共に、右隣のプロセッサに送信手段から送出する。この
処理をすべてのプロセッサで、すべての変数が求められ
るまで繰り返す。The processing procedure for solution is as follows. First, the receiving means of the processor receives data from the adjacent processor on the left side, and the received data and the variable values, matrix elements and constant vector elements previously calculated by the self processor are used to calculate grid points. The variable value is calculated by the calculation means. Next, the calculation result is stored in the storage means and is sent from the transmission means to the adjacent processor on the right. This process is repeated on all processors until all variables are obtained.

【００７５】ただし、すべてのプロセッサが同時にこの
処理を開始するわけではなく、計算するのに必要なデー
タを左隣のプロセッサから受け取るまでは、待つか、こ
の求解処理とは無関係な処理を行なう。However, not all the processors start this processing at the same time, and either wait or receive a processing irrelevant to the solution processing until data required for calculation is received from the adjacent processor on the left.

【００７６】ここで、計算手段で下三角行列を係数にも
つ連立一次方程式Ｌｘ＝ｂを解くことを考える。ここで、Ｌは下三角行列、ｘは変
数ベクトル、ｂは定数ベクトルである。Here, it is considered that the calculating means solves the simultaneous linear equations Lx = b having the lower triangular matrix as a coefficient. Here, L is a lower triangular matrix, x is a variable vector, and b is a constant vector.

【００７７】このとき、例えば、不完全ＬＵ分解（１，
１）を用いた場合には、 x[i]＝(b[i] - L[i,i-1] * x[i-1] - L[i,i-m] * x[i-
m])/L[i,i] の式により、変数値を求める。不完全ＬＵ分解（１，
２）を用いた場合には、 x[i]＝(b[i] - L[i,i-1] * x[i-1] - L[i,i-m] * x[i-
m]- L[i,i-m+1] * x[i-m+1])/L[i,i] の式により、変数値を求める。At this time, for example, incomplete LU decomposition (1,
When 1) is used, x [i] = (b [i]-L [i, i-1] * x [i-1]-L [i, im] * x [i-
m]) / L [i, i] is used to obtain the variable value. Incomplete LU decomposition (1,
When 2) is used, x [i] = (b [i]-L [i, i-1] * x [i-1]-L [i, im] * x [i-
m]-L [i, i-m + 1] * x [i-m + 1]) / L [i, i] is used to obtain the variable value.

【００７８】あるいは、第２の発明は、シミュレーショ
ン対象が３次元の場合には、受信手段では、左隣のプロ
セッサおよび上隣のプロセッサが送出するデータを受け
取り、保持する。計算手段は、２次元の場合と同様に変
数値を計算し、この結果を記憶手段に記憶する。送信手
段は右隣のプロセッサと下隣のプロセッサに、計算手段
で計算した変数値を送出する。Alternatively, in the second aspect of the invention, when the simulation target is three-dimensional, the receiving means receives and holds the data transmitted by the left adjacent processor and the upper adjacent processor. The calculation means calculates the variable value as in the two-dimensional case, and stores the result in the storage means. The transmitting means transmits the variable value calculated by the calculating means to the processor on the right side and the processor on the lower side.

【００７９】さらに、第３の発明は、前記第１の行列
の、繰り返し回数で指定される１列のデータを保持する
各演算プロセッサが、各列データと同じ行に属するデー
タを保持する複数の演算プロセッサに対して各列データ
を同時並列に転送する。同様に、前記第２の行列の、繰
り返し回数で指定される１行のデータを保持する各演算
プロセッサが、各行データと同じ列に属するデータを保
持する複数の演算プロセッサに対して各行データを同時
並列に転送する。Further, in the third invention, each arithmetic processor of the first matrix, which holds one column of data designated by the number of repetitions, has a plurality of data processors each of which holds data belonging to the same row as each column data. Each column data is transferred to the arithmetic processor in parallel at the same time. Similarly, each arithmetic processor that holds one row of data specified by the number of repetitions of the second matrix simultaneously transmits each row data to a plurality of arithmetic processors that retain data belonging to the same column as each row data. Transfer in parallel.

【００８０】各演算プロセッサで、転送された前記第１
の行列のデータと、転送された前記第２のデータの２つ
の数の積を求め、この積を繰り返し毎に累算する。この
ような操作を、行列の大きさから得られる所定回数だけ
繰り返すことによって行列の積を求めている。In each arithmetic processor, the first transferred data is transferred.
The product of two numbers of the matrix data and the transferred second data is calculated, and this product is accumulated for each repetition. The product of the matrices is obtained by repeating such an operation a predetermined number of times obtained from the size of the matrix.

【００８１】[0081]

【実施例】以下に、本発明の実施例を図面に基づいて説
明する。Embodiments of the present invention will be described below with reference to the drawings.

【００８２】第１の発明まず、第１の発明について図１〜５を用いて説明する。First Invention First, the first invention will be described with reference to FIGS.

【００８３】図１は、第１の発明のコンパイラに係わる
一実施例の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of an embodiment relating to the compiler of the first invention.

【００８４】図１に示すコンパイラは、ソースプログラ
ム１を入力するソースプログラム入力部２と、入力され
たソースプログラム１の字句を解析する字句解析部３
と、ソースプログラム１の文法を解釈する構文解析部４
と、ソースプログラム１を中間言語プログラムに変換す
る中間コード生成部５と、第１の発明の中心となる命令
スケジューリング部６と他の最適化を行う最適化部７を
持った中間コード最適化部８と、中間コードを目的プロ
グラム１１に変換するオブジェクトコード生成部９と、
目的プログラム１１を出力して供給する目的プログラム
出力部１０とから構成される。The compiler shown in FIG. 1 has a source program input section 2 for inputting a source program 1 and a lexical analysis section 3 for analyzing a lexical term of the input source program 1.
And a parser 4 for interpreting the grammar of the source program 1
And an intermediate code generation section 5 having an intermediate code generation section 5 for converting the source program 1 into an intermediate language program, an instruction scheduling section 6 which is the core of the first invention, and an optimization section 7 for performing other optimizations. 8 and an object code generator 9 for converting the intermediate code into the object program 11,
It is composed of a target program output unit 10 which outputs and supplies the target program 11.

【００８５】命令スケジューリング部６では、図２に示
すフローチャートに従って、中間コードの命令列のスケ
ジューリングを行う。The instruction scheduling unit 6 schedules the instruction sequence of the intermediate code according to the flowchart shown in FIG.

【００８６】以下に、図２のフローチャートの各ステッ
プについて説明する。The steps of the flowchart of FIG. 2 will be described below.

【００８７】ステップＡ：中間コードで表現された基本
ブロック内の命令に対して、データ依存関係のある命令
毎にグループ（以下単に“命令群”と呼ぶ）を構成し、
並列に実行可能な命令群を対応づける。並列に実行可能
な命令群は、一例として括弧を用いて対応づけることが
できる。Step A: For the instructions in the basic block represented by the intermediate code, a group (hereinafter simply referred to as "instruction group") is formed for each instruction having a data dependency relationship,
The instruction groups that can be executed in parallel are associated with each other. The instruction groups that can be executed in parallel can be associated with each other by using parentheses as an example.

【００８８】一命令だけで独立に演算可能な命令は、１
命令のみで命令群を構成する。An instruction that can be independently operated with only one instruction is 1
An instruction group consists of only instructions.

【００８９】ステップＢ：各命令群に対し、最初に発行
される命令から最後に終了する命令までの総実行時間
と、各命令群の最初の命令が発行されてから、他の命令
群（他の命令群に属する命令）を発行することが可能な
時間が何クロック目であるかを算出する。Step B: For each instruction group, the total execution time from the first issued instruction to the last finished instruction, and after the first instruction of each instruction group is issued, other instruction groups (other (Instruction belonging to the instruction group of 1) is calculated.

【００９０】ステップＣ：第１番目の命令群から着目
し、全命令群に対してステップＤ以下の処理を実行す
る。Step C: Focusing on the first instruction group, the processing from step D is executed for all instruction groups.

【００９１】ステップＤ：着目する命令群と並列に実行
可能な（発行順序を入れ換える、あるいは発行時間を繰
り上げることのできる）命令群が、着目する命令群以降
の命令群の中にあるか判断する。判断は括弧の対応をも
とに行う。あればステップＥへ、なければステップＧへ
移行する。Step D: It is judged whether or not there is an instruction group that can be executed in parallel with the instruction group of interest (that is, the issuing order can be changed or the issue time can be increased) in the instruction group after the instruction group of interest. . Judgment is based on the parentheses. If yes, go to step E; if no, go to step G.

【００９２】ステップＥ：着目する命令群に対し、ステ
ップＤで得た命令群の発行時間を繰り上げることによっ
て、２つの命令群の総実行時間を短縮することができる
ような命令群はあるか判断する。Step E: It is judged whether or not there is an instruction group that can reduce the total execution time of the two instruction groups by advancing the issue time of the instruction group obtained in step D for the instruction group of interest. To do.

【００９３】ここで、命令群の発行時間を繰り上げると
は、着目する命令群と並列に実行可能な命令群の発行時
間を、既に発行されている命令群の実行時間の間で他の
命令群を発行することが可能な時間に移動すること、あ
るいは、２つの命令群の発行順序を入れ換えることを表
す。Here, raising the issue time of an instruction group means that the issue time of an instruction group that can be executed in parallel with the instruction group of interest is equal to another instruction group between the execution times of the already issued instruction groups. Is moved to a time at which it is possible to issue, or the issue order of two instruction groups is exchanged.

【００９４】繰り上げによって実行時間が短縮できるよ
うな命令群が１つ以上ある場合には、ステップＦへ移行
する。該当する命令群が存在しない場合には、ステップ
Ｇへ移行する。If there is at least one instruction group whose execution time can be shortened by carrying it up, the process proceeds to step F. If the corresponding instruction group does not exist, the process proceeds to step G.

【００９５】ステップＦ：ステップＥで得た命令群が２
つ以上ある場合には、命令群のうちで、２つの命令群の
総実行時間を最も短縮できるものを選び、着目する命令
群に対する発行時間を繰り上げ、ステップＧへ移行す
る。Step F: The instruction group obtained in Step E is 2
If there are two or more instruction groups, the instruction group that can shorten the total execution time of the two instruction groups is selected, the issue time for the instruction group of interest is advanced, and the process proceeds to step G.

【００９６】ステップＧ：全ての命令群に対して処理を
行ったかを判断し、行った場合は終了し、未処理の命令
群が残っている場合はステップＨへ移行する。Step G: It is judged whether or not all the instruction groups have been processed, and if they have been processed, the processing ends, and if there are unprocessed instruction groups, the processing shifts to Step H.

【００９７】図３（ａ）のプログラム、図３（ｂ）の命
令列はそれぞれ、従来例で用いた図２７（ｂ）のプログ
ラム、図２７（ｃ）の命令列と同じである。本実施例
を、図３（ｂ）の命令列を用いて以下に説明する。The program of FIG. 3A and the instruction sequence of FIG. 3B are the same as the program of FIG. 27B and the instruction sequence of FIG. 27C used in the conventional example, respectively. This embodiment will be described below with reference to the instruction sequence shown in FIG.

【００９８】命令スケジューリング部６では、図３
（ｂ）の命令列に対し、図２のフローチャートに従って
最適化を行う。図３（ｂ）の命令列が命令スケジューリ
ング部６によって最適化されていく様子を、図２のフロ
ーチャートに従って以下に説明する。In the instruction scheduling unit 6, FIG.
The instruction sequence of (b) is optimized according to the flowchart of FIG. How the instruction sequence shown in FIG. 3B is optimized by the instruction scheduling unit 6 will be described below with reference to the flowchart in FIG.

【００９９】（１）図３（ｂ）の命令列に対し、依存
関係のある命令毎に命令群を構成し、並列に実行可能な
命令群を対応づける（ステップＡ）。１命令だけで独立
に演算可能な命令は、１命令のみで命令群を構成する。
一例として、並列に実行可能な命令群を’（）’で対応
づけた図を図４（ａ）に、各命令群に番号付けした図を
図４（ｂ）に示す。並列に実行可能な命令群は、それぞ
れ’（）’の対応で表現されている。(1) With respect to the instruction sequence shown in FIG. 3B, an instruction group is formed for each instruction having a dependency relationship, and an instruction group that can be executed in parallel is associated (step A). Instructions that can be independently operated with only one instruction form an instruction group with only one instruction.
As an example, FIG. 4A shows a diagram in which instruction groups that can be executed in parallel are associated with '()', and FIG. 4B shows a diagram in which each instruction group is numbered. The instruction groups that can be executed in parallel are represented by the correspondence of '()'.

【０１００】図４（ａ）において、１命令だけ
で’（）’に囲まれてる命令１、命令２、命令４、命令
７は、それぞれ１命令だけで独立に実行可能な命令を表
している。また、命令１、命令２、命令３の３命令で１
つの命令群（命令群３）を構成し、命令群３と命令４，
５で１つの命令群（命令群５）を構成していることを表
している。同様に命令群６、命令群８、命令群９も複数
の依存関係のある命令から構成されている。In FIG. 4A, an instruction 1, an instruction 2, an instruction 4, and an instruction 7, which are surrounded by '()' with only one instruction, represent instructions which can be independently executed with only one instruction. . Also, 1 out of 3 commands, command 1, command 2, and command 3.
One instruction group (instruction group 3) is formed, and instruction group 3 and instruction 4,
5 indicates that one instruction group (instruction group 5) is configured. Similarly, the instruction group 6, the instruction group 8 and the instruction group 9 are also composed of a plurality of instructions having a dependency relationship.

【０１０１】（２）各命令群に対し、最初に発行され
る命令から最後に終了する命令までの総実行時間と、各
命令群の最初の命令が発行されてから、他の命令群（他
の命令群に属する命令）を発行することが可能な時間が
何クロック目であるかを算出する（ステップＢ）。各命
令群の総実行時間と次命令群発行可能時間を図４（ｃ）
に示す。(2) For each instruction group, the total execution time from the first issued instruction to the last finished instruction, and after the first instruction of each instruction group is issued, other instruction groups (other (Instruction belonging to the instruction group) is calculated at what clock time (step B). FIG. 4C shows the total execution time of each instruction group and the issuable time of the next instruction group.
Shown in.

【０１０２】図４（ｃ）において、例えば命令群３は、
図２８（ａ）における命令３のＦステージを５クロック
目まで繰り下げることができるので、１，５クロック目
以外が次命令群発行可能時間であることを示している。In FIG. 4C, for example, the instruction group 3 is
Since the F stage of the instruction 3 in FIG. 28A can be delayed to the 5th clock, it is shown that the next instruction group issuable time is other than the 1st and 5th clocks.

【０１０３】（３）命令群１に着目し、ステップＤ以
降の処理を行う。(3) Focusing on the instruction group 1, the processing after step D is performed.

【０１０４】（４）命令群１と並列に実行可能な命令
群を、命令群２以降の命令群の中から探す（ステップ
Ｄ）。ここでは命令群２が該当し、ステップＥへ移行す
る。(4) An instruction group that can be executed in parallel with the instruction group 1 is searched from the instruction groups after the instruction group 2 (step D). Here, the instruction group 2 is applicable, and the process proceeds to step E.

【０１０５】（５）命令群１に対して命令群２の発行
時間を繰り上げても、２つの命令の総実行時間は６クロ
ックで変化がない。発行時間を繰り上げることで実行時
間を短縮することのできる命令群はないので（ステップ
Ｅ）、ステップＧへ移行する。（６）命令群２以降の処理が残っているので、ステッ
プＨへ移行し、命令群２に着目する。(5) Even if the issuance time of the instruction group 2 is advanced with respect to the instruction group 1, the total execution time of the two instructions remains unchanged at 6 clocks. Since there is no instruction group whose execution time can be shortened by advancing the issue time (step E), the process moves to step G. (6) Since the processing after the instruction group 2 remains, the process proceeds to step H and the instruction group 2 is focused.

【０１０６】（７）命令群２と並列に実行可能な命令
群を、命令群３以降の命令群の中から探す（ステップ
Ｄ）。括弧の対応から、命令群２と並列に実行可能な命
令群は、命令群１のみであるが、ここでは命令群１は入
れ換えの対象ではないため、ステップＧ、ステップＨへ
移行する。(7) A command group that can be executed in parallel with the command group 2 is searched from the command groups after the command group 3 (step D). From the correspondence of the parentheses, the instruction group 1 is the only instruction group that can be executed in parallel with the instruction group 2. However, since the instruction group 1 is not the target of replacement here, the processing shifts to steps G and H.

【０１０７】（８）命令群３と並列に実行可能な命令
群を、命令群４以降の命令群の中から探す（ステップ
Ｄ）。ここでは、命令群４が該当し、ステップＥへ移行
する。(8) A command group that can be executed in parallel with the command group 3 is searched from the command groups after the command group 4 (step D). Here, the instruction group 4 is applicable, and the process proceeds to step E.

【０１０８】（９）この場合、命令群３に対し、命令
群４の発行時間を繰り上げても、命令群３と命令群４の
総実行時間は変化しない。しかし、後から命令群９をス
ケジューリングする際に、命令３のＦステージを３クロ
ック繰り下げ、命令群９を４クロック繰り上げると、命
令群９の中の命令８の発行時間が命令４の発行時間とか
ち合ってしまう。(9) In this case, even if the issue time of the instruction group 4 is advanced with respect to the instruction group 3, the total execution time of the instruction groups 3 and 4 does not change. However, when the instruction group 9 is scheduled later, if the F stage of the instruction 3 is moved down by 3 clocks and the instruction group 9 is moved up by 4 clocks, the issue time of the instruction 8 in the instruction group 9 becomes equal to the issue time of the instruction 4. It will share.

【０１０９】命令８は、命令７と依存関係があるので、
発行を早めることができないため、命令４の発行時間を
１クロック早めることで、命令群９の実行時間を短縮す
ることができる。ステップＦへ移行する。Since instruction 8 has a dependency relationship with instruction 7,
Since the issue cannot be advanced, the execution time of the instruction group 9 can be shortened by advancing the issue time of the instruction 4 by one clock. Go to step F.

【０１１０】（１０）命令４の発行時間を１クロック
繰り上げ、命令３の発行時間を３クロックは繰り下げて
ステップＧ，Ｈへ移行する。(10) The issue time of the instruction 4 is advanced by 1 clock, the issue time of the instruction 3 is advanced by 3 clocks, and the process shifts to steps G and H.

【０１１１】（１１）命令群４と並列に実行可能な命
令群は、命令群３のみであるので、ここでは対象となら
ない。ステップＧ，Ｈへ移行する。(11) Since the instruction group 3 is the only instruction group that can be executed in parallel with the instruction group 4, it is not a target here. Go to steps G and H.

【０１１２】（１２）命令群５と並列に実行可能な命
令群はないため、ステップＧ，Ｈへ移行する。(12) Since there is no instruction group that can be executed in parallel with the instruction group 5, the process proceeds to steps G and H.

【０１１３】（１３）命令群６と並列に実行可能な命
令群は、命令群９である。該当する命令群があるので、
ステップＥへ移行する。(13) The instruction group 9 can be executed in parallel with the instruction group 6. Since there is a corresponding instruction group,
Go to step E.

【０１１４】（１４）命令群６に対し、命令群９の発
行時間を繰り上げると、２つの命令群の総実行時間は短
縮されるので、ステップＦへ移行する。(14) When the issue time of the instruction group 9 is advanced with respect to the instruction group 6, the total execution time of the two instruction groups is shortened, so that the process proceeds to step F.

【０１１５】（１５）ステップＦでは、先に発行され
た命令群に対して、後続の命令の発行時間を繰り上げる
場合と、２つの命令の発行時間を入れ換える場合のどち
らか、より総実行時間を短縮できる方を選択する。(15) In step F, the total execution time is set to a higher value by either advancing the issuance time of the subsequent instruction or exchanging the issuance time of two instructions with respect to the previously issued instruction group. Select the one that can be shortened.

【０１１６】命令群６を先に発行し、命令群９の発行時
間を繰り上げるよりも、２つの命令群の発行時間を入れ
換える方が、より２つの命令群の総実行時間を短縮する
ことができるため、このステップでは、２つの命令群の
発行時間を入れ換える。ステップＧ，Ｈへ移行する。The total execution time of two instruction groups can be shortened more by exchanging the issue times of the two instruction groups than by issuing the instruction group 6 first and advancing the issue time of the instruction group 9. Therefore, in this step, the issue times of the two instruction groups are exchanged. Go to steps G and H.

【０１１７】（１６）命令群７と並列に実行可能な命
令群はないので、ステップＧ，Ｈへ移行する。(16) Since there is no instruction group that can be executed in parallel with the instruction group 7, the process proceeds to steps G and H.

【０１１８】（１７）同様に命令群８と並列に実行可
能な命令群もないので、ステップＧ，Ｈへ移行する。(17) Similarly, since there is no instruction group that can be executed in parallel with the instruction group 8, the process shifts to steps G and H.

【０１１９】（１８）命令群９と並列に実行可能な命
令群は、命令群６であるが、ここでは対象とならないた
め、ステップＧへ移行し、全命令群に対して処理を行っ
たので終了する。(18) The instruction group 6 that can be executed in parallel with the instruction group 9 is the instruction group 6, but since it is not the target here, the process moves to step G and all the instruction groups are processed. finish.

【０１２０】以上の処理により、図３（ｂ）の命令列は
図５（ａ）の命令列に変換され、実行は図５（ｂ）にな
り、命令列の総実行時間は、命令スケジューリングを行
う前には２４クロックであったのに対し、１９クロック
に短縮され、従来の命令スケジューリング方法では２０
クロックであったのに比べても１クロック短縮すること
ができる。By the above processing, the instruction sequence of FIG. 3B is converted into the instruction sequence of FIG. 5A, the execution becomes that of FIG. 5B, and the total execution time of the instruction sequence is the instruction scheduling. It was shortened to 19 clocks from 24 clocks before execution, which is 20 in the conventional instruction scheduling method.
It can be shortened by one clock compared with the clock.

【０１２１】第２の発明次に、第２の発明について図６〜１６を用いて説明す
る。Second Invention Next, the second invention will be described with reference to FIGS.

【０１２２】まず、シミュレーションの対象となる領域
を４×４の２次元の格子状に分割し、不完全ＬＵ分解
（１，１）を適用した場合について説明する。First, a case will be described in which the area to be simulated is divided into a 4 × 4 two-dimensional lattice and the incomplete LU decomposition (1, 1) is applied.

【０１２３】不完全ＬＵ分解（１，１）で得られる下三
角行列は、図１０に示すような形となる。この下三角行
列を係数行列にもつ連立一次方程式を解く場合、図１１
に示すような変数求解の依存関係が得られる。図１１に
おいて、ｉ→ｊは、ｊを計算するためには、ｉの値が必
要であることを示している。The lower triangular matrix obtained by the incomplete LU decomposition (1,1) has the form shown in FIG. When solving a simultaneous linear equation having this lower triangular matrix as a coefficient matrix, FIG.
The dependency relationship of the variable solution is obtained as shown in. In FIG. 11, i → j indicates that the value of i is required to calculate j.

【０１２４】図６は、この連立一次方程式を解くための
並列計算機である。この並列計算機は、格子の行数分の
プロセッサ（１）〜（４）が１次元状に配置されてい
る。第ｉプロセッサは、第（ｉ＋１）プロセッサへデー
タを転送することが可能となっている。FIG. 6 shows a parallel computer for solving the simultaneous linear equations. In this parallel computer, processors (1) to (4) corresponding to the number of rows of a grid are arranged one-dimensionally. The i-th processor can transfer data to the (i + 1) -th processor.

【０１２５】各プロセッサは、自分自身のローカルな記
憶手段となるメモリ１３と、図示していないが受信手
段、計算手段、及び送信手段をもつ。Each processor has a memory 13 serving as its own local storage means, and a reception means, a calculation means, and a transmission means (not shown).

【０１２６】この計算機上で、上述の連立一次方程式を
解くためには、まず、各格子点のデータを各格子点の計
算を担当するプロセッサのローカルメモリ１３に配置す
る。第１プロセッサ(1) には、第１行の格子点（１〜
４）のデータを配置する。第２プロセッサ(2) には、第
２行の格子点（５〜８）のデータを配置する。第３、第
４のプロセッサ(3),(4) についても同様に、第３行の格
子点（９〜１２）、第４行の格子点（１３〜１６）を配
置する。In order to solve the above simultaneous linear equations on this computer, first, the data of each grid point is arranged in the local memory 13 of the processor in charge of the calculation of each grid point. The first processor (1) has a grid point (1 ...
Place the data of 4). The data of the grid points (5 to 8) in the second row are arranged in the second processor (2). Similarly, the grid points (9 to 12) in the third row and the grid points (13 to 16) in the fourth row are arranged for the third and fourth processors (3) and (4).

【０１２７】図８は、データが配置された並列計算機の
大まかな動作を表すフローチャートである。まずプロセ
ッサの受信手段が隣接プロセッサからデータを受け取り
（ステップ９１）、受けとったデータを用いて格子点の
変数値を計算手段で計算する（ステップ９２）。次に計
算結果を記憶手段で記憶すると共に、隣接プロセッサに
送信手段から送出する（ステップ９３）。この処理を、
このプロセッサが担当するすべての変数が計算されるま
で繰り返す（ステップ９４）。FIG. 8 is a flow chart showing the rough operation of the parallel computer in which the data is arranged. First, the receiving means of the processor receives the data from the adjacent processor (step 91), and the variable value of the lattice point is calculated by the calculating means using the received data (step 92). Next, the calculation result is stored in the storage means and is transmitted from the transmission means to the adjacent processor (step 93). This process
Repeat until all variables for which this processor is responsible have been calculated (step 94).

【０１２８】以下に、各プロセッサの具体的な動作を説
明する。The specific operation of each processor will be described below.

【０１２９】まず、第１段階では、第１プロセッサで、
第１格子点の変数を計算する。第２〜４プロセッサでは
何もしない。計算が終了した後、第１格子点の変数値を
第２プロセッサへ転送する。First, in the first stage, in the first processor,
The variable of the first grid point is calculated. The second to fourth processors do nothing. After the calculation is completed, the variable value of the first grid point is transferred to the second processor.

【０１３０】第２段階では、第１プロセッサで、第１段
階で計算された変数値を用いて第２格子点の変数を計算
する。このとき同時に、第２プロセッサでは、第５格子
点の変数を計算する。このとき、第１段階で第１プロセ
ッサから転送されてきた第１格子点の変数値を用いる。
計算が終了した後、第１のプロセッサから、第２格子点
の変数の値を第２プロセッサへ、第２プロセッサから第
５格子点の変数の値を第３プロセッサへそれぞれ転送す
る。In the second stage, the first processor calculates the variable of the second lattice point using the variable value calculated in the first stage. At the same time, the second processor calculates the variable of the fifth lattice point. At this time, the variable value of the first grid point transferred from the first processor in the first stage is used.
After the calculation is completed, the value of the variable at the second lattice point is transferred from the first processor to the second processor, and the value of the variable at the fifth lattice point is transferred from the second processor to the third processor.

【０１３１】第３段階では、第１プロセッサで、第２段
階で計算された変数値を用いて第３格子点の変数を計算
する。同時に、第２プロセッサでは第６格子点の変数、
第３プロセッサでは第９格子点の変数をそれぞれ計算す
る。それぞれのプロセッサで計算された格子点の変数の
値は、それぞれの隣接するプロセッサへ転送する。In the third stage, the first processor calculates the variable of the third lattice point using the variable value calculated in the second stage. At the same time, in the second processor, the variable of the sixth grid point,
The third processor calculates the variables of the ninth lattice point. The values of the lattice point variables calculated by the respective processors are transferred to the respective adjacent processors.

【０１３２】以降、同様の処理を、第７段階で、第１６
格子点の変数が計算されるまで繰り返す。Thereafter, the same processing is performed in the seventh stage, the 16th stage.
Repeat until the variables at the grid points are calculated.

【０１３３】この計算の結果、第ｉ行の格子の変数の値
は、第ｉプロセッサのローカルメモリ１３に得られる。As a result of this calculation, the values of the variables of the grid on the i-th row are obtained in the local memory 13 of the i-th processor.

【０１３４】このときの計算の様子は、図１１から分か
るように、同一行に横方向に並んでいる格子点は、１つ
のプロセッサに割り当てられ、同一列に縦方向に並んで
いる格子点は、各プロセッサにおいて同時に計算されて
いる。全体的には、左側の列の格子点から計算が開始さ
れる。As can be seen from FIG. 11, the state of the calculation at this time is such that the lattice points arranged in the same row in the horizontal direction are assigned to one processor, and the lattice points arranged in the same column in the vertical direction are , Are calculated simultaneously in each processor. Overall, the calculation starts from the grid points in the left column.

【０１３５】各列の上に示した数字(1) 〜(7) は、その
列上の格子点の変数が第何段階に計算が行なわれるかを
示している。プロセッサ間のデータ転送は、斜め方向の
矢印がある格子点間で行なわれている。横方向の矢印
は、プロセッサ間のデータ転送を表しているのではな
い。The numbers (1) to (7) shown above each column indicate at what stage the variables of the lattice points on that column are calculated. Data transfer between processors is performed between lattice points having diagonal arrows. The horizontal arrows do not represent data transfers between processors.

【０１３６】データの転送に要する時間が無視できると
すると、１つのプロセッサだけで計算した場合に比べ、
７／１６＝０．４４倍の計算時間ですむことになる。Assuming that the time required for data transfer can be ignored, compared to the case where calculation is performed by only one processor,
The calculation time will be 7/16 = 0.44 times.

【０１３７】次に、同じ２次元領域について、不完全Ｌ
Ｕ分解（１，２）を適用した場合について説明する。Next, for the same two-dimensional area, the incomplete L
A case where the U decomposition (1, 2) is applied will be described.

【０１３８】不完全ＬＵ分解（１，２）を用いた場合の
下三角行列は、図１２のようになる。この行列を係数に
もつ連立一次方程式を解くために用いる並列計算機は、
不完全ＬＵ分解（１，１）の場合と同じである。各格子
点のデータも同じように配置する。解の依存関係は、図
１３のようになる。The lower triangular matrix when the incomplete LU decomposition (1,2) is used is as shown in FIG. A parallel computer used to solve simultaneous linear equations with this matrix as a coefficient is
This is the same as in the case of incomplete LU decomposition (1,1). The data of each grid point is also arranged in the same way. The solution dependency relationship is as shown in FIG.

【０１３９】まず、第１段階では、第１プロセッサで、
第１格子点の変数を計算する。このとき第２〜４プロセ
ッサでは何もしない。計算が終了した後、第１格子点の
変数値を第２プロセッサへ転送し、保持する。First, in the first stage, in the first processor,
The variable of the first grid point is calculated. At this time, the second to fourth processors do nothing. After the calculation is completed, the variable value of the first grid point is transferred to the second processor and held.

【０１４０】第２段階では、第１プロセッサで、第１段
階で計算された変数値を用いて第２格子点の変数を計算
する。このとき第２〜４プロセッサは何もしない。計算
が終了した後、計算した第２格子点の変数値を第２のプ
ロセッサへ転送し、保持する。In the second stage, the first processor calculates the variable of the second lattice point using the variable value calculated in the first stage. At this time, the second to fourth processors do nothing. After the calculation is completed, the calculated variable value of the second grid point is transferred to the second processor and held.

【０１４１】第３段階では、第１プロセッサで、第２段
階で計算された変数値を用いて第３格子点の変数を計算
する。これと同時に第２プロセッサでは、第５格子点の
変数を計算する。このとき、第１段階及び第２段階で計
算された第１及び第２格子点の変数値を用いる。In the third stage, the first processor calculates the variable of the third lattice point using the variable value calculated in the second stage. At the same time, the second processor calculates the variable of the fifth lattice point. At this time, the variable values of the first and second lattice points calculated in the first step and the second step are used.

【０１４２】第３、４プロセッサは何もしない。計算が
終了した後、第１プロセッサは、第３格子点の変数値を
第２プロセッサへ転送し、保持する。第２プロセッサ
は、第５格子点の変数値を第３プロセッサへ転送し、保
持する。The third and fourth processors do nothing. After the calculation is completed, the first processor transfers and holds the variable value of the third lattice point to the second processor. The second processor transfers and holds the variable value of the fifth grid point to the third processor.

【０１４３】以降、同様の処理を、第４プロセッサで第
１６格子点の変数が計算されるまで繰り返す。Thereafter, the same processing is repeated until the variable of the 16th lattice point is calculated by the fourth processor.

【０１４４】この計算の結果、不完全ＬＵ分解（１，
１）の場合と同様に、第ｉ行の格子の変数の値は、第ｉ
プロセッサのローカルメモリ１３に得られる。As a result of this calculation, the incomplete LU decomposition (1,
As in the case of 1), the value of the variable of the grid of the i-th row is
It is available in the local memory 13 of the processor.

【０１４５】このときの計算の様子は、図１３から分か
るように、同一行に横方向に並んでいる格子点は、１つ
のプロセッサに割り当てられ、同一列に縦方向に並んで
いる格子点は、各プロセッサにおいて同時に計算されて
いる。全体的には、左側の列の格子点から計算が開始さ
れる。As can be seen from FIG. 13, the state of the calculation at this time is that the grid points arranged in the same row in the horizontal direction are assigned to one processor, and the grid points arranged in the same column in the vertical direction are , Are calculated simultaneously in each processor. Overall, the calculation starts from the grid points in the left column.

【０１４６】各列の上に示した数字(1) 〜(10)は、その
列上の格子点の変数が第何段階に計算が行なわれるかを
示している。プロセッサ間のデータ転送は、斜め方向の
矢印がある格子点間で行なわれている。１つの格子点に
対し２本の矢印が出ているが、実際には、プロセッサ間
データ転送を２回やるわけではなく、格子点の計算結果
が得られた直後にデータ転送を行ない、転送先のプロセ
ッサで記憶しておけば、１回のデータ転送ですむ。横方
向の矢印は、プロセッサ間のデータ転送を表しているの
ではない。The numbers (1) to (10) shown above each column indicate in what stage the variables of the lattice points on that column are calculated. Data transfer between processors is performed between lattice points having diagonal arrows. Two arrows are drawn for one grid point, but actually, data transfer between processors is not performed twice, but data transfer is performed immediately after the calculation result of the grid point is obtained, and the transfer destination If it is memorized by the processor, it will be possible to transfer the data only once. The horizontal arrows do not represent data transfers between processors.

【０１４７】データの転送に要する時間が無視できると
すると、１つのプロセッサだけで計算した場合に比べ、
１０／１６＝０．６３倍の計算時間ですむことになる。Assuming that the time required for data transfer can be ignored, compared to the case where calculation is performed by only one processor,
The calculation time will be 10/16 = 0.63 times.

【０１４８】次に、シミュレーション領域が図１４に示
すような３×３×３の大きさの３次元領域である場合
に、不完全ＬＵ分解（１，１，１）を適用した場合につ
いて説明する。不完全ＬＵ分解（１，１，１）により得
られる下三角行列は、図１５のようになる。この下三角
行列を係数行列にもつ連立一次方程式を解く場合、図１
６のような変数求解の依存関係となる。Next, the case where the incomplete LU decomposition (1, 1, 1) is applied when the simulation area is a three-dimensional area having a size of 3 × 3 × 3 as shown in FIG. 14 will be described. . The lower triangular matrix obtained by the incomplete LU decomposition (1,1,1) is as shown in FIG. When solving simultaneous linear equations with this lower triangular matrix in the coefficient matrix,
There is a dependency relationship of variable solution such as 6.

【０１４９】図７は、この連立一次方程式を解くための
並列計算機である。この並列計算機は、３×３個のプロ
セッサ（１，１）〜（３，３）が２次元状に配置されて
いる。（ｉ，ｊ）プロセッサは、（ｉ＋１，ｊ）プロセ
ッサおよび（ｉ，ｊ＋１）プロセッサの２つの隣接プロ
セッサへデータ転送することが可能となっている。ま
た、図示していないが、各プロセッサは、自分自身のロ
ーカルな記憶手段となるメモリ、受信手段、計算手段、
及び送信手段をもっている。FIG. 7 shows a parallel computer for solving the simultaneous linear equations. In this parallel computer, 3 × 3 processors (1,1) to (3,3) are two-dimensionally arranged. The (i, j) processor can transfer data to two adjacent processors, the (i + 1, j) processor and the (i, j + 1) processor. Although not shown, each processor has a memory serving as its own local storage means, a reception means, a calculation means,
And has transmission means.

【０１５０】この計算機上で、上述の連立一次方程式を
解くためには、次のように各格子点のデータを各プロセ
ッサのローカルメモリに配置する。すなわち、ｙ座標、
ｚ座標が同一の３つの格子点のデータを１つのプロセッ
サに配置する。例えば、１〜３の格子点のデータは、
（１，１）プロセッサに配置され、４〜６の格子点のデ
ータは（１，２）プロセッサに配置する。さらに、１０
〜１２の格子点のデータは（２，１）プロセッサに配置
する。In order to solve the simultaneous linear equations described above on this computer, the data at each lattice point is arranged in the local memory of each processor as follows. That is, the y coordinate,
Data of three grid points having the same z coordinate is arranged in one processor. For example, the data of grid points 1 to 3 is
The data of the lattice points of 4 to 6 are arranged in the (1,1) processor, and the data of the lattice points of 4 to 6 are arranged in the (1,2) processor. Furthermore, 10
The data of the grid points of ~ 12 are arranged in the (2, 1) processor.

【０１５１】まず、第１段階では、（１，１）プロセッ
サで第１格子点の変数を計算する。このとき他のプロセ
ッサでは何もしない。計算終了後、計算された第１格子
点の変数値を（１，２）プロセッサ、（２，１）プロセ
ッサの２つの隣接プロセッサへ転送する。First, in the first stage, the (1,1) processor calculates the variable at the first grid point. At this time, the other processors do nothing. After the calculation is completed, the calculated variable value of the first grid point is transferred to two adjacent processors of the (1, 2) processor and the (2, 1) processor.

【０１５２】第２段階では、（１，１）プロセッサで
は、第２の格子点の変数を計算し、（１，２）プロセッ
サでは、第４の格子点の変数を計算し、（２，１）プロ
セッサでは、第１０の格子点の変数を計算する。In the second stage, the (1,1) processor calculates the variable of the second grid point, the (1,2) processor calculates the variable of the fourth grid point, and the (2,1 ) The processor calculates the variable of the tenth grid point.

【０１５３】計算終了後、それぞれのプロセッサで計算
した格子点の変数値をそれぞれのプロセッサの右および
下の隣接プロセッサへ転送する。すなわち、（１，１）
プロセッサは、（１，２）および（２，１）プロセッサ
へ、（１，２）プロセッサは、（１，３）および（２，
２）プロセッサへ、（２，１）プロセッサは、（２，
２）および（３，１）プロセッサへ、計算結果を転送す
る。After the calculation is completed, the variable values of the lattice points calculated by the respective processors are transferred to the right and lower adjacent processors of the respective processors. That is, (1,1)
Processors are (1,2) and (2,1) processors, and (1,2) processors are (1,3) and (2,2
2) to processor, (2,1) processor to (2,1)
2) Transfer the calculation result to the (3, 1) processor.

【０１５４】以降、同様の処理を、第７段階で第２７格
子点の変数が計算されるまで繰り返す。Thereafter, similar processing is repeated until the variable of the 27th grid point is calculated in the 7th step.

【０１５５】以上の計算の結果、各格子点の変数値は、
各格子点のデータが配置されたプロセッサ上に得られ
る。As a result of the above calculation, the variable value of each grid point is
The data of each grid point is obtained on the processor in which the data is arranged.

【０１５６】このときの計算の様子は、図１６から分か
るように、同一行に横方向に並んでいる格子点は、１つ
のプロセッサに割り当てられ、同一列に縦方向に並んで
いる格子点は、各プロセッサにおいて同時に計算されて
いる。全体的には、左側の列の格子点から計算が開始さ
れる。As can be seen from FIG. 16, the calculation state at this time is such that the grid points arranged in the same row in the horizontal direction are assigned to one processor, and the grid points arranged in the same column in the vertical direction are , Are calculated simultaneously in each processor. Overall, the calculation starts from the grid points in the left column.

【０１５７】各列の上に示した数字(1) 〜(7) は、その
列上の格子点の変数が第何段階に計算が行なわれるかを
示している。プロセッサ間のデータ転送は、斜め方向の
矢印がある格子点間で行なわれている。横方向の矢印
は、プロセッサ間のデータ転送を表しているのではな
い。The numbers (1) to (7) shown above each column indicate in what stage the variables of the grid points on that column are calculated. Data transfer between processors is performed between lattice points having diagonal arrows. The horizontal arrows do not represent data transfers between processors.

【０１５８】データ転送に要する時間が無視できるとす
ると、１つのプロセッサだけで計算した場合に比べ、７
／２７＝０．２６倍の計算時間ですむことになる。４０
×４０×４０の場合であれば、４０×４０個のプロセッ
サで、１１８／６４０００＝０．００１８倍の計算時間
ですむことになる。Assuming that the time required for data transfer can be ignored, it is 7 compared to the case where calculation is performed by only one processor.
/27=0.26 times the calculation time is required. 40
In the case of x40x40, the calculation time of 118/64000 = 0.0018 times is required with 40x40 processors.

【０１５９】なお、第２の発明は、半導体デバイスシミ
ュレーションで用いられる、いわゆるカップル法のよう
に各行列要素が小行列となる場合があるが、この場合で
も同様に実施可能である。Although the second invention may have a small matrix for each matrix element as in the so-called couple method used in the semiconductor device simulation, it can be similarly implemented in this case.

【０１６０】第３の発明最後に、第３の発明について図１７〜２５を参照しなが
ら説明する。Third Invention Finally, the third invention will be described with reference to FIGS.

【０１６１】図１７は、第３の発明の並列計算機に係わ
る一実施例の構成を示すブロック図である。FIG. 17 is a block diagram showing the configuration of an embodiment relating to the parallel computer of the third invention.

【０１６２】同図において、第３の発明の並列計算機
は、大きく、演算制御部２１と演算実行部２２とから構
成される。演算制御部２１と、演算実行部２２は、デー
タバス２３で接続される。演算制御部２１は、命令メモ
リ２４と接続される。命令メモリ２４には、行列の乗算
手順を記述したプログラムが格納されている。In the figure, the parallel computer of the third invention is roughly composed of an arithmetic control unit 21 and an arithmetic execution unit 22. The arithmetic control unit 21 and the arithmetic execution unit 22 are connected by a data bus 23. The arithmetic control unit 21 is connected to the instruction memory 24. The instruction memory 24 stores a program that describes a matrix multiplication procedure.

【０１６３】演算制御部２１は、命令メモリ２４から命
令を読み込み、解読する。読み込んだ命令が、演算実行
部２２での演算を指示するものであれば、ＰＥ向けのマ
イクロコードに展開し、データバス２３を介して、演算
実行部２２に転送する。The arithmetic control unit 21 reads the instruction from the instruction memory 24 and decodes it. If the read instruction directs the operation in the operation executing unit 22, it is expanded into a microcode for PE and transferred to the operation executing unit 22 via the data bus 23.

【０１６４】読み込んだ命令が、ループ変数の更新等、
演算制御部２１内での処理を指示するものであれば、演
算制御部２１内で処理を行う。The read instruction is used to update the loop variable, etc.
If the instruction is for processing in the arithmetic control unit 21, the processing is performed in the arithmetic control unit 21.

【０１６５】演算実行部２２は、複数の演算要素プロセ
ッサ（Processing Element）ＰＥ２５から構成される。
ＰＥ２５は、各自でデータメモリ２６を持ち、データメ
モリ２６のデータに対して処理を行う。ＰＥ２５は互い
にネットワーク２７で結合されており、他のＰＥ２５と
データ通信を行うことができる。ＰＥ２５は、演算制御
部２１から放送されるマイクロコードにしたがって動作
する。The operation executing section 22 is composed of a plurality of processing element processors PE25.
Each PE 25 has its own data memory 26 and processes the data in the data memory 26. The PEs 25 are connected to each other via a network 27 and can perform data communication with other PEs 25. The PE 25 operates according to the microcode broadcast from the arithmetic control unit 21.

【０１６６】ネットワーク２７は、各ＰＥ２５を結合
し、ＰＥ間通信を可能にするもので、その形態として
は、例えば図１８に示すような２次元格子結合や、図１
９に示すｘ−ｙバス結合、さらには図示していないがハ
イパー・キューブ結合等がよく知られている。The network 27 connects the PEs 25 to each other to enable communication between PEs. The form thereof is, for example, a two-dimensional lattice connection as shown in FIG. 18 or the one shown in FIG.
The x-y bus connection shown in FIG. 9 and the hyper-cube connection not shown are well known.

【０１６７】以下では、図１８の２次元格子結合を例に
とって説明を続けるが、ｘ−ｙバス結合や、ハイパーキ
ューブ結合等、その他のネットワーク形態についても適
用可能である。In the following, the description will be continued by taking the two-dimensional lattice connection of FIG. 18 as an example, but other network forms such as xy bus connection and hypercube connection are also applicable.

【０１６８】演算制御部２１は、演算実行部２２の各Ｐ
Ｅ２５に対して、次のような制御を行うことができる。The arithmetic control unit 21 controls each P of the arithmetic execution unit 22.
The following control can be performed on E25.

【０１６９】（１）すべてのＰＥ２５が、一斉に同じ
処理を行う。(1) All PEs 25 simultaneously perform the same processing.

【０１７０】（２）少なくとも、２次元上に配置され
たＰＥ２５の、一行全てあるいは一列全てのＰＥ２５を
選択して、選択されたＰＥ２５のみ、一斉に同じ処理を
行う。以上で説明した並列計算機を用いて、２つの行列
の乗算を行う手順を説明する。(2) At least all the PEs 25 arranged in one row or all the columns of the PEs 25 arranged two-dimensionally are selected, and the same processing is simultaneously performed only on the selected PEs 25. A procedure for multiplying two matrices by using the parallel computer described above will be described.

【０１７１】ここでは、ｎ×ｎの２次元状に配置された
ＰＥアレイ上で、ｎ×ｎ行列の積、Ｃ＝Ａ＊Ｂを求める
場合を考える。ｉ行目ｊ列目に位置するＰＥ２５をＰＥ
（ｉ，ｊ）で表すものとする（０≦ｉ≦ｎ−１，０≦ｊ
≦ｎ−１）。各ＰＥ２５は、初期データとして、行列
Ａ，Ｂのそれぞれ１つの要素を持つようにデータを配置
する。Here, let us consider a case where a product of n × n matrices, C = A * B, is obtained on a PE array arranged in a two-dimensional n × n array. PE 25 located at i-th row and j-th column is PE
Let (i, j) represent (0≤i≤n-1, 0≤j
≤n-1). Each PE 25 arranges the data so that it has one element in each of the matrices A and B as initial data.

【０１７２】すなわち、ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−
１，０≦ｊ≦ｎ−１）は、初期データａ[i][j]、ｂ[i]
[j]を持つ。ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ
≦ｎ−１）では、それぞれ行列Ｃの要素ｃ[i][j]の演算
を受け持ち、全ＰＥ２５で一斉に演算を行う。ＰＥ
（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−１）で行う
演算は、次の通りである。That is, PE (i, j) (0≤i≤n-
1, 0 ≦ j ≦ n−1) is initial data a [i] [j], b [i]
Holds [j]. PE (i, j) (0≤i≤n-1, 0≤j
In the case of ≤n-1), each of the PEs 25 is in charge of the calculation of the element c [i] [j] of the matrix C, and all PEs 25 perform the calculation simultaneously. PE
The calculation performed in (i, j) (0≤i≤n-1, 0≤j≤n-1) is as follows.

【０１７３】ｃ[i][j]＝０；ｋ＝０ to ｎ−１までｃ[i][j]＝ｃ[i][j]＋ａ[i][k]＊ｂ[k][j]；を繰り返す。 …（Ｉ）乗算手順１を図２０のフローチャートにしたがって説明
する。C [i] [j] = 0; k = 0 to n−1 c [i] [j] = c [i] [j] + a [i] [k] * b [k] [j ];repeat. (I) The multiplication procedure 1 will be described with reference to the flowchart of FIG.

【０１７４】（１）各ＰＥ（ｉ，ｊ）で、ｃ[i][j]＝
０として、ｃを初期化する（ステップ１０１）。(1) In each PE (i, j), c [i] [j] =
Initialize c as 0 (step 101).

【０１７５】（２）ＰＥ（ｉ，ｊ）で１回目に行う演
算は、ｃ[i][j]＝ｃ[i][j]＋ａ[i][0]＊ｂ[0][j]； …（II）である。(2) The first operation in PE (i, j) is c [i] [j] = c [i] [j] + a [i] [0] * b [0] [j] ; (II).

【０１７６】ここで、ＰＥ（ｉ，ｊ）（１≦ｉ≦ｎ−
１，１≦ｊ≦ｎ−１）では、ａ[i][0]とｂ[0][j]は自分
のデータメモリ２６には存在しないため、データ転送が
必要である。ａ[i][0]は、第１列目のＰＥ（ｉ，０）
（０≦ｉ≦ｎ−１）が持ち、ｂ[0][j]は、ＰＥ（０，
ｊ）（１≦ｊ≦ｎ−１）が持つ。Here, PE (i, j) (1≤i≤n-
In 1, 1 ≤ j ≤ n-1), a [i] [0] and b [0] [j] do not exist in the own data memory 26, so data transfer is necessary. a [i] [0] is PE (i, 0) in the first column
(0≤i≤n-1), and b [0] [j] is PE (0,
j) (1 ≦ j ≦ n−1).

【０１７７】まず、第１列にあるＰＥ（ｉ，０）（Ｏ≦
ｉ≦ｎ−１）は、同じ行に存在する他のＰＥ（ｉ，１）
〜ＰＥ（ｉ，ｎ−１）に、ａ[i][0]を転送する。これは
１対ｎ放送の形式になる。First, PE (i, 0) (O ≦
i ≦ n−1) is another PE (i, 1) existing in the same row.
~ A (i] [0] is transferred to PE (i, n-1). This will be a 1 to n broadcast format.

【０１７８】ａ[i][0]の１対ｎ放送は、ＰＥ（ｉ，０）
が、まず隣接するＰＥ（ｉ，１）にａ[i][0]を転送し、
次にＰＥ（ｉ，１）がＰＥ（ｉ，２）にａ［ｉ］［０］
を転送し、……、最後にＰ（ｉ，ｎ−１）、ａ[i][0]が
転送される、という形式で、放送されるデータａ[i][0]
が、ＰＥ間でシフトされることによって、ｎ−１サイク
ルで実行できる。The one-to-n broadcast of a [i] [0] is PE (i, 0).
First transfers a [i] [0] to the adjacent PE (i, 1),
Next, PE (i, 1) converts PE (i, 2) into a [i] [0].
, And finally, P (i, n-1), a [i] [0] is transferred, and data a [i] [0] is broadcast.
Can be executed in n-1 cycles by shifting between PEs.

【０１７９】なお、この行方向への１対ｎ放送は、すべ
ての行（０≦ｉ≦ｎ−１）で、同時に実行できる（ステ
ップ１０２〜１０３）。図２１に、ｎ＝４の場合の行方
向への１対ｎ放送の様子を示す。The one-to-n broadcast in the row direction can be simultaneously executed in all the rows (0≤i≤n-1) (steps 102 to 103). FIG. 21 shows a state of 1-to-n broadcasting in the row direction when n = 4.

【０１８０】（３）次にｂ[0][j]を転送するために、
第１行にあるＰＥ[0][j]（０≦ｊ≦ｎ−１）は同じ列に
存在する他のＰＥ[0][j]〜ＰＥ[n-1][j]に、同様に１対
ｎ放送する（ステップ１０４）。この１対ｎ放送は、す
べての列で同時に実行できる。図２２にｎ＝４の場合の
列方向への１対ｎ放送の様子を示す。(3) Next, in order to transfer b [0] [j],
PE [0] [j] (0≤j≤n-1) in the first row is similarly processed to other PE [0] [j] to PE [n-1] [j] existing in the same column. Broadcast 1 to n (step 104). This 1-to-n broadcast can be performed simultaneously on all columns. FIG. 22 shows a state of 1-to-n broadcasting in the column direction when n = 4.

【０１８１】（４）以上の２回の１対ｎ放送によっ
て、各ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−
１）は、ａ[i][0]とｂ[0][j]を取得したので、（II）式
の一回目の演算を行う（ステップ１０５）。(4) Each PE (i, j) (0 ≦ i ≦ n−1, 0 ≦ j ≦ n−) by the above-mentioned two times of 1-to-n broadcasting.
In 1), since a [i] [0] and b [0] [j] have been acquired, the first calculation of equation (II) is performed (step 105).

【０１８２】（５）２回目の演算は、ｃ[i][j]＝ｃ[i][j]＋ａ[i][1]＊ｂ[1][j]； … (III) である。(5) The second operation is c [i] [j] = c [i] [j] + a [i] [1] * b [1] [j]; (III).

【０１８３】すなわち、ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−
１，０≦ｊ≦ｎ−１）は、ａ[i][1]とｂ[1][j]とがデー
タ転送されることが必要である。このためには、第２列
目のＰＥ（ｉ，１）（０≦ｉ≦ｎ−１）が、ａ[i][1]を
同じ行の他のＰＥ２５に１対ｎ放送を行い、次に、第２
行目のＰＥ（１，ｊ）（０≦ｊ≦ｎ−１）が、ｂ[1][j]
を同じ列の他のＰＥ２５に１対ｎ放送する。That is, PE (i, j) (0≤i≤n-
1, 0 ≦ j ≦ n−1) requires a [i] [1] and b [1] [j] to be data-transferred. For this purpose, PE (i, 1) in the second column (0 ≦ i ≦ n−1) broadcasts a [i] [1] to another PE 25 in the same row in a 1-to-n broadcast, and then Second
PE (1, j) (0 ≦ j ≦ n−1) of the line is b [1] [j]
1 to n is broadcast to other PEs 25 in the same row.

【０１８４】図２３，２４に、ｎ＝４の場合の、第２列
目および第２行目の列方向及び行方向への１対ｎ放送の
様子を示す。各ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦
ｊ≦ｎ−１）では、(III）式の演算を行う。23 and 24 show the state of 1-to-n broadcasting in the column direction and row direction of the second column and the second row when n = 4. Each PE (i, j) (0 ≦ i ≦ n-1, 0 ≦
In the case of j ≦ n−1), the calculation of formula (III) is performed.

【０１８５】（６）以上をｎ回繰り返すことによっ
て、各ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，０≦ｊ≦ｎ−
１）で、それぞれ行列の積であるｃ[i][j]が求まる（ス
テップ１０３〜１０７）。(6) By repeating the above n times, each PE (i, j) (0≤i≤n-1, 0≤j≤n-
In 1), c [i] [j], which is the product of the matrices, is obtained (steps 103 to 107).

【０１８６】以上の乗算手順１によれば、演算を行うた
びに、各ＰＥ２５で結果が累算されるので、あらかじ
め、全データを持つ必要がなく、少ないメモリ量で行列
の乗算が実行できる。すなわち、各ＰＥ２５で必要なメ
モリ量は、初期データ２つ、１回の繰り返しで転送され
るデータ２つ、解となるｃのための１つで合計５データ
分のメモリ２６があればよく、ｎの値に依存しない。す
なわちｎの値が大きくなっても、各ＰＥ２５のメモリ量
は少なくて済む。According to the multiplication procedure 1 described above, since the results are accumulated in each PE 25 each time an operation is performed, it is not necessary to have all the data in advance, and matrix multiplication can be executed with a small memory amount. That is, the amount of memory required for each PE 25 may be two initial data, two data transferred in one iteration, and one for the solution c, that is, a memory 26 for a total of 5 data, It does not depend on the value of n. That is, even if the value of n becomes large, the memory amount of each PE 25 can be small.

【０１８７】但し、この乗算手順１は、データ放送に要
する時間が従来方式よりも遅くなる。本方式では、１回
の繰り返しで、行、列方向の１対ｎ放送をそれぞれ一回
ずつ行う。１対ｎ放送は、ｎ−１サイクルかかるので、
行列乗算全体のデータ転送時間は２＊ｎ＊（ｎ−１）サ
イクルとなる。一方従来方式では、データ転送は、２＊
（ｎ−１）サイクルであり、データ転送時間についは、
ｎが大きくなるほど不利になる。However, in the multiplication procedure 1, the time required for data broadcasting is slower than that of the conventional method. In this method, one-to-n broadcasts in the row and column directions are performed once by one repetition. Since 1-to-n broadcasting takes n-1 cycles,
The data transfer time of the entire matrix multiplication is 2 * n * (n-1) cycles. On the other hand, in the conventional method, the data transfer is 2 *
(N-1) cycles, and the data transfer time is
The larger n is, the more disadvantageous.

【０１８８】そこで、次のような解決策が考えられる。
行方向に対しては、従来の方法通り、あらかじめｎ対ｎ
放送によって、演算に必要なデータをすべて保持してお
く。列方向に対しては、前述した乗算手順１にしたがっ
て１対ｎ放送を繰り返す。これを乗算手順２として図２
５のフローチャートにしたがって説明する。Therefore, the following solutions can be considered.
As for the row direction, n: n is previously set as in the conventional method.
Hold all data necessary for calculation by broadcasting. In the column direction, 1-to-n broadcasting is repeated according to the multiplication procedure 1 described above. This is shown as multiplication procedure 2 in FIG.
A description will be given according to the flowchart of FIG.

【０１８９】（１）まず、行方向に関して、必要なデ
ータをすべて取得する。すなわちＰＥ（ｉ，ｊ）（０≦
ｉ≦ｎ−１，０≦ｊ≦ｎ−１）は、ａ[i][0]〜ａ[i][n-
1]を、同じ行の他のＰＥ２５から取得する。このための
方法は、従来の方法と同じである。すなわち、まず、各
ＰＥ２５一斉に、右隣のＰＥ２５に自分のデータを転送
する（ステップ１１１〜１１３）。(1) First, all necessary data are acquired in the row direction. That is, PE (i, j) (0 ≦
i≤n-1, 0≤j≤n-1) is a [i] [0] to a [i] [n-
1] is obtained from another PE 25 in the same row. The method for this is the same as the conventional method. That is, first, each PE 25 simultaneously transfers its own data to the PE 25 on the right side (steps 111 to 113).

【０１９０】次に各ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−１，
０≦ｊ≦ｎ−１）は、左隣のＰＥ（ｉ，ｊ−１）から受
けったａ[i][j-1]を、右隣のＰＥ２５に転送する。これ
によって、ＰＥ（ｉ，ｊ）は、左隣のＰＥ２５からａ
[i][j-2]を受け取ることになり、これをメモリ２６に格
納する。Next, each PE (i, j) (0≤i≤n-1,
0.ltoreq.j.ltoreq.n-1) transfers a [i] [j-1] received from the PE (i, j-1) on the left side to the PE 25 on the right side. As a result, PE (i, j) is transmitted from PE 25 on the left to a
[i] [j-2] will be received and stored in the memory 26.

【０１９１】この処理をｎ−１回繰り返すことによっ
て、ＰＥ（ｉ，ｊ）は、ａ[i][0]〜ａ[i][n-1]を受け取
ることができる（ステップ１１４〜１１６）。以上によ
り、行方向の必要なデータは、すべて所得できる。この
ｎ対ｎ放送は、各行で並列に実行できる。放送に要する
時間はｎ−１サイクルである。By repeating this processing n-1 times, PE (i, j) can receive a [i] [0] to a [i] [n-1] (steps 114 to 116). . From the above, all necessary data in the row direction can be obtained. This n-to-n broadcast can be executed in parallel in each row. The time required for broadcasting is n-1 cycles.

【０１９２】（２）以下では、列方向のデータ転送
と、演算とを繰り返して行う。１回目の演算で必要なデ
ータとして、１行目のＰＥ（０，ｊ）（０≦ｊ≦ｎ−
１）が、列方向のＰＥ２５にｂ[0][j]を１対ｎ放送す
る。１対ｎ放送は、まず隣接するＰＥ（１，ｊ）にｂ
[0][j]を転送し、次にＰＥ（１，ｊ）がＰＥ（２，ｊ）
にｂ[0][j]を転送し……、最後にＰＥ（ｎ−１，ｊ）に
ｂ[0][j]が転送される、という形式で、放送されるデー
タｂ[0][j]が、ＰＥ間でシフトされることによって、ｎ
−１サイクルで実行できる（ステップ１１７〜１１
８）。(2) In the following, data transfer in the column direction and calculation are repeated. As the data necessary for the first calculation, PE (0, j) in the first row (0 ≦ j ≦ n−
1) broadcasts b [0] [j] 1: n to PE 25 in the column direction. For 1-to-n broadcasting, first, b is transmitted to the adjacent PE (1, j).
[0] [j] is transferred, and then PE (1, j) becomes PE (2, j).
Data b [0] [j] is transferred to PE (n-1, j) and b [0] [j] is transferred to PE (n-1, j). j] is shifted between PEs so that n
-1 cycle can be executed (steps 117 to 11)
8).

【０１９３】（３）各ＰＥ（ｉ，ｊ）（０≦ｉ≦ｎ−
１，０≦ｊ≦ｎ−１）で（II）式の演算を行う（ステッ
プ１１９）。(3) Each PE (i, j) (0≤i≤n-
The equation (II) is calculated with 1,0 ≦ j ≦ n−1) (step 119).

【０１９４】（４）２回目の演算に必要なデータとし
て、２行目のＰＥ（１，ｊ）（０≦ｊ≦ｎ−１）が、列
方向のＰＥ２５にｂ[1][j]を１対ｎ放送する（ステップ
１２０〜１１８）。(4) As the data required for the second calculation, PE (1, j) (0≤j≤n-1) in the second row sets b [1] [j] to PE 25 in the column direction. Broadcast 1 to n (steps 120 to 118).

【０１９５】（５）２回目の演算として、 (III)式を
計算する（ステップ１１９）。(5) As the second calculation, the formula (III) is calculated (step 119).

【０１９６】（６）以上を繰り返すことによって、行
列の乗算を行うことができる（ステップ１１８〜１２
１）。(6) By repeating the above, matrix multiplication can be performed (steps 118 to 12).
1).

【０１９７】ここでは、行方向にあらかじめｎ対ｎ放送
ですべてのデータを保持し、列方向の１対ｎ放送と演算
を繰り返す方法を示したが、列方向に関してあらかじめ
ｎ対ｎ放送ですべてのデータを保持する場合も、全く同
様に実行できる。Here, a method has been shown in which all data is held in advance in the n-to-n broadcast in the row direction and the calculation is repeated with the 1-to-n broadcast in the column direction. The same can be done when retaining the data.

【０１９８】この乗算手順２では、各ＰＥ２５で演算に
必要なメモリ量は、行あるいは列のどちらか一方の全デ
ータとなるのでｎ個でよく、従来方式の1/2 ですむ。ま
た、データ転送時間は、最初の行方向のｎ対ｎ放送はｎ
−１サイクル、その後の列方向の１対ｎ放送がｎ−１サ
イクルでｎ回なので、総転送サイクル数は、ｎ＊（ｎ−
１）＋ｎ−１サイクルとなり、行、列両方向に１対ｎ放
送を行う場合の1/2 に短縮される。In the multiplication procedure 2, since the memory amount required for the calculation in each PE 25 is all the data of either the row or the column, it may be n, which is 1/2 of the conventional method. The data transfer time is n for the first n-to-n broadcast in the row direction.
-1 cycle, and subsequent 1-to-n broadcasting in the column direction is n-1 times n times, so the total number of transfer cycles is n * (n-
1) + n-1 cycles, which is shortened to 1/2 of 1-to-n broadcasting in both row and column directions.

【０１９９】さらに、図１９で示したように、行方向、
列方向のＰＥ２５に対して、１サイクルで行方向の全て
のＰＥあるいは、列方向の全てのＰＥに放送が行えるよ
うな、ｘ−ｙバス結合方式を考えることもできる。ｘ−
ｙバス結合は、同じ行に存在するＰＥ（ｉ，０）〜ＰＥ
（ｉ，ｎ−１）（０≦ｉ≦ｎ−１）がバス結合され、ま
た、同じ列に存在するＰＥ（０，ｊ）〜ＰＥ（ｎ−１，
ｊ）（０≦ｊ≦ｎ−１）がバス結合される結合方式であ
る。Further, as shown in FIG. 19, in the row direction,
It is also possible to consider an xy bus coupling system in which the PEs 25 in the column direction can be broadcast to all PEs in the row direction or all PEs in the column direction in one cycle. x-
y-bus coupling is PE (i, 0) to PE existing in the same row.
(I, n-1) (0≤i≤n-1) are bus-coupled and PE (0, j) to PE (n-1,
j) (0 ≦ j ≦ n−1) is a bus coupling method.

【０２００】このｘ−ｙバス結合方式を用いて行列乗算
を行う手順は、乗算手順１，２で説明した１対ｎ放送を
除いては、上述した方法をそのまま適用できる。データ
放送は、放送するＰＥ２５が放送データをバスに読み出
し、残りのＰＥ２５が、バスからデータを読み込むこと
で実現できるので、１回のデータ放送に要する時間は１
サイクルである。As the procedure for performing matrix multiplication using this xy bus combination method, the above-described method can be applied as it is, except for the 1-to-n broadcast described in multiplication procedures 1 and 2. Data broadcasting can be realized by reading the broadcast data from the broadcasting PE 25 to the bus and reading the data from the remaining PEs 25 from the bus, so the time required for one data broadcasting is 1
It is a cycle.

【０２０１】行列乗算全体としては、データ転送に要す
る時間は２＊ｎサイクルとなり、２次元格子結合方式で
従来の乗算方式を用いた場合とほぼ同じ転送時間に迎え
られる。もちろん、演算に必要なメモリ量は、従来方式
と比較して、極めて少なくて済む。In the matrix multiplication as a whole, the time required for data transfer is 2 * n cycles, which is almost the same as the transfer time when the conventional multiplication method is used in the two-dimensional lattice coupling method. Of course, the amount of memory required for calculation can be extremely small as compared with the conventional method.

【０２０２】以上詳細にに述べたように、第３の発明の
並列計算機は、従来方式と比較して、ＰＥ当りの使用メ
モリ量が著しく削減できるため、ＰＥ当りのメモリ量が
少なくても乗算を実行できる。データ転送時間は、２次
元格子結合の倍は増加するが、ｘ−ｙバス結合等を導入
することによって、データ転送時間を従来方式と同等に
押えることかできる。As described in detail above, the parallel computer of the third invention can significantly reduce the amount of memory used per PE as compared with the conventional system. Therefore, even if the amount of memory per PE is small, multiplication is performed. Can be executed. Although the data transfer time increases twice as much as the two-dimensional lattice connection, the data transfer time can be suppressed to the same level as the conventional method by introducing the xy bus connection or the like.

【０２０３】ここで用いた演算式（Ｉ）は、各繰り返し
において、すべてのＰＥ２５で同じ演算を行う。また、
データ転送も、行、あるいは列単位で同じ処理を行うの
で、ＳＩＭＤ計算機に適合した乗算方法であるといえ
る。しかし、ＭＩＭＤ（Multiple Instruction Multipl
e Data Stream ）型並列計算機でももちろん実行可能で
ある。このＭＩＭＤ計算機は、各ＰＥ２５が命令メモ
リ、命令デコーダ部を持ち、各ＰＥ２５で異なった処理
を行えるのが特徴であるが、この第３の発明の乗算方法
も実行可能である。The arithmetic expression (I) used here performs the same arithmetic operation on all PEs 25 in each iteration. Also,
Since data transfer also performs the same processing row by row or column by column, it can be said that this is a multiplication method suitable for SIMD computers. However, MIMD (Multiple Instruction Multipl)
Of course, it can be executed on an e Data Stream type parallel computer. This MIMD computer is characterized in that each PE 25 has an instruction memory and an instruction decoder unit, and each PE 25 can perform different processing, but the multiplication method of the third invention can also be executed.

【０２０４】[0204]

【発明の効果】第１の発明によれば、依存関係のある命
令ごとに命令群を構成し、命令群を１つの単位として命
令スケジューリングを行うことで、命令によって実行時
間の異なる場合でも、実行時間が最も短縮されるような
命令列にコンパイルすることができる。また、命令群を
単位としてスケジューリングを行うため、スケジューリ
ングを効率よく行うことができるようになる。According to the first aspect of the present invention, an instruction group is formed for each instruction having a dependency relationship, and instruction scheduling is performed with the instruction group as one unit, so that even when the execution time differs depending on the instruction, the execution is executed. It can be compiled into a sequence of instructions that saves the most time. Moreover, since scheduling is performed in units of instructions, scheduling can be performed efficiently.

【０２０５】また、第２の発明の並列計算機によれば、
１次元状または２次元状に配置された複数プロセッサを
もつ並列計算機上で、連立一次方程式の求解を数値的特
性を損うことなく、隣接するプロセッサ間のみデータ通
信を行なえばよいため、効率よく高速に実行することが
可能となる。According to the parallel computer of the second invention,
On a parallel computer with multiple processors arranged in a one-dimensional or two-dimensional manner, it is only necessary to perform data communication between adjacent processors without compromising the numerical characteristics for solving simultaneous linear equations. It becomes possible to execute at high speed.

【０２０６】さらに、第３の発明の並列計算機によれ
ば、各演算要素プロセッサ当たりのメモリ量が少なくて
も、行列の乗算が実行可能である。大規模行列乗算を、
多数の演算要素プロセッサで構成し、特に演算要素プロ
セッサ当たりのメモリ量に制限がある場合に効果的であ
る。Further, according to the parallel computer of the third invention, matrix multiplication can be executed even if the memory amount per each arithmetic element processor is small. Large-scale matrix multiplication,
It is composed of a large number of arithmetic element processors, and is particularly effective when the memory amount per arithmetic element processor is limited.

[Brief description of drawings]

【図１】第１の発明のコンパイラの構成を示すブロック
図。FIG. 1 is a block diagram showing a configuration of a compiler of a first invention.

【図２】図１で示した命令スケジューリング部のフロー
チャート。FIG. 2 is a flowchart of the instruction scheduling unit shown in FIG.

【図３】第１の発明のコンパイラによるスケジューリン
グ方法を説明する際に用いられた、ソースプログラム及
びスケジューリング前の命令列。FIG. 3 is a source program and an instruction sequence before scheduling, which are used when explaining the scheduling method by the compiler of the first invention.

【図４】第１の発明におけるスケジューリングを説明す
るための命令群構成及び各命令群の実行時間と次命令発
行可能時間一覧表。FIG. 4 is a list of instruction group configuration, execution time of each instruction group, and next instruction issuable time for explaining the scheduling in the first invention.

【図５】第１の発明による命令スケジューリング後の命
令列及び実行の様子。FIG. 5 shows an instruction sequence and execution state after instruction scheduling according to the first invention.

【図６】第２の発明における２次元領域をシミュレーシ
ョンする際の並列計算機の構成図。FIG. 6 is a configuration diagram of a parallel computer when simulating a two-dimensional area in the second invention.

【図７】第２の発明における３次元領域をシミュレーシ
ョンする際の並列計算機の構成図。FIG. 7 is a configuration diagram of a parallel computer when simulating a three-dimensional area in the second invention.

【図８】第２の発明における並列計算機の動作を表すフ
ローチャート。FIG. 8 is a flowchart showing the operation of the parallel computer according to the second invention.

【図９】物理領域を２次元状に分割したときの格子点。FIG. 9 is a grid point when a physical area is divided into two dimensions.

【図１０】２次元領域の問題で不完全ＬＵ分解（１，
１）をして得られる下三角行列Ｌ。FIG. 10 shows an incomplete LU decomposition (1,
The lower triangular matrix L obtained by performing 1).

【図１１】２次元領域の問題で不完全ＬＵ分解（１，
１）を用いた場合の、解の計算の依存関係を表す図。FIG. 11 shows the incomplete LU decomposition (1,
The figure showing the dependence of calculation of a solution at the time of using 1).

【図１２】２次元領域の問題で不完全ＬＵ分解（１，
２）をして得られる下三角行列Ｌ。FIG. 12 shows an incomplete LU decomposition (1,
2) The lower triangular matrix L obtained by performing.

【図１３】２次元領域の問題で不完全ＬＵ分解（１，
２）を用いた場合の、解の計算の依存関係を表す図。FIG. 13 is an incomplete LU decomposition (1,
The figure showing the dependence of calculation of a solution at the time of using 2).

【図１４】物理領域を３次元状に分割したときの格子
点。FIG. 14 is a grid point when a physical area is divided into three dimensions.

【図１５】３次元領域の問題で不完全ＬＵ分解（１，
１，１）をして得られる下三角行列Ｌ。FIG. 15 shows the incomplete LU decomposition (1,
The lower triangular matrix L obtained by performing 1, 1).

【図１６】３次元領域の問題で不完全ＬＵ分解（１，
１，１）を用いた場合の、解の計算の依存関係を表す
図。FIG. 16 shows an incomplete LU decomposition (1,
The figure showing the dependency of calculation of a solution when using 1, 1).

【図１７】第３の発明の並列計算機に係わる一実施例の
構成を示すブロック図。FIG. 17 is a block diagram showing the configuration of an embodiment of a parallel computer according to the third invention.

【図１８】図１７で示した演算実行部の一構成例。FIG. 18 is a configuration example of the arithmetic execution unit shown in FIG.

【図１９】図１８と異なる演算実行部の一構成例。FIG. 19 is a configuration example of an arithmetic execution unit different from FIG.

【図２０】第３の発明における行列乗算手順を示すフロ
ーチャート。FIG. 20 is a flowchart showing a matrix multiplication procedure in the third invention.

【図２１】第３の発明における行方向のデータ放送の様
子を示す図。FIG. 21 is a diagram showing a state of row-direction data broadcasting in the third invention.

【図２２】第３の発明における列方向のデータ放送の様
子を示す図。FIG. 22 is a diagram showing a state of column-direction data broadcasting in the third invention.

【図２３】図２１と同様な行方向のデータ放送の様子を
示す図。23 is a diagram showing a state of data broadcasting in the row direction similar to FIG.

【図２４】図２２と同様な列方向のデータ放送の様子を
示す図。FIG. 24 is a diagram showing a state of data broadcasting in the column direction similar to FIG. 22.

【図２５】図２０と異なる行列乗算手順を示すフローチ
ャート。FIG. 25 is a flowchart showing a matrix multiplication procedure different from FIG. 20.

【図２６】代表的なパイプラインと実行の様子。FIG. 26 shows a typical pipeline and a state of execution.

【図２７】従来のコンパイラを説明するための計算機の
構成、ソースプログラム、及びスケジューリング前の命
令列。FIG. 27 shows a configuration of a computer, a source program, and an instruction sequence before scheduling for explaining a conventional compiler.

【図２８】図２７で示した命令列の実行の様子及び命令
列に対する依存有向グラフ。28A and 28B are states of execution of the instruction sequence shown in FIG. 27 and a dependency directed graph for the instruction sequence.

【図２９】従来のコンパイラによって得られるパター
ン。FIG. 29 is a pattern obtained by a conventional compiler.

【図３０】図２９で示した各パターンに対する実行の様
子。FIG. 30 shows how the patterns shown in FIG. 29 are executed.

【図３１】第３の発明に対する従来の並列計算機の構成
例。FIG. 31 is a configuration example of a conventional parallel computer for the third invention.

【図３２】図３１で示した各ＰＥに初期データが配置さ
れた様子を示す図。32 is a diagram showing a state in which initial data is arranged in each PE shown in FIG. 31. FIG.

【図３３】２つの４×４行列の積を示す数式。FIG. 33 is an equation showing the product of two 4 × 4 matrices.

【図３４】第３の発明に対する従来の行列乗算手順を示
すフローチャート。FIG. 34 is a flowchart showing a conventional matrix multiplication procedure for the third invention.

【図３５】図３１で示した各ＰＥが必要な全データを保
持した様子を示す図。FIG. 35 is a diagram showing a state in which each PE shown in FIG. 31 holds all necessary data.

[Explanation of symbols]

１ソースプログラム２ソースプログラム入力部３字句解析部４構文解析部５中間コード生成部６命令スケジューリング部７最適化部８中間コード最適化部９オブジェクトコード生成部１０目的プログラム出力部１１目的プログラム（１）〜（４），（１，１）〜（３，３）プロセッサ１３ローカルメモリ２１演算制御部２２演算実行部２３データバス２４命令メモリ２５ＰＥ（演算プロセッサ）２６データメモリ２７ネットワーク 1 Source Program 2 Source Program Input Section 3 Lexical Analysis Section 4 Syntax Analysis Section 5 Intermediate Code Generation Section 6 Instruction Scheduling Section 7 Optimization Section 8 Intermediate Code Optimization Section 9 Object Code Generation Section 10 Objective Program Output Section 11 Objective Program (1 )-(4), (1,1)-(3,3) Processor 13 Local memory 21 Arithmetic control unit 22 Arithmetic execution unit 23 Data bus 24 Instruction memory 25 PE (arithmetic processor) 26 Data memory 27 Network

Claims

[Claims]

1. A plurality of arithmetic units that can be executed in parallel are provided,
A compiler that generates a target program used in a computer that processes instructions in a pipeline, and configures an instruction group for each instruction that has a dependency relationship with an instruction sequence in a basic block, and uses the instruction group as a unit. It is characterized by an instruction scheduling unit that changes the execution order of instructions, moves up or down so that the empty state of the pipeline becomes small, and optimizes the execution time of the instruction sequence to be the shortest. compiler.

2. A parallel computer composed of a plurality of processors for solving simultaneous linear equations, which is represented by a simulation of a physical phenomenon in a two-dimensional area divided into a two-dimensional lattice, wherein the i-th processor is -1) Receiving means for receiving the physical quantity sent from the processor, calculating means for calculating the physical quantity of the lattice point for this i-th processor using the received physical quantity, storage means for storing the calculated physical quantity, and calculating A physical quantity to a (i + 1) th processor, and a physical quantity calculation of a grid point on the i-th row of the two-dimensional grid.
The i-th processor allocates the physical quantity transmitted by the transmitting means of the (i-1) th processor to the receiving means after the calculation of the physical quantity of the assigned lattice point on the i-th row becomes possible. The receiving means calculates the physical quantity of the lattice point allocated using the received physical quantity by the calculating means, stores the calculated physical quantity in the storing means, and sends the calculated physical quantity to the (i + 1) th processor by the i-th processor. A parallel computer characterized by repeating all grid points in a row.

3. A parallel computer composed of a plurality of processors for solving simultaneous linear equations, which is represented by a simulation of a physical phenomenon in a three-dimensional region divided into a three-dimensional lattice, wherein (i, j) processors are: The (i-1, j) processor and the receiving means for receiving the physical quantity sent from the (i, j-1) processor, and the (i, j
j) calculating means for calculating the physical quantity of the lattice point for the processor; storage means for storing the calculated physical quantity; and the calculated physical quantity for the (i, j + 1) processor and (i + 1,
j) processor and transmitting means for sending to the processor, the physical quantity calculation of grid points on the same axis of the three-dimensional grid is assigned to the same processor, and (i, j) processor is the assigned grid on the same axis. Since the calculation of the physical quantity of points becomes possible, the (i-1, j) processor and (i, j-)
1) The receiving means receives the physical quantity transmitted by the transmitting means of the processor, the calculating means calculates the physical quantity of the assigned lattice point using the received physical quantity, and the calculated physical quantity is stored in the storage means and transmitted. By means (i + 1,
j) A parallel computer characterized in that sending to the processor and the (i, j + 1) processor is repeated for all assigned lattice points on the same axis.

4. A parallel processor composed of a plurality of arithmetic element processors arranged in a two-dimensional manner, which is a parallel computer for obtaining the product of two matrices, wherein the first matrix data and the second matrix data are , The same allocation is performed on the two matrices, and 1 is set between the arithmetic processors having row-direction data of the matrices.
Means for broadcasting data from one arithmetic processor to other plural arithmetic processors, and means for broadcasting data from one arithmetic processor to other plural arithmetic processors between arithmetic processors having data in the column direction of the matrix Each arithmetic processor having means for controlling the number of repetitions and control means for controlling the operation of the data broadcast and the arithmetic processor, and holding one column of data in the first matrix designated by the number of repetitions Transfers each column data simultaneously in parallel to a plurality of arithmetic processors holding data belonging to the same row as each column data, and holds one row of data specified by the number of repetitions of the second matrix. Each arithmetic processor that transfers the row data simultaneously and in parallel to a plurality of arithmetic processors that hold data belonging to the same column as the row data, In Tsu service obtains the data of the transferred first matrix, the product of two numbers of the transferred second of data,
A parallel computer characterized in that a product of a matrix is obtained by accumulating the products for each repetition and repeating the above operation a predetermined number of times obtained from the size of the matrix.