JP3542184B2

JP3542184B2 - Linear calculation method

Info

Publication number: JP3542184B2
Application number: JP31151994A
Authority: JP
Inventors: 有作山本; 俊夫大河内; 暢俊佐川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1994-12-15
Filing date: 1994-12-15
Publication date: 2004-07-14
Anticipated expiration: 2019-07-14
Also published as: JPH08166941A

Description

【０００１】
【産業上の利用分野】
本発明は並列計算機を用いて例えば連立一次方程式の求解，固有値計算などの線形計算を高速に行う方法に関する。
【０００２】
【従来の技術】
構造解析，流体計算などの科学技術計算では，数万次元から数百万次元に上る大規模行列を係数行列とする連立一次方程式の求解，固有値計算などの線形計算が必要となる。このような計算を高速に行う手段としては，並列計算機が有力である。並列計算機は数十個から数万個に上る多数の高速プロセッサをネットワークで結んだシステムであり，問題を適切に分割して各プロセッサに割り当て，同時に処理を行わせることにより，１台のプロセッサによる実行と比べて飛躍的な実行速度の向上が可能となる。
【０００３】
問題の分割と部分問題のプロセッサへの割り当てにおいては，（１）プロセッサの単体性能を発揮させること，（２）並列化効率（プロセッサを１台からｐ台に増やしたときの理想的な性能向上であるｐ倍に対し，実際にどの程度の性能向上が達成できたかの比）を高めること，の２つを考慮することが必要である。（１）の単体性能に関しては，高性能な並列計算機に使われるプロセッサはキャッシュを備えたＲＩＳＣ型プロセッサであることが多いため，キャッシュを有効に利用できる分割法が必要となる。また，（２）の並列化効率に関しては，プロセッサ間通信の回数とデータ量とを減らして通信時間を削減すること，およびプロセッサ間の負荷を均等にして負荷不均等によるプロセッサの待ち時間を削減することが必要となる。
【０００４】
連立一次方程式を解くために使われるガウス消去法（ＬＵ分解法）の場合，従来はこの２つの課題を解決するために並列化ブロックガウス法（たとえばインテル・コーポレーション：”並列処理コンピュータで超大規模連立一次方程式を解く方法”，特開平５−２５０４０１（１９９１年１２月２０日出願）参照）と呼ばれる手法が使われてきた。これは図２の係数行列４を大きさＬＢ×ＬＢ（ＬＢ：ブロックサイズ）のブロック５，６，７などに分割し，各ブロックを実数のようにみなして，通常のガウス消去法を適用する手法である。並列化にあたっては各ブロックを図２のように１６台のプロセッサ０〜１５に割り当て，各ブロックに対する消去演算を担当のプロセッサに行わせる。
【０００５】
通常のガウス消去法と比べた場合，並列化ブロックガウス法の利点は次の通りである。まず，消去演算における実数どうしの乗算が，並列化ブロックガウス法ではＬＢ×ＬＢの行列どうしの乗算になる。一般に行列乗算では，データがキャッシュ中にある間に何回も演算を行うため，キャッシュの有効利用が可能となり，ＲＩＳＣ型プロセッサ上での単体性能を向上させることが可能となる。さらに，プロセッサ間でのデータ通信もブロック単位となるため，通常のガウス消去法に比べて通信回数は減少し，通信時間の削減により並列化効率の向上が可能となる。これらの利点のため，並列化ブロックガウス法は，並列計算機上でガウス消去法により連立一次方程式を解くためのもっとも一般的な手法となっている。
【０００６】
【発明が解決しようとする課題】
並列化ブロックガウス法では，ブロックサイズＬＢは２つの異なる要請から決定される。まず（１）単体性能の向上という面からは，消去演算で現われるＣ ← Ｃ − Ａ×Ｂという形の行列乗算において，行列Ａ，Ｂ，Ｃのすべてがキャッシュに入るという条件の下で，ＬＢはできるだけ大きいことが望ましい。ＬＢが大きいほど，データがキャッシュ中にある間に多数回の演算が行われることになり，キャッシュの有効利用が計られるからである。一方，（２）並列化効率の向上という面からもＬＢの最適値がある。ＬＢが大きくなるとデータ通信の単位となるブロックは大きくなり，通信回数の減少により通信時間が減る一方，各プロセッサの担当するブロックの個数のばらつきが大きくなり，負荷の不均等による待ち時間は増大する。ＬＢが小さくなるとデータ通信回数は増加するが，各プロセッサの負荷は均等になる。したがって，これら２つの均衡により，並列化効率を最大にするＬＢが定まる。以上で述べた（１）単体性能を最大にするＬＢの値と（２）並列化効率を最大にするＬＢの値とは，一般には一致しない。例として，１６台のプロセッサで並列化ブロックガウス法を実行する場合を図３に示す。この例では単体性能を最大にするようにＬＢを決めた結果，プロセッサ０の担当ブロック５は４個，プロセッサ１５の担当ブロック７は１個と，担当ブロック数に大きな差が生じている。このため，高い並列化効率は期待できない。このように従来の並列化ブロックガウス法では，単体性能，並列化効率の一方あるいは両方を犠牲にしつつ総合的に見てもっとも高い性能を達成するようにＬＢを決める必要があったため，並列計算機の持つ性能を十分に引き出せないという問題があった。
【０００７】
本発明は，並列化ブロックガウス法を改良することにより，この問題を解決し，単体性能と並列化効率の両方を高めることにより，線形計算において並列計算機の持つ性能を十分に引き出すことを目的とする。
【０００８】
【課題を解決するための手段】
上記目的を達成するため，本発明では並列化ブロックガウス法を改良し，２種の異なるブロックサイズを導入した。従来の並列化ブロックガウス法が単体性能と並列化効率とを同時に高められないのは，この２つの目的のそれぞれに対し，最適なブロックサイズが異なるからである。そこで本発明では，まず並列化効率の面から最適なブロックサイズＬＢを決定し，このサイズにより行列をブロックに分割して，各ブロックをプロセッサに割り当てる。これにより，並列化効率の最適化が可能となる。次に消去演算においては，各ブロック段において毎回行列全体の消去を行うのではなく，ある段数ｋ段分のブロックピボット行とブロックピボット列とを作成しておき，ｋ段に一回，まとめて消去を行う。これにより，消去演算はＭＢ＝ｋ＊ＬＢの大きさの行列乗算として処理できることになり，ｋを適当に取ることにより，単体性能の最適化が可能となる。
【０００９】
【作用】
１６台のプロセッサでガウス消去法を実行する場合を例にとり，本発明により高い並列化効率と高い単体性能とが同時に達成される様子を説明する。
【００１０】
まず，図４（ａ）が全体行列の消去演算を示す。行列はブロックに分割されており，各ブロックのサイズはＬＢ×ＬＢである。行列中の影の付いたブロックは，プロセッサ０の担当ブロックを示す。ＬＢを十分小さく取ることにより，各プロセッサの担当ブロック数はほぼ均等になり，並列化効率を高めることができる。図４（ａ）では４段分のブロックピボット列とブロックピボット行とを作成しておき，まとめて４段分の消去を行う場合を示している。この場合，プロセッサ０では，ブロックピボット行・列のうちの影の付いたブロックを用いて消去演算を行う。プロセッサ０の行う演算のみを取り出して表示したのが図４（ｂ）である。図から明らかなように，４ブロック段分の消去をまとめて行うことにより，プロセッサ０での消去演算はサイズＭＢ＝４＊ＬＢの行列どうしの乗算として処理できる。これにより，従来の並列化ブロックガウス法に比べて行列乗算のサイズを大きくすることができ，キャッシュの有効利用により単体性能を高めることができる。
【００１１】
以上により，本発明の手法では並列化効率の向上と単体性能の向上とを同時に達成することができる。
【００１２】
【実施例】
（実施例１）
以下，本発明の原理および実施例を，図面により詳細に説明する。ここで実施例として挙げるのは，並列計算機を用いてガウス消去法により連立一次方程式を解く方法である。図１に示すように，本方法を適用する並列計算機システムは行列データ，右辺ベクトルなどのデータを入力するための入力装置１，それぞれがキャッシュメモリを備えたｐ台のプロセッサ３１を持つ連立一次方程式求解装置２，解を出力するための出力装置３から構成される。本実施例における処理を図５に示す。まず入力装置から係数行列Ａ，右辺ベクトルｂ，行列の次元Ｎ，２つのブロックサイズＬＢ，ＭＢを入力し（処理１３），行列データおよび右辺ベクトルのプロセッサへの割り当てを行い（処理１４），Ｋ＝１からＫ＝Ｎ／ＬＢまでの各ブロック段について消去演算を行って係数行列のＬＵ分解を求める（処理１５）。次に，得られたＬＵ分解を用いて前進消去演算（処理２２），後退代入演算（処理２３）を行って解を求め，最後に各プロセッサに分散して格納されている解を収集し（処理２４），出力する（処理２５）。本発明のもっとも大きな特徴は，処理１４のデータ割り当て部分，および処理１５の消去演算部分にあるので，以下，これらの部分について説明する。
【００１３】
（１）プロセッサへのデータ割り当て
入力ステップ（処理１３）で入力したプロセッサ割り当てのためのブロックサイズＬＢを用いて，図２のように係数行列を大きさＬＢ×ＬＢのブロックに分割し，各ブロックをプロセッサに割り当てる。図２の５，６，７はそれぞれプロセッサ０，１，１５に割り当てられたブロックを示す。ＬＢが大きいと通信の単位が大きくなるため，通信回数が減少して通信オーバーヘッドは減少するが，各プロセッサの担当ブロック数のばらつきが大きくなり，プロセッサ間の負荷不均等によりアイドル時間が増大する。一方ＬＢが小さいと負荷は均等になるが，通信の回数が多くなって通信オーバーヘッドが増大する。本発明では，これら２つの兼ね合いから，並列化効率を最大にするようにＬＢをあらかじめ決める。
【００１４】
（２）消去演算
本発明における消去演算の特徴は，各ブロック段ごとに行列全体の消去演算を行わず，ｋ本のブロックピボット列・行を予め作成してからｋブロック段分の消去を行い，これにより消去演算を大きさＭＢ＝ｋ＊ＬＢの大きな行列の演算として処理してキャッシュの有効利用を計ることにある。そこで，図５の処理１５から処理２１により，消去演算の各過程を説明する。
【００１５】
まず，第Ｋブロック段の最初では第（Ｋ，Ｋ）ブロックのＬＵ分解を行う（処理１６）。これは従来の並列化ブロックガウス法と同じである。次に，今作成したＵの逆行列を第Ｋブロック列に右からかけることにより第Ｋブロックピボット列を作成し（処理１７），Ｌの逆行列を第Ｋブロック行に左からかけることにより第Ｋブロックピボット行を作成する（処理１８）。そして第Ｋブロックピボット列と第Ｋブロックピボット行とにより，消去演算を行う（処理１９）。ただし，ここでの消去は行列全体に対しては行わず，次のｋ本のブロックピボット列・行となる部分に対してのみ行う。ｋ段分のブロックピボット列・行を作成するためには，この範囲の消去で十分である。この後，Ｋがｋ＝ＭＢ／ＬＢの倍数でなければ，第Ｋ＋１ブロック段に進む（処理２０）。Ｋがｋ＝ＭＢ／ＬＢの倍数の場合には，それまでに作成したｋ本のブロックピボット列・行を用いて，行列全体の消去を行う（処理２１）。この消去演算での処理を図４（ａ）に示す。この例ではｋ＝４であり，４本のブロックピボット列８，４本のブロックピボット行１０による消去演算を行っている。図４（ａ）は全体の係数行列４に対する演算であるが，このうちプロセッサ０の担当ブロック５（それぞれの大きさはＬＢ×ＬＢ）に対する演算を抜き出したものを図４（ｂ）に示す。図から明らかなように，ブロックピボット列のうちでプロセッサ０の演算で使われる部分９，ブロックピボット行のうちでプロセッサ０の演算で使われる部分１１を用いて，消去演算は大きさＭＢ＝４＊ＬＢの行列に対する乗算として処理できる。この結果，ｋを適当に選んでＭＢ×ＭＢの行列３個がちょうどキャッシュに入るようにすれば，キャッシュを最大限に利用することができ，単体性能が向上する。一方，従来の並列化ブロックガウス法では各ブロック段ごとに行列全体の消去演算を行うので，消去演算はＬＢ×ＬＢの行列乗算となるが，ＬＢの範囲は並列化効率の最適化との兼ね合いにより決められてしまい，単体性能を十分に発揮させることが難しい。本発明では、この点が大きく改良されている。
【００１６】
（実施例２）
実施例１では線形計算の一つであるガウス消去法の場合について説明したが、これ以外にも、ピボット列とピボット行とを用いて行列を消去していくという線形計算に対しては，本発明が適用可能である。このような線形計算の例としては、最小二乗法などに使われる行列のＱＲ分解や、固有値計算に使われるＨｏｕｓｅｈｏｌｄｅｒ変換などがある。本実施例では，最小二乗法への本発明の適用例を述べる。
【００１７】
最小二乗法の問題例として，実験により与えられたｍ個のの入力データと出力データの組（ｘ１，ｙ１），（ｘ２，ｙ２），．．．，（ｘｍ，ｙｍ）に基づき，入力ｘと出力ｙの関係をｎ−１次式ｙ＝ｆ（ｘ）で近似する問題を考える。求めたいのは，ｆ（ｘ）におけるｘの０次，１次，．．．，ｎ−１次の項の係数ａ１，ａ２，．．．，ａｎである。これを求めるには，最小二乗法の標準的な手続きにより，まず第ｉ番目の入力データｘｉのｊ−１乗を第（ｉ，ｊ）要素とする行列Ａを求め，これをＡ＝ＱＲとＱＲ分解してから，Ｒを係数行列とし，（Ｑの転置行列）×（ｙ１，ｙ２，．．．，ｙｍ）を右辺ベクトルとする連立一次方程式を解けばよい。この一連の処理のうち，もっとも計算量が多いのはＡのＱＲ分解の部分であり，これを並列機上で実行する場合に本発明が利用できる。
【００１８】
本発明の手法を用いて，このＱＲ分解を並列機上で実行する場合の処理を図６に示す。ＱＲ分解の場合も，並列化効率を最大にするように選んだブロックサイズＬＢにしたがって行列をブロックに分割し，プロセッサに割り当てる部分（処理２７）は実施例１のガウス消去法と同様である。ただし次の消去演算において，第Ｋブロックピボット行・列を作成するための前処理演算が，ガウス消去法の場合のＬＵ分解処理（図５の処理１６）からＱＲ分解処理（図６の処理２８）へと置き替わっている。また，第Ｋブロックピボット行・列の作成（図６の処理１６，１７）の詳細もガウス消去法の場合とは異なる。しかし，あらかじめ数本のブロックピボット行・列を作成しておき，ＭＢ／ＬＢ段に１回まとめて消去演算を行うという本発明の手法（処理２１）は全く同様に適用できる。また，この手法により得られる効果も，ガウス消去法の場合と同様である。
【００１９】
本実施例では他の線形演算に対する適用例として最小二乗法などに使われるＱＲ分解の場合を示したが，Ｈｏｕｓｅｈｏｌｄｅｒ変換など，さらに他の線形計算アルゴリズムに対しても，本手法は同様に適用できる。
【００２０】
【発明の効果】
以上説明したように，本発明によれば，並列計算機上でのガウス消去法などの線形計算において，単体性能と並列化効率とを共に高めることができ，並列計算機の性能を十分に引き出した計算が可能となる。
【図面の簡単な説明】
【図１】並列計算機による線形計算システムの全体を示す図である。
【図２】従来の並列化ブロックガウス法でのブロックへのプロセッサ割り当てを示す図である。
【図３】従来の並列化ブロックガウス法において，単体性能を高めるためにブロックサイズを大きく取った結果，各プロセッサの担当ブロック数が不均等になったことを示す図である。
【図４】本発明により，プロセッサ割り当てのためのブロックサイズを小さくとっても，消去演算のためのブロックサイズが大きく取れるようになったことを示す図である。
【図５】本発明をガウス消去法による連立一次方程式解法に適用した場合の手続き全体を示す図である。
【図６】本発明を行列のＱＲ分解に適用した場合の手続き全体を示す図である。
【符号の説明】
１：入力装置，２：連立一次方程式求解装置，３：出力装置，４：係数行列の全体，５：プロセッサ０の担当するブロック，６：プロセッサ１の担当するブロック，７：プロセッサ１５の担当するブロック，８：４段分のブロックピボット列の全体，９：ブロックピボット列のうちプロセッサ０の計算に使われる部分，１０：４段分のブロックピボット行の全体，１１：ブロックピボット行のうちプロセッサ０の計算に使われる部分，１２：スタート，１３：データの入力，１４：係数行列および右辺ベクトルのプロセッサへの割り当て，１５：最外側の繰り返しループ，１６：第（Ｋ，Ｋ）ブロックのＬＵ分解，１７：第Ｋブロックピボット列の作成，１８：第Ｋブロックピボット行の作成，１９：第Ｋブロックピボット列・行による部分的な消去演算，２０：ＫがＭＢ／ＬＢの倍数か否かの判定，２１：Ｋ段分の同時消去演算，２２：前進消去演算，２３：後退代入演算，２４：解ベクトルの収集，２５：解ベクトルの出力，２６：終了，２７：行列データのプロセッサへの割り当て，２８：第（Ｋ，Ｋ）ブロックのＱＲ分解，２９：解行列の収集，３０：解行列の出力，３１：プロセッサ，３２：キャッシュメモリ。[0001]
[Industrial applications]
The present invention relates to a method for performing high-speed linear calculations, such as solving simultaneous linear equations and calculating eigenvalues, using a parallel computer.
[0002]
[Prior art]
In scientific and technical calculations such as structural analysis and fluid calculation, linear calculations such as solution of simultaneous linear equations using large-scale matrices ranging from tens of thousands to millions of dimensions as coefficient matrices and eigenvalue calculations are required. As a means for performing such calculations at high speed, a parallel computer is effective. A parallel computer is a system in which a large number of high-speed processors, from tens to tens of thousands, are connected via a network. Problems are divided appropriately, assigned to each processor, and processed at the same time. The execution speed can be dramatically improved compared to execution.
[0003]
In dividing the problem and assigning partial problems to the processors, (1) the performance of a single processor must be exhibited, and (2) parallelization efficiency (ideal performance improvement when the number of processors is increased from one to p). It is necessary to consider two factors: (a ratio of how much performance improvement was actually achieved with respect to p times). Regarding the single performance of (1), a processor used in a high-performance parallel computer is often a RISC-type processor having a cache, so that a partitioning method that can effectively use the cache is required. Regarding the parallelization efficiency of (2), the number of inter-processor communications and the amount of data are reduced to reduce communication time, and the load between processors is equalized to reduce the waiting time of the processor due to uneven load. It is necessary to do.
[0004]
In the case of the Gaussian elimination method (LU decomposition method) used to solve a system of linear equations, conventionally, to solve these two problems, a parallelized block Gaussian method (for example, Intel Corporation: "Large-scale simultaneous processing with a parallel processing computer") A method called "Method of Solving a Linear Equation" (see Japanese Patent Application Laid-Open No. 5-250401 (filed on Dec. 20, 1991)) has been used. This is done by dividing the coefficient matrix 4 of FIG. 2 into blocks 5, 6, 7 and the like having a size of LB × LB (LB: block size), treating each block as a real number, and applying a normal Gaussian elimination method. Method. In parallelization, each block is allocated to 16 processors 0 to 15 as shown in FIG. 2, and a processor in charge performs an erasure operation on each block.
[0005]
The advantages of the parallelized block Gaussian method compared to the normal Gaussian elimination method are as follows. First, multiplication between real numbers in the erasure operation is multiplication between LB × LB matrices in the parallelized block Gaussian method. Generally, in the matrix multiplication, since the operation is performed many times while the data is in the cache, the cache can be effectively used, and the single performance on the RISC processor can be improved. Furthermore, since data communication between processors is performed in block units, the number of times of communication is reduced as compared with the normal Gaussian elimination method, and the parallelization efficiency can be improved by reducing the communication time. Due to these advantages, the parallelized block Gaussian method is the most common method for solving simultaneous linear equations by Gaussian elimination on a parallel computer.
[0006]
[Problems to be solved by the invention]
In the parallelized block Gaussian method, the block size LB is determined from two different requirements. First, (1) from the standpoint of improving the performance of a single unit, in a matrix multiplication of the form C ← C−A × B that appears in the erasure operation, LB under the condition that all of the matrices A, B, and C enter the cache Is desirably as large as possible. This is because the larger the LB is, the more operations are performed while the data is in the cache, and the more effective use of the cache is achieved. On the other hand, (2) there is an optimum value of LB also from the viewpoint of improving parallelization efficiency. As the LB increases, the block serving as a unit of data communication increases, and the communication time decreases due to the decrease in the number of times of communication. On the other hand, the number of blocks handled by each processor increases, and the waiting time due to uneven load increases. . As the LB decreases, the number of data communications increases, but the load on each processor becomes equal. Therefore, the LB that maximizes the parallelization efficiency is determined by these two balances. Generally, (1) the value of LB that maximizes the single unit performance and (2) the value of LB that maximizes the parallelization efficiency do not match. As an example, FIG. 3 shows a case where the parallelized block Gaussian method is executed by 16 processors. In this example, as a result of determining the LB so as to maximize the single unit performance, the assigned block 5 of the processor 0 and the assigned block 7 of the processor 15 greatly differ from each other. Therefore, high parallelization efficiency cannot be expected. As described above, in the conventional parallelized block Gaussian method, it is necessary to determine the LB so as to achieve the highest overall performance while sacrificing one or both of the unit performance and the parallelization efficiency. There was a problem that it was not possible to bring out the performance that it had.
[0007]
An object of the present invention is to solve this problem by improving the parallelized block Gaussian method, and to fully exploit the performance of a parallel computer in linear computation by increasing both the unit performance and the parallelization efficiency. I do.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, the present invention improves the parallelized block Gaussian method and introduces two different block sizes. The reason that the conventional parallelized block Gaussian method cannot simultaneously increase the simplex performance and the parallelization efficiency is that the optimum block size differs for each of these two purposes. Therefore, in the present invention, first, an optimal block size LB is determined from the viewpoint of parallelization efficiency, a matrix is divided into blocks based on this size, and each block is assigned to a processor. This makes it possible to optimize the parallelization efficiency. Next, in the erasure operation, instead of erasing the entire matrix each time at each block stage, block pivot rows and block pivot columns for a certain number of stages k are created, and once in k stages, they are collectively collected. Perform erasure. As a result, the erasure operation can be processed as a matrix multiplication of the size of MB = k * LB. By appropriately setting k, the performance of a single device can be optimized.
[0009]
[Action]
Taking the case where the Gaussian elimination method is executed by 16 processors as an example, how the present invention achieves high parallelization efficiency and high unit performance simultaneously will be described.
[0010]
First, FIG. 4A shows the erasure operation of the entire matrix. The matrix is divided into blocks, and the size of each block is LB × LB. The shaded blocks in the matrix indicate the blocks in charge of the processor 0. By making the LB small enough, the number of blocks in charge of each processor becomes substantially equal, and the parallelization efficiency can be increased. FIG. 4A shows a case where block pivot columns and block pivot rows for four stages are created, and erase for four stages is collectively performed. In this case, the processor 0 performs the erasure operation using the shaded block of the block pivot row / column. FIG. 4B shows only the operation performed by the processor 0, which is displayed. As is clear from the figure, by erasing four blocks at a time, the erasing operation in the processor 0 can be processed as a multiplication between matrices of size MB = 4 * LB. As a result, the size of matrix multiplication can be increased as compared with the conventional parallelized block Gaussian method, and the unit performance can be improved by effective use of the cache.
[0011]
As described above, according to the method of the present invention, it is possible to simultaneously improve the parallelization efficiency and the single-unit performance.
[0012]
【Example】
(Example 1)
Hereinafter, the principle and embodiments of the present invention will be described in detail with reference to the drawings. Here, as an embodiment, a method of solving simultaneous linear equations by a Gaussian elimination method using a parallel computer is described. As shown in FIG. 1, a parallel computer system to which the present method is applied is an input device for inputting data such as matrix data and a right-hand side vector, and a system of linear equations having p processors 31 each having a cache memory. It comprises a solving device 2 and an output device 3 for outputting a solution. FIG. 5 shows the processing in this embodiment. First, a coefficient matrix A, a right side vector b, a matrix dimension N, two block sizes LB and MB are input from an input device (process 13), and matrix data and a right side vector are assigned to a processor (process 14), and K An erasure operation is performed for each block stage from = 1 to K = N / LB to obtain an LU decomposition of the coefficient matrix (process 15). Next, using the obtained LU factorization, a forward elimination operation (process 22) and a backward substitution operation (process 23) are performed to obtain a solution, and finally, solutions stored in a distributed manner among the processors are collected ( Process 24) and output (Process 25). The most significant features of the present invention are the data allocation part of the processing 14 and the erasure calculation part of the processing 15, and these parts will be described below.
[0013]
(1) Using the block size LB for processor allocation input in the data allocation input step to the processor (process 13), the coefficient matrix is divided into blocks of size LB × LB as shown in FIG. To the processor. 2 indicate blocks assigned to the processors 0, 1, and 15, respectively. When the LB is large, the unit of communication becomes large, so that the number of communication times is reduced and the communication overhead is reduced. However, the number of blocks in charge of each processor increases, and the idle time increases due to uneven load between the processors. On the other hand, if the LB is small, the load becomes equal, but the number of times of communication increases and the communication overhead increases. In the present invention, the LB is determined in advance so as to maximize the parallelization efficiency based on a balance between these two.
[0014]
(2) Erasing Operation The erasing operation according to the present invention is characterized in that the erasing operation for the entire matrix is not performed for each block stage, but k block pivot columns / rows are created in advance and then erasing for k block stages is performed. Thus, the erasure operation is processed as an operation of a large matrix having a size of MB = k * LB to effectively use the cache. Therefore, each process of the erasure calculation will be described with reference to processes 15 to 21 in FIG.
[0015]
First, at the beginning of the K-th block, LU decomposition of the (K, K) -th block is performed (process 16). This is the same as the conventional parallelized block Gaussian method. Next, a K-th block pivot column is created by applying the inverse matrix of U just created to the K-th block column from the right (process 17), and the inverse matrix of L is applied to the K-th block row from the left by applying the inverse matrix to the K-th block row. A K block pivot row is created (process 18). Then, an erase operation is performed using the K-th block pivot column and the K-th block pivot row (process 19). However, the erasure is not performed on the entire matrix, but is performed only on a portion to be the next k block pivot columns / rows. This range suffices to create k columns of block pivot columns and rows. Thereafter, if K is not a multiple of k = MB / LB, the process proceeds to the (K + 1) th block stage (process 20). If K is a multiple of k = MB / LB, the entire matrix is erased using the k block pivot columns and rows created so far (process 21). FIG. 4A shows the processing in the erase operation. In this example, k = 4, and the erasure calculation is performed using four block pivot columns 8 and four block pivot rows 10. FIG. 4A shows an operation on the entire coefficient matrix 4, and FIG. 4B shows an extracted result of the operation on the block 5 in charge (each size is LB × LB) of the processor 0. As is apparent from the figure, the erasure operation is performed using the part 9 used in the operation of the processor 0 in the block pivot row and the part 11 used in the operation of the processor 0 in the block pivot row, and the size MB = 4. * Can be processed as a multiplication of a matrix of LB. As a result, if k is appropriately selected and three MB × MB matrices are just entered into the cache, the cache can be used to the maximum, and the performance of a single unit can be improved. On the other hand, in the conventional parallelized block Gaussian method, the erasure operation of the entire matrix is performed for each block stage, so the erasure operation is a matrix multiplication of LB × LB, but the range of LB balances the optimization of the parallelization efficiency. And it is difficult to make full use of single performance. In the present invention, this point is greatly improved.
[0016]
(Example 2)
In the first embodiment, the case of the Gaussian elimination method, which is one of the linear calculations, has been described. However, in addition to the above, the linear calculation in which the matrix is deleted using the pivot columns and the pivot rows is used in the present embodiment. The invention is applicable. Examples of such linear calculations include QR decomposition of matrices used in the least squares method and the like, and Householder transform used in eigenvalue calculations. In this embodiment, an application example of the present invention to the least squares method will be described.
[0017]
As an example of the problem of the least squares method, m sets of input data and output data (x1, y1), (x2, y2),. . . , (Xm, ym), consider the problem of approximating the relationship between the input x and the output y with an (n−1) -order expression y = f (x). What we want to find is the 0th, 1st,. . . , N-1 order terms a1, a2,. . . , An. To find this, first, a matrix A having the (i, j) -th element as the (j−1) th element of the i-th input data xi is obtained by a standard procedure of the least squares method, and this is expressed as A = QR. After the QR decomposition, a simultaneous linear equation in which R is a coefficient matrix and (transposed matrix of Q) × (y1, y2,..., Ym) is a right-hand side vector may be solved. Of this series of processing, the largest calculation amount is the QR decomposition part of A, and the present invention can be used when this is executed on a parallel machine.
[0018]
FIG. 6 shows a process for executing this QR decomposition on a parallel machine using the method of the present invention. Also in the case of the QR decomposition, the matrix is divided into blocks according to the block size LB selected so as to maximize the parallelization efficiency, and the portion to be assigned to the processor (process 27) is the same as the Gaussian elimination method of the first embodiment. However, in the next erasure operation, the pre-processing operation for creating the K-th block pivot row / column is performed from the LU decomposition process (process 16 in FIG. 5) to the QR decomposition process (process 28 in FIG. 6) when the Gaussian elimination method is used. ). The details of the creation of the K-th block pivot row and column (steps 16 and 17 in FIG. 6) are also different from those of the Gaussian elimination method. However, the method (process 21) of the present invention in which several block pivot rows and columns are created in advance and the erasing operation is performed once in the MB / LB stage can be applied in exactly the same manner. The effect obtained by this method is the same as that of the Gaussian elimination method.
[0019]
In the present embodiment, the QR decomposition used in the least squares method and the like is shown as an application example to another linear operation. However, this method can be similarly applied to other linear calculation algorithms such as the Householder transform. .
[0020]
【The invention's effect】
As described above, according to the present invention, in a linear calculation such as a Gaussian elimination method on a parallel computer, both the stand-alone performance and the parallelization efficiency can be improved, and the calculation that fully exploits the performance of the parallel computer can be achieved. Becomes possible.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an entire linear calculation system using a parallel computer.
FIG. 2 is a diagram showing processor assignment to blocks in a conventional parallelized block Gaussian method.
FIG. 3 is a diagram showing that, in the conventional parallelized block Gaussian method, as a result of increasing the block size in order to improve the performance of a single unit, the number of blocks in charge of each processor becomes uneven.
FIG. 4 is a diagram showing that, according to the present invention, a large block size for an erasure operation can be obtained even if a small block size is used for processor allocation.
FIG. 5 is a diagram showing an entire procedure when the present invention is applied to a simultaneous linear equation solution method by the Gaussian elimination method.
FIG. 6 is a diagram showing an overall procedure when the present invention is applied to a QR decomposition of a matrix.
[Explanation of symbols]
1: input device, 2: simultaneous linear equation solving device, 3: output device, 4: entire coefficient matrix, 5: block assigned to processor 0, 6: block assigned to processor 1, 7: assigned to processor 15. Block, 8: whole block pivot column for 4 stages, 9: part of block pivot column used for calculation by processor 0, 10: whole block pivot line for 4 stages, 11: processor among block pivot lines Part used for calculation of 0, 12: Start, 13: Data input, 14: Assignment of coefficient matrix and right side vector to processor, 15: Outermost repetition loop, 16: LU of (K, K) block Decomposition, 17: Creation of K-th block pivot column, 18: Creation of K-th block pivot row, 19: Part by K-th block pivot column / row Erasure operation, 20: determination of whether K is a multiple of MB / LB, 21: simultaneous erasure operation for K stages, 22: forward erasure operation, 23: backward substitution operation, 24: collection of solution vectors, 25: Output of solution vector, 26: end, 27: assignment of matrix data to processor, 28: QR decomposition of (K, K) th block, 29: collection of solution matrix, 30: output of solution matrix, 31: processor, 32: cache memory.

Claims

A method for performing a linear calculation on an N-dimensional matrix A using a parallel computer including an input device, a processing device including a plurality of processors each including a cache memory, and an output device,
Input the matrix A to be calculated and two different block sizes LB, MB,
The coefficient matrix A is divided into a plurality of blocks of size LB × LB, and the divided blocks are distributed and assigned to the plurality of processors.
The linear operation is sequentially performed for each block stage of the divided block,
The calculation of the K-th stage (1 ≦ K ≦ N / LB) of the sequential execution is as follows .
The processor assigned to each block of the block stage is in charge of each operation, and
a. Preprocessing operations required to create the Kth block axis column and the Kth block axis;
b. Creation of the K-th block axis row,
c. Creation of the K-th block axis line,
d. A part of the matrix A to be updated by using the K-th block axis row and the K-th block axis row, the elimination operation of the range necessary to create the next several stages of block axis rows and block axis rows, ,
It is determined whether or not K is a multiple of MB / LB, and when K is a multiple, in addition to the above,
e. Simultaneous update operations for MB / LB stages by the (K-MB / LB + 1), (K-MB / LB + 2),..., K-th block axis rows and block axis rows of the part to be updated of the matrix A Execute,
A linear calculation method characterized by performing parallel execution by the above method.

2. The linear calculation method according to claim 1, wherein the input processing is performed when MB × MB ≦ W ≦ 4 × MB ×, where the size of the cache memory measured in units of data used for calculation is W words. A linear calculation method including a process of determining the MB so as to satisfy the MB.

2. The linear calculation method according to claim 1, wherein the preprocessing operation is an operation for performing an LU decomposition of the (K, K) block of the coefficient matrix A in the N-dimensional simultaneous linear equation Ax = b.

2. The linear calculation method according to claim 1, wherein the preprocessing operation is an operation for performing a QR decomposition of the (K, K) block of the N-dimensional matrix A.