JP2010122850A

JP2010122850A - Device and method for calculating matrix equation

Info

Publication number: JP2010122850A
Application number: JP2008295135A
Authority: JP
Inventors: Hideki Kawaguchi; 秀樹川口; Ippei Tahara; 一平田原
Original assignee: Muroran Institute of Technology NUC
Current assignee: Muroran Institute of Technology NUC
Priority date: 2008-11-19
Filing date: 2008-11-19
Publication date: 2010-06-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and method for calculating a matrix equation, which enable high-speed calculation of a large sparse matrix equation appearing as an equation to be solved finally in numerical simulation, such as electromagnetic field analysis and which can be applied to both cases that the large sparse matrix of the large sparse matrix equation is a symmetric matrix and an asymmetric matrix. <P>SOLUTION: The matrix equation calculation device 1, which solves the large matrix equation by an iterative method, includes: a memory which includes a predetermined number of memories for storing only non-zero components for each row among the components of the large sparse matrix of the large sparse matrix equation; and one or more operation sections which perform at least part of operations of the iterative method with a data flow format. The device is configured such that data required for the operations of the iterative method for each line may be loaded to the operation section from the memory at once. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、行列方程式計算装置および行列方程式計算方法に関し、例えば、電磁界解析、流体解析、構造解析などの数値シミュレーションにおいて最終的に解くべき方程式として現れる大型（大規模）疎行列方程式の計算に適用して好適なものである。 The present invention relates to a matrix equation calculation apparatus and a matrix equation calculation method, for example, for calculating a large (large-scale) sparse matrix equation that appears as an equation to be finally solved in a numerical simulation such as electromagnetic field analysis, fluid analysis, and structural analysis. It is suitable for application.

電磁界解析、流体解析、構造解析などにおける離散化された偏微分方程式を解くための科学技術計算は、多くの場合、最終的には、求めるべき未知数に関し行列方程式を解く問題に帰着する。また、これらの科学技術計算においては、多くの場合、解くべき行列方程式の行列は、次数が数千〜数千万と大規模で、かつ、その成分（要素）がほとんどゼロとなる疎行列、すなわち大型疎行列となる。実際、例えば、電磁界解析では、図１０に示すように、静電場・静磁場、準静的場、高周波の場などの電磁場の種類ごとに解くべき行列方程式は異なり、一方、偏微分方程式の解法には、大別して、差分法（ＦＤＭ）、有限要素法（ＦＥＭ）、境界要素法（ＢＥＭ）などの数値解析法があるため、これらの組合せにより非常に多くの計算方法が存在するものの、そのほとんどの計算方法は大型疎行列方程式を解く問題に帰着する。 In many cases, scientific and technical calculations for solving a discretized partial differential equation in electromagnetic field analysis, fluid analysis, structural analysis, etc. ultimately result in a problem of solving a matrix equation with respect to unknowns to be obtained. In these scientific and technical calculations, in many cases, the matrix of the matrix equation to be solved is a sparse matrix whose order is several thousand to several tens of millions, and whose components (elements) are almost zero, That is, a large sparse matrix. Actually, in the electromagnetic field analysis, for example, as shown in FIG. 10, the matrix equation to be solved is different for each type of electromagnetic field such as electrostatic field / static magnetic field, quasi-static field, high frequency field, etc. There are numerical methods such as the difference method (FDM), the finite element method (FEM), and the boundary element method (BEM), and there are a lot of calculation methods by combining these methods. Most of the calculation methods result in the problem of solving large sparse matrix equations.

これらの科学技術計算では、その計算規模が大きくなればなるほど全体の計算時間に対する行列方程式の計算時間の占める割合は大きくなり、しばしば９５％以上にもなる。近年のスーパーコンピュータの性能向上、および、パーソナルコンピュータ（ＰＣ）レベルにおける高性能化に助長され、科学技術計算に対する要求もますます強くなり、取り扱うべき問題も大規模化している。とりわけ、短期間での製品開発が要求される産業応用の場では、ハイパフォーマンスコンピューティング（ＨＰＣ）技術が非常に大きな解決課題となっている。このような背景から、科学技術計算の大部分を占める行列方程式の計算の高速化が、科学技術計算全体の最大の課題の１つとなっているといって過言でない。 In these science and technology calculations, the larger the calculation scale, the larger the ratio of the calculation time of the matrix equation to the total calculation time, often 95% or more. The demand for scientific and technological computations has become stronger and the problems to be handled have been increased in scale due to the recent improvement in performance of supercomputers and the enhancement of performance at the personal computer (PC) level. In particular, high performance computing (HPC) technology has become a very big issue in industrial applications where product development in a short period of time is required. Against this background, it is no exaggeration to say that speeding up the calculation of matrix equations, which account for the majority of scientific and technical calculations, is one of the biggest challenges of scientific and technical calculations overall.

一般に、行列方程式の解法は、原理的には未知数を１つずつ消去しながら解を求めていくＬＵ分解法に代表されるような直接解法と、初期値を設定してそこから真の解に向かって繰返し近似解を更新し収束値を求めていく反復解法とに大別される。このうち反復解法は、共役傾斜法（ＣＧ法）などのように線形連立方程式を等価なＮ次元２次形式（Ｎは行列の次数）の極小値を求める問題に置き換える方法と、そのような問題の置き換えは行わず行列を対角成分と非対角成分とに分離し直接対角成分のみから近似解の更新を行う単純反復法とに分類される。さらに、線形連立方程式を等価なＮ次元２次形式の極小値を求める問題に置き換える方法には、対称行列を適用対象とするＣＧ法をベースに、それを非対称行列に拡張した双共役傾斜法（ＢｉＣＧ法）や、反復計算を行う前に行列を前処理したり、各反復過程で最適な探索方向ベクトルを構成するなどして反復計算を安定化したり高速化したりする、ＩＣＣＧ法、ＣＧＳ法、ＢｉＣＧ−Ｓｔａｂ法、ＢｉＣＧ−Ｓｔａｂ２法などの派生的な方法が多数提案されている。 In general, there are two methods for solving matrix equations. In principle, direct solutions such as the LU decomposition method, in which unknowns are deleted one by one and the solution is obtained, and initial values are set and a true solution is obtained. This is roughly divided into an iterative solution method in which the iterative approximate solution is updated and the convergence value is obtained. Among them, the iterative solution method is a method of replacing linear simultaneous equations with a problem of obtaining a minimum value of an equivalent N-dimensional quadratic form (N is the order of a matrix), such as a conjugate gradient method (CG method), and such a problem. The matrix is classified into a simple iterative method in which the matrix is separated into a diagonal component and a non-diagonal component, and the approximate solution is updated only from the diagonal component. Furthermore, the method of replacing the linear simultaneous equations with the problem of obtaining an equivalent N-dimensional quadratic local minimum value is based on the CG method that applies a symmetric matrix to the biconjugate gradient method that extends it to an asymmetric matrix ( BiCG method), pre-processing the matrix before performing the iterative calculation, or configuring the optimal search direction vector in each iteration process to stabilize or speed up the iterative calculation, ICCG method, CGS method, Many derivative methods such as the BiCG-Stab method and the BiCG-Stab2 method have been proposed.

従来、これらの方法のうち大型疎行列方程式の解法としては、上記の反復解法が有効であると考えられており、電磁界解析、流体解析、構造解析などの各種の数値シミュレーションに現れる大型疎行列方程式の計算にしばしば用いられている。 Conventionally, it is considered that the above iterative method is effective for solving large sparse matrix equations among these methods. Large sparse matrices appearing in various numerical simulations such as electromagnetic field analysis, fluid analysis, and structural analysis. Often used to calculate equations.

ＣＧ法などの行列方程式の反復解法においては、解くべき行列形式の線形（１次）連立方程式の近似解を次の手順で求める。まず、Ｎを未知数の数あるいは行列の次数として、解くべき線形連立方程式を、それと等価なＮ次元２次形式の極小値を求める問題に置き換える。そして、初期値として適当なＮ次元解ベクトルを設定し、この初期値（ｋ＝０）を出発点にして、ＣＧ法などのそれぞれの反復解法の方法にしたがって次の近似解（ｋ＝１）を求め、さらに、この近似解（ｋ＝１）からその次の近似解（ｋ＝２）を求めていき、このｋ番目の近似解から（ｋ＋１）番目の近似解を求める手続きを繰返す。そして、この繰返し計算を（ｋ＋１）番目の近似解があらかじめ設定された精度以内になるまで行い、この精度以内になったところ、すなわち解が収束したところで上記の反復計算を終了し、この収束した値を数値計算としての最終的な近似解とする。 In an iterative solution of a matrix equation such as the CG method, an approximate solution of a linear (primary) simultaneous equation in a matrix form to be solved is obtained by the following procedure. First, with N being the number of unknowns or the order of a matrix, the linear simultaneous equations to be solved are replaced with a problem of obtaining an equivalent N-dimensional quadratic minimum value. Then, an appropriate N-dimensional solution vector is set as an initial value. The initial value (k = 0) is used as a starting point, and the following approximate solution (k = 1) is determined according to each iterative method such as the CG method. Further, the next approximate solution (k = 2) is obtained from this approximate solution (k = 1), and the procedure for obtaining the (k + 1) th approximate solution from this kth approximate solution is repeated. Then, this iterative calculation is performed until the (k + 1) -th approximate solution is within the preset accuracy, and when it is within this accuracy, that is, when the solution has converged, the above iterative calculation is terminated and this converged Let the value be the final approximate solution as a numerical calculation.

上記の行列方程式の反復解法の高速化技術としては、ＰＣクラスタ、あるいはさらに大規模なスーパーコンピュータなどの並列処理かつ高速なＣＰＵ上での走行が最も一般的に利用されているが、これらの計算機は比較的限られた研究機関でのみ利用が可能であったり、共同利用型であったりと、製品開発での使用などのだれもが手軽に利用できる一般の環境とはなっていない。 As a technique for speeding up the iterative solution of the matrix equation, running on a parallel processing and high-speed CPU such as a PC cluster or a larger supercomputer is most commonly used. It is not a general environment that anyone can use easily because it can be used only in relatively limited research institutes, or it can be used jointly.

これに対し、近年のＦＰＧＡ、ＣＰＬＤなどの書換え可能なＬＳＩの出現およびその手軽な設計・利用環境、さらに、プリント回路基板設計・開発環境の充実化を背景に、科学技術計算のＨＰＣ技術の選択肢の一つとして、ターゲットをしぼり、計算方法に特化したアーキテクチャのハードウェアを構成することにより、安価に、かつ、身近なＰＣの環境で実現することができる専用計算機が新たに注目されつつある。 On the other hand, against the backdrop of the recent emergence of rewritable LSIs such as FPGA and CPLD and their easy design / use environment, as well as the enhancement of the printed circuit board design / development environment, HPC technology options for scientific calculation As one of them, a dedicated computer that can be realized at low cost and in a familiar PC environment by narrowing down the target and configuring hardware with an architecture specialized for the calculation method is newly attracting attention. .

現在、線形行列計算を行うための専用計算機（以下、行列計算専用計算機ともいう）には、密行列をターゲットとしたものとして、ＬＵ分解法を高速に処理すべく複雑なＬＵ分解処理はホスト計算機に行わせ前進代入処理などの部分的な単純な処理を専用のハードウェアで行うタイプ、疎行列をターゲットとしたものとして、ハードウェア構成が比較的簡単な単純反復法の反復処理と収束判定処理とを直接ハードウェア化したタイプ、および、疎行列用のＬＵ分解法あるいはＣＧ法の演算のうち、ベクトルの内積演算あるいは行列とベクトルとの乗算の部分をハードウェア化したタイプなどがある（非特許文献１〜１３参照。）。 At present, a dedicated computer for performing a linear matrix calculation (hereinafter also referred to as a matrix calculation dedicated computer) is targeted at a dense matrix, and a complicated LU decomposition process is performed in order to perform the LU decomposition method at high speed. A type that performs partial simple processing such as forward substitution processing with dedicated hardware, targeting a sparse matrix, and iterative processing and convergence determination processing of a simple iteration method with a relatively simple hardware configuration And a type in which the inner product operation of a vector or the multiplication of a matrix and a vector is implemented in hardware among the LU decomposition method or CG method for a sparse matrix (non- (See Patent Documents 1 to 13.)

浅井秀樹，浅井光男，田中衛，大規模スパース行列のＬＵ分解専用並列計算機，信学論，Vol.J69-D, No.7, pp.1044-1053 (1986)Hideki Asai, Mitsuo Asai, Mamoru Tanaka, Dedicated Parallel Computer for LU Decomposition of Large Sparse Matrix, Science, Vol.J69-D, No.7, pp.1044-1053 (1986) 田辺昇，土肥康孝，連想スイッチによる疎行列計算機の構成，信学論，Vol.J70-D, No.12, pp.2393-2401 (1987)Noboru Tanabe, Yasutaka Dohi, Construction of sparse matrix computer with associative switch, theory of theory, Vol.J70-D, No.12, pp.2393-2401 (1987) 田辺昇，石坂健一，村越英機，泉谷昭二，土肥康孝，並列パイプライン式疎行列専用計算機ＲＡＭＰの構成，信学論，Vol.J71-D, No.10, pp.1939-1948 (1988)Noboru Tanabe, Kenichi Ishizaka, Hideki Murakoshi, Shoji Izumiya, Yasutaka Dohi, Construction of Parallel Pipeline Sparse Matrix Dedicated Computer RAMP, Science Theory, Vol.J71-D, No.10, pp.1939-1948 (1988) 清木泰，福重俊幸，泰地真弘人，牧野淳一郎，小河正基，戒崎俊一，密行列専用計算機ＧＥＮＥＲＡＬ−１の開発，計算機アーキテクチャ，111-9, pp.65-72 (1995)Yasushi Kiyoki, Toshiyuki Fukushige, Masato Taiji, Shinichiro Makino, Masaki Ogawa, Shunichi Kaizaki, Development of a dedicated matrix computer GENERAL-1, Computer Architecture, 111-9, pp.65-72 (1995) 泰地真弘人，藤田実，科学技術計算に適した準並列ＣＰＵの開発に関する研究，ＮＥＤＯ平成１２年度提案公募事業成果報告会予稿集，提案公募:99S29-019-1 (2000)Masahiro Taiji, Minoru Fujita, Research on the development of quasi-parallel CPUs suitable for scientific and engineering computation, NEDO 2000 Proposal Public Proposal Report Meeting, Proposal Open: 99S29-019-1 (2000) 大野洋介，戒崎俊一，泰地真弘人，小長谷明彦，行列専用計算機ＭＡＣＥ（M Atrix Computation Engine）の実効性能Yosuke Ohno, Shunichi Kaizaki, Masahiro Taiji, Akihiko Konaseya, Effective Performance of Matrix Computation Engine (MACE) 長谷川大，回路シミュレーション用行列演算専用計算機の開発，信学技報（TECHNICAL REPORT OF IEICE ），CA2002-116, pp.51-56 (2003)University of Hasegawa, Development of a dedicated computer for matrix operation for circuit simulation, Technical Report of IEICE, CA2002-116, pp.51-56 (2003) 瀧口善貴, 松岡俊佑, 川口秀樹, ラプラス/ ポアソン方程式差分法ソルバー専用計算機の開発に関する研究，平成１７年電気・情報関係学会北海道支部連合大会 (2005), No.121.Yoshitaka Higuchi, Toshiaki Matsuoka, Hideki Kawaguchi, Research on the development of Laplace / Poisson equation finite difference solver dedicated computer, 2005 Japan Association of Electrical and Information Engineering Hokkaido Branch (2005), No.121. G.R.Morris, V.K.Prasanna, R.D.Anderson, A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer, Proc. of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'06), Napa, California (2006)GRMorris, VKPrasanna, RDAnderson, A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer, Proc. Of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'06), Napa, California (2006 ) R. Strzodka, D. Goeddeke, Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components, Extended technical report of IEEE Proc. on Field-Programmable Custom Computing Machines (FCCM'06), Napa, California (2006)R. Strzodka, D. Goeddeke, Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components, Extended technical report of IEEE Proc. On Field-Programmable Custom Computing Machines (FCCM'06), Napa, California (2006 ) O. Maslennikow, V. Lepekha, A. Sergyienko, FPGA Implementation of the Conjugate Gradient Method, LNCS 3911, pp.526-533 (2006)O. Maslennikow, V. Lepekha, A. Sergyienko, FPGA Implementation of the Conjugate Gradient Method, LNCS 3911, pp.526-533 (2006) 田原一平, 川口秀樹, 松岡俊祐，線形計算専用計算機・共役傾斜法マシンの開発に関する研究，平成１９年電気・情報関係学会北海道支部連合大会 (2007), p.147.Tahara, Ippei, Kawaguchi, Hideki, Matsuoka, Shunsuke, Research on the development of computers dedicated to linear computation and conjugate gradient method, 2007 Hokkaido Association of Electrical and Information Society (2007), p.147. D. DuBois, A. DuBois, Sparse Matrix-Vector Multiplicationand Conjugate Gradient Algorithm on Hybrid Computing Platforms, LALP-07-041 (2007)D. DuBois, A. DuBois, Sparse Matrix-Vector Multiplication and Conjugate Gradient Algorithm on Hybrid Computing Platforms, LALP-07-041 (2007)

行列方程式計算用の専用計算機においては、ソフトウェアプログラムベースの計算機で行う処理と全く同じことをハードウェア化することは困難である。このため、従来は、全体の処理の中でハードウェア化に適し、かつ、ハードウェア化により有効に高速化が図れる処理を適切に選び出してそれらを専用計算機で実行させ、ホスト計算機と連動しながら全体として処理を高速化する方法が採られてきた。 In a dedicated computer for matrix equation calculation, it is difficult to implement exactly the same processing as that performed by a software program-based computer. For this reason, conventionally, it is suitable for hardware processing in the overall processing and appropriately selects processing that can be effectively speeded up by hardware processing and executes them on a dedicated computer, while interlocking with the host computer A method for speeding up the processing as a whole has been adopted.

非特許文献４では、密行列をターゲットとしガウスの消去法のための専用計算機を提案している。とりわけ、ガウスの消去法の中で演算量のほとんどを占める前進消去の内積演算の部分をハードウェア化する方式について検討し、計算機のシステム構成を提案している。しかしながら、メモリ容量の制約から、１０００次元以下の行列を対象としており、大規模な計算を行うにはさらなる検討が必要である。 Non-Patent Document 4 proposes a dedicated computer for Gaussian elimination with a dense matrix as a target. In particular, a method for hardware implementation of the forward erasure inner product operation, which accounts for most of the computation amount in the Gaussian elimination method, is proposed, and a computer system configuration is proposed. However, due to memory capacity limitations, the target is a matrix of 1000 dimensions or less, and further studies are necessary to perform large-scale calculations.

非特許文献５では、同じく密行列をターゲットとしたＬＵ分解法の専用計算機について、ＬＵ分解法の計算ではデータ量（行列の次元）Ｎに対して、演算量がＮ²、Ｎ^1.5とメモリアクセスがボトルネックになりづらい性質に着目して、ＬＵ分解法に現れる内積演算に特化し並列に高速処理することができる専用計算機の方式検討を行い、計算機のシステム構成を提案している。 In Non-Patent Document 5, for the LU decomposition method dedicated computer that also targets a dense matrix, the calculation amount is N ² , N ^1.5 and the memory access for the data amount (matrix dimension) N in the LU decomposition method calculation. Focusing on the property that is difficult to become a bottleneck, we are investigating the method of a dedicated computer capable of high-speed processing in parallel, specializing in the inner product operation appearing in the LU decomposition method, and proposing the system configuration of the computer.

また、非特許文献６では、同じく密行列をターゲットとしたＬＵ分解法の専用計算機について、ＬＵ分解法の演算のうち行列のＬＵ分解のための行列成分の四則演算処理のみをハードウェア化し、ピボット選択や前進代入・後退代入などの処理はホスト計算機で処理する方式について検討し、計算機のシステム構成を提案している。 Further, in Non-Patent Document 6, with regard to a dedicated computer for LU decomposition method, which also targets a dense matrix, only the four arithmetic operations of matrix components for LU decomposition of a matrix among the operations of LU decomposition method are hardwareized. We consider a method of processing by selection and forward substitution / backward substitution by a host computer, and propose a system configuration of the computer.

このとき、上記の密行列を対象とした専用計算機による行列計算の高速化の試みにおいては、必然的に大容量のメモリが必要となり、また、より高速計算を実現するためにはハードウェアが大規模となるため、高々、数千次元程度の行列サイズを取り扱うのが限界であり、科学技術計算にしばしば現れるより大規模な計算への適用は困難であるという問題があった。 At this time, in the attempt to speed up the matrix calculation by the dedicated computer for the above-described dense matrix, a large amount of memory is inevitably required, and a large amount of hardware is required to realize higher speed calculation. Due to the scale, it is limited to handle a matrix size of about several thousand dimensions at most, and there is a problem that it is difficult to apply to larger-scale calculations that often appear in scientific and technical calculations.

一方、非特許文献１〜３では、疎行列をターゲットとし疎行列用ＬＵ分解法のための専用計算機を提案している。疎行列性を利用し、行列の次元Ｎに比してはるかに少ない数の非ゼロ行列成分の分だけのローカルメモリを有する演算ユニットを用意し、１行分の加減算・乗除算を並列に行える構成とし、ＬＵ分解法の全ての処理を実行する方式について検討し、計算機のシステム構成を提案している。 On the other hand, Non-Patent Documents 1 to 3 propose a dedicated computer for a sparse matrix LU decomposition method targeting a sparse matrix. Utilizing sparse matrix properties, an arithmetic unit having a local memory corresponding to a much smaller number of non-zero matrix components than the dimension N of the matrix is prepared, and addition / subtraction / multiplication / division for one row can be performed in parallel. A system configuration is studied, and a method for executing all the processes of the LU decomposition method is studied, and a system configuration of the computer is proposed.

また、非特許文献７では、同じく疎行列をターゲットとしたＬＵ分解法の専用計算機について、汎用の算術演算器を２つ用意し、かつ、メモリアクセスを２チャネルとした上でパイプライン処理が可能なハードウェアを小規模ながら実際のＰＣＩインターフェース機能を有するＦＰＧＡ評価キットに実装し、その動作確認まで行っている。 In Non-Patent Document 7, two general-purpose arithmetic operators are prepared for a dedicated LU decomposition method computer that also targets a sparse matrix, and pipeline processing is possible with two channels of memory access. A small hardware is mounted on an FPGA evaluation kit having an actual PCI interface function, but its operation is confirmed.

しかしながら、上記のＬＵ分解法をハードウェア化する試みにおいては、ピボット選択も含めた複雑な行列のＬＵ分解、そして、それに続く前進および後退代入の複数の処理が必要であり、ＣＧ法やＢｉＣＧ法などの反復解法に比べて演算量の多い計算を余儀なくされるという問題があった。 However, in an attempt to make the above LU decomposition method hardware, LU decomposition of a complex matrix including pivot selection and subsequent multiple processes of forward and backward substitution are required. The CG method and the BiCG method There was a problem that it was necessary to perform calculations with a large amount of calculation compared to the iterative solution method.

非特許文献８では、ラプラス・ポアソン方程式に特化した専用計算機を提案している。この専用計算機では、ラプラス・ポアソン方程式の差分式に現れる大型疎行列の１行ごとの非ゼロ成分の数が高々５であり、かつ、同じ行に現れる行列成分が隣り合うグリッドのアドレスとなることを利用し、この構造に特化した単純反復法のハードウェア化が提案されている。しかし、全体のシステムとしてシンプルなアーキテクチャで実現することができることが判明したものの、やはり、単純反復法は解の収束が悪く、行列解法のＨＰＣ化としては必ずしも適切なものとはなっていなかった。 Non-Patent Document 8 proposes a dedicated computer specialized for the Laplace-Poisson equation. In this dedicated computer, the number of non-zero components per row of the large sparse matrix appearing in the Laplace-Poisson differential equation is at most 5, and the matrix components appearing in the same row are the addresses of adjacent grids. A simple iterative method specializing in this structure has been proposed. However, although it has been found that the entire system can be realized with a simple architecture, the simple iterative method still has poor convergence of the solution, and is not necessarily appropriate as an HPC implementation of the matrix solution.

非特許文献９では、大型疎行列をターゲットとしＣＧ法のための専用計算機を提案している。とりわけ、ＣＧ法の中で行列と探索方向のベクトルとの乗算演算の部分のみをハードウェア化し、他の処理はホスト計算機で実行させ、ホスト計算機と専用計算機とを連動させながらＣＧ法の計算を高速化する方式について検討し、計算機のシステム構成を提案している。 Non-Patent Document 9 proposes a dedicated computer for the CG method targeting a large sparse matrix. In particular, only the multiplication operation of the matrix and the search direction vector in the CG method is hardwareized, and other processing is executed by the host computer, and the calculation of the CG method is performed while the host computer and the dedicated computer are linked. We are investigating a method for speeding up and proposing a computer system configuration.

非特許文献１０では、大型疎行列をターゲットとしたＣＧ法の専用計算機について、ＦＰＧＡへの実装を想定した場合の精度とハードウェアサイズ・計算速度のトレードオフについて検討し、パイプライン処理を併用した適切な演算方法を提案している。 In Non-Patent Document 10, we examined the trade-off between accuracy and hardware size / calculation speed when assuming implementation on FPGA for a dedicated computer of the CG method targeting a large sparse matrix, and used pipeline processing together Appropriate calculation methods are proposed.

非特許文献１１では、大型疎行列をターゲットとしたＣＧ法の専用計算機について、ＣＧ法の演算に現れる除算の効率的な計算方法について検討し、演算部分について、実際にＦＰＧＡに実装しながらその有効性を示している。 Non-Patent Document 11 examines an efficient calculation method for division that appears in the calculation of the CG method for a dedicated computer of the CG method targeting a large sparse matrix, and the calculation part is actually implemented in an FPGA and effective. Showing sex.

非特許文献１２では、大型疎行列をターゲットとしたＣＧ法の専用計算機について、ＣＧ法の１回の反復計算のハードウェア化について検討し、小規模ながら実際にＦＰＧＡに実装しながらその実現可能性を示している。 Non-Patent Document 12 discusses hardware implementation of one iteration of the CG method for a dedicated computer for the CG method targeting a large sparse matrix, and its feasibility while actually being implemented on a small scale FPGA. Is shown.

非特許文献１３では、大型疎行列をターゲットとしたＣＧ法の専用計算機について、商用のリコンフィギュレーション可能な計算機を用いて行列方程式計算のＨＰＣ化について検討し、ＣＧ法に特化したシステムを提案している。 Non-Patent Document 13 examines the HPC of matrix equation calculation using a commercially available reconfigurable computer for a dedicated computer for the CG method targeting a large sparse matrix, and proposes a system specialized for the CG method. is doing.

以上のように、従来の行列計算専用計算機は、必ずしも科学技術計算にしばしば現れる大型疎行列には適していないＬＵ分解法をハードウェア化するものや、あるいは大型疎行列に適したＣＧ法のハードウェア化である場合も、ＣＧ法の演算に除算などの複雑な処理が含まれるために、演算方法を部分的にハードウェア化し、ホスト計算機と連動して動作させる構成のものである。これらの行列計算専用計算機は、いずれも、完全な専用計算機という形態ではなく、ホスト計算機とのデータのやり取りを生じるなど動作速度、大規模計算化などの点で必ずしも十分なものではなかった。また、ＣＧ法は基本的に対称行列が対象であるが、科学技術計算では非対称行列もよく現れるため、対称行列に特化した専用計算機では適用範囲が限られてしまうという問題もあった。 As described above, the conventional computer dedicated to matrix calculation is not necessarily suitable for a large sparse matrix that often appears in scientific and technical calculations, or it is a hardware of a CG method suitable for a large sparse matrix. Even in the case of hardware, since the calculation of the CG method includes complicated processing such as division, the calculation method is partially hardwareized and operated in conjunction with the host computer. None of these dedicated computers for matrix calculation are not necessarily a complete dedicated computer, and are not necessarily sufficient in terms of operation speed and large-scale calculation such as data exchange with the host computer. The CG method is basically intended for a symmetric matrix, but an asymmetric matrix often appears in scientific and technical calculations, so that there is a problem that the scope of application is limited in a dedicated computer specialized for a symmetric matrix.

そこで、この発明が解決しようとする課題は、電磁界解析、流体解析、構造解析などの各種の数値シミュレーションにおいて最終的に解くべき方程式として現れる大型疎行列方程式の計算を高速に行うことができ、しかもこの大型疎行列方程式の大型疎行列が対称行列である場合および非対称行列である場合の双方に適用することができる行列方程式計算装置および行列方程式計算方法を提供することである。 Therefore, the problem to be solved by the present invention is that high-speed calculation of large sparse matrix equations that appear as equations to be finally solved in various numerical simulations such as electromagnetic field analysis, fluid analysis, and structural analysis, In addition, the present invention provides a matrix equation calculation apparatus and a matrix equation calculation method that can be applied to both cases where the large sparse matrix of the large sparse matrix equation is a symmetric matrix and an asymmetric matrix.

上記課題を解決するために、第１の発明は、
対象となる大型疎行列方程式を反復解法により解く行列方程式計算装置であって、
上記大型疎行列方程式の大型疎行列の成分のうち、非ゼロの成分のみを１列ごとに格納する所定の個数のメモリを有するメモリ部と、
上記反復解法の演算の少なくとも一部をデータフロー形式で実行する１つまたは複数の演算部とを有し、
上記メモリ部から１行ごとの反復解法の演算に必要なデータを上記演算部に一度にロードするように構成されていることを特徴とするものである。
この行列方程式計算装置は、典型的には、パーソナルコンピュータなどからなるホスト計算機と接続され、このホスト計算機から必要なデータを取得するが、これに限定されるものではなく、必要に応じて、ホスト計算機の機能を持たせてもよい。 In order to solve the above problem, the first invention is:
A matrix equation calculator that solves a large sparse matrix equation of interest by an iterative method,
A memory unit having a predetermined number of memories for storing only non-zero components for each column among the components of the large sparse matrix of the large sparse matrix equation;
One or a plurality of calculation units that execute at least a part of the calculation of the iterative solution method in a data flow format,
The memory unit is configured to load data necessary for the calculation of the iterative solution for each row to the calculation unit at a time.
This matrix equation calculation apparatus is typically connected to a host computer such as a personal computer and obtains necessary data from the host computer, but is not limited to this. You may give the function of a computer.

第２の発明は、
対象となる大型疎行列方程式を反復解法により解く行列方程式計算方法であって、
上記大型疎行列方程式の大型疎行列の成分のうち、非ゼロの成分のみを１列ごとにメモリ部の所定の個数のメモリに格納するステップと、
上記メモリ部から１行ごとの反復解法の演算に必要なデータを１つまたは複数の演算部に一度にロードするステップと、
上記反復解法の演算の少なくとも一部を上記演算部でデータフロー形式で実行するステップとを有することを特徴とするものである。 The second invention is
A matrix equation calculation method for solving a target large sparse matrix equation by an iterative method,
Storing only non-zero components among the components of the large sparse matrix of the large sparse matrix equation in a predetermined number of memories in the memory unit for each column;
Loading data necessary for iterative solution operation for each row from the memory unit to one or more arithmetic units at a time;
And a step of executing at least a part of the calculation of the iterative solution in a data flow format by the calculation unit.

第１および第２の発明において、大型疎行列とは、電磁界解析、流体解析、構造解析などの数値シミュレーションにおいて最終的に解くべき行列方程式に現れる数千〜数百万あるいはそれ以上の次元であるがその成分のほとんどはゼロであり、非ゼロ成分は、１行あたり数個〜十数個であるような行列を意味する。大型疎行列は対称行列であっても非対象行列であってもよい。反復解法には、例えば、ＣＧ法、ＢｉＣＧ法、ＩＣＣＧ法、ＢｉＣＧ−Ｓｔａｂ法、ＣＧＳ法、ＢｉＣＧ−Ｓｔａｂ２法などを用いることができる。 In the first and second inventions, the large sparse matrix means a dimension of several thousand to several million or more appearing in a matrix equation to be finally solved in numerical simulation such as electromagnetic field analysis, fluid analysis, and structural analysis. However, most of its components are zero, and a non-zero component means a matrix having several to a dozen or more components per row. The large sparse matrix may be a symmetric matrix or a non-target matrix. For the iterative solution method, for example, CG method, BiCG method, ICCG method, BiCG-Stab method, CGS method, BiCG-Stab2 method and the like can be used.

第１および第２の発明においては、大型疎行列の成分のうち、非ゼロの成分のみを１列ごとに格納する所定の個数（複数）（必要に応じて選ばれるが、例えば８あるいは１６）のメモリを用いるが、これは次のような理由による。図１は大型疎行列方程式の１例を示す。図１に示すように、行列Ａおよび非同次ベクトルｂからなる大型疎行列方程式

から未知数ベクトルｘを求める問題において、これをＣＧ法の計算方法

を用いて解く場合を考える。（２−２）式の反復部分の計算には、

なる行列とベクトルとの乗算がある。このベクトルｑのｉ番目の成分ｑ_iの計算には行列Ａのｉ行目の１行分の成分の値が必要であるが、行列の成分の値を１つのメモリに格納しては、必要な値を１度にこれらのメモリから取得することはできない。このため、第１および第２の発明では、メモリ部が、行列の非ゼロ成分のみを１列ごとに格納する複数のメモリを有する構成とし、１行ごとの計算に必要なデータをメモリ部から演算部に並列に一度にロードすることとしている。また、後述の複数の演算部による並列処理の場合は、これらの列ごとにそれぞれ設けられたメモリをさらに並列数分設け、それぞれの並列処理が分担する行の成分の値を格納する。 In the first and second inventions, a predetermined number (plural) of storing only non-zero components among the components of the large sparse matrix for each column (selected as necessary, for example, 8 or 16) This memory is used for the following reason. FIG. 1 shows an example of a large sparse matrix equation. As shown in FIG. 1, a large sparse matrix equation consisting of a matrix A and an inhomogeneous vector b

In the problem of obtaining the unknown vector x from the above, this is calculated by the CG method

Consider the case of solving using For calculating the repetitive part of the equation (2-2),

There is a multiplication of a matrix and a vector. The calculation of the i-th component q _i of this vector q requires the value of the component for the first row of the i-th row of the matrix A, but it is necessary to store the value of the matrix component in one memory. A single value cannot be obtained from these memories at a time. For this reason, in the first and second inventions, the memory unit has a plurality of memories for storing only the non-zero components of the matrix for each column, and data necessary for calculation for each row is stored from the memory unit. The calculation unit is loaded in parallel at a time. Further, in the case of parallel processing by a plurality of arithmetic units described later, memories provided for each of these columns are further provided for the number of parallels, and the values of the row components shared by the respective parallel processings are stored.

好適には、行列方程式計算装置は、同じハードウェア回路の複数の演算部を有する。そして、これらの演算部の数だけ、行列の行数Ｎを分割し、それぞれの演算部が連動動作しながら並列処理を行う。こうすることで、行列方程式の計算を演算部の数だけスケーラブルに並列処理により高速化することができる。
好適には、反復解法の演算を可能な限りデータフロー形式で実行する。 Preferably, the matrix equation calculation apparatus includes a plurality of arithmetic units of the same hardware circuit. Then, the number of rows N of the matrix is divided by the number of these arithmetic units, and parallel processing is performed while the respective arithmetic units operate in conjunction with each other. By doing so, the calculation of the matrix equation can be speeded up by the parallel processing in a scalable manner by the number of operation units.
Preferably, iterative solution operations are performed in data flow form as much as possible.

さらに、好適には、対称行列を対象としたＣＧ法専用計算機の場合は、偶数かつ複数の同じハードウェア回路の演算部を有する。そして、これらの演算部の２つをペアに連動させることにより、１つの非対称行列を対象としたＢｉＣＧ法専用計算機としても機能させる。また、この演算部のペアの数だけ行列の行数を分割し、それぞれの演算部が連動動作しながら並列処理を行うようにしてもよい。こうすることで、１つの専用計算機を対称行列および非対称行列両方に適用することができ、対称行列モードの場合は非対称行列モードの場合の２倍の速度で動作する計算装置を実現することができる。 Further, preferably, the CG method dedicated computer for a symmetric matrix has an even number and a plurality of arithmetic units of the same hardware circuit. Then, by linking two of these arithmetic units in pairs, it is made to function as a BiCG method dedicated computer for one asymmetric matrix. Alternatively, the number of rows of the matrix may be divided by the number of pairs of the calculation units, and the parallel processing may be performed while each calculation unit operates in conjunction. In this way, one dedicated computer can be applied to both the symmetric matrix and the asymmetric matrix, and in the case of the symmetric matrix mode, it is possible to realize a computing device that operates at twice the speed of the asymmetric matrix mode. .

この発明によれば、行列方程式計算装置にＯＳなどのソフトウェア的な制約なしに、ハードウェア的に許される限り大容量のメモリを搭載可能となるため、大規模な計算を実行することができる。また、演算部におけるＣＧ法などの反復解法のデータフローアーキテクチャ回路構成とメモリ部における行列の１列の非ゼロ成分ごとのメモリ構成とにより、スループットとしても、例えば、ＣＧ法やＢｉＣＧ法では１回の反復計算を３Ｎ＋２クロック（Ｎは行列の次数）で、ＢｉＣＧ−Ｓｔａｂ法やＣＧＳ法では５Ｎクロックで処理することができ、計算方法の限界まで動作時間を最小化した高速な動作を実現することができる。
また、複数の演算部を有する構成とし並列動作させることにより、スケーラブルな並列処理で高速化を図ることができる。
さらに、ＣＧ法の演算部を２つ連動させることでＢｉＣＧ法の演算部を実現することができるため、用途に適した効率的な計算を行うことができる。
さらに、複数の反復解法の演算部を用意し、これらの演算部に同じ行列計算を同時に行わせ、最も早く計算が完了した演算部の解をホスト計算機がアップロードすることにより、自動的に問題に最適な反復解法の方法を選択して行列方程式の計算を実行することができる。 According to the present invention, a large-capacity memory can be mounted in a matrix equation calculation apparatus as long as it is allowed by hardware without software restrictions such as an OS, so that a large-scale calculation can be executed. In addition, the throughput is one time in the CG method or BiCG method, for example, by the data flow architecture circuit configuration of the iterative solution such as the CG method in the arithmetic unit and the memory configuration for each non-zero component of one column of the matrix in the memory unit. Can be processed with 3N + 2 clocks (N is the order of the matrix) and with the BiCG-Stab method or CGS method with 5N clocks, and the high-speed operation can be realized with the operation time minimized to the limit of the calculation method. Can do.
In addition, by operating in parallel with a configuration having a plurality of arithmetic units, it is possible to achieve high speed with scalable parallel processing.
Furthermore, since the BiCG method computing unit can be realized by linking two CG method computing units, efficient calculation suitable for the application can be performed.
In addition, a plurality of iterative solution calculation units are prepared, the same matrix calculation is performed simultaneously by these calculation units, and the host computer uploads the solution of the calculation unit that completed the earliest calculation, which automatically becomes a problem. The calculation of the matrix equation can be performed by selecting the optimal iterative solution method.

以下、この発明の実施の形態について図面を参照しながら説明する。
まず、この発明の第１の実施の形態による行列方程式計算装置について説明する。
この行列方程式計算装置はＣＧ法専用計算機により構成される。この行列方程式計算装置は、（１）式の行列方程式を（２−１）式、（２−２）式および（３）式に示すＣＧ法により計算するものである。 Embodiments of the present invention will be described below with reference to the drawings.
First, a matrix equation calculation apparatus according to a first embodiment of the present invention will be described.
This matrix equation calculation device is constituted by a computer dedicated to the CG method. This matrix equation calculation device calculates the matrix equation of equation (1) by the CG method shown in equations (2-1), (2-2), and (3).

図２はこの行列方程式計算装置１の全体構成、図３はこの行列方程式計算装置１の各部の構成の詳細を示す。図２および図３に示すように、この行列方程式計算装置１は、大型疎行列Ａの全成分の値を格納するメモリモジュール１１と、（２−２）式のＣＧ法の１回分の反復計算をデータフロー形式で３Ｎクロックで実行する演算モジュール１２と、メモリモジュール１１と演算モジュール１２との間で、（３）式の大型疎行列Ａとベクトルｐ_kとの乗算をクロック遅延のない組合せ回路で実行し、この結果の値Ａｐ_kを演算モジュール１２に提供する行列−ベクトル乗算回路１３と、演算モジュール１２で得られたｋ＋１番目の反復解が収束しており最終的な解となっているかどうかを判定する収束判定部１４とを有する。メモリモジュール１１、行列−ベクトル乗算回路１３、演算モジュール１２および収束判定部１４における一連のデータの流れの動作はマスターコントローラ１５により制御される。 FIG. 2 shows the overall configuration of the matrix equation calculation apparatus 1, and FIG. 3 shows the details of the configuration of each part of the matrix equation calculation apparatus 1. As shown in FIGS. 2 and 3, the matrix equation calculation apparatus 1 includes a memory module 11 that stores the values of all components of the large sparse matrix A, and an iterative calculation for one time of the CG method expressed by equation (2-2). the arithmetic module 12 to run with 3N clock data flow format, between the memory module 11 and the calculation module 12, (3) a large sparse matrix a and vector p _k not combining circuits clock delay multiplication of The matrix-vector multiplication circuit 13 that provides the result value Ap _k to the arithmetic module 12 and the (k + 1) th iterative solution obtained by the arithmetic module 12 have converged and become the final solution. And a convergence determination unit 14 for determining whether or not. The master controller 15 controls operations of a series of data flows in the memory module 11, the matrix-vector multiplication circuit 13, the arithmetic module 12, and the convergence determination unit 14.

メモリモジュール１１は、大型疎行列Ａを非ゼロ成分のみに圧縮したＮ×ｍのサイズの行列Ａ’の成分を列ごとに格納するｍ個のメモリＭ１−１〜Ｍ１−ｍを有する。この大型疎行列Ａの圧縮は具体的には次のように行う。例えば、各行ごとの非ゼロ成分の数のうち全Ｎ行の中で最大のものをｍとしてＮ×ｍのサイズの行列を用意し、これに各行ごとに非ゼロ成分を左詰めで格納する。同時に、これらのメモリＭ１−１〜Ｍ１−ｍと全く同じサイズのメモリをもう１つずつ用意する（メモリＭ２−１〜Ｍ２−ｍ）。これらのメモリＭ２−１〜Ｍ２−ｍには、対応するメモリＭ１−１〜Ｍ１−ｍの非ゼロ成分がもともとの大型疎行列Ａのどの列に位置していたかを示すインデックスを格納する。これにより、通常、大型疎行列ではＮが数千〜数千万に対しｍは高々十数個であるため、大型疎行列Ａの情報を失うことなく大幅にメモリ容量を節約することができる。メモリモジュール１１をこのように構成することで、メモリ容量を大幅に節約することができ、かつ、一度に１行分の行列の非ゼロ成分全てに並列アクセスすることができるためスループットとしての計算性能に大きく影響する行列の成分が格納されたメモリＭ１−１〜Ｍ１−ｍへのアクセスは、最小限の１回にまで減らすことができる。 The memory module 11 includes m memories M1-1 to M1-m that store, for each column, the components of a matrix A ′ having a size of N × m obtained by compressing the large sparse matrix A into only non-zero components. Specifically, the large sparse matrix A is compressed as follows. For example, a matrix of N × m size is prepared, where m is the maximum of all N rows out of the number of non-zero components for each row, and the non-zero components are stored left-justified for each row. At the same time, another memory having exactly the same size as these memories M1-1 to M1-m is prepared (memory M2-1 to M2-m). In these memories M2-1 to M2-m, indexes indicating in which column of the original large sparse matrix A the non-zero components of the corresponding memories M1-1 to M1-m are stored. As a result, normally, in a large sparse matrix, N is several tens to several tens of millions, and m is at most a dozen, so that the memory capacity can be saved significantly without losing information of the large sparse matrix A. By configuring the memory module 11 in this way, the memory capacity can be greatly saved, and all the non-zero components of the matrix for one row can be accessed in parallel at a time. The number of accesses to the memories M1-1 to M1-m in which the matrix components that greatly affect the memory are stored can be reduced to a minimum of one.

メモリモジュール１１はさらに、ＣＧ法の反復計算実行時に一時的に値を格納しておくために、同じベクトルｐ_kの値をｍ個別々に格納するためのメモリＭ３−０〜Ｍ３−ｍ、ベクトルｐ_kの更新値ｐ_k+1を格納するためのメモリＭ３−０’、ベクトルｘ_kを格納するためのメモリＭ４、ベクトルｘ_kの更新値ｘ_k+1を格納するためのメモリＭ４’、ベクトルｒ_kを格納するためのメモリＭ５、ベクトルｒ_kの更新値ｒ_k+1を格納するためのメモリＭ５’、ベクトルｑ_kを格納するためのメモリＭ６、ベクトルｒ_kのノルムの２乗‖ｒ_k‖²＝（ｒ_k，ｒ_k）を格納するためのメモリＭ７、ノルムの２乗‖ｒ_k‖²の更新値‖ｒ_k+1‖²＝（ｒ_k+1，ｒ_k+1）を格納するためのメモリＭ７’、行列の次数Ｎ、行ごとの非ゼロ成分の最大数ｍ、非同次項ベクトルｂのノルムの２乗‖ｂ‖²＝（ｂ，ｂ）、相対残差の収束判定基準値ε、最大反復回数Ｋｍ、各反復ステップの残差Ｒｓなどの反復の収束条件を格納するためのメモリＭ８を有する。 The memory module 11 further stores memories M3-0 to M3-m for storing the values of the same vector p _k individually for each m in order to temporarily store values during execution of the CG method iterative calculation. memory M3-0 for storing the updated value p _{k + 1} of p _k ', the memory M4 for storing the vector x _k, the memory M4 for storing the updated value x _{k + 1} vector x _k', memory M5 for storing the vector r _k, memory M5 for storing the updated value r _{k + 1} vectors r _k ', the memory M6 for storing the vector q _k, 2 squared norm of the vector r _k ‖ r _k ‖ ² = (r _k, r _k) the memory M7 for storing the square ‖R _k ‖ ² update value of the norm ‖r _{k + 1} ‖ ^{_{2 = (r k + 1,}} r k + 1 ), The order N of the matrix, the maximum number m of non-zero components per row, the non-homogeneous term vector b Memory for storing iteration convergence conditions such as the square of the norm of ‖b‖ ² = (b, b), the convergence criterion value ε of the relative residual, the maximum number of iterations Km, and the residual Rs of each iteration step It has M8.

上述のメモリＭ１−１〜Ｍ１−ｍ、Ｍ２−１〜Ｍ２−ｍ、Ｍ３−０〜Ｍ３−ｍ、Ｍ３−０’、Ｍ４、Ｍ４’、Ｍ５、Ｍ５’、Ｍ６、Ｍ７、Ｍ７’、Ｍ８としては、それぞれ独立した個別のメモリを用いてもよいし、それらのうちの１つまたは２つ以上のメモリとして、独立した１つのメモリのメモリ領域を所定個数に分割したものを用い、これらのメモリ領域をメモリＭ１−１〜Ｍ１−ｍ、Ｍ２−１〜Ｍ２−ｍ、Ｍ３−０〜Ｍ３−ｍ、Ｍ３−０’、Ｍ４、Ｍ４’、Ｍ５、Ｍ５’、Ｍ６、Ｍ７、Ｍ７’、Ｍ８のうちのいずれかに用いてもよい。これらのメモリＭ１−１〜Ｍ１−ｍ、Ｍ２−１〜Ｍ２−ｍ、Ｍ３−０〜Ｍ３−ｍ、Ｍ３−０’、Ｍ４、Ｍ４’、Ｍ５、Ｍ５’、Ｍ６、Ｍ７、Ｍ７’、Ｍ８としては、例えば、ＳＲＡＭまたはＤＲＡＭを用いることができる。 Memory M1-1 to M1-m, M2-1 to M2-m, M3-0 to M3-m, M3-0 ′, M4, M4 ′, M5, M5 ′, M6, M7, M7 ′, M8 Each of which may be an independent memory, or one or two or more of them may be obtained by dividing a memory area of one independent memory into a predetermined number. The memory areas are memories M1-1 to M1-m, M2-1 to M2-m, M3-0 to M3-m, M3-0 ′, M4, M4 ′, M5, M5 ′, M6, M7, M7 ′, It may be used for any of M8. These memories M1-1 to M1-m, M2-1 to M2-m, M3-0 to M3-m, M3-0 ′, M4, M4 ′, M5, M5 ′, M6, M7, M7 ′, M8 For example, SRAM or DRAM can be used.

行列−ベクトル乗算回路１３は、上述のように列ごとにメモリＭ１−１〜Ｍ１−ｍに分割格納された大型疎行列Ａの成分を、メモリＭ２−１〜Ｍ２−ｍを参照しつつ１行ごとに１行の全ての成分に並列に同時アクセスし、また、ベクトルｐ_kを格納するメモリＭ３−１〜Ｍ３−ｍからそれぞれ（３）式の計算に必要な値のアドレスに同時アクセスし、データフロー形式で（３）式の計算を実行し、ベクトルｑ_kを１クロックで１成分ずつ計算し、その結果を演算モジュール１２に引き渡す。 The matrix-vector multiplication circuit 13 refers to the components of the large sparse matrix A divided and stored in the memories M1-1 to M1-m for each column as described above, while referring to the memories M2-1 to M2-m. simultaneously accessed in parallel to all of the components of one row every, also simultaneous access to the address of the required values for the calculation of each equation (3) from the memory M3-1~M3-m for storing the vector p _k, The calculation of the expression (3) is executed in the data flow format, the vector q _k is calculated for each component in one clock, and the result is delivered to the arithmetic module 12.

演算モジュール１２は、行列−ベクトル乗算回路１３からのベクトルｑ_k、メモリＭ３−０のベクトルｐ_k、メモリＭ４のベクトルｘ_k、メモリＭ５のベクトルｒ_k、メモリＭ６のベクトルｑ_kおよびメモリＭ７の（ｒ_k，ｒ_k）を用いて、（２−２）式（ｉ）中のベクトルｐ_kとベクトルｑ_k（＝Ａｐ_k）との内積（ｐ_k，ｑ_k）の計算を実行するための演算回路３１、（２−２）式（ｉ）中の除算（ｒ_k，ｒ_k）／（ｐ_k，ｑ_k）の計算を実行するための演算回路３２、（２−２）式（ｉｉ）のベクトルｘ_k+1の計算を実行するための演算回路３３、（２−２）式（ｉｉｉ）のベクトルｒ_k+1の計算を実行するための演算回路３４、（２−２）式（ｖ）中のベクトルｒ_k+1の２乗（ｒ_k+1，ｒ_k+1）の計算を実行するための演算回路３５、（２−２）式（ｉｖ）中の除算（ｒ_k+1，ｒ_k+1）／（ｒ_k，ｒ_k）の計算を実行するための演算回路３６、（２−２）式（ｖ）のベクトルｐ_k+1の計算を実行するための演算回路３７を有する。 Calculation module 12, a matrix - vector q _k from vector multiply circuit 13, a vector p _k of memory M3-0, vector x _k of the memory M4, the vector r _k of the memory M5, the vector q _k and the memory M7 of the memory M6 (r _k, r _k) using, for performing the calculation of (2-2) formula (i) vectors p _k and vector q _k in (= Ap _k) the inner product of the (p _k, q _k) Arithmetic circuit 31, (2-2) arithmetic circuit 32 for executing the calculation of division (r _k , r _k ) / (p _k , q _k ) in formula (i), formula (2-2) ( (ii) an arithmetic circuit 33 for executing the calculation of the vector x _{k + 1} , (2-2) an arithmetic circuit 34 for executing the calculation of the vector r _{k + 1} of the equation (iii), (2-2) operation circuit 35 for performing the calculation of the square of the vector r _{k + 1} in formula _{(v) (r k + 1} , r k + 1), (2 Vector of 2) (iv) the division in _{_{(r k + 1, r k}} + 1) / (r k, the arithmetic circuit 36 for performing the calculation of r _k), (2-2) Formula (v) An arithmetic circuit 37 for executing the calculation of p _{k + 1} is included.

（２−２）式からわかるように、演算回路３１、３３、３４、３５、３７の計算はそれぞれＮクロックで、演算回路３２、３６の計算は１クロックで実行することができるが、演算モジュール１２は全体として可能な限りこれらの計算を並列に実行することができるように構成されている。具体的には、演算回路３１、３２のＮ＋１クロックの計算が完了した後は、演算回路３３、３４、３５の計算は、演算回路３４、３５のデータフロー性を利用するとまとめてＮクロックで実行することができるため、残りの演算回路３６、３７のＮ＋１クロックの計算を合わせて全体で３Ｎ＋２クロックで実行することができるように構成されている。 As can be seen from the equation (2-2), the calculation of the arithmetic circuits 31, 33, 34, 35, and 37 can be executed in N clocks, and the calculation of the arithmetic circuits 32 and 36 can be executed in one clock. 12 is configured so that these calculations can be executed in parallel as much as possible. Specifically, after the calculation of the N + 1 clocks of the arithmetic circuits 31 and 32 is completed, the calculation of the arithmetic circuits 33, 34, and 35 are collectively executed with N clocks using the data flow property of the arithmetic circuits 34 and 35. Therefore, the calculation of the N + 1 clocks of the remaining arithmetic circuits 36 and 37 can be combined and executed in a total of 3N + 2 clocks.

収束判別部１４は、ＣＧ法の演算を実行する行列−ベクトル乗算回路１３から演算モジュール１２での１回分の反復計算で得られたｋ＋１回目の残差ベクトルｒ_k+1から相対誤差を計算し、あらかじめ設定されたメモリＭ８内の相対残差の収束判定基準値ε内に収まり反復計算が収束しているかどうかを判別する。 The convergence determination unit 14 calculates a relative error from the k + 1-th residual vector r _{k + 1} obtained from the matrix-vector multiplication circuit 13 that executes the calculation of the CG method and the iterative calculation for one time in the calculation module 12. Then, it is determined whether or not the iterative calculation is converged within the preset convergence criterion value ε of the relative residual in the memory M8.

マスターコントローラ１５は、計算が収束していればｘ_k+1を最終的な近似解として計算を終了し、ホスト計算機（図示せず）に正常に収束値を得た旨を通知する。計算が収束していなければ、あらかじめ設定されたメモリＭ８内の最大反復回数Ｋｍを超えた場合は、ホスト計算機に反復計算が収束しなかった旨を通知し、最大反復回数Ｋｍを超えていない場合は、反復計算で得られたベクトルｐ_k+1、ｘ_k+1、ｒ_k+1および（ｒ_k+1，ｒ_k+1）を次の反復計算に送り、再度、行列−ベクトル乗算回路１３から演算モジュール１２での反復計算の実行を指示する。 If the calculation has converged, the master controller 15 ends the calculation with x _{k + 1} as the final approximate solution, and notifies the host computer (not shown) that the convergence value has been obtained normally. If the calculation does not converge, if the preset maximum number of iterations Km in the memory M8 is exceeded, the host computer is notified that the iterations have not converged, and the maximum number of iterations Km is not exceeded. Sends the vectors p _{k + 1} , x _{k + 1} , r _{k + 1} and (r _{k + 1} , r _{k + 1} ) obtained in the iteration calculation to the next iteration calculation, and again, the matrix-vector multiplication circuit 13 to instruct the execution of the iterative calculation in the arithmetic module 12.

図４は行列方程式計算装置１の使用形態を示す。行列方程式計算装置１はＰＣＩインターフェース４１によりホスト計算機４２と接続されている。ＣＧ法の演算を実行する上で必要な値は、圧縮格納された大型疎行列Ａ、反復計算の初期値となるベクトルｐ₀（＝ｒ₀）、ｘ₀、非同次項の（ｂ，ｂ）、収束判定基準ε、最大反復回数Ｋｍである。これらの値はホスト計算機４２に格納され、計算開始指示の前に行列方程式計算装置１にダウンロードされてメモリモジュール１１のメモリＭ３、Ｍ４、Ｍ５、Ｍ８に格納されるようになっている。ホスト計算機４２は、これらの値を行列方程式計算装置１にダウンロードした後、行列方程式計算装置１に対し計算開始を指示する。行列方程式計算装置１は、（２−２）式の反復計算および収束判定を繰返し、最大反復回数Ｋｍ以内で解が収束判定基準ε以下になれば、メモリＭ４’のベクトルｘ_k+1を近似解としてホスト計算機４２に引渡し、最大反復回数Ｋｍまで計算しても収束しなければ、計算は未収束として、これをホスト計算機４２に通知する。 FIG. 4 shows how the matrix equation calculation apparatus 1 is used. The matrix equation calculation apparatus 1 is connected to a host computer 42 by a PCI interface 41. The values necessary for executing the calculation of the CG method are a large sparse matrix A that is compressed and stored, vectors p ₀ (= r ₀ ) and x ₀ that are initial values for iterative calculations, and (b, b ), Convergence criterion ε, and maximum number of iterations Km. These values are stored in the host computer 42, downloaded to the matrix equation calculation apparatus 1 before the calculation start instruction, and stored in the memories M3, M4, M5, and M8 of the memory module 11. The host computer 42 downloads these values to the matrix equation calculation apparatus 1 and then instructs the matrix equation calculation apparatus 1 to start calculation. The matrix equation calculation apparatus 1 repeats the iterative calculation and convergence determination of equation (2-2), and approximates the vector x _{k + 1} in the memory M4 ′ if the solution falls below the convergence criterion ε within the maximum number of iterations Km. If it is delivered to the host computer 42 as a solution and does not converge even if the maximum number of iterations Km is calculated, the calculation is not converged, and this is notified to the host computer 42.

図５Ａ〜Ｅは行列方程式計算装置１の実装例を示す。図５ＡおよびＢに示すように、この例では、１つの小プリント基板５１上に、メモリモジュール１１の２列分の圧縮行列の成分を格納するメモリ５２と、この２列の行列と対応する探索ベクトルｐ_kの成分との乗算を行う行列−ベクトル乗算回路１３とが搭載される。図５ＣおよびＤに示すように、この小プリント基板５１が合計１６個、サブプリント基板上５３に接続される。図５Ｅに示すように、メインプリント基板上５４に、演算モジュール１２、収束判定部１４およびマスターコントローラ１５が搭載され、これにサブプリント基板５３が接続される。こうして、ＣＧ法専用計算機としての行列方程式計算装置１が構成される。これらの搭載部品のほとんどは例えばＦＰＧＡまたはＡＳＩＣにより作製することができる。このように構成することにより、行列方程式計算装置１をある決められた列数ｍ用に構成した場合でも、あらかじめ十分なソケットを用意しておけば、後に小プリント基板５１を追加接続することにより、行列サイズの変更に対してフレキシビリティーを持たすことができる。 5A to 5E show implementation examples of the matrix equation calculation apparatus 1. As shown in FIGS. 5A and 5B, in this example, on one small printed circuit board 51, a memory 52 for storing compression matrix components for two columns of the memory module 11, and a search corresponding to the matrix of the two columns A matrix-vector multiplication circuit 13 that performs multiplication with the component of the vector p _k is mounted. As shown in FIGS. 5C and 5D, a total of 16 small printed circuit boards 51 are connected to the sub printed circuit board 53. As shown in FIG. 5E, the arithmetic module 12, the convergence determination unit 14, and the master controller 15 are mounted on the main printed circuit board 54, and the sub printed circuit board 53 is connected thereto. Thus, the matrix equation calculation apparatus 1 as a computer dedicated to the CG method is configured. Most of these mounted components can be manufactured by, for example, FPGA or ASIC. With this configuration, even when the matrix equation calculation device 1 is configured for a predetermined number m of columns, if a sufficient socket is prepared in advance, the small printed circuit board 51 is additionally connected later. It is possible to have flexibility for changing the matrix size.

この第１の実施の形態によれば、次のような利点を得ることができる。すなわち、行列方程式計算装置１をＣＧ法専用計算機により構成し、メモリモジュール１１と演算モジュール１２とを分離した構成としていることにより、数値シミュレーションに用いる大型疎行列に応じてメモリモジュール１１を適切な構成のものに交換することができ、比較的複雑な構成を有する演算モジュール１３は変更しないで、大型疎行列の変更に柔軟に対応することができ、フレキシビリティーが高い。また、メモリモジュール１１は、行列の１列ごとにｍ個のメモリＭ１−１〜Ｍ１−ｍを設け、これらのメモリＭ１−１〜Ｍ１−ｍから（３）式の１行分の計算に必要なデータを１クロックで行列−ベクトル乗算回路１３にロードし、（３）式の１行分の計算を並列的に実行するようにしていることにより、メモリモジュール１１へのアクセス回数を最小限にすることができ、それによって高速動作化を図ることができ、行列方程式の計算を高速かつ短時間で行うことができる。また、この行列方程式計算装置１は実装や計算の大規模化が容易である。以上により、極めて実用性が高い高性能の行列方程式計算装置１を実現することができる。
この行列方程式計算装置１は、電磁界解析、流体解析、構造解析などの各種の数値シミュレーションで最終的に解くべき方程式として現れる大型疎行列方程式を解くのに適用して好適なものである。 According to the first embodiment, the following advantages can be obtained. That is, the matrix equation calculation apparatus 1 is configured by a computer dedicated to the CG method, and the memory module 11 and the arithmetic module 12 are separated from each other, so that the memory module 11 is appropriately configured according to the large sparse matrix used for the numerical simulation. The arithmetic module 13 having a relatively complicated configuration can be replaced without changing, and can flexibly cope with the change of a large sparse matrix, and has high flexibility. Further, the memory module 11 is provided with m memories M1-1 to M1-m for each column of the matrix, and is necessary for calculating one row of the equation (3) from these memories M1-1 to M1-m. Data is loaded into the matrix-vector multiplication circuit 13 in one clock, and the calculation for one row of the equation (3) is executed in parallel, thereby minimizing the number of accesses to the memory module 11. Accordingly, high-speed operation can be achieved, and the matrix equation can be calculated at high speed in a short time. In addition, the matrix equation calculation apparatus 1 can be easily implemented and scaled up. As described above, it is possible to realize a high-performance matrix equation calculation apparatus 1 with extremely high practicality.
The matrix equation calculation apparatus 1 is suitable for application to solving a large sparse matrix equation that appears as an equation to be finally solved in various numerical simulations such as electromagnetic field analysis, fluid analysis, and structural analysis.

次に、この発明の第２の実施の形態による行列方程式計算装置について説明する。
この第２の実施の形態においては、メモリモジュール１１へのアクセス回数の最小限化に加え、さらに、演算モジュール１２における３Ｎ＋２クロック分の計算をより高速化すべく、演算モジュール１２を複数用意し、３Ｎ＋２クロック分の計算をこれらの複数の演算モジュール１２で並列に実行する方式が用いられる。演算モジュール１２における３Ｎ＋２クロックの計算はもともと全て並列処理してもよく、この部分の計算を複数の演算モジュール１２で行うことにより、１つの演算モジュール１２を用いた場合に比べ、演算モジュール１２の台数分だけ、高速化を図ることができる。上記以外のことは第１の実施の形態と同様である。
この行列方程式計算装置１の実装は、図６に示すように、図５Ｅに示すものと同様なメインプリント基板５４上にサブプリント基板５３を複数搭載することにより簡単に行うことができる。 Next explained is a matrix equation calculation apparatus according to the second embodiment of the invention.
In the second embodiment, in addition to minimizing the number of accesses to the memory module 11, a plurality of arithmetic modules 12 are prepared to further increase the calculation speed of 3N + 2 clocks in the arithmetic module 12, and 3N + 2 A method is used in which the calculation for the clock is executed in parallel by the plurality of arithmetic modules 12. The calculation of 3N + 2 clocks in the arithmetic module 12 may be all processed in parallel originally, and by performing this part of the calculation with a plurality of arithmetic modules 12, the number of arithmetic modules 12 is larger than when one arithmetic module 12 is used. The speed can be increased by that amount. Other than the above are the same as in the first embodiment.
As shown in FIG. 6, the matrix equation calculation apparatus 1 can be easily mounted by mounting a plurality of sub print boards 53 on a main print board 54 similar to that shown in FIG. 5E.

この第２の実施の形態によれば、第１の実施の形態と同様な利点に加えて、動作速度のより一層の向上を図ることができ、行列方程式の計算をより高速に行うことができるという利点を得ることができる。 According to the second embodiment, in addition to the same advantages as the first embodiment, the operation speed can be further improved, and the matrix equation can be calculated at higher speed. The advantage that can be obtained.

次に、この発明の第３の実施の形態による行列方程式計算装置について説明する。
この第３の実施の形態においては、複数の演算モジュール３の並列処理による高速化に加え、さらに、ＣＧ法は大型疎行列が対称行列である場合のみに適用が限定されているのに対し、非対称行列である場合にも適用することができる機能を装備すべく、２つの演算モジュール１２を連動動作させ、非対称行列にも適用することができるＢｉＣＧ法マシンとして動作させる方式が用いられる。ＢｉＣＧ法はもともとＣＧ法に転置行列に対するＣＧ法の処理を追加したものであり、もう１つの演算モジュール１２に転置行列を処理させることにより、２つの演算モジュール１２をペアで連動動作させることによりＢｉＣＧ法の動作を実現することができる。上記以外のことは第１の実施の形態と同様である。
この行列方程式計算装置１の実装は、図６に示すように、図５Ｅに示すものと同様なメインプリント基板５４上にサブプリント基板５３を複数搭載することにより簡単に行うことができる。 Next explained is a matrix equation calculation apparatus according to the third embodiment of the invention.
In the third embodiment, in addition to speeding up by parallel processing of a plurality of arithmetic modules 3, the CG method is limited to application only when the large sparse matrix is a symmetric matrix. In order to provide a function that can be applied even in the case of an asymmetric matrix, a method of operating two arithmetic modules 12 in an interlocked manner and operating as a BiCG method machine that can also be applied to an asymmetric matrix is used. The BiCG method is originally obtained by adding processing of the CG method to the transposed matrix to the CG method. By causing the other computing module 12 to process the transposed matrix, the two computing modules 12 are linked and operated in pairs. The behavior of the law can be realized. Other than the above are the same as in the first embodiment.
As shown in FIG. 6, the matrix equation calculation apparatus 1 can be easily mounted by mounting a plurality of sub print boards 53 on a main print board 54 similar to that shown in FIG. 5E.

次に、ハードウェア記述言語であるＶＨＤＬ（VHSIC Hardware Description Language)により行列方程式計算装置１による大型疎行列計算の論理シミュレーションを行った結果について説明する。
図７はシミュレーションの例として使用したリニアモータの界磁部のモデルを、図８はその数値モデルを示す。このモデルは、３つの励磁コイル６１〜６３で励磁された静磁場が、鉄（μ_r＝５０００）からなる磁場遮蔽材料６４によって外部にもれないよう遮蔽される現象に関するもので、系は軸対称性を持っており、図８の数値シミュレーションの解析領域は軸対称性を考慮して２次元の５０×１００グリッドサイズのものである。この磁場を記述する支配方程式は、与えられた電流密度ベクトルＪ、透磁率μに対し、ベクトルポテンシャルＡを未知数とし、

となる。この方程式を半径方向、軸方向ともにΔｌのサイズの一様なグリッドで分割した上で軸対称性を考慮して差分法に基づき差分式で表すと、半径方向にi 番目、軸方向にj
番目のグリッドでは、

となる。ただし、μ_i,j、Ａ_i,j、Ｊ_i,jはそれぞれ、半径方向にi 番目、軸方向にj 番目のグリッドでの透磁率μ、ベクトルポテンシャルＡのθ成分、電流密度ベクトルＪのθ成分である。このとき、未知数Ａ_i,jの数は、全てのグリッド数の５０×１００＝５０００であるので、差分法により数値解析する際に現れる大型疎行列は５０００×５０００のサイズとなる。この大型疎行列計算を、非対称行列用のＢｉＣＧ法に基づいて構成した行列方程式計算装置１により実行する場合のＶＨＤＬ論理回路シミュレーションにより行った。その解から得られた静磁場分布の結果を図９に示す。実際の計算では、（５）式からわかるように、１行に現れる未知数Ａ_i,jの係数は高々５個であるので、この疎行列性を利用し、行列は５０００×５のサイズに圧縮したものを用いている。図９に示す結果は、同じ数値モデルに基づいてＣ言語ソフトウェアシミュレーションを行った結果と一致しており、この行列方程式計算装置１の妥当性が保証される。このＶＨＤＬ論理回路シミュレーションでの動作を参考に、この行列方程式計算装置１がアクセス速度２６６ＭＨｚのメモリと同程度の速度で動作した場合の性能を見積ると、２．６ＧＨｚのＣＰＵを有するパソコン上でＣ言語を用いて計算した時の約５０〜６０倍の性能に匹敵する。 Next, the result of logical simulation of large sparse matrix calculation by the matrix equation calculation apparatus 1 using VHDL (VHSIC Hardware Description Language) which is a hardware description language will be described.
FIG. 7 shows a model of the magnetic field portion of the linear motor used as an example of simulation, and FIG. 8 shows its numerical model. This model relates to a phenomenon in which a static magnetic field excited by three exciting coils 61 to 63 is shielded from being exposed to the outside by a magnetic shielding material 64 made of iron (μ _r = 5000). 8 has a two-dimensional 50 × 100 grid size in consideration of axial symmetry. The governing equation describing this magnetic field is that for a given current density vector J and permeability μ, the vector potential A is an unknown,

It becomes. When this equation is divided by a uniform grid of size Δl in both radial and axial directions, and expressed by a differential equation based on the difference method in consideration of axial symmetry, it is i-th in the radial direction and j in the axial direction.
In the second grid,

It becomes. However, μ _{i, j} , A _{i, j} and J _{i, j} are the magnetic permeability μ, the θ component of the vector potential A, the current density vector J, and the i-th grid in the radial direction and the j-th grid in the axial direction, respectively. The θ component. At this time, since the number of unknowns A _{i, j} is 50 × 100 = 5000, which is the number of all grids, the large sparse matrix that appears when performing numerical analysis by the difference method has a size of 5000 × 5000. This large sparse matrix calculation was performed by VHDL logic circuit simulation in the case of executing by the matrix equation calculation apparatus 1 configured based on the BiCG method for an asymmetric matrix. The result of the static magnetic field distribution obtained from the solution is shown in FIG. In the actual calculation, as can be seen from the equation (5), since the coefficient of the unknown A _{i, j} appearing in one row is at most 5, this sparse matrix property is used, and the matrix is compressed to a size of 5000 × 5. We use what we did. The result shown in FIG. 9 matches the result of the C language software simulation based on the same numerical model, and the validity of the matrix equation calculation apparatus 1 is guaranteed. Referring to the operation in the VHDL logic circuit simulation, the performance when the matrix equation calculation apparatus 1 operates at the same speed as a memory with an access speed of 266 MHz is estimated on a personal computer having a 2.6 GHz CPU. It is comparable to about 50-60 times the performance when calculated using language.

ここでは、簡単のため、軸対称２次元となるモデルを用いたが、基本的には３次元の場合も全く同様である。すなわち、例えば同じ図７の３次元モデルでは、グリッドサイズが１００×１００×１００であるので、最終的に現れる大型疎行列は、１００００００×１００００００で、（４）式に対応する３次元の差分式では、係数は１６個となるため、最終的には、１００００００×１６のサイズに圧縮された行列を解くことに帰着する。このため、この行列方程式計算装置１では、軸対称２次元の場合が、Ｎ＝５０００、ｍ＝５であったものが、３次元では、単に、Ｎ＝１００００００、ｍ＝１６となるだけで、動作自体は全く同様である。
この第３の実施の形態によれば、第１の実施の形態と同様な利点に加えて、大型疎行列Ａが非対称行列である場合にも行列方程式計算装置１を適用することができるという利点を得ることができる。 Here, for the sake of simplicity, an axisymmetric two-dimensional model is used, but basically the same applies to a three-dimensional model. That is, for example, in the same three-dimensional model of FIG. 7, the grid size is 100 × 100 × 100, so the large sparse matrix that finally appears is 1000000 × 1000000, and the three-dimensional difference equation corresponding to equation (4) Then, since there are 16 coefficients, the result is finally to solve a matrix compressed to a size of 1000000 × 16. For this reason, in this matrix equation calculation apparatus 1, in the case of the two-dimensional axisymmetric case, N = 5000 and m = 5, but in the three-dimensional case, only N = 1000000 and m = 16. The operation itself is exactly the same.
According to the third embodiment, in addition to the advantages similar to those of the first embodiment, the matrix equation calculation apparatus 1 can be applied even when the large sparse matrix A is an asymmetric matrix. Can be obtained.

以上、この発明の実施の形態について具体的に説明したが、この発明は、上述の実施の形態に限定されるものではなく、この発明の技術的思想に基づく各種の変形が可能である。
例えば、上述の実施の形態において挙げた数値、回路構成、配置などはあくまでも例に過ぎず、必要に応じて、これらと異なる数値、回路構成、配置などを用いてもよい。また、行列方程式の解法には、ＣＧ法やＢｉＣＧ法ではなく、ＩＣＣＧ法、ＢｉＣＧ−Ｓｔａｂ法、ＣＧＳ法、ＢｉＣＧ−Ｓｔａｂ２法などの他の反復解法を採用し、演算モジュール１２もこれらの行列方程式の解法用のもので構成してもよい。 Although the embodiment of the present invention has been specifically described above, the present invention is not limited to the above-described embodiment, and various modifications based on the technical idea of the present invention are possible.
For example, the numerical values, circuit configurations, arrangements, and the like given in the above-described embodiments are merely examples, and different numerical values, circuit configurations, arrangements, and the like may be used as necessary. The matrix equation is not solved by the CG method or BiCG method, but other iterative methods such as the ICCG method, BiCG-Stab method, CGS method, BiCG-Stab2 method are adopted, and the calculation module 12 also uses these matrix equations. You may comprise for the solution of these.

大型疎行列方程式の１例を示す略線図である。It is a basic diagram which shows one example of a large sparse matrix equation. この発明の第１の実施の形態による行列方程式計算装置の全体構成を示す略線図である。1 is a schematic diagram showing an overall configuration of a matrix equation calculation apparatus according to a first embodiment of the present invention. この発明の第１の実施の形態による行列方程式計算装置の各部の構成の詳細を示す略線図である。It is a basic diagram which shows the detail of a structure of each part of the matrix equation calculation apparatus by 1st Embodiment of this invention. この発明の第１の実施の形態による行列方程式計算装置をホスト計算機と接続した状態を示す略線図である。It is a basic diagram which shows the state which connected the matrix equation calculation apparatus by 1st Embodiment of this invention with the host computer. この発明の第１の実施の形態による行列方程式計算装置の実装例を示す略線図である。It is a basic diagram which shows the example of mounting of the matrix equation calculation apparatus by 1st Embodiment of this invention. この発明の第２の実施の形態による行列方程式計算装置の実装例を示す略線図である。It is a basic diagram which shows the example of mounting of the matrix equation calculation apparatus by 2nd Embodiment of this invention. この発明の第３の実施の形態による行列方程式計算装置を用いて行った数値シミュレーションに用いた数値モデルを示す略線図である。It is a basic diagram which shows the numerical model used for the numerical simulation performed using the matrix equation calculation apparatus by the 3rd Embodiment of this invention. この発明の第３の実施の形態による行列方程式計算装置を用いて行った数値シミュレーションに用いた数値モデルを示す略線図である。It is a basic diagram which shows the numerical model used for the numerical simulation performed using the matrix equation calculation apparatus by the 3rd Embodiment of this invention. この発明の第３の実施形態による行列方程式計算装置を用いて行った数値シミュレーションの結果を示す略線図である。It is a basic diagram which shows the result of the numerical simulation performed using the matrix equation calculation apparatus by 3rd Embodiment of this invention. 電磁界解析を例に、多くの科学技術計算は最終的に大型疎行列方程式の計算に帰着することを説明するための略線図である。Taking electromagnetic field analysis as an example, this is a schematic diagram for explaining that many scientific and technical calculations ultimately result in calculations of large sparse matrix equations.

Explanation of symbols

１…行列方程式計算装置、１１…メモリモジュール、１２…演算モジュール、１３…行列−ベクトル乗算回路、１４…収束判定部、１５…マスターコントローラ、４２…ホスト計算機、５１…プリント基板、５３…サブプリント基板、５４…メインプリント基板、Ｍ１−１〜Ｍ１−ｍ、Ｍ２−１〜Ｍ２−ｍ、Ｍ３−０〜Ｍ３−ｍ、Ｍ３−０’、Ｍ４、Ｍ４’、Ｍ５、Ｍ５’、Ｍ６、Ｍ７、Ｍ７’、Ｍ８…メモリ DESCRIPTION OF SYMBOLS 1 ... Matrix equation calculation apparatus, 11 ... Memory module, 12 ... Arithmetic module, 13 ... Matrix-vector multiplication circuit, 14 ... Convergence determination part, 15 ... Master controller, 42 ... Host computer, 51 ... Printed circuit board, 53 ... Subprint Substrate, 54 ... main printed circuit board, M1-1 to M1-m, M2-1 to M2-m, M3-0 to M3-m, M3-0 ', M4, M4', M5, M5 ', M6, M7 , M7 ', M8 ... memory

Claims

A matrix equation calculator that solves a large sparse matrix equation of interest by an iterative method,
A memory unit having a predetermined number of memories for storing only non-zero components for each column among the components of the large sparse matrix of the large sparse matrix equation;
One or a plurality of calculation units that execute at least a part of the calculation of the iterative solution method in a data flow format,
A matrix equation calculation apparatus configured to load data necessary for iterative solution operation for each row from the memory unit to the operation unit at a time.

The matrix equation calculation apparatus according to claim 1, comprising a plurality of the arithmetic units.

3. The matrix equation calculation apparatus according to claim 2, wherein the plurality of arithmetic units are operated in parallel.

4. The matrix equation calculation apparatus according to claim 3, wherein the iterative solution method is a conjugate gradient method or a biconjugate gradient method.

A matrix equation calculation method for solving a target large sparse matrix equation by an iterative method,
Storing only non-zero components among the components of the large sparse matrix of the large sparse matrix equation in a predetermined number of memories in the memory unit for each column;
Loading data necessary for iterative solution operation for each row from the memory unit to one or more arithmetic units at a time;
And a step of executing at least a part of the calculation of the iterative solving method in a data flow format by the calculation unit.

6. The matrix equation calculation method according to claim 5, wherein a plurality of the arithmetic units are used.

7. The matrix equation calculation method according to claim 6, wherein the plurality of arithmetic units are operated in parallel.

8. The matrix equation calculation method according to claim 7, wherein the iterative solution is a conjugate gradient method or a biconjugate gradient method.