JPS61175774A

JPS61175774A - Computer for analysis of simultaneous equations

Info

Publication number: JPS61175774A
Application number: JP1744585A
Authority: JP
Inventors: Mamoru Tanaka; 衞田中; Hideki Asai; 秀樹浅井; Mitsuo Asai; 浅井　光男
Original assignee: Individual
Current assignee: Individual
Priority date: 1985-01-30
Filing date: 1985-01-30
Publication date: 1986-08-07

Abstract

PURPOSE:To attain the LU analysis of a spurse matrix with high efficiency by storing only the non-zero elements into a local memory of each local unit after ignoring the greater part of the zero elements of an original matrix Y to convert the matrix Y into a band matrix and flowing the elements successively in parallel and every line to (q) pieces of processors corresponding to the band width. CONSTITUTION:This processor contains (q) pieces of local units. Then, the local units 1-3 are all connected with each other via a common data bus 10, a control line 11, a priority circuit 12, etc. All non-zero elements only including the fill-in are stored in each local memory unit {Aj} in the row direction of a matrix after ignoring all zero elements. The operations of the processor are controlled in a division mode (a), a multiplication mode (b) and a gauss erasion mode (c) respectively. In the division mode the unit {Aj} receives an access by a counter 13. Only the local bus corresponding to a pivot aii is enabled by an enabling circuit 15. Then the pivot data aii are transferred to all local processors via a common bus.

Description

【発明の詳細な説明】［発明の技術分野］本発明は、大規模スパース行列のＬＵ分解のための並列
計算機の構成法に関する。Detailed Description of the Invention [Technical Field of the Invention] The present invention relates to a method of configuring a parallel computer for LU decomposition of a large-scale sparse matrix.

［従来技術およびその問題点］並列処理の概念に基づく回路網解析アルゴリズムの開発
と平行して、大規模システ、ムに対して、それらのアル
ゴリズムを実行させるための専用計算機の発明も数多く
述べられ、急速な進歩を遂げている。[Prior art and its problems] In parallel with the development of circuit network analysis algorithms based on the concept of parallel processing, many inventions have been made of dedicated computers for executing these algorithms on large-scale systems. , is making rapid progress.

専用計算機の必要性が高まる根本的な理由は、大規模シ
ステムに関する解析に対し、汎用計算機の逐次処理によ
る計算では、あまりにも真人な解析時間が要求されるこ
とにある。The fundamental reason for the increasing need for dedicated computers is that the sequential processing of general-purpose computers requires too much analysis time for analysis of large-scale systems.

近年、ガウス消去法やＬＵ分解法などの直接法に伴う処
理の並列化とパイプライン化を高めるための専用計算機
として、シストリックアレイプロセッサの概念がＫｕｎ
ｇ［１］により提案され、その応用が頻繁に述べられて
いる［２−５］　。In recent years, the concept of systolic array processors has become popular as a dedicated computer for increasing parallelization and pipelining of processing associated with direct methods such as Gaussian elimination and LU decomposition.
g [1], and its applications have been frequently described [2-5].

アレイプロセッサは、理論的には、それらのデータ処理
に関する並列性を最も有効に利用できる専用機として考
えられるが、常にハードウェア無限という概念なしには
考えることができない、直接法により、零要素が非常に
多いスパース行列の解析を効率的に行なうためには、デ
ータ転送などが複雑化するという重大な問題があるため
、並列処理の可能な演算をすべて並列に処理するために
は、スパース行列を対象とする場合にも、あまりにも多
くのプロセッサセルを必要とする０例えば。Theoretically, array processors can be considered as special-purpose machines that can make the most effective use of their parallelism in data processing, but they cannot always be considered without the concept of infinite hardware. In order to efficiently analyze a large number of sparse matrices, there is a serious problem that data transfer becomes complicated. Even if you target 0, you need too many processor cells, for example.

ＬＵ分解プロセッサは、ガウス消去法処理に内包される
並列性を利用するものであるが、やはり、行列のスパー
ス性を無視することはできず、ｎＸｎの次元からなる正
方行列を扱うためには、ｎ２個のプロセッサセルを必要
とする。これは、ｎが非常に大きな大規模回路網の解析
に対しては非現実的な数であるといえる。The LU decomposition processor utilizes the parallelism inherent in Gaussian elimination processing, but the sparsity of the matrix cannot be ignored, and in order to handle a square matrix with nXn dimensions, Requires n2 processor cells. This can be said to be an unrealistic number for analysis of large-scale circuit networks where n is very large.

そこで、このような問題を簡単化するために、多くの７
レイプロセツサでは、その適用範囲として、係数行列が
バンド構造であるような節点方程式の解法に限定されて
いる。このことにより、データ通信が非常に簡単化され
、並列処理やパイプライン処理が非常に効率良く実行さ
れる。Therefore, in order to simplify such problems, many 7
The scope of application of the ray processor is limited to solving nodal equations in which the coefficient matrix has a band structure. This greatly simplifies data communication and allows parallel processing and pipeline processing to be performed very efficiently.

バンド幅２ｐ＋１を持つバンド行列に於いては。For a band matrix with a bandwidth of 2p+1.

Ｈ＞）ｐに対して約２２個のプロセッサセルを用意すれ
ばシストリックデータ方式を用いることが可能となり、
従って、ｎ元連立方程式をＯ（ｎ）時間で解析できる。If approximately 22 processor cells are prepared for H>)p, it becomes possible to use the systolic data method,
Therefore, n-dimensional simultaneous equations can be analyzed in O(n) time.

しかしながら、節点方程式の係数行列Ｙの構造は、回路
網をグラフ表現したときのトポロジーは勿論のこと、節
点に対するラベル付けの方法にも大きく依存しており、
一般的にはバンド構造ではない。However, the structure of the coefficient matrix Y of the nodal equation depends not only on the topology when the circuit network is represented graphically, but also on the method of labeling the nodes.
It generally does not have a band structure.

解析時間の短縮と、メモリ容量の節約のためには、その
ラベル付けの方式として、バンド方式とバッキング方式
がある［６コ６パツキング方式は行列計算過程に於ける
フィル・インを少なくするようなラベル付は手法であり
、フィル・インの数を減らすことにより、零要素の計算
が効率的に省略でき、結果として、解析時間が短縮され
る。その一つとしてピボット順序の最適化間１ｇ　［７
］がある。この方式では、フィル・インの数が極めて少
なくなるかわりに、非零要素が行列内に於いてどの位置
に現われるかは全く不規則的になる。In order to shorten analysis time and save memory capacity, there are two labeling methods: band method and backing method [6-6 packing method reduces fill-in during matrix calculation process. Labeling is a technique, and by reducing the number of fill-ins, the calculation of zero elements can be efficiently omitted, resulting in a reduction in analysis time. One of them is the optimization of the pivot order between 1g [7
] There is. In this method, the number of fill-ins is extremely small, but the positions of non-zero elements in the matrix are completely irregular.

一方、バンド方式は、係数行列Ｙのすべての非零要素を
一定のバンド幅内に納めるようなラベル付は手法である
ため、バッキング方式に比べるとフィル・インの数は増
加するが、フィル・インを含む非零要素の位置が限定さ
れるために、データ通信が簡単化され、結局、解析時間
も短縮される。On the other hand, in the band method, the labeling method is such that all non-zero elements of the coefficient matrix Y are contained within a certain bandwidth, so the number of fill-ins increases compared to the backing method, but the number of fill-ins increases. Since the locations of non-zero elements including ins are limited, data communication is simplified and analysis time is eventually shortened.

バッキング方式に従うスパース行列において、非零要素
が行列の対角要素から遠い位置にも現われる場合に、敢
てバンド幅２ｐ＋１を定義するなら、ｐ　／％−ｎとな
るため、結局、アレイを構成するプロセッサセルの数は
２２１１２個必要とされる。In a sparse matrix that follows the backing method, if non-zero elements appear in positions far from the diagonal elements of the matrix, if we dare to define a bandwidth of 2p+1, the result will be p/%-n, which will eventually form an array. The number of processor cells required is 22,112.

それ故、もしアレイプロセッサがこのような行列に適用
されるなら、非現実的な数におよぶプロセッサセルが要
求され、しかも、あまりにも多くのセルがアイドル状態
となり、効率的な解析は望めなくなる。従って、大規模
連立方程式に於ける係数行列の要素が不規則に存在し、
しかも極めて強いスパース性を有するものに対処できる
専用計算機が必要となる。Therefore, if an array processor were applied to such a matrix, an impractical number of processor cells would be required, and too many cells would be idle for efficient analysis. Therefore, the elements of the coefficient matrix in large-scale simultaneous equations exist irregularly,
Moreover, a dedicated computer that can handle extremely strong sparsity is required.

［発明の目的コ本発明では、大規模スパース行列のＬＵ分解のための並
列計算機の構成法が示される。本プロセッサでは、係数
行列Ｙの各行内に存在するフィル・インを含めた非零要
素の数ｑに対応する９個のローカルユニットを並列に稼
働させる。各ローカルユニット内のローカルメモリには
、元の行列Ｙにおける零要素の大部分を無視した形で、
非零要素のみを格納することにより、メモリ容量の節約
を図っている。また、そのために生じるガウス消去演算
過程に於けるデータの不整合を、データのシフト操作あ
るいはメモリの逐次書き込み並列読出しにより解消して
いる。その結果、シストリックアレイプロセッサなどで
は実用的に困難であるバッキング方式などに従ったスパ
ース行列のＬＵ分解が、バンド方式によるスパース行列
と同様に効率良く実行されることが示される。[Object of the Invention] The present invention presents a method of configuring a parallel computer for LU decomposition of a large-scale sparse matrix. In this processor, nine local units corresponding to the number q of non-zero elements including fill-ins existing in each row of the coefficient matrix Y are operated in parallel. In the local memory in each local unit, ignoring most of the zero elements in the original matrix Y,
Memory capacity is saved by storing only non-zero elements. Furthermore, the data mismatch that occurs during the Gaussian erasure calculation process is eliminated by a data shift operation or by sequential writing and parallel reading of the memory. The results show that LU decomposition of sparse matrices according to the backing method, etc., which is practically difficult in systolic array processors, can be performed as efficiently as sparse matrices using the band method.

［発明の要点コ本発明は上記目的を達成するために、要素ａｔ２（ｉ＝
１．２．、、、、Ｎ；ｊ＝１，２．、、、、Ｎ）から成
るスパース行列Ｙの各ｉ行目においてライル・インを含
めた非零要素を行方向に左詰めしてできる各列ベクトル
の各要素ａＬｋの値と列番号にとをデータとして。[Summary of the Invention] In order to achieve the above object, the present invention provides the element at2 (i=
1.2. ,,,,N;j=1,2. , , , , N), in each i-th row of the sparse matrix Y, the non-zero elements including Lyle in are left-justified in the row direction, and the value and column number of each element aLk of each column vector are as data.

アドレスｉに格納する第１のローカルメモリを多くても
行列Ｙの各行内に存在するフィル・インを含めた非零要
素の最大数ｑ個だけ設置し、前記行列Ｙの対角成分を除
く下三角行列部の各ｊ列目において、フィル・インを含
めた非零要素を列方向に上詰めしてできる各行ベクトル
の各要素ａ、の少なくとも行番号文をデータとして、ア
ドレスｉに格納する第２のローカルメモリを前記第１の
ローカルメモリの数よりも少ない個数だけ設置し、割り
算モードにおいて、前記アドレスｉによって前記第１の
各ローカルメモリから読み出される前記各要素机、と前
記列番号にのうちｉ＝にとなるピボット要素ａＬＬを検
出し、読み出された前記各要素ａＬｋを前記ピボット要
素ａｒ＜１’各同時に割り算しａよ／ａＬＬを求める割
り算手段と、乗算モードにおいて、前記アドレスｉによ
って前記第２の各ローカルメモリから読み出される前記
各要素ａえＪの各行番号文をアドレスとして前記各第１
のローカルメモリの内容ａ　Ｈａを読み出しβ＝ｉとな
る要素ａ４１εを検出し、前記割り算手段によって得ら
れた各値ａＬｋ／ａｔｔに前記要素ａｌｉを乗算しａｌ
ｉ　’　ａｉｋ／　ａ　、・を求める乗算手段と、前記
アドレスｉにＬよって前記各第１のローカルメモリから読み出された要
素ａＬｋの列番号にと前記行番号Ｑによって前記各第１
のローカルメモリから読み出された要素ａ２ｊの列番号
ｊとを一致させるシフト手段を用いて、ガウス消去演算
モードにおいて要素ガＪから前記乗算手段によって得ら
れた各値”Ｊｌ（−ａＬｋ’　ａｌ、；、を減算するガ
ウス消去演算手段を有することを特徴とする連立方程式
解析用計算機を提供し、前記シフト手段に逐次書き込み
同時読み出しを実行する内容が全く同一なランダムアク
セスメモリを用いることを特徴とし、前記第１のローカ
ルメモリは前記第２のローカルメモリの内容によってア
クセスされることを特徴とし、前記割り算手段。The first local memory to be stored at address i is set up to have at most q the maximum number of non-zero elements including fill-ins that exist in each row of matrix Y, and In each j-th column of the triangular matrix part, at least the row number statement of each element a of each row vector created by top-filling non-zero elements including fill-in in the column direction is stored at address i as data. 2 local memories are installed in a number smaller than the number of the first local memories, and in the division mode, each element read from each of the first local memories by the address i and the column number are A dividing means detects the pivot element aLL which satisfies i=, and simultaneously divides each of the read elements aLk to obtain the pivot element ar<1' to obtain aYO/aLL, and in the multiplication mode, the address i Each row number sentence of each element aeJ read from each second local memory by
Read the contents aHa of the local memory, detect the element a41ε where β=i, and multiply each value aLk/att obtained by the dividing means by the element ali.
a multiplication means for calculating i'aik/a, ·;
Using a shift means to match column number j of element a2j read from the local memory of , each value "Jl(-aLk' al, Provided is a computer for simultaneous equation analysis, characterized in that it has a Gaussian elimination calculation means for subtracting . , wherein the first local memory is accessed by the contents of the second local memory, and the dividing means.

乗算手段およびガウス消去演算手段を実行するプロセッ
サの数ｐが前記最大数ｑよりも小さい場合、２個の前記
第１の各ローカルメモリをｒ×ｐが９を越えるようにｒ
階に階層的に結合することを特徴とする連立方程式解析
用計算機を提供する。When the number p of processors that execute the multiplication means and the Gaussian elimination operation means is smaller than the maximum number q, the two first local memories are divided into r so that r×p exceeds 9.
Provided is a computer for analyzing simultaneous equations that is characterized by being hierarchically connected to levels.

［作用コ本プロセッサでは、係数行列Ｙの各行内に存在するフィ
ル・インを含めた非零要素の数ｑに対応するｑ個のロー
カルユニットを並列に稼働させるために、各ローカルユ
ニット内のローカルメモリには１元の行列Ｙにおける零
要素の大部分を無視した形で、非零要素のみを格納する
ことにより。[Action Co.] In this processor, in order to operate q local units in parallel corresponding to the number q of non-zero elements including fill-in existing in each row of the coefficient matrix Y, the local By storing only non-zero elements in memory, ignoring most of the zero elements in the one-element matrix Y.

行列Ｙをバンド行列化し、バンド幅に対応する前記ｑ個
のプロセッサに一行ずつ並列逐次に要素を流すという発
想に基づく。It is based on the idea of converting the matrix Y into a band matrix and passing the elements line by line in parallel and sequentially to the q processors corresponding to the band width.

［発明の実施例］以下に本発明の実施例について図面を参照しながら説明
する。[Embodiments of the Invention] Examples of the present invention will be described below with reference to the drawings.

まず、ＬＵ分解法について述べる。First, the LU decomposition method will be described.

一般的に１節点方程式Ｙｖ　　＝ｊ　　ｆｏｒｖ　　の
Ｉ′Ｉ　　　　　　　　　　　　　　　　　ｎ解法は、
ＬＵ分解、前進代入、後退代入を用いながら、次の手順
に従って実行さ九る。Generally, the I'I n solution for the one-node equation Yv = j forv is
It is performed according to the following steps using LU decomposition, forward substitution, and backward substitution.

１）　　Ｙ＝ＬＵ　　　　ｆｏｒ　　Ｌ、Ｕ２）　　Ｌ
ξ＝ｊ　　　　ｆｏｒ　　ξ３　）　　　Ｕ　ｖ　ｒ、
　＝ξ　　ｆｏｒｖｎこれらの過程を完遂するためには
、　Ｏ（ｎ３）の乗算および加算回数が必要となるが、
その大部分はＬＵ分解過程に於いて費やされるため１節
点方程式を短時間で解くためには、ＬＵ分解を効率的に
行なう必要がある。1) Y=LU for L, U2) L
ξ=j for ξ3) U v r,
=ξ forvnTo complete these processes, O(n3) multiplications and additions are required, but
Most of the time is spent in the LU decomposition process, so in order to solve a one-node equation in a short time, it is necessary to perform the LU decomposition efficiently.

ｉ行目のピボット操作において、行列Ｙ内のｉ番目の行
と列は１次の式（１）に従いながら変換−こで、である。In the i-th row pivot operation, the i-th row and column in the matrix Y are transformed according to the first-order equation (1).

これらの過程に於いて、要素・、ｊは式（２）に従って
、ａ　ｋｊ）に更新される。In these processes, the elements ., j are updated to a kj) according to equation (2).

ａ　−’　＝ａ　　−ａ　、−ａ−７ｍ”　　　（２）
ｋ、１　　　　　　　ｋＪ　　　　　ｋＬ　　　　　’
Ｊ　　　　　ＬＬつまり、フィル・インは、ｉ番目のピ
ボット過ｊ）の位置に発生する。バンド行列に於いては
、すべてのフィル・インがバンド内に発生することにな
る。a −' = a −a , −a−7m” (2)
k, 1 kJ kL'
JLL That is, the fill-in occurs at the i-th pivot point j). In a banded matrix, all fill-ins will occur within the band.

以下に、シストリックアレイプロセッサでは実用的に適
用が困戴となるスパース構造をもつ行列に対するＬＵ分
解プロセッサの構成法についての提案を行なう。Below, we will propose a method for configuring an LU decomposition processor for a matrix with a sparse structure, which is difficult to practically apply to a systolic array processor.

本プロセッサは、ｑ個のローカルユニットから構成され
、第２図に示されるように、すべてのローカルユニット
１，２．３は共通データバスｉｏ、制御ライン１１、お
よび優先回路等１２により結合されている。This processor is composed of q local units, and as shown in FIG. There is.

各ローカルユニットの内部が第１図に示される。The interior of each local unit is shown in FIG.

以下に１本プロセッサの詳細についての説明を行なう。The details of one processor will be explained below.

１）ローカルメモリとデータ構造Ａ）　ローカルメモリユニット（Ａ、２　）各ローカル
メモリユニット（Ａ２１　　ｊ＝１．２゜３ｙ　−−−
ｑ）には、行列の行方向に、すべての零要素が無視され
た形で、フィル・インを含むすべての非零要素のみが格
納される。1) Local memory and data structure A) Local memory unit (A, 2) Each local memory unit (A21 j=1.2°3y ---
q) stores only all non-zero elements, including fill-ins, in the row direction of the matrix, with all zero elements ignored.

従って、各ローカルメモリ（Ａ２　＞のアドレスは、そ
のアドレスに格納されたデータの行列Ｙにおける行番号
に等しい、データ形式は、（ｋ、数値）である、ここで
、にはそのデータの元の行列内に於ける列番号であり、
″数値”は１例えば、抵抗回路網では枝コンダクタンス
に対応している。Therefore, the address of each local memory (A2>) is equal to the row number in the matrix Y of the data stored at that address, and the data format is (k, number), where is the original is the column number in the matrix,
The "number" is 1. For example, in a resistor network, it corresponds to the branch conductance.

Ｂ）　ローカルメモリユニット（Ｂｊ）各ローカルメモ
リユニット（Ｂｊｌ　ｊ＝１．２゜３＋　、−−＊　ｑ
）には、行列Ｙの対角要素を除く下三角行列に対して１
行列の列方向に、すべての零要素が無視された形で、フ
ィル・インを含むすべての非零要素のみが格納される。B) Local memory unit (Bj) Each local memory unit (Bjl j=1.2゜3+, −-* q
) has 1 for the lower triangular matrix excluding the diagonal elements of the matrix Y.
Only all non-zero elements, including fill-ins, are stored in the column direction of the matrix, with all zero elements ignored.

従って、各ローカルメモリ（Ｂ、ｊ　）のアドレスは、
そのアドレスに格納されたデータの行列Ｙにおける列番
号に等しい、データ形式は、ｉのみである。ここで、１
は、そのデータの元の行列内に於ける行番号である。Therefore, the address of each local memory (B, j) is
The only data format that is equal to the column number in matrix Y of the data stored at that address is i. Here, 1
is the row number in the original matrix of the data.

第３図に示されるグラフを例として、ローカルメモリの
データ構造を示す。The data structure of the local memory is shown using the graph shown in FIG. 3 as an example.

第３図に示されるグラフ構造に対する節点コンダクタン
ス行列の構造が第４図に示される。The structure of the nodal conductance matrix for the graph structure shown in FIG. 3 is shown in FIG.

ここで、記号”１”と”ｆ”は、それぞれ、非零要素お
よびフィル・インを意味している。Here, the symbols "1" and "f" mean a non-zero element and fill-in, respectively.

第４図に示される行列構造に対して、ローカルメモリ（
Ａ、ｉ　）と（Ｂｊ）におけるデータ構造は。For the matrix structure shown in Figure 4, the local memory (
The data structures in A,i) and (Bj) are.

第５図、第６図のようになる。The result will be as shown in Figures 5 and 6.

つまり、フィル・インの発生による非零要素に関しては
、そのデータのラベル番号のみを格納し。In other words, for non-zero elements due to fill-in, only the label number of that data is stored.

数値は行列Ｙに従い、零とする。The numerical value follows matrix Y and is set to zero.

２）ローカルプロセッサによる演算本プロセッサにおける動作は、三つの演算モードによっ
て制御される。演算モードは、ａ）除算モード（÷）、
ｂ）乗算モード（Ｘ）　、Ｑ’）減算（ガウス消去）モ
ード（Ｇ）からなる。2) Operations by local processor Operations in this processor are controlled by three operation modes. The calculation modes are a) division mode (÷),
b) Multiplication mode (X), Q') Subtraction (Gaussian elimination) mode (G).

各ローカルプロセッサの内部にあるアキュムレータＡ、
Ｃの内容［ＡＣｃ］は各演算モードにお−いて１式（２
）に従いながら、次のように更新される。an accumulator A inside each local processor;
The content of C [ACc] is 1 expression (2
), updated as follows:

ａ）　　　Ａｃｃ″”Ｌｊ　’　”　ＬＬｂ）　　Ａ　
　４−［Ａ（ｃ］　”　ａ、；Ｃ（ｃ）　　　Ａ　　４−ａ　　−−［ＡＣＣ］ＣＣＪ以下に、各演算モードにおけるプロセッサの動作につい
ての説明を行なう。a) Acc″”Lj '” LLb) A
4-[A(c)''a,;C(c) A 4-a --[ACC]CCJ Below, the operation of the processor in each calculation mode will be explained.

ａ）除算モード第１図において、記号”ｉ”はピボットａ　ｃｊに対す
るピボット行（列）番号を意味している。まず最初、ロ
ーカルメモリ（Ａｊ　）において、ピボットａｔＥに対
するアドレスｉがピボット番号ｉを格納するカウンタ１
３によってアクセスされる。データ（ｋ、数値）がレジ
スタ（Ｒ°）と（Ｒ，ｊコ・− ）にセットされるａ　Ｒ：ＬＪ　　の列番号ｋがピボッ
ト番号ｉと比較され、ピボットａＬｉに割り当てら九る
べきローカルユニットが排他的論理和１４で検出される
。結果として、ピボットａＬＬに対応しているローカル
パスのみがイネーブル回路１５イネーブル状態となり、
ピボットデータａｔｉがすべてのローカルプロセッサに
共通バスを介して転送され。a) Division Mode In FIG. 1, the symbol "i" means the pivot row (column) number for the pivot a cj. First, in the local memory (Aj), the address i for the pivot atE is the counter 1 that stores the pivot number i.
Accessed by 3. Data (k, numeric value) is set to register (R°) and (R, j co-). The column number k of a R:LJ is compared with the pivot number i, and the local to be assigned to the pivot aLi is The unit is detected by exclusive OR 14. As a result, only the local path corresponding to the pivot aLL is enabled by the enable circuit 15,
Pivot data ati is transferred to all local processors via a common bus.

各ローカルユニットのプロセッサＰコ　　によって前記
”ＬＬはＲ３ｉの内容で除算が実行される。その結果は
ローカルメモリ（Ａｊ）に書き込まれ、また（Ｒ３ｊ）
にセットされる。The processor P of each local unit executes division of the above "LL" by the contents of R3i. The result is written to the local memory (Aj), and (R3j)
is set to

ｂ）　乗算モード除算モードにおいて、ローカルメモリ（Ａｊ　）のアド
レスｉがアクセスされると同時に、ローカルメモリ（Ｂ
ｊ）におけるアドレスｉがピボット番号ｉによってアク
セスされ、そのデータｐがレジスタ（Ｒ，ｊ）にセット
さ九る０乗算モードでは。b) Multiply mode In the divide mode, address i of local memory (Aj) is accessed and at the same time address i of local memory (B
In zero multiplication mode, address i in j) is accessed by pivot number i and its data p is set in register (R,j).

ローカルメモリ（Ａｊ）はレジスタ（Ｒ，ｊ）の内容で
アクセスされる。その内容とはｉ列における下三角行列
の非零要素の行番号である。レジスタ（Ｒ，Δ１Ｊ＝２
＃　３ｔ　−−−）の内容は、順次、一つ上のレジスタ
（Ｒ，・　　）にシフトされてい、Ｉ−１く、レジスタＲＸｌの内容でアクセスされたデータはレ
ジスタＲ４とＲ４Ａにセットされる。そして除算モード
と同様にＲ２４の列番号のデータとピボット番号ｉと比
較され、１列要素が検出される。検出値は共通バスで他
の全プロセッサへ転送され、Ｒ・の内容と乗算される。Local memory (Aj) is accessed with the contents of register (R,j). Its content is the row number of the nonzero element of the lower triangular matrix in column i. Register (R, Δ1J=2
The contents of #3t ---) are sequentially shifted to the next higher register (R, .), and the data accessed with the contents of register RXl is set to registers R4 and R4A. . Then, as in the division mode, the column number data of R24 is compared with the pivot number i, and one column element is detected. The detected value is transferred to all other processors on a common bus and multiplied with the contents of R.

その計算結果はＲ３３にａセットされる。The calculation result is a in R33. Set.

Ｃ）　ガウス消去モード次にＲ・の内容とＲ９ｊの内容でガウス消去演算１Ｉが実行されるのだが、レジスタの一対（Ｒ４Ｊ　＋　Ｒ
ｙ、４）のデータ［Ｒｂ２　］　、　　［：　Ｒｇ；　
］におけるラベル番号は一致していない場合もある。即
ち、大規模回路網の解析を対象とする本プロセッサでは
、メモリ容量の節約のために、零要素をすべて無視した
形でデータを格納しているのでガウス消去過程において
は、レジスタ（Ｒ，、ｉ）にデータがセットされた初期
段階において、加算データのラベル番号（７行列内にお
ける列番号）と被加算データのラベル番号（７行列内に
おける列番号）が多くの場合、異なることになる。そこ
で、元の行列内の同じ位置の要素同士の加算（減算）を
実行させるために、レジスタＲ５ａの内容をシフトし、
レジスタＲｑｊの内容のラベル番号と比較を行なってい
き。C) Gaussian elimination mode Next, Gaussian elimination operation 1I is executed using the contents of R and the contents of R9j, but the pair of registers (R4J + R
y, 4) data [Rb2], [: Rg;
] may not match. In other words, in this processor, which is designed for analyzing large-scale circuit networks, data is stored in a form that ignores all zero elements in order to save memory capacity, so in the Gaussian elimination process, registers (R, , At the initial stage when data is set in i), the label number of the addition data (column number in the 7 matrices) and the label number of the augend data (column number in the 7 matrices) are often different. Therefore, in order to perform addition (subtraction) between elements at the same position in the original matrix, the contents of register R5a are shifted,
The contents of register Rqj are compared with the label number.

（［Ｒ４・］）と（［Ｒ，ｊコ）におけるラベル番号の
すべての対が一致したとき、すなわち、スタート信号が
”１”になったとき、ガウス消去モードにおいて、加算
（減算）が実行される。そして、その演算結果がローカ
ルメモリ（Ａ、ｉ　）にＲ＋　ａの内容からのアドレス
を使って元のアドレスに書き込まれる。When all pairs of label numbers in ([R4・]) and ([R, j) match, that is, when the start signal becomes "1", addition (subtraction) is executed in Gaussian elimination mode. be done. The result of the operation is then written to the local memory (A,i) at the original address using the address from the contents of R+a.

３）シフト操作ここでは、２）で述べられたシフト操作についての説明
と、それらのシフト操作が演算時間におよぼす影響につ
いての考察を行なっている。3) Shift operations Here, the shift operations mentioned in 2) are explained and the influence of these shift operations on the calculation time is considered.

シフト操作の開始前において、レジスタ（Ｒ，ｊ）とレ
ジスタ（Ｒ，２）には、それぞれ、ガウス消去演算にお
ける加算データと被加算データがセットされている。こ
こで、レジスタＲもとＲ，５ｊの内容のラベル番号が比
較され、一致していない場合には、レジスタＲ幅の内容
がレジスタＲ６ヤ１に右シフトされる。原則的にはこの
操作が順次繰り返される。Before the start of the shift operation, the add data and the augend data in the Gaussian elimination operation are set in register (R, j) and register (R, 2), respectively. Here, the label numbers of the contents of register R source R and 5j are compared, and if they do not match, the contents of register R width are right shifted to register R6 layer 1. In principle, this operation is repeated sequentially.

右シフトの過程に於いて、すべての要素のラベル番号が
一致するまでに必要な最多シフト回数は。During the right shift process, what is the maximum number of shifts required until the label numbers of all elements match?

行列Ｙの各行列内に含まれるフィル・インを含む非零要
素の数に依存することは明らかである。右シフトの操作
においては、レジスタ（Ｒａｊ）および（Ｒ，ｊ）の内
容は、以下に示されるような性質を持っている。It is clear that it depends on the number of non-zero elements, including fill-ins, contained within each matrix of the matrix Y. In the right shift operation, the contents of registers (Raj) and (R,j) have the following properties.

［性質１コ　行列Ｙが構造対称であるならば、（［Ｒ匂
］　）は、ｉ行ｉ列の乗算過程で発生し得るｉ行目のす
べてのフィル・インを含むことになるから、レジスタの
内容（［Ｒ，、ｉｌ）は、レジスタの内容（［Ｒ４，ｉ
ｌ）の部分集合である。即ち、（［Ｒ・］　　ｌ　、）
＝ｌ、２ｅ　３＋　−−−ｔ　ｑ）ｄ保（［Ｒ・］　ＩＪ”１＋　２１３９０１．ｒ　ｑ）匂である。[Property 1] If the matrix Y is structurally symmetric, ([R smell] ) will include all the fill-ins in the i-th row that may occur during the multiplication process of the i-th row and the i-column, so the register The contents of ([R,,il) are the contents of the register ([R4,i
l). That is, ([R・] l ,)
= l, 2e 3+ ---t q) d Ho ([R・] IJ"1+ 213901. r q) It is a smell.

［性質２］　各レジスタの初期値［Ｒ５−、ｌｒ］のシ
フト回数をＴ・　Ｆ　”Ｉｔ　２＋　３＋　、−、ｖ　
ｒ≦ｑ）とすると、Ｃ性質１コより、Ｔ　　：ｉｉ；Ｔ　　≦０１．≦Ｔ１である、ここで、ｒは、行列Ｙの上三角行列のｉ行目に
含まれる非零要素の数である。[Property 2] The number of shifts of the initial value [R5-, lr] of each register is T・F ''It 2+ 3+ , -, v
r≦q), then from C property 1, T : ii; T ≦01. ≦T1, where r is the number of non-zero elements included in the i-th row of the upper triangular matrix of the matrix Y.

［性質３コ　　［性質２コより、各レジスタの内容［Ｒ
５ｊｌがシフトされるべき回数は、たかだかＴ１回であ
る。[Property 3 [Property 2] From property 2, the contents of each register [R
The number of times that 5jl should be shifted is at most T1 times.

［性質４］　行列Ｙにおけるｉ行目とｉ行目の非零要素
の数をそれぞれＮ　ＺＬ、　Ｎ　７１個とすると。[Property 4] Let the numbers of non-zero elements in the i-th row and the i-th row in the matrix Y be N ZL and N 71, respectively.

Ｔ　　＝ｌＮＺ・−ＮＺｆｌｌ＋１ｒ　　　　　　　　　　　　Ｌである。T =lNZ・-NZflll+1 r　　　　　　　　　L It is.

以上の性質はバンド行列に対しても適用できる。The above properties can also be applied to band matrices.

すなわち、バンド行列のＬＵ分解に対しては、はとんど
の場合において、ＮＺｊ＝ＮＺ１であるから、その最多
シフト回数はＴ７＝１となり、一度のシフトですべての
レジスタの対において、ラベル番号の整合が行なわれる
ことになる。緻密行列に対してはＴ７＝０である。That is, for the LU decomposition of a band matrix, since NZj = NZ1 in most cases, the maximum number of shifts is T7 = 1, and one shift can shift the label number in all register pairs. Alignment will take place. For dense matrices T7=0.

レジスタＲ・（ｊ　＝ｍ）とレジスタＲ４ｊ（ｊ＝Ｊｍ）内のラベル番号が一致するならば、レジスタ（Ｒ５
・ｌ　ｊ＝ｒｎ　＋　１　ｖ　ｍ　＋　２　ｔ　−１，
ｔ　ｒ　）の内容のみがシフトされる。そして、し、ジ
スタＲ４ｆｆ、、。If the label numbers in register R (j = m) and register R4j (j = J m) match, register (R5
・l j=rn + 1 v m + 2 t −1,
Only the contents of t r ) are shifted. And then, Jister R4ff...

には、数値″０”とＲ９や、のラベルが強制的にセット
される。The numerical value "0" and a label such as R9 are forcibly set to .

ところで、レジスタ（Ｒ４ｊ）と（Ｒｒｊ）の二対以上
のデータのラベル番号が同時に一致する場合も考えられ
るが、この場合には次の性質がある。By the way, there may be a case where the label numbers of two or more pairs of data in registers (R4j) and (Rrj) coincide at the same time, but in this case, the following property exists.

［性ｇｔ−５１二対以上のラベル番号が同時に一致する
とき、それらのレジスタは必ず隣接している。[Gt-51 When two or more pairs of label numbers match at the same time, those registers are always adjacent.

この場合には、　　（Ｒ，ｊ）に含まれているそれらの
レジスタのデータのみが右シフトされる必要がある。In this case, only the data in those registers contained in (R,j) needs to be right shifted.

本プロセッサではレジスタ（Ｒ４ｊ）と（Ｒｒｊ　）の
ラベル番号の一致検出を行なった結果をそのフラグとし
て、フリップ・フロップにラッチし、それらを右優先回
路［８］を通すことにより、一致しているレジスタのう
ち最も右側のレジスタを検出し、その出力をそれより右
側のレジスタのイネーブルとすることにより、そのこと
を実現している。In this processor, the result of detecting a match between the label numbers of the registers (R4j) and (Rrj) is used as a flag, latched into a flip-flop, and passed through the right priority circuit [8] to determine the match. This is accomplished by detecting the rightmost register among the registers and using its output to enable the registers to the right.

ピボットａ、−に対応する節点ｎＬに接続されてＬいる枝の数をＩｎｃ　　（ｎ・）とすると、ｉ行目の非
し零要素数ＮＺＬは、Ｉｎｃ（ｎ２）＋１に等しい。If the number of L branches connected to the node nL corresponding to pivot a, - is Inc (n.), then the number of non-zero elements NZL in the i-th row is equal to Inc (n2)+1.

即ち１回路網のグラフ構造が正規グラフであるとき、各
行内の非零要素数が等しくなり、最も効率的に演算が行
なわれる。That is, when the graph structure of one circuit network is a regular graph, the number of non-zero elements in each row is equal, and operations are performed most efficiently.

本プロセッサでは、シフト操作を含む演算時間が、行列
の構造（即ち、グラフの節点に対するラベル付け）で決
まるのではなく、行内の非零要素数によって決定される
ために、ラベル付けがバンド方式ではなく、フィル・イ
ンの数を少なくするようなバッキング方式の場合でも、
シフト回数は多くならず、従って、効率の良いＬＵ分解
が実行されることになる。In this processor, the calculation time including shift operations is not determined by the structure of the matrix (i.e., the labeling of graph nodes), but by the number of nonzero elements in a row. Even in the case of a backing method that reduces the number of fill-ins,
The number of shifts is not increased, and therefore efficient LU decomposition is performed.

原則的には、右シフトにより、ラベル番号の整合が行な
われるが、一般的なス・パース行列に対しては、左シフ
トが必要とされる場合もある。この場合には、第７図に
示されるように、左優先回路を用いた左シフト操作によ
り、ラベル番号の整合を行なうことが可能となる。In principle, a right shift aligns the label numbers, but for general sparse matrices, a left shift may be required. In this case, as shown in FIG. 7, label numbers can be matched by a left shift operation using a left priority circuit.

以下に１本プロセッサの効率の評価について述べる。The evaluation of the efficiency of a single processor will be described below.

計算機の演算効率Ｅは、Ｅ　＝　Ｏ（ｔｉｍｅ）　Ｘ　Ｎ　（ｃｏ−ｐｒｏｃｅ
ｓｓｏｒ）で表現できる。ここで、Ｏ（ｔｉｍｅ）とＮ
　（ｃｏ−ｐｒｏｃｅｓｓｏｒ）はそれぞれ、求解に必
要とされる時間と、プロセッサセルなどの数である。The calculation efficiency E of the computer is E = O (time) X N (co-proce
ssor). Here, O(time) and N
(co-processor) is the time required for solution solving, the number of processor cells, etc., respectively.

ｎＸｎの次元をもつスパース行列のＬＵ分解に対して、
アレイプロセッサ、本プロセッサの効率Ｅａ　、Ｅｓは
、それぞれ。For the LU decomposition of a sparse matrix with dimensions nXn,
The efficiencies Ea and Es of the array processor and this processor are respectively.

Ｅｏ＋！ｎ×ｐ”＝ｎｐ” Ｒ３＝ｎｑＸｑ／２＋α＝ｎｑ”／２＋ａと表現できる
。ここでαは、データ通信に要する時間であり、はぼシ
フト操作において必要とされる時間であると考えられる
。バレルシフタなどを用いれば、ラベル番号の整合をシ
フト操作なしで、−瞬に行なうことも可能であると考え
られるが、Ｉ　ＮＺＬ−ＮＺＱ　Ｉがそれ程大きくなけ
ればαもそれ程大きな値をとらない。Eo+! It can be expressed as n×p”=np” R3=nqXq/2+α=nq”/2+a. Here, α is the time required for data communication, and is considered to be the time required for the shift operation. If a barrel shifter or the like is used, it may be possible to match the label numbers instantaneously without a shift operation, but if INZL-NZQI is not that large, α will not take a very large value.

バンド行列に於いては、ｑ：２ｐである。アレイプロセ
ッサでは、行列のスパース性を利用できないので、バッ
キング方式のようなラベル付けが行なわれた場合には、
ｐ　／％　Ｑとなる場合もあり、この場合には、ｑ＜＜ｐであり、それ故。In the band matrix, q:2p. Array processors cannot take advantage of the sparsity of matrices, so if labeling such as the backing method is used,
It may also be p /% Q, in which case q<<p, hence.

］１：、＜＜ＥＱとなる、従って１本プロセッサは、フィル・インの数が
ある程度おさえられるラベル付けによるスパー入行列で
あれば、バンド方式、バッキング方式に関係なく有効で
あると考えられる。さらに、レジスタＲ５５のデータを
ピボット位置までシフトしてからガウス演算をおこなえ
ば、シフト回数は増加するがローカルプロセッサの数は
、約半数にすることが可能となる。この場合の効率をＥ
５′と定義する。]1:, << EQ Therefore, one processor is considered to be effective regardless of the band method or backing method, as long as the number of fill-ins is suppressed to some extent and the labeling is used to suppress the spur input matrix. Furthermore, if the data in the register R55 is shifted to the pivot position and then the Gaussian operation is performed, the number of local processors can be reduced to about half, although the number of shifts increases. The efficiency in this case is E
5'.

以下では、具体的な例を用いながら、シミュレーション
により１本プロセッサと７レイプロセツサの効率の比較
を行なっている０文献［７］に示されている回路網の係
数行列Ｙの構造が第８図に示される。In the following, using a concrete example, we compare the efficiency of a single processor and a 7-ray processor through simulation.The structure of the coefficient matrix Y of the circuit network shown in Reference [7] is shown in Figure 8. shown.

文献［７コで示されるピボット順序の最適化によると行
列Ｙの構造は第９図のようになる。According to the optimization of the pivot order shown in Reference [7], the structure of the matrix Y becomes as shown in FIG.

第８図、第９図に示される行列に対して本プロセッサを
利用することを考える。各ローカルプロセッサに８０８
６を想定し、各演算に要するクロック数を以下のように
設定する。Consider using this processor for the matrices shown in FIGS. 8 and 9. 808 for each local processor
6, and the number of clocks required for each calculation is set as follows.

第８図、第９図に示される行列構造からなる係数行列を
有する節点方程式の解法に対する演算効率Ｅ　　、Ｅ　
　およびＥ　、　Ｊ　が表２に示される。Computation efficiency E, E for the solution of nodal equations having coefficient matrices with matrix structures shown in FIGS. 8 and 9
and E, J are shown in Table 2.

α　　　　　５また、もとの行列の各行内の非零要素数を求め。α 5 Also, find the number of nonzero elements in each row of the original matrix.

少ない行からオーダリングするという簡易最適オーダリ
ングを第８図の行列に適用すると、その演算効率は、最
適オーダリングによるものとほとんど等しいことが確認
された。It has been confirmed that when simple optimal ordering, in which ordering is performed from the smallest number of rows, is applied to the matrix shown in FIG. 8, the computational efficiency is almost equal to that achieved by optimal ordering.

次に前記シフト操作をメモリを用いて行なう他の実施例
について説明を行ない、さらに、−行内に含まれる非零
要素の数がｑ個を越える場合のシステム構成に関する実
施例についての説明を第１０図を用いて行なう。Next, another embodiment in which the shift operation is performed using memory will be explained, and further, an explanation will be given of an embodiment regarding the system configuration when the number of non-zero elements included in a row exceeds q. Do this using diagrams.

ローカルメモリ（ＡＩＭＡｌ□ｔ　Ａ１３）　ｅ　　（
ＡｚｌｐＡ、、、Ａ２３）には後述するように行列の行
方向にフィル・イン以外の零要素を詰めた形で第５図の
ように格納する。データ形式は（ｋ、数値）であり、ｋ
は列番号で、数値は枝コンダクタンスに対応している。Local memory (AIMAl□t A13) e (
AzlpA, . . . A23) is filled with zero elements other than fill-in in the row direction of the matrix as shown in FIG. 5, as will be described later. The data format is (k, number), and k
is the column number, and the numerical value corresponds to the branch conductance.

ローカルメモリＢｊ　ａ　（Ｂｊ　Ｓ　ｊ＝１．２）に
は対角要素を除く下三角行列に対して列方向にフィル・
イン以外の零要素を詰めた形で格納されるが数値は必要
ない、第３図に示されるグラフを例とすわば、各ローカ
ルメモリのデータ構造は第５図および第６図のようにな
る。以下にプロセッサの動作について説明する。ｉはピ
ボットを示す、ローカルメモリＡｊｋはアドレスｉでア
クセスされ、Ｒ２１にセットされ列番号からＪＥ要素を
検出し、全プロセッサへ転送され除算が実行される。計
算結果はローカルメモリＣｊε（Ｃｊｊｊ＝１＊　２１
　、、、Ｉ”）に列番号を番地として書き込まれ、他の
ローカルメモリＣｊ　にもコピーされる。ローカルメモ
リＢＪもアドレスｉでアクセスされ、ｉ列の非零要素の
行番号がＲＩＪにセットされ１乗算モードでＲｌＪの内
容でローカルメモリＡＪＩ、をアクセスすると除算モー
ド同様に１列要素が検出される。モしてＲ４ｊに列番号
でローカルメモリＣｊ　をアクセスすると、その内容は
検出値と乗算されＲｌＪにセットされる。ガウス消去モ
ードでＲ％５とＲテｊの内容がガウス消去演算される。The local memory Bj a (Bj S j = 1.2) is filled in the column direction for the lower triangular matrix excluding diagonal elements.
Taking the graph shown in Figure 3 as an example, which is stored with zero elements other than IN, but does not require numerical values, the data structure of each local memory will be as shown in Figures 5 and 6. . The operation of the processor will be explained below. Local memory Ajk, where i indicates a pivot, is accessed with address i, set to R21, a JE element is detected from the column number, and transferred to all processors to perform division. The calculation result is stored in the local memory Cjε(Cjjj=1*21
, , I") with the column number as the address, and is also copied to other local memory Cj. Local memory BJ is also accessed at address i, and the row number of the non-zero element in column i is set to RIJ. When local memory AJI is accessed with the contents of RlJ in multiplication mode, one column element is detected in the same way as in division mode.When R4j is accessed with the column number of local memory Cj, its contents are multiplied by the detected value. It is set to RlJ.The contents of R%5 and Rtej are subjected to Gaussian elimination calculation in Gaussian elimination mode.

この動作が下三角行列のｉ列の非零要素の個数回実行さ
れた後ｉがカウントされる。After this operation is executed for the number of nonzero elements in column i of the lower triangular matrix, i is counted.

行内に含まれる非零要素数がローカルプロセッサの個数
を越えるような場合には、各ローカルメモリを階層的に
１例えば、ｑ＝＝５．ｐ＝２．ｒ＝３の場合には、第１
０図に示すように、（Ａｕ＊Ａ１□＃Ａ＋３）からなる
ローカルメモリと（Ａ２１ｙＡ２□＋Ａ２３）からなる
ローカルメモリをアドレスが同じになるように３階（ｒ
＝３）に階層的に接続して、第５図のＡ、、Ａｍ、Ａ３
．Ａ４．Ａ！−の内容をそれぞれＡＩ　Ｈｓ　Ａｘ＋　
、　Ａ１□ｌ　Ａ２２１　Ａｌｆｆに格納すればよい、
すなわち、二個（ｐ　＝　２）のローカルユニットで（
１＝５の非零要素数の場合を対処できることになる。If the number of non-zero elements included in a row exceeds the number of local processors, each local memory is hierarchically divided into 1, for example, q==5. p=2. If r=3, the first
As shown in Figure 0, the local memory consisting of (Au*A1□#A+3) and the local memory consisting of (A21yA2□+A23) are moved to the third floor (r
=3) in a hierarchical manner, A, , Am, A3 in Fig. 5.
．． A4. A! − content respectively AI Hs Ax+
, A1□l A221 Just store it in Alff,
That is, with two (p = 2) local units (
This means that the case where the number of non-zero elements is 1=5 can be handled.

［本発明の効果コ本発明では、大規模スパース行列のＬＵ分解専用並列計
算機の構成のための一方式について述べられた０通常、
このような目的に応じて提案されているシストリックア
レイプロセッサでは、取り扱われる係数行列Ｙの構造が
バンド構造に限られている。しかしながら、一般的な回
路網の係数行列はバンド構造ではないため、アレイプロ
セッサの適用が実用的には困難となる。[Effects of the present invention] In the present invention, a system for configuring a parallel computer dedicated to LU decomposition of large-scale sparse matrices is described.
In systolic array processors that have been proposed for such purposes, the structure of the coefficient matrix Y handled is limited to a band structure. However, since the coefficient matrix of a general circuit network does not have a band structure, it is difficult to apply an array processor in practice.

そこで、ローカルプロセッサユニットを行列の一行内の
非零要素数に等しいｑ個用意することにより、各非零要
素に対応する列に対しての操作を並列処理するＬＵ分解
プロセッサが提案された。Therefore, an LU decomposition processor has been proposed which processes operations on columns corresponding to each non-zero element in parallel by preparing q local processor units equal to the number of non-zero elements in one row of a matrix.

各ローカルユニットは、共通データバス、優先回路等で
結合されている。ローカルメモリには、零要素をすべて
無視した形で行列Ｙの非零要素のみを格納し、また、ガ
ウス消去演算過程に於いては、レジスタの内容をシフト
しながら、データのラベル番号の一致検出を行なうこと
により、演算の実行を可能としている。さらに、シフト
過程に於けるいくつかの性質が示され、シフト操作によ
るラベル番号の整合をとる方式を用いても、それ程、デ
ータ通信に時間が必要とされないことが示された。最終
的に、シミュレーションによって。Each local unit is connected by a common data bus, priority circuit, etc. Only the non-zero elements of the matrix Y are stored in the local memory, ignoring all zero elements, and during the Gaussian elimination process, the contents of the registers are shifted while the data label numbers match. By doing this, it is possible to execute the calculation. Furthermore, some properties of the shift process were demonstrated, and it was shown that even if a method of matching label numbers by shift operations is used, it does not require much time for data communication. Finally, by simulation.

本プロセッサの効率が７レイプロセツサのそれに比べて
すぐれていることが示された。It has been shown that the efficiency of this processor is superior to that of a 7-ray processor.

このようにして本発明は１行列内の各行の非零要素の数
以下のプロセッサユニットでｎＸｎの大規模スパース行
列をほぼＯ（ｎ）でＩ、Ｕ分解できるという効果がある
。In this manner, the present invention has the advantage that an nXn large-scale sparse matrix can be decomposed into I, U in approximately O(n) using processor units whose number is less than or equal to the number of nonzero elements in each row in one matrix.

[Brief explanation of drawings]

第１図は１本発明の連立方程式解析用計算機のローカル
ユニット、第２図は、本発明のシステム構成図、第３図
は、グラフの一例、第４図は、そのグラフに対する行列
構造、第５図は、ローカルメモリ（Ａ２　）内のデータ
構造、第６図は、ローカルメモリ（Ｂｊ　）内のデータ
構造、第７図は、左シフトのための回路構成図、第８図
は１行列Ｙの構造の一例図（ａ）、第９図は、ピボット
順序の最適化に従う行列Ｙの構造の一例図（ｂ）、第１
０図は１本発明の連立方程式解析用計算機のローカルユ
ニットの他の実施例である。１．２．３・・・ローカルユニットｔｏ、　１Ｇ　・・・共通バス１１・拳・制御ライン１２・・・優先回路１３・・・カウンタ１４・・・排他的論理和１５・・・イネーブル回路Ａ、、Ａ、１　・・・第１のローカルメモリＢ、Ｂ２・
・・第２のローカルメモリＰ、、ｐ、・・・ローカルプロセッサ葵２０憾３邑 ′！４１４図第ジ１コ葆Ｃ区第ｒ？邑箋８０算９０Figure 1 shows a local unit of a simultaneous equation analysis computer according to the present invention, Figure 2 is a system configuration diagram of the present invention, Figure 3 is an example of a graph, and Figure 4 shows the matrix structure for the graph. Figure 5 shows the data structure in the local memory (A2), Figure 6 shows the data structure in the local memory (Bj), Figure 7 shows the circuit configuration for left shift, and Figure 8 shows the data structure for one matrix Y. Figure 9 shows an example of the structure of the matrix Y according to optimization of the pivot order.
Figure 0 shows another embodiment of the local unit of the computer for simultaneous equation analysis according to the present invention. 1.2.3...Local unit to, 1G...Common bus 11/Fist/Control line 12...Priority circuit 13...Counter 14...Exclusive OR 15...Enable circuit A ,,A,1...first local memory B, B2.
...Second local memory P,,p,...Local processor Aoi20 澾3邑'! 414 Figure 1, Section C, Section r? 80 yakusen 90 arithmetic

Claims

[Claims]

(1) Element a_i_j (i=1, 2, ..., N; j=
The value and column of each element a_i_k of each column vector created by left-justifying non-zero elements including fill-in in each i-th row of a sparse matrix Y consisting of 1, 2, ..., N) in the row direction. The first one stores the number k as data at address i.
A local memory of at most q, the maximum number of non-zero elements including fill-ins, existing in each row of the matrix Y, and each j column of the lower triangular matrix part excluding the diagonal elements of the matrix Y. In the second local memory, at least the row number l of each element a_l_j of each row vector created by top-filling non-zero elements including fill-in in the column direction is stored at address i as data. Install a number smaller than the number of local memories, and in division mode,
detecting a pivot element a_i_i such that i=k among the respective elements a_i_k and the column number k read from each of the first local memories according to the address i;
Each of the read elements a_i_i is simultaneously divided by the pivot element a_i_i to obtain a_i_k/a_i_i.
and a division means for calculating, in the multiplication mode, reads the contents a_l_d of each of the first local memories using each row number l of each of the elements a_l_d read from each of the second local memories according to the address i as an address, l=i. Detect the element a_l_i, and multiply each value a_i_k/a_i_i obtained by the dividing means by the element a_l_i to obtain a_l_i・a_i_k/
multiplication means for determining a_i_i; and an element a read from each first local memory by said address i;
Each first column number k and row number l of _i_k
Using a shift means to match the column number j of the element a_l_d read from the local memory of the element a_l_d, each value a_l_i・a_i_k/a_i_
A computer for analyzing simultaneous equations, comprising a Gaussian elimination calculation means for subtracting i.

(2) A computer for simultaneous equation analysis according to claim 1, characterized in that the shift means uses a random access memory whose contents are exactly the same and which executes sequential writing and simultaneous reading.

(3) The simultaneous equation analysis computer according to claim 1, wherein the first local memory is accessed by the contents of the second local memory.

(4) If the number p of processors that execute the division means, the multiplication means, and the Gaussian erasure calculation means is smaller than the maximum number q, the p first local memories are divided into r×
A computer for analyzing simultaneous equations according to claim 1, characterized in that the computer is hierarchically connected to the rth order so that p exceeds q.