GB2205183A

GB2205183A - Finite element analysis utilizing a bandwidth maximized matrix

Info

Publication number: GB2205183A
Application number: GB08802490A
Authority: GB
Inventors: Steven Warren Hammond; Gary Bedrosian
Original assignee: General Electric Co
Current assignee: General Electric Co
Priority date: 1987-02-04
Filing date: 1988-02-04
Publication date: 1988-11-30
Also published as: JPS63265365A; FR2610429A1; AU1113888A; DE3803183A1; GB8802490D0; SE8800341D0; IT8819296A0

Description

1 0 RD- 17 30 8 FINITE ELEMENT ANALYSIS METHOD UTILIZING A BANDWIDTH

MAXIMIZED MATRIX Many physical systems may be described mathematically in terms of systems of linear equations which, in turn, are solved by matrix manipulation methods. Finite element analysis is a method concerned with describing various physical systems in terms of systems of equations and developing methodologies for the solution of such systems. The term 'physical system" is meant herein to refer to structure, devices, apparatus or bodies of matter (solid, liquid, gas) or simply a region of space in which a particular physical, chemical or other phenomenon is occurring. Finite element analysis had is it beginnings as a method for structural analysis, but today is routinely used in the design of motors, generators, magnetic resonance imaging systems, aircraft engine ignition sysems, circuit breakers and transformers, to name but a few; its techniques are used to analyze stress, temperature, molecular structure, electromagnetic RD-17308

fields, current, Physical forces, etc. in all sorts of physical systems. It has become a standard part of the design cycle for numerous products which are not easily analyzed by other methods. The present invention has particular application in the analysis and design of such products.

Systems of linear equations required to be solved by finite element analysis techniques are very often large and for that reason computationally difficult to solve. For example, a system of equations from a large, but not untypical, two-dimensional finite element analysis may have 25,000 unknowns. Where such equations are based on a finite element mesh having contributions from a majority of nodes, no choice exists but to use brute computational force to arrive at a solution. In some instances, however, such equations are both large and sparse and, thereby, afford an opportunit to pretreat or transform the equations in a manner which makes them less computationally intensive to solve. The word "sparse" is used herein to refer to the characteristic that only a very small percentage of the elements in a matrix have non-zero values. When extreme sparcity exists in a very large system, several techniques exist which may be used to transform the system of equations into one which is more easily handled from a computational standpoint. However, in spite of such transformations, standard computational techniques may be either impractical or very inefficient depending on the size and other characteristics of the resulting matrix equations.

As one can understand from the above discussion, the field of finite element analysis has developed in large measure because of the availability of larger and more powerful computing machines for use in solving is RD - -171.1) 0 8 sucn.SYstems. There now exist a variety of high perrorman special purpose computer systems designed to perform szec-,al azDlication calculations which are especially taxing to perform on general-purpose computers. One such system is based on the concept of a systolic architecture and provides a general methodology for mapping high-level computations into hardware structures. In a systolic system data flows from the computer memory in a rhythmic fashion, passing through many processing elements in a chain or pipeline manner before returning to memory, thereby permitting multiple computations for each memory access and resulting in a great increase in the speed of execution of computationally intensive problems without an associated increase in input/output requirements. Some methodologies for tailoring systolic architectures to handle matrix operations are discussed in a paper by H.T. Kung and C.E. Leiserson entitled "Systolic Arrays (for VLSI)", Sparse Matrix Proc. 1978, Sociecv for Industrial and Applied Mathematics, 1979, pp. 756-232. Another analysis of this problem and a suggesced solution is addressed in the article in ransactions on Computers, Vol. C-32P No. 3, March 1983 entitled "m'n Efficient Parallel Algorithm for the Solution of Large Sparse Linear Matrix Equations".

Tl'he aDDlicant has considered techniques and arrangements for performing various matrix operations which are common to the finite element analysis method, among others, and utilizing a parallel processor to enhance the speed of computation. This is accomplished largely by arranging the solution so that it is carried out as multiple, iterative and substantially identical steps performed in parallel by a series of parallel processors.

RD - 1, 7 3 0 8 is or example a method has been de v is ea for storing a larqe, snarse matrix in a multiole processor architecture in a manner which makes the elements of the matrix readily available for a varlety of matrix operations. One such operation used in solving systems of linear equations, is backsubstitutlon, -',uz others would suggest themselves to those skilled in the art.

The solution to linear equations generated as a result of embodiments of this invention may be implemented on a variety of parallel multiprocessor architectures, as alluded to hereinbefore. However, for the sake of illustration, the solution of equations developed as a result of the methods disclosed herein will be specifically described in connection with the systolic array architecture as generally described in the paper by E. Arnould et al. entitled "A Systolic Array Computer" presented at the I= International Conference on Acoustics Speech, and Signal Processing", March 26-29, 1985, pp. 232-235. Methods embodyinathe invention may be carried out on apparatus shown and described in Parent No. 4,493,048 entitled "S,,stolic Array Apparatuses for Matrix Computations" issued in the name of H.T. Kunz et al., the d-'sclosu.-e of which is hereDy -ncc:Dsraied by reference.

A disadvantage of one type of apparatus and method considered by the apiDlicant for solvina systems of linear equations by the backsubstitution or forward elimination methods results from data dependencies which force sequential, rather than concurrent or parallel, calculations in solving for the unknowns.

7hr present invention provides apparatus and machine lmDlefrented IRD-17 30 8 methods for analyzing physical systems represented by systems of linear equations.

Embodiments of the invention enable a multiprocessor to be operated in a manner which allows the more efficient analysis of manufactured products and physical systems represented by systems of linear equations.

Methods and apparatus embodying the invention may increase the speed and efficiencv of solution of systems of linear equations on parallel processors, in particular on parallel, pipeline processors.

An embodiment of the invention provides a method for processing large, sparse matrices on a systolic array computer system which supports a high degree of concurrency, yet employs only simple, regular communication and control to enable efficient implementation.

In one aspect, the invention providesa methodology for improving the speed of computing solutions to systems of linear equation by using backsubstitution and forward elimination techniques which operate on a triangular decomposed matrix which has been "bandwidth maximized" i.e. transformed to provide a band of zero elements of preselected minimum width separating the non-zero elements on the main diagonal from non-zero elements in upper right or lower left corners of zhe matrix.

Inanother aspect, the invention provides a technique for renumbering the nodes of a finite element.mesh employed in carrying out the finite element analysis method such that a decomposed matrix characterized by the above-noted band is generated.

In a further aspect, the invention provides a parallel processor and accompanying methodology for operating the parallel processor to solve a system -G- J is RD- 17 3 0 8 of linear eaua.'.. J Zlzns C"aracterized by a system matrix of the azove noted type in a new and more efficient manner.

In on embodiment of the invention, T. e _ h 0 d S and techniques for carrying 0 U t th -, finite element analysis method to analyze a pnvsica!L svszem include the step of generating a triancular decomposed matrix for the system in a form havina a band of zero valued elements or data of predetermined minimum bandwidth adjacent its main diagonal and solving the system of equations associated with such a matrix on a parallel processor which operates on the data elements of the matrix stored across the memories of a plurality of processors. Because of the presence of the band of zero-valued data in the decomposed matrix as described above, calculations for unknown vector components of the system may be carried out in a new and highly concurrent and efficient manner, with the calculation of values for a plurality of components of an unknown vector taking place by backsubstitution, independent of the prior solution of other values of components for the unknown vector. Having removed the need to delay the backsubstitution process for a given vector component until earlier calculazions for other components of the vector have been completed, the overall speed of solution for the system is substantially increased.

An embodiment of the present invention, given by way of example, will now be described with reference to the accompanying drawings, in which:

Figure 1 illustates a basic systolic architecture which is utilized in accordance with one method of the invention to process larger sparse matrices; Figure 2 is a flowchart illustrating the method for storing a large, sparse matrix into memory of a parallel multi-processor; Figure 3 illustrates a method for mapping a specific matrix, exemplary of a large, sparse, bandwidth maximized It is matrix into t-ic parallel mul t i Processor array acc to one i,-,, plementazion of the invention:

Figure 4 illustrates a method for carrying out the backsubstitution procedure in solving a t.-Jhancularized system of linear equations which system is characterized in part by the specific bandwidth maximized decomposition matrix stored in accordance with Figure 3 in a parallel multiprocessor system; Figures 4A-4m illustrate the flow of data through the multiprocessor of Figure 4 during successive machine cycles in carrying out the backsubstitution process on the exemplary stored, bandwidth maximized decomposition matrix of Figure 4; Fig. SA illustrates the sparcity structure of a bandwidth minimized decomposition matrix according to the prior art; and

Fig. 5B illustrates the sparcity structure of a bandwidth maximized decomposition matrix according to the invention.

As noted above, an asnect of the inve ntion relates the finite element analysis method used to analyze Physical systems. In such systems, a field variable (which may be pressure, zemperaturep displacement, stress or some other quantity) possesses infinitely many component values because it is a function of each geometric point in the body or solution area under investigation. In the first step of the method the problem is discretized into one involving a finite (albeit large) number of unknowns by dividing the solution area into elements and by expressing the unknown field variable in terms of approximating functions within each element. Thus, the comple'x problem reduces to -- 5 k, is considering a series of greatly -rob-jlens.

The approximating functions are definex 'n terms oi unknown values of the f.;eld variables at specified points called nodes. A-hen, matrix equaz--ns are written which express the properties of the individual e-lements. In general, each node is characterized by a linear equation(s) which relates a resultant or LI:orcing variable R to a field variable Y at given nodes and a stiffness" constant K. More specifically, for a solution area under investigation having a large number of such nodes, the resultant or forcing vector variable (R] having components [R,. R 2 - ' - RN] (which are known at boundary nodes or have a value of zero), is expressed in terms of a vector field variable (Y] having components [Y 1 01 Y 2 Y NI times the stiffness matrix (K] composed of constants or coefficients of the field variables at the nodes, by the matrix equation [K] [Y) = [R] For example, in analyzing a linear spring system, zhe resultant variable components represented by R,, R 2 may be values of force applied to the system a: se-lected nodes; zlle field variable components may be the displacement values at the nodes, and z-e constants may be spring stiffness values of tne spring elements being studied which relate force to disp.Lacemenz at the nodes. The constants form a coefficient or stiffness matrix [K1.

The next step in the finite element analysis method is to solve the above matrix equation. The macnine techniques for manipulating a matrix ecuation to automatically arrive at a solution rapidly and efficiently for systems or equations representing complex physical systems being analyzed by th'e finite analysis metnod is an important aspect of this invention. -nore 1 If i -131 ) 1 0 jJ ; 5 A detailed exDianation o.,E z."e princi-eles of t-.e f -4.nAJte element analysis metnod may be found in the texts entitled Rf 1he Finite Element Method for Engineers" by K.H. -"juebner and 'Finite Element Procedure in Engineering Analysis" Klaus-Jurgen Bathe.

Many techniques or methods are available to solve -the above noted system of equations. Such techniaues are quite often complex in nature and involve the use of decompositions and transformations of the system equations which are derived from the original system equations and are either, equivalent to or approximations thereof.

Thus, for example, the original system equations [K] [Y1 = [R] may be transformed into the form [A1[Xl = [Q] where [A] is a known decomposed matrix of dimension N by N, Q is a known transformed N-vector and X is an unknown transformed N- vector.

The purpose of such transformations and decomposition of the original system of equations is to permit the applica of a variety of known matrix techniques on the transformed system in order to arrive at a final solution.

Back substitution and forward elimination are two sucn numerical technicues which are utilized in solving system equations derived from using the finite element analysis method.

The backsubstitution technique is used where the decomposed or transformed matrix [A) is in the triangular (either upper or lower) form with all diagonal elements of the matrix non-zero. It may be used once to arrive at a definitive final result or it may be used iteratively as part of more complex approximating techniques to arrive at a final solution. Techniques for transforming a linear system of the general form to a system of the triangular form are well known. Such a triangular :,__i - -1 1 J - b sysze.T. -,as tne -fcr:n A 1 0 IIX,+A, xnt -2 4 A 22 X' -1r.

= Q 1 = Q _) = QN-1 A X A N-It,nl N-1 N-11N N A 'S, NW ' = Q N To solve this system of equations by backsubstitution, it is first noted that the last equation can be solved for X immediatelv since A (a diagonal element of N NN the stiffness matrix) and ON (a component of the forcing vector variable) are known. Knowing X,, the second last equation can be solved, since only one unknown exists, namely X N-1' S oec, f i c a 11 X N-1 A N-1,N-1 1 1:; With X and X known, the third last equation -,av N N-1 be solved, since it contains only one true unknown namely. X N-2 Thus, generally for i = N1 Nj,..

X 1 N (0 - i Idd j=i+l X j Ai ' j) A.. 2 0 1 1 It should be noted that when i = N, the summation N N 2: reads 1: which is interpreted as the sum j j = N+1 over no terms and gives, by convention, the value 0.

Several characteristics of the backsubstitution process offer the potential for highly concurrent process-c tr, 11, which will be implemented by the multiprocessing system described herein. The iirst is that X N' once ec>.rtDute,-4, 7ust ze multiplied by each element of the nth column oE' matrix during the process of solving for t.te components of X. Thus, in calculating X,-11 X14 is multizlied by A N-i'N (the coefficient fronT the N-1 row and 'N column). Likewise, in calculating X X,? N-2 1.

is multiplied respectively by A N-2,N - A 1 ' N (the remaining coefficients in the Nth column. In a similar manner, the term X N-11 once computed, is mult..i-e.',.ied by each coefficient of the N-1 th column.

An apparatus and method for efficiently carrying out the above noted process in a parallel processor is disclosed in the aforementioned application Ser. No. 870,566. However, the apparatus and method disci:sed in the above noted acplication is inherently limited RD' - '. -, - -1 a ' j is is -S t-at data deDendenc,.es in the required olDerations limit zhe available parallelism in solving the finite element equations and thus slow the throughput of the -nulzir)rocessor architecture for this application. Specifically, the process of obtaining XN-lr XNT-2' etc. , is inherently sequential sincer from the above equation, X i cannot be computed until all the previous X i+11 X 1+2... p X N have been obtained, their associated columnwise multiplications performed and the partial results accumulated.

When the physical system being analyzed is charcterize( by a system equation which is large and sparse (having a large percentage of zero coefficients) the use of decompositions and transformations which are banded has become quite common.

Bandedness refers to the confinement of non-zero terms in specific portions or bands of the decomposition matrix and may be better understood by reference to Figures SA and 5B. In Figures SA and 5B the coefficient structures or footprints of two matrices are illustrated with solid diagonal lines representing non-zero elements; cross-matched areas reepresenting areas containing both zero and non-zero data elements; and white or black areas representing only zero elements. 1z is thus seen that in Figure SA the non-zero elements of -the matrix are confined (along with a certain number of zero elements) to a band of width b on either side of the diagonal. Only zero elements are located in upper right and lower left corners of the matrix. This type of bandedness is referred to as nbandwidth,minimizationn since non-zero terms are confined to small bands adjacent the diagonals. As is readily apparent to one skilled in the art, such a bandwidth minimized matrix reduces the num'ber of calculations required to perform the backsubstizurion process since a known number of calculations (those involving all elements in the blank areas) will inevitably result - '12- 0 4.

Zoes nrc ne ----za denendency --:oclem n^-ted above, Since ca.-Lcu-lazion of the unknown components of z.,.,e field variable must -,ieverz",eiess proceed in sequence.

The formation of a bandwidth minimized SVSteM of equazions and -'zs corresponding minimized decomposition element analysis watrix is achieved during the finite method as a resulz of known techniques for numbering or renumbering the nodes in the finite element system mesh in such a way that the differences between the numbers assigned to adjacent nodes in the finite element mesh are minimized. The techniques (including appropriate programs) for. nodal numbering in order to achieve bandwidth minimization as currenzly practiced are discussed in detail in the following publications: R. Rosen, "Matrix Bandwidth Minimization", Proceedings, National Conference A.C.M., 1968, pp. 5B5-595. E.H. Cuthill and J.M. McKee, "Reducing the Bandwidth of Sparse Symmetric Matricesn, Proceedings, National Conference A.C.M., 1969, pp. 151-172.

In order zo remove the above noted data dependencies in solving finite elemenz equations and to permit a more efficienz solution of such equations on parallel =rocessors z-ere is disclosed herein a technique for Uandwidzh maxim.zation", i.e., zhe use of a decomposed matrix have a coefficient structure as shown in Fig. 53. Sucn a matrix (and its corresponding system of equations) is characterized by only non-zero data on its main diagonal; a band of only zero data elements adjacent the diagonal; and all other data (including all other non-zero data) clustered in the upper right 12a- is D - -, -, -, 3 8 and lower lefz cQrners of tn,e inatrix. The production of a matrix having a bandwidth maximized fooz:-Drin- is easily achieved by numbering or renummering tne nodes of tI4.e finite element mesh such that adjace,-.z nodes differ by at least some predetermined minimum value K. Thus, only zero data elements are present within the band of width K on either side of the d.;agona-l of the matrix.

When a system of equations characterized by a decomposed matrix of the form shown in Fig. 5B is solved using the backsubstitution method on a parallel processor system operating in accordance with the methodolog described in greater detail hereinbelow, efficiencies are obtained, as compared to prior art methods and apparatus. These efficiencies result from the alteration of the data dependencies when carrying out backsubstitution or forward elimination on a parallel architecture in accordance with the disclosed methods of operation.

-"3- Backsubsti tution fDr tne solution of the f.1 rst K components of the unknown f ield variable now becomes X N = Q 1,11 A N N X N-1 = QN-1 A N-1,N-1 i 5 and, similarly, X N-2 = A ON-2 etc. N-2, N-2 This is because A N-1,N # A N-2,N, and A N-2,N-1 are all zero owing to the generalized bandwidth maximized form of the solution matrix, as shown in Fig. 5B. In general, the backsubstitution, where a bandwidth maximized matrix is employed, becomes:

For i= N, N-1, ... 9, 1 do N X i = (Q i - Is:

j=j+k+l X A) /A,.

j 5 ' j A 1 i Thus, any given X. may be calculated independently of the previous K Xi's such that j may 1 cl.osen so that it corresponds to the ineximum amount of timP that it takes to get all the data together that a given X i depends on, as will be illustrated in greater detail with respect to the operation of the multiprocessor of Fig. 4, in carrying out a backsubstitution process using an exemplary bandwidth maximized matrix hereinbelow.

A brief review of matrix algebra as a computational tool for solving systems of linear equations may be useful as a background to understanding the invention.

1 -- 4- r 5 v is -30 RD- 1-73 08 S a rectangular A rratr:x in zs r-,,ost --suai array o1E scalaz quanrities consisting OL p." rows arranc ortho-ona-llly with respect. zo columns. The order a r-,,atr:x is defined 'L-iv its numcer of rows times its numner of cQ.Iumns and, accordi,,-,zly, Che matrix shown &,-,j,-.ed4. ately below is referred to as an 'm x C matrix. The usual compact method of representing a inatrix is shown below.

[A) = A mn = A A, 11 L 2 A 21 A 22 A m 1 A m2 A ln A 2 n A mn A row of a matrix is a horizontal line or one dimensional array of quantities, while a column is a verticalL line or one dimensional array of quantities. The quantities Allf A,,, etc. are said to be the elements of Trazr.x [A). A matrix in which nel is a column matrix or cclumin vector, and a matrix in which m=1 is a row matrix or row vector. Even more generally, a vector -.s an ordered n-tunie of values. A line in a matr; x is either a row or a column.

A sauare matrix is one in which the number of rows n eguals the number of columns m. The diagonal of a square matrix consists of the elements: All, A A 221 33 NN A triangular matrix is a matrix which contains all its non-zero elements in and either above or below its main diagonal. An upper triangular matrix has all its non-zero elements in and above its main diagonal; a lower triangular matrix has all of i:s non-zero element in and below its main diagonal.

In the operation (A) x [X] where fA] is of order m x n and (X] is of order n x p, [Q] is Of is 4" 5 order:n x p. 7nus, wnere is a will 1 Kewise be a col urn vector, havin- tne sam e nu,-,.zer of rows m as zl.e numner of rows in be noted znat =,at.-ix is def ined cniv where the matrix [X] in the above example has tne same numzer of rows as thle number of columns in matrix L:Ad.

In the above multiplication, the product [Q] is ontained by multiplying the columns of [X] by the rows of rA] so that, in general, Qij = According to a matrix JAJ by a proceeds as noted A A 21!A i 31 n 2: A ik X Xkj k=l -the above, the multiplication of multiplication vector [X) normallY below:

A12 A X 13! A A X iX 22 23 2. 2.

A, A;X 32 33: 3i Q3 13 A X, + A X 21 22 2 23 3' A X + A X + A X L31 1, 32 2 33 3 A From the perspective of processing zime requ..-,-.d, it would be most efficient to do all the mult.4plica:.40ns involving each specific vector element at the same time, resultinc in less intermediate memory accesses.

Thus, it would be more efficient from an input/output standpoint to perform the above multiplication by Derto",,nina all the onerations A X X and A, X, concu.' 11 1 41 1.31 or in a parallel manner. However, this procedure results in a scattering of the various contributing elements _i6_ Oi, -wnic', must larer ne car-nered A SeCond C,-daracteriStiC of thC n U 5 is or above process is that each element and Q3 of t'-e product vector is the result of an accumulation c.

summing based on the row of origin (or row index) o,,' the matrix element. Specifically referring to Cle above example, it is noted that QI is the sum of zartial results derived by multiplying different elements from row 1 of the matrix by associated or corresponding elements of vector [X].

Thus, a chief observation which is utilized in the methodology herein is that in order to enhance concurrency of operation and for minimizing input/output requirements for the above multiplication, it should proceed temporally as:

lst operation-,X, [All 2nd operation---X 2 [A12 3rd operation - X 3 [A13 A 21 A 22 A 23 A 3 j A 3 1 A = ' 31 the above to be followed by a su=ation of partial products based on the row index of the matrix which involved in those partial products as follows:

i Q, Q2. Q3 L - A X A X + A xl 11 12 2 13 1 A 21 X 1 + A 21 X 2 + A 23 X 3! A X + A X + A X 31 1 32 2 33 3 L.i The more general observation which follows from the above is that many matrix operations may be performed in a highly parallel manner by performing a first common operation across a series of different elements originating from one line of a matrix followed by a second operation based on the origin of those elements in the matrix; matrix multiplic. azion is but one of zhose operations. Even more specifically, it is observed from the above that the first operation in tne 1 ^ 5 1 J. 0 Is 0 RD-171,308 cat.,--n -rocess _ S --:: ir.ul:.4zl v tne f -cc element c chae r..ulziplicat-ci vector bv a second vector containina :he elements A A.1 land A r the elements of the &1 31 mazrix column. Thus, the multiplication proceeds by each element of the multiplication vectcr bv a transcose vector consisting of the elements of a column of the matrix. Vector element X 2 is likewise multiplied by the transpose of the second column of,he matrix containing elements A12 # A 2 and A32- The results of each such multiplication comprise partial results which, if accumulated based on the row of origin of the stored matrix element, will result in the resultant vector. This is seen from the prior discussion in which Q1 is generated by the accumulation of partial results based on row origin i.e., All X1r A 12X2 -, A 13X3 The same is true for the remaining elements of the resultant vector.

Having recognized that the above matrix multiplication, among others, may proceed concurrently as noted above, it remains to add the following observation which is applicable to large. sparse matrices encountered in finite element analysis. For such large, sparse matrices zero or null elements do not contribute to.:-e final result so they may be discarded or passed over in a procedure for mapping the matrix into storage. While zero or null elements are always passed over or discarded in mapping the matrix to be man.ipulated into a parallel processor, it may be equally possible to pass over or disregard other elements which, while not exactly zero or null, are computationally insignificant. For example, elements falling within a range adjacent to zero could be ignored once the determination has been made that inclusion does not change the final results of theoperations. Thus, the zerms zero and null should be interpreted in this sense.

0 5 is ..:avinc cpera::i--ns tne Potential fsr iein= a nignly manner, it remains t- provide a zecnnic:ue for mapp, ng a matrix onto a Parallel processor - n a manner t-- allow Ot)t..-4&izazion of the narallel -,,-ocessinz. Sucn a Process will now be descrIted witn recard to a parallel processing system comph-Ising a network of processing cells, each with izs own associared memory and intercoupled to communicate with each other, generally as shown in Figure 1.

Referring to Figure 1, there is shown a basic systolic system in which a host 10 typically receives data and outputs results to the systolic array of pipelined processing cells 15 via an interface unit 1-2. The host, of course, may comprise a computer. a memory, a real time device, etc. and while, as shown, in Figure 1, both input and output is coupled to the host, input may be from one physical host while output is directed to another. The essential goal of the systolic architecrure is to perform multiple computazions for each input/output access of the array with r=e host. These multai-ple computations are accomplished.n parallel by arranging for the cells to conzinue uzocessing of a received input in a cell 1 is receiving and processin- a new dazumn, z-úe other cells continue to ooeraze on data or parzial products of data received in a previous innut access. The challenge of using a systolic array is to decompose the problem being solved into substeps which can be worked on in parallel by the various cells of the system in a rhythmic way.

A procedure for loading and storing a large, sparse matrix into a series of parallel processing cells willnow be described with referince t-- tne crocesS flow diaciram of Ficure 2 in con)unction witn tne systolic array arcnitecture of Ficure 1.

1, lt snould also oe understood tnat tne -11 9_ j '.5 RD - 7 3 0 8 oadinc and storIng of a matrbx is acc-:-,.ml sned jnder the control of thke interface unit 14.2 of Figure -1, wnicn communicates during such process with the host 1-0 wnic,, already has the matrix stored as an array or otner suitable data structure.

The object of the mapping procedure of Figure 2 is, as alluded to hereinbefore, to store the matrix in the memory of the multiprocessor in a manner to allow a high degree of concurrency in doing matrix operations generally and the backsubstitution operation in particular. This is accomplished by storing the elements of one line of the matrix in a manner which permits concurrent operation by a common operand. More specifically and as will be described hereinafter, in a backsubstitution operation the elements of a matrix column are stored for successive or. concurrent multiplication by a common operand which is an associated and previously calculated component of the field variable vector being solved for.

A second feature of the storage process is that it generates an index for association with each stored matrix element which identifies the origin (relative to another line of the matrix) of each element. Specifical each element from a column of the matrix has assoc-4-nzed with it an index which identifies the row of z-,,e matrix from which it came. Moreover, the stored elements from each column are formed into transpose vectors each.of which may be associated with a common operand to provide regularity and concurrency of operation. Thus, several transpose vectors are formed, each being located in a given memory location in a plurality 0.1 cells, so that a common operand may operate concurrently on them. The origin and character of the common operand will depend in large measure on. the particular matrix operation being carried out. In addition, null, zero and computationally insignificant elements are discarded 7,D. - 1. 7 3 0 08 n zne szsrace Tzz:cess -z.-) tn-nt c=erazi--ns i7.vaivinz zne stored elements:ray ze done rp.ore r-=iDidiv. --- ii n a 1 diagonal elements c-ontained in any given transpose vector are collected for storace into a single processor cell so chat they are readily available for regular and repetitive handling during certain matrix operations, the backsubscituzion oDeration being one such matrix oDeration.

The above objectives are met by transposing the elements of L-he matrix columns into transpose vectors configured across the processing cells. Since zero elements of the matrix are passed over during the storage process, the number of processing cells needed to accomplis] the above storage procedure are quite reasonable. many physical systems are represented by matrices on the order of 25,000 x 25,000 elements in which the number of non-zero elements in any given column is on the order of twenty or less. Thus, as a result of the above storage operation, each non-zero element of a column of a matrix becomes an element of a transpose vector or array stored in the same memory location of each processing cell. In addition, all diagonal members are stored, for easy access, in a single processor cell. With the non-zero elements of the columns accessiole concurrently by zh-4s procedure, the common operands may be made to operate on the zransDose vecz0Z in a highly parallel manner, as will be explained in greater detail hereinafter.

A method for mapping a matrix into a multiprocessor array to accomplish the above noted objectives will. now be described with reference to Figure-2. First, starting at step 200, the constant C -4s set to equal the total number of processing elements in the parallel processor array to be utilized. The number of processing cells may vary, but maximum efficiency results when the number is approximately the same as the maximum number of non-zero elements in any single column ot' the large, sparse matrix to be mapped. The procedure 13 .j 4 is : 5 F-D - 1 -1 3 0 a v a r la z le- -s " 'z;:)w ' ':2.w i n,.ex r ma c: x e i nz beinc Processed) and.-all;ndex f--r -,,azrix element beinc processed) are both set:c equal 1, so 21, L starts a: the matrix element in z.l".e "ere. T' c. A he to be loaded thlat processing 4P ,rsz row, first variable "Procn C:,Iiumn and proceeds 'from zne next processing cell with a non-zero, non-diagonal element) is set equal to 2; "Proc" may, of course, vary between 1 and C but cell 1 has been earmarked to contain only the diagonal elements of the matrix, as will be described in more detail hereinafter. The MemoryL variable (the memory location in the "Procn being loaded) is set to "Mem", the first memory location to be filled in each of the processor cells. The host also sets or has already set from an earlier process the variables M (:he number of rows in the matrix) and N (the number of columns in the matrix). Finally, at step 200 a table called Table fco,i, which is. set up in the host, is initialized to all l's in a preliminary step to correlate one column of the matrix (a specific transpose vector to be created), with a specific operand to be used subsequently in a matrix operation, as will be explained in greater detail below.

Next, at stec 201 the macrix element 'M [Rowl(Call (which az the begi-ninc of th.e procedure is matrix element, A is tested to determine if iz is zero.

If it is zero, thle element is not to be loaded into any processor cell and will be disregarded. The process moves, therefore, zo sceps 202 and 203 at which Row is incremented, and the value of Row is tested to determine whether it is greater than M. If it is not, the process returns:a step 201 to test a new matrix element.

If R is greater tnan M, a complete column of the matrix has been tested and, at step 2.05, zeros are stored in any remaining processor cell(s) from Proc to C and at.memory location(s) Mem. The purpose of this step as alluded to earlier, is to permit each speci-fic memory j^ 3 I 1 Location in eacn zrocessor:3 contain stored elements associated with one and only one column of z.-ie macrix. In this manner, the elements associared with a a,-ven column will defi,-,e a transpose vector (each storedat a known memorv location across a plurality off cells) and a plurality of such transoose vectors (one for each column of the matrix) will be created.

After all processing cells at a particular have been filled, the process continues scanning matrix starting at the top of the next column. at step 207, Row is reset equal to 1 and Col is in preparation for testing matrix elements in a column of the matrix. However, if in step 209, is greater than N (the number of columns in the em the Thus, incremenzed new Col matrix) then the process of loading the matrix ends (step 2-10). If Col is not greater than N, then the process reverts to step 201 az which the new matrix element is tested.

If the test at step 201 indicates that the matrix element being tested is not zero, the element is tested at step 230 to determine if it is a diagonal element, i.e., if Row=Col. If it is a diagonal, the element is stored at Proc=1 (processor cell 1) and an index is generated to accompany the stored diagonal -;ndicazing the row of the marrix from wnic, it orizinazed. Ifzer storage of a diagonal in cell 1, the procedure moves to steps 202 and beyond in a manner similar to that described above.

If the matrix element tested at step 230 is nor a diagonal element, it is stored in the processor cell Proc at the memory location Mem, step 220. Then, an index equal to Row is generated and stored in assoclation with the newly stored element an the same memory location Mem, step 221. Having stored a non-zero element, zne process increments Proc, stew 2.25, so that the next non-zero element will be sto.rea the next processor cell, then tests Proc at step 227 -to determine if iz is greater than C (the total numcer of processor cells) - - 2 3 - -35 - 1. 5 S ?X - 1 -1 3 U^ a :P - "s indicates that all c,,; t-e Mem locaz,-ns cc, = J. - - - A.

4n the P--ocessor cells have been f illed and, at step 228, Proc is reset to 2 and memory is incremented to a new value. In addition, the entry at Table[,,,, in the previously created table is incremented so c-Aat during later processing, the host can correlate each column of the matrix with a given length of a transpose vector. This entry permits the host or interface unit to associaze the non-zero elements corresponding to this current Col with a transpose vector covering more than one memory location (the exact number being contained in the table). If multiple memory locations are required, the host will take appropriate action later in matrix operation procedure to insure that the corresponding operands are fed into the processors the correct number of times. Then the process goes on to step 202 as previously discussed.

If step 227 determines that Proc is not greater than C, the process goes directly to step 202, since additional memory locations Mem exist in processing cells. The process continues from step 202 as previously explained.

Referring to Figure 3, an example of the operation of the procedure of Figure 2 for mapping a specifIc upper triangular decomposed matrix [A) which has be-en bandwidtn maximized as exolained herein (shown in the top portion of Figure 3) into the exemplary memory locations of processor cells 1-4 of a mulciprocessor similar to that of Figure 1 will now be described. As can be seen, decomposed matrix [A] is of order 12 x 12 bu, is of the upper triangular form. Specifically, the matrix has only zerovalue elements below the diagonal and all non-zero diagonal elements. In addition, all of the non-zero data in matrix [A) is clustered into a group in the upper right corner and a band of width K of zero elements is located between ?-D - 1,7 3 0 a j'S : j_ tne jz)er ri=nt:2r.-.er :n Zne matrix l---t ZZ _azonal zr in the uiwer richt -and zero elerrients therecer-ween.The tec"nique for and sele--zina detail below.

creaz:nz sic!-,, a bandwidthl maximnized -,,atr-4x zne value K will be described in greater it should be kept 4In mind when reviewing this sample mapping t,'-,az marrix (A) is meant to be exemplary of a bandwidth maximized upper triangular matrix of the large, sparse variery commonly encountered in characterizing physical systems by finite element methods. Such a matrix might typically be of the order of 25,000 x 25,000 with only about.1 percent of the elements in any row/column being non-zero.

Starting at step 200 of Figure 2, the variables and constants are Initialized to conform cc the specifics of the matrix [A1 being mapped as folloWS: C=4; Row=l; Col=l; N=12; M=12; MemoryL=Mem; Proc=2 is set to 2 in order no set aside the first processor cell for holding the diagonal elements, as will be explained in greater detail hereinafter. The first element, A.

A d at step 201. Since its value rowl[Coll te'ste s not -nual zo zero z is next tested at steo 230 t o d c t c rx. _ n e splecifically Ji s a dagonal element of the matrix, Since element A is a diaconal a a _ment, 5 5 zore-4 a location Of prOC!!!1250r cell 1 alonc with an index 1 designating its origin from row 1 (5eo. 231). The process then conrinues to step 202 at which Row is incremented. Then at step 203f since row is less tnan M, the process returns to step 201 to examine the next element, A Since there are no other non-zero elements in the f irst column of the matrix, the process will move repeatedly through sneps 201, 202, 203 until row is incremenred to be 251 greater than 12 at wnicn t..-,ie the iD.-c)cedu.-e wi I! co from step 203 to steip 205. At stew 2JS zero' s wil-' be stored in processor cells 2-4 at Mem tc S that co-lumn _I of the matrix has no non-zero other t.han tIhe diaconal element A which has been s:ored in cell I. The loading of these zero's into cells 2-4 at Mem is shown in Fig. 3.

Having completed the scanning and loading of column- A_ I- 4A 1 of matrix [A), Col is incremented to 2 and row is reset to 1 to begin the scanning of czIumn 2, if Col is less than N (step 209).

Columns 2-7 are scanned in a manner identical to Col. 1 and, since each also has only one non-zero element B-G respectively, and further, since these non-zeross are diagonal elements, they are all stored in cell 1 at Mem+1 -Mem+6; zero's are stored at Mem+1 in cells 2-4.

The process now returns to scan the eighth column of the matrix at step 201 where A [81 111 is tested.

Since its value (H) is not zero, and since it is not a diagonal element (step 230), it is scored ac szeD 220 in cell 2 at Mem+7. An index i's also stored az the same location in cell 2 along with zne its oricin in the row of z;.,e nazrix. ',;ex::, F.roc is incremented and tested at steps 225, 2271 and since Proc is less than 1-2, row is incremented and zeszed and a new element A (81 [21 is processed. Since this element is zero, the process loops back again to step 201 to test A[81 [3]. Since zero's are located in Column 8 in rows 2-7 the process loops back successively while no values are scored. Since element A [8) [3) is non-zero and further since it is a diagona-2, is stored in cell J. at Mem+7 as shown in Figure 3, in a manner, identical to thatwhich earlier data elemen?-,, is 1 - 0 is 2 5 1 in 1 1 -; Ce a "I Qzler elemenra C:

- ' ce--- --- sin J. - - 1 3 are zero, che cells 3 and 4 are filled witn zero's at Mem+7 via steps 205 in preparation for scanning column 9 of the matrix.

The process continues, as above,.,zerazively storing each non-zero diagonal element of the matrix in different memory locations in cell 1j., storing non-zero elements of each column in cells 2-4 while passing over zero element and, finally, filling remaining memory cells in any given memory location with zero's before beginning to scan a new column. The result is that all the nonzero elements of the first column of the matrix A are reformed into a transpose vector in the memory location identified as Mem across the processor cells. Furthermore, all diagonal element included in such a transpose veccor are located in a specific processor cell, i.e. cell 1. In effect, the memory location Mem serves as an identifier for a newly created or transpose vector including all the nonzero elements of column 1 of the matrix with the diagonal element stored in cell 1. In the case of column 1 of matrix [A) of Figure 3, only one element is contained therein and since it is a diagonal it is stored in cell 1, all ozner elements being zero-valued, the remaining Mem lz--az. ons are filled with zero's, as shown in Figure 3. 7he importance of this newly created transform vector Mem with the special handling of diagonal elements in permitting the c oncurrent processing of matrix elements is explained in the aforementioned application Ser. No. 870,566 and will be briefly repeated hereinafter.

As indicated earlier, the number of processing cells preferably is selectedto equal the maximum number of nonzero elements in any given column of the matrix A, but it may be 2 or any number greater than 2 withou: deparzi:

-c I- I'S 1.1 L However, regardless frOM tne SPir4C of the inventon. of z.,,e nj:rber of processor cells employed, it may happen chat z',.- numner of non-zero elements in a particular cj-'ur.n c.,; the marrix exceeds the number of processor cell- 's. - o handle:his situation, steps 227 and 228 as alluded to above, are included in the mapping process as illustrated in the flow chart of Figure 2. if, for exammle, the first column of the matrix [A) had more non-zero elements than processor cells available, i.e., if Proc C (step 227) before all elements of a column have been stored, then Proc is reset to 2 and Mem is incremented to Mem+1 (step 228) and the transpose vector identified as nMem" is continued or expanded into the Mem+1 memory location. The result of this is that column 1 of the matrix is transformed into a transpose rector located in memory locations Mem and Mem+1. In such a situation, the Table [Col] entryat column 1 would be 02" to indicate this.

Any remaining processor cells corresponding to a given memory location which are not filled with nonzero elements after testing all the elements from a given column of the matrix are filled with zeros in accz.-.4ance with szeD 205. For example, in Column 10 of -.-nzr4.x tAl, there are only z,-,,ree non-zero elements. M1, N and P, leaving Mem+9 of processor cell No. 4 to be filled with zero to complete the transform vector at Mem+9. A similar situation is illustrated in Figure 3 by the filling of the Mem+10 memory location of processor cell No. 4 with a zero as a result of steo 205 since only three non-zero elements (5, R, T) are present in column 4 of the matrix.

The function of that portion of the memory locations labeled Q' in Figure 3 will be explained below with -. 1 3_ J5 z D-. --- refert.nce tz cne ciz a Dac:.Ks:----isz,-:j,:-zn operati,c)n to solve i system of linear 9,cuaz:ons zhe macrix A of F-ure 3.

CarrVinC Out the backsubszizjzion ouerazisn On the decomposed s.vszem of eauazions cnaraczeri--er-i Z'; tn.g,? bandwidzh maximized decomposed matrix A in a.---ex-.iz=ot the invention will now be de5cr----d dlzn to Figure 4. To briefly review, it- should be understood tha.t prior to the initiation of the backsubstitution operation, a System of equations which de5cr-.'bes the product or physical system under invesz,.aazion is generated of the form [K) [Y1 = (R] As will be recalled from the discussion on pages 71 and 8 above, [Y) is a vector representing an unknown field variable which describes an attribute of the physical system under investigation, such as disPlacemenz for a spring system. The vector [Y] is composed of components Y... Y for which the syst-M or e,-ua:ions 1 N is to be solved. T h e vector [R], on ozL'-Ael- hand, has components R, R.N which are known and re2:.esen:

1 a resu,-zanz vec=r variable, such as syszem Ceing analyzed. [K] is the szi.:,'z.iess and includes a plurality of values or c--nsz-7-n::s.w,-Licn relate the known resultant vector varianle to the UnKnown field vector variable at specific nodes of the finite element mesh, as discussed earlier.

In operation of the examole the above noted system of equations are transformed or decomposed into another series of equations [A)[X] = [Q) where [A] is in the form of a so-called bandwidth maximized matrix, as described above, with [Q1 being a known vector and [.X] being an unknown vector. With the system transformed into this new bandwidth maximized form, solution emoodvlno -:h e 1 J_ 5 3 RD- ', -7 -;-t S ) W on a parallel processor of t., nvention may ie form shown n F; gure ons of tne 4 and utilizing (at least, for some port 4.

solution process). the backsubsti,uzion or forward h elimination technique 5 carried out an a systolic array. The object of the solution Process, therefore, is to calculate the values of X,... XN by the backsubstitution technique In its basic g - Chat LA) and IQ] are known orm and as mentinn-A decomposed matrix A is tYPically both large and sparse. Specifically, it may be of order 25,000 x 25,000 elements with the number of non-zero elements being Only a very small percentage of the total number of elements.

It is also the upper triangulAr, e4cvlOus'Y, the form, i.e. a for Of 1-, P ="uwl,2tn maximized .. in which all the diagonal elements e matrix are non-zero, all other non-zero elements (including some zero elements) are clustered in the upper right corner of the matrix; and a band of width K is located above the diagonal containing only zero valued data, as illustrated in Fig. 5B. of Figure 3 is just such an maximized matrix.

2aving provided = A Matrix [A] upper triangular bandwidth of an upper triangular bandwidth maxecOmPOsed matrix imized form, the first step in the solution process is to 1 [A) as d oad and store - ne matr4x escribed above with reference to Fig u res 2 and 3. The result of the loading and storing process for the matrix [A] of Figure 3 is shown in Figure 4 wherein the memory 56 of cell 1 is shown storing the diagonal elements A... Y in memory 58 at Mem - Mem+11, respectively, each stored diagonal value being accompanied by an index stored in memory portion 59 designating ne rc,,j ori-gin of -zs associated diaconal e-lement. A partial result store 57 is also provided i.n cell 1. j:

tor szori.-ic calculated values X. X, of the L -L 2 jeczzr X and Jzs operation will be explained in greater In similar A-zshion, the memories 56 of cells 2 through 4 are shown storing the remaining non-zero values of the matrix, each stored non-zero element contained in memory portion 53 being accompanied by an index stored in portion 59 for identifying the row of the matrix from which its accompanying non-zero element originated. Each of the cells 2-4 also includes a memory store 57 for storing partial results accumulated n accordance with the stored index accompanying the stored matrix element, as will be described in greater detail below.

It should also be noted that the elements of each column of the matrix A were reordered during the mapping process to form what may be referred to as "transpose vectorsly at each of the memory locations Mem to Mem-t-11, w-z"l the diagonal elements collected in cell 1. For example, column 9 of the matrix A is transformed into a tzanspose vector an memory location Mem+8 with the ---agonal element L from row 9 of the matrix located n cell 1 and the remaining non-zero elements of column 4 (J, K) contained in cells 3 and 2 respectively along with indices (1 2), respectively, indicating the rows of the matrix from which they originated. As explained above, a zero data value is used to complete unfilled memory locations in each transpose vector remaining after all non-zero elements in its associated column have been stored.

0 1 5 -1.3 -t Pi - -1 7 308 4 A m Given z"e ue-er triangular matr 4 x [ ' 1 of F'gure 3, the system of eguations to be solved for [X) is as follows:

(1) (2) (3) (4) (5; (6) (7) (8) (9) (12) is BX 2 3 DX 4 + EX-) + FX 6 +... GX 7 IX LX HX a + 17 X 9... = Q 1 .. KX 9 ---10 UX 12 = 2 X +RX = 3 4 5 Q7 Q8 Q9 Q10 Qli Q12 5X ll +VX 12 WX 12 PX YX 12 As will be recalled from the discussion on pages 10 and 11 above, to solve the above system of equations by the backsubstizuzion process X 12 is first calculated by solving equation (12). Likewise, in a similar fashion X X X9 f X8 r X and X may be solved immediately ill 101 7 6 1 by a simple division of an associated known component of [Q] by a corresponding diagonal ele,-nen:: T, P, L, I, G and F, respectively. It should be noted that herein lies the power Of tne illustrated system. in tnat several components of the unknown field variable [X] may be computed concurrently and without regard to any other values of the field variable. This permits a great increase in the speed of calculation on a parallel processor, as will be described in greater detail below.

This is also in obvious contrast to the method of operation disclosed in the prior art and in the aforementioned copending application in which, except for the calculation of a single component of the field variable, the remaining values must be calculated with the successor values ?,-D -. -7 10 8 is already known. This data dependency pronlem great!,; reduces the speed of solving such equations by bacxsubstitution on a parallel processor, as can be readily appreciated by one skilled in the art.

Having calculated the values of X -X by a simple 12 8 deviation, as outlined above, these known values and the results of operations involving these know values are then made available for computing the remaining values for the unknown components of [X], i.e., X X 1 in equations (5) to (1), respectively. It should be noted again at this point that the above system of equations is meant to be exemplary of a much larger and sparser system encountered in typical physical problems being studied using the finite element analysis method. The system of equations alven acave is used here for illustrative purposes only and to explain the advantages of the invention resulting from the reduction of data dependencies in the solution process by use of a bandwidth maximized solution matrix of the form shown in Fig. 5B.

The actual process for carrying out the solution of the above s stem of equations via the backsubstizution y technique on a systolic multiprocessor system will now be explained in greater detail with reference zo Figure 4 and Figures 4A-4L. The multiprocessor system consists of four processor cells, cells 1-4, each of which is assigned to perform specific operations contributing to the solution. Each of the operations being performed by a processor cell is illustrated by means of a functional unit in Figure 4. It should be kept in mind that the functional units referred to in Figure 4 are not meant to refer to different hardware components of a machine, but rather to different data processing operations capable of being performed by each cell of the system, is S IRD- 17 3 0 8 as will be easily understood by those skilled in the art. The operations performed by each cell are designed to be regular and repetitive during each cycle of the machine, but performed on different operands presented to the inputs of each unit during successive machine cycles. The flow of information through the processor will be explained with reference to Figures 4A - 4L which note the flow of operands and results through the machine during its first 12 cycles in a typical backsubstitution process on the exemplary set of equations above.

The multiprocessor system of Figure 4 is seen to include a memory 56 associated with each of the cells 1-4. Memories 56 may be thought of as comprising a first array 58 which stores the values of the elements of the solution matrix A, as previously described. Each element stored in memory array 58 has associated with it an index stored in array 59 which designates the row of the matrix from which its associated stored value originated. Thus, referring to cell 1, matrix element A originated from row 1 of the matrix, matrix element L from row 9 of the matrix, etc. Storage in cell 1 also signifies that the stored element is a diagonal element based on the storage technique of Figure 2.

Each cell also contains another memory unit 57 which stores results from indexed calculations performed by the system, as will be explained in greater detail during the description of the calculation process below.The storage unit 57 of cell 1 operates as the store for the final component values X1 to X 12 of the unknown field variable. The storage units 57 of cells 2, 3 and 4 store results according to indices and, instead of input and outputting values in a sequential manner, respond to indexed inputs t ' o provide indexed outputs, i.e. stored values associated with those indexes. To this end, memory units 57 of cells 2-4 provide two is R-D - 17 3, 0 8 separate outputs. A f-,:-cz output on iine 570 _s fed to a series of accumulazors 53 at whic., d i f f erent values associated with the same index are added together prior to being restored in memory in an order based on that index- The output on line 57' represents the value stored in memory assoclated with the index of the diagonal being concurrently processed in the accumulator 53.

For this reason, the output on line 57' is controlled in rom memory 59. A second by an index input on l-. e 591 E. outiput on line 57'' from each of the memories 57 of cells 2-4 is fed to adder 65. The indexed outputs on line 5711 represent accumulations Q of earlier performed calculations which are to be summed in carrying our the backsubszizuzion process. The indexed outputs on line 57'' are selected by concurrent outputs to each of the memories 57 of cells 2-4 from a counter 61 which counts down starting at index 11 during cycle 1 to index 1 in cycle l! as shown of the attached drawings in Figures 4A-4L.

The values stored in the memories 58 of cells 1-4 are fed to functional units in the remainder of the system associated,;lzh each cell. The values stored in memories 58 are fed' ilE:oT. the zo;)mosz or higher numbered addresses (as seen In Figure 4) to zit-,e finctional units 52 and 51 via lines 55', a staged manner. Bv this is meant that the value szored az Mem + 11 in store 58 of cell 1 is fed to divider 52 on a given machine cycle which is, respectively, 1, 2 and 3 cycles prior to when the corresponding values at the same memory locations in memories 58 of cells 2, 3 and 4 are fed to their associated multipliers 51 via lines -58 as will be explained in greater detail below. Cell one includes a subtract unit 60 and a divide unit 52. The subtract unit 60 is supplied with a first operand on line 65' by adder unit 65. The second input- to subtractor 60 is the value of one of the known vector k is -)5 components i n the c),,lysical system Q 'r he divide unit 52 divides the result of the subtraction process performed at unit 60 by an appropriate value from szo.-age unit 5B of cell 1, which is a selected diaconal elerr.ent of the stored matrix.

Each result leaving the divider 52 of cell 1, therefore, represents a solution of one component of the unknown field variable X in the system of equations being solved. These results, in accordance with the backsubstitution method described above, are stored for later readout in storage unit 57 of cell 1. In addition, each result of the division in unit 52, must be multiplied by the transpose vector representing the corresponding column of the matrix. For example, X 12 must be multiplied by each non-zero element from column 12 of the matrix; the results of these operations must be collected and later summed according to row index. For this purpose, each result from unit 52 is fed in a pipeline fashion to the multiply units 51 of cells 2-4 at which an indexed result is computed. The outputs of each of the units 51 are accumulated according to row index in accumulator units 53, the results of which are again stored according to row index in storage units 57 for later outputting to adder 65 on line 57'' under the control of counter 61.

The system of Figure 4 may be better understood by an explanation of its operation in solving the simul tancous equations on page 31 by the backsubstitution process. The explanation will be keyed to Figures 4A-4L which illustrate the flow of data in the system for this purpose and to carry out the backsubstitution methodology, as outlined on page 31 above.

Generally speaking, the processor system of Figure 4 operates in a pipelined manner. At the beginning of the calculation process only the values contained in the storage units 56 and the known values Q 1 to RD-17 3 0 B is Q12 are available, the values Q 1 to Q12 being stored in an accompanying processor, such as host 1, Figure 1 for inputting as appropriate to unit 60. In successive machine cycles, selected ones of these values are made available to the multiprocessor cells to generate resultant operands. The process proceeds by moving data generally from left to right in Figure 4 through the processors to gradually fill it until each cell is doing a meaningful calculation during each cycle. The first several machine cycles serve to fill the pipeline in the manner described below while concurrently calculating the first several values for the components of the unknown field variable M.

Referring to Figure 4A, during the first machine cycle Q12 is fed from a suitable storage unit or host computer (not shown) to one input of subtractor 60; the other input to unit 60, Q li, from adder 65 is zero at this time in the process since none of the operands to be operated on have reached the other funcz.4onal units of the system at this point in the backsubstitution process, as will be explained hereinafter; the inputs to all the other functional units are zero. The outmuz Q12 from subtractor 60 during cycle 1 is made available to divider 52 in cycle 2.

During machine cycle 2, Figure 4B, equation 12, page 31. is solved for X12. This is done by dividing the Output Q12 from subtractor 60 by the first stored diagonal element, Y, of the matrix A to provide the first solution, X 12. at the output of unit 52. Concurrently during cycle 2, the output Ql' from adder 65 (which is still zero) is subtracted from Q11 in unit 60 to provide a new input to the divider 52 for the next cycle. Also, X 12 is stored via line 64 in the memory 57 and fed on line 63 to the multiply unit 51 of cell RD- 17 3 0 8 3 is 1 Z During c,,c-l---- 3, 'Figure 4C, equation 1A.1 is solved for X,, dividing Q,, by T and storing it via line 64 in the store 57 of cell 1. Also, during cycle 3, the Previously calculated X 12 on line 63 at cell 2. is multiplied by w (one of the matrix elements from the 12th column oil the matrix) in unit 51 of cell 2 to generate the partial product W.X12 at one input to accumulator 53 of cell 2. Also, during cycle 3, the output of adder 65 (which is still zero) is subtracted f rom Q 10 in subtractor 60 of cell 1.

In an analogous manner, during cycle 4p equation is solved for X 10 by dividing Q,, by the next stored diagonal element P in divider 52. Concurrently, at cell 2, the previously calculated X 11 is multiplied by S to generate a partial product SoX 11 while the previously generated partial product W.X12 is accumulated with other Q partial products, if any, (on line 57') in accumulator 53 of cell 2 for storage on the next cycle as Q in storage unit 57 of cell 2 via line 79. Also, X 12 which has passed through the one cycle delay 55 has now reached cell 3 and is multiplied by' matrix element V to generate a partial product for later accumulation and storage on the following cycle in store 57 of cell 3.

During cycle 5, the output Q from adder 65 (which still equals zero) is subtracted from Q. in subtract unit 60; Q 9 (the output from subtractor 60 on the previous cycle) is divided by L in divider 52 to generate X9; previously calculated X 10 is stored in store 57 of cell 1 and is also multiplied in cell 2 by N; the previous output 5-X 11 from multiplier 51 of cell 2 is accumulated with similarly indexed and stored values (if any) of store 57 (QA on line 571) at accumulator 53 for storage in unit 57 on the next cycle; p.reviously calculated X 11 is multiplied by R at mu ltiplier unit 51 of cell 3 and V.X 12 is accumulated with similarly indexed and is previously stored values 0: (if any) in accumulator 4 53 of cell 3 for subsequent storage in store 57 on the next cycle. Finally, X 12 has moved down to cel-l 4 to be multiplied by U in multi- plipr 51, the product Mex 12 being fed to accumulator 53 of cell 4 for accumul-%z.,on and storage during the next cycle.

Succeeding machine cycles 6-13 for carrying out the backsubstitution process proceed in a fashion analocous with those described above. Figures 4F-4m detail the flow of data through the machine in solving the succeeding equations. Summarizing briefly, on each succeeding cycle another of the equations on page 31 is solved for X 12 #@#X 1 with the value of this component being stored in memory 57 of cell 1 along with the previously calculated values. Each newly calculated value is also successively passed rightwardly, as seen in Figure 4, to each of the later stages of the processor for calculation of the partial products Q' which are accumulated N by index according to the row of origin of the matrix elements associated with the operation. For example, during cycle 6, Figure 4F, three different partial products are concurrently accumulated for later su-:nat-Jcn in adder 65 as follows:

(1) A first partial product, N#X 10 4.5 accu.,ru'A-.ated (along with any other previous partial product sharing. the same row index as element N)' in memory 57 of cell 2. Thus, N-X 10 is stored as Q3' since the origin of element N is from row 3 of the matrix.

(2) A second partial product R.X 11 is also accumulazed in memory 57 of cell 3 under the index 3 as Q' 3 (3) Likewise, a third partial product V.X12 is accumulated along with other partial products showing a common row origin as Q becaus.e the matrix element U is from row 2 of the matrix.

The first two of the abov e noted partial products remain stored in memories 57 until the machine cycle A IRD-17 3 0 8 during which counter 61 accesses all partial produczs associated with row 3 of the matrix. This occurs in cycle 9, Figure 41, during which all of the previously accumulated row 3 partial products Q are fed to summazion unit 65. During cycle 10, the summation, (N.X10 + RXll) is available for subtraction in unit 60 from the known value Q 3 so that during cycle 11 the difference Q3 - Q may be divided by the diagonal element D to generate X 3 It is thus seen that equation 3, page 31, is solvable during cycle 11 only because during prior cycles, an accumulation was made based on row index of all products of previously calculated values of unknown vectors and matrix elements in row 3.

Several important features of the methodology of the system will be apparent to one skilled in the art. A chief feature is that it allows the solution of the 12 simultaneous equations of matrix [A] in 13 steps This, in turn. is only possible because of the removal of data dependencies resulting from the bandwidth maximized form of matrix used in the solution in combination with the unique pattern for storing and a ccessing the matrix values in a multiprocessor as described herein. More specifically, during the machine cycles when the first several values of the unknown vector X are being calculated in one cell by a simple division step (made possible by the maximized form of matrix), matrix operations are being concurrently carried out in other processor cells to generate partial products (Q) using the previously calculated components. Proceeding in this manner, when the partial products which depend on previously calculated unknown vector components are needed, they are available for calculations. Thus, the backsubstitution process proceeds without having to delay until such partial products are available.

RD- 17 3 0 8 As alluded to earlier, the minimum bandwidth K of the band of zero's interposed of the matrix and the upper right hand corner cluster of non- zero elements is selected based on the design of the multiprocessor system. As a general matter, K should be equal to the maximum number of cycles or steps it takes for the specific processor arrangement employed to gather and make available the data needed to calculate any given unknown vector component X i With respect to the system of Figure 4, it is seen that once X 12 for example is calculated during cycle 2, six additional steps or cycles (3-8) are needed to complete the calculation, accumulation and summation of the partial products associated with X12 and provide those products to substractor unit 60 in time for use in the solution of equation 5. Thus a bandwidth K of 6 is sufficient to provide the space and time to cake full advantage of the particular systolic array of Figue 4. The particular band needed for other hardware systems would of course, depend on the specifics of such a system. If a smaller bandwidth is provided, some efficiency may be sacrificed depending on the comDosition of data in the series of equations being oDerazed on.

As generally understood in the art, bandedness and sparclithy reflect the connectivity of the finite element mesh. Thus, it is well known in the art that transforming a system matrix [K] to an upper triangular bandwidth maximized matrix [A] having a sparcity structure as illustrated Ain Figure 5B is well within the skill of the art. Whereas minimizing the bandwidth as illustrated in Figure SA involves renumbering the nodes of the finite element mesh so that a minimum difference between any two adjacent nodes exists, - maximizing the matrix in Figure 5B is accomplished' by renumbering the system nodes to maximize the difference between adjacent nodes.

This may ne accomplished in use of the system without between the main diaconal j! 2 5 U - 11 7 3 0 8 the expenditure of und,.le time b,., a c,,:z and zr,,, in which a first pass using random numbers for assignment to the nodes. Since the typical mesh is large this technique usually removes 90% of the non-zero data from the band adjacent the diagonal. Further refinement is made by switching numbers between nodes of non-zero elements within and outside of the band. Continuina this cut and trying the technique a few more times almost always results in a band of sufficient width to efficienty run the decomposed matrix in accordance with the technique herein disclosed. Moreover, since only a relatively small number of processors is contemplated the size of the band is generally small relative to the overall number of finite elements in a svstem mesh.

As will be appreciated by one skilled in the art, by utilizing the bandwidth maximized matrix A as provided herein, a plurality of unknown components XN of the unknown field variable vector may be computed without resort to other values of the field variable. If these computations are done sequentially in a single processor (cell 1 in the example above), other of the processors (cells 2-4) in the system may, during the same time period, be calculating and accumulating the partial products Q which will be needed at a later::.me zo complete the computation of all of the components oE X N (specifically those which are dependent on previously calculated values). Thus, by increasing the number of calculations for X N which may be carried out without data dependencies time is provided in a multiprocessor system to complete intermediate calculations without an attendent idle time. All processors of the systems, instead of being forced to wait for some portion of the back substi tut ion process, are allowed to do meaningful calculations during each machine cycle.

While the matrix calculation process. specifically addressed herein is backsubstitution, the forward elimination process would proceed in a similar manner and RD-17 308 could be easily implemented in accordance with the principles of the invention by one SKilled in the art. The forward elimination process requires, huwever, that the original system matrix be transformed into a lower triangular form, rather than the upper triangular form used in backsubstitution. The forward elimination technique is also described in detail in the above noted texts.

It should also be emphasized again that while the technique of the invention has been explained for the sake of brevity with respect to a small matrix and system of equations, the benefits of the invention are more obviously derived from operation on large, sparse matrices which are associated with large, sparse systems of equations representative of typical physical systems to be analyzed according to the finite element analysis method.

While the invention has been described in connection with preferred embodiments including use in a backsubstitution operation performed on a systolic array computer arthitecture, the invention is not so limited to any specific parallel multiprocessor matrix operation. Consequently. modifications would be suggested to one skilled in the art having an understanding of the foregoing without departing from the invention.

S RD. - i 7 3.3 8 -L. I MS:

1. An improved method for analyzing physical phenomenon comprising the steps of: generating a first system of linear equations to describe said physical phenomenon, said first system of equations characterized by a known system matrix K of order N by N, an unknown N-vector YO and a known N-vector R, said system matrix relating said unknown vector to said known vector by the matrix expression KY = R, said system matrix being large and sparse; transforming said first system into a second system of equations characterized by a decomposed, triangular matrix having only non-zero elements on its main diagonal, a band of zero elements extending a preselected bandwidth from said main diagonal, an absence of non-zero elements in said band, with all remaining non-zero elements located in the portion of said decomposed matrix outside said band; loading the elements of said decomposed matrix into the memories associated with a plurality of processors with the diagonal elements of said decomposed matrix being contained in the memory associatd with one of said processors; and solving said second system of equations by calculating the values of a plurality of components of said unknown N-vector independently of any previously calculated value of any component of said unknown N-vector.

2. The method of claim 1 wherein said step of solving comprises carrying out the backsubstitution or forward elimination process on said second system to determine the comzonents of the unknown N-vector by a series of matrix manipulazions carried out concurrently in a Plurality of parallel processors.

3. The method of claim 2 wherein said solving Step for each component of said unknown N-vector comprises a series of operations carried out by a plurality of processors.

4. The method of claim 3 including the step of storing the data elements of said decomposed matrix in the memories associated with a plurality of processors.

5. The method of claim 1 wherein said step of transforming comprises the steps of: expressing said physical phenomenon in terms of a plurality of finite elements, each defined by a plurality of nodes according to conventional finite element analysis techniques to generate said first system of equations; and renumbering a plurality of said nodes in a manner such that adjacent nodes differ by a pre5eleczed minimum value, to thereby insure that said second system of 'inear equations is characterized by a decomposed Tiatrx of the form as provided in claim 1.

6. The method of claim 5 wherein the bandwidth of said band in said decomposed matrix is sufficiently large to permit calculation of a plurality of components of said unknown N-vector independently ot'the values of any other components of said unknown N-vector.

7. The method of claim 1 wherein said solving steps includes:

calculating a first pluralizy of u-lxnown c,-,,,Tkr)onenzs of said field variable vector X on successive cycles of said processor by dividing known values of said vector Q by corresponding diagonal elemen's of said decomposed matrix in said one processor; and successively moving calculated values of said vector X into each of the other processing cells no: containing said diagonal elements for multiplication by associated other elements of said stored matrix to generate partial products required for calculation of a second plurality of said unknown components of said vector X.

8. In the finite element analysis method for analyzing a physical system including the steps of dividing the solution region into a large. but finite, number of elements, expressing unknown field variables in terms of approximating functions within each element, said functions being defined in terms of values of unknown field variables at specified nodes lying on the boundaries of said elements, assembling the element functions into a first system of equations describing the behavior of the solution region, said first system oil equacions taking the general matrix form

KY = R where Y represents unknown values of the field variable at nodes in the system, R represents known boundary values, and K is a large, sparse matrix, the improved step of solving said system of equations by transforming said system of equations into a second system of equations characterized by a decomposed, large, sparse, triangular matrix having only non-zero data on its main diagonal, a cluster of non-zero data located in the right-angle corner of said decomposed matrix, and a band only-zero

RD-173C _p data located intermediate said diagonal and said cluster.

9. The method of claim 7 further including the step of solving said second system of equations using backsubstitution or forward elimination to determine said unknown values of said field variable.

10. The method of claim 8 wherein said matrix is stored in an ordered manner in a plurality of parallel processors; and a plurality of said unknown values of said field variable are calculated independent of any other precalculati values of said field variable.

11. The method of claim 7 wherein said decomposed matrix is generated by renumbering said nodes in a manner to insure that adjacent nodes differ by a preselected minimum value sufficient to create said band.

12. A parallel processing apparatus usable to solve a set of simultaneous equations of triangular form and havina sufficient zero-valued ccefficients that (a) at least a predeterm-Lned plural number of the equations are each directly solvable independently of the solu?--J4---n to any other equation of the set, and (b) when written in matrix form, all the remaining equations have at least the said predetermined plural number of' zero-valued co-efficients between the leading diagonal of the matrix and any non-zero-valued co-efficients, l$ e P -17308 . D 1 the machi-e h-=vina a off parallel Processing cells a of solves in d predetermJ-e-.; n=ber solvable urn sai. equations and makes the solutions available to the remaining cells, the remaining cells using solutions from the first cell to provide partial solutions to remaining equations and making the partial solutions available to the first cell, and the first cell subsequently using partial solutions from the remaining cells to solve remaining equations and making the solutions available to the remaining cells for use in providing further said partial solutions, the number of machine steps required.for the first cell to solve the first directly solvable equation, make the solution available to a remaining cell, and for the remaining cell to provide the first partial product and make it available to the first cell being no greater than the number of machine steps recruired for the first, cell -to solve the predetermined number of directly solvable eauatilons.

1-3. III.Dr)a_-atus according to c-l--ii7i 12 w-hich each said which it will use to calculate solutions or partial solutions, J_.n locations which signify to which matrJx 1. - - column it. belongs (i.e. of which unknown factor it is a co-efficient).

Claims

14. Apparatus according to claim 12 or Claim 13 in which each said

parallel processing cell stores the co-efficients, which it will use to calculate solutions or partial solutions, in association with values indicating to which matrix row it belongs (i.e. in which equation of the set it is a co-efficient).

15. A method of ss'L.v--"ng a set o--5 simultaneous equations in whicIn. equaticns are t--ans-i'-cr-...ed, necessary, to render t".m..,e set into a form in whic"m apparatus accordi, -.g to any of Claim 12 to 1,1 is usable to solve it, the cc-eff_icients of the equations are stored in the cells of the said apparatus which will use them to provide solutions or partial solutions, and the first cell provides solutions and the remaining calls provide partial solutions until the first call has provided a solution to each equation in the set.

16. A method according to claim 15 in which the co-efficients in the matrix leading diagonal are stored in the first cell.

17. A method of analysing physical phenomena substantially as herein described with reference to the accompanying drawings.

18. Apparatus for solving simultaneous equations substantially as described with reference tc accompanying drawinzs.

j Y -4f- RD-173 08 the machine having a plurality of parallel processing cells a first of which initially solves in turn said predetermined number of directly solvable equations and makes the solutions available to the remaining cells, the remaining cells using solutions from the first cell to provide partial solutions to remaining equations and making the partial solutions available to the first cell, and the first cell subsequently.using partial solutions from the remaining cells to solve remaining equations and making the solutions available to the remaining cells for use in providing further said partial solutions, the number of machine steps required for the first cell to solve the first directly solvable equation, make the solution available to a remaining cell, and for the remaining cell to provide the first partial product and make it available to the first cell being no greater than the number of machine steps required for the first cell to solve the predetermined number of directly solvable equations.

13. Apparatus according to claim 12 in which each said parallel processing cell stares the cc-ef f icients, which it will use to calculate solutions or partial solutions, in locations which signify to which matrix col umn it belongs (i.e. of which unknown factor it is co- efficient).

14. Apparatus according to claim 12 or Claim 13 in which each said parallel processing cell stares the co-efficients, which it will use to calculate solutions or partial solutions, in association with values indicating to which matrix row it belongs (i.e. in which equation of the set it is a cc-efficient).

1 RD-17308 15. A method of solving a set of simultaneous equations in which the equations are transformed, if necessary, to render the set into a form in which apparatus according to any of Claim 12 to 14 is usable to solve it, the co-efficients of the equations are stored in the cells of the said apparatus which will use them to provide solutions or partial solutions, and the first cell provides solutions and the remaining cells provide partial solutions until the first cell has provided a solution to each equation in the set.

i 18. Apparatus for solving simultaneous equations substantially as herein described with reference to the accompanying drawings.