CA1236584A

CA1236584A - Parallel processing system

Info

Publication number: CA1236584A
Application number: CA000490758A
Authority: CA
Inventors: William E. Hall; Dale A. Stigers; Leslie F. Decker
Original assignee: Floating Point Systems Inc
Current assignee: Floating Point Systems Inc
Priority date: 1984-12-03
Filing date: 1985-09-13
Publication date: 1988-05-10

Abstract

Abstract A parallel processing system utilizes a plurality of simultaneously operable arithmetic units to provide matrix-vector products, with each of the arithmetic units implementing the matrix-vector product calculations for plural rows of a matrix stored as vectors in an arithmetic unit. A
column of a second matrix is broadcast to the respective arithmetic units whereby the products may be developed in all the arithmetic units simultaneously. The broadcasting of the matrix elements is accomplished via a memory bus which may be employed for selectively or simultaneously accessing registers in the various arithmetic units whereby vector information may be written into memory addresses and calculation results retrieved therefrom.

Description

~3~5~iii34 .

PAI~?~LLEL PROCESSING . SYSTRM

Background oE the Inventic)n The present invention relates to high speed parallel arithmetic circuitry and particularly to such circuitry for provi~in~ convenient accessi-bility to and from parallel arithmetic units.
Many complex computin~ problems involve hic~hly replicated arithmetic operations. Scientific computing typically includes lar~e continuum models that invariably generate a larqe, sparse matrix to be svlved and this matrix-solvin~ step is the bottleneck of the run for gen~ral-purpose computers. In order to solve complex problems in a reasonable time, the components of a monolithic supercomputer must be chosen for maximum speed, regardless o~ expense. ~lowever, computing circuit~y relying upon r~plicated desiqn can independently optimize performance and cost efficiency. Therefore, replicated, very large scale integrated circ~its can provide the parallel solution of parts of many kinds of computationally intensive problems in a reasonable time.
An example of a fundamental operation that dominates scientific computin~, the so called matrix-vector product (MVP), is the basis of both matrix multiplication and linear equation solving.
If scientific matrix problems required only a sin~le matrix-vector product at a time, then the on-ly way to increase its speed would be to use faster arithmetic and memory circuits to imple~ent a monolithic MVP ùnit. Ho~ever, problems involvin~
matrix-vector products require multiple MVPs to be evaluated. Thus, an alternative tactic for ~ainin~
speed is to devise parallel versions of the matrix-~k !

vector product.
The implementation of parallel arithmetieunits for performing a eomplltation sueh as the m~trix-vector procluct is typically somewhat inflexil~le and special pur~ose oriented. Thu.s, individual units are not readily aecessible from the standpoint of control ~nd from the stanclpoint o data aecess to and frorn parallel units. Further-more the theoretically ol~timum sper_d may not be easily realizecl. An advantageous system would provide convenient an-l acees~sible eomrnunieation with parallel eomputational units an-l at the same time taJce advantage oE the speetl possibilities of the units.
Summary o-E the Inventio~
In aeeor~ance with the present invention in a par-tieular embocliment thereof a parallel pro-eessing system includes a central processor Ullit, a memory systenl arld Inernory bus means couplirlg the eentral proeessor unit and the memory system. A
plurality of mernory mappecl arithmetic units are also eoupled with the m~mory bus means where~y th~se arithmetie IlllitS are ~ddress~ble ~or writing (~ata int~ seleetable units arl(3 Eor rr_adiny Aata ~rom ~eleetable units. A portion of the nnemory address space is clivided into seyments, one for eaeh of the arithrrletic ~lnits and is usec3 for readincJ and writiny data an~ eontrol information to the respective units. A speeial se~ment in the aA~lrers spaee writes to all of the arithmetie units so that common data is hroa-3east thereto.
The system typic~lly perforn~s a matri~-veetor produc-t lMV~) defined hy the expression y=A*x, 3~ where y is a veetor with m elelnerlts, A is a matrix 3 ~ ~ 3,~ ~B~

with m rows and n columns, and x is a vector with n elements. The -vector x is stored in the individual arithmetic units and elements of A are provided on the memory bus to the arithmetic units. Typically, dot products are performed wherein a single element of y is accumulated in an arithmetic unit as the~
sum of the element by element products of a row of a~ainst the locally stored vector x. The dot products with the same vector x can be completed with successive rows of ~ and a vector output is prcvided. The different arithmetic units locally store different vectors x with each unit calcu-lating the elements of a different output vector.
The process is repeated for each row of ~ for a matrix multiply.
The various registers in the arithmetic units are readily accessed as memory and can be input or output as desired. Each arithmetic unit, instead of storing only one vector x, can store a plurality of such vectors so that optimum time use is made of the elements of A received on the bus. In addition to the dot product calculation, other types of calculation are possible includin~ VSM~ (vector scalar multiply-add) and VMSA (vector multiply scalar add). The operation of the arithmetic units is controlled accordin~ to contro] information as addressed to memory correspondin~ to the arithmetic units.
It is an object of the present invention to provide an improved parallel processin~ system, the elements of which are readily accessible ancl controllable.
It is another object of the present invention to ~rovide an improved parallel processing system for producin~ a matrix-vector product.
-i;8~

It is another object of the present invention to provide an improved parallel processing system which is rapid in op~ration and ~hich makes optimum use of storage and communication facilities.
It is a further object of the present inven-tion to provide an improve~ system for perforrning calculations involving sparse matrices.
- The subjec-t matter of the present invention is particularly pointe~ out and distinctly claimed in the concluding portion of this specification.
However, both the organization and method of opera-tion thereof, together with further advantages and o~jects, may best be understoo(l by reference to the ~ollowing description taken ir- connection with accompanying drawings wherein like refercnce characters refer to like elements.

Drawin~s FIG. 1 is an illustration of a first type of calculation suitable for parallel processing, FIG. 2 is an illustration of a second type of calculation suitable for parallel processing, FIG. 3 illustrates a multiple calculation form of the type carried out'by~-the present invention, FIG. 4 is a block diagram illustrating arithmetic units as employed according to the present invention, FIG. 5 is an overall block diagram illustra--ting a parallel processing system according to the present invention, FIG. G is a more detailed block diagram of an arithmetic unit according to the present invention, FIG. 7 is a memory map according to the system of the present invention, FIG. B is a block diagram illustrating the ~23~

function of an arithmetic unit according to the present invention as interconnected for executing a dot product, FIG. 9 is a block diagram illustrating the function of an arithmetic unit according to the present invèn~ion as interconnected for executing a VSMA calculation, and FIG. 10 is a block diagram illus-trating the function of an arithmetic unit according to the present invention as interconnected for executing a VMSA calculation.

Detailed Description A calculation suitable for illustrating the operation of the processor according to the present invention comprises the matrix-vector product or MVP since it forms a basis of both matrix multipli-cation and linear equation solving. The MVP is a sum-of-products operation: y=A*x, where y is a vector with m elements, A is a matrix with m rows and n columns, and x is a vector with n elements.
In one computational form, the dot product form, the sum of element b~ element products of a row of A against the vector x is accumulated into a single element of y. Referring to FIG. 1, as a first step the first element, 10, o a row of A is multiplied with the top element, 12, of the vector x. Then the second element of a row of A is multiplied with the next to the top element of vector x and adde~ to the product of the first step at 14. Next the third elemen-t of a row of A is multiplîed with the third element of vector x and added to the previous accumulation 14, etc. The complete MVP is accomplished by m dot products involving the same vector x with successive rows of A.

.~ .

6 ~ 6~

Another form ~f computation for providing the MVP is ill~strated in FIG. 2. This forrn accumu-lates into all the elements of y the products of a single element of x with respective elements of a column of A. This form is called VSMA, for vector sçalar multiply-add. A complete MVP is accomplished by n such VSMA operations involving successive elements oE x against successive columns of A. It is understood the terms "row" and "column" in the above discussion are somewhat interchangeable.
Thus, in the dot product form, the sum may be accumulated of elemen-t by element products o~ a column of ~ against a row vector x.
For reference purposes, the dot product of two 5 vectors A and B may be defined as follows:
Vector ~: al, a2, a3, ... an Vector B: bl, b2, b3, ... bn AB: al*bl ~ a2*b2 ~ a3*b3 + ... ~ an*bn The VSMA can be illustrated in the following manner:
Vector ~: al, a2, a3, .... an Vector B: bl, b2, b3, .... bn Scalar C: c A~c*B: al+c*bl, a2+c*b2, a3+c*b3, ... an+c*bn A further form of the MVP, known as VMSA, or vector multiply scalar add is illustrated as follows:
Vector A: al, a2, a3, ... an Vector B: bl, b2, b3, ... bn Scalar C: c c-~*B: c+al*bl, c+a2*b2, c~a3*b3, c+an*bn Individual MVPs arc calculated in parallel according to the present invention employing plural ~VP units or arithmetic units as illustrated in FIG. 4. Each oE these arithmetic units 1~, 17...30 is adapted to perform an MVP calculation of the dot product type, or alterr.atively one of the other MVP forms as discusse~ above. The particular circuit locally stor~s our vecl..ors illustrated as rows 32 in FIG. 3 an(~ forms dot ~roducts 34 ~ith the column 36 forming part o~ a matrix M. This provides ~art of a matrix multiplication illustrateA in FIG. 3.
Reerrin~ further to FIG. 4, each aritllmetic unit such as unit 16 is adapte(3 to perorm a rnulti~ly-add calculation an2 hence includes a first multip.lier 38 receiving an input from hus means 40 and a second input from multiplexer 42. The product output from multiplier 3~ is coupled as an input to adder 49 further receiving a sëcond input rom multiplexer 46. The arithliletic unit also inclucles a first set of four vector registers ~8 and a first set of scalar re~isters 50 which selectivelv receive ~ata input from bus means 40. Either,the scalar registers 50 or the vector registers 48 rnay supply input to ~ultipli.er 3~ or adder 44 throu~h the ~forementioned multiple~ers 42 and 46. The output of a-~der 44 is coupled back ~IS inpu-t to the registers 48 an-,~ 50.
The circui~ for ~rit~lln~tic unit lfi as thus ~ar ~escribed is su~stantially Auplicated as indicated by units identified ~y primed reference numerals on the ri~ht han~ side of the c~rawing, and the ri~ht hand portion di~ers princi~ally in that the output of vector resisters 48' provides the only secon~
input o~ multiplier 38' as well as being connected to inputs of both multiple~ers 4~ ~n(l ~6. Further-3~ more, the output of scalar reg.isters 50' provides the only second input for adder ~4' while alsoconnectin~ to multipleY~er 42. The t~Jo substan-tially ~uplicate circuit ~alves o~ the arithmetic unit can e~ch perorm an ~lVP calculation of the dot product type wherein the four vectors, e. g, as illustrated at 32 in ~I~. 3, are respectively st~red in the registers 48 and another set of four i.s stored in registers ~8'. Then the elements of column 36 in FIG. 3 are provi~ecl eleme~nt by element on bus means 40 as in~)ut to the respective~
rnultipliers. A given input elt!mt-?nt of colurnn 36 is retaine(~ for fol~r cycles while multi~lications are successively performed ayain~st the Eour rows stored in the vector re~listers. The multiplier, e. g.
mul~iplier 38, and the adder, e. ~. adder 44, are "pipelined" and take a number of cycles -to provide ~n output after the respective inputs are supplied thereto. The multiplications an~ additions with respect to the four vectors stored in the vector registers are performed sequentiallv, and the su~
of the product with the previous products as supplit-~d by the adtler is re-enterc~d into the respective scalar re~isters. Thus, the previously accumulated sums from scalar registers 50 are added, via multiple~:er ~G, to the new product from multiplier 38, and the sums performecl by adcler 44 are re-entered in-to registers 50. The utilization of the vector and scalar re~isters in ~ach half of the arithmetic unit enables four multiply-adds to take place in each half be:Eore further input data is re(~uired from bus mean.s 40 and consequently the o~eration is not clelaye~ by waiting for additional d~t:a. In the~ case of tht- dot produc~ form of calculation, each half o the arithmetic unit operates suhstantially independently and so t-~ight vectors are stored locally in registt~rs 48 and 48', while providin~ the products for eight ro~Js, ~our of which are indicate~d at 32 in FIG. 3.
The arithmetic un;t 16 is further duplicated at 17...30 in FIG. ~ providin~ a total of fi~teen 9 ~3~

arithmetic units or thirty half units all inter-coupled to the same synchronous bus means ~0 for operating substantially simultaneously. Thus, the dot product calculation is not just performed for eight vectors as illuskrated for the case of arithmetic unit 16, but for eight times the number of arithmetic units. The same column elements of a matrix M (at 36 in FIG. 3) are broadcast to all the arithmetic units ernploying synchronous memory bus means whereby an MVP calculation of substantial size can be performed.
The overall arrangement of the parallel processor according to the present invention is illustrated in FIGo 5. FIG. 5 depicts a complete system accordin~ to the present invention comprising a central processing unit 54 provided with a main memory, in this case including a plurality of memory circuit boards 56. The central processing unit 54 is suitably an FPS-164 array processor manufactured by Floating Point Systems, Beaverton, Oregon, and includes an adder 60, multi-plier 62, X registers 6~, Y registers 66, table memory 68, and address unit 70 coupled kogether via interconnect bus circuitry 72. The processor 54 in the form of an array processor is typically connected to a front-end computer or host computer 74, and to disk mass storage 76. In a typical instance, the host computer 74 comprises a VAX-*
11/780 manufactured by the Digital Equipment Company. The front-end computer handles inter-active time-sharing, while the processor 5~ concen-trates on arithmetic-intcnsive calculations. In the particular system or~anization according to the present invention, the processor 5~ is used princi-pally for scalar calculations and for controlling * Trade Mark 1 o ~3~5~8~L

the operation of plural, parallel arithmetic units1~-30.
The aritimetic units 16-30 physically comprise boards each having the circuit configuration as outlined for unit 16 in FIG. 4, and are inter-coupled with synchronous memory bus means 40 used in com~on with processor 5~ and main memory boards 56. Parallel, memory-mapped processing is provided by the ariLhmetic units 16-30 which share the common memory address and memory data buses with main memory 56. Although fifteen arithmetic units are illustrated, it is understood a greater or lesser number can be employed as desired. The particular type of central processing unit 54 and front-end computer 79 are given by way of example and it is understood the present invention is not limited thereto.
Thus, computational units are provided which are readily accessible and wherein a "memory" write

2~ can bring about computation in one or all of the arithmetic units, with the result being subse-quently retrieved by reading an address of arith-metic units in memory. As hereinbefore indicated, a common input operanc~ may be broadcast to a plur-~5 ality of the arithmetic units 16-30, with each one performinq calculations with respect to data there-tofore stored therein, for example representing plural vectors or rows of a matrix.
Referring to FIG. 6, an arithmetic unit of the type hereinbeore described with reference to FIG.
4 is illustrated in greater detail. Memory write data from memory data bus 40a is received hy main data input register 78 forming the only data input path to the arithmetic unit of FIG. 6. ~ll data written to the FXG. 6 arithmetic unit is first r~

' ' 11 ~2~

loaded into this register which is suitably 64 bi~s wide. In a particular example, the data is in a nor-malized floating point format used by processor 54 in FIG. 5. Data may be written to scalar registers 50, S0', to vector registers 48, 48', or to the data pipe indicated at 112. Data may also be written to index counter 92, control register 88, or status register 90, in which case the data is in integer form, right justi-fied within the 6~bit word. The formatter 80 is utilized for floating point conversions from the format of processor 54 to the format of the devices employed in multiplier units 38, 38' and adder units 44, 44', which is an IEEE 64-bit floating point format. Of course, if the processor 54 utilizes the last mentioned format, then circuit 80 is unnecessary.
Data pipe 112 (also called advance pipe), suitably comprising a register used as input to multiplier 38 and/or multiplier 38', is utilized to provide input data to the respective multipliers ~0 for four clock cycles such that an input operand (for example an elemen-t of vector column 36 in FIG.

3~ may be successively multiplied with elements o~
matrix rows 32. This input comprises one of the inputs to each of the multipliers 38, 38', while the remainin~ input of each of the multipliers is obtained from one of the registers in the arithme~
tic units as hereinafter more fully described. The multipliers 38 and 38' are each capable of forming a product of two 64-bit floating point numbers.
The multipliers are pipelined and require a plurality of clock cycles for producing an output product. In a particular example, these multi-pliers cornprised WTL1065 units manufactured by WeitPk .
Memory address bus portion 40c is coupled to 6~

memory addre~s register ~2 and to memory address decoder 86. The ad~lress in register ~ utilized for addres-sing vector re~isters 48, 4~' an~ .sea1ar reqisters 50, 5n' l~y way of bus connections 114, 116, 11~ and 120 res~ectiv~ly. Bit~; o~ the address are deeoded hy cleeo-der ~6 to d~terrnine which o~ the aforementioned resis-ters is bein~ accessed for a read/write operation. It is notec1 the various re~Jisters o the arithmetic unit are independently accessible in the s~me manner as other portions of memory would he. Other address bits from memory acldress portion 40d are receivecl ~y board seleet eireuit 8~ where these hi~s are eompared to the locJical address of the particular arithmetie unit, and if a mateh is detected, then all the internal registers o*
lS this particlllar arithmetic urlit are accessible for read/
write opera-tions. It should be ohserved that a mateh i5 available for either addres.sing thi.s arithmetie unit by itself, or as a part of a group of arithmetie units for the purpose o~ broadca.sting a eommon operand to Inore than one unit. Reclisters ~8, 4n' and 50, 50' of an arithmetic unit are not aeeessihle when the arithmetie unit is husy completing a vector Eorm operation.
Index counter 9~ is a twelve bit eounter used for addressin~ the locations o~ veetor reyister~
114 and 116. The eounter's function is ei-ther to hold a value ~)eing wri-tt~n to an inclex register 94, 96, or 9~, or to increment a previous value that has been ~ritten or incremented. Data *or the eounter 9~ is reeeivecl from register 7~. The index eounter value is "loclcecl ~own" for internal use for Eour cycles hy any write to the data pi~e 112. The index registers 94, 96 and 98 are all twelve ~its wide and ~re locked down "eo~ies" of the index counter v~lue whenever any write oE the data E~ipe occurs. Writes to -the index counter do not ~Efect ~,"

. 13 ~3~8~

the value in the indcx registers until a write to the data pipe takcs placc in which case the index counter numt)er is copie~ into the desirecl index register or registers.
Index register 98 provides an output for first-in first-out registcrs tFIFOs) 102 and 104 which function as read-write registers for the temporary storage of addresses for vector registers 48, 48'. Each FIFO is suitably four locations deep.
Each vector storage register has its own index register and FIFO.associated therewith for addressing the particular vector storaqe register.
Thus, index register 94 and FIFO 102 may provicle an address to vector storage register 48 by way of bus 114, while index register 96 and FIFO 104 ma~
provide addressing to vector re~ister 48' via bus 116. The index register 98 provides the input to FIFOs 102 and 104.
During the operation of the arithmetic unit, the processor 54 sets the counter 92 with a startin~ index or adclressing the vector registers 48 and ~'. Duriny an auto increment mocle of operation, the index counter 92 causes registers 94 and 96 to provide successive ad~resses for the vector registers 48 and 48'. The auto incrementa-tion is appropriatc when operating on eithe.r ull, banded, or profile matrices. Ilowever, the addresses for the vector reqisters may be supplied alternatively from memory address register 82 to set the index before each multiply-add of the arithmetic unit for operating on sparse matrices.
The apparatus is capable of sparse matrix operation at full speed without requi.ring i.ncremental opera-tion. In the case o~ sparse matri.ces, the indices 5~,~

are re.~d from a t~le o:E pointers in the ~rocessor.
The FIFOs are used to delay the index valuefor certain forms o comput~ti.on in the arithmetic unit. The index value i.s held and read out for a~dressin~ the vector registers at the appropriate time, taking into consideration the ~ipelines presented by the multiply-a(l~ circ~litry. One of the FIFOs is suitably employed .~or readin~ one of the vector registers, while the other FIFO is employe~3 for provi~ing an ad-lres.s for writing into a vector register at a later time.
T~le two separatë vectox storage registers are employed ~or stori.ng two .separate sets of four vectors each. Th~ls, vector register 48 is employed for storing ~, B, C, n vectors (also in~ica-ted as 0, 1, 2, 3), ~nd vector register 48' is employed for storing ~, F, G, tl vector.s (also inclicated as 4, 5, 6, 7). ~ch stora~e re~ister is suitably 8K locations deep ~nd is dividecl into four ZK regions, on~, re~ion for each vector.
Data read from vector register 48 has three destinations: It can ~e supplied ~s an input operand to multi~l.ier 38 (via mul-tiplexer 42), or as an input to adder A~ ~via multiplexer 46), or it may be provided -to format circuit 108 (via multi-plexer ~) and latc}l110 for output onto the memory read data connection on memory out ~us 40b. Da~a read from v~ctor storage register 48' has four possiblc (lestinations: It can ~)e supplied as an input operand to multi~lier 3~ (via mllltiplexer

4~), to multi~ r 38', to a(l~er 44 (via multi-ple~er 46) or to format CiXCllit 10~ (via multi-plexer 42) ~or O~tpllt.
Data writte~n into vector re~ister ~R can come ~rom two possible sources. It can ~e supplied by ~ .

1~3~

either the adder ~4 output or as a data input fro~
register 78 through formatter 80. Data written into vector register 48' has three possible sources. It can be provided from the same two sources as indicated for register 48 plus the output of adder 44'. The vector storage registers may ~e addressed either directly from the mer,lory address register 82, or from an incremented address supplied by one of the index registers, or as an address supplied by one of the FIFO locations.
The scalar register 50 and 50' are suitably two separate eight location stores for storing scalars des-ignated aO, bo~ cO, do~ a1~ b1~ c1~ dl~ and eO~ 0~ go ho~ e1, fl, g1, hl. Each location can be read and written in any one system clock cycle. Data read from register 50 has three possible destinations: It can be supplied as an input operand to multiplier 38 (via multiplexer 42), to adder 44 (via multiplexer 46), or to the format circuit 108 (via multiplexer 42) for output.
Data read from scalar register 50' has two possible destinations: It can be supplied as an input operand to adder 4~', or to format circuit 108 (via multiplexer ~12) for output. Data written into scalar register 50 has two possible sources. It can come from the output of adder 44r or Erom register 78. Data written to scaler register 50 can be simultaneously written to the corres-ponding scaler register 50' locations. Data written to vector register 48 can be simultaneously written to vector register 48~ locations. Data written into regis-ter 50' has the same two sources plus the output of adder ~4'. ~ddressing of the scalar registers 50 and 50' can be accomplished as follows: The address can be input via address register 82 or supplied by counters 106 controlled by sequencer 100. Two counters are present, one being employed for reads and one for writes.
The address can also be supplied hy the sequencer during 3~34 a collapse sequence (hereinafter more fully described).
Adders 44 and 44' are suitably WTL 1064 circuits rnanufactured by Weitek. Both adders are capable of performing addition and subtraction, with the subtrahend for subtraction comprising the multiplier output.
The arithmetic unit is further provided with a control register 88 and a status register 90 which can receive data from input register 78. Control bits spe-cify the ~orm of calcu]ation to be performed by the arithmetic unit, e.g. a dot product form, a VMSA, or a VSMA. The control and sequencer arcordingly provides the proper interconnections via the multiplexers to form one of the configurations as further illustrated in FIGS. 8, 9, and 10, and sequences the arithmetic unit through the steps as indicated for performing the parti-cular calculation. Further hits supplied to the control register specify the adder operation to be either addi-tion or subtraction. Bits also provide control o~
input of data to scaler registers 50, 50' and vector registers 48, 48', i.e. for determining whether a pair of these registers are written together or separately.
As well as being controlled by the control register, the sequencer 100 is responsive to each write to the data pipe 112 for bringing about the desired multiply-add sequence. The status register 90 provides an indication for such factors as overflow, underflow, invalid opera-tion, and denormalized operands. An indication is also given of the status of formatter conversion~
The ~ormatter out circuit performs floating point data conversions from IEEE 64-bit ~loating point format, i.e. the format as employed by the devices employed in the multiplier units 38, 38' and adder units 44, 4~', to the floating point format appropriate for processor 54.
All data read from the vector registers 48, 48' or from the scalar registers 50/ 50' go through the format out ~%3~

circuit l~efore tennpora~y storac3e in latch 11~ c0ll~l2~' to memory rea-3 data }~US connection 40b. Latch 11~ can also receive the current contents of index coun-ter 92, status reaister 90 and control register ~8 on the connections mar~ed X and Y.
FIG. 7 is a memory map of the system accordin~ to the present invention and consicleration thereof will be u.seflll in further explaining the operation of the system.
The processor of FIG. 5 in the illustrated embodiment has a sixteen million worcl address s~ace, indicate~ at 58' in FIG. 7, of which fifteen million words can be used to address the ~hysical memory 56 in FIG. 5. The highest million words of the address space, 120, is divided into sixteen 64K word se~rnents 122j one for each of ifteen arithmetic units 16-30 in FIG. 5. The last 64K address se~mellt, 124, i.s a broadcast segment reco~nized by all the arithmetic units so that data or control can be directed simultaneously to the arithmetic units.
Inside a particular arithmetic unit, the lo~rer 32K
addresses are divided into eight 4K word hlocks, one for each o~ the locations insicle each vector register (48 and 4~' in FIG. 6). These blocks are ~lesi~nated "Vector Register n~ to "Vector Re~ister 7" in inclividual memory map 1~6~ Each arithmetic unit also has addre;sses 128 for the scalar re-3isters ~50, 50' in FIG. 6), addresses 130 ~or statl~s re~ister 90 in FIG. 6, and addresses 132 for control regis-ter ~n in FIG. 6. The processor pro-gram writes data into, or rea~ls results from, the arithmetic units in the s~me manner as with respect to the rest of the Inemory. The arithmetic units are, in effect, intelli~el-t memory unit.s since they botll have computation an~ storaqe capahility.
Writin~ to the control register acldress sets the vector ~orm ~or the next computation. The vector regis-ter index address 134 sets the next location in the vector regis-ters to be operated on, i. e. sets the index in inde~ counter 9~.
Writing data tG t}le advance pi~)e ~pipe 112 in FIG. 6) ad~ress aetually causes a multiply-add to take place, i. e. a portion of the word is used as control, eausin~ sequencer 1~0 to ~ring a~out'a multiply-adcl ~rocedure as hereinl~elow rnore ~ully describe~.
The various aclclresses indicated at 136 in FIG.
7 pertain to the aclvance pipe 112 and to the eontrol of the inclex usecl to a~ldr~ss the veetor re~isters for an oncJoing ealculation. After writing the vector re~ister index to set the starting address for the resL~ective veetor re~isters writing to "a~lvance ~ipe and inerement index address" sequences the arithrnetie unit to do a multiply-add and aclvance to the next veetor re~ister element. Various combinations of advanee ~ipe and inerement inde~ adflresses are provicled together with collapse dot procluet whieh'pertains to an operation as hereinafter more full~
deseril~ed. It will be notecl that the various register~s of the arithrlle~tie unit are aeeessible as mernory a9 i.nclicated by the map wi.th memory ma~pecl proeessin~ being l~rovided.
It ~rill ~e unclerstoocl the broacleast secJment 124 in FIG. 7 is suhstantially the same as the individual memory map 12~ e~:ee~t that this secJment addresses all the arithme-tic units at the same tirne. A stream of ma-trix elements are ~roaclcast in this manner to the aclvance l~i.pe of each arithmetie unit which does a multiply-a(.ld with the data whieh is stored locally lin a vector register).
Writin~ to the eontrol re(~ister addres~ in the ~roadcast: se~ment 12~ selects the vector forrn in -all the arithmetic uni-ts at the same time, i. e for dot ~rocluct computation, VSM~, or VMSA. Mext, the broadcast segment's vector register in~ex address is writ-ten, to set the startinc~ vector address for all the arithmetic units. Then, a~ter copyinc3 different matri~ rows into the respective vector registers, a Inatrix column element is broad-cast and written into the broadcas-t segment's a~vance ~ipe and increment index acldress. This sequence causes all -the arithmetic units to do a multiply-add and advance to next vector reyister elernellts. Since a large number of arithmetic units may be simultaneously em~loyed, and ~ach stores up to eight vectors representinc3 eic~ht matrix ro~s, a matrix vector product involvin~ all the rows of a matrix or a large portion thereoE ea~n be very rapidly performe~l.
The plural matrix rows are temporarily stored in each arithmetic unit so that a ~iven broadcast column 2~ element can })e multipliecl against each of the ro~s in a pipelined manner before another column element needs to be broadcast. Therefore, the communication tirne requirernents Eor the system are made less strin-yent and delay is avoicled. The arithmetic ~lnit uti-liæes two arithnletic pipelines and the incomin~eolumn elcment can he ;nultiplied against the appro-priate elements for th~ loc~lly storec~ rows in sequence. Moreover, the ~ipelines for the multiplier and a~lder are respectively seven and six clock cycles loncJ whereby partial sums are aecumulated in a pipe-line hefore ~eincJ adde(l back into the scalar regis-ters, with ei~ht scalar recJister a~resses storin~
parts of the aceurnulated sums ~or ~our dot product calculations. Four multiplicatiorls of four locally stored vectors are accomplished ~ith respect to an ~3~

incomincJ column element, and then the nex-t incoming column element is multiplied with the same four locally s-tored rows, but the answer sums are dis-placed in the pipeline. At the end of the calcula-tion, the partial sums are "collapsed" in-to the pro-per sums, i. e. in the scalar registers the data from address a~ is added to the data from address al, the data from address bo is added to the data from ad-dress b1, the data from address cO is added to the data from address cl, and the data from address do is added to the data fr~m address dl. After collapse, the answers are placed back at the current position of the index or at the ends of the vector registers to which the sums pertain. Of course, the sums can be read by the processor 54.
If the automatic incrementâtion of the index is utilized as employed above, then successive locations are used in the vector registers. This is appropriate when operating on either full, banded, or profile matrices.
However, if the vecto~ register index is set be~ore each multiply-add then indirèct addressing techniques can be used for operating on random sparse matrices. In this case, the processor program reads the indices from a table of pointers in ~eneral memory and loads the index counter. The resulting loop operates at nearly the same efficiency as the full matrix case, i. e. operatin~ on sparse matrices at substantially full speed.
~ IGS. 8, 9 and 10 illustrate the modes of operation for the dot product type of calulation, the VSMA, and VMSA respectively. The dot product configuration has been generall~ considered above, wherein the arithmetic unit as illustrated in FIGS. 4 and 6 essentially provides for two simultaneous dot product computations. Pursuant to the control data loaded to the par-ticular arithmetic unit, multiplexer connections are made to configure each half of the unit to the circuit as illustrated in FIG. 8, wherein column matrix elements arrive on data pipe or - advance pipe 112 and are multiplied with successive vec-tor row elemen-ts from vector register 48 After multi-plications, running sums are accumulated in scalar reyister 50 through-the operation of adder 4~ to form the dot product partial sums. As hereinbefore described, after collapse, the sums are stored back into vector register 48. It should be noted that the remaining half of each unit (the right half of FIG. 6) operates in the same manner, e.g. receiving the same input on data pipe 112, multiplying with row vectors from register 48' and storing results in register 48'.
On the other hand, for the VSMA form of calculation (as illustrated in FIG. 9), scalars from register 5~ are successively multiplied with elements o~ the incoming column in multiplier 38 and added to local vectors from vector register 48. The vector results are stored in vector register ~8'.
In the VMS~ form, the control causes the appropriate multiplexer connections to be made so that column elements on advance pipe 112 are multi-plied with locally stored vector elements frorn vector register 48 and the products are added in adder ~4 to locally stored scalars. The resultants are then stored in vector register ~8'. Since the versions illustrated in FIGS. 9 and 10 require vector registers from both halves of the arithmetic unit, the speed of calculation for the FIG. 9 and 10 forms is half that of the FIG. 8 form, i. e.
eleven megaflops as compared with twenty-two mega-flops for the FIG. ~ circuit.
The present system is adapted to perform arithmetic calculations with respect to complex numbers. Referring again to FIG. 6, the vector registers, for example registers A and B o~
register ~8, are adapted to store respectively the real and imaginary parts of a particular vector.
Similarly, registers C and D are adapted to store real and imaginary parts of a second vector. As hereinbefore mentioned, elements of a matriX column are provided successively to data pipe 112 from memory data bus 40a which are then retained for four clock cycles on the data pipe 112 for successive multiplications. In the case of a complex input, the real par~ of a data column element is first received for four cycles and then the imaginary part is received ~or four cycles. The desired product has a real part equalling the product of the real parts of the two operands minus the product of the imaginary parts of the two operands. Also, the imaginary part of the desired product equals the sum of the products of the real part of each operand multiplied by the imaginary 2Q part of the other. It will be seen the storage configuration facilitates the ultilization of a complex element received on data pipe I12 ~or eight clock cycles (four cycles for the real part and four cycles for the imaginary part) whereby all of the necessary products for the final resultant are sequentially produced with the vectors stored in registers A, B, C and D.
It is again understood that the reference herein to columns and rows is substantially inter-changeable and, for example, a broadcast row may bepresented for calculation with locally stored column vectors if so desired.
Whlle a preferred embodiment of the present invention has been shown and described, it will be apparent to those skilled in the art that many 23 ~236S8fL

chan~es an~ modifications may he made without departin~ from the invention in its broader aspects. The appended claims are therefore intended to cover all such changes and modifica-tions as fall within the true spirit and scope ofthe invention.

Claims

The Embodiments of the Invention in which an Exclusive Property or Privilege is Claimed are Defined as Follows:

1. A parallel processing system comprising:
a central processor unit, a memory system, memory bus means intercoupling said memory system with said central processor unit, and a plurality of memory mapped arith-metic units also intercoupled with said memory bus means whereby said arithmetic units are addressable for writing data into selectable arithmetic units and for reading data from selectable arithmetic units.

2. The system according to claim 1 wherein writing of data to a selected arithmetic unit at a predetermined address initiates an arithmetic operation and retrieval of data from said selected arithmetic unit at a predetermined address provides the result of said arithmetic operation.

3. The system according to claim 1 wherein said arithmetic units are respectively provided with a plurality of registers that are accessible via said memory bus means according to separate addresses thereof.

4. The system according to claim 1 wherein more than one of said arithmetic units are addressed to receive the same operand at substan-tially the same time to perform substantially simultaneous computations with respect thereto.

5. A parallel processing system comprising:
a central processing unit, a memory system, and a plurality of memory mapped arith-metic units interconnected with said central processing unit and memory system for performing substantially concurrent calculations with respect to data received thereby.

6. The system according to claim 5 wherein more than one of said arithmetic units are addressed by the same address value for receiving identical data and performing calculations with respect thereto in conjunction with non-identical data as theretofore received by said more than one of said arithmetic units.

7. The system according to claim 6 wherein said arithmetic units are provided with storage for retaining said non-identical data and wherein said arithmetic units are adapted to perform multiple calculations on different portions of the retained data in conjunction with the said same data as provided to said more than one of said arithmetic units.

8. The system according to claim 7 wherein said arithmetic units are pipelined to provide said multiple calculations in a serial manner.

9. The system according to claim 6 wherein said arithmetic units are responsive to group addresses for accessing said more than one of said arithmetic units in order to receive the same data, and wherein the same arithmetic units are respon-sive to individual address for accessing selected individual arithmetic units in order to receive non-identical data.

10. The system according to claim 6 wherein said arithmetic units are respectively provided with plural registers which are individually addressable and at least some which are addressable by groups including registers in a plurality of said arithmetic units.

11. A parallel processing system comprising:
a central processor unit, a plurality of arithmetic units operable from said central processor unit, and bus means intercoupling said central processor unit with said plurality of arithmetic units, wherein said arithmetic units are selec-tively addressable from said central processor unit and are selectively addressable by one or more groups so that a group is enabled to receive the same data as an input operand.

12. A parallel processing system comprising:
a plurality of arithmetic units each capable of performing a multiply-add operation on input data, each of said arithmetic units also including register means for storing input data, and bus means for interconnecting said arithmetic units wherein the arithmetic units have unique addresses so that individual arithmetic units can be accessed and having group addresses for accessing plural arithmetic units.

13. The system according to claim 12 including means for providing a common operand to a group of said arithmetic units by assertion of a group address and for controlling said group of arithmetic units to perform a calculation on said common operand in conjunction with data individually stored in register means of said arithmetic units.

14. The system according to claim 13 wherein said means for providing a common operand and for controlling said group of arithmetic units comprises a central processing unit, and wherein said bus means comprises memory bus means associated with said central processing unit.

15. The system according to claim 12 wherein said arithmetic units are controllable via said bus means to perform a matrix vector product defined by the expression y=A+x where y is a vector with m elements, A is a matrix with m rows and n columns, and x is a vector with n elements, an x vector being stored in said register means of an arith-metic unit, and A being provided by rows and element by element within rows on said bus means to said arithmetic units.

16. The system according to claim 15 wherein a said register means of an arithmetic unit stores plural vectors x and wherein an arithmetic unit receives an element of A and performs calculations relative to the plural vectors x before receiving another element of A.

17. The system according to claim 15 wherein a single element of y is accumulated in an arith-metic unit as the sum of the element by element products of a row of A against the vector x, comprising the dot product, and wherein m dot products are completed involving the same vector x with successive rows of A.

18. The system according to claim 12 wherein each arithmetic unit of said plural arithmetic units is adapted to store a different x vector in said register means thereof for multiplication with rows of the A matrix supplied to said plural arithmetic units.

19. The system according to claim 18 wherein each arithmetic unit of said plural arithmetic units stores plural x vectors in said register means thereof for multiplication with rows of said A matrix.

20. The system according to claim 12 wherein the register means of each arithmetic unit comprises at least one vector register and at least one scalar register, each being coupled for receiving data from said bus means, said arithmetic unit further comprising a multiplier for receiving one input from said bus means and a second input from said register means, and an adder for receiving one input from said multiplier and one input from said register means, the output of said adder being selectively provided to said register means.

21. The system according to claim 20 wherein said arithmetic unit includes a second, similarly coupled multiplier, and a second, similarly coupled adder.

22. The system according to claim 20 wherein said register means are adapted to store complex numbers with real and imaginary parts in successive locations, said multiplier and said adder being employed for performing successive multiplications and additions to provide complex results

23. The system according to claim 12 wherein each arithmetic unit includes an index means for addressing said register means, said index means being coupled to said bus means, said index means being selectively incremented for accessing successive vector elements, and said index means being selectively addressable to provide addresses which access other than successive vector elements.