WO1992000560A1

WO1992000560A1 - A generalised systolic array serial floating point adder and accumulator

Info

Publication number: WO1992000560A1
Application number: PCT/AU1991/000284
Authority: WO
Inventors: Warren Marwood
Original assignee: Luminis Pty. Ltd.
Priority date: 1990-06-29
Filing date: 1991-07-01
Publication date: 1992-01-09

Abstract

The invention is a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a k-stage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations. The output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder. In the case of a systolic ring accumulator it consists of a finite state machine (22) and a systolic de-normalisation array (21). Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as o ne's complement or sign-magnitude. In the preferred embodiment of an accumulator (20) sign-magnitude mantissae and two's complement exponent ordered number pairs are used. The only fixed aspects of the systolic ring are the arithmetic blocks. The length of the delay block (25) is determined by the exponent length in the number representation. The number of systolic de-normalisation cells (27) in the ring can range from a minimum of one. The number of recurrence cells and the number base of the characters in the floating point format determine the performance characteristics of the accumulator. The invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.

Description

"A GENERALISED SYSTOLIC ARRAY SERIAL FLOATING POINT

ADDER AND ACCUMULATOR"

This invention relates to floating-point accumulators and adders and in particular to serial systolic array floating point accumulators and adders.

BACKGROUND OF THE INVENTION

Fixed point adders and accumulators are implemented simply by a single adder and a carry storage register for serial implementations, and an array of adders and carry storage registers for parallel implementations. For n-bit numbers an adder is typically a factor of n less complex than a multiplier. This situation is no longer the case when floating point operations are considered. A floating point multiplier is not substantially more complex than its fixed point counterpart whereas the floating point adder is sig nificantly more complex than the fixed point equivalent. The reason for this complexity is evident when a floating point addition or accumulation algorithm is considered and compared with a fixed point addition or accumulation algorithm. Examples are:

Fixed point accumulation of a fixed point number Z with a fixed point accumulator value at time n of A_n is carried out by the simple operation of integer addition i.e. A_n+1 = A_n + Z (1)

To discuss floating point addition or accumulation it is necessary to define the number system. A floating point number F is composed of two parts, a fractional mantissa F_m and an integral exponent F_e, and can be represented as the 2-tuple

F = {F_e. F_m} (2)

The real number representation R of this floating point representation is

R = F_m .bF_e (3) where b is the base of both F_m and F_e and

The floating point accumulation of a floating point number {Z_e, Z_m} with a floating point accumulator value at time n of {A_e,_{n ,} A_m,n} to form the new accumulator value {A_e,_n+1, A_m,_n+1} at time n + 1 is performed by the following algorithm:

)

where Max.exp is the maximum exponent value in the particular format, Min-exp is the minimum exponent value, b is the number base of the floating-point representation, [.] represents the integer part of and sign is the sign of the operation.

This algorithm can be considered representative of the way in which addition or accumulation is performed in conventional computing hardware.

Some discussion of this algorithm is warranted. Equation (4) represents the shifting of the operand mantissa which has the smallest exponent by a number of digit places equal to the difference in the exponents, followed by the summation of the shifted operands. This temporary result A'_m,n+1 is conditionally left or right shifted according to its value. Three possibilities exist. The first is a result which is smaller than the lower bound of the defined range for the fractional part of the representation. In this case the operation has caused a loss of precision known as catastrophic cancellation. Leading zeroes are introduced into the representation which must be removed by left shifting the result. This operation is known as post-normalisation. The second possibility is that the result is larger than the upper bound of the defined range of the mantissa. In this case the result is right shifted one place to restore it to the defined range. This condition is mantissa overflow. The final possibility for the result is that it falls within the defined range of the representation, in which case no shifts are required. Equation (5) expresses these three conditions in mathematical form. Equation (6) defines the exponent of the temporary result A'_{e n+1}. This exponent value is modified by any shifts which are performed upon the mantissa to preserve the real value of the 2-tuple. Additive corrections to this exponent value are defined by equation (7). The corrections appear as additions for the exponents whereas multiplications, or shifts, are performed in the case of the mantissa. Equations (8) and (9) set flags which indicate whether the result has exceeded the floating point representation at either end of its dynamic range.

Existing techniques for the design and construction of floating point adders and accumulators are broadly categorised as parallel or serial. The parallel architectures are intended for low latency designs. An example is the work of OWEN, R.E., "A 15 nanosecond complex multiplier-accumulator for FFT's", CASSP'87, CH2396-0/87/0000-0527 pp. 527-530, 1987. For system architectures in which longer latencies can be tolerated, serial architectures are used to advantage and an example is CHAU, P.M., KAY, C.C. and KU, W.H.,"A bit-serial floating-point complex multiplier-accumulator for fault-tolerant digital signal processing arrays", CASSP'87, CH2396-0/87/0000-0483 pp. 483-486, 1987.

SUMMARY OF THE INVENTION

In its broadest form the invention comprises a systolic array floating point adder for accepting sequential pairs of real numbers Z₁ and Z₂ in floating point format and a mode control signal wherein said real numbers are represented as 2-tuples having the form {Z_e, Z_m}, Z_e is a character sequence representing the exponent of the real number and Z_m is a character sequence representing the mantissa of the real number, and the adder outputs a character sequence A which is the floating point representation {A_f, A_e, A_m} of the addition of said real numbers, wherein the adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the de-normalisation array and to output the floating point sum of sequential pairs of accepted real numbers.

In a further aspect of the invention the serial floating point adder has a mode character sequence entered in parallel with the 2-tuples to identify to the adder the fields Z_e and Z_m of the floating point representations, and the de-normalisation array further comprises at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.

In yet a further aspect the invention in its broadest form comprises a systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of the real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of the finite state machine and to output at least partially denormalized floating point numbers to the second input of the finite state machine and in the configuration to form a ring, wherein. during the first state the finite state machine is adapted to control the ring wherein the number Z in floating point format is input to the ring through the finite state machine first input and the accumulator output A is output from the ring from the finite state machine first output, and during the second state the finite state machine is adapted to transfer at least partially denormalised floating point numbers from its the second input to its the second output, to control the number of times the transfer occurs and to add aligned floating point numbers.

Yet in a further aspect of the invention a systolic ring serial floating point ac- cumulator has a finite state machine which further comprises an arithmetic logic unit ALU_1 having as first and second inputs the finite state machine first and second inputs and having as its first output the finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive the ALU_1 second output, a second arithmetic logic unit ALU_2 having as its output the finite state machine second output, the de-normalisation array further comprising at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order, the ring comprising a character sequence path formed from a serial configuration of, the ALU_1, the linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Z_e, the ALU_2 and the de-normalisation array.

A further aspect of the invention provides a systolic ring serial floating point accumulator in which the real numbers in floating point format are represented as a triplet having the form {Z_f, Z_e, Z_m} wherein Z_f is a character sequence representing descriptors of the real number and an initialization flag character, Z_e is a character sequence representing the exponent of the real number and Z_m is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Z_f, Z_e and Z_{m ,} and the accumulator output is a character sequence A which is the floating point representation {A_f, A_e, A_m} of the accumulation of the real numbers, whereby the ring forms, an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, the first field D_e being representative of the difference between the accumulator exponent A_e and the input exponent Z_e, and the second field Z_m being representative of the Z mantissa value, and a mode register which contains the mode characters.

According to a further aspect of the invention the ring of the serial floating point accumulator further comprises, a connection means to connect the ALU_1 to the ALU_2, whereby, ALU_1 controls ALU_2 dependent on the sign of the value D_e. In a further aspect of this invention at least one delay cell is added into the ring to increase the number of data characters in the floating point representation without increasing the number of systolic cells and thereby achieve the processing of operands with either increased precision or dynamic range.

In an embodiment of the invention, a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a k-stage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations. The output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder. In the case of a systolic ring accumulator it consists of a finite state machine and a systolic de-normalisation array.

Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as one's complement or sign-magnitude. In the preferred embodiment of an accumulator sign-magnitude mantissae and two's complement exponent ordered number pairs are used. The only fixed aspects of the systolic ring are the arithmetic blocks. The length of the delay block is determined by the exponent length in the number representation. The number of systolic de-normalisation cells in the ring can range from a minimum of one to a maximum of m where m is the number of characters in the mantissa of the number representation. The number of recurrence cells determine the performance characteristics of the accumulator.

The invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.

Further aspects of this invention include:

(i) the reduction of the complexity of the the problem of constructing floating point adders and accumulators by using: (a) replicated cell structures to implement recurrences which de-normalise mantissae;

(b) novel circuitry to implement in a serial pipelined fashion both the incrementing of an exponent difference and the conditional de-normalisation of an associated mantissa.

(ii) the depiction of the use of systolic de-normalisation cells interconnected with state memory stages in a linear array or systolic ring structure to construct either an adder capable of variable dynamic range or an accumulator capable of both variable precision and variable dynamic range.

(iii) the depiction of the construction of a systolic ring accumulator with a minimum gate complexity, consisting of an I/O multiplexer, two arithmetic logic units each containing a state machine, an array of delay stages and at least one computational cell representable by recurrences. The computational cell further comprising; the registers required to store the operands, a state storage register, one control storage register and an adder.

(iv) the depiction of the design a generic accumulator capable of providing a broad range of performance specifications by varying both the number of computational cells and or the number of delay cells in the systolic de-normalisation ring and the array of delay cells. Varying the number base of the characters in the floating point format also provides a further means for controlling the execution time of the accumulator.

To further describe the invention, preferred embodiments will now be given, however, it will be apparent that variations will be possible without departing from the inventive matter disclosed. This is especially so since such variations are within the ordinary skill of the practitioner of digital design techniques.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are described hereunder in some detail with reference to and as illustrated in the accompanying drawings in which:

Figure 1 depicts a state diagram for the first logic element or datapath.

Figure 2 depicts a state diagram for the second logic element or datapath.

Figure 3 depicts a schematic representation of a heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent chara cters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.

Figure 4 depicts a schematic of a systolic ring accumulator in which the elements are considered to be lumped. This clarifies the logical function of the array and highlights the distributed nature of the registers. Each register is associated with one of the re-circulating arrows. Naming conventions correspond to the simulation code of figure 12.

Figure 5 depicts a schematic of the systolic de-normalisation cell norm,-cell(). Variable names in brackets refer to the nomenclature of the 'C' simulation program given in a later figure.

Figure 6 depicts a schematic of an array of delay cells which form a component of the systolic ring.

Figure 7 depicts a schematic representation of the input /output multiplexer and (as implemented) a one-bit microcoded datapath. Although implemented as a one-bit per character device, the architecture can be constructed with multi-bit characters.

Figure 8 is a schematic diagram of the state generation and storage circuitry in the first logic cell Logic-l(). Figure 9 is a schematic diagram of the control signal generation for the first logic cell Logic_1(), with naming conventions as for figure 12.

Figure 10 is a schematic diagram of both the control signal generation and a block schematic diagram for the second logic cell Logic_2().

Figure 11 is a schematic diagram of two systolic rings which have coalesced to form a single, extended precision accumulator. To extend the dynamic range, additional delay cells must be placed before Logic_2(). The multiplexer for the second ring is controlled by the controller of the first ring, and the second occurrence of Logic_2() is not included in the ring.

Figure 12 is 'C' code which simulates a systolic ring accumulator.

DETAILED DESCRIPTION OF THE INVENTION

Based upon the following addition or accumulation technique, this patent describes a simpler implementation of floating point addition or accumulation than that detailed previously. Thus there is provided according to the invention both a linear systolic array serial floating point adder and a circular systolic array serial floating point accumulator. For simplicity, only the systolic ring accumulator is described as the linear adder is obvious from the description of the ring accumulator.

Equations (10) to (13) are significantly simpler than the conventional set given in equations (4) to (9). This simplicity is partly due to the lack of testing for overflow and underflow. Put simply the exponent register of the accumulator can be made sufficiently long to accomodate the accumulation of sequences of numbers, where the length of the sequences is less than or equal to some arbitrarily chosen maximum length, without reaching the overflow or underflow condition. It is a straightforward design exercise to provide guard digits in the exponent register to satisfy this requirement.

The second simplification is not obvious and is not part of floating point standards. It omits the post-normalisation of the sum. It is applicable to the floating point addition of two or more normalised numbers and allows post normalisation to be done only at the end of the completed summation, so effecting considerable savings in the case of long sequences.

Consider that error-free numbers are A and Z , and that their floating point representations A* and Z * introduce errors e._A and e_z such that

The maximum relative error E when forming the sum of these numbers occurs when they have opposite sign. This worst-case relative error is approximated by

The significance of equation (16) is that the error is formed in equations (4) and (11). The post-normalisation process of equation (5) does not alter the error in the sum. and as a consequence the operation may be omitted without significantly altering the error behaviour of the accumulation process. A benefit of this approach for summation is that when the summation is complete the number of leading zeroes in the accumulator may give an estimate of the lower bound to the error in the result.

Expanding the equations ( 10) to (13) gives the following relations:

where m is the number of characters in the mantissa of the floating point representation.

In an embodiment of the invention which reflects the previous relations a systolic ring serial floating point accumulator 20 is shown in figures 3 and 4. Figure 3 depicts a schematic representation of a. heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent characters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.

Figure 4 depicts a schematic representation of a systolic de-normalisation array 21 which implements the Z mantissa de-normalisation of either D_e characters when D_e is less than the mantissa length m, or m characters when D_e is greater than or equal to m when the value of the Z mantissa becomes zero to effect an alignment of the Z mantissa to the accumulator mantissa in the floating point representation prior to their addition as defined by equation (21) and a finite state machine 22 which implements equations (17) to (24) with the exclusion of equation (21). The finite state machine 22 consists of a controller 23 and an arithmetic logic unit ( ALU_1) 24 which is described in figure 12 in the form of C simulation code as the function logic_1(), a linear array of delay cells 25 as described in figure 12 as shiftv() and a second arithmetic logic unit (ALU_2) 26 described in figure 12 as logic_2(). It should be noted that the nomenclature and connectivity used in figure 4 relate directly to the C simulation code of figure 12 and it is therefore apparent that the figure does not represent a minimum configuration of the invention.

A first input to the accumulator 20 is pesented sequentially with a series of floating point representations of real numbers Z consisting of triplets having the form {Zf, Z_e, Z_m} wherein Z_f is a character sequence representing descriptors of the real number. An initialization flag character is also part of the descriptor. However, Z_f may or may not be used in one or other of the embodiments described hereafter. Z_e is a character sequence representing the exponent of the real number Z and Z_m is a character sequence representing the mantissa of the real number Z. A mode signal entered in parallel with the triplet through a second input identifies which of the fields Z_f, Z_e and Z_m are being input at any one time. In this implementation an additional character sequence C is also entered through a third input in parallel with the triplet as a constant to be used to increment the exponent difference D_e of equation (18). A further input shown in figure 4 is reset, which is used in the C simulation program to reset the simulated controller 23 and simulated ALU_2 26.

A first output from the accumulator consists of a status signal busy used to indicate when the accumulator may or may not accept inputs. An additional output provides a character sequence A which is the floating point representation {A_f, A_e, A_m} of the accumulation of the real numbers Z. A further output consists of a mode output signal which identifies the elements of the triplets. In this embodiment there is a final output Load which is derived from the initialization flag character present in the Z_f field of the input triplets.

These inputs and outputs collectively form the first inputs and outputs of the finite state machine 22.

The second output of the finite state machine 22 connects to the input of the systolic de-normalisation array 21 whose output is connected to a second input of the finite state machine 22 to form a systolic ring of four registers; a Z register of at least two fields representative of the exponent difference D_e, equal to the difference between the accumulator exponent A_e and Z_e , and the Z mantissa value

Z_m, an A register of at least two fields representative of exponent and mantissa of the A operand, in which the accumulation result is stored, a mode register which contains said mode signal, and a C register which contains a constant value which is circulated around the ring.

An internal connection between ALU_1 24 and ALU_2 26 denoted as sig in figures 4 and 12 is used as a control signal path to imnplement the conditional assignments in ALU_2 of equations (19) and (20).

The following table details the data, structure for both the serial operands and the associated mode bit. The operands are entered into the accumulator least significant character or least significant bit (LSB) first. State machines decode the different fields within the finite state machine controller and ALU_2. Mantissa Exponent

OPERAND: Guard msb . . . lsb sign msb . . . lsb zero_flag load_bit MODE: 0 1 . . . 1 0 0 . . . 0 0 1

TABLE 1: Data format and associated MODE word.

A state diagram which describes the operation of the controller, multiplexer and ALU_1 in the finite state machine of figure 4 is given in figure 1. The controller is a state machine shown in figure 9 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 9. The functional behaviour of the state machine is described by the C simulation code function fsml() of figure 12.

In the following figures, all logical tests on variables are defined to be true if the value of the variable is non-zero, and false if the value of the variable is zero. The initial state State 0 as shown in figure 1 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The controller remains in the zero state until a non-zero mode bit is det ected after which it enters State 1. In State 1 the load bit associated with the flag characters Z_f is sampled, logically OR-ed with the zero flag status register for the accumulator A_zf and stored in the one-bit storage register Load.

At the next clock transition, which corresponds to a zero mode bit, the controller moves to State 2 in which the zero flag character of the flag characters Z_f is stored in the internal storage register Z_zf.

If the Load register contains a non-zero value, the controller enters State 3 at the next clock period and otherwise the controller enters State 4 which will be described subsequently. In State 3 the accumulator exponent field is incremented by the contents of the overflow register from the previous computation and is output as the exponent field of the accumulated result through the finite state machine first output, and also the value of the input operand exponent field z_e is output to the ring accumulator exponent register A_e through the finite state machine second output, the exponent difference field D_e is set to zero and is entered into the ring Z register through the finite state machine second output, the sign register sig is set to zero and both the Z mantissa sign register Z_s and the accumulator sign register A_s are set equal to the sign of the input operand mantissa z_s.

When the value of the mode bit becomes non-zero, indicating the presence of mantissa characters, the controller enters either State 6 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 5 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.

In State 5 the sign-corrected mantissa value A_m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, before being output as the result mantissa through the finite state machine first output. The mantissa value input to the ring Z register is set to zero and the mantissa register A_m is set to the input mantissa value z_m. In State 6 the correctly represented mantissa value A_m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output as the result mantissa through the finite state machine first output. As in State 5 the mantissa value input to the Z register is set to zero and the mantissa register A_m is set to the input mantissa value z_m.

The controller enters State 9 when the mode bit becomes zero. It remains in this state until a non-zero signal cyn_1 is received from a counter depicted in figure 9, indicating that the mantissae A_M and Z_m are aligned or the mantissa Z_m is zero. During this state, the modulo 2 sum of the signs of the Z and A mantissae is stored in the register neg.

At the next clock period the controller enters State 10.

When the mode bit becomes non-zero the controller enters: State 11 in which the accumulator value A_m is computed by adding the contents of the Z mantissa Z_m to the contents of the accumulator A_m.

State 12 in which the accumulator value A_m is computed by subtracting the contents of the accumulator A_m from the contents of the Z mantissa Z_m,

State 13 in which the accumulator value A_m is computed by subtracting the contents of the Z mantissa Z_m from the contents of the accumulator A_m, or

When the mode bit becomes zero the controller returns to State 0.

If the Load register in State 2 contains zero, the controller enters State 4 in which the accumulator exponent value A_e is incremented by the value of the previously computed overflow A_ovf and is output to the ring through the second output of the finite state machine. The exponent difference D_e is set equal to the difference of the value z_e and the incremented accumulator value A_e + A_ovf and is output to the Z register through the second output of the finite state machine. The sign register sig is set equal to the sign bit of D_e and the one bit Z mantissa sign register Z_s is set equal to the sign bit of the input mantissa and the one bit accumulator sign register A_s is left unchanged.

When the value of the mode bit becomes non-zero, indicating the presence of mantissa characters, the controller enters either State 8 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 7 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.

In State 7 the sign-corrected mantissa value A_m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output.

In State 8 the correctly represented mantissa value A_m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output. In both State 7 and State 8 the Z register contents are passed unchanged from the finite state machine second input to the finite state machine second output.

A state diagram which describes the operation of the second arithmetic logic unit ALU_2 in the finite state machine of figure 4 is given in figure 2. The ALU_2 has a state machine shown in figure 10 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 10. The functional behaviour of the state machine is described by the C simulation code function fsm2() of figure 12.

The initial state State 0 as shown in figure 2 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The ALU_2 remains in the zero state until a non-zero mode bit is detected after which it enters State 1. At the occurrence of the next clock the ALU_2 state changes to State 2. The ALU_2 state changes to State 5 if the sign control line from ALU_ 1 is non-zero, and changes to State 3 otherwise.

In State 3 the exponent difference D_e is negated and the contents of the accumulator exponent field are replaced by the sum of A_e and the negated D_e, so restoring the former Z_e value.

When the mode bit becomes non-zero, indicating the mantissa field the ALU_2 changes state to State 4. In State 4 the contents of the two mantissa registers A_m and Z_m are exchanged.

When the mode bit becomes zero, the ALU_2 enters State 5.

The ALU_2 remains in State 5 until a non-zero signal cynΛ is received from a counter depicted in figure 10. when it enters State 6 and re-enters State 0 when the signal cyn_ 1 becomes zero.

Equations (17) to (24) with the exclusion of equation (21) are implemented using the finite state machine 22. To implement the de-normalisation of equation (21), an array of at least one systolic cell is required in which the transfer of data between cells is described by the following recurrences

M₀(p) = M₂(p - 1) (25)

C₀(p) = C₂(p - 1) (26)

Z₀{p) = Z₂{p - 1) (27) A₀(p) = A₂(p - 1) (28) and the internal recurrences which are implemented in each cell are

M₂(n)= M₁(n - 1) (29)

M₁(n)= M₀(n - 1) (30)

C₂(n) = C₁(n - 1) (31)

C₁(n) = C₀(n - 1) (32)

Z1(n) = Z₀(n - 1) (33)

Z₄(n) =Z₃ (n - 1) M₁ (n - 1 ) = 0

= Z₄(n - 1) M₁ (n - 1) = 1 (34)

A₂(n) =A₁(n - 1) (35) A1 (n) = A₀(n - 1) (36)

Z₂(n) = C1(n - 1) + Z₀(n - 1) + C_y(n - 1)

M₀(n - 1)&M1 (n - l)&Z₄(n - 1) = TRUE

= C₁(n - 1) + Z₁(n - 1) + C_y(n - 1)

= C₁(n - 1) + Z₀(m - 2) + C_y (n - 1)

M₀(n - 1)&M₁ (n - 1)&Z₄(n - 1) = FALSE (37)

It is assumed that C contains the value 1 in the character position corresponding to the least significant exponent character, and is zero elsewhere. An examination of the recurrences (34) shows that the sign of the exponent is stored in Z₄ for the duration of the mantissa. This value is used to control via recurrence (37) whether the mantissa output Z₂ is delayed either one or two stages when the mode values M₀ and M₁ are high. This effects a one character de-normalisation of the Z mantissa field relative to the A mantissa when the exponent difference D_e is negative. The presence of a 1 in the C character sequence can be seen to increment the exponent difference according to the recurrence (37).

Each cell which implements these recurrences in a linear structure can implement a one-character denormalization and sign-extension required for floating-point addition using ones-complement or two's complement mantissae, and the de-normalisation without sign extension for sign-magnitude mantissae. Thus for an m-bit mantissa full de-normalisation requires the application of m recurrences. These recurrences may be applied either by connecting m-cells in a linear array, or by connecting at least one cell in a systolic ring structure with sufficient delay cells to contain the operand, and circulating the operands until m recurrences have been applied, or until the mantissae are aligned as indicated by a non-negative exponent difference.

Figure 5 represents a schematic diagram of one possible hardware implementation of a de-normalisation cell 27 implementing the above recurrence equations (29) to (37).

Figure 6 represents a schematic diagram of one possible hardware implementation of a linear array of delay stages and their interconnection denoted by the above recurrence equations (25) to (28).

Figures 7 and 8 together represent a schematic diagram of the arithmetic logic unit ALU_1 24 component of the finite state machine 22. The notation depicted in figures 7 and 8 follows that of figure 12.

Figure 9 represents a schematic diagram of the control element 23 of the finite state machine 22. The notation depicted in figure 9 follows that of figure 12.

Figure 10 represents a schematic diagram of the arithmetic logic unit ALU_2 26 component of the finite state machine 22. The notation depicted in figure 10 follows that of figure 12.

A further embodiment of the invention is provided in figure 11 which depicts a schematic diagram of the joining or coalescence of two adjacent systolic ring accumulators to form a single accumulator capable of accumulating operands of double length. In the two systolic rings which have coalesced, the multiplexer for the second ring is controlled by the controller of the first ring.

Figure 12 is a C code simulation of an embodiment of a sign-magnitude systolic ring accumulator.

Although not implemented, it must be noted that post-normalisation is possible with the architecture of the ring accumulator. Minor additional complexity would be incurred in the logic circuitry and state machine of Logic_1, and an additional recircu- lation would be required.

Systolic ring arithmetic units provide new possibilities for systolic array processors. Consider a simple linear array of two processors, designed to process single precision operands. If the two processors are implemented as systolic rings it is possible with appropriate multiplexer means to coalesce the two rings into a single, larger ring. This large ring can process double-length operands with the same number of circulations as the single ring, as the ratio of mantissa characters to systolic cells remains a constant. For larger order systolic arrays the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.

The nature of the systolic architecture allows advantage to be taken of the statistical properties of numbers to minimise the number of systolic cells. Current studies suggest that the number of systolic cells may be minimised by matching the number of cells to the 95^th percentile of the expected distribution of denormalisation shifts. In such a processor, the use of longer mantissa lengths for increased precision would not require increased numbers of systolic cells, but only an increase in the length of the registers. For such an implementation 95% of accumulations would occur in the designed number of circulations, and the remaining 5% would require additional circulations. In a processor which is asynchronous, this computation time uncertainty would not constitute a problem, and the saving of circuitry would be valuable. The only addition to the structure would be a test of completion of denormalisation. A successful test would cause the remaining circulations of the operands to be bypassed. The information required to reduce the number of circulations in this way is in the sign bit of the incremented exponent difference, and can be used as an input to an expanded state machine in the circuit Logic_1. When the sign bit is zero, the de-normalisation is complete, and the state machine can move to the next state.

Systolic ring and linear array floating point accumulators constructed according to the details described in this patent are of interest in large order systolic arrays and neural networks, and floating point arithmetic units implemented in Gallium Arsenide. This is due to the wide range of area/ time/precision/dynamic-range tradeoffs achievable with the ring architecture and its low transistor count. It is also possible to implement the architecture determined by this patent with simple optical processing techniques.

defs.h

#define base 2

#define states 4

#define statesl 5

#dexine reg.len 12

#define recirc 3

#define recirc.m 2

#define exp_len 10

#define mant_len 30

#define cells mant_len/2

enum clock {

ph1, ph2

};

typedef struct {

int p1, p2;

} reg;

typedef struct {

reg x1, x2, y1, y2, model, mode2, pp, cy;

} mult;

typedef struct {

reg x1, x2, y1, y2, model, mode2, pp1, pp2, sign,cy, bypass;

} norm;

sma5.c

#include <stdio.h>

#include <math.h>

#include "defs.h"

int

and(a, b)

int a, b;

return (a & b);

}

int

or(a, b)

int a, b;

return (a ⃒ b);

}

int

mux(sel, a, b)

int sel, a, b;

^{

if (sel == 0)

return (a);

else

return (b);

}

int

mux4(sel, a, b, c, d)

int sel, a, b, c, d; switch (sel) {

case(0): return (a); break;

case(1): return (b); break;

case(2): return (c); break;

case(3): return (d); break;

}

void

add(a, b, c, sum, cy)

int a, b, c, *sum, *cy;

*sum = (a + b + c) % base; *cy = (a + b + c) / base;

>

int

inv_bit(x)

int x ;

int xbar;

xbar = ˉx & 1;

return (xbar);

}

int

nor(a, b)

int a, b; {

return (ˉ(a I b))&1;

}

int

xor(a, b)

int a, b;

{

return or(nor(inv_bit(a), b), nor(inv_bit(b),a)); }

void add_sub(a_s, a, b, c, sum, cy)

int a_s, a, b, c, *sum, *cy;

{

int t, ct;

ct = inv_bit(nor(or(

nor(inv_bit(a), inv_bit(c)),

nor(inv_bit(c), inv_bit(b))),

nor(inv_bit (b), inv_bit(a))));

t = xor(b,c);

*sum = xor(a,t);

*cy = xor(nor(a_s,inv_bit(t)),ct);

}

int

reg_cell (clock, a, b)

int clock, a;

reg *b;

{

if (clock == 0)

b->p1 = ˉa;

if (clock == 1)

b->p2 = ˉb->p1;

return (b->p2);

}

int

shiftv(cl, len, a, sreg)

int cl, len, a;

reg *sreg;

{

int i, op;

op = reg_cell(cl, a, &sreg[0]);

for (i = 0; i < len - 1; i++)

op = reg_cell(cl, sreg[i] .p2, &sreg[i + 1]); return (op);

}

int

fsm1(cl, reset, mode, load, neg, As, cyn_1, state)

int cl, reset, mode, load, neg, As, cyn_1, state; {

static int p1_reset, p1_mode, p1_state, t;

if (cl == 0) {

p1_reset = reset;

p1_mode = mode;

p1_state = state; }

if (cl == 1) {

if (p1_reset = = 1) {

state = 0;

} else {

switch (p1_state) {

case 0: state = mux(p1_mode,0,1);

break;

case 1:

switch (p1_mode) {

case 0:

state = 2;

break;

case 1:

printf ("Error in fsm1 s1: second bit of field one\n");

break;

}

break;

case 2: state = mux(load,4,3);

break;

case 3: if (!p1_mode) state = 3;

else state = mux(neg&&As,6,5);

break;

case 4: if (!p1_mode) state = 4;

else state = mux(neg&&As,8,7);

break;

case 5: state = mux(p1_mode,9, 5);

break;

case 6: state = mux (p1_mode, 9, 6);

break;

case 7: state = mux(p1_mode,9,7);

break;

case 8: state = mux(p1_mode,9,8);

break;

case 9: state = mux(cyn_1,9,10);

break;

case 10: t = (!neg&&p1_mode) +

2*(neg&&!As&&p1_mode) +

3*(neg&&_As&&p1_mode);

state = mux4(t,10,11,13,12);

break;

case 11: state = mux(p1_mode,0,11);

break;

case 12: state = mux(p1_mode,0, 12);

break;

case 13: state = mux(p1_mode, 0,13);

break;

}

return (state);

}

main(argc, argv) int argc;

char *argv[];

FILE *infp, *outfp;

char ch;

int index = 0, cl_gen, cl, ind;

int busy = 0, reset = 1;

int a[states], b[states];

int f_eof;

outfp = fopen("states", "w");

if (outfp == NULL)

fprintf(stderr,

"%s: cannot open file %s\n", argv[0], "states");

else {

a[0] = 0:

a[1] = 0:

a[2] = 0;

a[3] = 0;

for (ind = 0; ind < 4; ind++)

for (cl_gen = 1; cl_gen <= 3; cl_gen++) {

cl = (cl_gen & 2) » 1;

busy = fpad(outfp, cl_gen, cl, reset, a, b); }

/* printf ("Logic reset\n"); */

reset = 0;

init_instructions();

for (cl_gen = 1; cl_gen <= 3; cl_gen++) {

cl = (cl_gen & 2) » 1;

if ((busy == 0) && (cl_gen == 2)) {

f_eof = scanf("%1d % 1d %1d %1d\n", &a[0], &a[1], &a[2], &a[3]); busy = fpad(outfp, cl_gen, cl, reset, a, b); if ((busy == 0) & (cl_gen == 3)) {

if (b[0]) printf("%1d % 1d %d %d\n", b[0], b[1], b[2], b[3]);

}

} while (f_eof != EOF);

}

close (outfp); smad5.c

#include <stdio.h>

#include "defs.h"

/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Microcode fields a

Lzs Lzf Llrde:

LAzs LAs func src1 src2 shft dst1 dst2 cry1 cry2

Field lengths are:

ILlLlLlLlLlfIsslsslslddlddlclcl

Define constants as:

func: (fadd,fsub)

src1: (s1_Z,s1_A,s1_z,s1_0)

src2: (s2_Z,s2_A,s2_z,s2_0)

shft: (shift0,shift1)

dst1: (d1_Z,d1_z,d1_f,d1_0)

dst2: (d2_Z,d2_z,d2_f,d2_0)

cry1: (noset1,set1)

cry2: (noset2,set2)

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */

#define noset2 0

#define set2 1

#define noset1 0

#define set1 2

#define d2_f 0xc

#define d2_z 4

#define d2_Z 8

#define d2_I 0

#define d1_Z 0

#define d1_z 0×10

#define d1_f 0×20

#define d1_0 0×30

#define shift0 0

#define shift1 0×40

#define s2_Z 0

#define s2_z 0×80

#define s2_A 0×100

#define s2_0 0×180

#define s1_Z 0

#define s1_z 0×200

#define s1_A 0×400

#define s1_0 0×600

#define fsub 0

#define fadd 0×800

#define Lld 0×1000

#define Lzf 0×2000

#define Lzs 0×4000

#define LAs 0×8000

#define LAzs 0x10000

extern int mux();

int instr[20] ;

void init_instructions ()

{

instr[0] = 0 ;

instr [1] = Lld;

instr[2] = Lzf+set1+set2;

instr[3] = Lzs+LAzs+d1_0+d2_z ;

instr[4] = Lzs+fsub+s1_z+s2_A+d1_f+d2_I ;

instr [5] = shift1+fsub+s1_0+s2_A+d1_0+d2_z ; instr[6] = shift1+fadd+s1_0+s2_A+d1_0+d2_z;

instr[7] = shift1+fsub+s1_0+s2_A+d1_z+d2_f;

instr[8] = shift1+fadd+s1_0+s2_A+d1_z+d2_f;

instr[9] = d1_Z+d2_I;

instr[10] = d1_Z+d2_I;

instr[11] = LAs+f add+s1_Z+s2_A+d2_f;

instr[12] = LAs+f sub+s1_Z+s2_A+d2_f;

instr[13] = LAs+f sub+s1_A+s2_Z+d2_f ;

}

void

norm_cell(cell, cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)

norm *cell;

int cl, x, y, pp, mode, *x_out , *y_out, *pp_out, *mode_out; {

int m1, x1, y1, pp1, sum, c_out, bypass, sign;

m1 = reg_cell(cl, mode, &cell->mode1);

*mode_out = reg_cell(cl, m1, &cell->mode2);

pp1 = reg_cell(cl, pp, &cell->pp1);

*pp_out = reg_cell(cl, pp1, &cell->pp2);

y1 = reg_cell(cl, y, &cell->y1);

sign = reg_cell(cl, y1, &cell-> sign);

bypass=reg_cell(cl,mux(m1,sign,cell->bypass.p2),&cell->bypass); x1 = reg.cell(cl, x, &cell->x1);

*x_out = reg_cell(cl, x1, &cell-> x2);

add(cell->pp1_p2, mux (and (and (m1, mode), bypass), y1, y),

cell->cy.p2, &sum, & c_out );

c_out = reg_cell(cl, c_out, &cell->cy);

*y_out = reg_cell(cl, sum, &cell->y2);

}

void

normalise(cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)

int cl ,x,y,pp,mode, *x_out, *y_out,*pp_out, *mode_out; {

static norm mx [cells];

int cell_index, j, a[states], b[states]; a[0] = x;

a[1] = y;

a[2] = pp;

a[3] = mode;

for (cell_index = 0; cell_index < cells; cell_index++) {

norm_cell(&mx[cell_index], cl, a[0], a[1], a[2], a[3],

&b[0], &b[1], &b[2], &b[3]);

for (j = 0; j < states; j++) a[j] = b[j];

}

*x_out = b[0];

*y_out = b[1];

*pp_out = b[2];

*mode_out = b[3];

}

void delay(cl, a, b, del)

int cl, *a, *b;

reg *del;

{

int i;

for (i = 0; i < states; i++) b[i] = reg.cell(cl, a[i], 4del[i]); }

void

delay1(cl, a, b)

int cl, *a, *b;

{

static reg del[states];

int i;

for (i = 0; i < states; i++)

b[i] = reg_cell(cl, a[i], &del[i]);

}

void

logic_1(outfp, cl_gen, cl, del, reset, a, g, e, lr, r, sig, con)

FILE *outfp ;

int cl_gen, cl, reset, *a, *g, *e, *lr,*r,*sig,*con; reg *del;

{ /* logic_1 */

static int lAzs, lAs, lzs, lzf, lid, func, src1, src2;

static int shft, dst1, dst2, cry1, cry2;

static int lAc_sign,A,__A,A_,Z,z,Aovf,Ao,Zo,gshft; static int last_mode,count,cyn_1,load,Ld,neg,As;

static int ed[states],edz[states];

static int fdsum, fsum, fcy, isum, icy, zzf;

static int Azf,ss,bb,cc,Zs,cycle,cy1,state,p1,p2,Asd;

static reg fcy_reg, icy_reg, delz[states];

static reg si_reg, Load_reg, zzf_reg;

static reg Aovf_reg,Azf_reg,As_reg,Zsi_reg,fd_reg,neg_reg; static reg edd_reg, r_reg, Ld_reg,Asd_reg;

if (reset == 1) {

cycle = (recirc - 1);

count = -1;

last_mode = 0;

}

if (cl_gen == 2) {

if (e[3]&&!last_mode) {

count = (count + 1)%2;

if (!(count)){

cycle = (cycle + 1) % recirc;

}

cyl = (cycle == 0)⃒⃒ reset;

cyn_1 =(cycle == recirc - 1);

if (count>-1) *con = inv_bit(cy1);

last_mode = e[3];

}

e[0] = g[0] ;

e[1] = mux(*con, a[1], g[1]) ; e[2] = mux(*con, a[2], g[2]);

e[3] = mux(*con, a[3], g[3]);

delay(cl, e, ed, del);

delay(cl, a, edz, delz);

state = fsm1(cl, reset, e[3] , load, neg, Asd_reg.p2,cyn_1, state);

p1 = cl == 0;

p2 = cl == 1;

Z = ed[1];

z = edz[1];

A = ed[0];

__A = e[0];

truction decode */

cry2 = instr [state]&1;

cry1 = (instr[state]»1)&1;

dst2 = (instr[state]»2)&3;

dst1 = (instr[state]»4)&3;

shft = (instr[state]»6)&1;

src2 = (instr[state]»7)&3;

srcl = (instr[state]»9)&3;

func = (instr[state]»11)&1;

lid = (instr[state]»12)&1;

lzf = (instr[state]»13)&1;

lzs = (instr[state]»14)&1;

lAs = (instr[state]»15)&1;

lAzs = (instr[state]»16)&1;

lzs = lzs&&e[3]&&!ed[3];

lAzs = lAzs&_e[3]&&!ed[3];

lAc_sign = lAs&&!e[3];

gshft = shft&&!e[3];

fcy = reg_cell(cl,mux(cry1,and(fcy,inv_bit(and(e[3],

inv_bit(ed[3])))),Aovf), &fcy_reg);

icy = reg_cell(cl, mux (cry 2, icy, Aovf ), &icy_reg);

load = reg_cell(cl, mux (lld,Load_reg.p2,ed[1]⃒⃒!Azf_reg.p2),

&Load_reg);

Ld = reg cell(cl, mux (lld, Ld_reg.p2,ed[1]),&Ld_reg);

zzf = reg_cell(cl, mux(lzf, zzf_reg.p2,ed[1]), &zzf_reg);

Zs = reg_cell(cl, mux4(lzs+2*(inv_bit(*sig)&&shft&&!e[3]),

Zsi_reg.p2,ed[1] &&!load,As_reg_p2).&Zsi_reg);

Azf = reg cell(cl, mux(or(lzf,lAs),Azf_reg.p2,

and(lAs,or(fsum,Azf_reg.p2))) , Azf_reg);

neg = reg cell(cl, bb = mux(shft,neg_reg.p2,AsˉZs),&neg_reg); Aovf = reg_cell(cl, mux(lAs, Aovf_reg. p2,fsumˉneg),&Aovf_reg);

A_ = mux(and(shft,Aovf),A,__A);

add_sub(func,mux4(src1,Z,z,A_,0),mux4(src2,Z,z,A.,0),

fcy_reg.p2,&fsum.&fcy);

add_sub(1,0,A_,icy_reg.p2,&isum,&icy);

Zo = mux4(dst1,Z,z,fsum,0);

Ao = mux4(dst2,isum,z,Z,fsum);

* sig = reg.cell(cl,mux(lzs,si_reg.p2,fd_reg.p2),&si_reg); fdsum = reg.cell(cl, fsum, &f d_reg);

As=reg_cell (cl ,mux4((

lAc_sign+2+(lAzs)+3*(inv_bit(*sig)&&shft&&!e[3])), As_reg.p2, ((!neg)&&As&&Zs) I I (neg&&fsum), ed[1] ,Zsi_reg.p2), &As_reg);

Asd = reg.cell(cl, As, &Asd_reg);

lr[0] = Ao&&!lzs;

lr[1] = Zof&&!lzs;

lr[2] = ed[2];

lr[3] = ed[3];

r[0] = Ld;

r[1] = reg_cell(cl, mux4(lzs+2*ed[3],isum,As_reg.p2,fsum,0),

&r_reg);

r[2] = reg_cell(cl, ed[3],&edd_reg);

r[3] = r[2];

/* End of datapath */

}

int

fsm2(cl, reset, mode, sign, cyn_1, state)

int cl, reset, mode, sign, cyn_1, state;

static int p1_reset, p1_mode, p1_state, t;

if (cl == 0) {

p1.reset = reset;

p1.mode = mode;

p1.state = state;

}

if (cl == 1) {

if (p1_reset == 1) {

state = 0;

} else {

switch (p1_state) {

case 0: state = mux (p1_mode, 0,1);

break;

case 1:

switch (p1_mode) {

case 0:

state = 2;

break;

case 1 :

printf("Error in fsm2 s1: second bit of field one\n");

break;

}

break;

case 2: state = mux(sign,3,5);

break;

case 3: state = mux (p1_mode, 3, 4);

break;

case 4: state = mux(p1_mode, 5,4);

break;

case 5: state = mux(cyn_1,5,6);

break; case 6: state = mux(cyn_1,0,6);

break;

}

return (state);

}

void

logic_2(cl_gen, cl, reset, sign, e, lr)

int cl_gen, cl, reset, *sign, *e, *lr;

^{ static int ed[states], state;

static int dst1, dst2, cry;

static int sub_s, sub_c, sub_cd, ad_s, ad_c, ad_cd; static int cycle, cy1;

static int last_mode, count, cyn_1;

static reg sub_cy, ad_cy;

if (reset == 1) {

cycle = (recirc - 1);

count = -1;

last_mode = 0;

}

if (cl_gen == 2) {

if (e[3]&&!last_mode){

count = (count + 1)%2;

if (!(count)) {

cycle = (cycle + 1) % recirc;

}

cy1 = (cycle == 0)⃒⃒ reset;

cyn_1 =(cycle == recirc - 1);

last_mode = e[3];

}

/* Register operations */

delayl(cl, e, ed);

state = fsm2(cl, reset, e[3], *sign, cyn_1, state); ad cd = reg_cell(cl, and(ad_c,cry), &ad_cy); sub_cd = reg_cell(cl, and (sub_c, cry), &sub_cy) ;

/* Control signal generation */

dst1 = 0;

dst2 = 0;

cry = 0;

switch (state) {

case 3:

dstl = 2;

dst2 = 3;

cry = 1;

break;

case 4:

dst1 = 1;

dst2 = 1; break;

}

/* Arithmetic operations */

add_sub(0,0,ed[1], and(cry,sub_cy.p2),&sub_s,&sub_c);

add_sub(1,ed[1],ed[0],and(cry,ad_cy.p2), &cad_s, &cad_c);

/* set outputs from logic cell */

lr[0] = mux4(dst2, ed[0], ed[1], sub_s, ad_s);

lr[1] = mux4(dst1, ed[1], ed[0], sub_s, ad_s);

lr[2] = ed[2];

lr[3] = ed[3];

/* End of datapath */

}

int

fpad(outfp, cl_gen, cl, reset, a, r)

FILE *outfp;

int cl_gen, cl, reset, *a, *r;

static int b[states], c[states], d[states], g[states]; static int e[states], f[states], h[states], lr[states]; static reg xr[reg_len], yr[reg_len], ppr[reg_len]; static reg acxr[reg_len], acyr[reg_len], acppr[reg_len]; static reg acmoder [reg_len], moder[reg_len];

static int con = 0, sig = 0;

static reg del[states];

if (reset == 1)

con = 0;

logic_1 (outfp, cl_gen, cl, del, reset, a, g, e, lr, r, &sig, &con); f[0] = shiftv(cl, reg_len, lr[0], acxr);

f[1] = shiftv(cl, reg_len, lr[1], acyr);

f[2] = shiftv(cl, reg_len, lr[2], acppr);

f[3] = shiftv(cl, reg_len, lr[3], acmoder);

logic_2 (cl_gen, cl, reset, &sig, f, h);

normalise (cl, h[0], h[1], h[2], h[3], &g[0], &g[1], &g[2], &g[3]); return (con);

SUBSTITUTE SHEET

Claims

CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS: 1. A systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of said real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of said finite state machine and to output at least partially denormalized floating point numbers to the second input of said finite state machine and in said configuration to form a ring, wherein, during said first state said finite state machine is adapted to control said ring wherein said number Z in floating point format is input to said ring through said finite state machine first input and said accumulator output A is output from said ring from said finite state machine first output, and during said second state said finite state machine is adapted to transfer at least partially denormalised floating point numbers from its said second input to its said second output, to control the number of times said transfer occurs and to add aligned floating point numbers.

2. A systolic ring serial floating point accumulator according to claim 1 wherein said finite state machine further comprises an arithmetic logic unit ALU_1 having as first and second inputs said finite state machine first and second inputs and having as its first output said finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive said ALU _1 second output, a second arithmetic logic unit ALU_2 having as its output said finite state machine second output, said de-normalisation array further comprising at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order, said ring comprising a character sequence path formed from a serial configuration of, said ALU_1, said linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Z_e, said ALUL_2 and said de-normalisation array.

3. A systolic ring serial floating point accumulator according to claim 2 wherein said real numbers in floating point format are represented as a triplet having the form {Zf, Z_e, Z_m} wherein Z_f is a character sequence representing descriptors of the real number and an initialization flag character, Z_e is a character sequence representing the exponent of the real number and Z_m is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Zf, Z_t and Z_m, and said accumulator outputs a character sequence .4 which is the floating point representation {A_f,, A_e, A_m} of the accumulation of said real numbers. whereby said ring forms. an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, said first field D_e being representative of the difference between the accumulator exponent A_e and the input exponent Z_e, and said second field Z_m being representative of the Z mantissa value, and a mode register which contains said mode characters.

4. A systolic ring serial floating point accumulator according to claim 3 wherein, said ring further comprises. a connection means to connect said ALU_1 to said ALU-2, whereby, ALU.1 controls ALU_2 dependent on the sign of the value D _e.

5. A systolic ring serial floating point accumulator according to claim 4 wherein, ALU_1 further comprises, means for differencing in said finite state machine first state the exponent values A_e and Z_e to provide the value D_e which is output to the Z register of said ring and to output to the A register of said ring the value A_e, and also to output via said connection means the sign of the value D_e to ALU_2 and, said ALU_2 further comprises, test and control means to accept and test the sign of the value D_e received via said connection means, whereby if the sign of the value D_e is positive, a sign reversal means of said ALU-2 reverses the sign of D_e, an addition means of said ALU_2 adds the exponent field contents of the A register to the exponent field contents of the Z register to regenerate Z_e, and outputs Z_e to the A register and D_e to the Z register, A_m to the Z register and Z_m to the .4 register, otherwise to control said ALU_2 to pass unchanged the contents of the A and Z registers.

6. A systolic ring floating point accumulator according to claim 3 wherein each of said de-normalisation cells increments the contents of the exponent field, and effects a conditional de-normalisation of the contents of the Z_m field of the Z register if the contents of the unincremented exponent field of the Z register is negative, resulting in a one character de- normalisation of Z_m per systolic cell per circulation until the A and Z register mantissa field contents are aligned, or until the Z register mantissa field contents have been zeroed.

7. A systolic ring floating point accumulator according to claim 3 wherein each of said de-normalisation cells has a first and second mode of operation controlled by the current value of the mode characters in said mode register, wherein, in said de-normalisation cell first state which corresponds to the exponent fields of the Z and A registers in said cell, said de-normalization cell further comprises de-normalization addition means to increment the difference value D_e, de-normalization output means to output the incremented difference value, de-normalization storage means to store the sign of the difference value D_e, multiplexer control means, and in said de-normalisation cell second state which corresponds to the mantissa fields of the Z and A registers in said cell, said de-normalization cell further comprises a multiplexer means to bypass one delay cell in the Z register and so to effect a one-character de-normalisation of the Z_m operand if the sign of the difference value D_e stored in said de-normalisation cell storage means is negative.

8. A systolic ring floating point accumulator according to claim 1 wherein said real numbers in floating point format are represented as a triplet having the form {Z_f, Z_e, Z_m} wherein Z_f is a character sequence representing descriptors of the real number and an initialization flag character, whereby, when said initialization flag character is present said accumulator floating point number representation {A_f, A_e, .A_m } is output and zeroed prior to the input of the next sequentially input real number Z to the accumulator.

9. A systolic ring serial floating point accumulator according to claim 1 wherein the accumulator is implemented in Gallium Arsenide.

10 A svstolic ring serial floating point accumulator according to claim 1 wherein varying the number base of said data characters varies the execution time of said accu¬mulator.

11 . svstolic ring serial floating point accumulator according to claim 2 wherein varying the number of said delay cells varies the precision and dynamic range of said accumulator.

12. A systolic ring serial floatmg point accumulator according to claim 2 wherein varying the number of said systolic cells varies the precision and dynamic range of said accumulator.

13. A systolic ring serial floatmg point accumulator according to claim 2 wherein varying the number of said systolic cells and varying the number of delay cells varis the precision and dynamic range of said accumulator.

14. A systolic ring serial floatmg point accumulator according to claim 3 which is extensible having at least one additional multiplexer means located within said ring, each of said additional multiplexer means having connections to an adjacent ring which in a first mode both isolates each ring from the other, accepts input for and directs output from the adjacent ring and also completes the first ring, and in a second mode directs data sequences from the first ring to the input of the adjacent ring, bypasses the arithmetic logic unit ALU_2 in the controlled ring, and directs the output from the adjacent ring back into the first ring.

15. A method of accumulating successive data character sequences representing mantissas and exponents of Z operands in floating point format to output one identically formatted result data character sequence .4 which is the floating point accumulation of the Z operands, wherein at least one systolic cell is arranged in a ring configuration.

16. A method of adding successive pairs of data character sequences representing antissas and exponents of Z₁ and Z₂ floating point operands in floating point format to output one identically formatted result data character sequence A which is the floating point sum of the input operands, wherein said data character sequences are operated upon by at least one systolic cell.

17. A serial floating point adder for accepting sequential pairs of real numbers Z₁ and Z₂ in floating point format and a mode control signal wherein said real numbers are represented as 2-tuples having the form { Ze , Z_m}, Z_e is a character sequence rep_¬ resenting the exponent of the real number and Z_m is a character sequence representing the mantissa of the real number, and said adder outputs a character sequence A which is the floating point representation {A_f, A_e, A_m} of the addition of said real numbers, wherein said adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the de-normalisation array and to output the floating point sum of sequential pairs of accepted real numbers.

18. A serial floating point adder according to claim 17 wherein a mode character sequence is entered in parallel with the 2- tuples to identify to the adder the fields Z_e and Z_m of said floating point representations, and said de-normalisation array further comprises at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.

19. A serial floating point adder according to claim 18 wherein there are m de-normalisation cells where m is the number of characters in the mantissa of the real number representations, and wherein said second finite state machine further comprises, an adder means at the output of said de-normalisation array to add the sum of the character sequence representing the mantissae of the pair of accepted real numbers Z₁ and Z₂.

20. A systolic ring serial floating point adder according to claim 17 wherein the adder is implemented in Gallium Arsenide.

21. A systolic ring serial floating point adder according to claim 17 wherein varying the number base of said data characters varies the execution time of said adder.

22. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said delay cells varies the precision and dynamic range of said adder.

23. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said systolic cells varies the precision and dynamic range of said adder.

24. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said systolic cells and varying the number of delay cells varies the precision and dynamic range of said adder.