WO1992000560A1  A generalised systolic array serial floating point adder and accumulator  Google Patents
A generalised systolic array serial floating point adder and accumulator Download PDFInfo
 Publication number
 WO1992000560A1 WO1992000560A1 PCT/AU1991/000284 AU9100284W WO9200560A1 WO 1992000560 A1 WO1992000560 A1 WO 1992000560A1 AU 9100284 W AU9100284 W AU 9100284W WO 9200560 A1 WO9200560 A1 WO 9200560A1
 Authority
 WO
 Grant status
 Application
 Patent type
 Prior art keywords
 floating point
 ring
 systolic
 output
 number
 Prior art date
Links
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
 G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices
 G06F7/483—Computations with numbers represented by a nonlinear combination of denominational numbers, e.g. rational numbers, logarithmic number system, floatingpoint numbers
 G06F7/485—Adding; Subtracting

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F15/00—Digital computers in general; Data processing equipment in general
 G06F15/76—Architectures of general purpose stored program computers
 G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
 G06F15/8046—Systolic arrays

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
 G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using noncontactmaking devices, e.g. tube, solid state device; using unspecified devices
 G06F7/50—Adding; Subtracting
 G06F7/505—Adding; Subtracting in bitparallel fashion, i.e. having a different digithandling circuit for each denomination
 G06F7/509—Adding; Subtracting in bitparallel fashion, i.e. having a different digithandling circuit for each denomination for multiple operands, e.g. digital integrators
 G06F7/5095—Adding; Subtracting in bitparallel fashion, i.e. having a different digithandling circuit for each denomination for multiple operands, e.g. digital integrators wordserial, i.e. with an accumulatorregister

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
 G06F2207/38—Indexing scheme relating to groups G06F7/38  G06F7/575
 G06F2207/3804—Details
 G06F2207/386—Special constructional features
 G06F2207/3884—Pipelining
 G06F2207/3892—Systolic array
Abstract
Description
"A GENERALISED SYSTOLIC ARRAY SERIAL FLOATING POINT
ADDER AND ACCUMULATOR"
This invention relates to floatingpoint accumulators and adders and in particular to serial systolic array floating point accumulators and adders.
BACKGROUND OF THE INVENTION
Fixed point adders and accumulators are implemented simply by a single adder and a carry storage register for serial implementations, and an array of adders and carry storage registers for parallel implementations. For nbit numbers an adder is typically a factor of n less complex than a multiplier. This situation is no longer the case when floating point operations are considered. A floating point multiplier is not substantially more complex than its fixed point counterpart whereas the floating point adder is sig nificantly more complex than the fixed point equivalent. The reason for this complexity is evident when a floating point addition or accumulation algorithm is considered and compared with a fixed point addition or accumulation algorithm. Examples are:
Fixed point accumulation of a fixed point number Z with a fixed point accumulator value at time n of A_{n} is carried out by the simple operation of integer addition i.e. A_{n+1} = A_{n} + Z (1)
To discuss floating point addition or accumulation it is necessary to define the number system. A floating point number F is composed of two parts, a fractional mantissa F_{m} and an integral exponent F_{e}, and can be represented as the 2tuple
F = {F_{e}. F_{m}} (2)
The real number representation R of this floating point representation is
R = F_{m} .bF_{e} (3) where b is the base of both F_{m} and F_{e} and
The floating point accumulation of a floating point number {Z_{e}, Z_{m}} with a floating point accumulator value at time n of {A_{e},_{n ,} A_{m,n}} to form the new accumulator value {A_{e},_{n}+1, A_{m},_{n}+1} at time n + 1 is performed by the following algorithm:
)
where Max.exp is the maximum exponent value in the particular format, Minexp is the minimum exponent value, b is the number base of the floatingpoint representation, [.] represents the integer part of and sign is the sign of the operation.
This algorithm can be considered representative of the way in which addition or accumulation is performed in conventional computing hardware.
Some discussion of this algorithm is warranted. Equation (4) represents the shifting of the operand mantissa which has the smallest exponent by a number of digit places equal to the difference in the exponents, followed by the summation of the shifted operands. This temporary result A'_{m,n+1} is conditionally left or right shifted according to its value. Three possibilities exist. The first is a result which is smaller than the lower bound of the defined range for the fractional part of the representation. In this case the operation has caused a loss of precision known as catastrophic cancellation. Leading zeroes are introduced into the representation which must be removed by left shifting the result. This operation is known as postnormalisation. The second possibility is that the result is larger than the upper bound of the defined range of the mantissa. In this case the result is right shifted one place to restore it to the defined range. This condition is mantissa overflow. The final possibility for the result is that it falls within the defined range of the representation, in which case no shifts are required. Equation (5) expresses these three conditions in mathematical form. Equation (6) defines the exponent of the temporary result A'_{e n+1}. This exponent value is modified by any shifts which are performed upon the mantissa to preserve the real value of the 2tuple. Additive corrections to this exponent value are defined by equation (7). The corrections appear as additions for the exponents whereas multiplications, or shifts, are performed in the case of the mantissa. Equations (8) and (9) set flags which indicate whether the result has exceeded the floating point representation at either end of its dynamic range.
Existing techniques for the design and construction of floating point adders and accumulators are broadly categorised as parallel or serial. The parallel architectures are intended for low latency designs. An example is the work of OWEN, R.E., "A 15 nanosecond complex multiplieraccumulator for FFT's", CASSP'87, CH23960/87/00000527 pp. 527530, 1987. For system architectures in which longer latencies can be tolerated, serial architectures are used to advantage and an example is CHAU, P.M., KAY, C.C. and KU, W.H.,"A bitserial floatingpoint complex multiplieraccumulator for faulttolerant digital signal processing arrays", CASSP'87, CH23960/87/00000483 pp. 483486, 1987.
SUMMARY OF THE INVENTION
In its broadest form the invention comprises a systolic array floating point adder for accepting sequential pairs of real numbers Z_{1} and Z_{2} in floating point format and a mode control signal wherein said real numbers are represented as 2tuples having the form {Z_{e}, Z_{m}}, Z_{e} is a character sequence representing the exponent of the real number and Z_{m} is a character sequence representing the mantissa of the real number, and the adder outputs a character sequence A which is the floating point representation {A_{f}, A_{e}, A_{m}} of the addition of said real numbers, wherein the adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the denormalisation array and to output the floating point sum of sequential pairs of accepted real numbers.
In a further aspect of the invention the serial floating point adder has a mode character sequence entered in parallel with the 2tuples to identify to the adder the fields Z_{e} and Z_{m} of the floating point representations, and the denormalisation array further comprises at least one systolic denormalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.
In yet a further aspect the invention in its broadest form comprises a systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of the real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of the finite state machine and to output at least partially denormalized floating point numbers to the second input of the finite state machine and in the configuration to form a ring, wherein. during the first state the finite state machine is adapted to control the ring wherein the number Z in floating point format is input to the ring through the finite state machine first input and the accumulator output A is output from the ring from the finite state machine first output, and during the second state the finite state machine is adapted to transfer at least partially denormalised floating point numbers from its the second input to its the second output, to control the number of times the transfer occurs and to add aligned floating point numbers.
Yet in a further aspect of the invention a systolic ring serial floating point ac cumulator has a finite state machine which further comprises an arithmetic logic unit ALU_1 having as first and second inputs the finite state machine first and second inputs and having as its first output the finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive the ALU_1 second output, a second arithmetic logic unit ALU_2 having as its output the finite state machine second output, the denormalisation array further comprising at least one systolic denormalisation cell and zero or more delay cells where cells of each type may be arranged in any order, the ring comprising a character sequence path formed from a serial configuration of, the ALU_1, the linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Z_{e}, the ALU_2 and the denormalisation array.
A further aspect of the invention provides a systolic ring serial floating point accumulator in which the real numbers in floating point format are represented as a triplet having the form {Z_{f}, Z_{e}, Z_{m}} wherein Z_{f} is a character sequence representing descriptors of the real number and an initialization flag character, Z_{e} is a character sequence representing the exponent of the real number and Z_{m} is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Z_{f}, Z_{e} and Z_{m ,} and the accumulator output is a character sequence A which is the floating point representation {A_{f}, A_{e}, A_{m}} of the accumulation of the real numbers, whereby the ring forms, an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, the first field D_{e} being representative of the difference between the accumulator exponent A_{e} and the input exponent Z_{e}, and the second field Z_{m} being representative of the Z mantissa value, and a mode register which contains the mode characters.
According to a further aspect of the invention the ring of the serial floating point accumulator further comprises, a connection means to connect the ALU_1 to the ALU_2, whereby, ALU_1 controls ALU_2 dependent on the sign of the value D_{e}. In a further aspect of this invention at least one delay cell is added into the ring to increase the number of data characters in the floating point representation without increasing the number of systolic cells and thereby achieve the processing of operands with either increased precision or dynamic range.
In an embodiment of the invention, a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a kstage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations. The output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder. In the case of a systolic ring accumulator it consists of a finite state machine and a systolic denormalisation array.
Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as one's complement or signmagnitude. In the preferred embodiment of an accumulator signmagnitude mantissae and two's complement exponent ordered number pairs are used. The only fixed aspects of the systolic ring are the arithmetic blocks. The length of the delay block is determined by the exponent length in the number representation. The number of systolic denormalisation cells in the ring can range from a minimum of one to a maximum of m where m is the number of characters in the mantissa of the number representation. The number of recurrence cells determine the performance characteristics of the accumulator.
The invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.
Further aspects of this invention include:
(i) the reduction of the complexity of the the problem of constructing floating point adders and accumulators by using: (a) replicated cell structures to implement recurrences which denormalise mantissae;
(b) novel circuitry to implement in a serial pipelined fashion both the incrementing of an exponent difference and the conditional denormalisation of an associated mantissa.
(ii) the depiction of the use of systolic denormalisation cells interconnected with state memory stages in a linear array or systolic ring structure to construct either an adder capable of variable dynamic range or an accumulator capable of both variable precision and variable dynamic range.
(iii) the depiction of the construction of a systolic ring accumulator with a minimum gate complexity, consisting of an I/O multiplexer, two arithmetic logic units each containing a state machine, an array of delay stages and at least one computational cell representable by recurrences. The computational cell further comprising; the registers required to store the operands, a state storage register, one control storage register and an adder.
(iv) the depiction of the design a generic accumulator capable of providing a broad range of performance specifications by varying both the number of computational cells and or the number of delay cells in the systolic denormalisation ring and the array of delay cells. Varying the number base of the characters in the floating point format also provides a further means for controlling the execution time of the accumulator.
To further describe the invention, preferred embodiments will now be given, however, it will be apparent that variations will be possible without departing from the inventive matter disclosed. This is especially so since such variations are within the ordinary skill of the practitioner of digital design techniques.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments are described hereunder in some detail with reference to and as illustrated in the accompanying drawings in which:
Figure 1 depicts a state diagram for the first logic element or datapath.
Figure 2 depicts a state diagram for the second logic element or datapath.
Figure 3 depicts a schematic representation of a heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent chara cters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
Figure 4 depicts a schematic of a systolic ring accumulator in which the elements are considered to be lumped. This clarifies the logical function of the array and highlights the distributed nature of the registers. Each register is associated with one of the recirculating arrows. Naming conventions correspond to the simulation code of figure 12.
Figure 5 depicts a schematic of the systolic denormalisation cell norm,cell(). Variable names in brackets refer to the nomenclature of the 'C' simulation program given in a later figure.
Figure 6 depicts a schematic of an array of delay cells which form a component of the systolic ring.
Figure 7 depicts a schematic representation of the input /output multiplexer and (as implemented) a onebit microcoded datapath. Although implemented as a onebit per character device, the architecture can be constructed with multibit characters.
Figure 8 is a schematic diagram of the state generation and storage circuitry in the first logic cell Logicl(). Figure 9 is a schematic diagram of the control signal generation for the first logic cell Logic_1(), with naming conventions as for figure 12.
Figure 10 is a schematic diagram of both the control signal generation and a block schematic diagram for the second logic cell Logic_2().
Figure 11 is a schematic diagram of two systolic rings which have coalesced to form a single, extended precision accumulator. To extend the dynamic range, additional delay cells must be placed before Logic_2(). The multiplexer for the second ring is controlled by the controller of the first ring, and the second occurrence of Logic_2() is not included in the ring.
Figure 12 is 'C' code which simulates a systolic ring accumulator.
DETAILED DESCRIPTION OF THE INVENTION
Based upon the following addition or accumulation technique, this patent describes a simpler implementation of floating point addition or accumulation than that detailed previously. Thus there is provided according to the invention both a linear systolic array serial floating point adder and a circular systolic array serial floating point accumulator. For simplicity, only the systolic ring accumulator is described as the linear adder is obvious from the description of the ring accumulator.
Equations (10) to (13) are significantly simpler than the conventional set given in equations (4) to (9). This simplicity is partly due to the lack of testing for overflow and underflow. Put simply the exponent register of the accumulator can be made sufficiently long to accomodate the accumulation of sequences of numbers, where the length of the sequences is less than or equal to some arbitrarily chosen maximum length, without reaching the overflow or underflow condition. It is a straightforward design exercise to provide guard digits in the exponent register to satisfy this requirement.
The second simplification is not obvious and is not part of floating point standards. It omits the postnormalisation of the sum. It is applicable to the floating point addition of two or more normalised numbers and allows post normalisation to be done only at the end of the completed summation, so effecting considerable savings in the case of long sequences.
Consider that errorfree numbers are A and Z , and that their floating point representations A* and Z * introduce errors e._{A} and e_{z} such that
The maximum relative error E when forming the sum of these numbers occurs when they have opposite sign. This worstcase relative error is approximated by
The significance of equation (16) is that the error is formed in equations (4) and (11). The postnormalisation process of equation (5) does not alter the error in the sum. and as a consequence the operation may be omitted without significantly altering the error behaviour of the accumulation process. A benefit of this approach for summation is that when the summation is complete the number of leading zeroes in the accumulator may give an estimate of the lower bound to the error in the result.
Expanding the equations ( 10) to (13) gives the following relations:
where m is the number of characters in the mantissa of the floating point representation.
In an embodiment of the invention which reflects the previous relations a systolic ring serial floating point accumulator 20 is shown in figures 3 and 4. Figure 3 depicts a schematic representation of a. heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent characters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
Figure 4 depicts a schematic representation of a systolic denormalisation array 21 which implements the Z mantissa denormalisation of either D_{e} characters when D_{e} is less than the mantissa length m, or m characters when D_{e} is greater than or equal to m when the value of the Z mantissa becomes zero to effect an alignment of the Z mantissa to the accumulator mantissa in the floating point representation prior to their addition as defined by equation (21) and a finite state machine 22 which implements equations (17) to (24) with the exclusion of equation (21). The finite state machine 22 consists of a controller 23 and an arithmetic logic unit ( ALU_1) 24 which is described in figure 12 in the form of C simulation code as the function logic_1(), a linear array of delay cells 25 as described in figure 12 as shiftv() and a second arithmetic logic unit (ALU_2) 26 described in figure 12 as logic_2(). It should be noted that the nomenclature and connectivity used in figure 4 relate directly to the C simulation code of figure 12 and it is therefore apparent that the figure does not represent a minimum configuration of the invention.
A first input to the accumulator 20 is pesented sequentially with a series of floating point representations of real numbers Z consisting of triplets having the form {Zf, Z_{e}, Z_{m}} wherein Z_{f} is a character sequence representing descriptors of the real number. An initialization flag character is also part of the descriptor. However, Z_{f} may or may not be used in one or other of the embodiments described hereafter. Z_{e} is a character sequence representing the exponent of the real number Z and Z_{m} is a character sequence representing the mantissa of the real number Z. A mode signal entered in parallel with the triplet through a second input identifies which of the fields Z_{f}, Z_{e} and Z_{m} are being input at any one time. In this implementation an additional character sequence C is also entered through a third input in parallel with the triplet as a constant to be used to increment the exponent difference D_{e} of equation (18). A further input shown in figure 4 is reset, which is used in the C simulation program to reset the simulated controller 23 and simulated ALU_2 26.
A first output from the accumulator consists of a status signal busy used to indicate when the accumulator may or may not accept inputs. An additional output provides a character sequence A which is the floating point representation {A_{f}, A_{e}, A_{m}} of the accumulation of the real numbers Z. A further output consists of a mode output signal which identifies the elements of the triplets. In this embodiment there is a final output Load which is derived from the initialization flag character present in the Z_{f} field of the input triplets.
These inputs and outputs collectively form the first inputs and outputs of the finite state machine 22.
The second output of the finite state machine 22 connects to the input of the systolic denormalisation array 21 whose output is connected to a second input of the finite state machine 22 to form a systolic ring of four registers; a Z register of at least two fields representative of the exponent difference D_{e}, equal to the difference between the accumulator exponent A_{e} and Z_{e} , and the Z mantissa value
Z_{m}, an A register of at least two fields representative of exponent and mantissa of the A operand, in which the accumulation result is stored, a mode register which contains said mode signal, and a C register which contains a constant value which is circulated around the ring.
An internal connection between ALU_1 24 and ALU_2 26 denoted as sig in figures 4 and 12 is used as a control signal path to imnplement the conditional assignments in ALU_2 of equations (19) and (20).
The following table details the data, structure for both the serial operands and the associated mode bit. The operands are entered into the accumulator least significant character or least significant bit (LSB) first. State machines decode the different fields within the finite state machine controller and ALU_2. Mantissa Exponent
OPERAND: Guard msb . . . lsb sign msb . . . lsb zero_flag load_bit MODE: 0 1 . . . 1 0 0 . . . 0 0 1
TABLE 1: Data format and associated MODE word.
A state diagram which describes the operation of the controller, multiplexer and ALU_1 in the finite state machine of figure 4 is given in figure 1. The controller is a state machine shown in figure 9 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 9. The functional behaviour of the state machine is described by the C simulation code function fsml() of figure 12.
In the following figures, all logical tests on variables are defined to be true if the value of the variable is nonzero, and false if the value of the variable is zero. The initial state State 0 as shown in figure 1 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The controller remains in the zero state until a nonzero mode bit is det ected after which it enters State 1. In State 1 the load bit associated with the flag characters Z_{f} is sampled, logically ORed with the zero flag status register for the accumulator A_{z}f and stored in the onebit storage register Load.
At the next clock transition, which corresponds to a zero mode bit, the controller moves to State 2 in which the zero flag character of the flag characters Z_{f} is stored in the internal storage register Z_{z}f.
If the Load register contains a nonzero value, the controller enters State 3 at the next clock period and otherwise the controller enters State 4 which will be described subsequently. In State 3 the accumulator exponent field is incremented by the contents of the overflow register from the previous computation and is output as the exponent field of the accumulated result through the finite state machine first output, and also the value of the input operand exponent field z_{e} is output to the ring accumulator exponent register A_{e} through the finite state machine second output, the exponent difference field D_{e} is set to zero and is entered into the ring Z register through the finite state machine second output, the sign register sig is set to zero and both the Z mantissa sign register Z_{s} and the accumulator sign register A_{s} are set equal to the sign of the input operand mantissa z_{s}.
When the value of the mode bit becomes nonzero, indicating the presence of mantissa characters, the controller enters either State 6 if the previously computed result was a correct signmagnitude representation of the accumulated value, or State 5 if the previously computed result was not a correct signmagnitude representation and required a sign reversal.
In State 5 the signcorrected mantissa value A_{m} is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, before being output as the result mantissa through the finite state machine first output. The mantissa value input to the ring Z register is set to zero and the mantissa register A_{m} is set to the input mantissa value z_{m}. In State 6 the correctly represented mantissa value A_{m} is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output as the result mantissa through the finite state machine first output. As in State 5 the mantissa value input to the Z register is set to zero and the mantissa register A_{m} is set to the input mantissa value z_{m}.
The controller enters State 9 when the mode bit becomes zero. It remains in this state until a nonzero signal cyn_1 is received from a counter depicted in figure 9, indicating that the mantissae A_{M} and Z_{m} are aligned or the mantissa Z_{m} is zero. During this state, the modulo 2 sum of the signs of the Z and A mantissae is stored in the register neg.
At the next clock period the controller enters State 10.
When the mode bit becomes nonzero the controller enters: State 11 in which the accumulator value A_{m} is computed by adding the contents of the Z mantissa Z_{m} to the contents of the accumulator A_{m}.
State 12 in which the accumulator value A_{m} is computed by subtracting the contents of the accumulator A_{m} from the contents of the Z mantissa Z_{m},
State 13 in which the accumulator value A_{m} is computed by subtracting the contents of the Z mantissa Z_{m} from the contents of the accumulator A_{m}, or
When the mode bit becomes zero the controller returns to State 0.
If the Load register in State 2 contains zero, the controller enters State 4 in which the accumulator exponent value A_{e} is incremented by the value of the previously computed overflow A_{ov}f and is output to the ring through the second output of the finite state machine. The exponent difference D_{e} is set equal to the difference of the value z_{e} and the incremented accumulator value A_{e} + A_{ov}f and is output to the Z register through the second output of the finite state machine. The sign register sig is set equal to the sign bit of D_{e} and the one bit Z mantissa sign register Z_{s} is set equal to the sign bit of the input mantissa and the one bit accumulator sign register A_{s} is left unchanged.
When the value of the mode bit becomes nonzero, indicating the presence of mantissa characters, the controller enters either State 8 if the previously computed result was a correct signmagnitude representation of the accumulated value, or State 7 if the previously computed result was not a correct signmagnitude representation and required a sign reversal.
In State 7 the signcorrected mantissa value A_{m} is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output.
In State 8 the correctly represented mantissa value A_{m} is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output. In both State 7 and State 8 the Z register contents are passed unchanged from the finite state machine second input to the finite state machine second output.
A state diagram which describes the operation of the second arithmetic logic unit ALU_2 in the finite state machine of figure 4 is given in figure 2. The ALU_2 has a state machine shown in figure 10 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 10. The functional behaviour of the state machine is described by the C simulation code function fsm2() of figure 12.
The initial state State 0 as shown in figure 2 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The ALU_2 remains in the zero state until a nonzero mode bit is detected after which it enters State 1. At the occurrence of the next clock the ALU_2 state changes to State 2. The ALU_2 state changes to State 5 if the sign control line from ALU_ 1 is nonzero, and changes to State 3 otherwise.
In State 3 the exponent difference D_{e} is negated and the contents of the accumulator exponent field are replaced by the sum of A_{e} and the negated D_{e}, so restoring the former Z_{e} value.
When the mode bit becomes nonzero, indicating the mantissa field the ALU_2 changes state to State 4. In State 4 the contents of the two mantissa registers A_{m} and Z_{m} are exchanged.
When the mode bit becomes zero, the ALU_2 enters State 5.
The ALU_2 remains in State 5 until a nonzero signal cynΛ is received from a counter depicted in figure 10. when it enters State 6 and reenters State 0 when the signal cyn_ 1 becomes zero.
Equations (17) to (24) with the exclusion of equation (21) are implemented using the finite state machine 22. To implement the denormalisation of equation (21), an array of at least one systolic cell is required in which the transfer of data between cells is described by the following recurrences
M_{0}(p) = M_{2}(p  1) (25)
C_{0}(p) = C_{2}(p  1) (26)
Z_{0}{p) = Z_{2}{p  1) (27) A_{0}(p) = A_{2}(p  1) (28) and the internal recurrences which are implemented in each cell are
M_{2}(n)= M_{1}(n  1) (29)
M_{1}(n)= M_{0}(n  1) (30)
C_{2}(n) = C_{1}(n  1) (31)
C_{1}(n) = C_{0}(n  1) (32)
Z1(n) = Z_{0}(n  1) (33)
Z_{4}(n) =Z_{3} (n  1) M_{1} (n  1 ) = 0
= Z_{4}(n  1) M_{1} (n  1) = 1 (34)
A_{2}(n) =A_{1}(n  1) (35) A1 (n) = A_{0}(n  1) (36)
Z_{2}(n) = C1(n  1) + Z_{0}(n  1) + C_{y}(n  1)
M_{0}(n  1)&M1 (n  l)&Z_{4}(n  1) = TRUE
= C_{1}(n  1) + Z_{1}(n  1) + C_{y}(n  1)
= C_{1}(n  1) + Z_{0}(m  2) + C_{y} (n  1)
M_{0}(n  1)&M_{1} (n  1)&Z_{4}(n  1) = FALSE (37)
It is assumed that C contains the value 1 in the character position corresponding to the least significant exponent character, and is zero elsewhere. An examination of the recurrences (34) shows that the sign of the exponent is stored in Z_{4} for the duration of the mantissa. This value is used to control via recurrence (37) whether the mantissa output Z_{2} is delayed either one or two stages when the mode values M_{0} and M_{1} are high. This effects a one character denormalisation of the Z mantissa field relative to the A mantissa when the exponent difference D_{e} is negative. The presence of a 1 in the C character sequence can be seen to increment the exponent difference according to the recurrence (37).
Each cell which implements these recurrences in a linear structure can implement a onecharacter denormalization and signextension required for floatingpoint addition using onescomplement or two's complement mantissae, and the denormalisation without sign extension for signmagnitude mantissae. Thus for an mbit mantissa full denormalisation requires the application of m recurrences. These recurrences may be applied either by connecting mcells in a linear array, or by connecting at least one cell in a systolic ring structure with sufficient delay cells to contain the operand, and circulating the operands until m recurrences have been applied, or until the mantissae are aligned as indicated by a nonnegative exponent difference.
Figure 5 represents a schematic diagram of one possible hardware implementation of a denormalisation cell 27 implementing the above recurrence equations (29) to (37).
Figure 6 represents a schematic diagram of one possible hardware implementation of a linear array of delay stages and their interconnection denoted by the above recurrence equations (25) to (28).
Figures 7 and 8 together represent a schematic diagram of the arithmetic logic unit ALU_1 24 component of the finite state machine 22. The notation depicted in figures 7 and 8 follows that of figure 12.
Figure 9 represents a schematic diagram of the control element 23 of the finite state machine 22. The notation depicted in figure 9 follows that of figure 12.
Figure 10 represents a schematic diagram of the arithmetic logic unit ALU_2 26 component of the finite state machine 22. The notation depicted in figure 10 follows that of figure 12.
A further embodiment of the invention is provided in figure 11 which depicts a schematic diagram of the joining or coalescence of two adjacent systolic ring accumulators to form a single accumulator capable of accumulating operands of double length. In the two systolic rings which have coalesced, the multiplexer for the second ring is controlled by the controller of the first ring.
Figure 12 is a C code simulation of an embodiment of a signmagnitude systolic ring accumulator.
Although not implemented, it must be noted that postnormalisation is possible with the architecture of the ring accumulator. Minor additional complexity would be incurred in the logic circuitry and state machine of Logic_1, and an additional recircu lation would be required.
Systolic ring arithmetic units provide new possibilities for systolic array processors. Consider a simple linear array of two processors, designed to process single precision operands. If the two processors are implemented as systolic rings it is possible with appropriate multiplexer means to coalesce the two rings into a single, larger ring. This large ring can process doublelength operands with the same number of circulations as the single ring, as the ratio of mantissa characters to systolic cells remains a constant. For larger order systolic arrays the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.
The nature of the systolic architecture allows advantage to be taken of the statistical properties of numbers to minimise the number of systolic cells. Current studies suggest that the number of systolic cells may be minimised by matching the number of cells to the 95^{th} percentile of the expected distribution of denormalisation shifts. In such a processor, the use of longer mantissa lengths for increased precision would not require increased numbers of systolic cells, but only an increase in the length of the registers. For such an implementation 95% of accumulations would occur in the designed number of circulations, and the remaining 5% would require additional circulations. In a processor which is asynchronous, this computation time uncertainty would not constitute a problem, and the saving of circuitry would be valuable. The only addition to the structure would be a test of completion of denormalisation. A successful test would cause the remaining circulations of the operands to be bypassed. The information required to reduce the number of circulations in this way is in the sign bit of the incremented exponent difference, and can be used as an input to an expanded state machine in the circuit Logic_1. When the sign bit is zero, the denormalisation is complete, and the state machine can move to the next state.
Systolic ring and linear array floating point accumulators constructed according to the details described in this patent are of interest in large order systolic arrays and neural networks, and floating point arithmetic units implemented in Gallium Arsenide. This is due to the wide range of area/ time/precision/dynamicrange tradeoffs achievable with the ring architecture and its low transistor count. It is also possible to implement the architecture determined by this patent with simple optical processing techniques.
defs.h
#define base 2
#define states 4
#define statesl 5
#dexine reg.len 12
#define recirc 3
#define recirc.m 2
#define exp_len 10
#define mant_len 30
#define cells mant_len/2
enum clock {
ph1, ph2
};
typedef struct {
int p1, p2;
} reg;
typedef struct {
reg x1, x2, y1, y2, model, mode2, pp, cy;
} mult;
typedef struct {
reg x1, x2, y1, y2, model, mode2, pp1, pp2, sign,cy, bypass;
} norm;
sma5.c
#include <stdio.h>
#include <math.h>
#include "defs.h"
int
and(a, b)
int a, b;
return (a & b);
}
int
or(a, b)
int a, b;
return (a ⃒ b);
}
int
mux(sel, a, b)
int sel, a, b;
^{{}
if (sel == 0)
return (a);
else
return (b);
}
int
mux4(sel, a, b, c, d)
int sel, a, b, c, d; switch (sel) {
case(0): return (a); break;
case(1): return (b); break;
case(2): return (c); break;
case(3): return (d); break;
}
}
void
add(a, b, c, sum, cy)
int a, b, c, *sum, *cy;
*sum = (a + b + c) % base; *cy = (a + b + c) / base;
>
int
inv_bit(x)
int x ;
int xbar;
xbar = ˉx & 1;
return (xbar);
}
int
nor(a, b)
int a, b; {
return (ˉ(a I b))&1;
}
int
xor(a, b)
int a, b;
{
return or(nor(inv_bit(a), b), nor(inv_bit(b),a)); }
void add_sub(a_s, a, b, c, sum, cy)
int a_s, a, b, c, *sum, *cy;
{
int t, ct;
ct = inv_bit(nor(or(
nor(inv_bit(a), inv_bit(c)),
nor(inv_bit(c), inv_bit(b))),
nor(inv_bit (b), inv_bit(a))));
t = xor(b,c);
*sum = xor(a,t);
*cy = xor(nor(a_s,inv_bit(t)),ct);
}
int
reg_cell (clock, a, b)
int clock, a;
reg *b;
{
if (clock == 0)
b>p1 = ˉa;
if (clock == 1)
b>p2 = ˉb>p1;
return (b>p2);
}
int
shiftv(cl, len, a, sreg)
int cl, len, a;
reg *sreg;
{
int i, op;
op = reg_cell(cl, a, &sreg[0]);
for (i = 0; i < len  1; i++)
op = reg_cell(cl, sreg[i] .p2, &sreg[i + 1]); return (op);
}
int
fsm1(cl, reset, mode, load, neg, As, cyn_1, state)
int cl, reset, mode, load, neg, As, cyn_1, state; {
static int p1_reset, p1_mode, p1_state, t;
if (cl == 0) {
p1_reset = reset;
p1_mode = mode;
p1_state = state; }
if (cl == 1) {
if (p1_reset = = 1) {
state = 0;
} else {
switch (p1_state) {
case 0: state = mux(p1_mode,0,1);
break;
case 1:
switch (p1_mode) {
case 0:
state = 2;
break;
case 1:
printf ("Error in fsm1 s1: second bit of field one\n");
break;
}
break;
case 2: state = mux(load,4,3);
break;
case 3: if (!p1_mode) state = 3;
else state = mux(neg&&As,6,5);
break;
case 4: if (!p1_mode) state = 4;
else state = mux(neg&&As,8,7);
break;
case 5: state = mux(p1_mode,9, 5);
break;
case 6: state = mux (p1_mode, 9, 6);
break;
case 7: state = mux(p1_mode,9,7);
break;
case 8: state = mux(p1_mode,9,8);
break;
case 9: state = mux(cyn_1,9,10);
break;
case 10: t = (!neg&&p1_mode) +
2*(neg&&!As&&p1_mode) +
3*(neg&&_As&&p1_mode);
state = mux4(t,10,11,13,12);
break;
case 11: state = mux(p1_mode,0,11);
break;
case 12: state = mux(p1_mode,0, 12);
break;
case 13: state = mux(p1_mode, 0,13);
break;
}
}
}
return (state);
}
main(argc, argv) int argc;
char *argv[];
FILE *infp, *outfp;
char ch;
int index = 0, cl_gen, cl, ind;
int busy = 0, reset = 1;
int a[states], b[states];
int f_eof;
outfp = fopen("states", "w");
if (outfp == NULL)
fprintf(stderr,
"%s: cannot open file %s\n", argv[0], "states");
else {
a[0] = 0:
a[1] = 0:
a[2] = 0;
a[3] = 0;
for (ind = 0; ind < 4; ind++)
for (cl_gen = 1; cl_gen <= 3; cl_gen++) {
cl = (cl_gen & 2) » 1;
busy = fpad(outfp, cl_gen, cl, reset, a, b); }
/* printf ("Logic reset\n"); */
reset = 0;
init_instructions();
for (cl_gen = 1; cl_gen <= 3; cl_gen++) {
cl = (cl_gen & 2) » 1;
if ((busy == 0) && (cl_gen == 2)) {
f_eof = scanf("%1d % 1d %1d %1d\n", &a[0], &a[1], &a[2], &a[3]); busy = fpad(outfp, cl_gen, cl, reset, a, b); if ((busy == 0) & (cl_gen == 3)) {
if (b[0]) printf("%1d % 1d %d %d\n", b[0], b[1], b[2], b[3]);
}
}
} while (f_eof != EOF);
}
close (outfp); smad5.c
#include <stdio.h>
#include "defs.h"
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Microcode fields a
Lzs Lzf Llrde:
LAzs LAs func src1 src2 shft dst1 dst2 cry1 cry2
Field lengths are:
ILlLlLlLlLlfIsslsslslddlddlclcl
Define constants as:
func: (fadd,fsub)
src1: (s1_Z,s1_A,s1_z,s1_0)
src2: (s2_Z,s2_A,s2_z,s2_0)
shft: (shift0,shift1)
dst1: (d1_Z,d1_z,d1_f,d1_0)
dst2: (d2_Z,d2_z,d2_f,d2_0)
cry1: (noset1,set1)
cry2: (noset2,set2)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
#define noset2 0
#define set2 1
#define noset1 0
#define set1 2
#define d2_f 0xc
#define d2_z 4
#define d2_Z 8
#define d2_I 0
#define d1_Z 0
#define d1_z 0×10
#define d1_f 0×20
#define d1_0 0×30
#define shift0 0
#define shift1 0×40
#define s2_Z 0
#define s2_z 0×80
#define s2_A 0×100
#define s2_0 0×180
#define s1_Z 0
#define s1_z 0×200
#define s1_A 0×400
#define s1_0 0×600
#define fsub 0
#define fadd 0×800
#define Lld 0×1000
#define Lzf 0×2000
#define Lzs 0×4000
#define LAs 0×8000
#define LAzs 0x10000
extern int mux();
int instr[20] ;
void init_instructions ()
{
instr[0] = 0 ;
instr [1] = Lld;
instr[2] = Lzf+set1+set2;
instr[3] = Lzs+LAzs+d1_0+d2_z ;
instr[4] = Lzs+fsub+s1_z+s2_A+d1_f+d2_I ;
instr [5] = shift1+fsub+s1_0+s2_A+d1_0+d2_z ; instr[6] = shift1+fadd+s1_0+s2_A+d1_0+d2_z;
instr[7] = shift1+fsub+s1_0+s2_A+d1_z+d2_f;
instr[8] = shift1+fadd+s1_0+s2_A+d1_z+d2_f;
instr[9] = d1_Z+d2_I;
instr[10] = d1_Z+d2_I;
instr[11] = LAs+f add+s1_Z+s2_A+d2_f;
instr[12] = LAs+f sub+s1_Z+s2_A+d2_f;
instr[13] = LAs+f sub+s1_A+s2_Z+d2_f ;
}
void
norm_cell(cell, cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)
norm *cell;
int cl, x, y, pp, mode, *x_out , *y_out, *pp_out, *mode_out; {
int m1, x1, y1, pp1, sum, c_out, bypass, sign;
m1 = reg_cell(cl, mode, &cell>mode1);
*mode_out = reg_cell(cl, m1, &cell>mode2);
pp1 = reg_cell(cl, pp, &cell>pp1);
*pp_out = reg_cell(cl, pp1, &cell>pp2);
y1 = reg_cell(cl, y, &cell>y1);
sign = reg_cell(cl, y1, &cell> sign);
bypass=reg_cell(cl,mux(m1,sign,cell>bypass.p2),&cell>bypass); x1 = reg.cell(cl, x, &cell>x1);
*x_out = reg_cell(cl, x1, &cell> x2);
add(cell>pp1_p2, mux (and (and (m1, mode), bypass), y1, y),
cell>cy.p2, &sum, & c_out );
c_out = reg_cell(cl, c_out, &cell>cy);
*y_out = reg_cell(cl, sum, &cell>y2);
}
void
normalise(cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)
int cl ,x,y,pp,mode, *x_out, *y_out,*pp_out, *mode_out; {
static norm mx [cells];
int cell_index, j, a[states], b[states]; a[0] = x;
a[1] = y;
a[2] = pp;
a[3] = mode;
for (cell_index = 0; cell_index < cells; cell_index++) {
norm_cell(&mx[cell_index], cl, a[0], a[1], a[2], a[3],
&b[0], &b[1], &b[2], &b[3]);
for (j = 0; j < states; j++) a[j] = b[j];
}
*x_out = b[0];
*y_out = b[1];
*pp_out = b[2];
*mode_out = b[3];
}
void delay(cl, a, b, del)
int cl, *a, *b;
reg *del;
{
int i;
for (i = 0; i < states; i++) b[i] = reg.cell(cl, a[i], 4del[i]); }
void
delay1(cl, a, b)
int cl, *a, *b;
{
static reg del[states];
int i;
for (i = 0; i < states; i++)
b[i] = reg_cell(cl, a[i], &del[i]);
}
void
logic_1(outfp, cl_gen, cl, del, reset, a, g, e, lr, r, sig, con)
FILE *outfp ;
int cl_gen, cl, reset, *a, *g, *e, *lr,*r,*sig,*con; reg *del;
{ /* logic_1 */
static int lAzs, lAs, lzs, lzf, lid, func, src1, src2;
static int shft, dst1, dst2, cry1, cry2;
static int lAc_sign,A,__A,A_,Z,z,Aovf,Ao,Zo,gshft; static int last_mode,count,cyn_1,load,Ld,neg,As;
static int ed[states],edz[states];
static int fdsum, fsum, fcy, isum, icy, zzf;
static int Azf,ss,bb,cc,Zs,cycle,cy1,state,p1,p2,Asd;
static reg fcy_reg, icy_reg, delz[states];
static reg si_reg, Load_reg, zzf_reg;
static reg Aovf_reg,Azf_reg,As_reg,Zsi_reg,fd_reg,neg_reg; static reg edd_reg, r_reg, Ld_reg,Asd_reg;
if (reset == 1) {
cycle = (recirc  1);
count = 1;
last_mode = 0;
}
if (cl_gen == 2) {
if (e[3]&&!last_mode) {
count = (count + 1)%2;
if (!(count)){
cycle = (cycle + 1) % recirc;
}
}
cyl = (cycle == 0)⃒⃒ reset;
cyn_1 =(cycle == recirc  1);
if (count>1) *con = inv_bit(cy1);
last_mode = e[3];
}
e[0] = g[0] ;
e[1] = mux(*con, a[1], g[1]) ; e[2] = mux(*con, a[2], g[2]);
e[3] = mux(*con, a[3], g[3]);
delay(cl, e, ed, del);
delay(cl, a, edz, delz);
state = fsm1(cl, reset, e[3] , load, neg, Asd_reg.p2,cyn_1, state);
p1 = cl == 0;
p2 = cl == 1;
Z = ed[1];
z = edz[1];
A = ed[0];
__A = e[0];
truction decode */
cry2 = instr [state]&1;
cry1 = (instr[state]»1)&1;
dst2 = (instr[state]»2)&3;
dst1 = (instr[state]»4)&3;
shft = (instr[state]»6)&1;
src2 = (instr[state]»7)&3;
srcl = (instr[state]»9)&3;
func = (instr[state]»11)&1;
lid = (instr[state]»12)&1;
lzf = (instr[state]»13)&1;
lzs = (instr[state]»14)&1;
lAs = (instr[state]»15)&1;
lAzs = (instr[state]»16)&1;
lzs = lzs&&e[3]&&!ed[3];
lAzs = lAzs&_e[3]&&!ed[3];
lAc_sign = lAs&&!e[3];
gshft = shft&&!e[3];
fcy = reg_cell(cl,mux(cry1,and(fcy,inv_bit(and(e[3],
inv_bit(ed[3])))),Aovf), &fcy_reg);
icy = reg_cell(cl, mux (cry 2, icy, Aovf ), &icy_reg);
load = reg_cell(cl, mux (lld,Load_reg.p2,ed[1]⃒⃒!Azf_reg.p2),
&Load_reg);
Ld = reg cell(cl, mux (lld, Ld_reg.p2,ed[1]),&Ld_reg);
zzf = reg_cell(cl, mux(lzf, zzf_reg.p2,ed[1]), &zzf_reg);
Zs = reg_cell(cl, mux4(lzs+2*(inv_bit(*sig)&&shft&&!e[3]),
Zsi_reg.p2,ed[1] &&!load,As_reg_p2).&Zsi_reg);
Azf = reg cell(cl, mux(or(lzf,lAs),Azf_reg.p2,
and(lAs,or(fsum,Azf_reg.p2))) , Azf_reg);
neg = reg cell(cl, bb = mux(shft,neg_reg.p2,AsˉZs),&neg_reg); Aovf = reg_cell(cl, mux(lAs, Aovf_reg. p2,fsumˉneg),&Aovf_reg);
A_ = mux(and(shft,Aovf),A,__A);
add_sub(func,mux4(src1,Z,z,A_,0),mux4(src2,Z,z,A.,0),
fcy_reg.p2,&fsum.&fcy);
add_sub(1,0,A_,icy_reg.p2,&isum,&icy);
Zo = mux4(dst1,Z,z,fsum,0);
Ao = mux4(dst2,isum,z,Z,fsum);
* sig = reg.cell(cl,mux(lzs,si_reg.p2,fd_reg.p2),&si_reg); fdsum = reg.cell(cl, fsum, &f d_reg);
As=reg_cell (cl ,mux4((
lAc_sign+2+(lAzs)+3*(inv_bit(*sig)&&shft&&!e[3])), As_reg.p2, ((!neg)&&As&&Zs) I I (neg&&fsum), ed[1] ,Zsi_reg.p2), &As_reg);
Asd = reg.cell(cl, As, &Asd_reg);
lr[0] = Ao&&!lzs;
lr[1] = Zof&&!lzs;
lr[2] = ed[2];
lr[3] = ed[3];
r[0] = Ld;
r[1] = reg_cell(cl, mux4(lzs+2*ed[3],isum,As_reg.p2,fsum,0),
&r_reg);
r[2] = reg_cell(cl, ed[3],&edd_reg);
r[3] = r[2];
/* End of datapath */
}
int
fsm2(cl, reset, mode, sign, cyn_1, state)
int cl, reset, mode, sign, cyn_1, state;
static int p1_reset, p1_mode, p1_state, t;
if (cl == 0) {
p1.reset = reset;
p1.mode = mode;
p1.state = state;
}
if (cl == 1) {
if (p1_reset == 1) {
state = 0;
} else {
switch (p1_state) {
case 0: state = mux (p1_mode, 0,1);
break;
case 1:
switch (p1_mode) {
case 0:
state = 2;
break;
case 1 :
printf("Error in fsm2 s1: second bit of field one\n");
break;
}
break;
case 2: state = mux(sign,3,5);
break;
case 3: state = mux (p1_mode, 3, 4);
break;
case 4: state = mux(p1_mode, 5,4);
break;
case 5: state = mux(cyn_1,5,6);
break; case 6: state = mux(cyn_1,0,6);
break;
}
}
}
return (state);
}
void
logic_2(cl_gen, cl, reset, sign, e, lr)
int cl_gen, cl, reset, *sign, *e, *lr;
^{{} static int ed[states], state;
static int dst1, dst2, cry;
static int sub_s, sub_c, sub_cd, ad_s, ad_c, ad_cd; static int cycle, cy1;
static int last_mode, count, cyn_1;
static reg sub_cy, ad_cy;
if (reset == 1) {
cycle = (recirc  1);
count = 1;
last_mode = 0;
}
if (cl_gen == 2) {
if (e[3]&&!last_mode){
count = (count + 1)%2;
if (!(count)) {
cycle = (cycle + 1) % recirc;
}
cy1 = (cycle == 0)⃒⃒ reset;
cyn_1 =(cycle == recirc  1);
last_mode = e[3];
}
/* Register operations */
delayl(cl, e, ed);
state = fsm2(cl, reset, e[3], *sign, cyn_1, state); ad cd = reg_cell(cl, and(ad_c,cry), &ad_cy); sub_cd = reg_cell(cl, and (sub_c, cry), &sub_cy) ;
/* Control signal generation */
dst1 = 0;
dst2 = 0;
cry = 0;
switch (state) {
case 3:
dstl = 2;
dst2 = 3;
cry = 1;
break;
case 4:
dst1 = 1;
dst2 = 1; break;
}
/* Arithmetic operations */
add_sub(0,0,ed[1], and(cry,sub_cy.p2),&sub_s,&sub_c);
add_sub(1,ed[1],ed[0],and(cry,ad_cy.p2), &cad_s, &cad_c);
/* set outputs from logic cell */
lr[0] = mux4(dst2, ed[0], ed[1], sub_s, ad_s);
lr[1] = mux4(dst1, ed[1], ed[0], sub_s, ad_s);
lr[2] = ed[2];
lr[3] = ed[3];
/* End of datapath */
}
int
fpad(outfp, cl_gen, cl, reset, a, r)
FILE *outfp;
int cl_gen, cl, reset, *a, *r;
static int b[states], c[states], d[states], g[states]; static int e[states], f[states], h[states], lr[states]; static reg xr[reg_len], yr[reg_len], ppr[reg_len]; static reg acxr[reg_len], acyr[reg_len], acppr[reg_len]; static reg acmoder [reg_len], moder[reg_len];
static int con = 0, sig = 0;
static reg del[states];
if (reset == 1)
con = 0;
logic_1 (outfp, cl_gen, cl, del, reset, a, g, e, lr, r, &sig, &con); f[0] = shiftv(cl, reg_len, lr[0], acxr);
f[1] = shiftv(cl, reg_len, lr[1], acyr);
f[2] = shiftv(cl, reg_len, lr[2], acppr);
f[3] = shiftv(cl, reg_len, lr[3], acmoder);
logic_2 (cl_gen, cl, reset, &sig, f, h);
normalise (cl, h[0], h[1], h[2], h[3], &g[0], &g[1], &g[2], &g[3]); return (con);
SUBSTITUTE SHEET
Claims
Priority Applications (2)
Application Number  Priority Date  Filing Date  Title 

AUPK0920  19900629  
AUPK092090  19900629 
Publications (1)
Publication Number  Publication Date 

WO1992000560A1 true true WO1992000560A1 (en)  19920109 
Family
ID=3774792
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

PCT/AU1991/000284 WO1992000560A1 (en)  19900629  19910701  A generalised systolic array serial floating point adder and accumulator 
Country Status (1)
Country  Link 

WO (1)  WO1992000560A1 (en) 
Cited By (6)
Publication number  Priority date  Publication date  Assignee  Title 

US5334651A (en) *  19920325  19940802  Hoechst Aktiengesellschaft  Waterthinnable twocomponent coating preparation, a process for its preparation, and its use 
US5354807A (en) *  19920124  19941011  H. B. Fuller Licensing & Financing, Inc.  Anionic water dispersed polyurethane polymer for improved coatings and adhesives 
US7681344B2 (en)  20050729  20100323  CartTv, Llc  Shopping cart device 
US7895777B2 (en)  20050729  20110301  CartTv, Llc  Shopping cart device 
US8336774B2 (en)  20110404  20121225  Shopper's Club, Llc  Shopping apparatus and methods 
US9053510B2 (en)  20110404  20150609  David L. McEwan  Shopping apparatus and methods 
Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

EP0079471A1 (en) *  19811105  19830525  Ulrich Dr. Kulisch  Arrangement and method for forming scalar products and sums of floating point numbers with maximum precision 
US4405992A (en) *  19810423  19830920  Data General Corporation  Arithmetic unit for use in data processing systems 
EP0239737A2 (en) *  19860224  19871007  International Business Machines Corporation  Systolic super summation device 
Patent Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US4405992A (en) *  19810423  19830920  Data General Corporation  Arithmetic unit for use in data processing systems 
EP0079471A1 (en) *  19811105  19830525  Ulrich Dr. Kulisch  Arrangement and method for forming scalar products and sums of floating point numbers with maximum precision 
EP0239737A2 (en) *  19860224  19871007  International Business Machines Corporation  Systolic super summation device 
Cited By (7)
Publication number  Priority date  Publication date  Assignee  Title 

US5354807A (en) *  19920124  19941011  H. B. Fuller Licensing & Financing, Inc.  Anionic water dispersed polyurethane polymer for improved coatings and adhesives 
US5334651A (en) *  19920325  19940802  Hoechst Aktiengesellschaft  Waterthinnable twocomponent coating preparation, a process for its preparation, and its use 
US7681344B2 (en)  20050729  20100323  CartTv, Llc  Shopping cart device 
US7895777B2 (en)  20050729  20110301  CartTv, Llc  Shopping cart device 
US8336774B2 (en)  20110404  20121225  Shopper's Club, Llc  Shopping apparatus and methods 
US8727214B2 (en)  20110404  20140520  Shopper's Club, Llc  Shopping apparatus and methods 
US9053510B2 (en)  20110404  20150609  David L. McEwan  Shopping apparatus and methods 
Similar Documents
Publication  Publication Date  Title 

Oklobdzija  An algorithmic and novel design of a leading zero detector circuit: Comparison with logic synthesis  
US6366944B1 (en)  Method and apparatus for performing signed/unsigned multiplication  
US7668896B2 (en)  Data processing apparatus and method for performing floating point multiplication  
US3800130A (en)  Fast fourier transform stage using floating point numbers  
Hartley  Subexpression sharing in filters using canonic signed digit multipliers  
USRE33629E (en)  Numeric data processor  
US5422805A (en)  Method and apparatus for multiplying two numbers using signed arithmetic  
US4991131A (en)  Multiplication and accumulation device  
US5963461A (en)  Multiplication apparatus and methods which generate a shift amount by which the product of the significands is shifted for normalization or denormalization  
US4338675A (en)  Numeric data processor  
US5153848A (en)  Floating point processor with internal freerunning clock  
US4792793A (en)  Converting numbers between binary and another base  
US4901267A (en)  Floating point circuit with configurable number of multiplier cycles and variable divide cycle ratio  
US20060288069A1 (en)  Digital signal processing circuit having a SIMD circuit  
US4972362A (en)  Method and apparatus for implementing binary multiplication using booth type multiplication  
US4156922A (en)  Digital system for computation of the values of composite arithmetic expressions  
Bailey  The computation of 𝜋 to 29,360,000 decimal digits using Borweins’ quartically convergent algorithm  
US20040267863A1 (en)  Method and apparatus for performing singlecycle addition or subtraction and comparison in redundant form arithmetic  
US20060230092A1 (en)  Architectural floorplan for a digital signal processing circuit  
Sparsø et al.  Design of delay insensitive circuits using multiring structures  
US4135249A (en)  Signed double precision multiplication logic  
US7509366B2 (en)  Multiplier array processing system with enhanced utilization at lower precision  
US5493520A (en)  Two state leading zero/one anticipator (LZA)  
US5633819A (en)  Inexact leadingone/leadingzero prediction integrated with a floatingpoint adder  
US4994997A (en)  Pipelinetype serial multiplier circuit 
Legal Events
Date  Code  Title  Description 

AK  Designated states 
Kind code of ref document: A1 Designated state(s): AU BB BG BR CA FI HU JP KP KR LK MC MG MN MW NO PL RO SD SU US 

AL  Designated countries for regional patents 
Kind code of ref document: A1 Designated state(s): AT BE BF BJ CF CG CH CI CM DE DK ES FR GA GB GN GR IT LU ML MR NL SE SN TD TG 

NENP  Nonentry into the national phase in: 
Ref country code: CA 