WO1992000560A1 - A generalised systolic array serial floating point adder and accumulator - Google Patents

A generalised systolic array serial floating point adder and accumulator Download PDF

Info

Publication number
WO1992000560A1
WO1992000560A1 PCT/AU1991/000284 AU9100284W WO9200560A1 WO 1992000560 A1 WO1992000560 A1 WO 1992000560A1 AU 9100284 W AU9100284 W AU 9100284W WO 9200560 A1 WO9200560 A1 WO 9200560A1
Authority
WO
WIPO (PCT)
Prior art keywords
floating point
ring
systolic
output
accumulator
Prior art date
Application number
PCT/AU1991/000284
Other languages
French (fr)
Inventor
Warren Marwood
Original Assignee
Luminis Pty. Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luminis Pty. Ltd. filed Critical Luminis Pty. Ltd.
Publication of WO1992000560A1 publication Critical patent/WO1992000560A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8046Systolic arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/505Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
    • G06F7/509Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination for multiple operands, e.g. digital integrators
    • G06F7/5095Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination for multiple operands, e.g. digital integrators word-serial, i.e. with an accumulator-register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3884Pipelining
    • G06F2207/3892Systolic array

Definitions

  • This invention relates to floating-point accumulators and adders and in particular to serial systolic array floating point accumulators and adders.
  • a floating point number F is composed of two parts, a fractional mantissa F m and an integral exponent F e , and can be represented as the 2-tuple
  • R F m .bF e (3) where b is the base of both F m and F e and
  • the floating point accumulation of a floating point number ⁇ Z e , Z m ⁇ with a floating point accumulator value at time n of ⁇ A e , n , A m,n ⁇ to form the new accumulator value ⁇ A e , n +1, A m , n +1 ⁇ at time n + 1 is performed by the following algorithm:
  • Max.exp is the maximum exponent value in the particular format
  • Min-exp is the minimum exponent value
  • b is the number base of the floating-point representation
  • [.] represents the integer part of and sign is the sign of the operation.
  • This algorithm can be considered representative of the way in which addition or accumulation is performed in conventional computing hardware.
  • Equation (4) represents the shifting of the operand mantissa which has the smallest exponent by a number of digit places equal to the difference in the exponents, followed by the summation of the shifted operands.
  • This temporary result A' m,n+1 is conditionally left or right shifted according to its value.
  • Equation (5) expresses these three conditions in mathematical form.
  • Equation (6) defines the exponent of the temporary result A' e n+1 . This exponent value is modified by any shifts which are performed upon the mantissa to preserve the real value of the 2-tuple. Additive corrections to this exponent value are defined by equation (7). The corrections appear as additions for the exponents whereas multiplications, or shifts, are performed in the case of the mantissa.
  • Equations (8) and (9) set flags which indicate whether the result has exceeded the floating point representation at either end of its dynamic range.
  • the invention comprises a systolic array floating point adder for accepting sequential pairs of real numbers Z 1 and Z 2 in floating point format and a mode control signal wherein said real numbers are represented as 2-tuples having the form ⁇ Z e , Z m ⁇ , Z e is a character sequence representing the exponent of the real number and Z m is a character sequence representing the mantissa of the real number, and the adder outputs a character sequence A which is the floating point representation ⁇ A f , A e , A m ⁇ of the addition of said real numbers, wherein the adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the de-normalisation array and to output the floating point sum of sequential pairs of accepted real numbers.
  • the serial floating point adder has a mode character sequence entered in parallel with the 2-tuples to identify to the adder the fields Z e and Z m of the floating point representations
  • the de-normalisation array further comprises at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.
  • the invention in its broadest form comprises a systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of the real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of the finite state machine and to output at least partially denormalized floating point numbers to the second input of the finite state machine and in the configuration to form a ring, wherein.
  • the finite state machine is adapted to control the ring wherein the number Z in floating point format is input to the ring through the finite state machine first input and the accumulator output A is output from the ring from the finite state machine first output, and during the second state the finite state machine is adapted to transfer at least partially denormalised floating point numbers from its the second input to its the second output, to control the number of times the transfer occurs and to add aligned floating point numbers.
  • a systolic ring serial floating point ac- cumulator has a finite state machine which further comprises an arithmetic logic unit ALU_1 having as first and second inputs the finite state machine first and second inputs and having as its first output the finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive the ALU_1 second output, a second arithmetic logic unit ALU_2 having as its output the finite state machine second output, the de-normalisation array further comprising at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order, the ring comprising a character sequence path formed from a serial configuration of, the ALU_1, the linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Z e , the ALU_2 and the de-normalisation array.
  • a further aspect of the invention provides a systolic ring serial floating point accumulator in which the real numbers in floating point format are represented as a triplet having the form ⁇ Z f , Z e , Z m ⁇ wherein Z f is a character sequence representing descriptors of the real number and an initialization flag character, Z e is a character sequence representing the exponent of the real number and Z m is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Z f , Z e and Z m , and the accumulator output is a character sequence A which is the floating point representation ⁇ A f , A e , A m ⁇ of the accumulation of the real numbers, whereby the ring forms, an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, the first field D e being representative
  • the ring of the serial floating point accumulator further comprises, a connection means to connect the ALU_1 to the ALU_2, whereby, ALU_1 controls ALU_2 dependent on the sign of the value D e .
  • at least one delay cell is added into the ring to increase the number of data characters in the floating point representation without increasing the number of systolic cells and thereby achieve the processing of operands with either increased precision or dynamic range.
  • a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a k-stage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations.
  • the output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder.
  • a systolic ring accumulator it consists of a finite state machine and a systolic de-normalisation array.
  • Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as one's complement or sign-magnitude.
  • mantissa such as one's complement or sign-magnitude.
  • sign-magnitude mantissae and two's complement exponent ordered number pairs are used.
  • the only fixed aspects of the systolic ring are the arithmetic blocks.
  • the length of the delay block is determined by the exponent length in the number representation.
  • the number of systolic de-normalisation cells in the ring can range from a minimum of one to a maximum of m where m is the number of characters in the mantissa of the number representation.
  • the number of recurrence cells determine the performance characteristics of the accumulator.
  • the invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.
  • Figure 1 depicts a state diagram for the first logic element or datapath.
  • Figure 2 depicts a state diagram for the second logic element or datapath.
  • Figure 3 depicts a schematic representation of a heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers.
  • the data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent chara cters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
  • Figure 4 depicts a schematic of a systolic ring accumulator in which the elements are considered to be lumped. This clarifies the logical function of the array and highlights the distributed nature of the registers. Each register is associated with one of the re-circulating arrows. Naming conventions correspond to the simulation code of figure 12.
  • Figure 5 depicts a schematic of the systolic de-normalisation cell norm,-cell(). Variable names in brackets refer to the nomenclature of the 'C' simulation program given in a later figure.
  • Figure 6 depicts a schematic of an array of delay cells which form a component of the systolic ring.
  • Figure 7 depicts a schematic representation of the input /output multiplexer and (as implemented) a one-bit microcoded datapath. Although implemented as a one-bit per character device, the architecture can be constructed with multi-bit characters.
  • Figure 8 is a schematic diagram of the state generation and storage circuitry in the first logic cell Logic-l().
  • Figure 9 is a schematic diagram of the control signal generation for the first logic cell Logic_1(), with naming conventions as for figure 12.
  • Figure 10 is a schematic diagram of both the control signal generation and a block schematic diagram for the second logic cell Logic_2().
  • Figure 11 is a schematic diagram of two systolic rings which have coalesced to form a single, extended precision accumulator. To extend the dynamic range, additional delay cells must be placed before Logic_2().
  • the multiplexer for the second ring is controlled by the controller of the first ring, and the second occurrence of Logic_2() is not included in the ring.
  • Figure 12 is 'C' code which simulates a systolic ring accumulator.
  • this patent describes a simpler implementation of floating point addition or accumulation than that detailed previously.
  • a linear systolic array serial floating point adder and a circular systolic array serial floating point accumulator.
  • the linear adder is obvious from the description of the ring accumulator.
  • Equations (10) to (13) are significantly simpler than the conventional set given in equations (4) to (9). This simplicity is partly due to the lack of testing for overflow and underflow. Put simply the exponent register of the accumulator can be made sufficiently long to accomodate the accumulation of sequences of numbers, where the length of the sequences is less than or equal to some arbitrarily chosen maximum length, without reaching the overflow or underflow condition. It is a straightforward design exercise to provide guard digits in the exponent register to satisfy this requirement.
  • the second simplification is not obvious and is not part of floating point standards. It omits the post-normalisation of the sum. It is applicable to the floating point addition of two or more normalised numbers and allows post normalisation to be done only at the end of the completed summation, so effecting considerable savings in the case of long sequences.
  • equation (16) The significance of equation (16) is that the error is formed in equations (4) and (11).
  • the post-normalisation process of equation (5) does not alter the error in the sum. and as a consequence the operation may be omitted without significantly altering the error behaviour of the accumulation process.
  • a benefit of this approach for summation is that when the summation is complete the number of leading zeroes in the accumulator may give an estimate of the lower bound to the error in the result.
  • m is the number of characters in the mantissa of the floating point representation.
  • FIG. 3 depicts a schematic representation of a. heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers.
  • the data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent characters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
  • Figure 4 depicts a schematic representation of a systolic de-normalisation array 21 which implements the Z mantissa de-normalisation of either D e characters when D e is less than the mantissa length m, or m characters when D e is greater than or equal to m when the value of the Z mantissa becomes zero to effect an alignment of the Z mantissa to the accumulator mantissa in the floating point representation prior to their addition as defined by equation (21) and a finite state machine 22 which implements equations (17) to (24) with the exclusion of equation (21).
  • the finite state machine 22 consists of a controller 23 and an arithmetic logic unit ( ALU_1) 24 which is described in figure 12 in the form of C simulation code as the function logic_1(), a linear array of delay cells 25 as described in figure 12 as shiftv() and a second arithmetic logic unit (ALU_2) 26 described in figure 12 as logic_2().
  • ALU_1 arithmetic logic unit
  • ALU_2 second arithmetic logic unit
  • a first input to the accumulator 20 is pesented sequentially with a series of floating point representations of real numbers Z consisting of triplets having the form ⁇ Zf, Z e , Z m ⁇ wherein Z f is a character sequence representing descriptors of the real number. An initialization flag character is also part of the descriptor. However, Z f may or may not be used in one or other of the embodiments described hereafter.
  • Z e is a character sequence representing the exponent of the real number Z and Z m is a character sequence representing the mantissa of the real number Z.
  • a mode signal entered in parallel with the triplet through a second input identifies which of the fields Z f , Z e and Z m are being input at any one time.
  • an additional character sequence C is also entered through a third input in parallel with the triplet as a constant to be used to increment the exponent difference D e of equation (18).
  • a further input shown in figure 4 is reset, which is used in the C simulation program to reset the simulated controller 23 and simulated ALU_2 26.
  • a first output from the accumulator consists of a status signal busy used to indicate when the accumulator may or may not accept inputs.
  • An additional output provides a character sequence A which is the floating point representation ⁇ A f , A e , A m ⁇ of the accumulation of the real numbers Z.
  • a further output consists of a mode output signal which identifies the elements of the triplets.
  • there is a final output Load which is derived from the initialization flag character present in the Z f field of the input triplets.
  • the second output of the finite state machine 22 connects to the input of the systolic de-normalisation array 21 whose output is connected to a second input of the finite state machine 22 to form a systolic ring of four registers; a Z register of at least two fields representative of the exponent difference D e , equal to the difference between the accumulator exponent A e and Z e , and the Z mantissa value
  • ALU_1 24 and ALU_2 26 denoted as sig in figures 4 and 12 is used as a control signal path to imnplement the conditional assignments in ALU_2 of equations (19) and (20).
  • the following table details the data, structure for both the serial operands and the associated mode bit.
  • the operands are entered into the accumulator least significant character or least significant bit (LSB) first.
  • State machines decode the different fields within the finite state machine controller and ALU_2.
  • a state diagram which describes the operation of the controller, multiplexer and ALU_1 in the finite state machine of figure 4 is given in figure 1.
  • the controller is a state machine shown in figure 9 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 9.
  • the functional behaviour of the state machine is described by the C simulation code function fsml() of figure 12.
  • the controller moves to State 2 in which the zero flag character of the flag characters Z f is stored in the internal storage register Z z f.
  • the controller enters State 3 at the next clock period and otherwise the controller enters State 4 which will be described subsequently.
  • the accumulator exponent field is incremented by the contents of the overflow register from the previous computation and is output as the exponent field of the accumulated result through the finite state machine first output, and also the value of the input operand exponent field z e is output to the ring accumulator exponent register A e through the finite state machine second output, the exponent difference field D e is set to zero and is entered into the ring Z register through the finite state machine second output, the sign register sig is set to zero and both the Z mantissa sign register Z s and the accumulator sign register A s are set equal to the sign of the input operand mantissa z s .
  • the controller enters either State 6 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 5 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.
  • the sign-corrected mantissa value A m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, before being output as the result mantissa through the finite state machine first output.
  • the mantissa value input to the ring Z register is set to zero and the mantissa register A m is set to the input mantissa value z m .
  • the correctly represented mantissa value A m is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output as the result mantissa through the finite state machine first output.
  • the mantissa value input to the Z register is set to zero and the mantissa register A m is set to the input mantissa value z m .
  • the controller enters State 9 when the mode bit becomes zero. It remains in this state until a non-zero signal cyn_1 is received from a counter depicted in figure 9, indicating that the mantissae A M and Z m are aligned or the mantissa Z m is zero. During this state, the modulo 2 sum of the signs of the Z and A mantissae is stored in the register neg.
  • the controller enters State 4 in which the accumulator exponent value A e is incremented by the value of the previously computed overflow A ov f and is output to the ring through the second output of the finite state machine.
  • the exponent difference D e is set equal to the difference of the value z e and the incremented accumulator value A e + A ov f and is output to the Z register through the second output of the finite state machine.
  • the sign register sig is set equal to the sign bit of D e and the one bit Z mantissa sign register Z s is set equal to the sign bit of the input mantissa and the one bit accumulator sign register A s is left unchanged.
  • the controller enters either State 8 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 7 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.
  • a state diagram which describes the operation of the second arithmetic logic unit ALU_2 in the finite state machine of figure 4 is given in figure 2.
  • the ALU_2 has a state machine shown in figure 10 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 10.
  • the functional behaviour of the state machine is described by the C simulation code function fsm2() of figure 12.
  • the initial state State 0 as shown in figure 2 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated.
  • the ALU_2 remains in the zero state until a non-zero mode bit is detected after which it enters State 1.
  • the ALU_2 state changes to State 2.
  • the ALU_2 state changes to State 5 if the sign control line from ALU_ 1 is non-zero, and changes to State 3 otherwise.
  • the ALU_2 remains in State 5 until a non-zero signal cyn ⁇ is received from a counter depicted in figure 10. when it enters State 6 and re-enters State 0 when the signal cyn_ 1 becomes zero.
  • Equations (17) to (24) with the exclusion of equation (21) are implemented using the finite state machine 22.
  • an array of at least one systolic cell is required in which the transfer of data between cells is described by the following recurrences
  • C contains the value 1 in the character position corresponding to the least significant exponent character, and is zero elsewhere.
  • An examination of the recurrences (34) shows that the sign of the exponent is stored in Z 4 for the duration of the mantissa. This value is used to control via recurrence (37) whether the mantissa output Z 2 is delayed either one or two stages when the mode values M 0 and M 1 are high. This effects a one character de-normalisation of the Z mantissa field relative to the A mantissa when the exponent difference D e is negative. The presence of a 1 in the C character sequence can be seen to increment the exponent difference according to the recurrence (37).
  • Each cell which implements these recurrences in a linear structure can implement a one-character denormalization and sign-extension required for floating-point addition using ones-complement or two's complement mantissae, and the de-normalisation without sign extension for sign-magnitude mantissae.
  • m-bit mantissa full de-normalisation requires the application of m recurrences.
  • recurrences may be applied either by connecting m-cells in a linear array, or by connecting at least one cell in a systolic ring structure with sufficient delay cells to contain the operand, and circulating the operands until m recurrences have been applied, or until the mantissae are aligned as indicated by a non-negative exponent difference.
  • Figure 5 represents a schematic diagram of one possible hardware implementation of a de-normalisation cell 27 implementing the above recurrence equations (29) to (37).
  • Figure 6 represents a schematic diagram of one possible hardware implementation of a linear array of delay stages and their interconnection denoted by the above recurrence equations (25) to (28).
  • FIGS 7 and 8 together represent a schematic diagram of the arithmetic logic unit ALU_1 24 component of the finite state machine 22.
  • the notation depicted in figures 7 and 8 follows that of figure 12.
  • Figure 9 represents a schematic diagram of the control element 23 of the finite state machine 22. The notation depicted in figure 9 follows that of figure 12.
  • Figure 10 represents a schematic diagram of the arithmetic logic unit ALU_2 26 component of the finite state machine 22. The notation depicted in figure 10 follows that of figure 12.
  • FIG 11 depicts a schematic diagram of the joining or coalescence of two adjacent systolic ring accumulators to form a single accumulator capable of accumulating operands of double length.
  • the multiplexer for the second ring is controlled by the controller of the first ring.
  • Figure 12 is a C code simulation of an embodiment of a sign-magnitude systolic ring accumulator.
  • Systolic ring arithmetic units provide new possibilities for systolic array processors.
  • a simple linear array of two processors designed to process single precision operands. If the two processors are implemented as systolic rings it is possible with appropriate multiplexer means to coalesce the two rings into a single, larger ring.
  • This large ring can process double-length operands with the same number of circulations as the single ring, as the ratio of mantissa characters to systolic cells remains a constant.
  • the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.
  • the nature of the systolic architecture allows advantage to be taken of the statistical properties of numbers to minimise the number of systolic cells.
  • Current studies suggest that the number of systolic cells may be minimised by matching the number of cells to the 95 th percentile of the expected distribution of denormalisation shifts.
  • the use of longer mantissa lengths for increased precision would not require increased numbers of systolic cells, but only an increase in the length of the registers. For such an implementation 95% of accumulations would occur in the designed number of circulations, and the remaining 5% would require additional circulations. In a processor which is asynchronous, this computation time uncertainty would not constitute a problem, and the saving of circuitry would be valuable.
  • the only addition to the structure would be a test of completion of denormalisation. A successful test would cause the remaining circulations of the operands to be bypassed.
  • the information required to reduce the number of circulations in this way is in the sign bit of the incremented exponent difference, and can be used as an input to an expanded state machine in the circuit Logic_1. When the sign bit is zero, the de-normalisation is complete, and the state machine can move to the next state.
  • Systolic ring and linear array floating point accumulators constructed according to the details described in this patent are of interest in large order systolic arrays and neural networks, and floating point arithmetic units implemented in Gallium Arsenide. This is due to the wide range of area/ time/precision/dynamic-range tradeoffs achievable with the ring architecture and its low transistor count. It is also possible to implement the architecture determined by this patent with simple optical processing techniques.
  • reg_cell (clock, a, b)
  • op reg_cell(cl, a, &sreg[0]);
  • op reg_cell(cl, sreg[i] .p2, &sreg[i + 1]); return (op);
  • src1 (s1_Z,s1_A,s1_z,s1_0)
  • src2 (s2_Z,s2_A,s2_z,s2_0)
  • cry1 (noset1,set1)
  • cry2 (noset2,set2)
  • instr[3] Lzs+LAzs+d1_0+d2_z ;
  • instr[4] Lzs+fsub+s1_z+s2_A+d1_f+d2_I ;
  • instr[13] LAs+f sub+s1_A+s2_Z+d2_f ;
  • m1 reg_cell(cl, mode, &cell->mode1)
  • pp1 reg_cell(cl, pp, &cell->pp1)
  • y1 reg_cell(cl, y, &cell->y1)
  • bypass reg_cell(cl,mux(m1,sign,cell->bypass.p2),&cell->bypass);
  • x1 reg.cell(cl, x, &cell->x1);
  • c_out reg_cell(cl, c_out, &cell->cy);
  • e[3] mux(*con, a[3], g[3]);
  • cry2 instr [state]&1;
  • cry1 (instr[state]!1)&1;
  • lid (instr[state]!12)&1;
  • fcy reg_cell(cl,mux(cry1,and(fcy,inv_bit(and(e[3],
  • icy reg_cell(cl, mux (cry 2, icy, Aovf ), &icy_reg);
  • Ld reg cell(cl, mux (lld, Ld_reg.p2,ed[1]),&Ld_reg);
  • zzf reg_cell(cl, mux(lzf, zzf_reg.p2,ed[1]), &zzf_reg);
  • Azf reg cell(cl, mux(or(lzf,lAs),Azf_reg.p2,
  • reg cell(cl, bb mux(shft,neg_reg.p2,As ⁇ Zs),&neg_reg);
  • Aovf reg_cell(cl, mux(lAs, Aovf_reg. p2,fsum ⁇ neg),&Aovf_reg);
  • A_ mux(and(shft,Aovf),A,__A);
  • Asd reg.cell(cl, As, &Asd_reg);
  • r[1] reg_cell(cl, mux4(lzs+2*ed[3],isum,As_reg.p2,fsum,0),
  • r[2] reg_cell(cl, ed[3],&edd_reg);
  • lr[0] mux4(dst2, ed[0], ed[1], sub_s, ad_s);
  • lr[1] mux4(dst1, ed[1], ed[0], sub_s, ad_s);
  • f[1] shiftv(cl, reg_len, lr[1], acyr);
  • logic_2 (cl_gen, cl, reset, &sig, f, h);

Abstract

The invention is a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a k-stage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations. The output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder. In the case of a systolic ring accumulator it consists of a finite state machine (22) and a systolic de-normalisation array (21). Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as o ne's complement or sign-magnitude. In the preferred embodiment of an accumulator (20) sign-magnitude mantissae and two's complement exponent ordered number pairs are used. The only fixed aspects of the systolic ring are the arithmetic blocks. The length of the delay block (25) is determined by the exponent length in the number representation. The number of systolic de-normalisation cells (27) in the ring can range from a minimum of one. The number of recurrence cells and the number base of the characters in the floating point format determine the performance characteristics of the accumulator. The invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.

Description

"A GENERALISED SYSTOLIC ARRAY SERIAL FLOATING POINT
ADDER AND ACCUMULATOR"
This invention relates to floating-point accumulators and adders and in particular to serial systolic array floating point accumulators and adders.
BACKGROUND OF THE INVENTION
Fixed point adders and accumulators are implemented simply by a single adder and a carry storage register for serial implementations, and an array of adders and carry storage registers for parallel implementations. For n-bit numbers an adder is typically a factor of n less complex than a multiplier. This situation is no longer the case when floating point operations are considered. A floating point multiplier is not substantially more complex than its fixed point counterpart whereas the floating point adder is sig nificantly more complex than the fixed point equivalent. The reason for this complexity is evident when a floating point addition or accumulation algorithm is considered and compared with a fixed point addition or accumulation algorithm. Examples are:
Fixed point accumulation of a fixed point number Z with a fixed point accumulator value at time n of An is carried out by the simple operation of integer addition i.e. An+1 = An + Z (1)
To discuss floating point addition or accumulation it is necessary to define the number system. A floating point number F is composed of two parts, a fractional mantissa Fm and an integral exponent Fe, and can be represented as the 2-tuple
F = {Fe. Fm} (2)
The real number representation R of this floating point representation is
R = Fm .bFe (3) where b is the base of both Fm and Fe and
Figure imgf000004_0007
The floating point accumulation of a floating point number {Ze, Zm} with a floating point accumulator value at time n of {Ae,n , Am,n} to form the new accumulator value {Ae,n+1, Am,n+1} at time n + 1 is performed by the following algorithm:
Figure imgf000004_0001
Figure imgf000004_0002
Figure imgf000004_0003
)
Figure imgf000004_0004
Figure imgf000004_0005
Figure imgf000004_0006
where Max.exp is the maximum exponent value in the particular format, Min-exp is the minimum exponent value, b is the number base of the floating-point representation, [.] represents the integer part of and sign is the sign of the operation.
This algorithm can be considered representative of the way in which addition or accumulation is performed in conventional computing hardware.
Some discussion of this algorithm is warranted. Equation (4) represents the shifting of the operand mantissa which has the smallest exponent by a number of digit places equal to the difference in the exponents, followed by the summation of the shifted operands. This temporary result A'm,n+1 is conditionally left or right shifted according to its value. Three possibilities exist. The first is a result which is smaller than the lower bound of the defined range for the fractional part of the representation. In this case the operation has caused a loss of precision known as catastrophic cancellation. Leading zeroes are introduced into the representation which must be removed by left shifting the result. This operation is known as post-normalisation. The second possibility is that the result is larger than the upper bound of the defined range of the mantissa. In this case the result is right shifted one place to restore it to the defined range. This condition is mantissa overflow. The final possibility for the result is that it falls within the defined range of the representation, in which case no shifts are required. Equation (5) expresses these three conditions in mathematical form. Equation (6) defines the exponent of the temporary result A'e n+1. This exponent value is modified by any shifts which are performed upon the mantissa to preserve the real value of the 2-tuple. Additive corrections to this exponent value are defined by equation (7). The corrections appear as additions for the exponents whereas multiplications, or shifts, are performed in the case of the mantissa. Equations (8) and (9) set flags which indicate whether the result has exceeded the floating point representation at either end of its dynamic range.
Existing techniques for the design and construction of floating point adders and accumulators are broadly categorised as parallel or serial. The parallel architectures are intended for low latency designs. An example is the work of OWEN, R.E., "A 15 nanosecond complex multiplier-accumulator for FFT's", CASSP'87, CH2396-0/87/0000-0527 pp. 527-530, 1987. For system architectures in which longer latencies can be tolerated, serial architectures are used to advantage and an example is CHAU, P.M., KAY, C.C. and KU, W.H.,"A bit-serial floating-point complex multiplier-accumulator for fault-tolerant digital signal processing arrays", CASSP'87, CH2396-0/87/0000-0483 pp. 483-486, 1987.
SUMMARY OF THE INVENTION
In its broadest form the invention comprises a systolic array floating point adder for accepting sequential pairs of real numbers Z1 and Z2 in floating point format and a mode control signal wherein said real numbers are represented as 2-tuples having the form {Ze, Zm}, Ze is a character sequence representing the exponent of the real number and Zm is a character sequence representing the mantissa of the real number, and the adder outputs a character sequence A which is the floating point representation {Af, Ae, Am} of the addition of said real numbers, wherein the adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the de-normalisation array and to output the floating point sum of sequential pairs of accepted real numbers.
In a further aspect of the invention the serial floating point adder has a mode character sequence entered in parallel with the 2-tuples to identify to the adder the fields Ze and Zm of the floating point representations, and the de-normalisation array further comprises at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.
In yet a further aspect the invention in its broadest form comprises a systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of the real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of the finite state machine and to output at least partially denormalized floating point numbers to the second input of the finite state machine and in the configuration to form a ring, wherein. during the first state the finite state machine is adapted to control the ring wherein the number Z in floating point format is input to the ring through the finite state machine first input and the accumulator output A is output from the ring from the finite state machine first output, and during the second state the finite state machine is adapted to transfer at least partially denormalised floating point numbers from its the second input to its the second output, to control the number of times the transfer occurs and to add aligned floating point numbers.
Yet in a further aspect of the invention a systolic ring serial floating point ac- cumulator has a finite state machine which further comprises an arithmetic logic unit ALU_1 having as first and second inputs the finite state machine first and second inputs and having as its first output the finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive the ALU_1 second output, a second arithmetic logic unit ALU_2 having as its output the finite state machine second output, the de-normalisation array further comprising at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order, the ring comprising a character sequence path formed from a serial configuration of, the ALU_1, the linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Ze, the ALU_2 and the de-normalisation array.
A further aspect of the invention provides a systolic ring serial floating point accumulator in which the real numbers in floating point format are represented as a triplet having the form {Zf, Ze, Zm} wherein Zf is a character sequence representing descriptors of the real number and an initialization flag character, Ze is a character sequence representing the exponent of the real number and Zm is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Zf, Ze and Zm , and the accumulator output is a character sequence A which is the floating point representation {Af, Ae, Am} of the accumulation of the real numbers, whereby the ring forms, an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, the first field De being representative of the difference between the accumulator exponent Ae and the input exponent Ze, and the second field Zm being representative of the Z mantissa value, and a mode register which contains the mode characters.
According to a further aspect of the invention the ring of the serial floating point accumulator further comprises, a connection means to connect the ALU_1 to the ALU_2, whereby, ALU_1 controls ALU_2 dependent on the sign of the value De. In a further aspect of this invention at least one delay cell is added into the ring to increase the number of data characters in the floating point representation without increasing the number of systolic cells and thereby achieve the processing of operands with either increased precision or dynamic range.
In an embodiment of the invention, a heterogeneous array structure created from a main logic or arithmetic block and input/output multiplexer, a k-stage delay block, a secondary logic or arithmetic block and a normalisation block comprising a systolic array constructed from cells which represent the functional equivalent of a set of recurrence relations. The output from the normalisation block is either fed back to the input of the first arithmetic block to form a systolic ring, or in a linear array is input to a further adder. In the case of a systolic ring accumulator it consists of a finite state machine and a systolic de-normalisation array.
Both structures implement unnormalised addition and can operate upon symmetric number representations for the mantissa such as one's complement or sign-magnitude. In the preferred embodiment of an accumulator sign-magnitude mantissae and two's complement exponent ordered number pairs are used. The only fixed aspects of the systolic ring are the arithmetic blocks. The length of the delay block is determined by the exponent length in the number representation. The number of systolic de-normalisation cells in the ring can range from a minimum of one to a maximum of m where m is the number of characters in the mantissa of the number representation. The number of recurrence cells determine the performance characteristics of the accumulator.
The invention provides a generic architectural basis for the use of a recurrence cell to create systolic arrays of cells which can implement a new serial pipelined floating point accumulator.
Further aspects of this invention include:
(i) the reduction of the complexity of the the problem of constructing floating point adders and accumulators by using: (a) replicated cell structures to implement recurrences which de-normalise mantissae;
(b) novel circuitry to implement in a serial pipelined fashion both the incrementing of an exponent difference and the conditional de-normalisation of an associated mantissa.
(ii) the depiction of the use of systolic de-normalisation cells interconnected with state memory stages in a linear array or systolic ring structure to construct either an adder capable of variable dynamic range or an accumulator capable of both variable precision and variable dynamic range.
(iii) the depiction of the construction of a systolic ring accumulator with a minimum gate complexity, consisting of an I/O multiplexer, two arithmetic logic units each containing a state machine, an array of delay stages and at least one computational cell representable by recurrences. The computational cell further comprising; the registers required to store the operands, a state storage register, one control storage register and an adder.
(iv) the depiction of the design a generic accumulator capable of providing a broad range of performance specifications by varying both the number of computational cells and or the number of delay cells in the systolic de-normalisation ring and the array of delay cells. Varying the number base of the characters in the floating point format also provides a further means for controlling the execution time of the accumulator.
To further describe the invention, preferred embodiments will now be given, however, it will be apparent that variations will be possible without departing from the inventive matter disclosed. This is especially so since such variations are within the ordinary skill of the practitioner of digital design techniques.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments are described hereunder in some detail with reference to and as illustrated in the accompanying drawings in which:
Figure 1 depicts a state diagram for the first logic element or datapath.
Figure 2 depicts a state diagram for the second logic element or datapath.
Figure 3 depicts a schematic representation of a heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent chara cters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
Figure 4 depicts a schematic of a systolic ring accumulator in which the elements are considered to be lumped. This clarifies the logical function of the array and highlights the distributed nature of the registers. Each register is associated with one of the re-circulating arrows. Naming conventions correspond to the simulation code of figure 12.
Figure 5 depicts a schematic of the systolic de-normalisation cell norm,-cell(). Variable names in brackets refer to the nomenclature of the 'C' simulation program given in a later figure.
Figure 6 depicts a schematic of an array of delay cells which form a component of the systolic ring.
Figure 7 depicts a schematic representation of the input /output multiplexer and (as implemented) a one-bit microcoded datapath. Although implemented as a one-bit per character device, the architecture can be constructed with multi-bit characters.
Figure 8 is a schematic diagram of the state generation and storage circuitry in the first logic cell Logic-l(). Figure 9 is a schematic diagram of the control signal generation for the first logic cell Logic_1(), with naming conventions as for figure 12.
Figure 10 is a schematic diagram of both the control signal generation and a block schematic diagram for the second logic cell Logic_2().
Figure 11 is a schematic diagram of two systolic rings which have coalesced to form a single, extended precision accumulator. To extend the dynamic range, additional delay cells must be placed before Logic_2(). The multiplexer for the second ring is controlled by the controller of the first ring, and the second occurrence of Logic_2() is not included in the ring.
Figure 12 is 'C' code which simulates a systolic ring accumulator.
DETAILED DESCRIPTION OF THE INVENTION
Based upon the following addition or accumulation technique, this patent describes a simpler implementation of floating point addition or accumulation than that detailed previously. Thus there is provided according to the invention both a linear systolic array serial floating point adder and a circular systolic array serial floating point accumulator. For simplicity, only the systolic ring accumulator is described as the linear adder is obvious from the description of the ring accumulator.
Figure imgf000011_0001
Equations (10) to (13) are significantly simpler than the conventional set given in equations (4) to (9). This simplicity is partly due to the lack of testing for overflow and underflow. Put simply the exponent register of the accumulator can be made sufficiently long to accomodate the accumulation of sequences of numbers, where the length of the sequences is less than or equal to some arbitrarily chosen maximum length, without reaching the overflow or underflow condition. It is a straightforward design exercise to provide guard digits in the exponent register to satisfy this requirement.
The second simplification is not obvious and is not part of floating point standards. It omits the post-normalisation of the sum. It is applicable to the floating point addition of two or more normalised numbers and allows post normalisation to be done only at the end of the completed summation, so effecting considerable savings in the case of long sequences.
Consider that error-free numbers are A and Z , and that their floating point representations A* and Z * introduce errors e.A and ez such that
Figure imgf000012_0001
Figure imgf000012_0002
The maximum relative error E when forming the sum of these numbers occurs when they have opposite sign. This worst-case relative error is approximated by
Figure imgf000012_0003
The significance of equation (16) is that the error is formed in equations (4) and (11). The post-normalisation process of equation (5) does not alter the error in the sum. and as a consequence the operation may be omitted without significantly altering the error behaviour of the accumulation process. A benefit of this approach for summation is that when the summation is complete the number of leading zeroes in the accumulator may give an estimate of the lower bound to the error in the result.
Expanding the equations ( 10) to (13) gives the following relations:
Figure imgf000013_0001
Figure imgf000013_0002
Figure imgf000013_0003
Figure imgf000013_0004
Figure imgf000013_0005
Figure imgf000013_0006
Figure imgf000013_0007
Figure imgf000013_0008
where m is the number of characters in the mantissa of the floating point representation.
In an embodiment of the invention which reflects the previous relations a systolic ring serial floating point accumulator 20 is shown in figures 3 and 4. Figure 3 depicts a schematic representation of a. heterogeneous systolic ring accumulator showing major structural elements and a distributed delay and systolic cell implementation, but excluding the data driven controllers. The data format is also shown for a particular case, consisting of 6 mantissa characters and 4 exponent characters. Seven circulations of the operands are required for this minimum configuration of one systolic cell. Three circulations would be required if an alternative accumulator were constructed from three systolic cells and six delay stages. The last circulation is to adjust the accumulator for the overflow condition.
Figure 4 depicts a schematic representation of a systolic de-normalisation array 21 which implements the Z mantissa de-normalisation of either De characters when De is less than the mantissa length m, or m characters when De is greater than or equal to m when the value of the Z mantissa becomes zero to effect an alignment of the Z mantissa to the accumulator mantissa in the floating point representation prior to their addition as defined by equation (21) and a finite state machine 22 which implements equations (17) to (24) with the exclusion of equation (21). The finite state machine 22 consists of a controller 23 and an arithmetic logic unit ( ALU_1) 24 which is described in figure 12 in the form of C simulation code as the function logic_1(), a linear array of delay cells 25 as described in figure 12 as shiftv() and a second arithmetic logic unit (ALU_2) 26 described in figure 12 as logic_2(). It should be noted that the nomenclature and connectivity used in figure 4 relate directly to the C simulation code of figure 12 and it is therefore apparent that the figure does not represent a minimum configuration of the invention.
A first input to the accumulator 20 is pesented sequentially with a series of floating point representations of real numbers Z consisting of triplets having the form {Zf, Ze, Zm} wherein Zf is a character sequence representing descriptors of the real number. An initialization flag character is also part of the descriptor. However, Zf may or may not be used in one or other of the embodiments described hereafter. Ze is a character sequence representing the exponent of the real number Z and Zm is a character sequence representing the mantissa of the real number Z. A mode signal entered in parallel with the triplet through a second input identifies which of the fields Zf, Ze and Zm are being input at any one time. In this implementation an additional character sequence C is also entered through a third input in parallel with the triplet as a constant to be used to increment the exponent difference De of equation (18). A further input shown in figure 4 is reset, which is used in the C simulation program to reset the simulated controller 23 and simulated ALU_2 26.
A first output from the accumulator consists of a status signal busy used to indicate when the accumulator may or may not accept inputs. An additional output provides a character sequence A which is the floating point representation {Af, Ae, Am} of the accumulation of the real numbers Z. A further output consists of a mode output signal which identifies the elements of the triplets. In this embodiment there is a final output Load which is derived from the initialization flag character present in the Zf field of the input triplets.
These inputs and outputs collectively form the first inputs and outputs of the finite state machine 22.
The second output of the finite state machine 22 connects to the input of the systolic de-normalisation array 21 whose output is connected to a second input of the finite state machine 22 to form a systolic ring of four registers; a Z register of at least two fields representative of the exponent difference De, equal to the difference between the accumulator exponent Ae and Ze , and the Z mantissa value
Zm, an A register of at least two fields representative of exponent and mantissa of the A operand, in which the accumulation result is stored, a mode register which contains said mode signal, and a C register which contains a constant value which is circulated around the ring.
An internal connection between ALU_1 24 and ALU_2 26 denoted as sig in figures 4 and 12 is used as a control signal path to imnplement the conditional assignments in ALU_2 of equations (19) and (20).
The following table details the data, structure for both the serial operands and the associated mode bit. The operands are entered into the accumulator least significant character or least significant bit (LSB) first. State machines decode the different fields within the finite state machine controller and ALU_2. Mantissa Exponent
OPERAND: Guard msb . . . lsb sign msb . . . lsb zero_flag load_bit MODE: 0 1 . . . 1 0 0 . . . 0 0 1
TABLE 1: Data format and associated MODE word.
A state diagram which describes the operation of the controller, multiplexer and ALU_1 in the finite state machine of figure 4 is given in figure 1. The controller is a state machine shown in figure 9 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 9. The functional behaviour of the state machine is described by the C simulation code function fsml() of figure 12.
In the following figures, all logical tests on variables are defined to be true if the value of the variable is non-zero, and false if the value of the variable is zero. The initial state State 0 as shown in figure 1 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The controller remains in the zero state until a non-zero mode bit is det ected after which it enters State 1. In State 1 the load bit associated with the flag characters Zf is sampled, logically OR-ed with the zero flag status register for the accumulator Azf and stored in the one-bit storage register Load.
At the next clock transition, which corresponds to a zero mode bit, the controller moves to State 2 in which the zero flag character of the flag characters Zf is stored in the internal storage register Zzf.
If the Load register contains a non-zero value, the controller enters State 3 at the next clock period and otherwise the controller enters State 4 which will be described subsequently. In State 3 the accumulator exponent field is incremented by the contents of the overflow register from the previous computation and is output as the exponent field of the accumulated result through the finite state machine first output, and also the value of the input operand exponent field ze is output to the ring accumulator exponent register Ae through the finite state machine second output, the exponent difference field De is set to zero and is entered into the ring Z register through the finite state machine second output, the sign register sig is set to zero and both the Z mantissa sign register Zs and the accumulator sign register As are set equal to the sign of the input operand mantissa zs.
When the value of the mode bit becomes non-zero, indicating the presence of mantissa characters, the controller enters either State 6 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 5 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.
In State 5 the sign-corrected mantissa value Am is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, before being output as the result mantissa through the finite state machine first output. The mantissa value input to the ring Z register is set to zero and the mantissa register Am is set to the input mantissa value zm. In State 6 the correctly represented mantissa value Am is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output as the result mantissa through the finite state machine first output. As in State 5 the mantissa value input to the Z register is set to zero and the mantissa register Am is set to the input mantissa value zm.
The controller enters State 9 when the mode bit becomes zero. It remains in this state until a non-zero signal cyn_1 is received from a counter depicted in figure 9, indicating that the mantissae AM and Zm are aligned or the mantissa Zm is zero. During this state, the modulo 2 sum of the signs of the Z and A mantissae is stored in the register neg.
At the next clock period the controller enters State 10.
When the mode bit becomes non-zero the controller enters: State 11 in which the accumulator value Am is computed by adding the contents of the Z mantissa Zm to the contents of the accumulator Am.
State 12 in which the accumulator value Am is computed by subtracting the contents of the accumulator Am from the contents of the Z mantissa Zm,
State 13 in which the accumulator value Am is computed by subtracting the contents of the Z mantissa Zm from the contents of the accumulator Am, or
When the mode bit becomes zero the controller returns to State 0.
If the Load register in State 2 contains zero, the controller enters State 4 in which the accumulator exponent value Ae is incremented by the value of the previously computed overflow Aovf and is output to the ring through the second output of the finite state machine. The exponent difference De is set equal to the difference of the value ze and the incremented accumulator value Ae + Aovf and is output to the Z register through the second output of the finite state machine. The sign register sig is set equal to the sign bit of De and the one bit Z mantissa sign register Zs is set equal to the sign bit of the input mantissa and the one bit accumulator sign register As is left unchanged.
When the value of the mode bit becomes non-zero, indicating the presence of mantissa characters, the controller enters either State 8 if the previously computed result was a correct sign-magnitude representation of the accumulated value, or State 7 if the previously computed result was not a correct sign-magnitude representation and required a sign reversal.
In State 7 the sign-corrected mantissa value Am is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output.
In State 8 the correctly represented mantissa value Am is right shifted by an amount equal to the contents of the overflow register, set by the previous computation, and is output to the ring A register through the finite state machine second output. In both State 7 and State 8 the Z register contents are passed unchanged from the finite state machine second input to the finite state machine second output.
A state diagram which describes the operation of the second arithmetic logic unit ALU_2 in the finite state machine of figure 4 is given in figure 2. The ALU_2 has a state machine shown in figure 10 whose states change synchronously with the clock and conditional upon a number of input signals as also disclosed in figure 10. The functional behaviour of the state machine is described by the C simulation code function fsm2() of figure 12.
The initial state State 0 as shown in figure 2 is first entered when the system is initialised by the control input Reset, and successively thereafter when each operand has been accumulated. The ALU_2 remains in the zero state until a non-zero mode bit is detected after which it enters State 1. At the occurrence of the next clock the ALU_2 state changes to State 2. The ALU_2 state changes to State 5 if the sign control line from ALU_ 1 is non-zero, and changes to State 3 otherwise.
In State 3 the exponent difference De is negated and the contents of the accumulator exponent field are replaced by the sum of Ae and the negated De, so restoring the former Ze value.
When the mode bit becomes non-zero, indicating the mantissa field the ALU_2 changes state to State 4. In State 4 the contents of the two mantissa registers Am and Zm are exchanged.
When the mode bit becomes zero, the ALU_2 enters State 5.
The ALU_2 remains in State 5 until a non-zero signal cynΛ is received from a counter depicted in figure 10. when it enters State 6 and re-enters State 0 when the signal cyn_ 1 becomes zero.
Equations (17) to (24) with the exclusion of equation (21) are implemented using the finite state machine 22. To implement the de-normalisation of equation (21), an array of at least one systolic cell is required in which the transfer of data between cells is described by the following recurrences
M0(p) = M2(p - 1) (25)
C0(p) = C2(p - 1) (26)
Z0{p) = Z2{p - 1) (27) A0(p) = A2(p - 1) (28) and the internal recurrences which are implemented in each cell are
M2(n)= M1(n - 1) (29)
M1(n)= M0(n - 1) (30)
C2(n) = C1(n - 1) (31)
C1(n) = C0(n - 1) (32)
Z1(n) = Z0(n - 1) (33)
Z4(n) =Z3 (n - 1) M1 (n - 1 ) = 0
= Z4(n - 1) M1 (n - 1) = 1 (34)
A2(n) =A1(n - 1) (35) A1 (n) = A0(n - 1) (36)
Z2(n) = C1(n - 1) + Z0(n - 1) + Cy(n - 1)
M0(n - 1)&M1 (n - l)&Z4(n - 1) = TRUE
= C1(n - 1) + Z1(n - 1) + Cy(n - 1)
= C1(n - 1) + Z0(m - 2) + Cy (n - 1)
M0(n - 1)&M1 (n - 1)&Z4(n - 1) = FALSE (37)
It is assumed that C contains the value 1 in the character position corresponding to the least significant exponent character, and is zero elsewhere. An examination of the recurrences (34) shows that the sign of the exponent is stored in Z4 for the duration of the mantissa. This value is used to control via recurrence (37) whether the mantissa output Z2 is delayed either one or two stages when the mode values M0 and M1 are high. This effects a one character de-normalisation of the Z mantissa field relative to the A mantissa when the exponent difference De is negative. The presence of a 1 in the C character sequence can be seen to increment the exponent difference according to the recurrence (37).
Each cell which implements these recurrences in a linear structure can implement a one-character denormalization and sign-extension required for floating-point addition using ones-complement or two's complement mantissae, and the de-normalisation without sign extension for sign-magnitude mantissae. Thus for an m-bit mantissa full de-normalisation requires the application of m recurrences. These recurrences may be applied either by connecting m-cells in a linear array, or by connecting at least one cell in a systolic ring structure with sufficient delay cells to contain the operand, and circulating the operands until m recurrences have been applied, or until the mantissae are aligned as indicated by a non-negative exponent difference.
Figure 5 represents a schematic diagram of one possible hardware implementation of a de-normalisation cell 27 implementing the above recurrence equations (29) to (37).
Figure 6 represents a schematic diagram of one possible hardware implementation of a linear array of delay stages and their interconnection denoted by the above recurrence equations (25) to (28).
Figures 7 and 8 together represent a schematic diagram of the arithmetic logic unit ALU_1 24 component of the finite state machine 22. The notation depicted in figures 7 and 8 follows that of figure 12.
Figure 9 represents a schematic diagram of the control element 23 of the finite state machine 22. The notation depicted in figure 9 follows that of figure 12.
Figure 10 represents a schematic diagram of the arithmetic logic unit ALU_2 26 component of the finite state machine 22. The notation depicted in figure 10 follows that of figure 12.
A further embodiment of the invention is provided in figure 11 which depicts a schematic diagram of the joining or coalescence of two adjacent systolic ring accumulators to form a single accumulator capable of accumulating operands of double length. In the two systolic rings which have coalesced, the multiplexer for the second ring is controlled by the controller of the first ring.
Figure 12 is a C code simulation of an embodiment of a sign-magnitude systolic ring accumulator.
Although not implemented, it must be noted that post-normalisation is possible with the architecture of the ring accumulator. Minor additional complexity would be incurred in the logic circuitry and state machine of Logic_1, and an additional recircu- lation would be required.
Systolic ring arithmetic units provide new possibilities for systolic array processors. Consider a simple linear array of two processors, designed to process single precision operands. If the two processors are implemented as systolic rings it is possible with appropriate multiplexer means to coalesce the two rings into a single, larger ring. This large ring can process double-length operands with the same number of circulations as the single ring, as the ratio of mantissa characters to systolic cells remains a constant. For larger order systolic arrays the ability for cells to coalesce makes possible the construction of variable dimension arrays which can be matched to both the problem size and the number representation.
The nature of the systolic architecture allows advantage to be taken of the statistical properties of numbers to minimise the number of systolic cells. Current studies suggest that the number of systolic cells may be minimised by matching the number of cells to the 95th percentile of the expected distribution of denormalisation shifts. In such a processor, the use of longer mantissa lengths for increased precision would not require increased numbers of systolic cells, but only an increase in the length of the registers. For such an implementation 95% of accumulations would occur in the designed number of circulations, and the remaining 5% would require additional circulations. In a processor which is asynchronous, this computation time uncertainty would not constitute a problem, and the saving of circuitry would be valuable. The only addition to the structure would be a test of completion of denormalisation. A successful test would cause the remaining circulations of the operands to be bypassed. The information required to reduce the number of circulations in this way is in the sign bit of the incremented exponent difference, and can be used as an input to an expanded state machine in the circuit Logic_1. When the sign bit is zero, the de-normalisation is complete, and the state machine can move to the next state.
Systolic ring and linear array floating point accumulators constructed according to the details described in this patent are of interest in large order systolic arrays and neural networks, and floating point arithmetic units implemented in Gallium Arsenide. This is due to the wide range of area/ time/precision/dynamic-range tradeoffs achievable with the ring architecture and its low transistor count. It is also possible to implement the architecture determined by this patent with simple optical processing techniques.
defs.h
#define base 2
#define states 4
#define statesl 5
#dexine reg.len 12
#define recirc 3
#define recirc.m 2
#define exp_len 10
#define mant_len 30
#define cells mant_len/2
enum clock {
ph1, ph2
};
typedef struct {
int p1, p2;
} reg;
typedef struct {
reg x1, x2, y1, y2, model, mode2, pp, cy;
} mult;
typedef struct {
reg x1, x2, y1, y2, model, mode2, pp1, pp2, sign,cy, bypass;
} norm;
sma5.c
#include <stdio.h>
#include <math.h>
#include "defs.h"
int
and(a, b)
int a, b;
return (a & b);
}
int
or(a, b)
int a, b;
return (a ⃒ b);
}
int
mux(sel, a, b)
int sel, a, b;
{
if (sel == 0)
return (a);
else
return (b);
}
int
mux4(sel, a, b, c, d)
int sel, a, b, c, d; switch (sel) {
case(0): return (a); break;
case(1): return (b); break;
case(2): return (c); break;
case(3): return (d); break;
}
}
void
add(a, b, c, sum, cy)
int a, b, c, *sum, *cy;
*sum = (a + b + c) % base; *cy = (a + b + c) / base;
>
int
inv_bit(x)
int x ;
int xbar;
xbar = ˉx & 1;
return (xbar);
}
int
nor(a, b)
int a, b; {
return (ˉ(a I b))&1;
}
int
xor(a, b)
int a, b;
{
return or(nor(inv_bit(a), b), nor(inv_bit(b),a)); }
void add_sub(a_s, a, b, c, sum, cy)
int a_s, a, b, c, *sum, *cy;
{
int t, ct;
ct = inv_bit(nor(or(
nor(inv_bit(a), inv_bit(c)),
nor(inv_bit(c), inv_bit(b))),
nor(inv_bit (b), inv_bit(a))));
t = xor(b,c);
*sum = xor(a,t);
*cy = xor(nor(a_s,inv_bit(t)),ct);
}
int
reg_cell (clock, a, b)
int clock, a;
reg *b;
{
if (clock == 0)
b->p1 = ˉa;
if (clock == 1)
b->p2 = ˉb->p1;
return (b->p2);
}
int
shiftv(cl, len, a, sreg)
int cl, len, a;
reg *sreg;
{
int i, op;
op = reg_cell(cl, a, &sreg[0]);
for (i = 0; i < len - 1; i++)
op = reg_cell(cl, sreg[i] .p2, &sreg[i + 1]); return (op);
}
int
fsm1(cl, reset, mode, load, neg, As, cyn_1, state)
int cl, reset, mode, load, neg, As, cyn_1, state; {
static int p1_reset, p1_mode, p1_state, t;
if (cl == 0) {
p1_reset = reset;
p1_mode = mode;
p1_state = state; }
if (cl == 1) {
if (p1_reset = = 1) {
state = 0;
} else {
switch (p1_state) {
case 0: state = mux(p1_mode,0,1);
break;
case 1:
switch (p1_mode) {
case 0:
state = 2;
break;
case 1:
printf ("Error in fsm1 s1: second bit of field one\n");
break;
}
break;
case 2: state = mux(load,4,3);
break;
case 3: if (!p1_mode) state = 3;
else state = mux(neg&&As,6,5);
break;
case 4: if (!p1_mode) state = 4;
else state = mux(neg&&As,8,7);
break;
case 5: state = mux(p1_mode,9, 5);
break;
case 6: state = mux (p1_mode, 9, 6);
break;
case 7: state = mux(p1_mode,9,7);
break;
case 8: state = mux(p1_mode,9,8);
break;
case 9: state = mux(cyn_1,9,10);
break;
case 10: t = (!neg&&p1_mode) +
2*(neg&&!As&&p1_mode) +
3*(neg&&_As&&p1_mode);
state = mux4(t,10,11,13,12);
break;
case 11: state = mux(p1_mode,0,11);
break;
case 12: state = mux(p1_mode,0, 12);
break;
case 13: state = mux(p1_mode, 0,13);
break;
}
}
}
return (state);
}
main(argc, argv) int argc;
char *argv[];
FILE *infp, *outfp;
char ch;
int index = 0, cl_gen, cl, ind;
int busy = 0, reset = 1;
int a[states], b[states];
int f_eof;
outfp = fopen("states", "w");
if (outfp == NULL)
fprintf(stderr,
"%s: cannot open file %s\n", argv[0], "states");
else {
a[0] = 0:
a[1] = 0:
a[2] = 0;
a[3] = 0;
for (ind = 0; ind < 4; ind++)
for (cl_gen = 1; cl_gen <= 3; cl_gen++) {
cl = (cl_gen & 2) » 1;
busy = fpad(outfp, cl_gen, cl, reset, a, b); }
/* printf ("Logic reset\n"); */
reset = 0;
init_instructions();
for (cl_gen = 1; cl_gen <= 3; cl_gen++) {
cl = (cl_gen & 2) » 1;
if ((busy == 0) && (cl_gen == 2)) {
f_eof = scanf("%1d % 1d %1d %1d\n", &a[0], &a[1], &a[2], &a[3]); busy = fpad(outfp, cl_gen, cl, reset, a, b); if ((busy == 0) & (cl_gen == 3)) {
if (b[0]) printf("%1d % 1d %d %d\n", b[0], b[1], b[2], b[3]);
}
}
} while (f_eof != EOF);
}
close (outfp); smad5.c
#include <stdio.h>
#include "defs.h"
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Microcode fields a
Lzs Lzf Llrde:
LAzs LAs func src1 src2 shft dst1 dst2 cry1 cry2
Field lengths are:
ILlLlLlLlLlfIsslsslslddlddlclcl
Define constants as:
func: (fadd,fsub)
src1: (s1_Z,s1_A,s1_z,s1_0)
src2: (s2_Z,s2_A,s2_z,s2_0)
shft: (shift0,shift1)
dst1: (d1_Z,d1_z,d1_f,d1_0)
dst2: (d2_Z,d2_z,d2_f,d2_0)
cry1: (noset1,set1)
cry2: (noset2,set2)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
#define noset2 0
#define set2 1
#define noset1 0
#define set1 2
#define d2_f 0xc
#define d2_z 4
#define d2_Z 8
#define d2_I 0
#define d1_Z 0
#define d1_z 0×10
#define d1_f 0×20
#define d1_0 0×30
#define shift0 0
#define shift1 0×40
#define s2_Z 0
#define s2_z 0×80
#define s2_A 0×100
#define s2_0 0×180
#define s1_Z 0
#define s1_z 0×200
#define s1_A 0×400
#define s1_0 0×600
#define fsub 0
#define fadd 0×800
#define Lld 0×1000
#define Lzf 0×2000
#define Lzs 0×4000
#define LAs 0×8000
#define LAzs 0x10000
extern int mux();
int instr[20] ;
void init_instructions ()
{
instr[0] = 0 ;
instr [1] = Lld;
instr[2] = Lzf+set1+set2;
instr[3] = Lzs+LAzs+d1_0+d2_z ;
instr[4] = Lzs+fsub+s1_z+s2_A+d1_f+d2_I ;
instr [5] = shift1+fsub+s1_0+s2_A+d1_0+d2_z ; instr[6] = shift1+fadd+s1_0+s2_A+d1_0+d2_z;
instr[7] = shift1+fsub+s1_0+s2_A+d1_z+d2_f;
instr[8] = shift1+fadd+s1_0+s2_A+d1_z+d2_f;
instr[9] = d1_Z+d2_I;
instr[10] = d1_Z+d2_I;
instr[11] = LAs+f add+s1_Z+s2_A+d2_f;
instr[12] = LAs+f sub+s1_Z+s2_A+d2_f;
instr[13] = LAs+f sub+s1_A+s2_Z+d2_f ;
}
void
norm_cell(cell, cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)
norm *cell;
int cl, x, y, pp, mode, *x_out , *y_out, *pp_out, *mode_out; {
int m1, x1, y1, pp1, sum, c_out, bypass, sign;
m1 = reg_cell(cl, mode, &cell->mode1);
*mode_out = reg_cell(cl, m1, &cell->mode2);
pp1 = reg_cell(cl, pp, &cell->pp1);
*pp_out = reg_cell(cl, pp1, &cell->pp2);
y1 = reg_cell(cl, y, &cell->y1);
sign = reg_cell(cl, y1, &cell-> sign);
bypass=reg_cell(cl,mux(m1,sign,cell->bypass.p2),&cell->bypass); x1 = reg.cell(cl, x, &cell->x1);
*x_out = reg_cell(cl, x1, &cell-> x2);
add(cell->pp1_p2, mux (and (and (m1, mode), bypass), y1, y),
cell->cy.p2, &sum, & c_out );
c_out = reg_cell(cl, c_out, &cell->cy);
*y_out = reg_cell(cl, sum, &cell->y2);
}
void
normalise(cl, x, y, pp, mode, x_out, y_out, pp_out, mode_out)
int cl ,x,y,pp,mode, *x_out, *y_out,*pp_out, *mode_out; {
static norm mx [cells];
int cell_index, j, a[states], b[states]; a[0] = x;
a[1] = y;
a[2] = pp;
a[3] = mode;
for (cell_index = 0; cell_index < cells; cell_index++) {
norm_cell(&mx[cell_index], cl, a[0], a[1], a[2], a[3],
&b[0], &b[1], &b[2], &b[3]);
for (j = 0; j < states; j++) a[j] = b[j];
}
*x_out = b[0];
*y_out = b[1];
*pp_out = b[2];
*mode_out = b[3];
}
void delay(cl, a, b, del)
int cl, *a, *b;
reg *del;
{
int i;
for (i = 0; i < states; i++) b[i] = reg.cell(cl, a[i], 4del[i]); }
void
delay1(cl, a, b)
int cl, *a, *b;
{
static reg del[states];
int i;
for (i = 0; i < states; i++)
b[i] = reg_cell(cl, a[i], &del[i]);
}
void
logic_1(outfp, cl_gen, cl, del, reset, a, g, e, lr, r, sig, con)
FILE *outfp ;
int cl_gen, cl, reset, *a, *g, *e, *lr,*r,*sig,*con; reg *del;
{ /* logic_1 */
static int lAzs, lAs, lzs, lzf, lid, func, src1, src2;
static int shft, dst1, dst2, cry1, cry2;
static int lAc_sign,A,__A,A_,Z,z,Aovf,Ao,Zo,gshft; static int last_mode,count,cyn_1,load,Ld,neg,As;
static int ed[states],edz[states];
static int fdsum, fsum, fcy, isum, icy, zzf;
static int Azf,ss,bb,cc,Zs,cycle,cy1,state,p1,p2,Asd;
static reg fcy_reg, icy_reg, delz[states];
static reg si_reg, Load_reg, zzf_reg;
static reg Aovf_reg,Azf_reg,As_reg,Zsi_reg,fd_reg,neg_reg; static reg edd_reg, r_reg, Ld_reg,Asd_reg;
if (reset == 1) {
cycle = (recirc - 1);
count = -1;
last_mode = 0;
}
if (cl_gen == 2) {
if (e[3]&&!last_mode) {
count = (count + 1)%2;
if (!(count)){
cycle = (cycle + 1) % recirc;
}
}
cyl = (cycle == 0)⃒⃒ reset;
cyn_1 =(cycle == recirc - 1);
if (count>-1) *con = inv_bit(cy1);
last_mode = e[3];
}
e[0] = g[0] ;
e[1] = mux(*con, a[1], g[1]) ; e[2] = mux(*con, a[2], g[2]);
e[3] = mux(*con, a[3], g[3]);
delay(cl, e, ed, del);
delay(cl, a, edz, delz);
state = fsm1(cl, reset, e[3] , load, neg, Asd_reg.p2,cyn_1, state);
p1 = cl == 0;
p2 = cl == 1;
Z = ed[1];
z = edz[1];
A = ed[0];
__A = e[0];
truction decode */
cry2 = instr [state]&1;
cry1 = (instr[state]»1)&1;
dst2 = (instr[state]»2)&3;
dst1 = (instr[state]»4)&3;
shft = (instr[state]»6)&1;
src2 = (instr[state]»7)&3;
srcl = (instr[state]»9)&3;
func = (instr[state]»11)&1;
lid = (instr[state]»12)&1;
lzf = (instr[state]»13)&1;
lzs = (instr[state]»14)&1;
lAs = (instr[state]»15)&1;
lAzs = (instr[state]»16)&1;
lzs = lzs&&e[3]&&!ed[3];
lAzs = lAzs&_e[3]&&!ed[3];
lAc_sign = lAs&&!e[3];
gshft = shft&&!e[3];
fcy = reg_cell(cl,mux(cry1,and(fcy,inv_bit(and(e[3],
inv_bit(ed[3])))),Aovf), &fcy_reg);
icy = reg_cell(cl, mux (cry 2, icy, Aovf ), &icy_reg);
load = reg_cell(cl, mux (lld,Load_reg.p2,ed[1]⃒⃒!Azf_reg.p2),
&Load_reg);
Ld = reg cell(cl, mux (lld, Ld_reg.p2,ed[1]),&Ld_reg);
zzf = reg_cell(cl, mux(lzf, zzf_reg.p2,ed[1]), &zzf_reg);
Zs = reg_cell(cl, mux4(lzs+2*(inv_bit(*sig)&&shft&&!e[3]),
Zsi_reg.p2,ed[1] &&!load,As_reg_p2).&Zsi_reg);
Azf = reg cell(cl, mux(or(lzf,lAs),Azf_reg.p2,
and(lAs,or(fsum,Azf_reg.p2))) , Azf_reg);
neg = reg cell(cl, bb = mux(shft,neg_reg.p2,AsˉZs),&neg_reg); Aovf = reg_cell(cl, mux(lAs, Aovf_reg. p2,fsumˉneg),&Aovf_reg);
A_ = mux(and(shft,Aovf),A,__A);
add_sub(func,mux4(src1,Z,z,A_,0),mux4(src2,Z,z,A.,0),
fcy_reg.p2,&fsum.&fcy);
add_sub(1,0,A_,icy_reg.p2,&isum,&icy);
Zo = mux4(dst1,Z,z,fsum,0);
Ao = mux4(dst2,isum,z,Z,fsum);
* sig = reg.cell(cl,mux(lzs,si_reg.p2,fd_reg.p2),&si_reg); fdsum = reg.cell(cl, fsum, &f d_reg);
As=reg_cell (cl ,mux4((
lAc_sign+2+(lAzs)+3*(inv_bit(*sig)&&shft&&!e[3])), As_reg.p2, ((!neg)&&As&&Zs) I I (neg&&fsum), ed[1] ,Zsi_reg.p2), &As_reg);
Asd = reg.cell(cl, As, &Asd_reg);
lr[0] = Ao&&!lzs;
lr[1] = Zof&&!lzs;
lr[2] = ed[2];
lr[3] = ed[3];
r[0] = Ld;
r[1] = reg_cell(cl, mux4(lzs+2*ed[3],isum,As_reg.p2,fsum,0),
&r_reg);
r[2] = reg_cell(cl, ed[3],&edd_reg);
r[3] = r[2];
/* End of datapath */
}
int
fsm2(cl, reset, mode, sign, cyn_1, state)
int cl, reset, mode, sign, cyn_1, state;
static int p1_reset, p1_mode, p1_state, t;
if (cl == 0) {
p1.reset = reset;
p1.mode = mode;
p1.state = state;
}
if (cl == 1) {
if (p1_reset == 1) {
state = 0;
} else {
switch (p1_state) {
case 0: state = mux (p1_mode, 0,1);
break;
case 1:
switch (p1_mode) {
case 0:
state = 2;
break;
case 1 :
printf("Error in fsm2 s1: second bit of field one\n");
break;
}
break;
case 2: state = mux(sign,3,5);
break;
case 3: state = mux (p1_mode, 3, 4);
break;
case 4: state = mux(p1_mode, 5,4);
break;
case 5: state = mux(cyn_1,5,6);
break; case 6: state = mux(cyn_1,0,6);
break;
}
}
}
return (state);
}
void
logic_2(cl_gen, cl, reset, sign, e, lr)
int cl_gen, cl, reset, *sign, *e, *lr;
{ static int ed[states], state;
static int dst1, dst2, cry;
static int sub_s, sub_c, sub_cd, ad_s, ad_c, ad_cd; static int cycle, cy1;
static int last_mode, count, cyn_1;
static reg sub_cy, ad_cy;
if (reset == 1) {
cycle = (recirc - 1);
count = -1;
last_mode = 0;
}
if (cl_gen == 2) {
if (e[3]&&!last_mode){
count = (count + 1)%2;
if (!(count)) {
cycle = (cycle + 1) % recirc;
}
cy1 = (cycle == 0)⃒⃒ reset;
cyn_1 =(cycle == recirc - 1);
last_mode = e[3];
}
/* Register operations */
delayl(cl, e, ed);
state = fsm2(cl, reset, e[3], *sign, cyn_1, state); ad cd = reg_cell(cl, and(ad_c,cry), &ad_cy); sub_cd = reg_cell(cl, and (sub_c, cry), &sub_cy) ;
/* Control signal generation */
dst1 = 0;
dst2 = 0;
cry = 0;
switch (state) {
case 3:
dstl = 2;
dst2 = 3;
cry = 1;
break;
case 4:
dst1 = 1;
dst2 = 1; break;
}
/* Arithmetic operations */
add_sub(0,0,ed[1], and(cry,sub_cy.p2),&sub_s,&sub_c);
add_sub(1,ed[1],ed[0],and(cry,ad_cy.p2), &cad_s, &cad_c);
/* set outputs from logic cell */
lr[0] = mux4(dst2, ed[0], ed[1], sub_s, ad_s);
lr[1] = mux4(dst1, ed[1], ed[0], sub_s, ad_s);
lr[2] = ed[2];
lr[3] = ed[3];
/* End of datapath */
}
int
fpad(outfp, cl_gen, cl, reset, a, r)
FILE *outfp;
int cl_gen, cl, reset, *a, *r;
static int b[states], c[states], d[states], g[states]; static int e[states], f[states], h[states], lr[states]; static reg xr[reg_len], yr[reg_len], ppr[reg_len]; static reg acxr[reg_len], acyr[reg_len], acppr[reg_len]; static reg acmoder [reg_len], moder[reg_len];
static int con = 0, sig = 0;
static reg del[states];
if (reset == 1)
con = 0;
logic_1 (outfp, cl_gen, cl, del, reset, a, g, e, lr, r, &sig, &con); f[0] = shiftv(cl, reg_len, lr[0], acxr);
f[1] = shiftv(cl, reg_len, lr[1], acyr);
f[2] = shiftv(cl, reg_len, lr[2], acppr);
f[3] = shiftv(cl, reg_len, lr[3], acmoder);
logic_2 (cl_gen, cl, reset, &sig, f, h);
normalise (cl, h[0], h[1], h[2], h[3], &g[0], &g[1], &g[2], &g[3]); return (con);
SUBSTITUTE SHEET

Claims

CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS: 1. A systolic ring serial floating point accumulator for accepting sequentially as input at least two real numbers Z in floating point format and outputting the floating point representation A of the accumulation of said real numbers, comprising, a finite state machine having at least first and second inputs, at least first and second states and at least first and second outputs, a denormalization array adapted to receive the second output of said finite state machine and to output at least partially denormalized floating point numbers to the second input of said finite state machine and in said configuration to form a ring, wherein, during said first state said finite state machine is adapted to control said ring wherein said number Z in floating point format is input to said ring through said finite state machine first input and said accumulator output A is output from said ring from said finite state machine first output, and during said second state said finite state machine is adapted to transfer at least partially denormalised floating point numbers from its said second input to its said second output, to control the number of times said transfer occurs and to add aligned floating point numbers.
2. A systolic ring serial floating point accumulator according to claim 1 wherein said finite state machine further comprises an arithmetic logic unit ALU_1 having as first and second inputs said finite state machine first and second inputs and having as its first output said finite state machine first output and a second output, a linear array of zero or more delay cells adapted to receive said ALU _1 second output, a second arithmetic logic unit ALU_2 having as its output said finite state machine second output, said de-normalisation array further comprising at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order, said ring comprising a character sequence path formed from a serial configuration of, said ALU_1, said linear array of delay cells arranged to have a delay equal to at least the number of characters which represent the exponent Ze, said ALUL_2 and said de-normalisation array.
3. A systolic ring serial floating point accumulator according to claim 2 wherein said real numbers in floating point format are represented as a triplet having the form {Zf, Ze, Zm} wherein Zf is a character sequence representing descriptors of the real number and an initialization flag character, Ze is a character sequence representing the exponent of the real number and Zm is a character sequence representing the mantissa of the real number, mode is a character sequence entered in parallel with the triplet to identify to the accumulator the fields Zf, Zt and Zm, and said accumulator outputs a character sequence .4 which is the floating point representation {Af,, Ae, Am} of the accumulation of said real numbers. whereby said ring forms. an A register of at least two fields representative of exponent and mantissa of the A operand, a Z register of at least a first and second field, said first field De being representative of the difference between the accumulator exponent Ae and the input exponent Ze, and said second field Zm being representative of the Z mantissa value, and a mode register which contains said mode characters.
4. A systolic ring serial floating point accumulator according to claim 3 wherein, said ring further comprises. a connection means to connect said ALU_1 to said ALU-2, whereby, ALU.1 controls ALU_2 dependent on the sign of the value D e.
5. A systolic ring serial floating point accumulator according to claim 4 wherein, ALU_1 further comprises, means for differencing in said finite state machine first state the exponent values Ae and Ze to provide the value De which is output to the Z register of said ring and to output to the A register of said ring the value Ae, and also to output via said connection means the sign of the value De to ALU_2 and, said ALU_2 further comprises, test and control means to accept and test the sign of the value De received via said connection means, whereby if the sign of the value De is positive, a sign reversal means of said ALU-2 reverses the sign of De, an addition means of said ALU_2 adds the exponent field contents of the A register to the exponent field contents of the Z register to regenerate Ze, and outputs Ze to the A register and De to the Z register, Am to the Z register and Zm to the .4 register, otherwise to control said ALU_2 to pass unchanged the contents of the A and Z registers.
6. A systolic ring floating point accumulator according to claim 3 wherein each of said de-normalisation cells increments the contents of the exponent field, and effects a conditional de-normalisation of the contents of the Zm field of the Z register if the contents of the unincremented exponent field of the Z register is negative, resulting in a one character de- normalisation of Zm per systolic cell per circulation until the A and Z register mantissa field contents are aligned, or until the Z register mantissa field contents have been zeroed.
7. A systolic ring floating point accumulator according to claim 3 wherein each of said de-normalisation cells has a first and second mode of operation controlled by the current value of the mode characters in said mode register, wherein, in said de-normalisation cell first state which corresponds to the exponent fields of the Z and A registers in said cell, said de-normalization cell further comprises de-normalization addition means to increment the difference value De, de-normalization output means to output the incremented difference value, de-normalization storage means to store the sign of the difference value De, multiplexer control means, and in said de-normalisation cell second state which corresponds to the mantissa fields of the Z and A registers in said cell, said de-normalization cell further comprises a multiplexer means to bypass one delay cell in the Z register and so to effect a one-character de-normalisation of the Zm operand if the sign of the difference value De stored in said de-normalisation cell storage means is negative.
8. A systolic ring floating point accumulator according to claim 1 wherein said real numbers in floating point format are represented as a triplet having the form {Zf, Ze, Zm} wherein Zf is a character sequence representing descriptors of the real number and an initialization flag character, whereby, when said initialization flag character is present said accumulator floating point number representation {Af, Ae, .Am } is output and zeroed prior to the input of the next sequentially input real number Z to the accumulator.
9. A systolic ring serial floating point accumulator according to claim 1 wherein the accumulator is implemented in Gallium Arsenide.
10 A svstolic ring serial floating point accumulator according to claim 1 wherein varying the number base of said data characters varies the execution time of said accu¬mulator.
11 . svstolic ring serial floating point accumulator according to claim 2 wherein varying the number of said delay cells varies the precision and dynamic range of said accumulator.
12. A systolic ring serial floatmg point accumulator according to claim 2 wherein varying the number of said systolic cells varies the precision and dynamic range of said accumulator.
13. A systolic ring serial floatmg point accumulator according to claim 2 wherein varying the number of said systolic cells and varying the number of delay cells varis the precision and dynamic range of said accumulator.
14. A systolic ring serial floatmg point accumulator according to claim 3 which is extensible having at least one additional multiplexer means located within said ring, each of said additional multiplexer means having connections to an adjacent ring which in a first mode both isolates each ring from the other, accepts input for and directs output from the adjacent ring and also completes the first ring, and in a second mode directs data sequences from the first ring to the input of the adjacent ring, bypasses the arithmetic logic unit ALU_2 in the controlled ring, and directs the output from the adjacent ring back into the first ring.
15. A method of accumulating successive data character sequences representing mantissas and exponents of Z operands in floating point format to output one identically formatted result data character sequence .4 which is the floating point accumulation of the Z operands, wherein at least one systolic cell is arranged in a ring configuration.
16. A method of adding successive pairs of data character sequences representing antissas and exponents of Z1 and Z2 floating point operands in floating point format to output one identically formatted result data character sequence A which is the floating point sum of the input operands, wherein said data character sequences are operated upon by at least one systolic cell.
17. A serial floating point adder for accepting sequential pairs of real numbers Z1 and Z2 in floating point format and a mode control signal wherein said real numbers are represented as 2-tuples having the form { Ze , Zm}, Ze is a character sequence rep¬ resenting the exponent of the real number and Zm is a character sequence representing the mantissa of the real number, and said adder outputs a character sequence A which is the floating point representation {Af, Ae, Am} of the addition of said real numbers, wherein said adder comprises, a finite state machine adapted to receive said real numbers and having an output, a denormalization array adapted to receive the output of said finite state machine and to output a denormalized floating point number, a second finite state machine adapted to receive the output from the de-normalisation array and to output the floating point sum of sequential pairs of accepted real numbers.
18. A serial floating point adder according to claim 17 wherein a mode character sequence is entered in parallel with the 2- tuples to identify to the adder the fields Ze and Zm of said floating point representations, and said de-normalisation array further comprises at least one systolic de-normalisation cell and zero or more delay cells where cells of each type may be arranged in any order and the length of the total delay is at least the length of the exponent in the real number representation.
19. A serial floating point adder according to claim 18 wherein there are m de-normalisation cells where m is the number of characters in the mantissa of the real number representations, and wherein said second finite state machine further comprises, an adder means at the output of said de-normalisation array to add the sum of the character sequence representing the mantissae of the pair of accepted real numbers Z1 and Z2.
20. A systolic ring serial floating point adder according to claim 17 wherein the adder is implemented in Gallium Arsenide.
21. A systolic ring serial floating point adder according to claim 17 wherein varying the number base of said data characters varies the execution time of said adder.
22. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said delay cells varies the precision and dynamic range of said adder.
23. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said systolic cells varies the precision and dynamic range of said adder.
24. A systolic ring serial floating point adder according to claim 18 wherein varying the number of said systolic cells and varying the number of delay cells varies the precision and dynamic range of said adder.
PCT/AU1991/000284 1990-06-29 1991-07-01 A generalised systolic array serial floating point adder and accumulator WO1992000560A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AUPK092090 1990-06-29
AUPK0920 1990-06-29

Publications (1)

Publication Number Publication Date
WO1992000560A1 true WO1992000560A1 (en) 1992-01-09

Family

ID=3774792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU1991/000284 WO1992000560A1 (en) 1990-06-29 1991-07-01 A generalised systolic array serial floating point adder and accumulator

Country Status (1)

Country Link
WO (1) WO1992000560A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5334651A (en) * 1992-03-25 1994-08-02 Hoechst Aktiengesellschaft Water-thinnable two-component coating preparation, a process for its preparation, and its use
US5354807A (en) * 1992-01-24 1994-10-11 H. B. Fuller Licensing & Financing, Inc. Anionic water dispersed polyurethane polymer for improved coatings and adhesives
US7681344B2 (en) 2005-07-29 2010-03-23 Cart-Tv, Llc Shopping cart device
US7895777B2 (en) 2005-07-29 2011-03-01 Cart-Tv, Llc Shopping cart device
US8336774B2 (en) 2011-04-04 2012-12-25 Shopper's Club, Llc Shopping apparatus and methods
US9053510B2 (en) 2011-04-04 2015-06-09 David L. McEwan Shopping apparatus and methods

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0079471A1 (en) * 1981-11-05 1983-05-25 Ulrich Dr. Kulisch Arrangement and method for forming scalar products and sums of floating point numbers with maximum precision
US4405992A (en) * 1981-04-23 1983-09-20 Data General Corporation Arithmetic unit for use in data processing systems
EP0239737A2 (en) * 1986-02-24 1987-10-07 International Business Machines Corporation Systolic super summation device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4405992A (en) * 1981-04-23 1983-09-20 Data General Corporation Arithmetic unit for use in data processing systems
EP0079471A1 (en) * 1981-11-05 1983-05-25 Ulrich Dr. Kulisch Arrangement and method for forming scalar products and sums of floating point numbers with maximum precision
EP0239737A2 (en) * 1986-02-24 1987-10-07 International Business Machines Corporation Systolic super summation device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5354807A (en) * 1992-01-24 1994-10-11 H. B. Fuller Licensing & Financing, Inc. Anionic water dispersed polyurethane polymer for improved coatings and adhesives
US5334651A (en) * 1992-03-25 1994-08-02 Hoechst Aktiengesellschaft Water-thinnable two-component coating preparation, a process for its preparation, and its use
US7681344B2 (en) 2005-07-29 2010-03-23 Cart-Tv, Llc Shopping cart device
US7895777B2 (en) 2005-07-29 2011-03-01 Cart-Tv, Llc Shopping cart device
US8336774B2 (en) 2011-04-04 2012-12-25 Shopper's Club, Llc Shopping apparatus and methods
US8727214B2 (en) 2011-04-04 2014-05-20 Shopper's Club, Llc Shopping apparatus and methods
US9053510B2 (en) 2011-04-04 2015-06-09 David L. McEwan Shopping apparatus and methods

Similar Documents

Publication Publication Date Title
US5764555A (en) Method and system of rounding for division or square root: eliminating remainder calculation
US5513132A (en) Zero latency overhead self-timed iterative logic structure and method
US4736335A (en) Multiplier-accumulator circuit using latched sums and carries
US7080111B2 (en) Floating point multiply accumulator
US6584482B1 (en) Multiplier array processing system with enhanced utilization at lower precision
US4489393A (en) Monolithic discrete-time digital convolution circuit
US5493520A (en) Two state leading zero/one anticipator (LZA)
US5016210A (en) Binary division of signed operands
US4320464A (en) Binary divider with carry-save adders
US20020194239A1 (en) Floating point overflow and sign detection
Ienne et al. Bit-serial multipliers and squarers
US6988119B2 (en) Fast single precision floating point accumulator using base 32 system
CN108897523B (en) Divider and operation method thereof and electronic equipment
US7373369B2 (en) Advanced execution of extended floating-point add operations in a narrow dataflow
George et al. Hardware design procedure: principles and practices
US5164914A (en) Fast overflow and underflow limiting circuit for signed adder
Zhou A new bit-serial systolic multiplier over GF (2/sup m/)
WO1992000560A1 (en) A generalised systolic array serial floating point adder and accumulator
US5841683A (en) Least significant bit and guard bit extractor
US5113363A (en) Method and apparatus for computing arithmetic expressions using on-line operands and bit-serial processing
CN114201140B (en) Exponential function processing unit, method and neural network chip
EP0539010A2 (en) Method and device for generating sum information/rounding control signal
US4752904A (en) Efficient structure for computing mixed-radix projections from residue number systems
CN116127255A (en) Convolution operation circuit and related circuit or device with same
Kornerup Correcting the normalization shift of redundant binary representations

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AU BB BG BR CA FI HU JP KP KR LK MC MG MN MW NO PL RO SD SU US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BF BJ CF CG CH CI CM DE DK ES FR GA GB GN GR IT LU ML MR NL SE SN TD TG

NENP Non-entry into the national phase

Ref country code: CA