GB2287331A

GB2287331A - Electronic multiplying and adding apparatus.

Info

Publication number: GB2287331A
Application number: GB9403955A
Authority: GB
Inventors: David James Seal
Original assignee: Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 1994-03-02
Filing date: 1994-03-02
Publication date: 1995-09-13
Anticipated expiration: 2014-03-02
Also published as: GB9727525D0; GB9403955D0; JPH07271556A; JP3516503B2; GB2287331B; GB2317978A; GB2317978B; US5528529A

Abstract

A multiply-accumulate circuit in which upon each multiply iteration some of the accumulate value bits are incorporated into the result with lower order subsequently non-changing bits being latched. In this way, a wide accumulate value can be dealt with without incurring a correspondingly wide data path through the multiply accumulate circuit. Initialisation of the multiply accumulate circuit with one of the carry value and save value as the first partial summand and the other as at least part of the accumulate value is performed to reduce the total number of iterative cycles required to produce the result. <IMAGE>

Description

2287331 1 ELECTRONIC MULTIPLYING AND ADDING APPARATUS AND METHOD

This invention relates to the field of electronic multiplying and adding within data processing systems.

It is known to provide integrated circuit central processing units with dedicated hardware for performing certain arithmetic operations. Such dedicated hardware is designed to provide higher speed evaluation of certain arithmetic operations than would be available if those arithmetic operations were performed under software control by the general purpose central processing unit core logic.

It is known to provide dedicated hardware units that multiply a M-bit number by another M-bit number to produce a 2M-bit result. In the case of a 32-bit based central processing unit, two 32-bit multiplicands produce a 64-bit result. A more refined arithmetic operation one might wish to perform is multiplying two numbers together and then adding a further number.

In the case of a 32-bit machine, multiplying two 32-bit numbers together and then adding a further 32-bit number can be relatively straightforwardly achieved. However, the result of the multiplication is a 64-bit number and it is desirable that it should be possible to add a 64-bit number to the 64-bit result of the multiplication operation. A problem that arises with this is that a 32-bit machine will have a 32-bit data path structure and expanding this to a 64-bit width to cope with the 64-bit addition will introduce undesirable complication and will require a disadvantageously large amount of circuit area.

The present invention is concerned with providing for the multiplication of a M-bit multiplicand with an N-bit multiplier and adding an (M+N)-bit accumulate value without incurring a disadvantageous increase in data path width.

Viewed from one aspect this invention provides an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said apparatus comprising:

means for generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; means for carry-save adding at least one of the partial summands 2 to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result having active carry b4±S that are still changing and active save bits that are still changing, a most significant bit of said active save bits having a bit position Z and a least significant bit of said active save bits having a bit position Y; and means for adding X bits of said (M+N)-bit accumulate value to said input carry-save partial result on a carry-save cycle at bit positions above bit position Z, wherein Z and Y both increase by X between carry-save cycles.

This apparatus allows the adding of the (M+N)-bit accumulate value to proceed at the same time as the multiplication of the M-bit multiplicand by the N-bit multiplier. In this way the provision of a full (M+N)-bit data path can be avoided. Active carry bits and active save bits are still changing in the sense that new values will be calculated for them rather than allowing them to proceed to final addition unchanged.

In preferred embodiments of the invention in which X = 2, said means for adding comprises:

means for extending said output carry-save partial result on a carry-save cycle with extension save bits at bit positions Z+1 and Z+2 and an extension carry bit at bit position Z+2, said extension save bits being given by the sum of: the Z.4. I th bit of said (M+N)-bit accumulate value; the complement of the Z'h bit of said active carry bits; and the complement of the Z+1th bit of said partial summand, and said extension carry bit being given by the complement of the Z+2th bit of said (M+N)-bit accumulate value.

Thus, a Z+lt" bit and a Z+2 th bit of the (M+N)-bit accumulate value are incorporated into the calculation on each iteration of the carrysave and so the need for a final (M+N) -bit wide data path to perform an addition after the multiplication operation is avoided. The apparatus of the invention has a bulge in its data path width to accommodate the extra processing to handle the (M+N)-bit accumulate value, but this bulge is relatively small compared to an increase to the full (M+N)-bit width.

It will be appreciated that the means for generating a sequence 1 3 of partial summands could take many forms depending upon the degree of sophistication desired. However, the number of partial summands generated is reduced when said means for generating comprises a Booth encoder, said sequence of partial summands being a sequence of Booth summands.

Whilst the invention may be applied to any value of M and N, it is most usual to want that M should equal N to provide a general purpose arithmetic processing apparatus.

In the above context, the avoidance of excessively wide data paths becomes particularly advantageous in circumstances when M = N 32.

In the case where M = N, the data path can be constrained to have an advantageous width of only M+3 even though an 2M-bit accumulate value is involved.

The flexibility of the system is improved in embodiments in which the Z+ lth bit of the partial summand is either a zero or a sign extension depending upon signed selecting input, said M-bit multiplicand and said Nbit multiplier being treated as unsigned or signed numbers respectively in dependence upon said sign selecting input.

Whilst the invention may be usefully applied in various application specific situations, the invention is particularly suited for use within an integrated circuit central processing unit having a hardware multiplier.

Viewed from another aspect, this invention provides a method of multiplying and adding within an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said method comprising the steps of:

generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; carry-save adding respective ones of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result having active carry bits that are still changing and active save bits that are still changing, a most significant bit of said active save bits having a bit position Z and a least significant bit of said active save bits having a bit 4 position Y; and adding X bits of said (M+N)-bit accumulate value to said input carry-save partial- result on a carry-save cycle at bit positions above bit position Z, wherein Z and Y both increase by X between carry-save cycles.

It is desirable that an electronic multiplying and adding apparatus should operate at a high speed. Increases in the speed of operation will reduce processing bottlenecks and improve the overall performance of the system of which the multiplying and adding apparatus forms part.

Viewed from a further aspect this invention provides an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (N+M)-bit accumulate value. said apparatus comprising:

means for generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; means for carry-save adding at least one of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result and said output carry-save partial result each comprising a carry value and a save value; and means for initialising said carry value and said save value to a respective one of:

at least part of said (N+M)-bit accumulate value; and a first partial summand.

One of the stages of operation of an electronic multiplying and adding apparatus is the initialisation of that apparatus prior to its main calculating cycle. The invention recognises and exploits that by initialising the carry value and save value appropriately, less calculation needs to be performed during the main calculation operation and so an increase in performance can be achieved. More particularly, by dealing with the first partial summand during initialisation, the number of multiplier iterations may be reduced by one. At first sight, such a reduction by one iteration out of a total number of iterations that may be typically between ten and twenty does not seem too significant. However, if one considers a multiplication operation that requires seventeen iterations, then a reduction of one iteration to sixteen iterations enables the process to be split evenly between four machine cycles with four iterations per machine cycle. The alternatives would be to perform four iterations per machine cycle then an additional iteration in one cycle increasing the total number of cycles required to five, or attempt to increase the number of possible iterations per cycle to five, which would make timing constraints tighter, possibly decrease the maximum clock frequency and make the integrated circuit larger.

In preferred embodiments of the invention said means for generating comprises a modified Booth encoder, said sequence of partial summands being a sequence of modified Booth summands.

In order to deal effectively with both signed and unsigned multiplication, it is preferred that when said N-bit multiplier is signed, respective bits R[a] of an internal multiplier representing said N-bit multiplier are given by:

R[(N-1) to 0] are equal to corresponding bits of said N-bit multiplier, and R[N] is equal to R[(N-1)] and when said N-bit multiplier is unsigned, respective bits REa] of an internal multiplier representing said N-bit multiplier are given by:

R[(N-1) to 0] are equal to corresponding bits of said N-bit multiplier, and R[N] is equal to zero.

In the case of a Booth recoded system where the length of the number to be multiplied by is odd (e.g. N+1 in terms of the above), then particularly simple forms of the first partial summand are available. This first partial summand can accordingly be determined quickly and so avoid unduly delaying initialisation. More particularly, if R[O] is equal to zero, then said first partial summand is initialised to zero and if R[O] is equal to one, then said first partial summand is initialised to a bitwise inverse of said M-bit multiplicand.

More particularly, if R[O] is equal to zero, then said first partial summand should ideally be set to zero, and if R[O] is equal to one, then said first partial summand should ideally be set to minus the M-bit multiplicand.

A preferred initialisation is if R[O] is equal to zero, then said 6 first partial summand is initialised to zero and if R[O] is equal to one, then said first partial summand is initialised to a bitwise inverse of said M-bit multiplicand. Howeve-", because the bitwise inverse of the M-bit multiplicand is one less than minus the M-bit multiplicand, this introduces a requirement for later adding and extra 1 if REOI is one. Preferred embodiments comprise a final adder for adding said carry value and said save value, a carry-in bit for said final adder being set to R[O].

The provision of a f inal adder to sum the carry value and the save value following the multiplication iterations affords the opportunity to add in the "1" required because a 1's complement was taken rather than a 2's complement without incurring additional overhead. The carry in bit of the final adder would otherwise be unused.

Viewed from a still further aspect this invention provides a method of multiplying and adding within an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said method comprising the steps of:

generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; carry-save adding respective ones of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result and said output carry-save partial result each comprising a carry value and a save value; and initialising said carry value and said save value to a respective one of:

at least part of said (N+M)-bit accumulate value; and a first partial summand.

An embodiment will now be described, by way of example only, with reference to the accompanying drawings in which:

Figure 1 schematically illustrates a multiplier for determining a multiplication result by producing a sequence of partial summands; Figure 2 illustrates a modified version of Figure 1 in which feedback is employed to re-use a single multiplier row to determine all of the partial summands; Figure 3 illustrates a further modification whereby feedback is 7 used in conjunction with a plurality of multiplier rows; and Figures 4A and 4B illustrate different portions of a circuit for performing a multiply-accumulate operation.

The described circuits implement a multiply-accumulator that can handle:

(1) Multiplications of N-bit and M-bit numbers to produce an (N+M)bit product; (2) Multiply-accumulate operations which multiply together an N-bit number and an M-bit number and add an (N+M)-bit accumulate value to produce an (N+M)bit result. The circuits can supply both of these in signed and unsigned variants. The described circuits use the exemplary specific case of N=m=32. (i.e. 32x32 -> 64 multiplications, and 32x32+64 > 64 multiplyaccumulates).

The term "multiplier" is used below both for the multiplier circuit and for one of the two operands to the multiplication. It will be clear which is intended, either from the use of "circuit" or floperand" or from the context.

Multiplication of an M-bit multiplicand D by an N-bit multiplier R is usually performed in hardware as two major steps:

(1) Form a collection of multiples XOD, X1D,..., WD of the multiplicand D such that (a) each multiple can be generated easily; (b) X0+Xl±--+Xk = R, which ensures that the sum of XOD, X1D,... and WD is equal to R% the desired product.

Add the multiplicand multiples generated by step (1) together.

Staze (l): Forminz the multiDlicand multiDles Stage (1) can be performed in various ways. The simplest is to make k=N-1 (so that there are N multiplicand multiples in total), then let Xi=0 if bit i of R (which will be denoted by R[i] or Ri) is 0 and Xi=2' if R[i] is 1. Because every Xi is zero or a power of two, the multiplicand multiples XiD can be formed easily, by using either zero or the result of shifting D left by i bits.

Since the result of the multiplication is longer than either operand, the XiD multiplicand multiples should be created with enough bits to determine all the bits of the product: this requires extending each multiplicand with bits on its left. This is the point at which one may cater for the difference between a signed and an unsigned multiplicand: a signed multiplicand is extended with copies of its sign bit, while an unsigned multiplicand is extended with -eros. This extension of the multiplicand is often physically implemented by just extending it with a single zero or copy of the sign bit, with the understanding that this single extra bit represents the (common) value of all the remaining bits.

Another technique which is sometimes used - e.g. in the multiply instructions of the ARM6 integrated circuit of Advanced RISC Machines Limited - is only to generate the bottom N bits of the product of two N bit numbers. In this case, there is no need to extend the multiplicand in this way: the multiplicand can be treated identically for both signed and unsigned variants of the instruction. Somewhat less obviously, so can the multiplier: the net result is that there is no difference at all between a signed and an unsigned version of the instruction. and so only one instruction needs to be provided.

Dealing with the difference between an unsigned and a signed multiplier R under this scheme is somewhat trickier. First, note that the Xi sum to the unsigned value of R: this scheme is "naturally" an unsigned multiplication algorithm. So what is needed is a way of dealing with a signed multiplier. There are a number of ways of dealing with this. most of which come down to either (a) adjusting the m final result by subtracting an extra 2' D if R is negative; or (b) making Xk be -2" rather than.2 k if R[k] = 1, i.e. if R is negative.

A more sophisticated technique is modified Booth encoding. This comes in two slightly different forms, depending on whether N is even or odd. If N is even, we let k = N12 - 1 (so we will form N/2 multiplicand multiplies), and then define:

XO = -2R[1] + R[O] and:

Xi = (-2R[2i+l] +R[2i] + R[2i-1]) 2 (2i) for i = 1,2,.-k.

Another way of looking at this is that we extend the multiplier with a single bit R[-1] after the binary point, and set R[-1] to be zero (which leaves the value of the multiplier unchanged). The second 9 formula above can then be applied to the case i = 0 as well, and simplifies to the first formula above in that case, i.e. the apparent special case for XO can be got rid of by this definition or R[-1] as 0).

The sum of the Xi is then equal to the signed value of R:

Xk + X[k-1] +... + X2 + X1 + XO -2 (2k+1)R[2k+11 + 2 (2k) R[2k] + 2 (2k) RE2k-1] -2 (2k-1)R[2k-1] + 2 (2k-2)R12k-2] + 2 (2k-2) R[2k-31 -32RE51 + 16R[41 + 16RE31 -8RH1 + 4R[2] + 4R[1] -2R[1] + R[O] -2 (2k+1)R[2k+l] + 2 (2k) R[2k] +2 (2k-1)R[2kl] + 2 (2k-2)R[2k-2] +32R[51 + 16RE41 +8R[3] + 4R[2] +2R[1] + R[O] -2 (N-1)R[N-1] + 2 (N-2)R[N-2] +2 (n-3)REN-31 + 2 (N-4)REN-41 +32R[51 + 1CR[41 +8RI31 + 4R[21 +2R[1] + R[O] = signed value of R.

Furthermore, each Xi is a power of two times a number from the set {-2, 1, 0, 1, 2}, so must have a value of zero, a power of two or minus a power of two. This makes the multiplicand multiplies XiD easy to form. They are not quite as easy as for the earlier method, since we have to cope with negative powers of two as well as positive ones, but we get the substantial advantage of only having N12 multiplicand multiples to add together in the second stage, rather than N of them.

Forming a negative multiplicand multiple may be done by shifting the multiplicand to form the corresponding positive multiple, then negating by the "tak=- the l's complement and add 1" method - except that rather than performing the addition of 1 at this stage, we relegate it to the second stage. We therefore end up with N/2 multiplicand multiples and N/2 single bits (which are zero if the corresponding multiplicand multiple is positive and 1 if it is negative) to add in the second stage; this is still a substantial improvement over N full multiplicand multiples.

If N is odd, we do something very similar, except that K = (NW2 and the formulae for the Xi are:

XO = -RCO] and:

Xi = (-2R[2i] + R[2i-1] + R[2i-2]) 2 (2i-1) for i = 1,2,...,k. (Again, the formula for XO is not really a special case: we just have to define RE-11 = R[-2] = 0 to make the second formula produce the right value).

As well as the fact that modified Booth encoding produces half as many multiplicand multiples to add together, it has another advantage: it naturally treats the multiplier as a signed number rather than an unsigned one. With the earlier technique, we had to deal with a signed multiplier as a special case, because no matter how long we make an unsigned number, it cannot hold a negative signed value. The converse is easier: a signed number of length N+1 bits or more can hold an unsigned value of length N bits (or indeed a signed value of length N bits). So if we want a multiplier circuit that can handle both signed and unsigned 32- bit multipliers. for instance, a 33-bit or longer modified Booth encoder will do the job: all we have to do is extend the multiplier with one or more additional bits at its left end, making these bits zero if the multiplier is to be treated as unsigned and copies of the existing sign bit if the multiplier is to be treated as signed.

Dealing with the difference between a signed and an unsigned multiplicand is still done by the same technique as before.

Other, yet more sophisticated ways to form the multiplicand 11, multiples also exist, which reduce their number yet further at the expense of more complexity.

Stage (2): Adding the multiplicand multiples After stage (1), we have some fairly large number of multiplicand multiples to add together - e.g. 17 in the case that we are using a 33bit or 34-bit modified Booth encoder for a circuit that can do both signed and unsigned 32x32 multiplication.

The simplest technique is just to add two of them together, add a third to the sum of the first two, add the fourth to the resulting sum, etc., until we have got the final sum. (Incidentally, each addition can also deal with one of the extra bits generated by the modified Booth encoding. by using it as the carry input to the adder. So we don't need extra additions to cope with these bits). This is similar to the technique used by the ARM6 integrated circuit.

One difference is that we don't generate all the multiplicand multiples at once: instead, we generate them as needed. The other main difference deals with an irregularity in this technique: we have to generate two multiplicand multiples before we can do the first addition, but only one before each of the other additions. This can be (and is) exploited in order to provide a multiply-accumulate function without this irregularity by initialising the "sum so far" to be the accumulate value, then repeatedly generating one multiplicand multiple and adding it into the sum so far. (To get a simple multiplication, we do the same, except that the sum so far is initialised to zero rather than to an accumulate value).

This also completes what was suggested in the previous paragraph: there was in fact one addition too few to deal with all the extra bits from the modified Booth encoder, and now there are the right number.

The main problem with this technique is that each addition takes a substantial amount of time, because of the long carry chain it contains. A good solution to this is "carry-save" addition, which arises from the observation that although adding two numbers and a carry bit to get one number necessarily involves a long carry chain, adding three numbers and a carry bit to get two numbers needn't. Specifically, if we have three numbers X[N:0], Y[N:O] and a carry bit W, we can reduce them to two numbers S[N:O] and C[N+1:0] which add to 12 the same value by simply doing a separate addition of 3 single bits in each bit column:

X[N] X[N-1] ---rN-2]... XE31 X[2] X[1] X[O] L Y[N] Y[N-1] Y[N-2]... YE31 Y[2] Y[1] Y[O] Z[N] Z[N-1] ZFN-2]... ZE31 Z[2] Z[1] Z[O] W S[N] S[N-1] S[N-2]... SE31 S[2] S[1] S[O] C[N+1] C[N] C[N-1] C[N-21... CE31 C[2] C[1] C[O] where:

C[O] W and for i 0,1.... N: (CEi+11,SEil) is the two bit sum of X[i], Y[i] and Z[i] ----------- --------------------------------------- have same sum as Because the calculations are done separately for each column, with no carry chain, this is considerably faster than ordinary addition. (For instance, the multiplication on ARM6 used ordinary addition and managed one of them per clock cycle. An integrated circuit using carry-save addition may manage four or more additions per clock cycle.) If we have J multiplicand multiples to add, we can use this technique J-2 times to reduce them to just two numbers that we have to add to get the final result. This final addition will require an ordinary addition, but the overall total of J-2 carry-save additions and one ordinary addition is a considerable improvement over the original J-1 ordinary additions.

Some big and expensive multipliers do actually generate all the multiplicand multiples at once. With these, many of the addition delays can be eliminated by doing additions in parallel, at the expense of having a large number of adder circuits. For instance, 17 multiplicand multiplies can be reduced to the final product with 5 stages of ordinary addition (the first reduces them to 9 numbers, the second to 5. the third to 3, the fourth to 2 and the fifth to 1) or 1 13 with 6 stages of carry-save addition and one final ordinary addition (17->12->8->6->4->3->2->1). The time advantage of carry-save addition is less here than it was before, but another advantage of carry-save addition is important here: a carry-save adder is also a lot smaller than an ordinary adder.

However when aiming for a small simple circuit, we don't want to do this sort of parallel addition. The normal approach is similar to that for the ordinary adders: we will initialise the "carry" and "save" values so that their sum is zero (e.g. by initialising them both to zero), then use carry-save additions to add in the multiplicand multiples one by one. At the end, we use a normal addition to form the final sum of the "carry" and "save" values. As before, we can get a multiply- accumulate operation for free, by initialising the "carry" and Itsavell values so that their sum is the accumulate value - e.g. make one of them zero and the other the accumulate value. Indeed, because there are two values to initialise, we could add in two accumulate values, but this is not very useful: a better way to make use of this second accumulate value slot is described later.

Everything above assumes that we are doing (N+M)-bit additions - e.g. that if we are implementing a 32x32 multiplications, we will be doing 64-bit additions. This is awkward, because it typically results in us needing a datapath section which is twice as wide as the rest of the datapath.

However, if we look at the values we are adding in, we find they only contain a slightly more than M-bit region where the value is "interesting". For instance, consider XiD for a modified Booth (2i) (2i-1) (2i-1) (2i) encoder with N odd. Xi is one of -2 -2 0, 2 and 2 which correspond to XiD being one of:

xi Top N-2i bits Middle M+ 1 bits Bottom 2i-1 bits -2 (2i) All inverse of Inverted All 1, with carry multiplicand multiplicand bit of 1 sign followed by 1 -2 (21-1) All inverse of Inverted All 1, with carry multiplicand multiplicand bit of 1 sign sign followed by inverted multiplicand 0 All zero All zero All 0, with carry bit of 0 2 All equal to Multiplicand All 0. with carry multiplicand sign followed by bit of 0 sign multiplicand 2 (21) All equal to Multiplicand All 0, with carry multiplicand followed by 0 bit of 0 sign Both the top N-2i and the bottom 2i-1 bits are not very interesting. In particular. we can add the carry bit into the bottom 2i-1 bits to get all zeros and the same carry into the middle M+1 bits in each case - i.e. replace the above by:

1 xi Top N-2i bits Middle M+1 bits Bottom 2i-1 bits -2 (2i) All inverse of Inverted multiplicand All zero multiplicand sign followed by 1, with carry bit of 1 -2 2i-1) All inverse of Inverted multiplicand All zero multiplicand sign sign followed by inverted multiplicand with carry bit of 1 0 All zero All zero, with carry All zero bit of 0 2 (2i-1) All equal to Multiplicand signal All zero multiplicand sign followed by multiplicand, with carry bit of 0 2 (2i) All equal to Multiplicand followed All zero multiplicand sign by 0, with carry bit of 0 We then find that we don't need to do the carry-save addition on the bottom 2i-1 bits: rather than doing a carry-save addition of two values and zero, we can just leave the two values unchanged. Furthermore, provided the top N-2i bits of the "save" and "carry" values so far are all identical, all of the top N-2i column additions will be identical, and so we only need one circuit to evaluate them all. As a result, we can do everything with just M+2 column adders (M+1 for the middle M+1 bits plus one for the top N-2i bits) provided:

(a) We start with the "save" value having its top N bits identical; (b) We start with the "carry" value having its top N bits identical; (c) We add in the multiplicand multiples in the order XOD, XlD, X2D,..., xkD, so that each addition requires a smaller number of identical bits at the top thenthe previous ones; and (d) We shift our "area of interest" left by 2 bits each iteration, storing away the bits that drop off the bottom end. At the end of the calculation, these bits will form the low ends of the final "carry" and "save" values, while those in the last "area of interesC will form their high ends.

16 This allows us to implement the main part of the multiplier with just a slight "bulge" in the width of the datapath, not a doubling of its width. The final addition still has to be double width, but can be implemented by two uses of a single width adder, using the technique of using the carry-out bit from the first addition as the carry-in bit to the second addition.

Of the restrictions, the last two are a matter of implementing the circuit correctly. However, the first two mean that any accumulate value may be at most M bits wide if signed. rather than about N+M bits wide. A technique is described below to circumvent this restriction on the accumulate value. allowing the implementation e.g. of a 32x32+64 multiply-accumulate instruction rather than just 32x32+32.

Note that any particular multiplier may contain multiple instances of the hardware which generates a multiplicand multiple and does the carry-save addition into the current "carry" and "save" forms (this hardware will be called a "multiplier row" in what follows). At one extreme, there is the full multiplier array. with a separate multiplier row for each iteration as illustrated in Figure 1.

At the other, there is the fully iterative multiplier, with just one multiplier row which handles all the iterations as illustrated in Figure 2.

In between, there are iterative versions with more than one row - e.g. the multiplier illustrated in Figure 3 with two rows.

The choice between the various options above is essentially a space vs. time trade-off: the more multiplier rows you use, the bigger and faster the circuit becomes (it becomes faster because there are fewer multiplexer and latch delays per multiplication, and fewer times that some time may be wasted waiting for the next clock edge to occur).

In practice. a common trade-off for reasonably small multipliers is to use the iterative multiplier with the right number of multiplier rows to make the total delay around the loop just less than the required cycle time (or in some cases half the required cycle time, with the circuit being driven by a double-speed clock).

An important point about all this is that essentially the same multiplier row hardware is used in all of these designs: any improvement in multiplier row design applies equally to any of them.

17 InitialisinR the carry-save form As noted above, we can initialise both the "carry" and the "save" part of the carry-save form. One of them is wanted for the accumulate value; the other is simply set to zero. Can we make good use of the free addition we could get by setting it to something else? A second accumulate value is an obvious possibility, but is difficult to specify in an instruction (due to the number of bits required to specify an operation with four operands and a destination register) and not very useful anyway! Another thing we could do is use it for one of the multiplicand multiples. The main problem with this is that it means the carry-save form initialiser will have to contain a multiplicand multiple generator. This costs some space: more important, it causes an extra initialisation delay.

The magnitude of this extra delay depends on the complexity of generating the multiplicand multiple concerned. Looking at the formulae for the Xi generated by modified Booth encoding as set out above, one is particularly simple - namely that for N odd, XO = -R[O]. Typically, of course, we are interested in the N even case; however, as observed above, a good way of dealing with a requirement to multiply by both signed and unsigned N-bit numbers is in fact to multiply by a signed (N+l)-bit number.

These observations lead to the following initialisation method for an Mbit by N-bit multiply-accumulator which handles both signed and unsigned variants, using modified Booth encoding and carry-save addition, and N with Initialise the internal multiplier operand R[N:O] by: R[N-1:0] = supplied multiplier operand:

R[N] = 0 = R[N-1] if unsigned variant wanted; if signed variant wanted.

Initialise one of the "carry" and "save" values to the supplied accumulate value (extended with zeros or copies of the sign bit according to whether it is unsigned or signed).

18 Initialise the other of the "carry" and "save" values to zero if R[O] = 0, and to minus the supplied multiplicand (treated as signed or unsigned as appropriate) if R[O] = 1.

This last appears slightly complicated by having to generate minus the supplied multiplicand - i.e. its 2's complement. It would be advantageous to use the trick of forming its l's component and adding 1 instead. The problem is: when do we add this 1 in? A good answer is at the end of the multiplication. The reason is that the final addition on the carry-save form currently just needs to add the "carry" and "save" parts of it together. Most adders will add two numbers and a carry bit, and the carry bit is therefore unused. By setting the carry bit equal to R[O]. we can compensate for the difference between using the 2's complement and the l's complement of the multiplicand during initialisation when R[O] = 1. So we get the following initialisation method:

Initialise the internal multiplier operand R[N:O] by:

R[N-1:0] supplied multiplier operand; R[N] 0 if unsigned variant wanted; R[N-1] if signed variant wanted.

Initialise one of the "carry" and "save" values to the supplied accumulate value (treated as signed or unsigned as appropriate).

Initialise the other of the "carry" and "save" values to zero if REO] = 0, and to the bitwise inverse (i.e. l's complement) of the supplied multiplicand (treated as signed or unsigned as appropriate) if R[O] = 1.

Set the carry-in bit for the final addition to R[O].

The advantage of this technique is that, with little extra initialisation delay and hardware, one stage less of carry-save addition will be required for the main iteration. This may be more significant than it appears at first sight, due to the fact that N is often a power of 2 in real applications. If N = 2', this means that (N+l)-bit modified Booth encoding will generate a total of 2(n-1) + 1 19 multiplicand multiples and stages of carry-save addition. Reducing this to 2(1) stages can make it a lot easier to fit the calculation into an exact number of cycles and thus avoid wasting time.

Consider a circuit that needs to do a 32x32 multiplication. With this initialisation, it needs to do 16 stages of carry-save addition, which fits in as 4 cycles, each doing 4 stages. If the carry-save form were just initialised to the accumulate value, 17 stages would be required, which is a much more difficult number to deal with sensibly.

Dealing with a long accumulate value As stated above, the accumulate value can only be about M bits in length if we are able to represent all the top N-2i bits at stage i of the multiplication with just a single bit. If the accumulate value is longer the datapath needs to be increased in width substantially to cope with it.

Note that in the top N-2i bits, we initially only want one of the "carry" and "save" values to be able to contain non-identical bits:

the other one can contain all identical bits, as can the top N-2i bits of the multiplicand multiple. Unfortunately, after the top carry-save addition, both the "carry" and "save" values can contain strings of non-identical bits in their top bits. If this were not the case, and we could arrange that the top bits of the accumulate value were left unchanged by the addition and that the top bits of every other value remained a string of identical bits, we could again do the main work in a roughly M-bit wide datapath: the only difference from the previous state of affairs is that we would have to feed 2 bits of the accumulate value into the main calculation per iteration.

The way we will deal with this is by modifying the simple carry save "add three bits in each column" technique. In the same way as above, we will split the "carry" and "save" values up into three regions:

(a) A "low" region, in which no further change is going to be made to the "carry" and "save" values. After adding XiD into them, this region contains Li bits, where Li = 2i+1 in the N odd case (illustrated in the above) and Li = 2i+2 in the N even case. In the N even case, this can also be expressed by saying that Li=2i before MD is added; in the N odd case, it can also be expressed by saying that Li=2i-1 before XiD is added, provided a special case is made for i=0 (e.g. by putting the addition of XOD into the initialisntion, as described in the above); SL[Li1:0] and CL[Li-1:0] denote the low "save" and "carry" bits respectively.

(b) A "middle" or "active" region, in which the main carry-save additions are taking place. This region contains M+1 bits, denoted S[M:O] and C[M:O] for the "save" and "carry" values respectively.

(c) A "high" region. in which the "save" value contains as-yet-unused bits of the accumulate value and the "carry" value is simply a string of copies of C[M]. Before adding XiD into the carry save form. this region is 2(k-l)+2 bits long - i.e. the number of accumulate bits we still want to bring into the active region at 2 bits per addition (recall that k is the index of the last Xi).

As a check, the total length of the "carry" and "save" values is:

when N is even: Li + (M+1) + (2(k-i)+2) = 2i + M + 1 + 2k - 2i + 2 = M + 1 + 2 (NI 2 - 1) + 2 = M + N + 1 when N is odd:

Li + (M+1) + Q(k-i)+2) = 2i + M + 1 + 2k - 2i + 2 = -1 + M + 1 + 2 ((N1)/2) + 2 = M + N + 1 so we will naturally deal with an accumulate value A[M+N:O] of length M+N+1 bits. (We can of course deal with shorter accumulate values simply by zero-extending them or s ign -extending them as appropriate; we can also deal with longer accumulate values. though the excess bits will be completely unchanged by the main operation and simply need to be added to the corresponding "carry" value bits (i.e. copies of C[M]) during the final addition).

So before we add XiD into the carry-save form, the "save" and 21 flcarry" values will be:

"High" region "Active" region "Low" region llsave'I value: AEM+N:Li+M+11 SEM:0] SL[Li-1:0] "carry" value: 1 C[M],C[M].... CEMI CEM:0] CL[Li-1:0] Next, we need to look at what the multiplicand multiple is like. We start with an M bit signed or unsigned multiplicand D[M-1:0], which we signextend or zero-extend respectively to form an (M+1)bit signed multiplicand D[M:O]. When we form the multiplicand multiple as described in the above, we get the following forms for the multiplicand multiple, depending on the value of the "Booth digit" - 2R[2i+l] + R[2i] + R[2i-1] (for N even) or -2R[2i] + R[2i-1] + R[2i2] (for N odd):

Booth digit "High" region "Active" region "Low" region -2 I[M],I[M],-,I[M] IEM- 0,0,,0 1:0],1;carry=1 -1 I[M],I[M],-,I[M] I[M:O]; carry=l 0,0,,0 0 0,0,,0 0,0 0; carry=0 0,0,,0 1 D[M1,D[M1,-,D[M] DEM:01; carry=0 0,0,,0 2 D[M],D[M],-,D[M] D[M-1:0],0; 0,0,,0 1 1 carry=0 where I[M:O] is the bitwise inverse (or l's complement) of D[M:O] These are all of the form:

_"High" region "Active" region "Low" region X[m+1],xEm+1] XEM+1] X[M:O]; carry=XC 0,0,,0 for some values of X[M+1:0] and XC. So the addition we wish to perform is of the form:

22 _"High" region "Active" region "Low" region _A[M+N:Li+m+l] SEM:0] SL[Li-1:0] C[M],C[M],C[M] CEM:0] CL[Li-1:0] -X[M+1],X[M+1], XEM+1] XEM:0] 0,0,,0 0,0,,0 010,OIXC 0,0,,0 We wish to end up with "carry" and "save" values of the same form, but with i one greater, and thus Li two greater. In the process, we are going to generate new values for S[M:O] and C[M:O], which will be called S'[M:O] and C[M:O] respectively. We will also generate SL[Li+l:Li], which are two new bits of SL[], but will not disturb the existing bits SL[Li- 1:0]. Similarly, we will generate two new bits CL[Li+l:Li], but will not disturb the existing bits CL[Li-1:0]. Finally, the two lowest bits of A[M+ N:Li+M+1] will be consumed, so we want our modified carry-save addition to produce a result of the form:

new "High" region new "Active" new "Low" region region -Ifsave" value A[M+N:Li+m+31 S'EM:0] SL[Li+1:0] LIcar v" value, C' [M],C, EM],,c' [m] ' C' EM:0], CL[LI+1:0] rv v Matching the addition we wish to perform up against this, we find we wish our modified carry-save addition to be of the following form:

23 k "High" Transition "Active" Transition f'Low" region bits region bits region A[M+N:Li+M+31 A[Li+ A[Li+M S[M:2] SE11 SEOI SL[Li- M+2] +1] 1:0] C[M],_,C[M] C[MI C[MI C[M:2] C[11 CEO] CL[Li 1:01 XEM+1] X[M+ X[m+1 X[m+1] X[M:2] X111 X101 0,01 0 1] 1 0 0 0,,0 0 xc 0,0,,0 A[M+N:Li+M S'EM] S'[MS'Em- SL[L SL[ SL[Li- +31 1] 2:0] i+l] Li] 1:0] CIM,..., C'[M] C'Em- C'[MCL[L CL[ CL[Li C'EM] 1] 2:0] i+l] Li] 1:0] We can immediately eliminate the "low" region because the contribution this region will make to the final sum is unchanged. Similarly, we can eliminate the bits A[M+N:Li+m+31 which appear in the same positions both above and below the line: their contribution to the final form is again obviously unchanged. After this, we find that our modified carry-save addition must be of the form:

"High" region Transition bits "Active" Transition bits region 0,,0 A[Li+M+2] A[Li+M+1] S[M:2] S[1] S[O] CEM],_,C[M] C[M] C[M] C[M:2] CE1] C[O] X[m+1] XEM+1 X[m+1] X[m+1] X[M:2] X[1] XE0] 1 0,...,0 0 0 0,...,0 S'[M] S'[M-1] S'[M 2:0] C'Em 2:0] C'EM],_'C'[M] C'[M] C'[M-1] 0 xc SL[Li+l] SL[Li 1 CL[Li+l] CL[Li 1 Next, we do some ordinary carry-save addition on the "active" region and the two transition bits below it. By doing the operations:

CL[Li] XC; (CL[Li+l],SL[Li]) two bit sum of S[O], C[O] and X[O]; (C'[0],SL[Li+l]) two bit sum of S[1], C[1] and X[1]; For i = 2,3,...,M:

24 (C'[i-1],S'[i-21) two bit sum of S[i], C[i] and X[i]; we ensure that S[M-2:0], SL[Li+l], SI1Li], C'[M-1:0], CL[Li+l] and CL[Li] make the same contribution to the final sum below the line as SFM:O], C[M- 1:0], X[M:O], W and the "active region" copy of C[M] make above the line. So we can now eliminate all of these, together with all the zeros on the line containing XC, and the remainder of our modified carry-save addition must be of the form:

"High" region Transition bits 0,,0 A[Li+M+2] A[Li+M+1] CEM].,,,.CEM] C[M] C[M] XEM+1] XEM+1] XEM+1] XEM+1] 0 0 S'EM] S'EM-1] C'EM], C.EM] C'EM] 0 At this point. we make some mathematical modifications to this remaining sum. First, we can replace its second line by the sum of the following two lines:

1---.'1 1 0,,0 0 1 NOT(C[M]) Proof: if C[M] is 1, this is the sum of a row of all ones and a row of all zeros, which is a row of all ones. Conversely, if C[M] is 0, this is the sum of a row of all ones and a single one at its right hand end. This produces a row of all zeros plus a carry out of the left hand end. The carry out will be ignored, because it is outside the region in which we are doing the addition. So in either case, the sum is a row of copies of C[M].

Similarly, the third line can be replaced by the sum of the following two lines:

1------11 0---.,o 0 1 NW(X[M+1]) This modifies the required carry-save addition to be of the form:

"High" region Transition bits 0,,0 A[Li+M+2] A[Li+M+1] 1,... '1 1 1 0,,0 0 NOT(C[M]) 1,... '1 1 1 0 0 0 NOT(XCM+1]) 0------0S'EM] S'EM-1] C'Em]---_C'EM] C'EM] 0 Next, we add the two rows of ones, again ignoring the carry out because it is outside the region in which we are doing the addition. This changes the required carry-save addition to be of the form:

"High" region Transition bits 0,,0 A[Li+M+2] A[Li+M+1] 1,... '1 1 0 0,,0 0 NOT(C[M]) 0,,0 0 NOT(X[M+1]) 0------0S'EM] S'EM-1] CI[M1,_'ClEMI C'EM] 0 At this point, if we do the operation (SIEMI1SIEM-11) two bit sum of A[Li+M+1], NOT(C[M]) and NOT(X[M+1]) We will find that we can eliminate all these bits from the required addition, together with some zeros, which just leaves:

"High" region 0,...,0 1------1 Transition bit A(Li+M+2] C'EM],_'C'[M] C'EM] 26 And finally. if we do the operation:

c [M] = NOT(A[Li+M+2]) we solve this remaining part of t-i=- addition sum, by the reverse of the argument which said that we could replace a row of all C[M]s by a row of all ls and a row containing just NOT(CEMI) in its rightmost position.

Overall conclusion: by performing the set of operations:

CLELil (CL[Li+l],SL[Li]) (C'[O],SL[Li+l]) For i = 2,3,_,M: (C'[i-1],S'[i2]) (SIEMI1SI[M-11) C' [M] =XC; two bit sum of S[O], C[O] and X[O] two bit sum of S[1], C[1] and X[1]; two bit sum of SCi], C[i] and X[i]; two bit sum of A[Li+M+1], NOT(C[M]) and NOT (XEM+1]) NOVA[Li+M+21) we can implement a modified carry-save operation:

"High" region AEM+N:Li+M A[Li+M+2 +31 1 CEM],__C C[M] [M] XEM+1],... XEM+1] XEM+1] o' O 0 addition for our desired Transition bits A[Li+M+1 1 C[M] XEM+1] 0 Active region S[M:2] C[M:2] X[M:2] 0...., 0 Transition bits SE1] S[O] C[l] C[O] X[1] X[O] 0 A[M+N:Li+M S'EM] +31 CIEM]...., CIEM] C'EM] C' EM-1] S'EM-1] S-EM 2:0] C-EM2:0] 11 Low" region SL[Li1:0] CL[Li1:0] 0 xc 0,0,, 0 SL[Li SL[L SL[Li +1] i] 1:0] CL[Li CL[L CL[Li- +1] i] 1:0] Thus, provided we can initialise the "save" and "carry" values correctly, we can perform an M-bit by N-bit mul tiply- accumulate 27 operation with an (M+N+1) -bit accumulate value, by initialising, adding in multiplicand multiples using the above modified carry-save addition and then doing a final addition on the carry-save value. There are MA operations in the list that implement the carry-save addition. so only a slight "bulge" in the datapath is required to do this, and not the doubling of its width that you would normally think was required for the desired operation.

The invention encompasses any multiplier row that implements the above equations. whether it appears in a full array or an iterative multiplier, and no matter how many multiplier rows appear in the array or the iterative loop.

Furthermore, note that no essential change occurs if bits of the same numeric significance exchange roles. For instance, we could exchange the roles of XC and C[O] and of C'[M-1] and S'[M-1] to get the following equations, effectively: CL[i] (CL[Li+l],SL[Li]) (C'[O],SL[Li+l]) For i = 2,3,...,M-1: (C'[i-1],S'[i-2]) (S'[M-1],S'[M-2]) (SI[MIICI[M-11) C' [M] which would implement the invention just as C[O]; two bit sum of S[O], XC and X[O]; two bit sum of S[1], C[1] and X[1]; two bit sum of S[i], C[i] and X[i]; two bit sum of S[M], C[M] and X[M]; two bit sum of A[Li+M+1], NOT (C[M]) and NOT (X[M+1]) NOT(A[Li+M+2]) The invention may also be applied to some of the more sophisticated ways of generating multiplicand multiples referred to briefly in the above. For instance, another modification of Booth's algorithm might deal with 3 bits of the multiplier per cycle, by making Xn be a power of 2 times one of the number -4, -3, -2, -1, 0, 1, 2. 3 or 4. With such an algorithm, we would have to absorb three bits of the accumulate value into the "active" region each cycle. This could be done in a similar way to above, with the modified part of the carrysave addition being something like:

28 "High" region Transition bits 0,0,,0 A[Li+M+3] A[Li+M+2] A[Li+M+1] CEM].,.CEM] C[M] C[M] C[M] XEM+1] XEM+1] XEM+1] XEM+1] XEM+1] A[Li,m+31 A[Li+M+2] A[Li+M+1] 1,_1 1 1 1 o' O 0 0 NOT(C[M]) 1 1 1 0------00 0 NOT(X[M+1]) 0,...,0 1,,l A[Li+M+3] A[Li+M+2] A[Li+M+1] 0 NOT(C[M]) NOT(X[M+1]) o' O 0 0 0------00 0 0,,0 S'EM+1] S'[M] S'[M-1] C'EM+1],...,C'EM+ C'EM+1] C'EM] 0 1] with C'[M+1] = NOT (A[Li+M+31), S'[M+l:M] being the 2-bit sum of A[Li+M+ 2], 1 and 0 (i.e. S'[M+1] = A[Li+M+2], S'[M] = NOT(A[Li+M+2])), and (C'[M],S'[M-1]) being the 2-bit sum of A[Li+M+l], NOT(C[M]) and NOT(X[M+ 1]), rather than:

29 "High" region Transition bits 0,,0 A[Li+M+2] A[Li+M+1] C[M1,,C[M1 C[M] C[M] X[m+1] XEM+1] X[m+1] XEM+1] A[Li+M+2] 1 0 1 0,...,0 1 '... ' 1 0,...,0 1,... 'i A[Li+M+1] 1 NOT(C[M]) 1 0,,0 0 NW(X[M+1]) 0,,0 A[Li+M+2] A[Li+M+1] 1,... '1 1 0 0---.,o 0 NOT(C[M]) 0,,0 0 NOT(X[M+1]) 0, 0, S'EM] S'EM-1] C'EM],_'C'[M] C'EM] 0 with C[M] = NOT(A[Li+M+2]) and S'[M:M-1] being the 2-bit sum of A[Li+M+1], NOT(C[M]) and NOT(X[M+1]).

An example multiplier will now be described which uses the above to yield 32x32-A4 multiplications and 32x32+64-A4 multiply-accumulate operations in both signed and unsigned versions.

Further possible refinements not illustrated in this example would be:

Early termination, which essentially involves detecting when all the remaining multiplicand multiples will be zero (so that no more additions are actually required), then doing some rather messy rearrangement of the bits in the "low", "active" and "high" regions in order to produce the correct final additions.

Avoiding the need for the final adder in this circuit, by using another adder which may be present on the datapath anyway, e.g. as part of an ALU.

One could take advantage of the fact that the "save low latches" fill from the bottom up at 2 bits/multiplier row, while the "accumulate value high latches" empty from the bottom up at the same rate, to use the same physical register to hold both values. At the start, it contains the high part of the accumulate value; at the end. it contains the low part of the "save" value; in between, it contains the as-yet-unconsumed high accumulate value bits and the so-far-generated low "save" value bits.

This is a useful improvement to the circuit.

The multiplier uses the following inputs:

MPLIERE31:01 - the multiplier operand; MCAND [31:01 - the multiplicand operand; ACCULE63:01 - the accumulate value; SIGNED - a single bit which is 1 if a signed operation is to be performed, 0 if an unsigned operation is to be performed; ACCUM - a single bit which is 1 if a multiply accumulate is to be performed, 0 if just a multiply is to be performed.

plus control inputs to cause the correct sequence of events to occur.

The circuit produces RESULTE63:0] as its result.

A basic block diagram of this multiplier is as shown in Figures 4A and 4B (for the sake of clarity. the control signals are not shown): This multiplier calculates MPLIERE31:01 MCAND{31:01 + ACCULE63:01 in five cycles, with the various blocks performing the following functions on each cycle:

Multiolier latches Cycle 1: RE31:0' J REL32] Cycles 2-5: No change MultilDlicand latches = MPLIER[31:0] = SIGNED AND MPLIER[31] Cycle 1: DE31:01 = MCAND DE32] = SIGNED AND MCAND1311 31 Cycles 2-5: No change Carrv-save initialiser Cycle 1: For i=0,1,...,32: SI[i] = ACCUM AND ACCUL[i] CI[i] = R[O] AND NOT (D[i]) Cycles 2-5: No change Carry-save latches A Cycles 1-3: On phase 2, SO[32:0] and COE32:0] are loaded from S6[32:0] and C6E32:0] respectively Cycles 4-5: No change Carry-save latches B Cycles 1-4: On phase 1, S4E32:0] and C4E32:0] are loaded from SM32:0] and C3E32:0] respectively Cycle 5: No change Multii)1exer A Cycle 1: S1E32:0] = SIE32:0] C1[32:0] = CIE32:0] Cycles 2-5: S1E32:0] = SOH2:0] C1[32:0] = COH2:0] Booth encoders Cycle 1: B1E4:01 = BoothEnc(R[2:0]) B2E4:01 = BoothEnc(RE4:2]) B4[4:01 = BoothEnc(RE6:41) B5E4:01 = BoothEnc(R[8:6]) B10:01 = BoothEnc(R[10:8]) B2E4:01 = BoothEnc(R[12:10]) B4E4:01 = BoothEnc(R[14:12]) B5E4:01 = BoothEnc(R[16:14]) Cycle 3: B1[4:01 = BoothEnc(R[18:16]) B2[4:01 = BoothEnc(R[20:18]) B4E4:01 = BoothEnc(R[22:20]) 32 B5E4:01 = BoothEnc(R[24:22]) Cycle 4: B1E4:01 = BoothEnc(R[26:24]) B2E4:01 = BoothEnc(R[28:26]) B4E4:01 = BoothEnc(R[30:28]) B5[4:01 = BoothEnc(RE32:301) Cycle 5:

No change where the BoothEnc function is specified by the following table Input bits: Output bits:

_2 1 0 4 3 2 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 -0 1 0 0 1 0 0 0 -0 1 1 1 0 0 0 0 -1 0 0 0 0 0 0 1 -1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 11 1 1 0 0 1 0 0 Multiplier rows 1. 2.A and 5 On all cycles, multiplier row k takes inputs D[32:0], Bk[4:0], AHkE1:01, Sk[32:0] and Ck[32:0] and produces outputs S[k+l][32:0], C[k+l][32:0], SLk[l:O] and CLk[l:O] according to the following equations:

First use multiplexers to generate XE33:0] and XC according to the following table:

Bk X[331 X132:1] X101 xc _[41 [31 [21 Ell [01 0 0 0 0 1 NOT(D[321) NOT(DE31:01) 1 0 0 0 1 0 NOT(DE32]) NOT(DE32:1]) NOT(D[O]) 1 0 0 1 0 0 0 0,0....,o 0 0 0 1 0 0 0 D[32] DH2:1] D[O] 0 1 0 0 0 0 D[32] DE31:01 0 0 33 Then:

CLk[O] = XC; (CLkEll,SLkEO1) = two bit sum of Sk[O], Ck[O] and XE01; (C(k+ l)[0],SLk[l]) = two bit sum of Sk[l], Ck[l] and XE11; For i=2,3,...,32: (C(k+l)[i-1],S(k+l)[i-2])= two bit sum of Sk[i], Ck[i] and X[i]; (S(k+l)[32],S(k+l)[311) = two bit sum of Mk[O], NOT(CkE32])and NOT(XE331) C(k+l)[32] = NOT(AHk[l]) Accumulate value hiv-h latches These contain internal signals ACCHI[31:01 Cycle 1: For i=0,1,,30: ACCHI[i] = ACCUM AND ACCULEi+331 ACCHIH2] SIGNED AND ACCUM AND ACCULE631 AH1[1:0] ACCHIC1:0] AH2[1:0] ACCHIE3:2] AH4E1:01 ACCHIE5:41 AH5E1:01 ACCHIE7:61 AH1E1:01 = ACCHI[9:81 AH2[1:01 = ACCHIE11:101 AH4E1:01 = ACCHIE13:12] AH5E1:01 = ACCHI[15:141 Cycle 3: AH1[1:0] = ACCHIR7:161 AH2[1:0] = ACCHIE19:181 AH4E1:01 = ACCHI[21:20] AH5E1:01 = ACCHI[23:22] AH1E1:01 = ACCHI[25:241 AH2[1:01 = ACCHI[27:261 AH4E1:01 = ACCHI[29:281 AH5E1:01 = ACCHIE31:301 Cycle 5: No change 34 Carry and save low latches SLE7:01 = (SL5[1:0],SL4[1:01,SL2[1:0], SL1[1:0]) CLE7:01 = (CL5[1:0',,,-"T-4[1:01,CL2[1:0],CL1[1:0]) Cycle 2: SLI15:81 = (SL5[1:01-,SL4[1:01,SL2[1:0],SL2[1:0]) CL[15:81 =(CL5[1:01,CL4[1:01,CL2[1:0],CL1[1:0]) z SL[23:161 = (SL5[1:01,SL4[1:01,SL2[1:0],SL1[1:0]) CL[23:161 =(CL5[1:01, CL4[1:01,CL2E1:0],CL1[1:0]) SLE31:24] = (SL5[1:01,SL4[1:01,SL2[1:0],SL1[1:0]) CLE31:24] =(CL5[1:01, CL4[1:0],CL2[1:0],CL1[1:0]) Cycle 5: No change Multiplexer B Cycles 1-4: SF[31:0] = SL[31:0] WE31:01 = CLE31:01 Cycle 5: SFE31:01 = S6E31:01 CF[31:0] = C6E31:01 Multinlexer C Cycles 1-4: CIN = R[O] Cycle 5: CIN = NEWC Carrv Latch Cycles 1-3: No change Cycle 4: NEWC = COUT Cycle 5: No change 51 Final adder On all cycles: (COUT, SUME31:01) = 33-bit sum of SFE31:011, WE31:01 and CIN Result latches Cycles 1-3: No change Cycle 4: RESULTHl.:01 = SUME31:01 Cycle 5: RESULTE61:32] = SUME31:01 Figures 4A and 4B together illustrate a multiply- accumulate circuit for multiplying an M-bit multiplicand (MCAND[]) and an N-bit multiplier (MPLIER[]) and then adding a (M+N)-bit accumulate value (ACCVAL[]), in this example M = N = 32. The N-bit multiplier is latched within multiplier latches 2 and the M-bit multiplicand is latched within multiplicand latches 4. The lower portion of the (M+N) bit accumulate value is fed to the carry-save initialiser 6 and the upper portion of the (M+N)-bit accumulate value is fed to accumulate value high latches 8. The carry-save initialiser 6 receives the M-bit multiplicand (D[]) and produces either a bitwise inversion of this or zero, depending on whether the value of the bottommost bit of the multiplier latches 2 is one or zero respectively. The result is fed to a multiplexer A 10 to serve as one of the carry value or the save value. The other of the carry value or save value comprises the lowermost bits of the accumulate value.

The N-bit multiplier is also fed to a series of Booth encoders 12 which produce modified Booth summands that are fed to respective ones of the subsequent multiplier rows.

As shown in Figure 4B, a sequence of multiplier rows 14, 16, 18, are provided that each implement the multiplier algorithm previously discussed. This multiplier algorithm incorporates two bits of the accumulate value on each iteration. Input to each multiplier row 14, 16, 18, 20 on each cycle are a Booth digit (Bl[], B2[], B4[], B5[1), bits from the accumulate value stored within the accumulate value high latches 8, bits of the M-bit multiplicand from the multiplicand latches 36 4 and the save value and carry value from the previous multiplier row either directly or indirectly.

Output from each multiplier row are the lowermost bits (SL,CL) that are no longer changing on subsequent iterations and the current save value and carry value. These lowermost bits are accumulated in the carry and savelow latches 24. The save value and carry value (S6[1, C6[1) is fed back to the first multiplier row 14 via carry-save latches A22 and multiplexer A10. When the final multiplication iteration has been completed, the carry value and save value from the last multiplier row 20 and carry and save low latches 24 are respectively fed to a final adder 26 where they are summed over two cycles (the values from the carry and save low latches 24 being fed to the final adder 26 on the first cycle and the values from the final multiplier row 20 on the second cycle) with the result being stored in result latches 28. A multiplexer C30 serves to feed in the carry bit R[O] left over from the 1's complement initialisation during the first addition cycle of the final adder 26 and any carry bit as necessary between the first and second cycles of the final adder 26.

37

Claims

1. An electronic multiplying and adding apparatus for multiplying an Mbit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said apparatus comprising: means for generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; means for carry-save adding at least one of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result having active carry bits that are still changing and active save bits that are still changing, a most significant bit of said active save bits having a bit position Z and a least significant bit of said active save bits having a bit position Y; and means for adding X bits of said (M+N)-bit accumulate value to said input carry-save partial result on a carry-save cycle at bit positions above bit position Z, wherein Z and Y both increase by X between carry-save cycles.

2. An electronic multiplying and adding apparatus as claimed in claim 1, wherein X = 2.

3. An electronic multiplying and adding apparatus as claimed in claim 2, wherein said means for adding comprises: means for extending said output carry-save partial result on a carry-save cycle with extension save bits at bit positions Z+l and Z+2 and an extension carry bit at bit position Z+ 2, said extension save bits being given by the sum of: the Z+Jth bit of said (M+N)-bit accumulate value; the complement of the Zth bit of said active carry bits; and the complement of the Z+lth bit of said partial summand, and said extension carry bit being given by the complement of the Z+2 th bit of said (M+N)-bit accumulate value.

4. An electronic multiplying and adding apparatus as claimed in any one of the preceding claims, wherein said means for generating comprises a Booth encoder, said sequence of partial summands being a 38 sequence of Booth summands.

5. An electronic multiplyinrr and adding apparatus as claimed in any one of the preceding claims, wherein M = N.

6. An electronic multiplying and adding apparatus as claimed in claim 5, wherein M = N = 32.

7. An electronic multiplying and adding apparatus as claimed in any one of the preceding claims. said apparatus having a data path therethrough, said data path having a width of less than N+M bits.

8. An electronic multiplying and adding apparatus as claimed in claims 5 and 7. wherein said data path has a width of M+3 bits.

9. An electronic multiplying and adding apparatus as claimed in any one of the preceding claims, wherein the Z+1th bit of the partial summand is either a zero or a sign extension depending upon signed selecting input, said M-bit multiplicand and said N-bit multiplier being treated as unsigned or signed numbers respectively in dependence upon said sign selecting input.

10. An electronic multiplying and adding apparatus as claimed in any one of the preceding claims, comprising an integrated circuit central processing unit having a hardware multiplier.

11. A method of multiplying and adding within an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said method comprising the steps of:

generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; carry-save adding respective ones of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result having active carry bits that are still changing and active save bits that are still changing, a most significant bit of said active save bits having a bit position 39 Z and a least significant bit of said active save bits having a bit position Y; and adding X bits of said (M+N)-bit accumulate value to said input carry-save partial result on a carry-save cycle at bit positions above bit position Z, wherein Z and Y both increase by X between carry- save cycles.

12. A method of multiplying and adding as claimed in claim 11, wherein X = 2.

13. A method of multiplying and adding as claimed in any one of claims 11 and 12, wherein said step of adding comprises: extending said output carry-save partial result on a carry-save cycle with extension save bits at bit positions Z+1 and Z+2 and an extension carry bit at bit position Z+ 2, said extension save bits being given by the sum of: the Z+1th bit of said (M+N)-bit accumulate value; the complement of the Zth bit of said active carry bits; and the complement of the Z+1th bit of said partial summand, and said extension carry bit being given by the complement of the Z+2 th bit of said (M+N)-bit accumulate value.

14. An electronic multiplying and adding apparatus for multiplying an Mbit multiplicand by an N-bit multiplier and adding an (N+M)-bit accumulate value, said apparatus comprising: means for generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; means for carry-save adding at least one of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result and said output carry-save partial result each comprising a carry value and a save value; and means for initialising said carry value and said save value to a respective one of: at least part of said (N+M)-bit accumulate value; and a first partial summand.

15. An electronic multiplying and adding apparatus as claimed in claim 14, wherein said means for generating comprises a modified Booth encoder, said sequence of partial summands being a sequence of modified Booth summands.

16. An electronic multiplying and adding apparatus as claimed in claim 15, wherein, when said N-bit multiplier is signed, respective bits R[a] of an internal multiplier representing said N-bit multiplier are given by:

R[(N-1) to 0] are equal to corresponding bits of said N-bit multiplier, and R[N] is equal to R[(N-1)].

17. An electronic multiplying and adding apparatus as claimed in claim 15, wherein, when said N-bit multiplier is unsigned, respective bits R[a] of an internal multiplier representing said N-bit multiplier are given by:

18. An electronic multiplying and adding apparatus as claimed in any one of claims 16 and 17, wherein, if R[O] is equal to zero, then said first partial summand is initialised to zero.

19. An electronic multiplying and adding apparatus as claimed in any one of claims 16, 17 and 18, wherein, if R[O] is equal to one, then said first partial summand is initialised to a bitwise inverse of said M-bit multiplicand with a one being added later.

20. An electronic multiplying and adding apparatus as claimed in claim 19, comprising a final adder for adding said carry value and said save value, a carry-in bit for said final adder being set to R[O].

21. A method of multiplying and adding within an electronic multiplying and adding apparatus for multiplying an M-bit multiplicand by an N-bit multiplier and adding an (M+N)-bit accumulate value, said 41 method comprising the steps of: generating a sequence of partial summands representing multiplication of said M-bit multiplicand and said N-bit multiplier; carry-save adding respective ones of the partial summands to an input carry-save partial result to yield an output carry-save partial result, said input carry-save partial result and said output carry-save partial result each comprising a carry value and a save value; and initialising said carry value and said save value to a respective one of: at least part of said (N+M)-bit accumulate value; and a first partial summand.

22. An electronic multiplying and adding apparatus as claimed in any one of claims 1 to 10 and as claimed in any one of claims 14 to 20.

23. A method of multiplying and adding within an electronic multiplying and adding apparatus as claimed in any one of claims 11, 12 and 13 and as claimed in claim 21.

24. An electronic multiplying and adding apparatus substantially as hereinbefore described with reference to the accompanying drawings.

25. A method of multiplying and adding within an electronic multiplying and adding apparatus substantially as hereinbefore described with reference to the accompanying drawings.