WO1986002181A1

WO1986002181A1 - A digital signal processor for single cycle multiply/accumulation

Info

Publication number: WO1986002181A1
Application number: PCT/US1985/001423
Authority: WO
Inventors: Kevin Lee Kloker
Original assignee: Motorola, Inc.
Priority date: 1984-09-28
Filing date: 1985-07-26
Publication date: 1986-04-10
Also published as: EP0197945A1; KR860700300A

Abstract

A data processor (35) capable of repeatedly multiplying two input operands (X and Y) and selectively accumulating the resulting product with a third input operand in a single clock cycle of operation. The resulting accumulated product (10) may be used as one or both multiplier input operands in an immediately following clock cycle of operation by using a feedback path (57) coupled between an output and an input of the multiplier/accumulator (49). The data processor (35) utilizes a plurality of input storage registers (36, 39) which are shared by a memory bus (38 or 40) coupled to external memory and by the data processor (35) to thereby reduce data processing time.

Description

A DIGITAL SIGNAL PROCESSOR FOR SINGLE CYCLE MULTIPLY/ACCUMULATION

Technical Field

This invention relates generally to signal processors, and more particularly, to a digital signal processor capable of a multiply/accumulation in a single clock cycle.

Background of the Invention

Signal processors which utilize an ALU for multiplying two numbers and selectively adding the product with a third number are very common in the signal processing art. Typical processors utilize two stages in which a product is formed in the first stage and an accumulation is made in the second stage. An example of such a processor is taught by Glenn Culler in U.S. Patent No. 4,287,566 entitled "Array Processor With Parallel Operations Per Instruction". Such processors require a minimum of two clock cycles to provide an output.

Summary of the Invention

Accordingly, an object of the present invention is to provide an improved digital signal processor for single cycle multiply/accumulation operations.

Another object of the present invention is to provide an improved data processor capable of complete single cycle operation.

In carrying out the above and other objects, there is provided, in one form, a digital signal processor for implementing algorithms by providing product accumulations. In the illustrated form, a product of first and second input operands is selectively accumulated with a third input operand. First input storage means having an input coupled to a first data bus are used to selectively store the first input operand. Second input storage means having an input coupled to a second data bus are used to selectively store the second input operand. A multiplier/accumulator having first and second inputs for receiving the first and second operands provides a product selectively accumulated with a third input operand coupled to a third input thereof. The accumulated product is provided in a single clock cycle of the processor in response to receipt of the first, second and third input operands. An output storage means has an input selectively coupled to the output of the multiplier/accumulator or either memory bus. An output of the output storage means is selectively coupled to at least a predetermined one of the first, second or third inputs of the multiplier/accumulator for implementing a variety of differing algorithms. Repetitive complete multiply/accumulation operations may be executed with each operation taking only one clock cycle.

Brief Description of the Drawings

FIG. 1 illustrates in block diagram form a digital signal processor structure known in the art;

FIG. 2 illustrates in block diagram form a digital signal processor structure in accordance with a preferred embodiment of the present invention;

FIG. 3 illustrates in block diagram form another embodiment of the digital signal processor of FIG. 2;

FIG. 4 illustrates in block diagram form a biquadratic digital filter structure implementable by the digital signal processors of FIGS. 2 and 3; and

FIG. 5 illustrates in block diagram form a cascaded digital filter structure implementable by the digital signal processors of FIGS. 2 and 3. Detailed Description of the Invention

Shown in FIG. 1 is a representative data processor 10 known in the art which generally comprises stages 11 and 12. First stage 11 comprises a first input register 14 having an input connected to a first data bus 15 labeled "X Data Bus". A second input register 16 has an input connected to a second data bus 18 labeled "Y Coefficient Bus". An output of input register 14 is connected to a first input of a multiplier circuit 20, and an output of input register 16 is connected to a second input of multiplier circuit 20. Multiplier 20 has first and second outputs respectively connected to an input of a product register 22 and an input of a multiplexor circuit 21. A second input of multiplexor circuit 21 is connected to first data bus 15. An output of multiplexor 21 is connected to an input of a product register 24. Product registers 22 and 24 represent the most significant product (MSP) and least significant product (LSP), respectively, of multiplier 20.

Second stage 12 comprises a multiplexor 25, an ALU 26, a multiplexor 27, an accumulator register 28, and a bus driver circuit 30. Product registers 22 and 24 of first stage 11 each has an output connected to first and second inputs of riultiplexor 25, respectively. An output of multiplexor 25 is connected to a first input of ALU 26. An output of ALU 26 is connected to a first input of multiplexor circuit 27. A second input of multiplexor circuit 27 is connected to first data bus 15. An output of multiplexor circuit 27 is connected to an input of accumulator register 28. A first output of accumulator register 28 is connected to a second input of ALU 26. A second output of accumulator register 28 is connected to an input of bus driver circuit 30. An output of bus driver circuit 30 is connected to the inputs of input registers 14 and 16, to the second inputs of multiplexors 21 and 27 and to external circuitry via first memory bus 15. In operation, data processor 10 provides a multiply/ accumulate function. Input registers 14 and 16 provide a multiplicand and a multiplier input via data busses 15 and 18. Typically, one of the inputs represents a data value and the other input represents a coefficient value. After these inputs are loaded into registers 14 and 16, the data is coupled to multiplier 20. Multiplier 20 calculates a product of the first and second input values and presents a product output at the first and second outputs thereof. Multiplier 20 may perform a data formatting function to allow both fractional and integer number representations. Another common function which multiplier 20 may additionally perform includes sign bit control to effect either signed or positive unsigned number representation. Multiplier 20 may also perform an inversion of data to provide either a positive or negative product. After multiplier 20 provides a product, the product is stored in MSP/LSP form in product registers 22 and 24, respectively. The time required to provide an output product to registers 22 and 24 is one clock cycle after the input data is loaded into registers 14 and 16.

The operation of second stage 12 of data processor 10 is centered around ALU 26 which primarily adds the value in product registers 22 or 24 to a third input value to provide a multiply/accumulate operation. The third input value is provided by accumulator register 28. ALU 26 may also perform other functions such as logical ANDing, ORing, etc. to provide conventional ALU functions as well as addition. To provide a standard addition operation without a multiplication, an addend is loaded into product register 24 via multiplexor 21 and is selectively connected to ALU 26 via multiplexor 25 in the following clock cycle. The output of accumulator register 28 is connected to the second input of ALU 26 to provide the value from which the product is added or subtracted. The accumulated product output of ALU 26 is stored in accumulator register 28 via multiplexor 27. The output of ALU 26 can be written to data bus 15 via multiplexor 27, accumulator register 28 and bus driver circuit 30. Although the described architecture readily accomplishes repetitive multiply/accumulate operations, the architecture of FIG. 1 is not efficient for performing nonrepetitive calculations. For example, if the ALU output in the accumulator register 28 is immediately needed as an input to multiplier 20, the contents of accumulator register 28 must be clocked into input register 14 before the value is available to multiplier 20. To accomplish this preliminary step will take an entire clock cycle. Therefore, the accumulated product is not available immediately to use as a multiplier or a multiplicand in the multiplication. In other words, in a two stage processor as shown in FIG. 1, data in the second stage is not immediately available for use in the first stage. Because the output of the first stage is immediately available to the second stage, a multiply/accumulation is an efficient operation. However, an accumulation operation followed by a multiplication is not efficient. Also, because product registers 22 and 24 are hidden and can not be read by data busses 15 and 18, reading and writing the output product is limited. Therefore, data processor 10 is generally unavailable during interrupt processing without losing the presently held product register data.

Shown in FIG. 2 is a data processor 35 in accordance with the present invention. Data processor 35 comprises a plurality of input registers 36 having an input connected to a memory or data bus 38 labeled "X Data Bus", and a plurality of input registers 39 having an input connected to a memory or data bus 40 labeled "Y Coefficient Bus". It should be readily apparent that all register circuits shown herein are of multiple bit size and may be of variable width. A first output of input registers 36 is connected to an input of a multiplexor circuit 41. Multiplexor circuit 41 has an output which is connected to an input of a bus driver circuit 42. An output of bus driver circuit 42 is connected to data bus 38. A second output of input registers 36 is connected to a first input of a multiplexor circuit 43. A third output of input registers 36 is connected to a first input of a multiplexor circuit 45. An output of multiplexor circuit 43 is connected to a first input of a multiply/accumulator circuit 49 labeled "X". A first output of input registers 39 is connected to a second input of multiplexor circuit 43. A second output of input registers 39 is connected to a second input of multiplexor circuit 45. A third output of input registers 39 is connected to a multiplexor circuit 47. An output of multiplexor circuit 47 is connected to an input of a bus driver circuit 48 which has an output connected to a data bus 40. An output of multiplexor 45 is connected to a second input of a multiply/accumulator circuit 49 labeled "Y". An output of multiply/accumulator circuit 49 labeled "P" is connected to a first input of a multiplexor circuit 51. Second and third inputs of multiplexor circuit 51 are connected to data bus 38 and data bus 40, respectively. An output of multiplexor circuit 51 is connected to an input of a plurality of accumulator registers 54. A first output of accumulator registers 54 is connected to an input of a multiplexor circuit 55. An output of multiplexor circuit 55 is connected to an input of an accumulator shifter circuit

56. An output of accumulator shifter circuit 56 is connected to a third input of multiplier/accumulator 49. A second output of accumulator registers 54 is fed back to a third input of multiplexor circuits 43 and 45 via a feedback path

57. Third and fourth outputs of accumulator registers 54 are connected to an input of multiplexor circuits 58 and 59, respectively. An output of multiplexor 58 is connected to an input of a shifter/limiter circuit 60. Similarly, an output of multiplexor 59 is connected to an input of a shifter/ limiter circuit 61. An output of shifter/limiter circuit 60 is connected to an input of a bus driver circuit 64 which has an output connected to a data bus 38. An output of shifter/limiter circuit 61 is connected to an input of a bus driver circuit 65 which has an output connected to data bus 40. In operation, processor 35 is capable of performing a multiply/accumulate operation in one clock cycle where a clock cycle is defined as the time between successive processor register loads. That is, the machine state of the processor changes once per clock cycle at the end of the clock cycle. In a single clock cycle, input register data is multiplied, accumulated with accumulator register data and stored in a predetermined accumulator register. An accumulator register is loaded with the output of the multiply/accumulator 49 at the end of the clock cycle. Simultaneously, the input registers 36 and 39 may be loaded from data busses 38 and 40, respectively, at the end of the clock cycle.

Data is initially coupled to input registers 36 and 39 from an external source, from input registers 36 and 39 or from accumulator registers 54 via busses 38 and 40, respectively. Registers 36, 39 and 54 are coupled so that contents from any two of the three pluralities of registers are coupled to the first and second inputs of multiply/ accumulator 49. Multiply/accumulator circuit 49 processes the numbers coupled to the X, Y and A inputs to provide an output at the end of a clock cycle to be clocked into a predetermined accumulator register 54 thereby replacing the previous value in register 54. It should be readily understood that the X and Y inputs of multiply/accumulator 49 represent multiplier inputs which are functionally reversible. All illustrated registers 36, 39 and 54 may be implemented by conventional edge triggered D-type flip-flops to prevent possible race conditions. Simultaneous to the processing of the three input operands by multipy/accumulator 49, external circuitry may be accessed to read in additional input operands which are read into input registers 36 and 39 for use in the immediately following clock cycle. Similarly, external circuitry may be accessed to write data from input registers 36 and 39 or accumulator registers 54 out to the external circuitry. The X data multiplexor 43 and Y data multiplexor 45 provide a continuous coupling of data between processor 35 registers 36, 39 and 54. As a result, processor 35 is able to perform repetitive multiply/accumulate operations in single clock cycles. The short-time energy over N samples of a time sampled signal is conventionally defined as:

Therefore, processor 35 may readily execute energy calculations by providing the same data in one input register to both X and Y inputs of multiply/accumulator 49 via multiplexors 43 and 45. In order to perform energy calculations with processor 10 of FIG. 1, both registers 14 and 16 would have to be loaded with the same data. However, whenever the same data is routed to multiple destinations, extra instruction bits or extra clock cycles are typically required.

Similarly, data may be coupled to input registers 36 and 39 to allow shared use of register data by both data processor

35 and external data busses 38 and 40. Feedback paths from the output of input registers 36 and 39 to be described in further detail below selectively couple the output of input registers 36 and 39 via bus drivers 42 and 48, respectively, to data busses 38 and 40, respectively. As a result, input data which has been read from one memory location and is stored at the end of a clock cycle in one of input registers

36 or 39 may be fed back in a following clock cycle to a respective data bus and stored in the same or a different memory location. One form of the shared use of input registers 36 and 39 is simultaneous use of the registers by multiply/accumulator 49 and external memory in the same clock cycle. The feedback paths around input registers 36 and 39 also allow implementation of functions such as time shifting sampled data in memory or replacing an element in a memory location with a new element. The latter function is commonly referred to as a "Z" delay function where the Z transform "Z^-1" represents a time delay of one data sample.

Versatility of operation of data processor 35 allows efficient implementation of non-repetitive signal processing algorithms. The Y input of multiply/accumulator 49 can receive data from any of the X or Y input registers 36 and 39 as well as any of accumulator registers 54 via accumulator feedback path 57. The X input of multiply/accumulator 49 can also receive data from any of the X or Y input registers 36 and 39 as well as any of the accumulator registers 54 via accumulator feedback path 57. Feedback path 57 provides the ability to subsequently use the accumulated product result of the previous clock cycle as a multiplier or a multiplicand in a subsequent clock cycle. Subsequent use may include immediate use of the multiplier/accumulator output operand in an immediately following clock cycle. As a result, an extra clock cycle of delay associated with data processor 10 of FIG. 1 has been eliminated. The use of feedback path 57 allows standard formulas such as a power series expansion to be implemented quickly and efficiently because the previous Nth power of a number can be immediately multiplied by that number which is still stored in one of the input registers to provide the (N + 1)th power as an Output to be stored in accumulator register 54.

The "A" input of multiply/accumulator 49 is the previous accumulator value in one of the accumulator registers 54 which is coupled to accumla.tor shifter 56 via multiplexor 55. Accumulator shifter 56 can optionally pre-shift the data to the left or right for scaling purposes. Accumulator shifter 56 may also provide a zero function and couple all zeroes to the "A" input of multiply/accumulator 49 so that only a multiplication is performed by multiply/accumulator 49. The data coupled to the "A" input of multiply/accumulator 49 via accumulator shifter 56 may also be inverted by shifter 56 so that a "product minus accumulate" operation is effected. Accumulator registers 54 may also be loaded with data from the X data bus 38 and the Y data bus 40. Accumulator registers 54 may also be read out to the X and Y data busses 38 and 40, respectively, for storage in external memory via multiplexors 58 and 59 and shifter/limiter circuits 60 and 61, respectively. Generally, one shifter/limiter circuit is associated with each data bus. Multiplexors 58 and 59 select a predetermined one of accumulator registers 54 for shifter/limiter circuits 60 and 61, respectively. Shifter/limiter circuits 60 and 61 perform data shifting on the respective inputs followed by an overflow limiting function. This allows arithmetic scaling to be performed on the values provided by multiplexors 58 and 59 read from accumulator registers 54 before the values are provided to external memory via busses 38 and 40, respectively. Because the shifting operation may produce arithmetic overflows, shifter/limiter circuits 60 and 61 also provide an overflow limiting feature commonly called data saturation. If an overflow of data from accumulator registers 54 coming out of the shifter portion of either circuit 60 or 61 is detected, a limiter portion of circuits 60 or 61 substitutes a maximum positive or negative constant onto the respective data bus to limit the magnitude of the incurred error. Otherwise, passing the overflowed data on to the external busses results in a large error. Shifter/limiter circuits 60 and 61 provide for much lower errors and minimize the occurrence of an unstable condition encountered in signal processing digital filters commonly known as "limit cycles".

In one form, shifter/limiter circuits 60 and 61 may be implemented with conventional shifter circuits which shift data received from accumulator registers 54 via multiplexors 58 and 59, respectively. If a right shift is performed, no overflow can occur since the lower bits are discarded. If a left shift is performed, an overflow condition may occur if the upper bits discarded contain any significant information. An overflow detector detects if the upper bits discarded by the shifter contain significant bits or just copies of the sign bit of the data. If there is no overflow condition, all of the upper bits discarded by the shifter will equal the sign bit of the data provided to the external data bus. If there is an overflow condition, at least one of the upper bits discarded by the shifter will not equal the sign bit of the data provided to the external data bus. The overflow detector may be implemented by conventional logic circuits. If an overflow occurs, a maximum positive (01111...1) or negative (10000...0) constant is substituted onto the appropriate shifter/limiter output. The sign of the substituted constant is equal to the sign of the selected accumulator register 54. The resulting output of shifter/limiters 60 and 61 is a shifted and limited version of the selected accumulator register.

Bus driver circuits 64 and 65 may be implemented using conventional driver circuits. Driver circuits 64 and 65 are controlled by external logic such that only one register or memory is utilizing the bus at any given time.

Shown in FIG. 3 is another embodiment of the present invention illustrating a data processor 35' analogous to data processor 35 of FIG. 2 and which utilizes feedback between the output of multiply/accumulator 49 and the inputs of X and Y input registers 36 and 39. The data processor of FIG. 3 is otherwise identical to the data processor of FIG. 2 and utilizes the same numbered elements for ease of illustration with the exception that feedback path 57 has been replaced by a feedback path 67 from a second output of multiply/ accumulator 49 to the input of X and Y input registers 36 and 39 via multiplexor circuits 68 and 69, respectively. As illustrated in FIG. 3, accumulator register 54 now only has three outputs instead of four outputs. Additionally, multiplexors 43 and 45 only have two inputs each as illustrated in FIG. 3 rather than having three inputs each. In another form, feedback path 67 may be coupled to only one of input registers 36 or 39 via only one of multiplexor circuits 68 or 69, respectively. Feedback path 67 may selectively couple the output of multiply/accumulator 49 to either of input registers 36 or 39 or to both. From input registers 36 and 39, the output of multiply/accumulator 49 may be coupled back to the first or second input or to both inputs of multiply/accumulator 49. The output of multiply/ accumulator 49 may also be written to external memory after being stored in input registers 36 and 39 as described below in further detail.

Both data processors 35 and 35' of FIGS. 2 and 3, respectively, provide distinct advantages over processor 10 of FIG. 1. Data processors 35 and 35' are more efficient and flexible in their implementation of signal processing algorithms as discussed above. Feedback paths 57 and 67 of data processors 35 and 35', respectively, allow the output of multiply/accumulator 49 to be coupled to one or both inputs thereof without the use of data busses 38 and 40. As a result, data busses 38 and 40 are simultaneously available to load new operands into input registers 36 and 39, respectively. In data processor 10, however, the same operation would require the use of data bus 15 thereby precluding the use of the bus for loading input operands.

In the illustrated form, data processor 35 of FIG. 2 provides distinct advantages over data processor 35' of FIG. 3 with respect to overflow conditions and input data storage. Multiplier products typically require two times the number of register bits for storage compared to multiplier and multiplicand operands. Therefore, the size of accumulator registers 54 are typically twice as large as input registers 36 and 39. Additionally, accumulator registers 54 may provide extra upper data bits to provide an accumulator extension to accomodate word growth in repetitive multiply/accumulate operations. Data processor 35 of FIG. 2 provides feedback path 57 from accumulator registers 54. The larger size of accumulator registers 54 allows the entire output of multiply/accumulator 49 to be stored without overflow or roundoff errors. It is desirable to minimize errors if an accumulator overflow has occurred. Typically, accumulator registers may be tested for overflow before the accumulator register value is reused by feedback path 57. Shifter/ limiters 60 and 61 also allow overflows to be detected and limited before data is written to external memory. Data processor 35' of FIG. 3 provides feedback path 67 from multiply/accumulator 49. The smaller size of input registers 36 and 39 does not allow the entire output of multiply/ accumulator 49 to be stored without overflow or roundoff errors. Therefore, the possibility of roundoff and overflow errors is greatly increased. Typically, processor input registers do not provide the ability to test for overflow errors. Feedback path 67 may also be used to store a multiply/accumulator 49 result in input registers 36 or 39 which is then written out to external memory via the respective multiplexor 41 or 47 and bus driver 42 or 49. Since no shifter/limiter functions are provided in either input register feedback path, overflows cannot be detected and limited before data is written to external memory. A second advantage of processor 35 over processor 35' is due to the fact that signal processing algorithms typically require more input operands than output operands. One example is the typical multiply/accumulate operation where two input operands are required from external memory. Feedback path 57 of processor 35 does not require the use of input registers 36 or 39 to store the output of multiply/accumulator 49 thereby preserving useful storage means for input operands. Since processor 35 uses only accumulator registers 54 to store multiply/accumulator 49 results, there is no contention for input registers when input operands are required from external memory. Feedback path 67 of processor 35' however requires the use of at least some of input registers 36 and 39 which reduces the amount of useful storage registers for input operands. Processor 35' may present a contention problem since input registers 36 and 39 may be written from either the memory busses 38 and 40, respectively, or the multiply/ accumulate feedback path 67. This contention problem may lessen the efficiency of processor 35' when feedback path 67 is used. Therefore, processor 35 of FIG. 2 is a preferred embodiment of the present invention over processor 35' of FIG. 3.

A common application of data processors 35 and 35' is in digital filtering. Input registers 36 and 39 are loaded with data which is typically time sampled values stored in a work space of a filter commonly implemented as a digital delay line. A plurality of consecutive stages in external memory contain consecutive time samples of data. Also stored in a consecutive time sequence in external memory are coefficient values which form an impulse response of the filter. Therefore, data describing the characteristics of the time and frequency response of the digital filter is stored in external memory along, with sampled signal values. A plurality of repetitive data loads are executed by reading memory and loading input registers 36 and 39 to couple data values and accompanying coefficient values for multiplication and accumulation. When proceeding from filter output sample time N to sample time (N + 1), an effective time shift of the sampled data in external memory must be effected. A time shift of sampled data may be effected in external memory by executing a move of data from memory to a register and then back to memory at a different location. However, with the architecture of processor 10 of FIG. 1, such a time shift of sampled data in external memory will require a series of data movement operations after the filtering operation and will require at least two cycles per data sample.

In the illustrated forms, the present invention reduces overhead associated with a time shift operation on sampled data by providing the ability not only to write input registers 36 and 39 but also providing the ability to read both input registers. At the time a filter operation is being performed, a data sample and a coefficient value are coupled to input registers 36 and 39, respectively. During the cycle in which newly coupled data is being multiplied and accumulated, input registers 36 and 39 may be read back to memory to an appropriate location which effects a time shift of the sampled data. Such a location is typically one address removed in sequential memory from where the data originated, thereby effecting one unit of digital time delay. As a result, the next filter calculation involves sample time (N + 1) of the filter. Using processor 10 of FIG. 1, if the multiply/accumulate throughput is one cycle per tap for an N- tap filter, the time required to calculate the filter would be approximately N cycles plus a few overhead cycles. If a time shift were effected afterwards, it would take another 2N cycles in order to perform the data shift. The complete process takes at least 3N cycles with most of the time being consumed by shifting data from sample time N to sample time (N + 1) rather than calculating the filter. This is a primary disadvantage of the structure of input registers 14 and 16 associated with processor 10 of FIG. 1 and results from inability to read the samples stored in registers 36 and 39. Additionally, during the extra 2N cycles of overhead, effort is being duplicated during one of the 2N cycles since all the samples are being read again because those samples have already been read during the previous calculation of the filter.

In the illustrated form, the present invention provides the ability to read back the contents of input registers 36 and 39 and thereby avoid reading each memory location twice. Therefore, overhead is reduced from 2N cycles to N cycles resulting in a total filter calculation time of 2N rather than 3N. A further advantage of the present invention includes the fact that all of the input registers 36 and 39 may be read as well as written on an interrupt or a break in processing execution. Therefore, the contents of input registers 36 and 39 may be saved in external memory so that processors 35 and 35' may be used in an interrupt routine for another function. Upon completion of the interrupt, the data may then be restored from external memory and the filter calculation continued without significant additional overhead. By virtue of the system architecture of processor 10 of FIG. 1, processor 10 is generally unavailable during an interrupt because the processor registers cannot be easily saved and restored.

There are many algorithms in which the general capability of being able to read data present in an input register will save cycles of execution time as opposed to accessing memory for a second time to obtain the data. Typically, the overhead associated with systems such as processor 10 of FIG. 1 is apparent when accessing data in memory with addressing means. If input registers 14 and 16 cannot be read out to data busses 15 and 18, respectively, addressing means (not shown) may be required to access the data a second time from external memory.- However, the addressing pointer may have already been updated so that the addressing means no longer points at the data for a second access. If the addressing pointer no longer points to the proper location to access the data for a second time, another address pointer or extra address pointer modification is required. This requires additional hardware or clock cycles. In comparison, since processors 35 and 35' of FIGS. 2 and 3, respectively, have the capability of directly reading the data in input registers 36 and 39, no need for an additional address pointer or address pointer modification exists. The flow of data from input registers 36 and 39 to external memory is controlled by multiplexors 41 and 47, respectively, and bus driver circuits 42 and 48, respectively.

An illustration of the shared use of input registers by processors 35 or 35' and a data bus will be given below for a conventional infinite impulse response (IIR) filter such as a biquadratic second order section digital filter. However, it should be apparent that the present invention applies equally to finite impulse response (FIR) filters and other digital signal processing algorithms. Shown in FIG. 4 is a conventional structure of a second order biquadratic filter 70 commonly implemented in software. Shown in Table 1 in the attached appendix is a software example of a calculation of filter 70 by either processor 35 or 35'. Filter 70 of FIG. 4 generally comprises adder circuits 71 and 72, multiplier circuits 74, 75, 76 and 77 and data memory storage locations 79 and 80. The equations which filter 70 implements are:

W(n) = X(n) - a₁W(n-1) - a₂W(n-2) Y(n) = W(n) + b₁W(n-1) + b₂W(n-2).

An input signal X(n) is coupled to a first input of adder 71 and an output signal Y(n) is provided by an output of adder 72. An intermediate signal W(n) is formed and stored in data memory storage locations 79 and 80 with a digital time delay of one and two, respectively. Multipliers 74, 75, 76 and 77 function to multiply a respective data input with a designated coefficient value which is stored in coefficient memory storage (not shown). The coefficient values determine the impulse response of the digital filter. To implement filter 70 by data processors 35 or 35', input operands are first coupled to input registers 36 and 39. The value W(n-2) stored in location 80 is coupled to an X input register 36 labeled "X0" and coefficient (-a₂) is coupled from coefficient memory storage to a Y input register 39 labeled "YO". The input value X(n) is assumed to be preloaded in accumulator register 54 and labeled "A". The multiply/accumulate operation is then performed by multiply/accumulator 49 and new operands are loaded into input registers 36 and 39 from external memory for use in the next clock cycle.

Table 1 illustrates on a step by step basis what ALU operation is being executed, what data and coefficient transfer is occurring between external memory and input registers 36 and 39 and comments to indicate what mathematical operation is occurring. Five operation cycles are required for execution of filter 70 which is the minimum number possible to preload the first operands and perform four multiplications with a single multiplier ALU. Input registers X0 and X1 of registers 36 and register Y0 of registers 39 serve as input pipeline registers. Shown in the dotted box of Table 1 is an example of the shared use feature of input register 36 by a data bus and an ALU. Initially, signal W(n-2) is read from memory storage location 80 into X input register X0. During the first ALU operation, the signal W(n-1) is read from memory storage location 79 into X input register X1. During the following clock cycle, the contents in input register X1 is written back to memory storage location 80 representing signal W(n-2) thereby effecting a time shift of data in filter 70 from memory storage location 79 to 80. Simultaneous to the use of input register X1 by memory storage location 80, the ALU operation is using input register X1 as the multiplicand input of multiply/accumulator 49. At the end of the second ALU operation, a value for signal W(n) has been calculated and stored in the accumulator register labeled "A" illustrated in Table 1. This value is also used as the third input to multiply/accumulator 49 for the third ALU operation. The value of A present in accumulator register 54 and equal to W(n) is stored away as W(n-1) in memory storage location 79 during the third ALU operation. Therefore, both values in memory storage locations

79 and 80 are read into input registers 36 and 39 and new values are written back into memory storage locations 79 and

80 to effect a time shift.

Typically, to create higher order digital filters, a plurality of biquad filters such as filter 70 are cascaded as shown in FIG. 5. Repetitive software may be used to cascade filters directly. Time savings can be realized by overlapping the operand preload clock cycle with the last ALU operation clock cycle of the previous filter as shown in Table 2 in the attached appendix. As a result, an execution time of 4N+1 clock cycles is required for a cascade of N biquad filters which is the optimal time for a single multiplier ALU. Since multiply/accumulator 49 and both data busses 38 and 40 are busy all 4N cycles, optimal execution time is not possible without the ability to simultaneously use input registers 36 and 39 between an ALU and a data bus. Signal values W(n-1) and W(n-2) for each filter are stored in data memory storage locations such as storage locations 79 and 80 of filter 70. The values for each filter are indicated in Table 2 by use of subscripts such as W3(n-1) for filter F3. Similarly, coefficients -a ₁, -a₂, b₁ and b₂ are illustrated for each filter by additional subscripts such as -a₃₁ representing coefficient -a₁ for filter F3. By analyzing the time required to effect the total filter operation illustrated in Table 2, it should be apparent that the clock cycles which prefetch operands for each filter after the first filter overlap execution cycles of previous filters to accomplish 4N stage filter execution in 4N+1 cycles. Again, fast throughput is realizable only because of the ability to share input registers between an ALU and a data bus.

By now it should be apparent that shared use of input registers by a processor and a data bus allows the processor to operate at optimal speed. The processor structure of the present invention makes calculation of an accumulated product possible in a single clock cycle as opposed. to multiple clock cycles. Since two data busses are coupled to each of processors 35 and 35', data values and coefficient values may be coupled to either processor 35 or 35' to insure that processor operating speed is not adversely affected. The processor architecture of the present invention also minimizes storage register requirements. By virtue of a feedback path between the output and input of a multiplier/accumulator circuit, an accumulated product may be immediately used as an input operand for a successive multiplication without an extra overhead cycle. As a result, a very time efficient and flexible processor has been provided.

Claims

1. A digital signal processor for implementing algorithms by providing a product of first and second input operands selectively accumulated with a third input operand, comprising: first input storage means having an input for selectively receiving and storing the first input operand, and an output; second input storage means having an input for selectively receiving and storing the second input operand, and an output; multiplier/accumulator means having a first input selectively coupled to either the output of the first input storage means or the output of the second input storage means, a second input selectively coupled to either the output of the first input storage means or the output of the second input storage means, a third input selectively coupled to a third input operand, and an output for providing the product with selective accumulation during a single clock cycle in response to receipt of said first, second and third input operands, said clock cycle being an amount of time between successive storage loads of the first and second input storage means; and output storage means having an input selectively coupled to the output of the multiplier/accumulator means, and an output selectively coupled to at least a predetermined one of the first, second or third inputs of the multiplier/accumulator means, for implementing digital signal processing algorithms.

2. The digital signal processor of claim 1 further comprising: first data shifting and limiting means having an input coupled to the output of the output storage means, and an output coupled to a first data bus, for selectively shifting data contents of the output storage means and limiting the magnitude o.f said data contents; and second data shifting and limiting means having an input coupled to the output of the output storage means, and an output coupled to a second data bus, for selectively shifting data contents of the output storage means and limiting the magnitude of said data contents.

3. The digital signal processor of claim 1 further comprising: data shifting means having an input coupled to the output of the output storage means and an output coupled to the third input of the multiplier/ accumulator circuit, for selectively shifting predetermined bits of the output of the output storage means.

4. A method of providing a digital signal processor for performing an arithmetic operation, comprising the steps of: selectively coupling an output of at least a predetermined one of a first or second input storage means or an output storage means to at least a predetermined one of first and second inputs of a multiplier/accumulator circuit; selectively coupling an output of the output storage means to a third input of the multiplier/ accumulator circuit; multiplying the first and second inputs of the multiplier/accumulator circuit to provide a product and selectively accumulating the third input with the product to provide an output; and selectively storing the output of the multiplier/ accumulator circuit in an output storage means, said method being performed in a single clock cycle of the processor, said clock cycle being an amount of time between successive storage loads of the first and second input storage means.

5. The method of claim 4 wherein said the selective coupling of the output of the output storage means to the third input of the multiplier/accumulator circuit is provided by shifting means coupled between the output storage means and the multiplier/accumulator, for selectively shifting predetermined bits of the output of the output storage means.

6. In a data processor for receiving an input operand to be coupled from external circuitry via a data bus to an arithmetic logic unit, circuit means for storing an operand for shared use by both the arithmetic logic unit and the data bus, comprising: input storage means having an input terminal coupled to the data bus for selectively receiving the input operand, a first output terminal coupled to the arithmetic logic unit for selectively coupling an output of the input storage means to the arithmetic logic unit, and a second output terminal coupled to the input terminal, for selectively coupling the output of the input storage means back to said data bus.

7. The circuit means of claim 6 further comprising: data bus driver means having an input coupled to the second output terminal and an output coupled to the input terminal, for selectively driving the output of the input storage means onto the data bus.

8. The circuit means of claim 6 further comprising: first multiplexor means having a first input coupled to the input terminal, and a second input coupled to an output of the arithmetic logic unit, for selectively coupling either an operand from the data bus or the output of the arithmetic logic unit to the input storage means.

9. A method of shared use of an input storage means in a data processor between an arithmetic logic unit and a data bus, comprising the steps of: selectively coupling an input operand from the data bus to the input storage means; and selectively coupling an output of the input storage means to the arithmetic logic unit while selectively coupling the output of the input storage means back to the data bus.

10. The method of claim 9 further comprising the step of: selectively storing an output of the arithmetic logic unit to the input storage means before coupling the output of the input storage means to the data bus.