WO1986002181A1 - A digital signal processor for single cycle multiply/accumulation - Google Patents

A digital signal processor for single cycle multiply/accumulation Download PDF

Info

Publication number
WO1986002181A1
WO1986002181A1 PCT/US1985/001423 US8501423W WO8602181A1 WO 1986002181 A1 WO1986002181 A1 WO 1986002181A1 US 8501423 W US8501423 W US 8501423W WO 8602181 A1 WO8602181 A1 WO 8602181A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
output
storage means
data
coupled
Prior art date
Application number
PCT/US1985/001423
Other languages
French (fr)
Inventor
Kevin Lee Kloker
Original Assignee
Motorola, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola, Inc. filed Critical Motorola, Inc.
Priority to KR1019860700311A priority Critical patent/KR860700300A/en
Publication of WO1986002181A1 publication Critical patent/WO1986002181A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3856Operand swapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/499Denomination or exception handling, e.g. rounding or overflow
    • G06F7/49905Exception handling
    • G06F7/4991Overflow or underflow
    • G06F7/49921Saturation, i.e. clipping the result to a minimum or maximum value

Definitions

  • This invention relates generally to signal processors, and more particularly, to a digital signal processor capable of a multiply/accumulation in a single clock cycle.
  • Signal processors which utilize an ALU for multiplying two numbers and selectively adding the product with a third number are very common in the signal processing art.
  • Typical processors utilize two stages in which a product is formed in the first stage and an accumulation is made in the second stage.
  • An example of such a processor is taught by Glenn Culler in U.S. Patent No. 4,287,566 entitled "Array Processor With Parallel Operations Per Instruction". Such processors require a minimum of two clock cycles to provide an output.
  • an object of the present invention is to provide an improved digital signal processor for single cycle multiply/accumulation operations.
  • Another object of the present invention is to provide an improved data processor capable of complete single cycle operation.
  • a digital signal processor for implementing algorithms by providing product accumulations.
  • a product of first and second input operands is selectively accumulated with a third input operand.
  • First input storage means having an input coupled to a first data bus are used to selectively store the first input operand.
  • Second input storage means having an input coupled to a second data bus are used to selectively store the second input operand.
  • a multiplier/accumulator having first and second inputs for receiving the first and second operands provides a product selectively accumulated with a third input operand coupled to a third input thereof. The accumulated product is provided in a single clock cycle of the processor in response to receipt of the first, second and third input operands.
  • An output storage means has an input selectively coupled to the output of the multiplier/accumulator or either memory bus.
  • An output of the output storage means is selectively coupled to at least a predetermined one of the first, second or third inputs of the multiplier/accumulator for implementing a variety of differing algorithms. Repetitive complete multiply/accumulation operations may be executed with each operation taking only one clock cycle.
  • FIG. 1 illustrates in block diagram form a digital signal processor structure known in the art
  • FIG. 2 illustrates in block diagram form a digital signal processor structure in accordance with a preferred embodiment of the present invention
  • FIG. 3 illustrates in block diagram form another embodiment of the digital signal processor of FIG. 2;
  • FIG. 4 illustrates in block diagram form a biquadratic digital filter structure implementable by the digital signal processors of FIGS. 2 and 3;
  • FIG. 5 illustrates in block diagram form a cascaded digital filter structure implementable by the digital signal processors of FIGS. 2 and 3. Detailed Description of the Invention
  • FIG. 1 Shown in FIG. 1 is a representative data processor 10 known in the art which generally comprises stages 11 and 12.
  • First stage 11 comprises a first input register 14 having an input connected to a first data bus 15 labeled "X Data Bus".
  • a second input register 16 has an input connected to a second data bus 18 labeled "Y Coefficient Bus".
  • An output of input register 14 is connected to a first input of a multiplier circuit 20, and an output of input register 16 is connected to a second input of multiplier circuit 20.
  • Multiplier 20 has first and second outputs respectively connected to an input of a product register 22 and an input of a multiplexor circuit 21.
  • a second input of multiplexor circuit 21 is connected to first data bus 15.
  • An output of multiplexor 21 is connected to an input of a product register 24.
  • Product registers 22 and 24 represent the most significant product (MSP) and least significant product (LSP), respectively, of multiplier 20.
  • MSP most significant product
  • LSP least significant product
  • Second stage 12 comprises a multiplexor 25, an ALU 26, a multiplexor 27, an accumulator register 28, and a bus driver circuit 30.
  • Product registers 22 and 24 of first stage 11 each has an output connected to first and second inputs of riultiplexor 25, respectively.
  • An output of multiplexor 25 is connected to a first input of ALU 26.
  • An output of ALU 26 is connected to a first input of multiplexor circuit 27.
  • a second input of multiplexor circuit 27 is connected to first data bus 15.
  • An output of multiplexor circuit 27 is connected to an input of accumulator register 28.
  • a first output of accumulator register 28 is connected to a second input of ALU 26.
  • a second output of accumulator register 28 is connected to an input of bus driver circuit 30.
  • An output of bus driver circuit 30 is connected to the inputs of input registers 14 and 16, to the second inputs of multiplexors 21 and 27 and to external circuitry via first memory bus 15.
  • data processor 10 provides a multiply/ accumulate function.
  • Input registers 14 and 16 provide a multiplicand and a multiplier input via data busses 15 and 18. Typically, one of the inputs represents a data value and the other input represents a coefficient value. After these inputs are loaded into registers 14 and 16, the data is coupled to multiplier 20.
  • Multiplier 20 calculates a product of the first and second input values and presents a product output at the first and second outputs thereof. Multiplier 20 may perform a data formatting function to allow both fractional and integer number representations.
  • multiplier 20 may additionally perform includes sign bit control to effect either signed or positive unsigned number representation. Multiplier 20 may also perform an inversion of data to provide either a positive or negative product. After multiplier 20 provides a product, the product is stored in MSP/LSP form in product registers 22 and 24, respectively. The time required to provide an output product to registers 22 and 24 is one clock cycle after the input data is loaded into registers 14 and 16.
  • second stage 12 of data processor 10 is centered around ALU 26 which primarily adds the value in product registers 22 or 24 to a third input value to provide a multiply/accumulate operation.
  • the third input value is provided by accumulator register 28.
  • ALU 26 may also perform other functions such as logical ANDing, ORing, etc. to provide conventional ALU functions as well as addition.
  • an addend is loaded into product register 24 via multiplexor 21 and is selectively connected to ALU 26 via multiplexor 25 in the following clock cycle.
  • the output of accumulator register 28 is connected to the second input of ALU 26 to provide the value from which the product is added or subtracted.
  • the accumulated product output of ALU 26 is stored in accumulator register 28 via multiplexor 27.
  • the output of ALU 26 can be written to data bus 15 via multiplexor 27, accumulator register 28 and bus driver circuit 30.
  • FIG. 1 is not efficient for performing nonrepetitive calculations. For example, if the ALU output in the accumulator register 28 is immediately needed as an input to multiplier 20, the contents of accumulator register 28 must be clocked into input register 14 before the value is available to multiplier 20. To accomplish this preliminary step will take an entire clock cycle. Therefore, the accumulated product is not available immediately to use as a multiplier or a multiplicand in the multiplication. In other words, in a two stage processor as shown in FIG.
  • Data processor 35 comprises a plurality of input registers 36 having an input connected to a memory or data bus 38 labeled "X Data Bus", and a plurality of input registers 39 having an input connected to a memory or data bus 40 labeled "Y Coefficient Bus". It should be readily apparent that all register circuits shown herein are of multiple bit size and may be of variable width.
  • a first output of input registers 36 is connected to an input of a multiplexor circuit 41.
  • Multiplexor circuit 41 has an output which is connected to an input of a bus driver circuit 42.
  • An output of bus driver circuit 42 is connected to data bus 38.
  • a second output of input registers 36 is connected to a first input of a multiplexor circuit 43.
  • a third output of input registers 36 is connected to a first input of a multiplexor circuit 45.
  • An output of multiplexor circuit 43 is connected to a first input of a multiply/accumulator circuit 49 labeled "X".
  • a first output of input registers 39 is connected to a second input of multiplexor circuit 43.
  • a second output of input registers 39 is connected to a second input of multiplexor circuit 45.
  • a third output of input registers 39 is connected to a multiplexor circuit 47.
  • An output of multiplexor circuit 47 is connected to an input of a bus driver circuit 48 which has an output connected to a data bus 40.
  • An output of multiplexor 45 is connected to a second input of a multiply/accumulator circuit 49 labeled "Y".
  • An output of multiply/accumulator circuit 49 labeled "P" is connected to a first input of a multiplexor circuit 51.
  • Second and third inputs of multiplexor circuit 51 are connected to data bus 38 and data bus 40, respectively.
  • An output of multiplexor circuit 51 is connected to an input of a plurality of accumulator registers 54.
  • a first output of accumulator registers 54 is connected to an input of a multiplexor circuit 55.
  • An output of multiplexor circuit 55 is connected to an input of an accumulator shifter circuit
  • An output of accumulator shifter circuit 56 is connected to a third input of multiplier/accumulator 49.
  • a second output of accumulator registers 54 is fed back to a third input of multiplexor circuits 43 and 45 via a feedback path
  • Third and fourth outputs of accumulator registers 54 are connected to an input of multiplexor circuits 58 and 59, respectively.
  • An output of multiplexor 58 is connected to an input of a shifter/limiter circuit 60.
  • an output of multiplexor 59 is connected to an input of a shifter/ limiter circuit 61.
  • An output of shifter/limiter circuit 60 is connected to an input of a bus driver circuit 64 which has an output connected to a data bus 38.
  • An output of shifter/limiter circuit 61 is connected to an input of a bus driver circuit 65 which has an output connected to data bus 40.
  • processor 35 is capable of performing a multiply/accumulate operation in one clock cycle where a clock cycle is defined as the time between successive processor register loads.
  • the machine state of the processor changes once per clock cycle at the end of the clock cycle.
  • input register data is multiplied, accumulated with accumulator register data and stored in a predetermined accumulator register.
  • An accumulator register is loaded with the output of the multiply/accumulator 49 at the end of the clock cycle.
  • the input registers 36 and 39 may be loaded from data busses 38 and 40, respectively, at the end of the clock cycle.
  • Data is initially coupled to input registers 36 and 39 from an external source, from input registers 36 and 39 or from accumulator registers 54 via busses 38 and 40, respectively.
  • Registers 36, 39 and 54 are coupled so that contents from any two of the three pluralities of registers are coupled to the first and second inputs of multiply/ accumulator 49.
  • Multiply/accumulator circuit 49 processes the numbers coupled to the X, Y and A inputs to provide an output at the end of a clock cycle to be clocked into a predetermined accumulator register 54 thereby replacing the previous value in register 54. It should be readily understood that the X and Y inputs of multiply/accumulator 49 represent multiplier inputs which are functionally reversible.
  • All illustrated registers 36, 39 and 54 may be implemented by conventional edge triggered D-type flip-flops to prevent possible race conditions. Simultaneous to the processing of the three input operands by multipy/accumulator 49, external circuitry may be accessed to read in additional input operands which are read into input registers 36 and 39 for use in the immediately following clock cycle. Similarly, external circuitry may be accessed to write data from input registers 36 and 39 or accumulator registers 54 out to the external circuitry.
  • the X data multiplexor 43 and Y data multiplexor 45 provide a continuous coupling of data between processor 35 registers 36, 39 and 54. As a result, processor 35 is able to perform repetitive multiply/accumulate operations in single clock cycles.
  • the short-time energy over N samples of a time sampled signal is conventionally defined as:
  • processor 35 may readily execute energy calculations by providing the same data in one input register to both X and Y inputs of multiply/accumulator 49 via multiplexors 43 and 45.
  • both registers 14 and 16 would have to be loaded with the same data.
  • extra instruction bits or extra clock cycles are typically required.
  • data may be coupled to input registers 36 and 39 to allow shared use of register data by both data processor
  • input registers 36 or 39 may be fed back in a following clock cycle to a respective data bus and stored in the same or a different memory location.
  • One form of the shared use of input registers 36 and 39 is simultaneous use of the registers by multiply/accumulator 49 and external memory in the same clock cycle.
  • the feedback paths around input registers 36 and 39 also allow implementation of functions such as time shifting sampled data in memory or replacing an element in a memory location with a new element. The latter function is commonly referred to as a "Z" delay function where the Z transform "Z -1 " represents a time delay of one data sample.
  • the Y input of multiply/accumulator 49 can receive data from any of the X or Y input registers 36 and 39 as well as any of accumulator registers 54 via accumulator feedback path 57.
  • the X input of multiply/accumulator 49 can also receive data from any of the X or Y input registers 36 and 39 as well as any of the accumulator registers 54 via accumulator feedback path 57.
  • Feedback path 57 provides the ability to subsequently use the accumulated product result of the previous clock cycle as a multiplier or a multiplicand in a subsequent clock cycle. Subsequent use may include immediate use of the multiplier/accumulator output operand in an immediately following clock cycle.
  • feedback path 57 allows standard formulas such as a power series expansion to be implemented quickly and efficiently because the previous Nth power of a number can be immediately multiplied by that number which is still stored in one of the input registers to provide the (N + 1)th power as an Output to be stored in accumulator register 54.
  • the "A" input of multiply/accumulator 49 is the previous accumulator value in one of the accumulator registers 54 which is coupled to accumla.tor shifter 56 via multiplexor 55.
  • Accumulator shifter 56 can optionally pre-shift the data to the left or right for scaling purposes.
  • Accumulator shifter 56 may also provide a zero function and couple all zeroes to the "A" input of multiply/accumulator 49 so that only a multiplication is performed by multiply/accumulator 49.
  • the data coupled to the "A" input of multiply/accumulator 49 via accumulator shifter 56 may also be inverted by shifter 56 so that a "product minus accumulate" operation is effected.
  • Accumulator registers 54 may also be loaded with data from the X data bus 38 and the Y data bus 40. Accumulator registers 54 may also be read out to the X and Y data busses 38 and 40, respectively, for storage in external memory via multiplexors 58 and 59 and shifter/limiter circuits 60 and 61, respectively. Generally, one shifter/limiter circuit is associated with each data bus. Multiplexors 58 and 59 select a predetermined one of accumulator registers 54 for shifter/limiter circuits 60 and 61, respectively. Shifter/limiter circuits 60 and 61 perform data shifting on the respective inputs followed by an overflow limiting function.
  • shifter/limiter circuits 60 and 61 also provide an overflow limiting feature commonly called data saturation. If an overflow of data from accumulator registers 54 coming out of the shifter portion of either circuit 60 or 61 is detected, a limiter portion of circuits 60 or 61 substitutes a maximum positive or negative constant onto the respective data bus to limit the magnitude of the incurred error. Otherwise, passing the overflowed data on to the external busses results in a large error. Shifter/limiter circuits 60 and 61 provide for much lower errors and minimize the occurrence of an unstable condition encountered in signal processing digital filters commonly known as "limit cycles".
  • shifter/limiter circuits 60 and 61 may be implemented with conventional shifter circuits which shift data received from accumulator registers 54 via multiplexors 58 and 59, respectively. If a right shift is performed, no overflow can occur since the lower bits are discarded. If a left shift is performed, an overflow condition may occur if the upper bits discarded contain any significant information. An overflow detector detects if the upper bits discarded by the shifter contain significant bits or just copies of the sign bit of the data. If there is no overflow condition, all of the upper bits discarded by the shifter will equal the sign bit of the data provided to the external data bus.
  • the overflow detector may be implemented by conventional logic circuits. If an overflow occurs, a maximum positive (01111...1) or negative (10000...0) constant is substituted onto the appropriate shifter/limiter output. The sign of the substituted constant is equal to the sign of the selected accumulator register 54. The resulting output of shifter/limiters 60 and 61 is a shifted and limited version of the selected accumulator register.
  • Bus driver circuits 64 and 65 may be implemented using conventional driver circuits. Driver circuits 64 and 65 are controlled by external logic such that only one register or memory is utilizing the bus at any given time.
  • FIG. 3 Shown in FIG. 3 is another embodiment of the present invention illustrating a data processor 35' analogous to data processor 35 of FIG. 2 and which utilizes feedback between the output of multiply/accumulator 49 and the inputs of X and Y input registers 36 and 39.
  • the data processor of FIG. 3 is otherwise identical to the data processor of FIG. 2 and utilizes the same numbered elements for ease of illustration with the exception that feedback path 57 has been replaced by a feedback path 67 from a second output of multiply/ accumulator 49 to the input of X and Y input registers 36 and 39 via multiplexor circuits 68 and 69, respectively.
  • accumulator register 54 now only has three outputs instead of four outputs.
  • feedback path 67 may be coupled to only one of input registers 36 or 39 via only one of multiplexor circuits 68 or 69, respectively.
  • Feedback path 67 may selectively couple the output of multiply/accumulator 49 to either of input registers 36 or 39 or to both. From input registers 36 and 39, the output of multiply/accumulator 49 may be coupled back to the first or second input or to both inputs of multiply/accumulator 49. The output of multiply/ accumulator 49 may also be written to external memory after being stored in input registers 36 and 39 as described below in further detail.
  • Data processors 35 and 35' are more efficient and flexible in their implementation of signal processing algorithms as discussed above.
  • Feedback paths 57 and 67 of data processors 35 and 35', respectively, allow the output of multiply/accumulator 49 to be coupled to one or both inputs thereof without the use of data busses 38 and 40.
  • data busses 38 and 40 are simultaneously available to load new operands into input registers 36 and 39, respectively.
  • the same operation would require the use of data bus 15 thereby precluding the use of the bus for loading input operands.
  • data processor 35 of FIG. 2 provides distinct advantages over data processor 35' of FIG. 3 with respect to overflow conditions and input data storage.
  • Multiplier products typically require two times the number of register bits for storage compared to multiplier and multiplicand operands. Therefore, the size of accumulator registers 54 are typically twice as large as input registers 36 and 39. Additionally, accumulator registers 54 may provide extra upper data bits to provide an accumulator extension to accomodate word growth in repetitive multiply/accumulate operations.
  • Data processor 35 of FIG. 2 provides feedback path 57 from accumulator registers 54. The larger size of accumulator registers 54 allows the entire output of multiply/accumulator 49 to be stored without overflow or roundoff errors. It is desirable to minimize errors if an accumulator overflow has occurred.
  • accumulator registers may be tested for overflow before the accumulator register value is reused by feedback path 57.
  • Shifter/ limiters 60 and 61 also allow overflows to be detected and limited before data is written to external memory.
  • Data processor 35' of FIG. 3 provides feedback path 67 from multiply/accumulator 49.
  • the smaller size of input registers 36 and 39 does not allow the entire output of multiply/ accumulator 49 to be stored without overflow or roundoff errors. Therefore, the possibility of roundoff and overflow errors is greatly increased.
  • processor input registers do not provide the ability to test for overflow errors.
  • Feedback path 67 may also be used to store a multiply/accumulator 49 result in input registers 36 or 39 which is then written out to external memory via the respective multiplexor 41 or 47 and bus driver 42 or 49. Since no shifter/limiter functions are provided in either input register feedback path, overflows cannot be detected and limited before data is written to external memory.
  • a second advantage of processor 35 over processor 35' is due to the fact that signal processing algorithms typically require more input operands than output operands. One example is the typical multiply/accumulate operation where two input operands are required from external memory.
  • Feedback path 57 of processor 35 does not require the use of input registers 36 or 39 to store the output of multiply/accumulator 49 thereby preserving useful storage means for input operands.
  • processor 35 uses only accumulator registers 54 to store multiply/accumulator 49 results, there is no contention for input registers when input operands are required from external memory.
  • Feedback path 67 of processor 35' requires the use of at least some of input registers 36 and 39 which reduces the amount of useful storage registers for input operands.
  • Processor 35' may present a contention problem since input registers 36 and 39 may be written from either the memory busses 38 and 40, respectively, or the multiply/ accumulate feedback path 67. This contention problem may lessen the efficiency of processor 35' when feedback path 67 is used. Therefore, processor 35 of FIG. 2 is a preferred embodiment of the present invention over processor 35' of FIG. 3.
  • a common application of data processors 35 and 35' is in digital filtering.
  • Input registers 36 and 39 are loaded with data which is typically time sampled values stored in a work space of a filter commonly implemented as a digital delay line.
  • a plurality of consecutive stages in external memory contain consecutive time samples of data.
  • coefficient values which form an impulse response of the filter. Therefore, data describing the characteristics of the time and frequency response of the digital filter is stored in external memory along, with sampled signal values.
  • a plurality of repetitive data loads are executed by reading memory and loading input registers 36 and 39 to couple data values and accompanying coefficient values for multiplication and accumulation.
  • a time shift of sampled data may be effected in external memory by executing a move of data from memory to a register and then back to memory at a different location.
  • processor 10 of FIG. 1 such a time shift of sampled data in external memory will require a series of data movement operations after the filtering operation and will require at least two cycles per data sample.
  • the present invention reduces overhead associated with a time shift operation on sampled data by providing the ability not only to write input registers 36 and 39 but also providing the ability to read both input registers.
  • a data sample and a coefficient value are coupled to input registers 36 and 39, respectively.
  • input registers 36 and 39 may be read back to memory to an appropriate location which effects a time shift of the sampled data.
  • Such a location is typically one address removed in sequential memory from where the data originated, thereby effecting one unit of digital time delay.
  • the next filter calculation involves sample time (N + 1) of the filter.
  • the present invention provides the ability to read back the contents of input registers 36 and 39 and thereby avoid reading each memory location twice. Therefore, overhead is reduced from 2N cycles to N cycles resulting in a total filter calculation time of 2N rather than 3N.
  • a further advantage of the present invention includes the fact that all of the input registers 36 and 39 may be read as well as written on an interrupt or a break in processing execution. Therefore, the contents of input registers 36 and 39 may be saved in external memory so that processors 35 and 35' may be used in an interrupt routine for another function. Upon completion of the interrupt, the data may then be restored from external memory and the filter calculation continued without significant additional overhead.
  • processor 10 is generally unavailable during an interrupt because the processor registers cannot be easily saved and restored.
  • processors 35 and 35' of FIGS. 2 and 3 have the capability of directly reading the data in input registers 36 and 39, no need for an additional address pointer or address pointer modification exists.
  • the flow of data from input registers 36 and 39 to external memory is controlled by multiplexors 41 and 47, respectively, and bus driver circuits 42 and 48, respectively.
  • FIG. 4 Shown in FIG. 4 is a conventional structure of a second order biquadratic filter 70 commonly implemented in software. Shown in Table 1 in the attached appendix is a software example of a calculation of filter 70 by either processor 35 or 35'.
  • Filter 70 of FIG. 4 generally comprises adder circuits 71 and 72, multiplier circuits 74, 75, 76 and 77 and data memory storage locations 79 and 80. The equations which filter 70 implements are:
  • W(n) X(n) - a 1 W(n-1) - a 2 W(n-2)
  • Y(n) W(n) + b 1 W(n-1) + b 2 W(n-2).
  • An input signal X(n) is coupled to a first input of adder 71 and an output signal Y(n) is provided by an output of adder 72.
  • An intermediate signal W(n) is formed and stored in data memory storage locations 79 and 80 with a digital time delay of one and two, respectively.
  • Multipliers 74, 75, 76 and 77 function to multiply a respective data input with a designated coefficient value which is stored in coefficient memory storage (not shown). The coefficient values determine the impulse response of the digital filter.
  • input operands are first coupled to input registers 36 and 39.
  • the value W(n-2) stored in location 80 is coupled to an X input register 36 labeled "X0" and coefficient (-a 2 ) is coupled from coefficient memory storage to a Y input register 39 labeled "YO".
  • the input value X(n) is assumed to be preloaded in accumulator register 54 and labeled "A”.
  • the multiply/accumulate operation is then performed by multiply/accumulator 49 and new operands are loaded into input registers 36 and 39 from external memory for use in the next clock cycle.
  • Table 1 illustrates on a step by step basis what ALU operation is being executed, what data and coefficient transfer is occurring between external memory and input registers 36 and 39 and comments to indicate what mathematical operation is occurring. Five operation cycles are required for execution of filter 70 which is the minimum number possible to preload the first operands and perform four multiplications with a single multiplier ALU.
  • Input registers X0 and X1 of registers 36 and register Y0 of registers 39 serve as input pipeline registers. Shown in the dotted box of Table 1 is an example of the shared use feature of input register 36 by a data bus and an ALU. Initially, signal W(n-2) is read from memory storage location 80 into X input register X0.
  • the signal W(n-1) is read from memory storage location 79 into X input register X1.
  • the contents in input register X1 is written back to memory storage location 80 representing signal W(n-2) thereby effecting a time shift of data in filter 70 from memory storage location 79 to 80.
  • the ALU operation is using input register X1 as the multiplicand input of multiply/accumulator 49.
  • a value for signal W(n) has been calculated and stored in the accumulator register labeled "A" illustrated in Table 1. This value is also used as the third input to multiply/accumulator 49 for the third ALU operation.
  • the value of A present in accumulator register 54 and equal to W(n) is stored away as W(n-1) in memory storage location 79 during the third ALU operation. Therefore, both values in memory storage locations
  • a plurality of biquad filters such as filter 70 are cascaded as shown in FIG. 5.
  • Repetitive software may be used to cascade filters directly.
  • Time savings can be realized by overlapping the operand preload clock cycle with the last ALU operation clock cycle of the previous filter as shown in Table 2 in the attached appendix.
  • an execution time of 4N+1 clock cycles is required for a cascade of N biquad filters which is the optimal time for a single multiplier ALU. Since multiply/accumulator 49 and both data busses 38 and 40 are busy all 4N cycles, optimal execution time is not possible without the ability to simultaneously use input registers 36 and 39 between an ALU and a data bus.
  • Signal values W(n-1) and W(n-2) for each filter are stored in data memory storage locations such as storage locations 79 and 80 of filter 70.
  • the values for each filter are indicated in Table 2 by use of subscripts such as W3(n-1) for filter F3.
  • coefficients -a 1 , -a 2 , b 1 and b 2 are illustrated for each filter by additional subscripts such as -a 31 representing coefficient -a 1 for filter F3.
  • the processor structure of the present invention makes calculation of an accumulated product possible in a single clock cycle as opposed. to multiple clock cycles. Since two data busses are coupled to each of processors 35 and 35', data values and coefficient values may be coupled to either processor 35 or 35' to insure that processor operating speed is not adversely affected.
  • the processor architecture of the present invention also minimizes storage register requirements. By virtue of a feedback path between the output and input of a multiplier/accumulator circuit, an accumulated product may be immediately used as an input operand for a successive multiplication without an extra overhead cycle. As a result, a very time efficient and flexible processor has been provided.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Programmable Controllers (AREA)

Abstract

A data processor (35) capable of repeatedly multiplying two input operands (X and Y) and selectively accumulating the resulting product with a third input operand in a single clock cycle of operation. The resulting accumulated product (10) may be used as one or both multiplier input operands in an immediately following clock cycle of operation by using a feedback path (57) coupled between an output and an input of the multiplier/accumulator (49). The data processor (35) utilizes a plurality of input storage registers (36, 39) which are shared by a memory bus (38 or 40) coupled to external memory and by the data processor (35) to thereby reduce data processing time.

Description

A DIGITAL SIGNAL PROCESSOR FOR SINGLE CYCLE MULTIPLY/ACCUMULATION
Technical Field
This invention relates generally to signal processors, and more particularly, to a digital signal processor capable of a multiply/accumulation in a single clock cycle.
Background of the Invention
Signal processors which utilize an ALU for multiplying two numbers and selectively adding the product with a third number are very common in the signal processing art. Typical processors utilize two stages in which a product is formed in the first stage and an accumulation is made in the second stage. An example of such a processor is taught by Glenn Culler in U.S. Patent No. 4,287,566 entitled "Array Processor With Parallel Operations Per Instruction". Such processors require a minimum of two clock cycles to provide an output.
Summary of the Invention
Accordingly, an object of the present invention is to provide an improved digital signal processor for single cycle multiply/accumulation operations.
Another object of the present invention is to provide an improved data processor capable of complete single cycle operation.
In carrying out the above and other objects, there is provided, in one form, a digital signal processor for implementing algorithms by providing product accumulations. In the illustrated form, a product of first and second input operands is selectively accumulated with a third input operand. First input storage means having an input coupled to a first data bus are used to selectively store the first input operand. Second input storage means having an input coupled to a second data bus are used to selectively store the second input operand. A multiplier/accumulator having first and second inputs for receiving the first and second operands provides a product selectively accumulated with a third input operand coupled to a third input thereof. The accumulated product is provided in a single clock cycle of the processor in response to receipt of the first, second and third input operands. An output storage means has an input selectively coupled to the output of the multiplier/accumulator or either memory bus. An output of the output storage means is selectively coupled to at least a predetermined one of the first, second or third inputs of the multiplier/accumulator for implementing a variety of differing algorithms. Repetitive complete multiply/accumulation operations may be executed with each operation taking only one clock cycle.
Brief Description of the Drawings
FIG. 1 illustrates in block diagram form a digital signal processor structure known in the art;
FIG. 2 illustrates in block diagram form a digital signal processor structure in accordance with a preferred embodiment of the present invention;
FIG. 3 illustrates in block diagram form another embodiment of the digital signal processor of FIG. 2;
FIG. 4 illustrates in block diagram form a biquadratic digital filter structure implementable by the digital signal processors of FIGS. 2 and 3; and
FIG. 5 illustrates in block diagram form a cascaded digital filter structure implementable by the digital signal processors of FIGS. 2 and 3. Detailed Description of the Invention
Shown in FIG. 1 is a representative data processor 10 known in the art which generally comprises stages 11 and 12. First stage 11 comprises a first input register 14 having an input connected to a first data bus 15 labeled "X Data Bus". A second input register 16 has an input connected to a second data bus 18 labeled "Y Coefficient Bus". An output of input register 14 is connected to a first input of a multiplier circuit 20, and an output of input register 16 is connected to a second input of multiplier circuit 20. Multiplier 20 has first and second outputs respectively connected to an input of a product register 22 and an input of a multiplexor circuit 21. A second input of multiplexor circuit 21 is connected to first data bus 15. An output of multiplexor 21 is connected to an input of a product register 24. Product registers 22 and 24 represent the most significant product (MSP) and least significant product (LSP), respectively, of multiplier 20.
Second stage 12 comprises a multiplexor 25, an ALU 26, a multiplexor 27, an accumulator register 28, and a bus driver circuit 30. Product registers 22 and 24 of first stage 11 each has an output connected to first and second inputs of riultiplexor 25, respectively. An output of multiplexor 25 is connected to a first input of ALU 26. An output of ALU 26 is connected to a first input of multiplexor circuit 27. A second input of multiplexor circuit 27 is connected to first data bus 15. An output of multiplexor circuit 27 is connected to an input of accumulator register 28. A first output of accumulator register 28 is connected to a second input of ALU 26. A second output of accumulator register 28 is connected to an input of bus driver circuit 30. An output of bus driver circuit 30 is connected to the inputs of input registers 14 and 16, to the second inputs of multiplexors 21 and 27 and to external circuitry via first memory bus 15. In operation, data processor 10 provides a multiply/ accumulate function. Input registers 14 and 16 provide a multiplicand and a multiplier input via data busses 15 and 18. Typically, one of the inputs represents a data value and the other input represents a coefficient value. After these inputs are loaded into registers 14 and 16, the data is coupled to multiplier 20. Multiplier 20 calculates a product of the first and second input values and presents a product output at the first and second outputs thereof. Multiplier 20 may perform a data formatting function to allow both fractional and integer number representations. Another common function which multiplier 20 may additionally perform includes sign bit control to effect either signed or positive unsigned number representation. Multiplier 20 may also perform an inversion of data to provide either a positive or negative product. After multiplier 20 provides a product, the product is stored in MSP/LSP form in product registers 22 and 24, respectively. The time required to provide an output product to registers 22 and 24 is one clock cycle after the input data is loaded into registers 14 and 16.
The operation of second stage 12 of data processor 10 is centered around ALU 26 which primarily adds the value in product registers 22 or 24 to a third input value to provide a multiply/accumulate operation. The third input value is provided by accumulator register 28. ALU 26 may also perform other functions such as logical ANDing, ORing, etc. to provide conventional ALU functions as well as addition. To provide a standard addition operation without a multiplication, an addend is loaded into product register 24 via multiplexor 21 and is selectively connected to ALU 26 via multiplexor 25 in the following clock cycle. The output of accumulator register 28 is connected to the second input of ALU 26 to provide the value from which the product is added or subtracted. The accumulated product output of ALU 26 is stored in accumulator register 28 via multiplexor 27. The output of ALU 26 can be written to data bus 15 via multiplexor 27, accumulator register 28 and bus driver circuit 30. Although the described architecture readily accomplishes repetitive multiply/accumulate operations, the architecture of FIG. 1 is not efficient for performing nonrepetitive calculations. For example, if the ALU output in the accumulator register 28 is immediately needed as an input to multiplier 20, the contents of accumulator register 28 must be clocked into input register 14 before the value is available to multiplier 20. To accomplish this preliminary step will take an entire clock cycle. Therefore, the accumulated product is not available immediately to use as a multiplier or a multiplicand in the multiplication. In other words, in a two stage processor as shown in FIG. 1, data in the second stage is not immediately available for use in the first stage. Because the output of the first stage is immediately available to the second stage, a multiply/accumulation is an efficient operation. However, an accumulation operation followed by a multiplication is not efficient. Also, because product registers 22 and 24 are hidden and can not be read by data busses 15 and 18, reading and writing the output product is limited. Therefore, data processor 10 is generally unavailable during interrupt processing without losing the presently held product register data.
Shown in FIG. 2 is a data processor 35 in accordance with the present invention. Data processor 35 comprises a plurality of input registers 36 having an input connected to a memory or data bus 38 labeled "X Data Bus", and a plurality of input registers 39 having an input connected to a memory or data bus 40 labeled "Y Coefficient Bus". It should be readily apparent that all register circuits shown herein are of multiple bit size and may be of variable width. A first output of input registers 36 is connected to an input of a multiplexor circuit 41. Multiplexor circuit 41 has an output which is connected to an input of a bus driver circuit 42. An output of bus driver circuit 42 is connected to data bus 38. A second output of input registers 36 is connected to a first input of a multiplexor circuit 43. A third output of input registers 36 is connected to a first input of a multiplexor circuit 45. An output of multiplexor circuit 43 is connected to a first input of a multiply/accumulator circuit 49 labeled "X". A first output of input registers 39 is connected to a second input of multiplexor circuit 43. A second output of input registers 39 is connected to a second input of multiplexor circuit 45. A third output of input registers 39 is connected to a multiplexor circuit 47. An output of multiplexor circuit 47 is connected to an input of a bus driver circuit 48 which has an output connected to a data bus 40. An output of multiplexor 45 is connected to a second input of a multiply/accumulator circuit 49 labeled "Y". An output of multiply/accumulator circuit 49 labeled "P" is connected to a first input of a multiplexor circuit 51. Second and third inputs of multiplexor circuit 51 are connected to data bus 38 and data bus 40, respectively. An output of multiplexor circuit 51 is connected to an input of a plurality of accumulator registers 54. A first output of accumulator registers 54 is connected to an input of a multiplexor circuit 55. An output of multiplexor circuit 55 is connected to an input of an accumulator shifter circuit
56. An output of accumulator shifter circuit 56 is connected to a third input of multiplier/accumulator 49. A second output of accumulator registers 54 is fed back to a third input of multiplexor circuits 43 and 45 via a feedback path
57. Third and fourth outputs of accumulator registers 54 are connected to an input of multiplexor circuits 58 and 59, respectively. An output of multiplexor 58 is connected to an input of a shifter/limiter circuit 60. Similarly, an output of multiplexor 59 is connected to an input of a shifter/ limiter circuit 61. An output of shifter/limiter circuit 60 is connected to an input of a bus driver circuit 64 which has an output connected to a data bus 38. An output of shifter/limiter circuit 61 is connected to an input of a bus driver circuit 65 which has an output connected to data bus 40. In operation, processor 35 is capable of performing a multiply/accumulate operation in one clock cycle where a clock cycle is defined as the time between successive processor register loads. That is, the machine state of the processor changes once per clock cycle at the end of the clock cycle. In a single clock cycle, input register data is multiplied, accumulated with accumulator register data and stored in a predetermined accumulator register. An accumulator register is loaded with the output of the multiply/accumulator 49 at the end of the clock cycle. Simultaneously, the input registers 36 and 39 may be loaded from data busses 38 and 40, respectively, at the end of the clock cycle.
Data is initially coupled to input registers 36 and 39 from an external source, from input registers 36 and 39 or from accumulator registers 54 via busses 38 and 40, respectively. Registers 36, 39 and 54 are coupled so that contents from any two of the three pluralities of registers are coupled to the first and second inputs of multiply/ accumulator 49. Multiply/accumulator circuit 49 processes the numbers coupled to the X, Y and A inputs to provide an output at the end of a clock cycle to be clocked into a predetermined accumulator register 54 thereby replacing the previous value in register 54. It should be readily understood that the X and Y inputs of multiply/accumulator 49 represent multiplier inputs which are functionally reversible. All illustrated registers 36, 39 and 54 may be implemented by conventional edge triggered D-type flip-flops to prevent possible race conditions. Simultaneous to the processing of the three input operands by multipy/accumulator 49, external circuitry may be accessed to read in additional input operands which are read into input registers 36 and 39 for use in the immediately following clock cycle. Similarly, external circuitry may be accessed to write data from input registers 36 and 39 or accumulator registers 54 out to the external circuitry. The X data multiplexor 43 and Y data multiplexor 45 provide a continuous coupling of data between processor 35 registers 36, 39 and 54. As a result, processor 35 is able to perform repetitive multiply/accumulate operations in single clock cycles. The short-time energy over N samples of a time sampled signal is conventionally defined as:
Figure imgf000010_0001
Therefore, processor 35 may readily execute energy calculations by providing the same data in one input register to both X and Y inputs of multiply/accumulator 49 via multiplexors 43 and 45. In order to perform energy calculations with processor 10 of FIG. 1, both registers 14 and 16 would have to be loaded with the same data. However, whenever the same data is routed to multiple destinations, extra instruction bits or extra clock cycles are typically required.
Similarly, data may be coupled to input registers 36 and 39 to allow shared use of register data by both data processor
35 and external data busses 38 and 40. Feedback paths from the output of input registers 36 and 39 to be described in further detail below selectively couple the output of input registers 36 and 39 via bus drivers 42 and 48, respectively, to data busses 38 and 40, respectively. As a result, input data which has been read from one memory location and is stored at the end of a clock cycle in one of input registers
36 or 39 may be fed back in a following clock cycle to a respective data bus and stored in the same or a different memory location. One form of the shared use of input registers 36 and 39 is simultaneous use of the registers by multiply/accumulator 49 and external memory in the same clock cycle. The feedback paths around input registers 36 and 39 also allow implementation of functions such as time shifting sampled data in memory or replacing an element in a memory location with a new element. The latter function is commonly referred to as a "Z" delay function where the Z transform "Z-1" represents a time delay of one data sample.
Versatility of operation of data processor 35 allows efficient implementation of non-repetitive signal processing algorithms. The Y input of multiply/accumulator 49 can receive data from any of the X or Y input registers 36 and 39 as well as any of accumulator registers 54 via accumulator feedback path 57. The X input of multiply/accumulator 49 can also receive data from any of the X or Y input registers 36 and 39 as well as any of the accumulator registers 54 via accumulator feedback path 57. Feedback path 57 provides the ability to subsequently use the accumulated product result of the previous clock cycle as a multiplier or a multiplicand in a subsequent clock cycle. Subsequent use may include immediate use of the multiplier/accumulator output operand in an immediately following clock cycle. As a result, an extra clock cycle of delay associated with data processor 10 of FIG. 1 has been eliminated. The use of feedback path 57 allows standard formulas such as a power series expansion to be implemented quickly and efficiently because the previous Nth power of a number can be immediately multiplied by that number which is still stored in one of the input registers to provide the (N + 1)th power as an Output to be stored in accumulator register 54.
The "A" input of multiply/accumulator 49 is the previous accumulator value in one of the accumulator registers 54 which is coupled to accumla.tor shifter 56 via multiplexor 55. Accumulator shifter 56 can optionally pre-shift the data to the left or right for scaling purposes. Accumulator shifter 56 may also provide a zero function and couple all zeroes to the "A" input of multiply/accumulator 49 so that only a multiplication is performed by multiply/accumulator 49. The data coupled to the "A" input of multiply/accumulator 49 via accumulator shifter 56 may also be inverted by shifter 56 so that a "product minus accumulate" operation is effected. Accumulator registers 54 may also be loaded with data from the X data bus 38 and the Y data bus 40. Accumulator registers 54 may also be read out to the X and Y data busses 38 and 40, respectively, for storage in external memory via multiplexors 58 and 59 and shifter/limiter circuits 60 and 61, respectively. Generally, one shifter/limiter circuit is associated with each data bus. Multiplexors 58 and 59 select a predetermined one of accumulator registers 54 for shifter/limiter circuits 60 and 61, respectively. Shifter/limiter circuits 60 and 61 perform data shifting on the respective inputs followed by an overflow limiting function. This allows arithmetic scaling to be performed on the values provided by multiplexors 58 and 59 read from accumulator registers 54 before the values are provided to external memory via busses 38 and 40, respectively. Because the shifting operation may produce arithmetic overflows, shifter/limiter circuits 60 and 61 also provide an overflow limiting feature commonly called data saturation. If an overflow of data from accumulator registers 54 coming out of the shifter portion of either circuit 60 or 61 is detected, a limiter portion of circuits 60 or 61 substitutes a maximum positive or negative constant onto the respective data bus to limit the magnitude of the incurred error. Otherwise, passing the overflowed data on to the external busses results in a large error. Shifter/limiter circuits 60 and 61 provide for much lower errors and minimize the occurrence of an unstable condition encountered in signal processing digital filters commonly known as "limit cycles".
In one form, shifter/limiter circuits 60 and 61 may be implemented with conventional shifter circuits which shift data received from accumulator registers 54 via multiplexors 58 and 59, respectively. If a right shift is performed, no overflow can occur since the lower bits are discarded. If a left shift is performed, an overflow condition may occur if the upper bits discarded contain any significant information. An overflow detector detects if the upper bits discarded by the shifter contain significant bits or just copies of the sign bit of the data. If there is no overflow condition, all of the upper bits discarded by the shifter will equal the sign bit of the data provided to the external data bus. If there is an overflow condition, at least one of the upper bits discarded by the shifter will not equal the sign bit of the data provided to the external data bus. The overflow detector may be implemented by conventional logic circuits. If an overflow occurs, a maximum positive (01111...1) or negative (10000...0) constant is substituted onto the appropriate shifter/limiter output. The sign of the substituted constant is equal to the sign of the selected accumulator register 54. The resulting output of shifter/limiters 60 and 61 is a shifted and limited version of the selected accumulator register.
Bus driver circuits 64 and 65 may be implemented using conventional driver circuits. Driver circuits 64 and 65 are controlled by external logic such that only one register or memory is utilizing the bus at any given time.
Shown in FIG. 3 is another embodiment of the present invention illustrating a data processor 35' analogous to data processor 35 of FIG. 2 and which utilizes feedback between the output of multiply/accumulator 49 and the inputs of X and Y input registers 36 and 39. The data processor of FIG. 3 is otherwise identical to the data processor of FIG. 2 and utilizes the same numbered elements for ease of illustration with the exception that feedback path 57 has been replaced by a feedback path 67 from a second output of multiply/ accumulator 49 to the input of X and Y input registers 36 and 39 via multiplexor circuits 68 and 69, respectively. As illustrated in FIG. 3, accumulator register 54 now only has three outputs instead of four outputs. Additionally, multiplexors 43 and 45 only have two inputs each as illustrated in FIG. 3 rather than having three inputs each. In another form, feedback path 67 may be coupled to only one of input registers 36 or 39 via only one of multiplexor circuits 68 or 69, respectively. Feedback path 67 may selectively couple the output of multiply/accumulator 49 to either of input registers 36 or 39 or to both. From input registers 36 and 39, the output of multiply/accumulator 49 may be coupled back to the first or second input or to both inputs of multiply/accumulator 49. The output of multiply/ accumulator 49 may also be written to external memory after being stored in input registers 36 and 39 as described below in further detail.
Both data processors 35 and 35' of FIGS. 2 and 3, respectively, provide distinct advantages over processor 10 of FIG. 1. Data processors 35 and 35' are more efficient and flexible in their implementation of signal processing algorithms as discussed above. Feedback paths 57 and 67 of data processors 35 and 35', respectively, allow the output of multiply/accumulator 49 to be coupled to one or both inputs thereof without the use of data busses 38 and 40. As a result, data busses 38 and 40 are simultaneously available to load new operands into input registers 36 and 39, respectively. In data processor 10, however, the same operation would require the use of data bus 15 thereby precluding the use of the bus for loading input operands.
In the illustrated form, data processor 35 of FIG. 2 provides distinct advantages over data processor 35' of FIG. 3 with respect to overflow conditions and input data storage. Multiplier products typically require two times the number of register bits for storage compared to multiplier and multiplicand operands. Therefore, the size of accumulator registers 54 are typically twice as large as input registers 36 and 39. Additionally, accumulator registers 54 may provide extra upper data bits to provide an accumulator extension to accomodate word growth in repetitive multiply/accumulate operations. Data processor 35 of FIG. 2 provides feedback path 57 from accumulator registers 54. The larger size of accumulator registers 54 allows the entire output of multiply/accumulator 49 to be stored without overflow or roundoff errors. It is desirable to minimize errors if an accumulator overflow has occurred. Typically, accumulator registers may be tested for overflow before the accumulator register value is reused by feedback path 57. Shifter/ limiters 60 and 61 also allow overflows to be detected and limited before data is written to external memory. Data processor 35' of FIG. 3 provides feedback path 67 from multiply/accumulator 49. The smaller size of input registers 36 and 39 does not allow the entire output of multiply/ accumulator 49 to be stored without overflow or roundoff errors. Therefore, the possibility of roundoff and overflow errors is greatly increased. Typically, processor input registers do not provide the ability to test for overflow errors. Feedback path 67 may also be used to store a multiply/accumulator 49 result in input registers 36 or 39 which is then written out to external memory via the respective multiplexor 41 or 47 and bus driver 42 or 49. Since no shifter/limiter functions are provided in either input register feedback path, overflows cannot be detected and limited before data is written to external memory. A second advantage of processor 35 over processor 35' is due to the fact that signal processing algorithms typically require more input operands than output operands. One example is the typical multiply/accumulate operation where two input operands are required from external memory. Feedback path 57 of processor 35 does not require the use of input registers 36 or 39 to store the output of multiply/accumulator 49 thereby preserving useful storage means for input operands. Since processor 35 uses only accumulator registers 54 to store multiply/accumulator 49 results, there is no contention for input registers when input operands are required from external memory. Feedback path 67 of processor 35' however requires the use of at least some of input registers 36 and 39 which reduces the amount of useful storage registers for input operands. Processor 35' may present a contention problem since input registers 36 and 39 may be written from either the memory busses 38 and 40, respectively, or the multiply/ accumulate feedback path 67. This contention problem may lessen the efficiency of processor 35' when feedback path 67 is used. Therefore, processor 35 of FIG. 2 is a preferred embodiment of the present invention over processor 35' of FIG. 3.
A common application of data processors 35 and 35' is in digital filtering. Input registers 36 and 39 are loaded with data which is typically time sampled values stored in a work space of a filter commonly implemented as a digital delay line. A plurality of consecutive stages in external memory contain consecutive time samples of data. Also stored in a consecutive time sequence in external memory are coefficient values which form an impulse response of the filter. Therefore, data describing the characteristics of the time and frequency response of the digital filter is stored in external memory along, with sampled signal values. A plurality of repetitive data loads are executed by reading memory and loading input registers 36 and 39 to couple data values and accompanying coefficient values for multiplication and accumulation. When proceeding from filter output sample time N to sample time (N + 1), an effective time shift of the sampled data in external memory must be effected. A time shift of sampled data may be effected in external memory by executing a move of data from memory to a register and then back to memory at a different location. However, with the architecture of processor 10 of FIG. 1, such a time shift of sampled data in external memory will require a series of data movement operations after the filtering operation and will require at least two cycles per data sample.
In the illustrated forms, the present invention reduces overhead associated with a time shift operation on sampled data by providing the ability not only to write input registers 36 and 39 but also providing the ability to read both input registers. At the time a filter operation is being performed, a data sample and a coefficient value are coupled to input registers 36 and 39, respectively. During the cycle in which newly coupled data is being multiplied and accumulated, input registers 36 and 39 may be read back to memory to an appropriate location which effects a time shift of the sampled data. Such a location is typically one address removed in sequential memory from where the data originated, thereby effecting one unit of digital time delay. As a result, the next filter calculation involves sample time (N + 1) of the filter. Using processor 10 of FIG. 1, if the multiply/accumulate throughput is one cycle per tap for an N- tap filter, the time required to calculate the filter would be approximately N cycles plus a few overhead cycles. If a time shift were effected afterwards, it would take another 2N cycles in order to perform the data shift. The complete process takes at least 3N cycles with most of the time being consumed by shifting data from sample time N to sample time (N + 1) rather than calculating the filter. This is a primary disadvantage of the structure of input registers 14 and 16 associated with processor 10 of FIG. 1 and results from inability to read the samples stored in registers 36 and 39. Additionally, during the extra 2N cycles of overhead, effort is being duplicated during one of the 2N cycles since all the samples are being read again because those samples have already been read during the previous calculation of the filter.
In the illustrated form, the present invention provides the ability to read back the contents of input registers 36 and 39 and thereby avoid reading each memory location twice. Therefore, overhead is reduced from 2N cycles to N cycles resulting in a total filter calculation time of 2N rather than 3N. A further advantage of the present invention includes the fact that all of the input registers 36 and 39 may be read as well as written on an interrupt or a break in processing execution. Therefore, the contents of input registers 36 and 39 may be saved in external memory so that processors 35 and 35' may be used in an interrupt routine for another function. Upon completion of the interrupt, the data may then be restored from external memory and the filter calculation continued without significant additional overhead. By virtue of the system architecture of processor 10 of FIG. 1, processor 10 is generally unavailable during an interrupt because the processor registers cannot be easily saved and restored.
There are many algorithms in which the general capability of being able to read data present in an input register will save cycles of execution time as opposed to accessing memory for a second time to obtain the data. Typically, the overhead associated with systems such as processor 10 of FIG. 1 is apparent when accessing data in memory with addressing means. If input registers 14 and 16 cannot be read out to data busses 15 and 18, respectively, addressing means (not shown) may be required to access the data a second time from external memory.- However, the addressing pointer may have already been updated so that the addressing means no longer points at the data for a second access. If the addressing pointer no longer points to the proper location to access the data for a second time, another address pointer or extra address pointer modification is required. This requires additional hardware or clock cycles. In comparison, since processors 35 and 35' of FIGS. 2 and 3, respectively, have the capability of directly reading the data in input registers 36 and 39, no need for an additional address pointer or address pointer modification exists. The flow of data from input registers 36 and 39 to external memory is controlled by multiplexors 41 and 47, respectively, and bus driver circuits 42 and 48, respectively.
An illustration of the shared use of input registers by processors 35 or 35' and a data bus will be given below for a conventional infinite impulse response (IIR) filter such as a biquadratic second order section digital filter. However, it should be apparent that the present invention applies equally to finite impulse response (FIR) filters and other digital signal processing algorithms. Shown in FIG. 4 is a conventional structure of a second order biquadratic filter 70 commonly implemented in software. Shown in Table 1 in the attached appendix is a software example of a calculation of filter 70 by either processor 35 or 35'. Filter 70 of FIG. 4 generally comprises adder circuits 71 and 72, multiplier circuits 74, 75, 76 and 77 and data memory storage locations 79 and 80. The equations which filter 70 implements are:
W(n) = X(n) - a1W(n-1) - a2W(n-2) Y(n) = W(n) + b1W(n-1) + b2W(n-2).
An input signal X(n) is coupled to a first input of adder 71 and an output signal Y(n) is provided by an output of adder 72. An intermediate signal W(n) is formed and stored in data memory storage locations 79 and 80 with a digital time delay of one and two, respectively. Multipliers 74, 75, 76 and 77 function to multiply a respective data input with a designated coefficient value which is stored in coefficient memory storage (not shown). The coefficient values determine the impulse response of the digital filter. To implement filter 70 by data processors 35 or 35', input operands are first coupled to input registers 36 and 39. The value W(n-2) stored in location 80 is coupled to an X input register 36 labeled "X0" and coefficient (-a2) is coupled from coefficient memory storage to a Y input register 39 labeled "YO". The input value X(n) is assumed to be preloaded in accumulator register 54 and labeled "A". The multiply/accumulate operation is then performed by multiply/accumulator 49 and new operands are loaded into input registers 36 and 39 from external memory for use in the next clock cycle.
Table 1 illustrates on a step by step basis what ALU operation is being executed, what data and coefficient transfer is occurring between external memory and input registers 36 and 39 and comments to indicate what mathematical operation is occurring. Five operation cycles are required for execution of filter 70 which is the minimum number possible to preload the first operands and perform four multiplications with a single multiplier ALU. Input registers X0 and X1 of registers 36 and register Y0 of registers 39 serve as input pipeline registers. Shown in the dotted box of Table 1 is an example of the shared use feature of input register 36 by a data bus and an ALU. Initially, signal W(n-2) is read from memory storage location 80 into X input register X0. During the first ALU operation, the signal W(n-1) is read from memory storage location 79 into X input register X1. During the following clock cycle, the contents in input register X1 is written back to memory storage location 80 representing signal W(n-2) thereby effecting a time shift of data in filter 70 from memory storage location 79 to 80. Simultaneous to the use of input register X1 by memory storage location 80, the ALU operation is using input register X1 as the multiplicand input of multiply/accumulator 49. At the end of the second ALU operation, a value for signal W(n) has been calculated and stored in the accumulator register labeled "A" illustrated in Table 1. This value is also used as the third input to multiply/accumulator 49 for the third ALU operation. The value of A present in accumulator register 54 and equal to W(n) is stored away as W(n-1) in memory storage location 79 during the third ALU operation. Therefore, both values in memory storage locations
79 and 80 are read into input registers 36 and 39 and new values are written back into memory storage locations 79 and
80 to effect a time shift.
Typically, to create higher order digital filters, a plurality of biquad filters such as filter 70 are cascaded as shown in FIG. 5. Repetitive software may be used to cascade filters directly. Time savings can be realized by overlapping the operand preload clock cycle with the last ALU operation clock cycle of the previous filter as shown in Table 2 in the attached appendix. As a result, an execution time of 4N+1 clock cycles is required for a cascade of N biquad filters which is the optimal time for a single multiplier ALU. Since multiply/accumulator 49 and both data busses 38 and 40 are busy all 4N cycles, optimal execution time is not possible without the ability to simultaneously use input registers 36 and 39 between an ALU and a data bus. Signal values W(n-1) and W(n-2) for each filter are stored in data memory storage locations such as storage locations 79 and 80 of filter 70. The values for each filter are indicated in Table 2 by use of subscripts such as W3(n-1) for filter F3. Similarly, coefficients -a 1, -a2, b1 and b2 are illustrated for each filter by additional subscripts such as -a31 representing coefficient -a1 for filter F3. By analyzing the time required to effect the total filter operation illustrated in Table 2, it should be apparent that the clock cycles which prefetch operands for each filter after the first filter overlap execution cycles of previous filters to accomplish 4N stage filter execution in 4N+1 cycles. Again, fast throughput is realizable only because of the ability to share input registers between an ALU and a data bus.
By now it should be apparent that shared use of input registers by a processor and a data bus allows the processor to operate at optimal speed. The processor structure of the present invention makes calculation of an accumulated product possible in a single clock cycle as opposed. to multiple clock cycles. Since two data busses are coupled to each of processors 35 and 35', data values and coefficient values may be coupled to either processor 35 or 35' to insure that processor operating speed is not adversely affected. The processor architecture of the present invention also minimizes storage register requirements. By virtue of a feedback path between the output and input of a multiplier/accumulator circuit, an accumulated product may be immediately used as an input operand for a successive multiplication without an extra overhead cycle. As a result, a very time efficient and flexible processor has been provided.
Figure imgf000022_0001

Claims

Claims
1. A digital signal processor for implementing algorithms by providing a product of first and second input operands selectively accumulated with a third input operand, comprising: first input storage means having an input for selectively receiving and storing the first input operand, and an output; second input storage means having an input for selectively receiving and storing the second input operand, and an output; multiplier/accumulator means having a first input selectively coupled to either the output of the first input storage means or the output of the second input storage means, a second input selectively coupled to either the output of the first input storage means or the output of the second input storage means, a third input selectively coupled to a third input operand, and an output for providing the product with selective accumulation during a single clock cycle in response to receipt of said first, second and third input operands, said clock cycle being an amount of time between successive storage loads of the first and second input storage means; and output storage means having an input selectively coupled to the output of the multiplier/accumulator means, and an output selectively coupled to at least a predetermined one of the first, second or third inputs of the multiplier/accumulator means, for implementing digital signal processing algorithms.
2. The digital signal processor of claim 1 further comprising: first data shifting and limiting means having an input coupled to the output of the output storage means, and an output coupled to a first data bus, for selectively shifting data contents of the output storage means and limiting the magnitude o.f said data contents; and second data shifting and limiting means having an input coupled to the output of the output storage means, and an output coupled to a second data bus, for selectively shifting data contents of the output storage means and limiting the magnitude of said data contents.
3. The digital signal processor of claim 1 further comprising: data shifting means having an input coupled to the output of the output storage means and an output coupled to the third input of the multiplier/ accumulator circuit, for selectively shifting predetermined bits of the output of the output storage means.
4. A method of providing a digital signal processor for performing an arithmetic operation, comprising the steps of: selectively coupling an output of at least a predetermined one of a first or second input storage means or an output storage means to at least a predetermined one of first and second inputs of a multiplier/accumulator circuit; selectively coupling an output of the output storage means to a third input of the multiplier/ accumulator circuit; multiplying the first and second inputs of the multiplier/accumulator circuit to provide a product and selectively accumulating the third input with the product to provide an output; and selectively storing the output of the multiplier/ accumulator circuit in an output storage means, said method being performed in a single clock cycle of the processor, said clock cycle being an amount of time between successive storage loads of the first and second input storage means.
5. The method of claim 4 wherein said the selective coupling of the output of the output storage means to the third input of the multiplier/accumulator circuit is provided by shifting means coupled between the output storage means and the multiplier/accumulator, for selectively shifting predetermined bits of the output of the output storage means.
6. In a data processor for receiving an input operand to be coupled from external circuitry via a data bus to an arithmetic logic unit, circuit means for storing an operand for shared use by both the arithmetic logic unit and the data bus, comprising: input storage means having an input terminal coupled to the data bus for selectively receiving the input operand, a first output terminal coupled to the arithmetic logic unit for selectively coupling an output of the input storage means to the arithmetic logic unit, and a second output terminal coupled to the input terminal, for selectively coupling the output of the input storage means back to said data bus.
7. The circuit means of claim 6 further comprising: data bus driver means having an input coupled to the second output terminal and an output coupled to the input terminal, for selectively driving the output of the input storage means onto the data bus.
8. The circuit means of claim 6 further comprising: first multiplexor means having a first input coupled to the input terminal, and a second input coupled to an output of the arithmetic logic unit, for selectively coupling either an operand from the data bus or the output of the arithmetic logic unit to the input storage means.
9. A method of shared use of an input storage means in a data processor between an arithmetic logic unit and a data bus, comprising the steps of: selectively coupling an input operand from the data bus to the input storage means; and selectively coupling an output of the input storage means to the arithmetic logic unit while selectively coupling the output of the input storage means back to the data bus.
10. The method of claim 9 further comprising the step of: selectively storing an output of the arithmetic logic unit to the input storage means before coupling the output of the input storage means to the data bus.
PCT/US1985/001423 1984-09-28 1985-07-26 A digital signal processor for single cycle multiply/accumulation WO1986002181A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1019860700311A KR860700300A (en) 1984-09-28 1985-07-26 Input memory circuit means and its distribution method

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US65559984A 1984-09-28 1984-09-28
US65528584A 1984-09-28 1984-09-28
US655,599 1984-09-28
US655,285 1984-09-28

Publications (1)

Publication Number Publication Date
WO1986002181A1 true WO1986002181A1 (en) 1986-04-10

Family

ID=27096944

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1985/001423 WO1986002181A1 (en) 1984-09-28 1985-07-26 A digital signal processor for single cycle multiply/accumulation

Country Status (3)

Country Link
EP (1) EP0197945A1 (en)
KR (1) KR860700300A (en)
WO (1) WO1986002181A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4754421A (en) * 1985-09-06 1988-06-28 Texas Instruments Incorporated Multiple precision multiplication device
US4809212A (en) * 1985-06-19 1989-02-28 Advanced Micro Devices, Inc. High throughput extended-precision multiplier
US4817047A (en) * 1985-07-09 1989-03-28 Nec Corporation Processing circuit capable of raising throughput of accumulation
EP0377994A2 (en) * 1989-01-13 1990-07-18 International Business Machines Corporation Apparatus for performing floating point arithmetic operations
EP0505884A2 (en) * 1991-03-29 1992-09-30 Hitachi, Ltd. Arithmetic circuit, and adaptive filter and echo canceler using it
FR2685108A1 (en) * 1991-12-14 1993-06-18 Samsung Electronics Co Ltd Motion vector detection method
GB2287331A (en) * 1994-03-02 1995-09-13 Advanced Risc Mach Ltd Electronic multiplying and adding apparatus.
GB2291515A (en) * 1994-07-14 1996-01-24 Advanced Risc Mach Ltd Data processing using multiply-accumulate instructions.
GB2321979A (en) * 1997-01-30 1998-08-12 Motorola Ltd Modular multiplication circuit
WO1998038582A1 (en) * 1997-02-28 1998-09-03 Telefonaktiebolaget Lm Ericsson (Publ) Adaptive dual filter echo cancellation
EP1058185A1 (en) * 1999-05-31 2000-12-06 Motorola, Inc. A multiply and accumulate apparatus and a method thereof
KR100560345B1 (en) * 1996-09-13 2006-05-30 미크로나스 세미컨덕터 홀딩 아게 Digital signal processor
EP3835938A1 (en) * 2019-12-11 2021-06-16 Unify Patente GmbH & Co. KG Computer-implemented method of executing an arithmetic or logic operation in combination with an accumulate operation and processor

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3761698A (en) * 1972-04-24 1973-09-25 Texas Instruments Inc Combined digital multiplication summation
US4025771A (en) * 1974-03-25 1977-05-24 Hughes Aircraft Company Pipe line high speed signal processor
US4041461A (en) * 1975-07-25 1977-08-09 International Business Machines Corporation Signal analyzer system
US4130879A (en) * 1977-07-15 1978-12-19 Honeywell Information Systems Inc. Apparatus for performing floating point arithmetic operations using submultiple storage
US4194241A (en) * 1977-07-08 1980-03-18 Xerox Corporation Bit manipulation circuitry in a microprocessor
US4202039A (en) * 1977-12-30 1980-05-06 International Business Machines Corporation Specialized microprocessor for computing the sum of products of two complex operands
US4215416A (en) * 1978-03-22 1980-07-29 Trw Inc. Integrated multiplier-accumulator circuit with preloadable accumulator register
US4339793A (en) * 1976-12-27 1982-07-13 International Business Machines Corporation Function integrated, shared ALU processor apparatus and method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3761698A (en) * 1972-04-24 1973-09-25 Texas Instruments Inc Combined digital multiplication summation
US4025771A (en) * 1974-03-25 1977-05-24 Hughes Aircraft Company Pipe line high speed signal processor
US4041461A (en) * 1975-07-25 1977-08-09 International Business Machines Corporation Signal analyzer system
US4339793A (en) * 1976-12-27 1982-07-13 International Business Machines Corporation Function integrated, shared ALU processor apparatus and method
US4194241A (en) * 1977-07-08 1980-03-18 Xerox Corporation Bit manipulation circuitry in a microprocessor
US4130879A (en) * 1977-07-15 1978-12-19 Honeywell Information Systems Inc. Apparatus for performing floating point arithmetic operations using submultiple storage
US4202039A (en) * 1977-12-30 1980-05-06 International Business Machines Corporation Specialized microprocessor for computing the sum of products of two complex operands
US4215416A (en) * 1978-03-22 1980-07-29 Trw Inc. Integrated multiplier-accumulator circuit with preloadable accumulator register

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4809212A (en) * 1985-06-19 1989-02-28 Advanced Micro Devices, Inc. High throughput extended-precision multiplier
US4817047A (en) * 1985-07-09 1989-03-28 Nec Corporation Processing circuit capable of raising throughput of accumulation
US4754421A (en) * 1985-09-06 1988-06-28 Texas Instruments Incorporated Multiple precision multiplication device
EP0377994A2 (en) * 1989-01-13 1990-07-18 International Business Machines Corporation Apparatus for performing floating point arithmetic operations
EP0377994A3 (en) * 1989-01-13 1991-07-31 International Business Machines Corporation Apparatus for performing floating point arithmetic operations
EP0505884A2 (en) * 1991-03-29 1992-09-30 Hitachi, Ltd. Arithmetic circuit, and adaptive filter and echo canceler using it
EP0505884A3 (en) * 1991-03-29 1994-03-09 Hitachi Ltd
FR2685108A1 (en) * 1991-12-14 1993-06-18 Samsung Electronics Co Ltd Motion vector detection method
GB2287331B (en) * 1994-03-02 1998-04-29 Advanced Risc Mach Ltd Electronic multiplying and adding apparatus and method
GB2287331A (en) * 1994-03-02 1995-09-13 Advanced Risc Mach Ltd Electronic multiplying and adding apparatus.
US5528529A (en) * 1994-03-02 1996-06-18 Advanced Risc Machines Limited Electronic multiplying and adding apparatus and method
GB2291515B (en) * 1994-07-14 1998-11-18 Advanced Risc Mach Ltd Data processing using multiply-accumulate instructions
US5583804A (en) * 1994-07-14 1996-12-10 Advanced Risc Machines Limited Data processing using multiply-accumulate instructions
GB2291515A (en) * 1994-07-14 1996-01-24 Advanced Risc Mach Ltd Data processing using multiply-accumulate instructions.
KR100560345B1 (en) * 1996-09-13 2006-05-30 미크로나스 세미컨덕터 홀딩 아게 Digital signal processor
GB2321979A (en) * 1997-01-30 1998-08-12 Motorola Ltd Modular multiplication circuit
GB2321979B (en) * 1997-01-30 2002-11-13 Motorola Ltd Modular multiplication circuit
WO1998038582A1 (en) * 1997-02-28 1998-09-03 Telefonaktiebolaget Lm Ericsson (Publ) Adaptive dual filter echo cancellation
US5933797A (en) * 1997-02-28 1999-08-03 Telefonaktiebolaget Lm Ericsson (Publ) Adaptive dual filter echo cancellation
GB2341067A (en) * 1997-02-28 2000-03-01 Ericsson Telefon Ab L M Adaptive dual filter echo cancellation
GB2341067B (en) * 1997-02-28 2001-11-07 Ericsson Telefon Ab L M Adaptive dual filter echo cancellation
DE19882141B4 (en) * 1997-02-28 2009-01-02 Telefonaktiebolaget Lm Ericsson (Publ) Adaptive double-filter echo cancellation
EP1058185A1 (en) * 1999-05-31 2000-12-06 Motorola, Inc. A multiply and accumulate apparatus and a method thereof
EP3835938A1 (en) * 2019-12-11 2021-06-16 Unify Patente GmbH & Co. KG Computer-implemented method of executing an arithmetic or logic operation in combination with an accumulate operation and processor

Also Published As

Publication number Publication date
EP0197945A1 (en) 1986-10-22
KR860700300A (en) 1986-08-01

Similar Documents

Publication Publication Date Title
US4754421A (en) Multiple precision multiplication device
US5422805A (en) Method and apparatus for multiplying two numbers using signed arithmetic
US4972359A (en) Digital image processing system
US4490807A (en) Arithmetic device for concurrently summing two series of products from two sets of operands
US4766561A (en) Method and apparatus for implementing multiple filters with shared components
WO1986002181A1 (en) A digital signal processor for single cycle multiply/accumulation
US4947363A (en) Pipelined processor for implementing the least-mean-squares algorithm
US20070052557A1 (en) Shared memory and shared multiplier programmable digital-filter implementation
US5481488A (en) Block floating point mechanism for fast Fourier transform processor
US4802111A (en) Cascadable digital filter processor employing moving coefficients
JP2738443B2 (en) Processor
EP0088544B1 (en) Index limited continuous operation vector processor
US6122653A (en) Block IIR processor utilizing divided ALU operation instructions
US4761753A (en) Vector processing apparatus
US5944775A (en) Sum-of-products arithmetic unit
JPH0731592B2 (en) Division circuit
US5212782A (en) Automated method of inserting pipeline stages in a data path element to achieve a specified operating frequency
EP0278529A2 (en) Multiplication circuit capable of operating at a high speed with a small amount of hardware
JPS5981761A (en) Systolic calculation device
Lange et al. Reconfigurable multiply-accumulate-based processing element
US5650952A (en) Circuit arrangement for forming the sum of products
JPS63136710A (en) Digital signal processing circuit
US6792442B1 (en) Signal processor and product-sum operating device for use therein with rounding function
JPS62500326A (en) Digital signal processor for single-cycle multiplication/accumulation
KR0140805B1 (en) Bit-serial operation unit

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP KR

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): DE FR GB IT NL SE

WWE Wipo information: entry into national phase

Ref document number: 1985903782

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1985903782

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 1985903782

Country of ref document: EP