GB2138183A

GB2138183A - Multi-processor system

Info

Publication number: GB2138183A
Application number: GB08402288A
Authority: GB
Inventors: Andrew John Mcwilliam
Original assignee: Standard Telephone and Cables PLC
Current assignee: STC PLC
Priority date: 1984-01-28
Filing date: 1984-01-28
Publication date: 1984-10-17
Also published as: GB2138183B; GB8402288D0

Abstract

Where several processors are used in a multi-processor system, e.g. in digital signal processing, they are interconnected by a System Bus. If a processor (e.g. DSP1) needs a data transfer to or from another processor, it sends out a bus request signal to a Bus Arbitration Logic (BAL), which allots the bus to the calling processor if it is free. If not, the calling processor tries again. If two processors call at once, the arbitration logic selects one to have access to the bus. When such an allocation occurs the calling processor emits the address of the wanted processor for decoding at DAD, as a result of which the calling and called processors are interconnected for transfer via direct memory access processors in the two processors included in the transfer. <IMAGE>

Description

SPECIFICATION Multi-processor system The present invention relates to electrical digital computers, and especially to a multi-processor system.

According to the invention there is provided a digital processing system, which includes a plurality of electrical digital processors all having access to a common system bus, wherein a processor which requires data transfer to or from another one of the processors signals a transfer request to the system bus, wherein an arbitration logic circuit connected to the system bus receives the transfer requests from all of the processors and if the bus is available for use for such a transfer it allocates the bus therefor, wherein when such a bus allocation is made a signal to that effect is sent to the requesting processor, wherein the requesting processor in response to the said signal from the arbitration logic emits the address of the other processor involved in the desired data transfer, wherein address decoding means associated with the system bus detects the address of the wanted processor and activates an address processor therein which address processor is additional to that processor's main processing unit, wherein the address processors of the two processors now control the transfer of the data to be transferred from one of the processors to the other of the processors.

As will be seen later the processors used in the preferred embodiment of the above system are microprocessors. Further, the system is for use in digital signal processing (DSP).

A A data processing system embodying the invention will now be described with reference to the accompanying drawings, in which Figure 1 is a simplified block diagram of the architecture, i.e. the internal structure, of a microprocessor embodying the invention of our Appln. No. 83101 57 from whichthis application has been divided, Figure 2 is a block diagram of one of the data address units used in the microprocessor shown in Figure 1, and Figure 3 shows how two or more microprocessors of the type shown in Figures 1 and 2 can be used in a multiprocessor system embodying the present invention.

The microprocessor shown in Figure 1 is in a 40-pin "package", and the pins indicated in Figure 1 will be referred to in the course of the description. This microprocessor has been designed with an architecture and instruction set intended to enhance the speed of execution of a broad class of digital signal processing (DSP) tasks, although it can be programmed for other tasks. To achieve this end it makes extensive use of parallel processing and pipe-lining "on chip". The term pipe-lining as used herein means that execution of two or more instructions is overlapped in time. Thus where the instructions each involve a number of cycles which must be done in sequence, several instructions can be done by executing different cycles for different instructions simultaneously.As will be seen, these concepts are extended to multiprocessor arrangements by the method of inter-processor data transfer descibed below.

The overall architecture of the individual processor of Figures 1 and 2 embodies two main features which render the microprocessor especially useful for DSP applications. These are: (a) The data address units DAU 1 and DAU2, which generate addresses for the two internal data random access memories RAM 1 and RAM 2, can be software configured to create ring buffers of any size in the memory. A ring buffer is a number of consecutive memory locations which acts in effect as a closed loop store. No software pointer maintenance is needed. The unit DAU 1 will be described later with reference to Figures 2, DAU 2 being similar to DAU 1.

(b) Inter-processor communication uses an "on-chip" parallel direct memory access (DMA) processor to reduce the time needed for data transfer. This processor, which is additional to the main processing elements, includes a specialised DMA processor, plus a DMA buffer. Tasks which need more processing "throughput" than is conveniently attainable with one processor can thus be done efficiently by using more than one such microprocessor, as described with reference to Figure 3.

Before describing the above features, we describe the processor architecture briefly, see Figure 1. The microprocessor includes a timing generator which provides clock pulses for internal use, with a Sync input and a CLK (clock) input from the system in which the processor is used. This unit also has an active low reset input RST, and an output 1 which is an instruction rate output suitable for input to another processor's SYNC input or for strobing an external input-output (I/O) device address decode latch.

An input IE, from an external I/O device address decoder gives access to a DMA buffer. When this input is low it indicates that DMA transfer on the system bus is required, i.e. the processor shown is the slave in a data transfer. This buffer has access to the system bus, 5(0... 7), a Bus Controller, Program Controller, the Arithmetic/Logic Unit ALU, a multiplier and the two RAMs. The ALU is the processor's main processing unit, and is a 35 bit device. The multiplier can multiply two 16-bit numbers to give a 32 bit product; it is additional to the unit ALU in view of the large amount of multiplications needed in DSP. The remainder of the block diagram includes the address generating units DAU1, DAU2 and the DMA-CPU referred to above.

The pins and their purposes are listed in the following table.

TABLE External ROM & Pins i .7) Supplies the least significant byte ofthe external ROM address.

Do0 . . 7) Bidirectional: this is the external ROM output data bus. In the other direction it is used to output the most significant byte of the external ROM address to an external latch.

AS A strobe output to an external latch.

OE Used to tristate the external ROM data output when D(0 .7) is used to output from the processor chip.

Serial 110 Pins SYIN General purpose 1-bit input, which is the subject of two conditional branch instructions.

SYOUT General purpose 1-bit output, which can be set high or low by instruction execution.

Parallel Data lla Pins S(O.. 7) Bidirectional System Bus over which all data transfers between processors or peripherals take place.

RTS Output to bus arbitration requesting to be master of the system bus (Active low).

RFS Input from bus arbitration granting request to be master of the system bus (Active low).

IE Input from an external 1/0 device address decode. Active lowindicates that DMA transfer on the system bus is required i.e. that this processor is the slave in a data transfer.

WR/RD Tristate-output at system bus master indicating direction of data transfer as seen by bus master. Input at slave device.

HIB Tristate-output at system bus master input at slave. Active low indicates that the most significant byte of a 1 6-bit word is on the bus.

LOB Tristate-output at system bus master input at slave. Active low indicates that the least significant byte of a 16-bit word is on the bus.

Timing Pins CLK Input for 10 MHz externally generated clock.

SYNC Input for instruction cycle rate clock used to synchronise multi-processors during reset.

Instruction cycle rate output suitable for input to SYNC of other processors, also for strobing external 1/0 device address decode latch.

RST Active low reset input.

Supplying Pins VCC 5 volts.

GND Ground.

We now consider the data address units, Figure 2. As mentioned above, the microprocessor chip, which includes two address generation units, can read two 16-bit data words one from each RAM during a single instruction cycle. This is in spite of the fact that only 8 bits of the instruction are allocated to RAM addresses.

The two 8 bit addresses needed are generated by the address units, four instruction bits each being used to control these units.

A A data address unit has a base address unit BARU which contains four 8-bit base address registers (BARs), a pointer register PR which is an 8-bit up-down counter, and a vector length register VLR which can be loaded and read under software control. Of the four instruction bits controlling the DAU, two are used to select one of the four BAR's over the Address input to the unit BARU, and two are used to select one of four address modes.

These modes are: (1) Direct address, in which the instruction address causes the read-out of the contents of the selected BAR.

(2) Indexed Address, which causes read-out of the memory location defined by the sum of the contents of the selected BAR and of the pointer register PR, this being a modulo-256 sum.

(3) Incrementing in which a sequence of memory locations is read on repeated execution of instructions bearing this address mode. The addresses are derived by successively summating the contents of the pointer register PR and unity. Each such incrementation is to the modulus defined by the contents of the vector length register VLR. Thus a sequence of pointer register words is produced each of which is used to produce a RAM address. Each memory address is the sum of the contents of the selected BAR and that of the pointer register PR, modulo 256. This summation is effected by a dedicated adder AD and the result passed via a latch to the A-Bus, from which it goes to the read-out arrangements. This sequence continues until a number of memory locations appropriate to the contents of VLR have been read.

(iv) Decrementing is in effect the inverse of incrementing. Here the contents of the pointer register PR for each step is the result of subtracting one from the contents of the vector length register VLR, the modules being defined by VLR. The memory addresses are each the sum of the selected BAR and the pointer register PR, modulo 256.

Thus it is possible to set up, in either RAM, up to four ring buffers. A BAR is set to point to the word in the ring with the lowest physical address in its RAM. The register VLR is then set to a condition defining the size of ring. It is then possible, using the incrementing or decrementing mode as appropriate, to step round the ring in either direction without concern for the location of the "joint".

To illustrate the value of this feature, consider an example. If we wish to implement a transversal filter (FIR) with fifty taps, we have to compute:

where the a(i) are the fifty fixed coefficients stored in a ring buffer set up in RAM 1, the S(n-i) are the fifty most recent signal samples to be stored in a ring buffer set up in RAM 2. It is desired to generate a new output every time a new signal sample is received.

The method used is that each time a new y(n) is computed, the physical addresses of the coefficient and signal word pairs to be multiplied together are skewed by one location. Skewed in this context means that where two ring buffers are being read, on a second sequence of instruction execution, in one of the sequences one of the ring buffers is in effect shifted by one location as compared with the other. Thus when a new signal sample is written in, it merely over-writes the oldest signal sample which is then redundant. No other movement of signal data in the memory is needed. This is only possible since the "invisible joint" in the ring buffer means that it is not necessary to keep track of physical addresses in memory as the algorithm progresses.

Thus the addressing technique used greatly simplifies the performance of DSP operations.

We now consider inter-processor data transfer, see Figure 3.

A problem often encountered with microprocessors is the relative inefficiency of inter-processor data transfer, so that systems having three or four processors achieve much less than 3 or 4 times the through-put of one such processor. A usual method is for the processor initiating the data transfer to interrupt the other processor. The interrupted processor takes a significant time to save its machine status, react to the transfer request, and restore its status so as to continue with what it was doing before the interruption. Hence much time is wasted at the slave processor, and also at the master processor as it waits for the slave to respond.

Worse, it often maintains control of the data transfer bus during this waiting time, thus reducing the bus utilization.

The method used herein does not use interrupts; in fact at the slave the data transfer is transparent as far as time is concerned. At the master, the transfer of a 16 bit word usually only uses a single 400nS instruction cycle, wait cycles only being introduced at the master if the data transfer bus is in use when transfer is requested. Wait cycles do not occur at the slave. This is possible because transfer takes place between the master's ALU and a RAM in the slave via the DMA controller in the slave which for this purpose is externally controlled. The duty cycle of the RAM is shared evenly between the slave's main processing unit and the DMA's processor, hence the time transparency.

When a processor such as DSP1, Figure 3, encounters an IN or OUT command in its instruction stream, it becomes a master in a data transfer, and signals its desire to control the System Bus by setting its request to send RTS to low. The external bus arbitration logic unit BAL, which is in essence a lock-out circuit with built in priorities if desired, decides within 1 00nS if the request can be granted. If so it signals this to the requesting processor by setting the pin RFS of the "calling" processor low. If the request cannot be granted, RFS stays high, and the requesting processor, DSP 1 in this case, enters a single cycle (400nS) wait state, and maintains the bus control request. BAL is relatively simple since all processors have their instruction cycles locked together so that multiple requests for bus control reach the device BAL simultaneously.Thus the decision by BAL for that instruction cycle is based simply on task priority (where two or more processors are in the "calling" condition at once), and not on time of request.

Control of the bus is only given to a processor for a single cycle, and if a longer time is needed the "calling" processor repeatedly competes for it cycle by cycle.

When control of the system bus has been granted to a master processor, the first thing it does is to send out the 8-bit address of the required slave processor. This address is sent out via the bidirectional bus S(0..

7) to the system bus from which it is latched onto a device address latch DAL, where it is decoded by the device address decoder DAD to generate an interface enable IE low at the wanted slave processor. This activates that processor's DMA processor. Data can then be passed in either direction, i.e. master-to-slave or slave-to-master, in one or two eight bit bytes. The write/read WR/RD, high byte HIB and low byte LOB signals generated by the master processor control both the functions and the timing of the DMA buffer at the slave.

Finally the "master" processor releases the system bus by returning RTS high, to which the bus arbitrator BAL responds with RFS high within 1 00nS. Note that there is no "handshake" from the slave to say that the transfer has been completed correctly, since that would prolong the transfer time and serves very little purpose in a well designed system.

We now consider what happens when a data word is input to the DMA Input Buffer, Figure 1. This is signalled to the DMA CPU which then generates a RAM address for the data word and also controls the timing of the transfer of that word from the DMA Input Buffer to the appropriate RAM over the busses shown.

Similarly when a data word is output by the DMA Output Buffer, this is also signalled to the DMA CPU, which generates a RAM address and controls the transfer of the word at that address to the DMA Output Buffer in readiness for the next request to be "slave" and to output data.

The DMA CPU contains eight 8-bit registers which control its operation. These registers can be loaded at any time by the program running in the processor's main CPU. These are an Instruction Register, and a Bit Reversal Register, and for each of the two RAMs a Base Address Register, Pointer Register, and Vector Length Register.

The Instruction Register allows one to enable or disable DMA input or output independently. It also controls the assignment of the input and output channels to particular RAMs. At any one time one RAM must be assigned to input and the other to output, although both directions need not necessarily be enabled. The Instruction Register also specifies an addressing mode for each RAM. The address modes allowed follow the same scheme as in the main processor Data Address Units, that is, Indexed, Incrementing, and Decrementing. This time there is only one Base Address Register per RAM. The Vector Length Register for each RAM again permits the creation of ring buffers. The remaining function of the Instruction Register is to specify normal or bit-reversed indexing for each RAM independently.

This bit-reversed indexing feature is useful when executing any of the Fast Fourier Transform (FFT) algorithms. If such an algorithm is performed on a block of 8 data samples for example, successive data samples might be required to be stored in locations 0, 4, 2, 6, 1, 5, 3, 7. Such a sequence is created by taking the normal sequence 0 to 7 and reversing the order of the bits in the 3-bit binary representation of these numbers, so that for example 4 (100) becomes 1(001) and vice-versa. The process of arranging data in this order can be quite time-consuming if it has to be done in software, but costs no extra time when performed as part of the DMA operation.

When bit-reversal is specified by the Instruction Register, some or all of the bits of the relevant Pointer Register are reversed prior to addition to the Base Address Register. (The actual Pointer Register contents remain in normal order). The Bit-Reversal Register allows one to specify over what width of bit-field the reversal should take place, and also whether the algorithm is dealing with real or complex data. The example above relates to the requirement for real data. If a transform were being performed on eight complex data samples they would be stored successively as follows: 0, 1,8,9, 4, 5, 12, 13, 2, 3, 10, 11,6,7, 14, 15. The real and imaginary parts of each complex data sample are stored in adjacent memory locations, but otherwise the same bit-reversed indexing is performed.

The ability of the DMA facility to write data to, or read data from RAM in such an "intelligent" manner and at virtually no cost in time greatly enhances the processing capability of the processor unit.

Claims

1. A digital processing system, which includes a plurality of electrical digital processors all having access to a common system bus, wherein a processor which requires data transfer to or from another one of the processors signals a transfer request to the system bus, wherein an arbitration logic circuit connected to the system bus receives the transfer requests from all of the processors and if the bus is available for use for such a transfer it allocates the bus therefor, wherein when such a bus allocation is made a signal to that effect is sent to the requesting processor, wherein the requesting processor in response to the said signal from the arbitration logic emits the address of the other processor involved in the desired data transfer, wherein address decoding means associated with the system bus detects the address of the wanted processor and activates an address processor therein which address processor is additional to that processor's main processing unit, wherein the address processors of the two processors now control the transfer of the data to be transferred from one of the processors to the other of the processors.

2. A system as claimed in claim 1, and wherein each said transfer is made from the accumulator of one of the processors to the memory of the other of the processors.

3. A digital processing system, substantially as described with reference to the accompanying drawings.