GB2200482A

GB2200482A - Monitoring control flow in a microprocessor

Info

Publication number: GB2200482A
Application number: GB08729325A
Authority: GB
Inventors: Amos Intrater; Donald Alpert
Original assignee: National Semiconductor Corp
Current assignee: National Semiconductor Corp
Priority date: 1987-01-22
Filing date: 1987-12-16
Publication date: 1988-08-03
Anticipated expiration: 2007-12-16
Also published as: GB8729325D0; DE3740762A1; GB2200482B; JPS63193239A

Description

1 P; r Q 1; 2&'/-'00482 MONITORING CONTROL FLOW IN A MICROPROCESSOR The

present invention relates to data processing systems and, in particular, to a method for monitoring the sequence of instructions executed by a microprocessor without adversely effecting its operation.

A microprocessor is useful only if it is possible for designers of systems incorporating the microprocessor to "debug" their systems. The system designer must have the capability to observe the operation of the system under development, identify differences between the system.'s operation and its defined functional specification and then correct the system's design so that its behavior matches the functional specification.

One important aspect of observing the system's operation is following the sequence of instructions executed by the microprocessor. The system designer must be able to follow the sequence of executed instructions without slowing the system or causing the instruction sequence to differ from normal system operation.

Microprocessors are commonly designed so that transfer of control from one instruction to the next is determined both by the location and type of instruct-ion executed and by whether an exception occurs. Exceptions are events, errors and special conditions, such as an attempt to execute an illegal instruction or an interrupt request signalled by a peripheral device, which are detected by the microprocessor.

1 Instructions executed by a microprocessor may be classified into three'types: branch instructions, jump instructions and "other" instructions.

"Branch" instructions are those instructions that potentially transfer control to an instruction at a destination address calculated by adding a displacement value encoded into the currently executing instruction to the'address of the currently executing instruction. Branch instructions can be "unconditional" or "conditional"; in the latter case, a test is madeto determine whether a specified condition concerning the state of the microprocessor is true. A branch instruction is said to be "taken" either if it is unconditional or if it is conditional and the specified condition is.true.

"Jump" instructions are those instructions that potentially transfer control to an instruction at a destination address calculated in a general manner that depends on the definition of the particular instruction. Examples of common jump instructions are "RETURN", which transfers control to an address that is read from the top of the stack in memory, and "CASE", which transfers control to an address that is located by using an operand's value to index into a table of addresses in memory. Like branch instructions, jump instructions can be "unconditional" or "conditional" and are said to be "taken" either if unconditional or if conditional and the specified condition is true.

The significant distinction between branch and jump instructions is that, for branch instructions, it is possible to calculate the destination address knowing only the instruction's encoding and location, whereas for jump instructions, the destination address 1 A 1 Q c generally depends on some data value that can vary, such as the contents of a register or memory location.

For microprocessors that do not integrate a cache memory or a memory management unit "on-chip", the virtual addresses of all instruction and data references are available on the external interface of the microprocessor. Consequently, it is a straigtforward matter to follow externally the sequence of-instructions executed by the microprocessor.

For example, in the case of the National Semiconductor Corporation NS32032 microprocessor, after executing a taken branch or jump instruction, the next instruction is read from the destination address in memory using a special status code. The microprocessor also activates a program flow status interface signal whenever it begins executing a new instruction. It is, therefore, possible for the NS32032 microprocessor to monitor the sequence of executed instructions as follows:

1. -For each taken branch or jump instruction, the destination address is available on the external interface; 2. Following the execution of a taken branch or jump instruction, the program flow status interface signal is observed to determine the. sequence of instructions executed at consecutive Memory addresses until the next taken branch or jump is executed; and Data reads from memory locations that store the addresses of exception service procedures are detected on the external interface to determine when an exception has occurred.

3.

However, for a microprocessor such as that of the present invention, which integrates a cache memory and a memory management unit on-chip, there are two problems that must be solved in order to externally monitor the sequence of executed instructions. First, since required instructions or data may be found in onchip cache memory, not all memory references are observable on the microprocessor's external interface. This is because memory references that are located in the microprocessor's internal cache are performed without referring to external memory. Second, memory references observable on the microprocessor's external interface use physical addresses rather than virtual addresses. This is because the integrated memory management unit translates the virtual addresses generated by an executing program to the physical addresses used to access memory. In some cir cumstances, the translation will not be 1-to-l; that is, more than one virtual address can be translated to a single physical address. In such cases, it is impossible to determine the virtual addresses of memory references for an executing program by merely observing the physical addresses of the memory references on the external interface.

The method of the present invention for solving the above-described problems involves two aspects. First, an additional interface signal is provided which indicates whether an instruction beginning execution is sequential or non-sequential. Second, additional information is provided on the interface signals used for external memory references.

is The additional interface signal utilized by the microprocessor described in the following specification is called "Internal Sequential Fetch" (ISF). The microprocessor activates the ISF signal, along with a "Program Flow Status" (PFS) signal, whenever a taken branch or jump instruction is executed. It is, therefore, possible to monitor control flow when a b- ranch"or jump instruction is executed. If the instruction is taken, which is indicated by driving the ISF signal active, then control is transferred to a destination instruction. If the instruction is not taken, which is indicated by driving the ISF signal inactive, then control is transferred to the next sequential instruction in memory.

Additional information for monitoring control flow is provided on the external interface when an exception occurs or when a taken jump instruction is executed. When an exception occurs, the microprocessor displays both a code that indicates the type of exception and the virtual address of the exception service procedure. When a.taken jump instruction is executed, the microprocessor displays the virtual address of the jump destination.

This solution provided by the present invention to the problem of monitoring control flow is extremely efficient. Since only one new interface signal is required, the cost is small. Since the interface signals for memory references are used infrequently, interference with the microprocessor's external references is small and performance is not significantly degraded. Taken jump instructions typically comprise less than 10% of executed instructions. Exceptions typically occur at a rate of less than 1 per 100 instructions. Consequently, using- the interface signals to provide additional information for monitoring control flow has little effect on the microprocessor's performance.

Figure 1 is a schematic block diagram illustrating the general architecture of a microprocessor which utilizes the control flow monitoring method of the present invention.

1 1 Figure 2 is a schematic diagram illustrating the interface signals of the microprocessor described herein.

Figure 3 is a schematic block diagram illustrating the major functional units and interconnecting buses of the microprocessor described herein.

Figure 4 is a schematic block diagram illustrating the structure of the Instruction Cache of the microprocessor described herein.

Figure 5 is a schematic block diagram illustrating the structure of the Data Cache of the microprocessor described herein.

Figure 6 is a timing diagram illustrating the timing sequence for access to the Data Cache of the microprocessor described herein.

Figure 7 is a timing diagram illustrating the relationship between the CLK input and BUSCLK output signals of the microprocessor described herein.

Figure 8 is a timing diagram illustrating the basic read cycle of the microprocessor described herein.

Figure 9 is a timing diagram illustrating the basic write cycle of the microprocessor described herein.

1 1r is Figure 10 is a timing diagram illustrating a read cycle of the microprocessor described herein which has been extended with two wait cycles.

Figure 11 is a timing diagram illustrating a burst read cycle, having three transfers, which is terminated by the microprocessor described herein.

Figure 12 is a timing diagram illustrating a burst read cycle terminated by the system, the burst c ycle having two transfers, the second transfer being extended by one wait state.

Figure 13 is a schematic diagram illustrating the general structure of the 4-stage instruction Pipeline of the microprocessor described herein.

Figure 14 is a timing diagram illustrating Pipeline timing for an internal Data Cache hit.

Figure 15-is a timing diagram illustrating Pipeline timing for an internal Data Cache miss.

Figure 16 is a timing diagram illustrating the effect of an addressregister interlock on instruction Pipeline timing.

Figure 17 is a timing diagram illustrating the effect of correctly predicting a branch instruction to be taken in the operation of the microprocessor described herein.

Figure 18 is a timing diagram i llustrating the effect of incorrectly predicting the resolution of a branch instruction in the operation of the microprocessor described herein.

Fig. 1 shows the general architecture of a microprocessor (CPU) 10 which implements a control flow monitoring method in accordance with the present invention.

1 CPU 10 initiates bus cycles to communicate with external memory and with other devices in the computing cluster to fetch instructions, read and write data, perform floating-point operations and respond to exception requests.

CPU 10 includes a 4-stage instruction Pipeline 12 that is capable of executing, at 20 MHz, up to 10 MIPS (millions of instructions per second). Also integrated on-chip with instruction Pipeline 12 are three storage buffers that sustain the heavy demand of Pipeline 12 for instructions and data. The storage buffers include a 512-byte Instruction Cache 14, a 1024byte Data Cache 16 and a 64-entry translation buffer which is included within an on-chip memory management unit (MMU) 18. The primary functions of MMU 18 are to arbitrate requests for memory references and to translate virtual addresses to physical addresses. CPU 10 also includes an integrated Bus Interface Unit (BIU) 20 which controls the bus cycles for external references.

Placing the cache and memory management functions on the same chip with instruction Pipeline 12 provides excellent cost/performance by improving memory access time and bandwidth for all microprocessor applications.

CPU 10 is also compatible with available peripheral devices, such as Interrupt Control Unit (ICU) 24 (e.g., NS32202). The ICU interface to CPU 10 is completely asynchronous, so it is possible to operate ICU 24 at lower frequencies than CPU 10.

CPU 10 incorporates its own clock generator. Therefore, no timing control unit is required.

CPU 10 supports both external cache memory 25 as well as a "Bus Watcher" circuit 26 which assists in maintaining internal cache coherence, as described below.

1 -g- L As shown in Fig. 2, CPU 10 has 114 interface signals for bus timing and control, cache control, exception requests and other functions. The following list provides a summary of the CPU 10 interface signal functions:

IniDut Sianals BACK BER BRT BWO-BW1 CIAO-CIA6 Burst Acknowledge (Active Low). When active in response to a burst request, indicates that the memory supports burst cycles.

Bus Error (Active Low). Indicates to CPU 10 that an error was detected during the current bus cycle Bus Retry (Active Low). Indicates that CPU 10 must perform the current bus cycle again.

Bus Width (2 encoded lines). These lines define the bus width (8, 16 or 32 bits) for each data transfer, as shown in Table 1 below.

lBW1 1 BWO 1 Bug Width 1 0 0 reserved 0 1 8 bits 1 0 16 bits 1 1 1 1 32 bits 1 Table 1

Cache Invalidation Address (7 encoded lines) The Cache invalidation address is presented on the CIA bus. Table 2 presents the CIA lines relevant'for each of the internal caches of CPU 10.

1 CIA (0: 4) 1 t 1 1 1 1 CIA (5:6) 1 Reserved Set address in DC and IC 1 1 1 Table 2

CII Cache Inhibit In (Active High). Indicates to CPU 10 that the memory 10 reference of the current bus cycle is not cacheable. CINVE Cache Invalidation Enable. Input which determines whether the External Cache Invalidation options or 15 the Test Mode operation have been selected. CLK Clock. Input clock used to derive all timing for CPU 10.

DBG HOLD INT Debug Trap Request (Falling-Edge Activated). High-to-low transition of this signal causes Trap (DBG).

Hold Request (Active Low). Requests CPU 10 to release the bus for direct memory access unit (DMA) or multiprocessor purposes.

Interrupt (Active Low). Maskable interrupt request.

INVSET Invalidate Set (Active Low).

When Low, only a set in the on-chip caches is invalidated; when High, the entire cache is invalidated.

INVDC Invalidate Data Cache (Active Low). 35 When low, an invalidation is done in the Data Cache. INVIC Invalidate Instruction Cache (Active Low). When low, an invalidation is done in the 40 Instruction Cache.

z W is IODEC NMI RDY RST SDONE STRAP OutiDut Sitrnals AO-A31 1/0 Decode (Active Low). Indicates to CPU 10 that a peripheral device is addressed by the current bus cycle.

Nonmaskable Interrupt (Falling-Edge Activated). A High-to-Low transition of this signal requests a nonmaskable interrupt.

Ready (Active High). While this signal is not active, CPU 10 extends the current bus cycle to support a slow memory or peripheral device.

Reset (Active Low). Generates reset exceptions to initialize CPU 10.

Slave Done (Active Low). Indicates to CPU 10 that a Slave Processor has completed executing an instruction.

Slave Trap (Active Low). Indicates to CPU 10 that a Slave Processor has detected a trap condition while executing an instruction.

ADS i-E0 -BE 3 Address Bus (3-state, 32 lines) Transfers the 32-bit address during a bus cycle. AO transfers the least significant bit.

Address Strobe (Active Low, 3-State). Indicates that a bus cycle has begun and a valid address is on the address bus.

Byte Enables (Active Low, 3-state, 4 -lines). Signals enabling transfer on each byte of the data bus, as shown in Table 3.

BMT BP BREQ BUSCLK CASEC CIO BE Enables Bits 1 0 1 1 1 1 1 2 1 1 3 1 0 7 8 15 16 23 24 31 Table 3

Begin Memory Transaction (Active Low, 3State). Indicates that the current bus cycle is valid, that is, the bus cycle has not been cancelled; available earlier in the bus cycle than CONF.

Break Point (Active Low). Indicates that CPU 10 has detected a debuT condition.

Burst Request (Active Low, 3-state). Indicates. that CPU 10 is requesting to perform burst cycles.

Bus Clock output clock for bus timing.

CONF DDIN Cache Section (3-state) For cacheable data read bus cycles, Andicates the section of the on-chip Data Cache 18 into which the data will be placed.

Cache Inhibit (Active High). Indication by CPU 10 that the memory reference of the current bus cycle is not cacheable; controlled by the CIbit in the level-2 Page Table Entry.

Confirm Bus Cycle (Active Low, 3-state). Indicates that a bus cycle initiated with ADS is valid; that is, the bus cycle has not been cancelled.

Data Direction In (Active Low, 3-state). Indicates the direction of transfers on the data bus; when Low during a bus cycle, indicates that CPU 10 is reading data; when High during a bus cycle, indicates that CPU 10 is writing data.

j HWA ILO Hold Acknowledge (Active Low). Activated by CPU 10 in response to the 1- HOLD input to indicate that CPU 10 has released the bus.

Interlocked Bus Cycle (Active Low). Indicates that a sequence of bus cycles with interlock protection is in progress.

IOINH 1/0 Inhibit (Active Low). -10 Indicates that the current bus cycle should be ignored if a peripheral device is addressed. ISF Internal Sequential Fe ' tch. Indicates, along with PFS, that the 15 instruction beginning execution is sequential (ISF = Low) or non-sequential (ISF = High). PFS Program Flow Status (Active Low). A pulse on this signal indicates the 20 beginning of execution for each instruction. SPC Slave Processor Control (Active Low). Data Strobe for Slave Processor bus cycles.

STO-ST4 Status (5 encoded lines).

Bus cycle status code; STO is the least significant bit. The encoding is shown in Table 4.

U/S User/Supervisor (3-state). 30 Indicates User_(U/S =.High) or Supervisor (U/S = Low) Mode.

Bidirectional Sianals DO-D31 Data Bus (3-state,32 lines).

Transfers 8, 16, or 32 bits of data during a bus cycle. DO transfers the least significant bit.

STATUS DESCRIPTION

1 4 3 1 2 1 0 1 --L 0 0 0 0 0 Idle 0 0 0 0 1 Idle: Wait Instruction 0 0 0 1 0 Idle: Halted 0 0 0 1 1 Idle: Waiting for Slave Q 0 1 0 0 Interrupt acknowledge, Master 0 0 1 0 1 Interrupt acknowledge, Cascaded 0 0 1 1 0 End of Interrupt, Master 0 0 1 1 1 End of Interrupt, Cascaded 0 1 0 0 0 Sequential Instruction Fetch 0 1 0 0 1 Non-sequential Instruction Fetch 0 1 0 1 0 Data transfer 0 1 0 1 1 Read Read-Modify-Write Operand 0 1 1 0 0 Read for Effective address 0 1 1 0 1 Access PTE1 by MNU 0 1 1 1 0 Access PTE2 by MMU 0 1 1 1 1 reserved 1 0 0 0 0 reserved 1 0 0 0 1 reserved 1 0 0 1 0 reserved 1 0 0 1 1 reserved 1 0 1 0 0 reserved 1 0 1 0 1 reserved 1 0 1 1 0 reserved 1 0 1 1 1 reserved 1 1 0 0 0 reserved 1 1 0 0 1 reserved 1 1 0 1 0 reserved 1 1 0 1 1 reserved 1 1 1 0 0 reserved 1 1 1 1 1 0 1 1 Transfer Slave Processor Operand 1 1 1 -1 1 1 1 0 1 Read Slave Processor Status 1 1 1 1 Broadcast Slave ID + Opcode Table 4

Referring to Fig. 3, CPU 10 is organized internally as eight major functional units that operate in parallel to perform the following operations to execute instructions: prefetch, decode, calculate effective addresses and read source operands, calculate 0 results and store to registers, store results to memory.

A Loader 28 prefetches instructions anddecodes them for use,by an Address Unit 30 and an Execution Unit 32. Loader 28 transfers instructions received from the Instruction Cache 14 on the IBUS bus into an 8-.byte instruction queue. Loader 28 can extract an instruction field on each cycle, where a "field" means either an opcode (1 to 3 bytes including addressing mode specifiers), displacement or immediate value. Loader 28 decodes the opcode to generate the initial microcode address, which is passed on the LADR bus to Execution Unit 32. The decoded general addressing modes are passed on the ADMS bus to Address Unit 30. Displacement values are passed to Address Unit 30 on the DISP bus. Immediate values are available on the GCBUS bus. Loader 28 also includes a branch-prediction mechanism, which is described below.

Address Unit 30 calculates effective addresses using a dedicated 32-bit adder and reads source operands for Execution'Unit 32. Address Unit 30 controls a port from a Register File 34 to the GCBUS through which it transfers base and index values to the address adder and data values to Execution Unit 32. Effective addresses for operand references are transferred to MMU 18 and Data Cache 16 on the GVA bus, which is the virtual address bus.

Execution Unit 32 includes the data path and the microcoded control for executing instructions and processing exceptions. The data path inc ludes a 32-bit Arithmetic Logic Unit (ALU), a 32-bit barrel shifter, an 8-bit priority encoder, and a number of counters. Special-purpose hardware incorporated in Execution Unit 32 supports multiplication, retiring o ne bit per cycle with optimization for multipliers of small absolute value. Execution Unit 32 controls a port to Register File 34 from the GNA bus on which it stores results. The GNA bus is also used by Execution Unit 32 to read values of a number of dedicated registers, like configuration and interrupt base registers, which are included in Register File 34. A 2entry data buffer allows Execution Unit 32 to overlap the execution of one instruction with storing results to memory for previous instructions. The GVA bus is used by Execution Unit 32 to perform memory references for complex instructions (e.g., string operations) and exception processing.

Register File 34 is dual-ported, allowing read access by Address Unit 30 on the GCBUS and read/write access by Execution Unit 32 on the GNA bus. Register File 34 holds the general-purpose registers, dedicated registers, and program counter values for Address Unit 30 and Execution Unit 32.

MMU 18 is compatible with the memory management functions of CPU 10. Instruction Cache 14, Address Unit 30 and Execution Unit 32 make requests to MMU 20 for memory references. MMU 18 arbitrates the requests, granting access to transfer a virtual address on the GVA bus. MMU 18 translates the virtual address it receives on the GVA bus to the corresponding physical address, using its internal 64-entry Translation Buffer. MMU 18 transfers the physical address on the MPA bus to either Instruction Cache 14 or Data Cache 16, depending on whether an instruction or data reference is being performed. The physical address is also transferred to BIU 20 for an external bus cycle.

is 1E -Bus Interface Unit (BIU) 20-controls the bus cycles for references by Instruction Cache 14, Address Unit 30 and Execution Unit 32. BIU 20 contains a 3entry buffer for external references. Thus, for example, BIU 20 can be performing a bus cycle for an instruction fetch_while holding the information for ahother.bus cycle to write to memory and simultaneously accepting the next data read.

Referring to Fig.4, Instruction Cache 14 stores 512 bytes in a direct-map organization. Bits 6 through 8 of a reference instruction's address select 1 of 8 sets. Each set contains 16 bytes, i.e. 4-double-words of code and a log that holds address tags comprising the 23 most-significant bits of the physical address for the locations stored in that set. A valid bit is associated with every double-word.

Instruction Cache 14 also includes a 16-byte instruction buffer from which it can transfer 32-bits of code per cycle on the IBUS to Loader 28. In the event that the desired instruction is found in Instruction Cache 14 (a "hit"), the instruction buffer is loaded from the selected set of Instruction Cache 14. In the event of a miss, Instruction Cache 14 transfers the address of the missing double-word on the GVA bus to MMU 18, which translates the address for BIU 20. BIU 20 initiates a burst read cycle to load the instruction buffer from external memory through the GBDI bus. The instruction buffer is then written-to one of the sets of Instruction Cache 14.

Instruction Cache 14 holds counters for both the virtual and physical addresses from which to prefetch the next double-word of theinstruction stream. When Instruction Cache 14 must begin prefetching from a new instruction stream, the virtual address for the new stream is transferred from Loader 28 on the JBUS. When crossing to a new page, Instruction Cache 14 transfers the virtual address to MMU 18 on the GVA bus and receives back the physical address on the MPA bus.

Instruction Cache 14 supports an operating mode to lock its contents to fixed locations. This feature is enabled by setting a Lock Instruction Cache (LIC) bit in the-configuratiOn register. It can be used in realtime systems to allow fast, on-chip access to the most critical routines. Instruction Cache 14 can be enabled by setting an Instruction Cache Enable (IC) bit in the configuration register.

Data Cache 16 stores 1024 bytes of data in a twoway set associative organization, as shown in Fig. 5. Each set has two entries containing 16 bytes and two tags that hold the 23 most significant bits of physical address for the locations stored in the two entries. A valid bit is associated with every double-word.

The timing to access Data Cache 16 is shown in Fig. 6. First, virtual address bits 4 through 8 on the GVA bus are used to select the appropriate set within Data Cache 16 and read the two entries. Simultaneously, MMU 18 is translating the virtual address and transferring the physical address to Data Cache 16 and BIU 20 on the MPA bus. Then Data Cache 16 compares the two address tags with the physical address while BIU 20 initiates an external bus cycle to read the data from external memory. If the reference is a hit, then the selected data is aligned by Data Cache 16 and transferred to Execution Unit 32 on the GDATA bus and BIU 20 cancels the external bus cycle but does not assert the BMT and CONF signals. If the reference is a miss, BIU 20 completes the external bus cycle and transfers data from external memory to Execution Unit 5; 32 and Data Cache 16, which updates its cache entry. For references that hit, Data Cache 16 can sustain a throughput of one double-word per cycle, with a latency of 1.5 cycles.

Data Cache 16 is a write-through cache. For memory write references, Data Cache 16 examines whether the reference is a hit. If so, the contents of the cache 4re updated. In the event of either a hit or a miss, BIU 20 writes the data through to external memory.

Like I nstruction Cache 14, Data Cache 16 supports an operating mode to lock its contents to fixed locations. This feature is enabled by setting a Lock Data Cache (LDC) bit in the configuration register. It can be used in real-time systems to allow fast on-chip access to the most critical data locations. Data Cache 16 can be enabled by setting a Data Cache Enable (DC) bit in the configuration register.

CPU 10 receives a single-phase input clock CLK which has a frequency twice that of the operating rate of CPU 10-. For example, the input clock's frequency is MHz for a CPU 10 operating at 20 MHz. CPU 10 divides the CLK input by two to obtain an internal clock that is composed of two non-overlapping phases, PHI1 and PH12. CPU 10 drives PH11 on the BUSCLK output signal.

Fig. 7 shows the relationship between the CLK input and BUSCLK output signals.

As illustrated in Fig. 8, every rising edge of the BUSCLK output defines a transition in the timing state ('IT-state") of CPU 10. Bus cycles occur during a sequence of T-sta.tes, labelled TI, T2, and T2B in the associated timing diagrams. There may be idle T-states (Ti) between bus cycles. The phase relationship of the BUSCLK output to the CLK input can be established at reset.

The basic bus cycles performed by CPU 10 to read from and write to external memory and peripheral devices occur during two cycles of the bus clock, called T1 and T2. The basic bus cycles can be extended beyond two clock cycles for two reasons. First, additional T2 cycles can be added to wait for slow memory and peripheral devices. Second, when reading from external memory, burst cycles (called 11T2B11) can be used to transfer multiple double-words from consecutive locations. The timing for basic read and write bus cycles with no "wait" states is

shown in Figs. 8 and 9, respectively. For both read and write bus cycles, CPU 10 asserts Address Strobe ADS during the first half of T1 indicating the beginning of the bus cycle. From the beginning of T1 until the completion of the bus cycle, CPU 10 drives the address bus and control signals for the Status (STO-ST4), Byte Enables (BEO-BE3), Data Direction In (DDIN), Cache Inhibit (CIO), 1/0 Inhibit (IOINH), and Cache Section (CASEC) signals.

If the bus cycle is not cancelled (that is, T2 will follow on the next clock), CPU 10 asserts Begin Memory Transaction BMT during T1 and asserts Confirm Bus Cycle CONF from the middle of T1 until the completion of the bus cycle, at which time CONF is negated.

At the end of T2, CPU 10 samples that RDY is active, indicating that the bus cycle has been completed; that is, no additional T2 states should be added. Following T2 is either T1 for the next bus cycle or Ti, if CPU 10 has no bus cycles to perform.

Qr lO As shown in Fig. 10, the basic read and write bus cycles previously described can be extended to support longer access times. As stated, CPU 10 samples RDY at the end of each T2 state. If RDY is inactive, then the bus cycle is extended by repeating T2 for another clock. The additional T2 states after the first are called "wait" states. Fig. 10 shows the extension of a read bqs cycle with the addition of two wait states.

As shown in Fig. 11, the basic read cycles can also be extended to support burst transfers of up to four double-words from consecutive memory locations. During a burst read cycle, the initial double-w6rd is transferred during a sequence of Tland T2 states, like a basic read cycle. Subsequent double-words are transferred during states called 'IT2B11. Burst cycles are used only to read from 32-bit wide memories.

The number of transfers in a burst read cycle is controlled by a handshake between the output signal Burst Request BREQ and the input signal Burst Acknowledge BACK during a T2 or T2B state to indicate that it requests another transfer following a current one. The memory asserts BACK to indicate that it can support another transfer. Fig. 11 shows a burst read cycle of three transfers in which CPU 10 terminates the sequence by negating BREQ after the second transfer.

Fig. 12 shows a burst cycle of two transfers terminated by the system when BACK was inactive during the second transfer.

For each transfer after the first in the burst sequence, CPU 10 increments address bits 2 and 3 to select the next double-word. As shown for the second transfer in Fig. 12, CPU 10 samples RDY at the end of each state T2B and.extends the access time for the burst transfer if RDY is inactive.

CPU 10 provides a number of techniques for maintaining data coherence between the two on-chip caches and external memory. These techniques are summarized in Table 5.

SOFTWARE HARDWARE Inhibit Cache Cache-Inhibit Cache-Inhibit Access for bit in PTE input signal 1 certain locationsf 1 1 Invalidate 1 CINV Instruction 1Cache Invalida-i 1 certain locationsi to invalidate Ition request tol 1 in Cache 1 block]invalidate set 1 1 invalidate 1 CINV Instruction [Cache Invalida-1 1 Entire Cache 1 1 tion recruest 1 Table 5 The use of the caches can be inhibited for individual pages using the CI bit in level-2 page table entries. The coherence between the two on-chip caches of CPU 10 and external memory may be ensured by using an external "Bus Watcher" 26, shown in Fig. 1. This circuit 2-6 interfaces to the following buses:

1. CPU 10 address bus and CASEC output, to obtain information on which internal cache entries (tags) are modified and to maintain updated copies of CPU 10 internal cache tags; 2. -The System Bus, to detect which internal memory addresses are modified; and 3. CPU 10 Cache Invalidation bus, consisting of YNVSET, -EN-VDC, INVIC and CIAO-CIA6 signals. Bus Watcher 26 maintains tag copies of Instruction Cache 14 and Data Cache 16 entries. If the address of a memory write cycle on the System bus matches one of the tags inside Bus Watcher 26, a command will be issued by Bus Watcher 26 to CPU 10, via the Cache Invalidation Bus, to invalidate the corresponding entry it the appropriate internal cache. The invalidation of the internal cache entry by CPU 10 takes one clock cycle only and does not interfere with the on-going CPU bus cycle. Data Cache 16 is invalidated 32 bytes at a time, while Instruction Cache 14 is invalidated 16 bytes at a time.

These techniques for maintaining coherence between the two integrated caches of CPU 10 and external memory are more fully described in copending and commonlyassigned U.S. Pat. Appln. Serial No. 006,015, "Method of Maintaining Coherence Between a Microprocessor's Integrated Cache and External Memory", filed by Shacham et al. of even date herewith, - and which is hereby incorporated by reference.

To execute an instruction, CPU 10 first fetches the instruction whose address is contained in the program counter and then Performs the operations for that particular instruction. After executing the instruction, the program counter value is updated in one of two ways to contain the address of the next instruction to execute: either the current instruction explicitly loads the program counter (like JUMP) or the program counter is incremented by the length of the current instruction.

The sequence of instructions executed by a microprocessor is determined by repeatedly applying the following four rules:

1. If the microprocessor detects an exception, control is transferred to the first instruction of the appropriate service procedure for that type of exception. Depending on the definition of the 3.

4.

microprocessor's operation, the address and memory of the exception service procedure either is fixed or is found in a table of addresses for such procedures; 2. If the microprocessor executes a taken branch instruction, then control is transferred to the instruction whose address in memory is calculated by adding the displacement value in the branch instruction to the address of the branch instruction; If the microprocessor executes a taken jump instruction, then control is transferred to the instruction whose address in memory is calculated in a general manner according to the definition of that particular instruction; and If the microprocessor executes an instruction and none of the above rules applies, then control is transferred to the instruction whose address in memory immediately follows that of the executed instruction.

CPU 10 has five operating states regarding the execution of instructions and the processing of exceptions: reset, executing instructions, processing an exception, waiting for an interrupt and halted.

While executing an instruction, if CPU 10 recognizes an exception, it saves the contents of appropriate registers, then begins executing an exception service procedure. Exceptions are c6nditions, events and errors that alter the sequence of instruction execution.

CPU 10 recognizes four types of exception: reset, bus errors, interrupts and traps. A reset exception occurs when RST signal is activated; RST is used to A is 11 initialize CPU 10. A bus error exception occurs when the BER signal is activated in response to an instruction fetch or data transfer required by CPU 10 to execute an instruction. An interrupt occurs in response to an event signalled by activating the NMI or INT signals; interrupts are typically requested by peripheral devices that require the attention of CPU 10. A'trap occurs when certain conditions, such as a divisor of 0, are detected by CPU 10 during execution of an instruction.

Whenever the RST signal is activated, CPU 10 enters thereset state. CPU 10 remains in the reset state until iiS-T is driven inactive, at which time it enters the state of executing instructions. While CPU is in the reset state, the contents of certain dedicated registers are initialized.

While in the state of executing instructions, CPU continues to execute instructions until an exception is recognized or a "wait" instruction is executed.

When an exception other than reset is recognized, CPU enters the state of processing an exception.

Following execution of the "wait" instruction, CPU 10 enters the state of waiting for an interrupt.

While in the state of processing an exception, CPU is saving the contents of appropriate registers and reading the program counter and module linkage values to begin execution of the exception service procedure.

For processing an interrupt, CPU 10 additionally reads one or two vector values from ICU 24. Following the successful completion of all data references required to process an exception, CPU 10 enters the state of executing instructions. If, however, a bus error or an abort is detected while CPU 10 is processing an exception, it enters the halted state.

In the state of waiting for an interrupt, CPU 10 is idle. A special status identifying this state is presented on the system interface. When an interrupt or an external debug condition is detected, CPU 10 enters the state of processing an exception. CPU 10 enters the halted state when a bus error or abort i.s detected while CPU 10 is processing an exception, thereby preventing the transfer of control to an appropriate exception service procedure. A special status identifying this state is presented on the system interface. CPU 10 performs the following operations to execute an instruction: fetch the instruction; read source operands, if any; calculate results; write result operands, if any; modify flags, if necessary; and update the program counter. Under most circumstances, CPU 10 executes instructions by completing the operations listed above in strict sequence for one instruction and then beginning the sequence of operation for the next instruction. However, as stated above, exceptions can alter the sequence of operations to execute an instruction or to advance from one instruction to the next. Also, for enhanced performance, CPU 10 overlaps the operations for executing several instructions in a pipelined manner. The following discussion explains the effects of exceptions in Pipeline 12 on instruction execution. In this discussion, reads of addresses from memory to calculate effective addresses for memory-relative and external addressing modes are considered like source t 11 operands, even if the effective address is being calculated for an operand with access class of write.

CPU 10 checks for exceptions at various points during the execution of an instruction. when an exception is recognized, the instruction being executed ends in one of four possible ways: it is completed, it is suspended, it is terminated or it is partially completed. Each of the four types of exception causes a particular ending.

When an exception is recognized after an instruction is completed, CPU 10 has performed all of the operations for that instruction and for all other instructions executed since the last exception occurred. Result operands have been written, flags have been modified, and the program counter value saved on the interrupt stack contains the address of the next instruction to execute. The exception service procedure can, at its conclusion, execute an appropriate return instruction and CPU 10 will begin executing the instruction following the completed instructicn.

An instruction is suspended when one of several trapping conditions or a restartable bus error is detected during execution of the instruction. A suspended instruction has not been completed, but all other instructions executed since the last exception occurred have been completed. Result operands and flags due to be effected by the instruction may have been modified, but only modifications that allow the instruction to be executed again and completed can occur. For certain exceptions (e.g., bus errors and abort, undefined-instruction, and illegal-operation traps), CPU 10 clears the appropriate control flag in the program status register before saving the copy that 1 is pushed on the Interrupt Stack. The program counter value saved on the Interrupt Stack contains the address of the suspended instruction.

For example, a RESTORE instruction pops up to eight general-purpose registers from the stack. If an -invalid page table entry is detected on one of the references to the stack, then the instruction is suspended. The general-purpose registers due to be loaded by the instruction may have been modified, but the stack pointer still holds the same value it did when this instruction began.

To complete a suspended instruction, the exception service procedure takes either of two actions:

1. The service procedure can simulate the suspended instructions execution. After calculating and writing the instruction's results, flags in the program status register copy saved on the Interrupt Stack are modified and the program counter value saved on the Interrupt Stack is updated to point to the next instruction to execute. The service procedure then executes an appropriate return instruction and CPU 10 begins executing the instruction following the suspended instruction. This is the action taken when floating-point instructions are simulated by software in systems without a hardware floating-point unit.

2. Suspended instruction can be executed again after the service procedure has eliminated the trapping condition that caused the instruction to be suspended. The service procedure executes an appropriate return instruction at its conclusion; then CPU 10 begins executing the suspended instruction again.

Although CPU 10 allows a suspended instruction to be executed again and completed, CPU 10 may have read a source operand or the instruction from a nemory-mapped peripheral port before the exception was recognized. In such a case, the characteristic of the peripheral device -may prevent correct re-execution of the instruction.

An instruction being executed is terminated when reset or when a nonrestartable bus error occurs. Any result operands and flags due to be effected by the instruction are undefined as are the contents of the program counter. The result operands of other instructions executed since the last serializing operation may not have been written to memory. A terminated instruction cannot be completed.

When a restartable bus error, interrupt, abort, or debug condition is recognized during execution of a string instruction, the instruction is said to be partially-completed. A partially completed instruction has not been completed, but all other instructions executed since the last exception have been completed. Result operands and flags due to be effected by the instruction may have been modified, but the value stored in the stream pointers and other general-purpose registers used during the instruction's execution allow the instruction to be executed again and completed.

The program counter value saved on the Interr upt Stack contains the address of the partially completed instruction. The exception service procedure can, at its conclusion, simply execute the appropriate return instruction and CPU 10 will resume executing the partially completed instruction.

As stated above, CPU 10 overlaps the operations to execute several instructions simultaneously in 4-stage Pipeline 12. The general structure of Pipeline 12 is shown in Fig. 13. While Execution Unit 32 is calculating the results for one instruction, Address Unit 30 can be calculating the effective addresses and reading the source operands for the following instrudtion, and Loader 28 can be decoding a third instruction and prbfetching a fourth instruction into its 8-byte queue. Under certain circumstances, the effects of overlapped instruction execution can differ from those of strictly sequential instruction execution. More spepifically, the order of memory references performed by CPU 10 may appear to differ.

While executing an instruction, CPU 10 may read some of the source operands from memory before _completely fetching the instruction. CPU 10, however, always completes fetching an instruction and reading its source operands before writing its results. When more than one source operand must be read from memory to execute an instruction, the operands may be read in any order. Similarly, when more than one result operand is written to memory to execute an instruction, the operands may be written in any order.

CPU 10 begins fetching an instruction only after all Previous instructions have been completely fetched. However, CPU 10 may begin fetching an instruction before all the source operands have been read and results written for previous instructions.

CPU 10 begins reading the source operands for an instruction only after all previous instructions have been fetched and their source operands read. The source operand for an instructions may be read before all results of the previous instruction have been 2 tS 0! written, except when the source operands value depends on the result not yet written. CPU 10 compares the physical address in the length of the source operand with those of any results not yet written and delays reading the source operand until after writing all results on which the source operand depends.

In addition, CPU 10 identifies source operands that a3fe located in memory-napped peripheral ports and delays the reading of such operands until after all previous results destined for memory-mapped peripheral ports have been written. Special handling procedures insure that read and write references to memory-mapped 1/0 ports are always performed in the order implied by the program. These procedures are described in copending and commonly-assigned U.S. Pat. Appln. Serial No. 006,012, "Method of Detecting and Handling MemoryMapped 1/0 by a Pipelined microprocessor", filed by Levy et al. of even date herewith, and which is hereby incorporated by reference.

CPU 10 begins writing the result operands for an instruction only after all results of previous instructions have been written.

As a consequence of overlapping the operations for several instructions, CPU 10 may fetch an instruction and read its source operands, although the instruction is not executed (for example, if the previous instruction causes a trap). Nevertheless, when CPU 10 identifies that a source operand for an instruction is located in a memory-mapped peripheral port, then it will read the source operand only if the instruction is executed.

Note that CPU 10 does not check for dependencies between the fetching of an instruction and the writing of previous instructions results. Thus, self-modifying code must be treated specially.to execute as intended.

After executing certain instructions or processing an exception, CPU 10 serializes instruction execution. Serializing instruction execution means that CPU 10 completes writing all previous instructions results to memory, then begins fetching and executing the next instruction. Thus, when a new value is loaded into the program status register, the new program status register value determines the privilege state used to fetch and execute the next instruction.

In accordance with the present invention, CPU 10 implements a two-step procedure for monitoring the sequence of.executed instructions. First, an additional interface signal is provided which indicates whether an instruction beginning execution is sequential or non-sequential. Second, additional information is displayed on the interface signals used for external memory references.

The interface signal is called "Internal Sequential Fetch" (ISF). CPU 10 activates the ISF signal, along with a Program Flow Status (PFS) signal, whenever a taken branch or jump instruction is executed. For purposes of the present invention, it is only necessary for CPU 10 to activate the ISF signal for taken branch instructions. It is, therefore, possible to monitor control flow when a branch instruction is executed. If the branch instruction is taken, which is indicated by driving the ISF signal active, then control is transferred to a destination instruction, the address of which can be calculated knowing the encoding and address of the branch instruction. If the branch is not taken, which is indicated by driving the ISF signal inactive, then d' 4 control is transferred to the instruction following the branch instruction in memory.

Additional information for monitoring control flow is displayed on the external memory interface only when a taken jump instruction is executed or an exception occurs. When an exception occurs, CPU 10 displays both a code that indicates the type of exception and the virtual address of the exception service procedure. When a taken jump instruction is executed, CPU 10 displays the virtual address of the jump destination. The destination address i s displayed after CPU 10 has begun fetching the instruction at the jump destination. The memory interface will typically be idle at this time while CPU 10 is decoding and preparing to execute the instruction at the jump destination. CPU 10 indicates, through status information, when it is displaying either the code for an exception or the destination address for a taken jump instruction rather than making the reference to memory.

The Address Unit 30 and Execution Unit 32, can process i-nstructions at a peak rate of two cycles per instruction. Loader 28 can process instructions at a peak rate of one cycle per instruction, so it will typically keep a steady supply of instructions to the Address Unit 30 and Execution Units 32. Loader 28 disrupts the throughput of Pipeline 12 only when a gap in the instruction stream arises due to a branch instruction or Instruction Cache miss.

Fig. 14 shows the execution of two nemory-toregister instructions by the Address Unit 30 and Execution Unit 32. CPU 10 can sustain an execution rate of two cycles for most common instruction, typically exhibiting delays only in the following cases:

R Storage delays due to cache and translation buffer misses and non-aligned references.

2. Resource contention between stages of instruction Pipeline 12.

3. Branches and other non-sequential instruction fetches.

4. Complex addressing modes like scaled index, and complex operations, like division.

Fig. 15 shows the effect of a Data Cache 16 miss on the timing of Pipeline 12. Execution Unit 32 is delayed by two cycles until BIU 20 completed the bus cycles to read data.

Fig. 16 shows the effect of an address-register interlock on the timing of Pipeline 12: one instruction is modifying a register while the next instruction uses that register for an address calculation. Address Unit 30 is delayed by three cycles until Execution Unit 32 completes the register's update. Note that if the second instruction had used the register for a data value rather than an address calculation (e.g., ADDD RO, R1), then bypass circuitry in Execution Unit 32 would be used to avoid any delay to Pipeline 12.

Loader 28 includes special circuitry for the handling of branch instructions. When a branch instruction is decoded, Loader 28 calculates the destination address and selects between the sequential and nonsequential instruction streams. The selection is based on the branch condition and direction. If Loader 28 predicts that the branch is taken, then the destination address is transferred to Instruction Cache 14 on the JBUS. Whether or not the branch is predicted to be taken, Loader 28 saves the address of the alternate instruction stream. Later the branch instruction reaches Execution Unit 32, where the 1 1 condition is resolved. Execution Unit 32 signals Loader 28 whether or-not the branch was taken. If the branch had been incorrectly predicted, Pipeline 12 is flushed, and Instruction Cache 14 begins prefetching instructions from the correct stream.

Fig. 17 shows the effect of correctly predicting a branch instruction to be taken. A 2-cycle gap occurs in the decoding of instructions by Loader 28. This gap at the very top of Pipeline 12 can often be closed because one fully decoded instruction is buffered between Loader 28 and Address Unit 30 and because other delays may arise simultaneously at later stages in pipeline 12 Fig. 18 shows the effect of incorrectly predicting the resolution of a branch instruction. A 4-cycle gap occurs at Execution Unit 32.

Additional information regarding the operation of CPU 10 may be found in copending and commonly-assigned U.S. Pat. Appln. Ser. No. 006,016, "High Performance Microprocessor", filed by Alpert et al of even date herewith, and which is hereby incorporated by reference.

1 Z

Claims

Claims,

1. In a data processing system of the type having an external interface, the improvement comprising means for monitoring the sequence of instructions executed by the data processing system whereby an interface signal is generated which indicates whether an instruction beginning execution is sequential or non-sdquential and information is displayed on the interface which indicates when a taken branch or jump instruction is executed or an exception occurs.

2. The monitoring means as in claim 1 wherein the interface signal is activated when a taken branch instruction is executed.

3. The monitoring means as in claim 2 wherein control is transferred to a destination instruction the address of which is calculated from the encoding and address of the taken branch instruction.

4. The monitoring means as in claim 1 wherein the interface signal is inactive if-the branch instruction is not taken.

5. The monitoring means as in claim 4 wherein control is transferred to the instruction sequentially following the branch instruction in memory.

6. The monitoring means as in claim 1 wherein when an exception occurs, a code that indicates the type of exception is displayed.

1

7. The monitoring means as in claim 6 wherein the virtual address of an exception service procedure is displayed.

8. The monitoring means as in claim 1 wherein when a taken jump instruction is executed, the virtual address of a jump destination is displayed.

9. The monitoring means as in claim 8 wherein the destination address is displayed after the system has begun fetching the.instruction at the jump destination.

10. A method for monitoring the sequence of instructions executed by a centra 1 processing unit having an external interface, the method comprising generating an interface signal which indicates whether an instruction beginning execution is sequential or non-sequential; and displaying information on the interface which indicates when a taken jump instruction is executed or an exception occurs.

11. A method as in claim 10 wherein the interface signal is activated when a taken branch instruction is executed.

12. A method as in claim 10 wherein when an exception occurs, a code indicating the type of exception is displayed.

13. A method as in claim 12 wherein the virtual address of a corresponding exception service procedure is displayed.

14. A method ad in claim 10 wherein when a taken jump instruction is executed, the virtual address of a jump destination is displayed.

15. A method as in claim 14 wherein the destination address is displayed after the central processing unit has begun fetching the instruction at the jump destination.

16. A method for monitoring the sequence of instructions executed by a central processing unit having an external interface, the method comprising: (a) generating an interface signal representative of the execution of a branch instruction, the interface signal being of a first state if the branch is taken and being of a second state if the branch is not taken and (i) if the interface signal is of the first st ate, transferring central processing -unit control to a destination instruction having an address calculated using encoding and address information of the branch instruction; and (ii) if the interface signal is of the second state, transferring central processing unit control to the next sequential instruction following the branch instruction; and (b) displaying information on the interface representative of the execution of a taken jump instruction or the occurrence of an exception such that 39 (i) when an exception occurs, a code is displayed indicating the type of exception and the memory location of an associated exception service procedure; and (ii) when a taken jump instruction is executed, the memory location of the jump destination instruction is displayed.

17. A monitoring means for a data processing system as' claimed in Claim 1 substantially as herein described with reference to the accompanying drawings.

18. A method for monito:ring the sequence of instructions executed by a central processing uni t having an external interface as claimed in Claim 10 or Claim 16 substantially as herein described with reference to the accompanying drawings.

Published 1998 'at The Patent Office, State House, 66"71 High Holborn, London WC1R 4TP. Further copies may be obtained from The Patent Office. Sales Branch, St Marv Cray, Orvinoton. Kent BR5 3RD. Printed bv MultiDlex tachnioues Itd. St Marv Crav. Kent. Con. 1/87.