WO2003034201A2

WO2003034201A2 - Late resolving instructions

Info

Publication number: WO2003034201A2
Application number: PCT/GB2002/004555
Authority: WO
Inventors: Nigel Peter Topham; Nicholas Paul Joyce
Original assignee: Pts Corporation
Priority date: 2001-10-12
Filing date: 2002-10-07
Publication date: 2003-04-24
Also published as: AU2002329475A1; WO2003034201A3

Abstract

Techniques are disclosed for handling control transfer instructions in parallel pipelined processors. Such instructions may cause the sequence of subsequent instructions to change, and thus may require subsequent instructions to be deleted from the processor's pipelines. The deletion of subsequent instructions from the pipelines is delayed until the control transfer instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other. This may allow the mechanism required to delete unwanted instructions to be simplified.

Description

LATE RESOLVING INSTRUCTIONS

The present invention relates to parallel pipelined processors, such as very long instruction word (VLIW) processors. The present invention is particularly- concerned with the way in which certain control instructions are handled in such processors. Such control instructions may be instructions which, if executed, cause the sequence of subsequent instructions to change. Such instructions are referred to herein as control transfer instructions.

Modern processors use a technique known as pipelining to increase the rate at which instructions can be processed. Pipelining works by executing an instruction in several phases, with each phase being executed in a single pipeline stage. Instructions flow through successive pipeline stages, and complete execution when they reach the end of the pipeline .

Some processor architectures provide two or more parallel pipelines for processing different instructions, or different parts of an instruction, simultaneously. For example, VLIW processors use long instruction packets which may be divided into smaller instructions for simultaneous execution in different processor pipelines. The instructions from an instruction packet normally progress in parallel through the various pipelines. However, in certain circumstances, the instructions in the pipelines may become unaligned for one or more clock cycles . This may be due to, for example,' a stall signal, or some other control signal, being asserted in one pipeline before it is asserted in another pipeline.

In a parallel pipelined processor, the address of an instruction packet is typically computed by one of the pipelines (the "master" pipeline) , and the computed address distributed to the other pipelines (the "slave" pipelines) . Each pipeline then fetches its own instruction, and decodes and executes that instruction. Each of these operations is normally carried out in a separate pipeline stage.

The ability of the master pipeline to compute the address of an instruction relies on the fact that the next address can be predicted in advance with a fair degree of certainty. For example, if the processor is running a loop, then the address of the next instruction will, in most cases, be either the next address in memory, or the address of the first instruction in the loop. Thus the processor is able to compute the addresses of instructions and load the instructions into the pipelines in earlier pipeline stages while preceding instructions are still being decoded and executed in later pipeline stages.

A problem in the arrangement described above is that certain instructions, when decoded and executed, may cause the addresses of subsequent instructions to be different from those already computed by the processor. For example an "exit" instruction, if acted on, causes - the processor to exit from a loop. In such a situation, some or all of the instructions which are in earlier pipeline stages may need to be removed from the pipelines ("squashed"), because they may have been loaded from the incorrect address . In parallel pipelined processors, the removal of such unwanted instructions may require a large amount of logic, which may add to the chip area of the processor and potentially slow down the operating speed of the processor. According to the present invention there is provided a processor comprising a plurality of parallel pipelines, each pipeline comprising a plurality of pipeline stages for performing a series of operations on an instruction passing through the pipeline, the processor being arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle, the processor comprising: determining means for determining whether a predetermined instruction is to be executed; deleting means for deleting subsequent instructions from the pipelines if it is determined that the predetermined instruction is to be executed; and delay means for delaying the deletion of subsequent instructions until the predetermined instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other.

By providing such delay means, the mechanism required to delete the subsequent instructions may be simplified, since instructions may be deleted from the same pipeline stages in all pipelines. Furthermore, since the deletion of subsequent instructions is delayed, signals indicating that subsequent instructions are to be deleted may be distributed globally without involving critical time paths. In addition, advantage may be taken of existing logic (such as that in an exception handler) to delete the subsequent instructions .

Preferably each of the pipelines comprises determining means for determining whether a predetermined instruction in that pipeline is to be executed, and the deleting means is arranged to delete subsequent instructions from the pipelines if it is determined by any of the determining means that a predetermined instruction is to be executed. This can allow the processor to respond to the predetermined instruction whichever pipeline it occurs in.

The corresponding instructions in different pipelines may become unaligned, for example, in response to a stall signal. In that case, the delay means may be arranged to delay the deletion of subsequent instructions until the predetermined instruction is in a pipeline stage which does not generate a stall signal . Preferably the stage before that pipeline stage also does not generate a stall signal . For example, the delay means may be arranged to delay the deletion of subsequent instructions until the predetermined instruction has reached a commit stage of the pipeline, or until the predetermined instruction has reached a final stage of the pipeline.

In order to simplify the logic required to delete instructions, the deleting means may be arranged to delete all instructions from the pipelines.

In embodiments of the present invention, use may be made of an exception handler in order to delete subsequent instructions from the pipeline . This may result in less logic being required than would otherwise be the case, since an exception handler may already be provided in the processor, and may include the logic required to delete instructions from the pipelines . Thus the deleting means may comprise an exception handler. Preferably the exception handler is arranged to delete subsequent instructions from the pipeline, but not to enter an exception handling routine, in response to the predetermined instruction. The subsequent instructions in the pipelines may have caused unwanted changes in the processor's state. Thus the processor may further comprise restoring means for restoring the processor to a previous state (e.g. a previously committed state) in response to the predetermined instruction. If the processor has an exception handler, then the exception handler may already include logic for returning the processor to a previous state. Use may then be made of such logic, without requiring further logic to be provided for restoring the processor to a previous state. Thus the restoring means may be part of an exception handler.

The predetermined instruction may be an instruction which, if executed, may cause a sequence of subsequent instructions to change. For example, the predetermined instruction may be an exit instruction, a branch instruction, a loop instruction, a return-from-VLIW- mode instruction, a subroutine call instruction, or some other instruction. The processor may be arranged to treat two or more such instructions in a late resolving way, and thus the determining means may be arranged to determine whether one of a number of predetermined instructions is to be executed, and the deleting means may be arranged to delete subsequent instructions from the pipelines if it is determined that one of the predetermined instructions is to be executed. The determining means may be provided, for example, in a decode stage of the pipeline, which may allow use to be made of existing logic in the pipeline.

The processor may comprise a plurality of pipeline clusters, each cluster comprising a plurality of pipelines. The processor may be, for example, a VLIW processor. Corresponding methods are also provided, and thus according to a second aspect of the invention there is provided a method of operating a processor, the processor comprising a plurality of parallel pipelines, each pipeline comprising a plurality of pipeline stages for performing a series of operations on an instruction passing through the pipeline, the processor being arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle, the method comprising: determining whether a predetermined instruction is to be executed; and deleting subsequent instructions from the pipelines if it is determined that the predetermined instruction is to be executed; where the deletion of subsequent instructions is delayed until the predetermined instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other.

According to a third aspect of the invention there is provided a processor comprising a plurality of parallel pipelines, each pipeline comprising a plurality of pipeline stages for performing a series of operations on an instruction passing through the pipeline, the processor being arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle, the processor comprising: a determining unit which determines whether a predetermined instruction is to be executed; a deleting unit which deletes subsequent instructions from the pipelines if it is determined that the predetermined instruction is to be executed; and a delay unit which delays the deletion of subsequent instructions until the predetermined instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other.

Features of one aspect of the invention may be applied to any other aspect. Apparatus features may be applied to the method aspects and vice versa..

Preferred features of the present invention will now be described, purely by way of example, with reference to the accompanying drawings, in which: -

Figure 1 shows an overview of a processor embodying the present invention;

Figure 2 is a block diagram of a master cluster in a processor embodying the invention;

Figure 3 is a block diagram of a slave cluster in a processor embodying the invention; Figures 4(a), 4(b) and 4(c) show an example of a software pipelined loop;

Figure 5 shows the use of predicates in a software pipeline loop;

Figure 6 shows how a predicate register may be used to produce the predicates of Figure 5 ;

Figure 7 shows various pipeline stages in a processor embodying the invention;

Figure 8 shows an example of unaligned stale instruction packets; Figure 9 shows parts of a processor in accordance with an embodiment of the present invention; and

Figure 10 illustrates the operation of the processor in the embodiment of Figure 9.

Overview of a parallel pipelined processor

Figure 1 shows an overview of a parallel pipelined processor embodying the present invention. The processor 1 comprises instruction issuing unit 10, schedule storage unit 12, first, second, third and fourth processor clusters 14, 16, 18, 20 and system bus 22 connected to random access memory (RAM) 24, and input/output devices 26. As will be explained, each of the clusters 14, 16, 18, 20 contains a number of execution units having a shared register file.

The processor 1 is designed to operate in two distinct modes. In the first mode, referred to as scalar mode, instructions are issued to just the first cluster 14, and the second to fourth clusters 16, 18, 20 do not perform any computational tasks. In the second mode, referred to as VLIW mode, instructions are issued in parallel to all of the clusters 14, 16, 18, 20, and these instructions are processed in parallel . A group of instructions issued in parallel to the various clusters in VLIW mode is referred to as a VLIW instruction packet. In practice, the processor architecture may be configured to include any number of slave clusters. Each VLIW instruction packet contains a number of instructions (including no-operation instructions) equal to the total number of clusters times the number of execution units in each cluster.

When the processor is in VLIW mode, VLIW instruction packets are passed from the schedule storage unit 12 to the instruction issuing unit 10. In this example, the VLIW instruction packets are stored in compressed form in the schedule storage unit 12. The instruction issuing unit 10 decompresses the instruction packets and stores them in a cache memory, known as the V- cache. The various constituent instructions in the instruction packets are then read out from the V-cache and fed to the clusters 14, 16, 18, 20 via the issue slots IS1, IS2, IS3, IS4 respectively. In practice, the functions of the instruction issuing unit 10 may be distributed between the various clusters 14, 16, 18, 20. Further details of the instruction issuing unit 10 may be found in United Kingdom patent application number 0012839.7 in the name of Siroyan Limited, the entire subject matter of which is incorporated herein by reference.

The master cluster 14 controls the overall operation of the processor 1. In addition, certain control instructions are always sequenced so that they will be executed in the master cluster. The block structure of the master cluster 14 is shown in Figure 2. The master cluster comprises first and second execution units 30, 32, control transfer unit (CTU) 34, instruction register 36, I-cache 38, V-cache partition 40, code decompression unit (CDU) 42, local memory 44, data cache 46, system bus interface 48, control and status registers 50, and predicate registers (P-regs) 52.

In operation, when the processor is in scalar mode, instructions are fetched one at a time from the I-cache 38 and placed in the instruction register 36. The instructions are then executed by one of the execution units 30, 32 or the control transfer unit 34, depending on the type of instruction. If an I-cache miss occurs, a cache controller (not shown) arranges for the required cache block to be retrieved from memory.

When the processor is in VLIW mode, two instructions are fetched in parallel from the V-cache partition 40 and are placed in the instruction register 36. The V- cache partition 40 is the part of the V-cache which stores VLIW instructions which are to be executed by the master cluster 14. The two instructions in the instruction register are issued in parallel to the execution units 30, 32 and are executed simultaneously. The V-cache partitions of all clusters are managed by the code decompression unit 42. If a V-cache miss occurs, the code decompression unit 42 retrieves the required cache block, which is stored in memory in compressed form, decompresses the block, and distributes the VLIW instructions to the V-cache partitions in each cluster. An address in decompressed program space is referred to as an imaginary address. A VLIW program counter (VPC) points to the imaginary address of the current instruction packet . As well as the VLIW instructions, V-cache tags are also stored in V-cache partition 40, to enable the code decompression unit 42 to determine whether a cache miss has occurred.

Figure 3 shows the block structure of a slave cluster 16. The slave cluster 16 comprises first and second execution units 60, 62, instruction register 64, V- cache partition 66, local memory 68, system bus interface 70, status registers 72, and predicate registers (P-regs) 74. When the processor is in VLIW mode, instruction execution is controlled by the master cluster 14, which broadcasts an address corresponding to the next instruction packet to be issued. The instructions in the instruction packet are read from the V-cache partition in each cluster, and proceed in parallel through the execution units 60, 62.

A contiguous sequence of VLIW instruction packets is referred to as a VLIW code schedule. Such a code schedule is entered whenever the processor executes a branch to VLIW mode (bv) instruction in scalar mode . The code within a VLIW schedule consists of two types of code section: linear sections and loop sections. On entry to each VLIW code schedule, the processor begins executing a linear section. This may initiate a subsequent loop section by executing a loop instruction. Loop sections iterate automatically, terminating when the number of loop iterations reaches the value defined by the loop instruction. It is also possible to force an early exit of a loop by executing an exit instruction. When the loop section terminates a subsequent linear section is always entered. This may initiate a further loop section, or terminate the VLIW schedule (and cause a return to scalar mode) by executing an return from VLIW mode (rv) instruction.

A loop section is entered when the loop initiation instruction (loop) is executed. This sets up the loop control context and switches the processor into VLIW loop mode. The processor then executes the loop section code repeatedly, checking that the loop continuation condition still holds true prior to the beginning of each iteration (excluding the first iteration) . The loop control operation involves a number of registers which are provided in the master cluster. These registers are described below.

• LVPC - loop start VPC value. This points to the imaginary address of the first packet in the loop section. It is loaded from VPC+1 when the loop instruction is executed and is used to load the value back to VPC at the end of each loop iteration to allow VPC to return to the start of the loop.

• VPC -VLIW program counter. This points to the imaginary address of the current packet. It is loaded from LVPC at the end of every loop iteration or is simply incremented by 1. It is also incremented by the literal from a branch instruction when the branch instruction is executed.

• LPC - loop start PC value. This points to the start of the first compressed frame in memory in the block that contains the first packet in the loop section. It used when refilling the V-cache.

• PC - program counter. When in VLIW mode, this points to the start of the current compressed block in memory.

• IC - iteration count. This register is used to count the number of loop iterations, and is decremented for each iteration of the loop. It is loaded whenever the loop instruction is executed before entering the loop section.

• EIC - epilogue iteration count. This register is used to count the number of loop iterations during the shutdown (epilogue) phase of a software pipelined loop (see below) .

• CC - compression count. This indicates the size of the compressed block and is used for updating the value of PC.

• LSize - loop size. This register contains the number of packets in the loop sequence. It is loaded whenever the loop instruction is executed. The loop instruction explicitly defines the number of packets in the loop section.

• LCount - This register counts the number of loop packets, and is decremented with each new packet. When LCount becomes zero a new loop iteration is initiated. LCount is loaded from LSize at the beginning of a new loop iteration.

The above registers are all "early modified", that is, they are modified before the processor has committed to a change in the processor context due to the instruction. Each register has a backup register in order to be able to restore the processor to its last committed state when performing exception handling.

Typically, a linear section of VLIW code is used to set up the context for the execution of a software pipelined loop. A software pipelined loop works by executing different iterations of the same loop in different clusters in an overlapped manner. Figures 4 shows an illustrative example of a software pipelined loop. Figure 4(a) shows the loop prior to scheduling. The loop contains a plurality of instructions which are to be executed a number of times (seven in this example) . Figure 4 (b) shows the loop scheduled into five stages, each stage containing a number of instructions. The first stage contains the instructions which are required to be executed before a subsequent iteration can be started. This stage has a length referred to as the initiation interval . The other stages are arranged to be of the same length. Figure 4(c) shows how the various iterations of the loop schedule are sequenced in the clusters. In this example, a total of seven iterations of a loop schedule are executed, and it is assumed that seven clusters are available. Each iteration is executed in a different execution unit, with the start times of the iterations staggered by the initiation interval.

Referring to Figure 4 (c) , it can be seen that the pipeline loop schedule is arranged into a prologue (startup) phase, a kernel phase and an epilogue (shutdown) phase. The prologue and epilogue phases need to be controlled in a systematic way. This can be done through use of the predicate registers 52, 74 shown in Figures 2 and 3. The predicate registers 52, 74 are used to guard instructions passing through the pipelines either true or false. If an instruction is guarded true then it is executed, while if an instruction is guarded false then it is not executed and it is converted into a no-operation (NOP) instruction. In order to control the prologue and epilogue phases of a software pipeline loop, all instructions in pipeline stage i are tagged with a predicate P_± . P_± is then arranged to be true whenever pipeline stage i should be enabled. Figure 5 shows how the predicates for each software pipeline stage change during the execution of the loop.

In order to change the predicates during the execution of the loop, the predicate values are stored in a shifting register, which is a subset of one of the predicate registers, as shown in Figure 6. A further bit in the predicate register contains a value known as the predicate seed. The shift register subset initially contains the values 00000. When a loop is to be started, a 1 is loaded into the predicate seed. This 1 is shifted into the shift register subset prior to the first iteration, so that the values stored therein become 00001. This turns on pipeline stage 1, but leaves stages 2 through 5 disabled. When the first stage of the pipeline loop has completed (i.e. after a number of cycles equal to the initiation interval) , the values in the shift register are shifted to the left, so that the shift register subset contains the values 00011. This pattern continues until the shift register subset contains the values lllll. All of the software pipeline stages are then turned on, and the loop is in the kernel phase.

When a number of iterations equal to the iteration count have been executed (in this case seven) , the seed predicate is then set to zero. At this point the loop enters the epilogue phase, and zeros are shifted into the shift register subset to turn off the software pipeline stages in the correct order. When all of the pipeline stages have been turned off and the shifting predicate register contains 00000 again the loop has completed. The processor then exits the loop mode and enters the subsequent linear section.

At any time the loop itself can initiate an early shutdown by executing an exit instruction. When an exit instruction is executed in any cluster the effect is to clear the seed predicate in all clusters. This causes all clusters to enter the loop shutdown phase after completing the current loop iteration.

Further details on the use of predicates in software pipelined loops may be found in United Kingdom patent application number 0014432.9 in the name of Siroyan Limited, the entire subject matter of which is incorporated herein by reference.

Processors embodying the present invention are hardware pipelined in order to maximise the rate at which they process instructions . Hardware pipelining works by implementing each of a plurality of phases of instruction execution as a single pipeline stage. Instructions flow through successive pipeline stages, in a production-line fashion, with all partially-completed instructions moving one stage forward on each processor clock cycle . Each of the execution units 30, 32 in Figure 2 and 60, 62 in Figure 3 is arranged as a hardware pipeline having a number of pipeline stages .

Figure 7 shows an example of the pipeline stages that may be present in the various clusters. For simplicity, a single pipeline is shown for each cluster, although it will be appreciated that two or more pipelines may be provided in each cluster. In the pipelines of Figure 7, instructions flow through the pipelines from left to right; thus a stage which is to the left of another stage in Figure 7 is referred to as being before, or earlier than, that stage. The various stages in the pipelines are as follows.

• VA - VLIW address stage . The address of the next instruction is computed in this stage in the master cluster.

• VTIA - V-cache tags and instruction address. This stage is used to propagate the address of the next instruction from the master cluster to the slave cluster. In addition, the master cluster performs a V-tag comparison to establish whether the required instruction is in the V-cache (cache hit) .

• IF - instruction fetch. The VLIW instructions are fetched from memory into the pipelines in the various clusters.

• D - instruction decode. The instructions are decoded to determine the type of instruction and which registers are to be the source and the destination for the instruction, and literals are extracted from the instruction.

• XI - execute 1. First execution cycle.

• X2 - execute 2. Second execution cycle.

• X3 - execute 3. Third execution cycle.

• C - commit. The instruction result is obtained and, unless an exception has occurred, it will commit to causing a change to the processor state.

In the VLIW instruction set which is used by present processor there are several instructions which can directly affect the sequencing of subsequent VLIW packets . These VLIW instructions are referred to as control instructions. Examples of such control instructions are as follows :

• branch - this instruction causes the program to branch to another address. If this instruction is executed, earlier instructions in the pipelines will usually need to be discarded.

• loop - this instruction initiates a VLIW loop. If this instruction is executed, it may be necessary to discard earlier instructions from the pipelines if the loop body is less than three packets and the total number of iterations is greater than one .

• rv (return from VLIW mode) - this instruction causes the processor to change from VLIW mode to scalar mode. If this instruction is executed, earlier instructions in the pipelines need to be discarded.

• exit - this instruction causes the program to exit early from a loop. Depending on the way in which the exit is handled, one or more earlier instructions in the pipelines may need to be discarded.

Each of the above control instructions, if executed, may cause changes to the sequencing of subsequent instruction packets. However, such an instruction will only execute if the guard predicate corresponding to that instruction is true . The state of the guard predicate is assessed when the instruction is in the XI stage. By that stage, potentially unwanted instructions from instruction packets following the packet with the control instruction will have already been issued to the various pipelines. Thus, if such a control instruction is executed, it may be necessary to discard subsequent instructions that have already been loaded into the pipelines, and to undo any effects of those instructions. As will now be explained, discarding such unwanted instructions may be difficult for a variety of reasons .

A first difficultly in discarding any unwanted instructions arises due to the fact that corresponding instructions (i.e. instructions from the same instruction packet) in different clusters may not always be in the same pipeline stage at the same time. This may be due to, for example, the way in which stall signals are communicated in the processor. As disclosed in co-pending United Kingdom patent application number 0027294.8 in the name of Siroyan Limited, the entire contents of which are incorporated herein by reference, corresponding instructions in different pipelines may be allowed to become temporarily out of step with each other, in order to allow time for a stall signal to be distributed between pipelines. In embodiments of the present invention, a stall signal which is generated by one cluster takes effect in that cluster on the next clock edge, but does not take effect in other clusters until one clock cycle after that. This allows at least one clock cycle for the stall signal to be distributed throughout the processor. The result of this stalling mechanism is that the instructions in different pipelines may not be aligned with each other.

An example of unaligned stale packets is shown in

Figure 8. In this example it is assumed that the X2 stages in clusters 0 and 2 have both generated stall signals . These signals cause clusters 0 and 2 to stall immediately, while clusters 1 and 3 are stalled one clock cycle later. As a result, the instructions in clusters 1 and 3 advance one stage ahead of the corresponding instructions in clusters 0 and 2. If a control instruction (such as an exit instruction, as shown in Figure 8) is acted on in the XI stage of cluster 2, then it is necessary to discard the instructions in the VTIA, IF and D stages of clusters 0 and 2, and from the VTIA, IF, D and XI stages of clusters 1 and 3. The logic required to deleted the unwanted packets is therefore complex due to the fact that the instructions in the pipelines may not be aligned.

In addition to the non-alignment problem, the number of packets which are stale and need to be deleted depends on the type of control instruction. In the case of a branch instruction or a rv instruction, all subsequent packets that have already issued are unwanted. In the case of a loop instruction, the first unwanted packet can vary depending on factors such as the loop size, number of loop iterations and number of epilogue loop iterations. For example, if the loop size is one, and the number of loop iterations is greater than one, then the first subsequent packet could be retained but the second discarded. If the loop size is two then the first two subsequent packets could be retained. Alternatively, if the number of iterations is only one then all packets could be retained since the order of packet issue would remain unchanged.

In the case of an exit instruction, the number of packets which need to be discarded depends on loop size, number of loop iterations remaining, the number of epilogue loop iterations, and the exit instruction's position relative to the end of the loop body. In addition to deciding which packets are unwanted, predicate registers in other pipelines may have to be updated, to allow individual instructions in subsequent packets which are not deleted to become guarded false . This is necessary due to the mechanism of shifting predicates during the epilogue shut down phase of a loop. It may be necessary to create additional stall cycles while globally broadcasting the information required to update the predicate registers, since the subsequent instructions will require the updated predicate information before they can continue.

The register files to which the execution units have access may use a mechanism known as rotation, in order to allow the program to use the same register address on subsequent iterations of a loop. If an unwanted instruction in a cluster has caused a register file to rotate, then that register file must be returned to its previous (un-rotated) state. This is also made more complicated by the packet non-alignment problem, and the additional stalls required.

Late resolving instructions

In embodiments of the invention, certain control transfer instructions are allowed to progress to the C (commit) stage of the processor pipeline before any subsequent unwanted instructions are removed from the pipelines. These instructions are referred to herein as late resolving instructions. The late resolving instructions may occur in any pipeline.

In order to ensure that instructions from the same instruction packet always leave the pipelines in parallel, in embodiments of the invention the processor is arranged such that the C stage and the X3 stage of the pipelines are not able to assert stall signals. In this way the instructions in the various clusters are guaranteed to be aligned in the C stage. Hence, by only removing unwanted instructions when the control transfer instruction has reached the C stage, difficulties due to instruction misalignment do not arise .

Figure 9 shows parts of a processor in accordance with an embodiment of the invention. In this embodiment, the exit instruction is taken as an example of a late resolving instruction. The exit instruction causes the processor to exit from a software pipelined loop.

Referring to Figure 9, master cluster 14 comprises decode unit 150, execute units 154, 158, 162 commit unit 166, registers 170, 172, AND gate 178, registers 182, OR gate 184, register 188, AND gate 192 having one inverting and one non-inverting input, OR gates 194, registers 200, 202, local exit handling circuitry 208 and local exception handling circuitry 210. Slave cluster 16 comprises decode unit 152, execute units 156, 160, 164, commit unit 168, registers 174, 176, AND gate 180, register 184, OR gate 186, register 190, AND gate 196 having one inverting and one non-inverting input, OR gate 198, registers 204, 206, local exit handling circuitry 212 and local exception handling circuitry 214.

In operation, each of the decode units 150, 152 determines whether an instruction in the D stage of the respective cluster is an exit instruction. Each of the decode units 150, 152 outputs an EXIT signal which is true if the current instruction is an exit instruction. The EXIT signals are loaded into registers 170, 174 respectively in the XI stage. Also in this stage, the guard predicates for the instructions are loaded into registers 172, 176. The outputs of registers 170 and 172 are fed to AND gate 178, which produces a signal EXIT TAKEN to indicate that the exit instruction in cluster 0 is being executed. Similarly, the outputs of registers 174 and 176 are fed to AND gate 180, which produces a signal EXIT TAKEN to indicate that the exit instruction in cluster 1 is being executed.

In the X2 stage, the EXIT TAKEN signals in clusters 0 and 1 are registered in registers 182, 184 respectively. In this stage, the EXIT TAKEN signal from each cluster is broadcast to all other clusters, and collected in each cluster using OR gates 184, 186. This arrangement means that the same logic is provided in each cluster, which simplifies the design. Alternatively, the EXIT TAKEN signal could be collected by a single OR gate, for example in the master cluster, and then distributed to all clusters.

In the X3 stage the outputs of the OR gates 184, 186 are registered in registers 188, 190 respectively. These signals are then gated with exception request signals EXCEPTION using AND gate 192 and OR gate 194 in cluster 0 and AND gate 196 and OR gate 198 in cluster 1. The EXCEPTION signals are collected from the various clusters using OR gates (not shown) in a similar way to the EXIT TAKEN signals. In the C stage, the outputs of gates 192, 194, 196, 198 are registered in registers 200, 202, 204, 206. Registers 200 and 204 output the signals exit_request_c, and registers 202 and 206 output the signals restore__committed.

The processor shown in Figure 9 includes distributed exception handling circuitry 210, 214. The exception handling circuitry 210, 214 is used to handle exceptions which may be raised, for example, when unexpected events occur. An exception can be requested by any of the clusters by asserting the EXCEPTION signal. Usually, when an exception is raised, the exception handling circuitry 210, 214 flushes all pipelines of instructions and restores the processor to its previously committed state, before entering an exception handling routine. Thus, the logic required for flushing the pipelines and restoring the processor to its previously committed state is already present within the exception handling circuitry 210, 214. In the present embodiment, advantage is taken of this logic when handling late resolving control transfer instructions.

The exception handling circuitry 210 takes as inputs the restore_committed signals and the exit_request_c signals from cluster 0, and the exception handling circuitry 214 takes as inputs the restore_committed signals and the exit_request_c signals from cluster 1. When one of the restore_committed signals is asserted, the processor is restored to its previously committed state. All instructions in the various pipelines are squashed (including those in the packet containing the exit instruction), effectively flushing all pipelines. The register files and all other pipelined processor state variables are returned to their previous state, including the loop sequencing registers, with the exception of the loop iteration count register.

When one of the exit_request_c signals is asserted, indicating that a restore_committed signal was asserted due to an exit instruction rather than an exception, the other actions which are normally associated with exceptions are disabled. Instead of the processor jumping to an exception handling routine, the processor remains in VLIW mode and subsequently (after restoring the VPC and other loop context registers to the VTIA stage) re-issues the previous packets starting with the packet containing the exit instruction. The exit handling circuitry 208 takes as an input the exit_request_c signal from cluster 0, and the exit handling circuitry 212 takes as an input the exit__request_c signal from cluster 1. If one of the exit_request_c signals is asserted, the exit handling circuitry 208, 212 sets the loop iteration count register to zero, indicating that no further loop iterations should be started and the processor should enter the loop epilogue shutdown phase when it reaches the end of the current iteration. When the packet containing the exit instruction is re-issued (under control of the exception handling circuitry 210, 214) , the loop sequencing context will have been changed so that the processor is now within the final loop iteration due to the iteration count register having been set to zero. The seed predicates will then be cleared automatically in every cluster under the normal predicate control mechanism.

The normal mechanism for controlling predicates involves global signals broadcast to all slave clusters in the IF stage in parallel with the V-cache hit signal. These signals are as follows:

• pexit - this signal is asserted when the iteration count (IC) register is zero, indicating that the current issue packet is in the final iteration of the loop. It is used to clear the seed predicates in all clusters. • pshift - this signal is asserted when the loop packet count (LCount) register is zero, indicating that the current issue packet is the final packet of the loop body. It is used to shift the predicate registers in all clusters.

When the re-issued exit packet reaches the commit stage for the second time, the exit request and restore process does not repeat . This is achieved by disabling the exit request whenever the processor is already in the final iteration of a loop (when IC=0) . An exit taken during the final loop iteration will always be ignored since a taken exit will have no effect on the order of packet issue . This is because the loop sequencer will already be entering the loop shutdown phase at the end of the current iteration.

The sequence of events when an exit instruction is guarded true is illustrated in Figure 10. Although the guard predicate for the exit instruction is examined in the XI stage, the pipeline is only flushed once the exit instruction has reached the C stage, where all pipelines are guaranteed to be aligned (cycle 7) .

In the present embodiment, the instructions after the exit instruction are allowed to progress through the pipelines. If the exit instruction is guarded false, then the subsequent instructions in the pipelines are the correct instructions, and thus can be used without requiring any instructions to be deleted. Thus, where the exit instruction is guarded false, there are no wasted clock cycles, since no bubbles are inserted into the pipelines.

The exit instruction is normally guarded false, and only becomes guarded true when an early exit is required. It is therefore advantageous to handle it in a late resolving way so that in the most frequent cases no performance penalty is incurred. During the rare cases that the instruction does become guarded true (and the loop is not already in the final iteration) the penalty will be a full pipeline flush. Treating the exit instruction as a pseudo exception in the way described above allows existing logic in the exception handler to be used to handle the exit instruction, resulting in less logic and a smaller chip area. Since the deletion of subsequent instructions is delayed until the exit instruction has reached the commit stage, information concerning any exit to be taken can be distributed globally during the X2 and X3 stages without involving any time critical paths . Furthermore, there is no need to provide complex logic for deleting individual instructions from different clusters .

While the exit instruction has be used to illustrate the concept of late resolving instructions, control transfer instructions other than the exit instruction could be provided as late resolving instructions in a similar way. For example, branch instructions, loop instructions, return from VLIW mode instructions, subroutine call instructions and/or other instructions could be treated as late resolving instructions. It is particularly advantageous to treat control transfer instructions which are normally guarded false as late resolving instructions, since no processor clock cycles are wasted when a late resolving instruction is guarded false.

Although the above description relates, by way of example, to a clustered VLIW processor it will be appreciated that the present invention is applicable to any processor having at least two parallel pipelines. Thus the invention may be applied to parallel processors other than VLIW processors, and to processors not having clustered pipelines. A processor embodying the present invention may be included as a processor "core" in a highly-integrated "system-on-a- chip" (SOC) for use in multimedia applications, network routers, video mobile phones, intelligent automobiles, digital television, voice recognition, 3D games, etc.

Claims

1. A processor comprising a plurality of parallel pipelines, each pipeline comprising a plurality of pipeline stages for performing a series of operations on an instruction passing through the pipeline, the processor being arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle, the processor comprising: determining means for determining whether a predetermined instruction is to be executed; deleting means for deleting subsequent instructions from the pipelines if it is determined that the predetermined instruction is to be executed; and delay means for delaying the deletion of subsequent instructions until the predetermined instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other.

2. A processor according to claim 1, wherein each of the pipelines comprises determining means for determining whether a predetermined instruction in that pipeline is to be executed, and the deleting means is arranged to delete subsequent instructions from the pipelines if it is determined by any of the determining means that a predetermined instruction is to be executed.

3. A processor according to claim 1 or 2, wherein corresponding instructions in different pipelines may become unaligned in response to a stall signal.

4. A processor according to claim 3, wherein the delay means is arranged to delay the deletion of subsequent instructions until the predetermined instruction is in a pipeline stage which does not generate a stall signal.

5. A processor according to any of the preceding claims wherein the delay means is arranged to delay the deletion of subsequent instructions until the predetermined instruction has reached a commit stage of the pipeline.

6. A processor according to any of the preceding claims wherein the delay means is arranged to delay the deletion of subsequent instructions until the predetermined instruction has reached a final stage of the pipeline .

7. A processor according to any of the preceding claims wherein the deleting means is arranged to delete all instructions from the pipelines.

8. A processor according to any of the preceding claims wherein the deleting means comprises an exception handler.

9. A processor according to claim 8 wherein the exception handler is arranged to delete subsequent instructions from the pipelines, but not to enter an exception handling routine, in response to the predetermined instruction.

10. A processor according to any of the preceding claims further comprising restoring means for restoring the processor to a previous state in response to the predetermined instruction. -soil . A processor according to claim 10 wherein the restoring means is part of an exception handler.

12. A processor according to any of the preceding claims wherein the predetermined instruction is an instruction which, if executed, may cause a sequence of subsequent instructions to change.

13. A processor according to any of the preceding claims, wherein the determining means is arranged to determine whether one of a number of predetermined instructions is to be executed, and the deleting means is arranged to delete subsequent instructions from the pipelines if it is determined that one of the predetermined instruction is to be executed.

14. A processor according to any of the preceding claim wherein the determining means is provided in a decode stage of the pipeline .

15. A processor according to any of the preceding claims, the processor comprising a plurality of pipeline clusters, each cluster comprising a plurality of pipelines.

16. A processor according to any of the preceding claims, the processor being a VLIW processor.

17. A method of operating a processor, the processor comprising a plurality of parallel pipelines, each pipeline comprising a plurality of pipeline stages for performing a series of operations on an instruction passing through the pipeline, the processor being arranged such that corresponding instructions in different pipelines may become unaligned for at least one clock cycle, the method comprising: determining whether a predetermined instruction is to be executed; and deleting subsequent instructions from the pipelines if it is determined that the predetermined instruction is to be executed; where the deletion of subsequent instructions is delayed until the predetermined instruction is in a pipeline stage in which corresponding instructions in different pipelines are aligned with each other.