WO2002039272A1 - Method and apparatus for reducing branch latency - Google Patents

Method and apparatus for reducing branch latency Download PDF

Info

Publication number
WO2002039272A1
WO2002039272A1 PCT/US2001/049653 US0149653W WO0239272A1 WO 2002039272 A1 WO2002039272 A1 WO 2002039272A1 US 0149653 W US0149653 W US 0149653W WO 0239272 A1 WO0239272 A1 WO 0239272A1
Authority
WO
WIPO (PCT)
Prior art keywords
branch
instruction
address
memory block
branch instruction
Prior art date
Application number
PCT/US2001/049653
Other languages
French (fr)
Other versions
WO2002039272A9 (en
Inventor
John L. Redford
Original Assignee
Chipwrights Design, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipwrights Design, Inc. filed Critical Chipwrights Design, Inc.
Priority to AU2002227451A priority Critical patent/AU2002227451A1/en
Publication of WO2002039272A1 publication Critical patent/WO2002039272A1/en
Publication of WO2002039272A9 publication Critical patent/WO2002039272A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3846Speculative instruction execution using static prediction, e.g. branch taken strategy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural
    • G06F9/4484Executing subprograms
    • G06F9/4486Formation of subprogram jump address

Definitions

  • Processing systems typically execute program instructions in multiple stages or cycles. During a fetch cycle, a memory address at which the instruction to be executed is stored is read "from a program counter and written into an address register. The address in the address register is then used to access the instruction memory location, and the instruction is fetched from the instruction memory and loaded into an instruction register.
  • the instruction is a multiple-bit word. It typically includes a multiple-bit opcode which identifies the instruction and memory address information.
  • the memory address information can include, for example, a memory value which defines a location in memory at which an operand for the instruction is stored or a location at which the result of the instruction is to be stored when the instruction is complete.
  • a decode cycle is implemented. During the decode cycle, the instruction opcode is decoded to identify the instruction to be executed and to determine the processing steps required.
  • the memory address information in the instruction can then be used to retrieve an operand, if one is required.
  • the instruction can be executed.
  • the program counter is incremented to point to the next location in the instruction memory. During the next fetch cycle, the next instruction in the program is then fetched for execution.
  • Branch instructions are used to alter the normal sequential flow of instruction execution. Branches arc used, for example, to control instruction loops. When the last instruction in a loop is reached, if the condition that would terminate the loop is not satisfied, program flow must return to the top of the loop. In this case, a branch instruction is used to load the address of the top that would be obtained by incrementing the program counter. Branch instructions are also used to route program execution to separate procedure modules by loading address information pointing to the start of the procedure into the address register.
  • a branch instruction includes an opcode and address information.
  • the address information typically takes the form of an offset value which defines the number of addresses that the program execution will jump in taking the branch.
  • the offset value is typically a signed number which is added to the present instruction address during the decode cycle. Following this decode cycle, the sum is loaded into the address register such that the first instruction of the branch can be fetched during the next fetch cycle.
  • the efficiency of program execution can be enhanced by pipelining instruction execution.
  • pipelining a first instruction is fetched during a first fetch cycle.
  • the decode cycle for the first instruction is performed simultaneously with the fetch cycle for the next instruction in the program. That is, while the first instruction is being decoded, the next instruction is being fetched.
  • This approach can, in general, significantly increase program execution speed and efficiency.
  • the fetch cycle for the next instruction cannot be performed simultaneously, since the next instruction address has not yet been determined and loaded into the address register. Instead, the fetch cycle for the next instruction, i.e., the branch target instruction, cannot begin until after the decode cycle is complete.
  • the condition described above is commonly referred to as branch latency. It would be desirable to eliminate this latency such that the address of the first instruction after the branch, i.e., the branch target instruction, can be loaded into the address register soon enough after the branch instruction is fetched such that the branch target instruction can be fetched during the next fetch cycle.
  • the present invention is directed to an approach to eliminating this branch latency condition.
  • the branch instruction includes an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction.
  • the address portion of the branch instruction includes a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction.
  • Execution branches to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.
  • the memory block identifying portion of the address portion of the branch instruction identifies the block in memory to which execution is to branch in response to the branch instruction, in the event that the branch is taken.
  • the block can be one of three possible blocks.
  • One of the blocks can be the block that contains the branch instruction, in which case the branch target instruction is in the same block as the branch instruction.
  • the other blocks can be the immediately preceding block or the immediately following block.
  • the block identifier includes at least two bits capable of defining at least four codes used to identify blocks.
  • One of the codes identifies the same block as the branch instruction.
  • a second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block,
  • a fourth code can be used to identify a branch within the same block in a particular direction.
  • the first code can then be used to identify a branch within the same block in the opposite direction. Therefore, in one example of this configuration, the first and second codes, e.g., 00 and 01, can be used to identify a forward branch, the former within the block and the latter into the next block.
  • the third and fourth codes e.g., 10 and 11, can be used to identify a backward branch, the former into the preceding block and the latter within the present block.
  • the block identifying portion can be used in branch prediction, i.e., to predict whether the branch will be taken. In one embodiment, if (he branch is backward, then it is predicted that the branch will be taken. If the branch is forward, then it is predicted that the branch will not be taken. Hence, in the illustration set forth above, if the first bit of the block identifying code is a 1, then a backward branch is called for, and it is predicted that the branch will be taken. On the other hand, if the first bit is a 0, then a forward branch is called for, and it is predicted that the branch will not be taken.
  • the approach of the invention substantially reduces or eliminates aspects of branch latency found in the prior art.
  • the branch target address is generated directly from the displacement address information in the branch instruction without performing any time consuming arithmetic operations such as adding an address offset to the program counter value.
  • the branch target address is applied directly back to the address register as a hardware function in an effectively immediate fashion as part of the fetch cycle for the branch instruction.
  • the next cycle can be used to fetch the branch target instruction, resulting in no loss of cycles. This is in contrast to prior approaches in which the next fetch cycle had to be skipped because of the delay involved in computing the branch target address during the decode cycle.
  • the invention therefore provides significantly improved efficiency over the prior art in the processing of branch instructions.
  • FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration.
  • FIG. 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed.
  • FIG. 3 contains a schematic diagram which illustrates the format of a conventional branch instruction.
  • FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration.
  • FIG. 5 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention which solves the branch latency problem in the configuration of FIG. 4.
  • FIG. 6 is a schematic block diagram illustrating the addresses and locations of a portion of an instruction memory.
  • FIG, 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention.
  • FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch.
  • FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the method and apparatus for reducing branch latency of the present invention.
  • FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration.
  • a first instruction is fetched from the instruction memory location addressed by the value PC stored in the program counter.
  • the instruction PC is decoded while the instruction identified by the incremented program counter value PC + 1 is fetched.
  • the instruction PC + 1 is decoded while the next instruction, identified by the program counter value PC + 2 is fetched.
  • This process continues as instructions are fetched and decoded in the sequence controlled by incrementing the program counter during each cycle. Because of the pipelining configuration, instructions are processed efficiently, with one instruction being decoded and the next instruction being fetched simultaneously, In general, no instruction cycles are wasted.
  • FIG, 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed.
  • a first instruction identified by the program counter value PC is fetched during the first cycle.
  • the instruction is a branch instruction.
  • the format of a branch instruction is illustrated schematically in FIG, 3.
  • the branch instruction includes an opcode portion and a memory displacement portion.
  • the opcode defines the type of branch instruction, e.g., the conditions under which the branch is to be taken.
  • the displacement portion defines the address to which program flow will proceed if the branch is taken.
  • the displacement is a signed number which is added to the address of the branch instruction, i.e., the present value PC of the program counter. Accordingly, during the next cycle, the address that is to be loaded into the program counter to execute the branch is computed by adding the present address to the displacement, i.e., PC + Disp. Because the addition takes considerable time to complete, the address of the instruction of the beginning of the branch, i.e., the branch target instruction address, is not loaded into the program counter until late in the second cycle. As a result, the branch target instruction, referred to by the address value PC + Disp., is not fetched until the third cycle. Thus, a cycle is lost while the branch target instruction address is computed. This condition is commonly referred to as branch latency.
  • FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration
  • a processing system 10 for executing program instructions includes a program counter 12 which generates the addresses of the instructions to be executed in sequence.
  • instruction addresses are 32 bits long. It will be understood that other address sizes can be used and that the invention is applicable to other address sizes.
  • the address is read from the program counter 12 and is routed to an incrementing module 14 and a summing module 26.
  • the summer 26 adds the program counter value to a displacement routed from the instruction register 24 and applies the result to one input of a multiplexer 16.
  • the address is also be incremented by the incrementing module 14, and the incremented result is applied to another input of the MUX 16.
  • One of the addresses applied to the MUX 16 is selected via the MUX select input S by a branch prediction module 18. If the branch selector 18 determines that a branch is to be taken, then the MUX input from the summer 26 is selected such that a branch target instruction address, generaLed in the summer 26 by adding the displacement to the present program counter value, is loaded into the address register 20. Otherwise, the incremented address is selected such that it is loaded into the address register 20.
  • the address loaded into the address register 20 is applied to the instruction memory 22 to access the next instruction to be executed. That instruction is read from the memory 22 into an instruction register 24. The instruction can then be passed on for decoding and further processing. The displacement portion of the instruction, if any, can be routed as shown back to the summer 26 to compute an address to which program flow will jump, such as, for example, when the fetched instruction is a branch instruction. As described above, this approach introduces branch latency because of the time involved in performing the sum operation 26.
  • FIG. 5 is a schematic functional block diagram which illustrates a solution to the branch latency problem described above.
  • the displacement is extracted directly from the instruction in the instruction memory 22 and is inserted directly into the address at the input to the MUX 16 as a replacement for its least significant bits (LSBs), in this particular illustration, the sixteen LSBs, labeled 15:0, This is done as soon as the branch instruction is fetched from its memory location in the instruction memory 22, before the fetch cycle for the branch instruction terminates and the next fetch cycle begins.
  • LSBs least significant bits
  • next instruction which is the branch target instruction
  • branch instruction and the branch target instruction can be fetched in successive fetch cycles, with no loss of cycles.
  • the branch latency described above is eliminated.
  • FIG. 6 is a schematic block diagram illustrating the addresses and locations of a portion of a typical instruction memory 22.
  • the memory 22 can be defined to be made up of multiple blocks
  • each block has a group of locations with addresses ranging from 0000 I6 to FFFF l ⁇ .
  • the sixteen LSBs of each memory address define a location within a particular block consisting of 2 16 locations.
  • the program counter is accessing an instruction stored at one of the locations in one of the blocks, block 104, for example.
  • a forward branch can only be made a small distance, i.e., a small number of locations.
  • backward branches are extremely limited in possible distance, This situation places a constraint on the programming flexibility of the system.
  • FIG. 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention.
  • the example of FIG. 7 uses a 32-bit instruction with a 16-bit displacement field. It will be understood that the invention is applicable to other sizes.
  • the instruction format 120 includes an opcode field 122 including bits 16 to 31 and an address or displacement field 128 including bits 0 to 15.
  • the address field 128 is further divided into an address value field 124 including bits 0 to 13 and a block field 126 including bits 14 and 15.
  • the two-bit block field defines whether the branch should take place to the same block as the present block (referred to as PC), to the immediately preceding block (referred to as PC - 1) or to the immediately succeeding block (referred to as PC + 1).
  • the address value field 124 defines the address within the identified block from which the branch target instruction should be fetched.
  • the block field 126 includes at least two bits capable of defining at least four codes used to identify blocks.
  • One of the codes identifies the same block as the branch instruction.
  • a second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block.
  • a fourth code can be used to identify a branch within the same block in a particular direction. The first code can then be used to identify a branch within the same block in the opposite direction.
  • the first and second codes e.g., 00 and 01
  • the third and fourth codes e.g., 10 and 11
  • a 0 bit in position 15 can indicate a forward branch
  • a 1 in position 15 can indicate a backward branch.
  • FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch.
  • the 14 LSBs 13:0 are routed from the instruction memory 22 to three inputs of a four-input MUX 216.
  • the remaining 18 bits 31 : 14 are taken from the program counter 12 and are combined with the 14 LSBs from the instruction memory 22.
  • the 18 bits 31:14 arc routed through an incrementing module 220, a decrementing module 222 and a direct path 223, and the resulting routed bits are combined with the 14 LSBs from the instruction memory 22 at the inputs to the MUX 216,
  • the incrementing module 220 is used to generate an address that is used when the branch is to the next block in memory;
  • the decrementing module 222 is used to generate an address that is used when the branch is to the immediately preceding block in memory;
  • the direct path 223 is used to generate an address when the branch is within the present block.
  • the fourth input to the MUX 216 receives bits 31:0 directly from the program counter and is used where normal sequential program execution is being used.
  • the branch prediction module 18 is used to select which address is to be loaded into the address register 20. If no branch is to be taken, then bits 31 :0 from the incrementing module 14 are selected. If a branch is to be taken into the next block, then the address that includes bits 31:14 from the incrementing module 220 is selected. If a branch is to be taken into the previous block, then the address that includes bits 31:14 from the decrementing module 222 is selected. If a branch is to be taken within the present block, then the address that includes bits 31:14 from the direct path 223 is selected.
  • the block identifying code added to the branch instruction in accordance with the invention can be used as an aid in branch prediction.
  • a 0 bit in position 15 can indicate a forward branch
  • a 1 in position 15 can indicate a backward branch.
  • One common example approach to branch prediction is to take backward branches and not to take forward branches. Therefore, in accordance with the invention, if a 0 is in position 15, then the branch is not taken, and if a 1 is in position 15, then the branch is taken.
  • the possible range of addresses to which a branch can be taken is increased over that of the embodiment of FIG. 5.
  • the displacement portion of the address value can only specify 2 14 possible addresses, as opposed to the 2 16 , possible addresses of the former approach of FIG. 5.
  • 2 14 addresses can be specified in each of three possible memory blocks, in this example, Therefore, a large increase in possible branch distance and resulting programming flexibility are realized.
  • FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the improvements of the present invention.
  • a branch instruction and its associated branch target instruction can be fetched in successive cycles.
  • the branch latency found in other conventional approaches is eliminated.
  • the invention can be implemented using an approach different than that described above. For example, the invention can be implemented immediately before execution commences as the instruction cache memory is loaded with instructions for execution, rather than altering the instructions themselves individually as the program is compiled or linked. In this latter approach, the block identifying field, which is the two-bit field of the exemplary embodiment described above, is added to the appropriate instruction cache memory locations as the instructions are loaded.

Abstract

A method and apparatus for reducing latency in execution of branch instructions are provided. A branch instruction includes an opcode portion (122) and an address portion (128) that includes a displacement (124) and a code (126) that identifies a block in the instruction memory (22) in which the branch target instruction is located. During the fetch cycle in which the branch instruction is fetched, the displacement portion (124) of the branch instruction is reinserted into the address register (20) as the address of the next instruction to be fetched. The code (126) is used to ensure that the address register (20) is pointing to the correct block. As a result, during the next instruction fetch cycle, the target instruction is fetched for execution. Hence, the branch processing latency found in prior systems in which the next fetch cycle is skipped while the branch target address is computed, such as by adding an offset to the program counter (12) value, is eliminated.

Description

METHOD AND APPARATUS FOR REDUCING BRANCH LATENCY
Background of the Invention
Processing systems typically execute program instructions in multiple stages or cycles. During a fetch cycle, a memory address at which the instruction to be executed is stored is read "from a program counter and written into an address register. The address in the address register is then used to access the instruction memory location, and the instruction is fetched from the instruction memory and loaded into an instruction register.
In general, the instruction is a multiple-bit word. It typically includes a multiple-bit opcode which identifies the instruction and memory address information. The memory address information can include, for example, a memory value which defines a location in memory at which an operand for the instruction is stored or a location at which the result of the instruction is to be stored when the instruction is complete.
After the instruction is fetched and loaded into the instruction register, a decode cycle is implemented. During the decode cycle, the instruction opcode is decoded to identify the instruction to be executed and to determine the processing steps required.
Depending on the instruction, the memory address information in the instruction can then be used to retrieve an operand, if one is required. When the decoding is complete, the instruction can be executed.
In the normal sequential flow of instruction execution, after an instruction is fetched, the program counter is incremented to point to the next location in the instruction memory. During the next fetch cycle, the next instruction in the program is then fetched for execution.
One very common type of instruction is a branch instruction. Branch instructions are used to alter the normal sequential flow of instruction execution. Branches arc used, for example, to control instruction loops. When the last instruction in a loop is reached, if the condition that would terminate the loop is not satisfied, program flow must return to the top of the loop. In this case, a branch instruction is used to load the address of the top that would be obtained by incrementing the program counter. Branch instructions are also used to route program execution to separate procedure modules by loading address information pointing to the start of the procedure into the address register.
As is typical of most instructions, a branch instruction includes an opcode and address information. The address information typically takes the form of an offset value which defines the number of addresses that the program execution will jump in taking the branch. The offset value is typically a signed number which is added to the present instruction address during the decode cycle. Following this decode cycle, the sum is loaded into the address register such that the first instruction of the branch can be fetched during the next fetch cycle.
The efficiency of program execution can be enhanced by pipelining instruction execution. In pipelining, a first instruction is fetched during a first fetch cycle. Next, the decode cycle for the first instruction is performed simultaneously with the fetch cycle for the next instruction in the program. That is, while the first instruction is being decoded, the next instruction is being fetched. This approach can, in general, significantly increase program execution speed and efficiency.
In the case of branch instructions, the gains in efficiency realized by pipelining are compromised by the delay involved in computing the branch target address. During the decoding of (he branch instruction, the offset value must be added to the present program counter address value. This addition requires a significant portion of the decode cycle.
Accordingly, the fetch cycle for the next instruction cannot be performed simultaneously, since the next instruction address has not yet been determined and loaded into the address register. Instead, the fetch cycle for the next instruction, i.e., the branch target instruction, cannot begin until after the decode cycle is complete. The condition described above is commonly referred to as branch latency. It would be desirable to eliminate this latency such that the address of the first instruction after the branch, i.e., the branch target instruction, can be loaded into the address register soon enough after the branch instruction is fetched such that the branch target instruction can be fetched during the next fetch cycle. Summary of the Invention
The present invention is directed to an approach to eliminating this branch latency condition. In accordance with the invention, there is provided a method and apparatus for processing a branch instruction. The branch instruction includes an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction. The address portion of the branch instruction includes a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction. Execution branches to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.
The memory block identifying portion of the address portion of the branch instruction identifies the block in memory to which execution is to branch in response to the branch instruction, in the event that the branch is taken. In one embodiment, the block can be one of three possible blocks. One of the blocks can be the block that contains the branch instruction, in which case the branch target instruction is in the same block as the branch instruction. The other blocks can be the immediately preceding block or the immediately following block.
In one embodiment, the block identifier includes at least two bits capable of defining at least four codes used to identify blocks. One of the codes identifies the same block as the branch instruction. A second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block, A fourth code can be used to identify a branch within the same block in a particular direction.
The first code can then be used to identify a branch within the same block in the opposite direction. Therefore, in one example of this configuration, the first and second codes, e.g., 00 and 01, can be used to identify a forward branch, the former within the block and the latter into the next block. The third and fourth codes, e.g., 10 and 11, can be used to identify a backward branch, the former into the preceding block and the latter within the present block.
In one embodiment, the block identifying portion can be used in branch prediction, i.e., to predict whether the branch will be taken. In one embodiment, if (he branch is backward, then it is predicted that the branch will be taken. If the branch is forward, then it is predicted that the branch will not be taken. Hence, in the illustration set forth above, if the first bit of the block identifying code is a 1, then a backward branch is called for, and it is predicted that the branch will be taken. On the other hand, if the first bit is a 0, then a forward branch is called for, and it is predicted that the branch will not be taken. The approach of the invention substantially reduces or eliminates aspects of branch latency found in the prior art. The branch target address is generated directly from the displacement address information in the branch instruction without performing any time consuming arithmetic operations such as adding an address offset to the program counter value. The branch target address is applied directly back to the address register as a hardware function in an effectively immediate fashion as part of the fetch cycle for the branch instruction. As a result, the next cycle can be used to fetch the branch target instruction, resulting in no loss of cycles. This is in contrast to prior approaches in which the next fetch cycle had to be skipped because of the delay involved in computing the branch target address during the decode cycle. The invention therefore provides significantly improved efficiency over the prior art in the processing of branch instructions.
Brief Description of the Drawings
The foregoing and other objects, features and advantages of the invention will be apparent from the more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration.
FIG. 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed.
FIG. 3 contains a schematic diagram which illustrates the format of a conventional branch instruction.
FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration. FIG. 5 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention which solves the branch latency problem in the configuration of FIG. 4.
FIG. 6 is a schematic block diagram illustrating the addresses and locations of a portion of an instruction memory. FIG, 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention.
FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch. FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the method and apparatus for reducing branch latency of the present invention.
Detailed Description of Preferred Embodiments of the Invention
FIG. 1 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration. As shown in FIG. 1, during a first cycle, a first instruction is fetched from the instruction memory location addressed by the value PC stored in the program counter. During the next cycle, the instruction PC is decoded while the instruction identified by the incremented program counter value PC + 1 is fetched, During the third cycle, the instruction PC + 1 is decoded while the next instruction, identified by the program counter value PC + 2 is fetched. This process continues as instructions are fetched and decoded in the sequence controlled by incrementing the program counter during each cycle. Because of the pipelining configuration, instructions are processed efficiently, with one instruction being decoded and the next instruction being fetched simultaneously, In general, no instruction cycles are wasted.
FIG, 2 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during conventional execution of program instructions in a pipeline configuration in which a branch instruction is processed. As shown in the timing chart, a first instruction identified by the program counter value PC is fetched during the first cycle. In this case, the instruction is a branch instruction. The format of a branch instruction is illustrated schematically in FIG, 3. As shown in FIG. 3, the branch instruction includes an opcode portion and a memory displacement portion. The opcode defines the type of branch instruction, e.g., the conditions under which the branch is to be taken. The displacement portion defines the address to which program flow will proceed if the branch is taken. Typically, the displacement is a signed number which is added to the address of the branch instruction, i.e., the present value PC of the program counter. Accordingly, during the next cycle, the address that is to be loaded into the program counter to execute the branch is computed by adding the present address to the displacement, i.e., PC + Disp, Because the addition takes considerable time to complete, the address of the instruction of the beginning of the branch, i.e., the branch target instruction address, is not loaded into the program counter until late in the second cycle. As a result, the branch target instruction, referred to by the address value PC + Disp., is not fetched until the third cycle. Thus, a cycle is lost while the branch target instruction address is computed. This condition is commonly referred to as branch latency.
FIG. 4 is a schematic functional block diagram which illustrates execution of instructions in a conventional configuration, As shown in the FIG: 4, a processing system 10 for executing program instructions includes a program counter 12 which generates the addresses of the instructions to be executed in sequence. In this description, it will be assumed that instruction addresses are 32 bits long. It will be understood that other address sizes can be used and that the invention is applicable to other address sizes. The address is read from the program counter 12 and is routed to an incrementing module 14 and a summing module 26. The summer 26 adds the program counter value to a displacement routed from the instruction register 24 and applies the result to one input of a multiplexer 16. The address is also be incremented by the incrementing module 14, and the incremented result is applied to another input of the MUX 16. One of the addresses applied to the MUX 16 is selected via the MUX select input S by a branch prediction module 18. If the branch selector 18 determines that a branch is to be taken, then the MUX input from the summer 26 is selected such that a branch target instruction address, generaLed in the summer 26 by adding the displacement to the present program counter value, is loaded into the address register 20. Otherwise, the incremented address is selected such that it is loaded into the address register 20.
The address loaded into the address register 20 is applied to the instruction memory 22 to access the next instruction to be executed. That instruction is read from the memory 22 into an instruction register 24. The instruction can then be passed on for decoding and further processing. The displacement portion of the instruction, if any, can be routed as shown back to the summer 26 to compute an address to which program flow will jump, such as, for example, when the fetched instruction is a branch instruction. As described above, this approach introduces branch latency because of the time involved in performing the sum operation 26.
FIG. 5 is a schematic functional block diagram which illustrates a solution to the branch latency problem described above. In FIG. 5, instead of adding the displacement portion of a branch instruction to the program counter value, the displacement is extracted directly from the instruction in the instruction memory 22 and is inserted directly into the address at the input to the MUX 16 as a replacement for its least significant bits (LSBs), in this particular illustration, the sixteen LSBs, labeled 15:0, This is done as soon as the branch instruction is fetched from its memory location in the instruction memory 22, before the fetch cycle for the branch instruction terminates and the next fetch cycle begins. As a result, during the next succeeding cycle, the next instruction, which is the branch target instruction, can be fetched because its address is already present in the address register 20 before the succeeding fetch cycle begins. Hence, the branch instruction and the branch target instruction can be fetched in successive fetch cycles, with no loss of cycles. The branch latency described above is eliminated.
One particular drawback to this approach is illustrated in FIG. 6, which is a schematic block diagram illustrating the addresses and locations of a portion of a typical instruction memory 22. The memory 22 can be defined to be made up of multiple blocks
102, 104, 106. As shown in this particular illustrative example, each block has a group of locations with addresses ranging from 0000I6 to FFFF. Hence, the sixteen LSBs of each memory address define a location within a particular block consisting of 216 locations.
During execution of a program, at any given time, the program counter is accessing an instruction stored at one of the locations in one of the blocks, block 104, for example.
When a branch instruction is encountered, in accordance with the approach described above, the 1 -bit displacement portion of the instruction is placed in the next address in its 16 LSB positions. Execution then continues from the one of 216 locations within block 104. A drawback to this situation is derived from the fact that the location to which the branch is made must be within the same block. Because of this, the size of a possible jump may be severely limited, depending on the current value in the program counter, i.e., the location from which the branch is taken. For example, if the program is currently executing near the end of a block when it encounters a branch instruction, a forward branch can only be made a small distance, i.e., a small number of locations, Likewise, if the program is currently executing near the beginning of a block, backward branches are extremely limited in possible distance, This situation places a constraint on the programming flexibility of the system.
To solve this problem, in one embodiment, the invention uses a portion of the displacement portion of a branch instruction to identify a block in instruction memory to which the branch should be made. FIG. 7 is a schematic diagram illustrating a format for a branch instruction word in accordance with the invention. The example of FIG. 7 uses a 32-bit instruction with a 16-bit displacement field. It will be understood that the invention is applicable to other sizes. Referring to FIG. 7, the instruction format 120 includes an opcode field 122 including bits 16 to 31 and an address or displacement field 128 including bits 0 to 15. The address field 128 is further divided into an address value field 124 including bits 0 to 13 and a block field 126 including bits 14 and 15. The two-bit block field defines whether the branch should take place to the same block as the present block (referred to as PC), to the immediately preceding block (referred to as PC - 1) or to the immediately succeeding block (referred to as PC + 1). The address value field 124 defines the address within the identified block from which the branch target instruction should be fetched.
Hence, hi one embodiment, in this illustration, the block field 126 includes at least two bits capable of defining at least four codes used to identify blocks. One of the codes identifies the same block as the branch instruction. A second of the codes identifies the immediately following block, and a third of the codes identifies the immediately preceding block. A fourth code can be used to identify a branch within the same block in a particular direction. The first code can then be used to identify a branch within the same block in the opposite direction. Therefore, in one example of this configuration, the first and second codes, e.g., 00 and 01 , can be used to identify a forward branch, the former within the block and the latter into the next block, The third and fourth codes, e.g., 10 and 11, can be used to identify a backward branch, the former into the preceding block and the latter within the present block. Hence, a 0 bit in position 15 can indicate a forward branch, and a 1 in position 15 can indicate a backward branch. FIG. 8 is a schematic functional block diagram which illustrates execution of instructions in a configuration according to the invention in which the branch instruction identifies the block of instruction memory to which the execution is to branch. In this configuration, the 14 LSBs 13:0 are routed from the instruction memory 22 to three inputs of a four-input MUX 216. The remaining 18 bits 31 : 14 are taken from the program counter 12 and are combined with the 14 LSBs from the instruction memory 22. The 18 bits 31:14 arc routed through an incrementing module 220, a decrementing module 222 and a direct path 223, and the resulting routed bits are combined with the 14 LSBs from the instruction memory 22 at the inputs to the MUX 216, The incrementing module 220 is used to generate an address that is used when the branch is to the next block in memory; the decrementing module 222 is used to generate an address that is used when the branch is to the immediately preceding block in memory; and the direct path 223 is used to generate an address when the branch is within the present block. The fourth input to the MUX 216 receives bits 31:0 directly from the program counter and is used where normal sequential program execution is being used.
The branch prediction module 18 is used to select which address is to be loaded into the address register 20. If no branch is to be taken, then bits 31 :0 from the incrementing module 14 are selected. If a branch is to be taken into the next block, then the address that includes bits 31:14 from the incrementing module 220 is selected. If a branch is to be taken into the previous block, then the address that includes bits 31:14 from the decrementing module 222 is selected. If a branch is to be taken within the present block, then the address that includes bits 31:14 from the direct path 223 is selected.
The block identifying code added to the branch instruction in accordance with the invention can be used as an aid in branch prediction. As indicated above, a 0 bit in position 15 can indicate a forward branch, and a 1 in position 15 can indicate a backward branch. One common example approach to branch prediction is to take backward branches and not to take forward branches. Therefore, in accordance with the invention, if a 0 is in position 15, then the branch is not taken, and if a 1 is in position 15, then the branch is taken. Hence, in accordance ith the embodiment of the invention shown in FIG. 8, the possible range of addresses to which a branch can be taken is increased over that of the embodiment of FIG. 5. Using this latter approach, the displacement portion of the address value can only specify 214 possible addresses, as opposed to the 216 , possible addresses of the former approach of FIG. 5. However, using the approach of FIG. 8, 214 addresses can be specified in each of three possible memory blocks, in this example, Therefore, a large increase in possible branch distance and resulting programming flexibility are realized.
FIG. 9 contains a timing chart which schematically illustrates the timing of fetch and decode cycles during execution of program instructions in accordance with the improvements of the present invention. As shown in FIG. 9, in accordance with the invention, a branch instruction and its associated branch target instruction can be fetched in successive cycles. The branch latency found in other conventional approaches is eliminated. It is noted that the invention can be implemented using an approach different than that described above. For example, the invention can be implemented immediately before execution commences as the instruction cache memory is loaded with instructions for execution, rather than altering the instructions themselves individually as the program is compiled or linked. In this latter approach, the block identifying field, which is the two-bit field of the exemplary embodiment described above, is added to the appropriate instruction cache memory locations as the instructions are loaded.
The embodiments described herein refer, for example, to 32-bit instructions, 16-bit address values with branch instructions, and a two-bit block identifying value. It will be understood that these numbers of bits may be different without departing from the scope of the invention.
While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing- from the spirit and scope of the invention as defined by the appended claims. What is claimed is:

Claims

1. A method of processing a branch instruction, the branch instruction being one of a plurality of instructions in a program stored in a block of an instruction memory, the method comprising: providing the branch instruction with an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction; providing the address portion of the branch instruction with a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction; and branching to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.
2. The method of claim I wherein branching to the branch target instruction comprises generating a branch target address using the displacement portion of the address portion of the branch instruction and the memory-block identifying portion of the address portion of the branch instruction.
3. The method of claim 2 wherein the branch target address is generated during a fetch cycle of the branch instruction.
4. The method of claim 1 wherein the memory block identifying portion of the address portion of the branch instruction identifies the block to which execution is to branch as being one of the memory block preceding the memory block of the branch instruction, the memory block following the memory block of the branch instruction and the memory block of the branch instruction.
5. The method of claim 1 further comprising predicting whether the branch is to be taken using the memory block identifying portion of the address portion of the branch instruction.
6. The method of claim 1 wherein the memory block identifying portion of the address portion of the branch instruction defines at least four codes used to predict whether the branch will be taken.
7. The method of claim 6 wherein the four codes include a first pair of codes for forward branches and a second pair of codes for backward branches.
8. The method of claim 7 wherein the first pair of codes includes a code for a forward branch to a next memory block and a code for a forward branch within the memory block of the branch instruction.
9. The method of claim 7 wherein the second pair of codes includes a code for a backward branch to a preceding memory block and a code for a backward branch within the memory block of the branch instruction.
10. An apparatus for processing a branch instruction, the branch instruction being one of a plurality of instructions in a program, the apparatus comprising: an instruction memory for storing instructions, the branch instruction being stored in one of a plurality of blocks of the instruction memory, the branch instruction having an opcode portion containing an opcode and an address portion containing address information related to an address of a branch target instruction to which execution of a program is to branch in response to the branch instruction, and the address portion of the branch instruction having a memory block identifying portion and a displacement portion, the memory block identifying portion identifying a block in the memory to which the execution is to branch in response to the branch instruction; and a processor for executing instructions stored in the instruction memory, the processor causing execution of the instructions to branch to the branch target instruction using the displacement portion of the address portion of the branch instruction as an address within the memory block identified by the memory block identifying portion of the address portion of the branch instruction.
11. The apparatus of claim 10 wherein the processor generates a branch target address using the displacement portion of the address portion of the branch instruction and the memory block identifying portion of the address portion of the branch instruction.
12. The apparatus of claim 11 wherein the branch target address is generated during a fetch cycle of the branch instruction.
13. The apparatus of claim 10 wherein the memory block identifying portion of the address portion of the branch instruction identifies the block to which execution is to branch as being one of the memory block preceding the memory block of the branch instruction, the memory block following the memory block of the branch instruction and the memory block of the branch instruction.
14. The apparatus of claim 10 wherein the processor predicts whether the branch is to be taken using the memory block identifying portion of the address portion of the branch instruction.
15. The apparatus of claim 10 wherein the memoτy block identifying portion of the address portion of the branch instruction defines at least four codes used to predict whether the branch will be taken.
16, The apparatus of claim 15 wherein the four codes include a first pair of codes for forward branches and a second pair of codes for backward branches.
17. The apparatus of claim 16 wherein the first pair of codes includes a code for a forward branch to a next memory block and a code for a forward branch within the memory block of the branch instruction.
18. The apparatus of claim 16 wherein the second pair of codes includes a code for a backward branch to a preceding memory block and a code for a backward branch within the memory block of the branch instruction.
PCT/US2001/049653 2000-11-10 2001-11-09 Method and apparatus for reducing branch latency WO2002039272A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002227451A AU2002227451A1 (en) 2000-11-10 2001-11-09 Method and apparatus for reducing branch latency

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US71069900A 2000-11-10 2000-11-10
US09/710,699 2000-11-10

Publications (2)

Publication Number Publication Date
WO2002039272A1 true WO2002039272A1 (en) 2002-05-16
WO2002039272A9 WO2002039272A9 (en) 2003-09-04

Family

ID=24855140

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/049653 WO2002039272A1 (en) 2000-11-10 2001-11-09 Method and apparatus for reducing branch latency

Country Status (3)

Country Link
AU (1) AU2002227451A1 (en)
TW (1) TW559733B (en)
WO (1) WO2002039272A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9069938B2 (en) 2006-11-03 2015-06-30 Bluerisc, Inc. Securing microprocessors against information leakage and physical tampering
US9235393B2 (en) 2002-07-09 2016-01-12 Iii Holdings 2, Llc Statically speculative compilation and execution
US9569186B2 (en) 2003-10-29 2017-02-14 Iii Holdings 2, Llc Energy-focused re-compilation of executables and hardware mechanisms based on compiler-architecture interaction and compiler-inserted control
US9697000B2 (en) 2004-02-04 2017-07-04 Iii Holdings 2, Llc Energy-focused compiler-assisted branch prediction

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7996671B2 (en) 2003-11-17 2011-08-09 Bluerisc Inc. Security of program executables and microprocessors based on compiler-architecture interaction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5136967A (en) * 1989-12-07 1992-08-11 J. M. Voith Gmbh Means for cleaning doctor device for spreader
US5608886A (en) * 1994-08-31 1997-03-04 Exponential Technology, Inc. Block-based branch prediction using a target finder array storing target sub-addresses
US5666519A (en) * 1994-03-08 1997-09-09 Digital Equipment Corporation Method and apparatus for detecting and executing cross-domain calls in a computer system
US6076158A (en) * 1990-06-29 2000-06-13 Digital Equipment Corporation Branch prediction in high-performance processor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5136967A (en) * 1989-12-07 1992-08-11 J. M. Voith Gmbh Means for cleaning doctor device for spreader
US6076158A (en) * 1990-06-29 2000-06-13 Digital Equipment Corporation Branch prediction in high-performance processor
US5666519A (en) * 1994-03-08 1997-09-09 Digital Equipment Corporation Method and apparatus for detecting and executing cross-domain calls in a computer system
US5608886A (en) * 1994-08-31 1997-03-04 Exponential Technology, Inc. Block-based branch prediction using a target finder array storing target sub-addresses

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235393B2 (en) 2002-07-09 2016-01-12 Iii Holdings 2, Llc Statically speculative compilation and execution
US10101978B2 (en) 2002-07-09 2018-10-16 Iii Holdings 2, Llc Statically speculative compilation and execution
US9569186B2 (en) 2003-10-29 2017-02-14 Iii Holdings 2, Llc Energy-focused re-compilation of executables and hardware mechanisms based on compiler-architecture interaction and compiler-inserted control
US10248395B2 (en) 2003-10-29 2019-04-02 Iii Holdings 2, Llc Energy-focused re-compilation of executables and hardware mechanisms based on compiler-architecture interaction and compiler-inserted control
US9697000B2 (en) 2004-02-04 2017-07-04 Iii Holdings 2, Llc Energy-focused compiler-assisted branch prediction
US10268480B2 (en) 2004-02-04 2019-04-23 Iii Holdings 2, Llc Energy-focused compiler-assisted branch prediction
US9069938B2 (en) 2006-11-03 2015-06-30 Bluerisc, Inc. Securing microprocessors against information leakage and physical tampering
US9940445B2 (en) 2006-11-03 2018-04-10 Bluerisc, Inc. Securing microprocessors against information leakage and physical tampering
US10430565B2 (en) 2006-11-03 2019-10-01 Bluerisc, Inc. Securing microprocessors against information leakage and physical tampering
US11163857B2 (en) 2006-11-03 2021-11-02 Bluerisc, Inc. Securing microprocessors against information leakage and physical tampering

Also Published As

Publication number Publication date
TW559733B (en) 2003-11-01
AU2002227451A1 (en) 2002-05-21
WO2002039272A9 (en) 2003-09-04

Similar Documents

Publication Publication Date Title
EP1157329B1 (en) Methods and apparatus for branch prediction using hybrid history with index sharing
KR100395763B1 (en) A branch predictor for microprocessor having multiple processes
US4942520A (en) Method and apparatus for indexing, accessing and updating a memory
US5303355A (en) Pipelined data processor which conditionally executes a predetermined looping instruction in hardware
US4775927A (en) Processor including fetch operation for branch instruction with control tag
US5131086A (en) Method and system for executing pipelined three operand construct
US6105124A (en) Method and apparatus for merging binary translated basic blocks of instructions
US6687808B2 (en) Data processor using indirect register addressing
US20030065912A1 (en) Removing redundant information in hybrid branch prediction
EP0297943B1 (en) Microcode reading control system
US5146570A (en) System executing branch-with-execute instruction resulting in next successive instruction being execute while specified target instruction is prefetched for following execution
US7793078B2 (en) Multiple instruction set data processing system with conditional branch instructions of a first instruction set and a second instruction set sharing a same instruction encoding
US5295248A (en) Branch control circuit
US7003651B2 (en) Program counter (PC) relative addressing mode with fast displacement
WO2002039272A1 (en) Method and apparatus for reducing branch latency
US5142630A (en) System for calculating branch destination address based upon address mode bit in operand before executing an instruction which changes the address mode and branching
US4812970A (en) Microprogram control system
US5850542A (en) Microprocessor instruction hedge-fetching in a multiprediction branch environment
US7941651B1 (en) Method and apparatus for combining micro-operations to process immediate data
US7134000B2 (en) Methods and apparatus for instruction alignment including current instruction pointer logic responsive to instruction length information
US6182211B1 (en) Conditional branch control method
US20010037444A1 (en) Instruction buffering mechanism
JP2006053830A (en) Branch estimation apparatus and branch estimation method
US6115805A (en) Non-aligned double word fetch buffer
US6360310B1 (en) Apparatus and method for instruction cache access

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
COP Corrected version of pamphlet

Free format text: PAGES 1/6-6/6, DRAWINGS, REPLACED BY NEW PAGES 1/3-3/3; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP