US20010016903A1 - Software branch prediction filtering for a microprocessor - Google Patents
Software branch prediction filtering for a microprocessor Download PDFInfo
- Publication number
- US20010016903A1 US20010016903A1 US09/829,525 US82952501A US2001016903A1 US 20010016903 A1 US20010016903 A1 US 20010016903A1 US 82952501 A US82952501 A US 82952501A US 2001016903 A1 US2001016903 A1 US 2001016903A1
- Authority
- US
- United States
- Prior art keywords
- branch
- branch prediction
- software
- instruction
- predict
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 description 24
- 238000010586 diagram Methods 0.000 description 8
- 238000013459 approach Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000001427 coherent effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 241000761456 Nops Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 108010020615 nociceptin receptor Proteins 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
- G06F9/3844—Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
- G06F9/3846—Speculative instruction execution using static prediction, e.g. branch taken strategy
Definitions
- the present invention relates generally to microprocessors, and more particularly, to branch prediction for a microprocessor.
- RISC microprocessors are well known. RISC microprocessors are characterized by a smaller number of instructions, which are relatively simple to decode, and by having all arithmetic/logic operations be performed register-to-register. RISC instructions are generally of only one length (e.g., 32-bit instructions). RISC instruction execution is of the direct hardwired type, as opposed to microcoding. There is a fixed instruction cycle time, and the instructions are defined to be relatively simple so that each instruction generally executes in one relatively short cycle.
- a RISC microprocessor typically includes an instruction for a conditional branch operation. I.e., if a certain condition is present, then branch to a given location. It is known that a relatively small number of branch operations cause most of the branch mispredictions. For example, it has been suggested that 80 percent of the branch mispredictions result from 20 percent of the branch instructions for a given processor. Other branch operations are relatively easy to predict. For example, if an array access is preceded by a check for a valid array access, the check for a valid array access is accomplished in a typical RISC microprocessor by executing multiple conditional branches. These branches are generally easy to predict.
- Speed of execution is highly dependent on the sequentiality of the instruction stream executed by the microprocessor. Branches in the instruction stream disrupt the sequentiality of the instruction stream executed by the microprocessor and generate stalls while the prefetched instruction stream is flushed and a new instruction stream begun.
- the present invention provides software branch prediction filtering for a microprocessor.
- the present invention provides a cost-effective and high performance implementation of software branch prediction filtering executed on a microprocessor that performs branch operations.
- many easy-to-predict branches can be eliminated from a hardware-implemented branch prediction table thereby freeing up space in the branch prediction table that would otherwise be occupied by the easy-to-predict branches.
- easy-to-predict branches waste entries in a limited-size branch prediction table and, thus, are eliminated from the branch prediction table.
- This robust approach to software branch prediction filtering provides for improved branch prediction, which is desired in various environments, such as a JavaTM computing environment. For example, this method can be used for various instruction sets such as Sun Microsystems, Inc.'s UltraJavaTM instruction set.
- a method for software branch prediction filtering for a microprocessor includes determining whether a conditional branch operation is “easy”-to-predict and predicting whether to execute the branch operation based on software branch prediction.
- “hard”-to-predict branches are predicted using a hardware branch prediction (e.g., a limited size hardware branch prediction table).
- FIG. 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
- FIG. 2 is a schematic block diagram showing the core of the processor.
- FIG. 3 shows a format of a branch instruction in accordance with one embodiment of the present invention.
- FIG. 4 is a block diagram of an implementation of the branch instruction of FIG. 3 in accordance with one embodiment of the present invention.
- FIG. 5 is a flow diagram of the operation of the branch instruction of FIG. 3 in accordance with one embodiment of the present invention.
- the present invention provides software branch prediction filtering for branch operations for a microprocessor.
- software branch prediction filtering uses hardware branch prediction only for “hard”-to-predict branches (a branch in which historical operation of the branch taken is important in determining whether the branch will be taken this time, e.g., an if . . . then statement) and uses software branch prediction for “easy” to prediction branches (a branch in which the history is not important in determining whether the branch will be taken for this particular branch, e.g., a loop).
- the branch instruction can be used in a computing environment in which compiled programs include a significant number of branch operations, such as in a JavaTM computing environment or in a computing environment that is executing compiled “C” programs.
- branch mispredictions generally slow down JavaTM code executing on a typical microprocessor, which is due to the time wasted fetching the branched to instruction(s). Even with advanced compiler optimizations, it is difficult to eliminate all such branch mispredictions.
- Well-known Just-In-Time (JIT) JavaTM compilers that generate software branch predictions for a typical Reduced Instruction Set Computing (RISC) microprocessor are currently about 75% accurate. Current hardware branch prediction is more accurate at about 85-93% accurate. Hardware branch prediction is typically implemented using a hardware branch prediction table.
- JIT Just-In-Time
- RISC Reduced Instruction Set Computing
- the hardware branch prediction table is limited in size (e.g., 512 entries), this approach is not desirable if there are a significant number of branches (e.g., more than 1000 branches) that can lead to aliasing effects (e.g., two different branches sharing the same entries will corrupt each others prediction state).
- the present invention solves this problem by providing a branch instruction that includes a bit for indicating whether the branch is easy to predict or hard to predict in accordance with one embodiment. If the branch is hard to predict, then hardware branch prediction is used. Otherwise, software branch prediction is used. Thus, the more accurate hardware branch prediction is efficiently reserved for hard-to-predict branches. For example, a compiler can determine whether a branch is labeled as hard to predict or easy to predict (e.g., about 80% of the branches can be labeled easy to predict, and mechanisms may be added to update or modify these predictions based on mispredictions, as further discussed below).
- FIG. 1 a schematic block diagram illustrates a single integrated circuit chip implementation of a processor 100 that includes a memory interface 102 , a geometry decompressor 104 , two media processing units 110 and 112 , a shared data cache 106 , and several interface controllers.
- the interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die.
- the components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time.
- the interface controllers include a an UltraPort Architecture Interconnect (UPA) controller 116 and a peripheral component interconnect (PCI) controller 120 .
- UPA UltraPort Architecture Interconnect
- PCI peripheral component interconnect
- the illustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller.
- the shared data cache 106 is a dual-ported storage that is shared among the media processing units 110 and 112 with one port allocated to each media processing unit.
- the data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown).
- the data cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between the media processing units 110 and 112 .
- the UPA controller 116 is a custom interface that attains a suitable balance between high-performance computational and graphic subsystems.
- the UPA is a cache-coherent, processor-memory interconnect.
- the UPA attains several advantageous characteristics including a scaleable bandwidth through support of multiple bused interconnects for data and addresses, packets that are switched for improved bus utilization, higher bandwidth, and precise interrupt processing.
- the UPA performs low latency memory accesses with high throughput paths to memory.
- the UPA includes a buffered cross-bar memory interface for increased bandwidth and improved scaleability.
- the UPA supports high-performance graphics with two-cycle single-word writes on the 64-bit UPA interconnect.
- the UPA interconnect architecture utilizes point-to-point packet switched messages from a centralized system controller to maintain cache coherence. Packet switching improves bus bandwidth utilization by removing the latencies commonly associated with transaction-based designs.
- the PCI controller 120 is used as the primary system I/O interface for connecting standard, high-volume, low-cost peripheral devices, although other standard interfaces may also be used.
- the PCI bus effectively transfers data among high bandwidth peripherals and low bandwidth peripherals, such as CD-ROM players, DVD players, and digital cameras.
- Two media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously.
- the threads may arise from any sources such as the same application, different applications, the operating system, or the runtime environment.
- Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code.
- the illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions.
- a typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.
- the illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
- Thread level parallelism is particularly useful for JavaTM applications which are bound to have multiple threads of execution.
- JavaTM methods including “suspend”, “resume”, “sleep”, and the like include effective support for threaded program code.
- JavaTM class libraries are thread-safe to promote parallelism.
- the thread model of the processor 100 supports a dynamic compiler which runs as a separate thread using one media processing unit 110 while the second media processing unit 112 is used by the current application.
- the compiler applies optimizations based on “on-the-fly” profile feedback information while dynamically modifying the executing code to improve execution on each subsequent run. For example, a “garbage collector” may be executed on a first media processing unit 110 , copying objects or gathering pointer information, while the application is executing on the other media processing unit 112 .
- the processor 100 shown in FIG. 1 includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution.
- a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
- the media processing units 110 and 112 each include an instruction cache 210 , an instruction aligner 212 , an instruction buffer 214 , a pipeline control unit 226 , a split register file 216 , a plurality of execution units, and a load/store unit 218 .
- the media processing units 110 and 112 use a plurality of execution units for executing instructions.
- the execution units for a media processing unit 110 include three media functional units (MFU) 222 and one general functional unit (GFU) 220 .
- the media functional units 222 are multiple single-instruction-multiple-datapath (MSIMD) media functional units.
- Each of the media functional units 222 is capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single-instruction-multiple-datapath capability for the processor 100 including add, multiply-add, shift, compare, and the like.
- the media functional units 222 operate in combination as tightly-coupled digital signal processors (DSPs). Each media functional unit 222 has an separate and individual sub-instruction stream, but all three media functional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages.
- the general functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others.
- ALU arithmetic logic unit
- the general functional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction.
- the illustrative instruction cache 210 has a 16 Kbyte capacity and includes hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code.
- Software is used to indicate that the instruction storage is being modified when modifications occur.
- the 16K capacity is suitable for performing graphic loops, other multimedia tasks or processes, and general-purpose JavaTM code.
- Coherency is maintained by hardware that supports write-through, non-allocating caching.
- Self-modifying code is supported through explicit use of “store-to-instruction-space” instructions store2i.
- Software uses the store2i instruction to maintain coherency with the instruction cache 210 so that the instruction caches 210 do not have to be snooped on every single store operation issued by the media processing unit 110 .
- the pipeline control unit 226 is connected between the instruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units.
- the pipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions.
- the pipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls.
- the pipeline control unit 226 also generates traps and maintains special registers.
- Each media processing unit 110 and 112 includes a split register file 216 , a single logical register file including 128 thirty-two bit registers.
- the split register file 216 is split into a plurality of register file segments 224 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time.
- a separate register file segment 224 is allocated to each of the media functional units 222 and the general functional unit 220 .
- each register file segment 224 has 128 32-bit registers.
- the first 96 registers ( 0 - 95 ) in the register file segment 224 are global registers. All functional units can write to the 96 global registers.
- the global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all register file segments 224 .
- Registers 96 - 127 in the register file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or “visible” to other functional units.
- the media processing units 110 and 112 are highly structured computation blocks that execute software-scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time.
- the operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly.
- VLIW very large instruction word
- a VLIW instruction word always includes one instruction that executes in the general functional unit (GFU) 220 and from zero to three instructions that execute in the media functional units (MFU) 222 .
- a MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
- Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory.
- the execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor.
- the media processing units 110 and 112 are high-performance but simplified with respect to both compilation and execution.
- the media processing units 110 and 112 are most generally classified as a simple 2-scalar execution engine with full bypassing and hardware interlocks on load operations.
- the instructions include loads, stores, arithmetic and logic (ALU) instructions, and branch instructions so that scheduling for the processor 100 is essentially equivalent to scheduling for a simple 2-scalar execution engine for each of the two media processing units 110 and 112 .
- the processor 100 supports full bypasses between the first two execution units within the media processing unit 110 and 112 and has a scoreboard in the general functional unit 220 for load operations so that the compiler does not need to handle nondeterministic latencies due to cache misses.
- the processor 100 scoreboards long latency operations that are executed in the general functional unit 220 , for example a reciprocal square-root operation, to simplify scheduling across execution units.
- the scoreboard (not shown) operates by tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the instruction is finished and the result becomes available.
- a VLIW instruction packet contains one GFU instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an incoming VLIW instruction packet are checked against the scoreboard.
- any true dependencies or output dependencies stall the entire packet until the result is ready.
- Use of a scoreboarded result as an operand causes instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the referencing instruction that provokes the stall executes on the general functional unit 220 or the first media functional unit 222 , then the stall only endures until the result is available for intra-unit bypass. For the case of a load instruction that hits in the data cache 106 , the stall may last only one cycle. If the referencing instruction is on the second or third media functional units 222 , then the stall endures until the result reaches the writeback stage in the pipeline where the result is bypassed in transmission to the split register file 216 .
- the scoreboard automatically manages load delays that occur during a load hit.
- all loads enter the scoreboard to simplify software scheduling and eliminate NOPs in the instruction stream.
- the scoreboard is used to manage most interlocks between the general functional unit 220 and the media functional units 222 . All loads and non-pipelined long-latency operations of the general functional unit 220 are scoreboarded. The long-latency operations include division idiv,fdiv instructions, reciprocal squareroot frecsqrt,precsqrt instructions, and power ppower instructions. None of the results of the media functional units 222 is scoreboarded. Non-scoreboarded results are available to subsequent operations on the functional unit that produces the results following the latency of the instruction.
- the illustrative processor 100 has a rendering rate of over fifty million triangles per second without accounting for operating system overhead. Therefore, data feeding specifications of the processor 100 are far beyond the capabilities of cost-effective memory systems.
- Sufficient data bandwidth is achieved by rendering of compressed geometry using the geometry decompressor 104 , an on-chip real-time geometry decompression engine.
- Data geometry is stored in main memory in a compressed format. At render time, the data geometry is fetched and decompressed in real-time on the integrated circuit of the processor 100 .
- the geometry decompressor 104 advantageously saves memory space and memory transfer bandwidth.
- the compressed geometry uses an optimized generalized mesh structure that explicitly calls out most shared vertices between triangles, allowing the processor 100 to transform and light most vertices only once.
- the triangle throughput of the transform-and-light stage is increased by a factor of four or more over the throughput for isolated triangles.
- multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining.
- operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time.
- high trip count loops are software-pipelined so that most media functional units 222 are fully utilized.
- FIG. 3 shows a format of a branch instruction in accordance with one embodiment of the present invention.
- the branch instruction 300 includes a bit 302 for indicating that the branch is easy to predict (e.g., 0) or hard to predict (e.g., 1).
- the branch instruction includes a bit 304 for indicating a software branch prediction that the branch is taken (e.g., 0) or not taken (e.g., 1).
- the software branch prediction loaded in bit 304 is used to predict the outcome of the branch if the branch is easy to predict (e.g., bit 302 is set to 0).
- Branch instruction 300 also includes opcode 306 , which corresponds to the opcode for a branch instruction, destination portion 308 , which sets forth the destination register (e.g., where the condition resides), and relative offset portion 310 , which sets forth the relative offset of the branch target when the branch is taken.
- opcode 306 corresponds to the opcode for a branch instruction
- destination portion 308 which sets forth the destination register (e.g., where the condition resides)
- relative offset portion 310 which sets forth the relative offset of the branch target when the branch is taken.
- branch instructions have 2 bits that the compiler can set to let the processor know (a) if the branch is easy or hard to predict, and (b) the branch is predicted taken, which is a software branch prediction (e.g., determined by the compiler at compile time).
- a software branch prediction e.g., determined by the compiler at compile time.
- the microprocessor encounters an easy-to-predict branch, it simply uses the software branch prediction provided by the other bit.
- the microprocessor encounters a hard-to-predict branch, it can use a simple hardware-based branch prediction or a more robust hardware-based branch prediction. In this way it is possible to dedicate a hardware-based branch prediction mechanism only to those branches that the software cannot predict very well. Measurements show that a reduction of the number of mispredictions between 20-40 percent is achievable. Alternately, the prediction efficiency can be kept at the same level, while the size of the branch prediction table can be reduced.
- FIG. 4 is a block diagram of an implementation of the branch instruction of FIG. 4 in accordance with one embodiment of the present invention.
- MPU 400 includes an instruction fetch unit 402 , which fetches instruction data from an instruction cache unit (see FIG. 1). Instruction fetch unit 402 is coupled to a branch prediction circuit 404 .
- Branch prediction circuit 404 includes a branch prediction table 406 , such as a conventional 512-entry branch prediction table. Instruction fetch unit 402 is also coupled to a decoder 408 , which decodes an instruction for execution by execution unit 410 .
- a decoder 408 which decodes an instruction for execution by execution unit 410 .
- One of ordinary skill in the art will recognize that there are various way to implement the circuitry and logic for performing the branch prediction operation in a microprocessor, such as a pipelined microprocessor.
- FIG. 5 is a flow diagram of the operation of the branch instruction of FIG. 5 in accordance with one embodiment of the present invention.
- the operation of the branch instruction begins at stage of operation 502 .
- stage 502 whether the branch is easy to predict is determined. If so, then software branch prediction is used to predict whether the branch is taken.
- stage 504 whether the software branch prediction predicts that the branch is taken is determined. If so, then the branch is taken at stage 506 . Otherwise, the branch is not taken at stage 508 .
- a hardware branch prediction mechanism e.g., the branch prediction circuit of FIG. 5 is used to determine if the branch is predicted to be taken. If the branch is predicted take by the hardware branch prediction circuit (e.g., branch prediction array (bpar)), then the branch is taken at stage 512 (e.g., the offset is added to the current program counter to provide a new address sequence to be fetched). Otherwise, the branch is not taken at stage 514 (e.g., the present instruction stream is continued in sequence).
- the hardware branch prediction circuit e.g., branch prediction array (bpar)
- a branch misprediction by the software branch prediction causes a modification of the software branch prediction bit (e.g., toggles bit 504 of FIG. 5 using self-modifying code).
- a hardware branch misprediction causes a modification in the hardware branch prediction table (e.g., an entry in branch prediction table 506 of FIG. 5 is modified).
- the software branch prediction utilizes heuristics involving code analysis such as that set forth in Ball et al.
- the hardware branch prediction utilizes the following branch prediction scheme:
- the displacement of an unconditional branch is treated as an offset and added to the program counter (not shown) to form the target address of the next instruction if the branch is taken.
- a more robust hardware branch prediction approach utilizes a branch prediction table (e.g., 512-entry branch prediction table) and associates a state machine to each branch. For example, a 2-bit counter is used to describe four states: strongly taken, likely taken, likely not taken, and strongly not taken.
- the branch prediction table is implemented as a branch prediction array.
- a JIT compiler for JAVATM source code provides software branch prediction (e.g., sets bit 504 ) and indicates whether a compiled branch is easy to predict or hard to predict (e.g., sets bit 502 ).
- the software branch prediction filtering can reduce misprediction rates by about 25% and considering that about 20% of compiled JAVATM code can be branches, this embodiment provides a significant improvement.
- the present invention can also be applied to statically compiled C code or to static compilation of other computer programming languages. Also, this approach reduces the risk of polluting the hardware branch prediction table by conserving the hardware branch prediction table for hard-to-predict branches.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
- Stored Programmes (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The present invention provides software branch prediction filtering for a microprocessor. In one embodiment, a method for a software branch prediction filtering for a microprocessor includes determining whether a branch is “easy” to predict, and predicting the branch using software branch prediction if the branch is easy to predict. Otherwise (i.e., the branch is “hard” to predict), the branch is predicted using hardware branch prediction. Accordingly, more accurate but space-limited hardware branch prediction resources are conserved for hard-to-predict branches.
Description
- This application relates to application Ser. No. ______ (attorney docket number SP-2600 US), filed on even date herewith, entitled “A Multiple-Thread Processor For Threaded Software Applications” and naming Marc Tremblay and William Joy as inventors, the application being incorporated herein by reference in its entirety.
- The present invention relates generally to microprocessors, and more particularly, to branch prediction for a microprocessor.
- Reduced Instruction Set Computing (RISC) microprocessors are well known. RISC microprocessors are characterized by a smaller number of instructions, which are relatively simple to decode, and by having all arithmetic/logic operations be performed register-to-register. RISC instructions are generally of only one length (e.g., 32-bit instructions). RISC instruction execution is of the direct hardwired type, as opposed to microcoding. There is a fixed instruction cycle time, and the instructions are defined to be relatively simple so that each instruction generally executes in one relatively short cycle.
- A RISC microprocessor typically includes an instruction for a conditional branch operation. I.e., if a certain condition is present, then branch to a given location. It is known that a relatively small number of branch operations cause most of the branch mispredictions. For example, it has been suggested that 80 percent of the branch mispredictions result from 20 percent of the branch instructions for a given processor. Other branch operations are relatively easy to predict. For example, if an array access is preceded by a check for a valid array access, the check for a valid array access is accomplished in a typical RISC microprocessor by executing multiple conditional branches. These branches are generally easy to predict.
- Speed of execution is highly dependent on the sequentiality of the instruction stream executed by the microprocessor. Branches in the instruction stream disrupt the sequentiality of the instruction stream executed by the microprocessor and generate stalls while the prefetched instruction stream is flushed and a new instruction stream begun.
- Accordingly, the present invention provides software branch prediction filtering for a microprocessor. For example, the present invention provides a cost-effective and high performance implementation of software branch prediction filtering executed on a microprocessor that performs branch operations. By providing the software branch prediction filtering, many easy-to-predict branches can be eliminated from a hardware-implemented branch prediction table thereby freeing up space in the branch prediction table that would otherwise be occupied by the easy-to-predict branches. In other words, easy-to-predict branches waste entries in a limited-size branch prediction table and, thus, are eliminated from the branch prediction table. This robust approach to software branch prediction filtering provides for improved branch prediction, which is desired in various environments, such as a Java™ computing environment. For example, this method can be used for various instruction sets such as Sun Microsystems, Inc.'s UltraJava™ instruction set.
- In one embodiment, a method for software branch prediction filtering for a microprocessor includes determining whether a conditional branch operation is “easy”-to-predict and predicting whether to execute the branch operation based on software branch prediction. However, “hard”-to-predict branches are predicted using a hardware branch prediction (e.g., a limited size hardware branch prediction table).
- Other aspects and advantages of the present invention will become apparent from the following detailed description and accompanying drawings.
- FIG. 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
- FIG. 2 is a schematic block diagram showing the core of the processor.
- FIG. 3 shows a format of a branch instruction in accordance with one embodiment of the present invention.
- FIG. 4 is a block diagram of an implementation of the branch instruction of FIG. 3 in accordance with one embodiment of the present invention.
- FIG. 5 is a flow diagram of the operation of the branch instruction of FIG. 3 in accordance with one embodiment of the present invention.
- The present invention provides software branch prediction filtering for branch operations for a microprocessor. In one embodiment, software branch prediction filtering uses hardware branch prediction only for “hard”-to-predict branches (a branch in which historical operation of the branch taken is important in determining whether the branch will be taken this time, e.g., an if . . . then statement) and uses software branch prediction for “easy” to prediction branches (a branch in which the history is not important in determining whether the branch will be taken for this particular branch, e.g., a loop). For example, the branch instruction can be used in a computing environment in which compiled programs include a significant number of branch operations, such as in a Java™ computing environment or in a computing environment that is executing compiled “C” programs.
- For example, branch mispredictions generally slow down Java™ code executing on a typical microprocessor, which is due to the time wasted fetching the branched to instruction(s). Even with advanced compiler optimizations, it is difficult to eliminate all such branch mispredictions. Well-known Just-In-Time (JIT) Java™ compilers that generate software branch predictions for a typical Reduced Instruction Set Computing (RISC) microprocessor are currently about 75% accurate. Current hardware branch prediction is more accurate at about 85-93% accurate. Hardware branch prediction is typically implemented using a hardware branch prediction table. Because the hardware branch prediction table is limited in size (e.g., 512 entries), this approach is not desirable if there are a significant number of branches (e.g., more than 1000 branches) that can lead to aliasing effects (e.g., two different branches sharing the same entries will corrupt each others prediction state).
- The present invention solves this problem by providing a branch instruction that includes a bit for indicating whether the branch is easy to predict or hard to predict in accordance with one embodiment. If the branch is hard to predict, then hardware branch prediction is used. Otherwise, software branch prediction is used. Thus, the more accurate hardware branch prediction is efficiently reserved for hard-to-predict branches. For example, a compiler can determine whether a branch is labeled as hard to predict or easy to predict (e.g., about 80% of the branches can be labeled easy to predict, and mechanisms may be added to update or modify these predictions based on mispredictions, as further discussed below).
- Referring to FIG. 1, a schematic block diagram illustrates a single integrated circuit chip implementation of a
processor 100 that includes amemory interface 102, ageometry decompressor 104, twomedia processing units data cache 106, and several interface controllers. The interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die. The components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time. The interface controllers include a an UltraPort Architecture Interconnect (UPA)controller 116 and a peripheral component interconnect (PCI)controller 120. Theillustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller. The shareddata cache 106 is a dual-ported storage that is shared among themedia processing units data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown). Thedata cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between themedia processing units - The UPA
controller 116 is a custom interface that attains a suitable balance between high-performance computational and graphic subsystems. The UPA is a cache-coherent, processor-memory interconnect. The UPA attains several advantageous characteristics including a scaleable bandwidth through support of multiple bused interconnects for data and addresses, packets that are switched for improved bus utilization, higher bandwidth, and precise interrupt processing. The UPA performs low latency memory accesses with high throughput paths to memory. The UPA includes a buffered cross-bar memory interface for increased bandwidth and improved scaleability. The UPA supports high-performance graphics with two-cycle single-word writes on the 64-bit UPA interconnect. The UPA interconnect architecture utilizes point-to-point packet switched messages from a centralized system controller to maintain cache coherence. Packet switching improves bus bandwidth utilization by removing the latencies commonly associated with transaction-based designs. - The
PCI controller 120 is used as the primary system I/O interface for connecting standard, high-volume, low-cost peripheral devices, although other standard interfaces may also be used. The PCI bus effectively transfers data among high bandwidth peripherals and low bandwidth peripherals, such as CD-ROM players, DVD players, and digital cameras. - Two
media processing units illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions. A typical “general-purpose” processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time. Theillustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism. - Thread level parallelism is particularly useful for Java™ applications which are bound to have multiple threads of execution. Java™ methods including “suspend”, “resume”, “sleep”, and the like include effective support for threaded program code. In addition, Java™ class libraries are thread-safe to promote parallelism. Furthermore, the thread model of the
processor 100 supports a dynamic compiler which runs as a separate thread using onemedia processing unit 110 while the secondmedia processing unit 112 is used by the current application. In the illustrative system, the compiler applies optimizations based on “on-the-fly” profile feedback information while dynamically modifying the executing code to improve execution on each subsequent run. For example, a “garbage collector” may be executed on a firstmedia processing unit 110, copying objects or gathering pointer information, while the application is executing on the othermedia processing unit 112. - Although the
processor 100 shown in FIG. 1 includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution. Thus, in theprocessor 100, a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors. - Referring to FIG. 2, a schematic block diagram shows the core of the
processor 100. Themedia processing units instruction cache 210, aninstruction aligner 212, aninstruction buffer 214, apipeline control unit 226, a split register file 216, a plurality of execution units, and a load/store unit 218. In theillustrative processor 100, themedia processing units media processing unit 110 include three media functional units (MFU) 222 and one general functional unit (GFU) 220. The mediafunctional units 222 are multiple single-instruction-multiple-datapath (MSIMD) media functional units. Each of the mediafunctional units 222 is capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single-instruction-multiple-datapath capability for theprocessor 100 including add, multiply-add, shift, compare, and the like. The mediafunctional units 222 operate in combination as tightly-coupled digital signal processors (DSPs). Each mediafunctional unit 222 has an separate and individual sub-instruction stream, but all three mediafunctional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages. - The general
functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others. The generalfunctional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction. - The
illustrative instruction cache 210 has a 16 Kbyte capacity and includes hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code. Software is used to indicate that the instruction storage is being modified when modifications occur. The 16K capacity is suitable for performing graphic loops, other multimedia tasks or processes, and general-purpose Java™ code. Coherency is maintained by hardware that supports write-through, non-allocating caching. Self-modifying code is supported through explicit use of “store-to-instruction-space” instructions store2i. Software uses the store2i instruction to maintain coherency with theinstruction cache 210 so that theinstruction caches 210 do not have to be snooped on every single store operation issued by themedia processing unit 110. - The
pipeline control unit 226 is connected between theinstruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units. Thepipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions. Thepipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls. Thepipeline control unit 226 also generates traps and maintains special registers. - Each
media processing unit register file segments 224 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time. A separateregister file segment 224 is allocated to each of the mediafunctional units 222 and the generalfunctional unit 220. In the illustrative embodiment, eachregister file segment 224 has 128 32-bit registers. The first 96 registers (0-95) in theregister file segment 224 are global registers. All functional units can write to the 96 global registers. The global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all registerfile segments 224. Registers 96-127 in theregister file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or “visible” to other functional units. - The
media processing units - Instructions are executed in-order in the
processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory. The execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor. - To avoid software scheduling errors, the
media processing units media processing units processor 100 is essentially equivalent to scheduling for a simple 2-scalar execution engine for each of the twomedia processing units - The
processor 100 supports full bypasses between the first two execution units within themedia processing unit functional unit 220 for load operations so that the compiler does not need to handle nondeterministic latencies due to cache misses. Theprocessor 100 scoreboards long latency operations that are executed in the generalfunctional unit 220, for example a reciprocal square-root operation, to simplify scheduling across execution units. The scoreboard (not shown) operates by tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the instruction is finished and the result becomes available. A VLIW instruction packet contains one GFU instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an incoming VLIW instruction packet are checked against the scoreboard. Any true dependencies or output dependencies stall the entire packet until the result is ready. Use of a scoreboarded result as an operand causes instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the referencing instruction that provokes the stall executes on the generalfunctional unit 220 or the first mediafunctional unit 222, then the stall only endures until the result is available for intra-unit bypass. For the case of a load instruction that hits in thedata cache 106, the stall may last only one cycle. If the referencing instruction is on the second or third mediafunctional units 222, then the stall endures until the result reaches the writeback stage in the pipeline where the result is bypassed in transmission to the split register file 216. - The scoreboard automatically manages load delays that occur during a load hit. In an illustrative embodiment, all loads enter the scoreboard to simplify software scheduling and eliminate NOPs in the instruction stream.
- The scoreboard is used to manage most interlocks between the general
functional unit 220 and the mediafunctional units 222. All loads and non-pipelined long-latency operations of the generalfunctional unit 220 are scoreboarded. The long-latency operations include division idiv,fdiv instructions, reciprocal squareroot frecsqrt,precsqrt instructions, and power ppower instructions. None of the results of the mediafunctional units 222 is scoreboarded. Non-scoreboarded results are available to subsequent operations on the functional unit that produces the results following the latency of the instruction. - The
illustrative processor 100 has a rendering rate of over fifty million triangles per second without accounting for operating system overhead. Therefore, data feeding specifications of theprocessor 100 are far beyond the capabilities of cost-effective memory systems. Sufficient data bandwidth is achieved by rendering of compressed geometry using thegeometry decompressor 104, an on-chip real-time geometry decompression engine. Data geometry is stored in main memory in a compressed format. At render time, the data geometry is fetched and decompressed in real-time on the integrated circuit of theprocessor 100. Thegeometry decompressor 104 advantageously saves memory space and memory transfer bandwidth. The compressed geometry uses an optimized generalized mesh structure that explicitly calls out most shared vertices between triangles, allowing theprocessor 100 to transform and light most vertices only once. In a typical compressed mesh, the triangle throughput of the transform-and-light stage is increased by a factor of four or more over the throughput for isolated triangles. For example, during processing of triangles, multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining. Thus operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time. For other types of applications with high instruction level parallelism, high trip count loops are software-pipelined so that most mediafunctional units 222 are fully utilized. - FIG. 3 shows a format of a branch instruction in accordance with one embodiment of the present invention. The
branch instruction 300 includes abit 302 for indicating that the branch is easy to predict (e.g., 0) or hard to predict (e.g., 1). The branch instruction includes abit 304 for indicating a software branch prediction that the branch is taken (e.g., 0) or not taken (e.g., 1). The software branch prediction loaded inbit 304 is used to predict the outcome of the branch if the branch is easy to predict (e.g.,bit 302 is set to 0).Branch instruction 300 also includesopcode 306, which corresponds to the opcode for a branch instruction,destination portion 308, which sets forth the destination register (e.g., where the condition resides), and relative offsetportion 310, which sets forth the relative offset of the branch target when the branch is taken. - Accordingly, software branch prediction filtering migrates some of the complexity associated with conditional branches to the compiler. It is observed that, for example: graphics code has few branches, or very predictable branches; JAVA applications have more unconditional branches than typical C or Fortran applications (mainly due to the extensive usage of jumps or calls); a dynamic compiler has better observability and has the capability to update software-controlled prediction bits; software branch prediction with simple heuristics can predict branches successfully > 75% of the time, or possibly even >83% for brute force heuristics. See, e.g., Thomas Ball, James Larus,Branch Prediction for Free, Programming Languages Design & Implementation, 1993, New Mexico, pp300-312.
- Based on these observations, branch instructions have 2 bits that the compiler can set to let the processor know (a) if the branch is easy or hard to predict, and (b) the branch is predicted taken, which is a software branch prediction (e.g., determined by the compiler at compile time). In this way, when the microprocessor encounters an easy-to-predict branch, it simply uses the software branch prediction provided by the other bit. On the other hand, when the microprocessor encounters a hard-to-predict branch, it can use a simple hardware-based branch prediction or a more robust hardware-based branch prediction. In this way it is possible to dedicate a hardware-based branch prediction mechanism only to those branches that the software cannot predict very well. Measurements show that a reduction of the number of mispredictions between 20-40 percent is achievable. Alternately, the prediction efficiency can be kept at the same level, while the size of the branch prediction table can be reduced.
- FIG. 4 is a block diagram of an implementation of the branch instruction of FIG. 4 in accordance with one embodiment of the present invention.
MPU 400 includes an instruction fetchunit 402, which fetches instruction data from an instruction cache unit (see FIG. 1). Instruction fetchunit 402 is coupled to abranch prediction circuit 404.Branch prediction circuit 404 includes a branch prediction table 406, such as a conventional 512-entry branch prediction table. Instruction fetchunit 402 is also coupled to adecoder 408, which decodes an instruction for execution byexecution unit 410. One of ordinary skill in the art will recognize that there are various way to implement the circuitry and logic for performing the branch prediction operation in a microprocessor, such as a pipelined microprocessor. - FIG. 5 is a flow diagram of the operation of the branch instruction of FIG. 5 in accordance with one embodiment of the present invention. The operation of the branch instruction begins at stage of
operation 502. Atstage 502, whether the branch is easy to predict is determined. If so, then software branch prediction is used to predict whether the branch is taken. Atstage 504, whether the software branch prediction predicts that the branch is taken is determined. If so, then the branch is taken atstage 506. Otherwise, the branch is not taken atstage 508. - Otherwise (i.e., the branch is hard to predict), a hardware branch prediction mechanism (e.g., the branch prediction circuit of FIG. 5) is used to determine if the branch is predicted to be taken. If the branch is predicted take by the hardware branch prediction circuit (e.g., branch prediction array (bpar)), then the branch is taken at stage512 (e.g., the offset is added to the current program counter to provide a new address sequence to be fetched). Otherwise, the branch is not taken at stage 514 (e.g., the present instruction stream is continued in sequence).
- In one embodiment, a branch misprediction by the software branch prediction causes a modification of the software branch prediction bit (e.g., toggles
bit 504 of FIG. 5 using self-modifying code). A hardware branch misprediction causes a modification in the hardware branch prediction table (e.g., an entry in branch prediction table 506 of FIG. 5 is modified). - In one embodiment, the software branch prediction utilizes heuristics involving code analysis such as that set forth in Ball et al.
- In one embodiment, the hardware branch prediction utilizes the following branch prediction scheme:
- if offset<0 (backward branch) then predict taken
- else (i.e., offset>0) (forward branch) predict not taken.
- The displacement of an unconditional branch is treated as an offset and added to the program counter (not shown) to form the target address of the next instruction if the branch is taken. Alternatively, a more robust hardware branch prediction approach utilizes a branch prediction table (e.g., 512-entry branch prediction table) and associates a state machine to each branch. For example, a 2-bit counter is used to describe four states: strongly taken, likely taken, likely not taken, and strongly not taken. The branch prediction table is implemented as a branch prediction array.
- In one embodiment, a JIT compiler for JAVA™ source code provides software branch prediction (e.g., sets bit504) and indicates whether a compiled branch is easy to predict or hard to predict (e.g., sets bit 502). The software branch prediction filtering can reduce misprediction rates by about 25% and considering that about 20% of compiled JAVA™ code can be branches, this embodiment provides a significant improvement. The present invention can also be applied to statically compiled C code or to static compilation of other computer programming languages. Also, this approach reduces the risk of polluting the hardware branch prediction table by conserving the hardware branch prediction table for hard-to-predict branches.
- Although particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects. For example, different approaches to software branch prediction and to hardware branch prediction can be used. Also, dynamic software branch prediction or dynamic hardware branch prediction (or both) can be utilized in accordance with one embodiment of the present invention. The present invention is not limited by any particular processor architecture, the presence or structure of caches or memory, or the number of bits in any register or memory location. Therefore, the appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.
Claims (16)
1. A process for software branch prediction filtering for a microprocessor, comprising:
determining whether the branch is easy to predict; and
predicting the branch using software branch prediction if the branch is easy to predict.
2. The process of further comprising:
claim 1
predicting the branch using hardware branch prediction if the branch is hard to predict.
3. The process of further comprising:
claim 2
checking a first bit of an instruction that indicates whether the branch is easy to predict or hard to predict.
4. The process of further comprising:
claim 3
checking a second bit of the instruction that indicates whether the branch is predicted taken or not taken by the software branch prediction.
5. The process of further comprising:
claim 4
modifying the second bit if the software branch prediction mispredicts the branch.
6. The process of further comprising:
claim 2
modifying a branch prediction table if the hardware branch prediction mispredicts the branch.
7. The process of wherein the hardware branch prediction comprises incrementing and decrementing a counter based on a state machine.
claim 6
8. The process of wherein the software branch prediction comprises utilizing heuristics.
claim 6
9. The process of wherein the first bit is set by a compiler that compiled the instruction.
claim 3
10. The process of wherein the compiler is a Java™ Just-In-Time compiler.
claim 9
11. An apparatus for software branch prediction filtering for a microprocessor, comprising:
branch prediction circuitry, the branch prediction circuitry comprising a branch prediction table; and
software branch prediction filtering logic coupled to the branch prediction circuitry, the software branch prediction filtering logic executing a branch instruction and determining whether the branch is easy to predict, and the software branch prediction filtering logic predicting the branch using the software branch prediction if the branch is easy to predict.
12. The apparatus of , wherein the software branch prediction filtering logic further comprises predicting the branch using the hardware branch prediction circuitry if the branch is hard to predict.
claim 11
13. The apparatus of wherein a first bit of the branch instruction provides an indication of whether the branch is easy to predict, and a second bit provides an indication of the software branch prediction.
claim 12
14. The apparatus of wherein the software branch prediction filtering logic further comprises modifying the second bit if the software branch prediction mispredicts the branch.
claim 13
15. The apparatus of wherein the hardware branch prediction circuitry comprises a 512-entry branch prediction table.
claim 14
16. The apparatus of wherein the branch instruction comprises a compiled Java™ instruction.
claim 15
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/829,525 US6374351B2 (en) | 1998-12-03 | 2001-04-10 | Software branch prediction filtering for a microprocessor |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/204,792 US6341348B1 (en) | 1998-12-03 | 1998-12-03 | Software branch prediction filtering for a microprocessor |
US09/829,525 US6374351B2 (en) | 1998-12-03 | 2001-04-10 | Software branch prediction filtering for a microprocessor |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,792 Continuation US6341348B1 (en) | 1998-12-03 | 1998-12-03 | Software branch prediction filtering for a microprocessor |
Publications (2)
Publication Number | Publication Date |
---|---|
US20010016903A1 true US20010016903A1 (en) | 2001-08-23 |
US6374351B2 US6374351B2 (en) | 2002-04-16 |
Family
ID=22759451
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,792 Expired - Lifetime US6341348B1 (en) | 1998-12-03 | 1998-12-03 | Software branch prediction filtering for a microprocessor |
US09/829,525 Expired - Lifetime US6374351B2 (en) | 1998-12-03 | 2001-04-10 | Software branch prediction filtering for a microprocessor |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,792 Expired - Lifetime US6341348B1 (en) | 1998-12-03 | 1998-12-03 | Software branch prediction filtering for a microprocessor |
Country Status (2)
Country | Link |
---|---|
US (2) | US6341348B1 (en) |
WO (1) | WO2000033182A2 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050278517A1 (en) * | 2004-05-19 | 2005-12-15 | Kar-Lik Wong | Systems and methods for performing branch prediction in a variable length instruction set microprocessor |
US20060026408A1 (en) * | 2004-07-30 | 2006-02-02 | Dale Morris | Run-time updating of prediction hint instructions |
US7971042B2 (en) | 2005-09-28 | 2011-06-28 | Synopsys, Inc. | Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline |
US20120198428A1 (en) * | 2011-01-28 | 2012-08-02 | International Business Machines Corporation | Using Aliasing Information for Dynamic Binary Optimization |
US20140258682A1 (en) * | 2013-03-08 | 2014-09-11 | Advanced Digital Chips Inc. | Pipelined processor |
US20150363201A1 (en) * | 2014-06-11 | 2015-12-17 | International Business Machines Corporation | Predicting indirect branches using problem branch filtering and pattern cache |
US20220147360A1 (en) * | 2020-11-09 | 2022-05-12 | Centaur Technology, Inc. | Small branch predictor escape |
US11520588B2 (en) * | 2019-06-10 | 2022-12-06 | International Business Machines Corporation | Prefetch filter table for storing moderately-confident entries evicted from a history table |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3470948B2 (en) * | 1999-01-28 | 2003-11-25 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Dynamic compilation timing determination method, bytecode execution mode selection method, and computer |
US6662360B1 (en) * | 1999-09-27 | 2003-12-09 | International Business Machines Corporation | Method and system for software control of hardware branch prediction mechanism in a data processor |
US6766441B2 (en) * | 2001-01-19 | 2004-07-20 | International Business Machines Corporation | Prefetching instructions in mis-predicted path for low confidence branches |
EP1442363A1 (en) * | 2001-10-02 | 2004-08-04 | Koninklijke Philips Electronics N.V. | Speculative execution for java hardware accelerator |
US20040025153A1 (en) * | 2002-07-30 | 2004-02-05 | Johnson Teresa L. | System and method for software pipelining loops with multiple control flow paths |
US7290254B2 (en) * | 2003-03-25 | 2007-10-30 | Intel Corporation | Combining compilation and instruction set translation |
US7577824B2 (en) * | 2003-09-08 | 2009-08-18 | Altera Corporation | Methods and apparatus for storing expanded width instructions in a VLIW memory for deferred execution |
US20190303158A1 (en) * | 2018-03-29 | 2019-10-03 | Qualcomm Incorporated | Training and utilization of a neural branch predictor |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4370711A (en) | 1980-10-21 | 1983-01-25 | Control Data Corporation | Branch predictor using random access memory |
US4435756A (en) | 1981-12-03 | 1984-03-06 | Burroughs Corporation | Branch predicting computer |
US5564118A (en) * | 1992-11-12 | 1996-10-08 | Digital Equipment Corporation | Past-history filtered branch prediction |
TW261676B (en) | 1993-11-02 | 1995-11-01 | Motorola Inc | |
JP3599409B2 (en) | 1994-06-14 | 2004-12-08 | 株式会社ルネサステクノロジ | Branch prediction device |
US5655122A (en) * | 1995-04-05 | 1997-08-05 | Sequent Computer Systems, Inc. | Optimizing compiler with static prediction of branch probability, branch frequency and function frequency |
AU3666697A (en) * | 1996-08-20 | 1998-03-06 | Idea Corporation | A method for identifying hard-to-predict branches to enhance processor performance |
US5805876A (en) * | 1996-09-30 | 1998-09-08 | International Business Machines Corporation | Method and system for reducing average branch resolution time and effective misprediction penalty in a processor |
CN1153133C (en) * | 1996-12-09 | 2004-06-09 | 松下电器产业株式会社 | Information processing device by using small scale hardware for high percentage of hits branch foncast |
US5802602A (en) * | 1997-01-17 | 1998-09-01 | Intel Corporation | Method and apparatus for performing reads of related data from a set-associative cache memory |
US6115809A (en) * | 1998-04-30 | 2000-09-05 | Hewlett-Packard Company | Compiling strong and weak branching behavior instruction blocks to separate caches for dynamic and static prediction |
-
1998
- 1998-12-03 US US09/204,792 patent/US6341348B1/en not_active Expired - Lifetime
-
1999
- 1999-12-03 WO PCT/US1999/028876 patent/WO2000033182A2/en active Application Filing
-
2001
- 2001-04-10 US US09/829,525 patent/US6374351B2/en not_active Expired - Lifetime
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9003422B2 (en) | 2004-05-19 | 2015-04-07 | Synopsys, Inc. | Microprocessor architecture having extendible logic |
US20050278517A1 (en) * | 2004-05-19 | 2005-12-15 | Kar-Lik Wong | Systems and methods for performing branch prediction in a variable length instruction set microprocessor |
US8719837B2 (en) | 2004-05-19 | 2014-05-06 | Synopsys, Inc. | Microprocessor architecture having extendible logic |
US20060026408A1 (en) * | 2004-07-30 | 2006-02-02 | Dale Morris | Run-time updating of prediction hint instructions |
GB2416885A (en) * | 2004-07-30 | 2006-02-08 | Hewlett Packard Development Co | Updating branch instruction hints during program execution |
GB2416885B (en) * | 2004-07-30 | 2009-03-04 | Hewlett Packard Development Co | Run-Time updating of prediction hint instructions |
US8443171B2 (en) | 2004-07-30 | 2013-05-14 | Hewlett-Packard Development Company, L.P. | Run-time updating of prediction hint instructions |
US7971042B2 (en) | 2005-09-28 | 2011-06-28 | Synopsys, Inc. | Microprocessor system and method for instruction-initiated recording and execution of instruction sequences in a dynamically decoupleable extended instruction pipeline |
US9495136B2 (en) * | 2011-01-28 | 2016-11-15 | International Business Machines Corporation | Using aliasing information for dynamic binary optimization |
US20120198428A1 (en) * | 2011-01-28 | 2012-08-02 | International Business Machines Corporation | Using Aliasing Information for Dynamic Binary Optimization |
US20140258682A1 (en) * | 2013-03-08 | 2014-09-11 | Advanced Digital Chips Inc. | Pipelined processor |
US9454376B2 (en) * | 2013-03-08 | 2016-09-27 | Advanced Digital Chips Inc. | Pipelined processor |
US20150363201A1 (en) * | 2014-06-11 | 2015-12-17 | International Business Machines Corporation | Predicting indirect branches using problem branch filtering and pattern cache |
US10795683B2 (en) * | 2014-06-11 | 2020-10-06 | International Business Machines Corporation | Predicting indirect branches using problem branch filtering and pattern cache |
US11520588B2 (en) * | 2019-06-10 | 2022-12-06 | International Business Machines Corporation | Prefetch filter table for storing moderately-confident entries evicted from a history table |
US20220147360A1 (en) * | 2020-11-09 | 2022-05-12 | Centaur Technology, Inc. | Small branch predictor escape |
US11614944B2 (en) * | 2020-11-09 | 2023-03-28 | Centaur Technology, Inc. | Small branch predictor escape |
Also Published As
Publication number | Publication date |
---|---|
WO2000033182A9 (en) | 2002-08-29 |
US6374351B2 (en) | 2002-04-16 |
WO2000033182A3 (en) | 2000-10-19 |
WO2000033182A2 (en) | 2000-06-08 |
US6341348B1 (en) | 2002-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1137984B1 (en) | A multiple-thread processor for threaded software applications | |
US7490228B2 (en) | Processor with register dirty bit tracking for efficient context switch | |
US6279100B1 (en) | Local stall control method and structure in a microprocessor | |
US7114056B2 (en) | Local and global register partitioning in a VLIW processor | |
US6343348B1 (en) | Apparatus and method for optimizing die utilization and speed performance by register file splitting | |
US7254697B2 (en) | Method and apparatus for dynamic modification of microprocessor instruction group at dispatch | |
WO2000033183A9 (en) | Method and structure for local stall control in a microprocessor | |
US6671796B1 (en) | Converting an arbitrary fixed point value to a floating point value | |
US6374351B2 (en) | Software branch prediction filtering for a microprocessor | |
EP1230591B1 (en) | Decompression bit processing with a general purpose alignment tool | |
WO2011141337A1 (en) | Hardware assist thread | |
JP2004171573A (en) | Coprocessor extension architecture built by using novel splint-instruction transaction model | |
US20010042187A1 (en) | Variable issue-width vliw processor | |
US7117342B2 (en) | Implicitly derived register specifiers in a processor | |
US6615338B1 (en) | Clustered architecture in a VLIW processor | |
JP2003526155A (en) | Processing architecture with the ability to check array boundaries | |
US6625634B1 (en) | Efficient implementation of multiprecision arithmetic | |
CN113853584A (en) | Variable delay instructions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FPAY | Fee payment |
Year of fee payment: 12 |