US6959435B2 - Compiler-directed speculative approach to resolve performance-degrading long latency events in an application - Google Patents

Compiler-directed speculative approach to resolve performance-degrading long latency events in an application Download PDF

Info

Publication number
US6959435B2
US6959435B2 US09/968,261 US96826101A US6959435B2 US 6959435 B2 US6959435 B2 US 6959435B2 US 96826101 A US96826101 A US 96826101A US 6959435 B2 US6959435 B2 US 6959435B2
Authority
US
United States
Prior art keywords
instruction
instructions
performance
degrading
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US09/968,261
Other versions
US20030074653A1 (en
Inventor
Dz-ching Ju
Youfeng Wu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Priority to US09/968,261 priority Critical patent/US6959435B2/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JU, DZ-CHING, WU, YOUFENG
Publication of US20030074653A1 publication Critical patent/US20030074653A1/en
Application granted granted Critical
Publication of US6959435B2 publication Critical patent/US6959435B2/en
Adjusted expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • G06F8/4451Avoiding pipeline stalls

Definitions

  • the present invention relates generally to computer systems and, more particularly, to a compiler-directed speculative approach to resolve performance-degrading long latency events in an application.
  • Program performance is measured by retirement throughput. Since retirement throughput is sequential, the presence of a performance-degrading event, such as a long latency instruction, blocks retirement and degrades performance. Some examples of performance-degrading long latency instructions include branch mispredictions and instruction and data cache misses.
  • one solution focuses on running a subset of the instructions that feed to the performance-degrading events ahead of the general execution of the program in order to resolve the performance-degrading events, by detecting the outcomes of branches and prefetching the needed data into the cache.
  • This approach can improve performance only if one can identify a small subset of the program that can be issued sufficiently early to resolve the events with enough accuracy.
  • This approach also requires additional hardware, for example a separate pipeline that would allow the identified subset to run ahead.
  • identification of a minimal program subset with maximum accuracy requires a sophisticated program analysis and the hardware is typically constrained by a limited program scope and the simplicity of attainable analysis.
  • FIG. 1 illustrates a code region in a program, which includes at least one performance-degrading instruction.
  • FIG. 2 is a flow diagram of one embodiment of a method to resolve performance-degrading long latency events.
  • FIG. 3 is a block diagram of a processing system in accordance with one embodiment of the invention.
  • FIG. 4 is a detailed block diagram of the processing system.
  • a machine readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
  • a machine readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); or any other type of media suitable for storing or transmitting information. While embodiments of the present invention will be described with reference to the Internet and the World Wide Web, the system and method described herein is equally applicable to other network infrastructures or other data communication systems.
  • the method and system of the present invention provide a compiler-directed speculative approach to resolve performance-degrading long latency events in an application.
  • at least one performance-degrading instruction is identified from multiple instructions to be executed in a program.
  • a set of instructions preceding the performance-degrading instruction is defined within the program.
  • at least one speculative bit of each instruction of the identified set of instructions is marked to indicate a predetermined execution of the instruction.
  • FIG. 1 illustrates a code region in a program, which includes at least one performance-degrading instruction.
  • a performance-degrading instruction (I 7 ) 103 within the code region 100 is preceded by a set of instructions identified as a backward slice 110 , which contains multiple instructions (I 1 -I 6 ) 102 .
  • the backward slice 110 of the instruction 103 is the set of instructions that affects whether the instruction 103 will be executed or not and if so what value and side effect it will generate.
  • a backward slice may or may not extend beyond a function boundary depending on where the slice boundary point is set, where the slice and the main instruction stream implicitly synchronize.
  • the execution of the backward slice is identified as the “speculative execution” and the execution of the normal program is identified as the “main execution.”
  • the code region 100 further includes a launch point 101 , which is the instruction that launches the execution of the backward slice 110 , and a termination point 104 , which is the instruction that terminates the execution of the backward slice 110 .
  • the performance-degrading instruction 103 that causes frequent branch mispredictions or cache misses is identified through one of many known profile feedback or heuristic methods.
  • FIG. 2 is a flow diagram of one embodiment of a method to resolve performance-degrading long latency events. As illustrated in FIG. 2 , at processing block 210 , a performance-degrading instruction 103 is identified using one of many known identification methods.
  • the backward slice 110 corresponding to the performance-degrading instruction 103 is formed.
  • a compiler can apply one of many known inter-procedural methods to form backward slices across the function boundary.
  • each backward slice 110 has a set of live-in variables and a set of live-in memory locations.
  • each backward slice remembers function parameters and the memory live-in locations that it depends on when it reaches the function entry point.
  • the backward slice 110 is extended at the call site along the instructions that define the function parameters and the memory live-in locations.
  • Table 1(a) shows the original code region 100 . It is assumed that instruction I 7 causes frequent data cache misses and is a performance-degrading instruction. Also, it is assumed that the store instruction at I 4 may alias with the load instruction at I 5 . According to the above backward slice definition having a slice boundary point at the function entry, all of the instructions in this function, I 1 through I 7 , are part of the backward slice for I 7 , as shown in Table 1(b). It is often the case that a backward slice based on the traditional definition is a significant part of the main instruction stream, and hence the slice may not be pre-executed sufficiently early to resolve the performance-degrading instruction I 7 .
  • the marker symbol “*” shown in Table 1(d) is used to indicate that the instruction is executed in both the speculative and main executions.
  • the marker symbol “?” is used to indicate that the instruction is executed only in the speculative execution. Further, an instruction with no marker symbol is executed only in the main execution.
  • a speculation technique is memory speculation. If the chance of the store instruction I 4 to alias with the load instruction I 5 and the performance-degrading instruction I 7 is small, the instruction I 4 could be excluded at the time of the backward slice formation for the instruction I 7 . At the same time, instructions I 1 through I 3 can also be excluded from the slice, as shown in Table 1(c). With the reduced backward slice shown in Table 1(c), a compiler may invoke the backward slice at an early launch point, for example at the beginning of the function, as illustrated in Table 1(d). A speculative thread can skip other instructions and execute only the instructions on the backward slice ahead of the general execution to prefetch the data required for the performance-degrading instruction I 7 into a data cache.
  • Data speculation can be utilized to make a copy of the load instructions and their uses early in a backward slice as advanced loads.
  • the recovery code for each advanced load instruction can be used to adjust the mistake made for the speculative execution.
  • the advanced load and check will be marked by “?” to be executed only by the speculative execution.
  • Table 2 shows an example of the application of data speculation to the reduction of the backward slice.
  • Table 2(a) shows the original code region 100 .
  • the instruction I 4 performs an advanced load by ignoring the memory dependence from the store instruction at I 6 .
  • the instruction I 5 then loads the value from the memory location y. Once the speculative thread performs the critical load, it then executes the store instruction and checks whether it is necessary to reload again the value from the memory location y.
  • Value speculation may specify the most likely value for an instruction so to break the dependence of the backward slice on earlier instructions. The assignment of the special value will be marked by “?” to be executed only by the speculative execution. Table 3 shows an example of the application of value speculation to the reduction of the backward slice.
  • Table 3(a) shows the original code region 100 . It is assumed that through a hardware or software mechanism, the compiler predicts that the value being loaded at the load instruction I 5 is frequently 0x10000000. Hence, in the speculative thread, the compiler-generated code can quickly generate the address using the predicted value to load from a memory location z.
  • Table 4 shows an example of the application of branch prediction to the reduction of the backward slice.
  • Branch prediction is useful to force a backward slice progress along a predetermined path.
  • the input to the comparison for a speculative branch may be incorrect and the execution may go to a wrong direction without branch prediction.
  • a straightforward backward slice will include the control flow branch instruction and the instructions on both “taken” and “not-taken” paths. If the compiler can obtain the way in which the branch is likely to go, either through a software or through a hardware mechanism, it can perform a branch prediction at the compilation time. If it is assumed that the prediction favors the “taken” path, the backward slice 110 can be reduced to the instructions existent on the “taken” path.
  • one or more speculative bits on each instruction of the backward slice 110 is marked.
  • the marker symbol “*” is used to indicate that the instruction is executed in both the speculative and main executions.
  • the marker symbol “?” is used to indicate that the instruction is executed only in the speculative execution. Further, an instruction with no marker symbol is executed only in the main execution.
  • a launch point 101 and a termination point 104 are inserted for the backward slice 110 within the code region 100 .
  • the speculative backward slices 110 must be issued sufficiently early to resolve the performance-degrading long latency instructions 103 .
  • issuing the backward slices too early could lead to the loss of the prefetch effect.
  • the data that is prefetched by a backward slice issued too early may be evicted from the data cache before its use.
  • a compiler can use known program analysis techniques with aid from dynamic feedback information to decide where to insert the launch and termination points 101 and 104 , respectively.
  • a backward slice 110 may have multiple launch points 101 , which may be in a different function than the backward slice.
  • An optimal launch point 101 for the backward slice 110 is a program point that satisfies the following conditions:
  • the program is traversed backward, starting from the first instruction 102 of the backward slice 110 .
  • the backward traversal may encounter a joint point with multiple predecessors.
  • the backward traversal needs to continue along all the highly probable predecessors using branch frequency information.
  • Each instruction that changes the live-in value of the backward slice 110 will be scheduled earlier using known data and control speculation techniques.
  • a launch point 101 is identified when the latency condition is satisfied and all live-in variables are ready. If an instruction is reached that changes the live-in value of the backward slice 110 and it cannot be scheduled earlier, and the latency condition is not satisfied, a sub-optimal launch point is identified.
  • a sub-optimal launch point may be used if its latency can hide the majority of the miss latency of the performance-degrading instruction. If the backward traversal reaches the function entry and all live-ins are ready, but the latency condition is still not satisfied, the backward slice 110 is marked as incomplete and the list of live-ins are stored. In order to form backward slices across the function boundary, each backward slice 110 remembers function parameters and the memory live-in locations that it depends on when it reaches the function entry point. When the caller function is compiled, the backward slice 110 is extended at the call site along the instructions that define the function parameters and the memory live-in locations.
  • processing blocks 210 through 260 are subsequently repeated for another performance-degrading instruction 103 within the code region 100 .
  • FIG. 3 is a block diagram of a processing system in accordance with one embodiment of the invention.
  • processing system 300 includes a memory 310 and a processor 320 coupled to the memory 310 .
  • the processor 320 is a processor capable of compiling software and annotating code regions of the program.
  • Processor 320 can be any type of processor capable of executing software, such as a microprocessor, digital signal processor, microcontroller, or the like.
  • the processing system 300 can be a personal computer (PC), mainframe, handheld device, portable computer, set-top box, or any other system that includes software.
  • Memory 310 can be a hard disk, a floppy disk, random access memory (RAM), read only memory (ROM), flash memory, or any other type of machine medium readable by the processor 320 .
  • Memory 310 can store instructions for performing the execution of the various method embodiments of the present invention.
  • FIG. 4 is a detailed block diagram of the processing system. As illustrated in FIG. 4 , in one embodiment, a main pipeline 410 and a speculative pipeline 420 within the processor 320 share a data cache 430 and an instruction cache 440 located within memory 310 , as well as a branch target buffer 450 . This sharing arrangement allows the speculative pipeline 420 to resolve cache misses and branch mispredictions for the main pipeline 410 , as described in detail above.
  • the processor 320 needs to fetch instructions from the program 470 in a high bandwidth manner and skip those instructions that are not part of slices in order to achieve the run-ahead effect. Instructions marked with the marker symbol “*” will be executed in both the main pipeline 410 and the speculative pipeline 420 . Instructions marked with a marker symbol “?” will only be executed in the speculative pipeline 420 . Finally, instructions having no marker symbol will be executed only in the main pipeline 410 . The launch point instruction to launch the execution of the backward slice and the termination point instruction to terminate the execution of the backward slice will only be executed in the speculative pipeline 420 .

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

A compiler-directed speculative approach to resolve performance-degrading long latency events in an application is described. One or more performance-degrading instructions are identified from multiple instructions to be executed in a program. A set of instructions prefetching the performance-degrading instruction is defined within the program. Finally, at least one speculative bit of each instruction of the identified set of instructions is marked to indicate a predetermined execution of the instruction.

Description

FIELD OF THE INVENTION
The present invention relates generally to computer systems and, more particularly, to a compiler-directed speculative approach to resolve performance-degrading long latency events in an application.
BACKGROUND OF THE INVENTION
The performance of a computer program is usually difficult to characterize. Programs do not perform uniformly well or uniformly poorly. Rather, programs have stretches of adequate performance punctuated by performance-degrading events. The overall observed performance of a specific program depends on the frequency of such events and their relationship to one another and to the rest of the program.
Program performance is measured by retirement throughput. Since retirement throughput is sequential, the presence of a performance-degrading event, such as a long latency instruction, blocks retirement and degrades performance. Some examples of performance-degrading long latency instructions include branch mispredictions and instruction and data cache misses.
Several solutions have been proposed to reduce the frequency and observed latency of these performance-degrading events. For example, one solution focuses on running a subset of the instructions that feed to the performance-degrading events ahead of the general execution of the program in order to resolve the performance-degrading events, by detecting the outcomes of branches and prefetching the needed data into the cache. This approach can improve performance only if one can identify a small subset of the program that can be issued sufficiently early to resolve the events with enough accuracy. This approach also requires additional hardware, for example a separate pipeline that would allow the identified subset to run ahead. However, identification of a minimal program subset with maximum accuracy requires a sophisticated program analysis and the hardware is typically constrained by a limited program scope and the simplicity of attainable analysis.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and not intended to be limited by the figures of the accompanying drawings in which like references indicate similar elements and in which:
FIG. 1 illustrates a code region in a program, which includes at least one performance-degrading instruction.
FIG. 2 is a flow diagram of one embodiment of a method to resolve performance-degrading long latency events.
FIG. 3 is a block diagram of a processing system in accordance with one embodiment of the invention.
FIG. 4 is a detailed block diagram of the processing system.
DETAILED DESCRIPTION
A compiler-directed speculative approach to resolve performance-degrading long latency events in an application is described. In the following detailed description of embodiments of the invention, reference is made to the accompanying drawings in which like references indicate similar elements, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical, functional, and other changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work. An algorithm is here, and generally, conceived to be a self-consistent sequence of processing blocks leading to a desired result. The processing blocks are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, it is appreciated that throughout the present invention discussions utilizing terms such as “processing,” or “computing,” or “calculating,” or “determining,” or “displaying,” or the like, may refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the system's registers or memories or other such information storage, transmission, or display devices.
It is to be understood that embodiments of this invention may be used as or to support software programs executed upon some form of processing core (such as the CPU of a computer) or otherwise implemented or realized upon or within a machine or computer readable medium. A machine readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine readable medium includes read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); or any other type of media suitable for storing or transmitting information. While embodiments of the present invention will be described with reference to the Internet and the World Wide Web, the system and method described herein is equally applicable to other network infrastructures or other data communication systems.
The method and system of the present invention provide a compiler-directed speculative approach to resolve performance-degrading long latency events in an application. In one embodiment, at least one performance-degrading instruction is identified from multiple instructions to be executed in a program. A set of instructions preceding the performance-degrading instruction is defined within the program. Finally, at least one speculative bit of each instruction of the identified set of instructions is marked to indicate a predetermined execution of the instruction.
FIG. 1 illustrates a code region in a program, which includes at least one performance-degrading instruction. As shown in FIG. 1, a performance-degrading instruction (I7) 103 within the code region 100 is preceded by a set of instructions identified as a backward slice 110, which contains multiple instructions (I1-I6) 102. The backward slice 110 of the instruction 103 is the set of instructions that affects whether the instruction 103 will be executed or not and if so what value and side effect it will generate. A backward slice may or may not extend beyond a function boundary depending on where the slice boundary point is set, where the slice and the main instruction stream implicitly synchronize. In one embodiment, the execution of the backward slice is identified as the “speculative execution” and the execution of the normal program is identified as the “main execution.”
In one embodiment, the code region 100 further includes a launch point 101, which is the instruction that launches the execution of the backward slice 110, and a termination point 104, which is the instruction that terminates the execution of the backward slice 110. The performance-degrading instruction 103 that causes frequent branch mispredictions or cache misses is identified through one of many known profile feedback or heuristic methods.
FIG. 2 is a flow diagram of one embodiment of a method to resolve performance-degrading long latency events. As illustrated in FIG. 2, at processing block 210, a performance-degrading instruction 103 is identified using one of many known identification methods.
At processing block 220, the backward slice 110 corresponding to the performance-degrading instruction 103 is formed. In one embodiment, a compiler can apply one of many known inter-procedural methods to form backward slices across the function boundary. For example, each backward slice 110 has a set of live-in variables and a set of live-in memory locations. In order to form backward slices across the function boundary, each backward slice remembers function parameters and the memory live-in locations that it depends on when it reaches the function entry point. When the caller function is compiled, the backward slice 110 is extended at the call site along the instructions that define the function parameters and the memory live-in locations.
At processing block 230, a decision is made whether the size of the backward slice 110 is small enough to allow it to be pre-executed sufficiently early to resolve the performance-degrading instruction 103. If the size of the backward slice 110 is sufficiently small, then the process jumps to processing block 250. Otherwise, at processing block 240, the size of the backward slice 110 is reduced according to known speculation and prediction techniques that are described in detail below.
One example of a code region 100, according to the present invention, is illustrated in Table 1.
TABLE 1
(a) The original program
foo(x, d)
{
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5: ld y = [x]
I6: add z = y, 8
I7: ld = [z]
}
(b) Backward slice for I7
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5: ld y = [x]
I6: add z = y, 8
I7: ld = [z]
(c) Reduced backward slice for I7
I5: ld y = [x]
I6: add z = y, 8
I7: ld = [z]
(d) Launch and termination of backward slice
foo(x, d)
{
? launch I5
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5:* ld y = [x]
I6:* add z = y, 8
I7:* ld = [z]
? terminate
}
Table 1(a) shows the original code region 100. It is assumed that instruction I7 causes frequent data cache misses and is a performance-degrading instruction. Also, it is assumed that the store instruction at I4 may alias with the load instruction at I5. According to the above backward slice definition having a slice boundary point at the function entry, all of the instructions in this function, I1 through I7, are part of the backward slice for I7, as shown in Table 1(b). It is often the case that a backward slice based on the traditional definition is a significant part of the main instruction stream, and hence the slice may not be pre-executed sufficiently early to resolve the performance-degrading instruction I7. In one embodiment, the marker symbol “*” shown in Table 1(d) is used to indicate that the instruction is executed in both the speculative and main executions. The marker symbol “?” is used to indicate that the instruction is executed only in the speculative execution. Further, an instruction with no marker symbol is executed only in the main execution.
One example of a speculation technique is memory speculation. If the chance of the store instruction I4 to alias with the load instruction I5 and the performance-degrading instruction I7 is small, the instruction I4 could be excluded at the time of the backward slice formation for the instruction I7. At the same time, instructions I1 through I3 can also be excluded from the slice, as shown in Table 1(c). With the reduced backward slice shown in Table 1(c), a compiler may invoke the backward slice at an early launch point, for example at the beginning of the function, as illustrated in Table 1(d). A speculative thread can skip other instructions and execute only the instructions on the backward slice ahead of the general execution to prefetch the data required for the performance-degrading instruction I7 into a data cache.
Another example of a speculation technique is data speculation. Data speculation can be utilized to make a copy of the load instructions and their uses early in a backward slice as advanced loads. The recovery code for each advanced load instruction can be used to adjust the mistake made for the speculative execution. The advanced load and check will be marked by “?” to be executed only by the speculative execution. Table 2 shows an example of the application of data speculation to the reduction of the backward slice.
TABLE 2
(a) The original program
foo(x, d)
{
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5: ld y = [x]
I6: ld = [y]
}
(b) Launch and termination of backward slice using data speculation
foo(x, d)
{
? launch I4
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4:? ld.a y = [x]
I5:? ld = [y]
I6:* st [b] = a
I7:? ld.c y = [x]
I8:? ld = [y]
I9:? terminate
I10: ld y = [x]
I11: ld = [y]
}
Table 2(a) shows the original code region 100. In Table 2(b), the instruction I4 performs an advanced load by ignoring the memory dependence from the store instruction at I6. The instruction I5 then loads the value from the memory location y. Once the speculative thread performs the critical load, it then executes the store instruction and checks whether it is necessary to reload again the value from the memory location y.
A further example of a speculation technique is value speculation. Value speculation may specify the most likely value for an instruction so to break the dependence of the backward slice on earlier instructions. The assignment of the special value will be marked by “?” to be executed only by the speculative execution. Table 3 shows an example of the application of value speculation to the reduction of the backward slice.
TABLE 3
(a) The original program
foo(x, d)
{
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5: ld y = [x]
I6: add z = y, 8
I7: ld = [z]
}
(b) Launch and termination of backward slice using value speculation
foo(x, d)
{
? launch I7
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: st [b] = a
I5: ld y = [x]
I6: add z = y, 8
I7:? add z = 0x10000000, 8
I8:* ld = [z]
? terminate
}
Table 3(a) shows the original code region 100. It is assumed that through a hardware or software mechanism, the compiler predicts that the value being loaded at the load instruction I5 is frequently 0x10000000. Hence, in the speculative thread, the compiler-generated code can quickly generate the address using the predicted value to load from a memory location z.
An example of a prediction technique is branch prediction. Table 4 shows an example of the application of branch prediction to the reduction of the backward slice.
TABLE 4
(a) The original program
foo(x, d)
{
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: cmp.eq p = c, 0
I5: (p) br I9
I6: st [b] = a
I7: ld z = [x]
I8: br I1 1
I9: ld y = [x]
I10: add z = y, 8
I11: ld = [z]
}
(b) Launch and termination of backward slice using branch prediction
foo(x, d)
{
? launch I12
I1: ld e = [d]
I2: add a = c, d
I3: add b = e, 16
I4: cmp.eq p = c, 0
I5: (p) br I9
I6: st [b] = a
I7: ld z = [x]
I8: br I1 1
I9:* ld y = [x]
I10:* add z = y, 8
I11:* ld = [z]
? terminate
}
Branch prediction is useful to force a backward slice progress along a predetermined path. The input to the comparison for a speculative branch may be incorrect and the execution may go to a wrong direction without branch prediction. In Table 4, a straightforward backward slice will include the control flow branch instruction and the instructions on both “taken” and “not-taken” paths. If the compiler can obtain the way in which the branch is likely to go, either through a software or through a hardware mechanism, it can perform a branch prediction at the compilation time. If it is assumed that the prediction favors the “taken” path, the backward slice 110 can be reduced to the instructions existent on the “taken” path.
Referring back to FIG. 2, at processing block 250, after reduction of the backward slice 110, one or more speculative bits on each instruction of the backward slice 110 is marked. In one embodiment, the marker symbol “*” is used to indicate that the instruction is executed in both the speculative and main executions. The marker symbol “?” is used to indicate that the instruction is executed only in the speculative execution. Further, an instruction with no marker symbol is executed only in the main execution.
Finally, at processing block 260, a launch point 101 and a termination point 104 are inserted for the backward slice 110 within the code region 100. In one embodiment, the speculative backward slices 110 must be issued sufficiently early to resolve the performance-degrading long latency instructions 103. However, issuing the backward slices too early could lead to the loss of the prefetch effect. For example, the data that is prefetched by a backward slice issued too early may be evicted from the data cache before its use. A compiler can use known program analysis techniques with aid from dynamic feedback information to decide where to insert the launch and termination points 101 and 104, respectively.
A backward slice 110 may have multiple launch points 101, which may be in a different function than the backward slice. An optimal launch point 101 for the backward slice 110 is a program point that satisfies the following conditions:
    • 1. The launch point is earlier than the first instruction of the backward slice;
    • 2. All live-in variables of the backward slice are ready at the launch point; and
    • 3. The latency from the launch point to the performance-degrading instruction is greater than the total latency of the backward slice including the miss latency of the performance-degrading instruction.
To identify the launch point 101, the program is traversed backward, starting from the first instruction 102 of the backward slice 110. The backward traversal may encounter a joint point with multiple predecessors. The backward traversal needs to continue along all the highly probable predecessors using branch frequency information. Each instruction that changes the live-in value of the backward slice 110 will be scheduled earlier using known data and control speculation techniques. A launch point 101 is identified when the latency condition is satisfied and all live-in variables are ready. If an instruction is reached that changes the live-in value of the backward slice 110 and it cannot be scheduled earlier, and the latency condition is not satisfied, a sub-optimal launch point is identified. A sub-optimal launch point may be used if its latency can hide the majority of the miss latency of the performance-degrading instruction. If the backward traversal reaches the function entry and all live-ins are ready, but the latency condition is still not satisfied, the backward slice 110 is marked as incomplete and the list of live-ins are stored. In order to form backward slices across the function boundary, each backward slice 110 remembers function parameters and the memory live-in locations that it depends on when it reaches the function entry point. When the caller function is compiled, the backward slice 110 is extended at the call site along the instructions that define the function parameters and the memory live-in locations.
As illustrated in FIG. 2, processing blocks 210 through 260 are subsequently repeated for another performance-degrading instruction 103 within the code region 100.
FIG. 3 is a block diagram of a processing system in accordance with one embodiment of the invention. As illustrated in FIG. 3, processing system 300 includes a memory 310 and a processor 320 coupled to the memory 310. In some embodiments, the processor 320 is a processor capable of compiling software and annotating code regions of the program. Processor 320 can be any type of processor capable of executing software, such as a microprocessor, digital signal processor, microcontroller, or the like. The processing system 300 can be a personal computer (PC), mainframe, handheld device, portable computer, set-top box, or any other system that includes software.
Memory 310 can be a hard disk, a floppy disk, random access memory (RAM), read only memory (ROM), flash memory, or any other type of machine medium readable by the processor 320. Memory 310 can store instructions for performing the execution of the various method embodiments of the present invention.
FIG. 4 is a detailed block diagram of the processing system. As illustrated in FIG. 4, in one embodiment, a main pipeline 410 and a speculative pipeline 420 within the processor 320 share a data cache 430 and an instruction cache 440 located within memory 310, as well as a branch target buffer 450. This sharing arrangement allows the speculative pipeline 420 to resolve cache misses and branch mispredictions for the main pipeline 410, as described in detail above.
The processor 320 needs to fetch instructions from the program 470 in a high bandwidth manner and skip those instructions that are not part of slices in order to achieve the run-ahead effect. Instructions marked with the marker symbol “*” will be executed in both the main pipeline 410 and the speculative pipeline 420. Instructions marked with a marker symbol “?” will only be executed in the speculative pipeline 420. Finally, instructions having no marker symbol will be executed only in the main pipeline 410. The launch point instruction to launch the execution of the backward slice and the termination point instruction to terminate the execution of the backward slice will only be executed in the speculative pipeline 420.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims (23)

1. A method comprising:
identifying at least one performance-degrading instruction from a plurality of instructions to be executed in a program;
defining a set of instructions within said program to prefetch said at least one performance-degrading instruction;
marking at least one speculative bit of each instruction of said set of instructions to indicate a predetermined execution of said each instruction; and
inserting a launch point to launch said predetermined execution, said inserting further comprising:
determining a first instruction of said set of instructions;
determining whether variables of said set of instructions are ready for execution; and
determining whether a latency value from said launch point to said performance-degrading instruction is greater than a total latency value of said set of instructions including a miss latency of said performance-degrading instruction.
2. The method according to claim 1, further comprising:
inserting a termination point to terminate said predetermined execution.
3. The method according to claim 2, wherein inserting said termination point further comprises:
marking said termination point subsequent to said performance-degrading instruction.
4.The method according to claim 1, wherein said identifying further comprises:
monitoring execution of said plurality of instructions; and
detecting said at least one performance-degrading instruction.
5. The method according to claim 1, wherein said defining further comprises:
reducing a size of said set of instructions based on speculation and prediction techniques.
6. The method according to claim 5, wherein said reducing further comprises:
determining whether a first instruction of said set of instructions is aliased with a second instruction within said set of instructions; and
removing said first instruction and adjacent instructions related to said first instruction from said set, if said first instruction is not aliased with said second instruction.
7. A system comprising:
means for identifying at least one performance-degrading instruction from a plurality of instructions to be executed in a program;
means for defining a set of instructions within said program to prefetch said at least one performance-degrading instruction;
means for marking at least one speculative bit of each instruction of said set of instructions to indicate a predetermined execution of said each instruction; and
means for inserting a launch point to launch said predetermined execution, said means for inserting further comprising:
means for determining a first instruction of said set of instructions;
means for determining whether variables of said set of instructions are ready for execution; and
means for determining whether a latency value from said launch point to said performance-degrading instruction is greater than a total latency value of said set at instructions including a miss latency of said performance-degrading instruction.
8. The system according to claim 7, further comprising:
means for inserting a termination point to terminate said predetermined execution.
9. The system according to claim 8, further comprising:
means for marking said termination point subsequent to said performance-degrading instruction.
10. The system according to claim 7, further comprising:
means for monitoring execution of said plurality of instructions; and
means for detecting said at least one performance-degrading instruction.
11. The system according to claim 7, further comprising:
means for reducing a size of said set of instructions based on speculation and prediction techniques.
12. The system according to claim 11, further comprising:
means for determining whether a first instruction of said set of instructions is aliased with a second instruction within said set of instructions; and
means for removing said first instruction and adjacent instructions related to said first instruction from said set, if said first instruction is not aliased with said second instruction.
13. A computer readable medium containing executable instructions, which, when executed in a processing system, cause said processing system to perform a method comprising:
identifying at least one performance-degrading instruction from a plurality of instructions to be executed in a program;
defining a set of instructions within said program to prefetch said at least one performance-degrading instruction;
marking at least one speculative bit of each instruction of said set of instructions to indicate a predetermined execution of said each instruction; and
inserting a launch point to launch said predetermined execution, said inserting further comprising:
determining a first instruction of said set of instructions;
determining whether variables of said set of instructions are ready for execution; and
determining whether a latency value from said launch point to said performance-degrading instruction is greater than a total latency value of said set of instructions including a miss latency of said performance-degrading instruction.
14. The computer readable medium according to claim 13, wherein said method further comprises:
inserting a termination point to terminate said predetermined execution.
15. The computer readable medium according to claim 14, wherein inserting said termination point further comprises:
marking said termination point subsequent to said performance-degrading instruction.
16. The computer readable medium according to claim 13, wherein said identifying further comprises:
monitoring execution of said plurality of instructions; and
detecting said at least one performance-degrading instruction.
17. The computer readable medium according to claim 13, wherein said defining further comprises:
reducing a size of said set of instructions based on speculation and prediction techniques.
18. The computer readable medium according to claim 17, wherein said reducing further comprises:
determining whether a first instruction of said set of instructions is aliased with a second instruction within said set of instructions; and
removing said first instruction and adjacent instructions related to said first instruction from said set, if said first instruction is not aliased with said second instruction.
19. A system comprising:
a memory to store a plurality of instructions to be executed in a program; and
a processor coupled to said memory to:
identify at least one performance-degrading instruction from said plurality of instructions;
define a set of instructions within said program to prefetch said at least one performance-degrading instruction;
mark at least one speculative bit of each instruction of said set of instructions to indicate a predetermined execution of said each instruction; and
insert a launch point to launch said predetermined execution, wherein to insert a launch point further comprises said processor to:
determine a first instruction of said set of instructions;
determine whether variables of said set of instructions are ready for execution; and
determine whether a latency value from said launch point to said at least one performance-degrading instruction is greater than a total latency value of said set of instructions including a miss latency of said performance-degrading instruction.
20. The system according to claim 19, wherein said processor further inserts a termination point to terminate said predetermined execution.
21. The system according to claim 20, wherein said processor further marks said termination point subsequent to said performance-degrading instruction.
22. The system according to claim 19, wherein said processor further monitors execution of said plurality of instructions and detects said at least one performance-degrading instruction.
23. The system according to claim 19, wherein said processor further reduces a size of said set of instructions based on speculation and prediction techniques.
24. The system according to claim 23, wherein said processor further:
determines whether a first instruction of said set of instructions is aliased with a second instruction within said set of instructions; and
removes said first instruction and adjacent instructions related to said first instruction from said set, if said first instruction is not aliased with said second instruction.
US09/968,261 2001-09-28 2001-09-28 Compiler-directed speculative approach to resolve performance-degrading long latency events in an application Expired - Fee Related US6959435B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/968,261 US6959435B2 (en) 2001-09-28 2001-09-28 Compiler-directed speculative approach to resolve performance-degrading long latency events in an application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09/968,261 US6959435B2 (en) 2001-09-28 2001-09-28 Compiler-directed speculative approach to resolve performance-degrading long latency events in an application

Publications (2)

Publication Number Publication Date
US20030074653A1 US20030074653A1 (en) 2003-04-17
US6959435B2 true US6959435B2 (en) 2005-10-25

Family

ID=25513978

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/968,261 Expired - Fee Related US6959435B2 (en) 2001-09-28 2001-09-28 Compiler-directed speculative approach to resolve performance-degrading long latency events in an application

Country Status (1)

Country Link
US (1) US6959435B2 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040078790A1 (en) * 2002-10-22 2004-04-22 Youfeng Wu Methods and apparatus to manage mucache bypassing
US20050071438A1 (en) * 2003-09-30 2005-03-31 Shih-Wei Liao Methods and apparatuses for compiler-creating helper threads for multi-threading
US20070130114A1 (en) * 2005-06-20 2007-06-07 Xiao-Feng Li Methods and apparatus to optimize processing throughput of data structures in programs
US20070283106A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Method and system for generating prefetch information for multi-block indirect memory access chains
US20070283105A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Method and system for identifying multi-block indirect memory access chains
US20070294693A1 (en) * 2006-06-16 2007-12-20 Microsoft Corporation Scheduling thread execution among a plurality of processors based on evaluation of memory access data
US20090043992A1 (en) * 2007-02-21 2009-02-12 Hewlett-Packard Development Company, L.P. Method And System For Data Speculation On Multicore Systems
US20090172713A1 (en) * 2007-12-31 2009-07-02 Ho-Seop Kim On-demand emulation via user-level exception handling
US20090249316A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US9223714B2 (en) 2013-03-15 2015-12-29 Intel Corporation Instruction boundary prediction for variable length instruction set
US10379863B2 (en) * 2017-09-21 2019-08-13 Qualcomm Incorporated Slice construction for pre-executing data dependent loads
US11531544B1 (en) 2021-07-29 2022-12-20 Hewlett Packard Enterprise Development Lp Method and system for selective early release of physical registers based on a release field value in a scheduler
US20230061576A1 (en) * 2021-08-25 2023-03-02 Hewlett Packard Enterprise Development Lp Method and system for hardware-assisted pre-execution

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7137111B2 (en) * 2001-11-28 2006-11-14 Sun Microsystems, Inc. Aggressive prefetch of address chains
US20040154010A1 (en) * 2003-01-31 2004-08-05 Pedro Marcuello Control-quasi-independent-points guided speculative multithreading
US7657880B2 (en) * 2003-01-31 2010-02-02 Intel Corporation Safe store for speculative helper threads
US7827543B1 (en) 2004-02-28 2010-11-02 Oracle America, Inc. Method and apparatus for profiling data addresses
US7735073B1 (en) 2004-02-28 2010-06-08 Oracle International Corporation Method and apparatus for data object profiling
US8065665B1 (en) 2004-02-28 2011-11-22 Oracle America, Inc. Method and apparatus for correlating profile data
US7707554B1 (en) 2004-04-21 2010-04-27 Oracle America, Inc. Associating data source information with runtime events
US7784037B2 (en) * 2006-04-14 2010-08-24 International Business Machines Corporation Compiler implemented software cache method in which non-aliased explicitly fetched data are excluded
US8495636B2 (en) * 2007-12-19 2013-07-23 International Business Machines Corporation Parallelizing single threaded programs by performing look ahead operation on the single threaded program to identify plurality of instruction threads prior to execution
US8966465B2 (en) 2008-02-12 2015-02-24 Oracle International Corporation Customization creation and update for multi-layer XML customization
US8996658B2 (en) 2008-09-03 2015-03-31 Oracle International Corporation System and method for integration of browser-based thin client applications within desktop rich client architecture
US8799319B2 (en) 2008-09-19 2014-08-05 Oracle International Corporation System and method for meta-data driven, semi-automated generation of web services based on existing applications
US9122520B2 (en) 2008-09-17 2015-09-01 Oracle International Corporation Generic wait service: pausing a BPEL process
US9009689B2 (en) * 2010-11-09 2015-04-14 Intel Corporation Speculative compilation to generate advice messages
US8954942B2 (en) * 2011-09-30 2015-02-10 Oracle International Corporation Optimizations using a BPEL compiler
US10719321B2 (en) 2015-09-19 2020-07-21 Microsoft Technology Licensing, Llc Prefetching instruction blocks
US20170083339A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Prefetching associated with predicated store instructions

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5651124A (en) * 1995-02-14 1997-07-22 Hal Computer Systems, Inc. Processor structure and method for aggressively scheduling long latency instructions including load/store instructions while maintaining precise state
US5704053A (en) * 1995-05-18 1997-12-30 Hewlett-Packard Company Efficient explicit data prefetching analysis and code generation in a low-level optimizer for inserting prefetch instructions into loops of applications
US5751945A (en) * 1995-10-02 1998-05-12 International Business Machines Corporation Method and system for performance monitoring stalls to identify pipeline bottlenecks and stalls in a processing system
US5854934A (en) * 1996-08-23 1998-12-29 Hewlett-Packard Company Optimizing compiler having data cache prefetch spreading
US5909567A (en) * 1997-02-28 1999-06-01 Advanced Micro Devices, Inc. Apparatus and method for native mode processing in a RISC-based CISC processor
US5933643A (en) * 1997-04-17 1999-08-03 Hewlett Packard Company Profiler driven data prefetching optimization where code generation not performed for loops
US6070009A (en) * 1997-11-26 2000-05-30 Digital Equipment Corporation Method for estimating execution rates of program execution paths
US6421826B1 (en) * 1999-11-05 2002-07-16 Sun Microsystems, Inc. Method and apparatus for performing prefetching at the function level
US6560693B1 (en) * 1999-12-10 2003-05-06 International Business Machines Corporation Branch history guided instruction/data prefetching
US6567975B1 (en) * 1999-11-08 2003-05-20 Sun Microsystems, Inc. Method and apparatus for inserting data prefetch operations using data flow analysis
US6675374B2 (en) * 1999-10-12 2004-01-06 Hewlett-Packard Development Company, L.P. Insertion of prefetch instructions into computer program code
US6681387B1 (en) * 1999-12-01 2004-01-20 Board Of Trustees Of The University Of Illinois Method and apparatus for instruction execution hot spot detection and monitoring in a data processing unit

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5651124A (en) * 1995-02-14 1997-07-22 Hal Computer Systems, Inc. Processor structure and method for aggressively scheduling long latency instructions including load/store instructions while maintaining precise state
US5751985A (en) * 1995-02-14 1998-05-12 Hal Computer Systems, Inc. Processor structure and method for tracking instruction status to maintain precise state
US5704053A (en) * 1995-05-18 1997-12-30 Hewlett-Packard Company Efficient explicit data prefetching analysis and code generation in a low-level optimizer for inserting prefetch instructions into loops of applications
US5751945A (en) * 1995-10-02 1998-05-12 International Business Machines Corporation Method and system for performance monitoring stalls to identify pipeline bottlenecks and stalls in a processing system
US5854934A (en) * 1996-08-23 1998-12-29 Hewlett-Packard Company Optimizing compiler having data cache prefetch spreading
US5909567A (en) * 1997-02-28 1999-06-01 Advanced Micro Devices, Inc. Apparatus and method for native mode processing in a RISC-based CISC processor
US5933643A (en) * 1997-04-17 1999-08-03 Hewlett Packard Company Profiler driven data prefetching optimization where code generation not performed for loops
US6070009A (en) * 1997-11-26 2000-05-30 Digital Equipment Corporation Method for estimating execution rates of program execution paths
US6675374B2 (en) * 1999-10-12 2004-01-06 Hewlett-Packard Development Company, L.P. Insertion of prefetch instructions into computer program code
US6421826B1 (en) * 1999-11-05 2002-07-16 Sun Microsystems, Inc. Method and apparatus for performing prefetching at the function level
US6567975B1 (en) * 1999-11-08 2003-05-20 Sun Microsystems, Inc. Method and apparatus for inserting data prefetch operations using data flow analysis
US6681387B1 (en) * 1999-12-01 2004-01-20 Board Of Trustees Of The University Of Illinois Method and apparatus for instruction execution hot spot detection and monitoring in a data processing unit
US6560693B1 (en) * 1999-12-10 2003-05-06 International Business Machines Corporation Branch history guided instruction/data prefetching

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Craig B. Zilles et al., Understanding the Backward Slices of Performance Degrading Instructions, Article, Jun. 12-14, 2000, 10 pages, Proceedings of the 27<SUP>th </SUP>Annual International Symposium on Computer Architecture (ISCA-2000).
Jamison D. Collins et al., Speculative Precomputation: Long-range Prefetching of Delinquent Loads, Article, Jul. 2001, pp. 14-25, Proceedings of the 28<SUP>th </SUP>Annual International Symposium on Computer Architecture.
TITLE: Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism, author: Lo et al, ACM, 1995. *
TITLE: Load Execution Latency Reduction, author: Black et al, ACM, 1998. *
TITLE: Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors, author: Luk, IEEE, May 2001. *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040133886A1 (en) * 2002-10-22 2004-07-08 Youfeng Wu Methods and apparatus to compile a software program to manage parallel mucaches
US20040078790A1 (en) * 2002-10-22 2004-04-22 Youfeng Wu Methods and apparatus to manage mucache bypassing
US7448031B2 (en) 2002-10-22 2008-11-04 Intel Corporation Methods and apparatus to compile a software program to manage parallel μcaches
US7467377B2 (en) 2002-10-22 2008-12-16 Intel Corporation Methods and apparatus for compiler managed first cache bypassing
US20100281471A1 (en) * 2003-09-30 2010-11-04 Shih-Wei Liao Methods and apparatuses for compiler-creating helper threads for multi-threading
US20050071438A1 (en) * 2003-09-30 2005-03-31 Shih-Wei Liao Methods and apparatuses for compiler-creating helper threads for multi-threading
US8612949B2 (en) 2003-09-30 2013-12-17 Intel Corporation Methods and apparatuses for compiler-creating helper threads for multi-threading
US20070130114A1 (en) * 2005-06-20 2007-06-07 Xiao-Feng Li Methods and apparatus to optimize processing throughput of data structures in programs
US20070283106A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Method and system for generating prefetch information for multi-block indirect memory access chains
US20070283105A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Method and system for identifying multi-block indirect memory access chains
US7383402B2 (en) 2006-06-05 2008-06-03 Sun Microsystems, Inc. Method and system for generating prefetch information for multi-block indirect memory access chains
US7383401B2 (en) 2006-06-05 2008-06-03 Sun Microsystems, Inc. Method and system for identifying multi-block indirect memory access chains
US20070294693A1 (en) * 2006-06-16 2007-12-20 Microsoft Corporation Scheduling thread execution among a plurality of processors based on evaluation of memory access data
US7937565B2 (en) * 2007-02-21 2011-05-03 Hewlett-Packard Development Company, L.P. Method and system for data speculation on multicore systems
US20090043992A1 (en) * 2007-02-21 2009-02-12 Hewlett-Packard Development Company, L.P. Method And System For Data Speculation On Multicore Systems
US20090172713A1 (en) * 2007-12-31 2009-07-02 Ho-Seop Kim On-demand emulation via user-level exception handling
US8146106B2 (en) 2007-12-31 2012-03-27 Intel Corporation On-demand emulation via user-level exception handling
US20090249316A1 (en) * 2008-03-28 2009-10-01 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US8136103B2 (en) * 2008-03-28 2012-03-13 International Business Machines Corporation Combining static and dynamic compilation to remove delinquent loads
US9223714B2 (en) 2013-03-15 2015-12-29 Intel Corporation Instruction boundary prediction for variable length instruction set
US10379863B2 (en) * 2017-09-21 2019-08-13 Qualcomm Incorporated Slice construction for pre-executing data dependent loads
US11531544B1 (en) 2021-07-29 2022-12-20 Hewlett Packard Enterprise Development Lp Method and system for selective early release of physical registers based on a release field value in a scheduler
US20230061576A1 (en) * 2021-08-25 2023-03-02 Hewlett Packard Enterprise Development Lp Method and system for hardware-assisted pre-execution
US11687344B2 (en) * 2021-08-25 2023-06-27 Hewlett Packard Enterprise Development Lp Method and system for hard ware-assisted pre-execution
US12079631B2 (en) 2021-08-25 2024-09-03 Hewlett Packard Enterprise Development Lp Method and system for hardware-assisted pre-execution

Also Published As

Publication number Publication date
US20030074653A1 (en) 2003-04-17

Similar Documents

Publication Publication Date Title
US6959435B2 (en) Compiler-directed speculative approach to resolve performance-degrading long latency events in an application
US6539541B1 (en) Method of constructing and unrolling speculatively counted loops
JP3093626B2 (en) Central processing unit and method of executing instructions
US7424578B2 (en) Computer system, compiler apparatus, and operating system
JP3570855B2 (en) Branch prediction device
JP2013122774A (en) Method and device for solving simultaneously predicted branch instruction
WO2010010678A1 (en) Program optimization method
US9430240B1 (en) Pre-computation slice merging for prefetching in a computer processor
EP0655679B1 (en) Method and apparatus for controlling instruction in pipeline processor
US20030084433A1 (en) Profile-guided stride prefetching
US7234136B2 (en) Method and apparatus for selecting references for prefetching in an optimizing compiler
EP1316015B1 (en) Method and apparatus for using an assist processor to prefetch instructions for a primary processor
US9304750B2 (en) System and method for processor with predictive memory retrieval assist
US7257810B2 (en) Method and apparatus for inserting prefetch instructions in an optimizing compiler
US20040117606A1 (en) Method and apparatus for dynamically conditioning statically produced load speculation and prefetches using runtime information
US7457923B1 (en) Method and structure for correlation-based prefetching
US7293265B2 (en) Methods and apparatus to perform return-address prediction
JP2006216040A (en) Method and apparatus for dynamic prediction by software
US11893368B2 (en) Removing branching paths from a computer program
US6931632B2 (en) Instrumentation of code having predicated branch-call and shadow instructions
US20030131345A1 (en) Employing value prediction with the compiler
JP2002014868A (en) Microprocessor having memory referring operation detecting mechanism and compile method
KR101947737B1 (en) Method and apparatus for explicit and implicit information flow tracking
US20230205535A1 (en) Optimization of captured loops in a processor for optimizing loop replay performance
JP2006330813A (en) Compiler device with prefetch starting command inserting function

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JU, DZ-CHING;WU, YOUFENG;REEL/FRAME:012534/0056

Effective date: 20011127

FPAY Fee payment

Year of fee payment: 4

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20131025