US20050060517A1 - Switching processor threads during long latencies - Google Patents
Switching processor threads during long latencies Download PDFInfo
- Publication number
- US20050060517A1 US20050060517A1 US10/661,079 US66107903A US2005060517A1 US 20050060517 A1 US20050060517 A1 US 20050060517A1 US 66107903 A US66107903 A US 66107903A US 2005060517 A1 US2005060517 A1 US 2005060517A1
- Authority
- US
- United States
- Prior art keywords
- thread
- instruction
- processor pipeline
- feedback signal
- switch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 claims abstract description 17
- 230000015654 memory Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 6
- 230000002093 peripheral effect Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
Definitions
- processors such as modern high-performance processors, are designed to execute a large number of instructions per clock cycle. Certain instructions produce a result only after a potentially large number of cycles. Such instructions may be known as “long latency” instructions, as a long time interval exists between the time an instruction is delivered and when it is executed. A long latency may occur, for example, when data required by an instruction needs to be loaded from a high level of memory. Such a load operation therefore may have a “load-use” penalty associated with it. That is, after a program issues such a load instruction, the data may not be available for multiple cycles, even if the data exists (i.e., “hits”) in a cache memory associated with the processor.
- Processors typically allow execution to continue while a long latency instruction is outstanding. Often, however, data is needed relatively soon (e.g., within several clock cycles) because insufficient work remains to be done by the processor without the requested data. Accordingly, a need exists to improve processor performance in such situations.
- FIG. 1 is a block diagram of a portion of a processor pipeline in accordance with one embodiment of the present invention.
- FIG. 2 is a flow diagram of a method in accordance with one embodiment of the present invention.
- FIG. 3 is a block diagram of a wireless device in accordance with one embodiment of the present invention.
- pipeline 100 includes an instruction cache (I-cache) 110 , an instruction fetch unit 120 , an instruction decode unit 130 , a register lookup 140 , and reservation and execution unit 150 , although the scope of the present invention is not limited in this regard.
- I-cache instruction cache
- instruction fetch unit 120 fetch unit 120
- instruction decode unit 130 instruction decode unit 130
- register lookup 140 register lookup 140
- reservation and execution unit 150 reservation and execution unit 150
- the processor may be a relatively simple in-order processor.
- the processor may have a reduced instruction set computing (RISC) architecture, such as an architecture based on an Advanced RISC Machines (ARM) architecture.
- RISC reduced instruction set computing
- ARM Advanced RISC Machines
- a 32-bit version of an INTEL® XSCALETM processor available from Intel Corporation, Santa Clara, Calif. may be used.
- the processor may be a different processor.
- I-cache 110 may be coupled to receive instructions from a bus interface unit of the processor. I-cache 110 may be used to store instructions, including instructions of multiple threads of a program. I-cache 110 may be coupled to provide instructions to instruction fetch unit 120 . Alternately, instruction fetch unit 120 may receive instructions from a fill buffer (which may be within reservation and execution unit 150 ). Instruction fetch unit 120 may include, in certain embodiments, program counters for each thread to be executed on the processor, along with logic to sequence between the threads. In an embodiment in which out-of-order processing is implemented, instruction fetch unit 120 may include a branch target buffer that may be used to examine instructions fetched from I-cache 110 to determine whether any branching conditions exist.
- instruction decode unit 130 is coupled to receive fetched instructions from instruction fetch unit 120 .
- Instruction decode unit 130 may be used to decode instructions by breaking more complex instructions into smaller instructions that may be processed faster. For example, in one embodiment instructions may be decoded into micro-operations (uops). However, in other embodiments other types of instructions may be decoded, such as macro operations or another form of instruction. Additionally, it is to be understood that various instruction sets may be used, such as Reduced Instruction Set Computing (RISC) instructions or Complex Instruction Set Computing (CISC) instructions. Further, in one embodiment instruction decode unit 130 may decode CISC instructions to RISC instructions.
- RISC Reduced Instruction Set Computing
- CISC Complex Instruction Set Computing
- decoded instructions including an identification of registers to be accessed, may be provided to register lookup 140 .
- Register lookup 140 may be used to provide a physical register identification of a register in a register file unit. In such manner, registers may be assigned to each instruction. Also, register lookup 140 may, in certain embodiments, be used to stall the pipeline if register dependencies exist. Stalls are cycles in which the processor pipeline does not execute an instruction.
- instructions may then proceed to reservation and execution unit 150 for scheduling execution of the instructions in the execution unit (or units) of the processor (e.g., an integer and/or floating point arithmetic logic unit (ALU)).
- ALU integer and/or floating point arithmetic logic unit
- multiple execution units may be present.
- Such execution units may include a main execution pipeline, a memory pipeline and a multiply-accumulate (MAC) pipeline.
- the main execution pipeline may perform arithmetic and logic operations, as required for data processing instructions and load/store index calculations, and may further determine conditional instruction execution.
- the memory pipeline may include a data cache unit to handle load and store instructions.
- the MAC pipeline may be used to perform multiply and multiply-accumulate instructions.
- the above-described flow through pipeline 100 may describe normal operational flow. Such flow may occur when instructions do not require long latencies prior to execution, or in the absence of other conditions that may lead to processor stalls.
- a feedback loop 125 from instruction decode unit 130 to instruction fetch unit 120 may be activated to cause instruction fetch unit 120 to prepare to switch threads and accordingly fetch instructions for the new thread.
- identifications of predetermined conditions may be stored in instruction decode unit 130 , although the scope of the present invention is not limited in this regard.
- the identifications may be stored in a lookup table or other storage within instruction decode unit 130 . In this manner, when an instruction is received by instruction decode unit 130 , it may be analyzed against entries in the lookup table to determine whether the instruction corresponds to one of the predetermined conditions. If so, a feedback signal on feedback loop 125 may be activated. Alternately, logic in instruction decode unit 130 may be used to detect for the presence or occurrence of a predetermined condition. In still other embodiments, microcode in instruction decode unit 130 may determine the presence or occurrence of a predetermined condition.
- an instruction that may require a long latency prior to execution may be considered to be a predetermined condition.
- the term “long latency” means a time period between receipt of an instruction and execution of the instruction that causes the processor to suffer one or more stalls. Thus a long latency period may be several cycles or may be hundreds of cycles, depending on the ability of the processor to perform other instructions in the latency period.
- a load instruction that requires the obtaining of information from system memory e.g., as on a cache miss
- a load instruction that obtains data from a cache such as a level 1 (L1) cache
- both load instructions may be considered long latency instructions, as a processor may suffer stall cycles before data is ready for processing.
- a load instruction is an example instruction that may cause a long latency and thus may be considered a predetermined condition. While the latency caused by a load instruction may vary depending on the level of memory at which the data is obtained, the presence of a load instruction itself, regardless of actual latency, may be sufficient to cause a feedback signal in accordance with an embodiment of the present invention. In other words, a feedback signal in accordance with an embodiment of the present invention may be based on stochastic models. That is, the predetermined conditions may be selected based on a knowledge that the conditions may, but need not necessarily, cause a latency that leads to pipeline stalls.
- instructions that may be considered to be a predetermined condition may include store instructions and certain arithmetic instructions.
- a floating point divide operation may be a condition that causes a feedback signal.
- other operations which require accessing a memory subsystem may be a condition causing a feedback signal to be initiated.
- FIG. 2 shown is a flow diagram of a method in accordance with one embodiment of the present invention. While FIG. 2 relates specifically to a method for processing a long latency instruction, it is to be appreciated that the flow of FIG. 2 may be applicable to any predetermined condition.
- method 200 may start (oval 205 ) and fetch an instruction of a current thread (block 210 ). Next, the instruction may be decoded (block 220 ). After decoding, it may be determined whether a predetermined condition has been met. For example, it may be determined whether the instruction is one of the predetermined conditions, for example, because it may cause a long latency (diamond 230 ). Such a determination may be made in instruction decode unit 130 , in one embodiment.
- the instruction may be executed (block 240 ) and a next instruction of the current thread may be fetched (block 210 ).
- the pipeline may be prepared to switch threads (block 250 ).
- preparation to switch threads may include executing a remaining one or several instructions of the first thread prior to the thread switch. In such manner, additional instructions may be performed without a processor stall. More so, because such instructions are already present in the processor pipeline, they need not be flushed. Thus instructions of a first thread may continue to be processed while the second thread is prepared for execution.
- an embodiment executed in the processor pipeline 100 of FIG. 1 may have a two pipeline-stage delay caused by the feedback signal on feedback loop 125 . This delay may thus allow two additional instructions of the first thread to be performed before the second thread begins execution.
- Preparing to switch threads may further include setting instruction fetch unit 120 to fetch an instruction of the new thread. For example, a program counter for the second thread within instruction fetch unit 120 may be selected and used to fetch the next instruction of the thread.
- the threads may be switched (block 260 ).
- such thread switching may cause instruction fetch unit 120 to obtain an instruction of the new thread from I-cache 110 (block 210 ).
- various control registers and other information corresponding to the new thread may be loaded into different registers and portions of the processor pipeline to allow execution of instructions of the new thread.
- Thread0 ADD R2 LDR to R0
- R4 Thread1 MUL R3 LDR to R7 ADD R1 AND
- R4 Thread0 ADD R0 . . .
- a first thread may be executing in a processor pipeline.
- instruction decode unit 130 i.e., the Load to register 0 (LDR to R0) instruction
- a signal may be sent on feedback line 125 to instruction fetch unit 120 to prepare for a new thread.
- one or more additional instructions of the first thread may be executed in the processor pipeline (e.g., SUB R3 and XOR R4).
- a second thread (i.e., Thread 1) may be switched to and begun activation.
- the second thread may begin executing instructions and continue until a long latency instruction is encountered (i.e., the Load to register 7 (LDR to R7) instruction).
- LDR to R7 the Load to register 7
- preparation may be made to switch threads again, while at the same time processing one or more additional instructions of the current thread.
- Thread 0 the original thread (i.e., Thread 0) may again be switched to and begun execution. While shown and discussed in Table 1 as having two threads of operation, it is to be understood that in other embodiments multithread operations may encompass more than two threads.
- Embodiments of the present invention may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system, such as a wireless device to perform the instructions.
- the storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions, including programmable storage devices.
- ROMs read-only memories
- RAMs random access memories
- EPROMs erasable programmable read-only memories
- EEPROMs electrically erasable programmable read-only memories
- magnetic or optical cards or any type
- FIG. 3 is a block diagram of a wireless device with which embodiments of the invention may be used.
- wireless device 500 includes a processor 510 , which may include a general-purpose or special-purpose processor such as a microprocessor, microcontroller, application specific integrated circuit (ASIC), a programmable gate array (PGA), and the like.
- Processor 510 may include a feedback loop in accordance with one embodiment of the present invention and may be programmed to switch threads when a predetermined condition occurs.
- Processor 510 may be coupled to a digital signal processor (DSP) 530 via an internal bus 520 .
- DSP 530 may be coupled to a flash memory 540 .
- flash memory 540 may also be coupled to microprocessor 510 , internal bus 520 , and peripheral bus 560 .
- microprocessor 510 may also be coupled to a peripheral bus interface 550 and a peripheral bus 560 . While many devices may be coupled to peripheral bus 560 , shown in FIG. 3 is a wireless interface 570 which is in turn coupled to an antenna 580 .
- antenna 580 may be a dipole antenna, helical antenna, global system for mobile communication (GSM) or another such antenna.
- GSM global system for mobile communication
Abstract
In one embodiment, the present invention includes a method to determine whether execution of an instruction of a first thread may require a long latency and switch to a second thread if the instruction may require the long latency. In certain embodiments, at least one additional instruction may be executed in the first thread while preparing to switch threads.
Description
- Processors, such as modern high-performance processors, are designed to execute a large number of instructions per clock cycle. Certain instructions produce a result only after a potentially large number of cycles. Such instructions may be known as “long latency” instructions, as a long time interval exists between the time an instruction is delivered and when it is executed. A long latency may occur, for example, when data required by an instruction needs to be loaded from a high level of memory. Such a load operation therefore may have a “load-use” penalty associated with it. That is, after a program issues such a load instruction, the data may not be available for multiple cycles, even if the data exists (i.e., “hits”) in a cache memory associated with the processor.
- Processors typically allow execution to continue while a long latency instruction is outstanding. Often, however, data is needed relatively soon (e.g., within several clock cycles) because insufficient work remains to be done by the processor without the requested data. Accordingly, a need exists to improve processor performance in such situations.
-
FIG. 1 is a block diagram of a portion of a processor pipeline in accordance with one embodiment of the present invention. -
FIG. 2 is a flow diagram of a method in accordance with one embodiment of the present invention. -
FIG. 3 is a block diagram of a wireless device in accordance with one embodiment of the present invention. - Referring to
FIG. 1 , shown is a block diagram of a portion of a processor pipeline in accordance with one embodiment of the present invention. As shown inFIG. 1 ,pipeline 100 includes an instruction cache (I-cache) 110, aninstruction fetch unit 120, aninstruction decode unit 130, aregister lookup 140, and reservation andexecution unit 150, although the scope of the present invention is not limited in this regard. - While the type of processor which includes a pipeline in accordance with an embodiment of the present invention may vary, in one embodiment the processor may be a relatively simple in-order processor. In one embodiment, the processor may have a reduced instruction set computing (RISC) architecture, such as an architecture based on an Advanced RISC Machines (ARM) architecture. For example, in one embodiment a 32-bit version of an INTEL® XSCALE™ processor available from Intel Corporation, Santa Clara, Calif. may be used. However, in other embodiments the processor may be a different processor.
- In one embodiment, I-
cache 110 may be coupled to receive instructions from a bus interface unit of the processor. I-cache 110 may be used to store instructions, including instructions of multiple threads of a program. I-cache 110 may be coupled to provide instructions toinstruction fetch unit 120. Alternately,instruction fetch unit 120 may receive instructions from a fill buffer (which may be within reservation and execution unit 150).Instruction fetch unit 120 may include, in certain embodiments, program counters for each thread to be executed on the processor, along with logic to sequence between the threads. In an embodiment in which out-of-order processing is implemented,instruction fetch unit 120 may include a branch target buffer that may be used to examine instructions fetched from I-cache 110 to determine whether any branching conditions exist. - As shown in
FIG. 1 ,instruction decode unit 130 is coupled to receive fetched instructions frominstruction fetch unit 120.Instruction decode unit 130 may be used to decode instructions by breaking more complex instructions into smaller instructions that may be processed faster. For example, in one embodiment instructions may be decoded into micro-operations (uops). However, in other embodiments other types of instructions may be decoded, such as macro operations or another form of instruction. Additionally, it is to be understood that various instruction sets may be used, such as Reduced Instruction Set Computing (RISC) instructions or Complex Instruction Set Computing (CISC) instructions. Further, in one embodimentinstruction decode unit 130 may decode CISC instructions to RISC instructions. - Still referring to
FIG. 1 , decoded instructions, including an identification of registers to be accessed, may be provided to registerlookup 140. Registerlookup 140 may be used to provide a physical register identification of a register in a register file unit. In such manner, registers may be assigned to each instruction. Also, registerlookup 140 may, in certain embodiments, be used to stall the pipeline if register dependencies exist. Stalls are cycles in which the processor pipeline does not execute an instruction. - From
register lookup 140, instructions may then proceed to reservation andexecution unit 150 for scheduling execution of the instructions in the execution unit (or units) of the processor (e.g., an integer and/or floating point arithmetic logic unit (ALU)). In one embodiment, multiple execution units may be present. Such execution units may include a main execution pipeline, a memory pipeline and a multiply-accumulate (MAC) pipeline. In such an embodiment, the main execution pipeline may perform arithmetic and logic operations, as required for data processing instructions and load/store index calculations, and may further determine conditional instruction execution. The memory pipeline may include a data cache unit to handle load and store instructions. The MAC pipeline may be used to perform multiply and multiply-accumulate instructions. - The above-described flow through
pipeline 100 may describe normal operational flow. Such flow may occur when instructions do not require long latencies prior to execution, or in the absence of other conditions that may lead to processor stalls. - However in accordance with various embodiments of the present invention, when certain predetermined events or conditions occur, such as an instruction that may require a long latency prior to execution, a
feedback loop 125 frominstruction decode unit 130 toinstruction fetch unit 120 may be activated to causeinstruction fetch unit 120 to prepare to switch threads and accordingly fetch instructions for the new thread. - In one embodiment, identifications of predetermined conditions may be stored in
instruction decode unit 130, although the scope of the present invention is not limited in this regard. For example, the identifications may be stored in a lookup table or other storage withininstruction decode unit 130. In this manner, when an instruction is received byinstruction decode unit 130, it may be analyzed against entries in the lookup table to determine whether the instruction corresponds to one of the predetermined conditions. If so, a feedback signal onfeedback loop 125 may be activated. Alternately, logic ininstruction decode unit 130 may be used to detect for the presence or occurrence of a predetermined condition. In still other embodiments, microcode ininstruction decode unit 130 may determine the presence or occurrence of a predetermined condition. - While the predetermined conditions may vary in different embodiments, an instruction that may require a long latency prior to execution may be considered to be a predetermined condition. As used herein, the term “long latency” means a time period between receipt of an instruction and execution of the instruction that causes the processor to suffer one or more stalls. Thus a long latency period may be several cycles or may be hundreds of cycles, depending on the ability of the processor to perform other instructions in the latency period. For example, a load instruction that requires the obtaining of information from system memory (e.g., as on a cache miss) may require hundreds of cycles, while a load instruction that obtains data from a cache (such as a level 1 (L1) cache) closely associated with the processor may require fewer than ten cycles. In certain embodiments, both load instructions may be considered long latency instructions, as a processor may suffer stall cycles before data is ready for processing.
- Thus a load instruction is an example instruction that may cause a long latency and thus may be considered a predetermined condition. While the latency caused by a load instruction may vary depending on the level of memory at which the data is obtained, the presence of a load instruction itself, regardless of actual latency, may be sufficient to cause a feedback signal in accordance with an embodiment of the present invention. In other words, a feedback signal in accordance with an embodiment of the present invention may be based on stochastic models. That is, the predetermined conditions may be selected based on a knowledge that the conditions may, but need not necessarily, cause a latency that leads to pipeline stalls.
- Other examples of instructions that may be considered to be a predetermined condition may include store instructions and certain arithmetic instructions. For example, a floating point divide operation may be a condition that causes a feedback signal. In addition, other operations which require accessing a memory subsystem may be a condition causing a feedback signal to be initiated.
- When a predetermined condition is detected, instructions of another thread may be fetched and executed so that few or no stall cycles occur. Thus in certain embodiments, performance may be significantly increased in multi-thread contexts. At the same time, no performance difference occurs in a single thread context, because during single thread operation, instructions pass directly through
pipeline 100, as discussed above. - Referring now to
FIG. 2 , shown is a flow diagram of a method in accordance with one embodiment of the present invention. WhileFIG. 2 relates specifically to a method for processing a long latency instruction, it is to be appreciated that the flow ofFIG. 2 may be applicable to any predetermined condition. As shown inFIG. 2 ,method 200 may start (oval 205) and fetch an instruction of a current thread (block 210). Next, the instruction may be decoded (block 220). After decoding, it may be determined whether a predetermined condition has been met. For example, it may be determined whether the instruction is one of the predetermined conditions, for example, because it may cause a long latency (diamond 230). Such a determination may be made ininstruction decode unit 130, in one embodiment. - If it is determined that the instruction may not have a long latency (i.e., a predetermined condition has not occurred), the instruction may be executed (block 240) and a next instruction of the current thread may be fetched (block 210).
- If instead it is determined that a long latency may result (i.e., a predetermined condition has occurred), the pipeline may be prepared to switch threads (block 250). In one embodiment, preparation to switch threads may include executing a remaining one or several instructions of the first thread prior to the thread switch. In such manner, additional instructions may be performed without a processor stall. More so, because such instructions are already present in the processor pipeline, they need not be flushed. Thus instructions of a first thread may continue to be processed while the second thread is prepared for execution. For example, an embodiment executed in the
processor pipeline 100 ofFIG. 1 may have a two pipeline-stage delay caused by the feedback signal onfeedback loop 125. This delay may thus allow two additional instructions of the first thread to be performed before the second thread begins execution. - Preparing to switch threads may further include setting instruction fetch
unit 120 to fetch an instruction of the new thread. For example, a program counter for the second thread within instruction fetchunit 120 may be selected and used to fetch the next instruction of the thread. - Next, the threads may be switched (block 260). In the embodiment of
FIG. 2 , such thread switching may cause instruction fetchunit 120 to obtain an instruction of the new thread from I-cache 110 (block 210). Additionally, various control registers and other information corresponding to the new thread may be loaded into different registers and portions of the processor pipeline to allow execution of instructions of the new thread. - Referring now to Table 1, shown is an example of execution of a code portion in accordance with an embodiment of the present invention:
TABLE 1 Thread0: ADD R2 LDR to R0 SUB R3 XOR R4 Thread1: MUL R3 LDR to R7 ADD R1 AND R4 Thread0: ADD R0 . . . - As shown in Table 1, a first thread (Thread 0) may be executing in a processor pipeline. When a long latency instruction is detected by instruction decode unit 130 (i.e., the Load to register 0 (LDR to R0) instruction), a signal may be sent on
feedback line 125 to instruction fetchunit 120 to prepare for a new thread. At the same time, one or more additional instructions of the first thread may be executed in the processor pipeline (e.g., SUB R3 and XOR R4). - Then as shown in Table 1, a second thread (i.e., Thread 1) may be switched to and begun activation. As with the first thread, the second thread may begin executing instructions and continue until a long latency instruction is encountered (i.e., the Load to register 7 (LDR to R7) instruction). As above, at the detection of a long latency instruction, preparation may be made to switch threads again, while at the same time processing one or more additional instructions of the current thread.
- Finally as shown in Table 1, the original thread (i.e., Thread 0) may again be switched to and begun execution. While shown and discussed in Table 1 as having two threads of operation, it is to be understood that in other embodiments multithread operations may encompass more than two threads.
- Embodiments of the present invention may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system, such as a wireless device to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any type of media suitable for storing electronic instructions, including programmable storage devices.
-
FIG. 3 is a block diagram of a wireless device with which embodiments of the invention may be used. As shown inFIG. 3 , in oneembodiment wireless device 500 includes aprocessor 510, which may include a general-purpose or special-purpose processor such as a microprocessor, microcontroller, application specific integrated circuit (ASIC), a programmable gate array (PGA), and the like.Processor 510 may include a feedback loop in accordance with one embodiment of the present invention and may be programmed to switch threads when a predetermined condition occurs.Processor 510 may be coupled to a digital signal processor (DSP) 530 via aninternal bus 520. In turn,DSP 530 may be coupled to aflash memory 540. As further shown inFIG. 3 ,flash memory 540 may also be coupled tomicroprocessor 510,internal bus 520, andperipheral bus 560. - As shown in
FIG. 3 ,microprocessor 510 may also be coupled to aperipheral bus interface 550 and aperipheral bus 560. While many devices may be coupled toperipheral bus 560, shown inFIG. 3 is awireless interface 570 which is in turn coupled to anantenna 580. Invarious embodiments antenna 580 may be a dipole antenna, helical antenna, global system for mobile communication (GSM) or another such antenna. - Although the description makes reference to specific components of
device 500, it is contemplated that numerous modifications and variations of the described and illustrated embodiments may be possible. - While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.
Claims (29)
1. A method comprising:
determining whether execution of an instruction of a first thread may require a long latency; and
switching to a second thread if the instruction may require the long latency.
2. The method of claim 1 , further comprising executing at least one additional instruction in the first thread while preparing to switch to the second thread.
3. The method of claim 1 , wherein the determining is based on a stochastic analysis of whether the instruction will result in a long latency.
4. The method of claim 1 , wherein the determining comprises applying the instruction to a lookup table in a processor pipeline.
5. The method of claim 4 , further comprising providing a feedback signal from an instruction decoder to an instruction fetch unit to switch to the second thread.
6. The method of claim 1 , wherein the long latency comprises less than ten processor cycles.
7. The method of claim 1 , further comprising switching back to the first thread.
8. A method comprising:
switching from a first thread to a second thread if a condition that may result in a stall of a processor pipeline occurs during execution of the first thread in the processor pipeline.
9. The method of claim 8 , further comprising determining whether the condition occurs by comparing an instruction to entries in a lookup table.
10. The method of claim 8 , further comprising executing at least one additional instruction after the condition occurs and before switching to the second thread.
11. The method of claim 8 , wherein the condition is based on a stochastic model.
12. The method of claim 8 , further comprising providing a feedback signal from an instruction decoder to an instruction fetch unit to switch to the second thread.
13. An article comprising a machine-readable storage medium containing instructions that if executed enable a system to:
switch from a first thread to a second thread if a condition that may result in a stall of a processor pipeline occurs during execution of the first thread in the processor pipeline.
14. The article of claim 13 , further comprising instructions that if executed enable the system to determine whether the condition occurs by comparing an instruction to entries in a lookup table.
15. The article of claim 13 , further comprising instructions that if executed enable the system to execute at least one additional instruction in the first thread while the system prepares to switch to the second thread.
16. The article of claim 13 , further comprising instructions that if executed enable the system to send a feedback signal to cause the switch from the first thread to the second thread.
17. An apparatus comprising:
a processor pipeline having a feedback loop to provide a feedback signal to cause the processor pipeline to switch from a first thread to a second thread, the feedback signal to originate from a location in the processor pipeline before instruction execution.
18. The apparatus of claim 17 , wherein the feedback signal is coupled between an instruction decoder and an instruction fetch unit.
19. The apparatus of claim 18 , wherein the instruction decoder is coupled to provide the feedback signal to the instruction fetch unit when a predetermined condition occurs.
20. The apparatus of claim 19 , wherein the instruction decoder includes logic to determine when the predetermined condition occurs.
21. The apparatus of claim 19 , wherein the instruction decoder includes a lookup table that includes a list of predetermined conditions.
22. A system comprising:
a processor pipeline having a feedback loop to provide a feedback signal to cause the processor pipeline to switch from a first thread to a second thread, the feedback signal to originate from a location in the processor pipeline before instruction execution; and
a wireless interface coupled to the processor pipeline.
23. The system of claim 22 , further comprising at least one storage device to store code to enable the processor pipeline to switch from the first thread to the second thread if a predetermined condition occurs during execution of the first thread.
24. The system of claim 23 , wherein the at least one storage device includes code to enable the processor pipeline to execute at least one additional instruction in the first thread while the system prepares to switch to the second thread.
25. The system of claim 22 , wherein the feedback signal is coupled between an instruction decoder and an instruction fetch unit.
26. The system of claim 25 , wherein the instruction decoder is coupled to provide the feedback signal to the instruction fetch unit when a predetermined condition occurs.
27. The system of claim 26 , wherein the instruction decoder includes logic to determine when the predetermined condition occurs.
28. The system of claim 26 , wherein the instruction decoder includes a lookup table that includes a list of predetermined conditions.
29. The system of claim 22 , wherein the wireless interface comprises a dipole antenna.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/661,079 US20050060517A1 (en) | 2003-09-12 | 2003-09-12 | Switching processor threads during long latencies |
US11/827,207 US7596683B2 (en) | 2003-09-12 | 2007-07-11 | Switching processor threads during long latencies |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/661,079 US20050060517A1 (en) | 2003-09-12 | 2003-09-12 | Switching processor threads during long latencies |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/827,207 Division US7596683B2 (en) | 2003-09-12 | 2007-07-11 | Switching processor threads during long latencies |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050060517A1 true US20050060517A1 (en) | 2005-03-17 |
Family
ID=34273798
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/661,079 Abandoned US20050060517A1 (en) | 2003-09-12 | 2003-09-12 | Switching processor threads during long latencies |
US11/827,207 Expired - Fee Related US7596683B2 (en) | 2003-09-12 | 2007-07-11 | Switching processor threads during long latencies |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/827,207 Expired - Fee Related US7596683B2 (en) | 2003-09-12 | 2007-07-11 | Switching processor threads during long latencies |
Country Status (1)
Country | Link |
---|---|
US (2) | US20050060517A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7941642B1 (en) * | 2004-06-30 | 2011-05-10 | Oracle America, Inc. | Method for selecting between divide instructions associated with respective threads in a multi-threaded processor |
CN102609240A (en) * | 2011-01-20 | 2012-07-25 | 瑞昱半导体股份有限公司 | Processor circuit and method for reading data |
US20150052533A1 (en) * | 2013-08-13 | 2015-02-19 | Samsung Electronics Co., Ltd. | Multiple threads execution processor and operating method thereof |
US20160221253A1 (en) * | 2013-10-22 | 2016-08-04 | Fujifilm Corporation | Bonding method and device |
US20160224386A1 (en) * | 2012-12-27 | 2016-08-04 | Nvidia Corporation | Approach for a configurable phase-based priority scheduler |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2485538B (en) * | 2010-11-16 | 2013-12-25 | Nds Ltd | Obfuscated hardware multi-threading |
US9268542B1 (en) * | 2011-04-28 | 2016-02-23 | Google Inc. | Cache contention management on a multicore processor based on the degree of contention exceeding a threshold |
US9043788B2 (en) | 2012-08-10 | 2015-05-26 | Concurix Corporation | Experiment manager for manycore systems |
US20130080760A1 (en) * | 2012-08-10 | 2013-03-28 | Concurix Corporation | Execution Environment with Feedback Loop |
US8966462B2 (en) | 2012-08-10 | 2015-02-24 | Concurix Corporation | Memory management parameters derived from system modeling |
US9665474B2 (en) | 2013-03-15 | 2017-05-30 | Microsoft Technology Licensing, Llc | Relationships derived from trace data |
US20170139716A1 (en) | 2015-11-18 | 2017-05-18 | Arm Limited | Handling stalling event for multiple thread pipeline, and triggering action based on information access delay |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5870588A (en) * | 1995-10-23 | 1999-02-09 | Interuniversitair Micro-Elektronica Centrum(Imec Vzw) | Design environment and a design method for hardware/software co-design |
US6016542A (en) * | 1997-12-31 | 2000-01-18 | Intel Corporation | Detecting long latency pipeline stalls for thread switching |
US6049867A (en) * | 1995-06-07 | 2000-04-11 | International Business Machines Corporation | Method and system for multi-thread switching only when a cache miss occurs at a second or higher level |
US6076157A (en) * | 1997-10-23 | 2000-06-13 | International Business Machines Corporation | Method and apparatus to force a thread switch in a multithreaded processor |
US6463522B1 (en) * | 1997-12-16 | 2002-10-08 | Intel Corporation | Memory system for ordering load and store instructions in a processor that performs multithread execution |
US20030018826A1 (en) * | 2001-07-13 | 2003-01-23 | Shailender Chaudhry | Facilitating efficient join operations between a head thread and a speculative thread |
US6697935B1 (en) * | 1997-10-23 | 2004-02-24 | International Business Machines Corporation | Method and apparatus for selecting thread switch events in a multithreaded processor |
US6775740B1 (en) * | 2000-06-28 | 2004-08-10 | Hitachi, Ltd. | Processor having a selector circuit for selecting an output signal from a hit/miss judgement circuit and data from a register file |
US20050097552A1 (en) * | 2003-10-01 | 2005-05-05 | O'connor Dennis M. | Method and apparatus to enable execution of a thread in a multi-threaded computer system |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5357617A (en) * | 1991-11-22 | 1994-10-18 | International Business Machines Corporation | Method and apparatus for substantially concurrent multiple instruction thread processing by a single pipeline processor |
JP3569014B2 (en) * | 1994-11-25 | 2004-09-22 | 富士通株式会社 | Processor and processing method supporting multiple contexts |
US5933627A (en) * | 1996-07-01 | 1999-08-03 | Sun Microsystems | Thread switch on blocked load or store using instruction thread field |
US6662297B1 (en) * | 1999-12-30 | 2003-12-09 | Intel Corporation | Allocation of processor bandwidth by inserting interrupt servicing instructions to intervene main program in instruction queue mechanism |
US6965982B2 (en) * | 2001-06-29 | 2005-11-15 | International Business Machines Corporation | Multithreaded processor efficiency by pre-fetching instructions for a scheduled thread |
US6915414B2 (en) * | 2001-07-20 | 2005-07-05 | Zilog, Inc. | Context switching pipelined microprocessor |
US7000233B2 (en) * | 2003-04-21 | 2006-02-14 | International Business Machines Corporation | Simultaneous multithread processor with result data delay path to adjust pipeline length for input to respective thread |
JP4287799B2 (en) * | 2004-07-29 | 2009-07-01 | 富士通株式会社 | Processor system and thread switching control method |
-
2003
- 2003-09-12 US US10/661,079 patent/US20050060517A1/en not_active Abandoned
-
2007
- 2007-07-11 US US11/827,207 patent/US7596683B2/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049867A (en) * | 1995-06-07 | 2000-04-11 | International Business Machines Corporation | Method and system for multi-thread switching only when a cache miss occurs at a second or higher level |
US5870588A (en) * | 1995-10-23 | 1999-02-09 | Interuniversitair Micro-Elektronica Centrum(Imec Vzw) | Design environment and a design method for hardware/software co-design |
US6076157A (en) * | 1997-10-23 | 2000-06-13 | International Business Machines Corporation | Method and apparatus to force a thread switch in a multithreaded processor |
US6697935B1 (en) * | 1997-10-23 | 2004-02-24 | International Business Machines Corporation | Method and apparatus for selecting thread switch events in a multithreaded processor |
US6463522B1 (en) * | 1997-12-16 | 2002-10-08 | Intel Corporation | Memory system for ordering load and store instructions in a processor that performs multithread execution |
US6016542A (en) * | 1997-12-31 | 2000-01-18 | Intel Corporation | Detecting long latency pipeline stalls for thread switching |
US6775740B1 (en) * | 2000-06-28 | 2004-08-10 | Hitachi, Ltd. | Processor having a selector circuit for selecting an output signal from a hit/miss judgement circuit and data from a register file |
US20030018826A1 (en) * | 2001-07-13 | 2003-01-23 | Shailender Chaudhry | Facilitating efficient join operations between a head thread and a speculative thread |
US20050097552A1 (en) * | 2003-10-01 | 2005-05-05 | O'connor Dennis M. | Method and apparatus to enable execution of a thread in a multi-threaded computer system |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7941642B1 (en) * | 2004-06-30 | 2011-05-10 | Oracle America, Inc. | Method for selecting between divide instructions associated with respective threads in a multi-threaded processor |
CN102609240A (en) * | 2011-01-20 | 2012-07-25 | 瑞昱半导体股份有限公司 | Processor circuit and method for reading data |
US20120191910A1 (en) * | 2011-01-20 | 2012-07-26 | Yen-Ju Lu | Processing circuit and method for reading data |
US20160224386A1 (en) * | 2012-12-27 | 2016-08-04 | Nvidia Corporation | Approach for a configurable phase-based priority scheduler |
US10346212B2 (en) * | 2012-12-27 | 2019-07-09 | Nvidia Corporation | Approach for a configurable phase-based priority scheduler |
US20150052533A1 (en) * | 2013-08-13 | 2015-02-19 | Samsung Electronics Co., Ltd. | Multiple threads execution processor and operating method thereof |
US20160221253A1 (en) * | 2013-10-22 | 2016-08-04 | Fujifilm Corporation | Bonding method and device |
Also Published As
Publication number | Publication date |
---|---|
US20070260853A1 (en) | 2007-11-08 |
US7596683B2 (en) | 2009-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7596683B2 (en) | Switching processor threads during long latencies | |
US8099586B2 (en) | Branch misprediction recovery mechanism for microprocessors | |
US7685410B2 (en) | Redirect recovery cache that receives branch misprediction redirects and caches instructions to be dispatched in response to the redirects | |
US7401211B2 (en) | Method for converting pipeline stalls caused by instructions with long latency memory accesses to pipeline flushes in a multithreaded processor | |
US7278012B2 (en) | Method and apparatus for efficiently accessing first and second branch history tables to predict branch instructions | |
US20160291982A1 (en) | Parallelized execution of instruction sequences based on pre-monitoring | |
US7165254B2 (en) | Thread switch upon spin loop detection by threshold count of spin lock reading load instruction | |
US20090235051A1 (en) | System and Method of Selectively Committing a Result of an Executed Instruction | |
US6965983B2 (en) | Simultaneously setting prefetch address and fetch address pipelined stages upon branch | |
US7711934B2 (en) | Processor core and method for managing branch misprediction in an out-of-order processor pipeline | |
EP3171264B1 (en) | System and method of speculative parallel execution of cache line unaligned load instructions | |
US6338133B1 (en) | Measured, allocation of speculative branch instructions to processor execution units | |
US6961847B2 (en) | Method and apparatus for controlling execution of speculations in a processor based on monitoring power consumption | |
US10001996B2 (en) | Selective poisoning of data during runahead | |
US20220113966A1 (en) | Variable latency instructions | |
US20210294639A1 (en) | Entering protected pipeline mode without annulling pending instructions | |
US20040225866A1 (en) | Branch prediction in a data processing system | |
GB2542831A (en) | Fetch unit for predicting target for subroutine return instructions | |
US7519799B2 (en) | Apparatus having a micro-instruction queue, a micro-instruction pointer programmable logic array and a micro-operation read only memory and method for use thereof | |
KR20070108936A (en) | Stop waiting for source operand when conditional instruction will not execute | |
EP3278212A1 (en) | Parallelized execution of instruction sequences based on premonitoring | |
WO2020214624A1 (en) | Variable latency instructions | |
WO2007084202A2 (en) | Processor core and method for managing branch misprediction in an out-of-order processor pipeline | |
US10296350B2 (en) | Parallelized execution of instruction sequences | |
US8266414B2 (en) | Method for executing an instruction loop and a device having instruction loop execution capabilities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MORROW, MICHAEL W.;REEL/FRAME:014849/0298 Effective date: 20030909 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |