US20170161106A1 - Providing thread fairness in a hyper-threaded microprocessor - Google Patents
Providing thread fairness in a hyper-threaded microprocessor Download PDFInfo
- Publication number
- US20170161106A1 US20170161106A1 US15/385,823 US201615385823A US2017161106A1 US 20170161106 A1 US20170161106 A1 US 20170161106A1 US 201615385823 A US201615385823 A US 201615385823A US 2017161106 A1 US2017161106 A1 US 2017161106A1
- Authority
- US
- United States
- Prior art keywords
- thread
- entries
- processing element
- reservation
- pipeline
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 claims abstract description 64
- 230000000903 blocking effect Effects 0.000 abstract description 37
- 238000000034 method Methods 0.000 abstract description 8
- 239000000872 buffer Substances 0.000 description 11
- 238000001514 detection method Methods 0.000 description 8
- 230000004044 response Effects 0.000 description 8
- 230000001419 dependent effect Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000013519 translation Methods 0.000 description 5
- 230000002411 adverse Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 4
- 230000014616 translation Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002147 killing effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5014—Reservation
Abstract
A method and apparatus for providing fairness in a multi-processing element environment is herein described. Mask elements are utilized to associated portions of a reservation station with each processing element, while still allowing common access to another portion of reservation station entries. Additionally, bias logic biases selection of processing elements in a pipeline away from a processing element associated with a blocking stall to provide fair utilization of the pipeline.
Description
- This application is a Continuation of U.S. patent application Ser. No. 12/941,637, filed on Nov. 8, 2010, which is a Divisional of U.S. patent application Ser. No. 11/784,864, filed on Apr. 9, 2007, entitled “PROVIDING THREAD FAIRNESS IN A HYPER-THREADED MICROPROCESSOR”, now U.S. Pat. No. 8,521,993 issued on Aug. 27, 2013. This application is incorporated herein by reference in its entirety.
- This invention relates to the field of processors and, in particular, to providing resource fairness for processing elements.
- Advances in semi-conductor processing and logic design have permitted an increase in the amount of logic that may be present on integrated circuit devices. As a result, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores and multiple logical processors present on individual integrated circuits. A processor or integrated circuit typically comprises a single processor die, where the processor die may include any number of processing elements, such as cores, threads, and/or logical processors.
- In processors with multiple threads, the behavior of one thread potentially affects the behavior of another thread on the same processor core due to sharing of resources and pipelines. Often behavior of one thread creates unfairness in the usage of the shared resources and pipelines. In fact, when one thread's performance significantly changes in relation to other threads on the same core, often a large and unpredictable variability in performance from the unbalanced usage of shared resources occurs.
- For example, a reservation unit in a microprocessor is used to buffer instructions with corresponding operands for scheduling on execution units. In an out-of-order (OOO) processor, instructions may be scheduled out of order on execution units; however, some instructions are dependent on other instructions. As a result, when one thread schedules a long latency operation, such as a load operation that misses a cache, instructions that are dependent on the long latency operation reside in the reservation unit, while other threads operations are efficiently de-allocated. This results in the reservation station being monopolized by the thread that scheduled the long latency operation, which potentially adversely affects the ability of other threads on the same core to schedule operations for execution.
- In addition, during some stages of a processor pipeline, one thread may cause a stall, which does not allow other threads to continue processing during the stall. This behavior is often referred to as a blocking stall. As a result, one thread's stall potentially adversely affects other threads performance in the pipeline.
- The present invention is illustrated by way of example and not intended to be limited by the figures of the accompanying drawings.
-
FIG. 1 illustrates an embodiment a multi-resource processor capable of providing fair sharing of shared resources amongst multiple processing elements. -
FIG. 2 illustrates an embodiment of a reservation unit capable of dedicating entries to processing elements. -
FIG. 3 illustrates an embodiment of a pipeline capable of biasing processing element selection in response to stalls in the pipeline. -
FIG. 4 illustrates an embodiment of bias logic to provide Quality of Service QoS to processing elements. -
FIG. 5 illustrates another embodiment of bias logic to provide Quality of Service QoS to processing elements. - In the following description, numerous specific details are set forth such as examples of specific bias logic embodiments to provide fairness between processing elements, specific processor organization, specific pipeline stages, etc. in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that these specific details need not be employed to practice the present invention. In other instances, well known components or methods, such as different varieties of pipelines, stall detection, processing element identification, processing element selection, and specific operational details of microprocessors, have not been described in detail in order to avoid unnecessarily obscuring the present invention.
- The method and apparatus described herein are for sharing providing fairness between processing elements. Specifically, providing fairness is primarily discussed in reference to a microprocessor with multiple threads. However, the methods and apparatus for providing fairness are not so limited, as they may be implemented on or in association with any integrated circuit device or system, such as cell phones, personal digital assistants, embedded controllers, mobile platforms, desktop platforms, and server platforms, as well as in conjunction with any type of processing element, such as a core, hardware thread, software thread, logical processor, or other processing element.
- Referring to
FIG. 1 , an embodiment of a processor capable of providing fairness between two processing elements is illustrated. A processing element refers to a thread, a process, a context, a logical processor, a hardware thread, a core, and/or any processing element, which shares access to other shared resources of the processor, such as reservation units, execution units, pipelines, and higher level caches/memory. A physical processor typically refers to an integrated circuit, which potentially includes any number of other processing elements, such as cores or hardware threads. - A core often refers to logic located on an integrated circuit capable of maintaining an independent architectural state, such as
arch state - As can be seen, when certain resources are shared and others are dedicated to an architectural state, the line between the nomenclature of a hardware thread and core overlaps. Yet often, a core and a hardware thread are viewed by an operating system as individual logical processors, where the operating system is able to individually schedule operations on each logical processor. In other words, software views two cores or threads on a physical processor as two independent processors. Additionally, each core potentially includes multiple hardware threads for executing multiple software threads. Therefore, a processing element includes any of the aforementioned elements capable of maintaining a context, such as cores, threads, hardware threads, virtual machines, or other resources, that share access to shared resources of a processor, such as a shared pipeline or shared reservation unit/station.
- In one embodiment,
processor 100 is a multi-threaded processor capable of executing multiple threads in parallel. Here, a first thread is associated witharchitecture state registers 101 and a second thread is associated withthread 102. Therefore, belowthread Thread processor 100 or located in units, such as scheduler/execution module 140 or rename/allocater module 130. As discussed below, in one embodiment, portions of the reservation unit(s) are capable of being dedicated to each thread, shared amongst both threads, or reserved, i.e. not associated with either thread. - In addition, a pipeline or portion of a pipeline, such as a front-end or instruction decode portion of the pipeline, is shared by
threads threads - As illustrated,
architecture state registers 101 are replicated inarchitecture state registers 102, so individual architecture states/contexts are capable of being stored forlogical processor 101 andlogical processor 102. Other smaller resources, such as instruction pointers and renaming logic inrename allocater logic 130 may also be replicated forthreads retirement unit 135, ILTB 120, load/store buffers, and queues may be shared through partitioning. While resources, such as general purpose internal registers, page-table base register, low-level data-cache and data-TLB 150, execution unit(s) 140, and out-of-order unit 135 are potentially fully shared. - Bus interface module 105 is to communicate with devices external to
processor 100, such assystem memory 175, a chipset, a northbridge, or other integrated circuit.Memory 175 may be dedicated toprocessor 100 or shared with other devices in a system. Examples ofmemory 175 includes dynamic random access memory (DRAM), static RAM (SRAM), non-volatile memory (NV memory), and long-term storage. - Typically bus interface unit 105 includes input/output (I/O) buffers to transmit and receive bus signals on
interconnect 170. Examples ofinterconnect 170 include a Gunning Transceiver Logic (GTL) bus, a GTL+bus, a double data rate (DDR) bus, a pumped bus, a differential bus, a cache coherent bus, a point-to-point bus, a multi-drop bus or other known interconnect implementing any known bus protocol. Bus interface unit 105 as shown is also to communicate withhigher level cache 110. - Higher-level or further-out
cache 110 is to cache recently fetched and/or operated on elements. Note that higher-level or further-out refers to cache levels increasing or getting further way from the execution unit(s). In one embodiment, higher-level cache 110 is a second-level data cache. However,higher level cache 110 is not so limited, as it may be or include an instruction cache, which may also be referred to as a trace cache. A trace cache may instead be coupled afterdecoder 125 to store recently decode instructions.Module 120 also potentially includes a branch target buffer to predict branches to be executed/taken and an instruction-translation buffer (I-TLB) to store address translation entries for instructions. Here, a processor capable of speculative execution potentially prefetches and speculatively executes predicted branches. -
Decode module 125 is coupled to fetchunit 120 to decode fetched elements. In one embodiment,processor 100 is associated with an Instruction Set Architecture (ISA), which defines/specifies instructions executable onprocessor 100. Here, often machine code instructions recognized by the ISA include a portion of the instruction referred to as an opcode, which references/specifies an instruction or operation to be performed. - In one example, allocator and
renamer block 130 includes an allocator to reserve resources, such as register files to store instruction processing results. However,thread 101 is potentially capable of out-of-order execution, where allocator andrenamer block 130 also reserves other resources, such as reorder buffers to track instruction results.Unit 130 may also include a register renamer to rename program/instruction reference registers to other registers internal toprocessor 100. Reorder/retirement unit 135 includes components, such as the reorder buffers mentioned above, load buffers, and store buffers, to support out-of-order execution and later in-order retirement of instructions executed out-of-order. - Scheduler and execution unit(s) block 140, in one embodiment, includes a scheduler unit to schedule instructions/operation on execution units. In fact, instructions/operations are potentially scheduled on execution units according to their type availability. For example, a floating point instruction is scheduled on a port of an execution unit that has an available floating point execution unit. Register files associated with the execution units are also included to store information instruction processing results. Exemplary execution units include a floating point execution unit, an integer execution unit, a jump execution unit, a load execution unit, a store execution unit, and other known execution units.
- Lower level data cache and data translation buffer (D-TLB) 150 are coupled to execution unit(s) 140. The data cache is to store recently used/operated on elements, such as data operands, which are potentially held in memory coherency states, such as modified, exclusive, shared, and invalid (MESI) states. The D-TLB is to store recent virtual/linear to physical address translations. Previously, a D-TLB entry includes a virtual address, a physical address, and other information, such as an offset, to provide inexpensive translations for recently used virtual memory addresses.
- In
FIG. 1 ,processor 100 is illustrated as a microprocessor with two logical processors, i.e. two hardware threads, where certain shared resources, such as a reservation unit and a pipeline, which are capable of providing fairness between the two threads. However,processor 100 is not so limited. For example,processor 100 may be any processing element, such as an embedded processor, cell-processor, microprocessor, or other known processor, which includes any number of multiple cores/threads capable of executing multiple contexts, threads, virtual machines, etc. - Moreover, an oversimplified illustrative microarchitecture of an out-of-order of processor is illustrated for
processor 100. However, any of the modules/units illustrated inprocessor 100 may be configured in a different order/manner, may be excluded, as well as may overlap one another including portions of components that reside in multiple modules. For example, a reservation unit may be distributed inprocessor 100 including multiple smaller reservation tables in different modules ofprocessor 100. - Turning to
FIG. 2 , an embodiment of a reservation unit capable of providing fairness between processing elements that share access to the reservation unit is illustrated. Here,reservation unit 200 includes reservation entries 201-210. As an example, reservation unit includes 36 entries; however, any number of entries may be included. An exemplary range of entries include a range of 8 entries to 128 entries. - In one embodiment, reservation entries are to hold instruction information. Note that in many architectures, instructions are broken down into multiple micro-operation (micro-ops). As a result, the use of instruction information also includes micro-op information. Examples of instruction information include reservation information, dependency information, instruction identification information, result information, scheduling information, and any other information associated with instructions or micro-operations, reservation of resources, and/or reservation entries.
- For example, if a first entry referencing a first instruction is dependent upon a second instruction, the first entry includes dependency information to indicate it is dependent on the second instruction. As a result, the first instruction is not scheduled for execution until after the second instruction. Furthermore, the result from the second instruction may be held in a second entry, which is accessed when the instruction referenced in the first entry is scheduled for execution.
-
Processing elements reservation unit 200.Thread 220 is associated withstorage element 221 andthread 230 is associated withstorage element 226. In one embodiment,storage elements storage elements mask 221 is associated with a first number of reservation entries. As illustrated,field 222 is associated with two entries, i.e. 201 and 202. However, a field or any number of bits/fields may be associated with any number of reservation entries. As an example, a one to one relationship may exist between fields and entries or a one to two, three, four, eight, or other ratio may exist between fields and entries. - Here, when
field 222 holds a first value, such as a logical one,entries thread 220. In other words, whenfield 222 holds the first value,thread 220 may utilizeentries reservation unit 200. Furthermore, when an entry, such asentry 223 holds a second value, such as a logical zero,thread 220 is not associated with correspondingentries thread 220 is not able to utilizeentries -
Second storage element 226 is associated withthread 230. Similar to field 222,field 227 is also associated withentries field 227 holds a second value, i.e. a logical zero, to indicate thatthread 230 is not associated withentries entries thread 220, asfield 222 indicatesthread 220 may accessentries entry 227 indicates thatthread 230 may not accessentries - As illustrated, the combination of
mask thread 220, entries 205-208 are dedicated tothread 230, and entries 209-210 are associated with boththreads 220 andthread 230. Consequently, ifthread 230 encounters a long latency instruction, thenthread 230 is only able to utilize entries 205-210, instead of filling upreservation unit 200 with dependent instructions. Therefore,thread 220 is still able to utilize dedicated entries 201-204, instead ofthread 230 monopolizing all ofreservation unit 200 adversely affectingthread 220's performance. As can be seen,reservation unit 200 provides fairness by ensuring processing elements have at least some number of entries available to each processing element. - Also note that
mask entry thread 220 andthread 230 are not associated with entries 209-210. - In another embodiment,
storage elements threads reservation unit 200, a thread is allocated entries when a current number of entries in use are below a threshold value. Upon allocating entries, the counters are incremented, and upon de-allocating the entries, the counters are decremented. - Above, examples utilized a logical one and a logical zero as first and second values, respectively. However, any values may be held in fields to indicate an associated resource is or is not associated with reservation entries. Furthermore, there may be any number of storage elements associated with any number of processing elements, which are illustrated as threads, but may include any resource that shares access to
reservation unit 200. - Turning to
FIG. 3 , an embodiment of a pipeline capable of providing fairness between processing elements is illustrated. A pipeline often refers to a number of elements or stages coupled together in series, wherein the output of one stage is the input of a next stage. For example, an oversimplified pipeline includes four stages: fetch, decode, out of order execution, and retire. Note thatpipeline 303 includes any number of stages. In addition,pipeline 303 may represent a portion of a pipeline, such as a front-end portion, back-end portion, or other portion, as well as an entire pipeline. Stages 305-330 include any known pipeline stages, such as resource selection, instruction decode, allocation, rename, execution, retire, or other pipeline stage. - Often stalls in
pipeline 303 affect both performance of individual processing elements, as well as fairness between processing elements. Non-blocking stalls inpipeline 303 potentially allow processing by other processing elements to continue or to interrupt the stall. Therefore, with a non-blocking stall associated withthread 301,thread 302 may still usepipeline 303, so no biasing is needed to provide fairness. A blocking stall, however, typically refers to a stall or delay in a stage of a pipeline, which blocks execution of other processing elements in the stage. Here, a blocking stall block blocks execution in the stage on boththreads threads - Previously,
selection logic 305 alternates selection ofthread pipeline 303. Consequently, in response to a blocking stall onthread 301,bias logic 360biases selection logic 305 away from selectingthread 301 for a period of time or a number of cycles to compensate for the blocking stall. - For example, assume
stage 320 is an instruction length decoder (ILD) stage. Typically, common length instructions are decoded quickly, such as determining the start and end of instruction in a single block of data bytes within a single cycle. However, when a length changing prefix (LCP) is detected a slower length decode process is invoked. As an illustrative example, a single block of instructions are decoded unit by unit, which results in a stall of a number of cycles, such as seven cycles. Here, assume the LCP is associated withthread 301. Therefore, as the slower decode process is not to be interrupted,stage 320 is blocked, i.e. other processing elements, such asthread 302, are not able to determine decode lengths of instruction instage 320 for the number of cycles of the blocking stall. Essentially,thread 301blocks pipeline 303 for a number of cycles. - Consequently,
bias logic 360 is to bias selection instage 305 to provide fairness inpipeline 303. Continuing the example from above, a blocking stall associated withthread 301 is detected withdetection logic 350.Detection logic 350 may be independent logic for detecting stalls or logic within a stage for detecting a stall event. For example, logic to detect a Length Changing Prefix (LCP) may be part ofdetection logic 350, as it detects a blocking stall event. Here, assume the blocking stall last for seven execution cycles. - As a result, bias logic
biases selection logic 305 away fromthread 301 for a period of time or for a number of cycles after the blocking stall has concluded to provide fair access forthread 302 topipeline 303. For example,bias logic 360biases selection logic 305 to selectthread 302, i.e. away fromthread 301, for the next seven cycles. However,thread 302 may be selected for any number of cycles to provide fairness depending on the implementation. - Selecting away from
thread 301, i.e. biasing selection more towardthread 302, is to provide more access forthread 302 topipeline 303 to make up for the stall cyclesthread 301 monopolizedpipeline 303. As shown, providing fairness through biasingselection 305 may take place subsequent to a blocking stall's conclusion. Since the goal is to ensure reasonably equal access topipeline 303 over time, biasingselection logic 305 may take place immediately subsequent to completion of a blocking stall or during subsequent cycles. - In an alternate embodiment, biasing
selection 305 away fromthread 301 begins immediately afterdetection logic 350 detects the beginning of a blocking stall associated withthread 301. For example, pipe stages 310 and 315 are cleared or flushed andthread 302 is allowed to advance intostages stage 320. Therefore, ifstages thread 301 information, recovering fairness may being earlier by allowingthread 302 to populatestages - Providing fairness may, but does not necessarily, equate to equal time or cycles for each thread in
pipeline 303. For example, ifthread 301 creates a blocking stall that lasts seven cycles, then theoretically,bias logic 360 should bias towardthread 302 for seven cycles. However, in one embodiment, biasing away fromthread 301 or toward 302 includes any amount of biasing. To illustrate, after a seven cycle blocking stall,bias logic 360, depending on the implementation may bias towardthread 301 for an extra four cycles, instead of the seven. Also note that the bias algorithm utilized bybias logic 360 may be statically set for stalls of known length and dynamically adjustable for stalls of unknown length. In one embodiment, biasing away from a first processing element includes representing selection of other processing elements more often than the first processing element. - Also note that the example above assumes
thread bias logic 360 is to biasselection logic 305, not to forceselection logic 305 to select a processing element. For example, assumebias logic 360 outputs values to suggest orbias selection logic 305 for selection ofthread 302 six out of eight cycles, as discussed above. However, ifthread 302 has no activity for those cycles, whilethread 301 does have activity for the eight cycles, thenselection logic 305 may selectthread 301, as not to waste execution cycles. - Referring next to
FIG. 4 , an embodiment of bias logic to provide fairness in a pipeline is illustrated. Similar toFIG. 3 ,pipeline 403 includes stages 410-430 anddetection logic 450 to detect a blocking stall, such as a Length Change Prefix (LCP) in an Instruction Length Decode (ILD) stage. Here,detection logic 450 detects a blocking stall associated withthread 402.Control 465 setsstorage elements bias selection 405 in response to detecting the blocking stall. - In one embodiment, a blocking stall, such as an LCP blocking stall, results in a stall for a specific set number of cycles, such as seven cycles. Here,
control 465 sets biasstorage element 470 to a predefined pattern to biasselection logic 405. As shown,bias element 470 includes 6 bits; however, any size element may be used. For example,bias element 470 is a 16 bit shift register to hold a bit pattern representing a repeated bias ofthread 401 twice and 402 once. In this example,bias logic 460 is capable of biasingselection logic 405 for up to 16 cycles of the shift register. - In one embodiment, the pattern is determined by
control 465 XORing a bias value with a thread ID ofthread 402, which is associated with the stall. As a first example, the XOR is performed on the load ofbias element 470. As another example, the XOR is performed on the output ofbias element 470. In addition to the bias value/pattern loaded inbias storage element 470, corresponding valid values are loaded invalid storage element 475.Valid element 475 includes fields corresponding to bias/thread fields 470 to form entries, such ashead entry 480 andtail entry 481. - To illustrate, assume a seven cycle blocking stall associated with a LCP from
thread 402 is detected. A pattern, such as 001001, is loaded inbias element 470 and 111111, is loaded invalid element 475. Here, a logical value of 0 held in a thread field ofbias element 470 representsthread 401, while a logical value of 1 representsthread 402. Additionally, a one held invalid element 475 represents the corresponding bias field is valid and a 0 represent invalid. During a subsequent cycle,head entry 480 is shifted out toselection logic 405.Entry 480 currently holds a logical 0 representingthread 401 and a logical 1 representing the bias is valid. As a result,selection 405, selectsthread 401 in response to the threadvalue indicating thread 401 and the valid value indicating the thread value is valid. - In addition to shifting out
entry 480, in one embodiment, a zero is shifted intotail entry 481 ofvalid element 475 to indicatetail entry 481 is now invalid. Selection continues withthread thread 401, does not have any activity during acycle bias logic 460 indicatesthread 401 is to be selected, thenthread 402 may be selected to ensurepipeline 403 is efficiently utilized. - Now, assume the valid field in
entry 480 holds a logical zero. Whenentry 480 is now shifted out toselection logic 405, the bias field is determined to be not valid. Consequently,selection logic 405 is able to make a normal selection betweenthread selection logic 405 normally selects the opposite of the thread selected last cycle. - Next,
FIG. 5 illustrates another embodiment of bias logic to bias away from selection in a pipeline of a processing element associated with a blocking stall for providing fairness in the pipeline. As illustrated,threads pipeline 503 includingstages 510 through 530. As an example,stage 530 contains an instruction allocator that maintains the mapping of the thread's architectural register state to the internal physical registers.Detection logic 550 is to detect a blocking stall. Additional examples of blocking stalls include a partial register stall, such as a write to a subset of a register and a subsequent read of the entire register, and a branch stall to recover the architecture to physical register mapping after a mispredicted branch. Often these examples of blocking stalls are seen in an instruction queue read stage and/or stages of an allocation pipeline. In one embodiment, these examples of blocking stalls are variable in length, such as from 1 cycle up to 25 cycles and potentially larger. In the embodiment illustrated inFIG. 4 , a variable pattern may be loaded inbias element 470 to compensate for the variable length stalls. - However here, counter 570 and
corresponding resource field 575 are utilized to bias selection of threads instage 505. In one embodiment,counter 570 is to be set to a default value of zero. In response to detecting a blocking stall instage 530,counter 570 is updated in a first direction, such as incrementing the counter, for each cycle of the blocking stall. Note the counter may be set to a default integer value greater than zero and decremented. In one embodiment,resource field 575 is to store a value representing the processing element that is associated with the stall. For example, if a branch misprediction is associated withthread 501,resource 575 is to hold a thread ID or othervalue representing thread 501. In an alternative embodiment,resource 575 is to hold a value representing a resource to be selected based on which resource the stall is associated with. For example, if a branch misprediction is associated withthread 501, then resource 575 is to hold avalue representing thread 502, asthread 502 is to be selected more often to provide fairness inpipeline 503. - Below in Table 1 is an example of operation of
counter 570 to illustrate operation ofcounter 570,control 565, andresource field 575. In the first cycle, a blocking stall, such as a partial register or branch misprediction stall is detected onthread 502. For each of the 5 cycles of the stall,control 565 increments counter 570, i.e. in the first cycle from 0 to 1, and so on, to a counter value of 5.Control logic 565loads resource field 575 with a value to representthread 501, which is the thread to be selected in order to provide fairness in response to the stall associated withthread 302. - After the blocking stall is complete in cycle 5,
selection logic 505 selectsthread 501 based on the thread/bias value fromresource 575. In response to selectingthread 501 in cycles 6-8,counter 570 is decremented each selection bycontrol 565 to a value of two. In cycle 9,thread 501 is associated with a blocking stall. However, instead of incrementingcounter 570 for each cycle,control 565 realizesthread 501 is identified inresource field 575. Therefore, a stall for cycle 9 bythread 501 is permitted due to the unfairness of the previous stall bythread 502, which has not been fully compensated for. As a result, the value held incounter 570 is decremented in cycle 9 and 10. When the counter reaches the default value of 0, the counter begins to increment again. However, now control 565 setsresource field 575 to representthread 502 tobias selection logic 505 away fromthread 501. Upon completion of the stall,selection logic 505 selectsthread 502 andcontrol 565 decrements counter 570 untilcounter 570 reaches zero. Once at zero,selection logic 505 may return to normal selection. -
TABLE 1 Illustrative embodiment of bias counter Cycle Event Counter # Resource # 1 Thread 502 5cycle stall 1 501 2 2 501 3 3 501 4 4 501 5 5 501 6 Select thread 5014 501 7 Select thread 5013 501 8 Select thread 5012 501 9 Thread 501 5cycle stall 1 501 10 0 11 1 502 12 2 502 13 3 502 14 Select thread 5022 502 15 Select thread 5021 502 16 0 - As illustrated above, fairness is provided by shared resources, such as reservation stations and pipelines, for processing elements, such as threads on a core. Instead of a long latency instruction and a chain of dependent instructions monopolizing a reservation station, portions of the reservation station may be allocated/dedicated to processing elements to ensure each processing element is able to continue operation. In addition, instead of a blocking stall monopolizing a pipeline and then returning to alternating processing element selection, bias logic biases the selection logic to provide fairness between processing elements over time.
- The embodiments of methods, software, firmware or code set forth above may be implemented via instructions or code stored on a machine-accessible or machine readable medium which are executable by a processing element. A machine-accessible/readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form readable by a machine, such as a computer or electronic system. For example, a machine-accessible medium includes random-access memory (RAM), such as static RAM (SRAM) or dynamic RAM (DRAM); ROM; magnetic or optical storage medium; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals); etc.
- Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
- In the foregoing specification, a detailed description has been given with reference to specific exemplary embodiments. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense. Furthermore, the foregoing use of embodiment and other exemplarily language does not necessarily refer to the same embodiment or the same example, but may refer to different and distinct embodiments, as well as potentially the same embodiment.
Claims (6)
1. An apparatus comprising:
an execution unit to execute a plurality of instructions;
a reservation unit coupled to the execution unit, wherein the reservation unit is to hold instruction information associated with the plurality of instructions in a plurality of reservation entries; and
a first storage element to include a first mask field associated with a first number reservation entries of the plurality of reservation entries, the first mask field, when holding a first value, to indicate the first number of reservation entries are associated with a first processing element.
2. The apparatus of claim 1 , wherein the first processing element is selected from a group consisting of a thread, a logical processor, and a core.
3. The apparatus of claim 1 , wherein instruction information includes a plurality of information elements, wherein each of the plurality of information elements are selected from a group consisting of dependency information, instruction identification information, result information, and scheduling information.
4. The apparatus of claim 1 , wherein the first number of entries is an even number of entries.
5. The apparatus of claim 1 , further comprising a second storage element to include a second mask field associated with the first number of reservation entries, wherein
when the first mask field holds the first value and the second mask field holds the first value, the first number of reservation entries are associated with the first processing element and a second processing element;
when the first mask field holds the first value and the second mask field holds the second value, the first number of reservation entries are associated with the first processing element and not with the second processing element;
when the first mask field holds the second value and the second mask field holds the second value, the first number of reservation entries are not associated with the first processing element and are not associated with the second processing element; and
when the first mask field holds the second value and the second mask field holds the first value, the first number of reservation entries are associated with the second processing element and not with the first processing element.
6. The apparatus of claim 1 , wherein the first and second storage elements are registers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/385,823 US20170161106A1 (en) | 2007-04-09 | 2016-12-20 | Providing thread fairness in a hyper-threaded microprocessor |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/784,864 US8521993B2 (en) | 2007-04-09 | 2007-04-09 | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
US12/941,637 US9524191B2 (en) | 2007-04-09 | 2010-11-08 | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
US15/385,823 US20170161106A1 (en) | 2007-04-09 | 2016-12-20 | Providing thread fairness in a hyper-threaded microprocessor |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/941,637 Continuation US9524191B2 (en) | 2007-04-09 | 2010-11-08 | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170161106A1 true US20170161106A1 (en) | 2017-06-08 |
Family
ID=39827996
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/784,864 Active 2027-10-08 US8521993B2 (en) | 2007-04-09 | 2007-04-09 | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
US12/941,637 Expired - Fee Related US9524191B2 (en) | 2007-04-09 | 2010-11-08 | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
US12/941,615 Expired - Fee Related US8438369B2 (en) | 2007-04-09 | 2010-11-08 | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
US15/385,823 Abandoned US20170161106A1 (en) | 2007-04-09 | 2016-12-20 | Providing thread fairness in a hyper-threaded microprocessor |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/784,864 Active 2027-10-08 US8521993B2 (en) | 2007-04-09 | 2007-04-09 | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
US12/941,637 Expired - Fee Related US9524191B2 (en) | 2007-04-09 | 2010-11-08 | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
US12/941,615 Expired - Fee Related US8438369B2 (en) | 2007-04-09 | 2010-11-08 | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
Country Status (1)
Country | Link |
---|---|
US (4) | US8521993B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10503550B2 (en) | 2017-09-30 | 2019-12-10 | Intel Corporation | Dynamic performance biasing in a processor |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8521993B2 (en) * | 2007-04-09 | 2013-08-27 | Intel Corporation | Providing thread fairness by biasing selection away from a stalling thread using a stall-cycle counter in a hyper-threaded microprocessor |
US8219996B1 (en) * | 2007-05-09 | 2012-07-10 | Hewlett-Packard Development Company, L.P. | Computer processor with fairness monitor |
EP2159692A4 (en) * | 2007-06-20 | 2010-09-15 | Fujitsu Ltd | Information processor and load arbitration control method |
US8332854B2 (en) * | 2009-05-19 | 2012-12-11 | Microsoft Corporation | Virtualized thread scheduling for hardware thread optimization based on hardware resource parameter summaries of instruction blocks in execution groups |
US8347309B2 (en) * | 2009-07-29 | 2013-01-01 | Oracle America, Inc. | Dynamic mitigation of thread hogs on a threaded processor |
US9135215B1 (en) * | 2009-09-21 | 2015-09-15 | Tilera Corporation | Route prediction in packet switched networks |
US9465670B2 (en) * | 2011-12-16 | 2016-10-11 | Intel Corporation | Generational thread scheduler using reservations for fair scheduling |
US9606800B1 (en) * | 2012-03-15 | 2017-03-28 | Marvell International Ltd. | Method and apparatus for sharing instruction scheduling resources among a plurality of execution threads in a multi-threaded processor architecture |
US9665375B2 (en) | 2012-04-26 | 2017-05-30 | Oracle International Corporation | Mitigation of thread hogs on a threaded processor and prevention of allocation of resources to one or more instructions following a load miss |
US9069629B2 (en) * | 2013-03-15 | 2015-06-30 | International Business Machines Corporation | Bidirectional counting of dual outcome events |
WO2014146279A1 (en) * | 2013-03-21 | 2014-09-25 | Telefonaktiebolaget L M Ericsson (Publ) | Method and device for scheduling communication schedulable unit |
US9367472B2 (en) | 2013-06-10 | 2016-06-14 | Oracle International Corporation | Observation of data in persistent memory |
US20160011874A1 (en) * | 2014-07-09 | 2016-01-14 | Doron Orenstein | Silent memory instructions and miss-rate tracking to optimize switching policy on threads in a processing device |
US9971565B2 (en) * | 2015-05-07 | 2018-05-15 | Oracle International Corporation | Storage, access, and management of random numbers generated by a central random number generator and dispensed to hardware threads of cores |
US20160378497A1 (en) * | 2015-06-26 | 2016-12-29 | Roger Gramunt | Systems, Methods, and Apparatuses for Thread Selection and Reservation Station Binding |
US10235171B2 (en) * | 2016-12-27 | 2019-03-19 | Intel Corporation | Method and apparatus to efficiently handle allocation of memory ordering buffers in a multi-strand out-of-order loop processor |
US11734585B2 (en) | 2018-12-10 | 2023-08-22 | International Business Machines Corporation | Post-hoc improvement of instance-level and group-level prediction metrics |
US11294724B2 (en) | 2019-09-27 | 2022-04-05 | Advanced Micro Devices, Inc. | Shared resource allocation in a multi-threaded microprocessor |
US11144353B2 (en) | 2019-09-27 | 2021-10-12 | Advanced Micro Devices, Inc. | Soft watermarking in thread shared resources implemented through thread mediation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9524191B2 (en) * | 2007-04-09 | 2016-12-20 | Intel Corporation | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US2006A (en) * | 1841-03-16 | Clamp for crimping leather | ||
US5752031A (en) * | 1995-04-24 | 1998-05-12 | Microsoft Corporation | Queue object for controlling concurrency in a computer system |
US5987598A (en) * | 1997-07-07 | 1999-11-16 | International Business Machines Corporation | Method and system for tracking instruction progress within a data processing system |
US6233599B1 (en) * | 1997-07-10 | 2001-05-15 | International Business Machines Corporation | Apparatus and method for retrofitting multi-threaded operations on a computer by partitioning and overlapping registers |
US6697935B1 (en) * | 1997-10-23 | 2004-02-24 | International Business Machines Corporation | Method and apparatus for selecting thread switch events in a multithreaded processor |
US6535905B1 (en) * | 1999-04-29 | 2003-03-18 | Intel Corporation | Method and apparatus for thread switching within a multithreaded processor |
US6898694B2 (en) * | 2001-06-28 | 2005-05-24 | Intel Corporation | High instruction fetch bandwidth in multithread processor using temporary instruction cache to deliver portion of cache line in subsequent clock cycle |
US6980439B2 (en) * | 2001-12-27 | 2005-12-27 | Intel Corporation | EMI shield for transceiver |
US8024735B2 (en) * | 2002-06-14 | 2011-09-20 | Intel Corporation | Method and apparatus for ensuring fairness and forward progress when executing multiple threads of execution |
US7401207B2 (en) * | 2003-04-25 | 2008-07-15 | International Business Machines Corporation | Apparatus and method for adjusting instruction thread priority in a multi-thread processor |
US7610473B2 (en) * | 2003-08-28 | 2009-10-27 | Mips Technologies, Inc. | Apparatus, method, and instruction for initiation of concurrent instruction streams in a multithreading microprocessor |
JP2006023963A (en) | 2004-07-07 | 2006-01-26 | Fujitsu Ltd | Wireless ic tag reader/writer, wireless ic tag system and wireless ic tag data writing method |
US7447961B2 (en) * | 2004-07-29 | 2008-11-04 | Marvell International Ltd. | Inversion of scan clock for scan cells |
US8364897B2 (en) * | 2004-09-29 | 2013-01-29 | Intel Corporation | Cache organization with an adjustable number of ways |
US7631308B2 (en) * | 2005-02-11 | 2009-12-08 | International Business Machines Corporation | Thread priority method for ensuring processing fairness in simultaneous multi-threading microprocessors |
US7917907B2 (en) * | 2005-03-23 | 2011-03-29 | Qualcomm Incorporated | Method and system for variable thread allocation and switching in a multithreaded processor |
US8041929B2 (en) * | 2006-06-16 | 2011-10-18 | Cisco Technology, Inc. | Techniques for hardware-assisted multi-threaded processing |
US7899994B2 (en) * | 2006-08-14 | 2011-03-01 | Intel Corporation | Providing quality of service (QoS) for cache architectures using priority information |
US8458711B2 (en) * | 2006-09-25 | 2013-06-04 | Intel Corporation | Quality of service implementation for platform resources |
US8402253B2 (en) * | 2006-09-29 | 2013-03-19 | Intel Corporation | Managing multiple threads in a single pipeline |
US8645666B2 (en) * | 2006-12-28 | 2014-02-04 | Intel Corporation | Means to share translation lookaside buffer (TLB) entries between different contexts |
-
2007
- 2007-04-09 US US11/784,864 patent/US8521993B2/en active Active
-
2010
- 2010-11-08 US US12/941,637 patent/US9524191B2/en not_active Expired - Fee Related
- 2010-11-08 US US12/941,615 patent/US8438369B2/en not_active Expired - Fee Related
-
2016
- 2016-12-20 US US15/385,823 patent/US20170161106A1/en not_active Abandoned
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9524191B2 (en) * | 2007-04-09 | 2016-12-20 | Intel Corporation | Apparatus including a stall counter to bias processing element selection, and masks to allocate reservation unit entries to one or more processing elements |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10503550B2 (en) | 2017-09-30 | 2019-12-10 | Intel Corporation | Dynamic performance biasing in a processor |
Also Published As
Publication number | Publication date |
---|---|
US20110055524A1 (en) | 2011-03-03 |
US9524191B2 (en) | 2016-12-20 |
US20110055525A1 (en) | 2011-03-03 |
US20080250233A1 (en) | 2008-10-09 |
US8521993B2 (en) | 2013-08-27 |
US8438369B2 (en) | 2013-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20170161106A1 (en) | Providing thread fairness in a hyper-threaded microprocessor | |
US8095932B2 (en) | Providing quality of service via thread priority in a hyper-threaded microprocessor | |
JP5894120B2 (en) | Zero cycle load | |
US9940132B2 (en) | Load-monitor mwait | |
US9026705B2 (en) | Interrupt processing unit for preventing interrupt loss | |
US8335911B2 (en) | Dynamic allocation of resources in a threaded, heterogeneous processor | |
US8429386B2 (en) | Dynamic tag allocation in a multithreaded out-of-order processor | |
US9286075B2 (en) | Optimal deallocation of instructions from a unified pick queue | |
US9058180B2 (en) | Unified high-frequency out-of-order pick queue with support for triggering early issue of speculative instructions | |
US8099566B2 (en) | Load/store ordering in a threaded out-of-order processor | |
US8407421B2 (en) | Cache spill management techniques using cache spill prediction | |
US9690625B2 (en) | System and method for out-of-order resource allocation and deallocation in a threaded machine | |
US8904156B2 (en) | Perceptron-based branch prediction mechanism for predicting conditional branch instructions on a multithreaded processor | |
US9454218B2 (en) | Apparatus, method, and system for early deep sleep state exit of a processing element | |
US8458446B2 (en) | Accessing a multibank register file using a thread identifier | |
US20110113199A1 (en) | Prefetch optimization in shared resource multi-core systems | |
US20120233442A1 (en) | Return address prediction in multithreaded processors | |
US8347309B2 (en) | Dynamic mitigation of thread hogs on a threaded processor | |
US8635621B2 (en) | Method and apparatus to implement software to hardware thread priority | |
US20120297167A1 (en) | Efficient call return stack technique | |
US20130024647A1 (en) | Cache backed vector registers | |
WO2019005105A1 (en) | Speculative memory activation | |
KR20100111700A (en) | System and method for performing locked operations | |
US20130138888A1 (en) | Storing a target address of a control transfer instruction in an instruction field | |
CN111752616A (en) | System, apparatus and method for symbolic memory address generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |