CN101051266A - Processor with dummy multithread - Google Patents

Processor with dummy multithread Download PDF

Info

Publication number
CN101051266A
CN101051266A CN 200710104280 CN200710104280A CN101051266A CN 101051266 A CN101051266 A CN 101051266A CN 200710104280 CN200710104280 CN 200710104280 CN 200710104280 A CN200710104280 A CN 200710104280A CN 101051266 A CN101051266 A CN 101051266A
Authority
CN
China
Prior art keywords
thread
processor
virtual
threads
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200710104280
Other languages
Chinese (zh)
Inventor
N·英赖特
J·科林斯
P·王
H·王
X·田
J·沈
G·肖弗
P·哈马伦德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/070,991 external-priority patent/US7587584B2/en
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN101051266A publication Critical patent/CN101051266A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

Method, apparatus, and program means for a programmable event driven yield mechanism that may activate other threads. In one embodiment, an apparatus includes execution resources to execute a plurality of instructions and an event detector to detect a long latency event associated with a synchronization object. The event detector can cause a first thread switch in response to the long latency event associated with the synchronization object. The apparatus may also include a spin detector to detect that the synchronization object is a contended synchronization object. The spin detector can cause a second thread switch in response to the detection of the contended synchronization object to enable a spin detect response.

Description

Processor with dummy multithread
Technical field
The present invention relates to treatment facility and the field of system and the specific instruction sequence of programme this equipment and/or system of processing instruction sequence etc.Some embodiment relate to supervision and/or respond the conditioned disjunction incident of carrying out in this treatment facility in the resource.
Background technology
Use various mechanism to change the interior control stream (being processing path or the instruction sequence of being followed) of disposal system at present.For example, the jump instruction in the agenda causes jumping to new address explicitly.This jump instruction is an example that clearly changes of control stream, because this instruction bootstrap processor jumps to a position and continues to carry out at this point.Traditional jump instruction is " accurately " (or synchronously), and this is because the direct result that this redirect is the execution jump instruction takes place.
Another conventional example that control stream changes is for interrupting.Interruption can be the external signal that provides to the equipment such as processor.The response of processor is for jumping to interrupt handling routine (handler), and this handling procedure is the program of the incident of the special interruption transmission of processing.Interrupting also being accurate relatively usually, and this is to be had no progeny in specific time window by processor it to be discerned and produce response in receiving this because be.Especially, after internal interface was received interruption, this interruption was just worked on the next instruction border usually.In some cases, only allow operating system or work in high priority other other software masking to interrupt, so user program is had no chance to start or is forbidden these control stream change incidents.
Another conventional example that control stream changes comes across unusual response.The predefined framework condition of unusual common reaction, this condition satisfy the result of specific criteria (informal, underflow, overflow, nonnumeric etc.) for the instruction of mathematics for example.For example by the position in the control register is set, can shield that some are unusual.If unusually and not conductively-closed, then calling exception handler should be unusual to handle.
Another technology that changes the control stream of processor is the use breakpoint.Usually when debugging, use breakpoint.Specific instruction address can be programmed into breakpoint register.When breakpoint starts and arrive destination address, this processor adopt various measures (rather than continuing this program as usual).Breakpoint allows single step executive routine etc.
Multithreading is a kind of technology of using processor hardware by a plurality of different threads.Multiline procedure processor can switch between each thread owing to a variety of causes.For example, processor has the algorithm that automaticallyes switch between available thread.Other processor uses the multithreading (SoEMT) that switches based on incident, and the particular event meeting such as cache miss causes that thread switches thus.Thread switches can be counted as the change of a kind of control stream, and this is because processor switches its performed instruction sequence or instruction stream.
In a prior art reference, describe a kind of still command (seeing U.S. Patent number No.6,493,741) in detail.In one example, still command stops the process in the thread, expires or occurs carrying out till storer writes to memory location up to timer.Therefore, the process that can trigger the thread that comprises this still command such as the instruction of still command itself temporarily stops and switching to another thread.
Description of drawings
Unrestricted mode is set forth the present invention in the mode of example in appended each figure.
Fig. 1 has set forth the embodiment of the system that can detect and respond the treatment conditions of carrying out resource (execution resource).
Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.
Fig. 3 has set forth the workflow diagram of another embodiment of the system of Fig. 1.
Fig. 4 has set forth another embodiment of the system that can respond a plurality of different performance incidents (performance event) and/or composite performance incident.
Fig. 5 a has set forth an embodiment of the monitor that can discern compound event.
Fig. 5 b has set forth another embodiment of monitor.
Fig. 5 c has set forth another embodiment of monitor.
Fig. 6 has set forth according to an embodiment, the definable trigger of responder and start the process flow diagram that user program is carried out that is used for of worker thread (helper thread).
Fig. 7 has set forth the process flow diagram according to the process of the refinement monitor setting of an embodiment.
Fig. 8 has set forth the process flow diagram according to the process of the update software of an embodiment.
Fig. 9 a has set forth and has started the process flow diagram of a plurality of nested worker threads with support processor.
Fig. 9 b has set forth the thread switch logic of an embodiment of virtual support thread.
Figure 10 a has set forth an embodiment of context-sensitive incident diagram vector and shielding (mask) realization.
Figure 10 b has set forth an embodiment of context-sensitive incident diagram vector and shielding realization.
Figure 11 has set forth based on the monitor incident and an embodiment of the multiline procedure processor that execution thread switches.
Figure 12 has set forth an embodiment who synchronization object is had the system of event detection and processing power.
Figure 13 has set forth the process flow diagram of handling according to the synchronous event of a plurality of embodiment.
Figure 14 has set forth the improved process flow diagram of thread scheduling based on the lock analysis (lock profiling) of button.onrelease thread.
Embodiment
The following embodiment that abandons (yield) mechanism that has set forth the programmable event driven that can start other thread.In following description, listed many details such as processor type, micro-architecture condition, incident, Initiated Mechanism etc., its objective is to provide and understand more completely of the present invention.Yet those skilled in the art will be appreciated that, do not use these details also can implement the present invention.In addition, be not shown specifically some well-known structures, circuit etc., purpose is to avoid unnecessarily making the present invention to become indeterminate.
In certain embodiments, disclosed technology allows when executive routine, and this program can monitor and respond the condition of carrying out this program implementation resource energetically.In fact, these embodiment can comprise that real-time execution resource condition of work feedback is to improve performance.Run into the instruction delay condition if carry out resource, can interrupt this programmed instruction to adjust.In certain embodiments, can start a handling procedure, this handling procedure can produce worker thread to attempt to improve the instruction of primary thread.In other embodiments, can realize interrupting by another program threads that switches to non-worker thread.These and other embodiment can advantageously improve processing power in some cases and/or optimize to be suitable for special hardware.
With reference to figure 1, description can detect and respond an embodiment of the system of the treatment conditions of carrying out resource.In the embodiment in figure 1, carry out the part that resource 105, monitor 110 and enable logic 120 form the processor 100 that can execute instruction.This execution resource comprises the hardware resource that can be integrated into discrete component or integrated circuit in certain embodiments.Yet, carry out the combination in any that resource can comprise software or firmware resource or hardware and software and/or can be used for the firmware of execution of program instructions.For example, firmware can be used as the part of extract layer or can be processing hardware increases function, and software also can be like this.Software can also be used for part or all of emulator command collection, perhaps auxiliary process otherwise.
This processor is any dissimilar processor of executable instruction.For example, this processor can be a general processor, for example Pentium  processor family or Itanium  processor family or from a kind of processor in other processor family of Intel Company, or from the processor of other company.Therefore, this processor can be that reduced instruction set computer calculates (RISC) processor, sophisticated vocabulary calculates (CISC) processor, very long instruction word (VLIW) processor or any mixing or alternative processor type.In addition, for example the application specific processor of network or communication processor, coprocessor, embedded processor, compression engine, image processor etc. can use technology disclosed herein.Because integrated trend is still continuing and processor becomes complicated more, inner performance indicator is monitored and the necessity of reacting further increases, therefore make the present disclosed technology that needs more.Yet,, be difficult to predict all application of disclosed technology, although it can be widely used in the complex hardware of executed program sequences because the technical progress in this technical field is quick.
As shown in Figure 1, processor 100 is coupled to the storage medium 150 such as storer.Storage medium 150 can be for having the memory sub-system of various level levels, and it includes but not limited to the memory buffer of various levels, such as the system storage of dynamic RAM etc. and such as the nonvolatile memory of flash memory (for example memory stick etc.), disk or CD.As shown in the figure, this storage medium stores program 160 and handling procedure and/or such as other thread of worker thread 170.
In order to allow monitor to monitor expected event, monitor 110 can be coupled to the various piece of execution resource to detect actual conditions or to be apprised of specific micro-architecture event.Signal wire can be connected to monitor 110, and perhaps this monitor can strategically be placed or integrate with related resource.This monitor can comprise various FPGA (Field Programmable Gate Array) or software or firmware components or can be custom-designed to the detection actual conditions.This monitor is followed the trail of variety of event or condition, and if incident that this monitor should detect or condition, then send signal interrupting normal control stream to carrying out resource 105, otherwise program will be carried out by this normal control flow.As shown in Figure 1, this interruption can cause calling button.onrelease or thread occur switching.
But a data disappearance that is exemplified as in the memory buffer of special testing conditions, this data disappearance can cause occurring cache miss.In fact, program can produce a kind of memory access mode, and this mode can cause the cache miss of repetition, reduces performance thus.In section sometime or occur the cache miss of given number at certain partial code the term of execution, expression that Here it is makes progress an example of relatively slow incident when carrying out this partial code.
May can the detection incident relate to various other micro-architectures or the CONSTRUCTED SPECIFICATION of carrying out resource for other of the indicator of making slow progress.Monitor can detect and relate to following one or more condition: resource stops, the buffer memory incident, scrap incident, branch or branch and estimate result, unusual, bus events or various other incidents or the condition that come under observation usually or influence performance.This monitor can calculate these incidents or condition, perhaps to these incidents or condition timing, quantitatively or characterize, and when the special metric system (metric) relevant occurring, can programme to this monitor with one or more incidents or condition.
Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.Shown in the block 200 of Fig. 2,, program 160 causes the variation of carrying out control stream thereby can being provided with condition.For example, enable logic 120 can be controlled the startup of monitor and (a plurality of) incident that monitor will detect simultaneously.Alternatively, enable logic 120 can start and/or shield each incident, and monitor 110 itself also is programmable, thereby is stipulating to have bigger moving flexibly aspect tracked execution resource or intrasystem incident or the condition.In either case, program 160 itself can be defined in condition to be observed when himself carrying out.Program 160 also is provided at handling procedure or the thread 170 that is activated when the condition that is monitored occurs.For example, this program can be such program, that is, comprise main thread and worker thread or attempt to improve the auxiliary routine (helper routine) of the execution of main thread when the specified condition of program occurs.
Shown in block 205, carry out this programmed instruction.This program implementation causes carrying out state of resources and changes.For example, the various conditions of progress forward can take place or occur to suppress when this program of execution.Shown in block 210, monitor that various processing metric systems and/or micro-architecture condition are to determine whether to occur the trigger event of programming in the block 200.If the triggering state in block 210, do not occur, then do not trigger this monitor and continue program implementation by turning back to block 205.
In some cases, this triggering state is only represented the indirect relation with the execution of arbitrary single instruction.For example, in the prior art, when instruction pointer arrived the design address, the breakpoint detecting device produced time-out usually.This breakpoint is accurate, and this is because special instruction (for example its address) directly triggers this time-out.Similarly, prior art still command itself causes thread at least temporarily to stop.On the contrary, the control stream that uses some embodiment of technology disclosed herein to trigger a series of conditions changes, and not necessarily will cause this change by single instruction, but can cause this change by whole procedure stream and/or system environments.Therefore, although can the same instruction executing state in individual system repeatedly trigger this monitor, other condition, environment, system etc. can cause the different trigger points of same program.Thus, out of true or nonsynchronous mechanism that technology disclosed herein provides generation control stream to change in some cases, this mechanism are not directly related with instruction execution border.In addition, in certain embodiments, this coarse mechanism can be tested each incident with the fine granularity (fine granularity) less than each instruction, and/or can postpone a period of time to the identification of incident, this is because the framework correctness does not depend on the auxiliary routine of any raising processing speed of carrying out at any concrete time point.
When monitor detected the triggering state in block 210, the processing of program was interrupted, shown in block 215.Usually, this system can correspondingly adjust, and this is because the treatment effeciency of this program is lower or the mode of processing is different from the desired mode of programming personnel.For example, can call another software routine such as another program part.Another thread that this other program part can be and primary thread is irrelevant perhaps can be the worker thread from the auxiliary process instruction of primary thread, for example by pre-taking-up data to reduce cache miss.Alternatively, program transparent (for example hardware) mechanism can be carried out some optimizations, reconfigure redistributing of (including but not limited to reconfiguring of monitor setting), resource etc., handles thereby be hopeful to improve.
Fig. 3 has set forth an example calling worker thread.Especially, the process flow diagram of Fig. 3 is described the work of an embodiment of Fig. 1 system in detail, and wherein carrying out resource is the multithreading resource, and when the certain trigger condition occurring this routine call worker thread.Therefore, shown in block 300, first thread (for example master routine) is provided with the monitor condition.This condition can be one or more in the various conditions discussed herein.First thread execution, one code part is shown in block 310.If test determines not occur trigger condition in block 320, then continue to carry out this code section, shown in block 310.
If this trigger condition takes place really, then start worker thread with auxiliary first thread, shown in block 330.Can be by start this worker thread such as handler routine, perhaps only by switch activated this worker thread of thread.For example, in one embodiment, monitor sends to the trigger condition of carrying out resource can cause carrying out the button.onrelease that resource jumps to the generation worker thread.In another embodiment, this worker thread is one of other active threads.In yet another embodiment, can provide one or more special worker thread to carry out holding tank (slot) by processor, this monitor can cause switching to the worker thread from one of these holding tanks.Shown in block 340, two threads all continue to carry out.If talk about smoothly, this worker thread moves forward and removing can cause first thread to stop or the condition of fallback.
Fig. 4 has set forth another embodiment of the system that can respond a plurality of different performance incidents and/or composite performance incident.In the embodiment of Fig. 4, execution resource 400 is shown to and comprises one group of N monitor 410-1 to 410-N.In addition, provide incident diagram vector (ESV) memory location 420 and incident diagram vector shielding (ESVM) memory location 425.The embodiment of Fig. 4 shows a plurality of monitors, and its number (N) is corresponding to the figure place in incident diagram vector and the occurrence diagram screen unlocking vector.In other embodiments, the number of monitor may be different from the number of these vectors, and monitor can or not have direct correlation with this figure place direct correlation.For example, in certain embodiments, the condition and the single vector position that relate to a plurality of monitors are associated.
Carry out resource 400 and be coupled to event descriptor table 430 (EDT) alternatively, can realize this event descriptor table on this processor or in coprocessor or system storage partly.Control stream logic 435 is coupled to monitor 410-1 to 410-N, and receives the value from incident diagram vector and the shielding of incident diagram vector.When shielding according to this incident diagram vector and incident diagram vector when starting the detected condition of one or more monitors, control stream logic 435 changes the control stream of processing logics.
It is special with register 404 (MSR) that the embodiment of Fig. 4 has also set forth decode logic 402 and one group of machine or pattern.One of decode logic 402 and mode-specific register or both can be used for programming simultaneously and/or start this monitor and incident diagram vector and shielding.For example, the MSR type or the number of the incident that triggers monitor that can be used for programming.MSR also can be used for programmed events diagram vector and shielding.Alternatively, the one or more new special instruction of decoded device 402 decodings be can be used for programming this monitor and one of incident diagram vector and shielding or both.For example, when the specified conditions group occurring, can use and abandon the interruption that (yield) instruction start-up routine is handled.Can abandon instructing specified portions or all these conditions to this by operand, perhaps can programming before it is carried out.Can this abandon instruction by demoder 402 decodings, thereby directly produce corresponding microoperation or micro-order or microoperation sequence, perhaps start coprocessor or implement this abort function to the signal special logic with triggering microcode routine program.In certain embodiments, the notion of abandoning can be described such instruction rightly, and promptly this abandons can continuing to carry out a certain thread after the instruction in execution, but a bit slows down the execution of this thread owing to the execution of another thread or handling procedure at certain.For example a large amount of single-threaded programs can be called extra worker thread and also share these extra worker threads with processor.
In the embodiment of Fig. 4, storer 440 comprises button.onrelease 450 and main thread 460.In certain embodiments, event descriptor table can be stored in the storer identical with main thread 460 and handling procedure 450 or in the identical memory hierarchy.As previously mentioned, this handling procedure can produce worker thread to help to carry out effectively master routine.
But storer 440 also storage update module 442 to communicate by communication interface 444.Update module 442 can be hardware module or software routine, carries out this software routine to obtain to be programmed into the New Terms of each monitor and/or enable logic by carrying out resource.Update module 442 also can obtain new worker thread or routine.For example, can use software program to download these modules so that better performance to be provided from software program producer.Therefore, network interface 444 can be any network and/or the communication interface that allows by communication port transmission information.In some cases, this network interface can be received the Internet to download new condition and/or auxiliary routine or thread.
In one embodiment, specific incident appears or does not occur in everybody expression of incident diagram vector, and this particular event may be reacted the compound event of (and/or explain by Boolean calculation) various other incidents of conditioned disjunction.The appearance of particular event can be provided with the position in the incident diagram vector.In the incident diagram vector everybody has corresponding position in occurrence diagram screen unlocking vector.If this mask bit is represented this particular event conductively-closed, then control stream logic 435 is ignored this incident, although because the appearance of this incident is set this maintenance in the incident diagram vector.The user can select whether to remove this incident diagram vector when masked event not.Therefore, a certain event mask a period of time can be handled after a while.In certain embodiments, according to such as event update, sampling with the variety of issue of the relation that resets between (or maintenance incident of ESV internal trigger incident), the user can select to stipulate that this trigger is level trigger or edge triggered flip flop.
If mask bit is represented the not conductively-closed of a certain incident, then control stream logic 435 is called the button.onrelease of this particular event in the present embodiment.Control stream logic 435 can be pointed to event descriptor table 430 based on bit position in the incident diagram vector, thus this event descriptor table have with incident diagram vector in a N position corresponding N clauses and subclauses.This event descriptor table can comprise a handler address, the address that this address indication control stream logic 435 should redirect execution, and this event descriptor table also comprises useful in a particular embodiment out of Memory.For example, can in this event descriptor table, keep or upgrade preferential level, thread, processing and/or out of Memory.
In another embodiment, may not need event descriptor table 430 or its to be single clauses and subclauses, the address of these all incidents of clauses and subclauses indication individual event routine processes.In this case, these clauses and subclauses can be stored in the register or in other processor memory location.In one embodiment, can use single handling procedure, and how addressable this incident diagram vector of this handling procedure responds with definite incident that is occurred and this.In another embodiment, this incident diagram vector can jointly define the incident that causes control stream logic 435 calling processors.In other words, this incident diagram vector can be represented the various conditions of expressing an incident together.For example, this occurrence diagram screen unlocking vector can be used for which indicated incident of allocate event diagram vector and must take place to trigger the execution of handling procedure.Everybody can represent the monitor that reaches condition able to programme.When all when not shielding monitor and reaching its separately specified requirements, calling processor then.Therefore, whole event diagram processor can be used for specifying some complicated compound condition that will trigger the handling procedure execution.
In another embodiment, can use a plurality of incident diagram vectors to specify different conditions with shielding.These different vectors can point to different handling procedures by this event descriptor table or some other mechanism.In another embodiment, can be divided into groups in some position of one or more incident diagram vectors, thereby form the incident of calling that triggers handling procedure.Various other different changes will be apparent to those skilled in the art.
Fig. 5 a has set forth an embodiment of monitor 500, and this monitor is programmable and can be connected with various performance monitors to produce the signal of compound event.For example, this performance monitor can write down the appearance of various micro-architecture event or condition, for example, the cache miss that causes at the specified level of buffer memory level, branch scraps, branch's be out in the calculation (or the wrong branch that estimates scrap), the trace cache transport model changes or incident, branch estimates that the unit takes out request, the cancellation of memory requests, the cache lines division (is finished the division load, the counting of storage etc.), replay event, (for example locking of various types of bus switch, instantaneous read-write, write-back, invalid), distribution in the bus sequencer (or only particular type), auxiliary (the underflow of numeral, informal etc.), the execution of the instruction of particular type or microoperation (uOP)/scrap, machine zero clearing (or cleaning streamline), resource stops (register renaming resource, streamline etc.), the uOP of marks for treatment, instruction or uOP scrap, the distribution of buffer memory interior lines (and/or being specific state (for example M)), stop to take out a large amount of cycles per instructions, stop a large amount of cycles per instruction length decoders, get a large amount of buffer memorys, be distributed in a large amount of line of (or withdrawal) in the buffer memory etc.Only monitor some examples of micro-architecture event or condition.The combination of various other possibilities and these or other condition will be apparent to those skilled in the art.In addition, can use in arbitrary disclosed embodiment disclosed arbitrary monitor to monitor these and/or other condition or incident.
Usually comprise in processor that performance monitor is to count particular event.By the interface of manufacturer definition, the application specific processor macro instruction of the RDPMC instruction of supporting by well-known Intel processor for example, the programming personnel can read the reading of this performance monitor.See the appendix A of the volume III of Intel SoftwareDevelopers Guide for the Pentium  4 Processor.In certain embodiments, can use other inside or micro-order or microoperation reading performance counter.Therefore, for example performance monitor and disclosed technology can be used in combination.In some cases, adjust programmable performance monitor so that the ability that produces event signal to be provided.In other embodiments, can be by other monitor reading performance monitor to set up incident.
In the embodiment of Fig. 5 a, monitor 500 can comprise one group of clauses and subclauses able to programme.Each clauses and subclauses can comprise entry number 510, startup territory 511, be used to specify the performance monitor number (EMON#) and the trigger condition 514 of one of one group of performance monitor.This trigger condition can be for for example arriving specific counting, dropping on the counting in the particular range, the difference of counting etc.Monitor 500 can comprise logic to be read or be coupled to receive the counting from the performance monitor of appointment.When various M condition occurring, monitor 500 sends signal to control stream logic.By the startup territory of each clauses and subclauses of optionally programming, can use the subclass of this M clauses and subclauses.
Fig. 5 b has set forth another embodiment of monitor 520.The compound event monitor of monitor 520 representative customizations.Monitor 520 passes through the one group signal of signal wire 528-1 to 528-X reception from various execution resources or resource part, and passes through combinational logic 530 with its combination.If monitor 520 receives the appropriate combination of signal, then this monitor sends signal by output signal line 532 to control stream logic.
Fig. 5 c has set forth another embodiment of monitor 540.Monitor 540 comprises the table with M clauses and subclauses.Each clauses and subclauses comprise startup territory 552, condition field 554 and trigger territory 556.This condition field can be programmed to stipulate the combination of input signal to be monitored.These conditions can link to each other with other event detection structure such as performance monitor or not link to each other, so these conditions are compared more general than those conditions of discussing among Fig. 5 a.Trigger the state that those input signals of signal can be stipulated to send to control stream logic in territory 556.In addition, can be by starting territory 552 startups or forbidding each clauses and subclauses.In certain embodiments, this condition and triggering territory can be made up.These or other type, known or obtainable, the various combinations of simpler or more complicated monitor are conspicuous to those skilled in the art.
Fig. 6 has set forth according to an embodiment, the definable trigger of responder and start the process flow diagram that the user program of worker thread is carried out.In block 600, program at first tests whether possess the ability of abandoning." abandon ability " at this and be used as taking place and the writing a Chinese character in simplified form of the ability of Interrupt Process based on the conditioned disjunction incident.Alternatively, for the test that the ability of abandoning supports, this is abandoned ability and is defined as idle working code before can using and/or does not use before or undefined MSR, will can not influence the not processor of this ability so abandon ability.Also can inquire about this ability that whether exists by checking special CPU-ID, wherein this CPU-ID coding produces and shows the prompting that whether has this ability on par-ticular processor or the platform.Similarly, call or the special instruction of SALE (system abstraction layer environment) can be used for the query processor specific configuration information such as the PAL (processor abstraction layer) of Itanium, this preparation implement body configuration information comprises the availability of the definable ability of abandoning of this program.Suppose to exist this to abandon ability, user program various counters that can read and/or reset then are shown in block 610.For example, can reading performance watchdog count device, make to calculate increment (delta), if perhaps have this ability then this value is resetted.
Shown in block 620, user program is provided with the worker thread trigger condition subsequently.Can obtain this abandons ability down in low priority level (for example user class), makes any program or exhausted big portion routine can use this feature.For example, in Pentium  processor family etc., the preferential level of the 3rd ring can obtain this and abandon ability.Therefore, user program self can be provided with its oneself the trigger condition based on performance.If utility command or operating system can provide lasting surveillance coverage, then understand the user program there is this context-sensitive monitor arrangement or operating system and can select to cross thread/processing context switch and store or recover the specific monitor arrangement/setting of this application.
Shown in block 630, user program continues to carry out after this waive of condition of programming.Whether test waive of condition occurs in block 640.If waive of condition do not occur, then this program continues to carry out, shown in block 630.If this waive of condition, then worker thread is activated, shown in block 650.The flow table of Fig. 6 is tending towards hinting the synchronous polling of appearance to each incident, can use this method in certain embodiments.Yet some embodiment response to incident when incident takes place is asynchronous, perhaps in a large amount of clock period when incident takes place it is produced response, rather than by specific interval incident is carried out poll.In certain embodiments, the monitor condition can be set to detect specific condition outside a circulation or other code section.Demonstrate this notion by the pseudo-code example of following main thread and worker thread.
main()    {      CreateThread(T)      WaitForEvent()      n=NodeArray[0]      setup Helper Trigger//Intrinsic      while(n and remaining)         {             work()             n->i=n->next->j+n->next->k+n->next->1             n=n->next             remaining--             //Every Stride Time             //global_n=n             //global_r=remaining             //Ser Event()            }         disable Helper Trigger//Instrinsic  }T()  {       Do Stride times              n->i=n->next->j+n->next->k+n->next->1              n=n->next              remaining--
  SetEvent()  while(remaining)      {      Do Stride times         n->i=n->next->j+n->next->k+n->next->1         //Responsible for most effective prefetch         //due to run-ahead         n=n->next         remaining--         WaitForEvent()         if(remaining<global_r)//Detect Run-Behind              remaining=global_r//Adjust byjump ahead              n=global_n      }}
An advantage that outside the circulation trigger is set for the Compiler Optimization within circulation with unfettered.For example, for the circulation or the code segment that comprise such as the intrinsic parameter (intrinsic) that can be used to start the ability of abandoning, some compiler just can not be optimized it.By these intrinsic parameters are placed outside the circulation, can remove the interference of Compiler Optimization.
Fig. 7 has set forth the process flow diagram of the process of abandoning being provided with according to the refinement of an embodiment.Use has the processor of the ability of abandoning etc., and the programming personnel can design under the various situations can invoked program and auxiliary routine, shown in block 700.Therefore, can provide auxiliary routine for the various conditions that the desired obstruction of programming personnel is carried out.If when executive routine, need these routines and in these routines of needs, processor can call these routines.This is abandoned being provided with and can comprise that incident diagram vector and shielding vector and/or monitor are provided with etc.
On concrete processor, specific abandoning is provided with and can causes favourable execution result.Yet, manually make this and determine to be therefore to be more preferably unusual difficulty rule of thumb and to derive.Therefore, compiler or other adjustment software (for example Intel VTune code profiler) use difference to abandon configuration and repeat to simulate this code, derive setting best or that expect thus, shown in block 710.Therefore, can select the desired value of abandoning being provided with of working time, shown in block 720.Can be on a plurality of different editions of a processor or a plurality of different processor or in a plurality of different systems simulator program, be provided with thereby derive different abandoning.Which program adopts abandon being provided with in the time of can using and select to move such as the system of CPU-ID or processor mark, shown in block 730.
In addition, use the compact group that is provided with to optimize performance and be convenient to software upgrading.For example, when new processor is issued, can download the new value of abandoning to optimize the performance of par-ticular processor, perhaps with the new value of abandoning update software.These new values allow scale-of-two or modulus adjustment, and this can not disturb or endanger the function of existing software basically.
Fig. 8 has set forth the process flow diagram according to the update software process of an embodiment.Shown in block 800, issued the microprocessor of a redaction.New version has and the different time delay of being correlated with such as the micro-architecture event of cache miss.Therefore because new cache miss is hidden, the routine validity after the cache miss of given number that before is written into to start worker thread weakens.Therefore, optimize this again and abandon being provided with, shown in block 810.
In case derive new setting, then can upgrade this program (for example by being the upgrading module of this program part), shown in block 820.Can adjust or add the value of abandoning, this depends on the details of enforcement.In addition, can add additional or different auxiliary routines, thereby help the enforcement of new processor.In arbitrary situation, after the initial transmission of software, the transmission that the ability of abandoning can startability strengthens.In many occasions, this ability is very favorable, and can only be used to new optimization is provided and underlying hardware is not made any change.In addition, can keep basic software in some cases.For example, if write auxiliary routine, then can change the composition that triggers the incident of this routine on the different hardware, and not change real routine itself to handle comprehensive incident (for example serious cache miss).For example, can change monitor arrangement value and/or ESV/ESVM value, and this program remains unchanged.
By creating nested worker thread, can further strengthen the validity of disclosed technology, Fig. 9 a shows an example of this usage.In the embodiment of Fig. 9 a, in block 9 00 programs, incident is abandoned in the program setting.In block 910, program continues to carry out.In block 920, whether test the incident of abandoning (trigger) occurs.If incident do not occur abandoning, this program continues to carry out, shown in block 910.If abandon incident, then start worker thread, shown in block 925.Worker thread is provided with another and abandons incident, shown in block 930.Therefore, this worker thread is discerned effectively and is represented that further handling help is another useful condition.This other condition represents that whether first worker thread is effectively, and/or can be designed to represent another condition (although wherein also suspecting this condition by starting first worker thread or having started first worker thread).
Shown in block 940, this program and worker thread all are activated and execution thread.Handle the meaning that these threads all are activated and carry out the resource from multithreading, these threads are carried out simultaneously.In block 950, whether the combination of test procedure and worker thread new trigger condition takes place.If new trigger condition does not take place, then continue to carry out these two threads, shown in block 940.If occur new trigger condition really, then start second or nested worker thread, shown in block 960.Afterwards, this program and a plurality of worker thread are activated and carry out, shown in block 962.Therefore can adopt a plurality of nested worker threads in certain embodiments.
In one embodiment, can start a plurality of worker threads (can be nested or non-nested) by virtual thread.Processor is not the number that its whole resource groups is used to expand its treatable thread, and processor is buffer memory (in cache location, register position or other memory location) context data effectively.Therefore, a physical thread holding tank can switch between a plurality of threads fast.
For example, the embodiment of Fig. 9 b has set forth the thread switching logic according to an embodiment, and this thread switching logic allows virtual thread is switched to the limited physical thread holding tank of number, and these holding tanks are specifically designed to hardware and keep thread context.In the embodiment of Fig. 9 b, a plurality of worker thread 965-1 to 965-k are presented to virtual thread switch 970.This virtual thread switch 970 also can comprise other logic and/or microcode (not shown), thus the contextual information between new worker thread of selecting of exchange and the previous worker thread of selecting.Can trigger this virtual thread switch 970 with switch threads by synchronous or asynchronous stimulation.For example, the asynchronous incident by the instruction definition of abandoning type can cause the thread between the virtual thread to exchange.In addition, worker thread can comprise such as stop, the synchronization means of the static or instruction that other type stops to carry out, thereby send switching signal to another thread.This virtual thread switching logic 970 has presented a subclass (for example in the embodiment of Fig. 9 b, one of virtual thread) of virtual thread to processor thread switching logic 980.Processor thread switching logic 980 for example switches between the first thread 967-1 and other N-1 the thread (until thread 967-N) at one of worker thread subsequently.
In certain embodiments, preferably this is abandoned capabilities limits to specific program or thread.Therefore, can make this abandon ability becomes context-sensitive or non-mixed and disorderly.For example, Figure 10 a has set forth an embodiment of context-sensitive incident diagram vector and shielding realization.In the embodiment of Figure 10 a, memory block 1000 comprises context indicator field 1010 and the shielding memory location 1020 with each incident diagram vector correlation.This context indicator field is discerned each incident diagram vector and is shielded applied context.For example, can use context value such as the value of control register (for example CR3 of indication operating system process ID in the x86 processor).Additional or alternatively, can use number of threads information definition context.Therefore, in certain embodiments, when specific context starts, then can start specific context dependent incident with interrupt process.Therefore, this is abandoned machine-processed clear and definite part and is that its incident only influences specific context.
Figure 10 b has set forth another embodiment of context-sensitive diagram vector and shielding realization.In the embodiment of Figure 10 b,, then can handle this contextual integer k by for this k contextual each context provides one group of incident diagram vector and screening-off position 1050-1 to 1050-k.For example, have k thread in multiline procedure processor, each thread has an incident diagram vector and shielding or similarly starts the mechanism of abandoning.Notice that in other embodiments, the incident of only following the trail of in the specific context is worthless.For example, incident can be reacted whole process activity, and/or incident can be relevant with a plurality of related linear programs or by due to a plurality of related linear programs.
Figure 11 has set forth based on monitor or has abandoned types of events and an embodiment of the multiline procedure processor that execution thread switches.Although the many embodiment that discussed are interrupt process stream by causing handling procedure to carry out, cause the incident that thread switches in other embodiment definable multiline procedure processor.For example, in the embodiment of Figure 11, the thread switch logic is coupled to receive the signal from one group of N monitor 1110-1 to 1110-N.Thread switch logic 1105 also can be coupled to one or more groups incident diagram and shield 1130-1 to 1130-p (p is a positive integer).This incident diagram and shielding are to allowing the thread switch and make up when determining when switch threads and/or ignoring specific monitor incident.
Carry out resource 1120 and support the execution of p thread, belong to specific thread but be indifferent to whether to instruct.This execution resource can be any other resource of using in performance element, taking-up logic, demoder or the execution command.Multiplexer 1115 or other select resource to judge between various threads to determine which thread accesses execution resource 1120.Those skilled in the art will appreciate that in multiline procedure processor and can share or duplicate various resources, and various resource has and allows the thread handover access that limited number thread (for example) is visited this resource at every turn.
If one or more monitors and/or incident diagram vector and shielding be to indicated condition group, the execution of thread switching logic 1105 switch threads.Therefore, can start another thread, rather than start the thread of activity when the condition of processor condition and programming is complementary.For example user program can be controlled the incident that thread switches that triggers.
In some multiline procedure processors, each thread can have one group of relevant incident diagram vector and shield equity.Therefore, as shown in figure 11, multiplexer 1115 judges between p thread, and exists corresponding p incident diagram right with shielding.Yet, be multithreading just because of processor, and do not mean that all realizations all use a plurality of incident diagram vectors and shielding.Some embodiment only use a pair of startup indicator, perhaps use other startup indicator.For example, can use single position as opening or close the specific startup indicator of abandoning type of capability.
Figure 12 has set forth an embodiment who synchronization object is had the system of event detection and processing power.Synchronization object can for the lock or the lock variable, barrier or other hardware, software and/or can be used for thread or process between synchronous memory resource.With regard to the multinuclear heart and/or various types of multithreading, it is popular that multi-process obtains, and therefore in order to improve performance, becomes more important synchronously between these threads or the process.Therefore, has the system of synchronous efficiency of enhancing general and/or use in the various fields of dedicated processes (for example figure, natural medium type, digital signal processing, communication etc.) of concurrent process and have extensive applicability.
System shown in Figure 12 has been set forth a processor 1200, wherein this processor is coupled to storer 1250, also is coupled to communication interface 1292 and one or more peripherals 1294 (for example can be audio interface, display, keyboard, mouse or other input media, I/O device etc.).Can be by bus, bridge and/or point-to-point connection these devices that are coupled directly or indirectly.Processor 1200 comprises carries out resource 1210 and event detector 1220, to monitor the various aspects of carrying out resource 1210.Processor 1200 and event detector 1220 can have various characteristics, as the description of carrying out with reference to previous embodiment.Therefore, thus event detector 1220 be programmable based on definition incident and start a thread (for example trigger thread switch or bifurcated new thread).This incident can be the rigid line incident, is perhaps defined like preceding described mechanism by software program and similar incidents diagram vector.
Processor 1200 also can have lock and/or rotation detector 1222.In certain embodiments, this lock/rotation detector can be independent detecting device, as shown in figure 12.This independent detecting device can be specific predefine or even the hardware components of condition able to programme that detects indicating lock.Therefore this detecting device can partly be that hardware components is a software.In other embodiments, can just can realize in the common event detecting device that lock or rotation-lock detect by various conditions are programmed into.Therefore, can when programming rightly, can form rotation/lock detecting device effectively with the programming of common event detecting device to detect this condition.For example, use 1254 and can comprise that event detector programming module (EDPM) 1256 is used for programmed events detecting device 1220 to trigger expected event.
Take out with an incident of phase-locked pass that can be detected long delay for the lock variable., can detect and cause the lock variable that takes out long delay with certain cache miss that a bit (will in storer, visit this lock variable) in the trigger by programmed events detecting device 1220.This cache miss represents that processor will not lock the variable buffer memory.Then can respond this special event that is triggered subsequently, start a handling procedure and take out situation with the lock that solves long delay.
The second, this lock/rotation detector 1222 also can detect lock that rotating condition or program just waiting for that contention is fierce and circulation to verify the condition whether this variable can be used.For example, can detect rotating condition to the repeated accesses of known lock variable position by sensing.When detecting this rotating condition, second thread will be activated, and this will discuss hereinafter.Some embodiment can use worker thread to handle the situation of the fierce lock of this contention, and take out (triggering first worker thread) no matter the lock of long delay whether at first occurs.
The embodiment of Figure 12 comprises lightweight thread context-memory 1230.This context-memory allows the smaller subset of preservation state to switch with the context of realizing " lightweight " or " flyweight ".For example, only preserve the instruction pointer of parent process in some situations, and allow programmable device be responsible for any additional contextual preservation.Context is saved more or less, but the context that is stored in usually in the subclass is less than whole contexts.Use specialized instructions that these lightweight threads are disclosed in user class (for example the application layer of program, such as the program of x86 framework medium priority other 3), make the user to carry out resource and start thread in conjunction with application-specific at multithreading.In this case, the button.onrelease that is activated into the lightweight thread should be preserved its any context of not preserving of upsetting (normally carrying out the father uses and may need).In other embodiments, the worker thread that is triggered can be to have fully independently contextual thread.
The embodiment of Figure 12 also comprises the storer 1250 that is coupled to processor.In this embodiment, showing various lock expenses uses module and uses 1254.In the present embodiment, this module is a software program.In one embodiment, the single thread that is triggered when a certain incident occurring of each module.The single worker thread of the one-tenth capable of being combined of module shown in one or more, this worker thread can be whole or lightweight context thread.In other embodiments, can realize these modules with the combination of hardware or hardware and/or software and/or firmware.
Using 1254 is the application of user class, can have lock or other synchronization object or technology.Taking out critical partial data module 1258 can be used for will being locked the lock within the critical part of protection with speculating and takes out data and shift out.Obtain following lock module 1260 and move forward, obtain the lock of other lock protection part.Just can obtain lock in the buffer memory by Data Position is got simply, perhaps obtain lock by changing the lock variable in other embodiments, make to have this lock.Following lock module 1260 also comprises throttling module 1262, thereby guarantees that by limiting other thread this lock activity of speculating does not reduce whole throughput rate.Yun Hang execution module 1280 can be carried out forward forward, thereby finishes a few thing outside the part of being locked protection.
Lock analysis module 1270 is collected the data of particular thread with respect to the progress of lock variable and/or other thread.User thread scheduler module 1290 can receive the prompting of self-locking analysis module and allow thread is dispatched more efficiently.For example, analysis module 1270 can detect first thread (for example initial consumer), obtains a lock, and causes rotation significantly in second thread (for example initial production person).In this example,, at first dispatch second thread (initial production person) and can cause handling more efficiently to the scheduler circular.In certain embodiments, this scheduler is the thread scheduler of user class, and this scheduler is exposed to programmable device to allow the scheduling to user class (for example lightweight) thread.In certain embodiments, user thread scheduler 1290 can be to use a part of 1254.
Can further understand the interactional example of these modules with reference to Figure 13.In the embodiment of Figure 13, in block 1310, detect lock and take out delay.By 1220 programmings make its sensing one condition can realize this detection to event detector.For example, use the 1254 adjustable event detector programming modules 1256 of using with programmed events detecting device 1220, thereby just before visit lock variable, trigger cache miss.Alternatively, can use the lock/rotation detector 1222 of aforementioned special use to detect this lock.
In block 1315, can execution thread switching (for example the flyweight thread switches) in response to the taking-up delay that detects the lock variable.This thread switches and starts first worker thread, and this thread can be carried out the various functions of setting forth in the following block in each embodiment.Therefore, although block 1315 heels with various blocks, are not to need all these blocks in any specific embodiment, and its order is not crucial.
In block 1320, can take out outside the code that obtains this lock but drop on the data that are subjected within the code section that this lock protects.For example, can in the embodiment of Figure 12, carry out the critical partial data module 1258 of taking-up.When the entitlement of lock obtained at last, this data pre-fetching in lock protection code section goes out to reduce cache miss or the general data retrieval postpones.
(as the part of independent thread or identical thread) in addition can take out following lock as shown in block 1330-1365.Especially, can obtain N additional lock, wherein N is a positive integer.The number of lock to be obtained is programmable or hard code, and the term of execution be transformable (for example by reprogramming or throttling module 1262).In block 1330, loop variable i is made as 1.Shown in block 1340, take out a following lock (for example can taking out in advance or in fact be lockable in various embodiments).Following lock address is perhaps determined by the lock analysis to produce worker thread in the following lock of statistical computation address, and this will be described below.Shown in block 1350, test this lock and whether fought for.Be subjected to contention (for example can indicate) if should lock future, then can interrupt the taking-up of this lock and/or other following lock, shown in block 1355 by the value of the buffer status of locking variable, lock variable or by the circling behavior of program.
For in block 1350, not finding the situation that this lock is fought for, then in block 1360, continue operation.If the lock that test is taken out in block 1360 counting is not equal to number of targets N, then in block 1365, variable i is increased progressively and process is got back to block 1340.Reach N if count down in block 1360, then process proceeds to block 1370 in one embodiment.Carry out these lock taking-up operations 1330-1365 by in Figure 12 embodiment, obtaining following lock module 1260.Take out or obtain the advantageously accelerated procedure execution of entitlement of following lock, this is bigger because obtain the possibility of this lock easily when running into lock.Many programs run into the lock of the high competition of relative minority.Therefore, pre-facilitation of taking out lock has surpassed any negative effect to the progress of other process.
In certain embodiments, will may end at block 1365, and therefore can (by ending or the operation of bond type) this thread be closed, and control can turn back to main thread, shown in block 1370 by the work of the thread execution that in block 1315, triggers.In other embodiments, worker thread continues, and carries out operation and/or other operation of block 1372-1375.In addition, other embodiment can trigger the worker thread of other number or other combination of described operation.
In the embodiment of Figure 13, shown in block 1372, described application fails to protect this lock.In other words, the lock variable shows that other process has this lock.In certain embodiments, detect this failure of repetition.Can programme or be provided with a rotation threshold value, take measures before for obtaining the threshold number that lock variable entitlement attempts failing thereby be provided at.In block 1374, trigger second worker thread to finish the needed other work under the expense influence of lock variable entitlement that obtains.One is exemplified as, and can carry out the code outside the critical part, shown in block 1375.For example, in the embodiment of Figure 12, can carry out the execution module 1280 of operation forward.In certain embodiments, this code be can carry out, thereby instruction or data only taken out in advance and non-results of calculation and/or the result submitted to machine state.In other embodiments, if carry out the correlativity inspection to guarantee correct result, result of calculation and/or submit the result to a part then as the execution that moves forward.
Alternatively, similar with the process of block 1330-1365, second worker thread responds the lock that this second incident can obtain to add.Another is alternative for carrying out the lock analysis.Another is alternative to be, the previous buffer memory that few quilt goes out highest level clearly (may even to external interface) that is locked to that keeps obtains the transmission delay of locking to reduce another processor.In various embodiments, can be combined in the various examples that the influence of locking expense can be finished other work down with various distortion.
Figure 14 has set forth an embodiment who comprises the lock analysis.In the embodiment of Figure 14, various threads are scheduled in block 1410.In one embodiment, these threads can be the lightweight threads of being dispatched and controlling at user or application level.For example, use the 1254 user thread schedulers 1290 that can comprise among Figure 12.In another embodiment, these threads can be the threads with whole contextual operating system visibles.In block 1420, detect the lock of being fought for.This lock of being fought for causes enabling or starts the analysis thread.Shown in block 1430, the behavior of this thread analysis lock.In one embodiment, this analysis requires to catch such as the event counter data from performance counter or other similar structures.When finishing thread scheduling once more in block 1410, profile information is used to help to determine the right of priority and/or the ordering of thread subsequently.Therefore, the lock overhead time is used to improve whole program feature once more.As previously mentioned, this schedule information can the worker thread scheduler to dispatch the producer/consumer's thread more efficiently right.
Between period of expansion, a design can be simulated manufacturing from creating to through the various stages.The data of representative design can be represented this design in many ways.At first, in simulation, usefully, use hardware description language or another kind of functional description language to represent hardware.In addition, can produce circuit level model in some stages of design process with logic and/or transistor gate.In addition, in a certain stage, most of design arrives the data level of the physical layout of the various devices of expression in hardware model.For the situation of using the conventional semiconductors manufacturing technology, the data of representing hardware model can be to there being or not existing the data of various features on the different mask layers of mask regulation of making integrated circuit.In any statement of design, data can be stored in any type of machine-readable medium.Light or electric wave that modulated light or electric wave or alternate manner produce are used for transmission information, and this machine-readable medium can be for storer or such as the magnetics or the optical memory of CD.Arbitrary these media can " carry " or " indication " design or software information.When the electricity carrier wave of transmission indication or carrying code or design with copy, the buffer memory of carrying out electric signal or when transmitting again, just made new copy.Therefore, communication provider or network provider can be made the copy of the article (a kind of carrier wave) of implementing the technology of the present invention.
Therefore, the technology that the programmable event driven that can start other thread is abandoned mechanism is disclosed.Although be described in the drawings and shown specific example embodiment, should be appreciated that, these embodiment set forth and unrestricted this invention purely, shown in the invention is not restricted to and described concrete structure and layout, because those skilled in the art can expect various other modifications after reading present disclosure.In technical field such as this area, technical development rapidly and be not easy to predict further progress, can arrange the disclosed embodiments easily by technical progress under the situation of not leaving principle of the present invention and appended claims scope with details on modification.

Claims (27)

1. multiline procedure processor of virtual support multithreading additionally, described processor comprises:
The physical thread context-memory is used to store the contextual information of a plurality of effective threads;
The virtual thread context-memory is used to store the contextual information of a plurality of k virtual threads;
Be coupled to virtual thread context-memory and the virtual thread selector switch that is coupled to the physical thread context-memory, described virtual thread selector switch makes the contextual information that is used for described a plurality of at least one virtual thread of k virtual thread switch to described physical thread context-memory.
2. processor according to claim 1, wherein said physical thread context-memory is a specialized hardware, with a plurality of thread context that keep using by a plurality of execution resources, and the position of wherein said processor in the group that comprises cache location, register position or other memory location buffer memory virtual thread context data effectively.
3. processor according to claim 2, wherein said processor cache is used for the contextual information of a plurality of k virtual threads, wherein effectively carrying out in a plurality of k virtual threads before selected that virtual thread, the contextual information that is used for selected that virtual thread of described a plurality of k virtual threads must switch to described physical thread context-memory.
4. processor according to claim 2, the context data group that wherein said processor will be used for selected that virtual thread of described a plurality of k virtual threads is buffered in the processor memory location, wherein effectively carrying out in described a plurality of k virtual threads before selected that virtual thread, the contextual information that is used for selected that virtual thread of described a plurality of k virtual threads must switch to described physical thread context-memory.
5. processor according to claim 4 also comprises:
Monitor is used to detect the program definable condition of a plurality of execution resources, and one of them virtual thread that starts a plurality of k virtual threads is with the program definable condition in response to a plurality of execution resources.
6. processor according to claim 1 also comprises:
A plurality of event counters;
Monitor makes that carrying out thread between a plurality of k virtual threads switches, with the counting of a plurality of programs definition of reaching in response to described a plurality of event counters.
7. processor according to claim 1 also is included in the logic of exchange contextual information between the virtual thread of new selection and the previous virtual thread of selecting.
8. processor according to claim 1 also comprises the processor thread selector switch, and wherein said virtual thread selector switch presents the subclass of a plurality of k virtual threads to the processor thread selector switch.
9. processor according to claim 4 is wherein abandoned type instruction and is made that carrying out thread between a plurality of k virtual threads switches.
10. multiline procedure processor comprises:
The context stored logic is kept for first contextual information of more than first thread, and described more than first thread is a plurality of effective execution threads;
The second thread logic, management is used for second contextual information of more than second thread, and more than second thread that belongs to described more than first thread member in described more than second thread changed into a plurality of effective execution threads.
11. multiline procedure processor according to claim 10, the wherein said second thread logic comprises:
The second thread context storer, storage is used for second contextual information of more than second thread;
The second thread switch is coupled to the second thread context storer, the described second thread switch with the subclass of more than second thread be rendered as alternative effective thread and
Wherein multiline procedure processor also comprises the processor thread switch logic, and described processor thread switch logic is coupled to the described second thread switch,
Wherein said context stored logic comprises specialized hardware, with the context that keeps being used by a plurality of execution resources.
12. multiline procedure processor according to claim 10 is wherein abandoned type instruction and is used to make that carrying out thread between more than second thread switches.
13. multiline procedure processor according to claim 12, comprise that also buffer memory is used for the memory location of second contextual information of described more than second thread, second contextual information that wherein is used for more than second selected that thread of thread must switch to described context stored logic effectively carrying out in described more than second thread before selected that thread from the memory location.
14. multiline procedure processor according to claim 13, second contextual information that wherein will be used for the subclass of described more than second thread switches to described context stored logic, to start the subclass of described more than second thread.
15. multiline procedure processor according to claim 10, also comprise monitor, make that carrying out thread between more than second thread switches, with in response to program definable process, described program definable process indication comprises the situation of a plurality of threshold count that are used for a plurality of event counters.
16. a method comprises:
In from the effective thread holding tank of a plurality of N of specialized hardware context-memory, carry out by the performed a plurality of effective thread of the execution resource in the multiline procedure processor;
Use is switched between a plurality of second threads by second context that multiline procedure processor keeps in the processor memory location, to fill the effective thread holding tank of a plurality of N.
17. method according to claim 16, wherein carrying out a plurality of effective threads is included in multithreading and carries out on the resource and carry out described a plurality of effective thread simultaneously, wherein switch switching between a plurality of second threads between the subclass be included in a plurality of second threads, to fill the subclass of the effective thread holding tank of a plurality of N.
18. method according to claim 16 also comprises:
The exchange context data, starting the new thread from described a plurality of second threads, and the old effective thread that swaps out makes it become a thread in described a plurality of second thread.
19. a system comprises:
Storer is used to store a plurality of effective threads and a plurality of virtual thread;
Multiline procedure processor comprises:
Carry out the execution resource of a plurality of effective threads;
The virtual thread logic is used for changing to the virtual thread as an effective thread of a plurality of effective threads.
20. system according to claim 19, wherein multiline procedure processor comprises the specialized hardware context-memory that is used for described a plurality of effective threads and other context-memory that is used for described a plurality of virtual threads, and the exchange of wherein said multiline procedure processor is used for the contextual information of the new virtual thread of selecting.
21. system according to claim 19, wherein said multiline procedure processor also comprises low process detecting device, is used to carry out thread and switches.
22. system according to claim 19 also comprises communication interface.
23. a processor comprises:
The processor thread switch logic is used for switching between a plurality of N threads;
Be coupled to the virtual thread switch logic of processor thread switch logic, described virtual thread switch logic presents the virtual thread of a plurality of virtual threads to the processor thread switch logic, and wherein the processor thread switch logic is used for switching between the virtual thread of a plurality of virtual threads and at least one other thread.
24. processor according to claim 23, wherein the virtual thread switch logic is used for presenting to the processor thread switch logic subclass of a plurality of virtual threads, and wherein the processor switch logic is used for switching between the subclass of a plurality of virtual threads and at least one other thread.
25. processor according to claim 24, wherein the subclass of a plurality of virtual threads is virtual threads, and wherein said processor thread switch logic is used for switching between as a plurality of virtual threads of first thread and other thread of N-1.
26. processor according to claim 23 also comprises a plurality of execution resources, is used for carrying out being chosen as effective a plurality of thread by processor thread switch logic and virtual thread switch logic.
27. processor according to claim 23, wherein said processor are to be represented by the data that are stored on the computer-readable medium.
CN 200710104280 2005-03-02 2006-03-01 Processor with dummy multithread Pending CN101051266A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/070,991 US7587584B2 (en) 2003-02-19 2005-03-02 Mechanism to exploit synchronization overhead to improve multithreaded performance
US11/070991 2005-03-02

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN 200610019818 Division CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Publications (1)

Publication Number Publication Date
CN101051266A true CN101051266A (en) 2007-10-10

Family

ID=36946955

Family Applications (4)

Application Number Title Priority Date Filing Date
CN 200610019818 Expired - Fee Related CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance
CN 200710104280 Pending CN101051266A (en) 2005-03-02 2006-03-01 Processor with dummy multithread
CN201210460430.2A Expired - Fee Related CN102968302B (en) 2005-03-02 2006-03-01 Utilize synchronization overhead to improve the mechanism of multi-threading performance
CN 201110156959 Expired - Fee Related CN102184123B (en) 2005-03-02 2006-03-01 multithread processer for additionally support virtual multi-thread and system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN 200610019818 Expired - Fee Related CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN201210460430.2A Expired - Fee Related CN102968302B (en) 2005-03-02 2006-03-01 Utilize synchronization overhead to improve the mechanism of multi-threading performance
CN 201110156959 Expired - Fee Related CN102184123B (en) 2005-03-02 2006-03-01 multithread processer for additionally support virtual multi-thread and system

Country Status (1)

Country Link
CN (4) CN1828544B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635B (en) * 2007-12-24 2012-01-25 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN103699428A (en) * 2013-12-20 2014-04-02 华为技术有限公司 Method and computer device for affinity binding of interrupts of virtual network interface card
TWI489383B (en) * 2011-12-23 2015-06-21 Intel Corp Apparatus and method of mask permute instructions
CN106170768A (en) * 2014-03-27 2016-11-30 国际商业机器公司 Assign multiple thread in a computer
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9658850B2 (en) 2011-12-23 2017-05-23 Intel Corporation Apparatus and method of improved permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
CN108292230A (en) * 2015-12-11 2018-07-17 图芯芯片技术有限公司 Hardware access counter and for coordinate multiple threads event generate

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9760410B2 (en) * 2014-12-12 2017-09-12 Intel Corporation Technologies for fast synchronization barriers for many-core processing
US10268586B2 (en) * 2015-12-08 2019-04-23 Via Alliance Semiconductor Co., Ltd. Processor with programmable prefetcher operable to generate at least one prefetch address based on load requests
CN108376070A (en) 2016-10-28 2018-08-07 华为技术有限公司 A kind of method, apparatus and computer of compiling source code object
GB201717303D0 (en) * 2017-10-20 2017-12-06 Graphcore Ltd Scheduling tasks in a multi-threaded processor
CN111580792B (en) * 2020-04-29 2022-07-01 上海航天计算机技术研究所 High-reliability satellite-borne software architecture design method based on operating system
CN112286679B (en) * 2020-10-20 2022-10-21 烽火通信科技股份有限公司 DPDK-based inter-multi-core buffer dynamic migration method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6212544B1 (en) * 1997-10-23 2001-04-03 International Business Machines Corporation Altering thread priorities in a multithreaded processor
US6493741B1 (en) * 1999-10-01 2002-12-10 Compaq Information Technologies Group, L.P. Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit
US7487502B2 (en) * 2003-02-19 2009-02-03 Intel Corporation Programmable event driven yield mechanism which may activate other threads
US8694976B2 (en) * 2003-12-19 2014-04-08 Intel Corporation Sleep state mechanism for virtual multithreading

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635B (en) * 2007-12-24 2012-01-25 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
US10459728B2 (en) 2011-12-23 2019-10-29 Intel Corporation Apparatus and method of improved insert instructions
US10719316B2 (en) 2011-12-23 2020-07-21 Intel Corporation Apparatus and method of improved packed integer permute instruction
US11354124B2 (en) 2011-12-23 2022-06-07 Intel Corporation Apparatus and method of improved insert instructions
US9588764B2 (en) 2011-12-23 2017-03-07 Intel Corporation Apparatus and method of improved extract instructions
US9619236B2 (en) 2011-12-23 2017-04-11 Intel Corporation Apparatus and method of improved insert instructions
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
US9658850B2 (en) 2011-12-23 2017-05-23 Intel Corporation Apparatus and method of improved permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
US11347502B2 (en) 2011-12-23 2022-05-31 Intel Corporation Apparatus and method of improved insert instructions
US11275583B2 (en) 2011-12-23 2022-03-15 Intel Corporation Apparatus and method of improved insert instructions
US10474459B2 (en) 2011-12-23 2019-11-12 Intel Corporation Apparatus and method of improved permute instructions
US10467185B2 (en) 2011-12-23 2019-11-05 Intel Corporation Apparatus and method of mask permute instructions
TWI489383B (en) * 2011-12-23 2015-06-21 Intel Corp Apparatus and method of mask permute instructions
US10768960B2 (en) 2013-12-20 2020-09-08 Huawei Technologies Co., Ltd. Method for affinity binding of interrupt of virtual network interface card, and computer device
CN103699428A (en) * 2013-12-20 2014-04-02 华为技术有限公司 Method and computer device for affinity binding of interrupts of virtual network interface card
CN106170768B (en) * 2014-03-27 2020-01-03 国际商业机器公司 Dispatching multiple threads in a computer
CN106170768A (en) * 2014-03-27 2016-11-30 国际商业机器公司 Assign multiple thread in a computer
CN108292230B (en) * 2015-12-11 2022-06-10 图芯芯片技术有限公司 Hardware access counters and event generation for coordinating multithreading
CN108292230A (en) * 2015-12-11 2018-07-17 图芯芯片技术有限公司 Hardware access counter and for coordinate multiple threads event generate

Also Published As

Publication number Publication date
CN102968302B (en) 2016-01-27
CN102184123A (en) 2011-09-14
CN1828544B (en) 2013-01-02
CN102968302A (en) 2013-03-13
CN102184123B (en) 2013-10-16
CN1828544A (en) 2006-09-06

Similar Documents

Publication Publication Date Title
CN101051266A (en) Processor with dummy multithread
CN1311351C (en) Programmable event driven yield mechanism which may activate other threads
US7587584B2 (en) Mechanism to exploit synchronization overhead to improve multithreaded performance
Blackham et al. Timing analysis of a protected operating system kernel
US9063804B2 (en) System to profile and optimize user software in a managed run-time environment
Wang et al. COREMU: a scalable and portable parallel full-system emulator
US20030135720A1 (en) Method and system using hardware assistance for instruction tracing with secondary set of interruption resources
CN101067798A (en) Dynamic probe method and application in embedded system thereof
US7506207B2 (en) Method and system using hardware assistance for continuance of trap mode during or after interruption sequences
Liu et al. Barrier-aware warp scheduling for throughput processors
Chatzopoulos et al. Estima: Extrapolating scalability of in-memory applications
Whaley et al. Heuristics for profile-driven method-level speculative parallelization
Woralert et al. High frequency performance monitoring via architectural event measurement
Thiel An overview of software performance analysis tools and techniques: From gprof to dtrace
Brandner et al. Embedded JIT compilation with CACAO on YARI
Desnoyers et al. Synchronization for fast and reentrant operating system kernel tracing
Yu et al. Mt-profiler: a parallel dynamic analysis framework based on two-stage sampling
US7707556B1 (en) Method and system to measure system performance
Gracioli et al. An embedded operating system API for monitoring hardware events in multicore processors
Giesen et al. PRL: Standardizing Performance Monitoring Library for High-Integrity Real-Time Systems
Carnà HOP-Hardware-based Online Profiling of multi-threaded applications via AMD Instruction-Based Sampling
Andai Performance monitoring on high-end general processing boards using hardware performance counters
US10372581B1 (en) Instrumentation of program code for dynamic analysis
Tanaka et al. Register file size reduction through instruction pre-execution incorporating value prediction
Kriegel et al. A high level mixed hardware/software modeling framework for rapid performance estimation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20071010