CN102968302A - Mechanism for improving multithreading performance using synchronization overheads - Google Patents

Mechanism for improving multithreading performance using synchronization overheads Download PDF

Info

Publication number
CN102968302A
CN102968302A CN2012104604302A CN201210460430A CN102968302A CN 102968302 A CN102968302 A CN 102968302A CN 2012104604302 A CN2012104604302 A CN 2012104604302A CN 201210460430 A CN201210460430 A CN 201210460430A CN 102968302 A CN102968302 A CN 102968302A
Authority
CN
China
Prior art keywords
thread
lock
event
monitor
equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012104604302A
Other languages
Chinese (zh)
Other versions
CN102968302B (en
Inventor
N.英赖特
J.科林斯
P.王
H.王
X.田
J.沈
G.肖弗
P.哈马伦德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/070,991 external-priority patent/US7587584B2/en
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN102968302A publication Critical patent/CN102968302A/en
Application granted granted Critical
Publication of CN102968302B publication Critical patent/CN102968302B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Method, apparatus, and program means for a programmable event driven yield mechanism that may activate other threads. In one embodiment, an apparatus includes execution resources to execute a plurality of instructions and a monitor to detect a condition indicating a low level of progress. The monitor can disrupt processing of a program by transferring to a handler in response to detecting the condition indicating a low level of progress. In another embodiment, thread switch logic may be coupled to a plurality of event monitors which monitor events within the multithreading execution logic. The thread switch logic switches threads based at least partially on a programmable condition of one or more of the performance monitors.

Description

Utilize synchronization overhead to improve the mechanism of multi-threading performance
Technical field
The present invention relates to treatment facility and the field of system and the specific instruction sequence of programme this equipment and/or system of processing instruction sequence etc.Some embodiment relate to supervision and/or respond the conditioned disjunction event of carrying out in this treatment facility in the resource.
Background technology
Use at present various mechanism to change the interior control stream (being processing path or the instruction sequence of following) of disposal system.For example, the jump instruction in the agenda causes jumping to new address explicitly.This jump instruction is an example that clearly changes of control stream, because this instruction bootstrap processor jumps to a position and continues to carry out at this point.Traditional jump instruction is " accurately " (or synchronously), and this is because the direct result that this redirect is the execution jump instruction occurs.
Another conventional example that control stream changes is for interrupting.Interruption can be the external signal that provides to the equipment such as processor.The response of processor is for jumping to interrupt handling routine (handler), and this handling procedure is the program of the event of the special interruption transmission of processing.Interrupting also is relatively accurate usually, and this is to be had no progeny in receiving this by processor in specific time window it to be identified and produce response because be.Especially, after internal interface was received interruption, this interruption was just worked on the next instruction border usually.In some cases, only allow operating system or work in high priority other other software masking to interrupt, so user program is had no chance to start or is forbidden these control stream change events.
Another conventional example that control stream changes comes across unusual response.The predefined framework condition of unusual common reflection, this condition is for satisfying the result of specific criteria (informal, underflow, overflow, nonnumeric etc.) such as the mathematics instruction.For example by the position in the control register is set, can shield that some are unusual.If there is unusually and not conductively-closed, then calling exception handler should be unusual to process.
Another technology that changes the control stream of processor is the use breakpoint.Usually when debugging, use breakpoint.Specific instruction address can be programmed into breakpoint register.When breakpoint starts and arrive destination address, this processor adopt various measures (rather than as usual continuing this program).Breakpoint allows single step executive routine etc.
Multithreading is a kind of technology of using processor hardware by a plurality of different threads.Multiline procedure processor can switch between each thread owing to a variety of causes.For example, processor has the algorithm that automaticallyes switch between available thread.Other processor uses the multithreading (SoEMT) that switches based on event, and the particular event meeting such as cache miss causes that thread switches thus.Thread switches can be counted as the change of a kind of control stream, and this is because processor switches its performed instruction sequence or instruction stream.
In a prior art reference, describe a kind of still command (seeing U.S. Patent number No.6,493,741) in detail.In one example, still command stops the process in the thread, writes until timer expires or occurs carrying out storer to memory location.The process that therefore, can trigger the thread that comprises this still command such as the instruction of still command itself temporarily stops and switching to another thread.
Description of drawings
Unrestriced mode is set forth the present invention in the mode of example in appended each figure.
Fig. 1 has set forth the embodiment of the system of the treatment conditions of can detection and response carrying out resource (execution resource).
Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.
Fig. 3 has set forth the workflow diagram of another embodiment of the system of Fig. 1.
Fig. 4 has set forth another embodiment of the system that can respond a plurality of different performance events (performance event) and/or composite performance event.
Fig. 5 a has set forth an embodiment of the monitor that can identify compound event.
Fig. 5 b has set forth another embodiment of monitor.
Fig. 5 c has set forth another embodiment of monitor.
Fig. 6 has set forth according to an embodiment, the definable trigger of responder and start the process flow diagram that user program is carried out that is used for of worker thread (helper thread).
Fig. 7 has set forth the process flow diagram according to the process of the refinement monitor setting of an embodiment.
Fig. 8 has set forth the process flow diagram according to the process of the update software of an embodiment.
Fig. 9 a has set forth and has started a plurality of nested worker threads with the process flow diagram of support processor.
Fig. 9 b has set forth the thread switch logic of an embodiment of virtual support thread.
Figure 10 a has set forth an embodiment of context-sensitive event diagram vector and shielding (mask) realization.
Figure 10 b has set forth an embodiment of context-sensitive event diagram vector and shielding realization.
Figure 11 has set forth based on the monitor event and an embodiment of the multiline procedure processor that execution thread switches.
Figure 12 has set forth an embodiment who synchronization object is had the system of event detection and processing power.
Figure 13 has set forth the process flow diagram of processing according to the synchronous event of a plurality of embodiment.
Figure 14 has set forth the improved process flow diagram of thread scheduling based on the lock analysis (lock profiling) of button.onrelease thread.
Embodiment
The following embodiment that abandons (yield) mechanism that has set forth the programmable event driven that can start other thread.In following description, listed the many details such as processor type, micro-architecture condition, event, Initiated Mechanism etc., its objective is to provide and more thoroughly understand of the present invention.Yet those skilled in the art will be appreciated that, do not use these details also can implement the present invention.In addition, be not shown specifically some well-known structures, circuit etc., purpose is to avoid unnecessarily making the present invention to become indefinite.
In certain embodiments, disclosed technology allows when executive routine, and this program can monitor and respond the condition of carrying out this program implementation resource energetically.In fact, these embodiment can comprise that real-time execution resource condition of work feeds back to improve performance.Run into execution delay condition if carry out resource, can interrupt this program and carry out to adjust.In certain embodiments, can start a handling procedure, this handling procedure can produce worker thread to attempt to improve the execution of primary thread.In other embodiments, can realize interrupting by another program threads that switches to non-worker thread.These and other embodiment can advantageously improve in some cases processing power and/or optimize to be suitable for special hardware.
With reference to figure 1, description can detect and respond an embodiment of the system of the treatment conditions of carrying out resource.In the embodiment in figure 1, carry out the part that resource 105, monitor 110 and enable logic 120 form the processor 100 that can carry out instruction.This execution resource comprises the hardware resource that can be integrated into discrete component or integrated circuit in certain embodiments.Yet, carry out the combination in any that resource can comprise software or firmware resource or hardware and software and/or can be used for the firmware of execution of program instructions.For example, firmware can be used as the part of extract layer or can be processing hardware increases function, and software also can be like this.Software can also be used for part or all of emulator command collection, perhaps auxiliary process otherwise.
This processor is any dissimilar processor of executable instruction.For example, this processor can be general processor, for example Processor family or
Figure BDA00002410913200042
Processor family or from a kind of processor in other processor family of Intel Company, or from the processor of other company.Therefore, this processor can be that reduced instruction set computer calculates (RISC) processor, sophisticated vocabulary calculates (CISC) processor, very long instruction word (VLIW) processor or any mixing or alternative processor type.In addition, the application specific processor such as network or communication processor, coprocessor, embedded processor, compression engine, image processor etc. can use technology disclosed herein.Because integrated trend is still continuing and the processor more complex, inner performance indicator is monitored and the necessity of reacting further increases, therefore so that more need present disclosed technology.Yet, because the technical progress in this technical field is quick, be difficult to predict all application of disclosed technology, although it can be widely used in the complex hardware of executed program sequences.
As shown in Figure 1, processor 100 is coupled to the storage medium 150 such as storer.Storage medium 150 can be for having the memory sub-system of various level levels, and it includes but not limited to the memory buffer of various levels, such as the system storage of dynamic RAM etc. and such as the nonvolatile memory of flash memory (such as memory stick etc.), disk or CD.As shown in the figure, this storage medium stores program 160 and handling procedure and/or such as other thread of worker thread 170.
In order to allow monitor to monitor expected event, monitor 110 can be coupled to the various piece of execution resource to detect actual conditions or to be apprised of specific micro-architecture event.Signal wire can be connected to monitor 110, and perhaps this monitor can strategically be placed or integrate with related resource.This monitor can comprise various FPGA (Field Programmable Gate Array) or software or firmware components or can be custom-designed to the detection actual conditions.This monitor is followed the trail of variety of event or condition, and if there is event or condition that this monitor should detect, then send signal to execution resource 105 and flow to interrupt normal control, otherwise program will be carried out by this normal control flow.As shown in Figure 1, this interruption can cause calling button.onrelease or thread occur switching.
But a shortage of data that is exemplified as in the memory buffer of special testing conditions, this shortage of data can cause occurring cache miss.In fact, program can produce a kind of memory access mode, and this mode can cause the cache miss of repetition, thus performance.In section sometime or occur the cache miss of given number at certain partial code the term of execution, expression that Here it is makes progress an example of relatively slow event when carrying out this partial code.
May can the detection event relate to various other micro-architectures or the CONSTRUCTED SPECIFICATION of carrying out resource for other of the indicator of making slow progress.Monitor can detect and relate to following one or more condition: resource stops, the buffer memory event, scrap event, branch or branch and estimate result, unusual, bus events or various other events or the condition that usually come under observation or affect performance.This monitor can calculate these events or condition, perhaps to these events or condition timing, quantitatively or characterize, and when the special metric system (metric) relevant with one or more events or condition occurring, can programme to this monitor.
Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.Shown in the block 200 of Fig. 2, thereby can arranging condition, program 160 causes the variation of carrying out control stream.For example, enable logic 120 can be controlled the startup of monitor and (a plurality of) event that monitor will detect simultaneously.Alternatively, enable logic 120 can start and/or shield each event, and monitor 110 itself also is programmable, thereby is stipulating to have larger moving flexibly aspect tracked execution resource or intrasystem event or the condition.In either case, program 160 itself can be defined in condition to be seen when himself carrying out.Program 160 also is provided at handling procedure or the thread 170 that is activated when monitored condition occurs.For example, this program can be such program, that is, comprise main thread and worker thread or attempt to improve the auxiliary routine (helper routine) of the execution of main thread when the specified condition of program occurs.
Shown in block 205, carry out this programmed instruction.The state that this program implementation causes carrying out resource changes.For example, the forward various conditions of progress can occur or occur to suppress when this program of execution.Shown in block 210, monitor that various processing metric systems and/or micro-architecture condition are to determine whether to occur the trigger event of programming in the block 200.If the triggering state in block 210, do not occur, then do not trigger this monitor and continue program implementation by turning back to block 205.
In some cases, this triggering state only represents the indirect relation with the execution of arbitrary single instruction.For example, in the prior art, when instruction pointer arrived the design address, the breaking point detection device produced time-out usually.This breakpoint is accurate, and this is because special instruction (for example its address) directly triggers this time-out.Similarly, prior art still command itself causes thread at least temporarily to stop.On the contrary, the control stream that uses some embodiment of technology disclosed herein to trigger a series of conditions changes, and not necessarily will cause this change by single instruction, but can cause this change by whole program flow and/or system environments.Therefore, although can the same instruction executing state in individual system repeatedly trigger this monitor, other condition, environment, system etc. can cause the different trigger points of same program.Thus, technology disclosed herein provides in some cases and has produced out of true or the nonsynchronous mechanism that control stream changes, and this mechanism is not directly related with the instruction exercise boundary.In addition, in certain embodiments, this coarse mechanism can be tested each event with the fine granularity (fine granularity) less than each instruction, and/or can postpone a period of time to the identification of event, this is because the framework correctness does not depend on the auxiliary routine of any raising processing speed of carrying out at any concrete time point.
When monitor detected the triggering state in block 210, the processing of program was interrupted, shown in block 215.Usually, this system can correspondingly adjust, and this is because the treatment effeciency of this program is lower or the mode of processing is different from the desired mode of programming personnel.For example, can call another software routine such as another program part.Another thread that this other program part can be and primary thread is irrelevant perhaps can be the worker thread from the auxiliary process instruction of primary thread, for example by pre-taking-up data to reduce cache miss.Alternatively, program transparent (such as hardware) mechanism can be carried out some optimizations, reconfigure redistributing of (including but not limited to reconfiguring of monitor setting), resource etc., processes thereby be hopeful to improve.
Fig. 3 has set forth an example calling worker thread.Especially, the process flow diagram of Fig. 3 is described the work of an embodiment of Fig. 1 system in detail, and wherein carrying out resource is the multithreading resource, and when the certain trigger condition occurring this routine call worker thread.Therefore, shown in block 300, the first thread (for example master routine) arranges the monitor condition.This condition can be one or more in the various conditions discussed herein.The first thread execution one code part is shown in block 310.If test determines not occur trigger condition in block 320, then continue to carry out this code section, shown in block 310.
If this trigger condition occurs really, then start worker thread with auxiliary the first thread, shown in block 330.Can be by start this worker thread such as handler routine, perhaps only by switch activated this worker thread of thread.For example, in one embodiment, monitor sends to the trigger condition of carrying out resource can cause carrying out the button.onrelease that resource jumps to the generation worker thread.In another embodiment, this worker thread is one of other active threads.In yet another embodiment, can provide one or more special worker thread to carry out holding tank (slot) by processor, this monitor can cause switching to the worker thread from one of these holding tanks.Shown in block 340, two threads all continue to carry out.If talk about smoothly, this worker thread moves forward and removing can cause the first thread to stop or the condition of fallback.
Fig. 4 has set forth another embodiment of the system that can respond a plurality of different performance events and/or composite performance event.In the embodiment of Fig. 4, execution resource 400 is shown to and comprises one group of N monitor 410-1 to 410-N.In addition, provide event diagram vector (ESV) memory location 420 and event diagram vector shielding (ESVM) memory location 425.The embodiment of Fig. 4 shows a plurality of monitors, and its number (N) is corresponding to the figure place in event diagram vector and the occurrence diagram screen unlocking vector.In other embodiments, the number of monitor may be different from the number of these vectors, and monitor can or not have direct correlation with this figure place direct correlation.For example, in certain embodiments, the condition and the single vector position that relate to a plurality of monitors are associated.
Carry out resource 400 and be coupled to alternatively event descriptor table 430 (EDT), can realize partly this event descriptor table on this processor or in coprocessor or system storage.Control stream logic 435 is coupled to monitor 410-1 to 410-N, and receives the value from event diagram vector and the shielding of event diagram vector.When shielding to start the condition that one or more monitors detect according to this event diagram vector and event diagram vector, control stream logic 435 changes the control stream of processing logics.
The embodiment of Fig. 4 has also set forth decode logic 402 and one group of machine or mode-specific register 404 (MSR).One of decode logic 402 and mode-specific register or both can be used for simultaneously programming and/or start this monitor and event diagram vector and shielding.For example, MSR type or the number of the event that triggers monitor that can be used for programming.MSR also can be used for programmed events diagram vector and shielding.Alternatively, will be can be used for programming by the one or more new special instruction that demoder 402 is decoded this monitor and one of event diagram vector and shielding or both.For example, when the specified conditions group occurring, can use and abandon the interruption that (yield) instruction start-up routine is processed.Can abandon instruction specified portions or all these conditions to this by operand, perhaps can programming before it is carried out.Can this abandon instruction with triggering microcode routine program by demoder 402 decodings, thereby directly produce corresponding microoperation or micro-order or microoperation sequence to the signal special logic, perhaps start coprocessor or implement this abort function.In certain embodiments, the concept of abandoning can be described such instruction rightly, and namely this can continue to carry out a certain thread after abandoning instruction in execution, but a bit slows down the execution of this thread owing to the execution of another thread or handling procedure at certain.For example a large amount of single-threaded programs can be called extra worker thread and share these extra worker threads with processor.
In the embodiment of Fig. 4, storer 440 comprises button.onrelease 450 and main thread 460.In certain embodiments, event descriptor table can be stored in the storer identical with main thread 460 and handling procedure 450 or in the identical memory hierarchy.As previously mentioned, this handling procedure can produce worker thread to help effectively to carry out master routine.
But storer 440 also storage update module 442 to communicate by communication interface 444.Update module 442 can be hardware module or software routine, carries out this software routine to obtain to be programmed into the New Terms of each monitor and/or enable logic by carrying out resource.Update module 442 also can obtain new worker thread or routine.For example, can use software program to download these modules so that better performance to be provided from software program producer.Therefore, network interface 444 can be any network and/or the communication interface that allows by the communication port transmission information.In some cases, this network interface can be received the Internet to download new condition and/or auxiliary routine or thread.
In one embodiment, specific event appears or does not occur in everybody expression of event diagram vector, and this particular event may reflect the compound event of (and/or explain by Boolean calculation) various other events of conditioned disjunction.The appearance of particular event can arrange the position in the event diagram vector.In the event diagram vector everybody has corresponding position in occurrence diagram screen unlocking vector.If this mask bit represents this particular event conductively-closed, then control stream logic 435 is ignored this event, although because the appearance of this event is set this maintenance in the event diagram vector.The user can select whether to remove this event diagram vector when masked event not.Therefore, a certain event mask a period of time can be processed after a while.In certain embodiments, according to such as event update, sampling with the variety of issue of (or retention time of the ESV internal trigger event) Relations Among that resets, the user can select to stipulate that this trigger is level trigger or edge triggered flip flop.
If mask bit represents the not conductively-closed of a certain event, then control stream logic 435 is called the button.onrelease of this particular event in the present embodiment.Control stream logic 435 can be pointed to event descriptor table 430 based on bit position in the event diagram vector, so this event descriptor table has N the clauses and subclauses corresponding with the N position in the event diagram vector.This event descriptor table can comprise a handler address, the address that this address indication control stream logic 435 should redirect execution, and this event descriptor table also comprises useful in a particular embodiment out of Memory.For example, can in this event descriptor table, keep or upgrade preferential level, thread, processing and/or out of Memory.
In another embodiment, may not need event descriptor table 430 or its to be single clauses and subclauses, these clauses and subclauses indicate single button.onrelease to process the address of all events.In this case, these clauses and subclauses can be stored in the register or in other processor memory location.In one embodiment, can use single handling procedure, and how addressable this event diagram vector of this handling procedure responds with definite event that is occurred and this.In another embodiment, this event diagram vector can jointly define the event that causes control stream logic 435 calling processors.In other words, this event diagram vector can represent the various conditions of expressing together an event.For example, this occurrence diagram screen unlocking vector can be used for the execution that handling procedure must occur to trigger which indicated event of allocate event diagram vector.Everybody can represent the monitor that reaches condition able to programme.When all when not shielding monitor and reaching its separately specified requirements, calling processor then.Therefore, whole event diagram vector can be used for specifying some complicated compound condition that will trigger the handling procedure execution.
In another embodiment, can specify different conditions with shielding with a plurality of event diagram vectors.These different vectors can point to different handling procedures by this event descriptor table or some other mechanism.In another embodiment, can be divided into groups in some position of one or more event diagram vectors, thereby form the event of calling that triggers handling procedure.Various other different changes will be apparent to those skilled in the art.
Fig. 5 a has set forth an embodiment of monitor 500, this monitor be programmable and can with are connected performance monitor and connect to produce the signal of compound event.For example, this performance monitor can record the appearance of various micro-architecture event or condition, for example, the cache miss that causes at the specified level of buffer memory level, branch scraps, branch's be out in the calculation (or the wrong branch that estimates scrap), the trace cache transport model changes or event, branch estimates that the unit takes out request, the cancellation of memory requests, the cache lines division (is finished the division load, the counting of storage etc.), replay event, (for example locking of various types of bus switch, instantaneous read-write, write-back, invalid), distribution in the bus sequencer (or only particular type), auxiliary (the underflow of numeral, informal etc.), the execution of the instruction of particular type or microoperation (uOP)/scrap, machine zero clearing (or cleaning streamline), resource stops (register renaming resource, streamline etc.), the uOP of marks for treatment, instruction or uOP scrap, the distribution of buffer memory interior lines (and/or being specific state (for example M)), stop to take out a large amount of cycles per instructions, stop a large amount of cycles per instruction length decoders, get a large amount of buffer memorys, be distributed in a large amount of line of (or withdrawal) in the buffer memory etc.Only monitor some examples of micro-architecture event or condition.The combination of various other possibilities and these or other condition will be apparent to those skilled in the art.In addition, can use in arbitrary disclosed embodiment disclosed arbitrary monitor to monitor these and/or other condition or event.
Usually comprise in processor that performance monitor is to count particular event.By the interface of manufacturer definition, the application specific processor macro instruction of the RDPMC instruction of for example being supported by well-known Intel processor, the programming personnel can read the reading of this performance monitor.See Intel Software Developers Guide for the
Figure BDA00002410913200091
The appendix A of the volume III of 4Processor.In certain embodiments, can use other inside or micro-order or microoperation reading performance counter.Therefore, for example performance monitor and disclosed technology can be combined with.In some cases, adjust programmable performance monitor so that the ability that produces event signal to be provided.In other embodiments, can be by other monitor reading performance monitor to set up event.
In the embodiment of Fig. 5 a, monitor 500 can comprise one group of clauses and subclauses able to programme.Each clauses and subclauses can comprise entry number 510, startup territory 511, be used to specify performance monitor number (EMON#) 512 and the trigger condition 514 of one of one group of performance monitor.This trigger condition can be for specifically counting, drop on the counting in the particular range, the difference of counting etc. such as arrival.Monitor 500 can comprise logic to be read or be coupled to receive counting from the performance monitor of appointment.When various M condition occurring, monitor 500 sends signal to control stream logic.By the startup territory of each clauses and subclauses of optionally programming, can use the subset of this M clauses and subclauses.
Fig. 5 b has set forth another embodiment of monitor 520.The compound event monitor of monitor 520 representative customizations.Monitor 520 passes through signal wire 528-1 to 528-X reception from one group of signal of various execution resources or resource part, and passes through combinational logic 530 with its combination.If monitor 520 receives the appropriate combination of signal, then this monitor sends signal by output signal line 532 to control stream logic.
Fig. 5 c has set forth another embodiment of monitor 540.Monitor 540 comprises the table with M clauses and subclauses.Each clauses and subclauses comprise startup territory 552, condition field 554 and trigger territory 556.This condition field can be programmed to stipulate the combination of input signal to be monitored.These conditions can link to each other with other event detection structure such as performance monitor or not link to each other, so these conditions are compared more general than those conditions of discussing among Fig. 5 a.Trigger the state that those input signals of logic transmitted signal can be stipulated to flow to control in territory 556.In addition, can be by starting territory 552 startups or forbidding each clauses and subclauses.In certain embodiments, this condition and triggering territory can be made up.These or other type, known or obtainable, the various combinations of simpler or more complicated monitor are apparent to those skilled in the art.
Fig. 6 has set forth according to an embodiment, the definable trigger of responder and start the process flow diagram that the user program of worker thread is carried out.In block 600, program at first tests whether possess the ability of abandoning." abandon ability " and be used as occuring and the writing a Chinese character in simplified form of the ability interrupting processing based on the conditioned disjunction event at this.Alternatively, for the test that the ability of abandoning supports, this is abandoned ability and is defined as idle working code before can using and/or does not use before or undefined MSR, will can not affect the not processor of this ability so abandon ability.Also can inquire about this ability that whether exists by checking special CPU-ID, wherein this CPU-ID coding produces and shows the prompting that whether has this ability on par-ticular processor or the platform.Similarly, call or the special instruction of SALE (system abstraction layer environment) can be used for the query processor specific configuration information such as the PAL (processor abstraction layer) of Itanium, this preparation implement body configuration information comprises the availability of the definable ability of abandoning of this program.Suppose to exist this to abandon ability, user program various counters that can read and/or reset then are shown in block 610.For example, can reading performance watchdog count device, so that can calculate increment (delta), if perhaps have this ability then this value is resetted.
Shown in block 620, user program arranges the worker thread trigger condition subsequently.Any program or absolutely large section routine under low priority level (for example user class), can obtain this and abandon ability, so that can use this feature.For example, exist
Figure BDA00002410913200111
In the processor families etc., the preferential level of the 3rd ring can obtain this and abandon ability.Therefore, user program self can arrange its oneself the trigger condition based on performance.If utility command or operating system can provide lasting surveillance coverage, then understand the user program there is this context-sensitive monitor arrangement or operating system and can select to cross thread/processing context switch and store or recover this and use specific monitor arrangement/setting.
Shown in block 630, user program continues to carry out after this waive of condition of programming.Whether test waive of condition occurs in block 640.If waive of condition do not occur, then this program continues to carry out, shown in block 630.If there is this waive of condition, then worker thread is activated, shown in block 650.The flow table of Fig. 6 is tending towards hinting the synchronous polling that occurs each event, can uses the method in certain embodiments.Yet some embodiment response to event when event occurs is asynchronous, perhaps in a large amount of clock period when event occurs it is produced response, rather than by specific interval event is carried out poll.In certain embodiments, the monitor condition can be set to detect specific condition outside a circulation or other code section.Demonstrate this concept by the pseudo-code example of following main thread and worker thread.
Figure BDA00002410913200131
An advantage that outside the circulation trigger is set for the Compiler Optimization within circulation with unfettered.For example, for comprising that some compiler just can not be optimized it such as being used for circulation or the code segment that the intrinsic parameter (intrinsic) of ability is abandoned in startup.By these intrinsic parameters are placed outside the circulation, can remove the interference of Compiler Optimization.
Fig. 7 has set forth the process flow diagram of the process of abandoning arranging according to the refinement of an embodiment.Use has the processor of the ability of abandoning etc., and the programming personnel can design under the various situations can invoked program and auxiliary routine, shown in block 700.Therefore, can provide auxiliary routine for the various conditions that the desired obstruction of programming personnel is carried out.If when executive routine, need these routines and in these routines of needs, processor can call these routines.This is abandoned arranging and can comprise that event diagram vector and shielding vector and/or monitor arrange etc.
On concrete processor, specifically abandon arranging and to cause favourable execution result.Yet, manually make this and determine it is very difficult, therefore be more preferably rule of thumb and derive.Therefore, compiler or other adjustment software (for example Intel VTune code profiler) are abandoned configuration with difference and are repeated to simulate this code, derive thus setting best or expection, shown in block 710.Therefore, can select the desired value of abandoning arranging of working time, shown in block 720.Can be on a plurality of different editions of a processor or a plurality of different processor or in a plurality of different systems simulator program, arrange thereby derive different abandoning.Which program can adopt abandon arranging when selecting to move such as the system of CPU-ID or processor mark, shown in block 730.
In addition, come Optimal performance to be convenient to software upgrading with the compact group that arranges.For example, when new processor is issued, can download the new value of abandoning to optimize the performance of par-ticular processor, perhaps with the new value of abandoning update software.These new values allow scale-of-two or modulus adjustment, and this can not disturb or endanger the function of existing software basically.
Fig. 8 has set forth the process flow diagram according to the update software process of an embodiment.Shown in block 800, issued the microprocessor of a redaction.New version has and the different time delay of being correlated with such as the micro-architecture event of cache miss.Therefore because new cache miss time delay, the routine validity after the cache miss of given number that before is written into to start worker thread weakens.Therefore, again optimize this and abandon arranging, shown in block 810.
In case derive new setting, then can upgrade this program (for example by being the upgrading module of this program part), shown in block 820.Can adjust or add the value of abandoning, this depends on the details of enforcement.In addition, can add additional or different auxiliary routines, thereby help the enforcement of new processor.In arbitrary situation, after the initial transmission of software, the transmission that the ability of abandoning can startability strengthens.In many occasions, this ability is very favorable, and can only underlying hardware not made any change for new optimization is provided.In addition, can keep basic software in some cases.For example, if write auxiliary routine to process comprehensive event (for example serious cache miss), then can change the composition that triggers the event of this routine on the different hardware, and not change real routine itself.For example, can change monitor arrangement value and/or ESV/ESVM value, and this program remains unchanged.
By creating nested worker thread, can further strengthen the validity of disclosed technology, Fig. 9 a shows an example of this usage.In the embodiment of Fig. 9 a, in block 900 programs, programming is abandoned event.In block 910, program continues to carry out.In block 920, whether test the event of abandoning (trigger) occurs.If event do not occur abandoning, this program continues to carry out, shown in block 910.If there is abandoning event, then start worker thread, shown in block 925.Worker thread arranges another and abandons event, shown in block 930.Therefore, this worker thread is effectively identified and is represented that further processing help is another useful condition.This other condition represents whether effective first worker thread is, and/or can be designed to represent another condition (although wherein also suspecting this condition by starting first worker thread or having started first worker thread).
Shown in block 940, this program and worker thread all are activated and execution thread.Process the meaning that these threads all are activated and carry out the resource from multithreading, these threads are carried out simultaneously.In block 950, whether the combination of test procedure and worker thread new trigger condition occurs.If new trigger condition does not occur, then continue to carry out these two threads, shown in block 940.If new trigger condition really occurs, then start second or nested worker thread, shown in block 960.Afterwards, this program and a plurality of worker thread are activated and carry out, shown in block 962.Therefore can adopt in certain embodiments a plurality of nested worker threads.
In one embodiment, can start a plurality of worker threads (can be nested or non-nested) by virtual thread.Processor is not the number that its whole resource groups is used for its treatable thread of expansion, and processor is buffer memory (in cache location, register position or other memory location) context data effectively.Therefore, a physical thread holding tank can switch between a plurality of threads fast.
For example, the embodiment of Fig. 9 b has set forth the thread switching logic according to an embodiment, and this thread switching logic allows virtual thread is switched to the physical thread holding tank of Limited Number, and these holding tanks are specifically designed to hardware and keep thread context.In the embodiment of Fig. 9 b, a plurality of worker thread 965-1 to 965-k are presented to virtual thread switch 970.This virtual thread switch 970 also can comprise other logic and/or microcode (not shown), thus the contextual information between the new worker thread of selecting of exchange and the previous worker thread of selecting.Can trigger this virtual thread switch 970 with switch threads by synchronous or asynchronous stimulation.For example, the asynchronous event by the instruction definition of abandoning type can cause the thread between the virtual thread to exchange.In addition, worker thread can comprise such as stop, the synchronization means of the static or instruction that other type stops to carry out, thereby send switching signal to another thread.This virtual thread switching logic 970 has presented a subset (for example in the embodiment of Fig. 9 b, one of virtual thread) of virtual thread to processor thread switching logic 980.Processor thread switching logic 980 for example switches between the first thread 967-1 and other N-1 the thread (until thread 967-N) at one of worker thread subsequently.
In certain embodiments, preferably this is abandoned capabilities limits to specific program or thread.Therefore, can make this abandon ability becomes context-sensitive or non-mixed and disorderly.For example, Figure 10 a has set forth an embodiment of context-sensitive event diagram vector and shielding realization.In the embodiment of Figure 10 a, memory block 1000 comprises context indicator field 1010 and the shielding memory location 1020 with each event diagram vector correlation.This context indicator field is identified each event diagram vector and is shielded applied context.For example, can use context value such as the value of control register (for example CR3 of x86 processor indicating operating system process ID).Additional or alternatively, can use number of threads information definition context.Therefore, in certain embodiments, when specific context starts, then can start specific context dependent event with interrupt process.Therefore, this is abandoned machine-processed clear and definite part and is that its event only affects specific context.
Figure 10 b has set forth another embodiment of context-sensitive event diagram vector and shielding realization.In the embodiment of Figure 10 b, by for this k contextual each context provides one group of event diagram vector and screening-off position 1050-1 to 1050-k, then can process this contextual integer k.For example, have k thread in multiline procedure processor, each thread has an event diagram vector and shielding or similarly starts the mechanism of abandoning.Notice that in other embodiments, the event of only following the trail of in the specific context is worthless.For example, event can reflect whole Process Movement, and/or event can be relevant with a plurality of related linear programs or by due to a plurality of related linear programs.
Figure 11 has set forth based on monitor or has abandoned types of events and an embodiment of the multiline procedure processor that execution thread switches.Although the many embodiment that discussed are interrupt process stream by causing handling procedure to carry out, cause the event that thread switches in other embodiment definable multiline procedure processor.For example, in the embodiment of Figure 11, the thread switch logic is coupled to receive the signal from one group of N monitor 1110-1 to 1110-N.Thread switch logic 1105 also can be coupled to one or more groups event diagram and shield 1130-1 to 1130-p (p is positive integer).This event diagram and shielding are to allowing the thread switch to make up when determining when switch threads and/or ignoring specific monitor event.
Carry out resource 1120 and support the execution of p thread, whether instruction belongs to specific thread but be indifferent to.This execution resource can be performance element, take out logic, demoder or carry out any other resource of using in the instruction.Multiplexer 1115 or other select resource to judge to determine which thread accesses execution resource 1120 between various threads.Those skilled in the art will appreciate that in multiline procedure processor and can share or copy various resources, and various resource has the thread handover access that a limited number of thread of each permission (for example) is accessed this resource.
If there is one or more monitors and/or event diagram vector and shielding to indicated condition group, the execution of thread switching logic 1105 switch threads.Therefore, can start another thread, rather than start the thread of activity when the condition of processor condition and programming is complementary.For example user program can be controlled the event that thread switches that triggers.
In some multiline procedure processors, each thread can have one group of relevant event diagram vector and shield equity.Therefore, as shown in figure 11, multiplexer 1115 judges between p thread, and exists corresponding p event diagram right with shielding.Yet, be multithreading just because of processor, and do not mean that all realizations all use a plurality of event diagram vectors and shielding.Some embodiment only use a pair of startup indicator, perhaps use other startup indicator.For example, can use single position as opening or close the specific startup indicator of abandoning type of capability.
Figure 12 has set forth an embodiment who synchronization object is had the system of event detection and processing power.Synchronization object can for the lock or the lock variable, barrier or other hardware, software and/or can be used for thread or process between synchronous memory resource.With regard to multi-core and/or various types of multithreading, it is popular that multi-process obtains, and therefore in order to improve performance, becomes synchronously more important between these threads or the process.Therefore, has the system of synchronous efficiency of enhancing general and/or use in the various fields of dedicated processes (such as figure, natural medium type, digital signal processing, communication etc.) of concurrent process and have widely applicability.
System shown in Figure 12 has been set forth a processor 1200, wherein this processor is coupled to storer 1250, also is coupled to communication interface 1292 and one or more peripherals 1294 (such as thinking audio interface, display, keyboard, mouse or other input media, I/O device etc.).Can be by bus, bridge and/or point to point connect these devices that are coupled directly or indirectly.Processor 1200 comprises carries out resource 1210 and event detector 1220, to monitor the various aspects of carrying out resource 1210.Processor 1200 and event detector 1220 can have various characteristics, such as the description of carrying out with reference to previous embodiment.Therefore, thus event detector 1220 is programmable events based on definition and start a thread (for example trigger thread switch or bifurcated new thread).This event can be the rigid line event, is perhaps defined like front described mechanism by software program and similar incidents diagram vector.
Processor 1200 also can have lock and/or rotation detector 1222.In certain embodiments, this lock/rotation detector can be independent detecting device, as shown in figure 12.This independent detecting device can be specific predefine or even the hardware components of condition able to programme that detects indicating lock.Therefore this detecting device can partly be that hardware components is software.In other embodiments, can just can realize in the common event detecting device that lock or rotation-lock detect by various conditions are programmed into.Therefore, the common event detecting device can be programmed to detect this condition, when programming rightly, can effectively form rotation/lock detecting device.For example, use 1254 and can comprise that event detector programming module (EDPM) 1256 is used for programmed events detecting device 1220 to trigger expected event.
Take out for the long delay of locking variable with an event of the phase-locked pass that can be detected., can detect and cause the lock variable that takes out long delay with certain cache miss that a bit (will in storer, access this lock variable) in the trigger by programmed events detecting device 1220.This cache miss represents that processor will not lock the variable buffer memory.Then can respond subsequently this special event that is triggered, start a handling procedure and take out situation with the lock that solves long delay.
The second, this lock/rotation detector 1222 also can detect rotating condition or program and just wait for the lock of a contention fierceness and circulating to verify the condition whether this variable can be used.For example, can detect rotating condition to the repeated accesses of known lock variable position by sensing.When detecting this rotating condition, the second thread will be activated, and this will discuss hereinafter.Some embodiment can use worker thread to process the situation of the fierce lock of this contention, and take out (triggering the first worker thread) no matter the lock of long delay whether at first occurs.
The embodiment of Figure 12 comprises lightweight thread context-memory 1230.This context-memory allows the little subset of preservation state to switch with the context of realizing " lightweight " or " flyweight ".For example, only preserve the instruction pointer of parent process in some situations, and allow programmable device be responsible for any additional contextual preservation.Context is saved more or less, but the context that usually is stored in the subset is less than whole contexts.Use specialized instructions that these lightweight threads are disclosed in user class (for example the application layer of program, such as the program of x86 framework medium priority other 3), start thread in conjunction with application-specific so that the user can carry out resource at multithreading.In this case, the button.onrelease that is activated into the lightweight thread should be preserved its any context of not preserving of upsetting (normally carrying out the father uses and may need).In other embodiments, the worker thread that is triggered can be to have fully independently contextual thread.
The embodiment of Figure 12 also comprises the storer 1250 that is coupled to processor.In this embodiment, showing various lock expenses uses module and uses 1254.In the present embodiment, this module is software program.The single thread that is triggered when in one embodiment, modules is for a certain event of appearance.The single worker thread of the one-tenth capable of being combined of module shown in one or more, this worker thread can be whole or lightweight context thread.In other embodiments, can realize these modules with the combination of hardware or hardware and/or software and/or firmware.
Using 1254 is the application of user class, can have lock or other synchronization object or technology.Taking out critical partial data module 1258 can be used for will being locked the lock within the critical part of protection with speculating and takes out data and shift out.Obtain following lock module 1260 and move forward, obtain the lock of other lock protection part.Just can obtain lock in the buffer memory by Data Position is got simply, perhaps obtain lock by changing the lock variable in other embodiments, so that have this lock.Following lock module 1260 also comprises throttling module 1262, thereby guarantees that by limiting other thread this lock activity of speculating does not reduce whole throughput rate.The execution module 1280 of operation can be carried out forward forward, thereby finishes a few thing outside the part of being locked protection.
Lock analysis module 1270 is collected particular thread with respect to the data of the progress of lock variable and/or other thread.User thread scheduler module 1290 can receive the prompting of self-locking analysis module and allow thread is dispatched more efficiently.For example, analysis module 1270 can detect the first thread (for example initial consumer), obtains a lock, and causes significantly rotation in the second thread (for example initial production person).In this example, to the scheduler circular, at first dispatch the second thread (initial production person) and can cause more efficient processing.In certain embodiments, this scheduler is the thread scheduler of user class, and this scheduler is exposed to programmable device to allow the scheduling to user class (for example lightweight) thread.In certain embodiments, user thread scheduler 1290 can be to use a part of 1254.
Can further understand the interactional example of these modules with reference to Figure 13.In the embodiment of Figure 13, in block 1310, detect lock and take out delay.By 1220 programmings make its sensing one condition can realize this detection to event detector.For example, use the 1254 adjustable event detector programming modules 1256 of using with programmed events detecting device 1220, thereby just before access lock variable, trigger cache miss.Alternatively, can use the lock/rotation detector 1222 of aforementioned special use to detect this lock.
In block 1315, can the execution thread switching (for example the flyweight thread switches) in response to the taking-up delay that detects the lock variable.This thread switches and starts first worker thread, and this thread can be carried out the various functions of setting forth in the following block in each embodiment.Therefore, although block 1315 heels with various blocks, are not to need all these blocks in any specific embodiment, and its order is not crucial.
In block 1320, can take out outside the code that obtains this lock but drop on the data that are subjected within the code section that this lock protects.For example, can in the embodiment of Figure 12, carry out the critical partial data module 1258 of taking-up.When the entitlement of lock obtained at last, this data pre-fetching in lock protecting code part went out to reduce cache miss or general data search delay.
In addition (as the part of independent thread or identical thread) can take out following lock as shown in block 1330-1365.Especially, can obtain N additional lock, wherein N is positive integer.The number of lock to be obtained is programmable or hard code, and the term of execution be transformable (for example by reprogramming or throttling module 1262).In block 1330, loop variable i is made as 1.Shown in block 1340, take out a following lock (for example can taking out in advance or in fact be lockable in various embodiments).Following lock address is perhaps determined by the lock analysis to produce worker thread in the following lock of statistical computation address, and this will be described below.Shown in block 1350, test this lock and whether fought for.Be subject to contention (for example can indicate by the value of the buffer status of locking variable, lock variable or by the circling behavior of program) if should lock future, then can interrupt the taking-up of this lock and/or other following lock, shown in block 1355.
For in block 1350, not finding the situation that this lock is fought for, then in block 1360, continue operation.If the lock that test is taken out in block 1360 counting is not equal to number of targets N, then in block 1365, variable i is increased progressively and process is got back to block 1340.Reach N if count down in block 1360, then process proceeds to block 1370 in one embodiment.Carry out these lock taking-up operations 1330-1365 by in Figure 12 embodiment, obtaining following lock module 1260.Take out or obtain the advantageously accelerated procedure execution of entitlement of following lock, this is larger because obtain easily the possibility of this lock when running into lock.Many programs run into the lock of the high competition of relative minority.Therefore, pre-facilitation of taking out lock has surpassed any negative effect to the progress of other process.
In certain embodiments, will may end at block 1365 by the work of the thread execution that in block 1315, triggers, and therefore can (by ending or the operation of bond type) this thread be closed, and control can turn back to main thread, shown in block 1370.In other embodiments, worker thread continues, and carries out operation and/or other operation of block 1372-1375.In addition, other embodiment can trigger the worker thread of other number or other combination of described operation.
In the embodiment of Figure 13, shown in block 1372, described application fails to protect this lock.In other words, the lock variable shows that other process has this lock.In certain embodiments, detect this failure of repetition.Can programme or arrange a rotation threshold value, take measures to attempt failed threshold number for obtaining lock variable entitlement before thereby be provided at.In block 1374, trigger the second worker thread to finish the needed other work under the expense impact of lock variable entitlement that obtains.One is exemplified as, and can carry out the code outside the critical part, shown in block 1375.For example, in the embodiment of Figure 12, can carry out the forward execution module 1280 of operation.In certain embodiments, this code be can carry out, thereby instruction or data only taken out in advance and non-results of calculation and/or the result submitted to machine state.In other embodiments, check to guarantee correct result if carry out correlativity, then result of calculation and/or submit the result to as the execution that moves forward a part.
Alternatively, similar with the process of block 1330-1365, the second worker thread responds the lock that this second event can obtain to add.Another is alternative for carrying out the lock analysis.Another is alternative to be, the previous buffer memory that few quilt clears out highest level (may even to external interface) that is locked to that keeps obtains the transmission delay of locking to reduce another processor.In various embodiments, can be combined in the various examples that to finish other work under the impact of locking expense with various distortion.
Figure 14 has set forth an embodiment who comprises the lock analysis.In the embodiment of Figure 14, various threads are scheduled in block 1410.In one embodiment, these threads can be the lightweight threads that is subject to scheduling and controlling at user or application level.For example, use the 1254 user thread schedulers 1290 that can comprise among Figure 12.In another embodiment, these threads can be the threads with whole contextual operating system visibles.In block 1420, detect the lock of being fought for.This lock of being fought for causes enabling or starts the analysis thread.Shown in block 1430, the behavior of this thread analysis lock.In one embodiment, this analysis require to catch such as the data from the event counter data of performance counter or other similar structures.When again finishing thread scheduling in block 1410, profile information is used for helping to determine right of priority and/or the ordering of thread subsequently.Therefore, the lock overhead time is used to improve whole program feature again.As previously mentioned, this schedule information can the worker thread scheduler be dispatched the producer/consumer's thread pair more efficiently.
Between period of expansion, a design can through the various stages, be simulated manufacturing from creating to.The data of representative design can represent this design in many ways.At first, in simulation, usefully, use hardware description language or another kind of functional description language to represent hardware.In addition, can produce the circuit level model with logic and/or transistor gate in some stages of design process.In addition, at certain one-phase, most of design arrives the data level of the physical layout of the various devices of expression in hardware model.For the situation of using the conventional semiconductors manufacturing technology, the data that represents hardware model can be to existence on the different mask layers of mask regulation of manufacturing integration circuit or not have the data of various features.In any statement of design, data can be stored in any type of machine-readable medium.The light that modulated light or electric wave or alternate manner produce or electric wave are used for transmission information, and this machine-readable medium can be for storer or such as magnetics or the optical memory of CD.Arbitrary these media can " carry " or " indication " design or software information.When the electricity carrier wave of transmission indication or carrying code or design with copy, the buffer memory of carrying out electric signal or when again transmitting, just made new copy.Therefore, communication provider or network provider can be made the copy of the article (a kind of carrier wave) of implementing the technology of the present invention.
Therefore, the technology that the programmable event driven that can start other thread is abandoned mechanism is disclosed.Although be described in the drawings and shown specific example embodiment, should be appreciated that, these embodiment set forth and unrestricted this invention purely, shown in the invention is not restricted to and described concrete structure and layout, because those skilled in the art can expect various other modifications after reading present disclosure.In the technical field such as this area, technical development rapidly and be not easy to predict further progress, can arrange the disclosed embodiments easily by technical progress in the situation of not leaving principle of the present invention and appended claims scope with details on modification.

Claims (16)

1. equipment comprises:
Can carry out simultaneously the execution resource of a plurality of threads;
Detect the event detector hardware logic of the cache miss event that is associated with synchronization object, described event detector causes the first thread and switches; And
Detect synchronization object and be the rotation detector of the synchronization object of being fought for, described rotation detector causes the second thread and switches.
2. equipment as claimed in claim 1, wherein said rotation detector comprises the event detector program that is stored in the machine-readable medium, and described event detector program is programmed to detect described synchronization object to the event detector logic and fought for.
3. equipment as claimed in claim 1 also comprises storer, and described memory stores is utilized the application of described synchronization object and will be switched the following lock module that starts by the first thread, and wherein said following lock module will obtain the lock in future of described application.
4. equipment as claimed in claim 3, wherein said following lock module will obtain a plurality of following locks, and described equipment comprises that also the throttling module takes out in advance to prevent excessive lock.
5. equipment as claimed in claim 3, wherein said following lock module will obtain following lock by pre-taking-up data.
6. equipment as claimed in claim 3 also comprises the analysis module, to collect the profile data about synchronous contention.
7. equipment as claimed in claim 6, wherein the user thread scheduler module is used for the user thread scheduling with described profile data.
8. equipment as claimed in claim 1, the wherein said synchronization object of being fought for is the lock of being fought for, the lock that described rotation detector is fought for detection.
9. equipment as claimed in claim 8, also comprise storer, described storer will be stored application, and described application comprises latching segment and the module that contains the lock of being fought for, to use the overhead delay that causes owing to the lock of effectively being fought in described latching segment outside.
10. equipment as claimed in claim 1, also comprise storer, described storer will be stored and improve synchronously module, reschedule prompting to provide to scheduler, wherein said synchronous improvement module will and provide prompting to dispatch producer thread before consumer's thread in the thread scheduling poor efficiency that detects scheduling consumer thread before the producer thread.
11. equipment as claimed in claim 1, wherein said event detector is programmed, so as in response to the cache miss that runs in the latching segment the following lock of bifurcated worker thread.
12. a method comprises:
In the first thread, run into and relate to the latching segment of locking variable;
The cache miss that occurs when attempt taking out described lock variable and to start the first worker thread be that the first thread takes out following lock, at least in part described the first worker thread of executed in parallel and other thread.
13. method as claimed in claim 12 also comprises:
Detecting described lock variable is fought for;
Start the second worker thread to utilize the lock synchronization overhead.
14. method as claimed in claim 12 is wherein utilized the lock synchronization overhead to comprise in response to the lock variable of just being fought for and is carried out respectively iterating of synchronous circulating.
15. method as claimed in claim 13 wherein starts the first worker thread and comprises bifurcated the first thread and switch to the first worker thread, and wherein starts the second worker thread and comprise bifurcated the second worker thread and switch to the second worker thread.
16. method as claimed in claim 12 wherein utilizes the lock synchronization overhead to comprise:
Collect synchronous profile information;
The thread scheduling prompting is provided based on this synchronous profile information.
CN201210460430.2A 2005-03-02 2006-03-01 Utilize synchronization overhead to improve the mechanism of multi-threading performance Expired - Fee Related CN102968302B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US11/070,991 US7587584B2 (en) 2003-02-19 2005-03-02 Mechanism to exploit synchronization overhead to improve multithreaded performance
US11/070991 2005-03-02
CN 200610019818 CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN 200610019818 Division CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Publications (2)

Publication Number Publication Date
CN102968302A true CN102968302A (en) 2013-03-13
CN102968302B CN102968302B (en) 2016-01-27

Family

ID=36946955

Family Applications (4)

Application Number Title Priority Date Filing Date
CN 201110156959 Expired - Fee Related CN102184123B (en) 2005-03-02 2006-03-01 multithread processer for additionally support virtual multi-thread and system
CN 200710104280 Pending CN101051266A (en) 2005-03-02 2006-03-01 Processor with dummy multithread
CN201210460430.2A Expired - Fee Related CN102968302B (en) 2005-03-02 2006-03-01 Utilize synchronization overhead to improve the mechanism of multi-threading performance
CN 200610019818 Expired - Fee Related CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN 201110156959 Expired - Fee Related CN102184123B (en) 2005-03-02 2006-03-01 multithread processer for additionally support virtual multi-thread and system
CN 200710104280 Pending CN101051266A (en) 2005-03-02 2006-03-01 Processor with dummy multithread

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN 200610019818 Expired - Fee Related CN1828544B (en) 2005-03-02 2006-03-01 Mechanism to exploit synchronization overhead to improve multithreaded performance

Country Status (1)

Country Link
CN (4) CN102184123B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108139977A (en) * 2015-12-08 2018-06-08 上海兆芯集成电路有限公司 Processor with programmable prefetcher
CN108292230A (en) * 2015-12-11 2018-07-17 图芯芯片技术有限公司 Hardware access counter and for coordinate multiple threads event generate
CN108376070A (en) * 2016-10-28 2018-08-07 华为技术有限公司 A kind of method, apparatus and computer of compiling source code object
CN111580792A (en) * 2020-04-29 2020-08-25 上海航天计算机技术研究所 High-reliability satellite-borne software architecture design method based on operating system

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635B (en) * 2007-12-24 2012-01-25 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN108241504A (en) 2011-12-23 2018-07-03 英特尔公司 The device and method of improved extraction instruction
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
WO2013095620A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved insert instructions
WO2013095637A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved permute instructions
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
CN103699428A (en) 2013-12-20 2014-04-02 华为技术有限公司 Method and computer device for affinity binding of interrupts of virtual network interface card
US9195493B2 (en) * 2014-03-27 2015-11-24 International Business Machines Corporation Dispatching multiple threads in a computer
US9760410B2 (en) * 2014-12-12 2017-09-12 Intel Corporation Technologies for fast synchronization barriers for many-core processing
GB201717303D0 (en) * 2017-10-20 2017-12-06 Graphcore Ltd Scheduling tasks in a multi-threaded processor
CN112286679B (en) * 2020-10-20 2022-10-21 烽火通信科技股份有限公司 DPDK-based inter-multi-core buffer dynamic migration method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6212544B1 (en) * 1997-10-23 2001-04-03 International Business Machines Corporation Altering thread priorities in a multithreaded processor
US6493741B1 (en) * 1999-10-01 2002-12-10 Compaq Information Technologies Group, L.P. Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit
US7487502B2 (en) * 2003-02-19 2009-02-03 Intel Corporation Programmable event driven yield mechanism which may activate other threads
US8694976B2 (en) * 2003-12-19 2014-04-08 Intel Corporation Sleep state mechanism for virtual multithreading

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108139977A (en) * 2015-12-08 2018-06-08 上海兆芯集成电路有限公司 Processor with programmable prefetcher
CN108139977B (en) * 2015-12-08 2021-11-23 上海兆芯集成电路有限公司 Processor with programmable prefetcher
CN108292230A (en) * 2015-12-11 2018-07-17 图芯芯片技术有限公司 Hardware access counter and for coordinate multiple threads event generate
CN108292230B (en) * 2015-12-11 2022-06-10 图芯芯片技术有限公司 Hardware access counters and event generation for coordinating multithreading
CN108376070A (en) * 2016-10-28 2018-08-07 华为技术有限公司 A kind of method, apparatus and computer of compiling source code object
US10795651B2 (en) 2016-10-28 2020-10-06 Huawei Technologies Co., Ltd. Method and apparatus for compiling source code object, and computer
US11281441B2 (en) 2016-10-28 2022-03-22 Huawei Technologies Co., Ltd. Method and apparatus for compiling source code object, and computer
CN111580792A (en) * 2020-04-29 2020-08-25 上海航天计算机技术研究所 High-reliability satellite-borne software architecture design method based on operating system
CN111580792B (en) * 2020-04-29 2022-07-01 上海航天计算机技术研究所 High-reliability satellite-borne software architecture design method based on operating system

Also Published As

Publication number Publication date
CN102184123A (en) 2011-09-14
CN1828544B (en) 2013-01-02
CN1828544A (en) 2006-09-06
CN101051266A (en) 2007-10-10
CN102184123B (en) 2013-10-16
CN102968302B (en) 2016-01-27

Similar Documents

Publication Publication Date Title
CN1828544B (en) Mechanism to exploit synchronization overhead to improve multithreaded performance
US10459858B2 (en) Programmable event driven yield mechanism which may activate other threads
US7587584B2 (en) Mechanism to exploit synchronization overhead to improve multithreaded performance
Blackham et al. Timing analysis of a protected operating system kernel
Muzahid et al. SigRace: Signature-based data race detection
US9063804B2 (en) System to profile and optimize user software in a managed run-time environment
Zagha et al. Performance analysis using the MIPS R10000 performance counters
US7849465B2 (en) Programmable event driven yield mechanism which may activate service threads
Demme et al. Rapid identification of architectural bottlenecks via precise event counting
US20080256339A1 (en) Techniques for Tracing Processes in a Multi-Threaded Processor
US8762694B1 (en) Programmable event-driven yield mechanism
CN104169889A (en) Run-time instrumentation sampling in transactional-execution mode
US9135082B1 (en) Techniques and systems for data race detection
Dimakopoulou et al. Reliable and efficient performance monitoring in linux
CN104169887A (en) Run-time instrumentation indirect sampling by instruction operation code
CN104169886A (en) Run-time detection indirect sampling by address
Imtiaz et al. Automatic platform-independent monitoring and ranking of hardware resource utilization
Desnoyers et al. Synchronization for fast and reentrant operating system kernel tracing
Yu et al. Mt-profiler: a parallel dynamic analysis framework based on two-stage sampling
Carnà et al. Strategies and software support for the management of hardware performance counters
Gracioli et al. An embedded operating system API for monitoring hardware events in multicore processors
Giesen et al. PRL: Standardizing Performance Monitoring Library for High-Integrity Real-Time Systems
Carnà HOP-Hardware-based Online Profiling of multi-threaded applications via AMD Instruction-Based Sampling
Kriegel et al. A high level mixed hardware/software modeling framework for rapid performance estimation
Kailas et al. Temporal accuracy and modern high performance processors: A case study using Pentium Pro

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160127

CF01 Termination of patent right due to non-payment of annual fee