CN102968302B

CN102968302B - Utilize synchronization overhead to improve the mechanism of multi-threading performance

Info

Publication number: CN102968302B
Application number: CN201210460430.2A
Authority: CN
Inventors: N.英赖特; J.科林斯; P.王; H.王; X.田; J.沈; G.肖弗; P.哈马伦德
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2005-03-02
Filing date: 2006-03-01
Publication date: 2016-01-27
Anticipated expiration: 2026-03-01
Also published as: CN102968302A; CN1828544A; CN101051266A; CN102184123A; CN1828544B; CN102184123B

Abstract

That can start other thread, abandon mechanism for programmable event driven method, equipment and program means.In one embodiment, equipment comprises the execution resource performing multiple instruction and the event detector detecting the long delay event relevant with synchronization object.This event detector can cause the first thread and switch in response to the long delay event relevant with this synchronization object.This equipment also comprises a rotation detector, and whether this detecting device detects this synchronization object is the synchronization object fought for.This rotation detector can cause the second thread in response to the synchronization object fought for being detected and switch, thus starts to rotate and detect response.

Description

Utilize synchronization overhead to improve the mechanism of multi-threading performance

Technical field

The present invention relates to the treatment facility of processing instruction sequence etc. and the field of system and the specific instruction sequence of programme this equipment and/or system.Some embodiments relate to supervision and/or respond in this treatment facility the conditioned disjunction event performed in resource.

Background technology

The various mechanism of current use changes the control flow check (being followed process path or instruction sequence) in disposal system.Such as, the jump instruction in agenda causes jumping to new address explicitly.This jump instruction is clearly change example of control flow check, because this instruction bootstrap processor jumps to a position and continues to perform at this point.Traditional jump instruction is " accurately " (or synchronous), this is because it is the direct result performing jump instruction that this redirect occurs.

Another conventional example that control flow check changes is for interrupting.Interruption can be the external signal provided to the equipment of such as processor.The response of processor is for jumping to interrupt handling routine (handler), and this handling procedure is the special program of interrupting the event sent of process.It is usually also relatively accurate for interrupting, this is because be to be identified it by processor in specific time window and to produce response receiving in this to have no progeny.Especially, after internal receipt to interruption, this interruption is just worked on next instruction border usually.In some cases, only allow operating system or work in other other software masking of high priority to interrupt, therefore user program has no chance to start or forbid that these control flow check change event.

Another conventional example that control flow check changes comes across the response to exception.Abnormal usually reflect predefined framework condition, this condition is the result that such as mathematical instructions meets specific criteria (informal, underflow, spilling, nonnumeric etc.).Such as by arranging the position in control register, can be shielded some extremely.If there is abnormal and non-conductively-closed, then call exception handler to process this exception.

Change another technology of the control flow check of processor for using breakpoint.Usually breakpoint is used when debugging.Specific instruction address can be programmed into breakpoint register.When breakpoint starts and arrives destination address, this processor is adopted various measures (instead of as usual continuing this program).Breakpoint allows single step executive routine etc.

Multithreading is a kind of technology being used processor hardware by multiple different thread.Multiline procedure processor can switch between each thread due to a variety of causes.Such as, processor has the algorithm automatically switched between available thread.Other processor uses the multithreading (SoEMT) switched based on event, and the particular event of such as cache miss can cause thread to switch thus.Thread switches and can be counted as a kind of control flow check and change, this is because processor switches instruction sequence performed by it or instruction stream.

In a prior art reference, describe a kind of still command (see U.S. Patent number No.6,493,741) in detail.In one example, still command stops the process in a thread, until timer expires or occurs carrying out storer write to memory location.Therefore, the process that the instruction of such as still command itself can trigger the thread comprising this still command temporarily stops and being switched to another thread.

Accompanying drawing explanation

In appended each figure in an illustrative manner and unrestriced mode sets forth the present invention.

Fig. 1 sets forth and can perform the embodiment of the system of the treatment conditions of resource (executionresource) by detection and response.

Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.

Fig. 3 has set forth the workflow diagram of another embodiment of the system of Fig. 1.

Fig. 4 has set forth another embodiment of the system that can respond multiple different performance event (performanceevent) and/or composite performance event.

Fig. 5 a has set forth an embodiment of the monitor of identifiable design compound event.

Fig. 5 b has set forth another embodiment of monitor.

Fig. 5 c has set forth another embodiment of monitor.

Fig. 6 has set forth according to an embodiment, the definable trigger of responder and start the process flow diagram performed for user program of worker thread (helperthread).

Fig. 7 has set forth the process flow diagram of the process arranged according to the refinement monitor of an embodiment.

Fig. 8 has set forth the process flow diagram of the process of the update software according to an embodiment.

Fig. 9 a has set forth the multiple nested worker thread of startup with the process flow diagram of support processor.

Fig. 9 b has set forth the thread switch logic supporting an embodiment of virtual thread.

Figure 10 a has set forth the embodiment that context-sensitive event diagram vector shielding (mask) realizes.

Figure 10 b has set forth the embodiment that context-sensitive event diagram vector shielding realizes.

Figure 11 has set forth an embodiment of the multiline procedure processor that execution thread switches based on monitor event.

Figure 12 has set forth embodiment synchronization object to the system of event detection and processing power.

Figure 13 has set forth the process flow diagram of the synchronous event process according to multiple embodiment.

The process flow diagram that the thread scheduling that Figure 14 lock of having set forth based on button.onrelease thread dissects (lockprofiling) improves.

Embodiment

Followingly set forth the embodiment that can start the abandoning of the programmable event driven of other thread (yield) mechanism.In following description, list many details of such as processor type, micro-architecture condition, event, Initiated Mechanism etc., its objective is to provide and more thoroughly understand of the present invention.But those skilled in the art will be appreciated that, do not use these details can implement the present invention yet.In addition, be not shown specifically some well-known structures, circuit etc., object avoids unnecessarily making the present invention become indefinite.

In certain embodiments, disclosed technology allows when executive routine, and this program can monitor energetically and respond the condition of the execution resource performing this program.In fact, these embodiments can comprise real-time execution resource operation condition feedback with improving SNR.If perform resource to run into execution delay condition, this program of interruptible price performs to adjust.In certain embodiments, can start a handling procedure, this handling procedure can produce worker thread to attempt to improve the execution of primary thread.In other embodiments, realize interrupting by being switched to another program threads of non-auxiliary thread.These and other embodiment advantageously can be improved processing power in some cases and/or optimize to be suitable for special hardware.

With reference to figure 1, description can detect and respond an embodiment of the system of the treatment conditions performing resource.In the embodiment in figure 1, perform resource 105, monitor 110 and enable logic 120 and form the part that can perform the processor 100 of instruction.This execution resource comprises the hardware resource of accessible site to discrete component or integrated circuit in certain embodiments.But, perform the combination in any of firmware that resource can comprise software or firmware resource or hardware and software and/or can be used for execution of program instructions.Such as, firmware can be used as a part for extract layer or can be processing hardware increase function, and software also can be like this.Software can also be used for part or all of emulator command collection, or otherwise auxiliary process.

This processor is any dissimilar processor of executable instruction.Such as, this processor can be general processor, such as processor family or processor family or from a kind of processor in other processor family of Intel Company, or from the processor of other company.Therefore, this processor can be Jing Ke Cao Neng (RISC) processor, sophisticated vocabulary calculating (CISC) processor, very long instruction word (VLIW) processor or any mixing or alternative processor type.In addition, the application specific processor of such as network or communication processor, coprocessor, embedded processor, compression engine, image processor etc. can use technology disclosed herein.Due to integrated trend still in continuation and processor becomes more complicated, inner performance indicator to be monitored and the necessity of reacting increases further, therefore make more to need technology disclosed at present.But, because the technical progress in this technical field is quick, be difficult to all application of the technology disclosed in predicting, although it can be widely used in the complex hardware of executed program sequences.

As shown in Figure 1, processor 100 is coupled to the storage medium 150 of such as storer.Storage medium 150 can for having the memory sub-system of various level level, and it includes but not limited to the system storage of memory buffer, such as dynamic RAM etc. and the nonvolatile memory of such as flash memory (such as memory stick etc.), disk or CD of various level.As shown in the figure, other thread of this storage medium storage program 160 and handling procedure and/or such as worker thread 170.

In order to allow monitor to monitor expected event, monitor 110 can be coupled to the various piece of execution resource to detect actual conditions or to be apprised of specific micro-architecture event.Signal wire can be connected to monitor 110, or this monitor can strategically be placed with related resource or integrate.This monitor can comprise various FPGA (Field Programmable Gate Array) or software or firmware components or can be custom-designed to detection actual conditions.This monitor follows the trail of various event or condition, and the event that should detect if there is this monitor or condition, then send signal to interrupt normal control flow check to execution resource 105, otherwise program will perform by this normal control flow.As shown in Figure 1, this interruption can cause calling button.onrelease or occurring that thread switches.

Special can an example of testing conditions be shortage of data in memory buffer, and this shortage of data can cause occurring cache miss.In fact, program can produce a kind of memory access mode, and which can cause the cache miss of repetition, reduces performance thus.Occur the cache miss of given number term of execution of in section sometime or at certain partial code, Here it is is in progress an example of relatively slow event when representing and perform this partial code.

Other micro-architecture various or CONSTRUCTED SPECIFICATION performing resource may be related to for other the detected event of indicator of making slow progress.Monitor can detect and relate to one or more condition following: resource stopping, buffer memory event, scrap event, branch or branch's intended result, exception, bus events or usually come under observation or other event various of affecting performance or condition.This monitor can calculate these events or condition, or to these events or condition timing, quantitatively or characterize, and can to programme to this monitor as special metric system (metric) that appearance is relevant with one or more event or condition.

Fig. 2 has set forth the workflow diagram of an embodiment of the system of Fig. 1.As shown in the block 200 of Fig. 2, program 160 can setting model thus cause and perform the change of control flow check.Such as, enable logic 120 can control (multiple) event that the startup of monitor and monitor will detect simultaneously.Alternatively, enable logic 120 can start and/or shield each event, and monitor 110 itself is also programmable, thus has larger moving flexibly in the tracked execution resource of regulation or intrasystem event or condition.In either case, program 160 itself can be defined in condition to be seen when himself performing.Program 160 is also provided in the handling procedure or thread 170 that are activated when monitored condition occurs.Such as, this program can be such program, that is, comprise main thread and worker thread or attempt to improve the auxiliary routine (helperroutine) of execution of main thread when the condition specified by program occurs.

As shown in block 205, perform this programmed instruction.The execution of this program causes the state performing resource to change.Such as, can there are or occur the various conditions suppressing to be in progress forward when performing this program.As shown in block 210, monitor that various process metric system and/or micro-architecture condition are to determine whether to occur the trigger event of programming in block 200.If there is not trigger state in block 210, then do not trigger this monitor and continue the execution of program by turning back to block 205.

In some cases, this trigger state only represents the indirect relation with the execution of arbitrary single instruction.Such as, in the prior art, when instruction pointer arrives design address, breakpoint detector produces time-out usually.This breakpoint is accurate, this is because special instruction (such as its address) directly triggers this time-out.Similarly, prior art still command itself causes thread at least temporarily to stop.On the contrary, the control flow check using some embodiments of technology disclosed herein to trigger a series of condition changes, and not necessarily will cause this change by single instruction, but can cause this change by whole program flow and/or system environments.Therefore, although the same instruction executing state in individual system can repeatedly trigger this monitor, other condition, environment, system etc. can cause the different trigger points of same program.Thus, technology disclosed herein provides the out of true or nonsynchronous mechanism that produce control flow check change in some cases, and this mechanism is not directly related with instruction exercise boundary.In addition, in certain embodiments, this coarse mechanism can test each event with the fine granularity (finegranularity) being less than each instruction, and/or can postpone a period of time to the identification of event, this is because framework correctness does not depend on the auxiliary routine of any raising processing speed performed at any concrete time point.

When in block 210, monitor detects trigger state, the process of program is interrupted, as shown in block 215.Usually, this system can correspondingly adjust, this is because the treatment effeciency of this program is lower or the mode of process is different from mode desired by programming personnel.Such as, another software routine of such as another program part can be called.Other program part can be another thread irrelevant with primary thread, or can be the worker thread of auxiliary process instruction from primary thread, such as by pre-data of taking out to reduce cache miss.Alternatively, program transparent (such as hardware) mechanism can perform some and optimize, reconfigures (include but not limited to monitor arranges reconfigure), the redistributing of resource, thus is hopeful to improve and processes.

Fig. 3 has set forth the example calling worker thread.Especially, the process flow diagram of Fig. 3 describes the work of an embodiment of Fig. 1 system in detail, wherein perform resource be multithreading resource, and when there is certain trigger condition this routine call worker thread.Therefore, as shown in block 300, the first thread (such as master routine) arranges monitor condition.This condition can be one or more in various conditions discussed herein.First thread execution one code section, as shown in block 310.If test is determined not occur trigger condition in block 320, then continue to perform this code section, as shown in block 310.

If this trigger condition occurs really, then start worker thread with auxiliary first thread, as shown in block 330.This worker thread can be started by such as handler routine, or only by thread this worker thread switch activated.Such as, in one embodiment, monitor sends to the trigger condition performing resource can cause performing resource and jumps to the button.onrelease producing worker thread.In another embodiment, this worker thread is one of other active threads.In yet another embodiment, the worker thread that one or more can be provided special by processor performs holding tank (slot), and this monitor can cause being switched to the worker thread from one of these holding tanks.As shown in block 340, two threads all continue to perform.If talked about smoothly, this worker thread runs forward and removing can cause the first thread to stop or the condition of fallback.

Fig. 4 has set forth another embodiment of the system that can respond multiple different performance event and/or composite performance event.In the fig. 4 embodiment, execution resource 400 is shown to and comprises one group of N number of monitor 410-1 to 410-N.In addition, event diagram vector (ESV) memory location 420 and event diagram vector shielding (ESVM) memory location 425 is provided.The embodiment of Fig. 4 shows multiple monitor, and its number (N) is corresponding to the figure place in event diagram vector occurrence diagram screen unlocking vector.In other embodiments, the number of monitor may be different from the number of these vectors, and monitor with this figure place direct correlation or can not have direct correlation.Such as, in certain embodiments, the condition and the single vector position that relate to multiple monitor are associated.

Perform resource 400 and be coupled to event descriptor table 430 (EDT) alternatively, this event descriptor table can be realized partly on the processor or in coprocessor or system storage.Control flow check logic 435 is coupled to monitor 410-1 to 410-N, and receives the value from the shielding of event diagram vector event diagram vector.When starting the condition that one or more monitor detects according to the shielding of this event diagram vector event diagram vector, control flow check logic 435 changes the control flow check of processing logic.

Fig. 4 examples also illustrates decode logic 402 and one group of machine or Model-Specific Register 404 (MSR).One of decode logic 402 and Model-Specific Register or both can simultaneously for programming and/or start this monitor and event diagram vector shielding.Such as, MSR can be used for type or the number that programming triggers the event of monitor.MSR also can be used for the shielding of programmed events diagram vector.Alternatively, the one or more new special instruction of being decoded by demoder 402 be can be used for programme this monitor and one of event diagram vector and shielding or both.Such as, when occurring specified conditions group, the interruption abandoning the process of (yield) instruction start-up routine can be used.Instruction specified portions or all these conditions can be abandoned to this by operand, or can programming before it performs.Can being decoded by demoder 402, this abandons instruction to trigger microcode routine program, thus produces corresponding microoperation or micro-order or microoperation sequence directly to signal special logic, or starts coprocessor or implement this abort function.In certain embodiments, the concept abandoned can describe such instruction rightly, and namely in execution, this can continue to perform a certain thread after abandoning instruction, but a bit slows down the execution of this thread due to the execution of another thread or handling procedure at certain.Such as single-threaded in a large number program can be called extra worker thread and share these extra worker threads with processor.

In the fig. 4 embodiment, storer 440 comprises button.onrelease 450 and main thread 460.In certain embodiments, event descriptor table can be stored in memory hierarchy in the storer identical with main thread 460 and handling procedure 450 or identical.As previously mentioned, this handling procedure can produce worker thread to help effectively to perform master routine.

Storer 440 also can storage update module 442 to be communicated by communication interface 444.Update module 442 can be hardware module or software routine, performs this software routine to obtain the New Terms that will be programmed into each monitor and/or enable logic by performing resource.Update module 442 also can obtain new worker thread or routine.Such as, software program can be used to download these modules to provide better performance from software program producer.Therefore, network interface 444 can be allow by any network of communication port transmission information and/or communication interface.In some cases, this network interface can receive the Internet to download new condition and/or auxiliary routine or thread.

In one embodiment, everybody expression of event diagram vector occurs or does not occur specific event, and this particular event may reflect the compound event of (and/or being stated by Boolean calculation) various other event of conditioned disjunction.The appearance of particular event can arrange the position in event diagram vector.Everybody in event diagram vector has corresponding position in occurrence diagram screen unlocking vector.If this mask bit represents this particular event conductively-closed, then control flow check logic 435 ignores this event, although due to the appearance of this event make in event diagram vector this keep setting.User can select whether to remove this event diagram vector when not masked event.Therefore, a certain event mask a period of time can be processed after a while.In certain embodiments, according to the various problems of such as event update, sampling and the relation between (or retention time of ESV internal trigger event) of resetting, user can select to specify that this trigger is level trigger or edge triggered flip flop.

If mask bit represents the non-conductively-closed of a certain event, then control flow check logic 435 calls the button.onrelease of this particular event in the present embodiment.Control flow check logic 435 can point to event descriptor table 430 based on the position of event diagram vector meta, and therefore this event descriptor table has the N number of entry corresponding with the N position in event diagram vector.This event descriptor table can comprise a handler address, and this address instruction control flow check logic 435 should will perform the address redirected, and this event descriptor table also comprises out of Memory useful in a particular embodiment.Such as, can keep in this event descriptor table or upgrade preferential level, thread, process and/or out of Memory.

In another embodiment, may not need event descriptor table 430 or it is for single entry, this entry indicates the address of all events of single button.onrelease process.In this case, this entry can store in a register or in other processor memory location.In one embodiment, single handling procedure can be used, and this handling procedure may have access to this event diagram vector with determine occurred event and this how to respond.In another embodiment, this event diagram vector jointly can define the event causing control flow check logic 435 calling processor.In other words, this event diagram vector can represent the various conditions expressing an event together.Such as, must there is the execution triggering handling procedure in this occurrence diagram screen unlocking vector which event that can be used for indicated by allocate event diagram vector.Everybody can represent the monitor reaching condition able to programme.When all do not shield monitor reach its respective specified requirements time, then calling processor.Therefore, whole event diagram vector can be used for the compound condition of some complexity of specifying triggering handling procedure execution.

In another embodiment, multiple event diagram vector can be used to shield and to specify different conditions.These different vectors point to different handling procedures by this event descriptor table or some other mechanism.In another embodiment, can be divided into groups in some position of one or more event diagram vector, thus form the event called triggering handling procedure.Other different change various will be apparent to those skilled in the art.

Fig. 5 a has set forth an embodiment of monitor 500, and this monitor is signal that is programmable and that can be connected with various performance monitor to produce compound event.Such as, this performance monitor can record the appearance of various micro-architecture event or condition, such as, at the cache miss that the specified level of buffer memory level causes, branch scraps, branch is out in the calculation (or mistake estimate the scrapping of branch), the change of trace cache transport model or event, branch's expected cell takes out request, the cancellation of memory requests, cache lines division (completes division load, the counting of storage etc.), replay event, various types of bus switch (such as locks, instantaneous read-write, write-back, invalid), distribution in bus sequencer (or only particular type), auxiliary (the underflow of numeral, informal etc.), the execution of the instruction of particular type or microoperation (uOP)/scrap, machine resets (or cleaning streamline), resource stops (register renaming resources, streamline etc.), the uOP of marks for treatment, instruction or uOP scrap, the distribution (and/or being specific state (such as M)) of buffer memory interior lines, stop taking out a large amount of cycles per instructions, stop a large amount of cycles per instruction length decoder, get a large amount of buffer memorys, be distributed in a large amount of line etc. of (or withdrawal) in buffer memory.Only monitor some examples of micro-architecture event or condition.The combination of other possibility various and these or other condition will be apparent to those skilled in the art.In addition, arbitrary monitor disclosed in arbitrary disclosed embodiment can be used in and monitor these and/or other condition or event.

Usually in processor, performance monitor is comprised to count particular event.By the interface that manufacturer defines, the application specific processor macro instruction of the RDPMC instruction such as supported by well-known Intel processor, programming personnel can read the reading of this performance monitor.See IntelSoftwareDevelopersGuideforthe the appendix A of the volume III of 4Processor.In certain embodiments, other inner or micro-order or microoperation reading performance counter can be used.Therefore, such as performance monitor and disclosed combine with technique can be used.In some cases, programmable performance monitor is adjusted to provide the ability producing event signal.In other embodiments, by other monitor reading performance monitor to set up event.

In the embodiment of Fig. 5 a, monitor 500 can comprise one group of entry able to programme.Each entry can comprise entry number 510, start territory 511, performance monitor number (EMON#) 512 being used to specify one of one group of performance monitor and trigger condition 514.This trigger condition specifically can count for such as arriving, the difference etc. of the counting dropped in particular range, counting.Monitor 500 can comprise logic to be read or be coupled to receive the counting from the performance monitor of specifying.When there is various M condition, monitor 500 sends signal to control flow check logic.By the startup territory of each entry of optionally programming, the subset of this M entry can be used.

Fig. 5 b has set forth another embodiment of monitor 520.Monitor 520 represents the compound event monitor of customization.Monitor 520 receives one group of signal from various execution resource or resource part by signal wire 528-1 to 528-X, and is combined by combinational logic 530.If monitor 520 receives the appropriate combination of signal, then this monitor sends signal by output signal line 532 to control flow check logic.

Fig. 5 c has set forth another embodiment of monitor 540.Monitor 540 comprises the table with M entry.Each entry comprises startup territory 552, condition field 554 and triggers territory 556.This condition field can be programmed to the combination specifying input signal to be monitored.These conditions can be connected or not be connected with other event detection structure of such as performance monitor, and therefore these conditions are compared more general than those conditions discussed in Fig. 5 a.Trigger the state that territory 556 can specify to need those input signals sending signal to control flow check logic.In addition, can start by starting territory 552 or forbid each entry.In certain embodiments, this condition and triggering territory can be combined.These or other type, known or obtainable, more simply or the various combinations of more complicated monitor be apparent to those skilled in the art.

Fig. 6 has set forth according to an embodiment, the definable trigger of responder and process flow diagram that the user program that starts worker thread performs.In block 600, whether first program tests possesses the ability of abandoning." abandoning ability " at this is used as occurring and the writing a Chinese character in simplified form of the ability of interrupt processing based on conditioned disjunction event.Alternatively, for the test abandoning ability support, this abandons being defined as idle working code before ability can use and/or not using before or undefined MSR, therefore abandons the processor that ability can not affect not this ability.Also can inquire about whether there is this ability by checking special CPU-ID, wherein this CPU-ID encodes to produce and shows the prompting whether par-ticular processor or platform existing this ability.Similarly, the PAL (processor abstraction layer) of such as Itanium calls or the special instruction of SALE (system abstraction layer environment) may be used for query processor specific configuration information, and this treatment tool body configuration information comprises the availability of the definable ability of abandoning of this program.Suppose that there is this abandons ability, then user program can read and/or reset various counter, as shown in block 610.Such as, can reading performance watchdog count device, make to calculate increment (delta), if or there is this ability, this value is resetted.

As shown in block 620, user program arranges worker thread trigger condition subsequently.Can be obtained this and abandon ability under low priority level (such as user class), make any program or large absolutely portion routine can use this feature.Such as, exist in processor families etc., the preferential level of the 3rd ring can obtain this and abandon ability.Therefore, user program self can arrange its oneself the trigger condition based on performance.If utility command or operating system can provide lasting surveillance coverage, then understand and there is the user program of this context-sensitive monitor arrangement or operating system can be selected cross thread/process context switch and store or recover the specific monitor arrangement/setting of this application.

As shown in block 630, user program continues to perform after this waive of condition of programming.Test in block 640 and whether occur waive of condition.If not there is waive of condition, then this program continues to perform, as shown in block 630.If there is this waive of condition, then worker thread is activated, as shown in block 650.The flow table of Fig. 6 is tending towards implying the synchronous polling occurred each event, can use the method in certain embodiments.But some embodiments are asynchronous to the response of event when event occurs, or in a large amount of clock period when event occurs, response is produced to it, instead of by specific interval, poll is carried out to event.In certain embodiments, outside a circulation or other code section, monitor condition can be set to detect specific condition.This concept is demonstrated by the pseudo-code example of following main thread and worker thread.

The advantage arranging trigger outside circulation is that Compiler Optimization within circulation is by unfettered.Such as, for the circulation or the code segment that comprise the intrinsic parameter (intrinsic) that such as may be used for starting the ability of abandoning, some compiler would not be optimized it.By being placed in outside circulation by these intrinsic parameters, the interference of Compiler Optimization can be removed.

Fig. 7 has set forth the process flow diagram abandoning the process arranged according to the refinement of an embodiment.Use and have the processor etc. of the ability of abandoning, programming personnel can design under various situation all can invoked program and auxiliary routine, as shown in block 700.Therefore, the various conditions that perform of obstruction desired by programming personnel auxiliary routine can be provided.If need these routines when executive routine and when these routines of needs, processor can call these routines.This abandon setting can comprise event diagram vector shielding vector and/or monitor arrange.

On concrete processor, specifically abandon arranging and can cause favourable execution result.But, manually make this and determine it is very difficult, be therefore more preferably and rule of thumb derive.Therefore, compiler or other adjustment software (such as IntelVTune code profiler) use difference to abandon configuration to repeat simulating this code, derive setting that is best or expection thus, as shown in block 710.Therefore, the desired value abandoning arranging of working time can be selected, as shown in block 720.Can on multiple different editions of a processor or multiple different processor or in multiple different system simulator program, thus derive different abandoning and arrange.Program can use the system of such as CPU-ID or processor mark to select which adopts abandon arranging, as shown in block 730 when running.

In addition, the compact group that arranges is used to carry out Optimal performance and be convenient to software upgrading.Such as, when new processor is issued, can download and new abandon value to optimize the performance of par-ticular processor, or abandon value update software with new.These new values allow scale-of-two or modulus adjustment, and this can not disturb or endanger the function of existing software substantially.

Fig. 8 has set forth the process flow diagram of the update software process according to an embodiment.As shown in block 800, issue the microprocessor of a redaction.New version has the different time delay relevant with the micro-architecture event of such as cache miss.Therefore due to new cache miss time delay, the routine validity after the cache miss of given number being before written into start worker thread weakens.Therefore, again optimize this to abandon arranging, as shown in block 810.

Once derive new setting, then can upgrade this program (such as by being the upgraded module of this program part), as shown in block 820.Can adjust abandoning value or add, this depends on the details of enforcement.In addition, additional or different auxiliary routine can be added, thus help the enforcement of new processor.In arbitrary situation, after the initial transmission of software, the ability of abandoning can startability strengthen transmission.In many occasions, this ability is very favorable, and can only not make any change to underlying hardware for providing new optimization.In addition, basic software can be maintained in some cases.Such as, if write auxiliary routine is to process comprehensive event (such as serious cache miss), then can changes the composition of event different hardware triggering this routine, and not change real routine itself.Such as, monitor arrangement value and/or ESV/ESVM value can be changed, and this program remains unchanged.

By creating nested worker thread, can strengthen the validity of disclosed technology further, Fig. 9 a shows an example of this usage.In the embodiment of Fig. 9 a, in block 900 program, programming abandons event.In block 910, program continues to perform.In block 920, whether test there is event of abandoning (trigger).If do not occur abandoning event, this program continues to perform, as shown in block 910.If there is abandoning event, then start worker thread, as shown in block 925.Worker thread arranges another and abandons event, as shown in block 930.Therefore, this worker thread identifies that representing that further process helps is another useful condition effectively.This other condition represents whether first worker thread is effective, and/or can be designed to represent another condition (although wherein by startup first worker thread or start first worker thread and also suspect this condition).

As shown in block 940, this program and worker thread are all activated and execution thread.From multiple threads resource, these threads are all activated and the meaning performed, and these threads perform simultaneously.In block 950, whether the combination of test procedure and worker thread there is new trigger condition.If there is not new trigger condition, then continue to perform these two threads, as shown in block 940.If really there is new trigger condition, then start second or nested worker thread, as shown in block 960.Afterwards, this program and multiple worker thread are activated and perform, as shown in block 962.Therefore multiple nested worker thread can be adopted in certain embodiments.

In one embodiment, multiple worker thread (can be nested or non-nested) can be started by virtual thread.Processor is not the number being used for its whole resource group to expand its treatable thread, and processor can buffer memory (in cache location, register position or other memory location) context data effectively.Therefore, a physical thread holding tank can switch between the multiple threads fast.

Such as, the embodiment of Fig. 9 b has set forth the thread switching logic according to an embodiment, and this thread switching logic allows physical thread holding tank virtual thread being switched to Limited Number, and these holding tanks make hardware be specifically designed to maintenance thread context.In the embodiment of Fig. 9 b, multiple worker thread 965-1 to 965-k is presented to virtual thread switch 970.This virtual thread switch 970 also can comprise other logic and/or microcode (not shown), thus exchanges the contextual information between the worker thread of new selection and the worker thread previously selected.This virtual thread switch 970 can be triggered with switch threads by synchronous or asynchronous stimulation.Such as, the thread between virtual thread can be caused to exchange by the asynchronous event of the instruction definition abandoning type.In addition, worker thread can comprise such as stop, synchronization means that static or other type stops the instruction performing, thus send switching signal to another thread.This virtual thread switching logic 970 presents a subset (such as in the embodiment of Fig. 9 b, one of virtual thread) of virtual thread to processor thread switching logic 980.Processor thread switching logic 980 such as switches between the first thread 967-1 and other N-1 thread (until thread 967-N) at one of worker thread subsequently.

In certain embodiments, preferably this is abandoned ability and be restricted to specific program or thread.Therefore, can to make this abandon ability become context-sensitive or non-in a jumble.Such as, Figure 10 a has set forth the embodiment that context-sensitive event diagram vector shielding realizes.In the embodiment of Figure 10 a, memory block 1000 comprises and the context indicator field 1010 of each event diagram vector correlation and shielding memory location 1020.The shielding of each event diagram vector of this context indicator field identification is to applied context.Such as, the context value of the value of such as control register (indicating the CR3 of operating system process ID in such as x86 processor) can be used.Add or alternatively, number of threads information definition context can be used.Therefore, in certain embodiments, when specific context starts, then specific context-sensitive event can be started with interrupt process.Therefore, this is abandoned machine-processed clear and definite part and is that its event only affects specific context.

Figure 10 b has set forth another embodiment that context-sensitive event diagram vector shielding realizes.In the embodiment of Figure 10 b, by providing one group of event diagram vector screening-off position 1050-1 to 1050-k for this k each context contextual, then can process this contextual integer k.Such as, there is k thread in multiline procedure processor, each thread has an event diagram vector shielding or mechanism is abandoned in similar startup.Note, in other embodiments, only the event of following the trail of in specific context is worthless.Such as, event can reflect overall Process Movement, and/or event can be relevant with multiple related linear program or caused by multiple related linear program.

Figure 11 has set forth based on monitor or has abandoned types of events and an embodiment of the multiline procedure processor of execution thread switching.Although the many embodiments discussed perform and interrupt process stream by causing handling procedure, in other embodiment definable multiline procedure processor, cause the event that thread switches.Such as, in the embodiment in figure 11, thread switch logic is coupled to receive the signal from one group of N number of monitor 1110-1 to 1110-N.Thread switch logic 1105 also can be coupled to one or more groups event diagram and shield 1130-1 to 1130-p (p is positive integer).This event diagram and shielding are combined when determining when switch threads permission thread switch and/or ignore specific monitor event.

Perform the execution that resource 1120 supports p thread, but be indifferent to whether instruction belongs to specific thread.This execution resource can be performance element, takes out logic, demoder or perform other resource any used in instruction.Multiplexer 1115 or other selection resource judge to determine which thread accesses performs resource 1120 between various thread.Those skilled in the art will appreciate that and can share or copy various resource in multiline procedure processor, and various resource has each thread handover access allowing a limited number of thread (such as) to access this resource.

Shield indicated condition group if there is one or more monitor and/or an event diagram vector, the execution of thread switching logic 1105 switch threads.Therefore, another thread can be started, instead of the thread of activity when starting the matching criteria when processor condition and programming.Such as user program can control the event triggering thread switching.

In some multiline procedure processors, each thread can have one group of relevant event diagram vector shielding equity.Therefore, as shown in figure 11, multiplexer 1115 judges between p thread, and it is right with shielding to there is corresponding p event diagram.But, be multithreading just because of processor, and do not mean that all realizations all use multiple event diagram vector to shield.Some embodiments only use a pair startup indicator, or use other startup indicator.Such as, single position can be used as unlatching or close the specific startup indicator abandoning type of capability.

Figure 12 has set forth embodiment synchronization object to the system of event detection and processing power.Synchronization object can be lock or the synchronous memory resource locked variable, barrier or other hardware, software and/or can be used between thread or process.With regard to multi-core and/or various types of multithreading, multi-process obtains popular, therefore in order to improve performance, synchronously becomes more important between these threads or process.Therefore, the system with the synchronous efficiency of enhancing is general and/or use in the various fields of the dedicated processes (such as figure, natural medium type, digital signal processing, communication etc.) of concurrent process and have applicability widely.

System shown in Figure 12 has set forth a processor 1200, wherein this processor is coupled to storer 1250, is also coupled to communication interface 1292 and one or more peripherals 1294 (can be such as audio interface, display, keyboard, mouse or other input media, I/O device etc.).Can by bus, bridge and/or point to point connect these devices that are coupled directly or indirectly.Processor 1200 comprises execution resource 1210 and event detector 1220, to monitor the various aspects performing resource 1210.Processor 1200 and event detector 1220 can have various characteristic, as the description carried out with reference to preceding embodiment.Therefore, event detector 1220 is programmable thus starts a thread (such as triggering thread to switch or bifurcated new thread) based on the event defined.This event can be rigid line event, or is defined by the foregoing mechanism such as software program and similar incidents diagram vector.

Processor 1200 also can have lock and/or rotation detector 1222.In certain embodiments, this lock/rotation detector can be independent detecting device, as shown in figure 12.This independent detecting device can be detect the specific predefine of indicating lock or the hardware components of condition even able to programme.Therefore this detecting device can be partly hardware components is software.In other embodiments, just can realize lock by various condition being programmed in common event detecting device or rotating-lock detection.Therefore, by the programming of common event detecting device to detect this condition, rotation/lock detecting device can effectively be formed when programming rightly.Such as, apply 1254 and can comprise event detector programming module (EDPM) 1256 for programmed events detecting device 1220 to trigger expected event.

Take out with the long delay that an event of the phase-locked pass that can be detected is lock variable.By programmed events detecting device 1220 with the cache miss of certain a bit (this lock variable will be accessed in storer) in trigger, the lock variable causing and take out long delay can be detected.This cache miss represents that processor will not lock variable buffer memory.Then can respond this special event be triggered subsequently, start a handling procedure and take out situation with the lock solving long delay.

The second, this lock/rotation detector 1222 also can detect rotating condition or program just waiting for the lock that contention is fierce and circulation with verify whether this variable can condition.Such as, by sensing, rotating condition is detected to the repeated accesses of known lock variable position.When this rotating condition being detected, the second thread will be activated, and this will be discussed in detail below.Some embodiments can use the situation of the lock of this contention fierceness of worker thread process, and no matter whether first occur that the lock of long delay takes out (triggering the first worker thread).

The embodiment of Figure 12 comprises lightweight thread context-memory 1230.This context-memory allows the little subset of preservation state to realize the contextual processing of " lightweight " or " flyweight ".Such as, only preserve the instruction pointer of parent process in some situations, and allow programmable device be responsible for any additional contextual preservation.Context is saved more or less, but the context be usually stored in subset is less than whole contexts.Use specialized instructions these lightweight threads to be disclosed in user class (application layer of such as program, the such as program of x86 framework medium priority other 3), make user can perform resource at multithreading and start thread in conjunction with application-specific.In this case, the button.onrelease being activated into lightweight thread should preserve its any context (the normal father of execution applies and may need) do not preserved upset.In other embodiments, the worker thread be triggered can be have completely independently contextual thread.

The embodiment of Figure 12 also comprises the storer 1250 being coupled to processor.In this embodiment, show various lock expense and use module and application 1254.In the present embodiment, this module is software program.In one embodiment, the single thread be triggered when modules is and occurs a certain event.Shown in one or more, module may be combined with into single worker thread, and this worker thread can be whole or lightweight context thread.In other embodiments, these modules can be realized with the combination of hardware or hardware and/or software and/or firmware.

Application 1254 be the application of user class, can have and lock or other synchronization object or technology.Take out critic section data module 1258 to can be used for the lock within the critic section by lock protection and taking-up data being shifted out with speculating.Obtain following lock module 1260 to run forward, obtain the lock of other lock protection part.Just can obtain lock simply by being got in buffer memory by Data Position, or obtain lock by changing lock variable in other embodiments, make to have this lock.Following lock module 1260 also comprises Throttle module 1262, thus guarantees that by limiting other thread the lock activity that this is speculated does not reduce overall throughput rate.The execution module 1280 run forward can perform forward, thus completes a few thing outside by the part of lock protection.

Lock anatomy module 1270 collects particular thread relative to the data of the progress of locking variable and/or other thread.User thread scheduler module 1290 can receive self-locking and dissects the prompting of module and allow to dispatch more efficiently thread.Such as, dissect module 1270 and can detect the first thread (such as initial consumer), obtain a lock, and cause in the second thread (such as initial production person) and rotate significantly.In this illustration, to scheduler circular, first dispatch the second thread (initial production person) and can cause processing more efficiently.In certain embodiments, this scheduler is the thread scheduler of user class, and this scheduler is exposed to programmable device to allow the scheduling to user class (such as lightweight) thread.In certain embodiments, user thread scheduler 1290 can be a part for application 1254.

The interactional example of these modules can be understood further with reference to Figure 13.In the embodiment of Figure 13, in block 1310, detect that lock takes out and postpone.This detection can be realized by making it sense a condition to event detector 1220 programming.Such as, application 1254 adjustable event detector programming modules 1256 with programmed events detecting device 1220, thus just triggered cache miss before access lock variable.Alternatively, aforementioned special lock/rotation detector 1222 can be used to detect this lock.

In block 1315, can in response to the taking-up delay and execution thread switching (such as flyweight thread switches) of locking variable being detected.This thread switches startup first worker thread, and this thread can perform the various functions set forth in following block in embodiments.Therefore, although block 1315 heel is with various block, in any specific embodiment and all these blocks of non-required, and its order is not critical.

In block 1320, can take out obtain this lock code outside but the data dropped within the code section protected by this lock.Such as, taking-up critic section data module 1258 can be performed in the fig. 12 embodiment.When the entitlement of locking finally obtains, this data pre-fetching in lock protecting code part goes out to reduce cache miss or general data search delay.

In addition (part as independent thread or identical thread), following lock can be taken out as shown in block 1330-1365.Especially, can obtain N number of additional lock, wherein N is positive integer.The number of lock to be obtained is programmable or hard code, and the term of execution be transformable (such as by reprogramming or Throttle module 1262).In block 1330, loop variable i is set to 1.As shown in block 1340, take out a following lock (such as can take out in advance or in fact be lockable in various embodiments).Following lock address to produce worker thread, or is determined by lock anatomy in statistical computation lock in future address, and this will be described below.As shown in block 1350, test this lock and whether fought for.If this future, lock was subject to fighting for (such as by the value of the buffer status of lock variable, lock variable or can be indicated by the circling behavior of program), then the taking-up of this lock of interruptible price and/or other following lock, as shown in block 1355.

For not finding the situation that this lock is fought in block 1350, then in block 1360, continue operation.If the lock counting that test is taken out in block 1360 is not equal to number of targets N, then in block 1365, variable i to be increased progressively and process gets back to block 1340.If count down in block 1360 and reach N, then process proceeds to block 1370 in one embodiment.These lock taking-up operations 1330-1365 is performed by obtaining following lock module 1260 in Figure 12 embodiment.The entitlement taking out or obtain following lock can advantageously accelerated procedure perform, this is because the possibility easily obtaining this lock when running into lock is larger.Many programs run into the lock of the high competition of relative minority.Therefore, pre-facilitation of taking out lock has exceeded any negative effect of the progress to other process.

In certain embodiments, work by the thread execution triggered in block 1315 may be ended at block 1365, therefore (by stopping or the operation of bond type) this thread can be closed, and control can turn back to main thread, as shown in block 1370.In other embodiments, worker thread continues, and performs operation and/or other operation of block 1372-1375.In addition, other embodiment can trigger the worker thread of other number or other combination of described operation.

In the embodiment of Figure 13, as shown in block 1372, described application fails to protect this lock.In other words, lock variable and show that other process has this lock.In certain embodiments, this failure of repetition is detected.Can programme or arrange a threshold rotating value, thus be provided in take measures before for obtaining lock variable entitlement and attempt failed threshold number.In block 1374, trigger the second worker thread to complete the other work under impacts obtained required for lock variable entitlement.An example is to perform the code outside critic section, as shown in block 1375.Such as, in the fig. 12 embodiment, the execution module 1280 run forward can be performed.In certain embodiments, this code can be performed, thus only take out instruction or data in advance and non-results of calculation and/or result is submitted to machine state.In other embodiments, if perform correlativity inspection to guarantee correct result, then result of calculation and/or submit result to a part as the execution run forward.

Alternatively, similar with the process of block 1330-1365, the second worker thread responds this second event can obtain additional lock.Another is alternative for performing lock anatomy.Another is alternative is that what previously kept is locked to few buffer memory of highest level (even may arrive external interface) that cleared out to reduce the transmission delay that another processor obtains lock.In various embodiments, the various examples of other work can be completed under the impact of lock expense can being combined in various distortion.

Figure 14 has set forth the embodiment comprising lock and dissect.In the embodiment of Figure 14, various thread is scheduled in block 1410.In one embodiment, these threads can be the lightweight threads being subject to scheduling and controlling user or application level.Such as, application 1254 can comprise the user thread scheduler 1290 in Figure 12.In another embodiment, these threads can be the threads with whole contextual operating system visible.In block 1420, the lock fought for detected.This lock fought for causes enable or starts anatomy thread.As shown in block 1430, this thread dissects the behavior of lock.In one embodiment, this anatomy requires to catch such as from the data of the event counter data of performance counter or other similar structures.When again completing thread scheduling in block 1410, profile information is subsequently for helping right of priority and/or the sequence of determining thread.Therefore, the program feature that the overhead time is used to improve entirety is again locked.As previously mentioned, this schedule information can more efficiently dispatch the producer/consumer thread couple by worker thread scheduler.

Between period of expansion, a design through the various stage, can simulate manufacture from creating to.The data of representative design can represent this design in many ways.First, in simulations usefully, hardware description language or another kind of functional description language is used to represent hardware.In addition, the circuit level model with logic and/or transistor gate can be produced in some stages of design process.In addition, at certain one-phase, major part design arrives the data level representing the physical layout of various device in hardware model.For the situation using conventional semiconductors manufacturing technology, the data representing hardware model can be the data mask manufacturing integrated circuit being specified to the various feature of presence or absence on different mask layer.In any statement of design, data can be stored in any type of machine-readable medium.Light or the electric wave of modulated light or electric wave or alternate manner generation are used for transmission information, and this machine-readable medium can be magnetics or the optical memory of storer or such as CD.These media arbitrary can " carry " or " instruction " designs or software information.When transmitting instruction or carrying code or the electricity carrier wave of design to perform copy, the buffer memory of electric signal or again to transmit, just make new copy.Therefore, communication provider or network provider can make the copy of the article (a kind of carrier wave) implementing the technology of the present invention.

Therefore, the technology that the programmable event driven that can start other thread abandons mechanism is disclosed.Although be described in the drawings and shown specific example embodiment, should be appreciated that, these embodiments are set forth and this invention unrestricted purely, with described concrete structure and layout shown in the invention is not restricted to, because those skilled in the art can expect other amendment various after reading present disclosure.In the technical field of such as this area, technical development rapidly and be not easy to predict further progress, easily can arrange the disclosed embodiments by technical progress when not leaving principle of the present invention and appended claims scope and amendment in details.

Claims

1., for an equipment for scheduling thread, comprising:

The execution resource of multiple thread can be performed simultaneously;

Detect the event detector hardware logic of the cache miss event be associated with synchronization object, described event detector causes the first thread and switches;

Detect the rotation detector that synchronization object is the synchronization object fought for, described rotation detector causes the second thread and switches;

Collect the anatomy module of the profile data about synchronous contention; And

Described profile data is used for the user thread scheduler module of user thread scheduling.

2. equipment as claimed in claim 1, wherein said rotation detector comprises the event detector program be stored in machine-readable medium, and described event detector program is programmed to detect described synchronization object to event detector logic and fought for.

3. equipment as claimed in claim 1, also comprises storer, the application of synchronization object described in described storer memory and will be switched by the first thread the following lock module started, and wherein said following lock module will obtain the lock in future of described application.

4. equipment as claimed in claim 3, wherein said following lock module will obtain to be locked multiple future, and described equipment also comprises Throttle module and takes out in advance to prevent excessive lock.

5. equipment as claimed in claim 3, wherein said following lock module will obtain following lock by pre-taking-up data.

6. equipment as claimed in claim 3, wherein collects profile data and comprises the data caught from performance counter.

7. equipment as claimed in claim 6, wherein when again completing thread scheduling, described profile data is for helping right of priority and/or the sequence of determining thread.

8. equipment as claimed in claim 1, the wherein said synchronization object fought for is the lock fought for, and described rotation detector will detect the lock fought for.

9. equipment as claimed in claim 8, also comprises storer, and storages is applied by described storer, and described application comprises overhead delay that latching segment containing the lock fought for and utilization cause by the lock the fought for module in latching segment operate outside.

10. equipment as claimed in claim 1, also comprise storer, storage is synchronously improved module by described storer, reschedule prompting to provide to scheduler, wherein said synchronous improvement module will detect the thread scheduling poor efficiency of scheduling consumer thread and provide prompting to dispatch producer thread before consumer thread before producer thread.

11. equipment as claimed in claim 1, wherein said event detector is programmed, so that bifurcated will lock worker thread future in response to the cache miss run in latching segment.

12. 1 kinds, for the method for scheduling thread, comprising:

The latching segment relating to lock variable is run in the first thread;

Start the first worker thread in response to the cache miss that occurs when attempting taking out described lock variable to be that the first thread takes out following lock, will the first worker thread described in executed in parallel and other thread at least in part;

Detect described lock variable to be fought for;

Collect the profile data about synchronous contention; And

Described profile data is used for user thread scheduling.

13. methods as claimed in claim 12, also comprise:

Start the second worker thread to utilize lock synchronization overhead.

14. methods as claimed in claim 12, wherein utilize lock synchronization overhead to comprise to perform in response to the lock variable just fought for iterating of synchronous circulating respectively.

15. methods as claimed in claim 13, wherein start the first worker thread and comprise bifurcated first thread and be switched to the first worker thread, and wherein start the second worker thread and comprise bifurcated second thread and be switched to the second worker thread.

16. methods as claimed in claim 12, wherein collect profile data and comprise the data caught from performance counter.