CN105683906A

CN105683906A - Adaptive process for data sharing with selection of lock elision and locking

Info

Publication number: CN105683906A
Application number: CN201480053800.8A
Authority: CN
Inventors: M·K·克施温德; M·M·迈克尔; V·萨拉普拉; 岑中龙
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-10-14
Filing date: 2014-09-28
Publication date: 2016-06-15
Anticipated expiration: 2034-09-28
Also published as: CN105683906B; JP6642806B2; JP2016537709A; WO2015055083A1

Abstract

In a Hardware Lock Elision (HLE) Environment, predictively determining whether a HLE transaction should actually acquire a lock and execute non-transactionally, is provided. Included is, based on encountering an HLE lock-acquire instruction, determining, based on an HLE predictor, whether to elide the lock and proceed as an HLE transaction or to acquire the lock and proceed as a non-transaction; based on the HLE predictor predicting to elide, setting the address of the lock as a read-set of the transaction, and suppressing any write by the lock- acquire instruction to the lock and proceeding in HLE transactional execution mode until an xrelease instruction is encountered wherein the xrelease instruction releases the lock or the HLE transaction encounters a transactional conflict; and based on the HLE predictor predicting not- to-elide, treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction, and proceeding in non-transactional mode.

Description

Selection for utilizing lock to omit and to lock carries out the self-adaptive processing of data sharing

Technical field

Present disclosure relates generally to transactional memory systems, and more particularly relates to the selection by utilizing lock to omit and the to lock shared method of data, computer program and computer system adaptively.

Background technology

The quantity of CPU (CPU) core on chip and the quantity of CPU core being connected with shared memorizer constantly significantly increase, with the workload capacity requirement that support increases. Software scalability is caused significant burden by the cooperation ever-increasing CPU quantity to process identical live load; Such as, become focus by the shared queue of traditional semaphore protection or data structure and cause sub-line n road to be stretched curve. Traditionally, this point is tackled by realizing the locking of more fine granulation in software. The locking realizing more fine granulation is probably extremely complex and error-prone to improve software scalability, and according to current cpu frequency, is limited to physical size and the light velocity of chip and system the time delay of hardwired interconnections.

Have been introduced into hardware transaction memory (HTM, or in this discussion text, referred to as TM) realization, wherein, at other CPU (CPU) and I/O subsystem, one group of instruction is called in affairs data structure in memory to operate (in other literature, atomic operation is also referred to as " block is concurrent " or " serialization ") in the way of atom. Affairs perform optimistically when not obtaining lock, but, if the operation of the affairs being carrying out in memory location and another operating collision in same memory location, then it would be likely to need to stop and retry transactional and would perform. In the past, it is proposed that software transactional memory realizes to support software transactional memory (TM). But, compared with software TM, hardware TM can provide aspect of performance and the ease of use of improvement.

Being that " Methodandapparatusforthesynchronizationofdistributedcach es " U.S. Patent Application Publication No2004/0044850 teaches the method and apparatus synchronized for distributed buffer in the denomination of invention submitted on August 28th, 2002, the content of this patent application publication is incorporated herein by. Specifically, the embodiment provided relates to cache memory system, and relates more particularly to the layering caching protocol being suitable to use together with distributed buffer, is included in buffer memory input/output (I/O) hub and uses.

Teaching in the United States Patent (USP) 5586297 that denomination of invention is " Partialcachelinewritetransactionsinacomputingsystemwitha writebackcache " submitted on March 24th, 1994 and propose a kind of computing system comprising memorizer, input/output adapter and processor, the content of this patent is incorporated herein by. Processor comprise wherein can store dirty data write back buffer memory. When perform from input/output adapter to memorizer when unanimously writing, data block is written to the memory location in memorizer from input/output adapter. The data that data block comprises are fewer than the global buffer row write back in buffer memory. Search writes back buffer memory and writes back whether buffer memory comprises the data of this memory location to determine. When search is determined and write back data that buffer memory comprises this memory location, the global buffer row of the data comprising this memory location is eliminated.

Summary of the invention

There is provided a kind of hardware lock to omit being used in (HLE) environment and predictably determine whether HLE affairs answer the method that reality performs with obtaining lock and non-transactional. Embodiment according to present disclosure, method comprises the steps that based on running into HLE lock acquisition instruction, based on HLE predictor, it is determined that be omit lock and continue as HLE affairs or obtain to lock and continue as non-transactional; It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of HLE affairs, and suppress to be obtained the instruction any write to lock by lock, and continue under HLE transactional execution pattern, until running into xrelease instruction (wherein, xrelease instruction release lock) or till HLE affairs run into transactional conflict; And do not omit based on the prediction of HLE predictor, HLE lock is obtained instruction and is considered as non-HLE lock acquisition instruction and continues in non-transactional pattern.

In another embodiment of present disclosure, it is possible to provide hardware lock omits the computer program that being used in (HLE) environment predictably determines whether HLE affairs should perform with actual obtaining lock and non-transactional. This computer program can comprise: can be read by process circuit and store the computer-readable recording medium for the instruction processing the method that circuit execution comprises the following steps for execution: obtain instruction based on running into HLE lock, based on HLE predictor, it is determined that be omit lock and continue or obtain lock as HLE affairs and continue as non-transactional; It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of affairs, and suppress to be obtained the instruction any write to lock by lock, and continue in HLE transactional execution pattern, until running into xrelease instruction (wherein, xrelease instruction release lock) or till HLE affairs run into transactional conflict; And do not omit based on the prediction of HLE predictor, HLE lock is obtained instruction and is considered as non-HLE lock acquisition instruction and continues in non-transactional pattern.

In another embodiment of present disclosure, it is provided that hardware lock omits the computer system that being used in (HLE) environment predictably determines whether HLE affairs should perform with actual obtaining lock and non-transactional. This computer system can comprise: memorizer; And with the processor of memory communication, wherein, the method that computer system is configured to perform to comprise the following steps: obtain instruction based on running into HLE lock, based on HLE predictor, it is determined that be omit lock and continue or obtain lock as HLE affairs and continue as non-transactional; It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of affairs, and suppress to be obtained the instruction any write to lock by lock, and continue in HLE transactional execution pattern, until running into xrelease instruction (wherein, xrelease instruction release lock) or till HLE affairs run into transactional conflict; And be predicted as based on HLE predictor and not omit, HLE lock is obtained instruction and is considered as non-HLE lock and obtains instruction and continue in non-transactional pattern.

Accompanying drawing explanation

Indicative embodiment according to the present disclosure to read in conjunction with the accompanying described in detail below, the feature and advantage of embodiment disclosed herein will be apparent from. The various features of accompanying drawing are not drawn to, because diagram is in order to illustrate, to be conducive to those skilled in the art to understand present disclosure in conjunction with specific descriptions. In the accompanying drawings:

Fig. 1 and Fig. 2 illustrates the example multinuclear transaction memory environment of the embodiment according to present disclosure;

Fig. 3 illustrates the exemplary components of the example CPU of the embodiment according to present disclosure;

Fig. 4 illustrates and omits, at lock, the flow chart that the method for data is shared in the selection between locking adaptively for utilizing according to example hardware or software implementation;

Fig. 5 is shown in exists the flow chart realizing the conflict prediction device also referred to as HLE predictor or hardware lock virtual machine in the HLE environment supported;

Fig. 6 illustrate according to the exemplary embodiment of hardware capabilities being absent from adding for omitting, at lock, the flow chart that the method for data is shared in the selection between locking adaptively by utilizing;

Fig. 7 illustrates the flow chart according to the method for being shared data adaptively in the selection locked between omission with locking by utilization with the exemplary embodiment that hardware lock monitors;

Fig. 8～9 illustrate the exemplary flow sharing data adaptively; With

Figure 10 is the schematic block diagram of the hardware and software of the computer environment of at least one exemplary embodiment of the method according to Fig. 4～7.

Detailed description of the invention

In history, computer system or processor only have single processor (aka processing unit or CPU). Processor comprises instruction process unit (IPU), branch units and memory control unit etc. This processor can once perform the single thread of program. Developing operating system, this operating system can be passed through allocator and perform a period of time on a processor, and then distribute another program and perform another time time cycle shared server on a processor. Along with change of technique, usually to processor and the complicated pooled address translation interpolation memory sub-system buffer memory comprising translation lookahead buffer (TLB). IPU self is commonly referred to as processor. Along with technology continues development, whole processor can be encapsulated as single semiconductor chip or nude film, and this processor is referred to as microprocessor. Then, developing the processor adding multiple IPU, this processor is also commonly known as multiprocessor. The each this processor of multiprocessor computer system (processor) can comprise independent or that share buffer memory, memory interface, system bus and address translation mechanism etc. Virtual machine and instruction set framework (ISA) emulator add software layer to processor, and this uses the single IPU in single hardware processor to provide multiple " virtual processors " (aka processor) to virtual machine by isochronous surface. Along with technology develops further, develop multiline procedure processor, so that the single hardware processor with single multi-threaded I PU can provide the ability of the thread performing distinct program simultaneously, thus, operating system is shown as processor by each thread of multiline procedure processor. Along with technology develops further, it is possible to place multiple processor (each there is IPU) on single semiconductor chip or nude film. These processors are referred to as processor core or are only referred to only as core. It is therefoie, for example, such as the term of processor, CPU, processing unit, multi-processor core, core, processor core, processor thread and thread is often used interchangeably. When without departing substantially from teaching herein, the many aspects of the embodiments herein can be implemented by comprising any or all processor of the above. Wherein, term " thread " or " processor thread " are used herein, it is desirable to can have the specific advantages of embodiment in processor thread realizes.

Based onEmbodiment in transactional perform

Be incorporated herein by reference in their entirety "ArchitectureInstructionSetExtensionsProgrammingReference " in 319433-012A, February2012, the 8th chapter partly instructs the available increasing CPU core of multithreading application to realize higher performance. But, the write of multithreading application requires programmer to understand and considers the data sharing among multiple thread. To the access of shared data it is generally required to synchronization mechanism. Often through using by the critical section of lock protection, these synchronization mechanisms are employed to ensure that multiple thread updates shared data by the serialization operation being applied in shared data. Due to serialization limiting concurrent, therefore, programmer attempts limiting owing to synchronizing the expense caused.

Transactional synchronization extension (TSX) critical section requiring over lock protection carrys out serialization thread and only performs this serialization when needed to allow processor to dynamically determine whether. This allow processor to expose and utilize due on dynamically unnecessary synchronization and be hidden in the concurrency in application.

UtilizeTSX, transactionally performs the code region (also referred to as " transactional region " or be only only called " affairs ") that programmer specifies. Transactional performs if successfully completed, then, when watching from other processor, all storage operations performed in transactional region will appear to instantaneous appearance. Only when occurring successfully submitting to, i.e. when affairs are successfully completed execution, processor makes the storage operation being performed in transactional region being performed affairs that other processor is visible. This process is also commonly known as atomic commitment.

TSX provides two software interfaces to be given for the code region that transactional performs. Hardware lock omits traditional Compatible instruction set extension (comprising XACQUIRE and XRELEASE prefix) that (HLE) is regulation transactional region. Affined transaction memory (RTM) be programmer for by be likely to than HLE more flexible in the way of limit the KNI interface (comprising XBEGIN, XEND and XABORT instruction) in transactional region. HLE is intended in conventional hardware for the backward compatibility of the conventional mutual exclusion programming model of preference and hope to run HLE and enables software and also hope is intended to have on the HLE hardware supported and utilizes the new programmer locking omission ability. RTM performs the programmer of the flexible interface of hardware for preference transactional. It addition,TSX also provides for XTEST instruction. This instruction allows transactionally to perform in the transactional region whether software inquiry logic processor is being identified by HLE or RTM.

Owing to successful transactional performs to guarantee atomic commitment, therefore, processor performs code region optimistically when not having clear and definite synchronization. If synchronizing to be unnecessary for specific execution, then perform to submit under there is no the serialized situation of any cross-thread. If processor can not be submitted to atomically, then optimism performs failure. When this happens, processor is by rollback (rollback) this execution, and this is referred to as the process that transactional stops. When transactional stops, architecture state, by giving up all renewals performed in the memory area used by affairs, is returned to and appears to execution optimism do not occur by processor, and restarts to perform in the way of non-transactional.

Processor can perform transactional for many reasons and stop. Stop the memory access that conflicts between the logic processor being primarily due to transactionally perform and another logic processor of affairs. This conflict memory access can hinder successful transactional to perform. The storage address read in transactional region constitutes the reading collection in transactional region and the write collection in the composition transactional region, address being written in transactional region.TSX keeps reading collection and write collection with the granularity of cache lines. If another logic processor read a position (this position be transactional region write collection a part) or write one position (this position be transactional region read collection or write collection a part), then there will be conflict memory access. Conflict accesses and generally means that the serialization needed this code region. Due toTSX is with the granularity Detection data collision of cache lines, then the extraneous data position being placed in same cache lines will be detected as conflict, and this causes that transactional stops. Transactional stops possibly also owing to transaction resource is limited and occur. Such as, the amount being accessed for data in the zone can exceed that the capacity specific to implementation. It addition, some instructions and system event can cause that transactional stops. Transactional stops to cause that circulation waste and efficiency are lower frequently.

Hardware lock is omitted

Hardware lock is omitted (HLE) and is provided the traditional Compatible instruction set interface performed for programmer's transactional. HLE provides two new instruction prefixes prompting: XACQUIRE and XRELEASE.

Utilizing HLE, programmer is to being used for obtaining interpolation XACQUIRE prefix before the instruction of the lock of protection critical section. This prefix is considered as omitting the prompting obtaining the relevant write of operation to lock by processor. Although lock obtains the write operation being associated having lock, but processor does not add the address of lock to the write collection in transactional region, does not send any write request to lock yet. But, the address of lock is added to reading collection. Logic processor enters transactional and performs. If be locked in XACQUIRE was available before the instruction of prefix, then lock is considered as being available later by other processors all by continuing. Neither being written to the address of collection interpolation lock owing to transactional performs logic processor, also lock is not performed externally visible write operation, therefore, other logic processor can read lock when being not resulted in data collision. This allow other logic processor also into and be executed concurrently by lock protection critical section. Processor detects any data collision occurred transactional the term of execution automatically, and if it is necessary to will perform transactional and stop.

Although omitting processor lock is not performed any outside write operation, but hardware guaranteeing the program sequencing of the operation to lock. If omitting processor self to read the value of the lock in critical section, then will look as processor and obtained lock, namely this reading will return non-elliptical value. The behavior allows HLE to perform the execution being functionally equivalent to not have HLE prefix.

XRELEASE prefix can be added before the instruction of the lock for release guard critical section. Release lock comprises the write to lock. If instruction be the value of lock is returned to be locked in same lock with the XACQUIRE lock being prefix obtain operation before the value that has, then processor omits the external write request that the release with lock is associated, and does not add, to write collection, the address locked. Processor is then attempt to submit to transactional to perform.

By HLE, if multiple thread performs the critical section by same lock protection but they do not perform any conflict operation in data each other, then thread can concomitantly, perform under not having serialized situation. Although software obtains operation at shared use lock of locking, if but hardware identification this point, omit lock and when need not any by the communication locked to perform this communication of critical section on two threads be unnecessary on dynamically.

If processor can not transactionally perform region, then processor is by non-transactional ground and does not elliptically perform region. HLE enables identical based on the non-HLE execution the locked forward direction progress that software has with lower floor and ensures. In order to successful HLE performs, lock and critical section code must comply with some guilding principle. These guilding principles only affect performance; And do not follow these guilding principles and will not result in function failure. Not having the HLE hardware supported will ignore XACQUIRE and XRELEASE prefix hint, and will not perform any omission, reason is that these prefixes are corresponding with REPNE/REPEIA-32 prefix uncared-for in the effective instruction of XACQUIRE and XRELEASE. It is essential that HLE is compatible with the programming model based on existing lock. Prompting is used to be not result in function leak inadequately, but its delay leak that can expose in code.

Transactional is performed to provide software interface flexibly by affined transaction memory (RTM). RTM provides three new instruction XBEGIN, XEND and XABORT to start, submit to and stop transactional for programmer and performs.

Programmer uses starting and using XEND instruction to specify the end of transactional code region of XBEGIN instruction regulation transactional code region. If RTM region can not transactionally be successfully executed, then XBEGIN instruction obtains the operand of the skew relatively providing back-off instruction address.

Processor is likely to stop RTM transactional for many reasons and performs. In many cases, hardware automatically detects transactional and stops condition and restart from back-off instruction address to perform, wherein architecture state is corresponding with the architecture state of the beginning being present in XBEGIN instruction, and eax register is updated to description abort state.

XABORT instruction allows programmer to stop the execution in RTM region clearly. XABORT instruction obtain be loaded in eax register and thus to RTM stop after software can 8 immediate arguments. RTM instruction does not have any data memory location being associated. Although hardware is not in relation to whether once RTM region submits to offer to ensure successful transaction, but the most of affairs following the guilding principle of recommendation are submitted to being expected to successful transaction. But, programmer always must provide substituting code sequence to ensure that forward direction is in progress in rollback path. This is likely to perform code designation region equally simple with obtaining lock and non-transactional. Further, the affairs always stopped in given implementation transactionally can complete in implementation in the future. Therefore, programmer must assure that the code path for transactional region and substituting code sequence are tested successfully.

The detection that HLE supports

If CPUID.07H.EBX.HLE [position 4]=1, then processor supports that HLE performs. But, application can use HLE prefix (XACQUIRE and XRELEASE) when not checking whether processor supports HLE. The HLE processor supported is not had to ignore these prefixes and code will be performed when not entering transactional and performing.

The detection that RTM supports

If CPUID.07H.EBX.RTM [position 11]=1, then processor supports that RTM performs. Application must check whether processor supports RTM before processor uses RTM instruction (XBEGIN, XEND, XABORT). It is abnormal that these instructions will be produced #UD when using on the processor not supporting RTM.

The detection of XTEST instruction

If HLE or RTM supported by processor, then XTEST instruction just supported by processor. Application must check any one in these signatures before using XTEST instruction. It is abnormal that this instruction will be produced #UD when using on the processor not supporting HLE or RTM.

Inquiry transactional performs state

XTEST instruction can be used for determining the transaction status in the transactional region specified by HLE or RTM. Note, although HLE prefix is left in the basket on the processor not supporting HLE, but XTEST instruction will be produced #UD extremely when using on the processor not supporting HLE or RTM.

Requirement to HLE lock

For successful transaction the HLE that submits to perform, lock must is fulfilled for some characteristic and the access of lock be must comply with some guilding principle.

With the XRELEASE instruction being prefix, the value of the lock being omitted must be returned to the value that it had before lock obtains. This allows hardware to omit these locks safely by not adding lock to write collection. The lock release data size of (with XRELEASE for prefix) instruction and data address must be mated lock and be obtained data size and the data address of (with XACQUIRE for prefix), and must not lock and intersect with cache lines border.

Software should by not being written to the lock being omitted in transactional HLE region divided by any instruction beyond the instruction that XRELEASE is prefix, and otherwise this write can cause that transactional stops. Stop it addition, recursive locks (thread repeatedly obtains same lock, rather than first discharges lock) also results in transactional. Note, the result of the lock being omitted that software observable obtains in critical section. This read operation will return the value of write to lock.

Processor detects the violation to these pointer policies automatically, and is transitioned into safely and does not have elliptical non-transactional to perform. Due toDetection conflict in TSX granularity on cache lines, therefore, other logic processor detection that the write of the data being co-located on identical cache lines with the lock being omitted can be omitted same lock is data collision.

Transaction nest

HLE and RTM both of which supports subtransaction region. But, transactional stops recovering state to the operation starting transactional execution: outermost with the XBEGIN instruction of the XACQUIRE HLE valid instruction being prefix or outermost. The affairs of all nestings are considered as affairs by processor.

HLE is nested and omits

Programmer can nested HLE region, until the degree of depth specific to implementation of MAX_HLE_NEST_COUNT. Each logic processor counts in internal trace nesting, but software is disabled by this counting. It is incremented by nesting counting with the XACQUIRE HLE valid instruction being prefix, and the HLE valid instruction being prefix with XRELEASE successively decreases it. Logic processor enters transactional when nesting counting becomes 1 from 0 and performs. Only when nesting counting becomes 0, logic processor is attempted submitting to. If nested counting is more than MAX_HLE_NEST_COUNT, then transactional stops may occur in which.

Except supporting the HLE region of nesting, processor also can omit multiple nested lock. Lock followed the tracks of by processor, for starting with the XACQUIRE HLE valid instruction being prefix and with the omission terminated with the XRELEASE HLE valid instruction being prefix for same lock for this lock. Processor can at the lock of any one time tracking up to MAX_HLE_ELIDED_LOCKS quantity. Such as, if if the critical section that the MAX_HLE_ELIDED_LOCKS value that implementation support is 2 and nested three HLE of programmer identify (performs with the XACQUIRE HLE valid instruction being prefix by different the locking at three, and any one in lock does not perform to get involved with the XRELEASE HLE valid instruction being prefix), so the first two lock will be omitted, but the 3rd will not be omitted (and will be added to affairs write collection). But, execution will transactionally continue. Once run into the XRELEASE in two locks being omitted, then will be omitted by the lock subsequently of the HLE valid instruction acquisition being prefix with XACQUIRE.

When all XACQUIRE and XRELEASE being omitted counting matched, nested become 0 and lock meet require time, processor is attempted submitting to HLE to perform. If performing to submit to atomically, then perform to be transitioned into non-transactional under there is no elliptical situation and perform, such as the first instruction, not there is XACQUIRE prefix.

RTM is nested

Programmer can nested RTM region, until specific to the MAX_RTM_NEST_COUNT of implementation. Logic processor counts in internal trace nesting, but this counting is unavailable to software. XBEGIN instruction is incremented by nesting counting, and, XEND instruction is successively decreased nested counting. Only nested counting becomes 0, and logic processor is just attempted submitting to. If nested counting is more than MAX_RTM_NEST_COUNT, then occur that transactional stops.

Nested HLE and RTM

Shared transactional executive capability is provided two kinds of substituting software interfaces by HLE and RTM. When HLE and RTM is nested together, for instance, when HLE is in RTM or when RTM is in HLE, transactional processes behavior and is specific for implementation. But, in all cases, it is achieved mode will keep HLE and RTM semantic. Implementation can be worked as selection when being used in RTM region and ignore HLE prompting, and, when RTM instruction is used in HLE region, may result in transactional and stop. In the case of the latter, seamlessly occurring going to, from transactional, the transition that non-transactional performs, reason is that processor will carry out re-executing in elliptical situation HLE region unactual, and then performs RTM instruction.

Abort state defines

RTM uses eax register so that abort state is sent to software. After RTM stops, eax register has following definition.

Table 1

The EAX abort state of RTM only provides the reason of termination. It does not pass through self to whether occurring stopping or submitting to being encoded to RTM region. The value of EAX can be 0 after RTM stops. Such as, the cpuid instruction used when in RTM region causes that transactional stops, and can not meet the requirement of any one for setting in EAX position. This can cause that EAX value is 0.

RTM memory order

Successful RTM submits to and causes that all storage operations in RTM region seem to perform atomically. Comprise the successful submission RTM region of the XBEGIN of heel XEND, even at, when RTM region does not have storage operation, there is the sequence identical with the LOCK instruction being prefix semantic.

It is semantic that XBEGIN instruction does not have protection (fencing). But, if RTM performs termination, then be rejected from all memory updatings in RTM region and that other logic processor any is not visible.

RTM enables debugger support

Being default to, any debugging in RTM region is abnormal will cause that transactional stops, and control flow will be made to be redirected to back-off instruction address, while architecture state be resumed and position 4 in EAX is set. But, in order to allow software running device to intercept the execution in debugging extremely, RTM architecture provides additional ability.

If the position 15 of the position of DR7 11 and IA32_DEBUGCTL_MSR is 1 and stops to cause performing rollback and restarting from XBEGIN instruction rather than rollback address owing to debugging exception (#DB) or breakpoint any RTM that (#BP) causes extremely. In this scenario, eax register also will be restored to the point of XBEGIN instruction.

Programming considers

General programmer identifies that region is expected to successfully transactional and performs and submit to. But,TSX does not provide any this guarantee. Transactional performs to stop for many reasons. In order to make full use of transaction-capable, programmer should follow the probability that some guilding principle is submitted to increase its transactional to run succeeded.

This section discussion may result in the various events that transactional stops. Architecture guarantees that the renewal stopping subsequently to perform in the affairs performed will never become visible. Only have the transactional submitted to perform to start the renewal to architecture state. Transactional stops from being not resulted in function failure and only affecting performance.

Consideration based on instruction

Programmer can use any instruction in affairs (HLE or RTM) safely and can use affairs in any privilege level. But, some instructions will always stop transactional and perform and cause performing seamless and being transitioned into non-transactional path safely.

TSX allows most of shared instructions to be used in affairs when being not resulted in and stopping. Following operation in affairs is generally not resulted in stopping:

Operation on instruction pointer register, general register (GPR) and status indication (CF, OF, SF, PF, AF and ZF); With

Operation on XMM and YMM register and MXCSR depositor.

But, when mixing SSE and AVX operation in transactional region, programmer must be carefully. The AVX instruction of the SSE instruction of Hybrid access control XMM register and access YMM register can cause transaction abort. Programmer can use in affairs with the REP/REPNE string operation being prefix. But, long string can cause stopping. Further, if the value using change DF labelling of CLD and STD instruction, then the use of CLD and STD instruction can cause stopping. But, if DF is 1, then STD instruction will not result in termination. Similar, if DF is 0, then CLD instruction will not result in termination.

Because causing when using in affairs stopping generally to be not resulted in transaction abort (example is including but not limited to MFENCE, LFENCE, SFENCE, RDTSC, RDTSCP etc.) without the instruction here enumerated.

Following instruction will not stop transactional in any implementation and perform:

·XABORT

·CPUID

·PAUSE

It addition, in some implementations, following instruction can always cause that transactional stops. Undesirably these instructions are usually used in general transactional region. But, programmer must not depend on these instructions to force transactional to stop, because whether they cause that transactional stops to be to rely on implementation.

Operation in X87 and MMX architecture state. This includes all MMX and X87 instructions, comprises FXRSTOR and FXSAVE instruction.

Renewal to non-status sections CLI, STI, POPFD, POPFQ, CLTS of EFLAGS.

Update sector register, debugging depositor and/or control the instruction of depositor: MOVtoDS/ES/FS/GS/SS, POPDS/ES/FS/GS/SS, LDS, LES, LFS, LGS, LSS, SWAPGS, WRFSBASE, WRGSBASE, LGDT, SGDT, LIDT, SIDT, LLDT, SLDT, LTR, STR, FarCALL, FarJMP, FarRET, IRET, MOVtoDRx, MOVtoCR0/CR2/CR3/CR4/CR8 and LMSW.

Ring transition: SYSENTER, SYSCALL, SYSEXIT and SYSRET.

TLB and cacheability control: CLFLUSH, INVD, WBINVD, INVLPG, INVPCID and have the memory instructions (MOVNTDQA, MOVNTDQ, MOVNTI, MOVNTPD, MOVNTPS and MOVNTQ) of non-temporal prompting.

Processor state preserves: XSAVE, XSAVEOPT and XRSTOR.

Interrupt: INTn, INTO.

IO:IN, INS, REPINS, OUT, OUTS, REPOUTS and their variable.

VMX:VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON, INVEPT and INVVPID.

SMX:GETSEC.

UD2, RSM, RDMSR, WRMSR, HLT, MONITOR, MWAIT, XSETBV, VZEROUPPER, MASKMOVQ and V/MASKMOVDQU.

Consider during operation

Except based on the consideration of instruction, run time events may result in transactional and performs termination. They are likely due to data access patterns or microarchitecture implementation feature. The comprehensive discussion of the not all termination reason of following list.

Any fault being necessarily exposed to software or trap in affairs will be suppressed. Transactional performs termination and performs to will transition to non-transactional execution, does not occur such as fault or trap. It is not blanked if abnormal, then the exception do not covered will cause transactional to stop and state will appear as and do not occur as abnormal.

The synchronous abnormality event (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XF, #PF, #NM, #TS, #MF, #DB, #BP/INT3) occurred transactional the term of execution may result in execution and transactionally do not submit to, and needs non-transactional to perform. These events are suppressed, and do not occur such as them. When HLE, owing to non-transactional code path is identical with transaction code path, therefore, when being merely re-executed with causing abnormal instruction non-transactional, these events generally reappear, thus causing suitably transmitting, in non-transactional performs, the synchronous event being associated. The synchronous event (NMI, SMI, INTR, IPI, PMI etc.) occurred transactional the term of execution may result in transactional and performs to stop and be transitioned into non-transactional execution. Synchronous event is by pending (pended) and will stop processed to be processed afterwards in transactional.

Affairs are only supported to write back cacheable memory type operations. If transaction packet is containing the operation on other type of memory any, then affairs can always stop. This includes the instruction to UC type of memory and obtains.

Memory access in transactional region may call for processor and sets the access of benchmark page table entries and dirty labelling. How processor processes its behavior and is specific for implementation. Even if some implementations can allow to stop subsequently the renewal transactional region of these labellings also to become outside visible. SomeTSX implementation may select and stops transactional execution when these labellings need to be updated. Further, the page table of processor runs (walk) and can produce the write of the transactional to their own but the access of state do not submitted to. SomeThe optional execution stopping transactional region in this case of TSX implementation. In any case, this architecture guarantees, if transactional region stops, then the state that transactional writes will not made architecturally visible by the behavior of the structure of such as HLE.

Transactional performs self modifying code and may also lead to transactional termination. Even if when using HLE and RTM, programmer also must continue to follow for writing self-modifying and revising the Intel recommendation guilding principle of code mutually. Although the implementation of RTM and HLE generally will be provided for the enough resources performing to share transactional region, but the constraint of the implementation in transactional region and oversized dimensions can cause that transactional performs to stop and be transitioned into non-transactional and performs. Architecture does not ensure the stock number that can be used for carrying out transactional execution and does not ensure that transactional performs success.

The conflict request of the cache lines accessed in transactional region can be stoped affairs successful execution. Such as, if logic processor P0 reads the line A in transactional region and another logic processor P1 writes line A (being in transactional region or outer), if that the ability that the write interference processor P0 transactional of logic processor P1 performs, then logic processor P0 can stop.

Similarly, if P0 writes line A and P1 in transactional region and reads or write line A (being in transactional region or outer), then if the ability that the access interference P0 transactional of line A is performed by P1, then P0 can stop. It addition, other coherent communication amount can between or show as conflict request and can cause stop. Although these mistake conflicts can be there is, but they are expected to uncommon. In the above scenario for determining that P0 or the P1 conflict-solving strategy whether stopped is specific for implementation.

Generic transaction performs embodiment:

Submit to the paper " ARCHITECTURESFORTRANSACTIONALMEMORY " of Stanford University Computer Science and postgraduate committee when being partially completed PH.D degree and requiring in June, 2009 according to AustenMcDonald, there are three kinds of mechanism required for realizing the transactional region of atom and isolation: Version Control, collision detection and competition management, here by quoting the full content adding this paper.

So that transactional code region seems atomicity, this transactional code region all modifications performed must be stored and kept and other transaction isolation, until submission time. System completes this point by realizing Edition Control Strategy. There are two Version Control patterns: thirst for and lazy. Thirst for version control system newly generated transactional value storage to be put in place and by previous memory value memorizer on limit, cancel in daily record in what is called. Lazy version control system temporarily stores new value in so-called write buffer, only copies them to memorizer when submitting to. In any system, buffer memory is used to optimize the storage of redaction.

Seeming to be performed atomically in order to ensure affairs, conflict must be detected and solve. The two system, namely thirsts for and lazy version control system, detects conflict by realizing the collision detection strategy of optimism or pessimism. Optimistic system in parallel performs affairs, only checks conflict when affairs are submitted to. Pessimistic system checks conflict in each loading storage place. Similar with Version Control, collision detection also uses buffer memory, thus each line being denoted as the part reading collection or writing the part or both that collect. The two system solves conflict by realizing competition management strategy. There is many competition management strategies, some are more suitable for optimistic collision detection and some are more suitable for pessimistic collision detection. Some example policys are described below.

Need Version Control detection and collision detection due to each transaction memory (TM) system, therefore these options can produce TM design four kinds different: thirsts for pessimistic (EP), serious hope optimistic (EO), lazy pessimistic (LP) and laziness optimistic (LO). Table 2 is briefly described the TM design that all four is different.

Fig. 1 and 2 illustrates the example of multinuclear TM environment. Many TM that Fig. 1 is shown on the lower nude film 100 being connected with interconnection 122 of management of interconnected control 120a, 120b enable CPU (CPU1114a and CPU2114b etc.). Each CPU114a, 114b (also referred to as processor) can have the buffer memory of separation, instruction buffer 116a, 116b that the buffer memory of this separation comprises the instruction from memorizer performed for buffer memory and have the TM of data (operand) of the memory location operated for buffer memory data buffer storage 118a, 118b supported by CPU114a, 114b. In implementation, the buffer memory of multiple nude films 100 is interconnected to the buffer memory coherence supported between the buffer memory of multiple nude film 100. In implementation, use the buffer memory of single buffer memory rather than separation to keep both instruction and datas. In implementation, cpu cache is the buffer memory of a rank in level buffer structure. Such as, each nude film 100 is usable in shared buffer memory 124 shared among all CPU114a, the 114b on nude film 100. In another implementation, each nude film 100 may have access to the shared buffer memory 124 shared among all processors of all nude films 100.

Fig. 2 represents the details of example transactions CPU114, comprises the interpolation supporting TM. Transactional CPU114 (processor) can comprise for supporting register checkpointing 126 and the hardware of special TM depositor 128. Transactional cpu cache can have the MESI position 130 of conventional cache, label 140 and data 142, but also have such as, represent in the R position 132 that the isochrone performing affairs is read by CPU114 and represent in the W position 138 that the isochrone performing affairs is write by CPU114.

The Key detail of programmer is by any TM system non-transactional accesses and how to interact with affairs. By designing, above mechanism is used mutually to screen business call. But, still have to consider that rule, non-transactional load between the affairs of the new value that comprises this address mutual. Additionally, it is also necessary to mutual between inquiring into non-transactional storage and having read the affairs of this address. These are the problems of concept database isolation.

When each non-transactional load and store appear similar to atomic transaction time, TM system is considered to realize strong isolation, is also sometimes referred to as strong atomicity. Therefore, non-transactional loads and cannot see that the data do not submitted to and non-transactional are stored in any affairs reading this address and cause that atomicity is violated. The system being not the case is considered to realize less isolated, is also sometimes referred to as Weak atomicity.

Strong isolation is often even more ideal than less isolated, and reason is the relatively easy conceptualization of strong isolation and realization. In addition, if programmer has forgotten about by affairs around some memorizer benchmark shared thus causing leak, so by strong isolation, programmer will be frequently used simple debugging interface and detect this carelessness, because programmer is it will be seen that cause the non-transactional region that atomicity is violated. Additionally, the program of write is likely to work in a different manner in alternate model in a model.

Further, compared with less isolated, strong isolation is often easier to TM in hardware. Utilizing strong isolation, managed the loading between processor due to coherence protocol and stored transmission, therefore affairs can detect that non-transactional loads and stores, and takes suitable action. In order to realize strong isolation in software transactional memory (TM), non-transactional code must be modified, to include reading obstacle and writing obstacle; Thus weakening performance potentially. Although having paid huge effort to remove many unwanted obstacles, but this technology is often complicated and performance is generally significantly less than the performance of HTM.

Table 2

Table 2 illustrates the Basic Design space of transaction memory (Version Control and collision detection).

Thirst for-pessimistic (EP)

A TM design described below is referred to as serious hope-pessimism. EP system is written into centralized stores " putting in place " (hence obtain one's name " serious hope "), and storage rewrites the old value of line to support rollback in " cancelling daily record ". Processor uses W138 and R132 cache bit, to follow the tracks of reading collection and write collection, and the detection conflict when receiving eavesdropping load request. The most noticeable example of the EP system in known references is probably LogTM and UTM.

EP system starts affairs like the affairs started in other system: tm_begin () obtains register checkpointing, and initializes any status register. EP system is also required to initialization and cancels daily record, and its details depends on journal format, but often comprises and basic for daily record pointer is initialised to preallocated region, thread private memory and removes daily record boundary register.

Version Control: in EP, owing to thirsting for the mode that Version Control is designed to work, MESI130 status transition (with the cache lines indicator revised, exclusive, shared and invalid code state is corresponding) keeps major part constant. Outside affairs, MESI130 status transition keeps completely constant. When the line read in affairs, standard coherent transition is suitable for (S (shares) → S, I (invalid) → S or I → E (monopolizing)), sends record as required and misses, but R position 132 is also set. Similarly, write line applies standard transition (S → M, E → I, I → M), sends as required and misses, but also sets W (write) position 138. When line is write for the first time, the legacy version of whole line is loaded and is then written to cancel daily record with just in case retaining it when Current transaction stops. On legacy data, then newly written data are stored " putting in place ".

Collision detection: pessimistic collision detection uses about missing, upgrade exchanged relevant message, to find the conflict between affairs. When occurring that in affairs reading is missed, other processor receives load request; But, if they do not have required line, then they ignored request. If other processor non-speculative ground has required line or has line R132 (reading), so line is downgraded to S by them, and in some cases, if they have the line in the M of MESI or E-state, then send and be cached to buffer memory transmission. But, if buffer memory has line W138, then between two affairs, conflict detected, and the action added must be taked.

Similarly, when (first write time) affairs seek by line from share be upgraded to amendment time, affairs send be also used for detection conflict exclusive load request. If the buffer memory non-speculative ground received has line, then this line is deactivated, and, in some cases, send and be cached to buffer memory transmission (M or E-state). But, if line is R132 or W138, then conflict detected.

Checking: owing to only performing collision detection when each loading, therefore affairs always have the exclusive access of the write collection to himself. Therefore, checking does not require any extra work.

Submitting to: the redaction storage of data item put in place owing to thirsting for Version Control, therefore submission process is removed W138 and R132 position simply and gives up and cancel daily record.

Stop: when transaction rollback, the prototype version cancelling each cache lines in daily record must be resumed, and this is referred to as the process of " expansion " or " application " daily record. This completed in tm_discard () period, and must be atomicity relative to other affairs. Specifically, write collection must still be used for detecting conflict: these affairs only cancel the right version in daily record with line at it, and, request transaction has to wait for from the correct version of this journal recovery. Can pass through to use hardware state machine or software break processor to apply this daily record.

Serious hope-pessimism has such characteristics that submission is simple, and, because it puts in place, therefore speed is very fast. Similarly, checking is do-nothing operation. Pessimistic collision detection detects conflict very early, thus reduces the quantity of " be doomed failure " affairs. Such as, if two affairs relate in read-after-write dependency, then this dependency is detected immediately in pessimistic collision detection. But, in optimistic collision detection, this conflict was not detected before writer submits to.

Serious hope-pessimism also has the property that as it has been described above, when cache lines is write for the first time, old value must be written to daily record, thus producing extra cache access. Termination is expensive, and reason is that they need to cancel daily record. For each cache lines in daily record, it is necessary to send loading, it is possible to proceeding to before next line as far as main storage. Pessimistic collision detection is also prevented from there is some serializable scheduling.

Further, since conflict is processed when they occur, accordingly, there exist the probability of livelock, and, it is necessary to use careful competition management mechanism to ensure that forward direction is in progress.

Lazy-optimistic (LO)

Another kind of popular TM design is lazy-optimistic (LO), and it is at its write collection of " write buffer " or " redoing log " middle storage and conflicts (still using R and W position) in submission time detection.

Version Control: as in EP system, the MESI protocol of LO design is forced to implement outside affairs. Once be in affairs, read line just causes standard MESI transition, but also sets R position 132. Similarly, write line sets the W position 138 of line, but the MESI transition processing LO design is different from EP design. First, by lazy Version Control, the redaction of write data is stored in cache hierarchy, until submitting to, and other affairs are able to access that legacy version available in memorizer or other buffer memory. So that legacy version can be used, it is necessary to evict dirty line (M line) from when first passing through affairs and reading. Secondly, due to optimistic collision detection feature, therefore need not upgrade and miss: if affairs have the line in S state, then it can be simply written line and this line is upgraded to M state, and these changes are not transmitted with other affairs, because collision detection completes at submission time.

Collision detection and checking: in order to verify affairs and detection conflict, the address of the line that predictive is only revised by LO when it prepares to submit to is sent to other affairs. When checking, processor sends the potential big network packet comprising all addresses that write is concentrated. Data are not sent, but stay in the buffer memory of presenter and be denoted as dirty (M). In order to when not to building this bag when being designated as the line search buffer memory of W, use the simple bit vector being called " storage buffer ", wherein the line of these predictives amendment is followed the tracks of in a position of each cache lines. Other affairs use this address to wrap to detect conflict: if finding address in the buffer and setting R132 and/or W138 position, then conflict is initialised. If finding line but not setting R132 and W138, then line is by ineffective treatment simply, and this is similar with processing exclusive loading.

In order to support transaction atomicity, these addresses bag must be operated atomically, i.e. does not have two address bags to may utilize identical address disposable. In LO system, token can be submitted to realize this point by the acquisition overall situation simply before sending address bag. But, can pass through first to send address bag, collect response, enforce ordering protocols (be likely to first be most old affair business) and use the two-stage to submit scheme to, and all responses of disposable submission are gratifying.

Submit to: once checking occurs, submit to and avoid the need for special process: remove W138 position and R132 position and storage buffer simply. The write of affairs is denoted as dirty in the buffer, and the copy of other buffer memory of these lines is coated ineffective treatment by address. Other processor can then pass through normal coherence protocol and access submitted data.

Stop: rollback is very simple too: because write collection is contained in local cache, therefore these lines can be invalidated, then remove W138 position and R132 position and storage buffer. Storage buffer can allow to find that W line is invalid, without search buffer memory.

Laziness-optimism has such characteristics that termination is very fast, from without additional loading or storage, and only carries out localized variation. Can there is the ratio more serialized scheduling found in EP, this allows LO system more actively to speculate that affairs are independent, and this can produce higher performance. Finally, the detection in late period of conflict can increase the probability being in progress forward.

Laziness-optimism also has the property that checking needs the overall delivery time proportional to the size of write collection. Owing to only in submission time detection conflict, therefore, ill-fated affairs may waste work.

Lazy-pessimistic (LP)

Lazy-pessimistic (LP) represents the 3rd TM design option, thus the somewhere being between EP and LO: in write buffer, store newly written line, but detection conflicts on the basis of each access.

Version Control: Version Control is similar with the Version Control of LO but differs: read line sets its R position 132, and write line sets its W position 138, and, storage buffer is used to the W line in trace cache. Further, as in LO, dirty (M) is when being first expelled out of by transaction write is fashionable. But, owing to collision detection is pessimistic, therefore loading and exclusive must be performed when from I, S → M upgrade transaction line, these are different from LO.

The operation of the collision detection of collision detection: LP is identical with EP: use relevant message to find the conflict between affairs.

Checking: as, in EP, pessimistic collision detection guarantees that the affairs run on any point do not have with the affairs of other operation any and conflicts, and therefore checking is do-nothing operation.

Submit to: submit to and do not need special process: as, in LO, removed W138 position and R132 position and storage buffer simply.

Stop: rollback is also similar with the rollback of LO: write is collected ineffective treatment simply by using storage buffer and removes W138 position and R132 position and storage buffer.

Thirst for-optimistic (EO)

This LP has such characteristics that similar with LO, stops very fast. Similar with EP, use pessimistic collision detection to reduce the quantity of " be doomed failure " affairs. Similar with EP, some serialized schedulings are not allowed to, and, it is necessary to miss execution collision detection at each buffer memory.

The final combination of Version Control and collision detection is to thirst for-optimistic (EO). HTM system is not likely to be optimum selection by EO: putting in place owing to new transactional version is written into, therefore, other affairs do not select, and can only when conflict occurs (that is, when buffer memory miss occur time) notice conflict. But, just detect conflict till waiting until submission time due to EO, therefore these affairs become " corpse ", and they continue executing with, and waste resource, but " being doomed " to stop.

EO has been proved in STM be useful and by Bartok-STM and McRT realization. Lazy Version Control STM needs to check that when each reading its write buffer is to guarantee that it is reading nearest value. Owing to write buffer is not hardware configuration, therefore it is expensive, the serious hope Version Control thus preference write puts in place. Further, since the inspection to conflict is also expensive in STM, therefore optimistic collision detection provides the advantage that batch performs this operation.

Competition management

Have been described above having decided to stop the how rollback of these affairs of affairs once system; But when owing to conflict relates to two affairs, therefore should stop which affairs, how should initialize this termination and the topic of suspended affairs should be reattempted to need to inquire into. These are the topics that competition management (CM) to solve, and this competition management is the critical component of transaction memory. The following describes and how to initialize the strategy of termination about system and manage the various maturation methods which affairs should stop in collision.

Competition management strategy

When competition management (CM) strategy determines which affairs related in collision should stop and should reattempt to the mechanism of suspended affairs. Such as, situation may often be such that and reattempts to the performance stopping to be not resulted in the best immediately. On the contrary, use delay that the avoidance mechanism reattempted to of suspended affairs can be produced better performance. First STM sets about finding best competition management strategy, and, in the strategy being exemplified below many at first to STM develop.

CM strategy takes substantial amounts of measure to make decision, including the age of affairs, the size reading collection and write collection, the quantity etc. previously stopped. The combination making the measure of such decision is innumerable, but describes some combination by the order of the complexity increased in general below

In order to set up some nomenclatures, it is first noted that there are two aspects in collision: assailant and defender. Assailant is that request accesses the affairs sharing memory location. In pessimistic collision detection, assailant sends load or load exclusive affairs. In optimistic collision detection, assailant is an attempt to the affairs of checking. Defender in two kinds of situations is the affairs of the request receiving assailant.

Positive CM strategy always reattempts to assailant or defender immediately. In LO, actively mean that assailant always wins, so actively winning sometimes referred to as submitter. This strategy is used to LO system the earliest. When EP, actively can win for defender or assailant wins.

Restart to experience immediately the conflict transaction of another conflict and necessarily waste work and interconnection bandwidth backfill buffer memory is missed. Courtesy CM strategy used exponential backoff (but being used as linear) before restarting conflict. For ward off starvation (namely processing the situation without the resource being distributed to it by scheduler), exponential backoff reattempts at some n time and is greatly increased the successful probability of affairs afterwards.

The another kind of method of Conflict solving is, stops assailant or defender's (being called randomized strategy) at random. This strategy can be combined with random back scheme, to avoid unnecessary competition.

But, carrying out selecting at random, when selecting the affairs to stop, may result in the affairs stopping to complete " extensive work ", this is likely to waste resource. In order to avoid this waste, completed workload in affairs can be considered when determining and to stop which affairs. One measurement of work can be the age of affairs. Other method include oldest, batch TM, size are considered, Karma and Polka. The oldest simple timestamp method being to stop the youngest affairs in conflict. Batch TM uses the program. Size is considered similar with oldest, but is not use the affairs age, but the quantity of use read/write word is as priority, thus returning to oldest after the termination of fixed number of times. Karma is similar, uses the size of write collection as priority. Then rollback continues after keeping out of the way set time amount. Suspended affairs keep their priority (being thus known as Karma) afterwards being aborted. Polka is similar to Karma work, but, as the replacement keeping out of the way predetermined time amount, it exponentially compensates more every time.

Owing to stopping waste work, it is taken as that delay assailant until defender completes its affairs and can cause that better performance is consistent with logic. Unfortunately, this simple scheme easilys lead to deadlock.

Dead time revision technology can be used to solve this problem. Two rules of greedy use carry out Avoid deadlock. Article 1, rule is, if if the first affairs T1 has the priority lower than the second affairs T0 or T1 waits another affairs, then T1 stops when conflicting with T0. Article 2 rule is, if T1 has the priority higher than T0 and do not waiting, then T0 wait until T0 submit to, stop or start waiting for till (in this case, Article 1 rule be suitable for). Some offer about the event horizon being used for performing one group of affairs greedy ensure. One EP design (LogTM) uses the CM being similar to greediness strategy to utilize conservative dead time revision to realize delaying.

Example MESI coherence's rule provide multiprocessor caching system cache lines can be resident state four kinds possible: M, E, S and I, it is defined as foloows:

Amendment (M): cache lines exists only in current cache and is dirty; It the value from main storage be modified. Before any other allowing main storage state (no longer valid) reads, buffer memory needs, in some time in the future, data are write back to main storage. Write back and line is become exclusive state.

Exclusive (E): cache lines exists only in current cache, but is clean; It mates main storage. It can become shared state at any time in response to read requests. As an alternative, it can become amendment state when writing it.

Share (S): indicate this cache lines can be stored in other buffer memory of machine and be " clean "; It mates main storage. This line can be rejected (becoming disarmed state) at any time.

Invalid (I): indicating this cache lines is invalid (use).

Encode except MESI coherence position or in MESI coherence position, TM coherency status indicator (R132, W138) can be provided to each cache lines. Current transaction has been read in the instruction of R132 indicator from the data of cache lines, and W138 indicator instruction Current transaction has been written to the data of cache lines.

In the another aspect of TM design, by using transactional storage buffer design system. Submit on March 31st, 2000 and be called that U.S. Patent No. 6349361 instruction of " MethodsandApparatusforReorderingandRenamingMemoryReferen cesinaMultiprocessorComputerSystem " is used in the multiprocessor computer system at least with the first and second processors by memorizer base weight new sort and the method renamed here by quoting the name adding its full content. First processor has the first privately owned buffer memory and the first buffer, and the second processor has the second privately owned buffer memory and the second buffer. Method includes that the multiple gatings for memory data received by first processor are stored each in request and obtains the cache lines comprising data by the first privately owned buffer memory and the step storing data in the first buffer exclusively. When the first buffer receives load request to load particular data from first processor, particular data is provided to first processor based on the in-order sequence loading and storing operation from the data being stored in the first buffer. When the first buffer memory receives the load request for data-oriented from the second buffer memory, misdirection situation, further, when the current state of at least one in processor time corresponding is reset to state earlier with the data being stored in the first buffer by the load request for data-oriented.

It is until the memory buffers that terminates of affairs and for performing the firmware routines of various sophisticated functions for keeping the affairs back-up registers file of pre-affairs GR (general register) content, for following the tracks of the CACHE DIRECTORY of the cache lines accessed during affairs, for buffer-stored that one this transaction memory facility main realizes parts. In this part, detailed implementation is described.

IBMzEnterpriseEC12 enterprise servers embodiment

IBMzEnterpriseEC12 enterprise servers introduce transactional in transaction memory and perform (TX), and partly can obtain from IEEE ACM meeting distribution service (CPS), December in 2012 is described in the article " TransactionalMemoryArchitectureandImplementationforIBMSy stemz " of the collection of thesis 25-36 page of speech for 1 to 5 on the MICRO-45 of Vancouver, British Columbia, Canada, here by quoting its full content of addition.

Table 3 represents example transactions. Do not guarantee once to be successfully completed with TEND from the TBEGIN affairs started because they can when attempting performing every time experience termination condition, for instance, owing to repeating conflicting of CPU with other. This requires program to support, and rollback path is with such as by using the locking scheme of routine to perform same operation with carrying out non-transactional. Bring significant burden should to programming or software verification team, particularly when not automatically generating rollback path by reliable compiler.

Table 3

Transactional for having stopped performs (TX) affairs provides the requirement in rollback path to be probably heavy. In shared data structure, many affairs of operation are expected to shorter, only contact several different memory location, and only use simple instruction. For those affairs, IBMzEnterpriseEC12 introduces the concept of affined affairs; Under normal operation, CPU114 ensures that affined affairs finally successfully terminate, even if when necessary number of retries not being provided strict restriction. Affined affairs start with TBEGINC instruction and terminate with normal TEND. Task is embodied as constrained or free affairs and generally results in quite comparable performance, but affined affairs simplify software development by removing the demand to rollback path. Framework is performed, here by quoting its full content of addition in the transactional being further described IBM by IBM in z/Architecture, PrinciplesofOperation, TenthEdition, SA22-7832-09 disclosed in JIUYUE, 2012.

Affined affairs start with TBEGINC instruction. A series of programming constraint is must comply with the TBEGINC affairs started; Otherwise program takes non-filterable constraint violation to interrupt. Exemplary constraint may include but be not limited to: affairs can perform maximum 32 instructions, and all of instruction text must in the successive byte of the 256 of memorizer; Affairs only comprise sensing before opposed branch (namely not circulating and subroutine call); Affairs can access the memorizer of eight words (eight words (octoword) are 32 bytes) of maximum 4 alignment; Complicated order as decimal or floating-point operation is got rid of in the restriction of instruction set. Constraint is selected, so that the many common computing of such as double-strand list insertion deletion computing can be performed, including the atomic ratio for eight words alignd up to 4 compared with the very powerful concept with exchange. Meanwhile, constraint is by conservative selection so that following CPU implementation can ensure that the success of affairs, without adjusting constraint, because otherwise will cause that software is incompatible.

Controlling to be considered except zero with program interrupt filtered fields and control except being absent from flating point register (FPR), the behavior of TBEGINC is very similar toTBEGIN on the zEC12 server of TBEGIN or IBM in TSX. When transactional stops, instruction address is directly set back TBEGINC rather than instruction below, thus reflection is for the disappearance retrying and stopping path immediately of affined affairs.

Subtransaction is not allowed in affined affairs, but if TBEGINC occurring in free affairs, then it is considered to open new free nested rank, can do so as TBEGIN. Such as, if free affairs are invoked at the internal subroutine using affined affairs, then this is likely to occur.

Owing to interrupting filtering implicit closing, therefore all exceptions during affined affairs cause the interruption in operating system (OS). Final being successfully completed of affairs depends on OS page and enters the ability of by any affined transaction touch maximum 4 pages. OS must also ensure that isochronous surface long enough is to allow affairs to complete.

Table 4

Assuming that affined affairs are not mutual based on the code of locking with other, table 4 represents the affined transactional implementation of the code in table 3. Therefore not shown lock test, but, if mixing affined affairs and the code based on lock, then it is likely to add lock test.

When repeating unsuccessfully, by using the milli code as a part for system firmware to perform software emulation. Advantageously, because the burden removed from programmer, therefore affined affairs have desired characteristic.

IBMzEnterpriseEC12 processor introduces transactional and performs facility. This processor can decode 3 instructions by each dock cycles; Simple instruction is assigned as single microoperation, and more complicated instruction is cracked into multiple microoperation 232b. Microoperation (Uop232b figure 3 illustrates) is written to and unified sends queue 216, and they can be sent by out of order therefrom. Maximum two fixing points, a floating-point, two load/store and two branch instructions can perform each cycle. The overall situation completes table (GCT) 232 and keeps each microoperation and the transaction nest degree of depth (TND) 232a. GCT232 was write in order in the decoding time, follows the tracks of the execution state of each microoperation, and completes instruction when all microoperation 232b of oldest instruction group are successfully executed.

1 grade of (L1) data buffer storage 240 (Fig. 3) is to have 256 byte cache lines and 4 to recycle 96KB (kilobytes) the 6 tunnel association buffer memory postponed, it associates the 2nd grade of (L2) data buffer storage 268 (Fig. 3) coupling with special 1MB (Mbytes) 8 tunnel, wherein 1L is missed and has 7 recyclings delay costs. L1 buffer memory 240 (Fig. 3) is closest to the buffer memory of processor, and, Ln buffer memory is in the buffer memory on n-th grade of buffer memory. L1240 (Fig. 3) and L2268 (Fig. 3) buffer memory are all through storage. Six cores on each central processing unit (CP) chip share 48MB 3rd level storage inside buffer memory, and, six CP cores are connected with outer the 4th grade of buffer memory of 384MB of chip, and these six CP chips are packaged on glass ceramics multi-chip module (MCM) together. Maximum 4 multi-chip modules (MCM) can be connected (not every core can be used for running client's live load) with relevant symmetric multiprocessor (SMP) system with up to 144 cores.

Coherence is managed by the variant of MESI protocol. Cache lines can read-only (sharing) had or exclusive; L1240 (Fig. 3) and L2268 (Fig. 3) is through storage, and does not therefore comprise dirty line. L3 and L4 buffer memory is storage inside and follows the tracks of dirty situation. Each buffer memory comprises the other buffer memory of even lower level of its connections all.

Coherence's request is referred to as " cross-examination " (XI), and from higher level buffer memory to even lower level, other buffer memory sends by level, and is sent between L4. When a core misses L1240 (Fig. 3) and L2268 (Fig. 3) and during from its local L3 requested cache line, L3 checks whether it has line, and before cache lines is returned to requestor by it, if necessary then XI is sent to L2268 (Fig. 3)/L1240 (Fig. 3) being currently owned by guarantee coherence under this L3. If L3 is also missed in request, then L3 transmits the request to L4, this L4 is by the be necessary L3 being sent to by XI under this L4 and is sent to adjacent L4 and enforces coherence. Then, the L3 making request is responded by L4, and response is transferred to L2268 (Fig. 3)/L1240 (Fig. 3) by this L3.

Note, comprise rule due to buffer memory level, owing to being overflowed evicting from of causing on higher level buffer memory by the relatedness from request to other cache lines, sometimes cache lines from subordinate's buffer memory by XI. These XI are referred to alternatively as " LRUXI ", and here, LUR represents minimum nearest use.

Asking with reference to another type of XI, buffer memory proprietary rights is transformed into read-only status from exclusive by degradation-XI, and, buffer memory proprietary rights is transformed into disarmed state from exclusive by exclusive-XI. Degradation-XI and exclusive-XI needs to return to the response of XI transmitter. Target cache " can receive " XI, or, if it is receiving before XI firstly the need of evicting dirty data from, then send " refusal " and respond. L1240 (Fig. 3)/L2268 (Fig. 3) buffer memory is through storage, but, if they had storage needing to be sent in the storage queue of L3 before making exclusive state degradation, then it is rejected by degradation-XI and exclusive-XI. Unaccepted XI will be repeated by transmitter. Read-only XI is sent to and has the buffer memory that line is read-only; They need not respond for this XI, because can not be rejected. The details of SMP agreement is similar to those description of IBMz10 with P.Mak, C.Walters and G.Strait " IBMSystemz10processorcachesubsystemmicroarchitecture " in IBM research and development periodical 53:1 volume in 2009, here by quoting its full content of addition.

Transactional instruction performs

The exemplary components of Fig. 3 depicted example CPU. Instruction decoding unit (IDU) 208 keeps following the tracks of the Current transaction depth of nesting (TND) 212. When IDU208 receives TBEGIN instruction, the depth of nesting is incremented by, and successively decreases on the contrary when TEND instruction. For each instruction being assigned with, the depth of nesting is written in GCT232. When being decoded in the speculative path that TBEGIN or TEND is eliminated afterwards, the minimus GCT232 being never eliminated refreshes the depth of nesting of IDU208. Transaction status is also written into sending in queue 216 and uses for performance element, mainly for load/store unit (LSU) 280. Assuming that affairs stopped before arriving TEND instruction, TBEGIN instruction could dictate that the affairs diagnostics block (TDB) for recording status information.

Similar with the depth of nesting, IDU208/GCT232 follows the tracks of access depositor/flating point register (AR/FPR) collaboratively by transaction nest and revises mask; When AR/FPR revise instruction be decoded and revise mask blocks it time, abort request can be put in GCT232 by IDU208. When instruction become the next one complete time, complete to be blocked and transaction abort. Other restricted instruction is similarly processed, if including the TBEGIN being decoded or exceeding the maximum depth of nesting during being in affined affairs.

Outmost TBEGIN is broken into multiple microoperation according to Gr-preservation-mask; Each microoperation will by an execution in two fixing point unit (FXU) 202, to be stored in special affairs back-up registers file 224 by a pair GR228, this affairs back-up registers file 224 for recovering GR228 content afterwards when transaction abort. Further, if a TDB is prescribed, then TBEGIN causes microoperation 226b to perform the addressable test of TDB; Address is stored in special objective depositor, for later use in termination situation. In the decoding of outmost TBEGIN, the instruction text of instruction address and TBEGIN is also stored in special objective depositor, processes for later potential termination.

TEND and NTSTG is single microoperation 232b instruction; Except being denoted as non-transactional in sending queue so that LSU280 suitably can process it except, NTSTG (non-transactional storage) and normal storage are processed similarly. TEND is not operation upon execution, performs the end of affairs when completing TEND.

As it has been described above, the instruction being in affairs is indicated after this manner in sending queue 216, but otherwise perform almost with no change; LSU280 performs to follow the tracks of in the isolation described in next part.

Owing to decoding is in-order, and owing to IDU208 keeps following the tracks of current transaction status and sending in queue 216 by it together with each instruction write from affairs, therefore, TBEGIN, TEND and the execution of instruction within before affairs, and afterwards can by Out-of-order execution. Effective address computer 236 is contained in LSU280. Even can be executed first by (although unlikely) TEND, perform followed by whole affairs and last TBEGIN. In the deadline by GCT232 recovery routine order. The length of affairs is not by the size limitation of GCT232, because general register (GR) 228 can be recovered from back-up registers file 224.

The term of execution, based on event suppress control filter event record (PER) event, and, if PERTEND event is activated, be detected. Similarly, when being in transactional pattern, pseudo-random generator may result in the random termination enabled by affairs diagnosis control.

Tracking for transaction isolation

Load/store unit follows the tracks of the cache lines accessed transactional the term of execution, and, the XI if from another CPU (or LUR-XI) conflicts with trace (footprint), then triggers and stops. If the XI of conflict is exclusive or degradation XI, then LSU cherishes the hope refusal XI completing affairs before L3 repeats XI and returns to L3. Should " refusing to budge " be very effective in the affairs of high competition. In order to prevent hanging up when two CPU refuse to budge mutually, it is achieved XI refuses enumerator, this XI refuses enumerator and will stop by triggered transactions when meeting threshold value.

L1 CACHE DIRECTORY 240 is conventionally being realized by static RAM (SRAM). For transaction memory implementation, the significance bit 244 (64 row × 6 row) of this catalogue has been shifted in normal logic latch, and every cache lines supplements two more position: TX and reads 248 and dirty 252 of TX.

When new outermost TBEGIN is decoded (it and still pendent affairs interlocking before), TX reads 248 and is reset. TX reads 248 by being designated as between each loading instruction upon execution of " affairs " in sending queue and being set. Note, if such as performing the loading of predictive on the individual path of error prediction, then this may result in excessive sign. Set TX read the replacement scheme of position for too expensive silicon area loading the deadline, because multiple loading can simultaneously complete, thus needing many read ports in load queue.

Storage performs in the way of identical with non-transactional pattern, but transaction signature is placed in storage queue (STQ) 260 entry of storage instruction. Write back the time, when being written in L1240 from the data of STQ260, the cache lines of write is set dirty 252 of TX in L1 catalogue 256. Only occur writing back in L1240 storage after completing storage instruction, and, each circulation is written back to many storages. Before completing and writing back, loading can be passed through to store to forward to access data from STQ260; After writing back, CPU114 (Fig. 2) may have access to the speculative update data in L1240. If affairs successfully terminate, then the dirty position 252 of TX of all cache lines is eliminated, and also do not have the TX of storage of write to be indicated in STQ260 to be eliminated, thus effectively pendent storage being become normal storage.

When transaction abort, all pendent transactional storage be invalidated from STQ260, even completed those. The all cache lines being revised (namely so that the dirty position of TX 252 is opened) by the affairs in L1240 make their significance bit turn off, thus effectively at once removing them from L1240.

Framework requires to keep affairs to read collection and the isolation of write collection before completing new instruction. It is done to ensure that this isolation by delaying instruction when XI is undecided at reasonable time; Allow the out-of-sequence execution of predictive, suppose that pendent XI to arrive different addresses and not actually result in transactional conflict thus optimistic. This design very naturally completes interlocking with the XI-realized on existing system and adapts to, to guarantee the strong memory order that framework needs.

When L1240 receives XI, L1240 accesses catalogue to check the effectiveness by the address of XI in L1240, and, it be not rejected by Above-the-line and the XI of XI if TX reads position 248, then LSU280 triggers termination. When have activity TX read position 248 cache lines from L1240 by LRU time, special LRU extend vector each in 64 row of L1240 is remembered to exist TX read line on this row. Following the tracks of owing to LRU extension is absent from accurate address, therefore any not unaccepted XI of the effectively extension row of hit LSU280 triggers and stops. Do not cause stopping with conflicting of other CPU114 (Fig. 2) assuming that follow the tracks of for non-precision LRU extension, then provide LRU extremely efficient ground to increase the reading trace ability from L1 size to L2 size and relatedness.

Memorizer trace is by memory buffers size (memory buffers is discussed in further detail below) restriction and is thus impliedly limited by L2 size and relatedness. When the dirty cache lines of TX from L1 by LRU time, it is not necessary to perform LRU extension action.

Memory buffers

In existing system, owing to L1240 and L2268 is through memory buffers, therefore each storage instruction causes that L3 storage accesses; Utilizing 6 cores of present every L3 and the further augmented performance of each core, the filling rate for L3 (and for L2 in less degree) becomes problem for some live load. In order to avoid store queue postpones, it is necessary to add and collect memory buffers, this collection memory buffers is combination storage and adjacent address before sending storage to L3.

For transaction memory performance, making each TX dirty cache lines ineffective treatment from L1240 when transaction abort is acceptable, because L2 buffer memory 268 closely (7 circulation L1 miss cost) is in taking back clean line. But, for performance (with being used for the silicon area followed the tracks of) so that it is unacceptable (or worse on shared L3) that transactional is stored in that affairs terminate to be previously written L2268 and then make all dirty L2 cache lines ineffective treatments when stopping.

Available memory buffers 264 of collecting solves memory bandwidth and two problems of transaction memory storage process. Buffer memory 264 is the circular queue with 64 entries, and each entry keeps 128 bytes with the data of the accurate significance bit of byte. In non-transactional operates, when receiving storage from LSU280, memory buffers 264 checks whether that same address is existed entry, and, if it is newly storage is collected in existing entry. If there is no entry, then new entry is written in queue, and, if the quantity of free entry is lower than threshold value, then the oldest entry is written back in L2268 and L3 buffer memory.

When starting new outmost affairs, all existing entry in memory buffers 264 is denoted as closedown so that do not have new storage can be collected in these entries, and, start these entry evicting to L2268 and L3. Light from this, distribute new entry or collect existing transactional entry from LSU280STQ260 transactional storage out. These storages are write back in L2268 and L3 and is blocked, until affairs successfully terminate; At this point, (rear affairs) storage subsequently can continue to collect in existing entry, until next affairs is again switched off these entries.

Memory buffers 264 is asked when each exclusive or degradation XI, and if XI compare with any activity entries, cause that XI refuses. As pit does not complete further instruction while continuing refusal XI, then affairs are aborted to avoid hanging up in some threshold value.

When memory buffers is overflowed, LSU280 request transaction stops. LSU280 detects this condition when it attempts sending the new storage that can not be merged in existing entry, and, whole memory buffers 264 is filled the storage from Current transaction. Memory buffers 264 is managed as the subset of L2268: although from L1240 from the dirty line of affairs can be evicted, they must remain resident in L2268 in whole affairs. Maximum storage trace is thus limited to the memory buffers size of 64 × 128 bytes, but it is also limited by the relatedness of L2268. Owing to L2268 is 8 tunnel associations and has 512 row, therefore it is generally sufficiently large to be not resulted in transaction abort.

If transactional stops, then memory buffers is notified and keeps all entries of transactional data to be invalidated. Memory buffers also has about whether entry is written of indicating these double words across transaction abort maintenance effectively by NTSTG instruction by each double word (8 byte).

The function that milli code realizes

Conventionally, IBM host server processes device comprises the firmware layer being called milli code, and this firmware layer performs the sophisticated functions as some cisc instruction execution, interrupt processing, system synchronization and RAS. Instruction with application program and operating system (OS) is similar, and milli code comprises machine and relies on instruction and the instruction from memorizer acquirement and the instruction set architecture (ISA) of execution. Firmware resides in the confined area of the main storage that client's program can not access. When hardware detection is to when needing the situation calling milli code, instruction acquisition unit 204 is switched to " milli code pattern " and the appropriate position acquirement beginning in milli code memory region. Milli code can be obtained by the mode identical with the instruction of instruction set architecture (ISA) and performed, and can comprise ISA instruction.

For transaction memory, under various complex situations, relate to milli code. Each transaction abort calls special milli code subroutines to perform the hang up of necessity. Transaction abort milli code keeps the internal special register (SPR) stopping reason, potential abnormal cause and suspended instruction address of hardware to start by reading, if then a TDB is designated, milli code uses this special register to store TDB. TBEGIN instruction text is loaded to obtain GR from SPR and preserves mask, and milli code is known which GR228 of recovery is needs by this.

CPU114 (Fig. 2) supports that special only milli code command is to read backup GR and they to be copied in main GR. TBEGIN instruction address is also loaded, from SPR, the new instruction address that sets PSW, to stop to continue executing with after TBEGIN when subroutine completes at milli code. When stopping to be caused by the program interrupt of non-filtered, this PSW can be saved as the old PSW of program afterwards.

TABORT instruction can be that milli code realizes; When IDU208 decodes TABORT, its indicator acquisition unit is branched off in the milli code of TABORT, and milli code is therefrom branched off into share and stops in subroutine.

Extract the transaction nest degree of depth (ETND) instruction also can by milli code because it is not to performance-critical; Milli code loads from the current depth of nesting in special hardware depositor and puts it in GR228. PPA instruction is by milli code; It is based on the current hardware internal state execution optimal delay stopping counting and being also based on other being supplied to PPA by software as operand.

For affined affairs, milli code can keep following the tracks of the quantity stopped. Enumerator is reset to 0 when success TEND completes or if there is the interruption (because whether or when OS be not known by the program that returns to) entering in OS. According to currently stopping counting, milli code can call some mechanism to improve the chance of success that affairs subsequently retry. Random delay that this mechanism includes such as increasing continuously between retrying and reduce the amount that predictive performs, is accessed, by the predictive of the data that affairs reality does not use, the termination caused to avoid running into. As last countermeasure, discharging other CPU with before continuing normal process, milli code can be broadcast to other CPU to stop all conflict work, retries local matter. Multiple CPU must be coordinated to be not resulted in deadlock, accordingly, it would be desirable to some serializations between the milli code instance on different CPU.

Referring now to Fig. 4, accompanying drawing labelling 400 generally illustrates the exemplary embodiment that can realize the method for sharing data adaptively in hardware or in software.

In current implementation, can generally implement two methods making data access synchronize based on lock. Also referred to as in the data structure locking of locking or true locking, in the critical section of code, program may want to be guaranteed the exclusive access to the memory area also referred to as shared data. In this case, program can be passed through to lock protection and share data, and its effect is similar to the shared data labelling in this time disabled competitive program. But, locking mechanism can strictly control the access to shared data. In slight competition memory area, competitive program is likely to wait unnecessarily for, thus negatively affecting performance. Such as, in following code sample, while thread 1 keeps lock on structure hash_tbl, thread 2 etc. pending (although different piece of two threads more new construction) and can being performed in parallel.

Table 5

Above-mentioned HLE allows to be written into use the program that traditional locks determines code to have and utilizes the chance realizing the hardware that transactional performs. But, in severe competitiveness memory area, if there is conflict, then processor can stop affairs and by using pessimistic locking behavior to re-execute critical section. In one embodiment, any lock intersected with cache lines is not omitted and re-executes automatically triggering when not having HLE. Therefore, when known critical section is as affairs constantly failure, default to transactional execution and subsequently by using lock successfully to restart to make performance degradation.

In 410, when processor and CPU114 (Fig. 2) start code sequence to access memory area, (namely CPU114 (Fig. 2) calls the conflict prediction device that can realize in hardware or in software, HLE predictor or hardware lock virtualizer), to attempt predicting whether that lock omission is likely to successfully or whether substituting should use locking. In operation, as will be discussed, conflict prediction device can operate in various hardware and software environment. But, when conflict prediction device is with reference to the embodiment of the conflict prediction in HLE environment, conflict prediction device is alternatively called HLE predictor. In one embodiment, for instance in hardware register or based in each thread or the memory location that all threads are shared, remain successful the simple count that affairs perform. When transmission represents the threshold value of counting of successful transaction execution, at 410 places, the measurable transactional execution route of conflict prediction device, namely non-transactional (i.e. locking) path at comparable 455 places of lock omission is more effective, because interference is impossible. In at least one embodiment, in preferably corresponding at least one embodiment performed based on lock elliptical transactional, enumerator is initialised with the more effective execution route of originally preference. In another embodiment, within hardware or by embedding the instruction in program flow, perform affairs and relatively obtain and can be calculated by locking the estimation relative cost performed. Based on the relative cost calculated, conflict prediction device measurable transactional path or non-transactional path are more effective, because the path such as predicted performs lower in cost or more impossible to run into interference. In another embodiment, behavior prompting can be impliedly inserted in conflict prediction device by compiler, the locking path of transactional execution route or 455 places to select 420 places at 410 places. CPU114 (Fig. 2) can start to perform critical section as the affairs at 420 places, thus at 425 places more new data as required. When the affairs at 430 places terminate but before submitting result to, can be determined whether at 435 place CPU114 (Fig. 2) interference (that is, two or more code sequences of parallel work-flow in same data) that can cause transaction abort to be detected. When being not detected by disturbing, then at 440 places, affairs can successfully submit result to, and this can be used by other affairs subsequently. But, if interference being detected at 435 place CPU114 (Fig. 2), then at 455 places, perform by using locking to restart. At 460 places, critical section must explicitly obtain the lock in protection memory which will be accessed region. But, lock requester can be forced to wait until in the action be called rotation by till competitiveness process release lock. In the end when 460 places obtain lock, critical section can continue with. When being updated 470 by the data of lock protection, then critical section completes and can at 475 place's release lock.

With reference to Fig. 5, accompanying drawing labelling 500 generally illustrates the exemplary embodiment realizing conflict prediction device (that is, hardware lock virtualizer) in there is the HLE environment supported. As it has been described above, HLE isTradition Compatible instruction set extension, including XACQUIRE and XRELEASE, this instruction set extension allows to be written into use the program that traditional locks determines code to have and utilizes the chance realizing the hardware that transactional performs without significantly revising code. In the present embodiment, HLE predictor isThe particular example of HLE.

At 505 places, CPU114 (Fig. 2) performsXACQUIRE prefix instruction is to utilize the lock acquisition affairs startup HLE sequence being associated. In one embodiment, this sequence can be represented by the XACQUIRE of heel lock acquisition affairs. In some implementations, XACQUIRE prefix can be left in the basket. In other implementation, optionally perform XACQUIRE sequence. After starting HLE homing sequence, conflict prediction device (i.e. HLE predictor) is called at 510 places. Based on prediction, it is possible to perform lock and omit, or lock can be obtained. When lock omits and obtains prediction between lock, processing the substantially similarly continuation that can describe with 420～475 places at Fig. 4.

With reference to Fig. 6, flow chart that accompanying drawing labelling 600 generally illustrates the exemplary embodiment of the hardware capabilities that basis is absent from adding, method share data adaptively for the selection between utilizing lock omission and locking. In the present example embodiment, the system that can such as be operated by provides the prompting to conflict predictor in the code flow of application program or by hardware. Such as, in one embodiment, programmer can one or more instruction of explicit insertion, or compiler can impliedly insert to conflict predictor behavior prompting. Conflict prediction device can keep history vector or counting, to follow the tracks of the two quantity of success prediction and unsuccessful prediction (i.e. error prediction) on certain time period of such as such as 1 second. Then, at 610 places, conflict prediction device can during this time window the counting of comparison error prediction and failed number of thresholds. When during time window, error prediction exceedes the number of thresholds of failure, conflict prediction device can be default to the execution using lock (i.e. non-transactional pattern) to the remainder of time window. During this time window, due to the operating characteristic when multiple affairs simultaneously update inconsistency data, memory area can be high competition. Default by locking being temporarily chosen as, conflict prediction device can avoid have to restart the probability of the affairs of failure, and by avoiding transaction abort to improve handling capacity. But, once time window expires, the competition of memory area just can be become light, and conflict prediction device can again attempt to transactional and perform. In an embodiment, conflict prediction device is implemented with software, and the execution lock that the algorithm wherein realized by software is made omits the second edition that the lock of lock elliptical first version or code realization that the decision being also locked out controls pass to code realization obtains. In other embodiments, history based on interference, in response to by the instruction to the particular items to update of software, and reflect the expectation interference relevant with the field of the target as renewal affairs or non-interference etc., by using substituting test realization to determine 610.

At 655 places, critical section explicit must obtain protection and be accessed for the lock of memory area. But, lock requester can be forced to wait until to be locked in and is referred to as in the action of rotation by till competitive program release. When finally obtaining lock at 660 places, critical section can continue with. When being updated 670 by the data of lock protection, then complete critical section at 675 places, and lock and can be released. At 680 places, CPU114 (Fig. 2) can the expiring of review time window. If time window does not expire, then terminate in the process of 680 places. But, if time window expires, then at 685 places, the counting that failure affairs perform and successful transaction performs can be reset, thus effectively by time window reset and the re-training starting conflict prediction device.

When error prediction during time window less than failed number of thresholds, at 610 places, the optional lock of conflict prediction device omits, i.e. HLE affairs or read lock word with explicit rather than obtain lock and join together to realize lock elliptical affairs. When selected as HLE affairs perform (or as with by perform its read concentrate comprise lock word affairs perform lock elliptical software transaction join together realize lock elliptical affairs) time, at 615 places, CPU114 (Fig. 2) can be incremented by the counting that successful transaction performs. The HLE affairs at 620 places can at 625 places more new data as required. When the affairs at 630 places terminate but before the submission result at 635 places, CPU114 (Fig. 2) can be determined whether the interference (that is, two or more code sequences of parallel work-flow in same data) that can cause transaction abort to be detected. When being not detected by disturbing, at 640 places, HLE affairs (or realizing lock other affairs of elliptical) can successfully submit result to, and these results can be used by other process subsequently. But, if interference being detected at 635 place CPU114 (Fig. 2), then at 650 places, it is incremented by the counting that unsuccessfully affairs perform, because failure affairs be can be regarded as error prediction and can be used for training conflict prediction device to carry out more accurate prediction in the future. At 655 and 660 places, CPU114 (Fig. 2) can attempt now with obtaining lock and non-transactional on memory area (namely using lock) and restart critical section. When being finally updated 670 by the data of lock protection, then the process of critical section completes, and can be released at 675 place's locks. At 680 places, CPU114 (Fig. 2) can the expiring of review time window. If time window does not expire, then terminate in the process of 680 places. But, when time window expires, then at 685 places, the counting that failure affairs perform and successful transaction performs can be reset, thus effectively starting the re-training of conflict prediction device.

Referring now to Fig. 7, accompanying drawing labelling 700 generally illustrates the flow chart of the exemplary embodiment of the facility that may be included in when performing lock in monitoring hardware for the method sharing data adaptively. In the figure 7, substantially how to process HLE affairs (namely 610 to 650) with the embodiment of Fig. 6 similar in the process (namely 710 to 750) of HLE affairs. But, Fig. 7 be critical section just non-transactional the path that performs introduce hardware lock and monitor facility. In the present embodiment, while allowing critical section to perform in locked memory area, hardware lock monitors that facility is attempted minimizing error prediction by predicting the outcome, being HLE affairs such as the actual execution of critical section. Once be successfully obtained lock at 760 and 765 places, hardware lock monitors that facility just can begin at 770 places and monitor the situation of lock. The critical section at 775 places updates the data in locked memory area and completes to perform by release lock at 780 places. But, the term of execution, if monitoring that facilities detect that another processes the state checking lock labelling in 785 place hardware lock, then if this critical section performs for affairs rather than non-transactional ground, the trial of other process to comprehend and cause interference with and affairs failure. In one embodiment, lock is only monitored. In another embodiment, the data being updated as the part in locked region are monitored. As a result, at 790 places, hardware lock monitors that facility can be incremented by the counting that failed affairs perform.

In another embodiment, hardware lock monitors that facility can monitor all trial data accesses in locked memory area. Attempting accessing the data in this region if another processes, then at 790 places, hardware lock monitors that facility can be counted as interference and the failure of potential affairs. Therefore, conflict prediction device can learn to predict that transactional performs or non-transactional performs more likely success more accurately.

In another embodiment, labelling can be restarted when incremental affairs perform failed counting in the setting of 750 places. Then, when the counting that successful transaction performs is incremented by, can resetting at 755 places, this restarts labelling. Restart labelling can by incremental twice of the counting that prevents unsuccessfully affairs to perform (namely at 750 places as HLE affairs failed time once, and be locked in 755 places in use and restart once) improve forecasting accuracy.

Referring now to Fig. 8, in an embodiment, omit in (HLE) environment in hardware lock, predictably determine whether HLE affairs should perform 810 with actual obtaining lock and non-transactional and include: obtain instruction based on running into HLE lock, based on HLE predictor, it is determined that be omit lock and continue or obtain lock as HLE affairs and continue 820 as non-transactional; It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of HLE affairs, and suppress to be obtained the instruction any write to lock by lock, and continue in HLE transactional execution pattern, until running into xrelease instruction or HLE affairs run into transactional conflict 830, wherein, xrelease instruction release lock; And do not omit based on the prediction of HLE predictor, HLE lock is obtained instruction and is considered as non-HLE lock acquisition instruction and continues 840 in non-transactional pattern.

Referring now to Fig. 9, in an embodiment, it is successfully updated HLE predictor based on the prediction of HLE affairs. Run into the HLE affairs with lock address based on first time, the counting that the successful HLE affairs being associated with lock address perform is initialized as zero; Based on completing to have any HLE affairs subsequently locking address, the counting that the failed HLE affairs that the lock address with HLE affairs being incremented by HLE predictor is associated perform, wherein, high counting indicator is likely to termination 920. In non-transactional pattern, monitor that being processed the trial to lock by another accesses; And when being accessed detected by the trial of another process, it is incremented by the counting 950 that unsuccessfully HLE affairs perform. The counting that the counting that successful HLE affairs in tracking time window perform performs with failure HLE affairs; And the counting performed based on failure HLE affairs exceedes the number of thresholds of failure, by default for the remainder of time window to non-transaction mode 970. Expiring based on time window, the count resets that the counting perform success HLE affairs and failure HLE affairs perform is 0 960.

Referring now to Figure 10, computing equipment 1000 can include internal part 800 and the external component 900 of each group. Each in the group of internal part 800 includes: one or more processor 820; One or more computer-readable RAM822; One or more computer-readable ROM824 in one or more bus 826; One or more operating system 828; Perform one or more software application of the method for Fig. 5～7; With one or more computer-readable tangible storage device 830. One or more operating system is stored on one or more in each computer-readable tangible storage device 830, to perform by one or more in each processor 820 via one or more in each RAM822 (generally comprising buffer memory). In the embodiment shown in fig. 10, each in computer-readable tangible storage device 830 is the disk storage equipment that internal hard drive drives. As an alternative, each in computer-readable tangible storage device 830 is such as ROM824, EPROM, the semiconductor memory apparatus of flash memory or can store other computer-readable tangible storage device any of computer program and digital information.

Each group of internal part 800 also includes the R/W for reading from one or more computer-readable tangible storage device 936 and be written to and drives or interface 832, wherein this one or more computer-readable tangible storage device 936 such as thin supply memory devices, CD-ROM, DVD, SSD, memory stick, tape, disk, CD or semiconductor memory apparatus. R/W drives or interface 832 can be used for device driver 840 firmware, software or microcode being loaded into tangible storage device 936 to be conducive to the communication of the parts with computing equipment 100.

Each group of internal part 800 may also include network adapter (or switch port card) or the interface 836 of such as TCP/IP adapter card, wireless Wi-Fi interface card or 3G or 4G wireless interface card or other wired or wireless communication link. The operating system 828 being associated with computing equipment 1000 can calculate (such as, server) via network (such as, the Internet, LAN or wide area network) and each network adapter or interface 386 from outside and be downloaded to computing equipment 1000. From network adapter (or switch port adapter) or interface 836, the operating system 828 being associated with computing equipment 1000 is loaded into each hard drive 830 and network adapter 836. Network can comprise copper cash, optical fiber, Wireless transceiver, router, fire wall, switch, gateway computer and/or Edge Server.

Each included computer display monitor 920 in the group of external component 900, keyboard 930 and computer mouse 934. External component 900 may also include touch screen, dummy keyboard, touch pad, sensing equipment and other human interface device. Each in the group of internal part 800 also includes the device driver 840 docked with computer display monitor 920, keyboard 930 and computer mouse 934. Device driver 840, R/W drive or interface 832 and network adapter or network 836 include in hardware and software (being stored in storage device 830 and/or ROM824).

The various embodiments of present disclosure can be implemented in the data handling system being suitable to memorizer and/or execution program code, and this data handling system comprises at least one processor directly or indirectly coupled by system bus with memory component. Local storage that memory component includes such as using in the actual execution of program code, mass storage and in order to reduce the buffer memory of the temporarily storage that must provide at least some program code in the process of implementation from the number of times of mass storage retrieval coding.

Input/output or I/O equipment (including but not limited to keyboard, display, sensing equipment, DASD, band, CD, DVD, USB flash disk driving and other storage medium etc.) can direct or through the I/O controller of intervention and couple with system. Network adapter also can couple with system so that data handling system becomes to couple with other data handling system or remote printer or storage device by the special or common network got involved. Modem, cable modem and Ethernet card are only some in the network adapter of available types.

The present invention can be system, method and/or computer program. Computer program can include computer-readable recording medium, containing for making processor realize the computer-readable program instructions of various aspects of the invention.

Computer-readable recording medium can be the tangible device that can keep and store and be performed the instruction that equipment uses by instruction. computer-readable recording medium such as may be-but not limited to-the combination of storage device electric, magnetic storage apparatus, light storage device, electromagnetism storage device, semiconductor memory apparatus or above-mentioned any appropriate. the example more specifically (non exhaustive list) of computer-readable recording medium includes: portable computer diskette, hard disk, random access memory (RAM), read only memory (ROM), erasable type programmable read only memory (EPROM or flash memory), static RAM (SRAM), Portable compressed dish read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, such as on it, storage has punch card or the groove internal projection structure of instruction, and the combination of above-mentioned any appropriate. computer-readable recording medium used herein above is not construed as instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations, the electromagnetic wave (such as, by the light pulse of fiber optic cables) propagated by waveguide or other transmission mediums or by the signal of telecommunication of wire transfer.

Computer-readable program instructions as described herein can download to each from computer-readable recording medium and calculate/process equipment, or downloaded to outer computer or External memory equipment by network, such as the Internet, LAN, wide area network and/or wireless network. Network can include copper transmission cable, fiber-optic transfer, is wirelessly transferred, router, fire wall, switch, gateway computer and/or Edge Server. Adapter or network interface in each calculating/process equipment receive computer-readable program instructions from network, and forward this computer-readable program instructions, for be stored in each calculate/process equipment in computer-readable recording medium in.

Can be the source code write of assembly instruction, instruction set architecture (ISA) instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or the combination in any with one or more programming languages or object code for performing the computer program instructions of present invention operation, described programming language includes OO programming language such as Java, Smalltalk, C++ etc., and the procedural programming languages of routine such as " C " language or similar programming language. Computer-readable program instructions fully can perform on the user computer, partly performs on the user computer, performs as an independent software kit, partly partly perform on the remote computer on the user computer or perform on remote computer or server completely. In the situation relating to remote computer, remote computer can include LAN (LAN) by the network of any kind or wide area network (WAN) is connected to subscriber computer, or, it may be connected to outer computer (such as utilizes ISP to pass through Internet connection). In certain embodiments, by utilizing the status information of computer-readable program instructions to carry out personalized customization electronic circuit, such as Programmable Logic Device, field programmable gate array (FPGA) or programmable logic array (PLA), this electronic circuit can perform computer-readable program instructions, thus realizing various aspects of the invention.

Flow chart and/or block diagram referring herein to method according to embodiments of the present invention, device (system) and computer program describe various aspects of the invention. Should be appreciated that the combination of each square frame in each square frame of flow chart and/or block diagram and flow chart and/or block diagram, can be realized by computer-readable program instructions.

These computer-readable program instructions can be supplied to general purpose computer, special-purpose computer or other programmable data and process the processor of device, thus producing a kind of machine, make these instructions when the processor being processed device by computer or other programmable data is performed, create the device of the function/action of regulation in the one or more square frames in flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, these instructions make computer, programmable data process device and/or other equipment works in a specific way, thus, storage has the computer-readable medium of instruction then to include a manufacture, and it includes the instruction of the various aspects of the function/action of regulation in the one or more square frames in flowchart and/or block diagram.

Computer-readable program instructions can also be loaded into computer, other programmable data processes on device or miscellaneous equipment, make to process at computer, other programmable data device or miscellaneous equipment perform sequence of operations step, to produce computer implemented process, so that process the function/action of regulation in the one or more square frames in the instruction flowchart and/or block diagram performed on device or miscellaneous equipment at computer, other programmable data.

Flow chart and block diagram in accompanying drawing show according to the system of multiple embodiments of the present invention, the architectural framework in the cards of method and computer program product, function and operation. In this, flow chart or each square frame in block diagram can represent a part for a module, program segment or instruction, and a part for described module, program segment or instruction comprises the executable instruction of one or more logic function for realizing regulation. At some as in the realization replaced, the function marked in square frame can also to be different from the order generation marked in accompanying drawing. Such as, two continuous print square frames can essentially perform substantially in parallel, and they can also perform sometimes in the opposite order, and this determines according to involved function. It will also be noted that, the combination of the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart, can realize by the special hardware based system of the function or action that perform regulation, or can realize with the combination of specialized hardware Yu computer instruction.

Although describing in detail herein and describing preferred embodiment, but for a person skilled in the art, obviously, various amendment, interpolation and replacement etc. can be carried out when without departing substantially from the spirit of present disclosure, and therefore, these be considered in claims below limit scope of the present disclosure in.

Claims

1. hardware lock omits the method in HLE environment, and described method is for predictably determining whether HLE affairs should reality perform with obtaining lock and non-transactional, and described method includes:

Instruction is obtained, based on HLE predictor, it is determined that be omit lock and continue or obtain lock as HLE affairs and continue as non-transactional based on running into HLE lock;

It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of HLE affairs, and suppress lock to obtain the instruction any write to lock, and continue in HLE transactional execution pattern, until running into xrelease instruction or HLE affairs run into transactional conflict, wherein, xrelease instruction release lock; And

It is predicted as based on HLE predictor and does not omit, HLE lock is obtained instruction and is considered as non-HLE lock acquisition instruction and continues in non-transactional pattern.

2. method according to claim 1, also includes:

Success based on the prediction to HLE affairs updates HLE predictor, and wherein whether HLE predictor prediction HLE affairs are likely to stop.

3. method according to claim 1, also includes:

Run into the HLE affairs with lock address based on first time, the counting that the successful HLE affairs being associated with lock address perform is initialized as zero;

Based on stopping there are any HLE affairs subsequently locking address, the counting that the failed HLE affairs being associated with the lock address of HLE affairs in incremental predictor perform;

Based on completing to have any HLE affairs subsequently locking address, the counting that the successful HLE affairs being associated with the lock address of HLE affairs in incremental HLE predictor perform, wherein, the high counting indicator of failure HLE transactional execution is likely to termination.

4. method according to claim 1, also includes:

Non-transactional pattern monitoring, another processes the trial to lock and accesses; With

When the trial another process described being detected accesses, it is incremented by the counting that unsuccessfully HLE affairs perform.

5. method according to claim 1, also includes:

The counting that the counting that successful HLE affairs in tracking time window perform performs with failure HLE affairs;

What relatively the failed HLE affairs during described time window performed counts and failed number of thresholds; And

The counting performed based on failure HLE affairs exceedes described failed number of thresholds, by default for the remainder of described time window to non-transactional pattern.

6. method according to claim 5, also includes:

Expiring based on described time window, the count resets that the counting perform success HLE affairs and failure HLE affairs perform is zero.

7. hardware lock omits the computer program in HLE environment, and described computer program is for predictably determining whether HLE affairs should reality perform with obtaining lock and non-transactional, and described computer program comprises:

Computer-readable recording medium, described computer-readable recording medium can pass through process circuit and read, and stores the instruction being performed the method comprised the following steps for execution by process circuit:

It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of affairs, and suppress lock to obtain the instruction any write to lock, and continue in HLE transactional execution pattern, until running into xrelease instruction or HLE affairs run into transactional conflict, wherein, xrelease instruction release lock; With

8. computer program according to claim 7, also includes:

Success based on the prediction to HLE affairs updates HLE predictor, and wherein whether HLE predictor prediction HLE is likely to stop.

9. computer program according to claim 7, also includes:

10. computer program according to claim 7, also includes:

Non-transactional pattern monitoring, another processes the trial to the memory area by lock protection and accesses; With

11. computer program according to claim 7, also include:

Run into the HLE affairs with lock address based on first time, the counting being associated with lock address is initialized as zero;

Based on stopping there are any HLE affairs subsequently locking address, the counting being associated with the lock address of HLE affairs in incremental predictor;

Based on completing to have any HLE affairs subsequently locking address, the counting that the successful HLE affairs being associated with the lock address of HLE affairs in incremental predictor perform, wherein, the high counting indicator of failure HLE affairs execution is likely to termination.

12. computer program according to claim 7, also include:

What relatively the failed HLE affairs during described time window performed counts and failed number of thresholds; With

13. computer program according to claim 12, also include:

14. hardware lock omits the computer system in HLE environment, described computer system is for predictably determining whether HLE affairs should reality perform with obtaining lock and non-transactional, and described computer system comprises:

Memorizer; With

With the processor of described memory communication, wherein, computer system is configured to perform a kind of method, and described method includes:

It is predicted as omission based on HLE predictor, the address of lock is set as the reading collection of affairs, and suppress lock to obtain the instruction any write to lock, and continue in HLE transactional execution pattern, until running into xrelease instruction or HLE affairs run into transactional conflict, wherein, xrelease instruction release lock; And

15. computer system according to claim 14, also include:

16. computer system according to claim 14, also include:

Non-transactional pattern monitoring, another processes the trial to lock and accesses; And

17. computer system according to claim 14, also include:

Non-transactional pattern monitoring, another processes the trial to the memory area by lock protection and accesses; And

18. computer system according to claim 14, also include:

19. computer system according to claim 14, also include:

20. computer system according to claim 19, also include: