CN105683906B

CN105683906B - Selection for being omitted and being locked using lock carries out the self-adaptive processing of data sharing

Info

Publication number: CN105683906B
Application number: CN201480053800.8A
Authority: CN
Inventors: M·K·克施温德; M·M·迈克尔; V·萨拉普拉; 岑中龙
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-10-14
Filing date: 2014-09-28
Publication date: 2018-11-23
Anticipated expiration: 2034-09-28
Also published as: JP6642806B2; JP2016537709A; CN105683906A; WO2015055083A1

Abstract

It is omitted in (HLE) environment in hardware lock, provides and predictably determine that HLE affairs execute in which whether should actually obtain lock and non-transactional.Including：Based on HLE lock acquisition instruction is encountered, it is based on HLE fallout predictor, determination is to omit lock and continue or obtain lock as HLE affairs and continue as non-transactional；It is predicted as omitting based on HLE fallout predictor, the address of lock is set as to the reading collection of affairs, and inhibit to lock any write-in of the acquisition instruction to lock, and continue in HLE transactional execution pattern, until encountering xrelease instruction or HLE affairs encounter transactional conflict, wherein, xrelease instruction release lock；And it is predicted as not omitting based on HLE fallout predictor, HLE lock acquisition instruction is considered as non-HLE lock acquisition instruction and continues in non-transactional mode.

Description

Selection for being omitted and being locked using lock carries out the self-adaptive processing of data sharing

Technical field

Present disclosure relates generally to transactional memory systems, and more particularly relate to by using lock omit and Locking selects the adaptively method of shared data, computer program and computer system.

Background technique

The quantity of central processing unit (CPU) core on chip and the quantity for the CPU core connecting with shared memory are continuous It is significant to increase, to support the workload capacity requirement increased.It cooperates to handle the ever-increasing CPU of identical workload Quantity causes significantly to bear to software scalability；For example, shared queue or data knot by traditional semaphore protection Structure becomes hot spot and the road sub-line n is caused to be stretched curve.Traditionally, by realize in software the locking of more fine granulation come Cope with this point.Realize more fine granulation locking with improve software scalability may be it is extremely complex and error-prone, And according to current cpu frequency, the delay time of hardwired interconnections is limited to the physical size and the light velocity of chip and system.

Have been introduced into the realization of hardware transaction memory (HTM, or in this discussion text, referred to as TM), wherein Other central processing unit (CPU) and I/O subsystem apparently, the data of one group of instruction --- be known as affairs --- in memory (in other literature, atomic operation is also referred to as " block is concurrent " or " serialization ") is operated in structure in a manner of atom.Affairs exist It is optimistically executed in the case where not locked, still, if the operation of the affairs being carrying out in memory location and same Another operation conflict in one memory location, it would be possible that needing to stop and retry transactional execution.In the past, software is proposed Transaction memory is realized to support software transactional memory (TM).But compared with software TM, hardware TM can provide improved property It can aspect and ease of use.

Entitled " the Method and apparatus for the submitted on August 28th, 2002 Synchronization of distributed caches " U.S. Patent Application Publication No2004/0044850 teaches use In the synchronous method and apparatus of distributed buffer, the content of the patent application publication is incorporated herein by reference.Specifically, providing Embodiment be related to cache memory system, and relate more particularly to be suitable for be used together with distributed buffer layering caching association View is included in caching input/output (I/O) hub and uses.

In entitled " the Partial cache line write transactions that on March 24th, 1994 submits The United States Patent (USP) 5586297 of in a computing system with a write back cache ", which is taught, proposes one Computing system of the kind comprising memory, input/output adapter and processor, the content of the patent are incorporated herein by reference. Processor include can wherein store dirty data write back caching.When execution unanimously writing from input/output adapter to memory Fashionable, data block is written to the memory location in memory from input/output adapter.The data ratio that data block includes is write The global buffer row returned in caching is few.Search write back caching with determine write back caching whether include the memory location data.When When search determines that writing back caching includes the data of the memory location, the global buffer row of the data comprising the memory location is clear It removes.

Summary of the invention

There is provided a kind of hardware lock omit in (HLE) environment for predictably determining whether HLE affairs should actually obtain It locks and the method that executes of non-transactional ground.According to the embodiment of present disclosure, method may include：It is obtained based on HLE lock is encountered Instruction fetch, is based on HLE fallout predictor, and determination is to omit lock and as the continuation of HLE affairs or acquisition lock and as non-transactional Continue；It is predicted as omitting based on HLE fallout predictor, the address of lock is set as to the reading collection of HLE affairs, and inhibit to be obtained by lock Any write-in to lock is instructed, and is continued under HLE transactional execution pattern, until encountering xrelease instruction (wherein, Xrelease instruction release lock) or HLE affairs encounter transactional conflict until；And do not omitted based on the prediction of HLE fallout predictor, HLE lock acquisition instruction is considered as non-HLE lock acquisition instruction and is continued in non-transactional mode.

In another embodiment of the present disclosure, it is possible to provide hardware lock is omitted to be used for predictably in (HLE) environment Determine the computer program product that HLE affairs execute in which whether should actually obtain lock and non-transactional.The computer program produces Product may include：It can be read and be stored by processing circuit and executed for processing circuit to be used to execute method comprising the following steps Instruction computer readable storage medium：Based on HLE lock acquisition instruction is encountered, it is based on HLE fallout predictor, determination is to omit lock simultaneously And continues or obtain lock as HLE affairs and continue as non-transactional；It is predicted as omitting based on HLE fallout predictor, by lock Address is set as the reading collection of affairs, and inhibits any write-in by lock acquisition instruction to lock, and executes in HLE transactional Continue in mode, encounters transactional until encountering xrelease instruction (wherein, xrelease instruction release lock) or HLE affairs Until conflict；And do not omitted based on the prediction of HLE fallout predictor, HLE lock acquisition instruction is considered as non-HLE lock acquisition instruction and non- Affairs sexual norm relaying is continuous.

In another embodiment of the present disclosure, it provides and is used for predictably really in hardware lock omission (HLE) environment Determine the computer system that HLE affairs execute in which whether should actually obtain lock and non-transactional.The computer system may include：It deposits Reservoir；With the processor communicated with memory, wherein computer system is configured as executing method comprising the following steps：Base In encountering HLE lock acquisition instruction, it is based on HLE fallout predictor, determination is to omit lock and continue as HLE affairs or obtain lock simultaneously And continue as non-transactional；It is predicted as omitting based on HLE fallout predictor, the address of lock is set as to the reading collection of affairs, and press down Any write-in by lock acquisition instruction to lock is made, and is continued in HLE transactional execution pattern, is referred to until encountering xrelease Until enabling (wherein, xrelease instructs release to lock) or HLE affairs encounter transactional conflict；And it is pre- based on HLE fallout predictor It surveys not omit, HLE lock acquisition instruction is considered as non-HLE lock acquisition instruction and is continued in non-transactional mode.

Detailed description of the invention

It is disclosed herein according to the described in detail below of the explanatory embodiment for the present disclosure to be read in conjunction with the figure The feature and advantage of embodiment will be apparent.The various features of attached drawing are not drawn to, because diagram is to illustrate, to have Present disclosure is understood in conjunction with specific descriptions conducive to those skilled in the art.In the accompanying drawings：

Fig. 1 and Fig. 2 shows the example multicore transaction memory environment according to the embodiment of present disclosure；

Fig. 3 shows the exemplary components of the example CPU according to the embodiment of present disclosure；

Fig. 4 show according to example hardware or software implementation for using selection between lock is omitted and locked come The adaptively flow chart of the method for shared data；

Fig. 5 is shown realizes that the conflict for being also referred to as HLE fallout predictor or hardware lock virtual machine is pre- in the environment supported there are HLE Survey the flow chart of device；

Fig. 6 show according to there is no additional hardware capabilities exemplary embodiment for by using lock omit with Selection between locking carrys out the flow chart of the adaptively method of shared data；

Fig. 7 is shown according to the exemplary embodiment with hardware lock monitoring for by omitting and locking using in lock Between selection carry out the flow chart of the adaptively method of shared data；

Fig. 8~9 show the exemplary flow of adaptively shared data；With

Figure 10 is the hardware and software according to the computer environment of at least one exemplary embodiment of the method for Fig. 4~7 Schematic block diagram.

Specific embodiment

In history, computer system or processor only have single processor (aka processing unit or central processing list Member).Processor includes instruction process unit (IPU), branch units and memory control unit etc..This processor can be primary Execute the single thread of program.Operating system is developed, which can execute on a processor a period of time by distribution program Between the period, and then distribute another program and execute another time cycle on a processor and carry out time shared server.With technology It develops, usually the complicated pooled address translation to processor and comprising translation lookahead buffer (TLB) adds memory subsystem System caching.IPU itself is commonly referred to as processor.As technology continues to develop, entire processor can be encapsulated as single semiconductor Chip or bare die, this processor are referred to as microprocessor.Then, the processor that multiple IPU are added, this processor are developed Also commonly known as multiprocessor.This processor of each of multiprocessor computer system (processor) may include individual or total Caching, memory interface, system bus and address translation mechanism for enjoying etc..Virtual machine and instruction set framework (ISA) emulation Device adds software layer to processor, this is mentioned using the single IPU in single hardware processor to virtual machine by isochronous surface For multiple " virtual processors " (aka processor).As technology further develops, multiline procedure processor is developed, so that Single hardware processor with single multi-threaded I PU is capable of providing the ability for being performed simultaneously the thread of distinct program, more as a result, Each thread of thread processor shows as processor to operating system.It, can be in single semiconductor as technology further develops Multiple processors is placed on chip or bare die (each there is IPU).These processors are referred to as processor core or are only claimed For core.Thus, for example, such as processor, central processing unit, processing unit, multi-processor core, core, processor core, processor The term of thread and thread is often used interchangeably.It, can be by the inclusion of the above without departing substantially from teaching herein Any or all processor implement many aspects of the embodiments herein.Wherein, term " thread " or " place are being used herein Manage device thread ", it is expected that can have the specific advantages of embodiment in processor thread realization.

It is based onEmbodiment in transactional execute

Be incorporated herein by reference in their entirety " Architecture Instruction Set In Extensions Programming Reference " 319433-012A, February 2012, the 8th chapter is partly instructed Multithreading application can realize higher performance using more and more CPU cores.But the write-in of multithreading application requires programmer Understand and considers the data sharing among multiple threads.Synchronization mechanism is generally required to the access of shared data.Often through Using the critical section by lock protection, these synchronization mechanisms are employed to ensure that multiple threads pass through the string that is applied in shared data Rowization operates to update shared data.Due to serializing limiting concurrent, programmer attempts limitation due to caused by synchronizing Expense.

The synchronous extension of transactional (TSX processor) is allowed to dynamically determine whether to need to protect by lock The critical section of shield serializes thread and only executes the serialization when needed.This allow processor exposure and using due to The concurrency in application is hidden in the upper unnecessary synchronization of dynamic.

It utilizesTSX, transactionally execute programmer as defined in code region (also referred to as " transactional region " or Only it is only called " affairs ").Transactional executes if successfully completed, then, when being watched from other processors, in transactional region All storage operations of interior execution will appear to instantaneously occur.Only when occurring successfully submitting, that is, when affairs have succeeded When at executing, processor makes the storage operation being performed in transactional region for being performed affairs can to other processors See.The processing is also commonly known as atomic commitment.

TSX provides two software interfaces to provide the code region executed for transactional.Hardware lock is omitted (HLE) be regulation transactional region traditional Compatible instruction set extension (comprising XACQUIRE and XRELEASE prefix).It is constrained Transaction memory (RTM) be programmer be used for may the mode more more flexible than HLE limit the new instruction set in transactional region Interface (instructs) comprising XBEGIN, XEND and XABORT.HLE is used for the backward compatibility of the mutual exclusion programming model of preference routine simultaneously And it is ready that HLE is run in conventional hardware to be enabled software and be also ready on the hardware supported with HLE be omitted using new lock The programmer of ability.RTM executes the programmer of the flexible interface of hardware for preference transactional.In addition,TSX is also mentioned It is instructed for XTEST.Whether the instruction allow software to inquire logic processor in the transactional region by HLE or RTM identification Transactionally execute.

Since the execution of successful transactional ensures atomic commitment, processor is in the case where no explicitly synchronous Optimistically execute code region.If it is unnecessary for synchronizing for specific execution, execution can be in no any cross-line Journey is submitted in the case where serializing.If processor cannot be submitted atomically, optimism executes failure.When such case occurs When, for processor by rollback (roll back) execution, this is the processing of referred to as transactional suspension.When transactional stops, processing Device will give up all updates executed in the memory area used by affairs, and architecture state is restored to and is seemed seemingly There is not optimistic execution, and restarts to execute in a manner of non-transactional.

Processor can execute transactional suspension for many reasons.The main reason for stopping affairs is due to transactionally holding The memory access that conflicts between capable logic processor and another logic processor.This conflict memory access can be interfered into The transactional of function executes.The storage address read out of transactional region constitutes the reading collection in transactional region, and is written The write-in collection in transactional region is constituted to the address in transactional region.TSX keeps reading with the granularity of cache lines Collection and write-in collection.If another logic processor read a position (a part that the position is the write-in collection in transactional region) or A position (position is the reading collection in transactional region or a part of write-in collection) is written in person, then will appear conflict memory access It asks.Conflict access generally means that the serialization needed to the code region.Due toTSX is examined with the granularity of cache lines Measured data conflict, then the extraneous data position being placed in same cache lines will be detected as conflicting, this causes transactional to stop. Transactional suspension may also occur since transaction resource is limited.For example, the amount of accessed data can exceed that in the zone Specific to the capacity of implementation.In addition, some instructions and system event will lead to transactional suspension.Frequent transactional stops Cause circulation waste and efficiency lower.

Hardware lock is omitted

Hardware lock omits (HLE) and provides the traditional Compatible instruction set interface for using transactional to execute for programmer.HLE is provided Two new instruction prefixes prompts：XACQUIRE and XRELEASE.

Using HLE, programmer adds XACQUIRE prefix to for obtaining before the instruction for protecting the lock of critical section. The prefix is considered as the prompt of omission write-in relevant to lock acquisition operation by processor.Although lock, which obtains to have, is associated lock Write operation, but processor to transactional region write-in collection addition lock address, also do not issue any write-in to lock and ask It asks.But the address of lock is added to reading collection.Logic processor enters transactional execution.If lock is being with XACQUIRE It is available before the instruction of prefix, then all other processor will continue to lock to be considered as to be available later.Due to thing Business property executes the address that logic processor is neither written to collection addition lock, does not also execute externally visible write operation to lock, Therefore, other logic processors can read lock in the case where not leading to data collision.This allow other logic processors also into Enter and is executed concurrently by the critical section of lock protection.Processor detects any number occurred during transactional executes automatically Stop according to conflict, and if necessary, transactional will be executed.

Although omitting processor does not execute any external write operation to lock, hardware ensures the program time of the operation to lock Sequence.If omitting the value for the lock that processor itself is read in critical section, it will look as processor and obtained lock, i.e., The reading will return to the value of non-omission.The behavior allows HLE to execute the execution for being functionally equivalent to no HLE prefix.

XRELEASE prefix can be added before the instruction of the lock for release guard critical section.Release lock is comprising to lock Write-in.If instruction be the value of lock is restored to lock it is same lock using XACQUIRE as prefix lock obtain operation before The value having then processor omits external write request associated with the release of lock, and does not collect addition lock to write-in Address.Then processor attempts that transactional is submitted to execute.

By HLE, if multiple threads are executed by the same critical section for locking protection but they are not in mutual data Any conflict operation is executed, then thread can execute concomitantly, in the case where no serialization.Although software is in shared lock It is upper to obtain operation using lock, but hardware identification this point, omit lock and in the case where not needing any communication by lock If executing critical section on two threads --- this communication is unnecessary in dynamic.

If processor cannot transactionally execute region, processor is not held by non-transactional and elliptically Row region.HLE enables software and guarantees with forward direction progress identical with the execution based on non-HLE lock of lower layer.In order to successful HLE is executed, and lock and critical section code must comply with certain guilding principles.These guilding principles only influence performance；And it does not abide by Following these guilding principles will not result in function failure.Before the hardware for not having HLE to support will ignore XACQUIRE and XRELEASE Sew prompt, and any omission will not be executed, the reason is that these prefixes and effectively being instructed in XACQUIRE and XRELEASE Ignored REPNE/REPE IA-32 prefix is corresponding.Importantly, HLE is compatible with the programming model based on existing lock.It is uncomfortable Locality not will lead to function loophole using prompt, but it can expose the delay loophole in code.

Controlled transaction memory (RTM) executes transactional and provides flexible software interface.RTM provide three it is new Instruction --- XBEGIN, XEND and XABORT --- starts for programmer, submits and stops transactional execution.

Programmer provides the beginning of transactional code region using XBEGIN instruction and instructs regulation affairs using XEND The end of property code region.If the region RTM cannot be transactionally successfully executed, XBEGIN instruction is obtained to provide and be arrived back Move back the operand of the opposite offset of IA.

Processor may stop the execution of RTM transactional for many reasons.In many cases, hardware detects affairs automatically Property stop condition and to restart to execute from back-off instruction address, wherein architecture state be present in XBEGIN instruction The architecture state at beginning is corresponding, and eax register is updated to description abort state.

XABORT instruction permission programmer clearly stops the execution in the region RTM.XABORT instruction acquirement is loaded into EAX and posts In storage and thus to available 8 immediate arguments of software after RTM suspension.RTM instruction, which does not have, to be associated with Any data memory location of connection.Although hardware is not in relation to the region RTM and whether once submits to successful transaction offer guarantee, But the most of affairs for following the guilding principle of recommendation are submitted with being expected to successful transaction.But programmer must always return It is preceding to progress to guarantee to provide alternative code sequence in rolling path.This may execute rule with acquisition lock and non-transactional It is equally simple to determine code region.Also, the affairs always stopped on given implementation can be on the implementation in future Transactionally complete.Therefore, programmer must assure that the code path for transactional region and alternative code sequence quilt Success is tested.

The detection that HLE is supported

If CPUID.07H.EBX.HLE [position 4]=1, processor supports HLE to execute.But using can not examine Look into whether processor is supported to use HLE prefix (XACQUIRE and XRELEASE) in the case where HLE.The processing for not having HLE to support Device ignores these prefixes and will execute code in the case where not entering transactional and executing.

The detection that RTM is supported

If CPUID.07H.EBX.RTM [position 11]=1, processor supports RTM to execute.Using must be in processor Whether RTM is supported using check processor before RTM instruction (XBEGIN, XEND, XABORT).These instructions, which are worked as, not to be supported It is abnormal that #UD will be generated when being used on the processor of RTM.

The detection of XTEST instruction

If processor supports HLE or RTM, processor just XTEST to be supported to instruct.Using must use XTEST Any of these signatures are checked before instruction.The instruction on the processor for not supporting HLE or RTM when being used It is abnormal #UD will to be generated.

Inquire that transactional executes state

XTEST instructs the transaction status that can be used to determine the transactional region as defined in HLE or RTM.Although note that HLE prefix is ignored on the processor for not supporting HLE, but XTEST instruction is when the quilt on the processor for not supporting HLE or RTM It is abnormal that #UD will be generated when use.

Requirement to HLE lock

The HLE for successful transaction submitted is executed, and lock must satisfy certain characteristics and must abide by the access of lock Follow certain guilding principles.

The value for the lock being omitted must be restored to what it had before lock obtains by the instruction of prefix of XRELEASE Value.This allows hardware by not collecting addition lock to write-in safely to omit these locks.Lock release (using XRELEASE as prefix) The data size and data address of instruction must match the data size and data address that lock obtains (using XACQUIRE as prefix), And must not lock and intersect with caching line boundary.

Software should not be by being written to the region transactional HLE divided by XRELEASE for any instruction other than the instruction of prefix The interior lock being omitted, otherwise this write-in will lead to transactional suspension.In addition, recursive locks (thread repeatedly obtains same lock, and It is not that first release is locked) also result in transactional suspension.Note that the lock being omitted that software observable obtains in critical section As a result.This read operation will return to the value of write-in to lock.

Processor detects the violation to these pointer policies automatically, and is safely transitioned into the non-transactional not omitted It executes.Due toTSX detects conflict in the granularity on cache lines, therefore, be co-located at phase to the lock being omitted Other logic processors that the write-in of data on same cache lines can be omitted same lock are detected as data collision.

Transaction nest

Both HLE and RTM support subtransaction region.But transactional stops state being restored to beginning affairs Property execute operation：Outermost instructs by the HLE valid instruction of prefix or outermost XBEGIN of XACQUIRE.Processor is by institute There are nested affairs to be considered as an affairs.

HLE is nested and omits

Programmer can the nesting region HLE, until depth of the MAX_HLE_NEST_COUNT specific to implementation. Each logic processor is counted in internal trace nesting, but the counting is disabled software.Using XACQUIRE as the HLE of prefix Valid instruction is incremented by nested counting, and successively decreases it by the HLE valid instruction of prefix of XRELEASE.Logic processor is in nesting Count from 0 become 1 when enter transactional execute.Only when nesting counting becomes 0, logic processor is attempted to submit.If nested It counts more than MAX_HLE_NEST_COUNT, then transactional suspension may occur in which.

Other than supporting the nested region HLE, processor can also omit multiple nested locks.Processor tracking lock, with In to start and the lock to be with XRELEASE for same lock as the HLE valid instruction of prefix using XACQUIRE The omission that the HLE valid instruction of prefix terminates.Processor can be up to MAX_HLE_ELIDED_ in any one time tracking The lock of LOCKS quantity.For example, if implementation supports MAX_HLE_ELIDED_LOCKS value and if programmer for 2 The critical section of three HLE of nesting identification in three different locking (by executing using XACQUIRE as the legal finger of the HLE of prefix Enable, without on any of lock execute intervention using XRELEASE as the HLE valid instruction of prefix), then the first two lock It will be omitted, but third will not be omitted (and will be added to affairs write-in collection).But execution will transactionally after It is continuous.Once one XRELEASE in the lock being omitted for two is encountered, by using XACQUIRE as the HLE of prefix The subsequent lock that valid instruction obtains will be omitted.

When all XACQUIRE being omitted become 0 and lock to meet the requirements with XRELEASE to matched, nested counting When, processor attempts that HLE is submitted to execute.If execution cannot be submitted atomically, mistake in the case where no omission is executed It crosses to non-transactional and executes, as the first instruction does not have XACQUIRE prefix.

RTM is nested

Programmer can the nesting region RTM, until specific to the MAX_RTM_NEST_COUNT of implementation.At logic It manages device to count in internal trace nesting, but the counting is unavailable to software.XBEGIN instruction is incremented by nested counting, also, XEND Instruct nested count of successively decreasing.Only nested count becomes 0, and logic processor is just attempted to submit.If nesting is counted more than MAX_ RTM_NEST_COUNT, then there is transactional suspension.

Nested HLE and RTM

HLE and RTM provides two kinds of alternative software interfaces to shared transactional executive capability.When HLE and RTM are nested in When together, for example, transactional processing behavior is specific for implementation when HLE is in RTM or RTM is in HLE 's.But in all cases, implementation will keep HLE and RTM semantic.Implementation can work as to be used in the region RTM When selection ignore HLE prompt, also, when RTM instruction in the region HLE by use, can lead to transactional suspension.In latter In the case of, seamlessly occur going to the transition of non-transactional execution from transactional, the reason is that processor will be in not practical progress The region HLE is re-executed in the case where omission, and then executes RTM instruction.

Abort state definition

RTM is using eax register abort state is transmitted to software.RTM suspension after, eax register have with Under definition.

Table 1

The reason of EAX abort state of RTM only provides suspension.It does not pass through itself to whether stopping to the region RTM Or it submits and is encoded.The value of EAX can be 0 after RTM suspension.For example, the cpuid instruction used when in the region RTM Cause transactional to stop, and is not able to satisfy the requirement for setting any of EAX.It is 0 that this, which will lead to EAX value,.

RTM memory order

Successful RTM submission causes all storage operations in the region RTM to seem atomically to execute.Include heel The region successful submission RTM of the XBEGIN of XEND, or even when not having storage operation in the region RTM, has with LOCK and is The identical sequence of the instruction of prefix is semantic.

It is semantic that XBEGIN instruction does not have protection (fencing).But if RTM executes suspension, the area RTM is come from All memory updatings in domain are rejected and not visible to any other logic processor.

RTM enables debugger and supports

It is default to, any debugging in the region RTM will lead to transactional suspension extremely, and will be so that control flow is reset To back-off instruction address is arrived, while architecture state is resumed and the position 4 in EAX is set.But in order to allow software Debugger intercepts the execution in debugging exception, and RTM architecture provides additional ability.

If the position 11 of DR7 and the position 15 of IA32_DEBUGCTL_MSR are 1 and due to debugging abnormal (#DB) or breakpoint Any RTM suspension caused by abnormal (#BP) cause to execute rollback and from XBEGIN instruction rather than rollback address is restarted. In this scenario, eax register will also be restored to the point of XBEGIN instruction.

Programming considers

General programmer's identification region is expected to successfully transactional and executes and submit.ButTSX is not provided Any this guarantee.Transactional execution can stop for many reasons.In order to make full use of transaction-capable, programmer should follow certain A little guilding principles are run succeeded the probability of submission with increasing its transactional.

This section discussion can lead to the various events of transactional suspension.Architecture ensures in the subsequent affairs for stopping to execute The update of execution will never become visible.The transactional only submitted executes update of the starting to architecture state.Affairs Property stop never cause function failure and only influence performance.

Based on instruction the considerations of

Programmer can safely make using any instruction and in any privilege level in the affairs (HLE or RTM) Use affairs.But some instructions will always stop transactional and execute and cause to execute seamless and be safely transitioned into non-transactional Property path.

TSX allows most of shared instructions to be used in affairs in the case where not causing and stopping.In affairs Following operation do not cause generally to stop：

Behaviour on instruction pointer register, general register (GPR) and status indication (CF, OF, SF, PF, AF and ZF) Make；With

Operation on XMM and YMM register and MXCSR register.

But when mixing SSE and AVX operation in transactional region, programmer must be careful.Hybrid access control XMM is posted The SSE of storage is instructed and the AVX instruction of access YMM register will lead to transaction abort.Programmer can be used affairs in REP/REPNE is the string operation of prefix.But long string will lead to suspension.Also, if the use of CLD and STD instruction changes DF If the value of label, then the use of CLD and STD instruction will lead to suspension.But if DF is 1, STD, instruction will not be led It causes to stop.Similar, if DF is 0, CLD, instruction will not result in suspension.

Because causing the instruction stopped without enumerating herein not lead to transaction abort generally when using in affairs (example is including but not limited to MFENCE, LFENCE, SFENCE, RDTSC, RDTSCP etc.).

Instruction below will not stop transactional execution on any implementation：

·XABORT

·CPUID

·PAUSE

In addition, in some implementations, instruction below always can cause transactional to stop.Do not expect that these instructions are normal For in general transactional region.But programmer must not instruct dependent on these to force transactional to stop, because they Transactional suspension whether is caused to be to rely on implementation.

Operation in X87 and MMX architecture state.This includes all MMX and X87 instruction, comprising FXRSTOR and FXSAVE instruction.

Update to non-status sections CLI, STI, POPFD, POPFQ, CLTS of EFLAGS.

Update sector register, debugging register and/or the instruction for controlling register：MOV to DS/ES/FS/GS/ SS、POP DS/ES/FS/GS/SS、LDS、LES、LFS、LGS、LSS、SWAPGS、WRFSBASE、WRGSBASE、LGDT、SGDT、 LIDT、SIDT、LLDT、SLDT、LTR、STR、Far CALL、Far JMP、Far RET、IRET、MOV to DRx、MOV to CR0/CR2/CR3/CR4/CR8 and LMSW.

Ring transition:SYSENTER, SYSCALL, SYSEXIT and SYSRET.

TLB and cacheability control：CLFLUSH, INVD, WBINVD, INVLPG, INVPCID and having non-temporal is mentioned The memory instructions (MOVNTDQA, MOVNTDQ, MOVNTI, MOVNTPD, MOVNTPS and MOVNTQ) shown.

Processor state saves:XSAVE, XSAVEOPT and XRSTOR.

It interrupts：INTn,INTO.

·IO:IN, INS, REP INS, OUT, OUTS, REP OUTS and their variable.

·VMX：VMPTRLD,VMPTRST,VMCLEAR,VMREAD,VMWRITE,VMCALL,VMLAUNCH, VMRESUME, VMXOFF, VMXON, INVEPT and INVVPID.

·SMX：GETSEC.

UD2, RSM, RDMSR, WRMSR, HLT, MONITOR, MWAIT, XSETBV, VZEROUPPER, MASKMOVQ and V/MASKMOVDQU。

Consider when operation

Other than in addition to based on instruction the considerations of, run time events can lead to transactional and execute suspension.They may be due to Data access patterns or microarchitecture implementation feature.The not all comprehensive discussion for stopping reason of list below.

Any failure for being necessarily exposed to software or trap in affairs will be suppressed.Transactional execution will stop and hold It is about to be transitioned into non-transactional execution, as failure or trap did not occur.If abnormal be not blanked, do not cover Exception will lead to that transactional stops and state will appear as not occurring as abnormal.

Synchronous abnormality event (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, the # occurred during transactional executes XF, #PF, #NM, #TS, #MF, #DB, #BP/INT3) it can lead to execute and not submit transactionally, and non-transactional is needed to hold Row.These events are suppressed, as they did not occur.In the case where HLE, due to non-transactional code path and affairs generation Code path is identical, and therefore, when being merely re-executed with leading to abnormal instruction non-transactional, these events are generally reappeared, So as to cause suitably transmitting associated synchronous event in non-transactional executes.The synchronization occurred during transactional executes Event (NMI, SMI, INTR, IPI, PMI etc.), which can lead to transactional and execute, stops and is transitioned into non-transactional execution.Synchronous thing Part is by pending (pended) and will stop in transactional processed processed later.

Affairs are only supported to write back cacheable memory type operations.If affairs include on any other type of memory Operation, then affairs can always stop.This includes that the instruction to UC type of memory obtains.

Memory access in transactional region can require processor setting benchmark page table entries access and dirty label.Place The behavior how reason device handles it is specific for implementation.Even if some implementations allow the update to these labels Transactional region then stops also to become external visible.It is someTSX implementation may be selected to mark needs at these Stop transactional when being updated to execute.Also, the page table operation (walk) of processor can produce the transactional write-in to their own But the access for the state that do not submit.It is someTSX implementation may be selected to stop transactional region in this case Execution.In any case, which ensures, if transactional region stops, will not pass through the row of the structure of such as HLE To make the state of transactional write-in architecturally visible.

Transactional, which executes self modifying code, may also lead to transactional suspension.Even if when using HLE and RTM, programmer It must continue to follow for self-modifying and mutually the Intel recommendation guilding principle of modification code to be written.Although the realization of RTM and HLE Mode generally will be provided for executing enough resources in shared transactional region, but the implementation constraint and mistake in transactional region Large scale, which will lead to transactional and execute, stops and is transitioned into non-transactional execution.Architecture does not guarantee to can be used for carrying out affairs Property execute stock number and do not guarantee transactional execution will succeed.

Affairs successful execution can be prevented to the conflict request of the cache lines accessed in transactional region.For example, if patrolling Volume processor P0 read the line A in transactional region and another logic processor P1 write line A (in transactional region or Outside), if that the ability that the write-in interference processor P0 transactional of logic processor P1 executes, then logic processor P0 can in Only.

Similarly, if the line A and P1 reading or write line A in P0 write-in transactional region are (in transactional region Or outer), if that the ability that P1 executes the access interference P0 transactional of line A, then P0 can stop.In addition, other relevant Between traffic meeting or shows as conflict request and will lead to suspension.Although these mistake conflicts can occur, they are expected to not It is common.The conflict-solving strategy for determining whether P0 or P1 stops is specific for implementation in the above scenario.

Generic transaction executes embodiment：

This is submitted to when being partially completed PH.D degree and requiring in June, 2009 according to Austen McDonald Paper " the ARCHITECTURES FOR TRANSACTIONAL of smooth good fortune university computer science system and the postgraduate committee There are three kinds of mechanism required for the transactional region for realizing atom and isolation in MEMORY "：Version Control, collision detection and competing Management is striven, the full content of the paper is added here by reference.

In order to enable transactional code region seems atomicity, it must by all modifications that the transactional code region executes Must be stored and keep with other transaction isolations, until submission time.System is by realizing that Edition Control Strategy is completed This point.There are two Version Control patterns：Thirst for and lazy.Thirst for version control system to store newly generated transactional value In place and by previous storage device value memory on side, cancelled in log so-called.Lazy version control system is temporarily deposited Chu Xin value only copies them to memory when submitting in so-called write buffer.In any system, caching by with In the storage of optimization new version.

In order to ensure affairs seem to be performed atomically, conflict must be detected and solve.The two systems, that is, thirst for With lazy version control system, conflict is detected by realizing optimistic or pessimistic collision detection strategy.Optimistic system in parallel is held Business is acted, conflict is only checked when affairs are submitted.Pessimistic system is in each load and the inspection conflict of the place of storage.With Version Control class Seemingly, collision detection also using caching, thus by each line be denoted as read collection a part or write-in collection a part or this The two.The two systems are by realizing that competition management strategy solves conflict.There are many competition management strategies, it is some more suitable for Optimistic collision detection and some more suitable for pessimistic collision detection.Some example policies are described below.

Since each transaction memory (TM) system needs Version Control detection and collision detection, these options that can generate Four kinds of different TM designs：It is optimistic (LO) to thirst for pessimistic (EP), serious hope optimistic (EO), lazy pessimistic (LP) and laziness.Table 2 is brief Ground describes all four different TM designs.

Fig. 1 and 2 shows the example of multicore TM environment.Fig. 1 show under the management of interconnected control 120a, 120b with interconnection Many TM on one bare die 100 of 122 connections enable CPU (CPU1 114a and CPU2 114b etc.).Each CPU 114a, 114b (also referred to as processor) can have the caching of separation, and the caching of the separation is comprising for caching the finger from memory to be executed Instruction buffer 116a, 116b of order and with for cache will by CPU 114a, the 114b memory location operated data (behaviour Count) TM support data buffer storage 118a, 118b.In implementation, the caching of multiple bare dies 100 is interconnected to support more Caching coherence between the caching of a bare die 100.In implementation, protected using single caching rather than isolated caching Hold both instruction and datas.In implementation, cpu cache is the caching of a rank in level buffer structure.For example, Each bare die 100 may be used at the shared buffer memory 124 shared among all CPU 114a, 114b on bare die 100.In another realization In mode, each bare die 100 may have access to the shared buffer memory 124 shared among all processors of all bare dies 100.

Fig. 2 indicates the details of example transactions CPU 114, the addition comprising supporting TM.Transactional CPU 114 (processor) It may include the hardware for supporting register checkpointing 126 and special TM register 128.Transactional cpu cache can have routine The position MESI 130, label 140 and the data 142 of caching, but also have for example, indicating that line is by CPU while executing affairs 114 positions R 132 read and the expression position W 138 that line is written by CPU 114 while executing affairs.

Key detail in any TM system for programmer is how non-transactional access is handed over affairs Mutually.By design, mechanism more than use mutually screens business call.But it still has to consider rule, non-transactional load With the interaction between the affairs being newly worth comprising the address.In addition, it is also necessary to inquire into non-transactional storage and read the ground Interaction between the affairs of location.The problem of these are concept database isolation.

When each non-transactional load and store appear similar to atomic transaction when, TM system be considered realize by force every From also sometimes referred to as strong atomicity.Therefore, non-transactional load cannot see that the data that do not submit and non-transactional stores Atomicity is caused to violate in any affairs for having read the address.The system being not the case be considered realizing it is less isolated, Also sometimes referred to as Weak atomicity.

Strong isolation is often more even more ideal than less isolated, the reason is that relatively easy generalities and realization are isolated by force.In addition, if Programmer has forgotten about with affairs around some shared memory benchmark so as to cause loophole, then by being isolated by force, program Member usually will detect the carelessness using simple debugging interface, because programmer will be seen that the non-transactional for causing atomicity to violate Region.In addition, the program being written in a model may work in different ways in alternate model.

Also, compared with less isolated, strong isolation is often easier TM in hardware.Using strong isolation, due to coherence protocol Through the load between management processor and transmission is stored, therefore affairs can detecte non-transactional and load and store, and take Action appropriate.In order to realize strong isolation in software transactional memory (TM), non-transactional code be must be modified, to include It reads obstacle and writes obstacle；To potentially weaken performance.Although having paid huge effort to remove many unwanted obstacles, But this technology is often complicated and performance is generally significantly less than the performance of HTM.

Table 2

Table 2 shows the Basic Design space of transaction memory (Version Control and collision detection).

Serious hope-pessimism (EP)

First TM design described below is referred to as serious hope-pessimism.EP system be written into centrally stored " in place " (because This gains the name " serious hope "), and storage rewrites the old value of line to support rollback in " cancelling log ".Processor uses W 138 and R 132 cache bits are read with tracking and collect and be written collection, and detected and conflict when receiving and eavesdropping load request.EP in known references The most noticeable example of system may be LogTM and UTM.

Start affairs in EP system like the affairs started in other systems：Tm_begin () obtains register inspection Point, and initialize any status register.EP system is also required to initialization and cancels log, and details depends on journal format, but Often comprising the basic pointer of log is initialised to the region of predistribution, thread private memory and removes the deposit of log boundary Device.

Version Control：In EP, the mode to work, 130 status transition of MESI are designed to due to thirsting for Version Control (caching line indicator corresponding with Xiu Gai, exclusive, shared and invalid code state) keeps most of constant.Except affairs, 130 status transition of MESI keeps completely constant.When reading the line in affairs, standard coherent transition is applicable in (S (shared) → S, I (invalid) → S or I → E (exclusive)), it issues to record as needed and miss, but the position R 132 is also set.Similarly, write line is applied The quasi- transition of mark-on (S → M, E → I, I → M) issues miss as needed, but also sets W (write-in) position 138.When line is by for the first time When write-in, the legacy version of entire line is loaded and is then written to cancel daily record with reservation when just in case Current transaction stops It.On legacy data, then the data being newly written are stored " in place ".

Collision detection：Pessimistic collision detection is used about missing, upgrading the relevant message exchanged, to find between affairs Conflict.When appearance reading is missed in affairs, other processors receive load requests；But if they do not have institute The line needed, then they ignore request.If other processor non-speculative with required line or have line R 132 (reading), then line is downgraded to S by them, and in some cases, if their M with MESI or the line in E-state, It then issues and is cached to caching transmission.But if caching has line W 138, conflict is detected between two affairs, and Additional action must be taken.

Similarly, when (in the first write-in) affairs seek by line from it is shared be upgraded to modification when, affairs sending is also used for Detect the exclusive load request of conflict.If having line to received caching non-speculative, which is deactivated, also, In some cases, it issues and is cached to caching transmission (M or E-state).But if line is R 132 or W 138, detect punching It is prominent.

Verifying：Due to only executing collision detection in each load, affairs always have the write-in collection to its own Exclusive access.Therefore, verifying does not require any extra work.

It submits：The new version of data item is stored in place due to thirsting for Version Control, the process of submission simply removes W 138 and R 132 simultaneously gives up and cancels log.

Stop：When transaction rollback, the original version for cancelling each cache lines in log must be resumed, this is to be known as The process of " expansion " or " application " log.This is completed during tm_discard (), and must be former relative to other affairs Sub- property.Specifically, write-in collection still must be used to detect conflict：The affairs are only cancelled at it correct with line in log Version, also, request transaction has to wait for from the correct version of the journal recovery.It can be by using in hardware state machine or software Only processor applies this log.

Serious hope-pessimism has characteristics that：Submission is simple, also, in place because of it, speed is very fast. Similarly, verifying is do-nothing operation.Pessimistic collision detection detects conflict very early, thus reduces the quantity of " being doomed failure " affairs. For example, the dependence is detected immediately in pessimistic collision detection if two affairs are related in read-after-write dependence.But It is that in optimistic collision detection, this conflict is not detected before writer submits.

Serious hope-pessimism also has characteristics that：As described above, old value must quilt when cache lines are written for the first time It is written to log, to generate additional cache access.Suspension is expensive, the reason is that they need to cancel log.For day Each cache lines in will, it is necessary to load is issued, it may be before proceeding to next line as far as main memory.Pessimistic collision detection Prevent that there are the scheduling of certain serializabilities.

In addition, accordingly, there exist livelock a possibility that, and, it is necessary to using small processed when they occur due to conflicting The competition management mechanism of the heart is preceding to progress to guarantee.

Lazy-optimistic (LO)

Another popular TM design is lazy-optimistic (LO), its storage in " write buffer " or " redoing log " It is written collection and in submission time detection conflict (still using R and W).

Version Control：As in EP system, the MESI protocol of LO design is forced to implement outside affairs.Once In affairs, read line just causes standard MESI transition, but also sets the position R 132.Similarly, the position W of write line setting line 138, but the MESI transition for handling LO design is different from EP design.Firstly, the new edition of data is written by lazy Version Control Originally it is stored in cache hierarchy, until submitting, and other affairs are able to access that in memory or other cachings Available legacy version.In order to enable legacy version is available, it is necessary to evict dirty line (M line) from when reading first by affairs.Secondly, by In optimistic collision detection feature, therefore does not need upgrading and miss：If affairs have the line in S state, it can be simply It is written to line and the line is upgraded to M state, without transmitting these variations with other affairs, because collision detection is being submitted Time completes.

Collision detection and verifying：In order to verify affairs and detection conflict, LO is only when it prepares to submit by predictive modification The address of line be transmitted to other affairs.In verifying, one of all addresses of the processor transmission comprising write-in concentration is potential Big network packet.Data are not sent, but are stayed in the caching of presenter and be denoted as dirty (M).In order to not to mark The packet is constructed in the case where line search caching for W, using the simple bit vector for being known as " storage buffer ", wherein each slow A position for depositing line tracks the line of these predictive modifications.Other affairs are wrapped using the address to detect conflict：If caching In find address and set R 132 and/or W 138, then conflict is initialised.If finding line but without setting R 132 and W 138, then line is by simply invalidation, this is similar with exclusive load is handled.

In order to support transaction atomicity, these addresses packet must be operated atomically, that is, there are two address packet is available Identical address disposably exists.It, can be by simply obtaining global submission token before sending address packet in LO system To realize this point.It but can (may be most first by sending out address packet first, collecting response, enforce ordering protocols Old affair business) to submit scheme using the two-stage, and it is satisfactory for disposably submitting all responses.

It submits：Once verifying, there is no need to special processing for submission：Simply remove W 138 and R 132 with And storage buffer.The write-in of affairs be denoted as in the buffer it is dirty, and the copy of other cachings of these lines via Address is coated with invalidation.Then other processors can access submitted data by normal coherence protocol.

Stop：Rollback is similarly very simple：Because write-in collection is contained in local cache, these lines can be deactivated Change, then removes W 138 and R 132 and storage buffer.Storage buffer allows discovery W line invalid, without Search for caching.

Laziness-optimism has characteristics that：Suspension be it is very fast, without additional load or storage, and And only carry out localized variation.More serialized schedulings than finding in EP may be present, this allows LO system more actively to speculate Affairs be it is independent, this can produce higher performance.Finally, the advanced stage detection of conflict can increase a possibility that being in progress forward.

It is lazy-optimistic also to have characteristics that：Verifying needs the global delivery time proportional to the size of write-in collection. Conflict due to only being detected in submission time, ill-fated affairs may waste work.

Lazy-pessimistic (LP)

Lazy-pessimistic (LP) indicates the 3rd TM design option, thus in the somewhere between EP and LO：In write buffer It is middle to store the line being newly written, but conflict is detected on the basis of each access.

Version Control：Version Control is similar with the Version Control of LO but not identical：Read line sets its position R 132, write line Its position W 138 is set, also, storage buffer is used for the W line in trace cache.Also, as in LO, dirty (M) when It must be expelled out of by transaction write is fashionable first.But due to collision detection be it is pessimistic, load is exclusive must be from I, S It is performed when → M upgrade transaction line, this is different from LO.

Collision detection：The operation of the collision detection of LP is identical as EP's：Using relevant message to find rushing between affairs It is prominent.

Verifying：Such as in EP, pessimistic collision detection ensure at any point the upper affairs run not with any other operation Affairs have conflict, therefore verifying is do-nothing operation.

It submits：Submission does not need special processing：As in LO, simply removes W 138 and R 132 and deposit Store up buffer.

Stop：Rollback is also similar with the rollback of LO：Simply by use storage buffer will write-in collection invalidation and Remove W 138 and R 132 and storage buffer.

Serious hope-optimism (EO)

The LP has characteristics that：It is similar with LO, stop very fast.It is similar with EP, it is reduced using pessimistic collision detection The quantity of " being doomed failure " affairs.Similar with EP, some serialized schedulings are not allowed to, and, it is necessary to miss in each caching Execute collision detection.

The final combination of Version Control and collision detection is serious hope-optimism (EO).EO may not be most for HTM system Good selection：Since new transactional version is written into place, other affairs do not select, and can only be when conflict occurs (that is, when caching miss occur when) notice conflict.But since EO is waited until just detecting conflict until submission time, These affairs become " corpse ", they are continued to execute, waste of resource, still " are doomed " to stop.

EO, which has been proved in STM, to be useful and is realized by Bartok-STM and McRT.Lazy Version Control STM Need to check its write buffer in each reading to ensure that it is reading nearest value.Since write buffer is not hardware Structure, therefore it is expensive, so that Version Control is thirsted in preference write-in in place.In addition, since the inspection to conflict is in STM It is also expensive, therefore optimistic collision detection provides the advantages of executing in batches this operation.

Competition management

Have been described above once system determined stop affairs when the affairs how rollback；But since conflict relates to And two affairs, therefore which affairs should be stopped, how should initialize the suspension and should when reattempt to by The topic of the affairs of suspension needs to inquire into.These are competition management (CM) topics to be solved, which is affairs storage The critical component of device.Be described below about system how to initialize the strategy of suspension and manage which affairs should in collision in Various maturation methods only.

Competition management strategy

Competition management (CM) strategy is to determine which affairs being related in collision should stop and when should weigh The mechanism of the suspended affairs of new try.For example, situation often reattempts to suspension immediately does not lead to optimal performance.Phase Instead, better performance can produce to the avoidance mechanism of suspended affairs reattempted to using delay.STM sets about finding first Best competition management strategy, also, many in the strategy being exemplified below are developed to STM.

CM strategy takes a large amount of measure to make decision, the age including affairs, reads size, the elder generation of collection and write-in collection The quantity etc. of preceding suspension.It is innumerable for making the combination of such measure determined, but presses increased complexity in general below Order certain combinations are described

In order to establish some nomenclatures, it is first noted that there are two aspects in collision：Attacker and defender.Attacker It is the affairs for requesting access to shared memory position.In pessimistic collision detection, attacker is to issue load or load exclusive Affairs.In optimistic collision detection, attacker is an attempt to the affairs of verifying.Defender in the case of two kinds is to receive attacker Request affairs.

Positive CM strategy always reattempts to attacker or defender immediately.In LO, actively mean that attacker is total It is to win, so actively sometimes referred to as submitter wins.This strategy is used for earliest LO system.In the case where EP, actively It can be won for defender or attacker wins.

Restart that the conflict transaction of another conflict is undergone necessarily to waste work immediately --- i.e. interconnection bandwidth backfill caching It misses.Courtesy CM strategy uses exponential backoff (but can also be used linear) before restarting conflict.In order to prevent it is hungry (i.e. Processing does not have the case where resource that it is distributed to by scheduler), exponential backoff greatly increases after certain n times reattempt to The successful probability of affairs.

Another method of Conflict solving is, random to stop attacker or defender's (strategy being referred to as randomized).It is this Strategy can be in conjunction with random back scheme, to avoid unnecessary competition.

But select at random, when selecting the affairs to be stopped, can lead to the thing that " extensive work " is completed in suspension Business, this possible waste of resource.In order to avoid this waste, can consider to be completed in affairs when which affairs determination will stop Workload.One of work measures the age that can be affairs.Other methods include oldest, batch TM, size consider, Karma and Polka.Oldest is the simple timestamp method for stopping the most young affairs in conflict.Batch TM uses the program.Ruler It is very little consider with it is oldest similar, but not instead of using the affairs age, use the quantity of read/write word as priority, from And it is returned to after the suspension of fixed number of times oldest.Karma be it is similar, the size for using write-in to collect is as priority.Then Rollback continues after keeping out of the way set time amount.Suspended affairs keep their priority (thus to claim after being aborted For Karma).Polka works similar to Karma, and still, as the substitution for keeping out of the way predetermined time amount, it is exponentially mended every time It repays more.

Due to stopping to waste work, therefore, it is considered that delay attacker will lead to more preferably until defender completes its affairs Performance be logical.Unfortunately, this simple scheme easilys lead to deadlock.

Dead time revision technology can be used to solve this problem.Greediness avoids deadlock using two rules.First Rule is, if the first affairs T1 has the priority or if T1 waiting another affairs, T1 lower than the second affairs T0 The suspension when conflicting with T0.Second rule is that, if T1 has the priority than T0 high and do not waiting, T0 is waited Until T0 is submitted, is stopped or start waiting (in this case, the first rule is applicable in).Greediness is provided about being used for Execute some guarantees of the event horizon of one group of affairs.One EP design (LogTM) is using the CM strategy similar to greediness with benefit Delay is realized with conservative dead time revision.

Example MESI coherence's rule provides four kinds of possible states that the cache lines of multiprocessor caching system can be resident： M, E, S and I are defined as foloows：

It modifies (M)：Cache lines exist only in current cache and are dirty；It is repaired from the value in main memory Change.Before any other reading for allowing main memory state (no longer valid), caching is needed will in certain times in future Data write back to main memory.Writing back becomes exclusive state for line.

Exclusive (E)：Cache lines exist only in current cache, but are clean；It matches main memory.It can be in office When between become shared state in response to read requests.As an alternative, it can become modification state when it is written.

Shared (S)：Indicate that the cache lines can be stored in other cachings of machine and be " clean "；It matches main memory Reservoir.The line can be rejected at any time and (become invalid state).

(I) in vain：Indicate that the cache lines are invalid (unused).

It is encoded other than MESI coherence position or in MESI coherence position, TM coherence can be provided to each cache lines Positioning indicator (R 132, W 138).Current transaction, and W are read in the instruction of 132 indicator of R from the data of cache lines 138 indicators instruction Current transaction has been written to the data of cache lines.

In the another aspect of TM design, by using transactional storage buffer designing system.On March 31st, 2000 Submit and be added here by reference entitled " the Methods and Apparatus for of entire contents Reordering and Renaming Memory References in a Multiprocessor Computer The United States Patent (USP) No.6349361 introduction of System " is at least with the multiprocessor computer of the first and second processors The method that memory benchmark is resequenced and renamed in system.There is first processor the first privately owned caching and first to delay Device is rushed, and second processor has the second privately owned caching and the second buffer.Method includes to received by first processor It includes data that each of multiple gatings storage request for memory data, which is exclusively obtained through the first privately owned caching, Cache lines and in the first buffer the step of storing data.The first buffer from first processor receive load request with When loading specific data, specific data is based on loading and store the in-order sequence of operation from being stored in the first buffer First processor is provided in data.When the first caching receives the load request for data-oriented from the second caching, refer to Show erroneous condition, also, when load request and the data that are stored in the first buffer for data-oriented to it is corresponding when processing The current state of at least one of device is reset to state earlier.

The main realization component of one this transaction memory facility is for keeping in pre- affairs GR (general register) The affairs back-up registers file of appearance, is used for buffer-stored at the CACHE DIRECTORY for tracking the cache lines accessed during affairs Memory buffers until affairs terminate and the firmware routines for executing various sophisticated functions.In this part, description is detailed Thin implementation.

IBM zEnterprise EC12 enterprise servers embodiment

IBM zEnterprise EC12 enterprise servers introduce transactional in transaction memory and execute (TX), and portion Point ground can from IEEE Computer Association meeting issue service (CPS) obtain, 1 to 5 December in 2012 Canada it is British Article " the Transactional Memory of the 25-36 pages of the collection of thesis given a lecture on the MICRO-45 in Colombia Vancouver It is described in Architecture and Implementation for IBM System z ", is added here by reference Entire contents.

Table 3 indicates example transactions.Do not ensure that the affairs since TBEGIN were once successfully completed with TEND, because they can It is each attempt to execute when undergo suspension condition, for example, due to repeating to conflict with other CPU.This requires program to support back Route of retreat diameter for example to execute same operation by using conventional locking scheme with carrying out non-transactional.It should be to programming or software verification Team brings significant burden, especially in the case where not automatically generating rollback path by reliable compiler.

Table 3

It may be heavy for executing (TX) affairs to provide the requirement in rollback path for aborted transactional.In shared data The many affairs operated in structure are expected to shorter, only contact several different memory locations, and only use simple instruction. For those affairs, IBM zEnterprise EC12 introduces the concept of controlled affairs；Under normal operation, 114 CPU Guarantee that controlled affairs finally successfully terminate, even if in the case where not providing stringent limitation to necessary number of retries. Controlled affairs are instructed with TBEGINC to be started and is terminated with normal TEND.Task is embodied as constrained or not by about The affairs of beam generally result in quite comparable performance, but controlled affairs are by removing the demand to rollback path come simple Software development.By IBM disclosed in September, 2012 z/Architecture, Principles of Operation, The transactional that IBM is further described in Tenth Edition, SA22-7832-09 executes framework, adds here by reference Enter entire contents.

Controlled affairs are started with TBEGINC instruction.It must comply with a series of programming with the affairs that TBEGINC starts Constraint；Otherwise program takes non-filterable constraint violation to interrupt.Exemplary constraint may include but be not limited to：Affairs can execute Most 32 instructions, all instruction texts must be in 256 successive bytes of memory；Affairs only include to refer to forward phase To branch (i.e. without circulation and subroutine call)；(eight words (octoword) are eight words of accessible most 4 alignment of affairs 32 bytes) memory；The complicated order as decimal or floating-point operation is excluded to the limitation of instruction set.Constraint is selected, So that many common operations of such as double-strand list insertion/deletion operation can be executed, including eight for up to 4 alignment The very powerful concept that the atom of word compares and exchanges.Meanwhile constraint is selected by conservative, so that following CPU implementation The success that can guarantee affairs is constrained without adjusting, because it is incompatible otherwise to will lead to software.

In addition to being controlled there is no flating point register (FPR) with program interrupt filtered fields and other than control is considered as zero, The behavior of TBEGINC is very similar toTBEGIN on the zEC12 server of TBEGIN or IBM in TSX.? When transactional stops, IA directly set back TBEGINC, rather than subsequent instruction, to reflect for constrained Affairs the missing retried immediately and stop path.

Do not allow subtransaction in controlled affairs, but if there is TBEGINC in free affairs, So it is considered as opening new free nested rank, as TBEGIN can be done so.For example, if free Affairs are invoked at the internal subroutine using controlled affairs, then this is likely to occur.

Implicit closing is filtered due to interrupting, all exceptions during controlled affairs lead to operating system (OS) In interruption.The final of affairs successfully completes to enter dependent on OS pages by most energy of page 4 of any controlled transaction touch Power.OS must also ensure that isochronous surface long enough to allow affairs to complete.

Table 4

It is assumed that controlled affairs are not interacted with other codes based on locking, table 4 indicates the constrained of the code in table 3 Transactional implementation.Therefore lock test is not shown, still, if mixing controlled affairs and the code based on lock, It may addition lock test.

When repeating unsuccessfully, software emulation is executed by using the milli code of a part as system firmware. Advantageously, because from the burden that programmer removes, therefore controlled affairs have desired characteristic.

IBM zEnterprise EC12 processor introduces transactional and executes facility.The processor can be followed by each clock Ring decodes 3 instructions；Simple instruction is assigned as single microoperation, and more complicated instruction is cracked into multiple microoperations 232b.Microoperation (Uop 232b, be shown in FIG. 3) is written to unified sending queue 216, they can be out-of-order therefrom It issues.Most two fixing points, a floating-point, two load/store and two branch instructions can execute each period.Entirely Office completes table (GCT) 232 and keeps each microoperation and transaction nest depth (TND) 232a.GCT 232 is in decoding time by due order Sequence write-in tracks the execution state of each microoperation, and when all microoperation 232b of oldest instruction group are by success Instruction is completed when execution.

1 grade of (L1) data buffer storage 240 (Fig. 3) is the 96KB (K word that delay is recycled with 256 byte cache lines and 4 Section) 6 tunnels association caching, it is associated with the 2nd grade of (L2) data buffer storage 268 (Fig. 3) coupling with dedicated 8 tunnel of 1MB (Mbytes), wherein right 1L, which is missed, is recycled delay cost with 7.L1 caches the caching that 240 (Fig. 3) are closest to processor, also, Ln caching is Caching on n-th grade of caching.L1 240 (Fig. 3) and L2 268 (Fig. 3) caching is through storage.Each central processing unit (CP) six cores on chip share 48MB 3rd level storage inside caching, also, the 4th grade of six CP cores and the outer 384MB of chip are slow Connection is deposited, which is packaged in together on glass ceramics multi-chip module (MCM).Most 4 multi-chip modules (MCM) it can connect that (not every core can be used for transporting with having relevant symmetric multiprocessor (SMP) system of up to 144 cores Row customer workload).

Coherence is managed by the variant of MESI protocol.Cache lines read-only can be possessed (sharedly) either exclusive 's；L1 240 (Fig. 3) and L2 268 (Fig. 3) are through storage, and therefore do not include dirty line.L3 and L4 caching is storage inside And track dirty situation.Even lower level other caching of each caching comprising its all connection.

Coherence's request is referred to as " cross-examination " (XI), and caches from higher level to even lower level by level and do not cache It sends, and is sent between L4.When a core misses L1 240 (Fig. 3) and L2 268 (Fig. 3) and asks from its local L3 When seeking cache lines, L3 checks whether it possesses line, and before cache lines are returned to requestor by it, if necessary then at this XI is sent under L3 L2 268 (Fig. 3)/L1 240 (Fig. 3) being currently owned by ensure coherence.If L3 is also missed in request, So L3 transmits the request to L4, the L4 by by XI be sent under the L4 it is necessary to L3 and be sent to adjacent L4 come Enforce coherence.Then, L4 responds the L3 for making request, which is transferred to L2 268 (Fig. 3)/L1 for response 240 (Fig. 3).

Note that since caching level is comprising rule, due to upper by from requesting other cache lines in higher level caching Relevance overflow caused by evict from, sometimes cache lines from junior cache by XI.These XI are referred to alternatively as " LRU XI ", here, LUR represents minimum use recently.

It is requested referring to another type of XI, degradation-XI will cache ownership and be transformed into read-only status from exclusive, also, solely It accounts for-XI and will cache ownership and be transformed into invalid state from exclusive.Degradation-XI and exclusive-XI needs to return to the response of XI transmitter. Target cache " can receive " XI, alternatively, if it before receiving XI firstly the need of dirty data is evicted from, send " refusal " ring It answers.L1 240 (Fig. 3)/L2 268 (Fig. 3) caching is through storage, still, if they before making exclusive state degradation Needing to be sent to has storage in the storage queue of L3, then is rejected by degradation-XI and exclusive-XI.The XI being rejected will be by sending out It send and thinks highly of again.Read-only XI, which is sent to, possesses the read-only caching of line；This XI is not needed to respond, because they cannot be refused Absolutely.The details and P.Mak, C.Walters and G.Strait of SMP agreement research and develop periodical the 53rd in IBM in 2009:Volume 1 In " IBM System z10 processor cache subsystem microarchitecture " to IBM z10 describe Those of it is similar, here by reference be added entire contents.

Transactional instruction execution

The exemplary components of Fig. 3 depicted example CPU.Instruction decoding unit (IDU) 208 keeps the tracking Current transaction depth of nesting (TND)212.When IDU 208 receives TBEGIN instruction, the depth of nesting is incremented by, and successively decreases on the contrary in TEND instruction.It is right In each assigned instruction, the depth of nesting is written in GCT 232.When the supposition that TBEGIN or TEND are removed afterwards Property path on when being decoded, the minimus GCT 232 never removed refreshes the depth of nesting of IDU 208.Transaction status It is written into and issues in queue 216 for execution unit use, mainly used for load/store unit (LSU) 280.It is assumed that thing Business stops before reaching TEND instruction, and TBEGIN instruction could dictate that the affairs diagnostics block (TDB) for recording status information.

Similar with the depth of nesting, IDU 208/GCT 232 is collaboratively tracked access register/floating-point by transaction nest and posted Storage (AR/FPR) modifies exposure mask；When AR/FPR modification instruction decoded and modify mask blocks it when, IDU 208 can will in Only request is put into GCT 232.When instruction becomes next completion, completion is blocked and transaction abort.Others by Limit instruction is similarly processed, including during the controlled affairs if decoded or more than the maximum depth of nesting TBEGIN。

Outmost TBEGIN is broken into multiple microoperations according to Gr- preservation-exposure mask；Each microoperation will be by two fixing points An execution in unit (FXU) 202, a pair of of GR 228 is stored in special affairs back-up registers file 224, should Affairs back-up registers file 224 in the case where transaction abort for restoring 228 content of GR afterwards.Also, if one TDB is prescribed, then TBEGIN causes the addressable test of microoperation 226b execution TDB；Address is stored in special objective and posts In storage, for later use in the case of suspension.In the decoding of outmost TBEGIN, the instruction text of IA and TBEGIN This is also stored in special objective register, so that later potential suspension is handled.

TEND and NTSTG is single microoperation 232b instruction；In addition to being denoted as non-transactional so that LSU in issuing queue 280 can suitably be handled other than it, and NTSTG (non-transactional storage) is similarly processed with normal storage.TEND is when being executed For not operation, the end of affairs is executed when completing TEND.

As described above, the instruction in affairs is indicated after this manner in issuing queue 216, but it is otherwise almost unchanged Ground executes；LSU 280 is executed is isolated tracking described in next part.

Since decoding is in-order, and since IDU 208 keeps tracking current transaction status and by it together with next Issued in queue 216 from each instructions of affairs write-in, therefore, before affairs, within and TBEGIN, TEND and instruction later Execution can be by Out-of-order execution.Effective address calculator 236 is contained in LSU 280.It even being capable of (although be less likely) TEND is executed first, followed by entire affairs and last TBEGIN execution.Pass through 232 recovery routine of GCT in the deadline Order.The length of affairs is not limited by the size of GCT 232, because can restore general register from back-up registers file 224 (GR)228。

During execution, control filter event is inhibited to record (PER) event, also, PER TEND thing based on event Part is detected if being activated.Similarly, when in affairs sexual norm, pseudo-random generator can lead to is examined by affairs The random suspension that disconnected control enables.

Tracking for transaction isolation

Load/store unit tracking cache lines for accessing during transactional executes, also, if from another CPU (or LUR-XI XI) conflicts with trace (footprint), then triggering stops.If the XI of conflict is exclusive or degradation XI, then the hope refusal XI that LSU cherishes the completion affairs before L3 repeats XI returns to L3." refusing to budge " is somebody's turn to do in the thing of high competition It is very effective in business.In order to prevent from hanging up when two CPU mutually refuse to budge, realize that XI refuses counter, XI refusal meter Number device will trigger transaction abort when meeting threshold value.

L1 CACHE DIRECTORY 240 is conventionally being realized by static random access memory (SRAM).For transaction memory reality The significance bit 244 (64 rows × 6 column) of existing mode, the catalogue has been shifted in normal logic latch, and every cache lines are mended Fill two more positions：TX reads 248 and TX dirty 252.

When new outermost TBEGIN is decoded (it is interlocked with still pendent affairs before), TX reads 248 Position is reset.TX is set between reading 248 each load instructions when being executed by being designated as " affairs " in issuing queue.Note Meaning, if for example executing predictive load on the individual path of error prediction, this can lead to excessive mark.It is loading The alternative solution that deadline sets TX reading position is too expensive for silicon area, because multiple loads can be completed at the same time, from And many read ports are needed in load queue.

Storage is executed in a manner of identical with non-transactional mode, but transaction signature is placed in the storage team of store instruction It arranges in (STQ) 260 entry.The time is being write back, when the data from STQ 260 are written in L1 240, write-in is being delayed Deposit TX dirty 252 in line setting L1 catalogue 256.Only occur writing back to storage in L1 240 after completing store instruction, Also, each circulation is written back to more storages.Before completing and writing back, load can be visited by storage forwarding from STQ 260 Ask data；After writing back, CPU 114 (Fig. 2) may have access to the speculative update data in L1 240.If affairs are successfully tied Beam, then the dirty position 252 the TX of all cache lines is removed, and the TX for the storage being written not yet is indicated in quilt in STQ 260 It removes, to effectively become normally storing by pendent storage.

In transaction abort, all pendent transactional storages are invalidated from STQ 260, even being completed Those of.Make their significance bit by all cache lines of the affairs modification (that is, opening the dirty position 252 TX) in L1 240 Shutdown, to effectively remove them from L1 240 at once.

Framework requires the isolation for keeping affairs to read collection and write-in collection before completing new instruction.And being hanged in XI not Delay instruction in reasonable time when certainly and is done to ensure that the isolation；Allow predictive out-of-sequence execution, thus it is optimistic assume it is outstanding and Pending XI will arrive different addresses and not practical lead to transactional conflict.The design be very natural with it is real on existing system Existing XI- completes interlocking and adapts to, to ensure the strong memory order of framework needs.

When L1 240 receives XI, L1 240 accesses catalogue to check the validity of the address by XI in L1 240, and And if TX reads position 248 by the Above-the-line of XI and XI is not rejected, the triggering of LSU 280 stops.It lives when having Dynamic TX read the cache lines of position 248 from L1 240 by LRU when, special LRU extension vector is to every in 64 rows of L1 240 One is remembered that there are TX read lines on the row.Since there is no accurate addresses to track to LRU extension, LSU is hit Any XI triggering not being rejected of 280 effective extension row stops.It is assumed that for non-precision LRU extension tracking not with it is other The conflict of CPU 114 (Fig. 2) causes to stop, and prints then increasing with providing LRU extremely efficient from L1 size to the reading of L2 size Mark ability and relevance.

Memory trace is limited by memory buffers size (memory buffers discuss in further detail below) and is thus implied Ground is limited by L2 size and relevance.When the dirty cache lines of TX from L1 by LRU when, do not need execute LRU extension action.

Memory buffers

In existing system, since L1 240 and L2 268 is through memory buffers, each store instruction leads to L3 Storage access；Using 6 cores of present every L3 and the performance of each core further increased, for L3 (and lower For L2 in degree) filling rate problem is become for certain workloads.In order to avoid store queue delay, it is necessary to which addition is received Collect memory buffers, the collection memory buffers combination storage and adjacent address before sending storage to L3.

For transaction memory performance, making the dirty cache lines invalidation of each TX from L1 240 be in transaction abort can Receive, because L2 caching 268 very close (7 circulation L1 miss cost) is in taking back clean line.But for performance (and use In the silicon area of tracking) for, L2 268 is written before affairs terminate and then makes institute when stopping so that transactional is stored Dirty L2 cache lines invalidation is unacceptable (or worse on shared L3).

Two for solving the problems, such as memory bandwidth and transaction memory storage processing using memory buffers 264 are collected.Caching 264 be the circular queue with 64 entries, and each entry keeps 128 bytes with the data of the accurate significance bit of byte. In non-transactional operation, when receiving storage from LSU 280, memory buffers 264 check whether that there are entries to same address, and And if it is new storage is collected into existing entry.If there is no entry, then new entry is written to queue In, also, if the quantity of free entry is lower than threshold value, oldest entry is written back in L2 268 and L3 caching.

When starting new outmost affairs, all existing entries in memory buffers 264 are denoted as closing, so that Not new storage can be collected into these entries, also, start these entries evicting to L2 268 and L3.From this point It rises, the transactional storage come out from LSU 280STQ 260 distributes new entry or is collected into existing transactional entry. These storages are write back in L2 268 and L3 and are blocked, until affairs successfully terminate；In the point, subsequent (rear affairs) Storage can continue to be collected into existing entry, until next affairs are again switched off these entries.

Memory buffers 264 are asked in each exclusive or degradation XI, and if XI is compared with any activity entries XI is caused to refuse.If fruit stone is not completed further to instruct while continuing to refuse XI, then affairs quilt in certain threshold values Stop to avoid hang-up.

When memory buffers are overflowed, 280 request transaction of LSU stops.LSU 280 is attempted to send and cannot be merged into now at it The condition is detected when having the new storage in entry, also, entire memory buffers 264 are filled the storage from Current transaction.It deposits Storage caching 264 is managed as the subset of L2 268：Although the dirty line of affairs can be evicted from from L1 240, they must be in entire affairs In remain resident in L2 268.Thus maximum storage trace is limited to the memory buffers size of 64 × 128 bytes, but it It is limited by the relevance of L2 268.Since L2 268 is that 8 tunnels are associated and have 512 rows, it is general it is sufficiently large with Just do not lead to transaction abort.

If transactional stops, memory buffers are notified and all entries of transactional data are kept to be deactivated Change.Memory buffers are also had the mark whether being written by NTSTG instruction about entry by each double word (8 byte) --- this A little double words keep effective across transaction abort.

The function that milli code is realized

Conventionally, IBM host server processes device includes the firmware layer of referred to as milli code, which executes as certain The sophisticated functions of cisc instruction execution, interrupt processing, system synchronization and RAS.With the instruction of application program and operating system (OS) Similar, milli code includes that machine relies on instruction and the instruction from memory acquirement and the instruction set architecture (ISA) executed.Firmware It resides in the confined area for the main memory that customer's program cannot access.When hardware detection is to needing the case where calling milli code When, instruction acquisition unit 204 is switched to " milli code pattern " and starts the appropriate position in milli code memory region It obtains.Milli code can be obtained and be executed by mode identical with the instruction of instruction set architecture (ISA), and may include ISA Instruction.

For transaction memory, milli code is related under various complex situations.Each transaction abort calls dedicated milli code Subroutine is to execute necessary hang up.Transaction abort milli code keeps stopping reason, potential different inside hardware by reading The special register (SPR) of normal reason and suspended IA starts, and then milli code makes if a TDB is designated TDB is stored with the special register.TBEGIN instruction text is loaded to obtain GR and save exposure mask from SPR, this is to milli code Know and restores which GR 228 is needed.

CPU 114 (Fig. 2) supports special only milli code command to read backup GR and copy them in main GR to. TBEGIN IA is also loaded to set the new command address in PSW from SPR, when milli code stops subroutine completion It is continued to execute after TBEGIN.In the case where stopping the situation as caused by non-filtered program interrupt, which can be saved afterwards For the old PSW of program.

TABORT instruction can be the realization of milli code；When IDU 208 decodes TABORT, its indicator acquisition unit It is branched off into the milli code of TABORT, milli code, which is therefrom branched off into share, to be stopped in subroutine.

Extracting transaction nest depth (ETND) instruction can also be by milli code, because it is not to performance-critical；Milli generation Code loads the current depth of nesting in special hardware register and puts it into GR 228.PPA was instructed by milli generation Codeization；It by software as the current suspension that operand is supplied to PPA based on being counted and also based on shape inside other hardware State executes optimal delay.

For controlled affairs, milli code can keep the quantity of tracking suspension.Counter success TEND complete when or Person is reset to if there is the interruption (because whether or when OS will not be known back to program) entered in OS 0.Stop to count according to current, milli code can call certain mechanism to improve the chance of success that subsequent affairs retry.The mechanism packet The amount for for example increasing continuously the random delay between retrying and reducing conjectural execution is included, to avoid encountering by practical not to affairs Stop caused by the predictive access of the data used.As last countermeasure, other CPU are being discharged to continue normally to handle it Before, milli code can be broadcast to other CPU to stop all conflict work, retry local matter.Multiple CPU must be coordinated with Do not lead to deadlock, therefore, it is necessary to some serializations between the milli code instance on different CPU.

Referring now to Fig. 4, appended drawing reference 400, which generally illustrates, can be realized in hardware or in software for adaptively sharing number According to method exemplary embodiment.

In current implementation, it can usually implement two methods for keeping data access synchronous based on lock.Also referred to as In locking or the data structure really locked locking, in the critical section of code, program may want to be guaranteed to also referred to as The exclusive access of the memory area of shared data.In this case, program can protect shared data by lock, act on class Be similar to shared data the time not available competitive program label.But locking mechanism can be controlled strictly to altogether Enjoy the access of data.In slightly competition memory area, competitive program waits in which may not be necessary, to negatively affect Performance.For example, while thread 1 keeps lock on structure hash_tbl, the waiting of thread 2 is held in code sample below It row (although different piece of two threads more new construction) and can be performed in parallel.

Table 5

Above-mentioned HLE allows to be written into hard with being executed using realization transactional with the program for using traditional locks to determine code The chance of part.But in severe competitiveness memory area, if there is conflict, then processor can stop affairs and Critical section is re-executed by using pessimistic locking behavior.In one embodiment, any lock intersected with cache lines not by It omits and automatic trigger will be re-executed in the case where no HLE.Therefore, in known critical section as affairs constantly It is default to be executed to transactional and then successfully restart performance be made to deteriorate by using lock in the case where failure.

In 410, when processor, that is, CPU 114 (Fig. 2) starting code sequence is to access memory area, CPU 114 (Fig. 2) calls the conflict prediction device (that is, HLE fallout predictor or hardware lock virtualizer) that can be realized in hardware or in software, to taste Whether examination predicts whether that lock omits may succeed or answer alternative using locking.In operation, as discussed below, punching Prominent fallout predictor can operate in various hardware and software environment.But in conflict prediction device referring to the conflict prediction in HLE environment Embodiment in the case where, conflict prediction device is also referred to as HLE fallout predictor.In one embodiment, such as in hardware it deposits In device or in based on per thread or the memory location shared to all threads, the simple of affairs execution is remained successful It counts.When transmitting indicates the threshold value for the counting that successful transaction executes, at 410, conflict prediction device can be predicted transactional and execute It is more effective (to lock) path for path, i.e. non-transactional at lock omission comparable 455, because interference is impossible.At least In one embodiment, at least one embodiment that the transactional for preferably corresponding to be omitted based on lock is executed, counter is first Beginningization is with the more effective execution route of originally preference.In another embodiment, within hardware or by insertion program flow Instruction executes the opposite acquisition of affairs and can be calculated by the estimation relative cost that lock executes.Relative cost based on calculating, punching Transactional path can be predicted for prominent fallout predictor or non-transactional path is more effective, because the path for example predicted executes cost It is lower or unlikely encounter interference.In another embodiment, it is pre- impliedly can be inserted into conflict by compiler for behavior prompt It surveys in device, to select the locking path at the transactional execution route or 455 at 420 at 410.CPU 114 (Fig. 2) can start Critical section is executed as the affairs at 420, thus the more new data as needed at 425.Affairs at 430 terminate When but before submitting result, CPU 114 (Fig. 2) can be determined whether to detect the interference that will lead to transaction abort at 435 Two or more code sequences of parallel work-flow (that is, in same data).When not detecting interference, then 440 Place, affairs can be submitted successfully as a result, this then can be used by other affairs.But if CPU 114 (Fig. 2) is examined at 435 Interference is measured, then restarting to execute by using locking at 455.At 460, critical section must be obtained explicitly The lock in memory which will be accessed region must be protected.But locking requester can be forced to wait until the movement being known as rotation In by competitive processing release lock until.When finally obtaining lock at 460, critical section can be continued with.It is protected when by lock When the data of shield are updated at 470, then critical section is completed and can discharge lock at 475.

Referring to Fig. 5, appended drawing reference 500 is generally illustrated realizes conflict prediction device (that is, hard in the environment supported there are HLE Part lock virtualizer) exemplary embodiment.As described above, HLE isTraditional Compatible instruction set extension, including XACQUIRE and XRELEASE, which, which allows to be written into, with the program for using traditional locks to determine code there is utilization to realize The chance for the hardware that transactional executes is without substantially modifying code.In the present embodiment, HLE fallout predictor is HLE Particular example.

At 505, CPU 114 (Fig. 2) is executedXACQUIRE prefix instruction using associated lock to be obtained Affairs start HLE sequence.In one embodiment, the sequence can by followed by lock obtain affairs XACQUIRE indicate.Some In implementation, XACQUIRE prefix can be ignored.In other implementations, XACQUIRE sequence is optionally executed Column.After starting HLE homing sequence, conflict prediction device (i.e. HLE fallout predictor) is called at 510.Based on prediction, can hold Row lock omits or available lock.When lock omit and obtain lock between predict when, processing can with Fig. 4 420~ The substantially similarly continuation described at 475.

Referring to Fig. 6, appended drawing reference 600 generally illustrate according to there is no the exemplary embodiment of additional hardware capabilities, For omitting the selection between locking using lock come the flow chart of the method for adaptively shared data.In this exemplary implementation In example, can for example it be mentioned through operating system in the code flow of application program or by hardware offer to conflict fallout predictor Show.For example, in one embodiment, programmer can explicitly be inserted into one or more instructions or compiler and can impliedly insert Enter the behavior prompt to conflict fallout predictor.Conflict prediction device can keep history vector or counting, in some of such as 1 second The quantity of both success prediction and unsuccessful prediction (i.e. error prediction) is tracked on period.Then, at 610, conflict prediction Device can the counting that comparison error is predicted during the time window and the number of thresholds to fail.When mistake is pre- during time window Survey when being more than the number of thresholds of failure, conflict prediction device can remainder to time window it is default to using lock (i.e. non-transactional Sexual norm) execution.During the time window, due to when multiple affairs simultaneously update inconsistency data when working characteristics, deposit Reservoir region can be high competition.By will lock temporarily be selected as it is default, conflict prediction device can avoid must open again A possibility that affairs to fail that begin, and handling capacity is improved by avoiding transaction abort.But once time window expires, The competition of memory area can be become easily, and conflict prediction device can again attempt to transactional execution.In embodiment, Conflict prediction device is implemented with software, wherein to be made by the algorithm of software realization execution lock omit also be locked out determine will The second edition that the lock that the first version or code that the lock that control is transmitted to code realization omits are realized obtains.In other implementations In example, the history based on interference, in response to by the instruction to the particular items to be updated of software, and reflect with as more The relevant expectation interference of the field of the target of new affairs or non-interference etc., determine 610 by using alternative test realization.

At 655, critical section must explicitly obtain the lock of the accessed memory area of protection.But lock requester Until being forced to wait until that lock is discharged in the movement for being referred to as rotation by competitive program.It is finally obtained when at 660 When lock, critical section can be continued with.When by lock protection data when 670 are updated, then at 675 complete key area Section, and lock and can be released.At 680, CPU 114 (Fig. 2) can review time window expire.If time window does not arrive Phase, then processing terminate at 680.But if time window expires, at 685, failure affairs execute and success thing The counting that business executes can be reset, to effectively reset time window and start the re -training of conflict prediction device.

In the case where error prediction is no more than the number of thresholds of failure during time window, at 610, conflict prediction Device may be selected lock and omit, i.e., HLE affairs or reads lock word with explicit rather than obtain lock and join together to realize the affairs that lock omits. When be selected as HLE affairs execute (or as with by execute its read concentrate comprising lock word affairs come execute lock save Software transaction slightly joins together to realize the affairs that lock omits) when, at 615, CPU 114 (Fig. 2) can be incremented by successful transaction and hold Capable counting.HLE affairs at 620 can at 625 more new data as needed.At the end of affairs at 630 but at 635 Submission result before, CPU 114 (Fig. 2) can be determined whether to detect the interference that will lead to transaction abort (that is, in same data Two or more code sequences of upper parallel work-flow).When not detecting interference, at 640, HLE affairs (or realize lock The other affairs omitted) it can successfully submit as a result, these results then can be used by others processing.But if at 635 CPU 114 (Fig. 2) detects interference, then being incremented by the counting that failure affairs execute at 650, because failure affairs can be regarded as mistake Misprediction and can be used for trains conflict prediction device more accurate predict in the future.At 655 and 660, CPU 114 (Fig. 2) can attempt to be locked on memory area now and non-transactional restart critical section (i.e. using lock). When by locking the data protected finally when 670 are updated, then the processing of critical section is completed, and is locked and can be released at 675 It puts.At 680, CPU 114 (Fig. 2) can review time window expire.If time window does not expire, at 680 Processing terminate.But when time window expires, then the counting that failure affairs execute and successful transaction executes can at 685 It is reset, to effectively start the re -training of conflict prediction device.

Referring now to Fig. 7, it may include executing that appended drawing reference 700, which is generally illustrated for the method for adaptively shared data, The flow chart of the exemplary embodiment of facility when lock in monitoring hardware.In Fig. 7, the processing (i.e. 710 to 750) of HLE affairs It is similar that HLE affairs (i.e. 610 to 650) substantially how are handled with the embodiment of Fig. 6.But Fig. 7 is critical section just in non-thing The path executed to business property introduces hardware lock monitoring facility.In the present embodiment, in permission critical section in locked storage While execution in device region, hardware lock monitoring facility is attempted to minimize error prediction by prediction result, such as key area It is that HLE affairs are the same that Duan Shiji, which is executed,.Once successfully obtaining lock at 760 and 765, hardware lock monitoring facility can start The situation of monitoring lock at 770.Critical section at 775 updates the data in locked memory area and leads at 780 Release lock is crossed to complete to execute.But during execution, if hardware lock monitoring facility detects another processing inspection at 785 The state of lock label is looked into, if then it is affairs rather than non-transactional that the critical section, which executes, by the trial of other processing Processing will lead to interference and affairs failure.In one embodiment, only monitoring is locked.In another embodiment, as locked The data that a part in region is updated are monitored.As a result, hardware lock monitoring facility can be incremented by unsuccessfully affairs at 790 The counting of execution.

In another embodiment, hardware lock monitoring facility can monitor all trial data in locked memory area Access.If another processing is attempted to access the data in the region, at 790, hardware lock monitoring facility can be counted For interference and the failure of potential affairs.Therefore, conflict prediction device can learn more accurately to predict transactional execution or non-transactional Property execute be more likely to success.

In another embodiment, can increasing affairs when executing the counting of failure the setting at 750 restart to mark.So Afterwards, when the counting that successful transaction executes is incremented by, this can be resetted at 755 and restarts to mark.Restarting label can lead to The counting that crossing prevents unsuccessfully affairs from executing is incremented by (primary when failure i.e. at 750 as HLE affairs, and using twice Lock restarts primary at 755) improve forecasting accuracy.

It referring now to Fig. 8, in embodiment, is omitted in (HLE) environment in hardware lock, predictably determines that HLE affairs are It is no to execute 810 with actually obtaining lock and non-transactional and include：Based on HLE lock acquisition instruction is encountered, it is based on HLE fallout predictor, Determination is to omit lock and continue or obtain lock as HLE affairs and continue 820 as non-transactional；Based on HLE fallout predictor It is predicted as omitting, the address of lock is set as to the reading collection of HLE affairs, and inhibit to write any of lock by lock acquisition instruction Enter, and continue in HLE transactional execution pattern, encounters transactional punching until encountering xrelease instruction or HLE affairs Until prominent 830, wherein xrelease instruction release lock；And do not omitted based on the prediction of HLE fallout predictor, by HLE lock acquisition instruction view Acquisition instruction is locked for non-HLE and continues 840 in non-transactional mode.

Referring now to Fig. 9, in embodiment, the prediction based on HLE affairs is successfully updated HLE fallout predictor.Based on for the first time The HLE affairs with lock address are encountered, the counting that success HLE affairs associated with lock address execute is initialized as zero；Base In any subsequent HLE affairs for completing that there is lock address, it is incremented by associated with the lock address of HLE affairs in HLE fallout predictor The counting that failure HLE affairs execute, wherein high counting indicator may stop 920.In non-transactional mode, monitor by another One processing accesses the trial of lock；And when the trial access by another processing is detected, it is incremented by what failure HLE affairs executed Count 950.Track the counting that the successful HLE affairs in time window execute and the counting that failure HLE affairs execute；And it is based on The counting that the HLE affairs that fail execute is more than the number of thresholds of failure, and the remainder of time window is default to non-transaction mode 970.It is expired based on time window, is zero by the counting that success HLE affairs execute and the count resets that failure HLE affairs execute 960。

Referring now to fig. 10, calculate the internal part 800 and external component 900 that equipment 1000 may include each group.Internal portion Each of the group of part 800 includes：One or more processors 820；One or more computer-readable RAM 822； One or more computer-readable ROM 824 in one or more buses 826；One or more operating systems 828； Execute one or more software applications of the method for Fig. 5~7；With one or more computer-readable tangible storage devices 830.One or more operating systems are stored on one or more in each computer-readable tangible storage device 830, With one via one or more in each RAM 822 (generally comprising buffer memory) by each processor 820 or more Multiple execution.In the embodiment shown in fig. 10, each of computer-readable tangible storage device 830 is that internal hard drive drives Dynamic disk storage equipment.As an alternative, each of computer-readable tangible storage device 830 is such as ROM 824, EPROM, flash memory semiconductor memory apparatus or can store computer program and digital information it is any its Its computer-readable tangible storage device.

Each group of internal part 800 further includes for reading from one or more computer-readable tangible storage devices 936 The R/W driving for taking and being written to or interface 832, wherein one or more computer-readable tangible storage devices 936 are all Such as thin supply memory devices, CD-ROM, DVD, SSD, memory stick, tape, disk, CD or semiconductor memory apparatus.R/W Driving or interface 832 can be used for 840 firmware of device driver, software or microcode being loaded into tangible storage device 936 To be conducive to and calculate the communication of the component of equipment 100.

Each group of internal part 800 may also include such as TCP/IP adapter card, wireless Wi-Fi interface card or 3G or 4G The network adapter (or switch port card) or interface 836 of wireless interface card or other wired or wireless communication links.With Calculating the associated operating system 828 of equipment 1000 can be via network (for example, internet, local area network or wide area network) and each network Adapter or interface 386 calculate (for example, server) from outside and are downloaded to calculating equipment 1000.It (or is opened from network adapter Close port adapter) or interface 836, operating system 828 associated with equipment 1000 is calculated is loaded into each hard drive 830 With network adapter 836.Network may include copper wire, optical fiber, Wireless transceiver, router, firewall, switch, gateway computer and/ Or Edge Server.

Each of the group of external component 900 may include computer display monitor 920, keyboard 930 and computer mouse Mark 934.External component 900 may also include touch screen, dummy keyboard, touch tablet, sensing equipment and other human interface devices.It is interior Each of the group of portion's component 800 further includes docking with computer display monitor 920, keyboard 930 and computer mouse 934 Device driver 840.Device driver 840, R/W driving or interface 832 and network adapter or network 836 include hardware and In software (being stored in storage equipment 830 and/or ROM 824).

The various embodiments of present disclosure can be in being suitable for memory and/or the data processing system for executing program code It is implemented, which includes at least one processing directly or indirectly coupled with memory component by system bus Device.Memory component includes such as local storage used in the practical execution of program code, mass storage and is Reducing in the process of implementation must provide the temporary of at least some program codes from the number of mass storage retrieval coding When the buffer memory that stores.

Input/output or I/O equipment are (including but not limited to keyboard, display, sensing equipment, DASD, band, CD, DVD, U Dish driving and other storage mediums etc.) the I/O controller that can direct or through intervention couples with system.Network adapter It can be coupled with system so that data processing system is become and other data processing systems by the dedicated or common network intervened Or remote printer or storage equipment couple.Modem, cable modem and Ethernet card are only available types It is some in network adapter.

The present invention can be system, method and/or computer program product.Computer program product may include computer Readable storage medium storing program for executing, containing for making processor realize the computer-readable program instructions of various aspects of the invention.

Computer readable storage medium, which can be, can keep and store the tangible of the instruction used by instruction execution equipment Equipment.Computer readable storage medium for example can be-- but it is not limited to-- storage device electric, magnetic storage apparatus, optical storage Equipment, electric magnetic storage apparatus, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium More specific example (non exhaustive list) includes：Portable computer diskette, random access memory (RAM), read-only is deposited hard disk It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon It is stored with punch card or groove internal projection structure and the above-mentioned any appropriate combination of instruction.Calculating used herein above Machine readable storage medium storing program for executing is not interpreted that instantaneous signal itself, the electromagnetic wave of such as radio wave or other Free propagations lead to It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire Electric signal.

Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/ Processing equipment, or outer computer or outer is downloaded to by network, such as internet, local area network, wide area network and/or wireless network Portion stores equipment.Network may include copper transmission cable, optical fiber transmission, wireless transmission, router, firewall, interchanger, gateway Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment In calculation machine readable storage medium storing program for executing.

Computer program instructions for executing operation of the present invention can be assembly instruction, instruction set architecture (ISA) instructs, Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages The source code or object code that any combination is write, the programming language include the programming language-such as Java of object-oriented, Smalltalk, C++ etc., and conventional procedural programming languages-such as " C " language or similar programming language.Computer Readable program instructions can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).In some embodiments, by utilizing computer-readable program instructions Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can Programmed logic array (PLA) (PLA), the electronic circuit can execute computer-readable program instructions, to realize each side of the invention Face.

Referring herein to according to the method for the embodiment of the present invention, the flow chart of device (system) and computer program product and/ Or block diagram describes various aspects of the invention.It should be appreciated that flowchart and or block diagram each box and flow chart and/ Or in block diagram each box combination, can be realized by computer-readable program instructions.

These computer-readable program instructions can be supplied to general purpose computer, special purpose computer or other programmable datas The processor of processing unit, so that a kind of machine is produced, so that these instructions are passing through computer or other programmable datas When the processor of processing unit executes, function specified in one or more boxes in implementation flow chart and/or block diagram is produced The device of energy/movement.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, thus, it is stored with instruction Computer-readable medium then includes a manufacture comprising in one or more boxes in implementation flow chart and/or block diagram The instruction of the various aspects of defined function action.

Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other In equipment, so that series of operation steps are executed in computer, other programmable data processing units or other equipment, to produce Raw computer implemented process, so that executed in computer, other programmable data processing units or other equipment Instruct function action specified in one or more boxes in implementation flow chart and/or block diagram.

The flow chart and block diagram in the drawings show the system of multiple embodiments according to the present invention, method and computer journeys The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation One module of table, program segment or a part of instruction, the module, program segment or a part of instruction include one or more use The executable instruction of the logic function as defined in realizing.In some implementations as replacements, function marked in the box It can occur in a different order than that indicated in the drawings.For example, two continuous boxes can actually be held substantially in parallel Row, they can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or The combination of each box in flow chart and the box in block diagram and or flow chart, can the function as defined in executing or dynamic The dedicated hardware based system made is realized, or can be realized using a combination of dedicated hardware and computer instructions.

Although describing preferred embodiment in detail herein, it will be apparent to those skilled in the art that, it can be Without departing substantially from present disclosure spirit in the case where carry out various modifications, add and substitute, and therefore, these are considered as locating In limited in claim below scope of the present disclosure interior.

Claims

1. a kind of hardware lock omits the method in HLE environment, the method is for predictably determining whether HLE affairs should be practical It executes with obtaining lock and non-transactional, the method includes：

Based on HLE lock acquisition instruction is encountered, it is based on HLE fallout predictor, determination is to omit lock and continue or obtain as HLE affairs It takes lock and continues as non-transactional；

Compare unsuccessfully the number of thresholds of the counting and failure of the execution of HLE affairs；

It is no more than the number of thresholds of failure based on the counting that failure HLE affairs execute, continues in HLE transactional execution pattern, Until encountering xrelease instruction or detecting interference, wherein xrelease instruction release lock, it is dry in response to detecting It disturbs, is incremented by the counting that failure HLE affairs execute；And

It is more than the number of thresholds of failure based on the counting that failure HLE affairs execute, HLE lock acquisition instruction is considered as non-HLE lock and is obtained Instruction fetch and continue in non-transactional mode.

2. according to the method described in claim 1, further including：

HLE fallout predictor is updated based on the success of the prediction to HLE affairs, wherein HLE fallout predictor prediction HLE affairs whether may Stop.

3. according to the method described in claim 1, further including：

Based on the HLE affairs with lock address are encountered for the first time, by the counting of success HLE affairs execution associated with lock address It is initialized as zero；

Based on any subsequent HLE affairs for stopping that there is lock address, it is incremented by associated with the lock address of HLE affairs in fallout predictor The counting that failure HLE affairs execute；

Based on any subsequent HLE affairs for completing that there is lock address, it is incremented by related to the lock address of HLE affairs in HLE fallout predictor The counting that the successful HLE affairs of connection execute.

4. according to the method described in claim 1, further including：

Monitor that another processing accesses the trial of lock in non-transactional mode；With

When the trial for detecting another processing accesses, it is incremented by the counting that failure HLE affairs execute.

5. according to the method described in claim 1, further including：

Track the counting that the successful HLE affairs in time window execute and the counting that failure HLE affairs execute；

Compare the number of thresholds of counting and failure that the failure HLE affairs during the time window execute；And

It is more than the number of thresholds of the failure based on the counting that failure HLE affairs execute, by the remainder of the time window It is default to arrive non-transactional mode.

6. according to the method described in claim 5, further including：

It is expired based on the time window, is by the counting that success HLE affairs execute and the count resets that failure HLE affairs execute Zero.

7. a kind of computer readable storage medium, the computer readable storage medium can be read by processing circuit, and be deposited Storage is executed by processing circuit with the instruction for executing method comprising the following steps：

It is no more than the number of thresholds of failure based on the counting that failure HLE affairs execute, continues in HLE transactional execution pattern, Until encountering xrelease instruction or detecting interference, wherein xrelease instruction release lock, it is dry in response to detecting It disturbs, is incremented by the counting that failure HLE affairs execute；With

8. computer readable storage medium according to claim 7, wherein being executed instruction by processing circuit for executing Method further include：

HLE fallout predictor is updated based on the success of the prediction to HLE affairs, wherein HLE fallout predictor prediction HLE whether may in Only.

9. computer readable storage medium according to claim 7, wherein being executed instruction by processing circuit for executing Method further include：

10. computer readable storage medium according to claim 7, wherein being executed instruction by processing circuit for executing Method further include：

Monitor that another processing accesses the trial of the memory area by lock protection in non-transactional mode；With

11. computer readable storage medium according to claim 7, wherein being executed instruction by processing circuit for executing Method further include：

Based on the HLE affairs with lock address are encountered for the first time, counting associated with lock address is initialized as zero；

Based on any subsequent HLE affairs for completing that there is lock address, it is incremented by associated with the lock address of HLE affairs in fallout predictor The counting that success HLE affairs execute.

12. computer readable storage medium according to claim 7, wherein being executed instruction by processing circuit for executing Method further include：

Compare the number of thresholds of counting and failure that the failure HLE affairs during the time window execute；With

13. computer readable storage medium according to claim 12, wherein being executed instruction by processing circuit for holding Capable method further includes：

14. a kind of hardware lock omits the computer system in HLE environment, the computer system is for predictably determining HLE Affairs execute in which whether should actually obtain lock and non-transactional, and the computer system includes：

Memory；With

The processor communicated with the memory, wherein computer system is configured as executing a kind of method, the method packet It includes：

15. computer system according to claim 14, wherein further including by the method that computer system executes：

16. computer system according to claim 14, wherein further including by the method that computer system executes：

Monitor that another processing accesses the trial of lock in non-transactional mode；And

17. computer system according to claim 14, wherein further including by the method that computer system executes：

Monitor that another processing accesses the trial of the memory area by lock protection in non-transactional mode；And

18. computer system according to claim 14, wherein further including by the method that computer system executes：

19. computer system according to claim 14, wherein further including by the method that computer system executes：

20. computer system according to claim 19, wherein further including by the method that computer system executes：