CN105677526B

CN105677526B - The system for executing state for testing transactional

Info

Publication number: CN105677526B
Application number: CN201610081166.XA
Authority: CN
Inventors: R·拉吉瓦尔; B·L·托尔; K·K·赖; M·C·梅尔腾; M·G·迪克森
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2012-06-29
Filing date: 2013-06-19
Publication date: 2019-11-05
Anticipated expiration: 2033-06-19
Also published as: CN105760138B; CN105760139B; CN104335183B; CN105786665A; CN105760138A; CN105760140B; CN105786665B; CN105760139A; WO2014004222A1; CN105760265B; CN105760265A; CN105760140A; CN105677526A; CN104335183A

Abstract

This application discloses instructions and logic that state is executed for testing transactional.It discloses and executes the novel instruction of state, logic, method and apparatus for testing transactional.Embodiment includes the first instruction decoded for starting transactional region.In response to first instruction, the checkpoint for being used for one group of architecture states register is generated, and track the memory access of the processing element in transactional region associated with first instruction.Then the second instruction that the transactional for detecting the transactional region executes is decoded.Operation is executed in response to the second instruction of decoding, to determine the execution context of the second instruction whether within the transactional region.The first mark is updated then in response to the second instruction.In some embodiments, in response to the second instruction, register is optionally updated, and/or optionally updates the second mark.

Description

The system for executing state for testing transactional

Present patent application is that international application no is PCT/US2013/046633, and international filing date is 06 month 2013 It is 19 days, entitled " to execute state for testing transactional into National Phase in China application No. is 201380028480.6 The divisional application of the application for a patent for invention of instruction and logic ".

Related application

The application is the current pending international application PCT/US2012/ on 2 2nd, the 2012 specified U.S. submitted 023611 part continuation application.The first international application is incorporated herein by reference, as its entirety is recorded in the application In.

Technical field

The present disclosure relates generally to handle the field of logic, microprocessor and relevant instruction set architecture, these instruction set Framework executes logic, mathematics or other function when performed by processor or other processing logics and operates.Specifically, this public affairs Open the instruction and logic for being related to that state is executed for testing transactional.

Background technique

The progress of semiconductor processes and logical design has allowed the increasing of the amount of logic that may be present in integrated circuit device Add.Therefore, computer system configurations are developed to from the single or multiple integrated circuits in system and are deposited on single integrated circuit Multiple processing cores and multiple logic processors.Processor or integrated circuit generally include single processor tube core, wherein locating Managing device tube core may include any number of core or logic processor.

The quantity of increasingly increased core and logic processor enables more software threads by concomitantly on integrated circuit It executes.However, it is possible to which the increase of the quantity for the software thread being performed simultaneously has resulted in the number shared between synchronizing software thread According to related problem.A common solution packet for accessing the shared data in multicore or more logical processor systems It includes using lock and guarantees the mutual exclusion between multiple access to shared data.However, ever-increasing execute multiple software threads Ability to locking data generate bottleneck, cause thread to wait the completion (make them executes serialization) of other threads, from And reduce the benefit for executing multiple threads concurrently.In addition, in the case where write-in side attempts to modify data, some read-only visits Ask the mutual exclusion that lock can be used to ensure data, this can bring the undesirable side effect for repelling other read-only access.

For example, it is contemplated that keeping the hash table of shared data.Using lock system, the entire hash table of programmer lockable, thus Allow the entire hash table of thread accesses.However, the handling capacity and performance of other threads may be adversely impacted, because it Can not access any entry in hash table, until the lock is released.Alternatively, each entry in hash table may It is locked, so as to cause many lock constructions in software.In such construction, it may be necessary to it is specific to execute to obtain many locks Task, this will lead to the deadlock with other threads.No matter which kind of mode, which is being extrapolated to big scalable program In after, it is clear that lock competition, serialization, fine-grain synchronization and dead time revision complexity become the extremely numerous of programmer Trivial burden.

Another nearest data synchronization technology includes using transactional memory (TM).In general, transactional execution includes Atomically execute the grouping of multiple microoperations, operation or instruction.In the examples described above, two threads execute in hash table, and And their memory access is monitored/tracks.If the identical entry of two thread accesses/changes, conflictization can be performed Solution is to ensure data validity.It includes software transactional memory (STM) that a type of transactional, which executes, wherein not having usually It is executed in software in the case where having hardware supported and memory access, conflict dissolution, aborting task and other transactionals is appointed The tracking of business.It includes hardware transactional memory (HTM) system that another type of transactional, which executes, including for supporting to visit Ask the hardware of tracking, conflict dissolution and other transactional tasks.

Technology similar to transactional memory includes that hardware lock omits (HLE), wherein real without using lock Execute to the property tested locked critical section.If running succeeded (i.e. Lothrus apterus), keep result globally visible.In other words, Critical section is executed as being omitted the affairs of the lock instruction from critical section, rather than executes the thing atomically defined Business.As a result, in the examples described above, not replacing hash table to execute with affairs, tentatively executes and instructed by lock The critical section of definition.It is executed in hash table as multiple thread class, and their access is monitored/tracks.If this Any of a little threads access/change to same entry, then conflict dissolution can be performed to ensure data validity.But If no collision was detected, the update to hash table is atomically submitted.

As can be seen, transactional executes and locks to omit to have provides the potentiality of more best performance in multiple threads.However, HLE It is relatively new research field for microprocessor with TM.Therefore, not yet sufficiently explore or in detail research processor in HLE and TM implementation.

Detailed description of the invention

The present invention is unrestrictedly shown by example in each figure of attached drawing.

Fig. 1 shows for using instruction and logic the one embodiment for testing the computing system that transactional executes state.

Fig. 2 shows for using instruction and logic test transactional execute state processor one embodiment.

Fig. 3 A is shown according to one embodiment for providing the instruction volume for executing the function of state for testing transactional Code.

Fig. 3 B shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles Code.

Fig. 3 C shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles Code.

Fig. 3 D shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles Code.

Fig. 3 E shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles Code.

Fig. 4 A is in processor micro-architecture for executing the instruction for providing and executing the function of state for testing transactional The block diagram of one embodiment.

Fig. 4 B shows the processor micro-architecture for executing the instruction for providing the function for testing transactional execution state One embodiment element.

Fig. 5 is an implementation for executing the processor for the instruction for providing the function for testing transactional execution state The block diagram of example.

Fig. 6 is one for executing the computer system for the instruction for providing the function for testing transactional execution state The block diagram of embodiment.

Fig. 7 is for executing the another of the computer system for the instruction for providing the function for testing transactional execution state The block diagram of embodiment.

Fig. 8 is for executing the another of the computer system for the instruction for providing the function for testing transactional execution state The block diagram of embodiment.

Fig. 9 is one for executing the system on chip for the instruction for providing the function for testing transactional execution state The block diagram of embodiment.

Figure 10 is the embodiment for executing the processor for the instruction for providing the function for testing transactional execution state Block diagram.

Figure 11 is to provide the frame of one embodiment of the IP kernel development system of the function for testing transactional execution state Figure.

Figure 12, which is shown, provides one embodiment of the framework analogue system of the function for testing transactional execution state.

Figure 13 shows a reality of the system for converting the instruction for providing the function for testing transactional execution state Apply example.

Figure 14, which is shown, provides one embodiment of the device of the function for testing transactional execution state.

Figure 15 shows the process of one embodiment of the process for providing the function for testing transactional execution state Figure.

Figure 16 shows the process of the alternate embodiment of the process for providing the function for testing transactional execution state Figure.

Specific embodiment

Disclosed herein executes the instruction of state and some embodiments of logic in combination with place for testing transactional Manage synchronous extension (TSX) Lai Shixian of device instruction set architecture (ISA) transactional.Such extension, which can provide, dynamically to be detected multi-thread When the ability of the serialization of critical section by lock protection is needed in journey software environment.The specified code area of programmer (claims For transactional region) it can transactionally execute.If the transactional is completed (i.e. without from another process or line with running succeeded The competition of journey), then when successfully completing and being exited from the transactional region, data in all storage operations or memory Modification will be as atomically or simultaneously occurred.

Hardware lock omits one embodiment that (HLE) is such extension, it provides instruction set interface for programmer with benefit With two instruction prefixes prompt XAQUIRE and XRELEASE come the transactional region around specified obtain and release guard key area The lock of section.Using HLE, processor can omit with the associated write operation of the locking phase, and attempt transactionally to execute the region.If Processor detects any data collision, then will execute transactional and stop, and not hold by non-transactional and again elliptically The row critical section.

Restricted transactional memory (RTM) is another embodiment of the instruction set interface for programmer, is used Three instructions: XBEGIN and XEND, for executing transactional region；And XABORT, for clearly stopping the region RTM It executes.XBEGIN instruction also can refer to the branch of orientation relative displacement, return as by what is executed in the case where transactional stops Move back code segments.Rollback code may include conflict dissolution step.Specific XABORT may specify 8 immediate values also deposit is written Device, such as used to rollback code segments.The instruction for being used to test transactional execution state and logic disclosed herein Embodiment also extends, and/or is combined in combination with other processor ISA transactionals HTM, and/or combines STM, and/or combine other Transactional executes context to realize.

It is disclosed herein to execute the novel instruction of state, logic, method and apparatus for testing transactional.Embodiment Including decoding the first instruction or prefix for starting transactional region.In response to first instruction or prefix, generates and be used for one Group architecture states register checkpoint, and track from processing element in transactional region associated with first instruction In memory access.It in one embodiment, may include for testing transactional state for the instruction set interface of programmer The second instruction, wherein executing second instruction to determine and execute whether context closes in the transactional region or predictive affairs Within key section (such as HLE or RTM).In one embodiment, such instruction can be used for: if it is determined that the instruction is just in thing Business property executes inside region, then sets a value (such as zero) for flag register.In one embodiment, such instruction Can be used for: if it is determined that the instruction does not execute inside transactional region, then by flag register be set as another value (such as One).In an alternative embodiment, such instruction can be used for for register being set to indicate that the embedding of possible transactional region Cover the value of grade.In another alternate embodiment, such instruction can be used for determining that access is associated with memory operand and deposit The transactional whether reservoir will lead to possible transactional region stops.In another alternate embodiment, such instruction is available It can be used for the transactional execution in possible transactional region in determining whether there is enough bufferings.Other alternative embodiments It is possible.

It will be understood that by using one embodiment of such instruction, programmer can possible transactional region (such as The region HLE) determine to internal dynamic whether the region is transactionally being performed or whether the region is just stopping in transactional It is merely re-executed to non-transactional later.Using one embodiment of such instruction, programmer can be in possible transactional area It determines to domain (such as region RTM) internal dynamic whether XABORT instruction will restore previous architecture states, or whether will be regarded For NOP (i.e. without operation).Using one embodiment of such instruction, programmer dynamically determines that library routine is from affairs Property region in be called or called from rollback code segments.It will be understood that by using one embodiment of such instruction, Programmer dynamically determines whether the nesting level in transactional region can whether will close to hardware limitation and further nesting It will lead to transactional suspension.

In the following description, processing logic, processor type, micro-architecture condition, event, enabling mechanism etc. be set forth A variety of specific details, to provide the more thorough understanding to the embodiment of the present invention.However, those skilled in the art will be appreciated that not Having these details, also the present invention may be practiced.In addition, some well known structure, circuits etc. are not illustrated in detail, to avoid Unnecessarily obscure the embodiment of the present invention.

These and other embodiment of the invention can be realized according to following introduction, and it is to be understood that can be in following religion Various modifications and changes may be made in leading, without departing from broader spirit and scope of the invention.To, should according to it is illustrative without It is restrictive meaning to treat the description and the appended drawings, and the present invention is delimited according only to appended claims and their equivalents.

Fig. 1 shows for using instruction and logic an implementation for testing the computing system 100 that transactional executes state Example.According to the present invention, such as according to embodiment described herein, system 100 includes the component of such as processor 102 etc, Algorithm is executed to use the execution unit including logic to handle data.The representative of system 100 is based on can be from California, US What the Intel company of the Santa Clara Ya Zhou obtainedIII、4、Xeon^tm、XScale^tmAnd/or StrongARM^tmThe processing system of microprocessor, but other systems (packet can also be used Include PC, engineering work station, set-top box etc. with other microprocessors).In one embodiment, sample system 100 is executable The WINDOWS that can be obtained from the Microsoft of Raymond, Washington, United States^tmOne version of operating system, but can also make With other operating systems (such as UNIX and Linux), embedded software, and/or graphic user interface.Therefore, of the invention each Embodiment is not limited to any specific combination of hardware and software.

Embodiment is not limited to computer system.Alternate embodiment of the invention can be used for other equipment, such as hand-held Equipment and Embedded Application.Some examples of handheld device include cellular phone, Internet protocol equipment, digital camera, a number Word assistant (PDA) and Hand held PC.Embedded Application can include: microcontroller, digital signal processor (DSP) are on chip System, network computer (NetPC), set-top box, network hub, wide area network (WAN) interchanger or executable according at least one Any other system of one or more instructions of embodiment.

Fig. 1 is the block diagram of computer system 100, and computer system 100 is formed with processor 102, processor 102 include one or more execution units 108 to execute algorithm, with execute it is according to an embodiment of the invention at least one Instruction.Describe one embodiment referring to uniprocessor desktop or server system, but alternate embodiment can be included in it is more In processor system.System 100 is the example of " maincenter " system architecture.Computer system 100 includes processor 102 to handle number It is believed that number.It is micro- that processor 102 can be Complex Instruction Set Computer (CISC) microprocessor, reduced instruction set computing (RISC) Processor, very long instruction word (VLIW) microprocessor realize that the processor of instruction set combination or any other processor device are (all Such as digital signal processor).Processor 102 is coupled to processor bus 110, which in processor 102 and can be Data-signal is transmitted between other assemblies in system 100.All a elements of system 100 execute conventional function known in the art Energy.

In one embodiment, processor 102 includes the first order (L1) internal cache memory 104.Depending on frame Structure, processor 102 can have single internally cached or multiple-stage internal cache.Alternatively, in another embodiment, it is high Fast buffer memory can be located at the outside of processor 102.It is slow that other embodiments may also comprise internally cached and external high speed The combination deposited, this depends on specific implementation and demand.Register group 106 can be in multiple registers (including integer registers, floating-point Register, status register, instruction pointer register) in the different types of data of storage.Checkpoint logic 105 is provided to be directed to The inspection of the one group of architecture states register in thread setting register group 106 executed by the thread process element of processor 102 It makes an inventory of.Tracking logic 103 is provided to track from the transactional region phase with the shared memory in cache memory 104 The memory access of associated thread process element.

Execution unit 108 (logic including executing integer and floating-point operation) also is located in processor 102.Processor 102 It further include microcode (ucode) ROM, storage is used for the microcode of specific macro-instruction.For one embodiment, execution unit 108 include the logic of synchronous extension (TSX) instruction set 109 of processing transactional, which includes executing for testing transactional One or more instructions of state.By including in the instruction set of general processor 102 and including phase by TSX instruction set 109 For the circuit of pass to execute these instructions, can be used in general processor 102 is restricted transactional memory or hardware lock omission To execute many multithreadings using used operation.Therefore, it omits and uses by the way that transactional memory or hardware lock will be restricted In executing synchronization to shared data, many multithreading applications, which can get, to be accelerated, and more efficiently is executed.This can be eliminated altogether Enjoy memory with the needs for executing unnecessary synchronization on the critical section seldom to conflict relatively.

The alternate embodiment of execution unit 108 may be alternatively used for microcontroller, embeded processor, graphics device, DSP with And other kinds of logic circuit.System 100 includes memory 120.Memory devices 120 can be dynamic random access memory Device (DRAM) equipment, static random access memory (SRAM) equipment, flash memory device or other memory devices.Memory 120 The instruction and/or data that can be executed by processor 102 can be stored, data are indicated by data-signal.

System logic chip 116 is coupled to processor bus 110 and memory 120.In the embodiment illustrated be Logic chip 116 of uniting is memory controller hub (MCH).Processor 102 can be logical via processor bus 110 and MCH 116 Letter.MCH 116 is provided to the high bandwidth memory path 118 of memory 120, stores for instruction and data, and for depositing Store up graph command, data and text.MCH 116 is for other groups in bootstrap processor 102, memory 120 and system 100 Data-signal between part, and in processor bus 110, memory 120 and system I/O

Between bridge data signal.In some embodiments, system logic chip 116, which can provide, is coupled to graphics controller 112 graphics port.MCH 116 is coupled to memory 120 via memory interface.Graphics card 112 passes through accelerated graphics port (AGP) interconnection 114 is coupled to MCH 116.

System 100 is using peripheral equipment hub interface bus 122 to couple I/O controller center (ICH) for MCH 116 130.ICH 130 is directly connected to via local I/O bus to the offer of some I/O equipment.Local I/O bus is High Speed I/O bus, For peripheral equipment to be connected to memory 120, chipset and processor 102.Some examples are Audio Controllers, in firmware Pivot (flash memory BIOS) 128, transceiver 126, data storage 124, traditional I/O including user's input and keyboard interface Controller, serial expansion port (such as general-purpose serial bus USB) and network controller 134.Data storage device 124 can be with Including hard disk drive, floppy disk drive, CD-ROM device, flash memory device or other mass-memory units.

For another embodiment of system, system on chip can be used for according to the instruction of one embodiment.On chip One embodiment of system includes processor and memory.Memory for such a system is flash memories.Flash memory Memory can be located on tube core identical with processor and other systems component.In addition, such as Memory Controller or figure control Other logical blocks of device processed etc may be alternatively located on system on chip.

Fig. 2 shows for using instruction and logic test transactional execute state processor 200 one embodiment. In some embodiments, according to the instruction of one embodiment can be implemented as to byte size, word size, double word size, Four word sizes etc. and the data element with many data types (such as single precision and double integer and floating type) Execute operation.In one embodiment, orderly front end 201 is a part of processor 200, takes out the finger that will be performed It enables, and prepares these instructions to use later for processor pipeline.Front end 201 may include all a units.Implement at one In example, instruction prefetch device 226 takes out from memory and instructs, and instruction is fed to instruction decoder 228, instruction decoder 228 Then decoding or interpretative order.For example, in one embodiment, received instruction decoding can be performed decoder for machine The one or more operations for being referred to as " microcommand " or " microoperation " (also referred to as micro operations or uop).In other embodiments In, instruction is resolved to operation code and corresponding data and control field by decoder, they are by micro-architecture for executing according to one The operation of a embodiment.In the one embodiment for including trace cache 230, trace cache 230 receives decoded Microoperation, and they are assembled into the trace in program ordered sequence or microoperation queue 234, for executing.Work as tracking When cache 230 encounters complicated order, microcode ROM 232, which is provided, completes the required microoperation of operation.

Some instructions are converted into single microoperation, and other instructions need several microoperations to complete whole operation. In one embodiment, it completes to instruct if necessary to the microoperation more than four, then decoder 228 accesses microcode ROM 232 To carry out the instruction.For one embodiment, instruction can be decoded as a small amount of microoperation at instruction decoder 228 It is handled.In another embodiment, it completes to operate if necessary to several microoperations, then instruction can be stored in microcode In ROM 232.Trace cache 230 determines that correct microcommand refers to reference to inlet point programmable logic array (PLA) Needle, with the one or more instruction for reading micro-code sequence from microcode ROM 232 to complete according to one embodiment.In micro- generation After code ROM 232 is completed for the micro operation serialization of instruction, the front end 201 of machine restores to mention from trace cache 230 Take microoperation.It will be understood that not necessarily all embodiment all includes trace cache 230.

Out-of-order (out-of-order) enforcement engine 203 is the unit for being used to instructions arm execute.Out-of-order execution is patrolled Volume there are several buffers, for instruction stream is smooth and reorder, to optimize the performance after instruction stream enters assembly line, And dispatch command stream is for execution.Dispatcher logic distributes the machine buffer and resource that each microoperation needs, for holding Row.Register renaming logic is by the entry in all a logic register renamed as register groups.In instruction scheduler (storage Device scheduler, fast scheduler 202, at a slow speed/general floating point scheduler 204, simple floating point scheduler 206) before, distributor The entry of each microoperation is distributed among one in two microoperation queues, a microoperation queue is grasped for memory Make, another microoperation queue is operated for non-memory.Microoperation scheduler 202,204,206 is defeated based on the dependence to them Enter register operand source ready and microoperation complete their operation needed for execution resource availability come it is true Determine when microoperation is ready for executing.The fast scheduler 202 of one embodiment can be in every half of master clock cycle It is scheduled, and other schedulers can be dispatched only once on each primary processor clock cycle.Scheduler is to distribution port It is arbitrated and is executed with dispatching microoperation.

Register group 208,210 be located at execution unit 212 in scheduler 202,204,206 and perfoming block 211,214, 216, between 218,220,222,224.There is also individual register groups 208,210, are respectively used to integer and floating-point operation. Each register group 208,210 of one embodiment also includes bypass network, and bypass network can be write what is just completed not yet Enter the result bypass of register group or is transmitted to new dependence microoperation.Integer registers group 208 and flating point register group 210 Can communicate with one another data.For one embodiment, integer registers group 208 is divided into two individual register groups, and one A register group is used for 32 data of low order, and second register group is used for 32 data of high-order.One embodiment is floated Point register group 210 has the entry of 128 bit widths, because floating point instruction usually has the operand of from 64 to 128 bit widths. Some embodiments of flating point register group 210 can have 256 bit wides or 512 bit wides or some other width entries.For some Each element can be written to respectively 64,32,16 etc. boundaries in flating point register group 210 by embodiment.

Perfoming block 211 includes execution unit 212,214,216,218,220,222,224, execution unit 212,214, 216, it actually executes instruction in 218,220,222,224.The block includes register group 208,210, and register group 208,210 is deposited The integer and floating-point data operation value that storage microcommand needs to be implemented.The processor 200 of one embodiment includes multiple execution Unit: scalar/vector (AGU) 212, AGU 214, quick ALU 216, quick ALU 218, at a slow speed ALU 220, floating-point ALU 222, floating-point mobile unit 224.For one embodiment, floating-point perfoming block 222,224 execute floating-point, MMX, SIMD, SSE and AVX or other operations.The floating-point ALU 222 of one embodiment includes 64/64 Floating-point dividers, for executing division, putting down Root and remainder micro-operation.For all a embodiments of the invention, floating point hardware is can be used to locate in the instruction for being related to floating point values Reason.In one embodiment, ALU operation enters high speed ALU execution unit 216,218.The high speed ALU 216 of one embodiment, 218 executable high speed operations, effective waiting time are half of clock cycle.For one embodiment, most of complex integer behaviour Make to enter 220 ALU at a slow speed, it is all because ALU 220 includes the integer execution hardware for high latency type operations at a slow speed Such as, multiplier, shift unit, mark logic and branch process.Memory load/store operations are executed by AGU 212,214. For one embodiment, integer ALU 216,218,220 is described as executing integer operation to 64 data operands.It is substituting , it can be achieved that ALU 216,218,220 is to support the including various data bit such as 16,32,128,256 in embodiment.Similarly, may be used Floating point unit 222,224 is realized to support multiple operands with various bit wides.For one embodiment, floating point unit 222,224 128 bit width packaged data operands are operated in combination with SIMD and multimedia instruction.

In one embodiment, father load complete execute before, microoperation scheduler 202,204,206 just assign according to Rely operation.Because microoperation by speculating is dispatched and executed in processor 200, processor 200 also includes processing storage The logic of device miss.If data load miss in data high-speed caching, there may be have temporary error data It leaves scheduler and the dependence run in a pipeline operates.In some embodiments, replay mechanism is traceable uses error number According to instruction, and these instructions can be re-executed.It only relies only on operation to need to be played out, and independent operation is allowed to complete.Processing The scheduler and replay mechanism of one embodiment of device are also devised to capture the function for providing and executing state for testing transactional The instruction of energy.In some alternate embodiments for not having replay mechanism, the conjectural execution to microoperation can be prevented, and according to Rely the microoperation of property to can reside in scheduler 202,204,206 until they be cancelled or until they can not be cancelled for Only.

Term " register " refers to a part for being used as instruction with processor storage location on the plate of identification operation number. In other words, register is those processors outside available processor storage location (from the perspective of programmer).So And the register of an embodiment is not limited to indicate certain types of circuit.On the contrary, the register of an embodiment can be stored and be mentioned For data, it is able to carry out function described herein.Register described herein can be passed through using any amount of different technologies Circuit in processor realizes, such as dedicated physical register of these different technologies, the dynamic point using register renaming With physical register, combination that is dedicated and dynamically distributing physical register etc..In one embodiment, integer registers storage 32 Position integer data.The register group of one embodiment also includes eight multimedia SIM D registers, is used for packaged data.For with Lower discussion, register should be understood the data register for being designed to save packaged data, such as from California, USA 64 bit wide MMX of the microprocessor for enabling MMX technology of the Intel company of state Santa Clara^tmRegister is (in some realities Be also referred to as in example " mm register)." these MMX registers (can be used in integer and floating-point format) can with SIMD and SSE The packaged data element of instruction operates together.It is related to the 128 of the technology (being referred to as " SSEx ") of SSE2, SSE3, SSE4 or update Bit wide XMM register may be alternatively used for keeping such packaged data operand.Similarly, (or more with AVX, AVX2, AVX3 technology Advanced technology) relevant 256 bit wide YMM register and the ZMM registers of 512 bit wides can be Chong Die with XMM register and can For keeping such broader packaged data operand.In one embodiment, when storing packaged data and integer data, Register needs not distinguish between these two types of data types.In one embodiment, integer and floating data can be included in identical In register group, or it is included in different register groups.Further, in one embodiment, floating-point and integer data It can be stored in different registers, or be stored in identical register.

Fig. 3 A be with can be from the WWW (www) of the Intel company of Santa Clara City, California, America Obtained on intel.com/products/processor/manuals/ "64 and IA-32 Intel Architecture Software is opened Originator handbook volume 2: instruction set reference (64 and IA-32Intel Architecture Software Developer ' s Manual Volume 2:Instruction Set Reference) " described in operation code Format Type is corresponding has One implementation of operation coding (operation code) format 360 and register/memory operand addressing mode of 32 or more positions The description of example.It in one embodiment, can be by one or more fields 361 and 362 come coded command.It can identify each Instruction is up to two operand positions, including is up to two source operand identifiers 364 and 365.For one embodiment, mesh Ground operand identification symbol it is 366 identical as source operand identifier 364, and they are not identical in other embodiments.For can Embodiment is selected, destination operand identifier 366 is identical as source operand identifier 365, and they are not in other embodiments It is identical.In one embodiment, one in the source operand identified by source operand identifier 364 and 365 is commanded As a result it is override, and in other embodiments, identifier 364 corresponds to source register element, and identifier 365 corresponds to purpose Ground register elements.For one embodiment, operand identification accord with 364 and 365 can be used for mark 32 or 64 source and Vector element size.

Fig. 3 B shows another substitution operation coding (operation code) format 370 with 40 or more.Operation Code format 370 corresponds to operation code format 360, and including optional prefix byte 378.It can be led to according to the instruction of one embodiment One or more of field 378,371 and 372 is crossed to encode.By source operand identifier 374 and 375 and pass through prefix Byte 378 can identify and be up to two operand positions in each instruction.For one embodiment, prefix byte 378 can be used for The source and destination operand of mark 32 or 64.For one embodiment, destination operand identifier 376 and source are grasped Identifier 374 of counting is identical, and they are not identical in other embodiments.For alternate embodiment, vector element size mark Symbol 376 is identical as source operand identifier 375, and they are not identical in other embodiments.In one embodiment, instruction pair It accords with one or more operands that 374 and 375 are identified by operand identification to be operated, and by operand identification symbol 374 The result being commanded with one or more operands that 375 are identified is override, however in other embodiments, by identifier 374 and 375 operands identified are written into another data element in another register.360 He of operation code format 370 allow by MOD field 363 and 373 and by optional ratio-index-plot (scale-index-base) and displacement (displacement) register that byte is partially specified to register addressing, memory to register addressing, by memory To register addressing, by register pair register addressing, directly to register addressing, register to memory addressing.

Fig. 3 C is turned next to, in some alternative embodiments, 64 (or 128 or 256 or 512 or more It is more) single-instruction multiple-data (SIMD) arithmetical operation can instruct via coprocessor data processing (CDP) and execute.Operation coding (operation code) format 380 shows such CDP instruction, with CDP opcode field 382 and 389.It is real for substitution Example is applied, the operation of the type CDP instruction can be encoded by one or more of field 383,384,387 and 388.It can be to each Command identification is up to three operand positions, including is up to two source operand identifiers 385 and 390 and a destination Operand identification symbol 386.One embodiment of coprocessor can operate 8,16,32 and 64 place values.For one embodiment, Integer data element is executed instruction.In some embodiments, use condition field 381 can be conditionally executed instruction.For Some embodiments, can be by field 383 come source data size.In some embodiments, zero can be executed to SIMD field (Z), (N), carry (C) are born and overflows (V) detection.For some instructions, saturation type can be encoded by field 384.

Turning now to Fig. 3 D, which depict according to another embodiment with can be from Santa Clara City, California, America Intel company WWW (www) intel.com/products/processor/manuals/ on obtain "High-level vector extension programming reference (Advanced Vector Extensions Programming Reference operation code Format Type described in) " is corresponding for providing the another of the function of test transactional execution state One substitution operation coding (operation code) format 397.

Original x86 instruction set provides a variety of address byte (syllable) formats to 1 byte oriented operand and is included in attached Add the immediate operand in byte, wherein can know the presence of extra byte from first " operation code " byte.In addition, specific Byte value is reserved for operation code as modifier (referred to as prefix prefix, because they are placed before a command).When 256 When the original configuration (including these special prefix values) of a opcode byte exhausts, single byte is specified to jump out (escape) To 256 new operation code set.Because being added to vector instruction (such as, SIMD), even by using prefix to be expanded After exhibition, it is also desirable to generate more operation codes, and the mapping of " two bytes " operation code is also not enough.For this purpose, by new command It is added in additional mapping, additional mapping uses two bytes plus optional prefix as identifier.

In addition to this, (and any in prefix and operation code for the ease of realizing additional register in 64 bit patterns Jump out byte needed for operation code for determining) between use additional prefix (referred to as " REX ").In one embodiment In, REX has 4 " Payload " positions, to indicate to use additional register in 64 bit patterns.In other embodiments, There can be position more less or more than 4.The general format of at least one instruction set (corresponds generally to format 360 and/or format 370) it is shown generically as follows:

[prefixes] [rex] escape [escape2] opcode modrm (etc.)

Operation code format 397 corresponds to operation code format 370, and including optional VEX prefix byte 391 (in a reality Apply in example, started with hexadecimal C4 or C5) with substitute other most public uses traditional instruction prefix byte and Jump out code.For example, shown below the embodiment for carrying out coded command using two fields, can be not present in presumptive instruction Second is used when jumping out code.In embodiment described below, tradition jump out by it is new jump out value represented by, traditional prefix It is fully compressed as a part of " Payload (payload) " byte, traditional prefix is declared again and be can be used for following Extension, and new feature (such as, increased vector length and additional source register specificator) is added.

When jumping out code there are second in presumptive instruction, or when needing using additional position (such as the XB in REX field With W field) when.In the alternate embodiment shown below, the first tradition is jumped out and is similarly pressed with traditional prefix according to above-mentioned Contracting, and code compaction is jumped out in " mapping " field by second, under future map or the available situation of feature space, again Add new feature (such as increased vector length and additional source register specificator).

It can be encoded by one or more of field 391 and 392 according to the instruction of one embodiment.Pass through field 391 mark with source operation code identifier 374 and 375 and optional ratio-index-plot (scale-index-base, SIB) Know symbol 393, optional displacement identifier 394 and optional immediate byte 395 to combine, four can be up to for each command identification A operand position.For one embodiment, VEX prefix byte 391 can be used for the source and destination of mark 32 or 64 Operand and/or 128 or 256 simd registers or memory operand.For one embodiment, by operation code format Function provided by 397 can form redundancy with operation code format 370, and they are different in other embodiments.Operation code format 370 and 397 allow by MOD field 373 and by optional SIB identifier 393, optional displacement identifier 394 and optional The register partially specified of immediate byte 395 to register addressing, memory to register addressing, by memory to posting Storage addresses, by register pair register addressing, directly to register addressing, register to memory addressing.

Fig. 3 E is turned next to, which depict according to another embodiment for providing for testing transactional execution state Another substitution operation coding (operation code) format 398 of function.Operation code format 398 corresponds to operation code format 370 and 397, And it is most to substitute including optional EVEX prefix byte 396 (in one embodiment, starting with hexadecimal 62) The traditional instruction prefix byte of other public uses and code is jumped out, and additional function is provided.According to the finger of one embodiment Order can be encoded by one or more of field 396 and 392.Pass through field 396 and source operation code identifier 374 and 375 And optional ratio-index-plot (scale-index-base SIB) identifier 393, optional displacement identifier 394 and can It selects immediate byte 395 to combine, each instruction can be identified and be up to four operand positions and mask.One is implemented Example, EVEX prefix byte 396 can be used for mark 32 or 64 source and destination operand and/or 128,256 or 512 simd registers or memory operand.For one embodiment, the function as provided by operation code format 398 can be with Operation code format 370 or 397 forms redundancy, and they are different in other embodiments.Operation code format 398 allows by MOD word Section 373 and by optional (SIB) identifier 393, optional displacement identifier 394 and optional 395 institute of immediate byte The specified register using mask in part seeks register to register addressing, memory to register addressing, by memory Location, by register pair register addressing, directly to register addressing, register to memory addressing.At least one instruction set General format (corresponding generally to format 360 and/or format 370) is shown generically as follows:

evex1 RXBmmmmm WvvvLpp evex4 opcode modrm[sib][disp][imm]

For one embodiment, the instruction encoded according to EVEX format 398 can have additional " Payload " position, It is used to provide for executing the function of state for testing transactional, and there is additional new feature, such as, user is configurable Mask register, additional operand, from 128,256 or 512 bit vector registers or more registers to be selected Selection, etc..

For example, can be used for using explicit mask and with or without additional unary operation (such as in VEX format 397 Type conversion) come in the case where providing the function of executing state for testing transactional, which can be used for using aobvious Formula user can configure mask and with or without the additional dual operation (such as addition or multiplication) for needing additional operand To provide the function of executing state for testing transactional.Some embodiments of EVEX format 398 can also be used for using implicit complete The function that state is executed for testing transactional is provided at mask and using additional three atom operation.In addition, in VEX format 397 can be used in the case where providing the function for testing transactional execution state on 128 or 256 bit vector registers, EVEX format 398 can be used for providing at 128,256,512 or on the vector registor of bigger (or smaller) for testing The function of transactional execution state.

It will be understood that some embodiments of instruction and logic for testing transactional execution state may specify explicit source behaviour It counts and/or vector element size, and some embodiments can have implicit source operand and/or vector element size.Pass through Following example is shown for providing the example instruction for executing the function of state (hereinafter referred to as XTEST) for testing transactional:

Fig. 4 A is the ordered assembly line and register rename level, unrest for showing at least one embodiment according to the present invention Sequence publication/execution pipeline block diagram.Fig. 4 B be at least one embodiment according to the present invention is shown to be included in processing The block diagram of ordered architecture core and register renaming logic, out-of-order publication/execution logic in device.Solid box in Fig. 4 A is shown Ordered assembly line is gone out, dotted line frame shows register renaming, out-of-order publication/execution pipeline.Similarly, the reality in Fig. 4 B Wire frame shows ordered architecture logic, and dotted line frame shows register renaming logic and out-of-order publication/execution logic.

In Figure 4 A, processor pipeline 400 includes taking out level 402, length decoder level 404, decoder stage 406, distribution stage 408, rename level 410, scheduling (are also referred to as assigned or are issued) grade 412, register reading memory reading level 414, execute Grade 416 writes back/memory write level 418, exception handling level 422, submission level 424.

In figure 4b, arrow indicates the coupling between two or more units, and the direction instruction of arrow those units Between data flow direction.Fig. 4 B shows processor core 490, the front end unit including being coupled to enforcement engine unit 450 430, both the front end unit and enforcement engine unit are all coupled to memory cell 470.

Core 490 can be reduced instruction set computing (RISC) core, complex instruction set calculation (CISC) core, very long instruction word (VLIW) core or mixing or other core types.As another option, core 490 can be specific core, such as network or communication Core, compression engine, graphics core or the like.

Front end unit 430 includes the inch prediction unit 432 for being coupled to Instruction Cache Unit 434, the instruction cache Cache unit is coupled to instruction translation lookaside buffer (TLB) 436, which is coupled to instruction Retrieval unit 438, the instruction retrieval unit are coupled to decoding unit 440.Decoding unit or decoder decodable code instruct, and it is raw At one or more microoperations, microcode entry point, microcommand, other instructions or other control signals as output, these are defeated It is from being decoded in presumptive instruction or otherwise reflect presumptive instruction or derive from presumptive instruction out.Solution A variety of different mechanism can be used to realize for code device.The example of suitable mechanism includes but is not limited to, look-up table, hardware realization, can Programmed logic array (PLA) (PLA), microcode read only memory (ROM) etc..Instruction Cache Unit 434 is additionally coupled to memory The second level (L2) cache element 476 in unit 470.Decoding unit 440 is coupled to the life again in enforcement engine unit 450 Name/dispenser unit 452.

Enforcement engine unit 450 includes being coupled to the set of retirement unit 454 and one or more dispatcher units 456 Renaming/dispenser unit 452.Dispatcher unit 456 indicates any number of different schedulers, including reserved station, center Instruction window etc..Dispatcher unit 456 is coupled to physical register group unit 458.Each physical register group unit 458 indicates One or more physical register groups, wherein the different one or more different data types of physical register group preservation are (all Such as: scalar integer, scalar floating-point, packing integer, packing floating-point, vectorial integer, vector floating-point, etc.), state (such as, instructs Pointer is the address of next instruction to be executed) etc..458 retirement unit 454 of physical register group unit is covered, It (such as, is deposited using resequencing buffer and resignation with showing the various ways of achievable register renaming and Out-of-order execution Device group, using future file (future file), historic buffer, resignation register group, use register mappings and deposit Device pond etc.).In general, architectural registers are visible outside processor or from the viewpoint of programmer.These registers It is not limited to any of particular electrical circuit type.A variety of different types of registers are applicable, as long as they can store and mention For data described herein.The example of suitable register includes but is not limited to that dedicated physical register uses register renaming Dynamic allocation physical register and dedicated physical register and dynamically distribute physical register combination, etc..Resignation Unit 454 and physical register group unit 458, which are coupled to, executes cluster 460.Executing cluster 460 includes one or more execute The set of the set of unit 462 and one or more memory access units 464.A variety of operations can be performed in execution unit 462 (including: displacement, addition, subtraction, multiplication) and can numerous types of data (such as, scalar floating-point, packing integer, packing floating-point, Vectorial integer, vector floating-point) on execute.Although some embodiments may include being exclusively used in multiple execution of specific function or functional group Unit, however other embodiments may include only one execution unit or all execute the functional multiple execution units of institute.It adjusts Degree device unit 456, physical register group unit 458, execute cluster 460 be shown as may be it is a plurality of, because of certain implementations Example is that certain data/action types create all independent assembly line (for example, all having respective dispatcher unit, physics deposit Device group unit and/or execute the scalar integer assembly line of cluster, scalar floating-point/packing integer/packing floating-point/vectorial integer/to Measure floating-point pipeline, and/or pipeline memory accesses, and specific reality in the case where individual pipeline memory accesses Applying the execution cluster that example is implemented as the only assembly line has memory access unit 464).It is appreciated that using all In the case where independent assembly line, one or more of these assembly lines can be out-of-order publication/execution, and remaining is that have Sequence.

The set of memory access unit 464 is coupled to memory cell 470, which includes data TLB mono- Member 472, which is coupled to cache element 474, and it is slow which is coupled to the second level (L2) high speed Memory cell 476.In one exemplary embodiment, memory access unit 464 may include loading unit, storage address unit and Data storage unit, each of these is all coupled to the data TLB unit 472 in memory cell 470.L2 cache list Member 476 is coupled to the cache of other one or more ranks, and is finally coupled to main memory.

As an example, illustrative register renaming random ordering is issued/is executed core framework and can realize stream as described below Waterline 400:1) the instruction execution of extractor 438 is taken out and length decoder level 402 and 404；2) decoding unit 440 executes decoder stage 406；3) renaming/dispenser unit 452 executes distribution stage 408 and rename level 410；4) dispatcher unit 456 executes scheduling Grade 412；5) physical register group unit 458 and memory cell 470 execute register reading memory reading level 414；It holds Row cluster 460 realizes executive level 416；6) memory cell 470 and the execution of physical register group unit 458 write back/memory writes Enter grade 418；7) multiple units can be involved in exception handling level 422；And 8) retirement unit 454 and physical register group list Member 458 executes submission level 424.

Core 490 can support that (such as, x86 instruction set (has some expansions for increasing and having more new version to one or more instruction set Exhibition), the MIPS instruction set of MIPS Technologies Inc. of California Sani's Weir, California Sani's Weir ARM The ARM instruction set (there is optional additional extension, such as NEON) of holding company).

It should be appreciated that core can support multithreading (the two or more parallel collection for executing operation or thread), and can be with Various ways are realized, comprising: the time cuts multithreading, (wherein single physical core is that the physical core is multi-thread simultaneously to parallel multi-thread Each thread of Cheng Zhihang provides Logic Core) or above combination (such as, the time-division take out and decoding and later while it is more Thread, such asHyperthread Hyperthreading technology).

For one embodiment, enforcement engine unit 450 includes the TSX logic 469 for handling TSX instruction set.Pass through It include TSX instruction set and the associated TSX logic for executing these instructions in the instruction set of general-purpose processor core 490 469, it can be omitted in general-purpose processor core 490 using restricted transactional memory or hardware lock to execute largely by multi-thread Operation used in Cheng Yingyong.Therefore, by being used for restricted transactional memory or hardware lock omission to shared data Synchronization is executed, many multithreading applications can be more efficiently accelerated and execute.This can be eliminated to having what is relatively rarely conflicted shared to deposit The critical section of reservoir executes the needs of unnecessary synchronization.Tracking logic 473 is provided in memory cell 470 to chase after Track is from thread process element associated with the transactional region of shared memory in the cache of memory cell 470 Memory access.In one embodiment, checkpoint logic 455 is provided to execute for by the thread process element of core 490 Thread setting register group unit 458 in architecture states register set checkpoint.

Although describing register renaming under the background of Out-of-order execution, it is to be understood that, register renaming can by with In ordered architecture.Although the shown embodiment of processor also includes individual instruction and data cache element 434/ 474 and shared L2 cache element 476, but alternative embodiment can also have the single inside for instruction and data Cache, such as first order (L1) be internally cached or multiple ranks it is internally cached.In some implementations In example, system may include internally cached and External Cache combination, and External Cache is located at core and/or processor Except.Alternatively, all caches can all be located at except core and/or processor.

Fig. 5 is the single core processor with integrated Memory Controller and graphics devices of embodiment according to the present invention With the block diagram of multi-core processor 500.The solid box of Fig. 5 shows processor 500, and processor 500 has single core 502A, system 150, one groups of one or more bus control unit units 516 are acted on behalf of, and optional additional dotted line frame shows the processor of substitution 500, one group of one or more integrated memory controller with multiple core 502A-N, in system agent unit 510 Unit 514 and integrated graphics logic 508.

Memory hierarchy includes one or more level cache 504A-N in core, one or more shared caches The set of unit 506 and the external memory (not shown) for being coupled to this group of integrated memory controller unit 514.It is shared The set of cache element 506 may include one or more intermediate caches, such as, the second level (L2), the third level (L3), The cache of the fourth stage (L4) or other ranks, last level cache (LLC) and/or above combination.Tracking is provided to patrol Volume 503A-N, with tracking from the shared storage in cache memory 504A-N and/or shared cache element 506 The memory access of the associated thread process element in the transactional region of device.Although in one embodiment based on the mutual of annular Even integrated graphics logic 508, this group of shared cache element 506 and system agent unit 510 are interconnected by unit 512, But alternative embodiment also interconnects these units using any amount of well-known technique.

In some embodiments, one or more core 502A-N can be realized multithreading.System Agent 510 includes coordinating With operation those of core 502A-N component.System agent unit 510 may include that such as power control unit (PCU) and display are single Member.PCU, which can be, the power rating of core 502A-N and integrated graphics logic 508 is adjusted required logic and component, It or may include these logics and component.Display unit is used to drive the display of one or more external connections.

Core 502A-N can be isomorphic or heterogeneous on framework and/or instruction set.For example, one in core 502A-N It can be ordered into a bit, and other are out-of-order.Such as another example, two or more nuclear energy in core 502A-N are enough held The identical instruction set of row, and a subset that other cores are able to carry out in the instruction set or execute different instruction set.

Processor can be general processor, such as Duo (Core^TM) i3, i5, i7,2Duo and Quad, to strong (Xeon^TM), Anthem (Itanium^TM)、XScale^TMOr StrongARM^TMProcessor, these can be holy gram from California The Intel company in the city La La obtains.Alternatively, processor can come from another company, such as from ARM holding company, MIPS, etc..Processor can be application specific processor, such as, for example, network or communication processor, compression engine, graphics process Device, coprocessor, embeded processor, or the like.Processor may be implemented on one or more chips.Processor 500 It can be a part of one or more substrates, and/or using in kinds of processes technology (such as, BiCMOS, CMOS or NMOS) Any technology be implemented on one or more substrates.

Fig. 6-8 be adapted for include processor 500 exemplary system, Fig. 9 is the example that may include one or more cores 502 Property system on chip (SoC).Other systems design and configuration known in the art for following object is also applicable: above-knee Computer, desktop computer, Hand held PC, personal digital assistant, engineering effort station, server, the network equipment, network hub, Exchanger, embeded processor, digital signal processor (DSP), graphics device, video game device, set-top box, microcontroller Device, cellular phone, portable media player, handheld device and various other electronic equipments.In general, disclosed herein It is various can merging processor and/or it is other execute logic system or electronic equipment be usually be applicable in.

Referring now to Figure 6, shown is the block diagram of system 600 according to an embodiment of the invention.System 600 can wrap Include the one or more processors 610,615 for being coupled to graphics memory controller hub (GMCH) 620.Additional processor 615 washability indicates by a dotted line in Fig. 6.

Each processor 610,615 can be certain versions of processor 500.It should be appreciated, however, that integrated graphics logic It is far less likely to occur in processor 610,615 with integrated memory control unit.Fig. 6, which shows GMCH 620, can be coupled to storage Device 640, the memory 640 can be such as dynamic random access memory (DRAM).For at least one embodiment, DRAM can With associated with non-volatile cache, and can also be provided tracking logic with track come from in non-volatile cache Shared memory the associated thread process element in transactional region memory access.

GMCH 620 can be a part of chipset or chipset.GMCH 620 can be carried out with processor 610,615 Communication, and the interaction between control processor 610,615 and memory 640.GMCH 620 may also act as processor 610,615 Acceleration bus interface between other elements of system 600.For at least one embodiment, GMCH 620 is via such as front end The multi-point bus of bus (FSB) 695 etc is communicated with processor 610,615.

In addition, GMCH 620 is coupled to display 645 (such as flat-panel monitor).GMCH 620 may include integrated graphics Accelerator.GMCH 620 is also coupled to input/output (I/O) controller center (ICH) 650, the input/output (I/O) control Device maincenter (ICH) 650 can be used for coupleeing system 600 for various peripheral equipments.It is shown as example in the embodiment in fig 6 External graphics devices 660 and another peripheral equipment 670, the external graphics devices 660 can be coupled to point of ICH 650 Vertical graphics device.

Alternatively, additional or different processor also may be present in system 600.For example, Attached Processor 615 may include with The identical Attached Processor of processor 610 and 610 foreign peoples of processor or asymmetric Attached Processor, accelerator (such as figure Accelerator or Digital Signal Processing (DSP) unit), field programmable gate array or any other processor.In physical resource 610, there can be the various differences in terms of a series of quality metrics for including framework, micro-architecture, heat and power consumption features etc. between 615 It is different.These difference can effectively be shown as asymmetry between processor 610,615 and heterogeneity.For at least one implementation Example, various processors 610,615 can reside in same die package.

Referring now to Fig. 7, shown is the block diagram of second system 700 according to an embodiment of the present invention.As shown in fig. 7, Multicomputer system 700 is point-to-point interconnection system, and 770 He of first processor including coupling via point-to-point interconnection 750 Second processor 780.Each of processor 770 and 780 can be some versions of processor 500, as processor 610, One or more of 615 is the same.

Although only being shown with two processors 770,780, it should be understood that the scope of the present invention is not limited thereto.In other realities It applies in example, one or more Attached Processors may be present in given processor.

Processor 770 and 780 is illustrated as respectively including integrated memory controller unit 772 and 782.Processor 770 is also Point-to-point (P-P) interface 776 and 778 including a part as its bus control unit unit；Similarly, second processor 780 include P-P interface 786 and 788.Processor 770,780 can be via using point-to-point (P-P) interface circuit 778,788 P-P interface 750 exchanges information.As shown in fig. 7, IMC 772 and 782 couples the processor to corresponding memory, that is, store Device 732 and memory 734, these memories can be the part for being locally attached to the main memory of respective processor.

Processor 770,780 can be respectively via using each P-P of point-to-point interface circuit 776,794,786,798 to connect Mouth 752,754 exchanges information with chipset 790.Chipset 790 can also be via high performance graphics interface 739 and high performance graphics electricity Road 738 exchanges information.

Shared cache (not shown) can be included in any processor, or two processors outside but via P-P interconnection is connect with these processors, thus if processor is placed in low-power mode, any one or the two processor Local cache information can be stored in the shared cache.Can provide tracking logic, with tracking from The memory access of the associated thread process element in the transactional region of shared memory in shared cache.

Chipset 790 can be coupled to the first bus 716 via interface 796.In one embodiment, the first bus 716 It can be the total of peripheral component interconnection (PCI) bus or such as PCI high-speed bus or another third generation I/O interconnection bus etc Line, but the scope of the present invention is not limited thereto.

As shown in fig. 7, various I/O equipment 714 can be coupled to the first bus 716, bus bridge together with bus bridge 718 First bus 716 is coupled to the second bus 720 by 718.In one embodiment, the second bus 720 can be low pin count (LPC) bus.Various equipment can be coupled to the second bus 720, including such as keyboard and/or mouse 722, communication equipment 727 with And storage unit 728, it such as in one embodiment may include the disk drive of instructions/code and data 730 or other are big Capacity storage device.In addition, audio I/O 724 can be coupled to the second bus 720.Note that other frameworks are possible.For example, Instead of the Peer to Peer Architecture of Fig. 7, system can realize multi-point bus or other such frameworks.

Referring now to Fig. 8, shown is the block diagram of third system 800 according to an embodiment of the present invention.In Fig. 7 and 8 Similar component uses like reference numerals, and some aspects of Fig. 7 is omitted in fig. 8 to avoid the other aspects of fuzzy graph 8.

Fig. 8, which shows processor 870,880, can respectively include integrated memory and I/O control logic (" CL ") 872 and 882. For at least one embodiment, CL 872,882 may include such as above in conjunction with integrated memory controller described in Fig. 5 and 7 Unit.In addition, CL 872,882 may also include I/O control logic.Fig. 8 show not only memory 832,834 be coupled to CL 872, 882, I/O equipment 814 are also coupled to control logic 872,882.Traditional I/O equipment 815 is coupled to chipset 890.

Referring now to Fig. 9, shown is the block diagram of SoC 900 according to an embodiment of the invention.It is similar in Fig. 5 Component label having the same.Equally, dotted line frame is the optional feature on more advanced SoC.In Fig. 9, interconnecting unit 902 are coupled to: application processor 910, including one group of one or more core 502A-N；One or more levels cache in core 504A-N；And shared cache element 506；Logic 503A-N is tracked, is come from and cache memory for tracking The associated thread process element in the transactional region of 504A-N and/or the shared memory in shared cache element 506 Memory access；System agent unit 510；Bus control unit unit 516；Integrated memory controller unit 514；One group one A or multiple Media Processors 920, it may include integrated graphics logic 508, static and/or video camera function for providing Image processor 924 provides the video processing that audio processor 926, the offer encoding and decoding of video that hardware audio accelerates accelerate Device 928, static random access memory (SRAM) unit 930；Direct memory access (DMA) (DMA) unit 932；And display unit 940, for being coupled to one or more external displays.

Figure 10 shows processor, including central processing unit (CPU) and graphics processing unit (GPU), the processor can be held Row is instructed according at least one of one embodiment.In one embodiment, execution operates according at least one embodiment Instruction can be executed by CPU.In another embodiment, instruction can be executed by GPU.In a further embodiment, refer to Enable can the combination of performed by GPU and CPU operation execute.For example, in one embodiment, according to one embodiment Instruction can be received, and decoded for being executed on GPU.However, one or more operations in decoded instruction can Executed by CPU, and result be returned to GPU for instruction final resignation.On the contrary, in some embodiments, CPU can make For primary processor, and GPU is as coprocessor.

In some embodiments, the instruction for benefiting from highly-parallel handling capacity can be executed by GPU, and benefit from processor The instruction of the performance of (these processors benefit from deep pipeline framework) can be executed by CPU.For example, figure, science are answered The performance of GPU can be benefited from, financial application and other parallel workloads and is correspondingly executed, and is more serialized and answered With such as operating system nucleus or application code are more suitable for CPU.

In Figure 10, processor 1000 includes: CPU 1005, GPU 1010, image processor 1015, video processor 1020, USB controller 1025, UART controller 1030, SPI/SDIO controller 1035, display equipment 1040, fine definition are more Media interface (HDMI) controller 1045, MIPI controller 1050, Flash memory controller 1055, double data rate (DDR) (DDR) control Device 1060 processed, security engine 1065, I²S/I²C (integrated across chip voice/across integrated circuit) interface 1070.Other logics and electricity Road can be included in the processor of Figure 10, including more CPU or GPU and other peripheral device interface controllers.

The one or more aspects of at least one embodiment can be by representative data stored on a machine readable medium It realizes, which indicates the various logic in processor, and the machine is made to generate to execute and retouch herein when read by machine The logic for the technology stated.Such expression i.e. so-called " IP kernel " can store on tangible machine readable media (" tape ") and mention Various customers or manufacturer are supplied, to be loaded into the establishment machine of the actual fabrication logic or processor.For example, IP kernel (the Cortex such as developed by ARM holding company^TMProcessor affinity and by institute of computing technology, the Chinese Academy of Sciences (ICT) the Godson IP kernel developed) it can be authorized to or be sold to multiple clients or by licensor, such as Texas Instrument, high pass, apple Fruit or Samsung, and be implemented in as these clients or by processor manufactured by licensor.

Figure 11 shows the block diagram developed according to the IP kernel of one embodiment.Memory 1130 include simulation software 1120 and/ Or hardware or software model 1110.In one embodiment, indicate that the data of IP core design can be via memory 1140 (such as, Hard disk), wired connection (such as, internet) 1150 or be wirelessly connected 1160 and be provided to memory 1130.By emulation tool Manufacturing works then can be sent to model IP nuclear information generated, can be produced by third party in manufacturing works To execute at least one instruction according at least one embodiment.

In some embodiments, one or more instructions can correspond to the first kind or framework (such as x86), and It is converted or is emulated on the processor (such as ARM) of different type or framework.According to one embodiment, instruction can be any It is executed on processor or processor type, including ARM, x86, MIPS, GPU or other processor types or framework.

Figure 12 is shown according to how the instruction of the first kind of one embodiment is emulated by different types of processor. In Figure 12, program 1205 includes some instructions, these instructions are executable identical or basic as the instruction according to one embodiment Identical function.However, the instruction of program 1205 can be from processor 1215 different or incompatible types and/or lattice Formula, it means that the instruction of the type in program 1205 is unable to Proterozoic performed by processor 1215.However, by means of emulation Logic 1210, the instruction of program 1205 can be converted into can by processor 1215 primary execution instruction.Implement at one In example, emulation logic is specific within hardware.In another embodiment, emulation logic is embodied in tangible machine readable Jie In matter, which includes by such instruction translation in program 1205 into the direct class that can be executed by processor 1215 The software of type.In other embodiments, emulation logic is fixed function or programmable hardware and is stored in tangible machine readable The combination of program on medium.In one embodiment, processor includes emulation logic, but in other embodiments, emulation is patrolled It collects except processor and is provided by third party.In one embodiment, processor can be by executing comprising in the processor Or microcode associated therewith or firmware, load the emulation being embodied in the tangible machine readable media comprising software Logic.

Figure 13 is that the control of embodiment according to the present invention uses software instruction converter by the binary system in source instruction set Instruction is converted to the block diagram of the binary instruction of target instruction target word concentration.In an illustrated embodiment, dictate converter is that software refers to Enable converter, but alternatively, dictate converter can be realized with software, firmware, hardware or its various combination.Figure 13 shows X86 compiler 1304 can be used out to compile the program using high-level language 1302, with generate can be by having at least one x86 The x86 binary code 1306 of the primary execution of processor 1316 of instruction set core.With at least one x86 instruction set core 1316 Processor indicates that any processor, the processor can refer to by compatibly executing or handling in other ways (1) Intel x86 What the major part of the instruction set of order collection core or (2) were intended to run on the Intel processors at least one x86 instruction set core Using or other softwares object code version, come execute with at least one x86 instruction set core Intel processors base This identical function, to realize the result essentially identical with the Intel processors at least one x86 instruction set core. x86 Compiler 1304 indicates the compiler that can be used for generating x86 binary code 1306 (such as object code), the x86 binary system generation Code 1306 can be by additional link processing or without additional link processing at least one x86 instruction set core It is executed on processor 1316.Similarly, Figure 13, which is shown, can be used the instruction set compiler 1308 of substitution to compile and utilize advanced language The program of speech 1302, with generation can (such as, having can be performed by not having the processor 1314 of at least one x86 instruction set core The processor of the MIPS instruction set of MIPS Technologies Inc. of California Sani's Weir and/or execution California mulberry The processor of the ARM instruction set of the ARM holding company of Buddhist nun's Weir) primary execution alternative command collection binary code 1310.It should Dictate converter 1312 be used to be converted to x86 binary code 1306 can be by not having the processor of x86 instruction set core The code of 1314 primary execution.The transformed code is unlikely identical as alternative command collection binary code 1310, because Such dictate converter can be completed by being difficult to manufacture；However, transformed code will complete general operation and by alternative command collection Instruction constituted.Therefore, dictate converter 1312 indicates to allow not having by emulation, simulation or any other process The processor or other electronic equipments of x86 instruction set processor or core execute the software of x86 binary code 1306, firmware, hard Part or their combination.

Figure 14, which is shown, provides one embodiment of the device 1401 of the function for testing transactional execution state.Device 1401 include the instruction retrieval unit 1438 for being coupled to decoding unit 1440.Decoding unit or decoder decodable code instruct, and it is raw At one or more microoperations, microcode entry point, microcommand, other instructions or other control signals as output, these are defeated It is from being decoded in presumptive instruction or otherwise reflect presumptive instruction or derive from presumptive instruction out.Solution A variety of different mechanism can be used to realize for code device.The example of suitable mechanism includes but is not limited to look-up table, hardware realization, can compile Journey logic array (PLA), microcode read only memory (ROM) etc..Decoding unit 1440 is coupled to register group unit 1458.

Each register group unit 1458 indicates one or more physical register groups, wherein different physical register groups Saving one or more different data types, (such as: scalar integer, scalar floating-point, packing integer, packing floating-point, vector are whole Number, vector floating-point, etc.), state (such as instruction pointer, it is the address of next instruction to be executed) etc..Deposit Device group unit 1458 is coupled with the checkpoint logic 1455 of device 1402.In general, architectural registers are outside the processor or from volume It is visible as viewed from the perspective of journey person.In one embodiment, provide checkpoint logic 1455 with for by with shared memory The thread setting register group unit 1458 that executes of the associated thread process element in transactional region in architecture states post The checkpoint of the set of storage.These registers are not limited to any of particular electrical circuit type.A variety of different types of deposits Device is applicable, as long as they can store and provide data described herein.The example of suitable register is including but not limited to special With physical register, using register renaming dynamic allocation physical register and dedicated physical register and dynamic The combination of distribution physical register, etc..Register group unit 1458 is coupled to the set of one or more execution units 1462 With the set of one or more memory access units 1464.Execution unit 1462 can to various types of data (for example, Scalar floating-point, packing integer, packing floating-point, vector integer, vector floating-point) various operations are executed (for example, displacement, addition, subtracting Method, multiplication).Although some embodiments may include the multiple execution units for being exclusively used in specific function or function set, other Embodiment may include all executing the functional only one execution unit of institute or multiple execution units.Register group unit 1458, Memory access unit 1464 and execution unit 1462 are illustrated as may be plural number, because some embodiments are for certain types Data/operation generate the assembly line of difference and (such as be respectively provided with themselves register group unit and/or execution unit Scalar integer assembly line, scalar floating-point/packing integer/packing floating-point/vectorial integer/vector floating-point assembly line, and/or memory Assembly line is accessed, and in the case where pipeline memory accesses respectively, realizes the specific flowing water of wherein only one or more Line has some embodiments of memory access unit 1464).It is also understood that using the assembly line of difference, One or more of these assembly lines can be out-of-order publication/execution, and remaining assembly line can be orderly to issue/hold Row.

The set of memory access unit 1464 is coupled to data cache unit 1474, the data cache unit It is coupled to second level (L2) cache element 1476.In one exemplary embodiment, memory access unit 1464 may include Loading unit, storage address unit and data storage unit, the data that each of these units are coupled to device 1402 are high Fast cache unit 1474 and tracking logic 1473, with tracking from the shared memory in data cache unit 1474 The memory access of the associated processing element in transactional region.L2 cache element 1476 be coupled to it is one or more other The cache of grade, and it is eventually coupled to main memory.

As an example, exemplary means 1401 can realize assembly line 400:1 as follows) instruct taking-up 1438 to execute taking-up With length decoder level 402 and 404；2) decoding unit 1440 executes decoder stage 406；3) register group unit 1458 and memory Access unit 1464 executes register reading memory reading level 414；4) execution unit 1462 performs executive level 416；With And 5) memory access unit 1464 and the execution of physical register group unit 1458 write back/memory write level 418.

Device 1401 can support one or more instruction set (such as x86 instruction set (have add together with more new version The some extensions added, including TSX ISA 1469)；The MIPS of the MIPS Technologies Inc. in California Sani's Weir city is instructed Collection (synchronous including synchronous etc the transactional of the transactional in such as TSX ISA 1469)；California Sani's Weir city ARM holding company ARM instruction set (the optional additional extension with such as NEON etc, and including such as TSX ISA The transactional of transactional synchronization in 1469 etc is synchronous)).

It should be appreciated that device 1401 can support multithreading (to execute the collection of two or more parallel operations or thread Close), and the multithreading can be variously completed, this various mode includes time division multithreading, synchronizing multiple threads Change (wherein single physical core provides Logic Core for each thread of physical core in synchronizing multi-threaded threads) or A combination thereof (for example, the time-division take out and decoding and hereafter such as withHyperthread technology carrys out synchronous multi-threaded).

For one embodiment, execution unit 1462 executes TSX instruction set architecture (ISA) 1469 and is controlled with executing by TSX The transactional of 1457 cooperation of system is synchronous.TSX control 1457 and the checkpoint logic 1455 of device 1402 are operated together to be arranged and post The checkpoint of architectural registers set in storage group unit 1458, and with the tracking logic in memory access unit 1464 1473 operate together to track from associated with the transactional region of shared memory in data cache unit 1474 Thread process element memory access.If read/write collision occurs, architecture states can be rolled back to previous synchronization Point, and conflict is not submitted to access.For one embodiment, the TSX ISA 1469 of device 1402 includes one or more instructions (XTEST instruction as escribed above), one or more instruction can be executed by execution unit 1462 to provide for testing thread Transactional in processing element executes the function of state.

Phase by including TSX ISA 1469 in the instruction set of general-purpose processor core and for executing these instructions Associated logic can be omitted using restricted transactional memory or hardware lock in general-purpose processor core and utilize device 1401 To execute many multithreadings using used operation.Therefore, by omitting restricted transactional memory or hardware lock For executing synchronization to shared data, it can more efficiently accelerate and execute many multithreading applications.As described above, working as thread process When element transactionally executes, the tracking of tracking logic 1473 in memory access unit 1464 is from slow with data high-speed The memory access of the associated thread process element in the transactional region of shared memory in memory cell 1474.This can be eliminated The needs of unnecessary synchronization are executed to the critical section with the shared memory relatively rarely to conflict.

Figure 15 shows one embodiment of the process 1501 for providing the function for testing transactional execution state Flow chart.Process 1501 and other processes disclosed herein are executed by process block, process block may include specialized hardware or It can be by general-purpose machinery or special purpose machinery or the software or firmware opcode of its certain combination execution.

In the processing block 1510 of process 1501, for starting transactional region (such as RTM or HLE) is decoded One instruction or prefix.In response to the first instruction of decoding, the set for architecture states register is generated in processing block 1520 Checkpoint.The first instruction of decoding is also responded to, tracking comes from transactional area associated with the first instruction in processing block 1530 The memory access of processing element in domain.In processing block 1540, decode what the transactional for detecting transactional region executed Second instruction (such as instruction in XTEST instruction).In processing block 1550, operated in response to the second instruction execution of decoding, To determine the execution context of the second instruction whether within the transactional region.Then, in response to the second instruction, in processing block The first mark is updated in 1560 (for example, if the execution context of the second instruction is updated within the transactional region It is zero；Otherwise it is updated to one).Register is optionally updated further in response to second instruction in processing block 1570 (such as XTEST.NL or as XTEST.BA, etc.).It is optional in response to the second instruction also, in processing block 1580 Ground updates the second mark (for example, as XTEST.BV or XTEST.MV or XTEST.BM, etc.).

It will be understood that although process 1501 disclosed herein and other processes are shown in order, in some substitutions In embodiment, the operations of these processing blocks can be according to various different orders and/or parallel or be consecutively carried out.

Figure 16 shows the process for supporting the alternate embodiment 1601 of the process for testing transactional execution state Figure.In processing block 1605, into transactional region (such as being instructed by encountering XACQUIRE prefix or XBEGIN).It is handling Frame 1610 saves architectural registers and state.At this point, if executing XTEST instruction, processing block 1620 in processing block 1615 The test at place will determine: as in processing block 1615 transactional execute region within execute XTEST instruction as a result, not Zero flag is once set.It will be understood that the flow chart of Figure 16 is only example, and programmer can be at any point execution of the process Manage the XTEST instruction of frame 1615.

Processing block 1625 is proceeded to, as transactional execution region as a result, buffer storage affairs.In processing block 1635, it can be by the memory location (such as in data high-speed caching) through buffering labeled as exclusive.In processing block 1645 Readset is monitored to close.If the monitored memory location of readset conjunction is written in another execution thread, then in processing block 1650 Stop transactional processing (referred to as transactional suspension) in processing block 1665, and processor will start to execute be rolled back to it is previous Synchronous point (such as state of the processing block 1610 of preservation).On the other hand, when there is no other execution lines in processing block 1650 The monitored memory location that journey is written to readset conjunction is supervised then in processing block 1655 according to any read/write transaction simultaneously Set is write in control.If another execution thread reads or is written the monitored memory position for writing set in processing block 1660 It sets, then also stops transactional processing in processing block 1665.It will be understood that it is such monitoring be constantly lasting process, according to Cache coherence safeguards that similar mode constantly maintains.Before the end for reaching transactional region, if not other The monitored memory location of readset conjunction is written in processing block 1650 and exists without other execution threads for execution thread The monitored memory location for writing set is read or be written in processing block 1660, then the affairs are exited in processing block 1670 Property region (such as being instructed by encountering XRELEASE prefix or XEND), and depositing for buffering is atomically submitted in processing block 1675 Memory transaction, so that they can be observed by other execution threads.

After the transactional in processing block 1665 stops, execution is rolled back to previous synchronous point by processor, thus extensive The architectural registers and state saved again, and abandon any memory transaction that do not submit.At this point, if in processing block 1615 Execute XTEST instruction, then the test at processing block 1620 will determine, as in processing block 1665 transactional stop after It executes that XTEST is instructed in processing block 1615 as a result, being provided with zero flag, and is not therefore executed within region in transactional. Therefore, in processing block 1630, program or thread have restored viewing or the processor state of the prior synchronization of rollback point, and can It is continued to execute in processing block 1640 as non-transactional region.According to the embodiment that XTEST is instructed, which can determine affairs Property stop whether to have occurred and that, processor or memory state may not indicate whether transactional suspension has occurred and that originally.

It will be understood that, it is contemplated that stop the observation whether having occurred and that transactional, such information can be provided to programmer Option is such as recorded and is counted to the number retried terminated in transactional suspension.Also other choosings can be provided to programmer , is such as executed within region and skip code segments according to determining that the program or is not executed in transactional currently. The XTEST instruction of various other types has also been described, these XTEST instruction can provide additional option to programmer, all Such as obtained before transactional suspension instruction that some things can malfunction (such as exhaust buffer space or some thread also to The same memory position that your thread is intended to modification has issued affairs, etc.).

Foregoing description is intended to show that the preferred embodiment of the present invention.From the above discussion, it should be apparent that, especially exist Such rapid development and further progress are not easy in the technical field of prediction, in appended claims and its equivalent Within the scope of, those skilled in the art can arrange with the modification present invention in details without departing from the principle of the present invention.

Claims

1. a kind of system, comprising:

Multiple processors；

Processor interconnection, for being communicatively coupled two or more processors；

System storage, including dynamic random access memory are communicatively coupled to one or more processors,

Wherein one or more the multiple processors include:

Multiple multithreading cores, for carrying out out of order instruction execution, wherein one or more the multiple multithreadings to multiple threads Core includes:

Logic is taken out in instruction, for taking out the instruction of one or more the multiple threads,

Instruction decoding unit, for decoding described instruction,

Register renaming logic, for renaming the one or more registers for being used for described instruction in register group,

Instruction cache, the one or more described instructions pending for cache,

Data high-speed caching, the data of described instruction are used for for cache,

Second level (L2) cache element, for cache one or more described instruction and for the data of described instruction, And

Execution unit, the transactional for executing instruction execute region, and the execution unit has for executing the first instruction Circuit, first instruction execute the state in region for testing transactional,

Wherein the execution unit be also used to determine it is described first instruction whether the transactional execute region context it It is interior, and in response, it is set to indicate that first instruction above and below the transactional execution region flag register Value within text.

2. the system as claimed in claim 1, which is characterized in that further include:

Accelerator unit is communicatively coupled to one or more processors for executing specified function.

3. system as claimed in claim 2, which is characterized in that the accelerator unit includes field programmable gate array.

4. the system as claimed in claim 1, which is characterized in that further include:

External Cache is communicatively coupled to one or more processors mutually connecting.

5. the system as claimed in claim 1, which is characterized in that further include:

At least one communication equipment is communicatively coupled to one or more processors.

6. the system as claimed in claim 1, which is characterized in that further include:

Equipment is stored, two or more processors are communicatively coupled to.

7. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction The circuit of second instruction, second instruction are used to indicate the beginning that the transactional executes region.

8. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction The circuit of third instruction, the third instruction is used to indicate the transactional and executes the end in region, and leads to memory transaction Atomically is submitted or stopped.

9. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction The circuit of second instruction and the circuit instructed for executing the third in described instruction, wherein second instruction is used to indicate institute The beginning that transactional executes region is stated, the third instruction is used to indicate the transactional and executes the end in region and cause to store Device affairs are atomically submitted or are stopped.

10. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to Indicate that the transactional executes the value of the nesting level in region.

11. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to Indicate that the transactional executes the value of at least one in the quantity or size of the available internal buffer in region.

12. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to Indicate that the affairs for particular memory cell can overflow internal buffer and the transactional is caused to execute in the execution in region Value only.

13. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to Indicate that the execution that the access to particular memory cell can execute region with another transactional mutually conflicts and leads to the thing Business property executes the value that the execution in region stops.