CN105677526B - The system for executing state for testing transactional - Google Patents
The system for executing state for testing transactional Download PDFInfo
- Publication number
- CN105677526B CN105677526B CN201610081166.XA CN201610081166A CN105677526B CN 105677526 B CN105677526 B CN 105677526B CN 201610081166 A CN201610081166 A CN 201610081166A CN 105677526 B CN105677526 B CN 105677526B
- Authority
- CN
- China
- Prior art keywords
- instruction
- transactional
- processor
- register
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 52
- 230000015654 memory Effects 0.000 claims abstract description 150
- 230000004044 response Effects 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 46
- 238000003860 storage Methods 0.000 claims description 20
- 239000000872 buffer Substances 0.000 claims description 10
- 238000004891 communication Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 abstract description 56
- 238000000034 method Methods 0.000 abstract description 39
- 230000008569 process Effects 0.000 description 34
- 238000010586 diagram Methods 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 17
- 238000012856 packing Methods 0.000 description 12
- 238000006073 displacement reaction Methods 0.000 description 9
- 238000007667 floating Methods 0.000 description 9
- 230000001360 synchronised effect Effects 0.000 description 9
- 230000007246 mechanism Effects 0.000 description 8
- 238000006467 substitution reaction Methods 0.000 description 7
- 230000002093 peripheral effect Effects 0.000 description 6
- VOXZDWNPVJITMN-ZBRFXRBCSA-N 17β-estradiol Chemical compound OC1=CC=C2[C@H]3CC[C@](C)([C@H](CC4)O)[C@@H]4[C@@H]3CCC2=C1 VOXZDWNPVJITMN-ZBRFXRBCSA-N 0.000 description 5
- 239000003795 chemical substances by application Substances 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 239000000725 suspension Substances 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 238000004090 dissolution Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 238000004519 manufacturing process Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000003139 buffering effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 229910002056 binary alloy Inorganic materials 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000000151 deposition Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 230000007717 exclusion Effects 0.000 description 2
- 230000009191 jumping Effects 0.000 description 2
- 229910052754 neon Inorganic materials 0.000 description 2
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 description 2
- 238000004321 preservation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000000758 substrate Substances 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 235000008708 Morus alba Nutrition 0.000 description 1
- 240000000249 Morus alba Species 0.000 description 1
- 206010038743 Restlessness Diseases 0.000 description 1
- 101100285899 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) SSE2 gene Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005056 compaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 125000000524 functional group Chemical group 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000013442 quality metrics Methods 0.000 description 1
- 230000001846 repelling effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2236—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/30087—Synchronisation or serialisation instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3834—Maintaining memory consistency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/466—Transaction processing
- G06F9/467—Transactional memory
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Executing Machine-Instructions (AREA)
Abstract
This application discloses instructions and logic that state is executed for testing transactional.It discloses and executes the novel instruction of state, logic, method and apparatus for testing transactional.Embodiment includes the first instruction decoded for starting transactional region.In response to first instruction, the checkpoint for being used for one group of architecture states register is generated, and track the memory access of the processing element in transactional region associated with first instruction.Then the second instruction that the transactional for detecting the transactional region executes is decoded.Operation is executed in response to the second instruction of decoding, to determine the execution context of the second instruction whether within the transactional region.The first mark is updated then in response to the second instruction.In some embodiments, in response to the second instruction, register is optionally updated, and/or optionally updates the second mark.
Description
Present patent application is that international application no is PCT/US2013/046633, and international filing date is 06 month 2013
It is 19 days, entitled " to execute state for testing transactional into National Phase in China application No. is 201380028480.6
The divisional application of the application for a patent for invention of instruction and logic ".
Related application
The application is the current pending international application PCT/US2012/ on 2 2nd, the 2012 specified U.S. submitted
023611 part continuation application.The first international application is incorporated herein by reference, as its entirety is recorded in the application
In.
Technical field
The present disclosure relates generally to handle the field of logic, microprocessor and relevant instruction set architecture, these instruction set
Framework executes logic, mathematics or other function when performed by processor or other processing logics and operates.Specifically, this public affairs
Open the instruction and logic for being related to that state is executed for testing transactional.
Background technique
The progress of semiconductor processes and logical design has allowed the increasing of the amount of logic that may be present in integrated circuit device
Add.Therefore, computer system configurations are developed to from the single or multiple integrated circuits in system and are deposited on single integrated circuit
Multiple processing cores and multiple logic processors.Processor or integrated circuit generally include single processor tube core, wherein locating
Managing device tube core may include any number of core or logic processor.
The quantity of increasingly increased core and logic processor enables more software threads by concomitantly on integrated circuit
It executes.However, it is possible to which the increase of the quantity for the software thread being performed simultaneously has resulted in the number shared between synchronizing software thread
According to related problem.A common solution packet for accessing the shared data in multicore or more logical processor systems
It includes using lock and guarantees the mutual exclusion between multiple access to shared data.However, ever-increasing execute multiple software threads
Ability to locking data generate bottleneck, cause thread to wait the completion (make them executes serialization) of other threads, from
And reduce the benefit for executing multiple threads concurrently.In addition, in the case where write-in side attempts to modify data, some read-only visits
Ask the mutual exclusion that lock can be used to ensure data, this can bring the undesirable side effect for repelling other read-only access.
For example, it is contemplated that keeping the hash table of shared data.Using lock system, the entire hash table of programmer lockable, thus
Allow the entire hash table of thread accesses.However, the handling capacity and performance of other threads may be adversely impacted, because it
Can not access any entry in hash table, until the lock is released.Alternatively, each entry in hash table may
It is locked, so as to cause many lock constructions in software.In such construction, it may be necessary to it is specific to execute to obtain many locks
Task, this will lead to the deadlock with other threads.No matter which kind of mode, which is being extrapolated to big scalable program
In after, it is clear that lock competition, serialization, fine-grain synchronization and dead time revision complexity become the extremely numerous of programmer
Trivial burden.
Another nearest data synchronization technology includes using transactional memory (TM).In general, transactional execution includes
Atomically execute the grouping of multiple microoperations, operation or instruction.In the examples described above, two threads execute in hash table, and
And their memory access is monitored/tracks.If the identical entry of two thread accesses/changes, conflictization can be performed
Solution is to ensure data validity.It includes software transactional memory (STM) that a type of transactional, which executes, wherein not having usually
It is executed in software in the case where having hardware supported and memory access, conflict dissolution, aborting task and other transactionals is appointed
The tracking of business.It includes hardware transactional memory (HTM) system that another type of transactional, which executes, including for supporting to visit
Ask the hardware of tracking, conflict dissolution and other transactional tasks.
Technology similar to transactional memory includes that hardware lock omits (HLE), wherein real without using lock
Execute to the property tested locked critical section.If running succeeded (i.e. Lothrus apterus), keep result globally visible.In other words,
Critical section is executed as being omitted the affairs of the lock instruction from critical section, rather than executes the thing atomically defined
Business.As a result, in the examples described above, not replacing hash table to execute with affairs, tentatively executes and instructed by lock
The critical section of definition.It is executed in hash table as multiple thread class, and their access is monitored/tracks.If this
Any of a little threads access/change to same entry, then conflict dissolution can be performed to ensure data validity.But
If no collision was detected, the update to hash table is atomically submitted.
As can be seen, transactional executes and locks to omit to have provides the potentiality of more best performance in multiple threads.However, HLE
It is relatively new research field for microprocessor with TM.Therefore, not yet sufficiently explore or in detail research processor in HLE and
TM implementation.
Detailed description of the invention
The present invention is unrestrictedly shown by example in each figure of attached drawing.
Fig. 1 shows for using instruction and logic the one embodiment for testing the computing system that transactional executes state.
Fig. 2 shows for using instruction and logic test transactional execute state processor one embodiment.
Fig. 3 A is shown according to one embodiment for providing the instruction volume for executing the function of state for testing transactional
Code.
Fig. 3 B shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles
Code.
Fig. 3 C shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles
Code.
Fig. 3 D shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles
Code.
Fig. 3 E shows the instruction for providing the function for testing transactional execution state according to another embodiment and compiles
Code.
Fig. 4 A is in processor micro-architecture for executing the instruction for providing and executing the function of state for testing transactional
The block diagram of one embodiment.
Fig. 4 B shows the processor micro-architecture for executing the instruction for providing the function for testing transactional execution state
One embodiment element.
Fig. 5 is an implementation for executing the processor for the instruction for providing the function for testing transactional execution state
The block diagram of example.
Fig. 6 is one for executing the computer system for the instruction for providing the function for testing transactional execution state
The block diagram of embodiment.
Fig. 7 is for executing the another of the computer system for the instruction for providing the function for testing transactional execution state
The block diagram of embodiment.
Fig. 8 is for executing the another of the computer system for the instruction for providing the function for testing transactional execution state
The block diagram of embodiment.
Fig. 9 is one for executing the system on chip for the instruction for providing the function for testing transactional execution state
The block diagram of embodiment.
Figure 10 is the embodiment for executing the processor for the instruction for providing the function for testing transactional execution state
Block diagram.
Figure 11 is to provide the frame of one embodiment of the IP kernel development system of the function for testing transactional execution state
Figure.
Figure 12, which is shown, provides one embodiment of the framework analogue system of the function for testing transactional execution state.
Figure 13 shows a reality of the system for converting the instruction for providing the function for testing transactional execution state
Apply example.
Figure 14, which is shown, provides one embodiment of the device of the function for testing transactional execution state.
Figure 15 shows the process of one embodiment of the process for providing the function for testing transactional execution state
Figure.
Figure 16 shows the process of the alternate embodiment of the process for providing the function for testing transactional execution state
Figure.
Specific embodiment
Disclosed herein executes the instruction of state and some embodiments of logic in combination with place for testing transactional
Manage synchronous extension (TSX) Lai Shixian of device instruction set architecture (ISA) transactional.Such extension, which can provide, dynamically to be detected multi-thread
When the ability of the serialization of critical section by lock protection is needed in journey software environment.The specified code area of programmer (claims
For transactional region) it can transactionally execute.If the transactional is completed (i.e. without from another process or line with running succeeded
The competition of journey), then when successfully completing and being exited from the transactional region, data in all storage operations or memory
Modification will be as atomically or simultaneously occurred.
Hardware lock omits one embodiment that (HLE) is such extension, it provides instruction set interface for programmer with benefit
With two instruction prefixes prompt XAQUIRE and XRELEASE come the transactional region around specified obtain and release guard key area
The lock of section.Using HLE, processor can omit with the associated write operation of the locking phase, and attempt transactionally to execute the region.If
Processor detects any data collision, then will execute transactional and stop, and not hold by non-transactional and again elliptically
The row critical section.
Restricted transactional memory (RTM) is another embodiment of the instruction set interface for programmer, is used
Three instructions: XBEGIN and XEND, for executing transactional region;And XABORT, for clearly stopping the region RTM
It executes.XBEGIN instruction also can refer to the branch of orientation relative displacement, return as by what is executed in the case where transactional stops
Move back code segments.Rollback code may include conflict dissolution step.Specific XABORT may specify 8 immediate values also deposit is written
Device, such as used to rollback code segments.The instruction for being used to test transactional execution state and logic disclosed herein
Embodiment also extends, and/or is combined in combination with other processor ISA transactionals HTM, and/or combines STM, and/or combine other
Transactional executes context to realize.
It is disclosed herein to execute the novel instruction of state, logic, method and apparatus for testing transactional.Embodiment
Including decoding the first instruction or prefix for starting transactional region.In response to first instruction or prefix, generates and be used for one
Group architecture states register checkpoint, and track from processing element in transactional region associated with first instruction
In memory access.It in one embodiment, may include for testing transactional state for the instruction set interface of programmer
The second instruction, wherein executing second instruction to determine and execute whether context closes in the transactional region or predictive affairs
Within key section (such as HLE or RTM).In one embodiment, such instruction can be used for: if it is determined that the instruction is just in thing
Business property executes inside region, then sets a value (such as zero) for flag register.In one embodiment, such instruction
Can be used for: if it is determined that the instruction does not execute inside transactional region, then by flag register be set as another value (such as
One).In an alternative embodiment, such instruction can be used for for register being set to indicate that the embedding of possible transactional region
Cover the value of grade.In another alternate embodiment, such instruction can be used for determining that access is associated with memory operand and deposit
The transactional whether reservoir will lead to possible transactional region stops.In another alternate embodiment, such instruction is available
It can be used for the transactional execution in possible transactional region in determining whether there is enough bufferings.Other alternative embodiments
It is possible.
It will be understood that by using one embodiment of such instruction, programmer can possible transactional region (such as
The region HLE) determine to internal dynamic whether the region is transactionally being performed or whether the region is just stopping in transactional
It is merely re-executed to non-transactional later.Using one embodiment of such instruction, programmer can be in possible transactional area
It determines to domain (such as region RTM) internal dynamic whether XABORT instruction will restore previous architecture states, or whether will be regarded
For NOP (i.e. without operation).Using one embodiment of such instruction, programmer dynamically determines that library routine is from affairs
Property region in be called or called from rollback code segments.It will be understood that by using one embodiment of such instruction,
Programmer dynamically determines whether the nesting level in transactional region can whether will close to hardware limitation and further nesting
It will lead to transactional suspension.
In the following description, processing logic, processor type, micro-architecture condition, event, enabling mechanism etc. be set forth
A variety of specific details, to provide the more thorough understanding to the embodiment of the present invention.However, those skilled in the art will be appreciated that not
Having these details, also the present invention may be practiced.In addition, some well known structure, circuits etc. are not illustrated in detail, to avoid
Unnecessarily obscure the embodiment of the present invention.
These and other embodiment of the invention can be realized according to following introduction, and it is to be understood that can be in following religion
Various modifications and changes may be made in leading, without departing from broader spirit and scope of the invention.To, should according to it is illustrative without
It is restrictive meaning to treat the description and the appended drawings, and the present invention is delimited according only to appended claims and their equivalents.
Fig. 1 shows for using instruction and logic an implementation for testing the computing system 100 that transactional executes state
Example.According to the present invention, such as according to embodiment described herein, system 100 includes the component of such as processor 102 etc,
Algorithm is executed to use the execution unit including logic to handle data.The representative of system 100 is based on can be from California, US
What the Intel company of the Santa Clara Ya Zhou obtainedIII、4、Xeontm、XScaletmAnd/or StrongARMtmThe processing system of microprocessor, but other systems (packet can also be used
Include PC, engineering work station, set-top box etc. with other microprocessors).In one embodiment, sample system 100 is executable
The WINDOWS that can be obtained from the Microsoft of Raymond, Washington, United StatestmOne version of operating system, but can also make
With other operating systems (such as UNIX and Linux), embedded software, and/or graphic user interface.Therefore, of the invention each
Embodiment is not limited to any specific combination of hardware and software.
Embodiment is not limited to computer system.Alternate embodiment of the invention can be used for other equipment, such as hand-held
Equipment and Embedded Application.Some examples of handheld device include cellular phone, Internet protocol equipment, digital camera, a number
Word assistant (PDA) and Hand held PC.Embedded Application can include: microcontroller, digital signal processor (DSP) are on chip
System, network computer (NetPC), set-top box, network hub, wide area network (WAN) interchanger or executable according at least one
Any other system of one or more instructions of embodiment.
Fig. 1 is the block diagram of computer system 100, and computer system 100 is formed with processor 102, processor
102 include one or more execution units 108 to execute algorithm, with execute it is according to an embodiment of the invention at least one
Instruction.Describe one embodiment referring to uniprocessor desktop or server system, but alternate embodiment can be included in it is more
In processor system.System 100 is the example of " maincenter " system architecture.Computer system 100 includes processor 102 to handle number
It is believed that number.It is micro- that processor 102 can be Complex Instruction Set Computer (CISC) microprocessor, reduced instruction set computing (RISC)
Processor, very long instruction word (VLIW) microprocessor realize that the processor of instruction set combination or any other processor device are (all
Such as digital signal processor).Processor 102 is coupled to processor bus 110, which in processor 102 and can be
Data-signal is transmitted between other assemblies in system 100.All a elements of system 100 execute conventional function known in the art
Energy.
In one embodiment, processor 102 includes the first order (L1) internal cache memory 104.Depending on frame
Structure, processor 102 can have single internally cached or multiple-stage internal cache.Alternatively, in another embodiment, it is high
Fast buffer memory can be located at the outside of processor 102.It is slow that other embodiments may also comprise internally cached and external high speed
The combination deposited, this depends on specific implementation and demand.Register group 106 can be in multiple registers (including integer registers, floating-point
Register, status register, instruction pointer register) in the different types of data of storage.Checkpoint logic 105 is provided to be directed to
The inspection of the one group of architecture states register in thread setting register group 106 executed by the thread process element of processor 102
It makes an inventory of.Tracking logic 103 is provided to track from the transactional region phase with the shared memory in cache memory 104
The memory access of associated thread process element.
Execution unit 108 (logic including executing integer and floating-point operation) also is located in processor 102.Processor 102
It further include microcode (ucode) ROM, storage is used for the microcode of specific macro-instruction.For one embodiment, execution unit
108 include the logic of synchronous extension (TSX) instruction set 109 of processing transactional, which includes executing for testing transactional
One or more instructions of state.By including in the instruction set of general processor 102 and including phase by TSX instruction set 109
For the circuit of pass to execute these instructions, can be used in general processor 102 is restricted transactional memory or hardware lock omission
To execute many multithreadings using used operation.Therefore, it omits and uses by the way that transactional memory or hardware lock will be restricted
In executing synchronization to shared data, many multithreading applications, which can get, to be accelerated, and more efficiently is executed.This can be eliminated altogether
Enjoy memory with the needs for executing unnecessary synchronization on the critical section seldom to conflict relatively.
The alternate embodiment of execution unit 108 may be alternatively used for microcontroller, embeded processor, graphics device, DSP with
And other kinds of logic circuit.System 100 includes memory 120.Memory devices 120 can be dynamic random access memory
Device (DRAM) equipment, static random access memory (SRAM) equipment, flash memory device or other memory devices.Memory 120
The instruction and/or data that can be executed by processor 102 can be stored, data are indicated by data-signal.
System logic chip 116 is coupled to processor bus 110 and memory 120.In the embodiment illustrated be
Logic chip 116 of uniting is memory controller hub (MCH).Processor 102 can be logical via processor bus 110 and MCH 116
Letter.MCH 116 is provided to the high bandwidth memory path 118 of memory 120, stores for instruction and data, and for depositing
Store up graph command, data and text.MCH 116 is for other groups in bootstrap processor 102, memory 120 and system 100
Data-signal between part, and in processor bus 110, memory 120 and system I/O
Between bridge data signal.In some embodiments, system logic chip 116, which can provide, is coupled to graphics controller
112 graphics port.MCH 116 is coupled to memory 120 via memory interface.Graphics card 112 passes through accelerated graphics port
(AGP) interconnection 114 is coupled to MCH 116.
System 100 is using peripheral equipment hub interface bus 122 to couple I/O controller center (ICH) for MCH 116
130.ICH 130 is directly connected to via local I/O bus to the offer of some I/O equipment.Local I/O bus is High Speed I/O bus,
For peripheral equipment to be connected to memory 120, chipset and processor 102.Some examples are Audio Controllers, in firmware
Pivot (flash memory BIOS) 128, transceiver 126, data storage 124, traditional I/O including user's input and keyboard interface
Controller, serial expansion port (such as general-purpose serial bus USB) and network controller 134.Data storage device 124 can be with
Including hard disk drive, floppy disk drive, CD-ROM device, flash memory device or other mass-memory units.
For another embodiment of system, system on chip can be used for according to the instruction of one embodiment.On chip
One embodiment of system includes processor and memory.Memory for such a system is flash memories.Flash memory
Memory can be located on tube core identical with processor and other systems component.In addition, such as Memory Controller or figure control
Other logical blocks of device processed etc may be alternatively located on system on chip.
Fig. 2 shows for using instruction and logic test transactional execute state processor 200 one embodiment.
In some embodiments, according to the instruction of one embodiment can be implemented as to byte size, word size, double word size,
Four word sizes etc. and the data element with many data types (such as single precision and double integer and floating type)
Execute operation.In one embodiment, orderly front end 201 is a part of processor 200, takes out the finger that will be performed
It enables, and prepares these instructions to use later for processor pipeline.Front end 201 may include all a units.Implement at one
In example, instruction prefetch device 226 takes out from memory and instructs, and instruction is fed to instruction decoder 228, instruction decoder 228
Then decoding or interpretative order.For example, in one embodiment, received instruction decoding can be performed decoder for machine
The one or more operations for being referred to as " microcommand " or " microoperation " (also referred to as micro operations or uop).In other embodiments
In, instruction is resolved to operation code and corresponding data and control field by decoder, they are by micro-architecture for executing according to one
The operation of a embodiment.In the one embodiment for including trace cache 230, trace cache 230 receives decoded
Microoperation, and they are assembled into the trace in program ordered sequence or microoperation queue 234, for executing.Work as tracking
When cache 230 encounters complicated order, microcode ROM 232, which is provided, completes the required microoperation of operation.
Some instructions are converted into single microoperation, and other instructions need several microoperations to complete whole operation.
In one embodiment, it completes to instruct if necessary to the microoperation more than four, then decoder 228 accesses microcode ROM 232
To carry out the instruction.For one embodiment, instruction can be decoded as a small amount of microoperation at instruction decoder 228
It is handled.In another embodiment, it completes to operate if necessary to several microoperations, then instruction can be stored in microcode
In ROM 232.Trace cache 230 determines that correct microcommand refers to reference to inlet point programmable logic array (PLA)
Needle, with the one or more instruction for reading micro-code sequence from microcode ROM 232 to complete according to one embodiment.In micro- generation
After code ROM 232 is completed for the micro operation serialization of instruction, the front end 201 of machine restores to mention from trace cache 230
Take microoperation.It will be understood that not necessarily all embodiment all includes trace cache 230.
Out-of-order (out-of-order) enforcement engine 203 is the unit for being used to instructions arm execute.Out-of-order execution is patrolled
Volume there are several buffers, for instruction stream is smooth and reorder, to optimize the performance after instruction stream enters assembly line,
And dispatch command stream is for execution.Dispatcher logic distributes the machine buffer and resource that each microoperation needs, for holding
Row.Register renaming logic is by the entry in all a logic register renamed as register groups.In instruction scheduler (storage
Device scheduler, fast scheduler 202, at a slow speed/general floating point scheduler 204, simple floating point scheduler 206) before, distributor
The entry of each microoperation is distributed among one in two microoperation queues, a microoperation queue is grasped for memory
Make, another microoperation queue is operated for non-memory.Microoperation scheduler 202,204,206 is defeated based on the dependence to them
Enter register operand source ready and microoperation complete their operation needed for execution resource availability come it is true
Determine when microoperation is ready for executing.The fast scheduler 202 of one embodiment can be in every half of master clock cycle
It is scheduled, and other schedulers can be dispatched only once on each primary processor clock cycle.Scheduler is to distribution port
It is arbitrated and is executed with dispatching microoperation.
Register group 208,210 be located at execution unit 212 in scheduler 202,204,206 and perfoming block 211,214,
216, between 218,220,222,224.There is also individual register groups 208,210, are respectively used to integer and floating-point operation.
Each register group 208,210 of one embodiment also includes bypass network, and bypass network can be write what is just completed not yet
Enter the result bypass of register group or is transmitted to new dependence microoperation.Integer registers group 208 and flating point register group 210
Can communicate with one another data.For one embodiment, integer registers group 208 is divided into two individual register groups, and one
A register group is used for 32 data of low order, and second register group is used for 32 data of high-order.One embodiment is floated
Point register group 210 has the entry of 128 bit widths, because floating point instruction usually has the operand of from 64 to 128 bit widths.
Some embodiments of flating point register group 210 can have 256 bit wides or 512 bit wides or some other width entries.For some
Each element can be written to respectively 64,32,16 etc. boundaries in flating point register group 210 by embodiment.
Perfoming block 211 includes execution unit 212,214,216,218,220,222,224, execution unit 212,214,
216, it actually executes instruction in 218,220,222,224.The block includes register group 208,210, and register group 208,210 is deposited
The integer and floating-point data operation value that storage microcommand needs to be implemented.The processor 200 of one embodiment includes multiple execution
Unit: scalar/vector (AGU) 212, AGU 214, quick ALU 216, quick ALU 218, at a slow speed ALU 220, floating-point ALU
222, floating-point mobile unit 224.For one embodiment, floating-point perfoming block 222,224 execute floating-point, MMX, SIMD, SSE and
AVX or other operations.The floating-point ALU 222 of one embodiment includes 64/64 Floating-point dividers, for executing division, putting down
Root and remainder micro-operation.For all a embodiments of the invention, floating point hardware is can be used to locate in the instruction for being related to floating point values
Reason.In one embodiment, ALU operation enters high speed ALU execution unit 216,218.The high speed ALU 216 of one embodiment,
218 executable high speed operations, effective waiting time are half of clock cycle.For one embodiment, most of complex integer behaviour
Make to enter 220 ALU at a slow speed, it is all because ALU 220 includes the integer execution hardware for high latency type operations at a slow speed
Such as, multiplier, shift unit, mark logic and branch process.Memory load/store operations are executed by AGU 212,214.
For one embodiment, integer ALU 216,218,220 is described as executing integer operation to 64 data operands.It is substituting
, it can be achieved that ALU 216,218,220 is to support the including various data bit such as 16,32,128,256 in embodiment.Similarly, may be used
Floating point unit 222,224 is realized to support multiple operands with various bit wides.For one embodiment, floating point unit
222,224 128 bit width packaged data operands are operated in combination with SIMD and multimedia instruction.
In one embodiment, father load complete execute before, microoperation scheduler 202,204,206 just assign according to
Rely operation.Because microoperation by speculating is dispatched and executed in processor 200, processor 200 also includes processing storage
The logic of device miss.If data load miss in data high-speed caching, there may be have temporary error data
It leaves scheduler and the dependence run in a pipeline operates.In some embodiments, replay mechanism is traceable uses error number
According to instruction, and these instructions can be re-executed.It only relies only on operation to need to be played out, and independent operation is allowed to complete.Processing
The scheduler and replay mechanism of one embodiment of device are also devised to capture the function for providing and executing state for testing transactional
The instruction of energy.In some alternate embodiments for not having replay mechanism, the conjectural execution to microoperation can be prevented, and according to
Rely the microoperation of property to can reside in scheduler 202,204,206 until they be cancelled or until they can not be cancelled for
Only.
Term " register " refers to a part for being used as instruction with processor storage location on the plate of identification operation number.
In other words, register is those processors outside available processor storage location (from the perspective of programmer).So
And the register of an embodiment is not limited to indicate certain types of circuit.On the contrary, the register of an embodiment can be stored and be mentioned
For data, it is able to carry out function described herein.Register described herein can be passed through using any amount of different technologies
Circuit in processor realizes, such as dedicated physical register of these different technologies, the dynamic point using register renaming
With physical register, combination that is dedicated and dynamically distributing physical register etc..In one embodiment, integer registers storage 32
Position integer data.The register group of one embodiment also includes eight multimedia SIM D registers, is used for packaged data.For with
Lower discussion, register should be understood the data register for being designed to save packaged data, such as from California, USA
64 bit wide MMX of the microprocessor for enabling MMX technology of the Intel company of state Santa ClaratmRegister is (in some realities
Be also referred to as in example " mm register)." these MMX registers (can be used in integer and floating-point format) can with SIMD and SSE
The packaged data element of instruction operates together.It is related to the 128 of the technology (being referred to as " SSEx ") of SSE2, SSE3, SSE4 or update
Bit wide XMM register may be alternatively used for keeping such packaged data operand.Similarly, (or more with AVX, AVX2, AVX3 technology
Advanced technology) relevant 256 bit wide YMM register and the ZMM registers of 512 bit wides can be Chong Die with XMM register and can
For keeping such broader packaged data operand.In one embodiment, when storing packaged data and integer data,
Register needs not distinguish between these two types of data types.In one embodiment, integer and floating data can be included in identical
In register group, or it is included in different register groups.Further, in one embodiment, floating-point and integer data
It can be stored in different registers, or be stored in identical register.
Fig. 3 A be with can be from the WWW (www) of the Intel company of Santa Clara City, California, America
Obtained on intel.com/products/processor/manuals/ "64 and IA-32 Intel Architecture Software is opened
Originator handbook volume 2: instruction set reference (64 and IA-32Intel Architecture Software Developer ' s
Manual Volume 2:Instruction Set Reference) " described in operation code Format Type is corresponding has
One implementation of operation coding (operation code) format 360 and register/memory operand addressing mode of 32 or more positions
The description of example.It in one embodiment, can be by one or more fields 361 and 362 come coded command.It can identify each
Instruction is up to two operand positions, including is up to two source operand identifiers 364 and 365.For one embodiment, mesh
Ground operand identification symbol it is 366 identical as source operand identifier 364, and they are not identical in other embodiments.For can
Embodiment is selected, destination operand identifier 366 is identical as source operand identifier 365, and they are not in other embodiments
It is identical.In one embodiment, one in the source operand identified by source operand identifier 364 and 365 is commanded
As a result it is override, and in other embodiments, identifier 364 corresponds to source register element, and identifier 365 corresponds to purpose
Ground register elements.For one embodiment, operand identification accord with 364 and 365 can be used for mark 32 or 64 source and
Vector element size.
Fig. 3 B shows another substitution operation coding (operation code) format 370 with 40 or more.Operation
Code format 370 corresponds to operation code format 360, and including optional prefix byte 378.It can be led to according to the instruction of one embodiment
One or more of field 378,371 and 372 is crossed to encode.By source operand identifier 374 and 375 and pass through prefix
Byte 378 can identify and be up to two operand positions in each instruction.For one embodiment, prefix byte 378 can be used for
The source and destination operand of mark 32 or 64.For one embodiment, destination operand identifier 376 and source are grasped
Identifier 374 of counting is identical, and they are not identical in other embodiments.For alternate embodiment, vector element size mark
Symbol 376 is identical as source operand identifier 375, and they are not identical in other embodiments.In one embodiment, instruction pair
It accords with one or more operands that 374 and 375 are identified by operand identification to be operated, and by operand identification symbol 374
The result being commanded with one or more operands that 375 are identified is override, however in other embodiments, by identifier
374 and 375 operands identified are written into another data element in another register.360 He of operation code format
370 allow by MOD field 363 and 373 and by optional ratio-index-plot (scale-index-base) and displacement
(displacement) register that byte is partially specified to register addressing, memory to register addressing, by memory
To register addressing, by register pair register addressing, directly to register addressing, register to memory addressing.
Fig. 3 C is turned next to, in some alternative embodiments, 64 (or 128 or 256 or 512 or more
It is more) single-instruction multiple-data (SIMD) arithmetical operation can instruct via coprocessor data processing (CDP) and execute.Operation coding
(operation code) format 380 shows such CDP instruction, with CDP opcode field 382 and 389.It is real for substitution
Example is applied, the operation of the type CDP instruction can be encoded by one or more of field 383,384,387 and 388.It can be to each
Command identification is up to three operand positions, including is up to two source operand identifiers 385 and 390 and a destination
Operand identification symbol 386.One embodiment of coprocessor can operate 8,16,32 and 64 place values.For one embodiment,
Integer data element is executed instruction.In some embodiments, use condition field 381 can be conditionally executed instruction.For
Some embodiments, can be by field 383 come source data size.In some embodiments, zero can be executed to SIMD field
(Z), (N), carry (C) are born and overflows (V) detection.For some instructions, saturation type can be encoded by field 384.
Turning now to Fig. 3 D, which depict according to another embodiment with can be from Santa Clara City, California, America
Intel company WWW (www) intel.com/products/processor/manuals/ on obtain "High-level vector extension programming reference (Advanced Vector Extensions Programming
Reference operation code Format Type described in) " is corresponding for providing the another of the function of test transactional execution state
One substitution operation coding (operation code) format 397.
Original x86 instruction set provides a variety of address byte (syllable) formats to 1 byte oriented operand and is included in attached
Add the immediate operand in byte, wherein can know the presence of extra byte from first " operation code " byte.In addition, specific
Byte value is reserved for operation code as modifier (referred to as prefix prefix, because they are placed before a command).When 256
When the original configuration (including these special prefix values) of a opcode byte exhausts, single byte is specified to jump out (escape)
To 256 new operation code set.Because being added to vector instruction (such as, SIMD), even by using prefix to be expanded
After exhibition, it is also desirable to generate more operation codes, and the mapping of " two bytes " operation code is also not enough.For this purpose, by new command
It is added in additional mapping, additional mapping uses two bytes plus optional prefix as identifier.
In addition to this, (and any in prefix and operation code for the ease of realizing additional register in 64 bit patterns
Jump out byte needed for operation code for determining) between use additional prefix (referred to as " REX ").In one embodiment
In, REX has 4 " Payload " positions, to indicate to use additional register in 64 bit patterns.In other embodiments,
There can be position more less or more than 4.The general format of at least one instruction set (corresponds generally to format 360 and/or format
370) it is shown generically as follows:
[prefixes] [rex] escape [escape2] opcode modrm (etc.)
Operation code format 397 corresponds to operation code format 370, and including optional VEX prefix byte 391 (in a reality
Apply in example, started with hexadecimal C4 or C5) with substitute other most public uses traditional instruction prefix byte and
Jump out code.For example, shown below the embodiment for carrying out coded command using two fields, can be not present in presumptive instruction
Second is used when jumping out code.In embodiment described below, tradition jump out by it is new jump out value represented by, traditional prefix
It is fully compressed as a part of " Payload (payload) " byte, traditional prefix is declared again and be can be used for following
Extension, and new feature (such as, increased vector length and additional source register specificator) is added.
When jumping out code there are second in presumptive instruction, or when needing using additional position (such as the XB in REX field
With W field) when.In the alternate embodiment shown below, the first tradition is jumped out and is similarly pressed with traditional prefix according to above-mentioned
Contracting, and code compaction is jumped out in " mapping " field by second, under future map or the available situation of feature space, again
Add new feature (such as increased vector length and additional source register specificator).
It can be encoded by one or more of field 391 and 392 according to the instruction of one embodiment.Pass through field
391 mark with source operation code identifier 374 and 375 and optional ratio-index-plot (scale-index-base, SIB)
Know symbol 393, optional displacement identifier 394 and optional immediate byte 395 to combine, four can be up to for each command identification
A operand position.For one embodiment, VEX prefix byte 391 can be used for the source and destination of mark 32 or 64
Operand and/or 128 or 256 simd registers or memory operand.For one embodiment, by operation code format
Function provided by 397 can form redundancy with operation code format 370, and they are different in other embodiments.Operation code format
370 and 397 allow by MOD field 373 and by optional SIB identifier 393, optional displacement identifier 394 and optional
The register partially specified of immediate byte 395 to register addressing, memory to register addressing, by memory to posting
Storage addresses, by register pair register addressing, directly to register addressing, register to memory addressing.
Fig. 3 E is turned next to, which depict according to another embodiment for providing for testing transactional execution state
Another substitution operation coding (operation code) format 398 of function.Operation code format 398 corresponds to operation code format 370 and 397,
And it is most to substitute including optional EVEX prefix byte 396 (in one embodiment, starting with hexadecimal 62)
The traditional instruction prefix byte of other public uses and code is jumped out, and additional function is provided.According to the finger of one embodiment
Order can be encoded by one or more of field 396 and 392.Pass through field 396 and source operation code identifier 374 and 375
And optional ratio-index-plot (scale-index-base SIB) identifier 393, optional displacement identifier 394 and can
It selects immediate byte 395 to combine, each instruction can be identified and be up to four operand positions and mask.One is implemented
Example, EVEX prefix byte 396 can be used for mark 32 or 64 source and destination operand and/or 128,256 or
512 simd registers or memory operand.For one embodiment, the function as provided by operation code format 398 can be with
Operation code format 370 or 397 forms redundancy, and they are different in other embodiments.Operation code format 398 allows by MOD word
Section 373 and by optional (SIB) identifier 393, optional displacement identifier 394 and optional 395 institute of immediate byte
The specified register using mask in part seeks register to register addressing, memory to register addressing, by memory
Location, by register pair register addressing, directly to register addressing, register to memory addressing.At least one instruction set
General format (corresponding generally to format 360 and/or format 370) is shown generically as follows:
evex1 RXBmmmmm WvvvLpp evex4 opcode modrm[sib][disp][imm]
For one embodiment, the instruction encoded according to EVEX format 398 can have additional " Payload " position,
It is used to provide for executing the function of state for testing transactional, and there is additional new feature, such as, user is configurable
Mask register, additional operand, from 128,256 or 512 bit vector registers or more registers to be selected
Selection, etc..
For example, can be used for using explicit mask and with or without additional unary operation (such as in VEX format 397
Type conversion) come in the case where providing the function of executing state for testing transactional, which can be used for using aobvious
Formula user can configure mask and with or without the additional dual operation (such as addition or multiplication) for needing additional operand
To provide the function of executing state for testing transactional.Some embodiments of EVEX format 398 can also be used for using implicit complete
The function that state is executed for testing transactional is provided at mask and using additional three atom operation.In addition, in VEX format
397 can be used in the case where providing the function for testing transactional execution state on 128 or 256 bit vector registers,
EVEX format 398 can be used for providing at 128,256,512 or on the vector registor of bigger (or smaller) for testing
The function of transactional execution state.
It will be understood that some embodiments of instruction and logic for testing transactional execution state may specify explicit source behaviour
It counts and/or vector element size, and some embodiments can have implicit source operand and/or vector element size.Pass through
Following example is shown for providing the example instruction for executing the function of state (hereinafter referred to as XTEST) for testing transactional:
Fig. 4 A is the ordered assembly line and register rename level, unrest for showing at least one embodiment according to the present invention
Sequence publication/execution pipeline block diagram.Fig. 4 B be at least one embodiment according to the present invention is shown to be included in processing
The block diagram of ordered architecture core and register renaming logic, out-of-order publication/execution logic in device.Solid box in Fig. 4 A is shown
Ordered assembly line is gone out, dotted line frame shows register renaming, out-of-order publication/execution pipeline.Similarly, the reality in Fig. 4 B
Wire frame shows ordered architecture logic, and dotted line frame shows register renaming logic and out-of-order publication/execution logic.
In Figure 4 A, processor pipeline 400 includes taking out level 402, length decoder level 404, decoder stage 406, distribution stage
408, rename level 410, scheduling (are also referred to as assigned or are issued) grade 412, register reading memory reading level 414, execute
Grade 416 writes back/memory write level 418, exception handling level 422, submission level 424.
In figure 4b, arrow indicates the coupling between two or more units, and the direction instruction of arrow those units
Between data flow direction.Fig. 4 B shows processor core 490, the front end unit including being coupled to enforcement engine unit 450
430, both the front end unit and enforcement engine unit are all coupled to memory cell 470.
Core 490 can be reduced instruction set computing (RISC) core, complex instruction set calculation (CISC) core, very long instruction word
(VLIW) core or mixing or other core types.As another option, core 490 can be specific core, such as network or communication
Core, compression engine, graphics core or the like.
Front end unit 430 includes the inch prediction unit 432 for being coupled to Instruction Cache Unit 434, the instruction cache
Cache unit is coupled to instruction translation lookaside buffer (TLB) 436, which is coupled to instruction
Retrieval unit 438, the instruction retrieval unit are coupled to decoding unit 440.Decoding unit or decoder decodable code instruct, and it is raw
At one or more microoperations, microcode entry point, microcommand, other instructions or other control signals as output, these are defeated
It is from being decoded in presumptive instruction or otherwise reflect presumptive instruction or derive from presumptive instruction out.Solution
A variety of different mechanism can be used to realize for code device.The example of suitable mechanism includes but is not limited to, look-up table, hardware realization, can
Programmed logic array (PLA) (PLA), microcode read only memory (ROM) etc..Instruction Cache Unit 434 is additionally coupled to memory
The second level (L2) cache element 476 in unit 470.Decoding unit 440 is coupled to the life again in enforcement engine unit 450
Name/dispenser unit 452.
Enforcement engine unit 450 includes being coupled to the set of retirement unit 454 and one or more dispatcher units 456
Renaming/dispenser unit 452.Dispatcher unit 456 indicates any number of different schedulers, including reserved station, center
Instruction window etc..Dispatcher unit 456 is coupled to physical register group unit 458.Each physical register group unit 458 indicates
One or more physical register groups, wherein the different one or more different data types of physical register group preservation are (all
Such as: scalar integer, scalar floating-point, packing integer, packing floating-point, vectorial integer, vector floating-point, etc.), state (such as, instructs
Pointer is the address of next instruction to be executed) etc..458 retirement unit 454 of physical register group unit is covered,
It (such as, is deposited using resequencing buffer and resignation with showing the various ways of achievable register renaming and Out-of-order execution
Device group, using future file (future file), historic buffer, resignation register group, use register mappings and deposit
Device pond etc.).In general, architectural registers are visible outside processor or from the viewpoint of programmer.These registers
It is not limited to any of particular electrical circuit type.A variety of different types of registers are applicable, as long as they can store and mention
For data described herein.The example of suitable register includes but is not limited to that dedicated physical register uses register renaming
Dynamic allocation physical register and dedicated physical register and dynamically distribute physical register combination, etc..Resignation
Unit 454 and physical register group unit 458, which are coupled to, executes cluster 460.Executing cluster 460 includes one or more execute
The set of the set of unit 462 and one or more memory access units 464.A variety of operations can be performed in execution unit 462
(including: displacement, addition, subtraction, multiplication) and can numerous types of data (such as, scalar floating-point, packing integer, packing floating-point,
Vectorial integer, vector floating-point) on execute.Although some embodiments may include being exclusively used in multiple execution of specific function or functional group
Unit, however other embodiments may include only one execution unit or all execute the functional multiple execution units of institute.It adjusts
Degree device unit 456, physical register group unit 458, execute cluster 460 be shown as may be it is a plurality of, because of certain implementations
Example is that certain data/action types create all independent assembly line (for example, all having respective dispatcher unit, physics deposit
Device group unit and/or execute the scalar integer assembly line of cluster, scalar floating-point/packing integer/packing floating-point/vectorial integer/to
Measure floating-point pipeline, and/or pipeline memory accesses, and specific reality in the case where individual pipeline memory accesses
Applying the execution cluster that example is implemented as the only assembly line has memory access unit 464).It is appreciated that using all
In the case where independent assembly line, one or more of these assembly lines can be out-of-order publication/execution, and remaining is that have
Sequence.
The set of memory access unit 464 is coupled to memory cell 470, which includes data TLB mono-
Member 472, which is coupled to cache element 474, and it is slow which is coupled to the second level (L2) high speed
Memory cell 476.In one exemplary embodiment, memory access unit 464 may include loading unit, storage address unit and
Data storage unit, each of these is all coupled to the data TLB unit 472 in memory cell 470.L2 cache list
Member 476 is coupled to the cache of other one or more ranks, and is finally coupled to main memory.
As an example, illustrative register renaming random ordering is issued/is executed core framework and can realize stream as described below
Waterline 400:1) the instruction execution of extractor 438 is taken out and length decoder level 402 and 404;2) decoding unit 440 executes decoder stage
406;3) renaming/dispenser unit 452 executes distribution stage 408 and rename level 410;4) dispatcher unit 456 executes scheduling
Grade 412;5) physical register group unit 458 and memory cell 470 execute register reading memory reading level 414;It holds
Row cluster 460 realizes executive level 416;6) memory cell 470 and the execution of physical register group unit 458 write back/memory writes
Enter grade 418;7) multiple units can be involved in exception handling level 422;And 8) retirement unit 454 and physical register group list
Member 458 executes submission level 424.
Core 490 can support that (such as, x86 instruction set (has some expansions for increasing and having more new version to one or more instruction set
Exhibition), the MIPS instruction set of MIPS Technologies Inc. of California Sani's Weir, California Sani's Weir ARM
The ARM instruction set (there is optional additional extension, such as NEON) of holding company).
It should be appreciated that core can support multithreading (the two or more parallel collection for executing operation or thread), and can be with
Various ways are realized, comprising: the time cuts multithreading, (wherein single physical core is that the physical core is multi-thread simultaneously to parallel multi-thread
Each thread of Cheng Zhihang provides Logic Core) or above combination (such as, the time-division take out and decoding and later while it is more
Thread, such asHyperthread Hyperthreading technology).
For one embodiment, enforcement engine unit 450 includes the TSX logic 469 for handling TSX instruction set.Pass through
It include TSX instruction set and the associated TSX logic for executing these instructions in the instruction set of general-purpose processor core 490
469, it can be omitted in general-purpose processor core 490 using restricted transactional memory or hardware lock to execute largely by multi-thread
Operation used in Cheng Yingyong.Therefore, by being used for restricted transactional memory or hardware lock omission to shared data
Synchronization is executed, many multithreading applications can be more efficiently accelerated and execute.This can be eliminated to having what is relatively rarely conflicted shared to deposit
The critical section of reservoir executes the needs of unnecessary synchronization.Tracking logic 473 is provided in memory cell 470 to chase after
Track is from thread process element associated with the transactional region of shared memory in the cache of memory cell 470
Memory access.In one embodiment, checkpoint logic 455 is provided to execute for by the thread process element of core 490
Thread setting register group unit 458 in architecture states register set checkpoint.
Although describing register renaming under the background of Out-of-order execution, it is to be understood that, register renaming can by with
In ordered architecture.Although the shown embodiment of processor also includes individual instruction and data cache element 434/
474 and shared L2 cache element 476, but alternative embodiment can also have the single inside for instruction and data
Cache, such as first order (L1) be internally cached or multiple ranks it is internally cached.In some implementations
In example, system may include internally cached and External Cache combination, and External Cache is located at core and/or processor
Except.Alternatively, all caches can all be located at except core and/or processor.
Fig. 5 is the single core processor with integrated Memory Controller and graphics devices of embodiment according to the present invention
With the block diagram of multi-core processor 500.The solid box of Fig. 5 shows processor 500, and processor 500 has single core 502A, system
150, one groups of one or more bus control unit units 516 are acted on behalf of, and optional additional dotted line frame shows the processor of substitution
500, one group of one or more integrated memory controller with multiple core 502A-N, in system agent unit 510
Unit 514 and integrated graphics logic 508.
Memory hierarchy includes one or more level cache 504A-N in core, one or more shared caches
The set of unit 506 and the external memory (not shown) for being coupled to this group of integrated memory controller unit 514.It is shared
The set of cache element 506 may include one or more intermediate caches, such as, the second level (L2), the third level (L3),
The cache of the fourth stage (L4) or other ranks, last level cache (LLC) and/or above combination.Tracking is provided to patrol
Volume 503A-N, with tracking from the shared storage in cache memory 504A-N and/or shared cache element 506
The memory access of the associated thread process element in the transactional region of device.Although in one embodiment based on the mutual of annular
Even integrated graphics logic 508, this group of shared cache element 506 and system agent unit 510 are interconnected by unit 512,
But alternative embodiment also interconnects these units using any amount of well-known technique.
In some embodiments, one or more core 502A-N can be realized multithreading.System Agent 510 includes coordinating
With operation those of core 502A-N component.System agent unit 510 may include that such as power control unit (PCU) and display are single
Member.PCU, which can be, the power rating of core 502A-N and integrated graphics logic 508 is adjusted required logic and component,
It or may include these logics and component.Display unit is used to drive the display of one or more external connections.
Core 502A-N can be isomorphic or heterogeneous on framework and/or instruction set.For example, one in core 502A-N
It can be ordered into a bit, and other are out-of-order.Such as another example, two or more nuclear energy in core 502A-N are enough held
The identical instruction set of row, and a subset that other cores are able to carry out in the instruction set or execute different instruction set.
Processor can be general processor, such as Duo (CoreTM) i3, i5, i7,2Duo and Quad, to strong
(XeonTM), Anthem (ItaniumTM)、XScaleTMOr StrongARMTMProcessor, these can be holy gram from California
The Intel company in the city La La obtains.Alternatively, processor can come from another company, such as from ARM holding company,
MIPS, etc..Processor can be application specific processor, such as, for example, network or communication processor, compression engine, graphics process
Device, coprocessor, embeded processor, or the like.Processor may be implemented on one or more chips.Processor 500
It can be a part of one or more substrates, and/or using in kinds of processes technology (such as, BiCMOS, CMOS or NMOS)
Any technology be implemented on one or more substrates.
Fig. 6-8 be adapted for include processor 500 exemplary system, Fig. 9 is the example that may include one or more cores 502
Property system on chip (SoC).Other systems design and configuration known in the art for following object is also applicable: above-knee
Computer, desktop computer, Hand held PC, personal digital assistant, engineering effort station, server, the network equipment, network hub,
Exchanger, embeded processor, digital signal processor (DSP), graphics device, video game device, set-top box, microcontroller
Device, cellular phone, portable media player, handheld device and various other electronic equipments.In general, disclosed herein
It is various can merging processor and/or it is other execute logic system or electronic equipment be usually be applicable in.
Referring now to Figure 6, shown is the block diagram of system 600 according to an embodiment of the invention.System 600 can wrap
Include the one or more processors 610,615 for being coupled to graphics memory controller hub (GMCH) 620.Additional processor
615 washability indicates by a dotted line in Fig. 6.
Each processor 610,615 can be certain versions of processor 500.It should be appreciated, however, that integrated graphics logic
It is far less likely to occur in processor 610,615 with integrated memory control unit.Fig. 6, which shows GMCH 620, can be coupled to storage
Device 640, the memory 640 can be such as dynamic random access memory (DRAM).For at least one embodiment, DRAM can
With associated with non-volatile cache, and can also be provided tracking logic with track come from in non-volatile cache
Shared memory the associated thread process element in transactional region memory access.
GMCH 620 can be a part of chipset or chipset.GMCH 620 can be carried out with processor 610,615
Communication, and the interaction between control processor 610,615 and memory 640.GMCH 620 may also act as processor 610,615
Acceleration bus interface between other elements of system 600.For at least one embodiment, GMCH 620 is via such as front end
The multi-point bus of bus (FSB) 695 etc is communicated with processor 610,615.
In addition, GMCH 620 is coupled to display 645 (such as flat-panel monitor).GMCH 620 may include integrated graphics
Accelerator.GMCH 620 is also coupled to input/output (I/O) controller center (ICH) 650, the input/output (I/O) control
Device maincenter (ICH) 650 can be used for coupleeing system 600 for various peripheral equipments.It is shown as example in the embodiment in fig 6
External graphics devices 660 and another peripheral equipment 670, the external graphics devices 660 can be coupled to point of ICH 650
Vertical graphics device.
Alternatively, additional or different processor also may be present in system 600.For example, Attached Processor 615 may include with
The identical Attached Processor of processor 610 and 610 foreign peoples of processor or asymmetric Attached Processor, accelerator (such as figure
Accelerator or Digital Signal Processing (DSP) unit), field programmable gate array or any other processor.In physical resource
610, there can be the various differences in terms of a series of quality metrics for including framework, micro-architecture, heat and power consumption features etc. between 615
It is different.These difference can effectively be shown as asymmetry between processor 610,615 and heterogeneity.For at least one implementation
Example, various processors 610,615 can reside in same die package.
Referring now to Fig. 7, shown is the block diagram of second system 700 according to an embodiment of the present invention.As shown in fig. 7,
Multicomputer system 700 is point-to-point interconnection system, and 770 He of first processor including coupling via point-to-point interconnection 750
Second processor 780.Each of processor 770 and 780 can be some versions of processor 500, as processor 610,
One or more of 615 is the same.
Although only being shown with two processors 770,780, it should be understood that the scope of the present invention is not limited thereto.In other realities
It applies in example, one or more Attached Processors may be present in given processor.
Processor 770 and 780 is illustrated as respectively including integrated memory controller unit 772 and 782.Processor 770 is also
Point-to-point (P-P) interface 776 and 778 including a part as its bus control unit unit;Similarly, second processor
780 include P-P interface 786 and 788.Processor 770,780 can be via using point-to-point (P-P) interface circuit 778,788
P-P interface 750 exchanges information.As shown in fig. 7, IMC 772 and 782 couples the processor to corresponding memory, that is, store
Device 732 and memory 734, these memories can be the part for being locally attached to the main memory of respective processor.
Processor 770,780 can be respectively via using each P-P of point-to-point interface circuit 776,794,786,798 to connect
Mouth 752,754 exchanges information with chipset 790.Chipset 790 can also be via high performance graphics interface 739 and high performance graphics electricity
Road 738 exchanges information.
Shared cache (not shown) can be included in any processor, or two processors outside but via
P-P interconnection is connect with these processors, thus if processor is placed in low-power mode, any one or the two processor
Local cache information can be stored in the shared cache.Can provide tracking logic, with tracking from
The memory access of the associated thread process element in the transactional region of shared memory in shared cache.
Chipset 790 can be coupled to the first bus 716 via interface 796.In one embodiment, the first bus 716
It can be the total of peripheral component interconnection (PCI) bus or such as PCI high-speed bus or another third generation I/O interconnection bus etc
Line, but the scope of the present invention is not limited thereto.
As shown in fig. 7, various I/O equipment 714 can be coupled to the first bus 716, bus bridge together with bus bridge 718
First bus 716 is coupled to the second bus 720 by 718.In one embodiment, the second bus 720 can be low pin count
(LPC) bus.Various equipment can be coupled to the second bus 720, including such as keyboard and/or mouse 722, communication equipment 727 with
And storage unit 728, it such as in one embodiment may include the disk drive of instructions/code and data 730 or other are big
Capacity storage device.In addition, audio I/O 724 can be coupled to the second bus 720.Note that other frameworks are possible.For example,
Instead of the Peer to Peer Architecture of Fig. 7, system can realize multi-point bus or other such frameworks.
Referring now to Fig. 8, shown is the block diagram of third system 800 according to an embodiment of the present invention.In Fig. 7 and 8
Similar component uses like reference numerals, and some aspects of Fig. 7 is omitted in fig. 8 to avoid the other aspects of fuzzy graph 8.
Fig. 8, which shows processor 870,880, can respectively include integrated memory and I/O control logic (" CL ") 872 and 882.
For at least one embodiment, CL 872,882 may include such as above in conjunction with integrated memory controller described in Fig. 5 and 7
Unit.In addition, CL 872,882 may also include I/O control logic.Fig. 8 show not only memory 832,834 be coupled to CL 872,
882, I/O equipment 814 are also coupled to control logic 872,882.Traditional I/O equipment 815 is coupled to chipset 890.
Referring now to Fig. 9, shown is the block diagram of SoC 900 according to an embodiment of the invention.It is similar in Fig. 5
Component label having the same.Equally, dotted line frame is the optional feature on more advanced SoC.In Fig. 9, interconnecting unit
902 are coupled to: application processor 910, including one group of one or more core 502A-N;One or more levels cache in core
504A-N;And shared cache element 506;Logic 503A-N is tracked, is come from and cache memory for tracking
The associated thread process element in the transactional region of 504A-N and/or the shared memory in shared cache element 506
Memory access;System agent unit 510;Bus control unit unit 516;Integrated memory controller unit 514;One group one
A or multiple Media Processors 920, it may include integrated graphics logic 508, static and/or video camera function for providing
Image processor 924 provides the video processing that audio processor 926, the offer encoding and decoding of video that hardware audio accelerates accelerate
Device 928, static random access memory (SRAM) unit 930;Direct memory access (DMA) (DMA) unit 932;And display unit
940, for being coupled to one or more external displays.
Figure 10 shows processor, including central processing unit (CPU) and graphics processing unit (GPU), the processor can be held
Row is instructed according at least one of one embodiment.In one embodiment, execution operates according at least one embodiment
Instruction can be executed by CPU.In another embodiment, instruction can be executed by GPU.In a further embodiment, refer to
Enable can the combination of performed by GPU and CPU operation execute.For example, in one embodiment, according to one embodiment
Instruction can be received, and decoded for being executed on GPU.However, one or more operations in decoded instruction can
Executed by CPU, and result be returned to GPU for instruction final resignation.On the contrary, in some embodiments, CPU can make
For primary processor, and GPU is as coprocessor.
In some embodiments, the instruction for benefiting from highly-parallel handling capacity can be executed by GPU, and benefit from processor
The instruction of the performance of (these processors benefit from deep pipeline framework) can be executed by CPU.For example, figure, science are answered
The performance of GPU can be benefited from, financial application and other parallel workloads and is correspondingly executed, and is more serialized and answered
With such as operating system nucleus or application code are more suitable for CPU.
In Figure 10, processor 1000 includes: CPU 1005, GPU 1010, image processor 1015, video processor
1020, USB controller 1025, UART controller 1030, SPI/SDIO controller 1035, display equipment 1040, fine definition are more
Media interface (HDMI) controller 1045, MIPI controller 1050, Flash memory controller 1055, double data rate (DDR) (DDR) control
Device 1060 processed, security engine 1065, I2S/I2C (integrated across chip voice/across integrated circuit) interface 1070.Other logics and electricity
Road can be included in the processor of Figure 10, including more CPU or GPU and other peripheral device interface controllers.
The one or more aspects of at least one embodiment can be by representative data stored on a machine readable medium
It realizes, which indicates the various logic in processor, and the machine is made to generate to execute and retouch herein when read by machine
The logic for the technology stated.Such expression i.e. so-called " IP kernel " can store on tangible machine readable media (" tape ") and mention
Various customers or manufacturer are supplied, to be loaded into the establishment machine of the actual fabrication logic or processor.For example, IP kernel
(the Cortex such as developed by ARM holding companyTMProcessor affinity and by institute of computing technology, the Chinese Academy of Sciences
(ICT) the Godson IP kernel developed) it can be authorized to or be sold to multiple clients or by licensor, such as Texas Instrument, high pass, apple
Fruit or Samsung, and be implemented in as these clients or by processor manufactured by licensor.
Figure 11 shows the block diagram developed according to the IP kernel of one embodiment.Memory 1130 include simulation software 1120 and/
Or hardware or software model 1110.In one embodiment, indicate that the data of IP core design can be via memory 1140 (such as,
Hard disk), wired connection (such as, internet) 1150 or be wirelessly connected 1160 and be provided to memory 1130.By emulation tool
Manufacturing works then can be sent to model IP nuclear information generated, can be produced by third party in manufacturing works
To execute at least one instruction according at least one embodiment.
In some embodiments, one or more instructions can correspond to the first kind or framework (such as x86), and
It is converted or is emulated on the processor (such as ARM) of different type or framework.According to one embodiment, instruction can be any
It is executed on processor or processor type, including ARM, x86, MIPS, GPU or other processor types or framework.
Figure 12 is shown according to how the instruction of the first kind of one embodiment is emulated by different types of processor.
In Figure 12, program 1205 includes some instructions, these instructions are executable identical or basic as the instruction according to one embodiment
Identical function.However, the instruction of program 1205 can be from processor 1215 different or incompatible types and/or lattice
Formula, it means that the instruction of the type in program 1205 is unable to Proterozoic performed by processor 1215.However, by means of emulation
Logic 1210, the instruction of program 1205 can be converted into can by processor 1215 primary execution instruction.Implement at one
In example, emulation logic is specific within hardware.In another embodiment, emulation logic is embodied in tangible machine readable Jie
In matter, which includes by such instruction translation in program 1205 into the direct class that can be executed by processor 1215
The software of type.In other embodiments, emulation logic is fixed function or programmable hardware and is stored in tangible machine readable
The combination of program on medium.In one embodiment, processor includes emulation logic, but in other embodiments, emulation is patrolled
It collects except processor and is provided by third party.In one embodiment, processor can be by executing comprising in the processor
Or microcode associated therewith or firmware, load the emulation being embodied in the tangible machine readable media comprising software
Logic.
Figure 13 is that the control of embodiment according to the present invention uses software instruction converter by the binary system in source instruction set
Instruction is converted to the block diagram of the binary instruction of target instruction target word concentration.In an illustrated embodiment, dictate converter is that software refers to
Enable converter, but alternatively, dictate converter can be realized with software, firmware, hardware or its various combination.Figure 13 shows
X86 compiler 1304 can be used out to compile the program using high-level language 1302, with generate can be by having at least one x86
The x86 binary code 1306 of the primary execution of processor 1316 of instruction set core.With at least one x86 instruction set core 1316
Processor indicates that any processor, the processor can refer to by compatibly executing or handling in other ways (1) Intel x86
What the major part of the instruction set of order collection core or (2) were intended to run on the Intel processors at least one x86 instruction set core
Using or other softwares object code version, come execute with at least one x86 instruction set core Intel processors base
This identical function, to realize the result essentially identical with the Intel processors at least one x86 instruction set core. x86
Compiler 1304 indicates the compiler that can be used for generating x86 binary code 1306 (such as object code), the x86 binary system generation
Code 1306 can be by additional link processing or without additional link processing at least one x86 instruction set core
It is executed on processor 1316.Similarly, Figure 13, which is shown, can be used the instruction set compiler 1308 of substitution to compile and utilize advanced language
The program of speech 1302, with generation can (such as, having can be performed by not having the processor 1314 of at least one x86 instruction set core
The processor of the MIPS instruction set of MIPS Technologies Inc. of California Sani's Weir and/or execution California mulberry
The processor of the ARM instruction set of the ARM holding company of Buddhist nun's Weir) primary execution alternative command collection binary code 1310.It should
Dictate converter 1312 be used to be converted to x86 binary code 1306 can be by not having the processor of x86 instruction set core
The code of 1314 primary execution.The transformed code is unlikely identical as alternative command collection binary code 1310, because
Such dictate converter can be completed by being difficult to manufacture;However, transformed code will complete general operation and by alternative command collection
Instruction constituted.Therefore, dictate converter 1312 indicates to allow not having by emulation, simulation or any other process
The processor or other electronic equipments of x86 instruction set processor or core execute the software of x86 binary code 1306, firmware, hard
Part or their combination.
Figure 14, which is shown, provides one embodiment of the device 1401 of the function for testing transactional execution state.Device
1401 include the instruction retrieval unit 1438 for being coupled to decoding unit 1440.Decoding unit or decoder decodable code instruct, and it is raw
At one or more microoperations, microcode entry point, microcommand, other instructions or other control signals as output, these are defeated
It is from being decoded in presumptive instruction or otherwise reflect presumptive instruction or derive from presumptive instruction out.Solution
A variety of different mechanism can be used to realize for code device.The example of suitable mechanism includes but is not limited to look-up table, hardware realization, can compile
Journey logic array (PLA), microcode read only memory (ROM) etc..Decoding unit 1440 is coupled to register group unit 1458.
Each register group unit 1458 indicates one or more physical register groups, wherein different physical register groups
Saving one or more different data types, (such as: scalar integer, scalar floating-point, packing integer, packing floating-point, vector are whole
Number, vector floating-point, etc.), state (such as instruction pointer, it is the address of next instruction to be executed) etc..Deposit
Device group unit 1458 is coupled with the checkpoint logic 1455 of device 1402.In general, architectural registers are outside the processor or from volume
It is visible as viewed from the perspective of journey person.In one embodiment, provide checkpoint logic 1455 with for by with shared memory
The thread setting register group unit 1458 that executes of the associated thread process element in transactional region in architecture states post
The checkpoint of the set of storage.These registers are not limited to any of particular electrical circuit type.A variety of different types of deposits
Device is applicable, as long as they can store and provide data described herein.The example of suitable register is including but not limited to special
With physical register, using register renaming dynamic allocation physical register and dedicated physical register and dynamic
The combination of distribution physical register, etc..Register group unit 1458 is coupled to the set of one or more execution units 1462
With the set of one or more memory access units 1464.Execution unit 1462 can to various types of data (for example,
Scalar floating-point, packing integer, packing floating-point, vector integer, vector floating-point) various operations are executed (for example, displacement, addition, subtracting
Method, multiplication).Although some embodiments may include the multiple execution units for being exclusively used in specific function or function set, other
Embodiment may include all executing the functional only one execution unit of institute or multiple execution units.Register group unit 1458,
Memory access unit 1464 and execution unit 1462 are illustrated as may be plural number, because some embodiments are for certain types
Data/operation generate the assembly line of difference and (such as be respectively provided with themselves register group unit and/or execution unit
Scalar integer assembly line, scalar floating-point/packing integer/packing floating-point/vectorial integer/vector floating-point assembly line, and/or memory
Assembly line is accessed, and in the case where pipeline memory accesses respectively, realizes the specific flowing water of wherein only one or more
Line has some embodiments of memory access unit 1464).It is also understood that using the assembly line of difference,
One or more of these assembly lines can be out-of-order publication/execution, and remaining assembly line can be orderly to issue/hold
Row.
The set of memory access unit 1464 is coupled to data cache unit 1474, the data cache unit
It is coupled to second level (L2) cache element 1476.In one exemplary embodiment, memory access unit 1464 may include
Loading unit, storage address unit and data storage unit, the data that each of these units are coupled to device 1402 are high
Fast cache unit 1474 and tracking logic 1473, with tracking from the shared memory in data cache unit 1474
The memory access of the associated processing element in transactional region.L2 cache element 1476 be coupled to it is one or more other
The cache of grade, and it is eventually coupled to main memory.
As an example, exemplary means 1401 can realize assembly line 400:1 as follows) instruct taking-up 1438 to execute taking-up
With length decoder level 402 and 404;2) decoding unit 1440 executes decoder stage 406;3) register group unit 1458 and memory
Access unit 1464 executes register reading memory reading level 414;4) execution unit 1462 performs executive level 416;With
And 5) memory access unit 1464 and the execution of physical register group unit 1458 write back/memory write level 418.
Device 1401 can support one or more instruction set (such as x86 instruction set (have add together with more new version
The some extensions added, including TSX ISA 1469);The MIPS of the MIPS Technologies Inc. in California Sani's Weir city is instructed
Collection (synchronous including synchronous etc the transactional of the transactional in such as TSX ISA 1469);California Sani's Weir city
ARM holding company ARM instruction set (the optional additional extension with such as NEON etc, and including such as TSX ISA
The transactional of transactional synchronization in 1469 etc is synchronous)).
It should be appreciated that device 1401 can support multithreading (to execute the collection of two or more parallel operations or thread
Close), and the multithreading can be variously completed, this various mode includes time division multithreading, synchronizing multiple threads
Change (wherein single physical core provides Logic Core for each thread of physical core in synchronizing multi-threaded threads) or
A combination thereof (for example, the time-division take out and decoding and hereafter such as withHyperthread technology carrys out synchronous multi-threaded).
For one embodiment, execution unit 1462 executes TSX instruction set architecture (ISA) 1469 and is controlled with executing by TSX
The transactional of 1457 cooperation of system is synchronous.TSX control 1457 and the checkpoint logic 1455 of device 1402 are operated together to be arranged and post
The checkpoint of architectural registers set in storage group unit 1458, and with the tracking logic in memory access unit 1464
1473 operate together to track from associated with the transactional region of shared memory in data cache unit 1474
Thread process element memory access.If read/write collision occurs, architecture states can be rolled back to previous synchronization
Point, and conflict is not submitted to access.For one embodiment, the TSX ISA 1469 of device 1402 includes one or more instructions
(XTEST instruction as escribed above), one or more instruction can be executed by execution unit 1462 to provide for testing thread
Transactional in processing element executes the function of state.
Phase by including TSX ISA 1469 in the instruction set of general-purpose processor core and for executing these instructions
Associated logic can be omitted using restricted transactional memory or hardware lock in general-purpose processor core and utilize device 1401
To execute many multithreadings using used operation.Therefore, by omitting restricted transactional memory or hardware lock
For executing synchronization to shared data, it can more efficiently accelerate and execute many multithreading applications.As described above, working as thread process
When element transactionally executes, the tracking of tracking logic 1473 in memory access unit 1464 is from slow with data high-speed
The memory access of the associated thread process element in the transactional region of shared memory in memory cell 1474.This can be eliminated
The needs of unnecessary synchronization are executed to the critical section with the shared memory relatively rarely to conflict.
Figure 15 shows one embodiment of the process 1501 for providing the function for testing transactional execution state
Flow chart.Process 1501 and other processes disclosed herein are executed by process block, process block may include specialized hardware or
It can be by general-purpose machinery or special purpose machinery or the software or firmware opcode of its certain combination execution.
In the processing block 1510 of process 1501, for starting transactional region (such as RTM or HLE) is decoded
One instruction or prefix.In response to the first instruction of decoding, the set for architecture states register is generated in processing block 1520
Checkpoint.The first instruction of decoding is also responded to, tracking comes from transactional area associated with the first instruction in processing block 1530
The memory access of processing element in domain.In processing block 1540, decode what the transactional for detecting transactional region executed
Second instruction (such as instruction in XTEST instruction).In processing block 1550, operated in response to the second instruction execution of decoding,
To determine the execution context of the second instruction whether within the transactional region.Then, in response to the second instruction, in processing block
The first mark is updated in 1560 (for example, if the execution context of the second instruction is updated within the transactional region
It is zero;Otherwise it is updated to one).Register is optionally updated further in response to second instruction in processing block 1570
(such as XTEST.NL or as XTEST.BA, etc.).It is optional in response to the second instruction also, in processing block 1580
Ground updates the second mark (for example, as XTEST.BV or XTEST.MV or XTEST.BM, etc.).
It will be understood that although process 1501 disclosed herein and other processes are shown in order, in some substitutions
In embodiment, the operations of these processing blocks can be according to various different orders and/or parallel or be consecutively carried out.
Figure 16 shows the process for supporting the alternate embodiment 1601 of the process for testing transactional execution state
Figure.In processing block 1605, into transactional region (such as being instructed by encountering XACQUIRE prefix or XBEGIN).It is handling
Frame 1610 saves architectural registers and state.At this point, if executing XTEST instruction, processing block 1620 in processing block 1615
The test at place will determine: as in processing block 1615 transactional execute region within execute XTEST instruction as a result, not
Zero flag is once set.It will be understood that the flow chart of Figure 16 is only example, and programmer can be at any point execution of the process
Manage the XTEST instruction of frame 1615.
Processing block 1625 is proceeded to, as transactional execution region as a result, buffer storage affairs.In processing block
1635, it can be by the memory location (such as in data high-speed caching) through buffering labeled as exclusive.In processing block 1645
Readset is monitored to close.If the monitored memory location of readset conjunction is written in another execution thread, then in processing block 1650
Stop transactional processing (referred to as transactional suspension) in processing block 1665, and processor will start to execute be rolled back to it is previous
Synchronous point (such as state of the processing block 1610 of preservation).On the other hand, when there is no other execution lines in processing block 1650
The monitored memory location that journey is written to readset conjunction is supervised then in processing block 1655 according to any read/write transaction simultaneously
Set is write in control.If another execution thread reads or is written the monitored memory position for writing set in processing block 1660
It sets, then also stops transactional processing in processing block 1665.It will be understood that it is such monitoring be constantly lasting process, according to
Cache coherence safeguards that similar mode constantly maintains.Before the end for reaching transactional region, if not other
The monitored memory location of readset conjunction is written in processing block 1650 and exists without other execution threads for execution thread
The monitored memory location for writing set is read or be written in processing block 1660, then the affairs are exited in processing block 1670
Property region (such as being instructed by encountering XRELEASE prefix or XEND), and depositing for buffering is atomically submitted in processing block 1675
Memory transaction, so that they can be observed by other execution threads.
After the transactional in processing block 1665 stops, execution is rolled back to previous synchronous point by processor, thus extensive
The architectural registers and state saved again, and abandon any memory transaction that do not submit.At this point, if in processing block 1615
Execute XTEST instruction, then the test at processing block 1620 will determine, as in processing block 1665 transactional stop after
It executes that XTEST is instructed in processing block 1615 as a result, being provided with zero flag, and is not therefore executed within region in transactional.
Therefore, in processing block 1630, program or thread have restored viewing or the processor state of the prior synchronization of rollback point, and can
It is continued to execute in processing block 1640 as non-transactional region.According to the embodiment that XTEST is instructed, which can determine affairs
Property stop whether to have occurred and that, processor or memory state may not indicate whether transactional suspension has occurred and that originally.
It will be understood that, it is contemplated that stop the observation whether having occurred and that transactional, such information can be provided to programmer
Option is such as recorded and is counted to the number retried terminated in transactional suspension.Also other choosings can be provided to programmer
, is such as executed within region and skip code segments according to determining that the program or is not executed in transactional currently.
The XTEST instruction of various other types has also been described, these XTEST instruction can provide additional option to programmer, all
Such as obtained before transactional suspension instruction that some things can malfunction (such as exhaust buffer space or some thread also to
The same memory position that your thread is intended to modification has issued affairs, etc.).
Foregoing description is intended to show that the preferred embodiment of the present invention.From the above discussion, it should be apparent that, especially exist
Such rapid development and further progress are not easy in the technical field of prediction, in appended claims and its equivalent
Within the scope of, those skilled in the art can arrange with the modification present invention in details without departing from the principle of the present invention.
Claims (13)
1. a kind of system, comprising:
Multiple processors;
Processor interconnection, for being communicatively coupled two or more processors;
System storage, including dynamic random access memory are communicatively coupled to one or more processors,
Wherein one or more the multiple processors include:
Multiple multithreading cores, for carrying out out of order instruction execution, wherein one or more the multiple multithreadings to multiple threads
Core includes:
Logic is taken out in instruction, for taking out the instruction of one or more the multiple threads,
Instruction decoding unit, for decoding described instruction,
Register renaming logic, for renaming the one or more registers for being used for described instruction in register group,
Instruction cache, the one or more described instructions pending for cache,
Data high-speed caching, the data of described instruction are used for for cache,
Second level (L2) cache element, for cache one or more described instruction and for the data of described instruction,
And
Execution unit, the transactional for executing instruction execute region, and the execution unit has for executing the first instruction
Circuit, first instruction execute the state in region for testing transactional,
Wherein the execution unit be also used to determine it is described first instruction whether the transactional execute region context it
It is interior, and in response, it is set to indicate that first instruction above and below the transactional execution region flag register
Value within text.
2. the system as claimed in claim 1, which is characterized in that further include:
Accelerator unit is communicatively coupled to one or more processors for executing specified function.
3. system as claimed in claim 2, which is characterized in that the accelerator unit includes field programmable gate array.
4. the system as claimed in claim 1, which is characterized in that further include:
External Cache is communicatively coupled to one or more processors mutually connecting.
5. the system as claimed in claim 1, which is characterized in that further include:
At least one communication equipment is communicatively coupled to one or more processors.
6. the system as claimed in claim 1, which is characterized in that further include:
Equipment is stored, two or more processors are communicatively coupled to.
7. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction
The circuit of second instruction, second instruction are used to indicate the beginning that the transactional executes region.
8. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction
The circuit of third instruction, the third instruction is used to indicate the transactional and executes the end in region, and leads to memory transaction
Atomically is submitted or stopped.
9. the system as claimed in claim 1, which is characterized in that the execution unit also has for executing in described instruction
The circuit of second instruction and the circuit instructed for executing the third in described instruction, wherein second instruction is used to indicate institute
The beginning that transactional executes region is stated, the third instruction is used to indicate the transactional and executes the end in region and cause to store
Device affairs are atomically submitted or are stopped.
10. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to
Indicate that the transactional executes the value of the nesting level in region.
11. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to
Indicate that the transactional executes the value of at least one in the quantity or size of the available internal buffer in region.
12. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to
Indicate that the affairs for particular memory cell can overflow internal buffer and the transactional is caused to execute in the execution in region
Value only.
13. the system as claimed in claim 1, which is characterized in that the execution unit is also used to set flag register to
Indicate that the execution that the access to particular memory cell can execute region with another transactional mutually conflicts and leads to the thing
Business property executes the value that the execution in region stops.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/538,951 | 2012-06-29 | ||
US13/538,951 US9268596B2 (en) | 2012-02-02 | 2012-06-29 | Instruction and logic to test transactional execution status |
CN201380028480.6A CN104335183B (en) | 2012-06-29 | 2013-06-19 | The methods, devices and systems of state are performed for testing transactional |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201380028480.6A Division CN104335183B (en) | 2012-06-29 | 2013-06-19 | The methods, devices and systems of state are performed for testing transactional |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105677526A CN105677526A (en) | 2016-06-15 |
CN105677526B true CN105677526B (en) | 2019-11-05 |
Family
ID=49783754
Family Applications (7)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610081166.XA Active CN105677526B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
CN201610081114.2A Active CN105786665B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
CN201610081188.6A Active CN105760140B (en) | 2012-06-29 | 2013-06-19 | The instruction and logic of state are executed for testing transactional |
CN201610081127.XA Active CN105760139B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
CN201610081121.2A Active CN105760265B (en) | 2012-06-29 | 2013-06-19 | The instruction and logic of state are executed for testing transactional |
CN201380028480.6A Active CN104335183B (en) | 2012-06-29 | 2013-06-19 | The methods, devices and systems of state are performed for testing transactional |
CN201610081087.9A Active CN105760138B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
Family Applications After (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610081114.2A Active CN105786665B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
CN201610081188.6A Active CN105760140B (en) | 2012-06-29 | 2013-06-19 | The instruction and logic of state are executed for testing transactional |
CN201610081127.XA Active CN105760139B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
CN201610081121.2A Active CN105760265B (en) | 2012-06-29 | 2013-06-19 | The instruction and logic of state are executed for testing transactional |
CN201380028480.6A Active CN104335183B (en) | 2012-06-29 | 2013-06-19 | The methods, devices and systems of state are performed for testing transactional |
CN201610081087.9A Active CN105760138B (en) | 2012-06-29 | 2013-06-19 | The system for executing state for testing transactional |
Country Status (2)
Country | Link |
---|---|
CN (7) | CN105677526B (en) |
WO (1) | WO2014004222A1 (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8988221B2 (en) | 2005-03-16 | 2015-03-24 | Icontrol Networks, Inc. | Integrated security system with parallel processing architecture |
CN104883256B (en) * | 2014-02-27 | 2019-02-01 | 中国科学院数据与通信保护研究教育中心 | A kind of cryptographic key protection method for resisting physical attacks and system attack |
GB2538237B (en) * | 2015-05-11 | 2018-01-10 | Advanced Risc Mach Ltd | Available register control for register renaming |
US10346168B2 (en) | 2015-06-26 | 2019-07-09 | Microsoft Technology Licensing, Llc | Decoupled processor instruction window and operand buffer |
US10191747B2 (en) * | 2015-06-26 | 2019-01-29 | Microsoft Technology Licensing, Llc | Locking operand values for groups of instructions executed atomically |
US10318295B2 (en) * | 2015-12-22 | 2019-06-11 | Intel Corporation | Transaction end plus commit to persistence instructions, processors, methods, and systems |
US11106467B2 (en) * | 2016-04-28 | 2021-08-31 | Microsoft Technology Licensing, Llc | Incremental scheduler for out-of-order block ISA processors |
US10761849B2 (en) * | 2016-09-22 | 2020-09-01 | Intel Corporation | Processors, methods, systems, and instruction conversion modules for instructions with compact instruction encodings due to use of context of a prior instruction |
CN110419030B (en) * | 2016-09-28 | 2024-04-19 | 英特尔公司 | Measuring bandwidth per node in non-uniform memory access (NUMA) systems |
US10795853B2 (en) * | 2016-10-10 | 2020-10-06 | Intel Corporation | Multiple dies hardware processors and methods |
CN110121703B (en) * | 2016-12-28 | 2023-08-01 | 英特尔公司 | System and method for vector communication |
GB2563587B (en) | 2017-06-16 | 2021-01-06 | Imagination Tech Ltd | Scheduling tasks |
GB2563589B (en) * | 2017-06-16 | 2019-06-12 | Imagination Tech Ltd | Scheduling tasks |
GB2567433B (en) * | 2017-10-10 | 2020-02-26 | Advanced Risc Mach Ltd | Checking lock variables for transactions in a system with transactional memory support |
CN109815114A (en) * | 2018-12-14 | 2019-05-28 | 深圳壹账通智能科技有限公司 | Test method, management server, test equipment, computer equipment and medium |
CN110109657B (en) * | 2019-03-29 | 2023-06-20 | 南京佑驾科技有限公司 | GPU micro instruction detection method |
US11875095B2 (en) * | 2020-07-01 | 2024-01-16 | International Business Machines Corporation | Method for latency detection on a hardware simulation accelerator |
US11409533B2 (en) * | 2020-10-20 | 2022-08-09 | Micron Technology, Inc. | Pipeline merging in a circuit |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7206903B1 (en) * | 2004-07-20 | 2007-04-17 | Sun Microsystems, Inc. | Method and apparatus for releasing memory locations during transactional execution |
CN101187861A (en) * | 2006-09-20 | 2008-05-28 | 英特尔公司 | Instruction and logic for performing a dot-product operation |
CN102144218A (en) * | 2008-07-28 | 2011-08-03 | 超威半导体公司 | Virtualizable advanced synchronization facility |
US8006075B2 (en) * | 2009-05-21 | 2011-08-23 | Oracle America, Inc. | Dynamically allocated store queue for a multithreaded processor |
CN102163072A (en) * | 2008-12-09 | 2011-08-24 | 英特尔公司 | Software-based thread remapping for power savings |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7984248B2 (en) * | 2004-12-29 | 2011-07-19 | Intel Corporation | Transaction based shared data operations in a multiprocessor environment |
US7516313B2 (en) * | 2004-12-29 | 2009-04-07 | Intel Corporation | Predicting contention in a processor |
US7882339B2 (en) * | 2005-06-23 | 2011-02-01 | Intel Corporation | Primitives to enhance thread-level speculation |
US8190859B2 (en) * | 2006-11-13 | 2012-05-29 | Intel Corporation | Critical section detection and prediction mechanism for hardware lock elision |
JP4740926B2 (en) * | 2007-11-27 | 2011-08-03 | フェリカネットワークス株式会社 | Service providing system, service providing server, and information terminal device |
US8799582B2 (en) * | 2008-12-30 | 2014-08-05 | Intel Corporation | Extending cache coherency protocols to support locally buffered data |
US8627017B2 (en) * | 2008-12-30 | 2014-01-07 | Intel Corporation | Read and write monitoring attributes in transactional memory (TM) systems |
US8301849B2 (en) * | 2009-12-23 | 2012-10-30 | Intel Corporation | Transactional memory in out-of-order processors with XABORT having immediate argument |
US20110208921A1 (en) * | 2010-02-19 | 2011-08-25 | Pohlack Martin T | Inverted default semantics for in-speculative-region memory accesses |
US8549504B2 (en) * | 2010-09-25 | 2013-10-01 | Intel Corporation | Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region |
US8713256B2 (en) * | 2011-12-23 | 2014-04-29 | Intel Corporation | Method, apparatus, and system for energy efficiency and energy conservation including dynamic cache sizing and cache operating voltage management for optimal power performance |
-
2013
- 2013-06-19 CN CN201610081166.XA patent/CN105677526B/en active Active
- 2013-06-19 CN CN201610081114.2A patent/CN105786665B/en active Active
- 2013-06-19 CN CN201610081188.6A patent/CN105760140B/en active Active
- 2013-06-19 CN CN201610081127.XA patent/CN105760139B/en active Active
- 2013-06-19 CN CN201610081121.2A patent/CN105760265B/en active Active
- 2013-06-19 WO PCT/US2013/046633 patent/WO2014004222A1/en active Application Filing
- 2013-06-19 CN CN201380028480.6A patent/CN104335183B/en active Active
- 2013-06-19 CN CN201610081087.9A patent/CN105760138B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7206903B1 (en) * | 2004-07-20 | 2007-04-17 | Sun Microsystems, Inc. | Method and apparatus for releasing memory locations during transactional execution |
CN101187861A (en) * | 2006-09-20 | 2008-05-28 | 英特尔公司 | Instruction and logic for performing a dot-product operation |
CN102144218A (en) * | 2008-07-28 | 2011-08-03 | 超威半导体公司 | Virtualizable advanced synchronization facility |
CN102163072A (en) * | 2008-12-09 | 2011-08-24 | 英特尔公司 | Software-based thread remapping for power savings |
US8006075B2 (en) * | 2009-05-21 | 2011-08-23 | Oracle America, Inc. | Dynamically allocated store queue for a multithreaded processor |
Also Published As
Publication number | Publication date |
---|---|
CN105760138B (en) | 2018-12-11 |
CN105760139B (en) | 2018-12-11 |
CN104335183B (en) | 2018-03-30 |
CN105786665A (en) | 2016-07-20 |
CN105760138A (en) | 2016-07-13 |
CN105760140B (en) | 2019-09-13 |
CN105786665B (en) | 2019-11-05 |
CN105760139A (en) | 2016-07-13 |
WO2014004222A1 (en) | 2014-01-03 |
CN105760265B (en) | 2019-11-05 |
CN105760265A (en) | 2016-07-13 |
CN105760140A (en) | 2016-07-13 |
CN105677526A (en) | 2016-06-15 |
CN104335183A (en) | 2015-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105677526B (en) | The system for executing state for testing transactional | |
US10261879B2 (en) | Instruction and logic to test transactional execution status | |
CN104204990B (en) | Accelerate the apparatus and method of operation in the processor using shared virtual memory | |
CN106648553B (en) | For improving system, the method and apparatus of the handling capacity in continuous transactional memory area | |
CN103970509B (en) | Device, method, processor, processing system and the machine readable media for carrying out vector quantization are circulated to condition | |
KR101594502B1 (en) | Systems and methods for move elimination with bypass multiple instantiation table | |
KR101655713B1 (en) | Systems and methods for flag tracking in move elimination operations | |
CN107209722A (en) | For instruction and the logic for making the process forks of Secure Enclave in Secure Enclave page cache He setting up sub- enclave | |
TWI720056B (en) | Instructions and logic for set-multiple- vector-elements operations | |
CN107003853A (en) | The systems, devices and methods performed for data-speculative | |
CN107003850A (en) | The systems, devices and methods performed for data-speculative | |
CN108369571A (en) | Instruction and logic for even number and the GET operations of odd number vector | |
CN107408035B (en) | Apparatus and method for inter-strand communication |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |