US20070050561A1 - Method for creating critical section code using a software wrapper for proactive synchronization within a computer system - Google Patents

Method for creating critical section code using a software wrapper for proactive synchronization within a computer system Download PDF

Info

Publication number
US20070050561A1
US20070050561A1 US11/508,494 US50849406A US2007050561A1 US 20070050561 A1 US20070050561 A1 US 20070050561A1 US 50849406 A US50849406 A US 50849406A US 2007050561 A1 US2007050561 A1 US 2007050561A1
Authority
US
United States
Prior art keywords
instructions
instruction
lock
memory
acquire
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/508,494
Inventor
Mitchell Alsup
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GlobalFoundries Inc
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Priority to US11/508,494 priority Critical patent/US20070050561A1/en
Assigned to ADVANCED MICRO DEVICES, INC. reassignment ADVANCED MICRO DEVICES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ALSUP, MITCHELL
Publication of US20070050561A1 publication Critical patent/US20070050561A1/en
Assigned to GLOBALFOUNDRIES INC. reassignment GLOBALFOUNDRIES INC. AFFIRMATION OF PATENT ASSIGNMENT Assignors: ADVANCED MICRO DEVICES, INC.
Assigned to GLOBALFOUNDRIES U.S. INC. reassignment GLOBALFOUNDRIES U.S. INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: WILMINGTON TRUST, NATIONAL ASSOCIATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/04Addressing variable-length words or parts of words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/126Replacement control using replacement algorithms with special data handling, e.g. priority of data or instructions, handling errors or pinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30087Synchronisation or serialisation instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/3017Runtime instruction translation, e.g. macros
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • This invention relates to microprocessors and, more particularly, to process synchronization between processors in a multiprocessor system.
  • more advanced processors include additions to the instruction set that include hardware synchronization primitives (e.g., CMPXCHG, CMPXCHG8B, and CMPXCHG16B) that are based on atomically updating a single memory location.
  • hardware synchronization primitives e.g., CMPXCHG, CMPXCHG8B, and CMPXCHG16B
  • a method for creating critical section code using a software wrapper for proactive synchronization in a computer system may include creating a high-level set of pseudo functions for requesting exclusive access to one or more memory resource addresses and then manipulating those requested memory resources under an illusion of atomicity.
  • the pseudo functions allow an application programmer to construct and reason about the critical section at a high level, while a high-level language compiler always translates these pseudo functions into inline native instructions.
  • the method may include creating a high-level expression for requesting exclusive access to one or more memory resource addresses.
  • the high-level expression may include an ACQUIRE pseudo function call that includes one or more arguments associated with the one or more memory resource addresses.
  • the method may also include creating a low-level set of instructions by compiling the high-level expression.
  • the low-level set of instructions includes a specification phase for requesting exclusive access to the one or more memory resource addresses.
  • the method may include creating the specification phase of code, which may include generating an instruction stream having LOCK-based memory instructions having a LOCK prefix based on computations performed on the one or more arguments.
  • creating the specification phase of code may include inserting an ACQUIRE instruction into the instruction stream.
  • the LOCK-based memory instructions may include LOCK MOV instructions for loading data from one or more memory locations to one or more respective registers.
  • the LOCK-based memory instructions may include a LOCK PREFETCHW instruction.
  • FIG. 1 is a block diagram of one embodiment of a computer system.
  • FIG. 2 is a block diagram depicting further details of an embodiment a processing node of FIG. 1 .
  • FIG. 3 is a flow diagram that describes operation of one embodiment of the computer system shown FIG. 1 and FIG. 2 .
  • FIG. 4 is a flow diagram that describes operation of one embodiment of the computer system shown FIG. 1 and FIG. 2 in response to receiving a coherency invalidation probe.
  • FIG. 5 is a diagram illustrating the creation of an embodiment of a critical code section using a software wrapper pseudo function call in a high-level language.
  • an advanced synchronization facility may be used.
  • the facility may support the construction of non-Blocking synchronization, WaitFree synchronization, Transactional Memory, along with the construction of various forms of Compare and Swap primitives typically used in the construction of these methods.
  • the facility allows construction (in software) of a large variety of synchronization primitives.
  • the advanced synchronization facility may enable software to program a large variety of synchronization kinds.
  • Each synchronization kind may directly specify: the cache lines needed for successful completion, a sequence point where failures can redirect control flow, a data modification section where the result of the successful critical section is performed, and a sequence point where success is made visible to the rest of the system making the whole sequence of instructions appear to be atomic.
  • the functionality of the advanced synchronization facility may enable the acquisition and release of multiple cache lines with write permission associated with a critical section substantially simultaneously as seen by other processors/cores. This process may be referred to as Linearizing. After acquisition, several modifications can be performed before any other interested party may observe any of the modifications to any of the specified multiple cache lines. Between the acquisition and the release, no other processors are allowed to be manipulating these same lines (e.g. have write permission). A similar method could have been performed by not sending HyperTransportTM Source Done messages for the associated lines and thereby preventing concurrent accesses. However, these solutions lead to deadlock and/or livelock, or timeouts. Thus, a computer system including processors and processor cores that may implement the advanced synchronization facility is described below.
  • Computer system 100 includes several processing nodes 312 A, 312 B, 312 C, and 312 D.
  • Each of processing node 312 A- 312 D is coupled to a respective memory 314 A- 314 D via a memory controller 316 A- 316 D included within each respective processing node 312 A- 312 D.
  • processing nodes 312 A- 312 D include interface logic (IF) used to communicate between the processing nodes 312 A- 312 D.
  • IF interface logic
  • processing node 312 A includes interface logic 318 A for communicating with processing node 312 B, interface logic 318 B for communicating with processing node 312 C, and a third interface logic 318 C for communicating with yet another processing node (not shown).
  • processing node 312 B includes interface logic 318 D, 318 E, and 318 F;
  • processing node 312 C includes interface logic 318 G, 318 H, and 3181 ;
  • processing node 312 D includes interface logic 318 J, 318 K, and 318 L.
  • Processing node 312 D is coupled to communicate with a plurality of input/output devices (e.g. devices 320 A- 320 B in a daisy chain configuration) via interface logic 318 L.
  • processing nodes may communicate with other I/O devices in a similar fashion. Processors may use this interface to access the memories associated with other processors in the system. It is noted that a component that includes a reference numeral followed by a letter may be generally referred to solely by the numeral where appropriate. For example, when referring generally to the processing nodes, processing node(s) 312 may be used.
  • Processing nodes 312 implement a packet-based link for inter-processing node communication.
  • the link is implemented as sets of unidirectional lines (e.g. lines 324 A are used to transmit packets from processing node 312 A to processing node 312 B and lines 324 B are used to transmit packets from processing node 312 B to processing node 312 A).
  • Other sets of lines 324 C- 324 H are used to transmit packets between other processing nodes as illustrated in FIG. 1 .
  • each set of lines 324 may include one or more data lines, one or more clock lines corresponding to the data lines, and one or more control lines indicating the type of packet being conveyed.
  • the link may be operated in a cache coherent fashion for communication between processing nodes or in a non-coherent fashion for communication between a processing node and an I/O device (or a bus bridge to an I/O bus of conventional construction such as the PCI bus or ISA bus). Furthermore, the link may be operated in a non-coherent fashion using a daisy-chain structure between I/O devices as shown (e.g., 320 A and 320 B). It is noted that in an exemplary embodiment, the link may be implemented as a coherent HyperTransportTM link or a non-coherent HyperTransportTM link, although in other embodiments, other links are possible.
  • I/O devices 320 A- 320 B may be any suitable I/O devices.
  • I/O devices 320 A- 320 B may include devices for communicating with another computer system to which the devices may be coupled (e.g. network interface cards or modems).
  • I/O devices 320 A- 320 B may include video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, sound cards, and a variety of data acquisition cards such as GPIB or field bus interface cards.
  • SCSI Small Computer Systems Interface
  • Memories 314 A- 314 D may comprise any suitable memory devices.
  • a memory 314 A- 314 D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), DDR SDRAM, static RAM, etc.
  • the memory address space of computer system 300 is divided among memories 314 A- 314 D.
  • Each processing node 312 A- 312 D may include a memory map used to determine which addresses are mapped to which memories 314 A- 314 D, and hence to which processing node 312 A- 312 D a memory request for a particular address should be routed.
  • Memory controllers 316 A- 316 D may comprise control circuitry for interfacing to memories 314 A- 314 D. Additionally, memory controllers 316 A- 316 D may include request queues for queuing memory requests.
  • Memories 314 A- 314 D may store code executable by the processors to implement the functionality as described in the preceding sections.
  • a packet to be transmitted from one processing node to another may pass through one or more intermediate nodes.
  • a packet transmitted by processing node 312 A to processing node 312 D may pass through either processing node 312 B or processing node 312 C as shown in FIG. 1 .
  • Any suitable routing algorithm may be used.
  • Other embodiments of computer system 100 may include more or fewer processing nodes then the embodiment shown in FIG. 1 .
  • the packets may be transmitted as one or more bit times on the lines 324 between nodes. A bit time may be the rising or falling edge of the clock signal on the corresponding clock lines.
  • the packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from responding to probes and commands.
  • processing nodes 312 may additionally include one or more processor cores (shown in FIG. 2 ). It is noted the processor cores within each node may communicate via internal packet-based links operated in the cache coherent fashion. It is further noted that processor cores and processing nodes 312 may be configured to share any (or all) of the memories 314 .
  • processor cores may implement the x86 architecture, although other architectures are possible and contemplated.
  • instruction decoder logic within each of the various processor cores may be configured to mark instructions that use a LOCK prefix.
  • processor core logic may include hardware (shown in FIG. 2 ) that may enable identification of the markers associated with LOCKed instructions. This hardware may enable the use of the LOCK instruction prefix to identify critical sections of code as part of the advanced synchronization facility.
  • the advanced synchronization facility and associated hardware may be implemented within computer system 100 .
  • the advanced synchronization facility may employ new instructions and use hardware such as a synchronization arbiter (shown in FIG. 2 ) which may be interconnected within the cache coherent fabric.
  • synchronization arbiter 230 is coupled to a Northbridge unit 290 of any processing node 312 , thus enabling the synchronization arbiter to observe explicit addresses associated with the Advanced Synchronization Facility transactions of each node.
  • the synchronization arbiter may be placed anywhere in the coherent domain of the interconnect network.
  • synchronization arbiter when a system is configured to support multiple virtual machines, and when these virtual machines do not share any actual physical memory, multiple synchronization arbiters can be configured to distribute the synchronization load across several arbiters.
  • critical section refers to a section of code used in the advanced synchronization facility that may include one or more memory reference instructions marked with a LOCK prefix, an ACQUIRE instruction, and a RELEASE instruction which ends the critical section.
  • the critical section code will appear to be executed atomically by interested observers.
  • the first phase may be referred to as the specification phase, while the third phase is often referred to as the atomic phase.
  • software may be allowed to perform ‘simple’ arithmetic and logical manipulations on the data between reading and modifying the data of the critical section as long as the simple arithmetic operations do not cause exceptions when executed. If a data manipulation causes an exception inside a critical section, atomicity of that critical section may not be guaranteed. Critical section software should detect failures of atomicity, and deal with them appropriately, s described further below.
  • the advanced synchronization facility may utilize a weakened memory model and operate only upon cacheable data.
  • This weakened memory model may prevent the advanced synchronization facility from wasting processor cycles waiting for various processor and memory buffers to empty before performing a critical section.
  • software may insert LFENSE, SFENSE, or MFENSE instructions just prior to the RELEASE instruction to guarantee standard PC of memory ordering.
  • an SFENSE instruction between the last LOCKed Store and the RELEASE instruction will guarantee that the unCacheable data is globally visible before the cacheable synchronization data is globally visible in any other processor. This may enable maximum overlap of unCacheable and Cacheable accesses with minimal performance degradation.
  • interface logic 318 A- 318 L may comprise a variety of buffers for receiving packets from the link and for buffering packets to be transmitted upon the link.
  • Computer system 100 may employ any suitable flow control mechanism for transmitting packets.
  • each processing node may include respective buffer interface units (BIU) 220 (shown in FIG. 2 ), which may provide functionality to enable proactive synchronization.
  • BIU 220 may be configured to those special addresses that are associated with an Advanced Synchronization event and to transmit those addresses to synchronization arbiter 230 in response to execution of an ACQUIRE instruction.
  • the BIU 220 may also be configured to determine if the response received from synchronization arbiter 230 indicates the addresses may be interference free. Depending on whether the response indicates the addresses may not be interference free, BIU 220 may notify the requesting processor core of a failure by sending a failure count value to a register within the processor core 18 , and sending a completion message to synchronization arbiter 230 , or when guaranteed to be interference free by allowing the execution of the critical section, and waiting to send the completion message to synchronization arbiter 230 .
  • FIG. 2 is a block diagram that illustrates more detailed aspects of embodiments of processing node 312 A and synchronization arbiter 230 of FIG. 1 .
  • processing node 312 A includes processor cores 18 A and 18 n, where n may represent any number of processor cores. Since the processor cores may be substantially the same in various embodiments, only detailed aspects of processor core 18 A are described below.
  • processor cores 18 A and 18 n are coupled to bus interface unit 220 which is coupled to a Northbridge unit 290 , which is coupled to memory controller 316 A, HyperTransportTM interface logic 318 A- 318 C, and to synchronization arbiter 230 via a pair of unidirectional links 3241 - 324 J.
  • Processor core 18 A includes hardware configured to execute instructions. More particularly, as is typical of many processors, processor core 18 A includes one or more instruction execution pipelines including a number of pipeline stages, cache storage and control, and an address translation mechanism (only pertinent portions of which are shown for brevity). Accordingly, as shown processor core 18 A includes a level one (L1) instruction cache, prefetch logic, and branch prediction logic. Since these blocks may be closely coupled with the instruction cache, they are shown together as block 250 . Processor core 18 A also includes an L1 data cache 207 . Processor core 18 A also includes instruction decoder 255 and an instruction dispatch and control unit 256 may be coupled to receive instructions from instruction decoder 255 and to dispatch operations to a scheduler 259 .
  • L1 instruction cache prefetch logic
  • branch prediction logic branch prediction logic
  • instruction dispatch and control unit 256 may be coupled to a microcode read-only memory (MROM) (not shown).
  • Scheduler 259 is coupled to receive dispatched operations from instruction dispatch and control unit 256 and to issue operations to execution units 260 .
  • execution units 260 may include any number of integer execution units and floating-point units.
  • processor core 18 A includes a TLB 206 and a load/store unit 270 . It is noted that in alternative embodiments, an on-chip L2 cache may be present (although not shown).
  • Instruction decoder 255 may be configured to decode instructions into operations which may be either directly decoded or indirectly decoded using operations stored within the MROM. Instruction decoder 255 may decode certain instructions into operations executable within execution units 260 . Simple instructions may correspond to a single operation, while in other embodiments, more complex instructions may correspond to multiple operations. In one embodiment, instruction decoder 255 may include multiple decoders (not shown) for simultaneous decoding of instructions. Each instruction may be aligned and decoded into a set of control values in multiple stages depending on whether the instructions are first routed to MROM. These control values may be routed in an instruction stream to instruction dispatch and control unit 257 along with operand address information and displacement or immediate data which may be included with the instruction. As described further below, when a memory reference instruction includes a LOCK prefix, instruction decoder may identify the address with a marker.
  • Load/store unit 270 may be configured to, provide an interface between execution units 260 and data cache 207 .
  • load/store unit 270 may include load/store buffers with several storage locations for data and address information for pending loads or stores.
  • the illustrated embodiment includes LS1 205 , linear LS2 209 , physical LS2 210 , and data storage 211 .
  • processor core 18 A includes marker logic 208 , and a marker bit 213 .
  • a critical section may be processed in one of two ways: deterministically, and optimistically. The choice of execution may be based upon the configuration of the advanced synchronization facility and upon the state of a critical section predictor, as described in greater detail below. In various embodiments, either the basic input output system (BIOS), the operating system (OS), or a virtual memory manager (VMM) may configure the operational mode of the advanced synchronization facility. When operating in the deterministic execution mode, the addresses specified by the locked memory reference instructions may be bundled up and sent enmasse to the synchronization arbiter 230 to be examined for interference. The cache line data may be obtained and the critical section executed (as permitted).
  • BIOS basic input output system
  • OS operating system
  • VMM virtual memory manager
  • the advanced synchronization facility may use the synchronization arbiter 230 .
  • synchronization arbiter 230 examines all of the physical addresses associated with a synchronization request and either pass (a.k.a. destroy) the set of addresses or fail (i.e., reject) the set of addresses, based upon whether any other processor core or requester is operating on or has requested those addresses while they are being operated on.
  • synchronization arbiter 230 may allow software to be constructed that proactively avoids interference.
  • synchronization arbiter 230 may respond to a request with a failure status including a unique number (e.g., count value 233 ) to a requesting processor core.
  • the count may be indicative of the number of requestors contending for the memory resource(s) being requested.
  • Software may use this number to proactively avoid interference in subsequent trips through the critical section by using this number to choose a different resource upon which to attempt a critical section access.
  • synchronization arbiter 230 includes a storage 232 including a number of entries. Each of the entries may store one or more physical addresses of requests currently being operated on. In one embodiment, each entry may store up to eight physical addresses that are transported as a single 64-byte request.
  • the synchronization arbiter entry includes the count value 233 , which corresponds to all the addresses in the entry. As described above, the count value may be indicative of the number of requesters (i.e., interferers) that are contending for any of the addresses in a critical section.
  • a compare unit 231 within synchronization arbiter 230 checks for a match between each address in the set and all the addresses in storage 232 . If there is no match, synchronization arbiter 230 may be configured to issue a pass response by returning a passing count value and to store the addresses within storage 232 . In one embodiment, the passing count value is zero, although suitable count value may be used. However, if there is an address match, synchronization arbiter 230 may increment the count value 233 associated with set of addresses that includes the matching address, and then return that count value as part of a failure response. It is noted that compare unit 231 may be a compare only structure implemented in a variety of ways, as desired.
  • each address stored within storage 232 may be associated with a respective count.
  • the count value may be indicative of the number of requestors (i.e., interferers) that are contending for one of the respective address in a critical section.
  • bus interface unit (BIU) 220 includes a count compare circuit 221 , a locked line buffer (LLB) 222 , and a predictor 223 .
  • BIU 220 may also include various other circuits for transmitting and receiving transactions from the various components to which it is connected, however, these have been omitted for clarity.
  • BTU 220 may be configured to transmit a set of addresses associated with a critical section from LLB 222 to synchronization arbiter 230 in response to the execution of an ACQUIRE instruction.
  • compare circuit 221 may be configured to compare the count value returned by synchronization arbiter 230 to check if the count is a passing count value (e.g., zero) or a failing count value.
  • SBB 22 may be implemented using any type of storage structure. For example, it may be part of an existing memory address buffer (MAB) or separate, as desired.
  • MAB memory address buffer
  • LOCKed Load instructions may have the following form:
  • a regular x86 memory read instruction is made special by attaching a LOCK prefix. This causes the BIU 220 to gather the associated marked physical address into the LLB 222 as the address passes through the L1 cache (and TLB 206 ). In addition, memory access strength is reduced to access the line (in the case of a cache miss) without write permission (ReadS, not ReadM or Read).
  • the Load instruction may not be retired out of LS2 until the ACQUIRE instruction returns from the synchronization arbiter 230 .
  • LOCKed MOV to register instructions (which may be otherwise referred to as LOCKed Loads) may be processed normally down the data cache pipeline.
  • each linear address may be stored within linear address portion of LS2 209 .
  • the corresponding physical addresses may be stored in TLB 206 and within physical LS2 210 , while the corresponding data may be stored within data cache 207 and data LS2 211 .
  • Marker logic 208 may detect the LOCK prefix marker generated during decode and generate an additional marker bit 213 , thereby marking each such address as a participant in a critical section. Any LOCKed Load that takes a miss in the data cache may have its cache line data fetched through the memory hierarchy with Read-to-Share access semantics, however write permission to that specified memory resource is checked.
  • addresses associated with a critical section may be marked during instruction decode by using the LOCK prefix. More particularly, memory prefetch references that explicitly participate in advanced synchronization code sequences are annotated by using the LOCK prefix with an appropriate PREFETCHW instruction.
  • LOCKed Load instructions may have the following form:
  • a regular memory PREFETCHW instruction is made special by attaching a LOCK prefix. This causes the BIU 220 to gather the associated marked physical address into the LLB 222 as the address passes through the L1 cache (and TLB 206 ). In addition, memory access strength is reduced to avoid an actual DRAM access the line.
  • the PREFETCHW instruction may not be retired out of LS2 until the ACQUIRE instruction returns from synchronization arbiter 230 . These instructions may be used to touch cache lines that participate in the critical section and that need data (e.g., a pointer) in order to touch other data also needed in the critical section.
  • an ACQUIRE instruction is used to notify BIU 220 that all memory reference addresses for the critical section are stored in LLB 222 .
  • the ACQUIRE instruction may have the form
  • the ACQUIRE instruction checks that the number of LOCKed memory reference instructions is equal to the immediate value in the ACQUIRE instruction. If this check fails, the ACQUIRE instruction terminates with a failure code, otherwise, the ACQUIRE instruction causes BIU 220 to send all addresses stored within LLB 222 to the synchronization arbiter 230 . This instruction ‘looks’ like a memory reference instruction on the data path so that the count value returned from the synchronization arbiter 230 can be used to confirm (or deny) that all the lines can be accessed without interference. No address is necessary for this ‘load’ instruction because there can be only one synchronization arbiter 230 per virtual machine or per system.
  • the register specified in the ACQUIRE instruction is the destination register of processor core 18 .
  • the semantics of a LOCKed Load operation may include probe monitoring the location for a PROBE. If a PROBE is detected for a location, the LS1 or LS2 queue may return a failure status without waiting for the read to complete.
  • a general-purpose fault (#GP) may be generated if the number of LOCKed loads exceeds a micro-architectural limit. If an ACQUIRE instruction fails, the count of LOCKed loads will be reset to zero. If the address is not to a Write Back memory type, the ACQUIRE instruction can be made to fail (when subsequently encountered).
  • arithmetic and memory reference instructions may be processed in either the SSE registers (XMM), or in the general-purpose registers (e.g., EAX, etc), or in the MMX or x87 registers.
  • synchronization arbiter 230 may either pass the request enmasse or fail the request enmasse. If synchronization arbiter 230 fails the request, the response back to BIU 220 may be referred to as a “synchronization arbiter Fail-to-ACQUIRE” with the zero bit set (e.g., RFLAGS.ZF). As described above, the response returned by synchronization arbiter 230 may include the count value 233 , which may be indicative of the number of interferers. Software may use this count to reduce future interference as described above. The count value 233 from the synchronization arbiter 230 may be delivered to a general-purpose register (not shown) within processor core 18 and may also be used to set condition codes. If the synchronization arbiter 230 passes the request, the response back to BfU 220 may include a pass count value (e.g., zero).
  • a pass count value e.g., zero
  • the request may be returned with a negative count value such as minus one ( ⁇ 1), for example.
  • a negative count value such as minus one ( ⁇ 1), for example.
  • This may provide software running on the processor core a means to see an overload in the system and to enable that software to stop making requests to synchronization arbiter 230 for a while. For example, the software may schedule something else or it may simply waste some time before retrying the synchronization attempt.
  • processor core 18 may execute the instructions in the atomic phase and manipulate the data in the cache lines as desired.
  • a RELEASE instruction is executed signifying the end of the critical section.
  • the RELEASE instruction enables all of the modified data to become visible substantially simultaneously by sending the RELEASE message to synchronization arbiter 230 , thereby releasing the associated cache lines back to the system.
  • the advanced synchronization facility supports two kinds of failures, a “Fail-to-ACQUIRE” and a “Fail-to-REQUESTOR”.
  • the Fail-to-ACQUIRE failure causes the ACQUIRE instruction to complete with the zero bit set (e.g., RFLAGS.ZF) so that the subsequent conditional jump instruction can redirect control flow away from damage inducing instructions in the atomic phase.
  • the synchronization arbiter Fail-to-ACQUIRE with the zero bit set e.g., RFLAGS.ZF
  • a processor Fail-to-ACQUIRE is another type.
  • processor cores may communicate by observing memory transactions.
  • processor core 18 monitors all of those addresses for coherent invalidation probes (e.g., Probe with INValidate). If any of the lines are invalidated, the response from synchronization arbiter 230 may be ignored and the ACQUIRE instruction may be made to fail with the zero bit set (e.g., RFLAGS.ZF).
  • coherent invalidation probes e.g., Probe with INValidate
  • the Fail-to-REQUESTOR failure may be sent as a PROBE response if there is a cache hit on a line that has been checked for interference and passed by synchronization arbiter 230 .
  • a Fail-to-REQUESTOR response causes the requesting processor to Fail-to-ACQUIRE if it is currently processing an advanced synchronization facility critical section, or it will cause the requesting processor's BIU to re-request that memory request if it is not processing the critical section.
  • BIU 220 may be configured to cause a Fail-to-ACQUIRE in response to receiving a Probe with INValidate prior to obtaining a pass notification from synchronization arbiter 230 .
  • a processor core 18 that has had its addresses passed by synchronization arbiter 230 may obtain each cache line for exclusive access (erg. write permission) as memory reference instructions are processed in the atomic phase. After a passed cache line arrives, processor core 18 may hold onto that cache line and prevent other processor cores from stealing the line by responding to coherent invalidation probes with Fail-to-REQUESTOR responses. It is noted that Fail-to-REQUESTOR may also be referred to as a negative-acknowledgement (NAK).
  • NAK negative-acknowledgement
  • both the critical section and the interfering memory reference may both be performed (e.g., no live-lock, nor dead-lock).
  • the performance of a processor participating in the Advanced Synchronization Facility may be optimized by using a critical section predictor 223 .
  • Initially predictor 223 may be set up to predict that no interference is expected during execution of a critical section.
  • processor core 18 may not actually use the synchronization arbiter 230 . Instead processor core 18 may record the LOCKed memory references and may check these against Coherent Invalidation PROBEs to detect interference. If the end of the critical section is reached before any interference is detected, no interested third party has seen the activity of the critical section and it has been performed as if it was executed atomically. This property enables the Advanced Synchronization Facility to be processor-cycle competitive with currently existing synchronization mechanisms when no contention is observed.
  • processor core 18 may create a failure status for the ACQUIRE instruction and the subsequent conditional branch redirects the flow of control out of the critical section, and resets the predictor to predict deterministic mode.
  • the decoder will then predict interference might happen, and will process the critical section using the synchronization arbiter 230 (if enabled).
  • the Advanced Synchronization facility may operate on misaligned data items as long as these items do not span cache lines that are not participating in the actual critical section.
  • Software is free to have synchronization items span cache line boundaries as long as all cache lines so touched are recognized as part of the critical section entry.
  • the processor neither detects the failure of atomicity nor signals the lack of atomicity.
  • access to critical section data may be dependent upon the presence of that data in main memory. All of the lines necessary for the critical section are touched before ENTRY into the critical section, and any access rights issues or page-faulting issues may be detected when the LOCKed Load or LOCKed PREFETCHW instructions execute prior to entering the critical section. When any of the lead-in addresses take a fault, the subsequent ACQUIRE instruction is made to fail. After entry to the critical section, if any instruction causes an exception, the processor will causes a failure at the ACQUIRE instruction, and the subsequent conditional jump redirects control away from the critical-section.
  • the decoder of processor core 18 may arrange that the ACQUIRE instruction will fail with the zero bit set (e.g., RFLAGS.ZF), and take the interrupt at the ACQUIRE instruction.
  • the ACQUIRE instruction may be arranged that the ACQUIRE instruction will fail with the zero bit set (e.g., RFLAGS.ZF), and take the interrupt at the ACQUIRE instruction.
  • synchronization arbiter 230 may be assigned a predetermined and/or reserved node ID that no other component may have. This assignment may be made at boot time by the BIOS, for example.
  • the count value may be returned as a 64-bit value, although other values are contemplated.
  • FIG. 3 is a flow diagram describing the operation of the embodiments of the computer system shown in FIG. 1 and FIG. 2 .
  • addresses of cache lines that are currently being operated on or accessed as part of a critical section are maintained in a list (e.g., within LLB 222 ).
  • synchronization arbiter 230 may store the addresses corresponding to a critical section, as a set, within an entry of address storage 232 .
  • each entry of address storage 232 may also store a count value that is associated with the whole set of addresses stored therein (block 410 ).
  • the count value may be indicative of the number of contenders (i.e., interferers) for any of the addresses in the set.
  • synchronization arbiter 230 may store a number of count values within each entry, such that each address in the entry has a an associated count value.
  • a critical section may include the use of LOCKed MOV instructions, followed by an ACQUIRE instruction and a RELEASE instruction (block 415 ). Accordingly, the set of addresses that are requested are checked for interference. In one embodiment, the set of addresses is compared to all of the addresses within address storage 232 (block 420 ). In the embodiments described above, the LOCKed MOV instructions cause the addresses to be marked. The marker causes BIU 220 to store each marked address in LLB 222 .
  • the ACQUIRE instruction causes BIU 220 to send the entire set of address in LLB 222 to synchronization arbiter 230 in the form of an unCacheable write that carries 64-bytes of physical address data.
  • Synchronization arbiter 230 compares the set of addresses to all the addresses in the storage 232 .
  • the count value associated with the matching address is incremented (block 455 ) and the new count value is returned to BIU 220 as a part of a failure response to the unCacheable write (block 460 ) that carries 64-bits of response data.
  • synchronization arbiter 230 discards the set of addresses upon failure.
  • BIU 220 sends the failure count value to the register of the requesting processor/core, which may also set condition code flags. As a result, the requesting processor/core may use the count value to select another set of memory resources in subsequent operations (block 465 ) and avoid interference on its subsequent synchronization attempt. Operation proceeds as described above in block 415 .
  • synchronization arbiter 230 may return a passing count value (e.g., zero) to BIU 220 (block 430 ).
  • synchronization arbiter 230 may store the set of addresses in an entry of storage 232 (block 435 ).
  • BIU 220 may send the passing count value to the requesting processor/core register specified in the ACQUIRE instruction. As such, the requesting processor/core may manipulate or otherwise operate on the data at the requested addresses (block 440 ). If the operation is not complete (block 445 ), BIU 220 defers sending a completion message to synchronization arbiter 230 .
  • BIU 220 may send a completion message to synchronization arbiter 230 .
  • synchronization arbiter 230 may flush the corresponding addresses from storage 232 , thereby releasing those addresses back to the system (block 450 ) for use by another processor/core.
  • load/store unit 270 updates the data cache for all instructions in that critical section that retired.
  • FIG. 4 is a flow diagram describing the operation of the embodiments of FIG. 1 and FIG. 2 when a coherency invalidation probe is received.
  • a Probe is received and hits on a critical section address in load store unit 270 .
  • BIU 220 may send a Failure-to-Requestor response as a response to the probe (block 515 ).
  • this Failure-to-Requestor response should cause a failure of the ACQUIRE instruction if the processor core was operating in a critical section, or a retry of the addresses if not.
  • the processor core may ignore any count value received form synchronization arbiter 230 (block 520 ).
  • Load/store unit 270 may notify instruction dispatch and control unit 257 that there is a probe hit (e.g., Prb hit signal), and thus there is a Failure-to-Acquire.
  • the ACQUIRE instruction is made to fail, as described above. As such, to an outside observer the ACQUIRE instruction simply failed.
  • a critical code section may include one or more memory reference load instructions with the LOCK prefix, followed by the ACQUIRE instruction.
  • a conditional jump instruction should follow the ACQUIRE instruction to allow the code to exit the critical section should coherence observer 230 provide a Fail-to-Acquire code or if a Probe with INValidate is detected prior to acquiring the cache lines.
  • the conditional jump may be followed by a release instruction.
  • LOCKed store instructions may be used in lieu of the RELEASE instructions.
  • Two assembly language critical code sections are shown further below to exemplify the two types of critical sections. It is noted that the following code segments are merely examples used for discussion purposes. It is contemplated that other embodiments are possible and contemplated.
  • application programmers may use pseudo function calls within high-level languages such as the ‘C’ language, for example.
  • an ACQUIRE pseudo function may take a variable number of arguments and return an integer value and a condition code result.
  • the ACQUIRE pseudo function may take between one and eight arguments, for example.
  • the ACQUIRE pseudo function should be used in the context of conditional flow control. For example, the ACQUIRE pseudo function should be inside an if-statement.
  • Each argument may be compiled into the code stream with the LOCK prefix attached to the last memory reference used in the computation of that argument.
  • the ACQUIRE function is classified as a pseudo function because it will never be compiled into a subroutine call in the resulting code, but instead results in a series of inline instructions. In effect, the compiler understands that ACQUIRE is a high-level language expression that is directly translated into native code.
  • the compiler can check that the pseudo function is used in conditional context (e.g., in an if-statement) and issue a diagnostic if used otherwise.
  • conditional context e.g., in an if-statement
  • diagnostic if used otherwise e.g., in an if-statement
  • a MODIFY pseudo function may take a variable number of arguments and does not return a result.
  • the MODIFY pseudo function compiles these arguments into the code stream and then inserts a RELEASE instruction following the code that computes the arguments.
  • the MODIFY pseudo function may have a different number of arguments than the ACQUIRE pseudo function.
  • C code segment uses the ACQUIRE pseudo function and a subsequent MODIFY pseudo function to express the removal of an element from a doubly linked list.
  • the following example x86 assembly code segment illustrates the code produced from compiling the above ‘C’ code through a ‘C’ compiler with the ACQUIRE and MODIFY pseudo function built-in compilation templates.
  • the ACQUIRE pseudo function decorates the argument processing code with a LOCK prefix on the appropriate memory reference instruction(s), and then inserts the ACQUIRE instruction into the instruction stream. Since the ACQUIRE pseudo function is used in a conditional context, the compiler inserts the necessarily subsequent conditional jump following the ACQUIRE instruction. Similarly, the MODIFY pseudo function compiles its argument list and then inserts the RELEASE instruction into the code stream.
  • FIG. 5 is a diagram that illustrates the creation of a critical code section from a high-level language code segment using an ACQUIRE pseudo function call.
  • the critical code section 610 may be created by a compiler 600 from the critical section code 605 of high-level code segment 601 .
  • the ACQUIRE pseudo function directs the compiler to perform the argument list computation with LOCK-based memory reference instructions (e.g., LOCK MOV) and to insert the ACQUIRE instruction into the instruction stream.
  • the “interference” variable is used to cause the ‘for’ loop to search the list again if any interference was detected. At this point, the ‘for’ loop cannot depend upon running off the end of the list (meaning that there are no items on the list the dequeue operation wants).
  • the ‘count’ variable may be a direct indication of the amount of interference. Software may use this count value to find a different element and thereby possibly obviate interference the next time around.
  • the MODIFY pseudo functions are not used, but the RELEASE pseudo function is used instead.
  • the RELEASE pseudo function takes no arguments and returns no result.
  • the RELEASE pseudo function directs the compiler to insert the RELEASE instruction into the instruction stream directly.
  • C code segment uses the ACQUIRE pseudo function and a subsequent RELEASE pseudo function to express the removal of an element from a doubly linked list.
  • compiler 600 may include instructions that may be executed on any of the processing nodes 312 (and processor cores 18 ) in the computer system shown in FIG. 1 and FIG. 2 .
  • the code sequences (e.g., 610 ) generated by compiler 600 may also be executed on any of the processing nodes 312 (and processor cores 18 ).
  • each stand-alone processor may include all or part of the above described hardware and may be capable of executing the instructions that are part of the advanced synchronization facility.
  • processor and processor core may be used somewhat synonymously, except when specifically enumerated to be different.
  • a computer accessible/readable medium may include any media accessible by a computer during use to provide instructions and/or data to the computer.
  • a computer accessible medium may include storage media such as magnetic or optical media, e.g., disk (fixed or removable), CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g.
  • SDRAM synchronous dynamic RAM
  • RDRAM Rambus DRAM
  • SRAM static RAM
  • Flash memory accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc., as well as media accessible via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link.
  • a peripheral interface such as the Universal Serial Bus (USB) interface, etc.
  • media accessible via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)

Abstract

A method for creating critical section code using a software wrapper includes creating a high-level expression for requesting exclusive access to one or more memory resource addresses. The high-level expression may include an ACQUIRE function call that includes one or more arguments associated with the one or more memory resource addresses. The method may also include creating a low-level set of instructions by compiling the high-level expression into native instructions. The low-level set of instructions includes a specification phase for requesting exclusive access to the one or more memory resource addresses. In response to compiling the ACQUIRE function call, the method includes creating the specification phase of code, which may include generating an instruction stream having LOCK-based memory reference instructions having a LOCK prefix on the one or more arguments, and inserting an ACQUIRE instruction into the instruction stream.

Description

  • This application claims the benefit of U.S. Provisional Application No. 60/710,548, filed on Aug. 23, 2005.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • This invention relates to microprocessors and, more particularly, to process synchronization between processors in a multiprocessor system.
  • 2. Description of the Related Art
  • Modern microprocessor performance has increased steadily and somewhat dramatically over the past 10 years or so. To a large degree, the performance gains may be attributed to increased operating frequency and moreover, to a technique known as deep pipelining. Generally speaking, deep pipelining refers to using instruction pipelines with many stages, with each stage doing less, thereby enabling the overall pipeline to execute at a faster rate. This technique has served the industry well. However, there are drawbacks to increased frequency and deep pipelining. For example, clock skew and power consumption can be significant during high frequency operation. As such, the physical constraints imposed by system level thermal budget points, and the increased difficulty in managing clock skew may indicate that practical limits of the technique may be just around the comer. Thus, industry has sought to increase performance using other techniques. One type of technique to increase performance is the use of multiple core processors and more generally multiprocessing.
  • As computing systems employ multiprocessing schemes with more and more processors (e.g., processing cores), the number of requestors that may interfere or contend for the same memory datum may increase to such an extent that conventional methods of process synchronization may be inadequate. For example, when a low number of processors are contending for a resource, simply locking structures may provide adequate performance to critical sections of code. For example, locked arithmetic operations on memory locations may be sufficient. As the scale of multiprocessing grows, these primitives become less and less efficient. To that end, more advanced processors include additions to the instruction set that include hardware synchronization primitives (e.g., CMPXCHG, CMPXCHG8B, and CMPXCHG16B) that are based on atomically updating a single memory location. However, we are now entering the realm in which even these hardware primitives may not provide the kind of performance that may be demanded in high-performance, high processor count multiprocessors.
  • Many conventional processors use synchronization techniques based on an optimistic model. That is, when operating in a multiprocessor environment, these conventional processors are designed to operate under the assumption that they can achieve synchronization by repeatedly rerunning the synchronization code until no interference is detected, and then declare that synchronization has been achieved. This type of synchronization may incur an undesirable waste of time, particularly when many processors are attempting the same synchronizing event, since no more than one processor can make forward progress at any instant in time. As such, different synchronization techniques may be desirable.
  • SUMMARY
  • Various embodiments of a method for creating critical section code using a software wrapper for proactive synchronization in a computer system are disclosed. Broadly speaking, a method is contemplated in which the method may include creating a high-level set of pseudo functions for requesting exclusive access to one or more memory resource addresses and then manipulating those requested memory resources under an illusion of atomicity. The pseudo functions allow an application programmer to construct and reason about the critical section at a high level, while a high-level language compiler always translates these pseudo functions into inline native instructions.
  • In one embodiment, the method may include creating a high-level expression for requesting exclusive access to one or more memory resource addresses. The high-level expression may include an ACQUIRE pseudo function call that includes one or more arguments associated with the one or more memory resource addresses. The method may also include creating a low-level set of instructions by compiling the high-level expression. The low-level set of instructions includes a specification phase for requesting exclusive access to the one or more memory resource addresses. In response to compiling the ACQUIRE pseudo function call, the method may include creating the specification phase of code, which may include generating an instruction stream having LOCK-based memory instructions having a LOCK prefix based on computations performed on the one or more arguments. In addition, creating the specification phase of code may include inserting an ACQUIRE instruction into the instruction stream.
  • In one specific implementation, the LOCK-based memory instructions may include LOCK MOV instructions for loading data from one or more memory locations to one or more respective registers. In addition, the LOCK-based memory instructions may include a LOCK PREFETCHW instruction.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of one embodiment of a computer system.
  • FIG. 2 is a block diagram depicting further details of an embodiment a processing node of FIG. 1.
  • FIG. 3 is a flow diagram that describes operation of one embodiment of the computer system shown FIG. 1 and FIG. 2.
  • FIG. 4 is a flow diagram that describes operation of one embodiment of the computer system shown FIG. 1 and FIG. 2 in response to receiving a coherency invalidation probe.
  • FIG. 5 is a diagram illustrating the creation of an embodiment of a critical code section using a software wrapper pseudo function call in a high-level language.
  • While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. It is noted that the word “may” is used throughout this application in a permissive sense (i.e., having the potential to, being able to), not a mandatory sense (i.e., must).
  • DETAILED DESCRIPTION
  • To enable the construction of high performance synchronization methods in software, a set of instructions, which may be referred to as an advanced synchronization facility may be used. The facility may support the construction of non-Blocking synchronization, WaitFree synchronization, Transactional Memory, along with the construction of various forms of Compare and Swap primitives typically used in the construction of these methods. The facility allows construction (in software) of a large variety of synchronization primitives.
  • Moreover, the advanced synchronization facility may enable software to program a large variety of synchronization kinds. Each synchronization kind may directly specify: the cache lines needed for successful completion, a sequence point where failures can redirect control flow, a data modification section where the result of the successful critical section is performed, and a sequence point where success is made visible to the rest of the system making the whole sequence of instructions appear to be atomic.
  • Accordingly, the functionality of the advanced synchronization facility may enable the acquisition and release of multiple cache lines with write permission associated with a critical section substantially simultaneously as seen by other processors/cores. This process may be referred to as Linearizing. After acquisition, several modifications can be performed before any other interested party may observe any of the modifications to any of the specified multiple cache lines. Between the acquisition and the release, no other processors are allowed to be manipulating these same lines (e.g. have write permission). A similar method could have been performed by not sending HyperTransport™ Source Done messages for the associated lines and thereby preventing concurrent accesses. However, these solutions lead to deadlock and/or livelock, or timeouts. Thus, a computer system including processors and processor cores that may implement the advanced synchronization facility is described below.
  • Turning now to FIG. 1, an embodiment of a computer system 100 is shown. Computer system 100 includes several processing nodes 312A, 312B, 312C, and 312D. Each of processing node 312A-312D is coupled to a respective memory 314A-314D via a memory controller 316A-316D included within each respective processing node 312A-312D. Additionally, processing nodes 312A-312D include interface logic (IF) used to communicate between the processing nodes 312A-312D. For example, processing node 312A includes interface logic 318A for communicating with processing node 312B, interface logic 318B for communicating with processing node 312C, and a third interface logic 318C for communicating with yet another processing node (not shown). Similarly, processing node 312B includes interface logic 318D, 318E, and 318F; processing node 312C includes interface logic 318G, 318H, and 3181; and processing node 312D includes interface logic 318J, 318K, and 318L. Processing node 312D is coupled to communicate with a plurality of input/output devices (e.g. devices 320A-320B in a daisy chain configuration) via interface logic 318L. Other processing nodes may communicate with other I/O devices in a similar fashion. Processors may use this interface to access the memories associated with other processors in the system. It is noted that a component that includes a reference numeral followed by a letter may be generally referred to solely by the numeral where appropriate. For example, when referring generally to the processing nodes, processing node(s) 312 may be used.
  • Processing nodes 312 implement a packet-based link for inter-processing node communication. In the illustrated embodiment, the link is implemented as sets of unidirectional lines (e.g. lines 324A are used to transmit packets from processing node 312A to processing node 312B and lines 324B are used to transmit packets from processing node 312B to processing node 312A). Other sets of lines 324C-324H are used to transmit packets between other processing nodes as illustrated in FIG. 1. Generally, each set of lines 324 may include one or more data lines, one or more clock lines corresponding to the data lines, and one or more control lines indicating the type of packet being conveyed. The link may be operated in a cache coherent fashion for communication between processing nodes or in a non-coherent fashion for communication between a processing node and an I/O device (or a bus bridge to an I/O bus of conventional construction such as the PCI bus or ISA bus). Furthermore, the link may be operated in a non-coherent fashion using a daisy-chain structure between I/O devices as shown (e.g., 320A and 320B). It is noted that in an exemplary embodiment, the link may be implemented as a coherent HyperTransport™ link or a non-coherent HyperTransport™ link, although in other embodiments, other links are possible.
  • I/O devices 320A-320B may be any suitable I/O devices. For example, I/O devices 320A-320B may include devices for communicating with another computer system to which the devices may be coupled (e.g. network interface cards or modems). Furthermore, I/O devices 320A-320B may include video accelerators, audio cards, hard or floppy disk drives or drive controllers, SCSI (Small Computer Systems Interface) adapters and telephony cards, sound cards, and a variety of data acquisition cards such as GPIB or field bus interface cards. It is noted that the term “I/O device” and the term “peripheral device” are intended to be synonymous herein.
  • Memories 314A-314D may comprise any suitable memory devices. For example, a memory 314A-314D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), DDR SDRAM, static RAM, etc. The memory address space of computer system 300 is divided among memories 314A-314D. Each processing node 312A-312D may include a memory map used to determine which addresses are mapped to which memories 314A-314D, and hence to which processing node 312A-312D a memory request for a particular address should be routed. Memory controllers 316A-316D may comprise control circuitry for interfacing to memories 314A-314D. Additionally, memory controllers 316A-316D may include request queues for queuing memory requests. Memories 314A-314D may store code executable by the processors to implement the functionality as described in the preceding sections.
  • It is noted that a packet to be transmitted from one processing node to another may pass through one or more intermediate nodes. For example, a packet transmitted by processing node 312A to processing node 312D may pass through either processing node 312B or processing node 312C as shown in FIG. 1. Any suitable routing algorithm may be used. Other embodiments of computer system 100 may include more or fewer processing nodes then the embodiment shown in FIG. 1. Generally, the packets may be transmitted as one or more bit times on the lines 324 between nodes. A bit time may be the rising or falling edge of the clock signal on the corresponding clock lines. The packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from responding to probes and commands.
  • In one embodiment, processing nodes 312 may additionally include one or more processor cores (shown in FIG. 2). It is noted the processor cores within each node may communicate via internal packet-based links operated in the cache coherent fashion. It is further noted that processor cores and processing nodes 312 may be configured to share any (or all) of the memories 314.
  • In one embodiment, one or more of the processor cores may implement the x86 architecture, although other architectures are possible and contemplated. As such, instruction decoder logic within each of the various processor cores may be configured to mark instructions that use a LOCK prefix. In addition, as described further below, processor core logic may include hardware (shown in FIG. 2) that may enable identification of the markers associated with LOCKed instructions. This hardware may enable the use of the LOCK instruction prefix to identify critical sections of code as part of the advanced synchronization facility.
  • To reduce the effects of interference caused by more than one processor attempting to access the same memory reference (e.g., critical sections of code) at the same time, the advanced synchronization facility and associated hardware may be implemented within computer system 100. As will be described in greater detail below, the advanced synchronization facility may employ new instructions and use hardware such as a synchronization arbiter (shown in FIG. 2) which may be interconnected within the cache coherent fabric. As shown in FIG. 2, synchronization arbiter 230 is coupled to a Northbridge unit 290 of any processing node 312, thus enabling the synchronization arbiter to observe explicit addresses associated with the Advanced Synchronization Facility transactions of each node. The synchronization arbiter may be placed anywhere in the coherent domain of the interconnect network. It is noted that although one synchronization arbiter is shown, it is contemplated that when a system is configured to support multiple virtual machines, and when these virtual machines do not share any actual physical memory, multiple synchronization arbiters can be configured to distribute the synchronization load across several arbiters.
  • It is noted that the phrase “critical section” is used throughout this document. A “critical section” refers to a section of code used in the advanced synchronization facility that may include one or more memory reference instructions marked with a LOCK prefix, an ACQUIRE instruction, and a RELEASE instruction which ends the critical section. In one embodiment, there are four stages of each critical section: 1) specifying the address(es) of the cache line(s) needed during the critical section (e.g., entering the critical section), 2) going through the mechanics to acquire these cache lines, 3) atomically modifying the critical section data, 4) releasing the cache lines back to the system. In particular, the critical section code will appear to be executed atomically by interested observers. The first phase may be referred to as the specification phase, while the third phase is often referred to as the atomic phase.
  • In various implementations, software may be allowed to perform ‘simple’ arithmetic and logical manipulations on the data between reading and modifying the data of the critical section as long as the simple arithmetic operations do not cause exceptions when executed. If a data manipulation causes an exception inside a critical section, atomicity of that critical section may not be guaranteed. Critical section software should detect failures of atomicity, and deal with them appropriately, s described further below.
  • Generally, the advanced synchronization facility may utilize a weakened memory model and operate only upon cacheable data. This weakened memory model may prevent the advanced synchronization facility from wasting processor cycles waiting for various processor and memory buffers to empty before performing a critical section. However, when software requires a standard PC strong memory model, software may insert LFENSE, SFENSE, or MFENSE instructions just prior to the RELEASE instruction to guarantee standard PC of memory ordering. For the case of using cacheable synchronization to enable accesses to unCacheable data, an SFENSE instruction between the last LOCKed Store and the RELEASE instruction will guarantee that the unCacheable data is globally visible before the cacheable synchronization data is globally visible in any other processor. This may enable maximum overlap of unCacheable and Cacheable accesses with minimal performance degradation.
  • In various embodiments, interface logic 318A-318L may comprise a variety of buffers for receiving packets from the link and for buffering packets to be transmitted upon the link. Computer system 100 may employ any suitable flow control mechanism for transmitting packets. In addition to interface logic 318A-318L each processing node may include respective buffer interface units (BIU) 220 (shown in FIG. 2), which may provide functionality to enable proactive synchronization. For example, as described further below, BIU 220 may be configured to those special addresses that are associated with an Advanced Synchronization event and to transmit those addresses to synchronization arbiter 230 in response to execution of an ACQUIRE instruction. The BIU 220 may also be configured to determine if the response received from synchronization arbiter 230 indicates the addresses may be interference free. Depending on whether the response indicates the addresses may not be interference free, BIU 220 may notify the requesting processor core of a failure by sending a failure count value to a register within the processor core 18, and sending a completion message to synchronization arbiter 230, or when guaranteed to be interference free by allowing the execution of the critical section, and waiting to send the completion message to synchronization arbiter 230.
  • FIG. 2 is a block diagram that illustrates more detailed aspects of embodiments of processing node 312A and synchronization arbiter 230 of FIG. 1. Referring to FIG. 2, processing node 312A includes processor cores 18A and 18 n, where n may represent any number of processor cores. Since the processor cores may be substantially the same in various embodiments, only detailed aspects of processor core 18A are described below. As shown, processor cores 18A and 18 n are coupled to bus interface unit 220 which is coupled to a Northbridge unit 290, which is coupled to memory controller 316A, HyperTransport™ interface logic 318A-318C, and to synchronization arbiter 230 via a pair of unidirectional links 3241-324J.
  • Processor core 18A includes hardware configured to execute instructions. More particularly, as is typical of many processors, processor core 18A includes one or more instruction execution pipelines including a number of pipeline stages, cache storage and control, and an address translation mechanism (only pertinent portions of which are shown for brevity). Accordingly, as shown processor core 18A includes a level one (L1) instruction cache, prefetch logic, and branch prediction logic. Since these blocks may be closely coupled with the instruction cache, they are shown together as block 250. Processor core 18A also includes an L1 data cache 207. Processor core 18A also includes instruction decoder 255 and an instruction dispatch and control unit 256 may be coupled to receive instructions from instruction decoder 255 and to dispatch operations to a scheduler 259. Further, instruction dispatch and control unit 256 may be coupled to a microcode read-only memory (MROM) (not shown). Scheduler 259 is coupled to receive dispatched operations from instruction dispatch and control unit 256 and to issue operations to execution units 260. In various implementations, execution units 260 may include any number of integer execution units and floating-point units. Further, processor core 18A includes a TLB 206 and a load/store unit 270. It is noted that in alternative embodiments, an on-chip L2 cache may be present (although not shown).
  • Instruction decoder 255 may be configured to decode instructions into operations which may be either directly decoded or indirectly decoded using operations stored within the MROM. Instruction decoder 255 may decode certain instructions into operations executable within execution units 260. Simple instructions may correspond to a single operation, while in other embodiments, more complex instructions may correspond to multiple operations. In one embodiment, instruction decoder 255 may include multiple decoders (not shown) for simultaneous decoding of instructions. Each instruction may be aligned and decoded into a set of control values in multiple stages depending on whether the instructions are first routed to MROM. These control values may be routed in an instruction stream to instruction dispatch and control unit 257 along with operand address information and displacement or immediate data which may be included with the instruction. As described further below, when a memory reference instruction includes a LOCK prefix, instruction decoder may identify the address with a marker.
  • Load/store unit 270 may be configured to, provide an interface between execution units 260 and data cache 207. In one embodiment, load/store unit 270 may include load/store buffers with several storage locations for data and address information for pending loads or stores. As such, the illustrated embodiment includes LS1 205, linear LS2 209, physical LS2 210, and data storage 211. Further, processor core 18A includes marker logic 208, and a marker bit 213.
  • In one embodiment, a critical section may be processed in one of two ways: deterministically, and optimistically. The choice of execution may be based upon the configuration of the advanced synchronization facility and upon the state of a critical section predictor, as described in greater detail below. In various embodiments, either the basic input output system (BIOS), the operating system (OS), or a virtual memory manager (VMM) may configure the operational mode of the advanced synchronization facility. When operating in the deterministic execution mode, the addresses specified by the locked memory reference instructions may be bundled up and sent enmasse to the synchronization arbiter 230 to be examined for interference. The cache line data may be obtained and the critical section executed (as permitted). In contrast, when operating in the optimistic synchronization mode, no interference may be assumed, and the critical section may be executed (bypassing the synchronization arbiter 230) and if any other processor interferes with this critical section, the interference will be detected and then the processor backs up to the ACQUIRE instruction and redirects control flow away from the atomic phase.
  • To implement the deterministic mode, the advanced synchronization facility may use the synchronization arbiter 230. As described above, synchronization arbiter 230 examines all of the physical addresses associated with a synchronization request and either pass (a.k.a. bless) the set of addresses or fail (i.e., reject) the set of addresses, based upon whether any other processor core or requester is operating on or has requested those addresses while they are being operated on. As such, synchronization arbiter 230 may allow software to be constructed that proactively avoids interference. When interference is detected by synchronization arbiter 230, synchronization arbiter 230 may respond to a request with a failure status including a unique number (e.g., count value 233) to a requesting processor core. In one embodiment, the count may be indicative of the number of requestors contending for the memory resource(s) being requested. Software may use this number to proactively avoid interference in subsequent trips through the critical section by using this number to choose a different resource upon which to attempt a critical section access.
  • Accordingly, as shown in FIG. 2, synchronization arbiter 230 includes a storage 232 including a number of entries. Each of the entries may store one or more physical addresses of requests currently being operated on. In one embodiment, each entry may store up to eight physical addresses that are transported as a single 64-byte request. In addition, the synchronization arbiter entry includes the count value 233, which corresponds to all the addresses in the entry. As described above, the count value may be indicative of the number of requesters (i.e., interferers) that are contending for any of the addresses in a critical section. When synchronization arbiter 230 receives a set of addresses, a compare unit 231 within synchronization arbiter 230 checks for a match between each address in the set and all the addresses in storage 232. If there is no match, synchronization arbiter 230 may be configured to issue a pass response by returning a passing count value and to store the addresses within storage 232. In one embodiment, the passing count value is zero, although suitable count value may be used. However, if there is an address match, synchronization arbiter 230 may increment the count value 233 associated with set of addresses that includes the matching address, and then return that count value as part of a failure response. It is noted that compare unit 231 may be a compare only structure implemented in a variety of ways, as desired. In addition, in another embodiment, each address stored within storage 232 may be associated with a respective count. As such, the count value may be indicative of the number of requestors (i.e., interferers) that are contending for one of the respective address in a critical section.
  • In the illustrated embodiment, bus interface unit (BIU) 220 includes a count compare circuit 221, a locked line buffer (LLB) 222, and a predictor 223. BIU 220 may also include various other circuits for transmitting and receiving transactions from the various components to which it is connected, however, these have been omitted for clarity. As such, BTU 220 may be configured to transmit a set of addresses associated with a critical section from LLB 222 to synchronization arbiter 230 in response to the execution of an ACQUIRE instruction. In addition, compare circuit 221 may be configured to compare the count value returned by synchronization arbiter 230 to check if the count is a passing count value (e.g., zero) or a failing count value. It is noted that SBB 22 may be implemented using any type of storage structure. For example, it may be part of an existing memory address buffer (MAB) or separate, as desired.
  • As described above, if processor core 18 is operating in the deterministic synchronization mode, addresses associated with a critical section may be marked during instruction decode by using the LOCK prefix. More particularly, memory references that explicitly participate in advanced synchronization code sequences are annotated by using the LOCK prefix with an appropriate MOV instruction. LOCKed Load instructions may have the following form:
  • LOCK MOVx reg,[B+I*s+DISP].
  • More particularly, a regular x86 memory read instruction is made special by attaching a LOCK prefix. This causes the BIU 220 to gather the associated marked physical address into the LLB 222 as the address passes through the L1 cache (and TLB 206). In addition, memory access strength is reduced to access the line (in the case of a cache miss) without write permission (ReadS, not ReadM or Read). The Load instruction may not be retired out of LS2 until the ACQUIRE instruction returns from the synchronization arbiter 230.
  • While the request form BIU 220 (to synchronization arbiter 230) is awaiting a response, the LLB 222 watches for Probes with INValidate semantics, and if one (or more) occurs, the ACQUIRE instruction will be made to fail, even if synchronization arbiter 230 returns a success. The LOCK prefix does not cause any particular locking of the cache or bus, but simply provides a convenient marker to be added to memory based MOVe instructions. As such, LOCKed MOV to register instructions (which may be otherwise referred to as LOCKed Loads) may be processed normally down the data cache pipeline.
  • Accordingly, during address translation each linear address may be stored within linear address portion of LS2 209. The corresponding physical addresses may be stored in TLB 206 and within physical LS2 210, while the corresponding data may be stored within data cache 207 and data LS2 211. Marker logic 208 may detect the LOCK prefix marker generated during decode and generate an additional marker bit 213, thereby marking each such address as a participant in a critical section. Any LOCKed Load that takes a miss in the data cache may have its cache line data fetched through the memory hierarchy with Read-to-Share access semantics, however write permission to that specified memory resource is checked.
  • As described above, if processor core 18 is operating in a deterministic synchronization mode, addresses associated with a critical section may be marked during instruction decode by using the LOCK prefix. More particularly, memory prefetch references that explicitly participate in advanced synchronization code sequences are annotated by using the LOCK prefix with an appropriate PREFETCHW instruction. These types of LOCKed Load instructions may have the following form:
  • LOCK PREFETCHW [B+I*s+DISP].
  • Thus, a regular memory PREFETCHW instruction is made special by attaching a LOCK prefix. This causes the BIU 220 to gather the associated marked physical address into the LLB 222 as the address passes through the L1 cache (and TLB 206). In addition, memory access strength is reduced to avoid an actual DRAM access the line. The PREFETCHW instruction may not be retired out of LS2 until the ACQUIRE instruction returns from synchronization arbiter 230. These instructions may be used to touch cache lines that participate in the critical section and that need data (e.g., a pointer) in order to touch other data also needed in the critical section. At the conclusion of the specification phase, an ACQUIRE instruction is used to notify BIU 220 that all memory reference addresses for the critical section are stored in LLB 222.
  • The ACQUIRE instruction may have the form
  • ACQUIRE reg, imm8
  • The ACQUIRE instruction checks that the number of LOCKed memory reference instructions is equal to the immediate value in the ACQUIRE instruction. If this check fails, the ACQUIRE instruction terminates with a failure code, otherwise, the ACQUIRE instruction causes BIU 220 to send all addresses stored within LLB 222 to the synchronization arbiter 230. This instruction ‘looks’ like a memory reference instruction on the data path so that the count value returned from the synchronization arbiter 230 can be used to confirm (or deny) that all the lines can be accessed without interference. No address is necessary for this ‘load’ instruction because there can be only one synchronization arbiter 230 per virtual machine or per system. The register specified in the ACQUIRE instruction is the destination register of processor core 18.
  • In one embodiment, the semantics of a LOCKed Load operation may include probe monitoring the location for a PROBE. If a PROBE is detected for a location, the LS1 or LS2 queue may return a failure status without waiting for the read to complete. A general-purpose fault (#GP) may be generated if the number of LOCKed loads exceeds a micro-architectural limit. If an ACQUIRE instruction fails, the count of LOCKed loads will be reset to zero. If the address is not to a Write Back memory type, the ACQUIRE instruction can be made to fail (when subsequently encountered).
  • It is expected that some critical sections may include a number of arithmetic and control flow decisions to compute what data modifications may be appropriate (if any). However, software should arrange that these types of instructions never cause an actual exception. In one embodiment, arithmetic and memory reference instructions may be processed in either the SSE registers (XMM), or in the general-purpose registers (e.g., EAX, etc), or in the MMX or x87 registers.
  • As described above, synchronization arbiter 230 may either pass the request enmasse or fail the request enmasse. If synchronization arbiter 230 fails the request, the response back to BIU 220 may be referred to as a “synchronization arbiter Fail-to-ACQUIRE” with the zero bit set (e.g., RFLAGS.ZF). As described above, the response returned by synchronization arbiter 230 may include the count value 233, which may be indicative of the number of interferers. Software may use this count to reduce future interference as described above. The count value 233 from the synchronization arbiter 230 may be delivered to a general-purpose register (not shown) within processor core 18 and may also be used to set condition codes. If the synchronization arbiter 230 passes the request, the response back to BfU 220 may include a pass count value (e.g., zero).
  • In one embodiment, if the synchronization arbiter address storage 232 is full, the request may be returned with a negative count value such as minus one (−1), for example. This may provide software running on the processor core a means to see an overload in the system and to enable that software to stop making requests to synchronization arbiter 230 for a while. For example, the software may schedule something else or it may simply waste some time before retrying the synchronization attempt.
  • If the count is zero (meaning there are no interferers observed by synchronization arbiter 230), processor core 18 may execute the instructions in the atomic phase and manipulate the data in the cache lines as desired. When the data manipulation is complete, a RELEASE instruction is executed signifying the end of the critical section. In one embodiment, the RELEASE instruction enables all of the modified data to become visible substantially simultaneously by sending the RELEASE message to synchronization arbiter 230, thereby releasing the associated cache lines back to the system.
  • In one embodiment, the advanced synchronization facility supports two kinds of failures, a “Fail-to-ACQUIRE” and a “Fail-to-REQUESTOR”. The Fail-to-ACQUIRE failure causes the ACQUIRE instruction to complete with the zero bit set (e.g., RFLAGS.ZF) so that the subsequent conditional jump instruction can redirect control flow away from damage inducing instructions in the atomic phase. The synchronization arbiter Fail-to-ACQUIRE with the zero bit set (e.g., RFLAGS.ZF) is one type of Fail-to-ACQUIRE failure. A processor Fail-to-ACQUIRE is another type. In one embodiment, during execution of critical sections, processor cores may communicate by observing memory transactions. These observations may be made visible at the ACQUIRE instruction of an executing processor core. More particularly, during the time between the start of collecting of the addresses necessary for a critical section and the response of synchronization arbiter 230, processor core 18 monitors all of those addresses for coherent invalidation probes (e.g., Probe with INValidate). If any of the lines are invalidated, the response from synchronization arbiter 230 may be ignored and the ACQUIRE instruction may be made to fail with the zero bit set (e.g., RFLAGS.ZF).
  • The Fail-to-REQUESTOR failure may be sent as a PROBE response if there is a cache hit on a line that has been checked for interference and passed by synchronization arbiter 230. A Fail-to-REQUESTOR response causes the requesting processor to Fail-to-ACQUIRE if it is currently processing an advanced synchronization facility critical section, or it will cause the requesting processor's BIU to re-request that memory request if it is not processing the critical section. As such, BIU 220 may be configured to cause a Fail-to-ACQUIRE in response to receiving a Probe with INValidate prior to obtaining a pass notification from synchronization arbiter 230.
  • Once the addresses of the critical section have been acquired, a processor core 18 that has had its addresses passed by synchronization arbiter 230 may obtain each cache line for exclusive access (erg. write permission) as memory reference instructions are processed in the atomic phase. After a passed cache line arrives, processor core 18 may hold onto that cache line and prevent other processor cores from stealing the line by responding to coherent invalidation probes with Fail-to-REQUESTOR responses. It is noted that Fail-to-REQUESTOR may also be referred to as a negative-acknowledgement (NAK).
  • As described above, when a processor receives a Fail-to-REQUESTOR and it is currently participating in an advanced synchronization instruction sequence, that instruction sequence will be caused to fail at the ACQUIRE instruction. In this case, the subsequent conditional jump is taken and the damage inducing part of the memory reference instructions in the critical section may be avoided. However, when a processor receives a Fail-to-REQUESTOR and is not participating in an advanced synchronization instruction sequence, the requesting processor's BIU may just re-request the original memory transaction. Thus, the elapsed time between the sending of the Fail-to-REQUESTOR and the subsequent arrival of the next coherent invalidation probe at the passed critical section enables forward progress of the processor with the synchronization arbiter's blessing to be guaranteed. The guarantee of forward progress enables the advanced synchronization facility to be more efficient under contention than currently existing synchronization mechanisms. Accordingly, sooner or later, both the critical section and the interfering memory reference may both be performed (e.g., no live-lock, nor dead-lock).
  • As mentioned above, the performance of a processor participating in the Advanced Synchronization Facility may be optimized by using a critical section predictor 223. Initially predictor 223 may be set up to predict that no interference is expected during execution of a critical section. In this mode, processor core 18 may not actually use the synchronization arbiter 230. Instead processor core 18 may record the LOCKed memory references and may check these against Coherent Invalidation PROBEs to detect interference. If the end of the critical section is reached before any interference is detected, no interested third party has seen the activity of the critical section and it has been performed as if it was executed atomically. This property enables the Advanced Synchronization Facility to be processor-cycle competitive with currently existing synchronization mechanisms when no contention is observed.
  • More particularly, when interference is detected, processor core 18 may create a failure status for the ACQUIRE instruction and the subsequent conditional branch redirects the flow of control out of the critical section, and resets the predictor to predict deterministic mode. When the next critical section is detected, the decoder will then predict interference might happen, and will process the critical section using the synchronization arbiter 230 (if enabled).
  • In one embodiment, the Advanced Synchronization facility may operate on misaligned data items as long as these items do not span cache lines that are not participating in the actual critical section. Software is free to have synchronization items span cache line boundaries as long as all cache lines so touched are recognized as part of the critical section entry. When a data item spans a cache line into another cache line that was not part of the synchronization communication, the processor neither detects the failure of atomicity nor signals the lack of atomicity.
  • Further, access to critical section data may be dependent upon the presence of that data in main memory. All of the lines necessary for the critical section are touched before ENTRY into the critical section, and any access rights issues or page-faulting issues may be detected when the LOCKed Load or LOCKed PREFETCHW instructions execute prior to entering the critical section. When any of the lead-in addresses take a fault, the subsequent ACQUIRE instruction is made to fail. After entry to the critical section, if any instruction causes an exception, the processor will causes a failure at the ACQUIRE instruction, and the subsequent conditional jump redirects control away from the critical-section.
  • In one embodiment, if the decoder of processor core 18 must take an interrupt, it may arrange that the ACQUIRE instruction will fail with the zero bit set (e.g., RFLAGS.ZF), and take the interrupt at the ACQUIRE instruction.
  • It is noted that in embodiments in which synchronization arbiter 230 is connected within a North Bridge implementation within the HyperTransport™ fabric, synchronization arbiter 230 may be assigned a predetermined and/or reserved node ID that no other component may have. This assignment may be made at boot time by the BIOS, for example. In addition, in the above embodiments, the count value may be returned as a 64-bit value, although other values are contemplated.
  • FIG. 3 is a flow diagram describing the operation of the embodiments of the computer system shown in FIG. 1 and FIG. 2. Referring collectively to FIG. 1 through FIG. 3, and beginning in block 405 addresses of cache lines that are currently being operated on or accessed as part of a critical section are maintained in a list (e.g., within LLB 222). For example, synchronization arbiter 230 may store the addresses corresponding to a critical section, as a set, within an entry of address storage 232. In one embodiment, each entry of address storage 232 may also store a count value that is associated with the whole set of addresses stored therein (block 410). As described above, the count value may be indicative of the number of contenders (i.e., interferers) for any of the addresses in the set. In another embodiment, synchronization arbiter 230 may store a number of count values within each entry, such that each address in the entry has a an associated count value.
  • When a processor or processor core that is implementing the advanced synchronization facility, requests exclusive access to one or more cache lines, the request comes in the form of a critical code section. For example, as described above, to ensure completion of the instructions in an atomic manner (as viewed by all outside observers) a critical section may include the use of LOCKed MOV instructions, followed by an ACQUIRE instruction and a RELEASE instruction (block 415). Accordingly, the set of addresses that are requested are checked for interference. In one embodiment, the set of addresses is compared to all of the addresses within address storage 232 (block 420). In the embodiments described above, the LOCKed MOV instructions cause the addresses to be marked. The marker causes BIU 220 to store each marked address in LLB 222. The ACQUIRE instruction causes BIU 220 to send the entire set of address in LLB 222 to synchronization arbiter 230 in the form of an unCacheable write that carries 64-bytes of physical address data. Synchronization arbiter 230 compares the set of addresses to all the addresses in the storage 232.
  • If there is a match on any address (block 425), the count value associated with the matching address is incremented (block 455) and the new count value is returned to BIU 220 as a part of a failure response to the unCacheable write (block 460) that carries 64-bits of response data. In addition, synchronization arbiter 230 discards the set of addresses upon failure. BIU 220 sends the failure count value to the register of the requesting processor/core, which may also set condition code flags. As a result, the requesting processor/core may use the count value to select another set of memory resources in subsequent operations (block 465) and avoid interference on its subsequent synchronization attempt. Operation proceeds as described above in block 415.
  • Referring back to block 425, if there is no matching address in storage 232, synchronization arbiter 230 may return a passing count value (e.g., zero) to BIU 220 (block 430). In addition, synchronization arbiter 230 may store the set of addresses in an entry of storage 232 (block 435). BIU 220 may send the passing count value to the requesting processor/core register specified in the ACQUIRE instruction. As such, the requesting processor/core may manipulate or otherwise operate on the data at the requested addresses (block 440). If the operation is not complete (block 445), BIU 220 defers sending a completion message to synchronization arbiter 230. When the operation in the critical section is complete such as when the RELEASE instruction is executed, BIU 220 may send a completion message to synchronization arbiter 230. Upon receiving the completion message, synchronization arbiter 230 may flush the corresponding addresses from storage 232, thereby releasing those addresses back to the system (block 450) for use by another processor/core. In addition, load/store unit 270 updates the data cache for all instructions in that critical section that retired.
  • As described above, if a coherency invalidation probe hits on an address in the critical section during processing of the critical section, the response to that probe may be dependent upon the state of processing of the critical section (i.e., whether or not the cache lines have been acquired). FIG. 4 is a flow diagram describing the operation of the embodiments of FIG. 1 and FIG. 2 when a coherency invalidation probe is received.
  • Referring collectively to FIG. 1 through FIG. 4 and beginning in block 505 of FIG. 4, a Probe is received and hits on a critical section address in load store unit 270. If the requested lines have been successfully acquired (block 510), (e.g., a coherency invalidation probe is received after synchronization arbiter 230 has provided a pass count value, and stored the set of addresses within storage 232), BIU 220 may send a Failure-to-Requestor response as a response to the probe (block 515). At the requesting processor core, this Failure-to-Requestor response should cause a failure of the ACQUIRE instruction if the processor core was operating in a critical section, or a retry of the addresses if not.
  • Referring back to block 510, if the requested lines have been acquired, the processor core may ignore any count value received form synchronization arbiter 230 (block 520). Load/store unit 270 may notify instruction dispatch and control unit 257 that there is a probe hit (e.g., Prb hit signal), and thus there is a Failure-to-Acquire. As such, the ACQUIRE instruction is made to fail, as described above. As such, to an outside observer the ACQUIRE instruction simply failed.
  • As described above, a critical code section may include one or more memory reference load instructions with the LOCK prefix, followed by the ACQUIRE instruction. In addition, a conditional jump instruction should follow the ACQUIRE instruction to allow the code to exit the critical section should coherence observer 230 provide a Fail-to-Acquire code or if a Probe with INValidate is detected prior to acquiring the cache lines. In some implementations the conditional jump may be followed by a release instruction. However as shown below, LOCKed store instructions may be used in lieu of the RELEASE instructions. Two assembly language critical code sections are shown further below to exemplify the two types of critical sections. It is noted that the following code segments are merely examples used for discussion purposes. It is contemplated that other embodiments are possible and contemplated. To create critical code sections such as described above, application programmers may use pseudo function calls within high-level languages such as the ‘C’ language, for example.
  • As shown below, an ACQUIRE pseudo function may take a variable number of arguments and return an integer value and a condition code result. In one embodiment the ACQUIRE pseudo function may take between one and eight arguments, for example. The ACQUIRE pseudo function should be used in the context of conditional flow control. For example, the ACQUIRE pseudo function should be inside an if-statement. Each argument may be compiled into the code stream with the LOCK prefix attached to the last memory reference used in the computation of that argument. The ACQUIRE function is classified as a pseudo function because it will never be compiled into a subroutine call in the resulting code, but instead results in a series of inline instructions. In effect, the compiler understands that ACQUIRE is a high-level language expression that is directly translated into native code. In addition, the compiler can check that the pseudo function is used in conditional context (e.g., in an if-statement) and issue a diagnostic if used otherwise. The arguments specified in the ACQUIRE pseudo function specify the memory resources needed for successful transversal of the critical section.
  • In addition, a MODIFY pseudo function may take a variable number of arguments and does not return a result. The MODIFY pseudo function compiles these arguments into the code stream and then inserts a RELEASE instruction following the code that computes the arguments. The MODIFY pseudo function may have a different number of arguments than the ACQUIRE pseudo function.
  • Taken together, the ACQUIRE and MODIFY pseudo functions enable the application programmer a rather direct means to convert existing synchronization codes into codes which can utilize the advanced synchronization facility.
  • The following example ‘C’ code segment uses the ACQUIRE pseudo function and a subsequent MODIFY pseudo function to express the removal of an element from a doubly linked list.
    struct doubly_linked *remove( index i; KIND kind )
    {
     unsigned int     interference, count;
     struct doubly_linked *a, *b, *c, *d, *p;
     do
     {
      interference = FALSE;
      for( count = 0, p = queue[ i ].head; p ; p = p->next )
      {
       if( p->kind == kind )
       {
        if( count == 0 )
        {
         if(!(count = ACQUIRE( a = p->next,
                    b = p->prev,
                    c = a->prev,
                    d = b->next ) ) )
         {
          MODIFY ( a->next = d,
               b->prev = c,
               p->next = NULL,
               p->prev = NULL );
          return p;
         }
         else
         {
          interference = TRUE;
          break;
         }
        }
        else
         count−−;
       }
      }
     }
     while( interference );
      return NULL;
    }
  • The following example x86 assembly code segment illustrates the code produced from compiling the above ‘C’ code through a ‘C’ compiler with the ACQUIRE and MODIFY pseudo function built-in compilation templates.
    _remove:
    MOVD R9,0 // interference = 0
    do_loop:
    MOVD R10,0 // count = 0
    MOVD R11,[queue+head+
    RAX*8]
    for_loop:
    MOVD R12,[kind+R11] // p->kind
    CMPD R12,EDX // p->kind == kind
    JNE for_continue
    TST R10 // count == 0
    JNE bypass
       LOCK MOVD R12,[next+R11] // a = p->next
       LOCK MOVD R13,[prev+R11] // b = p->prev
       LOCK MOVD R14,[prev+R12] // c = a->prev
       LOCK MOVD R15,[next+R13] // d = b->prev
    ACQUIRE R10,<4> // count =
    ACQUIRE( )
    JNZ fails
    MOVD [next+R12],R15 // a->next = d
    MOVD [prev+R13],R14 // b->prev = c
    MOVD [next+R11],0 // p->next = NULL
    MOVD [prev+R11],0 // p->prev = NULL
    RELEASE
    MOVD RAX,R11 // return value <- p
    RET
    fails: MOVD R9,1 // interference =
    TRUE
    JMP for_continue
    bypass: SUB/ R10,1 // count −
    for_continue: MOVD R11,[next+R11] // p = p->next
    JMP for_loop
    TST R9 // interference ?
    JNE do_loop
    MOVD RAX,0 // return value <- 0
    RET
  • Thus, it is shown that the ACQUIRE pseudo function decorates the argument processing code with a LOCK prefix on the appropriate memory reference instruction(s), and then inserts the ACQUIRE instruction into the instruction stream. Since the ACQUIRE pseudo function is used in a conditional context, the compiler inserts the necessarily subsequent conditional jump following the ACQUIRE instruction. Similarly, the MODIFY pseudo function compiles its argument list and then inserts the RELEASE instruction into the code stream.
  • FIG. 5 is a diagram that illustrates the creation of a critical code section from a high-level language code segment using an ACQUIRE pseudo function call. As shown below, and in FIG. 5, the critical code section 610 may be created by a compiler 600 from the critical section code 605 of high-level code segment 601.
  • As shown, the ACQUIRE pseudo function directs the compiler to perform the argument list computation with LOCK-based memory reference instructions (e.g., LOCK MOV) and to insert the ACQUIRE instruction into the instruction stream. The “interference” variable is used to cause the ‘for’ loop to search the list again if any interference was detected. At this point, the ‘for’ loop cannot depend upon running off the end of the list (meaning that there are no items on the list the dequeue operation wants). A major difference here is that the ‘count’ variable may be a direct indication of the amount of interference. Software may use this count value to find a different element and thereby possibly obviate interference the next time around. For example, consider the worst-case behavior of a typical non-Blocking primitive; when N processors all try to dequeue (or enqueue) to the same place (i.e., memory location(s)) all at once. The contending processors may access these lines continuously, slowing system performance. In contrast, using the ACQUIRE instruction and the LOCKed MOV instructions, the returned count value provides a means for N−1 processors to select a different queue element the next time around and to simply avoid interference. Each processor may select a different element candidate that requires different lines, thus reducing the interference. As such, no exponential back-off function is required (as may be required in conventional embodiments of non-blocking and wait-free code sequences currently in use), as the variable returned from ACQUIRE may be used as a direct measure of interference, and not a predicted measure of past interference. It is noted that in one embodiment the compiler may be taught to place the RELEASE instruction appropriately into the critical section, while in other embodiments, the programmer may explicitly place the RELEASE instruction appropriately into the critical section.
  • In another embodiment, the MODIFY pseudo functions are not used, but the RELEASE pseudo function is used instead. The RELEASE pseudo function takes no arguments and returns no result. The RELEASE pseudo function directs the compiler to insert the RELEASE instruction into the instruction stream directly.
  • The following example ‘C’ code segment uses the ACQUIRE pseudo function and a subsequent RELEASE pseudo function to express the removal of an element from a doubly linked list.
    struct doubly_linked *remove( index i; KIND kind )
    {
     unsigned int     interference, count;
     struct doubly_linked *a, *b, *c, *d, *p;
     do
     {
      interference = FALSE;
      for( count = 0, p = queue[ i ].head; p ; p = p->next )
      {
       if( p->kind == kind )
       {
        if( count == 0 )
        {
         if(!(count = ACQUIRE( a = p->next,
                    b = p->prev,
                    c = a->prev,
                    d = b->next ) ) )
         {
          a->next = d;
          b->prev = c;
          p->next = NULL;
          p->prev = NULL;
          RELEASE( );
          return p;
         }
         else
         {
          interference = TRUE;
          break;
         }
        }
        else
         count−−;
       }
      }
     }
     while( interference );
      return NULL;
    }

    Since this ‘C’ function generates exactly the same x86 assembly code as the previous ‘C’ function generated, the output from the compiler is not shown for brevity.
  • It is noted that in one embodiment, compiler 600 may include instructions that may be executed on any of the processing nodes 312 (and processor cores 18) in the computer system shown in FIG. 1 and FIG. 2. Similarly, the code sequences (e.g., 610) generated by compiler 600 may also be executed on any of the processing nodes 312 (and processor cores 18).
  • It is noted that although the computer system 100 described above includes processing nodes that include one or more processor cores, it is contemplated that in other embodiments, the advanced synchronization facility and associated hardware may be implemented using stand-alone processors or a combination of processing nodes and stand-alone processors, as desired. In such embodiments, each stand-alone processor may include all or part of the above described hardware and may be capable of executing the instructions that are part of the advanced synchronization facility. As such the terms processor and processor core may be used somewhat synonymously, except when specifically enumerated to be different.
  • Code and/or data that implements the functionality described in the preceding sections may also be provided on computer accessible/readable medium. Generally speaking, a computer accessible/readable medium may include any media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible medium may include storage media such as magnetic or optical media, e.g., disk (fixed or removable), CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc., as well as media accessible via transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as a network and/or a wireless link.
  • Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims (18)

1. A method comprising:
creating a high-level expression for requesting exclusive access to one or more memory resource addresses, wherein the high-level expression includes an ACQUIRE pseudo function call comprising one or more arguments associated with the one or more memory resource addresses; and
creating a low-level set of instructions by compiling the high-level expression, wherein the low-level set of instructions includes a specification phase for requesting exclusive access to the one or more memory resource addresses;
wherein in response to compiling the ACQUIRE pseudo function call, creating the specification phase of code including:
generating an instruction stream including LOCK-based memory reference instructions having a LOCK prefix based on computations performed on the one or more arguments; and
inserting an ACQUIRE instruction into the instruction stream.
2. The method as recited in claim 1, wherein the LOCK-based memory instructions comprise LOCK MOV instructions for loading data from one or more memory locations to one or more respective registers.
3. The method as recited in claim 1, wherein the LOCK-based memory instructions comprise a LOCK PREFETCHW instruction.
4. The method as recited in claim 1, wherein the low-level set of instructions includes code for creating an atomic phase of code, and wherein creating the atomic phase comprises inserting a RELEASE instruction into the instruction stream after one or more MOV instructions for storing data to one or more memory locations from one or more respective registers.
5. The method as recited in claim 4, wherein creating the atomic phase further comprises compiling a MODIFY pseudo function call comprising one or more additional arguments specifying the one or more memory resource modifications.
6. The method as recited in claim 1, wherein the specification phase code further comprises inserting a conditional jump instruction in the instruction stream after the AQUIRE instruction.
7. A method comprising:
providing to a compiler, a high-level expression including an ACQUIRE pseudo function call comprising one or more arguments associated with one or more memory resource addresses; and
the compiler creating a low-level set of instructions from the high-level expression, wherein the low-level set of instructions includes a specification phase for requesting exclusive access to the one or more memory resource addresses;
wherein the specification phase of code includes:
an instruction stream including LOCK-based memory instructions having a LOCK prefix based on computations performed on the one or more arguments; and
an ACQUIRE instruction inserted into the instruction stream after the LOCK-based memory instructions.
8. The method as recited in claim 7, wherein the LOCK-based memory instructions comprise LOCK MOV instructions for loading data from one or more memory locations to one or, more respective registers.
9. The method as recited in claim 7, wherein the LOCK-based memory instructions comprise a LOCK PREFETCHW instruction.
10. The method as recited in claim 7, wherein the low-level set of instructions includes code for creating an atomic phase of code, and wherein creating the atomic phase comprises inserting a RELEASE instruction into the instruction stream after one or more MOV instructions for storing data to one or more memory locations from one or more respective registers.
11. The method as recited in claim 10, wherein creating the atomic phase further comprises compiling a MODIFY pseudo function call comprising one or more additional arguments specifying the one or more memory resource modifications.
12. The method as recited in claim 10, wherein creating the atomic phase of code further comprises inserting a conditional jump instruction into the instruction stream after the AQUIRE instruction.
13. A computer readable medium including program instructions executable by a processor to:
create a low-level set of instructions by compiling a high-level expression that includes an ACQUIRE pseudo function call comprising one or more arguments associated with one or more memory resource addresses;
wherein the low-level set of instructions includes a critical code section for requesting exclusive access to the one or more memory resource addresses;
wherein in response to compiling the ACQUIRE pseudo function, the program instructions are further executable by the processor to:
generate an instruction stream including LOCK-based memory instructions having a LOCK prefix based on computations performed on the one or more arguments;
insert an ACQUIRE instruction into the instruction stream.
14. The computer readable medium as recited in claim 13, wherein the LOCK-based memory instructions comprise LOCK MOV instructions for loading data from one or more memory locations to one or more respective registers.
15. The computer readable medium as recited in claim 13, wherein the LOCK-based memory instructions comprise a LOCK PREFETCHW instruction.
16. The computer readable medium as recited in claim 13, wherein the program instructions are further executable by the processor to insert a RELEASE instruction into the instruction stream after one or more MOV instructions for storing data to one or more memory locations from one or more respective registers.
17. The computer readable medium as recited in claim 13, wherein the program instructions are further executable by the processor to create a low-level set of instructions by compiling a high-level expression that includes a MODIFY pseudo function call comprising one or more additional arguments specifying one or more memory resources to be modified
18. The computer readable medium as recited in claim 13, wherein the program instructions are further executable by the processor to insert a conditional jump instruction in the instruction stream after the AQUIRE instruction.
US11/508,494 2005-08-23 2006-08-23 Method for creating critical section code using a software wrapper for proactive synchronization within a computer system Abandoned US20070050561A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/508,494 US20070050561A1 (en) 2005-08-23 2006-08-23 Method for creating critical section code using a software wrapper for proactive synchronization within a computer system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US71054805P 2005-08-23 2005-08-23
US11/508,494 US20070050561A1 (en) 2005-08-23 2006-08-23 Method for creating critical section code using a software wrapper for proactive synchronization within a computer system

Publications (1)

Publication Number Publication Date
US20070050561A1 true US20070050561A1 (en) 2007-03-01

Family

ID=37607137

Family Applications (6)

Application Number Title Priority Date Filing Date
US11/508,646 Active 2027-06-28 US7636819B2 (en) 2005-08-23 2006-08-23 Method for proactive synchronization within a computer system
US11/508,494 Abandoned US20070050561A1 (en) 2005-08-23 2006-08-23 Method for creating critical section code using a software wrapper for proactive synchronization within a computer system
US11/508,647 Abandoned US20070050563A1 (en) 2005-08-23 2006-08-23 Synchronization arbiter for proactive synchronization within a multiprocessor computer system
US11/508,491 Active 2027-04-24 US7552290B2 (en) 2005-08-23 2006-08-23 Method for maintaining atomicity of instruction sequence to access a number of cache lines during proactive synchronization within a computer system
US11/508,492 Active 2027-06-29 US7606985B2 (en) 2005-08-23 2006-08-23 Augmented instruction set for proactive synchronization within a computer system
US11/508,493 Active 2027-06-26 US7627722B2 (en) 2005-08-23 2006-08-23 Method for denying probes during proactive synchronization within a computer system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/508,646 Active 2027-06-28 US7636819B2 (en) 2005-08-23 2006-08-23 Method for proactive synchronization within a computer system

Family Applications After (4)

Application Number Title Priority Date Filing Date
US11/508,647 Abandoned US20070050563A1 (en) 2005-08-23 2006-08-23 Synchronization arbiter for proactive synchronization within a multiprocessor computer system
US11/508,491 Active 2027-04-24 US7552290B2 (en) 2005-08-23 2006-08-23 Method for maintaining atomicity of instruction sequence to access a number of cache lines during proactive synchronization within a computer system
US11/508,492 Active 2027-06-29 US7606985B2 (en) 2005-08-23 2006-08-23 Augmented instruction set for proactive synchronization within a computer system
US11/508,493 Active 2027-06-26 US7627722B2 (en) 2005-08-23 2006-08-23 Method for denying probes during proactive synchronization within a computer system

Country Status (7)

Country Link
US (6) US7636819B2 (en)
JP (1) JP5103396B2 (en)
KR (1) KR101369441B1 (en)
CN (1) CN101297270A (en)
DE (1) DE112006002237B4 (en)
GB (1) GB2445294B (en)
WO (1) WO2007025112A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080263339A1 (en) * 2007-04-18 2008-10-23 Kriegel Jon K Method and Apparatus for Context Switching and Synchronization
US20090125519A1 (en) * 2007-11-13 2009-05-14 Intel Corporation Device, system, and method for regulating software lock elision mechanisms
US20140281038A1 (en) * 2013-03-14 2014-09-18 Samsung Electronics Co., Ltd. Terminal and application synchronization method thereof
US9619386B2 (en) 2015-01-29 2017-04-11 Kabushiki Kaisha Toshiba Synchronization variable monitoring device, processor, and semiconductor apparatus

Families Citing this family (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7636819B2 (en) * 2005-08-23 2009-12-22 Advanced Micro Devices, Inc. Method for proactive synchronization within a computer system
US8135926B1 (en) * 2008-10-21 2012-03-13 Nvidia Corporation Cache-based control of atomic operations in conjunction with an external ALU block
US8108610B1 (en) * 2008-10-21 2012-01-31 Nvidia Corporation Cache-based control of atomic operations in conjunction with an external ALU block
US8108557B2 (en) * 2009-01-22 2012-01-31 Hewlett-Packard Development Company, L.P. System and method for measuring clock skew on a network
US8316368B2 (en) * 2009-02-05 2012-11-20 Honeywell International Inc. Safe partition scheduling on multi-core processors
US9727508B2 (en) * 2009-04-27 2017-08-08 Intel Corporation Address learning and aging for network bridging in a network processor
US9384063B2 (en) * 2009-06-18 2016-07-05 Microsoft Technology Licensing, Llc Eliding synchronization in a concurrent data structure
EP2467852B1 (en) 2009-08-20 2019-05-22 Rambus Inc. Atomic memory device
US8307198B2 (en) * 2009-11-24 2012-11-06 Advanced Micro Devices, Inc. Distributed multi-core memory initialization
US20110208921A1 (en) * 2010-02-19 2011-08-25 Pohlack Martin T Inverted default semantics for in-speculative-region memory accesses
US8793471B2 (en) 2010-12-07 2014-07-29 Advanced Micro Devices, Inc. Atomic program verification
US9122476B2 (en) 2010-12-07 2015-09-01 Advanced Micro Devices, Inc. Programmable atomic memory using hardware validation agent
US8788794B2 (en) * 2010-12-07 2014-07-22 Advanced Micro Devices, Inc. Programmable atomic memory using stored atomic procedures
KR20120101761A (en) * 2011-03-07 2012-09-17 삼성전자주식회사 Cache phase detector and processor core
US10061618B2 (en) 2011-06-16 2018-08-28 Imagination Technologies Limited Scheduling heterogenous computation on multithreaded processors
US9037838B1 (en) * 2011-09-30 2015-05-19 Emc Corporation Multiprocessor messaging system
TWI454922B (en) * 2011-12-19 2014-10-01 Phison Electronics Corp Memory storage device and memory controller and data writing method thereof
EP2798471A4 (en) * 2011-12-30 2016-12-21 Intel Corp Structure access processors, methods, systems, and instructions
US9430391B2 (en) 2012-03-29 2016-08-30 Advanced Micro Devices, Inc. Managing coherent memory between an accelerated processing device and a central processing unit
US9086957B2 (en) 2012-08-02 2015-07-21 International Business Machines Corporation Requesting a memory space by a memory controller
CN103020003A (en) * 2012-12-31 2013-04-03 哈尔滨工业大学 Multi-core program determinacy replay-facing memory competition recording device and control method thereof
US9146885B2 (en) * 2013-05-17 2015-09-29 Analog Devices, Inc. Parallel atomic increment
US10229043B2 (en) 2013-07-23 2019-03-12 Intel Business Machines Corporation Requesting memory spaces and resources using a memory controller
US9262343B2 (en) * 2014-03-26 2016-02-16 International Business Machines Corporation Transactional processing based upon run-time conditions
US9256553B2 (en) * 2014-03-26 2016-02-09 International Business Machines Corporation Transactional processing based upon run-time storage values
CN104035888B (en) * 2014-06-11 2017-08-04 华为技术有限公司 A kind of data cached method and storage device
US9710381B2 (en) 2014-06-18 2017-07-18 International Business Machines Corporation Method and apparatus for cache memory data processing
US9740614B2 (en) * 2014-06-27 2017-08-22 International Business Machines Corporation Processor directly storing address range of co-processor memory accesses in a transactional memory where co-processor supplements functions of the processor
US9513960B1 (en) 2015-09-22 2016-12-06 International Business Machines Corporation Inducing transactional aborts in other processing threads
US9563467B1 (en) 2015-10-29 2017-02-07 International Business Machines Corporation Interprocessor memory status communication
US9760397B2 (en) 2015-10-29 2017-09-12 International Business Machines Corporation Interprocessor memory status communication
US9916179B2 (en) 2015-10-29 2018-03-13 International Business Machines Corporation Interprocessor memory status communication
US10261827B2 (en) 2015-10-29 2019-04-16 International Business Machines Corporation Interprocessor memory status communication
US9772874B2 (en) * 2016-01-29 2017-09-26 International Business Machines Corporation Prioritization of transactions based on execution by transactional core with super core indicator
US20170300427A1 (en) * 2016-04-18 2017-10-19 Mediatek Inc. Multi-processor system with cache sharing and associated cache sharing method
RU2623806C1 (en) 2016-06-07 2017-06-29 Акционерное общество Научно-производственный центр "Электронные вычислительно-информационные системы" (АО НПЦ "ЭЛВИС") Method and device of processing stereo images
US10721185B2 (en) 2016-12-06 2020-07-21 Hewlett Packard Enterprise Development Lp Age-based arbitration circuit
US10452573B2 (en) 2016-12-06 2019-10-22 Hewlett Packard Enterprise Development Lp Scripted arbitration circuit
US10237198B2 (en) 2016-12-06 2019-03-19 Hewlett Packard Enterprise Development Lp Shared-credit arbitration circuit
US10944694B2 (en) * 2016-12-06 2021-03-09 Hewlett Packard Enterprise Development Lp Predictive arbitration circuit
US11157407B2 (en) * 2016-12-15 2021-10-26 Optimum Semiconductor Technologies Inc. Implementing atomic primitives using cache line locking
US10223186B2 (en) * 2017-02-01 2019-03-05 International Business Machines Corporation Coherency error detection and reporting in a processor
US10776282B2 (en) 2017-12-15 2020-09-15 Advanced Micro Devices, Inc. Home agent based cache transfer acceleration scheme
US10693811B2 (en) 2018-09-28 2020-06-23 Hewlett Packard Enterprise Development Lp Age class based arbitration
US10796399B2 (en) 2018-12-03 2020-10-06 Advanced Micro Devices, Inc. Pixel wait synchronization
CN109933543B (en) * 2019-03-11 2022-03-18 珠海市杰理科技股份有限公司 Data locking method and device of Cache and computer equipment
CN110490581B (en) * 2019-07-18 2022-09-30 拉货宝网络科技有限责任公司 Distributed system critical data resource updating method and system
US12093689B2 (en) * 2020-09-25 2024-09-17 Advanced Micro Devices, Inc. Shared data fabric processing client reset system and method
US11740973B2 (en) * 2020-11-23 2023-08-29 Cadence Design Systems, Inc. Instruction error handling
US11972117B2 (en) * 2021-07-19 2024-04-30 EMC IP Holding Company LLC Selecting surviving storage node based on environmental conditions
US11892972B2 (en) * 2021-09-08 2024-02-06 Arm Limited Synchronization mechanisms for a multi-core processor using wait commands having either a blocking or a non-blocking state

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5142676A (en) * 1988-12-28 1992-08-25 Gte Laboratories Incorporated Separate content addressable memories for storing locked segment addresses and locking processor identifications for controlling access to shared memory
US20020004810A1 (en) * 1997-04-01 2002-01-10 Kenneth S. Reneris System and method for synchronizing disparate processing modes and for controlling access to shared resources
US6389519B1 (en) * 1999-07-19 2002-05-14 Ati International Srl Method and apparatus for providing probe based bus locking and address locking
US6604162B1 (en) * 2000-06-28 2003-08-05 Intel Corporation Snoop stall reduction on a microprocessor external bus
US6678772B2 (en) * 2000-12-19 2004-01-13 International Businesss Machines Corporation Adaptive reader-writer lock
US20040068607A1 (en) * 2002-10-07 2004-04-08 Narad Charles E. Locking memory locations
US20040123058A1 (en) * 2002-12-24 2004-06-24 Hum Herbert H. Method and apparatus for processing a load-lock instruction using a relaxed lock protocol
US6772255B2 (en) * 1998-06-30 2004-08-03 Sun Microsystems, Inc. Method and apparatus for filtering lock requests
US20050132132A1 (en) * 2001-08-27 2005-06-16 Rosenbluth Mark B. Software controlled content addressable memory in a general purpose execution datapath
US20050283783A1 (en) * 2004-06-22 2005-12-22 Desota Donald R Method for optimizing pipeline use in a multiprocessing system
US20060026411A1 (en) * 2004-07-29 2006-02-02 Fujitsu Limited Processor system and thread switching control method
US20060041788A1 (en) * 2004-08-17 2006-02-23 International Business Machines Corporation Protecting code from breakpoints
US20060095685A1 (en) * 2004-11-03 2006-05-04 Bonola Thomas J System and method to coordinate access to a sharable data structure using deferred cycles
US7117481B1 (en) * 2002-11-06 2006-10-03 Vmware, Inc. Composite lock for computer systems with multiple domains
US7120762B2 (en) * 2001-10-19 2006-10-10 Wisconsin Alumni Research Foundation Concurrent execution of critical sections by eliding ownership of locks
US7269717B2 (en) * 2003-02-13 2007-09-11 Sun Microsystems, Inc. Method for reducing lock manipulation overhead during access to critical code sections
US7290105B1 (en) * 2002-12-16 2007-10-30 Cisco Technology, Inc. Zero overhead resource locks with attributes
US7325064B2 (en) * 2001-07-17 2008-01-29 International Business Machines Corporation Distributed locking protocol with asynchronous token prefetch and relinquish
US7454570B2 (en) * 2004-12-07 2008-11-18 International Business Machines Corporation Efficient memory update process for on-the-fly instruction translation for well behaved applications executing on a weakly-ordered processor
US7552290B2 (en) * 2005-08-23 2009-06-23 Advanced Micro Devices, Inc. Method for maintaining atomicity of instruction sequence to access a number of cache lines during proactive synchronization within a computer system
US7814488B1 (en) * 2002-09-24 2010-10-12 Oracle America, Inc. Quickly reacquirable locks

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4574350A (en) 1982-05-19 1986-03-04 At&T Bell Laboratories Shared resource locking apparatus
US4725946A (en) 1985-06-27 1988-02-16 Honeywell Information Systems Inc. P and V instructions for semaphore architecture in a multiprogramming/multiprocessing environment
CA2045791A1 (en) * 1990-06-29 1991-12-30 Richard Lee Sites Branch performance in high speed processor
US5285528A (en) * 1991-02-22 1994-02-08 International Business Machines Corporation Data structures and algorithms for managing lock states of addressable element ranges
US5613139A (en) * 1994-05-11 1997-03-18 International Business Machines Corporation Hardware implemented locking mechanism for handling both single and plural lock requests in a lock message
JP4086259B2 (en) * 1995-08-04 2008-05-14 株式会社東芝 Communications system
US5968157A (en) * 1997-01-23 1999-10-19 Sun Microsystems, Inc. Locking of computer resources
US6219751B1 (en) * 1998-04-28 2001-04-17 International Business Machines Corporation Device level coordination of access operations among multiple raid control units
US6651088B1 (en) * 1999-07-20 2003-11-18 Hewlett-Packard Development Company, L.P. Method for reducing coherent misses in shared-memory multiprocessors utilizing lock-binding prefetchs
KR100331565B1 (en) * 1999-12-17 2002-04-06 윤종용 Matrix operation apparatus and Digital signal processor capable of matrix operation
US6668308B2 (en) * 2000-06-10 2003-12-23 Hewlett-Packard Development Company, L.P. Scalable architecture based on single-chip multiprocessing
US6976158B2 (en) * 2001-06-01 2005-12-13 Microchip Technology Incorporated Repeat instruction with interrupt
US20060218556A1 (en) * 2001-09-28 2006-09-28 Nemirovsky Mario D Mechanism for managing resource locking in a multi-threaded environment
US6986005B2 (en) * 2001-12-31 2006-01-10 Hewlett-Packard Development Company, L.P. Low latency lock for multiprocessor computer system
US7089371B2 (en) * 2002-02-12 2006-08-08 Ip-First, Llc Microprocessor apparatus and method for prefetch, allocation, and initialization of a block of cache lines from memory
US6721816B1 (en) * 2002-02-27 2004-04-13 Advanced Micro Devices, Inc. Selecting independently of tag values a given command belonging to a second virtual channel and having a flag set among commands belonging to a posted virtual and the second virtual channels
US7395274B2 (en) * 2002-07-16 2008-07-01 Sun Microsystems, Inc. Space- and time-adaptive nonblocking algorithms
US7162589B2 (en) * 2002-12-16 2007-01-09 Newisys, Inc. Methods and apparatus for canceling a memory data fetch
US20040268046A1 (en) * 2003-06-27 2004-12-30 Spencer Andrew M Nonvolatile buffered memory interface
US20050120185A1 (en) * 2003-12-01 2005-06-02 Sony Computer Entertainment Inc. Methods and apparatus for efficient multi-tasking
US8332483B2 (en) * 2003-12-15 2012-12-11 International Business Machines Corporation Apparatus, system, and method for autonomic control of grid system resources
US7210019B2 (en) * 2004-03-05 2007-04-24 Intel Corporation Exclusive access for logical blocks
US7797704B2 (en) * 2005-03-30 2010-09-14 Hewlett-Packard Development Company, L.P. System and method for performing work by one of plural threads using a lockable resource

Patent Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5142676A (en) * 1988-12-28 1992-08-25 Gte Laboratories Incorporated Separate content addressable memories for storing locked segment addresses and locking processor identifications for controlling access to shared memory
US20020004810A1 (en) * 1997-04-01 2002-01-10 Kenneth S. Reneris System and method for synchronizing disparate processing modes and for controlling access to shared resources
US6772255B2 (en) * 1998-06-30 2004-08-03 Sun Microsystems, Inc. Method and apparatus for filtering lock requests
US6389519B1 (en) * 1999-07-19 2002-05-14 Ati International Srl Method and apparatus for providing probe based bus locking and address locking
US6604162B1 (en) * 2000-06-28 2003-08-05 Intel Corporation Snoop stall reduction on a microprocessor external bus
US6678772B2 (en) * 2000-12-19 2004-01-13 International Businesss Machines Corporation Adaptive reader-writer lock
US7325064B2 (en) * 2001-07-17 2008-01-29 International Business Machines Corporation Distributed locking protocol with asynchronous token prefetch and relinquish
US20050132132A1 (en) * 2001-08-27 2005-06-16 Rosenbluth Mark B. Software controlled content addressable memory in a general purpose execution datapath
US7120762B2 (en) * 2001-10-19 2006-10-10 Wisconsin Alumni Research Foundation Concurrent execution of critical sections by eliding ownership of locks
US7814488B1 (en) * 2002-09-24 2010-10-12 Oracle America, Inc. Quickly reacquirable locks
US20040068607A1 (en) * 2002-10-07 2004-04-08 Narad Charles E. Locking memory locations
US7117481B1 (en) * 2002-11-06 2006-10-03 Vmware, Inc. Composite lock for computer systems with multiple domains
US7290105B1 (en) * 2002-12-16 2007-10-30 Cisco Technology, Inc. Zero overhead resource locks with attributes
US20040123058A1 (en) * 2002-12-24 2004-06-24 Hum Herbert H. Method and apparatus for processing a load-lock instruction using a relaxed lock protocol
US7269717B2 (en) * 2003-02-13 2007-09-11 Sun Microsystems, Inc. Method for reducing lock manipulation overhead during access to critical code sections
US20050283783A1 (en) * 2004-06-22 2005-12-22 Desota Donald R Method for optimizing pipeline use in a multiprocessing system
US20060026411A1 (en) * 2004-07-29 2006-02-02 Fujitsu Limited Processor system and thread switching control method
US20060041788A1 (en) * 2004-08-17 2006-02-23 International Business Machines Corporation Protecting code from breakpoints
US20060095685A1 (en) * 2004-11-03 2006-05-04 Bonola Thomas J System and method to coordinate access to a sharable data structure using deferred cycles
US7454570B2 (en) * 2004-12-07 2008-11-18 International Business Machines Corporation Efficient memory update process for on-the-fly instruction translation for well behaved applications executing on a weakly-ordered processor
US7552290B2 (en) * 2005-08-23 2009-06-23 Advanced Micro Devices, Inc. Method for maintaining atomicity of instruction sequence to access a number of cache lines during proactive synchronization within a computer system
US7606985B2 (en) * 2005-08-23 2009-10-20 Advanced Micro Devices, Inc. Augmented instruction set for proactive synchronization within a computer system
US7627722B2 (en) * 2005-08-23 2009-12-01 Advanced Micro Devices, Inc. Method for denying probes during proactive synchronization within a computer system
US7636819B2 (en) * 2005-08-23 2009-12-22 Advanced Micro Devices, Inc. Method for proactive synchronization within a computer system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080263339A1 (en) * 2007-04-18 2008-10-23 Kriegel Jon K Method and Apparatus for Context Switching and Synchronization
US7681020B2 (en) * 2007-04-18 2010-03-16 International Business Machines Corporation Context switching and synchronization
US20100115250A1 (en) * 2007-04-18 2010-05-06 International Business Machines Corporation Context switching and synchronization
US8205067B2 (en) 2007-04-18 2012-06-19 International Business Machines Corporation Context switching and synchronization
US20090125519A1 (en) * 2007-11-13 2009-05-14 Intel Corporation Device, system, and method for regulating software lock elision mechanisms
US20140281038A1 (en) * 2013-03-14 2014-09-18 Samsung Electronics Co., Ltd. Terminal and application synchronization method thereof
US10003617B2 (en) * 2013-03-14 2018-06-19 Samsung Electronics Co., Ltd. Terminal and application synchronization method thereof
US9619386B2 (en) 2015-01-29 2017-04-11 Kabushiki Kaisha Toshiba Synchronization variable monitoring device, processor, and semiconductor apparatus

Also Published As

Publication number Publication date
US20070050562A1 (en) 2007-03-01
DE112006002237B4 (en) 2023-04-06
US20070050560A1 (en) 2007-03-01
US20070050563A1 (en) 2007-03-01
CN101297270A (en) 2008-10-29
US7627722B2 (en) 2009-12-01
DE112006002237T5 (en) 2008-06-19
US20070067529A1 (en) 2007-03-22
US20070050559A1 (en) 2007-03-01
GB2445294B (en) 2009-01-21
JP2009506436A (en) 2009-02-12
GB2445294A (en) 2008-07-02
KR20080038435A (en) 2008-05-06
JP5103396B2 (en) 2012-12-19
WO2007025112A1 (en) 2007-03-01
US7552290B2 (en) 2009-06-23
KR101369441B1 (en) 2014-03-04
GB0802809D0 (en) 2008-03-26
US7606985B2 (en) 2009-10-20
US7636819B2 (en) 2009-12-22

Similar Documents

Publication Publication Date Title
US7552290B2 (en) Method for maintaining atomicity of instruction sequence to access a number of cache lines during proactive synchronization within a computer system
TWI476595B (en) Registering a user-handler in hardware for transactional memory event handling
JP5404574B2 (en) Transaction-based shared data operations in a multiprocessor environment
US8190859B2 (en) Critical section detection and prediction mechanism for hardware lock elision
JP5860450B2 (en) Extension of cache coherence protocol to support locally buffered data
US8180977B2 (en) Transactional memory in out-of-order processors
US7162613B2 (en) Mechanism for processing speculative LL and SC instructions in a pipelined processor
US9110691B2 (en) Compiler support technique for hardware transactional memory systems
US20110208921A1 (en) Inverted default semantics for in-speculative-region memory accesses
SG188993A1 (en) Apparatus, method, and system for providing a decision mechanism for conditional commits in an atomic region
US9280349B2 (en) Decode time instruction optimization for load reserve and store conditional sequences
KR101056820B1 (en) System and method for preventing in-flight instances of operations from interrupting re-execution of operations within a data-inference microprocessor
US7373484B1 (en) Controlling writes to non-renamed register space in an out-of-order execution microprocessor

Legal Events

Date Code Title Description
AS Assignment

Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ALSUP, MITCHELL;REEL/FRAME:018208/0952

Effective date: 20060823

AS Assignment

Owner name: GLOBALFOUNDRIES INC., CAYMAN ISLANDS

Free format text: AFFIRMATION OF PATENT ASSIGNMENT;ASSIGNOR:ADVANCED MICRO DEVICES, INC.;REEL/FRAME:023120/0426

Effective date: 20090630

Owner name: GLOBALFOUNDRIES INC.,CAYMAN ISLANDS

Free format text: AFFIRMATION OF PATENT ASSIGNMENT;ASSIGNOR:ADVANCED MICRO DEVICES, INC.;REEL/FRAME:023120/0426

Effective date: 20090630

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: GLOBALFOUNDRIES U.S. INC., NEW YORK

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION;REEL/FRAME:056987/0001

Effective date: 20201117