WO2017131624A1

WO2017131624A1 - A unified lock

Info

Publication number: WO2017131624A1
Application number: PCT/US2016/014829
Authority: WO
Inventors: Milind M CHABBI; Hideaki Kimura; Tianzheng WANG
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2016-01-26
Filing date: 2016-01-26
Publication date: 2017-08-03

Abstract

In one example, a unified lock includes a first level lock and a second level lock. The first and second level locks allow threads executing on a processor to request exclusive control of a processor resource. A first thread in the second level lock acquires an individual lock in the first level lock and then passes the individual lock to successive threads in the second level lock within a bounded limit before relinquishing the individual lock to remaining threads in the first level lock.

Description

A UNIFIED LOCK BACKGROUND

[0001] Multiple computer environments provide significant

advantages to computer clients or users. In particular, multiple computer environments allow many client instruction threads to share different computer resources including both hardware and software resources. In fact, a multiple computer environment may provide client threads access to a considerable number of computer resources, which are available to practically any client processor, including multiple cores (with and without hyper-threading), and even computer systems having Internet capabilities. Today's multi-core processors can also process multiple threads of instructions that share various levels of memory, as do parallel architecture machines and other forms of distributed computing such as distributed network based computing. Sharing computer resources provides many known benefits, such as the fact that only one such resource needs to be created, updated, and maintained, which is particularly useful such as for the large databases being created today.

[0002] Modern computer systems provide various "lock" services for managing the various client thread access requests to use computer resources. The various lock services allow a client thread to lock a resource when using that resource so that subsequent client threads may not access that resource during that time the first client thread has acquired the lock. BRIEF DESCRIPTION OF THE DRAWINGS

[0003] The disclosure is better understood with reference to the following drawings. The elements of the drawings are not necessarily to scale relative to each other. Rather, emphasis has instead been placed upon clearly illustrating the claimed subject matter. Furthermore, like reference numerals designate corresponding similar parts through the several views.

[0004] Fig. 1 is a block diagram of an example unified lock having a context-less (CL) lock and a lower tiered queuing lock (QL);

[0005] Fig. 2 is a flow chart of an example unified lock technique to implement the unified lock of Fig. 1 ;

[0006] Fig. 3 is a flow chart of another example queuing lock technique to implement the queueing lock of Fig. 1 ;

[0007] Fig. 4 is a block diagram of an example computing system for implementing a unified lock;

[0008] Fig. 5 is a block diagram of an example tangible unified lock on a non-transitory computer readable medium;

[0009] Fig. 6 is a flow diagram of an example method to implement a unified lock;

[00010] Fig. 7 is an example computing system that includes a unified lock;

[00011] Fig. 8 is a block diagram of an example system 800 that has a

NUMA (non-uniform memory access) processor with multiple cores; and

[00012] Figs. 9A and 9B are example performance graphs of an example implementation of a unified lock. DETAILED DESCRIPTION

[00013] A locking mechanism, also known as a muiex, assures mutual exclusion to certain sections of shared data and other processor resources by instructions executing on a processor. These sections of instructions or code are often referred to as critical sections, where at most one thread may execute the instruction fragment at a time. Various forms of locking mechanisms have been developed that allow for preventing stalls, blocks, deadlocks, race conditions, lock contention, and reducing the additional overhead caused by a particular locking mechanism needed to address a shared resource. Most modern processors provide simple hardware supported operations of atomic primitives that support complex operations by using synchronization constructs and conventions to protect against overlap of conflicting operations trying to access the same resource. An atomic primitive operation or the atomicity of an operation, relates to a read, modify, and write operation that must be performed entirely or not at ail. Since lock requests are typically associated with a particular access or command request, atomicity is typically required. Various atomic primitive operations include, but are not limited to: "fetch and store", "compare and swap", "test and set", and "fetch and add". The atomic primitive operations may be used with a busy-wait type synchronization lock architecture that use "spin locks". A spin lock is a software lock, which causes a thread trying to acquire it to simply wait in a loop (i.e. "spin") while repeatedly checking if the lock is available.

[00014] In order of decreasing granularity, for a computing system, a lock may protect shared resources such as nibbles, bytes, words, double words, memory locations, cache lines, cache pages, memory pages, and banks of memory. For a database management system, a lock may protect shared resources part of a field, a field, a record, a data page, or an entire table. Also, a lock may protect shared input/output (I/O) resources, such as interrupts, DMA channels, configuration registers, and I/O circuits as just a few examples. [00015] Fig. 1 is a block diagram of an example composite unified lock 100 having a context-less (CL) lock 102 and a lower tiered queuing lock (QL) 104 that can request the context-less lock 102 along with other context- less threads. The unified lock 100 described herein is a superior lock architecture that allows for multiple types of clients, such as context-less and context clients, to accommodate various forms of client threads, such as those which are frequent context-based worker client threads, "regular clients" (108A-108D), and infrequent lock context-less client threads, "guest clients" (106A-106B) and client thread 108C from the queueing lock 104 as an additional guest client. In effect, this new unified lock 100 offers a large design space for programmers by combining different lock types to trade off memory space, time, fairness, ease of use, portability, and scalability. The unified lock 100 may guarantee that regular clients 108A-108D never starve and typically execute their critical sections in a first~in~first-out (FIFO) order. Performance is enhanced by allowing a regular client 108A-108D to always enter its critical section, even under high contention, by acquiring just one lock, the queueing lock 104. Flexibility in the architecture allows for various context-less locks 102 to be used to add additional functionality, such as allowing guest clients 106A-106C to be serviced in another FIFO order when using a ticket lock, which may also ensure higher fairness for both guest and regular client threads. However, there is no integrated FIFO ordering between respective guest and regular client thread FIFO queues in the context-less lock 102 and queueing lock 104.

[00016] The notion of fairness in lock acquisition applies to the order in which client threads acquire a lock successfully. If some type of fairness is implemented, a thread is prevented from being starved out of execution for a long time due to its inability to acquire a lock in favor of other client threads. With no fairness guarantees, a situation can arise where a thread (or multiple threads) can take a disproportionately long time to execute as compared to others. The unified lock 100 ensures fairness by having a bound limit within the regular clients while it may not impose a bound limit and its associated overhead on infrequent guest clients. [00017] The context-less lock 102 may have multiple guest clients, such as 106A, 106B, and 106C (the queueing lock acting as a guest client) as just one example. The context-less lock 102 may be one of several types of known context-less locks such as Test and Set (TAS) locks and Test and Test and Set (TATAS) locks.

[00018] A TAS lock typically has a single flag field per lock and acquires lock by changing the flag from false to true, i.e. true=successful lock. The flag is reset to false to release or relinquish the lock. If the flag is false, the guest client 106A-106C threads may execute their CL spin 122 to continue to try to acquire the CL lock. One issue with the TAS lock is that a "test and set" atomic operation will likely invalidate a cache line causing a large amount of memory network traffic. The TATAS lock helps to reduce the cache invalidation problem by first testing the flag field before performing the "test and set^" atomic operation.

[00019] Accordingly, the TATAS lock first checks if there is a chance of success, i.e. it only reads the flag. This restriction prevents the memory cache within the processor from being invalidated as with a write operation and thus has better performance than the TAS lock. However, the TATAS lock relies on cache-coherence between different client thread processors.

[00020] Other context-less locks may be variations of the TAS and TATAS locks, such as by adding in back-off, typically exponential, so that on failure to acquire the lock, a back-off wait time of a fixed, programmable, pseudo-random, or random duration during the spin is used to ensure that client threads request the context-less lock at different times and thus prevent contention for the lock from several client threads. However, due to the back-off time, some threads may wait longer than necessary.

[00021] Simple busy-wait spin locks such as TAS and TATAS locks scale poorly since each client thread continuously polls on a shared memory location for the lock availability. Furthermore, these locks neither maintain the time-based ordering among requesters nor ensure starvation freedom.

[00022] Another context-less lock is a ticket (TKT) lock (sometimes also referred to as an array lock). This context-less lock provides a first come-first served or FIFO queue to provide for less contention of the lock and less cache invalidations and adds the benefit of fairness to ensure starvation freedom. In one example of a TKT lock, a client thread uses a low-level atomic synchronization primitive to obtain a ticket value, then waits until a counter reaches that value, or some function of the value. For example, a fetch-and-increment instruction may be used to read a memory location and increment its value, while no other thread or processor is able to access the memory location in between. The client thread then waits for another counter to reach the ticket value, and enters the critical section. By program design, the client thread will typically be guaranteed an exclusive access to certain data or I/O objects protected by the lock. When the client thread is done and wishes to allow other threads access to the data or I/O objects, it increments the counter to allow the next successor thread in-line the exclusive access. If no successor threads are waiting, then the next client thread to try to obtain the lock will be given the exclusive access. The TKT lock scales poorly due to its centralized polling.

[00023] Complex systems such as the Linux kernel often employ centralized locks such as the TKT lock because of the simplicity of their implementations. However, centralized locks are inherently non-scalable under high contention. Queue-based locks, invented in 1990s, are scalable but not widely adopted since they require additional context management, which can cause severe code changes in complex systems.

[00024] The queueing lock (QL) 104 is a modified MCS queue lock ( CS lock is named after its creator's initials, John M. Mellor-Crummey and Michael L. Scott). An MCS lock provides a guarantee of first~in~first~out (FIFO) ordering of lock acquisition requests, a spin 1 10A-1 10D for each respective regular client 108A-108D on locally-accessible flag variables only, a constant amount of space per lock for a Qnode record, and typically requires processor systems with coherent cache. In one example, the Qnode record contains context 128 in the form of a queue link record 1 14- 120 and a Boolean flag. Each regular client 108A-108D thread includes an additional variable during a lock acquire operation. Ail client threads holding or waiting for the lock are chained together by links. Each client thread spins on its own locally accessible flag. The MCS lock itself includes a tail pointer 126 to the Qnode record for the regular client 108D thread at the tail 124 of the queue or a 'nil' if the lock is not held. Each regular client 108A- 108D thread in the MCS queue holds the queue-link record 1 14-120 for the predecessor regular client thread. An atomic "compare and swap" (CAS) operation on the tail pointer allows a regular client 108A-108D thread to determine whether it is the only regular client 108 A-108D thread in the queue, and if so, remove itself correctly as a single atomic action. The QL spin 1 10A-1 10D in the acquire lock operation checks to see if the lock is free or not. To unlock, the regular client thread holding the QL lock 1 12 modifies the locked field of the successor node in the QL queue. If no successor exists, the tail pointer 126 is reset.

[00025] However, the MCS queue lock requires that ail of its clients "bring their own context (Qnode) records." Such a requirement is difficult to adopt in large, complex, and legacy systems where some infrequent clients may be difficult to refactor code to include Qnode records. This restriction of "bring-you-own-contexf has led to the limited adoption of queuing locks since:

a) The code needs to be rewritten at ail call sites in legacy codes to provide a "context" parameter.

b) The context needs to be carried by the locking client in multiple functions and code paths since the same context is needed both in the lock acquire and release phases, which leads to an explosion of code changes.

c) The queue node may have to be preallocated in a shared- memory region, and the preallocation may not be feasible when the number of participants is not a-priori known.

[00026] Because of these complications, many complex systems still include multiple scattered centralized locks despite their scalability limitations. For example, the Linux kernel, which forms the backbone of many business critical systems, is one stereotypical example of complex, legacy code that suffers from lock contention at scale and needs to take advantage of scalable algorithms. However, Linux code guardians have shown resistance to adopting well-known scalable locks such as MCS locks since the code changes needed to incorporate MCS locks impact large parts of the code base and precompiled applications. The unified lock provides a transition path where a vast majority of legacy code can continue to work on future systems without any change while offering the benefits of massive parallelism in hardware for performance-critical software willing to tolerate minor changes.

[00027] In addition, mandating that each client thread bring its queue node poses challenges of extensive code refactoring and pre-ailocation of the queue node context 128. Extensive code refactoring forces all routines in the code where the traditional spin-locks are used (and hence a reference to only the lock word was passed) to be replaced with an additional argument, a reference to the context node. Refactoring changes can be quite intrusive and unwelcome in complex legacy systems that have critical memory space and timing requirements.

[00028] Pre-aliocation of the queue node context 128 affects both inter- and intra- process mutual exclusion. When the MCS lock is used for inter-process mutual exclusion, the queue nodes must be pre-aliocated. Pre-ailocation is necessary for interpreting the predecessor and successor pointers used in the MCS lock. However, pre-allocating the queue nodes is difficult, if not impossible, for a guest client that may be created and destroyed arbitrarily. While intra-process mutual exclusion can allocate a queue node on the client's stack, stack allocated queue nodes are not preferred in mission critical systems. This non-preference is because the MCS lock is prone to the ABA problem due to its use of the compare and swap (CAS) atomic primitive.

[00029] In multithreaded computing, the ABA problem occurs during synchronization, when a location is read twice, has the same value for both reads, and "value is the same" is used to indicate "nothing has changed". However, another thread can execute between the two reads and change the value, do other work, then change the value back, thus fooling the first thread into thinking "nothing has changed" even though the second thread did work that violates that assumption. To circumvent the ABA problem, many conventional computing systems pre-aiiocate the queue nodes and use an offset from a base address to perform the CAS operation.

Accordingly, pre-allocation remains a problem for both the inter- and intra- processes of locking for mutual exclusion. While frequent lock users such as the regular clients 108A-108D may opt to allocate and bring their queue nodes (the context 128 record), infrequent guest clients 108A-106B may actually find it too restrictive and cumbersome to take on the additional burden of providing a queue node. Such infrequent guest clients 106A- 108B include daemon processes, snapshot generators, and progress indicators as just a few non-limiting examples.

[00030] The unified lock 100 of Fig. 1 solves these issues by

composing two locks, the context-less (CL) lock 102 and the queueing lock (QL) 104. The unified lock 100 allows for performance-critical legacy code to be scalable with added context without sacrificing the portability of vast amounts of context-less legacy code. The queueing lock 104 provides a scalable queueing lock while the context-less lock 102 accommodates guest client threads 106A, 106B into a queueing lock without requiring the guest client threads 106A, 106B to bring their context information. The guest client threads 106A-108C compete directly for the CL lock 102 and thus need no context at ail. The regular client threads 108A-108D compete for the queueing lock 104. The first regular client 108A that acquires the QL lock 104 has the additional responsibility of competing for the CL lock 102. Subsequent waiting regular clients 108B-108D inherit the QL lock 104 from their predecessor and in most instances, do not compete for the CL lock 102. In effect, this CL lock capture ensures that subsequent regular client threads 108B-108D usually acquire the lock in a manner similar to their other subsequent QL lock counterparts. A bound limit on passing the QL lock 104 to subsequent regular client threads 108B-108D ensures starvation freedom for the regular client threads 108A-108D. A FIFO order is naturally maintained among ail regular client threads 108A-108D. !f a TKT lock is used for the CL lock 102, the CL lock 102 may ensure starvation freedom and a second FIFO among guest client threads 106A-106C.

[00031] Fig, 2 is a flow chart of an example unified lock technique 200 to implement the unified lock 100 of Fig. 1 by modifying an MCS type lock for the QL lock 104. In block 202, a regular client thread 108 enters the FIFO in QL lock 104 by making an "acquire lock" request with a lock API interface, for a restricted resource. In decision block 204 the QL Lock 104 is checked to see if there is one or more predecessors in the QL lock 104 FIFO. If there is at least one predecessor, then the regular client thread 108 spins in the QL lock in block 1 10 until it determines it has acquired the QL lock 104. For instance, the spin continues to check its local flag to

determine whether it has acquired QL lock or not, thereby preventing cache invalidation. If in decision block 104 it is determined there is no predecessor in the QL lock 104 FIFO then the regular client thread 108 has acquired the QL lock 104. Once QL lock 104 is acquired, then in decision block 206 a check is made to see if CL lock 102 has been acquired for QL clients. If it has not, then in block 122, the regular client thread 108A spins waiting to acquire CL lock 102, Once CL lock 102 has been acquired for QL clients, then the regular client thread 108A may access the restricted resource in block 208.

[00032] Once a regular client thread 108A is done accessing the resource, in block 210 a check is made to determine if there is a successor regular client thread 108B in the QL lock 104 FIFO, If not, then in block 212, both the CL lock 102 and the QL lock 104 are relinquished or released. To ensure high performance in some examples, it is better to release CL lock 102 first followed by releasing the QL lock 104 as such an ordering will ensure that the CL lock 102 is not unnecessarily held while a critical section is not being executed. In addition, this ordering ensures correctness independent of the type of lock used at the CL level. However, in some examples, TKT, TATAS, TAS type locks at the CL level may tolerate any order of release. For these types of locks, the QL lock 104 may be released first before the CL lock 102,

[00033] !f there is a successor regular client thread 108B-108D then in decision block 214 a check is made to see if a bound limit has been reached. The QL lock 104 may have a bound limit of any kind. Some examples include bound by time, bound by CPU cycles, bound by the number of lock passing, bound by cache misses etc. In fact, any

monotonically increasing or decreasing metric with a local counting property (to prevent cache invalidation) is sufficient.

[00034] If it is determined in decision block 214 that the bound limit has been reached, then in block 212, both the CL lock 102 and QL lock 104 is relinquished. If the bound limit has not been reached, then the QL lock 104 is passed to a successor regular client thread 108B-108D in block 216 and the successor regular client thread 108B-108D may access the restricted resource in block 208 as their turn arises in the FIFO queue. Each successive regular client thread 108B-108D is able to successively access the restricted resource until either there are no more successive regular clients 108B-108D or the bound limit is reached. Upon relinquishing both the CL and QL locks, the CL lock 102 can continue to allow regular client threads 108A-106B to request access to the restricted resource via CL lock 102.

[00035] Fig. 3 is a flow chart of another example queuing lock technique 300 to implement the queueing lock 104 of Fig. 1 that uses a bound limit based on a number of lock passes. In block 302, the bound iimit(s) are initialized and thus depending on the particular needs of the computing system, the bound limits may be tuned, calibrated, characterized, or otherwise modified to achieve specific performance criteria. In block 304, new QL regular client threads 108A-108D are accepted into the QL lock 104 FIFO queue. If the accepted regular client thread 108A-108D is the first in the FIFO queue as determined in decision block 306, it has acquired QL lock 104 and flow continues to decision block 306 where the regular client thread 108 seeks to acquire the CL lock 102 for a restricted resource in decision block 308. If not the first in the QL FIFO queue, the regular client thread 108A -108D performs its local spin 1 10 seeking to determine if it has acquired the QL lock by being first in the QL FIFO queue..

[00036] In decision block 308, if the CL Iock102 has not been acquired for the regular QL clients, then the regular client thread 108A-108D performs a CL lock spin 122 while waiting to acquire CL lock 102. Once the CL lock 102 has been acquired, the regular client threads 108A-108D may access the restricted resource in block 310. Once done accessing the restricted resource, in block 312 a QL sequence number is incremented and checked in decision block 314 to determine if the QL sequence number limit (the bound limit) has been reached. The QL sequence number is passed by regular clients to unlock their successors, such as by using a flag field of the QL lock 104 to pass the sequence number efficiently rather than relying on a centralized counter. The QL sequence number is incremented by a QL lock holder before passing the lock to its successor.

[00037] !f the limit has not been reached, then in decision block 316, a check is made to see if there is a successor in the QL lock 104 FIFO. If there is, the CL lock 102 is passed to the successor regular client thread 108B-108D. The successor regular client thread 108B-108D leaves its QL spin 1 10 upon confirmation that it is first in the QL lock 104 FIFO queue and thus that it has QL lock 104. If the CL lock is still acquired for QL clients, then the successor regular client thread 108B-108D man access the restricted resource in block 310.

[00038] Returning to decision block 314, if it is determined that the QL sequence number limit has been reached, then in block 316 the CL lock 102 is relinquished and in block 318 the QL sequence number is reset. In decision block 320 a check is made to see if a successor regular client thread 108B-108D is in the QL lock FIFO. If there is a successor, then the QL lock is relinquished to the successor and a message is sent to the successor to notify it to acquire the CL lock 102. The message may be a simple local flag that is checked in decision block 308 during CL lock 102 acquisition, !f there is no successor in block 320 in block 324 the QL lock is relinquished,

[00039] !f in decision biock 314 it is determined that the QL sequence number limit has not been reached, then in decision block 326 a check is made to see if a successor regular client thread 108B-108D is in the QL Lock FIFO. If not, then the flow returns to block 316 as described above to relinquish the CL lock in block 316 and reset the QL sequence number in biock 318. As there is no successor then from decision biock 320, the QL lock is relinquished in biock 324 and flow continues to block 304 to accept new regular clients in the QL lock. If there is a successor determined in decision biock 326 then in block 328 the CL lock is passed to the successor in the QL FIFO.

[00040] Fig, 4 is a block diagram of an example computing system 400 for implementing a unified lock 100. The computing system 400 may include one or more processors 402A-402C. The processors may include one or more cores each, and each core may have one or more hyper- threads as well as a plurality of resources. The processors may be under the control of a BIOS, an operating system, or application program to execute a plurality of executable threads of instructions requesting exclusive control of a subset of the plurality of resources, such as resource A 404A, resource B 404B, resource C 404C, and resource D 404D. Also included, are a first level lock module 406 and a second level lock module 410. The code to execute the first and second level lock modules 406, 410 may be contained in the BIOS, the operating system, and/or the application program.

[00041] The first level lock module 406 includes a first queue 408 and the second level lock module 410 includes a second queue 412. The first queue 408 allows a first set of guest client threads 106A-106B and a guest client thread 106C from the second level lock module 410 to acquire an individual lock on the subset of the plurality of resources 404A-404D. The second queue 412 allows a second set of regular client threads with queue nodes to acquire the individual lock through the first level lock module 408, and then pass the individual lock to successive second set of threads in the second level lock module 410 within a bounded limit before relinquishing the individual lock to the remaining set of guest client threads 108A-106B in the first level block module 406.

[00042] In some examples, the first level lock 406 may be any context- less lock and in one special example a ticket type lock to allow a first FIFO in the first level lock 406. Further, the second level lock 410 may allow a second FIFO and may be any queue-based lock with a local spinning property such as at least one modified MCS type lock. The first Ievei lock guest client threads 106A-106C may be without context or context-less and the second level lock 410 regular client threads 108A-108D may be with context 128 in the form of the Qnode information that is used to perform local spinning by each of the regular client threads 108A-108D. In yet other examples, a sequence number is passed to successive second set of regular client threads 108B-108D and if the sequence number reaches a predetermined value, the individual lock is relinquished to the remaining first set of guest client threads 106A-108B in the first Ievei lock 406.

[00043] Fig. 5 is a block diagram of an example tangible unified lock 500 on a non-transitory computer readable medium 502 that includes a set of instructions that when executed by a processor having one or more cores and in one or more threads cause the processor with a first set of

instructions 506 to implement a context-less lock accepting a first set of threads executing on the processor that request an individual lock on a set of resources available on the processor. Further, a second set of instructions 508 are to implement a queueing lock with context accepting a second set of threads that executes on the processor requesting the individual lock. The queueing lock first requests the individual lock in the context-less lock and passes the individual lock to successive second set of threads within a bounded limit before relinquishing the individual lock to the remaining first set of threads in the context-less lock.

[00044] In some examples, when an initial thread in the second set of threads arrives in the queueing lock without other threads in the second set of threads, the queueing lock conveys a message to the initial thread to compete for the individual lock in the context-less lock. In other examples, once the initial thread relinquishes the individual lock and there are successive threads in the second set of threads, a sequence number is passed to the next successive thread in the queueing lock. Further, once the initial thread acquires the individual lock, a bounding interval based on a monotonically growing metric with a local counting property is started and the individual lock is relinquished by the queueing lock when a

predetermined limit for the metric is reached or there are no successive threads in the second set of threads. The bounding interval on the monotonically increasing metric may be at least one from the set of a bound number of threads, a bound by time, a bound by CPU cycles, a bound by the number of lock passing, and a bound by cache misses.

[00045] A computer readable medium allows for storage of one or more sets of data structures and instructions (e.g. software, firmware, logic) embodying or utilized by any one or more of the methodologies or functions described herein. The instructions may also reside, completely or at least partially, with the static memory, the main memory, and/or within the processor during execution by the computing system. The main memory and the processor memory also constitute computer readable medium. The term "computer readable medium" may include single medium or multiple media (centralized or distributed) that store the one or more instructions or data structures. The computer readable medium may be implemented to include, but not limited to, solid state, optical, and magnetic media whether volatile or non-volatile. Such examples include, semiconductor memory devices (e.g. Erasable Programmable Read-Oniy Memory (EPROM), Electrically Erasable Programmable Read-only Memory (EEPROM), and flash memory devices), magnetic discs such as internal hard drives and removable disks, magneto-optical disks, and CD-ROM (Compact Disc Read - Only Memory), and DVD (Digital Versatile Disc) disks.

[00046] Fig. 6 is a flow diagram of an example method 800 to implement a unified lock. In block 602, a first set of threads executing on the processor requesting an individual lock on a set of resources available on the processor are accepted to implement a context-iess lock. In block 604, a second set of threads executing on the processor requesting the individual lock are accepted to implement a queueing lock with context 128. !n block 606, the individual lock in the context-less lock is requested by the queueing lock. In block 608, the individual lock is passed to a successive second set of threads within a bounded limit before relinquishing the individual lock to the remaining first set of threads in the context-less lock. In one example, the bounded limit may be based on a number of lock passes in the queueing lock. In another example, a sequence number is used to track the number of lock passes. In yet another example, the method may include sending a message to a successor in the successive second set of threads to acquire the context-less lock when the bounded limit is reached. .

[00047] Fig. 7 is an example computing system 700 that includes a unified lock. The computing system 700 includes a plurality of processor units 714 coupled via one or more computer busses or communication links to a system memory 702, a storage interface 716, a network interface 718, and other I/O resources 719. The plurality of processor units 714 have a plurality of data and I/O resources that may be accessed by multiple instruction threads running on the plurality of processor units 714. The multiple instruction threads may be present in tangible non-transitory instruction code or modules on computer readable media such as in a BIOS 704, an operating system 706, including its kernel, application programs 708, program modules 710, and program data 712. These instruction or code modules 704-712 may be located in one or more memory and storage locations such as system memory 702 and storage devices accessible by storage interface 716. Further, the network 718 or I/O resources may also allow access to instruction code or modules 706-712, such as via an intranet, Internet, virtual private network, wireless networks, and

communication links and the like. [00048] The various examples described herein may include logic or a number of components, modules, or constituents. Modules may constitute either software modules, such as code embedded in tangible non-transitory machine readable medium) or hardware modules, A hardware module is a tangible unit capable of performing certain operations and by be configured or arranged in certain manners. In one example, one or more computer systems of one or more hardware modules of a computer system may be configured by software (e.g. an application, or portion of an application) as a hardware module that operates to perform certain operations as described herein.

[00049] The ability to have the QL lock 104 (Fig. 1 ) compete for the CL lock 102 is at least one advantage over previous composite, cohort, or hierarchical lock techniques. The unified lock 100 is different from cohort locks in that the unified lock 100 allows a thread to compete either at the first level (as a guest client) or at the second level (as a regular client), whereas in a cohort lock all threads begin their protocol at the second level. Being able to compose a CL lock 102 with a context-based lock in queueing lock 104 provides a unique advantage in accommodating infrequent guest client threads 106A-106B in an otherwise context-based QL lock 104 which services most regular client threads. While the unified QL-CL lock uses a single context-based lock at the QL level, some examples may have multiple QL locks at the interior computer core or OS level.

[00050] Fig. 8 is a block diagram of an example system 800 that has a NUMA (non-uniform memory access) processor 808 with multiple cores 806A-806D, where each of the multiple cores 806A-806D feed a separate respective QL lock 804A-804D as regular clients 108. NUMA is an alternative approach for multiple processor designs that links several compute nodes using a high-performance connection. Each node contains processors and memory, much like a small SMP (symmetric multiprocessor) system. However, an advanced memory controller allows a compute node to use memory on ail other compute nodes, creating a single system image. [00051] For instance, there may be one QL lock 804A-804D per locality domain, or one QL lock 8Q4A-S04D per group of threads based on a software designer's choosing. Each of the QL locks 804A-804D have a separate request into a ticket-based (TKT) content-less (CL) lock 802 as guest clients 106. Other guest clients 106 may request the TKT CL lock 802 from any processes executing on the multiple cores 808A-808D of NU A Processor 806. Also, the described architecture does not preclude example implementations having multiple interior levels of QL locks, for instance, composing a hierarchical modified QL lock with a CL lock such as for the NUMA system. In some example systems, the processor may be a NUMA processor, and the queueing lock has multiple queueing locks each assigned to a separate compute node on the NUMA processor.

[00052] Figs. 9A and 9B are performance graphs of an example implementation of a unified lock. The performance results illustrated in the graphs of Figs, 9A and 9B are sample results of the example

implementation and the claimed subject matter is not limited to the example implementation nor any particular set of results or specific improvements over existing locks.

[00053] Fig. 9A is an example first performance chart 900 comparing the throughput (locks/sec) of one example implementation the unified lock 100 with a pure TKT lock and a pure MCS queueing lock as an experiment. The experiment was done on a 4 socket, 18 core, 2-way SMT for a total of 144 threads using an Intel™ Haswel™ machine clocked at 2.5 GHz. In this evaluation, each thread repeatedly invoked a lock-unlock pair in a loop. First performance chart 900 illustrates the scalability of the locks with no legacy clients. The unified lock 100 matches the performance of the pure MCS lock, whereas the throughput of a TKT lock falls off.

[00054] Fig. 9B is an example second performance chart 950 that illustrates the scalability of the example implementation of the unified lock 100 with respect to various numbers (1 to 144) of legacy context lock clients. The throughput of the TKT and MCS locks are shown for reference only as they do not admit different types of clients. Second performance chart 950 illustrates that the unified lock offer the same throughput as an MCS lock when the number of legacy clients is small (i.e., less than 4) as is common. However, the performance starts to drop off with frequent and larger numbers of legacy clients. In the unlikely event that all clients are legacy context lock clients, the behavior of the unified lock 100 approaches that of the TKT lock. While frequent legacy clients are not a common use case, if it becomes so, those legacy clients that cause contention should be moved over to use the modified MCS interface by adding context through

refactoring of the associated code.

[00055] For instance, it is observed that in large, complex systems there are: 1 ) numerous code places that are infrequently exercised ("cold" code) and are performance agnostic, and 2) there are a few code paths that are frequently exercised ("hot" code) that cause lock contention and hence are performance sensitive. One can easily identify such hot and cold code regions via profiling. With this profile information, one can replace the locks used only in the hot code regions with queuing locks as regular clients while leaving the legacy context-less locks used in cold code untouched as guest clients. Of course, these two locks might be protecting the same critical section and hence their interactions should preserve correctness. To make the interactions correct, the composite unified lock 100 includes a scalable queue-based lock (QL) and a centralized context less lock (CL). Cold code paths (guest clients) thus compete directly for the CL lock and require neither code change nor any recompiiation. Hot code paths (regular clients) need modification to use the MCS-type interface that passes a context and compete for the QL lock. A thread that acquires the QL lock uncontended proceeds to compete for the CL lock and it can enter its critical section only after it has acquired the CL lock as well. Threads that do not immediately acquire the QL lock, enqueue waiting for the QL lock to be passed from their predecessors. The release protocol of the QL lock checks if it has a waiting successor and if so, it releases the QL lock to the successor. The

successor acquires the QL lock and in the process it inherits the ownership of the CL lock as well. The successor immediately enters the critical section without explicitly competing for the CL lock. Passing the QL lock to a successor and consequently inheriting the CL lock from a predecessor continues throughout the QL protocol. Finally, if a thread does not have a respective successor in the QL lock, it releases the CL lock first, followed by releasing the QL lock. To avoid starving the legacy CL lock clients, the number of times a QL lock can consecutively pass the lock to a successor is bound. Once the bound is reached, the CL-QL protocol releases the CL lock even if there is a waiting QL successor and informs the successor (if any) to compete for the CL lock.

[00056] Under contention, a thread waiting for a QL lock, spins locally similar to an MCS lock and avoids interconnect (e.g. memory bus, network communication) traffic. Usually, a QL client enters its critical section by acquiring a single lock-the QL lock. Thus the CL-QL lock behaves similar to a scalable queue-based lock under contention. Sporadic cold code executes the CL protocol and does not impact the overall performance. If the contention is low, cold code paths incur no overhead since their code is unchanged. The QL lock clients may incur the cost of additional CL lock acquisition, since there may not be successors and predecessors. This cost is negligible (less than five instructions) when the lock is uncontended.

[00057] In summary, with the unified lock, vast amounts of

perform ance-non- critical, legacy code can continue to use an old

centralized context-less locking interface and remain unchanged; whereas small amounts of performance-critical code or newly introduced code can adopt the modified MCS queue-based locking interface and enjoy scalability. The unified lock design ensures mutual exclusion when these two different locking paradigms interact. Legacy binaries require no recompilation.

Example performance results show that the unified lock may deliver the same high performance as queue-based locks while being legacy

compatible.

[00058] While the claimed subject matter has been particularly shown and described with reference to the foregoing examples, those skilled in the art will understand that many variations may be made therein without departing from the intended scope of subject matter in the following claims. This description should be understood to include all novel and non-obvious combinations of elements described herein, and claims may be presented in this or a later application to any novel and non-obvious combination of these elements. The foregoing examples are illustrative, and no single feature or element is essential to ail possible combinations that may be claimed in this or a later application. Where the claims recite "a" or "a first" element of the equivalent thereof, such claims should be understood to include

incorporation of one or more such elements, neither requiring nor excluding two or more such elements.

Claims

What is claimed is: CLAIMS

1 . A unified lock, comprising:

a plurality of processors having a plurality of resources executing a plurality of threads executing instructions requesting exclusive control of a subset of the plurality of resources;

a first level lock module with a first queue to allow a first set of threads to acquire an individual lock on the subset of the plurality of resources; and a second level lock module with a second queue to allow a second set of threads to acquire the individual lock on the subset of the plurality of resources by having a first thread in the second queue acquire the individual lock through the first level lock module and then pass the individual lock to successive second set of threads in the second level lock module within a bounded limit before relinquishing the individual lock to the remaining first set of threads in the first level lock module.

2. The unified lock of claim 1 wherein the first set of threads are without context and the second set of threads are with context that is used to perform local spinning by each of the second set of threads.

3. The unified lock of claim 1 wherein the second level lock module is a queue-based lock with a local spinning property.

4. The unified lock of claim 1 wherein the first level lock module is a context-less lock.

5. The unified lock of claim 1 wherein a sequence number is passed to successive second set of threads and if the sequence number reaches a predetermined value, the individual lock is relinquished to the remaining first set of threads in the first level lock module.

6. A non-transitory computer readable medium for a unified lock, comprising instructions that when executed by a processor having one or more cores in one or more threads cause the processor to:

implement a context-less lock accepting a first set of threads executing on the processor requesting an individual lock on a set of resources available on the processor;

implement a queueing lock with context accepting a second set of threads executing on the processor requesting the individual lock, wherein the queueing lock first requests the individual lock in the context-less lock and passes the individual lock to successive second set of threads within a bounded limit before relinquishing the individual lock to the remaining first set of threads in the context-less lock.

7. The computer readable medium of claim 6, wherein when an initial thread in the second set of threads arrives in the queueing lock without other threads in the second set of threads, the queueing lock conveys a message to the initial thread to compete for the individual lock in the context-less lock.

8. The computer readable medium of claim 7, wherein once the initial thread relinquishes the individual lock and there are successive threads in the second set of threads, a sequence number is passed to the next successive thread in the queueing lock.

9. The computer readable medium of claim 7, wherein once the initial thread acquires the individual lock, the bounded limit is based on a

monotonicaiiy growing metric with a local counting property is started and the individual lock is relinquished by the queueing lock when a predetermined limit for the metric is reached or there are no successive threads in the second set of threads.

10. The computer readable medium of claim 9 wherein the bounded limit on the monotonicaiiy growing metric is at least one from the set of a bound number of threads, a bound by time, a bound by CPU cycles, a bound by the number of lock passing, and a bound by cache misses,

1 1 . A method of implementing a unified lock, comprising:

accepting a first set of threads executing on the processor requesting an individual lock on a set of resources available on the processor to implement a context-less lock;

accepting a second set of threads executing on the processor requesting the individual lock to implement a queueing lock with context; requesting the individual lock in the context-less lock by the queueing lock; and

passing the individual lock to a successive second set of threads within a bounded limit before relinquishing the individual lock to the remaining first set of threads in the context-less lock.

12. The method of claim 1 1 , wherein the bounded limit is based on a number of lock passes in the queueing lock,

13. The method of claim 12, wherein a sequence number is used to track the number of lock passes.

14. The method of claim 1 1 , further comprising sending a message to a successor in the successive second set of threads to acquire the context-less lock when the bounded lim it is reached.

15. The method of claim 1 1 , wherein the processor is a non-uniform memory access (NUMA) processor, and the queueing lock is comprised of multiple queueing locks each assigned to a separate compute node on the UMA processor.