EP3803597A1

EP3803597A1 - Device and method for serializing access to a shared resource

Info

Publication number: EP3803597A1
Application number: EP18750154.9A
Authority: EP
Inventors: Shay Goikhman; Eliezer Levy
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2021-04-14
Also published as: WO2020025122A1

Abstract

The present invention relates to the field of task parallelization and multi-threaded runtime environments. The present invention in particular relates to thread synchronization and scheduling in multithreaded applications, and thus provides a device 100 for serializing access to a shared resource 101, wherein the device 100 is configured to operate a thread 102 to execute a light-weight process, LWP, 103; probe access by the LWP 103 to the shared resource 101, and, if the shared resource 101 is locked; push the LWP 103 to a delegation queue 104.

Description

DEVICE AND METHOD FOR SERIALIZING ACCESS TO A SHARED RESOURCE

TECHNICAL FIELD

The present invention relates to the field of task parallelization and multi-threaded runtime environments. The present invention in particular relates to thread synchronization and scheduling in multithreaded applications, and thus provides a device and method for synchronized and scheduled access to a shared resource.

BACKGROUND

Presently, task-parallel and multithreaded non uniform memory access (NUMA)-aware run time environments are notoriously difficult to develop, in particular when guaranteed performance and scalability are required. Usually, kernel level threads, heavy-weight synchronization such as futexes (i.e. a locking mechanism), and concurrent structures are abstracted in an idiomatic way to present to the user a simplified and specialized programming environment. As such, these run times are exposed to thread synchronization overheads, and operating system (OS) scheduling interactions, which are all amplified by NUMA effects.

Thread synchronization and scheduling in a non-trivial multithreaded application on a current multicore NUMA server architecture becomes an even greater re-emerging problem that has been known so far. Efficient and scalable synchronization among threads is needed to protect shared data structures in an application and in an OS kernel to enable performance and scalability of an application. With the growth of a number of cores in a computing system, a demand is growing for finer-grained synchronization to achieve scalability. Thread scheduling is dealt within an OS kernel in a way transparent from an application. A scheduler might itself suffer from scalability problems when a large enough number of threads need urgent servicing. Coarse-grained synchronization primitives e.g. semaphores and futexes, build their functionality on the ability to interact with scheduling. Otherwise, an OS kernel does not take note of user space thread contention. Known problems that are related to interaction between scheduling and synchronization are priority inversions and convoying, each aggravating performance and scalability of an application that is executed. Recently, a gain can be observed in concurrency theories and the design of synchronization primitives and concurrent data structures, most notably, locks for fine grained synchronized concurrent data structure operations. Yet, no single design prevails in a general case. For advanced lock designs, the lock’s interfaces become esoteric, so that for a large application, a run time or an OS kernel can’t be modified to use such locks. Kernels in particular, because of their monolithic design and their multiple subsystems that share large and complex data structures, have been lagging in applying more advanced synchronization.

Many threading and parallelizing run-time environments (e.g. pthread, OpenMP, TBB) have emerged that try to combine kernel- level threads with user- level and even compile-time knowledge to enhance mostly programmer productivity. Yet, these run times are exposed to OS scheduling decisions, while they are handling synchronization by themselves with locks and mutexes, and thus are vulnerable to priority inversions and convoying. OpenMP NUMA extension support has been proposed, but there is no contribution to a standard yet. With further proliferation of many-core and non- uniform memory architectures that demand even more fine-grained synchronization and significantly enhanced locality, it is not possible for the conventional run times based on the above mentioned premises to provide the required scalability and performance.

SUMMARY

In view of the above-mentioned problems and disadvantages, the present invention aims to improve the concept governing the design of the conventional run time environments.

The present invention has the objective to provide a device and method that enables for thread synchronization and scheduling that avoids the above mentioned problems and emphasizes single-thread locality of reference. The device specifically enables constructing efficient task parallel NUMA-aware run times. The present invention handles a generic task-parallel application, each task being run on a cooperating user- level-thread (or a fiber, or LWP, which are interchangeable terms) that accesses a global data structure (i.e. a shared resource) wherein synchronization could be semantically specified in the application’s code in terms of locks. The task’s code the fiber executes is encapsulated in a function or a class method. The present invention schedules and synchronizes fibers over OS threads, which are identified with CPU cores, using a so-called fiber delegation queue, i.e. a queue to which fibers are delegated for execution, in particular in a serialized manner. The delegation queue may also be called QD. In other words, the present invention dynamically translates lock-based semantics into delegation.

The objective of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.

A first aspect of the present invention provides a device for serializing access to a shared resource, wherein the device is configured to operate a thread to execute a light weight process, LWP; probe access by the LWP to the shared resource, and, if the shared resource is locked; push the LWP to a delegation queue.

This ensures that at the detection of contention at a shared resource the LWP can be de-scheduled. By means of the delegation queue, the enqueued LWPs later can be processed in a serial manner by means of a single thread. That is, contention of multiple LWPs that try to access the shared resource at a same time is mitigated, since the LWPs are processed in a serialized manner based on the delegation queue.

In an implementation form of the first aspect, the device is further configured to operate a helper thread to pop the LWP from the delegation queue and execute the LWP by means of the shared resource.

Since a single thread executes conflicting operations on the shared resource, locality of a footprint of the operations can be preserved in the thread cache, thereby delivering high throughput. In a further implementation form of the first aspect, the device is further configured to operate the thread to, if the shared resource is locked, obtain LWP context information associated with the LWP, and to push the LWP context information to the delegation queue.

This ensures, that context information associated with the LWP is pushed to the delegation queue, which enables that a state of execution of the LWP can be provided to the delegation queue, where it can be read by a further entity which can continue executing LWP from its previous state, based on the LWP context formation.

In a further implementation form of the first aspect, the device is further configured to operate the helper thread to pop the LWP context information from the delegation queue and execute the LWP based on the LWP context information.

This ensures that the helper thread can continue executing the LWP at its previous state, which is indicated by means of the LWP context information.

In a further implementation form of the first aspect, the helper thread is the first thread to access the shared resource.

This ensures, that access to the shared resource can be effectively serialized, since a thread which accesses the shared resource first, is set to be the helper thread, which subsequently processes all operations which require access to the shared resource.

In a further implementation form of the first aspect, the device is further configured to operate the helper thread to, after completion of executing the LWP by means of the shared resource, push the LWP to a ready queue.

This ensures that, after the LWP does no longer require the shared resource for execution, the LWP is pushed to a ready queue, where can be taken from by other threads for further execution.

In a further implementation form of the first aspect, the device is further configured to operate the helper thread to, after completion of executing the LWP by means of the shared resource, update the LWP context information, thereby obtaining updated LWP context information, and push the updated LWP context information to the ready queue.

This ensures that LWP context information stored in the ready queue, which can in turn read by the other thread, thereby enabling the thread to continue execution of the LWP from its previous state.

In a further implementation form of the first aspect, the device is further configured to operate the thread by a first core of the device.

This ensures that multiple cores or CPUs of a multi-processor system can be used effectively by the device according present invention.

In a further implementation form of the first aspect, the device is further configured to operate the helper thread by a second core of the device.

In a further implementation form of the first aspect, the LWP context information comprises a predefined application binary interface, ABI.

This enables for storing a state of the LWP in the LWP context information by means of a predefined ABI.

In a further implementation form of the first aspect, the LWP context information comprises an ABI specified LWP register context.

This enables for storing a state of the LWP in the LWP context information by means of an LWP register context. In a further implementation form of the first aspect, the device is further configured to operate the thread to pop the LWP from the ready queue and continue executing the LWP. This ensures that the thread that previously executed the LWP can continue execution, after the part of the LWP which required access to the shared resource was executed by the helper thread.

In a further implementation form of the first aspect, the device is further configured to operate the thread to pop the updated LWP context information from the ready queue and continue executing the LWP based on the updated LWP context information.

This ensures that also LWP context information is provided back to the thread. Thereby, the thread can continue execution of the LWP from a state that the LWP previously had when execution of LWP by means of the helper thread finished.

In a further implementation form of the first aspect, the device is further configured to operate a second thread, to pop the LWP from the ready queue and continue executing the LWP, wherein the device preferably is further configured to operate the second thread to pop the updated LWP context information from the ready queue and continue executing the LWP based on the updated LWP context information.

Thereby, not only the thread which previously executed the LWP, but also any other thread in a multithreaded environment can continue executing the LWP, once the helper thread finished processing the LWP. This threat can also obtain LWP context information, for executing the LWP starting from its previous state.

A second aspect of the present invention provides a method for serializing access to a shared resource, wherein the method comprises the steps of operating a thread to execute a light-weight process, LWP; probing access by the LWP to the shared resource, and, if the shared resource is locked; pushing the LWP to a delegation queue. In an implementation form of the second aspect, the method further comprises operating a helper thread to pop the LWP from the delegation queue and execute the LWP by means of the shared resource.

In a further implementation form of the second aspect, the method further comprises obtaining LWP context information associated with the LWP, and operating the thread to, if the shared resource is locked, push the LWP context information to the delegation queue.

In a further implementation form of the second aspect, the method further comprises operating the helper thread to pop the LWP context information from the delegation queue and execute the LWP based on the LWP context information.

In a further implementation form of the second aspect, the helper thread is the first thread to access the shared resource.

In a further implementation form of the second aspect, the method further comprises operating the helper thread to, after completion of executing the LWP by means of the shared resource, push the LWP to a ready queue.

In a further implementation form of the second aspect, the method further comprises operating the helper thread to, after completion of executing the LWP by means of the shared resource, update the LWP context information, thereby obtaining updated LWP context information, and push the updated LWP context information to the ready queue.

In a further implementation form of the second aspect, the method further comprises operating the thread by a first core.

In a further implementation form of the second aspect, the method further comprises operating the helper thread by a second core.

In a further implementation form of the second aspect, the LWP context information comprises a predefined application binary interface, ABI. In a further implementation form of the second aspect, the LWP context information comprises an ABI specified LWP register context.

In a further implementation form of the second aspect, the method further comprises operating the thread to pop the LWP from the ready queue and continue executing the LWP.

In a further implementation form of the second aspect, the method further comprises operating the thread to pop the updated LWP context information from the ready queue and continue executing the LWP based on the updated LWP context information.

In a further implementation form of the second aspect, the method further comprises operating a second thread, to pop the LWP from the ready queue and continue executing the LWP, wherein the method preferably further comprises operating the second thread to pop the updated LWP context information from the ready queue and continue executing the LWP based on the updated LWP context information.

The method of the second aspect and its implementation forms include the same advantages as the device according to the first aspect and its implementation forms.

A third aspect of the present invention provides a computer program product comprising a program code for controlling the device according to the first aspect or any one of its implementation forms, or for performing, when running on a computer, the method according to the second aspect or any one of its implementation forms.

The computer program product of the third aspect includes the same advantages as the device according to the first aspect and its implementation forms.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a schematic view of a device according to an embodiment of the present invention.

FIG. 2 shows a schematic view of a device according to an embodiment of the present invention in more detail.

FIG. 3 shows a schematic view of ready queues associated to cores.

FIG. 4 shows a code listing of functionality provided by the present invention.

FIG. 5 shows another code listing of functionality provided by the present invention.

FIG. 6 shows another code listing of functionality provided by the present invention.

FIG. 7 shows another code listing of functionality provided by the present invention.

FIG. 8 shows a schematic view of an embodiment of the present invention. FIG. 9 shows a schematic view of a method according to an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS

Fig. 1 shows a device 100 for serializing access to a shared resource 101 according to an embodiment of the present invention. As it is shown in Fig. 1, the device 100 is configured to operate a thread 102. The thread 102 executes a light-weight process (LWP) 103. In the field of computer operating systems, an LWP 103 is a means for providing multitasking capabilities. An LWP 103 e.g. runs in a user space on top of a single kernel thread and shares its address space and system resources with other LWPs within the same process. Multiple user level threads, managed by a thread library, can be placed on top of one kernel managed thread - allowing multitasking to be done at the user level, which allows for achieving performance benefits. The shared resource 101 can e.g. be a globally used data structure, or any kind of device, e.g. a storage, memory, I/O, or network device of a computer system.

To execute the LWP 103, the device 100 is further configured to probe access by the LWP 103 to the shared resource 101. If the shared resource 101 is locked, the device 100 is configured to push the LWP 103 to a delegation queue 104. In the delegation queue 104, all LWPs 103 which need to access the same shared resource 101 can be enqueued. Once the LWPs 103 are in the delegation queue 104, they can be processed in a serialized manner, by taking each LWP 103, one by one, from delegation queue 104, for executing them by means of shared resource 101. The delegation queue 104 can also be called QD 104.

Fig. 2 shows a device 100 according to an embodiment of the present invention in more detail. The device 100 of Fig. 2 includes all features and functionality as the device 100 of Fig. 1. To this end, identical features are labelled with identical reference signs. All features that are going to be described in view of Fig. 2 are optional features of the device 100.

As it is illustrated in Fig. 2, the device 100 is further configured to operate an optional helper thread 201. To allow for serialized execution of LWPs 103, which are enqueued in delegation queue 104, the helper thread 201 is configured to pop the LWP 103 from the delegation queue 104, and execute the LWP 103. The helper thread 201 in particular can access the shared resource 101 and can therefore execute the LWP 103 by means of the shared resource 101. That is, an LWP 103, which previously could not access the shared resource 101, because the shared resource 101 was locked, can now access the shared resource 101 by means of the helper thread 201 which executes the LWP 103 and which has access to the shared resource 101. The helper thread 201 in particular has access to the shared resource 101, because the helper thread 201 is the first thread to access the shared resource 101. As it is also shown Fig. 2, the device 100 can optionally obtain LWP context information 202. The LWP context information 202 can also be regarded as an execution context of the LWP 103. The LWP context information 202 is associated with the LWP 103. The LWP context information 202 can in particular store a present state of execution of the LWP 103. The device 100 can further be configured operate the thread 102 to, if the shared resource is locked, pushed the LWP context information 202 to the legation queue 104. That is, now the delegation queue 104 holds the LWP 103 and the associated context information 200.

As a result, the device 100 can now optionally be configured to operate the helper thread 201 to pop the LWP context information 202 from the delegation queue 104. By doing so, the helper thread 201 can obtain information about a previous state of execution of the LWP 103, before the LWP 103 was pushed to the delegation queue 104. Thus, the device 100 can execute the LWP 103 based on the LWP context information 202. In other words, the helper thread 201 can start execution of the LWP 103 at the state of execution at which the LWP 103 arrived, before it was pushed to the delegation queue 104. Again in other words, the helper thread 201 switches into the context (i.e. the LWP context information 202) of the LWP 103.

The device 100 can further be optionally configured to push the LWP 103 to a ready queue 203 after completion of execution of the LWP 103 by means of the shared resource 101. In other words, once the part of the LWP 103 that required access to the shared resource 101 for execution is finished, the helper thread 201 can push the LWP 103 to the ready queue 203. Being in the ready queue 203, the LWP 103 can be popped by any other thread for further execution (e.g. an execution that does not require access to the shared resource 101).

Further, after execution of the LWP 103 by means of the shared resource 101 is completed, the helper thread 201 updates the LWP context information 202. That is, the LWP context information 202 now contains information regarding the state of the LWP 103 after its execution by the helper thread 201 is completed. Thereby, updated LWP context information 202’ is obtained. The updated LWP context information 202’ is pushed to the ready queue 203, where it can be popped by means of any other thread, so that any other thread can pop the LWP 103 and the updated LWP context information 202’. That is, any other thread can continue execution of the LWP 103 that was popped from the ready queue 203, starting from the state of the LWP 103 according to the updated LWP context information 202’. The device 100 further can be configured to operate the thread 102 to pop the LWP

103 from the ready queue 203. The thread 102 then can continue executing the LWP 103. That is, the thread 102 that initially pushed the LWP 103 to the delegation queue

104 can now pop the LWP 103 from the ready queue 203 to continue execution of the LWP 103, in particular after access to the shared resource 101 is no longer needed.

The thread 102 can also pop the updated LWP context information 202’ from the ready queue 203 and continue executing the LWP 103 based on the updated LWP context information 202’. That is, the thread 102 can continue executing the LWP 103 beginning at the last state of the LWP 103 that is stored in the updated context information 202’.

The device 100 is in particular suitable for use in a multithreaded runtime environment. That is, the device 100 in particular can operate multiple threads, e.g. to run them in parallel or synchronize them. Therefore, the device 100 can further optionally be configured to operate a second thread 206, to pop the LWP 103 from the ready queue 203 and continue executing the LWP 103. The device 100 preferably can operate the second thread 206 to pop the updated LWP context information 202’ from the ready queue 203 and can continue executing the LWP 103 based on the updated LWP context information 202’. As the device 100 can be, or can be used in a multi-processor system, the device 100 can further be optionally configured to operate the thread 102 by a first core 204 of the device 100, and/or can optionally further be configured to operate the helper thread 201 by a second core 205 of the device 100.

Further optionally, the LWP context information 202 can comprise a predefined application programming interface (API), e.g. for manipulation of the LWP context information 202. The API can e.g. be one of the linux functions getcontext() or setcontext() that allow user-level context switching between multiple threads of control within a process.

Further optionally, the LWP context information 202 can comprise an application binary interface (ABI) specified LWP register context. The ABI can e.g. be a System V Application Binary Interface.

As it is now exemplary described in view of Fig. 3, an example embodiment of the device 100 according to the invention may consist of a set of pinned CPU cores, each being associated with a ready queue. The ready queues of the cores hold ready task fibers, that is, user- level-contexts (i.e. the LWPs 103). A user- level-context structure (i.e. the LWP 103) may hold a compatible user- level ABI (this ABI standard defines all methods concerned with binary execution format and convention, e.g. a call function, registers for passing arguments, the binary frame of the stack, etc.), necessary register context (i.e. the LWP context information 202, and the stack, so that control can be passed to the fiber by setting these registers. The ready queues are interconnected through an all-to-all mesh of point-to-point message passing channels, such that each core can pass a fiber to any other core. The mesh configuration design enables each communication point to be either a single producer or mostly a single consumer, thus avoiding contention of the channels. The cores service their respective ready queues by de-queueing a ready task fiber and switching into it. When the ready queue is empty, the core is engaged in job stealing from its neighboring cores’ ready queues. That is, an idle core can pick an LWP from another cores ready queue and execute it. The cores are aware of their NUMA interconnect distances, such that the job stealing scheduling guarantees load-balancing both per socket cores and inter socket cores, with the preference that local cores access local data. The throughput and latency of the task processing are standards measures of efficiency that also determine scalability. In queue delegation locking according to the prior art, function pointers to critical sections (i.e. sections that require access to the shared resource) are passed through a queue to a helper thread, requiring specific restructuring of the code. The delegating threads can then“detach”, i.e. continue execution after successful enqueue, or wait (through a future mechanism) till the helper thread is done with executing the associated critical section.

According to the present invention however, a data structure, i.e. a delegation queue, is used to hold contexts (i.e. the LWP context information 202) of delegating fibers which block on contention to a resource. In execution semantics of the present invention, the fiber entering the critical section and discovering that the resource is taken switches and stores its context (i.e. the LWP context information), putting its fiber context in the resource’s delegation queue, and returns to service on any other ready fiber from its ready queue.

Fig. 4 and Fig. 5 present code that manages the delegating and the helper fibers.

Fig. 4 shows an enqueue function. The code finds a unique space in the dq array to place the fiber context and then the context switches to a scheduler, while placing the old context in an array.

Fig. 5 shows the lock function that manages dq delegating and helper thread roles in the NUMA-aware lock algorithm. If a resource is free, the thread becomes the helper thread; see Fig. 5, lines 6 - 12; it opens the associated QD, returns from the lock function, and then executes its own critical section code. The delegating thread, finding the locks taken, calls _enqueue_ at Fig. 4, suspending its fiber at a unique queue slot, and context switches to the scheduler to process a next fiber from the core’s ready queue. Fig. 6 shows an unlock function. As it is shown, a helper thread repeatedly executes enqueued fiber contexts, switching to the next fiber in the queue, while sending completed fiber contexts to some ready queue on the socket. The helper thread executes its critical section and at its end calls unlock, see Fig. 6. Upon entering the function, it determines if the queue is not empty and if it is, it switches to the first context in the queue, see Fig. 6, lines 8 - 16. The function queue_fiber_context_swap is a composite function that enables switching into the context provided as an argument while transmitting the old context to a communication queue, specified as the first argument. The function node_other_rand_q finds a communication queue corresponding to a ready queue of a core on the same socket. Otherwise, the QD is closed and its locks released, see Fig. 6, lines 18 - 23. The function other_node_delegate and hdq->glock are used to support delegation to another socket in a NUMA multi-socket case. While the helper thread completes execution of the first critical section it encounters another unlock call, and continues to execute the unlock function at Fig. 6, lines 32 - 39. It probes whether all the remaining contexts in the QD are done and closes the queue if it is the case. Otherwise, similarly to the first enqueued context, it resumes the next context in the queue while sending the completed context to some ready queue on the socket. Please note that the resumed context wakes up on the helper core under the impression it just returned from the lock function, and the completed context will wake up on some core under the impression it just returned from unlock function. This design supports NUMA awareness extending the prior art to enable passing and resuming of fibers on a neighbor socket. The QD is made hierarchical by composing a global spinlock and a local per socket fiber QD, see Fig. 5, lines 6-9 and Fig. 6, lines 20-21 and 28-29. In Fig. 5, a local QD lock is tried, if it is free, then it is taken and then the global lock is tried. If the global lock is taken, the fiber de-schedules itself placing its context in the first entry of the QD and returns to serve its ready queue in the function enqueue_and_open, which is not shown here for brevity. The function enqueue_and_open opens the local delegation queue and enqueues its context, leaving it open for subsequent local fibers to enqueue. When the global lock holder is about to release the global lock it notices whether there are fibers sleeping on the resource’s QD on any other socket in function other_node_delegate in Fig. 6. The lock holder chooses the next closest socket in a predefined, yet avoiding starvation way, and sends the whole of the that socket’s QD to that socket’s Ready Queue core, see Fig. 7. The function get_nn_order_socks returns socket nodes in the order of a Hamilton path of the sockets’ topology interconnect. Each node is tried to see if it has buffered fibers on its DQ on the resource. If it is the case the first fiber on the QD is delivered to some Ready queue on the socket, see Fig. 7, line 13. In that way all of the QD would be executed by the corresponding core. When the core is done with the QD it will call other_node_delegate on its turn, and the next socket on the Hamilton path would be probed.

The present invention further extends the fiber QD to support multi reader by accommodating a sleeping reader in a per-core sleeping-on-an-event list, and transforming the lock and unlock functions to read/writer lock/unlock. The event is a location such that if set, all the cores having that event on their list will wake up the associated fiber.

As it is now going to be described in view of Fig. 8, a simplified Two Phase Lock (2PL) transaction processing system database using a hierarchical multi reader fiber QD according to the present invention, is shown. The database constitutes of a set of rows; each row is associated with a fiber QD. The transactions are represented as embedded C functions. Each core in the database is associated with a ready queue of fibers, and a list of reader sleeping fibers each associated with an event, represented as a pointer to a boolean. The ready queues are interconnected through communication channels, implemented as FF-queues, in an all-to-all topology to minimize contention. The cores process their ready queues and the associated lists and if those are empty try to job steal from increasingly farther neighbors. The transaction processing system runs a YCSB benchmark on an eight-socket 192 core machine and achieves very good performance and excellent scalability under high contention.

Fig. 9 shows a method 900 for operating the device 100. That is, the method 900 is for serializing access to a shared resource 101.

The method 900 comprises a first step of operating 901 a thread 102 to execute an LWP 103. The method comprises a further step of probing 902 access by the LWP 103 to the shared resource 101, and, if the shared resource 101 is locked, the method comprises a step of pushing 903 the LWP 103 to a delegation queue 104. The present invention also provides a computer program product comprising a program code for controlling a device 100 or for performing, when running on a computer, the method 900. The computer program product includes any kind of computer readable data, including e.g. any kind of storage, or information that is transmitted via a communication network.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word“comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A device (100) for serializing access to a shared resource (101), wherein the device (100) is configured to

- operate a thread (102) to execute a light-weight process, LWP, (103)

- probe access by the LWP (103) to the shared resource (101), and, if the shared resource (101) is locked,

- push the LWP (103) to a delegation queue (104).

2. The device (100) according to claim 1, wherein the device (100) is further configured to operate a helper thread (201) to pop the LWP (103) from the delegation queue (104) and execute the LWP (103) by means of the shared resource (101).

3. The device (100) according to claim 1 or 2, wherein the device (100) is further configured to operate the thread (102) to, if the shared resource (101) is locked, obtain LWP context information (202) associated with the LWP (103), and to push the LWP context information (202) to the delegation queue (104).

The device (100) according to any one of the preceding claims, further configured to operate the helper thread (201) to pop the LWP context information (202) from the delegation queue (104) and execute the LWP (103) based on the LWP context information (202). 5. The device (100) according to any one of the preceding claims, wherein the helper thread (201) is the first thread to access the shared resource (101).

The device (100) according to any one of the preceding claims, further configured to operate the helper thread (201) to, after completion of executing the LWP (103) by means of the shared resource (101), push the LWP (103) to a ready queue (203).

7. The device (100) according to any one of the preceding claims, further

configured to operate the helper thread (201) to, after completion of executing the LWP (103) by means of the shared resource (101), update the LWP context information (202), thereby obtaining updated LWP context information (202’), and push the updated LWP context information (202’) to the ready queue (203).

8. The device (100) according to any one of the preceding claims, further

configured to operate the thread (102) by a first core (204) of the device (100).

9. The device (100) according to any one of the preceding claims, further

configured to operate the helper thread (201) by a second core (205) of the device (100).

10. The device (100) according to any one of the preceding claims, wherein the LWP context information (202) comprises a predefined application

programming interface, API.

11. The device (100) according to any one of the preceding claims, wherein the LWP context information (202) comprises an ABI specified LWP register context.

12. The device (100) according to any one of the preceding claims, further

configured to operate the thread (102) to pop the LWP (103) from the ready queue (203) and continue executing the LWP (103).

13. The device (100) according to any one of the preceding claims, further

configured to operate the thread (102) to pop the updated LWP context information (202’) from the ready queue (203) and continue executing the LWP (103) based on the updated LWP context information (202’).

14. The device (100) according to any one of the preceding claims, further

configured to operate a second thread (206), to pop the LWP (103) from the ready queue (203) and continue executing the LWP (103), wherein the device (100) preferably is further configured to operate the second thread (206) to pop the updated LWP context information (202’) from the ready queue (203) and continue executing the LWP (103) based on the updated LWP context information (202’).

15. A method (900) for serializing access to a shared resource (101), wherein the method (900) comprises the steps of:

- operating (901) a thread (102) to execute a light-weight process, LWP, (103) - probing (902) access by the LWP (103) to the shared resource (101), and, if the shared resource (101) is locked,

- pushing (903) the LWP (103) to a delegation queue (104).