WO2023093984A1

WO2023093984A1 - Method and computing device for processing task request

Info

Publication number: WO2023093984A1
Application number: PCT/EP2021/082903
Authority: WO
Inventors: Naor SHLOMO; Amit Golander; Yigal Korman; Itamar OFEK
Original assignee: Huawei Cloud Computing Technologies Co., Ltd.
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2023-06-01

Abstract

A method for processing a task request in a multithreading computing system, includes determining a process for servicing the task request received from a client. The method further includes dividing the process for servicing the task request between at least two groups of threads, where a first group of threads is configured to handle tasks of lower complexity than those handled by a second group of threads. The disclosed method improves performance of the multithreading computing system in terms of reducing latency for data replication with less resource (e.g., memory) consumption.

Description

METHOD AND COMPUTING DEVICE FOR PROCESSING TASK REQUEST

TECHNICAL FIELD

The present disclosure relates generally to the field of data management and data replication systems; and, more specifically, to a method and a computing device for processing a task request in a multithreading computing system for low latency data replication.

BACKGROUND

With the rapid development of data-intensive applications, the data storage needs in a cloud environment have increased. In the cloud environment, a persistent memory (PM) is generally used to increase small input/output (I/O) performance, such as storage of metadata, indexes, log files (or logs), hot data, and the like. The use of the persistent memory for storage of the aforementioned data generally results in speeding up cloud services, such as databases, high- performance computing (HPC), etc. Moreover, the persistent memory may also be used for storing data structures such that the stored data structures can be accessed continuously using memory instructions or memory application programming interface (APIs) even after the end of the process that created or last modified the data structures. The persistent memory is like regular memory, but it is persistent across server crashes, like hard disk or solid-state drive (SSD). The persistent memory is byte-addressable like regular memory and can be accessed using remote direct memory access (RDMA). The use of RDMA enables data transfer with a partial low latency for both read and write I/O operations. Moreover, data replication of write I/O is performed to ensure higher availability and reliability of data. However, the data replication to multiple nodes adds substantial latency to the data transfer process. This is because a conventional client device receives a reply of either a “success” or a “failure” only when the data transfer process that involves communication with all the multiple nodes (i.e., replication nodes) ends and further when multiple write operations to each node, including a conventional primary node, ends.

In a conventional multithreading computing system, a data replication process of a RDMA write operation to multiple nodes with a storage-class memory involves sending a RDMA write request by a client device to a conventional primary node as well as to replication nodes (or replicas). The persistent memory is also referred to as the storage-class memory. The conventional primary node then receives the RDMA write request and waits for polling or exception. After some time, the replication nodes (i.e., replicas) send the acknowledgement about the RDMA write request to the conventional primary node. Thereafter, the conventional primary node writes to its persistent memory and send the acknowledgement about the RDMA write request to the conventional client device. In this way, the process (or a thread) of data replication to multiple nodes gets completed.

The data replication process utilizes multiple threads or processes, and each thread or process typically handles the assigned work in the same manner as other threads handle the process. Consequently, a conventional operating system scheduler performs a context switch, which is a process of storing the state of a process or thread, so that it can be restored and resume execution at a later point. Performing such context switches is usually computationally intensive. In the data replication process, while waiting for replies from the replication nodes (or replicas), the conventional operating system scheduler may perform a context switch in order to let other threads to work during the wait, which is considered idle time. Therefore, the use of multiple context switches leads to additional latency and high resource consumption, which is not desirable.

Other solutions, such as fast remote memory (FaRM), Tail wind, and distributed asynchronous object storage (DAOS), rely on the underlying operating system to perform the thread or process scheduling for the RDMA operations, which includes the data replication process. The DAOS supports the data replication process along with the more common client side replication. Thus, there exists a technical problem of high replication latency as well as high resource (e.g., memory) consumption in the conventional multithreading computing system, further leading to reduced throughput and inefficiency of the system.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional methods of data replication to the replication nodes (i.e., replicas).

SUMMARY

The present disclosure provides a method and a computing device for processing a task request in a multithreading computing system. The present disclosure provides a solution to the existing problem of high replication latency as well as high resource (e.g., memory) consumption in a conventional multithreading computing system, further leading to reduced throughput and low efficiency in the conventional multithreading computing system. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide an improved method and a computing device for processing a task request in a multithreading computing system, for achieving low latency data replication as compared to conventional systems.

The object of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

According to an aspect of the present disclosure, there is provided a method for processing a task request in a multithreading computing system. The method comprises determining a process for servicing the task request received from a client. The method further comprises dividing the process for servicing the task request between at least two groups of threads, where a first group of threads is configured to handle tasks of lower complexity than those handled by a second group of threads.

The method of the present disclosure improves performance of the multithreading computing system in terms of reducing latency for data replication with less resource (e.g., memory) consumption as compared to existing systems. Since the process of servicing the task request is divided into the first group of threads and the second group of threads and each group is assigned a different task at different times to work independently, low data replication latency is achieved. Due to the independent execution of multiple tasks by each group of threads, idle time decreases and CPU utilization increases and consequently, system’s resource utilization increases.

In an implementation form, the first group of threads is configured to receive the task request from the client at a primary node, and send the task request to one or more secondary nodes for replication.

By virtue of using the first group of threads for specific tasks, such as receiving the task request from the client at the primary node and sending the task request to the one or more secondary nodes for replication, the data replication latency reduces. Additionally, computational resource utilization increases because more tasks utilize more CPU cores in parallel. In a further implementation form, the second group of threads is configured to receive a response from the one or more secondary nodes, complete the task request at the primary node, and send a confirmation of completed task request to the client.

It is advantageous to use the second group of threads for receiving the response from the one or more secondary nodes, completing the task request at the primary node, and sending the confirmation of the completed task request to the client, in order to speed up the servicing of the task request.

In a further implementation form, the method further comprises pairing an individual task queue to each thread in the second group of threads, where tasks are added to the individual task queue from the first group of threads.

The pairing of the individual task queue to each thread in the second group of threads leads to concurrent execution of one or more tasks resulting in high throughput of the multithreading computing system.

In a further implementation form, the method further comprises creating a shared task queue between all threads in the second group of threads.

The creation of the shared task queue between all threads in the second group of threads not only leads to parallelism of multiple tasks but also enables low memory consumption.

In a further implementation form, the method further comprises adding tasks lying on a shared memory between the first and the second groups of threads to either the individual task queue or the shared task queue, where each thread from the first group of threads is configured to use compare-and-swap (CAS) technique to access the individual task queue, and where each thread from the second group of threads is configured to use CAS technique to access each of the individual task queue and the shared task queue.

The usage of the CAS technique provides a fast access of the individual task queue and the shared task queue because the CAS technique replaces mutual exclusion technique which is slower. In a further implementation form, each task added to the individual task queue or the shared task queue comprises a pre-determined Return to Trip (RTT) value and a commit value.

By virtue of the pre-determined RTT value and the commit value associated with each task, a task for concurrent execution can be selected with ease.

In a further implementation form, each thread in the second group of threads is configured to handle a task from either the corresponding individual task queue or the shared task queue in an idle period, where the idle period is the time spent by the thread waiting for the completion of the existing task request.

By virtue of handling the task in the idle period of the existing task request, the system’s utilization is maximized, and the idle period is minimized.

In a further implementation form, a task from either the individual task queue or the shared task queue is only handled in the idle period when the RTT value of the task is less than the sum of the RTT value and the commit value of the existing task request.

It is advantageous to handle the task of lower RTT value than the sum of the RTT value and the commit value of the existing task request in order not to disform the order of writing.

In a further implementation form, the task is picked from the shared task queue only when no eligible task for the idle period is found in the individual task queue.

The selection of eligible task in the idle period leads to minimization of the idle period.

In another aspect, the present disclosure provides a computing device for processing a task request. The computing device comprises a memory, a communication interface, and a processor configured to determine a process for servicing the task request received from a client, and divide the process for servicing the task request between at least two groups of threads, where a first group of threads is configured to handle tasks of lower complexity than those handled by a second group of threads. The computing device achieves all the advantages and effects of the method of the present disclosure, after execution of the method.

In a yet another aspect, the present disclosure provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method.

The computer (e.g., processor of a computing device or a system) achieves all the advantages and effects of the method after execution of the method.

It is to be appreciated that all the aforementioned implementation forms can be combined.

It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1 is a flowchart of a method for processing a task request in a multithreading computing system, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram that illustrates various exemplary components of a computing device, in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates servicing of a task request using a first group of threads and a second group of threads at a primary node, in accordance with an embodiment of the present disclosure; and

FIG. 4 illustrates assignment of one or more task requests between a first group of threads and a second group of threads at a primary node, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. 1 is a flowchart of a method for processing a task request in a multithreading computing system, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a method 100 for processing a task request in a multithreading computing system. The method 100 includes steps 102 and 104. The method 100 is executed by a computing device, described in detail, for example, in FIG. 2.

The method 100 provides an improved and efficient central processing unit (CPU) scheduling scheme that enables an increase in throughput and bandwidth of the multithreading computing system. Generally, the multithreading computing system may be defined as a computing system that allows two or more threads of a process to execute concurrently while sharing same resources. A thread may be defined as a self-contained sequence of instructions that can execute in parallel with other threads those are part of the same process. For example, in a web browser, one thread is used to handle user interface and in parallel, another thread is used to fetch the data to be displayed. Therefore, multithreading allows multiple concurrent tasks, which can be performed within a single process. Hence, multithreading improves responsiveness of the computing system. The method 100 is described in detail in the following steps.

At step 102, the method 100 comprises determining a process for servicing the task request received from a client. In an example, the task request received from the client may be a remote direct memory access (RDMA) write request. In another example, the task request received from the client may be a RDMA read request. After receiving the task request (i.e., RDMA write or RDMA read request) from the client, the process for servicing the task request is determined.

At step 104, the method 100 further comprises dividing the process for servicing the task request between at least two groups of threads, where a first group of threads is configured to handle tasks of lower complexity than those handled by a second group of threads. The first group of threads is configured to handle tasks, which are different from tasks handled by the second group of threads.

In an implementation, the first group of threads is configured to receive the task request from the client at a primary node, and send the task request to one or more secondary nodes for replication. The first group of threads is configured to receive the task request (e.g., RDMA write request) from the client at the primary node. After receiving the task request, the first group of threads is further configured to send the received task request (i.e., RDMA write request) to the secondary nodes for data replication. Therefore, the secondary nodes may also be referred to as data replication nodes or replicas. The first group of threads may be associated with two or more CPU cores at the primary node. Alternatively stated, each thread is bound to its own CPU core.

In an implementation, the second group of threads is configured to receive a response from the one or more secondary nodes, complete the task request at the primary node, and send a confirmation of completed task request to the client. The second group of threads is configured to receive the response (e.g., acknowledgement of the RDMA write request) from the secondary nodes. Thereafter, the second group of threads is configured to write the data to a local persistent memory of the primary node using a direct memory access (DMA) engine for data chunks that are larger than 64KBs. The second group of threads is further configured to send the confirmation (i.e., acknowledgement) of completed task request (i.e., the RDMA write request) to the client.

The receiving of the task request (i.e., RDMA write request) at the primary node from the client and sending the task request to the one or more secondary nodes is less intensive and a faster process, in comparison to receiving the response from the one or more secondary nodes, executing the task request (i.e., RDMA write request) at the primary node and sending the acknowledgement of the task request to the client. Therefore, the first group of threads is configured to handle the tasks of lower complexity in comparison to the second group of threads. Similar to the first group of threads, the second group of threads may be associated with two or more CPU cores at the one or more secondary nodes. Moreover, the first group of threads has smaller number of threads bound to the CPU cores in comparison to the second group of threads. By virtue of handling the tasks of higher complexity, the second group of threads requires more CPU power therefore, has larger number of threads bound to the CPU cores accordingly.

In an implementation, the method 100 further comprises pairing an individual task queue to each thread in the second group of threads, where tasks are added to the individual task queue from the first group of threads. The individual task queue may belong to a thread space shared memory, which may have one or more work queues. The individual task queue may have a tail portion and a head portion. The different tasks are added (i.e., enqueued) to the tail portion of the individual task queue from the first group of threads. Thereafter, each individual task queue is paired from its head portion to each thread in the second group of threads for execution of the added tasks. In an implementation, the method 100 further comprises creating a shared task queue between all threads in the second group of threads. In addition to the individual task queue from the thread space shared memory, a shared task queue is also created for all the threads in the second group of threads. Similar to the individual task queue, the shared task queue may have a tail portion and a head portion. The different tasks are enqueued to the tail portion of the shared task queue from all the threads of the second group of threads and dequeued from the head portion of the shared task queue to all the threads of the second group of threads.

In an implementation, the method 100 further comprises adding tasks lying on a shared memory between the first and the second groups of threads to either the individual task queue or the shared task queue, where each thread from the first group of threads is configured to use compare-and-swap, CAS, technique to access the individual task queue, and where each thread from the second group of threads is configured to use CAS technique to access each of the individual task queue and the shared task queue. The threads from the first group of threads may be configured to add tasks to the work queues that is the individual task queue, lying on the shared memory between the first group of threads and the second group of threads. Alternatively stated, the first group of threads is configured to generate tasks for the second group of threads while the second group of threads is configured to execute the generated tasks. Additionally, each thread from the second group of threads may use compare-and-swap (CAS) in order to access the shared task queue dedicated to the second group of threads. Compare- and-swap (CAS) technique is an atomic procedure, which can be used to rewrite data on the shared memory without the use of operating system (OS) level locking, such as semaphores.

In an implementation, each task added to the individual task queue or the shared task queue comprises a pre-determined Return to Trip (RTT) value and a commit value. The predetermined RTT value may be comprised of round-trip time of the secondary nodes (or data replication nodes), that starts from sending the task request (i.e., RDMA write request) to the secondary nodes, waiting for them to complete the task request and return a reply. The RTT value can be a well-known figure, since the secondary nodes are close by and the write duration to the local persistent memory of the primary node can also be calculated based upon a payload length and a manufacturer published figures. The RTT value may also be referred to as an estimated-time-of-arrival (ETA) value. The commit value may be defined as a time duration required to write the data to the local persistent memory of the primary node. The commit value can easily be calculated based upon the write payload length and the manufacturer published figures.

In an implementation, each thread in the second group of threads is configured to handle a task from either the corresponding individual task queue or the shared task queue in an idle period. Idle period is the time spent by the thread waiting for the completion of the existing task request. If an existing task request has large RTT (i.e., ETA) and commit values, the idle period is increased. Therefore, in order to reduce the idle period, each thread from the second group of threads is configured to handle another task from either the corresponding individual task queue or the shared task queue in the idle period.

In an implementation, a task from either the individual task queue or the shared task queue is only handled in the idle period when the RTT value of the task is less than the sum of the RTT value and the commit value of the existing task request. The handling of the other task selected from either the individual task queue or the shared task queue in the idle period is possible only when, the RTT value of the selected task is less than the sum of the RTT value and the commit value of the existing task request. In such a case, the existing task request and the selected task are eligible for concurrent launching and reducing the idle time between the concurrent launched tasks as well. Generally, a launch may be initiated by sending a task request (e.g., a RDM A write request) from the primary node to the secondary nodes.

In an implementation, the task is picked from the shared task queue only when no eligible task for the idle period is found in the individual task queue. In case, the second group of threads do not find an eligible concurrent task to launch in the individual task queue, the second group of threads pick a task from the shared task queue. The best fitted task is launched, which minimizes the idle time. Moreover, when a low value ETA and commit task is received by the second group of threads and it is not worth launching, the second group of threads may push the task into the shared task queue.

Thus, the method 100 effectively reduces the latency (e.g., data replication latency) during data transfer process to replication nodes, since the data transfer process is divided into two groups of threads, such as the first group of threads and the second group of threads. Each thread from the first group of threads as well as from the second group of threads is assigned a particular task, which further leads to an improved CPU utilization (because more tasks utilize more CPU cores in parallel) and reduced idle time. The replication latency can be calculated using onesided RDMA operations to a persistent journal without any software on the secondary nodes. The method 100 enables idle time minimization by selecting the tasks according to different time frames (e.g., RTT, commit value, launch period, etc.), which further results into maximization of system’s utilization. Moreover, the method 100 enables lockless tasks transmissions between the two groups of threads, which leads to better CPU utilization and overall low memory consumption. The method 100 is adaptive to high performance computing (HPC) as well.

The steps 102, and 104 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

FIG. 2 is a block diagram that illustrates various exemplary components of a computing device, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements from FIG. 1. With reference to FIG. 2, there is shown a block diagram of a computing device 200 that includes a memory 202, a communication interface 204 and a processor 206. The memory 202 is configured to store a task request 202A. The memory 202 and the communication interface 204 may be communicatively coupled to the processor 206. The processor 206 of the computing device 200 is configured to execute the method 100 (of FIG. 1).

The computing device 200 includes suitable logic, circuitry, interfaces, or code that is configured to process the task request 202A. The computing device 200 may be a multithreading computing system. The computing device 200 may also be referred to as a primary node. Examples of the computing device 200 include, but are not limited to, a computing server, mainframe computer, a super computer, and the like. In an example, the computing device 200 may be a single computing device or an electronic device. In another example, the computing device 200 may be a computing node in a network of a plurality of computing devices, or electronic devices, operating in a parallel or distributed architecture.

The memory 202 includes suitable logic, circuitry, interfaces, or code that is configured to store data and the instructions executable by the processor 206. Examples of implementation of the memory 202 may include, but are not limited to, a local persistent memory, or a remote direct memory. The memory 202 may store an operating system or other program products (including one or more operation algorithms) to operate the computing device 200.

The communication interface 204 may include suitable logic, circuitry, interfaces, or codes that is configured for receiving a task request from a client. Moreover, the communication interface 204 is configured to communicate with each of the memory 202, and the processor 206. Examples of the communication interface 204 may include, but are not limited to, a radio frequency transceiver, a network interface, a telematics unit, and/or a subscriber identity module (SIM) card.

The processor 206 includes suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 202. In an example, the processor 206 may be a general-purpose processor. Other examples of the processor 206 may include, but is not limited to a central processing unit (CPU), a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a state machine, a data processing unit, and other processors or control circuitry. Moreover, the processor 206 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the computing device 200.

In operation, the processor 206 is configured to determine a process for servicing the task request 202A received from a client. The processor 206 is configured to receive the task request (e.g., a RDMA write request) from the client. Thereafter, the processor 206 is configured to determine the process for servicing the task request.

The processor 206 is further configured to divide the process for servicing the task request between at least two groups of threads, where a first group of threads is configured to handle tasks of lower complexity than those handled by a second group of threads. After receiving the task request (i.e., RDMA write request) from the client, the processor 206 is further configured to divide the process in two groups of threads, such as the first group of threads and the second group of threads. The first group of threads is configured to handle tasks of lower complexity, such as receiving the task request from the client and sending the task request to secondary nodes, in comparison to tasks, such as receiving replies from the secondary nodes, executing the task request at a primary node and sending a reply to the client, which are handled by the second group of threads. Moreover, the first group of threads has smaller number of threads bound to the CPU cores in comparison to the second group of threads. By virtue of handling the tasks of higher complexity, the second group of threads requires more CPU power therefore, has larger number of threads bound to the CPU cores accordingly.

In one aspect, the present disclosure provides a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method 100 (of FIG. 1). In yet another aspect, the present disclosure provides a non- transitory computer-readable medium having stored thereon, computer-implemented instructions that, when executed by a computer, causes the computer to execute operations of the method 100.

FIG. 3 illustrates servicing of a task request using a first group of threads and a second group of threads at a primary node, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIGs. 1, and 2. With reference to FIG. 3, there is shown a sequence diagram 300 for servicing a task request. There is shown a client 302, a primary node 304 and one or more secondary nodes 306. There is further shown a first group of threads 304 A and a second group of threads 304B at the primary node 304. There is further shown a sequence of operations 308 to 318. Each of the first group of threads 304A and the second group of threads 304B is represented by a dashed box, which is used for illustration purpose only.

At operation 308, the client 302 is configured to send a task request (e.g., a RDMA write request) to the primary node 304.

At operation 310, the first group of threads 304A at the primary node 304 is configured to receive the task request from the client 302 and send the received task request (i.e., RDMA write request) to the one or more secondary nodes 306 (i.e., replication nodes or replicas).

At operation 312, the second group of threads 304B is waiting for acknowledgement from the one or more secondary nodes 306.

At operation 314, the one or more secondary nodes 306 send the acknowledgement about the task request to the second group of threads 304B at the primary node 304. At operation 316, the second group of threads 304B completes the task request by writing the data to persistent memory of the primary node 304.

At operation 318, the second group of threads 304B sends a confirmation of the completed task request to the client 302. In this way, the task request received from the client 302 is serviced between the primary node 304 and the one or more secondary nodes 306 through the first group of threads 304 A and the second group of threads 304B.

Moreover, after the operation 310, there is a waiting period (or the idle time) during which the second group of threads 304B waits for an acknowledgement from the one or more secondary nodes 306. If an existing task request has large values of round-trip-time and commit duration, then, the second group of threads 304B may be configured to select another task of lower RTT values in comparison to the existing task, for concurrent execution during the waiting period and not to disform the order of writing. This results in a more efficient system’s utilization during the waiting period.

As described above, the process of servicing the task request is divided into the first group of threads 304A and the second group of threads 304B, and each thread of the first group of threads 304A and the second group of threads 304B is assigned a different role. Therefore, multiple tasks can run concurrently, resulting in an improved performance of the system (i.e., the multithreading computing system) in terms of high throughput, low latency, and bandwidth. Conventionally, there is no specific role or task request assigned to multiple threads and it is assumed that each thread handles the assigned task request in the same manner of its sibling. This results in an increased use of context switches hence, the conventional process of servicing the task request is computationally intensive having high latency as well. Due to the nature of the role separation between the first group of threads 304A and the second group of threads 304B, context switches still occur but their number is significantly dropped because of scheduling the assigned task request to each thread of the first group of threads 304A and the second group of threads 304B.

FIG. 4 illustrates assignment of one or more task requests between a first group of threads and a second group of threads at a primary node, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3. With reference to FIG. 4, there is shown the primary node 304 that includes the first group of threads 304A, and the second group of threads 304B. There is further shown a plurality of individual task queues 402 and a shared task queue 404. The primary node 304 is represented by a dashed box, which is used for illustration purpose only.

Each thread of the first group of threads 304A and the second group of threads 304B corresponds to a single core of the CPU (e.g., the processor 206 of FIG. 2). Initially, the first group of threads 304A (also represented as Gl) is configured to add a plurality of task requests (e.g., RDMA write requests) to each individual task queue of the plurality of individual task queues 402. Each individual task queue of the plurality of individual task queues 402 has a tail portion and a head portion. The plurality of task requests is enqueued to the tail portion of each of the plurality of individual task queues 402. Thereafter, each of the plurality of individual task queues 402 is paired to each thread of the second group of threads 304B (also represented as G2).

Alternatively stated, each task request from each of the plurality of individual task queues 402 is dequeued from its respective head portion to each thread in the second group of threads 304B (i.e., G2). In addition to the plurality of individual task queues 402, the second group of threads 304B (i.e., G2) may be configured to access the shared task queue 404 using compare-and-swap (CAS) technique. In other words, the shared task queue 404 is shared between all threads in the second group of threads 304B (i.e., G2). Furthermore, each thread from the first group of threads 304A (i.e., Gl) and the second group of threads 304B (i.e., G2) may be configured to access the plurality of individual task queues 402 using CAS technique. Additionally, all the work queues, that is the plurality of individual task queues 402 and the shared task queue 404 are accessed using the CAS technique because the CAS technique provides a fast access of the shared data to the multiple threads and replaces the mutual exclusion technique which is slower. The plurality of individual task queues 402 corresponds to a thread space shared memory which is accessible to each of the first group of threads 304A (i.e., Gl) and the second group of threads 304B (i.e., G2).

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A method (100) for processing a task request (202A) in a multithreading computing system, the method (100) comprising: determining a process for servicing the task request (202A) received from a client (302); and dividing the process for servicing the task request (202A) between at least two groups of threads, wherein a first group of threads (304A) is configured to handle tasks of lower complexity than those handled by a second group of threads (304B).

2. The method (100) of claim 1, wherein the first group of threads (304A) is configured to: receive the task request (202 A) from the client (302) at a primary node (304), and send the task request (202 A) to one or more secondary nodes (306) for replication.

3. The method (100) of claim 2, wherein the second group of threads (304B) is configured to: receive a response from the one or more secondary nodes (306), complete the task request (202 A) at the primary node (304), and send a confirmation of completed task request to the client (302).

4. The method (100) of any preceding claim, further comprising pairing an individual task queue to each thread in the second group of threads (304B), wherein tasks are added to the individual task queue from the first group of threads (304A).

5. The method (100) of claim 4, further comprising creating a shared task queue (404) between all threads in the second group of threads (304B).

6. The method (100) of claim 4 or 5, further comprising adding tasks lying on a shared memory between the first group of threads (304 A) and the second groups of threads (304B) to either the individual task queue or the shared task queue (404), wherein each thread from the first group of threads (304 A) is configured to use compare-and-swap, CAS, technique to access the individual task queue, and wherein each thread from the second group of threads (304B) is configured to use CAS technique to access each of the individual task queue and the shared task queue (404).

7. The method (100) of any of claims 4 to 6, wherein each task added to the individual task queue or the shared task queue (404) comprises a pre-determined Return to Trip (RTT) value and a commit value.

8. The method (100) of any of claims 5 to 7, wherein each thread in the second group of threads (304B) is configured to handle a task from either the corresponding individual task queue or the shared task queue (404) in an idle period, wherein the idle period is the time spent by the thread waiting for the completion of the existing task request.

9. The method (100) of claim 8, wherein a task from either the individual task queue or the shared task queue (404) is only handled in the idle period when the RTT value of the task is less than the sum of the RTT value and the commit value of the existing task request.

10. The method (100) of claim 9, wherein the task is picked from the shared task queue (404) only when no eligible task for the idle period is found in the individual task queue.

11. A computing device (200) for processing a task request (202A) comprises: a memory (202); a communication interface (204); and a processor (206) configured to: determine a process for servicing the task request (202 A) received from a client (302); and divide the process for servicing the task request (202A) between at least two groups of threads, wherein a first group of threads (304 A) is configured to handle tasks of lower complexity than those handled by a second group of threads (304B).

12. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method (100) of any of claims 1 to 10.