AU2016201513B2

AU2016201513B2 - Low latency fifo messaging system

Info

Publication number: AU2016201513B2
Application number: AU2016201513A
Authority: AU
Inventors: Nishant AGRAWAL; Manoj Karunakaran Nambiar
Original assignee: Tata Consultancy Services Ltd
Current assignee: Tata Consultancy Services Ltd
Priority date: 2011-06-15
Filing date: 2016-03-09
Publication date: 2017-10-05
Anticipated expiration: 2031-12-21
Also published as: CN102831018A; AU2011265444A1; CN102831018B; AU2011265444B2; AU2016201513A1

Abstract

LOW LATENCY FIFO MESSAGING SYSTEM A system for lockless remote messaging in an inter-process communication between processing nodes as implemented by RDMA supported Network Interface Card is presented. The inter-process communication is implemented by using RDMA write operations accessed through infiniband verbs library or Ethernet. This provides a direct access to the RDMA enabled NIC without a system call overhead to achieve low latency for remote messaging requirement and high messaging rates. The RDMA NIC receives the messages in bulk as the remote sender process bundles together plurality of messages for reducing the number of work requests per message transmitted and acknowledged. This requires memory mapped structure hosted on the communicating processing nodes to be synchronized by RDMA. Figure 1

Description

2016201513 09 Mar 2016

LOW LATENCY FIFO MESSAGING SYSTEM

FIELD OF THE INVENTION

The present invention relates to the field of inter processor messaging and more particularly relates to low latency remote messaging assisted by Remote Direct Access Memory based first-in-first-out system.

BACKGROUND OF THE INVENTION

With the advent of computing acceleration, the exchange of data between two different software threads or processor cores demands to be fast and efficient. In general, the existing methods of remote messaging over a typical TCP/IP schemes have the disadvantage of high CPU utilization for sending and receiving of messages. In the messaging TCP/IP paradigm, a software thread does not share any common memory space with another software thread it desires to communicate with. Instead, sending and receiving a message to and from another software thread requires the use of socket send () and socket recv () system call respectively.

This aspect of communication via typical TCP/IP scheme involves a lot of software instructions that are to be executed by CPU cores residing on both the sending and remote hosts. Additionally, every time a send () system call is executed there is a change of context from user level to system level which amounts to a high CPU overhead. So is the case for the receive system call on the receiving end.

Since the amount of data that should be exchanged between two different software threads has swelled up, the message FIFO between two processor cores needs to be low latency so that the processors need not slow down due to frequent communication. With TCP/IP protocol in place, it is very difficult to achieve low latency for messaging at high message rates because of the system calls that needs to be executed by the application process in order to facilitate message exchange between the sending and receiving peers.

This implies that the messaging infrastructure (including software) should be capable of processing very large workloads. Very large workloads imply more than a million messages per second. Accordingly, keeping in view of the amount of work load the current messaging system presently demands and the future anticipated workload, a new system which ensures low latency messaging and optimized throughput is urgently required.

Thus, in the light of the above mentioned background of the art, it is evident that there is a need for a system and method which: • Provides high throughput and low latency messaging technique for an inter-processes communication between at least two processes running on at least two nodes. • increases the throughput optimization of the messaging system; • reduces the latencies of the messaging system; • requires minimum infrastructure; • reduces the cost of the hardware setup to improve throughput and reduce the latency of the messaging system; and • is easy to deploy on existing systems. 2016201513 09 Mar 2016

OBJECTIVES OF THE INVENTION

The principle object of the present invention is to provide a system for messaging high throughput optimization in an inter-processes communication across the network with lower latencies on higher workloads between the processes running on the remote nodes.

Another significant object of the invention is to provide a high throughput and low latency messaging system in an inter-processes communication between the multiple processes running on the remote nodes.

It is another object of the present invention to provide a cost effective high throughput and low latency messaging system in an inter-processes communication between the processes running on the remote nodes.

Another object of the invention is to provide a system employing minimal computational resources by reducing CPU intervention for high throughput and low latency messaging and making it more available for application programs.

Yet another object of the invention is to provide an inter-process messaging system requiring minimal infrastructure support by eliminating the need for additional receiver required for receiving messages at remote host thereby reducing one latency introducing component.

Yet another object of the invention is to reduce the number of additional message copies required for realizing high throughput and low latency message passing technique in an inter-processes communication. 2 2016201513 09 Mar 2016

SUMMARY OF THE INVENTION

Before the present methods, systems, and hardware enablement are described, it is to be understood that this invention in not limited to the particular systems, and methodologies described, as there can be multiple possible embodiments of the present invention and which are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is for the purpose of describing the particular versions or embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.

The present Invention envisages a system and method for low latency and high throughput messaging in interprocess communication between the processes running on remote nodes.

In the preferred embodiment of the invention the system implements an asynchronous lockless FIFO message queue between two server hosts using Random Direct Memory Access (RDMA) technology. The inter-process communication is implemented by using RDMA write operations accessed through infiniband verbs library that obviates the use of TCP/IP scheme provided by the operating system for remote messaging which involves a high system call overhead.

The present system, on the contrary, provides a direct access to the RDMA enable Network Interface Cards (NIC) without a system call overhead which is a key to achieving very low latency messaging. The RDMA NIC converts RDMA write operations into a series of RDMA protocol messages over TCP/IP which is acted upon by the RDMA NIC in the remote host and makes the necessary updates to the memory of the remote host.

According to one of the preferred embodiments of the present invention a system for lockless remote messaging in an inter-process communication between at least two processes running on at least two nodes implemented by RDMA supported Network Interface Card configured to synchronize a memory mapped file hosted on each of the said node is provided, the system comprising of: a) a sending host node communicatively coupled with a receiving host node for sending and receiving messages over a computing network respectively; b) RDMA supported Network Interface Card deployed on each of the said host nodes for executing RDMA commands; c) a storage hosted on each the host node adapted to store inter process messages invoked by either of the communicatively coupled host nodes; d) a first memory mapped file hosted on the sending host node configured to synchronize static circular queue of messages with a second memory mapped file hosted on the receiving host node and vice versa; and e) at least one remote sender process running on the sending host node for constituting at least one batch of message and asynchronously sending the batch along with a corresponding RDMA work request, wherein the batch constitution involves a coordination between an operational status of the sending host node and a bulking variable to determine the number of messages within the batch, 3 2016201513 09 Mar 2016 wherein inclusion of an additional message in the batch is further determined by a predetermined aclat parameter.

According to one of the other preferred embodiments of the present invention, a memory mapped structure comprising computer executable program code is provided, wherein the said structure is configured to synchronize static circular queue of messages between the sending and receiving host nodes, the structure comprising: a) plurality of messages bundled together to form at least one batch, each batch comprising of a sequence of payload section, wherein each of the payload section is intermittently followed by a corresponding node counter structure to constitute a contiguous memory region, wherein the payload section is further coupled with a common queue data and continuously arranged headers; b) a rdma free pointing element adapted to point to a message buffer in which the sending host node inserts a new message; c) a rdma insertion counter to count the number of messages inserted by the sending host node; d) a receiving node counter structure element, responsive to the receiving host node, configured to enable said receiving host node issue one RDMA work request for acknowledging at least one message from the batch; e) last sent message node pointing element of the common queue data to point to the node counter structure of the last message sent from a remote sender process to the receiving host node; and f) last receive message node pointing element of the common queue data to point to the message last received by the receiving host node.

In one of the other embodiments of the present invention, a method for lockless remote messaging in an interprocess communication between at least two processes running on at least one sending and one receiving host node implemented by RDMA supported Network Interface Card configured to synchronize a queue of messages via a memory mapped file hosted on each of the said node is provided, the said method comprising: a) initializing transfer of a message from the sending host node to the corresponding memory mapped file whenever an indication of the message to be read from the message buffer by the receiving host node is received, and accordingly updating a rdma free pointing element and a rdma insertion counter to indicate the sending host node for transferring the next message to the message buffer ; b) performing constitution of at least one batch of message in a remote sender process, wherein the batch constitution is based upon a coordination between an operational status of the sending host node and a bulking variable to determine the number of messages within the batch, wherein a determination for inclusion of a next message in the constituted batch is further dependent upon a predetermined aclat parameter; c) updating a batch size of the constituted batch, a node counter structure of a previous node to detect arrival of any new message and a last sent message pointing element to point to the last message in the message batch to be read by the receiving host; 4 2016201513 09 Mar 2016 d) issuing a RDMA work request for transmitting a contiguous message buffer section associated therewith the batch of message; and e) initializing transfer of the message batch from a memory mapped file to the corresponding receiving host node and updating a last received message pointing element along with a data pointing element to indicate the arrival of the message batch to be read by the receiving host.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing summary, as well as the following detailed description of preferred embodiments, are better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings example constructions of the invention; however, the invention is not limited to the specific methods and system disclosed. In the drawings:

Figure 1 illustrates a typical layout of Memory Mapped File as known in the prior art.

Figure 2 shows circular queue of messages as represented in a Memory Mapped File Layout.

Figure 3 illustrates a system for inter process communication between two server hosts with memory mapped files synchronized using RDMA.

Figure 4 shows a design layout of Memory Mapped File in accordance with the preferred embodiment of the present invention.

Figure 5 shows the implementation set up of the system in accordance with one of the preferred embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Some embodiments of this invention, illustrating all its features, will now be discussed in detail.

The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.

It must also be noted that as used herein and in the appended claims, the singular forms "a," "an," and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods 5 2016201513 09 Mar 2016 similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, the preferred, systems and methods are now described.

The disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms.

Definitions:

Throughput: The numbers of messages read or number of write operations that can be performed from the queue per second are called the Throughput.

Latency - The time elapsed between the sending of a message by the sender process and receiving of that message by the receiver process is the latency experienced by that message. RDMA write operation: RDMA write operation, interchangeably called as RDMA write work request is a command issued to the RDMA capable NIC. It is a user level call which notifies the local NIC about where the data is placed in RDMA registered memory (RAM) and its length. The NIC then (asynchronously) fetches the data and transfers it across the network using the relevant (IWARP) RDMA protocol. In effect, it writes the specified data from the local memory location to the memory location on the remote host. The NIC on the remote host responds to the iWARP messages by placing the data in its RDMA registered memory in its host, thus carrying out an RDMA write work request.

The RDMA write operations are accessed through infiniband verbs library.

Memory registration: These are API’s provided by RDMA to make a local memory region available to remote hosts. This is essential for the use of RDMA write operations. RDMA technology allows applications to access memory on remote hosts as if it was available on the same host where the application runs. RDMA was first introduced in Infiniband networks whereby the native infiniband protocol was used and later supported on Ethernet networks using iWARP. In both networks, the network interface cards (NIC) are capable of executing RDMA write commands which cause the placement of data in the memory area of the remote host. Though for the purposes of this invention Ethernet network has been referred in the latter document for illustrative purposes, the invention is not limited to Ethernet networks and can be implemented on Infiniband network using native infiniband protocol. A RDMA interface allows an application to read from and/or write into memory (RAM) locations of a remote host. This is very much unlike sending and receiving messages. It gives the application the illusion of a shared memory between the sender and receiver processes even though they run on separate hosts.

The device drivers of NICs supporting RDMA operation provide a direct interface to application programs to send data bypassing the operating system. The costly overhead of switching from user mode to system mode is avoided thereby allowing the application to be executed by the CPU more efficiently. Further, the RDMA NIC 6 2016201513 09 Mar 2016 implements the complex networking tasks required to transfer the message from the local host to the remote host without any CPU intervention, make the CPU more available for applications programs.

The other advantageous feature of the present invention is eliminating the need of performing an additional copy operation as required in a typical system call assisted communication. For an RDMA write operation the NIC does a Direct Memory Access transfer with the source data as the registered memory region which the application running in user mode can directly write the message into, thereby obviating the need for that additional copy.

Also, as the RDMA write operation takes up the entire responsibility of making the data available in the registered memory region of the remote host where the receiver process runs, there is no need for a separate receiver process to manage the receiving of messages from the network which effectively contributes to the reduction of one latency introducing component in the system.

As shown in Figure 1 & 2, a typical layout of memory mapped file contains a static circular queue of messages. Each message structure in the file has a header section 101 and a payload section 102. The payload section 102 contains the raw message as passed by the application. It is also referred to as message buffer. The header section 101 contains the pointer to the header of the next message thus creating a circular queue of messages. The initial part of the file contains data specific to the queue. Some of the important variables in this section are: • data_anchor 103 - points to the next message to be read by the receiver. • free_node 104 - points to the message that will be written to by the sender. • number_ofJnserts 105 - number of messages sent by the sender (since queue creation) • number_of_deletes 106- number of messages read by the receiver (since queue creation)

Referring to Figure 2, additional free node pointer called rdma_free_node 201 and a new counter variable called rdmajnserts 202 have been introduced which splits the typical message structure into two. The messages pointed by free_node 104 up to the messages pointed by rdma_free_node 201 represents the messages that have been queued by the sender process, but are yet to be transferred to the remote host (server B) via RDMA. The messages from free node to data_anchor 103 are already in transit to (or have reached) the remote host, waiting to be acknowledged by the receiver process (on server B) through the update of the data_anchor pointer.

Now, referring to Figure 3 a system for inter process communication between two server hosts with memory mapped files synchronized using RDMA is presented. The system 300, according to one of the preferred embodiments of the present invention comprises of: • Physical Server 301 - This is the host on which the sender application process runs. • Physical Server 302 - This is the host on which the receiver application runs. 7 2016201513 09 Mar 2016 • RDMA capable network interface card NIC on server 301 - This NIC is capable of executing RDMA commands from the local or remote host. • RDMA capable network interface card on server 302 - This NIC is capable of executing RDMA commands from the local or remote host. • Messaging Library- This library contains the message send and receives functions which are linked in & invoked by the sender and receiver application processes. • Memory mapped file 303 on server 301 - This memory mapped file contains the FIFO queue which is used for sending and receiving messages. It is synchronized with server 302. • Memory mapped file 304 on server 302 - This memory mapped file contains the FIFO queue which is used for sending and receiving messages. It is synchronized with server 301. • Remote Sender process 305 running on server 301. This is the component which is responsible for batching incoming messages via RDMA. It will group all messages from free_node 104 to rdma_free_node 201 and issue RDMA work requests for the entire group of messages. • Ethernet or Infiniband switch (optional) - The switch that connects servers 301 and 302.

Referring next to Figure 4, a design layout of Memory Mapped File is shown. The figure represents a separate section for buffer headers which point to the payload areas of the buffers. The headers are contiguously allocated in one memory area and so are the payload sections. There is also an area where the data common to the queue is stored. Data anchor 103, free node pointer 104, number_of_inserts 105, numberofdeletes 106 are all example of such common queue data 401. Within the common queue data area 401, a structure which groups the free node pointer 104 and the data anchor pointer 103 is also presented.

Further, the common queue data of the memory mapped file contains two variables free_node 104 and number_of_inserts 105 that have been grouped together in a single structure to eliminate the latency reducing component. This helps send the entire structure in 1 RDMA write work request instead of sending them in separate work requests. This structure will now be known as node_counter structure 402.

In every update issued by the remote sender process RS 305 there are two work requests. One work request points to the payload region in the area. The other work request points to the node_counter structure. These work requests cannot be combined because one work request can point only to a contiguous region of memory. To reduce the two work requests required per update to one request it is necessary to combine the two sets of data in a different way.

Figure 4 depicts an optimized memory layout wherein the node_counter structure 402 has been repeated at the end of the payload section of every message. Thus it is now possible to combine the message payload plus the node_counter structure into one work request as they are both in contiguous memory region.

The newly added variables and new meanings of modified variables in the sending side are as follows: 8 2016201513 09 Mar 2016 • rdma_free_node - points to the message buffer in which the sender will insert the next new message. • rdmajnserts - number of messages inserted by the sender process since the queue was created • node_counter.free_node - points to the next message starting which the remote sender process will start batching messages in order to send the messages as part of one RDMA write work request. • node_counter.number_of_inserts - The number of messages that have been updated to the remote host (via RDMA write work requests) ever since the creation of the queue.

Also, the data_anchor and the number_of_deletes can be grouped together in a single structure. This helps the receiver process to send the entire structure in 1 RDMA write work request instead of sending them in separate work requests. This structure will be known as receiver_node_counter structure. For the receiver process, the receiver_node_counter.data_anchor functions same as the data_anchor and the receiver_node_counter.number_deletes functions same as the numberofdeletes.

The optimization achieved by introducing newly added variables for establishing low latency high message throughput is presented below wherein the number of RDMA work request have been reduced to one by using the remote sender process 305.

The modified algorithm of the sender process is as follows:

Loop a. If the next update of rmda_free_node equals data_anchor keep checking else continue to next step b. Copy the message from user buffer to the local memory mapped file c. Update the rdma_free_node to point to the next data buffer d. Increment rdmajnserts counter

This time the sender process does not issue any RDMA work requests as this work now will be done by the remote sender process (RS).

The algorithm for the remote sender process is as follows:

Thus the new optimized approach for the remote sender is as follows: 1. Registering the local memory mapped file with remote host 302 and perform the following operations: a. If the rdmaJree_node equals free_node and rdmajnserts equals number_ofJnserts keep checking else proceed to next step b. Node_var = free_node

c. Prev_node = NULL d. Initialize message_group to null e. Initialize group_size to 0 f. While node_var does not equal rdmaJree_node 9 2016201513 09 Mar 2016 i. Add the message pointed to by node into the message_group ii. Increment group_size iii. Prev_node = node_var iv. Node_var = next node g. Add group_size to number_of_inserts into the node_counter structure of nodes pointed by last_message_node_counter_pointer and prev_node h. Update free_node (in the node_counter structure of nodes pointed by last_message_node_counter_pointer and prev_node) to point the message buffer next to the last message in message_group i. last_message_node_counter_pointer = prev_node j. Check status of previous RDMA work requests and clear them if any of them completed k. Issue 1 RDMA work request for the payload section of messages in message_group.

Wherein the variables “last_sent_message_node_counter_pointer” of the first node in queue is introduced in the common queue data area. This variable points to the node_counter structure of the last message which was last sent to the remote server B. For the case of illustration it will point to the node_counter belonging message node A in the above figure. This is done during queue creation and similar is the case for data_anchor, rdma_free_node and node_counter.free_node, where they are made to point to message node A during queue creation as it was for the previous implementations. The counters node_counter.no_of_inserts in all the message nodes and no_of_deletes in the common data area are initialized to zero during queue creation as it was for the previous implementations. This initialization has not been explicitly mentioned before. It is being mentioned now for convenience.

The variable “last_received_message_node_counter_pointer” of the first node in queue is introduced in the common queue data area. This variable point to the node_counter structure of the last message which was last received the remote server A. 10 2016201513 09 Mar 2016

The optimized algorithm of the receiver process is as follows: 1. Register the local memory mapped file with remote host (server A) and perform the following operations: a. Check status of previous RDMA work requests and clear them if any of them completed b. If last received message_node_counter_pointer.free_node equals data_anchor keep checking else continue c. Copy message from local memory mapped file to user buffer d. last_received_message_node_counter_pointer = data_anchor e. Update the receiver_node_counter.data_anchor pointer to the next data buffer f. Increment receiver_node_counter.number_of_deletes counter g. Issue 1 RDMA work request for updating the receiver_node_counter structure

The above adopted approach reduces the number of work requests in the remote sender process 305 to 1 by grouping the variables number_of_inserts and free_node pointer into a single structure and placing the node counter structure intermittently after each payload section. The deciding factor of performance for a system could be the maximum number of work requests that can be executed per sec by the NIC supporting RDMA. With this in mind it should be ensured that the number of work requests per update is optimized. By batching the messages and grouping variables as in the previous optimization the number of work requests that go into one update has been reduced.

EXAMPLE OF WORKING OF THE INVENTION

The invention is described in the example given below which is provided only to illustrate the invention and therefore should not be construed to limit the scope of the invention.

Referring to Figure 4, it is assumed that the remote sender process 305 has batched 3 messages C, D and E and wants to update the remote host 302 end using RDMA. The memory region to be updated in the batch is marked in the referred figure. Note that this memory region will include the node_counter structure for messages B, C, D and E. Also important to note is that the only node_counter structures that need to be updated are the ones attached to payloads of messages B and E. The reasoning for this is as follows: • Node_counter structure attached to B. Prior to message C, D and E, the last message sent from the remote sender to the receiver was B. The node_counter structure of B was also updated as part of that last message. So the receiver will be checking the free_node pointer in the node_counter structure attached to B to determine if any new message has come. • Node counter structure attached to E. Once the batch is updated to the remote and the messages C, D and E are read by the receiver, the node_counter structure attached to the payload of E is checked by the receiver 11 2016201513 09 Mar 2016 to know that there are no further messages. Only the next batch update from the remote sender will update this node_counter structure in C to indicate that there are more messages inserted in the queue.

Optimization by increasing the number of messages being grouped by the Remote Sender Process

Next, explained is the optimization level achieved upon increasing the number of messages being grouped by the remote sender process 305. In all the above discussed optimization approaches, it is witnessed that the number of messages being grouped are not significant. In fact the average size of group of messages sent by the remote sender process was less than 2.It is therefore understood that a need for adding more messages in the group exists to get efficient message transmissions. If the remote sender process 305 waited for more messages for an indefinite time then it would add to the latency of messages. So there has to be an upper bound on how much messages could be grouped together. This upper bound is referred to as up-limit for further reference for the purposes of this invention. However, the remote sender process 305 need not have to wait for the entire up-limit number of messages to make a group of messages to send. It shall also be understood that there is no guarantee on when the messages arrive. So in addition to this up-limit there can be some more indicators to decide whether to continue grouping messages (herein after called as “bulking”) or not.

Consider a situation when the sender process is queuing a message. In such a case, it is a good enough indication for the remote sender process 305 to wait for the next message to be bulked. On the contrary, in a situation where the sender process is not queuing a message, there is very little reason for the remote sender process 305 to wait for adding another message to the group. However if the application is willing to tolerate a slight amount of latency, (hereon called as aclat) then even if the sender is not currently queuing a message, the remote sender process can wait for aclat nanoseconds to wait for the next message from the sender process to be added to the group.

To implement this idea, the sender process keeps a variable indication, called as write_on, to check if it is currently queuing a message in the queue. It is declared as volatile. Also to be implemented is a user configurable aclat parameter which will tell the remote sender process 305 how long to wait for the next message for grouping in case the sender process is not currently sending a message.

In the above discussed scenario, the modified sender process is detailed below: a) Write_on=1 b) If the next update of rdma_free_node equals data_anchor, keep checking else continue to next step c) Copy the message from user buffer to the local memory mapped file d) Update the rdma_free_node to point to the next data buffer e) Increment datajnserts counter f) Write_on=0 12 2016201513 09 Mar 2016

The remote sender process 305 is also modified with changing scenario. However, few new variables are added to achieve the said optimization like: a) Buffer variable: This variable is introduced to control the grouping of messages and indicate when the grouping (bulking) can be stopped. b) nc- This is a temporary variable used to control the wait for grouping messages in case the sender process is not currently sending a process.

The modified remote sender process 305 in view of the newly added variable is as follows: 1) Register the local memory mapped file with remote host (Server B) and perform the following operations: a. If the rdma_free_node equals free_node and rdmajnserts equals the number_of_inserts, keep checking else proceed to next step b. Node var= free_node

c. Prev_node= NULL d. Initialize message group to null e. Initialize group_size to 0 f. Buffer variable available=1 g. While Buffer variable available i. If node_var equals rdma_free_node 1. If write_on==0 exit innermost loop 2. Wait till node_var equals rdma_free_node AND write_on==1 3. If node_var still equals rdma_free_node a. nc=0; b. init time= time_stamp c. while nc=0 I. curr time= time_stamp II. diff_time= curr time- init time III. if node_var does not equal rdma_free_node, nc=1 else IV. if diff_time > aclat, nc=2 d) if nc==2 exit innermost loop ii) Add the message pointed to by node into the message_group iii) Increment group_size iv) Prev_node= node_var v) Node__var= next node vi) If group_size> up-limit, buffer variable=0 13 2016201513 09 Mar 2016 h) Add group_size to number_of_inserts into the node_counter structure of nodes pointed by last_message_node_counter_pointer and prev_node i) Update free_node (in the node_counter structure of nodes pointed by last_message_node_counter_pointer an dprev_node) to point the message buffer next to the last message in message_group j) last_message_node_counter_pointer= prev_node k) Check status of previous RDMA work requests and clear them if any of them is completed l) Issue 1 RDMA work request for the payload section of messages in the message_group

In the above modified process, the remote sender process 305 first waits for at least one message to be inserted by the sender process, which is similar to what is done in previous optimization approaches. Here, the difference is in the looping when the first message is detected and the looping starts to group messages.

The grouping loop, so modified functions as below:

First it checks to see if new messages have arrived. If there is a new message then it proceeds to add new message to the group as before. Otherwise it checks if the sender is currently queuing a new message. This is facilitated by the write_on variable which is updated by the sender process. If the sender is indeed queuing a message then it waits for the new message to be inserted and goes on to add the new message in the group as before. If the sender is not adding a new message then it waits for a time specified by the administrator configured aclat parameter in a spin loop. Within this wait spin loop if a new message is inserted then it exits the spin loop and proceeds to adding the new message into the group as before. If the time period specified by aclat expires without a new message arriving, then the grouping of messages is stopped.

Next level of optimization is achieved when the memory layout is changed to further reduce the work request. This is achieved by reducing the number of work requests to 1 for a batch of messages in the receiver process. In all the previous optimization approaches, the number of work requests was reduced in the remote sender process 305, whereas the receiver process is till issuing 1 work request per message received. So at this point the receiver is the bottleneck as it is issuing the maximum number of work requests per second. To improve performance it is clear that the receiver should issue lesser work requests. A slight consideration will show that it is considerable if the receiver issues an acknowledgement work request for the last message in the currently received set of messages from the remote sender.

However, one perceived disadvantage to this is that the acknowledgement will arrive at the sender a little later. This can be offset by the following considerations:

When the work request is actually executed by the NIC the latest update to the receiver_node_counter structure is sent to the sender host (server A) as opposed to when the work request was actually issued. 14 2016201513 09 Mar 2016

If the throughput improves due the reduction in work requests at the receiver process, again the acknowledgment will reach the sender host faster than perceived. So, a further modification to the receiver process for better optimization is as follows: 1) Register the local memory mapped file with remote host and perform the following operations a) Check status of previous RDMA work requests and clear them if any of them completed b) If last_received_message_node_counter_pointer.free_node equals data_anchor keep checking else continue c) Copy message from local memory mapped file to user buffer d) last_received_message_node_counter_pointer=data_anchor e) Update the data anchor pointer to the next data buffer f) Increment number_of_deletes counter g) If the currently read message is the last message in the group of messages sent by the remote sender process a. Issue one RDMA work request for updating the receiver_node_counter structure

Further, following bulk messaging APIs are adapted for RDMA write work request in the release_reserve_read_bulk and release_reserve_write_bulk functions: • reserve_read_bulk(&no_of_messages) - variable no_of_messages updated to indicate the number of free buffers available for reading. • release_reserve_read_bulk(num) - mark the next “num” messages as read. • reserve_write_bulk(&no_of_messages) - variable no_of_messages updated to indicate the number of free buffers available for writing. • release_reserve_write_bulk(num)- mark the next “num” messages as ready to be read.

When executed on a separate infrastructure of the following specification with certain changes, a throughput of 5.5 million messages per second is achieved.

Specification of infrastructure: • 2 nodes (server 1 and server 2) each having six core Intel X5675 running at 3.07GHz • 12 MB shared cache • 24GB memory

• Network being Infiniband with 40 Gbps bandwidth and capable of RDMA • Mellanox ConnectX®-2 40Gb/s InfiniBand mezzanine card • Mellanox M3601Q 36-Port 40Gb/s InfiniBand Switch

The changes made to the above infrastructure being: a) Maintaining the maximum queue size to 1000 15 2016201513 09 Mar 2016 b) Keeping up-limit to 40% c) Setting aclat to 10 nano-seconds

Referring to Figure 5, a Latency test is set-up for the infrastructure of a given (above) specification to validate the optimization levels achieved with the modified process flow as given below:

So far the measurement results focused only on throughput test results where only messaging rate was a concern. Thus a new test was devised which measures latency as well as throughput test results. In this test the sender and receiver processes run on the same host (server 1). There is a loopback process that runs on the remote host (server 2) which simply receives the messages from sender process and sends it to the receiver process. The receiver process receives the messages and computes the latencies and throughput. For latency computation the sender process records the timestamp A into the message, just before sending. When this message reaches the receiver process it takes a timestamp B. The difference time B-A is used for computing latency and the average is calculated over several samples.

The queue parameters configured for this test being: • Maintaining maximum queue size to 100 • Keeping the Up-limit at 40% • Setting aclat to 10 nano-seconds

In this test the receiver process recorded a throughput of 3.25 messages per second with an average round trip latency of 34 microseconds. Thus, using the modified approach, just more than 1 million messages per second with a sub 100 micro second latency.

The preceding description has been presented with reference to various embodiments of the invention. Persons skilled in the art and technology to which this invention pertains will appreciate that alterations and changes in the described structures and methods of operation can be practiced without meaningfully departing from the principle, spirit and scope of this invention. 16

Claims

We Claim: 1) A memory mapped structure operable by computer executable instructions, wherein the said memory mapped structure is configured to synchronize a static circular queue of messages between at least one sending and receiving host node, the memory mapped structure comprising: plurality of messages bundled together to form at least one batch, each batch comprising of a sequence of payload section, wherein each of the payload section is intermittently followed by a corresponding node counter structure to constitute a contiguous memory region, wherein the payload section is further coupled with a corresponding common queue data and continuously arranged headers; a rdma free pointing element adapted to point to a message buffer in which the sending host node inserts a new message; a rdma insertion counter to count the number of messages inserted by the sending host node; a receiving node counter structure element, responsive to the receiving host node, configured to enable said receiving host node issue one RDMA work request for acknowledging at least one message from the batch; last sent message node pointing element of the common queue data to point to the node structure of the last message sent from a remote sender process to the receiving host node; and last receive message node pointing element of the common queue data to point to the message last received by the receiving host node.
2) The structure of claim 1, wherein the layout of memory mapped file is optimized to form the contiguous memory region adapted to batch the plurality of messages and group the constituting elements such that the number of RDMA work request is reduced to less than one per message.
3) The structure of claim 1, wherein the common queue data further comprises of: a data pointing element adapted to point to the next message to be received by the receiving host node, a free pointing element adapted to point to the message written by the sending host node, an insertion counter to count number of messages sent by the sender host node, a deletion counter to count the number of messages read by the receiving host node since queue initiation, the last sent message node pointing element and the last received message node pointing element.
4) The structure of claim 1, wherein the node counter structure is constituted of the free pointing element and the insertion counter to send the entire structure as one RDMA work request.
5) The structure of claim 1, wherein the node counter structure invokes the free pointing element and the insertion counter to point to the message whereon the remote sender process initiates the batching process and to point the number of messages updated to the receiving host node respectively.
6) The structure of claim 1, wherein the receiving node counter structure is constituted of the data pointing element and the deletion counter.
7) The structure of claim 1, wherein the remote sender process updates the node counter element for insertion of additional messages in the circular queue to make it accessible to the receiving host node.