WO2016028268A1

WO2016028268A1 - Send buffer based on messaging traffic load

Info

Publication number: WO2016028268A1
Application number: PCT/US2014/051630
Authority: WO
Inventors: Patrick Estep
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2014-08-19
Filing date: 2014-08-19
Publication date: 2016-02-25

Abstract

A send buffer is allocated within a kernel of an operating system (OS) of a first node. The kernel of the first node is to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node. When the send buffer is sent may be dynamically varied based on a messaging traffic load.

Description

SEND BUFFER BASED ON MESSAGING TRAFFIC LOAD

BACKGROUND

[0001 ] Networks may have various types of communication topologies. A common communications topology is one with many processes/threads per node, where every process/thread may communicate with every other process/thread in a cluster of nodes. Manufacturers, vendors, and/or service providers are challenged to provide improved communication topologies for more efficient transfer of information between nodes.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] The following detailed description references the drawings, wherein:

[0003] FIG. 1 is an example block diagram of a system to determine when to send a send buffer based on a messaging traffic load;

[0004] FIG. 2 is another example block diagram of a system to determine when to send a send buffer based on a messaging traffic load;

[0005] FIG. 3 is an example block diagram of a computing device including instructions for determining when to send a send buffer based on a messaging traffic load; and

[0006] FIG. 4 is an example flowchart of a method for determining when to send a send buffer based on a messaging traffic load.

DETAILED DESCRIPTION

[0007] Specific details are given in the following description to provide a thorough understanding of embodiments. However, it will be understood that embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring embodiments.

[0008] Common communications topologies may have many processes/threads per node that can communicate with other processes/threads in a cluster of nodes. These topologies may suffer in performance and scalability. For instance, as a number of nodes and a number of processes/threads increases, the number of connections per node may grow exponentially, e.g. η^Λ2 connections, where n is the number of processes/threads. As the underlying interconnect may have to multiplex/demultiplex each connection onto a single interface, this contention may cause a performance bottleneck.

[0009] Also, there is typically a limit on the number of connections that may be possible per node. As the number of nodes and the number of processes/threads increase, this limit may be reached. Many of the messages being sent in these topologies may be small, which results in poor network throughput. Current systems, which are implemented in user level code, may not fully solve the above problems. Thus, current approaches may have shortcomings from both a performance and scalability perspective.

[0010] Some software systems may aggregate messages to improve performance and scalability. These systems may aggregate many smaller messages into a single larger message which is then sent over the network. However, the smaller messages may remain in the aggregated buffer until the buffer is full or some other threshold is reached (e.g. maximum time to wait).

[001 1 ] Thus, under light messaging loads, aggregation systems that utilize fixed thresholds to determine when to send the aggregation buffer may wait too long to send the buffer, resulting in excessive latency. Conversely, under heavy messaging loads, aggregation systems that utilize fixed thresholds to determine when to send the aggregation may send the buffer too soon, resulting in non-optimal network throughput.

[0012] Examples may utilize an adaptive algorithm to determine when to send the aggregation or send buffer based on existing messaging traffic. An example system may include a send buffer and a threshold unit. The send buffer may be allocated within a kernel of an operating system (OS) of a first node, the kernel of the first node may aggregate a plurality of the messages stored at the send buffer into a single transfer and output the single transfer across a network to a second node. The threshold unit may dynamically vary when the send buffer is sent based on a messaging traffic load.

[0013] Thus, a balance between message latency and network throughput may be improved or optimized for a given messaging load. For instance, examples may dynamically adjust appropriate thresholds based on light or heavy messaging loads to improve or optimize message latency.

[0014] Referring now to the drawings, FIG. 1 is an example block diagram of a system 100 to determine when to send a send buffer based on a messaging traffic load. The system 100 may be any type of message aggregation system, such as a communication network. Example types of communication networks may include wide area networks (WAN), metropolitan area networks (MAN), local area networks (LAN), Internet area networks (IAN), campus area networks (CAN) and virtual private networks (VPN).

[0015] The system 100 is shown to include a first node 110. The term node may refer to a connection point, a redistribution point or a communication endpoint. The node may be an active electronic device that is attached to a network, and is capable of sending, receiving, or forwarding information over a communications channel. Examples of the node may include data communication equipment (DCE) such as a modem, hub, bridge or switch; or data terminal equipment (DTE) such as a digital telephone handset, a printer or a host computer, like a router, a workstation or a server.

[0016] The first node is shown include an operating system (OS) 120. An OS may be a collection of software that manages computer hardware resources and provides common services for applications. Examples types of OSs may include Android, BSD, iOS, Linux, OS X, QNX, Microsoft Windows, Windows Phone, and IBM z/OS.

[0017] The OS 120 is shown to include a kernel 130. The kernel 130 may be a central part of the OS 120 that loads first, and remains in main memory (not shown). Typically, the kernel 130 may be responsible for memory management, process and task management, and disk management. For example, the kernel 130 may manages input/output requests from an application and translate it into data processing instructions for a central processing unit (not shown) and other electronic components of a node. The kernel 130 may also allocate requests from applications to perform I/O to an appropriate device or part of a device.

[0018] The kernel 130 is shown to include a send buffer 140 and a threshold unit. The term buffer may refer to a region of physical memory storage used to temporarily store data while it is being moved from one place to another. The kernel 130 may be included in any electronic, magnetic, optical, or other physical storage device that contains or stores information, such as Random Access Memory (RAM), flash memory, a solid state drive (SSD), a hard disk drive (HDD) and the like.

[0019] The threshold unit 150 may include, for example, a hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the threshold unit 150 may be implemented as a series of instructions encoded on a machine-readable storage medium and executable by a processor.

[0020] The send buffer 140 is allocated within the kernel 130 of the OS 120 of the first node 1 10. Thus, the send buffer 140 may not be paged out of the kernel 130. The kernel 130 of the first node 1 10 may aggregate a plurality of the messages 142-1 to 142-n, where n is a natural number, stored at the send buffer 140 into a single transfer and may output the single transfer across a network to a second node.

[0021 ] The threshold unit 150 may dynamically vary when the send buffer 140 is sent based on a messaging traffic load. For example, the threshold unit 150 may determine a time to output the single transfer based on, for example, a timeout threshold, a number of messages within the send buffer 140, an amount of data stored at the send buffer 140 and the like. The threshold unit 150 is explained in greater detail below with respect to FIG. 2.

[0022] FIG. 2 is another example block diagram of a system 200 to determine when to send a send buffer based on a messaging traffic load. The system 200 may be any type of message aggregation system, such as a communication network, such as WAN, LAN or VPN. The system 200 is shown to include the first node 205 and a second node 210. The second node 210 may include the functionality and/or hardware of the first node 205 of FIG. 2 and/or vice versa. While the system 200 is shown to only include two nodes 205 and 210, examples may include more than two nodes, such as a cluster of hundreds or thousands of nodes.

[0023] The first node 205 of FIG. 2 may include the functionality and/or hardware of the first node 1 10 of FIG. 1 . For example, the first node 205 of FIG. 2 includes the OS 120 of FIG. 1 , where the OS 120 includes the kernel 130. The kernel 130 includes the send buffer 140 and a threshold unit 250, where threshold unit 250 includes at least the functionality and/or hardware of the threshold unit 150 of FIG. 1.

[0024] A receive buffer 240 is allocated within a kernel 230 of an OS 220 of the second node 210. The receive buffer 240 may receive the single transfer directly from the send buffer 140. The send and receive buffers 140 and 240 may be persistently allocated in the kernels 130 and 230 of the first and second nodes 1 10 and 210 for remote direct memory access (RDMA) transfers. By using a persistent RDMA approach, an preparing/unpreparing of the buffers per message transfer may be avoided.

[0025] The threshold unit 250 is shown to include a size threshold 252 to relate to a size of the send buffer 140, Example sizes of the size threshold 252 may include 128 kilobytes (KB), 256 KB and the like. In one example, the threshold unit 250 may halve and/or double the value of the size threshold 252 when the threshold unit 250 dynamically varies the value of the size threshold 252. [0026] The threshold unit 250 is shown to further include a number threshold 254 to relate to a number of the plurality of messages stored at the send buffer 140, and a time threshold 256 to relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140. However, examples may include more or less than the above three thresholds 252, 254 and 256.

[0027] The threshold unit 250 may classify the messaging traffic load to be at least one of a low messaging load and a heavy messaging load. The threshold unit 250 may dynamically vary a value of at least one of the size, number and time threshold 252, 254 and 256 based on the messaging traffic load.

[0028] The threshold unit 250 may send the send buffer 140 sooner to reduce message latency, if the messaging traffic load is classified as the low messaging load. Conversely, the threshold unit 250 may send the send buffer 140 later to increase network throughput, if the messaging traffic load is classified as the heavy messaging load.

[0029] For instance, the threshold unit 250 may reduce the value of at least one of the size, number and time threshold 252, 254 and 256 to send the send buffer 140 sooner, if the messaging traffic load is classified as the low messaging load. The threshold unit 250 may increase the value of at least one of the size, number and time threshold to 252, 254 and 256 to send the send buffer 140 later, if the messaging traffic load is classified as the heavy messaging load.

[0030] For example, the threshold unit 250 may carry out a combination of any of the following actions. The threshold unit 250 may reduce the value of the size threshold 252, if the messaging traffic load is classified as the low messaging load in response to a size of the messages at the send buffer 140 being relatively small. The threshold unit 250 may reduce the value of the number threshold 254, if the messaging traffic load is classified as the low messaging load in response to a number of the messages at the send buffer 140 being relatively small. The threshold unit 250 may reduce the value of the time threshold 256, if the messaging traffic load is classified as the low messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively small.

[0031 ] The threshold unit 250 may also carry out a combination of any of the following actions. The threshold unit 250 may increase the value of the size threshold 252, if the messaging traffic load is classified as the heavy messaging load in response to a size of the messages at the send buffer 140 being relatively large. The threshold unit 250 may increase the value of the number threshold 254, if the messaging traffic load is classified as the heavy messaging load in response to a number of the messages at the send buffer 140 being relatively large. The threshold unit 250 may increase the value of the time threshold 256, if the messaging traffic load is classified as the heavy messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively large.

[0032] The threshold unit 250 may not reduce the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a minimum value. An example of the minimum value may be 1 millisecond. The threshold unit 250 may also not increase the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a maximum value. [0033] An amount at least one of the size, number and time thresholds 252, 254 and 256 may be varied by the threshold unit 250 further based on an amount at least an other of the size, number and time thresholds 252, 254 and 256 is varied. In one example, the threshold unit 250 may dynamically vary the value of at least one of the size and time threshold 252 and 256 before dynamically varying the value of the number threshold 254. In another example, the threshold unit 250 may also dynamically vary the size threshold 252 before dynamically varying the value of the time 256 threshold.

[0034] In one example, the threshold unit 150 may be implemented via kernel level code. In another example, the threshold unit 250 may dynamically vary when the send buffer is sent in response to a burst of message traffic. In yet another example, at least one of the time, number and size threshold may be varied based on a run-time analysis of pattern of the messaging traffic load. Thus, examples may provide an improved or optimized balance between individual message latency and network throughput.

[0035] FIG. 3 is an example block diagram of a computing device 300 including instructions for determining when to send a send buffer based on a messaging traffic load. In the embodiment of FIG. 3, the computing device 300 includes a processor 310 and a machine-readable storage medium 320. The machine-readable storage medium 320 further includes instructions 321 , 323, 325, 327 and 329 for determining when to send a send buffer based on a messaging traffic load.

[0036] The computing device 300 may be included in or part of, for example, a microprocessor, a controller, a memory module or device, a notebook computer, a desktop computer, an all-in-one system, a server, a network device, a wireless device, or any other type of device capable of executing the instructions 321 , 323, 325, 327 and 329. In certain examples, the computing device 300 may include or be connected to additional components such as memories, controllers, etc.

[0037] The processor 310 may be, at least one central processing unit (CPU), at least one semiconductor-based microprocessor, at least one graphics processing unit (GPU), a microcontroller, special purpose logic hardware controlled by microcode or other hardware devices suitable for retrieval and execution of instructions stored in the machine-readable storage medium 320, or combinations thereof. The processor 310 may fetch, decode, and execute instructions 321 , 323, 325, 327 and 329 to implement determining when to send the send buffer based on the messaging traffic load. As an alternative or in addition to retrieving and executing instructions, the processor 310 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 321 , 323, 325, 327 and 329.

[0038] The machine-readable storage medium 320 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, the machine-readable storage medium 320 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like. As such, the machine-readable storage medium 320 can be non-transitory. As described in detail below, machine-readable storage medium 320 may be encoded with a series of executable instructions for determining when to send the send buffer based on the messaging traffic load.

[0039] Moreover, the instructions 321 , 323, 325, 327 and 329 when executed by a processor (e.g., via one processing element or multiple processing elements of the processor) can cause the processor to perform processes, such as, the process of FIG. 4. For example, the measure instructions 321 may be executed by the processor 310 to measure a messaging traffic load based on a number and size of messages aggregated at a send a buffer allocated within the kernel of an operating system (OS) of a first node.

[0040] The vary number instructions 323 may be executed by the processor 310 to vary a number threshold based on the number of the aggregated messages. The number threshold may relate to a number of the plurality of messages stored at the send buffer. The vary size instructions 325 may be executed by the processor 310 to vary a size threshold based on the size of the aggregated messages. The size threshold may relate to a size of the send buffer.

[0041 ] The vary time instructions 327 may be executed by the processor 310 to vary a time threshold based on the number and the size of the aggregated messages. The time threshold may relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer. The determine instructions 329 may be executed by the processor 310 to determine when to send the send buffer as a single transfer across a network to a second node based on the number, size and time thresholds. [0042] FIG. 4 is an example flowchart of a method 400 for determining when to send a send buffer based on a messaging traffic load. Although execution of the method 400 is described below with reference to the system 200, other suitable components for execution of the method 400 can be utilized, such as the system 100. Additionally, the components for executing the method 400 may be spread among multiple devices (e.g., a processing device in communication with input and output devices). In certain scenarios, multiple devices acting in coordination can be considered a single device to perform the method 400. The method 400 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 320, and/or in the form of electronic circuitry.

[0043] At block 410, a first node 1 10 may classify a messaging traffic load to be at least one of a low messaging load and a heavy messaging load. At block 420, the first node 1 10 may vary a value of at least one of a time, number and size threshold 252, 254 and 256 based on the classification. The varying at block 420 may also assign initial values to the size, number and time thresholds 252, 254 and 256. The initial values of at least one of the size, number and time threshold 252, 254 and 256 may be based on a type of a network including first node and second nodes 205 and 210.

[0044] At block 430, the first node 1 10 may determine when to send a send buffer 140 allocated within a kernel 130 of an operating system (OS) 120 of the first node 205 based on the at least one of the time, number and size threshold 252, 254 and 256. The kernel 120 of the first node 205 may aggregate a plurality of the messages 142-1 to 142-n stored at the send buffer 140 into a single transfer and output the single transfer across a network to the second node 210. The size threshold 252 may relate to a size of the send buffer 140. The number threshold 254 may relate to a number of the plurality of messages 142-1 to 142-n stored at the send buffer 140. The time threshold 256 may relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140.

Claims

CLAIMS We claim:

1. A system, comprising:

a send buffer allocated within a kernel of an operating system (OS) of a first node, the kernel of the first node to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node; and

a threshold unit to dynamically vary when the send buffer is sent based on a messaging traffic load.

2. The system of claim 1 , wherein the threshold unit is to include at least one of,

a size threshold to relate to a size of the send buffer,

a number threshold to relate to a number of the plurality of messages stored at the send buffer, and

a time threshold to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.

3. The system of claim 2, wherein,

the threshold unit is to classify the messaging traffic load to be at least one of a low messaging load and a heavy messaging load, and

the threshold unit is to dynamically vary a value of at least one of the size, number and time threshold based on the messaging traffic load.

4. The system of claim 3, wherein,

the threshold unit is to send the send buffer sooner to reduce message latency, if the messaging traffic load is classified as the low messaging load, and

the threshold unit is to send the send buffer later to increase network throughput, if the messaging traffic load is classified as the heavy messaging load.

5. The system of claim 4, wherein,

the threshold unit is to reduce the value of at least one of the size, number and time threshold to send the send buffer sooner, if the messaging traffic load is classified as the low messaging load, and

the threshold unit is to increase the value of at least one of the size, number and time threshold to send the send buffer later, if the messaging traffic load is classified as the heavy messaging load.

6. The system of claim 5, wherein the threshold unit is to at least one of,

reduce the value of the size threshold, if the messaging traffic load is classified as the low messaging load in response to a size of the messages at the send buffer being relatively small,

reduce the value of the number threshold, if the messaging traffic load is classified as the low messaging load in response to a number of the messages at the send buffer being relatively small, and

reduce the value of the time threshold, if the messaging traffic load is classified as the low messaging load in response to both the number and the size of the messages at the send buffer being relatively small.

7. The system of claim 5, wherein the threshold unit is to at least one of,

increase the value of the size threshold, if the messaging traffic load is classified as the heavy messaging load in response to a size of the messages at the send buffer being relatively large,

increase the value of the number threshold, if the messaging traffic load is classified as the heavy messaging load in response to a number of the messages at the send buffer being relatively large, and

increase the value of the time threshold, if the messaging traffic load is classified as the heavy messaging load in response to both the number and the size of the messages at the send buffer being relatively large.

8. The system of claim 5, wherein,

the threshold unit is to not reduce the value of at least one of the size, number and time threshold, if the value is already at a minimum value, and the threshold unit is to not increase the value of at least one of the size, number and time threshold, if the value is already at a maximum value.

9. The system of claim 2, wherein an amount at least one of the size, number and time threshold is varied by the threshold unit is further based on an amount at least an other of the size, number and time thresholds is varied.

10. The system of claim 9, wherein the threshold unit is to dynamically vary the value of at least one of the size and time threshold before dynamically varying the value of the number threshold.

1 1 . The system of claim 10, wherein the threshold unit is to dynamically vary the size threshold before dynamically varying the value of the time threshold.

12. A method, comprising:

classifying a messaging traffic load at a first node to be at least one of a low messaging load and a heavy messaging load;

varying a value of at least one of a time, number and size threshold based on the classification; and

determining when to send a send buffer allocated within a kernel of an operating system (OS) of the first node based on the at least one of the time, number and size threshold, wherein

the kernel of the first node is to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node,

the size threshold is to relate to a size of the send buffer,

the number threshold is to relate to a number of the plurality of messages stored at the send buffer, and the time threshold is to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.

13. The method of claim 12, wherein,

the varying assigns initial values to the size, number and time thresholds, and

the initial values of at least one of the size, number and time threshold is based on a type of a network including the first and second nodes.

14. A non-transitory computer-readable storage medium storing instructions that, if executed by a processor of a device, cause the processor to:

measure a messaging traffic load based on a number and size of messages aggregated at a send a buffer allocated within the kernel of an operating system (OS) of a first node;

vary a number threshold based on the number of the aggregated messages;

vary a size threshold based on the size of the aggregated messages; vary a time threshold based on the number and the size of the aggregated messages; and

determine when to send the send buffer as a single transfer across a network to a second node based on the number, size and time threshold.

15. The non-transitory computer-readable storage medium of claim 14, wherein, the size threshold is to relate to a size of the send buffer,

the number threshold is to relate to a number of the plurality of messages stored at the send buffer, and

the time threshold is to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.