WO2016028268A1 - Send buffer based on messaging traffic load - Google Patents

Send buffer based on messaging traffic load Download PDF

Info

Publication number
WO2016028268A1
WO2016028268A1 PCT/US2014/051630 US2014051630W WO2016028268A1 WO 2016028268 A1 WO2016028268 A1 WO 2016028268A1 US 2014051630 W US2014051630 W US 2014051630W WO 2016028268 A1 WO2016028268 A1 WO 2016028268A1
Authority
WO
WIPO (PCT)
Prior art keywords
threshold
size
send buffer
messaging
send
Prior art date
Application number
PCT/US2014/051630
Other languages
French (fr)
Inventor
Patrick Estep
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2014/051630 priority Critical patent/WO2016028268A1/en
Publication of WO2016028268A1 publication Critical patent/WO2016028268A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/61Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources taking into account QoS or priority requirements

Definitions

  • Networks may have various types of communication topologies.
  • a common communications topology is one with many processes/threads per node, where every process/thread may communicate with every other process/thread in a cluster of nodes.
  • Manufacturers, vendors, and/or service providers are challenged to provide improved communication topologies for more efficient transfer of information between nodes.
  • FIG. 1 is an example block diagram of a system to determine when to send a send buffer based on a messaging traffic load
  • FIG. 2 is another example block diagram of a system to determine when to send a send buffer based on a messaging traffic load
  • FIG. 3 is an example block diagram of a computing device including instructions for determining when to send a send buffer based on a messaging traffic load
  • FIG. 4 is an example flowchart of a method for determining when to send a send buffer based on a messaging traffic load.
  • Common communications topologies may have many processes/threads per node that can communicate with other processes/threads in a cluster of nodes. These topologies may suffer in performance and scalability. For instance, as a number of nodes and a number of processes/threads increases, the number of connections per node may grow exponentially, e.g. ⁇ ⁇ 2 connections, where n is the number of processes/threads. As the underlying interconnect may have to multiplex/demultiplex each connection onto a single interface, this contention may cause a performance bottleneck.
  • Some software systems may aggregate messages to improve performance and scalability. These systems may aggregate many smaller messages into a single larger message which is then sent over the network. However, the smaller messages may remain in the aggregated buffer until the buffer is full or some other threshold is reached (e.g. maximum time to wait).
  • Examples may utilize an adaptive algorithm to determine when to send the aggregation or send buffer based on existing messaging traffic.
  • An example system may include a send buffer and a threshold unit.
  • the send buffer may be allocated within a kernel of an operating system (OS) of a first node, the kernel of the first node may aggregate a plurality of the messages stored at the send buffer into a single transfer and output the single transfer across a network to a second node.
  • the threshold unit may dynamically vary when the send buffer is sent based on a messaging traffic load.
  • a balance between message latency and network throughput may be improved or optimized for a given messaging load.
  • examples may dynamically adjust appropriate thresholds based on light or heavy messaging loads to improve or optimize message latency.
  • FIG. 1 is an example block diagram of a system 100 to determine when to send a send buffer based on a messaging traffic load.
  • the system 100 may be any type of message aggregation system, such as a communication network.
  • Example types of communication networks may include wide area networks (WAN), metropolitan area networks (MAN), local area networks (LAN), Internet area networks (IAN), campus area networks (CAN) and virtual private networks (VPN).
  • WAN wide area networks
  • MAN metropolitan area networks
  • LAN local area networks
  • IAN Internet area networks
  • CAN campus area networks
  • VPN virtual private networks
  • the system 100 is shown to include a first node 110.
  • the term node may refer to a connection point, a redistribution point or a communication endpoint.
  • the node may be an active electronic device that is attached to a network, and is capable of sending, receiving, or forwarding information over a communications channel. Examples of the node may include data communication equipment (DCE) such as a modem, hub, bridge or switch; or data terminal equipment (DTE) such as a digital telephone handset, a printer or a host computer, like a router, a workstation or a server.
  • DCE data communication equipment
  • DTE data terminal equipment
  • the first node is shown include an operating system (OS) 120.
  • An OS may be a collection of software that manages computer hardware resources and provides common services for applications. Examples types of OSs may include Android, BSD, iOS, Linux, OS X, QNX, Microsoft Windows, Windows Phone, and IBM z/OS.
  • the OS 120 is shown to include a kernel 130.
  • the kernel 130 may be a central part of the OS 120 that loads first, and remains in main memory (not shown).
  • the kernel 130 may be responsible for memory management, process and task management, and disk management.
  • the kernel 130 may manages input/output requests from an application and translate it into data processing instructions for a central processing unit (not shown) and other electronic components of a node.
  • the kernel 130 may also allocate requests from applications to perform I/O to an appropriate device or part of a device.
  • the kernel 130 is shown to include a send buffer 140 and a threshold unit.
  • the term buffer may refer to a region of physical memory storage used to temporarily store data while it is being moved from one place to another.
  • the kernel 130 may be included in any electronic, magnetic, optical, or other physical storage device that contains or stores information, such as Random Access Memory (RAM), flash memory, a solid state drive (SSD), a hard disk drive (HDD) and the like.
  • RAM Random Access Memory
  • flash memory flash memory
  • SSD solid state drive
  • HDD hard disk drive
  • the threshold unit 150 may include, for example, a hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory.
  • the threshold unit 150 may be implemented as a series of instructions encoded on a machine-readable storage medium and executable by a processor.
  • the send buffer 140 is allocated within the kernel 130 of the OS 120 of the first node 1 10. Thus, the send buffer 140 may not be paged out of the kernel 130.
  • the kernel 130 of the first node 1 10 may aggregate a plurality of the messages 142-1 to 142-n, where n is a natural number, stored at the send buffer 140 into a single transfer and may output the single transfer across a network to a second node.
  • the threshold unit 150 may dynamically vary when the send buffer 140 is sent based on a messaging traffic load. For example, the threshold unit 150 may determine a time to output the single transfer based on, for example, a timeout threshold, a number of messages within the send buffer 140, an amount of data stored at the send buffer 140 and the like. The threshold unit 150 is explained in greater detail below with respect to FIG. 2.
  • FIG. 2 is another example block diagram of a system 200 to determine when to send a send buffer based on a messaging traffic load.
  • the system 200 may be any type of message aggregation system, such as a communication network, such as WAN, LAN or VPN.
  • the system 200 is shown to include the first node 205 and a second node 210.
  • the second node 210 may include the functionality and/or hardware of the first node 205 of FIG. 2 and/or vice versa. While the system 200 is shown to only include two nodes 205 and 210, examples may include more than two nodes, such as a cluster of hundreds or thousands of nodes.
  • the first node 205 of FIG. 2 may include the functionality and/or hardware of the first node 1 10 of FIG. 1 .
  • the first node 205 of FIG. 2 includes the OS 120 of FIG. 1 , where the OS 120 includes the kernel 130.
  • the kernel 130 includes the send buffer 140 and a threshold unit 250, where threshold unit 250 includes at least the functionality and/or hardware of the threshold unit 150 of FIG. 1.
  • a receive buffer 240 is allocated within a kernel 230 of an OS 220 of the second node 210.
  • the receive buffer 240 may receive the single transfer directly from the send buffer 140.
  • the send and receive buffers 140 and 240 may be persistently allocated in the kernels 130 and 230 of the first and second nodes 1 10 and 210 for remote direct memory access (RDMA) transfers.
  • RDMA remote direct memory access
  • the threshold unit 250 is shown to include a size threshold 252 to relate to a size of the send buffer 140, Example sizes of the size threshold 252 may include 128 kilobytes (KB), 256 KB and the like. In one example, the threshold unit 250 may halve and/or double the value of the size threshold 252 when the threshold unit 250 dynamically varies the value of the size threshold 252.
  • the threshold unit 250 is shown to further include a number threshold 254 to relate to a number of the plurality of messages stored at the send buffer 140, and a time threshold 256 to relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140. However, examples may include more or less than the above three thresholds 252, 254 and 256.
  • the threshold unit 250 may classify the messaging traffic load to be at least one of a low messaging load and a heavy messaging load.
  • the threshold unit 250 may dynamically vary a value of at least one of the size, number and time threshold 252, 254 and 256 based on the messaging traffic load.
  • the threshold unit 250 may send the send buffer 140 sooner to reduce message latency, if the messaging traffic load is classified as the low messaging load. Conversely, the threshold unit 250 may send the send buffer 140 later to increase network throughput, if the messaging traffic load is classified as the heavy messaging load.
  • the threshold unit 250 may reduce the value of at least one of the size, number and time threshold 252, 254 and 256 to send the send buffer 140 sooner, if the messaging traffic load is classified as the low messaging load.
  • the threshold unit 250 may increase the value of at least one of the size, number and time threshold to 252, 254 and 256 to send the send buffer 140 later, if the messaging traffic load is classified as the heavy messaging load.
  • the threshold unit 250 may carry out a combination of any of the following actions.
  • the threshold unit 250 may reduce the value of the size threshold 252, if the messaging traffic load is classified as the low messaging load in response to a size of the messages at the send buffer 140 being relatively small.
  • the threshold unit 250 may reduce the value of the number threshold 254, if the messaging traffic load is classified as the low messaging load in response to a number of the messages at the send buffer 140 being relatively small.
  • the threshold unit 250 may reduce the value of the time threshold 256, if the messaging traffic load is classified as the low messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively small.
  • the threshold unit 250 may also carry out a combination of any of the following actions.
  • the threshold unit 250 may increase the value of the size threshold 252, if the messaging traffic load is classified as the heavy messaging load in response to a size of the messages at the send buffer 140 being relatively large.
  • the threshold unit 250 may increase the value of the number threshold 254, if the messaging traffic load is classified as the heavy messaging load in response to a number of the messages at the send buffer 140 being relatively large.
  • the threshold unit 250 may increase the value of the time threshold 256, if the messaging traffic load is classified as the heavy messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively large.
  • the threshold unit 250 may not reduce the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a minimum value.
  • An example of the minimum value may be 1 millisecond.
  • the threshold unit 250 may also not increase the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a maximum value.
  • An amount at least one of the size, number and time thresholds 252, 254 and 256 may be varied by the threshold unit 250 further based on an amount at least an other of the size, number and time thresholds 252, 254 and 256 is varied.
  • the threshold unit 250 may dynamically vary the value of at least one of the size and time threshold 252 and 256 before dynamically varying the value of the number threshold 254. In another example, the threshold unit 250 may also dynamically vary the size threshold 252 before dynamically varying the value of the time 256 threshold.
  • the threshold unit 150 may be implemented via kernel level code.
  • the threshold unit 250 may dynamically vary when the send buffer is sent in response to a burst of message traffic.
  • at least one of the time, number and size threshold may be varied based on a run-time analysis of pattern of the messaging traffic load.
  • FIG. 3 is an example block diagram of a computing device 300 including instructions for determining when to send a send buffer based on a messaging traffic load.
  • the computing device 300 includes a processor 310 and a machine-readable storage medium 320.
  • the machine-readable storage medium 320 further includes instructions 321 , 323, 325, 327 and 329 for determining when to send a send buffer based on a messaging traffic load.
  • the computing device 300 may be included in or part of, for example, a microprocessor, a controller, a memory module or device, a notebook computer, a desktop computer, an all-in-one system, a server, a network device, a wireless device, or any other type of device capable of executing the instructions 321 , 323, 325, 327 and 329.
  • the computing device 300 may include or be connected to additional components such as memories, controllers, etc.
  • the processor 310 may be, at least one central processing unit (CPU), at least one semiconductor-based microprocessor, at least one graphics processing unit (GPU), a microcontroller, special purpose logic hardware controlled by microcode or other hardware devices suitable for retrieval and execution of instructions stored in the machine-readable storage medium 320, or combinations thereof.
  • the processor 310 may fetch, decode, and execute instructions 321 , 323, 325, 327 and 329 to implement determining when to send the send buffer based on the messaging traffic load.
  • the processor 310 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 321 , 323, 325, 327 and 329.
  • IC integrated circuit
  • the machine-readable storage medium 320 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions.
  • the machine-readable storage medium 320 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like.
  • RAM Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read Only Memory
  • the machine-readable storage medium 320 can be non-transitory.
  • machine-readable storage medium 320 may be encoded with a series of executable instructions for determining when to send the send buffer based on the messaging traffic load.
  • the instructions 321 , 323, 325, 327 and 329 when executed by a processor can cause the processor to perform processes, such as, the process of FIG. 4.
  • the measure instructions 321 may be executed by the processor 310 to measure a messaging traffic load based on a number and size of messages aggregated at a send a buffer allocated within the kernel of an operating system (OS) of a first node.
  • OS operating system
  • the vary number instructions 323 may be executed by the processor 310 to vary a number threshold based on the number of the aggregated messages.
  • the number threshold may relate to a number of the plurality of messages stored at the send buffer.
  • the vary size instructions 325 may be executed by the processor 310 to vary a size threshold based on the size of the aggregated messages.
  • the size threshold may relate to a size of the send buffer.
  • the vary time instructions 327 may be executed by the processor 310 to vary a time threshold based on the number and the size of the aggregated messages.
  • the time threshold may relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.
  • the determine instructions 329 may be executed by the processor 310 to determine when to send the send buffer as a single transfer across a network to a second node based on the number, size and time thresholds.
  • FIG. 4 is an example flowchart of a method 400 for determining when to send a send buffer based on a messaging traffic load. Although execution of the method 400 is described below with reference to the system 200, other suitable components for execution of the method 400 can be utilized, such as the system 100.
  • the components for executing the method 400 may be spread among multiple devices (e.g., a processing device in communication with input and output devices). In certain scenarios, multiple devices acting in coordination can be considered a single device to perform the method 400.
  • the method 400 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 320, and/or in the form of electronic circuitry.
  • a first node 1 10 may classify a messaging traffic load to be at least one of a low messaging load and a heavy messaging load.
  • the first node 1 10 may vary a value of at least one of a time, number and size threshold 252, 254 and 256 based on the classification.
  • the varying at block 420 may also assign initial values to the size, number and time thresholds 252, 254 and 256.
  • the initial values of at least one of the size, number and time threshold 252, 254 and 256 may be based on a type of a network including first node and second nodes 205 and 210.
  • the first node 1 10 may determine when to send a send buffer 140 allocated within a kernel 130 of an operating system (OS) 120 of the first node 205 based on the at least one of the time, number and size threshold 252, 254 and 256.
  • the kernel 120 of the first node 205 may aggregate a plurality of the messages 142-1 to 142-n stored at the send buffer 140 into a single transfer and output the single transfer across a network to the second node 210.
  • the size threshold 252 may relate to a size of the send buffer 140.
  • the number threshold 254 may relate to a number of the plurality of messages 142-1 to 142-n stored at the send buffer 140.
  • the time threshold 256 may relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140.

Abstract

A send buffer is allocated within a kernel of an operating system (OS) of a first node. The kernel of the first node is to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node. When the send buffer is sent may be dynamically varied based on a messaging traffic load.

Description

SEND BUFFER BASED ON MESSAGING TRAFFIC LOAD
BACKGROUND
[0001 ] Networks may have various types of communication topologies. A common communications topology is one with many processes/threads per node, where every process/thread may communicate with every other process/thread in a cluster of nodes. Manufacturers, vendors, and/or service providers are challenged to provide improved communication topologies for more efficient transfer of information between nodes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] The following detailed description references the drawings, wherein:
[0003] FIG. 1 is an example block diagram of a system to determine when to send a send buffer based on a messaging traffic load;
[0004] FIG. 2 is another example block diagram of a system to determine when to send a send buffer based on a messaging traffic load;
[0005] FIG. 3 is an example block diagram of a computing device including instructions for determining when to send a send buffer based on a messaging traffic load; and
[0006] FIG. 4 is an example flowchart of a method for determining when to send a send buffer based on a messaging traffic load.
DETAILED DESCRIPTION
[0007] Specific details are given in the following description to provide a thorough understanding of embodiments. However, it will be understood that embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring embodiments.
[0008] Common communications topologies may have many processes/threads per node that can communicate with other processes/threads in a cluster of nodes. These topologies may suffer in performance and scalability. For instance, as a number of nodes and a number of processes/threads increases, the number of connections per node may grow exponentially, e.g. ηΛ2 connections, where n is the number of processes/threads. As the underlying interconnect may have to multiplex/demultiplex each connection onto a single interface, this contention may cause a performance bottleneck.
[0009] Also, there is typically a limit on the number of connections that may be possible per node. As the number of nodes and the number of processes/threads increase, this limit may be reached. Many of the messages being sent in these topologies may be small, which results in poor network throughput. Current systems, which are implemented in user level code, may not fully solve the above problems. Thus, current approaches may have shortcomings from both a performance and scalability perspective.
[0010] Some software systems may aggregate messages to improve performance and scalability. These systems may aggregate many smaller messages into a single larger message which is then sent over the network. However, the smaller messages may remain in the aggregated buffer until the buffer is full or some other threshold is reached (e.g. maximum time to wait).
[001 1 ] Thus, under light messaging loads, aggregation systems that utilize fixed thresholds to determine when to send the aggregation buffer may wait too long to send the buffer, resulting in excessive latency. Conversely, under heavy messaging loads, aggregation systems that utilize fixed thresholds to determine when to send the aggregation may send the buffer too soon, resulting in non-optimal network throughput.
[0012] Examples may utilize an adaptive algorithm to determine when to send the aggregation or send buffer based on existing messaging traffic. An example system may include a send buffer and a threshold unit. The send buffer may be allocated within a kernel of an operating system (OS) of a first node, the kernel of the first node may aggregate a plurality of the messages stored at the send buffer into a single transfer and output the single transfer across a network to a second node. The threshold unit may dynamically vary when the send buffer is sent based on a messaging traffic load.
[0013] Thus, a balance between message latency and network throughput may be improved or optimized for a given messaging load. For instance, examples may dynamically adjust appropriate thresholds based on light or heavy messaging loads to improve or optimize message latency.
[0014] Referring now to the drawings, FIG. 1 is an example block diagram of a system 100 to determine when to send a send buffer based on a messaging traffic load. The system 100 may be any type of message aggregation system, such as a communication network. Example types of communication networks may include wide area networks (WAN), metropolitan area networks (MAN), local area networks (LAN), Internet area networks (IAN), campus area networks (CAN) and virtual private networks (VPN).
[0015] The system 100 is shown to include a first node 110. The term node may refer to a connection point, a redistribution point or a communication endpoint. The node may be an active electronic device that is attached to a network, and is capable of sending, receiving, or forwarding information over a communications channel. Examples of the node may include data communication equipment (DCE) such as a modem, hub, bridge or switch; or data terminal equipment (DTE) such as a digital telephone handset, a printer or a host computer, like a router, a workstation or a server.
[0016] The first node is shown include an operating system (OS) 120. An OS may be a collection of software that manages computer hardware resources and provides common services for applications. Examples types of OSs may include Android, BSD, iOS, Linux, OS X, QNX, Microsoft Windows, Windows Phone, and IBM z/OS.
[0017] The OS 120 is shown to include a kernel 130. The kernel 130 may be a central part of the OS 120 that loads first, and remains in main memory (not shown). Typically, the kernel 130 may be responsible for memory management, process and task management, and disk management. For example, the kernel 130 may manages input/output requests from an application and translate it into data processing instructions for a central processing unit (not shown) and other electronic components of a node. The kernel 130 may also allocate requests from applications to perform I/O to an appropriate device or part of a device.
[0018] The kernel 130 is shown to include a send buffer 140 and a threshold unit. The term buffer may refer to a region of physical memory storage used to temporarily store data while it is being moved from one place to another. The kernel 130 may be included in any electronic, magnetic, optical, or other physical storage device that contains or stores information, such as Random Access Memory (RAM), flash memory, a solid state drive (SSD), a hard disk drive (HDD) and the like.
[0019] The threshold unit 150 may include, for example, a hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the threshold unit 150 may be implemented as a series of instructions encoded on a machine-readable storage medium and executable by a processor.
[0020] The send buffer 140 is allocated within the kernel 130 of the OS 120 of the first node 1 10. Thus, the send buffer 140 may not be paged out of the kernel 130. The kernel 130 of the first node 1 10 may aggregate a plurality of the messages 142-1 to 142-n, where n is a natural number, stored at the send buffer 140 into a single transfer and may output the single transfer across a network to a second node.
[0021 ] The threshold unit 150 may dynamically vary when the send buffer 140 is sent based on a messaging traffic load. For example, the threshold unit 150 may determine a time to output the single transfer based on, for example, a timeout threshold, a number of messages within the send buffer 140, an amount of data stored at the send buffer 140 and the like. The threshold unit 150 is explained in greater detail below with respect to FIG. 2.
[0022] FIG. 2 is another example block diagram of a system 200 to determine when to send a send buffer based on a messaging traffic load. The system 200 may be any type of message aggregation system, such as a communication network, such as WAN, LAN or VPN. The system 200 is shown to include the first node 205 and a second node 210. The second node 210 may include the functionality and/or hardware of the first node 205 of FIG. 2 and/or vice versa. While the system 200 is shown to only include two nodes 205 and 210, examples may include more than two nodes, such as a cluster of hundreds or thousands of nodes.
[0023] The first node 205 of FIG. 2 may include the functionality and/or hardware of the first node 1 10 of FIG. 1 . For example, the first node 205 of FIG. 2 includes the OS 120 of FIG. 1 , where the OS 120 includes the kernel 130. The kernel 130 includes the send buffer 140 and a threshold unit 250, where threshold unit 250 includes at least the functionality and/or hardware of the threshold unit 150 of FIG. 1.
[0024] A receive buffer 240 is allocated within a kernel 230 of an OS 220 of the second node 210. The receive buffer 240 may receive the single transfer directly from the send buffer 140. The send and receive buffers 140 and 240 may be persistently allocated in the kernels 130 and 230 of the first and second nodes 1 10 and 210 for remote direct memory access (RDMA) transfers. By using a persistent RDMA approach, an preparing/unpreparing of the buffers per message transfer may be avoided.
[0025] The threshold unit 250 is shown to include a size threshold 252 to relate to a size of the send buffer 140, Example sizes of the size threshold 252 may include 128 kilobytes (KB), 256 KB and the like. In one example, the threshold unit 250 may halve and/or double the value of the size threshold 252 when the threshold unit 250 dynamically varies the value of the size threshold 252. [0026] The threshold unit 250 is shown to further include a number threshold 254 to relate to a number of the plurality of messages stored at the send buffer 140, and a time threshold 256 to relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140. However, examples may include more or less than the above three thresholds 252, 254 and 256.
[0027] The threshold unit 250 may classify the messaging traffic load to be at least one of a low messaging load and a heavy messaging load. The threshold unit 250 may dynamically vary a value of at least one of the size, number and time threshold 252, 254 and 256 based on the messaging traffic load.
[0028] The threshold unit 250 may send the send buffer 140 sooner to reduce message latency, if the messaging traffic load is classified as the low messaging load. Conversely, the threshold unit 250 may send the send buffer 140 later to increase network throughput, if the messaging traffic load is classified as the heavy messaging load.
[0029] For instance, the threshold unit 250 may reduce the value of at least one of the size, number and time threshold 252, 254 and 256 to send the send buffer 140 sooner, if the messaging traffic load is classified as the low messaging load. The threshold unit 250 may increase the value of at least one of the size, number and time threshold to 252, 254 and 256 to send the send buffer 140 later, if the messaging traffic load is classified as the heavy messaging load.
[0030] For example, the threshold unit 250 may carry out a combination of any of the following actions. The threshold unit 250 may reduce the value of the size threshold 252, if the messaging traffic load is classified as the low messaging load in response to a size of the messages at the send buffer 140 being relatively small. The threshold unit 250 may reduce the value of the number threshold 254, if the messaging traffic load is classified as the low messaging load in response to a number of the messages at the send buffer 140 being relatively small. The threshold unit 250 may reduce the value of the time threshold 256, if the messaging traffic load is classified as the low messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively small.
[0031 ] The threshold unit 250 may also carry out a combination of any of the following actions. The threshold unit 250 may increase the value of the size threshold 252, if the messaging traffic load is classified as the heavy messaging load in response to a size of the messages at the send buffer 140 being relatively large. The threshold unit 250 may increase the value of the number threshold 254, if the messaging traffic load is classified as the heavy messaging load in response to a number of the messages at the send buffer 140 being relatively large. The threshold unit 250 may increase the value of the time threshold 256, if the messaging traffic load is classified as the heavy messaging load in response to both the number and the size of the messages at the send buffer 140 being relatively large.
[0032] The threshold unit 250 may not reduce the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a minimum value. An example of the minimum value may be 1 millisecond. The threshold unit 250 may also not increase the value of at least one of the size, number and time threshold 252, 254 and 256, if the value is already at a maximum value. [0033] An amount at least one of the size, number and time thresholds 252, 254 and 256 may be varied by the threshold unit 250 further based on an amount at least an other of the size, number and time thresholds 252, 254 and 256 is varied. In one example, the threshold unit 250 may dynamically vary the value of at least one of the size and time threshold 252 and 256 before dynamically varying the value of the number threshold 254. In another example, the threshold unit 250 may also dynamically vary the size threshold 252 before dynamically varying the value of the time 256 threshold.
[0034] In one example, the threshold unit 150 may be implemented via kernel level code. In another example, the threshold unit 250 may dynamically vary when the send buffer is sent in response to a burst of message traffic. In yet another example, at least one of the time, number and size threshold may be varied based on a run-time analysis of pattern of the messaging traffic load. Thus, examples may provide an improved or optimized balance between individual message latency and network throughput.
[0035] FIG. 3 is an example block diagram of a computing device 300 including instructions for determining when to send a send buffer based on a messaging traffic load. In the embodiment of FIG. 3, the computing device 300 includes a processor 310 and a machine-readable storage medium 320. The machine-readable storage medium 320 further includes instructions 321 , 323, 325, 327 and 329 for determining when to send a send buffer based on a messaging traffic load.
[0036] The computing device 300 may be included in or part of, for example, a microprocessor, a controller, a memory module or device, a notebook computer, a desktop computer, an all-in-one system, a server, a network device, a wireless device, or any other type of device capable of executing the instructions 321 , 323, 325, 327 and 329. In certain examples, the computing device 300 may include or be connected to additional components such as memories, controllers, etc.
[0037] The processor 310 may be, at least one central processing unit (CPU), at least one semiconductor-based microprocessor, at least one graphics processing unit (GPU), a microcontroller, special purpose logic hardware controlled by microcode or other hardware devices suitable for retrieval and execution of instructions stored in the machine-readable storage medium 320, or combinations thereof. The processor 310 may fetch, decode, and execute instructions 321 , 323, 325, 327 and 329 to implement determining when to send the send buffer based on the messaging traffic load. As an alternative or in addition to retrieving and executing instructions, the processor 310 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 321 , 323, 325, 327 and 329.
[0038] The machine-readable storage medium 320 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, the machine-readable storage medium 320 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like. As such, the machine-readable storage medium 320 can be non-transitory. As described in detail below, machine-readable storage medium 320 may be encoded with a series of executable instructions for determining when to send the send buffer based on the messaging traffic load.
[0039] Moreover, the instructions 321 , 323, 325, 327 and 329 when executed by a processor (e.g., via one processing element or multiple processing elements of the processor) can cause the processor to perform processes, such as, the process of FIG. 4. For example, the measure instructions 321 may be executed by the processor 310 to measure a messaging traffic load based on a number and size of messages aggregated at a send a buffer allocated within the kernel of an operating system (OS) of a first node.
[0040] The vary number instructions 323 may be executed by the processor 310 to vary a number threshold based on the number of the aggregated messages. The number threshold may relate to a number of the plurality of messages stored at the send buffer. The vary size instructions 325 may be executed by the processor 310 to vary a size threshold based on the size of the aggregated messages. The size threshold may relate to a size of the send buffer.
[0041 ] The vary time instructions 327 may be executed by the processor 310 to vary a time threshold based on the number and the size of the aggregated messages. The time threshold may relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer. The determine instructions 329 may be executed by the processor 310 to determine when to send the send buffer as a single transfer across a network to a second node based on the number, size and time thresholds. [0042] FIG. 4 is an example flowchart of a method 400 for determining when to send a send buffer based on a messaging traffic load. Although execution of the method 400 is described below with reference to the system 200, other suitable components for execution of the method 400 can be utilized, such as the system 100. Additionally, the components for executing the method 400 may be spread among multiple devices (e.g., a processing device in communication with input and output devices). In certain scenarios, multiple devices acting in coordination can be considered a single device to perform the method 400. The method 400 may be implemented in the form of executable instructions stored on a machine-readable storage medium, such as storage medium 320, and/or in the form of electronic circuitry.
[0043] At block 410, a first node 1 10 may classify a messaging traffic load to be at least one of a low messaging load and a heavy messaging load. At block 420, the first node 1 10 may vary a value of at least one of a time, number and size threshold 252, 254 and 256 based on the classification. The varying at block 420 may also assign initial values to the size, number and time thresholds 252, 254 and 256. The initial values of at least one of the size, number and time threshold 252, 254 and 256 may be based on a type of a network including first node and second nodes 205 and 210.
[0044] At block 430, the first node 1 10 may determine when to send a send buffer 140 allocated within a kernel 130 of an operating system (OS) 120 of the first node 205 based on the at least one of the time, number and size threshold 252, 254 and 256. The kernel 120 of the first node 205 may aggregate a plurality of the messages 142-1 to 142-n stored at the send buffer 140 into a single transfer and output the single transfer across a network to the second node 210. The size threshold 252 may relate to a size of the send buffer 140. The number threshold 254 may relate to a number of the plurality of messages 142-1 to 142-n stored at the send buffer 140. The time threshold 256 may relate to a time elapsed since a first one 142-1 of the plurality of messages 142-1 to 142-n was stored at the send buffer 140.

Claims

CLAIMS We claim:
1. A system, comprising:
a send buffer allocated within a kernel of an operating system (OS) of a first node, the kernel of the first node to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node; and
a threshold unit to dynamically vary when the send buffer is sent based on a messaging traffic load.
2. The system of claim 1 , wherein the threshold unit is to include at least one of,
a size threshold to relate to a size of the send buffer,
a number threshold to relate to a number of the plurality of messages stored at the send buffer, and
a time threshold to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.
3. The system of claim 2, wherein,
the threshold unit is to classify the messaging traffic load to be at least one of a low messaging load and a heavy messaging load, and
the threshold unit is to dynamically vary a value of at least one of the size, number and time threshold based on the messaging traffic load.
4. The system of claim 3, wherein,
the threshold unit is to send the send buffer sooner to reduce message latency, if the messaging traffic load is classified as the low messaging load, and
the threshold unit is to send the send buffer later to increase network throughput, if the messaging traffic load is classified as the heavy messaging load.
5. The system of claim 4, wherein,
the threshold unit is to reduce the value of at least one of the size, number and time threshold to send the send buffer sooner, if the messaging traffic load is classified as the low messaging load, and
the threshold unit is to increase the value of at least one of the size, number and time threshold to send the send buffer later, if the messaging traffic load is classified as the heavy messaging load.
6. The system of claim 5, wherein the threshold unit is to at least one of,
reduce the value of the size threshold, if the messaging traffic load is classified as the low messaging load in response to a size of the messages at the send buffer being relatively small,
reduce the value of the number threshold, if the messaging traffic load is classified as the low messaging load in response to a number of the messages at the send buffer being relatively small, and
reduce the value of the time threshold, if the messaging traffic load is classified as the low messaging load in response to both the number and the size of the messages at the send buffer being relatively small.
7. The system of claim 5, wherein the threshold unit is to at least one of,
increase the value of the size threshold, if the messaging traffic load is classified as the heavy messaging load in response to a size of the messages at the send buffer being relatively large,
increase the value of the number threshold, if the messaging traffic load is classified as the heavy messaging load in response to a number of the messages at the send buffer being relatively large, and
increase the value of the time threshold, if the messaging traffic load is classified as the heavy messaging load in response to both the number and the size of the messages at the send buffer being relatively large.
8. The system of claim 5, wherein,
the threshold unit is to not reduce the value of at least one of the size, number and time threshold, if the value is already at a minimum value, and the threshold unit is to not increase the value of at least one of the size, number and time threshold, if the value is already at a maximum value.
9. The system of claim 2, wherein an amount at least one of the size, number and time threshold is varied by the threshold unit is further based on an amount at least an other of the size, number and time thresholds is varied.
10. The system of claim 9, wherein the threshold unit is to dynamically vary the value of at least one of the size and time threshold before dynamically varying the value of the number threshold.
1 1 . The system of claim 10, wherein the threshold unit is to dynamically vary the size threshold before dynamically varying the value of the time threshold.
12. A method, comprising:
classifying a messaging traffic load at a first node to be at least one of a low messaging load and a heavy messaging load;
varying a value of at least one of a time, number and size threshold based on the classification; and
determining when to send a send buffer allocated within a kernel of an operating system (OS) of the first node based on the at least one of the time, number and size threshold, wherein
the kernel of the first node is to aggregate a plurality of the messages stored at the send buffer into a single transfer and to output the single transfer across a network to a second node,
the size threshold is to relate to a size of the send buffer,
the number threshold is to relate to a number of the plurality of messages stored at the send buffer, and the time threshold is to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.
13. The method of claim 12, wherein,
the varying assigns initial values to the size, number and time thresholds, and
the initial values of at least one of the size, number and time threshold is based on a type of a network including the first and second nodes.
14. A non-transitory computer-readable storage medium storing instructions that, if executed by a processor of a device, cause the processor to:
measure a messaging traffic load based on a number and size of messages aggregated at a send a buffer allocated within the kernel of an operating system (OS) of a first node;
vary a number threshold based on the number of the aggregated messages;
vary a size threshold based on the size of the aggregated messages; vary a time threshold based on the number and the size of the aggregated messages; and
determine when to send the send buffer as a single transfer across a network to a second node based on the number, size and time threshold.
15. The non-transitory computer-readable storage medium of claim 14, wherein, the size threshold is to relate to a size of the send buffer,
the number threshold is to relate to a number of the plurality of messages stored at the send buffer, and
the time threshold is to relate to a time elapsed since a first one of the plurality of messages was stored at the send buffer.
PCT/US2014/051630 2014-08-19 2014-08-19 Send buffer based on messaging traffic load WO2016028268A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2014/051630 WO2016028268A1 (en) 2014-08-19 2014-08-19 Send buffer based on messaging traffic load

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2014/051630 WO2016028268A1 (en) 2014-08-19 2014-08-19 Send buffer based on messaging traffic load

Publications (1)

Publication Number Publication Date
WO2016028268A1 true WO2016028268A1 (en) 2016-02-25

Family

ID=55351065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/051630 WO2016028268A1 (en) 2014-08-19 2014-08-19 Send buffer based on messaging traffic load

Country Status (1)

Country Link
WO (1) WO2016028268A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112422244A (en) * 2019-08-21 2021-02-26 无锡江南计算技术研究所 RDMA buffer dynamic allocation method based on flow load prediction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009482A1 (en) * 2001-06-21 2003-01-09 International Business Machines Corporation Method and system for dynamically managing data structures to optimize computer network performance
US6788697B1 (en) * 1999-12-06 2004-09-07 Nortel Networks Limited Buffer management scheme employing dynamic thresholds
US20040264454A1 (en) * 2003-06-27 2004-12-30 Ajay Rajkumar Packet aggregation for real time services on packet data networks
EP1848172A1 (en) * 2006-04-19 2007-10-24 Nokia Siemens Networks Gmbh & Co. Kg Method and machine for aggregating a plurality of data packets into a unified transport data packet
US20110182294A1 (en) * 2010-01-28 2011-07-28 Brocade Communications Systems, Inc. In-order traffic aggregation with reduced buffer usage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6788697B1 (en) * 1999-12-06 2004-09-07 Nortel Networks Limited Buffer management scheme employing dynamic thresholds
US20030009482A1 (en) * 2001-06-21 2003-01-09 International Business Machines Corporation Method and system for dynamically managing data structures to optimize computer network performance
US20040264454A1 (en) * 2003-06-27 2004-12-30 Ajay Rajkumar Packet aggregation for real time services on packet data networks
EP1848172A1 (en) * 2006-04-19 2007-10-24 Nokia Siemens Networks Gmbh & Co. Kg Method and machine for aggregating a plurality of data packets into a unified transport data packet
US20110182294A1 (en) * 2010-01-28 2011-07-28 Brocade Communications Systems, Inc. In-order traffic aggregation with reduced buffer usage

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112422244A (en) * 2019-08-21 2021-02-26 无锡江南计算技术研究所 RDMA buffer dynamic allocation method based on flow load prediction
CN112422244B (en) * 2019-08-21 2022-11-25 无锡江南计算技术研究所 RDMA buffer dynamic allocation method based on flow load prediction

Similar Documents

Publication Publication Date Title
US20200241927A1 (en) Storage transactions with predictable latency
US9965441B2 (en) Adaptive coalescing of remote direct memory access acknowledgements based on I/O characteristics
US11381515B2 (en) On-demand packet queuing in a network device
CN111614746B (en) Load balancing method and device of cloud host cluster and server
US11805070B2 (en) Technologies for flexible and automatic mapping of disaggregated network communication resources
US10768823B2 (en) Flow control for unaligned writes in network storage device
US9092259B2 (en) Apparatus and method for controlling a resource utilization policy in a virtual environment
US11567556B2 (en) Platform slicing of central processing unit (CPU) resources
US20130100955A1 (en) Technique for prioritizing traffic at a router
WO2022025966A1 (en) Receiver-based precision congestion control
US11265235B2 (en) Technologies for capturing processing resource metrics as a function of time
US20210359955A1 (en) Cache allocation system
US20210320866A1 (en) Flow control technologies
US20210326177A1 (en) Queue scaling based, at least, in part, on processing load
US20220109733A1 (en) Service mesh offload to network devices
CN117157957A (en) Switch-induced congestion messages
US11099767B2 (en) Storage system with throughput-based timing of synchronous replication recovery
WO2016028268A1 (en) Send buffer based on messaging traffic load
US11388050B2 (en) Accelerating machine learning and profiling over a network
US10042682B2 (en) Copy message from application buffer to send buffer within kernel
WO2016160033A1 (en) Compress and load message into send buffer
US11687146B1 (en) Power consumption control
US20210328945A1 (en) Configurable receive buffer size
US20240031295A1 (en) Storage aware congestion management
US20230401079A1 (en) Resource allocation in virtualized environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14899990

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14899990

Country of ref document: EP

Kind code of ref document: A1