US10375168B2 - Throughput in openfabrics environments - Google Patents

Throughput in openfabrics environments Download PDF

Info

Publication number
US10375168B2
US10375168B2 US15/168,449 US201615168449A US10375168B2 US 10375168 B2 US10375168 B2 US 10375168B2 US 201615168449 A US201615168449 A US 201615168449A US 10375168 B2 US10375168 B2 US 10375168B2
Authority
US
United States
Prior art keywords
data
header
buffers
minimum number
placement information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US15/168,449
Other versions
US20170346899A1 (en
Inventor
Adhiraj Joshi
Abhijit Toley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Veritas Technologies LLC
Original Assignee
Veritas Technologies LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Veritas Technologies LLC filed Critical Veritas Technologies LLC
Assigned to VERITAS TECHNOLOGIES LLC reassignment VERITAS TECHNOLOGIES LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JOSHI, ADHIRAJ, TOLEY, ABHIJIT
Priority to US15/168,449 priority Critical patent/US10375168B2/en
Assigned to BANK OF AMERICA, N.A., AS COLLATERAL AGENT reassignment BANK OF AMERICA, N.A., AS COLLATERAL AGENT PATENT SECURITY AGREEMENT Assignors: VERITAS TECHNOLOGIES LLC
Priority to CN201780029882.6A priority patent/CN109478171B/en
Priority to PCT/US2017/033951 priority patent/WO2017210015A1/en
Priority to EP17733209.5A priority patent/EP3465450B1/en
Priority to JP2018561265A priority patent/JP6788691B2/en
Publication of US20170346899A1 publication Critical patent/US20170346899A1/en
Publication of US10375168B2 publication Critical patent/US10375168B2/en
Application granted granted Critical
Assigned to WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT reassignment WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT PATENT SECURITY AGREEMENT SUPPLEMENT Assignors: Veritas Technologies, LLC
Assigned to WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT reassignment WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: VERITAS TECHNOLOGIES LLC
Assigned to VERITAS TECHNOLOGIES LLC reassignment VERITAS TECHNOLOGIES LLC TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS AT R/F 052426/0001 Assignors: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • H04L67/2842
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/28DMA
    • G06F2213/2806Space or buffer allocation for DMA transfers

Definitions

  • This disclosure relates to throughput.
  • this disclosure relates to improving throughput in OpenFabrics computing environments.
  • OFEDTM OpenFabrics Enterprise Distribution
  • RDMA Remote Direct Memory Access
  • kernel bypass applications OFED can be used in computing environments that require highly efficient networks, storage connectivity, and parallel computing.
  • OFED provides kernel-level drivers, RDMA send/receive operations, services for parallel message passing (MPI), kernel bypasses of an operating system (OS), and kernel and user-level application programming interfaces (APIs). Therefore, OFED can be used for applications that require high efficiency computing, wire-speed messaging, microsecond latencies, and fast input/output (I/O) for storage and file systems.
  • MPI parallel message passing
  • OS operating system
  • APIs kernel and user-level application programming interfaces
  • RDMA involves direct memory access (DMA) from the memory of one computing system into that of another computing system without involving either computing system's OS.
  • DMA direct memory access
  • RDMA provides remote direct memory access, asynchronous work queues, and kernel bypass. RDMA permits increased throughput and low latency.
  • OFED APIs can accept input parameters (e.g., in the form of a header and one or more data packets). Because the header is typically small in size (e.g., compared to the data), it is efficient to coalesce the data and the header to minimize the number of RDMA writes. However, coalescing the data and the header, while keeping the data page-boundary-aligned, can result in on-wire wastage (e.g., the amount of data that is sent or transferred over the network).
  • on-wire wastage e.g., the amount of data that is sent or transferred over the network.
  • One such method involves receiving data and a header, and identifying buffers in which the data and the header are to be written. Placement information for the data and the header is determined based, at least in part, on a size of each buffer, a page-boundary-alignment of the data, and a header alignment of the header. The data and the header are then written to the buffer(s) using the placement information. In this example, data is written on page boundaries and the header is written on a header boundary.
  • using the placement information results in the utilization of a minimum number of buffers and the data being page-boundary-aligned when written to the minimum number of buffers, as well as minimal (or zero) on-wire wastage.
  • the placement information includes instructions to write the data and the header to a second to last buffer.
  • the header and the data is coalesced (combined) into a Remote Direct Memory Access (RDMA) write by mapping the header and the data contained in multiple source buffers to one (or more) destination buffers based on the placement information.
  • RDMA Remote Direct Memory Access
  • the RDMA write which includes and is accompanied by a 32-bit data space containing metadata, is sent or transmitted to a destination along with the placement information.
  • one or more additional buffers can be selected.
  • the buffers include multiple destination buffers, and the minimum number of buffers include one or more destination buffers.
  • FIG. 1 is a block diagram of a OFED computing system, according to one embodiment of the present disclosure.
  • FIG. 2 is a block diagram of a source computing system that implements an OFED API, according to one embodiment of the present disclosure.
  • FIG. 3A is a block diagram of data units and a header that are not coalesced, according to one embodiment of the present disclosure.
  • FIG. 3B is a block diagram of a header that is written at the start of a buffer, according to one embodiment of the present disclosure.
  • FIG. 3C is a block diagram of a header that is written immediately after data, according to one embodiment of the present disclosure.
  • FIG. 3D is a block diagram of a 32-bit data space provided by the OFED API, according to one embodiment of the present disclosure.
  • FIG. 3E is a block diagram of a header that is written at the end of an alignment, according to one embodiment of the present disclosure.
  • FIG. 4A is a block diagram of a header that is written at the end of an alignment in a second to last buffer, according to one embodiment of the present disclosure.
  • FIG. 4B is a block diagram of a header that is written at the end of an alignment without additional buffers, according to one embodiment of the present disclosure.
  • FIG. 4C is a block diagram of a header that is written to a second to last buffer, according to one embodiment of the present disclosure.
  • FIG. 5A is a flowchart that illustrates a process for filling buffers with data and a header, according to one embodiment of the present disclosure.
  • FIG. 5B is a flowchart that illustrates a process for combining a header and data, according to one embodiment of the present disclosure.
  • FIG. 6 is a flowchart that illustrates a process for determining placement/mapping information of a header and data, according to one embodiment of the present disclosure.
  • FIG. 7 is a flowchart that illustrates a process for generating and transmitting data and a header using RDMA, according to one embodiment of the present disclosure.
  • FIG. 8 is a block diagram of a computing system, illustrating how certain module(s) can be implemented in software, according to one embodiment of the present disclosure.
  • FIG. 9 is a block diagram of a networked system, illustrating how various devices can communicate via a network, according to one embodiment of the present disclosure.
  • data transfer between two (or more) computing systems can involve sending data and an accompanying header (as well as other metadata) directly from the memory of one computing system (e.g., from multiple source buffers) to the memory of another computing system (e.g., to one or more destination buffers.
  • a header is typically small in size (e.g., 1 k)
  • performing a write operation solely to transfer the header from the memory of one computing system to the memory of another computing system is inefficient, not to mention resource intensive.
  • coalesce or combine
  • data and a header contained in multiple source buffers can be mapped to a single destination buffer (or multiple destination buffers). This placement (or mapping) information can then be transmitted from source buffers to one or more destination buffers along with the data and the header as part of a single write operation.
  • OFED Open Fabrics Enterprise Distribution
  • OFED OpenFabrics Software
  • RDMA Remote Direct Memory Access
  • OFED can be implemented in computing environments that require highly efficient networks, storage connectivity, and parallel computing.
  • OFED provides kernel-level drivers, channel oriented RDMA and send/receive operations, kernel bypasses of an operating system (OS), and kernel and user-level application programming interfaces (APIs) (e.g., OFED API for RDMA transfer).
  • OFED also provides services for parallel message passing (MPI), sockets data exchange (e.g., Session Description Protocol (SDP)), Network Attached Storage (NAS) and Storage Area Network (SAN) storage, and file systems.
  • MPI parallel message passing
  • SDP Session Description Protocol
  • NAS Network Attached Storage
  • SAN Storage Area Network
  • RDMA involves direct memory access (DMA) from the memory of one computing system (e.g., a source computing system) into that of another computing system (e.g., a destination computing system) without involving either the source or destination computing system's OS.
  • a network adapter of the source computing system can send a message to the network adapter of the destination computing system that permits the network adapter of the destination computing system to directly access data to (or from) the destination computing system's memory.
  • a message in RDMA can include at least two types of messages.
  • the first type is an RDMA write.
  • An RDMA write includes an address and data to put (or write) at that address.
  • the RDMA write permits the network adapter that receives the RDMA write to write (or put) the supplied data at the specified address.
  • the second type is an RDMA read.
  • An RDMA read includes an address and a length. The RDMA read permits the network adapter that receives the RDMA read to generate a reply that sends back the data at the address requested.
  • both these types of messages are “one-sided”—the messages are processed by the network adapter that receives them without involving the central processing unit (CPU) on the computing system that receives the messages.
  • CPU central processing unit
  • network adapters can be accessed via an asynchronous interface (also called a “verbs” interface).
  • a network adapter e.g., to perform RDMA operations
  • objects called queue pairs (or QPs) can be created.
  • QPs involve a pair of work queues—a send queue and a receive queue, as well as completion queues (CQs).
  • CQs completion queues
  • An RDMA operation can be posted to a work queue (e.g., as a request).
  • the RDMA operation is then executed synchronously and when completed, the network adapter adds work completion information to the end of the CQ. Completion information can then be retrieved from the CQ to determine which requests have been completed.
  • Completion information can then be retrieved from the CQ to determine which requests have been completed.
  • RDMA-enabled network adapters also support “two-sided” send/receive operations, in addition to one-sided RDMA operations.
  • network adapters in RDMA computing environments can also permit userspace processes to perform fast-path operations (e.g., posting work requests and retrieving work completions) directly with the hardware without involving the kernel (thus saving time associated with system call overhead). Therefore, RDMA permits high-throughput and low-latency networking (e.g., which is especially useful in parallel computing clusters).
  • OFED provides APIs for RDMA data transfer. If OFED APIs are used for RDMA data transfer, a client's message will include a header and one or more data packets. Because the size of the header is typically small compared to the size of the data, it is not efficient to write the data and the header independently (e.g., as two separate RDMA writes). The data and the header can be coalesced (or combined) (e.g., as part of a single RDMA write) to reduce the total number of RDMA writes.
  • the OFED API also provides a 32-bit data space for private use (also called 32-bit immediate data or private data) as part of each RDMA write.
  • This 32-bit data space can be used to store buffer location information as well as other metadata associated with the RDMA write, in addition to information associated with a header itself (e.g., a header offset indicating the location of the header on a header boundary).
  • coalescing combining
  • a header e.g., by mapping source buffers to destination buffers
  • coalescing combining
  • a header e.g., by mapping source buffers to destination buffers
  • coalescing combining
  • a header e.g., by mapping source buffers to destination buffers
  • RDMA data transfer
  • applications e.g., one or more applications executing on destination computing system 145
  • data to be page boundary aligned For example, a computing system reads or writes data to a memory address in word sized chunks (e.g., 4 byte chunks on a 32-bit system) or larger.
  • Page boundary alignment involves writing data at a memory address equal to some multiple of the word size so as to increase the computing system's performance (e.g., due to the way certain memory management systems manage memory).
  • To page boundary align the data it may be necessary to insert (or treat) some bytes as meaningless (“don't care”) (referred to as wastage) between the end of the last data structure and the start of the next data structure (e.g., padding). Therefore, the requirement that data remain page boundary aligned can result in wastage that is unnecessarily and redundantly sent on-wire (e.g., transmitted over the network as part of the RDMA data transfer).
  • coalescing data and a header can result in additional redundant RDMA writes (e.g., one or more additional destination buffer(s) may be required to write the header at the start of a header boundary, to maintain page boundary alignment, and the like). Therefore, minimizing the number of destination buffers that are used (or RDMA writes that are performed) to write the data and the header is also an important consideration.
  • systems such as those implementing OFED employ (or can be modified to employ) the OFED API which provides for a 32-bit informational storage area as part of an RDMA write, for example, to maintain buffer location information, RDMA write metadata, header location information, and the like. Because only a few bits of the 32-bits of data space are available to maintain header location information (e.g., a header offset), the placement of the header is also another important consideration (and limitation) when coalescing data and the header in such systems.
  • FIG. 1 is a block diagram of a computing system that implements and uses RDMA technology, according to one embodiment.
  • the computing system of FIG. 1 includes a source computing system 105 and a destination computing system 145 communicatively coupled via a network 180 .
  • Network 180 can include any type of network or interconnection (e.g., the Internet, a Wide Area Network (WAN), a SAN, and the like).
  • Source computing system 105 includes a source processor 110 and a source network adapter 115 .
  • Source processor 110 and source network adapter 115 are communicatively coupled to a source memory 120 .
  • Source memory 120 includes a source driver 125 , a source OS 130 , and an application 135 , which implements source buffers 140 ( 1 )-(N).
  • destination computing system 145 includes a destination processor 175 and a destination network adapter 170 communicatively coupled to a destination memory 150 .
  • Destination memory 150 includes destination buffers 155 ( 1 )-(N), a destination driver 160 , and a destination OS 165 .
  • the computing system of FIG. 1 permits remote direct memory access from source memory 120 of source computing system 105 into that of destination computing system 145 without involving source OS 130 or destination OS 165 (and vice-versa).
  • source network adapter 115 of source computing system 105 can send a message (e.g., a client message) to destination network adapter 170 of destination computing system 145 that permits destination network adapter 170 of destination computing system 145 to directly access data to (or from) destination memory 150 (and vice-versa).
  • source network adapter 115 and destination network adapter 170 permit userspace processes (on source computing system 105 and destination computing system 145 , respectively) to perform RDMA operations (or other buffer-based operations amendable to methods such as those described herein) by bypassing their respective kernels and avoiding a system call (e.g., by bypassing source OS 130 and destination OS 165 as shown in FIG. 1 ).
  • RDMA operations can be managed and facilitated using an OFED API.
  • FIG. 2 is a block diagram of a source computing system that implements the OFED API, according to one embodiment.
  • source computing system 105 includes an Application Programming Interface (API) 205 , such as the OpenFabrics Enterprise Distribution (OFEDTM) API.
  • API Application Programming Interface
  • Source computing system 105 also includes an RDMA module 210 , a buffer selector 215 , a data and header coalescer 220 , and a page boundary alignment calculator 225 .
  • RDMA module 210 (or other comparable support module), buffer selector 215 , data and header coalescer 220 , and page boundary alignment calculator 225 can be implemented as hardware or software, and as part of source computing system 105 or separately (e.g., as part of an RDMA server, a software appliance, a virtual machine, or some other type of computing device).
  • Source computing system 105 also includes source memory 120 .
  • Source memory 120 includes source buffers 140 ( 1 )-(N). Each source buffer contains data, or a combination of data and a header.
  • source buffer 140 ( 1 ) contains data 230 ( 1 ) (e.g., with one or more data units)
  • source buffer 140 (N ⁇ 2) contains data 230 ( 2 )
  • source buffer 140 (N ⁇ 1) e.g., the second to last buffer
  • source buffer 140 (N) contains data 230 (N).
  • RDMA module 210 manages, facilitates, coordinates, and performs one or more RDMA operations. For example, RDMA module 210 can perform an RDMA write operation or an RDMA read operation.
  • Buffer selector 215 selects one or more buffers (e.g., destination buffers 155 ( 1 )-(N)) to fill with (or write) data, or data and a header.
  • Data and header coalescer 220 maps source buffers to destination buffers by coalescing (or combining) data and a header (e.g., in a single buffer (e.g., destination buffer 140 (N ⁇ 1)) as part of a single RDMA write/packet).
  • page boundary alignment calculator 225 determines the placement of data in destination buffers 155 ( 1 )-(N) such that the data is page boundary aligned.
  • RDMA module 210 determines the placement of data and the header among one or more (available) buffers (e.g., destination buffers 155 ( 1 )-(N)) so as to improve throughput in OFED-based RDMA computing environments.
  • available buffers e.g., destination buffers 155 ( 1 )-(N)
  • source computing system 105 receives a header and data (e.g., contained in one or more source buffers and/or from an application executed by one or more hosts, servers, and the like).
  • Buffer selector 215 identifies destination buffers in which the data and the header are to be written (e.g., destination buffers 155 ( 1 )-(N)).
  • Data and header coalescer 220 then determines the (appropriate) mapping and placement of the data and the header.
  • the determination of the mapping and placement of the data and the header is based on at least three factors.
  • the determination of the placement of the data and the header is based on utilizing a minimum number of destination buffers (e.g., so as to reduce the number of RDMA writes).
  • the determination of the placement of the data and the header is based on the data being page boundary aligned (in the minimum number of destination buffers).
  • the determination of the placement of the data and the header is based on the placement minimizing on-wire wastage (e.g., reducing the amount of wastage (or padding) that is sent over the network).
  • RDMA module 210 writes the data and the header to the destination buffers (e.g., data is written on page boundaries and the header is written on a header boundary).
  • RDMA module 210 generates an RDMA write.
  • the RDMA write includes the header and the data coalesced into a single RDMA packet (e.g., using data and header coalescer 220 ).
  • the RDMA write is accompanied by a 32-bit data space (e.g., the 32-bit data space is not part of client data space), and is transmitted using RDMA to destination computing system 145 .
  • the 32-bit data space is used to include an offset of the header (e.g., as part of writing the data and the header to destination buffers).
  • buffer selector 215 determines the minimum number of buffers that are required for the data and the header based on a size of each buffer. Buffer selector 215 selects one or more additional buffers if the data cannot be page boundary aligned in the minimum number of buffers.
  • each buffer is a destination buffer, and the OFED API permits mapping of multiple source buffers to a single destination buffer.
  • FIG. 3A is a block diagram of data units and a header that are not coalesced, according to one embodiment. It should be noted that for the sake of illustration, the size of source buffers 140 ( 1 )-(N) (as well as the size of the receive/destination buffers 155 ( 1 )-(N)) in FIGS. 3A-3E and 4A-4C is 8 k and the data is 4 k aligned.
  • source buffers 140 ( 1 )-(N) and destination buffers 155 ( 1 )-(N) can be any size (e.g., 16 k, 32 k, and the like), and the page boundary alignment of the data can also differ (e.g., the data can be 2 k, 3 k, or 6 k aligned).
  • data e.g., data units 305 ( 1 )-( 13 )
  • a header e.g., header 235
  • the data is 13 k (e.g., data units 305 ( 1 )-( 13 )) and the header is 1 k (e.g., header 235 ).
  • the data e.g., 13 k of data that is 4 k page boundary aligned
  • requires (and uses) two buffers e.g., destination buffers 155 ( 1 ) and 155 ( 2 ) as shown in FIG. 3A ).
  • header 235 will require a separate and additional buffer (e.g., destination buffer 155 ( 3 )). Therefore, although there is no on-wire wastage, the RDMA write operation would consume three destination buffers. Consequently, three RDMA writes would be required as part of the RDMA transfer of the data and the header.
  • FIG. 3B is a block diagram of a header that is written at the start of a buffer, according to one embodiment.
  • the header e.g., header 235
  • the start of a buffer e.g., destination buffer 140 ( 1 )
  • applications in RDMA computing environments typically require the data to be written at a page boundary (e.g., at 0 k or 4 k in an 8 k buffer, if the data is 4 k page boundary aligned).
  • writing the header at the start of a buffer can cause wastage on-wire between the end of the header and the start of the data, because the data has to start at a page aligned boundary (e.g., data unit 305 ( 1 ) has to start at the 4 k page boundary), and the number of buffers required has also not been reduced (e.g., three buffers).
  • a page aligned boundary e.g., data unit 305 ( 1 ) has to start at the 4 k page boundary
  • the number of buffers required has also not been reduced (e.g., three buffers).
  • FIG. 3C is a block diagram of a header that is written immediately after data, according to one embodiment.
  • writing 13 k of data e.g., data units 305 ( 1 )-( 13 )
  • two buffers e.g., destination buffers 155 ( 1 ) and 155 ( 2 )
  • header 235 is written immediately after the data (e.g., after data unit 305 ( 13 )).
  • the OFED API provides programmers a 32-bit space with every RDMA write for private use. Destination computing system 145 can use this 32-bit value to locate appropriate buffer(s) and use information regarding the buffer(s) (e.g., the RDMA write) for further processing.
  • 3C results in zero wastage on-wire and uses the minimum number of destination buffers, such a solution may not be feasible because representing the header offset (e.g., particularly if the header is not written at a header boundary) requires too many bits from the 32-bit data space.
  • FIG. 3D is a block diagram of a 32-bit data space provided by the OFED API, according to one embodiment.
  • 32-bit immediate data 310 contains information including a buffer identifier 315 , flag(s) 320 , metadata 325 , a source identifier 330 , a client identifier 335 , and free bits 340 ( 1 )-(N) (e.g., available for header offset representation).
  • Information regarding the buffer identifier, flag(s), metadata, source identifier and client identifier takes up most of the space (e.g., bits) of 32-bit immediate data 310 .
  • free bits 340 ( 1 )-(N) represent a small number of bits available for header offset representation. Consequently, to accurately represent the header offset information in the available space (e.g., free bits), in some embodiments, the header is written at a header boundary (e.g., so that the header offset information can be fully and accurately represented using free bits 340 ( 1 )-(N)).
  • FIG. 3E is a block diagram of a header that is written at the end of an alignment, according to one embodiment.
  • header 235 is written at the end of a particular alignment (e.g., at the end of a 2 k alignment) and at a header boundary (e.g., at 6 k in an 8 k buffer).
  • the alignment of header 235 is determined based on the number of bits that would potentially be available (e.g., for header placement) if the data were to be written (e.g., to one or more buffers like destination buffers 155 ( 1 ) and 155 ( 2 )). The more free bits that are available, the lesser the wastage.
  • the header e.g., aligned at 2 k
  • the header can be placed at (or written to) four possible offsets (e.g., 0 k, 2 k, 4 k, and 6 k in an 8 k buffer).
  • the maximum possible on-wire wastage is 2 k.
  • the header e.g., aligned at 1 k
  • the header can be placed at (or written to) eight possible offsets, thus reducing the maximum possible on-wire wastage to 1 k. Therefore, writing a header at the end of an alignment as shown in FIG. 3E also produces some wastage, however minimal (e.g., as compared to writing the header at the beginning of a buffer as shown in FIG. 3A ).
  • FIG. 4A is a block diagram of a header that is written at the end of an alignment in a second to last buffer, according to one embodiment.
  • header 235 is written in a second to last buffer (e.g., destination buffer 140 (N ⁇ 1)) at the last (available) header-aligned offset (e.g., of destination buffer 155 (N ⁇ 1)) (e.g., 6 k), based on the header size (e.g., 1 k as shown in FIG. 4A ) and alignment (e.g., based on a header boundary).
  • a second to last buffer e.g., destination buffer 140 (N ⁇ 1)
  • the last (available) header-aligned offset e.g., of destination buffer 155 (N ⁇ 1)
  • 6 k e.g., 6 k
  • buffer selector 215 identifies destination buffer 155 (N ⁇ 1) as the second to last buffer, and selects destination buffer 155 (N ⁇ 1) for header placement.
  • Page boundary alignment calculator 225 then calculates the page boundary alignment of data that can be written to destination buffer 155 (N ⁇ 1) (e.g., data units 305 ( 1 )- 305 ( 6 )), while data and header coalescer 220 ensures that enough space is available for header placement (e.g., 1 k), header representation (e.g., in 32-bit immediate data 310 ), and header alignment (e.g., at 6 k).
  • FIG. 4A is a block diagram of a header that is written at the end of an alignment without (requiring) additional buffers, according to other embodiments.
  • header 235 can be written after the data (e.g., after 14 data units). It will be appreciated that header and data placement in this example causes no wastage on-wire and also does not require additional (destination) buffers.
  • FIG. 4C is a block diagram of a header that is written to a second to last buffer, according to one embodiment.
  • header 235 is written to destination buffer 155 (N ⁇ 1), which is the second to last buffer.
  • Header 235 as required by the need to include the header offset in 32-bit immediate data 310 , is header aligned at a header boundary (e.g., at 6 k).
  • the available space in the last buffer is fully filled with data (e.g., data units 305 ( 7 )-( 14 )).
  • Data units 305 ( 1 )-( 6 ) are written to destination buffer 155 (N ⁇ 1).
  • the data in both destination buffers 155 (N ⁇ 1) and 155 (N) is page boundary aligned and header 235 is also aligned at a header boundary. Therefore, in this example, the minimum number of buffers required to write the data and the header are used, and there is no on-wire wastage.
  • FIG. 5A is a flowchart that illustrates a process for filling buffers with data and a header, according to one embodiment.
  • the process begins at 505 by receiving a header (e.g., header 235 ) and data (e.g., data units 305 ( 1 )-( 14 )), for example, as an input parameter from a host, a virtual machine, or some other type of computing system communicatively coupled to source computing system 105 , or from one or more applications executing on source computing system 105 .
  • the process determines the number of buffers that are required for the header and the data (e.g., by using buffer selector 215 and based on the size of data units received as part of the data).
  • the process page (boundary) aligns the data (e.g., by using page boundary alignment calculator 225 ) and records this mapping.
  • the process determines whether there are data-only buffers to fill (e.g., write data (units) to). If there are data-only buffers to fill, the process at 525 , fills buffer(s) with data (e.g., destination buffers 155 ( 1 )-(N ⁇ 2)). If there are no data-only buffers to fill, the process, at 530 , determines the position of the header on the second to last buffer (e.g., destination buffer 155 (N ⁇ 1)). At 535 , the process fills the data up to destination buffer 155 (N ⁇ 2). At 540 , the process fills the data and the header in destination buffer 155 (N ⁇ 1) (e.g., as shown in FIGS. 4A and 4C ). At 545 , the process fills the last buffer (e.g., destination buffer 155 (N)) with the remaining data. The process ends at 550 by determining if there is another message to process.
  • FIG. 5B is a flowchart that illustrates a process for combining a header and data, according to one embodiment.
  • the process begins at 555 by receiving data and a header contained in source buffers.
  • the process initiates header and data placement analysis (e.g., using RDMA module 210 ).
  • the process calculates the size of available buffers (e.g., 8 k, 16 k, 32 k, and the like).
  • the process determines the number (e.g., minimum) of buffers that are required for data and the header (e.g., based on the size of the data units received).
  • the process determines the page boundary alignment for the data within the identified buffers (e.g., using page boundary alignment calculator 225 ). It should be noted that page boundary alignment calculator 225 can also determine header boundaries for the header.
  • the process determines the position of the header in the second to last buffer (e.g., in destination buffer 155 (N ⁇ 1)). For example, the process can determine the position of header 235 in destination buffer 155 (N ⁇ 1) at the end of page boundary aligned data (e.g., after data unit 305 ( 6 ) as shown in FIGS. 4A and 4C ) and at a header boundary (e.g., starting at 6 k for a 2 k alignment of a 1 k header, also as shown in FIGS. 4A and 4C ).
  • the process fills destination buffers with the data and the header such that the minimum number of destination buffers are utilized and the data is page boundary aligned.
  • the process sends the combined header and data along with placement/mapping information in a single RDMA write (e.g., a message sent over RDMA) to a destination (e.g., destination computing system 145 ).
  • a destination e.g., destination computing system 145 .
  • the process ends at 595 by determining if there is another header and (more) data to process.
  • FIG. 6 is a flowchart that illustrates a process for determining placement/mapping information of a header and data, according to one embodiment.
  • the process begins at 605 by receiving data and a header.
  • the process determines the minimum number of buffers required to write the data and the header.
  • the process determines the placement of the data and the header in selected buffers.
  • the process determines whether the data is page boundary aligned and whether the data has a minimum number of gaps that can cause on-wire wastage.
  • the process at 625 , re-determines placement of the data and the header in the selected buffers to keep the data page boundary aligned and with minimum (or zero) gaps.
  • the process at 630 , fills the selected buffers with the data and the header (e.g., as shown in FIGS. 4A and 4C ). The process ends at 635 by determining if there is another header and (more) data to process.
  • FIG. 7 is a flowchart that illustrates a process for generating and transmitting data and a header using RDMA, according to one embodiment.
  • the process begins at 705 by receiving or accessing data and a header (e.g., from source buffers).
  • the process writes the data and the header to buffers such that the minimum number of buffers are used, the data is page boundary aligned, the header is aligned, and there is minimum (or no) on-wire wastage.
  • the process includes header offset information (e.g., the location of the header if and when the header is written to a particular destination buffer) in 32-bit immediate data 310 (e.g., the 32-bit data space provided as part of API 205 ).
  • the process generates an RDMA write (e.g., using RDMA module 210 ).
  • the process transmits the RDMA write to a destination along with 32-bit immediate data 310 (e.g., to destination computing system 145 ).
  • the process ends at 730 by determining if there is another header and (more) data to process.
  • coalescing a header and data by mapping source buffers to one or more destination buffer(s), and writing the header and the data to particularly selected destination buffer(s) based on determined placement/mapping information results in efficient utilization of destination buffers and reduces (or even eliminates) on-wire wastage in OFED-based and RDMA-enabled computing environments. It will also be appreciated that the systems, methods, and processes described herein can also provide increased I/O performance and application throughput in such computing environments.
  • FIG. 8 is a block diagram of a computing system, illustrating how a placement and mapping information module 865 can be implemented in software, according to one embodiment.
  • Computing system 800 broadly represents any single or multi-processor computing device or system capable of executing computer-readable instructions. Examples of computing system 800 include, without limitation, any one or more of a variety of devices including workstations, personal computers, laptops, client-side terminals, servers, distributed computing systems, handheld devices (e.g., personal digital assistants and mobile phones), network appliances, storage controllers (e.g., array, tape drive, or hard drive controllers), and the like.
  • Computing system 800 may include at least one processor 855 (e.g., source processor 110 or destination processor 175 ) and a memory 860 (e.g., source memory 120 or destination memory 150 ). By executing the software that implements source computing system 105 or destination computing system 145 , computing system 800 becomes a special purpose computing device that is configured to improve throughput in OpenFabrics environments.
  • processor 855 e.g., source processor 110 or destination processor 175
  • memory 860 e.g., source memory 120 or destination memory 150 .
  • Processor 855 generally represents any type or form of processing unit capable of processing data or interpreting and executing instructions.
  • processor 855 may receive instructions from a software application or module. These instructions may cause processor 855 to perform the functions of one or more of the embodiments described and/or illustrated herein.
  • processor 855 may perform and/or be a means for performing all or some of the operations described herein.
  • Processor 855 may also perform and/or be a means for performing any other operations, methods, or processes described and/or illustrated herein.
  • Memory 860 generally represents any type or form of volatile or non-volatile storage devices or mediums capable of storing data and/or other computer-readable instructions. Examples include, without limitation, random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory device. Although not required, in certain embodiments computing system 800 may include both a volatile memory unit and a non-volatile storage device. In one example, program instructions implementing placement and mapping information module 865 may be loaded into memory 860 (e.g., source memory 120 ).
  • computing system 800 may also include one or more components or elements in addition to processor 855 and/or memory 860 .
  • computing system 800 may include a memory controller 820 , an Input/Output (I/O) controller 835 , and a communication interface 845 , each of which may be interconnected via a communication infrastructure 805 .
  • Communication infrastructure 805 generally represents any type or form of infrastructure capable of facilitating communication between one or more components of a computing device. Examples of communication infrastructure 805 include, without limitation, a communication bus (such as an Industry Standard Architecture (ISA), Peripheral Component Interconnect (PCI), PCI express (PCIe), or similar bus) and a network.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component Interconnect
  • PCIe PCI express
  • Memory controller 820 generally represents any type/form of device capable of handling memory or data or controlling communication between one or more components of computing system 800 .
  • memory controller 820 may control communication between processor 855 , memory 860 , and I/O controller 835 via communication infrastructure 805 .
  • memory controller 820 may perform and/or be a means for performing, either alone or in combination with other elements, one or more of the operations or features described and/or illustrated herein.
  • I/O controller 835 generally represents any type or form of module capable of coordinating and/or controlling the input and output functions of a virtual machine, an appliance, a gateway, and/or a computing system.
  • I/O controller 835 may control or facilitate transfer of data between one or more elements of source computing system 105 or destination computing system 145 , such as processor 855 (e.g., source processor 110 or destination processor 175 ), memory 860 (e.g., source memory 120 or destination memory 150 ), communication interface 845 , display adapter 815 , input interface 825 , and storage interface 840 .
  • Communication interface 845 broadly represents any type or form of communication device or adapter capable of facilitating communication between computing system 800 and one or more other devices. Communication interface 845 may facilitate communication between computing system 800 and a private or public network including additional computing systems. Examples of communication interface 845 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, and any other suitable interface.
  • a wired network interface such as a network interface card
  • a wireless network interface such as a wireless network interface card
  • modem any other suitable interface.
  • Communication interface 845 may provide a direct connection to a remote server via a direct link to a network, such as the Internet, and may also indirectly provide such a connection through, for example, a local area network (e.g., an Ethernet network), a personal area network, a telephone or cable network, a cellular telephone connection, a satellite data connection, or any other suitable connection.
  • a local area network e.g., an Ethernet network
  • a personal area network e.g., a personal area network
  • a telephone or cable network e.g., a personal area network
  • a cellular telephone connection e.g., cellular telephone connection
  • satellite data connection e.g., satellite data connection
  • Communication interface 845 may also represent a host adapter configured to facilitate communication between computing system 800 and one or more additional network or storage devices via an external bus or communications channel.
  • host adapters include, Small Computer System Interface (SCSI) host adapters, Universal Serial Bus (USB) host adapters, Institute of Electrical and Electronics Engineers (IEEE) 1394 host adapters, Serial Advanced Technology Attachment (SATA), Serial Attached SCSI (SAS), and external SATA (eSATA) host adapters, Advanced Technology Attachment (ATA) and Parallel ATA (PATA) host adapters, Fibre Channel interface adapters, Ethernet adapters, or the like.
  • Communication interface 845 may also allow computing system 800 to engage in distributed or remote computing (e.g., by receiving/sending instructions to/from a remote device for execution).
  • computing system 800 may also include at least one display device 810 coupled to communication infrastructure 805 via a display adapter 815 .
  • Display device 810 generally represents any type or form of device capable of visually displaying information forwarded by display adapter 815 .
  • display adapter 815 generally represents any type or form of device configured to forward graphics, text, and other data from communication infrastructure 805 (or from a frame buffer, as known in the art) for display on display device 810 .
  • Computing system 800 may also include at least one input device 830 coupled to communication infrastructure 805 via an input interface 825 .
  • Input device 830 generally represents any type or form of input device capable of providing input, either computer or human generated, to computing system 800 . Examples of input device 830 include a keyboard, a pointing device, a speech recognition device, or any other input device.
  • Computing system 800 may also include storage device 850 coupled to communication infrastructure 805 via a storage interface 840 .
  • Storage device 850 generally represents any type or form of storage devices or mediums capable of storing data and/or other computer-readable instructions.
  • storage device 850 may include a magnetic disk drive (e.g., a so-called hard drive), a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash drive, or the like.
  • Storage interface 840 generally represents any type or form of interface or device for transferring and/or transmitting data between storage device 850 , and other components of computing system 800 .
  • Storage device 850 may be configured to read from and/or write to a removable storage unit configured to store computer software, data, or other computer-readable information. Examples of suitable removable storage units include a floppy disk, a magnetic tape, an optical disk, a flash memory device, or the like. Storage device 850 may also include other similar structures or devices for allowing computer software, data, or other computer-readable instructions to be loaded into computing system 800 . For example, storage device 850 may be configured to read and write software, data, or other computer-readable information. Storage device 850 may also be a part of computing system 800 or may be separate devices accessed through other interface systems.
  • computing system 800 may be connected to computing system 800 .
  • computing system 800 may also be interconnected in different ways from that shown in FIG. 8 .
  • Computing system 800 may also employ any number of software, firmware, and/or hardware configurations.
  • one or more of the embodiments disclosed herein may be encoded as a computer program (also referred to as computer software, software applications, computer-readable instructions, or computer control logic) on a computer-readable storage medium.
  • Examples of computer-readable storage media include magnetic-storage media (e.g., hard disk drives and floppy disks), optical-storage media (e.g., CD- or DVD-ROMs), electronic-storage media (e.g., solid-state drives and flash media), and the like.
  • Such computer programs can also be transferred to computing system 800 for storage in memory via a network such as the Internet or upon a carrier medium.
  • the computer-readable medium containing the computer program may be loaded into computing system 800 . All or a portion of the computer program stored on the computer-readable medium may then be stored in memory 860 and/or various portions of storage device 850 .
  • a computer program loaded into computing system 800 may cause processor 855 to perform and/or be a means for performing the functions of one or more of the embodiments described and/or illustrated herein. Additionally or alternatively, one or more of the embodiments described and/or illustrated herein may be implemented in firmware and/or hardware.
  • computing system 800 may be configured as an application specific integrated circuit (ASIC) adapted to implement one or more of the embodiments disclosed herein.
  • ASIC application specific integrated circuit
  • FIG. 9 is a block diagram of a networked system, illustrating how various devices can communicate via a network, according to one embodiment.
  • network-attached storage (NAS) devices may be configured to communicate with source computing system 105 and/or destination computing system 145 using various protocols, such as Network File System (NFS), Server Message Block (SMB), or Common Internet File System (CIFS).
  • NFS Network File System
  • SMB Server Message Block
  • CIFS Common Internet File System
  • Network 180 generally represents any type or form of computer network or architecture capable of facilitating communication between source computing system 105 and/or destination computing system 145 .
  • a communication interface such as communication interface 845 in FIG. 8
  • network 180 can be a Storage Area Network (SAN).
  • SAN Storage Area Network
  • all or a portion of one or more of the disclosed embodiments may be encoded as a computer program and loaded onto and executed by source computing system 105 and/or destination computing system 145 , or any combination thereof. All or a portion of one or more of the embodiments disclosed herein may also be encoded as a computer program, stored on source computing system 105 and/or destination computing system 145 , and distributed over network 180 . In some examples, all or a portion of source computing system 105 and/or destination computing system 145 may represent portions of a cloud-computing or network-based environment. Cloud-computing environments may provide various services and applications via the Internet.
  • cloud-based services e.g., software as a service, platform as a service, infrastructure as a service, etc.
  • cloud-based services may be accessible through a web browser or other remote interface.
  • Various functions described herein may be provided through a remote desktop environment or any other cloud-based computing environment.
  • placement and mapping information module 865 may transform the behavior of source computing system 105 and/or destination computing system 145 in order to cause source computing system 105 and/or destination computing system 145 to improve throughput in OpenFabrics and RDMA computing environments.

Abstract

Disclosed herein are systems, methods, and processes to improve throughput in OpenFabrics and Remote Direct Memory Access (RDMA) computing environments. Data and a header is received. Buffers in which the data and the header are to be written are identified. Placement information for the data and the header is determined based on a size of each buffer, a page-boundary-alignment of the data, and a header alignment of the header. The data and the header are written to the buffer(s) using the placement information. In such computing environments, throughout can be improved by writing data on page boundaries and the header on a header boundary in a second to last buffer.

Description

FIELD OF THE DISCLOSURE
This disclosure relates to throughput. In particular, this disclosure relates to improving throughput in OpenFabrics computing environments.
DESCRIPTION OF THE RELATED ART
OpenFabrics Enterprise Distribution (OFED™) is open-source computing technology for Remote Direct Memory Access (RDMA) and kernel bypass applications. OFED can be used in computing environments that require highly efficient networks, storage connectivity, and parallel computing. OFED provides kernel-level drivers, RDMA send/receive operations, services for parallel message passing (MPI), kernel bypasses of an operating system (OS), and kernel and user-level application programming interfaces (APIs). Therefore, OFED can be used for applications that require high efficiency computing, wire-speed messaging, microsecond latencies, and fast input/output (I/O) for storage and file systems.
RDMA involves direct memory access (DMA) from the memory of one computing system into that of another computing system without involving either computing system's OS. In addition to other features, RDMA provides remote direct memory access, asynchronous work queues, and kernel bypass. RDMA permits increased throughput and low latency.
OFED APIs can accept input parameters (e.g., in the form of a header and one or more data packets). Because the header is typically small in size (e.g., compared to the data), it is efficient to coalesce the data and the header to minimize the number of RDMA writes. However, coalescing the data and the header, while keeping the data page-boundary-aligned, can result in on-wire wastage (e.g., the amount of data that is sent or transferred over the network).
SUMMARY OF THE DISCLOSURE
Disclosed herein are methods, systems, and processes to improve throughput in OpenFabrics computing environments. One such method involves receiving data and a header, and identifying buffers in which the data and the header are to be written. Placement information for the data and the header is determined based, at least in part, on a size of each buffer, a page-boundary-alignment of the data, and a header alignment of the header. The data and the header are then written to the buffer(s) using the placement information. In this example, data is written on page boundaries and the header is written on a header boundary.
In certain embodiments, using the placement information results in the utilization of a minimum number of buffers and the data being page-boundary-aligned when written to the minimum number of buffers, as well as minimal (or zero) on-wire wastage. The placement information includes instructions to write the data and the header to a second to last buffer.
In some embodiments, the header and the data is coalesced (combined) into a Remote Direct Memory Access (RDMA) write by mapping the header and the data contained in multiple source buffers to one (or more) destination buffers based on the placement information. The RDMA write, which includes and is accompanied by a 32-bit data space containing metadata, is sent or transmitted to a destination along with the placement information.
In other embodiments, if the data cannot be page boundary aligned in the minimum number of buffers, one or more additional buffers can be selected. In this example, the buffers include multiple destination buffers, and the minimum number of buffers include one or more destination buffers.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations and omissions of detail; consequently those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, features, and advantages of the present disclosure, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure may be better understood, and its numerous objects and features made apparent to those skilled in the art by referencing the accompanying drawings.
FIG. 1 is a block diagram of a OFED computing system, according to one embodiment of the present disclosure.
FIG. 2 is a block diagram of a source computing system that implements an OFED API, according to one embodiment of the present disclosure.
FIG. 3A is a block diagram of data units and a header that are not coalesced, according to one embodiment of the present disclosure.
FIG. 3B is a block diagram of a header that is written at the start of a buffer, according to one embodiment of the present disclosure.
FIG. 3C is a block diagram of a header that is written immediately after data, according to one embodiment of the present disclosure.
FIG. 3D is a block diagram of a 32-bit data space provided by the OFED API, according to one embodiment of the present disclosure.
FIG. 3E is a block diagram of a header that is written at the end of an alignment, according to one embodiment of the present disclosure.
FIG. 4A is a block diagram of a header that is written at the end of an alignment in a second to last buffer, according to one embodiment of the present disclosure.
FIG. 4B is a block diagram of a header that is written at the end of an alignment without additional buffers, according to one embodiment of the present disclosure.
FIG. 4C is a block diagram of a header that is written to a second to last buffer, according to one embodiment of the present disclosure.
FIG. 5A is a flowchart that illustrates a process for filling buffers with data and a header, according to one embodiment of the present disclosure.
FIG. 5B is a flowchart that illustrates a process for combining a header and data, according to one embodiment of the present disclosure.
FIG. 6 is a flowchart that illustrates a process for determining placement/mapping information of a header and data, according to one embodiment of the present disclosure.
FIG. 7 is a flowchart that illustrates a process for generating and transmitting data and a header using RDMA, according to one embodiment of the present disclosure.
FIG. 8 is a block diagram of a computing system, illustrating how certain module(s) can be implemented in software, according to one embodiment of the present disclosure.
FIG. 9 is a block diagram of a networked system, illustrating how various devices can communicate via a network, according to one embodiment of the present disclosure.
While the disclosure is susceptible to various modifications, specific embodiments of the disclosure are provided as examples in the drawings and detailed description. The drawings and detailed description are not intended to limit the disclosure to the particular form disclosed. Instead, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the appended claims.
DETAILED DESCRIPTION Introduction
In certain computing environments, data transfer between two (or more) computing systems can involve sending data and an accompanying header (as well as other metadata) directly from the memory of one computing system (e.g., from multiple source buffers) to the memory of another computing system (e.g., to one or more destination buffers. Because a header is typically small in size (e.g., 1 k), performing a write operation solely to transfer the header from the memory of one computing system to the memory of another computing system is inefficient, not to mention resource intensive.
Therefore, it can be advantageous to coalesce (or combine) data and the header to reduce the number of write operations performed and improve application throughput. As part of the coalescing, data and a header contained in multiple source buffers can be mapped to a single destination buffer (or multiple destination buffers). This placement (or mapping) information can then be transmitted from source buffers to one or more destination buffers along with the data and the header as part of a single write operation.
An example of a system in which the aforementioned approach can be used advantageously is a system that implements Open Fabrics Enterprise Distribution (OFED™) (or OpenFabrics Software (OFS)). OFED is open-source software for Remote Direct Memory Access (RDMA) and kernel bypass applications. OFED can be implemented in computing environments that require highly efficient networks, storage connectivity, and parallel computing. Among other features, OFED provides kernel-level drivers, channel oriented RDMA and send/receive operations, kernel bypasses of an operating system (OS), and kernel and user-level application programming interfaces (APIs) (e.g., OFED API for RDMA transfer). OFED also provides services for parallel message passing (MPI), sockets data exchange (e.g., Session Description Protocol (SDP)), Network Attached Storage (NAS) and Storage Area Network (SAN) storage, and file systems.
RDMA involves direct memory access (DMA) from the memory of one computing system (e.g., a source computing system) into that of another computing system (e.g., a destination computing system) without involving either the source or destination computing system's OS. For example, a network adapter of the source computing system can send a message to the network adapter of the destination computing system that permits the network adapter of the destination computing system to directly access data to (or from) the destination computing system's memory.
A message in RDMA can include at least two types of messages. The first type is an RDMA write. An RDMA write includes an address and data to put (or write) at that address. The RDMA write permits the network adapter that receives the RDMA write to write (or put) the supplied data at the specified address. The second type is an RDMA read. An RDMA read includes an address and a length. The RDMA read permits the network adapter that receives the RDMA read to generate a reply that sends back the data at the address requested. In RDMA, both these types of messages are “one-sided”—the messages are processed by the network adapter that receives them without involving the central processing unit (CPU) on the computing system that receives the messages.
In certain computing systems, network adapters can be accessed via an asynchronous interface (also called a “verbs” interface). To use a network adapter (e.g., to perform RDMA operations), objects called queue pairs (or QPs) can be created. QPs involve a pair of work queues—a send queue and a receive queue, as well as completion queues (CQs). An RDMA operation can be posted to a work queue (e.g., as a request). The RDMA operation is then executed synchronously and when completed, the network adapter adds work completion information to the end of the CQ. Completion information can then be retrieved from the CQ to determine which requests have been completed. Operating asynchronously in this manner makes it easier to overlap computation and communication in RDMA computing environments.
It should be noted that RDMA-enabled network adapters also support “two-sided” send/receive operations, in addition to one-sided RDMA operations. In addition, it will be appreciated that network adapters in RDMA computing environments can also permit userspace processes to perform fast-path operations (e.g., posting work requests and retrieving work completions) directly with the hardware without involving the kernel (thus saving time associated with system call overhead). Therefore, RDMA permits high-throughput and low-latency networking (e.g., which is especially useful in parallel computing clusters).
As previously noted, OFED provides APIs for RDMA data transfer. If OFED APIs are used for RDMA data transfer, a client's message will include a header and one or more data packets. Because the size of the header is typically small compared to the size of the data, it is not efficient to write the data and the header independently (e.g., as two separate RDMA writes). The data and the header can be coalesced (or combined) (e.g., as part of a single RDMA write) to reduce the total number of RDMA writes.
The OFED API also provides a 32-bit data space for private use (also called 32-bit immediate data or private data) as part of each RDMA write. This 32-bit data space can be used to store buffer location information as well as other metadata associated with the RDMA write, in addition to information associated with a header itself (e.g., a header offset indicating the location of the header on a header boundary).
Unfortunately, coalescing (combining) data and a header (e.g., by mapping source buffers to destination buffers) for RDMA data transfer poses several challenges. First, applications (e.g., one or more applications executing on destination computing system 145) require data to be page boundary aligned. For example, a computing system reads or writes data to a memory address in word sized chunks (e.g., 4 byte chunks on a 32-bit system) or larger.
Page boundary alignment (or data structure alignment) involves writing data at a memory address equal to some multiple of the word size so as to increase the computing system's performance (e.g., due to the way certain memory management systems manage memory). To page boundary align the data, it may be necessary to insert (or treat) some bytes as meaningless (“don't care”) (referred to as wastage) between the end of the last data structure and the start of the next data structure (e.g., padding). Therefore, the requirement that data remain page boundary aligned can result in wastage that is unnecessarily and redundantly sent on-wire (e.g., transmitted over the network as part of the RDMA data transfer).
Second, coalescing data and a header can result in additional redundant RDMA writes (e.g., one or more additional destination buffer(s) may be required to write the header at the start of a header boundary, to maintain page boundary alignment, and the like). Therefore, minimizing the number of destination buffers that are used (or RDMA writes that are performed) to write the data and the header is also an important consideration.
Third, and as noted, systems such as those implementing OFED employ (or can be modified to employ) the OFED API which provides for a 32-bit informational storage area as part of an RDMA write, for example, to maintain buffer location information, RDMA write metadata, header location information, and the like. Because only a few bits of the 32-bits of data space are available to maintain header location information (e.g., a header offset), the placement of the header is also another important consideration (and limitation) when coalescing data and the header in such systems.
Disclosed herein are methods, systems, and processes to coalesce data and a header and improve throughput in systems such as OFED RDMA computing systems while efficiently utilizing the provided 32-bit data space, minimizing the number of RDMA writes (e.g., reducing the number of destination buffers used to write data and the header), maintaining page boundary alignment of data, and minimizing on-wire wastage.
Example Implementation in a Computing System
FIG. 1 is a block diagram of a computing system that implements and uses RDMA technology, according to one embodiment. The computing system of FIG. 1 includes a source computing system 105 and a destination computing system 145 communicatively coupled via a network 180. Network 180 can include any type of network or interconnection (e.g., the Internet, a Wide Area Network (WAN), a SAN, and the like).
Source computing system 105 includes a source processor 110 and a source network adapter 115. Source processor 110 and source network adapter 115 are communicatively coupled to a source memory 120. Source memory 120 includes a source driver 125, a source OS 130, and an application 135, which implements source buffers 140(1)-(N). Similarly, destination computing system 145 includes a destination processor 175 and a destination network adapter 170 communicatively coupled to a destination memory 150. Destination memory 150 includes destination buffers 155(1)-(N), a destination driver 160, and a destination OS 165.
The computing system of FIG. 1 permits remote direct memory access from source memory 120 of source computing system 105 into that of destination computing system 145 without involving source OS 130 or destination OS 165 (and vice-versa). For example, source network adapter 115 of source computing system 105 can send a message (e.g., a client message) to destination network adapter 170 of destination computing system 145 that permits destination network adapter 170 of destination computing system 145 to directly access data to (or from) destination memory 150 (and vice-versa).
In some embodiments, source network adapter 115 and destination network adapter 170 permit userspace processes (on source computing system 105 and destination computing system 145, respectively) to perform RDMA operations (or other buffer-based operations amendable to methods such as those described herein) by bypassing their respective kernels and avoiding a system call (e.g., by bypassing source OS 130 and destination OS 165 as shown in FIG. 1). RDMA operations can be managed and facilitated using an OFED API.
FIG. 2 is a block diagram of a source computing system that implements the OFED API, according to one embodiment. As shown in FIG. 2, source computing system 105 includes an Application Programming Interface (API) 205, such as the OpenFabrics Enterprise Distribution (OFED™) API. Source computing system 105 also includes an RDMA module 210, a buffer selector 215, a data and header coalescer 220, and a page boundary alignment calculator 225. It should be noted that RDMA module 210 (or other comparable support module), buffer selector 215, data and header coalescer 220, and page boundary alignment calculator 225 can be implemented as hardware or software, and as part of source computing system 105 or separately (e.g., as part of an RDMA server, a software appliance, a virtual machine, or some other type of computing device).
Source computing system 105 also includes source memory 120. Source memory 120 includes source buffers 140(1)-(N). Each source buffer contains data, or a combination of data and a header. For example, as shown in FIG. 2, and as mapped by RDMA module 210 to destination buffers, source buffer 140(1) contains data 230(1) (e.g., with one or more data units), source buffer 140(N−2) contains data 230(2), source buffer 140(N−1) (e.g., the second to last buffer) includes data 230(N−1) and a header 240, and source buffer 140(N) contains data 230(N).
RDMA module 210 manages, facilitates, coordinates, and performs one or more RDMA operations. For example, RDMA module 210 can perform an RDMA write operation or an RDMA read operation. Buffer selector 215 selects one or more buffers (e.g., destination buffers 155(1)-(N)) to fill with (or write) data, or data and a header. Data and header coalescer 220 maps source buffers to destination buffers by coalescing (or combining) data and a header (e.g., in a single buffer (e.g., destination buffer 140(N−1)) as part of a single RDMA write/packet). Finally, page boundary alignment calculator 225 determines the placement of data in destination buffers 155(1)-(N) such that the data is page boundary aligned.
In conjunction, RDMA module 210, buffer selector 215, data and header coalescer 220, and page boundary alignment calculator 225 determine the placement of data and the header among one or more (available) buffers (e.g., destination buffers 155(1)-(N)) so as to improve throughput in OFED-based RDMA computing environments.
Examples of Writing Data and a Header
In one embodiment, source computing system 105 receives a header and data (e.g., contained in one or more source buffers and/or from an application executed by one or more hosts, servers, and the like). Buffer selector 215 identifies destination buffers in which the data and the header are to be written (e.g., destination buffers 155(1)-(N)). Data and header coalescer 220 then determines the (appropriate) mapping and placement of the data and the header.
In some embodiments, the determination of the mapping and placement of the data and the header is based on at least three factors. First, the determination of the placement of the data and the header is based on utilizing a minimum number of destination buffers (e.g., so as to reduce the number of RDMA writes). Second, the determination of the placement of the data and the header is based on the data being page boundary aligned (in the minimum number of destination buffers). Third, the determination of the placement of the data and the header is based on the placement minimizing on-wire wastage (e.g., reducing the amount of wastage (or padding) that is sent over the network). Based on at least these three factors, RDMA module 210 writes the data and the header to the destination buffers (e.g., data is written on page boundaries and the header is written on a header boundary).
In other embodiments, RDMA module 210 generates an RDMA write. The RDMA write includes the header and the data coalesced into a single RDMA packet (e.g., using data and header coalescer 220). In certain embodiments, the RDMA write is accompanied by a 32-bit data space (e.g., the 32-bit data space is not part of client data space), and is transmitted using RDMA to destination computing system 145. In this example, the 32-bit data space is used to include an offset of the header (e.g., as part of writing the data and the header to destination buffers).
In other embodiments, buffer selector 215 determines the minimum number of buffers that are required for the data and the header based on a size of each buffer. Buffer selector 215 selects one or more additional buffers if the data cannot be page boundary aligned in the minimum number of buffers. In certain embodiments, each buffer is a destination buffer, and the OFED API permits mapping of multiple source buffers to a single destination buffer.
Examples of Coalescing Data and a Header
FIG. 3A is a block diagram of data units and a header that are not coalesced, according to one embodiment. It should be noted that for the sake of illustration, the size of source buffers 140(1)-(N) (as well as the size of the receive/destination buffers 155(1)-(N)) in FIGS. 3A-3E and 4A-4C is 8 k and the data is 4 k aligned. However, in alternate implementations and embodiments, source buffers 140(1)-(N) and destination buffers 155(1)-(N) can be any size (e.g., 16 k, 32 k, and the like), and the page boundary alignment of the data can also differ (e.g., the data can be 2 k, 3 k, or 6 k aligned).
As shown in FIG. 3A, in some embodiments, data (e.g., data units 305(1)-(13)) and a header (e.g., header 235) are not coalesced (e.g., not combined into a single destination buffer or a single RDMA write). In this example, the data is 13 k (e.g., data units 305(1)-(13)) and the header is 1 k (e.g., header 235). The data (e.g., 13 k of data that is 4 k page boundary aligned) requires (and uses) two buffers (e.g., destination buffers 155(1) and 155(2) as shown in FIG. 3A). If the data and the header are not coalesced, header 235 will require a separate and additional buffer (e.g., destination buffer 155(3)). Therefore, although there is no on-wire wastage, the RDMA write operation would consume three destination buffers. Consequently, three RDMA writes would be required as part of the RDMA transfer of the data and the header.
FIG. 3B is a block diagram of a header that is written at the start of a buffer, according to one embodiment. In some embodiments, the header (e.g., header 235) is written at the start of a buffer (e.g., destination buffer 140(1)). But as previously noted, applications in RDMA computing environments typically require the data to be written at a page boundary (e.g., at 0 k or 4 k in an 8 k buffer, if the data is 4 k page boundary aligned).
Consequently, as shown in FIG. 3B, writing the header at the start of a buffer can cause wastage on-wire between the end of the header and the start of the data, because the data has to start at a page aligned boundary (e.g., data unit 305(1) has to start at the 4 k page boundary), and the number of buffers required has also not been reduced (e.g., three buffers).
FIG. 3C is a block diagram of a header that is written immediately after data, according to one embodiment. In this example, writing 13 k of data (e.g., data units 305(1)-(13)) requires two buffers (e.g., destination buffers 155(1) and 155(2)). In some embodiments, header 235 is written immediately after the data (e.g., after data unit 305(13)). As previously noted, the OFED API provides programmers a 32-bit space with every RDMA write for private use. Destination computing system 145 can use this 32-bit value to locate appropriate buffer(s) and use information regarding the buffer(s) (e.g., the RDMA write) for further processing.
However, using OFED-based RDMA computing environments as an example, a part of this 32-bit data space is also required to indicate the exact header offset in a particular buffer. In the example of FIG. 3C, if header 235 is written immediately after the data (e.g., an 8 byte alignment for an 8 k buffer), representing the header offset will require more space (e.g., 10 bits) than is available from the 32-bit data space of immediate data. Unfortunately, in many scenarios such as the foregoing, finding 10 (free) bits in an already-crowded 32-bit data space may not be possible. Therefore, although the header and data placement in the example of FIG. 3C results in zero wastage on-wire and uses the minimum number of destination buffers, such a solution may not be feasible because representing the header offset (e.g., particularly if the header is not written at a header boundary) requires too many bits from the 32-bit data space.
FIG. 3D is a block diagram of a 32-bit data space provided by the OFED API, according to one embodiment. As shown in FIG. 3D, 32-bit immediate data 310 contains information including a buffer identifier 315, flag(s) 320, metadata 325, a source identifier 330, a client identifier 335, and free bits 340(1)-(N) (e.g., available for header offset representation). Information regarding the buffer identifier, flag(s), metadata, source identifier and client identifier, in addition to other information, takes up most of the space (e.g., bits) of 32-bit immediate data 310. Therefore, free bits 340(1)-(N) represent a small number of bits available for header offset representation. Consequently, to accurately represent the header offset information in the available space (e.g., free bits), in some embodiments, the header is written at a header boundary (e.g., so that the header offset information can be fully and accurately represented using free bits 340(1)-(N)).
FIG. 3E is a block diagram of a header that is written at the end of an alignment, according to one embodiment. As shown in FIG. 3E, and in certain embodiments, header 235 is written at the end of a particular alignment (e.g., at the end of a 2 k alignment) and at a header boundary (e.g., at 6 k in an 8 k buffer). In this example, the alignment of header 235 is determined based on the number of bits that would potentially be available (e.g., for header placement) if the data were to be written (e.g., to one or more buffers like destination buffers 155(1) and 155(2)). The more free bits that are available, the lesser the wastage.
For example, if only two bits are available for alignment in FIG. 3E, the header (e.g., aligned at 2 k) can be placed at (or written to) four possible offsets (e.g., 0 k, 2 k, 4 k, and 6 k in an 8 k buffer). In this example, the maximum possible on-wire wastage is 2 k. However, if three bits are available for alignment in FIG. 3E, the header (e.g., aligned at 1 k) can be placed at (or written to) eight possible offsets, thus reducing the maximum possible on-wire wastage to 1 k. Therefore, writing a header at the end of an alignment as shown in FIG. 3E also produces some wastage, however minimal (e.g., as compared to writing the header at the beginning of a buffer as shown in FIG. 3A).
Examples of Header and Data Placement
FIG. 4A is a block diagram of a header that is written at the end of an alignment in a second to last buffer, according to one embodiment. As shown in FIG. 4A, and in some embodiments, header 235 is written in a second to last buffer (e.g., destination buffer 140(N−1)) at the last (available) header-aligned offset (e.g., of destination buffer 155(N−1)) (e.g., 6 k), based on the header size (e.g., 1 k as shown in FIG. 4A) and alignment (e.g., based on a header boundary). In certain embodiments, buffer selector 215 identifies destination buffer 155(N−1) as the second to last buffer, and selects destination buffer 155(N−1) for header placement. Page boundary alignment calculator 225 then calculates the page boundary alignment of data that can be written to destination buffer 155(N−1) (e.g., data units 305(1)-305(6)), while data and header coalescer 220 ensures that enough space is available for header placement (e.g., 1 k), header representation (e.g., in 32-bit immediate data 310), and header alignment (e.g., at 6 k).
Therefore, as shown in FIG. 4A, and in some embodiments, writing a header at the end of an alignment in a second to last buffer results in zero wastage on-wire and also does not require additional buffers. FIG. 4B is a block diagram of a header that is written at the end of an alignment without (requiring) additional buffers, according to other embodiments. For example, if the available space (e.g., bits) in a buffer (e.g., after data is written) permit a 2 k header alignment (e.g., so as to be able to capture the header offset in the smallest amount of bites possible), writing header 235 at the end of an alignment (e.g., in destination buffer 155(N)) permits more data to be written than would otherwise be possible (e.g., data unit 305(14)). If data is not filled up to the brim (e.g., up to and including data unit 305(14)), a 1 k gap (e.g., on-wire wastage) is introduced. However, in this example, header 235 can be written after the data (e.g., after 14 data units). It will be appreciated that header and data placement in this example causes no wastage on-wire and also does not require additional (destination) buffers.
Like FIG. 4A, FIG. 4C is a block diagram of a header that is written to a second to last buffer, according to one embodiment. As shown in FIG. 4C, header 235 is written to destination buffer 155(N−1), which is the second to last buffer. Header 235, as required by the need to include the header offset in 32-bit immediate data 310, is header aligned at a header boundary (e.g., at 6 k). The available space in the last buffer is fully filled with data (e.g., data units 305(7)-(14)). Data units 305(1)-(6) are written to destination buffer 155(N−1). The data in both destination buffers 155(N−1) and 155(N) is page boundary aligned and header 235 is also aligned at a header boundary. Therefore, in this example, the minimum number of buffers required to write the data and the header are used, and there is no on-wire wastage.
It will be appreciated that when computing systems such as those described herein (e.g., as shown in FIG. 1) have more than one receive buffer (e.g., destination buffers 155(1)-(N)) and when the total size of the receive/destination buffers is more than the sum of the aligned header and the page boundary aligned data, writing a header at the end of an alignment in a second to last buffer results in the minimum utilization of buffers and reduces (or eliminates) on-wire wastage. It will also be appreciated that, coalescing a header and data in this manner results in increased I/O performance and application throughput.
Example Processes for Coalescing a Header and Data
FIG. 5A is a flowchart that illustrates a process for filling buffers with data and a header, according to one embodiment. The process begins at 505 by receiving a header (e.g., header 235) and data (e.g., data units 305(1)-(14)), for example, as an input parameter from a host, a virtual machine, or some other type of computing system communicatively coupled to source computing system 105, or from one or more applications executing on source computing system 105. At 510, the process determines the number of buffers that are required for the header and the data (e.g., by using buffer selector 215 and based on the size of data units received as part of the data). At 515, the process page (boundary) aligns the data (e.g., by using page boundary alignment calculator 225) and records this mapping.
At 520, the process determines whether there are data-only buffers to fill (e.g., write data (units) to). If there are data-only buffers to fill, the process at 525, fills buffer(s) with data (e.g., destination buffers 155(1)-(N−2)). If there are no data-only buffers to fill, the process, at 530, determines the position of the header on the second to last buffer (e.g., destination buffer 155(N−1)). At 535, the process fills the data up to destination buffer 155(N−2). At 540, the process fills the data and the header in destination buffer 155(N−1) (e.g., as shown in FIGS. 4A and 4C). At 545, the process fills the last buffer (e.g., destination buffer 155(N)) with the remaining data. The process ends at 550 by determining if there is another message to process.
FIG. 5B is a flowchart that illustrates a process for combining a header and data, according to one embodiment. The process begins at 555 by receiving data and a header contained in source buffers. At 560, the process initiates header and data placement analysis (e.g., using RDMA module 210). At 565, the process calculates the size of available buffers (e.g., 8 k, 16 k, 32 k, and the like). At 570, the process determines the number (e.g., minimum) of buffers that are required for data and the header (e.g., based on the size of the data units received). At 575, the process determines the page boundary alignment for the data within the identified buffers (e.g., using page boundary alignment calculator 225). It should be noted that page boundary alignment calculator 225 can also determine header boundaries for the header.
At 580, the process determines the position of the header in the second to last buffer (e.g., in destination buffer 155(N−1)). For example, the process can determine the position of header 235 in destination buffer 155(N−1) at the end of page boundary aligned data (e.g., after data unit 305(6) as shown in FIGS. 4A and 4C) and at a header boundary (e.g., starting at 6 k for a 2 k alignment of a 1 k header, also as shown in FIGS. 4A and 4C). At 585, the process fills destination buffers with the data and the header such that the minimum number of destination buffers are utilized and the data is page boundary aligned. At 590, the process sends the combined header and data along with placement/mapping information in a single RDMA write (e.g., a message sent over RDMA) to a destination (e.g., destination computing system 145). The process ends at 595 by determining if there is another header and (more) data to process.
FIG. 6 is a flowchart that illustrates a process for determining placement/mapping information of a header and data, according to one embodiment. The process begins at 605 by receiving data and a header. At 610, the process determines the minimum number of buffers required to write the data and the header. At 615, the process determines the placement of the data and the header in selected buffers. At 620, the process determines whether the data is page boundary aligned and whether the data has a minimum number of gaps that can cause on-wire wastage. If the data is not page boundary aligned or the data does not have a minimum number of gaps (or zero gaps) that can cause on-wire wastage, the process, at 625, re-determines placement of the data and the header in the selected buffers to keep the data page boundary aligned and with minimum (or zero) gaps. However, if the data is page boundary aligned and has a minimum number of gaps (or even zero gaps) that can cause on-wire wastage, the process, at 630, fills the selected buffers with the data and the header (e.g., as shown in FIGS. 4A and 4C). The process ends at 635 by determining if there is another header and (more) data to process.
FIG. 7 is a flowchart that illustrates a process for generating and transmitting data and a header using RDMA, according to one embodiment. The process begins at 705 by receiving or accessing data and a header (e.g., from source buffers). At 710, the process writes the data and the header to buffers such that the minimum number of buffers are used, the data is page boundary aligned, the header is aligned, and there is minimum (or no) on-wire wastage.
At 715, the process includes header offset information (e.g., the location of the header if and when the header is written to a particular destination buffer) in 32-bit immediate data 310 (e.g., the 32-bit data space provided as part of API 205). At 720, the process generates an RDMA write (e.g., using RDMA module 210). At 725, the process transmits the RDMA write to a destination along with 32-bit immediate data 310 (e.g., to destination computing system 145). The process ends at 730 by determining if there is another header and (more) data to process.
It will be appreciated that coalescing a header and data by mapping source buffers to one or more destination buffer(s), and writing the header and the data to particularly selected destination buffer(s) based on determined placement/mapping information results in efficient utilization of destination buffers and reduces (or even eliminates) on-wire wastage in OFED-based and RDMA-enabled computing environments. It will also be appreciated that the systems, methods, and processes described herein can also provide increased I/O performance and application throughput in such computing environments.
Example Computing Environment
FIG. 8 is a block diagram of a computing system, illustrating how a placement and mapping information module 865 can be implemented in software, according to one embodiment. Computing system 800 broadly represents any single or multi-processor computing device or system capable of executing computer-readable instructions. Examples of computing system 800 include, without limitation, any one or more of a variety of devices including workstations, personal computers, laptops, client-side terminals, servers, distributed computing systems, handheld devices (e.g., personal digital assistants and mobile phones), network appliances, storage controllers (e.g., array, tape drive, or hard drive controllers), and the like. Computing system 800 may include at least one processor 855 (e.g., source processor 110 or destination processor 175) and a memory 860 (e.g., source memory 120 or destination memory 150). By executing the software that implements source computing system 105 or destination computing system 145, computing system 800 becomes a special purpose computing device that is configured to improve throughput in OpenFabrics environments.
Processor 855 generally represents any type or form of processing unit capable of processing data or interpreting and executing instructions. In certain embodiments, processor 855 may receive instructions from a software application or module. These instructions may cause processor 855 to perform the functions of one or more of the embodiments described and/or illustrated herein. For example, processor 855 may perform and/or be a means for performing all or some of the operations described herein. Processor 855 may also perform and/or be a means for performing any other operations, methods, or processes described and/or illustrated herein.
Memory 860 generally represents any type or form of volatile or non-volatile storage devices or mediums capable of storing data and/or other computer-readable instructions. Examples include, without limitation, random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory device. Although not required, in certain embodiments computing system 800 may include both a volatile memory unit and a non-volatile storage device. In one example, program instructions implementing placement and mapping information module 865 may be loaded into memory 860 (e.g., source memory 120).
In certain embodiments, computing system 800 may also include one or more components or elements in addition to processor 855 and/or memory 860. For example, as illustrated in FIG. 8, computing system 800 may include a memory controller 820, an Input/Output (I/O) controller 835, and a communication interface 845, each of which may be interconnected via a communication infrastructure 805. Communication infrastructure 805 generally represents any type or form of infrastructure capable of facilitating communication between one or more components of a computing device. Examples of communication infrastructure 805 include, without limitation, a communication bus (such as an Industry Standard Architecture (ISA), Peripheral Component Interconnect (PCI), PCI express (PCIe), or similar bus) and a network.
Memory controller 820 generally represents any type/form of device capable of handling memory or data or controlling communication between one or more components of computing system 800. In certain embodiments memory controller 820 may control communication between processor 855, memory 860, and I/O controller 835 via communication infrastructure 805. In certain embodiments, memory controller 820 may perform and/or be a means for performing, either alone or in combination with other elements, one or more of the operations or features described and/or illustrated herein.
I/O controller 835 generally represents any type or form of module capable of coordinating and/or controlling the input and output functions of a virtual machine, an appliance, a gateway, and/or a computing system. For example, in certain embodiments I/O controller 835 may control or facilitate transfer of data between one or more elements of source computing system 105 or destination computing system 145, such as processor 855 (e.g., source processor 110 or destination processor 175), memory 860 (e.g., source memory 120 or destination memory 150), communication interface 845, display adapter 815, input interface 825, and storage interface 840.
Communication interface 845 broadly represents any type or form of communication device or adapter capable of facilitating communication between computing system 800 and one or more other devices. Communication interface 845 may facilitate communication between computing system 800 and a private or public network including additional computing systems. Examples of communication interface 845 include, without limitation, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, and any other suitable interface. Communication interface 845 may provide a direct connection to a remote server via a direct link to a network, such as the Internet, and may also indirectly provide such a connection through, for example, a local area network (e.g., an Ethernet network), a personal area network, a telephone or cable network, a cellular telephone connection, a satellite data connection, or any other suitable connection.
Communication interface 845 may also represent a host adapter configured to facilitate communication between computing system 800 and one or more additional network or storage devices via an external bus or communications channel. Examples of host adapters include, Small Computer System Interface (SCSI) host adapters, Universal Serial Bus (USB) host adapters, Institute of Electrical and Electronics Engineers (IEEE) 1394 host adapters, Serial Advanced Technology Attachment (SATA), Serial Attached SCSI (SAS), and external SATA (eSATA) host adapters, Advanced Technology Attachment (ATA) and Parallel ATA (PATA) host adapters, Fibre Channel interface adapters, Ethernet adapters, or the like. Communication interface 845 may also allow computing system 800 to engage in distributed or remote computing (e.g., by receiving/sending instructions to/from a remote device for execution).
As illustrated in FIG. 8, computing system 800 may also include at least one display device 810 coupled to communication infrastructure 805 via a display adapter 815. Display device 810 generally represents any type or form of device capable of visually displaying information forwarded by display adapter 815. Similarly, display adapter 815 generally represents any type or form of device configured to forward graphics, text, and other data from communication infrastructure 805 (or from a frame buffer, as known in the art) for display on display device 810. Computing system 800 may also include at least one input device 830 coupled to communication infrastructure 805 via an input interface 825. Input device 830 generally represents any type or form of input device capable of providing input, either computer or human generated, to computing system 800. Examples of input device 830 include a keyboard, a pointing device, a speech recognition device, or any other input device.
Computing system 800 may also include storage device 850 coupled to communication infrastructure 805 via a storage interface 840. Storage device 850 generally represents any type or form of storage devices or mediums capable of storing data and/or other computer-readable instructions. For example, storage device 850 may include a magnetic disk drive (e.g., a so-called hard drive), a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash drive, or the like. Storage interface 840 generally represents any type or form of interface or device for transferring and/or transmitting data between storage device 850, and other components of computing system 800.
Storage device 850 may be configured to read from and/or write to a removable storage unit configured to store computer software, data, or other computer-readable information. Examples of suitable removable storage units include a floppy disk, a magnetic tape, an optical disk, a flash memory device, or the like. Storage device 850 may also include other similar structures or devices for allowing computer software, data, or other computer-readable instructions to be loaded into computing system 800. For example, storage device 850 may be configured to read and write software, data, or other computer-readable information. Storage device 850 may also be a part of computing system 800 or may be separate devices accessed through other interface systems.
Many other devices or subsystems may be connected to computing system 800. Conversely, all of the components and devices illustrated in FIG. 8 need not be present to practice the embodiments described and/or illustrated herein. The devices and subsystems referenced above may also be interconnected in different ways from that shown in FIG. 8.
Computing system 800 may also employ any number of software, firmware, and/or hardware configurations. For example, one or more of the embodiments disclosed herein may be encoded as a computer program (also referred to as computer software, software applications, computer-readable instructions, or computer control logic) on a computer-readable storage medium. Examples of computer-readable storage media include magnetic-storage media (e.g., hard disk drives and floppy disks), optical-storage media (e.g., CD- or DVD-ROMs), electronic-storage media (e.g., solid-state drives and flash media), and the like. Such computer programs can also be transferred to computing system 800 for storage in memory via a network such as the Internet or upon a carrier medium.
The computer-readable medium containing the computer program may be loaded into computing system 800. All or a portion of the computer program stored on the computer-readable medium may then be stored in memory 860 and/or various portions of storage device 850. When executed by processor 855, a computer program loaded into computing system 800 may cause processor 855 to perform and/or be a means for performing the functions of one or more of the embodiments described and/or illustrated herein. Additionally or alternatively, one or more of the embodiments described and/or illustrated herein may be implemented in firmware and/or hardware. For example, computing system 800 may be configured as an application specific integrated circuit (ASIC) adapted to implement one or more of the embodiments disclosed herein.
Example Networking Environment
FIG. 9 is a block diagram of a networked system, illustrating how various devices can communicate via a network, according to one embodiment. In certain embodiments, network-attached storage (NAS) devices may be configured to communicate with source computing system 105 and/or destination computing system 145 using various protocols, such as Network File System (NFS), Server Message Block (SMB), or Common Internet File System (CIFS). Network 180 generally represents any type or form of computer network or architecture capable of facilitating communication between source computing system 105 and/or destination computing system 145.
In certain embodiments, a communication interface, such as communication interface 845 in FIG. 8, may be used to provide connectivity between source computing system 105 and/or destination computing system 145, and network 180. It should be noted that the embodiments described and/or illustrated herein are not limited to the Internet or any particular network-based environment. For example, network 180 can be a Storage Area Network (SAN).
In one embodiment, all or a portion of one or more of the disclosed embodiments may be encoded as a computer program and loaded onto and executed by source computing system 105 and/or destination computing system 145, or any combination thereof. All or a portion of one or more of the embodiments disclosed herein may also be encoded as a computer program, stored on source computing system 105 and/or destination computing system 145, and distributed over network 180. In some examples, all or a portion of source computing system 105 and/or destination computing system 145 may represent portions of a cloud-computing or network-based environment. Cloud-computing environments may provide various services and applications via the Internet. These cloud-based services (e.g., software as a service, platform as a service, infrastructure as a service, etc.) may be accessible through a web browser or other remote interface. Various functions described herein may be provided through a remote desktop environment or any other cloud-based computing environment.
In addition, one or more of the components described herein may transform data, physical devices, and/or representations of physical devices from one form to another. For example, placement and mapping information module 865 may transform the behavior of source computing system 105 and/or destination computing system 145 in order to cause source computing system 105 and/or destination computing system 145 to improve throughput in OpenFabrics and RDMA computing environments.
Although the present disclosure has been described in connection with several embodiments, the disclosure is not intended to be limited to the specific forms set forth herein. On the contrary, it is intended to cover such alternatives, modifications, and equivalents as can be reasonably included within the scope of the disclosure as defined by the appended claims.

Claims (17)

What is claimed is:
1. A computer-implemented method comprising:
receiving data and a header;
determining a minimum number of buffers that are configured to collectively store the data and the header, wherein
the minimum number is an integer greater than or equal to two;
determining placement information for the data and the header within the buffers,
wherein
the placement information is determined based, at least in part, on
a size of each of the buffers,
a page-boundary-alignment of the data,
a header alignment of the header, and
the placement information identifies a second-to-last buffer; and
writing the data and the header to the buffers, wherein
the writing of the data uses the placement information,
the data is written on page boundaries, and
the header is written on a header boundary of the second-to-last buffer.
2. The computer-implemented method of claim 1, wherein
using the placement information results in
utilizing the minimum number of buffers,
the data being page-boundary-aligned when written to the minimum number of buffers, and
minimizing on-wire wastage.
3. The computer-implemented method of claim 1, further comprising:
coalescing the header and the data into a Remote Direct Memory Access (RDMA) write by mapping the header and the data comprised in a plurality of source buffers to the buffers based on the placement information.
4. The computer-implemented method of claim 3, wherein
the RDMA write comprises a 32-bit data space.
5. The computer-implemented method of claim 4, further comprising:
including an offset of the header in the 32-bit data space.
6. The computer-implemented method of claim 1, further comprising:
selecting one or more additional buffers if the data cannot be page-boundary aligned in the minimum number of buffers.
7. The computer-implemented method of claim 1, wherein
the minimum number of buffers comprise one or more destination buffers.
8. A non-transitory computer-readable storage medium storing program instructions executable to:
receive data and a header;
determine a minimum number of buffers that are configured to collectively store the data and the header, wherein
the minimum number is an integer greater than or equal to two;
determine placement information for the data and the header within the buffers, wherein the placement information is determined based, at least in part, on
a size of each of the buffers,
a page-boundary-alignment of the data,
a header alignment of the header, and
the placement information identifies a second-to-last buffer; and
write the data and the header to the buffers, wherein
writing the data uses the placement information,
the data is written on page boundaries, and
the header is written on a header boundary of the second-to-last buffer.
9. The non-transitory computer-readable storage medium of claim 8, wherein
using the placement information results in
utilizing the minimum number of buffers,
the data being page-boundary-aligned when written to the minimum number of buffers, and
minimizing on-wire wastage.
10. The non-transitory computer-readable storage medium of claim 8, further comprising:
coalescing the header and the data into a Remote Direct Memory Access (RDMA) write by mapping the header and the data comprised in a plurality of source buffers to the buffers based on the placement information, wherein
the RDMA write comprises a 32-bit data space; and
including an offset of the header in the 32-bit data space.
11. The non-transitory computer-readable storage medium of claim 8, further comprising:
selecting one or more additional buffers if the data cannot be page-boundary aligned in the minimum number of buffers.
12. The non-transitory computer-readable storage medium of claim 8, wherein
the minimum number of buffers comprise one or more destination buffers.
13. A system comprising:
one or more processors; and
a memory coupled to the one or more processors, wherein the memory stores program instructions executable by the one or more processors to:
receive data and a header;
determine a minimum number of buffers that are configured to collectively store the data and the header, wherein
the minimum number is an integer greater than or equal to two;
determine placement information for the data and the header within the buffers, wherein
the placement information is determined based, at least in part, on
a size of each of the buffers,
a page-boundary-alignment of the data, and
a header alignment of the header, and
the placement information identifies a second-to-last buffer; and
write the data and the header to the buffers, wherein
writing the data uses the placement information,
the data is written on page boundaries, and
the header is written on a header boundary of the second-to-last buffer.
14. The system of claim 13, wherein
using the placement information results in
utilizing the minimum number of buffers,
the data being page-boundary-aligned when written to the minimum number of buffers, and
minimizing on-wire wastage.
15. The system of claim 13, further comprising:
coalescing the header and the data into a Remote Direct Memory Access (RDMA) write by mapping the header and the data comprised in a plurality of source buffers to the buffers based on the placement information, wherein
the RDMA write comprises a 32-bit data space; and
including an offset of the header in the 32-bit data space.
16. The system of claim 13, further comprising:
selecting one or more additional buffers if the data cannot be page-boundary aligned in the minimum number of buffers.
17. The system of claim 13, wherein
the minimum number of buffers comprise one or more destination buffers.
US15/168,449 2016-05-31 2016-05-31 Throughput in openfabrics environments Active 2036-07-24 US10375168B2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US15/168,449 US10375168B2 (en) 2016-05-31 2016-05-31 Throughput in openfabrics environments
JP2018561265A JP6788691B2 (en) 2016-05-31 2017-05-23 Improved throughput in OpenFabrics
CN201780029882.6A CN109478171B (en) 2016-05-31 2017-05-23 Improving throughput in openfabics environment
PCT/US2017/033951 WO2017210015A1 (en) 2016-05-31 2017-05-23 Improving throughput in openfabrics environments
EP17733209.5A EP3465450B1 (en) 2016-05-31 2017-05-23 Improving throughput in openfabrics environments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/168,449 US10375168B2 (en) 2016-05-31 2016-05-31 Throughput in openfabrics environments

Publications (2)

Publication Number Publication Date
US20170346899A1 US20170346899A1 (en) 2017-11-30
US10375168B2 true US10375168B2 (en) 2019-08-06

Family

ID=59215982

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/168,449 Active 2036-07-24 US10375168B2 (en) 2016-05-31 2016-05-31 Throughput in openfabrics environments

Country Status (5)

Country Link
US (1) US10375168B2 (en)
EP (1) EP3465450B1 (en)
JP (1) JP6788691B2 (en)
CN (1) CN109478171B (en)
WO (1) WO2017210015A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9996463B2 (en) * 2015-11-10 2018-06-12 International Business Machines Corporation Selection and placement of volumes in a storage system using stripes
CN110888827B (en) * 2018-09-10 2021-04-09 华为技术有限公司 Data transmission method, device, equipment and storage medium
US11379404B2 (en) * 2018-12-18 2022-07-05 Sap Se Remote memory management
US11863469B2 (en) * 2020-05-06 2024-01-02 International Business Machines Corporation Utilizing coherently attached interfaces in a network stack framework

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491802A (en) * 1992-05-29 1996-02-13 Hewlett-Packard Company Network adapter for inserting pad bytes into packet link headers based on destination service access point fields for efficient memory transfer
WO1999034273A2 (en) 1997-12-30 1999-07-08 Lsi Logic Corporation Automated dual scatter/gather list dma
US6694392B1 (en) 2000-06-30 2004-02-17 Intel Corporation Transaction partitioning
US20050015549A1 (en) 2003-07-17 2005-01-20 International Business Machines Corporation Method and apparatus for transferring data from a memory subsystem to a network adapter by extending data lengths to improve the memory subsystem and PCI bus efficiency
US20060095611A1 (en) 2004-11-02 2006-05-04 Standard Microsystems Corporation Hardware supported peripheral component memory alignment method
US20110185032A1 (en) 2010-01-25 2011-07-28 Fujitsu Limited Communication apparatus, information processing apparatus, and method for controlling communication apparatus
US8874844B1 (en) * 2008-12-02 2014-10-28 Nvidia Corporation Padding buffer requests to avoid reads of invalid data
US20160026604A1 (en) * 2014-07-28 2016-01-28 Emulex Corporation Dynamic rdma queue on-loading

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0574140A1 (en) * 1992-05-29 1993-12-15 Hewlett-Packard Company Network adapter which places a network header and data in separate memory buffers
US7361881B2 (en) * 2002-03-13 2008-04-22 Oy Ajat Ltd Ganged detector pixel, photon/pulse counting radiation imaging device
JP2004240711A (en) * 2003-02-06 2004-08-26 Fujitsu Ltd Buffer memory device and buffer memory control method
US7590777B2 (en) * 2004-12-10 2009-09-15 International Business Machines Corporation Transferring data between system and storage in a shared buffer
JP5206788B2 (en) * 2008-05-29 2013-06-12 富士通株式会社 DATA RELAY DEVICE, DATA RELAY PROGRAM, DATA RECEPTION DEVICE, AND COMMUNICATION SYSTEM
US9146678B2 (en) * 2013-04-29 2015-09-29 International Business Machines Corporation High throughput hardware acceleration using pre-staging buffers

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491802A (en) * 1992-05-29 1996-02-13 Hewlett-Packard Company Network adapter for inserting pad bytes into packet link headers based on destination service access point fields for efficient memory transfer
WO1999034273A2 (en) 1997-12-30 1999-07-08 Lsi Logic Corporation Automated dual scatter/gather list dma
US6694392B1 (en) 2000-06-30 2004-02-17 Intel Corporation Transaction partitioning
US20050015549A1 (en) 2003-07-17 2005-01-20 International Business Machines Corporation Method and apparatus for transferring data from a memory subsystem to a network adapter by extending data lengths to improve the memory subsystem and PCI bus efficiency
US20060095611A1 (en) 2004-11-02 2006-05-04 Standard Microsystems Corporation Hardware supported peripheral component memory alignment method
US8874844B1 (en) * 2008-12-02 2014-10-28 Nvidia Corporation Padding buffer requests to avoid reads of invalid data
US20110185032A1 (en) 2010-01-25 2011-07-28 Fujitsu Limited Communication apparatus, information processing apparatus, and method for controlling communication apparatus
US20160026604A1 (en) * 2014-07-28 2016-01-28 Emulex Corporation Dynamic rdma queue on-loading

Also Published As

Publication number Publication date
CN109478171A (en) 2019-03-15
JP2019517692A (en) 2019-06-24
EP3465450A1 (en) 2019-04-10
US20170346899A1 (en) 2017-11-30
WO2017210015A1 (en) 2017-12-07
EP3465450B1 (en) 2023-07-26
JP6788691B2 (en) 2020-11-25
CN109478171B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
US9934065B1 (en) Servicing I/O requests in an I/O adapter device
EP2889780B1 (en) Data processing system and data processing method
EP3465450B1 (en) Improving throughput in openfabrics environments
US9864538B1 (en) Data size reduction
EP3057272A1 (en) Technologies for concurrency of cuckoo hashing flow lookup
US10116746B2 (en) Data storage method and network interface card
US9058338B2 (en) Storing a small file with a reduced storage and memory footprint
TW201220197A (en) for improving the safety and reliability of data storage in a virtual machine based on cloud calculation and distributed storage environment
EP2840576A1 (en) Hard disk and data processing method
US9584628B2 (en) Zero-copy data transmission system
US10241871B1 (en) Fragmentation mitigation in synthetic full backups
EP3598313A1 (en) Data access method and device
US11068399B2 (en) Technologies for enforcing coherence ordering in consumer polling interactions by receiving snoop request by controller and update value of cache line
WO2017201984A1 (en) Data processing method, associated apparatus, and data storage system
CN109241015B (en) Method for writing data in a distributed storage system
CN111881476B (en) Object storage control method, device, computer equipment and storage medium
US9886405B1 (en) Low latency write requests over a network using a pipelined I/O adapter device
US10062137B2 (en) Communication between integrated graphics processing units
JP5893028B2 (en) System and method for efficient sequential logging on a storage device that supports caching
CN107250995B (en) Memory management device
US10523741B2 (en) System and method for avoiding proxy connection latency
US20200242040A1 (en) Apparatus and Method of Optimizing Memory Transactions to Persistent Memory Using an Architectural Data Mover
US20160267050A1 (en) Storage subsystem technologies
US10275466B2 (en) De-duplication aware secure delete
US11188394B2 (en) Technologies for synchronizing triggered operations

Legal Events

Date Code Title Description
AS Assignment

Owner name: VERITAS TECHNOLOGIES LLC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JOSHI, ADHIRAJ;TOLEY, ABHIJIT;REEL/FRAME:038748/0753

Effective date: 20160530

AS Assignment

Owner name: BANK OF AMERICA, N.A., AS COLLATERAL AGENT, ILLINOIS

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:VERITAS TECHNOLOGIES LLC;REEL/FRAME:040679/0466

Effective date: 20161019

Owner name: BANK OF AMERICA, N.A., AS COLLATERAL AGENT, ILLINO

Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:VERITAS TECHNOLOGIES LLC;REEL/FRAME:040679/0466

Effective date: 20161019

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT, DELAWARE

Free format text: PATENT SECURITY AGREEMENT SUPPLEMENT;ASSIGNOR:VERITAS TECHNOLOGIES, LLC;REEL/FRAME:052426/0001

Effective date: 20200408

AS Assignment

Owner name: WILMINGTON TRUST, NATIONAL ASSOCIATION, AS NOTES COLLATERAL AGENT, DELAWARE

Free format text: SECURITY INTEREST;ASSIGNOR:VERITAS TECHNOLOGIES LLC;REEL/FRAME:054370/0134

Effective date: 20200820

CC Certificate of correction
AS Assignment

Owner name: VERITAS TECHNOLOGIES LLC, CALIFORNIA

Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS AT R/F 052426/0001;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION, AS COLLATERAL AGENT;REEL/FRAME:054535/0565

Effective date: 20201127

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4