CN108536543B

CN108536543B - Receive queue with stride-based data dispersal

Info

Publication number: CN108536543B
Application number: CN201810217976.2A
Authority: CN
Inventors: 伊丹.伯斯坦
Original assignee: Mellanox Technologies Ltd
Current assignee: Mellanox Technologies Ltd
Priority date: 2017-03-16
Filing date: 2018-03-16
Publication date: 2021-08-03
Anticipated expiration: 2038-03-16
Also published as: CN108536543A; US10210125B2; US20180267919A1; CN108536543B8

Abstract

A method for communication includes publishing a sequence of work items in a queue, the sequence of work items pointing to a buffer in memory that includes a plurality of strides of a common, fixed size. A NIC receives a data packet from a network containing data to be pushed to the memory. The NIC reads a first work item from the queue that is directed to a first buffer and writes data from a first packet to a first number of strides in the first buffer without consuming all strides in the first buffer. The NIC then writes at least a portion of the data from the second packet to the remaining stride in the first buffer. When all strides in the first buffer have been consumed, the NIC reads a second work item from the queue that is directed to a second buffer, and writes additional data to strides in the second buffer.

Description

Receive queue with stride-based data dispersal

Technical Field

The present invention relates generally to data communications, and in particular to devices for interfacing between computing devices and packet data networks.

Background

InfiniBand^TM(IB) is a switched fabric communication architecture that is widely used for high performance computing. Which has been standardized by the InfiniBand industry association. Computing devices (host processors and peripherals) are connected to the IB fabric via a Network Interface Controller (NIC), which is referred to as a channel adapter in IB parlance. The host processor (or host) uses a Host Channel Adapter (HCA), while the peripheral uses a Target Channel Adapter (TCA).

A client process (hereinafter client), such as a software application process, running on a host processor communicates with the transport layer of the IB fabric by manipulating a transport service instance (referred to as a "queue pair" (QP)) consisting of a send work queue and a receive work queue. To send and receive messages over a network using an HCA, a client initiates a Work Request (WR) that causes a work item, referred to as a Work Queue Element (WQE), to be placed onto the appropriate work queue. Typically, each WR has a data buffer associated with it for holding data to be sent or received during execution of a WQE. The HCA performs WQEs and thus communicates with the corresponding QPs of the channel adapters at the other end of the link.

The IB channel adapter implements various service types and transport operations, including Remote Direct Memory Access (RDMA) read and write operations, and send operations. RDMA write requests and read requests both carry data sent by a channel adapter (called a requestor) and cause the other channel adapter (responder) to write the data to a memory address at its own link-end. RDMA write requests specify an address in the memory of a remote responder to write data to, while sending requests relies on the responder to determine the memory location at the destination of the request. This type of sending operation is sometimes referred to as a "push" operation because the initiator of the data transfer pushes data to the remote QP.

When a send request addressed to a QP is received, the channel adapter at the destination node places the data sent by the requestor into the next available receive buffer for that QP. To specify a receive buffer to be used for such incoming send requests, a client on the host computing device generates a receive WQE and places it in the receive queue of the appropriate QP. Each time a valid send request is received, the destination channel adapter fetches the next WQE from the receive queue of the destination QP and places the received data in the memory location specified in that WQE. Thus, each valid incoming send request results in a receive queue operation by the responder.

U.S. patent 7,263,103, the disclosure of which is incorporated herein by reference, describes a method for network communications in which a pool of descriptors (WQEs) is shared among multiple transport service instances used in communicating over a network. Each descriptor in the pool includes a scatter list, indicating buffers available in local memory. One of the descriptors is read from the pool when a message containing data to be pushed to the local storage is received over the network on one of the transport service instances. The data contained in the message is written to a buffer indicated by the scatter list included in the descriptor.

U.S. patent 9,143,467, the disclosure of which is incorporated herein by reference, describes a NIC having a circular receive buffer. The first index and the second index are provided to point to, respectively, a first buffer of the set to which the NIC is to write and a second buffer of the set to be read from by a client process running on the host device. In response to receiving the message, data is written to a first buffer pointed to by the first index, and the first index is cyclically advanced through the set. When the data in the second buffer has been read by the client process, the second index is cycled through the set. In some embodiments, the buffers are all of a uniform size, such as one byte.

Disclosure of Invention

Embodiments of the present invention described hereinafter provide efficient methods for handling data "push" operations and apparatus for implementing such methods.

There is thus provided in accordance with an embodiment of the present invention a method for communication, the method including publishing a sequence of work items in a queue, each work item pointing to at least one buffer in memory that includes a plurality of strides of a common, fixed size. Receiving, in a Network Interface Controller (NIC) of a host computer, a data packet from a network, the data packet including at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory. A first work item directed to a first buffer in the memory is read from the queue. Writing the first data from the NIC into the first buffer for a first number of strides sufficient to include the first data without consuming all strides in the first buffer. Writing at least a first portion of the second data from the NIC to a remaining number of strides in the first buffer. When all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet from the NIC to an additional number of strides in the second buffer.

In a disclosed embodiment, writing the additional data includes writing a second portion of the second data to the second buffer.

Typically, the NIC consumes an integer number of strides in writing data from each of the packets to the memory.

In one embodiment, the packets contain respective sequence numbers, and wherein the NIC selects a stride to which to write data from each of the packets in response to the respective sequence numbers.

Additionally or alternatively, publishing the sequence of work items includes sizing the stride to correspond to a size of packets received on a given transport service instance. Typically, the packets include a header and a payload, and in one embodiment, each stride includes a first portion mapped to a first region of the memory to receive the header of a given packet and a second portion mapped to a different second region of the memory to receive the payload of the given packet.

In some embodiments, the first packet and the second packet belong to different first and second messages transmitted by one or more peer devices to the host computer over the network. The first message and the second message are transmitted on different first and second transport service instances, the first and second transport service instances sharing the queue of the work item. Additionally or alternatively, each of the first and second messages ends with a respective last packet, the method comprising writing a respective completion report from the NIC to the memory only after writing data from the respective last packet within each of the first and second messages to the memory. Typically, the completion report contains a pointer to the stride to which the data within each of the messages was written.

In other embodiments, the method includes reserving a contiguous allocation of strides in the first buffer for the first message, wherein writing the first data includes writing to a first number of strides within the contiguous allocation, and wherein writing at least a first portion of the second data includes writing to a remaining number of strides after the contiguous allocation. In one such embodiment, the method includes writing a completion report from the NIC to the memory when a predetermined timeout expires after the continuous allocation is consumed or when the continuous allocation is not completely consumed if the predetermined timeout expires after the continuous allocation is reserved.

In further embodiments, the method includes reserving a contiguous allocation of strides in the first buffer for the first message, wherein writing the first data comprises writing to a first number of strides within the contiguous allocation, and writing a fill completion report from the NIC to the memory even when the contiguous allocation is not completely consumed in order to release the first buffer and cause a subsequent message to be written to the second buffer.

In some embodiments, the method includes automatically learning, in the NIC, a characteristic size of a message transmitted to the host computer in response to traffic received from the network, and reserving a contiguous allocation of strides in the first buffer for the first message, while deciding how many strides to include in the contiguous allocation in response to the characteristic size of the message. In one such embodiment, automatically learning the feature size includes estimating a number of consecutive packets to include in a Large Receive Offload (LRO) operation. Additionally or alternatively, when the LRO operation is applied to a packet received by the NIC in a first flow, the first flow shares a receive queue with at least a second flow different from the first flow, the method may include: sending out a completion report on a stride that has been consumed by a packet in the first flow when a packet from the second flow is received before the packet in the first flow has consumed all strides in the consecutive allocation; and writing packets in the second stream to the remaining one or more strides in the continuous allocation.

In some embodiments, the method includes writing a respective completion report from the NIC to the memory after writing data from each of the data packets to the memory. Typically, the completion report contains metadata about the transmission status information related to the data packet. In one embodiment, the metadata indicates a status of a Large Receive Offload (LRO) operation performed by the NIC.

In other embodiments, the method includes writing a respective completion report from the NIC to the memory after writing the data to the last stride belonging to each work item. In this case, the completion report may contain details of the data packets written to the at least one buffer pointed to by the work item.

In yet another embodiment, the method includes writing a corresponding completion report from the NIC to the memory after writing the data to a given stride when the number of strides remaining in the first buffer is less than a predetermined minimum.

Additionally or alternatively, publishing the series of work items includes monitoring consumption of the work items in response to the completion report written to the memory, and publishing one or more additional work items to the queue when a remaining length of the queue falls below a specified limit.

There is also provided, in accordance with an embodiment of the present invention, a computing apparatus including a memory and a host processor configured to publish a sequence of work items in a queue, each work item pointing to at least one buffer in the memory, each buffer including a plurality of strides of a common, fixed size. A Network Interface Controller (NIC) is configured to: receiving a data packet from a network, the data packet comprising at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory; reading a first work item from the queue that points to a first buffer in the memory; writing the first data to a first number of strides in the first buffer sufficient to include the first data without consuming all strides in the first buffer; a remaining number of strides to write at least a first portion of the second data from the NIC into the first buffer; and when all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet from the NIC to an additional number of strides in the second buffer.

There is additionally provided, in accordance with an embodiment of the present invention, a Network Interface Controller (NIC), including a host interface configured to connect to a host processor via a host bus, the host processor configured to publish a sequence of work items in a queue for access by the NIC, each work item pointing to at least one buffer in memory, each buffer including a plurality of strides of a common, fixed size. The network interface is configured to receive a data packet from a network, the data packet including at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory. The packet processing circuit is configured to: reading a first work item from the queue that points to a first buffer in the memory; writing the first data via the host interface into the first buffer for a first number of strides sufficient to include the first data without consuming all strides in the first buffer; write, via the host interface, at least a first portion of the second data to a remaining number of strides in the first buffer; and when all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet to an additional number of strides in the second buffer via a host interface.

The present invention will be more fully understood from the following detailed description of embodiments of the invention taken together with the accompanying drawings, in which:

drawings

FIG. 1 is a block diagram schematically illustrating a data communication system according to an embodiment of the present invention;

FIG. 2 is a block diagram schematically illustrating allocation of data buffers in a memory according to an embodiment of the present invention; and is

Fig. 3 is a flowchart schematically illustrating a method for receiving data according to an embodiment of the present invention.

Detailed Description

Pushing data to a remote node in a packet network-e.g. using a send operation-is a useful model for many data transfer operations, but it incurs a significant overhead: in conventional implementations, a client process on the receiving host device must continuously publish a WQE to the appropriate receive queue(s), and the NIC must take one of these WQEs and use its contained buffer information for each incoming transmit packet it processes. WQEs in the receive queue can consume a large amount of memory, and the generation and consumption of these WQEs can add delay in the processing of pushed data and consume host CPU resources.

The method described in the above-mentioned us patent 9,143,467 is said to eliminate the need for WQEs in receiving push data from the network via a suitably configured NIC. Alternatively, such a method requests the client process to pre-allocate a continuous, circular set of buffers in the host memory for use by the QP to receive packets containing data to be pushed to the host memory. However, to avoid buffer overflow (and hence frequent packet retransmissions), the allocated set of buffers must be large enough to handle large amounts of bursty data — which is common in many network applications. These buffer allocations may therefore result in a substantial memory "footprint" that may be largely unused during quiescent periods between bursts.

The embodiments of the invention described herein provide an alternative solution that enables more efficient host resource utilization when receiving push data, thereby equalizing memory usage for overhead. In these embodiments, work items published in the receive queue (such as WQEs) point to respective buffers in memory, and each of these buffers is divided into strides (stride). Typically, the host processor will publish a series of such work items in the receive queue for processing incoming packets (such as IB send packets) containing data to be pushed to memory. In the embodiments described below, it is assumed that the buffer in question is located in host memory; but alternatively some or all of these stride-based buffers may be located in another memory, such as a memory attached to or included in the NIC.

In the context of this specification and in the claims, a "stride" is a common, fixed-size segment of memory that may be as small as 1 byte in length or may be many bytes. For example, the stride may advantageously be sized to correspond to an expected packet size on a given transport service instance (i.e., equal to or slightly larger than the expected packet size). The segments may be contiguous in the memory address space, but in some applications non-contiguous memory segments may be advantageous. For example, to accommodate a series of packets having fixed respective sizes of headers and payloads, each stride may include a first portion mapped to one region of memory to receive the header and a second portion mapped to another region to receive the payload.

A stride-based receive queue of the type described herein may advantageously be shared among multiple transport service instances (e.g., multiple QPs), but alternatively, the stride-based approach described herein may be applied to individual QPs separately. The stride-based receive queue may be used with substantially any type of transmission instance, including reliable transmission types and unreliable transmission types.

When the NIC receives a packet containing data to be pushed to memory from the network on the stride-based receive queue, it reads the next work item from the queue and writes the data from the packet to the buffer indicated by the work item. However, in contrast to conventional use, data write operations do not necessarily consume the entire buffer and corresponding work items. Rather, the NIC only writes to the number of strides needed to contain the data, with each packet consuming an integer number of strides (even if the packet data does not completely fill the last stride consumed). Any remaining strides are then used by the NIC to write data from the next incoming packet even if the next packet belongs to a different message (and possibly a different transport service instance) than the first packet. (in the context of this specification and in the claims, the term "message" refers to any logical unit of data communicated over a network and may include a single packet or multiple packets transmitted over the same transport service instance). For example, a message may correspond to an InfiniBand RDMA or Send transaction; alternatively, it may comprise a sequence of packets processed together by the NIC in a Large Receive Offload (LRO) over ethernet session, as described below. ) Thus, a single work item and its corresponding buffer may be consumed in receiving and writing data from a single large packet or from a sequence of two or more smaller packets to memory.

The NIC will typically only read and use the next work item in the queue when all strides in the first buffer have been consumed. (alternatively, in some scenarios, multiple WQEs may be opened and used in parallel for different receive sessions.) the NIC will then process the next work item and its corresponding buffer, as well as additional work items, in a similar manner. If a given buffer has an insufficient number of strides remaining to accommodate all of the data in a given packet, the NIC may write a first portion of the data to the remaining strides in the buffer, and then write the remaining portion of the data to the buffer indicated by the next work item. Alternatively, upon receiving a packet that is too long for inclusion in the remaining stride for a given work item, the NIC may issue a "fill completion report" (also referred to below as a "dummy CQE") to free the remaining stride allocated to that work item, and may then write the packet to the buffer indicated by the next work item.

The host processor monitors the consumption of work items, for example by reading and processing completion reports written by the NIC to memory after writing data from a data packet to memory, and can therefore track the remaining length of the work item queue. When the length falls below a specified limit, the host processor publishes one or more additional work items to the queue. Thus, work items are only written and corresponding memory buffers are reserved as needed. Such an approach may be used to limit the memory footprint associated with the receive queue to no more than what is actually needed, while at the same time ensuring that the buffer will be available when needed.

This novel receive queue model may be useful in many data communication scenarios, but it is particularly advantageous for handling incoming data traffic that may include both small and large packets and may be characterized by unpredictable bursts. Stride-based buffers and work items enable host processors and NICs to efficiently handle such traffic in terms of memory footprint and overhead generated by the host processors and buses.

For the sake of specificity and clarity, the embodiments described below with reference to the drawings relate specifically to IB networks and use terminology taken from the IB standard. However, the principles of the present invention are similarly applicable to data communications over other kinds of packet networks, such as ethernet and Internet Protocol (IP) networks. In particular, the techniques described herein may be implemented, mutatis mutandis, in the context specified by other RDMA protocols known in the art, such as RDMA over Ethernet (RoCE) and Internet Wide Area RDMA Protocol (iWARP), as well as in the context of other kinds of transport protocols, such as Transmission Control Protocol (TCP).

Fig. 1 is a block diagram schematically illustrating a data communication system 20 according to an embodiment of the present invention. Host computer 22 (also referred to as a host or host device) communicates with other hosts 24 via a network 30, such as an IB switch fabric. As is known in the art, the computer 22 includes a processor in the form of a Central Processing Unit (CPU)32 and a host memory 34 (also referred to as a system memory), with the host memory 34 being connected by a suitable bus 36. A NIC 38, such as an IB HCA, connects the computer 22 to the network 30.

The NIC 38 includes a network interface 42 and a host interface 40, the network interface 40 being coupled to the network 30, and the host interface 40 being connected to the CPU 32 and the memory 34 via the bus 36. Packet processing circuitry 44 coupled between network interface 42 and host interface 40 generates outgoing packets for transmission over network 30 and processes incoming packets received from the network, as described below. The interface 40 and the interface 42, as well as the circuitry 44, typically comprise dedicated hardware logic, the details of which will be apparent to those skilled in the art upon reading the description. Alternatively or additionally, at least some of the functionality of the circuitry 44 can be implemented in software on a suitable programmable processor.

A client process running on the CPU 32, such as a process generated by application software (client 46 for short), communicates with a client 48 running on the remote host 24 by means of a QP on the NIC 38. Each client 46 is typically assigned multiple QPs that are used to communicate with different clients on various remote hosts. As previously described, some of these QPs may operate in a conventional manner depending on which client 46 publishes a WQE to both its send and receive queues. However, other QPs have a shared receive queue that contains WQEs that point to a stride-based buffer, as defined above.

In this latter arrangement, the NIC driver 50 running on the CPU 32 allocates a buffer for receiving push data communicated from the client 48 on the peer device (host 24) to the IB send packet 52 of any participating QP. The buffer is divided into strides of a uniform size that can be set by the driver. The NIC 38 is informed of the QP configuration, typically at QP initialization, and processes incoming send requests accordingly.

FIG. 2 is a block diagram that schematically illustrates the allocation of data buffers 66 in host memory 34, in accordance with an embodiment of the present invention. The client 46 is assigned a corresponding QP 60, the QP 60 being preconfigured to use receive WQEs 64 queued in a Shared Receive Queue (SRQ) 62. The assignment of QPs 60 to SRQs 62 is recorded, for example, in the context of each QP (not shown in the figure), which is typically maintained by memory 34, enabling the packet processing circuitry 44 in NIC 38 to identify and read WQEs 64 from the SRQs as packets are received from network 30 on these QPs.

Each WQE 64 includes one or more scatter entries (referred to in FIG. 2 as SE0, SE1, etc.). Each scatter entry points to the base address of a corresponding buffer 66 that has been reserved in memory 34. Each buffer 66 is divided into an integer number of strides 68 of a predetermined size. The stride size may be selected, for example, based on the expected data size of the incoming packet on QP 60, such that a small packet consumes a single stride, while a larger packet consumes multiple strides. All of the buffers 66 may conventionally be the same size and therefore contain the same number of strides, or they may be different sizes.

In the illustrated example, the NIC 38 receives a sequence of transmit packets 70 from the network 30 on QP 60 labeled QP1, QP2, and QP 3. Each packet 70 pertains to a respective message that is transmitted from a corresponding client 48 on a peer device, such as one of the hosts 24. These messages are labeled message 1, message 2, etc. in fig. 2 for each QP. Some of the packets 70 contain long payload data for delivery to the memory 34. Such packets may belong to multi-packet messages (not shown in the figures) and the header of these packets usually contains a field indicating whether it is the first packet, the last packet or an intermediate packet of the corresponding message. Other packets may contain short payload data or may contain commands that do not have data to be scattered to the memory 34.

When the first transmit packet 70 in the sequence (corresponding to message 1 on QP 1) is received via network interface 42, processing circuitry 44 in NIC 38 reads the first WQE 64 from SRQ 62 from memory 34 via host interface 42 and then reads the scatter entry SE0 from that WQE. Processing circuitry 44 extracts the payload data from the first packet and writes the payload data via host interface 40 to buffer 66 indicated by SE0, buffer 66 starting at the base address and extending over a sufficient number of strides 68 to contain all of the data. Each packet consumes an integer number of strides and the stride is filled with dummy data if the packet data does not reach the end of the last stride.

If the first packet does not consume all the strides in the buffer 66, the processing circuitry 44 writes the payload data from the next packet 70 (message 1 on QP 3) to the remaining strides 68 in this same first buffer 66, as shown in FIG. 2. In the example shown in the figure, a sufficient number of strides remain in the first buffer to accommodate all data from the second packet. Alternatively, if the number of remaining strides is insufficient, a portion of the data from the second packet may be written to the remaining strides in the first buffer while the remainder of the packet data is written to the next buffer.

When all strides in the first buffer have been consumed, processing circuitry 44 reads the next available scatter entry, in this example SE1 from the first WQE 64 in SRQ 62, and writes the next packet or packets to the stride of buffer 66 indicated by SE 1. The processing circuit 44 reads the next WQE 64 from SRQ 62 only if all buffers indicated by the first WQE have been depleted. The processing circuitry 44 uses the scatter entry or entries in the WQE to write additional packet data to the appropriate number of strides 68 in the next buffer and continues in this manner as long as there are incoming packets to be processed on the QP 60.

After it has completed writing data from each packet 70 to the appropriate buffer 66, processing circuitry 44 writes a completion report, such as a Completion Queue Entry (CQE) 74, to a completion queue 72 in memory 34. The CQE 74 indicates the QP 60 to which the packet is addressed and, in the case of a multi-packet message, whether the packet is the first packet, an intermediate packet, or the last packet in the message. This information is read by the client 46, which in turn, the client 46 is able to read and process the data sent to it on its corresponding QP 60.

In alternative embodiments, processing circuitry 44 may apply other criteria in deciding when to write the CQE to memory 34. In particular, the criteria for reporting completion may be selected so as to reduce the number of completion reports that CPU 32 needs to process. For example, in one embodiment, processing circuitry 44 writes CQEs 74 only once per message after writing data from the last packet of each message to memory 34. In this case, the CQE typically contains a scatter list that contains one or more pointers that indicate the stride to which the packet in the message is written. Additionally or alternatively, processing circuitry 44 writes one CQE 74 per scatter entry, for example, after writing data to the last stride 68 in a given buffer 66, or writes one CQE per WQE 64 after writing data to the last stride belonging to a WQE.

In this latter case, CQE 74 typically contains additional information to enable a process running on CPU 32 to parse and process data that has been written to memory 34. For example, a CQE may contain details of packets and/or messages scattered to the buffer or buffers to which the WQE in question points, such as an association between the packet or message and the corresponding transport service instance (QP or LRO session), a stride index, and an indication of whether each packet in a multi-packet message is a first packet, a last packet, or an intermediate packet. As an example, a CQE for a certain WQE may contain the following information:

● first packet, QP 0x1, size 1KB

● unique packet, QP 0x2, size 4KB

● intermediate packet, QP 0x1, size 1KB

● first packet, QP 0x3, size 4KB

● Final packet, QP 0x1, size 8B

● dummy Filler, size 64B (due to the presence of remaining space)

For each packet 70, CQE 74 indicates the number of strides 68 consumed by the packet. (for this purpose, even command packets that do not contain payload data are considered to consume a single stride.) driver 50 reads this information from CQE 74 and is therefore able to track the consumption of buffer 66. When the driver 50 in this manner finds a number of WQEs 64 from the SRQ 62 that have been consumed, such that the remaining number of WQEs is below a certain limit (which may be configured in software), the driver allocates one or more additional buffers 66 in the memory 34 and writes to the SRQ one or more additional WQEs that are directed to those buffers. Thus, when the program parameter settings are correct, the SRQ 62 should always contain a sufficient number of WQEs 64 to handle incoming traffic, while keeping the memory footprint (in terms of allocated buffers 66) no larger than is actually required by the current traffic conditions.

In some cases, such as when only a small number of strides remain available in a given buffer 66, the processing circuitry 44 may write "dummy CQEs" with respect to those strides in order to cause the driver 50 to free the buffer and write a new WQE. Processing circuitry 44 will then begin writing data from the next packet to the first stride in the next buffer. This approach is particularly useful when the number of remaining strides is less than some predetermined minimum, e.g., when the number of strides is insufficient for the next packet or stride allocation is made by the processing circuitry.

Additionally or alternatively, processing circuitry 44 incorporates metadata in CQE 74 regarding transmission status information related to the data packet. These metadata may include, for example, an indication of whether the corresponding packet is a packet in a given message, an intermediate packet, or a last packet. The "message" in question may be, for example, a sent message, or it may be a sequence of packets received in an LRO operation. The metadata thus enables the client 46 and/or driver 50 to determine when a message is complete and which data should be read from the memory 34 as part of the message. For example, when the NIC 38 allocates a context for an LRO operation to a flow on a certain OP 60 and then aggregates data packets belonging to a given flow on that QP into an LRO, metadata in the CQE may indicate the state of the context of the LRO.

In an alternative embodiment (not shown in the figures), the processing circuitry 44 reserves a contiguous allocation of strides 68 within one or more of the buffers 66 for each message 70, or at least for certain messages (e.g., messages on one or more QPs 60). When a packet belonging to a message for which such an allocation of reservations is received, processing circuitry 44 writes data from the packet to the appropriate number of strides within the consecutive allocation. When a packet belonging to a different message or QP arrives during this time, processing circuitry 44 writes the data from this latter packet into the buffer for the remaining number of strides after this consecutive allocation. Alternatively, in such a case, when the new message is too large for accommodating in the current buffer, processing circuitry 44 may close the buffer, e.g., by issuing a dummy (filler) CQE, and may then write the data from the packet to the first stride or strides in the next buffer.

This scheme is useful for QPs carrying messages of constant (or approximately constant) size, and may also be suitable for use with Large Receive Offload (LRO) in ethernet and IP networks. In these cases, processing circuitry 44 may reserve a contiguous allocation of multiple strides corresponding to the characteristic size of the message or to the LRO reservation size. Each new message or LRO session initiated by packet processing circuitry 44 (which is considered a type of "message" in this context) reserves a fixed-size allocation containing a certain number of strides in the corresponding buffer, and each such reservation holds a timer for buffer release. The timer starts when the buffer is reserved and ends when the reserved allocation is full. When the continuous allocation has been consumed, or when a timer expires (when the continuous allocation has not been completely consumed at the expiration of a predetermined timeout), the CQE 74 is issued by the circuit 44 so that the CPU 32 will process the data that has been written to the buffer. The unused strides remaining in a given buffer may be used by another message or a subsequent session. Multiple messages or sessions may be run in parallel, with each message or session having its own buffer, the number of messages or sessions being limited by the number of WQEs in the available buffer or queue.

When multiple strides are pre-allocated to accommodate packets in a multi-packet message, processing circuitry 44 may utilize the allocation to deliver the message data to CPU 32 in the correct order, even when packets are received out-of-order or intermittently received on different QPs sharing the same SRQ. For this purpose, the processing circuitry 44 does not have to write the packets to the stride in the order in which they arrived, but rather according to the packet sequence number in the packet header. In this manner, processing circuitry 44 selects the stride to which to write data from the packet based on the corresponding sequence number, such that the data is written to the assigned stride in the correct order.

The above reservation scheme is also useful for relieving the burden on the CPU 32 in that it reduces the number of CQEs that the CPU needs to process and enables the client 46 to access received messages by reading consecutive data segments from the memory 34. The allocation size may vary between QPs 60 depending on the characteristic size of the messages transmitted to the host computer 22 on each QP-and even between different QPs sharing the same SQR 62. When the characteristic message size is known in advance, the allocation size of each message can be fixed for each QP by software, for example, by the driver 50. Alternatively or additionally, the driver 50 or processing circuitry 44 may automatically learn the characteristic message size for each QP based on traffic received from the network 30. In either case, processing circuitry 44 would then decide how many strides to include in each allocation based on the characteristic size of the message, such that each message is included in its allocated number of strides with minimal waste of space in the buffer.

As a specific example, processing circuitry 44 may learn how many strides 68 should be included in the optimal LRO reservation. In general, the optimal reservation size for a given packet flow is determined by the number of consecutive packets that are likely to arrive at the SRQ from a given packet flow before packets from a different flow enter the SRQ 62. (in this context, "flow" means a sequence of packets transmitted from a given client 48 on the peer host 24 to the NIC 38, e.g., a sequence of packets having the same source and destination addresses, source and destination ports, and protocols.) the number of consecutive packets depends, for example, on the burstiness of the flow in question and the rate of traffic in other flows on the SRQ 62. After tracking and learning these flow characteristics, processing circuitry 44 may then estimate the number of consecutive packets that are likely to arrive in a given burst, and will typically set the LRO reservation size to a small number of strides larger than necessary to accommodate the estimated burst size.

Using this approach, each of the bursts of packets received by the NIC 38 on the flow in question will consume almost all of the strides in the corresponding LRO reservation. As explained above, processing circuitry 44 may then release any remaining strides in the reservation by issuing dummy (filler) CQEs. Alternatively, when a packet from another flow reaches SRQ 62 before the LRO reservation has been consumed, processing circuitry 44 may end the reservation by issuing a CQE that is only relevant to the stride actually consumed by the flow for which the reservation was made, and may then use the remainder of the reservation for the other flow.

FIG. 3 is a flow chart that schematically illustrates a method for receiving data by host computer 22, in accordance with an embodiment of the present invention. The method begins when the NIC 38 receives an incoming transmit (push data) packet from the network 30 via the network interface 42, at a packet reception step 80. At a QP check step 82, the packet processing circuitry 44 checks if the context of the QP of the packet points to the shared receive queue with a stride-based WQE. If not, the NIC 38 may process the packet using conventional WQEs, buffers, and processing steps (which is beyond the scope of this description) at a default processing step 84.

If the QP for the packet belongs to a stride based SRQ, the packet processing circuitry 44 checks if a WQE from that SRQ 62 has been opened and has a stride 68 available for writing packet data, at a WQE check step 86. If not, the circuit reads a new WQE 64 from the SRQ 62, at a WQE read step 88. (if no WQEs are available in the queue, the NIC 38 will return a negative acknowledgement over the network 30 to the sender of the packet, which will then resend the packet.) in either case, the circuitry 44 then writes the data from the packet to the buffer 66 in the memory 34, starting with the next available stride 68 (meaning the first stride in the case of a new WQE read at step 88).

When packet processing circuitry 44 has completed writing the data in the current packet to memory 34, it writes CQE 74 to completion queue 72 at completion step 92. (alternatively, in other embodiments, a CQE may be issued, for example, for each message or each scattered entry in a WQE, as explained above.) at a CQE read step 94, driver 50 reads CQE 74 and thereby tracks the usage of buffer 66 and the consumption of corresponding WQEs 64. The driver 50 may also notify the client 46 when the send operation (including a multi-packet send operation) directed to the corresponding QP 60 of the client 46 is complete so that the client 46 reads and processes data from the buffer 66 and then releases the buffer for reuse. Alternatively, client 46 may be programmed to read and process CQE 74 directly without driver intervention.

The driver 50 periodically checks whether the number of remaining WQEs 64 (or equivalently, the number of available buffers) in the SRQ 62 falls below a predetermined threshold limit, at a queue checking step 96. The limit is typically set in software based on a tradeoff between the desired memory footprint and the expected packet arrival rate on the participating QPs 60. When the number of remaining WQEs falls below this limit, the driver 50 reserves one or more additional buffers 66 and writes corresponding WQEs 64 to the SRQ 62, at a WQE publish step 98. At this stage, host computer 22 is ready to receive additional transmitted packets at a packet arrival step 100.

It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.

Claims

1. A method for communication, comprising:

publishing a sequence of work items in a queue, each work item pointing to at least one buffer in memory comprising a plurality of strides of a common, fixed size;

receiving, in a network interface controller, NIC, of a host computer, a data packet from a network, the data packet comprising at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory;

reading a first work item from the queue that points to a first buffer in the memory;

writing the first data from the NIC into a first number of strides in the first buffer sufficient to contain the first data without consuming all strides in the first buffer;

a remaining number of strides to write at least a first portion of the second data from the NIC into the first buffer; and

when all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet from the NIC to an additional number of strides in the second buffer.

2. The method of claim 1, wherein writing the additional data comprises writing a second portion of the second data to the second buffer.

3. The method of claim 1, wherein the NIC consumes an integer number of strides in writing data from each of the packets to the memory.

4. The method of claim 1, wherein the packets contain respective sequence numbers, and wherein the NIC selects a stride to which to write data from each of the packets in response to the respective sequence numbers.

5. The method of claim 1, wherein publishing the sequence of work items comprises sizing the stride to correspond to a size of packets received on a given transport service instance.

6. The method of claim 5, wherein the packets include a header and a payload, and wherein each stride includes a first portion mapped to a first region of the memory to receive the header of a given packet and a second portion mapped to a different second region of the memory to receive the payload of the given packet.

7. The method of claim 1, wherein the first packet and the second packet belong to different first and second messages transmitted by one or more peer devices to the host computer over the network.

8. The method of claim 7, wherein the first message and the second message are transmitted on different first and second transport service instances that share the queue of the work item.

9. The method of claim 7, wherein each of the first and second messages ends with a respective last packet, and wherein the method comprises writing a respective completion report from the NIC to the memory only after writing data from the respective last packet within each of the first and second messages to the memory.

10. The method of claim 9, wherein the completion report includes a pointer to a stride to which data within each of the messages is written.

11. The method of claim 7, and comprising reserving a contiguous allocation of strides in the first buffer for the first message, wherein writing the first data comprises writing to a first number of strides within the contiguous allocation, and wherein writing at least a first portion of the second data comprises writing to a remaining number of strides after the contiguous allocation.

12. The method of claim 11, and comprising writing a completion report from the NIC to the memory when a predetermined timeout expires after the continuous allocation is consumed or when the continuous allocation is not completely consumed if the predetermined timeout expires after the continuous allocation is reserved.

13. The method of claim 7, and comprising reserving a contiguous allocation of strides in the first buffer for the first message, wherein writing the first data comprises writing to a first number of strides within the contiguous allocation, and writing a fill completion report from the NIC to the memory even when the contiguous allocation is not completely consumed, in order to release the first buffer and cause a subsequent message to be written to the second buffer.

14. The method of claim 7, and comprising automatically learning, in the NIC, a characteristic size of a message transmitted to the host computer in response to traffic received from the network, and reserving a contiguous allocation of strides in the first buffer for the first message, while deciding how many strides to include in the contiguous allocation in response to the characteristic size of the message.

15. The method of claim 14, wherein automatically learning the feature size comprises estimating a number of consecutive packets to include in a Large Receive Offload (LRO) operation.

16. The method of claim 15, wherein the LRO operation applies to packets received by the NIC in a first flow that shares a receive queue with at least a second flow different from the first flow, and wherein the method comprises:

sending out a completion report on a stride that has been consumed by a packet in the first flow when a packet from the second flow is received before the packet in the first flow has consumed all strides in the consecutive allocation; and

writing packets in the second stream to the remaining one or more strides in the continuous allocation.

17. The method of claim 1, and comprising writing a respective completion report from the NIC to the memory after writing data from each of the data packets to the memory.

18. The method of claim 17, wherein the completion report includes metadata regarding transmission status information related to the data packet.

19. The method of claim 18, wherein the metadata indicates a status of a Large Receive Offload (LRO) operation performed by the NIC.

20. The method of claim 1, and comprising writing a respective completion report from the NIC to the memory after writing the data to the last stride belonging to each work item.

21. The method of claim 20, wherein the completion report includes details of data packets written to the at least one buffer pointed to by the work item.

22. The method of claim 1, and comprising writing a corresponding completion report from the NIC to the memory after writing the data to a given stride when a number of strides remaining in the first buffer is less than a predetermined minimum.

23. The method of claim 1, and comprising writing a completion report from the NIC to the memory in response to the data having been written to the memory, wherein publishing the series of work items comprises monitoring consumption of the work items in response to the completion report being written to the memory, and publishing one or more additional work items to the queue when a remaining length of the queue falls below a specified limit.

24. A computing device, comprising:

a memory;

a host processor configured to publish a sequence of work items in a queue, each work item pointing to at least one buffer in the memory, each buffer comprising a plurality of strides of a common, fixed size; and

a network interface controller, NIC, configured to: receiving a data packet from a network, the data packet comprising at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory; reading a first work item from the queue that points to a first buffer in the memory; writing the first data to a first number of strides in the first buffer sufficient to include the first data without consuming all strides in the first buffer; a remaining number of strides to write at least a first portion of the second data from the NIC into the first buffer; and when all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet from the NIC to an additional number of strides in the second buffer.

25. The apparatus of claim 24, wherein the additional data written by the NIC comprises a second portion of the second data.

26. The apparatus of claim 24, wherein the NIC consumes an integer number of strides in writing data from each of the packets to the memory.

27. The apparatus of claim 24, wherein the packets contain respective sequence numbers, and wherein the NIC selects a stride to which to write data from each of the packets in response to the respective sequence numbers.

28. The apparatus of claim 24, wherein the NIC is configured to size the stride to correspond to a size of a packet received on a given transport service instance.

29. The apparatus of claim 24, wherein the first packet and the second packet belong to different first and second messages transmitted by one or more peer devices to the computing apparatus over the network.

30. The apparatus of claim 29, wherein the NIC is configured to automatically learn a characteristic size of a message transmitted to a host computer in response to traffic received from the network, and reserve a contiguous allocation of strides in the first buffer for the first message, while deciding how many strides to include in the contiguous allocation in response to the characteristic size of the message.

31. The apparatus of claim 24, wherein the NIC is configured to write a respective completion report to the memory after writing data from each of the data packets to the memory.

32. The apparatus of claim 24, wherein the NIC is configured to write a corresponding completion report to the memory after writing the data to a last stride belonging to each work item.

33. The apparatus of claim 24, wherein the NIC is configured to write a corresponding completion report to the memory after writing the data to a given stride when a number of strides remaining in the first buffer is less than a predetermined minimum.

34. The apparatus of claim 24, wherein the NIC is configured to write a completion report to the memory in response to the data having been written to the memory, and wherein the host processor is configured to monitor consumption of the work item in response to the completion report being written to the memory, and to publish one or more additional work items to the queue when a remaining length of the queue falls below a specified limit.

35. A network interface controller, NIC, comprising:

a host interface configured to connect to a host processor via a host bus, the host processor configured to publish a sequence of work items in a queue for access by the NIC, each work item pointing to at least one buffer in memory, each buffer comprising a plurality of strides of a common, fixed size;

a network interface configured to receive a data packet from a network, the data packet including at least a first packet and a second packet, the first packet and the second packet respectively containing at least first data and second data to be pushed to the memory; and

a packet processing circuit configured to: reading a first work item from the queue that points to a first buffer in the memory; writing the first data via the host interface into the first buffer for a first number of strides sufficient to include the first data without consuming all strides in the first buffer; write, via the host interface, at least a first portion of the second data to a remaining number of strides in the first buffer; and when all strides in the first buffer have been consumed, reading a second work item from the queue that points to a second buffer in the memory, and writing additional data in the data packet to an additional number of strides in the second buffer via a host interface.