CN117203627A - Apparatus and method for remote direct memory access - Google Patents

Apparatus and method for remote direct memory access Download PDF

Info

Publication number
CN117203627A
CN117203627A CN202180096824.1A CN202180096824A CN117203627A CN 117203627 A CN117203627 A CN 117203627A CN 202180096824 A CN202180096824 A CN 202180096824A CN 117203627 A CN117203627 A CN 117203627A
Authority
CN
China
Prior art keywords
data packet
receiving device
sequence
message
received
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180096824.1A
Other languages
Chinese (zh)
Inventor
鲁文·科恩
大卫·加诺
阿米特·杰伦
本-沙哈尔·贝尔彻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN117203627A publication Critical patent/CN117203627A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]

Abstract

The invention relates to an apparatus and method for RDMA. Specifically, the invention provides a transmitting device and a receiving device. The transmitting device is configured to: transmitting a sequence of data packets to a receiving device for RDMA, wherein each transmitted data packet in the sequence of data packets is associated with a data packet sequence number (packet sequence number, PSN) and carries a message; determining whether a particular data packet in the sequence of data packets is received at the receiving device, wherein the particular data packet includes a first message; after determining that the specific data packet is lost at the receiving device, retransmitting the specific data packet to the receiving device as a next step, wherein the retransmitted data packet carries the same first message as the specific data packet and is associated with a new PSN. The receiving device is adapted to operate accordingly.

Description

Apparatus and method for remote direct memory access
Technical Field
The present invention relates to high performance computing technology, and more particularly to remote direct memory access (remote direct memory access, RDMA) technology. The present invention relates to transferring RDMA transactions over a packet network. To this end, the present invention provides an apparatus, method, and packet format for RDMA.
Background
Applications often need to communicate with other applications or other computing resources over a network fabric. The delivery of switched packets is not always guaranteed, as some packets may be dropped by the network switch as they pass through the network. The remote computing resource may also discard the data packet for various reasons.
To ensure transmission of data packets, reliable transport protocols are often used, such as the transmission control protocol (transmission control protocol, TCP), the fast user datagram protocol (user datagram protocol, UDP) internet connection (quick UDP internet connection, QUIC) or RDMA-reliable connection (RDMA-reliable connection, RDMA-RC). Such a reliable protocol identifies and retransmits lost data packets. Each retransmitted data packet consumes additional bandwidth resources and thus is required to avoid unnecessary retransmissions.
To enable the sender to identify the lost data packet, the receiver sends an Acknowledgement (ACK) for the received data packet. Negative acknowledgements (negative acknowledgment, NAK) are sometimes used as well. For example, various retransmission schemes are sometimes classified into "stop-and-wait", "return N", and "selective repeat" according to their operations.
RDMA is a technique that enables applications to perform memory access operations on remote memory installed in remote network nodes. RDMA-RC provides reliable data transfer and is implemented in an RDMA network interface card (RDMA network interface card, RNIC) device, enabling network nodes to perform such memory access operations without involving the operating system or the node's main central processing unit (central processing unit, CPU). RDMA enables a computer to perform such memory access operations without involving an operating system running on the computer. RDMA is being widely used by modern data centers and computer clusters because it provides low latency remote memory access operations and high network bandwidth.
There are two common RDMA techniques: one is defined in the InfiniBand specification and the other in the internet engineering task force (internet engineering task force, IETF). Specifically, infiniBand RDMA has two variants (i.e., roCE and RoCEv 2) that enable it to run on an IP/ethernet network.
RDMA-RC in InfiniBand uses a relatively simple return N (Go-Back-N, GBN) retransmission algorithm, where many packets are retransmitted after an identified packet loss event. However, in RDMA-RC, the messages are signaled to be completed in the same order as the software layer issues. Current RDMA-RCs do not support any unordered, faster completion signaling.
Disclosure of Invention
In view of the above, embodiments of the present invention aim to provide a retransmission scheme for reliable transmissions. The aim is to propose a retransmission scheme, in particular a retransmission scheme that ensures that all transmitted data packets are processed exactly once by the receiver. One object is to be able to generate a signal that a message is completed in a different order (time) than the order in which the messages were sent. Another object is to be able to easily apply equal cost multi-path (equal cost multiple path, ECMP) flow blocks to speed up the execution and completion of messages.
These and other objects are achieved by embodiments of the invention as described in the appended independent claims. Advantageous implementations of the embodiments of the invention are further defined in the dependent claims.
A first aspect of the invention provides a transmitting device for RDMA. The transmitting device is configured to: transmitting a sequence of data packets to a receiving device for RDMA, wherein each transmitted data packet in the sequence of data packets is associated with a data packet sequence number (packet sequence number, PSN) and carries a message; determining whether a particular data packet in the sequence of data packets is received at the receiving device, wherein the particular data packet includes a first message; after determining that the specific data packet is lost at the receiving device, retransmitting the specific data packet to the receiving device as a next step, wherein the retransmitted data packet carries the same first message as the specific data packet and is associated with a new PSN.
Embodiments of the present invention introduce a new RDMA retransmission scheme for reliable transmission, called sequence-independent retransmission. The present invention achieves unordered or faster completion signals. In the present invention, each packet carries exactly one message. It should be noted that the message referred to herein may be a work queue element (working queue element, WQE) which is an RDMA operation or transaction issued or issued by a source ULP (e.g., an application or software) and pushed into a Queue Pair (QP).
In one implementation manner of the first aspect, each transmitted data packet includes: a transaction identifier (XID) identifying the message carried in the data packet; the specific data packet includes a first XID identifying the first message carried in the specific data packet; the retransmitted data packet includes the same first XID as the specific data packet.
Specifically, each packet (transmitted or retransmitted) carries two different labels: XID and PSN. XID identifies messages, so the sender assigns a unique XID to each message. When a packet is identified as lost, the packet will be immediately retransmitted. The retransmitted copies carry the same XID, but the PSNs are different. It should be noted that the PSN of the retransmitted packet is logically larger than the previous lost packet sent to the same destination.
In an implementation manner of the first aspect, the sending device is further configured to: assigning a flow block ID identifying a flow block to each transmitted data packet and/or each retransmitted data packet, wherein the flow block comprises a plurality of data packets routed through the same network route, and each transmitted data packet and/or retransmitted data packet further comprises the flow block ID; and transmitting each transmitted data packet and/or retransmitted data packet to the receiving device through the stream block.
Thus, the present invention may also support ECMP stream blocks. It is possible that the transmitting device may maintain separate (contiguous) PSN space for each sub-stream (i.e. stream block). That is, each stream block is associated with a separate PSN space. The packets are assigned to the stream blocks and assigned a PSN when transmitted or retransmitted.
In one implementation of the first aspect, the stream block ID assigned to the retransmitted data packet is different from the stream block ID assigned to the transmitted data packet, wherein the transmitted data packet and the retransmitted data packet are routed through different stream blocks identified by the stream block IDs of the transmitted data packet and the retransmitted data packet, respectively.
When the transmitting device determines that a packet is lost, it may retransmit the same message in a new packet via any of the stream blocks. However, it may be preferable to assign a different stream block ID to the retransmitted data packet and route the retransmitted data packet through a different stream block.
In an implementation manner of the first aspect, the sending device is configured to: receiving a notification message for one or more transmitted data packets of the sequence of data packets from the receiving device, wherein the notification message indicates whether the transmitted data packet was received at the receiving device; determining whether the particular data packet in the sequence of data packets is received at the receiving device based on the notification message.
Note that the notification message may be an ACK, which carries the PSN and the stream block ID of the received packet. Alternatively, a single notification message may report multiple packets belonging to the same flow block.
In an implementation manner of the first aspect, the sending device is further configured to: upon receiving a notification message indicating that a transmitted data packet is first received, a completion signal for the message is generated, wherein the transmitted data packet includes an XID identifying the message.
In this case, the message may specifically be an RDMA write operation or an RDMA send operation. When the notification message notifies the sending device that the message was received (i.e., carried in a data packet), the message may be signaled to be completed. It is possible that a later ACK, which may report the same message received, does not trigger a completion signal (message that has completed). It should be appreciated that completion of a message occurs regardless of the status of any previously published messages, i.e., whether all previously published messages have been completed or not. In the present invention, this completion generation mechanism may be referred to as "out-of-order completion" or "order-independent completion.
In an implementation manner of the first aspect, the sending device is further configured to: when a notification message indicating that a particular data packet having a PSN X in the sequence of data packets was received at the receiving device is received from the receiving device before receiving the notification message indicating that a data packet having a PSN greater than X was received at the receiving device, determining that the particular data packet having the PSN X was lost at the receiving device, wherein X is a non-negative integer.
It should be noted that the retransmission scheme may have two complementary mechanisms: PSN-based and timer-based. The PSN-based retransmission may be based on a unique strictly monotonically increasing PSN assigned to each transmitted packet and quickly identify lost packets when a "gap" is detected in the PSN reported back by the notification message.
In an implementation manner of the first aspect, each notification message indicates the transmitted data packet of which the PSN is within a specific range.
In an implementation form of the first aspect, each notification message indicates the transmitted data packet including the same stream block ID.
In an implementation manner of the first aspect, the sending device is configured to: if tracking of a time period after transmitting a data packet has not been started, starting to track the time period; after receiving the notification message, resetting the tracking of the time period.
Timer-based retransmissions are similar to the traditional way of triggering retransmission events: if no notification message is received within a configurable period of time, all outstanding data packets are retransmitted.
In an implementation manner of the first aspect, the sending device is further configured to: when a notification message indicating that the data packet having the PSN X is received at the receiving device is not received before the expiration of the period of time, it is determined that the specific data packet having the PSN X is lost at the receiving device.
In an implementation manner of the first aspect, the sending device is further configured to: the time period is set according to the receiving device.
It should be noted that, in order to reduce the required resources, a pair of network nodes may use a timer (i.e. the same period of time). Alternatively, to further reduce the required resources and overhead, the transmitting device 100 may use a single timer for all connections with different receivers, i.e. set the same time period.
In an implementation manner of the first aspect, the message carried in each data packet is limited to fit into a single network maximum transmission unit.
A second aspect of the invention provides a receiving device for RDMA, wherein the receiving device is for: receiving a sequence of data packets from a sending device for RDMA, wherein each received data packet in the sequence of data packets is associated with a data packet sequence number (packet sequence number, PSN) and carries a message; a notification message for the sequence of data packets is sent to the sending device, wherein the notification message indicates which data packets of the sequence of data packets were received at the receiving device.
Embodiments of the present invention also provide a receiving apparatus that operates correspondingly to the transmitting apparatus of the first aspect.
In one implementation of the second aspect, each received data packet includes an XID, the XID being used to identify the message carried in the data packet.
Note that XID is used to identify messages. Each message is assigned a unique XID.
In one implementation of the second aspect, each data packet in the sequence of data packets further includes a flow block ID identifying a flow block, wherein the flow block includes a plurality of data packets routed through the same network route.
In one implementation of the second aspect, each notification message indicates a data packet including the same stream block ID.
Alternatively, a single notification message may report multiple packets belonging to the same flow block.
In an implementation manner of the second aspect, the receiving device is further configured to: maintaining a first data structure storing the PSN of the received data packet; after receiving a data packet from the transmitting device, the first data structure is updated.
Optionally, the receiving device may also record XID carried in the received data packet.
In an implementation manner of the second aspect, the receiving device is further configured to: if the PSN of the received data packet cannot be recorded in the first data structure, discarding the received data packet.
In an implementation manner of the second aspect, the receiving device is further configured to: and sending the notification message based on the first data structure.
In one implementation of the second aspect, each notification message indicates a data packet for which the PSN is within a specific range.
A third aspect of the invention provides a method for RDMA. The method comprises the following steps: transmitting a sequence of data packets to a receiving device for RDMA, wherein each transmitted data packet in the sequence of data packets is associated with a PSN and carries a message; determining whether a particular data packet in the sequence of data packets is received at the receiving device, wherein the particular data packet includes a first message; after determining that the specific data packet is lost at the receiving device, retransmitting the specific data packet to the receiving device as a next step, wherein the retransmitted data packet carries the same first message as the specific data packet and is associated with a new PSN.
The method of the third aspect and its implementation provides the same advantages and effects as described above for the transmitting device of the first aspect and its corresponding implementation.
A fourth aspect of the invention provides a method for RDMA. The method comprises the following steps: receiving a sequence of data packets from a sending device for RDMA, wherein each received data packet in the sequence of data packets is associated with a data packet sequence number (packet sequence number, PSN) and comprises a transaction identifier XID identifying a message carried in the data packet; a notification message for the sequence of data packets is sent to the sending device, wherein the notification message indicates which data packets of the sequence of data packets were received at the receiving device.
The method of the fourth aspect and its implementation provides the same advantages and effects as described above for the second aspect and its corresponding implementation receiving device.
A fifth aspect of the present application provides a computer program comprising program code for performing the method according to any of the third aspect and implementations thereof or the fourth aspect and implementations thereof when implemented on a processor.
It should be noted that all the devices, elements, units and means described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the various entities described in the present application and functions to be performed by the various entities described are intended to mean that the respective transmitting device is adapted to perform the respective steps and functions. Although in the following description of specific embodiments, specific functions or steps performed by external entities are not embodied in a description of specific detailed elements of a transmitting device performing the specific steps or functions, it should be apparent to a skilled person that these methods and functions may be implemented by corresponding hardware or software elements or any combination thereof.
Drawings
The various aspects described above and the manner of attaining them will be elucidated with reference to the accompanying drawings, wherein:
fig. 1 shows a transmitting apparatus provided by an embodiment of the present application.
Fig. 2 shows packet exchanges between a transmitting device and a receiving device provided by an embodiment of the present application.
Fig. 3 shows packet exchanges between a transmitting device and a receiving device provided by an embodiment of the present application.
Fig. 4 shows packet exchanges between a transmitting device and a receiving device provided by an embodiment of the present application.
Fig. 5 shows a receiving apparatus provided by an embodiment of the present application.
Fig. 6 illustrates a method provided by an embodiment of the present application.
Fig. 7 illustrates a method provided by an embodiment of the present application.
Detailed Description
Exemplary embodiments of a method, apparatus and program product for a sequence independent retransmission scheme are described with reference to the accompanying drawings. While this description provides detailed examples of possible implementations, it should be noted that these details are intended to be exemplary and do not limit the scope of the application.
Furthermore, one embodiment/example may refer to other multiple embodiments/examples. For example, any description mentioned in one embodiment/example, including but not limited to terms, elements, processes, interpretations, and/or technical advantages apply to other embodiments/examples.
Fig. 1 illustrates a sending device 100 for RDMA provided by an embodiment of the present invention. The transmitting device 100 may include processing circuitry (not shown) for performing, conducting, or initiating various operations of the transmitting device 100 described herein. The processing circuitry may include hardware and software. The hardware may include analog circuitry or digital circuitry, or both analog and digital circuitry. Digital circuitry may include components such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (digital signal processor, DSPs), or multi-purpose processors. The transmitting device 100 may also include a memory circuit that stores one or more instructions that may be executed by a processor or processing circuit (specifically, under control of software). For example, the memory circuit may include a non-transitory storage medium storing executable software code that, when executed by a processor or processing circuit, causes the transmitting device 100 to perform various operations. In one embodiment, a processing circuit includes one or more processors and a non-transitory memory coupled to the one or more processors. The non-transitory memory may carry executable program code that, when executed by the one or more processors, causes the transmitting device 100 to perform, conduct, or initiate the operations or methods described herein.
Specifically, the transmitting device 100 is configured to transmit a sequence of data packets 101 to the receiving device 200 for RDMA. Specifically, each transmitted data packet in the sequence of data packets 101 is associated with a PSN and carries a message. Then, the transmitting apparatus 100 is configured to determine whether or not a specific packet 1011 in the packet sequence 101 is received at the receiving apparatus 200. The particular data packet 1011 includes a first message 1012. Further, the transmitting apparatus 100 is configured to retransmit the specific data packet to the receiving apparatus 200 as a next step after determining that the specific data packet is lost at the receiving apparatus 200. Specifically, the retransmitted data packet 102 carries the same first message 1012 as the specific data packet 1011 and is associated with a new PSN.
Embodiments of the present invention introduce a new RDMA retransmission scheme for reliable transmissions. It will be appreciated that this retransmission scheme is inadequate for applications requiring signaling that a message has completed and maintaining its relative order. However, many other applications that do not require signaling in order that a message has completed may benefit from out-of-order, faster completion signals. In the present invention, each packet carries exactly one message. That is, the messages carried in each packet are limited to fit into a single network maximum transmission unit.
Furthermore, as with most known protocols that guarantee reliable transmission, two different mechanisms are used together to identify lost data packets: PSN-based and timer-based. Details are provided in the latter half of the invention.
It should be noted that RDMA transactions involve an originating node and a destination target node (or target node). The initiating node initiates or sends an RDMA operation request, and the target node receives the RDMA operation request and responds accordingly. The transmitting device 100 shown in fig. 1 may be regarded as an originating node and the receiving device 200 shown in fig. 1 may be regarded as a target node.
According to an embodiment of the invention, each transmitted data packet includes a transaction identifier (XID) identifying the message carried in the data packet. Specifically, the particular data packet 1011 includes a first XID identifying a first message 1012 carried in the particular data packet 1011. Once the transmitting device 100 determines that a particular data packet 1011 is lost, it retransmits the data packet carrying the same message as the lost data packet. That is, the retransmitted data packet 102 includes the same first XID as the specific data packet 1011.
It can be seen that each data packet (transmitted or retransmitted) carries two different labels: XID and PSN. XID is used to identify the message. The sender (i.e., sending device 100) assigns a unique XID to each message. When a packet is identified as lost, the packet will be immediately retransmitted. The retransmitted copies carry the same XID, but the PSNs are different. It should be noted that the PSN of the retransmitted packet is logically larger than the previous lost packet sent to the same destination.
Alternatively, the XID digital space of the message may be a continuous digital space to simplify and reduce the memory required to store the "received XID" state on the target node. The PSN digital space may be a continuous digital space to simplify and reduce the memory required to store the "outstanding PSN" state at the originating node and reduce the data structure of the ACK.
According to one embodiment of the present invention, the transmitting device 100 may be configured to assign a stream block ID identifying a stream block to each transmitted data packet and/or each retransmitted data packet. Typically, a flow block includes a plurality of packets that are routed through the same network route. Optionally, each transmitted data packet and/or retransmitted data packet further includes a stream block ID. The transmitting apparatus 100 may be configured to transmit each transmitted data packet and/or retransmitted data packet to the receiving apparatus 200 through a stream block. All packets belonging to the same flow block can be routed through the same network path. It is assumed here that different flow blocks use different network paths.
It is worth mentioning that ECMP flow blocks are supported by dividing the packet flow between the originating node and the destination node into sub-flows. It is assumed that packets of different sub-flows use different routes through the network. The transmitting device 100 may maintain separate (contiguous) PSN space for each sub-stream (i.e., stream block). That is, each stream block is associated with a separate PSN space. The packets are assigned to the stream blocks and assigned a PSN when transmitted or retransmitted.
According to one embodiment of the invention, a transmitted data packet or retransmitted data packet from the transmitting device 100 may include an XID identifying the message, a PSN of the data packet, and a stream block ID transmitting the data packet.
According to one embodiment of the invention, the stream block ID assigned to the retransmitted data packet may be different from the stream block ID assigned to the transmitted data packet. In particular, the transmitted data packet and the retransmitted data packet may be routed through different chunks identified by the chunk IDs of the transmitted data packet and the retransmitted data packet, respectively. That is, each data packet may be transmitted over a different network path established between the transmitter and the receiver.
When the transmitting apparatus 100 determines that a packet is lost, it may retransmit the same message in a new packet through any stream block. That is, the stream blocks allocated to the retransmitted data packets may be identical to the stream blocks allocated to the lost data packets. However, it may be preferable to assign a different stream block ID to the retransmitted data packet and route the retransmitted data packet through a different stream block. This may indicate that a problem has occurred with previous network routing due to previous packet loss. There is decision flexibility that enables the transmitting device 100 to dynamically select different chunks for retransmission. This decision may be based on any combination of criteria, such as "better" chunks that are less loaded, faster, less crowded, etc.
According to one embodiment of the invention, as shown in fig. 2, a transmitting device may be used to receive a notification message 201 for one or more transmitted data packets in a sequence of data packets 101 from a receiving device 200. Specifically, the notification message 201 indicates whether the transmitted data packet is received at the reception apparatus 200. The transmitting device may also be configured to determine from the notification message 201 whether a particular data packet in the sequence of data packets 101 is received at the receiving device 200.
In the present invention, for each received data packet, a notification message 201 (also referred to as an ACK) is sent back from the receiver to the sender, whether or not the data packet is accepted (not accepted if it carries a duplicate message). The notification message 201 carries the PSN and stream block ID of the received packet. Alternatively, a single notification message 201 may report multiple packets belonging to the same flow block.
According to one embodiment of the invention, each notification message indicates a transmitted data packet for which the PSN is within a particular range. Optionally, each notification message indicates a transmitted data packet including the same chunk ID.
Note that the ACKs may not be cumulative. This means that each notification message 201 can report PSNs for a specific range, all belonging to a specific flow block.
It should be noted that, ACKs of data packets of the same flow block may be routed through the same path. Alternatively, it is also possible that all ACKs are sent back to the sender over the same network path.
According to one embodiment of the invention, the transmitting device 100 may be configured to: upon receiving a notification message indicating that a transmitted data packet is first received, a completion signal for the message is generated, wherein the transmitted data packet includes an XID identifying the message.
When the notification message 201 notifies the sending device 100 that a message has been received (i.e., carried in a data packet), the message may be signaled to be completed. A later ACK, which may report the same message received, does not trigger a completion signal (message that has completed).
It should be appreciated that completion of a message occurs regardless of the status of any previously published messages, i.e., whether all previously published messages have been completed or not. This completion generation mechanism may be referred to as "out-of-order completion" or "order-independent completion.
According to one embodiment of the invention, the transmitting device 100 may be configured to: when a notification message indicating that a data packet having a PSN X is logically larger than X is received at the receiving apparatus 200 is received from the receiving apparatus 200 before a notification message indicating that a specific data packet 1011 having the PSN X is received at the receiving apparatus 200 in the data packet sequence 101 is received, it is determined that the specific data packet 1011 having the PSN X is lost at the receiving apparatus 200, wherein X is a non-negative integer.
According to one embodiment of the invention, the transmitting device 100 may be configured to: if tracking of a time period after transmitting a data packet has not been started, tracking of the time period is started. The transmitting device 100 may then be configured to: the tracking of the time period is reset after receiving the notification message.
According to one embodiment of the invention, the transmitting device 100 may be configured to: when a notification message indicating that a packet having PSN X is received at the reception apparatus 200 is not received before the expiration of the period, it is determined that the specific packet 1011 having PSN X is lost at the reception apparatus 200.
It is possible that whether a data packet is lost may be determined by the receiver. For example, if another packet is received through the same stream block and its PSN is logically larger than the previous packet, the receiving device 200 may determine that the packet is lost and needs to report so.
According to one embodiment of the present invention, the transmitter (i.e., the transmitting device 100) may determine that a packet is lost using at least one of:
notify of the packet loss by a notification message 201 (e.g., ACK, NAK, or SACK).
It does not receive the notification message 201 of the packet, but receives the notification message 210, which notification message 210 indicates that a logically larger packet of PSN sent over the same chunk was received.
A timeout event is detected on a flow block indicating that all outstanding data packets of the flow block (i.e. data packets that have been sent but not yet acknowledged) should be regarded as lost.
According to one embodiment of the present invention, the transmitting apparatus 100 may be used to set a time period according to the receiving apparatus 200. By means of the timer per stream block, optimal latency performance can be achieved. To reduce the required resources, a timer (i.e. the same period of time) may be used for a pair of network nodes (i.e. the transmitting device 100 and the receiving device 200).
It is possible that to further reduce the required resources and overhead, the transmitting device 100 may use a single timer for all connections with different receivers, i.e. set the same time period.
Fig. 2 illustrates a packet exchange between an originating node (i.e., transmitting device 100) and a destination node (i.e., receiving device 200) provided by an embodiment of the present invention. The transmitting apparatus 100 may be the transmitting apparatus shown in fig. 1, and the receiving apparatus 200 may be the receiving apparatus shown in fig. 1. Specifically, fig. 3 illustrates retransmission of lost data packets detected based on notification message 201. This example uses an ACK that includes information about multiple PSNs. The size of this data structure is implementation parameter detail. In this example, a single stream block is considered, and thus the stream block ID is omitted for brevity.
In the present embodiment, the transmitting apparatus 100 transmits the packet sequences P20 to P26, i.e., the packet sequence 101 shown in fig. 1, to the receiving apparatus 200. For each received data packet, the receiving apparatus 200 records it in a data structure, such as a bitmap (the receiving apparatus 200 may mark the received data packet as "1" and the lost data packet as "0"). It can be seen that in the example shown in fig. 2, when the reception apparatus 200 receives P22 but still does not receive P21, it judges that P21 is lost, and marks "0" in the bitmap. Later, when the reception apparatus 200 receives P26 but still does not receive P25, it judges that P25 is lost.
The transmitting device 100 receives an ACK, for example, a notification message 201, which notifies the transmitting device 100 that P21 and P25 are not received at the receiving device 200. In response to the ACK, the transmitting apparatus 100 transmits the retransmitted data packet P29 as retransmission of P21, and transmits the retransmitted data packet P30 as retransmission of P25 to the receiving apparatus 200. Possibly, P21 may be a specific lost packet 1011 comprising a first message 1012, as shown in fig. 1, the retransmitted packet P29 may be a retransmitted packet 102 carrying the same first message 1012 but with a new PSN (29).
Fig. 3 shows another signaling flow diagram between an originating node (i.e., transmitting device 100) and a target node (i.e., receiving device 200) provided by an embodiment of the present invention. Similarly, the transmitting apparatus 100 may be the transmitting apparatus shown in fig. 1, and the receiving apparatus 200 may be the receiving apparatus shown in fig. 1. In this example, packet loss is shown. It should be noted that, the sender side considers that there are two different loss situations of the data packet loss: if a packet is lost en route to the receiver, it fails to reach the receiver; or it does reach the receiver, but the notification message 201 of the packet is lost on its way to the sender.
In particular, fig. 3 emphasizes the case where the XID of the packet does not change but the PSN changes during retransmission. In this embodiment, the packet P21 does not reach the reception apparatus 200. When the transmission apparatus 100 receives the notification message a22 acknowledging the reception P22 from the reception apparatus, but still does not receive the notification message 201 acknowledging the reception P21, the transmission apparatus 100 determines that P21 is lost and rapidly transmits P24 carrying the same message as P21. As shown in fig. 3, the XID of the retransmitted packet P24 is still "301" and is the same as the XID of the lost packet P21. After the fast retransmission of the lost data packet, the transmitting apparatus 100 continues to transmit the normal data packet.
Fig. 4 shows another signaling flow diagram between an originating node (i.e., transmitting device 100) and a target node (i.e., receiving device 200) provided by an embodiment of the present invention. Specifically, fig. 4 shows another loss situation, in which notification message 201 is lost.
In the present embodiment, the transmitting apparatus 100 transmits the data packets P20 to P23 to the receiving apparatus 200. Although all the data packets P20 to P23 arrive at the reception apparatus 200, and the reception apparatus 200 transmits the notification message 201 (ACK) for each of the packets P20 to P23, a22 is lost, and does not arrive at the transmission apparatus 100. When the transmitting device 100 receives the ACK a23 from the receiving device, but still does not receive the ACK 22, the transmitting device 100 considers P22 to be "lost" and rapidly transmits P25 carrying the same message as P22. As shown in fig. 4, the XID of the retransmitted packet P25 is still "302", which is the same as the XID of the packet P22.
It should be noted that, the embodiment of the present invention adopts a retransmission scheme, which may have two complementary mechanisms: PSN-based and timer-based. The PSN-based retransmission may be based on a unique strictly monotonically increasing PSN assigned to each transmitted packet and quickly identify lost packets when a "gap" is detected in the PSN reported back by the notification message 201. Timer-based retransmissions are similar to the traditional way of triggering retransmission events: if the notification message 201 is not received within a configurable period of time, all outstanding data packets are retransmitted. An outstanding data packet is defined as a data packet that has been sent but has not received the notification message 201.
Fig. 5 illustrates a receiving device 200 for RDMA provided by an embodiment of the present invention. The receiving device 200 may include processing circuitry (not shown) for performing, conducting, or initiating various operations of the receiving device 200 described herein. The processing circuitry may include hardware and software. The hardware may include analog circuitry or digital circuitry, or both analog and digital circuitry. Digital circuitry may include components such as application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (digital signal processor, DSPs), or multi-purpose processors. The receiving device 200 may also include a memory circuit that stores one or more instructions that may be executed by a processor or processing circuit (specifically, under control of software). For example, the memory circuit may include a non-transitory storage medium storing executable software code that, when executed by a processor or processing circuit, causes the receiving device 200 to perform various operations. In one embodiment, a processing circuit includes one or more processors and a non-transitory memory coupled to the one or more processors. The non-transitory memory may carry executable program code that, when executed by the one or more processors, causes the receiving device 200 to perform, conduct, or initiate the operations or methods described herein.
Specifically, the receiving apparatus 200 is configured to receive the packet sequence 101 from the transmitting apparatus 100 for RDMA. It is possible that the transmitting apparatus 100 here may be the transmitting apparatus 100 shown in fig. 1. Specifically, each received data packet in the sequence of data packets 101 is associated with a PSN and carries a message. Further, the receiving device 200 is configured to send a notification message 201 of the sequence of data packets 101 to the sending device 100, wherein the notification message 201 indicates which data packets of the sequence of data packets 101 are received at the receiving device 200.
The embodiment of the present invention also provides a receiving apparatus 200, which receiving apparatus 200 operates according to the transmitting apparatus 100 previously described in the present invention. It is worth mentioning that in the present invention, the notification message 201 (ACK) supports notification of aggregating multiple data packets, and supports "selective ACK" (SACK) reporting of received and lost data packets. The number of such reported data packets (i.e., the size of the ACK data structure) is one implementation parameter.
According to one embodiment of the invention, each received data packet includes an XID identifying the message carried in the data packet.
According to one embodiment of the invention, each data packet in the sequence of data packets 101 further comprises a flow block ID identifying a flow block, wherein the flow block comprises a plurality of data packets routed through the same network route.
According to one embodiment of the invention, each notification message 201 indicates a data packet that includes the same stream block ID.
According to one embodiment of the invention, the receiving device 200 may be configured to maintain a first data structure storing the PSN of the received data packet. The receiving device 200 may also be configured to update the first data structure after receiving the data packet from the transmitting device 100.
According to one embodiment of the invention, the receiving device 200 may be configured to: if the PSN of the received data packet cannot be recorded in the first data structure, discarding the received data packet. For example, if there is no resource at the receiving device 200 to store or mark a new PSN, the packet may be discarded.
According to one embodiment of the invention, the receiving device 200 may be configured to send a notification message based on the first data structure.
Alternatively, the receiving device 200 may maintain the status of "received XID". If it is the first time a message is received, its XID will be stored in this state. If another copy of the message is received later, the message is not accepted. That is, the reception apparatus 200 may also be configured to determine whether each received data packet is a duplicate data packet by checking whether XID included in the received data packet is already stored in the database, and ignore any duplicate data packet. It is possible that these duplicate items are only validated, but not re-executed. In this way, each message is guaranteed to be accepted by the receiving device 200 exactly once.
According to one embodiment of the invention, each notification message 201 indicates a packet for which the PSN is within a particular range.
In summary, embodiments of the present invention enable out-of-order completion. The present invention uses XID and PSN for each packet sent. The retransmitted data packets carry the same XID, but the PSNs are different. This may speed up detection of lost retransmitted data packets, so that "fast retransmission" of all lost data packets occurs, even if the data packets have been retransmitted. "fast retransmission" means that there is no need to wait for a timeout event. In addition, the present invention allows the data packet to be retransmitted over any stream block regardless of the stream block used by the last transmission of the data packet.
FIG. 6 illustrates a method 600 for RDMA provided by an embodiment of the present invention. In certain embodiments of the invention, the method 600 may be performed by the transmitting device 100 shown in fig. 1 or 5. The method 600 includes a step 601 of transmitting a sequence of data packets 101 to a receiving device 200 for RDMA. Specifically, each transmitted data packet in the sequence of data packets 101 is associated with a PSN and carries a message. The method 600 further comprises a step 602 of determining whether a specific data packet 1011 in the data packet sequence 101 is received at the receiving device 200, wherein said specific data packet 1011 comprises a first message 1012. Then, the method 600 further comprises step 603, after determining that the specific data packet is lost at the receiving device 200, retransmitting the specific data packet to the receiving device 200 as a next step, wherein the retransmitted data packet 102 carries the same first message 1012 as the specific data packet 1011 and is associated with a new PSN.
FIG. 7 illustrates a method 700 for RDMA provided by an embodiment of the present invention. In certain embodiments of the invention, the method 700 is performed by the receiving device 200 shown in fig. 1 or 5. The method 700 includes step 701, receiving a sequence of packets 101 from a sending device 100 of RDMA. Specifically, each received data packet of the sequence of data packets 101 is associated with a PSN and includes an XID that identifies the message carried in the data packet. The method 700 further comprises a step 702 of sending a notification message 201 of the sequence of data packets 101 to the sending device 100, wherein the notification message 201 indicates which data packets of the sequence of data packets 101 are received at the receiving device 200.
The invention has been described in connection with various embodiments and implementations as examples. However, other variations can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the invention, and the appended claims. In the claims and in the description, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Furthermore, any of the methods provided by the embodiments of the present invention may be implemented in a computer program having code means which, when run by a processing module, causes the processing module to perform the method steps. The computer program is embodied in a computer readable medium of a computer program product. A computer readable medium may include essentially any memory, such as read-only memory (ROM), programmable ROM (programmable read-only memory), erasable PROM (EPROM), flash memory, electrically erasable PROM (electrically erasable PROM, EEPROM), or a hard disk drive.
Further, those skilled in the art will recognize that embodiments of the transmitting apparatus 100 and the receiving apparatus 200 include the necessary communication capabilities in the form of functions, devices, units, elements, etc. for performing the scheme. Examples of other such devices, units, elements and functions are: processors, memories, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selection units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiving units, transmitting units, DSPs, trellis-coded modulation (TCM) encoders, TCM decoders, power supply units, power supply feeders, communication interfaces, communication protocols, etc., suitably arranged together to perform the scheme.
In particular, the one or more processors of the transmitting device 100 and the receiving device 200 may include, for example, one or more of the following examples: a central processing unit (central processing unit, CPU), a processing unit, a processing circuit, a processor, an application-specific integrated circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions. The expression "processor" may thus denote a processing circuit comprising a plurality of processing circuits, e.g. any, some or all of the items listed above. The processing circuitry may also perform data processing functions for inputting, outputting, and processing data, including data buffering and device control functions, such as invoking process controls, user interface controls, and the like.

Claims (24)

1. A transmitting device (100) for remote direct memory access (remote direct memory access, RDMA), the transmitting device (100) being configured to:
transmitting a sequence of data packets (101) to a receiving device (200) for RDMA, wherein each transmitted data packet of the sequence of data packets (101) is associated with a data packet sequence number (packet sequence number, PSN) and carries a message;
Determining whether a specific data packet (1011) of said sequence of data packets (101) is received at said receiving device (200), wherein said specific data packet (1011) comprises a first message (1012);
after determining that the specific data packet is lost at the receiving device (200), retransmitting the specific data packet to the receiving device (200) as a next step, wherein the retransmitted data packet (102) carries the same first message (1012) as the specific data packet (1011) and is associated with a new PSN.
2. The transmitting device (100) according to claim 1, characterized in that:
each transmitted data packet includes a transaction identifier XID, the XID identifying the message carried in the data packet;
-the specific data packet (1011) comprises a first XID for identifying the first message (1012) carried in the specific data packet (1011);
the retransmitted data packet (102) includes the same first XID as the specific data packet (1011).
3. The transmitting device (100) according to claim 1 or 2, characterized by being configured to:
assigning a flow block ID identifying a flow block to each transmitted data packet and/or each retransmitted data packet, wherein the flow block comprises a plurality of data packets routed through the same network route, and each transmitted data packet and/or retransmitted data packet further comprises the flow block ID;
Each transmitted data packet and/or retransmitted data packet is transmitted to the receiving device (200) through the stream block.
4. A transmitting device (100) according to claim 3, characterized in that the stream block ID assigned to the retransmitted data packet is different from the stream block ID assigned to the transmitted data packet, wherein the transmitted data packet and the retransmitted data packet are routed through different stream blocks identified by the stream block IDs of the transmitted data packet and the retransmitted data packet, respectively.
5. The transmitting device (100) according to any one of claims 1 to 4, characterized by being configured to:
-receiving a notification message (201) for one or more transmitted data packets of the sequence of data packets (101) from the receiving device (200), wherein the notification message (201) indicates whether the transmitted data packets are received at the receiving device (200);
according to the notification message (201), it is determined whether the specific data packet in the sequence of data packets (101) is received at the receiving device (200).
6. The transmitting device (100) according to claim 5, characterized by being configured to:
upon receiving a notification message indicating that a transmitted data packet is first received, a completion signal for the message is generated, wherein the transmitted data packet includes an XID identifying the message.
7. The transmitting device (100) according to claim 5 or 6, characterized by being configured to:
when a notification message indicating that a data packet having a PSN greater than X is received at the receiving apparatus (200) is received from the receiving apparatus (200) before a notification message indicating that a specific data packet (1011) having the PSN X in the data packet sequence (101) is received at the receiving apparatus (200) is received, it is determined that the specific data packet (1011) having the PSN X is lost at the receiving apparatus (200), wherein X is a non-negative integer.
8. The transmitting device (100) according to any of claims 5 to 7, wherein each notification message indicates the transmitted data packets for which the PSN is within a specific range.
9. The transmitting device (100) according to any of claims 5 to 8, wherein each notification message indicates the transmitted data packet comprising the same chunk ID.
10. The transmitting device (100) according to any one of claims 5 to 9, characterized by being configured to:
if tracking of a time period after transmitting a data packet has not been started, starting to track the time period;
after receiving the notification message, resetting the tracking of the time period.
11. The transmitting device (100) according to claim 10, characterized by being configured to: when a notification message indicating that the data packet having the PSN X is received at the receiving apparatus (200) is not received before the expiration of the period of time, it is determined that the specific data packet (1011) having the PSN X is lost at the receiving apparatus (200).
12. The transmitting device (100) according to claim 10 or 11, characterized by being configured to:
the time period is set according to the receiving device (200).
13. The transmitting device (100) according to any of claims 1 to 12, wherein the message carried in each data packet is limited to fit into a single network maximum transmission unit.
14. A receiving device (200) for remote direct memory access (remote direct memory access, RDMA), the receiving device (200) being configured to:
receiving a sequence of data packets (101) from a sending device (100) for RDMA, wherein each received data packet in the sequence of data packets (101) is associated with a data packet sequence number (packet sequence number, PSN) and carries a message;
-transmitting a notification message (201) for the sequence of data packets (101) to the transmitting device (100), wherein the notification message (201) indicates which data packets of the sequence of data packets (101) are received at the receiving device (200).
15. The receiving device (200) according to claim 14, characterized in that:
each received data packet includes a transaction identifier XID, which is used to identify the message carried in the data packet.
16. The receiving device (200) according to claim 14 or 15, wherein each data packet in the sequence of data packets (101) further comprises a flow block ID identifying a flow block, wherein the flow block comprises a plurality of data packets routed through the same network route.
17. The receiving device (200) of claim 16, wherein each notification message (201) indicates a data packet comprising the same chunk ID.
18. The receiving device (200) according to any one of claims 14 to 17, characterized by being adapted to:
maintaining a first data structure storing the PSN of the received data packet;
the first data structure is updated after receiving a data packet from the transmitting device (100).
19. The receiving device (200) according to claim 18, characterized by being adapted to:
if the PSN of the received data packet cannot be recorded in the first data structure, discarding the received data packet.
20. The receiving device (200) according to claim 18 or 19, characterized by being adapted to:
And sending the notification message based on the first data structure.
21. The receiving device (200) according to any of claims 14 to 20, wherein each notification message (201) indicates a data packet for which the PSN is within a specific range.
22. A method (600) for remote direct memory access (remote direct memory access, RDMA), the method comprising:
transmitting (601) a sequence of data packets (101) to a receiving device (200) for RDMA, wherein each transmitted data packet of the sequence of data packets (101) is associated with a data packet sequence number (packet sequence number, PSN) and carries a message;
determining (602) whether a specific data packet (1011) of the sequence of data packets (101) is received at the receiving device (200), wherein the specific data packet (1011) comprises a first message (1012);
after determining that the specific data packet is lost at the receiving device (200), as a next step, retransmitting (603) the specific data packet to the receiving device (200), wherein the retransmitted data packet (102) carries the same first message (1012) as the specific data packet (1011) and is associated with a new PSN.
23. A method (700) for remote direct memory access (remote direct memory access, RDMA), the method comprising:
-receiving (701) a sequence of data packets (101) from a sending device (100) for RDMA, wherein each received data packet in the sequence of data packets (101) is associated with a data packet sequence number (packet sequence number, PSN) and comprises a transaction identifier XID identifying a message carried in the data packet;
-transmitting (702) a notification message (201) for the sequence of data packets (101) to the transmitting device (100), wherein the notification message (201) indicates which data packets of the sequence of data packets (101) are received at the receiving device (200).
24. Computer program product, characterized in that it comprises a program code for performing the method according to claim 22 or 23 when being implemented on a processor.
CN202180096824.1A 2021-08-04 2021-08-04 Apparatus and method for remote direct memory access Pending CN117203627A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/071749 WO2023011712A1 (en) 2021-08-04 2021-08-04 A device and method for remote direct memory access

Publications (1)

Publication Number Publication Date
CN117203627A true CN117203627A (en) 2023-12-08

Family

ID=77358263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180096824.1A Pending CN117203627A (en) 2021-08-04 2021-08-04 Apparatus and method for remote direct memory access

Country Status (2)

Country Link
CN (1) CN117203627A (en)
WO (1) WO2023011712A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10509764B1 (en) * 2015-06-19 2019-12-17 Amazon Technologies, Inc. Flexible remote direct memory access
CN115941616A (en) * 2017-12-15 2023-04-07 微软技术许可有限责任公司 Multi-path RDMA transport
US20210119930A1 (en) * 2019-10-31 2021-04-22 Intel Corporation Reliable transport architecture

Also Published As

Publication number Publication date
WO2023011712A1 (en) 2023-02-09

Similar Documents

Publication Publication Date Title
US8233483B2 (en) Communication apparatus, communication system, absent packet detecting method and absent packet detecting program
US8190960B1 (en) Guaranteed inter-process communication
CN108881008B (en) Data transmission method, device and system
US10148581B2 (en) End-to-end enhanced reliable datagram transport
US10430374B2 (en) Selective acknowledgement of RDMA packets
US10419329B2 (en) Switch-based reliable multicast service
CN115941616A (en) Multi-path RDMA transport
CN101510816B (en) Multi-route parallel transmission method based on route relationship
US11863370B2 (en) High availability using multiple network elements
US8792512B2 (en) Reliable message transport network
US9692560B1 (en) Methods and systems for reliable network communication
US7535916B2 (en) Method for sharing a transport connection across a multi-processor platform with limited inter-processor communications
CN108234089B (en) Method and system for low latency communication
CN112383622A (en) Reliable transport protocol and hardware architecture for data center networking
CN117203627A (en) Apparatus and method for remote direct memory access
US20220400074A1 (en) System to transmit messages using multiple network paths
WO2023016646A1 (en) A device and method for remote direct memory access
WO2021249651A1 (en) Device and method for delivering acknowledgment in network transport protocols
EP3432500A1 (en) Point-to-point transmitting method based on the use of an erasure coding scheme and a tcp/ip protocol
US20230327812A1 (en) Device and method for selective retransmission of lost packets
CN114520711B (en) Selective retransmission of data packets
EP1733527B1 (en) Technique for handling outdated information units
WO2012043142A1 (en) Multicast router and multicast network system
WO2021223853A1 (en) Device and method for delivering acknowledgment in network transport protocols
WO2023241770A1 (en) Efficient rerouting of a selective-repeat connection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination