WO2022048647A1 - RoCE网络拥塞控制的方法及相关装置 - Google Patents

RoCE网络拥塞控制的方法及相关装置 Download PDF

Info

Publication number
WO2022048647A1
WO2022048647A1 PCT/CN2021/116494 CN2021116494W WO2022048647A1 WO 2022048647 A1 WO2022048647 A1 WO 2022048647A1 CN 2021116494 W CN2021116494 W CN 2021116494W WO 2022048647 A1 WO2022048647 A1 WO 2022048647A1
Authority
WO
WIPO (PCT)
Prior art keywords
congestion
network
information
network device
message
Prior art date
Application number
PCT/CN2021/116494
Other languages
English (en)
French (fr)
Inventor
韩兆皎
白志君
钱远盼
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21863709.8A priority Critical patent/EP4195760A4/en
Publication of WO2022048647A1 publication Critical patent/WO2022048647A1/zh
Priority to US18/178,117 priority patent/US20230208771A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/127Avoiding congestion; Recovering from congestion by using congestion prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/27Evaluation or update of window size, e.g. using information derived from acknowledged [ACK] packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/0289Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/11Identifying congestion
    • H04L47/115Identifying congestion using a dedicated packet
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion
    • H04L47/122Avoiding congestion; Recovering from congestion by diverting traffic away from congested entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/25Flow control; Congestion control with rate being modified by the source upon detecting a change of network conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/26Flow control; Congestion control using explicit feedback to the source, e.g. choke packets
    • H04L47/263Rate modification at the source after receiving feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/16Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]
    • H04W28/18Negotiating wireless communication parameters
    • H04W28/22Negotiating communication rate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/28Flow control; Congestion control in relation to timing considerations
    • H04L47/283Flow control; Congestion control in relation to timing considerations in response to processing delays, e.g. caused by jitter or round trip time [RTT]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/30Flow control; Congestion control in combination with information about buffer occupancy at either end or at transit nodes

Definitions

  • the present application relates to the field of communication technologies, and in particular, to a method and related apparatus for RoCE network congestion control.
  • RDMA remote direct memory access
  • RDMA over Converged Ethernet is a type of RDMA technology that allows servers to perform remote direct memory access over Ethernet.
  • RoCE Remote Direct Memory Access
  • the advantages of the RoCE protocol are mainly based on the characteristics of converged Ethernet, the RoCE protocol can also be applied to traditional Ethernet networks or non-converged Ethernet networks.
  • DCQCN specifies that the congested node (Congestion Point, CP) device should Random early detection (Random Early Detection, RED) explicit congestion notification (Explicit Congestion Notification, ECN) marking.
  • CP Congested node
  • RED Random Early Detection
  • ECN explicit congestion notification
  • CNP Congestion Notification Packet
  • the network congestion notification of the RoCE protocol adopts a separate CNP message notification method. Therefore, after network congestion occurs, the network card at the receiving end needs to continuously send ACK confirmation messages and CNP messages. However, since there is an upper limit on the packet sending rate of the network card at the receiving end, it may cause a delay in sending the congestion notification, which will result in a slow response speed of the source-side congestion control.
  • CNP packets can only notify the source end that the network is congested, but cannot announce that the network congestion has been eliminated. Only the source end can periodically detect whether the network congestion has been eliminated. , which affects the effective utilization of network bandwidth.
  • the embodiments of the present application provide a RoCE network congestion control method and related device, which can timely notify the occurrence and elimination of network congestion, improve the congestion control response speed of the source end, and improve network bandwidth utilization.
  • the present application provides a RoCE network congestion control method, the method includes: a first network device sends a RoCE protocol packet to a second network device; the first network device receives an acknowledgement packet from the second network device
  • the confirmation message includes confirmation information and indication information for the RoCE protocol message, and the indication information is used to indicate whether the network path between the first network device and the second network device is congested ; the first network device performs congestion control based on the confirmation message.
  • the first network device and the second network device are hardware designed to allow computing devices to communicate on the network, and both support the network communication of the RoCE protocol.
  • RoCE (RDMA over Converged Ethernet) NET uses the Remote Direct Memory Access (RDMA) network protocol.
  • the first network device and the second network device can both be, for example, a network interface controller (RDMA network interface controller, RNIC), a network interface controller, a network adapter, a network interface card, or a local area network receiver (LAN adapter).
  • at least one of the first network device and the second network device may also be a switch device.
  • the first network device can be set on the source device, and the second network device can be set on the destination device, so the source device and the destination device can realize remote data based on the communication interaction between the first network device and the second network device. Read and write and transfer.
  • the source end device can send a message through the first network device, and the destination end device can reply to the ACK confirmation information aggregated with the indication information through the second network device, and the indication information is used to send the message to the first network device.
  • the device advertises whether there is network congestion on the current network path.
  • the first network device can obtain the current network state, that is, whether the current network is congested or not, and the first network device can perform a corresponding congestion control operation based on the current network state. For example, when the network is not congested, the first network device can maintain or restore the high-speed transmission rate in time.
  • the aggregated bearer of indication information and ACK avoids the disadvantage of sending an independent CNP in the traditional solution, reduces the overhead of notification, and helps to reduce the delay of congestion notification in high-traffic scenarios, and improves the performance of the destination device. reaction speed.
  • the source-end device can obtain the network congestion situation at the first time, and trigger the congestion control earlier to realize the regulation of the sending rate, which improves the response speed of the source-end device.
  • the source device can also be informed through the indication information, and the sending rate can be restored in time, which improves the network bandwidth utilization rate.
  • the acknowledgment packet when the indication information indicates that the network path is congested, the acknowledgment packet further includes congestion information, and the congestion information specifically includes congestion of the network path at least one kind of information among degree, congestion location, message queue length, and network delay; the first network device performs congestion control based on the confirmation message, which specifically includes: the first network device performs congestion control based on the congestion information Congestion control is performed.
  • the confirmation message when there is network congestion on the current network path, the confirmation message can also carry congestion information, and the congestion information represents detailed network status content.
  • the first network device of the source device can extract the congestion information to perform quantitative and diversified congestion control actions. Therefore, on the one hand, through the aggregated bearer of indication information, congestion information and ACK, the sending of independent CNPs is avoided, the overhead of notification is reduced, the delay of congestion notification in high-traffic scenarios is reduced, and the response speed of the destination device is improved.
  • the existing RDMA network congestion control has less notification information, and the network congestion convergence speed is slow, while this application uses aggregated packets to carry detailed congestion information, such as congestion level, congestion point, queue depth, network delay, etc.
  • Information of different dimensions is helpful for the first network device to perform diversified, differentiated, and specific congestion control based on detailed congestion information, such as adjusting the sending rate of different levels, and realizing diversified adjustments such as the number of packets and sending time. , which greatly improves the effect of congestion control.
  • the first network device performing congestion control according to the congestion information includes at least one of the following manners:
  • the first network device can quantitatively adjust the sending rate of the packets in the next time window of the first network device according to the congestion degree, wherein in a possible embodiment, the congestion degree belongs to various One of the different levels of congestion, that is, different levels of congestion correspond to different sending rates. There is a correspondence between the congestion level and the transmission rate. For example, for multiple levels such as "no congestion, mild congestion, moderate congestion, and severe congestion", the first network device may decide how to implement the speed reduction process according to the difference in the congestion degree. Different levels can correspond to different packet sending rates, so as to adjust the packet sending rates of different grades, so that faster rate convergence can be achieved.
  • the first network device may determine the number of packets to be sent in the next time window according to at least one of the congestion position and the packet queue depth.
  • the RNIC of the source device can determine how many data packets can be carried on the network path without causing packet loss according to the congestion location and/or the depth of the packet queue, so as to determine the number of packets that can continue to be sent. Network applications with high bandwidth requirements are more friendly.
  • the first network device may adjust the sending rate of the first network device or the number of packets to be sent in the next time window according to the network delay.
  • the present application uses aggregated packets to carry detailed congestion information, such as information in different dimensions such as congestion degree, congestion point, queue depth, network delay, etc.
  • the congestion information is diversified, differentiated, and specific for congestion control, which greatly improves the effect of congestion control.
  • the acknowledgment packet specifically includes a basic transport header (BTH) field and an extension field, the acknowledgment information and the indication information are carried in the BTH field, and the congestion information carried in the extension field.
  • BTH basic transport header
  • the extension field is, for example, the congestion control extended transport header (Congestion Extended Transport Header, CETH) described herein, that is, the congestion information can be carried by the extended CETH header.
  • CETH Congestion Extended Transport Header
  • the CETH includes two parts, standard definition and manufacturer-defined information, where the standard definition part can be used for compatibility and docking in mixed networking scenarios.
  • the standard definition part may include the following fields: version number (Ver), CETH header length (Length). in:
  • the manufacturer-defined information field is used to support the user-defined congestion control advertisement information of each manufacturer, and the total length is, for example, (Length*4-1) bytes.
  • the manufacturer can design at least one kind of information among the congestion degree, congestion location, packet queue length, and network delay of the bearer network path.
  • the congestion level of the network path can be represented by a 2-bit ratio field, and the ratio field is used to identify the congestion level.
  • the ratio field can indicate the degree of congestion at a component level, such as: no congestion, light congestion, moderate congestion, heavy congestion, and so on.
  • a 1-bit field is designed to indicate whether the current congestion advertisement is a common CNP type or an enhanced CNP type.
  • a 4-bit field can be designed to identify service scenarios, such as RC/XRC write/send scenarios, RC/XRC read response scenarios, or UD send scenarios.
  • the Ver field is used to indicate the CETH version number, for example, it can occupy 4 bits, which is used to support the upgrade of the congestion control algorithm and the compatibility docking.
  • the Length field is used to indicate the length of the CETH header. For example, it occupies 4 bits.
  • the variable length of the CETH header is supported to reduce fixed overhead.
  • extension field CETH it can not only ensure the bearing of congestion information without occupying the space of existing fields, but also can be used for compatibility in hybrid networking scenarios, and supports the customization of congestion control notification information by manufacturers. Conducive to meet the needs of different manufacturers.
  • the confirmation message specifically includes a BTH field, and the confirmation information, the indication information, and the congestion information are all carried in the BTH field.
  • the reserved field "reserved 6" in the BTH of the traditional ACK may be used to carry the congestion information of the embodiment of the present application, that is, the "reserved 6" is used as the first CETH to carry related data, so as to achieve Aggregation of congestion control and transmission acknowledgments.
  • the reserved field "reserved 7" in the BTH of the traditional ACK may be used to carry the congestion information of this embodiment of the present application, that is, the "reserved 7" is used as the second CETH to carry related data, thereby To achieve the purpose of the aggregation of congestion control and transmission acknowledgment.
  • the indication information and the congestion information can be integrated and set into the field of the confirmation information.
  • the confirmation message can be regarded as an improvement of the traditional ACK message, and the existing field space can be fully utilized to realize the bearing of the congestion information. , so as to make full use of the field space in the confirmation message and avoid changing the existing message format.
  • the first network device performing congestion control based on the acknowledgment packet includes: the first The network device maintains the sending rate of the first network device unchanged, for example, maintains a relatively high sending rate, thereby improving packet transmission efficiency.
  • the first network device performing congestion control based on the acknowledgment packet includes: the first The network device sets the sending rate of the first network device to a preset rate. For example, adjusting from a low sending rate (the low sending rate is designed when the network is congested, for example) to a higher sending rate (the high sending rate is designed when the network is not congested, for example), thereby improving the recovery efficiency of the sending rate and reducing the The delay in sending the message.
  • the indication information may be an indication bit, an indication field, an indication flag, or the like.
  • the indication information is an indication bit
  • the value of the indication bit when the value of the indication bit is 0, it means to indicate to the RNIC of the source device that there is no network congestion in the current network path, and the confirmation message does not carry congestion information;
  • a value of 1 means that the destination device indicates to the RNIC of the source device that there is network congestion in the current network path, and the confirmation packet carries the congestion information.
  • the indication information can use existing fields to redefine functions.
  • the indication information can be the BTH.BECN field of the confirmation packet. When the BTH.BECN field is 0, it means to indicate to the RNIC of the source device that the current network path is in progress. If there is no network congestion, the confirmation message does not carry congestion information; when the BTH.BECN field is 1, it means to indicate to the RNIC of the source device that there is network congestion in the current network path, and the confirmation message carries congestion. information.
  • the present application provides a method for RoCE network congestion control, the method comprising:
  • the second network device receives the RoCE protocol message of the first network device; the second network device checks whether the RoCE protocol message carries an explicit congestion notification; the second network device generates a confirmation message according to the check result, and the The confirmation message includes confirmation information and indication information for the RoCE protocol message, and the indication information is used to indicate whether the network path between the first network device and the second network device is congested; the The second network device sends the confirmation message to the first network device, where the confirmation message is used for the first network device to perform congestion control.
  • the destination device can analyze whether the RoCE protocol packet carries an explicit congestion notification through the second network device, and then reply with ACK confirmation information aggregated with indication information, and the indication information is used to send the first network
  • the device advertises whether there is network congestion on the current network path.
  • the first network device can obtain the current network state, that is, whether the current network is congested or not, and the first network device can perform a corresponding congestion control operation based on the current network state. For example, when the network is not congested, the first network device can maintain or restore the high-speed transmission rate in time.
  • the aggregated bearer of indication information and ACK avoids the disadvantage of sending an independent CNP in the traditional solution, reduces the overhead of notification, and helps to reduce the delay of congestion notification in high-traffic scenarios, and improves the performance of the destination device. reaction speed.
  • the source-end device can obtain the network congestion situation at the first time, and trigger the congestion control earlier to realize the regulation of the sending rate, which improves the response speed of the source-end device.
  • the source device can also be informed through the indication information, and the sending rate can be restored in time, which improves the network bandwidth utilization rate.
  • the acknowledgment packet when the indication information indicates that the network path is congested, the acknowledgment packet further includes congestion information, and the congestion information specifically includes congestion of the network path at least one kind of information among degree, congestion location, packet queue length, and network delay; the congestion information is used for the first network device to perform congestion control, which is helpful for the first network device to diversify and control the congestion according to the detailed congestion information. Differentiated and specific congestion control, such as adjusting the sending rate of different levels, and realizing diversified adjustments such as the number of packets and sending time, thus greatly improving the effect of congestion control.
  • the congestion degree belongs to one of multiple different levels of congestion degrees, and the different levels of congestion degrees respectively correspond to different sending rates of the first network device.
  • the method before the second network device generates the confirmation message according to the check result, the method further includes: the second network device generates the congestion information.
  • the current network of the second network device when the current network of the second network device is congested, it can obtain network status information, such as congestion degree, congestion point, queue depth, network delay, and other information of different dimensions, through packet detection or hardware detection.
  • network status information such as congestion degree, congestion point, queue depth, network delay, and other information of different dimensions
  • the second network device may obtain the congestion degree by:
  • the second network device determines the degree of congestion of the network path according to the proportion of RoCE protocol packets carrying an explicit congestion notification in the historical packet reception record. For example, the second network device performs periodic sliding window statistics on the proportion of received packets carrying the ECN flag, so as to calculate the current specific congestion degree of the network path.
  • the congestion degree by means of in-band network telemetry (Inband Network Telemetry, INT) or by on-site operation management and maintenance (In-situ Operation Administration and Maintenance, IOAM).
  • INT Inband Network Telemetry
  • IOAM In-situ Operation Administration and Maintenance
  • the range supported by INT can be extended to the server network card, and the network card can receive the measurement information inserted by the switch into the data packet, and the current network state can be calculated through this information, for example, the network can be calculated by the timestamp. Latency, congestion level calculated from queue length and queue occupancy, etc.
  • the acknowledgment packet specifically includes a basic transport header (BTH) field and an extension field, the acknowledgment information and the indication information are carried in the BTH field, and the congestion information carried in the extension field.
  • the extension field is, for example, the congestion control extended transport header (Congestion Extended Transport Header, CETH) described herein, that is, the congestion information can be carried by the extended CETH header.
  • extension field CETH By designing the extension field CETH, it can not only ensure the bearing of congestion information without occupying the space of existing fields, but also be used for compatibility in mixed networking scenarios, and supports the customization of congestion control notification information by each manufacturer, which is beneficial to meet different requirements. manufacturers' needs.
  • the confirmation message specifically includes a BTH field, and the confirmation information, the indication information, and the congestion information are all carried in the BTH field.
  • the existing field space is fully utilized to carry the congestion information, thereby realizing the full utilization of the field space in the acknowledgment information and avoiding the modification of the existing message format.
  • an embodiment of the present application provides an apparatus, which is applied to a first network device, and includes: a message sending module, configured to send a RoCE protocol message to the second network device; and a message receiving module, configured to Receive an acknowledgment message from the second network device, where the acknowledgment message includes acknowledgment information and indication information for the RoCE protocol message, where the indication information is used to instruct the first network device and the second network Whether the network path between the devices is congested; the congestion control module is configured to perform congestion control based on the confirmation message.
  • each functional module of the device is specifically used to implement the method steps described in the first aspect, which will not be repeated here.
  • the confirmation message further includes congestion information
  • the congestion information specifically includes congestion of the network path at least one kind of information among the degree of congestion, the location of congestion, the length of the packet queue, and the network delay
  • the congestion control module is specifically configured to perform congestion control according to the congestion information.
  • the congestion control module is specifically configured to: adjust the sending rate of the first network device according to the congestion degree, wherein a difference between the congestion degree and the sending rate is There is a corresponding relationship; or, according to at least one of the congestion location and the packet queue depth, determine the number of packets to be sent in the next time window; or, adjust the first according to the network delay The sending rate of the network device or determine the number of packets to be sent in the next time window.
  • the congestion degree belongs to one of multiple different levels of congestion degrees, and different levels of congestion degrees correspond to different transmission rates respectively.
  • the acknowledgment message specifically includes a basic transport header (BTH) field and an extension field, the acknowledgment information and the indication information are carried in the BTH field, and the congestion information carried in the extension field.
  • BTH basic transport header
  • the confirmation message specifically includes a BTH field, and the confirmation information, the indication information, and the congestion information are all carried in the BTH field.
  • the congestion control module is specifically configured to, in the case that the indication information indicates that the network path is not congested, maintain the sending rate of the first network device at a constant rate. Change.
  • the congestion control module is specifically configured to, when the indication information indicates that the network path is not congested, set the sending rate of the first network device to preset rate.
  • the present application provides an apparatus, which is applied to a second network device, comprising: a message receiving module for receiving RoCE protocol messages of the first network device; a congestion information determination module for checking all Whether the RoCE protocol message carries an explicit congestion notification; the notification aggregation sending module is used to generate a confirmation message according to the inspection result, and the confirmation message includes confirmation information and indication information for the RoCE protocol message, and the indication
  • the information is used to indicate whether the network path between the first network device and the second network device is congested; it is also used to send the confirmation message to the first network device, where the confirmation message is used for The first network device performs congestion control.
  • each functional module of the apparatus is specifically used to implement the method steps described in the second aspect, which will not be repeated here.
  • the confirmation message further includes congestion information, and the congestion information specifically includes congestion of the network path at least one kind of information among degree, congestion location, packet queue length, and network delay; the congestion information is used for the first network device to perform congestion control.
  • the congestion degree belongs to one of multiple different levels of congestion degrees, and the different levels of congestion degrees respectively correspond to different sending rates of the first network device.
  • the congestion information determination module is further configured to generate the congestion information.
  • the congestion information determination module is specifically configured to: receive a RoCE protocol message carrying an explicit congestion notification according to a historical message reception record
  • the degree of congestion is determined by the ratio of the text; or, the degree of congestion is obtained by means of in-band network telemetry (INT); or, the degree of congestion is obtained by means of on-site operation management and maintenance (IOAM).
  • INT in-band network telemetry
  • IOAM on-site operation management and maintenance
  • the acknowledgment packet specifically includes a basic transport header (BTH) field and an extension field, the acknowledgment information and the indication information are carried in the BTH field, and the congestion information carried in the extension field.
  • BTH basic transport header
  • the confirmation message specifically includes a BTH field, and the confirmation information, the indication information, and the congestion information are all carried in the BTH field.
  • the present application provides a device, the device includes a host system and a first network device; the host system is configured to interact with the first network device to implement data transmission, and the first network device is configured to execute A method as described in any embodiment of the first aspect.
  • the present application provides a device, the device includes a host system and a second network device; the host system is configured to interact with the second network device to implement data transmission, and the second network device is configured to execute A method as described in any embodiment of the second aspect.
  • the present application provides a first network device, the first network device may include a controller, a register, a communication interface and a logic operation component, and these components may be electrically connected through one or more internal buses.
  • the first network device implements the method described in any embodiment of the first aspect through the cooperation of various components.
  • the present application provides a second network device
  • the second network device may include a controller, a register, a communication interface and a logic operation component, and these components may be electrically connected through one or more internal buses.
  • the first network device implements the method described in any embodiment of the second aspect through the cooperation of various components.
  • an embodiment of the present application provides a chip, the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface to execute the first aspect or the second aspect The method described in any of the embodiments of .
  • an embodiment of the present invention provides a non-volatile computer-readable storage medium; the computer-readable storage medium is used to store the implementation code of the method described in any embodiment of the first aspect or the second aspect.
  • the program code can implement the method described in any embodiment of the first aspect or the second aspect when executed by the device.
  • an embodiment of the present invention provides a computer program product; the computer program product includes program instructions, and when the computer program product is executed by a device, executes any of the embodiments described in the first aspect or the second aspect.
  • the computer program product may be a software installation package, and the computer program product may be downloaded and executed on the controller to implement the method described in any embodiment of the first aspect or the second aspect.
  • the second network device of the destination device can reply to ACK confirmation information aggregated with CETH and indication information when the RoCE protocol packet carries the ECN mark, and the indication information is used to notify the source end device.
  • CETH is used to provide detailed congestion information to the source device.
  • the first network device of the source device extracts congestion information from the CETH to perform quantitative and diversified congestion control actions.
  • the destination device can reply with an ACK confirmation message carrying the indication information, and notify the source device that there is no network congestion on the current network path, so that the source device can maintain or resume high-speed transmission in time. rate.
  • the source-end device can obtain the network congestion situation at the first time, and trigger the congestion control earlier to realize the regulation of the sending rate, which improves the response speed of the source-end device.
  • the source device can also be informed through the indication information, and the sending rate can be restored in time, which improves the network bandwidth utilization rate.
  • the existing RDMA network congestion control has less notification information, and the network congestion convergence speed is slow, while the present application uses CETH to carry detailed network congestion information, such as congestion level, congestion point, queue depth, network delay and other information in different dimensions , which is beneficial for the source device to adjust the sending rate to the target rate in one step according to the detailed congestion information, realize fast convergence, and realize diversified adjustments such as the number of packets and sending time, which greatly improves the effect of congestion control.
  • detailed network congestion information such as congestion level, congestion point, queue depth, network delay and other information in different dimensions
  • FIG. 1 is a schematic diagram of a system architecture involved in an embodiment of the present application.
  • Fig. 2 is a device communication flow scene diagram of the existing RoCE protocol
  • FIG. 3 is a scenario diagram of a device communication flow in a network high traffic scenario
  • FIG. 4 is a schematic diagram of a system architecture including functional modules provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a hardware structure of a network device provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the content of some possible confirmation messages provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the content of some other possible confirmation messages provided by an embodiment of the present application.
  • FIG. 8 is an example diagram of a data structure of congestion information provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a data structure of a confirmation message provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a data structure of another confirmation message provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the complete form of some RoCE protocol packets provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of a method for RoCE network congestion control provided by an embodiment of the present application.
  • FIG. 14 is a schematic flowchart of another RoCE network congestion control method provided by an embodiment of the present application.
  • FIG. 15 is a scenario diagram of a device communication process provided by an embodiment of the present application.
  • FIG. 16 is a scenario diagram of a device communication flow in a network high-traffic scenario provided by an embodiment of the present application.
  • a system, product or device comprising a series of units/devices is not limited to the listed units/devices, but optionally also includes unlisted units/devices, or optionally also includes these products or devices Inherent in other units/devices.
  • FIG. 1 is a schematic diagram of a system architecture involved in an embodiment of the present application.
  • the system architecture includes a source end device 10 and a destination end device 20 , the source end device 10 and the destination end device 20 are connected through a network 30 for communication, and both the source end device 10 and the destination end device 20 support RoCE protocol for network communication.
  • the source device 10 and the destination device 20 may be computing devices such as a computer, a desktop computer, a notebook computer, a server, a terminal, and the like.
  • the network 30 may include multiple switching devices 31, and the multiple switching devices 31 may be used to perform packet transfer, transmission, network traffic detection, etc. between the source device 10 and the destination device 20, so as to realize the Communication interaction between destination devices 20 .
  • the switching device 31 may be, for example, a switch, a router, a relay device, a gateway device, or the like.
  • Both the source device 10 and the destination device 20 may include a network device and a host system, and the host system includes a host CPU (central processing unit, central processing unit) and a memory.
  • the source device 10 in FIG. 1 includes a CPU 12, a memory 13, and a network device 11, and a connection can be established between these components through a bus.
  • the destination device 20, the CPU 22, the memory 23, and the network device 21 can also establish connections between these components through a bus.
  • the network device is a piece of hardware designed to allow the computing device to communicate on the network, and the network device may specifically be a network interface controller (network interface controller) for realizing communication between the device and the network.
  • NIC network interface controller
  • NIC network interface controller
  • the network device supports the RDMA protocol, so the NIC may also be called an RNIC (RDMA NIC). This article will mainly take RNIC as an example to describe the scheme.
  • the network device 11 and the network device 21 are connected through a network 30 to implement communication between the source device 10 and the destination device 20 .
  • Both the network device 11 and the network device 21 support the RoCE protocol.
  • the source device 10 initiates an RDMA read and write request to the destination device 20 through the network
  • the data to be written is directly written into the memory 23 from the memory 13 through the network device 11 and the network device 21, or the data to be read is written to the memory 23 directly. Write directly from memory 23 to memory 13.
  • the number of CPUs in the host system may be one or more, and the types of each CPU may be different or the same.
  • a CPU may include one or more processor cores, or multiple CPUs may also be integrated into a multi-core processor.
  • the host system may run various software components through the CPU, such as an operating system, applications running on the operating system, and the like. The user can initiate business communication through the operating system or the application program, and then realize the communication interaction between the source end device 10 and the destination end device 20 through the network device.
  • the memory in the host system can be used to store computer instructions and data, and the memory can also store data, messages, etc. that are read and written through RDMA.
  • the memory can be any one or any combination of the following storage media: Storage Class Memory (SCM), Read-Only Memory (ROM), Random Access Memory (RAM) ) or cache (Cache).
  • the remote access between the two computing devices is implemented by the RNIC of the computing devices.
  • the solution of the present application is applied to the RoCE network, and the methods described in the embodiments of the present application can be deployed on an RNIC network card to implement RoCE network congestion control.
  • source device and destination device are two relative concepts:
  • the source device refers to a computing device that initiates an RDMA request, that is, a computing device that requests access to another computing device.
  • the destination device refers to a computing device that receives an RDMA request, that is, a computing device accessed by another computing device.
  • the access of the source device to the destination device may be that the source device writes data to the destination device.
  • the source device transmits the data in the source device to the destination device through the RNIC of the source device.
  • RNIC the destination device receives the data through the RNIC of the destination device, so as to transmit the data in the source device to the destination device.
  • the access of the source device to the destination device can also be for the source device to read data from the destination device.
  • the source device can read the data in the memory of the destination device through the RNIC of the source device.
  • the data to be read by the source device is sent to the RNIC of the source device through the RNIC of the destination device, and the RNIC of the source device receives the data to complete the reading of the data in the destination device.
  • the data communicated between them is mainly carried in the form of a message, and a message supporting the RoCE protocol may be referred to as a RoCE protocol message or a RoCE data message for short herein.
  • a message supporting the RoCE protocol may be referred to as a RoCE protocol message or a RoCE data message for short herein.
  • the destination device needs to reply with an ACK confirmation message to inform the source device that the message was successfully received.
  • This application implements The technical solution of the example is mainly optimized for this process.
  • FIG. 1 is only for describing the technical solution provided by the examples of the present application, and shows the above components and their connection relationships.
  • the source device 10 and the destination device 20 shown in FIG. 1 may also include other components than the above components, for example, may also include hardware resources such as hard disks, which will not be described here.
  • FIG. 2 shows a device communication process of an existing RoCE protocol.
  • the RoCE protocol is one of the RDMA protocols.
  • the RDMA protocol is a transport layer protocol
  • the RoCE protocol is a protocol that additionally includes the network layer and the link layer.
  • the RoCE protocol also supports the reliable connection service.
  • the source device sends the protocol packet with Packet Sequence Numbers (PSN), and the destination device returns an ACK confirmation packet to the source device after receiving the protocol packet. Notify the RNIC of the source device that the packets it sent have been successfully transmitted.
  • PSN Packet Sequence Numbers
  • the destination device returns an ACK confirmation packet to the source device after receiving the protocol packet.
  • the CP device in the network marks the packet with RED ECN.
  • the destination device When the destination device receives the packet carrying the ECN mark, according to the protocol, in addition to replying to the source device In addition to the ACK confirmation message, an independent CNP message is also sent to the source device to notify network congestion.
  • the CNP message is only defined as a signal and does not carry any status information.
  • the RNIC of the source device after the destination device replies the ACK confirmation message and the CNP message respectively, the RNIC of the source device also needs to process the two messages respectively.
  • FIG. 3 exemplarily shows a device communication process in a network high-traffic scenario.
  • data packet 1 experiences network congestion, but limited by the sending capability of the destination device, there is a delay in sending the first CNP packet A, which causes the source device to not wait until protocol packet 5.
  • network congestion has been increasing due to not reducing the sending rate in time.
  • Network congestion begins to be eliminated from protocol packet 6.
  • the source device has been slowing down until protocol packet 7. Excessive slowdown will affect network bandwidth utilization.
  • the destination device does not notify the network congestion again from the CNP packet B, but the source device cannot obtain the congestion elimination information in time, and can only rely on the timeout to slowly increase the speed until the protocol packet 10 returns to the target sending rate.
  • the network bandwidth utilization during this period is low.
  • the RoCE protocol uses CNP packets to notify network congestion, which can only notify that congestion occurs on the network, but cannot notify the specific status of network congestion.
  • the sender cannot achieve efficient congestion control, but can only slowly approach the target rate step by step, resulting in slow network convergence and low bandwidth utilization.
  • FIG. 4 shows a specific system architecture of an embodiment of the present application.
  • the RNICs of the source device 10 and the destination device 20 are respectively configured with relevant function modules to support the implementation of the solution of the present application.
  • the network device of the source device 10 is RNIC 11
  • the network device of the destination device 20 is RNIC 21
  • the RNIC 11 is configured with a congestion control module 111 , a message sending module 112 and a message receiving module 113
  • the RNIC 21 is configured with a congestion information determination module 211, a notification aggregation sending module 212 and a packet receiving module 213, which are specifically described as follows:
  • the message sending module 112 is configured to send a RoCE protocol message to the destination device 20 .
  • the message receiving module 113 is configured to receive an acknowledgment message from the destination device 20, where the acknowledgment message is an aggregated message designed in the embodiment of the present application.
  • the acknowledgment message may aggregate acknowledgment information (ACK) and indication information for the RoCE protocol message, and the indication information is used to indicate whether the network path between the source end device 10 and the destination end device 20 is congested.
  • the confirmation message also aggregates congestion information, and the congestion information specifically includes at least one kind of information among the congestion degree, congestion location, message queue length, and network delay of the network path. . That is to say, the confirmation message carries both the indication information of the network congestion and the specific state information of the network congestion. The specific implementation of the confirmation message will be described in detail later.
  • the congestion control module 111 is configured to perform quantifiable congestion control based on the acknowledgment message.
  • congestion control is a method used to adjust the number of packets sent by a transmission control protocol (RoCE protocol) connection at a time (single time). It can quantitatively increase or decrease the sending amount and sending frequency of a single message, so that it approaches the most suitable carrying capacity of the current network.
  • RoCE protocol transmission control protocol
  • the packet receiving module 213 is configured to receive the RoCE protocol packet from the source device 10 .
  • the congestion information determination module 211 is configured to check whether the RoCE protocol packet carries an explicit congestion notification, and if the RoCE protocol packet carries an explicit congestion notification, the congestion information determination module 211 can be used to generate the congestion information, the congestion information Specifically, it includes at least one kind of information among the congestion degree, congestion location, packet queue length, and network delay of the network path, and the congestion information is used to support the source-end device to perform quantitative congestion control.
  • the notification aggregation sending module 212 may be configured to generate a confirmation message according to the check result of the message, where the confirmation message is the aggregation message designed in the embodiment of the present application. It is also used to send the confirmation message to the source end device 10, so that the source end device 10 can implement quantitative congestion control.
  • FIG. 5 shows an exemplary RNIC hardware structure 30, where the RNIC hardware structure 30 may be the structure of the RNIC in the source end device or the structure of the RNIC in the destination end device.
  • the RNIC hardware structure 30 may be an independent standard network card (eg, a PCIe interface network card), or an integrated network card integrated into the SoC chip. By upgrading the existing RNIC network card (eg, ASIC chip, FW firmware, etc.) hardware to support the solutions mentioned in the examples of this application.
  • the RNIC hardware structure 30 may include a controller 31 , a register 32 , a communication interface 33 and a logic operation unit 34 , and these components may be electrically connected through one or more internal buses 35 . in:
  • the register 32 is a memory with a small storage space, and the register 32 can be used to save various instructions; the register 32 can also be used to store the register operands and intermediate or final operation results temporarily stored during the execution of the instructions; the register also It can be used to store data used by the logic operation unit 34 to complete the tasks requested by the controller 31 .
  • the controller 31 is used to decode the instructions stored in the registers, and to issue control signals for each operation to be performed for each instruction.
  • the controller 31 is a processor core that can run programs, such as a system-on-a-chip (SOC), a field programmable gate array (FPGA), an application-specific integrated circuit (application- specific integrated circuit, ASIC) or FPGA+ASIC and other circuit devices.
  • the controller 31 may be composed of various AND-OR gate arrays.
  • the control method of the controller 31 can be, for example, a micro-program control method with micro-storage as the core, and the micro program can be stored in the register 32, or can be a hardware control method based on a logic hard-wiring structure, which is not described in this application. limited.
  • the logic operation unit 34 can be used to execute operation commands, such as addition commands, subtraction commands, multiplication commands, division commands, etc.; the logic operation unit 34 can also be used to obtain logic commands, such as OR logic commands, AND logic commands, and non-logic commands. command, etc.
  • the logic operation unit 34 may also be used to obtain a control signal from the controller 31, obtain data corresponding to the control signal from the register 32 according to the obtained control signal, and perform corresponding operations.
  • the communication interface 33 is used to send or receive data, and there can be multiple communication interfaces 33, which can be respectively used for receiving data sent by the processor or sending data to the CPU of the host system, or for receiving data or sending data from an external computing device. Data to external computing devices (for example, sending or receiving RoCE protocol packets or aggregated acknowledgment packets).
  • the RNIC may further include a crystal oscillator, a media access controller, a physical interface transceiver, and the like, which are not limited in this embodiment of the present application.
  • the controller 31 reads the control signals for the instructions stored in the registers and sends out the control signals for each operation to be performed for each instruction, so as to implement the RoCE network congestion control method described in any of the embodiments herein.
  • the following specifically describes the confirmation message that can realize the aggregation of the congestion notification provided by the embodiment of the present application.
  • the embodiment of the present application improves the congestion notification mechanism of the RoCE protocol by extending the existing RoCE ACK acknowledgment message, and obtains the acknowledgment message of the present application, so that it can carry the indication information and congestion information of network congestion, so as to realize the aggregation of congestion notification and accurate network congestion notification.
  • FIG. 6 and FIG. 7 show the contents of some possible confirmation packets in the embodiments of the present application.
  • the confirmation message can be generated by the destination device and used to reply to the source device.
  • FIG. 6 shows a schematic diagram of two types of confirmation packets in a scenario without network congestion (for example, a RoCE protocol packet does not carry an explicit congestion notification ECN).
  • the confirmation packet includes confirmation information and indication information for the RoCE protocol packet.
  • the acknowledgment information can realize the existing ACK acknowledgment function, that is, to notify the RNIC of the source device whether the message sent by it has been successfully transmitted by the destination device, and the indication information is used to indicate the relationship between the source device and the destination device.
  • the network path is not congested.
  • the confirmation message is implemented as confirmation message A, and the confirmation information and indication information in the confirmation message A can be set in different positions in the message, for example, can be distributed in different messages In the header, so as to avoid modifying the fields in the confirmation information.
  • the confirmation message is implemented as confirmation message B, and the indication information can be integrated and set into a field of the confirmation message, so as to fully utilize the field space in the confirmation message.
  • Fig. 7 shows a schematic diagram of two types of confirmation packets in a network congestion scenario (for example, a RoCE protocol packet carries an explicit congestion notification ECN).
  • the confirmation packet includes confirmation information, indication information and Congestion information, the confirmation information can realize the function of the existing ACK confirmation, that is to inform the RNIC of the source device whether the packets sent by it have been successfully transmitted by the destination device, and the indication information is used to indicate the source device and the destination device.
  • the congestion information is used to indicate the specific state of the network, which may specifically include at least one information of the congestion degree of the network path, the congestion location, the length of the packet queue, and the network delay.
  • the confirmation message is implemented as confirmation message C, and the confirmation information, indication information and congestion information in the confirmation message C can be set at different positions in the message, for example, they can be distributed in different headers, so as to avoid modifying the fields in the confirmation information.
  • the acknowledgment message is implemented as acknowledgment message D, and the indication information and/or congestion information can be integrated and set into the fields of the acknowledgment information, so as to fully utilize the field space in the acknowledgment information.
  • the indication information may be an indication bit, an indication field, an indication flag, or the like.
  • the indication information is an indication bit
  • the value of the indication bit when the value of the indication bit is 0, it means to indicate to the RNIC of the source device that there is no network congestion in the current network path, and the confirmation message does not carry congestion information;
  • a value of 1 means that the destination device indicates to the RNIC of the source device that there is network congestion in the current network path, and the confirmation packet carries the congestion information.
  • the indication information can use existing fields to redefine functions.
  • the indication information can be the BTH.BECN field of the confirmation packet. When the BTH.BECN field is 0, it means to indicate to the RNIC of the source device that the current network path is in progress. If there is no network congestion, the confirmation message does not carry congestion information; when the BTH.BECN field is 1, it means to indicate to the RNIC of the source device that there is network congestion in the current network path, and the confirmation message carries congestion. information.
  • the congestion information may be a new packet header (such as CETH described below) that is defined for content bearing, and supports carrying detailed network congestion status content, so that the destination device can carry out accurate information to the source device.
  • Network congestion notification may also be carried by using the existing field space, for example, using the existing reserved field.
  • FIG. 8 is an example diagram of a data structure of congestion information provided by an embodiment of the present application, and the data structure of congestion information may be referred to herein as a congestion control extended transport header (Congestion Extended Transport Header, CETH).
  • CETH Congestion Extended Transport Header
  • the CETH also referred to as a CETH header
  • the standard definition part may include the following fields: version number (Ver), CETH header length (Length). in:
  • the manufacturer-defined information field is used to support the user-defined congestion control advertisement information of each manufacturer, and the total length is, for example, (Length*4-1) bytes.
  • the manufacturer can design at least one kind of information among the congestion degree, congestion location, packet queue length, and network delay of the bearer network path.
  • the congestion level of the network path can be represented by a 2-bit ratio field, and the ratio field is used to identify the congestion level.
  • the ratio field can indicate the degree of congestion at a component level, such as: no congestion, light congestion, moderate congestion, heavy congestion, and so on.
  • a 1-bit field is designed to indicate whether the current congestion advertisement is a normal CNP type or an enhanced CNP type.
  • a 4-bit field can be designed to identify service scenarios, such as RC/XRC write/send scenarios, RC/XRC read response scenarios, or UD send scenarios.
  • the Ver field is used to indicate the CETH version number, for example, it can occupy 4 bits, which is used to support the upgrade of the congestion control algorithm and the compatibility docking.
  • the version number 0 represents the standard CNP announcement, and no other information is carried, and the version numbers 1 to 15 are used by the manufacturer.
  • the Length field is used to indicate the length of the CETH header. For example, it occupies 4 bits.
  • the variable length of the CETH header is supported to reduce fixed overhead. For example, Length can take a value of 1 to 4, which is used to indicate how many 4 bytes the length of the CETH header is.
  • FIG. 9 is a schematic diagram of a data structure of an acknowledgement message provided by the present application.
  • the confirmation message specifically includes ACK confirmation information and CETH, where:
  • the ACK confirmation information specifically includes the Base Transport Header (BTH for short, or BTH header, or BTH field) and the Acknowledge Extended Transport Header (AETH), that is, the ACK confirmation information can pass BTH and AETH.
  • BTH Base Transport Header
  • AETH Acknowledge Extended Transport Header
  • the ACK confirmation information is used to implement the ACK function, that is, to notify the RNIC of the source device whether the packet sent by it has been successfully transmitted by the destination device.
  • the indication information for indicating whether the network path between the source end device and the destination end device is congested may also be carried in the BTH field.
  • CETH is an extension field designed in this application.
  • CETH can be aggregated behind AETH of ACK as an optional option, so as to realize aggregation of ACK confirmation information, indication information and CETH.
  • CETH is used to indicate the specific network state of the current network path.
  • ACK carries indication information and aggregates CETH, so that the destination device can notify the source device of the occurrence of network congestion in time.
  • ACK carries indication information without CETH, so that the destination device can Notify the congestion elimination to the source device in time.
  • CETH carries network status information such as the congestion degree, congestion location, packet queue length, network delay, etc. of the network path, it is beneficial to solve the problem that the existing RoCE network is not friendly to efficient congestion control due to the lack of information advertised by congestion control. question. Due to the aggregation of packets and the carrying of indication information, it is also beneficial to solve the problem of slow response speed of RoCE network congestion control.
  • the RNIC of the destination device when the RoCE protocol packet received by the RNIC of the destination device does not carry the ECN flag, the RNIC of the destination device replies to the source device with an ACK that does not carry the CETH header, and the ACK only includes confirmation information and indication information to notify the RNIC of the source device that the protocol packet has been received and that there is no congestion on the network.
  • the RNIC of the destination device When the RoCE protocol packet received by the RNIC of the destination device carries the ECN tag, the RNIC of the destination device replies to the source device with an ACK carrying a CETH header, and the ACK includes confirmation information, indication information and congestion information. Notifies the RNIC of the source device that the protocol packets have been received and the network is congested, and enables the RNIC of the source device to obtain detailed network status information, such as congestion level, congestion point, queue depth, network delay and other dimensions information, so that quantitative congestion control can be realized based on this information.
  • detailed network status information such as congestion level, congestion point, queue depth, network delay and other dimensions information
  • the degree of network congestion indicated in the congestion information can be divided into multiple levels such as "no congestion, mild congestion, moderate congestion, and severe congestion", and the RNIC of the source device can decide how to implement it according to the difference in the congestion level.
  • Speed-down processing can adjust the sending rate of different grades of packets, so that faster rate convergence can be achieved.
  • the RNIC of the source device can determine how many data packets can be carried on the network path without causing packet loss based on such information, thereby determining the packets that can continue to be sent. number, which is more friendly to network applications with high bandwidth requirements.
  • FIG. 10 is a schematic diagram of the data structure of another confirmation message provided by the present application.
  • CETH congestion information
  • CETH is carried using the existing field space.
  • CETH is integrated into the BTH field of the ACK confirmation information.
  • the indication information used to indicate whether the network path between the source end device and the destination end device is congested can also be carried in the BTH field .
  • the acknowledgment message can be regarded as an improvement of the traditional ACK message, making full use of the existing field space to carry the congestion information.
  • the acknowledgment message can not only realize the function of ACK, that is, to inform the RNIC of the source device whether the message sent by it has been successfully transmitted by the destination device; it can also realize the function of congestion indication, that is, to indicate the communication between the source device and the destination device. whether the network path is congested. It can also enable the RNIC of the source device to obtain detailed network status information, such as congestion level, congestion point, queue depth, network delay and other information in different dimensions, so that quantitative congestion control can be realized based on this information.
  • the reserved field "reserved 6" in the BTH of the traditional ACK may be used to carry the congestion information of the embodiment of the present application, that is, the "reserved 6" is used as the first CETH to carry related data, so as to achieve Aggregation of congestion control and transmission acknowledgments.
  • the reserved field "reserved 7" in the BTH of the traditional ACK may be used to carry the congestion information of this embodiment of the present application, that is, the "reserved 7" is used as the second CETH to carry related data, thereby To achieve the purpose of the aggregation of congestion control and transmission acknowledgment.
  • the specific implementation form of the first CETH and the second CETH may only include information such as congestion degree, congestion point, queue depth, network delay, etc., and may also include other information as described in the embodiment of FIG. 8 , such as version numbers.
  • FIG. 11 and FIG. 12 respectively exemplarily describe some complete forms of RoCE protocol packets and the complete forms of corresponding confirmation packets in the embodiments of the present application.
  • a RoCE protocol packet from a source device may include a MAC packet header, an IP packet header, a UDP packet header, a BTH header, a data payload, and an invariant cyclic redundancy check value (Invariant CRC). Cyclic Redundancy Check, ICRC) and Variant Cyclic Redundancy Check (VCRC).
  • the MAC packet header, IP packet header, and UDP packet header are the packet headers corresponding to the MAC layer, IP layer, and UDP layer respectively.
  • the data payload is the data that needs to be transmitted for device communication and interaction. ICRC and VCRC can be used to verify data. completeness.
  • Yet another RoCE protocol packet may include a MAC packet header, an IP packet header, a UDP packet header, a BTH header, an ImmDt field, a data payload, an ICRC, and a VCRC.
  • the RoCE protocol packet may further include more or less content, which is not limited in this application.
  • an acknowledgment packet from a destination device may include a MAC packet header, an IP packet header, a UDP packet header, a BTH header, an AETH header, a CETH header, an ICRC, and a VCRC.
  • the MAC packet header, IP packet header, and UDP packet header are the packet headers corresponding to the MAC layer, IP layer, and UDP layer, respectively.
  • ICRC and VCRC can be used to verify data integrity.
  • BTH header and AETH header here , The specific content and implementation form of the CETH header may be similar to the related description of the embodiment with reference to FIG. 9 , which will not be repeated here.
  • Yet another confirmation message may include a MAC header, an IP header, a UDP header, a BTH header, an AETH header, an ICRC, and a VCRC.
  • the specific content and implementation form of the BTH header and the AETH header here may be similar to the relevant description of the embodiment in FIG. 10 , and details are not repeated here.
  • confirmation message may also include more or less content, which is not limited in this application.
  • the congestion control method provided by the embodiments of the present application is described below.
  • the method embodiments described below for the sake of convenience, they are all expressed as a combination of a series of action steps, but those skilled in the art should know that the specific implementation of the technical solutions of the present application is not affected by the described series of steps. restrictions on the sequence of action steps.
  • FIG. 13 is a schematic flowchart of a method for RoCE network congestion control provided by an embodiment of the present application. The method is described from the perspective of interaction between a first network device and a second network device, wherein the first network device and The second network device may be an RNIC, a network interface controller, a network adapter, a network card, a local area network receiver, or the like.
  • the first network device may be the RNIC of the source device
  • the second network device may be the RNIC of the destination device
  • the first network device and the second network device may be connected through a network.
  • the method includes but is not limited to the following steps:
  • the first network device sends a RoCE protocol packet to the second network device.
  • the RoCE protocol message may be generated based on user service requirements, and the RoCE protocol message may be a periodic message.
  • the specific content of the RoCE protocol packet has been described above, and will not be repeated here.
  • the second network device checks whether the RoCE protocol packet carries an explicit congestion notification.
  • the CP device in the network can mark the packet with RED ECN.
  • the second network device receives the packet carrying the ECN mark, it confirms that the network is currently congested. On the contrary, when receiving the packet carrying the ECN mark, the second network device confirms that the network is not currently congested.
  • the second network device generates an ACK confirmation message according to the check result, where the ACK confirmation message at least includes aggregated confirmation information and indication information.
  • the indication information is used to indicate whether the network path between the first network device and the second network device is congested.
  • the indication information may be an indication bit, an indication field, an indication flag, or the like.
  • the indication information when the indication information is an indication bit, when the value of the indication bit is 0, it means to indicate to the RNIC of the source device that there is no network congestion in the current network path, and the confirmation message does not carry congestion information;
  • a value of 1 means that the destination device indicates to the RNIC of the source device that there is network congestion in the current network path, and the confirmation packet carries the congestion information.
  • the confirmation information and the indication information may be set in different positions in the message, for example, may be distributed in different message headers, so as to avoid modifying the fields in the confirmation information.
  • the indication information can be integrated and set into the field of the confirmation information, so as to fully utilize the field space in the confirmation information.
  • the second network device replies the confirmation message to the first network device.
  • the first network device receives the confirmation message from the second network device.
  • the first network device performs congestion control based on the confirmation message.
  • congestion control is a function used to adjust the number of packets sent in a single transmission (a single transmission amount) for a transmission control protocol (RoCE protocol) connection, which can quantitatively increase or decrease the amount of a single packet transmission and
  • the transmission frequency is close to the most suitable carrying capacity of the current network.
  • the first network device may reduce the sending rate of the RoCE protocol packet in the next time window.
  • the first network device may keep the sending rate of RoCE protocol packets in the next time window unchanged, or set the sending rate of RoCE protocol packets in the next time window to preset rate.
  • the destination device can reply to the ACK confirmation information aggregated with the indication information through the second network device when the RoCE protocol packet carries the ECN mark, and the indication information is used to notify the source end device of the current network. There is network congestion on the path.
  • the first network device of the source device can reduce the sending rate of the RoCE protocol packet in the next time window.
  • the destination device can reply with an ACK confirmation message carrying the indication information, and notify the source device that there is no network congestion on the current network path, so that the source device can maintain or resume the high-speed sending rate in time. .
  • the sending of independent CNPs is avoided, the overhead of notification is reduced, the delay of congestion notification in high-traffic scenarios is reduced, and the response speed of the destination device is improved.
  • the source-end device can obtain the network congestion situation at the first time, and trigger the congestion control earlier to realize the regulation of the sending rate, which improves the response speed of the source-end device.
  • the source device can also be informed through the indication information, and the sending rate can be restored in time, which improves the network bandwidth utilization rate.
  • FIG. 14 is a schematic flowchart of another RoCE network congestion control method provided by an embodiment of the present application. The method is described from the perspective of interaction between a first network device and a second network device, wherein the first network device And the second network device may be an RNIC, a network interface controller, a network adapter, a network card, a local area network receiver, and the like.
  • the first network device may be the RNIC of the source device
  • the second network device may be the RNIC of the destination device
  • the first network device and the second network device may be connected through a network.
  • the method includes but is not limited to the following steps:
  • the first network device sends a RoCE protocol packet to the second network device.
  • the specific content of the RoCE protocol packet has been described above, and will not be repeated here.
  • the second network device checks whether the RoCE protocol packet carries an explicit congestion notification (ECN). When it is determined that the RoCE protocol packet carries an explicit congestion notification, the second network device subsequently performs steps S403-S405; when it is determined that the RoCE protocol packet does not carry an explicit congestion notification, the second network device subsequently performs steps S406-S407.
  • ECN explicit congestion notification
  • the second network device acquires congestion information, where the congestion information is used to indicate a specific state of the network.
  • the current network of the second network device when it is congested, it can obtain network status information, such as congestion degree, congestion point, queue depth, network delay and other information of different dimensions through packet detection or hardware detection.
  • network status information such as congestion degree, congestion point, queue depth, network delay and other information of different dimensions through packet detection or hardware detection.
  • the second network device may obtain the congestion level by:
  • the second network device determines the degree of congestion of the network path according to the proportion of RoCE protocol packets carrying an explicit congestion notification in the historical packet reception record. For example, the second network device performs periodic sliding window statistics on the proportion of received packets carrying the ECN flag, so as to calculate the current specific congestion degree of the network path.
  • the second network device generates an acknowledgment packet, where the acknowledgment packet includes aggregated acknowledgment information, indication information, and congestion information.
  • the congestion information can be implemented by, for example, the CETH described herein, and the CETH can be an extension field or an existing reserved field.
  • the acknowledgment information is used to implement the ACK function, and the indication information is used to indicate that the network path between the source end device and the destination end device is congested.
  • the acknowledgment information, indication information and congestion information in the acknowledgment message can be set in different positions in the message, for example, can be distributed in different headers, so as to avoid modifying the fields in the acknowledgment information.
  • the indication information and/or the congestion information can be integrated and set into the field of the confirmation information, so as to fully utilize the field space in the confirmation information.
  • the second network device sends the confirmation message of S404 to the first network device.
  • the second network device generates a confirmation message, where the confirmation message includes aggregated confirmation information and indication information.
  • the acknowledgment information is used to implement the ACK function, and the indication information is used to indicate that the network path between the source end device and the destination end device is not congested.
  • the second network device sends the confirmation message of S406 to the first network device.
  • the first network device performs quantitative congestion control based on the confirmation message.
  • the first network device can perform congestion control according to the congestion information in the confirmation message, including at least one of the following methods:
  • the first network device can quantitatively adjust the sending rate of packets in the next time window of the first network device according to the congestion degree, wherein there is a corresponding relationship between the congestion degree and the sending rate . For example, for multiple levels such as "no congestion, mild congestion, moderate congestion, and severe congestion", the first network device may decide how to implement the speed reduction process according to the difference in the congestion degree. Different levels can correspond to different packet sending rates, so as to adjust the packet sending rates of different grades, so that faster rate convergence can be achieved.
  • the first network device may determine the number of packets to be sent in the next time window according to at least one of the congestion position and the packet queue depth.
  • the RNIC of the source device can determine how many data packets can be carried on the network path without causing packet loss according to the congestion location and/or the depth of the packet queue, so as to determine the number of packets that can continue to be sent. Network applications with high bandwidth requirements are more friendly.
  • the first network device may adjust the sending rate of the first network device or the number of packets to be sent in the next time window according to the network delay.
  • the first network device After the first network device receives the confirmation message of S407, it determines that the current network is not congested according to the indication information, and the first network device can keep the sending rate of the RoCE protocol message in the next time window unchanged, or send the next time window to the next time window. The sending rate of RoCE protocol packets within the time window is restored/set to the preset rate.
  • the solution of the present application can realize the aggregation and bearer of the RoCE congestion control notification and the ACK confirmation information, and notify the generation and elimination of network congestion through the indication information.
  • the defined CETH carries detailed network congestion status information. Specifically, as shown in Figure 15:
  • the message received by the destination device carries the ECN flag (denoted as data w/ECN), and the embodiment of this application redefines the ACK message used for reply as ACK aggregated CETH (denoted as ACK w/CETH) , the CETH header is extended after the AETH header to carry the quantified congestion information.
  • the CNP of RDMA write and RDMA send connected by the original RC, XRC, RD and other application scenarios can be aggregated to the ACK, so in the embodiment of this application, the destination device does not need to reply the CNP to the source device separately.
  • the message received by the destination device does not carry the ECN flag (denoted as data w/o ECN), and the embodiment of this application redefines the ACK message used for reply as ACK not aggregated CETH (denoted as ACK) w/o CETH), so that the remote device can know the current network status, so as to quickly restore the sending rate.
  • FIG. 16 exemplarily shows a device communication process in a network traffic scenario.
  • network congestion occurs in data packet 1.
  • the source device After the destination device aggregates the RoCE congestion notification CETH into the ACK acknowledgement packet, the source device obtains the congestion notification faster when congestion occurs, so that it can be faster. For example, in Figure 16, it enters the slowdown state from the beginning of protocol packet 4. Compared with the independent CNP notification method in Figure 3, the packet transmission rate can be slowed down faster; this aggregation method It can also support the destination device to use the indication information in the ACK to notify the source device of the elimination of the network congestion state.
  • the source device After the source device receives the congestion elimination notification, it can quickly increase the packet sending rate, as shown in Figure 16 from the datagram.
  • the sending rate can be restored from the beginning of the text 8. Compared with the existing periodic detection method that relies on the source device, the sending rate of the message can be accelerated faster.
  • the transmission rate can be accurately controlled to reduce the transmission rate to the target rate in the first cycle of deceleration to achieve faster convergence.
  • the first sent data packet 4 can quickly converge to the target rate.
  • the second network device of the destination device can reply to ACK confirmation information aggregated with CETH and indication information when the RoCE protocol packet carries the ECN mark, and the indication information is used to notify the source end device.
  • CETH is used to provide detailed congestion information to the source device.
  • the first network device of the source device extracts congestion information from the CETH to perform quantitative and diversified congestion control actions.
  • the destination device can reply with an ACK confirmation message carrying the indication information, and notify the source device that there is no network congestion on the current network path, so that the source device can maintain or resume high-speed transmission in time. rate.
  • the source-end device can obtain the network congestion situation at the first time, and trigger the congestion control earlier to realize the regulation of the sending rate, which improves the response speed of the source-end device.
  • the source device can also be informed through the indication information, and the sending rate can be restored in time, which improves the network bandwidth utilization rate.
  • the existing RDMA network congestion control has less notification information, and the network congestion convergence speed is slow, while the present application uses CETH to carry detailed network congestion information, such as congestion level, congestion point, queue depth, network delay and other information in different dimensions , which is beneficial for the source device to adjust the sending rate to the target rate in one step according to the detailed congestion information, realize fast convergence, and realize diversified adjustments such as the number of packets and sending time, which greatly improves the effect of congestion control.
  • detailed network congestion information such as congestion level, congestion point, queue depth, network delay and other information in different dimensions
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请实施例提供了RoCE网络拥塞控制的方法及相关装置,方法包括:第一网络设备向第二网络设备发送RoCE协议报文;第一网络设备接收来自第二网络设备的确认报文,确认报文包括对RoCE协议报文的确认信息和指示信息,指示信息用于指示第一网络设备与第二网络设备之间的网络通路是否发生拥塞;第一网络设备基于确认报文进行拥塞控制。实施本申请实施例,能够及时向第一网络设备通告网络拥塞的发生和消除,提高第一网络设备的拥塞控制反应速度,提升网络带宽利用率。

Description

RoCE网络拥塞控制的方法及相关装置
本申请要求于2020年09月03日提交中国专利局、申请号为202010915720.6、申请名称为“RoCE网络拥塞控制的方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及RoCE网络拥塞控制的方法及相关装置。
背景技术
在数据通信系统,为了提高计算设备之间报文传输的速度,通常采用远程直接内存访问(Remote Direct Memory Access,RDMA)技术进行连接。RDMA技术是通过网络把数据直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而无需两台计算设备的操作系统或内核介入。RDMA消除了外部存储器复制和上下文切换的开销,因此能解放内存带宽和CPU周期用于改进应用系统性能。
基于融合以太网的远程直接内存访问(RDMA over Converged Ethernet,RoCE)是RDMA技术的一种,允许服务器通过以太网进行远程直接内存访问。虽然RoCE协议的优点主要是基于融合以太网的特性,但是RoCE协议也可以应用在传统以太网网络或者非融合以太网络中。
当网络中的流量因为过大而产生拥塞时(从某个源端端口发出的报文流量有可能在某个时间段可能比较大),DCQCN规定拥塞节点(Congestion Point,CP)设备对报文进行随机早期检测(Random Early Detection,RED)显式拥塞通知(Explicit Congestion Notification,ECN)标记。对于支持RoCE协议的接收端,该接收端在收到携带ECN标记的报文时,向源端发送独立的拥塞通告(Congestion Notification Packet,CNP)报文以通告网络拥塞,源端根据该CNP报文,降低后续报文的发送速率到某个值以消除拥塞。
RoCE协议的网络拥塞通告采用了单独的CNP报文通告的方式,所以发生网络拥塞后,接收端的网卡需要连续发送ACK确认报文和CNP报文。但由于接收端的网卡的报文发送速率是存在上限的,可能会造成拥塞通告的发送延迟,从而会造成源端拥塞控制反应速度慢。
另外,CNP报文只能向源端通告网络发生了拥塞,不能通告网络的拥塞已消除,只能由源端定时检测网络的拥塞是否消除,这也导致了源端不能及时恢复报文发送速率,影响网络带宽有效利用率。
发明内容
本申请实施例提供一种RoCE网络拥塞控制的方法及相关装置,能够及时通告网络拥塞的发生和消除,提高源端的拥塞控制反应速度,提升网络带宽利用率。
第一方面,本申请提供一种RoCE网络拥塞控制的方法,该方法包括:第一网络设备向第二网络设备发送RoCE协议报文;所述第一网络设备接收来自第二网络设备的确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;所述第一网络设备基于所述确认报文进行拥塞控制。
其中,第一网络设备和第二网络设备均为被设计用来允许计算设备在网络上进行通讯的硬件,且均支持RoCE协议的网络通信,RoCE(RDMA over Converged Ethernet)是一种允许通 过以太网使用远程直接内存访问(RDMA)的网络协议。第一网络设备和第二网络设备例如均可以是网络接口控制器(RDMA network interface controller,RNIC)、网络接口控制器、网络适配器(network adapter)、网卡(network interface card)、或局域网接收器(LAN adapter)。在可能的实现中,第一网络设备和第二网络设备中的至少一者还可以是交换机设备。
第一网络设备可设置于源端设备,第二网络设备可设置于目的端设备,故源端设备和目的端设备可基于第一网络设备和第二网络设备之间的通信交互而实现远程数据读写与传输。
可以看到,实施本申请实施例,源端设备可通过第一网络设备发送报文,目的端设备可通过第二网络设备回复聚合有指示信息的ACK确认信息,指示信息用于向第一网络设备通告当前网络通路上有没有网络拥塞。这样,第一网络设备就可以获得当前网络的状态,也就是当前网络是拥塞的还是无拥塞的,第一网络设备就能够基于当前网络的状态执行相应的拥塞控制操作。例如当网络无拥塞时,第一网络设备可以及时维持或恢复高速发送速率。所以,一方面,通过指示信息和ACK的聚合承载,避免了传统方案中需要发送独立CNP的弊端,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。另一方面,通过指示信息,源端设备可以在第一时间可以获得网络拥塞的情况,更早的触发拥塞控制实现发送速率的调控,提高了源端设备的响应速度。在网路拥塞解除时,也能够通过指示信息让源端设备获知,以及时恢复发送速率,提高了网路带宽利用率。
基于第一方面,在具体的实施例中,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述第一网络设备基于所述确认报文进行拥塞控制,具体包括:所述第一网络设备根据所述拥塞信息进行拥塞控制。
可以看到,实施本申请实施例,当前网络通路上有网络拥塞时,确认报文还可以携带拥塞信息,拥塞信息表征了详细的网络状态内容。这样,源端设备的第一网络设备可提取拥塞信息进行定量、多样化的拥塞控制动作。所以,一方面,通过指示信息、拥塞信息和ACK的聚合承载,避免了独立CNP的发送,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。另一方面,现有的RDMA网络拥塞控制的通告信息少,网络拥塞收敛速度慢,而本申请利用聚合报文承载了详细的拥塞信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,有利于第一网络设备根据详细拥塞信息进行多样化、差异化、具体化的拥塞控制,例如调整不同级别的发送速率,又例如实现报文数量、发送时间等多样化的调整,从而极大提升了拥塞控制的效果。
基于第一方面,在可能的实施例中,所述第一网络设备根据所述拥塞信息进行拥塞控制包括以下至少一种方式:
(1)第一网络设备可根据所述拥塞程度,定量调整所述第一网络设备的下一时间窗内的报文的发送速率,其中在可能的实施例中,所述拥塞程度属于多种不同级别拥塞程度中的一种,即不同级别拥塞程度分别对应不同的发送速率。所述拥塞程度和所述发送速率之间具有对应关系。例如对于“无拥塞、轻度拥塞、中度拥塞、重度拥塞”等多个级别,第一网络设备可以根据拥塞程度的差异来决定如何实施降速处理。不同级别可以对应不同的报文发送速率,实现不同档次的报文发送速率的调节,因而可以实现更快的速率收敛。
(2)第一网络设备可根据拥塞位置和报文队列深度中的至少一者,确定下一时间窗内的待发送的报文数量。源端设备的RNIC可以根据该拥塞位置和/或报文队列深度判断该网络路径上还可以承载多少数据报文而不导致发生丢包等行为,从而决定可以继续发送的报文数 量,这对高带宽需求的网络应用比较友好。
(3)所述第一网络设备可根据网络时延调整所述第一网络设备的发送速率或下一时间窗内的待发送的报文数量。
可以看到,实施本申请实施例,本申请利用聚合报文承载了详细的拥塞信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,有利于第一网络设备根据详细拥塞信息进行多样化、差异化、具体化的拥塞控制,极大提升了拥塞控制的效果。
基于第一方面,在可能的实施例中,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
其中,所述扩展字段例如为本文描述的拥塞控制扩展传输头(Congestion Extended Transport Header,CETH),即可通过扩展的CETH头来承载拥塞信息。
例如,该CETH包括标准定义和厂家自定义信息这两部分,其中标准定义部分可用于混合组网场景的兼容性对接。标准定义部分可包括如下字段:版本号(Ver)、CETH头长度(Length)。其中:
厂家自定义信息字段用于支持各厂家自定义拥塞控制通告信息,总长度为例如为(Length*4–1)字节。例如厂家可以设计承载网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息。举例来说,网络通路的拥塞程度可用2bit的ratio字段表征,该ratio字段用于标识拥塞程度。在应用场景中,ratio字段可以分量级来指示拥塞程度,例如:无拥塞、轻度拥塞、中度拥塞、重度拥塞,等等。此外,还可以对厂家自定义信息设计更多的其他内容,例如设计的1bit的字段来指示当前的拥塞通告为普通CNP类型还是增强CNP类型。又例如可以设计4bit的字段来用于标识业务场景,例如RC/XRC write/send场景、RC/XRC read response场景或UD send场景。
Ver字段用于指示CETH版本号,例如可占用4bit,用于支持拥塞控制算法的升级及兼容性对接。
Length字段用于指示CETH头长度,例如占用4bit,支持CETH头可变长度,以减少固定开销。
可以看到,通过设计扩展字段CETH,既能够保证拥塞信息的承载,且不占用现有字段的空间,还可用于混合组网场景的兼容性对接,且支持各厂家自定义拥塞控制通告信息,有利于满足不同厂家的需求。
基于第一方面,在可能的实施例中,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
例如,在一种实现中,可利用传统ACK的BTH中的保留字段“reserved 6”去承载本申请实施例的拥塞信息,也就是该“reserved 6”作为第一CETH进行相关数据承载,从而达到拥塞控制和传输确认的聚合的目的。
又例如,在一种实现中,可利用传统ACK的BTH中的保留字段“reserved 7”去承载本申请实施例的拥塞信息,也就是该“reserved 7”作为第二CETH进行相关数据承载,从而达到拥塞控制和传输确认的聚合的目的。
本实施例可将该指示信息和拥塞信息整合设置到确认信息的字段中,这种情况下,该确认报文可视为传统ACK报文的改进,充分利用现有字段空间实现拥塞信息的承载,从而实现对确认信息中字段空间的充分利用,避免对现有报文格式作更改。
基于第一方面,在可能的实施例中,在所述指示信息指示所述网络通路未发生拥塞的情况下,所述第一网络设备基于所述确认报文进行拥塞控制包括:所述第一网络设备将所述第一网络设备的发送速率维持不变,例如维持较高的发送速率,从而提高报文传输效率。
基于第一方面,在可能的实施例中,在所述指示信息指示所述网络通路未发生拥塞的情况下,所述第一网络设备基于所述确认报文进行拥塞控制包括:所述第一网络设备将所述第一网络设备的发送速率设置为预设速率。例如从低发送速率(该低发送速率例如是网络拥塞时设计的)调整为较高的发送速率(该高发送速率例如是网络无拥塞时设计的),从而提升了发送速率的恢复效率,降低报文发送的延迟。
基于第一方面,在可能的实施例中,指示信息可以是指示位、指示字段、指示标识等。
例如,当指示信息是指示位时,当指示位的值为0,意味着向源端设备的RNIC指示当前网络通路中没有网络拥塞的情况,确认报文中不携带拥塞信息;当指示位的值为1,意味着目的端设备向源端设备的RNIC指示当前网络通路中存在网络拥塞的情况,且确认报文中携带了拥塞信息。
又例如,指示信息可以利用现有字段进行功能重定义,比如指示信息可以是确认报文的BTH.BECN字段,当BTH.BECN字段为0,意味着向源端设备的RNIC指示当前网络通路中没有网络拥塞的情况,确认报文中不携带拥塞信息;当BTH.BECN字段为1,意味着向源端设备的RNIC指示当前网络通路中存在网络拥塞的情况,且确认报文中携带了拥塞信息。
第二方面,本申请提供一种RoCE网络拥塞控制的方法,所述方法包括:
第二网络设备接收第一网络设备的RoCE协议报文;所述第二网络设备检查所述RoCE协议报文是否携带显式拥塞通知;所述第二网络设备根据检查结果生成确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;所述第二网络设备向所述第一网络设备发送所述确认报文,所述确认报文用于所述第一网络设备进行拥塞控制。
可以看到,实施本申请实施例,目的端设备可通过第二网络设备分析RoCE协议报文是否携带显式拥塞通知,进而回复聚合有指示信息的ACK确认信息,指示信息用于向第一网络设备通告当前网络通路上有没有网络拥塞。这样,第一网络设备就可以获得当前网络的状态,也就是当前网络是拥塞的还是无拥塞的,第一网络设备就能够基于当前网络的状态执行相应的拥塞控制操作。例如当网络无拥塞时,第一网络设备可以及时维持或恢复高速发送速率。所以,一方面,通过指示信息和ACK的聚合承载,避免了传统方案中需要发送独立CNP的弊端,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。另一方面,通过指示信息,源端设备可以在第一时间可以获得网络拥塞的情况,更早的触发拥塞控制实现发送速率的调控,提高了源端设备的响应速度。在网路拥塞解除时,也能够通过指示信息让源端设备获知,以及时恢复发送速率,提高了网路带宽利用率。
基于第二方面,在可能的实施例中,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述拥塞信息用于所述第一网络设备进行拥塞控制,有利于第一网络设备根据详细拥塞信息进行多样化、差异化、具体化的拥塞控制,例如调整不同级别的发送速率,又例如实现报文数量、发送时间等多样化的调整,从而极大提升了拥塞控制的效果。
基于第二方面,在可能的实施例中,所述拥塞程度属于多种不同级别拥塞程度中的一种, 不同级别拥塞程度分别对应所述第一网络设备不同的发送速率。
基于第二方面,在可能的实施例中,所述第二网络设备根据检查结果生成确认报文之前,还包括:所述第二网络设备生成所述拥塞信息。
例如,第二网络设备当前网络有拥塞时,可通过报文检测或者硬件检测的方式获得网络状态信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息。
基于第二方面,在可能的实施例中,当所述拥塞信息包括所述拥塞程度时,所述第二网络设备可通过以下获得所述拥塞程度:
(1)第二网络设备根据历史报文接收记录中携带显式拥塞通知的RoCE协议报文的比例,来确定所述网络通路的拥塞程度。例如,第二网络设备对接收报文携带ECN标记的比例做周期性滑窗统计,从而计算出当前该网络路径的具体拥塞程度。
(2)通过借助带内网络遥测(Inband Network Telemetry,INT)方式或者通过现场运行管理和维护(In-situ Operation Administration and Maintenance,IOAM)方式获得所述拥塞程度。以INT方式为例,可将INT支持的范围扩展到服务器网卡,网卡就可以接收到交换机插入到数据报文内的测量信息,通过该信息可以计算获得当前网络状态,例如通过时间戳计算出网络时延,通过队列长度和队列占用率计算出拥塞程度,等等。
基于第二方面,在可能的实施例中,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。其中,所述扩展字段例如为本文描述的拥塞控制扩展传输头(Congestion Extended Transport Header,CETH),即可通过扩展的CETH头来承载拥塞信息。
通过设计扩展字段CETH,既能够保证拥塞信息的承载,且不占用现有字段的空间,还可用于混合组网场景的兼容性对接,且支持各厂家自定义拥塞控制通告信息,有利于满足不同厂家的需求。
基于第二方面,在可能的实施例中,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
通过将指示信息和拥塞信息整合设置到确认信息的字段中,充分利用现有字段空间实现拥塞信息的承载,从而实现对确认信息中字段空间的充分利用,避免对现有报文格式作更改。
第三方面,本申请实施例提供一种装置,所述装置应用于第一网络设备,包括:报文发送模块,用于向第二网络设备发送RoCE协议报文;报文接收模块,用于接收来自第二网络设备的确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;拥塞控制模块,用于基于所述确认报文进行拥塞控制。
其中,该装置的各功能模块具体用于实现第一方面描述的方法步骤,这里不再赘述。
基于第三方面,在可能的实施例中,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述拥塞控制模块具体用于,根据所述拥塞信息进行拥塞控制。
基于第三方面,在可能的实施例中,所述拥塞控制模块具体用于:根据所述拥塞程度,调整所述第一网络设备的发送速率,其中所述拥塞程度和所述发送速率之间具有对应关系;或者,根据所述拥塞位置和所述报文队列深度中的至少一者,确定下一时间窗内待发送的报文数量;或者,根据所述网络时延调整所述第一网络设备的发送速率或确定下一时间窗内待发送的报文数量。
基于第三方面,在可能的实施例中,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应不同的发送速率。
基于第三方面,在可能的实施例中,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
基于第三方面,在可能的实施例中,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
基于第三方面,在可能的实施例中,所述拥塞控制模块具体用于,在所述指示信息指示所述网络通路未发生拥塞的情况下,将所述第一网络设备的发送速率维持不变。
基于第三方面,在可能的实施例中,所述拥塞控制模块具体用于,在所述指示信息指示所述网络通路未发生拥塞的情况下,将所述第一网络设备的发送速率设置为预设速率。
第四方面,本申请提供一种装置,所述装置应用于第二网络设备,包括:报文接收模块,用于接收第一网络设备的RoCE协议报文;拥塞信息确定模块,用于检查所述RoCE协议报文是否携带显式拥塞通知;通告聚合发送模块,用于根据检查结果生成确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;还用于向所述第一网络设备发送所述确认报文,所述确认报文用于所述第一网络设备进行拥塞控制。
其中,该装置的各功能模块具体用于实现第二方面描述的方法步骤,这里不再赘述。
基于第四方面,在可能的实施例中,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述拥塞信息用于所述第一网络设备进行拥塞控制。
基于第四方面,在可能的实施例中,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应所述第一网络设备不同的发送速率。
基于第四方面,在可能的实施例中,所述拥塞信息确定模块还用于生成所述拥塞信息。
基于第四方面,在可能的实施例中,当所述拥塞信息包括所述拥塞程度时,所述拥塞信息确定模块具体用于:根据历史报文接收记录中携带显式拥塞通知的RoCE协议报文的比例,来确定所述拥塞程度;或者,通过带内网络遥测(INT)方式获得所述拥塞程度;或者,通过现场运行管理和维护(IOAM)方式获得所述拥塞程度。
基于第四方面,在可能的实施例中,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
基于第四方面,在可能的实施例中,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
第五方面,本申请提供一种设备,所述设备包括主机系统和第一网络设备;所述主机系统用于与所述第一网络设备交互实现数据传输,所述第一网络设备用于执行如第一方面任意实施例描述的方法。
第六方面,本申请提供一种设备,所述设备包括主机系统和第二网络设备;所述主机系统用于与所述第二网络设备交互实现数据传输,所述第二网络设备用于执行如第二方面任意实施例描述的方法。
第七方面,本申请提供一种第一网络设备,该第一网络设备可包括控制器、寄存器、通 信接口和逻辑运算部件,这些组件可以通过一个或多个内部总线进行电性连接。第一网络设备通过各组件的配合来实现如第一方面任意实施例描述的方法。
第八方面,本申请提供一种第二网络设备,该第二网络设备可包括控制器、寄存器、通信接口和逻辑运算部件,这些组件可以通过一个或多个内部总线进行电性连接。第一网络设备通过各组件的配合来实现如第二方面任意实施例描述的方法。
第九方面,本申请实施例提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如第一方面或第二方面的任意实施例描述的方法。
第十方面,本发明实施例提供了一种非易失性计算机可读存储介质;所述计算机可读存储介质用于存储第一方面或第二方面任意实施例描述的方法的实现代码。所述程序代码被设备执行时能够实现第一方面或第二方面的任意实施例描述的方法。
第十一方面,本发明实施例提供了一种计算机程序产品;该计算机程序产品包括程序指令,当该计算机程序产品被设备执行时,执行前述第一方面或第二方面的任意实施例描述的方法。该计算机程序产品可以为一个软件安装包,可以下载该计算机程序产品并在控制器上执行该计算机程序产品,以实现第一方面或第二方面的任意实施例描述的方法。
可以看到,实施本申请实施例,目的端设备的第二网络设备可以在RoCE协议报文携带ECN标记时,回复聚合有CETH和指示信息的ACK确认信息,指示信息用于向源端设备通告当前网络通路上有网络拥塞,CETH用于向源端设备提供详细的拥塞信息。源端设备的第一网络设备从CETH提取拥塞信息进行定量、多样化的拥塞控制动作。目的端设备可以在RoCE协议报文未携带ECN标记时,则回复携带有指示信息的ACK确认信息,向源端设备通告当前网络通路上无网络拥塞,以便于源端设备及时维持或恢复高速发送速率。
这样,一方面,通过指示信息、拥塞信息和ACK的聚合承载,避免了独立CNP的发送,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。
另一方面,通过指示信息,源端设备可以在第一时间可以获得网络拥塞的情况,更早的触发拥塞控制实现发送速率的调控,提高了源端设备的响应速度。在网路拥塞解除时,也能够通过指示信息让源端设备获知,以及时恢复发送速率,提高了网路带宽利用率。
最后,现有的RDMA网络拥塞控制的通告信息少,网络拥塞收敛速度慢,而本申请利用CETH承载网络详细的拥塞信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,有利于源端设备根据详细拥塞信息将发送速率一步到位的调整到目标速率,实现快速收敛,并实现报文数量、发送时间等多样化的调整,极大提升了拥塞控制的效果。
附图说明
图1是本申请实施例涉及的系统架构的示意图;
图2是一种现有RoCE协议的设备通信流程场景图;
图3是一种网络大流量场景下的设备通信流程场景图;
图4是本申请实施例提供的一种包含功能模块的系统架构示意图;
图5是本申请实施例提供的一种网络设备的硬件结构示意图;
图6是本申请实施例提供的一些可能的确认报文的内容的示意图;
图7是本申请实施例提供的又一些可能的确认报文的内容的示意图;
图8是本申请实施例提供的一种拥塞信息的数据结构示例图;
图9是本申请实施例提供的一种确认报文的数据结构示意图;
图10是本申请实施例提供的另一种确认报文的数据结构示意图;
图11是本申请实施例提供的一些RoCE协议报文完整形式的示意图;
图12是本申请实施例提供的一些确认报文的完整形式的示意图;
图13是本申请实施例提供的一种RoCE网络拥塞控制的方法的流程示意图;
图14是本申请实施例提供的又一种RoCE网络拥塞控制的方法的流程示意图;
图15是本申请实施例提供的一种设备通信流程场景图;
图16是本申请实施例提供的一种网络大流量场景下的设备通信流程场景图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。需要说明的是,当在本说明书和所附权利要求书中使用时,术语“包括”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列单元/器件的系统、产品或者装置没有限定于已列出的单元/器件,而是可选地还包括没有列出的单元/器件,或者还可选地包括这些产品或者装置固有的其他单元/器件。
还需要说明的是,本说明书和权利要求书中的术语“第一”“第二”“第三”等用于区别不同的对象,而并非用于描述特定的顺序或者特定的含义。
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
首先介绍本申请实施例所应用的系统架构。
参见图1,图1是本申请实施例涉及的系统架构的示意图。如图1所示,该系统架构包括源端设备10和目的端设备20,源端设备10和目的端设备20之间通过网络30进行通信连接,源端设备10和目的端设备20均支持RoCE协议的网络通信。源端设备10和目的端设备20可以是计算机、台式电脑、笔记本电脑、服务器、终端等等计算设备。
网络30可包括多个交换设备31,所述多个交换设备31可用于在源端设备10和目的端设备20之间进行报文中转、传输、网络流量检测等,以实现源端设备10和目的端设备20之间的通信交互。交换设备31例如可以是交换机、路由器、中继设备、或网关设备等等。
源端设备10和目的端设备20均可包括包含网络设备和主机系统,主机系统包括主机CPU(central processing unit,中央处理器)以及内存。例如图1中源端设备10包括CPU12、内存13和网络设备11,这些组件之间可以通过总线建立连接。目的端设备20CPU22、内存23和网络设备21,这些组件之间可以也通过总线建立连接。
本申请实施例中,网络设备是一块被设计用来允许计算设备在网络上进行通讯的硬件,网络设备具体可以为用于实现设备与网络之间的通信的网络接口控制器(network interface controller,NIC),又可称网络接口控制器、网络适配器(network adapter),网卡(network interface card),或局域网接收器(LAN adapter)。本申请实施例中,该网络设备支持RDMA协议,所以该NIC也可以称为RNIC(RDMA NIC)。本文将主要以RNIC为例进行方案的描述。
如图1所示,网络设备11和网络设备21通过网络30连接,以实现源端设备10和目的端设备20之间的通信。网络设备11和网络设备21均支持RoCE协议。当源端设备10通过网 络向目的端设备20发起RDMA的读写请求时,是通过网络设备11和网络设备21将需要写的数据从内存13直接写入内存23中,或将需要读的数据从内存23直接写入内存13中。
对于源端设备10和目的端设备20的主机系统,主机系统中的CPU的数量可以是一个或多个,各个CPU的类型可以不同或者相同。CPU可以包括一个或多个处理器核,或者多个CPU也可以集成为多核处理器。主机系统可通过CPU运行各种软件组件,例如操作系统,以及在操作系统上运行的应用程序,等等。用户可通过该操作系统或应用程序发起业务通信,进而通过网络设备实现源端设备10和目的端设备20之间的通信交互。
主机系统中的内存可用于存储计算机指令和数据,内存也可以存储通过RDMA读写的数据,报文等。内存可以是以下存储介质的任一种或任一种组合:储存级记忆体(Storage Class Memory,SCM)、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)或高速缓存(Cache)。
上述系统架构中,两台计算设备之间的远程访问由计算设备的RNIC实现。支持RoCE协议的RNIC和交换设备组成的网络共同构成了RoCE网络。本申请方案应用于该RoCE网络,本申请实施例描述的方法可部署在RNIC网卡上,用于实现RoCE网络拥塞控制。
虽然将对上述两台计算设备划分角色分为源端设备和目的端设备,但需要理解的是,其中,“源端设备”和“目的端设备”是两个相对的概念:
源端设备指的是发起RDMA请求的计算设备,即请求访问另一台计算设备的计算设备。
目的端设备指的是接收RDMA请求的计算设备,即被另一台计算设备访问的计算设备。
例如,源端设备对目的端设备的访问可以为源端设备向目的端设备中写入数据,具体地,源端设备通过源端设备的RNIC将源端设备中的数据传输给目的端设备的RNIC,目的端设备通过目的端设备的RNIC接收该数据,从而将源端设备中的数据传输到目的端设备中。源端设备对目的端设备的访问也可以为源端设备从目的端设备中读取数据,具体地,源端设备可通过源端设备的RNIC读取目的端设备内存中的数据,目的端设备通过目的端设备的RNIC将源端设备要读取的数据发送给源端设备的RNIC,源端设备的RNIC接收该数据从而完成对目的端设备中的数据的读取。
本申请实施例中,之间通信的数据主要以报文的形式承载,本文中可以将支持RoCE协议的报文简称为RoCE协议报文或RoCE数据报文。通常来讲,当源端设备10向目的端设备20发送的报文被目的端设备成功接收时,目的端设备需回复ACK确认报文,以向源端设备告知报文成功接收,本申请实施例的技术方案主要就是针对这一过程进行优化。
需要说明的是,图1只是为描述本申请实例提供的技术方案,示出了上述组件及其连接关系。在具体实现时,图1所示的源端设备10和目的端设备20还可以包括上述组件之外的其它组件,例如还可以包括硬盘等硬件资源,这里不再展开描述。
图2示出了一种现有RoCE协议的设备通信流程。RoCE协议属于RDMA协议的一种。RDMA协议是传输层协议,RoCE协议是额外包含了网络层和链路层的协议。RoCE协议也支持可靠连接服务,源端设备对协议报文的发送携带报文序列号(Packet Sequence Numbers,PSN),目的端设备在接收到协议报文后向源端设备回复ACK确认报文,通告源端设备的RNIC其发送的报文已被成功传输。当网络中的流量因为过大而产生拥塞时,网络中的CP设备对报文进行RED ECN标记,目的端设备在收到携带ECN标记的报文时,按协议规定,除了向源端设备回复ACK确认报文外,还向源端设备发送独立的CNP报文以通告网络拥塞,该CNP报文只定义为信号,并未携带任何状态信息。此外,目的端设备分别回复ACK确认报文和CNP报文后, 源端设备的RNIC也需要分别处理这2个报文。
图3示例性示出了一种网络大流量场景下的设备通信流程。如图3所示,数据报文1发生了网络拥塞,但是受限于目的端设备的发送能力,其第一个CNP报文A的发送存在延迟,这导致源端设备直到协议报文5才开始降速,在此之前由于未及时降低发送速率会导致网络拥塞程度一直在加剧。从协议报文6开始网络拥塞开始消除,但是由于CNP报文的通告延迟,直到协议报文7源端设备都一直在降速,过大的降速会影响网络带宽利用率。从CNP报文B开始目的端设备就没有再次通告网络拥塞,但是源端设备无法及时获得拥塞消除的信息,只能靠时间超时缓慢升速,直到协议报文10才恢复到目标发送速率,在此期间的网络带宽利用率低。
综上可以看到,现有方案中,由于网卡的报文发送速率是存在上限的,发生网络拥塞后,目的端设备的网卡需要连续发送ACK确认报文和CNP报文,当网络流量压力增大时,会导致确认报文和CNP报文的回复延迟,从而造成控速的响应延迟。另外,CNP只能通告发生了拥塞,不能通告拥塞消除了,只能由发送端定时检测,ACK延迟增大会导致飞包统计偏多,CNP报文延迟会导致源端设备拥塞控制反应速度慢,速率控制不及时。
另外,RoCE协议采用CNP报文来通告网络拥塞,都只能通告网络上发生了拥塞,而无法通告网络拥塞的具体状态。这导致发送端无法实现高效的拥塞控制,而是只能一步步的缓慢逼近目标速率,网络收敛慢、带宽利用率低。
本申请实施例对源端设备和目的端设备的RNIC进行改进,从而能够解决上述现有方案中所提及的部分或全部的缺陷。参见图4,图4示出了本申请实施例的一种具体的系统架构。在该系统架构中,源端设备10和目的端设备20的RNIC分别配置了相关功能模块以支持本申请方案的实现。如图4所示,该源端设备10的网络设备为RNIC11,该目的端设备20的网络设备为RNIC21,RNIC11配置了拥塞控制模块111、报文发送模块112和报文接收模块113。RNIC21配置了拥塞信息确定模块211、通告聚合发送模块212和报文接收模块213,具体描述如下:
报文发送模块112用于向目的端设备20发送RoCE协议报文。
报文接收模块113用于接收来自目的端设备20的确认报文,该确认报文是本申请实施例设计的聚合报文。该确认报文可聚合了对所述RoCE协议报文的确认信息(ACK)和指示信息,所述指示信息用于指示源端设备10和目的端设备20之间的网络通路是否发生拥塞。在网络通路发生拥塞的情况下,该确认报文还聚合了拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息。也就是说,该确认报文既携带了网络拥塞的指示信息,也携带了网络拥塞的具体状态信息。关于确认报文的具体实现将在后文展开详细描述。
拥塞控制模块111用于基于所述确认报文进行可定量的拥塞控制,本申请实施例中,拥塞控制是一种用来调整传输控制协议(RoCE协议)连接单次发送的分组数量(单次发送量)的功能,它能够定量的增减单次报文发送量和发送频率,使之逼近当前网络最合适的承载能力。
报文接收模块213用于接收来自源端设备10的RoCE协议报文。
拥塞信息确定模块211用于检查所述RoCE协议报文是否携带显式拥塞通知,如果RoCE协议报文携带显式拥塞通知,则拥塞信息确定模块211可用于生成所述拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种 信息,所述拥塞信息用于支持源端设备进行定量的拥塞控制。
通告聚合发送模块212可用于根据报文的检查结果生成确认报文,该确认报文即本申请实施例设计的所述聚合报文。还用于向源端设备10发送所述确认报文,以便于所述源端设备10实现定量的拥塞控制。
源端设备和目的端设备中的RNIC的上述功能模块可分别通过各自的RNIC的软件和硬件结构的相互配合实现。参见图5,图5示出了一种示例性的RNIC硬件结构30,该RNIC硬件结构30可以是源端设备中的RNIC的结构,也可以是目的端设备中的RNIC的结构。在具体实现中,RNIC硬件结构30可以是独立的标准网卡(例如PCIe接口的网卡),也可以是集成到SoC芯片的集成网卡,通过升级现有RNIC网卡(例如ASIC芯片、FW固件等)硬件来获得,以支持本申请实施例提及的方案。
如图5所示,该RNIC硬件结构30可包括控制器31、寄存器32、通信接口33和逻辑运算部件34,这些组件可以通过一个或多个内部总线35进行电性连接。其中:
寄存器32为一种存储空间较小的存储器,寄存器32可以用于保存各种指令;寄存器32还可以用于存储在指令执行过程中临时存放的寄存器操作数和中间或最终的操作结果;寄存器还可以用于存储逻辑运算部件34完成控制器31请求的任务所使用的数据。
控制器31用于对寄存器中保存的指令进行译码,并发出为完成每条指令所要执行的各个操作的控制信号。控制器31是可运行程序的处理器核,例如可通过系统级芯片(System-on-a-Chip,SOC),现场可编程门阵列(field programmable gate array,FPGA)、专用集成电路(application-specific integrated circuit,ASIC)或FPGA+ASIC等电路装置实现。控制器31又例如可以由各种与或门阵列组成。控制器31的控制方式例如可以是以微存储为核心的微程序控制方式,微程序可以保存在寄存器32中,又例如可以是为以逻辑硬布线结构为主的硬件控制方式,本申请不做限定。
逻辑运算部件34可以用于执行运算命令,如加命令、减命令、乘命令、除命令,等等;逻辑运算部件34还可以用于获取逻辑命令,如或逻辑命令、与逻辑命令、非逻辑命令,等等。逻辑运算部件34还可以用于从控制器31获取控制信号,根据获取到的控制信号从寄存器32中获取该控制信号对应的数据并执行相应的操作。
通信接口33用于发送或接收数据,通信接口33可以有多个,其可以分别用于接收处理器发送数据或发送数据给主机系统的CPU,或者,用于接收外部计算设备发送的数据或发送数据给外部的计算设备(例如发送或接收RoCE协议报文或聚合的确认报文)。
可选地,RNIC还可以包括晶体振荡器、媒体接入控制器、物理接口收发器,等等,本申请实施例不作限制。
具体实施例中,控制器31读取对寄存器中保存的指令发出为完成每条指令所要执行的各个操作的控制信号,以实现本文中任意实施例描述的RoCE网络拥塞控制的方法。
为了更好理解本申请的实现方案,下面对本申请实施例提供的能够实现拥塞通告聚合的确认报文进行具体描述。
本申请实施例通过对现有RoCE ACK确认报文进行扩展,改进RoCE协议的拥塞通告机制,获得本申请的确认报文,使其可以携带网络拥塞的指示信息和拥塞信息,以实现拥塞通告聚合和精准的网络拥塞通告。
综合参见图6和图7,图6和图7示出了本申请实施例中一些可能的确认报文的内容。 该确认报文可由目的端设备产生,用于向源端设备回复。
图6示出了无网络拥塞场景(例如RoCE协议报文不携带显式拥塞通知ECN)下的两种确认报文的示意图,确认报文包括对所述RoCE协议报文的确认信息和指示信息,该确认信息可实现现有的ACK确认的功能,即通告源端设备的RNIC其发送的报文是否已被目的端设备成功传输,该指示信息用于指示源端设备和目的端设备之间的网络通路没有发生拥塞。
如图6所示,一种实施例中,确认报文实现为确认报文A,确认报文A中的确认信息和指示信息可设置在报文中的不同位置,例如可以分布在不同的报文头中,从而避免对确认信息中字段做修改。又一种实施例中,确认报文实现为确认报文B,可将该指示信息整合设置到确认信息的字段中,从而实现对确认信息中字段空间的充分利用。
图7示出了有网络拥塞场景(例如RoCE协议报文携带显式拥塞通知ECN)下的两种确认报文的示意图,确认报文包括对所述RoCE协议报文的确认信息、指示信息和拥塞信息,该确认信息可实现现有的ACK确认的功能,即通告源端设备的RNIC其发送的报文是否已被目的端设备成功传输,该指示信息用于指示源端设备和目的端设备之间的网络通路发生拥塞,拥塞信息用于指示网络的具体状态,具体可包括网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延等等中的至少一种信息。
如图7所示,一种实施例中,确认报文实现为确认报文C,确认报文C中的确认信息、指示信息和拥塞信息可设置在报文中的不同位置,例如可以分布在不同的报文头中,从而避免对确认信息中字段做修改。又一种实施例中,确认报文实现为确认报文D,可将该指示信息和/或拥塞信息整合设置到确认信息的字段中,从而实现对确认信息中字段空间的充分利用。
本申请实施例中,指示信息可以是指示位、指示字段、指示标识等。
例如,当指示信息是指示位时,当指示位的值为0,意味着向源端设备的RNIC指示当前网络通路中没有网络拥塞的情况,确认报文中不携带拥塞信息;当指示位的值为1,意味着目的端设备向源端设备的RNIC指示当前网络通路中存在网络拥塞的情况,且确认报文中携带了拥塞信息。
又例如,指示信息可以利用现有字段进行功能重定义,比如指示信息可以是确认报文的BTH.BECN字段,当BTH.BECN字段为0,意味着向源端设备的RNIC指示当前网络通路中没有网络拥塞的情况,确认报文中不携带拥塞信息;当BTH.BECN字段为1,意味着向源端设备的RNIC指示当前网络通路中存在网络拥塞的情况,且确认报文中携带了拥塞信息。
本申请实施例中,拥塞信息可以是定义的一个新的报文头(如下面描述的CETH)进行内容承载,支持携带详细的网络拥塞状态内容,以便于目的端设备向源端设备进行准确的网络拥塞通告。拥塞信息也可以是利用现有的字段空间进行内容承载,例如利用现有的保留(reserve)字段。
下面描述本申请实施例提供的一种拥塞信息的设计方式。参见图8,图8是本申请实施例提供的一种拥塞信息的数据结构示例图,本文中可将拥塞信息的数据结构称为拥塞控制扩展传输头(Congestion Extended Transport Header,CETH)。如图8所示,该CETH(也可以称为CETH头)包括标准定义和厂家自定义信息(Vender defined information)这两部分,其中标准定义部分可用于混合组网场景的兼容性对接。标准定义部分可包括如下字段:版本号(Ver)、CETH头长度(Length)。其中:
厂家自定义信息字段用于支持各厂家自定义拥塞控制通告信息,总长度为例如为(Length*4–1)字节。例如厂家可以设计承载网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息。举例来说,网络通路的拥塞程度可用2bit的ratio字段表征,该ratio字段用于标识拥塞程度。在应用场景中,ratio字段可以分量级来指示拥塞程度,例如:无拥塞、轻度拥塞、中度拥塞、重度拥塞,等等。
此外,在具体实现中,还可以对厂家自定义信息设计更多的其他内容,例如设计的1bit的字段来指示当前的拥塞通告为普通CNP类型还是增强CNP类型。又例如可以设计4bit的字段来用于标识业务场景,例如RC/XRC write/send场景、RC/XRC read response场景或UD send场景。
Ver字段用于指示CETH版本号,例如可占用4bit,用于支持拥塞控制算法的升级及兼容性对接。其中,版本号0表示标准CNP通告,未携带其它信息,版本号1~15由厂家自定义使用。
Length字段用于指示CETH头长度,例如占用4bit,支持CETH头可变长度,以减少固定开销。Length例如可取值1~4,用于指示CETH头长度为多少个4字节。
基于上述的指示信息和拥塞信息,下面描述一些可能的确认报文的数据结构。
参见图9,图9是本申请提供的一种确认报文的数据结构示意图。如图9所示,确认报文具体包括ACK确认信息和CETH,其中:
ACK确认信息具体包括了基础传输头(Base Transport Header,简称BTH,或称BTH头,或称BTH字段)和确认扩展传输头(Acknowledge Extended Transport Header,AETH),即ACK确认信息可通过BTH和AETH承载,该ACK确认信息用于实现ACK的功能,即通告源端设备的RNIC其发送的报文是否已被目的端设备成功传输。该实施例中,用于指示源端设备和目的端设备之间的网络通路是否发生拥塞的指示信息也可承载于BTH字段。
关于BTH和AETH的相关子字段(例如OpCode、SE、目标QP、Pad、TVer等等字段)可参考现有技术方案的相关描述,这里不再展开描述。
CETH是本申请设计的扩展字段,本实施例中可以将CETH作为一个可选项聚合到ACK的AETH后面,从而实现ACK确认信息、指示信息和CETH的聚合。CETH用于指示当前网络通路的具体网络状态。当网络拥塞发生时,ACK携带指示信息并且聚合CETH,以便于目的端设备及时向源端设备通告网络拥塞的发生,当网络无拥塞时,ACK携带指示信息而不携带CETH,以便于目的端设备及时向源端设备通告拥塞的消除。
由于CETH携带了网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延等网络状态信息,有利于解决现有RoCE网络由于拥塞控制通告的信息少而对实现高效的拥塞控制不友好的问题。由于报文的聚合以及指示信息的携带,也有利于解决RoCE网络拥塞控制反应速度慢的问题。
具体应用场景中,当目的端设备的RNIC接收到的RoCE协议报文不携带ECN标记时,目的端设备的RNIC向源端设备回复一个不携带CETH头的ACK,该ACK即只包括了确认信息和指示信息,以通告源端设备的RNIC其协议报文已被接收且网络上未发生拥塞。
当目的端设备的RNIC接收到的RoCE协议报文携带ECN标记时,目的端设备的RNIC向源端设备回复一个携带CETH头的ACK,该ACK即包括了确认信息、指示信息和拥塞信息,以通告源端设备的RNIC其协议报文已被接收且网络上发生了拥塞,并使得源端设备的RNIC获得详细的网络状态信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,从 而能根据这些信息实现定量化的拥塞控制。
例如,对于拥塞信息中指示的网络拥塞程度,可以分为“无拥塞、轻度拥塞、中度拥塞、重度拥塞”等多个级别,源端设备的RNIC可以根据拥塞程度的差异来决定如何实施降速处理,实现不同档次的报文发送速率的调节,因而可以实现更快的速率收敛。
又例如,对于拥塞点和队列深度信息,源端设备的RNIC可以根据该类信息判断该网络路径上还可以承载多少数据报文而不导致发生丢包等行为,从而决定可以继续发送的报文数量,这对高带宽需求的网络应用比较友好。
参见图10,图10是本申请提供的另一种确认报文的数据结构示意图。在该实施例中,CETH(拥塞信息)可利用现有的字段空间进行承载。如图10所示,CETH被整合到ACK确认信息的BTH字段中,该实施例中,用于指示源端设备和目的端设备之间的网络通路是否发生拥塞的指示信息也可承载于BTH字段。这种情况下,该确认报文可视为传统ACK报文的改进,充分利用现有字段空间实现拥塞信息的承载。该确认报文既能实现ACK的功能,即通告源端设备的RNIC其发送的报文是否已被目的端设备成功传输;又能实现拥塞指示功能,即指示源端设备和目的端设备之间的网络通路是否发生拥塞。还能使得源端设备的RNIC获得详细的网络状态信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,从而能根据这些信息实现定量化的拥塞控制。
例如,在一种实现中,可利用传统ACK的BTH中的保留字段“reserved 6”去承载本申请实施例的拥塞信息,也就是该“reserved 6”作为第一CETH进行相关数据承载,从而达到拥塞控制和传输确认的聚合的目的。
又例如,在一种实现中,可利用传统ACK的BTH中的保留字段“reserved 7”去承载本申请实施例的拥塞信息,也就是该“reserved 7”作为第二CETH进行相关数据承载,从而达到拥塞控制和传输确认的聚合的目的。
其中,第一CETH和第二CETH的具体实现形式可以是仅包括拥塞程度、拥塞点、队列深度、网络时延等信息,还可以包括如图8实施例描述的其他信息,例如版本号等。
参考图11和图12,图11和图12分别示例性描述了本申请实施例中的一些RoCE协议报文完整形式和对应的确认报文的完整形式。
如图11所示,一种来自源端设备的RoCE协议报文可包括MAC报文头、IP报文头、UDP报文头、BTH头、数据载荷、不变循环冗余校验值(Invariant Cyclic Redundancy Check,ICRC)和可变循环冗余校验值(Variant Cyclic Redundancy Check,VCRC)。其中MAC报文头、IP报文头、UDP报文头分别是MAC层、IP层、UDP层对应的报文头,数据载荷为设备通信交互需要传输的数据,ICRC和VCRC可用于校验数据完整性。又一种RoCE协议报文可包括MAC报文头、IP报文头、UDP报文头、BTH头、ImmDt字段、数据载荷、ICRC和VCRC。
需要说明的是,实际应用中,RoCE协议报文还可以包括更多或更少的内容,本申请对此不做限制。
如图12所示,一种来自目的端设备的确认报文可包括MAC报文头、IP报文头、UDP报文头、BTH头、AETH头、CETH头、ICRC和VCRC。其中MAC报文头、IP报文头、UDP报文头分别是MAC层、IP层、UDP层对应的报文头,ICRC和VCRC可用于校验数据完整性,关于这里的BTH头、AETH头、CETH头的具体内容和实现形式可类似参考图9实施例的相关描述,这里不再赘述。又一种确认报文可包括MAC报文头、IP报文头、UDP报文头、BTH头、AETH头、 ICRC和VCRC。关于这里的BTH头、AETH头的具体内容和实现形式可类似参考图10实施例的相关描述,这里不再赘述。
需要说明的是,实际应用中,确认报文还可以包括更多或更少的内容,本申请对此不做限制。
基于上文描述的系统架构和报文数据结构,下面描述本申请实施例提供的拥塞控制方法。对于下文描述的各方法实施例,为了方便起见,将其都表述为一系列的动作步骤的组合,但是本邻域技术人员应该知悉,本申请技术方案的具体实现并不受所描述的一系列的动作步骤的顺序的限制。
参见图13,图13是本申请实施例提供的一种RoCE网络拥塞控制的方法的流程示意图,该方法从第一网络设备和第二网络设备交互的角度进行方案描述,其中第一网络设备和第二网络设备可以是RNIC、网络接口控制器、网络适配器、网卡、局域网接收器等等。例如,第一网络设备可以是源端设备的RNIC,第二网络设备可以是目的端设备的RNIC,第一网络设备和第二网络设备可通过网络连接。该方法包括但不限于以下步骤:
S301.第一网络设备向第二网络设备发送RoCE协议报文。所述RoCE协议报文可以是基于用户的业务需求而产生的,RoCE协议报文可以是周期性的报文。RoCE协议报文的具体内容已在前文做了描述,这里不再赘述。
S302.第二网络设备检查所述RoCE协议报文是否携带显式拥塞通知。
当网络中的流量因为过大而产生拥塞时,网络中的CP设备可对报文进行RED ECN标记,第二网络设备在收到携带ECN标记的报文时,确认网络当前有拥塞。反之,第二网络设备在收到携带ECN标记的报文时,确认网络当前无拥塞。
S303.第二网络设备根据检查结果生成ACK确认报文,所述ACK确认报文至少包括聚合的确认信息和指示信息。所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞。
本申请实施例中,指示信息可以是指示位、指示字段、指示标识等。例如,当指示信息是指示位时,当指示位的值为0,意味着向源端设备的RNIC指示当前网络通路中没有网络拥塞的情况,确认报文中不携带拥塞信息;当指示位的值为1,意味着目的端设备向源端设备的RNIC指示当前网络通路中存在网络拥塞的情况,且确认报文中携带了拥塞信息。
一种实施例中,确认信息和指示信息可设置在报文中的不同位置,例如可以分布在不同的报文头中,从而避免对确认信息中字段做修改。又一种实施例中,可将该指示信息整合设置到确认信息的字段中,从而实现对确认信息中字段空间的充分利用。
S304.第二网络设备向所述第一网络设备回复所述确认报文,相应的,所述第一网络设备接收来自第二网络设备的确认报文。
S305.第一网络设备基于所述确认报文进行拥塞控制。
本申请实施例中,拥塞控制是一种用来调整传输控制协议(RoCE协议)连接单次发送的分组数量(单次发送量)的功能,它能够定量的增减单次报文发送量和发送频率,使之逼近当前网络最合适的承载能力。例如,当指示信息指示当前网络有拥塞时,第一网络设备可降低下一时间窗内的RoCE协议报文的发送速率。当指示信息指示当前网络无拥塞时,第一网络设备可将下一时间窗内的RoCE协议报文的发送速率维持不变,或者将下一时间窗内的RoCE协议报文的发送速率设置为预设速率。
可以看到,实施本申请实施例,目的端设备可以在RoCE协议报文携带ECN标记时,通过 第二网络设备回复聚合有指示信息的ACK确认信息,指示信息用于向源端设备通告当前网络通路上有网络拥塞。源端设备的第一网络设备可降低下一时间窗内的RoCE协议报文的发送速率。目的端设备可以在RoCE协议报文未携带ECN标记时,回复携带有指示信息的ACK确认信息,向源端设备通告当前网络通路上无网络拥塞,以便于源端设备及时维持或恢复高速发送速率。
这样,一方面,通过指示信息和ACK的聚合承载,避免了独立CNP的发送,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。
另一方面,通过指示信息,源端设备可以在第一时间可以获得网络拥塞的情况,更早的触发拥塞控制实现发送速率的调控,提高了源端设备的响应速度。在网路拥塞解除时,也能够通过指示信息让源端设备获知,以及时恢复发送速率,提高了网路带宽利用率。
参见图14,图14是本申请实施例提供的又一种RoCE网络拥塞控制的方法的流程示意图,该方法从第一网络设备和第二网络设备交互的角度进行方案描述,其中第一网络设备和第二网络设备可以是RNIC、网络接口控制器、网络适配器、网卡、局域网接收器等等。例如,第一网络设备可以是源端设备的RNIC,第二网络设备可以是目的端设备的RNIC,第一网络设备和第二网络设备可通过网络连接。该方法包括但不限于以下步骤:
S401.第一网络设备向第二网络设备发送RoCE协议报文。RoCE协议报文的具体内容已在前文做了描述,这里不再赘述。
S402.第二网络设备检查所述RoCE协议报文是否携带显式拥塞通知(ECN)。当确定RoCE协议报文携带显式拥塞通知时,第二网络设备后续执行步骤S403-S405;当确定RoCE协议报文不携带显式拥塞通知时,第二网络设备后续执行步骤S406-S407。
S403.第二网络设备获取拥塞信息,拥塞信息用于指示网络的具体状态。
具体的,第二网络设备当前网络有拥塞时,可通过报文检测或者硬件检测的方式获得网络状态信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息。
举例来说,当所述拥塞信息包括所述拥塞程度时,所述第二网络设备可通过以下获得所述拥塞程度:
(1)第二网络设备根据历史报文接收记录中携带显式拥塞通知的RoCE协议报文的比例,来确定所述网络通路的拥塞程度。例如,第二网络设备对接收报文携带ECN标记的比例做周期性滑窗统计,从而计算出当前该网络路径的具体拥塞程度。
(2)通过借助带内网络遥测(Inband Network Telemetry,INT)方式或者通过现场运行管理和维护(In-situ Operation Administration and Maintenance,IOAM)方式获得所述拥塞程度。以INT方式为例,可将INT支持的范围扩展到服务器网卡,网卡就可以接收到交换机插入到数据报文内的测量信息,通过该信息可以计算获得当前网络状态,例如通过时间戳计算出网络时延,通过队列长度和队列占用率计算出拥塞程度,等等。
S404.第二网络设备生成确认报文,所述确认报文包括聚合的确认信息、指示信息和拥塞信息。拥塞信息例如可通过本文描述的CETH实现,该CETH可以是扩展字段,也可以利用现有的保留字段来实现。
其中,确认信息用于实现ACK的功能,指示信息用于指示源端设备和目的端设备之间的网络通路发生拥塞。
一种实施例中,确认报文中的确认信息、指示信息和拥塞信息可设置在报文中的不同位置,例如可以分布在不同的报文头中,从而避免对确认信息中字段做修改。
又一种实施例中,可将该指示信息和/或拥塞信息整合设置到确认信息的字段中,从而实现对确认信息中字段空间的充分利用。
关于确认信息、指示信息和拥塞信息的聚合可参考图9或图10实施例的描述,这里不再赘述。
S405.第二网络设备向所述第一网络设备发送S404的确认报文。
S406.第二网络设备生成确认报文,所述确认报文包括聚合的确认信息和指示信息。
其中,确认信息用于实现ACK的功能,指示信息用于指示源端设备和目的端设备之间的网络通路没有发生拥塞。
S407.第二网络设备向所述第一网络设备发送S406的确认报文。
S408.第一网络设备基于所述确认报文进行定量的拥塞控制。
具体的,当第一网络设备收到S405的确认报文后,根据指示信息确定当前网络有拥塞,第一网络设备可根据确认报文中的拥塞信息进行拥塞控制包括以下至少一种方式:
(1)第一网络设备可根据所述拥塞程度,定量调整所述第一网络设备的下一时间窗内的报文的发送速率,其中所述拥塞程度和所述发送速率之间具有对应关系。例如对于“无拥塞、轻度拥塞、中度拥塞、重度拥塞”等多个级别,第一网络设备可以根据拥塞程度的差异来决定如何实施降速处理。不同级别可以对应不同的报文发送速率,实现不同档次的报文发送速率的调节,因而可以实现更快的速率收敛。
(2)第一网络设备可根据拥塞位置和报文队列深度中的至少一者,确定下一时间窗内的待发送的报文数量。源端设备的RNIC可以根据该拥塞位置和/或报文队列深度判断该网络路径上还可以承载多少数据报文而不导致发生丢包等行为,从而决定可以继续发送的报文数量,这对高带宽需求的网络应用比较友好。
(3)所述第一网络设备可根据网络时延调整所述第一网络设备的发送速率或下一时间窗内的待发送的报文数量。
当第一网络设备收到S407的确认报文后,根据指示信息确定当前网络无拥塞,第一网络设备可将下一时间窗内的RoCE协议报文的发送速率维持不变,或者将下一时间窗内的RoCE协议报文的发送速率恢复/设置为预设速率。
综合上文实施例可知,本申请的方案能够实现将RoCE的拥塞控制通告与ACK确认信息聚合承载,通过指示信息来通告网络拥塞的产生和消除。在有网络拥塞时,定义的CETH承载详细的网络拥塞状态信息。具体的,如图15所示:
当网络存在拥塞时,目的端设备接收到的报文携带ECN标记(记为data w/ECN),本申请实施例重定义用于回复的ACK消息为ACK聚合CETH(记为ACK w/CETH),在AETH头后扩展携带CETH头,携带量化的拥塞信息。原有的RC、XRC、RD等应用场景连接的RDMA write、RDMA send的CNP都可聚合到ACK上,所以本申请实施例中,目的端设备将无需再单独回复CNP给源端设备。
当网络的拥塞消除时,目的端设备接收到的报文不携带ECN标记(记为data w/o ECN),本申请实施例重定义用于回复的ACK消息为ACK不聚合CETH(记为ACK w/o CETH),以使远端设备获知当前网络状态,从而快速恢复发送速率。
下面以图16为例进一步理解本申请方案的技术效果。图16示例性示出了一种网络大流量场景下的设备通信流程。如图16所示,数据报文1发生了网络拥塞,目的端设备将RoCE的拥塞通告CETH聚合到ACK确认报文后,在发生拥塞时源端设备获得的拥塞通告更快,从而能更快速地进行拥塞控制动作,例如图16中从协议报文4开始就进入降速状态了,相比于前 述图3的独立CNP通告方法而言,报文发送速率可以更快降速;该聚合方法还可以支持目的端设备使用ACK中的指示信息来通告源端设备该网络拥塞状态的消除,源端设备收到拥塞消除通告后,即可快速提高报文发送速率,如图16中从数据报文8开始就可以恢复发送速率,相比现有的依赖源端设备的周期性检测方法而言,报文发送速率可以更快升速。
另外,由于扩展了准确详细的拥塞通告信息,在降速的第一个周期即可准确的控制降低发送速率到目标速率,实现更快的收敛,如图16中,在收到拥塞通告信息A后,第一个发送的数据报文4即可快速收敛到目标速率。
可以看到,实施本申请实施例,目的端设备的第二网络设备可以在RoCE协议报文携带ECN标记时,回复聚合有CETH和指示信息的ACK确认信息,指示信息用于向源端设备通告当前网络通路上有网络拥塞,CETH用于向源端设备提供详细的拥塞信息。源端设备的第一网络设备从CETH提取拥塞信息进行定量、多样化的拥塞控制动作。目的端设备可以在RoCE协议报文未携带ECN标记时,则回复携带有指示信息的ACK确认信息,向源端设备通告当前网络通路上无网络拥塞,以便于源端设备及时维持或恢复高速发送速率。
这样,一方面,通过指示信息、拥塞信息和ACK的聚合承载,避免了独立CNP的发送,降低了通告的开销,有利降低大流量场景下拥塞通告的时延,提高了目的端设备的反应速度。
另一方面,通过指示信息,源端设备可以在第一时间可以获得网络拥塞的情况,更早的触发拥塞控制实现发送速率的调控,提高了源端设备的响应速度。在网路拥塞解除时,也能够通过指示信息让源端设备获知,以及时恢复发送速率,提高了网路带宽利用率。
最后,现有的RDMA网络拥塞控制的通告信息少,网络拥塞收敛速度慢,而本申请利用CETH承载网络详细的拥塞信息,例如拥塞程度、拥塞点、队列深度、网络时延等不同维度的信息,有利于源端设备根据详细拥塞信息将发送速率一步到位的调整到目标速率,实现快速收敛,并实现报文数量、发送时间等多样化的调整,极大提升了拥塞控制的效果。
应理解,在本申请的各种方法实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在上述实施例中,对各个实施例的描述各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上述实施例仅用以说明本申请的技术方案,而非对其限制。尽管参照上述实施例对本申请进行了详细的说明,本领域的普通技术人员还应当理解的是:任何基于对上述各实施例所记载的技术方案进行的改动、变形、或者对其中部分技术特征进行的等同替换均应属于本申请各实施例技术方案的精神和范围。

Claims (32)

  1. 一种RoCE网络拥塞控制的方法,其特征在于,所述方法包括:
    第一网络设备向第二网络设备发送RoCE协议报文;
    所述第一网络设备接收来自第二网络设备的确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;
    所述第一网络设备基于所述确认报文进行拥塞控制。
  2. 根据权利要求1所述的方法,其特征在于,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;
    所述第一网络设备基于所述确认报文进行拥塞控制,具体包括:
    所述第一网络设备根据所述拥塞信息进行拥塞控制。
  3. 根据权利要求2所述的方法,其特征在于,所述第一网络设备根据所述拥塞信息进行拥塞控制包括以下至少一种方式:
    所述第一网络设备根据所述拥塞程度,调整所述第一网络设备的发送速率,其中所述拥塞程度和所述发送速率之间具有对应关系;或者,
    所述第一网络设备根据所述拥塞位置和所述报文队列深度中的至少一者,确定下一时间窗内待发送的报文数量;或者,
    所述第一网络设备根据所述网络时延调整所述第一网络设备的发送速率或确定下一时间窗内待发送的报文数量。
  4. 根据权利要求3所述的方法,其特征在于,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应不同的发送速率。
  5. 根据权利要求2-4任一项所述的方法,其特征在于,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
  6. 根据权利要求2-4任一项所述的方法,其特征在于,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,在所述指示信息指示所述网络通路未发生拥塞的情况下,所述第一网络设备基于所述确认报文进行拥塞控制包括:
    所述第一网络设备将所述第一网络设备的发送速率维持不变。
  8. 根据权利要求1-6任一项所述的方法,其特征在于,在所述指示信息指示所述网络通路未发生拥塞的情况下,所述第一网络设备基于所述确认报文进行拥塞控制包括:
    所述第一网络设备将所述第一网络设备的发送速率设置为预设速率。
  9. 一种RoCE网络拥塞控制的方法,其特征在于,所述方法包括:
    第二网络设备接收第一网络设备的RoCE协议报文;
    所述第二网络设备检查所述RoCE协议报文是否携带显式拥塞通知;
    所述第二网络设备根据检查结果生成确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;
    所述第二网络设备向所述第一网络设备发送所述确认报文,所述确认报文用于所述第一网络设备进行拥塞控制。
  10. 根据权利要求9所述的方法,其特征在于,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述拥塞信息用于所述第一网络设备进行拥塞控制。
  11. 根据权利要求10所述的方法,其特征在于,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应所述第一网络设备不同的发送速率。
  12. 根据权利要求10或11所述的方法,其特征在于,所述第二网络设备根据检查结果生成确认报文之前,还包括:
    所述第二网络设备生成所述拥塞信息。
  13. 根据权利要求10-12任一项所述的方法,其特征在于,当所述拥塞信息包括所述拥塞程度时,所述第二网络设备通过以下至少一种方式获得所述拥塞程度:
    所述第二网络设备根据历史报文接收记录中携带显式拥塞通知的RoCE协议报文的比例,来确定所述拥塞程度;或者,
    通过带内网络遥测(INT)方式获得所述拥塞程度;或者,
    通过现场运行管理和维护(IOAM)方式获得所述拥塞程度。
  14. 根据权利要求10-13任一项所述的方法,其特征在于,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
  15. 根据权利要求10-14任一项所述的方法,其特征在于,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
  16. 一种装置,其特征在于,所述装置应用于第一网络设备,包括:
    报文发送模块,用于向第二网络设备发送RoCE协议报文;
    报文接收模块,用于接收来自第二网络设备的确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;
    拥塞控制模块,用于基于所述确认报文进行拥塞控制。
  17. 根据权利要求16所述的装置,其特征在于,在所述指示信息指示所述网络通路发生拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;
    所述拥塞控制模块具体用于,根据所述拥塞信息进行拥塞控制。
  18. 根据权利要求17所述的装置,其特征在于,所述拥塞控制模块具体用于:
    根据所述拥塞程度,调整所述第一网络设备的发送速率,其中所述拥塞程度和所述发送速率之间具有对应关系;或者,
    根据所述拥塞位置和所述报文队列深度中的至少一者,确定下一时间窗内待发送的报文数量;或者,
    根据所述网络时延调整所述第一网络设备的发送速率或确定下一时间窗内待发送的报文数量。
  19. 根据权利要求18所述的装置,其特征在于,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应不同的发送速率。
  20. 根据权利要求17-19任一项所述的装置,其特征在于,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
  21. 根据权利要求17-19任一项所述的装置,其特征在于,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
  22. 根据权利要求16-21任一项所述的装置,其特征在于,所述拥塞控制模块具体用于,在所述指示信息指示所述网络通路未发生拥塞的情况下,将所述第一网络设备的发送速率维持不变。
  23. 根据权利要求16-21任一项所述的装置,其特征在于,所述拥塞控制模块具体用于,在所述指示信息指示所述网络通路未发生拥塞的情况下,将所述第一网络设备的发送速率设置为预设速率。
  24. 一种装置,其特征在于,所述装置应用于第二网络设备,包括:
    报文接收模块,用于接收第一网络设备的RoCE协议报文;
    拥塞信息确定模块,用于检查所述RoCE协议报文是否携带显式拥塞通知;
    通告聚合发送模块,用于根据检查结果生成确认报文,所述确认报文包括对所述RoCE协议报文的确认信息和指示信息,所述指示信息用于指示所述第一网络设备与所述第二网络设备之间的网络通路是否发生拥塞;还用于向所述第一网络设备发送所述确认报文,所述确认报文用于所述第一网络设备进行拥塞控制。
  25. 根据权利要求24所述的装置,其特征在于,在所述指示信息指示所述网络通路发生 拥塞的情况下,所述确认报文还包括拥塞信息,所述拥塞信息具体包括所述网络通路的拥塞程度、拥塞位置、报文队列长度、网络时延中的至少一种信息;所述拥塞信息用于所述第一网络设备进行拥塞控制。
  26. 根据权利要求25所述的装置,其特征在于,所述拥塞程度属于多种不同级别拥塞程度中的一种,不同级别拥塞程度分别对应所述第一网络设备不同的发送速率。
  27. 根据权利要求25或26所述的装置,其特征在于,所述拥塞信息确定模块还用于生成所述拥塞信息。
  28. 根据权利要求25-27任一项所述的装置,其特征在于,当所述拥塞信息包括所述拥塞程度时,所述拥塞信息确定模块具体用于:
    根据历史报文接收记录中携带显式拥塞通知的RoCE协议报文的比例,来确定所述拥塞程度;或者,
    通过带内网络遥测(INT)方式获得所述拥塞程度;或者,
    通过现场运行管理和维护(IOAM)方式获得所述拥塞程度。
  29. 根据权利要求25-28任一项所述的装置,其特征在于,所述确认报文具体包括基础传输头(BTH)字段和扩展字段,所述确认信息和所述指示信息承载于所述BTH字段,所述拥塞信息承载于所述扩展字段。
  30. 根据权利要求25-29任一项所述的装置,其特征在于,所述确认报文具体包括BTH字段,所述确认信息、所述指示信息和所述拥塞信息均承载于所述BTH字段。
  31. 一种设备,其特征在于,所述设备包括主机系统和第一网络设备;所述主机系统用于与所述第一网络设备交互实现数据传输,所述第一网络设备用于执行如权利要求1-8任一项所述的方法。
  32. 一种设备,其特征在于,所述设备包括主机系统和第二网络设备;所述主机系统用于与所述第二网络设备交互实现数据传输,所述第二网络设备用于执行如权利要求9-15任一项所述的方法。
PCT/CN2021/116494 2020-09-03 2021-09-03 RoCE网络拥塞控制的方法及相关装置 WO2022048647A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21863709.8A EP4195760A4 (en) 2020-09-03 2021-09-03 ROCE NETWORK CONGESTION MANAGEMENT METHOD AND RELATED APPARATUS
US18/178,117 US20230208771A1 (en) 2020-09-03 2023-03-03 RoCE Network Congestion Control Method and Related Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010915720.6 2020-09-03
CN202010915720.6A CN114143827A (zh) 2020-09-03 2020-09-03 RoCE网络拥塞控制的方法及相关装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/178,117 Continuation US20230208771A1 (en) 2020-09-03 2023-03-03 RoCE Network Congestion Control Method and Related Apparatus

Publications (1)

Publication Number Publication Date
WO2022048647A1 true WO2022048647A1 (zh) 2022-03-10

Family

ID=80438127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/116494 WO2022048647A1 (zh) 2020-09-03 2021-09-03 RoCE网络拥塞控制的方法及相关装置

Country Status (4)

Country Link
US (1) US20230208771A1 (zh)
EP (1) EP4195760A4 (zh)
CN (1) CN114143827A (zh)
WO (1) WO2022048647A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114745331A (zh) * 2022-03-23 2022-07-12 新华三技术有限公司合肥分公司 一种拥塞通知方法及设备

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002240B (zh) * 2022-08-04 2022-12-16 深圳市星卡软件技术开发有限公司 一种数据传输系统、方法、装置、设备及介质
CN116760779A (zh) * 2023-08-21 2023-09-15 珠海星云智联科技有限公司 网络拥塞控制方法、系统、存储介质和电子设备
CN116915721B (zh) * 2023-09-12 2023-12-19 珠海星云智联科技有限公司 一种拥塞控制方法、装置、计算设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107135164A (zh) * 2017-06-27 2017-09-05 中国联合网络通信集团有限公司 拥塞控制方法及装置
CN110445722A (zh) * 2018-05-04 2019-11-12 华为技术有限公司 拥塞控制方法、装置、设备及存储介质
CN110505112A (zh) * 2019-07-09 2019-11-26 星融元数据技术(苏州)有限公司 一种网络性能监测方法、装置和存储介质
US20200145349A1 (en) * 2018-11-06 2020-05-07 Mellanox Technologies, Ltd. Managing congestion in a network adapter based on host bus performance

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107493238A (zh) * 2016-06-13 2017-12-19 华为技术有限公司 一种网络拥塞控制方法、设备及系统
US11418446B2 (en) * 2018-09-26 2022-08-16 Intel Corporation Technologies for congestion control for IP-routable RDMA over converged ethernet
CN111431811A (zh) * 2019-01-10 2020-07-17 华为技术有限公司 一种报文传输控制方法、装置和网络设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107135164A (zh) * 2017-06-27 2017-09-05 中国联合网络通信集团有限公司 拥塞控制方法及装置
CN110445722A (zh) * 2018-05-04 2019-11-12 华为技术有限公司 拥塞控制方法、装置、设备及存储介质
US20200145349A1 (en) * 2018-11-06 2020-05-07 Mellanox Technologies, Ltd. Managing congestion in a network adapter based on host bus performance
CN110505112A (zh) * 2019-07-09 2019-11-26 星融元数据技术(苏州)有限公司 一种网络性能监测方法、装置和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4195760A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114745331A (zh) * 2022-03-23 2022-07-12 新华三技术有限公司合肥分公司 一种拥塞通知方法及设备
CN114745331B (zh) * 2022-03-23 2023-11-07 新华三技术有限公司合肥分公司 一种拥塞通知方法及设备

Also Published As

Publication number Publication date
EP4195760A4 (en) 2024-02-07
CN114143827A (zh) 2022-03-04
EP4195760A1 (en) 2023-06-14
US20230208771A1 (en) 2023-06-29

Similar Documents

Publication Publication Date Title
WO2022048647A1 (zh) RoCE网络拥塞控制的方法及相关装置
US10868767B2 (en) Data transmission method and apparatus in optoelectronic hybrid network
US7412488B2 (en) Setting up a delegated TCP connection for hardware-optimized processing
US7937447B1 (en) Communication between computer systems over an input/output (I/O) bus
CN101827072A (zh) 虚拟内存协议分段卸载
US9081905B2 (en) Low latency interconnect bus protocol
WO2022001175A1 (zh) 数据包发送的方法、装置
WO2022032984A1 (zh) 一种mqtt协议仿真方法及仿真设备
CN104883335A (zh) 一种全硬件tcp协议栈实现方法
US20220166698A1 (en) Network resource monitoring
US20220311711A1 (en) Congestion control based on network telemetry
Kissel et al. Evaluating high performance data transfer with rdma-based protocols in wide-area networks
CN102420763A (zh) Dma发送方法
EP2699030B1 (en) Route switching device, network switching system and route switching method
CN103152278A (zh) 拥塞确定方法、装置和网络设备
CN114489840A (zh) 基于fpga的tcp/ip硬件卸载系统及其实现方法
CN117354253A (zh) 一种网络拥塞通知方法、装置及存储介质
EP3977705B1 (en) Streaming communication between devices
WO2021136278A1 (zh) 报文传输方法及电子设备
WO2024041572A1 (zh) 业务处理方法、装置、设备、介质及程序产品
Wang et al. An Optimized RDMA QP Communication Mechanism for Hyperscale AI Infrastructure
Fang et al. Implementation of industrial ethernet communication based on embedded systems
WO2023048925A1 (en) Network resource monitoring
CN116802620A (zh) 用于远程直接内存访问的设备和方法
Shen Design of High Performance Disk Storage Protocol

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863709

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021863709

Country of ref document: EP

Effective date: 20230308

NENP Non-entry into the national phase

Ref country code: DE