WO2023226603A1 - 一种抑制拥塞队列产生的方法及装置 - Google Patents

一种抑制拥塞队列产生的方法及装置 Download PDF

Info

Publication number
WO2023226603A1
WO2023226603A1 PCT/CN2023/086561 CN2023086561W WO2023226603A1 WO 2023226603 A1 WO2023226603 A1 WO 2023226603A1 CN 2023086561 W CN2023086561 W CN 2023086561W WO 2023226603 A1 WO2023226603 A1 WO 2023226603A1
Authority
WO
WIPO (PCT)
Prior art keywords
queue
congestion
cache queue
cache
network device
Prior art date
Application number
PCT/CN2023/086561
Other languages
English (en)
French (fr)
Inventor
韩自发
吴涛
王炳权
闫健
韩磊
顾叔衡
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023226603A1 publication Critical patent/WO2023226603A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/12Avoiding congestion; Recovering from congestion

Definitions

  • the present invention relates to the field of communication technology, and in particular, to a method and device for suppressing the generation of congestion queues.
  • RDMA remote direct memory access
  • RoCE converged ethernet
  • PFC priority flow control
  • This application provides a method and device for suppressing the generation of congestion queues. By detecting the accumulation of queues on the incoming port, the congestion queues are identified, and then the upstream device needs to be stopped based on whether the queue is a congestion queue and the cache occupancy size of the device. time, ensuring that there is remaining cache space for traffic in the CBD ring, and ensuring that data packets in the CBD ring can flow normally.
  • this application provides a method for suppressing the generation of a congestion queue.
  • the method includes: determining the queue length of the buffer queue of the input port of the first network device and the number of messages to be received by the first network device; according to the queue The length and the number of messages to be received determine the first time, and the first time is the time when the second network device stops sending messages to the first network device; according to the queue length, when the cache queue is determined to be a congestion queue, the cache queue is determined according to the queue length.
  • the first time is updated to the second time; the second time is sent to the second network device.
  • the stop time when the upstream network device needs to stop sending packets is jointly determined by detecting the queue length of the buffer queue of the network device input port, the number of in-flight packets, and whether the current buffer queue is a congestion queue. Further, in the application of the present invention, a first time can be determined based on the queue length of the buffer queue of the input port and the number of flight messages before the network device generates a congestion queue, and then determined based on whether the buffer queue is a congestion queue. Do you need to update the first time to get the second stop time. When the cache queue is a congested queue, the second time is sent to the second network device to reduce the cache queue in the first network device. The number of received packets in a queue to avoid the occurrence of congestion queue. This ensures that data packets in the CBD ring can flow normally.
  • the first time is sent to the second network device.
  • the first time also needs to be sent to the second network device so that the second network device stops sending messages to the first network device. Avoiding the generation of congestion queue in the first network device.
  • determining whether the cache queue is a congestion queue includes: obtaining the congestion detection judgment result C of the cache queue; wherein the value of the congestion detection judgment result is True or False.
  • the value of C When it is True, it means that the cache queue is a congested queue.
  • the value of C When the value of C is False, it means that the cache queue is a non-congested queue.
  • the initial value of C is False; the value of C is determined based on the congestion detection judgment result of the cache queue. Whether the cache queue is a congestion queue.
  • the first network device determines the first time based on the queue length of the buffer queue and the number of packets to be received.
  • the first network device can directly obtain the congestion determination result of the cache queue, and determine whether the cache queue is a congestion queue based on the congestion determination result.
  • the method also includes: obtaining the congestion detection discrimination mark J of the cache queue; where the value of J is True or False; when the value of J is True, it indicates that the cache queue needs to perform congestion detection , when the value of J is False, it means that the cache queue does not need to perform congestion detection, and the initial value of J is True; when the value of the congestion detection discrimination flag J of the cache queue is True, congestion detection is performed on the cache queue; when When the cache queue is a congestion queue, update the value of the congestion detection judgment result C of the cache queue to True; or, when the cache queue is not a congestion queue, update the value of the congestion detection judgment flag J of the cache queue to False.
  • the discrimination mark of the cache queue is first judged. This allows the cache queue to be congested again only if the cache queue has not been subjected to congestion detection or has been subjected to congestion detection but the detection result is a congested queue. This avoids unnecessary congestion judgment on the cache queue and saves system overhead.
  • performing congestion detection on the cache queue includes: determining whether the initial length of the cache queue is stored in the first network device; the initial length of the cache queue refers to the first time that the queue length of the cache queue is larger than the capacity of the cache queue. The queue length at the minimum queue length detected by the congestion queue; when the first network device stores the initial length of the cache queue, determine whether the queue length of the cache queue continues to accumulate based on the queue length of the cache queue and the initial length of the cache queue; when When the queue length of the cache queue continues to accumulate, the cache queue is determined to be a congestion queue.
  • the initial length of the cache queue when the initial length of the cache queue is not saved in the network device, it is determined whether the queue length of the cache queue is greater than the minimum queue length that can perform congestion queue detection; when the queue length of the cache queue is greater than the minimum queue length that can perform congestion queue detection, The minimum queue length for congestion queue detection, and the queue length of the cache queue is saved as the initial length of the cache queue.
  • the initial length of the cache queue when the initial length of the cache queue is not stored in the current network device, it cannot be judged whether the cache queue is a congestion queue based on the current length and the initial length of the cache queue. Therefore, it is determined whether the current length of the cache queue can be saved as its initial length for subsequent congestion detection. Further In this step, only when the queue length of the cache queue meets the minimum queue length for congestion queue detection, the queue length of the current cache queue will be saved with the initial length of the cache queue for subsequent congestion detection.
  • determining whether the queue length of the cache queue continues to accumulate based on the queue length of the cache queue and the initial queue length of the cache queue includes: determining the entry of the cache queue based on the queue length of the cache queue and the initial queue length of the cache queue. Whether the queuing rate is greater than the dequeuing rate of the cache queue; when the queuing rate of the cache queue is greater than the dequeuing rate of the cache queue, it is determined that the queue length of the cache queue continues to accumulate.
  • the judgment can be made based on the enqueue rate of data in the current cache queue and the dequeuing rate of data in the current cache queue.
  • this application provides a network device, which includes:
  • a detection module configured to determine the queue length of the buffer queue of the input port of the first network device and the number of messages to be received by the first network device;
  • a processing module configured to determine a first time based on the queue length and the number of messages to be received, where the first time is the time when the second network device stops sending messages to the first network device;
  • the detection module is also used to determine whether the cache queue is a congestion queue based on the queue length
  • the processing module is also used to update the first time to the second time according to the congestion level of the cache queue when the cache queue is a congestion queue;
  • the sending module is used to send the second time to the second network device.
  • the sending module is also used to:
  • the first time is sent to the second network device.
  • the detection module is used to:
  • the detection module is also used to:
  • Processing modules are also used to:
  • the detection module is also used to:
  • the initial length of the cache queue refers to the queue length when the queue length of the cache queue is greater than the minimum queue length for congestion queue detection for the first time;
  • the initial length of the cache queue When the initial length of the cache queue is stored in the first network device, determine whether the queue length of the cache queue continues to accumulate based on the queue length of the cache queue and the initial length of the cache queue;
  • the cache queue is determined to be a congestion queue.
  • the detection module is used to:
  • the queue length of the cache queue is saved as the initial length of the cache queue.
  • the detection module is used to:
  • this application provides a computer-readable medium. Instructions are stored in the computer storage medium. When the instructions are run on a computer, they cause the computer to execute the method provided in the first aspect.
  • this application provides a computer program product containing instructions. When the instructions are run on a computer, they cause the computer to execute the method provided in the first aspect.
  • Figure 1(a) is a schematic diagram of PFC deadlock in triangular routing
  • Figure 1(b) is a schematic diagram of PFC deadlock in a CLOS topology
  • Figure 2 is a schematic diagram of a deadlock recovery process that actively recovers PFC deadlock
  • Figure 3 is a schematic flow chart of a method for multi-queue switching to avoid PFC deadlock
  • Figure 4(a) is a system architecture diagram of a CLOS network provided by an embodiment of the present application.
  • Figure 4(b) is a schematic structural diagram of a switch provided by an embodiment of the present application.
  • Figure 5 is a flow chart of a method for determining whether a cache queue is a congestion queue provided by an embodiment of the present application
  • Figure 6 is a flow chart of a method for suppressing the generation of congestion queues provided by an embodiment of the present application.
  • Figure 7 is a flow chart of another method for suppressing the generation of congestion queues provided by an embodiment of the present application.
  • Figure 8 is a flow chart of yet another method for suppressing the generation of congestion queues provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a network device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of another network device provided by an embodiment of the present application.
  • any embodiment or design solution that is "exemplary”, “such as” or “for example” should not be construed as being more preferred or advantageous than other embodiments or design solutions. . Rather, use of the words “exemplary,” “such as,” or “for example” is intended to present the concepts in a concrete manner.
  • first and second are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • the terms “including,” “includes,” “having,” and variations thereof all mean “including but not limited to,” unless otherwise specifically emphasized.
  • RDMA technology In order to reduce the internal network delay of the data center and improve the processing efficiency, RDMA technology came into being. RDMA technology By allowing user applications to directly read and write remote memory without requiring the CPU to intervene in multiple memory copies, it achieves high throughput, ultra-low latency and low CPU overhead. In order to unleash the true performance of RDMA and break through the network performance bottleneck of large-scale distributed systems in data centers, it is necessary to build a lossless network environment for RDMA without packet loss. The key to achieving no packet loss is to solve the network congestion problem. Generally, lossless Ethernet can rely on hop-by-hop priority-based flow control (PFC) to solve the packet loss problem caused by buffer overflow in the network.
  • PFC hop-by-hop priority-based flow control
  • PFC extends the basic flow control IEEE 802.3X to allow the creation of 8 virtual channels on an Ethernet link and assigns corresponding priorities to each virtual channel. PFC allows any virtual channel to be paused and restarted independently, while allowing traffic from other virtual channels to pass through without interruption.
  • the PFC mechanism can avoid data packet loss by pausing the data transmission corresponding to the priority of the immediate upstream network device of the current network device. However, the PFC mechanism will cause PFC deadlock in the network, and in severe cases, the entire network will be blocked.
  • PFC deadlock refers to the existence of cyclic buffer dependency (CBD, Cyclic Buffer Dependency) between a group of network devices, and each network device in the loop holds all the buffers required by its upstream network device while waiting for its downstream The network device releases some buffers and resumes its packet transmission.
  • CBD Cyclic Buffer Dependency
  • FIG. 1(a) shows a PFC deadlock scenario caused by buffer circular dependency.
  • switch A as the downstream network device of switch B (switchB) (switch B is the upstream network device of switch A)
  • switch B receives the data sent by switch B and caches it.
  • switch C As a downstream network device of switch C (switchC), switch B receives and caches the data sent by switch C.
  • switch C receives the data sent by switch A and caches it. Therefore, a circular buffer dependency relationship is formed between switch A, switch B, and switch C.
  • switch A, switch B, and switch C When the buffers of switch A, switch B, and switch C all reach the XOFF (flow controlled) waterline, switch A, switch B, and switch C will all send PASUE frames to their upstream network devices to notify the upstream network devices to stop sending packets.
  • PFC deadlock refers to the occurrence of micro-loops between multiple switches. The reason is that congestion occurs at the same time, the buffer consumption of each port exceeds the threshold, and each other waits for the other party to release resources, resulting in a network state in which the data flow on all switches is permanently blocked).
  • the throughput of the entire network or part of the network will become zero due to the back pressure effect of PFC.
  • the backpressure effect of PFC means that when the switch port is congested and the XOFF waterline is triggered, the switch will send PAUSE frames to the upstream network device as backpressure, and the upstream device stops sending data after receiving the PAUSE frame.
  • the first network device serves as a downstream network device of the fourth network device, receives and caches the third Four network devices send data.
  • the second network device receives and caches the data sent by the first network device.
  • the third network device receives and caches the data sent by the second network device.
  • the fourth network device as a downstream network device of the third network device, receives and caches the data sent by the third network device. That is, there is a cyclic buffer dependency relationship between the first network device, the second network device, the third network device and the fourth network device.
  • Deadlocks can also occur in CLOS networks when there are circular buffer dependencies in CLOS networks. That is, when all four switches reach the XOFF waterline, they all send PAUSE frames to their upstream network devices at the same time. At this time, all switches in the topology are in a stopped state, and due to the backpressure effect of PFC, the throughput of the entire network or part of the network will become zero. While loops may be temporary, the deadlocks caused by them are not. Once the issue causing the deadlock (configuration error, glitch/update, etc.) is fixed, the deadlock will not be broken automatically. Therefore, when deploying RDMA in Ethernet, some mechanism must be used to deal with the deadlock problem.
  • a corresponding mechanism can be used to actively monitor whether a PFC deadlock is formed.
  • the deadlock can be broken by resetting the link/port/host, etc. As shown in Figure 2, it includes: Step 1-Step 3.
  • Step 1 Device 2 (Device2) starts the timer and monitors multiple PFC backpressure frames received.
  • the internal scheduler of Device 2 When the port of Device 2 (Device2) receives the PFC backpressure frame sent by Device 1 (Device1), the internal scheduler of Device 2 will stop sending the queue traffic of the corresponding priority and start the timer.
  • Step 2 Determine whether the corresponding priority queue at the receiving port of device 2 has been flow controlled during the timer period.
  • Step 3 When the corresponding priority queue at the receiving port of device 2 has been flow controlled during the timer period, it is determined that a PFC anti-lock has occurred between device 1 and device 2.
  • device 2 starts to detect the PFC backpressure frame received by the queue according to the set deadlock detection method and deadlock detection accuracy. If the queue has been in the PFC-XOFF (that is, flow controlled) state within the set PFC deadlock detection time, it is considered that a PFC deadlock has occurred in the network, and the PFC deadlock recovery process needs to be performed. Deadlock is detected by checking the duration of PFC-Xoff, and deadlock recovery is performed by dropping packets or ignoring PFC backpressure frames. Although, PFC deadlock recovery can be performed after a PFC deadlock occurs. However, PFC backpressure frames may be ignored during PFC deadlock recovery. Ignoring PFC backpressure frames may cause buffer bloat, resulting in packet loss.
  • PFC-XOFF that is, flow controlled
  • multi-queue switching can also be used to avoid deadlocks in the network.
  • it includes switch 1 (Switch1), switch 2 (Switch2) and switch (Switch3).
  • switch 1 As a downstream network device of switch 1, switch 2 receives the data sent by switch 1.
  • switch 3 receives the data sent by switch 2.
  • Four queues (entrance queues Queue5 and Queue6, and exit queues Queue5 and Queue6) are set up in switch 2 to send and receive data.
  • Queue 5 and queue 6 in network device 2 are both lossless queues.
  • Network device 2 queues ingress queue 5 into egress queue 6 by modifying the Differentiated Services Code Point (DSCP).
  • DSCP Differentiated Services Code Point
  • the current congestion queue suppression method only identifies “congested” flows and “victimized” flows, without further distinguishing between ports and their corresponding queues.
  • "high-speed” queue traffic entering the CBD is much larger than “low-speed””
  • the queue traffic leaving the CBD causes the circular queue buffer to continue to accumulate.
  • the buffer accumulation in the queue reaches a certain threshold, the queue becomes a congestion queue. Therefore, the newly added data to the CBD loop becomes the main culprit of congestion.
  • the corresponding queue of the port becomes a congestion queue, while the queues of other ports become victims. queue. This is precisely the main source of congestion when loops occur, and even deadlocks.
  • embodiments of the present application provide a method for suppressing the generation of congestion queues, which is mainly applied to network architectures deploying lossless Ethernet (RoCE+PFC).
  • the congestion queue is identified by detecting the accumulation of the cache queue on the input port, and then determining the time when the upstream network device needs to stop sending data based on whether the cache queue is a congestion queue and the cache occupation size of the cache queue, ensuring the flow within the CBD ring. There is remaining cache space and data packets can flow normally in the CBD ring.
  • FIG. 4(a) is a schematic system architecture diagram of a CLOS network provided by an embodiment of the present application.
  • a typical networking of a CLOS network includes network devices and hosts. There can be at least one network device.
  • the multiple network devices may be switches (S1, S2,...Sm, L1, L2,...Ln), and each switch corresponds to multiple hosts. Data can be exchanged between multiple switches, and each host in multiple hosts receives data through the switch.
  • the description takes switch S1 and switch L1 as an example. When switch S1 acts as an upstream network device and sends data to switch L1.
  • Switch L1 can periodically detect the accumulation of the buffer queue of its input port, the queue length of the buffer queue, and the number of flight packets to be received by the input port. Then, switch L1 identifies whether the buffer queue is a congestion queue based on the accumulation of the input port buffer queue. Then switch L1 makes a joint decision based on the queue length of the cache queue, the number of in-flight messages (referring to messages that the upstream network device has sent but has not yet been received by the downstream network device) and whether the queue is a congestion queue. The upstream network device needs to stop. The time the data was sent. The number of packets entering the circular queue buffer is reduced, ensuring that there is remaining buffer space for traffic in the CBD ring, and maintaining normal flow of data packets in the CBD ring.
  • the network device described in the embodiment of the present application may be a network device with a data forwarding function, such as a router or a switch.
  • Figure 4(b) shows the hardware structure of a switch.
  • the switch includes: a processor 401, an interface circuit 402, a memory 403, and a switching module 404.
  • the processor 401, the interface circuit 402, the memory 403, and the switching module 404 can be connected through a bus.
  • Memory 403 is used to store program instructions and data. For example, after the switch receives data sent by other switches, the data can be stored in the memory 403 .
  • the processor 401 is the computing core and control core of the switch. The processor 401 reads the program instructions and data stored in the memory 403, thereby executing a method for suppressing the generation of a congestion queue.
  • Interface circuit 402 includes internal circuitry that connects various ports.
  • the switching module 404 communicates with the processor 401 through the bus to complete data transmission.
  • Figure 5 provides a method for determining whether the cache queue is a congestion queue.
  • the network device involved in this method may be the network device described in Figure 4(a). Referring to Figure 5, the method includes: S501-S506.
  • the network device periodically obtains the queue length L(t), queue discrimination mark J, and congestion discrimination result C of the buffer queue of its input port.
  • the buffer queue of the input port of the network device is used to buffer messages received from the network.
  • the buffer queue at the input port of the network device is a first-in, first-out data structure. This data structure only allows insertion at one end and deletion at the other end. Direct access to all data except these two ends is prohibited. That is, in this embodiment, cache One end of the queue is used to receive data from the network, and the other end of the cache queue is used to send data to the network.
  • the receiving rate of data received by a network device from the network is greater than the sending rate of data sent by the network device to the network, data that has not yet been sent will be cached in the cache queue.
  • the data size of the data cached in the cache queue is the queue length L(t) of the cache queue.
  • the queue length of the buffer queue of a network device input port is initialized to 0. Then, every time the input port of the network device receives a message, the queue length of the buffer queue is increased by 1. Every time a network device sends a message through its output port, the buffer queue length is decremented by 1.
  • the discriminant mark J is used to indicate whether the cache queue needs to perform congestion detection. Among them, the value of J can be True or False. When the value of J is True, it indicates that the cache queue needs to perform congestion detection. When the value of J is False, it means that the cache queue does not need to perform congestion detection. In an example, the initial value of the discriminant flag J can be set to True, that is, all queues need to perform congestion detection.
  • the congestion determination result C is used to indicate whether the cache queue is a congestion queue. Among them, the value of C can be True or False.
  • the initialization value of the congestion determination result C can be set to False, indicating that the cache queue is not a congestion queue.
  • the network device needs to scan the queue length of the buffer queue of its input port periodically (such as every 1 minute or 1 hour, etc.). While scanning the cache queue, you also need to detect the discriminant mark of the queue.
  • the network device performs congestion detection on the cache queue of the input port for the first time, it can perform congestion detection on the cache queue based on the obtained queue length L(t) and the initial value of the discrimination mark J, and modify the discrimination mark J according to the congestion detection result. , the value of congestion detection result C.
  • the discrimination mark of the cache queue is first judged before performing congestion queue detection on the cache queue.
  • the discrimination mark J of the cache queue is first judged. This avoids wasting system resources by performing congestion detection on the cache queue when the cache queue is not a congestion queue.
  • S503 The network device determines whether the initial length of the cache queue is 0. When the initial length of the cache queue is not 0, S506 is executed; otherwise, S504 is executed.
  • whether the cache queue is a congestion queue can be determined based on the queue length of the cache queue of the network device and the initial length record of the cache queue.
  • the initial length of the cache queue refers to the queue length when the cache queue is larger than Th1 for the first time.
  • the unit of Th1 is byte, which indicates the minimum queue length for congestion queue detection.
  • L_old can be used to record the initial length of the cache queue.
  • S504 The network device compares the queue length of the cache queue with the minimum queue length for congestion queue detection. When the queue length of the cache queue is greater than the minimum queue length for congestion queue detection, S505 is executed; otherwise, S501 is executed.
  • the queue length of the cache queue may be compared with the minimum queue length for which congestion queue detection can be performed. Only when the obtained queue length of the cache queue is greater than or equal to the minimum queue length for congestion queue detection, the queue length can be saved as the initial length of the cache queue. for the next congestion queue detection.
  • S505 The network device saves the queue length of the cache queue as the initial length of the cache queue.
  • S506 The network device determines whether the current cache queue continues to be congested. When the cache queue continues to be congested, the value of the judgment result C of the cache queue is updated to True. When the cache status of the cache queue is not in the congestion state, the judgment mark of the current cache queue is updated. The value of J is False.
  • the initial condition for the congestion judgment on the cache queue is set to " "Whether L_old is 0", that is, after the first congestion judgment, the initial condition for the next congestion judgment is S503.
  • Figure 6 shows a flow chart of a method for suppressing the generation of a congestion queue provided by an embodiment of the present application. This method is applied to the method shown in Figure 4(a)
  • the network architecture involved in this method may be the network device described in Figure 4(a). Referring to Figure 6, the method includes: S601-S607.
  • the network device periodically obtains the queue length of the buffer queue of its input port and the number of flight messages to be received by the network device.
  • the network device can periodically scan (for example, every 1 minute or 1 hour, etc.) the queue length L(t) of the buffer queue of the input port, and obtain the flight packets that the current network device needs to receive.
  • the flight message refers to the message that the upstream network device has sent but the downstream network device has not yet received.
  • the network device sending data can be used as the upstream network device, and the device receiving data can be used as the downstream network device.
  • switch S1 sends data to switch L1 and switch L2. Then switch S1 is the upstream network device of switch L1 and switch L2.
  • Switch L1 and switch L2 are downstream network devices of switch S1.
  • the network device determines the stop time for the upstream network device of the current network device to stop sending messages based on the queue length of the current cache queue and the number of flight messages to be received.
  • the stop time t_stop1 when the upstream network device needs to stop sending packets can be determined based on the determined queue length of the cache queue and the number of in-flight packets. specifically,
  • T is the scan cycle for the network device to scan the buffer queue at the input end;
  • R represents the message sending rate negotiated between the upstream network device and the downstream network device;
  • H is the hedaroom reserved at the input end of the network device for receiving in-flight messages. messages to prevent packet loss.
  • the initial value of H is BDP+R*T;
  • F(t) is the number of flight messages, and F(t) can be initialized as a delay-bandwidth product ((Bandwidth-Delay Product, BDP) .
  • BDP is the maximum number of bits on the link, also called the link length in bits.
  • the delay-bandwidth product BDP propagation delay * bandwidth.
  • S603 The network device determines whether the current cache queue is a congestion queue. When the cache queue is a congestion queue, execute S604; otherwise, execute S605.
  • the network device determines the stop time at which the upstream network device needs to stop sending messages based on the determined queue length of the cache queue and the number of in-flight messages. In order to make the determined stop time more accurate, the network device also needs to determine whether the buffer queue of its input port is a congestion queue.
  • the network device can perform congestion detection directly on the cache queue.
  • the process of performing congestion detection on a column is the same as S501-S506, and will not be described again here.
  • the network device may determine whether the cache queue of the input port of the network device is a congestion queue while executing S601 (the determination process of determining whether the cache queue is a congestion queue is the same as S501-S506. Here, no (Repeatedly), the congestion determination result C of the cache queue is obtained. Then, determine whether the cache queue is a congested queue based on the congestion determination result of the cache queue. Performing congestion detection on the cache queue while executing step S601 saves the time for the network device to generate a stop time for the upstream network device to stop sending messages.
  • the congestion determination result C is set in advance for the buffer queue of the input port of the network device.
  • the initialization value of C is True.
  • the value of C can be True or False. When the value of C is True, it indicates that the cache queue is a congestion queue. When the value of C is False, it indicates that the cache queue is a non-congested queue.
  • Network devices need to periodically scan the queue length of the buffer queue of their input ports. While scanning the cache queue, it is also necessary to determine whether the cache queue is a congestion queue based on the queue length of the cache queue and the recorded historical length of the cache queue. When the cache queue is a congestion queue, set the value of the congestion determination result C of the cache queue to True.
  • the network device updates the stop time when the upstream network device stops sending packets.
  • the stop time for the upstream network device to stop sending messages is updated to reduce the traffic entering the CBD and prevent the occurrence of PFC deadlock. Specifically, the updated stop time t_stop2 when the upstream network device stops sending packets is,
  • T is the scan cycle for the network device to scan the buffer queue at the input end;
  • p is the penalty factor, P>1;
  • t_stop1 is the network device that determines the upstream network device based on the determined queue length of the cache queue and the number of flight messages. The stop time when sending messages needs to be stopped.
  • S605 The network device determines whether the stop time for the upstream network device to stop sending packets is greater than 0. When the stop time for the upstream network device to stop sending packets is greater than 0, execute S606; otherwise, execute S607.
  • the upstream network device before sending the determined stop time for the upstream network device to stop sending packets to the upstream device, it is also necessary to determine whether the obtained stop time for the upstream network device to stop sending packets is 0.
  • the determined stop time of the upstream network device to stop sending packets is 0, there is no congestion in the cache queue of the network device, and the upstream network device can continue to send packets. At this time, the network device does not need to send PFC backpressure frames to the upstream network device.
  • the determined stop time for the upstream network device to stop sending packets is not 0, it means that the current network device is congested or will be congested. At this time, the current network device needs to send a PFC backpressure frame to the upstream network device.
  • the stop time of the upstream network device stopped sending messages carried in the PFC backpressure frame is judged to avoid sending backpressure frames to the upstream network device when there is no need to send backpressure frames to the upstream network device. Press frame. Saves channel resources in the network.
  • the buffer queue of the network device input port is a congestion queue
  • the updated stop time t_stop2 when the upstream network device stops sending packets needs to be sent to the upstream network device. Therefore, the network device needs to determine t_stop2 before sending a backpressure frame to the upstream network device. Only when t_stop2 is greater than 0, the PFC backpressure frame is sent to the upstream network device.
  • the determined stop time t_stop1 when the upstream network device stops sending packets is sent to the upstream network device.
  • the network device needs to determine t_stop1 before sending a backpressure frame to the upstream network device. Only when t_stop1 is greater than 0, the PFC backpressure frame is sent to the upstream network device.
  • S606 The network device sends the PFC backpressure frame to the upstream network device.
  • the PFC backpressure frame sent by the current network device to the upstream network device carries the stop time when the upstream network device stops sending messages. This causes the upstream network device to stop sending packets to the current network device within the stop time, thereby avoiding the occurrence of PFC deadlock.
  • the network device updates the number of flight messages to be received.
  • RRT refers to the air interface propagation delay when data is transmitted between upstream network equipment and downstream network equipment.
  • t_stop3 refers to the time when the upstream network device stops sending data within the RRT.
  • BDP is the delay bandwidth product.
  • the input port of the network device can use only one cache queue and avoid packet loss of the network device, so that the network device can avoid deadlock and the throughput in the network will not be affected.
  • FIG. 7 shows a flow chart of yet another method for suppressing the generation of a congestion queue provided by an embodiment of the present application.
  • the method of suppressing the congestion queue shown in FIG. 7 is another way of describing the method of suppressing the congestion queue shown in FIG. 5 and the method of judging the congestion queue shown in FIG. 6 .
  • the network device involved in this method may be the network device described in Figure 4(a).
  • the method includes: S710, S720, and S730. It should be noted that the specific implementation process of S710 is the same as that of S501 and S601. There is no order of execution between S720 and S730. S720 and S730 can be executed at the same time.
  • S720 Determine whether the current cache queue is a congestion queue based on the obtained cache queue length and cache queue discrimination flag.
  • S7202 determine the initial length L_old of the current cache queue. When L_old is not 0, execute S7203, otherwise execute S7204.
  • S730 Determine the stop time for the upstream network device to stop sending messages based on the obtained cache queue length, the congestion determination result of the cache queue, and the number of flight messages to be received by the current network device; and determine the stop time for the upstream network device to stop sending messages. The number of flight messages is updated based on the message stop time.
  • the stop time t_stop when the upstream network device stops sending messages is calculated.
  • the stop time t_stop when the upstream network device stops sending messages is calculated.
  • t_stop is updated.
  • determine whether t_stop is greater than 0, when When t_stop>0, execute S7304, otherwise execute S7305.
  • the PFC backpressure frame is sent to the upstream network device.
  • F(t) is updated.
  • the stop time required for the upstream network device to stop sending packets is first calculated when the cache queue is not a congestion queue. Then determine whether the cache queue in the network device is a congestion queue. When the cache queue in the network device is a congestion queue, update the stop time when the upstream network device stops sending packets. Avoid the formation of congestion queues in network equipment.
  • the network device may also first determine whether the buffer queue in the input port of the network device is a congestion queue. Then, the stop time for the upstream network device to stop sending messages is determined based on whether the cache queue of the input port of the network device is a congestion queue, the queue length of the cache queue, and the number of flight messages to be received.
  • the queue length of the buffer queue of the network device input port and the number of flight messages to be received are periodically detected. Then, based on the length of the buffer queue of the current network device input port and the number of in-flight packets, the time when the network device needs the upstream network device to stop sending packets at the current sampling time is determined. Further, whether the cache queue is a congestion queue is determined based on the queue length of the cache queue and the accumulation speed of the cache queue. Then the network device decides whether to update the time when the network device needs the upstream network device to stop sending packets at the current sampling time based on whether the current cache queue is a congestion queue. In the embodiment of this application, by periodically detecting the buffer queue of the input port, the number of messages entering the circular queue is reduced, ensuring that there is remaining buffer space for traffic in the CBD ring, and ensuring that data packets in the CBD ring can Normal flow.
  • FIG. 8 shows a flow chart of yet another method for suppressing the generation of a congestion queue provided by an embodiment of the present application.
  • the network device involved in this method may be the network device described in Figure 4(a).
  • the network device can be an upstream network device.
  • the method includes: S801-S802.
  • the network device receives a PFC backpressure frame sent by the downstream network device.
  • the backpressure frame carries a stop time to stop sending messages.
  • the backpressure frame is parsed to obtain the stop time for stopping sending messages carried in the backpressure frame.
  • S802 The network device stops sending packets to the downstream network device within the stop time.
  • the internal scheduler of the network device will stop sending the packets in the queue of the corresponding priority and start the timer.
  • any network device shown in Figure 4(a) can be either an upstream network device or a downstream network device, or it can be an upstream network device and a downstream network device at the same time.
  • the network device that sends data is regarded as the upstream network device
  • the network device that receives data is regarded as the downstream network device.
  • the upstream network device stops sending messages to the downstream network device within the stop time according to the stop time carried in the PFC backpressure frame sent by the downstream network device. Effectively avoids the formation of congestion queues in downstream network equipment.
  • FIG. 9 shows a schematic structural diagram of a network device provided by an embodiment of the present application.
  • the network device includes: a receiving module 901, a detection module 902, a processing module 903, and a sending module 904.
  • the receiving module 901 is used to receive data packets sent by the upstream network device.
  • the detection module 902 periodically detects the cache queue length of the current network device input port, the cache queue discrimination flag, the congestion discrimination result of the cache queue, and the number of flight messages to be received by the current network device. Further, the detection module 902 is also configured to determine whether the current cache queue is a congestion queue based on the obtained cache queue length and cache queue discrimination flag.
  • the processing module 903 is configured to determine the stopping time for the current upstream network device to stop sending messages based on the obtained cache queue length, the congestion determination result of the cache queue, and the number of flight messages to be received by the current network device.
  • the sending module 904 sends a PFC backpressure frame to the upstream network device, where the PFC backpressure frame carries a stop time for the upstream network device to stop sending messages.
  • the network device embodiment depicted in Figure 9 is merely illustrative.
  • the division of modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed.
  • Each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • each module in Figure 9 can be implemented in the form of hardware or software function modules.
  • the above-mentioned detection module 902 may be implemented as a software function module generated by at least one processor 401 in FIG. 4(b) after reading the program code stored in the memory 403.
  • the above-mentioned modules in Figure 9 can also be implemented by different hardware in the network device.
  • the detection module 902 is implemented by a part of the processing resources of at least one processor 401 in Figure 4(b) (for example, one of the multi-core processors).
  • the sending module 904 and the receiving module 901 are implemented by the interface circuit of Figure 4(b) and the remaining processing resources in at least one processor 401 (such as other cores in a multi-core processor), or by using FPGA, Or programmable devices such as coprocessors to complete.
  • processor 401 such as other cores in a multi-core processor
  • FPGA field-programmable gate array
  • the above functional modules can also be implemented by a combination of software and hardware.
  • the sending module 904 and the receiving module 901 are implemented by hardware programmable devices, while the detection module 902 is generated by the CPU after reading the program code stored in the memory.
  • Software function module is implemented by the CPU after reading the program code stored in the memory.
  • FIG. 10 shows a schematic structural diagram of yet another network device provided by an embodiment of the present application.
  • the network device includes: a receiving module 1001, a processing module 1002, and a sending module 1003.
  • the receiving module 1001 is configured to receive the PFC backpressure frame sent by the downstream network device.
  • the PFC backpressure frame carries the stop time required for the current network device to stop sending packets to the downstream network device.
  • the processing module 1002 is configured to stop the sending module 1003 from sending messages to the downstream network device according to the received PFC backpressure frame.
  • the network device embodiment depicted in Figure 10 is merely illustrative.
  • the division of modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed.
  • Each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module.
  • each module in Figure 10 can be implemented in the form of hardware or software function modules.
  • the above-mentioned processing module 1002 can be implemented by a software function module generated by at least one processor 401 in FIG. 4(b) after reading the program code stored in the memory 403. now.
  • Each of the above modules in Figure 10 can also be implemented by different hardware in the network device.
  • the processing module 1002 is implemented by a part of the processing resources (such as a core in a multi-core processor) of at least one processor 401 in Figure 4(b), while the receiving module 1001 and the sending module 1003 are implemented by Figure 4(b)
  • the interface circuit and the remaining processing resources in at least one processor 401 can be completed by using programmable devices such as FPGA or co-processor.
  • the above functional modules can also be implemented by a combination of software and hardware.
  • the receiving module 1001 and the sending module 1003 are implemented by hardware programmable devices, while the processing module 1002 is generated by the CPU after reading the program code stored in the memory.
  • Software function modules are implemented by a part of the processing resources (such as a core in a multi-core processor) of at least one processor 401 in Figure 4(b)
  • the interface circuit and the remaining processing resources in at least one processor 401 can be completed by using programmable devices such as FPGA or co-processor.
  • the above functional modules can also be implemented by a combination of software
  • the method steps in the embodiments of the present application can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable rom) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted over a computer-readable storage medium.
  • the computer instructions may be transmitted from one website, computer, server or data center to another website through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. , computer, server or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Abstract

一种抑制拥塞队列产生的方法及装置。该方法可以包括:确定第一网络设备的输入端口的缓存队列的队列长度、所述第一网络设备待接收的报文数量;根据所述队列长度和所述待接收的报文数量确定第一时间,所述第一时间为第二网络设备停止向所述第一网络设备发送报文的时间;根据所述队列长度,确定所述缓存队列为拥塞队列时,根据所述缓存队列的拥塞程度,将第一时间更新为第二时间;将所述第二时间发送给第二网络设备。通过对输入端口的缓存队列进行检测,减少了进入循环队列的报文数量,保证了CBD环内流量有剩余缓存空间,保证了CBD环内的数据包可以正常流动。

Description

一种抑制拥塞队列产生的方法及装置
本申请要求在2022年5月23日提交中国国家知识产权局、申请号为202210565502.3,发明名称为“一种抑制拥塞队列产生的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及通信技术领域,尤其涉及一种抑制拥塞队列产生的方法及装置。
背景技术
一般地,数据中心及云服务提供商可以通过部署远程直接内存访问(remote direct memory access,RDMA),以使网络达到低延迟、高吞吐量和低CPU开销的目的。在可用的RDMA技术中,聚合以太网直接内存访问(RDMA over converged ethernet,RoCE)是一种很有吸引力的技术,因为它与当前的IP和基于以太网的数据中心网络兼容。聚合以太网RDMA的部署需要基于优先级的流量控制(priority flow control,PFC)来提供无损L2网络。通过使用PFC可以让网络设备在buffer溢出发生之前暂停其直接上游网络设备的数据发送,来避免数据包丢失。然而,在使用PFC时容易导致以太网网络出现死锁,严重时会导致整个网络被阻塞。
发明内容
本申请提供了一种抑制拥塞队列产生的方法及装置,通过检测入端口的队列的累积情况,识别出拥塞队列,然后根据队列是否是拥塞队列以及设备的缓存占用大小一起决定上游设备需要停止的时间,保证了CBD环内流量有剩余缓存空间,保证了CBD环内数据包可以正常流动。
第一方面,本申请提供了一种抑制拥塞队列产生的方法,该方法包括:确定第一网络设备的输入端口的缓存队列的队列长度、第一网络设备待接收的报文数量;根据该队列长度和待接收的报文数量确定第一时间,第一时间为第二网络设备停止向第一网络设备发送报文的时间;根据该队列长度,确定缓存队列为拥塞队列时,根据该缓存队列的拥塞程度,将第一时间更新为第二时间;将第二时间发送给第二网络设备。
在上述方案中,通过检测网络设备输入端口的缓存队列的队列长度、飞行报文的数量以及当前缓存队列是否为拥塞队列来共同决定上游网络设备需要停止发送报文的停止时间。进一步地,在本发明申请中,可以在网络设备产生拥塞队列之前根据输入端口的缓存队列的队列长度、飞行报文的数量来确定一个第一时间,然后根据该缓存队列是否为拥塞队列来确定是否需要更新第一时间得到第二停止时间。当该缓存队列为拥塞队列时,将第二时间发送给第二网络设备设备,来减少第一网络设备中缓存队 列接收报文的数量,避免拥塞队列的产生。进而保证CBD环内的数据包可以正常流动。
在一个可能的实现方式中,当确定缓存队列不为拥塞队列时,将第一时间发送给第二网络设备。
也就是说,当缓存队列不为拥塞队列时,也需要将第一时间发送给第二网络设备,以使第二网络设备停止向第一网络设备发送报文。避免第一网络设备中拥塞队列的产生。
在一个可能的实现方式中,其特征在于,确定缓存队列是否为拥塞队列包括:获取缓存队列的拥塞检测判别结果C;其中,拥塞检测判别结果的取值为True或False,当C的取值为True时,表示该缓存队列为拥塞队列,当C的取值为False时,表示该缓存队列为非拥塞队列,C的初始值为False;根据缓存队列的拥塞检测判别结果C的取值确定缓存队列是否为拥塞队列。
也就是说,当第一网络设备根据缓存队列的队列长度和待接收的报文数量确定第一时间以后。第一网络设备可以直接获取该缓存队列的拥塞判别结果,并根据该拥塞判别结果来确定该缓存队列是否为拥塞队列。
在一个可能的实现方式中,该方法还包括:获取缓存队列的拥塞检测判别标记J;其中,J的取值为True或False;当J的取值为True时,表示缓存队列需要进行拥塞检测,当J的取值为False时,表示缓存队列不需要进行拥塞检测,J的初始值为True;当缓存队列的拥塞检测判别标记J的取值为True时,对缓存队列进行拥塞检测;当缓存队列为拥塞队列时,更新缓存队列的拥塞检测判别结果C的值为True;或者,当缓存队列不为拥塞队列时,更新缓存队列的拥塞检测判别标记J的值为False。
也就是说,在对缓存队列进行拥塞判断之前,首先对该缓存队列的判别标记进行判断。使得只有在缓存队列没有进行过拥塞检测或者进行过拥塞检测但是检测结果为拥塞队列的情况下才会再次对缓存队列进行拥塞。避免了对缓存队列进行不必要的拥塞判断,节省了系统的开销。
在一个可能的实现方式中,对缓存队列进行拥塞检测,包括:确定第一网络设备中是否保存有缓存队列的初始长度;缓存队列的初始长度是指缓存队列的队列长度第一次大于能够进行拥塞队列检测的最小队列长度时的队列长度;当第一网络设备中保存有缓存队列的初始长度时,根据缓存队列的队列长度和缓存队列的初始长度确定缓存队列的队列长度是否持续积累;当缓存队列的队列长度为持续积累时,确定缓存队列为拥塞队列。
也就是说,在确定当前缓存队列是否为拥塞队列之前,还需要获取该缓存队列的历史队列长度记录,根据该缓存队列的历史队列长度记录来确定该缓存队列是否为拥塞队列。
在一个可能的实现方式中,当网络设备中未保存有缓存队列的初始长度时,确定缓存队列的队列长度是否大于能够进行拥塞队列检测的最小队列长度度;当缓存队列的队列长度大于能够进行拥塞队列检测的最小队列长度,将缓存队列的队列长度保存为缓存队列的初始长度。
也就是说,在当前网络设备中没有保存有缓存队列的初始长度时,是不能通过缓存队列的当前长度和初始长度来对缓存队列是否为拥塞队列进行判断。因此,判断缓存队列的当前长度是否能够作为其初始长度进行保存,以用于之后的拥塞检测。进一 步地,只有当缓存队列的队列长度满足能够进行拥塞队列检测的最小队列长度度时,当前缓存队列的队列长队才会被保存缓存队列的初始长度,以用于之后的拥塞检测。
在一个可能的实现方式中,根据缓存队列的队列长度和缓存队列的初始队列长度确定缓存队列的队列长度是否持续积累包括:根据缓存队列的队列长度和缓存队列的初始队列长度确定缓存队列的入队列速率是否大于缓存队列的出队列速率;当缓存队列的入队列速率大于缓存队列的出队列速率,确定缓存队列的队列长度为持续积累。
也就是说,在判断当前缓存队列的队列长度是否持续积累时,可以通过当前缓存队列中数据的入队列的速率和当前缓存队列中数据的出队列速率来进行判断。
第二方面,本申请提供了一种网络设备,该网络设备包括:
检测模块,用于确定第一网络设备的输入端口的缓存队列的队列长度以及第一网络设备待接收的报文数量;
处理模块,用于根据队列长度和待接收的报文数量确定第一时间,第一为第二网络设备停止向第一网络设备发送报文的时间;
检测模块,还用于根据队列长度,确定缓存队列是否为拥塞队列;
处理模块,还用于当缓存队列为拥塞队列时,根据缓存队列的拥塞程度,将第一时间更新为第二时间;
发送模块,用于将第二时间发送给第二网络设备。
在一个可能的实现方式中,发送模块还用于:
当确定缓存队列不为拥塞队列时,将第一时间发送给第二网络设备。
在一个可能的实现方式中,检测模块用于:
获取缓存队列的拥塞检测判别结果C;其中,拥塞检测判别结果的取值为True或False,当C的取值为True时,表示该缓存队列为拥塞队列,当C的取值为False时,表示该缓存队列为非拥塞队列,C的初始值为False;
根据缓存队列的拥塞检测判别结果C的取值确定缓存队列是否为拥塞队列。
在一个可能的实现方式中,检测模块还用于:
获取缓存队列的拥塞检测判别标记J;其中,J的取值为True或False;当J的取值为True时,表示缓存队列需要进行拥塞检测,当J的取值为False时,表示缓存队列不需要进行拥塞检测,J的初始值为True;
当缓存队列的拥塞检测判别标记J的取值为True时,对缓存队列进行拥塞检测;
处理模块还用于:
当缓存队列为拥塞队列时,更新缓存队列的拥塞检测判别结果C的值为True;或者,当缓存队列不为拥塞队列时,更新缓存队列的拥塞检测判别标记J的值为False。
在一个可能的实现方式中,检测模块还用于:
确定第一网络设备中是否保存有缓存队列的初始长度;缓存队列的初始长度是指缓存队列的队列长度第一次大于能够进行拥塞队列检测的最小队列长度时的队列长度;
当第一网络设备中保存有缓存队列的初始长度时,根据缓存队列的队列长度和缓存队列的初始长度确定缓存队列的队列长度是否持续积累;
当缓存队列的队列长度为持续积累时,确定缓存队列为拥塞队列。
在一个可能的实现方式中,检测模块用于:
当网络设备中未保存有缓存队列的初始长度时,确定缓存队列的队列长度是否大于能够进行拥塞队列检测的最小队列长度度;
当缓存队列的队列长度大于能够进行拥塞队列检测的最小队列长度,将缓存队列的队列长度保存为缓存队列的初始长度。
在一个可能的实现方式中,检测模块用于:
根据缓存队列的队列长度和缓存队列的初始队列长度确定缓存队列的入队列速率是否大于缓存队列的出队列速率;
当缓存队列的入队列速率大于缓存队列的出队列速率,确定缓存队列的队列长度为持续积累。
第三方面,本申请提供了一种计算机可读介质,计算机存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行第一方面所提供的方法。
第四方面,本申请提供了一种包含指令的计算机程序产品,当指令在计算机上运行时,使得计算机执行第一方面所提供的方法。
附图说明
图1(a)为一种三角路由中的PFC死锁的示意图;
图1(b)为一种CLOS拓扑结构中的PFC死锁示意图;
图2为一种主动恢复PFC死锁的死锁恢复流程示意图;
图3为一种多队列切换避免PFC死锁的方法流程示意图;
图4(a)为本申请实施例提供的一种CLOS网络的系统架构图;
图4(b)为本申请实施例提供的一种交换机的结构示意图;
图5为本申请实施例提供的一种判别缓存队列是否为拥塞队列的方法的流程图;
图6为本申请实施例提供的一种抑制拥塞队列产生的方法的流程图;
图7为本申请实施例提供的又一种抑制拥塞队列产生的方法的流程图;
图8为本申请实施例提供的又一种抑制拥塞队列产生的方法的流程图;
图9为本申请实施例提供的一种网络设备的结构示意图;
图10为本申请实施例提供的又一种网络设备的结构示意图。
具体实施方式
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本申请实施例中的技术方案进行描述。
在本申请实施例中的描述中,“示例性的”、“例如”或者“举例来说”的任何实施例或设计方案不应该被理解为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”、“例如”或者“举例来说”等词旨在以具体方式呈现相关概念。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
为了降低数据中心内部网络延迟,提高处理效率,RDMA技术应运而生。RDMA技术 通过允许用户的应用程序直接读取和写入远程内存,而无需CPU介入多次拷贝内存,来达到高吞吐量、超低时延和低CPU开销的效果。而要想发挥RDMA的真正性能,突破数据中心大规模分布式系统的网络性能瓶颈,就需要为RDMA搭建一套不丢包的无损网络环境。而要实现不丢包的关键就是解决网络拥塞问题。一般地,无损以太网可以依靠逐跳基于优先级的流量控制(PFC)来解决网络中因缓冲区溢出而导致的丢包问题。PFC在基础流控IEEE 802.3X基础上进行扩展,允许在一条以太网链路上创建8个虚拟通道,并为每条虚拟通道指定相应优先级。PFC允许单独暂停和重启其中任意一条虚拟通道,同时允许其它虚拟通道的流量无中断通过。
PFC机制可以通过暂停当前网络设备的直接上游网络设备对应优先级的数据发送来避免数据包丢失。然而,PFC机制会导致网络内发生PFC死锁,严重时会导致整个网络被阻塞。PFC死锁是指在一组网络设备之间存在循环buffer依赖关系(CBD,Cyclic Buffer Dependency),且在循环中的每个网络设备持有其上游网络设备所需的所有buffer,同时等待其下游网络设备释放一些buffer并恢复其数据包传输。
图1(a)示出了一个由于buffer循环依赖导致的PFC死锁场景。在图1(a)中交换机A(switchA)作为交换机B(switchB)的下游网络设备(交换机B为交换机A的上游网络设备),接收交换机B发送的数据并缓存。交换机B作为交换机C(switchC)的下游网络设备,接收交换机C发送的数据并缓存。交换机C作为交换机A的下游网络设备接收交换机A发送的数据并缓存。因此,在交换机A、交换机B和交换机C之间构成了循环的buffer依赖关系。当交换机A、交换机B和交换机C的buffer都达到XOFF(被流控)水线时,交换机A、交换机B和交换机C都会向其上游网络设备发送PASUE帧,通知上游网络设备停止发送报文。此时,在交换机A、交换机B和交换机C构成的网络拓扑中,所有的交换机都处于停流状态,该网络中发生PFC死锁(PFC死锁是指多个交换机之间因微环路等原因同时出现拥塞,各自端口缓存消耗超过阈值,而又相互等待对方释放资源,从而导致所有交换机上的数据流都永久阻塞的一种网络状态)。当网络中发生PFC死锁以后,由于PFC的反压效应,整个网络或部分网络的吞吐量将变为零。其中,PFC的反压效应是指交换机的端口出现拥塞并触发XOFF水线时,该交换机会向上游网络设备将发送PAUSE帧反压,上游设备接收到PAUSE帧后停止发送数据。
除图1(a)所示的三角路由外,在图1(b)所示的多级交换(CLOS)网络拓扑中,第一网络设备作为第四网络设备的下游网络设备,接收并缓存第四网络设备发送的数据。第二网络设备作为第一网络设备的下游网络设备,接收并缓存第一网络设备发送的数据。第三网络设备作为第二网络设备的下游网络设备,接收并缓存第二网络设备发送的数据。第四网络设备作为第三网络设备的下游网络设备,接收并缓存第三网络设备发送的数据。即在第一网络设备、第二网络设备、第三网络设备和第四网络设备之间存在循环buffer依赖关系。当CLOS网络中存在循环buffer依赖关系时,CLOS网络中也会发生死锁。即,当四台交换机都达到XOFF水线,都同时向其上游网络设备发送PAUSE帧。此时,该拓扑中所有交换机都处于停流状态,由于PFC的反压效应,整个网络或部分网络的吞吐量将变为零。虽然循环可能是暂时的,但由它们引起的死锁却不是。导致死锁的问题(配置错误、故障/更新等)修复后,死锁不会自动断开。 因此,在以太网中部署RDMA时,必须使用某种机制来处理死锁问题。
示例性的,可以通过相应的机制主动监测PFC死锁是否形成。当发现网络中出现PFC死锁时,可以通过重置链路/端口/主机等来打破死锁。如图2所示,包括:步骤1-步骤3。
步骤1,设备2(Device2)启动定时器,监控收到的多个PFC反压帧。
当设备2(Device2)的端口接收到设备1(Device1)发送的PFC反压帧后,设备2内部调度器将停止发送对应优先级的队列流量,并开启定时器。
步骤2,确定设备2接收端口处对应优先级队列在定时器期间内是否一直被流控。
步骤3,当设备2接收端口处对应优先级队列在定时器期间内一直被流控,确定设备1和设备2之间出现了PFC反锁。
在上述方案中,设备2根据设定的死锁检测方法和死锁检测精度开始检测队列收到的PFC反压帧。若在设定的PFC死锁检测时间内该队列一直处于PFC-XOFF(即被流控)状态,则认为该网络中出现了PFC死锁,需要进行PFC死锁恢复处理流程。通过检查PFC-Xoff的持续时间来检测死锁,并通过丢包或者忽略PFC反压帧来进行死锁恢复。虽然,可以在发生PFC死锁以后进行PFC死锁恢复。但是在PFC死锁恢复的过程中可能会忽略PFC反压帧。忽略PFC反压帧可能会导致缓冲区膨胀,进而导致数据包的丢失。当数据包出现丢失时,会导致接收端执行go-back-N的丢包恢复机制严重影响吞吐。另外,如果CBD持续存在,那么它将在PFC死锁恢复后又立即进入下一轮的PFC死锁状态,吞吐量将持续受到很大影响。
示例性地,还可以通过多队列切换来避免网络中死锁的发生。如图3所示,包括交换机1(Switch1)、交换机2(Switch2)和交换机(Switch3)。交换机2作为交换机1的下游网络设备接收交换机1发送的数据。交换机3作为交换机2的下游网络设备接收交换机2发送的数据。交换机1和交换机3的端口处设置有两个队列(Queue5和Queue6)来进行数据的发送与接收。在交换机2中设置了4个队列(入口队列Queue5、Queue6,出口队列Queue5和Queue6)来进行数据的发送与接收。网络设备2中的队列5和队列6都是无损队列。网络设备2通过修改差分服务代码点(DSCP)将入口队列5排入到出口队列6。当下游网络设备3出现拥塞时,即当触发下游网络设备3中的队列6上的PFC时,该PFC将被映射到网络设备2中的入口队列5上。通过这种方式,即可打破循环buffer依赖,进而避免出现PFC死锁。
在上述方案中,通过多队列切换来避免死锁,并通过跨优先级PFC反压进行流控,进而防止数据丢包。虽然,上述方案可以避免buffer的循环依赖,但是需要网络设备提供多个队列来供数据进行切换。然而,避免buffer循环依赖所需的优先级数量是由网络中的最长路径决定的,该路径随着网络规模的增加而增加。商用网络设备实际上能支持的无损优先级有很大限制,部署难度较高。
由上述内容可以看出,目前拥塞队列抑制方法只识别“拥塞”流和“受害”流,没有对端口及其相应的队列做进一步区分,导致“高速”进入CBD的队列流量,远大于“低速”离开CBD的队列流量,导致循环队列buffer持续累积,当队列中的buffer积累到一定阈值时,该队列变为拥塞队列。因此新加入CBD环路的数据成为导致拥塞主要罪魁祸首,该端口相应的队列变为了拥塞队列,而其他端口的队列则成为了受害 队列。而这恰恰也是导致出现环路时拥塞的主要来源,甚至造成死锁的出现。
为了抑制网络设备中拥塞队列的产生,本申请实施例提供了一种抑制拥塞队列产生的方法,主要应用于部署无损以太网(RoCE+PFC)的网络架构。通过检测输入端口的缓存队列的累积情况识别出拥塞队列,然后根据该缓存队列是否是拥塞队列以及该缓存队列的缓存占用大小一起决定上游网络设备需要停止发送数据的时间,保证了CBD环内流量有剩余缓存空间以及CBD环内数据包可以正常流动。
示例性的,图4(a)为本发明申请实施例提供的一种CLOS网络的系统架构示意图。如图4(a)所示,在CLOS网络的典型组网中包括网络设备和主机。其中,网络设备可以至少为一个。当网络设备为多个时,多个网络设备可以是交换机(S1、S2、…Sm、L1、L2、…Ln),每个交换机都对应有多个主机。多个交换机之间可以进行数据的交互,多个主机中的每一个主机通过交换机接收数据。以交换机S1和交换机L1为例进行说明。当交换机S1作为上游网络设备向交换机L1发送数据时。交换机L1可以周期性的检测其输入端口的缓存队列的累积情况、缓存队列的队列长度和该输入端口待接收的飞行报文的数量。然后,交换机L1根据输入端口缓存队列的累积情况识别出该缓存队列是否为拥塞队列。然后交换机L1根据缓存队列的队列长度、飞行报文(指上游网络设备已经发送,但是下游网络设备还没有接收到的报文)的数量以及该队列是否为拥塞队列共同决策需要上游网络设备需要停止数据发送的时间。减少了进入循环队列buffer的报文数量,保证CBD环内流量有剩余缓存空间,保持CBD环内数据包可以正常流动。
可以理解的是,本申请实施例中所描述的网络设备(比如图4(a)中的网络设备)可以是具有数据转发功能的网络设备,比如,路由器、交换机。
下面以交换机为例介绍本申请实施例中所涉及的网络设备的硬件结构。
示例性的,图4(b)示出了一种交换机的硬件结构。如图4(b)所示,该交换机包括:处理器401、接口电路402、存储器403、交换模块404。其中,处理器401、接口电路402、存储器403、交换模块404之间可以通过总线进行连接。
存储器403用于存放程序指令与数据。例如,当交换机接收到其他交换机发送的数据以后,可以将该数据存储到存储器403中。处理器401是交换机的计算核心及控制核心。处理器401读取存储器403中保存的程序指令和数据,从而执行抑制拥塞队列产生的方法。接口电路402包括连接各个端口的内部电路。交换模块404通过总线与处理器401进行通信以完成数据的传输。
接下来基于上文所描述的内容,对本申请实施例提供的抑制拥塞队列产生的方法进行介绍。
示例性地,图5提供了一种判别缓存队列是否为拥塞队列的方法。该方法中涉及的网络设备可以为图4(a)中所描述的网络设备。参照图5,该方法包括:S501-S506。
S501,网络设备周期性的获取其输入端口的缓存队列的队列长度L(t)、队列判别标记J、拥塞判别结果C。
本实施例中,网络设备输入端口的缓存队列用于缓存从网络中接收的报文。其中,网络设备输入端口处的缓存队列是一个先进先出的数据结构。该数据结构只允许在一端进行插入,另一端进行删除,禁止直接访问除这两端以外的一切数据。即在本实施例中,缓存 队列的一端用于从网络中接收数据,缓存队列的另一端用于向网络中发送数据。当网络设备从网络中接收数据的接收速率大于该网络设备向网络中发送数据的发送速率时,缓存队列中会缓存有暂时还未发送的数据。此时,缓存队列中缓存的数据的数据大小即为缓存队列的队列长度L(t)。
在一个可能的示例中,将网络设备输入端口的缓存队列的队列长度初始化为0。然后,网络设备的输入端口每接收到一个报文,就将该缓存队列的队列长度加1。网络设备通过其输出端口每发送一个报文,就将该缓存队列长度减1。
进一步地,还需要为网络设备的输入端口的缓存队列设置判别标记J、拥塞判别结果C。判别标记J用于表示该缓存队列是否需要进行拥塞检测。其中,J的取值可以为True或者False。当J的取值为True时,表示该缓存队列需要进行拥塞检测。当J的取值为False时,表示该缓存队列不需要进行拥塞检测。在一个示例中,可以设置判别标记J的初始值为True,即所有的队列都需要进行拥塞检测。拥塞判别结果C用于表示缓存队列是否为拥塞队列。其中,C的取值可以为True或者False。当C的取值为True时,表示缓存队列为拥塞队列。当C的取值为False时,表示缓存队列不为拥塞队列。在一个示例中,可以设置拥塞判别结果C的初始化值为False,表示缓存队列不为拥塞队列。
网络设备需要周期性的(比如间隔1分钟或者1个小时等)扫描其输入端口的缓存队列的队列长度。在扫描缓存队列的同时还需要检测该队列的判别标记。当网络设备第一次对输入端口的缓存队列进行拥塞检测时,可以根据获取的队列长度L(t)、判别标记J的初始值对缓存队列进行拥塞检测,并根据拥塞检测结果修改判别标记J、拥塞检测结果C的值。
S502,网络设备对当前缓存队列的判别标记J进行判断,当J==True时,执行S503,否则执行S501。
本实施例中,在对缓存队列进行拥塞队列检测之前,首先对缓存队列的判别标记进行判断。当缓存队列的判别标记J==True时,表示该缓存队列没有进行过拥塞检测,或者进行过拥塞检测以后,确定该缓存队列为拥塞队列。此时,需要再次对该缓存队列进行拥塞检测。当缓存队列的判别标记J==False时,表明该缓存队列已经进行过拥塞检测,且检测结果为该缓存队列不为拥塞队列。此时,不需要再次对该缓存队列进行拥塞检测。在对缓存队列进行拥塞队列检测之前,先对该缓存队列的判别标记J进行判断。避免在该缓存队列不为拥塞队列的情况下,对该缓存队列进行拥塞检测,浪费系统资源。
S503,网络设备确定缓存队列的初始长度是否为0,当缓存队列的初始长度不为0时,执行S506,否则执行S504。
本实施例中,当缓存队列的判别标记J==True时,表明需要对该缓存队列进行拥塞判断。此时,可以根据网络设备的缓存队列的队列长度以及缓存队列的初始长度记录来确定缓存队列是否为拥塞队列。其中,缓存队列的初始长度是指缓存队列第一次大于Th1时的队列长度。Th1的单位为字节,表示可以进行拥塞队列检测的最小队列长度。
在一个可能的示例中,可以用L_old来记录缓存队列的初始长度。当L_old=0时表示缓存队列不存在初始长度。此时无法根据缓存队列的历史长度记录来确定当前缓存队列是否为拥塞队列。当L_old大于0时,可以根据当L_old记录的缓存队列初始长度来判断缓存队列是否为拥塞队列。比如,设置缓存队列可以进行拥塞检测的最小队列长度Th1=5。 在t1时刻,获取到缓存队列的队列长度L(t)=7、缓存队列的初始长度L_old=0。在对缓存队列进行拥塞判断时,确定该缓存队列的初始长度L_old为0以后,将L(t)与Th1进行比较。由于L(t)=7大于Th1=5,表明该缓存队列可以进行拥塞队列检测。因此,可以将该缓存队列的队列长度保存到L_old中(即令L_old=7),以供下一次拥塞队列检测时使用。
S504,网络设备将缓存队列的队列长度与可以进行拥塞队列检测的最小队列长度进行比较,当缓存队列的队列长度大于可以进行拥塞队列检测的最小队列长度时,执行S505,否则执行S501。
在本实施例中,由于对缓存队里进行拥塞判断时,需要同时获取的缓存队列的长度与该缓存队列的初始长度。因此,在当缓存队列的初始长度L_old为0时,还需要确定获取的缓存队列长度是否可以保存为缓存队列的初始长度。进一步地,在确定获取的缓存队列长度是否可以保存为缓存队列的初始长度时,可以将缓存队列的队列长度与可以进行拥塞队列检测的最小队列长度进行比较。只有获取的缓存队列的队列长度大于等于可以进行拥塞队列检测的最小队列长度时,才可以将该队列长度保存为该缓存队列的初始长度。以用于下一次的拥塞队列检测。
S505,网络设备将缓存队列的队列长度保存为缓存队列的初始长度。
在本实施例中,当确定L_old=0后,还需要继续判断缓存队列的队列长度是否大于Th1。当缓存对队列的队列长度大于Th1时,表示该缓存队列可以进行拥塞检测。因此,可以将该缓存队列的队列长度保存到L_old中,以供下一次拥塞队列检测时使用。
在一个可能的示例中,设置缓存队列可以进行拥塞检测的最小队列长度Th1=5。网络设备在t1时刻获取到缓存队列的队列长度L(t)=7,队列判别标记J==True以及该缓存队列的初始长度L_old=0。当网络设备在t1时刻对缓存队列进行拥塞队列检测时,首先确定缓存队列的队列判别标记J的值是否为True。当确定缓存队列的判别标记J==True后,继续判断该缓存队列的初始长度L_old是否为0。当确定该缓存队列的初始长度L_old为0以后,将L(t)与Th1进行比较。由于L(t)=7大于Th1=5,表明该缓存队列可以进行拥塞队列检测。因此,可以将该缓存队列的队列长度保存到L_old中(即令L_old=7),以供下一次拥塞队列检测时使用。
网络设备在t2时刻,获取到缓存队列的队列长度L(t)=12,队列判别标记J==True以及该缓存队列的初始长度L_old=7。当网络设备在t2时刻对缓存队列进行拥塞队列检测时,首先确定缓存队列的队列判别标记J的值是否为True。当确定缓存队列的判别标记J==True后,继续判断该缓存队列的初始长度L_old是否为0。当确定该缓存队列的初始长度L_old不为0以后,网络设备可以根据该缓存队列的队列长度L(t)、该缓存队列的初始长度L_old来对该缓存队列进行拥塞判断。
S506,网络设备判断当前缓存队列是否持续拥塞,当缓存队列持续拥塞时,更新缓存队列的判别结果C的值为True,当缓存队列的缓存状态不为拥塞状态时,更新当前缓存队列的判别标记J的值为False。
在本实施例中,在确定当前缓存队列的队列长度以及当前缓存队列的初始长度以后,可以根据确定的缓存队列的队列长度以及缓存队列的初始长度,确定缓存队列是否在持续累积(当缓存队列的入队列速率大于出队列速率时,可以确定该缓存队列内的报文持续积累)。具体地,可以通过判断L(t)-L_old是否大于L_old来确定当前缓存队列内的报文是 否在持续积累。当L(t)-L_old>L_old时,可以确定当前缓存队列内的报文持续积累,此时可以将该缓存队列标记为拥塞队列,令C=True。否则,停止拥塞队列检测过程,置J=False。
在一个可能的示例中,当判断出缓存队列不为拥塞队列以后,将J的值置为False。由于,在进行拥塞队列检测时的初始判断条件是J=True时,才进行拥塞检测。即当J=False以后,需要停止对该缓存队列的后续数据接收过程进行拥塞检测。
在另一个可能的示例中,当需要继续对该缓存队列的后续数据接收过程进行拥塞检测时,在对缓存队列进行以第一次拥塞判断以后,设置对缓存队列进行拥塞判断的初始条件为“L_old是否为0”,即第一次拥塞判断以后,下一次进入拥塞判断的初始条件为S503。
示例性的,基于图5示出的检测拥塞队列的方法,图6示出了本申请实施例提供的一种抑制拥塞队列产生的方法的流程图,该方法应用于图4(a)所示的网络架构,该方法中涉及的网络设备可以为图4(a)中所描述的网络设备。参照图6,该方法包括:S601-S607。
S601,网络设备周期性获取其的输入端口的缓存队列的队列长度以及该网络设备待接收的飞行报文的报文数量。
在本实施例中,网络设备可以周期性的扫描(比如间隔1分钟或者1个小时等)输入端口的缓存队列的队列长度L(t),以及获取当前网络设备需要接收的飞行报文的报文数量F(t)。其中,飞行报文是指上游网络设备已经发出而下游网络设备还没有接收到的报文。任意两个网络设备进行数据传输时,可以将发送数据的网络设备作为上游网络设备,将接收数据的设备作为下游网络设备。以图4中的CLOS网络为例,交换机S1向交换机L1和交换机L2发送数据。那么交换机S1则为交换机L1和交换机L2的上游网络设备。交换机L1和交换机L2为交换机S1的下游网络设备。
S602,网络设备根据当前缓存队列的队列长度以及待接收的飞行报文的报文数量确定当前网络设备的上游网络设备停止发送报文的停止时间。
在本实施例中,在确定当前缓存队列的队列长度以及需要接收的飞行报文的报文数量以后。可以根据确定的缓存队列的队列长度以及飞行报文的报文数量确定上游网络设备需要停止发送报文的停止时间t_stop1。具体地,
其中,T为网络设备对输入端的缓存队列进行扫描的扫描周期;R表示上游网络设备和下游网络设备之间协商的报文发送速率;H为网络设备输入端预留的hedaroom,用于接收飞行报文,防止丢包,H初始化的值为BDP+R*T;F(t)为飞行报文的数量,F(t)可以初始化为一个时延带宽积((Bandwidth-Delay Product,BDP)。BDP即链路上的最大比特数,也称以比特为单位的链路长度。进一步地,时延带宽积BDP=传播时延*带宽。
S603,网络设备确定当前缓存队列是否为拥塞队列,当缓存队列为拥塞队列时,执行S604,否则执行S605。
在本实施例中,在网络设备根据确定的缓存队列的队列长度以及飞行报文的报文数量确定上游网络设备需要停止发送报文的停止时间以后。为了使确定的停止时间更准确,网络设备还需要确定其输入端口的缓存队列是否为拥塞队列。
在一个可能的示例中,网络设备可以直接对缓存队列进行拥塞检测。其中,对缓存队 列进行拥塞检测的过程与S501-S506相同,在此不再赘述。
在另一个可能的示例中,网络设备可以在执行S601的同时判断网络设备的输入端口的缓存队列是否为拥塞队列(判断缓存队列是否为拥塞队列的判断过程与S501-S506相同,此处,不再赘述),得到该缓存队列的拥塞判别结果C。然后,根据该缓存队列的拥塞判别结果来确定该缓存队列是否为拥塞队列。在执行步骤S601的同时对缓存队列进行拥塞检测,节省了网络设备生成上游网络设备停止发送报文的停止时间的时间。
在一个示例中,预先为网络设备的输入端口的缓存队列设置有拥塞判别结果C。C的初始化值为True。在一个扫描周期内,C的取值可以为True或者False。当C的取值为True时,表明该缓存队列为拥塞队列。当C的取值为False时,表明该缓存队列为非拥塞队列。
网络设备需要周期性的扫描其输入端口的缓存队列的队列长度。在扫描缓存队列的同时还需要根据缓存队列的队列长度以及记录的缓存队列的历史长度对缓存队列是否为拥塞队列进行判断。当缓存队列为拥塞队列时,将缓存队列的拥塞判别结果C的值置为True。
S604,网络设备更新上游网络设备停止发送报文的停止时间。
在本实施例中,当缓存队列的判别结果为拥塞队列时,更新上游网络设备停止发送报文的停止时间,以减少进入CBD中的流量,防止PFC死锁的发生。具体地,更新后的上游网络设备停止发送报文的停止时间t_stop2为,
其中,T为网络设备对输入端的缓存队列进行扫描的扫描周期;p为惩罚因子,P>1;t_stop1为网络设备根据确定的缓存队列的队列长度以及飞行报文的报文数量确定上游网络设备需要停止发送报文的停止时间。
S605,网络设备判断上游网络设备停止发送报文的停止时间是否大于0,当上游网络设备停止发送报文的停止时间大于0时,执行S606,否则执行S607。
在本实施例中,在将确定的上游网络设备停止发送报文的停止时间发送给上游设备之前,还需要确定得到的上游网络设备停止发送报文的停止时间是否为0。当确定的上游网络设备停止发送报文的停止时间为0时,网络设备的缓存队列不存在拥塞,上游网络设备可以继续发送报文。此时,网络设备不需要向上游网络设备发送PFC反压帧。当确定的上游网络设备停止发送报文的停止时间不为0时,表示当前网络设备发生了拥塞或者将要发生拥塞。此时当前网络设备需要向上游网络设备发送PFC反压帧。在发送PFC反压帧之前,对PFC反压帧携带的上游网络设备停止发送报文的停止时间进行判断,避免了在不需要向上游网络设备发送反压帧的情况下向上游网络设备发送反压帧。节省了网络中的信道资源。
在一个可能的示例中,当网络设备输入端口的缓存队列为拥塞队列时,需要将更新后的上游网络设备停止发送报文的停止时间t_stop2发送给上游网络设备。因此,网络设备在向上游网络设备发送反压帧之前,需要对t_stop2进行判断。只有在t_stop2大于0的情况下,才向上游网络设备发送PFC反压帧。
当网络设备输入端口的缓存队列不为拥塞队列时,将确定的上游网络设备停止发送报文的停止时间t_stop1发送给上游网络设备。网络设备在向上游网络设备发送反压帧之前,需要对t_stop1进行判断。只有在t_stop1大于0的情况下,才向上游网络设备发送PFC反压帧。
S606,网络设备向上游网络设备发送PFC反压帧。
在本实施例中,当前网络设备向上游网络设备发送的PFC反压帧中携带有上游网络设备停止发送报文的停止时间。以使上游网络设备在该停止时间内停止向当前网络设备发送报文,从而避免PFC死锁的发生。
S607,网络设备更新待接收的飞行报文的报文数量。
在本申请实施例中,可以根据RTT减去该RTT内的停止时间之和确定更新的飞行报文的数量。即,更新后的飞行报文数量F(t)’=(RRT-t_stop3)×BDP。其中,RRT指数据在上游网络设备和下游网络设备之间进行传输时的空口传播时延。t_stop3指在RRT内上游网络设备停止发送数据的时间。BDP为时延带宽积。
由此,减少了进入循环队列的报文数量,保证了CBD环内流量有剩余缓存空间,保证了CBD环内的数据包可以正常流动。进一步地,在本发明申请实施例中,网络设备的输入端口可以在只使用一个缓存队列且避免网络设备丢包的情况下,实现网络设备免死锁,网络中的吞吐量不受影响的目的。
示例性的,图7示出了本申请实施例提供的又一种抑制拥塞队列产生的方法的流程图。其中图7所示的抑制拥塞队列的方法是对图5所示的抑制拥塞队列的方法以及图6所示的判断拥塞队列的方法的另一种描述方式。该方法中涉及的网络设备可以为图4(a)中所描述的网络设备。参照图7该方法包括:S710、S720、S730。需要说明的是,S710的具体实现过程与S501和S601相同。S720和S730之间的执行没有先后顺序,S720和S730可以同时执行。
S710,周期性的获取当前网络设备输入端口的缓存队列长度L(t)、缓存队列判别标J记、缓存队列的拥塞判别结果C以及当前网络设备待接收的飞行报文的报文数量F(t)。
S720,根据获取的缓存队列长度以及缓存队列判别标记,判断当前缓存队列是否为拥塞队列。
在本实施例中,在根据获取的缓存队列长度以及缓存队列判别标记,判断当前缓存队列是否为拥塞队列时,可以通过以下几个步骤来实现。具体地,在S7201,对当前缓存队列的判别标记J进行判断,当J==True时,执行S7202,否则执行S710。在S7202,确定当前缓存队列的初始长度L_old,当L_old不为0时,执行S7203,否则执行S7204。在S7203,将当前缓存队列的队列长度与拥塞队列检测算法生效的最小队列长度Th1进行比较,当L(t)>Th1时,执行S7204,否则执行S710。在S7204,令L_old=L(t)。在S7205,判断缓存队列是否持续拥塞,当缓存队列持续拥塞时,令C=True,当缓存队列的缓存状态不为拥塞状态时,令J=False。其中,S7201-S7205的实现过程,可以参见前述S502-S506中的描述,此处不再赘述。
S730,根据获取的缓存队列长度、缓存队列的拥塞判别结果以及当前网络设备待接收的飞行报文的报文数量,确定上游网络设备停止发送报文的停止时间;以及根据上游网络设备停止发送报文的停止时间更新飞行报文的数量。
本实施例中,在确定上游网络设备停止发送报文的停止时间和更新飞行报文的数量时,可以通过以下几个步骤实现。具体地,在S7301,根据L(t)和F(t),计算上游网络设备停止发送报文的停止时间t_stop。在S7302,确定C的值是否为True,当C==True时,执行S7303,否则执行S7304。在S7303,更新t_stop。在S7304,确定t_stop是否大于0,当 t_stop>0时,执行S7304,否则执行S7305。在S7305,向上游网络设备发送PFC反压帧。在S7306,更新F(t)。其中,S7301-S7306的实现过程,可以参见前述S602-S607中的描述,此处不再赘述。在本实施例中,先计算出缓存队列不为拥塞队列时,需要上游网络设备停止发送报文的停止时间。然后再确定网络设备中的缓存队列是否为拥塞队列。当网络设备中的缓存队列为拥塞队列时,更新上游网络设备停止发送报文的停止时间。避免网络设备中拥塞队列的形成。
在一个可能的示例中,网络设备还可以先确定网络设备的输入端口中的缓存队列是否为拥塞队列。然后,再根据网络设备的输入端口的缓存队列是否为拥塞队列,以及缓存队列的队列长度、待接收的飞行报文数量来确定上游网络设备停止发送报文的停止时间。
在本申请实施例中,周期性的对网络设备输入端口的缓存队列的队列长度、待接收的飞行报文数量进行检测。然后,根据当前网络设备输入端口的缓存队列的长度、飞行报文数量、来决定当前采样时刻网络设备需要上游网络设备停止发送报文的时间。进一步地,根据缓存队列的队列长度以及缓存队列的累积速度来确定该缓存队列是否为拥塞队列。然后网络设备根据当前缓存队列是否为拥塞队列,决定是否更新当前采样时刻网络设备需要上游网络设备停止发送报文的时间。在本申请实施例中,通过周期性的对输入端口的缓存队列进行检测,减少了进入循环队列的报文数量,保证了CBD环内流量有剩余缓存空间,保证了CBD环内的数据包可以正常流动。
示例性的,图8示出了本申请实施例提供的又一种抑制拥塞队列产生的方法的流程图。该方法中涉及的网络设备可以为图4(a)中所描述的网络设备。该网络设备可以为上游网络设备。该方法包括:S801-S802。
S801,网络设备接收下游网络设备发送的PFC反压帧,该反压帧中携带有停止发送报文的停止时间。
在本实施例中,当网络设备的端口接收到下游网络设备发送的PFC反压帧以后,对该反压帧进行解析,获取该反压帧中携带的停止发送报文的停止时间。
S802,网络设备在该停止时间内停止向下游网络设备发送报文。
在本实施例中,网络设备获取到停止发送报文的停止时间以后。网络设备的内部调度器将停止发送对应优先级的队列中的报文,并开启定时器。
需要说明的是,本实施例中所说的上游网络设备和下游网络设备是相对的。如图4(a)中所示的任意一个网络设备既可以是上游网络设备也可以是下游网络设备,还可以同时是上游网络设备和下游网络设备。对于任意两个进行数据传输的网络设备,将发送数据的网络设备作为上游网络设备,将接收数据的网络设备作为下游网络设备。
在本实施例中,上游网络设备根据下游网络设备发送的PFC反压帧中携带的停止时间,在该停止时间内停止向下游网络设备发送报文。有效避免了下游网络设备中拥塞队列的形成。
示例性的,图9示出了本申请实施例提供的一种网络设备的结构示意图。参照图9,该网络设备包括:接收模块901、检测模块902、处理模块903、发送模块904。
接收模块901用于接收上游网络设备发送的数据包。
检测模块902用周期性的检测当前网络设备输入端口的缓存队列长度、缓存队列判别标记、缓存队列的拥塞判别结果以及当前网络设备待接收的飞行报文的报文数量。进一步地,检测模块902还用于根据获取的缓存队列长度以及缓存队列判别标记,判断当前缓存队列是否为拥塞队列。
处理模块903用于根据获取的缓存队列长度、缓存队列的拥塞判别结果以及当前网络设备待接收的飞行报文的报文数量来确定当前上游网络设备停止发送报文的停止时间。
发送模块904用向上游网络设备发送PFC反压帧,该PFC反压帧中携带有上游网络设备停止发送报文的停止时间。
图9所描述的网络设备实施例仅仅是示意性的。例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
例如,图9中的各个模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。例如,采用软件实现时,上述检测模块902可以是由附图4(b)中的至少一个处理器401读取存储器403中存储的程序代码后,生成的软件功能模块来实现。图9中的上述各个模块也可以是由网络设备中的不同硬件分别实现,例如检测模块902由附图4(b)中至少一个处理器401中的一部分处理资源(例如多核处理器中的一个核)实现,而发送模块904和接收模块901由附图4(b)的接口电路和中至少一个处理器401中的其余部分处理资源(例如多核处理器中的其他核),或者采用FPGA、或协处理器等可编程器件来完成。显然上述功能模块也可以采用软件硬件相结合的方式来实现,例如发送模块904和接收模块901由硬件可编程器件实现,而检测模块902是由CPU读取存储器中存储的程序代码后,生成的软件功能模块。
示例性的,图10示出了本申请实施例提供的又一种网络设备的结构示意图。参照图10,该网络设备包括:接收模块1001、处理模块1002、发送模块1003。
接收模块1001用于接收下游网络设备发送的PFC反压帧。该PFC反压帧中携带有需要当前网络设备停止向下游网络设备发送报文的停止时间。
处理模块1002用于根据接收到的PFC反压帧,停止发送模块1003向下游网络设备发送报文。
图10所描述的网络设备实施例仅仅是示意性的。例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。
例如,图10中的各个模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。例如,采用软件实现时,上述处理模块1002可以是由附图4(b)中的至少一个处理器401读取存储器403中存储的程序代码后,生成的软件功能模块来实 现。图10中的上述各个模块也可以是由网络设备中的不同硬件分别实现。例如,处理模块1002由附图4(b)中至少一个处理器401中的一部分处理资源(例如多核处理器中的一个核)实现,而接收模1001和发送模块1003由附图4(b)的接口电路和中至少一个处理器401中的其余部分处理资源(例如多核处理器中的其他核),或者采用FPGA、或协处理器等可编程器件来完成。显然上述功能模块也可以采用软件硬件相结合的方式来实现,例如接收模块1001和发送模块1003由硬件可编程器件实现,而处理模块1002是由CPU读取存储器中存储的程序代码后,生成的软件功能模块。
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。

Claims (16)

  1. 一种抑制拥塞队列产生的方法,其特征在于,所述方法包括:
    确定第一网络设备的输入端口的缓存队列的队列长度、所述第一网络设备待接收的报文数量;
    根据所述队列长度和所述待接收的报文数量确定第一时间,所述第一时间为第二网络设备停止向所述第一网络设备发送报文的时间;
    根据所述队列长度,确定所述缓存队列为拥塞队列时,根据所述缓存队列的拥塞程度,将第一时间更新为第二时间;
    将所述第二时间发送给第二网络设备。
  2. 根据权利要求1所述的方法,其特征在于,当确定所述缓存队列不为拥塞队列时,将所述第一时间发送给第二网络设备。
  3. 根据权利要求1或2所述的方法,其特征在于,确定所述缓存队列是否为拥塞队列包括:
    获取所述缓存队列的拥塞检测判别结果C;其中,所述拥塞检测判别结果的取值为True或False,当C的取值为True时,表示所述缓存队列为拥塞队列,当C的取值为False时,表示所述缓存队列为非拥塞队列,C的初始值为False;
    根据所述缓存队列的拥塞检测判别结果C的取值确定所述缓存队列是否为拥塞队列。
  4. 根据权利要求1或3所述的方法,其特征在于,所述方法还包括:
    获取所述缓存队列的拥塞检测判别标记J;其中,J的取值为True或False;当J的取值为True时,表示所述缓存队列需要进行拥塞检测,当J的取值为False时,表示所述缓存队列不需要进行拥塞检测,J的初始值为True;
    当所述缓存队列的拥塞检测判别标记J的取值为True时,对所述缓存队列进行拥塞检测;当所述缓存队列为拥塞队列时,更新所述缓存队列的拥塞检测判别结果C的值为True;或者,当所述缓存队列不为拥塞队列时,更新所述缓存队列的拥塞检测判别标记J的值为False。
  5. 根据权利要求4所述的方法,其特征在于,所述对所述缓存队列进行拥塞检测,包括:
    确定所述第一网络设备中是否保存有所述缓存队列的初始长度;所述缓存队列的初始长度是指所述缓存队列的队列长度第一次大于能够进行拥塞队列检测的最小队列长度时的队列长度;
    当所述第一网络设备中保存有所述缓存队列的初始长度时,根据所述缓存队列的队列长度和所述缓存队列的初始长度确定所述缓存队列的队列长度是否持续积累;
    当所述缓存队列的队列长度为持续积累时,确定所述缓存队列为拥塞队列。
  6. 根据权利要求5所述的方法,其特征在于,当所述网络设备中未保存有所述缓存队列的初始长度时,确定所述缓存队列的队列长度是否大于能够进行拥塞队列检测的最小队列长度;
    当所述缓存队列的队列长度大于能够进行拥塞队列检测的最小队列长度,将所述缓存队列的队列长度保存为所述缓存队列的初始长度。
  7. 根据权利要求5所述的方法,其特征在于,所述根据所述缓存队列的队列长度和所述缓存队列的初始队列长度确定所述缓存队列的队列长度是否持续积累包括:
    根据所述缓存队列的队列长度和所述缓存队列的初始队列长度确定所述缓存队列的入队列速率是否大于所述缓存队列的出队列速率;
    当所述缓存队列的入队列速率大于所述缓存队列的出队列速率,确定所述缓存队列的队列长度为持续积累。
  8. 一种网络设备,其特征在于,包括:
    检测模块,用于确定第一网络设备的输入端口的缓存队列的队列长度以及所述第一网络设备待接收的报文数量;
    处理模块,用于根据所述队列长度和所述待接收的报文数量确定第一时间,所述第一为第二网络设备停止向所述第一网络设备发送报文的时间;
    所述检测模块,还用于根据所述队列长度,确定所述缓存队列是否为拥塞队列;
    所述处理模块,还用于当所述缓存队列为拥塞队列时,根据所述缓存队列的拥塞程度,将第一时间更新为第二时间;
    发送模块,用于将所述第二时间发送给第二网络设备。
  9. 根据权利要求8所述的网络设备,其特征在于,所述发送模块还用于:
    当确定所述缓存队列不为拥塞队列时,将所述第一时间发送给第二网络设备。
  10. 根据权利要求8或9所述的网络设备,其特征在于,所述检测模块用于:
    获取所述缓存队列的拥塞检测判别结果C;其中,所述拥塞检测判别结果的取值为True或False,当C的取值为True时,表示所述缓存队列为拥塞队列,当C的取值为False时,表示所述缓存队列为非拥塞队列,C的初始值为False;
    根据所述缓存队列的拥塞检测判别结果C的取值确定所述缓存队列是否为拥塞队列。
  11. 根据权利要求8或10所述的网络设备,其特征在于,所述检测模块还用于:
    获取所述缓存队列的拥塞检测判别标记J;其中,J的取值为True或False;当J的取值为True时,表示所述缓存队列需要进行拥塞检测,当J的取值为False时,表示所述缓存队列不需要进行拥塞检测,J的初始值为True;
    当所述缓存队列的拥塞检测判别标记J的取值为True时,对所述缓存队列进行拥塞检测;
    所述处理模块还用于:
    当所述缓存队列为拥塞队列时,更新所述缓存队列的拥塞检测判别结果C的值为True;或者,当所述缓存队列不为拥塞队列时,更新所述缓存队列的拥塞检测判别标记J的值为False。
  12. 根据权利要求11所述的网络设备,其特征在于,所述检测模块还用于:
    确定所述第一网络设备中是否保存有所述缓存队列的初始长度;所述缓存队列的初始长度是指所述缓存队列的队列长度第一次大于能够进行拥塞队列检测的最小队列长度时的队列长度;
    当所述第一网络设备中保存有所述缓存队列的初始长度时,根据所述缓存队列的 队列长度和所述缓存队列的初始长度确定所述缓存队列的队列长度是否持续积累;
    当所述缓存队列的队列长度为持续积累时,确定所述缓存队列为拥塞队列。
  13. 根据权利要求12所述的网络设备,其特征在于,所述检测模块用于:
    当所述网络设备中未保存有所述缓存队列的初始长度时,确定所述缓存队列的队列长度是否大于能够进行拥塞队列检测的最小队列长度度;
    当所述缓存队列的队列长度大于能够进行拥塞队列检测的最小队列长度,将所述缓存队列的队列长度保存为所述缓存队列的初始长度。
  14. 根据权利要求12所述的网络设备,其特征在于,所述检测模块用于:
    根据所述缓存队列的队列长度和所述缓存队列的初始队列长度确定所述缓存队列的入队列速率是否大于所述缓存队列的出队列速率;
    当所述缓存队列的入队列速率大于所述缓存队列的出队列速率,确定所述缓存队列的队列长度为持续积累。
  15. 一种计算机可读介质,所述计算机存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1-7任一所述的方法。
  16. 一种包含指令的计算机程序产品,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-7任一所述的方法。
PCT/CN2023/086561 2022-05-23 2023-04-06 一种抑制拥塞队列产生的方法及装置 WO2023226603A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210565502.3 2022-05-23
CN202210565502.3A CN117155863A (zh) 2022-05-23 2022-05-23 一种抑制拥塞队列产生的方法及装置

Publications (1)

Publication Number Publication Date
WO2023226603A1 true WO2023226603A1 (zh) 2023-11-30

Family

ID=88908665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/086561 WO2023226603A1 (zh) 2022-05-23 2023-04-06 一种抑制拥塞队列产生的方法及装置

Country Status (2)

Country Link
CN (1) CN117155863A (zh)
WO (1) WO2023226603A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917330A (zh) * 2008-09-11 2010-12-15 丛林网络公司 用于限定流量控制信号的方法和装置
CN108390828A (zh) * 2018-01-17 2018-08-10 新华三技术有限公司 报文转发方法及装置
US20200280518A1 (en) * 2020-01-28 2020-09-03 Intel Corporation Congestion management techniques
CN114095448A (zh) * 2020-08-05 2022-02-25 华为技术有限公司 一种拥塞流的处理方法及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101917330A (zh) * 2008-09-11 2010-12-15 丛林网络公司 用于限定流量控制信号的方法和装置
CN108390828A (zh) * 2018-01-17 2018-08-10 新华三技术有限公司 报文转发方法及装置
US20200280518A1 (en) * 2020-01-28 2020-09-03 Intel Corporation Congestion management techniques
CN114095448A (zh) * 2020-08-05 2022-02-25 华为技术有限公司 一种拥塞流的处理方法及设备

Also Published As

Publication number Publication date
CN117155863A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
US11962490B2 (en) Systems and methods for per traffic class routing
CN109412964B (zh) 报文控制方法及网络装置
US10193831B2 (en) Device and method for packet processing with memories having different latencies
CN108243116B (zh) 一种流量控制方法及交换设备
US6563790B1 (en) Apparatus and method for modifying a limit of a retry counter in a network switch port in response to exerting backpressure
US6678244B1 (en) Congestion management system and method
US8442056B2 (en) Scheduling packets in a packet-processing pipeline
US9485200B2 (en) Network switch with external buffering via looparound path
US20060203730A1 (en) Method and system for reducing end station latency in response to network congestion
US7836195B2 (en) Preserving packet order when migrating network flows between cores
US20120023304A1 (en) Flow control for reliable message passing
JP2003518817A (ja) パケットスケジューリングを用いるネットワーク交換方法
US9055009B2 (en) Hybrid arrival-occupancy based congestion management
JP2023062077A (ja) パケットスケジューリング方法、スケジューラ、ネットワーク装置及びネットワークシステム
WO2006063298A1 (en) Techniques to manage flow control
US20070140282A1 (en) Managing on-chip queues in switched fabric networks
US9374325B2 (en) Hash perturbation with queue management in data communication
US6725270B1 (en) Apparatus and method for programmably modifying a limit of a retry counter in a network switch port in response to exerting backpressure
US8995458B1 (en) Method and apparatus for delay jitter reduction in networking device
WO2023226603A1 (zh) 一种抑制拥塞队列产生的方法及装置
WO2022174444A1 (zh) 一种数据流传输方法、装置及网络设备
US7500012B2 (en) Method for controlling dataflow to a central system from distributed systems
US11924106B2 (en) Method and system for granular dynamic quota-based congestion management
US11824792B2 (en) Method and system for dynamic quota-based congestion management
US7433986B2 (en) Minimizing ISR latency and overhead

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23810678

Country of ref document: EP

Kind code of ref document: A1