CN116170377A

CN116170377A - Data processing method and related equipment

Info

Publication number: CN116170377A
Application number: CN202111415441.4A
Authority: CN
Inventors: 李彤; 徐恪; 杜鑫乐; 黄翰林; 戴惠辰; 郑凯
Original assignee: Tsinghua University; Huawei Technologies Co Ltd
Current assignee: Tsinghua University; Huawei Technologies Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2023-05-26

Abstract

The application provides a data processing method, which comprises the following steps: based on the port state information, indicating that the increasing amplitude of the message flow received by the first inlet port is larger than a threshold value, and the increasing amplitude of the message flow received by the second inlet port is smaller than the threshold value, configuring a first PFC threshold value of the first inlet port, and configuring a second PFC threshold value of the second inlet port; wherein the first PFC threshold is less than the second PFC threshold. When the second input port is identified as not being the main cause of blocking of the target output port, the PFC threshold corresponding to the second input port is configured to be larger than the PFC threshold corresponding to the first input port. Furthermore, as the second ingress port which is not the main cause of blocking the target egress port, sending of indication information (such as a pause frame) to the upstream network device is not easily triggered, and further message transmission on the second ingress port for other egress ports except the target egress port is not affected, so that the transmission efficiency of the system is improved.

Description

Data processing method and related equipment

Technical Field

The present disclosure relates to the field of computers, and in particular, to a data processing method and related devices.

Background

In recent years, lossless data center network technology has become an important point of interest in the industry. The remote direct memory access (remote direct memory access, RDMA) is a technology for allowing computers in a network to perform data interaction without passing through a processor, a cache and an operating system, and has the characteristics of zero-copy (zero-copy), kernel bypass (kernel bypass), no need of intervention of a central processing unit (central processing unit, CPU) (No CPU involvement) and the like. Not only saves a great deal of CPU resources, but also improves the throughput of the system and reduces the network communication delay. In general, end-to-end congestion control cannot be effectively applied to such bursty traffic. To guarantee a lossless environment for RDMA, many congestion control algorithms in current data centers employ PFC mechanisms. Priority-based traffic control (PFC), which is an ethernet protocol based on the 802.qbb standard for the L2 layer, supports the selection of priorities for different types of traffic in the network, and is a mechanism to prevent packet loss in case of congestion. PFC mechanisms act primarily on the ingress port of a switch (or other network device, such as a router, etc.), and control is performed according to the length of the ingress port's queue (otherwise known as the ingress queue's length), by sending an indication (e.g., a packet frame) to inform the upstream port to stop sending data.

Taking a network device as an example of a switch, four strategies for sharing cache allocation among a plurality of ports in the switch are as follows: full sharing, full bisection, static and dynamic thresholds (dynamic threshold, DT). The current popular approach is mainly a dynamic threshold strategy.

The DT strategy improves the adaptivity of the traditional strategy, but it does not address the problem of traffic burstiness. The problem with the DT policy is that the PFC threshold for each port on the same network device is configured to be the same (i.e., the PCT threshold for each ingress port of the switch is uniformly variable). When traffic bursts occur, the remaining buffer decreases, and the PFC thresholds of all ports decrease synchronously, so that a port is highly likely to trigger a pause frame, regardless of whether or not it has burst traffic. The pause frame blocks the upstream port from messaging, and the upstream port may have other traffic. These traffic may not have a direct relationship with the ports that generated the traffic bursts, but the transmission performance of these traffic may be innocently compromised, resulting in a low transmission efficiency.

Disclosure of Invention

The embodiment of the application provides a data processing method, and when a second input port is identified as not being a main cause of blocking of a target output port, a PFC threshold corresponding to the second input port is configured to be larger than a PFC threshold corresponding to a first input port. For example, the PFC threshold value corresponding to the second inlet port may be raised, or the PFC threshold value corresponding to the second inlet port may be kept unchanged, or the PFC threshold value corresponding to the second inlet port may be lowered by a smaller extent than the PFC threshold value corresponding to the first inlet port. Furthermore, as the second ingress port which is not the main cause of blocking the target egress port, sending of indication information (such as a pause frame) to the upstream network device is not easily triggered, and further message transmission on the second ingress port for other egress ports except the target egress port is not affected, so that the transmission efficiency of the system is improved.

In a first aspect, an embodiment of the present application provides a data processing method, where the method is applied to a network device, and the network device includes a first ingress port, a second ingress port, and a target egress port; the method comprises the following steps: respectively acquiring port state information of a first inlet port and the second inlet port; the port state information is related to the message flow which is received by the corresponding ingress port and is specific to the target egress port; based on the port state information, indicating that the increasing amplitude of the message flow received by the first inlet port is larger than a threshold value, and the increasing amplitude of the message flow received by the second inlet port is smaller than the threshold value, configuring a first PFC threshold value of the first inlet port, and configuring a second PFC threshold value of the second inlet port; wherein the first PFC threshold is less than the second PFC threshold.

It should be understood that the increasing magnitude of the message traffic received by the second ingress port is less than the "threshold" of the thresholds, and is not limited to being the same value as the "threshold" of the port status information indicating that the increasing magnitude of the message traffic received by the first ingress port is greater than the threshold, for example, the port status information indicating that the increasing magnitude of the message traffic received by the first ingress port is greater than the first threshold, and the increasing magnitude of the message traffic received by the second ingress port is less than the second threshold, and the second threshold may be less than or equal to the first threshold.

In one possible implementation, if the port status information indicates that the magnitude of increase of the packet traffic received by the first ingress port is greater than a threshold value and the magnitude of increase of the packet traffic received by the second ingress port is less than the threshold value, the first ingress port may be considered to receive burst traffic for the target egress port, and the second ingress port may not receive burst traffic for the target egress port, in which case the first ingress port may be considered to be the primary cause of blocking of the target egress port, and the second ingress port may not be considered to be the primary cause of blocking of the target egress port. In the existing implementation, the PFC thresholds corresponding to the first ingress port and the second ingress port are reduced to the same extent, and in this case, even if the PFC threshold corresponding to the second ingress port is not the main cause of blocking the target egress port, the PFC threshold corresponding to the second ingress port is reduced to the same extent as the PFC threshold corresponding to the first ingress port, so that the second ingress port is affected to receive the message transmitted to other egress ports, and the transmission efficiency of the message is greatly reduced.

It should be appreciated that the threshold in the embodiments of the present application may be a preset value, for example, may be 10 percent of the original flow.

In this embodiment of the present application, when it is identified that the second input port is not the main cause of blocking the target output port, the PFC threshold corresponding to the second input port is configured to be greater than the PFC threshold corresponding to the first input port. For example, the PFC threshold value corresponding to the second inlet port may be raised, or the PFC threshold value corresponding to the second inlet port may be kept unchanged, or the PFC threshold value corresponding to the second inlet port may be lowered by a smaller extent than the PFC threshold value corresponding to the first inlet port. Furthermore, as the second ingress port which is not the main cause of blocking the target egress port, sending of indication information (such as a pause frame) to the upstream network device is not easily triggered, and further message transmission on the second ingress port for other egress ports except the target egress port is not affected, so that the transmission efficiency of the system is improved.

In one possible implementation, the first ingress port and the second ingress port share cache resources of the network device.

In one possible implementation, the first ingress port corresponds to a first ingress queue, and the switch is configured to send first indication information to an upstream port of the first ingress port when the number of messages in the first ingress queue is greater than the first PFC threshold, where the first indication information is configured to indicate to stop sending the message to the first ingress port;

In one possible implementation, the second ingress port corresponds to a second ingress queue, and the switch is configured to send second indication information to an upstream port of the second ingress port when the number of packets in the second ingress queue is greater than the second PFC threshold, where the second indication information is configured to indicate that sending of the packets to the second ingress port is stopped.

For example, the first indication information and the second indication information may be PFC frames, where the PFC frames are used to instruct the port of the upstream network device to temporarily stop sending the packet to the port of the local network device. For example, the PFC frame may be a PFC packet frame for notifying the upstream device to temporarily stop sending messages to the port of the network device. It should be noted that, here, only the function of the PFC frame is described, and the PFC pause frame is referred to as an example.

In one possible implementation, the port state information includes a first message dequeuing rate of the first ingress queue and a second message dequeuing rate of the second ingress queue. The message dequeue rate indicates the number of messages leaving from the ingress port in one cycle, and when the message dequeue rate is greater than the dequeue rate threshold, the message received on the first ingress port for the target egress port may be considered to be smaller (because the target egress port is already in a blocked state at this time, the message dequeue rate of the first ingress port is still large because the first ingress port also receives many messages that do not correspond (need to be transmitted) to the target egress port). When the message dequeue rate is less than the dequeue rate threshold, the message Wen Jiaoduo received on the first ingress port for the destination egress port may be considered (because the destination egress port is already in a blocked state at this time, the message dequeue rate of the first ingress port is small because the first ingress port receives many messages corresponding (to be transmitted) to the destination egress port).

Illustratively, the dequeue rate threshold may relate to performance of the network device itself, and the dequeue rate threshold may be a dequeue rate of a queue of the ingress port when the ingress port of the network device is operating normally (when no blocking state occurs), for example, may be a historical average dequeue rate of a queue of the ingress port when no blocking state occurs.

That is, in the case where the port status information indicates that the first packet dequeue rate is less than the dequeue rate threshold, the port status information may be considered to indicate that the magnitude of increase in the packet traffic received by the first ingress port is greater than a threshold. Similarly, in the case where the port status information indicates that the second packet dequeue rate is greater than the dequeue rate threshold, the magnitude of increase in the packet traffic received by the second ingress port may be considered to be less than the threshold.

In one possible implementation, when it is determined that the first ingress port is the primary cause of the congestion of the target egress port, a PFC threshold value of the first ingress port may be configured as the first PFC threshold value by a third PFC threshold value, where the first PFC threshold value is less than the third PFC threshold value.

In one possible implementation, the first PFC threshold may be configured by the following formula:

T(t)＝α·(B-∑ _i Q ⁱ (t))；

Wherein B may represent the switch cache size, Q ⁱ (t) represents the i-th port queue length, α is an adjustment factor, e.g., a=2. T (T) represents the maximum occupiable cache size of the port at time T.

In one possible implementation, when it is determined that the second ingress port is not the primary cause of congestion of the target egress port, the PFC threshold of the second ingress port may be configured from a fourth PFC threshold to the second PFC threshold; wherein the second PFC threshold is greater than or equal to the fourth PFC threshold, or the second PFC threshold is less than the fourth PFC threshold, the magnitude of decrease in the second PFC threshold compared to the fourth PFC threshold being less than the magnitude of decrease in the first PFC threshold compared to the third PFC threshold.

In one possible implementation, the second PFC threshold may be configured as follows: a value between a/(a+1) B and B is selected.

In one possible implementation, the third PFC threshold and the fourth PFC threshold are equal. That is, when the PFC threshold is not configured based on the threshold adjustment method in the embodiment of the present application, the PFC thresholds corresponding to the respective ports are the same.

In a second aspect, the present application provides a data processing apparatus, the apparatus being applied to a network device, the network device including a first ingress port, a second ingress port, and a target egress port; the device comprises:

The acquisition module is used for respectively acquiring port state information of the first inlet port and the second inlet port; the port state information is related to the message flow which is received by the corresponding ingress port and is specific to the target egress port;

a threshold configuration module, configured to instruct, based on the port state information, that an increase amplitude of the packet traffic received by the first ingress port is greater than a threshold value, and that an increase amplitude of the packet traffic received by the second ingress port is less than the threshold value, configure a first PFC threshold value of the first ingress port, and configure a second PFC threshold value of the second ingress port; wherein the first PFC threshold is less than the second PFC threshold.

In one possible implementation, the first ingress port corresponds to a first ingress queue and the second ingress port corresponds to a second ingress queue;

the network device is configured to send first indication information to an upstream port of the first ingress port when the number of messages in the first ingress queue is greater than the first PFC threshold, where the first indication information is used to indicate to stop sending the messages to the first ingress port;

the network device is configured to send second indication information to an upstream port of the second ingress port when the number of messages in the second ingress queue is greater than the second PFC threshold, where the second indication information is used to indicate to stop sending the messages to the second ingress port.

In one possible implementation, the port state information includes a first message dequeuing rate of the first ingress queue and a second message dequeuing rate of the second ingress queue;

the port state information indicates that the increasing amplitude of the message flow received by the first ingress port is greater than a threshold value, including:

The port state information indicates that the first message dequeue rate is less than a dequeue rate threshold;

the increasing amplitude of the message flow received by the second ingress port is smaller than the threshold value, including:

the port status information indicates that the second message dequeue rate is greater than the dequeue rate threshold.

In one possible implementation, the target egress port corresponds to a target egress queue, which is in a congested state.

In one possible implementation, the threshold configuration module is specifically configured to:

the PFC threshold of the first ingress port is configured from a third PFC threshold to the first PFC threshold, the first PFC threshold being less than the third PFC threshold.

configuring a PFC threshold of the second ingress port from a fourth PFC threshold to the second PFC threshold; wherein the second PFC threshold is greater than or equal to the fourth PFC threshold, or the second PFC threshold is less than the fourth PFC threshold, the magnitude of decrease in the second PFC threshold compared to the fourth PFC threshold being less than the magnitude of decrease in the first PFC threshold compared to the third PFC threshold.

In one possible implementation, the third PFC threshold and the fourth PFC threshold are equal.

In a third aspect, the present application provides a network device comprising a processor, a memory and a bus, wherein:

the processor and the memory are connected through the bus;

the memory is used for storing computer programs or instructions;

the processor is configured to call or execute a program or an instruction stored in the memory to implement the steps described in the first aspect and any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer storage medium comprising computer instructions which, when run on a computer, perform the steps of any one of the above first aspect and possible implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product for performing the steps of any one of the above first aspect and possible implementations of the first aspect when the computer program product is run on a computer.

In an eighth aspect, the present application provides a chip system comprising a processor for supporting a computer to implement the functions involved in the above aspects, for example, to transmit or process data involved in the above method; or, information. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The embodiment of the application provides a data processing method which is applied to network equipment, wherein the network equipment comprises a first input port, a second input port and a target output port; the method comprises the following steps: respectively acquiring port state information of a first inlet port and the second inlet port; the port state information is related to the message flow which is received by the corresponding ingress port and is specific to the target egress port; based on the port state information, indicating that the increasing amplitude of the message flow received by the first inlet port is larger than a threshold value, and the increasing amplitude of the message flow received by the second inlet port is smaller than the threshold value, configuring a first PFC threshold value of the first inlet port, and configuring a second PFC threshold value of the second inlet port; wherein the first PFC threshold is less than the second PFC threshold.

Drawings

Fig. 1 is a schematic diagram of an application architecture provided in an embodiment of the present application;

fig. 2 is a schematic diagram of an application architecture provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an application architecture provided in an embodiment of the present application;

fig. 4 is a schematic diagram of an application architecture provided in an embodiment of the present application;

fig. 5 is a schematic diagram of an application architecture provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of a data processing method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a circuit according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a state transition provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of a data processing method according to an embodiment of the present application;

fig. 10 is a schematic diagram of an experimental topology provided in an embodiment of the present application;

fig. 11 is a schematic diagram of an experimental topology provided in an embodiment of the present application;

FIG. 12 is a schematic view of an embodiment of a data processing apparatus according to an embodiment of the present application;

fig. 13 is an embodiment diagram of a network device according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In recent years, lossless data center network technology has become an important point of interest in the industry. The remote direct memory access (remotedirect memory access, RDMA) is a technology for allowing computers in a network to perform data interaction without passing through a processor, a cache and an operating system, and has the characteristics of zero-copy (zero-copy), kernel bypass (kernel bypass), no need of intervention of a central processing unit (central processing unit, CPU) (No CPU involvement) and the like. Not only saves a great deal of CPU resources, but also improves the throughput of the system and reduces the network communication delay. RDMA technology is widely used in various fields at present, and the application range is different according to different application requirements. Applications requiring low latency characteristics, such as: high performance computing (high performance computing, HPC), financial services, etc.; applications requiring high bandwidth, such as: HPC, medical devices, storage and backup systems, cloud computing, etc.; applications requiring less CPU occupation, such as: HPC, cloud computing, etc.

Traffic bursts, on the other hand, are typical traffic pattern networks in modern data centers. Traffic bursts are typically generated by various online generated data-intensive applications and virtualization services, such as distributed computing, etc. These applications and services may generate multiple concurrent data messages that enter ports of the same routing switch at the same time, thereby generating short congestion. Traffic bursts may cause excessive queues of ports to produce dropped packets or even time-out retransmissions, which is unacceptable for applications in data centers, particularly delay-sensitive applications. Although RDMA has the application advantages of high bandwidth and low delay, the premise of exerting the performance advantage is to ensure that packets cannot be lost in the data transmission process, otherwise, large-scale retransmission is caused, and serious performance loss and load overhead are caused.

In general, end-to-end congestion control cannot be effectively applied to such bursty traffic. To guarantee a lossless environment for RDMA, many congestion control algorithms in current data centers employ PFC mechanisms. Priority-based traffic control (PFC), which is an ethernet protocol based on the 802.qbb standard for the L2 layer, supports the selection of priorities for different types of traffic in the network, and is a mechanism to prevent packet loss in case of congestion. PFC mechanisms act primarily on the ingress port of a switch (or other network device, such as a router, etc.), and control is performed according to the length of the ingress port's queue (otherwise known as the ingress queue's length), by sending an indication (e.g., a packet frame) to inform the upstream port to stop sending data. Specifically, the pause frame may be transmitted when the ingress port queue length exceeds a preset PFC threshold (for example XOFF), and the transmission may be stopped when the ingress port queue length is smaller than XON. Where XOFF and XON may represent thresholds for ingress port queue lengths.

Illustratively, the working principle of PFC may be as shown in fig. 1, and the workflow includes the following four basic processes:

(1) The upstream port sends a message to the downstream port;

(2) Accumulating messages in the downstream port, increasing the length of the queue continuously, and sending a pause frame to the upstream port by the downstream port when the length of the queue exceeds XOFF;

(3) After the upstream port receives the pause frame, stopping sending the message;

(4) When the length of the downstream port queue is smaller than XON, the downstream port stops sending the pause frame, sends a resume frame instead, and notifies the upstream port to resume message sending.

The PFC trigger threshold (i.e., XOFF) is set to determine when the downstream port transmits the pause frame, but whether the pause frame is transmitted, which has a definitely large impact on network transmission. Therefore, the most critical technical challenge of the PFC mechanism is to determine the PFC trigger threshold (which may be simply referred to as the PFC threshold in the embodiments of the present application). Current commodity switches are mostly cached in a shared manner by multiple ports, and are called shared memory switches. For shared memory switches, the physical meaning of the PFC trigger threshold is the maximum size of the buffer that the port can occupy. Thus, the PFC trigger threshold setting problem is equivalent to the allocation problem of multiple ports sharing the cache.

For example, as shown in fig. 2, the transmitting interface of the network device a (abbreviated as device a in fig. 2) is divided into 8 queues, the receiving interface of the network device B (abbreviated as device B in fig. 2) has 8 receiving buffer queues (or simply called ingress queues of ingress ports), and buffer (buffers) are allocated, so that 8 virtualized channels in the network are formed, and the buffer sizes are such that each queue has different data buffering capacities, and the 8 queues of the device a and the 8 receiving buffer queues of the device B are in one-to-one correspondence.

When a certain receiving buffer queue on the interface of the device B generates congestion, the device B sends indication information (for example, PFC back pressure information or referred to as a pause frame) to a direction (upstream device a) in which data enters, the device a stops sending a message of a corresponding queue according to the PFC back pressure information, and stores the message in a local port buffer, if the consumption of the local port buffer exceeds a threshold value, the upstream back pressure is continued, and the first-stage back pressure is performed until the network terminal device, so that packet loss caused by congestion of a network node is eliminated.

Next, an application architecture of the embodiment of the present application is described.

Referring to fig. 3, fig. 3 is a schematic diagram of an application architecture provided by an embodiment of the present application, where the topology shown in fig. 3 is a three-layer structure (not limiting to the application, and in other topologies applicable to the application, more or fewer layers and/or numbers of network devices may be included), and optionally, each branch in the topology shown in fig. 3 may have the same bandwidth. For convenience of description, the embodiment of the present application uses a network device as an example of a switch. Switches play an important role as bridges for service clusters and application clients of a data center. The invention is based on the system architecture of a data center, and improves the network equipment which is commercially available and supports priority-based flow control (PFC).

Specifically, fig. 4 illustrates a simplified model of a typical PFC-enabled simplified model shared cache switch. The model can be divided into three parts: forwarding cores, memory management units (paged memory management unit, MMU), and shared memory pools. A conventional switch is an output queue sharing cache mode that discards an upcoming packet when the output queue length is greater than a certain threshold. The PFC-enabled switch is in an ingress port queue shared cache mode, where when a packet enters an ingress port, the MMU checks whether the current ingress port's queue length exceeds a threshold that triggers PFC, and then updates the ingress port's queue length and associated egress port's queue length, the latter of which is used to set a display congestion notification (explicitcongestion notification, ECN) threshold.

To better understand the principle of PFC-enabled switches, fig. 5 illustrates a 5-to-1 traffic scenario within an MMU from two perspectives. Five traffic (two messages per traffic) enters the switch through 5 ports and exits the switch through one port. Physically, the ten messages are stored in a shared memory pool. But from the MMU perspective the different views have different functions.

(a) The portal view is used to control PFC. In the ingress view, packets for each flow are queued at the corresponding ingress port. When the queue length of the input port reaches the PFC threshold, PFC pause is triggered.

(b) The exit view is used to control the ECN. In the egress view, packets for each flow are queued at the output port. When the output queue length reaches a threshold, each packet will be marked with an ECN.

Taking a network device as an example of a switch, a plurality of ports in the switch share a cache allocation policy in four ways: full sharing, full bisection, static threshold, and dynamic threshold. In the complete sharing strategy, each port completely shares the memory, and the method has high efficiency, but is unfair, so that new flow can not occupy the memory, and the starvation problem of partial flow is easily caused; in the complete halving strategy, the cache of each port is evenly distributed, and the method ensures fairness, but has low efficiency and can not distribute more caches for the high throughput ports; a static threshold strategy has high requirements on parameters, and the self-adaptive difference cannot be used commercially; thus, the current popular approach is mainly a dynamic threshold (dynamic threshold, DT) strategy.

In some scenarios, there may be bursty traffic situations where DT may cause a "victim flow" problem. The "victim flow" problem means that when traffic bursts occur, some traffic may not have a direct relationship with the congested ports, but the transmission performance of these traffic may be innocently compromised. The embodiment of the application can solve the problem of 'victim' flow caused by the prior DT strategy and reduce the times of PFC triggering.

In particular, taking a network device as an example of a switch, in one implementation of DT, PCT thresholds (e.g., XOFF as described in the above embodiments) configured by multiple ingress ports within the network device are not fixed, but are proportional to the available cache within the network device. The core idea can be described by the following formula:

wherein B represents the cache size of the switch, Q ⁱ (t) represents the i-th port queue length, α is an adjustment factor, e.g., a=2. T (T) represents the maximum occupiable cache size of the port at time T.

The PFC threshold is positively correlated with T (T). In some PFC implementations, xoff=t (T), xon=t (T) -3mtu, mtu representing the largest transmission unit (maximum transmission unit), it is apparent that the larger α, the more difficult PFC is to trigger.

The DT strategy improves the adaptivity of the traditional strategy, but it does not address the problem of traffic burstiness. When a traffic burst occurs, the burst cannot be completely buffered, and when the queue length is greater than the PFC threshold, the pause frame will be triggered. The problem with the DT policy is that the PFC threshold for each port on the same network device is configured to be the same (i.e., the PCT threshold for each ingress port of the switch is uniformly variable). When traffic bursts occur, the remaining buffer decreases, and the PFC thresholds of all ports decrease synchronously, so that a port is highly likely to trigger a pause frame, regardless of whether or not it has burst traffic. The pause frame blocks the upstream port from messaging, and the upstream port may have other traffic. These traffic may not have a direct relationship with the ports that generated the traffic bursts, but the transmission performance of these traffic may be innocently compromised, resulting in a low transmission efficiency. In addition, after more ports trigger the pause frames, the pause frames can be transmitted to respective upstream ports, so that the transmission efficiency of the whole network is further reduced.

The embodiment of the application can solve the problems, can solve the problem of 'victim' flow aiming at the burst flow of the inlet port, ensures the fairness of each port and improves the transmission efficiency.

Referring to fig. 6, fig. 6 is a flowchart of a data processing method provided in an embodiment of the present application, where the method is applied to a network device, and the network device includes a first ingress port, a second ingress port, and a target egress port; as shown in fig. 6, a data processing method provided in an embodiment of the present application includes:

601. respectively acquiring port state information of a first inlet port and the second inlet port; the port status information is related to the message traffic received by the corresponding ingress port for the target egress port.

In one possible implementation, the network device may be a switch, router, or the like. The embodiment of the application uses a network device as a switch for illustration:

in one possible implementation, a switch may include a plurality of ingress ports and a plurality of egress ports, e.g., a switch may include 4 ingress ports I1, I2, I3, and I4, respectively, and 4 egress ports O1, O2, O3, and O4, respectively. Since the switch can communicate bi-directionally, I1, I2, I3 and I4 can also be output ports, and O1, O2, O3 and O4 can also be input ports.

The switch may include a BUFFER (or simply referred to as a BUFFER), which may be used to BUFFER messages, and the BUFFER may include a plurality of message BUFFERs, where each output port may correspond to a block of message BUFFERs. For example, the message buffers corresponding to the output ports O1 to O4 are B1 to B4, respectively, and each message buffer may include one or more queues.

In one possible implementation, each ingress port in the switch may share a cache, one ingress queue may be configured for each ingress port, where a first ingress port may be configured with a first ingress queue for holding messages sent from an upstream network device to the first ingress port and a second ingress port may be configured with a second ingress queue for holding messages sent from the upstream network device to the second ingress port.

In one possible implementation, the first ingress port may receive a message from an upstream network device, the message being burst traffic for the destination egress port. The burst traffic for the destination output port is understood to mean that, in a certain period of time, the packet traffic of the packet received by the first input port and that needs to be transmitted to the destination output port suddenly increases.

For example, at a first time, the first ingress port does not receive a message that needs to be transferred to the target egress port, and at a second time (a time after the first time), the first ingress port receives a message that needs to be transferred to the target egress port. It can be considered that, compared with the first time, the message flow of the message which is received by the first ingress port and needs to be transmitted to the target egress port at the second time suddenly increases.

For example, at a first moment, the size of the message flow of the message that the first ingress port receives the message that needs to be transferred to the target egress port is A1, and at a second moment (a moment after the first moment), the size of the message flow of the message that the first ingress port receives the message that needs to be transferred to the target egress port is A2, where A2 is greater than A1. It can be considered that, compared with the first time, the message flow of the message which is received by the first ingress port and needs to be transmitted to the target egress port at the second time suddenly increases.

In one possible implementation, the second ingress port may receive a message from the upstream network device that needs to be delivered to the target egress port, but that is not bursty traffic for the target egress port. Or the second ingress port does not receive the message that needs to be delivered to the target egress port.

For example, at a first time, the second ingress port does not receive a message to be transferred to the destination egress port, and at a second time (a time after the first time), the second ingress port also does not receive a message to be transferred to the destination egress port. It can be considered that, compared with the first time, the message flow of the message which is received by the second ingress port at the second time and needs to be transmitted to the target egress port is unchanged.

For example, at a first time, the size of the message flow of the message received by the second ingress port and required to be transmitted to the target egress port is A1, and at a second time (a time after the first time), the size of the message flow of the message received by the second ingress port and required to be transmitted to the target egress port is A2, where A2 is equal to A1. It can be considered that, compared with the first time, the message flow of the message which is received by the second ingress port at the second time and needs to be transmitted to the target egress port is unchanged.

For example, at a first time, the size of the message flow of the message received by the second ingress port and required to be transmitted to the target egress port is A1, and at a second time (a time after the first time), the size of the message flow of the message received by the second ingress port and required to be transmitted to the target egress port is A2, where A2 is smaller than A1. It can be considered that, compared with the first time, the message flow of the message which is received by the second ingress port at the second time and needs to be transmitted to the target egress port becomes smaller.

In one possible implementation, the target egress port may be in a congested state due to bursty traffic received by the first ingress port for the target egress port. Further, the PFC threshold (e.g., XOFF, or PFC trigger threshold, as described in the above embodiments) corresponding to the ingress port in the switch needs to be adjusted.

602. Based on the port state information, indicating that the increasing amplitude of the message flow received by the first inlet port is larger than a threshold value, and the increasing amplitude of the message flow received by the second inlet port is smaller than the threshold value, configuring a first PFC threshold value of the first inlet port, and configuring a second PFC threshold value of the second inlet port; wherein the first PFC threshold is less than the second PFC threshold.

Next, how to determine port state information is described.

T(t)＝α·(B-∑ _i Q ⁱ (t))；

wherein B may represent the switch cache size, Q ⁱ (t) Representing the i-th port queue length, α is an adjustment factor, e.g., a=2. T (T) represents the maximum occupiable cache size of the port at time T.

Next, how to obtain the above-described message dequeue rate is described:

in one possible implementation, in an implementation within the MMU of the switch, the aforementioned dequeue rate acquisition and processing may be implemented by a clock module, an ingress port rate calculation module, an ingress port status determination module, and an ingress port status save module. The present embodiment may be exemplarily implemented in the form of a circuit as shown in fig. 7 for a clock module, an ingress port rate calculation module, an ingress port state judgment module, and an ingress port state saving module.

The circuit shown in fig. 7 may be composed of three parts, namely a Timer (TC), an ingress port rate calculator (ingress rate counter, IRC) and a State Holder (SH), wherein the inputs are an ingress port dequeue signal and a PFC pause signal, and the output is a control state of the port. Before describing the portions of the present embodiment, two control states of the port are first described: absorptions and normals. The two states of the inter-rotation may be as shown in fig. 8.

Wherein, the high load and the low load represent the flow state of the port, and the high load means that when the dequeue speed of the ingress port is higher, the number of packets leaving from the ingress port in one period exceeds a certain threshold value; low load means when the dequeue speed of the ingress port is relatively low, indicating that the number of packets leaving the ingress port in one cycle is less than the threshold. The traffic state of the port is used to determine the control state of the port, i.e., whether it is Normal or absorptions.

The details of the above-mentioned circuit are as follows:

TC: the timer is used for fixing the periodic time TC for calculating the speed, the dequeue rate counter is periodically reset, the period TC of the clock C is preferably larger than RTT, the period TC is more than or equal to 3RTT according to experience, the reciprocal is started when a pulse is generated, and the clock stops when the pulse is 0. In the reciprocal process, the output position is 0, and when the reciprocal is 0, the output position is 1, and at this time, the IRC reset counter can be notified to calculate the speed of the next cycle.

IRC: the dequeue rate counter records the number of queue empties, is the key for determining the state, and is set as (C×TC)/k, C is the line speed, C×TC is the maximum number of depackets in the period TC, and k represents the tolerance of the concurrency scale. For example when k=5, meaning that an ingress port is considered highly loaded when it competes with less than 5 ingress ports for the same egress port. The dequeue signal for each ingress port triggers an increase in the dequeue signal, and thus the dequeue rate for the ingress port, in combination with the TC, may represent the number of packets dequeued in one cycle. When the threshold is not exceeded, the output is 0, the port control state is Normal, otherwise 1, and the port control state is Absorption.

SH: for saving the state of the IRC. Because IRC will be set to 0 each time TC is reset, the output will be 0.SH is to save the state in the last cycle. And also penalizes ports that occupy much of the cache but still trigger PFC, when pfcpase is triggered, the output is 0 and the port control state becomes Normal.

Aiming at a threshold setting module in an MMU of a switch, a caching scheme of the embodiment is adopted for a port of an Absorption, and the maximum usage threshold of the port is increased, and the maximum length of a port queue is alpha/(alpha+1) B under a DT strategy, so that the value is set to be between alpha/(alpha+1) B and B in the embodiment; whereas for ports in Normal state, existing DT policies may be employed. By the configuration method, the port PFC trigger threshold in the Absorption state is not smaller than the port PFC trigger threshold in the Normal state, so that the port in the Normal state is preferentially caused to trigger the pause frame.

The data processing method in the embodiment of the present application is described next in connection with a specific embodiment.

Referring to fig. 9, fig. 9 is a flowchart of a data processing method according to an embodiment of the present application, where control states of ports are divided into Absorption (Absorption state) and Normal (Normal state):

(1) Absorption: the ingress port dequeuing rate is greater.

(2) Normal: the ingress port dequeuing rate is small.

And according to the control state of the input port, adopting a corresponding port PFC trigger threshold.

Step one: calculating the dequeue rate of the input port;

step two: and judging the control state of the port according to the dequeue rate of the ingress port. If the port control state is not the Absorption state, entering a step III; otherwise, entering a fourth step;

step three: setting the PFC threshold of the port to B;

step four: the PFC threshold for the port is set to a, where a is greater than B.

The beneficial effects of the embodiments of the present application are described below in conjunction with test results:

the technical effects are as follows: the innocent damage probability of the service is reduced by 82% -85%:

the experimental topology of the embodiments of the present application is shown in fig. 10, which is very classical and representative in data centers, such as in Clos, fat-Tree, etc.

There are a total of 31 senders H0-H30 and two receivers R0-R1, which are connected by two switches S0, S1. All link bandwidths are 50Gbps, 5us delay, traffic consists of long stream and concurrent burst short stream. Specifically, H0, H1 send long streams to R0 and R1, respectively, and when two long streams are stably transmitted, H2-H30 simultaneously generates 29 burst short streams at a line speed to send to R1, each short stream being 64KB in size and having a duration of 11us.

Experimental results show that compared with the existing scheme DT, the PFC pause triggering rate of the H0-R0 service is reduced by 82% -85%.

The technical effects are as follows: the embodiment of the application can reduce the deadlock probability by 43 percent:

the experimental topology structure of the embodiment of the present application is shown in fig. 11, in 1000 simulation experiments, the embodiment of the present application encounters 207 to 233 deadlocks, while the existing scheme DT encounters 592 to 668 deadlocks, and the data processing method provided by the embodiment of the present application can reduce the probability of dead lock from 66% to 23%, and reduce the probability by 43%.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application, where, as shown in fig. 12, the apparatus may be applied to a network device (for example, a switch, a router, etc.), and the network device may include a first ingress port, a second ingress port, and a target egress port; the apparatus 1200 may include:

an obtaining module 1201, configured to obtain port state information of a first ingress port and the second ingress port respectively; the port state information is related to the message flow which is received by the corresponding ingress port and is specific to the target egress port;

for a specific description of the acquiring module 1201, reference may be made to the description of step 601 in the above embodiment, which is not repeated here.

A threshold configuration module 1202, configured to instruct, based on the port state information, that an increase amplitude of the packet traffic received by the first ingress port is greater than a threshold value, and that an increase amplitude of the packet traffic received by the second ingress port is less than the threshold value, configure a first PFC threshold value of the first ingress port, and configure a second PFC threshold value of the second ingress port; wherein the first PFC threshold is less than the second PFC threshold.

For a specific description of the threshold configuration module 1202, reference may be made to the description of step 602 in the above embodiment, which is not repeated here.

Based on the same technical concept, the embodiment of the present invention further provides a network device 1300, referring to fig. 13, where the network device 1300 is configured to implement the steps of the data processing method described in the embodiment corresponding to fig. 6 in the above method embodiment, where the network device 1300 of this embodiment may include: memory 1301, processor 1302, and computer programs stored in the memory and executable on the processor, such as data processing programs. The steps of the various data processing method embodiments described above are implemented when the processor executes the computer program.

The specific connection medium between the memory 1301 and the processor 1302 is not limited in the embodiment of the invention. In the embodiment of the present application, the memory 1301 and the processor 1302 are connected through the bus 1303 in fig. 5, the bus 1303 is indicated by a thick line in fig. 13, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 1303 may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.

The memory 1301 may be a volatile memory (RAM), such as random-access memory (RAM); the memory 1301 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a Hard Disk Drive (HDD) or a Solid State Drive (SSD), or the memory 1301 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. Memory 1301 may be a combination of the above.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps of the data processing method described in the embodiment corresponding to fig. 6 in the above-described embodiment.

There is also provided in an embodiment of the present application a computer-readable storage medium in which a program for performing signal processing is stored, which when run on a computer, causes the computer to perform the steps of the data processing method as described in the corresponding embodiment of fig. 6 in the foregoing embodiment.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A data processing method, wherein the method is applied to a network device, and the network device comprises a first input port, a second input port and a target output port; the method comprises the following steps:

respectively acquiring port state information of a first inlet port and the second inlet port; the port state information is related to the message flow which is received by the corresponding ingress port and is specific to the target egress port;

based on the port state information, indicating that the increasing amplitude of the message flow received by the first inlet port is larger than a threshold value, and the increasing amplitude of the message flow received by the second inlet port is smaller than the threshold value, configuring a first priority-based flow control PFC threshold value of the first inlet port, and configuring a second PFC threshold value of the second inlet port; wherein the first PFC threshold is less than the second PFC threshold.

2. The method of claim 1, wherein the first ingress port and the second ingress port share cache resources of the network device.

3. The method of claim 1 or 2, wherein the first ingress port corresponds to a first ingress queue and the second ingress port corresponds to a second ingress queue;

4. The method of claim 3, wherein the port status information comprises a first message dequeuing rate of the first ingress queue and a second message dequeuing rate of the second ingress queue;

5. The method of any of claims 1 to 4, wherein the target egress port corresponds to a target egress queue, the target egress queue being in a congested state.

6. The method of any one of claims 1 to 5, wherein configuring the first PFC threshold for the first ingress port comprises:

7. The method of any of claims 1 to 6, wherein the configuring the second PFC threshold for the second ingress port comprises:

8. The method of claim 6 or 7, wherein the third PFC threshold value and the fourth PFC threshold value are equal.

9. A data processing apparatus, wherein the apparatus is applied to a network device, the network device comprising a first ingress port, a second ingress port, and a target egress port; the device comprises:

10. The apparatus of claim 9, wherein the first ingress port and the second ingress port share cache resources of the network device.

11. The apparatus of claim 9 or 10, wherein the first ingress port corresponds to a first ingress queue and the second ingress port corresponds to a second ingress queue;

12. The apparatus of claim 11, wherein the port status information comprises a first message dequeue rate for the first ingress queue and a second message dequeue rate for the second ingress queue;

13. The apparatus of any of claims 9 to 12, wherein the target egress port corresponds to a target egress queue, the target egress queue being in a congested state.

14. The apparatus according to any one of claims 9 to 13, wherein the threshold configuration module is specifically configured to:

15. The apparatus according to any one of claims 9 to 14, wherein the threshold configuration module is specifically configured to:

16. The apparatus of claim 14 or 15, wherein the third PFC threshold value and the fourth PFC threshold value are equal.

17. A network device comprising a processor, a memory, and a bus, wherein:

The processor and the memory are connected through the bus;

the memory is used for storing computer programs or instructions;

the processor is configured to invoke or execute a program or instructions stored in the memory to implement the method steps of any of claims 1-8.

18. A computer readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 8.

19. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-8.