WO2023041142A1 - A device and method for flow detection and processing - Google Patents

A device and method for flow detection and processing Download PDF

Info

Publication number
WO2023041142A1
WO2023041142A1 PCT/EP2021/075189 EP2021075189W WO2023041142A1 WO 2023041142 A1 WO2023041142 A1 WO 2023041142A1 EP 2021075189 W EP2021075189 W EP 2021075189W WO 2023041142 A1 WO2023041142 A1 WO 2023041142A1
Authority
WO
WIPO (PCT)
Prior art keywords
flow
elephant
packets
state
endpoint device
Prior art date
Application number
PCT/EP2021/075189
Other languages
French (fr)
Inventor
Reuven Cohen
Tal Mizrahi
Ben-Shahar BELKAR
Amir Roitshtein
Yoni BICK
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2021/075189 priority Critical patent/WO2023041142A1/en
Priority to CN202180102086.7A priority patent/CN117917061A/en
Publication of WO2023041142A1 publication Critical patent/WO2023041142A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/31Flow control; Congestion control by tagging of packets, e.g. using discard eligibility [DE] bits
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/35Flow control; Congestion control by embedding flow control information in regular packets, e.g. piggybacking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/28Flow control; Congestion control in relation to timing considerations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/52Queue scheduling by attributing bandwidth to queues
    • H04L47/522Dynamic queue service slot or variable bandwidth allocation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3027Output queuing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/35Switches specially adapted for specific applications
    • H04L49/351Switches specially adapted for specific applications for local area network [LAN], e.g. Ethernet switches

Definitions

  • the present disclosure relates to the field of network communications, for example, to high-speed networks, and Remote Direct Memory Access (RDMA) technologies.
  • the disclosure is concerned with elephant flow detection in high-speed networks.
  • the present disclosure provides an endpoint device for flow transmission in a network, a switch or a router for a network, and corresponding methods for flow transmission in a network and a network switch, respectively, and a computer program, or a hardware device.
  • Traffic patterns in high-speed networks often have an elephant/mouse behavior.
  • Elephant flows typically consume a large amount of bandwidth, and exist for long periods of time.
  • Mouse flows usually have low bandwidth, and sometimes live for a short period of time. Mouse flows often require low latency, as the (small) amount of traffic is more sensitive to delay, and thus needs to be forward with a higher priority, guaranteeing a short completion time and a low probability of packet loss.
  • Network devices such as switches and routers often have elephant flow detection mechanisms. Once a flow is detected as an elephant in these devices, it may receive different treatment than other flows, including one or more of the following: the flow may be split (load balanced) among multiple network paths, and the flow may be forwarded with a different priority than mouse flows.
  • an elephant flow is forwarded with a lower priority, allowing mouse flows to be forwarded with a lower packet loss probability, and thus resulting in lower completion time (lower network latency).
  • Elephant flows are typically detected by network devices by statistical methods that include a combination of one or more of the following: per flow counter, or per flow timestamp of the last packet.
  • One of the main challenges with existing approaches is that elephant flow detection is based on a statistical algorithm, which may cause a new elephant flow to be detected only a while after it has started.
  • a transient detection time during which the switch may be congested, potentially causing mouse packets to be dropped.
  • temporary congestion may occur before the flow is assigned a dedicated path. Possibly, an elephant flow may in some cases not be detected as an elephant at all.
  • this disclosure aims to provide an elephant flow detection method.
  • An objective is to propose an elephant flow detection method, which needs no transient time that causes temporary packet loss.
  • Another aim is to enable lower packet loss, lower delivery time, and lower congestion in scenarios where elephant and mouse flows are transmitted in parallel.
  • Another goal is to enable the combination of the elephant flow detection method and the functionality of the server and switch/router.
  • a first aspect of the present disclosure provides an endpoint device for flow transmission in a network, wherein the endpoint device is configured to: maintain a send queue for packets of one or more flows, wherein each flow comprises a plurality of packets; maintain at least one state for each flow, wherein the at least one state includes “Elephant” or “nonElephant”, which respectively indicates that the flow is an Elephant flow or is a nonElephant flow, wherein the Elephant flow consumes more network resources than the nonElephant flow; provide each packet of the plurality of packets with an indication field that indicates the at least one state of the flow that the packet belongs to, and transmit the plurality of packets.
  • the present disclosure introduces a new elephant flow detection algorithm that is performed at the endpoint, rather than in switches/routers.
  • the endpoint may be potentially a hardware-enabled Smart Network Interface Card (NIC). It can use the information it has about the flows to detect which flows are elephants. In this way, this disclosure enables to detect which flows are elephants before the flows start.
  • NIC hardware-enabled Smart Network Interface Card
  • each flow is a Remote Direct Memory Access (RDMA) flow
  • each packet is an RDMA packet.
  • RDMA Remote Direct Memory Access
  • the endpoint device may be an endpoint device that generates RDMA messages.
  • the endpoint device may be a server with an NIC that may have hardware support for RDMA transmission.
  • the endpoint device is further configured to record a timestamp of transmission for each transmitted packet.
  • the timestamp of the last packet that is transmitted is kept by the endpoint device.
  • the at least one state of each flow is detected based on a first detection algorithm, wherein the endpoint device is configured to: set the at least one state of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue, and a length of the new transaction exceeds a pre-determined length.
  • the solution of this disclosure may include different mechanisms. That is, based on different detection algorithms, the endpoint device may set the state for a flow differently. Optionally, the default state of a flow is “non-Elephant”.
  • the endpoint device is configured to set the at least one state of the first flow to “non-Elephant”, when it is determined that: a following new transaction element is queued in the send queue, a length of the following new transaction element does not exceed the predetermined length, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
  • the at least one state of each flow is detected based on a second detection algorithm, wherein the endpoint device is configured to set the at least one state of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue, and at least one of the following conditions is met: a length of the new transaction exceeds a pre-determined length, or a rate of the first flow exceeds a first pre-determined rate.
  • the endpoint device is further configured to: set the at least one state of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue, a length of the following new transaction does not exceed the pre-determined length, the rate of the first flow is below a second pre-determined rate, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
  • a transaction of the send queue is a working queue element (WQE).
  • WQE working queue element
  • the new transaction that is queued in the send queue 101 is an RDMA transaction, i.e., a WQE.
  • the endpoint device is configured to obtain the length of the new transaction or the following new transaction by checking a length field of that transaction.
  • the endpoint device may use an RDMA “direct memory access (DMA) Length” field to detect elephant flows and set the states of the flows.
  • DMA direct memory access
  • the first detection algorithm is used for priority assignment of multiple flows
  • the second detection algorithm is used for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
  • the indication field comprises a first indication and/or a second indication, wherein the first indication indicates whether the first detection algorithm is used for detecting the at least one state and the second indication indicates whether the second detection algorithm is used for detecting the at least one state.
  • the at least one state may comprise two Boolean states, each corresponding to a different one of the two algorithms.
  • the endpoint device 100 may maintain two Boolean state bits, i.e., the first indication and the second indication (each indicates Elephant / Non-Elephant of the flow), for each flow.
  • the indication field is comprised in a header of each packet.
  • a second aspect of the present disclosure provides a switch for a network, wherein the switch is configured to: receive a plurality of packets from an endpoint device, wherein the plurality of packets comprises packets of one or more flows, and each packet of the plurality of packets comprises an indication field that indicates at least one state of a flow that the packet belongs to, wherein the at least one state includes “Elephant” or “nonElephant”, which respectively indicates that the flow is an Elephant flow or non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and process each packet based on the indication field.
  • the present disclosure further provides a switch that operates accordingly to the endpoint device of the first aspect.
  • the switch or router may use the indication field to provide different services to elephant flows, including possibly assigning them to a different queue with a different priority and with different memory resources.
  • the indication field comprises a first indication and/or a second indication, wherein the first indication indicates whether a first detection algorithm is used by the endpoint device for detecting the state and the second indication indicates whether a second detection algorithm is used by the endpoint device for detecting the state.
  • the indication may include two bits, “XY”, X for indicating whether the first detection algorithm is used for detecting the state for the flow, and Y for indicating whether the second detection algorithm is used for detecting the state for the flow.
  • the first detection algorithm is used for priority assignment of multiple flows
  • the second detection algorithm is for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
  • the switch is configured to make a loadbalancing, for each flow of the one or more flows, based on the indication field comprised in the packets of said flow.
  • the switch is further configured to: configured to: detect a first flow that is a non-Elephant flow; and assign each packet of the first flow to a queue with a priority higher than a priority associated with an Elephant flow, or route each packet of the first flow to a path with a latency lower than a latency of a path for routing an Elephant flow.
  • a third aspect of the present disclosure provides a method performed by an endpoint device for flow transmission in a network.
  • the method comprises: maintaining a send queue for packets of one or more flows, wherein each flow comprises a plurality of packets; maintaining at least one state for each flow, wherein the at least one state includes “Elephant” or “non-Elephant”, which respectively indicates that the flow is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and providing each packet of the plurality of packets with an indication field that indicates the at least one state of the flow that the packet belongs to, and transmitting the plurality of packets.
  • the method of the third aspect and its implementation forms provide the same advantages and effects as described above for the endpoint device of the first aspect and its respective implementation forms.
  • a fourth aspect of the present disclosure provides a method performed by a switch for a network.
  • the method comprises: receiving a plurality of packets from an endpoint device, wherein the plurality of packets comprises packets of one or more flows, and each packet of the plurality of packets comprises an indication field that indicates at least one state of a flow that the packet belongs to, wherein the at least one state includes “Elephant” or “non-Elephant”, which respectively indicates that the flow is an Elephant flow or is a nonElephant flow, wherein the Elephant flow consumes more network resources than the nonElephant flow; and processing each packet based on the indication field.
  • the method of the fourth aspect and its implementation forms provide the same advantages and effects as described above for the switch of the second aspect and its respective implementation forms.
  • a fifth aspect of the present disclosure provides a computer program comprising a program code for carrying out, when implemented on a processor, the method according to any of the third aspect and its implementation forms, or any of the fourth aspect and its implementation forms. It has to be noted that all devices, elements, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective endpoint device is adapted to or configured to perform the respective steps and functionalities.
  • FIG. 1 shows an endpoint device according to an embodiment of the disclosure.
  • FIG. 2 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
  • FIG. 3 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
  • FIG. 4 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
  • FIG. 4 shows a switch according to an embodiment of the disclosure.
  • FIG. 5 shows a method according to an embodiment of the disclosure.
  • FIG. 6 shows a method according to an embodiment of the disclosure.
  • an embodiment/example may refer to other embodiments/examples.
  • any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples.
  • FIG. 1 shows an endpoint device 100 for flow detection in a network according to an embodiment of the disclosure.
  • the endpoint device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the endpoint device 100 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • the endpoint device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software.
  • the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the endpoint device 100 to be performed.
  • the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors.
  • the non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the endpoint device 100 to perform, conduct or initiate the operations or methods described herein.
  • the endpoint device 100 is configured to maintain a send queue 101 for packets of one or more flows, wherein each flow 102 comprises a plurality of packets. Further, the endpoint device 100 is configured to maintain at least one state 1021 for each flow 102, wherein the at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or is a non-Elephant flow. Notably, the Elephant flow consumes more network resources than the non-Elephant flow. Then, the endpoint device 100 is further configured to provide each packet 1011 of the plurality of packets with an indication field 1022 that indicates the at least one state 1021 of the flow 102 that the packet 1011 belongs to, and transmit the plurality of packets.
  • Non-Elephant flows e.g., mouse flows or mice flows, usually have low bandwidth, and sometimes live for a short period.
  • FIG. 2 illustrates examples of the behavior of elephant flows that are introduced along with existing mouse flows.
  • the behavior of elephant flows in the case of without elephant flow detection and with elephant flow detection are first discussed here.
  • the elephant flow consumes the queue resources and causes mouse packets to be dropped.
  • Such behavior is not necessarily desirable, as mouse flows may be control or management traffic that may be of high importance.
  • Mouse flows often require low latency, as the (small) amount of traffic is more sensitive to delay, and thus needs to be forward with a higher priority, guaranteeing a short completion time and a low probability of packet loss.
  • FIG. 2 (b) shows a scenario when the switch performs the elephant flow detection. It can be seen that an elephant flow is assigned to a dedicated queue with a different (e.g., lower) priority, and potentially with different resources allocation (such as higher drop threshold). Consequently, mouse packets are not dropped, while elephant packets are potentially dropped.
  • FIG. 3 shows another use case for elephant flow detection, namely, load balancing.
  • switches and routers along the path may allocate a dedicated path for the elephant flow, while forwarding other flows through different paths, as shown in FIG. 3.
  • a statistical algorithm may cause a new elephant flow to be detected only a while after it has started. That is, only after the detection, the desired behavior as shown in FIG. 2 (b) may be reached.
  • the switch may be congested, thus potentially causing mouse packets to be dropped as shown in FIG. 2 (a). Since the detection is not instantaneous, temporary congestion may occur before a flow is assigned with a dedicated path as shown in FIG. 3.
  • This disclosure thus introduces a new elephant flow detection algorithm that is performed at the endpoint, rather than in switches/routers.
  • the endpoint i.e., the endpoint device 100 as shown in FIG. 1, which is potentially a hardware-enabled Smart NIC, uses the information it has about the flows to detect which flows are elephants.
  • the endpoint device 100 is an endpoint device that generates RDMA messages.
  • the endpoint device 100 may be a server with an NIC that may have hardware support for RDMA transmission.
  • an endpoint In RDMA transmissions, an endpoint has information about RDMA flows, and thus be able to detect which flows are elephants before the flows start.
  • each flow 102 as shown in FIG. 1 is an RDMA flow.
  • each packet 1011 of the send queue 101 is an RDMA packet.
  • a transaction of the send queue 101 is a WQE.
  • a WQE is an RDMA operation or transaction that is posted or issued by a source application or software.
  • the endpoint device 100 may use an “elephant indication”, i.e., the indication field 1022, in the packet header to indicate to network switches/routers which packets belong to elephant flows.
  • an “elephant indication” i.e., the indication field 1022
  • the indication field 1022 is comprised in a header of each packet 1011. Possibly, the indication field 1022 may be part of a proprietary header or may be incorporated in an existing field of an RDMA packet, such as the differentiated services code point (DSCP) field in the IP header.
  • DSCP differentiated services code point
  • the endpoint device 100 may be configured to obtain the length of the new transaction or the following new transaction by checking a length field of that transaction.
  • the endpoint device 100 that sends RDMA packets may use the DMA Length field to detect elephant flows and assign the value of the “elephant indication” bit, i.e., the indication field 1022.
  • the endpoint device 100 keeps a Boolean state bit: Elephant / Non-Elephant. The state is incorporated into the packet’s “elephant indication” field.
  • the endpoint device 100 may be configured to record a timestamp of transmission for each transmitted packet 1011.
  • the solution of the disclosure may include two different mechanisms: an elephant flow detection mechanism for priority assignment, and a detection mechanism for load balancing. Specifically, the disclosure defines two elephant detection algorithms that are performed by the endpoint device 100.
  • the at least one state 1021 of each flow 102 is detected based on a first detection algorithm, e.g., priority assignment of multiple flows.
  • a new flow is by default assigned the non-Elephant state, i.e., the default state of a flow is “non-Elephant”.
  • the endpoint device 100 may be configured to set the at least one state 1021 of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue 101, and a length of the new transaction exceeds a pre-determined length.
  • the endpoint device 100 may be configured to set back the at least one state 1021 of the first flow to “nonElephant”, when it is determined that: a following new transaction is queued in the send queue 101, a length of the following new transaction does not exceed the pre-determined length, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
  • the new transaction that is queued in the send queue 101 is an RDMA transaction, i.e., aWQE.
  • the at least one state 1021 of each flow 102 is detected based on a second detection algorithm, e.g., load balancing.
  • a new flow is by default assigned the non-Elephant state, i.e., the default state of a flow is “non-Elephant”.
  • the rate (bandwidth) of each Queue Pair or the rate of the flow maybe continuously measured, and used in the decision of the state.
  • the endpoint device 100 may be configured to set the at least one state 1021 of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue 101, and at least one of the following conditions is met: a length of the new transaction exceeds a pre-determined length, or a rate of the first flow exceeds a first pre-determined rate, e.g., R1.
  • the endpoint device 100 may be configured to set back the at least one state 1021 of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue 101, a length of the following new transaction does not exceed the pre-determined length, the rate of the first flow is below a second pre-determined rate, e.g., R2, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
  • a second pre-determined rate e.g., R2
  • a time period since the timestamp of a previous packet is greater than a predetermined time limit.
  • the second pre-determined rate R2 is not greater than the first pre-determined rate Rl, that is, R2 ⁇ Rl.
  • the packet header may include up to two “elephant indication” bits, one for each of the mechanisms.
  • the indication field 1022 may comprise a first indication and/or a second indication, wherein the first indication indicates whether the first detection algorithm is used for detecting the at least one state 1021 and the second indication indicates whether the second detection algorithm is used for detecting the at least one state 1021.
  • the at least one state 1021 may comprise two Boolean states, each corresponding to a different one of the two algorithms. Accordingly, the endpoint device 100 may maintain two Boolean state bits (each indicates Elephant / Non-Elephant of the flow) for each flow.
  • the indication field 1022 may include two bits, “XY”, X for indicating whether the first detection algorithm is used for detecting the at least one state 1021, and Y for indicating whether the second detection algorithm is used for detecting the at least one state 1021.
  • the indication field 1022 carries “10”, it indicates that the elephant flow detection is performed for priority assignment, and the current flow is an elephant flow.
  • the indication field 1022 carries “01”, it indicates that the elephant flow detection is performed for load balancing, and the current flow is an elephant flow.
  • the indication field 1022 carries “00”, it may indicate that the current flow is a nonElephant flow.
  • FIG. 4 shows a switch 200 for a network according to an embodiment of the disclosure.
  • the switch 200 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the switch 200 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as ASICs, FPGAs, DSPs, or multi-purpose processors.
  • the switch 200 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software.
  • the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the switch 200 to be performed.
  • the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors.
  • the non- transitory memory may carry executable program code which, when executed by the one or more processors, causes the switch 200 to perform, conduct or initiate the operations or methods described herein.
  • the switch 200 is configured to a plurality of packets from an endpoint device 100.
  • the endpoint device 100 here may be the endpoint device 100 shown in FIG. 1.
  • the plurality of packets comprises packets of one or more flows
  • each packet 1011 of the plurality of packets comprises an indication field 1022 that indicates at least one state 1021 of a flow 102 that the packet belongs to.
  • the at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow.
  • the switch 200 may be further configured to process each packet 1011 based on the indication field 1022.
  • Embodiments of the present disclosure further provide a switch 200 that operates according to the endpoint device 100 as previously described in this disclosure.
  • the term “switch” is used here. It should be noted that it could generally be a network device such as a router as well.
  • a switch or router may use the indication field 1022, which may also be referred to as “elephant indication” bit, in its routing or load balancing decision.
  • a switch or router may use the indication field 1022 to provide different services to elephant flows, including possibly assigning them to a different queue with a different priority and with different memory resources.
  • the indication field 1022 comprises a first indication and/or a second indication, wherein the first indication indicates whether a first detection algorithm is used by the endpoint device 100 for detecting the state 1021 and the second indication indicates whether a second detection algorithm is used by the endpoint device 100 for detecting the state 1021.
  • the first detection algorithm is used by the switch 200 for priority assignment of multiple flows
  • the second detection algorithm is for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
  • the switch 200 may be configured to make a load-balancing, for each flow 102 of the one or more flows, based on the indication field 1022 comprised in the packets of said flow.
  • the switch 200 may use the elephant flow indication field 1022 as part of the load balancing decision. For example, each elephant flow may be assigned its own separate path, while non-elephant flows are forwarded together through one or more different paths.
  • the switch 200 may be configured to detect a first flow that is a non-Elephant flow.
  • the switch 200 may be configured to assign each packet of the first flow to a queue with a priority higher than a priority associated with an Elephant flow, or route each packet of the first flow to a path with a latency lower than the latency of a path for routing an Elephant flow.
  • a queue pair may be initialized as a non-Elephant by default, and then 1000 short WQEs are sent back-to-back. Due to the high rate of the queue pair, it is treated as an Elephant flow for load balancing purposes, and the load state, i.e., the state 1021 when it is detected by using the load balancing algorithm, may change to “Elephant”. For priority assignment purposes, it is not treated as an Elephant flow, and will still be assigned a high priority, in order to allow fast delivery times for short WQEs. Thus, the priority state, i.e., the state 1021 when it is detected by using the priority assignment algorithm, remains in the ‘non-Elephant’ value.
  • the flow waits for a duration of at least RTT since the last packet allows previous traffic to be delivered before new Non-Elephant packets are sent.
  • elephant flow detection can be performed immediately instead of statistical. There is no transient time that causes temporary packet loss. This disclosure thus enables lower packet loss, lower delivery time, and lower congestion in scenarios where elephant and mouse flows are transmitted in parallel.
  • combined endpoint and switch/router functionality provide added value to a network that uses the combined server and switch/router functionality. The disclosure allows detecting RDMA elephant flows based on a combination of components such as WQE length, idle time, and traffic rate.
  • FIG. 5 shows a method 500 for flow transmission in a network according to an embodiment of the disclosure.
  • the method 500 is performed by an endpoint device 100 shown in FIG. 1 or FIG. 4.
  • the method 500 comprises a step 501 of maintaining a send queue 101 for packets of one or more flows, wherein each flow 102 comprises a plurality of packets.
  • the method 500 further comprises a step 502 of maintaining at least one state 1021 for each flow 102.
  • the at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or is a non-Elephant flow.
  • the Elephant flow consumes more network resources than the non-Elephant flow.
  • the method 500 further comprises a step 503 of providing each packet 1011 of the plurality of packets with an indication field 1022 that indicates the at least one state 1021 of the flow 102 that the packet belongs to, and transmitting the plurality of packets.
  • FIG. 6 shows a method 600 for flow transmission in a network according to an embodiment of the disclosure.
  • the method 600 is performed by a switch 200 shown in FIG. 4.
  • the method 600 comprises a step 601 of receiving a plurality of packets from an endpoint device 100.
  • the plurality of packets comprises packets of one or more flows
  • each packet 1011 of the plurality of packets comprises an indication field 1022 that indicates at least one state 1021 of a flow 102 that the packet belongs to.
  • the at least one state 1021 includes “Elephant” or “nonElephant”, which respectively indicates that the flow 102 is an Elephant flow or is a nonElephant flow.
  • the method 600 further comprises a step 602 of processing each packet 1011 based on the indication field 1022.
  • any method according to embodiments of the disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method.
  • the computer program is included in a computer readable medium of a computer program product.
  • the computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
  • embodiments of the endpoint device 100, or the switch 200 comprises the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution.
  • means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
  • TCM trellis-coded modulation
  • the processor(s) of the endpoint device 100, or the switch 200 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an ASIC, a microprocessor, or other processing logic that may interpret and execute instructions.
  • CPU Central Processing Unit
  • the expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above.
  • the processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present disclosure relates to devices and methods for flow detection. Specifically, the disclosure proposes an endpoint device and a switch. The endpoint device is configured to: maintain a send queue for packets of one or more flows, wherein each flow comprises a plurality of packets; maintain at least one state for each flow, wherein the at least one state includes "Elephant" or "non-Elephant", which respectively indicates that the flow is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; provide each packet of the plurality of packets with an indication field that indicates the at least one state of the flow that the packet belongs to, and transmit the plurality of packets. The switch is configured to receive a plurality of packets from an endpoint device, and process each packet based on the indication field.

Description

A DEVICE AND METHOD FOR FLOW DETECTION AND PROCESSING
TECHNICAL FIELD
The present disclosure relates to the field of network communications, for example, to high-speed networks, and Remote Direct Memory Access (RDMA) technologies. The disclosure is concerned with elephant flow detection in high-speed networks. To this end, the present disclosure provides an endpoint device for flow transmission in a network, a switch or a router for a network, and corresponding methods for flow transmission in a network and a network switch, respectively, and a computer program, or a hardware device.
BACKGROUND
Traffic patterns in high-speed networks often have an elephant/mouse behavior. Elephant flows typically consume a large amount of bandwidth, and exist for long periods of time.
Mouse flows (or mice flows) usually have low bandwidth, and sometimes live for a short period of time. Mouse flows often require low latency, as the (small) amount of traffic is more sensitive to delay, and thus needs to be forward with a higher priority, guaranteeing a short completion time and a low probability of packet loss.
Network devices such as switches and routers often have elephant flow detection mechanisms. Once a flow is detected as an elephant in these devices, it may receive different treatment than other flows, including one or more of the following: the flow may be split (load balanced) among multiple network paths, and the flow may be forwarded with a different priority than mouse flows.
Typically, an elephant flow is forwarded with a lower priority, allowing mouse flows to be forwarded with a lower packet loss probability, and thus resulting in lower completion time (lower network latency). Elephant flows are typically detected by network devices by statistical methods that include a combination of one or more of the following: per flow counter, or per flow timestamp of the last packet. One of the main challenges with existing approaches is that elephant flow detection is based on a statistical algorithm, which may cause a new elephant flow to be detected only a while after it has started. Thus, with the current statistical detection mechanism, a transient detection time during which the switch may be congested, potentially causing mouse packets to be dropped. Further, since detection is not instantaneous, temporary congestion may occur before the flow is assigned a dedicated path. Possibly, an elephant flow may in some cases not be detected as an elephant at all.
Therefore, a new elephant detection algorithm is desired.
SUMMARY
In view of the above, this disclosure aims to provide an elephant flow detection method. An objective is to propose an elephant flow detection method, which needs no transient time that causes temporary packet loss. Another aim is to enable lower packet loss, lower delivery time, and lower congestion in scenarios where elephant and mouse flows are transmitted in parallel. Another goal is to enable the combination of the elephant flow detection method and the functionality of the server and switch/router.
These and other objectives are achieved by the solution described in the enclosed independent claims. Advantageous implementations are further defined in the dependent claims.
A first aspect of the present disclosure provides an endpoint device for flow transmission in a network, wherein the endpoint device is configured to: maintain a send queue for packets of one or more flows, wherein each flow comprises a plurality of packets; maintain at least one state for each flow, wherein the at least one state includes “Elephant” or “nonElephant”, which respectively indicates that the flow is an Elephant flow or is a nonElephant flow, wherein the Elephant flow consumes more network resources than the nonElephant flow; provide each packet of the plurality of packets with an indication field that indicates the at least one state of the flow that the packet belongs to, and transmit the plurality of packets.
The present disclosure introduces a new elephant flow detection algorithm that is performed at the endpoint, rather than in switches/routers. The endpoint may be potentially a hardware-enabled Smart Network Interface Card (NIC). It can use the information it has about the flows to detect which flows are elephants. In this way, this disclosure enables to detect which flows are elephants before the flows start.
In an implementation form of the first aspect, each flow is a Remote Direct Memory Access (RDMA) flow, and each packet is an RDMA packet.
The endpoint device may be an endpoint device that generates RDMA messages. For instance, the endpoint device may be a server with an NIC that may have hardware support for RDMA transmission.
In an implementation form of the first aspect, the endpoint device is further configured to record a timestamp of transmission for each transmitted packet.
Notably, for each flow, the timestamp of the last packet that is transmitted is kept by the endpoint device.
In an implementation form of the first aspect, the at least one state of each flow is detected based on a first detection algorithm, wherein the endpoint device is configured to: set the at least one state of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue, and a length of the new transaction exceeds a pre-determined length.
It should be noted that the solution of this disclosure may include different mechanisms. That is, based on different detection algorithms, the endpoint device may set the state for a flow differently. Optionally, the default state of a flow is “non-Elephant”.
In an implementation form of the first aspect, the endpoint device is configured to set the at least one state of the first flow to “non-Elephant”, when it is determined that: a following new transaction element is queued in the send queue, a length of the following new transaction element does not exceed the predetermined length, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
In an implementation form of the first aspect, the at least one state of each flow is detected based on a second detection algorithm, wherein the endpoint device is configured to set the at least one state of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue, and at least one of the following conditions is met: a length of the new transaction exceeds a pre-determined length, or a rate of the first flow exceeds a first pre-determined rate.
In an implementation form of the first aspect, the endpoint device is further configured to: set the at least one state of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue, a length of the following new transaction does not exceed the pre-determined length, the rate of the first flow is below a second pre-determined rate, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
In an implementation form of the first aspect, a transaction of the send queue is a working queue element (WQE). Optionally, the new transaction that is queued in the send queue 101 is an RDMA transaction, i.e., a WQE.
In an implementation form of the first aspect, the endpoint device is configured to obtain the length of the new transaction or the following new transaction by checking a length field of that transaction.
Optionally, the endpoint device may use an RDMA “direct memory access (DMA) Length” field to detect elephant flows and set the states of the flows.
In an implementation form of the first aspect, the first detection algorithm is used for priority assignment of multiple flows, and the second detection algorithm is used for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
In an implementation form of the first aspect, the indication field comprises a first indication and/or a second indication, wherein the first indication indicates whether the first detection algorithm is used for detecting the at least one state and the second indication indicates whether the second detection algorithm is used for detecting the at least one state.
Optionally, the at least one state may comprise two Boolean states, each corresponding to a different one of the two algorithms. Accordingly, the endpoint device 100 may maintain two Boolean state bits, i.e., the first indication and the second indication (each indicates Elephant / Non-Elephant of the flow), for each flow. In an implementation form of the first aspect, the indication field is comprised in a header of each packet.
A second aspect of the present disclosure provides a switch for a network, wherein the switch is configured to: receive a plurality of packets from an endpoint device, wherein the plurality of packets comprises packets of one or more flows, and each packet of the plurality of packets comprises an indication field that indicates at least one state of a flow that the packet belongs to, wherein the at least one state includes “Elephant” or “nonElephant”, which respectively indicates that the flow is an Elephant flow or non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and process each packet based on the indication field.
The present disclosure further provides a switch that operates accordingly to the endpoint device of the first aspect. The switch or router may use the indication field to provide different services to elephant flows, including possibly assigning them to a different queue with a different priority and with different memory resources.
In an implementation form of the second aspect, the indication field comprises a first indication and/or a second indication, wherein the first indication indicates whether a first detection algorithm is used by the endpoint device for detecting the state and the second indication indicates whether a second detection algorithm is used by the endpoint device for detecting the state.
For instance, the indication may include two bits, “XY”, X for indicating whether the first detection algorithm is used for detecting the state for the flow, and Y for indicating whether the second detection algorithm is used for detecting the state for the flow.
In an implementation form of the second aspect, the first detection algorithm is used for priority assignment of multiple flows, and the second detection algorithm is for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
In an implementation form of the second aspect, the switch is configured to make a loadbalancing, for each flow of the one or more flows, based on the indication field comprised in the packets of said flow.
In an implementation form of the second aspect, the switch is further configured to: configured to: detect a first flow that is a non-Elephant flow; and assign each packet of the first flow to a queue with a priority higher than a priority associated with an Elephant flow, or route each packet of the first flow to a path with a latency lower than a latency of a path for routing an Elephant flow.
A third aspect of the present disclosure provides a method performed by an endpoint device for flow transmission in a network. The method comprises: maintaining a send queue for packets of one or more flows, wherein each flow comprises a plurality of packets; maintaining at least one state for each flow, wherein the at least one state includes “Elephant” or “non-Elephant”, which respectively indicates that the flow is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and providing each packet of the plurality of packets with an indication field that indicates the at least one state of the flow that the packet belongs to, and transmitting the plurality of packets.
The method of the third aspect and its implementation forms provide the same advantages and effects as described above for the endpoint device of the first aspect and its respective implementation forms.
A fourth aspect of the present disclosure provides a method performed by a switch for a network. The method comprises: receiving a plurality of packets from an endpoint device, wherein the plurality of packets comprises packets of one or more flows, and each packet of the plurality of packets comprises an indication field that indicates at least one state of a flow that the packet belongs to, wherein the at least one state includes “Elephant” or “non-Elephant”, which respectively indicates that the flow is an Elephant flow or is a nonElephant flow, wherein the Elephant flow consumes more network resources than the nonElephant flow; and processing each packet based on the indication field.
The method of the fourth aspect and its implementation forms provide the same advantages and effects as described above for the switch of the second aspect and its respective implementation forms.
A fifth aspect of the present disclosure provides a computer program comprising a program code for carrying out, when implemented on a processor, the method according to any of the third aspect and its implementation forms, or any of the fourth aspect and its implementation forms. It has to be noted that all devices, elements, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective endpoint device is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that endpoint device that performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which
FIG. 1 shows an endpoint device according to an embodiment of the disclosure.
FIG. 2 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
FIG. 3 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
FIG. 4 shows an exchange of packets between an endpoint device and a switch according to an embodiment of the disclosure.
FIG. 4 shows a switch according to an embodiment of the disclosure.
FIG. 5 shows a method according to an embodiment of the disclosure.
FIG. 6 shows a method according to an embodiment of the disclosure. DETAILED DESCRIPTION OF EMBODIMENTS
Illustrative embodiments of method, device, and program product for flow transmission are described with reference to the figures. Although this description provides a detailed example of possible implementations, it should be noted that the details are intended to be exemplary and in no way limit the scope of the application.
Moreover, an embodiment/example may refer to other embodiments/examples. For example, any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples.
FIG. 1 shows an endpoint device 100 for flow detection in a network according to an embodiment of the disclosure. The endpoint device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the endpoint device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The endpoint device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the endpoint device 100 to be performed. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the endpoint device 100 to perform, conduct or initiate the operations or methods described herein.
The endpoint device 100 is configured to maintain a send queue 101 for packets of one or more flows, wherein each flow 102 comprises a plurality of packets. Further, the endpoint device 100 is configured to maintain at least one state 1021 for each flow 102, wherein the at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or is a non-Elephant flow. Notably, the Elephant flow consumes more network resources than the non-Elephant flow. Then, the endpoint device 100 is further configured to provide each packet 1011 of the plurality of packets with an indication field 1022 that indicates the at least one state 1021 of the flow 102 that the packet 1011 belongs to, and transmit the plurality of packets.
As previously discussed, Elephant flows typically consume a large amount of bandwidth, and exist for a long period; while Non-Elephant flows, e.g., mouse flows or mice flows, usually have low bandwidth, and sometimes live for a short period.
FIG. 2 illustrates examples of the behavior of elephant flows that are introduced along with existing mouse flows. For ease of understanding of the present disclosure, the behavior of elephant flows in the case of without elephant flow detection and with elephant flow detection are first discussed here. As shown in FIG. 2 (a), i.e., in the case of without elephant flow detection, the elephant flow consumes the queue resources and causes mouse packets to be dropped. Such behavior is not necessarily desirable, as mouse flows may be control or management traffic that may be of high importance. Mouse flows often require low latency, as the (small) amount of traffic is more sensitive to delay, and thus needs to be forward with a higher priority, guaranteeing a short completion time and a low probability of packet loss.
In the conventional solutions, by keeping track of the counters and timestamps it is possible to statistically evaluate flows as Elephant flows, e.g., when a flow counter exceeds a predefined threshold, or when an average time between consecutive packets of a flow is lower than a predefined threshold.
FIG. 2 (b) shows a scenario when the switch performs the elephant flow detection. It can be seen that an elephant flow is assigned to a dedicated queue with a different (e.g., lower) priority, and potentially with different resources allocation (such as higher drop threshold). Consequently, mouse packets are not dropped, while elephant packets are potentially dropped.
FIG. 3 shows another use case for elephant flow detection, namely, load balancing. When an elephant flow is detected, switches and routers along the path may allocate a dedicated path for the elephant flow, while forwarding other flows through different paths, as shown in FIG. 3. Notably, such a statistical algorithm may cause a new elephant flow to be detected only a while after it has started. That is, only after the detection, the desired behavior as shown in FIG. 2 (b) may be reached. During the transient detection time, the switch may be congested, thus potentially causing mouse packets to be dropped as shown in FIG. 2 (a). Since the detection is not instantaneous, temporary congestion may occur before a flow is assigned with a dedicated path as shown in FIG. 3.
This disclosure thus introduces a new elephant flow detection algorithm that is performed at the endpoint, rather than in switches/routers. The endpoint, i.e., the endpoint device 100 as shown in FIG. 1, which is potentially a hardware-enabled Smart NIC, uses the information it has about the flows to detect which flows are elephants.
According to this disclosure, a method of detecting an RDMA elephant flow and processing it by network devices is defined. That is, the endpoint device 100 is an endpoint device that generates RDMA messages. The endpoint device 100 may be a server with an NIC that may have hardware support for RDMA transmission. In RDMA transmissions, an endpoint has information about RDMA flows, and thus be able to detect which flows are elephants before the flows start.
According to an embodiment of this disclosure, each flow 102 as shown in FIG. 1 is an RDMA flow. Accordingly, each packet 1011 of the send queue 101 is an RDMA packet. Optionally, a transaction of the send queue 101 is a WQE. Notably, a WQE is an RDMA operation or transaction that is posted or issued by a source application or software.
The endpoint device 100 may use an “elephant indication”, i.e., the indication field 1022, in the packet header to indicate to network switches/routers which packets belong to elephant flows.
Optionally, the indication field 1022 is comprised in a header of each packet 1011. Possibly, the indication field 1022 may be part of a proprietary header or may be incorporated in an existing field of an RDMA packet, such as the differentiated services code point (DSCP) field in the IP header.
According to an embodiment of this disclosure, the endpoint device 100 may be configured to obtain the length of the new transaction or the following new transaction by checking a length field of that transaction. Optionally, the endpoint device 100 that sends RDMA packets may use the DMA Length field to detect elephant flows and assign the value of the “elephant indication” bit, i.e., the indication field 1022. For instance, for each RDMA flow (Queue Pair), the endpoint device 100 keeps a Boolean state bit: Elephant / Non-Elephant. The state is incorporated into the packet’s “elephant indication” field.
According to an embodiment of this disclosure, the endpoint device 100 may be configured to record a timestamp of transmission for each transmitted packet 1011.
The solution of the disclosure may include two different mechanisms: an elephant flow detection mechanism for priority assignment, and a detection mechanism for load balancing. Specifically, the disclosure defines two elephant detection algorithms that are performed by the endpoint device 100.
According to an embodiment of this disclosure, the at least one state 1021 of each flow 102 is detected based on a first detection algorithm, e.g., priority assignment of multiple flows. Optionally, a new flow is by default assigned the non-Elephant state, i.e., the default state of a flow is “non-Elephant”.
Optionally, the endpoint device 100 may be configured to set the at least one state 1021 of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue 101, and a length of the new transaction exceeds a pre-determined length.
Notably, for each flow, the timestamp of the last packet that is transmitted is kept by the endpoint device 100. According to an embodiment of this disclosure, the endpoint device 100 may be configured to set back the at least one state 1021 of the first flow to “nonElephant”, when it is determined that: a following new transaction is queued in the send queue 101, a length of the following new transaction does not exceed the pre-determined length, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
Optionally, the new transaction that is queued in the send queue 101 is an RDMA transaction, i.e., aWQE. According to another embodiment of this disclosure, the at least one state 1021 of each flow 102 is detected based on a second detection algorithm, e.g., load balancing. Optionally, a new flow is by default assigned the non-Elephant state, i.e., the default state of a flow is “non-Elephant”.
For the user case of load balancing, the rate (bandwidth) of each Queue Pair or the rate of the flow, maybe continuously measured, and used in the decision of the state.
Optionally, the endpoint device 100 may be configured to set the at least one state 1021 of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue 101, and at least one of the following conditions is met: a length of the new transaction exceeds a pre-determined length, or a rate of the first flow exceeds a first pre-determined rate, e.g., R1.
According to an embodiment of this disclosure, the endpoint device 100 may be configured to set back the at least one state 1021 of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue 101, a length of the following new transaction does not exceed the pre-determined length, the rate of the first flow is below a second pre-determined rate, e.g., R2, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
It may be worth mentioning that the second pre-determined rate R2 is not greater than the first pre-determined rate Rl, that is, R2 < Rl.
Optionally, the packet header may include up to two “elephant indication” bits, one for each of the mechanisms. According to an embodiment of this disclosure, the indication field 1022 may comprise a first indication and/or a second indication, wherein the first indication indicates whether the first detection algorithm is used for detecting the at least one state 1021 and the second indication indicates whether the second detection algorithm is used for detecting the at least one state 1021. Optionally, the at least one state 1021 may comprise two Boolean states, each corresponding to a different one of the two algorithms. Accordingly, the endpoint device 100 may maintain two Boolean state bits (each indicates Elephant / Non-Elephant of the flow) for each flow.
For instance, the indication field 1022 may include two bits, “XY”, X for indicating whether the first detection algorithm is used for detecting the at least one state 1021, and Y for indicating whether the second detection algorithm is used for detecting the at least one state 1021. In an example, when the indication field 1022 carries “10”, it indicates that the elephant flow detection is performed for priority assignment, and the current flow is an elephant flow. When the indication field 1022 carries “01”, it indicates that the elephant flow detection is performed for load balancing, and the current flow is an elephant flow. When the indication field 1022 carries “00”, it may indicate that the current flow is a nonElephant flow.
It should be noted that this is merely provided as examples for ease of understanding. There may be other possible implementations to indicate the detection algorithm that is used for the elephant flow detection.
FIG. 4 shows a switch 200 for a network according to an embodiment of the disclosure. The switch 200 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the switch 200 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as ASICs, FPGAs, DSPs, or multi-purpose processors. The switch 200 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the switch 200 to be performed. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non- transitory memory may carry executable program code which, when executed by the one or more processors, causes the switch 200 to perform, conduct or initiate the operations or methods described herein.
In particular, the switch 200 is configured to a plurality of packets from an endpoint device 100. Possibly, the endpoint device 100 here may be the endpoint device 100 shown in FIG. 1. In particular, the plurality of packets comprises packets of one or more flows, and each packet 1011 of the plurality of packets comprises an indication field 1022 that indicates at least one state 1021 of a flow 102 that the packet belongs to. The at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow. The switch 200 may be further configured to process each packet 1011 based on the indication field 1022.
Embodiments of the present disclosure further provide a switch 200 that operates according to the endpoint device 100 as previously described in this disclosure. For ease of description, the term “switch” is used here. It should be noted that it could generally be a network device such as a router as well. A switch or router may use the indication field 1022, which may also be referred to as “elephant indication” bit, in its routing or load balancing decision. In particular, a switch or router may use the indication field 1022 to provide different services to elephant flows, including possibly assigning them to a different queue with a different priority and with different memory resources.
According to an embodiment of the disclosure, the indication field 1022 comprises a first indication and/or a second indication, wherein the first indication indicates whether a first detection algorithm is used by the endpoint device 100 for detecting the state 1021 and the second indication indicates whether a second detection algorithm is used by the endpoint device 100 for detecting the state 1021.
Optionally, the first detection algorithm is used by the switch 200 for priority assignment of multiple flows, and the second detection algorithm is for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
According to an embodiment of the disclosure, the switch 200 may be configured to make a load-balancing, for each flow 102 of the one or more flows, based on the indication field 1022 comprised in the packets of said flow. The switch 200 may use the elephant flow indication field 1022 as part of the load balancing decision. For example, each elephant flow may be assigned its own separate path, while non-elephant flows are forwarded together through one or more different paths.
According to another embodiment of the disclosure, the switch 200 may be configured to detect a first flow that is a non-Elephant flow. In particular, the switch 200 may be configured to assign each packet of the first flow to a queue with a priority higher than a priority associated with an Elephant flow, or route each packet of the first flow to a path with a latency lower than the latency of a path for routing an Elephant flow.
In a particular embodiment, a queue pair may be initialized as a non-Elephant by default, and then 1000 short WQEs are sent back-to-back. Due to the high rate of the queue pair, it is treated as an Elephant flow for load balancing purposes, and the load state, i.e., the state 1021 when it is detected by using the load balancing algorithm, may change to “Elephant”. For priority assignment purposes, it is not treated as an Elephant flow, and will still be assigned a high priority, in order to allow fast delivery times for short WQEs. Thus, the priority state, i.e., the state 1021 when it is detected by using the priority assignment algorithm, remains in the ‘non-Elephant’ value.
It should be noted that the method proposed by embodiments of this disclosure guarantees that packets are delivered in the order in which they are transmitted with a very high probability even when the state changes from “Non-Elephant” to “Elephant”, or vice versa.
When a state of a flow is switching from Non-Elephant to Elephant, new packets will be assigned lower priority by the switch, and thus will have a higher latency, preventing out- of-order delivery.
When a state of a flow is switching from Elephant to Non-Elephant, the flow waits for a duration of at least RTT since the last packet allows previous traffic to be delivered before new Non-Elephant packets are sent.
To summarize, according to embodiments of this disclosure, elephant flow detection can be performed immediately instead of statistical. There is no transient time that causes temporary packet loss. This disclosure thus enables lower packet loss, lower delivery time, and lower congestion in scenarios where elephant and mouse flows are transmitted in parallel. In addition, combined endpoint and switch/router functionality provide added value to a network that uses the combined server and switch/router functionality. The disclosure allows detecting RDMA elephant flows based on a combination of components such as WQE length, idle time, and traffic rate.
FIG. 5 shows a method 500 for flow transmission in a network according to an embodiment of the disclosure. In a particular embodiment of the disclosure, the method 500 is performed by an endpoint device 100 shown in FIG. 1 or FIG. 4. The method 500 comprises a step 501 of maintaining a send queue 101 for packets of one or more flows, wherein each flow 102 comprises a plurality of packets. The method 500 further comprises a step 502 of maintaining at least one state 1021 for each flow 102. In particular, the at least one state 1021 includes “Elephant” or “non-Elephant”, which respectively indicates that the flow 102 is an Elephant flow or is a non-Elephant flow. Notably, the Elephant flow consumes more network resources than the non-Elephant flow. Then, the method 500 further comprises a step 503 of providing each packet 1011 of the plurality of packets with an indication field 1022 that indicates the at least one state 1021 of the flow 102 that the packet belongs to, and transmitting the plurality of packets.
FIG. 6 shows a method 600 for flow transmission in a network according to an embodiment of the disclosure. In a particular embodiment of the disclosure, the method 600 is performed by a switch 200 shown in FIG. 4. The method 600 comprises a step 601 of receiving a plurality of packets from an endpoint device 100. In particular, the plurality of packets comprises packets of one or more flows, and each packet 1011 of the plurality of packets comprises an indication field 1022 that indicates at least one state 1021 of a flow 102 that the packet belongs to. The at least one state 1021 includes “Elephant” or “nonElephant”, which respectively indicates that the flow 102 is an Elephant flow or is a nonElephant flow. The method 600 further comprises a step 602 of processing each packet 1011 based on the indication field 1022.
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.
Furthermore, any method according to embodiments of the disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method. The computer program is included in a computer readable medium of a computer program product. The computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
Moreover, it is realized by the skilled person that embodiments of the endpoint device 100, or the switch 200, comprises the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution. Examples of other such means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
Especially, the processor(s) of the endpoint device 100, or the switch 200, may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an ASIC, a microprocessor, or other processing logic that may interpret and execute instructions. The expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Claims

Claims
1. An endpoint device (100) for flow transmission in a network, wherein the endpoint device (100) is configured to: maintain a send queue (101) for packets of one or more flows, wherein each flow (102) comprises a plurality of packets; maintain at least one state (1021) for each flow (102), wherein the at least one state (1021) includes “Elephant” or “non-Elephant”, which respectively indicates that the flow (102) is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; provide each packet (1011) of the plurality of packets with an indication field (1022) that indicates the at least one state (1021) of the flow (102) that the packet (1011) belongs to, and transmit the plurality of packets.
2. The endpoint device (100) according to claim 1, wherein each flow (102) is a Remote Direct Memory Access, RDMA, flow, and each packet (1011) is an RDMA packet.
3. The endpoint device (100) according to claim 1 or 2, configured to: record a timestamp of transmission for each transmitted packet (1011).
4. The endpoint device (100) according to one of the claims 1 to 3, wherein the at least one state (1021) of each flow (102) is detected based on a first detection algorithm, wherein the endpoint device (100) is configured to: set the at least one state (1021) of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue (101), and a length of the new transaction exceeds a pre-determined length.
5. The endpoint device (100) according to claim 4 and claim 3, configured to: set the at least one state (1021) of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue (101), a length of the following new transaction does not exceed the pre-determined length, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
6. The endpoint device (100) according to one of the claims 1 to 3, wherein the at least one state (1021) of each flow (102) is detected based on a second detection algorithm, wherein the endpoint is configured to: set the at least one state (1021) of a first flow to “Elephant”, when it is determined that: a new transaction is queued in the send queue (101), and at least one of the following conditions is met: a length of the new transaction exceeds a pre-determined length, or a rate of the first flow exceeds a first pre-determined rate.
7. The endpoint device (100) according to claim 6 and claim 3, configured to: set the at least one state (1021) of the first flow to “non-Elephant”, when it is determined that: a following new transaction is queued in the send queue (101), a length of the following new transaction does not exceed the pre-determined length,
- the rate of the first flow is below a second pre-determined rate, and a time period since the timestamp of a previous packet is greater than a predetermined time limit.
8. The endpoint device (100) according to one of the claims 4 to 7, wherein a transaction of the send queue (101) is a working queue element, WQE.
9. The endpoint device (100) according to one of the claims 4 to 8, configured to: obtain the length of the new transaction or the following new transaction by checking a length field of that transaction.
10. The endpoint device (100) according to one of the claims 4 to 9, wherein the first detection algorithm is used for priority assignment of multiple flows, and the second detection algorithm is used for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
11. The endpoint device (100) according to one of the claims 4 to 10, wherein the indication field (1022) comprises a first indication and/or a second indication, wherein the first indication indicates whether the first detection algorithm is used for detecting the at least one state (1021) and the second indication indicates whether the second detection algorithm is used for detecting the at least one state (1021).
12. The endpoint device (100) according to one of the claims 1 to 11, wherein the indication field (1022) is comprised in a header of each packet (1011).
13. A switch (200) for a network, configured to: receive a plurality of packets from an endpoint device (100), wherein the plurality of packets comprises packets of one or more flows, and each packet (1011) of the plurality of packets comprises an indication field (1022) that indicates at least one state (1021) of a flow (102) that the packet belongs to, wherein the at least one state (1021) includes “Elephant” or “non-Elephant”, which respectively indicates that the flow (102) is an Elephant flow or non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and process each packet (1011) based on the indication field (1022).
14. The switch (200) according to claim 13, wherein the indication field (1022) comprises a first indication and/or a second indication, wherein the first indication indicates whether a first detection algorithm is used by the endpoint device (100) for detecting the state (1021) and the second indication indicates whether a second detection algorithm is used by the endpoint device (100) for detecting the state (1021).
15. The switch (200) according to claim 14, wherein the first detection algorithm is used for priority assignment of multiple flows, and the second detection algorithm is for load balancing of multiple flows or of a plurality of packets of an Elephant flow.
16. The switch (200) according to one of the claims 12 to 15, configured to make a load-balancing, for each flow (102) of the one or more flows, based on the indication field (1022) comprised in the packets of said flow.
17. The switch (200) according to one of the claims 12 to 16, configured to: detect a first flow that is a non-Elephant flow; and assign each packet of the first flow to a queue with a priority higher than a priority associated with an Elephant flow, or route each packet of the first flow to a path with a latency lower than a latency of a path for routing an Elephant flow.
18. A method (500) for flow transmission in a network, wherein the method comprises: maintaining (501) a send queue (101) for packets of one or more flows, wherein each flow (102) comprises a plurality of packets; maintaining (502) at least one state (1021) for each flow (102), wherein the at least one state (1021) includes “Elephant” or “non-Elephant”, which respectively indicates that the flow (102) is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; providing (503) each packet (1011) of the plurality of packets with an indication field (1022) that indicates the at least one state (1021) of the flow (102) that the packet belongs to, and transmitting the plurality of packets.
21
19. A method (600) for a network switch, comprising: receiving (601) a plurality of packets from an endpoint device (100), wherein the plurality of packets comprises packets of one or more flows, and each packet (1011) of the plurality of packets comprises an indication field (1022) that indicates at least one state (1021) of a flow (102) that the packet belongs to, wherein the at least one state (1021) includes “Elephant” or “non-Elephant”, which respectively indicates that the flow (102) is an Elephant flow or is a non-Elephant flow, wherein the Elephant flow consumes more network resources than the non-Elephant flow; and processing (602) each packet (1011) based on the indication field (1022).
20. A computer program product comprising a program code for carrying out, when implemented on a processor, the method according to claim 18 or 19.
22
PCT/EP2021/075189 2021-09-14 2021-09-14 A device and method for flow detection and processing WO2023041142A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2021/075189 WO2023041142A1 (en) 2021-09-14 2021-09-14 A device and method for flow detection and processing
CN202180102086.7A CN117917061A (en) 2021-09-14 2021-09-14 Apparatus and method for stream detection and processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/075189 WO2023041142A1 (en) 2021-09-14 2021-09-14 A device and method for flow detection and processing

Publications (1)

Publication Number Publication Date
WO2023041142A1 true WO2023041142A1 (en) 2023-03-23

Family

ID=77914318

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/075189 WO2023041142A1 (en) 2021-09-14 2021-09-14 A device and method for flow detection and processing

Country Status (2)

Country Link
CN (1) CN117917061A (en)
WO (1) WO2023041142A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150163144A1 (en) * 2013-12-09 2015-06-11 Nicira, Inc. Detecting and handling elephant flows
US20190190838A1 (en) * 2017-12-18 2019-06-20 Mellanox Technologies, Ltd. Elephant Flow Detection in Network Access
CN111277467A (en) * 2020-01-23 2020-06-12 华为技术有限公司 Communication device, data stream identification method and related equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150163144A1 (en) * 2013-12-09 2015-06-11 Nicira, Inc. Detecting and handling elephant flows
US20190190838A1 (en) * 2017-12-18 2019-06-20 Mellanox Technologies, Ltd. Elephant Flow Detection in Network Access
CN111277467A (en) * 2020-01-23 2020-06-12 华为技术有限公司 Communication device, data stream identification method and related equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE SUGI ET AL: "Efficient User-Level Multi-Path Utilization in RDMA Networks", IEEE ACCESS, IEEE, USA, vol. 9, 7 September 2021 (2021-09-07), pages 127619 - 127629, XP011879317, DOI: 10.1109/ACCESS.2021.3110840 *

Also Published As

Publication number Publication date
CN117917061A (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN107204931B (en) Communication device and method for communication
US11032205B2 (en) Flow control method and switching device
US9900255B2 (en) System and method for link aggregation group hashing using flow control information
EP3120253B1 (en) Flow aware buffer management for data center switches
EP2702731A1 (en) Hierarchical profiled scheduling and shaping
US10389636B2 (en) Technologies for adaptive routing using network traffic characterization
US10341224B2 (en) Layer-3 flow control information routing system
US9614777B2 (en) Flow control in a network
Tian et al. P-PFC: Reducing tail latency with predictive PFC in lossless data center networks
JP2015057931A (en) Network apparatus, communication system, and detection method and program for abnormal traffic
WO2017147808A1 (en) Method and device for managing network apparatus
Avci et al. Congestion aware priority flow control in data center networks
WO2023041142A1 (en) A device and method for flow detection and processing
Nithin et al. Efficient load balancing for multicast traffic in data center networks using SDN
CN111224884B (en) Processing method for congestion control, message forwarding device and message receiving device
CN117813595A (en) Apparatus and method for remote direct memory access
WO2019119836A1 (en) Message processing method and device
CN109729018B (en) Burst size determining method based on flow shaping and related equipment
Adesanmi et al. Controlling TCP Incast congestion in data centre networks
KR101707073B1 (en) Error detection network system based on sdn
CN114765585B (en) Service quality detection method, message processing method and device
Baniamerian et al. NCE: An ECN dual mechanism to mitigate micro-bursts
US20240205155A1 (en) Protocol agnostic cognitive congestion control
Khan et al. Receiver-driven flow scheduling for commodity datacenters
CN117978739A (en) Message sending method and device, storage medium and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21777690

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180102086.7

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE