CN117880198A

CN117880198A - Congestion management based on stream pruning

Info

Publication number: CN117880198A
Application number: CN202311314282.8A
Authority: CN
Inventors: K·D·安德伍德
Original assignee: Hewlett Packard Enterprise Development LP
Current assignee: Hewlett Packard Enterprise Development LP
Priority date: 2022-10-11
Filing date: 2023-10-11
Publication date: 2024-04-12

Abstract

The present disclosure relates to congestion management based on flow pruning. A networking device is provided that facilitates efficient congestion management. During operation, the apparatus may receive, via a network, a plurality of packets including portions of a data segment transmitted from a sender device to a receiver device. The device may identify, from the plurality of packets, one or more payload packets including the payload of the data segment and at least a header packet including header information and a header packet indicator of the data segment. The apparatus may determine whether congestion is detected at the receiver device based on a number of sender devices that send packets to the receiver device via the apparatus. When congestion at the recipient device is determined, the apparatus may perform flow pruning by forwarding the header packet to the recipient device and discarding a subset of the one or more payload packets.

Description

Congestion management based on stream pruning

Cross Reference to Related Applications

The present application claims the benefit of U.S. provisional application No. 63/379,079, attorney docket No. P170847USPRV, filed by inventor Keith d.underwood and Duncan Roweth at 10, 11, 2022, entitled "Systems and Methods for Implementing Congestion Management and Encryption [ systems and methods for implementing congestion management and encryption ]".

Background

High Performance Computing (HPC) may generally facilitate efficient computing on nodes running applications. The HPC may facilitate high-speed data transfer between the sender device and the receiver device.

Drawings

Fig. 1A illustrates an example of flow pruning based congestion management in a network in accordance with an aspect of the subject application.

Fig. 1B illustrates an example of a packet that facilitates flow pruning based congestion management in accordance with an aspect of the subject application.

Fig. 2A presents a flow chart illustrating an example of a process by which forwarding hardware of a switch facilitates flow pruning in accordance with an aspect of the subject application.

Fig. 2B presents a flowchart illustrating an example of a process by which a Network Interface Controller (NIC) of a computing device generates packets from data segments, according to an aspect of the present application.

Fig. 3 illustrates an example of communication that facilitates flow pruning based congestion management in accordance with an aspect of the subject application.

Fig. 4A presents a flowchart illustrating an example of a process by which a sender device generates packets from a data stream, in accordance with an aspect of the subject application.

Fig. 4B presents a flowchart illustrating an example of a process for switch processing applying flow pruning to packets in accordance with an aspect of the present application.

Fig. 5 presents a flowchart illustrating an example of a process by which a receiver device processes a packet in accordance with an aspect of the subject application.

Fig. 6 illustrates an example of a computing system that facilitates flow pruning-based congestion management in accordance with an aspect of the subject application.

Fig. 7 illustrates an example of a computer readable memory device that facilitates congestion management based on flow pruning in accordance with an aspect of the subject application.

Fig. 8 illustrates an example of a switch supporting flow pruning based congestion management according to an embodiment of the present application.

In the drawings, like reference numerals refer to like elements.

Detailed Description

As applications become increasingly more distributed, HPCs may facilitate efficient computing on nodes running applications. The HPC environment may include computing nodes, storage nodes, and a mass switch coupling the nodes. Thus, the HPC environment may include a high bandwidth low latency network formed by the switches. In general, computing nodes may form clusters. The cluster may be coupled to the storage node via a network. The compute node may run one or more applications running in parallel in the cluster. The storage node may record the output of the computation performed on the computing node. Thus, the compute node and the storage node may interoperate with each other to facilitate high performance computing.

To ensure the desired level of performance, the corresponding node needs to operate at the rate of operation of the other nodes. For example, after a computing node generates data, a storage node needs to receive a piece of data from the computing node immediately. Here, the storage node and the computing node may operate as a receiver device and a sender device, respectively. On the other hand, if the computing node obtains data from the storage node, the storage node and the computing node may operate as a sender device and a receiver device, respectively. In some examples, the HPC environment may deploy an Edge Queuing Datagram Service (EQDS), which may provide datagram services to higher layers via dynamic channels in the network. The EQDS may encapsulate Transmission Control Protocol (TCP) and Remote Direct Memory Access (RDMA) packets.

The EQDS may use a credit-based mechanism in which the recipient device may issue credits to the sender device to control packet retrieval. Thus, switches in the network can avoid over-utilization of buffers, ensuring that a static queue (standing queue) never occurs in the network. Transport protocols typically generate and exchange data streams. To allow a transport protocol instance to send its data stream, the sender device may maintain a queue or buffer of data streams. The receiving device may control the queue because the receiving device may send out a message to control the data flow. When many sender devices attempt to send data to a receiver device, incast can occur in the network, resulting in a high degree of congestion at the receiver device. Thus, high performance networks (such as data center networks) can require efficient congestion management in the network, especially during incast, to ensure high speed data transfer.

Aspects described herein address the problem of efficient congestion management in a network by: (i) Generating separate packets for the header and the payload of the transport layer data stream; and (ii) upon detection of congestion at the switch, discarding the payload packet while forwarding the header packet. The header packet may be significantly smaller than the payload packet and may include control information, which may indicate the payload data to follow. Thus, the switch may deliver header packets with lower bandwidth utilization while ensuring that the recipient device is aware of the subsequent data. In this way, the switch may selectively drop packets to repair the flow, thereby alleviating congestion.

In existing techniques, data transfer from multiple sender devices to a receiver device may lead to congestion and reduce throughput of the data stream at the switch. This many-to-one communication mode may be referred to as "incast". In general, to mitigate the effects of congestion, a switch that detects congestion may restrict traffic from a sender device. For example, if the EQDS is deployed in a network (e.g., a data center network), the sender device may be responsible for queuing. When the receiver device sends out a transmission signal, the sender device transmits a corresponding packet, thereby allowing the receiver device to divide its bandwidth among the senders. To avoid congestion in the network, the switch may prune the packet by removing the packet's payload while retaining the header.

The packet repair clip includes semantic information (e.g., header) of the forwarded packet while discarding the payload. The semantic information may include information that may quantify (quantify) the data in the payload and identify the distribution of the data (e.g., the number of bytes in the payload and the number of packets carrying the payload). However, the packets may be encrypted to ensure end-to-end security. Such a packet may include an Integrity Check Value (ICV) at some location of the packet (e.g., before or after the payload). Any modification to the packet may result in the ICV being disabled. Specifically, if the switch performs packet pruning, the recipient device may receive only the header of the packet. Since the ICV value is calculated based on the entire packet, header information is insufficient to reproduce the ICV. Thus, the recipient device will not be able to verify the packet.

To address this issue, when the NIC of the sender device receives a data segment (e.g., a TCP data segment) from an upper layer, the NIC may synthesize a packet including semantic information of the segment. Since header information of the data segment may include semantic information, the synthesized packet may include header information of the data segment. In particular, the semantic information may include all fields in the header of the data segment that may be used to quantify the data in the payload and identify the distribution of the data. The NIC may include a header packet indicator in the packet. The indicator may mark the packet as a forwarded packet that should be forwarded to the recipient device. In other words, if selective packet dropping occurs due to network congestion, the indicator may mark the packet as "less suitable" to drop (i.e., less likely to be dropped). The NIC may also generate one or more payload packets that include the payload of the segment. The NIC may not include the indicator in the packets, indicating that the packets are "more suitable" for selective dropping.

The NIC may then generate a corresponding ICV for each of these packets and encrypt these packets individually. The NIC may then send the header packet to the recipient device. When the switch detects network congestion, the switch may check the indicator of the packet and selectively discard the payload packet instead of performing packet pruning. The switch may detect congestion based on the level of incast at the recipient device. The incast degree may indicate the number of sender devices that send packets to the receiver device. The switch may compare the current incast degree at the recipient device (which may be determined by the number of flows through the switch to the recipient device) to a threshold. The threshold may indicate a predetermined number of sender devices that send packets to the receiver device. If the current incast level at the recipient device reaches a threshold, the switch may determine congestion at the recipient device. However, the switch may continue to forward the header packet to the recipient device. The receiving device may use the corresponding ICV to validate the header packet. The recipient device may then schedule packets from different recipient devices based on semantic information in the header packets. In this way, semantic information of different streams may be forwarded to the recipient device while providing verification based on the corresponding ICV.

The recipient device may then obtain the data from the corresponding sender device at the scheduled time. Since the number of bytes in the semantic information is significantly smaller than the payload, the amount of traffic generated by the header packet can be relatively small. Thus, forwarding the header packet may not exacerbate congestion and may not overwhelm the recipient device. Furthermore, data waiting for transmission may be buffered at the corresponding sender device. Thus, buffering at the switches in the network can be avoided. The recipient device may obtain data from the sender device at the scheduled time and relieve congestion in the network.

In this disclosure, the term "switch" is used in a generic sense and it may refer to any independent networking device or fabric switch operating in any network layer. The term "switch" should not be construed as limiting examples of the invention to layer 2-networks. Any device or networking equipment that can forward traffic to an external device or other switch can be referred to as a "switch. Any physical or virtual device (e.g., a virtual machine or switch operating on a computing device) that can forward traffic to an end device can be referred to as a "switch. Examples of "switches" include, but are not limited to, layer 2-switches, layer 3-routers, routing switches, components of a Gen-Z network, or fabric switches comprising a plurality of similar or heterogeneous smaller physical and/or virtual switches.

The term "packet" refers to a group of bits that can be transmitted together over a network. "packets" should not be construed as limiting examples of the present invention to a particular layer of the network protocol stack. The "packet" may be replaced by other terms relating to a set of bits, such as "message", "frame", "unit", "datagram" or "transaction". Further, the term "port" may refer to a port that may receive or transmit data. A "port" may also refer to hardware, software, and/or firmware logic that may facilitate operation of the port.

Fig. 1A illustrates an example of recipient-driven incast management using data retrieval in accordance with an aspect of the subject application. HPC environment 100 may include a plurality of nodes 111, 112, 113, 114, 115, 116, 117, 118, and 119. A subset of these nodes may be computing nodes while other nodes may be storage nodes. These nodes may be coupled to each other via network 110. The corresponding node may operate as a receiver device or a sender device. Such nodes may be referred to as receiver devices or sender devices, respectively. Network 110 may include a set of high-capacity networking devices, such as switches 101, 102, 103, 104, and 105. Here, the network 110 may be an HPC architecture. The compute nodes and storage nodes may operate in conjunction with one another over network 110 to facilitate high performance computing in HPC environment 100.

Subsets of switches in network 110 may be coupled to each other via respective channels. Examples of channels may include, but are not limited to, VXLAN, generic Routing Encapsulation (GRE), network virtualization using GRE (NVGRE), generic network virtualization encapsulation (gene), internet protocol security (IPsec), and multiprotocol label switching (MPLS). Channels in network 110 may be formed on an underlying network (or underlay network). The base network may be a physical network and the corresponding links of the base network may be physical links. The corresponding switch pair in the underlying network may be a Border Gateway Protocol (BGP) peer. A VPN, such as an Ethernet VPN (EVPN), may be deployed on the network 110.

To ensure the desired level of performance, the corresponding nodes in HPC environment 100 may operate at the operating rates of the other nodes. Assume that node 111 operates as a recipient device. At least a subset of the remaining nodes in environment 100 may then operate as sender devices. Switches 101, 102, 103, 104, and 105 may facilitate low latency data transfer from the respective sender device to receiver device 111 at high speeds. When a large number of sender devices attempt to send data to receiver device 111, incast occurs in network 110, which may lead to high congestion at receiver device 111 and associated switches. Thus, to ensure high speed data transfer, the HPC environment 100 may require a mechanism for alleviating congestion during incast.

In the prior art, switch 101 may detect congestion caused by incast at recipient device 111. Switch 101 may detect congestion based on the level of incast at recipient device 111. The incast degree may indicate the number of sender devices that send packets to the receiver device 111. Switch 101 may compare the current incast degree at recipient device 111 (which may be determined by the number of flows through switch 101 to recipient device 111) to a threshold. The threshold may indicate a predetermined number of sender devices that send packets to the receiver device. If the current incast level at the recipient device 111 reaches a threshold, the switch 101 may determine congestion at the recipient device 111. To mitigate the effects of congestion, switch 101 may limit traffic from sender devices 112 and 114. For example, if the EQDS is deployed in the network 119, the sender devices 112 and 114 may be responsible for queuing or buffering. For example, a transport layer daemon (TPD) 160 running on the sender device 114 may send the data stream to the receiver device 111. Examples of TPD 160 may include, but are not limited to, a TCP daemon, a User Datagram Protocol (UDP) daemon, a Stream Control Transmission Protocol (SCTP) daemon, a Datagram Congestion Control Protocol (DCCP) daemon, an AppleTalk Transaction Protocol (ATP) daemon, a Fibre Channel Protocol (FCP) daemon, a Reliable Data Protocol (RDP) daemon, and a Reliable User Data Protocol (RUDP) daemon.

The sender device 114 may then become responsible for buffering the data in the stream. Similarly, sender device 112 may then be responsible for buffering local data in the stream. When the receiver device 111 sends out a corresponding transmission, the sender devices 112 and 114 may send corresponding packets, allowing the receiver device 111 to divide its bandwidth among the senders. To avoid congestion in network 110, forwarding hardware 150 of switch 101 may prune packets by removing the packet's payload while preserving the header.

Forwarding hardware 150 may perform packet pruning by forwarding the header of the packet to recipient device 111 while discarding the payload. In this way, switch 101 may "prune" the packet. However, if the packet is encrypted to ensure end-to-end security, the packet may include the ICV at some location of the packet (e.g., before and after the payload). If forwarding hardware 150 modifies the packet by pruning the payload, the ICV in the packet may become invalid. Specifically, because the receiver device 111 may receive the header of the packet and the ICV value is calculated based on the entire packet, the receiver device 111 may not be able to reproduce the ICV. Thus, the recipient device 111 will not be able to verify the packet.

To address this issue, when NIC 140 of sender device 114 receives data segment 142 from TPD 182, NIC 140 may synthesize header packet 144 that includes semantic information for segment 142. The semantic information may include the number of bytes in the payload of the segment 142 and the number of packets carrying the payload. Since the header of segment 142 may include semantic information of segment 142, NIC 140 may include header information of segment 142 in packet 144. NIC 140 may include a header packet indicator in packet 144. The indicator may mark the packet 144 as a forwarded packet that should be forwarded to the recipient device 111. In other words, if selective packet dropping occurs due to network congestion, the indicator may mark the packet 144 as less biased toward dropping. NIC 140 may also generate a payload packet 146 that includes the payload of segment 142. NIC 140 may not include an indicator in packet 146, indicating that packet 146 may be more biased towards being selectively dropped. NIC 140 may generate corresponding ICVs for each of packets 144 and 146 and encrypt them separately.

Similarly, upon receiving the data segment 132, the NIC 130 of the sender device 112 may generate a header packet 134 with an indicator and a payload packet 136. If forwarding hardware 150 detects congestion in network 110, forwarding hardware 150 may check the indicators in packets 134 and 144 instead of performing packet pruning. Based on the presence of the indicator, forwarding hardware 150 may forward header packets 134 and 144 to recipient device 111. The recipient device 111 may then validate the header packets 134 and 144 with the corresponding ICV. NIC 170 of recipient device 111 may store header packets 134 and 144 in ingress buffers 172 and 174, respectively. In this way, switch 101 may forward semantic information of different streams to recipient device 111, which in turn may validate header packets 134 and 144 based on the corresponding ICVs.

NIC 170 may determine that the data of payload packets 136 and 146 should be retrieved, respectively, based on semantic information in header packets 134 and 144. For example, the semantic information may quantify the data in the payload packets 136 and 146. NIC 170 may also determine that data waiting to be retrieved is buffered in NICs 130 and 140. Accordingly, NIC 170 may deploy a scheduling mechanism to schedule the retrieval of data from NICs 130 and 140, respectively. Since the number of bytes in the semantic information of segments 132 and 142 is significantly smaller than the corresponding payload, the amount of traffic generated by header packets 134 and 144 may be relatively small. Thus, forwarding header packets 134 and 144 may not exacerbate congestion and may not overwhelm NIC 170. NIC 170 may send a message to NICs 130 and 140 to initiate transmission of payload packets 136 and 146.

However, if congestion persists, switch 101 may deploy selective dropping. When forwarding hardware 150 receives packets 136 and 146, forwarding hardware 150 may check the indicators in packets 136 and 146 instead of performing packet pruning. Because the indicator is not present in packets 136 and 146, forwarding hardware 150 may discard packets 136 and 146, pruning the corresponding flow. This allows switches in network 110 to discard any packets of the flow while forwarding the header without pruning individual packets. Thus, the corresponding ICV of the respective packet received at NIC 170 is not affected by the stream pruning. Thus, NIC 170 may successfully validate packets while forwarding hardware 150 performs flow pruning.

Fig. 1B illustrates an example of a packet that facilitates flow pruning based congestion management in accordance with an aspect of the subject application. During operation, TPD 160 may determine data stream 120 (e.g., a transport layer stream, such as a TCP stream). TPD 160 may generate data segment 142 from stream 120. Segment 142 may include header 122 and payload 124. Header 122 may include a set of header fields, which may include one or more of the following: a source port, a destination port, a sequence number, an acknowledgement number, a header length, a flag, an urgent bit, an acknowledgement bit, a push indicator, a connection reset, a synchronization bit, an end bit, a sliding window field, a checksum, an urgent pointer, and optional bits.

TPD 160 may provide segment 142 to NIC 140 for transmission to recipient device 111.NIC 140 may generate payload packet 146 that includes payload data 194 (i.e., data to be transmitted) in payload 124. Payload packet 146 may include a header 192, which may be a copy of header 122. NIC 140 may then generate ICV 196 for payload packet 146 and encrypt payload packet 146.NIC 140 may then store payload packet 146 in local buffer 190 associated with stream 120. In addition, NIC 140 may determine semantic information 182 from header 122 and generate header packet 144 that includes semantic information 182. Semantic information 182 may include parameter values from one or more fields of header 122. In particular, the semantic information 182 may include all fields in the header 122 that may be used to quantify data in the payload 124 and determine which portion of the segment 142 corresponds to the payload 124 (i.e., the distribution of the payloads of the segment 142). The semantic information 182 may allow the NIC 170 to schedule and obtain data in the payload 124 (e.g., using RDMA).

NIC 140 may also include a header packet indicator 184 in header packet 144. The indicator 184 may be a separate field in the header packet 144 or included in an optional field of the header 122 that is included in the semantic information 182. The indicator 184 may be represented by a predefined value. NIC 140 may then encrypt packet 144 while generating ICV 186 of header packet 144. Header packet 144 may include a header 180, which may include a layer 3-header (e.g., an Internet Protocol (IP) header) and a layer 2-header (e.g., an ethernet header). The source and destination addresses of the layer 3-header may correspond to the IP addresses of the sender and receiver devices 114 and 111, respectively. The source address and destination address of the layer 2-header may correspond to the Media Access Control (MAC) address of the sender device 114 and the locally coupled switch 102, respectively.

Based on the information in header 180, NIC 140 may send header packet 144 to switch 102. Switch 102 may then forward header packet 144 to switch 101 based on the information in header 180. If switch 101 detects congestion, switch 101 may determine whether header packet 144 includes indicator 184. When switch 101 determines that indicator 184 is present in header packet 144, switch 101 determines that discarding header packet 144 should be avoided if possible. Accordingly, switch 101 may forward header packet 144 to recipient device 111 based on information in header 180. Because NIC 170 may receive header packet 144 in its entirety (i.e., without pruning), NIC 170 may validate header packet 144 based on ICV 186.

NIC 170 may obtain semantic information 182 from header packet 144. Because header packet 144 may be validated, NIC 170 may treat semantic information 182 as trusted information. NIC 170 may then use semantic information 182 to determine the presence of a subsequent packet. Accordingly, NIC 170 may schedule data retrieval and assign transmission credits to NIC 140. Upon receipt of the credit, NIC 140 may obtain payload packet 146 based on the credit from buffer 190 and send payload packet 146 to recipient device 111. Because 184 is not included in payload packet 146, switch 101 (and switch 102) may discard payload packet 146 if flow pruning is initiated due to congestion. On the other hand, if switch 101 forwards payload packet 146, then recipient device 111 may use ICV 196 to validate payload packet 146 because payload packet 146 was delivered without pruning.

Fig. 2A presents a flow chart illustrating an example of a process by which forwarding hardware of a switch facilitates flow pruning in accordance with an aspect of the subject application. During operation, forwarding hardware may receive a plurality of packets including portions of a data segment sent from a sender device to a receiver device (operation 202). To allow the switch to perform flow pruning, the sender device may generate header packets and payload packets. The forwarding hardware may identify one or more payload packets including the payload of the data segment and possibly at least a header packet including header information and a header packet indicator of the data segment from the plurality of packets (operation 204). The forwarding hardware may distinguish the header packet from one or more payload packets based on an indicator in the header packet.

The forwarding hardware may determine whether congestion is detected at the recipient device based on the number of sender devices sending packets to the recipient device via the switch (operation 206). The forwarding hardware may determine congestion at the recipient device if the number of sender devices sending packets to the recipient device via the switch reaches a predetermined threshold. When congestion at the recipient device is determined (operation 208), the forwarding hardware may perform flow pruning by sending header packets to the recipient device and discarding a subset of one or more payload packets (operation 210). If congestion is not detected, flow pruning may not be required. Accordingly, the forwarding hardware may forward the one or more payload packets to the recipient device (operation 212).

Fig. 2B presents a flowchart illustrating an example of a process by which a NIC of a computing device generates packets from data segments, in accordance with an aspect of the present application. During operation, the NIC may generate a header packet that includes header information for a data segment sent from the computing device to the recipient device (operation 252). The data segments may be transport layer data segments generated by a TPD running on the computing system. The NIC may include one or more fields of a header of the data segment in the header packet. The NIC may select fields associated with semantic information of the data segment to include in the header packet. The NIC may also generate one or more payload packets that include the payload of the data segment (operation 254). The NIC may distribute the payload of the data segment into one or more payload packets. For example, the NIC may determine the total number of bytes in the data segment and determine the maximum number of bytes a packet may accommodate. The NIC may then distribute the payloads of the data segments accordingly into one or more payload packets.

The NIC may include a header packet indicator in the header packet that distinguishes the header packet from one or more payload packets (operation 256). Here, the header packet indicator indicates that the header packet may not be discarded. On the other hand, the absence of a header packet indicator in one or more payload packets indicates that the one or more payload packets are allowed to be discarded. The NIC may then forward the header packet to the recipient device (operation 258). The NIC may store one or more payload packets in a buffer from which transmissions are controlled by the recipient device (operation 260). The NIC may transmit from the buffer if the recipient device issues a corresponding transmission credit.

Fig. 3 illustrates an example of communication that facilitates flow pruning based congestion management in accordance with an aspect of the subject application. HPC environment 300 may include a network 330 including switches 351 and 352. Nodes 331 and 334 may operate as a receiver device and a sender device, respectively. During operation, the switch 351 may detect congestion in the network 330 and initiate flow pruning (operation 302). If the number of sender devices that send packets to the receiver device 331 via the switch 351 reaches a predetermined threshold, the switch 351 may determine congestion at the receiver device 331. When the NIC 340 of the sender device 334 receives the segment 342 of the transport layer data stream 380 from the TPD 360, the NIC 340 may generate a header packet 344 including header information of the segment 342 and a payload packet 346 including a payload of the segment 342 (operation 304). NIC 340 may include a payload packet indicator in header packet 344 to indicate that packet 344 is unlikely to be dropped when flow pruning is initiated. The NIC 340 may buffer the payload packet 346 (operation 306) and send the header packet 344 (operation 308). Switch 352 may forward header packet 344 to receiver device 331 via switch 351.

Upon receiving the header packet 344, the forwarding hardware 350 of the switch 351 may parse the header packet 344 and detect an indicator in the header packet 344 (operation 310). Here, forwarding hardware 350 may examine a predetermined location in payload packet 344 (e.g., in a header field) and determine that the location includes a predetermined value representing an indicator. Because the indicator indicates that the header packet 344 is unlikely to be discarded, upon identifying the indicator, the forwarding hardware 350 may refrain from discarding the header packet 344 and forwarding the header packet 344 to the recipient device 331 (operation 312). The NIC 370 of the receiver device 331 may then determine semantic information from the header packet 344 (operation 314). Because the semantic information quantifies the payload of the segment 342 and indicates how the payload is distributed (e.g., in one payload packet), the semantic information allows the NIC 370 to schedule data retrieval (operation 316). For example, NIC 370 may issue a transmission credit based on the number of bytes in payload packet 346. Accordingly, the NIC 370 may allocate credit to the sender device 334 and provide the credit to the switch 351 for forwarding to the sender device 334 (operation 318). The credit allows the sender device 334 to forward the packet to the receiver device 331. Switches 351 and 352 may forward the credit to sender device 334. Upon receipt of the credit, the NIC 340 may send the payload packet 346 to the recipient device 331 (operation 320). When forwarding hardware 350 receives payload packet 346, forwarding hardware 350 may detect that an indicator is not present in payload packet 346 (operation 322). For example, forwarding hardware 350 may examine a predetermined location of an indicator in payload packet 346 and determine that the location does not include a predetermined value representing the indicator. Thus, forwarding hardware 350 may discard payload packets 346, thereby performing flow pruning on data flow 380.

Fig. 4A presents a flowchart illustrating an example of a process by which a sender device generates packets from a data stream, in accordance with an aspect of the subject application. During operation, a sender device may determine a data segment of a data stream (operation 402). The data segments may be sent from the TPD of the sender device through the network protocol stack. The NIC of the sender device may then obtain the data segment via the stack. The sender device may then generate a header packet from the header of the segment (operation 404) and include a header packet indicator in the header packet (operation 406). The sender device may include one or more fields of a header of the data segment in the header packet. The sender device may select fields associated with semantic information of the data segment to include in the header packet. The indicator may indicate that the header packet is unlikely to be discarded. The sender device may also generate a payload packet from the payload of the segment (operation 408) and store the payload packet in a local buffer (e.g., in a NIC) (operation 410). The sender device may determine the total number of bytes in the data segment and determine the maximum number of bytes a packet can accommodate. The sender device may then include the payload of the data segment at least in a payload packet. The transmission from the buffer may depend on the transmission credit issued by the recipient device. The sender device may send a header packet to the receiver device (operation 412).

Fig. 4B presents a flowchart illustrating an example of a process for switch processing applying flow pruning to packets in accordance with an aspect of the present application. During operation, the switch may receive the packet (operation 452) and determine whether to initiate flow pruning (operation 454). If congestion is detected at the recipient device, flow pruning may be initiated. The switch may detect congestion at the recipient device if the number of sender devices sending packets to the recipient device via the switch reaches a predetermined threshold. If flow pruning is initiated, the switch may begin looking for packets that may be dropped. Accordingly, the switch may determine whether a header packet indicator is present in the packet (operation 456). If no flow pruning is initiated (operation 454) or an indicator exists (operation 456), the switch may forward the packet (operation 460). On the other hand, if the indicator is not present in the packet, the switch may determine that the packet is allowed to be discarded. Thus, the switch may discard the packet (operation 458).

Fig. 5 presents a flowchart illustrating an example of a process by which a receiver device processes a packet in accordance with an aspect of the subject application. During operation, the recipient device may receive the header packet (operation 502) and determine semantic information from the header packet (operation 504). The semantic information may include information from one or more fields of a header of the data segment. The semantic information may indicate the number of bytes of the data segment to be retrieved from the sender device and how the bytes are distributed (e.g., how many packets are spanned). The recipient device may then identify the payload packet(s) awaiting retrieval based on the semantic information (operation 506) and schedule retrieval of the payload packet (operation 508). The semantic information may indicate the number of payload packets and the payload bytes in the payload packets. The recipient device may schedule the retrieval in such a way that the retrieval does not cause congestion at the recipient device. Based on the schedule, the recipient device may send transmission credits to the sender device at the scheduled time and allocate buffers for the payload packet(s) (operation 510). Since the sender device can only transmit packets after receiving the transmission credit, the receiver device can manage when to receive traffic from the sender device by scheduling the transmission credit to be taken and sent accordingly.

Fig. 6 illustrates an example of a computing system that facilitates flow pruning-based congestion management in accordance with an aspect of the subject application. Computing system 600 may include a set of processors 602, memory units 604, NICs 606, and storage 608. The memory unit 604 may include a set of volatile memory devices (e.g., dual Inline Memory Modules (DIMMs)). Further, if desired, the computing system 600 may be coupled to a display device 612, a keyboard 614, and a pointing device 616. Storage 608 may store an operating system 618. The stream management system 620 and data 636 associated with the stream management system 620 may be maintained and executed from the storage 608 and/or the NIC 606.

The flow management system 620 may include instructions that, when executed by the computing system 600, may cause the computing system 600 to perform the methods and/or processes described in this disclosure. In particular, if computing system 600 is the sender device, stream management system 620 may include instructions for generating header packets and payload packets from the data segments (packet logic 622). The stream management system 620 may also include instructions for including an indicator in the header packet (indicator logic 624). The stream management system 620 may include instructions (encryption logic 626) for generating the header packet and the corresponding ICV of the payload packet. The stream management system 620 may include instructions (encryption logic 626) for encrypting header packets and payload packets.

If computing system 600 is the recipient device, stream management system 620 may include instructions for sending the transmission credit to the sender device (packet logic 622). The stream management system 620 may then include instructions for determining whether an indicator is present in the packet (indicator logic 624). The stream management system 620 may also include instructions (encryption logic 626) for validating the respective package based on the corresponding ICV.

The stream management system 620 may further include instructions (communication logic 628) for sending and receiving packets. Data 636 may include any data that may facilitate the operation of flow management system 620. The data 636 may include, but is not limited to, semantic information of the data segment, payload data to be transmitted, header packets and payload packets, and indicators.

Fig. 7 illustrates an example of a computer readable memory device that facilitates congestion management based on flow pruning in accordance with an aspect of the subject application. The computer readable memory means 700 may comprise a plurality of units or devices which may communicate with each other via wired, wireless, quantum, optical or electrical communication channels. The memory apparatus 700 may be implemented using one or more integrated circuits and may include fewer or more units or devices than those shown in fig. 7.

Further, the memory device 700 may be integrated with a computer system or in a device capable of communicating with other computer systems and/or devices. For example, memory device 700 may be a NIC in a computer system. The memory device 700 may include units 702-708 that perform similar functions or operations as the logic blocks 622-628 of the stream management system 620 of fig. 6, including: a packet unit 702; an indicator unit 704, an encryption unit 706; and a communication unit 708.

Fig. 8 illustrates an example of a switch supporting flow pruning based congestion management according to an embodiment of the present application. The switch 800, which may also be referred to as a networking device 800, may include a plurality of communication ports 802, a packet processor 810, and storage 850. Switch 800 may also include forwarding hardware 860 (e.g., processing hardware of switch 800, such as an Application Specific Integrated Circuit (ASIC) chip thereof) that includes information from which switch 800 processes packets (e.g., determines output ports of packets). The packet processor 810 extracts and processes header information from a received packet. The packet processor 810 may identify a switch identifier (e.g., a MAC address and/or an IP address) associated with the switch 800 in a header of the packet.

Communication ports 802 may include inter-switch communication channels for communicating with other switches and/or subscriber devices. The communication channels may be implemented via conventional communication ports and based on any open or proprietary format. Communication port 802 may include one or more ethernet ports capable of receiving frames encapsulated in an ethernet header. Communication ports 802 may also include one or more IP ports capable of receiving IP packets. The IP port is capable of receiving IP packets and may be configured with an IP address. Packet processor 810 may process ethernet frames and/or IP packets. The respective ports of the communication ports 802 may operate as ingress ports and/or egress ports.

Switch 800 may maintain a database 852 (e.g., in storage 850). Database 852 may be a relational database and may run on one or more database management system (DBMS) instances. Database 852 may store information associated with routing and configuration associated with switch 800. Forwarding hardware 860 may include congestion management logic 830 that facilitates flow pruning in the network. The congestion management logic 830 may include a detection logic 832 and a pruning logic 834. The congestion management logic 830 may determine whether there is congestion at the recipient device. The congestion management logic block 830 may determine congestion at the recipient device if the number of sender devices sending packets to the recipient device via the switch 800 reaches a predetermined threshold. When congestion is detected, the congestion management logic 830 may initiate flow pruning at the switch 800. The detection logic 832 may detect whether a packet received via one of the ports 802 includes a header packet indicator. During congestion, if the payload packet does not include an indicator, pruning logic 834 may discard the payload packet while forwarding the header packet.

The description herein is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed examples will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other examples and applications without departing from the spirit and scope of the invention. Accordingly, the invention is not limited to the examples shown, but is intended to be accorded the widest scope consistent with the claims.

An aspect of the present technology may provide a networking device that facilitates efficient congestion management. During operation, forwarding hardware of the networking apparatus may receive, via the network, a plurality of packets including portions of the data segments sent from the sender device to the receiver device. The forwarding hardware may identify one or more payload packets including the payload of the data segment and at least a header packet from the plurality of packets, the header packet including header information and a header packet indicator of the data segment. The forwarding hardware may determine whether congestion is detected at the recipient device based on a number of sender devices sending packets to the recipient device via the networking equipment. When congestion at the recipient device is determined, the forwarding hardware may perform flow pruning by sending header packets to the recipient device and dropping a subset of the one or more payload packets.

In variations on this aspect, the forwarding hardware may distinguish the header packet from one or more payload packets based on the header packet indicator.

In another variation, if congestion is not detected, the forwarding hardware may forward one or more payload packets to the recipient device.

In variations on this aspect, the header information may include semantic information that quantifies the payload of the data segment and indicates the distribution of the payload.

In a further variation, the forwarding hardware may receive the transmission credit from the recipient device. Here, the transmission credit may correspond to one or more payload packets and may be generated based on semantic information. The forwarding hardware may then send the transmission credit to the sender device.

In variations on this aspect, the data segment may be a transport layer data segment generated at the sender device.

In variations of this aspect, the networking equipment may be located in a network that couples the sender device and the receiver device.

Another aspect of the present technology may provide a computing system that facilitates efficient congestion management. During operation, the NIC of the computing system may generate a header packet that includes header information for a data segment sent from the computing system to the recipient device. The NIC may also generate one or more payload packets that include the payload of the data segment. The NIC may then include a header packet indicator in the header packet that distinguishes the header packet from one or more payload packets. The NIC may then forward the header packet to the recipient device and store one or more payload packets in a buffer from which transmission may be controlled by the recipient device.

In a further variation, the header information may include semantic information quantifying the payload of the data segment and indicating a distribution of the payload.

In a further variation, the NIC may receive a transmission credit from the recipient device. The transmission credits may correspond to one or more payload packets and are generated based on the semantic information.

In another variation, the NIC may determine a subset of the one or more payload packets corresponding to the transmission credit. The apparatus may then transmit a subset of the one or more payload packets to the recipient device.

In a variation on this aspect, the absence of a header packet indicator in one or more payload packets indicates that one or more payload packets are allowed to be discarded.

In variations of this aspect, the data segment may be a transport layer data segment generated at the computing system.

In variations on this aspect, the NIC may encrypt the header packet and the one or more payload packets separately.

The data structures and code described in this detailed description are typically stored on a computer readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. Computer-readable storage media include, but are not limited to, volatile memory, nonvolatile memory, magnetic and optical storage devices such as magnetic disks, tapes, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.

The methods and processes described in the detailed description section may be embodied as code and/or data, which may be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer readable storage medium.

The methods and processes described herein may be performed by and/or included in hardware logic blocks or devices. Such logic blocks or devices may include, but are not limited to, application Specific Integrated Circuit (ASIC) chips, field Programmable Gate Arrays (FPGAs), dedicated or shared processors that execute particular software logic blocks or code at particular times, and/or other programmable logic devices now known or later developed. When the hardware logic blocks or devices are activated, they perform the methods and processes included therein.

The foregoing description of the inventive examples has been presented only for the purposes of illustration and description. The description is not intended to be exhaustive or to limit the disclosure. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. The scope of the invention is defined by the appended claims.

Claims

1. A networking device, comprising:

a memory device; and

forwarding hardware for:

receiving a plurality of packets, the plurality of packets including portions of a data segment transmitted from a sender device to a receiver device;

identifying, from the plurality of packets, one or more payload packets comprising a payload of the data segment and at least a header packet comprising a header packet indicator and header information of the data segment;

determining whether congestion is detected at the receiver device based on a number of sender devices sending packets to the receiver device via the networking apparatus;

in response to detecting congestion at the recipient device, stream pruning is performed by:

sending the header packet to the recipient device; and

a subset of the one or more payload packets is discarded.

2. The networking device of claim 1, wherein the forwarding hardware is further to:

the header packet is distinguished from the one or more payload packets based on the header packet indicator.

3. The networking apparatus of claim 1, wherein, in response to not detecting the congestion, the forwarding hardware is further to forward the one or more payload packets to the recipient device.

4. The networking device of claim 1, wherein the header information comprises semantic information that quantifies the payload of the data segment and indicates a distribution of the payload.

5. The networking device of claim 4, wherein the forwarding hardware is further to:

receiving a transmission credit from the recipient device, wherein the transmission credit corresponds to the one or more payload packets and is generated based on the semantic information; and

the transmission credit is sent to the sender device.

6. The networking apparatus of claim 1, wherein the data segment is a transport layer data segment generated at the sender device.

7. The networking apparatus of claim 1, wherein the networking apparatus is located in a network that couples the sender device and the receiver device.

8. A computing system, comprising:

a memory device; and

a Network Interface Controller (NIC) for:

generating a header packet including header information for a data segment sent from the computing system to a recipient device;

generating one or more payload packets comprising payloads of the data segments;

Including a header packet indicator in the header packet, the header packet indicator distinguishing the header packet from the one or more payload packets;

forwarding the header packet to the recipient device; and

the one or more payload packets are stored in a buffer from which transmissions are controlled by the recipient device.

9. The computing system of claim 8, wherein the header information includes semantic information that quantifies the payload of the data segment and indicates a distribution of the payload.

10. The computing system of claim 9, wherein the NIC is further to receive a transmission credit from the recipient device, wherein the transmission credit corresponds to the one or more payload packets and is generated based on the semantic information.

11. The computing system of claim 10, wherein the NIC is further to:

determining a subset of the one or more payload packets corresponding to the transmission credit; and

the subset of the one or more payload packets is sent to the recipient device.

12. The computing system of claim 8, wherein the absence of the header packet indicator in the one or more payload packets indicates that the one or more payload packets are allowed to be discarded.

13. The computing system of claim 8, wherein the data segment is a transport layer data segment generated at the computing system.

14. The computing system of claim 8, wherein the NIC is further to encrypt the header packet and the one or more payload packets separately.

15. A method, comprising:

receiving, by a networking device in a network coupling a sender device and a receiver device, a plurality of packets comprising portions of data segments sent from the sender device to the receiver device;

identifying, by the networking device, one or more payload packets including the payload of the data segment and at least a header packet from the plurality of packets, the header packet including a header packet indicator and header information of the data segment;

determining, by the networking apparatus, whether congestion is detected at the receiver device based on a number of sender devices sending packets to the receiver device via the networking apparatus; and

forwarding the header packet to the recipient device; and

A subset of the one or more payload packets is discarded.

16. The method of claim 15, further comprising:

17. The method of claim 15, wherein in response to not detecting the congestion, the method further comprises forwarding the one or more payload packets to the recipient device.

18. The method of claim 15, wherein the header information includes semantic information that quantifies the payload of the data segment and indicates a distribution of the payload.

19. The method of claim 18, further comprising:

the transmission credit is sent to the sender device.

20. The method of claim 15, wherein the data segment is a transport layer data segment generated at the sender device.