WO2021107867A1

WO2021107867A1 - Method of monitoring a communication network to facilitate network fault analysis, and a communication network thereof

Info

Publication number: WO2021107867A1
Application number: PCT/SG2020/050684
Authority: WO
Inventors: Pravein GOVINDAN KANNAN; Nishant SHYAMAL BUDHDEV; Raj Joshi; Mun Choon Chan
Original assignee: National University Of Singapore
Priority date: 2019-11-26
Filing date: 2020-11-24
Publication date: 2021-06-03

Abstract

A method 300 of monitoring a communication network 100,600 to facilitate network fault analysis is disclosed. The communication network 100, 600 comprises switches 110 with each switch arranged to relay corresponding data packets in the network 100,600. In a described embodiment, the method 300 includes (i) recording the corresponding data packets received at each switch 110 within a recording window. Upon detecting a network fault 102 by a particular switch 110a among the switches 110, the network fault 102 occurring within the recording window, the method 300 further includes (ii) generating a trigger packet 332, (iii) broadcasting the trigger packet 332 selectively to at least some of the switches 110 in the network 100,600, and (iv) upon receiving the trigger packet 332, storing the recorded data packets of the selected switches 110 received within the recording window for subsequent retrieval. The recorded data packets represent the network's traffic when the network fault 102 is detected. A communication network 100,600 to facilitate network fault analysis is also disclosed.

Description

METHOD OF MONITORING A COMMUNICATION NETWORK TO FACILITATE NETWORK FAULT ANALYSIS, AND A COMMUNICATION NETWORK

THEREOF TECHNICAL FIELD

The present disclosure relates to monitoring of a communication network to facilitate network fault analysis. The present disclosure also relates to a communication network for facilitating the network fault analysis. BACKGROUND

Large network providers need to quickly resolve network faults in order to meet their high SLA (service level agreement) requirements. However, debugging network faults in modem data centers is extremely challenging due to the scale and complexity of data interactions in a dynamic environment. Further, there may be a number of root causes that may lead to a given network fault. For example, a packet drop due to a table miss can happen either due to a parity error or due to temporal inconsistency during a network update. Therefore, in order to properly debug network faults, an operator has to examine the state that the network was in when the network fault occurred. However, the task of debugging network faults in modem data center networks is difficult.

To an example, when a microburst occurs at a switch port in the network, an operator may observe a uniform distribution of packets from different sending hosts which would indicate a fan-in traffic pattern with no single offending flow. There are two possible root causes for the fan-in traffic pattern to occur. First, the sending hosts could be sending the data in a synchronized fashion. Second, non- determ inistic interaction among the network traffic may have occurred. In order to identify the root cause, the operator has to examine the packets involved in the microburst before the microburst happened. For example, if packet arrival times at first-hop switches which are directly connected to the sending hosts are synchronized, this would indicate that the root cause is synchronized traffic. Otherwise, the microburst is due to non-deterministic interaction of flows in the network.

Previous attempts have tried with varying degrees of success to develop an efficient network monitoring and debugging framework. For example, query-based streaming telemetry systems and sketch-based frameworks might be able to monitor network traffic but have their limitations. Other solutions could be expensive and does not scale well to accommodate multi-petabit network traffic seen in today’s data center networks.

Therefore, it is desirable to provide a solution that address at least one of the problems mentioned in existing prior art, and/or to provide the public with a useful alternative. SUMMARY

In a first aspect, there is provided a method of monitoring a communication network to facilitate network fault analysis, the communication network having switches with each switch arranged to relay corresponding data packets in the network, the method includes (i) recording the corresponding data packets received at each switch within a recording window. Upon detecting a network fault by a particular switch among the switches, the network fault occurring within the recording window, the method further includes (ii) generating a trigger packet, (iii) broadcasting the trigger packet selectively to at least some of the switches in the network, and (iv) upon receiving the trigger packet, storing the recorded data packets of the selected switches received within the recording window for subsequent retrieval. The recorded data packets represent the network’s traffic when the network fault is detected.

When no network fault is detected, the method may further include discarding the recorded data packets that fall outside of the recording window. Recording the corresponding data packets at each switch in (i) may further include recording timing information of the corresponding data packets at each switch. The timing information between the switches may be causally consistent. Additionally, the timing information between at least some of the switches may include a timing difference that is within a tolerance level.

The tolerance level may be a duration to transmit a single data packet between the switches.

The method may further include correlating the recorded data packets based on the timing information of the corresponding data packets to construct an order of events leading to the network fault.

The timing information may include arrival times and departure times of the corresponding data packets.

Recording the corresponding data packets at each switch in (i) may further include counting the corresponding data packets arriving at each switch within a limited recording window, the limited recording window being a subset of the recording window, and associating the corresponding data packets arriving within the limited recording window with a specific arrival time.

Furthermore, recording the corresponding data packets at each switch in (i) may further include counting the corresponding data packets departing at each switch within the limited recording window, and associating the corresponding data packets departing within the limited recording window with a specific departure time.

In another embodiment, recording the corresponding data packets at each switch in (i) may further include counting the corresponding data packets received at each switch that are within a flow, and associating the corresponding data packets within the flow with a packet identifier. Each switch may maintain a list of communication links from which the switch receives the corresponding data packets within the recording window. The trigger packets may be broadcasted only to the list of communication links.

Where the selected switches exclude the particular switch, the method may also include storing the recorded data packets of the particular switch received within the recording window for subsequent retrieval. Storing the recorded data packets may further include generating collection packets to read the recorded data packets from the respective switches, and forwarding the collection packets to the storage server.

The corresponding data packets received at each switch within the recording window may be recorded in respective pre-trigger buffers, and after receiving the trigger packet and before the recorded data packets in the respective pre-trigger buffers are stored for subsequent retrieval, the method may further include recording subsequent data packets received in respective post-trigger buffers. Further, the corresponding data packets may include compressed data packets.

In a second aspect, there is provided a communication network for facilitating network fault analysis, including a plurality of switches with each switch arranged to relay corresponding data packets in the network and having corresponding recording modules configured to record the corresponding data packets that are received within a recording window. The communication network further includes a trigger module configured to generate a trigger packet in response to a network fault detected by a particular switch, the network fault occurring within the recording window, and a broadcasting module configured to selectively broadcast the trigger packet to at least some of the switches in the network. In response to the network fault detected by the particular switch, and in response to receiving the trigger packet by the selected switches, each switch is configured to store its recorded data packets received within the recording window for subsequent retrieval. The recorded data packets represent the network’s traffic when the network fault is detected.

When no network fault is detected, each recording module may be further configured to discard the recorded data packets that fall outside of the recording window.

Each recording module may be further configured to record timing information of the corresponding data packets at each switch. The timing information between the switches may be causally consistent.

In addition, the timing information between at least some of the switches may include a timing difference that is within a tolerance level. The tolerance level may be a duration to transmit a single data packet between the switches.

The communication network may further include a correlation module configured to correlate the recorded data packets based on the timing information of the corresponding data packets, and to construct an order of events leading to the network fault, and the order may be chronological order.

Each recording module may further include an arrival packet counter configured to count the data packets arriving at a corresponding switch within a limited recording window, the limited recording window being a subset of the recording window, and to associate the data packets arriving within the limited recording window with a specific arrival time. Furthermore, each recording module may further include a departure packet counter configured to count the data packets departing at a corresponding switch within the limited recording window, and to associate the data packets departing within the limited recording window with a specific departure time.

Each recording module may further include a flow packet counter configured to count the data packets received at a corresponding switch that are within a flow, and to associate the data packets within the flow with a packet identifier. Each switch may be configured to maintain a list of communication links from which the switch receives the corresponding data packets within the recording time window, and the broadcasting module may be further configured to broadcast the trigger packet only to the list of communication links. Where the selected switches exclude the particular switch, the particular switch may be configured to store its recorded data packets received within the recording window for subsequent retrieval.

The communication module may further include a collection module configured to generate collection packets to read the recorded data packets from the respective switches, and to forward the collection packets to the storage server.

Each recording module may further include a pre-trigger buffer associated with a corresponding switch in the network. The pre-trigger buffer may be configured to record the corresponding data packets received at the corresponding switch within the recording window. Each recording module may also include a post-trigger buffer associated with a corresponding switch in the network. The post-trigger buffer may be configured to record subsequent data packets received at the corresponding switch, after receiving the trigger packet and before the recorded data packets in the respective pre-trigger buffers are stored for subsequent retrieval.

The corresponding data packets may include compressed data packets. The described embodiments may achieve the following advantages:

Visibility: Ability to observe network-wide metrics at a packet-level resolution (e.g. packet arrivals and departures at all ports for all switches). Retrospection: Ability to look back on past network-wide states before the fault has occurred. When the problem is detected, historical events relating to the fault is preserved and thus, it is possible to look-back at the events leading to the fault. Correlation: Ability to correlate network-wide events at small timescales. This might be useful if faults occur due to the interaction of traffic flows across multiple switches.

BRIEF DESCRIPTION OF THE FIGURES

Exemplary embodiments will be described with reference to the accompanying drawings in which:

Figure 1 illustrates an exemplary communication network for facilitating network fault analysis according to a first embodiment.

Figure 2 is a partial schematic of the communication network of Figure 1. Figure 3 is a flow diagram of an exemplary method of monitoring the communication network of Figure 1 to facilitate network fault analysis.

Figure 4 is a schematic diagram of the storage server of the communication network of Figure 1.

Figure 5 is a flow diagram illustrating an exemplary workflow of an operator using the communication network of Figure 1.

Figure 6A is a schematic diagram of a switch topology of an exemplary communication network according to a second embodiment.

Figure 6B is a schematic diagram of a network-level DAG for DPTP synchronization for the communication network of Figure 6A.

Figure 7A is a line graph of a synchronization error over time between switches that are 1 -hop, 2-hop and 3-hops away in the switch topology of Figure 6A. Figure 7B is a line graph of a propagation delay between two pairs of switches in the switch topology of Figure 6A.

Figure 8A is a line graph of an arrival time of data packets for a 500-packet sequence at five switches in the switch topology of Figure 6A. Figure 8B is an enlarged view of the line graph of Figure 8A showing the arrival time of data packets for a 50-packet sequence.

Figure 9 is a bar graph of a percentage of common recorded data packets as seen by the storage server in the communication network of Figure 6A.

Figure 10 is a schematic diagram of the switch topology of the communication network of Figure 6A set up to simulate synchronized application traffic according to a first debug scenario.

Figure 11 A is a line graph of a queue build up at a switch over time for the first debug scenario in Figure 10.

Figure 11 B is a line graph of the data packets arriving at each host over time for the first debug scenario in Figure 10.

Figure 12 is a schematic diagram of the switch topology of the communication network of Figure 6A set up to simulate non-synchronized application traffic according to the first debug scenario. Figure 13A is a line graph of a queue build up at a switch over time for the first scenario in Figure 12.

Figure 13B is a line graph of the data packets arriving at each host over time for the first debug scenario in Figure 12.

Figures 14A, 14B and 14C are schematic diagrams of an initial, final, and transient states of the switch topology of the communication network of Figure 6A set up to simulate a transient blackhole according to a second debug scenario.

Figure 15A is a line graph of packet drops over time at two switches for the second debug scenario in Figures 14A, 14B and 14C.

Figure 15B is a line graph of forwarding rule versions over time at the two switches for the second debug scenario in Figures 14A-C.

Figure 16 is a schematic diagram of the switch topology of the communication network of Figure 6A set up to simulate congestion at a switch due to a link load balancing problem according to a third debug scenario.

Figure 17A is a line graph of a queue duration over time at six links for the third debug scenario in Figure 16.

Figure 17B is a line graph of a link utilization over time at the six links for the third debug scenario in Figure 16. Figures 18A, 18B and 18C are schematic diagrams of an initial, final, and transient states of the switch topology of the communication network of Figure 6A set up to simulate link congestion due to a network update according to a fourth debug scenario.

Figure 19A is a line graph of a queue duration over time at six links for the fourth debug scenario in Figures 18A, 18B and 18C.

Figure 19B is a line graph of a link utilization over time at the six links for the fourth debug scenario in Figures 18A, 18B and 18C.

Figure 19C is a line graph of forwarding rule versions over time for the fourth debug scenario in Figures 18A, 18B and 18C.

Figure 20 is a bar graph of SRAM consumption used by the pre-trigger buffer of the recording module of the communication networks of Figures 1 and 6A.

DETAILED DESCRIPTION The following description contains specific examples for illustrative purposes. The person skilled in the art would appreciate that variations and alterations to the specific examples are possible and within the scope of the present disclosure. The figures and the following description of the particular embodiments should not take away from the generality of the preceding summary.

DESIGN

The present disclosure is able to record and store fine-grained (packet-level, nanosecond resolution) and network-wide telemetry information, in a synchronized manner to ensure visibility, retrospection and correlation. Visibility and correlation are achieved by collecting packet-level telemetry information and leverage data- plane time synchronization. Retrospection is achieved by leveraging on switch data- plane as a fast temporal storage for recording packet telemetry information over a moving time window which beneficially enables recording of telemetry information about all packets within a time window at line rate.

Figure 1 illustrates an exemplary communication network 100 for facilitating network fault analysis. In this embodiment, the exemplary communication network 100 includes a number of communication devices 120, and four network switches 110a, 110b, 110c, 110d communicatively coupled to the devices 120. The four switches 110a,110b,110c,110d are programmable and arranged to relay (i.e. receive and forward) data packets throughout the network, allowing the devices 120 to communicate with each other.

The communication network 100 further includes a controller 130, a storage server 140, and a human-computer interface (HCI) device 150. The controller 130 is communicatively coupled to the four switches 110a,110b,110c,110d, and is configured to instruct the four switches 110a, 110b, 110c, 110d to perform tasks if a network fault is detected at one (or more) of the four switches 110a, 110b, 110c, 110d. The controller 130 is also communicatively coupled to the storage server 140, and the HCI device 150. The HCI device 150 includes an application interface 160 that enables an operator to configure the controller 130, as well as the switches 110. The HCI device 150 is also communicatively coupled to the storage server 150, and the operator is able to retrieve data stored in the storage server 150 for further analysis.

Figure 2 is a partial schematic of the communication network 100 depicting only the controller 130 and one of the switches 110a. Notably, the four switches 110a,110b,110c,110d have similar components, and only the switch 110a is described with reference to Figure 2 for brevity. The switch 110a includes a recording module 112 which is configured to record the data packets (i.e. packet- level telemetry information) received by the switch 110a within a recording window. In addition to recording the data packets received by the switch 110a, information regarding each data packet is also recorded in the recorded data packet. Each recorded data packet contains 3 fields: [pID , pTimem, pTime_out]· pID is a packet identifier which comprises a combination of a hash value of the packet headers (5- tuple flow key) and TCP/UDP checksum. The hash value helps in associating packets from the same flow, whereas the checksum helps in uniquely tacking each packet within the flow. Any hash collisions are then resolved using topology and timing information. pTimein captures an arrival time of the data packet when the data packet enters the switch 110a. pTime₀u_t captures a departure time of the data packet when the data packet leaves the switch 110a. To identify a recorded data packet with a particular flow, the hash value to 5-tuple flow key mapping is stored temporarily in an NIC such as the controller 130, and can be retrieved on demand.

During normal operation when no network faults are detected, the recording window moves ahead, and older data packets that fall outside the recording window are discarded. In other words, the recording window maintains a recent history of the switch’s data transactions. Since the switch 110a operates on the network’s data- plane, by recording the data packets at the switch 110a, the communication network leverages the data-plane as a fast temporal storage for recording data packets. Advantageously, this allows information on the data packets received within the recording window to be recorded at line rate. The switch 110a is configured to inform the controller 130 when a network fault 102 such as packet loss or high latency is detected by the switch 110a which occurred within the recording window. Notably, the recording module 112 includes a pre- trigger buffer 112a (also referred to as a history buffer) and a post-trigger buffer 112b (also referred to as a future buffer). In this embodiment, ring buffer arrays are used for the pre-trigger buffer 112a and the post-trigger buffer 112b. During normal operation, the data packets are recorded in the pre-trigger buffer 112a. However, when the network fault 102 occurs, the switch 110a stops recording data packets in the pre-trigger buffer 112a, and instead records the data packets in the post-trigger buffer 112b. In this way, the recording module 112 is able to record data packets received before and after the network fault 102 occurs.

The controller 130 includes a trigger module 132 that is configured to generate a trigger packet with high priority, a broadcasting module 134 which is configured to broadcast the trigger packet to the other switches 110b,110c, 110d. Upon receiving the trigger packet, the other switches 110b, 110c, 110d stops recording data packets in their respective pre-trigger buffers, and instead records the data packets in their respective post-trigger buffers. The controller 130 further includes a packet collection module 136 (also referred to as a data-plane packet generator) that is configured to generate a collection packet to read the recorded data packet from the pre-trigger buffer 112a and the post- trigger buffer 112b of the recording module 112. Collection packets are also generated by the packet collection module 136 to read the recorded data packets from respective pre-trigger buffers and post-trigger buffers of the other switches 110b, 110c, 110d. The packet collection module 136 is further configured to forward the collection packets to the storage server 140 for further retrieval by the HCI device 150.

The controller 130 further includes a correlation module 138 which is configured to construct an ordering of events leading to the network fault based on the recorded data packets collected from the switches 110a,110b,110c,110d using a time synchronization technique (Data-Plane-Time-synchronization Protocol).

Since data packets are collected only when the network fault 102 occurs, the amount of data packets collected is much smaller than if the network 100 is being continuously monitored. To further reduce the data packets collected, the switch 110a implements a number of compression techniques on the data packets received at each switch 110a, 110b,110c, 110d and scope reduction at the network level.

- Compress pID

Using the switch 110a as reference, the switch 110a further includes a flow packet counter 114 that is configured to count the data packets received at the switch 110a that are within a flow. Consecutive data packets from the same flow have the same packet identifier and only one pID entry has to be recorded for the data packets within the same flow.

- Compress pTimein

The switch 110a further includes an arrival packet counter 116 that is configured to count the data packets arriving at switch 110a within a limited recording window. The limited recording window is a subset of the recording window. In this embodiment, the limited recording window is 64ns. Instead of recording the arrival time of each data packet individually, the arrival packet counter 116 assumes that every data packet received at the switch 110a within the limited recording window has the same arrival time. In this way, only one pTimetn entry is recorded for the data packets received within the limited recording window.

- Compress pTimeout

A similar approach is adopted for the departure time of the data packets leaving the switch 110a. The switch 110a further includes a departure packet counter 118 that is configured to count the data packets departing at switch 110a within the limited recording window. The departure packet counter 118 assumes that every data packet departing the switch 110a within the limited recording window has the same departure time. In this way, only one pTimeout entry is recorded for the data packets received within the limited recording window.

When the flow falls within the limited time window, a single [pID, pTimetn , pTime_0Utl tuple plus an (n-bit) packet counter including the flow packet counter 114, the arrival packet counter 116, and the departure packet counter 118, is recorded for the (2ⁿ) data packets received at the switch 110a within the limited time window.

Figure 3 is a flow diagram of an exemplary method 300 of monitoring the communication network 100 to facilitate network fault analysis. The method 300 is performed by the communication network 100.

At step 310, the communication network 100 is operating normally i.e. no trigger conditions has occurred. Each switch 110a, 110b,110c, 110d records corresponding data packets that are received by the switch 110a,110b,110c,110d within the recording window in the pre-trigger buffer 112a. In particular, the timing information of the corresponding data packets are recorded. The timing information includes arrival times ( pTimetn ) and departure times (pTimeout) of the corresponding data packets. Each switch includes respective internal clocks that are synchronized such that the timing information between the switches are causally consistent. In cases where the internal clocks between at least some of the switches are not completely synchronized and the timing information between these switches include a timing difference, as long as the timing difference is within a tolerance level, then the switches are causally consistent. This tolerance level is defined as a duration (or time taken) to transmit a single data packet between the switches.

An example is given using the switch 110a (denoted here as X) and the switch 110b (denoted here as Y) as reference. Each switch has respective internal clocks Cxand CY. A synchronization error |Cx - CY| between the internal clocks is denoted as Terr. A data packet is transmitted from switch X to switch Y. The data packet leaves switch X at TimeOutx, and enters switch Y at Timelnγ , after a propagation delay D. TimeOutx corresponds to the time packet A enters an egress pipeline in switch X after queuing. This is the latest available time in the dataplane for a packet. Similarly, Timelnγ corresponds to the time the data packet enters the ingress pipeline at switch Y. Consequently, propagation delay is defined in Equation (1) as

D = EgressDelay + DeparserDelay + MACDelay + WireDelay (1)

To ensure causal consistency between recorded data packets, the data packet has to leave switch X before reaching switch Y. In other words, TimeOutx should be less than TimelnY. This is true if the synchronization error between the internal clocks is less than the propagation delay as shown in Equation (2).

Terr < D (2)

If the condition stated in equation (2) can be met, consistency between any set of packets transmitted between two adjacent switches is ensured. The same principle is extended between switches 110 separated by several hops in the network 100. In such cases, increase in propagation delay is higher than the increase in Terr, thus ensuring consistency between switches across multiple hops.

At step 320, the network fault 102 occurs at the switch 110a within the recording window, and is detected by the switch 110a. The network fault 102 is associated with a trigger condition such as congestion at a link, packet drops or packet reordering. By monitoring the trigger condition, the switch 110a is able to detect the network fault 102 when it occurs. The trigger condition for the switch 110a to detect the network fault 102 is defined by the operator through the application interface 160. The switch 110a generates a trigger which is sent to the controller 130 to inform the controller 130 of the network fault 102.

At step 330, the trigger module 132 of the controller 130 generates a trigger packet 332 by cloning a current data packet, stripping the data packet’s payload and headers, and inserting a trigger header. The trigger header includes 1 ) Trigger ID to identify the trigger 2) Trigger Type to classify the trigger; and 3) Trigger Time to specify the time when the trigger occurred.

At step 340, the broadcasting module 134 of the controller 130 is configured to broadcast the trigger packet 332 to the switches 110 in the network. To ensure that the trigger packet 332 is received by the switches 110 in the network, the switches 110 receiving the trigger packet 332 further broadcast it to their neighbouring switches. In a large scale operation, the broadcasting module 134 sends multiple trigger packets 332 to the switches 110. Due to redundancies in the broadcast of the trigger packets 332, unless the network 100 is partitioned, the trigger packets 332 reach the entire network 100. If a switch 110 receives a trigger packet 332 with the same Trigger ID as a previous trigger packet that it received, then the trigger packet 332 is dropped. Upon receiving a trigger packet 332, the switches 110 stop recording their corresponding data packets in their respective pre-trigger buffers, and instead record their corresponding data packets in their respective post-trigger buffers. The size of the pre-trigger buffer that is needed for debugging purpose depends on a round-trip-time (RTT). In a data center context, VM-to-VM (virtual machine) RTT vary between 5μs to 100μs. For a packet rate of 1 Bpps, 1 million recorded data packets could store up to 1ms duration of history. This translates to data packets corresponding to 10’s of RTTs available for debugging. A packet collection process 350 of the method 300 is described next. The packet collection process 350 includes step 360 and step 370. At step 360, the packet collection module 136 generates a collection packet 362 to read the recorded data packets in the pre-trigger buffer 112a and the post-trigger buffer 112b of the recording module 112. The collection packet 362 can only read one recorded data packet each time it traverses through the switch 110a. As such, the collection packet 362 is recirculated multiple times through the switch 110a to coalesce multiple recorded data packets. Advantageously, this reduces large serialization overhead if the collection packet 362 contained exactly one recorded data packet. Once the number of recorded data packets read by the collection packet 362 reaches a threshold which can be set by the operator, the collection packet 362 is forwarded to the storage server 140 for subsequent retrieval by the operator. At step 370, which is similar to step 360, the packet collection module 136 generates further collection packets 372 to read the recorded data packets in the respective pre-trigger buffers and post-trigger buffers of the other switches 110b,110c,110d. The collection packet 372 is subsequently forwarded to the storage server 140 for subsequent retrieval by the operator.

The packet collection process 350 ends when all the recorded data packets in the respective recording modules of the switches 110 are stored in the storage server 140. Notably, regular traffic forwarding is not disrupted during the collection process 350. For cases when an additional network fault is detected during the collection process 350, a new trigger packet is generated by the controller 130 and the collection process 350 is extended.

At step 380, correlation of the recorded data packets is performed. The correlation module 138 uses global timing information (i.e. the arrival times and the departure times of the recorded data packets of the switches 110) to construct an accurate network-wide ordering of events leading to the network fault 102. Hence, the data- plane clocks (used to record the arrival time and the departure time of data packets) across the switches 110 are synchronized to a fine granularity to avoid timing inconsistencies.

It should be noted that the description of the communication network 100 is not meant to be limitative. For example, it should be understood that the communication network 100 can be scaled up to include any number of switches. Furthermore, although the network fault is described as occurring at the switch, it is also possible for the network fault to occur anywhere on the network 100. While storing of the recorded data packets in the storage server is described under the packet collection process 350 which is initiated by the controller 130, storing of the recorded data packets may also be initiated by the switches 110b, 110c, 110d in response to receiving the trigger packet 332. In the case of the switch 110a which detected the network fault 102, storing of the recorded data packets may be initiated by the switch 110a in response to detecting the trigger packet 332, rather than in response to receiving the trigger packet 332. In this instance, the recording module 112 of the switch 110a stops recording the data packets received at the switch 110a in the pre-trigger buffer 112a upon detecting the trigger, and starts recording subsequent data packets received at the switch 110a in the post-trigger buffer 112b.

Furthermore, it may not be necessary to store the recorded data packets in a storage server. The recorded data packets may be stored, for example, in a memory device located on the control-plane. In addition, step 360 and step 370 may be performed concurrently or in sequence.

Further, the packet collection process may also end after sufficient time has elapsed after the network fault 102 occurred.

Additionally, the operator may specify via the application interface 160 for the collection process 350 to begin only when a set of trigger conditions occur within the recording window. To support this, the trigger module 132 maps the trigger type upon receiving the trigger packet 332 at any of the switches, and again using the switch 110a as reference, the switch 110a maps the trigger type to a bit-index in a temporal trigger bit-array. Additionally, the switch 110a sends the trigger packet 332 to the controller 130. The controller 130 maintains a timer for each trigger type, and clears the bit corresponding to the trigger type upon expiration of the timer. Hence, the temporal trigger bit-array maintains a list of triggers that occurred in the network for the recording window. The collection process 350 can then begin based on the values of particular trigger-types in the temporal trigger bit-array. The trigger conditions are customizable by the operator via the application interface 160.

As explained above, the trigger packet 332 is described to be broadcasted to all the switches 110 in the network 100, and thus, the recorded data packets are collected from all switches 110 the network 100. This may not optimize the network source if the network is huge and the root cause of the network fault 102 is localized to a group of switches. Instead, an alternative is to selectively broadcast (or multicast) trigger packet 332 to some of the switches 110. Each switch maintains a list of communication links or a list of switches 110 from which each switch receives data packets during a given recording window. Upon informing the controller 130 of the network fault 102 as detected by the switch 110a, instead of broadcasting the trigger packet 332, the broadcasting module 134 may selectively multicast the trigger packet 332 to only the communication links or the switches 110 from where the packets were received within the recording window. For example, the switch 110a may have only received data packets from the switch 110b during the recording window, in which case, the broadcasting module 134 may only send the trigger packet 332 to the switch 110b thus minimizing unnecessary data stored in the storage server 140. In this way, the network 100 is concerned with where the current set of data packets came from, and provides the ability to trace every data packet which appeared in the trigger switch 110a to its source, while reducing the number of switches involved in the collection. In this way, collection overhead may be reduced at network level.

DEBUGGER The recorded data packets are stored in the storage server 140 in a relational database (RDBMS) which allows the recorded data packets to be queried using SQL. The database also stores information regarding the trigger events, network topology, and position of switches within the topology.

Figure 4 illustrates a schematic diagram of the storage server 140 of the communication network 100. The storage server 140 employs a relational database (also referred to as a debugger database). Before the recorded data packets are stored in the database, hash collision removal is performed using topology and timing information. The recorded data packets are organized in four tables: Packetrecords 410, Triggers 420, Links 430, and Switches 440.

Packetrecords: This table 410 stores basic and custom fields within each recorded data packet. Each recorded data packet has 1) Switch ID, 2) Packet ID, 3) Packet Hash, 4) TCP/UDP Checksum 5) Time In, 6) Time Queued, 7) Time Out and, 8) Operator-specified statistics. Note that Packet ID is just a combination of packet hash and the checksum, stored separately to facilitate flow-level queries as well as packet-level queries.

Triggers: This table 420 stores information regarding each trigger event. Each trigger event stores: 1 ) Trigger Type, 2) Trigger Time, and 3) Trigger Origin Switch. This enables classification of network faults based on the trigger type.

Links: This table 430 stores the topology of the communication network 100, as specified by the operator instead of inferring from the packetrecords table, as non- zero utilization of all links in the topology is not guaranteed. Each link stores endpoints and link capacity.

Switches: This table 440 stores the position of a switch in the topology, e.g. ToR, Aggregation, Core, etc.

To determine the root cause of a network fault, the operator performs SQL queries on the above tables through the application interface 160. For example, in the case of an incast, culprit packets and their routes can be obtained by combining information from packetrecords 410, triggers 420, and links 430 tables. The output of these queries can also be used to replay or build dashboards using special building tools that are available to the skilled person.

Two exemplary queries are listed below: 1 ) List events in the switch 110a which triggered a network fault:

Select * FROM packetrecords JOIN triggers

ON packetrecords.switch = triggers.switch;

2) Check where the problem-causing packets came from and the path they took: Select * FROM (packetrecords as P) WHERE id

IN ( select id from packetrecords JOIN triggers ON p.switch = triggers.switch AND p.timejn < triggers.time) ORDERBY timejn ASC;

CONFIGURATION Figure 5 is a flow diagram illustrating an exemplary work flow 500 of an operator 510 for configuring or programming the switches 110 and the controller 130 (collectively referred to as network data-plane 520), and thus provides an interface for defining the recorded data packets (‘p-records’) and triggers for programming the switches 110.

The application interface 160 provides a platform for the operator 510 to configure or program the following parameters of the network data-plane 520:

1 ) network statistics (fields) to be collected in the recorded data packets (‘p- records’);

2) number of recorded data packets (p-records) to be collected; and

3) trigger conditions to initiate the packet collection process 350.

The fields specifiable in the configuration are:

1 ) switch-provided metadata (queue depth, ingress port, egress port);

2) packet header data (flowid); and

3) data that is computed and stored in user metadata by the operator (link_utilization, counters, EWMA). In this embodiment, the network data-plane 520 is implemented on Barefoot Tofino switches using P4 (programming language). The configuration is compiled, then translated to P4 and embedded with the original switch P4 program. The configuration Syntax is provided below. precord { fields { field_list_l; field_list_2;

} default_field : field_list_{x}; history : {y}; future : {z }; time_window : {t ms};

} trigger { conditions { cl = condition_l; c2 = condition_2;

} collection { c [& I ] c2' [& I ] c3' ...

}

A recorded data packet (i.e. the p-record) has a list of fields (denoted as “field_lists”). Each “field Jist” contains one or more (metadata) fields from Packet Header Vector (PHV) which is supported by switch architecture and defined in the operator’s P4 program. A “default_field” list is specified by the programmer which is the active “fieldjist” to be included in each recorded data packet. The current active “fieldjist” is configurable during runtime. The "history" refers to the total size of the history buffer (or pre-trigger buffer 112a), while "future" refers to the size of the future buffer (or post-trigger buffer 112b). The "time_window" is the target recording window (in milliseconds), and is used to maintain the trigger and broadcast window. The operator 510 declares a list of trigger conditions which are predicates operating on header/metadata fields (e.g. meta.link_utilization > 90) and is configurable during runtime. Based on the trigger conditions declared, the packet collection process 350 is configurable to be performed using an individual trigger condition or a combination (AND(“&”), OR(“|”) of multiple triggers defined). While trigger condition c occurs in a switch locally, cO represents a trigger condition happening elsewhere in the network. Hence, a representation like c1&c20 would trigger a coordinated collection by a switch 110 only if condition c1 occurs at the switch 110, and c2 has occurred in another switch in the network. Trigger conditions and collection conditions are definable based on several network metrics in the network such as packet drop, high packet queuing, and loops.

The application interface 160 supports changes to the configuration of the “field_list”s and trigger conditions while the communication network 100 is running without the need to recompile and load a new P4 program. The application interface 160 facilitates changing of the configuration in the following ways:

1) Adding a new “fieldjist” or editing the active “fieldjist”; and

2) Adding/removing trigger conditions. Notably, these changes are restricted to the available PHV contents in the data- plane as there is no modification to the underlying P4 program.

The application interface 160 includes a compiler 560. When compiling the configuration, the compiler 560 enumerates all the PHV contents (packet headers, switch and user-defined metadata) of the P4 program. The compiler 560 then creates template tables with actions for each PHV container to be stored in the recorded data packet. This facilitates the runtime to dynamically add/remove the fields to be recorded in each recorded data packet. The fields could be TCP sequence number, TCP flags which are part of packet headers, or “ingress_port”, “queue_depth”, etc. which are part of the switch meta-data. The field to be added cannot be a metric (e.g. EWMA) that is not defined or a packet header that is not parsed by the already compiled P4 program. Since PHV contents are limited, enumerating and storing them in actions do not significantly increase data-plane resource consumption. The maximum bytes in a recorded data packet and the number of recorded data packet entries in the recording window is fixed at compile- time based on the available hardware resources (e.g. stateful ALUs and SRAM). To facilitate addition/removal of trigger conditions at runtime, the compiler 560 uses similar enumeration technique and generates range-based match-action tables. Since, collection is performed based on the trigger bit-array value, this value is added/modified based on the collection condition changes. Additionally, the application interface 160 updates the storage server 140 each time the configuration is changed to ensure that recorded data packets are stored correctly. Advantageously, the configuration can be continuously tweaked using the application interface 160 to suit the statistics that the operator 510 wants to keep an eye on. The pseudocode for recording, trigger, and collection is provided below. precordArray : Register Buffer Array writelndex : Current index to write N : Size of the ring buffer POST_TRIG_SIZE : Size of buffer for post trigger pwritelndex : Current index to write post trigger triggerArray : Temporal Trigger bit-array triggerConditions : Bitmask configuration of TriggerArray for collection TimeNow : Current Global Time

Packet Record Logic if packet is normalPacket: if collectlnProgress == False:

Store Hash, Timenow, Timequeue, CustomStats in precordArray[writelndex] writelndex = (writelndex + 1) % N add_to_port_group(ingress_port) else : if pwrite Index < POST_TRIG_SIZE:

Store Hash, Timenow, Timequeue, CustomStats in precordArray[pwritelndex] pwritelndex = (pwritelndex + 1) if triggerHit is True: clone(packet) if packet is clonedPacket: add_header(trigger) remove_header(ipv4/tcp/udp) trigger.time = Timenow trigger.id = triggerld trigger.type = triggerType recirculate() Trigger Packet Logic if packet is triggerPacket: if trigger.id != lastSeenld[trigger.source]: triggerArray | = 1 « (trigger.type - 1); lastseenld[trigger.source] = trigger.id; else: dropO if triggerArray in triggerConditions: collectlnProgress = True Multicast(port_group) Collection Packet Logic if packet is collectPacket: if collectPacket.entries < MAX ENTRIES PICT: p-record = precordArray[readlndex] readlndex = (readlndex + 1) % N add_header( p-record) collectPacket.entries++ recirculateO else: l2fwd_to_collector()

IMPLEMENTATION

An exemplary implementation of the communication network 100 is described next according to a specific embodiment. The communication network 100 is implemented on Barefoot Tofino switches using P4 (~ 1900 LoC (lines-of-code)). A precise Data-Plane-Time-synchronization Protocol (DPTP) is used for time synchronization between the switches 110. DPTP is implemented on PSA architecture and provides a global timestamp in the data-plane. Baseline contents of the recorded data packets are stored in both the ingress and egress pipeline of the PSA architecture. Ingress pipeline maintains the ‘writejndex’ of the pre-trigger buffer 112a (i.e. history ring buffer array) upon a data packet arrival and stores the pID and pTimein. Egress pipeline stores pTimeout and custom “fieldjist” to be captured. pID is a combination of 16-bit flow hash, and 16-bit TCP/UDP checksum. pTimein is a 32-bit global timestamp (at nanosecond granularity) of the data packet when it enters ingress pipeline of the switch. On the other hand, pTimeout is a 24- bit field which captures time when the packet enters egress pipeline. 24-bit Timeout allows capturing upto 16 ms of queuing delay which is much larger than normal queuing times observed in the switches 110. Hence, the basic uncompressed recorded data packet is 11 -bytes in size. To implement compression, separate 8-bit counter arrays associated with pID (flow packet counter 114), pTimein (arrival packet counter 116) and pTimeout (departure packet counter 118) are maintained. A control-plane is also implemented which performs:

1 ) Timekeeping of temporal trigger bitarray;

2) Updating multicast port-group; and 3) Set up packet collection module 136 for collection of recorded data packets.

The compiler 560 is implemented using Rust (~ 4000 LoC). The compiler 560 takes as input the configuration and the switch P4 program, and generates a P4 code that implements storage and collection logic for the recorded data packets. Storage and trigger conditions are executed using stateful ALUs. Additionally, the runtime environment (implemented in Python) accepts commands to modify configuration such as: 1) changing the active fieldjist, 2) adding new fieldjist, and 3) adding/removing trigger conditions. These configurations translate to control-plane configuration updates of the composed switch P4 program.

Finally, the storage server 140 is implemented using n2disk utility (with PF_RING) to store collection packets as PCAP files in the local disk. Additionally, a Python program is implemented to parse the PCAP files and store individual recorded data packets in a MySQL database. The storage server 140 also takes as input the configuration to parse and name the custom statistics fields correctly from the PCAP files. Every time the active “fieldjist” is modified through the application interface 160, the update is also passed on to the storage server 140.

RESULTS A communication network 600 according to a second embodiment is implemented using four physical servers, and two Barefoot Tofino Wedge100BF-32X switches. The communication network 600 differs from the communication network 100 only in switch topology. Figure 6A illustrates a schematic diagram of the switch topology 610 for the communication network 600. Notably, like-components (such as the controller 130) are not illustrated in Figure 6A. The servers and the switches S1- S10 are virtualized to create a fat-tree topology. Switches S1 to S5 are virtualized on the first switch (“Tofino A”) using 10G loopback links. Similarly, S6 to S10 are virtualized on the second switch (“Tofino B”).

Each virtual switch’s data-plane is synchronized to S10 using DPTP. Figure 6B illustrates a schematic diagram of a network-level DAG (directed acyclic graph) for DPTP synchronization. A register array of 50K entries in SRAM is created to implement the pre-trigger buffer 112a for storing recorded data packets on switches A and B. Each virtual switch stores its recorded data packet in a slice of this register array. For example, S1 uses 0-1 OK, S2 uses 10K-20K and so on. 5K entries per virtual switch are used for the post-trigger buffer 112b for capturing recorded data packets after a trigger condition is hit. Each recorded data packet is 16 bytes in size and each virtual switch stores up to 10K recorded data packets.

The configuration Syntax for the “fieldjist” is provided below: fieldjist SyNDB_verification { meta.packet_sequence;

} fieldjist SyNDB_scenario { meta.ingress_port; meta . I i n k_uti I izatio n; meta.drop_counter;

} precord { fields {

Sy N D B_ve rificatio n; SyNDB_scenario;

} default_field : SyNDB_scenario; history : 10000; future : 10000; time_window : 1; }

The configuration Syntax for the trigger conditions is provided below: trigger { conditions { a = meta.time_queue > 10000; b = meta.fwding_table_miss == 1; c = meta.fwding_update == 1; d = meta.pkt_sequence == 5000;

} collection { a; b & c'; d;

}

Performance of the communication network 600 is described in the following three segments namely: design capabilities, network debugging scenarios, and overhead.

DESIGN CAPABILITIES

The capabilities of the communication network 600 are quantified based on time, scale and granularity, and correlation.

Time: To ensure that the recorded data packets captured are consistent, the time synchronization error is kept below the propagation delay between adjacent switches (equation (2)). Figure 7A illustrate a line graph of three DPTP time- synchronization error graphs 712,714,716 between switches that are 1-hop, 2-hop, and 3-hops away respectively. The synchronization error graph 716 indicate that the synchronization error between switches separated by 3-hops is less than 50 ns. Additionally, the synchronization error is significantly lower, under 10 ns, in case of switches 2-hops apart as indicated by the synchronization error graph 714. Figure 7B is a line graph 720 illustrating the propagation delay between two pairs of switches. A first propagation delay graph 722 shows the propagation delay between the first pair of switches, S1 and S4 (virtualized using the same physical switch), while a second propagation delay graph 724 shows the propagation delay between the second pair of switches, S4 and S10 (virtualized using different physical switch).

As illustrated in the line graph 720, the propagation delay between S1 and S4, and between S4 and S10, varies around 400-450ns. The difference in delay (= 30 ns) between the two pairs of switches, S1-S4 and S4-S10, is due to the effect of clock- drift between the physical switches. Line graphs 710,720 illustrate that the synchronization error is much lower than the propagation delay, and hence consistency is ensured.

Scale and Granularity: The communication network 600 is able to capture information at a network-wide scale and with packet-level granularity. A Constant- Bit Rate (CBR) of 10Mpps (limited due to 10G host links) is used, with each data packet annotated with a sequence number. The data packets are sent from switch S1 to switch S7 along switch path S1-S4-S10-S9-S7. At each switch, a “packet_sequence” number of each packet is recorded. Next, S1 is configured to generate a trigger and broadcast the trigger packet after receiving 5000 data packets. After receiving the trigger packet, each switch S1 ,S4,S10,S9,S7 sends their recorded data packets to the controller 130 where the arrival time of data packets based on their sequence number is plotted. Figure 8A illustrates a line graph 810 of the arrival time of data packets for a 500 packet sequence at each switch S1 ,S4,S10,S9,S7 along the switch path for CBR traffic. Figure 8B is an enlarged view of line graph 810. Figure 8B illustrates a line graph 820 of the arrival time of data packets for a 50 packet sequence at each switch S1,S4,S10,S9,S7 along the switch path.

Line graphs 810,820 illustrate that every data packet is recorded in the next switch only after the data packet has left the previous switch. Furthermore, the timestamps increase linearly which means that all the recorded data packets across all the switches in the communication network 600 are consistent and at expected intervals. Furthermore, the communication network 600 records all 5000 data packets on all switches traversed by the data packets.

Correlation: The communication network 600 identifies the root causes for network faults by accurately presenting the ordering of events in the network 600 leading to the network faults. The ability to track the progression of the packet across the network using the recorded data packets is important for identifying the root causes of the network faults.

As the recorded data packets are stored in a ring buffer, it is likely that old recorded data packets that fall outside of the recording window are discarded to accommodate new recorded data packets during the time the trigger packets travel across the communication network 600. The likelihood depends on the distance from the origin trigger switch, the number of recorded data packets, and network utilization. Instead of analysing the recorded data packets collected from the switches, the correlation between the recorded data packets seen by the storage server 140 after a trigger is generated is analysed instead.

A similar setup, with 10 Mpps CBR traffic transmitted along the switch path S1 -84- SI 0-S9-S7 with each switch storing up to 10K recorded data packets entries, is used here. Switch S7 is configured to generate a trigger after it has received 10,000 data packets. Once the trigger is generated at S7, the trigger is broadcast to other switches in the communication network 600 to initiate the packet collection process 350. A percentage of common recorded data packets seen by other switches compared to those seen by switch S7 (the switch that initiated the trigger) is obtained based on the recorded data packets available at the storage server 140. The results are shown in Figure 9 which illustrates a bar graph 900 of the percentage of common recorded data packets seen by other switches compared to those seen by switch S7. Figure 9 indicates that with increasing number of hops relative to S7, the percentage of common recorded data packets reduces slightly. Nevertheless, the percentage of common recorded data packets collected is above 99% between two switches that are 4-hops away i.e. between switches S7 and S1. In other words, the communication network 600 accurately records the progression of most data packets leading up to the network fault.

NETWORK DEBUGGING The communication network 600 is utilized here to debug four kinds of transient network faults: (1 ) microburst caused by incast traffic, (2) transient packet drops due to network update, (3) transient load balancing issues, and (4) congestion due to network updates.

Each recorded data packet is configured to contain the custom "fieldjist: SyNDB_scenario" which contains three metrics: 1) Ingress Port 2) Link Utilization and 3) Drop Counter. Ingress port of a packet is provided by the switch meta-data. Link utilization is calculated over a window of 10μs in the data-plane using a low- pass-filter. Drop counter is the number of packets which missed the forwarding table.

Additionally, the communication network 600 is configured to perform the packet collection process 350 based on three triggers: (1) High Queuing Delay (trigger A), (2) Table Lookup Miss (trigger B), and (3) Network Configuration Update (trigger C). The packet collection process 350 is initiated when a switch receives trigger A or a switch receives both triggers B and C. In each of the following scenarios, data and control traffic is generated to emulate the corresponding network faults. The host data traffic is generated using MoonGen.

Scenario 1 : Microbursts Microburst is common in data centers in which congestion is caused by a short burst of packets lasting for at most a few hundred microseconds. Traffic bursts occur due to various reasons like application traffic patterns (e.g. DFS, MapReduce), TCP behavior and also NIC (Segmentation offload, receive offload) . Complex interactions and traffic patterns make microbursts debugging extremely complicated. It is necessary to find the root cause to determine how the network fault should be resolved.

The communication network 600 is able to attribute different root causes to two microburst events that are detected using the same trigger conditions. In one scenario, the microburst is due to incast of synchronized application traffic. In the other scenario, the microburst is caused by the interaction of uncorrelated flows with different source-destinations.

Synchronized Application Traffic: Fan-in traffic pattern of data center networks exhibited by applications such as MapReduce and Distributed File System (DFS) is an incast traffic pattern where many sources transmit to a small number of destinations within a short time window. These short bursts of traffic increase the queuing delay at microsecond time-scales. The challenge in identifying the root cause of such microburst is that many sources contribute to the total traffic and the burst occurs only for a very short time.

Figure 10 illustrates the switch topology 610 of the communication network 600 set up to simulate synchronized application traffic for the first debug scenario. As indicated in Figure 10, hosts H1 to H6 are set up to send data to H8. Each host sends a burst of ten 1500-byte packets at an average rate of 1 Gbps to H8 via ToR switch S7. All links have capacity of 10 Gbps. In this embodiment, the sources start in an asynchronized fashion, but over time transmissions from different hosts can synchronize their transmissions causing sudden spikes in queuing delays on switch S7, triggering the trigger A. Such synchronization of periodic messages over time are known to occur in routing message updates. To determine if the issue of microburst is caused by synchronized fan-in traffic, a query of the queuing delay at S7 together with the packet arrival information at the ToR switches before the microburst detection is performed at the storage server 140 by the operator 510 through the application interface 160. The Syntax for the query to list the packet arrival times at ToR switch ports and queuing delay at S7 is provided below:

SELECT switch , ingress_port , timejn FROM packetrecords WHERE id IN (SELECT id FROM packetrecords AS A JOIN triggers as T ON (A.timejn < T.time AND A.switch = T.switch)) AND switch IN (SELECT switch FROM switches WHERE type = "tor");

SELECT time_queue FROM packetrecords where switch=7;

Answers to the query are shown in Figure 11 A and 11 B. Figure 11 A illustrates a line graph 1110 of the number of data packets in the queue building up (“queue build- up”) at switch S7 over time for the first debug scenario. Figure 11 B illustrates a line graph 1120 of the data packets arriving at each host over time for the first debug scenario.

By correlating the data packets arriving from different hosts before the bursts occurred, it is shown that the data packets that make up the bursts are transmitted by hosts H1 to H6 synchronously and reach S7 at about the same time. The root cause of this microburst from H1 to H6 can thus be determined as host-based synchronized traffic. Non-Synchronized Application Traffic: Synchronized incast is one possible cause for microburst. There are many other scenarios for microbursts. Microburst events are generated through the interaction of non-synchronized traffic. In this next case of the first debug scenario, hosts H1 to H6 send bursts of 10 packets at an average rate of 1Gbps to H8. A randomized delay of up to 5μs is added before sending a burst to minimize traffic synchronization. In addition, another flow sends a burst of 10 packets every 1ms of packets from H9 to H6 (through S1-S4-S5-S8-S6) at an average rate of 2Gbps. Note that this flow (H9 - H6) runs asynchronously and does not travel through the bottleneck switch (S7) where the microburst occurred. Nevertheless, microbursts are observed on the link from S7 to H8.

Figure 12 illustrates the switch topology 610 of the communication network 600 set up to simulate non-synchronized application traffic for the first debug scenario. A query of the queuing delay at S7 together with the packet arrival information is performed by the operator 510 through the application interface 160.

Figure 13A illustrates a line graph 1310 of the queue build-up at switch S7 over time for the first debug scenario. Figure 13B illustrates a line graph 1320 of the data packets arriving at each host over time for the first debug scenario.

The shaded portion 1330 in the line graph 1320 shows the duration in which the flow from H9 to H6 occurs. Line graphs 1310 and 1320 indicate that the microburst is likely due to a combination of factors, namely (1 ) the synchronization of the bursts among the pair of flows from H1 & H2, H3 & H4 and H5 & H6; (2) the burst from H9 to H6 arriving just before the bursts from H1 & H2 in S1 and the bursts from H3 & H4 in S5. This causes queue build-up resulting in packets from H1 to H6 arriving at S7 at about the same time. The root-cause is thus determined to be due to interaction of network queuing effect.

Scenario 2: Network Configuration Updates

Networks operate in a dynamic environment where operators frequently modify forwarding rules and link weights to perform tasks from fault management, traffic engineering, to planned maintenance. However, dynamic network configurations are complex and error prone especially if they involve several devices. For example, updating the route for a flow(s) can lead to unexpected packet drops if the updates are not applied consistently or efficiently.

The communication network 600 is shown to identify whether a transient error is due to a network update or localized hardware fault. For the set-up, each forwarding rule is assumed to indicate a version number and route based on destination MAC address as shown in the Syntax below: table_addforward send_to_portethernet_dstAddr <dstMac> => output_port <num> entry_ver <num>

A transient forwarding blackhole occurs when an out-of-order execution of a network update gives rise to nondeterm inistic network behaviour leading to temporary packet loss. Figures 14A-C illustrate the switch topology 610 of the communication network 600 set up to simulate the transient blackhole according to the second debug scenario. Figures 14A-C depict the initial, final and transient states of the switch topology 610 of the communication network 600 respectively when routing of a flow from H1 to H 7 is updated by rerouting traffic from S10 to S8. However, the transitioning as illustrated in Figure 14A to 14B requires updates to both S10 and S8. In this network update, a new rule to route the flow needs to be added to S8 first and then S10 needs to update the policy to route the flows from S9 to S8. If the update at S8 occurs later than the reroute at S10, a temporary forwarding blackhole 1410 will form as shown in Figure 14C, resulting in a data packet drop. However, the data packet drop at S8 due to table lookup miss could also be flagged as a parity error when the context of the table miss is unknown. To check if a delayed network update is a possible cause, the operator 510 sends a query via the application interface 160 for the forwarding rule versions observed by each data packet at the two switches S10 and S8 along with the number of drops observed. Results of the query are illustrated in Figures 15A and 15B. Figure 15A illustrates a line graph 1510 of the packet drops over time at two switches S10 and S8 for the second debug scenario. Figure 15B illustrates a line graph 1520 of the forwarding rule version over time at the two switches S10 and S8 for the second debug scenano.

From Figures 15A and 15B, by correlating the rule version number and packet drops in time, it is clear that the data packet dropped can be attributed to a transient inconsistency in rules between switches S8 and S10. The Syntax for the query for correlating the network update with packet drops is provided below:

SELECT forwarding_rule_ver, drop_counter FROM packetrecords WHERE switch=8 OR switch=10;

Notably, data collection is triggered based on an aggregating trigger defined over multiple switches. Switch S8 broadcasts the trigger B to other switches on detection of forwarding table miss and S10 broadcasts the trigger C on policy update. Trigger B or C by itself does not trigger data collection. When a switch receives both triggers (within a time window), then data collection is triggered. Such multi-switch trigger reduces both false-positives and collection overhead.

Scenario 3: Transient Load Balancing Issues

Modem data center topologies such as fat-tree provide redundant paths between a source-destination pair. ECMP is a common load balancing policy for handling multipath routes. However, it has a lot of inefficiencies in distributing the load evenly. As a result, it is observed that a subset of core-links regularly experience congestion while there is spare capacity on other links. Figure 16 illustrates the switch topology 610 of the communication network 600 set up to simulate congestion at switch S9 due to a link load balancing problem. In the third debug scenario, each switch calculates the hash of the 5-tuple and redirects the flow via one out of the two links. A variety of combinations of 5-tuple flows which can lead to load imbalance in the communication network 600 is set up. In one such combination, S9-S7 is congested, even though spare capacity is available at S8-S7. Multiple flows is created in the network originating from H1 to H6 with the destination as H7 and H8. The traffic (containing faulty combination) is sent at short bursts, with an overall throughput of 1 Gbps per flow. The load imbalance happens when both the core switches (S5 and S10) direct too many flows to S9, resulting in congestion on the S9-S7 link. With only the congestion indication, it is be difficult to determine the root cause. To determine if load imbalance is the root cause, the operator 510 observes the queuing duration and link utilization of various links at the same time by plotting the utilization of the links measured at the same time at packet level granularity. The query for link utilization and queue duration is provided below: SELECT switchl , switch 2 , link_utilization*8 , time_queue FROM (SELECT switchl , switch 2 FROM links WHERE (switchl IN (select switch FROM switches WHERE type != "tor") AND switch2 IN (SELECT switch FROM switches WHERE type != "tor" ))) AS L JOIN (SELECT * FROM packetrecords) AS A JOIN (SELECT * FROM packetrecords) AS B ON (A. hash = B.hash AND A.switch = L. switchl AND B.switch = L.switch2);

SELECT forwarding_rule_ver FROM packetrecords WHERE switch=10;

Results of the query are illustrated in Figures 17A and 17B. Figure 17A illustrates a line graph 1710 of the queue duration over time at six links: S9-S7, S5-S9, S10-S9, S8-S7, S5-S8, S10-S8 for the third debug scenario. Figure 17B illustrates a line graph 1720 of the link utilization over time at the six links for the third debug scenario.

High link utilization is observed at S9-S7 link S8-S7 sees no significant utilization. Furthermore, the congestion trigger at the link S9-S7 is preceded by higher than normal link utilization in links S5-S9 and S10-S9. Thus, the load distribution from the core switches (S5 and S10) to S8 and S9 is heavily skewed, with most flows being routed via S9 during some time intervals. Based on this observation, the operator 510 can infer that the root cause for the congestion at the link S9-S7 is the load imbalance caused by the load balancing scheme. Scenario 4: Network updates leading to congestion

Network updates to the forwarding rules can cause not only packet drops, but also cause intermittent congestion in the network.

Figures 18A-C illustrate the initial, final and transient states of the switch topology 610 of the communication network 600 respectively to simulate link congestion at link S10-S8 due to network update. In the fourth debug scenario, two flows, H1 to H 7 and H2 to H6, of 6 Gbps each are simulated. The initial routing for these flows is shown in Figure 18A. To reduce the traffic on links S8-S6 and S9-S7, routing is changed to the configuration shown in Figure 18B. This requires the network to update two forwarding rules in S10: 1 ) Route flow 1 via S8, and 2) Route flow 2 via S9. While the rules are being updated, a possible state is shown in Figure 18C, where the two flows share the link S10-S8. This causes over-utilization of the S10- S8 link as the link capacity is 10 Gbps whereas the combined traffic rate of the two flows is 12 Gbps.

This problem may cause S8 to generate two possible triggers, one for packet drop and one for congestion. The trigger is generated when the queuing duration increases beyond the threshold. The controller 130 will initiate the packet collection process 350 on whichever trigger condition is raised first, whereas only information relating to the second network fault will be sent to the storage server 140.

Based on the packet drop and congestion trigger, the root cause can mostly be attributed to network m ^configuration or load balancing issues. To rule out network misconfiguration due to policy updates, the operator 510 sends a query via the application interface 160 for the queuing duration, link utilization, and forwarding rule versions. The query Syntax for plotting forwarding rule version of S10 along with link utilization and queueing times of core links is provided below: SELECT switchl , switch 2 , link_utilization*8 , time_queue FROM (SELECT switchl , switch2 FROM links WHERE (switchl IN (select switch FROM switches WHERE type != "tor") AND switch2 IN (SELECT switch FROM switches WHERE type != "tor" ))) AS L JOIN (SELECT * FROM packetrecords) AS A JOIN (SELECT * FROM packetrecords) AS B ON (A.hash = B.hash AND A.switch = L. switchl AND B.switch = L.switch2); SELECT forwarding_rule_ver

FROM packetrecords WHERE switch=10;

Figure 19A illustrates a line graph 1910 of the queue duration over time at six links: S9-S7, S5-S9, S10-S9, S8-S7, S5-S8, S10-S8 for the fourth debug scenario. Figure 19B illustrates a line graph 1920 of the link utilization over time at the six links for the fourth debug scenario. Figure 19C illustrates a line graph 1930 for the forwarding rule versions over time for the fourth debug scenario.

As indicated by Figures 19A-C, a forwarding rule update is observed on S10 just before queue build-up of 50μs in S8, and increased link utilization is observed on the S10-S8 link. Subsequently, a second update is observed in S10. Following this update, both the queuing duration and link utilization return to levels prior to the first update. Thus, it can be concluded that the root cause for the transient congestion at S8 was a transient network misconfiguration due to policy updates.

It can be seen that in order to identify the root cause of complex network faults, it is often necessary to have the visibility into packet statistics, the ability to look at past events (retrospection) and the timing information to correlate observations among across switches (correlation), capabilities which are provided by the communication network 600.

OVERHEAD SRAM Overhead: Figure 20 illustrates a bar graph 2000 for SRAM consumption used by the pre-trigger buffer 112a for different packet rates and compressed recorded data packet sizes. The vertical axis indicates the SRAM consumption (or usage) in MB and the horizontal axis indicates the recorded time in the pre-trigger buffer 112a based on a baseline compressed recorded data packet size. For example, "100μs(11B)" represents 100μs of history buffer with 11 -byte baseline compressed recorded data packet size, and "100μs(11B)C" represents the same configuration with compressed recorded data packets. At packet rates of 1 Bpps, the communication network 600 consumes an average of =5MB using compressed recorded data packets while consuming =10MB of

SRAM to record 1 ms of history for uncompressed baseline recorded data packets respectively. The average case of compressed recorded data packets is computed by using the mean of the best-case and worst-case compression scenarios.

Compression saves 50% of SRAM memory on average, and saves up to 80% depending on the traffic pattern. The SRAM consumption is easily accommodated by the latest switching ASICs which contain SRAM greater than 100 MB. High utilization of the SRAM is observed only across a few switch ports during congestion events. Thus, the pipeline utilization is usually much lower than its capacity. To support lower packet rates like <500 Mpps, the communication network 600 is configured to use about 2MB of SRAM. In this way, the operator 150 is able to trade- off between the total capture duration and the memory budget.

Collection Overhead: The overhead incurred at the switch to collect the recorded data packets and ship them to the storage server 140 is measured. To perform collection, the controller 130 sets up the packet collection module 136 to generate collection packets 362,372 at 100 Mpps. The collection packets 362,372 coalesce recorded data packets (64 recorded data packets per packet) by recirculation.

Collection of 10000 compressed recorded data packets requires 104 collection packets on average. It then takes a total time of 245μs to evict the recorded data packets. Also, it takes only 45μs to collect these packets in the data-plane using packet generator and recirculation, with the majority of the time being signalling from the controller 130 to start the packet collection module 136. This timing overhead can be reduced drastically in special architectures which support triggering packet generation from data-plane events. The overall pipeline overhead incurred is about

100 Mpps and bandwidth consumption is limited to re-circulation port and the collection forwarding port (e.g. mirror port), thus not affecting regular data-plane traffic. In order to collect 1M compressed recorded data packets (1ms history at 1

Bpps), it takes only about 323μs on average. Out of this, it takes 123μs to recirculate 7800 packets to collect the compressed recorded data packets, and 200μs to trigger the packet generation. Thus, the recording module 112 can resume recording (in the pre-trigger buffer 112a) after half the recorded data packets are collected in about 260μs. This means the communication network 600 is able to support up to

=6000 triggers/sec. Notably, the recording module 112 has a post-trigger buffer 112b for storing recorded data packets once trigger conditions are met. To support continuous recording of all future events, the minimum duration the post-trigger buffer 112b needs to capture would be 260μs. With a lms pre-trigger buffer 112a, the ability to support 1000 triggers per second without any break in recording is sufficient to enable continuous monitoring. Hence, the communication network 600 is able to capture microbursts occurring every few milliseconds as well as network incidents separated by hours.

Switch Resource Overhead: The total hardware resource consumption of the application interface 160 is compared to the baseline switch.p4. Switch. p41 is a baseline P4 program that implements various common networking features applicable to a typical data center switch. The total resources consumed by all the components (switch.p4, DPTP and the communication network 600 (referred to as SyNDB in the table)) is shown in Table 1 below.

Table 1

The majority of resources required for the communication network 600 arise from the need to store precords in the data-plane. The communication network 600 consumes 33% of the stateful ALUs and 15% of the SRAM to store recorded data packets and trigger conditions in the evaluation configuration. Thus, the communication network 600 can be implemented on top of switch.p4 in programmable switch ASICs. Although the present disclosure has been described with reference to specific exemplary embodiments, various modifications may be made to the embodiments without departing from the scope of the invention as laid out in the claims. For example, the methods described may be operated on any computer systems with the proper software tools to execute the instructions.

In addition, switches compatible with the communication network 600 can be deployed incrementally with each new switch providing additional visibility into the network 600. To maximize effectiveness, deployment can start from ToR switches where most congestion events occur. For DPTP synchronization, links can be added between adjacent ToR switches. Further, network tomography techniques available to the skilled person can be used to infer the core network’s states.

Further, as packet processing rate increases beyond 1 Bpps and parallel pipelines increase, the SRAM overhead increases linearly. Since, SRAM is an expensive resource, an alternative approach to scale is to use the relatively large DRAM available in switch CPU. Alternatively, external DRAMs using RDMA channel (without CPU overhead) to store high-speed look-up tables or packet buffers may also be employed. The communication network 100,600 can leverage such external memory to store p-records for a longer duration without having to change its design.

The communication network 100,600 may also be used to debug timing bugs in distributed systems. In most of these timing bugs, a deadlock is caused due to a missed or delayed message. The communication network 100,600 can help to find the reason why the message was missed or delayed by raising trigger conditions when it observes reordering or drops of certain packets.

The communication network’s 100,600 capability may go beyond debugging as well. A network device’s configuration can be used along with network traces to create a “replay” of the network fault. This in turn, can be used to form regression test suites. Such test suites are integral to large software development but are rare in network testing and management. For example, a bug in a network load-balancing implementation may cause a skewed distribution for a certain traffic combination. After the bug is fixed, packet-level replays can be employed in the communication network 100,600 to inject the exact same traffic combination into the network device to test the fix and prevent future regression. Dashboards, query suggestions and intelligent assistants available to the operator 150 may also be deployed together with the communication network 100,600 using Al techniques to facilitate faster debugging.

Various embodiments as discussed above may be practiced using any means available to the skilled person without departing from the scope of the invention as laid out in the claims. Modifications and alternative constructions apparent to the skilled person are understood to be within the scope of the disclosure.

Claims

1. A method of monitoring a communication network to facilitate network fault analysis, the communication network having switches with each switch arranged to relay corresponding data packets in the network, the method comprising

(i) recording the corresponding data packets received at each switch within a recording window; wherein upon detecting a network fault by a particular switch among the switches, the network fault occurring within the recording window;

(ii) generating a trigger packet;

(iii) broadcasting the trigger packet selectively to at least some of the switches in the network; and

(iv) upon receiving the trigger packet, storing the recorded data packets of the selected switches received within the recording window for subsequent retrieval, wherein the recorded data packets represent the network’s traffic when the network fault is detected.

2. A method according to claim 1 , wherein when no network fault is detected, discarding the recorded data packets that fall outside of the recording window.

3. A method according to claim 1 or 2, wherein recording the corresponding data packets at each switch in (i) further comprises recording timing information of the corresponding data packets at each switch, wherein the timing information between the switches are causally consistent.

4. A method according to claim 3, wherein the timing information between at least some of the switches include a timing difference that is within a tolerance level.

5. A method according to claim 4, wherein the tolerance level is a duration to transmit a single data packet between the switches.

6. A method according to any one of claims 3 to 5, further comprising correlating the recorded data packets based on the timing information of the corresponding data packets to construct an order of events leading to the network fault.

7. A method according to any one of claims 3 to 6, wherein the timing information include arrival times and departure times of the corresponding data packets.

8. A method according to any preceding claim, wherein recording the corresponding data packets at each switch in (i) further comprises counting the corresponding data packets arriving at each switch within a limited recording window, the limited recording window being a subset of the recording window; and associating the corresponding data packets arriving within the limited recording window with a specific arrival time.

9. A method according to any preceding claim, wherein recording the corresponding data packets at each switch in (i) further comprises counting the corresponding data packets departing at each switch within the limited recording window; and associating the corresponding data packets departing within the limited recording window with a specific departure time.

10. A method according to any preceding claim, wherein recording the corresponding data packets at each switch in (i) further comprises counting the corresponding data packets received at each switch that are within a flow; and associating the corresponding data packets within the flow with a packet identifier.

11. A method according to any preceding claim, wherein each switch maintains a list of communication links from which the switch receives the corresponding data packets within the recording window, and wherein, the trigger packets are broadcasted only to the list of communication links.

12. A method according to any preceding claim, wherein where the selected switches exclude the particular switch, the method further comprises storing the recorded data packets of the particular switch received within the recording window for subsequent retrieval.

13. A method according to any preceding claim, wherein storing the recorded data packets further comprises generating collection packets to read the recorded data packets from the respective switches; and forwarding the collection packets to the storage server.

14. A method according to any preceding claim, wherein the corresponding data packets received at each switch within the recording window are recorded in respective pre-trigger buffers, and after receiving the trigger packet and before the recorded data packets in the respective pre-trigger buffers are stored for subsequent retrieval, the method further comprises recording subsequent data packets received in respective post-trigger buffers.

15. A method according to any preceding claim, wherein the corresponding data packets are compressed data packets.

16. A communication network for facilitating network fault analysis, comprising a plurality of switches with each switch arranged to relay corresponding data packets in the network and having corresponding recording modules configured to record the corresponding data packets that are received within a recording window; a trigger module configured to generate a trigger packet in response to a network fault detected by a particular switch, the network fault occurring within the recording window; and a broadcasting module configured to selectively broadcast the trigger packet to at least some of the switches in the network, wherein in response to the network fault detected by the particular switch, and in response to receiving the trigger packet by the selected switches, each switch is configured to store its recorded data packets received within the recording window for subsequent retrieval, wherein the recorded data packets represent the network’s traffic when the network fault is detected.

17. A communication network according to claim 16, when no network fault is detected, each recording module is further configured to discard the recorded data packets that fall outside of the recording window.

18. A communication network according to claim 16 or 17, wherein each recording module is further configured to record timing information of the corresponding data packets at each switch, and wherein the timing information between the switches are causally consistent.

19. A communication network according to claim 18, wherein the timing information between at least some of the switches including a timing difference that is within a tolerance level.

20. A communication network according to claim 19, wherein the tolerance level is a duration to transmit a single data packet between the switches.

21. A communication network according to any one of claims 18 to 20, further comprising a correlation module configured to correlate the recorded data packets based on the timing information of the corresponding data packets to construct an order of events leading to the network fault.

22. A communication network according to any one of claims 18 to 21 , wherein the timing information include arrival times and departure times of the corresponding data packets.

23. A communication network according to any one of claims 16 -22, wherein each recording module further comprises an arrival packet counter configured to count the data packets arriving at a corresponding switch within a limited recording window, the limited recording window being a subset of the recording window; and to associate the data packets arriving within the limited recording window with a specific arrival time.

24. A communication network according to any one of claims 16 -23, wherein each recording module further comprises a departure packet counter configured to count the data packets departing at a corresponding switch within the limited recording window; and to associate the data packets departing within the limited recording window with a specific departure time.

25. A communication network according to any one of claims 16 -24, wherein each recording module further comprises a flow packet counter configured to count the data packets received at a corresponding switch that are within a flow; and to associate the data packets within the flow with a packet identifier.

26. A communication network according to any one of claims 14 -25, wherein each switch is configured to maintain a list of communication links from which the switch receives the corresponding data packets within the recording time window, and wherein the broadcasting module is further configured to broadcast the trigger packet only to the list of communication links.

27. A communication network according to any one of claims 16-26, wherein where the selected switches exclude the particular switch, the particular switch is configured to store its recorded data packets received within the recording window for subsequent retrieval.

28. A communication network according to any one of claims 16 - 27, further comprising a collection module configured to generate collection packets to read the recorded data packets from the respective switches; and to forward the collection packets to the storage server.

29. A communication network according to any one of claims 16 -28, wherein each recording module further comprises a pre-trigger buffer associated with a corresponding switch in the network, the pre-trigger buffer configured to record the corresponding data packets received at the corresponding switch within the recording window; and a post-trigger buffer associated with a corresponding switch in the network, the post-trigger buffer configured to record subsequent data packets received at the corresponding switch, after receiving the trigger packet and before the recorded data packets in the respective pre-trigger buffers are stored for subsequent retrieval.

30. A communication network according to any one of claims 16 - 29, wherein the corresponding data packets are compressed data packets.