GB2467424A

GB2467424A - Managing overload in an Ethernet network by re-routing data flows

Info

Publication number: GB2467424A
Application number: GB1001283A
Authority: GB
Inventors: Cyriel Johan Minkenberg; Mircea Gusat; Alessandra Sciccitano
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2009-01-28
Filing date: 2010-01-27
Publication date: 2010-08-04
Also published as: GB201001283D0

Abstract

The invention relates to data flow overload management in an Ethernet network, the network comprising nodes connected to define various data flow paths. Data flows from a first, congested path are re-routed to another, uncongested path of the network when an overload condition is detected (see the dashed line in fig. 1B). The data flow overload may be detected on the basis of at least one overload event sent in the network such as a congestion notification, link-level flow control event or a combination of the two. The invention takes advantage of multi-path capability offered by, for example, data centre networks. Re-routing data flows to different paths provides a spatial alternative to the sole temporal reaction (rate reduction mechanism) usually provided in Ethernet networks.

Description

OVERLOAD MANAGEMENT IN ETHERNET NETWORKS

FIELD OF THE INVENTION

The invention relates to the field of overload management in Ethernet networks, such as congestion management.

BACKGROUND OF THE INVENTION

The IEEE 802.3 standard defines Ethernet, which is known to be one of the most widely implemented local area networks (LAN). It is believed that simplicity, scalability, wide availability, low cost, and acceptable performance have made Ethernet the network of choice for LAN traffic. Tn particular, performance parameters such as throughput, drop rate, latency, and jitter are considered acceptable for LAN traffic.

Besides, one knows data center installations, which typically have a communication infrastructure comprising at least three disjoint networks: a LAN, a storage area network (SAN), and a clustering network. Concerns with such data center installations are cost reduction, power consumption, complexity, management and maintenance overhead.

Moreover, to ensure that 10-Gigabit Ethernet (lOGE) will be equipped to meet some requirements of the data center, several working groups within the IEEE and IETF standards bodies are addressing key issues, such as congestion management (IEEE 802.lQau).

Important requirements of data center networks are low latency, losslessness, and high speed.

In this respect, although losing packets might be tolerable in traditional networks, this is no longer true in a data center environment, where packet loss can seriously degrade system performance. To achieve the objective of losslessness and avoid drops due to buffer overflows, data center networks employ some form of link-level flow control (LL-FC), usually some variation of either credit-based or stop/go-based flow control.

The combination of short buffers, for low latency, and high speed may easily lead to congested switch buffers, which will trigger the LL-FC mechanism, thus propagating the congestion to upstream switches. If congestion persists long enough, a saturation tree [1] of congested switches is induced, which can cause catastrophic collapse of global network throughput [1], [2], as a saturation tree affects not oniy flows directly contributing to the congestion, but also other flows getting caught in the ensuing backlog. Therefore, congestion management (CM) is relied upon to prevent from such collapses.

In this respect, the IEEE 802.3 standard provides an LL-FC mechanism called PAUSE (802.3x), which can be used for temporarily pausing the link when the buffer is filling up.

In addition, the IEEE 802.lQau working group is furthermore currently in the process of defining a standard for congestion management (CM) in lOGE networks. CM protocols created and studied in this context, such as ECM or QCN, try to eliminate congestion by reducing the sending rates at the sources.

In a nutshell, such schemes operate by monitoring switch queue length offsets with respect to a predefined equilibrium threshold (Qeq) as well as queue length changes, computing a feedback value indicating the level of congestion, sending congestion notification (CN) frames to the source of "hot" flows when congestion is detected, and reducing the transmission rate at the source based on the feedback value. This mechanism keeps congestion under control by reducing the aggregate sending rate of all flows traversing the bottleneck, thus pushing the backlog to the edge of the network.

BRIEF SUMMARY OF THE INVENTION

In a first aspect, the present invention is embodied as an overload management method for an Ethernet network comprising nodes and connections between said nodes, the connections defining a set of data flow paths, the method comprising a step of re-routing a data flow from a first path of the set to a second path of the set, upon detection of data flow overload.

In other embodiments, the said apparatus may comprise one or more of the following features: -the method further comprises, prior to re-routing, a step of detecting data flow overload, based on at least one overload event sent in the Ethernet network: -at the step of detecting, the at least one overload event is a congestion notification, a link-level flow control event, or a combination thereof; -the step of detecting data flow overload comprises: receiving at least one overload event; and updating overload information in relation to paths of the set, based on the received overload event, and the step of re-routing the data flow is carried out according to the updated overload information; -the step of detecting data flow overload comprises: intercepting, at a given node of the Ethernet network, a congestion notification generated by a third node and destined to a source node of the Ethernet network; -the method further comprises forwarding the intercepted notification to a source node for subsequent reduction of a rate of data flow transmitted from the source node; -the method further comprises releasing the intercepted notification to a source node, upon detection of data flow overload of both the first and second paths of the set; -both steps of detecting and re-routing are implemented in parallel at each of a first node and a second node of the network, each located between a source node and a third node of the Ethernet network; -the method further comprises, prior to detecting data flow overload, steps of: generating congestion notifications at a third node of the Ethernet network; and randomly routing the said congestion notifications from the third node through the Ethernet network; -the step of detecting data flow overload comprises: receiving, at a given node of the Ethernet network, a link-level flow control event generated by a third node of the Ethernet network; -at the step of receiving, the link-level flow control event is indicative of a full buffer at the third node of the Ethernet network; -the step of detecting data flow overload comprises: receiving overload events comprising both a link-level flow control event and a congestion notification event; and updating overload information in relation to paths of the set, based on the received overload events, and the step of re-routing the data flow is carried out according to the updated overload information; -the step of receiving the at least one overload event is implemented at a first node of the network; and the step of re-routing is implemented at a second node of the network, wherein re-routing is carried out according to the at least one overload event received at the first node of the network; and -the step of receiving comprises receiving at a first node or a source node of the network a congestion notification generated by a third node of the network; the step of updating comprises updating overload information at the second node, from the first node or source node, based on the received congestion notification; and the step of re-routing is implemented at the second node, based on the updated overload information.

In a second aspect, the present invention is further embodied as an Ethernet network, comprising nodes and connections between said nodes, the connections defining a set of data flow paths; and data flow overload management means, designed to take steps of the method according to any embodiments above.

In yet another aspect, the present invention is embodied as a computer program product comprising program code means to take steps of the method according to any embodiments above.

A network and method embodying the present invention will now be described, by way of non-limiting example, and in reference to the accompanying drawings.

BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS

-FIG. 1A schematically depicts an Ethernet network; -FIG. lB illustrates re-routing of data flows from a given path to another path of the network of FIG. 1A, according to an embodiment of the present invention; -FIG. 1C shows the network of FIG. JA, with a loop-free set of links to a given host; -FIG. 2 is a flowchart reflecting a general embodiment of the present invention; -FIG. 3 is a flowchart pertaining to more specific embodiments; and -FIG. 4 illustrates an example of routing congestion notifications.

DETAILED DESCRIPTION OF THE INVENTION

I. Introduction

As an introduction to the following description, it is first pointed at general aspects of the invention, directed to a data flow overload management method for an Ethernet network. Tn brief, the Ethernet network comprises nodes connected such as to define various data flow paths. Overload management is here designed to re-route data flows, from a given path to another path of the network, upon detection of data flow overload. This way, it is taken advantage of multi-path capability offered by e.g. data center networks. In other words, re-routing data flows to different paths provides a spatial alternative to the temporal reaction (rate reduction mechanism) provided by the 802.lQau schemes.

As will be described in details later, different types of overload are contemplated. Note that the overload can be considered as a change in the conditions of a path, i.e. similar to a broken link. Having realized this, one understands that overload may be addressed by searching for another path between source and destination and, if one exists, re-routing some or all of the traffic to a new path.

Hence, an adaptive routing (AR) is implemented within an Ethernet network. By adaptive routing or AR, it is here meant overload management solutions going beyond existing solutions, as in e.g. the known 802.lQau congestion management solutions. This is likely to result in a higher overall network throughput compared for instance to existing 802.lQau approaches, which can only reduce the source rates. In particular, if a congested flow can be routed on an uncongested path, its rate does not need to be reduced. Therefore, combining AR with CM can significantly increase the throughput of a congested network.

In further embodiments, it is advantageously relied upon existing overloaded events, such as congestion notifications generated to detect congestion or link-level flow control events such as the PAUSE command. For instance, multi-path routing can simply be enabled by configuring node (e.g. switch) routing tables to allow multiple routing table entries for every destination MAC address. Such an AR scheme can accordingly be built on top of the existing end-to-end CM schemes being defined in 802.lQau, taking advantage of congestion notifications. As such, a key advantage is that no changes are necessary to the Ethernet frame format, to the existing CM schemes, or the Ethernet adapters. n fact, specific embodiments can be contemplated, which merely operate by modifying the routing behavior of the Ethernet switching nodes.

The detailed description is organized as follows. In Sec. II, the IEEE 802.lQau congestion management (CM) schemes as well as solutions in the area of AR are discussed in details, for the sake of better understanding the subsequent description. Then, in Sec. III, embodiments of the present invention are presented, which make use of congestion notifications. Specific embodiments of adaptive routing schemes for Data Center Ethernet or Convergence Enhanced Ethernet (CEE) are described in Sec. IV. Next, results obtained by means of simulations on several test topologies are briefly discussed in section V. Finally, alternative embodiments are briefly discussed in the last section. References are listed at the end of the

detailed description.

II. Analysis of known solutions of Ethernet congestion management and adaptive routing.

A. Ethernet congestion management In the IEEE and TETF standards bodies, several working groups are defining new standards to ensure that 10 Gigabit Ethernet (lOGE) will be able to meet data center requirements.

The following paragraphs provide an overview of efforts of the IEEE 802. lQau Task Force towards defining a CM scheme for Ethernet.

Several mechanisms have been considered so far. As mentioned earlier, their goal is to hold the buffer occupancy around the Qeq threshold so that the buffer is neither overutilized nor underutilized.

The 802.lQau taskforce has adopted an end-to-end approach to CM aiming to push congestion out to the edge of the network.

The framework is based on controlling switch queue lengths by generating congestion notification messages that cause the senders to impose and adjust rate limits for those flows contributing to congestion. The key components are as follows: 1) Congestion detection and Signaling: each switch samples the incoming frames with a probability P. When a frame is sampled, the switch determines the output buffer occupancy and computes a feedback value Fb, which is a weighted sum of the current queue offset Q°,r with respect to the equilibrium threshold Qeq and the level change Qd/t since the last sample: Fb = Qoff -Wd * Qdelta. Depending on the value of Fb, the switch sends a congestion notification to the source of the sampled frame. The congested queue that causes the generation of the message is called Congested Point (CP).

2) Source reaction: each source has an associated Reaction Point (RP) that instantiates rate limiters for congested flows, adjusting the sending rates depending on the feedback received.

When an RP receives a congestion notification, it decreases the rate limit of the sampled flow if the feedback is negative, and increases it if the feedback is positive. To avoid starvation, accelerate rate recovery when congestion is over, and improve fairness of rate allocation, the RP can autonomously increase rate limits based on timers or byte counting.

Ethernet Congestion Management (ECM; formerly called BCN) [3], [4] signals both positive and negative feedbacks. Negative feedback is generated only if the buffer level exceeds Qeq.

Positive feedback is generated only when the sampled frame is tagged as belonging to a rate-limited flow and the tag contains the congestion point ID (CPTD) that corresponds to the switch and output queue in question. Here, CPID tagging is needed to be able to filter out false positives; otherwise it might happen that a rate limit is increased by positive feedback from uncongested output queues on the same path, thus compromising stability. Each rate limiter is associated with the CPID of the most recent negative feedback; each rate-limited frame is tagged with the associated CPID.

Another known scheme is the Quantized Congestion Notification (QCN), first proposed in [5] and detailed in [6]. A main feature of this algorithm is the absence of positive feedback. A source autonomously increases the rate of a rate-limited flow after a time interval T during which it has not received any negative feedbacks. This T is determined by counting the number of transmitted bytes for a given rate limited flow and comparing it with a threshold that depends on the last feedback value.

Both schemes, as shown in [8], are able to provide good performance in controlling the queue length at the CP. However, the performance benchmarks [9] used in 802.lQau consider single-path topologies exclusively.

B. Adaptive routing In packet switching networks, the goal of routing protocols is to select the path that a message should take to reach its destination. The choice may be made among a set of different paths and based on different decision metrics. Existing routing algorithms can be divided into two categories: Deterministic and Adaptive Routing.

Deterministic routing became popular when wormhole switching was invented [10]. Its popularity is due to its minimal hardware requirement. Indeed the use of its simple deadlock-avoidance algorithm results in a design of simple and fast routers. In deterministic routing algorithms, paths between sources and destinations are fixed and messages with the same source and destination addresses always take the same route [11], [12]. Consequently, a message must wait for each busy channel in the path to become available.

It can be realized that such algorithms do not take advantage of alternative paths that a topology may provide to avoid blocked channels. Tn contrast, AR algorithms support multiple paths between the source and destination and messages are allowed to explore all alternative routes when crossing the network [13], [14]. AR algorithms are either minimal or non-minimal. Minimal routing algorithms allow only shortest paths to be chosen, while non-minimal routing algorithms also allow longer paths.

AR algorithms, whether minimal or non-minimal, can be further differentiated by the number of paths allowed. Partially adaptive routing algorithms do not allow all messages to use any path, while fully adaptive routing algorithms do not have any restriction on the set of paths that can be used. Partially adaptive routings allow for selecting an output channel from a subset of all possible channels. Turn-model-based algorithms [15] and planar adaptive routing algorithms [16] are important partially adaptive routing algorithms for some networks.

Several fully adaptive routing algorithms have been proposed so far, see e.g. [17], [18]. Tn addition, several fully adaptive routing algorithms on tori have been evaluated in [19] of which the one using Negative Hop-based (NHop) routing augmented with the so-called bonus cards (Nbc) has been shown to have high performances. For completeness, in [19], the Nbc routing scheme has been used in the context of Duato's methodology [13], resulting in a routing algorithm named Duato-Nbc with high performance and minimum virtual channel requirements.

III. Route configuration Traditionally, Ethernet networks employ the Spanning Tree Protocol (STP) or a variant thereof to construct a routing tree, transforming an arbitrary physical topology (which may contain loops) into a logical one, without ioops, by enabling or disabling specific ports in each switch. This is used to eliminate routing loops in the topology, which lead to "broadcast storms", i.e., endless replications of broadcast frames.

Unfortunately, this also results in that potential multi-pathing capabilities are prevented. On the contrary, the present invention can be embodied such as to restore multi-pathing capabilities of the Ethernet network.

For instance, FIGS. 1A -1C schematically depict an Ethernet network 10, with a given topology. Si -S12 are inner nodes or switches of the network 10. Hi -H6 are hosts of the network, that is, end nodes. In the examples of FIGS. 1A -1B, H4 -H5 can be considered to be destination nodes. The bold arrows (e.g. LS3S7, LS3S1O, LS3S7a, LS3S1Oa in FIGS. iA - 1 C) denote links to a subsequent node. In other words, arrows denote connections between nodes. Successive connections define a path. Yet, a single connection is already a path to another switch and thereby part of a path to a destination node. Hence, one understands that re-routing locally to an alternate port amounts to re-route to another path of the network 10, be it from a local point of view.

However, one may distinguish amongst scenarios wherein a switch locally re-routes data flows from a first output port to a second output port. In a first case, both ports may link to a same subsequent switch, while in a second case, the ports link to distinct switches. While re-routing modifies a data flow path in each case (at least locally), the second case would be the most likely in practice, as illustrated in FIG. lB. Here, detection of overload at the level of S7 (as denoted by the flash) prompts S3 to re-route data flows through SlO instead of S7. Other variants are possible, not to mention the distinction between logical and physical switches.

In the example of FIG. 1C, corresponding to the network of FIG. 1A, not all connections are depicted. More will be said about FIG. 1C later in the description.

As illustrated in the flowchart of FIG. 2, an embodiment of the present invention provides for re-routing a data flow from a first to a second path (step 170), upon detection of overload in the Ethernet network (step 140). Obviously, if no overload occurs, data flows do not specifically need to be re-routed.

Such re-routing is for instance carried out upon detection of data flow overload in the first path. Alternatively, a global strategy may lead to re-route data flow from a first to a second path, upon detection of overload in a third path. In either case, multi-pathing capabilities of an Ethernet network are reinstated. Re-routing may for instance be enabled through conveniently adapted overload management means of the network.

FIG. 3 pertains to more specific embodiments. However, as recited in reference to FIG. 2, a main principle which remains is to re-route data flow from a first to a second path (step 170), if overload is detected (step 140). Yet, the type of overload more specifically addressed here is congestion.

The detection of congestion may preferably rely upon intercepting (step 110), e.g. at a given node or switch, a congestion notification or CN generated (step 100) by a third node.

Generation of CNs is, as such, already provided in Ethernet networks. Thus, intercepting CNs according to the present embodiment allows for a simple integration of the embodiment of FIG. 2 to Ethernet networks, without requiring additional means for congestion signaling.

Interception of congestion notifications may advantageously be implemented at an inner node (see e.g. Si -S12 in FIGS. 1A -C) of the network (step 110), i.e. between a source node and a third node. A "third node" means here any downstream node likely to generate notifications destined to a source node (upstream). Hence, notifications are intercepted on their path to the source (step 110), whereby appropriate step can be taken before the source actually reduces the flow rate, as discussed at length earlier.

In fact, in other embodiments, interception of notifications, detection of congestion and re- routing can be implemented at several nodes (for example one or more, or all switches Si -S12 of FIG. 1B), such that multiple decision points are enabled in parallel in the Ethernet network.

Referring back to FIG. 3, note that the intercepted notifications may either be immediately released (i.e. forwarded) to the source, as in step 115, or be literally hijacked, and thereby hidden to the source (at least temporarily, see step 190). In the first case, both mechanisms (spatial re-routing and sending rate reduction) are competitively implemented. Yet, provided that spatial re-routing is rapidly efficient, effects of rate-reduction should not impair performances of the network. Such an embodiment somehow allows for a transparent implementation of the invention, with respect to Ethernet standards. In the second case, the expected source throttling is delayed or possibly suppressed as alternative pathing is implemented at the intercepting node. In particular, re-routing data flows is here enabled based on the congestion information conveyed by the intercepted notification.

Before discussing more specific embodiments, it is worth recalling that any network with bidirectional links and multiple paths from some source to some destination necessarily has a loop therein. Hence, drawbacks due to routing loops applies to all multi-path topologies and is therefore of particular interest for Convergence Enhanced Ethernet (CEE), because data centers often employ multi-path network topologies.

Accordingly, a specific embodiment shall be now discussed, which overcomes problems caused by routing loops in the topology, in reference to FIG. 1C.

Switched Ethernet networks typically do not require explicit programming of switch routing tables. Rather, a switch automatically learns the routing by observing the source MAC addresses of the frames entering. If a frame from MAC m enters on port p, the switch makes an entry in its routing table to route frames destined to MAC in through port p. When a frame arrives for which no entry exists in the routing table, the switch broadcasts the frame to all ports except the one on which the frame arrived. This way, each switch will discover a route to each end node (assuming that all end nodes generate some traffic that reaches all switches).

As such a method involves broadcasting; it may also result in broadcast storms on a multi-path topology.

To enable multi-path routing while preventing routing loops, an embodiment of the invention uses pre-configuring routing tables of each switch. This can be carried out in such a way that, for each destination node n, a directed graph formed by all routes leading to node n is free of loops.

n this respect, it is shown in FIG. 1 C an example of how the routes can be programmed to reach a set of links, which are free of loops. In particular, a possible route configuration for host H5, having a given topology, is depicted. The bold arrows (e.g. L5357a, L53510a) indicate a set of links (via output ports) that may be used to route to H5. Such links form a loop-free directed graph. Thus, not all possible paths are enabled: for instance, while S8 could also route to S2 to follow the path through S7, this would create a loop. Paths to other hosts, and according to other topologies, can be configured in a comparable way.

More generally, it is preferably ensured that no frame can ever be routed in a ioop, without having to keep track of previously visited switches. Route configuration may for instance be performed at network initialization. This further ensures that the routing is deadlock-free, as there exists a routing subset that is free of cyclic channel dependencies.

Accordingly, each switch may maintain a routing table that maps each destination MAC address to a list of available ports. The ports are for example listed in order of preference (and so the links to subsequent nodes); the first entry is the default routing option. Ports having a shorter distance (hop count) to the destination may similarly receive higher preference.

In the example of FIG. 1C, the link LS3S7a could be the default link on the path from S3 to H5, as it involves three intermediary nodes S7 -S9 -S5 only, while four intermediary nodes Sb -Si 1 -S9 -S5 are involved via the link IS3S1Oa.

Yet, upon detection of congestion in LS3S7a, S3 may re-route via LS3S1Oa, as to be discussed in more details now.

IV. Specific embodiments of adaptive routing A basic idea of embodiments discussed above is to take advantage of the congestion iS information conveyed by notifications generated as in e.g. 802.iQau-enabled switches. These congestion notifications travel backwards from the congestion point to the sources of the flows contributing to the hotspot.

To this aim, congestion management means can be implemented at a given node between the source node and a third node, which is likely to generate such notification. Referring back to FIG. 3, the said given node is thus an upstream switch that relays a CN. Said upstream switch can intercept a CN (as in step 110 of FIG. 3) to find out about downstream congestion. The CN is thus at least temporarily kept hidden to the source, as noted earlier (see steps 115 or 190, denoting alternate embodiments). Next, by marking ports as congested with respect to specific destinations, a switch can reorder its preferences of the corresponding output ports contained in the routing table entry for that destination. Logically, uncongested ports will be preferred over congested ones.

More generally, the switch may update overload information, that is, congestion information, as to potential paths of the network (step 120), based on the intercepted congestion notification. Updating said information would typically occur upon interception of a congestion notification. Hence, re-routing data flows can be carried out according to said updated congestion information, which is itself based on interpretation of CNs.

A. Example of interception of congestion notification n an embodiment particularly advantageous for enabling AR, each switch is likely to maintain congestion information. For example, upon intercepting congestion notifications, switches shall update respective congestion information tables, such that at any time, each switch knows which of its paths is congested. Preferably, paths are sorted according to a congestion degree, which allows for locally optimized re-routing.

For example, a congestion information table may map a congestion key (d, p), where d is the destination MAC address and p the local port number, to a small data structure that keeps track of the current congestion status of port p with respect to destination d. This data structure may for instance comprise the following four fields: -A congested flag, indicating whether congestion has been detected on port p for traffic destined to d, it being understood that port p for traffic destined to d locally defines a path to destination d.

-A local flag, indicating (if congested is true) whether the congestion occurred locally, i.e., in the output queue attached to port p. -A feedback counter fbcount, indicating how many congestion notifications have been intercepted for (d, p); and -A feedback severity indication feedback, providing an estimate of how severe the congestion is.

Whenever a switch receives or itself generates a congestion notification for a flow destined to d (for example, the sampled frame that triggered the creation of the CN was destined to d) it updates the congestion information corresponding to (d, p), where p is the output port corresponding to the input port on which the CN was received (remote CN), or the output port that triggered the creation of the CN (local CN).

If the entry was not marked as congested (or did not exist yet), the congestion flag is set and local is set according to whether the CN was generated remotely or locally, fbcount is increased, and feedback is for example incremented by the product of fbcount and the feedback value carried by the CN. Tn contexts where CNs carry negative feedback values only, feedback will also be negative and decrease as more CNs are received. The lower the value of feedback, the more severe the congestion. Such a weighted update can accordingly be used to assign more weight to recent CNs, thereby gradually reducing effect of older entries and false positives.

Incidentally, in the update procedure, if the entry was already marked as congested, local is updated only if it was previously true, i.e. local congestion can be overridden by remote congestion but not vice versa.

As another remark, the person skilled in the art may appreciate that more elaborated congestion information tables can be contemplated, which may for instance include VID or VLAN ID data (i.e. identification of the VLAN, as used by the standard 802.1Q) and/or priority data (in the sense of the 802.lQbb scheme, which introduces a priority-based flow control).

B. Congestion expiration First, as noted earlier, there are contexts wherein only negative feedback are signaled (as in QCN), i.e. the presence or increase of congestion is signaled but not the absence or decrease of congestion. Thus, a timer-based approach may be advantageously used, in an embodiment, to expire remote entries in the congestion information table, at least in a QCN-like context.

In particular, local entries can be expired when the corresponding output queue is no longer congested. To this end, whenever an entry is updated as being congested, a timer is started.

When the timer expires the entry is reset, provided that it is not flagged as local. A local entry is reset when the length of the corresponding output queue drops below e.g. Qeq/2.

Second, in contexts wherein positive feedback is signaled, congestion can be expired by further taking into account positive feedback values.

C. Routing decisions Referring back to FIG. 3, whenever a frame arrives (step 130), a switch performs a routing lookup for the frame's destination MAC address d (steps 140 to 160). For example, at step 140, it is checked whether the default path to the destination of the frame is congested. If the default (most preferred) port po is not flagged as congested as indicated by the congestion table entry for (d, p0), the frame is routed via port po (step 150). Tf the default port is flagged as congested, the next preferred port is checked (160) for subsequent re-routing (170), etc. If all ports are flagged as being congested, the frame will preferably be routed to the port with the least severe congestion (i.e., with the feedback value closest to zero), step 180. Tn each case, re-routing can take advantage of checking entries (d, p), which define local paths to a destination d, a port p belonging to a given path.

For completeness, in case CNs are hijacked at the intercepting switches, they may eventually be released to the source, for subsequent rate reduction, step 190. Step 190 does not occur for embodiments wherein CNs are immediately forwarded, as already discussed.

In addition, note that in a further embodiment, congestion notification frames may receive special treatment. While such frames do a priori not need to be subjected to congestion checks, it is yet preferable to ensure that all ports belonging to alternative paths leading to the congestion point are made aware of the congestion. In this respect, if all CN frames are always routed on the same path to the reaction point (source), the flow might be routed on an alternative path that eventually ends up at the same congestion point.

FIG. 4 illustrates such an example. In the exemplary network 16 depicted, both Hi and H2 are sending at line speed to H3 and H4, respectively, as denoted by respective dashed and full-line arrows. This is likely to cause severe congestion at port 2 of S4 when the shortest paths are taken. In this example, ports 0, 1, ... n of switch Sn are denoted by a numeral reference in the box pertaining to Sn. The shortest reverse path back to Hi is through switch S2. However, if all CNs for Hi traverse S2, Si will only mark its port 2 as congested, but not port 1. Thus, Si will route traffic on the second-shortest path through port 1 to S6 and S7, still crossing the bottleneck in S4.

Therefore, S3 should preferably make sure that it routes CNs also on the reverse path through S7 and S6. Then, 51 will mark ports 1 and 2 as congested with respect to destination H3, and will proceed to route its traffic through the longest path via S8-S 12 to S5, thus bypassing S4 and eliminating the congestion.

This issue can be addressed by having each switch randomly select one of the available ports when performing a lookup for a CN frame. As long as the congestion persists, the congested switch will keep generating CNs. Thus, by routing a CN randomly, each reverse path should eventually be traversed.

Embodiments as discussed above can furthermore be embodied as a computer program product, comprising suitable program code means to take the steps described, as shall be understood by the skilled person.

V. Evaluation Different topologies and traffic scenarios have been tested through simulations, for the purpose of evaluation of performances of embodiments discussed above.

Briefly, the basis CM algorithm used for detection and reaction is QCN. More specifically, the QCN version 2.2 as specified in [7] was implemented. Simulations were performed for both noncontending-and contending-flows, as well as for uniform Bernoulli traffic.

As to scenarios of noncontending flows, the results obtained have first shown that the classical CM scheme was already able to adjust the transmission rates. Second, enabling AR made it possible for the throughput to reach the offered load. Congestion was avoided by re-routing flows on different paths. Correspondingly, hot queues appeared to have decongested.

Concerning contending flows, it was shown that while AR was not able to eliminate congestion, the underlying CM behaved as expected, in terms of controlling the queue length by adjusting the rates of the contending flows.

Finally, for uniform traffic, it was observed that, without CM, throughput saturated at about 55.6%. Enabling CM raised this figure to 73.3%. Enabling AR ultimately increased the saturation throughput to 96.8% -98.5% (depending on the topologies tested), indicating that AR optimally exploited the available path diversity. Accordingly, the mean latency was also drastically reduced.

VI. Other alternative embodiments B. Implementing the reaction at another switch As evoked earlier, re-routing data flows may be implemented at one or more switch of the network, for example in parallel (the "delocalized" solution).

Nevertheless, one may further contemplate embodiment wherein the CNs are received at a given node (for example at a source) while re-routing is implemented at another node, e.g. a switch. Such a solution allows for centralizing the reaction decision, and somehow offers improved control of the reaction to overload detection.

For instance, a CN generated by a downstream switch of the network could be routed to a source node, just as in usual Ethernet networks. Then, the source may take care of updating overload information at other nodes of the network, using any convenient means. For example, the source might trigger the deployment or propagation of an application for execution at other switches of the network. Execution of this application would allow for updating routing tables of the switches. In other words, overload information is updated by propagating some dedicated application from the source. As another example, overload information updates might be computed at the source and then transmitted to other nodes Therefore, how the subsequent re-routing occurs at each of the switches remains under control of one decision point. This is further simpler to implement than the delocalized solution, inasmuch as it only implies modifying the source nodes, not all the nodes.

However, it can be noted that such a solution is likely to be slower than the delocalized solution.

B. Link-level flow control events such as the PAUSE command Next, and as evoked earlier, other embodiments may rely on link-level flow control events, such as the PAUSE command. Link-layer flow control events are quickly transmitted, whereby more rapid reaction can be expected, compared with solutions based on CNs.

Indeed, the PAUSE mechanism uses a particular address, which has been reserved for use in PAUSE frames. The use of a well-known address simplifies the flow control process by making it unnecessary for a switch at one end of the link to discover and store the address of the switch at the other end of the link. In addition, frames sent to this address are understood by the switch to be frames meant to be acted upon within the switch. Reactions can thus be faster.

However, such a link-level flow control event is limited to "first-neighbor" switches, whereas congestion notifications propagate through the network, allowing for a global control of the subsequent re-routing of data flows.

Notwithstanding, whether based on CNs or flow control events, the invention can be embodied in similar ways. In particular, overload information can be updated upon receiving a flow control event, at the receiving switch. Then, re-routing the data flow can be carried out according to the updated information, i.e. based on the received event.

Other features discussed in reference to CN interception can be embodied similarly, except that flow control events are relied upon, instead of CNs. For example, steps of detecting and re-routing can be implemented in parallel at several switches of the Ethernet network.

Finally, both link-level flow control events and CNs could be relied upon, whereby different types of overload can be addressed in a common solution.

To summarize, the present invention can be embodied such as to provide an adaptive routing scheme for Ethernet networks, notably for IEEE-802.lQau-compliant CEE networks. It may further take advantage of modifying switch routing behavior, based on information snooped from congestion notifications or link-level flow control events. Evaluations showed that performance can be improved significantly (increased throughput, reduced latency).

While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims. For example, as any fully adaptive routing scheme may cause frames to arrive out of order, the skilled person may understand that re-sequencing needs to be implemented at the receiving end. How to re-sequence frames is known per Se.

References cited

[1] G. Pfister and V. Norton, "Hot spot contention and combining in multistage interconnection networks," IEEE Trans. Computers, vol. C-34, no. 10, pp. 933-938, Oct. 1985.

[2] G. Pfister, M. Gusat, W. Denzel, D. Craddock, N. Ni, W. Rooney, T. Engbersen, R. Luijten, R. Krishnamurthy, and J. Duato, "Solving hot spot contention using InfiniBand Architecture congestion control," in Proc. HP-IPC 2005, Research Triangle Park, NC, July 24 2005.

[3] D. Bergamasco, "Data Center Ethernet Congestion Management: Backward Congestion Notification," IEEE 802.1 Meeting, May 2005.

[4] D. Bergamasco and R. Pan, Backward Congestion Notification Version 2.0," IEEE 802.1 Meeting, September 2005.

[5] R. Pan, B. Prabhakar, and A. Laxmikantha, "QCN: Quantized Congestion Notification," May 17 2007. [Online]. Available: http://www.ieee8O2.org/1/files/public/docs2007/auprabhakar-qcndescription. pdf [6] R. Pan, B. Prabhakar, and A. Laxmikantha, "QCN: Quantized Congestion Notification," May 29 2007. [Online]. Available: http://www.ieee802.org/1/files/public/docs2007/au-panqcn-details-053007. pdf [7] R. Pan, "QCN Pseudo Code Version 2.2," Nov. 13, 2008. [Online].

Available: http://www.ieee802.org/1 /files/public/docs2008/au-pan-QCNpseudo-code-ver2-2.pdf [8] C. Minkenberg, M. Gusat, "Congestion Management for lOG Ethernet" in Proc. Second Workshop on Tnterconnection Network Architectures: On-Chip, Multi-Chip (INA-OCMC 2008), Göteborg, Sweden, Jan. 27, 2008.

[9] M. Wadekar, "CN-SIM: Topologies and Workloads", Feb. 8, 2007, [Online]. Available: "http://www.ieee802. org/1/files/public/docs2007/ausim-wadekar-reqd-extended-sim- 1ist020807.pdf' [10] W. Dally and C. Seitz, "Deadlock-free message routing in multiprocessor interconnection networks," IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547-553, May 1987.

[11] J. Duato, S. Yalamanchili, L. Ni, "Interconnection networks: an engineering approach," Morgan Kaufmann Publication, 2002.

[12] W. Daily and C. Seitz, "Deadlock-free message routing in multiprocessor interconnection networks," IEEE Transactions on Computers, vol. 36, no. 5, pp. 547-553, 1987.

[13] J. Duato, "A new theory of deadlock-free adaptive routing in wormhole routing networks," IEEE Transactions On Parallel And Distributed Systems, vol. 4, no. 12, pp. 1320- 133 1, 1993.

[14] J. Duato and P. Lopez, "Performance Evaluation of Adaptive Routing Algorithms for k-ary-n-cubes," in Proc. First International Workshop on Parallel Computer Routing and Communication, 1994.

[15] C. Glass and L. Ni, "The turn model for adaptive routing," in Proc. 19th Int'l Symp. on Computer Architecture, pp. 278-287, 1992.

[16] A. Chien, J. Kim, "Planar-adaptive routing: low-cost adaptive networks for multiprocessors," in Proc. Int'l Symp. on Computer Architecture, Journal of ACM 42(1), pp. 91-123, 1992.

[17] I. Gopal, "Prevention of store and forward deadlock in computer network," IEEE Transactions on Communications, vol. COM-33, no. 12, pp. 1258-1264, Dec. 1985.

[18] X. Lin, P. McKinley, and L. Ni, "The message flow model for routing in wormhole- routed networks," in Proc. 1993 International Conference on Parallel Processing, pp. 1-294-1- 297, Aug. 1993.

[19] F. Safaei, A. Khonsari, M. Fathy, M. Ould-Khaoua, "Performance Comparison of Routing Algorithms in Wormhole-Switched Fault-Tolerant Interconnect Networks," in Proc. International Conference on Network and Parallel Computing (NPC), 2006, Japan.

Claims

CLAIMS1. An overload management method for an Ethernet network (10) comprising nodes (Si -S12) and connections (LS3S7, LS3S1O, LS3S7a, LS3S1Oa) between said nodes, the connections defining a set of data flow paths, the method comprising a step of: -re-routing (170, 180) a data flow from a first path of the set to a second path of the set, upon detection (140) of data flow overload.
2. The method of claim 1, further comprising, prior to re-routing, a step of: -detecting (140) data flow overload, based on at least one overload event sent (100) in the Ethernet network.
3. The method of claim 2, wherein, at the step of detecting, the at least one overload event is a congestion notification, a link-level flow control event, or a combination thereof.
4. The method of claim 2 or 3, wherein the step of detecting (140) data flow overload comprises: -receiving (110) at least one overload event; and -updating (120) overload information in relation to paths of the set, based on the received overload event, and wherein the step of re-routing the data flow is carried out according to the updated overload information.
5. The method of claim 2 or 4, wherein the step of detecting data flow overload comprises: -intercepting (110), at a given node of the Ethernet network, a congestion notification generated (100) by a third node and destined to a source node of the Ethernet network.
6. The method of claim 5, further comprising: -forwarding (115, 190) the intercepted notification to a source node for subsequent reduction of a rate of data flow transmitted from the source node.
7. The method of claim 5, further comprising: -releasing (190) the intercepted notification to a source node, upon detection (110) of data flow overload of both the first and second paths of the set.
8. The method according to any one of claims 5 to 7, wherein both steps of detecting and re-routing are implemented in parallel at each of a first node and a second node of the network, each located between a source node and a third node of the Ethernet network.
9. The method according to any one of claims 2 to 8, further comprising, prior to detecting data flow overload, steps of: -generating (100) congestion notifications at a third node of the Ethernet network; and -randomly routing (100) the said congestion notifications from the third node through the Ethernet network.
10. The method according to any one of claims 2 to 9, wherein the step of detecting data flow overload comprises: -receiving, at a given node of the Ethernet network, a link-level flow control event generated (100) by a third node of the Ethernet network.
11. The method according to claim 10, wherein, at the step of receiving, the link-level flow control event is indicative of a full buffer at the third node of the Ethernet network, such as a PAUSE command.
12. The method according to any one of claims 2 to 11, wherein the step of detecting (140) data flow overload comprises: -receiving (110) overload events comprising both a link-level flow control event and a congestion notification event; and -updating (120) overload information in relation to paths of the set, based on the received overload events, and wherein the step of re-routing the data flow is carried out according to the updated overload information.
13. The method of claim 4, wherein: -the step of receiving (110) the at least one overload event is implemented at a first node of the network; and -the step of re-routing is implemented at a second node of the network, wherein re-routing is carried out according to the at least one overload event received at the first node of the network.
14. The method of claim 13, wherein: -the step of receiving comprises receiving at a first node or a source node of the network a congestion notification generated (100) by a third node of the network; -the step of updating comprises updating overload information at the second node, from the first node or source node, based on the received congestion notification; and -the step of re-routing is implemented at the second node, based on the updated overload information.
15. Ethernet network (10), comprising: -nodes (Si -S12) and connections (L53S7, L53510, LS3S7a, LS3S1Oa) between said nodes, the connections defining a set of data flow paths; and -data flow overload management means, designed to take the steps of any one of claims 1 to 14.
16. A computer program product comprising program code means to take steps of the method according to any one of claims 1 to 15.