WO2015077878A1 - Agrégation de chemins commutés pour les centres de données - Google Patents

Agrégation de chemins commutés pour les centres de données Download PDF

Info

Publication number
WO2015077878A1
WO2015077878A1 PCT/CA2014/051121 CA2014051121W WO2015077878A1 WO 2015077878 A1 WO2015077878 A1 WO 2015077878A1 CA 2014051121 W CA2014051121 W CA 2014051121W WO 2015077878 A1 WO2015077878 A1 WO 2015077878A1
Authority
WO
WIPO (PCT)
Prior art keywords
path
packet
switched
packet switching
paths
Prior art date
Application number
PCT/CA2014/051121
Other languages
English (en)
Inventor
Liam Casey
Original Assignee
Rockstar Consortium Us Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rockstar Consortium Us Lp filed Critical Rockstar Consortium Us Lp
Publication of WO2015077878A1 publication Critical patent/WO2015077878A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4633Interconnection of networks using encapsulation techniques, e.g. tunneling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4604LAN interconnection over a backbone network, e.g. Internet, Frame Relay
    • H04L12/462LAN interconnection over a bridge based backbone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/50Routing or path finding of packets in data switching networks using label swapping, e.g. multi-protocol label switch [MPLS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/66Layer 2 routing, e.g. in Ethernet based MAN's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q11/00Selecting arrangements for multiplex systems
    • H04Q11/0001Selecting arrangements for multiplex systems using optical switching
    • H04Q11/0005Switch and router aspects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks

Definitions

  • the switches in a large data center are typically organized as a three level hierarchy of Top of Rack (ToR) switches (switches usually situated on top of a Rack of computers), Leaf switches and Spine switches.
  • ToR Top of Rack
  • the Leaf level is also called the
  • Aggregation level and the Spine level is also called the Core level.
  • ToR In a folded Clos switch design, ToR, Leaf and Spine switches are all high radix switches, i.e. they have a large number of ports. In a true folded Clos organization there are no direct links between switches of the same level: all ports of each spine switch connect to leaf switches, leaf switch ports are of two types, those that connect to spine switches and those, perhaps of lower capacity, that connect to client ToR switches. Likewise ToR switch ports are of two types, those connected to Leaf switches and those connected to the computers/hosts in their rack. Packets are forwarded upwards towards the Spine and then downwards towards the destination ToR switch. In consequence, any path between any pair of ToR switches will be either two hops (when both ToR switches have at least one link to a common Leaf switch) or 4 hops when the ToR switches are connected to different Leaf switches.
  • IP OSI Layer 3 or L3 forwarding in switches requires more packet processing power than Ethernet (OSI Layer 2 or L2) forwarding, so, for a given packet throughput capacity, pure Layer 2 switches are cheaper than combined L2/L3 switches. Routing requires the continuous operation of a routing protocol, with more overheads and the potential for service affecting reconfigurations when there are switch or link failures (which, given the potential size of data center backbones, occur relatively frequently).
  • Layer 3 forwarding Another drawback of using Layer 3 forwarding is that it interferes with a desired mode of operation of being able to treat all servers (whether directly executing on hosts or as Domains (Virtual Machines)) as equally able to be executed at any host. Having to group related servers locally to the same Layer 3 subnet, as for example is required for load balancing servers deploying Direct Server Return, both limits the scale of a service and results in under utilization of computing resources.
  • aspects of the present invention provide methods, apparatus and systems for forwarding packets across a hierarchical organization of switches constituting a network for the purpose of interconnecting a large number of client systems.
  • Some embodiments provide a method for spreading Ethernet packet streams over the plurality of packet switched paths between all the pairs of switches at a particular hierarchical level, be that the Leaf level or, alternatively, the ToR level.
  • Some embodiments use Ethernet Switched Paths (ESPs) of the IEEE 802.1 PBB-TE standard as the paths, wherein each path is uniquely identified by a combination of a destination backbone medium access control (B-MAC) address and a backbone VLAN identifier (B-VID).
  • B-MAC backbone medium access control
  • B-VID backbone VLAN identifier
  • Some embodiments use a control or management entity to determine the ESPs, control the assignment of B-VIDs and install the paths in the switches of the data center network.
  • the present technique extends Ethernet Link Aggregation to aggregate packet switched paths into what appears to the aggregation client as a single Ethernet link.
  • One aspect of the invention provides a method of forwarding Ethernet frames at a first edge bridge of a network comprising the first edge bridge, a second edge bridge and a plurality of intermediate network elements, wherein there are more than two possible communication paths between the first edge bridge and the second edge bridge, and all possible communication paths between the first edge bridge and the second edge bridge traverse at least one intermediate network element.
  • the method comprises: associating a respective path identifier with each of a plurality of
  • Another aspect of the invention provides a method of establishing packet switched paths in a hierarchical network comprising at least two layers of packet switching elements, each packet switching element of a first layer being connected to a plurality of packet switching elements of a second layer.
  • the method may be performed by a control system which is distinct from the packet switching elements and may comprise: determining a path between a respective pair of packet switching elements of the first layer, via at least one packet switching element of the second layer; installing forwarding state in each packet switching element traversed by the path, such that packets can be forwarded via the path; and notifying at least one of the respective pair of packet switching elements of the first level that the path is a candidate to be aggregated by the at least one of the respective pair of packet switching elements into a switched path aggregation group.
  • the underlay fabric system comprises: an underlay fabric control system; a plurality of interconnected backbone core bridges (BCBs) wherein each BCB is operable to receive instructions from the underlay fabric control system, the instructions comprising instructions to add, modify and/or delete forwarding entries of the BCB; and a plurality of backbone edge bridges (BEBs) each comprising at least one ingress port, wherein each BEB is connected to at least one BCB and at least one pair of the plurality of BEBs is not directly connected to each other, each BEB being operable to receive instructions from the underlay fabric control system, the instructions comprising instructions to add switched paths to particular switched path aggregation groups;
  • BCBs backbone core bridges
  • BEBs backbone edge bridges
  • a further aspect of the invention comprises a method of transporting overlay network packets over a packet switched network which comprises a plurality of interconnected core packet switches and a plurality of edge switches having installed between each respective pair of edge switches a respective set of at least two traffic engineered packet switched paths.
  • Each traffic engineered path transits at least one core packet switch and no two of the traffic engineered paths between any pair of edge switches transits exactly the same set of core packet switches.
  • the method comprises: receiving an overlay network packet at an ingress edge switch, the ingress edge switch being one of the plurality of edge switches; determining an egress edge switch for the received overlay network packet, the egress edge switch being a one of the plurality of edge switches, the determination being based, at least in part, on a destination address field of the received overlay network packet; responsive to determining the egress edge switch for the received overlay network packet, selecting one traffic engineered packet switched path from the set of at least two traffic engineered packet switched paths installed between the ingress edge switch and the egress edge switch; encapsulating the received overlay network packet with header fields associated with the selected traffic engineered packet switched path; and forwarding the encapsulated overlay network packet over the selected one traffic engineered packet switched path towards the egress edge switch.
  • FIG. 1 is a representation of a 3 level fat tree organization of switches as might be deployed, at much bigger scale, in a data center;
  • FIG. 2 depicts a functional block diagram of the elements that perform link aggregation according to the IEEE Standard 802.1 AX;
  • FIG. 3 is a representation of a PBB-TE Ethernet frame as used in Ethernet Switched Paths (ESPs);
  • FIG. 4 combines the network of FIG. 1 with functional block diagrams of the elements that perform switched path aggregation according to the present invention, to show an instance of a Switched Path Aggregation group (SPAG);
  • SPAG Switched Path Aggregation group
  • FIG. 5 is a refinement of an Aggregator of FIG. 4;
  • FIG. 6 is a flowchart of the steps for the installation and operation of SPAGs
  • FIG. 7 is a representation of an underlay fabric comprising a small full mesh core network realizable using the present invention. Each of the connections shown is a Switched Path Aggregation Group; and
  • FIG. 8 depicts a Virtual Machine (VM) Host computer, advantageously enhanced to utilize the present invention.
  • VM Virtual Machine
  • FIG. 1 is a representation of a 3 level fat tree organization of switches as might be deployed, at much bigger scale, in a data center.
  • FIG. 1 depicts uniform switches each having 6 ports of equal capacity. In a real deployment the number of ports per switch would be much larger, say 64 or even 128.
  • the top level, called the spine or core consists of switches 111 to 119. Switches of the second and third levels are organized into pods 10 to 60. Within each pod there are shown 3 second level or leaf level switches (211 , 212 and 213 in pod 10 and 261 , 262 and 263 in pod 60) and 3 third level or Top of Rack (ToR) switches (311 , 312 and 313 in pod 10 and 361 , 362 and 263 in pod 60).
  • ToR Top of Rack
  • each pod 10 to 60 ports of the ToR switches belonging to the pod are joined by links to ports of the pod's Leaf switches for the bi-directional transmission of packets between the ToR level and the leaf level.
  • Each of the spine switch ports has a link to a port of a leaf switch.
  • end systems that originate and terminate the packets switched by the three levels of switches. These end systems are typically each attached to one, or maybe two, of the ToR switch ports. These end systems are typically computing devices, hosting perhaps a single application or service, or, alternatively, hosting a plurality of virtual machines (VMs) or "containers" which in turn support applications or services. End systems in some pods may be predominantly storage or database nodes. Also, as Data Centers have connectivity to wide area networks, the organization of some pods may be different, for example lacking any ToR switches and instead having wide-area-capable leaf switches coupled to the spine switches and one or more wide area networks.
  • VMs virtual machines
  • FIG. 1 it is clear that in a folded Clos type of network there are a large number of minimal hop paths available for transporting packets from one edge to the other.
  • a packet originating at ToR switch 311 and destined for ToR switch 362 could traverse Leaf 211 , Spine 111 and Leaf 261 or it could traverse Leaf 212, Spine 115, Leaf 262, to list 2 of the 9 possible paths.
  • the radix (number of ports) of the switches grow, the number of potential paths grows faster.
  • the ToR switches are designated as edges of the network, the present invention is not limited to the edges being ToR switches. Neither is the invention limited to three-level fat trees and folded Clos organizations of switches. Rather the invention is applicable to any network arrangement where there are multiple minimal hop paths between edges.
  • Link Aggregation Groups also known as a Multi-Link Trunks (MLTs) were first standardized in the IEEE 802.3ad Working Group as Section 43 of IEEE Standard 802.3-2000. This standard was subsequently published as IEEE
  • FIG. 2 derived from IEEE 802.1AX, depicts how two systems, such as bridges 410 and 412, can partner to treat multiple (point to point, full duplex) links between them, 400, as a single Aggregated Link.
  • Link Aggregation allows MAC Clients, 410 and 412, to each treat sets of two or more ports (for example ports 440, 442 and 444 on MAC client 410 as one set, and ports 441 , 443 and 445 on MAC client 412 as another set) respectively as if they were single ports.
  • the IEEE 802.1AX standard defines the Link Aggregation Control Protocol (LACP) for use by two systems that are connected to each other by one or more physical links to instantiate a LAG between them.
  • LACP Link Aggregation Control Protocol
  • an Aggregation Key has to be associated with each Aggregation port that can potentially be aggregated together to form an Aggregator.
  • the binding of ports to Aggregators within a system is managed by the Link Aggregation Control function for that system, which is responsible for determining which links may be aggregated, aggregating them, binding the ports within the system to an appropriate Aggregator 430, 432, and monitoring conditions to determine when a change in aggregation is needed. While binding can be under manual control, automatic binding and monitoring may occur through the use of a Link Aggregation Control Protocol (LACP) for use by two systems that are connected to each other by one or more physical links to instantiate a LAG between them.
  • an Aggregation Key has to be associated with each Aggregation port that can potentially be aggregated together to form an
  • the LACP uses peer exchanges across the links to determine, on an ongoing basis, the aggregation capability of the various links, and continuously provides the maximum level of aggregation capability achievable between a given pair of Systems.
  • LACP provides LAG management frames for identifying when members of the link aggregation group have failed. The response to failure is to reduce the number of links in the group, not to recalculate the topology. And increasing the number of paths in a LAG group is automatic too. If extra switching capacity is added to a system there should not be any co-ordination needed in adding extra links to LAG groups. If it is necessary to re-configure some links and switches, this should appear to endpoints as a reduction in active links until the Aggregation Control function is notified of the new resources.
  • FIG 2. depicts link aggregation between bridges, but 802.1 AX link aggregation can also be used by end stations connecting to bridges.
  • a drawback of 802.1 AX LAGs for folded Clos style of networks is, as can be seen in FIG. 1 , that each link from a particular switch terminates on a distinct system, so that there is no opportunity to form Link Aggregation Groups. Note however, that the embodiments described below can co-exist with deployments where each of the links between switches shown in FIG. 1 is in fact a plurality of links formed into an 802.1 AX LAG. Such deployments may in fact be very common as data centers grow in size and change out old equipment for new, so that single links have differing capacities and more capacity is need in different parts of the network.
  • An aspect of at least some embodiments of the present invention is to extend the operation of an Aggregator (430, 432) to bind logical ports instead of, or in addition to, physical ports (440 through 445).
  • a logical port comprises the functionality that encapsulates a packet to be transmitted with a tunnel encapsulation so that the Ethernet packet can be transported to a peer system through a tunnel.
  • the tunnels can be set up between directly connected immediate neighbours, their utility in the context of data centers is when they are established between switches of the same level (e.g. ToR layer) attached to different switches at the level above.
  • a tunnel connecting a ToR switch e.g. 311) to a ToR switch in a different pod (e.g. 361) would traverse 4 physical links.
  • tunnels There are various methods, well known in the art, for establishing packet switched paths as tunnels depending on the technology (e.g. bridging, routing or label switching) used to realize them. At least some embodiments of this invention are concerned with installing and aggregating packet switched paths into Switched Path Aggregation Groups (SPAGs). As described below, in preferred embodiments the tunnels are Ethernet Switched Paths (ESPs) but the invention is not limited to this type of packet switched path and other embodiments may, for example, use Multi-Protocol Label Switching (MPLS) label switched paths as tunnels.
  • MPLS Multi-Protocol Label Switching
  • MAC Media Access Control
  • VMs Virtual Machines
  • NIC virtual Network Interface Card
  • High performance Ethernet switches that can handle such large numbers of MAC addresses are not cheap: they require huge amount of expensive ternary content addressable memory (TCAM) or other associative structures to look up MAC address at line rate speeds and, by their very size, such structures consume a lot of power.
  • TCAM ternary content addressable memory
  • IP switches in the core to switch IP encapsulated Ethernet packets to achieve uniform MAC addressing across the data center.
  • Ethernet standards have evolved to also allow for an Ethernet encapsulation process to take place.
  • IEEE 802.1 ah Provider As specified in the IEEE 802.1 ah Provider
  • PBB Backbone Bridging
  • BEB Backbone Edge Bridge
  • a Service Provider Backbone network is comprised of the aforementioned BEBs and Backbone Core Bridges (BCBs).
  • the backbone encapsulation header comprises a destination MAC address on the service provider's network (B-MAC DA) 542, a source MAC address on the service provider's network (B-MAC SA) 544, and a VLAN ID (B-VID) 546.
  • BCBs need only work with the backbone MAC addresses and B-VIDs, thus substantially reducing the required table sizes of core switches.
  • a fourth component of the backbone encapsulation header is a service instance tag (l-SID) 548, but this is only of significance to BEBs.
  • the core switches such as those at the leaf and spine levels, can be general standard high speed bridges.
  • These BCB's have only to forward packets based on the B-MAC and B-VID values in the encapsulation headers.
  • the possible number of B-MAC and B-VID values that a switch will need to have associative memory for is far fewer than the number of "Customer" MAC addresses, this enables cheaper, more energy efficient data center switching facilities.
  • PBB Provider Backbone Trunks
  • PBB-TE Backbone Bridges - Traffic Engineering
  • PBB-TE replaces the normal Ethernet spanning tree and flooding operations with explicit configuration of the MAC forwarding at each Backbone Bridge.
  • Explicit MAC forwarding rules are configured for each B-VID in a range of Virtual l_AN identifiers. Explicit configuration permits complete route freedom for paths defined between any pair of source B-MAC (544) and destination B-MAC (542) addresses, permitting the engineering of path placement. Multiple paths between pairs of source B-MAC and destination B-MAC addresses are distinguished by using distinct B-VIDs (546) as path identifiers. Referring to the previous example in FIG.
  • FIG. 4 depicts 3 ESPs, 480, 482 and 484, installed across the radix 6 fat tree network of FIG. 1 between BEBs (ToR switches) 311 and 363. For the radix 6 fat tree of FIG. 4 there are 9 potential paths for ESPs.
  • FIG 4. also depicts entities of the BEBs 311 and 363 that encapsulate and forward Ethernet frames over the Switched Path Aggregation Group (SPAG) composed of switched paths 480, 482 and 484.
  • B-MAC clients 622 and 624 are responsible for encapsulating customer Ethernet frames, 500 in FIG. 3, with a Backbone Encapsulation Header 540.
  • B-MAC clients may have additional functions, but minimally they have to determine a destination B-MAC address for the B-MAC DA field 542 based on the C- MAC DA field 502 of the customer frame.
  • the value assigned to the B-MAC SA field 544 is a MAC address of the BEB.
  • the B-MAC client is a Virtual Switch Instance (VSI) serving a particular community of interest identified by a specific l-SID
  • VSI Virtual Switch Instance
  • BEBs 311 and 363 in FIG. 4 follow the conventions of IEEE 802.1 specifications to show that an encapsulated Ethernet frame is passed from the B-MAC clients 622 and 624 to the Aggregator entities 631 and 632 respectively.
  • the B-VID Selector (641 and 642) determines the value to be inserted into the B-VID field 546 of the encapsulation header.
  • the set of those ESP's installed between the Aggregator's BEB and the destination BEB is herein designated to be a Switched Path Aggregation group (SPAG).
  • SPAG Switched Path Aggregation group
  • One method of selecting the B-VID for an encapsulated Ethernet frame is to hash its l-TAG and use the resulting value, modulo the number of switched paths in the SPAG, as an index into a table of B-VIDs.
  • this method should result in an almost even spreading of customer frames over the members of the SPAG, while ensuring that frames from a single flow are always forwarded over the same member of the SPAG. If the number of sources is very small however, as might be the case when the BEB is implemented in a Host computer, then extra fields from higher layer protocol headers might need to be incorporated in the hash operation.
  • each ESP to a specific destination BEB transits distinct physical ports at the originating and destination BEBs. It should be noted that this need not the case for less structured core network organizations. Also if the BEB is an Hswitch, being, as described below, a realization of a switch on a host computer then the diversity of path routes is only exhibited in the core of the network as all paths will pass though the one or two links that the host computer has to one or two ToR switches.
  • PNP Provider Network Port
  • ToR switches 311 and 363 as Switched Path Aggregation enabled BEBs follows the style of IEEE 802.1 in showing frames being handed off between distinct sublayer functions.
  • actual embodiments of the invention could optimize away distinct sublayers, so that the B-VID is selected as part of the process of determining of the B-MAC DA and the PNP to be used.
  • SAN Storage Area Network
  • BCBs Intermediate switches
  • Priority-based Flow Control capable while all other traffic is directed over a SPAG whose where the BCBs on the paths are plain Ethernet switches.
  • BCBs Intermediate switches
  • this last example is not meant suggest in any way that SPAGs are restricted to frames having a single setting of the so called "p-bits" in their B-VIDs.
  • FIG. 5 provides a more detailed look at the components that might comprise the Aggregator 631 of FIG. 4.
  • an Aggregation Controller 671 Before Switched Paths can be aggregated into SPAGs an Aggregation Controller 671 must determine the state of the candidate switched paths that could form a SPAG to another BEB.
  • each BEB's Aggregator will be informed of the parameters of each packet switched path available for its use. These parameters will usually comprise the tunnel packet header fields used to encapsulate the Ethernet frame to be transported over the packet switched path.
  • the parameters would comprise the Destination B-MAC and B-VID, which together with Source B-MAC address of the BEB itself define the encapsulation header.
  • ESPs Ethernet Switched Paths
  • a further parameter to be determined is an identifier or index for the local switch port 661 , 662 or 663, that initiates transmission on the particular packet switched path.
  • these network core-facing ports are called Provider Network Ports (PNPs).
  • each potential SPAG will have an associated SPAG identifier.
  • the parameters of a packet switched path provided to the BEB would then include the SPAG identifier that the packet switched path is to be aggregated into.
  • a further refinement would be to associate a traffic filter, specifying, in some fashion, a matching criteria for determining which customer frames are forwarded on which SPAG.
  • the Aggregation Controller 671 could directly take all the information it receives concerning the potential switched paths and immediately form a SPAG from the full set of them, by enabling the B-VID Selector 641 to use all the candidate B-VIDs for the SPAG's B-MAC DA. But since failure of links and switches are commonplace in data centers, and pre-configuration of paths could possibly have had undetected errors, most embodiments would place the notified switch paths into a candidate list and determine the status of each switched path in the candidate list. Switched paths determined to be operational would then be added to the SPAG, by for example making an operational ESP's B-VID available to the B-VID Selector 641.
  • One method for the Aggregation Controller 671 of a BEB to determine whether a candidate packet switched path is actually capable of transporting Ethernet frames to the peer BEB is to encapsulate a series of test control frames and forward them along the switched path.
  • the first step to forwarding a control frame is to pass it to a Control Parser/Multiplexor 651 which multiplexes control frames received from the Aggregation Controller with encapsulated customer frames received from the B-VID selector 641 to form a single stream of frames to be transmitted by the relevant PNP 661 , 662, or 663.
  • the Control Parser/Multiplexor 651 also examines frames received at the PNPs and steers them to the Aggregation Controller 671 if they are control frames, and to the collector function (not shown) if they are encapsulated customer frames. If the
  • Aggregation Controller subsequently receives a report control frame from the peer BEB indicating that the test frames it originated were received by the peer BEB then it can conclude that the candidate packet switched path is operational and can be added to the SPAG.
  • the Aggregation Controller must be able to match up each switched packet path for which it is an originator with a corresponding switched packet path that it terminates. While it is not an absolute requirement that the reverse path traverses (in reverse order) the same links and switches as the forward path, it is generally advantageous, from the perspective of removing failed packet switched paths from SPAGs, that forward and reverse paths be the same. In the simplest embodiments, both the forward and reverse paths would share the same path identifier, the same B-VID for ESPs where the B-MAC DA of the forward ESP is the B-MAC SA of the reverse path and vice versa. For other embodiments, the Aggregation Controller will have to be informed of the parameters of the reverse path, such as a different B-VID, when it is informed of the parameters of the candidate path.
  • test and report control frames There are at least three protocol choices for sending test and report control frames. These three protocols are asynchronous and a single frame type serves as both a test and report frame. Both end points send to the other a sequence of test frames, which report state information derived in part from whether or not the end point has received test frames sent from the other end point. In particular, if an end point has not received any test frames for a period of time, it signals this in the state information field of the test frames that it is transmitting.
  • LACP Link Aggregation Control Protocol
  • the first of the protocols is the Link Aggregation Control Protocol (LACP) of IEEE 802.1AX to be operable over ESPs.
  • LACP was original protocol designed to determine the status of the links attached to physical ports.
  • LACP functions so that peer Link Aggregators (FIG. 2 430, 432) synchronise between themselves which of potentially multiple links 400 between them are currently part of the LAG, i.e.
  • LACP ensures that the ports 440, 442, 444 over which the Distributor at one end 470 spreads traffic are matched to the ports from which the Collector at the other end 482 receives traffic.
  • An ESP extended version of LACP would work with logical ports, defined by B-VI D and B- MAC DA pair of the ESP they originate.
  • LACP has a discovery phase in which it determines which links terminate on which neighbour
  • the Aggregation Controller 671 need not perform a discovery phase. Rather the Aggregation Controller 671 can assume either that all the ESPs sharing the same B-MAC DA can be aggregated into one SPAG or that it will have been informed of which ESPs belong to which SPAG, and can move straight to the phase of determining if a candidate ESP is operable to carry encapsulated traffic.
  • Other embodiments may execute a full LACP style frame exchange over each ESP pair so that an Aggregation Controller and its peer can reach agreement on the member ESPs (forward and reverse paths) that will form the Switched Path Aggregation group between them.
  • the second protocol is the I EEE 802. l ag Continuity Check Protocol (CCP).
  • CCP Continuity Check Protocol
  • Alternative embodiments could use CCP to determine the continuity of an Ethernet Switched Path, i.e. if it is operable to carry traffic. This is similar to the use of CCP to detect failures described in US Patent 7,996,559 "Automatic M EP provisioning in a link state controlled Ethernet network" by Mohan et al, hereby incorporated by reference.
  • This in-band or data plane protocol is implemented in many switches. Note that the standard versions of LACP and CCP utilize standard multicast MAC addresses, which would lead to having to install extra forwarding table entries in each intermediate switch on a path. But Mohan describes substituting the end Unicast address (in the present application the destination B-MAC) for the multicast address in CCP, and a similar substitution could be affected for LACP.
  • the third protocol is Bidirectional Forwarding Detection as described in I ETF RFC 5880, "Bidirectional Forwarding Detection (BFD)", by D. Katz and D. Ward hereby incorporated by reference. Embodiments could use a version of BFD to detect failures in the forward or reverse paths of members of SPAGs.
  • I ETF RFC 5884 "Bidirectional Forwarding Detection (BFD) for MPLS Label Switched Paths (LSPs)" by Aggarwal et al. defines a version of BFD for MPLS LSPs so BFD would be a natural choice for embodiments of the present invention where, the packet switched paths aggregated into SPAGs are LSPs.
  • a version of BFD for Ethernet Switched Paths might also be a preferred embodiment, both because of its inherent robustness and because of the flexibility it affords for the end points to mutually set the frequency of the periodic BFD control frames. So, for example, when there is no traffic between a pair BEBs the frequency of sending BFD control frames on each ESP of the one or more SPAGs established between them could be reduced to one every 300ms while, when a SPAG is carrying substantive traffic, the periodicity could be reduced to 0.3ms.
  • a further refinement of the present invention would be to augment the information carried in the test and report control frames so as to allow the Aggregator control function at each end to determine how congested each ESP in a SPAG is, and consequently to bias the assignment of flows to member ESPs towards assigning fewer flows to congested ESPs.
  • Detecting congestion could be as simple as detecting lost control frames and otherwise comparing the jitter in arrival times with some base line, or it could involve adding timestamps and/or other fields to the control frames to allow the a finer deduction of the congestion state experienced by the control frame. It should be noted that an advantage of this refinement is that it does not require of intermediate switches that they perform special operations on the control frames, in keeping a goal of at least some embodiments of this invention: reducing the complexity required in core switches.
  • the packet switched paths that will be aggregated into SPAGs are set up by a management entity or controller.
  • a management entity or controller In a straightforward implementation one such controller might control all the participating switches in a data center.
  • the overall controller function could be distributed amongst a plurality of controllers, with switches allocated in some fashion to be controlled by respective controllers.
  • controller functionality is related to a single operational controller, which since, as will be presently described, can be used to set up an Ethernet Underlay Network or Fabric, is herein called an Underlay Fabric Controller (UFC).
  • UFC Underlay Fabric Controller
  • the UFC is not limited to such an embodiment and could be a distributed controller.
  • the UFC could be implemented on a suitably programmed general purpose computing element or on a more specialized platform.
  • the UFC needs to determine the topology of the constituent BEB and BCB switches, to calculate a set of Ethernet Switched Paths (ESPs) that are to be the SPAG members between BEBs, to install the B-VID, B-DA forwarding table entries in the relevant switches to realize the ESPs, and to install in BEBs the information necessary for them to aggregate some or all of the ESPs terminating on another BEB into a Switched Path Aggregation Group.
  • ESPs Ethernet Switched Paths
  • the first stage 602 is for the UFC to acquire the topology of the core network. While it is possible that the UFC is configured with complete topology information, most embodiments will involve at least some determination by constituent switches as to which other switches they themselves are connected to.
  • the discovery phase of Link State Protocols such as IS-IS and OSPF provide one way of doing this. While IS-IS, in particular, could be run when there is no IP layer in place, there is a specific Ethernet layer protocol for switches to discover characteristics of their neighbours.
  • Link Level Discovery Protocol (LLDP), as now defined in IEEE 802.1AB- 2009, is a widely implemented way for Ethernet Switches to advertise their identity to the devices at the other end of each of their links.
  • Switches regularly transmit an LLDP advertisement from each of their operational ports. While LLDP advertisements can carry extra information, such as meaningful text names for originating switches, the two essential items of information conveyed are the originating system identifier (called a chassis ID) and an originating port identifier.
  • a chassis ID the originating system identifier
  • MIB management information base
  • the out of band (OOB) management network will usually be a routed network, complete with one or more Dynamic Host Configuration Protocol (DHCP) servers to assign IP addresses to newly powered up switches.
  • OOB out of band
  • the management network might be a single subnet, with Ethernet spanning tree bridging for forwarding packets.
  • a data center might be organized in a set of subnets, say one for each pod, and one covering the Spine switches, with IP routing between subnets.
  • the invention is not limited in how the management network is constructed, nor indeed whether topology information is conveyed to the UFC by way of an IP based protocol.
  • SDN Software Defined Networking
  • an operator might pre-configure for each switch a default ESP between it and an SDN controller.
  • minimizing pre-configuration operations perhaps just those switches directly connected to the SDN controller would be pre-configured, and they in turn would advertise in their LLDP advertisements to their neighbours the B- VI D, Destination B-MAC to be used to reach an SDN controller.
  • the UFC When it receives new or updated topology information the UFC performs an evaluation of which ESPs should be installed or removed across the core network (step 604).
  • a key decision which might be a network operator settable parameter, is how many members a SPAG should have so as to provide good load spreading, and uniform loading of switches, while not running into forwarding information base memory constraints.
  • the number of potential paths that could be installed is very large for higher radix switches: each ESP requires at least 60bits of forwarding table space at each switch in its path.
  • a choice of a number such as 8 ESPs per SPAG might give nearly optimal load spreading without incurring unsupportable forwarding table costs.
  • paths may be evaluated with the objective of uniformly distributing paths across all the switches, where paths between a pair of BEBs are maximally disjoint and the total number of paths passing though each Spine switch is the same (or as near as the same as the mathematics allows).
  • the basic uniform spreading of paths may be modified to accommodate differing link capacities, as might occur when some switches in a data center have been upgraded to a new model, while others remain with lower capacity links.
  • Other embodiments might modify the assignment of paths to BEB pairs based on a traffic matrix reflecting either provisioned or measured traffic patterns.
  • the UFC may use the reported topology information in additional ways: it may have a component that checks reported configurations against an ideal for the chosen architecture and generates trouble tickets or work orders for the correction of any mis-wiring or mis-configuration of the constituent switches.
  • each participating switch may be directed by the management system to install the packet switched paths (step 606).
  • Installing ESPs in the BCBs they transit comprises installing in the BCBs' forwarding information base the destination B-MAC and B-VID values for the ESPs.
  • SDN Software Defined Networking
  • the management system would be called an SDN Controller, and the instructions to the switches would be conveyed over a Control Plane interface, perhaps using a version of the SDN related OpenFlow protocol.
  • SDN Software Defined Networking
  • control plane protocols such as Generalized Multi-Protocol Label Switching (GMPLS) and Resource Reservation Protocol - Traffic Engineering (RSVP-TE), allow for the hop-by- hop signalling of a switched path.
  • GPLS Generalized Multi-Protocol Label Switching
  • RSVP-TE Resource Reservation Protocol - Traffic Engineering
  • edge switches being the end points of SPAGs, may need to be notified as to which packet switched paths belong to which SPAG.
  • this information might be deduced form the destination B-MAC of each ESP, or might be determined by protocol exchanges over each packet switched path in a similar fashion to the exchanges defined in IEEE
  • Implementations of the present technique are not limited to having a single management or control entity compute and install the required packet switch paths. Any of the schemes known in the art for master and standby control, or distributed control, or hierarchical control could be used.
  • the UFC may need to detect changes in the topology (step 610) and re-evaluate the installed packet switched paths (step 604 again).
  • the re-evaluation may be responsive to detecting load changes or congestion as well as detecting topology changes resulting from the failures of links or switches and the addition of new, or returned to service, links and switches.
  • the routing of installed packet switched paths may be incrementally adjusted. Installed packet switched paths may be first removed from SPAGs or SPAG candidate lists and then uninstalled. New packet switched paths may be first installed and then added to SPAG candidate lists.
  • the UFC in addition to receiving topology reports on a regular basis, might receive notification from BEBs when an ESP is removed from a SPAG because it has ceased to operable.
  • the UFC compute and configure replacement ESPs at the slower cadence of topology report reception. That said, there may be a requirement for a low frequency auditing process to detect any corruption of forwarding tables and the like that results in a packet switched path appearing to be operational to the UFC, when in fact the BEB's have determined that it is not operational.
  • management system must track the allocation of B-VIDs to ensure that, for a given destination B-MAC, each path is properly defined.
  • Ethernet Virtual Private Networks are overlay networks.
  • Many data centers of the size that would benefit from the present invention have multiple tenants, each using a fraction of the data center's resources. While a large data center may support tens, even hundreds, of thousands of low usage web server instances, each instance being a web server of a different tenant, a common deployment model is for a data center to assign each tenant a respective number of computing hosts, either physical machines or virtual machines (VMs), with the tenant organizing what computations run on their assigned computing hosts and how its computing hosts communicate with each other.
  • VMs virtual machines
  • the owners of data centers may themselves be running a number of very large scale applications. Some of these may be tuned for responsiveness for responding to Internet search queries, while others may be tuned for background computing, such as web crawling.
  • EVPNs provide an application with the lowest common denominator for communication between its constituent parts, namely a Layer 2 local area network.
  • the control software of a data center does not need to know anything about how a tenant or an application uses IP: whether it is IPv4 or IPv6, whether IP address spaces overlap, or indeed whether Layer 3 is used at all.
  • Using EVPNs also facilitates moving VMs or other constituent parts of an application from one host to another host anywhere in the data center, since the VM or application constituent will still be reachable without needing to change any IP addresses.
  • SPAGs can be used as "links" in an Underlay Fabric.
  • An Underlay Fabric is used to switch or transport packets of one or more overlay networks across the core switches of a network.
  • Ethernet Virtual Private Networks EVPNs
  • Narten et al. (in Problem 1)
  • Overlays for Network Virtualization Internet Draft draft-ietf-nvo3-overlay- problem-statement-04, herein incorporated by reference), in relation to the IETF NV03 Working Group, describes the general way that overlay networks are realized as "Map and Encap”.
  • a packet of a specific overlay network instance arrives at a first-hop overlay device, i.e. a underlay fabric edge device, the device first performs a mapping function that maps the destination address (either L2 or L3) of the received packet into the corresponding destination address of the egress underlay fabric edge device, where the encapsulated packet should be sent to reach its intended destination.
  • the underlay fabric edge device encapsulates the received packet with an overlay header and forwards the encapsulated packet over the underlay fabric to the determined egress.
  • an overlay header provides a place to carry either a virtual network identifier or an identifier that is locally significant to the egress edge device. In either case, the identifier in the overlay header specifies at the egress to which specific overlay network instance the packet belongs when it is de-encapsulated.
  • the NV03 Working Group is focused on the construction of overlay networks that operate over an IP (L3) underlay transport
  • the (corresponding) address produced by the mapping function is an IP address and the core switches of the underlay fabric must be IP routers.
  • IP routers that support Equal Cost Multi-Path (ECMP) does allow for load spreading across the underlay fabric core
  • ECMP Equal Cost Multi-Path
  • a goal of at least some embodiments of the current invention is to reduce the cost of switching in data centers by eliminating the requirement that the switches have high performance routing capabilities.
  • At least some embodiments of the current invention enable pure L2 (bridging) underlay fabrics and deployment of SPAGs gives traffic-engineered load balancing superior to ECMP.
  • FIG. 7 depicts a small example of an underlay fabric for realizing EVPNs.
  • Underlay Fabric Edge Switches 311 , 312, 313, 361 ,362 and 363, as might be the ToR switches of the first and last pods, 10 & 60 of FIG 3, are full mesh connected by "links".
  • the 'links" are in fact a plurality of distinct packet switched paths, each transiting a series of core switches (not shown), and aggregated into a SPAG.
  • the Aggregator 631 of Edge Switch 311 is shown as forming a SPAG 702 with Aggregator 633 of the Edge Switch 363.
  • SPAG 702 is the aggregation of a plurality of packet switched paths, as maybe ESPs 480, 482 and 484 of FIG. 4, installed between the Edge Switch pair 311 and 363 by an Underlay Fabric controller (UFC).
  • UFC Underlay Fabric controller
  • the "link" between a pair of Edge Switches can, for all intents and purposes, be considered always available. Physical links, ports or even entire leaf or spine switches of a given packet switched path may fail and take an ESP out of service, but the capacity of the "link” would drop by only a small percentage (depending obviously on how many ESPs are configured between each pair of edge switches).
  • FIG. 7 depicts VSIs 711 , 713, 761 and 763 of an EVPN "a" that has client connections (not shown) at 4 of the Underlay Fabric Edge switches 311 , 313, 361 and 363.
  • EVPN instances may well have VSIs at a smaller or greater number of Underlay Fabric Edges than the depicted EVPN instance "a", depending on the number and distribution of client connections.
  • VSIs at a smaller or greater number of Underlay Fabric Edges than the depicted EVPN instance "a"
  • the overhead of a VSI is very small and Underlay Fabric Edges can support a large number of VSIs.
  • a fully meshed core network can be used to interconnect the root bridges of multiple non-overlapping spanning trees resulting in a bigger Ethernet network (as described, for example, in Interconnections : Bridges and Routers by Radia Perlman, ISBN 0-201-56332-0, hereby incorporated by reference), provided that the forwarding behaviour of the root bridge is modified from normal learning bridge forwarding to be split horizon forwarding. (With split horizon forwarding, frames arriving on a core full-mesh port are not forwarded on any other core full-mesh port). It will be realized though, that in data centers, VSIs at underlay fabric edges may not need to have the full functionality of a root bridge of a spanning tree.
  • VSIs at underlay fabric edges may only need a split horizon forwarding capability because in data centers the connections to the EVPN clients will be connections to end systems (hosts, VMs etc), rather than connections to learning bridges. Thus, a VSI can be realized as little more than a grouping of forwarding table entries.
  • a VSI 711 of an ingress underlay fabric edge 311 receives an overlay network Ethernet frame in the form of a customer frame (FIG. 3, 500), on a client connection, it first performs the "map” step. This involves determining the egress underlay fabric edge 363 (one of 313, 361 and 363 in FIG. 7) where a peer VSI 763 will forward the customer frame towards its final destination as designated by the customer MAC destination address of the received customer frame (502). Then the "encap” step is performed for the ingress VSI 711 , adding a backbone encapsulation header (FIG.
  • the destination B-MAC address (542) set to be a MAC address of the egress underlay fabric edge.
  • the source B-MAC address field (544) is set to a MAC address of the ingress underlay fabric edge (311), while the l-component Service Instance Identifier (l-SID) field 548 is set to a community of interest identifier that uniquely identifies the overlay network instance, e.g. EVPN instance, within the underlay fabric and is associated with each of the constituent VSIs (311 , 313, 361 and 363 in FIG. 7) of the specific overlay network instance.
  • l-SID l-component Service Instance Identifier
  • the B-VID field 546 of the encapsulation header is set by the Aggregator 631 for SPAG 702 responsive to determining that the destination B-MAC address of the SPAG 702 is the address of the egress underlay fabric edge, i.e. matches the destination B-MAC address 542 of the encapsulation header, and then selecting the B- VID corresponding to one of the member ESPs of the SPAG, as described above.
  • the Aggregator 633 uses the l-SID value 548 to select the associated VSI 763 that in turn determines on which client connection the de-encapsulated frame should be further forwarded.
  • the selected VSI must be the one associated with the overlay network instance, e.g. an EVPN instance,
  • Client connections are also known in the art as access circuits (ACs). The customer frames from distinct
  • the client connection may be embodied as a physical link terminating on a physical port of the underlay fabric edge.
  • the physical port becomes dedicated to the VSI associated with the particular community of interest.
  • client connections must be logical links multiplexed onto a physical link and terminating at the underlay fabric edge as VSI logical ports.
  • Customer frames are associated with VSI logical ports based on some form of multiplexing tag carried in the frame header.
  • One such multiplexing tag would be the S-VID (506 of FIG. 3) of the IEEE 802.1 ad or Provider Bridges Ethernet frame format.
  • Another such multiplexing tag would be the E-TAG of the IEEE 802.1 BR standard when the host computer's Network Interface Card (NIC) assumes the role of an IEEE 802.1 BR Bridge Port extender.
  • NIC Network Interface Card
  • Most current ToR switches already embrace IEEE 802.1 BR technology. Note that in situations where the host is more than one hop away from the Underlay Fabric Edge Switch (e.g. when the underlay fabric edge functionality is incorporated in the aggregation or second level switches), it might be preferable to use the IEEE 802.1 ad S-VID approach as that approach can exploit spanning tree to provide alternate paths should an intermediate switch fail.
  • PBB Provider Backbone Bridges
  • VSIs Virtual Switch Instances
  • PBB Provider Backbone Bridges
  • the rationale for PBB was to allow the transport of l-Tagged customer Ethernet frames over a Provider Backbone Bridged network and the 802.1 Q I- components of Backbone Edge Bridges can be considered a generalization of Virtual Switch Instances performing the l-Tagging and destination BEB determination.
  • PBB Provider Backbone Bridges
  • PBB-TE Another putative Layer 2 data center underlay fabric is PBB-TE.
  • PBT which subsequently was standardised as IEEE802.1ay Provider Backbone Bridges - Traffic Engineering (PBB-TE), now incorporated into the IEEE 802.1 Q-2011 standard).
  • PBB-TE Provider Backbone Bridges - Traffic Engineering
  • US Patent 8,194,668 describes an EVPN service (called Virtual Private LAN Service or VPLS in 8, 194,668) over ESPs (engineered connections) between BEB's (carrier edge nodes, PE-edge) using VSIs (virtual customer-address-space Ethernet switch instance).
  • IEEE802.1ay and US Patent 8, 194,688 both lack the establishment of a plurality of ESPs between a pair of BEBs for the purpose of aggregating them into SPAGs for load balancing customer traffic between the pair of BEBs. Lacking the formation of SPAGs means that neither teaches the use of SPAGs in providing a highly resilient full mesh underlay fabric.
  • the aggregation of a plurality of ESPs between BEBs into SPAGs in order to both load balance customer traffic and provide a highly resilient full mesh underlay fabric is a novel aspect of the disclosed embodiments.
  • each bridge uses the same standardised set of tie breaking algorithms to select a pre-administered number (between 1 and 16) of shortest paths for which to install forwarding entries, after having assigning a pre-administered distinct B-VID to each of the selected shortest paths.
  • a method that uses a set of tie breaking algorithms to determine the paths of members of all the SPAGs is brittle, inflexible and incompatible with using the lowest possible cost switches in a data center.
  • SPBM is inflexible as it only finds shortest paths and does not permit traffic engineering. If, for example, a 3 level Clos organization were augmented with "cut- through" links between selected leaf switches then SPBM would consider only paths using the cut-through links when determining shortest paths between members of the pods they belong too.
  • SPBM requires that all the bridges in the system participate in a link state protocol, which requires that bridges have a certain degree of memory and processing power.
  • the standardized set of tie breaking algorithms are only defined to work over a single area, but incorporating thousands or even tens of thousands of switches into a single area of flooded Link State Advertisements would require bridges have memory and processing power way beyond what is practicable, let alone economic.
  • the overlay networks support by the traffic engineered load balancing Layer 2 underlay fabric disclosed herein are in no way limited to being EVPN instances.
  • the B-MAC clients (FIG. 4, 622 & 624) of Aggregators 631 and 632 may be any type of MAC client: bridge relay functions, Virtual Switch Instances (VSIs), Routers, Virtual Routing Forwarders, logical ports terminating access circuits and, as will be discussed below, Virtual NICs of VMs.
  • VSIs Virtual Switch Instances
  • Routers Virtual Routing Forwarders
  • logical ports terminating access circuits and, as will be discussed below, Virtual NICs of VMs.
  • the overlay networks that can be realized using the invention could include IP subnets, IP VPNs, Storage Area Networks (SANs) and so on, and different types of overlay network may be supported simultaneously at underlay fabric edges.
  • SANs Storage Area Networks
  • FIG. 8 depicts an example host computer organization for a computer hosting computations for multiple communities of interest.
  • a host computer might have multiple logical links over a single Ethernet physical link connected to a ToR BEB supporting per community of interest VSIs and EVPN instances.
  • the realization of a computation belonging to a particular community of interest is by way of Virtual Machines (VMs) 811 ,812, 819.
  • VMs Virtual Machines
  • VMs are complete instances of a total software image that is executable on a computer, wherein multiple independent VMs are supported by a host computer having a base software layer 840 variously called a Hypervisor, a Virtual Machine Monitor (VMM) or "Domain 0".
  • VMs hosted on a computer share the CPUs 802 and one or more Network Interface Cards (NICs) 804 of the host.
  • NICs Network Interface Cards
  • NIC is used for any logic and hardware drivers of a network port coupled to a computer.
  • vNIC virtual NIC
  • the VMs of each tenant while sharing the resources of the data center, do not interact with each other. Indeed, it would be a breach of service level agreements and security if one tenant could interfere with another tenant's applications and services.
  • the VMs assigned to a single tenant need to be able to communicate with each other. The most general way that such communication can be realized is to provide each VM with its own virtual NIC (vNIC) 821 , 822, 829, with each vNIC having its own (customer) MAC address, as if the VM had exclusive use of a real NIC.
  • vNIC virtual NIC
  • C-MAC Customer MAC
  • tagging Ethernet frames leaving a vNIC with an identifier specific to tenant or community of interest permits use of the same C-MACs by different tenants, either accidentally or deliberately, while also allowing for the separation of tenants required in a data center.
  • this identifier might be an l-SID (546 in FIG. 3) in the core of network, while being an S-VID (506) on the client connection between a vNIC and a VSI of a BEB.
  • the Ethernet switching functionality implemented within the Hypervisor herein denoted as the Host switch or Hswitch 850 is of interest for the present invention.
  • the function of the Hswitch is to tag and direct Ethernet frames from the vNICs 851 , 852, 859 over a hardware NIC 804, and to bridge Ethernet frames between local VMs that belong to the same community of interest.
  • an Hswitch may be an l-BEB (a Backbone Edge Bridge comprising an l-Component), with the ToR switch it is connected to being a B-BEB (a Backbone Edge Bridge comprising a B -Component), and the link between HSwitch and ToR switch may then be an l-TAG Boundary l_AN.
  • each host has a single NIC attached to a ToR switch, on the basis that the likelihood of a communications failure that renders the host unusable is of the same order of magnitude as other host failures.
  • a host may have two NICs 804 dual homed to different ToR switches for reliability or performance reasons.
  • VMs' vNICs may be assigned to one of the two NICs, with all vNICs being re-assigned to the other of the two NICs in the case of failure.
  • NICs could be assigned to different types of communication, typically storage operations vs other types of communications.
  • Split MLT as described above, could be deployed between pairs of ToR switches, so that the Hswitch can treat a plurality of NICs as a single (aggregated) link.
  • VMs are not the only realization for the sharing of resources of a data center amongst multiple tenants.
  • Tenants can "rent" complete host computers, VMs and, increasingly, Containers, for carrying out computations.
  • Containers being a reincarnation of the user processes of time sharing systems, are a lighter weight realization of VM functionality, where each tenant's programs on a host share a common operating system with all other programs on the host.
  • tenant's programs can access private or virtualized instances of shared services such as Network Functions Virtualization (NFV) firewalls for handling their external Internet traffic, and block storage controllers for accessing the permanent storage that they have also rented.
  • NFV Network Functions Virtualization
  • Domain is used to denote any collection of tightly coupled resources used for performing a single or multi-threaded computation at a host. This is an old usage of the term "Domain”.
  • the open source XEN hypervisor calls its virtual machines "domains", in this specification "Domain” is used henceforth to include any instances of a VM or a Container or an instance of a virtualized service that can be dedicated to a single community of interest.
  • Domain The system that dispatches Domains to hosts will be called a Domain Controller.
  • a data center tenant normally needing communications, computing and storage facilities will be assigned one or more Domains that provide these facilities, free from interference from other tenant's computations and, to a first order, unaffected by other tenant's resource requests.
  • the set of Domains of a tenant constitute a community of interest that will, in many embodiments, communicate directly with each other using a single dedicated overlay network instance, either an EVPN (a Layer 2 VPN) or an IP VPN (a Layer 3 VPN).
  • the set of a tenant's Domains together with the overlay network dedicated to their inter-communications constitute a "sandbox" so called because a tenant is free to do what she likes within her sandbox, but is severely constrained in any interactions between domains within the sandbox and any services outside the sandbox.
  • the overlay VPN instance dedicated to a sandbox will have an administered VPN identifier that could also be used as an owner (or principal) identifier for the sandbox's Domains.
  • the VPN identifier would be either directly an l-SID or be mappable to an l-SID. Note that, depending on the nature of the applications or services a tenant want to realize, the applications or services may be separated into multiple sandboxes, each with its own associated VPN identifier.
  • each ToR switch may comprise a set of Virtual Switch Instances (VSIs) as B-MAC Clients 622, 624 of the Aggregators 631 , 633 that constitute the edges of the full mesh SPAG underlay fabric shown in FIG. 7.
  • VSIs Virtual Switch Instances
  • Each VSI is associated with a distinct l-SID identifying an EVPN instance dedicated to a sandbox and has logical ports linked to those of the sandbox's Domains that are located on the host computers subtending off of ToR switch.
  • Hswitch An alternative choice for the location of underlay fabric edge functionality is the Hswitch.
  • ToR switches would require only the simple functionality of BCBs, that is, the TOR switches would not require any extra capabilities related to PBB encapsulation, SPAG Aggregators or VSI's. Instead these capabilities can be introduced piecemeal into the Host computer Hypervisor and/or Operating System implementations of Hswitches as the data center operator moves towards realizing a SPAG based Layer 2 underlay fabric over more and more of their infrastructure.
  • the BEB Underlay Fabric edge switch
  • the BCBs Underlay Fabric edge switch
  • the number of B-MAC addresses that the core switches (the BCBs) potentially have to deal with will be one or two orders of magnitude greater than if the edge switch functionality were exclusively the preserve of ToR switches. Installing all the ESP mappings needed to achieve full mesh multi- member SPAG connectivity with all other Hswitches may become impractical due to capacity limitations in core switch forwarding entry table space.
  • the first technique is to divide the hosts into virtual pods and assign sandboxes to individual virtual pods.
  • Virtual pods can be of arbitrary size and shrink or grow as needed. Since all layer 2 inter-communication of Domains within sandboxes stays within a respective sandbox, the Underlay Fabric Controller (UFC) needs only to establish a full mesh of SPAGs per virtual pod, i.e. between the Hswitches of the hosts belonging to the virtual pod.
  • UOC Underlay Fabric Controller
  • virtual pods may be dedicated to particular types of sandboxes e.g. sandboxes where all the Domains are Containers.
  • the second technique requires that Domain Controllers notify the UFC of the Hswitch's B-MAC address and the sandbox's l-SID when it installs a domain of the sandbox at the Hswitch's host.
  • the UFC can determine which pairs of Hswitches will potentially forward frames to each other given the current assignment of domains to hosts. This allows the UFC just to install the minimum number of ESP forwarding entries required to realize SPAGs that are actually of potential use to the underlay fabrics installed by clients in their current locations.
  • Domain Controllers may also be responsible for migrating running domains to a new Host, as first described by Casey et al. in the paper ⁇ Domain Structure for Distributed Computer Systems', published in the Proceeding of the 6th Symposium on Operating System Principles in the ACM Operating Systems Review, Vol 11 No 5, Nov. 1977, hereby incorporated by reference.
  • the Domain migration process either the Domain is migrated to a host for which there is already an operational SPAG between the host and each of the hosts serving the other Domains of its sandbox, or new SPAGs have to be established so as to provide full mesh connectivity.
  • the overlay networks are controlled using the Locator/Identifier Separation Protocol (LISP) as described in Internet Draft draft-maino- nvo3-lisp-cp-03, LISP Control Plane for Network Virtualization Overlays, by F Maino et al. 18 Oct 2013, hereby incorporated by reference.
  • LISP Locator/Identifier Separation Protocol
  • the LISP mapping database holds mappings from sandbox specific addresses (MAC addresses for EVPN overlay networks, and IPv4 addresses and/or IPv6 addresses for Layer 3 overlay networks) to destination Hswitch B-MAC addresses.
  • the Domain Controller can consult these mappings for a list of all Hswitch B-MAC addresses currently associated with a specific I- SID, i.e. with a specific sandbox.
  • the Domain Controller might first try to determine if any of the Hswitches' hosts are suitable for receiving the migrating Domain (see Casey et al referenced above). Otherwise the Domain Controller may develop a short list of B-MAC addresses of hosts that, according to its criteria, are suitable for receiving the migrating Domain, and may send the short list to the UFC.
  • the UFC could then choose one host from the short list based on criteria such as minimising the number of new [B-VID, B-
  • the UFC may try to avoid hosts for which the local links are more congested than others (assuming that the UFC has a background activity of collecting traffic statistics). Once the UFC has made its choice, the UFC would notify the Domain
  • SPAGs are not limited in scope to single data centers. Rather when there are multiple communication links between a group of data centers, packet switched paths could be constructed that span between data centers and the inter-data center paths could be aggregated into resilient, load spreading inter- data-center SPAGs.
  • the multiple communication links between two data centers may comprise wavelength channels (sometimes called "lambdas") in the same owned fiber, multiple fibers (preferably diversely routed) or a purchased service such as MPLS pseudo-wires or Metro Ethernet E-Lines.
  • the packet switch paths may be homogeneous (e.g. Ethernet bridged both within the data centers and between data centers), or they could be heterogeneous with Ethernet in the data centers and LSPs between the data centers.
  • inter-data-center SPAGs could be used in the realization of a single Layer 2 underlay fabric but, given the difference in cross section of bandwidth within a data center compared to bandwidth between data centers, it would likely be advantageous to deploy virtual pods with only a small number of virtual pods having hosts in more than one data center.
  • inter-data center virtual pods when combined with domain migration mechanisms, provide a method for the orderly transfer of the complete live load of a data center to other data centers (as might be required when, for example, a typhoon is forecast to hit the originating data center).
  • a virtual pod is created at the receiving data center and merged with a virtual pod at the originating data center, then, after the full mesh of SPAGs is installed for a merged virtual pod, individual domains can be migrated from the originating data center to the receiving data center while still maintaining inter-communication with the other members of their sandboxes. Once all the domains of all the sandboxes assigned to the virtual pod have been migrated to the receiving data center, the hosts at the originating data center can be removed from the virtual pod, with the reclamation of the ESPs of the SPAGs between them and between and them and the newly formed receiver data center virtual pod.

Abstract

L'invention concerne des techniques pour expédier des paquets à travers une organisation hiérarchique de commutateurs constituant un réseau dans le but d'interconnecter un grand nombre de systèmes clients. Le réseau comprend au moins deux couches d'éléments de commutation de paquets, chaque élément de commutation d'une couche étant connecté aux éléments de commutation d'une autre couche. Le procédé peut être exécuté par un système de commande qui est distinct des éléments de commutation de paquets. On acquiert une topologie du réseau et on calcule un ou plusieurs chemins entre des paires respectives d'éléments de commutation de paquets d'une couche, par l'intermédiaire d'au moins un élément de commutation de paquets d'une autre couche. On installe l'état d'expédition dans chaque élément de commutation de paquets traversé par un chemin, de telle manière que les paquets puissent être expédiés par le chemin. On analyse les chemins pour trouver les chemins qu'il est possible d'agréger, et on agrège au moins deux chemins en un groupe d'agrégation de chemins commutés.
PCT/CA2014/051121 2013-11-26 2014-11-25 Agrégation de chemins commutés pour les centres de données WO2015077878A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361909054P 2013-11-26 2013-11-26
US61/909,054 2013-11-26

Publications (1)

Publication Number Publication Date
WO2015077878A1 true WO2015077878A1 (fr) 2015-06-04

Family

ID=53198144

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2014/051121 WO2015077878A1 (fr) 2013-11-26 2014-11-25 Agrégation de chemins commutés pour les centres de données

Country Status (1)

Country Link
WO (1) WO2015077878A1 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160020939A1 (en) * 2014-07-21 2016-01-21 Big Switch Networks, Inc. Systems and methods for handling link aggregation failover with a controller
WO2017101230A1 (fr) * 2015-12-18 2017-06-22 华为技术有限公司 Procédé de sélection de routage pour réseau de centre de données et gestionnaire de réseau
CN106899478A (zh) * 2017-03-23 2017-06-27 国网浙江省电力公司 电力测试业务通过云平台实现资源弹性扩展的方法
CN109743266A (zh) * 2019-01-22 2019-05-10 上海宽带技术及应用工程研究中心 基于胖树结构的sdn交换网络
US10644895B1 (en) 2018-10-26 2020-05-05 Cisco Technology, Inc. Recovering multicast data traffic during spine reload in software defined networks
US20210385163A1 (en) * 2019-02-27 2021-12-09 Huawei Technologies Co., Ltd. Packet processing method, packet forwarding apparatus, and packet processing apparatus
CN113824781A (zh) * 2021-09-16 2021-12-21 中国人民解放军国防科技大学 一种数据中心网络源路由方法与装置
CN114363272A (zh) * 2020-09-27 2022-04-15 华为技术有限公司 一种交换机的配置方法及相关设备
US11398956B2 (en) 2020-07-16 2022-07-26 Cisco Technology, Inc. Multi-Edge EtherChannel (MEEC) creation and management

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010037421A1 (fr) * 2008-10-02 2010-04-08 Telefonaktiebolaget Lm Ericsson (Publ) Émulation de diffusion de trames ethernet

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010037421A1 (fr) * 2008-10-02 2010-04-08 Telefonaktiebolaget Lm Ericsson (Publ) Émulation de diffusion de trames ethernet

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AL-FARES ET AL.: "A Scalable, Commodity Data Center Network Architecture", PROC. SIGCOMM'08, 17 August 2008 (2008-08-17), pages 63 - 74, XP058098076, Retrieved from the Internet <URL:http://ccr.sigcomm.org/online/files/p63-alfares.pdf> DOI: doi:10.1145/1402958.1402967 *
MORRIS, STEPHEN B.: "MPLS and Ethernet: Seven Things You Need To Know", INFORMIT, 17 December 2004 (2004-12-17), pages 1 - 9, Retrieved from the Internet <URL:http://www.informit.com/articles/article.aspx?p=357100> *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10270645B2 (en) * 2014-07-21 2019-04-23 Big Switch Networks, Inc. Systems and methods for handling link aggregation failover with a controller
US20160020939A1 (en) * 2014-07-21 2016-01-21 Big Switch Networks, Inc. Systems and methods for handling link aggregation failover with a controller
WO2017101230A1 (fr) * 2015-12-18 2017-06-22 华为技术有限公司 Procédé de sélection de routage pour réseau de centre de données et gestionnaire de réseau
CN106899478B (zh) * 2017-03-23 2023-09-01 国网浙江省电力公司 电力测试业务通过云平台实现资源弹性扩展的方法
CN106899478A (zh) * 2017-03-23 2017-06-27 国网浙江省电力公司 电力测试业务通过云平台实现资源弹性扩展的方法
US10644895B1 (en) 2018-10-26 2020-05-05 Cisco Technology, Inc. Recovering multicast data traffic during spine reload in software defined networks
CN109743266A (zh) * 2019-01-22 2019-05-10 上海宽带技术及应用工程研究中心 基于胖树结构的sdn交换网络
US20210385163A1 (en) * 2019-02-27 2021-12-09 Huawei Technologies Co., Ltd. Packet processing method, packet forwarding apparatus, and packet processing apparatus
US11683272B2 (en) * 2019-02-27 2023-06-20 Huawei Technologies Co., Ltd. Packet processing method, packet forwarding apparatus, and packet processing apparatus
US11398956B2 (en) 2020-07-16 2022-07-26 Cisco Technology, Inc. Multi-Edge EtherChannel (MEEC) creation and management
CN114363272A (zh) * 2020-09-27 2022-04-15 华为技术有限公司 一种交换机的配置方法及相关设备
CN114363272B (zh) * 2020-09-27 2023-03-31 华为技术有限公司 一种交换机的配置方法及相关设备
CN113824781A (zh) * 2021-09-16 2021-12-21 中国人民解放军国防科技大学 一种数据中心网络源路由方法与装置
CN113824781B (zh) * 2021-09-16 2023-10-31 中国人民解放军国防科技大学 一种数据中心网络源路由方法与装置

Similar Documents

Publication Publication Date Title
EP3020164B1 (fr) Support pour segments d&#39;un réseau local extensible virtuel dans plusieurs sites d&#39;un centre de données
US8948181B2 (en) System and method for optimizing next-hop table space in a dual-homed network environment
WO2015077878A1 (fr) Agrégation de chemins commutés pour les centres de données
EP3253009B1 (fr) Procédé et système pour prendre en charge des opérations de protocole de commande à relaisdistribué (drcp) lors d&#39;une défaillance de communication
EP3692685B1 (fr) Commande à distance de tranches de réseau dans un réseau
US9553798B2 (en) Method and system of updating conversation allocation in link aggregation
US9225549B2 (en) Multi-chassis link aggregation in a distributed virtual bridge
KR20210060483A (ko) 네트워크 컴퓨팅 환경에서 제 1 홉 게이트웨이 이중화
EP3430774B1 (fr) Procédé et appareil aptes à prendre en charge un transfert bidirectionnel (bfd) sur un groupe d&#39;agrégation de liaisons multi-châssis (mc-lag) dans des réseaux de protocole internet (ip)
US20170026299A1 (en) Method and system of implementing conversation-sensitive collection for a link aggregation group
WO2017099971A1 (fr) Interconnexion de commutateurs basés sur la tunnellisation de superposition hiérarchique
US20080080535A1 (en) Method and system for transmitting packet
US11663052B2 (en) Adaptive application assignment to distributed cloud resources
CA2747007A1 (fr) Evolution de reseaux ethernet
MX2007008112A (es) Metodo para ejecutar una red sin conexion como una red de conexion orientada.
KR20150013612A (ko) 802.1aq에 대한 3 스테이지 폴딩된 clos 최적화
CN108141392B (zh) 伪线负载分担的方法和设备
WO2018065813A1 (fr) Procédé et système de distribution de trafic virtuel de couche 2 vers de multiples dispositifs de réseau d&#39;accès
US20220311643A1 (en) Method and system to transmit broadcast, unknown unicast, or multicast (bum) traffic for multiple ethernet virtual private network (evpn) instances (evis)
Bruschi et al. A scalable SDN slicing scheme for multi-domain fog/cloud services
US20160006511A1 (en) Metro-core network layer system and method
WO2018220426A1 (fr) Procédé et système de traitement de paquets d&#39;une fonction de réseau virtuel (vnf) distribuée
Nadeem et al. A survey of cloud network overlay protocols
Tu Cloud-scale data center network architecture
Shahrokhkhani An Analysis on Network Virtualization Protocols and Technologies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14866415

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 06/09/2016)

122 Ep: pct application non-entry in european phase

Ref document number: 14866415

Country of ref document: EP

Kind code of ref document: A1