A SYSTEM AND A METHOD FOR SWITCHING DATA PACKETS OR FRAMES USING MULTIPLE SWITCHING ELEMENTS
The purpose of the invention is to build switch systems out of multiple, preferably monolithic, switching elements, so as to obtain a high utilization of the links between the elements.
The invention comprises multiple stepwise improvements over standard non-meshed architectures.
When building switch systems of multiple (such as monolithic) switching elements, it is necessary to provide a method of preventing loops. Among well-known ways to do so is not having loops in the architecture at all, or to use the Spanning Tree Protocol (STP) to block selected links, thus removing potential loops. Both these ways imply sub-optimal use of the links between the switching elements.
The invention provides ways to build systems with a better throughput than similar loop-free architectures and STP based systems.
In a first aspect, the invention relates to a system for switching data packets or frames, the system comprising three or more switching elements each having a number of uplinks and downlinks each being an input/output adapted to receive and output data packets or frames, where: - the switching elements are interconnected so that one or more uplinks of each switching element is connected to one or more uplinks of each of the other switching elements, the switching elements comprise means preventing transfer of data packets or frames between uplinks thereof.
In the present context, a switching element may be a single switch chip or an assembly of chips providing switching ability. The chip(s) may be provided with physical layer elements (PHY's) or the like and may even be provided in separate boxes etc.
The uplinks and downlinks will normally be standard I/O ports of the switching elements. These ports may have the same or different data rates depending on the bandwidth requirements in different parts of the system. The splitting up of the ports into uplinks and downlinks is merely determined by the use of the links/ports. Thus, an uplink may be converted into a downlink by disconnecting it from another switching element of the system and e.g. connecting it to an outside network or the like. The downlinks may be used as standard switch ports for interconnecting computers, networks, the WWW or any other networking element(s).
The interconnecting of the uplinks may be via PHY's or simply via a PCB - depending on the actual set-up of the switching elements. Naturally, a combination of interconnecting methods may be used.
In standard switches, data may be transferred between any pairs of their ports. This would provide the problem of looping, whereby the preventing means are provided.
In this first aspect, preferably, the preventing means are adapted to, on the basis of an identity of an uplink or downlink receiving a data packet or frame, determine which uplink(s) and/or downlink(s) the data packet or frame may be transferred to. This could be as a more or less hard coded (or hardwired) approach which would make the chip simpler. However, bandwidth problems and over subscription may be a problem.
In a second aspect, the invention relates to a system for switching data packets or frames, the system comprising three or more switching elements each having a number of uplinks and downlinks being inputs/outputs adapted to receive and output data packets or frames, where: the switching elements are interconnected so that one or more uplinks of each switching element is connected to one or more uplinks of each of the other switching elements, wherein each switching element comprises means for enabling transfer of data packets or frames only between downlinks and between uplinks and downlinks.
Thus, transfer of data packets or frames is only enabled between downlinks and between uplinks and downlinks. When uplink-to-uplink transfer is not enabled from the first place (such as when the switching element is manufactured without that ability), preventing means as seen in the first aspect are not required.
In the above two aspects, at least two switching elements may be interconnected by a plurality of uplinks, at least one of the two switching elements comprising means for determining to which of the plurality of uplinks to transfer a data packet or frame destined for the other of the at least two switching elements.
The number of uplinks connected at a given position in the system will depend on the bandwidth requirements at that position. If the switching elements are provided with ports with different data rates, ports may also be interchanged in order to adapt to bandwidth requirement changes.
IEEE requires, for Ethernet traffic, that packets relating to a data flow may not overtake each other. In that situation, it will be desired that the determining means will ensure that all packets in a data flow use the same uplink within the system. This may be ensured by determining the uplink ID using e.g. source or destination addresses (MAC or IP).
Another, simpler, manner is one where the determining means are adapted to determine the uplink on the basis of an identity of a downlink having received the packet or frame. Thus, only one or a number of identified downlinks (but preferably not all) will transfer data packets or frames (destined to a the other switching element) via a given uplink.
This would also maintain the ordering of packets within a data flow. Also, the theoretical over subscription of this system may, in fact, be better than e.g. a hashing based solution where, theoretically, all incoming data packets or frames could be determined to be transferred on the same uplink.
In a third aspect, the invention relates to a method of switching data or packets, the method comprising:
providing three or more switching elements each having a number of uplinks and downlinks each being adapted to receive and output data packets or frames, interconnecting one or more uplinks of each switching element to one or more uplinks of each of the other switching elements, and preventing, in each switching element, transfer of data packets or frames between uplinks thereof.
Again, the preventing step may comprise determining, on the basis of an identity of an uplink or downlink receiving a data packet or frame, which uplink(s) and/or downlink(s) the data packet or frame may be transferred to.
A fourth aspect of the invention relates to a method of switching data or packets, the method comprising:
providing three or more switching elements each having a number of uplinks and downlinks each being adapted to receive and output data packets or frames, interconnecting one or more uplinks of each switching element to one or more uplinks of each of the other switching elements, and - enabling, in each switching element, transfer of data packets or frames only between downlinks and between uplinks and downlinks.
In the third and fourth aspects, preferably the providing step comprises providing at least two switching elements interconnected by a plurality of uplinks, the method further comprising the step of determining to which of the plurality of uplinks to transfer a data packet or frame received by one of the at least two switching elements and being destined for the other of the at least two switching elements.
Again, preferably the determining step comprises determining the uplink on the basis of an identity of a downlink having received the packet or frame.
In the following, preferred embodiments will be described with reference to the drawing, wherein:
Fig. 1 illustrates the overall functionality of the invention - using only three switching elements with a reduced set of I/O ports,
Fig. 2 illustrates the functionality of Fig. 1 now scaled to more switching elements having more I/O ports - Fig. 3 illustrates link aggregation between switching elements, and
Fig. 4 illustrates forwarding from uplinks and downlinks, respectively, in a switching element for use in accordance with the invention.
The first step of the invention presents a way to build switch systems of rings or meshes of monolithic switches, preventing loops through static limitations to the forwarding within the switching elements.
The basic system consists of three 4-port switching elements, interconnected in a ring by single links (FIG 1). Each of the three switching elements provides two switch ports to be exposed as ports to the switch system, in the following called downlink ports; the other two ports are used for the interconnections, in the following called uplink ports.
This method prevents loops by disallowing forwarding of traffic between the uplinks, while allowing forwarding between uplinks and downlinks, and between the downlinks. The method relies on the fact that traffic from any downlink in the system can reach any other downlink in the system by traversing at most one uplink; thus, no switching element is required to forward traffic between two uplinks. And by prohibiting the forwarding of traffic between the uplinks, no traffic can perform a loop. The Spanning Tree Protocols need not be, and is preferably not, applied to the uplinks; if any of the uplinks is blocked, the system will lose part of the connectivity.
Learning in the system can be based on automatic learning locally in the switching elements. When a packet arrives on a downlink port on a switching element (A), destined for an unknown address (Y), the packet is flooded to all ports on the switching element (A), including the uplink ports, and the location of the source address (X) is learned in the switching element (A). The other switching elements (B and C) will receive the packet on the uplink from A, and flood it to all downlink ports, but not to the other uplink port. B and C will learn the source address X on the uplink
from A. The packet has now been flooded to all downlink ports in the system and has thus reached Y, e.g. located on a downlink port on B. When Y returns a packet to X, B receives it on the downlink port and forwards it also to the uplink to A, where it has learned X. A receives it and forwards it to the downlink port where it has learned X. And like before, B and A now learn the location of Y. C will, in time, learn the position of Y, if a packet from Y is forwarded to C, either from flooding to an unknown address or from forwarding to an address that has been learned on the uplink to C.
The system can be scaled to N elements, provided that the switching elements have N-1 uplink ports and M downlink ports. The N elements must be connected in a full mesh; that is, any two switching elements are directly interconnected by an uplink (FIG 2). The worst case utilization of this system is when all M downlink ports on each switching element are fully loaded with traffic destined for one particular other switching element, and thus, one uplink. The throughput of the system in this scenario is 1/M. An even load distribution leads to a throughput of (N-1)/M.
The total number of downlink ports in the system is N*M; the total number of switching element ports is N*(N+M-1).
As the second step, the system can be scaled further by using K uplinks between any two switching elements, using Link Aggregation for the uplinks (FIG 3). Essentially this works in the same way as using single uplinks, with the addition, that the K parallel uplinks have an algorithm for distributing the traffic among the links, for instance a hashing based on the source and/or destination addresses, and that the K parallel uplinks feature common learning of addresses, such that the source address of a packet received on one of the K uplink ports is learned as being reachable on all of the K ports.
The forwarding rules still disallow forwarding between the uplinks; including forwarding between links belonging to the same aggregated uplink. This, as above, is for preventing loops. All other forwarding between links is allowed.
Scaling the number of uplinks allows scaling the number of downlink ports equally, yet keeping the same average throughput. Though not mandatory, a proportional
scaling to K*M is used for illustration. Any number of ports can be used instead of K*M.
Again, the worst-case utilization of this system is when all K*M downlink ports on each switching element are fully loaded with traffic destined for one particular other switching element, and thus, one aggregated uplink. Assuming a perfect distribution algorithm for the K aggregated links, the full aggregate link bandwidth can be utilized, resulting in a throughput of 1/M. However, since the above-mentioned distribution algorithm is not perfect, the throughput will be lower. The worst-case being all traffic on one link; throughput being 1/(K*M).
The ratio between the total number of downlink ports on this system and the total number of switching element ports is M/(M+N-1).
Thirdly, the worst-case limit can be drastically improved, by using a certain static distribution algorithm for the uplink traffic. The method is as follows. As above, N switching elements are interconnected in a fully meshed system with K parallel links for each interconnect. Each switching element has K*(N-1) uplink ports and K*M downlink ports. The ports are divided into port groups, consisting of one uplink from each of the (N-1) aggregates and M downlinks.
Forwarding is disallowed between any two uplinks, in order to prevent loops. Forwarding between any downlinks is allowed, and so is forwarding from any uplink to any downlink. But, forwarding from downlinks to uplinks is allowed only within the port groups; that is, packets from a given downlink can be forwarded to only one link of each aggregate uplink (FIG 4) to a given other switching element.
This method ensures that only M of the K*M downlink ports can forward traffic to a certain uplink of an aggregate uplink. Thus, the worst-case utilization is now 1/M - a factor K higher than with the address based distribution algorithm.
This system can be implemented with static configuration of switching elements, using source and destination port masks as follows.
The source port mask for a port defines to which ports packets may be forwarded from this port. This is typically 'allowed' for all other ports than the source port itself. In this system, the source port masks are as follows:
- For the downlink ports: 'Allow' all other downlink ports than the source port itself, and 'allow' the one uplink port in each aggregate uplink, that is part of the source port's port group. 'Disallow' all other ports.
For the uplink ports: 'Allow' all downlink ports, 'disallow' all uplink other ports.
The destination port mask defines to which ports to forward packets that are destined to an address learned on a certain port. This is typically 'allowed' for the learning port itself only. In this system the destination port masks are as follows:
For the downlink ports: 'Allow' for the learning port itself only, 'disallow' all other ports.
For the uplink ports: 'Allow' for all uplink ports in the same aggregate uplink, 'disallow' all other ports.
Thus, the destination port masks are modified from the typical case to enable aggregate uplink learning. And the source port masks are modified for loop prevention and uplink selection.