WO2002015489A2

WO2002015489A2 - Switches and routers, with parallel domains operating at a reduced speed

Info

Publication number: WO2002015489A2
Application number: PCT/IB2001/001451
Authority: WO
Inventors: Yuanlong Wang; Kewei Yang; Feng Chen Lin
Original assignee: Conexant Systems, Inc.
Priority date: 2000-08-15
Filing date: 2001-08-14
Publication date: 2002-02-21
Also published as: AU2001278644A1; WO2002015489A3; EP1310065A2

Abstract

Crossbar and queing chips with integrated point-to-point packet-based channel interfaces and resulting high internal aggregate bandwidths are designed as modules for a scalable CIOQ-based switch fabric that supports high-capacity fixed-length cell switching. By aggregating large amounts of traffic onto a single switching chip, the system pin-count and chip-count is dramatically reduced. The switch fabric offers improved switching capacity while operating at sub-unity speedup to relax cell-time requirments.

Description

HIGH-PERFORMANCE SWITCHES AND ROUTERS WITH PARALLEL SWITCHING DOMAINS HAVING SUB-UNITY SPEEDUP

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority from and is a continuation-in-part of the following U. S. patent applications, the disclosures of which are herein incorporated by reference for all purposes:

1. Ser. No. 09/350,414, entitled "ACCURATE TIMING CALIBRATION FOR EACH OF MULTIPLE HIGH-SPEED CLOCKED RECEIVERS USING A SINGLE DLL," filed July 8, 1999; and

2. Ser. No. 09/349,832, entitled "LOW-LEVEL CIRCUIT IMPLEMENTATION OF SIGNAL FLOW GRAPHS FOR REAL-TIME SIGNAL PROCESSING OF HIGHSPEED DIGITAL SIGNALS," filed July 8, 1999.

This application claims priority from the following provisional patent applications, the disclosures of which are herein incorporated by reference for all purposes:

1. U.S. Provisional Patent Application Ser. No. 60/173,777, entitled "INPUT-QUEUED CROSSBAR-BASED PROTOCOL-INDEPENDENT SWITCHING FABRIC FOR SWITCHES AND ROUTERS," filed Dec. 30, 1999;

2. U.S. Provisional Patent Application Ser. No. 60/178,076, entitled "INPUT-QUEUED CROSSBAR-BASED PROTOCOL-INDEPENDENT SWITCHING FABRIC FOR SWITCHES AND ROUTERS," filed Jan. 25, 2000; and

3. U.S. Provisional Patent Application Ser. No. 60/178,132, entitled "INPUT-QUEUED CROSSBAR-BASED PROTOCOL-INDEPENDENT SWITCHING FABRIC FOR SWITCHES AND ROUTERS," filed Jan. 26, 2000. BACKGROUND

Packet Techniques

The term "packet" as applied to digital systems and networks has multiple connotations. First, packet transmission, packet-messaging, and packetized signaling, all refer to a class of techniques for transmitting collections of digital bits as a data unit having a well-defined beginning and end. Packet- messaging techniques are broadly taught in "Frames, Packets and Cells in Broadband Networking," by William A. Flanagan, published by Telecom Library Incorporated, 1991; and in "Computer Networks: a Systems Approach," Second Edition, by Larry Peterson and Bruce Davie, published by Morgan Kaufmann, 2000.

A number of different ways have evolved to implement and can characterize the data units sent by these techniques. For purposes of providing different classes of transmission functionality at different levels of abstraction, it is common to recursively encapsulate (embed) a data unit within another data unit at a lower level of abstraction. (This will become clearer, when we consider specific layers in the next paragraph). This is possible because each level of abstraction ignores the content of its own data unit. Hence, each level is transparent to any embedding of data units from a higher level. (However, as data units at a given level may be of fixed-length, or have a relatively short maximum Variable- length, a data unit at a higher level may in fact be segmented and transmitted from source to destination using many lower level data units.) Informally, across these layers and especially when there is no need to distinguish among layers, the data units are often loosely referred to as packets, a second usage of the term.

A complete set of abstraction layers is referred to as a protocol-stack. Different protocol-stacks exist for various purposes, using different abstractions as appropriate. The different protocol-stacks generally have different definitions and names for each of (heir layers. A particularly well-known protocol-stack is the OSI Reference Model. In this model, the lowest layer of abstraction (layer-1) is referred to as the physical layer. The next higher layers of abstraction are the data link layer (layer-2), the network layer (layer-3), and a transport layer (layer-4). Other higher layers also exist, but we are not interested in them here. At each of these layers, the data unit has a particular formal name. The OSI data link layer data unit is formally referred to as a frame. The OSI network layer data unit is (confusingly) formally referred to as a packet. Thus we have the third usage of this term.

The fourth usage arises from distinctions made between packets of variable or fixed-length. Specifically, the term "cell" is used to refer to a fixed-length network data unit. In this context, use of the term packet is taken to refer to a variable-length network data unit. Typical practice is to redundantly refer to "fixed-length cells" vs. "variable-length packets." As will be discussed later, the use of fixed-length cells generally enables hardware optimizations not available to variable-length

9 packets. Asynchronous Transfer Mode (ATM) is a particular packet-switched technology (in the first 0 and broadest sense of "packet") that only uses cells, and is thus is commonly referred to as a cell- 1 switching technology. 2 3 Clearly, without further qualification, the particular meaning of the term "packet" must be inferred 4 from the context of its use. To help clarify matters, this specification will avoid the second and third 5 usage of the term packet as given above, unless explicitly indicated otherwise. This leaves the first 6 and fourth usages, which are generally easy to distinguish between. This specification will also 7 sometimes use the term "message" to broadly refer to a data unit from a set that includes fixed-length, 8 variable-length, layer-2, and layer-3 data units. 9 θ^' 1 Point-to-Point Packet Signaling for Chip-to-Chip Transfers 2 3 Contemporary chip-to-chip transfers in high-performance digital systems are performed using point- 4 to-point transfers using packet transmission (packet-messaging) protocol techniques. The transfers 5 are said to take place over a channel, or link. Use of the term channel implies, at a minimum, the 6 interconnect between the two chips, but often is intended to include the channel interfaces on either or 7 both chips. The channel interfaces, also referred to as link macros, can be thought of as been 8 conceptually partitioned into physical layer circuits and (so-called) transport layer logic. (The OSI 9 Reference Model would consider this latter logic as existing at the data link layer. However, in the 0 point-to-point channel technology, which only has layer- 1 and layer-2 levels of abstraction, layer-2 is 1 loosely referred to as the transport layer.) The interconnect-side of these link macros are generally characterized by a minimalist high- performance full-duplex I/O interface of relatively narrow data-width. The I/O Interface is optimized for a basic transfer type and has minimal control signals that are focused-on error detection and retry for the basic transfer type. Control for higher order functions beyond the basic transfer type are implemented via control data fields defined within the (layer 2) frames.

Fig. 1 illustrates a system 15,000 using packet-messaging techniques to communicate between Chip A 15,100 and Chip B 15,300 using a generic (serial or parallel) bi-directional point-to-point channel (link) 15,200. Chip A 15,100 includes core logic 15,120 and a link macro 15,110. The link macro 15,110 includes physical layer circuits 15,140 and transport layer logic 15,130. Chip B 15,300 includes core logic 15,320 and a link macro 15,310. The link macro 15,310 includes physical layer circuits 15,340 and transport layer logic 15,330. The terms "link" and "channel" are synonymously used to refer to serial (one data wire or complementary wire-pair) and parallel (multiple data wires or wire-pairs) groupings of point-to-point inter-chip interconnect that are managed using variations on packet-messaging techniques as described herein. The point-to-point to link 15,200 could be either a serial link, hereinafter referred to by the term S-Link, or a parallel link, hereinafter referred to by the term P-Link.

A clock rate is defined for the link macros and a plurality of bits is transferred (using multiple-data rate techniques) over each individual wire (bit) in each clock cycle. All of the bits transferred each cycle, either for the single wire in a serial link, or for multiple wires in a parallel link, are mapped into a predefined unit, or block, referred to as a "frame." The frame is often scaled-up into a super-frame by the use of multiple links in parallel. The frame includes a data-unit (DU) and framing bits (F). The DU may be used for either Control Data (CD) or Payload Data (PD). The control data generally includes at least one address to identifying the ultimate destination for the packet. Such an address is required for packet switching, wherein it is desired that the packet be forwarded to the destination across a network. The framing bits are required overhead information that generally includes error detection and correction bits and "flag" bits provided to define the boundaries of the frame or to facilitate timing alignment or calibration. The term "cell" as used herein refers to a packet-messaging unit of a predetermined (fixed) number of frames (or super-frames). With reference to packet-messaging units, the term "packet" as used herein refers to a dynamically determined (variable-length) number of frames (or super-frames). Within a cell or packet, each frame (or super-frame) is typically allocated into multiple fields, the definition of each generally varying on a cycle-to-cycle basis.

Figs. 2A through 2E are abstract drawings illustrating in concept the components of a generic packet or cell message, transferred using the generic point-to-point link of Fig. 1. Fig. 2 A illustrates a data unit prior to the addition of framing bits. Fig. 2B illustrates the composite frame after framing bits have been added to the data unit. Fig. 2C illustrates a complete cell or packet built from multiple frames. And shown, some of the day units have been designated as control data fields and others have been designated as payload data fields. Fig. 2D conceptually shows the control data and payload data with the overhead framing bits stripped away. Fig.2E conceptually shows the assembly of the control data and payload data fields into larger respective control data and payload data words, as might be employed by the core logic of either chip.

Fig. 2F is a conceptual timing diagram of a data transfer from chip A to chip B for the abstract generic link of Fig. 1. Time is shown increasing to the right on the x-axis, the vertical grid delineating cycle boundaries. Each of the waveforms A through E corresponds to a respective point in the system 15,000 of Fig. 1, as identified by the matching encircled letter. As will be appreciated from examining the timing diagram, the core logic of Chip A and the core logic of Chip B may pipeline or stream the control and payload data as though the two were directly coupled via a small multi-stage latency. Thus the two cores may inter-operate largely transparent to the fact that their interaction is actually occurring across chip (or board) boundaries and chip-to-chip (or board-to-board) interconnect. It will be appreciated by those skilled this has been abstract example, intended to establish the terminology to be used herein, and that the details of the nature of the point-to-point links of the present invention will differ significantly in detail from the foregoing discussion. Router/Switch Overview

While industry usage is in a state of flux, switches and routers are distinguished herein as follows. Switches are defined as devices that perform packet-forwarding functions to transport layer-2 data units (e.g., Token Ring, AppleTalk, Ethernet frames) across a switched network that is layer-2 homogenous (i.e., the same protocol is used throughout the network). Routers are defined as devices that perform packet-forwarding functions to transport layer-3 data units (e.g., IP and IPX datagrams) across a network that is layer-3 homogeneous, but may be layer-2 heterogeneous (i.e., different protocols are used in various parts of the network).

The architectures described in the following paragraphs, while characterized as switch architectures, are the architectural foundation for both switches and routers. While exact implementation details will vary, at a purely abstract level, the operation of both of these packet-forwarding devices may be viewed as follows. The switch fabric core forwards (switches) packets between one incoming and at least one outgoing network interface. (In high-end systems, the interfaces are generally modular and referred to as line cards.) For each data unit received by a line card, the data unit is briefly buffered on the input side of the switch fabric. The destination address specified by the data unit is looked up in the line card's copy of a forwarding table. Under control of resource scheduling fabric logic the data unit is subsequently transferred over the switch fabric to the destination line card.

A crossbar packet switch consists of a crossbar switch fabric core having a multiple-port input side and multiple-port output side. The packet switch must interface externally at specified line rates for the input and output ports. The addition of input and output queues (often implemented in shared- memory), at the input and output respectively of the switch fabric core, impacts both the service performance of the switch and the rate at which the switch fabric core must operate, for a given external line rate. In accordance with their particular use of input and output queues, crossbar switches are referred to as Input-Queued (IQ) switches, Output-Queued (OQ) switches, or Combined Input and Output Queued (CIOQ) switches.

Externally, the switch may be called upon to handle either fixed-length cells or variable length packets. Internally however, upon arrival variable length packets will be formatted as cells. Subsequent to being switched across the switch fabric core, reassembly of the variable link packets will be performed. In this way, the switch fabric core need only handle fixed-length cells. From the perspective of the switch fabric core, a "time slot" is defined to be the interval between the cell arrivals.

Switches are characterized by their internal "speedup". Speedup is the (maximum) number of cells that are transferred across the switch fabric core each time slot. The switch fabric core must operate faster than the line rate by a factor equal to the speedup.

Basic OQ switches must operate with a speedup of N, where N is the number of ports on each of the input and output sides of the switch. OQ switches offer the maximum possible throughput and can provide quality-of-service (QoS) guarantees. Unfortunately, the speedup of N requirement for OQ switches makes them prohibitively expensive for high line rates or for large number of ports.

Basic IQ switches need only operate with a speedup of one. Thus, the switch fabric need only operate as fast as the line rate, making IQ switches a common choice for applications with high line rates or large number of ports. Unfortunately, in its basic form in which each input port has a single FIFO, output-port contention results in head-of-line (HOL) blocking, which can limit overall throughput to less than 60 percent.

Figure 1 is an abstract drawing of a prior art enhanced IQ crossbar-switch that employs Virtual Output Queues (VOQs) to eliminate HOL-blocking. This and other architectures for Internet routers are overviewed in "Fast Switched Backplane for a Gigabit Switched Router," by Nick McKeown, in Business Communications Review, volume 27, No. 12, December 1997. At each input, a separate FIFO queue is maintained for each output. Hence, each of these FIFO queues is a VOQ for a respective output. After an initial forwarding decision is made, and arriving cell is placed in the VOQ for the output port to which it is to be forwarded. At the beginning of each time slot, the scheduling and matching logic evaluates the contents of the VOQ and selects a conflict free input-to-output configuration for the M-way crossbar switch. Because CIOQ switches use buffering at both the inputs and outputs, they have speedup values between 1 and N. It has been shown in simulation that the average delay of practical CIOQ switches with a speedup of 2 approximates the average delay of OQ switches. Thus, CIOQ switches should provide much better delay control compared with IQ switches, and at the same time require only a modest speedup. Reducing the required speedup accordingly reduces bandwidth requirements and costs associated with memory and internal links. Thus from a theoretical perspective, CIOQ switches would appear to be an underlying architecture for high capacity switches that approximate the performance of an OQ switch without requiring high speed up.

There have been several difficult technical challenges that, taken in combination, have prevented prior art CIOQ architectures from large-scale deployment for commercial high capacity switches with high link rates. First, high capacity switch fabrics (64x64xlOGbps or 256x256x2.5Gbps) require very large internal bandwidth, normally in the multi-terabits per second range, even at the low levels of speedup required for a CIOQ approach. As an example, in order to support a target 640Gbρs user switching capacity, a switch fabric with a speed up of 2 must have at least 2Tbps of internal bandwidth. Second, in spite of available high-speed integrated transceiver technology, prior art CIOQ architectures necessitate a large (and thereby expensive) number of semi-custom or ASIC chips to provide the requisite aggregate bandwidth. Third, prior art CIOQ architectures have problematically short cell times, even at the low levels of speedup required for a CIOQ approach. That is, a straightforward implementation of a CIOQ crossbar with a speedup of 2 could cause the cell time for an ATM cell (at lOGbps link rate) to be only 25ns. There is no practical prior art crossbar scheduler that can operate this fast for switch sizes up to 64x64. Such a short cell time also presents challenges to other aspects of the switch fabric design.

SUMMARY

The present invention teaches crossbar and queuing chips with integrated point-to-point packet-based channel interfaces and resulting high internal aggregate bandwidths (on the order of 256Gbps in current technology), designed as modules of a scalable CIOQ-based switch fabric that supports high- capacity fixed-length cell switching. By aggregating large amounts of traffic onto a single switching chip, the system pin-count and chip-count is dramatically reduced. The switch fabric offers improved switching capacity while operating at sub-unity speedup to relax cell-time requirements.

In accordance with the present invention, the switch fabric consists of multiple (eight in an illustrative embodiment) switching domains operating in parallel, for an overall effective speedup of 2, but with sub-unity speedup within each domain. Each domain contains one or more non-buffered crossbar chips operating in bit-sliced fashion (plus a stand-alone scheduler chip if more than one slice is used). The incoming traffic is queued at the ingress port in VOQs and then dispatched uniformly to all switching domains. Traffic coming out of the switching domains will then be aggregated at the egress port with shared-memory based OQs.

Overall, the switch fabric approximates a CIOQ crossbar switch with a speedup factor of 2 and at the same time doubles the cell time for ATM cells to 100ns. The relatively long cell time allows for the design of the crossbar scheduler to be much easier. The multiple sub-unity domains allow much easier system implementation of the switch fabric than a fully bit-sliced architecture.

The multiple non-buffered switching domains are totally independent and any switch domain can switch any cell for any ingress port to any egress port. The status of each switch domain (Xchip or crossbar card) will be sent to the Qchips in conjunction with its handling of all the ingress and egress ports. The Qchips monitor the returned status and automatically redirect cells and requests to available switching domains, avoiding any disabled or malfunctioning domains (whether due to link, chip, or other cause). Thus, there is no need to provide extra redundant switching capacity. Also, since there is no buffer within the switching domains, there is no need for cell reordering at the egress ports. This both simplifies the design and improves the switch fabric performance. 1 The switch fabric is protocol independent, scalable, and may be implemented as a set of ASIC chips.

2 The switch fabric is composed of building blocks (modules) including queuing chips (Qchips),

3 crossbar chips (Xchips), and MUX chips (Mchips). These chips are collectively referred to herein as

4 the chipset. All of these chips have high-speed integrated CMOS transceivers. The high-speed

5 transceivers include 8Gbps parallel channels and 2.5Gbps serial links that allow for low chip-count,

6 low-pin count, low power and highly integrated implementations. The chipset uses sideband

7 signaling via a parallel channel as the control mechanism between the components. The chipset

8 switches fixed cell sizes, which is optimal for all protocols (IP, ATM, etc.) allowing the architecture

9 to be protocol independent.

10

11 Compared to prior art solutions, the switch fabric of the present invention scales to larger port

12 configurations, requires less than one-tenth the components and uses only one-fourth the total power

13 of an equivalent system. Three basic variations (described herein) provide throughputs of 160Gbps,

14 320Gbps and 640Gbρs. It will be obvious to those skilled in the art how to scale these variations to

15 either higher or lower capacities as required. The 320Gbps configuration requires only 24 total chips

16 while the 640Gbps configuration requires only 120 total chips. A single crossbar chip with integrated

17 transceivers can provide an aggregate throughput of 256Gbps with a pin-count of less than 650 pins.

18

19 Edge and core switches and routers are exemplary system applications of the present invention. More

20 generally, the present invention provides a high-performance protocol-independent switch fabric for

21 the selective forwarding of packets between multiple networks and sub-networks that collectively

22 comprise a larger switched network. System applications of the present invention will vary in terms

23 of subnet heterogeneity, network scale, control protocols, and forwarding algorithms. The present

24 invention is applicable in modified, enhanced, and hybrid router/switch variants that blur and extend

25 the definitions given below for switches and routers. Thus the system applications for the present

26 invention include, but are not limited to, switches, routers, layer-3 switches, routing switches, and

27 devices that implement specialized packet forwarding protocols, such as the Multiple-Protocol-Label-

28 Switching (MPLS) standard.

29

30

31 BRIEF DESCRIPTION OF DRAWINGS

Fig. 1 is an abstract drawing of a generic point-to-point link used for chip-to-chip message transfers, as found in the prior art.

Figs. 2A through 2E are abstract drawings illustrating the components of a message transferred using the generic point-to-point link of Fig. 1. Fig. 2F is a timing diagram of a data transfer from chip A to chip B for the abstract generic link of Fig. 1.

9

10 Fig. 3 is a prior art crossbar switch. 11 12 Fig. 4 illustrates a switch fabric 9200 in accordance with the present invention, having a capacity of 13 320Gbps using current technology, implemented using P-Link interconnect network 9250, and a 14 particular number of Qchips 2000 and Xchips 1000. 15 16 Fig. 5 illustrates a router/switch 9000 using the switch fabric 9200 of Fig. 4, in which the network 17 interface 9100 and the Qchip 2000 are both implemented on Line Card 9150. 18 19 Fig. 6 illustrates the system environment in which the router/switch of Fig. 5 finds application. 20 21 Fig. 7 illustrates a more general configuration of the switch fabric configuration 9200 of Fig. 4, 22 emphasizing that within the scope of the invention different numbers of Qchips and Xchips are 23 possible. 24 25 Fig. 8 illustrates a more general configuration of router/switch 9000 of Fig. 5, emphasizing that within 26 the scope of the invention the Qchips need not be implemented on the line cards. 27 28 Fig. 9A illustrates the internal architecture of the Qchips of Fig. 4, for specific numbers of OC-192 29 ports, S-Links, and P-Links. 30 31 Fig. 9B illustrates the outgoing logic 2200 of Fig. 9A, for a specific configuration. Fig. 9C illustrates the incoming logic 2100 of Fig. 9A, for a specific configuration.

Fig. 10A illustrates a more general configuration of the Qchip of Fig. 9 A, emphasizing that within the scope of the invention different numbers of ports, S-Links, and P-Links are possible. Fig. 10B illustrates a more general configuration of the outgoing logic 2200 of Fig. 9C, emphasizing that within the scope of the invention different queue configurations are possible. Fig. 10C illustrates a more general configuration of the incoming logic 2100 of Fig. 9B, emphasizing that within the scope of the invention different queue configurations are possible.

Fig. 11A illustrates the internal architecture of each Xchip of Fig. 4, for specific numbers of crossbar ports and associated P-Links. Fig. 1 IB illustrates the internal structure of logic 1400 of Fig. 11 A.

Fig. 12 illustrates a more general configuration of the Xchip of Fig. 11A, emphasizing that within the scope of the invention different numbers of crossbar ports and P-Links are possible.

Fig. 13A is an abstract drawing illustrating one functional view of the S-Link macro 4100 Fig. 5 A. Fig. 13B illustrates the minimum length S-Link packet format. Fig. 13C illustrates the full-length S- Link packet format.

Fig. 14A is an abstract drawing illustrating one functional view of the P-Link macro 3100 Fig. 5A. Fig. 14B illustrates the P-Link cell format.

Fig. 15 illustrates the logic of the S-Link macro 4100 of Fig. 13 A.

Figs. 16A through 161 detail logic, circuitry, and behavioral aspects of the P-Link macro 3100 of Fig. 14A. Fig. 16A illustrates the logic of the P-Link macro 3100 of Fig. 14A. Fig. 16B is a different view of the transmitter section 10,100 of the P-Link macro 3100 of Fig. 16A. Fig. 16C illustrates the internal circuitry of the differential transceiver 10,127 of the transmitter section 10,100 of Fig. 16B. Fig. 16D illustrates the voltage waveform for the link has observed from the transmitter output. Fig. 16E illustrates the voltage waveform for the link is observed from the receiver input at the opposite and of the link relative to the observation-point of Fig. 16D. Fig. 16F is a different view of the receiver section 10,200 of the P-Link macro 3100 of Fig. 16A. Fig. 16G illustrates the internal circuitry of the receiver section 10,200 of Fig. 16F. Fig. 16H illustrates the logic within the receiver synchronization circuit 10,221 of Figs. 16A and 16F. Fig. 161 illustrates a detail of the operation of the synchronization circuit 10,221 of Fig. 16H.

Fig. 17 illustrates switch fabric 9202, a reduced-scale variation of the switch fabric of Fig. 4 within the scope of the present invention, having a P-Link interconnect network 9252 and a capacity of 160Gbps using current technology, using a particular number of Qchips 2000 and Xchips 1000.

Fig. 18 is a drawing of a router/switch 9000 in accordance with the present invention using the switch fabric configuration 9202 of Fig. 17.

Fig. 19 illustrates a more general configuration of the switch fabric configuration 9202 of Fig. 17, emphasizing that within the scope of the invention different numbers of Qchips and Xchips are possible.

Fig. 20 illustrates switch fabric 9205, an expanded-scale variation of the switch fabric of Fig. 4 within the scope of the present invention, having a P-Link interconnect network 9255 and a capacity of 640Gbps using current technology, using a particular number of Qchips 2000 and Crossbar Cards 5000.

Fig. 21 illustrates a router/switch 9000 using the switch fabric 9205 of Fig. 20, in which the network interface 9100 and the Qchip 2000 are both implemented on Line Card 9150.

Fig. 22 illustrates a more general configuration of the switch fabric configuration 9205 of Fig. 20, emphasizing that within the scope of the invention different numbers of Qchips and Xchips are possible.

Fig. 23 illustrates a more general configuration of router/switch 9000 of Fig. 21; emphasizing that within the scope of the invention the Qchips need not be implemented on the line cards. Fig. 24 illustrates the internal architecture of the crossbar cards 5000 of Fig. 20, using a particular number of Mchips 6000 and Xchips 1000.

Fig. 25 illustrates a more general configuration of the crossbar card 5000 of Fig. 24, emphasizing that within the scope of the invention different numbers of Mchips 6000 and Xchips lOOOare possible.

Fig. 26A illustrates the internal architecture of Mchip 6000 of Fig. 16C. Fig. 26B provides additional

9 detail of logic block 6100 of Mchip 6000 of Fig. 26A. Fig. 26C provides additional detail of logic 10 block 6300 of Mchip 6000 of Fig. 26A. Fig. 26D provides additional detail of logic block 6400 of 11 Mchip 6000 of Fig. 26A. 12

13 14 DETAILED DESCRIPTION 15 16 ROUTER/SWITCH PRIMARY ARCHITECTURE 17 18 Fig. 4 illustrates a switch fabric 9200 in accordance with the present invention, having a capacity of 19 320Gbps using current technology, implemented using P-Link interconnect network 9250, and a 20 particular number of Qchips 2000 and Xchips 1000. The switch fabric of this particular illustrative 21 embodiment is designed to implement the CIOQ crossbar architecture with 8 slower speed (sub-unity 22 speedup) switching domains operating in parallel. Although the switch fabric as a whole has an 23 internal speedup of 2 for better QoS, the 8-switching domain design allows each domain to operate at 24 only half the link rate. 25 26 To provide an internal speed up of 2 for the user bandwidth, the raw switching capacity of the switch 27 fabric is 3.2 times the link data rate. For example, for every lOGbps link, the switch fabric allocates 28 32Gbps internal bandwidth. Of that, 20Gbps is used for the switching of payload; the other 12Gbps 29 bandwidth is used for overhead, which includes requests, grants, backpressure information, and other 30 control information. The switch fabric supports 8 priorities (classes) with a per-port per-class based 31 delay control. The switch fabric naturally supports fault tolerance without requiring additional redundancy logic. The multiple non-buffered switching domains are totally independent and any switch domain can switch any cell for any ingress port to any egress port. The status of each switch domain (Xchip or crossbar card) will be sent to the Qchips in conjunction with its handling of all the ingress and egress ports. The Qchips monitor the returned status and automatically redirect cells and requests to available switching domains, avoiding any disabled or malfunctioning domains (whether due to link, chip, or other cause). Even when one switching domain is disabled, the remaining seven switching domains can still provide a speed up of 1.8 times that of the link data rate. In this case, the switch fabric continues to provide good performance. Thus, there is no need to provide extra redundant switching capacity.

Fig. 5 illustrates a router/switch 9000 using the switch fabric 9200 of Fig.4, in which the network interface 9100 and the Qchip 2000 are both implemented on Line Card 9150. For switch applications, forwarding-table management is performed in a master network processor, a designation given to the network processor in the first line card. Specifically, the master network processor is responsible for dynamically maintaining forwarding tables for the network topology and switch configuration and updating the forwarding tables of the network processors on the other line cards. In addition, the master network processor is also responsible for system administration functions typical for a switch, including initialization (e.g., loading of the switch operating system software), configuration, console, and maintenance. Alternatively, implementations are possible in which all network processors have identical functionality, and a separate unit, a switch processor, performs the forwarding-table management and system administration functionality.

For router applications, forwarding-table management is performed in a master network processor, a designation given to the network processor in the first line card. Specifically, the master network processor is responsible for running the routing protocols, building routing and forwarding tables for the network topology and router configuration and distributing the forwarding tables to the network processors on the other line cards. In addition, the master network processor is also responsible for system administration functions typical for a router, including initialization (e.g., loading of the router operating system software), configuration, console, and maintenance. Alternatively, implementations are possible in which all network processors have identical functionality, and a separate unit, a route processor, performs the forwarding-table management and system administration functionality.

Fig. 6 illustrates the system environment in which the router/switch of Fig. 5 finds application.

Fig. 7 illustrates a more general configuration of the switch fabric configuration 9200 of Fig.4, emphasizing that within the scope of the invention different numbers of Qchips and Xchips are possible.

Fig. 8 illustrates a more general configuration of router/switch 9000 of Fig. 5, emphasizing that within the scope of the invention the Qchips need not be implemented on the line cards.

QUEUING CHIP (QCHIP INTERNAL ARCHITECTURE

Qchip Overview

Fig. 9 A illustrates the internal architecture of the Qchips of Fig. 4, for specific numbers of OC-192 ports, S-Links, and P-Links. For this particular illustrative embodiment, each Qchip can support a) 2 OC-192 ports (lOGbps), b) 8 OC-48 ports (2.5Gbps), or c) 1 OC-192 port and 4 OC-48 ports. On one side, the Qchip interfaces to a line card or the network processor with 16 high-speed serial links providing a total of 32Gbps bandwidth. Each serial link runs at 2.5Gbps and provides effective bandwidth of 2Gbps. On the other side, the Qchip interfaces with the 8 switching domains with 8 parallel links providing a total of 64Gbps bandwidth. Each parallel link can provide 8Gbps bandwidth. Qchip Ingress Processing.

Fig. 9B illustrates the outgoing logic 2200 of Fig. 9 A, for a specific configuration. Ingress processing is performed here. In this particular illustrative embodiment, the Qchip maintains 64 unicasting VOQs 2250, shared by all the ingress ports, each of these 64 VOQs targeting a respective egress port. Each VOQ consists of 8 sub-VOQs, one for each of the 8 priorities. Thus, there are 512 sub-VOQs. The Qchip also maintains a multicasting queue 2260 for each OC-192 port, so there are 2 multicasting queues in this particular Qchip embodiment. In the case that an OC-192 port is configured as 4 OC-48 ports, all 4 OC-48 ports will share the same multicasting queue. Each multicasting queue also consists of 8 sub-queues for the 8 priorities. The buffers for the unicasting queues and multicasting queues are implemented by on-chip SRAMs and are managed with an adaptive dynamic threshold algorithm for better adaptation to different traffic patterns and for efficient use of buffer space. The ingress portions of the SRAMs have a bandwidth of 80Gbps.

The ingress port scheduler 2240 sits between the queues. It controls the dispatch of requests, which are sent to the Xchips from the queues via the 8 outgoing P-Links. The real data transfer happens only after the crossbar schedulers in the Xchips have granted a request. For every grant, a cell will be forwarded to the Xchip to be switched to the appropriate egress port. In a first embodiment, the ingress port scheduler is based on a round-robin algorithm. The pointers used for the scheduler can be reset to a random value periodically. In a second embodiment, the scheduling mechanism is fully programmable and can support either strict or weighted priority.

Qchip Egress Processing

Fig. 9C illustrates the incoming logic 2100 of Fig. 9A, for a specific configuration. Egress processing is performed here. In this particular illustrative embodiment, the Qchip maintains 8 OQs, corresponding to one OQ for every potential OC-48 egress port on the Qchip. When a Qchip is configured to support OC-48 ports, each OC-48 port will have its own OQ. Each OQ 2120 corresponds to a priority and further has one sub-OQ per priority. There is a multicasting queue 2130 for each egress OC-192 port. Each multicasting queue consists of one sub-queue per priority. When an OC-192 port is configured to be 4 OC-48 ports, all the 4 OC-48 ports will share the same multicasting queue.

The OQs and multicasting queues are shared-memory based and are implemented as on-chip SRAMs and managed with an adaptive dynamic threshold algorithm for better adaptation to different traffic patterns. The egress portions of the SRAMs also have a bandwidth of 80Gbps. The egress port scheduler 2140 sits between the OQs and multicasting queues, and the egress port 2145. It supports both strict priority and weighted round-robin algorithms to control delays of the cells for each of the priorities.

Generalized Implementations of the Qchip

Fig. 10A illustrates a more general configuration of the Qchip of Fig. 9 A, emphasizing that within the scope of the invention different numbers of ports, S-Links, and P-Links are possible. The Qchip interfaces to the line card or the network processor via multiple S-Link links (reference No. 4000 individually, 2050 collectively). Each serial link runs at 2.5Gbps and provides effective bandwidth of only 2Gbps due to 8b-10b encoding. A single Qchip can support N OC-192 ports or 4xN OC-48 ports. A group of 8 S-Links can be programmed to support either a single OC-192 port or 4 OC-48 ports. When supporting 4 OC-48 ports, an external mux is required to mux OC-48 cells on to the 8 S-Links. The Qchip interfaces with the each of the switching domains with a respective 8Gbps bandwidth P- Link. All the ports on a Qchip share the multiple P-Links interfacing to the switching domains.

Fig. 10B illustrates a more general configuration of the incoming logic 2100 of Fig. 9B, emphasizing that within the scope of the invention different queue configurations are possible. Fig. 10C illustrates a more general configuration of the outgoing logic 2200 of Fig. 9C, emphasizing that within the scope of the invention different queue configurations are possible. CROSSBAR CHIP (XCHIP) INTERNAL ARCHITECTURE

Fig. 11 A illustrates the internal architecture of each Xchip of Fig.4, for specific numbers of crossbar ports and associated P-Links. Fig. 1 IB provides additional detail of logic 1400 of Fig. 11A. The Xchip has an integral crossbar 1100 as the data path for the switch fabric. It also has an integral crossbar scheduler 1200. Each Xchip has 16 P-Link interfaces. With 8Gbps per parallel channel (P- Link), per direction, an Xchip has raw switching capacity of 128Gbps. The aggregate throughput of a single Xchip is 256Gbps. The Xchip also supports l-> N multicasting. In one cell time, multiple targets can receive the cell from a single ingress port.

Each Xchip used alone, or each combination of Xchips on a crossbar card (discussed below), constitutes an independent switching domain. The real cell switching is done in these domains. In the switch fabric 9200 of Fig.4 there are eight such domains. Each switching domain operates independent of the others and at half the link rate. For an ATM cell with a lOGbps incoming link data rate, the cell time is 100ns. A scheduler using commonly available CMOS technology can finish one cell scheduling for a 64x64 crossbar within one cell time.

POINT-TO-POINT CHANNEL TECHNOLOGY

Overview

Serial and parallel point-to-point packet-based channels and parallel multi-drop channels (referred to by S-Link, and P-Link, and P-Link MD) have been integrated into the chips of the switch fabric. The following paragraphs will describe for each of these channel interfaces their overall functionally, their data unit protocols, and aspects of their circuit design. Additional aspects regarding their implementation are provided in the previously referenced applications: Ser. No. 09/350,414 and Ser. No. 09/349,832. These low power, low pin-count, high reliability, high-speed CMOS transceivers are suitable for chip-to-chip, box-to-box, and a variety of backplane implementations. The S-Link is a serial transceiver optimized for ease-of-use and efficiency in high-performance data transmission systems. It accepts two 10-bit 8b/10b encoded transmit characters, latches them on the rising edge of the transmission clock and serializes that data onto the transmitter differential outputs at a baud rate that is twenty times the transmission clock frequency. It also samples the serial receive data on the receiver differential inputs, recovers the clock and data, deserializes it onto two 10-bit characters and outputs a recovered clock at one-twentieth of the incoming baud rate. The S-Link Link contains on-chip PLL circuitry for synthesis of the baud-rate transmit-clock and the extraction of the clock from the received serial stream. The S-Link Link can operate at bit data rates of up to 3.125Gbps.

The P-Link uses a parallel transceiver that provides 16Gbps of aggregated bandwidth at data rate of 1.6Gbps per differential signal pair with less than 10^"15 BER. The data width is 5bit. The transceiver utilizes 24 signal pin and 6 power and ground pins. It can drive 39" 50 ohm PCB trace with two connector crossings. It has a built-in self-calibration circuit that optimizes data transfer rate and corrects up to 1.2ns line-to-line data skew. Multiple transceivers can be integrated on to a single ASIC chip to dramatically improve the per chip bandwidth. Further more, the P-Link transceiver requires no external termination resistors. The latency through link is less than 8ns because there is no need for data encoding and decoding.

The P-Link multi-drop parallel channel (P-Link MD) is a variation on the P-Link that supports a multi-drop configuration where one transmitter drives multiple receivers.

1 S-Link Macro

2

3 Fig. 13A is an abstract drawing illustrating one functional view of the S-Link macro 4100 Fig. 5A.

4 Table 1 itemizes the four link-side signal wires per S-Link transceiver, consisting of a complementary

5 wire-pair for each direction of transmission. The link-side interfaces uses a CMOS transceiver

6 designed to drive up to 30 meters of cable at 2.5Gbps per direction on a single differential pair. The

7 transceiver is designed in a standard 0.25u CMOS process with power dissipation of 350mW. The

8 transceiver has a transmitter pre-equalization circuit for compensating high frequency attenuation

9 through PCB or cable. The transceiver supports both AC and DC coupled transmission, and both 50

10 ohm and 75 ohm transmission lines. The S-Link macro has built-in comma detection and framing

11 logic. Table 2 details the parallel core-side interface, which includes 20-bits of payload, running at

12 125MHz. The transceiver can also operate in 1.25Gbps mode with the parallel interface running at

13 62.5MHz. An automatic locked-to-reference feature allows the receiver to tolerate very long streams

14 of consecutive 1 's or O's. A frequency difference up to +/-200ppm between two ends of the serial link

15 is tolerable. Multiple S-Link macros can be integrated into a single ASIC chip to achieve any desired

16 bandwidth. There is an internal loopback mode for at-speed BIST.

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31 Table 1. S-Link 4000 Component Signal s

Signal Name Signal Type (relative to S- Signal Description ink Macro)

TXP O Transmit Differential Signal ( + polarity)

TXM O Transmit Differential Signal ( - polarity)

XP I Receiver Differential Signal ( + polarity)

RXM I Receiver Differential Signal ( - polarity)

Table 2. S-Link Macro Core Interface 4900 Component Signals

Signal Name Signal Type (relative to Signal Description

S-Link Macro)

TX [19:0] o Data out of the core, synchronized to the core clock

RX [I9:0] I Data into the core from RX+, synchronized to the core clock

RCLKP 0 Differential Clock output to be used by core ( + polarity)

RCLKM 0 Differential Clock output to be used by core ( - polarity)

BIAS_CTL I Adjusts transmit swing

COM_DET 0 Comma detected; used for word alignment

EN_CDET I Enables comma detection by macro

REFCLK I Reference clock S-Link Packet Format

Fig. 13B illustrates the minimum length S-Link packet format. Fig. 13C illustrates the full-length S- Link packet format. Variable-length packets are used for carrying payload to and from the switch fabric, via the S-Links.

Unlike a cell payload (discussed below), the packet load can have variable length. The length of the packet payload (PP) could vary from 0 to the cell payload length in 8-byte granularity. PP will be addressed with byte granularity. For example, PPO will be byte 0 of the PP. The cell payload length is

10 72. Packet payloads less than the cell payload length will first be patched to the full cell payload 11 length before being switched within the switch fabric.

12 13 The packet control header (PCH) is 96-bit long and carries control information related to the payload 14 and backpressure information. PCH will be addressed with both byte granularity and bit granularity. 15 For example, PCHO will be byte 0 of the PCH, and PCH[7:0] will be bits 0-7 of the PCH. Tables 3 16 and 4 provide detailed definitions for the PCH fields. 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Table 4. Type 2 Packet Control Header Field Definitions (Outgoing S-Link from Qchip to Li card Network Interface

Field Name Description

CMD[2:0] Command.

0-5: RSVD

6: VALID. Indicating the packet is a valid packet.--

7: IDLE. Indicating the packet is an IDLE packet

L[3:0] Length.4-bit field used to indicate the length of the payload in 8-byte granularity.

LDES[1:0] Local destination port number.

MBP_MAP [7:0] Multicasting backpressure bit map. Used to indicate the backpressure information of the multicasting queues at an ingress port. Note that each priority of an OC-192 (or 4 OC-48 ports) has a separate multicasting queue and can be backpressured by setting a bit in the bitmap.

P [2:0] Priority.

UBP_BASE[2:0] Unicasting backpressure base address. Used together with the 64-bit backpressure map to indicate which egress ports (out of a maximum of

256 ports) have backpressured.

UBP_MAP [63:0] Unicasting backpressure bit map. Used to indicate the backpressure information to an ingress port. Together with BP_BASE[2:0], the switch fabric can broadcast system wide 512-bit backpressure information in 8 cell times. Each bit in the bitmap indicates the backpressure information in a output queue. The detailed encoding is defined in the later chapter. P-Link Macro

Fig. 14A is an abstract drawing illustrating one functional view of the P-Link macro 3100 Fig. 5 A. Table 5 provides detailed definitions of the P-Link side signals. Table 6 provides detailed definitions of the core-side signals. The P-Link macro is a scalable parallel data transceiver with an aggregate bandwidth of 32Gbps, or 16Gbps in each direction. The transmitter serializes 200MHz data into five pairs of differential data lines. A 200MHz transmitter phase-locked loop (PLL) generates equally distributed eight-phase clocks to multiplex 200MHz data into a 1.6Gbps data stream. The data is then transmitted at a data rate eight times the system clock over differential signal lines. A delay-lock loop (DLL) at the receiver retrieves the clock and data, latching incoming data through high-speed differential receiver/latch modules using the eight phase clocks. The Channel is optimized for ease- of-use and efficient, low Bit Error Rate data transmission in high performance systems. An on-chip timing calibration circuit performs data de-skewing and timing optimization. Low swing differential signaling further reduces noise and error probability and therefore relaxes the restrictions of board design.

Table 6. P-Link Macro Core Interface 3900 Component Signals

Signal Name Signal Type (relative Signal Description to P-Link Macro)

TX [39:0] I Transmit data from core, synchronized to core clock

RX [39:0] O Receiver data to core, synchronized to core clock

REFCLK Reference clock

BIAS CTL Adjusts transmit swing

P-Link Cell Format

Fig. 14B illustrates the P-Link cell format. Fixed length cells are used for carrying payload from ingress ports to egress ports within the switch fabric via the P-Links. When transferred on a link within the switch fabric, two types of the information will be transferred for a cell. The information that is independent from the physical links and the information that is dependent on the physical links.

The physical link independent portion of the cell is divided into 3 fields while the physical link dependent portion of the cell defines how the 3 fields of a cell are transferred on a physical link.

The cell payload (CP) is the payload that will be transferred within the switch fabric. CP will be addressed with byte granularity. For example, CPO will be byte 0 of the CP. The length of the cell payload is fixed to be 72-byte. Packet payload could be variable length. Packet payload can vary from 0 to the cell payload length of 72 bytes. Packet payloads less than the cell payload length will first be patched to the full cell payload length before being switched within the switch fabric.

The ingress processing logic of the Qchip will add a 32-bit cell payload header (CPH) to the ceil payload. CPH will be addressed with bit granularity. For example, CPH[7:0] will be bits 0-7 of the CPH. The switch fabric will then switch the cell payload and cell payload header to the target egress port as pure data without being looked at or modified. The egress processing logic of the Qchip needs the information in the cell payload header for processing the cell. Table 7 provides detailed field definitions for the CPH.

Together with the cell payload and cell payload header, a cell also contains the cell control header (CCH) for request, grant, backpressure control and other control purpose. CCH will be addressed with bit granularity. For example, CCH[7:0] will be bits 0-7 of the CCH.The cell control header is 89-bit long. Tables 8 and 9 provide detailed field definitions for the CCH.

SOC bit is the cell-framing signal. It is used to indicate the start of a cell. E bit is used to indicate whether the cell being transferred is an erroneous cell. This bit will be updated along the path. When set to 1, the E bit indicates that the cell contains an error even if there is no parity error detected on the cell. Odd horizontal parity is used for error protection. PAR[0] covers P-Link bits [18:0], PAR[1]

MREQ_SRC Multicasting Request Source. Used to indicate which OC-192 port of a Dual-port Qchip the multicasting request comes from.

UBP 32 Unicasting backpressure. Used to indicate the status of the unicasting queues in the egress portion of the Qchip.

UREQO Unicasting Request 0. Set to 1 to indicate a valid unicasting request.

UREQ1 Unicasting Request 1.

UREQJD0 Unicasting Request ID 0. 6-bit field encoded with the VOQ number.

UREQ D1 Unicasting Request ID 1.

UREQ_PRI0 Unicasting Request Priority 0.

UREQ_PRI1 Unicasting Request Priority 1.

Table 9. Cell Control Header (CCH) Field Definitions - (Incoming to Qchip from Switch Fabric Core)

Field Name # Bits Description

Command.

0-5: RSVD

CMD 6: GNT. Indicating the CCH contains unicasting and multicasting grants.

7: IDLE. Indicating the CCH is an IDLE CCH.

MGNT BAS Multicasting Grant Base Address.

MGNT MAP 16 Multicasting Grant Bitmap. Bit map indicating a multicasting grant.

MGNT PRI Multicasting Grant Priority. Used to indicate the priority of the multicasting grant.

MGNT_QID Multicasting Grant Queue ID.

MGNT SRC Multicasting Grant Source. Used to indicate which OC-192 port of a Dual-port Qchip the multicasting grant is for.

SSN System Sequencing Number.

UGNT Unicasting Grant. Set to 1 to indicate a valid unicasting grant.

UGNT ID Unicasting Grant ID. VOQ number.

UGNT_PRI Unicasting Grant Priority.

Link Physical Design

Fig. 15 illustrates the logic of the S-Link macro 4100 of Fig. 13A. Fig.s 16A through 161 detail logic, circuitry, and behavioral aspects of the P-Link macro 3100 of Fig. 14A. Fig. 16A illustrates the logic of the P-Link macro 3100 of Fig. 14A. The transmitter 10,100 serializes 40-bit 200MHz data into 5 pairs of differential data lines. A 200MHz transmitter phase-locked loop (PLL) 10,110 generates equally distributed 8 phase clocks to multiplex 200MHz data into a 1.6Gb/s data stream. (In an alternate embodiment, 16 phases are used. In the 16 phase embodiment, phase 16, phase 13, and phase 3 are respectively used instead of phase 8, phase 7, and phase 2 of the 8 phase embodiment illustrated herein.) Fig. 16B is a different view of the transmitter section 10,100 of the P-Link macro 3100 of Fig. 16A. Fig. 16C illustrates the internal circuitry of the differential transceiver 10,127 of the transmitter section 10,100 of Fig. 16B.

In the transmitter section, a single stage differential 8-to-l multiplexer/predriver 10,124 is used. The data driver 10,128 is a constant current, differential current steering driver with matched impedance termination resistors 10,134 to termination voltage (Vt). Vt is typically the same as Vdd, and can be lower as long as it meets the receiver input common-mode range. A process/voltage/temperature (PVT) compensated current source driven by TXBIAS 3130 is used to generate the output bias current. Because of constant current drive, power and ground noise due to simultaneous output switching is greatly reduced; therefore, the number of power and ground pins required are reduced compared to other implementations. A dynamic impedance matching mechanism is utilized to obtain the best match between the termination resistor and the transmission line impedance. The transmitter also has a 1-bit pre-equalization circuit 10,129 that amplifies signal swing when it switches. The pre- equalization circuit provides compensation for the high frequency lost through board traces or cables. This maximizes the data eye opening at receiver inputs. Fig. 16D illustrates the voltage waveform for the link has observed from the transmitter output. Fig. 16E illustrates the voltage waveform for the link is observed from the receiver input at the opposite and of the link relative to the observation-point of Fig. 16D. 2-bit transmitter swing control 3125 provides 4 levels of signal swing to compensate for signal attenuation through transmission lines. The transmitter forwards a 200MHz clock 10,115 in addition to transmitting 10 bit data.

Fig. 16F is a different view of the receiver section 10,200 of the P-Link macro 3100 of Fig. 16A. Fig. 16G illustrates the internal circuitry of the receiver section 10,200 of Fig. 16F. Signals are terminated at the receiver in addition to the transmitter. A delay-locked loop (DLL) at the receiver regenerates 8 phase clocks from the incoming 200MHz transmitted clock. Incoming data is latched through highspeed differential receiver/latch modules using these 8 phase clocks. Synchronization with the 200MHz receiving chip's core clock is performed after the data capture. The receiver cell, shown in Fig. 16G, is a high-speed differential latch. It senses and latches incoming data at the capture_clock's rising edge. Q and Q^Λ are pre-charged to Vdd while the capture_clock is low. This receiver is capable of sensing differential input of 80m V and requires a very small data valid window.

The main function of calibration is data de-skewing. This function is achieved through two steps. The first step is bit clock optimization. 8 phase clocks are globally distributed to 10 receiver cells, as shown in Figure X. At each receiver, the clock can be delayed up to 1-bit time (625ps at 1.6Gb/s data rate) before the input data latch uses it. 4-bit control provides a bit clock adjustment resolution of 40ps. In addition to the 8 clock phases that are used for data capture, a 90-degree phase shifted version of 8 phase clocks are also used during the calibration to achieve 2X over sampling. During the calibration, a dedicated training pattern is used. The receiver captures 16 data within one 200MHz clock cycle. This data is used to determine the receiver timing and allow calibration logic to optimize the local clock delay in order to center the capture clock in the center of the data eye. Once the bit timing is achieved, control logic determines the byte alignment for each receiver cell and between different receiver cells. This assures that 80-bit data output at the receiver matches the 80-bit data input at the transmitter. Overall, the calibration can correct up to 1.2ns data skew at 1.6Gb/s data transfer rate.

Fig. 16H illustrates the logic within the receiver synchronization circuit 10,221 of Fig.s 16A and 16F. Fig. 161 illustrates a detail of the operation of the synchronization circuit 10,221 of Fig. 16H. The received data is de-serialized and latched using phase 8. A clock phase detector is used to determine the phase relationship between the core clock and the received clock. If the core clock rising edge is between phase 7 and phase 2, the data is delayed using a phase 7 latch, and then registered using the next rising edge of the core clock. If the core clock rising edge is between phase 2 and phase 7, the data is registered into the core clock domain without a delay. This circuit achieves minimum latency with the link. It also allows drifting of the clock up to a bit time without need of re-calibration. This design assumes that chips on both sides of the link obtain reference clocks from a same clock source. A data FIFO would have to be added into the receiver data path to handle the synchronization if the two chips do not share the same clock source. As mentioned earlier, the transport layer of logic handles errors. Error correction is achieved through retry. Once transport logic determines that the link is not operating at its optimum condition, it issues a re-calibration command to transceiver macro. REDUCED-SCALE VARIATION

EXPANDED-SCALE CROSSBAR-CARD VARIATION

For larger systems (roughly greater than 320Gbps at present), a preferred illustrative embodiment uses Mux chips (Mchips) for doing bit slicing and protocol conversion between the Qchips and Xchips. For smaller capacity systems (roughly less than or equal to 320Gbps at present), the parallel channels (P-Links) alone are sufficient and preferred, resulting in more compact designs with fewer chips.

Fig. 21 illustrates a router/switch 9000 using the switch fabric 9205 of Fig. 20, in which the network interface 9100 and the Qchip 2000 are both implemented on Line Card 9150. Fig. 22 illustrates a more general configuration of the switch fabric configuration 9205 of Fig. 20, emphasizing that within the scope of the invention different numbers of Qchips and Xchips are possible.

Fig. 23 illustrates a more general configuration of router/switch 9000 of Fig. 21, emphasizing that within the scope of the invention the Qchips need not be implemented on the line cards.

Fig. 24 illustrates the internal architecture of the crossbar cards 5000 of Fig. 20, using a particular number of Mchips 6000 and Xchips 1000. Each crossbar card constitutes an independent switching domain. It includes multiple sliced crossbars and an external centralized scheduler. Specifically, Xchip 1000-3 acts as an external centralized scheduler for the other two Xchips 1000-1 and 1000-2 on the card. External scheduler 1000-3 sends crossbar configuration information to the Xchips 1000-1 and 1000-2 synchronously every cell time via the dedicated high-speed crossbar configuration bus 5030, implemented using a P-Link MD channel.

MUX CHIP (MCHIP) INTERNAL ARCHITECTURE

Fig. 26A illustrates the internal architecture of Mchip 6000 of Fig. 16C. Fig. 26B provides additional detail of logic block 6100 of Mchip 6000 of Fig. 26A. Fig. 26C provides additional detail of logic block 6300 of Mchip 6000 of Fig. 26 A. Fig. 26D provides additional detail of logic block 6400 of Mchip 6000 of Fig. 26A.

On the Qchip-side of the Mchip, there are N P-Links, on the Xchip-side of the Mchip, there are N+l P-Links, N of them are used to connect to the 4 crossbar chips (Xchips), the (N+l)th P-Link is used to connect to an Xchip dedicated to use as a scheduler chip for the other Xchips on the card (within the switching domain). In the Mchip->Xchip direction, the Mchip receives cells from the 4 P-Links on the Qchip-side and forwards them across the first 5 P-Links on the Xchip-side. The CP and CPH portion of the cells are forwarded in a bit-sliced fashion across the 4 P-Links to the crossbar chips. The CCH portions of the cells are forwarded on the 5th P-Link to the scheduler chip.

In the Xchip->Mchip direction, the Mchip receives the CP and CPH portion of the cells from the crossbar chips on the 4 P-Links and the CCH portion of the cells from the scheduler chip on the 5th P- Link. The Mchip assembles a complete cell from the CP, CPH and CCH received and forwards the cell to the appropriate Qchip via one of the 4 P-Links on the Qchip-side.

LOSSLESS BACKPRESSURE PROTOCOL

The switch fabric supports lossless backpressure protocol on a per class per port basis. The backpressure information can be generated by the line card (network processor) at the egress port and the switch fabric internal queues. The unicasting and multicasting backpressure information can be transferred all the way back to the line card (network processor) at the ingress port:

The switch fabric utilizes lossless backpressure protocol for flow control. The backpressure could come from either the line card at the egress port or the switch fabric internal queues. The line card backpressure is 8-bit per port (the BP[7:0] field in the PCH, the LSRC[1 :0] field is used to identify which OC-48 port is sending the BP[7:0]. LSRC[1:0] should be set to 0 for OC-192 port) for indicating the status of the 8 priority queues on the linecard.

The 8-bit backpressure information is applied directly on the OQ (in the Qchip) for the specific port. The backpressured sub-OQs will stop sending data to the egress port. Note that in the switch fabric, our OQs are egress port specific (an OQ for either an OC-192 port or an OC-48 port). We do not implement the Virtual Input Queues (VIQs) at the egress ports. Although VIQs can be used to achieve fairness among ingress ports at an egress port, the backpressure protocol and design complexity will unnecessarily go up dramatically due to the large number of queues that need to be maintained on a single Qchip. Instead of using VIQs to achieve fairness, user can map different types of traffic to different priorities for the same purpose. One example is that telnet traffic and ftp traffic can be mapped to different priorities (the switch fabric has total 8 priorities) to guarantee the timely response to the telnet traffic under heavy ftp traffic load. The fairness can also be achieved by higher-level traffic management protocols. Another argument for not implementing the VIQs is that although fairness among ingress ports at an egress port is better achieved with VIQs within a switch fabric, the problem still exists once the cells (packets) get out the switch fabric. This is because that once leaving a switch fabric, the packets are no longer distinguishable with the ingress port numbers.

The switch fabric uses OQs for the unicasting traffic. For each OC-192 egress port, the switch fabric also has an MOQ for multicasting traffic. When an OC-192 port is configured as 4 OC-48 ports, the OC-48 ports will share the same MOQ. The BP[7:0] will be applied to both the OQ and MOQ for the specific egress port. Note that there may be head-of-line blocking in the MOQ if an OC-192 port is configured to be quad-OC-48. In this case, a backpressured OC-48 port will block the entire MOQ (for the specific priority).

The line card backpressure information does not directly result in the generation of internal backpressure. The internal backpressure from the egress portion of the Qchip to the Schedulers (Xchip) will come from the OQs and MOQs. The Qchip to Scheduler backpressure information is carried in the MBP[7:0], UBP[31:0] and BP_BAS fields of the CCH and transferred from the Qchip to the Scheduler in two cell times. The MBP and UBP fields of the CCH are simply the status of the OQ and MOQ.

A Qchip will always generate the unicasting backpressure (UBP) information for the OQs. However, the Qchip can be programmed through CSRs on per OC-1 2 port basis to select whether to generate multicasting backpressure (MBP) information to the Scheduler. The reason behind this is that there could be head-of-line blocking in the multicasting queues once an egress port is blocked. Disabling all the egress ports to generate multicasting backpressure can eliminate the head-of-line blocking in the multicasting queues. However, disabling the multicasting backpressure on an egress port will make the multicasting traffic to that port lossy. The Qchip will drop any multicasting cells towards an MOQ if the MOQ is backpressured and the generation of backpressure information to the Scheduler for that MOQ is disabled. There could be on-the-fly cells coming to a Qchip even after the backpressure information has been generated and transferred to the Scheduler. The maximum number of the on-the-fly cells towards an OQ or MOQ is the product of the round-trip delay in the unit of cell time from Qchip to Qchip times the number of switching domains. To make the backpressure protocol lossless, the adaptive dynamic threshold logic for the OQs and MOQs will always reserve certain amount entries in the two on-chip SRAM blocks. The number of reserved entries from each SRAM block is programmable through CSRs to be 0, 32, 64 or 128 entries.

The Schedulers in each switching domain maintains the complete backpressure information of the entire switch fabric. The system wide backpressure information will be updated every 2 cell-times by UBP and MBP fields in the CCH. The requests in the RVOQs and RMIQs of the Xchips will be selectively masked by the backpressure information. As a result, the backpressured requests will not be served (or scheduled) by the crossbar scheduler. In the Xchip, the backpressure information is applied to the unicasting requests in the unit of sub-VOQs. Multiple ingress ports may share the same VOQ in different configurations. In such configurations, multiple ingress ports may be backpressured by a single egress port. The backpressure information is applied to the multicasting requests in the unit of OC-192 port. As in the MOQ case, there could be head-of-line blocking in the multicasting RMIQs.

Between the egress portion of the Qchip and the Scheduler, the backpressure protocol is achieved by broadcasting the UBP and MBP in the CCH from the Qchip to the Scheduler. The backpressure protocol from the Scheduler to the ingress portion of the Qchip is different and is designed to be credit based. The P-Link interfaces in the Qchips keep track of pending requests in the Xchip. The Qchip will stop sending requests once there is no room in the Xchip request queues.

Once the Qchips stop forwarding requests to the Xchips, eventually, the VOQs and the MIQs will be backpressured by the adaptive dynamic threshold logic. The Qchip will broadcast the 512-bit unicasting (64 VOQs, 8 sub-VOQs per VOQ) backpressure information to the line card in the UBP_MAP[63:0] and UBP_BASE[2:0] fields of the PCH in 8 cell times. The multicasting backpressure is generated from MIQs and broadcasted to the line card in the MBP_MAP[7:0] field of the PCH every cell time. Each bit of the MBP_MAP[7:0] corresponds to a priority of an OC-192 port. The unicasting backpressure information is VOQ based. The multicasting backpressure is per OC- 192 port based. In the case that an OC-192 is configured as quad-OC-48 ports, the MBP_MAP[7:0] field will backpressure all 4 OC-48 ports.

Again the adaptive dynamic threshold logic will reserve entries for on-the-fly cells (packets) from the line card to the Qchip. The number of reserved entries in the SRAM block from VOQs and MIQs can be programmed via CSRs to be 0, 32, 64 or 128.

CONCLUSION

Although the present invention has been described using particular illustrative embodiments, it will be understood that many variations in construction, arrangement and use are possible within the scope of the invention. For example the number of units, banks, ways, or arrays, and the size or width, number of entries, number of ports, speed, and type of technology used may generally be varied in each component block of the invention. Also, unless specifically stated to the contrary, the value ranges specified, and maximum and minimum values used, are merely those of the illustrative or preferred embodiments and should not be construed as limitations of the invention. Specifically, other embodiments may use different port densities (switch sizes), minimum and maximum data rates (link speeds), data unit sizes, and control or data bit-widths. Functionally equivalent techniques known to those skilled in the art may be employed instead of those illustrated to implement various components. The names given to interconnect and logic are illustrative, and should not be construed as limiting the invention. The present invention is thus to be construed as including all possible modifications and variations encompassed within the scope of the appended claims.

Claims

We Claim:

1) A device for transferring messages between a first and second port, including:

a plurality of network interfaces, including one of said interfaces for each of said first and second ports; a switch fabric coupling the network interfaces, the switch fabric including an interconnect network having parallel point-to-point links; and a plurality of serial links coupling the network interfaces to the switch fabric.

2) A switch fabric having:

a plurality of queuing chips; a plurality of serial links coupling the queuing chips to the outside world via network interfaces operating at a first rate; a plurality of crossbar chips operating at a second rate slower than the first rate; and an interconnect network of parallel point-to-point links coupling the queuing chips and crossbar chips.

3) The switch fabric of claim 2, wherein the parallel links couple each queuing chip to every crossbar chip and every crossbar chip to every queuing chip.

4) A queuing module for a switch fabric implemented on a single integrated circuit, having:

serial link interface ports for coupling to multiple network interfaces; parallel link interface ports for coupling to multiple switching domains; outgoing logic for merging messages input from the multiple network interfaces and routing them to the multiple switching domains; and incoming logic for coupling the messages arriving from the multiple switching domains to a message designated network interface. 5) A method of providing fault tolerance in a switch fabric, including:

providing a switch fabric having multiple independent switching domains and links; providing a route/switch processor; monitoring for one or more malfunctions in the links and switching domains with the route/switch processor; detecting one of said malfunctions; and dynamically redirecting traffic to the remaining working links and domains.

6) A method of operating a switch fabric, including:

providing multiple mux chips having integral parallel point-to-point link interfaces; providing a first and second crossbar chip used in combination as a switching domain; providing a third crossbar chip used as an external centralized crossbar scheduler; wherein each crossbar chip has integral parallel point-to-point link interfaces and an integral parallel multi-drop link interface; providing a multi-drop link channel; and sending crossbar configuration information to the first and second crossbar chips synchronously every cell time via the multi-drop link channel.