Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/889,468external-prioritypatent/US7757033B1/en
Application filed by Oracle International CorpfiledCriticalOracle International Corp
Priority to US11/057,036priorityCriticalpatent/US8868790B2/en
Priority to US11/256,269prioritypatent/US7398380B1/en
Priority to US11/256,646prioritypatent/US7990994B1/en
Priority to US11/256,688prioritypatent/US7843907B1/en
Priority to US11/256,645prioritypatent/US7843906B1/en
Assigned to FABRIC7 SYSTEMS, INC.reassignmentFABRIC7 SYSTEMS, INC.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: SHINGANE, MANGESH, SHAH, SHREYAS B., LOVETT, THOMAS DEAN, WHITE, MYRON H., JAGANNATHAN, RAJESH K., MEHROTRA, SHARAD, NICOLAOU, COSMOS, SARAIYA, NAKUL PRATAP
Priority to US11/736,355prioritypatent/US8713295B2/en
Priority to US11/736,281prioritypatent/US7872989B1/en
Assigned to Habanero Holdings, Inc.reassignmentHabanero Holdings, Inc.ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: FABRIC7 (ASSIGNMENT FOR THE BENEFIT OF CREDITORS), LLC
Assigned to FABRIC7 (ASSIGNMENT FOR THE BENEFIT OF CREDITORS), LLCreassignmentFABRIC7 (ASSIGNMENT FOR THE BENEFIT OF CREDITORS), LLCASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: FABRIC7 SYSTEMS, INC.
Priority to US12/954,655prioritypatent/US8218538B1/en
Priority to US13/007,977prioritypatent/US8489754B2/en
Priority to US13/544,696prioritypatent/US8743872B2/en
Assigned to ORACLE INTERNATIONAL CORPORATIONreassignmentORACLE INTERNATIONAL CORPORATIONASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).Assignors: Habanero Holdings, Inc.
Publication of US20130107872A1publicationCriticalpatent/US20130107872A1/en
Publication of US8868790B2publicationCriticalpatent/US8868790B2/en
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/10—Packet switching elements characterised by the switching fabric construction
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L12/00—Data switching networks
H04L12/28—Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
H04L12/46—Interconnection of networks
H04L12/4641—Virtual LANs, VLANs, e.g. virtual private networks [VPN]
H04L12/4645—Details on frame tagging
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/10—Packet switching elements characterised by the switching fabric construction
H04L49/111—Switch interfaces, e.g. port details
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/10—Packet switching elements characterised by the switching fabric construction
H04L49/113—Arrangements for redundant switching, e.g. using parallel planes
H04L49/116—Transferring a part of the packet through each plane, e.g. by bit-slicing
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/10—Packet switching elements characterised by the switching fabric construction
H04L49/113—Arrangements for redundant switching, e.g. using parallel planes
H04L49/118—Address processing within a device, e.g. using internal ID or tags for routing within a switch
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/35—Switches specially adapted for specific applications
H04L49/356—Switches specially adapted for specific applications for storage area networks
H04L49/357—Fibre channel switches
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/60—Software-defined switches
H04L49/602—Multilayer or multiprotocol switching, e.g. IP switching
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/70—Virtual switches
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L1/00—Arrangements for detecting or preventing errors in the information received
H04L1/004—Arrangements for detecting or preventing errors in the information received by using forward error control
H04L1/0056—Systems characterized by the type of code used
H04L1/0061—Error detection codes
H—ELECTRICITY
H04—ELECTRIC COMMUNICATION TECHNIQUE
H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
H04L49/00—Packet switching elements
H04L49/45—Arrangements for providing or supporting expansion
Definitions
F7.2004.09Centitled FABRIC-BACKPLANE ENTERPRISE SERVERS WITH VNICS AND VLANS
the present inventionrelates generally to interprocess and inter-module communications in servers and server clusters. More specifically, it relates to the organization, provisioning, management, and interoperation of compute, storage, and network resources to enhance datacenter availability, efficiency, and utilization.
FIG. 1Aillustrates a conceptual representation of an embodiment of an ES system.
FIG. 1Billustrates a conceptual representation of selected details of data transfer in an embodiment of an ES system.
FIG. 2illustrates various example embodiments of packet and process data flow in an ES embodiment.
FIG. 3Aillustrates selected aspects of an embodiment of packet transmission and reception in an ES embodiment.
FIG. 3Billustrates selected aspects of an embodiment of packet and cell prioritized transmission in an ES embodiment.
FIG. 4Aillustrates selected aspects of an embodiment of a System Intelligence Module (SIM) configured as a pluggable module including a System Control Module (SCM) and an associated Switch Fabric Module (SFM).
SIMSystem Intelligence Module
SCMSystem Control Module
SFMSwitch Fabric Module
FIG. 4Billustrates selected aspects of an embodiment of a Processor Memory Module (PMM) configured as a pluggable module.
PMMProcessor Memory Module
FIG. 4Cillustrates selected aspects of an embodiment of a Network Module (NM) configured as a pluggable module.
NMNetwork Module
FIG. 4Dillustrates selected aspects of an embodiment of a Fibre Channel Module (FCM) configured as a pluggable module.
FCMFibre Channel Module
FIG. 4Eillustrates selected aspects of an embodiment of an OffLoad Board (OLB) configured as a pluggable module.
OLBOffLoad Board
FIG. 5Aillustrates selected aspects of embodiments of SoftWare (SW) layers for executing on application processor resources in an ES embodiment.
SWSoftWare
FIG. 5Billustrates selected aspects of embodiments of SW layers for executing on management processor resources in an ES embodiment.
FIG. 5Cillustrates selected aspects of embodiments of SW layers for executing on module-level configuration and management processor resources in an ES embodiment.
FIG. 6Aillustrates selected aspects of a logical view of an embodiment of a plurality of virtual Network Interface Controllers (VNICs), also known as virtualized Network Interface Cards.
VNICsvirtual Network Interface Controllers
FIG. 6Billustrates selected aspects of a logical view of an embodiment of VNIC transmit queue organization and prioritization.
FIG. 6Cillustrates selected aspects of a logical view of an embodiment of transmit output queue organization and prioritization.
FIG. 6Dillustrates selected aspects of a logical view of an embodiment of receive input queue organization and prioritization.
FIG. 6Eillustrates selected aspects of a logical view of an embodiment of VNIC receive queue organization and prioritization.
FIG. 7Aillustrates selected aspects of an embodiment of a Virtual Input Output Controller (VIOC).
VIPVirtual Input Output Controller
FIG. 7Billustrates selected aspects of egress operation of an embodiment of a Virtual Input Output Controller (VIOC).
VIPVirtual Input Output Controller
FIG. 7Cillustrates selected aspects of ingress operation of an embodiment of a VIOC.
FIG. 8Aillustrates selected aspects of an embodiment of an egress lookup key and result entries.
FIG. 8Billustrates selected aspects of an embodiment of an ingress lookup key and entry.
FIGS. 9A and 9Billustrate a Hardware Resources view and a Provisioned Servers and Switch view of an embodiment of an ES system, respectively.
FIG. 9Cillustrates an operational view of selected aspects of provisioning and management SW in an ES embodiment.
FIG. 10illustrates a conceptual view of an embodiment of a Server Configuration File (SCF) and related SCF tasks.
SCFServer Configuration File
FIG. 11illustrates selected aspects of an embodiment of server operational states and associated transitions.
FIGS. 12A and 12Bare flow diagrams illustrating selected operational aspects of real time server provisioning and management in an ES embodiment.
FIG. 13Ais a state diagram illustrating processing of selected Baseboard Management Controller (BMC) related commands in an ES embodiment.
BMCBaseboard Management Controller
FIG. 13Billustrates selected operational aspects of single and dual PMM low-level hardware boot processing in an ES embodiment.
FIG. 14illustrates a conceptual view of selected aspects of embodiments of Internet Protocol (IP) and Media Access Control (MAC) address failover data structures and associated operations.
IPInternet Protocol
MACMedia Access Control
FIG. 15illustrates a flow diagram of an embodiment of rapid IP address takeover.
FIG. 16illustrates an embodiment of a multi-chassis fabric-backplane ES system.
FIG. 17illustrates an embodiment of two variations of multi-chassis provisioning and management operations.
the inventioncan be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links.
these implementations, or any other form that the invention may take,may be referred to as techniques.
the order of the steps of disclosed processesmay be altered within the scope of the invention.
a hybrid server/multi-layer switch system architectureforms the basis for a number of Enterprise Server (ES) chassis embodiments.
ESEnterprise Server
PMMsProcessor Memory Modules
SCMSystem Control Module
the SCMincludes a cellified switching-fabric core (SF) and a System Intelligence Module (SIM).
Each PMMhas one or more resident Virtual IO Controller (VIOC) adapters.
VIOCis a specialized Input/Output (I/O) controller that includes embedded layer- 2 forwarding and filtering functions and tightly couples the PMM to the SF.
I/OInput/Output
the layer- 2 switch functionality within the ES chassisis distributed over all of the SCM, NM, and PMM modules.
VIOC/VNIC device drivershost operating system software (Host O/S) running on the PMMs is presented with a plurality of Virtual Network Interface Cards (VNICs).
VNICVirtual Network Interface Cards
each VNICbehaves as a high-performance Ethernet interface at the full disposal of the Host O/S.
at least some of the VNICsbehave as high-performance Fibre Channel Host Bus Adapters.
the SIMis responsible for provisioning and overall system management. Via system control and management processes running on the SIM, the server and switch functionality of the ES chassis are provisioned via configuration files in accordance with respective requirements specified by server and network administrators.
Configurable parameters for each serverinclude the number of processors, memory, the number of VNICs, and VNIC bandwidth.
Configurable parameters for the networkinclude Virtual LAN (VLAN) assignments for both Network Module ports and VNICs and Link Aggregation Group (LAG) definitions.
VLANVirtual LAN
LAGLink Aggregation Group
An ES systemmay be operated as one or more provisioned servers, each of the provisioned servers including capabilities as identified by a corresponding set of specifications and attributes, according to various embodiments. Typically the specifications (or constraints) and attributes are specified with a Server Configuration File.
An ES systemmay be provisioned into any combination and number of servers according to needed processing and I/O capabilities. Each of these servers may include distinct compute, storage, and networking performance. Provisioned servers may be managed similar to conventional servers, including operations such as boot and shutting down.
VNICsprovide for communication among modules of Enterprise Server (ES) embodiments via a switch fabric dataplane. Processes executing on compute complexes of the servers exchange data as packets or messages by interfaces made available through VNICs.
the VNICsfurther provide for transparent communication with network and storage interfaces.
VNIC provisioning capabilitiesinclude programmable bandwidth, priority scheme selection, and detailed priority control (such as round-robin weights).
VNICsare implemented in Virtual Input/Output Controllers (VIOCs).
VIPsVirtual Local Area Networks
VLANsenable access to Layer- 2 (L 2 ) and selected Layer- 3 (L 3 ) network functions while exchanging the packets and messages.
VLAN identificationis provided in each VNIC, and VLAN processing is partially performed in VIOCs implementing VNICs.
the compute complexes and interfacesare typically configured as pluggable modules inserted into a backplane included in a chassis.
the switch fabric dataplane(sometimes simply referred to as “a dataplane”) is accessible via the backplane (serving as a replacement for a conventional backplane bus), and hence ES embodiments are known as “fabric-backplane” enterprise servers.
Various ES embodimentsare comprised of varying numbers and arrangements of modules.
the EF architectureprovides for provisioning virtual servers (also known as server-instances) with included virtual networks from underlying ES hardware and software resources.
the EF architectureis applicable to application scenarios requiring dynamic combinations of compute, network, and storage performance and capabilities, and is a unifying solution for applications requiring a combination of computation and networking performance. Resources may be pooled, scaled, and reclaimed dynamically for new purposes as requirements change, using dynamic reconfiguration of virtual computing and communication hardware and software. This approach offers the advantages of reduced cost, as provisioning is “just-right” rather than over-provisioned.
dynamic configurationallows for quick performance or scale modifications.
the EF architectureprovides a radically different underlying server architecture compared to traditional multi-way Symmetric MultiProcessor (SMP) servers, including integrated fabric interconnectivity to enable high-bandwidth, low-latency I/O operation.
Processing and I/O throughputare virtualized, providing scalable, coordinated resources.
Partitioning and fail-overare hardware supported, including mechanisms for treating multiple virtual servers as a single managed entity, resulting in new high availability clustering and multi-site fail-over capabilities.
RDMARemote Direct Memory Access
Virtualized fabric servicessuch as Server Load Balancing (SLB), Secure Sockets Layer (SSL) protocols including Transport Layer Security (TLS) variants, eXtensible Markup Language (XML), and so forth, are also provided.
SLBServer Load Balancing
SSLSecure Sockets Layer
TLSTransport Layer Security
XMLeXtensible Markup Language
a data center or other installation implemented in accordance with the EF architecturewill include one or more ES chassis.
the ES chassis capabilitiesinclude an 8-way SMP partition-configurable compute complex. These compute resources include a plurality of 64-bit x86 processing elements.
the ES chassis hardware configurationis compatible with execution of software operating systems such as Linux and Microsoft Windows. Processing elements in the ES chassis are coupled to a low-latency high-bandwidth interconnect fabric via virtualized I/O functions, providing for efficient communication between processing elements and with network and fibre channel interfaces coupled to the fabric.
the virtualized I/O functionsare distributed throughout the plurality of processing elements.
the ES chassisincludes VNICs and virtualized Host Bus Adaptors (VHBAs). Via these VNICs and VHBAs, the processing elements can selectively communicate with external networks coupled to any of several high-performance network interfaces (up to three 10 Gb Ethernet interfaces, or thirty 1 Gb Ethernet interfaces, in the first embodiment) and with several high-performance 2 Gb Fibre Channel interfaces (up to eight in the first embodiment).
Each VNIC/VHBAcan be individually configured such that it appears to be coupled to a multi-port switch coupled to other of the VNICs/VHBA and to the network/storage interfaces.
each VNIC/VHBAcan be configured such that it appears to be directly coupled to one of the network/storage interfaces.
Additional processing capabilitiesmay be provided in the chassis in the form of offload cards (or pluggable boards or modules) supporting virtualized services, such as SLB, SSL, and XML processing.
the ES chassisis further configured with capabilities to provide for a high availability system, including modular components, hot-swap of components, and fully redundant components.
Other high availability capabilitiesinclude multi-site fail-over and mainframe class Reliability, Availability, and Serviceability (RAS) features.
An ES systemmay be operated as one or more provisioned servers, each of the provisioned servers including capabilities as identified by a corresponding set of specifications and attributes, according to various embodiments. Typically the specifications (or constraints) and attributes are specified with an SCF (see the SCF and Related Tasks section, elsewhere herein).
An ES systemmay be provisioned into any combination and number of servers according to needed processing and I/O capabilities. Each of these servers may include distinct compute, storage, and networking performance. Provisioned servers may be managed similar to conventional servers, including operations such as boot and shutting down (see the Server Operational States section, elsewhere herein).
the EF architecturefurther includes a Graphical User Interface (GUI) for configuration management.
GUIGraphical User Interface
the GUImay be provided via a web browser, a network-based Java client, or some other related mechanism, according to various embodiments.
the GUIprovides role-based access and division of functions, and may be used as a single point of management for all EF system functions.
System management personnelmay use the GUI to control EF virtualized configuration and provisioning settings.
Resource pooling and allocation of Central Processing Unit (CPU) and 10 capabilitiesmay be dynamically altered without requiring physical changes or re-cabling.
Network and storage capabilitiesmay be similarly dynamically modified, including Network Interface Controller (NIC), Host Bus Adaptor (HBA), and bandwidth resources. Redundancy, fail-over and other RAS capabilities may also be configured via the GUI, including specific multi-site configuration information.
Various embodimentsmay also include a Command Line Interface (CLI) with functions and capabilities similar to the GUI.
CLICommand Line Interface
the GUIfurther provides functions for monitoring various aspects of the hardware and software performance and behavior of systems including each ES chassis.
the monitoring functionsare available for inspection of operations at several levels in the system, from top-level application performance to low-level network interconnect metrics.
the GUIprovides hooks for integration of the functions provided therein into higher-level application software and standard applications, allowing for flexibility in specifying and monitoring the EF system configuration.
EF configuration management and monitoringmay also be performed via other mechanisms.
Alternate mechanismsinclude one or more command line interfaces, a scripting interface, and remote network-centric interfaces using standard capabilities provided for in Simple Network Management Protocol (SNMP) and Remote MONitoring (RMON).
SNMPSimple Network Management Protocol
RMONRemote MONitoring
Systems including EF capabilitiesmay also provide for upgrades to installed software, including operating system software, application software, driver-level software, and firmware software.
the upgradesmay include updates to address security issues, to enable new or expanded functionality, or to repair incorrect operation (a “bug fix”).
a variety of sourcesmay provide upgrades, including EF vendors, or vendors of software installed or used in EF-based systems.
ISVIndependent Software Vendor
Illustrative application usage scenariosinclude a first usage scenario including a first configuration adapted to replace a server (having an Operating System selected from a list including but not limited to Unix, Linux, Windows, etc.) or a collection of such servers.
the first configurationprovides for virtualization of data center capabilities, resource pooling, and consolidation of functions otherwise performed in a plurality of heterogeneous devices. Computing, networking, and services are completely virtualized, enabling dynamic deployment, scaling, and reclamation according to changing application requirements. Significant savings in capital and operating expense result.
a second usage scenarioincludes a second configuration adapted for I/O intensive applications.
the second configurationprovides high-bandwidth and low-latency storage and networking capabilities, enabling new classes of applications using fewer infrastructure components than currently possible.
the high-bandwidth and low-latency capabilitiesare enabled in part by use of a high-bandwidth, low-latency fabric. Efficient intra-chassis communication is provided for in a transparent manner, enabling increased I/O bandwidth and reduced latency compared to existing solutions.
a third usage scenarioincludes a third configuration adapted for consolidating tiers in a data center application.
the third configurationprovides for collapsing the physical divisions in present 3-tier data centers, enabling solutions with fewer servers, a smaller number of network switches, and reduced needs for specialized appliances.
the concepts taught hereinprovide for completely virtualized computing, networking, and services, in contrast to existing solutions addressing tiered data systems. Dynamic configuration enables pooling of resources and on-the-fly deploying, scaling, and reclaiming of resources according to application requirements, allowing for reduced infrastructure requirements and costs compared to existing solutions.
a fourth usage scenarioincludes a fourth configuration adapted for enhanced high availability, or RAS functionality, including multi-site fail-over capabilities.
the fourth configurationprovides for new redundancy and related architectures. These new architectures reduce set-up and configuration time (and cost), and also decrease on-going operating expenses.
Modular components of the ES chassisare hot-swap compatible and all EF systems are configured with fully redundant components, providing for mainframe-class RAS functionality. Reduced networking latency capabilities enable enhanced multi-site fail-over operation.
the concepts taught hereinconsolidate multiple devices and tiers in data center operations, requiring fewer servers (in type and quantity), reduced supporting hardware, and smaller infrastructure outlays compared to systems of the current art. Significant reductions in the total cost of ownership are thus provided for by the concepts taught herein.
the concepts taught hereinensure highly reliable and available compute, network, storage, and application resources while also dramatically improving storage and networking performance and reliability. True multi-site fail-over and disaster recovery are possible by use of the concepts taught herein, enabling new classes of I/O and high availability applications.
FIG. 1Aillustrates System 100 A, a conceptual representation of an embodiment of an ES system.
the systemincludes a particular ES chassis embodiment, ES 1 110 A, which is coupled to various other systems, including Fibre Channel Storage Network 106 , Generic Packet Network 107 , and Ethernet Storage Network 108 .
Fibre Channel Storage Network 106provides mass storage via a collection of disks organized, for example, as a Storage Area Network (SAN).
SANStorage Area Network
Generic Packet Network 107conceptually includes arbitrary combinations of Local Area Network (LAN), Metro Area Network (MAN), and Wide Area Network (WAN) networks and typically includes Ethernet and Ethernet derivative links for coupling to Internet 101 , an arbitrary number and arrangement of Client machines or servers, represented as Client 102 and Client 103 , as well as an arbitrary number and arrangement of Personal Computers (PCs) or Workstations, represented as PC 104 and PC 105 .
Ethernet Storage Network 108provides mass storage via a collection of disks organized in a Network Attached Storage (NAS) or Small Computer System Interface over Transmission Control Protocol/Internet Protocol (iSCSI) fashion.
NASNetwork Attached Storage
iSCSITransmission Control Protocol/Internet Protocol
ES 1 110 Aincludes a central I/O SFM (SFM 180 ) providing a switch fabric dataplane coupling for FCMs 120 , NMs 130 , SCMs 140 , PMMs 150 (also known as Application Processor Modules), and OLBs 160 , also known as Offload Modules (OLMs) or AppLication Modules (ALMs).
SFM 180central I/O SFM
FCMs 120include Fibre Channel Interfaces (FCIs) for coupling to Fibre Channel standard storage devices and networks (such as SANs).
NMs 130include interfaces to standard network infrastructures.
PMMs 150include compute elements for execution of Application, Driver, and Operating System (OS) processes, via SMP clusters illustrated conceptually as SMP 151 .
a configurable Coherency Switch Fabric and Interconnect (CSFI 170 )is included for partitioning or combining the CPU and Randomly Accessible read/write Memory (RAM) resources of PMMs 150 .
OLBs 160include compute elements for execution of service processes, via various service acceleration modules.
Service acceleration modulesinclude SLB accelerator 161 , SSL accelerator 162 , and XML accelerator 163 .
SCMs 140include compute elements for providing system management, controlplane (L 2 /L 3 bridging and routing, for example), and load balancing processing for SFM 180 and the elements coupled to it.
PMMs 150also include FCIs for coupling to mass storage systems, such as Optional Local Disks 111 - 112 , or SAN systems including mass storage.
Application, Driver, and OS processesare executed on PMMs 150 via CPU and RAM elements included in SMP 151 .
At least some of the data consumed and produced by the processesis exchanged in packets formatted as cells for communication on SFM 180 .
the datamay include network data exchanged with Generic Packet Network 107 via NMs 130 , and storage data exchanged with Ethernet Storage Network 108 via NMs 130 or Fibre Channel Storage Network 106 via FCMs 120 .
the datamay also include service data exchanged with OLBs 160 and SCMs 140 , and other Application, Driver, or OS data exchanged with other elements of PMMs 150 .
Data communicated on SFM 180is not limited to data exchanged with PMMs 150 , but may also include data communicated between any of the modules (or fabric clients) coupled to the fabric. For example, one NM may forward packets to itself or to another NM via the fabric. An NM may also exchange packets with an OLB for processing via the fabric. SCMs 140 may also exchange configuration and forwarding update information with VIOCs via VIOC Control Protocol (VIOC-CP) packets via the fabric. In some embodiments, SCMs 140 may also exchange selected system management, controlplane, and load balancing information with all modules coupled to the fabric via in-band packets communicated on the fabric.
VIP-CPVIOC Control Protocol
a modified Ethernet Driverprovides the illusion of local NIC functionality to Application, Driver, and OS processes locally executing on any of SCMs 140 , PMMs 150 , and OLBs 160 .
the NIC functionalitycan be configured to either appear to be coupled to a switch coupled to other NICs or appear to be coupled directly to one of the networking interfaces included on NMs 130 .
this techniquemay be used to access networked storage devices (i.e., NAS subsystems) via the NMs 130 .
SFM 180includes a redundant pair of fabrics, with one of the pair typically configured as a Primary Fabric, while the other fabric is typically configured as a Redundant Fabric.
SCM-Fabric coupling 149represents two fabric dataplane couplings, a first Primary Coupling between a Primary SCM of SCMs 140 and the Primary Fabric, and a Redundant Coupling between a Redundant SCM of SCMs 140 and the Redundant Fabric.
all dataplane trafficis carried on the Primary Fabric, managed by the Primary SCM, while the Redundant Fabric and the Redundant SCM are maintained in a hot-standby mode.
FIG. 1Further dataplane couplings to SFM 180 are illustrated conceptually as FCM-Fabric coupling 129 , NM-Fabric coupling 139 , PMM-Fabric coupling 159 , and OLB-Fabric coupling 169 .
each couplingis abstractly portrayed as a single line between each group of modules and the switch fabric. It will be understood that for the FCM, NM, and OLB modules, each module instance has a Primary Fabric coupling and a Redundant Fabric coupling. For the PMM, each PMM instance has two Primary Fabric couplings and two Redundant Fabric couplings.
All of the modules coupled to SFM 180include fabric interface communication units for exchanging data as cells on the fabric. The details of this data exchange are described in more detail elsewhere herein.
the components of ES 1 110 Aare included on a plurality of pluggable modules adapted for insertion into and removal from a backplane while the server is powered-up and operational (although software intervention to cleanly shut down or start up various processes or functions may be required).
the backplaneforms portions of FCM-Fabric coupling 129 , NM-Fabric coupling 139 , CSFI-PMM coupling 179 , PMM-Fabric coupling 159 , and OLB-Fabric coupling 169 .
the Primary Fabric of SFM 180 and the associated Primary SCM of SCMs 140are included on pluggable module Primary SIM as illustrated by SIMs 190 .
Redundant Fabric of SFM 180 and the associated Redundant SCM of SCMs 140are included on pluggable module Redundant SIM of SIMs 190 .
All of the modules of FCMs 120 , NMs 130 , PMMs 150 , and OLBs 160are also configured as pluggable modules adapted for operation with the backplane.
Each PMM of PMMs 150is physically-partitionable, i.e. configurable into one or more physical partitions.
the physical partitioning of PMMs 150 and related modes of CSFI 170are configured under program control.
PMMs 150may be configured as a single SMP complex in conjunction with CSFI 170 .
the resultis a first example of a physical partition.
each PMM of PMMs 150may instead be configured individually as an independent SMP complex, resulting in a plurality of physical partitions, one for each PMM.
each PMM of PMMs 150may instead be configured as a pair of SMP complexes, resulting in two physical partitions per PMM.
CSFI 170may be implemented as any combination of simple interconnect, coherency logic, and switching logic, operating in conjunction with any combination of interconnect and logic included on PMMs 150 . Some of these embodiments are discussed in more detail later herein.
ES 1 110 Ais representative of a number of embodiments configured with various Modules to provide differing amounts of storage and network interface capability (connectivity and bandwidth), as well as differing levels of compute capability (cycles and memory).
each embodimentincludes at least a redundant pair of Switch Fabrics and associated System Intelligence Modules (for communication between Modules), at least one Processor Memory Module (for execution of Application, Driver, and OS processes), and at least one Network Module (for communication with external agents).
Some embodimentsmay optionally further include any combination of additional Modules to provide additional interface and compute capability, up to the physical limits of the particular implementation.
additional Network Modulesmay be included in an embodiment to provide additional network bandwidth or connectivity.
Fibre Channel Modulesmay be included in an embodiment to provide additional storage bandwidth or connectivity. Additional Processor Memory Modules may be included to provide additional compute cycles or memory. One or more Offload Modules may be included to provide additional service compute cycles or memory, and these Offload Modules may each be individually configured with any combination of SLB, SSL, and XML accelerators.
communication between the Modules via SFM 180is independent of the manner and arrangement of the Modules. All of the Modules communicate as peers on SFM 180 and interface to the fabric in a similar manner.
System 100 Ais also representative of a variety of system embodiments, for example, differing in the number, type, and arrangement of storage and network systems coupled to ES 1 110 A.
any combination of Optional Local Disks 111 - 112may be included.
Generic Packet Network 107may include any combination of LAN, MAN, or WAN elements.
FCMs 120may be coupled to a single SAN, or a plurality of SANs.
NMs 130may be coupled to a plurality of networks or storage systems. Couplings between ES 1 110 A and other systems is limited only by the number and type of interfaces and physical couplings available according to implementation.
FIG. 1Billustrates System 100 B, a conceptual representation of selected details of data transfer in an embodiment of an ES system.
An ES chassis embodimentillustrated as ES 1 110 B, is coupled to Fibre Channel Storage Network 106 and Ethernet Network 107 , as described elsewhere herein.
one module of each type(FCM 120 A, NM 130 A, PMM 150 A, and OLB 160 A) are coupled to Primary Switch Fabric Module 180 A, via FCM-Fabric coupling 129 A, NM-Fabric coupling 139 A, PMM-Fabric couplings 159 A/ 159 A′, and OLB-Fabric coupling 169 A, respectively.
FCM 120 Aprovides interfaces for storage network couplings, including a coupling for Fibre Channel Storage Network 106 .
NM 130 Aprovides interfaces for network couplings, including a coupling for Ethernet Network 107 , coupled in turn to Client 102 .
PMM 150 Ais configured as a first and a second physical partition.
the first physical partitionincludes SMP Portion P A 152 A, having RAM 153 A, and is coupled by PMM-Fabric coupling 159 A to the fabric dataplane.
the second physical partitionincludes SMP Portion P A′ 152 A′, having RAM 153 A′, and is coupled by PMM-Fabric coupling 159 A′ to the fabric dataplane. Note: several elements have been omitted from the figure for clarity, including the SCMs, the Redundant Fabric, the CSFI, and optional SANs.
Each pathillustrates the movement of data between two clients of the switch fabric.
datais organized as packets transferred via a stateless connection-free (and unreliable) protocol.
datais organized as messages, and transferred via a connection-oriented reliable message protocol.
datais selectively organized as either packets or messages.
each port of each NMacts as a switch port of a virtualized high-performance L 2 /L 3 switch.
the switchhas advanced VLAN and classification functionalities.
the VLAN functionalityprovides for selectively coupling or isolating the network segments coupled to each switch port.
Each segment associated with an NM portmay have one or more external physical network devices as in any conventional network segment.
the classification functionalityprovides for special forwarding treatments in accordance with a variety of attributes of the Ethernet frames received from external network devices on the ports of the NMs.
a virtualized fibre channel switchis similarly presented to external fibre channel devices.
certain software processes running on the PMMsare provided the illusion they are coupled to the fibre channel switch via high-performance fibre channel interfaces.
Multiple VLANs and multiple fibre channel networkscan simultaneously co-exist on top of the fabric transport infrastructure while being completely logically separate and secure.
packetsrefers to conventional Ethernet frames sent via some connectionless protocol that does not have integral support for reliable delivery.
messagesrefers to one or more data transfers of quasi-arbitrarily sized data blocks reliably delivered over a logical connection established between end-points. Packets are transported over the fabric using “fabric packets,” while messages are transported over the fabric using “fabric messages.” Both fabric packets and fabric messages make use of highly similar fabric frames.
a fabric packetis comprised of a single fabric frame, sent over the fabric without any connection or reliable delivery support.
fabric messagesare comprised of (potentially quite long) sequences of fabric frames, sent over the fabric using a connection-oriented reliable delivery protocol. Some of the fabric frames of a fabric message are for transfer of the message data blocks while other fabric frames are used for control to set up and take down connections and to implement reliable delivery (e.g., via handshake and re-delivery attempts).
fabric messagesrequire additional fabric frames for messaging control beyond the fabric frames required to transport the message data.
the fabric frames of fabric messagesrequire additional processing at the source and destination ends related to the management of reliable delivery, connections, and the fragmentation (segmentation) and reassembly of data blocks.
the transport over the fabric of individual fabric framesis essentially the same for both fabric messages and fabric packets. Since all sources and destinations on the fabric have support for processing fabric packets and fabric messages, those of ordinary skill in the art will understand that all of the data exchange illustrations below that describe the transport of packets using fabric packets are equally applicable to the transport of messages using fabric messages.
the Ethernet frames of the packets to be transported over the fabricmay originate in external clients or devices coupled to the NM ports or from within the various processing modules.
a fabric packetis formed to contain the data of each original Ethernet frame plus additional information to facilitate transport over the fabric.
the protocol field (Ether-type) of the original Ethernet frameis examined.
the fabric packetis generally labeled (tagged) in accordance with the Ether-type and other information found in the original packet.
the fabric packetis identifiable as an “IP fabric packet.”
IP fabric packetsare evaluated for L 3 forwarding (a.k.a. IP forwarding) based upon their included destination IP address. Otherwise, non-IP fabric packets are evaluated for L 2 forwarding based upon their included MAC destination address (MACDA). L 2 /L 3 forwarding is overviewed next.
a forwarding decisionis made that determines a fabric destination address that is embedded in the fabric packet.
the embedded fabric destination addresscontrols how the fabric packet is delivered to destinations within the system.
the fabric destination addressincludes a specification for an egress port of the switch fabric. When multiple sub-ports (corresponding to multiple L 2 or L 3 destination addresses) are associated with a single egress port, the fabric destination address will also include a fabric sub-address to specify a particular one of the sub-ports.
the fabric packetis subsequently cellified (segmented into cells) and presented to an ingress port of the switch fabric.
Each cellincludes the fabric destination address and the cell is transferred by the switch fabric to the egress port specified by the fabric destination address.
the cellsAfter being received by the module coupled to the specified egress port, the cells are reformed into a representation of the original Ethernet frame prior to presentation to the destination. If the module at the egress port has multiple sub-ports, the module will use the included fabric sub-address to further direct the reformed Ethernet frame to the specified sub-port.
L 2 forwardingthe VLAN assignment of the network port or processing module from which the original Ethernet frame was sourced is also used with the MACDA in determination of the fabric destination address. The determination is by way of a lookup in an L 2 Forwarding Information Base (L 2 FIB). As discussed elsewhere herein, an L 2 FIB is implemented for each VIOC and NM in the system using any combination of TCAM/SRAM structures and search engines, according to embodiment. The L 2 forwarding decision is thus implemented completely within the module where the original Ethernet frame was sourced and the next fabric destination is the module most directly associated with the MACDA of the original Ethernet frame.
L 2 FIBL 2 Forwarding Information Base
General L 3 forwarding(i.e., to destinations beyond the IP sub-net of the source IP) requires access (on the same L 2 sub-net as the source) to at least one gateway IP interface and associated L 3 FIB.
the number and location of gateway IP interfaces and L 3 FIBsvaries by embodiment.
a gateway IP interface and L 3 FIBis implemented external to the system chassis.
At least one gateway IP interfaceis implemented via a media port (physical port) or pseudo-port (virtual port) somewhere on at least one NM and an L 3 FIB is implemented within each NM having a gateway IP interface.
the gateway IP interfaceis only visible from inside the chassis if implemented on a pseudo-port.
the gateway IP interfaceis visible from inside and outside the chassis if implemented on a media port. Combinations of multiple gateway IP interfaces, some on media ports and others on pseudo ports, are envisioned.
a gateway IP interfaceis implemented somewhere on at least one NM and for each VNIC, and an L 3 FIB is implemented within each NM and VIOC.
Gateway IP interfaces and L 3 FIBs implemented within the chassisare fabric packet aware and assist L 3 forwarding by providing the fabric destination address of the next hop.
L 3 FIB management processesmaintain a master L 3 FIB in the SCM and maintain coherency between all L 3 FIBs in the chassis.
the IP fabric packet undergoing the forwarding decisionis forwarded as an “exception packet” to the controlplane process executing on the Primary SCM.
the controlplane processdetermines the proper fabric address for the missing entry, propagates a new entry to all of the L 3 FIBs, and forwards the IP fabric packet to the destination IP (or at least one hop closer) using the newly learned fabric address.
Fabric framesexist for a single cellified hop across the fabric between fabric source and fabric destination.
IP fabric packetis “forwarded” via an indirection or hop via an gateway IP interface or the Primary SCM, the IP fabric packet is being re-instantiated into a new fabric frame for each traverse across the fabric.
an IP fabric packetmay first undergo an indirection to an IP gateway interface, possibly on a different port or pseudo port on the same or a different module or external to the system. All transport is by conventional Ethernet frames outside the chassis and by fabric frames within the chassis. Once at an IP gateway interface, the destination IP address of the original Ethernet frame is used to associatively access the L 3 FIB and the lookup result is used to forward the IP packet to the IP destination (or at least one hop closer).
a gateway IP interfaceFor IP packet transfers over the fabric, generally a gateway IP interface must be involved. In the following illustrated data exchanges, the paths are drawn for scenarios that do not require additional indirection. Nevertheless, it will be understood that if an IP packet is received at an interface that is neither the IP destination address or a gateway IP interface, then generally the corresponding data exchange path is modified by interposing an intermediate hop to a gateway IP interface. Furthermore, when an IP packet is received at a gateway IP interface, either directly or as part of an indirection from a non-gateway IP interface, in the relatively rare event that there is a miss in the associated L 3 FIB, the corresponding data exchange path is modified by interposing an intermediate hop to the Primary SCM.
Primary SCM controlplane processingservices the miss in the master L 3 FIB and updates the L 3 FIBs throughout the chassis. Once the miss is serviced, the Primary SCM forwards the IP packet toward the originally intended destination. Thus, while not a frequent occurrence, for some IP fabric packets two intermediate hops are interposed in the data exchange paths: a first intermediate hop to a gateway IP interface and a second intermediate hop to the Primary SCM.
the classification functionality of the NMsfacilitates more sophisticated forwarding decisions, special data manipulation, and other data treatments, to be optionally performed as a function of additional attributes of the network data traffic encountered.
the fabric destination address for IP fabric packetsis at least in part determined by the recognition of particular service requests (and the lack thereof) embedded in the data traffic. More specifically, the service request recognition takes the form of recognizing particular Transmission Control Protocol/Internet Protocol (TCP/IP) destination ports corresponding to particular applications.
TCP/IPTransmission Control Protocol/Internet Protocol
the L 2 and L 3 FIBsare also updated dynamically, both in response to changes in the network configuration and optionally for dynamic performance optimization, such as to achieve load balancing among the processing resources of the system.
References to packet transmission, packets originating from the client, incoming packets, received packets, reassembled packets, or simply packets,are references to Ethernet frames. It will be understood that all such Ethernet frames are transported across the fabric via the process of fabric packet encapsulation, cellification, switch fabric traversal, and reassembly. References to augmented packets or cellified packets are references to fabric packets. References to cells or cellified packets being forwarded refers to the providing of fabric-addressed cells to the switch fabric for transfer by the switch fabric to the module coupled to the switch fabric egress port specified by the fabric address.
Client-Server Data Exchange 115includes packet transmission from Client 102 via Ethernet Network 107 to NM 130 A. Since the system may in part be providing the functionality of an L 2 /L 3 switch for any of many network segments, packets received in close time proximity by NM 130 A may be for any of multiple destinations both internal and external to the system. The incoming packets are classified, formed into fabric packets, subjected to a forwarding decision to determine a fabric address, and selectively provided as cells to Primary Switch Fabric Module 180 A via a fabric ingress port associated with NM-Fabric coupling 139 A.
NM 130 Aaddresses the cells to PMM 150 A, and more specifically to SMP Portion PA 152 A, as a result of the forwarding decision identifying the fabric egress port associated with PMM-Fabric coupling 159 A as the destination fabric addresses for the cells.
Primary Switch Fabric Module 180 Athen transfers the cells to the fabric egress port associated with PMM-Fabric coupling 159 A.
SMP Portion PA 152 Areceives the cells and reassembles them into received packets corresponding to the packets originating from Client 102 .
the received packetsare formed directly in RAM 153 A, typically via DMA write data transfers. Return packets follow the flow in reverse, typically beginning with DMA read transfers from RAM 153 A.
Client-Server Data Exchange 115has been described from the perspective of packets “originating” from Client 102 and return traffic flowing in reverse, this is only for illustrative purposes.
the flow from Client 102 to SMP Portion PA 152 Ais entirely independent of the flow in the other direction.
Client-Service Data Exchange 117illustrates cellified packets selectively forwarded by NM 130 A toward OLB 160 A via NM-Fabric coupling 139 A, Primary Switch Fabric Module 180 A, and OLB-Fabric coupling 169 A.
packets from Client 102are determined to require transfer to OLB 160 A (instead of other fabric clients, such as SMP Portion PA 152 A as in Client-Server Data Exchange 115 ).
NM 130 Aaddresses the corresponding cells to OLB 160 A and executes a forwarding decision identifying the fabric egress port associated with OLB-Fabric coupling 169 A as the fabric destination address for the cells.
Primary Switch Fabric Module 180 Athen transfers the cells to the fabric egress port associated with OLB-Fabric coupling 169 A.
OLB 160 Areceives the cells and reassembles them into received packets directly into a RAM local to the OLB. Return packets follow the flow in reverse.
Storage-Server Data Exchange 116includes establishing a reliable end-to-end logical connection, directly reading message data from RAM 153 A (included in SMP Portion PA 152 A), fragmenting (as required) the message data into fabric frames, and providing corresponding cells addressed to FCM 120 A via PMM-Fabric coupling 159 A.
the cell destination addressesspecify the fabric egress port associated with FCM-Fabric coupling 129 A.
the cellsare transferred, received, and reassembled in a manner similar to that described for fabric packets in conjunction with Client-Service Data Exchange 117 .
the storage transactionsare provided via a storage network coupling to at least one storage device of external Storage Network 106 . If more than one storage network and associated storage network coupling is associated with FCM 120 A, the particular storage network coupling is specified via a fabric sub-address portion of the cell destination address. Returning storage transaction responses follow the flow in reverse.
Service Data Exchange 118is similar to Client-Service Data Exchange 117 .
Packet datais read from RAM 153 A′ (included in SMP Portion P A′ 152 A′), and cells are forwarded to OLB 160 A by a forwarding decision specifying the fabric egress port associated with OLB-Fabric coupling 169 A as the cell destination addresses.
the packets exchanged by Client-Server Data Exchange 115 , and Client-Service Data Exchange 117 , and Service Data Exchange 118are typically but not necessarily IP packets.
Data Exchanges 115 - 118are overlapped or partially concurrent with each other.
cells corresponding to a portion of Client-Server Data Exchange 115 trafficmay be intermixed with cells relating to Client-Service Data Exchange 117 traffic, as the cells from both data exchanges are coupled via NM-Fabric coupling 139 A to the fabric.
each cellincludes sufficient information in the corresponding fabric destination address and other information to specify the proper operation.
cells of Client-Server Data Exchange 115are forwarded to SMP Portion P A 152 A by a first forwarding decision specifying the fabric egress port associated with PMM-Fabric coupling 159 A, while cells of Client-Service Data Exchange 117 are forwarded to OLB 160 A by a second forwarding decision specifying the fabric egress port associated with OLB-Fabric coupling 169 A.
cells from Client-Service Data Exchange 117 and Service Data Exchange 118may be intermixed on OLB-Fabric coupling 169 A, because sub-port destination address and other information in the cells enable proper processing.
a portion of the sub-port destination addressis used to associate packets with a respective input queue within the destination module.
the termini of Data Exchangesare located in RAM that is directly accessible by one or more processing elements.
Service Data Exchange 118includes a first terminus in RAM 153 A′, and a second terminus in a RAM within OLB 160 A.
packet data from the RAMsare read and written by DMA logic units included in each of the respective modules.
datais streamed from a source RAM as packets, cellified and provided to the fabric, transferred to the egress port as specified by the cells, reassembled, and stored into a destination RAM in packet form.
These operationsare fully overlapped, or pipelined, so that data from a first cell of a packet may be stored into the destination RAM while data from a following cell of the same source packet is being read from the source RAM.
FIG. 2illustrates various example embodiments of packet and process data flow in an ES 1 110 A embodiment.
a plurality of FCMsare illustrated by FCM 120 A and FCM 120 B, coupled to Primary Switch Fabric Module 180 A via FCM-Fabric coupling 129 A and FCM-Fabric coupling 129 B, respectively.
a plurality of NMsare illustrated by NM 130 A and NM 130 B, coupled to Primary Switch Fabric Module 180 A via NM-Fabric coupling 139 A and NM-Fabric coupling 139 B, respectively.
a plurality of PMMsare illustrated by PMM 150 A and PMM 150 B, coupled to Primary Switch Fabric Module 180 A by PMM-Fabric couplings 159 A/ 159 A′ and PMM-Fabric couplings 159 B/ 159 B′, respectively.
CSFI 170is coupled to PMM 150 A and PMM 150 B by CSFI-PMM coupling 179 A and CSFI-PMM coupling 179 B, respectively.
a plurality of OLBsare illustrated by OLB 160 A and OLB 160 B, coupled to Primary Switch Fabric Module 180 A by OLB-Fabric coupling 169 A and OLB-Fabric coupling 169 B. Note: the Redundant SIM and associated couplings are omitted from the figure for clarity.
each of the active FCMs and NMs of FIG. 2are typically, but not necessarily, coupled to external devices on external networks as illustrated in FIGS. 1A and 1B . It remains the case that all transport is by conventional Ethernet frames outside the chassis and by fabric frames within the chassis. Nevertheless, when such external devices or networks are involved, the termini of FCM-related and NM-related packet and message data exchanges may be considered from a certain perspective to extend to those external devices. However, even with coupled external devices, exchanges for at least some fabric frames related to system management and control will terminate within the various modules. Furthermore, in certain embodiments and scenarios, including scenarios with external devices, certain non-control data exchanges terminate within the NMs. Specifically, for the case of fabric IP packets unicast to the IP gateway interface on a pseudo port within an NM, the data exchange to the pseudo port terminates within the NM and is not visible externally.
PMM 150 Ais shown configured as two physical partitions, P 1 201 , and P 2 202
PMM 150 Bis shown configured as a single physical partition P 3 203
PMM 150 A and PMM 150 Bare shown configured as a single unified physical partition P 4 204 .
FCM-PMM Data Exchange 210is representative of data exchanged between a storage sub-system coupled to an FCM and a PMM, or more specifically a physical partition of a PMM. As illustrated, this traffic is typically storage related messages between processes executing on P 3 203 of PMM 150 B (including any of Application, Driver, or OS Processes) and an external storage sub-system (such as SAN 106 of FIG. 1B ). In operation, bidirectional message information flows as cellified fabric frames via FCM-Fabric coupling 129 A, Primary Switch Fabric Module 180 A, and PMM-Fabric coupling 159 B. For example, a storage sub-system request is generated by a storage sub-system Driver process executing on P 3 203 .
the requestis formed as a storage sub-system message addressed to the external storage sub-system coupled to FCM 120 A, and delivered as cellified fabric frames to Primary Switch Fabric Module 180 A via PMM-Fabric coupling 159 B.
Primary Switch Fabric Module 180 Aroutes the cells to FCM-Fabric coupling 129 A.
FCM-Fabric coupling 129 Adelivers the cellified fabric frames to FCM 120 A.
the cells of each fabric frameare reconstituted (or reconstructed) into the original storage sub-system message request, which is then sent to the storage sub-system attached to FCM 120 A (such as Fibre Channel Storage Network 106 of FIG. 1B , for example).
the storage sub-systemreturns a response message, which is formed by FCM 120 A into one or more fabric messages addressed to P 3 203 .
the fabric messagesare fragmented into fabric frames that are delivered as cells to Primary Switch Fabric Module 180 A via FCM-Fabric coupling 129 A.
Primary Switch Fabric Module 180 Aroutes the cells via PMM-Fabric coupling 159 B to P 3 203 of PMM 150 B.
P 3 203reconstitutes the cells into fabric frames, then reassembles and delivers the response message to the storage sub-system Driver process executing on P 3 203 .
FCM-PMM Data Exchange 210may flow via PMM-Fabric coupling 159 B′ instead of 159 B, or it may flow partially via PMM-Fabric coupling 159 B and partially via PMM-Fabric coupling 159 B′.
the operationis similar for these cases, as the fabric messages may be forwarded to P 3 203 via 159 B and 159 B′ with no other change in operation.
NM-OLB Data Exchange 211is representative of data exchanged between an NM and a service process executing on an OLB.
NM 130 Areceives information, typically but not necessarily in IP packet form, from an external coupled client (such as Client 102 of FIG. 1B ), and classifies the packets, in part to determine a subset of the packets to be sent to OLB 160 B. Based in part on the classification, an appropriate subset of the information is formed into like-kind fabric packets including the destination address of OLB 160 B. An appropriate L 2 /L 3 forwarding decision is made and the fabric packets are then communicated as cells to Primary Switch Fabric Module 180 A via NM-Fabric coupling 139 A. Primary Switch Fabric Module 180 A forwards the cells toward OLB 160 B.
the cellsare ultimately received via OLB-Fabric coupling 169 B, reconstituted as packets, and provided directly to the service process executing on OLB 160 B.
the reverse pathis used to transfer information from the service process to the client coupled to NM 130 A.
Another pathmay also be used to transfer information from the service process to other destinations, such as an application process executing on a PMM.
NM 130 Arecognizes a variety of SSL IP packet forms during classification, including HyperText Transfer Protocol Secure (HTTPS) as TCP/IP destination port 443 , Secure Simple Mail Transport Protocol (SSMTP) as TCP/IP destination port 465 , and Secure Network News Transfer Protocol (SNNTP) as TCP/IP destination port 563 .
IP fabric packetsare formed including the destination IP address of OLB 160 B.
An L 3 forwarding decisionis made and the IP fabric packets are provided as cells to the fabric and forwarded toward OLB 160 B.
the SSL service process executing on OLB 160 Bupon receiving the reconstituted IP packets, performs SSL service functions such as context switching, state look-up, protocol layer demultiplexing, and decryption.
the SSL service process executing on OLB 160 Bproduces result data based in part on the packets received from the external client via NM 130 A.
the result datatypically includes IP packets that may sent back to the external client via NM 130 A (a handshake or acknowledgement, for example) as illustrated by NM-OLB Data Exchange 211 or alternately addressed to P 3 203 (decrypted clear text, for example) as illustrated by PMM-OLB Data Exchange 216 .
fabric packetsare provided as cells to Primary Switch Fabric Module 180 A via OLB-Fabric coupling 169 B and forwarded accordingly.
NM 130 Arecognizes TCP SYN packets during classification and forms an IP fabric packet including the destination IP address of OLB 160 B.
An L 3 forwarding decisionis made and the IP fabric packet is provided as cells to the fabric and forwarded toward OLB 160 B.
the SLB service process executing on OLB 160 Bupon receiving a reconstituted packet, consults load information for the system, and assigns the request to a relatively unloaded physical partition of a PMM (such as one of P 1 201 , P 2 202 , and P 3 203 ), establishing a new connection.
PMMsuch as one of P 1 201 , P 2 202 , and P 3 203
the new connectionis recorded in the appropriate L 3 FIBs, in order for NM 130 A to properly forward subsequent IP packets for the new connection to the assigned physical partition, enabling information flow from NM 130 A to the assigned physical partition without the need for indirection through OLB 160 B.
SSL and SLB processingmay be cascaded.
NM 130 Aforwards cellified encrypted IP packet information from an encrypted external client toward OLB 160 B for SSL processing, or decryption.
OLB 160 Bin turn forwards cellified decrypted (clear text) IP packet information to itself, another OLB, or a PMM for subsequent SLB processing.
cellified packetsare then forwarded first from NM 130 A to OLB 160 B for decryption, and then directly to the assigned physical partition.
the service processfunctions as an XML server.
NM 130 Aidentifies XML requests from the external client, and forwards each request, in the form of cellified IP packets, toward OLB 160 B, where the XML service process analyzes the request. Appropriate response information is produced and provided as cellified packets forwarded toward NM 130 A.
NM-OLB Data Exchange 211illustrates data exchange between NM 130 A and OLB 160 B
NM 130 Amay examine a packet (typically but not necessarily an IP packet) received from the client coupled to NM 130 A to determine an associated flow, and then selectively determine a destination OLB based on the determined flow (OLB 160 A or OLB 160 B, as appropriate). This provides a form of service processing load balancing.
the destination OLBmay also be determined based on the type of service (SLB, SSL, or XML), if a particular OLB includes hardware acceleration specific to a service.
OLB 160 Aincludes an SSL hardware accelerator
OLB 160 Bincludes an XML hardware accelerator
IP packets requiring SSL processingare typically directed toward OLB 160 A
IP packets requiring XML processingare typically directed toward OLB 160 B.
destination OLB determinationmay be performed based on combining service processing load balancing with selection based on hardware accelerator availability and location.
Service processing(such as SLB, SSL, and XML) is not restricted to OLBs, as PMMs and SCMs may also be configured to provide service processing.
NMstake the destination IP address assigned to a physical partition of a PMM (such as P 1 201 , P 2 202 , or P 3 203 , for example) or an SCM (such as Primary SCM 140 A, for example) and perform an L 3 forwarding decision to provide a fabric destination address in preparation for transit on the fabric as cells. The cells are then forwarded toward the appropriate PMM or SCM where the service process is executing.
NM-NM Data Exchange 212is representative of data exchanged between NMs. This traffic is exemplary of network traffic between a first external client and a second external client coupled respectively to a port of NM 130 A and a port of NM 130 B, and wherein ES 1 110 A performs as a bridge, switch, or router. (Clients such as 102 and 103 of FIG. 1A are representative of the external clients.) The low-level details of the data exchange are substantially similar to NM-OLB Data Exchange 211 .
a port of NM 130 Areceives information, typically in packet form, from the first external client (and potentially a plurality of other external sources), and classifies the packets (which may be for a plurality of destinations), in part to determine a subset of the packets to be sent to the port of NM 130 B. Based in part on the classification, an appropriate subset of the information is formed into fabric packets destination-addressed to the port of NM 130 B.
NM 130 Amakes a forwarding decision that embeds a fabric address into the fabric packet, which is then communicated as cells to Primary Switch Fabric Module 180 A via NM-Fabric coupling 139 A. Primary Switch Fabric Module 180 A forwards the cells toward NM 130 B. After arriving at NM 130 B the cells are reconstituted as packets, and sent to the second external client coupled to the port of NM 130 B. The reverse path is used to transfer information from the second client to the first client, and operates in a symmetric manner.
an NMmay forward data toward itself via the fabric. Operation in this scenario is similar to NM-NM Data Exchange 212 , except the packets are addressed to NM 130 A, instead of NM 130 B.
the multiple media portsare distinguished via the sub-address portion of the fabric address.
NM-PMM Data Exchange 213is representative of IP packets exchanged between an NM and a process (Application, Driver, or OS) executing on a PMM, typically under control of a higher-level protocol, such as Transmission Control Protocol (TCP) or User Datagram Protocol (UDP).
TCPTransmission Control Protocol
UDPUser Datagram Protocol
the data exchangeis substantially similar to NM-OLB Data Exchange 211 .
NM 130 Bforms a portion of received information (based on classification) as IP packets addressed to P 3 203 .
NM 130 Bexecutes a forwarding decision on the destination IP address to obtain a fabric destination address in preparation for providing a fabric packet as cells to Primary Switch Fabric Module 180 A via NM-Fabric coupling 139 B.
Primary Switch Fabric Module 180 Aforwards the cells toward P 3 203 .
the cellsUpon arrival via PMM-Fabric coupling 159 B (or alternatively PMM-Fabric coupling 159 B′), the cells are reconstituted as IP packets by PMM 150 B, and provided to P 3 203 .
the processtransfers information to NM 130 B (and typically on to a client coupled to the NM) using the reverse path.
a return IP fabric packetis formulated by the process, IP destination-addressed to NM 130 B, a corresponding fabric address is obtained from a forwarding decision, and the IP fabric packet is provided as cells to Primary Switch Fabric Module 180 A for forwarding toward NM 130 B.
NM-SCM Data Exchange 214is representative of data exchanged between an NM (or a client coupled thereto) and a management, forwarding, or load balancing process executing on an SCM.
the data exchangeis substantially similar to NM-OLB Data Exchange 211 .
Packets addressed to Primary SCM 140 Aare formulated by either an external client coupled to NM 130 B or (as an alternative example) by a control plane related process running on the Network Processor of NM 130 B.
NM 130 Bforms corresponding fabric packets and a forwarding decision is made to determine the embedded fabric address.
the fabric packetsare then provided as cells to Primary Switch Fabric Module 180 A via NM-Fabric coupling 139 B.
Primary Switch Fabric Module 180 Aforwards the cells toward Primary SCM 140 A.
the management, controlplane, or load balancing processtransfers information back to NM 130 B (or a client coupled thereto) using the reverse path.
a management packetis addressed to NM 130 B (or the client coupled thereto), is formulated by a process executing on SCM 140 A, a corresponding fabric packet is formed, and a forwarding decision is made to determine the embedded fabric address.
the fabric packetis provided as cells to Primary Switch Fabric Module 180 A for forwarding toward NM 130 B.
the management packetis reconstructed. If addressed to NM 130 B, the packet is consumed therein. If addressed to the external client, the reconstructed packet is provided thereto.
a management client coupled externally to NM 130 B(typically for remote server management, provisioning, configuration, or other related activities) sends a management related packet via NM 130 B with the destination address of the management process executing on Primary SCM 140 A.
the packetis classified and determined to be a management related packet.
a forwarding decisionis then made and a cellified version of the management packet is forwarded toward the management process via Primary Switch Fabric Module 180 A.
Return information from the management process to the management clientuses the reverse path, by formulation (within SCM 140 A) of packets having the destination address of the management client coupled to NM 130 B.
a forwarding decisionis then made and a cellified version of the return information packets are forwarded toward the management client via Primary Switch Fabric Module 180 A and NM 130 B.
IP packetswould be used for the exchange between the management client and process.
NM 130 B classificationdetermines that the proper L 2 forwarding for a received packet is not known, and designates the received packet is an “exception packet”.
NM 130 Bforwards a cellified version of the exception packet to an L 2 FIB management process executing on the Primary SCM via Primary Switch Fabric Module 180 A.
the L 2 FIB management processexamines the exception packet, master L 2 FIB, and other forwarding related information, to determine the proper fabric address for the missing L 2 FIB entry.
the updated forwarding informationis then recorded in the master L 2 FIB, in some embodiments, and propagated to the ancillary L 2 FIBs in order for NM 130 B to properly forward subsequent packets having the same or similar classification.
Primary SCM 140 Aalso provides a correct fabric address for the exception packet and emits an IP fabric packet equivalent to the exception packet (but addressed to the updated fabric address) as corresponding cells to Primary Switch Fabric Module 180 A for forwarding to the interface at destination IP address (or at least one hop closer).
the fabric destinationcould be any of the elements coupled to the dataplane of Primary Switch Fabric Module 180 A, including NM 130 B or Primary SCM 140 A (this general data exchange is not illustrated in the figure).
operationis similar to the SLB service executing on an OLB, except the IP packets are destination IP addressed to Primary SCM 140 A instead of an OLB.
PMM-SCM Data Exchange 215is representative of data exchanged between an Application, Driver, or OS process executing on a physical partition of a PMM and a management, controlplane, or load balancing process executing on an SCM.
the data exchangeis substantially similar to the exchanges described elsewhere herein.
a PMM-to-SCM communicationis formed as a packet addressed to Primary SCM 140 A by a Driver process, for example, executing on P 3 203 .
a fabric packetis formed and a forwarding decision is made to determine the embedded fabric address.
the fabric packetis then provided as cells via PMM-Fabric coupling 159 B (or 159 B′), and forwarded via Primary Switch Fabric Module 180 A toward Primary SCM 140 A.
the reassembled packetUpon arrival at Primary SCM 140 A and subsequent reassembly, the reassembled packet is provided to the management, controlplane, or load balancing process.
the reverse pathis used for SCM-to-PMM communication, with the management, controlplane, or load balancing process formulating a packet addressed to P 3 203 , for communication to the Driver process.
the communicationincludes server load information relating to PMM 150 B.
PMM-SCM Data Exchange 215is also representative of a variety of paths between an SCM and all other elements coupled to the fabric dataplane (such as FCMs, NMs, OLBs, and other PMMs), to update forwarding information maintained in each of the elements.
the controlplane process executing on Primary SCM 140 Aformulates one or more packets to include the updated forwarding information and addresses the packet(s) to the appropriate fabric destination.
the packetsare provided as cells to the fabric and the fabric forwards the cells according to the fabric destination.
the fabric destinationincludes a multicast destination, and the cellified packets are delivered to a plurality of destinations by the fabric.
PMM-OLB Data Exchange 216is representative of data exchanged between a process (Application, Driver, or OS) executing on a physical partition of a PMM and a service process executing on an OLB.
the data exchangeis substantially similar to PMM-SCM Data Exchange 215 , except that OLB 160 B takes the place of Primary SCM 140 A, and data is coupled via OLB-Fabric coupling 169 B instead of SCM-Fabric coupling 149 A.
PMM-PMM-Fabric Data Exchange 217Data exchanges between processes executing on different physical partitions are communicated on the fabric (PMM-PMM-Fabric Data Exchange 217 , for example). Data exchanges between processes executing within the same physical partition are communicated by coherent shared memory and coherent cache memory transactions (PMM-Internal Data Exchange 218 , for example). When multiple PMMs are configured as a single physical partition, coherent shared memory and coherent cache memory transactions travel between the PMMs of the partitions via CSFI 170 (PMM-PMM-CSFI Data Exchange 219 , for example).
PMM-PMM-Fabric Data Exchange 217is representative of data exchanged between a first process and a second process executing on different physical partitions, i.e. message-passing InterProcess Communication (IPC).
the two processesmay be any combination of Application, Driver, or OS processes.
the data exchangeis substantially similar to PMM-SCM Data Exchange 215 , except P 1 201 takes the place of Primary SCM 140 A, and data is coupled via PMM-Fabric coupling 159 A′ instead of SCM-Fabric coupling 149 A.
Another example of this type of communicationwould be between P 1 201 and P 2 202 , (via PMM-Fabric coupling 159 A′ and PMM-Fabric coupling 159 A) even though these two physical partitions are on the same PMM.
PMM-Internal Data Exchange 218is representative of data exchanged between two processes executing on the same physical partition, and the physical partition resides entirely within a single PMM.
a source processexecuting on a first compute element of P 3 203 , writes to a shared memory location
a sink processexecuting on a second compute element of P 3 203 , reads the shared memory modified by the write.
Communicationis provided by links internal to PMM 150 B supporting coherent shared memory and coherent cache memory.
PMM-PMM-CSFI Data Exchange 219is representative of data exchanged between two processes executing on the same physical partition, the physical partition spans more than one PMM, and the two processes execute on different PMMs.
An example of this physical partition configurationis illustrated as P 4 204 , where P 4 204 includes all of the compute elements of PMM 150 A and PMM 150 B.
Coherent shared memory and coherent cache memory transactionsare used to exchange data, as in PMM-Internal Data Exchange 218 . However, the transactions are communicated via CSFI 170 , instead of links internal to the PMMs.
Data exchangesmay also occur between processes executing on physical partitions distributed across multiple PMMs via a combination of paths similar to PMM-Internal Data Exchange 218 and PMM-PMM-CSFI Data Exchange 219 . That is, particular coherent memory traffic (for both shared memory and cache memory) may travel via both CSFI 170 and via links internal to the PMMs.
Data exchanges involving an NMtypically include Tag processing.
incoming packets from a first client coupled to the NMare classified, producing a condensed representation of certain details of the incoming packet, typically by analyzing the header of the incoming packet.
a portion of the classification resultis represented by a Tag, and typically a portion of the Tag, referred to as the egress Tag, is included in a header of the fabric packet produced by the NM in response to the incoming packet.
the egress Tagmay specify selected packet processing operations to be performed by the NM during egress processing, thus modifying the packet header, data, or both, before receipt by a second client.
Egress packet processingmay be performed in response to the corresponding Tag produced during ingress processing (in the NM that did the ingress processing on the packet) or in response to a Tag ‘manufactured’ by a service process on an OLB or an Application, Driver, or OS process executing on a PMM.
egress processingis specified directly by the Tag, and in another embodiment egress processing is determined indirectly by the Tag (via a table look-up, for example).
the egress Tag providedmay include information examined by the SSL process in order to perform any combination of SSL processing related context switching, SSL related per context state look-up, and early protocol layer demultiplexing.
the exception packet delivered to the routing processmay include an egress Tag providing exception information to determine a particular Quality of Service (QoS) for the associated routing protocol.
QoSQuality of Service
the routing processexamines the exception information of the egress Tag to determine the particular route processing queue to insert the exception packet into.
Primary Switch Fabric Module 180 Aprovides for only a single transport of cells at a time between any pairing of ingress and egress ports. In another embodiment, Primary Switch Fabric Module 180 A provides for a plurality of simultaneous transports of cells between multiple pairings of fabric ingress and egress ports. This simultaneous transport may be by parallel communication paths available in the fabric, by interleaving cells from different transports in time on a single communication path, or any combination of these and other techniques. Those of ordinary skill in the art will recognize that the details of Primary Switch Fabric Module 180 A operation affect only the available bandwidth and latency provided by the fabric, not details of data exchanges as exemplified by FCM-PMM Data Exchange 210 , NM-OLB Data Exchange 211 , and so forth. In one embodiment, Primary Switch Fabric Module 180 A includes sufficient parallel resources to provide substantially simultaneous communication for all of the data exchanges illustrated in the figure.
FIG. 3Aillustrates Fabric Communication 300 A, conceptually showing selected aspects of an embodiment of packet transmission and reception in an ES embodiment, used in FCM-PMM Data Exchange 210 and other similar data exchanges described elsewhere herein.
Fabric client elements coupled to Primary Switch Fabric Module 180 Ainclude PMM 150 A, OLB 160 A, NM 130 A, FCM 120 A, and Primary SCM 140 A.
Each coupled clientincludes RAM, shown respectively as PMM RAM 350 , OLB RAM 360 , NM RAM 330 , FCM RAM 320 , and SCM RAM 340 .
Each RAMin turn includes a packet memory image, shown respectively as Packet Memory Image PKT PMM 351 , Packet Memory Image PKT OLB 361 , Packet Memory Image PKT NM 331 , Packet Memory Image PKT FCM 321 , and Packet Memory Image PKT SCM 341 .
a VIOCis included in each of PMM 150 A, OLB 160 A, FCM 120 A, and Primary SCM 140 A, illustrated as VIOC 301 . 5 , VIOC 301 . 6 , VIOC 301 . 2 , and VIOC 301 . 4 , respectively.
the VIOCsare shown conceptually coupled to corresponding packet images as dashed arrows 359 , 369 , 329 , and 349 , respectively.
the VIOCsprovide an interface to the fabric via PMM-Fabric coupling 159 A, OLB-Fabric coupling 169 A, FCM-Fabric coupling 129 A, and SCM-Fabric coupling 149 A, respectively
NM 130 Aincludes a Traffic Manager (TM 302 ), also known as a Buffer and Traffic Manager (BTM) instead of a VIOC.
TM 302Traffic Manager
BTMBuffer and Traffic Manager
the TMis shown conceptually coupled to Packet Memory Image PKT NM 331 via dashed arrow 339 .
TM 302provides an interface for NM-Fabric coupling 139 A.
Packet transmissionbegins at the source fabric client by reading a packet image from a source RAM and providing the packet as cells for transmission via the fabric.
the fabricroutes the cells to the appropriate destination fabric client.
Packet receptionconceptually begins at the fabric edge of the destination fabric client, where the cells are received from the fabric and reassembled into a corresponding packet (including information from the packet image as read from the source RAM) and written into a destination RAM in the destination fabric client.
Each of TM 302 and the VIOCs( 301 . 5 , 301 . 6 , 301 . 2 , and 301 . 4 ) comprise various logic blocks, including a fabric interface communication unit (also known as a packet-cell converter) for performing the functions relating to packet transmission and packet reception via cells on the fabric.
a fabric interface communication unitalso known as a packet-cell converter
the fabric communication operation of all fabric clientsis substantially similar, but for clarity is described in a context of data exchanged between PMM 150 A and NM 130 A (such as NM-PMM Data Exchange 213 , for example).
the fabric interface communication unitsread a packet image from a RAM included in a fabric client (such as Packet Memory Image PKT PMM 351 included in PMM RAM 350 ).
the packet imageincludes a header and packet body data, illustrated as Header 311 and Packet Body 312 respectively.
the fabric interface communication unit(included in VIOC 301 . 5 in this example) conceptually segments the packet into a plurality of cells of equal length, illustrated as Cell C 1 313 . 1 ′ through Cell C M-1 313 .(M- 1 )′.
the final Cell CM 313 .M′is typically a different length than the other cells as packets are not restricted to integer multiples of cell sizes.
the packet body datamay instead be scattered in various non-contiguous buffers in RAM, according to various embodiments.
Each of the cellsare encapsulated with headers, illustrated as Cell Header h 1 314 . 1 , corresponding to Cell C 1 313 . 1 , and Cell Header h M 314 .M corresponding to Cell C M 313 .M.
the cell headers for each of the cells segmented from a given packetare determined in part from the packet header, as shown conceptually by arrow 310 . 1 and arrow 310 .M flowing from Header 311 to Cell Header h 1 314 . 1 and to Cell Header h M 314 .M respectively.
Each of the resultant encapsulated cellsis provided, in order, to the fabric for routing to the destination. Segmenting the packet into cells and encapsulating the cells is also referred to as “cellification”.
Header 311includes a packet destination address, and VIOC 301 . 1 determines a cell destination address in part from the destination address of the packet header in addition to routing tables and state information available to the VIOC.
the cell destination addressalso known as a fabric destination or egress port address, is included in each of the cell headers (Cell Header h 1 314 . 1 through Cell Header h M 314 .M).
This techniqueenables a process executing on a fabric client to transparently address a packet to another fabric client using a logical address for the addressed client.
the resolution of the packet address to a fabric egress port addresscorresponds to resolving the logical address to a physical address, i.e. a specific port of the switch that the addressed client is coupled to.
the reverse direction receive pathoperates in a complementary, conceptually symmetric, inverse manner.
the segmented cellsare routed by the fabric, in order, to the fabric port specified by the cell destination address.
the fabric interface communication unit included in the destination fabric clientreceives the cells, processes the headers, and reconstitutes (or reconstructs) the cells into a packet image in RAM, resulting in a packet image substantially similar to the packet image originally provided by the transmitting fabric client.
the destination address of Packet Memory Image PKT PMM 351resolves to NM 130 A
the cellsare routed to NM-Fabric coupling 139 A by Primary Switch Fabric Module 180 A.
TM 302receives the cells via NM-Fabric coupling 139 A, assembles them back into a packet, and stores the resultant image as Packet Memory Image PKT NM 331 in NM RAM 330 .
packet transmission and reception proceduresare not limited to complete packet images in RAM.
packet information flowing to the fabricmay be provided, in some embodiments, directly from a network interface included on the NM, without intermediate storage in RAM.
packet information flowing from the fabricmay, in some embodiments, be provided directly to the network interface without intermediate storage in RAM.
the same techniquesmay be used on an FCM with respect to information flowing to and from the storage interface.
the sans-header intermediate cells, Cell C 1 313 . 1 ′ through Cell C M 313 .M′are only conceptual in nature as some embodiments implement cellification without copying packet data. Instead, packet data is accessed in-situ in cell-sized chunks and encapsulation performed on-the-fly.
the fabric interface communication units included in TMs and VIOCsfurther include logic adapted to allocate and manage bandwidth and priority for various flows as identified by any combination of classification information, Tag, and a sub-port portion of a fabric egress port address. This enables provisioning of bandwidth and setting of priorities for transport according to operational requirements.
the particular priority at which transport is performedis selectively determined by examination of the sub-port portion of the fabric egress port address.
NM 130 Amay be configured with a high-priority queue and a low-priority queue, having corresponding high-priority and low-priority sub-port addresses.
NM-PMM Data Exchange 213for example, a portion of the data exchange may be performed via the low priority queue and another portion performed via the high-priority queue.
a process desiring selective access to the high-priority queue and low-priority queueaddresses packets accordingly, providing a corresponding high-priority packet destination address to high-priority packets and a corresponding low-priority packet destination address to low-priority packets.
the high-priority packet address and the low-priority packet addressare resolved by the appropriate VIOC on PMM 150 B to a corresponding high-priority fabric egress port address and a corresponding low-priority egress port address.
the high-priority egress port address and the low-priority egress port addressinclude a fabric port number sub-portion that is identical for the two egress port addresses, since both packets are destined to the same pluggable module. However, the sub-port portion is different to distinguish between high and low priorities.
the TM on NM 130 AUpon receipt in the NM of high-priority cells and low-priority cells (corresponding to cells from packets addressed to the high-priority queue and the low-priority queue, respectively), the TM on NM 130 A examines the sub-port portion of the fabric egress port addresses provided with each cell, and selectively identifies packets as associated with the high-priority queue or the low-priority queue as appropriate.
the sub-port portionmay also include a bandwidth allocation portion to identify one of a plurality of bandwidth allocations to associate with the packet assembled from the cell. Still other embodiments provide for combining priority and bandwidth allocation dependent processing according to the sub-port portion of the fabric egress port address.
FCMsfor example, may provide for allocation of bandwidth to various coupled storage devices or networks via the sub-port mechanism.
the fabric interface communication units of TMs and VIOCsprovide hardware support for a reliable message protocol in addition to packet communication.
State machinesimplement a connection-oriented procedure including establishing a connection via a connection request and a corresponding acknowledgement, sending and receiving one or more messages using the established connection, and terminating the connection after it is no longer needed. Delivery of message content is guaranteed, using a limited number of retries, otherwise an error is returned to the sender.
message imagesare constructed similar to packet images, with an indicator included in the message image identifying the image as a message instead of a packet.
the message imageincludes a message header, similar in format to a packet header, and message body data, similar to packet body data.
the message body datais communicated in the same manner as packet body data.
the message headerincludes a message destination similar in format to a packet destination address. The message destination address is resolved into a cell destination address for inclusion in the cells during cellification, similar to the resolution of a packet destination address.
FIG. 3Billustrates Prioritized Fabric Communication 300 B, conceptually showing selected aspects of an embodiment of packet and cell prioritized transmission in an ES embodiment, focusing on transports originating from a VIOC.
VIOCs 301 . 5 and 301 . 6 , along with TM 302are coupled to Primary Switch Fabric Module 180 A, via couplings 159 A, 169 A, and 139 A, respectively.
a VIOCtypically is closely associated with 16 groups of four transmit queues each, providing a total of 64 transmit queues.
the transmit queuesare conceptualized as existing within the VIOC.
the transmit queuesphysically reside in shared portions of the host memory, although their state is managed within the VIOC and portions are buffered within the VIOC.
Each transmit queueis configurable as operating according to a specified priority or according to a specified bandwidth.
Q 1 309 . 1may be configured to operate at a strictly higher priority than Q 2 309 . 2 and Q 3 309 . 3 , or the queues may be configured to operate in a round-robin priority with respect to each other.
Q 1 309 . 1may be configured to operate at twice the bandwidth of Q 2 309 . 2 and Q 3 309 . 3 .
Q 1may be configured for a first maximum bandwidth and a first minimum bandwidth
Q 2 309 . 2may be configured for a second maximum and a second minimum bandwidth
Q 3 309 . 3may be configured for third maximum and minimum bandwidths.
VIOCsIn addition to transmit queues, VIOCs typically implement virtual output queues (VOQs) to prevent head of line blocking, in order to maximize the bandwidth of transmission to the fabric.
VOQsvirtual output queues
the VOQsare implemented as pointer managed buffers within an egress shared memory internal to the VIOC.
a subset of VOQs in VIOC 301 . 6are illustrated as VOQ 1 308 . 1 and VOQ 2 308 . 2 , one for each of the two illustrated destinations (VIOC 301 . 5 and TM 302 ).
the VOQsare processed according to configurable priority algorithms, including a straight (or strict) priority algorithm, a straight round-robin algorithm (without weights), a weighted round-robin algorithm, and a weighted round-robin algorithm with configurable weights, according to embodiment.
configurable priority algorithmsincluding a straight (or strict) priority algorithm, a straight round-robin algorithm (without weights), a weighted round-robin algorithm, and a weighted round-robin algorithm with configurable weights, according to embodiment.
a transmit queue selectionis made according to the configured priority and bandwidth. Data is then made available from the selected queue accordingly, and provided to the fabric for transfer to the destination according to the virtual output queue associated with the destination. These procedures repeat as long as any of the queues are not empty.
one of Q 1 309 . 1 , Q 2 309 . 2 , and Q 3 309 . 3is selected, and then examined to determine the next packet (or message) to transmit.
Q 1 309 . 1is configured as the highest priority
Q 2 309 . 2is the next highest priority
Q 3 309 . 3is configured as the lowest priority.
a first selection and subsequent transmissionis made from Q 1 309 . 1 , as it is configured as the highest priority, 35 and includes at least one packet ready for transmission.
Datais read according to Q 1 309 . 1 , and determined to be destined for the fabric egress port associated with PMM-Fabric coupling 159 A.
Datais transferred to Primary Switch Fabric Module 180 A under the control of VOQ 1 308 . 1 , and further transferred to PMM-Fabric coupling 159 A.
a second selection and transmissionis made from Q 2 309 . 2 , as it is configured as the next highest priority, it includes at least one packet ready for transmission, and Q 1 309 . 1 is empty.
Datais read according to Q 2 309 . 2 , determined to be destined for PMM-Fabric coupling 159 A, provided to Primary Switch Fabric Module 180 A under the control of VOQ 1 308 . 1 , and transferred to PMM-Fabric coupling 159 A.
a third selection and transmissionshown conceptually as Packet Transmission Path 317 . 3 , is made from Q 3 309 . 3 , as it is configured as the lowest priority, it is not empty, and Q 1 309 . 1 and Q 2 309 . 2 are empty.
Datais read according to the selected transmit queue (Q 3 309 . 3 ) as in the previous two scenarios, but the destination is determined to be the fabric egress port associated with NM-Fabric coupling 139 A, and therefore data is provided to the fabric under the control of VOQ 2 308 . 2 .
transmission of data from differing virtual output queuesmay instead be interleaved on the fabric.
transmission of data from VOQ 1 308 . 1(such as Packet Transmission Paths 317 . 1 or 317 . 2 ) may overlap in time with the transmission of data from VOQ 2 308 . 2 (such as Packet Transmission Path 317 . 3 ).
cells from the overlapping transmissionsare wholly or partially interleaved on the fabric.
the switch fabricIn addition to prioritized selection among the transmit queues, the switch fabric also typically provides for prioritized transport. Each cell may specify a priority, and in one embodiment there are four priority levels available. The cell priority is developed from any combination of various parameters, including packet size, packet type, packet class of service, packet quality of service, transmit queue priority, and other packet header information. As shown in the figure, Cell Transmission Path 318 . 1 provides for transmission of cells from VIOC 301 . 6 to VIOC 301 . 5 , and Cell Transmission Path 318 . 2 provides for transmission of cells from VIOC 301 . 6 to TM 302 . Each of Paths 318 . 1 and 318 . 2 may transfer cells according to any of the four priorities. For example, cells corresponding to Packet Transmission Path 317 . 1 may be transferred at the highest priority, while cells corresponding to Packet Transmission Path 317 . 2 may be transferred at a lower priority.
FIGS. 4A-4Eillustrate various embodiments of pluggable modules included in various ES embodiments.
the modulesshare many similar characteristics.
each of the modulesincludes a fabric interface communication unit included in a TM or a VIOC.
Each of the modulestypically includes one or more computation and memory elements. Couplings between elements of the modules typically operate in the same or substantially similar fashion.
RAM elementsare shown with identifiers prefixed with 411 , and these elements are typically Dyanimc Random Access Memories (DRAMs) organized as Dual Inline Memory Modules (DIMMs) in some embodiments.
CPU elementsare shown with identifiers prefixed with 410 , and these elements are typically Opteron processors.
VIOC identifiersare prefixed with 301 .
Elements representing combined Ternary Content Addressable Memory (TCAM) and Static Random Access Memory (SRAM) identifiersare prefixed with 403 .
BMC elementsare prefixed with 402 .
FCI elementsare prefixed with 413 , and the associated optional coupling identifiers are prefixed with 414 .
HyperTransport (HT) channel couplingsare shown with identifiers prefixed with 460 .
FIG. 4Aillustrates SIM Detail 400 A, including selected aspects of an embodiment of a SIM configured as a pluggable module including an SCM and an associated SFM. It will be understood that the discussion of FIG. 4A is made with respect to the capabilities and topology of the primary SIM and primary fabric, but the discussion equally describes the redundant topology and latent capabilities of the secondary SIM and secondary fabric. As discussed elsewhere herein, the secondary fabric remains dormant with respect to non-control dataplane functionally as long as the primary fabric operates properly.
Primary SCM 140 Aincludes compute and associated memory elements CPU 410 . 4 L/RAM 411 . 4 L coupled to CPU 410 . 4 R/RAM 411 . 4 R via HT coupling 460 . 4 L. VIOC 301 .
Management I/O 412is coupled to CPU 410 . 4 R via HT coupling 460 .R. VIOC 301 . 4 is in communication with TCAM/SRAM 403 . 4 . and provides a fabric interface for SCM-Fabric coupling 149 A.
Management I/O 412is coupled to CPU 410 . 4 L via HT coupling 460 . 4 M and provides an interface to the intra-chassis BMCs via coupling 452 .
Primary SCM 140 Aalso includes BMC 402 . 4 coupled to VIOC 301 . 4 and Management I/O 412 .
Mass Storage 412 Ais coupled to Management I/O 412 via coupling 453 and provides local mass storage.
Primary Switch Fabric Module 180 Aincludes Primary SFM Dataplane (SFDP) 404 having a plurality of fabric ports with respective fabric port addresses.
the fabric portsare coupled to the various system modules via SCM-Fabric coupling 149 A, FCM-Fabric coupling 129 A, NM-Fabric coupling 139 A, PMM-Fabric coupling 159 A, and OLB-Fabric coupling 169 A.
Each module in the systemmay be coupled to one or more of the fabric ports and at least some of the foregoing illustrated fabric couplings represent more than one full-duplex fabric coupling to the Primary Switch Fabric Module 180 A. For example, in one embodiment, there maybe up to two PMM modules and each PMM module has two full-duplex fabric couplings.
PMM-Fabric coupling 159 Amay be representative of four full-duplex fabric couplings to four respective fabric ports, each having a respective fabric port address. Each module or portion thereof having its own fabric coupling to a fabric port is addressable via the corresponding fabric port address.
Primary Switch Fabric Module 180 Aalso includes Primary Switch Fabric Scheduler 401 coupled to Primary SFDP 404 . 101651 In operation, SFDP 404 , under the direction of Switch Fabric Scheduler 401 , routes data as cells provided to a fabric ingress port to a fabric egress port, according to a cell destination address, as described elsewhere herein.
Each of dataplane couplings 149 A, 129 A, 139 A, 159 A, and 169 Acouples with a respective fabric ingress port and a respective fabric egress port.
Primary SCM 140 Aexecutes any combination of management, controlplane, and load balancing processes using compute and memory resources provided by CPU 410 . 4 L/RAM 411 . 4 L and CPU 410 . 4 R/RAM 411 . 4 R.
the CPUsoperate as a single SMP complex, communicating shared memory coherency and cache memory coherency transactions via HT coupling 460 . 4 L.
VIOC 301 . 4operates as an intelligent I/O device responding to commands from the CPUs, typically originating from a Driver process.
a Driver process executing on one of the CPUsforms a packet image in one of the RAMs, including specifying a destination address for the packet, and then notifies the VIOC that a new packet is available for transmission.
the VIOC fabric interface communication transmit unitdirectly accesses the packet image from RAM via an included transmit Direct Memory Access (DMA) unit.
DMADirect Memory Access
the VIOCexamines the packet header and identifies the packet destination address. The transmission of packets as cells proceeds without direct assistance from any of the processes executing on the CPUs.
the packet address and other associated informationare referenced in accessing forwarding and state information maintained in TCAM/SRAM 403 . 4 to determine the corresponding fabric egress port address and other related information for inclusion in headers of cells provided to the fabric to transmit the packet as cells.
VIOC 301 . 4also operates as an intelligent I/O device in the reverse direction, in a conceptually symmetric fashion.
Cellsare received from the fabric and reassembled as packets by the VIOC fabric interface communication receive unit.
the packet datais partially reassembled directly into a packet image stored in one of the RAMs via an included receive DMA unit.
the reception of packets as cellsproceeds without direct assistance from any of the processes executing on the CPUs.
the VIOCnotifies one or more of the CPUs that new packet data is available, and subsequently a process, typically a Driver process, accesses the packet image provided in RAM and processes it accordingly.
the management process (or processes) executing on the CPUs of Primary SCM 140 Acommunicate management and configuration control information via Management I/O 412 between Primary SCM 140 A and other modules via coupling 452 coupled to BMCs included in PMM 150 A, NM 130 A, FCM 120 A, OLB 160 A, and Primary SCM 140 A (local BMC 402 . 4 ).
This communicationis typically via a dedicated management Ethernet network, and is consequently out-of-band with respect to Primary Switch Fabric Module 180 A.
BMC 402 . 4provides baseboard management functions, communicating with Management I/O 412 and VIOC 301 . 4 .
the processes executing on the CPUs of Primary SCM 140 Acollect selected management information from all BMCs in the server and in response to the collected information and provisioning commands received from elsewhere, provide management and configuration commands to the BMCs.
Management I/O 412also communicates configuration and control information via coupling 451 between management and controlplane processes executing on the CPUs and Switch Fabric Scheduler 401 . This provides, for example, for static or dynamic configuration of the SCMs, one as the Primary SCM and the other as the Redundant SCM.
Mass Storage 412 Amay include any combination of mass storage device types including Flash memory, Magnetic Disk memory, and Optical Disk memory.
the mass storage devicesmay be coupled via any combination of storage interface types including but not limited to PC Card, Compact Flash, Multi-Media Card, Memory Stick, Smart Card, Secure Digital, Universal Serial Bus (USB), FireWire, SCSI (Small Computer System Interface), IDE (Integrated Device Electronics), EIDE (Enhanced IDE) and variations and successors thereof.
the local mass storageis omitted, and this data is accessed from mass storage devices or networks remotely via FCMs 120 or NMs 130 .
FIG. 4Billustrates PMM Detail 400 B, including selected aspects of an embodiment of a PMM configured as a pluggable module.
the PMMis arranged as a pair of identical sections, Half-PMM 430 and Half-PMM 430 ′.
Each sectionincludes two CPU/RAM elements coupled to each other by HT links, a VIOC/TCAM/SRAM element interfacing to a fabric coupling, and an optional FCI.
the coupling of these elementsis substantially similar to corresponding elements of Primary SCM 140 A, except that Management I/O 412 is omitted.
the two Half-PMMsshare BMC 402 . 5 .
the two Half-PMMsare coupled to each other by a pair of HT links (HT coupling 460 . 5 X and HT coupling 460 . 5 Y).
One of the CPUs of each halfalso provides an HT interface for coupling to another PMM (such as PMM 150 B of FIG. 2 ) via CSFI-PMM coupling 179 A and CSFI-PMM coupling 179 A′.
these couplingsare coupled directly to another identically configured PMM, and in other embodiments these couplings are coupled indirectly to another PMM via CSFI 170 (with variations illustrated in FIG. 1A and FIG. 2 ).
Shared memory coherency and cache memory coherency transactionsare communicated over the HT couplings internal to the PMM ( 460 . 5 L, 460 . 5 X, 460 . 5 L′, and 460 . 5 Y) and over HT couplings external to the PMM ( 179 A, and 179 A′).
the HT couplings communicating shared memory coherency and cache memory coherency transactions and CSFI 170are programmatically configurable to provide for physical partitioning of the CPU/RAM elements of PMMs.
the PMMis configured as a single 4-way physical partition by programming the internal HT links ( 460 . 5 L, 460 . 5 X, 460 . 5 L′, and 460 . 5 Y) for coherent operation, and programming the external HT links ( 179 A, and 179 A′) for “isolated” operation (i.e. links 179 A and 179 A′ are disabled).
isolating a PMM for configuration as a single 4-way physical partition (or as two 2-way physical partitions)is performed by programmatically configuring CSFI 170 (of FIG. 1A ) to isolate the PMM from other PMMs.
the PMMis configured as a pair of identical 2-way physical partitions (Half-PMM 430 and Half-PMM 430 ′) by programmatically configuring a portion of the internal HT links ( 460 . 5 L, and 460 . 5 L′) for coherent operation, and another portion of the internal HT links ( 460 . 5 X, and 460 . 5 Y) for isolated operation.
the external HT links ( 179 A, and 179 A′) or CSFI 170are also programmed for isolated operation.
a plurality of PMMsare configured as a single unified 8-way physical partition by programmatically configuring all of the internal and external HT links of all of the PMMs (and also CSFI 170 , depending on the embodiment) for coherent operation.
PMMs 150programmatically configuring all of the internal and external HT links of all of the PMMs (and also CSFI 170 , depending on the embodiment) for coherent operation.
CSFI 170CSFI 170
each PMMis programmatically partitioned according to provisioning information.
Physical partitionscan be established that have one-half of a PMM (2-way), a single PMM (4-way), or two PMMs (8-way). It will be understood that the number of SMP-ways per half of a PMM is merely illustrative and not limiting as is the configurable topology for aggregation of SMP-ways.
Application, Driver, and OS processesare executed on the resultant physical partitions.
Each resultant physical partitionalways includes at least one VIOC. The VIOC provides for communication between the executing processes and other clients of the fabric via packet images in memory, operating as described elsewhere herein.
one or both of optional FCIs 413 . 5 and FCI 413 . 5 ′are included, to access boot images or related information, via couplings 414 . 5 and 414 . 5 ′ and FCIs 413 . 5 and 413 . 5 ′, from either a local mass storage device or via a mass storage network.
the optional FCIsare omitted, and this data is accessed via the fabric from mass storage devices or networks via fabric-coupled FCMs 120 or NMs 130 .
CSFI 170may be wholly or partially implemented on the SIM, on the PMM, on a separate module, or any combination of SIM, PMM, and separate module, or any other convenient location.
the coherent switch functionalitymay be implemented in conjunction with the HT links on the PMM, or implemented independently of these functions without substantially modifying the operation.
CSFI 170is limited to interconnect operating in conjunction with coherency and switching logic implemented internal to the CPU elements included on the PMMs.
CSFI 170includes some portion of coherency and switching logic operating in conjunction with coherency and switching logic included on the PMMs.
FIG. 4Cillustrates NM Detail 400 C, including selected aspects of an embodiment of a Network Module (NM) configured as a pluggable module.
NM 130 Aincludes media interface hardware specific to a particular type of network coupling (Interfaces 420 and 419 for couplings 426 and 427 , respectively), coupled to network processing elements adapted for packet processing, including Packet Classification and Editor (PCE 417 ) and associated CAM 418 , coupled in turn to Traffic Manager (TM 302 ).
TM 302is in communication with RAM 416 , and provides a fabric interface for NM-Fabric coupling 139 A.
Control Processor (CP) 429is coupled to PCE 417 and TM 302 .
NM 130 Aalso includes BMC 402 .
the BMCprovides an interface for coupling 452 . While the illustrated embodiment shows CP 429 coupled to Management I/O 412 indirectly via BMC 402 . 3 , in alternate embodiments the CP is coupled to the Management I/O via a coupling shared with the BMC, and in further alternate embodiments the CP is coupled to the Management I/O via a dedicated (i.e. not shared) coupling.
information(typically in the form of packets) communicated between a network device (typically external to the ES 1 ) coupled to coupling 426 is processed at a low-level and in an interface-specific manner by Interface 420 (the operation of coupling 427 and Interface 419 is substantially similar).
Packets received from the network deviceare provided to PCE 417 for classification and Tag determination, as described elsewhere herein.
the packet data and Tagare stored in RAM 416 , and provided to the fabric as cells by TM 302 via NM-Fabric coupling 139 A. In the reverse direction, cells are reassembled by TM 302 as received from the fabric via NM-Fabric coupling 139 A, and the resultant packets are stored in RAM 416 .
PCE 417reads the stored packet data, and dynamically modifies it according to any associated Tag information, providing the result to Interface 420 for transmission to the network device via coupling 426 .
TM 302operates as a fabric interface communication unit, and includes a fabric interface communication transmit unit that directly accesses the packet image from RAM via an included DMA unit.
the TMexamines the packet header and identifies the packet destination address.
the packet address and other associated informationare referenced in accessing routing and state information maintained in one or more of CAM 418 and RAM 416 .
the resultant fabric egress port address and other related informationare included in headers of cells provided to the fabric to transmit the packet as cells.
TM 302also includes a fabric interface communication receive unit that operates in a conceptually symmetric fashion. Cells are received from the fabric and reassembled as packets stored into RAM 416 via an included DMA unit. The TM notifies the PCE as new packet data becomes available for editing and transport to Interface 420 .
CP 429manages various HW resources on the NM, including PCE 417 and TM 302 , and respective lookup elements CAM 418 and RAM 416 .
the CPreceives management information via coupling 452 (either indirectly via the BMC or directly via a Management I/O coupling, according to embodiment) and programs lookup, forwarding, and data structure information included in CAM 418 (such as associatively searched information) and RAM 416 (such as trie table information).
FIG. 4Dillustrates FCM Detail 400 D, including selected aspects of an embodiment of an FCM configured as a pluggable module.
FCM 120 Aincludes Fibre Channel compatible couplings 428 . 1 A through 428 . 4 B, coupled in pairs to Fibre Channel interface Processors (FCPs 423 . 1 - 423 . 4 ).
the FCPsare in turn coupled to a compute and storage element including Input Output Processor (IOP 421 ) and associated RAM 422 , coupled in turn to VIOC 301 . 2 .
the VIOCprovides a fabric interface for FCM-Fabric coupling 129 A.
FCM 120 Aalso includes BMC 402 . 2 coupled to VIOC 301 . 2 and providing an interface for coupling 452 .
information communicated between Fibre Channel compatible devices or networks coupled to couplings 428 . 1 A through 428 . 4 Bis processed in a low-level manner by FCPs 423 . 1 - 423 . 4 .
Information received from external storage devicesis typically stored as packets in RAM 422 .
packet datais transmitted as cells to the fabric by the fabric interface communication transmit unit of VIOC 301 . 2 via FCM-Fabric coupling 129 A (as described elsewhere herein).
VIOC 301 . 2FCM-Fabric coupling 129 A
IOP 421reads the stored data, providing the end result to FCPs 423 . 1 - 423 . 4 for transmission to the coupled device or network.
FCPs 423 . 1 - 423 . 4access the data directly via DMA.
FIG. 4Eillustrates OLB Detail 400 E, including selected aspects of an embodiment of an OLB configured as a pluggable module.
OLB 160 Ais similar in many respects to Primary SCM 140 A, and includes compute and associated memory elements CPU 410 . 6 L/RAM 411 . 6 L coupled to CPU 410 . 6 R/RAM 411 . 6 R via HT coupling 460 . 6 L.
VIOC 301 . 6is coupled to CPU 410 . 6 R via HT coupling 460 . 6 R.
VIOC 301 . 6is in communication with TCAM/SRAM 403 . 6 and provides a fabric interface for OLB-Fabric coupling 169 A.
PCI sub-module 425 and HT sub-module 424are optionally included in various combinations and configurations in several embodiments to provide additional service-specific computational capabilities as service acceleration modules.
OLB 160 Aalso includes BMC 402 . 6 coupled to VIOC 301 . 6 and providing an interface for coupling 452 .
PCI sub-module 425includes a PCI interface for interfacing PCI-adapter based devices to HT coupling 460 . 6 R.
HT sub-module 424includes an HT interface for coupling to CPU 410 . 6 R via coupling 460 . 6 X.
Various embodiments of PCI sub-modules and HT-sub-modulesfurther include any number and combination of service-specific hardware accelerators according to implementation, such as an SLB hardware accelerator, an SSL hardware accelerator, and an XML hardware accelerator.
OLB 160 Aexecutes any combination of service processes (relating to SLB, SSL, or XML, for example) using compute and memory resources provided by CPU 410 . 6 L/RAM 411 . 6 L and CPU 410 . 6 R/RAM 411 . 6 R.
the CPUsoperate as a single SMP complex, communicating shared memory coherency and cache memory coherency transactions via HT coupling 460 . 6 L.
the VIOCprovides for communication between the executing processes and other clients of the fabric via packet images in memory, operating as described elsewhere herein.
Service processes executing on embodiments including any combination of PCI sub-module 425 and HT sub-module 424access elements of the sub-modules in order to accelerate processing related to the service processes.
FIG. 5Aillustrates Application SW Layering 500 A, including selected aspects of embodiments of SW layers for executing on application processor resources, such as CPUs included on PMMs, of an ES embodiment.
the layersare represented in a first context for execution on P 3 203 , and in a second context for execution on P 1 201 .
P 3 203 and P 1 201correspond to distinct physical partitions configured from one or more PMMs.
Application SW Layering 500 Ais illustrated as representative of a collection of code images, in some contexts it may be useful to consider it as conceptually representative of processes, or groups of processes associated with each of the illustrated elements.
Hypervisor 1 510represents a code image of a specific Hypervisor, but it may also be considered conceptually representative of all processes and related execution threads associated with executing any portion of the Hypervisor code image. In typical embodiments, a plurality of concurrent execution streams co-exists and cooperates while executing portions of the code image.
OS 1 /Drivers 1 507 , App 1 501 , and so forthmay be considered representative of groups of respective processes associated with each of the respective SW elements.
Hypervisor layerIllustrated at the highest (most fundamental and privileged) SW layer level is a Hypervisor layer, as shown by Hypervisor 1 510 and Hypervisor 2 520 .
Hypervisorstypically provide a software environment for executing a plurality of OS instances in an apparently concurrent manner via timesharing on a shared hardware resource, such as P 3 203 or P 1 201 , as illustrated in the figure.
OS layerAt the middle SW layer level is an OS layer.
Hypervisor 1 510provides two logical partitions, one for OS 1 /Drivers 1 507 , and another for OS 2 /Drivers 2 508 .
Hypervisor 2 520provides a single logical partition for OS 3 /Drivers 3 509 .
VIOC DriversIllustrated within the OS layer are Drivers, including VIOC and VNIC Drivers.
a VIOC Driverprovides an interface between management and controlplane processes and VIOCs.
VIOC Driversinclude VIOC Driver 1 511 in OS 1 /Drivers 1 507 , VIOC Driver 2 512 in OS 2 /Drivers 2 508 , and VIOC Driver 3 513 in OS 3 /Drivers 3 509 .
VIOC Driversin some embodiments, are customized according to OS environment, such that VIOC Driver 1 511 , VIOC Driver 2 512 , and VIOC Driver 3 513 may be distinct if OS 1 /Drivers 1 507 , OS 2 /Drivers 2 508 , and OS 3 /Drivers 3 509 or associated environments are distinct.
An OS layermay include one or more VIOC Drivers, depending on embodiment.
a VNIC Driverprovides an interface between processes (executing on application processor resources, for example) and communication resources as provided by VNICs (implemented by VIOCs, for example).
a VNIC Driveris conceptually similar to a modified Ethernet Driver.
VNIC Driversinclude VNIC Driver 1 514 and VNIC Driver 2 515 in OS 1 /Drivers 1 507 , VNIC Driver 3 516 and VNIC Driver 4 517 in OS 2 /Drivers 2 508 , and VNIC Driver 5 518 and VNIC Driver 6 519 in OS 3 /Drivers 3 509 .
VNIC Driversare customized according to OS environment, such that VNIC Driver 1 514 and VNIC Driver 3 516 may be distinct if OS 1 /Drivers 1 507 and OS 2 /Drivers 2 508 or associated environments are distinct.
VNIC driversare further customized according to OS requirements or contexts, such that VNIC drivers within the same OS are distinct (VNIC Driver 1 514 being distinct with respect to VNIC Driver 2 515 , for example).
An OS layermay include one or more VNIC Drivers, each having unique functions, parameters, or customizations, depending on embodiment.
OS 1 /Drivers 1 507hosts a plurality of Applications as shown by App 1 -App 3 501 - 503 .
OS 2 /Drivers 2 508hosts two applications, App 4 504 , and App 5 505 .
OS 3 /Drivers 3 509hosts a single application, App 6 506 .
Execution of SW at the different layerstypically results in a plurality of processes or execution streams, corresponding to program execution of various portions or sub-portions of code from the illustrated SW layers.
execution of each of Hypervisor 1 510 , OS 1 /Drivers 1 507 , and App 1 501may result in a plurality of concurrently running processes.
One example of a process spawned from OS 1 /Drivers 1 507is a Device Driver process, for example a Device Driver process corresponding to execution of the modified Ethernet Driver described elsewhere herein.
VIOC Device Driversare associated with VIOC management and control
VNIC Device Driversare associated with VNIC management and control
OS 1 /Drivers 1 507 , OS 2 /Drivers 2 508 , and OS 3 /Drivers 3 509include various combinations of VIOC and VNIC Drivers.
VIOC Drivers, VNIC Drivers, and associated functionsmay be wholly or partially shared and combined according to embodiment.
SW layeringis only on example embodiment.
Other embodimentsmay include more layers (such as Application sub-layers) or fewer layers (such as dedicated physical partitions requiring no Hypervisor layer).
FIG. 5Billustrates Management SW Layering 500 B, including selected aspects of SW layers for executing on management processor resources, such as processor resources included on SCMs (Primary SCM 140 A, for example), OLBs (OLB 160 A, for example), and PMMs (PMM 150 A, for example), according to various ES system embodiments.
management processor resourcessuch as processor resources included on SCMs (Primary SCM 140 A, for example), OLBs (OLB 160 A, for example), and PMMs (PMM 150 A, for example), according to various ES system embodiments.
SCMsPrimary SCM 140 A, for example
OLBsOLBs
PMMsPMM 150 A, for example
FIG. 5Billustrates Management SW Layering 500 B, including selected aspects of SW layers for executing on management processor resources, such as processor resources included on SCMs (Primary SCM 140 A, for example), OLBs (OLB 160 A, for example), and PMMs (PMM 150 A, for example), according to various
Enterprise Manager 530supports multi-chassis management, complex provisioning, interfaces to client GUIs, and generally operates at a relatively high level of abstraction, as does CLI 532 .
Platform Manager 531generally performs in-chassis (or single-chassis) management operations and tends to manipulate system objects directly at a relatively low level of abstraction.
SW modulesoperate in close cooperation with the Platform Manager, including Chassis Manager (CM) 533 , Query Engine 534 , Repository Manager 535 , VIOC Manager 536 , Interface Manager 537 , L 2 Forwarding DataBase (FDB) Manager 538 , VLAN Manager 539 , and Other Management Code 540 .
CLI 532generally provides some combination of the CLI proper and related services to support the interface.
CM 533discovers and manages chassis resources.
Query Engine 534processes queries relating to persistent state managed by Repository Manager 535 .
VIOC Manager 536generally provides an interface between the system control and management processes and VIOCs in a system.
Interface Manager 537generally provides an interface between the system control and management processes and NMs in a system.
L 2 FDB Manager 538provides L 2 forwarding information management by maintaining one or more FDBs. The L 2 FDB Manager further manages dissemination of information related to portions of the FDB throughout the system as needed (such as updating the forwarding and state information maintained in TCAM/SRAM elements coupled to VIOCs).
VLAN Manager 539generally provides system-wide management functions relating to provisioning VLANs and maintaining VLAN related information, such as associations between VNICs, VLANs, and NM ports. Typically a Logical InterFace identifier (LIF) is allocated per port per VLAN, identifying a “connection” to the VLAN.
LIFLogical InterFace identifier
Other Management Code 540generally provides other management, controlplane, and load balancing functions. The platform manager and the aforementioned closely cooperating SW modules are described in more detail elsewhere herein.
FIG. 5Cillustrates BMC SW Layering 500 C, including selected aspects of SW layers for executing on module-level configuration and management processor resources, such as BMCs (BMC 402 . 4 of FIG. 4A , for example), according to various ES system embodiments.
Module BMC SW 550supports module-level operations, typically via interfaces with HW components on the module the BMC is included on.
the Module BMC SWgenerally functions at a relatively low level of abstraction. Similar to Application SW Layering 500 A and Management SW Layering 500 B , BMC SW Layering 500 C represents a collection of code images, and each element may be usefully considered as representative of one or more processes executing portions of each respective element.
optional IPMI Client 551provides an interface to IPMI services (typically part of Platform Manager 531 ) and in some embodiments serves to export low-level platform services to various elements of the Management SW.
Event Agent 552monitors module-level information (such as sensors and board insertion detection logic) to recognize changes in status and configuration of the module. The Event Agent then communicates these changes as events delivered to various elements of the Management SW (such as CM 533 ).
the Selected BMC Event Details sectionincluded elsewhere herein, provides further details on BMC events.
Command Agent 553receives BMC commands from various elements of the Management SW (such as CM 533 ) and sequences and forwards the commands. The Command Agent communicates results of command execution as events via functions provided by the Event Agent.
the Selected BMC Command Details sectionincluded elsewhere herein, provides further details on BMC commands.
VNICsprovide each processing element with access via a modified Ethernet Driver to other system resources connected to the fabric.
each VNICin conjunction with transport via the fabric and external communication via an NM, provides capabilities similar to that of a conventional NIC, made available to processes by way of a driver functionally similar to a conventional Ethernet driver.
the accessible system resourcesinclude networking interfaces provided by NMs, storage interfaces provided by FCMs, and computing capabilities provided by SCMs, PMMs, and OLBs.
VNICsare implemented by VIOCs included in modules such as PMMs, OLBs, FCMs, and SCMs, thereby providing processing elements on such modules with access to VNIC capabilities.
VNIC operationgenerally provides for communication of data directly between processes executing on a Local Processor and the fabric via Local Processor Memory.
the Local Processor Memoryis typically accessible by the Local Processor and, for example, a VIOC implementing VNIC functionality.
a VNICprovides transmit and receive queues for use by processes executing on a Local Processor for communication of data (as packets or messages) to and from the Local Processor Memory.
VNICsAs VNICs (like conventional NICs) are bidirectional, VNICs also provide access to each processing element from other system resources connected to the fabric. For example, each VNIC on an OLB provides a separate portal to execution resources provided by the OLB. A first VNIC on the OLB may be allocated to and accessed by processes executing on a first provisioned server, while a second VNIC on the same OLB may be used by a second provisioned server. As another example, each VNIC on an FCM provides a separate path to storage resources enabled by the FCM, and each of the separate paths may be accessed by distinct provisioned servers.
FIG. 6Aillustrates selected aspects of a logical view of an embodiment of a plurality of VNICs.
VIOC 301implements 16 identical VNICs. In the figure only two of the VNICs are shown (for clarity), VNIC # 1 600 . 1 and VNIC # 16 600 . 16 .
Each VNICsuch as VNIC # 1 600 . 1 , includes programmable identification information illustrated as Address Block 601 . 1 , VNIC configuration registers illustrated as VNIC Configuration Block 618 . 1 , and packet and message data communication interfaces illustrated as I/O Block 605 . 1 .
Address Block 601 . 1includes Fabric Address 602 . 1 , for specifying the source fabric address of the VNIC, and MAC Address 603 . 1 , defining the MAC address associated with the VNIC.
a first portion of Fabric Address 602 . 1is common to some or all of the VNICs of a VOIC, and corresponds to the physical fabric address of the VIOC.
a second portion of Fabric Address 602 . 1also known as the fabric sub-address, is unique to each VNIC.
Address Block 601 . 1optionally includes, according to embodiment, Public Internet Protocol (Public IP) Address 604 . 1 , for specifying the IP address associated with the VNIC, to enable selected L 3 processing.
Public IPPublic Internet Protocol
Address Block 601 . 1also includes a Virtual Local Area Network (VLAN) Identifier (VLANid 617 . 1 ), for use according to embodiment to map a destination address according to VLAN information, or as a tag to identify the VNIC with respect to a VLAN for selected L 3 processing, or both.
VLANid 617 . 1Virtual Local Area Network (VLAN) Identifier
a separate L 3 VLAN identifier(in addition to VLAN Identifier 617 . 1 ) may be provided for use with selected L 3 VLAN operations. Examples of VLAN processing include formulating a packet or cell header including a VLAN tag, and filtering incoming traffic with respect to a VLAN.
programming of VLAN Identifier 617 . 1further enables implementation of fabric-port-based, MAC address-based, IP address-based, and general L 3 type VLAN functions.
VNIC Configuration Block 618 . 1includes VNIC Enable 618 . 1 a, for enabling (and disabling) the corresponding VNIC, and priority and bandwidth configuration registers.
the priority configuration registersinclude Priority Scheme Selection 618 . 1 b and Priority Weights 618 . 1 c, for specifying priority processing related information, such as priority scheme selection and weights, respectively.
the bandwidth configuration registersinclude Bandwidth Minimum 618 . 1 d and Bandwidth Maximum 618 . 1 e, for specifying bandwidth allocation and control configurations, such as minimum and maximum bandwidth allocations, respectively, for the corresponding VNIC.
I/O Block 605 . 1includes separate collections of queues for packets and messages. Each collection includes transmit, receive, and completion queues (the completion queues are also known as receive completion queues). The packet and message queues are organized by “context”. Each context includes a transmit queue, a receive queue, and either a completion queue or a pointer to a completion queue. In a typical usage scenario, the transmit, receive, and completion queues of a context are associated with a particular protocol connection or a particular multiplexed set of protocol connections. Each context is established by software running on the Local Processors. In certain embodiments, while each context has a respective pair of dedicated transmit and receive queues, multiple contexts are associated with a common completion queue. This is detailed further below.
a packet (or message) contextmay be used for IP traffic, interprocess communication (directly or over IP), or other similar communications.
packet (or message) contextsWhen accessed via physical buffer addresses, packet (or message) contexts are typically used directly by OS and Hypervisor processes, and typically used indirectly (via system calls, for example) by application processes.
packet (or message) contextsWhen accessed via virtual buffer addresses, packet (or message) contexts are typically usable directly by application processes, in addition to direct use by OS and Hypervisor processes.
a single packet contextmay be used to implement multiple connections via SW, while a single message context typically implements a single connection managed (at least in part) by HW.
Message contextsmay be used by RDMA operations with fewer SW operations than similar functionality implemented with packet contexts.
message contextsare typically directly accessible via application processes.
Each contexthas particular state associated with it, organized within a Context State Block (CSB), which identifies the context's transmit, receive, and completion queues.
CSBsare maintained in various embodiments by any combination of hypervisor, OS, and Driver processes providing control information for the VNIC.
the queuescontain descriptors that are organized within circular buffers in contiguous memory and thus are also known as descriptor rings.
Each enqueued descriptordescribes various aspects (detailed herein below) of the packets or messages being transmitted and received.
the packet transmit and receive descriptorscorrespond to the raw data buffers in the Local Processor Memory where packet data is respectively read and written by the VIOC.
a common scenariois for these descriptors to have been prepared by processes of the Local Processors (such as a Driver).
Packet Contexts 606 . 1provides four packet CSBs (also referred to as packet contexts) of identical capability, two of which are illustrated in the figure (Packet Context Block # 1 606 . 1 . 1 and Packet Context Block # 4 606 . 1 . 4 ).
each packet contextincludes a transmit queue, such as Packet Transmit Queue # 1 607 . 1 . 1 and Packet Transmit Queue # 4 607 . 1 . 4 ., and a receive queue pointer, such as Packet Receive Queue Pointer # 1 608 . 1 . 1 and Packet Receive Queue Pointer # 4 608 . 1 . 4 .
Each packet receive queue pointeridentifies one of 16 identical shared packet receive queues, as illustrated by dashed-arrow 619 r pointing toward Packet Receive Queues 616 r. For clarity, only two of the 16 packet receive queues are illustrated in the figure, specifically Packet Receive Queue # 1 616 r . 1 and Packet Receive Queue # 16 616 r . 16 .
the packet receive queue pointersenable arbitrary sharing of packet receive queues among packet contexts, including sharing among packet contexts in multiple distinct VNICs.
packet context receive queuesare not shared among VNICs, but are dedicated per VNIC as packet context transmit queues are.
608 . 1 . 1 . . . 608 . 1 . 4operate directly as the packet receive queues for a given VNIC, and Packet Receive Queues 616 r (of FIG. 6A ) is not provided.
each packet contextis also included in each packet context.
a completion queue pointersuch as Packet Complete Queue Pointer # 1 609 . 1 . 1 and Packet Complete Queue Pointer # 4 609 . 1 . 4 .
Each packet completion queue pointeridentifies one of 16 identical shared packet completion queues, as illustrated by dashed-arrow 619 c pointing toward Packet Completion Queues 616 c . For clarity, only two of the 16 packet completion queues are illustrated in the figure, specifically Packet Completion Queue # 1 616 c . 1 and Packet Completion Queue # 16 616 c . 16 . Similar to the packet receive queue pointers, the packet completion queue pointers enable arbitrary sharing of packet completion queues among packet contexts, including sharing among packet contexts in multiple distinct VNICs.
the packet transmit queues of a VNIC(such as Packet Transmit Queue # 1 607 . 1 . 1 , for example), also known as transmit rings, are used to communicate transmit packet availability and location in memory.
the transmit queuesinclude a plurality of transmit descriptors, each of which refer to a buffer in memory having a complete or partial packet image for transmission.
the descriptorincludes the address of the buffer (in Local Processor Memory), the buffer size, a packet state indicator, a valid indicator, a done indicator, and other related information.
the packet state indicatordescribes the buffer information as associated with the start, the middle, or the end of a packet. Assertion of the valid indicator specifies that the descriptor refers to packet image data ready for transmission.
Deassertionindicates otherwise: that no data is ready, the descriptor is invalid, or some other related condition.
the valid indicatorallows a VNIC implementation (such as a VIOC) to poll (or scan) the transmit queues for valid descriptors to discover packet data ready for transmission without requiring any further information or stimulus from the processes executing on the Local Processor.
the processesmay also interrupt the VIOC by accessing a Doorbell location in the VIOC memory-mapped I/O address space to indicate new or modified descriptors in a transmit queue, corresponding to additional packet data ready for transmission.
a plurality of Doorbell locationsare typically provided per VNIC.
packet transmit queues and message contexts of all VNICsare allocated distinct doorbells.
the packet receive queues of a VNIC(such as those pointed to by Packet Receive Queue Pointer # 1 608 . 1 . 1 , for example), also known as receive rings, are used to communicate receive packet data locations in memory.
the receive queuesinclude a plurality of receive descriptors, each of which refers to a buffer in memory for reception of a complete or partial packet image.
the descriptorincludes the address of the buffer (in Local Processor Memory), the buffer size, a valid indicator, and other related information.
the valid indicatorspecifies that the buffer the descriptor refers to is ready to receive packet image data, enabling a VNIC implementation (such as a VIOC) to determine receive buffer availability without direct involvement of processes executing on the Local Processor.
the shared packet completion queues(Packet Completion Queue # 1 616 c . 1 , for example) are used to communicate completion information, including receive packet data availability and status.
the completion queuesinclude a plurality of entries, each of which includes a packet status indicator and other related information.
the packet status indicatoridentifies buffer data as corresponding to the start, middle, or end of a packet.
the completion queue data structure and related processingfurther enable a VNIC implementation (such as a VIOC) to provide packet data to processes executing on Local Processors with little direct involvement of the processes.
the ingress processinge.g., packet reassembly from cells
the VIOC/VNICfrees up resources for other tasks, and may also simplify implementation of other portions of the system.
Message Contexts 610 . 1provides up to a total of 4K ⁇ 4 (i.e., 2 to the 12 th power, minus 4, or 4092) message CSBs (also referred to as message contexts) per VNIC.
Each message contexthas identical capability, and two are illustrated in the figure (Message Context Block # 1 610 . 1 . 1 and Message Context Block #N 610 . 1 .N).
the message contexts of the 16 VNICsare shared, for a total of 64K ⁇ 16 (i.e., 2 to the 16 th power, minus 16, or 65,520) message contexts available collectively to all of the VNICs.
each message contextincludes a transmit queue, such as Message Transmit Queue # 1 611 . 1 . 1 and Message Transmit Queue #N 611 . 1 .N, and a receive queue, such as Message Receive Queue # 1 612 . 1 . 1 and Message Receive Queue #N 612 . 1 .N.
a completion queuesuch as Message Completion Queue # 1 613 . 1 . 1 and Message Completion Queue #N 613 . 1 .N.
the message transmit, receive, and completion queue data structuresare similar to those defined for packet processing.
the message data structuresspecify message location and availability for transmission, for determining buffer location and availability for reception, and for directly providing message data and completion status to processes executing on Local Processors.
Message Lists 615 . 1includes two linked lists of messages to be processed (Message List # 1 615 . 1 . 1 and Message List # 2 615 . 1 . 2 ). Each of the lists includes an active entry pointing to (or selecting) one of the message contexts, as illustrated by dashed-ellipse 614 . 1 and associated dashed-arrows pointing to message contexts in Message Contexts 610 . 1 . In embodiments with 4K ⁇ 4 message contexts per VNIC the active entries point to message contexts associated with the VNIC, and in embodiments with 64K ⁇ 16 shared message contexts, the active entries point to any of the shared message contexts. Each of the selected message contexts in turn includes transmit, receive, and completion queues, as illustrated.
Transmit and receive packet and message buffer addressesmay be provided as physical addresses or virtual addresses subsequently translated by a VNIC implementation (such as a VIOC), depending on embodiment and configuration.
VNIC implementationsuch as a VIOC
address typevaries according to queue, and may also vary according to other configuration information.
VNICsshare resources amongst each other and between queues within each respective VNIC.
Shared transmit resourcesinclude access to the fabric (based on destination, for example) and effective bandwidth to the fabric (allocated between VNICs and queues included in VNICs, for example).
Shared receive resourcesinclude input queuing (or enqueuing) bandwidth from the fabric (based on a categorization included in the received data, for example) and dequeuing bandwidth from input queues to Local Processor Memory (allocated between VNICs and queues included in VNICs, for example).
Sharing of transmit resourcesis managed by three scheduling policies. These are the transmit-descriptor-fetch scheduling policy, the VOQ scheduling policy, and the subsequent-transmit-data-read scheduling policy. From a high-level perspective, the transmit-descriptor-fetch scheduling policy decides which transmit tasks the VIOC will do next. More specifically, the transmit-descriptor-fetch scheduling policy, described in more detail below in conjunction with FIG. 6B , determines the next transmit-descriptor to be fetched. (An initial data read is also performed in conjunction with each transmit descriptor fetch.) Each transmit descriptor describes a transmit buffer in the local processor memory that holds one or more packets awaiting transport to their addressed destinations.
the transmit descriptor to be next fetchedis the descriptor identified by the next descriptor pointer of a transmit queue selected by the transmit-descriptor-fetch scheduling policy from the candidate pool of all VNIC transmit queues.
the selectionis based in part on a prioritization among the VNIC transmit queues, in a manner that is configurable in part.
the selectionis also based in part on an approximate fabric bandwidth allocation among the VNICs.
the VOQ scheduling policydetermines the next cell to transmit to the fabric.
the cell to be next transmitted to the fabricis the cell at the head of the VOQ selected by the VOQ scheduling policy from the candidate pool of all VOQs.
the selectionis based in part on a prioritization among the VOQs, in a manner that is configurable in part.
the subsequent-transmit-data-read scheduling policydetermines the next transmit data to read (for all reads required to retire a transmit descriptor made after the first data read).
the transmit data to be next readis the data (generally a cache-line in memory) identified by the next data unit prefetch pointer of a transmit queue packet buffer (in transmit shared memory) selected by the subsequent-transmit-data-read scheduling policy from the candidate pool of all transmit queue packet buffers awaiting to do transmit data reads.
each of the solid-arrow diagramsis very similar to a “tournament bracket” (also known as a tournament chart), wherein a pool of competitors is reduced by successive stages of competition to arrive at an overall winner.
the transmit queues represented at the starton the left for transmit, on the right for receive
“compete”in accordance with rules that may differ at each stage to be chosen as the queue for which an action (different in each figure) is next performed.
each selection describedis a logical abstraction that generally does not necessitate any data movement corresponding to any of: the queues, identifiers (or pointers) representing the queues, descriptors pointed to by the queues, or data associated with the descriptors.
evaluation in each of the discrete stages describedis also a logical abstraction that need not be physically implemented in order to realize the overall result.
each scheduling policy of FIGS. 6B through 6Eis reducible to logic equations that can be implemented in many functionally equivalent ways, including but not limited to: pass gates, multiplexers, AND/OR-gating, memory or programmable-logic arrays, micro-programming, and combinations thereof.
pass gatesmultiplexers
AND/OR-gatingAND/OR-gating
memory or programmable-logic arraysmemory or programmable-logic arrays
micro-programmingmicro-programming
FIG. 6Billustrates selected aspects of a logical view of an embodiment of VNIC transmit-descriptor-fetch scheduling. For clarity, only the transmit queue functions of two of the 16 VNICs are illustrated (VNIC # 1 600 . 1 and VNIC # 16 600 . 16 ). The figure represents operations related to selecting a transmit queue from among all the transmit queues. The next descriptor pointer of the selected transmit queue in turn identifies the next transmit descriptor to fetch, according to various priority techniques. This determines the relative processing order of packet and message buffers described by the transmit descriptors, and thereby approximates and manages a per-VNIC effective bandwidth allocation to the fabric.
Message Transmit Queues 621 . 1 .Mincludes two message transmit queues (Message Transmit Queue # 5 621 . 1 .M. 5 and Message Transmit Queue # 6 621 . 1 .M. 6 ) of VNIC # 1 600 . 1 . (The above and subsequent uses of “ . . .
Message Lists 615 . 1(Message List # 1 615 . 1 . 1 and Message List # 2 615 . 1 . 2 ), as shown by dashed-ellipse 614 . 1 (of FIG. 6A ).
the other 15 VNICsare organized identically.
a first prioritization levelselects (identifies), for each VNIC, one packet transmit queue and one message transmit queue.
the first levelincludes Packet Transmit Queue Prioritization 622 . 1 .P, selecting one of Packet Transmit Queues 621 . 1 .P according to either a straight or weighted round-robin priority scheme.
the first levelalso includes Message Transmit Queue Prioritization 622 . 1 .M, selecting one of Message Transmit Queues 621 .
Identical prioritization processingis performed for each of the 16 VNICs, selecting one potential packet queue and one potential message queue for each VNIC. The resultant 32 candidate queues are then evaluated in a second prioritization level.
the second prioritization levelselects, for each VNIC, between the packet queue or the message queue as selected by the first prioritization level.
the second levelincludes Packet vs. Message Transmit Prioritization 623 . 1 , selecting a packet queue or a message queue according to a weighted round-robin priority scheme. Identical prioritization processing is performed for each of the 16 VNICs, selecting one candidate queue for each VNIC. These 16 queues are then evaluated in a third prioritization level.
the third prioritization levelselects, across the 16 VNICs, a single transmit queue for subsequent evaluation.
the third levelincludes VNIC Bandwidth Management 624 , selecting one of the 16 queues provided by the second level according to a bandwidth allocation priority scheme. Bandwidth schemes include enforcing a maximum bandwidth per VNIC, a minimum bandwidth per VNIC, and arbitrary combinations of maximum and minimum bandwidths (individually selectable) per VNIC.
a single resultant transmit queueis selected for processing as indicated by dashed-arrow 639 .
the processingincludes accessing a transmit descriptor identified by the selected transmit queue, reading data for transmission according to the accessed descriptor, and then readying the data for transmission over the fabric in accordance to the VOQ scheduling policy discussed in conjunction with FIG. 6C below.
the selection of straight or round-robin prioritization(Packet Transmit Queue Prioritization 622 . 1 .P and Message Transmit Queue Prioritization 622 . 1 .M, for example) is individually programmable for each VNIC and may be distinct for packet and message prioritization, in various embodiments.
Weights for each of the weighted round-robin prioritizations(Packet Transmit Queue Prioritization 622 . 1 .P, Message Transmit Queue Prioritization 622 . 1 .M, and Packet vs. Message Transmit Prioritization 623 . 1 , for example), are individually programmable, in various embodiments.
the maximum and minimum bandwidths per VNIC(VNIC Bandwidth Management 624 ) are also programmable in typical embodiments.
the descriptoris fetched and the first transfer is made (corresponding to a fabric-cell-worth of data) from the first packet in the transmit memory buffer pointed to by the descriptor to a packet buffer in shared memory associated with the transmit queue.
This first read and the subsequent readsare performed by one or more transmit DMA engines, which operate in accordance with the transmit-descriptor-fetch scheduling policy and the subsequent-transmit-data-read scheduling policy, discussed below.
a lookupis generally performed on the MACDA contained in the first read, the nature of the destination becomes known as a result of the lookup, and the data can subsequently be appropriately directed to either a multicast VOQ or a fabric-port-specific unicast VOQ. Operation of the VOQs is discussed in more detail below, in conjunction with FIG. 6C .
the data from the transmit queue packet buffer(in shared memory) is read, additional information is added (such as priority and the destination fabric address and fabric sub-address), and the data is transferred to the appropriate VOQ as cells.
additional informationis added (such as priority and the destination fabric address and fabric sub-address), and the data is transferred to the appropriate VOQ as cells.
cells from different packetsare not co-mingled in the VOQs.
the VOQsact as variable depth FIFOs, wherein the order in which cells enter a VOQ determines the order in which the cells leave the VOQ.
cellsare released by the VOQs and transmitted to the fabric in accordance with the VOQ scheduling policy.
the packet buffersare depleted. Responsive to the depletion, the subsequent-transmit-data-read scheduling policy generally attempts to keep the transmit queue packet buffers full of pre-fetched transmit read data up to the allocated pre-fetch depth, which is 16 cache-lines in one embodiment. In doing so, it gives higher priority to those transmit queue packet buffers that are supplying cells for a packet that is at the head of a VOQ (a packet being actively transferred as cells over the fabric). It gives lower priority to those transmit queue packet buffers that are not yet supplying cells to a VOQ.
FIG. 6Cillustrates selected aspects of a logical view of an embodiment of a VOQ scheduling policy to provide efficient access to the fabric.
the figurerepresents processing to select the VOQ to send the next data unit (cells in one embodiment) to the fabric.
the VOQssend information to the fabric interface as cells. Accordingly, at least logically the VOQ receives information as cells. Physically, the VOQ could receive cells as constituent components (data, priority, fabric address, etc.) and assemble the cells just prior to sending the cells to the fabric interface.
the VOQsmay be implemented within the egress shared memory. Thus a number of levels of virtualization and indirection are possible.
the VOQsare implemented within the egress shared memory and they hold cells that have been pre-assembled and are ready for immediate transfer via the fabric interface to the switch fabric.
the pre-assemblyis performed at least in part by transfers to each VOQ from an appropriately dynamically associated transmit queue packet buffer (also in the egress shared memory).
transmit logic included in the VNIC implementation(such as in the egress logic of a VIOC as illustrated in FIG. 7A and as discussed below), assembles cells in preparation for providing them to the VOQs as represented abstractly by dashed-arrow 640 . Included within each cell is a corresponding priority indication (one of four levels: P 0 , P 1 , P 2 , and P 3 , in selected embodiments) and a corresponding fabric destination address.
the destination addressmay be a unicast address (one of up to 11 destinations, in some embodiments), or a multicast address.
Multicast cellsare enqueued into one of Multicast Output Queues 641 .M (also known as Multicast VOQs) according to priority, as abstractly represented by dashed-arrow 640 .M, illustrating insertion into the P 0 priority multicast VOQ.
Multicast VOQsalso known as Multicast VOQs
Unicast cellsare enqueued into one of 11 unicast VOQ groups (Unicast VOQ Group # 1 641 . 1 . . . Unicast VOQ Group # 11 641 . 11 ) according to the fabric destination address (VOQ Group # 1 . . . VOQ Group # 11 ) and further according to priority (P 0 . . . P 3 ) within each VOQ group. Since there are 11 destinations, each having four priorities, there are a total of 44 unicast VOQs.
the unicast enqueue operationis illustrated by dashed-arrows 640 . 1 , and 640 . 11 . Arrow 640 .
a first prioritization levelselects a candidate VOQ from within the Multicast VOQs (Multicast Output Queues 641 .M) and each of the Unicast VOQ groups (Unicast VOQ Group # 1 641 . 1 . . . Unicast VOQ Group # 1 1 641 . 11 ).
the first levelincludes Multicast Output Queue Prioritization 642 .M, selecting a single multicast VOQ from Multicast Output Queues 641 .M according to either a straight or weighted round-robin priority scheme.
the first levelalso includes Unicast Output Queue Prioritization 642 . 1 , selecting one of Unicast VOQ Group # 1 641 . 1 according to either a straight or weighted round-robin priority scheme. Identical prioritization processing is performed for each of the 11 Unicast VOQ groups, selecting one potential VOQ for each Unicast VOQ group. The resultant 11 unicast VOQ candidates are then evaluated in a second prioritization level, and the resultant single multicast VOQ candidate is then evaluated in a third prioritization level.
the second prioritization levelselects, on a per VOQ group basis, one of the 11 unicast VOQ group VOQ candidates as selected by the first level.
the second levelincludes Destination Prioritization 643 , selecting a VOQ according to a round-robin priority scheme. Since the VOQ groups are organized by fabric destination, the second level is a fabric-destination-based prioritization. The single result unicast VOQ candidate is then evaluated in the third prioritization level.
the third and final prioritization levelselects between the multicast and unicast VOQ candidates as provided by the first and second levels respectively.
the third levelincludes Multicast vs. Unicast Output Prioritization 644 , selecting the final multicast or the unicast VOQ candidate according to a weighted round-robin priority scheme. The final selected VOQ is then permitted to provide one cell to the fabric, as abstractly represented by dashed-arrow 659 .
the selection of straight or round-robin prioritizationis individually programmable for Multicast Output Queue Prioritization 642 .M and each of Unicast Output Queue Prioritization 642 . 1 . . . 642 . 11 .
Weights for each of the weighted round-robin prioritizationsare individually programmable, in various embodiments.
the number of VOQ groupsis equal to the number of fabric destinations in the ES system (such as 11 VOQ groups and 11 fabric destinations). In some embodiments, the number of VOQ groups is greater than the number of fabric destinations (such as 16 VOQ groups and 11 fabric destinations). In some embodiments, more than one priority may share a VOQ, instead of each priority having a separate VOQ. For example, P 0 and P 1 priorities may share a first VOQ within a VOQ group and P 2 and P 3 priorities may share a second VOQ within the VOQ group. These and all similar variations are contemplated within the contexts of various embodiments.
Shared receive resourcesinclude enqueuing bandwidth for cells received from the fabric.
the bandwidthis shared based on a priority included in the received data units (see the following FIG. 6D discussion).
the data unitsAfter enqueuing, the data units are classified according to destination VNIC and associated queue, including processing according to multicast and unicast destinations. Then the data units are dequeued for storage into Local Processor Memory according to priorities associated with the VNICs and the queues of the VNICs.
the shared receive resourcesfurther include the dequeuing bandwidth (see the following FIG. 6E discussion).
FIG. 6Dillustrates selected aspects of a logical view of an embodiment to schedule the start of receive processing for incoming cells.
Received cellsare pushed into the VIOC from the fabric, typically via one or more First-In-First-Out (FIFO) or similar buffering mechanisms external to the VIOC, as suggested by dashed-arrow 660 .
FIFOFirst-In-First-Out
the cellsare classified according to multicast or unicast (dashed arrows 660 .M and 660 .U, respectively), and inserted into Multicast Input Queues 661 .M or Unicast Input Queues 661 .U accordingly.
queue insertionis without regard to priority, as all priorities (P 0 , . . . P 3 ) share the same queue for a given traffic type.
a single level of prioritizationis performed by the receive logic to select an input queue from a candidate pool that includes Multicast Input Queues 661 .M and Unicast Input Queues 661 .U.
the single levelselects between the multicast and the unicast queues according to Multicast vs. Unicast Input Prioritization 663 , a weighted round-robin priority scheme.
the receive logicthen pulls one cell from the queue selected for storage into receive logic memory (such as ISMem 738 ) and subsequent processing. Weights for Multicast vs. Unicast Input Prioritization 663 are individually programmable, according to various embodiments.
some embodimentsselect the next cell to pull from the input queues at least in part according to priorities associated with the received cells.
the multicast and unicast input queuesmay be managed with priority information either included in the received data unit or determined as a function of the fabric transport priority associated with the received cell, according to embodiment.
Multicast and unicast input queue insertionis then partially determined by the priority associated with the received cell, in addition to multicast versus unicast categorization.
Multicast Input Queues 661 .M and Unicast Input Queues 661 .Uare thus each modified to be organized with respect to data unit priority.
each queueincludes data associated with a single priority (i.e. there is a queue per categorization and priority pair).
each queuemay include cells from a pair of priorities (P 0 and P 1 in a first queue, and P 2 and P 3 in a second queue, for example).
queue insertionis further determined according to information provided in response to a lookup operation based on information included in the received cell (see the TCAM/SRAM lookup state section, elsewhere herein).
a first prioritization levelselects, on a per queue priority basis, one candidate multicast input queue and one candidate unicast input queue from the multicast and unicast input queue groups respectively.
the first prioritizationmay be straight priority, straight round-robin, or weighted round-robin, according to embodiment.
the resultant two input queue candidatesare then evaluated in a second prioritization level.
the second prioritization levelselects between the multicast and unicast input queue candidates according to a straight priority, straight round-robin, or weighted round-robin, according to embodiment.
At least some of the queuesmay be implemented with queue depths substantially larger than queue depths implemented for embodiments lacking priority-managed queues.
the receive path input queue depthsare substantially larger than the effective queue depth of the fabric providing the receive data units.
selected received cellsare ignored (or dropped) according to programmable receive (or ingress) bandwidth limitations or policies, to prevent overuse of subsequent VIOC receive resources or associated Local Processor resources.
the ingress bandwidth limitation policiesoperate in parallel with the prioritization illustrated in FIG. 6D , and in various other embodiments the ingress policies operate either “before” or “after” the operations depicted in the figure. Some embodiments implement dropping policies according to a single cell, while other embodiments drop all subsequent cells of a packet or a message after dropping a first cell in response to a dropping policy operation.
Incoming unicast cellshave a fabric sub-address that identifies the VNIC and receive queue to which the cell is associated. As each unicast cell is pulled from the unicast queue, the receive queue is ascertained and the data payload of the cell is placed into a corresponding receive queue packet buffer. The data carried by multicast cells is replicated in accordance with the multicast group ID and appropriately placed into multiple receive queue packet buffers. The data held within the receive queue packet buffers accumulates until written to Local Processor Memory in accordance with a receive-data-write scheduling policy, as illustrated in FIG. 6E .
the receive-data-write scheduling policy of FIG. 6Eselects the next receive queue to be serviced.
the queue selectiondetermines the next receive data to write through the following indirection.
Each receive queuepoints to a currently active receive descriptor, which describes a receive buffer in the Local Processor Memory that is ready to receive one or more packets.
Each receive queuealso has a corresponding receive queue packet buffer in the ingress shared memory.
the receive data to be next writtenis the data unit (generally a cache-line in memory) identified by the next data unit write pointer of the receive queue packet buffer corresponding to the receive queue selected by the receive-data-write scheduling policy from the candidate pool of all VNIC receive queues.
the selectionis based in part on a prioritization among the VNIC receive queues, in a manner that is configurable in part.
Each data unit writtenis scheduled independently (generally on a cache-line by cache-line basis) by the receive-data-write scheduling policy.
the writes associated with each receive descriptorcarry out the reassembly of corresponding data link layer frames (typically Ethernet frames).
Packet Receive Queues 684 . 1 .Pincludes four packet receive queues such as Packet Receive Queue # 1 684 . 1 .P. 1 and Packet Receive Queue # 4 684 . 1 .P. 4 of VNIC # 1 600 . 1 . These correspond to the four packet receive queues identified by each of Packet Receive Queue Pointer # 1 608 . 1 . 1 . . . Packet Receive Queue Pointer # 4 608 . 1 . 4 respectively (of FIG. 6A ).
Message Receive Queues 684 . 1 .Mincludes two message receive queues (Message Receive Queue # 5 684 . 1 .M. 5 and Message Receive Queue # 6 684 . 1 .M. 6 ) of VNIC # 1 600 . 1 . These correspond to the two message receive queues identified by an active message identified by each of Message Lists 615 . 1 (Message List # 1 615 . 1 . 1 and Message List # 2 615 . 1 . 2 ), as shown by dashed-ellipse 614 . 1 (of FIG. 6A ). The other 15 VNICs are organized identically.
a first prioritization levelselects, for each VNIC, one candidate packet receive queue and one candidate message receive queue.
the first levelincludes Packet Receive Queue Prioritization 682 . 1 .P, selecting one of Packet Receive Queues 684 . 1 .P according to a straight round-robin priority scheme.
the first levelalso includes Message Receive Queue Prioritization 682 . 1 .M, selecting one of Message Receive Queues 684 . 1 .M according to a straight round-robin prioritization scheme.
Identical processingis performed for each of the 16 VNICs, selecting one potential receive packet queue and one potential receive message queue for reach VNIC. The resultant 32 candidate queues are then evaluated in a second prioritization level.
the second processing levelselects, for each VNIC, between the packet or the message receive queue as selected by the first prioritization level.
the second levelincludes Packet vs. Message Receive Prioritization 681 . 1 , selecting a packet or a message receive queue according to a straight round-robin priority scheme. Identical prioritization processing is performed for each of the 16 VNICs, selecting one candidate receive queue for each VNIC. These 16 candidate data units are then evaluated in a third prioritization level.
the third and final prioritization levelselects, across the 16 VNICs, a single receive queue.
the third levelincludes VNIC Prioritization 680 , selecting one of the 16 receive queues provided by the second level according to a straight round-robin priority scheme.
a final single resultant receive queueis selected and, through the indirection process described previously, a single data unit (generally a cache-line) is written via the HT interface into Local Processor Memory as abstractly represented by dashed-arrow 699 .
weighted round-robin prioritizationmay be performed for any combination of the first, second, and third prioritization levels, and the associated weights may be fixed or individually programmable, according to embodiment.
Various embodimentsmay also provide individual programmable selection between straight and weighted round-robin for each of the first, second, and third prioritization levels.
Transmit and receive priority algorithmsmay vary according to embodiments.
straight prioritymay implement a static priority having queue # 1 as the highest, queue # 2 as the next highest, and so forth with queue # 4 as the lowest priority.
the priority ordermay be reversed (i.e. # 4 is the highest and # 1 is the lowest).
Round-robin weightingmay be based on data units (cells, for example) or bytes, according to various embodiments.
Weighted fair queuingmay also be provided by some embodiments in place of or in addition to weighted round-robin, and the weighted fair queuing may be based on data units or bytes, according to various embodiments. Round-robin processing may be based on previously processed information or on queue depth, also according to embodiment.
each VNICis a member of one VLAN, which is a port-based VLAN (i.e., a virtual LAN defined by logical connection to a designated subset of available logical L 2 switch ports).
each VNICmay be a member of a plurality of VLANs, including at least one port-based VLAN.
the VLANsmay be port-based, MAC address-based, IP address-based, and L 3 type VLANs.
VLANsmay be provisioned and managed by programming VNIC address information accordingly (such as VLAN Identifier 617 . 1 , for example) and by writing corresponding lookup state (such as that retained in TCAM/SRAMs).
VLAN management operationsmay be relatively static, as related to endpoints included within a server, or relatively dynamic, as related to endpoints external to the server.
Internal endpoint VLAN operationsinclude server and cluster provisioning and re-provisioning, VLAN specific provisioning, pluggable module insertion and removal, and failover responses, for example.
VLAN operationsmay be supervised by controlplane processes executing on a SCM (such as Primary SCM 140 A), Driver processes executing on Local Processors, or combinations of the two, according to embodiment. 102651
VLAN related processing for egress data to the fabricincludes determining a VLAN identifier. If the VLAN identifier does not match the source VLAN identifier, then the egress data may optionally be dropped, according to embodiment.
the source VLAN identifiermay be provided directly from VLAN Identifier 617 . 1 or derived from it, according to embodiment. If the destination MAC is not identifiable, then the egress data may be flooded to all destinations allowed by the source VLAN configuration, according to embodiment.
VLAN related processing for ingress data from the fabricincludes determining which VNICs, if any, are members of the VLAN identified by the received data, and providing the data to the member VNICs appropriately. If no VNICs are members of the destination VLAN, then the ingress data may be optionally dropped, according to embodiment.
VLAN related broadcasts to VNICsmay be wholly or partially implemented using multicast group processing.
VLAN ingress processingmay optionally include determining the VLAN identifier associated with learning a MAC Source Address (MACSA) associated with the ingress data.
MACSAMAC Source Address
processingmay further include dropping the ingress data if the learning VLAN (i.e. the VLAN from which the MACSA was learned from) is different from the destination VLAN.
VLAN broadcastsare implemented by assigning a Multicast Group IDentifier (MGID) to each of the VLAN broadcast groups.
MGIDMulticast Group IDentifier
IVLIndependent VLAN Learning
SVLShared VLAN Learning
IVL and SVLboth enforce inter-VLAN isolation (within the same abstraction layer) through confirmation of VLAN membership based on MAC address.
forwarding entriesi.e., entries of the FIBs
FIBsforwarding entries
SVLforwarding entries learned on a first VLAN are “shared” with other VLANs.
a forwarding entry learned for a MAC address on a first VLAN(and therefore unreachable at the same abstraction layer by other VLANs) is used by the other VLANs for the limited purpose of dropping frames addressed to the MAC address on the first VLAN.
the MAC addressis known by the other VLANs to be unreachable only because of the SVL sharing. In this way, SVL prevents unnecessary flooding within any of the other VLANs, which under IVL would have occurred in a futile effort to reach the MAC address on the first VLAN (which under IVL, is guaranteed to be unknown to the other VLANs). Further details of IVL and SVL, particularly with respect to TCAM/SRAM configuration and use, is provided in conjunction with the discussion of FIGS. 8A and 8B .
Static VLAN management operationstypically include distribution of VLAN membership information, removing the need for learning VLAN membership changes relating to provisioning, module insertion and removal, and failover responses.
VLAN learning operationsare performed under the supervision of SCM-based management processes.
At least one VIOCis included in each of PMM 150 A, OLB 160 A, FCM 120 A, and each of the SCMs 140 (including Primary SCM 140 A).
Each VIOCtypically implements functionality for a plurality of VNICs.
the VIOCis illustrated in these various operating contexts in FIGS. 3A , 4 A, 4 B, 4 D, and 4 E.
VIOC operationwas summarized in the pluggable modules section above, in conjunction with an examination of VIOC 301 . 4 of FIG. 4A .
the VIOC 301 . 5is coupled and adapted to directly communicate packets 351 between RAM elements 350 and the Primary Switch Fabric Module 180 A.
the RAM elementsare also accessible by one or more CPUs, enabling processes executing on the CPUs to directly exchange data via the fabric.
CPUs coupled to RAMs accessible by VIOCs in this mannerare examples of Local Processors
the coupled RAMsare examples of Local Processor Memory.
RAM elements 411 . 4 L and 411 . 4 Rare accessible via an HT Channel 460 . 4 R, and the fabric is accessible via a Common Switch Interface consortium (CSIX) channel 149 A.
CSIXCommon Switch Interface consortium
Control of dataplane functionalitycorresponds to controlplane functionality and conceptually includes forwarding tables and related state information included in part in the TCAM/SRAM.
Control packetsalso known as VIOC-CP packets
VIOC-CP packetsgenerated by the SCM are received via the fabric and processed by the VIOCs, resulting in selective accesses to configuration registers and the TCAM/SRAM coupled to each VIOC.
the forwarding and state information of the TCAMs/SRAMsis typically initialized and maintained in this way.
the control packetsare provided by management and controlplane processes executing on any combination of the SCMs, PMMs, and OLBs. Configuration information of a more general nature is typically provided in part by a BMC.
the VIOC and processes executing on the Local Processorscommunicate in part by sharing portions of the Local Processor Memory space. Included in these shared portions are the packet and message queues as described in the VNIC overview and queuing operation section. In addition, the VIOC itself appears as an intelligent memory-mapped I/o device residing in a portion of the Local Processor Memory space. In this way, the VIOC provides access to configuration registers and certain state relating to packet (and message) transmission and reception.
the packet transmit and receive descriptors associated with the VNICsdescribe raw data buffers in the Local Processor Memory where packet data is respectively read and written by the VIOC, via DMA operations, in order to implement VNIC functions.
at least some of the packet transmit and receive descriptorsare prefetched into buffers onto the VIOC to improve performance.
all of the packet receive descriptors corresponding to the VIOC's VNICsare buffered.
the packet CSB'sare held within the VIOC to improve performance.
the message context stateis kept in either the Local Processor Memory, or in memory private to the VIOC (such as the TCAM/SRAM or the DDR DRAM discussed herein below). Since in certain embodiments the packet CSBs that represent the packet queues are held on-chip, and since some descriptors are buffered on-chip, for some conceptual purposes the queues may be thought of as residing within the VIOC. Those skilled in the art will understand that this is an informal abstraction, as the queues actually reside in Local Processor Memory.
Packet and message transmissioncorresponds to data flowing out from the VIOC to the fabric, also known as VIOC egress, or simply as egress when discussing the VIOC operation. Conversely, packet and message reception corresponds to VIOC ingress, or simply ingress.
FIG. 7Aillustrates selected aspects of one VIOC embodiment as VIOC block diagram 700 A.
VIOC 301includes several interfaces, including a unit for coupling to Double Data Rate (DDR) DRAM memories (DDR Interface 701 ) via coupling 721 , a unit for coupling to an HT channel (HT Interface 702 ) via coupling 722 , and a block for coupling to a BMC (BMC Interface 718 included in VIOC Control 704 ) via coupling 733 .
DDRDouble Data Rate
HT Interface 702HT channel
BMC Interface 718included in VIOC Control 704
VIOC 301Further included in VIOC 301 are FICTX 714 (an instance of a VIOC fabric interface communication transmit unit) and FICRX 715 (an instance of a VIOC fabric interface communication receive unit).
FICTX 714includes egress path elements Vioc EGRess interface (VEGR) 708 , and CSix Transmit unit (CSTX) 710 .
VEGR 708includes DMATX 716 , an instance of a transmit DMA unit; ECSM 735 , an instance of Egress Control State Machines; and ESMem 736 , an instance of an Egress Shared Memory.
FICRX 715includes ingress path elements Vioc INGress interface (VING) 709 , and CSix Receive unit (CSRX) 711 .
VING 709includes DMARX 717 , an instance of a receive DMA unit; ICSM 737 , an instance of an Ingress Control State Machines; and ISMem 738 , an instance of an Ingress Shared Memory that in some embodiments is an implementation of the receive logic memory.
Csix Flow Control Unit Transmit side (CFCUTX) 712 and Csix Flow Control Unit Receive side (CFCURX) 713are coupled from the receive path to the transmit path.
CFCUTX 712is used to temporarily suspend sending by CSTX 710 upon receiving an indication of fabric congestion
CFCURX 713is used to indicate VIOC congestion to other modules.
Other VIOC elementsinclude RXDmgr 766 , and shared egress and ingress elements Lookup Engine (LE) 703 and Message State Machine 707 .
VIOC 301 control elementsinclude VIOC Control 704 , in turn including SIM Interface 705 , VIOC Configuration block 706 , and BMC Interface 718 .
egress dataenters VIOC 301 via HT Channel coupling 722 , and flows from HT Interface 702 to VEGR 708 via coupling 750 , under control of DMA read protocols implemented by DMATX 716 .
the egress datacontinues to CSTX 710 via coupling 751 , exiting CSTX 710 via coupling 753 , and exits VIOC 301 via Fabric Coupling 732 .
ingress dataflows in a symmetric reverse path, entering via Fabric Coupling 732 and continuing to CSRX 711 via coupling 763 and then to VING 709 via coupling 761 .
the ingress dataproceeds to HT Interface 702 via coupling 760 under control of DMA write protocols implemented by DMARX 717 to exit VIOC 301 via HT Channel coupling 722 .
Information related to egress flow controlis provided from CSRX 711 to CFCUTX 712 via coupling 752 r.
Egress flow control commandsare provided from CFCUTX 712 to CSTX 710 via coupling 752 t.
Information related to ingress flow controlis provided from CSRX 711 to CFCURX 713 via coupling 762 r.
Ingress flow control commandsare provided from CFCURX 713 to CSTX 710 via coupling 762 t.
Control packet handshakingis provided from FICRX 715 to FICTX 714 as shown by ingress-egress coupling 772 .
couplings 750 , 751 , 753 , 760 , 761 , 763 , and 772are illustrated as unidirectional, this is only to highlight the primary flow of data, as control and status information, for example, flows along similar pathways in a bidirectional manner.
Internal egress path related coupling 770 and ingress path related coupling 771illustrate LE 703 request and returning status and result communication with VEGR 708 and CSRX 711 , respectively.
VIOC Configuration block 706includes configuration and mode information relating to operation of VIOC 301 , generally organized into registers, including system configuration registers and local configuration registers.
the system and local configuration registersare typically accessed by management processes executing on Primary SCM 140 A, by control packets sent to Fabric Coupling 732 , and then processed by CSRX 711 and SIM Interface 705 .
the system registersare typically inaccessible to processes executing on Local Processors, and include a plurality of scratchpad registers typically used for communication with the management processes.
the local registersare typically accessible via the HT channel by Hypervisor, OS, and Driver processes executing on Local Processors. Hypervisor and OS processes typically configure environments for Application processes so that the local configuration registers are inaccessible to the Application processes.
the system registersinclude VNIC related registers, such as Address Block 601 . 1 (of FIG. 6A ) for each of 16 VNICs. Also included is a bit (or mask) per VNIC to enable and disable the corresponding VNIC.
the local registersinclude pointers and state information associated with I/O Block 605 . 1 (of FIG. 6A ) of each of the 16 VNICs.
Local Processor access to the system registersmay be provided by manipulation of a field in the system configuration registers.
the system and local configuration registersare accessible via BMC command and data information received from BMC Interface 718 .
VIOCs included on controlplane modulesare initialized by BMC commands to enable selected privileged operations, including transmission via the fabric of control packets without lookup processing (these packets are also referred to as ‘No Touch’ packets).
No Touch packetsmay be used for control packets (to initialize or modify forwarding information included in TCAM/SRAMs) and to forward an exception packet from an SCM to the proper destination.
VIOCs included on other modulesare initialized to disable No Touch packet transmission, i.e. packets (and messages) are always processed with an egress lookup.
SIM Interface 705is coupled to receive control packets from CSRX 711 as typically provided by controlplane processes executing on an SCM included in a SIM. The control packets are parsed to determine the included command and any associated parameters, such as address and data. SIM Interface 705 then passes the command and parameters to the proper element of VIOC 301 for execution. Return handshake status is typically provided in the form of a packet addressed to the sending SCM from FICRX 715 to FICTX 714 via ingress-egress coupling 772 , and FICTX 714 provides the packet to the fabric.
BMC Interface 718includes logic for interfacing to a BMC, including receiving, processing, and responding to BMC commands received via coupling 733 .
the interfaceparses the command, provides the command and any associated parameters to the proper unit of VIOC 301 , and returns response information as appropriate.
HT Interface 702includes an HT Channel compatible external interface providing read and write style accesses to resources available via coupling 722 .
Read response information(typically associated with egress processing) is provided to VEGR 708 via coupling 750 .
Write information(typically associated with ingress processing) is provided from VING 709 via coupling 760 .
the read and write accessestarget memory locations in RAMs coupled to CPUs coupled in turn to HT Channel coupling 722 (i.e. Local Processor Memory).
HT Channel coupling 722is an illustrative instance of HT couplings including 460 . 4 R, 460 . 5 R, 460 . 5 R′, 460 . 6 R (of FIGS. 4A , 4 B, 4 B, and 4 E, respectively).
DDR Interface 701includes logic for coupling to DDR DRAMs via coupling 721 .
DDR Interface 701communicates with Message State Machine 707 via coupling 767 , as shown.
DDR Interface 701also communicates with other elements of VIOC 301 via implicit communication paths that allow access to the DRAMs via control packets (SIM Interface 705 ), BMC commands (BMC Interface 718 ), and processes executing on Local Processors (HT Channel coupling 722 ), in addition to VIOC internal requests (Message State Machine 707 , VIOC Control 704 , FICTX 714 , and FICRX 715 ).
the topology of these pathswill be understood by those of ordinary skill in the art.
DDR SDRAMtypically includes data structures related to context and message related processing (such as CSBs), as well as virtual to physical address translation for transmit and receive buffer addresses in Local Processor Memory.
Message State Machine 707manages message state (including connecting, established, closing, and closed) for a plurality of message contexts, such as those associated with the 16 VNICs, according to a connection-oriented reliable protocol.
message stateis stored in part in DDR coupled via coupling 721 to DDR Interface 701 , and coupling 767 communicates requests for DDR reads and writes, as well as resultant read data between the state machine and the DDR interface.
the state machinealso provides for message handshaking and re-delivery attempts by appropriate processing and communication with FICTX 714 and FICRX 715 , via implicit couplings that will be understood to those of ordinary skill in the art.
message related inspection and processing of incoming informationmay be performed in CSRX 711 under the direction of Message State Machine 707 .
message related processing and information insertionmay be performed in CSTX 710 also under the control of the state machine.
logic units for performing RDMAare logic units for performing RDMA.
RXDmgr 766includes logic for fetching and prefetching receive descriptors to support ingress operation. Receive descriptor requirements and results are communicated with FICRX 715 via coupling 764 . Requests to read descriptors from Local Processor Memory are provided to HT Interface 702 via coupling 765 , and returning data is returned via coupling 765 .
FICTX 714includes logic (VEGR 708 ) implementing egress path processing, including accessing packet data for transmission and cellification using DMA protocols, according to configured priorities and bandwidth allocations, and including one lookup (LE 703 via coupling 770 ).
the lookuptypically provides a fabric egress port based in part on the packet destination address (typically a MAC address) and relevant VLAN related information.
the included logicalso implements packet data cellification and CSIX cell-level processing (CSTX 710 ). An overview of selected aspects of packet access and cellification is provided with respect to FIG. 3A .
FICTX 714processes selected multicast packets (and hence cells) using cell-level multicast capability provided by the fabric.
VEGR 708includes logic blocks performing packet egress processing functions including transmit queue management and scheduling (see FIG. 6B and the related discussion), transmit packet scheduling, packet segmentation into cells (including a packet address processing lookup via LE 703 ), various control state machines within ECSM 735 , and an egress shared memory ESMem 736 .
DMATX 716included in VEGR 708 , is configured to transfer packet image data from Local Processor Memory to the egress shared memory, and further configured to transfer data from the egress shared memory to CSTX 710 .
the VOQsare implemented as pointer managed buffers that reside within the egress shared memory.
the DMA transfersare managed by the control state machines in VEGR 708 according to bandwidth and priority scheduling algorithms.
Logic units in CSTX 710read cell data according to the VOQs as scheduled by a VOQ prioritizing algorithm (see FIG. 6C and the related discussion), calculate horizontal parity, vertical parity, and CRC for each cell, and then send the results and the cell data to the fabric.
Logic units in CSTX 710include CSIX egress queue structures and associated transmit data path (FIFO) buffers, CSIX compatible transmit flow control logic responsive to information received from CFCUTX 712 , logic responsive to information received from CFCURX 713 (to apply fabric back-pressure using CSIX compatible receive flow control instructions), and a transmit-side CSIX compatible external interface for Fabric Coupling 732 .
FIFOtransmit data path
CFCUTX 712(shown outside of FICTX 714 in the figure, but closely associated with egress processing) includes fabric congestion detection logic and VOQ feedback control logic to instruct CSTX 710 to stop sending cell traffic from a VOQ when fabric congestion is detected. When the congestion is relieved, the logic instructs CSTX 710 to resume cell traffic from the stopped VOQ. Fabric congestion information is provided to CFCUTX 712 from CSRX 711 as it is received from the fabric.
the VOQ prioritizing algorithm implemented in CSTX 710includes configurable weighted round-robin priority between unicast output queues and multicast output queues, round-robin priority among VOQ groups, and straight priority within VOQ groups.
the algorithmalso guarantees that all cells associated with a given packet are sent in order, and further that cells from different packets from the same VOQ are not intermingled. In other words, once a first cell for a packet from a selected one of the VOQs is sent, then the remainder of the cells for the packet are sent before any cells of any other packet from the selected VOQ are sent.
FICRX 715includes logic implementing ingress path processing, including CSIX cell-level processing (CSRX 711 ), and packet-level processing (VING 709 ), including storing reassembled packets using DMA protocols.
An optional lookup(LE 703 ) is performed under the control of CSRX 711 via coupling 771 . The lookup provides information related to processing the packet, including mapping the packet to the proper receive queue.
Logic units in CSRX 711receive, buffer, and parse cell data from the fabric.
Logic units in CSRX 711include a receive-side CSIX compatible external interface for Fabric Coupling 732 , CSIX ingress queue structures and associated CSIX receive data path (FIFO) buffers, a CSIX cell parser unit, and transmit and receive flow control information detection logic.
FIFOCSIX receive data path
CFCURX 713(shown outside of FICRX 715 in the figure, but closely associated with ingress processing) includes VIOC congestion detection logic and fabric feedback control logic to instruct the fabric to stop sending cell traffic of a specific priority when VIOC congestion is detected for that priority. When the congestion is relieved, the logic instructs the fabric to resume cell transmission. Receive flow control instructions to the fabric are communicated via CSTX 710 . This method of congestion relief is referred to elsewhere herein as applying fabric back-pressure.
Cell datais received from the fabric, including horizontal parity, vertical parity, and CRC.
the parities and CRCare computed for the received data, checked, and errors logged.
Cell and packet headersare parsed, and in some embodiments an optional lookup is performed (LE 703 ) for selected unicast packets to determine in part an appropriate receive queue.
an optional lookupis performed for multicast packets, VLAN broadcast packets, or both, according to embodiment, to determine in part one or more appropriate receive queues or multicast group identifiers, also according to embodiment.
Unicast lookups(if performed) are typically based in part on a source fabric port address and a context key included in the packet header. Some embodiments omit unicast lookups entirely. Control packet data is written into a control packet portion of the CSRX's FIFOs, and subsequently sent to SIM Interface 705 for further processing, while non-control packet data is written to a data portion of the CSRX's FIFOs.
VING 709includes logic blocks performing packet ingress processing functions including receive and completion queue management and scheduling, receive packet scheduling (see FIG. 6D and the related discussion), cell reassembly into packets, various control state machines, and an ingress shared memory.
DMARX 717included in VING 709 , is configured to transfer cell data into Local Processor Memory from the Ingress Shared Memory (ISMem 738 ). The DMA transfers are managed by the Ingress Control State Machines (ICSM 737 ) in VING 709 .
a receive bufferis considered complete (or consumed) when either the last available location in a buffer is written, or the last cell of a packet is written.
Buffer completionis indicated by writing an entry to one of the completion queues, with data including packet receive status (Error or OK), receive processing (or thread) number, and context key (if the data includes the last cell of the packet).
the completion queue write informationoptionally includes results of packet-level CRC and l's complement computations for use by Driver or other processes executing on the Local Processors.
VING 709controls movement of cell data from the ingress shared memory (ISMem 738 ) to Local Processor Memory during packet reconstruction according to various receive scheduling algorithms (see FIG. 6E and the related discussion), including determination of buffers in the Local Processor Memory, selection of cell data to transfer, and movement of the packet data to the Local Processor Memory. Buffers are determined by processing receive descriptors, and in some embodiments the receive descriptors are fetched and processed according to a round-robin priority between the groups of receive queues. Cell data is scheduled for transfer to the Local Processor Memory according to a selection between unicast and multicast queues for insertion into the receive queues, as illustrated by FIG. 6D .
Packet data movement into the Local Processor Memoryis also directed in part according to flow control information from HT Interface 702 that VING 709 responds to in order to prevent overrunning limited resources in the HT interface.
Fabric Coupling 732in one embodiment, includes a Fabric Interface Chip (FIC) providing low-level functions relating to coupling to an embodiment of SFM 180 that includes Fabric Switch Chips (FSCs).
Fabric Coupling 732is an illustrative instance of generic fabric couplings, which in the system as a whole further includes FCM-Fabric coupling 129 A, NM-Fabric coupling 139 A, OLB-Fabric coupling 169 A, PMM-Fabric coupling 159 A, and PMM-Fabric coupling 159 A′, for example.
LE 703includes TCAM and SRAM interfaces, and accepts egress lookup requests from VEGR 708 and ingress lookup requests from CSRX 711 .
Lookup requestsinclude a key and a look up transaction identifier.
the TCAMis searched for a first entry matching the key, and if a match is found, a corresponding entry from the SRAM is read.
the requesteris notified by a handshake, and the transaction identifier, a match indication, and result data (if a match) are returned to the requestor (one of VEGR 708 and CSRX 711 ).
Request processingis pipelined in LE 703 , but if the Lookup Engine is unable to accept a request, then an acceptance delayed indicator is provided to the requestor.
the key and the resultsare each 64 bits, both are multiplexed in two 32-bit chunks, and the transaction identifier is 4 bits.
LE 703supports directly reading and writing the TCAM and SRAM to examine and modify lookup information, via requests from BMC Interface 718 , SIM Interface 705 , and HT Interface 702 .
VIOC 301 and related componentsare initialized to set configuration, mode, initial state, and other related information.
selected management and configuration control information maintained in VIOC Configuration block 706is written by an external BMC via coupling 733 and BMC Interface 718 .
Additional informationis optionally written by an external agent via packets received from Fabric Coupling 732 , CSRX 711 , and SIM Interface 705 .
Additional informationmay also be optionally written by an agent coupled to HT Channel coupling 722 via HT Interface 702 .
the management and configuration control informationis provided by management processes executing on Primary SCM 140 A, as described elsewhere herein.
Initial (as well as subsequent) ingress and egress lookup informationis typically provided by controlplane and related processes executing on Primary SCM 140 A.
the informationis included in packets sent by the processes and received via Fabric Coupling 732 , CSRX 711 , and SIM Interface 705 .
the lookup informationis stored in TCAM/SRAM resources coupled to VIOC 301 via TCAM/SRAM coupling 723 by LE 703 . Portions of state stored in the TCAM/SRAM may also be optionally initialized by the agent coupled to HT Channel coupling 722 via HT Interface 702 and LE 703 .
VIOC 301 and related elementsare available for directly communicating packets (and messages) between clients coupled to the fabric, as described with respect to FIG. 3A and FIG. 4A .
the communicationis bidirectional, including egress (from Local Processor Memory to fabric) and ingress (from fabric to Local Processor Memory), and is typically accessible to processes executing on Local Processors via a VNIC-style interface as illustrated by FIG. 6A .
Egress operationserves to directly transmit a buffer of packet data, as provided by a Driver process in conjunction with an OS executing on a Local Processor, to the fabric.
the Driver(or optionally an Application process) forms a packet image within the buffer.
a transmit descriptorincluding the physical address of the buffer, the buffer size, a valid indicator, and a done indicator, is fabricated by the Driver and placed on one of the transmit descriptors.
the valid indicatoris asserted by the Driver to indicate the descriptor is ready for processing by VIOC 301 .
the done indicatoris initially deasserted by the Driver and later asserted by VIOC 301 when the descriptor and the underlying buffer data has been fully processed by the VIOC. Upon assertion of done the buffer is available for subsequent use by the Driver.
the Driverinforms VIOC 301 that additional packet data is available for transmission by accessing a corresponding Doorbell, asynchronously interrupting VIOC 301 .
the Doorbell accessis sent via HT Channel coupling 722 , received by HT Interface 702 , and processed by VIOC Control 704 .
VIOC 301polls transmit descriptors, examining the associated valid indicators, to determine that additional packet data is available for transmission.
VEGR 708accesses the transmit queue using DMA processing functions included in DMATX 716 according to the bandwidth and priority scheduling algorithms of FIG. 6B .
Algorithms implemented by the priority scheduling of the transmit queue accessesinclude straight priority, round-robin, and weighted round-robin, and priority is determined between transmit packet queues and transmit message queues.
the information obtained from the queueincludes a descriptor including a pointer to the new packet images.
VEGR 708analyzes the descriptor, providing the pointer to DMATX 716 and requesting additional accesses to begin to read in the packet image.
the packet imagebegins with a packet header, including packet destination and priority information.
VEGR 708formats information from the packet header, including the destination, along with VLAN processing related information, into a lookup request in a suitable format, and passes the request to LE 703 .
LE 703accepts the request when room for the request is available, and processes it by accessing the coupled TCAM/SRAM.
the lookup resultincluding a destination fabric port address, is used in forming appropriate cell headers, including a fabric destination port address. Packet data is cellified and sent to CSTX 710 for further processing.
CSTX 710receives the cellified packet data, including cell header information, and processes the data according to the VOQ prioritizing algorithms of FIG. 6C .
Cell datais sent according to the configured priority algorithm, and CSTX 710 is further responsive to flow control instructions received from CFCUTX 712 .
Cell parities and CRCare calculated and provided to Fabric Coupling 732 along with the cell header and data information.
Ingress operationis conceptually the reverse of egress operation, and serves to directly receive packet data into a buffer in Local Processor Memory, the buffer being directly accessible by a Driver process (and optionally an Application process) executing on a Local Processor.
a receive descriptorincluding the physical address of the buffer, the buffer size, and a valid indicator, is fabricated by the Driver and placed on one of the receive descriptor queues. The valid indicator is asserted by the Driver to indicate the descriptor is ready for use by VIOC 301 .
VIOC 301prefetches (under the direction of RXDmgr 766 ) and preprocesses one or more receive descriptors in preparation for receiving cell data and storing it into a new receive buffer in Local Processor Memory.
a completion queue entryis written by VIOC 301 when the buffer has been filled with packet image data.
CSRX 711receives CSIX cells, checks parities and CRC for the received cells, parses cell headers, and for the first cells of packets, parses a packet header. Information related to flow control is provided to CFCURX 713 , and fabric back-pressure is applied (via CSTX 710 ) when VIOC congestion is detected. A lookup is performed via LE 703 for the first cells of multicast packets, to determine proper destinations and required replication of the packet. Further within CSRX 711 , control packet data is FIFOed for presentation to and processing by SIM Interface 705 , while non-control packet data is FIFOed for further data path processing in accordance with FIG. 6D as discussed elsewhere herein.
VING 709directs DMARX 717 to store received non-control packet data as complete or partially reassembled packets into Local Host Memory via DMA transfers according to the various receive scheduling algorithms of FIG. 6E .
VING 709directs writing a corresponding completion queue entry, including status (Error or OK), thread number, context key, and optionally packet-level CRC and 1's complement results. This completes the reception of the packet (if the last cell was received) and the packet image is available for use directly by the Driver (or optionally an Application) process executing on a Local Processor.
Control packetsare sent in-order to SIM Interface 705 for further processing.
SIM Interface 705parses the control packet and passes command, address, and data information to the appropriate VIOC element for execution.
Return handshake status and result informationis typically provided via ingress-egress coupling 772 as a packet (typically addressed to an SCM) for transmission to Fabric Coupling 732 .
Control packetstypically provided from a controlplane process executing on Primary SCM 140 A, may be used at any time to effect updates or changes to forwarding, VLAN, multicast, and other related state information included in TCAM/SRAM coupled to VIOC 301 .
Egress and ingress buffer operationis not restricted to physical addresses, as virtual addresses may be supplied in transmit and receive descriptors.
VIOC 301references coupled DDR DRAM via coupling 721 to access translation mapping information. The VIOC then translates the virtual addresses to physical addresses and processing proceeds accordingly.
Message State Machine 707manages various aspects of the reliable connection-oriented protocol, and directs overall processing according to message related queues and associated descriptors.
the reliable protocolincludes handshake and retry information that is communicated from VING 709 to VEGR 708 via ingress-egress coupling 772 for communication as messages on Fabric Coupling 732 .
Message operationis otherwise similar to packet operation.
Embodiments implementing only packet operationomit Message State Machine 707 and associated processing logic.
VIOC 301is an example embodiment only.
the external couplingsmay have differing bandwidths to that described heretofore.
a VIOC used in a controlplane contextsuch as an SCM included on a SIM
the VIOC included in the SCMis coupled to the SFM via one-half the coupling bandwidth of the VIOC included in the PMM.
the VIOC included in the OLBis coupled to the SFM via one-half the coupling bandwidth of the VIOC included in the PMM.
VIOC 301is only illustrative, and that any number of other arrangements and variations may be implemented.
the functions relating to SIM Interface 705 , VIOC Configuration block 706 , and Message State Machine 707may be incorporated within the control state machines of VEGR 708 and VING 709 .
Functions implemented in the blocks of FICTX 714 and FICRX 715may be rearranged in other equivalent organizations. These and other such arrangements are contemplated in various embodiments.
FIG. 7Billustrates selected aspects of egress operation of an embodiment of a VIOC as flow diagram Egress Overview 700 B. For clarity, only selected details related to packet processing are shown (message processing is similar). Processing begins upon receipt of “Doorbell Ring” 781 , indicating availability of one or more new transmit descriptors pointing to packets available for transmission. Flow proceeds to “Descriptor Fetch” 780 . 1 , where transmit descriptors are fetched (in accordance with the scheduling illustrated by FIG. 6B ) and passed to “Valid?” 780 . 2 to determine which descriptors are valid for processing by the VIOC. If an invalid descriptor is detected, then an error condition is present, and processing is complete (“No” 780 . 2 N proceeds to “End” 780 . 14 ). If the descriptor is valid, then flow continues to “Program DMATX Transmit Q Fetch” 780 . 4 via “Yes” 780 . 2 Y.
Process DMATX Transmit Q Fetch780 . 4 analyzes the fetched descriptor information to determine the buffer address and length, and configures DMATX 716 to fetch the packet data located in the buffer and to store the data into ESMem 736 .
the fetched packet datais in turn analyzed to determine the destination, and a lookup is performed according to the MAC destination address (MACDA) and the VLAN of the associated with the descriptor at “Lookup” 780 . 5 .
the lookup result, including a destination fabric port addressis used in part by “Form Packet Header” 780 . 6 to formulate a working image of a packet header.
the packet headerincludes other information from the address block of the VNIC sourcing the transmission (such as Address Block 601 . 1 , of FIG. 6A ), including a MAC source address (such as MAC Address 603 . 1 , of FIG. 6A ), and a VLAN tag (such as VLAN Identifier 617 . 1 , of FIG. 6A ).
VLAN processingsuch as dropping the packet if source and destination VLANs are different.
Processingcontinues as “Scheduled?” 780 . 7 determines if a first cell of the packet is scheduled, and if not, loops back via “No” 780 . 7 N until the cell is scheduled. The scheduling is as illustrated in FIG. 6C .
flowproceeds to “Program DMATX Output Q Fetch” 780 . 8 where DMATX 716 is programmed to fetch data from ESMem 736 for insertion into the appropriate output queue.
the output queuesare implemented within ESMem 736 . It will be understood that data transfers within the same memory structure may be at least in part performed logically via pointer manipulation rather than via physical data transfers.
a cell headeris formulated in “Form Cell Header” 780 . 8 A, for encapsulating cells corresponding to the packet.
the fetched datais processed (“Compute Checksum, CRC” 780 . 8 B) to determine packet-level error detection information in CSTX 710 (of FIG. 7A ).
CRCComputer Checksum
the cell header and cell dataare then ready for transmission on the fabric by CSTX 710 (“Transmit Cells” 780 . 9 ).
Packet Complete?” 780 . 10determines if the entire packet has been transmitted. If not (“No” 780 . 10 N), then flow returns to “Scheduled?” 780 . 7 to continue sending the packet. If the entire packet has been transmitted (“Yes” 780 . 10 Y), then flow proceeds to “Modify Transmit Q Descriptor” 780 . 11 to indicate that the buffer identified by the transmit descriptor has been transmitted by setting the done indicator accordingly.
Interrupt Requested?” 780 . 12determines if an interrupt to the Local Processor is requested, based in part on an interrupt request indicator included in the transmit descriptor, in one embodiment. If an interrupt is requested (“Yes” 780 . 12 Y), then flow continues to request an interrupt (“Interrupt” 780 . 13 ) and then processing of the information related to the descriptor is complete (“End” 780 . 14 ). If an interrupt is not requested (“No” 780 . 12 N), then processing is complete (“End” 780 . 14 ).
Egress Overview 700 Bis representative of the overall flow with respect to one cell, including any special processing relating to completion of a packet. However, according to various embodiments, such processing may be wholly or partially overlapped for a plurality of cells.
Descriptor Fetch 780 . 1may provide a plurality of descriptors, each pointing to a plurality of cells, and each of the respective cells are processed according to the remainder of the flow.
a first cellmay remain in ESMem 736 indefinitely once processed by Program DMATX Transmit Q Fetch 780 . 4 , while subsequent cells are processed according to Program DMATX Transmit Q Fetch 780 . 4 .
cellsmay be removed from ESMem 736 in a different order than stored, according to Program DMATX Output Q Fetch 780 . 8 .
FIG. 7Cillustrates selected aspects of ingress operation of an embodiment of a VIOC as flow diagram Ingress Overview 700 C. For clarity, only selected details related to packet processing are shown (message processing is similar and is omitted). Processing begins when a cell is received from the fabric and enqueued (according to priorities illustrated by FIG. 6D ) by CSRX 711 , as indicated by “Cell Received” 791 . Flow continues to “Check Parities, CRC” 790 . 1 , where cell-level error check computations are performed. The error results are checked (“Data OK?” 790 . 2 ), and if the data is incorrect (“No” 790 . 2 N), then the error is recorded (“Log Error” 790 . 3 ) and processing for the cell is complete (“End” 790 . 16 ). If the data is correct (“Yes” 790 . 2 Y), then processing proceeds to “UC/MC?” 790 . 4 .
UC/MC?” 790 . 4determines if the cell is a multicast (“MC” 790 . 4 M) or a unicast (“UC” 790 . 4 U) type cell.
Unicast processingcontinues at “Enque By VNIC/Q” 790 . 5 , where the received cell is enqueued in a selected unicast receive queue according to VNIC number and receive priority (or queue).
Multicast processingcontinues at “Lookup” 790 . 17 , where a lookup is performed by LE 703 (of FIG. 7A ) based at least in part on the MGID as discussed elsewhere herein in conjunction with FIG. 8B .
the lookupprovides information describing the VNICs to receive the multicast data, and the cell is enqueued accordingly (“Enqueue Multicast” 790 . 18 ).
Some embodimentsimplement selected VLAN processing such as dropping the cell if the learning VLAN is different from the destination VLAN.
the receive queuesare implemented within ISMem 738 .
Processing in “Scheduled?” 790 . 8determines if the cell has been scheduled (according to priorities illustrated in FIG. 6E ), and if not (“No” 790 . 8 N), then processing loops back. If the cell has been scheduled (“Yes” 790 . 8 Y), then processing continues at “Program DMARX DeQueue Fetch” 790 . 9 , where DMARX 717 (of FIG. 7A ) is programmed to fetch the cell data from the shared ingress memory and to store the cell data into local host memory according to the fetched receive descriptor. Error checking information is then computed (“Compute Checksum, CRC” 790 .
processing for multicast cellsis performed wholly or partially in parallel, including embodiments where all multicast destinations for the cell are processed in parallel.
Processing in “Write Completion Q Descriptor” 790 . 12records completion of the processing of the receive descriptor, or consumption of the descriptor, by entering an entry on a designated write complete queue. The entry includes packet and error status. Then a determination is made (“Interrupt>Threshold ?” 790 . 13 ) as to whether the number of receive events exceeds a specified threshold. If so (“Yes” 790 . 13 Y), then an interrupt is signaled to the Local Processor (“Interrupt” 790 . 14 ). If the threshold has not been reached (“No” 790 . 13 N), then a further determination is made if a timer has expired (“Timer Expired?” 790 . 15 ). If so (“Yes” 790 .
Ingress Overview 700 Cis representative of the overall flow with respect to one cell, including any required multicast processing and special processing relating to an EOP cell or consumption of a receive descriptor. However, according to various embodiments, such processing may be wholly or partially overlapped for a plurality of cells. For example, once a first cell has been processed according to “Enque By VNIC/Q” 790 . 5 , processing for the first cell may be suspended indefinitely, and in the meantime a plurality of additional cells may be received and processed up to and including “Enque By VNIC/Q” 790 . 5 . In addition, cells may be processed according to “Program DMARX DeQueue Fetch” 790 .
“Scheduled?” 790 . 8is conceptually performed for many (for example all) enqueued cells on every cycle, even though only a subset of cells is scheduled according to “Yes” 790 . 8 Y (for example, only a single cell may be scheduled).
ISMem 738is used to store received cells during some portion of the processing time corresponding to “Enque By VNIC/Q” 790 . 5 through “Yes” 790 . 8 Y.
processing of a first cell according to “Lookup” 790 . 17may be wholly or partially concurrent with processing of a second cell according to “Check Parities, CRC” 790 . 1 .
packetsare aligned on cache line boundaries, and packets are segmented along cache line boundaries into cells.
local host memorymay be referenced a cache line at a time
reception of the first cell of the packetprovides a full cache line of data that is also aligned with respect to the receiving buffer, and the entire received cell may be written to the local host memory in a single transaction. Subsequently received cells may also be written one cell per cache line transaction.
packet-level error detection informationis computed, transmitted, and checked upon reception irrespective of packet size. In other embodiments, if all of the data for a packet fits in a single cell, then no packet-level error detection information is computed or transmitted, enabling the transmission of additional data bytes instead of the error detection information. For example, if a two-byte CRC is used for error detection, then two additional data bytes may be sent instead of the CRC. In these circumstances the cell error detection information (such as a CRC) provides error detection information for the packet.
CRCcell error detection information
TCAM lookupsmay be performed using a combination of “local” and “global” masks.
Each entry in the Tag arraylogically has a data field (holding a data value) and an associated equal width local mask field (holding a local mask value). Equivalently (and possibly physically), the Tag array may also be described as having a data array and a corresponding mask array.
one or more global mask registersholding a global mask value of the same width as the data and local mask values located outside the Tag array.
the data value of each entryhas applied to it the associated local mask value of the entry and a selected global mask value.
the masked data valueis then compared against the search key.
One or more entries in the Tag arraymay result in a hit (a match with the key).
a priority-encoderselects the highest priority entry (the match at the lowest address), which is used as an index to address the SRAM and retrieve the corresponding result entry.
the mask values that will be used for a searchare often known well in advance and are often stable for extended periods of operation. Accordingly, many mask values (in the local mask array and the global mask registers) may be programmed well in advance of the search. This permits the searches to proceed more quickly.
local mask valuesare required at least in part if the use of simultaneous prioritized hits to more than one entry is desired.
the use of local mask valuesis used to establish an arbitrary-MACDA low-priority broadcast entry for each VLAN.
the data valueincludes the VLANid for the VLAN and a null MACDA, and the MACDA-related bits of the local mask are cleared to prevent the null MACDA field from participating in comparison with presented keys. If there are no other entries in the TCAM to match on the VLANid, then the multicast result (and included MGID) corresponding to the broadcast entry will be returned.
the broadcast entryis used to flood the frame being forwarded to all members of the VLAN.
thisis accomplished by providing a multicast result that includes a Multicast Group ID (MGID) that has been assigned to the VLAN.
MIDMulticast Group ID
a MACDA on a VLANis learned, a higher-priority (lower addressed) MACDA-specific non-broadcast entry is created wherein the data value generally includes the VLANid for the VLAN and the learned MACDA, and the MACDA-related bits of the local mask are set to enable the MACDA field to participate in the comparison with presented keys. Subsequent searches using a key having both the VLANid and the MACDA will cause hits to both the non-broadcast and broadcast entries. Since the broadcast entry is created at a higher address, the TCAM's priority encoder only returns the MACDA-specific non-broadcast entry.
TCAM illustrationsare intentionally general to encompass a number of implementation options and variations.
the use of both local and global masksis allowed for as are TCAM Tag-entry and SRAM result-entry organizations that support both IVL and SVL modes of operation.
the combination of local and global masksis illustrative and not limiting. Within a given implementation, either local masks or global masks could be eliminated with an associated reduction in the logic associated with the eliminated functionality. (However, elimination of local masks generally requires performing a subsequent broadcast lookup upon encountering a previously unlearned MACDA.)
the combination of IVL and SVLis merely illustrative and not limiting. One of either IVL or SVL could be eliminated with possibly associated reductions in certain fields within the TCAM Tag-entries and SRAM result-entries.
TCAM illustrationsare also described in terms of a single overall TCAM/SRAM combination. It will be understood that the overall TCAM/SRAM may be physically implemented using a plurality of smaller TCAM/SRAM primitives (i.e., smaller in entry-width, number of entries, or both) arranged in ways known to those of ordinary skill in the art to provide the desired number of entries and entry-width of the overall TCAM/SRAM.
TCAM/SRAM primitivesi.e., smaller in entry-width, number of entries, or both
“Lookup” 780 . 5includes a lookup in a TCAM/SRAM coupled to a VIOC (such as VIOC 301 . 5 coupled to TCAM/SRAM 403 . 5 ), as performed by LE 703 .
“Lookup” 790 . 17includes a lookup in the TCAM/SRAM. The lookup operations are performed by formulating a lookup key, optionally selecting a global mask register, and presenting the key and optional global mask value to the TCAM portion of the TCAM/SRAM. A result is then produced by reading the first matching entry (if any) as determined by the TCAM from the SRAM portion of the TCAM/SRAM.
the TCAM/SRAMis programmed according to various provisioning, switching, and routing functions, as described elsewhere herein.
Egress TCAM/SRAM keys, masks, and resultsare formulated to provide for transparent processing of various L 2 switching related activities, and selected L 3 switching and routing related activities.
the L 2 and L 3 operationsinclude multicast and unicast, with and without Link Aggregation Group (LAG) processing, and further include VLAN processing.
LAGLink Aggregation Group
a lookup keyis formed without specific knowledge concerning the destination, other than the MAC Destination Address (MACDA). In other words, the key is formed in the same manner for multicast and unicast destinations. As described in more detail below, the lookup result provides information specifying the type of destination (multicast, unicast, or unicast LAG), along with information specific to the destination according to the destination type.
FIG. 8Aillustrates selected aspects of an embodiment of an egress lookup key and result entries as TCAM/SRAM Egress Layout 800 A.
the egress layoutis an example of a portion of a MAC Forwarding Information Base (MACFIB) implementation.
a 64-bit lookup key, Egress Key 801has several fields, including: Egress PathID 899 A (two bits), Egress TableID 898 A (two bits), VLANid 805 (12 bits), and MACDA 806 (48 bits).
the PathIDis 0x0 and the TableID is 0x0.
Embodiments implementing IVLformulate lookup keys including the VLANid associated with the source (such as a VNIC or NM port), such as VLAN Identifier 617 . 1 or a value derived from it, according to embodiment. If the VLANid and the MACDA of the key match to a TCAM entry having both the same VLANid and the same MACDA as the key, then a non-broadcast (unicast or multicast) entry has been identified and the corresponding SRAM result is retrieved.
the VLANid and the MACDA of the keymatch to a TCAM entry having both the same VLANid and the same MACDA as the key, then a non-broadcast (unicast or multicast) entry has been identified and the corresponding SRAM result is retrieved.
lookup keys and TCAM data array valuesare formed with a common predetermined value (such as all zeros) in the VLANid field of the key and in the corresponding VLANid field of the TCAM entries. Since the TCAM entries so programmed always match on the common predetermined VLANid, forwarding entries learned for one VLAN are accessible by other VLANs. If no entry matches on the common predetermined VLANid, then the MACDA has yet to be learned, and some subsequent means must be used to broadcast over the VLAN. The previously discussed catch-all broadcast entry (wherein the MACDA field is masked) could still be used, being looked-up using a key with the VLANid of the source VLAN in a follow-up TCAM search.
a common predetermined valuesuch as all zeros
a second SVL embodimentuses two global mask registers.
lookup keysare formed with the MACDA specified by the source, but with arbitrary values in the VLANid portion of the key.
mask bit locations corresponding to the VLANid bitsare cleared to remove the VLANid bits of the data array from participation in the TCAM search, while mask bit locations corresponding the MACDA bits are set to ensure their participation in the search.
TCAM entriesmatch on the MACDA of the key regardless of the VLANid of the key, and forwarding entries learned for one VLAN are accessible by other VLANs. If no entry matches, then the MACDA has yet to be learned.
a second lookupis performed, this time with a key having the VLANid of the source and arbitrary values in the MACDA portion of the key.
a second global mask registeris used, this time having the mask bit locations corresponding to the VLANid set and the mask bit locations corresponding to the MACDA bits cleared.
the VLANid fieldis not strictly required in the lookup key for SVL-only implementations. However, if the VLANid field is not present in the TCAM, then in order to implement broadcast when the MACDA is yet to be learned by the TCAM, the VLANid to broadcast address mapping must be resolved using other logic.
a third SVL embodimentuses local masks. For non-broadcast entries, the mask bit locations in the local mask array corresponding to the VLANid field are cleared to remove the VLANid bits of the data array from participation in the TCAM search (the mask bit locations in the local mask array corresponding to the MACDA field are set). As before, for previously learned MACDAs, TCAM entries match on the MACDA of the key regardless of the VLANid of the key, and forwarding entries learned for one VLAN are accessible by other VLANs. A broadcast entry as described above, which reverses the local masking between the VLANid field and the MACDA field, would within the same lookup still provide the broadcast match if the MACDA has yet to be learned.
the MACDAis a value provided in the packet (or message) header included in the buffer image formed in the Local Processor Memory, or the MACDA is derived in part from the header. Since the destination may be one of several types, the format of the 64-bit result returned varies accordingly, including: multicast, unicast, and unicast LAG formats, as illustrated by Multicast Result 802 , Unicast Result 803 , and Unicast (LAG) Result 804 , respectively.
Multicast Result 802has several fields, including: Multicast Bit (Multicast) 807 .M (one bit), VLANid (Multicast) 808 .M (12 bits), and Egress MGID 809 (16 bits).
the multicast bitis asserted if the result corresponds to a multicast destination, and deasserted otherwise. As this is a multicast result, the multicast bit is asserted.
the VLANididentifies the VLAN membership of the destination.
the MGIDidentifies the destination multicast group, and may be associated with a VLAN broadcast group or an IP broadcast group. Subsequent processing uses the MGID to replicate information to one or more destinations, as determined by the identified multicast group.
VLANid fieldis not strictly required in the multicast, unicast, or LAG results of IVL-only implementations.
TCAM matchesand subsequent result retrievals are predicated upon the destination residing within the VLAN of the source, thus checking the VLANid field of the result is superfluous.
Unicast Result 803has several fields, including: Multicast Bit (Unicast) 807 .U (one bit), LAG Bit (Unicast) 810 .U (one bit), VLANid (Unicast) 808 .U (12 bits), DstFabAddr 811 (eight bits), DstSubAddr 812 (four bits), Egress DstQ 813 (four bits), and DstLFIFID (Unicast) 814 .U (12 bits).
the multicast bitis deasserted to indicate the result is a unicast result.
the LAG bitis deasserted to indicate the result is not a LAG result.
VLANid (Unicast) 808 .Uis identical in format and function to VLANid (Multicast) 808 .M.
DstFabAddr 811identifies the destination fabric port address (typically associated with a slot having an inserted pluggable module).
DstSubAddr 812identifies a sub-address distinguishing one of a plurality of sub-destinations associated with the destination fabric port address.
DstSubAddr 812typically identifies either a) one of the plurality of VNICs implemented in the VIOC at the destination, or b) one of the plurality of network ports of a multi-ported NM.
DstQ 813typically identifies a packet receive queue associated with the identified VNIC.
DstLIFIDsare typically global, and may be used by software or hardware components (such as NMs), according to embodiment.
DstLIFID (Unicast) 814 .Uis a DstLIFID associated with the destination identified by MACDA 806 .
LAG Result 804has several fields, including: Multicast Bit (LAG) 807 .UL (one bit), LAG Bit (LAG) 810 .UL (one bit), VLANid (LAG) 808 .UL (12 bits), LagID 815 (eight bits), and DstLIFID (LAG) 814 .UL (16 bits).
the multicast bitis deasserted to indicate the result is a unicast result.
the LAG bitis asserted to indicate the result is a LAG result.
VLANid (LAG) 808 .ULis identical in format and function to VLANid (Multicast) 808 .M.
LagID 815identifies the LAG the destination is associated with to enable load balancing, failover, and other related operations with respect to the identified LAG.
DstLIFID (LAG) 814 .ULis identical in format and function to DstLIFID (Unicast) 814 .U.
Subsequent processing of Unicast Result 803 and Unicast (LAG) Result 804provides portions of the lookup result, or information derived in part from the lookup result, for use in forming selected egress cell headers.
various combinations of DstFabAddr 811 , DstSubAddr 812 , Egress DstQ 813 , and DstLIFID (Unicast) 814 .Uare included in selected cell headers formulated during data transmission, according to embodiment.
various combinations of LagID 815 and DstLIFID (LAG) 814 .ULare included in selected cell headers during data transmission, according to embodiment.
Providing destination information such as the destination fabric port address, sub-address (or VNIC identifier), destination queue, and destination logical interface in the lookup resultenables transparent L 2 and selected L 3 operations with respect to processes sending data.
the sending processesaddress data by MACDA, and are not aware of multicast, unicast, and LAG properties associated with the destination.
the lookup destination informationfurther enables transparent management of bandwidth and other related resources by agents other than the Local Processor process sending the data. Such agents include management, controlplane, and load balancing processes executing elsewhere.
FIG. 8Billustrates selected aspects of an embodiment of an ingress lookup key and result entry, as TCAM/SRAM Ingress Layout 800 B.
the illustrated ingress layout embodimentis an example of an implementation of an MGID table.
a 64-bit lookup key, Ingress Key 821has several fields, including: Ingress PathID 899 B (two bits), Ingress TableID 898 B (two bits), Ingress Mask 823 (44 bits), Multicast Key Bit 897 (one bit), and Ingress MGID 824 (15 bits).
the PathIDis 0x1 and the TableID is 0x0.
the same size keyis used for both ingress and egress searches.
the TCAMoperates identically for ingress and egress searches, comparing each presented key in parallel with all of the stored data values as masked by the local mask values and the global mask value, as described previously.
the PathID and TableID bitsare commonly laid out between the ingress and egress entries. These bits participate in the TCAM comparisons, allowing if so desired the ingress and egress entries, and entries from multiple tables, to be co-mingled in the TCAM while remaining logically distinct subsets.
Ingress searchesonly require the 16 bits corresponding to the Ingress MGID 824 bits and the Multicast Key Bit 897 .
Multicast Key Bit 897is asserted to indicate the lookup is a multicast type search.
Ingress MGID 824is directly from an MGID field included in the received cell header, or is derived from the header, according to embodiment.
the remaining 44 bit positions of the common key layout, denoted by Ingress Mask 823are null place-holders, being neither required nor logically used on ingress lookups.
the mask bit locations within the local mask array corresponding to Ingress Mask 823are cleared to insure that the bit locations within the data array corresponding to Ingress Mask 823 do not participate in ingress searches.
the corresponding mask bits within a global mask registerare cleared to accomplish the same result.
Ingress Result 822is 64 bits and has several fields, including: Ingress DstQ 825 (four bits) and VNIC BitMask 826 (16 bits).
Ingress DstQ 825identifies one of a plurality of multicast packet receive queues for insertion of the received data into (see the discussion relating to FIG. 6D ).
VNIC BitMask 826identifies destination VNICs for replication of the received data. Typically there is a one-to-one correspondence between asserted bits in VNIC BitMask 826 and VNICs that are to receive the multicast data.
Egress PathID 899 A and Ingress PathID 899 Bare arranged to be in the same location in the egress and ingress lookup keys, respectively.
An egress path lookupis identified by the value 0x0 and an ingress path lookup is identified by the value 0x1, thus enabling selected embodiments to include egress and ingress lookup information in a shared TCAM/SRAM (such as TCAM/SRAM 403 . 4 ).
Other embodimentsmay provide separate TCAM/SRAM resources for egress and ingress processing.
Egress TableID 898 A and Ingress TableID 898 Bare in the same location and may be used to specify one of a plurality of tables to facilitate additional lookup information for use in other scenarios.
An ES systemprovides one or more provisioned servers in conjunction with a provisioned L 2 /L 3 switch and associated network topology.
Each of the provisioned servers and the provisioned switch and networkinclude capabilities as identified by a corresponding set of specifications and attributes, according to various embodiments.
specifications (or constraints) and attributesare specified with an SCF (see the SCF and Related Tasks section, elsewhere herein).
An ES systemmay be provisioned into a wide-range of server combinations according to needed processing and I/O capabilities.
serversmay be provisioned according to various illustrative application usage scenarios described elsewhere herein including: a Unix server, an I/O intensive server, a data-center tier-consolidation server, and an enhanced high availability server.
Each of these serversmay include distinct compute, storage, and networking performance.
Provisioned serversmay be managed similar to conventional servers, including operations such as booting and shutting down (see the server operational states section, elsewhere herein).
An ES systemmay also be provisioned to configure a wide range of networking capabilities and logical switch topologies within the ES system (i.e., internal to the chassis of the ES system).
the networkingmay be provisioned such that a logical L 2 /L 3 switch provides L 2 and L 3 forwarding for VNICs of provisioned servers within the ES system and other network interfaces external to the ES system.
Any of the logical ports of the L 2 /L 3 switchmay be configured to be part of a VLAN and multiple simultaneous VLANs are possible.
a provisioned servermay optionally be provided with a dedicated (a.k.a. “pinned”) network port for direct non-switched coupling to an external network.
Another optionis the implementation one or more LAGs, where multiple physical network ports are aggregated to form one logical port of the L 2 /L 3 switch.
FIG. 9Aillustrates a Hardware Resources View 900 A of an embodiment of an ES system.
Provisioned servers and an associated network and switch complexare formed by assigning hardware resources from a collection of available hardware resources (such as any or all of the elements of Hardware Resources View 900 A) and then programming configuration and management state associated with the assigned hardware resources.
FIG. 9Billustrates a Provisioned Servers and Switch View 900 B of an embodiment of an ES system, and conceptually represents the result of provisioning several servers and network functionality from the hardware elements of Hardware Resources View 900 A.
Hardware Resources View 900 Aillustrates one embodiment having a Primary SFM 180 A, a Primary SCM 140 A, a first NM 130 A, a second NM 130 B, a first PMM 150 A, and a second PMM 150 B.
NM 130 A and NM 130 Beach provide a plurality of network ports for interfacing with networks external to the ES system and further adapts those ports to couple with the cell-based Primary SFM 180 A.
the ports of NM 130 Aare coupled to the Internet 910 while the ports of NM 130 B are coupled to Data Center Network 920 . It will be understood that this configuration is merely illustrative and the ES system may be configured for use with a wide range of external network topologies.
Each NMhas an associated fabric address on the Primary SFM 180 A and each network port of the NMs has an associated fabric sub-address.
Each PMMhas two 2-way SMP Processing Complexes that can be optionally coupled (via configurable couplings 990 A and 990 B respectively) as a single 4-way SMP Processing Complex.
Each of couplings 990 A and 990 Brepresents coupling pair 460 . 5 X and 460 . 5 Y of FIG. 4B .
couplings 179 A and 179 Bmay be configured in conjunction with CSFI 170 to couple the two PMMs together as a single 8-way SMP Processing Complex.
2-way and 4-way configurationshave been chosen and thus couplings 179 A and 179 B are configured as inactive (as suggested by dashed lines, rather than solid lines).
PMM 150 Ais configured such that coupling 990 A is not active (as suggested by dashed lines), facilitating the formation of 2-way SMP Processing Complex 901 . 2 and 2-way SMP Processing Complex 901 . 1 .
PMM 150 Bis configured such that coupling 990 B is active (as suggested by solid lines), facilitating the formation of 4-way SMP Processing Complex 901 . 0 . It will be understood that since the two PMMs are physically identical, the illustrated scenario is an arbitrary choice. Both PMMs can just as easily be configured in any combination of 2-way and 4-way SMP Processing Complexes (up to four 2-ways, one 4-way and up to two 2-ways, up to two 4-ways, or one 8-way). In this way, the ES system provides for an easily scalable number of SMP processor-ways from which to configure physical partitions, which are the basis for provisioned servers.
Each 2-way SMP Processing Complexis associated with a VIOC that adapts the Complex to the cell-based Primary SFM 180 A. While in FIG. 9A each VIOC is represented by only one VNIC, in one illustrative embodiment each VIOC includes 16 VNICs. Each VIOC has an associated fabric address on the Primary SFM 180 A and each VNIC has an associated fabric sub-address. In conjunction with appropriate device drivers, each VNIC appears to the operating software on the SMP Complex as a high-performance Ethernet compatible NIC. Each VNIC can be selectively enabled, thus any subset of the 16 VNICs may be provisioned for use by the associated 2-way SMP Processing Complex. In this way, the ES system provides easily scalable virtualized I/O services to the provisioned servers.
2-way, 4-way, and 8-way physical partitionsare envisioned that make use of the 2-way SMP Processing Complex and it's associated VIOC and VINCs as an underlying primitive.
each of up to four 2-way physical partitionsconsists of a 2-way SMP Processing Complex and up to 16 VNICs
each of up to two 4-way physical partitionsconsists of two coupled 2-way SMP Processing Complexes and up to 32 VINCs (up to 16 VNICS in each of two groups)
an 8-way physical partitionconsists of four coupled 2-way SMP Processing Complexes and up to 64 VNICs (up to 16 VNICs in each of four groups).
physical partition P 1 201consists of 2-way SMP Processing Complex 901 .
physical partition P 2 202consists of 2-way SMP Processing Complex 901 . 1 and VNIC 154 A′. 1
physical partition P 3 203consists of 4-way SMP Processing Complex 901 . 0 and VNICs 154 B. 1 and 154 B′. 1 .
the fabric infrastructure of FIG. 9Ais not seen by the software operating on the SMP Complexes or by external sub-systems. These entities need not be concerned with (and in fact are isolated from) knowledge about the fabric addresses and fabric sub-addresses.
Software and external sub-systemsoperate conventionally, communicating through packets and messages over Ethernet frames using either Ethernet-style MACDAs or IP addressing.
the SCM 140 Amaintains separate master L 2 and L 3 FIBs (an L 2 FIB and a separate L 3 FIB).
the L 2 FIBmaps Ethernet-style MACDAs to fabric addresses and sub-addresses and the L 3 FIB maps destination IP addresses to fabric addresses and sub-addresses.
the L 2 and L 3 FIBsare initialized and updated with mappings for the provisioned hardware resources internal to the ES system and other mappings are learned over time based on network traffic processed.
Each NM and VIOCmakes L 2 forwarding decisions for packets they receive (via the network ports of the NM and via the SMP Complex associated with each VIOC) by referencing a local copy of the L 2 FIB. That is, each NM and VIOC does a local L 2 FIB lookup on the MACDA of each packet received (packets heading toward the SFM) and determines the fabric address and sub-address within the ES system where the packet should be delivered.
the NM or VIOCthen provides a fabric-addressed cellified version of the packet to the Primary SFM 180 A, which delivers the cellified packet to the module specified by the fabric address (PMM 150 A, PMM 150 B, NM 130 A, or NM 130 B). In turn, each module delivers the reconstructed packet to the network port or VNIC specified by the fabric sub-address.
the packetsare replicated as required both prior to fabric insertion (for each fabric address in the multicast group) and after fabric egress (for each fabric sub-address in the multicast group).
the multicast to fabric sub-address mappingis determined via multicast group lookups in the L 2 FIBs.
Provisioned Servers and Switch View 900 Billustrates three example resultant provisioned servers as Provisioned Server PS 2 902 . 2 , Provisioned Server PS 1 902 . 1 , and Provisioned Server PSO 902 . 0 . These correspond respectively to physical partition P 1 201 , physical partition P 2 202 , and physical partition P 3 203 , of the Hardware Resources View 900 A of FIG. 9A .
Each provisioned serverincludes compute and I/O resources.
Provisioned Server PS, 902 . 1includes 2-way SMP Processing Complex 901 . 1 and network connectivity via NIC Function 903 . 1 .
Provisioned Server PS 0 902 . 0includes 4-way SMP Processing Complex 901 . 0 and network connectivity via NIC Function 903 . 0 A and NIC Function 903 . 0 B.
the network ports and VNICsreadily lend themselves to the logical view (of FIG. 9B ) that they are coupled to the ports of a L 2 switch (exemplified by L 2 /L 3 Switch 930 ).
L 2 switchillustrated by L 2 /L 3 Switch 930 .
Switch 930provides selective isolation between the logical network segments coupled to each switch port. Switch 930 forwards packets between the switch ports (network segments) only when warranted to reach a known MACDA on an otherwise isolated segment (or when the MACDA is unknown and thus can only be reached via a broadcast on the otherwise isolated segments).
Switch 930provides its selective isolation/forwarding functionality to resources both within and external to the ES system. For example, packets originating in Data Center Network 920 are selectively isolated-from/forwarded-to destinations associated with Internet 910 , Provisioned Server PS 1 902 . 1 , and Provisioned Server PS 0 902 . 0 , as warranted by the specified MACDA. Unicast packets exchanged solely between two end-points nodes are not observable by other nodes either inside or outside the ES system.
VLANsare a common networking administration tool to selectively isolate or couple multiple network segments for collaboration, security, and performance motives.
a common VLAN implementationis the so-called port-based VLAN, wherein each logical switch-port is defined to be associated with a particular VLAN.
Switch ports (and associated network segments) belonging to the same VLANare logically coupled for forwarding and broadcast events (they are part of the same broadcast domain), while switch ports (and associated network segments) belonging to different VLANs are L 2 isolated for all events (they are in different broadcast domains).
the L 2 FIB lookup architecture of the ES systemhas integral support for port-based VLANs.
port-based VLANsare defined by associating each of the switch-ports of Switch 930 with a particular VLAN.
SCM 140 Ahas a VLAN Manager that associates each NM port and each VNIC with a particular VLAN.
each NM port and VNIC in the systemis by default a member of the VLAN known as VLAN 1 .
VLAN 1a subset of the switch-ports of Switch 930 are expressly associated with VLAN 905 (a different VLAN than VLAN 1 ).
those switch-ports associated with interconnect 937 , 938 , and 940are members of VLAN 905 .
Switch ports and associated network segments within VLAN 1are selectively coupled for forwarding and broadcast events as warranted.
Network segments (switch ports) within VLAN 905are similarly selectively coupled for forwarding and broadcast events as warranted. From a strict L 2 view, network segments within VLAN 1 are never coupled to network segments within VLAN 905 . In effect, the two VLANs function as though each were topographically segregated including having two separate switches (one for each VLAN).
Provisioned Server PS 2 902 . 2includes 2-way SMP Processing Complex 901 . 2 and network connectivity as illustrated by (virtualized) NIC Function 903 . 2 .
VNIC 154 A. 1is “pinned” to the port of NM 130 A associated with interconnect 131 . That is, the port of NM 130 A associated with interconnect 131 has been dedicated to data traffic to and from VNIC 154 A. 1 . More specifically, data traffic coming into the dedicated network port goes only to VNIC 154 A. 1 and data traffic coming from VNIC 154 A. 1 goes only to the dedicated network port.
FIG. 9Bwhere NIC 903 . 2 appears to be coupled directly to Internet 910 via interconnect 931 , without use of L 2 /L 3 Switch 930 .
provisioned serversmay include OLB and FCM resources (not shown explicitly in FIGS. 9A and 9B ).
OLB and FCM allocation granularityis by number, bandwidth capacity, and relative queuing priorities of VNICs implemented in VIOCs included on OLBs and FCMs.
a first level of resource partitioning between provisioned serversmay be accomplished by allocating appropriate numbers of VNICs to each of the provisioned servers (a greater number of VNICs generally corresponding to larger allocations).
a first provisioned servermay be allocated a single first VNIC of an OLB, while a second provisioned server may be allocated second, third, and fourth VNICs of the OLB.
VNICsimplement various combinations of minimum and maximum bandwidth, providing a second level of resource control, such that the first VNIC may be allocated a first maximum bandwidth and the second, third, and fourth VNICs may be allocated respective second, third, and fourth maximum bandwidths.
a third level of resource sharing controlmay be implemented by proper variation of VNIC queuing priorities, as described elsewhere herein.
FCM resource partitioningmay be accomplished in the same manner, by assigning proper number, bandwidth capacity, and relative queuing priorities of VNICs implemented by VIOCs on FCMs.
Provisioned serversare logically isolated from each other.
each provisioned servermay be characterized by a group of VNICs corresponding to resources allocated to the server.
each group of VNICsBy assigning each group of VNICs to distinct VLANs, at least one VLAN per server, then each of the provisioned servers remains completely isolated form the others, even though multiple servers may be using resources from the same module (such as an OLB or FCM).
first and second provisioned serversmay be using a shared OLB via respective first and second VNICs. If the first and second VNICs are assigned respectively to distinct first and second VLANs, then the provisioned servers are isolated, even though both are using the shared OLB. Similar isolation may be provided when provisioned servers share an FCM, by associating the groups of VNICs to distinct VLANs.
Link aggregationprovides a way to linearly scale connectivity bandwidth and also offers reliability and availability benefits.
the L 2 FIB lookup architecture of the ES systemsupports link aggregation.
the NM 130 A ports associated with interconnect 133 and 134 in FIG. 9Aare aggregated to form one logical port (with double the bandwidth) illustrated in FIG. 9B as LAG 950 . While not expressly illustrated, link aggregation across multiple NMs is also possible.
Load balancing processingis performed in conjunction with the L 2 FIB lookup architecture to distribute the traffic quasi-evenly over the ports that comprise the aggregate.
NM ports that are aggregatedmust be of the same media type.
FIG. 9Cillustrates an operational view of selected aspects of various system control and system management SW and processes (including provisioning) in an embodiment of an ES system.
Many provisioning functionsstore, access, and communicate data using abstractions and representations derived from the Common Information Model (CIM), as described by CIM Schema v2.8.2 from http://www.dmtf.org, herein incorporated by reference for all purposes, for example.
CIMCommon Information Model
Selected command and result communicationsuse abstractions and representations derived from the Common Object Request Broker Architecture (CORBA), as described by CORBA: Core Specification Version 3.0.3, March 2004, from http://www.omg.org, herein incorporated by reference for all purposes, for example.
CORBACommon Object Request Broker Architecture
SW layers(Platform Manager 531 , and Enterprise Manager 530 , of FIG. 5B ) conceptually surround HW core ES 1 110 A ( FIG. 1A ).
Platform Manager 531is representative of not only Platform Manager 531 but also the SW modules operating in close cooperation with the Platform Manager.
Such modulesmay include CLI 532 , CM 533 , Query Engine 534 , Repository Manager 535 , VIOC Manager 536 , Interface Manager 537 , L 2 FDB Manager 538 , VLAN Manager 539 , and Other Management Code 540 (see FIG. 5B ), according to embodiment.
the SW layersexecute using resources included in the HW core (such as Primary SCM 140 A), and provide system-level control and management of the HW core.
Persistent state relating to these operationsis maintained in CIM-compatible formats in two repositories, Primary CIM Instance Repository 917 .P and Secondary CIM Instance Repository 917 .S.
Non-volatile storage for these repositoriesmay be included in various elements of Flash memory, Magnetic Disk memory, and Optical Disk memory, according to embodiment, and interfaced to various elements of ES 1 110 A (such as SCMs, PMMs, or OLBs), or directly included in such elements, also according to embodiment.
the secondary repositoryis not specifically stored in persistent storage, but is stored as data structures in portions of processor main memory (such as DRAM and cache memory), although OS related paging operations may swap memory pages including portions of the repository to disk.
the secondary repositoryis not present, and accesses to the secondary repository are replaced with accesses to the primary repository.
Enterprise Manager 530provides an interface for client GUIs (such as GUI 914 . 1 and GUI 914 . 2 ) and maintains Secondary CIM Instance Repository 917 .S.
Platform Manager 531provides an interface for remote shell window CLIs (such as CLI 911 . 1 and CLI 911 . 2 ), and maintains Primary CIM Instance Repository 917 .P.
the Enterprise and Platform Managerscooperate to keep the Primary and Secondary Repositories synchronized. Communication between Client GUIs and remote shell CLIs is generally via CIM and CORBA standard representations, (or similar techniques), as shown by CIM Communication 1 916 . 1 , CIM Communication 2 916 . 2 , CORBA Communication 1 916 . 3 , and CORBA Communication 2 916 . 4 .
GUIsenable performance of various system management and configuration control operations by system management personnel, including various aspects of HW and SW operation.
GUIsare provided via network-based Java clients (such as Client 1 913 . 1 and Client 2 913 . 2 ) executing on PCs, Workstations, or other similar computing elements.
the clientsinclude interface functions (such as Interface 915 . 1 and Interface 915 . 2 ) to facilitate processing of GUI commands and display of GUI data, as well as communication of commands and data.
GUIsgenerally operate on managed objects (such as provisioned servers), and typically independently maintain state information about the managed objects (i.e. the GUIs are “stateful”).
GUIsupdate in real time as status of managed objects changes in real time.
Communications between the GUIs and the Enterprise Managermay be transported via any combination of WAN (including the Internet), MAN, LAN, or a direct connection to any compatible communication interface provided by ES 1 110 A, according to various embodiments.
WANincluding the Internet
MANincluding the Internet
LANlocal area network
ES 1 110 Aany compatible communication interface provided by ES 1 110 A
communications between GUIs and the Enterprise Managermay be coupled via an Ethernet port provided by one of NMs 130 , or by Management I/O 412 ( FIG. 4A ), according to embodiment.
CLIsenable all or any subset of system management and configuration control operations available via GUIs, according to embodiment. In some embodiments, CLIs enable somewhat limited functionality with respect to the GUIs.
CLIsare typically provided via network-based text-oriented command shells (such as Shell 1 910 . 1 and Shell 2 910 . 2 ) executing on PCs, Workstations, or other similar computing elements.
the shells and related SWinclude interface functions (such as Interface 912 . 1 and Interface 912 . 2 ) similar in operation to the client interface functions.
CLIsare typically stateless, relying on the Platform Manager 531 to manage objects on their behalves.
the CLIssend commands to the Platform Manager for translation into operations on managed objects (such as provisioned servers).
Communications between the CLIs and the Platform Managermay be transported via any of the mechanisms provided for the communications between the clients and the Enterprise Manager.
CLIs, low-level services supporting CLIs, or bothare provided by SW executing on resources of ES 1 110 A, such as CLI 532 ( FIG. 5B ) executing on Primary SCM 140 A.
GUI sessionsmay be simultaneously active and in communication with the Enterprise Manager, receiving data and providing commands in real time.
the Enterprise Managerupdates Secondary CIM Instance Repository 917 .S according to the commands received from all of the GUI sessions.
CLI sessionsmay be simultaneously active and in communication with the Platform Manager, receiving data and providing commands in real time.
the Platform Managerupdates Primary CIM Instance Repository 917 .P according to the commands received from all of the CLI sessions.
Any number of GUI sessions and any number CLI sessionsmay also be active concurrently, and the Enterprise and Platform Managers receive and process the respective commands, synchronizing the two repositories as necessary.
GUIs and CLIsprovide an integrated view of processing and networking functions available in an ES system.
GUIs and CLIsprovide a “legacy” view of logically separate elements including switches, routers, accelerators for Layer- 4 to Layer- 7 processing (such as SSL accelerators), management servers, and enterprise servers.
GUIs and CLIs providing integrated and legacy viewsmay be operated simultaneously.
Platform Manager 531(and closely cooperating SW modules) functions performed include HW and SW inventory discovery and maintenance, SW services relating to internal subsystem management and RAS, networking services, low-level user interface services, and component level services.
a single Application Programming Interface (API)is provided by Platform Manager 531 to enable accessing the aforementioned functions by other SW modules (such as Enterprise Manager 530 , and CLI 532 , for example).
HW and SW inventory discovery functionsinclude any combination of several elements, according to various embodiments.
a chassis manager(such as CM 533 of FIG. 5B ) discovers and manages chassis resources, a query engine (such as Query Engine 534 ) processes queries, and a persistent state manager (such as Repository Manager 535 ) records and provides system configuration, status, and query information.
a VIOC manager(such as VIOC Manager 536 ) provides control information directly to VIOCs via the SFM, and indirectly interfaces to VIOC Drivers via the scratchpad registers included in the VIOC. A write to any of the scratchpad registers typically results in an interrupt being delivered to a VIOC Driver.
An interface manager(such as Interface Manager 537 ) discovers interfaces on NMs and detects changes in state on NMs.
a VLAN Manager(such as VLAN Manager 539 ) provides services and operations relating to VLANs, such as some aspects of provisioning VLANs in relation to VIOCs.
An event serviceincludes a general publish and subscribe message layer and an alarm service enables management processes to set and clear alarms.
a software versioning and upgrade serviceenables management of binary software releases.
Internal subsystem management SW servicesinclude, according to various embodiments, an Interface Definition Language (IDL) based communication infrastructure for use between various system components.
IDL-based infrastructureis also used for process management and monitoring (via SNMP, for example) of system components, services, and applications, and also for use for information logging from one or more processors.
Security services and virtualization servicesrelating to modularity and ownership records of system components and resources, for example
Maintenance servicesrelating to a persistent database to store configuration and other related information across system restarts and failures, for example
a naming serviceserving name and location information to processes executing on resources inside a chassis, and to executing agents external to the chassis, for example
An IPC communication framework and associated servicesfor communication between management and controlplane processes
a low-level CLIfor accessing various platform manager functions
RAS SW servicesinclude, according to various embodiments, state replication, quorum protocols, fast restart mechanisms, product validation techniques, support for in-service upgrades, and statistics and tracing collection and storage.
Platform manager functionsfurther include, according to embodiment, an interface for Enterprise Manager 530 , a debugging infrastructure, a highly available process infrastructure (with disaster recovery), and various support functions relating to security, logging in, filtering and secure communication.
Networking servicesinclude, according to various embodiments, L 2 and L 3 protocols and functions, such as those associated with management of FIB data, and Routing Information Base (RIB) data, respectively.
networking servicesfurther include selected Layer 4 and above services, and protocols and services relating to SNMP Management Information Base (MIB) data and SNMP agent support for external management systems.
MIBManagement Information Base
Component level servicesinclude, according to various embodiments, services and frameworks relating to management of processing resources included on NMs, FCMs, OLBs (including system and user code), and PMMs.
An example of such a frameworkis a configuration framework to be used by CLIs, SNMP agents, the Enterprise Manager (via a GUI), and other similar mechanisms to deliver configuration information to components.
An example of such a serviceis a boot management service to provide support and boot image management for booting pluggable modules (such as PMMs, NMs, FCMs, and OLBs) in a system.
Enterprise Manager 530 functions performedinclude multi-chassis functions analogous to functions performed by the Platform Manager, selected complex provisioning operations, and interfacing to GUIs (described elsewhere herein).
the Enterprise Managerincludes functions to integrate one or more ES systems into surrounding management infrastructure.
FIG. 10illustrates a conceptual view of an embodiment of a Server Configuration File (SCF) and related SCF tasks.
SCF Tasks 1010operate on SCF 1020 as illustrated by Interaction 1030 between SCF Tasks 1010 and SCF 1020 .
SCF 1020serves to describe a desired server by specifying a list of required (or desired) resources, typically in the form of constraints.
a server specificationas illustrated by SCF 1020 , may include HardWare (HW) details such as HW Specifications 1021 , and SW details such as OS Specifications 1022 . Additionally, various HW and SW attributes and organization and operational details may also be included in SCF 1020 , as illustrated by Server Attributes 1023 .
HWHardWare
SCF 1020may be implemented as a human-readable text file or as a machine-readable binary file.
text file implementationsenable editing and viewing operations via standard text editors.
binary file implementationsenable editing and viewing operations via a specialized Command Line Interface (CLI) or a specialized GUI.
text file SCF implementationsmay provide CLI and GUI driven editing and viewing operations.
Tasks that may be performed on an SCFinclude creating an SCF, as illustrated by Create 1011 , and modifying an SCF (including editing), as illustrated by Modify 1012 .
An SCFmay be explicitly locked to prevent inadvertent or unauthorized modifications, and explicitly unlocked to allow modifications, as illustrated by Lock/Unlock 1013 .
Viewing an SCFas illustrated by View 1014 , enables examining the SCF to inspect various details (such as parameters in HW Specifications 1021 , OS Specifications 1022 , and Server Attributes 1023 ). In some embodiments, View 1014 may provide error checking operations to determine if the SCF is legal and is syntactically correct.
An SCFmay be copied, as illustrated by Copy 1015 , moved (or renamed) as illustrated by Move 1016 , and removed, as illustrated by Remove 1017 .
HW Specifications 1021may describe constraints in a logical manner (for example ‘CreateServer 2way SMP’) or in a physical manner (for example ‘CreateServer 2way SMP-slot 2’, referring specifically to HW inserted into slot 2 ).
the HW specificationsmay include a combination of logical and physical specifications.
the constraintsare interpreted by default as minimum (i.e. ‘CreateServer 2way SMP’ may be satisfied by one or more 2-way SMPs), and may be specified explicitly to match exactly (for example ‘CreateServer 2way SMP-exact’ only allows for a single 2-way SMP).
Constraints for some types of resourcesmay be described by a full or a partial specification.
the full specificationis expressed by itemizing and fully specifying each individual resource (enumerating bandwidth, priority scheme selection and associated weights, and other similar parameters, for example).
the partial specificationis expressed by listing a number of a particular type of resource required (each of the resources is assumed to require identical parameters, for example).
FIG. 11illustrates selected aspects of an embodiment of server operational states and associated transitions.
Flowtypically begins at “Unused” 1101 with a defined SCF (such as SCF 1020 of FIG. 10 , for example) and an available resource collection (such as Hardware Resources View 900 A of FIG. 9A ).
SCFSystem Call Function
available resource collectionsuch as Hardware Resources View 900 A of FIG. 9A .
PinServer1111 to “Pinned” 1102 , where resources are assigned (or allocated) from the available resource collection according to constraints in the SCF.
resource allocationis managed by a platform manager (such as Platform Manager 531 of FIG. 5B ).
the result of the pinningis a provisioned server (such as any of Provisioned Server PS 2 902 . 2 , Provisioned Server PS 1 902 . 1 , and Provisioned Server PS 0 902 . 0 illustrated in FIG. 9B ).
the SCFmay specify a two-processor constraint, and during processing relating to “Pinned” 1102 an assignment of SMP Portion P A 152 A is made from an available resource pool originally including SMP Portion P A 152 A and SMP Portion P A′ 152 A′ (see FIGS. 9A and 9B , for example).
SMP 151 A(see FIG. 1B , for example) may then be configured as the pair of two-processor physical partitions 152 A and 152 A′ (if this has not already been accomplished).
VNIC 154 A. 1may also be configured with an IP address, a MAC address, a VLANid, and so forth, according to information in the SCF or according to other management related criteria.
Selected lookup state information(such as keys and results illustrated in FIG. 8A and FIG. 8B ) may be programmed, including a VNIC MAC destination address (MACDA) to corresponding fabric port mapping, for subsequent reference by VIOCs when processing packets (and messages).
MACDAVNIC MAC destination address
Other programmingmay be performed relating to VLAN membership, bandwidth, queuing behaviors, or other related properties (such as programming any combination of VNIC registers including those of Address Block 601 . 1 and VNIC Configuration Block 618 . 1 of FIG. 6A ).
the initial operating softwareincludes any combination of a boot image, an OS boot loader, a root file system image, portions of an OS image, and an entire OS image.
the softwareis customized as required according to attributes included in the SCF, and stored so that it is accessible by hardware resources assigned during processing relating to “Pinned” 1102 . If installation fails, then flow proceeds along “Failure” 1113 .F to “Failed” 1109 , where corrective action may be taken. If installation is successful, then flow proceeds along “Success” 1113 to “Installed” 1104 .
“Running” 1106is exited when any of several server operational commands, including shutdown, reboot, and reclaim are received, as illustrated by “ShutdownServer, RebootServer, ReclaimServer” 1116 .
a shutdown commandflow proceeds to “Shutting Down” 1107 , where any executing applications are closed and the executing OS is terminated.
StopdownServer1117 to “Pinned” 1102 , awaiting the next command.
flowproceeds to “Shutting Down” 1107 (where software execution is terminated) and then proceeds along “RebootServer” 1117 .B to “Booting” 1105 to boot the server again.
a reclaim server commandproceeds to “Shutting Down” 1107 (terminating software) and then proceeds along “ReclaimServer” 1117 .R to “Unused” 1101 , where processing frees the resources assigned when pinning the server and returns them to the available resource collection.
a reclaim server commandmay also be processed from “Installed” 1104 (via “ReclaimServer” 1114 .R) and from “Pinned” 1102 (via “ReclaimServer” 1112 .R).
the unused SCFmay be deleted, as indicated by flow “DeleteServer” 1111 .D proceeding to “Deleted” 1108 .
the initial operating softwaremay correspond to a generic operating system environment (such as Linux, Windows, or any similar commercially available OS). In other embodiments, the initial operating software may correspond to a wholly or partially customized operating system environment, according to user requirements. Such customization may be based in part on one of the commercially available generic operating system environments. As illustrated in FIG. 5A , in some embodiments the initial operating software may include a Hypervisor (such as Hypervisor 1 510 or Hypervisor 2 520 ).
a Hypervisorsuch as Hypervisor 1 510 or Hypervisor 2 520 .
operations such as those illustrated in SCF Tasks 1010may be performed on an SCF only while there is no corresponding provisioned server, as designated by “Unused” 1101 .
the previous discussion of SCF Tasksassumes this scenario, and is roughly analogous to operating on the SCF when a server is “off-line”.
other embodimentsenable some or all SCF Tasks to be performed while a server is “on-line” (i.e. in any combination of operational states “Pinned” 1102 , “Installing” 1103 , “Installed” 1104 , “Booting” 1105 , “Running” 1106 , and “Failed” 1109 , in addition to “Unused” 1101 ).
Modify 1012may be performed while a server is running, in order to add resources associated with the server, and has the effect of changing the SCF as well as pinning additional resources that are then made available to the running server. Similarly, resources may be subtracted from the running server via Modify 1012 .
Parameters that may be modified while a server is runningare considered dynamic parameters, and parameters that may not be modified are considered static parameters.
Various parametersmay be dynamic or static, according to implementation.
SCF Taskswith respect to server operational state (such as Running and Unused) may also vary by embodiment.
information specific to the operational state of the server corresponding to the selected SCFmay be provided. This information may include current operating state, any relevant error information, data concerning processing load, and other similar information relating to status of an on-line server.
variants of the View operationmay provide a system-wide listing of defined SCFs and any corresponding server operational state.
a server as defined by an SCFmay be physically relocated from one set of resources to another, including resources in distinct geographical locations, according to various embodiments.
an SCFmay be developed on a first EF system and then the SCF may be used on a second EF system. This effectively deassigns resources for the server from the first EF system and then assigns resources for the server from the second EF system.
the two systemsmay be co-located in a single server facility, or located in geographically distinct server facilities.
a server provisioned according to an SCFmay be unpinned with respect to a first assignment of resources, and then pinned anew to a second assignment of resources.
Real time server provisioning and managementincludes recognizing system status changes and responding to the system status changes at approximately the same rates, and performing these operations with low latency.
availability of new HW resources included in the pluggable moduleis reflected immediately to an operator of the ES, according to a previously recorded query. The operator may then request booting a new provisioned server, and this request is performed immediately (provided sufficient resources are available to provision the server).
CM 533executing on an SCM (such as Primary SCM 140 A), manages many aspects of real time server provisioning and management, including managing modules in the chassis and maintaining an inventory of modules in the chassis.
the CMalso monitors operational status of the modules and provides user interfaces for configuration of modules and provisioning of servers (via CLIs or GUIs, as provided directly by the CM or in conjunction with other SW, according to various embodiments). Additional information concerning the CM is included elsewhere in this section (see also the Chassis Manager Operation section, elsewhere herein).
instances of Module BMC SW 550executing on BMCs in the system (such as BMCs 402 . 4 , 402 . 5 , and so forth), provide events to and receive commands from the CM.
the eventsconvey information regarding module status changes and command execution results, providing information necessary for module management and inventory maintenance.
FIG. 12Ais a flow diagram illustrating selected operational aspects of real time server provisioning and management in an ES system embodiment, including “CM SW Flow” 1201 . 20 and “Module BMC SW Flow” 1201 . 21 .
the CM flowillustrates portions of processing performed directly by or under the control of CM 533 ( FIG. 5B ).
the BMC flowillustrates portions of processing performed directly or under the control of the Module BMC SW 550 ( FIG. 5C ) executing on the BMCs.
Processingbegins when a module (a PMM, such as PMM 150 A, for example) is introduced into an ES chassis backplane (“Module Insertion” 1201 . 1 ), and continues as a presence interrupt is generated and delivered to CM 533 , indicating insertion of the pluggable module (“Detect Module Insertion and Generate Presence Interrupt” 1201 . 2 ). Processing then continues under the control of the CM, as illustrated in “CM SW Flow” 1201 . 20 .
CM SW Flow1201 . 20 .
CMWhen the CM receives the presence interrupt, a request is made to establish communication between the CM and a BMC included on the inserted module, such as BMC 402 . 5 of PMM 150 A (“Establish TCP IP with Module BMC” 1201 . 3 ).
the module BMChas been powered (due to insertion of the module) and has begun booting. Depending on various implementation dependent timing behaviors, the module BMC may have completed booting.
the BMCEventually the BMC completes booting, responds to the TCP/IP communication channel, and listens for commands from the CM (by executing various portions of Command Agent 553 of FIG. 5B , for example).
CM 533is aware only that a module has been inserted, but is not aware of any particular details of the module (such as if the module is a PMM, NM, FCM, or OLB).
the CMthen interrogates the module for Vital Product Data (VPD) to determine the particular details of the module (“Request VPD” 1201 . 4 ) by issuing a Module Get VPD, command to the module BMC.
VPDVital Product Data
the CMthen awaits a BMC event in response to the command (“Event Available?” 1201 . 5 ), looping back (“No” 1201 . 5 N) until a response is received (“Yes” 1201 . 5 Y).
the BMC SWreceives the command (as illustrated conceptually by dashed-arrow 1201 . 4 V) and begins to gather the VPD for the module.
optional power-up processingmay occur (“Optional Power-Up” 1201 . 10 via dashed-arrow 1201 . 4 P) to enable various components on the module to respond to BMC interrogatories concerning various capacities and capabilities.
Optional Power-Up1201 . 10 via dashed-arrow 1201 . 4 P
the various elements of the VPDare eventually gathered from components of the module (“Gather VPD” 1201 . 11 ).
the BMC SW flowthen proceeds to send an event (“Send VPD Event” 1201 . 12 ) to the CM in response to the command (as illustrated conceptually by dashed-arrow 1201 . 12 V).
Processing relating to sending the eventis generally performed by executing various portions of Event Agent 552 ( 5 B), for example.
the CMhas been awaiting a response from the BMC, and when an event arrives conveying the response, the VPD included in the response is parsed and corresponding entries are stored into a repository (“Post Event Data to Repository” 1201 . 6 via “Yes” 1201 . 5 Y).
the repositoryis Primary CIM Instance Repository 917 .P ( FIG. 9C ) and Repository Manager 535 accesses the repository at the request of CM 533 .
the repositoryincludes any combination of Primary CIM Instance Repository 917 .P and Secondary CIM Instance Repository 917 .S.
the CMthen processes queries that depend on at least one of the corresponding entries stored in the repository (“Pre-Select Queries and Respond to Activated Queries” 1201 . 7 ). Processing includes determining queries that are dependent on any of the newly stored entries (or “pre-selecting” such queries), evaluating the pre-selected queries (to determine which, if any, are activated or deactivated), and processing any resultant triggered queries (and ceasing processing of any queries that are no longer triggered). In some embodiments, query processing is performed via SW routines included in Query Engine 534 ( FIG. 5B ). Flow then loops back to await a subsequent event (“Event Available?” 1201 . 5 ).
the BMC SWhas entered a loop monitoring for status changes on the module (“Status Change?” 1201 . 13 ). If no change has occurred, then processing loops back (“No” 1201 . 13 N). If a change has occurred, then processing flows forward (“Yes” 1201 . 13 Y) to send a status change event indicating and describing the new status to the CM (“Send StatusChange Event” 1201 . 14 ).
the event communicationis indicated conceptually by dashed-arrow 1201 . 14 E, pointing to “Event Available?” 1201 . 5 , where the CM is looping while awaiting a newly available event.
processing of triggered queriesmay result in one or more commands being sent to the BMC to alter the status or configuration of the module (as illustrated conceptually by dashed-arrow 1201 . 7 C, for example).
a querymay be registered that is activated whenever a module is inserted, and the query may result in an action including provisioning a server. If the module is a PMM, then provisioning the server may require sending a command to the BMC on the PMM module to partition the PMM according to the requirements of the server to be provisioned. Other such scenarios are possible, such as re-provisioning a failed server when a replacement module is inserted.
the BMC SWis enabled to receive and process commands in parallel with performing other processing.
the received BMC commandsare typically generated by the CM, and in some embodiments are provided in response to server provisioning and management commands, that may be provided manually by a user, or generated automatically in response to an activated query, according to various usage scenarios. Examples include booting a server, such as processing related to “BootServer” 1114 ( FIG. 12 ), and shutting down a server, such as processing relating to “ShutdownServer” 1117 .
a commandis sent asynchronously to the BMC (“Command” 1201 . 15 ), as a result, for example, of processing related to an activated query (see dashed-arrow 1201 . 7 C originating from “Pre-Select Queries and Respond to Activated Queries” 1201 . 7 .
the commandis then received, any accompanying parameters are parsed, and the required operation is performed (“Perform Command” 1201 . 16 ).
Statusthat may change as a result of executing, the command is updated (“Update Status” 1201 . 17 ) and processing of the command is complete (“End” 1201 . 18 ). Updating the status, as shown conceptually by dashed-arrow 1201 .
Recognized status changes(“Status Change?” 1201 . 13 ) are not limited to those occurring as a result of processing a command, but may include a change in any monitored parameter, state, or other related variable associated with the module. Such status changes may include a module failing or becoming operational or powered up, a sensor crossing a threshold, or completion of a boot operation. See the Selected BMC Event Details section, included elsewhere herein, for other examples.
CMcomplementary metal-oxide-semiconductor
FIG. 12Bis a flow diagram illustrating selected operational aspects of real time server provisioning and management in an ES system embodiment, including selected details relating to provisioning VNICs and booting PMMs, as typically performed during processing related to “Booting” 1105 (see FIG. 11 ).
FIG. 12Bis representative of operations performed by various SW elements, including the CM, the BMC SW, the VIOC Manager, and the BIOS, OS, and VIOC drivers executing on a PMM.
Flow starts(“Begin” 1202 . 1 ) when a command or request to boot a server is processed.
the CMinstructs the BMC to partition the PMM according to an SCF, and the BMC configures the HT links on the PMM to form the appropriate physical partitions (“Partition PMM” 1202 . 2 ).
the BMCalso “constructs” or “routes” a VIOC implementation in all or a portion of a Field Programmable Gate Array (FPGA) device (“Instantiate VIOC from FPGA” 1202 . 3 ).
a plurality of VIOCsmay be instantiated, such as VIOC 301 . 5 and VIOC 301 . 5 ′ of PMM 150 A ( FIG. 4B ).
a further plurality of VIOCs, included on a plurality of PMMsmay be instantiated, depending on the requirements stated in the SCF.
the CMprovides VNIC provisioning information from the SCF to a controlplane process (such as VIOC Manager 536 of FIG. 5B ) responsible for configuring VNICs in the VIOC (“VNIC Provisioning Info to VIOCmgr” 1202 . 4 ).
the VNICsare then configured according to the provisioning information (“Configure VNICs” 1202 . 5 ), typically by asserting corresponding VNIC enable bits (such as VNIC Enable 618 . 1 a of FIG. 6A ) of respective VNICs.
the VNIC configurationfurther includes setting minimum and maximum bandwidth parameters (such as Bandwidth Minimum 618 . 1 d and Bandwidth Maximum 618 .
lookup informationis programmed into TCAM/SRAMs coupled to the VIOC (“Configure TCAMs” 1202 . 6 ), based in part on the SCF and also based on additional system configuration and topological information.
the PMM configuration(including partitioning and VIOC setup) is now complete and processing continues by booting the PMM (or PMMs) used in the instantiated server (“Boot PMMs” 1202 . 7 ). Processing in the PMM continues as an OS (such as OS/Drivers 1 507 of FIG. 5A ) is booted (“Start OS” 1202 . 8 ). A kernel mode VIOC Driver (such as VIOC Driver 1 511 ) is in turn initialized and spawned by the OS (“Start VIOC Driver” 1202 . 9 ). The VIOC Driver is typically responsible for communication between the OS and selected VIOC resources, including VNICs.
the VIOC Driversubsequently instantiates OS-level interfaces for each of the configured VNICs, presenting the VNICs as NIC resources to the OS (“Present NICs to OS” 1202 . 10 ).
Presenting the NICsincludes the VIOC Driver reading the VNIC enable bits implemented in the VIOC, and for each asserted bit (indicating an active VNIC), allocating and initializing driver-level SW data structures for the respective enabled VNIC to enable the OS to access the VNIC as a NIC.
the illustrated processingis then complete (“End” 1202 . 9 ).
VNICRemoval of a VNIC (as a result of operations relating to server management, for example) is typically accomplished in several stages.
the VIOC Managerdisables a corresponding VNIC enable bit in (included in a VNIC enable register of a VIOC).
the VIOCIn response to writing the VNIC enable register, the VIOC generates an interrupt that is delivered to the VIOC Driver executing on the module including the VIOC (such as a PMM).
the VIOC Driverreads the VNIC enable register, determines that a VNIC has been disables, and deletes any corresponding OS-level interfaces previously configured and presented to the OS.
Reconfiguration of a VNIC, with respect to properties visible to the OS,is accomplished in a similar manner: the VIOC Manager writes VIOC registers, an interrupt is delivered to the VIOC Driver, and the VIOC Driver modifies corresponding OS-level interfaces appropriately. Results may be returned by the VIOC Driver by writing one or more VIOC scratchpad registers with return value information. Typically the VIOC Manager polls the proper scratchpad registers awaiting results.
the VIOC Managerexecutes using resources on an SCM (such as Primary SCM 140 A), and in such embodiments the VIOC Driver typically communicates via an SFM with the Manager (as shown by PMM-SCM Data Exchange 215 , for example).
SCMPrimary SCM 140 A
the VIOC Drivertypically communicates via an SFM with the Manager (as shown by PMM-SCM Data Exchange 215 , for example).
one or more of the VIOC Driverscommunicate with the same VIOC Manager, and or more of the VIOC Drivers communicate with the same VLAN Manager.
FIG. 13Ais a state diagram illustrating processing of selected BMC related commands in an ES embodiment.
all or any portion of BMC commandsare implemented as IPMI type commands, and in some of these embodiments, an IPMI Client (such as PMI Client 551 of FIG. 5C ) may provide interface functions for communication with Command Agent 553 .
the BMC commandsare typically provided by CM 533 and processed by BMC SW, as illustrated by “Perform Command” 1201 . 16 ( FIG. 12A ).
the BMC SWimplements processing of BMC commands via a BMC Operational State Machine, and the figure is representative of states and transitions of these embodiments.
the BMC command processingis performed by any combination of SW and HW.
a plurality of BMC state machine instancesare provided on a module (such as a PMM), one for each possible partition the module may be partitioned into.
the BMC Operational State Machinereceives IPMI commands and passes them through to an IPMI implementation, returning IPMI status from the IPMI implementation if needed.
Some BMC eventscorrespond to pass-through IPMI events, such as Sensor Threshold Breached, described elsewhere herein.
the figureconforms to the convention that the state machine remains in a state unless one of the illustrated transitions is activated (“loopback” transitions are omitted for clarity).
the transitionis reported to Chassis Manager 533 as one or more events that describe the resultant state.
the transitionsare typically recognized as a status change, such as those detected by “Status Change?” 1201 . 13 ( FIG. 12A ), and event signaling is as illustrated by “Send StatusChange Event” 1201 . 14 .
the eventsinclude information concerning the transition or the cause for the transition.
BMC Operational State MachineAs starting in state P 1301 .
the module the BMC is included inis powered up only sufficiently for operation of the BMC sub-system, and other elements of the module remain powered down.
BMC sub-system elements related to detecting selected module status information, receiving CM commands, and delivering event information to the CMare operational. In some embodiments, these elements include execution of all or portions Event Agent 552 , and Command Agent 553 of FIG. 5C .
State C 1generally corresponds to a first or minimal configuration. If the command was Module Power Up, then state C 1 1302 is the end state for processing the command. If the command was Module Hold Reset, then when conditions for transition “Valid BCT AND Configuration Complete OK” 1302 .C 2 are met, the state machine transitions to state C 2 1303 , and this is the end state for processing the Module Hold Reset command. State C 2 generally corresponds to a second or customized configuration.
the state machinetransitions first to state C 2 1303 as Module Hold Reset. Then, when conditions for transition “Boot” 1303 .S 1 are met, the state machine transitions to state S I/D 1304 , followed by a transition to state S 2 1305 when conditions for transition “Heartbeat OK” 1304 .S 2 are met, and this is the end state for processing the Module Boot command.
While in state C 1 1302power is applied to all of the elements on the module, in addition to the BMC sub-system, and reset is active to any system or application processing elements. For example, CPUs included on PMMs, PCEs and TMs included on NMs, IOPs included on FCMs, and CPUs included on OLBs are continuously reset. If a Module Power Up command was being processed, then C 1 is the final state, and is exited only upon receipt of another command. If a Module Power Down command is received, then the state machine transitions to state P 1301 via transition “Power Down” 1302 .P. If a Module Force Fault command is received, then the state machine transitions to state F 1306 via transition “Force Fault” 1302 .F.
C 1is a transient state, and is exited when the BMC sub-system detects or determines that the conditions for transition “Valid BCT AND Configuration Complete OK” 1302 .C 2 are met.
the Boot Configuration Table (BCT) information received with the command being processedhas been found to be valid, and any module configuration information included in the BCT information has been successfully applied to the module.
the module configurationis complete and the machine transitions to state C 2 1303 .
While in state C 2 1303reset remains active to the system and application processing elements. If a Module Hold Reset command was being processed, then C 2 is the final state, and is exited only upon receipt of another command. If any command that provides new BCT information is received, then the state machine transitions to state C 1 1302 via transition “BCT Change” 1303 .C 1 . If a Module Power Down command is received, then the state machine transitions to state P 1301 via transition “Power Down” 1303 .P. If a Module Force Fault command is received, then the state machine transitions to state F 1306 via transition “Force Fault” 1303 .F.
C 2is a transient state, and is exited when the BMC sub-system detects or determines that the conditions for transition “Boot” 1303 .S 1 are met. Specifically, the BMC determines that an implementation dependent delay has transpired, and the state machine transitions to state S 1 /D 1304 .
state SI/D 1304While in state SI/D 1304 , reset is released, allowing the system and application processing elements to begin fetching and executing code. If a Module Hold Reset command is received, then the state machine transitions to state C 2 1303 via transition “Time Out OR Hold Reset” 1304 .C 2 . If a Module Power Down command is received, then the state machine transitions to state P 1301 via transition “Power Down” 1304 .P. If a Module Force Fault command is received, then the state machine transitions to state F 1306 via transition “Force Fault” 1304 .F.
S 1 /Dis a transient state. If the conditions for transition “Heartbeat OK” 1304 .S 2 are met, then the state machine transitions to state S 2 1305 via transition “Heartbeat OK” 1304 .S 2 .
the BMC sub-systemreceives a heartbeat indication from the system or application processor after the processor has executed sufficient start-up code (such as BIOS boot for a PMM) to communicate the heartbeat indication to the BMC.
BIOS executioncommunicates heartbeat information to the BMC via VIOC scratchpad registers.
a special sub-case of the Module Boot commandmay specify (via information in the BCT) that off-line diagnostics are to be performed instead of a full boot.
completion of the Module Boot commandoccurs when the off-line diagnostics are completed or have timed out.
the state machinetransitions to state C 1 1302 via transition “Offline Diagnostics Finished” 1304 .C 1 .
state S 2While in state S 2 (the terminus of successful processing of a Module Boot command), reset to the processing elements remains released, and the processing elements continue executing instructions and periodically generating heartbeat indications to the state machine. If a predetermined period of time elapses without a heartbeat indication, then the state machine transitions to state F 1306 via transition “Heartbeat Timeout OR Force Fault” 1305 .F.
a Module Boot or a Module Hold Reset commandis received, then the state machine transitions to state C 2 1303 via transition “Boot OR Hold Reset” 1305 .C 2 . If a Module Power Down command is received, then a transition is made to state P 1301 via transition “Power Down” 1305 .P. If a Module Force Fault command is received, then a transition is made to state F 1306 via transition “Heartbeat Timeout OR Force Fault” 1305 .F.
State Fis a transient fault recovery state where an attempt is made to recover from whatever condition led to the transition into the state. If recovery from any non-fatal faults relating to state S 2 is made, then the machine transitions to state F 1306 via transition “Recover” 1306 .S 2 . If recovery from any fatal faults relating to states C 2 , S 1 /D, or S 2 is made, then the machine transitions to state C 2 1303 via transition “Recover” 1306 .C 2 . If recovery from any fatal faults relating to state C 1 is made, then a transition is made to state C 1 1302 via transition “Recover OR Hold Reset” 1306 .C 1 .
a Module Hold Reset commandoverrides any in-progress or attempted recovery, and in response the machine transitions to state C 1 1302 via transition “Recover OR Hold Reset” 1306 .C 1 .
a Module Power Down commandis similarly overriding, and the machine moves to state P 1301 via transition “Power Down” 1306 .P.
the CMmay issue a Module Power Down command in response to event information sent from the BMC indicating that the fault is an unrecoverable HW or SW failure, according to embodiment, or represents a catastrophic fault, also according to embodiment.
some server provisioning and management operationstypically require issuing one or more BMC commands that are processed according to the illustrated state diagram.
a first exampleis booting a server, such as processing related to “BootServer” 1114 ( FIG. 12 ). If the server to be booted is configured with multiple modules (such as two PMMs), then two separate BMC command streams will be issued, one to each of the two PMMs (see FIG. 13B for an example).
a second exampleis shutting down a server, such as processing relating to “ShutdownServer” 1117 , resulting in separate Module Power Down commands to some or all of the modules the server was provisioned from.
BMC Operational State Machinehas been described with respect to selected BMC commands (such as Module Power Up, Module Power Down, Module Hold Reset, Module Boot, and Module Force Fault), this is only a representative embodiment. Any combination of BMC commands (such as those described in the Selected BMC Command Details section, elsewhere herein) may be implemented by the BMC state machine. Additionally, in some embodiments, any combination of BMC Commands illustrated with respect to the BMC state machine may be implemented by other mechanisms.
BMC Operational State Machinemay be implemented in HW, SW, or any combination of the two, according to embodiment. It is also apparent than any number of state machine states and transitions may be implemented to provide similar functionality, according to embodiment.
FIG. 13Billustrates selected operational aspects of single and dual PMM low-level hardware boot processing in an ES embodiment, as represented by “Boot PMMs” 1202 . 7 ( FIG. 12B ), for example.
the processing illustrated in FIG. 13Bconceptualizes selected paths through states as illustrated by FIG. 13A , with corresponding states and transitions named accordingly.
FIG. 13Billustrates boot processing for a single PMM configuration (such as P 3 203 of FIG. 2 ) and a dual PMM configuration (such as P 4 204 of FIG. 2 ), and as such the generic Module commands described in FIG. 13A transitions correspond to specific PMM Module commands in the context of FIG. 13B .
Operation in the single PMM configurationis as follows. “Boot Flow” 1312 begins at state P 1301 .M, when the BMC on PMM 150 B receives a PMM Boot command via CM 533 . The BMC Operational State Machine then moves to state C 1 1302 .M via transition “Boot” 1301 .C 1 .M, and asserts reset to the PMM. When the BCT has been found to be valid and the configuration included therein has been properly applied to the PMM, the state machine moves to state C 2 1303 .M via transition “Configuration Complete OK” 1302 .C 2 .M.
the state machinethen continues to state S 1 /D 1304 .M via transition “Boot” 1303 .S 1 .M, and releases reset to the PMM.
the PMMthen boots BIOS and generates a valid heartbeat.
the machinemoves to state S 2 1305 .M via transition “Heartbeat OK” 1304 .S 2 .M, and the PMM boot flow is complete.
Each of the state machine transitionsare reported to CM 533 via events describing the resultant state. For example, when the state machine has completed the transition to state C 1 1302 , an event describing the new state machine state as “C1” is generated and delivered to the CM. Events are similarly generated and delivered for all of the state machine transitions.
Operation in the dual PMM configurationis as follows, with PMM 150 B operating as the master, and PMM 150 A operating as the slave.
the master PMMis partially booted (“Hold Reset Flow” 1313 M)
the slave PMMis booted (“Hold Reset Flow” 1313 S and “Release Reset Flow” 1314 S)
the master PMMis fully booted (“Release Reset Flow” 1314 M).
the final slave PMM boot stateis different than the master PMM boot state, as the slave PMM omits booting of BIOS and hence generates no heartbeat.
Coordination of transitions between the master and slave PMMsare managed by CM 533 , via reception and processing of state transition events and issuing of appropriate commands to the master and slave BMCs on the respective PMMs.
“Hold Reset Flow” 1313 Mbegins at state P 1301 .M, when the BMC on the master PMM (PMM 150 B) receives a PMM Hold Reset command from CM 533 .
the BMC Operational State Machinethen moves to state C 1 1302 .M (asserting reset to the master PMM) and then to state C 2 1303 .M as in “Boot Flow” 1312 .
the state machineremains in state C 2 1303 .M when processing the PMM Hold Reset command (leaving reset asserted), instead of continuing as when processing a PMM Boot command.
an eventis generated upon arrival in state C 2 1303 .M and delivered to the CM.
the CMsends a PMM Hold Reset command to the BMC on the slave PMM (PMM 150 A).
the slave BMC Operational State Machinethen transitions from state P 1301 .S to state C 1 1302 .S (asserting reset to the slave PMM) and then to state C 2 1303 .S, where it remains, awaiting further CM commands.
An eventis generated and delivered to the CM indicating the slave BMC is now in the “C2” state.
the CMprovides a PMM Release Reset command to the slave BMC.
the slave BMCreleases reset to the slave PMM and transitions to state S 1 /D 1304 .S, whereupon another event is delivered to the CM indicating arrival in the “S1/D” state.
the CMIn response (indicated conceptually by dashed-arrow 1311 ) the CM sends a Release Reset command to the master BMC.
the master BMCthen transitions to state S I/D 1304 .M and releases reset to the master PMM.
BIOS bootis complete and the resultant heartbeat is detected
the master BMC Operational State Machinetransitions to state S 2 1305 .M and reports the new state to the CM. Booting of the dual PMM configuration is now complete, with both PMMs out of reset and the master PMM having booted BIOS.
CM communication with BMCsis via any combination of transports and protocols.
the transportsinclude Ethernet (coupling 452 of FIG. 4A , for example, as described elsewhere herein), an Intelligent Chassis Management Bus (ICMB), an Intelligent Platform Management Bus (IPMB), RS-485, RS-232, PCI mailboxes, in-band or out-of-band signaling over the SFM, and any other similar mechanisms.
the protocolsinclude TCP/IP and any similar protocols.
the communicationsinclude events from BMCs to the CM, and commands from the CM to the BMCs.
Some embodimentsprovide for larger than 8-way SMPs, and in a first group of implementations, BMC coordination is via explicit CM control, as illustrated in FIG. 13B .
BMC SW instancescommunicate and cooperate with each other in a peer-to-peer mode, independent of explicit CM control coordination and sequencing.
BMC eventsare generated when a change in specific characteristics of an ES system or a pluggable module included therein occurs, and are also generated in response to most BMC commands (even those effecting no change in characteristics).
the CMis the primary consumer of the generated events.
the CMestablishes a separate TCP connection for each respective BMC, for communication of the events as TCP messages.
Each of the TCP messagesmay include a returned data structure providing specific details regarding the event, such as detailed status or log information, according to embodiment.
the data structuretypically includes fields identifying the pluggable module type sourcing the event, and the event classification or number.
SIMs, PMMs, NMs, and OLBsmay be identified as pluggable module types 1 , 2 , 3 , and 4 , respectively, with unknown modules identified as module type 0 , according to embodiment.
a dedicated packet formatis used to convey event information.
BMC eventsare conveyed as SNMP traps.
Pluggable modules including VIOCscommunicate events specific to VIOC operation, such as VIOC Initialization Complete/Fail, and VIOC Reset Request.
VIOC Initialization Complete eventis sent when the BMC has successfully initialized the VIOC after module reset has been released, and the VIOC Initialization Fail event is sent if the VIOC initialization fails.
the VIOC Initialization Complete and Fail eventsare implemented as a single event with a field in the associated return data structure specifying success or failure.
the VIOC Reset Request eventis sent by the BMC in response to receipt of a corresponding VIOC reset request from a VIOC Driver executing on the module.
the CMdetermines if and when permission for the request is to be given, and if so sends a corresponding Module Reset VIOC command to the BMC, providing a mechanism for the VIOC Driver to reset an associated VIOC under control of the CM.
BMC eventsinclude Module Operational Status Up/Down, Release Reset Failure, and Sensor Threshold Breached.
the Module Operational Status Up/Down eventis sent when the BMC successfully establishes a heartbeat with SW executing on the module.
the associated return data structurespecifies that the module is operational (Up).
the BMCsends the event with the data structure indicating the module is not operational (Down).
separate eventsare used for Module Operational Status Up and Module Operational Status Down.
the SW executing on the module and providing the heartbeatmay be any combination of OS SW, Driver SW, and BIOS SW, varying according to module type and embodiment.
the Module Operational Status Up/Down eventis sent when the BMC Operational State Machine transitions to state S 2 1305 ( FIG. 13A ), with the return data structure indicating the module is operational.
a general BMC State Change eventmay be used to communicate transitions of the BMC Operational State Machine, including transitions such as the transition to state S 2 1305 ., as well as other transitions of the state machine.
the Release Reset Failure eventis sent when the BMC detects that a module fails to respond to a request to release reset, typically delivered to the module by the BMC in response to a corresponding command from the CM.
the Sensor Threshold Breached eventis sent when any sensors included in the BMC sub-system report a value that crosses any predefined thresholds (for example an over-temperature or over-voltage detection).
the event data structuremay optionally include the sensor value at the time the event is detected, according to sensor type and embodiment.
PMM specific eventsgenerally relate to a BCT, that is typically a superset of a Partition Configuration Table (PCT), used to specify the configuration of a PMM, particularly with respect to the number of CPUs in a partition (such as 2-way, 4-way, or 8-way).
PMM specific eventsinclude a BCT Valid event that is sent in response to a BMC command that communicates a BCT.
the BMCchecks the communicated BCT to determine that it is valid for the module (such as determining that a requested partitioning is possible for the module), and if so, then configures the module according to the information in the BCT. If the configuration is successful, then the BCT sends a BCT Valid event indicating that the BCT was valid for the module, and the module was successfully configured as specified by the BCT.
SIM specific eventsinclude Power Up/Down and Fan Up/Down events.
the Power Up/Down eventis sent when there is a change in the operational status of a power module in the system.
the event data structurespecifies if the module has become operational (Up) or has become non-operational (Down).
the Fan Up/Down eventis sent to notify the CM of a change in a fan module operational state.
separate eventsare used for Power Up, Power Down, Fan Up, and Fan Down events.
BMC commandsare generally sent by the CM to determine information or status regarding pluggable modules, or to effect a change in configuration or status of pluggable modules.
BMC commandsmay be directed to BMCs on any type of pluggable module (such as a SIM, PMM, NM, FCM, and OLB), via the separate TCP connections for each module established at module boot.
the TCP connectionsare also used to communicate BMC events.
commandsare specific to one module type (such as a PMM), and other commands may be applicable to more than one module type (such as any module including a VIOC, or any module having a configurable power supply).
commands directed toward a SIMare directed to a Redundant SIM by a Primary SIM, since the CM typically executes at least in part using computing resources included in a Primary SIM (such as Primary SCM 140 A of FIG. 2 ).
Each BMC commandgenerally includes a command parameter data structure defining specific details or values associated with the command.
the data structuretypically includes fields identifying the pluggable module type receiving the command, and the command identifier (or number).
SIMs, PMMs, NMs, and OLBsmay be identified as pluggable module types 1 , 2 , 3 , and 4 , respectively, with unknown modules identified as module type 0 , according to embodiment.
a dedicated packet formatis used to convey command information.
processing of a BMC commandmay include generating a response event (directed to the CM, for example) acknowledging receipt of the command and describing the outcome of the command in the form of a return code.
BMC commands specific to BMCs on PMM modulesinclude Module BCT.
the Module BCT command(identifier 0x000F) is used to instruct the BMC to configure the associated PMM (or a previously partitioned portion of it) according to a BCT (provided as a command parameter data structure), and is typically issued in the context of provisioning a server.
the BMCparses the provided BCT to determine if the configuration is valid for the PMM in which the BMC is included. If the configuration is valid, then the BMC configures components on the PMM according to the configuration.
the parametersinclude structures for general, server, VNIC, boot, partition, console, and OS information.
the general parameter structureincludes a table version number (such as 0), and an action identifier describing an action to take based on the configuration information (such as provision identified by the value 1 , and release or reclaim provision identified by the value 2).
the general parameter structurefurther includes a count of the number of BMCs involved in applying the configuration (one BMC for a 2-way or a 4-way configuration, and two BMCs for an 8-way configuration).
the general parametersfurther include an IP address identifying a master BMC associated with the configuration, and a list of IP addresses for all of the BMCs involved in the configuration.
the server structureincludes a server type identifier (having values such as 1 for 2-way, 2 for 4-way, and 3 for 8-way), and a slot number and valid bit to associate with the provisioned server (having values such as 0 and 1).
the server structurefurther includes a system number and valid bit to associate with the provisioned server (having values such as 0 and 1), and a boot method identifier (such as 1 for network booting and 2 for local fibre channel booting).
the server structurefurther includes a count of VNICs for the server (from 1 to 64, for example), a VNIC structure for each of the VNICs, and a list and count of fibre channel boot paths.
Each VNIC structureincludes a VNIC identifier that is unique throughout the server (such as a 32-bit integer), a bandwidth specification, and a MAC address for the VNIC.
Each fibre channel boot pathincludes a port identifier of an associated fibre channel port, a world wide name of a fibre channel destination, and a logical unit number for the fibre channel destination.
the partition structureincludes a boot policy identifier (such as 1 for ‘Wait-for-SIM’, 2 for ‘autoBoot’, 3 for ‘oneShot’ and 4 for ‘Debug’), and a sticky bit to indicate if the configuration remains over reboots (such as 0 for not sticky and 1 for sticky).
the console structureincludes information describing a baud rate, a number of data bits, a parity type, a number of stop bits, and a console type (such as 1 for VT-100).
the OS structureincludes an OS identifier (such as 1 for Linux and 2 for Windows).
Processing performed in response to the Module BCT commandanalyzes the BCT and configures PMM HW accordingly.
the actionis to provision a server (action identifier equal to 1)
the server type identifieris decoded to determine how to configure the HT links on the PMM. More specifically, if the server type is 2-way (type identifier equal to 1), then in the context of PMM 150 A ( FIG. 4B ), HT couplings 460 . 5 L and 460 . 5 R are configured for coherent operation by BMC 402 . 5 , and HT couplings 460 . 5 X and 460 . 5 Y are configured for isolated operation. If the server type is 4-way (type identifier equal to 2), then HT couplings 460 .
each VNIC structureis stored into corresponding configuration state in one or more VIOCs, such as setting MAC Address 603 . 1 ( FIG. 6A ), by writing corresponding configuration state implemented in VIOC Configuration block 706 ( FIG. 7A ).
the response event generated upon completion of the Module BCT commandincludes a return code selected from the set including Valid BCT Configuration Successful (encoded as 0x0000), Invalid Slot Information (0x0100), Valid BCT Configuration Failure (0x0200), and Invalid BCT (0x0300).
Valid BCT Configuration Successfulis returned when the BMC has determined that the provided BCT is valid for the associated module, and the specified configuration has been successfully applied to the module (or portion thereof).
Valid BCT Configuration Failureis returned when the BCT is valid but the configuration has not been successfully applied.
Invalid Slot Informationis returned when the slot information in the BCT is not valid for the module.
Invalid BCTis returned when the BMC determines that the BCT is not valid for the module (no attempt is made to configure the module).
BMC commands specific to BMCs on pluggable modules including VIOCsinclude Module Reset VIOC.
the Module Reset VIOC command(identifier 0x000E) causes the BMC to reset a selected VIOC on the module (without resetting any other elements) and is typically issued in response to a request by a VIOC Driver to reset a VIOC.
the parametersinclude a slot number, and a VIOC number to select which VIOC on the module to reset (such as 0 or 1).
the return codesinclude VIOC Reset Successful (0x0000), Invalid Slot Information (0x0100), Invalid VIOC Number (0x0200), and VIOC Reset Failure (0x0300).
BMC commands specific to BMCs on pluggable modules having system or application processing elementsinclude Module Reset Partition, Module Hold Reset, Module Release Reset, Module Boot, Module Firmware Update, and Module Firmware Update Status.
modulesinclude PMMs (having CPUs), NMs (having PCEs and TMs), FCMs (having IOPs), and OLBs (having CPUs).
the Module Reset Partition command(identifier 0x0006) causes the BMC to assert and then release reset for an entire module or a partition of a module (such as a partition of a PMM). If the module has been previously configured into partitions (by a Module BCT command, for example), then the command operates on a specified partition of the module. If the module is a partitionable module (such as a PMM) and there has been no previous partitioning of the module, then the entire module is reset and an error is returned.
a partitionable modulesuch as a PMM
the parametersinclude a slot number and a partition identifier.
the associated return codesinclude Reset Partition Successful (0x0000), Invalid Slot Information (0x0100), Invalid Partition (0x0200), and Reset Partition Failure (0x0300).
Reset Partition Successfulis returned when the partition identifier is valid and reset has been successfully applied and released.
Invalid Slot Informationis returned when the slot information is not valid for the module (for example when the module is inserted in a different slot than the command was intended for, or an incorrect BMC received the command).
Invalid Partitionis returned when the partition identifier is incorrect for module. In some embodiments, Invalid Partition is returned when the module has not been previously partitioned (although the entire module is also reset).
the Module Hold Reset command(identifier 0x0005) causes the BMC to begin asserting reset to system and application processing elements on the module, a selected partition of the module, or a CPU sub-system on the module, and to continue asserting reset until a command to release reset is received. If the module has not been previously partitioned (or is not partitionable), then the entire module (or CPU sub-system) is reset and continues to be reset.
the parametersinclude a slot number and a partition identifier.
the return codesinclude Hold Reset Successful (encoding 0x0000) for indicating the partition identifier is valid (or ignored) and reset has been successfully applied, Invalid Slot Information (0x0100), ), Invalid Partition (0x0200), and Hold Reset Failure (0x0300).
the Module Release Reset command(identifier 0x0004) causes the BMC to stop asserting reset to system and application processing elements on the module, a selected partition of the module, or a CPU sub-system on the module.
the Module Release Reset commandenables the module (or the selected partition or CPU sub-system) to boot. It may be used, for example, when directed to a PMM as in “Release Reset Flow” 1314 S ( FIG. 13B ).
the parametersinclude a slot number and a partition identifier.
the return codesinclude Release Reset Successful (encoding 0x0000), Invalid Slot Information (0x0100), and Release Reset Failure (0x0200).
the Module Boot commandinstructs the BMC to power up, reset, and release reset to system and application processing elements on the module, a selected partition of the module, or a CPU sub-system on the module.
the Module Boot commandtypically enables the module (or the selected partition or CPU sub-system) to proceed from being not powered to a fully booted state without additional BMC commands.
the parametersinclude a slot number and a partition identifier.
the return codesinclude Boot Successful (encoding 0x0000), Invalid Slot Information (0x0100), and Boot Failure (0x0200).
intermediate eventsreturn information as the module proceeds through various stages of executing the Module Boot command.
the Module Firmware Update command(identifier 0xFFFE) instructs the BMC to download and program firmware into non-volatile memory (such as flash memory) on the module. Downloading typically uses the Trivial File Transfer Protocol (TFTP).
the parametersinclude an IP address (in binary format) and a number and list of file names.
the return codesinclude Firmware Update Successful (encoding 0x0000), indicating all of the requested files have been downloaded and stored into the non-volatile memory, and Firmware Update Failure (0x0100), indicating otherwise.
the Module Firmware Update Status command(identifier 0xFFFF) instructs the BMC to provide information concerning the most recent Module Firmware Update command. In some embodiments, there are no parameters. Multiple return codes are provided in response, including an overall status indicator, a stage indicator, and a completion/an error indicator.
the overall status indicator statesinclude Success (encoding 0x0000) and Failure (0x0100).
the stage indicator statesinclude Update Complete ( 0 ), Update Downloading (1), and Updating Flash (2).
the completion/error indicator statesinclude percent completion from 0% to 100% (encodings 0x00 to 0x64), Update Successful (0x70), No TFTP Server (0x71), File Not Found (0x72), Checksum Invalid (0x73), Bad Sector Number (0x74), TFTP Connection Closed (0x75), and Canceled (0x76).
the Module Get VPD command(identifier 0x0002) causes the BMC to collect and report information describing the capabilities of the module.
the parametersinclude a slot number.
the return informationincludes several fields and structures, including a status code, a table identifier, a count of table entries, a variable number of table entries, an end marker tag, and a checksum.
the status codesinclude VPD Retrieval Successful (0x0000) and VPD Retrieval Failure (0x0100).
the table identifieris an 8-bit field in some embodiments.
the count of table entriesspecifies the number of individual VPD table entries that follow the count.
the end marker tag(encoded as 0x79) marks the end of the VPD table.
the checksumis used to verify integrity of the response data, and is an 8-bit field in some embodiments.
VPD entriesare compatible with those described by the Conventional PCI v2.2 Specification, available from http://www.pcisig.com, and hereby incorporated by reference herein for all purposes.
Each VPD table entryincludes an entry beginning marker tag (0x90) followed by a count of fields in the entry and a variable number of fields as indicated by the count.
Each fieldin turn includes a field name (a 3-character string in some embodiments), a field length, and a field value string having a length as indicated by the field length.
the general format of the VPD tableenables essentially unlimited information to be provided by the BMC to the CM, as the format is not restrictive.
VPDmay include descriptions of number and capabilities of system and application processing elements present on or associated with the module. Examples include number and frequency of CPUs included on PMMs, PCEs and TMs included on NMs and included on daughter cards coupled to NMs, IOPs included on FCMs, CPUs included on OLBs, and CPUs and Accelerators included on daughter cards coupled to OLBs.
VPDmay include memory size and organization on the module.
VPDmay include MAC address information associated with the module, such as a MAC address associated with a VIOC on the module.
VPD returned for SIM modulesmay indicate the presence and capabilities of Mass Storage 412 A ( FIG. 4A ), and information concerning Primary Switch Fabric Module 180 A.
VPD returned for PMM modulesmay indicate the presence and capabilities of FCI 413 . 5 and FCI 413 . 5 ′ ( FIG. 4B ).
VPD returned for NM modulesmay describe Interface 420 and IOP 421 ( FIG. 4C ), including bandwidth capacity and physical interface type.
VPD returned for FCM modulesmay describe operational parameters associated with FCPs, such as FCP 423 . 4 ( FIG. 4D ).
VPD returned for OLB modulesmay describe the presence and capabilities of optional daughter cards or modules, such as PCI sub-module 425 and HT sub-module 424 ( FIG. 4E ), including descriptions of specific services or protocols accelerated by the daughter elements.
the aforementioned module-specific VPD informationmay vary in specific details and may be provided in various combinations, according to embodiment.
the Module SIM Fabric Port command(identifier 0x000D) informs the BMC of the physical fabric port number of the Primary SIM (having an included Primary SCM), such as the fabric port associated with Primary SCM-Fabric coupling 149 A ( FIG. 2 ) as coupled to Primary Switch Fabric Module 180 A.
the parametersinclude a slot number and a fabric port number.
the fabric port numbercorresponds to the fabric port number of the primary (or master) SIM of the chassis (such as 4 or 5 ).
the return codesinclude SIM Fabric Port Success (0x0000) and SIM Fabric Port Failure (0x0100). In some embodiments, SIM Fabric Port Failure is returned when the BMC fails to register the SIM fabric port.
the Module Power Up and Module Power Down commands(identifiers 0x0003 and 0x007, respectively) instruct the BMC to apply and remove, respectively, operating power for the remainder of the module.
the Module Power Up commandleaves reset to system and application processing elements of the module asserted.
the Module Power Down commandoptionally fails unless the module (such as a PMM) has no booted, active, or running partitions, or has no active heartbeat established, according to embodiment.
the parametersinclude a slot number.
the return codesinclude Success (0x0000), Invalid Slot Information (0x0100), and Failure (0x0200).
the Module Get Sensors commandcauses the BMC to return information regarding sensors available on the module, such as the number and types of sensors.
the parametersinclude a slot number.
the return informationincludes a status code, a count of sensors available, and a variable number of sensor identifiers.
the status codesinclude Success (0x0000), Invalid Slot Information (0x0100), and Failure (0x0200).
the count of sensorsspecifies the number of sensors available on the module and individually identified by the information following the count.
Each of the sensor identifiersis a 32-bit integer in some embodiments.
the Module Get Sensor Information command(identifier 0x000B) causes the BMC to return information about a selected sensor or list of sensors, as specified by the command.
the parametersinclude a slot number, a count of sensors requested, and a variable number of sensor identifiers.
the count of sensors requestedspecifies the number of sensors for which information is requested and individually identified by the information following the count.
Each of the sensor identifiersis a 32-bit integer in some embodiments.
the return informationincludes a status code, and a sensor information structure for the sensors selected by the sensor identifiers.
the status codesinclude Success (0x0000), Invalid Slot Information (0x0100), and Failure (0x0200).
sensor information structuresare compatible with IPMI v1.5, available from ftp://download.intel.com/design/servers/ipmi/IPMIv1 — 5rev1 — 1.pdf, and hereby incorporated by reference herein for all purposes.
Each of the sensor information structuresincludes a sensor identifier (32-bits, for example), a length specification (16-bits, for example) equal to the length of the following name, value, and type fields (including nulls), a name string, a value string (representing the current value of the sensor), and a data type field for the value string.
the Module Get Operational Status command(identifier 0x000C) instructs the BMC to return the operational status of a module or a selected partition of a previously partitioned module (such as a PMM). Typically the operational status is determined by the presence of a heartbeat between the BMC and an OS (or BIOS) executing on the module or the selected partition.
the parametersinclude a slot number and a partition identifier.
the return informationincludes a status code and an operational code.
the status codesinclude Get Operational Status Successful (0x0000), Invalid Slot Information (0x0100), Invalid Partition (0x0200), and Get Operational Status Failure (0x0300).
the operational codesinclude Down/Non-operational (0x0000) and Up/Operational (0x0100).
the Module Force Fault commandinstructs the BMC to force the BMC Operational State Machine associated with the module (or a selected partition of a module) to transition to state F 1306 ( FIG. 13A ), and may be used when the CM detects operational errors requiring the module to be failed.
the parametersmay include a slot number and a partition identifier, according to embodiment.
BMC commandsare IPMI-compliant, and relate to collecting and managing information in a System Event Log (SEL) maintained by a BMC, and include Module Get SEL and Module Clear SEL.
the Module Get SEL commandcauses the BMC to provide selected log entries from the associated SEL.
the parametersinclude a slot number, an offset, and a maximum count.
the offsetspecifies a starting point in the SEL from which the BMC is to return data, to prevent resending older data.
the maximum countspecifies the maximum number of entries to provide in the return information.
the return informationincludes several fields and structures, including a status code, a count of returned log entries, and a variable number of log entries.
the status codesinclude Get SEL Successful (0x0000), Invalid Slot Information (0x0100), Invalid Offset (0x0200), and Get SEL Failure (0x0300).
the count of log entriesspecifies the number of individual log entries that follow the count.
Each returned log entryin some embodiments, includes a 16-byte field encoded according to an IPMI standard (such as is described on page 308 of IPMI specification Rev 1.1, incorporated herein by reference for all purposes). 105011
the Module Clear SEL command(identifier 0x0009) causes the BMC to clear all or selected log entries from the associated SEL, according to embodiment.
the parametersinclude a slot number.
the return codesinclude Clear SEL Successful (0x0000), Invalid Offset (0x0200), and Clear SEL Failure (0x0300).
the aforementioned command identifiersvary according to target module. For example, a prefix may be inserted identifying the module type (such as 0x0001 for SIMs, 0x0002 for PMMs, and 0x0003 for NMs, according to embodiment).
the aforementioned return codesare 16-bit values, and the status codes are 8-bit values.
the slot numberis 0 or 1 for PMMs, 2 or 3 for SIMs, 4, 5, or 6 for NMs, and other values for other modules.
the partition identifieris 0 or 1 to select a first or a second partition of a PMM that is partitioned as a 2-way element.
the partition identifieris optional, and is provided only for a command directed to a PMM. In some embodiments (or contexts, such as a PMM), the partition identifier is ignored unless the module has been partitioned as a 2-way element.
the CMprovides a single source of chassis information to all other processes in the system. It provides other processes with information such as presence of modules, properties of the modules and status of the modules. It also provides information about failure of modules and changes in module configurations. To provide such detailed information about each of the modules, the CM peers with the BMC on each of the modules in a chassis and obtains vital data to maintain a persistent database.
the CMmay be considered to provide a window into an ES system embodiment and an interface to users and operators to view and modify various system level behaviors.
a plurality of slots (10, for example) in an ES system chassis embodimentthere is a plurality of slots (10, for example) in an ES system chassis embodiment.
each slot in the chassisis enabled to accommodate only one type of pluggable module.
the slot assignments and the module types in the chassisare predefined.
the CMperforms various initialization steps, including resetting values of global variables, initializing an event library, and initializing the BMC interface of the SCM the CM is executing on.
Typical SCM embodimentsinclude a Mastership Module (MM), and the CM initializes an interface of the MM. The CM then issues a Process Initialized Event.
MMMastership Module
the initialization of the various interfacestriggers an associated set of activities in the CM.
the CMperforms the initialization functions and then enters a loop for listening to events occurring in the system, such as those reported by the Module BMC SW.
the MMis typically implemented in all or a portion of an FPGA, according to embodiment, and provides various functions, also according to embodiment.
the functionsmay include an application level heartbeat, and an identification of the slot that the SCM the CM is executing on is inserted.
Other functionsmay include presence information of various modules inserted in the chassis, notification of pluggable module insertion (such as that associated with “Detect Module Insertion and Generate Presence Interrupt” 1201 . 2 of FIG. 12A ), and notification of pluggable module removal.
Further functionsmay include various indications of whether or not inserted modules are capable of powering up and powering down. Further functions may enable failover from a Primary SCM to a Redundant SCM (such as the Primary and Redundant SCMs 140 of FIG. 1A ), either manually via a user or operator request, or automatically as a result of a system failure.
the MMincludes a Mastership state machine.
the CMindicates it has booted, and in response the state machine transitions from a Waiting state to a Booted state.
the state machinedetermines whether the SCM the CM is executing on is a Primary SCM or a Secondary SCM, and transitions to a Primary or Redundant state accordingly.
the SCMis determined to be the Primary SCM if there is currently no Primary SCM, and otherwise it is the Redundant SCM. If the determination of Primary versus Secondary roles is not possible, then an error is recognized and the state machine transitions to an Error state. If there is a failover (either manually or automatically) then a transition is made from the Redundant to the Primary state, and the SCM becomes a new Primary SCM and the CM changes roles from Redundant to Primary accordingly.
the CMalso sets a watchdog time interval in the MM, corresponding to a heartbeat for SW executing on the SCM.
the CMsets a watchdog bit at regular intervals (shorter than the watchdog time interval). If the CM is unable to set the watchdog bit within the timeout of the watchdog interval, then the MM assumes that the SW executing on the SCM is locked up and the SCM becomes unavailable. If the SCM is the Primary SCM, then an automatic failover occurs, and the Redundant SCM becomes a new Primary SCM. If the SCM was the Redundant SCM, then the SCM is no longer available for failover, and there is no longer a Redundant SCM.
the CMWhen an SCM becomes a Primary SCM, the CM reads the identification of the SCM slot from the MM and stores it in an internal variable. The CM then obtains the presence information from the MM (such as by reading selected MM registers, in some embodiments) and determines the modules that are present in the chassis. The CM then populates a database of modules (such as Primary CIM Instance Repository 917 .P of FIG. 9C ) and attempts to set up communication channels with BMCs on each of the modules that were indicated as being present.
a database of modulessuch as Primary CIM Instance Repository 917 .P of FIG. 9C
the MMdetects this change and notifies the CM via an event.
the CMreceives the event, determines the affected slot, and carries out any necessary actions as determined by the specific module involved. There is also a notification when the SCM has changed from Secondary to Primary (such as during failover processing).
the CMmaintains a map of the slots in the chassis.
the slotsare restricted to selected modules (for example, PMMs may only be inserted in slots 0 or 1 , SIMs in slots 2 or 3 , NMs in slots 4 , 5 , or 6 , and so forth according to embodiment).
the mapincludes information concerning the type of module that may be inserted in each slot according to the restrictions.
the module type informationmay vary according to any combination of product type, chassis type, or other similar customization information, according to various embodiments.
the CMattempts to establish a TCP connection for each module in the chassis by issuing a connect request to the BMC on each respective module.
the request issuingis non-blocking and arrives asynchronously.
the CMtypically requests VPD for the corresponding module (using a Module Get VPD command, for example).
Returned informationarrives via a corresponding BMC event, and is processed and stored in the module database. The information is used, for example, to determine a module type and various properties associated with each respective module.
the CMissues module presence events to any other processes that may be listening for the module presence events.
the presenceis published (i.e. module presence events generated) only if the VPD information is obtained. If there is a failure in retrieving the VPD data, then the module is considered of an unknown or unrecognized type.
the CMthen collects other information such as module properties, sensor properties and anything else that may be necessary for CM and related functions.
the CMmay also poll the SEL maintained by the BMC to determine if there were any new system events logged. System events in the SEL may also be dumped into a system log file along with appropriate information to identify sources of the dumped system events.
the CMmay initiate booting by issuing a command (such as a Module Boot or a Module BCT command) to the BMC of the module.
a commandsuch as a Module Boot or a Module BCT command
the CMalso initiates module resets, reloads, and other related operations by issuing corresponding BMC commands.
the various commands from the CM to the BMCsmay be results of manual user input or automatic provisioning or configuration processing.
the CMstores module information (such as presence, sensor values, and so forth) in the database. Thresholds and policies relating to these values may also be stored in the database, and in some embodiments are implemented as queries having corresponding actions.
booting of some pluggable modules that include system or application processing elementsincludes providing one or more data images to the booting module.
the CMupdates a Dynamic Host Configuration Protocol (DHCP) configuration file and creates or updates a Pre-boot eXecution Environment (PXE) configuration file for the module.
DHCPDynamic Host Configuration Protocol
PXEPre-boot eXecution Environment
the CMthen restarts a DHCP daemon and issues a BMC command to boot the module.
the modulesubsequently issues a DHCP request, and the DHCP daemon responds with IP address and PXE configuration information, according to the updates the CM has performed.
the modulerequests a kernel image and RAM disk image via TFTP, the images are transferred, and the module boots using the images.
the CMSince the DHCP configuration file is accessed during module boot, and modules may be dynamically configured in or added to live systems and then booted, the CM must dynamically alter the DHCP and PXE information as module configuration changes, or as modules are inserted into the chassis. Additionally, in these embodiments, the DHCP configuration file may also include entries corresponding to each of the bootable modules.
the VPDincludes a MAC address, and the CM may determine some portion of DHCP and PXE configuration information based in part on the MAC address.
the configuration informationmay also be based in part on processing of an SCF when performing an install server command (such as “Installing” 1103 of FIG. 11 ).
the CMstores portions of configuration data and portions of runtime information, including portions of the database, in the form of CIM instances, providing a standard interface for querying the information and other operational parameters.
Chassis Manager 533may issue events to Repository Manager 535 to create CIM instances corresponding to modules inserted in the chassis, for example when VPD is returned and processed. Additional CIM instances may be created to represent components of inserted modules.
creation and maintenance of the CIM instances (or instances of classes)may be performed in a library form in a platform process (such as a process associated with Platform Manager 531 ).
commands or requests originating from CLI and GUI operationstrigger the CM to carry out operations necessary to perform the requested command.
the operationsmay include accessing the database to view CIM instances (corresponding to modules or components in the chassis), and delivering BMC commands as needed.
the operationsmay further include updating CIM instances as module state changes, as indicated by received BMC events.
Some of the BMC eventsmay be generated due to processing the BMC commands, and may indicate success or failure of a command. This enables a user to determine success or failure of a request by requesting a display of appropriate module status information.
asynchronous requestsare posted to the CM without blocking, and a requestor determines success or failure by specifically requesting status from the CM. In some embodiments, synchronous requests are posted to the CM with blocking, and wait for status returned from the CM before completion.
one or more processes associated with any combination of Enterprise Manager 530 and Platform Manager 531may require portions of module sensor information (and portions of other monitoring information) to be visible via CIM instances.
the CMacts as the instance provider for some or all of the sensor and monitoring instances, creating the instances as corresponding information is received (perhaps in response to commands) from the modules.
all non-configuration type CIM instancesare managed in the CM context (the CM performs as the associated instance provider) and all configuration type CIM instances are managed by Platform Manager 531 .
Chassis Manager Operation sectionis illustrative only, as those of ordinary skill in the art will recognize that selected CM functions may be performed elsewhere while still under the direct control of the CM. Additionally, some of the functions may be modified, added, or deleted, according to embodiment.
Layer- 3 and above networking protocolstypically identify and name sources, destinations, and resources using one or more IP addresses, and the IP addresses are mapped to associated MAC addresses while performing various switching and routing functions.
a pluggable module of an ES system embodimentis typically associated with (or assigned) one or more IP addresses, such as Public IP Address 604 . 1 ( FIG. 6A ), and one or more MAC addresses, such as MAC Address 603 . 1 .
IP and MAC addressesare typically assigned or configured when a server is provisioned (see the Server Operational States section, elsewhere herein). For modules having VIOCs implementing one or more VNICs, appropriate values are written into each of the respective public IP and MAC address registers corresponding to the assignments.
Layer- 3 forwarding information(including correspondence between selected IP addresses and respective MAC addresses) is maintained by system management, controlplane, and load balancing processes (also referred to collectively as “SCM processes” since they are executed by an SCM that is part of a SIM).
the SCM processesprovide portions of the layer- 3 forwarding information to pluggable modules, typically as L 3 FIB updates.
NMsinclude search engines accessing IP to MAC forwarding information that is managed by the SCM processes, and in some embodiments VIOCs access forwarding information (stored in coupled TCAM/SRAMs) that is also managed by the SCM processes.
Layer- 2 networking protocolstypically communicate source and destination information using MAC addresses, and pluggable modules in an ES system embodiment typically map each pluggable module MAC address to a corresponding fabric port address.
the correspondence between module MAC addresses and fabric port addressesis maintained by the SCM processes, according to embodiment, and may be modified when a server is provisioned.
the MAC address to fabric port address mapping (or forwarding) informationis typically provided to pluggable modules, typically as L 2 FIB updates.
the NM search enginesaccess and manage a cache of MAC to fabric port forwarding information that is provided by the SCM processes.
VIOCsaccess and manage a cache of similar forwarding information (such as MACFIB information as discussed in the TCAM/SRAM Lookup State section, elsewhere herein) that is also managed by the SCM processes.
server provisioning and management functionsenable detection of a failed module, identification of a standby module (already available in the system), and automatic failover replacement of the failed module by the standby module.
any combination of the IP address and the MAC address assigned to the failed moduleare re-assigned to the standby module.
the module that is going to failis associated with a first IP address and a first MAC address, and such that the standby module is associated with a second IP address and a second MAC address.
the standby moduleis associated with the first IP address (replacing or “taking over” the first IP address) as part of performing the module failover.
the standby moduleremains associated with the second MAC address, and thus the first IP address should no longer be resolved to the first MAC address, but to the second MAC address.
ARPAddress Resolution Protocol
an Address Resolution Protocol (ARP) compatible address discovery mechanismis used to discover the new mapping when the remapped IP address is referenced. The new mapping is then propagated to the layer- 3 forwarding information tables (such as those accessed by the NM search engine and the VIOCs, according to embodiment).
the SCMprocesses intercede during the ARP-compatible processing, recognizing a “local” IP address and providing a corresponding local MAC address without overheads typically associated with ARP-compatible processing.
Local IP addressesinclude IP addresses allocated to pluggable modules (such as SIMs, PMMs, NMs, FCMs, and OLBs) within an ES system or within an ES chassis.
the SCMprocesses actively update the new mapping in the layer- 3 forwarding information tables upon the replacement event, irrespective of if or when the remapped IP address is referenced. Since the MAC addresses are unchanged in the first group of embodiments, the layer- 2 forwarding information (such as mappings to fabric port addresses) is also unchanged. If there is a mapping between the service and an associated service address, then since the standby module has been assigned the first IP address, no change in the service address is made.
the standby moduleis associated with the first MAC address (taking over the first MAC address) as part of performing module failover.
the second group of embodimentsis typically used in conjunction with local service IP addresses (i.e. the service address is not visible external to the ES system), or in conjunction with a proxy, or in circumstances where changes to the service address are inconsequential.
the standby moduleremains associated with the second IP address, and thus the mapping between the first IP address and the first MAC address is no longer valid, and a new mapping between the second IP address and the first MAC address is created.
some implementationsuse the ARP-compatible mechanism and some implementations use the active update of the new mapping. Since the MAC address is changed, the layer- 2 forwarding information is also changed accordingly, and the SCM processes actively propagate new MAC to fabric port address mapping information to the pluggable modules. If there is a mapping between the service and an associated service address, then since the standby module is assigned the second IP address, the service address is changed to the second IP address. Some implementations perform passive discovery of this new mapping via the ARP-compatible mechanism and some implementations use the active updating of the new mapping.
the standby moduleis associated with the first IP address and the first MAC address as part of performing module failover.
the mapping between the first IP address and the first MAC addressremains valid, however the layer- 2 mapping between the first MAC address and the associated fabric port is updated, and the associated layer- 2 forwarding information is changed by active propagation to the pluggable modules. If there is a mapping between the service and an associated service address, then since the standby module has been assigned the first IP address, no change in the service address is made.
FIG. 14illustrates a conceptual view of selected aspects of embodiments of IP and MAC address failover data structures and associated operations, including HW elements 1404 and IP/MAC Address and Forwarding Chart 1405 . Three techniques are illustrated, corresponding to one embodiment of each of the aforementioned three groups of embodiments.
the illustrated HW elementsinclude only selected portions of an ES system embodiment, Primary Switch Fabric Module 180 A providing communication between included modules NM 130 B, Primary SCM 140 A, PMM 150 A, and PMM 150 B.
the NMincludes Search Engine 1406 to search state information included on the NM to perform layer- 3 forwarding functions, including supplying a forwarding MAC address for a provided IP address.
the SCMexecutes the SCM processes.
PMM 150 Aillustrates a “failure” PMM, in other words a PMM that is initially functioning properly, but then becomes non-functional.
PMM 150 Billustrates a “standby” PMM, in other words a PMM that is initially unused (or spare), but is later used to replace the failed PMM.
Each of the PMMsincludes a VIOC ( 301 . 5 and 301 . 5 B) and a TCAM/SRAM ( 403 . 5 and 403 . 5 B) accessed in part for layer- 2 and optionally for layer- 3 forwarding functions, according to various implementations.
IP/MAC Address and Forwarding Chart 1405illustrates address and forwarding information before PMM 150 A fails, and corresponding information after PMM 150 B has replaced PMM 150 A, for each of the three techniques.
the chartis organized in rows and columns. “Initial MAC/IP” column 1410 shows information before the failure, and “Technique 1 MAC/IP” column 1411 , “Technique 2 MAC/IP” column 1412 , and “Technique 3 MAC/IP” column 1413 show final information after failure processing for the three techniques.
Failure PMM Address Row 1420 A and Standby PMM Address Row 1420 Bshow IP and MAC address information stored in VIOCs 301 . 5 and 301 . 5 B included in PMMs 150 A and 150 B respectively, for the initial and final states. More specifically, the failure and standby IP address information are stored in respective instances of Public IP Address 604 . 1 ( FIG. 6A ), corresponding to VNICs implemented in VIOCs 301 . 5 and 301 . 5 B. The failure and standby MAC addresses are stored in respective instances of MAC Address 603 . 1 .
MAC to Fabric Port Forwarding Rows 1430shows destination MAC address to destination fabric port address forwarding information stored in TCAM/SRAMs 403 . 5 and 403 . 5 B and referenced by VIOCs 301 . 5 and 301 . 5 B respectively. More specifically, 1430 shows key and result pairs as described in association with FIG. 8B . For example, instances of Egress Key 801 are programmed with MAC addresses as shown in 1430 (MAC 1 and MAC 2 ), and corresponding Unicast Result 803 instances are programmed with DstFabAddr 811 as shown in 1430 (FPort 0 and FPort 1 ). Typically identical information is stored in TCAM/SRAMs 403 . 5 and 403 . 5 B, such that both VIOC 301 .
IP to Fabric Port Forwarding Rows 1431show destination IP address to destination fabric port address forwarding information referenced by Search Engine 1406 .
Service Address Row 1432shows an IP address associated with a service provided by PMM 150 A before failing, and by PMM 150 B after completion of failover processing. In some implementations the IP address to service mapping of 1432 is also referenced by Search Engine 1406 .
address(es) stored in PMMis shorthand for “address(es) stored in a VNIC implemented in a VIOC included in PMM”.
the shorthand terminologyis meant to refer to storage in an instance of Public IP Address 604 . 1 for an IP address, and to storage in an instance of MAC Address 603 . 1 for a MAC address.
the IP address stored in PMM 150 A(the PMM that is to fail) is IP 1
the MAC address stored in PMM 150 Ais MAC 1
the corresponding initial addresses stored in PMM 150 B(the standby PMM) are IP 2 and MAC 2 .
the initial MAC address to fabric port forwarding information stored in the TCAM/SRAMs of both PMM 150 A and PMM 150 Bassociates MAC address MAC 1 (of PMM 150 A) with fabric port 0 (FPort 0 , corresponding to slot 0 ), and MAC 2 (of PMM 150 B) is associated with fabric port 1 (FPort 1 , corresponding to slot 1 ).
the initial IP to fabroic port address forwarding information referenced by Search Engine 1406associates IP 1 to FPort 0 and IP 2 to FPort 1 .
the initial mapping for the serviceis to IP address IP 1 (that of PMM 150 A).
the IP address stored in PMM 150 B(the standby PMM that has replaced the failed PMM) is set to the value of the IP address previously stored in PMM 150 A (the failed PMM). This is illustrated by PMM_B IP address 1450 (of “Technique 1 MAC/IP” column 1411 ) having the value IP 1 .
the IP and MAC address information stored in the VIOC of PMM 150 A(the failed PMM) is no longer relevant, as the module is no longer being used.
the IP to fabric port address forwarding informationhas changed, since the replacement module has taken on the IP address of the failed module without also taking on the fabric port address of the failed module (i.e.
IP to fabric port address entry 1454having the value FPort 1 .
the MAC address to fabric port forwarding and service IP address mapping informationare not changed (see the intersection of rows 1430 and 1432 , respectively, with column 1411 ), as the initial mappings remain applicable. Note that the MAC address to fabric port forwarding information previously associated with MAC 1 is no longer valid, as the MAC 1 address is no longer being used.
the MAC address stored in PMM 150 B(the replacement PMM) is set to the value of the MAC address previously stored in PMM 150 A (the failed PMM). This is illustrated by PMM_B MAC address 1451 (of “Technique 2 MAC/IP” column 1412 ) having the value MAC 1 .
PMM_B MAC address 1451(of “Technique 2 MAC/IP” column 1412 ) having the value MAC 1 .
the IP and MAC address information stored in PMM 150 Ais no longer relevant, as the module is no longer being used.
the MAC address to fabric port forwarding informationis changed, since the replacement PMM has a new MAC address but has remained inserted in the same slot.
MAC address to fabric port address entry 1455(of “Technique 2 MAC/IP” column 1412 ) having the value FPort 1 .
the MAC address to fabric port forwarding information associated with MAC 2is no longer valid, and the MAC 1 address is now associated with a different fabric port address.
the IP to fabric port address forwarding associated with the IP address of the failed moduleis now invalid.
the service IP address mappinghas changed, since the replacement module is known by a different IP address than the failed module. This is illustrated by service IP address 1456 having the value IP 2 .
the IP and MAC addresses stored in PMM 150 Bare set to the corresponding values previously stored in PMM 150 A (the failed PMM). This is illustrated by PMM_B IP address 1452 (of “Technique 3 MAC/IP” column 1413 ) having the value IP 1 , and PMM_B MAC address 1453 having the value MAC 1 .
PMM_B IP address 1452of “Technique 3 MAC/IP” column 1413
PMM_B MAC address 1453having the value MAC 1 .
the IP and MAC address information stored in PMM 150 Ais no longer relevant, as the module is no longer being used.
the MAC address to fabric port forwarding informationis changed, as illustrated by MAC to fabric port address entry 1457 having the value FPort 1 .
the MAC address to fabric port forwarding information associated with MAC 2is no longer valid, and the MAC 1 address is now associated with a different fabric port address.
the IP to fabric port address forwardingis changed, as illustrated by IP to fabric port address entry 1458 having the value FPort 1 .
the service IP address mapping information associated with IP 1is not changed, as the initial mappings remain applicable.
FIG. 15illustrates a flow diagram of an embodiment of rapid IP address takeover in a context of replacing a failed module with a standby module.
An IP address originally associated with a first MAC address(corresponding to the failed module), is re-associated with a second MAC address (corresponding to the standby module).
failover processing that directs the re-associationis typically executed outside of the standby module, the re-association is often described as the standby module “taking over” the IP address from the failed module, and corresponds to the aforementioned first group of embodiments.
Processingbegins at “Start” 1501 , and then flows to “Detect Failed Module” 1510 upon determination that a module is no longer functional (such as PMM 150 A as shown in FIG. 14 ). Flow then proceeds to “Identify Replacement Module” 1511 to determine a standby module to serve in place of the failed module (such as PMM 150 B replacing PMM 150 A). Processing continues at “Determine Replacement MAC Address” 1512 where the MAC address of the standby module is ascertained. This may be performed by consulting appropriate MAC address assignment or allocation tables maintained by the SCM processes, by reading state managed by the module (such as an instance of MAC Address 603 . 1 ), or by other similar mechanisms, according to embodiment. Note that this operation is distinct from determining an IP to MAC address mapping, as there is no specific IP address involved in “Determine Replacement MAC Address” 1512 .
the standby module MAC address, and its correspondence to the IP address previously associated with the failed module,is made known throughout the SCM processes by updating a master Layer- 3 FIB table (“Update Master L 3 FIB” 1513 ).
Update Master L 3 FIB1513
Flowcontinues to “Update Module L 3 FIBs” 1514 , where the correspondence between the IP address and the standby module MAC address is actively disseminated to module tables (such as forwarding information consulted by Search Engine 1406 ), replacing the stale correspondence to the failed module MAC address.
the SCM processescommunicate with interface management processes that in turn provide updates to search engine look up state via the switch fabric module. This contrasts to a more passive replacement of IP to MAC correspondence information for IP addresses external to an ES system (such as Client 103 of FIG. 1A ) via Address Resolution Protocol (ARP) requests.
ARPAddress Resolution Protocol
the illustrated embodiment of FIG. 15is shown determining and propagating a new IP address to MAC address association (leaving an original IP address for an associated service intact) when replacing a failing module with a standby module (corresponding to the aforementioned first group of embodiments). Some embodiments also determine and propagate a new IP address to fabric port address association in conjunction with propagating a new IP address to MAC address association. Other embodiments determine and propagate an IP address update, or both MAC and IP address updates (corresponding to the aforementioned second and third groups of embodiments, respectively), including propagating updates as appropriate for the following mappings: IP address to MAC address, MAC address to fabric port address, and IP address to fabric port address.
the SCMprocesses program the Layer- 2 and Layer- 3 module tables (L 2 FIBs and L 3 FIBs) in their entirety with respect to all elements known to reside within an ES system.
L 2 FIBs and L 3 FIBsLayer- 2 and Layer- 3 module tables
all IP and MAC addresses to fabric port address correspondencesare programmed into the TCAM/SRAM structures included on the PMMs and into the search engines included in the NMs.
the pre-programming of mapping informationguarantees that references to local IP and MAC addresses will be found in the module tables (i.e. will be “hits”).
IP and MAC takeoverfor example during failover processing
the SCMprocesses update the L 2 and L 3 FIBs immediately, guaranteeing that later references to local IP and MAC addresses will continue to be hits in the module tables.
the L 2 and L 3 FIB preprogrammingis limited according to VLAN configuration but still sufficient to guarantee that local IP and MAC address references are hits, to conserve TCAM/SRAM entry usage.
TCAM/SRAM 403 . 5would be initially preprogrammed only with entries corresponding to the VLANs that VNICs implemented in VIOC 301 . 5 were members of, while TCAM/SRAM 403 . 5 B would be initially programmed according to VLAN membership of VNICs implemented by VIOC 301 . 5 B.
there could be entries uniquely present in TCAM/SRAM 403 . 5entries uniquely present in TCAM/SRAM 403 . 5 B, and entries present in both TCAM/SRAMs 403 . 5 and 403 . 5 B. Failover processing would immediately update and add TCAM/SRAM 403 . 5 B entries in order to continue to continue guarantee local IP and MAC address hits.
NMs and PMMsmay implement any combination of L 2 and L 3 FIBs and perform corresponding L 2 and L 3 forwarding lookups.
the L 2 and L 3 module tablesare distinct, while in other embodiments the L 2 and L 3 module tables are implemented in a single combined module table, with L 2 and L 3 type entries being differentiated by a table identification field (of one or more bits) stored in the table and included in the lookup key.
TCAM/SRAM elements associated with VIOCs included on PMMsare typically implemented as TCAM/SRAM elements associated with VIOCs included on PMMs.
TCAM/SRAM elements associated with VIOCs included on SCMs, FCMs, and OLBs, as well as TCAM/SRAM elements included on NMsfunction similarly.
Those of ordinary skill in the artwill readily appreciate how to extend the failover techniques to other module types having TCAM/SRAM elements implementing Layer- 2 and Layer- 3 module tables.
FIG. 16illustrates an embodiment of a multi-chassis fabric-backplane ES system, also referred to simply as a “multi-chassis system”.
Serversmay be provisioned from compute, storage, and I/O resources available via three chassis (ES 110 X, ES 110 Y, and ES 110 Z), each similar to ES 1 110 A (see FIG. 1A ).
the multi-chassis provisioning processis similar to that of a single chassis, as illustrated in FIGS. 9A , 9 B, and 11 and their respective discussions, except that resources for provisioning are distributed amongst more than one chassis.
each chassisincludes an SFM (SFM 180 X of ES 110 X, for example) coupled to various compute, storage, and I/O modules.
the compute modulesinclude two OLBs (OLB 160 XA and OLB 160 XB of ES 110 X, for example), two PMMs (PMM 150 XA and PMM 15 OXB of ES 110 X, for example), and an SCM (SCM 140 X of ES 110 X, for example).
Storage modulesinclude two FCMs (FCM 120 XA and FCM 12 OXB of ES 110 X, for example).
I/O modulesinclude two NMs (NM 130 XA and NM 13 OXB of ES 110 X, for example).
ES 110 Y and ES 110 Zare similar to ES 110 X (similar elements are identified with a ‘Y’ and a ‘Z’, respectively, in the identifier name instead of an ‘X’).
Fibre channel standard storage arrays(or networks coupled to arrays, according to implementation) are coupled to each ES chassis, as illustrated by Fibre Channel Array/Networks 106 X, 106 Y, and 106 Z, coupled to ESs 110 X, 110 Y, and 110 Z, respectively.
Each ES system chassisis coupled to LAN/WAN/MAN/Internet network 1619 , ES 110 X via NM 130 XB and coupling 1614 , ES 110 Y via NM 130 YB and coupling 1615 ,