US11799714B2 - Device management using baseboard management controllers and management processors - Google Patents

Device management using baseboard management controllers and management processors Download PDF

Info

Publication number
US11799714B2
US11799714B2 US17/652,335 US202217652335A US11799714B2 US 11799714 B2 US11799714 B2 US 11799714B2 US 202217652335 A US202217652335 A US 202217652335A US 11799714 B2 US11799714 B2 US 11799714B2
Authority
US
United States
Prior art keywords
management
processor
electronic device
management processor
processors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US17/652,335
Other versions
US20230269126A1 (en
Inventor
Mohan Parthasarathy
Matthew James Muggeridge
Vinay VENUGOPAL
Srinivasan Varadarajan Sahasranamam
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Enterprise Development LP
Original Assignee
Hewlett Packard Enterprise Development LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development LP filed Critical Hewlett Packard Enterprise Development LP
Priority to US17/652,335 priority Critical patent/US11799714B2/en
Assigned to HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP reassignment HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SAHASRANAMAM, SRINIVASAN VARADARAJAN, MUGGERIDGE, MATTHEW JAMES, VENUGOPAL, Vinay, PARTHASARATHY, MOHAN
Publication of US20230269126A1 publication Critical patent/US20230269126A1/en
Application granted granted Critical
Publication of US11799714B2 publication Critical patent/US11799714B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2025Failover techniques using centralised failover control functionality
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0695Management of faults, events, alarms or notifications the faulty arrangement being the maintenance, administration or management system
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Definitions

  • a computing environment can include a relatively large number (hundreds, thousands, etc.) of electronic devices, such as server computers, storage systems, communication nodes, Internet-of-things (IoT) devices, and so forth.
  • the computing environment is a data center associated with an enterprise, such as a business concern, a government agency, an education organization, an individual, and so forth.
  • the computing environment can include a cloud, a web-based computing environment, and so forth.
  • FIG. 1 is a block diagram of an arrangement that includes electronic devices and management components according to some examples.
  • FIG. 2 is a flow diagram of a process of management processors according to some examples.
  • FIG. 3 is a block diagram of a system according to some examples.
  • FIG. 4 is a block diagram of a storage medium storing machine-readable instructions according to some examples.
  • FIG. 5 is a flow diagram of a process according to some examples.
  • An electronic device in a computing environment can be managed from a remote location using a management component in the electronic device.
  • a systems administrator at a remote computer can access the management component in the electronic device over a network, which in some cases can be a management network that is separate from a data network used for data communications of the electronic device.
  • the management component in the electronic device includes a baseboard management controller (BMC).
  • BMC baseboard management controller
  • a “BMC” can refer to a specialized service controller that monitors the physical state of an electronic device using sensors and communicates with a remote management system (that is remote from the electronic device) through an independent “out-of-band” connection.
  • the BMC can perform management tasks to manage components of the electronic device.
  • Examples of management tasks that can be performed by the BMC can include any or some combination of the following: power control to perform power management of the electronic device (such as to transition the electronic device between different power consumption states in response to detected events), thermal monitoring and control of the electronic device (such as to monitor temperatures of the electronic device and to control thermal management states of the electronic device), fan control of fans in the electronic device, system health monitoring based on monitoring measurement data from various sensors of the electronic device, remote access of the electronic device (to access the electronic device over a network, for example), remote reboot of the electronic device (to trigger the electronic device to reboot using a remote command), system setup and deployment of the electronic device, system security to implement security procedures in the electronic device, and so forth.
  • power control to perform power management of the electronic device (such as to transition the electronic device between different power consumption states in response to detected events)
  • thermal monitoring and control of the electronic device such as to monitor temperatures of the electronic device and to control thermal management states of the electronic device
  • fan control of fans in the electronic device system health monitoring based on monitoring measurement data from various sensors of the
  • the BMC can provide so-called “lights-out” functionality for an electronic device.
  • the lights out functionality may allow a user, such as a systems administrator, to perform management operations on the electronic device even if an operating system (OS) is not installed or not functional on the electronic device.
  • OS operating system
  • the BMC can run on auxiliary power provided by an auxiliary power supply (e.g., a battery); as a result, the electronic device does not have to be powered on to allow the BMC to perform the BMC's operations.
  • auxiliary power supply is separate from a main power supply that supplies powers to other components (e.g., a main processor, a memory, an input/output (I/O) device, etc.) of the electronic device.
  • an additional management controller (separate from the BMCs) can be used to interact with the BMCs to perform management of the electronic devices.
  • the additional management controller can be referred to as a rack management controller (RMC).
  • RMC rack management controller
  • a “rack” refers to a mounting structure that has supports for multiple electronic devices.
  • the additional management controller can be referred to as a “system management controller” to manage a collection of electronic devices that each further includes a BMC.
  • the system management controller can perform any or some combination of the following tasks.
  • the system management controller can provide a command interface to allow a remote management device, such as one associated with a systems administrator, to submit commands to the system management controller to perform electronic device management operations.
  • the interface can include a command-line interface in which commands are submitted to the system management controller as lines of text.
  • the interface can include a graphical user interface (GUI).
  • GUI graphical user interface
  • the system management controller can provide a RESTful API (API is an abbreviation for “application programming interface” and REST is an abbreviation for “REpresentational State Transfer”) that is an interface for management of information technology (IT) components such as servers, storage, etc.
  • RESTful API is provided by a REDFISH interface, which is a standardized interface (defined by a suite of specifications) for management of electronic devices using various management tools, scripts, and so forth.
  • An interface provided by the system management controller can also be used by a remote entity to access information collected by the system management controller and/or the BMCs, such as analysis logs, reports, and so forth.
  • the system management controller can perform analysis of components (e.g., hardware, programs, etc.) of the electronic devices for faults.
  • a “fault” can refer to any anomaly in a component (hardware or machine-readable instructions or data) of an electronic device that prevents an expected operation of the electronic device.
  • the fault can include a hardware failure, machine-readable instructions crashing or producing an error, an error in data, and so forth.
  • the system management controller can predict a fault in the electronic devices. For example, the system management controller can monitor health data (e.g., measurement data from various sensors in the electronic devices) to predict when a fault is expected to occur.
  • health data e.g., measurement data from various sensors in the electronic devices
  • the system management controller can initiate recovery actions in response to detected or predicted faults.
  • the recovery actions can include sending an alert to a target recipient (e.g., a systems administrator or another entity), powering off an electronic device, rebooting an electronic device, isolating a component in an electronic device, and so forth.
  • a target recipient e.g., a systems administrator or another entity
  • the system management controller can perform firmware management to manage versions of firmware loaded in the electronic devices.
  • firmware can include machine-readable instructions such as Basic Input/Output System (BIOS) code, firmware of hardware I/O devices, and so forth.
  • BIOS Basic Input/Output System
  • the system management controller can partition the electronic devices into multiple partitions (where each partition can include a collection of electronic devices, where a “collection of electronic devices” can refer to a single electronic device or multiple electronic devices). Each partition of electronic devices can be managed by the system management controller independently of another partition.
  • the BMC in each electronic device can interface with the system management controller, and can perform management tasks instructed by the system management controller.
  • the system management controller is external of the electronic devices, and can be coupled to the BMCs over a network, such as a local area network (LAN) or another type of network.
  • LAN local area network
  • the use of the system management controller to work with the BMCs of the respective individual electronic devices allows for overall system management that can be accomplished without the system management controller having to drill down into the details associated with each individual electronic device.
  • the BMC of an individual electronic device is aware of the specific configuration of each electronic device that can be leveraged for management of the electronic device using the combination of the system management controller and the BMC of the electronic device.
  • the load on the system management controller can be relatively high, which can adversely affect performance of management operations, since the system management controller may have to perform a relatively large number of management tasks in conjunction with the BMCs.
  • system management controller may not experience a fault (e.g., due to an error of machine-readable instructions in the system management controller, a failure of a hardware component, a failure of a communications link, etc.), then management of the electronic devices using the system management controller may not be possible.
  • a high-availability arrangement is provided to perform management of multiple electronic devices, such as server computers in a rack or another arrangement.
  • the electronic devices include BMCs and respective management processors.
  • the BMCs and management processors are “onboard” management components since they are included within respective electronic devices. This is in contrast with an external system management controller (as discussed above) that is external of the electronic devices.
  • the onboard management processors make up a cluster of management processors, and one (or more) of the management processors can be elected as a primary management processor to operate as a rack management controller (or more generally, a system management controller).
  • a rack management controller or more generally, a system management controller.
  • an onboard management processor can be used to perform the tasks of the system management controller.
  • a “management processor” refers to a processor that is able to perform management tasks, such as those of the system management controller discussed above.
  • a “processor” can include a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
  • the primary management processor can delegate management tasks to other onboard management processors in the cluster to offload such management tasks.
  • management tasks can be distributed across multiple management processors for load balancing to reduce the likelihood that any individual management processor is overburdened.
  • a “cluster of management processors” can refer to any grouping of management processors that are configured to work together to manage operations of electronic devices.
  • the cluster of management processors can also provide fault tolerance to provide high-availability operation. For example, if the primary management processor were to experience a fault, then failover to another management processor to become the primary management processor can be performed. Additionally, if any given management processor were to experience a fault, the primary management processor can delegate the management tasks of the given management processor to another management processor.
  • FIG. 1 is a block diagram of an example arrangement that includes a rack 102 in which are mounted server computers 104 - 1 , 104 - 2 , . . . , 104 -N (N ⁇ 2).
  • server computers 104 - 1 to 104 -N being mounted in the rack 102 in FIG. 1
  • server computers 104 - 1 to 104 -N can be in other types of mounting structures, or can be freely standing without being mounted in any type of mounting structure.
  • an arrangement can include other types of electronic devices, such as storage systems, communication nodes, IoT devices, and so forth.
  • the server computers 104 - 1 to 104 -N can be scale-up or scale-out server nodes.
  • a “scale-up” system allows for support of additional workload by increasing the processing/storage capacity of each node, which is a server computer in FIG. 1 .
  • a “scale-out” system allows for support of additional workload by adding more nodes (e.g., server computers).
  • Each server computer includes a BMC, a management processor, and a central processing unit (CPU).
  • a CPU refers to the main processor of a server computer.
  • the server computer can include one CPU or multiple CPUs.
  • a CPU can include a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
  • each server computer further includes a BMC.
  • the server computer 104 - 1 includes a BMC 106 - 1 , a management processor 108 - 1 , and a CPU 110 - 1 .
  • the server computer 104 - 2 includes a BMC 106 - 2 , a management processor 108 - 2 , and a CPU 110 - 2
  • the server computer 104 -N includes a BMC 106 -N, a management processor 108 -N, and a CPU 110 -N.
  • Each management processor 108 - 1 to 108 -N can perform any or some combination of the tasks that can be performed by the system management controller discussed above.
  • a “management processor” can also refer to a node controller in a server computer or another type of electronic device.
  • a node controller can perform power control and/or capping (capping the amount of power that can be used by a server computer).
  • the node controller can also collect measurement data that is accessible by a remote entity.
  • one (or more than one) of the management processors 108 - 1 to 108 -N can be elected as a primary management processor to serve as a management controller (similar to the system management controller discussed further above).
  • the management controller implemented with the primary management processor is onboard a server computer and is not external of the server computers 104 - 1 to 104 -N.
  • the management processor 108 - 2 has been elected as the primary management processor.
  • more than one management processor can be elected as a primary management processor, resulting in multiple primary management processors that can provide load balancing and/or redundancy in case of a fault of any of the primary management processors.
  • the management processors 108 - 1 to 108 -N form a cluster of management processors. In further examples, less than all of the management processors 108 - 1 to 108 -N can form a cluster of management processors. For example, the management processors 108 - 1 and 108 - 2 can be part of the cluster while the management processor 108 -N is not part of the cluster. Omitting a management processor of the server computers 104 - 1 to 104 -N can be performed for any of various reasons (e.g., the omitted management processor is non-functional or is experiencing a fault, the omitted management processor does not have a reliable network connection, the omitted management processor is not executing a correct version of machine-readable instructions, and so forth).
  • management processor 108 - 2 has been elected as the primary management processor
  • the remaining management processors ( 108 - 1 , 108 -N) are referred to as “member management processors” to which the primary management processor can assign management tasks.
  • the management processors 108 - 1 to 108 -N are connected to a management network 112 .
  • the management network 112 is distinct and separate from a data network 114 that is used by the server computers 104 - 1 to 104 -N to communicate data with each other as well as to communicate with other devices coupled to the data network 114 .
  • Each management processor can include a network interface to connect to the management network 112 .
  • each management processor can be coupled through a network interface in the respective server computer to the management network 112 .
  • the management network 112 is an “out-of-band” network since the management network 112 allows management-related information to be communicated with the server computers 104 - 1 to 104 -N over a communication path that is separate from the data network 114 .
  • the management network 112 allows a remote management device 116 (such as a device associated with a systems administrator) or another entity to access the server computers 104 - 1 to 104 -N to perform management operations with respect to the server computers.
  • a remote management device 116 such as a device associated with a systems administrator
  • the management device 116 can include a desktop computer, a notebook computer, a tablet computer, a smartphone, or any other type of electronic device.
  • each BMC is locally connected to its respective management processor.
  • the BMC 106 - 1 is connected over a communication link 118 - 1 to the management processor 108 - 1
  • the BMC 106 - 2 is connected over a communication link 118 - 2 to the management processor 108 - 2
  • the BMC 106 -N is connected over a communication link 118 -N to the management processor 108 -N.
  • Each communication link 118 - 1 , 118 - 2 , or 118 -N can be in the form of a local area network (LAN) or another type of communication link, such as a computer bus.
  • LAN local area network
  • computer bus another type of communication link
  • Each management processor 108 - 1 , 108 - 2 , or 108 -N can present an interface (or multiple interfaces) that can be accessed by the management device 116 .
  • the interface(s) can include a command line interface, a GUI, a REDFISH interface, and so forth. Although reference is made to the REDFISH interface in some examples, it is noted that in other examples, other types of interfaces (whether standardized or proprietary) can be presented by each management processor for the purpose of managing the respective server computer in which the management processor is located.
  • the management processors 108 - 1 to 108 -N in the cluster can perform a distributed election process by which one (or more than one) management processor is elected as a primary management processor.
  • the distributed election process is performed based on cooperation of multiple management processors in the cluster.
  • the election of the primary management processor can be based on any one or some combination of the following criteria: a current processing load of a management processor (e.g., a given management processor with a lower processing load than other management processors in the cluster may be favored to be elected as the primary management processor); a network status of the management processor (e.g., a communication bandwidth of a management processor, where a given management processor with a higher communication bandwidth than other management processors in the cluster may be favored to be elected as the management primary management processor); a health status of a management processor (e.g., based on faults detected in each management processor, where a given management processor that is healthier (has less faults) than other management processors in the cluster may be favored to be elected as the primary management processor); a random criterion (in which a primary management processor is randomly selected from among the cluster of management processors); and so forth.
  • a current processing load of a management processor e.g., a given management processor with a lower processing load than other management processor
  • Each server computer includes a respective OS, firmware, and other machine-readable instructions.
  • the server computer 104 - 1 includes an OS 120 - 1 and a firmware 122 - 1 (e.g., a BIOS), which are executable on the CPU 110 - 1 .
  • the server computer 104 - 2 includes an OS 120 - 2 and a firmware 122 - 2 that are executable on the CPU 110 - 2
  • the server computer 104 -N includes an OS 120 -N and a firmware 122 -N that are executable on the CPU 110 -N.
  • Each server computer further includes a main power supply and an auxiliary power supply.
  • the server computer 104 - 1 includes a main power supply 124 - 1 that supplies power to the CPU 110 - 1 and other components of the server computer 104 - 1 , such as a memory, a persistent storage, a network interface controller to communicate over the data network 114 , I/O devices, and so forth.
  • the server computer 104 - 1 also includes an auxiliary power supply 126 - 1 that supplies power to the management processor 108 - 1 and the BMC 106 - 1 .
  • the auxiliary power supply 126 - 1 can power the management processor 108 - 1 and the BMC 106 - 1 even if the server computer 104 - 1 is powered off (i.e., the main power supply 124 - 1 is off).
  • Management operations can be performed with respect to the server computer 104 - 1 using the management processor 108 - 1 and the BMC 106 - 1 that are powered by the auxiliary power supply 126 - 1 even if the main power supply 124 - 1 is off and the OS 120 - 1 is not currently running in the server computer 104 - 1 .
  • a main power supply can include a battery or can be powered by an external power source, such as a wall outlet.
  • An auxiliary power supply can include a battery or any other power source that is separate from the main power supply.
  • the server computer 104 - 2 includes a main power supply 124 - 2 to power the CPU 110 - 2 and other components of the server computer 104 - 2 , and an auxiliary power supply 126 - 2 to supply power to the management processor 108 - 2 and the BMC 106 - 2 .
  • the server computer 104 -N similarly includes a main power supply 124 -N and an auxiliary power supply 126 -N.
  • the primary management processor (which in the example of FIG. 1 is 108 - 2 ) coordinates with the remaining management processors (“member management processors”) of the cluster of management processors to perform management tasks.
  • the primary management processor 108 - 2 can assign management tasks to respective member management processors 108 - 1 , 108 -N to perform, and each respective member management processor can coordinate with a corresponding BMC to perform the respective management operation.
  • the primary management processor 108 - 2 can assign management tasks to itself to perform in coordination with the BMC 106 - 2 in the server computer 104 - 2 .
  • the cluster of management processors is a Kubernetes cluster of nodes.
  • Kubernetes is an open-source orchestration system that includes a primary controlling unit and nodes that are part of the Kubernetes cluster, where a node can refer to a machine in which containers (including workloads) are deployed.
  • a Kubernetes cluster can support the use of multiple primary controlling units to allow for the selection of multiple primary management processors.
  • a container can reside in a “pod,” which can include a collection of containers.
  • a “collection of containers” can refer to a single container or multiple containers.
  • a Kubernetes cluster includes pods that are executable on respective nodes (management processors in the context of FIG. 1 ).
  • a pod can include networking and storage resources that can be employed by the container(s) of the pod.
  • a networking resource can include an assigned network address, such as an Internet Protocol (IP) address, a network namespace, a network port, and so forth.
  • IP Internet Protocol
  • a storage resource can refer to a collection of storage volumes (e.g., logical volumes).
  • a pod is considered to be a self-contained, isolated logical host in which programs executed in the container(s) of the pod can run.
  • k8s full weight version of Kubernetes
  • k3s lightweight version of Kubernetes
  • the primary management processor is a primary node within the Kubernetes cluster, and the remaining management processors are nodes on which containers can be deployed.
  • the primary node (primary management processor) of the Kubernetes cluster can schedule pods on any of the Kubernetes nodes (member management processors).
  • management work can be distributed across the management processors to avoid any potential bottlenecks associated with use of a single system management controller in other examples. Additionally, by distributing the management workload across multiple management processors, more management services than may be possible using a single external system management controller (e.g., an RMC) can be provided.
  • a single external system management controller e.g., an RMC
  • Failover support is provided in the cluster of management processors for high availability. Failover support can allow another management processor to take over as a primary management processor if a current management processor experiences a fault. Failover support can also allow the primary management processor to reassign management tasks from a first member management processor to a second member management processor if the first member management processor experiences a fault.
  • a stateless process is a process that does not depend on another process, such as on executed in another container or pod or on another management processor.
  • management tasks can be stateful processes that have dependencies on other processes.
  • a shared storage can be employed to store data shared by the multiple processes that have dependencies on one another, so that failover over of a management task from one management processor to another management processor is possible.
  • FIG. 2 is a flow diagram that shows example tasks performed by the primary management processor 108 - 2 and member management processor is 108 - 1 and 108 -N.
  • the primary management processor 108 - 2 receives (at 204 ) requests for management operations, such as from the management device 116 or another entity.
  • the requests can be received through a command line interface, GUI, or REDFISH interface of the primary management processor 108 - 2 .
  • the primary management processor 108 - 2 schedules (at 206 , 208 ) corresponding management tasks to be performed by the member management processors 108 - 1 and 108 -N, respectively.
  • the primary management processor 108 - 2 can send an instruction or any other type of indicator to each member management processor to perform the corresponding management task.
  • the instruction or other indicator can include information that identifies the management task to be performed.
  • the member management processors 108 - 1 to 108 -N perform (at 210 and 212 , respectively) the scheduled management tasks.
  • the member management processor 108 - 1 can send (at 214 ) a completion indication to the primary management processor 108 - 2
  • the member management processor 108 -N can send (at 216 ) a completion indication to the primary management processor 108 - 2 .
  • a “completion indication” can refer to a message, an information element, a signal, or another indicator that provides an indication that the requested management task has been completed. If the management task were to fail at a member management processor, then the member management processor can send an error indication instead of the completion indication to the primary management processor 108 - 2 .
  • the primary management processor 108 - 2 can respond to the completion indications by notifying (at 218 ) the management device 116 or another requesting entity that the requested management operations have been completed.
  • the cluster of management processors 108 - 2 to 108 -N can also provide failover support.
  • the primary management processor 108 - 2 can detect (at 220 ) whether a member management processor has experienced a fault. For example, in FIG. 2 , it is assumed that the primary management processor 108 - 2 has detected that the member management processor 108 - 1 has experienced a fault. The detection can be based on use of heartbeat messages sent by each member management processor to the primary management processor 108 - 2 . The heartbeat messages may be sent by a management processor on a periodic basis. In other examples, the primary management processor 108 - 2 can poll each member management processor (such as on a periodic basis) to trigger a response from the member management processor. If the member management processor does not respond to the polling, then the primary management processor 108 - 2 can make a determination that the member management processor has experienced a fault and a failover operation should be initiated.
  • the primary management processor 108 - 2 can reassign (at 222 ) management task(s) of the member management processor 108 - 1 to the member management processor 108 -N.
  • the scheduled management task(s) can include any management task not completed by the member management processor 108 - 1 , or any other management task that should have been assigned to the member management processor 108 - 1 but which cannot be so assigned due to the fault of the member management processor 108 - 1 .
  • the member management processor 108 -N When a management task of the member management processor 108 - 1 is reassigned to the member management processor 108 -N, the member management processor 108 -N would interact with the BMC 106 - 1 of the server computer 104 - 1 ) such as by communicating over the management network 112 to perform the reassigned management task.
  • Failover support can also include failing over from a current primary management processor to another primary management processor.
  • Each member management processor is able to detect whether the primary management processor 108 - 2 has experienced a fault.
  • the member management processor ( 108 - 1 or 108 -N) can receive heartbeat messages from the primary management processor 108 - 2 on a periodic basis.
  • the member management processor ( 108 - 1 or 108 -N) can poll the primary management processor 108 - 2 to determine whether the primary management processor 108 - 2 is still available.
  • the member management processor 108 - 1 has detected (at 224 ) that the primary management processor 108 - 2 has experienced a fault (e.g., the member management processor 108 - 1 has not received a heartbeat message at an expected time or has not received a response to a polling request).
  • the member management processor 108 - 1 can initiate a primary management processor failover process by sending (at 226 ) to each other member management processor an indication of primary management processor fault.
  • the member management processors 108 - 1 and 108 -N can elect (at 228 ) a new primary management processor, using similar criteria as noted above for electing a primary management processor.
  • FIG. 3 is a block diagram of a system 300 according to some examples.
  • the system 300 can include an arrangement of electronic devices 302 , such as the server computers 104 - 1 to 104 -N shown in FIG. 1 .
  • Each of the electronic devices 302 includes a respective management processor 304 and a BMC 306 . At least some of the management processors 304 that are part of the electronic devices 302 can form a cluster 308 of management processors.
  • a management processor of the cluster of the management processors is a primary management processor (indicated by a * in FIG. 3 ) to act as a management controller for the electronic devices 302 .
  • the management controller interacts with the BMC 306 in a respective electronic device 302 to perform management of the respective electronic device 302 .
  • the cluster of management processors can perform failover responsive to a fault of the primary management processor to select another management processor of the cluster of management processors as the management controller.
  • the management controller (the primary management processor 304 indicated with a * in FIG. 3 ) can schedule management tasks to be performed by multiple management processors of the cluster 308 of the management processors that are part of the arrangement of electronic devices 302 .
  • the management controller can schedule execution of containers on the multiple management processors to perform the management tasks, where each container includes a management program to interact with the BMC 306 of an electronic device 302 to perform management of the electronic device 302 .
  • another management processor of the cluster 308 of management processors is a further primary management processor to act as a further management controller for the plurality of electronic devices.
  • the multiple primary management processors can serve as management controllers in a high-availability arrangement, to provide load balancing and/or failover support.
  • the management controller is a proxy for a management device (e.g., 116 in FIG. 1 ) to access the arrangement of electronic devices 302 to perform management of the electronic devices 302 .
  • a management device e.g., 116 in FIG. 1
  • the management processors of the cluster 308 are to collectively elect the primary management processor using a distributed election process.
  • the primary management processor can provide a management interface (e.g., a command line interface, a GUI, a REDFISH interface, etc.) accessible by a remote management device (e.g., 116 in FIG. 1 ) over a management network.
  • a management interface e.g., a command line interface, a GUI, a REDFISH interface, etc.
  • a remote management device e.g., 116 in FIG. 1
  • FIG. 4 is a block diagram of a non-transitory machine-readable or computer-readable storage medium 400 storing machine-readable instructions that upon execution cause a system to perform various tasks.
  • the machine-readable instructions include primary management processor election instructions 402 to elect, from among a plurality of management processors in respective electronic devices, a primary management processor.
  • the election of the primary management processor can be based on any one or some combination of criteria discussed further above.
  • the machine-readable instructions include management tasks scheduling instructions 404 to schedule, by the primary management processor, management tasks to be performed by respective management processors of the plurality of management processors.
  • the machine-readable instructions receive, from a remote management device at the primary management processor, requests for management operations, where the scheduling of the management tasks by the primary management processor is responsive to the requests.
  • the machine-readable instructions include BMC interactions instructions 406 to interact, by each corresponding management processor of the plurality of management processors, with a BMC in a respective electronic device of the electronic devices to perform a management task with respect to the respective electronic device.
  • a management processor can instruct a BMC to perform a management task, and the BMC can perform the management task in the respective electronic device.
  • the machine-readable instructions include failover instructions 408 to perform failover from the primary management processor to another management processor in response to detecting a fault of the primary management processor. In response to detecting the fault of the primary management processor, another primary management processor is elected.
  • FIG. 5 is a flow diagram of a process 500 according to some examples.
  • the process 500 includes electing (at 502 ), from among a plurality of management processors in respective electronic devices, a primary management processor using a distributed election process.
  • the plurality of management processors are part of a cluster, such as a Kubernetes cluster of nodes or another type of cluster.
  • the process 500 includes receiving (at 504 ), by the primary management processor over a management network, requests for management operations to be performed with respect to the electronic devices.
  • the requests can be received by the primary management processor from a remote management device, for example.
  • the process 500 includes scheduling (at 506 ), by the primary management processor in response to the requests, management tasks to be performed by respective management processors of the plurality of management processors. This distributes the management workload across multiple management processors.
  • the process 500 includes interacting (at 508 ), by each corresponding management processor of the plurality of management processors, with a BMC in a respective electronic device of the electronic devices to perform a management task with respect to the respective electronic device.
  • the process 500 includes performing (at 510 ) failover from the primary management processor to another management processor in response to detecting a fault of the primary management processor.
  • a storage medium can include any or some combination of the following: a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory; a magnetic disk such as a fixed, floppy and removable disk; another magnetic medium including tape; an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device.
  • a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory
  • a magnetic disk such as a fixed, floppy and removable disk
  • another magnetic medium including tape an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device.
  • CD compact disk
  • DVD
  • Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture).
  • An article or article of manufacture can refer to any manufactured single component or multiple components.
  • the storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

In some examples, a system includes a plurality of electronic devices each comprising a respective management processor and a baseboard management controller (BMC). A management processor of a cluster of management processors is a primary management processor to act as a management controller for the plurality of electronic devices. The management controller interacts with the BMC in a respective electronic device to perform management of the respective electronic device. The cluster of management processors performs failover responsive to a fault of the primary management processor to select another management processor of the cluster of the management processors as the management controller.

Description

BACKGROUND
A computing environment can include a relatively large number (hundreds, thousands, etc.) of electronic devices, such as server computers, storage systems, communication nodes, Internet-of-things (IoT) devices, and so forth. In some examples, the computing environment is a data center associated with an enterprise, such as a business concern, a government agency, an education organization, an individual, and so forth. In further examples, the computing environment can include a cloud, a web-based computing environment, and so forth.
BRIEF DESCRIPTION OF THE DRAWINGS
Some implementations of the present disclosure are described with respect to the following figures.
FIG. 1 is a block diagram of an arrangement that includes electronic devices and management components according to some examples.
FIG. 2 is a flow diagram of a process of management processors according to some examples.
FIG. 3 is a block diagram of a system according to some examples.
FIG. 4 is a block diagram of a storage medium storing machine-readable instructions according to some examples.
FIG. 5 is a flow diagram of a process according to some examples.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements. The figures are not necessarily to scale, and the size of some parts may be exaggerated to more clearly illustrate the example shown. Moreover, the drawings provide examples and/or implementations consistent with the description; however, the description is not limited to the examples and/or implementations provided in the drawings.
DETAILED DESCRIPTION
In the present disclosure, use of the term “a,” “an,” or “the” is intended to include the plural forms as well, unless the context clearly indicates otherwise. Also, the term “includes,” “including,” “comprises,” “comprising,” “have,” or “having” when used in this disclosure specifies the presence of the stated elements, but do not preclude the presence or addition of other elements.
An electronic device in a computing environment can be managed from a remote location using a management component in the electronic device. For example, a systems administrator at a remote computer (or other remote entity) can access the management component in the electronic device over a network, which in some cases can be a management network that is separate from a data network used for data communications of the electronic device.
In some examples, the management component in the electronic device includes a baseboard management controller (BMC).
A “BMC” can refer to a specialized service controller that monitors the physical state of an electronic device using sensors and communicates with a remote management system (that is remote from the electronic device) through an independent “out-of-band” connection. The BMC can perform management tasks to manage components of the electronic device. Examples of management tasks that can be performed by the BMC can include any or some combination of the following: power control to perform power management of the electronic device (such as to transition the electronic device between different power consumption states in response to detected events), thermal monitoring and control of the electronic device (such as to monitor temperatures of the electronic device and to control thermal management states of the electronic device), fan control of fans in the electronic device, system health monitoring based on monitoring measurement data from various sensors of the electronic device, remote access of the electronic device (to access the electronic device over a network, for example), remote reboot of the electronic device (to trigger the electronic device to reboot using a remote command), system setup and deployment of the electronic device, system security to implement security procedures in the electronic device, and so forth.
In some examples, the BMC can provide so-called “lights-out” functionality for an electronic device. The lights out functionality may allow a user, such as a systems administrator, to perform management operations on the electronic device even if an operating system (OS) is not installed or not functional on the electronic device.
Moreover, in some examples, the BMC can run on auxiliary power provided by an auxiliary power supply (e.g., a battery); as a result, the electronic device does not have to be powered on to allow the BMC to perform the BMC's operations. The auxiliary power supply is separate from a main power supply that supplies powers to other components (e.g., a main processor, a memory, an input/output (I/O) device, etc.) of the electronic device.
In some examples, in addition to the BMC in each electronic device, an additional management controller (separate from the BMCs) can be used to interact with the BMCs to perform management of the electronic devices. In examples where the electronic devices are server computers (or other types of electronic devices) mounted in a rack, the additional management controller can be referred to as a rack management controller (RMC). A “rack” refers to a mounting structure that has supports for multiple electronic devices.
More generally, the additional management controller can be referred to as a “system management controller” to manage a collection of electronic devices that each further includes a BMC.
The system management controller can perform any or some combination of the following tasks. For example, the system management controller can provide a command interface to allow a remote management device, such as one associated with a systems administrator, to submit commands to the system management controller to perform electronic device management operations. The interface can include a command-line interface in which commands are submitted to the system management controller as lines of text. In other examples, the interface can include a graphical user interface (GUI).
Alternatively or additionally, the system management controller can provide a RESTful API (API is an abbreviation for “application programming interface” and REST is an abbreviation for “REpresentational State Transfer”) that is an interface for management of information technology (IT) components such as servers, storage, etc. An example of the RESTful API is provided by a REDFISH interface, which is a standardized interface (defined by a suite of specifications) for management of electronic devices using various management tools, scripts, and so forth.
An interface provided by the system management controller can also be used by a remote entity to access information collected by the system management controller and/or the BMCs, such as analysis logs, reports, and so forth.
The system management controller can perform analysis of components (e.g., hardware, programs, etc.) of the electronic devices for faults. A “fault” can refer to any anomaly in a component (hardware or machine-readable instructions or data) of an electronic device that prevents an expected operation of the electronic device. For example, the fault can include a hardware failure, machine-readable instructions crashing or producing an error, an error in data, and so forth.
The system management controller can predict a fault in the electronic devices. For example, the system management controller can monitor health data (e.g., measurement data from various sensors in the electronic devices) to predict when a fault is expected to occur.
The system management controller can initiate recovery actions in response to detected or predicted faults. The recovery actions can include sending an alert to a target recipient (e.g., a systems administrator or another entity), powering off an electronic device, rebooting an electronic device, isolating a component in an electronic device, and so forth.
The system management controller can perform firmware management to manage versions of firmware loaded in the electronic devices. For example, firmware can include machine-readable instructions such as Basic Input/Output System (BIOS) code, firmware of hardware I/O devices, and so forth.
The system management controller can partition the electronic devices into multiple partitions (where each partition can include a collection of electronic devices, where a “collection of electronic devices” can refer to a single electronic device or multiple electronic devices). Each partition of electronic devices can be managed by the system management controller independently of another partition.
The BMC in each electronic device can interface with the system management controller, and can perform management tasks instructed by the system management controller. In some examples, the system management controller is external of the electronic devices, and can be coupled to the BMCs over a network, such as a local area network (LAN) or another type of network.
The use of the system management controller to work with the BMCs of the respective individual electronic devices allows for overall system management that can be accomplished without the system management controller having to drill down into the details associated with each individual electronic device. The BMC of an individual electronic device is aware of the specific configuration of each electronic device that can be leveraged for management of the electronic device using the combination of the system management controller and the BMC of the electronic device.
In a large computing environment with a relatively large number of electronic devices (e.g., hundreds, thousands, hundreds of thousands, millions, etc., of electronic devices), the load on the system management controller can be relatively high, which can adversely affect performance of management operations, since the system management controller may have to perform a relatively large number of management tasks in conjunction with the BMCs.
Also, if the system management controller were to experience a fault (e.g., due to an error of machine-readable instructions in the system management controller, a failure of a hardware component, a failure of a communications link, etc.), then management of the electronic devices using the system management controller may not be possible.
In accordance with some implementations of the present disclosure, a high-availability arrangement is provided to perform management of multiple electronic devices, such as server computers in a rack or another arrangement. The electronic devices include BMCs and respective management processors. The BMCs and management processors are “onboard” management components since they are included within respective electronic devices. This is in contrast with an external system management controller (as discussed above) that is external of the electronic devices.
The onboard management processors make up a cluster of management processors, and one (or more) of the management processors can be elected as a primary management processor to operate as a rack management controller (or more generally, a system management controller). Thus, instead of using a system management controller that is external of electronic devices that are being managed, an onboard management processor can be used to perform the tasks of the system management controller.
A “management processor” refers to a processor that is able to perform management tasks, such as those of the system management controller discussed above. As used here, a “processor” can include a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
In accordance with some implementations of the present disclosure, the primary management processor can delegate management tasks to other onboard management processors in the cluster to offload such management tasks. As a result, management tasks can be distributed across multiple management processors for load balancing to reduce the likelihood that any individual management processor is overburdened.
A “cluster of management processors” can refer to any grouping of management processors that are configured to work together to manage operations of electronic devices.
In addition to load balancing, the cluster of management processors can also provide fault tolerance to provide high-availability operation. For example, if the primary management processor were to experience a fault, then failover to another management processor to become the primary management processor can be performed. Additionally, if any given management processor were to experience a fault, the primary management processor can delegate the management tasks of the given management processor to another management processor.
FIG. 1 is a block diagram of an example arrangement that includes a rack 102 in which are mounted server computers 104-1, 104-2, . . . , 104-N (N≥2).
Although reference is made to the server computers 104-1 to 104-N being mounted in the rack 102 in FIG. 1 , in other examples, the server computers 104-1 to 104-N can be in other types of mounting structures, or can be freely standing without being mounted in any type of mounting structure.
Also, although reference is made to server computers in FIG. 1 , in other examples, an arrangement can include other types of electronic devices, such as storage systems, communication nodes, IoT devices, and so forth.
In some examples, the server computers 104-1 to 104-N can be scale-up or scale-out server nodes. A “scale-up” system allows for support of additional workload by increasing the processing/storage capacity of each node, which is a server computer in FIG. 1 . A “scale-out” system allows for support of additional workload by adding more nodes (e.g., server computers).
Each server computer includes a BMC, a management processor, and a central processing unit (CPU). A CPU refers to the main processor of a server computer. The server computer can include one CPU or multiple CPUs. A CPU can include a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit.
In the arrangement of FIG. 1 , instead of using a single system management controller that is external of the server computers 104-1 to 104-N, local (onboard) management processors are included in the respective server computers. Each server computer further includes a BMC. Thus, for example, the server computer 104-1 includes a BMC 106-1, a management processor 108-1, and a CPU 110-1. Similarly, the server computer 104-2 includes a BMC 106-2, a management processor 108-2, and a CPU 110-2, and the server computer 104-N includes a BMC 106-N, a management processor 108-N, and a CPU 110-N.
Each management processor 108-1 to 108-N can perform any or some combination of the tasks that can be performed by the system management controller discussed above.
In other examples, a “management processor” can also refer to a node controller in a server computer or another type of electronic device. For example, a node controller can perform power control and/or capping (capping the amount of power that can be used by a server computer). The node controller can also collect measurement data that is accessible by a remote entity.
In accordance with some examples of the present disclosure, one (or more than one) of the management processors 108-1 to 108-N can be elected as a primary management processor to serve as a management controller (similar to the system management controller discussed further above). However, the management controller implemented with the primary management processor is onboard a server computer and is not external of the server computers 104-1 to 104-N.
In the example of FIG. 1 , the management processor 108-2 has been elected as the primary management processor. In further examples, more than one management processor can be elected as a primary management processor, resulting in multiple primary management processors that can provide load balancing and/or redundancy in case of a fault of any of the primary management processors.
Even in examples where just a single primary management processor is elected, high-availability is still provided since another management processor can be elected as the primary management processor in case a current primary management processor experiences a fault.
The management processors 108-1 to 108-N form a cluster of management processors. In further examples, less than all of the management processors 108-1 to 108-N can form a cluster of management processors. For example, the management processors 108-1 and 108-2 can be part of the cluster while the management processor 108-N is not part of the cluster. Omitting a management processor of the server computers 104-1 to 104-N can be performed for any of various reasons (e.g., the omitted management processor is non-functional or is experiencing a fault, the omitted management processor does not have a reliable network connection, the omitted management processor is not executing a correct version of machine-readable instructions, and so forth).
In examples where the management processor 108-2 has been elected as the primary management processor, the remaining management processors (108-1, 108-N) are referred to as “member management processors” to which the primary management processor can assign management tasks.
As further shown in FIG. 1 , the management processors 108-1 to 108-N are connected to a management network 112. The management network 112 is distinct and separate from a data network 114 that is used by the server computers 104-1 to 104-N to communicate data with each other as well as to communicate with other devices coupled to the data network 114. Each management processor can include a network interface to connect to the management network 112. Alternatively, each management processor can be coupled through a network interface in the respective server computer to the management network 112.
The management network 112 is an “out-of-band” network since the management network 112 allows management-related information to be communicated with the server computers 104-1 to 104-N over a communication path that is separate from the data network 114.
The management network 112 allows a remote management device 116 (such as a device associated with a systems administrator) or another entity to access the server computers 104-1 to 104-N to perform management operations with respect to the server computers. In some examples, the management device 116 can include a desktop computer, a notebook computer, a tablet computer, a smartphone, or any other type of electronic device.
In addition, each BMC is locally connected to its respective management processor. Thus, the BMC 106-1 is connected over a communication link 118-1 to the management processor 108-1, the BMC 106-2 is connected over a communication link 118-2 to the management processor 108-2, and the BMC 106-N is connected over a communication link 118-N to the management processor 108-N.
Each communication link 118-1, 118-2, or 118-N can be in the form of a local area network (LAN) or another type of communication link, such as a computer bus.
Each management processor 108-1, 108-2, or 108-N can present an interface (or multiple interfaces) that can be accessed by the management device 116. The interface(s) can include a command line interface, a GUI, a REDFISH interface, and so forth. Although reference is made to the REDFISH interface in some examples, it is noted that in other examples, other types of interfaces (whether standardized or proprietary) can be presented by each management processor for the purpose of managing the respective server computer in which the management processor is located.
The management processors 108-1 to 108-N in the cluster can perform a distributed election process by which one (or more than one) management processor is elected as a primary management processor. The distributed election process is performed based on cooperation of multiple management processors in the cluster. The election of the primary management processor can be based on any one or some combination of the following criteria: a current processing load of a management processor (e.g., a given management processor with a lower processing load than other management processors in the cluster may be favored to be elected as the primary management processor); a network status of the management processor (e.g., a communication bandwidth of a management processor, where a given management processor with a higher communication bandwidth than other management processors in the cluster may be favored to be elected as the management primary management processor); a health status of a management processor (e.g., based on faults detected in each management processor, where a given management processor that is healthier (has less faults) than other management processors in the cluster may be favored to be elected as the primary management processor); a random criterion (in which a primary management processor is randomly selected from among the cluster of management processors); and so forth.
Each server computer includes a respective OS, firmware, and other machine-readable instructions. For example, the server computer 104-1 includes an OS 120-1 and a firmware 122-1 (e.g., a BIOS), which are executable on the CPU 110-1. The server computer 104-2 includes an OS 120-2 and a firmware 122-2 that are executable on the CPU 110-2, and the server computer 104-N includes an OS 120-N and a firmware 122-N that are executable on the CPU 110-N.
Each server computer further includes a main power supply and an auxiliary power supply. For example, the server computer 104-1 includes a main power supply 124-1 that supplies power to the CPU 110-1 and other components of the server computer 104-1, such as a memory, a persistent storage, a network interface controller to communicate over the data network 114, I/O devices, and so forth. The server computer 104-1 also includes an auxiliary power supply 126-1 that supplies power to the management processor 108-1 and the BMC 106-1. The auxiliary power supply 126-1 can power the management processor 108-1 and the BMC 106-1 even if the server computer 104-1 is powered off (i.e., the main power supply 124-1 is off).
Management operations can be performed with respect to the server computer 104-1 using the management processor 108-1 and the BMC 106-1 that are powered by the auxiliary power supply 126-1 even if the main power supply 124-1 is off and the OS 120-1 is not currently running in the server computer 104-1.
A main power supply can include a battery or can be powered by an external power source, such as a wall outlet. An auxiliary power supply can include a battery or any other power source that is separate from the main power supply.
Similarly, the server computer 104-2 includes a main power supply 124-2 to power the CPU 110-2 and other components of the server computer 104-2, and an auxiliary power supply 126-2 to supply power to the management processor 108-2 and the BMC 106-2. The server computer 104-N similarly includes a main power supply 124-N and an auxiliary power supply 126-N.
The primary management processor (which in the example of FIG. 1 is 108-2) coordinates with the remaining management processors (“member management processors”) of the cluster of management processors to perform management tasks. For example, the primary management processor 108-2 can assign management tasks to respective member management processors 108-1, 108-N to perform, and each respective member management processor can coordinate with a corresponding BMC to perform the respective management operation. Note that the primary management processor 108-2 can assign management tasks to itself to perform in coordination with the BMC 106-2 in the server computer 104-2.
In some examples, the cluster of management processors is a Kubernetes cluster of nodes. Kubernetes is an open-source orchestration system that includes a primary controlling unit and nodes that are part of the Kubernetes cluster, where a node can refer to a machine in which containers (including workloads) are deployed.
In some examples, a Kubernetes cluster can support the use of multiple primary controlling units to allow for the selection of multiple primary management processors.
A “container” refers to an isolated computing environment in which a service can execute while being isolated from a service executed in another container. Service(s) executing in a container can be provided by a program and any associated components, such as libraries. In some examples according to the present disclosure, the program that can be run in a container can be a management program to perform management tasks of a management processor as discussed above.
According to Kubernetes, a container can reside in a “pod,” which can include a collection of containers. A “collection of containers” can refer to a single container or multiple containers. A Kubernetes cluster includes pods that are executable on respective nodes (management processors in the context of FIG. 1 ).
When a pod runs multiple containers, the containers can share resources of the pod. A pod can include networking and storage resources that can be employed by the container(s) of the pod. A networking resource can include an assigned network address, such as an Internet Protocol (IP) address, a network namespace, a network port, and so forth. A storage resource can refer to a collection of storage volumes (e.g., logical volumes). A pod is considered to be a self-contained, isolated logical host in which programs executed in the container(s) of the pod can run.
In some examples, a full weight version of Kubernetes (referred to as “k8s”) can be employed. In other examples, a lightweight version of Kubernetes (referred to as “k3s”) can be employed.
Although reference is made to Kubernetes in some examples, in other examples, other clustering arrangements can be employed, such as Pacemaker clusters, Red Hat clusters, and so forth.
In examples where the cluster of management processors includes a Kubernetes cluster, then the primary management processor is a primary node within the Kubernetes cluster, and the remaining management processors are nodes on which containers can be deployed. The primary node (primary management processor) of the Kubernetes cluster can schedule pods on any of the Kubernetes nodes (member management processors).
By leveraging multiple onboard management processors in the cluster, management work can be distributed across the management processors to avoid any potential bottlenecks associated with use of a single system management controller in other examples. Additionally, by distributing the management workload across multiple management processors, more management services than may be possible using a single external system management controller (e.g., an RMC) can be provided.
In addition, failover support is provided in the cluster of management processors for high availability. Failover support can allow another management processor to take over as a primary management processor if a current management processor experiences a fault. Failover support can also allow the primary management processor to reassign management tasks from a first member management processor to a second member management processor if the first member management processor experiences a fault.
In some examples, failing over management tasks from one management processor to another management processor if the management tasks are stateless processes. A stateless process is a process that does not depend on another process, such as on executed in another container or pod or on another management processor.
In other examples, management tasks can be stateful processes that have dependencies on other processes. In such examples, a shared storage can be employed to store data shared by the multiple processes that have dependencies on one another, so that failover over of a management task from one management processor to another management processor is possible.
FIG. 2 is a flow diagram that shows example tasks performed by the primary management processor 108-2 and member management processor is 108-1 and 108-N.
Initially, such as when the management processors start up, or in response to other events (such as specified time intervals, error events, etc.), the management processors 108-1 to 108-N can perform (at 202) a primary management processor election process to elect a primary management process processor using criteria as discussed further above. In examples according to FIGS. 1 and 2 , the election process elects the management processor 108-2 as the primary management processor. The remaining management processor is 108-1 to 108-N are member management processors.
The primary management processor 108-2 receives (at 204) requests for management operations, such as from the management device 116 or another entity. The requests can be received through a command line interface, GUI, or REDFISH interface of the primary management processor 108-2.
In response to the requests for management operations, the primary management processor 108-2 schedules (at 206, 208) corresponding management tasks to be performed by the member management processors 108-1 and 108-N, respectively. For example, the primary management processor 108-2 can send an instruction or any other type of indicator to each member management processor to perform the corresponding management task. The instruction or other indicator can include information that identifies the management task to be performed.
In response to the scheduled management tasks, the member management processors 108-1 to 108-N perform (at 210 and 212, respectively) the scheduled management tasks. In response to completion of the corresponding management tasks, the member management processor 108-1 can send (at 214) a completion indication to the primary management processor 108-2, and the member management processor 108-N can send (at 216) a completion indication to the primary management processor 108-2.
A “completion indication” can refer to a message, an information element, a signal, or another indicator that provides an indication that the requested management task has been completed. If the management task were to fail at a member management processor, then the member management processor can send an error indication instead of the completion indication to the primary management processor 108-2.
The primary management processor 108-2 can respond to the completion indications by notifying (at 218) the management device 116 or another requesting entity that the requested management operations have been completed.
The cluster of management processors 108-2 to 108-N can also provide failover support. For example, the primary management processor 108-2 can detect (at 220) whether a member management processor has experienced a fault. For example, in FIG. 2 , it is assumed that the primary management processor 108-2 has detected that the member management processor 108-1 has experienced a fault. The detection can be based on use of heartbeat messages sent by each member management processor to the primary management processor 108-2. The heartbeat messages may be sent by a management processor on a periodic basis. In other examples, the primary management processor 108-2 can poll each member management processor (such as on a periodic basis) to trigger a response from the member management processor. If the member management processor does not respond to the polling, then the primary management processor 108-2 can make a determination that the member management processor has experienced a fault and a failover operation should be initiated.
In response to detecting by the primary management processor 108-2 that the member management processor 108-1 has experienced a fault, the primary management processor 108-2 can reassign (at 222) management task(s) of the member management processor 108-1 to the member management processor 108-N. The scheduled management task(s) can include any management task not completed by the member management processor 108-1, or any other management task that should have been assigned to the member management processor 108-1 but which cannot be so assigned due to the fault of the member management processor 108-1.
When a management task of the member management processor 108-1 is reassigned to the member management processor 108-N, the member management processor 108-N would interact with the BMC 106-1 of the server computer 104-1) such as by communicating over the management network 112 to perform the reassigned management task.
Failover support can also include failing over from a current primary management processor to another primary management processor.
Each member management processor is able to detect whether the primary management processor 108-2 has experienced a fault. For example, the member management processor (108-1 or 108-N) can receive heartbeat messages from the primary management processor 108-2 on a periodic basis. As another example, the member management processor (108-1 or 108-N) can poll the primary management processor 108-2 to determine whether the primary management processor 108-2 is still available.
In the example of FIG. 2 , it is assumed that the member management processor 108-1 has detected (at 224) that the primary management processor 108-2 has experienced a fault (e.g., the member management processor 108-1 has not received a heartbeat message at an expected time or has not received a response to a polling request). In response, the member management processor 108-1 can initiate a primary management processor failover process by sending (at 226) to each other member management processor an indication of primary management processor fault. In the example of FIG. 2 , it is assumed that there are just two member management processors 108-1 and 108-N. In other examples, there may be more than two member management processors.
After the primary management processor fault indication has been sent to initiate a primary management processor failover process, the member management processors 108-1 and 108-N can elect (at 228) a new primary management processor, using similar criteria as noted above for electing a primary management processor.
FIG. 3 is a block diagram of a system 300 according to some examples. The system 300 can include an arrangement of electronic devices 302, such as the server computers 104-1 to 104-N shown in FIG. 1 .
Each of the electronic devices 302 includes a respective management processor 304 and a BMC 306. At least some of the management processors 304 that are part of the electronic devices 302 can form a cluster 308 of management processors.
A management processor of the cluster of the management processors is a primary management processor (indicated by a * in FIG. 3 ) to act as a management controller for the electronic devices 302. The management controller interacts with the BMC 306 in a respective electronic device 302 to perform management of the respective electronic device 302. The cluster of management processors can perform failover responsive to a fault of the primary management processor to select another management processor of the cluster of management processors as the management controller.
In some examples, the management controller (the primary management processor 304 indicated with a * in FIG. 3 ) can schedule management tasks to be performed by multiple management processors of the cluster 308 of the management processors that are part of the arrangement of electronic devices 302.
In some examples, the management controller can schedule execution of containers on the multiple management processors to perform the management tasks, where each container includes a management program to interact with the BMC 306 of an electronic device 302 to perform management of the electronic device 302.
In some examples, the management controller can detect a fault in a given management processor of the multiple management processors, and in response to detecting the fault, schedule a management task of the given management processor on another management processor of the cluster of the management processors.
In some examples, another management processor of the cluster 308 of management processors is a further primary management processor to act as a further management controller for the plurality of electronic devices. The multiple primary management processors can serve as management controllers in a high-availability arrangement, to provide load balancing and/or failover support.
In some examples, the management controller is a proxy for a management device (e.g., 116 in FIG. 1 ) to access the arrangement of electronic devices 302 to perform management of the electronic devices 302.
In some examples, the management processors of the cluster 308 are to collectively elect the primary management processor using a distributed election process.
In some examples, the primary management processor can provide a management interface (e.g., a command line interface, a GUI, a REDFISH interface, etc.) accessible by a remote management device (e.g., 116 in FIG. 1 ) over a management network.
FIG. 4 is a block diagram of a non-transitory machine-readable or computer-readable storage medium 400 storing machine-readable instructions that upon execution cause a system to perform various tasks.
The machine-readable instructions include primary management processor election instructions 402 to elect, from among a plurality of management processors in respective electronic devices, a primary management processor. The election of the primary management processor can be based on any one or some combination of criteria discussed further above.
The machine-readable instructions include management tasks scheduling instructions 404 to schedule, by the primary management processor, management tasks to be performed by respective management processors of the plurality of management processors. In some examples, the machine-readable instructions receive, from a remote management device at the primary management processor, requests for management operations, where the scheduling of the management tasks by the primary management processor is responsive to the requests.
The machine-readable instructions include BMC interactions instructions 406 to interact, by each corresponding management processor of the plurality of management processors, with a BMC in a respective electronic device of the electronic devices to perform a management task with respect to the respective electronic device. A management processor can instruct a BMC to perform a management task, and the BMC can perform the management task in the respective electronic device.
The machine-readable instructions include failover instructions 408 to perform failover from the primary management processor to another management processor in response to detecting a fault of the primary management processor. In response to detecting the fault of the primary management processor, another primary management processor is elected.
FIG. 5 is a flow diagram of a process 500 according to some examples. The process 500 includes electing (at 502), from among a plurality of management processors in respective electronic devices, a primary management processor using a distributed election process. In some examples, the plurality of management processors are part of a cluster, such as a Kubernetes cluster of nodes or another type of cluster.
The process 500 includes receiving (at 504), by the primary management processor over a management network, requests for management operations to be performed with respect to the electronic devices. The requests can be received by the primary management processor from a remote management device, for example.
The process 500 includes scheduling (at 506), by the primary management processor in response to the requests, management tasks to be performed by respective management processors of the plurality of management processors. This distributes the management workload across multiple management processors.
The process 500 includes interacting (at 508), by each corresponding management processor of the plurality of management processors, with a BMC in a respective electronic device of the electronic devices to perform a management task with respect to the respective electronic device.
The process 500 includes performing (at 510) failover from the primary management processor to another management processor in response to detecting a fault of the primary management processor.
A storage medium (e.g., 400 in FIG. 4 ) can include any or some combination of the following: a semiconductor memory device such as a dynamic or static random access memory (a DRAM or SRAM), an erasable and programmable read-only memory (EPROM), an electrically erasable and programmable read-only memory (EEPROM) and flash memory; a magnetic disk such as a fixed, floppy and removable disk; another magnetic medium including tape; an optical medium such as a compact disk (CD) or a digital video disk (DVD); or another type of storage device. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.

Claims (17)

What is claimed is:
1. A system comprising:
a plurality of electronic devices comprising respective management processors and respective baseboard management controllers (BMCs),
wherein the management processors form a cluster of management processors, and wherein the management processors are to collectively elect, using a distributed election process, a first management processor of the cluster of management processors as a primary management processor based on respective communication bandwidths of the management processors, the primary management processor to act as a management controller for the plurality of electronic devices,
the management controller to assign a management task to a second management processor of the cluster of management processors, wherein the second management processor is in a first electronic device of the plurality of electronic devices, and the second management processor to interact with a first BMC in the first electronic device to start the management task in the first electronic device, and
the management controller to:
detect a fault of the second management processor in the first electronic device, and
in response to detecting the fault of the second management processor, reassign the management task of the second management processor to a third management processor in a second electronic device that is separate from the first electronic device,
wherein after the reassignment of the management task to the third management processor, the third management processor in the second electronic device is to interact with the first BMC in the first electronic device to complete the management task.
2. The system of claim 1, wherein the management controller is to schedule management tasks to be performed by multiple management processors of the cluster of management processors that are part of the plurality of electronic devices.
3. The system of claim 2, wherein the management controller is to schedule execution of containers on the multiple management processors to perform the management tasks, wherein each container of the containers is an isolated computing environment and comprises a management program to interact with the BMC of an electronic device of the plurality of electronic devices to perform management of the electronic device.
4. The system of claim 1, wherein the cluster of management processors is to perform failover responsive to a fault of the primary management processor to select another management processor of the cluster of management processors as the management controller.
5. The system of claim 1, wherein another management processor of the cluster of management processors is a further primary management processor to act as a further management controller for the plurality of electronic devices.
6. The system of claim 4, wherein the fault of the primary management processor is indicated by a further management processor of the cluster of management processors responsive to the further management processor failing to receive a heartbeat message from the primary management processor.
7. The system of claim 1, wherein the management controller is part of one of the plurality of electronic devices.
8. The system of claim 1, wherein the management controller is a proxy for a management device to access the first and second electronic devices to perform the management task.
9. The system of claim 1, wherein the cluster of management processors comprises a Kubernetes cluster.
10. The system of claim 1, wherein each corresponding electronic device of the plurality of electronic devices comprises an auxiliary power supply to supply power to the BMC and the management processor of the corresponding electronic device, wherein the BMC and the management processor of the corresponding electronic device remain on even when the corresponding electronic device is powered off.
11. The system of claim 1, wherein the primary management processor is to provide a management interface accessible by a remote management device over a management network.
12. The system of claim 11, wherein the management interface comprises an application programming interface for management of electronic devices.
13. A non-transitory machine-readable storage medium comprising instructions that upon execution cause a system to:
elect, by a plurality of management processors in respective electronic devices, a primary management processor from among the plurality of management processors using a distributed election process based on respective communication bandwidths of the plurality of management processors;
schedule, by the primary management processor, a management task to be performed by a first member management processor of the plurality of management processors, wherein the first member management processor is in a first electronic device of the electronic devices;
interact, by the first member management processor, with a baseboard management controller (BMC) in the first electronic device to start the management task in the first electronic device;
detect, at the primary management processor a fault of the first member management processor in the first electronic device;
in response to detecting the fault of the first member management processor, reassign the management task of the first member management processor to a second member management processor in a second electronic device that is separate from the first electronic device; and
after the reassignment of the management task to the second member management processor, interact, by the second member management processor in the second electronic device, with the BMC in the first electronic device to complete the management task.
14. The non-transitory machine-readable storage medium of claim 13, wherein the instructions upon execution cause the system to:
receive, from a remote management device at the primary management processor, a request for a management operation, wherein the scheduling of the management task by the primary management processor is responsive to the request.
15. The non-transitory machine-readable storage medium of claim 13, wherein the scheduling of the management task by the primary management processor comprises scheduling a container on the first member management processor, the container being an isolated computing environment and comprising a management program.
16. The non-transitory machine-readable storage medium of claim 15, wherein the primary management processor is internal to one of the electronic devices.
17. A method comprising:
electing, by a plurality of management processors in respective electronic devices, a primary management processor from among the plurality of management processors using a distributed election process based on respective communication bandwidths of the plurality of management processors;
receiving, by the primary management processor over a management network, a request for a management operation to be performed with respect to a first electronic device of the electronic devices;
scheduling, by the primary management processor in response to the request, a management task to be performed by a first member management processor of the plurality of management processors, wherein the first member management processor is in the first electronic device;
interacting, by the first member management processor, with a first baseboard management controller (BMC) in the first electronic device to start the management task in the first electronic device;
detecting, by the primary management processor, a fault of the first member management processor in the first electronic device;
in response to detecting the fault of the first member management processor, reassigning the management task of the first member management processor to a second member management processor in a second electronic device that is separate from the first electronic device; and
after the reassignment of the management task to the second member management processor, interacting, by the second member management processor in the second electronic device, with the BMC in the first electronic device to complete the management task.
US17/652,335 2022-02-24 2022-02-24 Device management using baseboard management controllers and management processors Active US11799714B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/652,335 US11799714B2 (en) 2022-02-24 2022-02-24 Device management using baseboard management controllers and management processors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US17/652,335 US11799714B2 (en) 2022-02-24 2022-02-24 Device management using baseboard management controllers and management processors

Publications (2)

Publication Number Publication Date
US20230269126A1 US20230269126A1 (en) 2023-08-24
US11799714B2 true US11799714B2 (en) 2023-10-24

Family

ID=87574809

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/652,335 Active US11799714B2 (en) 2022-02-24 2022-02-24 Device management using baseboard management controllers and management processors

Country Status (1)

Country Link
US (1) US11799714B2 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220374320A1 (en) * 2022-06-14 2022-11-24 Intel Corporation Reliability availability serviceability (ras) service framework
US11915059B2 (en) * 2022-07-27 2024-02-27 Oracle International Corporation Virtual edge devices
CN115562950B (en) * 2022-12-05 2023-03-17 苏州浪潮智能科技有限公司 Data acquisition method and device and computer equipment
US12361133B2 (en) * 2023-03-14 2025-07-15 Dell Products, L.P. System-level service discovery in a multi-baseboard management controller (BMC) environment
US12355619B2 (en) 2023-06-08 2025-07-08 Oracle International Corporation Multi-tier deployment architecture for distributed edge devices
US20250328437A1 (en) * 2024-04-19 2025-10-23 Nvidia Corporation Processor scheduling

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7234073B1 (en) * 2003-09-30 2007-06-19 Emc Corporation System and methods for failover management of manageable entity agents
US20150373115A1 (en) * 2014-06-23 2015-12-24 Liqid Inc. Modular switched fabric for data storage systems
US20160239370A1 (en) 2015-02-12 2016-08-18 Aic Inc. Rack having automatic recovery function and automatic recovery method for the same
US20170262011A1 (en) * 2016-03-14 2017-09-14 Electronics And Telecommunications Research Institute Processor system and fault detection method thereof
US9853938B2 (en) 2014-09-08 2017-12-26 Quanta Computer Inc. Automatic generation of server network topology
US9866548B2 (en) 2014-12-17 2018-01-09 Quanta Computer Inc. Authentication-free configuration for service controllers
US10015023B2 (en) 2014-09-08 2018-07-03 Quanta Computer Inc. High-bandwidth chassis and rack management by VLAN
US10223229B2 (en) 2015-11-18 2019-03-05 Mitac Computing Technology Corporation System for monitoring a to-be-monitored unit of a rack/chassis management controller (RMC/CMC) according to heartbeat signals for determining operating modes
US10298479B2 (en) 2016-05-09 2019-05-21 Mitac Computing Technology Corporation Method of monitoring a server rack system, and the server rack system
US10509456B2 (en) 2016-05-06 2019-12-17 Quanta Computer Inc. Server rack power management
US20200050523A1 (en) * 2018-08-13 2020-02-13 Stratus Technologies Bermuda, Ltd. High reliability fault tolerant computer architecture
US20200136943A1 (en) * 2019-12-27 2020-04-30 Intel Corporation Storage management in a data management platform for cloud-native workloads
US10754722B1 (en) 2019-03-22 2020-08-25 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US20200305300A1 (en) 2019-03-22 2020-09-24 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US10842041B2 (en) 2019-03-22 2020-11-17 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US20210344497A1 (en) * 2020-04-29 2021-11-04 Hewlett Packard Enterprise Development Lp Hashing values using salts and peppers
US20210342213A1 (en) * 2021-07-14 2021-11-04 Karunakara Kotary Processing Device, Control Unit, Electronic Device, Method and Computer Program
US20210374084A1 (en) * 2020-05-26 2021-12-02 Hewlett Packard Enterprise Development Lp Server identification via a keyboard/video/mouse switch
US11194680B2 (en) * 2018-07-20 2021-12-07 Nutanix, Inc. Two node clusters recovery on a failure
US20210405097A1 (en) * 2020-06-29 2021-12-30 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Reliable hardware metering
US11243589B1 (en) * 2019-09-24 2022-02-08 Amazon Technologies, Inc. Remote power button actuation device for a pre-assembled computer system integrated into a server for a virtualization service
US20220308860A1 (en) * 2021-03-26 2022-09-29 Hewlett Packard Enterprise Development Lp Program installation in a virtual environment

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7234073B1 (en) * 2003-09-30 2007-06-19 Emc Corporation System and methods for failover management of manageable entity agents
US20150373115A1 (en) * 2014-06-23 2015-12-24 Liqid Inc. Modular switched fabric for data storage systems
US9853938B2 (en) 2014-09-08 2017-12-26 Quanta Computer Inc. Automatic generation of server network topology
US10015023B2 (en) 2014-09-08 2018-07-03 Quanta Computer Inc. High-bandwidth chassis and rack management by VLAN
US9866548B2 (en) 2014-12-17 2018-01-09 Quanta Computer Inc. Authentication-free configuration for service controllers
US20160239370A1 (en) 2015-02-12 2016-08-18 Aic Inc. Rack having automatic recovery function and automatic recovery method for the same
US10223229B2 (en) 2015-11-18 2019-03-05 Mitac Computing Technology Corporation System for monitoring a to-be-monitored unit of a rack/chassis management controller (RMC/CMC) according to heartbeat signals for determining operating modes
US20170262011A1 (en) * 2016-03-14 2017-09-14 Electronics And Telecommunications Research Institute Processor system and fault detection method thereof
US10509456B2 (en) 2016-05-06 2019-12-17 Quanta Computer Inc. Server rack power management
US10298479B2 (en) 2016-05-09 2019-05-21 Mitac Computing Technology Corporation Method of monitoring a server rack system, and the server rack system
US11194680B2 (en) * 2018-07-20 2021-12-07 Nutanix, Inc. Two node clusters recovery on a failure
US20200050523A1 (en) * 2018-08-13 2020-02-13 Stratus Technologies Bermuda, Ltd. High reliability fault tolerant computer architecture
US10754722B1 (en) 2019-03-22 2020-08-25 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US20200305300A1 (en) 2019-03-22 2020-09-24 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US10842041B2 (en) 2019-03-22 2020-11-17 Aic Inc. Method for remotely clearing abnormal status of racks applied in data center
US11243589B1 (en) * 2019-09-24 2022-02-08 Amazon Technologies, Inc. Remote power button actuation device for a pre-assembled computer system integrated into a server for a virtualization service
US20200136943A1 (en) * 2019-12-27 2020-04-30 Intel Corporation Storage management in a data management platform for cloud-native workloads
US20210344497A1 (en) * 2020-04-29 2021-11-04 Hewlett Packard Enterprise Development Lp Hashing values using salts and peppers
US20210374084A1 (en) * 2020-05-26 2021-12-02 Hewlett Packard Enterprise Development Lp Server identification via a keyboard/video/mouse switch
US20210405097A1 (en) * 2020-06-29 2021-12-30 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Reliable hardware metering
US20220308860A1 (en) * 2021-03-26 2022-09-29 Hewlett Packard Enterprise Development Lp Program installation in a virtual environment
US20210342213A1 (en) * 2021-07-14 2021-11-04 Karunakara Kotary Processing Device, Control Unit, Electronic Device, Method and Computer Program

Non-Patent Citations (15)

* Cited by examiner, † Cited by third party
Title
Andy Jeffries, CIVO.com, What's the difference between k3s vs k8s, Sep. 24, 2019 (5 pages).
Bajic et al., "Implementation of High Availability Server Cluster by Using Fencing Concept", 18th International Symposium INFOTECH, Mar. 20, 2019. *
Brey et al., "Blade Centrer Chassis Management", IBM Journal of Research and Development (vol. 49, lssue:6), Nov. 2005, IBM Publisher, pp. 941-961. *
HP Enterprise, QuickSpecs, HPE Superdome Flex, Dec. 6, 2021 (37 pages).
HP Enterprise, Technical White Paper, HPE Superdome Flex Server Architecture and RAS, May 2021 (22 pages).
Meier et al., "IEEE 1588 applied in the enviroment of high availability LANs", IEEE Symposium on Precision Clock Synchronization for Measurement, Contorl and Communication, Oct. 2007. *
Naik, "Docket Container-Based Big Data Processing System in Multiple Clouds for Everyone", IEEE International Systems Engineering Symposium, Oct. 11, 2017, IEEE Publishing. *
Pratheek Prabhakaran, How to set up a Pacemaker cluster for high availability Linux, May 28, 2021 (8 pages).
Rania Mohamed, www.suse.com, Kubemetes Cluster vs Master Node, Apr. 16, 2019 (6 pages).
Telco 5G, Platform9, How to Create Multi-Master Kubemetes Clusters downloaded Jan. 17, 2022 (12 pages).
Vayghan et al., "Microservices Based Architectures: Toward High-Availability for Stateful Applications with Kubernetes", IEEE 19th International Conference on Software Quallity, Reliability and Security , Jul. 22, 2019, IEEE Publishing. *
Wikipedia, Kubemetes last edited Jan. 16, 2022 (14 pages).
Wikipedia, Pacemaker (Software) last edited Oct. 14, 2021 (2 pages).
Wikipedia, Red Hat cluster suite last edited Mar. 15, 2020 (3 pages).
Wikipedia, Redfish (specification) last edited Dec. 9, 2021 (4 pages).

Also Published As

Publication number Publication date
US20230269126A1 (en) 2023-08-24

Similar Documents

Publication Publication Date Title
US11799714B2 (en) Device management using baseboard management controllers and management processors
JP4496093B2 (en) Remote enterprise management of high availability systems
US8055933B2 (en) Dynamic updating of failover policies for increased application availability
US8176501B2 (en) Enabling efficient input/output (I/O) virtualization
US10609159B2 (en) Providing higher workload resiliency in clustered systems based on health heuristics
US9110717B2 (en) Managing use of lease resources allocated on fallover in a high availability computing environment
US20060155912A1 (en) Server cluster having a virtual server
CN106980529B (en) Baseboard Management Controller Resource Management Computer System
US9471137B2 (en) Managing power savings in a high availability system at a redundant component level of granularity
US11119872B1 (en) Log management for a multi-node data processing system
US20210224121A1 (en) Virtual machine-initiated workload management
US11909818B2 (en) Reaching a quorum with a number of master nodes
US12236269B2 (en) Distribution of workloads in cluster environment using server warranty information
US11755100B2 (en) Power/workload management system
US20100083034A1 (en) Information processing apparatus and configuration control method
US11902089B2 (en) Automated networking device replacement system
US10454773B2 (en) Virtual machine mobility
US11748176B2 (en) Event message management in hyper-converged infrastructure environment
US8074109B1 (en) Third-party voting to select a master processor within a multi-processor computer
Pongpaibool et al. Netham-nano: A robust and scalable service-oriented platform for distributed monitoring
US20260010398A1 (en) Job management system and information processing method
US11204820B2 (en) Failure detection for central electronics complex group management
CN116880956A (en) Automatic deployment method and system applied to digital factory
Salapura et al. Remote Restart for a High Performance Virtual Machine Recovery in a Cloud
WO2016108945A1 (en) Cluster arbitration

Legal Events

Date Code Title Description
AS Assignment

Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PARTHASARATHY, MOHAN;MUGGERIDGE, MATTHEW JAMES;VENUGOPAL, VINAY;AND OTHERS;SIGNING DATES FROM 20220221 TO 20220223;REEL/FRAME:059091/0384

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE