US20110113228A1 - Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control - Google Patents

Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control Download PDF

Info

Publication number
US20110113228A1
US20110113228A1 US12/906,220 US90622010A US2011113228A1 US 20110113228 A1 US20110113228 A1 US 20110113228A1 US 90622010 A US90622010 A US 90622010A US 2011113228 A1 US2011113228 A1 US 2011113228A1
Authority
US
United States
Prior art keywords
cluster
mode
status
locked
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/906,220
Inventor
William J. MIDDLECAMP
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quantum Corp
Original Assignee
Quantum Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quantum Corp filed Critical Quantum Corp
Priority to US12/906,220 priority Critical patent/US20110113228A1/en
Assigned to QUANTUM CORPORATION reassignment QUANTUM CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MIDDLECAMP, WILLIAM J.
Publication of US20110113228A1 publication Critical patent/US20110113228A1/en
Assigned to WELLS FARGO CAPITAL FINANCE, LLC, AS AGENT reassignment WELLS FARGO CAPITAL FINANCE, LLC, AS AGENT SECURITY AGREEMENT Assignors: QUANTUM CORPORATION
Assigned to QUANTUM CORPORATION reassignment QUANTUM CORPORATION RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: WELLS FARGO CAPITAL FINANCE, LLC, AS AGENT
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/004Error avoidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system

Definitions

  • Networked systems of computers allow parallel and distributed computing.
  • Networked systems of computers may act as clusters, where the computers are nodes in the clusters.
  • Clusters may function collectively as operational groups of servers.
  • the nodes of a cluster or other multi-node computer system may function together to achieve high server system performance, availability, and reliability.
  • High availability (HA) systems facilitate a standby server taking over in the event of undesired failures on a primary server. Goals for a HA system include providing safety and uninterrupted operation. If a primary server fails, failover should occur automatically and operations should resume on a secondary server. However, at any point in time, only one of the primary server or the secondary server should have write access to certain items. For example, at any point in time, there should only be one server with write access to file system metadata to prevent corruption of the metadata. When two servers both have write access, which is an undesirable state, this may be referred to as a split brain scenario (SBS).
  • SBS split brain scenario
  • FIG. 1 illustrates a pair of metadata controllers (MDCs) used in a system for maintaining single writer access between a pair of HA servers using a rules-based, mode-driven manager (RBM) for timer bounded arbitration protocol based resource control.
  • MDCs metadata controllers
  • RBM rules-based, mode-driven manager
  • FIG. 2 illustrates a method for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • FIG. 3 illustrates a system for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • FIG. 4 illustrates an apparatus for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • Example apparatuses and methods work with apparatuses and methods that prevent a split brain scenario (SBS) where both servers in a high availability (HA) pair could have write access to resources (e.g., file system metadata, databases) for which there should only be single writer access.
  • Example apparatuses and methods implement a rules-based, mode-driven manager (RBM) that enforces rules for modes into which HA cluster members can be placed.
  • the modes are controlled by pairing rules.
  • the modes are created by the RBM issuing commands to cluster members.
  • the RBM collects mode and/or status information in order to determine which commands, if any, to issue to control modes into which cluster members are placed.
  • the RBM may trigger a hardware reset of one or both members of an HA cluster after acquiring information (e.g., mode, status) and attempting to alter modes by taking actions.
  • the mode and/or status information is collected at least when a decision point in the management chain of events for an HA cluster (e.g., subsystem startup) occurs.
  • a decision point e.g., subsystem startup
  • an RBM running on one server attempts to communicate with a peer server in the HA cluster to ensure that mode-pairing rules are followed and to ensure that the cluster is not exposed to split brain scenario risk by operational mistakes.
  • the RBM can be viewed as a cluster manager that automates HA cluster pairwise operation.
  • the RBM facilitates normal operation, reconfiguration, maintenance, not running in HA mode, and so on.
  • the RBM facilitates configuring metadata controllers (MDCs) to operate properly together to avoid split brain scenario.
  • MDCs metadata controllers
  • the RBM makes it possible to be able to turn off HA protection and then to turn it back on.
  • the RBM is configured to substantially instantaneously acquire a picture of the operating status of servers in a system. Acquiring substantially instantaneous information is desired because there are asynchronous processes operating in the HA system.
  • the RBM also is configured to act on the information it acquires.
  • the RBM is not just a status-collection machine, but rather is an entity that can force mode transitions according to rules that maintain cluster members in desired pair-wise modes that facilitate avoiding a split brain scenario.
  • an RBM can receive three different statuses from a server, and can force a server into one of six different modes by causing up to seven different actions.
  • FIG. 1 illustrates a pair of metadata controllers (MDCs) used in a system for maintaining single writer access between a pair of HA servers.
  • MDCs metadata controllers
  • FIG. 1 provides context for the example RBM for timer-bounded ARB-protocol-based resource control described herein.
  • the pair of MDCs includes active MDC 100 and standby MDC 150 .
  • Active MDC 100 includes an active file system portmapper (FSMPM) 110 , an active file system manager (FSM) 120 , and an active dead-man timer 130 .
  • the standby MDC 150 includes a standby FSMPM 160 , a standby FSM 170 , and a standby dead-man timer 180 .
  • the active FSMPM 110 is connected to the active FSM 120 by a socket.
  • the standby FSMPM 160 is connected to the standby FSM 170 by a socket.
  • the active FSMPM 110 is configured to provide a heartbeat signal to a coordinator (not shown).
  • the timer bounded ARB protocol may define an amount of elapsed time between heartbeats. Therefore, one way that standby FSMPM 160 can determine that active FSM 120 is not operating properly is by monitoring the heartbeat signal.
  • the active FSM 120 is configured to maintain ownership of an ARB block 140 by periodically writing the ARB block 140 according to the timer bounded ARB protocol.
  • the active dead-man timer 130 is reset upon successfully writing the ARB block 140 .
  • the active FSMPM 110 may request permission to reset the active dead-man timer 130 . If permission is granted, the active FSM 120 may attempt to maintain control by updating the ARB block 140 before the active dead-man timer 130 expires.
  • the active FSM 120 can negotiate for additional time to retain control of the ARB block 140 . Also if permission is granted, the standby FSM 170 will not attempt to establish control of the ARB block 140 for a predetermined amount of time. Therefore, the negotiation affords the active FSM 120 the opportunity to maintain control of the ARB block 140 in situations where a hardware reset is unnecessary (e.g., minor system delay or slowdown). Furthermore, single writer access is maintained via the negotiation.
  • the active FSMPM 110 can also be configured to selectively force an election of an FSM to replace the active FSM 100 as the single writer upon a determination that the active FSM 100 has exited. Therefore, the active FSMPM 110 may establish the standby FSM 170 as the replacement of the active FSM 120 . Accordingly, an activation command may be sent to the standby FSM 170 .
  • the standby FSM 170 monitors the ARB block 140 for a safety period of time before writing to the ARB block 140 . During the safety period of time the standby FSM 170 monitors the ARB block 140 to ensure that the active FSM 120 is not writing to the ARB block 140 . Once the safety period expires and it is determined the active FSM 120 has not written to the ARB block 140 , the standby FSM 170 may write to the ARB block. This way single writer access is maintained during the transition of control from the active FSM 120 to the standby FSM 170 .
  • the active MDC 100 violates the timer bounded ARB protocol when the active FSM 120 has not exited before the active dead-man timer 130 expires. If the timer bounded ARB protocol is violated, a hardware reset will be forced, and an election is held to select a standby MDC 150 to replace the active MDC 100 .
  • the standby FSMPM 160 is configured to selectively activate the standby FSM 170 to take control of the ARB block 140 after being elected to replace the active MDC 100 .
  • the standby FSM 170 is configured to acquire ownership of the ARB block 140 , to maintain ownership of the ARB block 140 by periodically writing the ARB block 140 according to the timer bounded ARB protocol, and to reset the standby dead-man timer 180 upon successfully writing the ARB block 140 .
  • the active MDC 100 and the standby MDC 150 reside on separate pieces of computer hardware and communicate over a computer network.
  • FIG. 1 illustrates the active dead-man timer 130 and the standby dead-man timer 180 as being internal to the active MDC 100 and the standby MDC 150 respectively.
  • the dead-man timers 130 and 180 may be part of an MDC or may be external to but used by an MDC.
  • a dead-man timer may be external to a process and/or hardware implementing an active MDC 100 or a standby MDC 150 .
  • the active dead-man timer 130 can be, but is not limited to being, a kernel timer, an operating system timer, and a timer associated with computer hardware (e.g., peripheral component interconnect express (PCIE) card) operatively connected to an interface visible to active MDC 100 .
  • PCIE peripheral component interconnect express
  • the standby dead-man timer 180 can be, but is not limited to being, a kernel timer, an operating system timer, and a timer associated with computer hardware (e.g., PCIE card) operatively connected to an interface visible to the standby MDC 150 .
  • the active MDC 100 and the standby MDC 150 participate in the timer bounded ARB protocol. Functioning of the protocol, and thus functioning of the active MDC 100 and the standby MDC 150 are enhanced by example RBMs described herein.
  • the timer bounded ARB protocol includes controlling the active FSM 120 to write the ARB block 140 once per FSM write period. The periodic writing indicates continued ownership of the ARB block 140 .
  • the active FSM 120 will reset the active dead-man timer 130 to a reset threshold period. Recall that it is the expiration of the active timer 130 that forces a hardware reset of the active MDC 100 .
  • the timer bounded ARB protocol may have different time delays.
  • the FSM write period employed by the active dead-man timer 130 is 5 seconds.
  • the active FSMPM 110 associated with the active FSM 120 sends a request to the standby FSMPM 160 to ask permission to restart the active dead-man timer 130 .
  • the active FSMPM 110 may measure the round trip time of the request and reset the active dead-man timer 130 if permission was granted in less than a second.
  • an HA manager collects and reports the operating information of the servers in the HA pair.
  • This operating information may be employed by the RBM to enhance operation of the pair of servers in the HA pair.
  • the operating information may include information concerning modes and statuses of the individual HA servers.
  • Example RBMs control the modes and statuses of the servers in the HA pair so that both of the servers in the HA pair do not have write access to file system metadata.
  • the RBM may also change the operating information and/or status of the HA servers in the HA pair.
  • the RBM has a set of actions available to it. The set of actions facilitates putting individual servers into pairs of operating modes according to pairing rules to avoid SBS.
  • FIG. 2 illustrates a method 200 .
  • Computer executable instructions may be stored on a non-transitory computer readable medium. The instructions, when executed by a computer in a HA cluster, control the computer to perform method 200 .
  • Method 200 begins, at 205 , when a decision point in the lifecycle of an HA cluster is detected.
  • Method 200 proceeds, at 210 , to acquire data describing an operating condition of a set of servers comprising the HA cluster. The data will include a mode of a server in the HA cluster.
  • Method 200 also includes, at 220 , controlling at least one member of the HA cluster to selectively change mode to a target paired mode.
  • the target paired mode is selected based, at least in part, on mode pairing rules associated with the HA cluster.
  • the mode is one of: a default mode, a single mode, a config (configuration) mode, a locked mode, a peerdown mode, and a failed startup mode.
  • HA monitoring for a server is ON and SMITH (Shoot Myself In The Head) reset is enabled.
  • SMITH Shoot Myself In The Head
  • HA monitoring for a server is OFF and a peer server is communicating and in locked mode, or not communicating and designated peerdown.
  • the configuration mode HA monitoring for a server is OFF and a peer server is communicating and in locked mode, or not communicating and designated peerdown.
  • SAN storage area network
  • a peer server is OFF.
  • the failed startup mode attempts to start a SAN application are blocked until a failure indicator is cleared.
  • the data may also include a status of the server in the HA cluster.
  • the status is one of: unknown, stopped, running, and primary.
  • the unknown status is reported when a peer server is not communicating.
  • the stopped status is reported when a status command returns a first pre-determined code.
  • the running status is reported when a status command returns a second, different pre-determined code.
  • the primary status is reported when the server status is running and the FSMPM (file system port mapper) in the server is in the primary state.
  • FSMPM file system port mapper
  • controlling at least one member of the HA cluster to selectively change mode comprises issuing one or more of: a status, stop, start, configuration, clear, primary, and force reset command.
  • the status command causes members of the HA cluster to report their status.
  • the stop command causes the non-primary server in the HA cluster to be transitioned to locked mode, the primary server in the HA cluster to be transitioned to configuration mode, HA monitoring to be turned off on the primary server in the HA cluster, the SAN application to be stopped, and both servers in the HA cluster to be transitioned to default mode.
  • the start command causes the stop command to be run to transition the cluster to default-default mode if necessary, the SAN application and HA monitor to be started on the local server in the HA cluster, SMITH reset to be enabled on the local server in the HA cluster, the SAN application and HA monitor to be started on the peer server in the HA cluster, and SMITH reset to be enabled on the peer server in the HA cluster.
  • the config command causes the peer server in the HA cluster to be transitioned to the locked mode and the local server in the HA cluster to be transitioned to the configuration mode.
  • the clear command clears an indicator that was set by failure of a start command.
  • the primary command sets the status of the FSMPM on the local server in the HA cluster to primary.
  • the force reset command triggers an immediate HA reset of the local server.
  • Method 200 facilitates persisting modes for members of the HA cluster. Therefore, method 200 may also include storing values for the modes associated with members of the HA cluster to maintain modes through a hardware reset. Part of the persistence can involve monitoring a file that indicates that a previous initialization of the HA cluster has failed. Since the file is monitored, in one example method 200 may also include granting permission, prior to initialization of the HA cluster, for the HA cluster to initialize.
  • the allowed set of mode pairs includes: default-default, default-locked, default-peerdown, single-peerdown, single-locked, config-locked, config-peerdown, locked-default, locked-single, locked-config, and locked-locked.
  • the prohibited set of paired modes may include: single-default, single-single, single-config, config-default, config-single, and config-config.
  • the status command causes cluster members to report their status.
  • the stop command causes a series of actions: the non-primary server is transitioned to locked mode, the primary server is transitioned to configuration mode, HA monitoring is turned off on the primary server, the SAN application is stopped, and then both servers are transitioned to default mode.
  • the start command also causes a series of actions: the stop command is run to transition the cluster to default-default mode if necessary, then SMITH reset is enabled on the local server and the SAN application and HA monitor are started on the local server. Then the SMITH reset is enabled on the peer server and the SAN application and HA monitor are started on the peer server.
  • the config command transitions the peer server to the locked mode and transitions the local server to the configuration mode.
  • the clear command clears an indicator that was set by failure of a start command.
  • the primary command sets the status of the FSMPM on the local server to primary.
  • the force reset command triggers an immediate HA reset.
  • the pairing rules describe mode configurations for a primary server and a secondary server of the HA pair that allows single server write access to system resources.
  • an RBM selectively sets the modes and statuses of the paired HA servers. Setting the modes and statuses of the paired HA servers overrides default behaviors of the paired HA servers and puts the paired HA servers into allowed mode pairs while preventing the paired HA servers from entering prohibited mode pairs.
  • the RBM changes the modes and statuses of the servers in the pair of HA servers so that only a single server has write access to the system resources (e.g. file metadata) at a given time.
  • the RBM sets the modes and statuses according to pairing rules to avoid unintentional hardware resets.
  • Example apparatuses and methods rely on pairing rules that define valid mode combinations of paired HA servers.
  • the RBM monitors the operating information of paired HA servers to ensure that the paired HA servers are operating in valid mode combinations. If the RBM detects that paired servers are not operating in a valid mode combination, the RBM will take an action to attempt to force the paired servers into a valid mode combination if possible. If the RBM is unable to move the paired HA servers to a valid mode combination, the RBM may trigger a hardware reset in one or both of the paired HA servers to prevent SBS.
  • FIG. 3 illustrates an HA cluster manager apparatus 300 .
  • Apparatus 300 implements an RBM 310 for timer bounded arbitration protocol based resource control.
  • RBM 310 includes a rules logic 320 , a mode selection logic 330 , and an action logic 340 .
  • Rules logic 320 is configured to acquire substantially instantaneous information about an HA cluster.
  • the information includes at least a mode and status for members of the HA cluster.
  • the HA cluster may include, for example, primary server 350 and secondary server 360 .
  • RBM 310 manages the HA cluster to prevent a split-brain-scenario with regards to file system resource 370 . While RBM 310 is illustrated being separate from the primary server 350 and the secondary server 360 , in different examples the RBM 310 may be implemented in one or both of the primary server 350 and the secondary server 360 .
  • the mode selection logic 330 is configured to select a mode for a member of the HA cluster.
  • the mode is selected to make the HA cluster comply with a set of allowed paired modes and to prevent the HA cluster from attaining a prohibited paired mode.
  • the action logic 340 is configured to prevent a split brain scenario in the HA cluster by transforming an HA cluster member mode.
  • the cluster member mode is changed by the action logic 340 causing the performance of one or more of, a status action, a stop action, a start action, a configuration action, a clear action, a primary action, and a force hardware reset action.
  • control of and write access to system resources is regulated through the RBM 310 .
  • the RBM 310 monitors and sets operating modes. The modes persist across system reboots. Therefore, if the paired HA servers are rebooted, the paired HA servers do not encounter an ambiguous or non-deterministic state when initialized. Accordingly, the paired HA servers are not subject to an unnecessary forced hardware reset upon initialization.
  • Operating modes for members of the HA cluster may include default, single, config (configuration), locked, peerdown, and failed startup. These are the modes that can be assigned to a server in the HA cluster.
  • the RBM 310 may employ a distributed application that puts individual servers into pairs of operating modes according to rules that prevent a split brain scenario.
  • the RBM 310 may need to suspend SMITH resets to facilitate doing configuration changes, to restart a cluster without incurring a SMITH reset, and for other reasons.
  • the RBM uses mode pairing rules to ensure that one of the servers stops and stays stopped until the RBM tells it to restart. Therefore, before a component of a SAN file system (e.g., StorNext) application can start, the RBM gives its permission.
  • the RBM 310 attempts to communicate across a network (e.g., LAN) to a peer server computer to ensure that mode-pairing rules are followed and to ensure that the cluster is not exposed to split brain scenario risk by operational mistakes.
  • a network e.g., LAN
  • the RBM 310 monitors operating states of the SAN file system on both servers in an HA pair and outputs modes and statuses.
  • HA monitoring is turned on.
  • the peer server is assumed to be in default mode.
  • SMITH reset is enabled and thus a server can force a hardware reset on itself.
  • HA monitoring is turned off.
  • this server In single mode, HA monitoring is turned off.
  • its paired peer server must be communicating and in locked mode, or not communicating and certified as being in peerdown mode. This mode is meant for extended production operations without a redundant server, for example when one server is being repaired or replaced.
  • the operating server can be transitioned from single to default mode without stopping an associated SAN file system (e.g., StorNext) application.
  • SMITH reset is disabled for single-server operation.
  • HA monitoring is turned off.
  • a peer server In this mode, a peer server must be communicating and in locked mode, or not communicating and certified as being in the peerdown mode.
  • the config mode is intended for re-configuration and other non-production service operations.
  • an associated SAN file system (e.g., StorNext) application When returning to production service and the default mode, an associated SAN file system (e.g., StorNext) application must be stopped to ensure that SAN file system processes can be started correctly upon returning to default mode.
  • StorNext SAN file system
  • Locked mode In the locked mode, an associated SAN file system application (e.g., StorNext) is stopped and prevented from starting on the local server in a pair of paired HA servers.
  • Locked mode allows the RBM to actively query the peer server to ensure that it is stopped when the local peer is operating in single or config mode. Communication with the locked node must continue, so this mode is effective when the associated SAN file system (e.g., StorNext) application is stopped for a short period and the node will not be rebooted. If communication is lost, the peer node assumes this node is in default mode, which facilitates avoiding split-brain scenarios.
  • Locked mode can be set programmatically to allow a cluster to be put into the config mode automatically by the RBM.
  • the peer server In the peerdown mode, the peer server is turned off and must not be communicating with the local server's RBM subsystem. Therefore, this mode is effective when the server is powered down.
  • This mode is declared by the peerdown command on a working server to give information about the non-working peer server. By setting this mode, an administrator is certifying the off status of the peer, which the RBM cannot verify by itself. This allows the local peer to be in single or config mode. If the peer starts communicating while this mode is set, the setting is immediately erased, the local mode is set to default to restore HA Monitoring, and an associated SAN file system (e.g., StorNext) application is shut down, which can trigger an HA reset. The peerdown mode is changed to default mode with the peerup command.
  • the peerdown and peerup commands should not be automated because they require external knowledge about the peer server's condition and operator awareness of a requirement to keep the peer server turned off.
  • the RBM collects server statuses along with server modes to measure the operating condition of an HA cluster.
  • Statuses may include stopped, running, primary and unknown.
  • the stopped status is reported when a status command (e.g., DSM_control status) has returned a first pre-determined code (e.g., false).
  • the running status is reported when a status command (e.g., DSM_control status) has returned a second, different pre-determined code (e.g., true).
  • the primary status is reported when the server status is running and the FSMPM is in the primary state. This combination indicates that the HA shared FSM has been activated.
  • the unknown status is reported when attempts to communicate with the peer server fail.
  • the RBM controls modes into which HA cluster members can be forced.
  • the modes are controlled by pairing rules.
  • the modes are created by the RBM issuing commands.
  • the RBM collects status information in order to determine which commands, if any, to issue to control modes. In some circumstances, the RBM may force a hardware reset of one or both members of an HA cluster.
  • FIG. 4 illustrates a computer 400 that facilitates maintaining single writer access between a pair of HA servers by participating in a timer bounded ARB protocol for resource control.
  • Computer 400 includes a processor 402 and a memory 404 that are operably connected by a bus 412 .
  • the computer 400 may include a first component 406 , a second component 408 , and a third component 410 . Additionally, the computer 400 may be associated with a process 414 and data 416 .
  • the first component 406 is configured to acquire a mode from a member of an HA cluster.
  • the second component 408 is configured to determine a desired mode pairing for the member of the HA cluster.
  • the third component 410 is configured to take an action configured to either achieve the desired mode pairing for the member of the HA cluster or to selectively force a hardware reset of the member of the HA cluster upon determining that a split brain scenario is possible based, at least in part, on the mode of the member of the HA cluster.
  • the modes include: a default mode, a single mode, a config mode, a locked mode, a peerdown mode, and a failed startup mode.
  • HA monitoring for a member of the HA cluster is ON and SMITH reset is enabled.
  • HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode, or not communicating and in peerdown mode.
  • HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode, or not communicating and in peerdown mode.
  • the locked mode an SAN application on the member of the HA cluster is stopped and prevented from starting.
  • a peer member of the HA cluster is OFF.
  • the failed startup mode attempts to start the SAN application are blocked until a failure indicator is cleared.
  • the first component 406 may also be configured to acquire a status from the member of the HA cluster.
  • the status may be one of: unknown, stopped, running, and primary.
  • the unknown status is reported when the member of the HA cluster is not communicating.
  • the stopped status is reported when a status command returns a first pre-determined code.
  • the running status is reported when a status command returns a second, different pre-determined code.
  • the primary status is reported when the member of the HA cluster status is running and the FSMPM in the member of the HA cluster is in the primary state.
  • the third component 410 may force actions including a status, stop, start, config, clear, primary, and force reset.
  • the status action causes a cluster member to report status.
  • the stop action causes the non-primary member of the HA cluster to be transitioned to locked mode, the primary member of the HA cluster to be transitioned to config mode, HA monitoring to be turned off on the primary member of the HA cluster, the SAN application to be stopped, and both members of the HA cluster to be transitioned to default mode.
  • the start action also causes the stop command to be run, the SAN application and HA monitor to be started on the local member of the HA cluster, SMITH reset to be enabled on the local member of the HA cluster, the SAN application and HA monitor to be started on the peer member of the HA cluster, and SMITH reset to be enabled on the peer member of the HA cluster.
  • the configuration action causes the peer member of the HA cluster to be transitioned to the locked mode and the local member of the HA cluster to be transitioned to the configuration mode.
  • the clear action clears an indicator that was set by failure of a start command.
  • the primary action sets the status of the FSMPM on the local member of the HA cluster to primary.
  • the force reset action triggers an immediate HA reset.
  • the third component 410 may also be configured to force a hardware reset upon determining that the HA cluster is in a prohibited paired mode and is at risk of an SBS.
  • the processor 402 may be a variety of various processors including dual microprocessor and other multi-processor architectures.
  • a memory 404 may include volatile memory (e.g., RAM (random access memory)) and/or non-volatile memory (e.g., ROM (read only memory)).
  • the memory 404 can store a process 414 and/or a data 416 , for example.
  • the process 414 may be a RBM process and the data 416 may be co-ordination and control data.
  • the bus 412 may be a single internal bus interconnect architecture and/or other bus or mesh architectures. While a single bus is illustrated, it is to be appreciated that the computer 400 may communicate with various devices, logics, and peripherals using other busses (e.g., PCIE (peripheral component interconnect express), 1394, USB (universal serial bus), Ethernet).
  • the bus 412 can be types including, for example, a memory bus, a memory controller, a peripheral bus, an external bus, a crossbar switch, and/or a local bus.
  • references to “one embodiment”, “an embodiment”, “one example”, “an example”, and other similar terms indicate that the embodiment(s) or example(s) so described may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element or limitation. Furthermore, repeated use of the phrase “in one embodiment” or “in one example” does not necessarily refer to the same embodiment or example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Bus Control (AREA)
  • Hardware Redundancy (AREA)
  • Computer And Data Communications (AREA)

Abstract

An example apparatus includes a processor, a memory, and an interface that connects the processor, the memory, and a set of components. The set of components includes a first component configured to acquire a mode from members of an HA cluster and a second component configured to enforce mode pairing rules for members of the HA cluster. Once the desired mode pairing has been determined, a third component takes actions configured to either achieve the mode pairing according to rules for members of the HA cluster or to selectively force a hardware reset of one or more members of the HA cluster upon determining that a split brain scenario is possible based, at least in part, on the mode of the members of the HA cluster. The example apparatus therefore implements a rules-based manager for timer bounded arbitration protocol based resource control.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Patent Application 61/259,271 filed Nov. 9, 2009.
  • BACKGROUND
  • Networked systems of computers allow parallel and distributed computing. Networked systems of computers may act as clusters, where the computers are nodes in the clusters. Clusters may function collectively as operational groups of servers. The nodes of a cluster or other multi-node computer system may function together to achieve high server system performance, availability, and reliability. High availability (HA) systems facilitate a standby server taking over in the event of undesired failures on a primary server. Goals for a HA system include providing safety and uninterrupted operation. If a primary server fails, failover should occur automatically and operations should resume on a secondary server. However, at any point in time, only one of the primary server or the secondary server should have write access to certain items. For example, at any point in time, there should only be one server with write access to file system metadata to prevent corruption of the metadata. When two servers both have write access, which is an undesirable state, this may be referred to as a split brain scenario (SBS).
  • Conventional systems may have employed protocols and techniques for preventing multiple writer access leading to a SBS. However, these conventional systems may have had default settings that unintentionally allowed both the primary server and the secondary server to have write access resulting in SBS under certain circumstances. Additionally, these systems may have had settings that unintentionally led to unnecessary hardware resets when an ambiguous or non-deterministic state was encountered. One unintentional occurrence that could lead to an undesired hardware reset involves a communications network breakdown or slowdown. When synchronizing communications are lost, a hardware reset may be forced, even though all parts of the system except the communications network are healthy and single writer access is still in place. Operation of an HA cluster includes times when the protection mechanism must be stopped, which requires stopping all but one of the processors to avoid SBS. A state-based system is insufficient to protect against SBS because one or more processors may change state without awareness of that state change by another processor under certain types of equipment failures or operational mistakes.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate various example methods, apparatus, and other example embodiments of various aspects of the invention described herein. It will be appreciated that the illustrated element boundaries (e.g., boxes, groups of boxes, other shapes) in the figures represent one example of the boundaries of the elements. One of ordinary skill in the art will appreciate that in some examples one element may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of another element may be implemented as an external component and vice versa. Furthermore, elements may not be drawn to scale.
  • FIG. 1 illustrates a pair of metadata controllers (MDCs) used in a system for maintaining single writer access between a pair of HA servers using a rules-based, mode-driven manager (RBM) for timer bounded arbitration protocol based resource control.
  • FIG. 2 illustrates a method for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • FIG. 3 illustrates a system for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • FIG. 4 illustrates an apparatus for maintaining single writer access between a pair of HA servers using an RBM for timer bounded arbitration protocol based resource control.
  • DETAILED DESCRIPTION
  • Example apparatuses and methods work with apparatuses and methods that prevent a split brain scenario (SBS) where both servers in a high availability (HA) pair could have write access to resources (e.g., file system metadata, databases) for which there should only be single writer access. Example apparatuses and methods implement a rules-based, mode-driven manager (RBM) that enforces rules for modes into which HA cluster members can be placed. The modes are controlled by pairing rules. The modes are created by the RBM issuing commands to cluster members. The RBM collects mode and/or status information in order to determine which commands, if any, to issue to control modes into which cluster members are placed. In some circumstances, the RBM may trigger a hardware reset of one or both members of an HA cluster after acquiring information (e.g., mode, status) and attempting to alter modes by taking actions.
  • The mode and/or status information is collected at least when a decision point in the management chain of events for an HA cluster (e.g., subsystem startup) occurs. Upon detecting a decision point, an RBM running on one server attempts to communicate with a peer server in the HA cluster to ensure that mode-pairing rules are followed and to ensure that the cluster is not exposed to split brain scenario risk by operational mistakes.
  • The RBM can be viewed as a cluster manager that automates HA cluster pairwise operation. The RBM facilitates normal operation, reconfiguration, maintenance, not running in HA mode, and so on. The RBM facilitates configuring metadata controllers (MDCs) to operate properly together to avoid split brain scenario. The RBM makes it possible to be able to turn off HA protection and then to turn it back on. To achieve this desired functionality while avoiding split brain scenario, the RBM is configured to substantially instantaneously acquire a picture of the operating status of servers in a system. Acquiring substantially instantaneous information is desired because there are asynchronous processes operating in the HA system. The RBM also is configured to act on the information it acquires. Therefore, the RBM is not just a status-collection machine, but rather is an entity that can force mode transitions according to rules that maintain cluster members in desired pair-wise modes that facilitate avoiding a split brain scenario. In one example, an RBM can receive three different statuses from a server, and can force a server into one of six different modes by causing up to seven different actions.
  • FIG. 1 illustrates a pair of metadata controllers (MDCs) used in a system for maintaining single writer access between a pair of HA servers. Thus, FIG. 1 provides context for the example RBM for timer-bounded ARB-protocol-based resource control described herein. The pair of MDCs includes active MDC 100 and standby MDC 150.
  • Active MDC 100 includes an active file system portmapper (FSMPM) 110, an active file system manager (FSM) 120, and an active dead-man timer 130. The standby MDC 150 includes a standby FSMPM 160, a standby FSM 170, and a standby dead-man timer 180. The active FSMPM 110 is connected to the active FSM 120 by a socket. Likewise, the standby FSMPM 160 is connected to the standby FSM 170 by a socket.
  • The active FSMPM 110 is configured to provide a heartbeat signal to a coordinator (not shown). When the active FSMPM is configured to provide a heartbeat signal, the timer bounded ARB protocol may define an amount of elapsed time between heartbeats. Therefore, one way that standby FSMPM 160 can determine that active FSM 120 is not operating properly is by monitoring the heartbeat signal.
  • The active FSM 120 is configured to maintain ownership of an ARB block 140 by periodically writing the ARB block 140 according to the timer bounded ARB protocol. The active dead-man timer 130 is reset upon successfully writing the ARB block 140. Before the expiration of the active dead-man timer 130, the active FSMPM 110 may request permission to reset the active dead-man timer 130. If permission is granted, the active FSM 120 may attempt to maintain control by updating the ARB block 140 before the active dead-man timer 130 expires.
  • In this manner the active FSM 120 can negotiate for additional time to retain control of the ARB block 140. Also if permission is granted, the standby FSM 170 will not attempt to establish control of the ARB block 140 for a predetermined amount of time. Therefore, the negotiation affords the active FSM 120 the opportunity to maintain control of the ARB block 140 in situations where a hardware reset is unnecessary (e.g., minor system delay or slowdown). Furthermore, single writer access is maintained via the negotiation.
  • The active FSMPM 110 can also be configured to selectively force an election of an FSM to replace the active FSM 100 as the single writer upon a determination that the active FSM 100 has exited. Therefore, the active FSMPM 110 may establish the standby FSM 170 as the replacement of the active FSM 120. Accordingly, an activation command may be sent to the standby FSM 170.
  • The standby FSM 170 monitors the ARB block 140 for a safety period of time before writing to the ARB block 140. During the safety period of time the standby FSM 170 monitors the ARB block 140 to ensure that the active FSM 120 is not writing to the ARB block 140. Once the safety period expires and it is determined the active FSM 120 has not written to the ARB block 140, the standby FSM 170 may write to the ARB block. This way single writer access is maintained during the transition of control from the active FSM 120 to the standby FSM 170.
  • The active MDC 100 violates the timer bounded ARB protocol when the active FSM 120 has not exited before the active dead-man timer 130 expires. If the timer bounded ARB protocol is violated, a hardware reset will be forced, and an election is held to select a standby MDC 150 to replace the active MDC 100.
  • The standby FSMPM 160 is configured to selectively activate the standby FSM 170 to take control of the ARB block 140 after being elected to replace the active MDC 100. The standby FSM 170 is configured to acquire ownership of the ARB block 140, to maintain ownership of the ARB block 140 by periodically writing the ARB block 140 according to the timer bounded ARB protocol, and to reset the standby dead-man timer 180 upon successfully writing the ARB block 140. In one embodiment, the active MDC 100 and the standby MDC 150 reside on separate pieces of computer hardware and communicate over a computer network.
  • FIG. 1 illustrates the active dead-man timer 130 and the standby dead-man timer 180 as being internal to the active MDC 100 and the standby MDC 150 respectively. One skilled in the art will appreciate that the dead- man timers 130 and 180 may be part of an MDC or may be external to but used by an MDC. For example, a dead-man timer may be external to a process and/or hardware implementing an active MDC 100 or a standby MDC 150. Therefore, in different examples, the active dead-man timer 130 can be, but is not limited to being, a kernel timer, an operating system timer, and a timer associated with computer hardware (e.g., peripheral component interconnect express (PCIE) card) operatively connected to an interface visible to active MDC 100. In one embodiment, there is one dead-man timer per active FSM 120. Similarly, the standby dead-man timer 180 can be, but is not limited to being, a kernel timer, an operating system timer, and a timer associated with computer hardware (e.g., PCIE card) operatively connected to an interface visible to the standby MDC 150. There is one dead-man timer per standby FSM 170.
  • The active MDC 100 and the standby MDC 150 participate in the timer bounded ARB protocol. Functioning of the protocol, and thus functioning of the active MDC 100 and the standby MDC 150 are enhanced by example RBMs described herein. In one example, the timer bounded ARB protocol includes controlling the active FSM 120 to write the ARB block 140 once per FSM write period. The periodic writing indicates continued ownership of the ARB block 140. When a write to the ARB block 140 is successful, the active FSM 120 will reset the active dead-man timer 130 to a reset threshold period. Recall that it is the expiration of the active timer 130 that forces a hardware reset of the active MDC 100.
  • One skilled in the art will appreciate that a failover system for a pair of HA servers can be arranged in different environments and may experience different operating conditions, different communication conditions, and other different factors. Therefore the timer bounded ARB protocol may have different time delays. In one embodiment, the FSM write period employed by the active dead-man timer 130 is 5 seconds. The active FSMPM 110 associated with the active FSM 120 sends a request to the standby FSMPM 160 to ask permission to restart the active dead-man timer 130. The active FSMPM 110 may measure the round trip time of the request and reset the active dead-man timer 130 if permission was granted in less than a second.
  • In one example, an HA manager collects and reports the operating information of the servers in the HA pair. This operating information may be employed by the RBM to enhance operation of the pair of servers in the HA pair. The operating information may include information concerning modes and statuses of the individual HA servers. Example RBMs control the modes and statuses of the servers in the HA pair so that both of the servers in the HA pair do not have write access to file system metadata.
  • The RBM may also change the operating information and/or status of the HA servers in the HA pair. The RBM has a set of actions available to it. The set of actions facilitates putting individual servers into pairs of operating modes according to pairing rules to avoid SBS.
  • FIG. 2 illustrates a method 200. Computer executable instructions may be stored on a non-transitory computer readable medium. The instructions, when executed by a computer in a HA cluster, control the computer to perform method 200. Method 200 begins, at 205, when a decision point in the lifecycle of an HA cluster is detected. Method 200 proceeds, at 210, to acquire data describing an operating condition of a set of servers comprising the HA cluster. The data will include a mode of a server in the HA cluster.
  • Method 200 also includes, at 220, controlling at least one member of the HA cluster to selectively change mode to a target paired mode. The target paired mode is selected based, at least in part, on mode pairing rules associated with the HA cluster.
  • In method 200, the mode is one of: a default mode, a single mode, a config (configuration) mode, a locked mode, a peerdown mode, and a failed startup mode. In the default mode, HA monitoring for a server is ON and SMITH (Shoot Myself In The Head) reset is enabled. In the single mode, HA monitoring for a server is OFF and a peer server is communicating and in locked mode, or not communicating and designated peerdown. In the configuration mode, HA monitoring for a server is OFF and a peer server is communicating and in locked mode, or not communicating and designated peerdown. In the locked mode, a storage area network (SAN) application on the server is stopped and prevented from starting. In the peerdown mode, a peer server is OFF. In the failed startup mode, attempts to start a SAN application are blocked until a failure indicator is cleared.
  • In one example, the data may also include a status of the server in the HA cluster. The status is one of: unknown, stopped, running, and primary. The unknown status is reported when a peer server is not communicating. The stopped status is reported when a status command returns a first pre-determined code. The running status is reported when a status command returns a second, different pre-determined code. The primary status is reported when the server status is running and the FSMPM (file system port mapper) in the server is in the primary state.
  • In one example, controlling at least one member of the HA cluster to selectively change mode comprises issuing one or more of: a status, stop, start, configuration, clear, primary, and force reset command. The status command causes members of the HA cluster to report their status. The stop command causes the non-primary server in the HA cluster to be transitioned to locked mode, the primary server in the HA cluster to be transitioned to configuration mode, HA monitoring to be turned off on the primary server in the HA cluster, the SAN application to be stopped, and both servers in the HA cluster to be transitioned to default mode. The start command causes the stop command to be run to transition the cluster to default-default mode if necessary, the SAN application and HA monitor to be started on the local server in the HA cluster, SMITH reset to be enabled on the local server in the HA cluster, the SAN application and HA monitor to be started on the peer server in the HA cluster, and SMITH reset to be enabled on the peer server in the HA cluster. The config command causes the peer server in the HA cluster to be transitioned to the locked mode and the local server in the HA cluster to be transitioned to the configuration mode. The clear command clears an indicator that was set by failure of a start command. The primary command sets the status of the FSMPM on the local server in the HA cluster to primary. The force reset command triggers an immediate HA reset of the local server.
  • Method 200 facilitates persisting modes for members of the HA cluster. Therefore, method 200 may also include storing values for the modes associated with members of the HA cluster to maintain modes through a hardware reset. Part of the persistence can involve monitoring a file that indicates that a previous initialization of the HA cluster has failed. Since the file is monitored, in one example method 200 may also include granting permission, prior to initialization of the HA cluster, for the HA cluster to initialize.
  • In one example, the allowed set of mode pairs includes: default-default, default-locked, default-peerdown, single-peerdown, single-locked, config-locked, config-peerdown, locked-default, locked-single, locked-config, and locked-locked. The prohibited set of paired modes may include: single-default, single-single, single-config, config-default, config-single, and config-config.
  • To review the actions caused by the commands, the status command causes cluster members to report their status. The stop command causes a series of actions: the non-primary server is transitioned to locked mode, the primary server is transitioned to configuration mode, HA monitoring is turned off on the primary server, the SAN application is stopped, and then both servers are transitioned to default mode. The start command also causes a series of actions: the stop command is run to transition the cluster to default-default mode if necessary, then SMITH reset is enabled on the local server and the SAN application and HA monitor are started on the local server. Then the SMITH reset is enabled on the peer server and the SAN application and HA monitor are started on the peer server. The config command transitions the peer server to the locked mode and transitions the local server to the configuration mode. The clear command clears an indicator that was set by failure of a start command. The primary command sets the status of the FSMPM on the local server to primary. The force reset command triggers an immediate HA reset.
  • The pairing rules describe mode configurations for a primary server and a secondary server of the HA pair that allows single server write access to system resources. Thus, an RBM selectively sets the modes and statuses of the paired HA servers. Setting the modes and statuses of the paired HA servers overrides default behaviors of the paired HA servers and puts the paired HA servers into allowed mode pairs while preventing the paired HA servers from entering prohibited mode pairs. The RBM changes the modes and statuses of the servers in the pair of HA servers so that only a single server has write access to the system resources (e.g. file metadata) at a given time. Likewise, the RBM sets the modes and statuses according to pairing rules to avoid unintentional hardware resets.
  • Example apparatuses and methods rely on pairing rules that define valid mode combinations of paired HA servers. The RBM monitors the operating information of paired HA servers to ensure that the paired HA servers are operating in valid mode combinations. If the RBM detects that paired servers are not operating in a valid mode combination, the RBM will take an action to attempt to force the paired servers into a valid mode combination if possible. If the RBM is unable to move the paired HA servers to a valid mode combination, the RBM may trigger a hardware reset in one or both of the paired HA servers to prevent SBS.
  • FIG. 3 illustrates an HA cluster manager apparatus 300. Apparatus 300 implements an RBM 310 for timer bounded arbitration protocol based resource control. RBM 310 includes a rules logic 320, a mode selection logic 330, and an action logic 340.
  • Rules logic 320 is configured to acquire substantially instantaneous information about an HA cluster. The information includes at least a mode and status for members of the HA cluster. The HA cluster may include, for example, primary server 350 and secondary server 360. RBM 310 manages the HA cluster to prevent a split-brain-scenario with regards to file system resource 370. While RBM 310 is illustrated being separate from the primary server 350 and the secondary server 360, in different examples the RBM 310 may be implemented in one or both of the primary server 350 and the secondary server 360.
  • The mode selection logic 330 is configured to select a mode for a member of the HA cluster. The mode is selected to make the HA cluster comply with a set of allowed paired modes and to prevent the HA cluster from attaining a prohibited paired mode.
  • The action logic 340 is configured to prevent a split brain scenario in the HA cluster by transforming an HA cluster member mode. The cluster member mode is changed by the action logic 340 causing the performance of one or more of, a status action, a stop action, a start action, a configuration action, a clear action, a primary action, and a force hardware reset action.
  • In one example, control of and write access to system resources (e.g., file system resource 370) is regulated through the RBM 310. The RBM 310 monitors and sets operating modes. The modes persist across system reboots. Therefore, if the paired HA servers are rebooted, the paired HA servers do not encounter an ambiguous or non-deterministic state when initialized. Accordingly, the paired HA servers are not subject to an unnecessary forced hardware reset upon initialization.
  • Operating modes for members of the HA cluster may include default, single, config (configuration), locked, peerdown, and failed startup. These are the modes that can be assigned to a server in the HA cluster. The RBM 310 may employ a distributed application that puts individual servers into pairs of operating modes according to rules that prevent a split brain scenario. The RBM 310 may need to suspend SMITH resets to facilitate doing configuration changes, to restart a cluster without incurring a SMITH reset, and for other reasons. When suspending SMITH resets is necessary, the RBM uses mode pairing rules to ensure that one of the servers stops and stays stopped until the RBM tells it to restart. Therefore, before a component of a SAN file system (e.g., StorNext) application can start, the RBM gives its permission.
  • At decision points in the management chain of events (e.g., component startup actions), the RBM 310 attempts to communicate across a network (e.g., LAN) to a peer server computer to ensure that mode-pairing rules are followed and to ensure that the cluster is not exposed to split brain scenario risk by operational mistakes. In one example, the RBM 310 monitors operating states of the SAN file system on both servers in an HA pair and outputs modes and statuses.
  • In default mode, HA monitoring is turned on. When the peer server is not available for communication, the peer server is assumed to be in default mode. In default mode, SMITH reset is enabled and thus a server can force a hardware reset on itself.
  • In single mode, HA monitoring is turned off. For this server to be in single mode, its paired peer server must be communicating and in locked mode, or not communicating and certified as being in peerdown mode. This mode is meant for extended production operations without a redundant server, for example when one server is being repaired or replaced. When the peer server is about to be restored to service, the operating server can be transitioned from single to default mode without stopping an associated SAN file system (e.g., StorNext) application. In the single mode, SMITH reset is disabled for single-server operation.
  • In the config mode, HA monitoring is turned off. In this mode, a peer server must be communicating and in locked mode, or not communicating and certified as being in the peerdown mode. The config mode is intended for re-configuration and other non-production service operations. When returning to production service and the default mode, an associated SAN file system (e.g., StorNext) application must be stopped to ensure that SAN file system processes can be started correctly upon returning to default mode.
  • In the locked mode, an associated SAN file system application (e.g., StorNext) is stopped and prevented from starting on the local server in a pair of paired HA servers. Locked mode allows the RBM to actively query the peer server to ensure that it is stopped when the local peer is operating in single or config mode. Communication with the locked node must continue, so this mode is effective when the associated SAN file system (e.g., StorNext) application is stopped for a short period and the node will not be rebooted. If communication is lost, the peer node assumes this node is in default mode, which facilitates avoiding split-brain scenarios. Locked mode can be set programmatically to allow a cluster to be put into the config mode automatically by the RBM.
  • In the peerdown mode, the peer server is turned off and must not be communicating with the local server's RBM subsystem. Therefore, this mode is effective when the server is powered down. This mode is declared by the peerdown command on a working server to give information about the non-working peer server. By setting this mode, an administrator is certifying the off status of the peer, which the RBM cannot verify by itself. This allows the local peer to be in single or config mode. If the peer starts communicating while this mode is set, the setting is immediately erased, the local mode is set to default to restore HA Monitoring, and an associated SAN file system (e.g., StorNext) application is shut down, which can trigger an HA reset. The peerdown mode is changed to default mode with the peerup command. The peerdown and peerup commands should not be automated because they require external knowledge about the peer server's condition and operator awareness of a requirement to keep the peer server turned off.
  • In the failed startup mode, a previous attempt to start an associated SAN file system application (e.g., StorNext) with a command (e.g., service cvfs start) has failed before completion. Attempts to start the SAN file system application are blocked until this status is cleared by running the clear command.
  • The RBM collects server statuses along with server modes to measure the operating condition of an HA cluster. Statuses may include stopped, running, primary and unknown. The stopped status is reported when a status command (e.g., DSM_control status) has returned a first pre-determined code (e.g., false). The running status is reported when a status command (e.g., DSM_control status) has returned a second, different pre-determined code (e.g., true). The primary status is reported when the server status is running and the FSMPM is in the primary state. This combination indicates that the HA shared FSM has been activated. The unknown status is reported when attempts to communicate with the peer server fail.
  • Therefore the RBM controls modes into which HA cluster members can be forced. The modes are controlled by pairing rules. The modes are created by the RBM issuing commands. The RBM collects status information in order to determine which commands, if any, to issue to control modes. In some circumstances, the RBM may force a hardware reset of one or both members of an HA cluster.
  • FIG. 4 illustrates a computer 400 that facilitates maintaining single writer access between a pair of HA servers by participating in a timer bounded ARB protocol for resource control. Computer 400 includes a processor 402 and a memory 404 that are operably connected by a bus 412. In one example, the computer 400 may include a first component 406, a second component 408, and a third component 410. Additionally, the computer 400 may be associated with a process 414 and data 416.
  • The first component 406 is configured to acquire a mode from a member of an HA cluster. The second component 408 is configured to determine a desired mode pairing for the member of the HA cluster. The third component 410 is configured to take an action configured to either achieve the desired mode pairing for the member of the HA cluster or to selectively force a hardware reset of the member of the HA cluster upon determining that a split brain scenario is possible based, at least in part, on the mode of the member of the HA cluster.
  • The modes include: a default mode, a single mode, a config mode, a locked mode, a peerdown mode, and a failed startup mode. In the default mode, HA monitoring for a member of the HA cluster is ON and SMITH reset is enabled. In the single mode, HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode, or not communicating and in peerdown mode. In the configuration mode, HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode, or not communicating and in peerdown mode. In the locked mode, an SAN application on the member of the HA cluster is stopped and prevented from starting. In the peerdown mode, a peer member of the HA cluster is OFF. In the failed startup mode, attempts to start the SAN application are blocked until a failure indicator is cleared.
  • The first component 406 may also be configured to acquire a status from the member of the HA cluster. The status may be one of: unknown, stopped, running, and primary. The unknown status is reported when the member of the HA cluster is not communicating. The stopped status is reported when a status command returns a first pre-determined code. The running status is reported when a status command returns a second, different pre-determined code. The primary status is reported when the member of the HA cluster status is running and the FSMPM in the member of the HA cluster is in the primary state.
  • The third component 410 may force actions including a status, stop, start, config, clear, primary, and force reset. The status action causes a cluster member to report status. The stop action causes the non-primary member of the HA cluster to be transitioned to locked mode, the primary member of the HA cluster to be transitioned to config mode, HA monitoring to be turned off on the primary member of the HA cluster, the SAN application to be stopped, and both members of the HA cluster to be transitioned to default mode. The start action also causes the stop command to be run, the SAN application and HA monitor to be started on the local member of the HA cluster, SMITH reset to be enabled on the local member of the HA cluster, the SAN application and HA monitor to be started on the peer member of the HA cluster, and SMITH reset to be enabled on the peer member of the HA cluster. The configuration action causes the peer member of the HA cluster to be transitioned to the locked mode and the local member of the HA cluster to be transitioned to the configuration mode. The clear action clears an indicator that was set by failure of a start command. The primary action sets the status of the FSMPM on the local member of the HA cluster to primary. The force reset action triggers an immediate HA reset.
  • The third component 410 may also be configured to force a hardware reset upon determining that the HA cluster is in a prohibited paired mode and is at risk of an SBS.
  • Generally describing an example configuration of the computer 400, the processor 402 may be a variety of various processors including dual microprocessor and other multi-processor architectures. A memory 404 may include volatile memory (e.g., RAM (random access memory)) and/or non-volatile memory (e.g., ROM (read only memory)). The memory 404 can store a process 414 and/or a data 416, for example. The process 414 may be a RBM process and the data 416 may be co-ordination and control data.
  • The bus 412 may be a single internal bus interconnect architecture and/or other bus or mesh architectures. While a single bus is illustrated, it is to be appreciated that the computer 400 may communicate with various devices, logics, and peripherals using other busses (e.g., PCIE (peripheral component interconnect express), 1394, USB (universal serial bus), Ethernet). The bus 412 can be types including, for example, a memory bus, a memory controller, a peripheral bus, an external bus, a crossbar switch, and/or a local bus.
  • The following includes definitions of selected terms employed herein. The definitions include various examples and/or forms of components that fall within the scope of a term and that may be used for implementation. The examples are not intended to be limiting.
  • References to “one embodiment”, “an embodiment”, “one example”, “an example”, and other similar terms indicate that the embodiment(s) or example(s) so described may include a particular feature, structure, characteristic, property, element, or limitation, but that not every embodiment or example necessarily includes that particular feature, structure, characteristic, property, element or limitation. Furthermore, repeated use of the phrase “in one embodiment” or “in one example” does not necessarily refer to the same embodiment or example.
  • To the extent that the term “includes” or “including” is employed in the detailed description or the claims, it is intended to be inclusive in a manner similar to the term “comprising” as that term is interpreted when employed as a transitional word in a claim.
  • To the extent that the term “or” is employed in the detailed description or claims (e.g., A or B) it is intended to mean “A or B or both”. When the applicants intend to indicate “only A or B but not both” then the term “only A or B but not both” will be employed. Thus, use of the term “or” herein is the inclusive, and not the exclusive use. See, Bryan A. Garner, A Dictionary of Modern Legal Usage 624 (2d. Ed. 1995).
  • While example apparatus, methods, and articles of manufacture have been illustrated by describing examples, and while the examples have been described in considerable detail, it is not the intention of the applicants to restrict or in any way limit the scope of the appended claims to such detail. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the systems, methods, and so on described herein. Therefore, the invention is not limited to the specific details, the representative apparatus, and illustrative examples shown and described. Thus, this application is intended to embrace alterations, modifications, and variations that fall within the scope of the appended claims.

Claims (20)

1. A non-transitory computer readable medium storing computer executable instructions that when executed by a computer in a high availability (HA) cluster controls the computer to perform a method, the method comprising:
upon detecting an occurrence of a decision point in the lifecycle of the HA cluster, acquiring data describing an operating condition of a set of servers comprising the HA cluster, the data including a mode of a server in the HA cluster;
controlling at least one member of the HA cluster to selectively change mode to a target paired mode, where the target paired mode is selected based, at least in part, on mode pairing rules associated with the HA cluster; and
selectively forcing a hardware reset of one or more members of the HA cluster upon determining that the mode pairing rules have been violated.
2. The computer readable medium of claim 1, where the mode is one of: a default mode, a single mode, a configuration mode, a locked mode, a peerdown mode, and a failed startup mode.
3. The computer readable medium of claim 2, where:
in the default mode, HA monitoring for a server is ON and SMITH (Shoot Myself In The Head) reset is enabled;
in the single mode, HA monitoring for a server is OFF and a communicating peer server is communicating and in locked mode;
in the configuration mode, HA monitoring for a server is OFF and a communicating peer server is communicating and in locked mode;
in the locked mode, a storage area network (SAN) application on the server is stopped and prevented from starting;
in the peerdown mode, a peer server is OFF; and
in the failed startup mode, attempts to start an SAN application are blocked until a failure indicator is cleared.
4. The computer readable medium of claim 1, the data also including a status of the server in the HA cluster.
5. The computer readable medium of claim 4, where the status is one of: unknown, stopped, running, and primary.
6. The computer readable medium of claim 5, where:
the unknown status is reported when the server is not communicating;
the stopped status is reported when a status command returns a first pre-determined code;
the running status is reported when a status command returns a second, different pre-determined code; and
the primary status is reported when the server status is running and the FSMPM (file system port mapper) in the server is in the primary state.
7. The computer readable medium of claim 1, where controlling at least one member of the HA cluster to selectively change mode comprises issuing one or more of: a status, stop, start, config, clear, primary, and force reset command.
8. The computer readable medium of claim 7, where:
the status command causes members of the HA cluster to report their status;
the stop command causes the non-primary server in the HA cluster to be transitioned to locked mode, the primary server in the HA cluster to be transitioned to config mode, HA monitoring to be turned off on the primary server in the HA cluster, the SAN application to be stopped, and both servers in the HA cluster to be transitioned to default mode;
the start command also causes the stop command to be run, the SAN application and HA monitor to be started on the local server in the HA cluster, SMITH reset to be enabled on the local server in the HA cluster, the SAN application and HA monitor to be started on the peer server in the HA cluster, and SMITH reset to be enabled on the peer server in the HA cluster;
the configuration command causes the peer server in the HA cluster to be transitioned to the locked mode and the local server in the HA cluster to be transitioned to the configuration mode;
the clear command clears an indicator that was set by failure of a start command;
the primary command sets the status of the FSMPM on the local server in the HA cluster to primary, and
the force reset command triggers an immediate HA reset of one or more servers in the HA cluster.
9. The computer readable medium of claim 1, the method comprising:
storing values for the modes associated with members of the HA cluster to maintain modes through a hardware reset.
10. The computer readable medium of claim 1, the method comprising:
monitoring a file that indicates that a previous initialization of the HA cluster has failed, and
granting permission, prior to initialization of the HA cluster, for the HA cluster to initialize.
11. The computer readable medium of claim 1, the method comprising controlling members of the HA cluster to be in a mode pairing selected from an allowed set of mode pairings comprising:
default-default, default-locked, default-peerdown, single-peerdown, single-locked, config-locked, locked-default, locked-single, locked-config, and locked-locked.
12. The computer readable medium of claim 11, the method comprising controlling members of the HA cluster to not be in a mode pairing selected from the prohibited set of paired mode states comprising:
single-default, single-single, single-config, config-default, config-single, and config-config.
13. An apparatus, comprising;
a processor,
a memory, and
an interface that connects the processor, the memory, and a set of components, the set of components comprising:
a first component configured to acquire a mode from a member of an HA cluster;
a second component configured to determine a desired mode pairing for the member of the HA cluster; and
a third component configured to take an action configured to either achieve the desired mode pairing for the member of the HA cluster or to selectively force a hardware reset of the member of the HA cluster upon determining that a split brain scenario is possible based, at least in part, on the mode of the member of the HA cluster.
14. The apparatus of claim 13, the mode being one of:
a default mode, a single mode, a configuration mode, a locked mode, a peerdown mode, and a failed startup mode, where:
in the default mode, HA monitoring for a member of the HA cluster is ON and SMITH reset is enabled;
in the single mode, HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode or not communicating and in peerdown mode;
in the configuration mode, HA monitoring for a member of the HA cluster is OFF and a communicating peer member of the HA cluster is communicating and in locked mode or not communicating and in peerdown mode;
in the locked mode, an SAN application on the member of the HA cluster is stopped and prevented from starting;
in the peerdown mode, a peer member of the HA cluster is OFF; and
in the failed startup mode, attempts to start the SAN application are blocked until a failure indicator is cleared.
15. The apparatus of claim 14, the first component being configured to acquire a status from the member of the HA cluster, the status being one of:
unknown, stopped, running, and primary, and where:
the unknown status is reported when the server is not communicating;
the stopped status is reported when a status command returns a first pre-determined code;
the running status is reported when a status command returns a second, different pre-determined code; and
the primary status is reported when the member of the HA cluster status is running and the FSMPM in the member of the HA cluster is in the primary state.
16. The apparatus of claim 15, where the action performed by the third component is one of:
status, stop, start, configuration, clear, primary, and force reset, and where:
the status action causes a cluster member to report status;
the stop action causes the non-primary member of the HA cluster to be transitioned to locked mode, the primary member of the HA cluster to be transitioned to configuration mode, HA monitoring to be turned off on the primary member of the HA cluster, the SAN application to be stopped, and both members of the HA cluster to be transitioned to default mode;
the start action also causes the stop command to be run, the SAN application and HA monitor to be started on the local member of the HA cluster, SMITH reset to be enabled on the local member of the HA cluster, the SAN application and HA monitor to be started on the peer member of the HA cluster, and SMITH reset to be enabled on the peer member of the HA cluster;
the configuration action causes the peer member of the HA cluster to be transitioned to the locked mode and the local member of the HA cluster to be transitioned to the configuration mode;
the clear action clears an indicator that was set by failure of a start command;
the primary action sets the status of the FSMPM on the local member of the HA cluster to primary, and
the force reset action triggers an immediate HA reset.
17. The apparatus of claim 16, the desired mode pairings comprising:
default-default, default-locked, default-peerdown, single-peerdown, single-locked, config-locked, locked-default, locked-single, locked-config, and locked-locked.
18. The apparatus of claim 17, where prohibited mode pairings comprise:
single-default, single-single, single-config, config-default, config-single, and config-config.
19. The apparatus of claim 18, the third component being configured to force a hardware reset upon determining that the HA cluster is in a prohibited paired mode and is in danger of an SBS.
20. A high availability (HA) cluster manager apparatus, comprising:
a logic configured to acquire a substantially instantaneous state of an HA cluster, the state comprising at least a mode and status for members of the HA cluster;
a mode rules logic configured to select a mode for a member of the HA cluster, the mode being selected to make the HA cluster comply with a set of allowed mode pairings and to prevent the HA cluster from attaining a prohibited mode pairing, and
an action logic configured to prevent a split brain scenario in the HA cluster by transforming an HA cluster member state by performing one or more of, a status action, a stop action, a start action, a configuration action, a clear action, a primary action, and a force hardware reset action.
US12/906,220 2009-11-09 2010-10-18 Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control Abandoned US20110113228A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/906,220 US20110113228A1 (en) 2009-11-09 2010-10-18 Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US25927109P 2009-11-09 2009-11-09
US12/906,220 US20110113228A1 (en) 2009-11-09 2010-10-18 Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control

Publications (1)

Publication Number Publication Date
US20110113228A1 true US20110113228A1 (en) 2011-05-12

Family

ID=43974950

Family Applications (4)

Application Number Title Priority Date Filing Date
US12/821,350 Active 2030-12-17 US8260830B2 (en) 2009-11-09 2010-06-23 Adapting a timer bounded arbitration protocol
US12/821,251 Active 2031-06-21 US8332684B2 (en) 2009-11-09 2010-06-23 Timer bounded arbitration protocol for resource control
US12/906,220 Abandoned US20110113228A1 (en) 2009-11-09 2010-10-18 Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control
US13/564,755 Active US8442948B2 (en) 2009-11-09 2012-08-02 Adapting a timer bounded arbitration protocol

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US12/821,350 Active 2030-12-17 US8260830B2 (en) 2009-11-09 2010-06-23 Adapting a timer bounded arbitration protocol
US12/821,251 Active 2031-06-21 US8332684B2 (en) 2009-11-09 2010-06-23 Timer bounded arbitration protocol for resource control

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/564,755 Active US8442948B2 (en) 2009-11-09 2012-08-02 Adapting a timer bounded arbitration protocol

Country Status (1)

Country Link
US (4) US8260830B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150382214A1 (en) * 2010-09-13 2015-12-31 Blinq Wireless Inc. System and method for reception mode switching in dual-carrier wireless backhaul networks

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8850062B2 (en) * 2010-08-09 2014-09-30 Cisco Technology, Inc. Distributed connectivity verification protocol redundancy
US9600315B2 (en) * 2010-10-22 2017-03-21 Netapp, Inc. Seamless takeover of a stateful protocol session in a virtual machine environment
US9258244B1 (en) 2013-05-01 2016-02-09 Sandia Corporation Protocol for communications in potentially noisy environments
US10749711B2 (en) 2013-07-10 2020-08-18 Nicira, Inc. Network-link method useful for a last-mile connectivity in an edge-gateway multipath system
US10454714B2 (en) 2013-07-10 2019-10-22 Nicira, Inc. Method and system of overlay flow control
US10498652B2 (en) 2015-04-13 2019-12-03 Nicira, Inc. Method and system of application-aware routing with crowdsourcing
US10135789B2 (en) 2015-04-13 2018-11-20 Nicira, Inc. Method and system of establishing a virtual private network in a cloud service for branch networking
US10425382B2 (en) 2015-04-13 2019-09-24 Nicira, Inc. Method and system of a cloud-based multipath routing protocol
US10229017B1 (en) * 2015-10-01 2019-03-12 EMC IP Holding Company LLC Resetting fibre channel devices for failover in high availability backup systems
CN107211003B (en) * 2015-12-31 2020-07-14 华为技术有限公司 Distributed storage system and method for managing metadata
US10275468B2 (en) 2016-02-11 2019-04-30 Red Hat, Inc. Replication of data in a distributed file system using an arbiter
US11252079B2 (en) 2017-01-31 2022-02-15 Vmware, Inc. High performance software-defined core network
US20180219765A1 (en) 2017-01-31 2018-08-02 Waltz Networks Method and Apparatus for Network Traffic Control Optimization
US11706127B2 (en) 2017-01-31 2023-07-18 Vmware, Inc. High performance software-defined core network
US11121962B2 (en) 2017-01-31 2021-09-14 Vmware, Inc. High performance software-defined core network
US10992558B1 (en) 2017-11-06 2021-04-27 Vmware, Inc. Method and apparatus for distributed data network traffic optimization
US10992568B2 (en) 2017-01-31 2021-04-27 Vmware, Inc. High performance software-defined core network
US20200036624A1 (en) 2017-01-31 2020-01-30 The Mode Group High performance software-defined core network
US10778528B2 (en) 2017-02-11 2020-09-15 Nicira, Inc. Method and system of connecting to a multipath hub in a cluster
US10523539B2 (en) 2017-06-22 2019-12-31 Nicira, Inc. Method and system of resiliency in cloud-delivered SD-WAN
US11516049B2 (en) 2017-10-02 2022-11-29 Vmware, Inc. Overlay network encapsulation to forward data message flows through multiple public cloud datacenters
US11089111B2 (en) 2017-10-02 2021-08-10 Vmware, Inc. Layer four optimization for a virtual network defined over public cloud
US11115480B2 (en) 2017-10-02 2021-09-07 Vmware, Inc. Layer four optimization for a virtual network defined over public cloud
US10999165B2 (en) 2017-10-02 2021-05-04 Vmware, Inc. Three tiers of SaaS providers for deploying compute and network infrastructure in the public cloud
US10959098B2 (en) 2017-10-02 2021-03-23 Vmware, Inc. Dynamically specifying multiple public cloud edge nodes to connect to an external multi-computer node
US10999100B2 (en) 2017-10-02 2021-05-04 Vmware, Inc. Identifying multiple nodes in a virtual network defined over a set of public clouds to connect to an external SAAS provider
US11223514B2 (en) 2017-11-09 2022-01-11 Nicira, Inc. Method and system of a dynamic high-availability mode based on current wide area network connectivity
US11018995B2 (en) 2019-08-27 2021-05-25 Vmware, Inc. Alleviating congestion in a virtual network deployed over public clouds for an entity
US11144374B2 (en) * 2019-09-20 2021-10-12 Hewlett Packard Enterprise Development Lp Data availability in a constrained deployment of a high-availability system in the presence of pending faults
US11044190B2 (en) 2019-10-28 2021-06-22 Vmware, Inc. Managing forwarding elements at edge nodes connected to a virtual network
US11489783B2 (en) 2019-12-12 2022-11-01 Vmware, Inc. Performing deep packet inspection in a software defined wide area network
US11394640B2 (en) 2019-12-12 2022-07-19 Vmware, Inc. Collecting and analyzing data regarding flows associated with DPI parameters
US20210234804A1 (en) 2020-01-24 2021-07-29 Vmware, Inc. Accurate traffic steering between links through sub-path path quality metrics
US11245641B2 (en) 2020-07-02 2022-02-08 Vmware, Inc. Methods and apparatus for application aware hub clustering techniques for a hyper scale SD-WAN
US11363124B2 (en) 2020-07-30 2022-06-14 Vmware, Inc. Zero copy socket splicing
US11575591B2 (en) 2020-11-17 2023-02-07 Vmware, Inc. Autonomous distributed forwarding plane traceability based anomaly detection in application traffic for hyper-scale SD-WAN
US11575600B2 (en) 2020-11-24 2023-02-07 Vmware, Inc. Tunnel-less SD-WAN
US11929903B2 (en) 2020-12-29 2024-03-12 VMware LLC Emulating packet flows to assess network links for SD-WAN
US11792127B2 (en) 2021-01-18 2023-10-17 Vmware, Inc. Network-aware load balancing
US11979325B2 (en) 2021-01-28 2024-05-07 VMware LLC Dynamic SD-WAN hub cluster scaling with machine learning
US11509571B1 (en) 2021-05-03 2022-11-22 Vmware, Inc. Cost-based routing mesh for facilitating routing through an SD-WAN
US11729065B2 (en) 2021-05-06 2023-08-15 Vmware, Inc. Methods for application defined virtual network service among multiple transport in SD-WAN
US11489720B1 (en) 2021-06-18 2022-11-01 Vmware, Inc. Method and apparatus to evaluate resource elements and public clouds for deploying tenant deployable elements based on harvested performance metrics
US11375005B1 (en) 2021-07-24 2022-06-28 Vmware, Inc. High availability solutions for a secure access service edge application
US11943146B2 (en) 2021-10-01 2024-03-26 VMware LLC Traffic prioritization in SD-WAN
US11909815B2 (en) 2022-06-06 2024-02-20 VMware LLC Routing based on geolocation costs

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133727A1 (en) * 2001-03-15 2002-09-19 International Business Machines Corporation Automated node restart in clustered computer system
US20060031706A1 (en) * 2002-01-24 2006-02-09 Nick Ramirez Architecture for high availability using system management mode driven monitoring and communications
US20100088440A1 (en) * 2008-10-03 2010-04-08 Donald E Banks Detecting and preventing the split-brain condition in redundant processing units

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5295258A (en) * 1989-12-22 1994-03-15 Tandem Computers Incorporated Fault-tolerant computer system with online recovery and reintegration of redundant components
US5392434A (en) * 1993-09-03 1995-02-21 Motorola, Inc. Arbitration protocol system granting use of a shared resource to one of a plurality of resource users
US5649206A (en) * 1993-09-07 1997-07-15 Motorola, Inc. Priority arbitration protocol with two resource requester classes and system therefor
US5805905A (en) * 1995-09-06 1998-09-08 Opti Inc. Method and apparatus for arbitrating requests at two or more levels of priority using a single request line
US6625750B1 (en) * 1999-11-16 2003-09-23 Emc Corporation Hardware and software failover services for a file server
EP1370947A4 (en) * 2001-02-13 2009-05-27 Candera Inc Silicon-based storage virtualization server
US7159015B2 (en) * 2001-11-16 2007-01-02 Sun Microsystems, Inc. Method and apparatus for managing configuration information in a distributed computer system
US20050009467A1 (en) * 2003-07-11 2005-01-13 Nuber Raymond Mark System and method for satellite communication with a hybrid payload and DAMA support
US7711820B2 (en) * 2004-11-08 2010-05-04 Cisco Technology, Inc. High availability for intelligent applications in storage networks
US7484131B2 (en) * 2005-09-13 2009-01-27 International Business Machines Corporation System and method for recovering from a hang condition in a data processing system
US7747806B2 (en) * 2006-06-02 2010-06-29 Panasonic Corporation Resource use management device, resource use management system, and control method for a resource use management device
US20080201605A1 (en) * 2007-02-21 2008-08-21 Inventec Corporation Dead man timer detecting method, multiprocessor switching method and processor hot plug support method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020133727A1 (en) * 2001-03-15 2002-09-19 International Business Machines Corporation Automated node restart in clustered computer system
US20060031706A1 (en) * 2002-01-24 2006-02-09 Nick Ramirez Architecture for high availability using system management mode driven monitoring and communications
US20100088440A1 (en) * 2008-10-03 2010-04-08 Donald E Banks Detecting and preventing the split-brain condition in redundant processing units

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150382214A1 (en) * 2010-09-13 2015-12-31 Blinq Wireless Inc. System and method for reception mode switching in dual-carrier wireless backhaul networks
US9635571B2 (en) * 2010-09-13 2017-04-25 Blinq Wireless Inc. System and method for reception mode switching in dual-carrier wireless backhaul networks
US9877325B2 (en) 2010-09-13 2018-01-23 Blinq Wireless Inc. System and method for joint scheduling in dual-carrier wireless backhaul networks

Also Published As

Publication number Publication date
US20110113066A1 (en) 2011-05-12
US20120310999A1 (en) 2012-12-06
US20110107139A1 (en) 2011-05-05
US8260830B2 (en) 2012-09-04
US8442948B2 (en) 2013-05-14
US8332684B2 (en) 2012-12-11

Similar Documents

Publication Publication Date Title
US20110113228A1 (en) Rules-Based, Mode-Driven Manager for Timer Bounded Arbitration Protocol Based Resource Control
US10560315B2 (en) Method and device for processing failure in at least one distributed cluster, and system
US7418627B2 (en) Cluster system wherein failover reset signals are sent from nodes according to their priority
JP4505763B2 (en) Managing node clusters
JP4345334B2 (en) Fault tolerant computer system, program parallel execution method and program
EP3427151B1 (en) Memory backup management in computing systems
US8417899B2 (en) System and method for controlling access to shared storage device
US9753809B2 (en) Crash management of host computing systems in a cluster
US10049010B2 (en) Method, computer, and apparatus for migrating memory data
US20100162043A1 (en) Method, Apparatus, and System for Restarting an Emulated Mainframe IOP
TWI522834B (en) System and method for operating system agnostic hardware validation
US20150339200A1 (en) Intelligent disaster recovery
US9164864B1 (en) Minimizing false negative and duplicate health monitoring alerts in a dual master shared nothing database appliance
CN103257908A (en) Software and hardware cooperative multi-controller disk array designing method
WO2018095107A1 (en) Bios program abnormal processing method and apparatus
CN110865907B (en) Method and system for providing service redundancy between master server and slave server
TW201635142A (en) Fault tolerant method and system for multiple servers
US20150095488A1 (en) System and method for acquiring log information of related nodes in a computer network
CN109358982B (en) Hard disk self-healing device and method and hard disk
CN114586013A (en) Baseboard management controller for starting diagnosis operation to collect host information
US7428655B2 (en) Smart card for high-availability clustering
KR102053849B1 (en) Airplane system and control method thereof
JP2009026182A (en) Program execution system and execution device
US20140089579A1 (en) Information processing system, recording medium, and information processing method
TW201629757A (en) Control module of server node and firmware updating method for the control module

Legal Events

Date Code Title Description
AS Assignment

Owner name: QUANTUM CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MIDDLECAMP, WILLIAM J.;REEL/FRAME:025149/0948

Effective date: 20101013

AS Assignment

Owner name: WELLS FARGO CAPITAL FINANCE, LLC, AS AGENT, CALIFO

Free format text: SECURITY AGREEMENT;ASSIGNOR:QUANTUM CORPORATION;REEL/FRAME:027967/0914

Effective date: 20120329

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: QUANTUM CORPORATION, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:WELLS FARGO CAPITAL FINANCE, LLC, AS AGENT;REEL/FRAME:040474/0079

Effective date: 20161021