GB2418039A - Proactive maintenance for a high availability cluster of interconnected computers - Google Patents

Proactive maintenance for a high availability cluster of interconnected computers Download PDF

Info

Publication number
GB2418039A
GB2418039A GB0516362A GB0516362A GB2418039A GB 2418039 A GB2418039 A GB 2418039A GB 0516362 A GB0516362 A GB 0516362A GB 0516362 A GB0516362 A GB 0516362A GB 2418039 A GB2418039 A GB 2418039A
Authority
GB
United Kingdom
Prior art keywords
node
cluster
nodes
quality
cluster master
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0516362A
Other versions
GB0516362D0 (en
Inventor
Ken Gary Pomaranski
Andrew Harvey Barr
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Publication of GB0516362D0 publication Critical patent/GB0516362D0/en
Publication of GB2418039A publication Critical patent/GB2418039A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)
  • Medicines That Contain Protein Lipid Enzymes And Other Medicines (AREA)

Abstract

A high availability cluster apparatus 100 comprises a plurality of computing nodes 104, a cluster master 102 connected to each of the nodes 104 and a non-volatile memory 120 associated with the cluster master, configured to store data keeping track of quality measures of the nodes 104. The quality i.e. operational health measures could be parameters such as node error rate, critical performance parameters, chassis code output or details of any Intelligent Platform Management Interface (IPMI) events. The cluster master 102 could also keep track of the freshness measures of the nodes 104 i.e. how recently a node 104 has had a quality check. The cluster master 102 could be a separate computing system from the cluster nodes 104 or it could be an application that is distributed over the nodes 104 of the cluster. The input/output interfaces 103 of the cluster master 102 could be connected to the node I/O 106 using point-to-point links or using a network. The cluster master 102 could be periodically configured to gather status, quality and freshness measures for each node and then execute a quality check task. A poor quality active node 114 could then be replaced with the best available spare node 112 i.e. a failure prediction could be made for a node and suitable action taken.

Description

24 1 8039
HIGH-AVAILABILITY CLUSTER WITH PROACTIVE
MAINTENANCE
Inventors: Ken Gary Pomaranski and Andrew Harvey Barr
Field of the Invention
The present disclosure relates generally to computer networks.
More particularly, the present disclosure relates to clusters of interconnected to computer systems.
Descrintion of the Background Art
A cluster is a parallel or distributed system that comprises a collection of interconnected computer systems or servers that is used as a single, unified computing unit. Members of a cluster are referred to as nodes or systems. The cluster service is the collection of software on each node that manages cluster-related activity.
Clustering may be used for parallel processing or parallel computing to simultaneously use two or more processors to execute an application or program. Clustering is a popular strategy for implementing parallel processing applications because it allows system administrators to leverage already existing computers and workstations. Because it is difficult to predict the number of requests that wlil be issued to a networked server, clustering is also useful for load balancing to distribute processing and communications activity evenly across a network system so that no single server is overwhelmed. If one server is running the risk of being swamped, requests may be forwarded to another clustered server with greater capacity. For example, busy Web sites may employ two or more clustered Web servers in order to employ a load balancing scheme. Clustering also provides for increased scalability by allowing so new components to be added as the system load increases. In addition, clustering simplifies the management of groups of systems and their applications by allowing the system administrator to manage an entire group as a single system. Clustering may also be used to increase the fault tolerance of a network ( system. If one server suffers an unexpected software or hardware failure, another clustered server may assume the operations of the failed server. Thus, if any hardware of software component in the system fails, the user might experience a performance penalty, but will not lose access to the service.
Current cluster services include Microsoft Cluster Server (MSCS), designed by Microsoft Corporation for clustering for its Windows NT 4.0 and Windows 2000 Advanced Server operating systems, and Novell Netware Cluster Services (NWCS), among other examples. For instance, MSCS supports the clustering of two NT servers to provide a single highly available server.
It is desirable to improve apparatus and methods for high availability (HA) clusters. It is particularly desirable to make HA clusters more robust and increase uptime for such clusters.
SUMMARY
One embodiment disclosed relates to a method of preventative maintenance of a high-availability cluster. A least-recently-tested active node is determined. The least-recently-tested active node is swapped out from the HA cluster, and a stand-by node is swapped into the HA cluster.
Another embodiment pertains to a high-availability cluster apparatus including a plurality of computing nodes of said cluster and a cluster master communicatively connected to each of the nodes. Non-volatile memory associated with the cluster master is configured to store data keeping track of quality measures of the nodes. Data keeping track of freshness measures of the nodes may also be stored.
Another embodiment pertains to a method of pro-actively maintaining a high-availability cluster having a plurality of nodes. The method includes keeping track of status variables, quality measures, and freshness measures for the nodes.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic diagram of a high-availability cluster in accordance with an embodiment of the invention.
FIG. 2 is a flow chart depicting a method of gathering maintenance related data in accordance with an embodiment of the invention.
FIG. 3A is a flow chart depicting a method of proactively checking the quality of active nodes in accordance with an embodiment of the invention.
FIG. 3B is a flow chart depicting a procedure for replacing a low quality node in accordance with an embodiment of the invention.
FIG. 4A is a flow chart depicting a method of performing preventative maintenance in accordance with an embodiment of the invention.
FIG. 4B is a flow chart depicting a testing procedure in accordance with an embodiment of the invention.
DETAILED DESCRIPTION
A highly disadvantageous event for a high-availability cluster is having a participating node of the cluster Drop-out" unexpectedly. First, such a drop-out takes precious time to notice and verify that a node has disappeared.
Secondly, an unexpected drop-out is highly risky to the uptime of the HA cluster, as there is a finite probability of some "glitch" (sudden interruption of function) occurring during the drop-out which could cause the failure of applications to Uswitchover" to the backup nodes correctly.
Conventional implementations of HA clusters deal only with nodes after they have already failed or been disconnected from the cluster.
Furthermore, the quality of spare or underutilized nodes are not tracked after the HA cluster has been up for a period of time. Moving resources to these nodes may be dangerous to the operational health of the HA cluster.
An aspect of the present invention relates to reducing the chances of a participating node of a cluster from unexpectedly dropping-out. Instead of unexpected drop-outs, controlled or expected switchovers are much more desirable. This is because it is much easier to move applications off of a running system, rather than one that has "disappeared" from the HA cluster.
Furthermore, controlled switchovers can be made to occur at the convenience of the operator, such as, for example, when the HA cluster is operating at a less critical time of the day.
Apparatus and methods are disclosed herein to make an HA cluster more robust by performing proactive maintenance on cluster nodes. It is expected that such proactive maintenance may result in significant reductions in unexpected outages of nodes, and therefore greater uptime for the HA cluster in total.
The proactive maintenance provided by the hardware system and software algorithms discussed below provides a means to safely assure the quality of nodes in an HA cluster. The quality assurance applies to both active and stand-by nodes and is handled at the cluster level, resulting in added robustness. In addition, the "freshness" of a quality-assured node is advantageously tracked and used, resulting in greater cluster uptime with less risk. With such proactive maintenance, failures can be anticipated and dealt with before a node or system crash occurs. Moreover, the impact on cluster level performance is advantageously minimized by swapping each node out of the active cluster prior to running functional/electrical tests on it.
FIG. 1 is a schematic diagram of a high-availability cluster in accordance with an embodiment of the invention. The HA cluster includes a cluster master 102 and multiple nodes 104.
In one implementation, the cluster master 102 is a separate computing system that may communicate with the nodes 104 via point-to-point links (as depicted in FIG. 1) or via a data communications network (such as an Ethemet type network, a mesh network, or other network). In the case of point to-point links, an inpuVoutput (I/O) system of the cluster master 102 may communicate via links to l/O cards 106 at each of the nodes 104. The l/O card 106 for a node 104 may be implemented as a simple network card (for example, a conventional Ethernet card), or as a microprocessor-based Smart network card with built in functionality to handle a portion of the cluster management for that node.
In an alternate implementation, the cluster master 102 may be so implemented as a distributed application that runs across the nodes 104 in the HA cluster. In this implementation, the cluster master 102 may have specific memory allocated to or associated with it at each node 104. The allocated memory may include non-volatile memory and may also include volatile memory.
The nodes 104 may be grouped into different sets based on their status. Nodes 104 that are actively being used as part of the operating cluster have an "active" status 110. Nodes 104 that are not actively being used as part of the operating cluster, but that are available to swap into the active cluster, have a "stand-by" or"spare" status 112. Inactive nodes that are unavailable to be swapped into the cluster may have a status indicating they are either "under test" or "out-of-service" 114. In one implementation, the status data may be stored in non-volatile memory 120 allocated to the cluster master 102.
As discussed further below, the cluster master 102 may be to configured to keep track of the quality (operational health) of the cluster nodes 104. The quality-related information may include, for example, the node error rate, critical performance parameters or measurements, the chassis code output, and any IPMI (Intelligent Platform Management Interface) events coming from each node. Based on such information, each node in the cluster may be assigned a quality rank or measure. In addition, the cluster master 102 may be configured to keep track of the freshness of the nodes 104. The freshness relates to how recently that node has had a quality check. Each node may be assigned a freshness rank or measure, where the "freshest node is one that has just gone through, and passed, a thorough testing procedure by the cluster master 102. Both the quality and freshness data may be stored in non-volatile memory 120 allocated to the cluster master 102.
Also discussed further below, the cluster master 102 may proactively take a specific node 104 out of the set of active nodes 110 of the HA cluster so as to perform diagnostic testing on the node 104. This may happen if the node's current health is below an acceptable level, and/or if it is that node's turn to get tested. The cluster master 102 also coordinates the movement of critical applications off the node 104 which is going to be removed from the active node set 110.
In accordance with an embodiment of the invention, the cluster master 102 is configured to perform various tasks. These tasks may include a "gather data" task discussed below in relation to FIG. 2, a "quality check" task discussed below in relation to FIGS. 3A and 3B, and a "perform preventative maintenance" task discussed below in relation to FIGS. 4A and 4B. Each of these tasks may be performed periodically, and the periodicity or cycle rate for each task may be individually configurable.
In accordance with an embodiment of the invention, each node 104 has a status that may be either Active", Uspare" (or standby), sunder test", or "under repair" (or out-of-service). A node under repair is unavailable to the HA cluster and unavailable to the cluster master 102.
A node that is "under repair" may be moved back into the spare resource pool 112 after the node is repaired. (by physically replacing bad hardware, doing a reconfiguration, etc.). This may be done by manually or JO automatically changing the status from "under repair" to "ready to be tested" after the repair is done. The Cluster Master 102 may then determine the viability of the node through a series of tests or status checks, set its quality level, then move the node into the resource pool 112 by changing the status of the node to the spare (or stand-by) status if the quality level is sufficient (in other words, if the tests are passed). This method advantageously prevents potential bad nodes from entering the resource pool 112.
A similar process may be used for new nodes entering the cluster.
A new node may have its status set to a ready-to-be-tested status. Functional tests may then be applied to the new node, and a quality measure for the node go set based on results of the tests. The new node may then be placed in the resource pool 112 by changing the status of the node to the spare (or stand-by) status if the quality level is sufficient (in other words, if the tests are passed).
This process advantageously prevents potential bad nodes from entering the resource pool 112.
z5 FIG. 2 is a flow chart depicting a method (200) of gathering maintenance-related data in accordance with an embodiment of the invention.
These gathered data may be stored in data storage 120 associated with the cluster master 102. A cycle of this "gather data" task (200) may be periodically (220) performed by the cluster master 102. The frequency of performance of the gather data task may be configurable.
For each task cycle, the method (200) goes through all the nodes 104. The node 104 assigned the number 1 is set (202) to be the first in the illustrated example of FIG. 2, but other orders of the nodes may be followed instead. For each node "n", the status is gathered (204) by communications between the cluster master 102 and that node 104. If the status indicates that the node n is "under repair" (out-of-service) (206), then the process moves on to the next node. Moving onto the next node may involve, for example, incrementing (216) the node number n, and gathering the status data from the next node (204), so long as there are more nodes from which to gather data (218).
If the status indicates that the node n is not under repair (i.e. that it is either active, or a spare, or under test) (206), then the cluster master 102 to gathers further data from the node n. The gathering of further data includes gathering quality-related data (208) and freshnessrelated data (212), not necessarily in that order. The quality-related data may include, for example, a chassis code from node n and performance data. Using the quality-related data, a quality measure or ranking may be generated (210). In one implementation, a t5 higher number for the quality measure indicates a lower quality of the node.
Using the freshness-related data, a freshness measure or ranking may be generated (214). In one implementation, a higher number for the freshness measure indicates that a node is less Fresh". Of course, other implementations of the quality and freshness measures are also possible. For example, a lower number for the quality measure may indicate a lower quality of the node, and a lower number for the freshness measure may indicate a less fresh node.
FIG. 3A is a flow chart depicting a method of proactively checking the quality of active nodes 110 in accordance with an embodiment of the invention. A cycle of this "quality check" task (300) may be periodically (318) performed by the cluster master 102. The frequency of performance of the quality check task may be configurable.
For each task cycle, the method (300) goes through all the nodes 104. The node 104 assigned the number 1 is set (302) to be the first in the illustrated example of FIG. 3, but other orders of the nodes may be followed instead. For each node "n", the status is checked (304) by examining the status data stored in the data storage 120 associated with the cluster master 102. If the status indicates that the node n is not "active" (306), then the process moves on to the next node. Moving onto the next node may involve, for example, incrementing (314) the node number n and checking status of the next node (304), so long as there are more nodes to check (316).
If the status indicates that the node n is active (306), then the cluster master 102 checks (308) the quality measure for the node n as stored 120. If the quality measure does not exceed a maximum allowed measure (310), then the method (300) moves onto the next node, for example, by incrementing n (314). If the quality measure is greater than the maximum allowed measure (310), then node n is replaced with a Best" spare node (312).
In other words, the node n is replaced (312) with the best-available spare node if to the node n is deemed (310) to have an unsatisfactorily poor quality measure. A procedure for this replacement (312) is discussed in further detail below in relation to FIG. 3B. After the replacement (312), the method (300) moves onto the next node, for example, by incrementing n (314) and checking status of the next node (304), so long as there are more nodes to check (316).
ts In an alternate embodiment, the quality check task may replace an active node so long as there is a better spare node available. Such a method may involve comparing the quality measure for each active node with that of the best-available spare node. If the quality measure is poorer for the active node, then the active node is replaced in the cluster by the best-available spare node.
FIG. 3B is a flow chart depicting a procedure for replacing (312) an active node of low quality in accordance with an embodiment of the invention.
Quality measures (332) and freshness measures (334) are obtained by the cluster master 102 from the stored quality/freshness data 120 for all spare (standby) nodes 112. While FIG. 3B depicts the quality measures being obtained first, then the freshness measures being obtained, the measures may be obtained not necessarily in that order.
Using the quality and freshness data, a determination is made as to the "best" spare node (336). This determination may be made using various techniques or formulas. For example, the best spare node may be the spare node that is the most fresh with a satisfactory quality measure, where the satisfactory quality measure is no greater than a maximum allowable. As another example, the best spare node may be the spare node that is of best quality with a satisfactory freshness, where a satisfactory freshness measure is no greater than a maximum allowable. As another example, the quality and freshness measures may be averaged for each spare node, and the spare node with the lowest average (best combined quality and freshness) may be chosen.
The averaging may be weighted in favor of either the quality or the freshness measure. Other formulas may also be used.
Once the best spare node has been determined (336J, the best spare node may be added (338) to the active list of the cluster and then swapped (340) into the HA cluster. Critical applications may then be moved (342) from the node being replaced, and then the node being replaced may be to removed from the cluster and have its status changed to the repair (out-of- service) status (344).
FIG. 4A is a flow chart depicting a method of performing preventative maintenance in accordance with an embodiment of the invention. A cycle of this "preventative maintenance" task (400) may be periodically (420) performed by the cluster master 102. The frequency of performance of the preventative maintenance task may be configurable.
For each task cycle, the method (400) goes through all the nodes 104. The node 104 assigned the number 1 is set (402) to be the first in the illustrated example of FIG. 3, but other orders of the nodes may be followed So instead. In addition, a greatest-freshness variable is reset, for example, to zero.
For each node "n", the status is checked (404) by examining the status data stored in the data storage 120 associated with the cluster master 102. If the status indicates that the node n is not "active" (406), then the process moves on to the next node. Moving onto the next node may involve, for example, incrementing (414) the node number n and checking status of the next node (404), so long as there are more nodes to check (416).
If the status indicates that the node n is active (406), then the cluster master 102 checks (408) the freshness measure for the node n as stored 120. The freshness measure for the node n is compared with the value of the greatest-freshness variable. If the freshness measure for this node is greater than the value of the greatest-freshness variable, then the greatest-freshness variable is replace with the freshness measure for this node, and the node-to- test is set to be node n (412). On the other hand, if the freshness measure for this node is not greater than the value of the greatest- freshness variable, then the method (400) moves onto the next node, for example, by incrementing n (414) and checking status of the next node (404), so long as there are more nodes to check (416). When there are no more nodes to check, then the node to-test should be the active node with the highest freshness measure, i.e. the least fresh node.
When all the nodes have been checked (416), then a testing procedure is run (418). An example of the testing procedure (418) is discussed in further detail below in relation to FIG. 4B.
FIG. 4B is a flow chart depicting a testing procedure (418) in accordance with an embodiment of the invention. This testing procedure (418) relates to the testing of a node-to-test from the set of active nodes 110. For example, the node-to-test may be determined by the preventative maintenance method (400) discussed above in relation to FIG. 4A.
Quality measures (432) and freshness measures (434) are obtained by the cluster master 102 from the stored quality/freshness data 120 for all spare (standby) nodes 112. While FIG. 4B depicts the quality measures being obtained first, then the freshness measures being obtained, the measures may be obtained not necessarily in that order.
Using the quality and freshness data, a determination is made as to the "best" spare node (436). This determination may be made using various techniques or formulas. For example, the best spare node may be the spare node that is the most fresh with a satisfactory quality measure, where the satisfactory quality measure is no greater than a maximum allowable. As another example, the best spare node may be the spare node that is of best quality with a satisfactory freshness, where a satisfactory freshness measure is no greater than a maximum allowable. As another example, the quality and freshness measures may be averaged for each spare node, and the spare node with the lowest average (best combined quality and freshness) may be chosen.
The averaging may be weighted in favor of either the quality or the freshness measure. Other formulas may also be used.
Once the best spare node has been determined (436), the best spare node may be added (438) to the active list of the cluster (changing the node's status to active) and then swapped (440) into the HA cluster. Critical applications may then be moved (442) from the node-to-test to the swappedin node.
The node-to-test may then have its status changed to the under test status, and functional and/or electrical tests (preferably both) may be run on the node-under-test (444). The tests may involve use of "worstcase" type data patterns and other stresses to probe the functionality and robustness of the node being tested. From the test results, a new or updated quality measure for the node is determined (446).
With the node having just been retested, the freshness measure is restored (448) to the lowest (most fresh) value. The just tested node is then added (450) to the spare (stand-by) list (changing the node's status to spare).
However' if the just tested node had an updated quality measure deemed too poor or "bad" (for example, having a quality measure above a maximum allowable value) (452), then the status of the just tested node may be changed (454) to repair (out-of-service).
In the above description, numerous specific details are given to provide a thorough understanding of embodiments of the invention. However, the above description of illustrated embodiments of the invention is not intended to be exhaustive or to limit the invention to the precise forms disclosed. One skilled in the relevant art will recognize that the invention can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures or operations are not shown or described in detail to avoid obscuring aspects of the invention. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the claims. Rather, the scope of the invention is to be determined by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.

Claims (10)

  1. A high-availability cluster apparatus, the apparatus comprising: a plurality of computing nodes of said cluster; a cluster master communicatively connected to each of the nodes; and non-volatile memory associated with the cluster master, wherein the non-volatile memory is configured to store data keeping track of quality measures of the nodes.
  2. 2. The apparatus of claim 1, wherein the non-volatile memory is further configured to store data keeping track of freshness measures of the nodes.
  3. 3. The apparatus of claim 2, wherein the cluster master is configured to keep track of status variables of the nodes.
  4. 4. The apparatus of claim 1, wherein the cluster master comprises a separate computing system from the nodes of the cluster.
  5. 5. The apparatus of claim 1, wherein the cluster master comprises an application that is distributed over the nodes of the cluster.
  6. 6. The apparatus of claim 1, further comprising: point-to-point links between inpuVoutput interfaces of the cluster master and each of the nodes.
  7. 7. The apparatus of claim 1, further comprising: a network interconnecting inpuVoutput interfaces of the cluster master and each of the nodes.
  8. 8 The apparatus of claim 3, further comprising: computer-executable code of the cluster master configured to periodically execute a gather task so as to determine the status variable and the quality and freshness measures for each node.
  9. 9. The apparatus of claim 8, further comprising: computer-executable code of the cluster master configured to periodically execute a quality check task so as to check the quality measure of each active node.
  10. 10. The apparatus of claim 9, wherein the quality check task is configured to replace a poor-quality active node with a best-available spare node.
GB0516362A 2004-09-08 2005-08-09 Proactive maintenance for a high availability cluster of interconnected computers Withdrawn GB2418039A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/935,941 US7409576B2 (en) 2004-09-08 2004-09-08 High-availability cluster with proactive maintenance

Publications (2)

Publication Number Publication Date
GB0516362D0 GB0516362D0 (en) 2005-09-14
GB2418039A true GB2418039A (en) 2006-03-15

Family

ID=34984339

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0516362A Withdrawn GB2418039A (en) 2004-09-08 2005-08-09 Proactive maintenance for a high availability cluster of interconnected computers

Country Status (3)

Country Link
US (1) US7409576B2 (en)
JP (1) JP2006079602A (en)
GB (1) GB2418039A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013112356A1 (en) 2012-01-23 2013-08-01 Microsoft Corporation Building large scale test infrastructure using hybrid clusters
US8689040B2 (en) 2010-10-01 2014-04-01 Lsi Corporation Method and system for data reconstruction after drive failures

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002058337A1 (en) * 2001-01-19 2002-07-25 Openwave Systems, Inc. Computer solution and software product to establish error tolerance in a network environment
US20040153714A1 (en) * 2001-01-19 2004-08-05 Kjellberg Rikard M. Method and apparatus for providing error tolerance in a network environment
US7543174B1 (en) * 2003-09-24 2009-06-02 Symantec Operating Corporation Providing high availability for an application by rapidly provisioning a node and failing over to the node
US8195976B2 (en) * 2005-06-29 2012-06-05 International Business Machines Corporation Fault-tolerance and fault-containment models for zoning clustered application silos into continuous availability and high availability zones in clustered systems during recovery and maintenance
JP4920391B2 (en) * 2006-01-06 2012-04-18 株式会社日立製作所 Computer system management method, management server, computer system and program
US20070234114A1 (en) * 2006-03-30 2007-10-04 International Business Machines Corporation Method, apparatus, and computer program product for implementing enhanced performance of a computer system with partially degraded hardware
US7850260B2 (en) * 2007-06-22 2010-12-14 Oracle America, Inc. Injection/ejection mechanism
JP2010086363A (en) * 2008-10-01 2010-04-15 Fujitsu Ltd Information processing apparatus and apparatus configuration rearrangement control method
JP5332518B2 (en) * 2008-10-31 2013-11-06 日本電気株式会社 Build-up computer, switching control method, and program
US9454444B1 (en) 2009-03-19 2016-09-27 Veritas Technologies Llc Using location tracking of cluster nodes to avoid single points of failure
US8073952B2 (en) * 2009-04-22 2011-12-06 Microsoft Corporation Proactive load balancing
NL2003736C2 (en) * 2009-10-30 2011-05-03 Ambient Systems B V Communication method for high density wireless networks, terminal, cluster master device, central node, and system therefor.
US8458515B1 (en) 2009-11-16 2013-06-04 Symantec Corporation Raid5 recovery in a high availability object based file system
US8495323B1 (en) 2010-12-07 2013-07-23 Symantec Corporation Method and system of providing exclusive and secure access to virtual storage objects in a virtual machine cluster
US9026560B2 (en) 2011-09-16 2015-05-05 Cisco Technology, Inc. Data center capability summarization
WO2013136526A1 (en) * 2012-03-16 2013-09-19 株式会社日立製作所 Distributed application-integrating network system
US9529772B1 (en) * 2012-11-26 2016-12-27 Amazon Technologies, Inc. Distributed caching cluster configuration
US9847907B2 (en) 2012-11-26 2017-12-19 Amazon Technologies, Inc. Distributed caching cluster management
US9602614B1 (en) 2012-11-26 2017-03-21 Amazon Technologies, Inc. Distributed caching cluster client configuration
US9262323B1 (en) 2012-11-26 2016-02-16 Amazon Technologies, Inc. Replication in distributed caching cluster
US9116958B2 (en) * 2012-12-07 2015-08-25 At&T Intellectual Property I, L.P. Methods and apparatus to sample data connections
US9590852B2 (en) * 2013-02-15 2017-03-07 Facebook, Inc. Server maintenance system
US9760420B1 (en) * 2014-09-03 2017-09-12 Amazon Technologies, Inc. Fleet host rebuild service implementing vetting, diagnostics, and provisioning pools
US10972335B2 (en) * 2018-01-24 2021-04-06 Hewlett Packard Enterprise Development Lp Designation of a standby node
US11640342B2 (en) * 2019-10-30 2023-05-02 Ghost Autonomy Inc. Fault state transitions in an autonomous vehicle
CN113595246B (en) * 2021-07-30 2023-06-23 广东电网有限责任公司 Micro-grid state online monitoring method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087913A1 (en) * 2000-12-28 2002-07-04 International Business Machines Corporation System and method for performing automatic rejuvenation at the optimal time based on work load history in a distributed data processing environment
US20030105867A1 (en) * 2001-11-30 2003-06-05 Oracle Corporation Managing a high availability framework by enabling & disabling individual nodes
US20030158933A1 (en) * 2002-01-10 2003-08-21 Hubbert Smith Failover clustering based on input/output processors

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61233845A (en) 1985-04-09 1986-10-18 Nec Corp Normality check system for standby system device
JP2748757B2 (en) 1991-12-12 1998-05-13 日本電気株式会社 Information processing device
US5553058A (en) * 1995-07-10 1996-09-03 Telefonaktiebolaget Lm Ericsson Centralized load minimizing method for periodical routing verification tests scheduling
JP2872113B2 (en) 1996-06-12 1999-03-17 日本電気通信システム株式会社 Micro diagnostic system for information processing equipment
US6601084B1 (en) * 1997-12-19 2003-07-29 Avaya Technology Corp. Dynamic load balancer for multiple network servers
JPH11203157A (en) 1998-01-13 1999-07-30 Fujitsu Ltd Redundancy device
US6389551B1 (en) 1998-12-17 2002-05-14 Steeleye Technology, Inc. Method of preventing false or unnecessary failovers in a high availability cluster by using a quorum service
US6594784B1 (en) * 1999-11-17 2003-07-15 International Business Machines Corporation Method and system for transparent time-based selective software rejuvenation
US6609213B1 (en) * 2000-08-10 2003-08-19 Dell Products, L.P. Cluster-based system and method of recovery from server failures
US20020087612A1 (en) * 2000-12-28 2002-07-04 Harper Richard Edwin System and method for reliability-based load balancing and dispatching using software rejuvenation
US6922791B2 (en) * 2001-08-09 2005-07-26 Dell Products L.P. Failover system and method for cluster environment
JP4072392B2 (en) 2002-07-29 2008-04-09 Necエンジニアリング株式会社 Multiprocessor switching method
US7137040B2 (en) * 2003-02-12 2006-11-14 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
US8627149B2 (en) * 2004-08-30 2014-01-07 International Business Machines Corporation Techniques for health monitoring and control of application servers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020087913A1 (en) * 2000-12-28 2002-07-04 International Business Machines Corporation System and method for performing automatic rejuvenation at the optimal time based on work load history in a distributed data processing environment
US20030105867A1 (en) * 2001-11-30 2003-06-05 Oracle Corporation Managing a high availability framework by enabling & disabling individual nodes
US20030158933A1 (en) * 2002-01-10 2003-08-21 Hubbert Smith Failover clustering based on input/output processors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Linux Clusters: The HPC Revolution [online] 17-20 May 2004, Leangsuksun et al, "A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster". Available from http://www.linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/2004techpapers.html *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8689040B2 (en) 2010-10-01 2014-04-01 Lsi Corporation Method and system for data reconstruction after drive failures
WO2013112356A1 (en) 2012-01-23 2013-08-01 Microsoft Corporation Building large scale test infrastructure using hybrid clusters
EP2807552A4 (en) * 2012-01-23 2016-08-03 Microsoft Technology Licensing Llc Building large scale test infrastructure using hybrid clusters

Also Published As

Publication number Publication date
JP2006079602A (en) 2006-03-23
US20060053337A1 (en) 2006-03-09
US7409576B2 (en) 2008-08-05
GB0516362D0 (en) 2005-09-14

Similar Documents

Publication Publication Date Title
US7409576B2 (en) High-availability cluster with proactive maintenance
US10346230B2 (en) Managing faults in a high availability system
US20050188283A1 (en) Node management in high-availability cluster
US7426554B2 (en) System and method for determining availability of an arbitrary network configuration
CN108153622B (en) Fault processing method, device and equipment
DE60314025T2 (en) System and method for identifying a faulty component in a network element
US9164864B1 (en) Minimizing false negative and duplicate health monitoring alerts in a dual master shared nothing database appliance
US20070011499A1 (en) Methods for ensuring safe component removal
JP2005209190A (en) Reporting of multi-state status for high-availability cluster node
JP2008532170A (en) Computer QC module test monitor
GB2418040A (en) Monitoring a high availability cluster using a smart card
Alfatafta et al. Toward a generic fault tolerance technique for partial network partitioning
US20080270852A1 (en) Multi-directional fault detection system
US10853191B2 (en) Method, electronic device and computer program product for maintenance of component in storage system
JP2007156679A (en) Failure recovery method for server, and database system
JP2012504808A (en) Method, apparatus, and program for use in a computerized storage system that includes one or more replaceable units to manage testing of one or more replacement units (to manage testing of replacement units) Computerized storage system with replaceable units)
Nagaraja et al. Using Fault Injection and Modeling to Evaluate the Performability of {Cluster-Based} Services
CN108199901A (en) Hardware reports method, system, equipment, hardware management server and storage medium for repairment
Jang et al. Enhancing Node Fault Tolerance through High-Availability Clusters in Kubernetes
CN106777238B (en) A kind of self-adapted tolerance adjusting method of HDFS distributed file system
CN112564968B (en) Fault processing method, device and storage medium
Corsava et al. Self-healing intelligent infrastructure for computational clusters
Alfatafta et al. Understanding partial network partitioning
Alfatafta An analysis of partial network partitioning failures in modern distributed systems
US20240195679A1 (en) Smart online link repair and job scheduling in machine learning supercomputers

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)