US20020198996A1 - Flexible failover policies in high availability computing systems - Google Patents

Flexible failover policies in high availability computing systems Download PDF

Info

Publication number
US20020198996A1
US20020198996A1 US09/997,404 US99740401A US2002198996A1 US 20020198996 A1 US20020198996 A1 US 20020198996A1 US 99740401 A US99740401 A US 99740401A US 2002198996 A1 US2002198996 A1 US 2002198996A1
Authority
US
United States
Prior art keywords
failover
resource
node
domain
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US09/997,404
Inventor
Padmanabhan Sreenivasan
Ajit Dandapani
Michael Nishimoto
Ira Pramanick
Manish Verma
Robert Bradshaw
Luca Castellano
Raghu Mallena
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment America LLC
Morgan Stanley and Co LLC
Original Assignee
Graphics Properties Holdings Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US18986400P priority Critical
Priority to US81135701A priority
Priority to US09/997,404 priority patent/US20020198996A1/en
Application filed by Graphics Properties Holdings Inc filed Critical Graphics Properties Holdings Inc
Assigned to SILICON GRAPHICS, INC. reassignment SILICON GRAPHICS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CASTELLANO, LUCA, VERMA, MANISH, SREENIVASAN, PADMANABHAN, NISHIMOTO, MICHAEL, DANDAPANI, AJIT, MALLENA, RAGHU, PRAMANICK, IRA, BRADSHAW, ROBERT DAVID
Assigned to SILICON GRAPHICS, INC. reassignment SILICON GRAPHICS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CASTELLANO, LUCA, VERMA, MANISH, SREENIVASAN, PADMANABHAN, NISHIMOTO, MICHAEL, DANDAPANI, AJIT, MALLENA, RAGHU, PRAMANICK, IRA, BRADSHAW, ROBERT DAVID
Publication of US20020198996A1 publication Critical patent/US20020198996A1/en
Assigned to WELLS FARGO FOOTHILL CAPITAL, INC. reassignment WELLS FARGO FOOTHILL CAPITAL, INC. SECURITY AGREEMENT Assignors: SILICON GRAPHICS, INC. AND SILICON GRAPHICS FEDERAL, INC. (EACH A DELAWARE CORPORATION)
Assigned to GENERAL ELECTRIC CAPITAL CORPORATION reassignment GENERAL ELECTRIC CAPITAL CORPORATION SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SILICON GRAPHICS, INC.
Assigned to MORGAN STANLEY & CO., INCORPORATED reassignment MORGAN STANLEY & CO., INCORPORATED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GENERAL ELECTRIC CAPITAL CORPORATION
Assigned to SILICON GRAPHICS INTERNATIONAL, CORP. reassignment SILICON GRAPHICS INTERNATIONAL, CORP. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SGI INTERNATIONAL, INC., SILICON GRAPHICS, INC. ET AL.
Assigned to SONY COMPUTER ENTERTAINMENT AMERICA LLC reassignment SONY COMPUTER ENTERTAINMENT AMERICA LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SILICON GRAPHICS INTERNATIONAL, CORP.
Assigned to SONY INTERACTIVE ENTERTAINMENT AMERICA LLC reassignment SONY INTERACTIVE ENTERTAINMENT AMERICA LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: SONY COMPUTER ENTERTAINMENT AMERICA LLC
Application status is Abandoned legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Application independent communication protocol aspects or techniques in packet data networks
    • H04L69/40Techniques for recovering from a failure of a protocol instance or entity, e.g. failover routines, service redundancy protocols, protocol state redundancy or protocol service redirection in case of a failure or disaster recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Abstract

A system for implementing a failover policy includes a cluster infrastructure for managing a plurality of nodes, a high availability infrastructure for providing group and cluster membership services, and a high availability script execution component operative to receive a failover script and at least one failover attribute and operative to produce a failover domain. In addition, a method for determining a target node for a failover comprises executing a failover script that produces a failover domain, the failover domain having an ordered list of nodes, receiving a failover attribute and based on the failover attribute and failover domain, selecting a node upon which to locate a resource.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Application No. 60/189,864, entitled “HIGH AVAILABILITY COMPUTING SYSTEM AND METHOD” and filed Mar. 16, 2000, and is related to cofiled, copending and coassigned U.S. patent application Ser. No. ______ entitled “MAINTAINING MEMBERSHIP IN HIGH AVAILABILITY COMPUTING SYSTEMS”, both of which are hereby incorporated herein by reference.[0001]
  • FIELD
  • The present invention is related to computer processing, and more particularly to providing flexible failover policies on high availability computer processing systems. [0002]
  • COPYRIGHT NOTICE/PERMISSION
  • A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawings hereto: Copyright © 2000, 2001 Silicon Graphics Incorporated, All Rights Reserved. [0003]
  • BACKGROUND INFORMATION
  • Companies today rely on computers to drive all aspects of their business. Certain business functions can survive intermittent interruptions in service; others cannot. [0004]
  • To date, attempts to ensure high availability to mission critical applications have relied on two approaches. Applications have been made more available either through the use of specialized fault tolerant hardware or through cumbersome changes to the applications or to the environment in which the applications run. These approaches increase the costs to the organization of running the applications. In addition, certain approaches to making applications more available increase the risk of introducing errors in the underlying data. [0005]
  • Furthermore, these approaches [0006]
  • What is needed is a system and method of increasing the availability of mission critical applications by providing greater failover flexibility in determining the targets for moving resources from a machine that has failed. [0007]
  • SUMMARY OF THE INVENTION
  • To address the problems stated above, and to solve other problems that will become apparent in reading the specification and claims, a high availability computing system and method are described. The high availability computing system includes a plurality of servers connected by a first and a second network, wherein the servers communicate with each other to detect server failure and transfer applications to other servers on detecting server failure through a process referred to as “failover”. [0008]
  • According to another aspect of the present invention, a system for implementing a failover policy includes a cluster infrastructure for managing a plurality of nodes, a high availability infrastructure for providing group and cluster membership services, and a high availability script execution component operative to receive a failover script and at least one failover attribute and operative to produce a failover domain. [0009]
  • According to another aspect of the invention, a method for determining a target node for a failover comprises executing a failover script that produces a failover domain, the failover domain having an ordered list of nodes, receiving a failover attribute and based on the failover attribute and failover domain, selecting a node upon which to locate a resource.[0010]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1A shows a diagram of the hardware and operating environment in conjunction with which embodiments of the invention may be practiced; [0011]
  • FIG. 1B is a diagram illustrating an exemplary node configuration according to embodiments of the invention; and [0012]
  • FIG. 2 is a flowchart illustrating a method for providing failover policies according to an embodiment of the invention;[0013]
  • DETAILED DESCRIPTION
  • In the following detailed description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. [0014]
  • Some portions of the detailed descriptions which follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present invention, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar computing device, that manipulates and transforms data represented as physical (e.g., electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. [0015]
  • Definitions
  • A number of computing terms will be used throughout this specification. In this specification, a cluster node is a single computer system. Usually, a cluster node is an individual computer. The term node is also used for brevity. When one node fails, other nodes are left intact and able to operate. [0016]
  • A pool is the entire set of nodes involved with a group of clusters. The group of clusters are usually close together and should always serve a common purpose. A replicated database is stored on each node in the pool. [0017]
  • A cluster is a collection of one or more nodes coupled to each other by networks or other similar interconnections. A cluster is identified by a simple name; this name must be unique within the pool. A particular node may be a member of only one cluster. All nodes in a cluster are also in the pool: however, all nodes in the pool are not necessarily in the cluster. [0018]
  • A node membership is the list of nodes in a cluster on which High Availability base software can allocate resource groups. [0019]
  • A process membership is the list of process instances in a cluster that form a process group. There can be multiple process groups per node. [0020]
  • A client-server environment is one in which a set of users operate on a set of client systems connected through a network to a set of server systems. Often, applications within a client-server system are divided into two components: a client component and a server component. Each component can run on the same of different nodes. A process running the client component of the application is called a client; a process running the server component is called a server. [0021]
  • Clients send requests to servers and collect responses from them. Not all servers can satisfy all requests. For instance, a class of Oracle database servers might be able to satisfy requests regarding the employees of a company, while another class might be able to satisfy requests regarding the company's products. [0022]
  • Servers that are able to satisfy the same type of requests are said to be providing the same service. The time interval between the event of posting a request and the event of receiving a response is called latency. [0023]
  • Service availability can be defined by the following example. Consider a web service implemented by a set of web servers running on a single system. Assume that the system suffers an operating system failure. After the system is rebooted, the web servers are restarted and clients can connect again. A failure of the servers therefore appears to clients like a long latency. [0024]
  • A service is said to be unavailable to a client when latencies become greater than a certain threshold, called critical latency. Otherwise, it is available. A service is down when it is unavailable to all clients; otherwise, it is up. An outage occurs when a service goes down. The outage lasts until the service comes up again. [0025]
  • If downtime is the sum of the durations of outages over a certain time interval D=[t, t′], for a certain service S, service availability can be defined as: [0026]
  • avail(S)=1−downtime/(t′- t)
  • where t′-t is a large time interval, generally a year. For instance, a service which is available 99.99% should have an yearly downtime of about an hour. A service that is available 99.99% or higher is generally called highly available. [0027]
  • Service outages occur for two reasons: maintenance (e.g. hardware and software upgrades) and failures (e.g. hardware failures, OS crashes). [0028]
  • Outages due to maintenance are generally considered less severe. They can be scheduled when clients are less active, for instance, during a weekend. Users can get early notification. Downtime due to maintenance is often called scheduled downtime. On the other hand, failures tend to occur when the servers are working under heavy load, i.e. when most clients are connected. Downtime due to failures is often called unscheduled downtime. Some time service availability is measured considering only unscheduled downtime. [0029]
  • Vendors often provide figures for system availability. System availability is computed similarly to service availability. The downtime is obtained by multiplying the average number of system failures (OS crashes, HW failures, . . . ) by the average repair time. [0030]
  • Consider a service whose servers are distributed on a set of N (where N>1) nodes in a cluster. For the service to be unavailable, all of the N nodes must fail at the same time. Since most of system failures are statistically independent, the probability of such an event is p[0031] N, where p is the probability of a failure of a single system. For example, given a cluster of 2 nodes with availability of 99.7% for each node, at any given time, there is a 0.3% or 0.003 probability that a node is unavailable. The probability of both nodes being unavailable at the same time is 0.003 2=0.000009 or 0.0009%. The cluster as a whole therefore has a system availability of 99.9991 % or (1−0.000009). System availability of a cluster is high enough to allow the deployment of highly available services.
  • A resource is a single physical or logical entity that provides a service to clients or other resources. For example, a resource can be a single disk volume, a particular network address, or an application such as a web server. A resource is generally available for use over time on two or more nodes in a cluster, although it can be allocated to only one node at any given time. [0032]
  • Resources are identified by a resource name and a resource type. One resource can be dependent on one or more other resources: If so, it will not be able to start (that is, be made available for use) unless the dependent resources are also started. Dependent resources must be part of the same resource group and are identified in a resource dependency list [0033]
  • A resource name identifies a specific instance of a resource type. A resource name must be unique for a given resource type. [0034]
  • A resource type is a particular class of resource. All of the resources in a particular resource type can be handled in the same way for the purposes of failover. Every resource is an instance of exactly one resource type. [0035]
  • A resource type is identified by a simple name: this name must be unique within the cluster. A resource type can be defined for a specific node, or It. can be defined for an entire cluster. A resource type that is defined for a specific node overrides a cluster-wide resource type definition with the same name: this allows an-individual node to override global settings from a cluster-wide resource type definition. [0036]
  • Like resources, a resource type can be dependent on one or more other resource types. If such a. dependency exists, at least one instance of each of the dependent resource types must be defined. For example, a resource type named Netscape_web might have resource type dependencies on resource types named IP_address and volume. If a resource named web is defined with the Netscape web resource type, then the resource group containing web must also contain at least one resource of the type IP_address and one resource of the type volume. [0037]
  • In one embodiment, predefined resource types, are provided. However, a user can create additional resource types. [0038]
  • A resource group is a collection of interdependent resources. A resource group is identified by a simple name: this name must be unique within a duster. Table 1 shows an example of the resources for a resource group named WebGroup. [0039]
    TABLE 1
    Resource Resource Type
    Vol1 volume
    /fs1 filesystem
    199.10.48.22 IP_address
    web1 Netscape_web
    Oracle_DB Application
  • In some embodiments, if any individual resource in a resource group becomes unavailable for its intended use, then the entire resource group is considered unavailable. In these embodiments, a resource group is the unit of failover for the High Availability base software. [0040]
  • In some embodiments of the invention, resource groups cannot overlap: that is. two resource groups cannot contain the same resource. [0041]
  • A resource dependency list is a list of resources upon which a resource depends. Each resource instance must have resource dependencies that satisfy its resource type dependencies before it can be added to a resource group. [0042]
  • A resource type dependency list is a list of resource types upon which a resource type depends. For example, the filesystem resource type depends upon the volume resource type, and the Netscape web resource type depends upon the filesystem and IP_address resource types. [0043]
  • For example, suppose a file system instance /fs1 is mounted on volume /vol1. [0044]
  • Before /fs1 can be added to a resource group. /fs1 must be defined to depend on /vol1. the High Availability base software only knows that a file system instance must have one volume instance in its dependency list. This requirement is inferred from the resource type dependency list [0045]
  • A failover is the process of allocating a resource group (or application) to another node, according to a failover policy A failover may be triggered by the failure of a resource, a change in the node membership (such as when a node fails or starts), or a manual request by the administrator. [0046]
  • A failover policy is the method used by High Availability base software to determine the destination node of a failover. A failover policy consists of the following: [0047]
  • Failover domain [0048]
  • Fallover attributes [0049]
  • Failover script [0050]
  • The administrator can configure a failover policy for each resource group. A failover policy name must be unique within the pool. [0051]
  • A failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be within the same cluster. However, the failover domain does not have to include every node in the cluster. [0052]
  • The administrator defines the initial failover domain when creating a failover policy. This list is transformed into a run-time failover domain by the failover script. [0053]
  • High Availability base software uses the run-time failover domain along with failover attributes and the node membership to determine the node on which a resource group should reside. High Availability base software stores the run-time failover domain and uses it as input to the next failover script invocation. Depending on the run-time conditions and contents of the failover script, the initial and run-time failover domains may be identical. [0054]
  • In general, High Availability base software allocates a given resource group to the first node listed in the run-time failover domain that is also in the node membership: the point at which this allocation takes place is affected by the failover attributes. [0055]
  • A failover attribute is a string that affects the allocation of a resource group in a cluster. The administrator must spec˜system attributes (such as Auto_Failback or Controlled_Failback). and can optionally supply site-specific attributes. [0056]
  • A failover script is a shell script that generates a run-time failover domain and returns it to the High Availability base software process. The High Availability base software process applies the failover attributes and then selects the first node in the returned failover domain that is also in the current node membership. [0057]
  • The action scripts are the set of scripts that determine how a resource is started, monitored, and stopped. Typically, there will be a set of action scripts specified for each resource type. [0058]
  • The following is the complete set of action scripts that can be specified for each resource: [0059]
  • probe, which verifies that the resource is configured on a server [0060]
  • exclusive, which verifies that the resource is not already running [0061]
  • start, which starts the resource [0062]
  • stop, which stops the resource [0063]
  • monitor, which monitors the resource [0064]
  • restart, which restarts the resource on the same server after a monitoring failure occurs [0065]
  • Highly Available Services Overview
  • Highly Available (HA) services can be provided in two ways. First, a multi-server application using built-in or highly available services, can directly provide HA services. In the alternative, a single-server application layered on top of multi-server highly available system services can provide equivalent HA services. In other words, a single-server application may depend on a special application which uses the multi-server application discussed above. [0066]
  • FIG. 1A illustrates an exemplary environment for providing HA services. As shown, the environment [0067] 10 includes nodes 20, clients 24, and database 26, all communicably coupled by a network 22. Each of nodes 20.1-20.5 are computer systems comprising hardware and software and can provide services to clients 24. Thus nodes 20 can also be referred to as servers. Specifically, processes 26 comprise software that provides services to client 24. Moreover, each of nodes 20.1 can be suitably configured to provide high availability services according to the various embodiments of the invention.
  • FIG. 1B provides further detail of software layers according to an embodiment of the invention that can be run on nodes [0068] 20 to support HA services. As illustrated, software running on a node 20 includes cluster infrastructure 12, HA infrastructure 14, HA base software 16, application plug-ins 28, and processes 26.
  • Cluster infrastructure [0069] 12 includes software components for performing the following:
  • Node logging [0070]
  • Cluster administration [0071]
  • Node definition [0072]
  • In one embodiment, the cluster software infrastructure includes clusteruster_admin and cluster_control subsystems. [0073]
  • HA infrastructure [0074] 14 provides software components to define clusters, resources, and resource types. In one embodiment, the HA infrastructure includes the following:
  • Cluster membership daemon. Provides the list of nodes, called node membership, available to the cluster. [0075]
  • Group membership daemon. Provides group membership and reliable communication services In the presence of failures to HA base software [0076] 12 processes.
  • Start daemon. Starts HA base software daemons and restarts them on failures. [0077]
  • System resource manager daemon. Manages resources, resource groups and resource types. Executes action scripts for resources. [0078]
  • Interface agent daemon. Monitors the local node's network Interfaces. [0079]
  • Further details on the cluster infrastructure [0080] 12 and HA infrastructure 14 can be found in the cofiled, copending, coassigned U.S. patent application Ser. No. ______ entitled “MAINTAINING MEMBERSHIP IN HIGH AVAILABILITY COMPUTING SYSTEMS”, previously incorporated by reference.
  • HA base software [0081] 16 provides end-to-end monitoring of services and client in order to determine whether resource load balancing or failover are required. In one embodiment, HA base software 16 is the IRIS FailSafe product available from Silicon Graphics, Inc. In this embodiment, HAbase software includes the software required to make the following high-availability services:
  • IP addresses (the IP_address resource type) [0082]
  • XLV logical volumes (the volume resource type) [0083]
  • XFS file systems (the filesystem resource type) [0084]
  • MAC addresses (the MAC_address resource type) [0085]
  • In one embodiment of the invention, application plug-ins [0086] 28 comprise software components that provide an interface to convert applications such as processes 26 into high-availability services. For example, application plug-ins 26 can include database agents. Each database agent monitors all instances of one type of database. In one embodiment, database agents comprise the following:
  • IRIS FailSafe Oracle [0087]
  • IRIS FailSafe INFORMIX [0088]
  • IRIS FailSafe Netscape Web [0089]
  • IRIS FailSafe Mediabase [0090]
  • Processes [0091] 26 include software applications that can be configured to provide services. It is not a requirement that any of processes 26 be intrinsically HA services. Application plug-ins 18, along with HA base software 16 can be used to turn processes 26 into HA services. In order for a plug-in to be used to turn a process 26 into an HA application, it is desirable that the process 26 have the following characteristics:
  • The application can be easily restarted and monitored. [0092]
  • It should be able to recover from failures as do most client/server software. The failure could be a hardware failure, an operating system failure, or an application failure. If a node crashed and reboots, client/server software should be able to attach again automatically. [0093]
  • The application must have a start and stop procedure. [0094]
  • When the resource group fails over, the resources that constitute the resource group are stopped on one node and started on another node, according to the failover script and action scripts. [0095]
  • The application can be moved from one node to another after failures. [0096]
  • If the resource has failed, it must still be possible to run the resource stop procedure. In addition, the resource must recover from the failed state when the resource start procedure is executed in another node. [0097]
  • The application does not depend on knowing the host name: that is those resources that can be configured to work with an IP address. [0098]
  • It should be noted that an application process [0099] 26 itself is not modified to make it into a high-availability service.
  • In addition, node [0100] 20 can include a database (not shown). The database can be used to store information including:
  • Resources [0101]
  • Resource types [0102]
  • Resource groups [0103]
  • Failover policies [0104]
  • Nodes [0105]
  • Clusters [0106]
  • In one embodiment, a cluster administration daemon (cad) maintains identical databases on each node in the cluster. [0107]
  • Method
  • The previous section described an overview of a system for providing high availability services and failover policies for such services. This section will provide a description of a method [0108] 200 for providing failover policies for high availability services. The methods to be performed by the operating environment constitute computer programs made up of computer-executable instructions. Describing the methods by reference to a flowchart enables one skilled in the art to develop such programs including such instructions to carry out the methods on suitable computers (the processor of the computer executing the instructions from computer-readable media). The method illustrated in FIG. 2 is inclusive of the acts required to be taken by an operating environment executing an exemplary embodiment of the invention.
  • The method [0109] 200 begins when a software component, such as HA base software 16 (FIG. 1A) executes a failover script for a resource in response to either a failover event such as a node or process failure, or a load balancing event such as a resource bottleneck or processor load (block 202). The failover script can be programmed in any of a number of languages, include Java, per, shell (Bourne, C-Shell, Korn shell etc.) or the C programming language. The invention is not limited to a particular programming language. In one embodiment of the invention, the following scripts can be executed:
  • probe, which verifies that the resource is configured on a node [0110]
  • exclusive, which verifies that the resource is not already running [0111]
  • start, which starts the resource [0112]
  • stop, which stops the resource [0113]
  • monitor, which monitors the resource [0114]
  • restart, which restarts the resource on the same node when a monitoring failure occurs [0115]
  • It should be noted that in some embodiments, the start, stop, and exclusive scripts are required for every resource type. A monitor script s may also required, but if need be only a return-success function. A restart script may be required if the restart mode is set to 1; however, this script may contain only a return-success function. The probe script is optional. [0116]
  • In some embodiments, there are two types of monitoring that may be accomplished in a monitor script: [0117]
  • Is the resource present?[0118]
  • Is the resource responding?[0119]
  • For a client-node resource that follows a protocol, the monitoring script can make a simple request and verify that the proper response is received. For a web node, the monitoring script can request a home page, verify that the connection was made, and ignore the resulting home page. For a database, a simple request such as querying a table can be made. [0120]
  • Next, a system executing the method receives a failover domain as output from the failover script (block [0121] 204). The failover script can receive an input domain, apply script logic, and provide an output domain. The output domain is an ordered list of nodes on which a given resource can be allocated.
  • Next, the system receives failover attributes (block [0122] 206). The failover attributes are used by the scripts and by the HA base software to modifying the run-time failover domain used for a specific resource group. Based on the failover domain and attributes, the method determines a target node for the resource (block 208). Once a target node has been determined, the system can cause the resource to start on the target node.
  • In the above discussion and in the attached appendices, the term “computer” is defined to include any digital or analog data processing unit. Examples include any personal computer, workstation, set top box, mainframe, server, supercomputer, laptop or personal digital assistant capable of embodying the inventions described herein. [0123]
  • Examples of articles comprising computer readable media are floppy disks, hard drives, CD-ROM or DVD media or any other read-write or read-only memory device. [0124]
  • Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiment shown. This application is intended to cover any adaptations or variations of the present invention. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof. [0125]

Claims (2)

What is claimed is:
1. A system for implementing a failover policy comprising:
a cluster infrastructure for managing a plurality of nodes;
a high availability infrastructure for providing group and cluster membership services; and
a high availability script execution component operative to receive a failover script and at least one failover attribute and operative to produce a failover domain.
2. A method for determining a target node for a failover, comprising:
executing a failover script, said script producing a failover domain, said failover domain having an ordered list of nodes;
receiving a failover attribute; and
based on the failover attribute and failover domain, selecting a node upon which to locate a resource.
US09/997,404 2000-03-16 2001-11-29 Flexible failover policies in high availability computing systems Abandoned US20020198996A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US18986400P true 2000-03-16 2000-03-16
US81135701A true 2001-03-16 2001-03-16
US09/997,404 US20020198996A1 (en) 2000-03-16 2001-11-29 Flexible failover policies in high availability computing systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US09/997,404 US20020198996A1 (en) 2000-03-16 2001-11-29 Flexible failover policies in high availability computing systems
US12/891,390 US8769132B2 (en) 2000-03-16 2010-09-27 Flexible failover policies in high availability computing systems
US14/288,079 US9405640B2 (en) 2000-03-16 2014-05-27 Flexible failover policies in high availability computing systems
US15/226,725 US20170031790A1 (en) 2000-03-16 2016-08-02 Flexible failover policies in high availability computing systems

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US81135701A Continuation 2001-03-16 2001-03-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/891,390 Continuation US8769132B2 (en) 2000-03-16 2010-09-27 Flexible failover policies in high availability computing systems

Publications (1)

Publication Number Publication Date
US20020198996A1 true US20020198996A1 (en) 2002-12-26

Family

ID=46278520

Family Applications (4)

Application Number Title Priority Date Filing Date
US09/997,404 Abandoned US20020198996A1 (en) 2000-03-16 2001-11-29 Flexible failover policies in high availability computing systems
US12/891,390 Active US8769132B2 (en) 2000-03-16 2010-09-27 Flexible failover policies in high availability computing systems
US14/288,079 Active US9405640B2 (en) 2000-03-16 2014-05-27 Flexible failover policies in high availability computing systems
US15/226,725 Pending US20170031790A1 (en) 2000-03-16 2016-08-02 Flexible failover policies in high availability computing systems

Family Applications After (3)

Application Number Title Priority Date Filing Date
US12/891,390 Active US8769132B2 (en) 2000-03-16 2010-09-27 Flexible failover policies in high availability computing systems
US14/288,079 Active US9405640B2 (en) 2000-03-16 2014-05-27 Flexible failover policies in high availability computing systems
US15/226,725 Pending US20170031790A1 (en) 2000-03-16 2016-08-02 Flexible failover policies in high availability computing systems

Country Status (1)

Country Link
US (4) US20020198996A1 (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040163007A1 (en) * 2003-02-19 2004-08-19 Kazem Mirkhani Determining a quantity of lost units resulting from a downtime of a software application or other computer-implemented system
US20050036483A1 (en) * 2003-08-11 2005-02-17 Minoru Tomisaka Method and system for managing programs for web service system
US20050071449A1 (en) * 2003-09-30 2005-03-31 International Business Machines Corporation Policy driven autonomic computing-programmatic policy definitions
US20050091352A1 (en) * 2003-09-30 2005-04-28 International Business Machines Corporation Policy driven autonomic computing-specifying relationships
US20050091351A1 (en) * 2003-09-30 2005-04-28 International Business Machines Corporation Policy driven automation - specifying equivalent resources
US20050132379A1 (en) * 2003-12-11 2005-06-16 Dell Products L.P. Method, system and software for allocating information handling system resources in response to high availability cluster fail-over events
US20050138317A1 (en) * 2003-12-19 2005-06-23 Cannon David M. Real-time feedback for policies for computing system management
US20050259572A1 (en) * 2004-05-19 2005-11-24 Esfahany Kouros H Distributed high availability system and method
US20060036894A1 (en) * 2004-07-29 2006-02-16 International Business Machines Corporation Cluster resource license
US20060080568A1 (en) * 2004-10-08 2006-04-13 Microsoft Corporation Failover scopes for nodes of a computer cluster
US20060242454A1 (en) * 2003-02-12 2006-10-26 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
CN1302412C (en) * 2003-11-11 2007-02-28 联想(北京)有限公司 Computer group system and its operation managing method
US20080250267A1 (en) * 2007-04-04 2008-10-09 Brown David E Method and system for coordinated multiple cluster failover
US20090049329A1 (en) * 2007-08-16 2009-02-19 International Business Machines Corporation Reducing likelihood of data loss during failovers in high-availability systems
US20110022882A1 (en) * 2009-07-21 2011-01-27 International Business Machines Corporation Dynamic Updating of Failover Policies for Increased Application Availability
US20110078235A1 (en) * 2009-09-25 2011-03-31 Samsung Electronics Co., Ltd. Intelligent network system and method and computer-readable medium controlling the same
US20110179419A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Dependency on a resource type
US20110179169A1 (en) * 2010-01-15 2011-07-21 Andrey Gusev Special Values In Oracle Clusterware Resource Profiles
US20110179173A1 (en) * 2010-01-15 2011-07-21 Carol Colrain Conditional dependency in a computing cluster
US20110179171A1 (en) * 2010-01-15 2011-07-21 Andrey Gusev Unidirectional Resource And Type Dependencies In Oracle Clusterware
US20110179172A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Dispersion dependency in oracle clusterware
US20110179428A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Self-testable ha framework library infrastructure
US20110214007A1 (en) * 2000-03-16 2011-09-01 Silicon Graphics, Inc. Flexible failover policies in high availability computing systems
EP2416526A1 (en) * 2009-04-01 2012-02-08 Huawei Technologies Co., Ltd. Task switching method, server node and cluster system
US8732162B2 (en) 2006-02-15 2014-05-20 Sony Computer Entertainment America Llc Systems and methods for server management
US8738961B2 (en) 2010-08-17 2014-05-27 International Business Machines Corporation High-availability computer cluster with failover support based on a resource map
US8949425B2 (en) 2010-01-15 2015-02-03 Oracle International Corporation “Local resource” type as a way to automate management of infrastructure resources in oracle clusterware
US20150066185A1 (en) * 2013-09-05 2015-03-05 SK Hynix Inc. Fail-over system and method for a semiconductor equipment server
US20150100826A1 (en) * 2013-10-03 2015-04-09 Microsoft Corporation Fault domains on modern hardware
US9424152B1 (en) * 2012-10-17 2016-08-23 Veritas Technologies Llc Techniques for managing a disaster recovery failover policy
US9454444B1 (en) * 2009-03-19 2016-09-27 Veritas Technologies Llc Using location tracking of cluster nodes to avoid single points of failure

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7925917B1 (en) * 2008-04-03 2011-04-12 United Services Automobile Association (Usaa) Systems and methods for enabling failover support with multiple backup data storage structures
US8448014B2 (en) * 2010-04-23 2013-05-21 International Business Machines Corporation Self-healing failover using a repository and dependency management system
US20140380300A1 (en) * 2013-06-25 2014-12-25 Bank Of America Corporation Dynamic configuration framework
CN103577235A (en) * 2013-11-14 2014-02-12 中安消技术有限公司 Software deploying method, deploying server, computer to be deployed and system
US10254928B1 (en) 2014-09-08 2019-04-09 Amazon Technologies, Inc. Contextual card generation and delivery
US9836363B2 (en) * 2014-09-30 2017-12-05 Microsoft Technology Licensing, Llc Semi-automatic failover
US10275326B1 (en) * 2014-10-31 2019-04-30 Amazon Technologies, Inc. Distributed computing system failure detection
US9811345B2 (en) * 2015-04-16 2017-11-07 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Utilizing computing resources under a disabled processor node without fully enabling the disabled processor node
US10069688B2 (en) 2016-03-07 2018-09-04 International Business Machines Corporation Dynamically assigning, by functional domain, separate pairs of servers to primary and backup service processor modes within a grouping of servers
CN106506233A (en) * 2016-12-01 2017-03-15 郑州云海信息技术有限公司 Method for automatically deploying Hadoop clusters and scalable working nodes

Citations (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5513314A (en) * 1995-01-27 1996-04-30 Auspex Systems, Inc. Fault tolerant NFS server system and mirroring protocol
US5526492A (en) * 1991-02-27 1996-06-11 Kabushiki Kaisha Toshiba System having arbitrary master computer for selecting server and switching server to another server when selected processor malfunctions based upon priority order in connection request
US5590285A (en) * 1993-07-28 1996-12-31 3Com Corporation Network station with multiple network addresses
US5805785A (en) * 1996-02-27 1998-09-08 International Business Machines Corporation Method for monitoring and recovery of subsystems in a distributed/clustered system
US5852724A (en) * 1996-06-18 1998-12-22 Veritas Software Corp. System and method for "N" primary servers to fail over to "1" secondary server
US5862348A (en) * 1996-02-09 1999-01-19 Citrix Systems, Inc. Method and apparatus for connecting a client node to a server node based on load levels
US5941999A (en) * 1997-03-31 1999-08-24 Sun Microsystems Method and system for achieving high availability in networked computer systems
US5987621A (en) * 1997-04-25 1999-11-16 Emc Corporation Hardware and software failover services for a file server
US6047323A (en) * 1995-10-19 2000-04-04 Hewlett-Packard Company Creation and migration of distributed streams in clusters of networked computers
US6145089A (en) * 1997-11-10 2000-11-07 Legato Systems, Inc. Server fail-over system
US6185695B1 (en) * 1998-04-09 2001-02-06 Sun Microsystems, Inc. Method and apparatus for transparent server failover for highly available objects
US6189111B1 (en) * 1997-03-28 2001-02-13 Tandem Computers Incorporated Resource harvesting in scalable, fault tolerant, single system image clusters
US6243825B1 (en) * 1998-04-17 2001-06-05 Microsoft Corporation Method and system for transparently failing over a computer name in a server cluster
US6266781B1 (en) * 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6292905B1 (en) * 1997-05-13 2001-09-18 Micron Technology, Inc. Method for providing a fault tolerant network using distributed server processes to remap clustered network resources to other servers during server failure
US6438705B1 (en) * 1999-01-29 2002-08-20 International Business Machines Corporation Method and apparatus for building and managing multi-clustered computer systems
US6438704B1 (en) * 1999-03-25 2002-08-20 International Business Machines Corporation System and method for scheduling use of system resources among a plurality of limited users
US6442685B1 (en) * 1999-03-31 2002-08-27 International Business Machines Corporation Method and system for multiple network names of a single server
US6487622B1 (en) * 1999-10-28 2002-11-26 Ncr Corporation Quorum arbitrator for a high availability system
US6496941B1 (en) * 1998-12-29 2002-12-17 At&T Corp. Network disaster recovery and analysis tool
US6532494B1 (en) * 1999-05-28 2003-03-11 Oracle International Corporation Closed-loop node membership monitor for network clusters
US6539494B1 (en) * 1999-06-17 2003-03-25 Art Technology Group, Inc. Internet server session backup apparatus
US6594786B1 (en) * 2000-01-31 2003-07-15 Hewlett-Packard Development Company, Lp Fault tolerant high availability meter
US6636982B1 (en) * 2000-03-03 2003-10-21 International Business Machines Corporation Apparatus and method for detecting the reset of a node in a cluster computer system
US6745241B1 (en) * 1999-03-31 2004-06-01 International Business Machines Corporation Method and system for dynamic addition and removal of multiple network names on a single server
US6785678B2 (en) * 2000-12-21 2004-08-31 Emc Corporation Method of improving the availability of a computer clustering system through the use of a network medium link state function
US6857082B1 (en) * 2000-11-21 2005-02-15 Unisys Corporation Method for providing a transition from one server to another server clustered together
US6859882B2 (en) * 1990-06-01 2005-02-22 Amphus, Inc. System, method, and architecture for dynamic server power management and dynamic workload management for multi-server environment
US6892317B1 (en) * 1999-12-16 2005-05-10 Xerox Corporation Systems and methods for failure prediction, diagnosis and remediation using data acquisition and feedback for a distributed electronic system
US6901530B2 (en) * 2000-08-01 2005-05-31 Qwest Communications International, Inc. Proactive repair process in the xDSL network (with a VDSL focus)
US6952766B2 (en) * 2001-03-15 2005-10-04 International Business Machines Corporation Automated node restart in clustered computer system

Family Cites Families (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4769772A (en) 1985-02-28 1988-09-06 Honeywell Bull, Inc. Automated query optimization method using both global and parallel local optimizations for materialization access planning for distributed databases
US5339392A (en) * 1989-07-27 1994-08-16 Risberg Jeffrey S Apparatus and method for creation of a user definable video displayed document showing changes in real time data
US5187787B1 (en) * 1989-07-27 1996-05-07 Teknekron Software Systems Inc Apparatus and method for providing decoupling of data exchange details for providing high performance communication between software processes
IL99923D0 (en) * 1991-10-31 1992-08-18 Ibm Israel Method of operating a computer in a network
US5504894A (en) * 1992-04-30 1996-04-02 International Business Machines Corporation Workload manager for achieving transaction class response time goals in a multiprocessing system
US5625811A (en) * 1994-10-31 1997-04-29 International Business Machines Corporation Method and system for database load balancing
US6003030A (en) * 1995-06-07 1999-12-14 Intervu, Inc. System and method for optimized storage and retrieval of data on a distributed computer network
US5864854A (en) * 1996-01-05 1999-01-26 Lsi Logic Corporation System and method for maintaining a shared cache look-up table
US5778187A (en) * 1996-05-09 1998-07-07 Netcast Communications Corp. Multicasting method and apparatus
JP3788832B2 (en) * 1996-10-04 2006-06-21 株式会社東芝 Composite-based computer system
US5867494A (en) * 1996-11-18 1999-02-02 Mci Communication Corporation System, method and article of manufacture with integrated video conferencing billing in a communication system architecture
US6421726B1 (en) * 1997-03-14 2002-07-16 Akamai Technologies, Inc. System and method for selection and retrieval of diverse types of video data on a computer network
AU7149498A (en) 1997-04-25 1998-11-24 Symbios, Inc. Redundant server failover in networked environment
US6199110B1 (en) * 1997-05-30 2001-03-06 Oracle Corporation Planned session termination for clients accessing a resource through a server
JP3901806B2 (en) * 1997-09-25 2007-04-04 富士通株式会社 Information management system and the secondary server
US5999712A (en) * 1997-10-21 1999-12-07 Sun Microsystems, Inc. Determining cluster membership in a distributed computer system
US6173420B1 (en) * 1997-10-31 2001-01-09 Oracle Corporation Method and apparatus for fail safe configuration
US6279032B1 (en) * 1997-11-03 2001-08-21 Microsoft Corporation Method and system for quorum resource arbitration in a server cluster
US6178529B1 (en) * 1997-11-03 2001-01-23 Microsoft Corporation Method and system for resource monitoring of disparate resources in a server cluster
US6360331B2 (en) * 1998-04-17 2002-03-19 Microsoft Corporation Method and system for transparently failing over application configuration information in a server cluster
US6157955A (en) * 1998-06-15 2000-12-05 Intel Corporation Packet processing system including a policy engine having a classification unit
US6108703A (en) * 1998-07-14 2000-08-22 Massachusetts Institute Of Technology Global hosting system
JP3859369B2 (en) * 1998-09-18 2006-12-20 株式会社東芝 Message relay device and method
US6263433B1 (en) * 1998-09-30 2001-07-17 Ncr Corporation Provision of continuous database service and scalable query performance using active redundant copies
US6473396B1 (en) * 1999-01-04 2002-10-29 Cisco Technology, Inc. Use of logical addresses to implement module redundancy
US20010000801A1 (en) * 1999-03-22 2001-05-03 Miller Paul J. Hydrophilic sleeve
US6351747B1 (en) * 1999-04-12 2002-02-26 Multex.Com, Inc. Method and system for providing data to a user based on a user's query
US6647430B1 (en) * 1999-07-30 2003-11-11 Nortel Networks Limited Geographically separated totem rings
US6625152B1 (en) 1999-10-13 2003-09-23 Cisco Technology, Inc. Methods and apparatus for transferring data using a filter index
US6564336B1 (en) * 1999-12-29 2003-05-13 General Electric Company Fault tolerant database for picture archiving and communication systems
US20020198996A1 (en) 2000-03-16 2002-12-26 Padmanabhan Sreenivasan Flexible failover policies in high availability computing systems
US7627694B2 (en) * 2000-03-16 2009-12-01 Silicon Graphics, Inc. Maintaining process group membership for node clusters in high availability computing systems
JP2004062603A (en) 2002-07-30 2004-02-26 Dainippon Printing Co Ltd Parallel processing system, server, parallel processing method, program and recording medium
US7716238B2 (en) 2006-02-15 2010-05-11 Sony Computer Entertainment America Inc. Systems and methods for server management
US7979460B2 (en) * 2006-02-15 2011-07-12 Sony Computer Entainment America Inc. Systems and methods for server management
US7831620B2 (en) * 2006-08-31 2010-11-09 International Business Machines Corporation Managing execution of a query against a partitioned database

Patent Citations (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6859882B2 (en) * 1990-06-01 2005-02-22 Amphus, Inc. System, method, and architecture for dynamic server power management and dynamic workload management for multi-server environment
US5526492A (en) * 1991-02-27 1996-06-11 Kabushiki Kaisha Toshiba System having arbitrary master computer for selecting server and switching server to another server when selected processor malfunctions based upon priority order in connection request
US5590285A (en) * 1993-07-28 1996-12-31 3Com Corporation Network station with multiple network addresses
US5513314A (en) * 1995-01-27 1996-04-30 Auspex Systems, Inc. Fault tolerant NFS server system and mirroring protocol
US6047323A (en) * 1995-10-19 2000-04-04 Hewlett-Packard Company Creation and migration of distributed streams in clusters of networked computers
US5862348A (en) * 1996-02-09 1999-01-19 Citrix Systems, Inc. Method and apparatus for connecting a client node to a server node based on load levels
US5805785A (en) * 1996-02-27 1998-09-08 International Business Machines Corporation Method for monitoring and recovery of subsystems in a distributed/clustered system
US5852724A (en) * 1996-06-18 1998-12-22 Veritas Software Corp. System and method for "N" primary servers to fail over to "1" secondary server
US6189111B1 (en) * 1997-03-28 2001-02-13 Tandem Computers Incorporated Resource harvesting in scalable, fault tolerant, single system image clusters
US5941999A (en) * 1997-03-31 1999-08-24 Sun Microsystems Method and system for achieving high availability in networked computer systems
US5987621A (en) * 1997-04-25 1999-11-16 Emc Corporation Hardware and software failover services for a file server
US6292905B1 (en) * 1997-05-13 2001-09-18 Micron Technology, Inc. Method for providing a fault tolerant network using distributed server processes to remap clustered network resources to other servers during server failure
US6145089A (en) * 1997-11-10 2000-11-07 Legato Systems, Inc. Server fail-over system
US6185695B1 (en) * 1998-04-09 2001-02-06 Sun Microsystems, Inc. Method and apparatus for transparent server failover for highly available objects
US6243825B1 (en) * 1998-04-17 2001-06-05 Microsoft Corporation Method and system for transparently failing over a computer name in a server cluster
US6266781B1 (en) * 1998-07-20 2001-07-24 Academia Sinica Method and apparatus for providing failure detection and recovery with predetermined replication style for distributed applications in a network
US6496941B1 (en) * 1998-12-29 2002-12-17 At&T Corp. Network disaster recovery and analysis tool
US6438705B1 (en) * 1999-01-29 2002-08-20 International Business Machines Corporation Method and apparatus for building and managing multi-clustered computer systems
US6438704B1 (en) * 1999-03-25 2002-08-20 International Business Machines Corporation System and method for scheduling use of system resources among a plurality of limited users
US6745241B1 (en) * 1999-03-31 2004-06-01 International Business Machines Corporation Method and system for dynamic addition and removal of multiple network names on a single server
US6442685B1 (en) * 1999-03-31 2002-08-27 International Business Machines Corporation Method and system for multiple network names of a single server
US6532494B1 (en) * 1999-05-28 2003-03-11 Oracle International Corporation Closed-loop node membership monitor for network clusters
US6539494B1 (en) * 1999-06-17 2003-03-25 Art Technology Group, Inc. Internet server session backup apparatus
US6487622B1 (en) * 1999-10-28 2002-11-26 Ncr Corporation Quorum arbitrator for a high availability system
US6892317B1 (en) * 1999-12-16 2005-05-10 Xerox Corporation Systems and methods for failure prediction, diagnosis and remediation using data acquisition and feedback for a distributed electronic system
US6594786B1 (en) * 2000-01-31 2003-07-15 Hewlett-Packard Development Company, Lp Fault tolerant high availability meter
US6636982B1 (en) * 2000-03-03 2003-10-21 International Business Machines Corporation Apparatus and method for detecting the reset of a node in a cluster computer system
US6901530B2 (en) * 2000-08-01 2005-05-31 Qwest Communications International, Inc. Proactive repair process in the xDSL network (with a VDSL focus)
US6857082B1 (en) * 2000-11-21 2005-02-15 Unisys Corporation Method for providing a transition from one server to another server clustered together
US6785678B2 (en) * 2000-12-21 2004-08-31 Emc Corporation Method of improving the availability of a computer clustering system through the use of a network medium link state function
US6952766B2 (en) * 2001-03-15 2005-10-04 International Business Machines Corporation Automated node restart in clustered computer system

Cited By (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8769132B2 (en) 2000-03-16 2014-07-01 Sony Computer Entertainment America Llc Flexible failover policies in high availability computing systems
US20110214007A1 (en) * 2000-03-16 2011-09-01 Silicon Graphics, Inc. Flexible failover policies in high availability computing systems
US9405640B2 (en) 2000-03-16 2016-08-02 Sony Interactive Entertainment America Llc Flexible failover policies in high availability computing systems
US20060242454A1 (en) * 2003-02-12 2006-10-26 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
US7814373B2 (en) * 2003-02-12 2010-10-12 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against node failures for very large clusters
US20080313333A1 (en) * 2003-02-12 2008-12-18 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against node failures for very large clusters
US20070277058A1 (en) * 2003-02-12 2007-11-29 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
US7296191B2 (en) * 2003-02-12 2007-11-13 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
US7401265B2 (en) 2003-02-12 2008-07-15 International Business Machines Corporation Scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
US20040163007A1 (en) * 2003-02-19 2004-08-19 Kazem Mirkhani Determining a quantity of lost units resulting from a downtime of a software application or other computer-implemented system
US20050036483A1 (en) * 2003-08-11 2005-02-17 Minoru Tomisaka Method and system for managing programs for web service system
US7533173B2 (en) 2003-09-30 2009-05-12 International Business Machines Corporation Policy driven automation - specifying equivalent resources
US20050091352A1 (en) * 2003-09-30 2005-04-28 International Business Machines Corporation Policy driven autonomic computing-specifying relationships
US8892702B2 (en) * 2003-09-30 2014-11-18 International Business Machines Corporation Policy driven autonomic computing-programmatic policy definitions
US20050091351A1 (en) * 2003-09-30 2005-04-28 International Business Machines Corporation Policy driven automation - specifying equivalent resources
US7451201B2 (en) 2003-09-30 2008-11-11 International Business Machines Corporation Policy driven autonomic computing-specifying relationships
US20050071449A1 (en) * 2003-09-30 2005-03-31 International Business Machines Corporation Policy driven autonomic computing-programmatic policy definitions
CN1302412C (en) * 2003-11-11 2007-02-28 联想(北京)有限公司 Computer group system and its operation managing method
US20050132379A1 (en) * 2003-12-11 2005-06-16 Dell Products L.P. Method, system and software for allocating information handling system resources in response to high availability cluster fail-over events
US20050138317A1 (en) * 2003-12-19 2005-06-23 Cannon David M. Real-time feedback for policies for computing system management
US8307060B2 (en) 2003-12-19 2012-11-06 International Business Machines Corporation Real-time feedback for policies for computing system management
US20100198958A1 (en) * 2003-12-19 2010-08-05 International Business Machines Corporation Real-time feedback for policies for computing system management
US8930509B2 (en) 2003-12-19 2015-01-06 International Business Machines Corporation Real-time feedback for policies for computing system management
US7734750B2 (en) * 2003-12-19 2010-06-08 International Business Machines Corporation Real-time feedback for policies for computing system management
US20050259572A1 (en) * 2004-05-19 2005-11-24 Esfahany Kouros H Distributed high availability system and method
US20060036894A1 (en) * 2004-07-29 2006-02-16 International Business Machines Corporation Cluster resource license
CN1758608B (en) 2004-10-08 2011-08-17 微软公司 Method and system for processing fault of computer cluster node
EP1647890A3 (en) * 2004-10-08 2009-06-03 Microsoft Corporation Failover scopes for nodes of a computer cluster
US7451347B2 (en) * 2004-10-08 2008-11-11 Microsoft Corporation Failover scopes for nodes of a computer cluster
JP2006114040A (en) * 2004-10-08 2006-04-27 Microsoft Corp Failover scope for node of computer cluster
US20060080568A1 (en) * 2004-10-08 2006-04-13 Microsoft Corporation Failover scopes for nodes of a computer cluster
KR101176651B1 (en) * 2004-10-08 2012-08-23 마이크로소프트 코포레이션 Failover scopes for nodes of a computer cluster
US9886508B2 (en) 2006-02-15 2018-02-06 Sony Interactive Entertainment America Llc Systems and methods for server management
US8732162B2 (en) 2006-02-15 2014-05-20 Sony Computer Entertainment America Llc Systems and methods for server management
US8429450B2 (en) 2007-04-04 2013-04-23 Vision Solutions, Inc. Method and system for coordinated multiple cluster failover
US20100241896A1 (en) * 2007-04-04 2010-09-23 Brown David E Method and System for Coordinated Multiple Cluster Failover
US7757116B2 (en) 2007-04-04 2010-07-13 Vision Solutions, Inc. Method and system for coordinated multiple cluster failover
US20080250267A1 (en) * 2007-04-04 2008-10-09 Brown David E Method and system for coordinated multiple cluster failover
US7669080B2 (en) 2007-08-16 2010-02-23 International Business Machines Corporation Reducing likelihood of data loss during failovers in high-availability systems
US20090049329A1 (en) * 2007-08-16 2009-02-19 International Business Machines Corporation Reducing likelihood of data loss during failovers in high-availability systems
US9454444B1 (en) * 2009-03-19 2016-09-27 Veritas Technologies Llc Using location tracking of cluster nodes to avoid single points of failure
EP2416526A4 (en) * 2009-04-01 2012-04-04 Huawei Tech Co Ltd Task switching method, server node and cluster system
EP2416526A1 (en) * 2009-04-01 2012-02-08 Huawei Technologies Co., Ltd. Task switching method, server node and cluster system
US20110022882A1 (en) * 2009-07-21 2011-01-27 International Business Machines Corporation Dynamic Updating of Failover Policies for Increased Application Availability
US8055933B2 (en) 2009-07-21 2011-11-08 International Business Machines Corporation Dynamic updating of failover policies for increased application availability
US20110078235A1 (en) * 2009-09-25 2011-03-31 Samsung Electronics Co., Ltd. Intelligent network system and method and computer-readable medium controlling the same
US8473548B2 (en) * 2009-09-25 2013-06-25 Samsung Electronics Co., Ltd. Intelligent network system and method and computer-readable medium controlling the same
US9098334B2 (en) 2010-01-15 2015-08-04 Oracle International Corporation Special values in oracle clusterware resource profiles
US20110179428A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Self-testable ha framework library infrastructure
US8438573B2 (en) * 2010-01-15 2013-05-07 Oracle International Corporation Dependency on a resource type
US20110179172A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Dispersion dependency in oracle clusterware
US20110179171A1 (en) * 2010-01-15 2011-07-21 Andrey Gusev Unidirectional Resource And Type Dependencies In Oracle Clusterware
US20110179173A1 (en) * 2010-01-15 2011-07-21 Carol Colrain Conditional dependency in a computing cluster
US8949425B2 (en) 2010-01-15 2015-02-03 Oracle International Corporation “Local resource” type as a way to automate management of infrastructure resources in oracle clusterware
US9207987B2 (en) 2010-01-15 2015-12-08 Oracle International Corporation Dispersion dependency in oracle clusterware
US20110179419A1 (en) * 2010-01-15 2011-07-21 Oracle International Corporation Dependency on a resource type
US9069619B2 (en) 2010-01-15 2015-06-30 Oracle International Corporation Self-testable HA framework library infrastructure
US20110179169A1 (en) * 2010-01-15 2011-07-21 Andrey Gusev Special Values In Oracle Clusterware Resource Profiles
US8583798B2 (en) 2010-01-15 2013-11-12 Oracle International Corporation Unidirectional resource and type dependencies in oracle clusterware
US8738961B2 (en) 2010-08-17 2014-05-27 International Business Machines Corporation High-availability computer cluster with failover support based on a resource map
US9424152B1 (en) * 2012-10-17 2016-08-23 Veritas Technologies Llc Techniques for managing a disaster recovery failover policy
US20150066185A1 (en) * 2013-09-05 2015-03-05 SK Hynix Inc. Fail-over system and method for a semiconductor equipment server
US9678503B2 (en) * 2013-09-05 2017-06-13 SK Hynix Inc. Fail-over system and method for a semiconductor equipment server
US20150100826A1 (en) * 2013-10-03 2015-04-09 Microsoft Corporation Fault domains on modern hardware

Also Published As

Publication number Publication date
US20110214007A1 (en) 2011-09-01
US9405640B2 (en) 2016-08-02
US8769132B2 (en) 2014-07-01
US20170031790A1 (en) 2017-02-02
US20140281675A1 (en) 2014-09-18

Similar Documents

Publication Publication Date Title
EP1650653B1 (en) Remote enterprise management of high availability systems
EP1654645B1 (en) Fast application notification in a clustered computing system
US7529822B2 (en) Business continuation policy for server consolidation environment
US7516221B2 (en) Hierarchical management of the dynamic allocation of resources in a multi-node system
US7555673B1 (en) Cluster failover for storage management services
US8019732B2 (en) Managing access of multiple executing programs to non-local block data storage
KR101091250B1 (en) On-demand propagation of routing information in distributed computing system
US9262273B2 (en) Providing executing programs with reliable access to non-local block data storage
US6449734B1 (en) Method and system for discarding locally committed transactions to ensure consistency in a server cluster
US6393485B1 (en) Method and apparatus for managing clustered computer systems
CA2778723C (en) Monitoring of replicated data instances
CN103124967B (en) For the application server to connect to the database system and method of the cluster
US8055933B2 (en) Dynamic updating of failover policies for increased application availability
US6868442B1 (en) Methods and apparatus for processing administrative requests of a distributed network application executing in a clustered computing environment
US7181524B1 (en) Method and apparatus for balancing a load among a plurality of servers in a computer system
US7260818B1 (en) System and method for managing software version upgrades in a networked computer system
US7024580B2 (en) Markov model of availability for clustered systems
US7171459B2 (en) Method and apparatus for handling policies in an enterprise
US7941510B1 (en) Management of virtual and physical servers using central console
EP3276492B1 (en) Failover and recovery for replicated data instances
US7487390B2 (en) Backup system and backup method
US7546398B2 (en) System and method for distributing virtual input/output operations across multiple logical partitions
US20160314022A1 (en) Virtual systems management
US20060155912A1 (en) Server cluster having a virtual server
US7437386B2 (en) System and method for a multi-node environment with shared storage

Legal Events

Date Code Title Description
AS Assignment

Owner name: SILICON GRAPHICS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SREENIVASAN, PADMANABHAN;DANDAPANI, AJIT;NISHIMOTO, MICHAEL;AND OTHERS;REEL/FRAME:013102/0229;SIGNING DATES FROM 20020127 TO 20020710

AS Assignment

Owner name: SILICON GRAPHICS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SREENIVASAN, PADMANABHAN;DANDAPANI, AJIT;NISHIMOTO, MICHAEL;AND OTHERS;REEL/FRAME:013523/0968;SIGNING DATES FROM 20020127 TO 20020710

AS Assignment

Owner name: WELLS FARGO FOOTHILL CAPITAL, INC., CALIFORNIA

Free format text: SECURITY AGREEMENT;ASSIGNOR:SILICON GRAPHICS, INC. AND SILICON GRAPHICS FEDERAL, INC. (EACH A DELAWARE CORPORATION);REEL/FRAME:016871/0809

Effective date: 20050412

Owner name: WELLS FARGO FOOTHILL CAPITAL, INC.,CALIFORNIA

Free format text: SECURITY AGREEMENT;ASSIGNOR:SILICON GRAPHICS, INC. AND SILICON GRAPHICS FEDERAL, INC. (EACH A DELAWARE CORPORATION);REEL/FRAME:016871/0809

Effective date: 20050412

AS Assignment

Owner name: GENERAL ELECTRIC CAPITAL CORPORATION, CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:018545/0777

Effective date: 20061017

Owner name: GENERAL ELECTRIC CAPITAL CORPORATION,CALIFORNIA

Free format text: SECURITY INTEREST;ASSIGNOR:SILICON GRAPHICS, INC.;REEL/FRAME:018545/0777

Effective date: 20061017

AS Assignment

Owner name: MORGAN STANLEY & CO., INCORPORATED, NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GENERAL ELECTRIC CAPITAL CORPORATION;REEL/FRAME:019995/0895

Effective date: 20070926

Owner name: MORGAN STANLEY & CO., INCORPORATED,NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GENERAL ELECTRIC CAPITAL CORPORATION;REEL/FRAME:019995/0895

Effective date: 20070926

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION

AS Assignment

Owner name: SILICON GRAPHICS INTERNATIONAL, CORP., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SILICON GRAPHICS, INC. ET AL.;SGI INTERNATIONAL, INC.;SIGNING DATES FROM 20090508 TO 20120320;REEL/FRAME:027904/0315

AS Assignment

Owner name: SONY COMPUTER ENTERTAINMENT AMERICA LLC, CALIFORNI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SILICON GRAPHICS INTERNATIONAL, CORP.;REEL/FRAME:031519/0523

Effective date: 20131015

AS Assignment

Owner name: SONY INTERACTIVE ENTERTAINMENT AMERICA LLC, CALIFO

Free format text: CHANGE OF NAME;ASSIGNOR:SONY COMPUTER ENTERTAINMENT AMERICA LLC;REEL/FRAME:038630/0154

Effective date: 20160331