WO2007061440A2 - System and method for providing singleton services in a cluster - Google Patents

System and method for providing singleton services in a cluster Download PDF

Info

Publication number
WO2007061440A2
WO2007061440A2 PCT/US2006/012413 US2006012413W WO2007061440A2 WO 2007061440 A2 WO2007061440 A2 WO 2007061440A2 US 2006012413 W US2006012413 W US 2006012413W WO 2007061440 A2 WO2007061440 A2 WO 2007061440A2
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
server
servers
migratable
cluster master
Prior art date
Application number
PCT/US2006/012413
Other languages
French (fr)
Other versions
WO2007061440A3 (en
Inventor
Prasad Peddada
Original Assignee
Bea Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/396,826 external-priority patent/US7447940B2/en
Application filed by Bea Systems, Inc. filed Critical Bea Systems, Inc.
Publication of WO2007061440A2 publication Critical patent/WO2007061440A2/en
Publication of WO2007061440A3 publication Critical patent/WO2007061440A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2025Failover techniques using centralised failover control functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2046Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share persistent storage

Definitions

  • the invention is generally related to servers, clusters and deployment of various services on server clusters; and in particular to systems and methods for providing highly available and migratable servers which offer singleton services across a cluster of application servers.
  • clustering refers to a group of one or more servers, (usually called "nodes"), that work together and generally represent themselves as a single virtual server to the network.
  • nodes servers
  • clustering when a client connects to a set of clustered servers, it thinks that there is only a single server, rather than a plurality.
  • nodes responsibilities are taken over by another node, thereby boosting the reliability of the system.
  • a singleton service should be migrated in the event of a hosting server failure.
  • migratable, singleton services were manually targeted to a server in the cluster, and the administrator did the migration manually. This type of resolution is lacking in that it is complex, time consuming and tedious on the system administrators.
  • the downtime of the service provided can be quite lengthy.
  • a new approach is desired, one which would automatically target and distribute migratable, singleton services across the servers in the cluster, in addition to migrating them automatically in the event of server failures.
  • Embodiments of the present invention include systems and methods for providing singleton services within a cluster and for automatically migrating those services across the machines in the cluster.
  • the term "machine,” for the purposes of this disclosure, means any computer system capable of maintaining a server or providing some type of service. Examples are personal computers, workstations, mainframes and other computers that can be connected to a network or cluster.
  • the clustering infrastructure can guarantee that each migratable service is active on only one node in the cluster at all times.
  • the present methodology can perform three tasks: First, a judgment can be made as to whether a server has failed; Second, the seemingly failed server can be isolated from clients and disks as well as other entities connected to it; Third, the seemingly failed server can be restarted on the machine upon which it sits or, if that cannot be achieved, the server can be migrated to another machine.
  • Figure 1 is a flow chart of a process defining the overall functionality of providing singleton services in a cluster by implementing migratable servers, in accordance with certain embodiments of the invention.
  • Figure 2 is a flow chart of a process defining an exemplary functionality of one server in the cluster, in accordance with certain embodiments of the invention.
  • Figure 3 is a flow chart of a process defining an exemplary functionality of a cluster master in the cluster, in accordance with certain embodiments of the invention.
  • Figure 4 is an illustration of the overall placement of a cluster of machines running servers, a node manager, a highly available database and an administration server, in accordance with certain embodiments of the invention.
  • Figure 5 is an illustration of a cluster of servers functioning against the database, in accordance with certain embodiments of the invention.
  • Figure 6 is an illustration of a method of migrating the migratable server to a different machine within the cluster, in accordance with certain embodiments of the invention.
  • FIG. 7 is an illustration of Internet protocol (IP) address migration, in accordance with certain embodiments of the invention.
  • IP Internet protocol
  • Embodiments of the present invention include systems and methods for providing singleton services within a cluster and for automatically migrating those services across the machines in the cluster.
  • the term "machine,” for the purposes of this disclosure, means any computer system capable of maintaining a server or providing some type of service. Examples are personal computers, workstations, mainframes and other computers that can be connected to a network or cluster.
  • the clustering infrastructure can guarantee that each migratable service is active on only one node in the cluster at all times.
  • the present methodology can perform three tasks: First, a judgment can be made as to whether a server has failed; Second, the seemingly failed server can be isolated from clients and disks as well as other entities connected to it; Third, the seemingly failed server can be restarted on the machine upon which it sits or, if that cannot be achieved, the server can be migrated to another machine.
  • a server for purposes of this disclosure can be any type of an application server that provides some type of a service, resource or application.
  • Web Logic® Server available from BEA Systems, can be implemented.
  • a migratable server is a server in a cluster, which hosts a singleton service or services that are required to be highly available. Any of the servers in the cluster can be tagged as migratable, depending on the customer's needs, and these migratable servers can be made to host a variety of both singleton and non-singleton services.
  • Each migratable server can be assigned a unique identifier (id) or name. All servers in the cluster other than migratable servers, will be generally referred to as "pinned" servers.
  • Figure 1 is a flow diagram illustration of a process defining the overall functionality of providing singleton services in a cluster via migratable servers, in accordance with various embodiments of the invention.
  • Figure 1 depicts functional steps in a particular order for purposes of illustration, the process is not necessarily limited to any particular order or arrangement of steps.
  • One skilled in the art will appreciate that the various steps portrayed in this figure can be omitted, rearranged, performed in parallel, combined and/or adapted in various ways.
  • step 101 the various servers can be started in cluster form (i.e. as nodes in the cluster) by the administration (admin) server.
  • the admin server can be made responsible for starting the servers initially, and for stopping the servers finally. Its role can also be to coordinate any manual migration by system administrators, in addition to any kind of changing of configuration of servers.
  • each server can assume the rol ⁇ assigned to it.
  • the first server started can take the role of being the cluster master.
  • the cluster master is one server in the cluster that is responsible for the placement and migration of migratable servers. Usually a cluster would require the services of a cluster master if at least one of the managed servers in the cluster were tagged as a migratable server. The rest of the servers that are starting up can then take the role of being either a migratable server, or a pinned server, according to the particular needs of the enterprise operating the cluster (the customer).
  • step 105 all of the servers in the cluster can be heartbeating against a database.
  • heartbeating it is meant that the server is continuously renewing its liveness information in the database.
  • This process can be implemented by assigning a table entry to each server, which the server then must update after every certain time period expires.
  • the time period required for updating the table entry can be arbitrarily chosen, or can be defined according to the servers and database implemented, in order to maximize performance. For example, a time period of 30,000 milliseconds can be selected.
  • a server does not update the table entry in the database after the expiration of the defined time period, then the server has failed to heartbeat, and it could be assumed that there has occurred a crash, server hang or some other type of failure.
  • the database for purposes of this invention, can be any database, file system or other form of information storage, capable of storing some form of entry for each server.
  • the database should be made highly available in order to boost the reliability of the cluster and performance of the services provided.
  • the database can be selected from various products offered by companies such as Oracle, Microsoft, Sybase and IBM. It should be noted that the migration capability of the servers, and consequently the providing of singleton services, depends to a large extent on the integrity of the database, so a reliable database should be selected.
  • each server can be determined by the role that was assigned to it in step 103.
  • a server that is assigned the role of the cluster master can be responsible for performing one set of functions, while all migratable servers can perform another, all as described in further detail below.
  • step 107 as the cluster master is heartbeating against the database, it is also monitoring the heartbeats of all other servers in the cluster. This can be implemented by various functions, including but not limited to having the cluster master read all of the table entries in the database whenever it accesses the database to heartbeat. Thus, if some server has failed to heartbeat, the cluster master should notice it the next time it accesses the database.
  • the cluster master can notice a failed server as described above. It can then take the necessary steps to restart the failed server on the same machine, or migrate the failed server to a different machine in the cluster.
  • the cluster master can first attempt to restart the failed server on the same machine by calling the node manager.
  • the node manager can be a software program that runs on all of the machines in cluster. It should be capable of starting, restarting, stopping, shutting down, and migrating all of the migratable servers, together with their internet protocol (IP) addresses, to different machines.
  • IP internet protocol
  • the node manager should also be capable of being invoked remotely by the cluster master. Any programming framework can be used in order to impart this functionality upon the node manager, including but not limited to scripts for Unix or Microsoft Windows operating systems.
  • the cluster master can then use the node manager to migrate the failed server to a different machine in the cluster.
  • the Internet protocol (IP) address can be migrated along with the migratable failed server to another machine. This makes running various applications easier, because the client will always be connected to the same server, no matter where that server is within the cluster.
  • IP migration is that the client need not know the physical location of the server; simply knowing the IP address of the server is enough.
  • the cluster master can invoke the remote machine's node manager and have the node manager migrate the server to the new machine.
  • step 111 all of the servers that did not take the role of being cluster master, can be actively monitoring the heartbeats of the cluster master as they are themselves heartbeating against the database. This can implemented similarly to the monitoring ability of the cluster master, or in some other form of a monitoring function.
  • step 113 if the cluster master were to fail its heartbeat, then any of the other servers can notice that failure. The first server to notice can then take over the role of being cluster master. In effect, all servers can be actively trying to become the cluster master at all times. When the migratable server becomes a new cluster master, it assumes all the functions and duties of the original cluster master. No migration is necessary at this point, although it may be implemented.
  • the cluster can be configured to freeze whenever the cluster master fails, until a system administrator reboots or reconfigures the cluster master; however this type of implementation is not as efficient in that the cluster is dependent upon the performance of one server, namely the cluster master.
  • Figure 2 is a flow diagram illustration of a process defining an exemplary functionality of one server in the cluster, in accordance with various embodiments of the invention.
  • Figure 2 depicts functional steps in a particular order for purposes of illustration, the process is not necessarily limited to any particular order or arrangement of steps.
  • One skilled in the art will appreciate that the various steps portrayed in this figure can be omitted, rearranged, performed in parallel, combined and/or adapted in various ways.
  • step 200 a server in the cluster is initially started by the admin server and joins the cluster. It can then be determined, in step 202, whether the server is the first server being connected to the cluster. In step 203, if a server is the first server joining the cluster, the server can be assigned the role of being cluster master. In step 205, the cluster master can then begin to heartbeat against the database, proving its liveness to it. At the same time, the cluster master can monitor the migratable servers that are heartbeating against the database, noticing any failures to heartbeat by any migratable server. Two things may occur from that point on: the cluster master may notice that a migratable server has failed, or the cluster master can fail itself.
  • step 207 if the cluster master notices that a migratable server has failed to heartbeat, it can assume that the migratable server has crashed or has failed in some other manner. Consequently it can be assumed that the failed server is not responding and therefore not providing the singleton services that it is supposed to be providing.
  • step 209 the cluster master can then take steps to migrate the failed migratable server to another machine. The cluster master can first attempt to restart the failed server on the same machine and if that attempt fails, it can then call the node manager in order to migrate the server to another machine. The node manager can subsequently migrate the failed server to another machine.
  • the cluster master itself may fail to heartbeat because of a crash, server hang or some other type of failure.
  • the first migratable server available can take over the role of being cluster master, as illustrated in step 213.
  • step 202 if the server is not the first server to join the cluster, it would not be assigned the role of cluster master; rather the server could become a migratable server, as illustrated in step 204. From that point on, the migratable server heartbeats against the database, and at the same time it is monitoring the heartbeat of the cluster master as illustrated in step 215. Thus, a migratable server can notice that the cluster master has failed, or the migratable server may fail itself.
  • step 217 if the migratable server notices that the cluster master failed to heartbeat, it actively attempts to become the cluster master itself, i.e. it attempts to take over the role of cluster master and assume its functions as illustrated in step 219.
  • the migratable server may itself fail to heartbeat because of a crash, server hang or some other failure. The failure to heartbeat is then noticed by the cluster master and the migratable server will get restarted on the same machine or migrated to a different machine by the cluster master, as illustrated in step 223.
  • FIG. 3 is a flow diagram illustration of a process defining an exemplary functionality of a cluster master in the cluster, in accordance with various embodiments of the invention.
  • the process begins at step 300.
  • a server is started as a cluster master, as previously described above. It then begins to perform two functions either simultaneously or consecutively.
  • the cluster master heartbeats against the database as illustrated in step 313, providing its liveliness information to it.
  • the cluster master harvests the liveliness information of other migratable servers from the database, as illustrated in step 303.
  • step 315 the cluster master crashes, hangs, or fails in some other manner, one of the migratable servers will take over its functions as illustrated in step 317.
  • the cluster master is harvesting liveliness information from the database, if the cluster master notices that a migratable server has failed to heartbeat (in step 305), it can then initiate the node manager in order to deal with this problem, as illustrated in step 307.
  • the node manager will first attempt to restart the failed server on the same machine.
  • step 310 if that attempt is successful, the cluster master will go back to performing its usual functions, namely heartbeating against the database and monitoring the liveness of other migratable servers.
  • step 310 the node manager will migrate the failed server onto a different machine, as illustrated in step 311. Subsequently the cluster master could go back to performing its duties of harvesting liveness information and heartbeating against the database.
  • the cluster master need not be made to wait for the node manager to complete the migration. After initiating the node manager, the cluster master is free, and could go back to fulfilling its role of heartbeating and harvesting, as described above. Alternatively, the cluster master could be made to wait for the node manager to finish its server migration process, so as to ensure the success of the migration, before returning to its usual functions. Both alternatives are within the spirit of the invention, as will be apparent to one skilled in the art.
  • FIG 4 is an exemplary illustration of the overall placement of a cluster of machines running servers, a node manager, a highly available database and an administration (admin) server, in accordance with various embodiments of the invention.
  • this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
  • each machine 13,14,15, in the cluster 2 may have one or more servers 7,8,9,10,16, running thereon.
  • the machines can also have node manager 6,11 ,12, software deployed on them.
  • the node manager should be capable of running customizable scripts or other programs in order to facilitate migration of the servers across machines.
  • the node manager can be invoked remotely by the cluster master 8, in order to start and to stop (kill) various servers in the cluster.
  • the admin server 5 can be used to coordinate manual server migration and changing of configuration. It should also be used for the purpose of initially starting the servers. Similarly, it can be responsible for finally stopping all the servers in the cluster.
  • the admin server is running on a separate machine 4, which is not part of the cluster, and is thus not migratable itself.
  • An admin server can be implemented by another machine, a network computer, a workstation or some other means. It can be made accessible by system administrators and other persons who can subsequently coordinate manual migration of migratable servers within the cluster.
  • the highly available database 3 need not necessarily be a traditional database, as already discussed above. Instead, it can be implemented as any type of file or information storage system; however it is preferable that it be highly available in order to boost performance of the cluster and the singleton services.
  • FIG. 5 is an exemplary illustration of a cluster of servers functioning against the database, in accordance with various embodiments of the invention.
  • this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
  • the cluster master 8 can be heartbeating 55 its liveness information to the highly available database 3. It can do this by continuously updating one of the entries (17-21) of a table 59 in the database, in order to check in.
  • each table entry may have variables for storing the primary key, server name, server instance, host machine, domain name, cluster name, the timeout (check-in time period), and a variable to determine whether this particular server is the cluster master.
  • the cluster master can be monitoring the heartbeats 50, 51 , 52, 53, of all of the other migratable servers 7,9,10,16, in the database. Once it notices that a migratable server has stopped heartbeating, the cluster master can restart/migrate that server to another machine.
  • all of the migratable servers 7,9,10,16 can be heartbeating (50-53) their own liveness information to the database , by the same means as the cluster master 8.
  • Each migratable server has an entry in the database corresponding to its liveness information, which the migratable server can be continuously updating.
  • each migratable server can be proactively attempting to take over the role of cluster master. Thus, if the cluster master were to fail its heartbeat, the first migratable server to notice this, will become cluster master itself.
  • a single table need not necessarily be implemented in order to store the liveness information of the servers in the cluster. Multiple such tables may be used, or other types of data structures can be employed, including but not limited to lists, graphs or binary trees.
  • FIG. 6 is an exemplary illustration of a migration method and system, in accordance with various embodiments of the invention.
  • this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
  • each server depicted 7,8,10 is heartbeating against the database 3.
  • a first server S4 (7) may crash or fail and consequently it may stop sending its heartbeats 6 to the highly available database.
  • a second server S1 (8), designated as the current cluster master, will notice 61 another server S4's failure to heartbeat, and then it will attempt to restart/migrate server S4.
  • This figure illustrates one method of restarting or migration of S4 by the cluster master S1.
  • the cluster master can send an instruction 62 to restart S4, to the node manager 6 installed upon the machine 15 that S4 is currently deployed on. However, because the machine itself may have crashed or frozen, the node manager installed therein may not receive the restart instruction sent by the cluster master.
  • the cluster master will subsequently send instructions 63 to migrate S4 to the node manager 11 of another machine, for example machine M1 (13).
  • the node manager can then migrate 64 server S4 by starting S4 on the new machine 13, and the migrated server can begin to heartbeat again 65 against the database, as well as continue providing the singleton services.
  • Precautions may be taken that no previously crashed or frozen server is restarted again on the old machine 15, because that would cause two instances of server S4, and consequently two instances of every singleton service that the server is providing.
  • These precautions can be implemented in various ways, including, but not limited to, continuously sending kill messages to the old machine 15, or isolating the old machine from the cluster.
  • FIG. 7 is an exemplary illustration of IP migration, in accordance with various embodiments of the invention.
  • this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
  • IP addresses are usually stored in the IP stack 60.
  • a server S2 (9) may be migrated in the manner previously discussed above with reference to Figure 6. Assuming it is migrated to a different machine 14, and not restarted upon the same machine 13, then server S2 can be made to retain its original IP address IP Addr (address) 2 (62).
  • IP Addr IP Addr 2
  • This implementation provides an advantage over assigning new IP addresses to migrated servers, as previously discussed, in that clients in the outside world 23 need not know the IP address of the server they are trying to access.
  • the IP address 62 gets migrated along with the server S2 onto the different machine 14.
  • the term "outside world" refers to computers or systems accessing a server that exist outside the cluster of servers.
  • the present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art.
  • the invention may also be implemented by the preparation of integrated circuits and/or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
  • Various embodiments include a computer program product which is a storage medium (media) having instructions stored thereon/in which can be used to program a general purpose or specialized computing processor(s)/device(s) to perform any of the features presented herein.
  • the storage medium can include, but is not limited to, one or more of the following: any type of physical media including floppy disks, optical discs, DVDs, CD-ROMs, microdrives, magneto-optical disks, holographic storage, ROMs, RAMs, PRAMS, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs); paper or paper-based media; and any type of media or device suitable for storing instructions and/or information.
  • Various embodiments include a computer program product that can be transmitted in whole or in parts and over one or more public and/or private networks wherein the transmission includes instructions which can be used by one or more processors to perform any of the features presented herein.
  • the transmission may include a plurality of separate transmissions.
  • the present disclosure includes software for controlling both the hardware of general purpose/specialized computer(s) and/or processor(s), and for enabling the computer(s) and/or processor(s) to interact with a human user or other mechanism utilizing the results of the present invention.
  • Such software may include, but is not limited to, device drivers, operating systems, execution environments/containers, user interfaces and applications.

Abstract

A system and method for providing singleton services in a cluster of servers, where one server is designated as a cluster master, other servers are designated as migratable servers and where all servers in the cluster heartbeat their liveness information against a database. The cluster master monitors the heartbeats of all migratable servers. Upon failure of a migratable server's heartbeat, the cluster master first attempts to restart the migratable server on the same machine and if that does not succeed, the cluster master migrates the migratable server to a different machine in the cluster. In accordance with an embodiment, all migratable servers monitor the heartbeats of the cluster master. Upon failure of the cluster master's heartbeating, one migratable server takes over the role of being cluster master.

Description

SYSTEM AND METHOD FOR PROVIDING SINGLETON SERVICES IN A CLUSTER
Claim of Priority:
U.S. Provisional Patent Application No. 60/736,718 entitled SYSTEM AND METHOD FOR PROVIDING SINGLETON SERVICES IN A CLUSTER, by Prasad Peddada, filed November 15, 2005 [Attorney Docket No. BEAS- 01559USO]; and U.S. Patent Application No. 11/ , entitled SYSTEM AND METHOD
FOR PROVIDING SINGLETON SERVICES IN A CLUSTER, by Prasad Peddada, filed April 3, 2006 [Attorney Docket No. BEAS-01559US1].
Field of the Invention: The invention is generally related to servers, clusters and deployment of various services on server clusters; and in particular to systems and methods for providing highly available and migratable servers which offer singleton services across a cluster of application servers.
Background:
Clustering of servers is becoming increasingly important in a wide variety of contexts, for reasons of increased functionality, higher levels of services and availability, in addition to supporting server failover. Many businesses that employ computer systems require such connectivity between servers in order to ensure the durability and improved services of the network, intranet or website employed. As referred to herein, clustering refers to a group of one or more servers, (usually called "nodes"), that work together and generally represent themselves as a single virtual server to the network. In other words, when a client connects to a set of clustered servers, it thinks that there is only a single server, rather than a plurality. When one node fails, that nodes responsibilities are taken over by another node, thereby boosting the reliability of the system.
Traditionally, all services on such a cluster have been deployed homogenously on all of the servers in the cluster. This has satisfied most demands, in that when one server fails, another server is providing the same services, and thus clients maintain connection and access to such services. However, sometimes there is a set of stateful services that need to be run on only one server in the cluster at any given time, with the ability to automatically migrate the service in the event of server failures. For example, the Java Messaging Service (JMS) subsystem guarantees that user-generated client subscriber identifiers (ids) are unique within the cluster. In order to honor such requirements, a JMS or similar service that runs on only one node in the cluster is required. These types of services are, for the purposes of this disclosure, referred to as "singleton services", by which it is meant that the service has a single active instance in the cluster.
A singleton service should be migrated in the event of a hosting server failure. With a traditional approach, migratable, singleton services were manually targeted to a server in the cluster, and the administrator did the migration manually. This type of resolution is lacking in that it is complex, time consuming and tedious on the system administrators. In addition, the downtime of the service provided can be quite lengthy.
A new approach is desired, one which would automatically target and distribute migratable, singleton services across the servers in the cluster, in addition to migrating them automatically in the event of server failures. However, there are two sets of problems that make it difficult to provide such automation. First, when a server becomes temporarily frozen or disconnected from the cluster and is mistakenly judged to have failed, then the service may be migrated to a new server, and subsequently the original server may rejoin the cluster. In that instance, two servers would be providing the singleton service. Second, if a server is incorrectly assumed to be alive, then none of the servers in the cluster would be providing the singleton service.
Summary:
Embodiments of the present invention include systems and methods for providing singleton services within a cluster and for automatically migrating those services across the machines in the cluster. The term "machine," for the purposes of this disclosure, means any computer system capable of maintaining a server or providing some type of service. Examples are personal computers, workstations, mainframes and other computers that can be connected to a network or cluster. The clustering infrastructure can guarantee that each migratable service is active on only one node in the cluster at all times. In order to prevent the problems of auto-migration described above, the present methodology can perform three tasks: First, a judgment can be made as to whether a server has failed; Second, the seemingly failed server can be isolated from clients and disks as well as other entities connected to it; Third, the seemingly failed server can be restarted on the machine upon which it sits or, if that cannot be achieved, the server can be migrated to another machine.
Brief Description of the Figures: Figure 1 is a flow chart of a process defining the overall functionality of providing singleton services in a cluster by implementing migratable servers, in accordance with certain embodiments of the invention.
Figure 2 is a flow chart of a process defining an exemplary functionality of one server in the cluster, in accordance with certain embodiments of the invention.
Figure 3 is a flow chart of a process defining an exemplary functionality of a cluster master in the cluster, in accordance with certain embodiments of the invention.
Figure 4 is an illustration of the overall placement of a cluster of machines running servers, a node manager, a highly available database and an administration server, in accordance with certain embodiments of the invention.
Figure 5 is an illustration of a cluster of servers functioning against the database, in accordance with certain embodiments of the invention.
Figure 6 is an illustration of a method of migrating the migratable server to a different machine within the cluster, in accordance with certain embodiments of the invention.
Figure 7 is an illustration of Internet protocol (IP) address migration, in accordance with certain embodiments of the invention.
Detailed Description:
Embodiments of the present invention include systems and methods for providing singleton services within a cluster and for automatically migrating those services across the machines in the cluster. The term "machine," for the purposes of this disclosure, means any computer system capable of maintaining a server or providing some type of service. Examples are personal computers, workstations, mainframes and other computers that can be connected to a network or cluster. The clustering infrastructure can guarantee that each migratable service is active on only one node in the cluster at all times. In order to prevent the problems of auto-migration described above, the present methodology can perform three tasks: First, a judgment can be made as to whether a server has failed; Second, the seemingly failed server can be isolated from clients and disks as well as other entities connected to it; Third, the seemingly failed server can be restarted on the machine upon which it sits or, if that cannot be achieved, the server can be migrated to another machine.
Aspects of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to "an", "one" and "various" embodiments in this disclosure are not necessarily to the same embodiment. In the following description, numerous specific details are set forth to provide a thorough description of the invention. However, it will be apparent to one skilled in the art that the invention may be practiced without these specific details.
Various embodiments include a highly available database and a node manager in order to implement server migration. A server, for purposes of this disclosure can be any type of an application server that provides some type of a service, resource or application. As one non-limiting example, Web Logic® Server, available from BEA Systems, can be implemented. As also described herein, a migratable server is a server in a cluster, which hosts a singleton service or services that are required to be highly available. Any of the servers in the cluster can be tagged as migratable, depending on the customer's needs, and these migratable servers can be made to host a variety of both singleton and non-singleton services. Each migratable server can be assigned a unique identifier (id) or name. All servers in the cluster other than migratable servers, will be generally referred to as "pinned" servers.
Figure 1 is a flow diagram illustration of a process defining the overall functionality of providing singleton services in a cluster via migratable servers, in accordance with various embodiments of the invention. Although Figure 1 depicts functional steps in a particular order for purposes of illustration, the process is not necessarily limited to any particular order or arrangement of steps. One skilled in the art will appreciate that the various steps portrayed in this figure can be omitted, rearranged, performed in parallel, combined and/or adapted in various ways.
The process begins at step 100. In step 101 , the various servers can be started in cluster form (i.e. as nodes in the cluster) by the administration (admin) server. The admin server can be made responsible for starting the servers initially, and for stopping the servers finally. Its role can also be to coordinate any manual migration by system administrators, in addition to any kind of changing of configuration of servers.
In step 103, as the servers are being started, each server can assume the rolθ assigned to it. For example, the first server started can take the role of being the cluster master. The cluster master is one server in the cluster that is responsible for the placement and migration of migratable servers. Usually a cluster would require the services of a cluster master if at least one of the managed servers in the cluster were tagged as a migratable server. The rest of the servers that are starting up can then take the role of being either a migratable server, or a pinned server, according to the particular needs of the enterprise operating the cluster (the customer).
In step 105, all of the servers in the cluster can be heartbeating against a database. By the term "heartbeating" it is meant that the server is continuously renewing its liveness information in the database. This process can be implemented by assigning a table entry to each server, which the server then must update after every certain time period expires. The time period required for updating the table entry can be arbitrarily chosen, or can be defined according to the servers and database implemented, in order to maximize performance. For example, a time period of 30,000 milliseconds can be selected. Thus, if a server does not update the table entry in the database after the expiration of the defined time period, then the server has failed to heartbeat, and it could be assumed that there has occurred a crash, server hang or some other type of failure. The database, for purposes of this invention, can be any database, file system or other form of information storage, capable of storing some form of entry for each server. However, the database should be made highly available in order to boost the reliability of the cluster and performance of the services provided. For example, the database can be selected from various products offered by companies such as Oracle, Microsoft, Sybase and IBM. It should be noted that the migration capability of the servers, and consequently the providing of singleton services, depends to a large extent on the integrity of the database, so a reliable database should be selected.
In step 106, the further functionality of each server can be determined by the role that was assigned to it in step 103. Thus, a server that is assigned the role of the cluster master can be responsible for performing one set of functions, while all migratable servers can perform another, all as described in further detail below.
In step 107, as the cluster master is heartbeating against the database, it is also monitoring the heartbeats of all other servers in the cluster. This can be implemented by various functions, including but not limited to having the cluster master read all of the table entries in the database whenever it accesses the database to heartbeat. Thus, if some server has failed to heartbeat, the cluster master should notice it the next time it accesses the database.
In step 109, the cluster master can notice a failed server as described above. It can then take the necessary steps to restart the failed server on the same machine, or migrate the failed server to a different machine in the cluster. The cluster master can first attempt to restart the failed server on the same machine by calling the node manager. The node manager can be a software program that runs on all of the machines in cluster. It should be capable of starting, restarting, stopping, shutting down, and migrating all of the migratable servers, together with their internet protocol (IP) addresses, to different machines. The node manager should also be capable of being invoked remotely by the cluster master. Any programming framework can be used in order to impart this functionality upon the node manager, including but not limited to scripts for Unix or Microsoft Windows operating systems. If the cluster master cannot restart the failed server on the same machine, it can then use the node manager to migrate the failed server to a different machine in the cluster. In certain embodiments, the Internet protocol (IP) address can be migrated along with the migratable failed server to another machine. This makes running various applications easier, because the client will always be connected to the same server, no matter where that server is within the cluster. One advantage of IP migration is that the client need not know the physical location of the server; simply knowing the IP address of the server is enough. Thus the cluster master can invoke the remote machine's node manager and have the node manager migrate the server to the new machine. In step 111 , all of the servers that did not take the role of being cluster master, can be actively monitoring the heartbeats of the cluster master as they are themselves heartbeating against the database. This can implemented similarly to the monitoring ability of the cluster master, or in some other form of a monitoring function. In step 113, if the cluster master were to fail its heartbeat, then any of the other servers can notice that failure. The first server to notice can then take over the role of being cluster master. In effect, all servers can be actively trying to become the cluster master at all times. When the migratable server becomes a new cluster master, it assumes all the functions and duties of the original cluster master. No migration is necessary at this point, although it may be implemented. In the alternative, the cluster can be configured to freeze whenever the cluster master fails, until a system administrator reboots or reconfigures the cluster master; however this type of implementation is not as efficient in that the cluster is dependent upon the performance of one server, namely the cluster master.
Figure 2 is a flow diagram illustration of a process defining an exemplary functionality of one server in the cluster, in accordance with various embodiments of the invention. Although Figure 2 depicts functional steps in a particular order for purposes of illustration, the process is not necessarily limited to any particular order or arrangement of steps. One skilled in the art will appreciate that the various steps portrayed in this figure can be omitted, rearranged, performed in parallel, combined and/or adapted in various ways.
The process begins in step 200. In step 201 , a server in the cluster is initially started by the admin server and joins the cluster. It can then be determined, in step 202, whether the server is the first server being connected to the cluster. In step 203, if a server is the first server joining the cluster, the server can be assigned the role of being cluster master. In step 205, the cluster master can then begin to heartbeat against the database, proving its liveness to it. At the same time, the cluster master can monitor the migratable servers that are heartbeating against the database, noticing any failures to heartbeat by any migratable server. Two things may occur from that point on: the cluster master may notice that a migratable server has failed, or the cluster master can fail itself.
In step 207, if the cluster master notices that a migratable server has failed to heartbeat, it can assume that the migratable server has crashed or has failed in some other manner. Consequently it can be assumed that the failed server is not responding and therefore not providing the singleton services that it is supposed to be providing. In step 209, the cluster master can then take steps to migrate the failed migratable server to another machine. The cluster master can first attempt to restart the failed server on the same machine and if that attempt fails, it can then call the node manager in order to migrate the server to another machine. The node manager can subsequently migrate the failed server to another machine.
Alternatively, in step 211 , the cluster master itself may fail to heartbeat because of a crash, server hang or some other type of failure. However, since all migratable servers are always actively trying to become cluster master themselves, once the original cluster master fails to heartbeat, the first migratable server available can take over the role of being cluster master, as illustrated in step 213.
Returning to step 202, if the server is not the first server to join the cluster, it would not be assigned the role of cluster master; rather the server could become a migratable server, as illustrated in step 204. From that point on, the migratable server heartbeats against the database, and at the same time it is monitoring the heartbeat of the cluster master as illustrated in step 215. Thus, a migratable server can notice that the cluster master has failed, or the migratable server may fail itself.
In step 217, if the migratable server notices that the cluster master failed to heartbeat, it actively attempts to become the cluster master itself, i.e. it attempts to take over the role of cluster master and assume its functions as illustrated in step 219.
In step 221 , the migratable server may itself fail to heartbeat because of a crash, server hang or some other failure. The failure to heartbeat is then noticed by the cluster master and the migratable server will get restarted on the same machine or migrated to a different machine by the cluster master, as illustrated in step 223.
Figure 3 is a flow diagram illustration of a process defining an exemplary functionality of a cluster master in the cluster, in accordance with various embodiments of the invention. Although this figure depicts functional steps in a particular order for purposes of illustration, the process is not necessarily limited to any particular order or arrangement of steps. One skilled in the art will appreciate that the various steps portrayed in this figure can be omitted, rearranged, performed in parallel, combined and/or adapted in various ways. The process begins at step 300. In step 301 , a server is started as a cluster master, as previously described above. It then begins to perform two functions either simultaneously or consecutively. The cluster master heartbeats against the database as illustrated in step 313, providing its liveliness information to it. In addition, the cluster master harvests the liveliness information of other migratable servers from the database, as illustrated in step 303.
If, as illustrated in step 315, the cluster master crashes, hangs, or fails in some other manner, one of the migratable servers will take over its functions as illustrated in step 317. On the other hand, while the cluster master is harvesting liveliness information from the database, if the cluster master notices that a migratable server has failed to heartbeat (in step 305), it can then initiate the node manager in order to deal with this problem, as illustrated in step 307. In step 309, the node manager will first attempt to restart the failed server on the same machine. In step 310, if that attempt is successful, the cluster master will go back to performing its usual functions, namely heartbeating against the database and monitoring the liveness of other migratable servers. If the attempt to restart was unsuccessful, then, in step 310, the node manager will migrate the failed server onto a different machine, as illustrated in step 311. Subsequently the cluster master could go back to performing its duties of harvesting liveness information and heartbeating against the database.
Although not illustrated, the cluster master need not be made to wait for the node manager to complete the migration. After initiating the node manager, the cluster master is free, and could go back to fulfilling its role of heartbeating and harvesting, as described above. Alternatively, the cluster master could be made to wait for the node manager to finish its server migration process, so as to ensure the success of the migration, before returning to its usual functions. Both alternatives are within the spirit of the invention, as will be apparent to one skilled in the art.
Figure 4 is an exemplary illustration of the overall placement of a cluster of machines running servers, a node manager, a highly available database and an administration (admin) server, in accordance with various embodiments of the invention. Although this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
As shown in Figure 4, each machine 13,14,15, in the cluster 2, may have one or more servers 7,8,9,10,16, running thereon. The machines can also have node manager 6,11 ,12, software deployed on them. The node manager should be capable of running customizable scripts or other programs in order to facilitate migration of the servers across machines. The node manager can be invoked remotely by the cluster master 8, in order to start and to stop (kill) various servers in the cluster.
The admin server 5 can be used to coordinate manual server migration and changing of configuration. It should also be used for the purpose of initially starting the servers. Similarly, it can be responsible for finally stopping all the servers in the cluster. In certain embodiments, the admin server is running on a separate machine 4, which is not part of the cluster, and is thus not migratable itself. An admin server can be implemented by another machine, a network computer, a workstation or some other means. It can be made accessible by system administrators and other persons who can subsequently coordinate manual migration of migratable servers within the cluster. The highly available database 3 need not necessarily be a traditional database, as already discussed above. Instead, it can be implemented as any type of file or information storage system; however it is preferable that it be highly available in order to boost performance of the cluster and the singleton services.
As shown in Figure 4, all of the components are illustrated as being part of one, i.e. a single domain 1. However this is done merely for purposes of illustration and ease of understanding. Multiple domains and subdomains can also be implemented.
Figure 5 is an exemplary illustration of a cluster of servers functioning against the database, in accordance with various embodiments of the invention. Although this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
As shown in Figure 5, in normal operation the cluster master 8 can be heartbeating 55 its liveness information to the highly available database 3. It can do this by continuously updating one of the entries (17-21) of a table 59 in the database, in order to check in. As an illustrative example, each table entry may have variables for storing the primary key, server name, server instance, host machine, domain name, cluster name, the timeout (check-in time period), and a variable to determine whether this particular server is the cluster master. Simultaneously, the cluster master can be monitoring the heartbeats 50, 51 , 52, 53, of all of the other migratable servers 7,9,10,16, in the database. Once it notices that a migratable server has stopped heartbeating, the cluster master can restart/migrate that server to another machine.
Similarly, all of the migratable servers 7,9,10,16, can be heartbeating (50-53) their own liveness information to the database , by the same means as the cluster master 8. Each migratable server has an entry in the database corresponding to its liveness information, which the migratable server can be continuously updating. Simultaneously, each migratable server can be proactively attempting to take over the role of cluster master. Thus, if the cluster master were to fail its heartbeat, the first migratable server to notice this, will become cluster master itself.
It should be pointed out that in accordance with various embodiment, a single table need not necessarily be implemented in order to store the liveness information of the servers in the cluster. Multiple such tables may be used, or other types of data structures can be employed, including but not limited to lists, graphs or binary trees.
Figure 6 is an exemplary illustration of a migration method and system, in accordance with various embodiments of the invention. Although this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
As shown in Figure 6, to begin with, each server depicted 7,8,10, is heartbeating against the database 3. A first server S4 (7) may crash or fail and consequently it may stop sending its heartbeats 6 to the highly available database. A second server S1 (8), designated as the current cluster master, will notice 61 another server S4's failure to heartbeat, and then it will attempt to restart/migrate server S4. This figure illustrates one method of restarting or migration of S4 by the cluster master S1. As shown in Figure 6, the cluster master can send an instruction 62 to restart S4, to the node manager 6 installed upon the machine 15 that S4 is currently deployed on. However, because the machine itself may have crashed or frozen, the node manager installed therein may not receive the restart instruction sent by the cluster master. Thus, the cluster master will subsequently send instructions 63 to migrate S4 to the node manager 11 of another machine, for example machine M1 (13). The node manager can then migrate 64 server S4 by starting S4 on the new machine 13, and the migrated server can begin to heartbeat again 65 against the database, as well as continue providing the singleton services.
Precautions may be taken that no previously crashed or frozen server is restarted again on the old machine 15, because that would cause two instances of server S4, and consequently two instances of every singleton service that the server is providing. These precautions can be implemented in various ways, including, but not limited to, continuously sending kill messages to the old machine 15, or isolating the old machine from the cluster.
Figure 7 is an exemplary illustration of IP migration, in accordance with various embodiments of the invention. Although this diagram depicts components as logically separate, such depiction is merely for illustrative purposes. It will be apparent to those skilled in the art that the components portrayed in this figure can be combined or divided into separate software, firmware and/or hardware components. Furthermore, it will also be apparent to those skilled in the art that such components, regardless of how they are combined or divided, can execute on the same computing device or can be distributed among different computing devices connected by one or more networks or other suitable communication means.
As shown in Figure 7, all migratable servers are assigned their own internet protocol (IP) addresses (61-64). These IP addresses are usually stored in the IP stack 60. Subsequently, a server S2 (9) may be migrated in the manner previously discussed above with reference to Figure 6. Assuming it is migrated to a different machine 14, and not restarted upon the same machine 13, then server S2 can be made to retain its original IP address IP Addr (address) 2 (62). This implementation provides an advantage over assigning new IP addresses to migrated servers, as previously discussed, in that clients in the outside world 23 need not know the IP address of the server they are trying to access. As illustrated, the IP address 62 gets migrated along with the server S2 onto the different machine 14. As used herein, the term "outside world" refers to computers or systems accessing a server that exist outside the cluster of servers.
The present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art. The invention may also be implemented by the preparation of integrated circuits and/or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
Various embodiments include a computer program product which is a storage medium (media) having instructions stored thereon/in which can be used to program a general purpose or specialized computing processor(s)/device(s) to perform any of the features presented herein. The storage medium can include, but is not limited to, one or more of the following: any type of physical media including floppy disks, optical discs, DVDs, CD-ROMs, microdrives, magneto-optical disks, holographic storage, ROMs, RAMs, PRAMS, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs); paper or paper-based media; and any type of media or device suitable for storing instructions and/or information. Various embodiments include a computer program product that can be transmitted in whole or in parts and over one or more public and/or private networks wherein the transmission includes instructions which can be used by one or more processors to perform any of the features presented herein. In various embodiments, the transmission may include a plurality of separate transmissions.
Stored one or more of the computer readable medium (media), the present disclosure includes software for controlling both the hardware of general purpose/specialized computer(s) and/or processor(s), and for enabling the computer(s) and/or processor(s) to interact with a human user or other mechanism utilizing the results of the present invention. Such software may include, but is not limited to, device drivers, operating systems, execution environments/containers, user interfaces and applications. The foregoing description of the preferred embodiments of the present invention has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art. Embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention, the various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.

Claims

Claims:
What is claimed is: 1. A method for providing singleton services in a cluster, the method comprising: designating one server in the cluster to be the cluster master; designating one or more servers in the cluster to be migratable servers, wherein said one or more migratable servers perform a singleton service; heartbeating liveness information of the migratable servers to a database; wherein the cluster master monitors the heartbeating of the migratable servers; and wherein the cluster master automatically migrates a migratable server that failed to heartbeat onto a different machine, whereby execution of the singleton service offered by the migratable server is ensured.
2. The method of claim 1 , further comprising : heartbeating liveness information of the cluster master to the database; wherein the migratable servers monitor the heartbeating of the cluster master; and wherein one of the migratable servers is automatically designated cluster master if the cluster master fails to heartbeat.
3. The method of claim 1 wherein the cluster master migrates the migratable servers by invoking a node manager, said node manager comprising: scripts to facilitate server migration across machines; and scripts to start and shutdown servers; wherein the node manager is capable of being invoked remotely by the cluster master.
4. The method of claim 1 wherein the migratable server's IP address is migrated along with the migratable server.
5. The method of claim 1 wherein an administration server is employed for initially starting and stopping the servers in the cluster.
6. The method of claim 5 wherein the administration server is capable of being used to manually migrate the migratable server.
7. The method of claim 1 wherein the database employed is a highly available database comprising: a table for storing the heartbeats and other information about the servers in the cluster; wherein the table can be accessed and updated by the servers in the cluster.
8. A system for providing singleton services in a cluster, comprising: a database capable of storing server liveness information; a cluster of servers comprising: one server designated as a cluster master; and one or more other servers designated as migratable servers; wherein the one or more migratable servers in the cluster heartbeat against the database in order to prove their liveness information; wherein the cluster master monitors the heartbeat of all migratable servers; wherein the cluster master automatically migrates a migratable server to a different machine if the migratable server fails to heartbeat.
9. The system of claim 8 wherein: the cluster master heartbeats against the database; wherein migratable servers monitor the heartbeats of the cluster master; and wherein one migratable server is automatically designated as cluster master if the cluster master fails to heartbeat.
10. The system of claim 8 further including a node manager, said node manager comprising: scripts to facilitate server migration across machines; and scripts to start and shutdown servers; wherein the node manager is capable of being invoked remotely by the cluster master in order to migrate the migratable servers.
11. The system of claim 8 wherein the migratable server's IP address is migrated along with the migratable server.
12. The system of claim 8 wherein an administration server is employed for initially starting and stopping the servers in the cluster.
13. The system of claim 12 wherein the administration server is capable of being used to manually migrate the migratable server.
14. The system of claim 8 wherein the database employed is a highly available database comprising: a table for storing the heartbeats and other information about the servers in the cluster; wherein the table can be accessed and updated by the servers in the cluster.
PCT/US2006/012413 2005-11-15 2006-04-04 System and method for providing singleton services in a cluster WO2007061440A2 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US73671805P 2005-11-15 2005-11-15
US60/736,718 2005-11-15
US11/396,826 2006-04-03
US11/396,826 US7447940B2 (en) 2005-11-15 2006-04-03 System and method for providing singleton services in a cluster

Publications (2)

Publication Number Publication Date
WO2007061440A2 true WO2007061440A2 (en) 2007-05-31
WO2007061440A3 WO2007061440A3 (en) 2007-11-15

Family

ID=38067672

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/012413 WO2007061440A2 (en) 2005-11-15 2006-04-04 System and method for providing singleton services in a cluster

Country Status (1)

Country Link
WO (1) WO2007061440A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015139510A1 (en) * 2014-03-19 2015-09-24 福建福昕软件开发股份有限公司 Method for cluster deployment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6108300A (en) * 1997-05-02 2000-08-22 Cisco Technology, Inc Method and apparatus for transparently providing a failover network device
US20020131423A1 (en) * 2000-10-26 2002-09-19 Prismedia Networks, Inc. Method and apparatus for real-time parallel delivery of segments of a large payload file
US20060190766A1 (en) * 2005-02-23 2006-08-24 Adler Robert S Disaster recovery framework
US20060195560A1 (en) * 2005-02-28 2006-08-31 International Business Machines Corporation Application of attribute-set policies to managed resources in a distributed computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6108300A (en) * 1997-05-02 2000-08-22 Cisco Technology, Inc Method and apparatus for transparently providing a failover network device
US20020131423A1 (en) * 2000-10-26 2002-09-19 Prismedia Networks, Inc. Method and apparatus for real-time parallel delivery of segments of a large payload file
US20060190766A1 (en) * 2005-02-23 2006-08-24 Adler Robert S Disaster recovery framework
US20060195560A1 (en) * 2005-02-28 2006-08-31 International Business Machines Corporation Application of attribute-set policies to managed resources in a distributed computing system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015139510A1 (en) * 2014-03-19 2015-09-24 福建福昕软件开发股份有限公司 Method for cluster deployment

Also Published As

Publication number Publication date
WO2007061440A3 (en) 2007-11-15

Similar Documents

Publication Publication Date Title
US7447940B2 (en) System and method for providing singleton services in a cluster
US8769132B2 (en) Flexible failover policies in high availability computing systems
US7178050B2 (en) System for highly available transaction recovery for transaction processing systems
JP4307673B2 (en) Method and apparatus for configuring and managing a multi-cluster computer system
US9141502B2 (en) Method and system for providing high availability to computer applications
US8464092B1 (en) System and method for monitoring an application or service group within a cluster as a resource of another cluster
US7620842B2 (en) Method for highly available transaction recovery for transaction processing systems
JP5860497B2 (en) Failover and recovery for replicated data instances
JP4637842B2 (en) Fast application notification in clustered computing systems
US6952766B2 (en) Automated node restart in clustered computer system
US7689862B1 (en) Application failover in a cluster environment
US7234072B2 (en) Method and system for making an application highly available
US20030074426A1 (en) Dynamic cluster versioning for a group
US20080010490A1 (en) Job Scheduler
US8015432B1 (en) Method and apparatus for providing computer failover to a virtualized environment
US7228344B2 (en) High availability enhancement for servers using structured query language (SQL)
US20150324222A1 (en) System and method for adaptively integrating a database state notification service with a distributed transactional middleware machine
US11119872B1 (en) Log management for a multi-node data processing system
US20030145050A1 (en) Node self-start in a decentralized cluster
WO2007061440A2 (en) System and method for providing singleton services in a cluster
AU2007254088A1 (en) Next generation clustering
WO2003073281A1 (en) Highly available transaction recovery for transaction processing systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06740448

Country of ref document: EP

Kind code of ref document: A2