GB2345769A - Failure recovery in a multi-computer system - Google Patents
Failure recovery in a multi-computer system Download PDFInfo
- Publication number
- GB2345769A GB2345769A GB9900473A GB9900473A GB2345769A GB 2345769 A GB2345769 A GB 2345769A GB 9900473 A GB9900473 A GB 9900473A GB 9900473 A GB9900473 A GB 9900473A GB 2345769 A GB2345769 A GB 2345769A
- Authority
- GB
- United Kingdom
- Prior art keywords
- node
- server
- disk
- computer
- standby
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2028—Failover techniques eliminating a faulty processor or activating a spare
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2033—Failover techniques switching over of hardware resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2038—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2046—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share persistent storage
Abstract
A multi-node computer system includes N + 1 nodes 10, comprising, e.g., N active nodes and one standby node. Each node hosts a server installation 11. Each server has a system disk 16, holding the operating system and configuration files for the server, and being mirrored for resilience. Each server also has a recovery disk 17, which holds a synchronised recovery copy of data held on the respective system disk 16, from which it is normally disconnected. In the event of failure of a node 10, a recovery process (Fig. 2), initiated on system administration workstation 12, reconfigures the system, by connecting the recovery disk 17 corresponding to the failed node to the system disk 16 of the standby node, and copying the contents of this recovery disk to the system disk. This causes the server in the failed node to migrate to the standby node, which thus becomes an active node. The server that was running on the standby node is similarly relocated to the failed node (Fig. 3).
Description
RESILIENCE IN A MULTI-COMPUTER SYSTEM
Background to the Invention
This invention relates to techniques for achieving resilience in a multi-computer system.
Such systems are often used to support a large number of users, and to store very large databases. For example, a typical system may consist of 8 server computers, supporting up to 50,000 users and may store one or more 300 GigaByte databases.
It would be desirable to be able to provide such a system based on standard server software such as for example Microsoft
Exchange running under Microsoft Windows NT. However, a problem with this is that of providing resilience to failure of one of the computers. The use of cluster technology for a system of this scale would be too expensive. Also, Microsoft
Exchange is not a cluster-aware application, and it is not permissible to have two instances of Exchange on the same server (even a 2-node cluster).
Summary of the Invention
According to the invention, a computer system comprises a plurality of active computers, at least one standby computer, a plurality of system disk units, a plurality of further disk units for providing a synchronised recovery copy of data held on the system disk units, and a recovery process for reconfiguring the system in the event of failure of one of the active computers, by causing the standby computer to pick up the further disk unit corresponding to the failed computer.
Brief Description of the Drawings
Figure 1 is a block diagram of a multi-node computer system embodying the invention.
Figure 2 is a flow chart showing a recovery process for handling failure of one of the nodes of the system.
Figure 3 is a block diagram showing an example of the system after reconfiguration by the recovery process.
Description of an Embodiment of the Invention
One computer system in accordance with the invention will now be described by way of example with reference to the accompanying drawings.
In the present specification, the following terms are used with specific meanings :
Node-this means an individual computer hardware
configuration. In the present embodiment of the
invention, each node comprises an ICL Xtraserver computer.
Each node has a unique identity number.
Server-this means a specific server software installation.
In the present embodiment of the invention, each server
comprises a specific Microsoft NT installation. Each
server has a unique server name, and is capable of being
hosted (i. e. run) on any of the nodes. A server can, if
necessary, be shut down and relocated to another node.
System-this means a number of servers accessing a common
storage unit.
Referring to Figure 1, this shows a system comprises N+1 nodes 10. In normal operation, N of the nodes are active, while the remaining one is a standby. In this example, N equals four (i. e. there are 5 nodes altogether). Each of the nodes 10 hosts a server 11.
The system also includes a system administration workstation 12, which allows a (human) operator or system administrator to monitor and control the system. Each server displays its name and current operational state on the workstation 12. One or more other systems (not shown) may also be controlled and monitored from the same workstation.
All of the nodes 10 are connected to a shared disk array 13.
In this example, the disk array 13 is an EMC Symmetrix disk array. This consists of a large number of magnetic disk units, all of which are mirrored (duplexed) for resilience. In addition, the disk array includes a number of further disks, providing a Business Continuance Volume (BCV). A BCV is effectively a third plex, which can be connected to or disconnected from the primary plexes under control of EMC
Timefinder software, running on the workstation 12. The BCV data can be synchronised with the primary plexes so as to provide a backup, or can be disconnected from the primary plexes, so as to provide a snapshot of the main data at a given point in time. When the BCV has been split in this way, it can be reconnected at any time and the data then copied from the primary plexes to the BCV, or vice versa, to resynchronise them.
The system also includes an archive server 14 connected to the disk array 13 and to a number of robotic magnetic tape drives 15. In operation, the archive server periodically performs an offline archive of the data in each database, by archiving the copy of the database held in the BCV to tape. When the archive is secure, the BCV is then brought back into synchronism with the main database, before again being broken away to form the recovery BCV, using the EMC TimeFinder software.
As illustrated in Figure 1, the disk array 13 includes a number of system disks 16, one for each of the servers 11. Each system disk holds the NT operating system files and configuration files for its associated server: in other words, the system disk holds all the information that defines the "personality"of the server installation. Each of the system disks has a BCV disk 17 associated with it, holding a backup copy of the associated system disk. Normally, each BCV disk 17 is disconnected from its corresponding system disk; it is connected only if the system disk changes, so as to synchronise the two copies.
In the event of failure of one of the N active nodes 10, a recovery process is initiated on the system administration workstation 12. In this example, the recovery process comprises a script, written in the scripting language associated with the Timefinder software. The process guides the system administrator through a recovery procedure, which reconfigures the system to cause the standby node to pick up the system disk BCV of the failed node, thereby relocating the server on the failed node on to the standby node and vice versa.
The recovery process makes use of a predetermined set of device files, one for every possible combination of node and server.
Since in this example there are five servers and five nodes (including the standby), there are 25 possible combinations, and hence 25 such device files are provided. Each of these files is identified by a name in the form n (N)is (S) where N is a node identity number, and S is the last three digits of the server name. (Other conventions could of course be used for naming the files). Each device file contains all the information required to install the specified server on the specified node.
As illustrated in Figure 2, the recovery process comprises the following steps: (Step 201) The recovery process first confirms the identity of the failed system with the administrator. This step is required only if more than one system is managed from the same system administration workstation.
(Step 202) The recovery process then queries the administrator to obtain the identity numbers of the failed node and the standby node. The administrator can determine these node numbers using information displayed on the system administration workstation 12.
(Step 203) The recovery process next queries the system administrator to obtain the name of the failed server (i. e. the server currently running on the failed node). The recovery process also automatically determines the name of the standby server-this is a predetermined value for each system.
(Step 204) The recovery process also automatically determines the device identifiers for the BCVs associated with the failed server and the standby server, using a lookup table which associates each server name with a particular device identifier.
(Step 205) The recovery process then calls the BCV QUERY command in the Timefinder software, so as to determine the current states of these two BCVs. These should both be in the disconnected state.
If one or both of the BCVs is not in the disconnected state, the recovery process aborts, prompting the system administrator to call the appropriate technical support service.
(Step 206) If both of the BCVs are in the disconnected state, the recovery process continues by prompting the administrator to ensure that both the failed server and the standby server are shut down. The recovery process waits for confirmation that this has been done.
(Step 207) When both the failed server and the standby server have been shut down, the recovery process constructs two device file names as follows:
The first file name is n (W)is (X) where W is the node
number of the standby node and X is the last three digits
of the failed server's name.
The second file name is n (Y)is (Z) where Y is the node
number of the failed node and Z is the last three digits
of the standby server's name.
(Step 208) The recovery process then calls the Timefinder BCV
RESTORE command passing it the first device file name as a parameter. This causes the BCV of the failed node to be linked to the system disk of the standby server, and initiates copying of the data from this BCV to the system disk. It can be seen that the effect of this is to relocate the server that was running on the failed node on to the standby node.
The recovery process also calls the BCV RESTORE command, passing it the second device file name as a parameter. This causes the BCV of the standby node to be linked to the system disk of the failed server, and initiates copying of the data from this BCV to the system disk. The effect of this is therefore to relocate the server that was running on the standby node on to the failed node.
As an example, Figure 3 shows the case where node 1 has failed, and where node 4 is the standby. As shown, the BCV disk of the standby node is linked to the system disk of the failed node, and the BCV of the failed node is linked to the system disk of the standby
While the restore commands are running, the recovery process checks for error responses, and reports any such responses to the administrator. It also writes all actions to a log file immediately prior to the action.
(Step 209) After issuing the restore commands, the recovery process prompts the administrator to restart the recovered server (i. e. the server which has migrated from the failed node to the standby node), stating the new node name it will run on.
The standby node therefore now becomes an active node.
It should be noted that the restore commands run in the background and typically take about an hour to complete.
However, the recovered server can be restarted immediately, and its data accessed, without waiting for the restore commands to complete.
(Step 210) The recovery procedure monitors for completion of the BCV restore operations, using the Timefinder BCV Query command.
(Step 211) When the restore operations are complete, the recovery procedure issues a Timefinder BCV Split command, which disconnects the BCVs from the system disks. Recovery is now complete, and the recovery process terminates.
Once the failed node has been fixed, it can be rebooted as required, and will become the standby server. The recovery procedure can then be repeated if any of the active nodes fails.
Some possible modifications
It will be appreciated that many modifications may be made to the system described above without departing from the scope of the present invention. For example, different numbers of disks and computers may be used. Also, the invention may be implemented in other operating systems, and using other hardware configurations. Moreover, instead of implementing the recovery procedure by means of a script, it could for example be integrated into the operating system.
Claims (5)
- CLAIMS 1. A computer system comprising a plurality of active computers, at least one standby computer, a plurality of system disk units, a plurality of further disk units for providing a synchronised recovery copy of data held on the system disk units, and a recovery process for reconfiguring the system in the event of failure of one of the active computers, by causing the standby computer to pick up the further disk unit corresponding to the failed computer.
- 2. A computer system according to Claim 1 wherein the recovery process connects the further disk unit associated with the failed computer to the system disk of the standby computer and initiates copying of data from that further disk to the system disk.
- 3. A computer system according to Claim 2 wherein the recovery process restarts the standby computer while the copying of data is being performed in the background.
- 4. A computer system according to any preceding claim, including a set of device files, one for each possible combination of a particular operating system installation with a particular computer hardware configuration, wherein the recovery process selects two of the device files that correspond to the new configurations of the failed computer and the standby computer and uses these to control reconfiguration of the system.
- 5. A computer system substantially as hereinbefore described with reference to the accompanying drawings.
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9900473A GB2345769A (en) | 1999-01-12 | 1999-01-12 | Failure recovery in a multi-computer system |
EP99306404A EP0987630B1 (en) | 1998-09-08 | 1999-08-13 | Resilience in a multi-computer system |
DE69927223T DE69927223T2 (en) | 1998-09-08 | 1999-08-13 | Resilience of a multi-computer system |
US09/385,937 US6460144B1 (en) | 1998-09-08 | 1999-08-30 | Resilience in a multi-computer system |
AU47388/99A AU753898B2 (en) | 1998-09-08 | 1999-09-06 | Resilence in a multi-computer system |
JP25385899A JP3967499B2 (en) | 1998-09-08 | 1999-09-08 | Restoring on a multicomputer system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9900473A GB2345769A (en) | 1999-01-12 | 1999-01-12 | Failure recovery in a multi-computer system |
Publications (1)
Publication Number | Publication Date |
---|---|
GB2345769A true GB2345769A (en) | 2000-07-19 |
Family
ID=10845807
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB9900473A Withdrawn GB2345769A (en) | 1998-09-08 | 1999-01-12 | Failure recovery in a multi-computer system |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2345769A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5621884A (en) * | 1993-04-30 | 1997-04-15 | Quotron Systems, Inc. | Distributed data access system including a plurality of database access processors with one-for-N redundancy |
-
1999
- 1999-01-12 GB GB9900473A patent/GB2345769A/en not_active Withdrawn
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5621884A (en) * | 1993-04-30 | 1997-04-15 | Quotron Systems, Inc. | Distributed data access system including a plurality of database access processors with one-for-N redundancy |
Non-Patent Citations (1)
Title |
---|
PC Week Vol. 4, No. 37, September 15, 1987, pages C26-30, and also DIALOG Accession No. 01210619. * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4744804B2 (en) | Information replication system with enhanced error detection and recovery | |
US10146453B2 (en) | Data migration using multi-storage volume swap | |
JP4400913B2 (en) | Disk array device | |
US6658589B1 (en) | System and method for backup a parallel server data storage system | |
US7689862B1 (en) | Application failover in a cluster environment | |
US7447933B2 (en) | Fail-over storage system | |
US7546484B2 (en) | Managing backup solutions with light-weight storage nodes | |
US6145066A (en) | Computer system with transparent data migration between storage volumes | |
US8015396B2 (en) | Method for changing booting configuration and computer system capable of booting OS | |
US6978282B1 (en) | Information replication system having automated replication storage | |
JP3957278B2 (en) | File transfer method and system | |
US20140325267A1 (en) | System and method for high performance enterprise data protection | |
US20050188248A1 (en) | Scalable storage architecture | |
US20050149684A1 (en) | Distributed failover aware storage area network backup of application data in an active-N high availability cluster | |
US5615330A (en) | Recovery method for a high availability data processing system | |
US6460144B1 (en) | Resilience in a multi-computer system | |
US7207033B2 (en) | Automatic backup and restore for configuration of a logical volume manager during software installation | |
JP2000099359A5 (en) | ||
JPH09293001A (en) | Non-stop maintenance system | |
GB2345769A (en) | Failure recovery in a multi-computer system | |
WO2003003209A1 (en) | Information replication system having enhanced error detection and recovery | |
Bagal et al. | Oracle Database Storage Administrator's Guide, 11g Release 1 (11.1) B31107-05 | |
Salinas et al. | Oracle Real Application Clusters Administrator's Guide 10g Release 1 (10.1) Part No. B10765-02 Copyright© 1998, 2004, Oracle. All rights reserved. Primary Authors: David Austin, Mark Bauer Contributing Authors: Jonathan Creighton, Rajiv Jayaraman, Raj Kumar, Dayong Liu, Venkat Maddali | |
Salinas et al. | Oracle Real Application Clusters Administrator's Guide, 10g Release 1 (10.1) Part No. B10765-01 Copyright© 1998, 2003, Oracle. All rights reserved. Primary Author: David Austin and Mark Bauer. Contributor: Jonathan Creighton, Rajiv Jayaraman, Raj Kumar, Dayong Liu, Venkat Maddali, Michael | |
Scriba et al. | Storage Foundation Software stack |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |