JP5321658B2 - Failover method and its computer system. - Google Patents

Failover method and its computer system. Download PDF

Info

Publication number
JP5321658B2
JP5321658B2 JP2011184262A JP2011184262A JP5321658B2 JP 5321658 B2 JP5321658 B2 JP 5321658B2 JP 2011184262 A JP2011184262 A JP 2011184262A JP 2011184262 A JP2011184262 A JP 2011184262A JP 5321658 B2 JP5321658 B2 JP 5321658B2
Authority
JP
Japan
Prior art keywords
logical partition
server
failure
affected
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2011184262A
Other languages
Japanese (ja)
Other versions
JP2011258233A (en
Inventor
片野真吾
高本良史
畑崎恵介
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to JP2011184262A priority Critical patent/JP5321658B2/en
Publication of JP2011258233A publication Critical patent/JP2011258233A/en
Application granted granted Critical
Publication of JP5321658B2 publication Critical patent/JP5321658B2/en
Application status is Active legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To enhance availability upon a recover from failure. <P>SOLUTION: The computer system comprises: a plurality of servers connected to an external disk device on a network; and a logical part mechanism that establishes one or more logical parts on each server, in which the logical part boots an operating system through the boot disk in the external disk device. When a failure occurs on the server in operation of a task and when taking over the operation from the computer system, only the logical part affected by the failure is subjected to a fail-over. The fail-over method comprises the steps of: detecting a server on which a failure occurs 921; determining a failed part ; determining a logical part which is affected by the failure of the failed part 924; aborting of the logical part 912; searching a spare server which takes over the operation from the computer system 926; establishing a logical part which is affected by the failure in the spare server 952; taking over the boot disk of the logical part which is affected by the failure in the established logical part 941; and starting up the established logical part 953. <P>COPYRIGHT: (C)2012,JPO&amp;INPIT

Description

The present invention relates to a failover method in a computer system comprising a server booted from an external disk device, and more particularly to a method for failing over only a specific logical partition in a logical partition computer system.

When the server boots using an external disk array device as a disk device, the disk array device can be connected to multiple servers via a fiber channel or fiber channel switch. The boot disk of one server can be referenced from another server. In this configuration, when a failure occurs in a server that is executing a job, the job can be taken over by starting a spare server using the boot disk of that server. Furthermore, since it is not necessary to prepare a spare server to be paired with the active server, it is possible to take over business from an arbitrary active server to an optional spare server, thereby reducing the introduction cost. (See Patent Document 1)
On the other hand, as a method for reducing the introduction cost, there is a technique for integrating a plurality of tasks by dividing a single server into a plurality of logical partitions. For example, a plurality of CPUs, memories, I / O devices, etc. are divided and assigned to individual logical partitions.
By combining these technologies, the introduction cost can be further reduced.

Japanese Patent Application Laid-Open No. 2003-228561 describes that only a failed operating system is taken over to another host in the hot standby in the hot standby of the virtual machine.

JP 2006-163963 A JP-A-4-141744

Since the conventional technology described in Patent Document 2 is a hot standby system, another host must be operated in synchronization with the host in which a failure has occurred, and there is a problem in cost. Further, Patent Document 2 describes an operating system failure, but does not describe a hardware failure. If there is a failure in the hardware configuration in the server, the relationship between the hardware and the logical partition will change depending on the configuration of the logical partition. However, there are cases where it is related to a plurality of logical partitions. Conventionally, when a failure occurs, it is a normal response to fail over each server regardless of the configuration of the logical partitions.

In a server system that boots an OS using an external disk array device and divides the server into logical partitions and executes two or more independent virtual servers in a single server, the active server in the event of a server failure When taking over a boot disk from one server to a spare server, a plurality of virtual servers running on the active server are stopped, so there is a problem that the influence of the failure is very large and the availability of the entire system is lowered. For example, even when a failure occurs in a CPU assigned to a specific logical partition, it is necessary to stop other logical partitions, resulting in reduced availability.

An object of the present invention is to solve the problems in the above-described conventional method and increase the availability at the time of failure recovery in such a system.

A computer system in which a plurality of servers are connected to an external disk device on a network and have a logical partition mechanism for constructing one or more logical partitions on the server, and the logical partition boots an operating system from the boot disk of the external disk device In this case, when a failure occurs in a server that is operating a business, when the business in the computer system is transferred to another server, only the logical partition affected by the failure is failed over. For this purpose, detection of failure of the server, identification of the failure part, identification of the logical partition affected by the failure from the failure part, suspension of the logical partition, search for a spare server to be taken over from the computer system, spare There is a feature in the failover method in which the logical partition affected by the failure is constructed in the server, the boot disk of the logical partition affected by the failure is taken over by the constructed logical partition, and the constructed logical partition is activated.

In addition, a failure of a computer that is connected to a disk device via a network and has a logical partition mechanism for constructing one or more logical partitions in the computer and is in operation At the time of occurrence, at least one of the computers is equipped with a failover mechanism that takes over the work to another computer,
At least one of the computers has a failover mechanism that takes over the work to another computer when a failure occurs in the computer that is running the work,
The computer has a processing unit that detects the occurrence of a failure, and a processing unit that identifies a site of the failure,
The failover mechanism is
A processing unit for identifying a logical partition affected by the failure from the failure site;
A processing unit for stopping the logical partition; a processing unit for searching for a spare computer to be taken over from within the computer system;
A processing unit for constructing a logical partition affected by the failure in the spare computer;
A processing unit for constructing a correspondence relationship between the logical partition constructed in the spare server and the external disk device for booting an operating system;
The computer system has a processing unit that activates the constructed logical partition.

Further, the server has a logical partition mechanism for constructing one or more logical partitions on the server, and the logical partition is
A failover method for taking over work in the computer system,
Detecting a failure of the server;
Identifying the site of the disorder;
Identifying a logical partition that is affected by the failure from the failure site;
Stopping the logical partition;
Searching the server for a logical partition to be taken over;
Building a logical partition corresponding to the logical partition affected by the failure;
In the physical computer called a failover method, the logical partition affected by the failure is assigned to another logical partition. It is characterized by a method of building and failing over.

The present invention implements a business takeover by taking over a boot disk when a failure occurs in a server that boots using an external disk device and constructs at least one logical partition using the logical partitioning technology. The method realizes a failover method that allows a logical partition that is not affected by a failure to be continued by specifying a logical partition that is affected by the failure and failing over only the logical partition.

1 shows an overall configuration diagram of the present invention. FIG. Example 1 The block diagram of a server is shown. The block diagram of a logical partition is shown. The failover mechanism diagram of the management server is shown. A server management table is shown. The structure of a server management function is shown. A logical partition information table is shown. An example of changing a logical partition is shown. The sequence diagram of this invention is shown. The processing flow of a server management function is shown. The processing flow of the influence range search function is shown. The processing flow of a server search function is shown. The processing flow of a logical partition management function is shown. The sequence diagram of this invention is shown. (Example 2) The processing flow of a server management function is shown. The sequence diagram of this invention is shown. (Example 3) A logical partition information table is shown. The processing flow of a server management function is shown. The processing flow of a logical partition management function is shown. A logical partition information table is shown. Example 4 The processing flow of the influence range search function is shown. The example of the management server console screen in Example 2 is shown. The example of the management server console which displays the structure of a server in Example 2 is shown.

  Hereinafter, embodiments of the present invention will be described.

FIG. 1 shows an overall view of an embodiment of the present invention. The plurality of servers 102 are connected to a network including the management server 101 via a network interface card (NIC) 122 and connected to the disk array device 103 via a fiber channel host bus adapter (HBA) 121. The server 102 includes a BMC (Baseboard Manage).
ment Controller) 123, and hardware status monitoring and power supply control can be performed from the server 101 via the network. In addition, the server 102 is equipped with a logical partition configuration mechanism 120 and can configure at least one logical partition. The management server 101 monitors the state via the network and controls the server 102 and the disk array device 103 as necessary. Failover mechanism 1 to management server 101
10 is configured. The failover mechanism 110 receives a failure notification from the BMC 122, controls the power supply to the BMC 122, searches for a logical partition affected by the cause of the failure, logical partition control to the logical partition mechanism 120, and the disk mapping function 13 of the disk array device 103.
There is a mechanism for performing zero control, which is one of the features of the present invention. The disk mapping function 130 in the disk array device 103 is a security function that restricts the server 102 that can access the disk 131, and the HBA 121 and the disk 131 mounted on the server 102.
With the relationship. In the first embodiment of the present invention, the server 102 uses the disk 131 in the disk array device 103 as a boot disk, and the OS and business application are stored in the disk 131.

In this embodiment, the server is used in the meaning of a computer (physical computer). Also,
Failover refers to taking over processing and data to an alternate / spare server in the event of a server failure. In the present invention, not only a failure, but also processing and data such as a logical partition operation or the like are specified by the user. To take over to a replacement / spare server.

In this embodiment, a logical partition is a function that allows a single computer to operate like a plurality of independent computers. For example, a plurality of CPUs, memories, I / O devices, etc. are divided into individual logical units. A function assigned to a partition and operating an operating system in each logical partition.

FIG. 2 shows a detailed configuration of the server 102 in this embodiment. The server 102 includes a memory 201 that stores programs and data, at least one CPU 202 that executes programs in the memory, and at least one HBA 12 as an I / O device.
1. It is composed of at least one NIC 122 and BMC 123. HBA12
1 is a WWN (Worl) required for specifying a communication partner in Fiber Channel communication.
d Wide Name) 204 is stored in the memory.
The BMC 123 mainly monitors and controls the hardware of the server 102. When an abnormality occurs in the hardware of the server 102, the failure detection mechanism 206 can detect and notify the outside. Further, the server 102 can be turned on / off remotely through the BMC 123.

FIG. 3 shows the configuration of the logical partition of the server 102 in this embodiment. Server 102
Includes a logical partition mechanism 120, and one or more logical partitions 301 can be constructed by the logical partition mechanism 120, and an operating system 311 is operated on the logical partition 301.
Can be executed. The logical partition mechanism 120 may be dedicated hardware, or may be a program that operates on the CPU of the server.

FIG. 4 shows the failover mechanism 110 constituting the management server 101 in FIG. 1 in this embodiment. The failover mechanism 110 includes a server management table 401 that stores the configuration status of the server hardware and partitions, and the usage status of the server, a server management function 405 that performs server status monitoring and server power control, and a server. Logical partition information table 4 for storing a list of logical partitions and hardware used by the logical partitions
02, a logical partition function 407 that performs logical partition control of the server, an affected range search function 403 that searches for a logical partition that is affected by the hardware failure in the event of a hardware failure, and a spare server that is a business takeover destination in the event of a failover And a disk mapping change function 404 for changing the disk mapping from the server in which a failure has occurred during the failover to the spare server. Here, the functions and tables of the failover mechanism may be loaded into the memory of the management server 101 and executed by the CPU, or each function and table may be configured by dedicated hardware. Good.

FIG. 5 shows details of the server management table 401 in FIG. The server management table stores a list of servers managed by the failover function 110 and configuration information and status of each server. A column 501 of the table stores server identifiers. The server identifier 501 may be information that can identify the server. This includes the server serial number. A column 502 stores CPU information. Column 50
A column 521 in 2 stores the number of CPUs mounted on the server. Column 50
A column 522 in 2 stores CPU identifiers. The CPU identifier 522 may be information that can identify the CPU in the server. This corresponds to the slot number of the CPU. The CPU information in this embodiment is based on the CPU slot unit, that is, the CPU unit, but this does not mean that the management unit is limited to the CPU unit. It is possible to decide the management unit in consideration of the granularity at which a failure can be detected and the allocation unit to the logical partition. For example, a CPU core unit can be considered as another realization means. A column 503 stores memory information. A column 531 in the column 503 stores the total amount of memory installed in the server. A column 532 in the column 503 stores a memory identifier. The memory identifier 532 may be information that can identify the memory in the server. For example, the memory slot number. A column 533 in the column 503 stores the memory capacity. The memory information in the present embodiment is information in units of memory slots, that is, information in units of memory, but this does not mean that the management unit is limited to units of memory. It is possible to decide the management unit in consideration of the granularity at which a failure can be detected and the allocation unit to the logical partition. For example, a memory bank unit can be considered as another implementation means. A column 504 stores information on server I / O devices.
A column 541 in the column 504 stores the MAC address of the NIC. Column 5
A column 542 in 04 stores the WWN of the HBA.

In this embodiment, the I / O device information is composed of the MAC address of the NIC indicated by the column 541 and the WWN of the HBA indicated by the column 542, but this restricts the I / O device information to only the NIC and the HBA. Does not mean. When the server is equipped with another I / O device, the other I / O device information can also be registered in the column 504. Column 505
Indicates the state of the server. If it is in use, the job is being executed. When it is not used, it can be used for another job immediately. Further, when a failure occurs and cannot be used, information indicating that the failure is occurring is stored. A column 506 indicates a fault site. When the information in the column 505 indicates that a failure has occurred, information indicating in which part of the server the failure has occurred is stored.

FIG. 6 shows details of the server management function 405 in FIG. Server management function
Monitors the server status, monitors the server operating status and faults, and controls the power supply. When a server failure occurs, if the BMC 123 shown in FIG. 1 or an agent program running on the server detects a server failure, the server management function 405 is notified of the occurrence of the failure. The failure information notified at this time includes the type of failure. The server management function 405 has a failure operation table 601 in order to determine whether or not to execute failover according to the failure type. A column 611 indicates the type of failure that has occurred, and a column 612 indicates whether or not to execute failover when a failure occurs. The storage information of the failure operation table can be arbitrarily set by the system user.

FIG. 7 shows details of the logical partition information table 402 in FIG. The logical partition information table 402 stores a list of logical partitions configured in the server and H / W used by the logical partition in the server managed by the failover function 110. A column 701 shows server identifiers. This is the same identifier as the column 501 of the server management table of FIG. A column 702 shows the identifier of the logical partition mechanism. A column 703 indicates logical partition identifiers. This registers the identifier used by the logical partitioning mechanism to identify the logical partition. A column 704 indicates the number of CPUs used by the logical partition. A column 705 indicates an identifier of the CPU. This is the same identification as the column 522 of the server management table of FIG. A column 706 indicates the memory capacity used by the logical partition. A column 707 shows memory identification information. This is the same identifier as the column 532 of the server management table of FIG. A column 708 indicates an I / O device used by the logical partition. A column 781 indicates the MAC address of the NIC. Column 782 is HBA
WWN.

In the example described in the table of FIG. 7, for example, the server with the server identifier S1 has LPAR1 and LPAR
There are two logical partitions of PAR2, for example, LPAR1 corresponds to hardware elements such as CPU1, Mem1, MAC1, and WWN1. In the server with the server identifier S2, there are two logical partitions such as LPAR3 and 4, and LPAR3 has two CPUs, three memories, and two NICs.
It shows that hardware elements such as two HBAs are compatible.

The table in FIG. 7 does not describe an example in which a hardware device such as a CPU or an I / O device extends over a plurality of logical partitions. However, depending on the logical partition, 2 corresponding to one hardware device is described. There may be a case where two or more logical partitions are also used. In that case, two or more logical partitions may be specified in a step (step 1104) of specifying a logical partition affected by the failure described later.

FIG. 8 shows an example of changing the logical partition in this embodiment. Server 80 where the failure occurred
2 has a logical partition mechanism 823, and logical partition 821 and logical partition 822 are configured by the logical partition mechanism. The logical partition 821 uses hardware 824 and the logical partition 8
22 uses hardware 825. The hardware used by the logical partition may be contradictory as shown in the embodiment, or part or all of the hardware may be shared by different logical partitions. The spare server 803 has a logical partition mechanism 832. When a failure occurs in the hardware 825 of the server 802, the failure detection mechanism 826 detects the failure and notifies the failover mechanism 110 of the management server 101. The failover mechanism 110 activates the influence range search function 403 and identifies the logical partition that uses the hardware in which the failure has occurred. In FIG. 8, the logical partition 825 is identified as the logical partition affected by the failure, and the logical partition 825 is moved to a spare server 803 that can be replaced. At this time, the failover mechanism 110
The logical partition mechanism 823 of the server 802 is used to stop the logical partition 822, and the logical partition mechanism 832 is used to configure the logical partition 831 having a hardware configuration equivalent to that of the logical partition 822 on the spare server 803. As a result, failover can be realized without stopping a logical partition that does not use the hardware 825 in which a failure has occurred.

FIG. 9 shows an operation sequence in the present invention. The sequence shown is a server 901, a failover mechanism 902, a server logical partition mechanism 903, a disk mapping mechanism 904, a spare server 905, and a spare server logical partition mechanism 906. Step 91
1 indicates the occurrence of a failure. The occurrence of a failure is detected by a BMC installed in the server 901 or an agent program running on the server 901 and notified to the failover mechanism 902. In step 921, the failover function detects the notified failure. In step 922, the hardware configuration of the server 902 is acquired. This information is acquired from the server management table shown in FIG. In step 923, the logical partition configuration of the server is acquired. This information is acquired from the logical partition information table shown in FIG. In step 924, step 92
1, failure information notified in step 1, server information acquired in step 922, and step 923
The logical partition affected by the failure is identified from the logical partition information acquired in step (1). Step 925
Then, the server logical partition mechanism 903 is requested to stop the logical partition specified in step 924. By steps 924 and 925, only the logical partition affected by the failure is stopped, and the operation being executed in the logical partition not affected by the failure can be continued without stopping. In step 931, the logical partition designated from the failover mechanism 902 is stopped. If the OS remains running in the logical partition, a shutdown is attempted. However, when the OS is acquiring a dump, the shutdown is not performed until the dump process is completed.

The failover mechanism 902 may request the server 901 to start dump processing. In step 912, the logical partition is stopped. When the OS cannot be shut down, the logical partition mechanism 903 executes the forced stop of the logical partition. In step 926, a spare server capable of taking over work is searched based on the server information acquired in step 922 and the logical partition information acquired in step 923. This search uses the server management table shown in FIG. 5 and the logical partition information table shown in FIG. 7 to search for a server that can create a logical partition having the same H / W configuration as the logical partition affected by the failure. To execute. As a result of the search, it is assumed that the server found as a spare server 912 and the logical partition mechanism of the spare server are the spare server logical partition mechanism 906. In step 927, the spare server 905 is activated from the failover mechanism. This is for constructing a logical partition in the spare server 905.
If the spare server 905 has already been activated, step 927 need not be executed.
In step 951, the spare server 905 is activated. In step 928, the spare server logical partition mechanism 906 is requested to change the logical partition configuration. In this configuration change request, the spare server 905 is requested to create a logical partition having an H / W configuration equivalent to the logical partition affected by the failure. In step 961, a logical partition is created in the spare server 905. In step 952, a logical partition is created in the spare server 905. In Step 929, the disk mapping mechanism 904 sends a disk mapping change request for releasing the disk mapping of the logical partition affected by the failure and mapping it to the logical partition created in Step 952.
To request. In step 941, the disk mapping mechanism 904 of the disk array device.
The requested disk mapping change is executed so that the disk used by the server 901 can be used by the spare server 905.

Here, in order to boot the logical partition affected by the failure in the new logical partition constructed by the spare server, the WWN of the HBA of the spare server is changed to the WWN of the server affected by the failure. In step 92A, the standby server logical partition mechanism 906 is requested to activate the logical partition created in step 952. In step 962, the spare server logical partition mechanism 906 activates the requested logical partition. In step 953, the logical partition created in step 952 is activated. As a result, the OS and business application are started.
Business resumes at 4.

In FIG. 9, the spare server is a separate server. However, when a logical partition that can be replaced with a normal hardware in the same server as the server 901 can be constructed, failover is performed in the same physical server. Can do. In that case, there is no need to boot a new server, so there is no need to change the disk mapping.

  Hereinafter, the sequence in FIG. 9 will be described in more detail.

FIG. 10 shows an operation flow of the server management function 405. In step 1001,
Receives failure information from the server where the failure occurred. In step 1002, the location of the failure and the type of failure are identified from the received failure information. In step 1003, the failure operation table is referred to, and information on whether or not to execute the failover operation for the corresponding failure type is referred to. Step 1004 determines whether or not to execute failover from the contents of the failure operation table. If a failover operation is necessary, the process proceeds to step 1005, and if not necessary, the process is terminated without doing anything. In step 1005, the influence range search function is activated, and the logical partition affected by the failure is specified. This is because a business that operates in a logical partition that is not affected by a failure is continuously operated as it is. In the subsequent processing, failover is performed only for the logical partition identified as affected by the failure. When the operation of the influence range search function ends, the process proceeds to step 1006. In step 1006, the logical partition identified in step 1005 is requested to be shut down. In step 1011, the requested logical partition is shut down. When the shutdown ends, the process moves to step 1007. In step 1007, the server search function is activated to identify a spare server that can be switched. When the operation of the server search function ends, the process moves to step 1008. In step 1008, the logical partition management function is activated, and a logical partition equivalent to that in the server in which the failure has occurred in the spare server identified in step 1007 is configured. When the execution of the logical partition management function ends, the process moves to step 1009. In step 1009, the disk mapping change function is activated to change the disk mapping between the logical partition of the server where the failure has occurred and the logical partition of the spare server.
When the execution of the disk mapping function ends, the process proceeds to step 101A. Step 101
In A, the activation of the logical partition of the spare server configured in step 1008 is requested.

FIG. 11 shows a processing flow of the influence range search function 403. Step 1101 determines whether the logical partition mechanism is affected by the failure. If the logical partition mechanism is affected by the failure, the process moves to step 1102. If the logical partition mechanism is not affected by the failure, step 1103
Move on. In step 1102, all logical partitions operating on the server are identified as logical partitions affected by the failure. In step 1103, information on the logical partition of the server is acquired from the logical partition information table. At this time, information is extracted based on the server identifier of the server. In step 1104, the logical partition affected by the failure is identified. The logical partition information acquired in step 1101 stores which H / W is used for all logical partitions configured in the server. By using this information and the information on the failed part notified from the server, it is possible to identify the logical partition using the H / W where the failure has occurred, that is, the logical partition affected by the failure. If there are logical partitions that use the same H / W, or if the failure location is a part related to the entire server, such as a chassis fan or chassis power supply unit, multiple logical A parcel is discovered.

FIG. 12 shows a processing flow of the server search function 406. Step 1201 acquires server information from the server management table. At this time, information is extracted based on the server identifier of the server. Step 1202 acquires the H / W configuration information of the logical partition to be failed over from the logical partition information table. At this time, information is extracted based on the server identifier of the server and the logical partition identifier of the logical partition to be failed over. The logical partition identifier of the logical partition that performs failover is the logical partition identifier specified by the influence range search function. Step 12
In 03, a server capable of configuring a logical partition having the same configuration as that of the logical partition to be failed over is searched. Here, based on the information in the server management table and the information in the logical partition information table,
An available unused device that is not used for a logical partition in each server and that is not in failure is obtained. This information is compared with the H / W configuration information of the logical partition that performs failover, and a server having an available unused device equivalent to the H / W configuration of the logical partition that performs failover is searched. Here, the H / W configuration refers to the number of CPUs, memory capacity, and I / O devices. Step 1204 determines whether a server is found in step 1203.
If no server is found, the process moves to step 1205. In step 1205, a server capable of configuring a logical partition equivalent to the logical partition to be failed over is searched for using the H / W already used for another logical partition, and set as a spare server. In step 1206, the server found in step 1203 or step 1205 is designated as a spare server.
If a plurality of servers are found in step 1203, one arbitrary server is selected.
In selecting any one server, it is possible to set some rules based on user presetting or the like.

For example, priority is given to a server that does not have a failed H / W in order to operate a business with a more secure server, and priority is given to a server that is not operating other business in order to distribute processing and improve efficiency. In order to reduce power consumption, priority may be given to a server that is already running. In addition, when there are multiple logical partitions affected by a failure and there are not enough spare servers to fail over to all logical partitions, user's pre-settings etc. It is possible to set some rules by. For example, when a priority is set for a logical partition and priority is given to failover of a logical partition with a higher priority, a combination capable of failing over more logical partitions is selected, or when an SMP between blades is configured, Multiple blades look like one blade. In that case, select a configuration that increases the processing speed by consolidating the separated logical partitions into the same blade when a failure occurs. It is conceivable.

When setting the priority for the logical partition, a column indicating the priority is newly added to the logical partition information table 402, and the user sets the priority of the logical partition when the system is constructed. .

FIG. 13 shows a processing flow of the logical partition management mechanism 407. In step 1301, it is determined whether the spare server is activated. Here, the spare server is the server selected in step 1205 in FIG. If the spare server is activated, the process proceeds to step 1302. If the spare server has not been activated, the process proceeds to step 1311. In step 1311,
Start the spare server. This is to activate the logical partition mechanism of the spare server. The activation of the spare server can be executed through the BMC of the spare server. In step 1302, the logical partition mechanism of the spare server is requested to create a logical partition equivalent to the logical partition to be failed over. In step 1321, the logical partition mechanism of the spare server creates a logical partition having the requested configuration.

The second embodiment of the present invention shows an example in which failover is performed at a trigger other than when a failure occurs. As an opportunity other than when a failure occurs, an instruction from a user such as an administrator or maintenance staff, an instruction from a program according to a schedule, or the like can be considered. The purpose may be a case of performing maintenance such as server firmware update.

FIG. 22 shows the management server console in the second embodiment. The management server console 2201 is an interface for performing failover according to an instruction from a user or the like. The management server console 2201 has an interface 2211 for selecting or inputting a server identifier, an interface 2212 for selecting or inputting a logical partition identifier,
It has an interface 2213 for selecting or inputting hardware, an interface 2214 for instructing execution of failover, and an interface 2215 for notifying completion of failover. The failover target is specified by the interface 2211, the interface 2212, and the interface 2213. For example, when maintaining specific hardware, it is conceivable to specify a server identifier and hardware to be maintained, and to fail over a logical partition that uses the hardware. Further, when failing over a specific logical partition, a server identifier and a logical partition identifier may be designated, and when failing over a server, only the server identifier may be designated.

As another usage, as described in the first embodiment, after a failure occurs, the logical partition that affects the failure is failed over, and then, the logical partition that remains on the server and operates to maintain the server. May be transferred to the spare server. In this case, the user is provided with an interface for notifying what the remaining logical partition is in operation and allowing the user to specify a logical partition to be failed over.

Execution of failover is instructed by the interface 2214. Interface 2214
If an execution is instructed in, a pseudo failure occurs and failover is executed. When the execution of the failover is completed, the interface 2215 is notified of the completion. The notification contents may include a logical partition that has failed over, the success or failure of the failover, and the like. The interface instructing the execution of failover is not limited to GUI (Graphical User Interface) as shown in FIG. 22, but may be other types of interfaces such as CLI (Command Line Interface).

FIG. 23 illustrates an example of a management server console that displays the server configuration in the second embodiment. The management server console 2301 is an interface for checking the server configuration. The management server console 2301 has an interface 2302 for selecting or inputting a server identifier, and an interface 230 for displaying the configuration of the selected server.
3. The interface 2303 includes, as a server configuration, a logical partition identifier indicated by a column 2331, an operation state of a logical partition indicated by a column 2332, and an I / O indicated by a column 2333.
Display the device. This information can be used as a reference when executing failover based on instructions from the user.
FIG. 14 is an operation sequence according to the second embodiment. The difference from the first embodiment is that the trigger for executing the failover is not the occurrence of a failure but a pseudo failure occurrence instruction 1400 by the user or the like. In the second embodiment, the failover is performed in response to the designation of the failed part in step 1400.

FIG. 15 shows an operation flow of the server management function in the second embodiment. The difference from the first embodiment is that the flow starts in response to the reception of the pseudo failure occurrence instruction in step 1501.
It is to generate a pseudo failure. In step 1502, a pseudo failure is generated. Information such as the part that generates the pseudo failure is acquired from the received pseudo failure occurrence instruction in step 1501.

As described in the first embodiment, when a logical partition affected by a failure is transferred to a spare server and there is a remaining logical partition, the logical partition is used by the user using the interface. By specifying, if you want to maintain the failed server, update the program, or replace the specific hardware by taking over with another server, By specifying the hardware and searching for the corresponding logical partition, the corresponding logical partition is identified,
The present embodiment has the effect that it is possible to move to another server and perform hardware maintenance.

The third embodiment of the present invention shows an example in which after a logical partition affected by a failure is failed over, another logical partition that is not affected by the failure is moved to the spare server. If possible, other logical partitions that are not affected by the failure are moved to a spare server while maintaining the operational state of the business by using a relocation technique during operation. As a result, all logical partitions of the server can be moved to the spare server without stopping the operations of the logical partitions that are not affected by the failure. For example, the server can be powered down to perform hardware maintenance safely. It becomes possible. The relocation during operation refers to moving to a logical partition of another physical server without stopping the program of the logical partition.

FIG. 16 shows an operation sequence in the third embodiment. The difference from the first embodiment is that step 162B is added. In step 162B, another logical partition that is not affected by the failure is moved to the spare server by relocation during operation.

FIG. 17 shows a logical partition information table in the third embodiment. The difference from the first embodiment is that a column 1709 is added. A column 1709 indicates whether the logical partition mechanism can perform relocation during operation.

FIG. 18 shows an operation flow of the server management function in step 162B of FIG. Step 1801 acquires server logical partition information from the logical partition information table.
In step 1802, it is determined whether there is a logical partition that is not affected by the failure based on the information in the logical partition information table acquired in step 1801. If there is no logical partition that is not affected by the failure, the process moves to step 1811. If there is a logical partition that is not affected by the failure, the process proceeds to step 1803. Step 1803 determines whether the state of another logical partition that is not affected by the failure is normal. If the agent program operating in the logical partition shows an abnormal state or if it cannot communicate with the agent program, it is determined that relocation during operation is not possible, and the process moves to step 1811. If normal, the process proceeds to step 1804. Step 1804 activates a server search function to search for a spare server. The operation flow of the server search function is the same as that of the first embodiment. Step 1805 determines whether the logical partition mechanism of the failed server and the spare server supports in-service relocation. If not supported, the process proceeds to step 1811. If it is supported, the process proceeds to step 1806. In step 1806, the logical partition management function is activated, the reallocation during operation is executed, and the process ends. Step 1811 is a flow when relocation cannot be performed during operation. In this case, the administrator or the like is notified that the logical partition that is not affected by the failure remains operating on the server. As a notification method, for example, a method of sending an e-mail to a preset notification destination, a method of displaying a message on the management server console, or the like can be considered.

FIG. 19 shows an operation flow of the logical partition management function in the third embodiment. Step 1
In step 901, it is determined whether the spare server is activated. If the spare server is not running,
In step 1902, the spare server is activated. This is to make the logical partition mechanism of the spare server available. In step 1903, the logical partition mechanism of the spare server is requested to create a logical partition equivalent to the logical partition to be failed over, and in step 1921 the logical partition is created. Step 1904 requests the logical partition mechanism of the failed server and spare server to execute relocation during operation.
In operation 22, reallocation is performed. As a result, the work that operates in the logical partition that is not affected by the failure is moved to the spare server without being stopped.

As an effect of this embodiment, a logical partition that cannot be continued due to the influence of a failure is failed over to a spare server by taking over the boot disk, and other logical partitions that are not affected by the failure are not stopped without stopping the program. By moving to a spare server using technology to move to a physical server, all logical partitions of the failed server can be moved to the spare server without stopping any logical partitions that are not affected by the failure. . As a result, the server in which the failure has occurred can be stopped and maintenance work can be performed safely. Due to these effects, a highly available failover is realized in the event of a failure of a running server.

The fourth embodiment of the present invention shows an example in which a logical partition associated with a logical partition affected by a failure also fails over.
FIG. 20 shows a logical partition information table in the fourth embodiment. The difference from the first embodiment is that a column 2009, a column 2091, and a column 2092 are added. Column 2
009 stores information on related logical partitions. A column 2091 stores server identifiers of related logical partitions. A column 2092 stores logical partition identifiers of related logical partitions. Information on related logical partitions is registered by user's pre-setting. In this embodiment, the information on the related logical partitions is represented by the server identifier and the logical partition identifier. However, as another means for realizing, there is a method of assigning an identifier to a set of related logical partitions.

FIG. 21 shows an operation flow of the influence range search function in the fourth embodiment. The difference from the first embodiment is that a step 2105 is added. Step 2105 identifies the logical partition associated with the logical partition affected by the failure from the information in the column 2009 of the logical partition information table, and the logical partition associated with the logical partition affected by the failure is also affected by the failure. A logical partition.

In this embodiment, the logical partition is described, but the logical partition may be replaced with a virtual server (computer).

101 Management server 102 Server 103 Disk array device 110 Failover mechanism 120 Logical partition mechanism 121 HBA
122 NIC
123 BMC
130 Disk mapping mechanism 131 Boot disk

Claims (17)

  1. A plurality of servers are connected to an external disk device on a network, one or more logical partitions are constructed in the server, and a management server is connected to the server via the network,
    A failover method that targets a computer system that boots an operating system from the external disk device, and takes over the work in the computer system to another server when a failure occurs in an operating server,
    The management server
    Storing correspondence information between the logical partition and hardware information including an identifier of an I / O device used by the logical partition;
    A console of the management server has an interface for generating a pseudo failure in a specific server, and receiving the pseudo failure from the interface;
    Identifying the site of the simulated disorder;
    Identifying a logical partition affected by the simulated fault from the simulated fault site and the correspondence information in response to a simulated fault occurrence of the server;
    Stopping the logical partition;
    Searching for a spare server to be taken over from within the computer system;
    Starting a spare server identified in the search;
    Constructing a logical partition affected by the simulated failure in the activated spare server;
    A logical partition constructed in the spare server by changing the identifier of the I / O device of the constructed logical partition to the identifier of the I / O device of the logical partition affected by the pseudo failure based on the correspondence information; Building a correspondence with the external disk device;
    And a step of activating the constructed logical partition.
  2. The step of identifying a logical partition affected by the simulated fault according to claim 1 detects whether the simulated fault affects a logical partition mechanism, and if it affects the logical partition mechanism, all logical partitions 2. The failover method according to claim 1, wherein the logical partition is affected by the simulated fault.
  3. The step of searching for the spare server includes the case where there are two or more logical partitions affected by the simulated failure, and there is no server capable of constructing all the logical partitions affected by the simulated failure. 2. The failover method according to claim 1, wherein a logical partition having a high priority among logical partitions affected by a pseudo failure is a logical partition to be failed over.
  4. The step of searching for the spare server is characterized in that if there is no server capable of constructing a logical partition affected by the simulated fault, a server already assigned to another logical partition is set as the spare server. The failover method according to claim 1.
  5. After failing over a logical partition affected by a failure, an instruction of information for failing over a logical partition not affected by a failure is received from the interface, the designated logical partition is taken over to a spare server, and the interface is taken over after taking over. The failover method according to claim 1, further comprising notifying that the handover has been completed.
  6. The step of identifying the logical partition affected by the simulated failure is characterized in that, when there is another logical partition related to the logical partition, the other logical partition is a logical partition affected by the simulated failure. The failover method according to claim 1.
  7. When there is another logical partition except the logical partition affected by the simulated failure in the server in which the simulated failure occurs, and the logical partition mechanism of the server has a function of relocating the logical partition during operation, 2. The failover method according to claim 1, further comprising the step of relocating another logical partition to the spare server during operation.
  8. A function of relocating a logical partition while the logical partition mechanism of the server in which the pseudo failure has occurred, and the logical partition mechanism of the server in which the pseudo failure has occurred exists in the server in which the pseudo failure has occurred except for the logical partition affected by the pseudo failure The failover method according to claim 1, wherein if there is not, the failover method notifies that the other logical partition is operating on the server.
  9. The failover method according to claim 1, wherein the step of stopping the logical partition is executed after the dump processing of the logical partition corresponding to the pseudo-failure site is completed.
  10. A plurality of computers are connected to a disk device via a network, one or more logical partitions are constructed in the computer, a management computer is connected to the computer via the network, and an operating system is booted from an external disk device The management computer is equipped with a failover mechanism that takes over work to another computer in the event of a failure of a computer that is operating.
    The management computer has an interface that generates a pseudo failure in a specific computer ,
    Means for accepting a pseudo fault from the interface;
    The failover mechanism is
    Storing correspondence information between the logical partition and hardware information including an identifier of an I / O device used by the logical partition;
    A processing unit for identifying a logical partition affected by the simulated fault from the simulated fault site;
    A processing unit for stopping the logical partition;
    A processing unit for searching for a spare computer to be taken over from within the computer system;
    A processing unit for starting the spare computer identified in the search;
    A processing unit for constructing a logical partition affected by the simulated fault in the activated spare computer;
    Based on the correspondence information with the hardware information, the identifier of the I / O device of the constructed logical partition is changed to the identifier of the I / O device of the logical partition affected by the pseudo failure, and the spare computer is changed. A processing unit for constructing a correspondence relationship between the constructed logical partition and the external disk device for booting the operating system;
    And a processing unit that activates the constructed logical partition.
  11. After failing over the logical partition affected by the failure, an instruction of information for failing over the logical partition not affected by the failure is received from the interface, the designated logical partition is taken over to the spare computer , and after taking over, the interface is passed. 11. The computer system according to claim 10, wherein the face is notified that the takeover is completed.
  12. Information indicating whether the computer can be relocated during operation is provided in the failover mechanism, and if both the computer with the simulated failure and the spare computer can support the operation relocation, the computer is affected by the simulated failure. The computer system according to claim 10, further comprising a processing unit that executes a process for taking over a logical partition that is not present to another computer.
  13. A logical partition related to the logical partition affected by the pseudo failure is defined in advance, and a processing unit for failing over the related logical partition together with the logical partition affected by the failure to a spare computer is provided. Item 11. The computer system according to Item 10.
  14. When there are a plurality of logical partitions affected by the simulated fault, the processing unit that identifies the logical partition affected by the simulated fault from the simulated faulty portion stores information indicating the logical partition having a high priority in the logical partition information storage. The computer system according to claim 10, further comprising a process that prioritizes failover of any logical partition based on the information provided in the means.
  15. A plurality of computers are connected to a disk device via a network, one or more logical partitions are constructed in the computer, and a management computer is connected to the computer via the network. Target computer systems that boot
    A failover program for causing the management computer having a failover mechanism to take over work to another computer in the event of a failure of a computer operating the business, to execute the following functions,
    Having an interface that generates a simulated fault and receiving a simulated fault from the interface;
    Storing correspondence information between the logical partition and hardware information including an identifier of an I / O device used by the logical partition;
    A function of specifying a logical partition from the site of the accepted pseudo disorders affected by the pseudo fault,
    A function of stopping the logical partition;
    A function of searching for a spare computer to be taken over from within the computer system;
    A function of starting the spare computer identified in the search;
    A function of constructing a logical partition affected by the simulated fault in the activated spare computer;
    Based on the correspondence information with the hardware information, the identifier of the I / O device of the constructed logical partition is changed to the identifier of the I / O device of the logical partition affected by the pseudo failure, and the spare computer is changed. A function for constructing a correspondence relationship between the constructed logical partition and the external disk device for booting the operating system;
    A failover execution program that executes a function to activate the constructed logical partition.
  16. 16. The function of specifying a logical partition affected by the simulated fault according to claim 15 detects whether the simulated fault affects a logical partition mechanism. If the simulated partition affects the logical partition mechanism, all logical partitions are detected. 16. The failover execution program according to claim 15, wherein the logical partition is affected by the simulated fault.
  17. 17. The failover execution program according to claim 16, wherein the function of stopping the logical partition is executed after the dump processing of the logical partition corresponding to the pseudo failure part is completed.
JP2011184262A 2011-08-26 2011-08-26 Failover method and its computer system. Active JP5321658B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011184262A JP5321658B2 (en) 2011-08-26 2011-08-26 Failover method and its computer system.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011184262A JP5321658B2 (en) 2011-08-26 2011-08-26 Failover method and its computer system.

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
JP2006326446 Division 2006-12-04

Publications (2)

Publication Number Publication Date
JP2011258233A JP2011258233A (en) 2011-12-22
JP5321658B2 true JP5321658B2 (en) 2013-10-23

Family

ID=45474252

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011184262A Active JP5321658B2 (en) 2011-08-26 2011-08-26 Failover method and its computer system.

Country Status (1)

Country Link
JP (1) JP5321658B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015114816A1 (en) 2014-01-31 2015-08-06 株式会社日立製作所 Management computer, and management program

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000057108A (en) * 1998-08-12 2000-02-25 Fujitsu Ltd Switching test method for duplex computer system, monitoring device for it, and computer readable recording medium
US6594784B1 (en) * 1999-11-17 2003-07-15 International Business Machines Corporation Method and system for transparent time-based selective software rejuvenation
JP4119162B2 (en) * 2002-05-15 2008-07-16 株式会社日立製作所 Multiplexed computer system, logical computer allocation method, and logical computer allocation program
JP4054616B2 (en) * 2002-06-27 2008-02-27 株式会社日立製作所 Logical computer system, logical computer system configuration control method, and logical computer system configuration control program
JP4448719B2 (en) * 2004-03-19 2010-04-14 株式会社日立製作所 Storage system
JP4462024B2 (en) * 2004-12-09 2010-05-12 株式会社日立製作所 Failover method by disk takeover
JP2006178801A (en) * 2004-12-24 2006-07-06 Hitachi Ltd System changeover test system for duplex system

Also Published As

Publication number Publication date
JP2011258233A (en) 2011-12-22

Similar Documents

Publication Publication Date Title
JP5089380B2 (en) Dynamic migration of virtual machine computer programs
US8015431B2 (en) Cluster system and failover method for cluster system
US9519656B2 (en) System and method for providing a virtualized replication and high availability environment
US9870243B2 (en) Virtual machine placement with automatic deployment error recovery
JP4295184B2 (en) Virtual computer system
US7444502B2 (en) Method for changing booting configuration and computer system capable of booting OS
JP5282046B2 (en) Computer system and enabling method thereof
JP4920391B2 (en) Computer system management method, management server, computer system and program
JP4939102B2 (en) Reliable method for network boot computer system
JP5018252B2 (en) How to change device allocation
US8307363B2 (en) Virtual machine system, restarting method of virtual machine and system
US8195867B2 (en) Controlled shut-down of partitions within a shared memory partition data processing system
US9026848B2 (en) Achieving ultra-high availability using a single CPU
JP5011073B2 (en) Server switching method and server system
JP4980792B2 (en) Virtual machine performance monitoring method and apparatus using the method
US8185776B1 (en) System and method for monitoring an application or service group within a cluster as a resource of another cluster
JP4882845B2 (en) Virtual computer system
US7934119B2 (en) Failure recovery method
JP2008152594A (en) Method for enhancing reliability of multi-core processor computer
US20090265501A1 (en) Computer system and method for monitoring an access path
US8954963B2 (en) Method and apparatus for resetting a physical I/O adapter without stopping a guest OS running on a virtual machine
US7549076B2 (en) Fail over method through disk take over and computer system having fail over function
JP5142678B2 (en) Deployment method and system
US8407702B2 (en) Computer system for managing virtual machine migration
US20070220350A1 (en) Memory dump method, memory dump program and computer system

Legal Events

Date Code Title Description
A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20130219

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20130401

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130618

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130701