EP2518627B1 - Procédé de traitement de panne partielle dans un système informatique - Google Patents

Procédé de traitement de panne partielle dans un système informatique Download PDF

Info

Publication number
EP2518627B1
EP2518627B1 EP20120165177 EP12165177A EP2518627B1 EP 2518627 B1 EP2518627 B1 EP 2518627B1 EP 20120165177 EP20120165177 EP 20120165177 EP 12165177 A EP12165177 A EP 12165177A EP 2518627 B1 EP2518627 B1 EP 2518627B1
Authority
EP
European Patent Office
Prior art keywords
lpar
fault
failover
hypervisor
notice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Not-in-force
Application number
EP20120165177
Other languages
German (de)
English (en)
Other versions
EP2518627B8 (fr
EP2518627A2 (fr
EP2518627A3 (fr
Inventor
Tomoki Sekiguchi
Hitoshi Ueno
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Publication of EP2518627A2 publication Critical patent/EP2518627A2/fr
Publication of EP2518627A3 publication Critical patent/EP2518627A3/fr
Application granted granted Critical
Publication of EP2518627B1 publication Critical patent/EP2518627B1/fr
Publication of EP2518627B8 publication Critical patent/EP2518627B8/fr
Not-in-force legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component

Definitions

  • the present invention relates to a processing method for a partial fault in a computer system in which a plurality of LPARs (Logical PARtitions) are executing owing to logical partitioning.
  • LPARs Logical PARtitions
  • Each computer has various hardware fault detection mechanisms.
  • the computer detects an abnormality in its components, and notifies software such as an OS or a hypervisor of the fault by means of an interrupt.
  • software such as an OS or a hypervisor of the fault by means of an interrupt.
  • an interrupt which notifies of a fault is called machine check interrupt.
  • the OS or the hypervisor can stop the whole computer or only a part relating to the fault which has occurred, depending upon contents of the fault notified of by means of the machine check.
  • a computer which supports the logical partitioning notifies only an LPAR affected by a hardware fault which has occurred, of the machine check. Only the LPAR notified of the machine check can stop execution. LPARs which do not utilize the component in which the fault has occurred can execute continuously.
  • US7,134,052B2 Bailey et al . discloses a method for identifying an LPAR relating to a fault which has occurred in a device of a computer, at the time of execution and transmitting a machine check only to that LPAR. In principle, similar fault processing is possible in the virtualization as well.
  • cluster technique As a technique for constituting a computer system in which data loss and processing interruption are not allowed, there is the cluster technique.
  • a back up computer is prepared against stop of a computer due to a fault.
  • the primary computer primary node
  • the backup computer backup node
  • these kinds of control are executed by software called cluster management software which executes in the primary node and the backup node.
  • a highly reliable system can be configured by combining the hardware fault processing in the logical partitioning with the cluster configuration.
  • the cluster management software which executes in an LPAR relating to a hardware fault executes failover and causes an LPAR in a backup node which is on standby in another computer to continue data processing which has been executed in the LPAR relating to the hardware fault.
  • LPARs which are not affected by the fault continue to execute data processing as it is.
  • Hardware in which a fault has occurred needs to be replaced sooner or later.
  • an application, a virtual computer, and an LPAR which are executing as the primary node in a computer mounting defective hardware are failed over manually to a computer in the backup node in the cluster, then a computer which has executed the virtual computer or LPAR in the primary node is stopped, and hardware is replaced.
  • An operator which executes maintenance makes a decision by some means whether the fault node can be stopped and whether fault node is not executing some data processing, and executes operation for stopping the fault node.
  • Document US 2007/0011495 discloses a server cluster managing method for computing environment, involves activating standby logical partition in processing complex to operate in active mode in response to failure detection in another complex of server cluster.
  • the method involves operating a logical partition in a processing complex of a server cluster in an active mode, and operating another logical partition in the processing complex in a standby mode.
  • a failure is detected in another processing complex of the server cluster, and the standby logical partition in the former processing complex is activated to operate in the active mode in response to the failure detection.
  • Partition resources are transferred from the former logical partition to the latter logical partition.
  • each clustered LPAR which affected by the fault failovers at the time of a hardware fault and an LPAR of primary node and an LPAR of backup node are present mixedly in one physical computer.
  • a hardware fault processing described hereafter is conducted for a virtual computer system having a plurality of LPARs generated on physical computers constituting clusters, under control of hypervisors.
  • a first hypervisor in the first physical computer makes a decision whether there is an LPAR which can continue execution, as regards LPARs generated on the first physical computer. If there is an LPAR which cannot continue execution, then the first hypervisor stops a first LPAR which cannot continue execution and a cluster control unit in a second LPAR which constitutes a cluster with the first LPAR and which is generated on the second physical computer conducts first failover to failover the application of the first LPAR to the second LPAR.
  • a cluster control unit in a fourth LPAR which constitutes a cluster with a third LPAR capable of continuing execution and which is generated on the second physical computer conducts second failover to failover the application of the third LPAR to the fourth LPAR.
  • a cluster control unit in the third LPAR may set a stop possibility for the third LPAR in fault notice information the first hypervisor has, after the second failover to "possible.”
  • the hypervisors have fault notice information, the fault notice information manages whether there is a request of a fault notice from every LPAR and whether the LPAR can be stopped after failover, as regards a hardware fault which does not affect execution of the LPARs.
  • the hypervisor Upon occurrence of a hardware fault, the hypervisor refers to the fault notice infortnation, transmits a fault notice to an LPAR which requests a notice of a hardware fault, and makes a decision whether there is an LPAR which can continue execution among the plurality of LPARs.
  • the hypervisor stops a first LPAR which cannot continue execution and the first LPAR is failovered to a second LPAR which constitutes a cluster with the first LPAR.
  • a third LPAR which can continue execution is failovered to a fourth LPAR which constitutes a cluster with the third LPAR.
  • the present invention it is recorded in fault notice information whether an LPAR can be stopped after the failover. Therefare, it becomes possible for an operator to make a decision easily at the time of maintenance after occurrence of a partial hardware fault whether a computer is in a state in which maintenance work can be executed.
  • a hypervisor provides an interface for an LPAR which has received a notice of a hardware fault allowing continuation of execution to notify that the LPAR has executed the fault processing upon the notice.
  • the hypervisor retains the notification situation of the fault processing of the LPAR.
  • the hypervisor provides an interface for acquiring the notification situation. Therefore, it is made possible to register and acquire the situation of the fault processing through these interfaces, and it is made possible to make a decision as to the situation of coping with a fault in the computer as a whole.
  • FIG. 3 is a diagram showing a configuration of a computer system according to embodiment 1 of the present invention.
  • LPARs 210 and 310 executing in a computer 100 and a computer 200 constitute a cluster
  • LPARs 260 and 360 executing in the computer 100 and the computer 200 constitute another cluster.
  • the computer 100 is denoted by computer A
  • the computer 200 is denoted by computer B.
  • the LPAR 210 is LPAR1 in LPAR name
  • the LPAR 260 is LPAR2 in LPAR name.
  • the LPAR 310 is LPAR3 in LPAR name and the LPAR 360 is LPAR4 in LPAR name.
  • the LPAR 210 and the LPAR 260 constitute primary nodes in the clusters and the LPAR 310 and the LPAR 360 constitute backup nodes in the clusters.
  • the LPARs 210 and 260 are constituted by a hypervisor 250 (not illustrated in FIG. 3 ) to execute.
  • the hypervisor 250 is implemented as software which executes in a CPU in the computer 100 or hardware in the computer 100.
  • the hypervisor 250 is shown to assign logical NICs 392 and 393 to the LPAR 210 and assign NICs 394 and 395.
  • Other LPARs are assigned logical NICs in the same way.
  • a software configuration will now be described by taking the LPAR 210 and the LPAR 310 included in the cluster as an example.
  • an OS 230 In the LPAR 210, an OS 230, a cluster control unit 220 and a data processing program 211 are executed. The same is true of the LPAR 310 as well.
  • the cluster control unit 220 in the LPAR 210 and a cluster control unit 320 in the LPAR 310 monitor operation situations each other via a network 390. Under control of the cluster control units 220 and 320, a data processing program is executed in a LPAR which is executing as the primary node. For example, if the LPAR 210 is the primary node, the data processing program 211 executes actual processing.
  • the cluster control unit 320 in the LPAR 310 detects an abnormality in the LPAR 210, executes failover, and starts execution of a data processing program 311 (thereafter, the LPAR 210 becomes the backup node).
  • the data processing programs 211 and 311 transmit and receive a processing request and a result via a network 391.
  • execution of the data processing program is started.
  • data processing is being executed in the backup node as well, but the backup node is not executing actual output.
  • the cluster control unit exercises control in failover to cause the data processing program 311 to start actual processings.
  • the LPAR 260 and the LPAR 360 also constitute a similar cluster. Furthermore, although not illustrated in FIG. 3 , resources such as a main memory, a CPU, and a storage device required to execute the data processing program are also subject to logical partitioning and assigned to respective LPARs.
  • FIG. 1 shows a structure of the computer 100 which constitutes clusters in embodiments of the present invention.
  • CPUs 101 to 104, a main memory 120, an I/O bus management device 130 are connected to each other via a bus 110.
  • An input/output device 150 such as a display or a keyboard, HBAs (Host Bus Adapters) 161 to 163 for connection to an external storage device, and NICs (Network Interface Adapters) 171 to 173 for connection to a network are connected an I/O bus 140 to which the I/O bus management device 130 is connected.
  • HBAs Hyper Bus Adapters
  • NICs Network Interface Adapters
  • the CPUs 101 to 104 read a program into the main memory 120, execute the program which is read into the main memory 120, and execute various kinds of processing. In the ensuing description, this is represented by description that the program or processing executes.
  • Each of components in the computer 100 has an abnormality detection function.
  • the CPUs 101 to 104 can detect a failure of a part of an internal cache, a failure of an internal core, and a failure of an internal register. Upon detecting such an internal fault, the CPUs 101 to 104 generate a machine check interrupt and notify software of an abnormality.
  • the main memory 120, the I/O bus management device 130, the HBAs 161 to 163, and the NICs 171 to 173 also have a similar function.
  • the machine check interrupt is transmitted to any or all of the CPUs 101 to 104 via a device which manages the main memory 120. If the HBAs 161 to 163 and the NICs 171 to 173 detect an abnormality, the I/O bus management device 130 transmits the machine check interrupt.
  • FIG. 2 is a diagram showing a configuration of software in the computer in the embodiments of the present invention.
  • the configuration of software will now be described by taking the computer 100 as an example.
  • the computer 100 is executing the hypervisor 250.
  • the hypervisor 250 logically divides resources of the computer 100 and is executing the LPAR 210 and the LPAR 260. Divided resources are the CPUs 101 to 104, the main memory 120, the HBAs 161 to 163, and the NICs 171 to 173.
  • the LPAR 210 and the LPAR 260 are executing respectively by utilizing resources provided by the hypervisor 250.
  • the hypervisor 250 includes a machine check interrupt processing handler 251 and a fault notice table 252 which process a machine check interrupt transmitted from a component included in the computer 100.
  • a configuration of the fault notice table 252 is shown in FIG. 4 .
  • the table 252 For each of LPARs executed on the hypervisor 250, the table 252 retains an LPAR name 401, a flag (a fault notice request flag 402) indicating whether the LPAR is requesting a notice of a hardware fault which does not affect the LPAR, a flag (a fault notice flag 403) indicating whether there is a past hardware fault, and a flag (a stop possibility flag 404) indicating whether a fault processing is completed in the LPAR after a notice of a hardware fault and a state in which the LPAR can be stopped is brought.
  • a fault notice request flag 402 indicating whether the LPAR is requesting a notice of a hardware fault which does not affect the LPAR
  • a flag a fault notice flag 403
  • a stop possibility flag 404 indicating whether a fault processing is completed in the LPAR after a notice of a hardware fault and a state in which the LPAR can be stopped is
  • the hypervisor 250 provides the LPAR 210 with an interface for setting the fault notice request flag 402 and the stop possibility flag 404 so as to be utilizable from the OS 230.
  • the hypervisor 250 assigns an entry for the LPAR in the table 252, and sets a value indicating "not requesting" into the fault notice request flag 402, a value indicating "not present” into the fault notice flag 403, and a value indicating "no" into the stop possibility flag 404. Contents of the table at this time are shown in 410 in FIG. 4 .
  • the OS 230 is executing.
  • the OS 230 includes a machine check interrupt processing handler 231 which processes a logical machine check interrupt transmitted by the hypervisor.
  • the OS 230 has an interface to notify a program executed in the OS 230 that a machine check interrupt is received.
  • the program can receive a notice that the machine check interrupt is received, via the interface.
  • the LPAR 210 constitutes a cluster with the LPAR 310.
  • the LPAR 210 is executing the OS 230.
  • the cluster control unit 220 is executing.
  • the cluster control unit 220 executes mutual monitoring and failover processing between the primary node and the backup node.
  • the cluster control unit 220 includes a fault notice acceptance unit 222 which accepts a notice that a hardware fault is occurring from the OS 230, a failover request table 223 for managing failover requests, a failover processing unit 224 which executes failover, a request monitoring unit 225 which schedules failover processing, and a cluster control interface 221 which provides the data processing program 211 which executes in the cluster with information such as the cluster state and a failover interface.
  • the cluster control unit 220 sets the fault notice request flag 402 to "requesting” and sets the stop possibility flag 404 to "no" to notify the LPAR of a hardware fault through the interface provided by the hypervisor 250.
  • a state in the fault notice table 252 at a time point when a cluster control unit 270 in the LPAR 260 is also executing is shown in 420 in FIG. 4 .
  • the cluster control unit 320 in the LPAR 310 serving as the backup node sets the stop possibility flag 404 to "yes.” "
  • the setting in the stop possibility flag 404 is setting indicating that the primary node which is executing a data processing program must not be stopped, but the primary node may be stopped.
  • States in the fault notice table 252 at a time point when a cluster control unit 370 in the LPAR 360 is also executing are shown in 430 in FIG. 4 .
  • a state in which the LPAR3 and the LPAR4 serve as the backup node is shown.
  • Each of the cluster control units 220 and 320 changes the stop possibility flag 404 in the fault notice table 252 from “yes” to “no” when the operation mode of the LPAR which execute under the control of the cluster control unit makes transition from the backup node to the primary node.
  • Each of the cluster control units 220 and 320 changes the stop possibility flag 404 in the fault notice table 252 from “no” to "yes” when the operation mode make transition from the primary node to the backup node.
  • FIG. 11 shows a structure of the failover request table 223 in the cluster control unit 220.
  • the failover request table 223 retains a value (a request flag 1110) indicating whether a request to execute failover is received and a value (an uncompleted request flag 1111) indicating whether there is an unprocessed failover request.
  • the data processing program 211 1 or another program can set these flags through the cluster control interface 221.
  • the failover processing unit 224 and the request monitoring unit 225 also operate these flags.
  • the data processing program 211 is an application which executes on the cluster. If the LPAR 210 stops execution due to a fault when the LPAR 210 is the primary node, the processing of the program 211 is taken over by the LPAR 310 in the computer 200. At this time, the cluster control unit 320 exercises control to cause the LPAR 310 which has taken over the processing to become the primary node.
  • the other LPAR 260 also has a similar configuration (not illustrated).
  • a data processing program 261 may be a program which executes independently of the program 211 1 executing in the LPAR 210.
  • the LPARs 310 and 360 in the computer 200 also have similar configurations.
  • the cluster control unit 220 When starting a cluster, the cluster control unit in each of LPARs which constitute the cluster registers in the OS to be notified of a machine check interrupt which indicates occurrence of a hardware fault. If the LPAR is the LPAR 210, the cluster control unit 220 requests the OS 230 to notify it of a machine check interrupt (its flow is not illustrated).
  • the cluster control unit 220 controls execution of the data processing program 211 which provides service and conducts mutual monitoring of computers which constitute the cluster, in the same way as the general cluster system.
  • the cluster control unit 220 conducts mutual communication with the cluster control unit 320 which executes in the LPAR 310 to monitor the operation situation. If the cluster control unit 320 on the backup node detects an abnormality in the primary node, then failover is executed by control of the cluster control unit 320 and the LPAR 310 becomes the primary node.
  • the cluster control unit 220 waits for a cluster control request from the OS 230 or the data processing program 211 as well.
  • FIG. 5 shows a processing flow of the machine check interrupt processing handler 251 in the hypervisor and the machine check interrupt processing handler 231 in the OS 230 at the time when a partial hardware fault has occurred.
  • a component which has caused the fault transmits a machine check interrupt to the CPUs 101 to 104.
  • a CPU which has caughted the interrupt executes the machine check interrupt processing handler 251 in the hypervisor 250.
  • the machine check interrupt processing handler 251 identifies a fault reason on the basis of contents of the interrupt (step 501), and identifies an LPAR which becomes impossible to execute due to influence of the hardware fault (step 502).
  • the machine check interrupt processing handler 251 transmits an uncorrectable machine check which indicates that continuation of execution is impossible to the LPAR which is impossible to execute, and causes the execution of the LPAR to be stopped (step 503).
  • the cluster control unit 220 changes the fault notice flag in the fault notice table 252 to "present” and the stop possibility flag in the fault notice table 252 to "yes.”
  • the uncorrectable machine check is transmitted from the hypervisor to the LPAR 260.
  • a machine check interrupt processing unit 281 in an OS 280 which executes in the LPAR 260 caught an interrupt transmitted from the machine check interrupt processing handler 251 in the hypervisor, and stops execution of the OS 280.
  • the cluster control unit 370 in the LPAR 360 in the computer 200 which constitutes the cluster with the LPAR 260 detects execution stop of the OS 280 and executes failover. As a result, the LPAR 360 becomes the primary node and a data processing program 361 starts execution. At this time, the cluster control unit 370 has set the stop possibility flag 404 for the LPAR 360 (LPAR4) in the fault notice table 252 to "no" as described above. States of the fault notice table 252 in the computer 200 at this time are shown in 440 in FIG. 12 .
  • the machine check interrupt processing unit 251 refers to the fault notice request flag 402 in the fault notice table 252, sets the fault notice flag 403 for the LPAR which is requesting a fault notice to "present,” sets the stop possibility flag 404 to "no,” and transmits a machine check (correctable machine check) to notify that execution continuation is possible, but a hardware fault has occurred (step 504).
  • the LPAR 210 corresponding to the LPAR1 is requesting a notice and the machine check interrupt processing unit 251 transmits the correctable machine check to the LPAR 210. Furthermore, before transmitting the machine check, the interrupt processing handler 251 sets the fault notice flag 403 corresponding to the LPAR 210 in the fault notice table 252 to "present,” and sets the stop possibility flag 404 to "no.” " In the fault notice table 252 at this time, the fault notice flag 403 is "present” and the stop possibility flag 404 is "no" for the LPAR 210 (LPAR1). This state indicates that the LPAR 210 has received a hardware fault notice, but processing required to stop the LPAR 210 associated therewith is not completed. A state of the fault notice table 252 in the computer 100 at this time is shown in 450 in FIG. 12 .
  • the machine check interrupt processing handler 231 in the OS 230 Upon receiving a machine check interrupt, the machine check interrupt processing handler 231 in the OS 230 is started to execute processing described hereafter.
  • the machine check interrupt processing handler 231 makes a decision whether the caught machine check interrupt is an uncorrectable machine check which indicates that execution continuation is impossible (step 510).
  • step 513 If the machine check is an uncorrectable machine check, execution of the OS 230 is stopped (step 513).
  • step 511 If the machine check is a correctable machine check, then occurrence of a fault is recorded (step 511) and a program which is requesting a fault notice is notified of fault reason (step 512).
  • the cluster control unit 220 requests a notice of a machine check interrupt and consequently the OS 230 schedules to cause the fault notice acceptance unit 222 in the cluster control unit 220 to execute.
  • the machine check notice is transmitted from the machine check interrupt processing handler 231.
  • the notice processing may be executed after execution of the machine check interrupt processing hanlder 231 is completed.
  • the fault notice acceptance unit 222 is dispatched by the OS 230 to execute.
  • the fault notice acceptance unit 222 sets the request flag 1110 in the failover request table 223 in the cluster control unit 220 to a value indicating "requesting" and sets the uncompleted request flag 1111 to a value indicating "present” (a processing flow is omitted).
  • the request monitoring unit 225 in the cluster control unit 220 periodically executes processing of checking the failover request table 223.
  • FIG. 6 shows its processing flow.
  • the request monitoring unit 225 inspects the uncompleted request flag 1111 in the failover request table 223 and makes a decision whether the failover is not completed although there is a request (step 601).
  • the request flag 1110 in the failover request table 223 is set to "requesting" again (step 602).
  • the request monitoring unit 225 waits for a predetermined time (step 603) and repeats check processing from the step 601.
  • the determinate time is set by the user on the basis of an application which is being executed.
  • the determinate time may be 30 seconds.
  • FIG. 7 shows a processing flow of the data processing program 211.
  • the data processing program 211 1 is basically repetition of accepting a processing request transmitted via a network and execution of data processing corresponding to the processing request.
  • the data processing program 211 waits for a processing request (step 701).
  • the data processing program 211 waits to time out when predetermined time has elapsed.
  • the step 701 is completed by arrival of a processing request or time out.
  • the data processing program 211 inquires of the cluster control unit 220 via the cluster control interface 221 whether there is a failover request (step 702).
  • the cluster control unit 220 returns the value of the request flag 1110 in the failover request table 223.
  • step 703 If there is no failover request, then requested data processing is executed (step 703). If the step 701 is completed by time out, however, nothing is done.
  • the data processing program 211 waits for arrival of a processing request again (the step 701).
  • the data processing program 211 requests the cluster control unit 220 to conduct failover (step 710).
  • the data processing program 211 waits for completion of failover processing, acquires an execution situation of the failover from the cluster control unit 220, and makes a decision whether the failover is successful (step 711).
  • the data processing program 211 stops execution. If the failover fails, the data processing program 211 waits for arrival of a processing request again (the step 701).
  • failover is requested again at a certain time point in the future by processing in the request monitoring unit 225.
  • Processing may be conducted to forcibly stop the processing of the data processing program 211 at a time point when the failover is successful. In this case, the processing of the data processing program 211 is continued only when the failover fails.
  • FIG. 8 shows a processing flow of the failover processing unit 224 in the cluster control unit 220.
  • the failover processing unit 224 Upon receiving a request from the data processing program 211, the failover processing unit 224 executes failover processing (step 801). If the failover is completed, then the LPAR 310 in the computer 200 becomes the primary node and the data processing program 311 accepts a request and executes data processing.
  • the failover processing unit 224 makes a decision whether the failover is successful (step 802).
  • the uncompleted request flag 1111 in the failover request table 223 ( Fig. 11 ) is set to "not present” and the request flag 1110 in the failover request table 223 is also set to "not requesting” (step 803).
  • the stop possibility flag 404 in the fault notice table 252 in the hypervisor 250 is set to "yes" (step 804). This is executed via an interface provided by the hypervisor 250. As a result, the entry for the LPAR 210 in the fault notice table 252 becomes "present” in the fault notice flag 403 and "yes” in the stop possibility flag 404. This indicates that a fault is notified of and preparations for stopping in response to the fault notice are ready States of the fault notice table 252 in the computer 100 at this time are shown in 460 in FIG. 12 . As shown in 460, it becomes possible by referring to the fault notice table 252 to easily recognize that the LPAR1 and the LPAR2 executed in the computer 100 can be stopped. Furthermore, states of the fault notice table 252 in the computer 200 are shown in 470 in FIG. 12 . Since the LPAR 310 has become the primary node, the stop possibility flag 404 for the LPAR3 is shown to become "no.” "no"
  • the fault notice table 252 for managing the fault notice request flag 402 and the stop possibility flag 404 is provided in the hypervisor 250 and the hypervisor 250 provides an interface for updating them.
  • the hypervisor 250 it becomes possible for the hypervisor 250 to notify an LPAR which can continue execution, of fault occurrence when a partial fault has occurred and cause the LPAR to execute processing in preparation for future stop.
  • a program (the cluster control unit 220, in the embodiment) which executes in the LPAR can receive a notice of a partial fault which has no direct relation to execution of itself and execute processing in preparation for system stop at the time of future maintenance (execute failover, in the embodiment).
  • the LPAR Upon receiving a fault notice, therefore, the LPAR can execute preparations for future maintenance and notify the hypervisor 250 that the preparations have been completed.
  • the method for ascertaining whether the LPAR can be stopped depends upon a application which is being executed and consequently individual ascertainment is needed.
  • the hypervisor 250 can retain information for judging whether the whole of the computer 100 can be stopped, regardless of the application which is being executed, and consequently it becomes possible for an operator engaging in maintenance to easily judge whether the computer can be stopped.
  • the hardware fault processing is combined with cluster control.
  • combination is not restricted to the combination with cluster control. It is possible to execute processing in preparation for future stop as long as a program which receives a notice of a machine check interrupt and updates the fault notice table 252 in the hypervisor 250 executes.
  • FIG. 9 is a system configuration diagram of an embodiment 2 according to the present invention.
  • a computer 900 for monitoring the operation states of the computer 100 and the computer 200 is added to the system configuration in the embodiment 1.
  • a NIC 931 and a NIC 932 are mounted respectively on the computer 100 and the computer 200, and are connected to the computer 900 via a network 920. It is made possible to refer to an interface of the hypervisor 250 via the network 920 and acquire contents of the fault notice table 252.
  • the computer 900 is a computer having a configuration similar to that shown in FIG. 1 .
  • a fault status display unit 910 is executing.
  • the fault status display unit 910 acquires information from computers to be managed and displays the information. It is now supposed that the computer 100 and the computer 200 have been registered as objects of the management. In particular, the fault status display unit 910 acquires states of the fault notice tables 252 from the hypervisors in the computer 100 and the computer 200 and displays the states. As a result, it is possible to make a decision easily whether a hardware fault which allows continuation of execution has occurred and stop preparation processing corresponding to it has been executed.
  • FIG. 10 shows an example of display of a fault status. This is an example showing a state after failover of the LPAR 210 is completed in the embodiment 1.
  • This display is constituted on the basis of the fault notice table 460 of the computer 100 and the fault notice table 470 of the computer 200 shown in FIG. 12 .
  • a constituting method of contents of this display will be described.
  • the fault status display unit 910 which creates the view shown in FIG. 10 acquires an LPAR name 1001 and an operation situation 1002 of the computer 100 corresponding to the computer A from the hypervisor 250. These kinds of information are supposed to be acquired from the hypervisor 250 as management information.
  • the fault status display unit 910 acquires contents of the fault notice table 252 concerning an LPAR in operation from the hypervisor. Specifically, the fault status display unit 910 displays a value acquired from the fault notice flag 403 in the fault notice table 252 as contents of a fault notice 1003 and displays a value acquired from the stop possibility flag 404 in the fault notice table 252 as contents of a stop possibility 1004.
  • the LPAR 260 (LPAR2) is in the stop state and consequently information of the LPAR 210 (LPAR1) is acquired and showed.
  • information of the LPAR in the stop state only the LPAR name and operation situation are displayed and other information is not showed.
  • Information of the computer 200 corresponding to the computer B is acquired and showed in the same way. Specifically, information of the LPAR 310 (LPAR3) and the LPAR 360 (LPAR4) is acquired, and an LPAR name 1011, an operation situation 1012, a fault notice 1013, and a stop possibility 1014 are showed.
  • the fault notice flag 403 for the LPAR1 is "present” and the stop possibility flag 404 is "yes.” " Therefore, the fault status display unit 910 shows “present” in the fault notice 1003 for the computer 100 and shows “yes” in the stop possibility 1004. The fault status display unit 910 acquires contents of the fault notice table 470 for the computer 200 as well and shows information.
  • the computer A corresponds to the computer 100 and the computer B corresponds to the computer 200. It is shows that the LPAR1 is executing in the computer A, but the LPAR1 receives a fault notice and in a stoppable state and the LPAR2 stops execution. Furthermore, it is showed that both LPARs 3 and 4 in the computer B are executing.
  • a screen display 1000 is referred to. It can be judged through the screen that in the computer A the LPAR1 is executing, but the LPAR1 can be stopped.
  • management software such as the cluster control unit 220 which executes in the LPAR can set a stop possibility state of the LPAR in the fault notice table 252 in the hypervisor 250 in cooperation with the hypervisor 250 by using the interface provided by the hypervisor.
  • LPARs In general, applications running on LPARs are unrelated systems. Only a manager of a system (not an operator) can make a decision whether the application can be stopped. According to the present invention, it is possible to cause an LPAR which continues execution regardless of a partial fault to execute stop preparations directed to maintenance work and ascertain the situation of preparation processing easily. As a result, work for stopping the computer is facilitated when replacing a faulty part.
  • the data processing program 211 and the cluster control unit 220 cooperate and execute the failover.
  • the cluster control unit 220 may start the failover singly.
  • the data processing program 211 is caused to judge execution timing of failover of the cluster.
  • the hypervisor may monitor the operation situation of the LPAR and cause the cluster control unit 220 to start failover. For example, it is also possible to find an idle state of an LPAR and execute failover.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)

Claims (14)

  1. Procédé de traitement de défaillance matérielle dans un système informatique virtuel ayant une pluralité de partitions logiques (LPAR) (210, 260, 310 et 360) générées sur un premier ordinateur physique (100) et un second ordinateur physique (200) qui forment des grappes, sous la commande d'hyperviseurs, dans lequel un premier hyperviseur (250) dans le premier ordinateur physique commande une pluralité de partitions logiques LPAR générées sur le premier ordinateur physique et dispose d'informations d'avis de défaillance (252) qui indiquent si une partition logique LPAR peut être oui ou non interrompue pour chaque partition logique LPAR sur le premier ordinateur physique, le procédé de traitement de défaillance matérielle comportant les étapes consistant à :
    répondre à une occurrence d'une défaillance matérielle dans le premier ordinateur physique en :
    amenant le premier hyperviseur (250) à :
    identifier une première partition logique LPAR (260) parmi la pluralité de partitions logiques LPAR sur le premier ordinateur physique qui devient impossible à exécuter du fait de l'influence de la défaillance matérielle, et
    transmettre un avis de défaillance qui indique que la poursuite de l'exécution est impossible dans la première partition logique LPAR pour entraîner l'exécution de la première partition logique LPAR à être interrompu et une unité de commande de grappe (370) dans une deuxième partition logique LPAR (360), qui forme une grappe avec la première partition logique LPAR et qui est générée sur le second ordinateur physique, pour exécuter un basculement de la première partition logique LPAR sur la deuxième partition logique LPAR,
    amener une unité de commande de grappe (270) dans la première partition logique LPAR à changer les informations d'avis de défaillance pour la première partition logique LPAR en réponse à la réception de l'avis de défaillance qui indique que la poursuite de l'exécution est impossible de sorte que les informations d'avis de défaillance indiquent que la première partition logique LPAR peut être interrompue,
    amener le premier hyperviseur à transmettre un avis de défaillance qui indique que la poursuite de l'exécution est possible, mais qu'une défaillance matérielle est survenue, à une troisième partition logique LPAR (210), qui est différente de la première partition logique LPAR, sur le premier ordinateur physique de sorte qu'une unité de commande de grappe (320) dans une quatrième partition logique LPAR (310), qui forme une grappe avec la troisième partition logique LPAR et qui est générée sur le deuxième ordinateur physique, exécute un basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et
    amener une unité de commande de grappe (220) dans la troisième partition logique LPAR à changer les informations d'avis de défaillance pour la troisième partition logique LPAR, en réponse à l'achèvement du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, de sorte que les informations d'avis de défaillance indiquent que la troisième partition logique LPAR peut être interrompue.
  2. Procédé de traitement de défaillance matérielle selon la revendication 1, dans lequel
    les informations d'avis de défaillance gèrent si (402) une demande d'avis de défaillance est présente pour la partition logique LPAR et si (404) la partition logique LPAR peut être interrompue après un basculement, en ce qui concerne une défaillance matérielle qui n'affecte pas l'exécution des partitions logiques LPAR, pour chaque partition logique LPAR sur le premier ordinateur physique,
    le premier hyperviseur (250) se rapporte aux informations d'avis de défaillance, et si il existe une demande pour un avis de défaillance matérielle en provenance de la troisième partition logique LPAR qui peut poursuivre l'exécution, le premier hyperviseur transmet l'avis de défaillance à la troisième partition logique LPAR, et
    une unité de commande de grappe dans la troisième partition logique LPAR qui a reçu l'avis de défaillance dispose d'informations de demande basculement (223) pour gérer une situation du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et définit la présence (1110) d'une demande pour le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR dans les informations de demande de basculement.
  3. Procédé de traitement de défaillance matérielle selon la revendication 2, dans lequel
    l'unité de commande de grappe dans la troisième partition logique LPAR se rapporte aux informations de demande de basculement (223) et si la demande basculement existe, l'unité de commande de grappe dans la troisième partition logique LPAR effectue le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et
    à la fin du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, l'unité de commande de grappe dans la troisième partition logique LPAR établit une possibilité d'arrêt (404) pour la troisième partition logique LPAR dans les informations d'avis défaillance (252) dans le premier hyperviseur (250) à "possible" après le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR.
  4. Procédé de traitement de défaillance matérielle selon la revendication 3, dans lequel
    le système informatique virtuel comprend une unité d'affichage de situation de défaillance (910),
    l'unité d'affichage de situation de défaillance affiche une situation d'opération (1002, 1012) et une possibilité d'arrêt (1004, 1014) pour chaque partition logique LPAR comprise dans le système, et
    la possibilité d'arrêt affichée dans l'unité d'affichage de situation de défaillance est basée sur la capacité d'arrêt de la partition logique LPAR après un basculement géré par les informations d'avis de défaillance (252).
  5. Procédé de traitement de défaillance matérielle selon la revendication 3, dans lequel l'unité de commande de grappe dans la troisième partition logique LPAR fait référence aux informations de demande de basculement (223) lors de chaque période de temps prédéterminée.
  6. Procédé de traitement de défaillance matérielle selon la revendication 1, dans lequel
    des hyperviseurs dans le premier ordinateur physique (100) et le second ordinateur physique (200) ont des interfaces pour enregistrer qu'une partition logique LPAR demande un avis de défaillance matérielle pour laquelle l'exécution de la partition logique LPAR peut se poursuivre, et
    les hyperviseurs dans le premier ordinateur physique et le second ordinateur physique informent une partition logique LPAR qui a demandé un avis, d'une défaillance matérielle pour laquelle l'exécution de la partition logique LPAR peut se poursuivre, conformément à une situation d'enregistrement via les interfaces.
  7. Procédé de traitement de défaillance matérielle selon la revendication 1, dans lequel
    le premier hyperviseur (250) et un second hyperviseur compris dans le second ordinateur physique ont des interfaces pour notifier que la troisième partition logique LPAR a exécuté le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR,
    au moins l'un du premier hyperviseur et du second hyperviseur conserve une situation d'avis de copie de traitement avec une défaillance de partition logique LPAR, et
    au moins l'un du premier hyperviseur et du deuxième hyperviseur dispose d'une interface pour acquérir la situation d'avis.
  8. Procédé de traitement de défaillance matérielle selon la revendication 7, comportant une procédure et un dispositif pour acquérir et afficher une situation de copie avec une défaillance conservée par au moins l'un du premier hyperviseur et du deuxième hyperviseur.
  9. Procédé de traitement de défaillance matérielle selon la revendication 7, comportant :
    une procédure pour recevoir un avis d'une défaillance matérielle permettant de poursuivre l'exécution depuis au moins l'un du premier hyperviseur (250) et du deuxième hyperviseur et exécuter un transfert de système, et
    une procédure pour notifier qu'un traitement pour la copie avec une défaillance a été exécuté après l'achèvement du transfert de système, via une interface d'au moins l'un du premier hyperviseur et du deuxième hyperviseur,
    dans lequel une situation d'achèvement du transfert de système peut être acquise à partir d'au moins l'un du premier hyperviseur et du deuxième hyperviseur.
  10. Système informatique virtuel ayant une pluralité de partitions logiques (LPAR) (210, 260, 310 et 360) générées sur un premier ordinateur physique (100) et un second ordinateur physique (200) qui forment des grappes, sous la commande d'hyperviseurs, dans lequel :
    un premier hyperviseur (250) dans le premier ordinateur physique est conçu pour commander une pluralité de partitions logiques LPAR générées sur le premier ordinateur physique, dans lequel le premier hyperviseur dispose d'informations d'avis de défaillance (252) qui indiquent si une partition logique LPAR peut être oui ou non interrompue pour chaque partition logique LPAR sur le premier ordinateur physique,
    dans lequel lors de l'une occurrence d'une défaillance matérielle dans le premier ordinateur physique :
    le premier hyperviseur (250) est configuré pour :
    identifier une première partition logique LPAR (260) parmi la pluralité de partitions logiques LPAR sur le premier ordinateur physique qui devient impossible à exécuter du fait de l'influence de la défaillance matérielle, et
    transmettre un avis de défaillance qui indique que la poursuite de l'exécution est impossible dans la première partition logique LPAR pour entraîner l'exécution de la première partition logique LPAR à être interrompu et une unité de commande de grappe (370) dans une deuxième partition logique LPAR (360), qui forme une grappe avec la première partition logique LPAR et qui est générée sur le second ordinateur physique, pour exécuter un basculement de la première partition logique LPAR sur la deuxième partition logique LPAR,
    une unité de commande de grappe (270) dans la première partition logique LPAR est configurée pour changer les informations d'avis de défaillance pour la première partition logique LPAR en réponse à la réception de l'avis de défaillance qui indique que la poursuite de l'exécution est impossible de sorte que les informations d'avis de défaillance indiquent que la première partition logique LPAR peut être interrompue,
    le premier hyperviseur est en outre configuré pour transmettre un avis de défaillance qui indique que la poursuite de l'exécution est possible, mais qu'une défaillance matérielle est survenue, à une troisième partition logique LPAR (210), qui est différente de la première partition logique LPAR, sur le premier ordinateur physique de sorte qu'une unité de commande de grappe (320) dans une quatrième partition logique LPAR (310), qui forme une grappe avec la troisième partition logique LPAR et qui est générée sur le deuxième ordinateur physique, exécute un basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et
    une unité de commande de grappe (220) dans la troisième partition logique LPAR est configurée pour changer les informations d'avis de défaillance pour la troisième partition logique LPAR, en réponse à l'achèvement du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, de sorte que les informations d'avis de défaillance indiquent que la troisième partition logique LPAR peut être interrompue.
  11. Système informatique virtuel selon la revendication 10, dans lequel
    les informations d'avis de défaillance gèrent si (402) une demande d'avis de défaillance est présente pour la partition logique LPAR et si (404) la partition logique LPAR peut être interrompue après un basculement, en ce qui concerne une défaillance matérielle qui n'affecte pas l'exécution des partitions logiques LPAR, pour chaque partition logique LPAR sur le premier ordinateur physique,
    le premier hyperviseur (250) se rapporte aux informations d'avis de défaillance, et si il existe une demande pour un avis de défaillance matérielle en provenance de la troisième partition logique LPAR qui peut poursuivre l'exécution, le premier hyperviseur transmet l'avis de défaillance à la troisième partition logique LPAR, et
    une unité de commande de grappe dans la troisième partition logique LPAR qui a reçu l'avis de défaillance dispose d'informations de demande basculement (223) pour gérer une situation du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et définit la présence (1110) d'une demande pour le second basculement dans les informations de demande de basculement.
  12. Système informatique virtuel selon la revendication 11, dans lequel
    l'unité de commande de grappe dans la troisième partition logique LPAR se rapporte aux informations de demande de basculement (223) et si la demande basculement existe, l'unité de commande de grappe dans la troisième partition logique LPAR effectue le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, et
    à la fin du basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR, l'unité de commande de grappe dans la troisième partition logique LPAR établit une possibilité d'arrêt (404) pour la troisième partition logique LPAR dans les informations d'avis défaillance (252) dans le premier hyperviseur (250) à "possible" après le basculement de la troisième partition logique LPAR sur la quatrième partition logique LPAR.
  13. Système informatique virtuel selon la revendication 12, dans lequel
    le système informatique virtuel comprend une unité d'affichage de situation de défaillance (910),
    l'unité d'affichage de situation de défaillance affiche une situation d'opération (1002, 1012) et une possibilité d'arrêt (1004, 1014) pour chaque partition logique LPAR comprise dans le système, et
    la possibilité d'arrêt affichée dans l'unité d'affichage de situation de défaillance est basée sur la capacité d'arrêt de la partition logique LPAR après un basculement géré par les informations d'avis de défaillance (252).
  14. Système informatique virtuel selon la revendication 12, dans lequel l'unité de commande de grappe dans la troisième partition logique LPAR fait référence aux informations de demande de basculement lors de chaque période de temps prédéterminée.
EP20120165177 2011-04-25 2012-04-23 Procédé de traitement de panne partielle dans un système informatique Not-in-force EP2518627B8 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011096689A JP5548647B2 (ja) 2011-04-25 2011-04-25 計算機システムでの部分障害処理方法

Publications (4)

Publication Number Publication Date
EP2518627A2 EP2518627A2 (fr) 2012-10-31
EP2518627A3 EP2518627A3 (fr) 2013-07-10
EP2518627B1 true EP2518627B1 (fr) 2014-08-27
EP2518627B8 EP2518627B8 (fr) 2015-03-04

Family

ID=46045828

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20120165177 Not-in-force EP2518627B8 (fr) 2011-04-25 2012-04-23 Procédé de traitement de panne partielle dans un système informatique

Country Status (3)

Country Link
US (1) US8868968B2 (fr)
EP (1) EP2518627B8 (fr)
JP (1) JP5548647B2 (fr)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015052836A1 (fr) * 2013-10-11 2015-04-16 株式会社日立製作所 Dispositif de stockage et procédé de basculement
US20160110277A1 (en) * 2014-10-16 2016-04-21 Siemens Aktiengesellshaft Method for Computer-Aided Analysis of an Automation System
JP2017045084A (ja) * 2015-08-24 2017-03-02 日本電信電話株式会社 障害検知装置及び障害検知方法
JP6535572B2 (ja) 2015-10-26 2019-06-26 日立オートモティブシステムズ株式会社 車両制御装置、車両制御システム
US9798641B2 (en) * 2015-12-22 2017-10-24 Intel Corporation Method to increase cloud availability and silicon isolation using secure enclaves
US20180150331A1 (en) * 2016-11-30 2018-05-31 International Business Machines Corporation Computing resource estimation in response to restarting a set of logical partitions
CN108959063A (zh) * 2017-05-25 2018-12-07 北京京东尚科信息技术有限公司 一种程序执行的方法和装置
US10496351B2 (en) * 2017-06-07 2019-12-03 Ge Aviation Systems Llc Automatic display unit backup during failures of one more display units through the utilization of graphic user interface objects defined for control transfer and reversion after resolution of the failures
JP7006461B2 (ja) * 2018-04-02 2022-01-24 株式会社デンソー 電子制御装置および電子制御システム
US11061785B2 (en) 2019-11-25 2021-07-13 Sailpoint Technologies, Israel Ltd. System and method for on-demand warm standby disaster recovery
CN117389790B (zh) * 2023-12-13 2024-02-23 苏州元脑智能科技有限公司 可恢复故障的固件检测系统、方法、存储介质及服务器

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4123942B2 (ja) * 2003-01-14 2008-07-23 株式会社日立製作所 情報処理装置
US7134052B2 (en) * 2003-05-15 2006-11-07 International Business Machines Corporation Autonomic recovery from hardware errors in an input/output fabric
US7774785B2 (en) * 2005-06-28 2010-08-10 International Business Machines Corporation Cluster code management
US7937616B2 (en) * 2005-06-28 2011-05-03 International Business Machines Corporation Cluster availability management
JP2007279890A (ja) * 2006-04-04 2007-10-25 Hitachi Ltd バックアップシステム及びバックアップ方法
JP4923990B2 (ja) * 2006-12-04 2012-04-25 株式会社日立製作所 フェイルオーバ方法、およびその計算機システム。
JP4809209B2 (ja) * 2006-12-28 2011-11-09 株式会社日立製作所 サーバ仮想化環境における系切り替え方法及び計算機システム
JP5032191B2 (ja) * 2007-04-20 2012-09-26 株式会社日立製作所 サーバ仮想化環境におけるクラスタシステム構成方法及びクラスタシステム
JP4980792B2 (ja) * 2007-05-22 2012-07-18 株式会社日立製作所 仮想計算機の性能監視方法及びその方法を用いた装置
JP4744480B2 (ja) * 2007-05-30 2011-08-10 株式会社日立製作所 仮想計算機システム
JP4995015B2 (ja) * 2007-09-13 2012-08-08 株式会社日立製作所 仮想計算機の実行可否検査方法
US8141094B2 (en) * 2007-12-03 2012-03-20 International Business Machines Corporation Distribution of resources for I/O virtualized (IOV) adapters and management of the adapters through an IOV management partition via user selection of compatible virtual functions
JP5353378B2 (ja) * 2009-03-31 2013-11-27 沖電気工業株式会社 Haクラスタシステムおよびそのクラスタリング方法
JP5856925B2 (ja) * 2012-08-21 2016-02-10 株式会社日立製作所 計算機システム

Also Published As

Publication number Publication date
EP2518627B8 (fr) 2015-03-04
JP2012230444A (ja) 2012-11-22
EP2518627A2 (fr) 2012-10-31
US8868968B2 (en) 2014-10-21
EP2518627A3 (fr) 2013-07-10
US20120272091A1 (en) 2012-10-25
JP5548647B2 (ja) 2014-07-16

Similar Documents

Publication Publication Date Title
EP2518627B1 (fr) Procédé de traitement de panne partielle dans un système informatique
US8245077B2 (en) Failover method and computer system
US8132057B2 (en) Automated transition to a recovery kernel via firmware-assisted-dump flows providing automated operating system diagnosis and repair
US7925817B2 (en) Computer system and method for monitoring an access path
CN108121630B (zh) 电子装置、重新启动方法及记录媒介
US9176834B2 (en) Tolerating failures using concurrency in a cluster
US9304849B2 (en) Implementing enhanced error handling of a shared adapter in a virtualized system
US7941810B2 (en) Extensible and flexible firmware architecture for reliability, availability, serviceability features
US20070260910A1 (en) Method and apparatus for propagating physical device link status to virtual devices
JP5579650B2 (ja) 監視対象プロセスを実行する装置及び方法
US8880936B2 (en) Method for switching application server, management computer, and storage medium storing program
WO2020239060A1 (fr) Procédé et appareil de recouvrement d'erreur
JP2014106587A (ja) I/oデバイスの制御方法及び仮想計算機システム
WO2018095107A1 (fr) Procédé et appareil de traitement anormal de programme biologique
US7925922B2 (en) Failover method and system for a computer system having clustering configuration
US10353786B2 (en) Virtualization substrate management device, virtualization substrate management system, virtualization substrate management method, and recording medium for recording virtualization substrate management program
US20070174723A1 (en) Sub-second, zero-packet loss adapter failover
US9529656B2 (en) Computer recovery method, computer system, and storage medium
US10102088B2 (en) Cluster system, server device, cluster system management method, and computer-readable recording medium
US9298568B2 (en) Method and apparatus for device driver state storage during diagnostic phase
US10454773B2 (en) Virtual machine mobility
EP3974979A1 (fr) Prévention des interruptions de plateforme et de service à l'aide de métadonnées de déploiement
CN111581058A (zh) 故障管理方法、装置、设备及计算机可读存储介质
KR101883251B1 (ko) 가상 시스템에서 장애 조치를 판단하는 장치 및 그 방법
Lee et al. NCU-HA: A lightweight HA system for kernel-based virtual machine

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20120510

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 11/07 20060101ALI20130606BHEP

Ipc: G06F 11/20 20060101AFI20130606BHEP

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

INTG Intention to grant announced

Effective date: 20140327

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 684815

Country of ref document: AT

Kind code of ref document: T

Effective date: 20140915

RAP2 Party data changed (patent owner data changed or rights of a patent transferred)

Owner name: HITACHI, LTD.

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602012002854

Country of ref document: DE

Effective date: 20141009

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 684815

Country of ref document: AT

Kind code of ref document: T

Effective date: 20140827

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG4D

REG Reference to a national code

Ref country code: NL

Ref legal event code: VDEP

Effective date: 20140827

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20141127

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20141128

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20141229

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20141127

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20141227

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602012002854

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20150528

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: LU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20150423

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150430

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150430

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 5

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20150423

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20160309

Year of fee payment: 5

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20160419

Year of fee payment: 5

Ref country code: GB

Payment date: 20160420

Year of fee payment: 5

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO

Effective date: 20120423

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

REG Reference to a national code

Ref country code: DE

Ref legal event code: R119

Ref document number: 602012002854

Country of ref document: DE

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20170423

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20171229

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170502

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20171103

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170423

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20140827