WO2012077235A1 - Multiplex system and method for switching multiplex system - Google Patents

Multiplex system and method for switching multiplex system Download PDF

Info

Publication number
WO2012077235A1
WO2012077235A1 PCT/JP2010/072272 JP2010072272W WO2012077235A1 WO 2012077235 A1 WO2012077235 A1 WO 2012077235A1 JP 2010072272 W JP2010072272 W JP 2010072272W WO 2012077235 A1 WO2012077235 A1 WO 2012077235A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
failure
server
stop
flag
Prior art date
Application number
PCT/JP2010/072272
Other languages
French (fr)
Japanese (ja)
Inventor
峯村 治実
俊介 國分
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2012547662A priority Critical patent/JP5342701B2/en
Priority to PCT/JP2010/072272 priority patent/WO2012077235A1/en
Publication of WO2012077235A1 publication Critical patent/WO2012077235A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component

Definitions

  • the present invention relates to a multiplex system that switches a standby server to a new active server when a failure occurs in the active server, for example, and a system switching method of the multiplex system.
  • Non-Patent Document 1 Technology that realizes fault tolerance (fault tolerance) in software by applying virtualization technology and synchronizing virtual machines between two physical servers (active and standby)
  • Non-Patent Document 1 There is literature 2). This technique is called “software FT”.
  • the standby system In the software FT, the standby system periodically acquires operation data of the operation system. When a failure occurs in the active system, the standby system takes over the processing of the active system using the operation data acquired from the active system. For this reason, when the standby system acquires operational data when a failure occurs in the active system, the standby system also stops according to the operational data indicating that a failure has occurred. In this way, when the standby system stops together with the active system, it is called “co-death”.
  • An object of the present invention is to prevent, for example, the standby system from being stopped according to the operational data of the operational system (co-death).
  • the multiplex system of the present invention includes a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server.
  • the first server A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
  • a first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
  • a first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
  • a first stop unit for stopping the first execution unit;
  • the failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit.
  • the second server A second synchronization unit that obtains operational data from the first server every predetermined synchronization period; A second operational data storage unit that stores operational data acquired by the second synchronization unit; A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit; A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped; A second monitoring unit that monitors the first server every predetermined monitoring cycle and determines whether or not a failure has occurred in the first server; When it is determined by the second monitoring unit that a failure has occurred in the first server, the second execution unit is started and the second execution flag stored in the second stop flag storage unit is executed in the second execution flag.
  • a second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit
  • a second failure detection unit for setting a failure occurrence value
  • a second stop unit for stopping the second execution unit; Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit
  • a second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
  • the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag.
  • a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop If the second continuation value is not set in the flag, the second execution unit is stopped.
  • the standby system for example, it is possible to prevent the standby system (second server) from stopping according to the operation data of the active system (first server) (co-death).
  • FIG. 1 is a configuration diagram of a dual system 100 according to Embodiment 1.
  • FIG. 3 is a flowchart showing a system switching method (active server) in the first embodiment.
  • 3 is a flowchart showing a system switching method (standby server) in the first embodiment.
  • 4 is a table showing a relationship between a failure and a means for detecting the failure in the first embodiment.
  • 3 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment.
  • 2 is a diagram illustrating an example of hardware resources of an active server 200 and a standby server 300 according to Embodiment 1.
  • FIG. FIG. 3 is a configuration diagram showing another form of the dual system 100 according to the first embodiment.
  • FIG. 3 is a configuration diagram of a dual system 100 according to a second embodiment.
  • 9 is a flowchart showing a system switching method (active server) in the second embodiment.
  • 9 is a flowchart showing a system switching method (standby server) in the second embodiment.
  • Embodiment 1 A dual system in which the standby server operates instead of the active server when a failure occurs on the active server will be described.
  • FIG. 1 is a configuration diagram of a dual system 100 according to the first embodiment.
  • a dual system 100 according to Embodiment 1 will be described with reference to FIG.
  • the duplex system 100 (an example of a multiple system) includes an operational server 200, a host OS unit 310, and a LAN 101 (Local Area Network).
  • the active server 200 and the host OS unit 310 communicate via the LAN 101.
  • the LAN 101 is an example of a network.
  • the active server 200 (an example of a first server) is a server device that executes predetermined operation processing.
  • the active server 200 includes a host OS unit 210, a guest OS unit 220, and a virtual machine monitor unit 230. Further, the operational server 200 includes a CPU (Central Processing Unit), a memory, and the like as the hardware 201.
  • CPU Central Processing Unit
  • the virtual machine monitor unit 230 executes a virtual machine monitor and controls the host OS unit 210 and the guest OS unit 220 as virtual machines.
  • the virtual machine monitor is a function for controlling a plurality of virtual machines by allocating hardware resources (CPU usage time, storage areas in a memory, etc.) of the computer (server device) to the plurality of virtual machines.
  • the host OS unit 210 (an example of a first host computer) is a virtual machine that manages the guest OS unit 220.
  • an OS (Operating System) of the host OS unit 210 is referred to as a host OS.
  • the host OS unit 210 includes a software FT unit 211, a system switching control unit 212, a system switching detection unit 213, and a host OS storage unit 219.
  • the host OS unit 210 and each “ ⁇ unit” included in the host OS unit 210 operate using hardware resources allocated to the host OS unit 210.
  • the guest OS unit 220 (an example of a first guest computer) is a virtual machine that executes predetermined operation processing.
  • the OS of the guest OS unit 220 is referred to as a guest OS.
  • the guest OS unit 220 includes an application execution unit 221, a cluster software unit 222, a system switching instruction unit 224, and a guest OS storage unit 229.
  • the cluster software unit 222 includes a failure detection unit 223.
  • the guest OS unit 220 and each “ ⁇ unit” included in the guest OS unit 220 operate using hardware resources allocated to the guest OS unit 220.
  • the guest OS storage unit 229 (an example of a first operation data storage unit) stores various data used by the guest OS unit 220.
  • the guest OS storage unit 229 stores operation data.
  • the operation data is data including processing data used for operation processing and a failure flag (system switching instruction flag described later) indicating whether or not an operation processing failure has occurred.
  • the application execution unit 221 (an example of a first execution unit) executes an application program (hereinafter referred to as an application) that describes a processing procedure of a predetermined operation process by using processing data stored in the guest OS storage unit 229.
  • an application an application program that describes a processing procedure of a predetermined operation process by using processing data stored in the guest OS storage unit 229.
  • a failure detection unit 223 (an example of a first failure detection unit) included in the cluster software unit 222 detects a failure in an operation process executed by the application execution unit 221.
  • the failure flag stored in the guest OS storage unit 229 indicates a failure occurrence value (“OK” described later) indicating that an operation process failure has occurred.
  • the system switching instruction unit 224 (an example of a first instruction unit) refers to a failure flag stored in the guest OS storage unit 229. When the failure occurrence value is set in the referenced failure flag, the system switching instruction unit 224 instructs the system switching control unit 212 to stop the application execution unit 221 (system switching instruction described later).
  • system switching control unit 212 (an example of a first stopping unit) is instructed to stop the application execution unit 221 from the system switching instruction unit 224, the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.
  • the standby server 300 When a failure occurs in the active server 200, the standby server 300 operates as a new active server, and when the active server 200 recovers, the active server 200 operates as a new standby server.
  • the host OS storage unit 219 (an example of a first stop flag storage unit) stores various data used by the host OS unit 210.
  • the host OS storage unit 219 stores a first stop flag (a system switching state flag described later) indicating whether the application execution unit 221 can be stopped.
  • each “ ⁇ part” of the active server 200 operates as follows.
  • the software FT unit 211 (an example of a first synchronization unit) acquires operation data from the standby server 300 every predetermined synchronization period.
  • the system switching detection unit 213 (an example of a first monitoring unit and a first activation unit) monitors the standby server 300 every predetermined monitoring cycle and determines whether or not a failure has occurred in the standby server 300. If the system switching detection unit 213 determines that a failure has occurred in the standby server 300, the system switching detection unit 213 activates the guest OS unit 220 including the application execution unit 221. Further, the system switching detection unit 213 sets a first continuation value (“system switching present” to be described later) indicating that the application execution unit 221 is not stopped in the first stop flag stored in the host OS storage unit 219. .
  • a first continuation value (“system switching present” to be described later) indicating that the application execution unit 221 is not stopped in the first stop flag stored in the host OS storage unit 219.
  • the system switching control unit 212 When the system switching control unit 212 (an example of a first stop unit) is instructed to stop the application execution unit 221 by the system switching instruction unit 224, the system switching control unit 212 refers to the first stop flag stored in the host OS storage unit 219. .
  • the system switching control unit 212 indicates that the application execution unit 221 may be stopped at the referenced first stop flag when the first continuation value is set in the referenced first stop flag. Set the value ("No system switching" described later).
  • the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.
  • the standby server 300 (an example of a second server) is a server device that executes operation processing instead of the active server 200 when a failure occurs in the active server 200.
  • the standby server 300 includes a host OS unit 310, a guest OS unit 320, and a virtual machine monitor unit 330, similar to the active server 200.
  • the standby server 300 includes hardware 301 as with the active server 200.
  • the host OS unit 310 (an example of the second host computer) includes a software FT unit 311, a system switching control unit 312, a system switching detection unit 313, and a host OS storage unit 319 in the same manner as the host OS unit 210 of the active server 200. Prepare. Each “ ⁇ unit” included in the host OS unit 310 and the host OS unit 310 operates using hardware resources allocated to the host OS unit 310.
  • the guest OS unit 320 (an example of a second guest computer) includes an application execution unit 321, a cluster software unit 322, a system switching instruction unit 324, and a guest OS storage unit 329, similar to the guest OS unit 220 of the active server 200. .
  • the cluster software unit 322 includes a failure detection unit 323.
  • the guest OS unit 320 and each “ ⁇ unit” included in the guest OS unit 320 operate using hardware resources allocated to the guest OS unit 320.
  • the software FT unit 311 (an example of a second synchronization unit) acquires operational data from the operational server 200 at every predetermined synchronization period.
  • the guest OS storage unit 329 (an example of a second operation data storage unit) stores various data used by the guest OS unit 320.
  • the guest OS storage unit 329 stores operation data (processing data, failure flag, etc.) acquired by the software FT unit 311.
  • the application execution unit 321 (an example of a second execution unit) executes an operation process using the operation data stored in the guest OS storage unit 329.
  • the host OS storage unit 319 (an example of a second stop flag storage unit) stores various data used by the host OS unit 310.
  • the host OS storage unit 319 stores a second stop flag (system switching state flag described later) indicating whether the application execution unit 321 can be stopped.
  • the system switching detection unit 313 (an example of a second monitoring unit and a second activation unit) monitors the active server 200 every predetermined monitoring cycle and determines whether or not a failure has occurred in the active server 200. When the system switching detection unit 313 determines that a failure has occurred in the active server 200, the system switching detection unit 313 activates the guest OS unit 320 including the application execution unit 321. In addition, the system switching detection unit 313 sets a second continuation value (“system switching present” described later) indicating that the application execution unit 321 is not stopped in the second stop flag stored in the host OS storage unit 319. .
  • a second continuation value (“system switching present” described later) indicating that the application execution unit 321 is not stopped in the second stop flag stored in the host OS storage unit 319.
  • a failure detection unit 323 (an example of a second failure detection unit) included in the cluster software unit 322 detects a failure in an operation process executed by the application execution unit 321.
  • the failure detection unit 323 sets the failure occurrence value in the failure flag stored in the guest OS storage unit 329.
  • the system switching instruction unit 324 (an example of a second stopping unit) refers to a failure flag (a system switching instruction flag described later) stored in the guest OS storage unit 329.
  • a failure flag a system switching instruction flag described later
  • the system switching instruction unit 324 instructs the system switching control unit 312 to stop the application execution unit 321 (system switching instruction described later), and refers to the referenced failure flag. Is set to a non-failure occurrence value ("non-occurrence" described later) indicating that no operation processing failure has occurred.
  • the system switching control unit 312 When the system switching control unit 312 (an example of the second stop unit) is instructed to stop the application execution unit 321 by the system switching instruction unit 324, the system switching control unit 312 refers to the second stop flag stored in the host OS storage unit 319. .
  • the system switching control unit 312 indicates that the application execution unit 321 may be stopped at the referenced second stop flag when the second continuation value is set in the referenced second stop flag. Set the value ("No system switching" described later).
  • the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321.
  • FIG. 2 is a flowchart showing a system switching method (active server) in the first embodiment.
  • a system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.
  • the application execution unit 221 executes an application program in which a predetermined operation processing procedure is described.
  • the processing data processed by the operation processing and the processing data processed by the operation processing are stored in the guest OS storage unit 229.
  • the guest OS storage unit 229 is a storage area in a memory allocated to the guest OS unit 220, for example.
  • the system switching detection unit 213 sends a heartbeat signal to the standby system every time a predetermined heartbeat notification period elapses in order to notify the standby server 300 that the guest OS unit 220 is operating normally. Transmit (notify) to the server 300.
  • the system switch detection unit 213 starts a heartbeat notification timer, and transmits a heartbeat signal to the standby server 300 when a timeout is notified from the heartbeat notification timer. Then, the system switching detection unit 213 newly starts a heartbeat notification timer. However, when the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal.
  • the heartbeat notification timer is a function for notifying a timeout when the heartbeat notification cycle has elapsed since the activation.
  • the failure detection unit 223 determines whether a hardware failure or a software failure has occurred in the operation process executed by the application execution unit 221.
  • the failure detection unit 223 monitors the hardware 201 (external storage device, communication device, etc.) accessed in the operation process, and uses a response wait timer to determine the response delay (timeout) from the hardware 201. Detect as a failure.
  • the response waiting timer is a function that times out when a predetermined response waiting time elapses after activation.
  • the failure detection unit 223 detects a defect in the application program as a software failure. A memory shortage or a release error that does not release the secured storage area is an example of a malfunction of the application program.
  • the cluster software unit 222 executes a predetermined failure process. For example, the cluster software unit 222 displays information on the failure that has occurred on the display device.
  • S110 to S130 are repeatedly executed. If a hardware failure has occurred, the process proceeds to S131. If a software failure has occurred, the process proceeds to S132.
  • the failure detection unit 223 sets “OK” to the system switching instruction flag (an example of the failure flag) stored in advance in the guest OS storage unit 229.
  • the initial value of the system switching instruction flag is “not generated”. “Non-occurring” means that no failure has occurred in the operation process, and “Yes” means that a hardware failure has occurred. Further, “permitted” means that the system switching instruction can be discarded. After S131, the process proceeds to S140.
  • the failure detection unit 223 sets “No” in the system switching instruction flag. “No” means that a software failure has occurred. Further, “No” means that the system switching instruction cannot be discarded. It progresses to S140 after S132.
  • the system switching instruction unit 224 refers to the system switching instruction flag stored in the guest OS storage unit 229 every time a predetermined failure detection period elapses.
  • the system switching instruction flag is set to “permitted” or “no”
  • the system switching instruction unit 224 notifies the system switching control unit 212 of data including the value of the system switching instruction flag as a system switching instruction.
  • the system switching instruction unit 224 activates a failure detection timer and refers to the system switching instruction flag when a timeout is notified from the failure detection timer.
  • the failure detection timer is a function for notifying a timeout when a failure detection cycle has elapsed since the activation.
  • the system switching instruction unit 224 inputs a system switching instruction to the system switching control unit 212 when “permitted” or “no” is set in the referenced system switching instruction flag.
  • the system switching instruction unit 224 newly starts a failure detection timer when “not generated” is set in the referenced system switching instruction flag.
  • the system switching instruction unit 224 notifies the system switching control unit 212 of the system switching instruction by setting the system switching instruction in a predetermined storage area provided for delivery of the system switching instruction.
  • system switching instruction unit 224 sets “not generated” to the system switching instruction flag stored in the guest OS storage unit 229. That is, the system switching instruction unit 224 initializes the system switching instruction flag. After S140, the process proceeds to S150.
  • the system switching control unit 212 refers to the system switching instruction notified from the system switching instruction unit 224. If the value of the system switching instruction flag included in the system switching instruction is “permitted”, the process proceeds to S151. When the value of the system switching instruction flag included in the system switching instruction is “NO”, the process proceeds to S152.
  • the system switching control unit 212 refers to a system switching status flag (an example of a first stop flag) stored in advance in the host OS storage unit 219.
  • the host OS storage unit 219 is a storage area in a memory allocated to the host OS unit 210, for example.
  • the initial value of the system switching status flag is “no system switching”. “No system switching” performs system switching from the active server 200 to the standby server 300 or system switching processing from the standby server 300 (new active server) to the active server 200 (new standby server). Means not. In the system switching status flag, “no system switching” or “system switching present” is set. “With system switching” means that a system switching process is being performed.
  • step S152 the system switching control unit 212 stops the guest OS unit 220 via the virtual machine monitor unit 230. After the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal (S120). After S152, the process proceeds to S200.
  • the administrator resolves the failure that has occurred, such as replacing the hardware in which the failure has occurred.
  • the active server 200 operates as a new standby server.
  • the guest OS unit 220 has not been activated yet.
  • the active server 200 that operates as a new standby server operates in the same manner as the standby server 300 until a failure occurs in the active server 200.
  • the operation of the standby server 300 will be described later.
  • FIG. 3 is a flowchart showing a system switching method (standby server) in the first embodiment.
  • a system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.
  • the guest OS unit 320 of the standby server 300 is stopped.
  • the software FT unit 311 acquires the operation data of the guest OS unit 220 from the active server 200 every time a predetermined synchronization period elapses, and stores the acquired operation data in the guest OS storage unit 329.
  • the operation data is data stored in a storage area (guest OS storage unit 229) in a memory allocated to the guest OS unit 220, such as processing data and a system switching instruction flag.
  • the software FT unit 311 of the standby server 300 starts a synchronization timer and transmits a synchronization request to the active server 200 when a timeout is notified from the synchronization timer.
  • the software FT unit 211 of the active server 200 receives the synchronization request and transmits the operation data stored in the guest OS storage unit 229 to the standby server 300.
  • the software FT unit 311 of the standby server 300 receives the operation data, stores the received operation data in the guest OS storage unit 329, and newly starts a synchronization timer.
  • the synchronization timer is a function for notifying a timeout when the synchronization period has elapsed since the activation.
  • the system switch detection unit 313 determines whether or not a failure has occurred in the active server 200 as follows.
  • the system switching detection unit 313 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received (detected) within a predetermined monitoring period.
  • the system switch detection unit 313 activates a monitoring timer and determines whether a heartbeat signal has been received before a timeout is notified from the monitoring timer.
  • the system switching detection unit 313 stops the started monitoring timer and starts a new monitoring timer.
  • the monitoring timer is a function for notifying a timeout when a monitoring cycle has elapsed since the activation.
  • the system switch detection unit 313 determines that a failure has occurred in the active server 200.
  • the system switch detection unit 313 activates the guest OS unit 320 via the virtual machine monitor unit 330. It progresses to S240 after S230.
  • the system switching detection unit 313 sets “system switching present” to the system switching state flag (an example of the second stop flag) stored in advance in the host OS storage unit 319.
  • the initial value of the system switching status flag is “no system switching”.
  • the meaning of the value of the system switching status flag is the same as that of the active server 200 (see S151 in FIG. 2). After S240, the process proceeds to S100.
  • the standby server 300 operates as a new active server.
  • the system switching method (standby system server) ends.
  • the standby server 300 operates as a new active server as follows.
  • the system switching instruction unit 324 refers to the system switching instruction flag stored in the guest OS storage unit 329 every time the failure detection cycle elapses.
  • the system switching instruction unit 324 notifies the system switching control unit 212 of a system switching instruction including the value of the system switching instruction flag when the system switching instruction flag is set to “permitted” or “no”, and system switching is performed. Initialize the instruction flag.
  • the value of the system switching state flag of the standby server 300 operating as a new active server is “system switching present” by S240 (see FIG. 3).
  • the system switching control unit 212 discards the system switching instruction in S151 and sets the system switching status flag in the “system switching status flag”. Set “No switching”. Then, the application execution unit 321 executes an operation process (S110), the system switching detection unit 313 transmits a heartbeat signal every time the heartbeat notification cycle elapses (S120), and the failure detection unit 323 has a failure detection cycle. It is determined whether or not a failure has occurred each time (S130). That is, the system switching control unit 312 does not stop the guest OS unit 320 if the value of the system switching instruction flag is “system switching present” even if the system switching instruction is notified. At this time, a hardware failure has occurred in the active server 200, and no hardware failure has occurred in the standby server 300.
  • the duplex system 100 described in Embodiment 1 has the following effects.
  • the standby server 300 can be operated as a new active server, so that the system availability can be increased.
  • the server 300 can be normally operated as a new operational server. That is, even when the failure state is synchronized from the active server 200 to the standby server 300, the standby server 300 does not stop and operates as a new active server.
  • the dual system 100 can discriminate between a hardware failure and a software failure by using the system switching instruction flag, and can perform different failure control when a hardware failure occurs and when a software failure occurs. For example, in the case of a hardware failure, it is determined whether to stop the guest OS unit 220 based on the system switching status flag (S151 in FIG. 2), and in the case of a software failure, the guest OS unit 220 is stopped (see FIG. 2, S152).
  • FIG. 4 is a table showing the relationship between the failure and the means for detecting the failure in the first embodiment. The effect which the duplex system 100 demonstrated in Embodiment 1 show
  • Hardware (H / W) failures can be distinguished by causes (1) to (4).
  • the host OS unit 210 of the active server 200 includes a failure detection unit (not shown) that detects a hardware failure of the active server 200.
  • Failure (1) is a serious failure that causes the active server 200 to stop suddenly due to a power failure or the like. Such a failure interrupts heartbeat communication. Therefore, the failure (1) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300.
  • Fault (2) is a minor fault that does not cause the active server 200 to stop, such as a fan failure.
  • the failure (2) is detected by the host OS unit 210 (failure detection unit) of the active server 200.
  • Fault (3) is a fault in which waiting for a response from hardware times out due to an I / O error of a disk or network. Such a failure is detected by the host OS unit 210 (failure detection unit) of the active server 200.
  • the host OS unit 210 (failure detection unit) detects the failure (2) using the driver function of the host OS.
  • the failure (4) is a failure in which waiting for a response from the hardware times out as in the case of the failure (3).
  • the failure (4) and the failure (3) are different in timeout time and detection means.
  • the hardware timeout time is set longer at the OS level.
  • the failure (4) is a failure in which a time-out time set according to the system is applied, and waiting for a response from hardware times out.
  • the failure (4) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200 (FIG. 2, S130).
  • the state of the failure (4) (system switching instruction flag) is a standby server as part of the data (operation data) of the guest OS unit 220. 300 is synchronized. In this case, although the failure (4) does not occur in the standby server 300, the standby server 300 detects the failure (4) and stops the guest OS unit 320. That is, both the active server 200 and the standby server 300 are stopped. This is called “joint death”. However, in the first embodiment, by using the system switching state flag, it is possible to prevent the accompanying death (FIG. 2, S151).
  • Failure (5) is a failure in which the host OS unit 210 of the active server 200 stops due to a cause such as an OS hang-up. Such a failure interrupts heartbeat communication. Therefore, the failure (5) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300 or the guest OS unit (cluster software unit) of the standby server described later.
  • Fault (6) is a fault that causes the operation process to stop due to an application program malfunction (for example, memory shortage).
  • the failure (6) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200.
  • FIG. 5 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment.
  • plays is demonstrated based on FIG.
  • Synchronous A is a case where the standby server 300 acquires operational data from the active server 200 before the guest OS unit 220 of the active server 200 detects a hardware failure.
  • the “synchronization B” is performed when the standby server 300 receives operational data from the active server 200 after the guest OS unit 220 of the active server 200 detects a hardware failure and before notifying the host OS unit 210 of a system switching instruction. Is obtained.
  • “Synchronous C” is a case where the standby server 300 acquires operational data from the active server 200 after the guest OS unit 220 of the active server 200 notifies the host OS unit 210 of a system switching instruction.
  • the value of the system switching instruction flag of the standby system server 300 is “not generated”.
  • the unit 320 does not notify the host OS unit 310 of a system switching instruction. That is, the guest OS unit 320 does not stop until a new failure is detected.
  • the duplex system 100 is configured so that the active server 200 and the standby server 300 can be synchronized with each other even if the operation data is synchronized between the active server 200 and the standby server 300 at any timing. Can prevent death.
  • FIG. 6 is a diagram illustrating an example of hardware resources of the active server 200 and the standby server 300 according to the first embodiment.
  • the active server 200 and the standby server 300 include a CPU 901 (Central Processing Unit).
  • the CPU 901 is connected to the ROM 903, the RAM 904, the communication board 905, the display device 911, the keyboard 912, the mouse 913, the drive device 914, and the magnetic disk device 920 via the bus 902, and controls these hardware devices.
  • the drive device 914 is a device that reads and writes a storage medium such as an FD (Flexible Disk Drive), a CD (Compact Disc), and a DVD (Digital Versatile Disc).
  • FD Flexible Disk Drive
  • CD Compact Disc
  • DVD Digital Versatile Disc
  • the communication board 905 is wired or wirelessly connected to a communication network such as a LAN (Local Area Network), the Internet, or a telephone line.
  • a communication network such as a LAN (Local Area Network), the Internet, or a telephone line.
  • the magnetic disk device 920 stores an OS 921 (operating system), a program group 922, and a file group 923.
  • OS 921 operating system
  • program group 922 program group 922
  • file group 923 file group 923
  • the program group 922 includes a program for executing a function described as “unit” in the embodiment.
  • the program is read and executed by the CPU 901. That is, the program causes the computer to function as “ ⁇ part”, and causes the computer to execute the procedures and methods of “ ⁇ part”.
  • the file group 923 includes various data (input, output, determination result, calculation result, processing result, etc.) used in “ ⁇ part” described in the embodiment.
  • arrows included in the configuration diagrams and flowcharts mainly indicate input and output of data and signals.
  • what is described as “to part” may be “to circuit”, “to apparatus”, and “to device”, and “to step”, “to procedure”, and “to processing”. May be. That is, what is described as “ ⁇ unit” may be implemented by any of firmware, software, hardware, or a combination thereof.
  • FIG. 7 is a configuration diagram showing another form of the dual system 100 according to the first embodiment.
  • the dual system 100 may include a standby server 400 and a shared storage 102.
  • the standby server 400 (third server) is a server device that executes operation processing instead of the active server 200 when a software failure occurs in the active server 200.
  • the standby server 400 includes a host OS unit 410, a guest OS unit 420, a virtual machine monitor unit 430, and hardware 401, similar to the active server 200 and the standby server 300.
  • the shared storage 102 is a storage device that stores processing data used for operation processing, image data constituting a guest OS unit (virtual machine), and the like.
  • the active server 200, the standby server 300, or the standby server 400 accesses the shared storage 102 via the LAN 101 and executes an operation process using the processing data stored in the shared storage 102.
  • the standby server 400 may not synchronize the operation server 200 and the operation data (system switching instruction flag).
  • the cluster software unit 222 of the active server 200 transmits a heartbeat signal to the standby server 400 every time a predetermined heartbeat communication cycle elapses, and the cluster software unit of the standby server 400 receives the signal from the active server 200. Receive a heartbeat signal.
  • the cluster software unit of the standby server 400 monitors the heartbeat signal every predetermined monitoring period. If the cluster software unit of the standby server 400 cannot receive the heartbeat signal within the monitoring period, it determines that a software failure has occurred in the active server 200.
  • the application execution unit of the standby server 400 restarts the application program for operation processing.
  • the system switching detection unit of the standby server 400 restarts the guest OS unit 420.
  • the standby server 400 operates as a new operational server.
  • Embodiment 1 for example, the following inter-server state synchronization method (system switching method) has been described.
  • the dual system 100 includes an active server 200 and a standby server 300, and replicates the operating state (operation data) of the active system to the standby system in a predetermined procedure.
  • the active server 200 is stopped, and the standby server 300 is started using the replicated operation state, thereby switching the system from the active system to the standby system.
  • the state where the failure occurred (system switch instruction flag) was copied to the standby system, and the failure actually occurred in the standby system If not, the standby server 300 is not stopped.
  • the standby server 300 is provided with a system switching status holding unit (host OS storage unit 319) for storing presence / absence of system switching from the active system to the standby system (system switching status flag).
  • a system switching status holding unit host OS storage unit 319) for storing presence / absence of system switching from the active system to the standby system (system switching status flag).
  • the original standby system (standby system server 300) is replaced with the new active system
  • the original active system active system server 200
  • the new standby system Operate as a standby system. After a failure occurs in the new active system and the system is switched to the new standby system, the state where the failure occurred is replicated to the new standby system. If it does not actually occur, do not stop the new standby server.
  • a system switching state holding unit (host OS storage unit) that stores the presence / absence of system switching from another system to the own system (system switching status flag) is provided in each of the active server 200 and the standby server 300.
  • system switching status flag stored in the system switching status holding unit is “no system switching”
  • the local server is stopped for system switching to another system, and the system switching status flag is If “system switchover” is present, the local server is not stopped.
  • a virtual environment (virtual machine monitor unit) is mounted on each of the active server 200 and the standby server 300, and one host OS (host OS unit) and one or more guest OS (guest OS) are installed in the virtual environment. Part).
  • the operating state (operation data) of the guest OS is synchronized from the active system to the standby system by the software fault tolerant function (software FT unit 211) installed on the host OS.
  • software FT unit 211 installed on the host OS.
  • a system switching control unit that performs system switching and a system switching status holding unit (host OS storage unit) that holds whether or not system switching from the active system to the standby system is performed (system switching status flag) are provided in the host OS.
  • a system switching instruction is transmitted from the guest OS to the host OS.
  • the system switching control unit of the host OS receives the system switching instruction from the guest OS, the system switching control unit discards the system switching instruction only when the system switching state flag stored in the system switching state holding unit is “system switching present”. If not, the system is switched to the standby system.
  • the system switching control unit on the host OS stops the active guest OS.
  • the standby host OS detects that the synchronization by the software fault tolerant function or the heartbeat communication with the active system has been interrupted, and starts the standby guest OS.
  • a flag (system switching instruction flag) indicating whether or not the system switching instruction can be discarded is provided in the system switching instruction transmitted from the guest OS to the host OS.
  • the system switching instruction flag indicates that the system switching instruction can be discarded, and the system switching state stored in the system switching state holding unit Only when the flag is “system switching present”, the system switching instruction is discarded. Otherwise, the system switching control unit of the host OS performs system switching to the standby system.
  • the system switching instruction flag is set to “permitted” only when the cause of the failure is hardware, and whether or not the system switching can be executed is determined based on this value and the system switching status flag stored in the system switching status holding unit. .
  • the system switching instruction flag is “permitted” and the system switching state flag is “system switching is present”, the system switching operation is not performed, and the accompanying death can be prevented.
  • FIG. A mode in which the active server 200 notifies the standby server 300 that the guest OS unit 220 has been stopped when the guest OS unit 220 is stopped will be described.
  • FIG. 8 is a configuration diagram of the dual system 100 according to the second embodiment. The configuration of dual system 100 in the second embodiment will be described with reference to FIG.
  • the host OS unit 210 of the active server 200 includes a system switching communication unit 214 instead of the system switching detection unit 213 (see FIG. 1) described in the first embodiment.
  • the host OS unit 310 of the standby server 300 includes a system switching communication unit 314 instead of the system switching detection unit 313 described in the first embodiment.
  • the system switching communication unit 214 (an example of a first notification unit) waits for the guest OS unit 220 (application execution unit 221) to stop when the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221. Notification to the system server 300.
  • the system switching communication unit 314 (an example of a second activation unit) activates the guest OS unit 320 including the application execution unit 321 when the active server 200 notifies the stop of the guest OS unit 220 (application execution unit 221). To do. Further, the system switching communication unit 314 sets a second continuation value (“with system switching”) indicating that the guest OS unit 320 (application execution unit 321) is not stopped in the system switching status flag stored in the host OS storage unit 319. ) Is set.
  • the system switching communication unit 214 and the system switching communication unit 314 operate as follows.
  • the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321
  • the system switching communication unit 314 operates to stop the guest OS unit 320 (application execution unit 321). Notification to the system server 200.
  • the system switching communication unit 214 (an example of the second notification unit) activates the guest OS unit 220 including the application execution unit 221 when the standby server 300 is notified of the stoppage of the guest OS unit 320 (application execution unit 321). To do. Further, the system switching communication unit 214 sets the second continuation value (“system switching present”) in the system switching state flag stored in the host OS storage unit 219.
  • FIG. 9 is a flowchart showing a system switching method (active server) in the second embodiment.
  • a system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.
  • the system switching communication unit 214 transmits a stop notification of the guest OS unit 220 to the standby server 300 (S153).
  • FIG. 10 is a flowchart showing a system switching method (standby server) in the second embodiment.
  • a system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.
  • S220B is executed instead of S220 (see FIG. 3) described in the first embodiment.
  • the system switching communication unit 314 determines whether or not a failure has occurred in the active server 200 as follows.
  • the system switching communication unit 314 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received within a predetermined monitoring period.
  • the system switching communication unit 314 determines that a failure has occurred in the active server 200 when the software FT unit 311 cannot acquire operation data from the active server 200. Further, the system switching communication unit 314 determines that a failure has occurred in the active server 200 when receiving a stop notification of the guest OS unit 220 from the active server 200. If a failure has occurred in the active server 200 (YES), the process proceeds to S230. If no failure has occurred in the active server 200 (NO), S210 and S220B are repeated.
  • Embodiment 2 for example, the following inter-server state synchronization method (system switching method) has been described.
  • a host switching notification unit (system switching communication unit) and a system switching notification reception unit (system switching communication unit) are provided in the host OS (host OS unit).
  • the system switching control unit of the active host OS stops the guest OS (guest OS unit), and the system switching notification unit of the active system performs system switching to the system switching notification receiving unit of the standby system. Notice. Then, the standby host OS activates the guest OS.
  • the standby server can detect that the active server has stopped without depending on the synchronization or heartbeat cycle, and can immediately operate as a new active server when the active server stops. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The purpose of the present invention is to prevent a standby system from crashing due to the operational data of an operational system. In an operational server (200), a failure detection unit (223) sets a system switching instruction flag to "permitted" when a hardware failure has been detected. A guest OS (220) is subsequently suspended. In a standby server (300), a software FT unit (311) periodically acquires operational data containing the system switching instruction flag from the operational server (200). When a heartbeat signal from the operational server (200) ceases, a system switching detection unit (313) launches a guest OS (320), and sets a system switching status flag to "system switched". A system switching instruction unit (324) detects a hardware failure by referring to the system switching instruction flag, and notifies the system switching detection unit (313) of the system switching instruction. Having received the system switching instruction, the system switching detection unit (313) suspends the guest OS (320) only when the system switching status flag has not been set to "system switched".

Description

多重系システムおよび多重系システムの系切り替え方法Multisystem system and system switching method for multisystem
 本発明は、例えば、運用系サーバに障害が発生した場合に待機系サーバを新たな運用系サーバに切り替える多重系システムおよび多重系システムの系切り替え方法に関するものである。 The present invention relates to a multiplex system that switches a standby server to a new active server when a failure occurs in the active server, for example, and a system switching method of the multiplex system.
 仮想化技術を応用し、仮想マシンを2つの物理サーバ(運用系と待機系)間で同期させることにより、ソフトウェア的にフォールトトレラント(耐障害性)を実現する技術(非特許文献1、非特許文献2)がある。この技術は「ソフトウェアFT」と呼ばれる。 Technology that realizes fault tolerance (fault tolerance) in software by applying virtualization technology and synchronizing virtual machines between two physical servers (active and standby) (Non-Patent Document 1, Non-Patent Document 1) There is literature 2). This technique is called “software FT”.
 ソフトウェアFTでは、運用系の運用データを待機系が定期的に取得する。そして、運用系に障害が発生した場合、待機系は運用系から取得した運用データを用いて運用系の処理を引き継ぐ。
 このため、運用系に障害が発生したときの運用データを待機系が取得した場合、障害が発生したことを示す運用データに従って待機系も停止してしまう。
 このように、運用系と共に待機系が停止してしまうことを「共連れ死」という。
In the software FT, the standby system periodically acquires operation data of the operation system. When a failure occurs in the active system, the standby system takes over the processing of the active system using the operation data acquired from the active system.
For this reason, when the standby system acquires operational data when a failure occurs in the active system, the standby system also stops according to the operational data indicating that a failure has occurred.
In this way, when the standby system stops together with the active system, it is called “co-death”.
 本発明は、例えば、運用系の運用データに従って待機系も停止してしまうこと(共連れ死)を防ぐことができるようにすることを目的とする。 An object of the present invention is to prevent, for example, the standby system from being stopped according to the operational data of the operational system (co-death).
 本発明の多重系システムは、所定の運用処理を実行する第一サーバと、第一サーバに障害が発生した場合に第一サーバの代わりに運用処理を実行する第二サーバとを備える。
 第一サーバは、
 運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグとを含んだデータを運用データとして記憶する第一運用データ記憶部と、
 前記第一運用データ記憶部に記憶された処理データを用いて運用処理を実行する第一実行部と、
 前記第一実行部により実行される運用処理の障害を検出する第一障害検出部であって、運用処理の障害を検出した場合、前記第一運用データ記憶部に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値を設定する第一障害検出部と、
 前記第一実行部を停止する第一停止部と、
 前記第一運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第一停止部に前記第一実行部の停止を指示する第一指示部とを備え、
 第二サーバは、
 所定の同期周期毎に第一サーバから運用データを取得する第二同期部と、
 前記第二同期部により取得された運用データを記憶する第二運用データ記憶部と、
 前記第二運用データ記憶部に記憶された運用データを用いて運用処理を実行する第二実行部と、
 前記第二実行部の停止の可否を示す第二停止フラグを記憶する第二停止フラグ記憶部と、
 第一サーバを所定の監視周期毎に監視し、第一サーバに障害が発生したか否かを判定する第二監視部と、
 前記第二監視部により第一サーバに障害が発生したと判定された場合、前記第二実行部を起動し、前記第二停止フラグ記憶部に記憶されている第二停止フラグに前記第二実行部を停止しないことを示す第二継続値を設定する第二起動部と、
 前記第二実行部により実行される運用処理の障害を検出する第二障害検出部であって、運用処理の障害を検出した場合、前記第二運用データ記憶部に記憶されている障害フラグに前記障害発生値を設定する第二障害検出部と、
 前記第二実行部を停止する第二停止部と、
 前記第二運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第二停止部に前記第二実行部の停止を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値を設定する第二指示部とを備え、
 前記第二停止部は、前記第二指示部から前記第二実行部の停止を指示された場合、前記第二停止フラグ記憶部に記憶されている第二停止フラグを参照し、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグに前記第二実行部を停止してもよいことを示す非第二継続値を設定し、参照した第二停止フラグに前記第二継続値が設定されていない場合、前記第二実行部を停止する。
The multiplex system of the present invention includes a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server.
The first server
A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
A first stop unit for stopping the first execution unit;
The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first instruction unit,
The second server
A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
A second operational data storage unit that stores operational data acquired by the second synchronization unit;
A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
A second monitoring unit that monitors the first server every predetermined monitoring cycle and determines whether or not a failure has occurred in the first server;
When it is determined by the second monitoring unit that a failure has occurred in the first server, the second execution unit is started and the second execution flag stored in the second stop flag storage unit is executed in the second execution flag. A second activation part for setting a second continuation value indicating that the part is not stopped;
A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
A second stop unit for stopping the second execution unit;
Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop If the second continuation value is not set in the flag, the second execution unit is stopped.
 本発明によれば、例えば、運用系(第一サーバ)の運用データに従って待機系(第二サーバ)も停止してしまうこと(共連れ死)を防ぐことができる。 According to the present invention, for example, it is possible to prevent the standby system (second server) from stopping according to the operation data of the active system (first server) (co-death).
実施の形態1における二重系システム100の構成図。1 is a configuration diagram of a dual system 100 according to Embodiment 1. FIG. 実施の形態1における系切り替え方法(運用系サーバ)を示すフローチャート。3 is a flowchart showing a system switching method (active server) in the first embodiment. 実施の形態1における系切り替え方法(待機系サーバ)を示すフローチャート。3 is a flowchart showing a system switching method (standby server) in the first embodiment. 実施の形態1における障害と障害を検出する手段との関係を示す表。4 is a table showing a relationship between a failure and a means for detecting the failure in the first embodiment. 実施の形態1における運用系サーバ200と待機系サーバ300との同期タイミングと待機系サーバ300の動作との関係を示すフローチャート。3 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment. 実施の形態1における運用系サーバ200および待機系サーバ300のハードウェア資源の一例を示す図。2 is a diagram illustrating an example of hardware resources of an active server 200 and a standby server 300 according to Embodiment 1. FIG. 実施の形態1における二重系システム100の別形態を示す構成図。FIG. 3 is a configuration diagram showing another form of the dual system 100 according to the first embodiment. 実施の形態2における二重系システム100の構成図。FIG. 3 is a configuration diagram of a dual system 100 according to a second embodiment. 実施の形態2における系切り替え方法(運用系サーバ)を示すフローチャート。9 is a flowchart showing a system switching method (active server) in the second embodiment. 実施の形態2における系切り替え方法(待機系サーバ)を示すフローチャート。9 is a flowchart showing a system switching method (standby server) in the second embodiment.
 実施の形態1.
 運用系サーバに障害が発生した場合に待機系サーバが運用系サーバの代わりに稼働する二重系システムについて説明する。
Embodiment 1 FIG.
A dual system in which the standby server operates instead of the active server when a failure occurs on the active server will be described.
 図1は、実施の形態1における二重系システム100の構成図である。
 実施の形態1における二重系システム100について、図1に基づいて説明する。
FIG. 1 is a configuration diagram of a dual system 100 according to the first embodiment.
A dual system 100 according to Embodiment 1 will be described with reference to FIG.
 二重系システム100(多重系システムの一例)は、運用系サーバ200とホストOS部310とLAN101(Local Area Network)とを備える。
 運用系サーバ200とホストOS部310とは、LAN101を介して通信を行う。LAN101はネットワークの一例である。
The duplex system 100 (an example of a multiple system) includes an operational server 200, a host OS unit 310, and a LAN 101 (Local Area Network).
The active server 200 and the host OS unit 310 communicate via the LAN 101. The LAN 101 is an example of a network.
 運用系サーバ200(第一サーバの一例)は、所定の運用処理を実行するサーバ装置である。
 運用系サーバ200は、ホストOS部210とゲストOS部220と仮想マシンモニタ部230とを備える。
 また、運用系サーバ200は、ハードウェア201としてCPU(Central Processing Unit)やメモリなどを備える。
The active server 200 (an example of a first server) is a server device that executes predetermined operation processing.
The active server 200 includes a host OS unit 210, a guest OS unit 220, and a virtual machine monitor unit 230.
Further, the operational server 200 includes a CPU (Central Processing Unit), a memory, and the like as the hardware 201.
 仮想マシンモニタ部230は、仮想マシンモニタを実行し、ホストOS部210とゲストOS部220とを仮想マシンとして制御する。
 仮想マシンモニタとは、計算機(サーバ装置)のハードウェア資源(CPUの使用時間、メモリ内の記憶領域など)を複数の仮想マシンに割り当て、複数の仮想マシンを制御する機能である。
The virtual machine monitor unit 230 executes a virtual machine monitor and controls the host OS unit 210 and the guest OS unit 220 as virtual machines.
The virtual machine monitor is a function for controlling a plurality of virtual machines by allocating hardware resources (CPU usage time, storage areas in a memory, etc.) of the computer (server device) to the plurality of virtual machines.
 ホストOS部210(第一ホスト計算機の一例)は、ゲストOS部220を管理する仮想マシンである。以下、ホストOS部210のOS(Operating System)をホストOSという。
 ホストOS部210は、ソフトウェアFT部211、系切替制御部212、系切替検出部213およびホストOS記憶部219を備える。
 ホストOS部210およびホストOS部210に備わる各「~部」は、ホストOS部210に割り当てられたハードウェア資源を用いて動作する。
The host OS unit 210 (an example of a first host computer) is a virtual machine that manages the guest OS unit 220. Hereinafter, an OS (Operating System) of the host OS unit 210 is referred to as a host OS.
The host OS unit 210 includes a software FT unit 211, a system switching control unit 212, a system switching detection unit 213, and a host OS storage unit 219.
The host OS unit 210 and each “˜ unit” included in the host OS unit 210 operate using hardware resources allocated to the host OS unit 210.
 ゲストOS部220(第一ゲスト計算機の一例)は、所定の運用処理を実行する仮想マシンである。以下、ゲストOS部220のOSのOSをゲストOSという。
 ゲストOS部220は、アプリ実行部221、クラスタソフトウェア部222、系切替指示部224およびゲストOS記憶部229を備える。クラスタソフトウェア部222は、障害検出部223を備える。
 ゲストOS部220およびゲストOS部220に備わる各「~部」は、ゲストOS部220に割り当てられたハードウェア資源を用いて動作する。
The guest OS unit 220 (an example of a first guest computer) is a virtual machine that executes predetermined operation processing. Hereinafter, the OS of the guest OS unit 220 is referred to as a guest OS.
The guest OS unit 220 includes an application execution unit 221, a cluster software unit 222, a system switching instruction unit 224, and a guest OS storage unit 229. The cluster software unit 222 includes a failure detection unit 223.
The guest OS unit 220 and each “˜ unit” included in the guest OS unit 220 operate using hardware resources allocated to the guest OS unit 220.
 ゲストOS記憶部229(第一運用データ記憶部の一例)は、ゲストOS部220で使用される各種データを記憶する。
 例えば、ゲストOS記憶部229は運用データを記憶する。運用データとは、運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグ(後述する系切替指示フラグ)とを含んだデータである。
The guest OS storage unit 229 (an example of a first operation data storage unit) stores various data used by the guest OS unit 220.
For example, the guest OS storage unit 229 stores operation data. The operation data is data including processing data used for operation processing and a failure flag (system switching instruction flag described later) indicating whether or not an operation processing failure has occurred.
 アプリ実行部221(第一実行部の一例)は、所定の運用処理の処理手順を記したアプリケーションプログラム(以下、アプリという)をゲストOS記憶部229に記憶された処理データを用いて実行する。 The application execution unit 221 (an example of a first execution unit) executes an application program (hereinafter referred to as an application) that describes a processing procedure of a predetermined operation process by using processing data stored in the guest OS storage unit 229.
 クラスタソフトウェア部222に備わる障害検出部223(第一障害検出部の一例)は、アプリ実行部221により実行される運用処理の障害を検出する。
 障害検出部223は、運用処理の障害を検出した場合、ゲストOS記憶部229に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値(後述する“可”)を設定する。
A failure detection unit 223 (an example of a first failure detection unit) included in the cluster software unit 222 detects a failure in an operation process executed by the application execution unit 221.
When the failure detection unit 223 detects an operation process failure, the failure flag stored in the guest OS storage unit 229 indicates a failure occurrence value (“OK” described later) indicating that an operation process failure has occurred. Set.
 系切替指示部224(第一指示部の一例)は、ゲストOS記憶部229に記憶されている障害フラグを参照する。
 系切替指示部224は、参照した障害フラグに前記障害発生値が設定されている場合、系切替制御部212にアプリ実行部221の停止(後述する系切替指示)を指示する。
The system switching instruction unit 224 (an example of a first instruction unit) refers to a failure flag stored in the guest OS storage unit 229.
When the failure occurrence value is set in the referenced failure flag, the system switching instruction unit 224 instructs the system switching control unit 212 to stop the application execution unit 221 (system switching instruction described later).
 系切替制御部212(第一停止部の一例)は、系切替指示部224からアプリ実行部221の停止を指示された場合、アプリ実行部221を備えるゲストOS部220を停止する。 When the system switching control unit 212 (an example of a first stopping unit) is instructed to stop the application execution unit 221 from the system switching instruction unit 224, the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.
 運用系サーバ200に障害が発生した場合、待機系サーバ300が新たな運用系サーバとして稼働し、運用系サーバ200が復旧した場合、運用系サーバ200が新たな待機系サーバとして稼働する。 When a failure occurs in the active server 200, the standby server 300 operates as a new active server, and when the active server 200 recovers, the active server 200 operates as a new standby server.
 ホストOS記憶部219(第一停止フラグ記憶部の一例)は、ホストOS部210で使用される各種データを記憶する。
 例えば、ホストOS記憶部219は、アプリ実行部221の停止の可否を示す第一停止フラグ(後述する系切替状態フラグ)を記憶する。
The host OS storage unit 219 (an example of a first stop flag storage unit) stores various data used by the host OS unit 210.
For example, the host OS storage unit 219 stores a first stop flag (a system switching state flag described later) indicating whether the application execution unit 221 can be stopped.
 待機系サーバ300が新たな運用系サーバとして稼働し、運用系サーバ200が新たな待機系サーバとして稼働する場合、運用系サーバ200の各「~部」は以下のように動作する。 When the standby server 300 operates as a new active server and the active server 200 operates as a new standby server, each “˜part” of the active server 200 operates as follows.
 ソフトウェアFT部211(第一同期部の一例)は、所定の同期周期毎に待機系サーバ300から運用データを取得する。 The software FT unit 211 (an example of a first synchronization unit) acquires operation data from the standby server 300 every predetermined synchronization period.
 系切替検出部213(第一監視部、第一起動部の一例)は、待機系サーバ300を所定の監視周期毎に監視し、待機系サーバ300に障害が発生したか否かを判定する。
 系切替検出部213は、待機系サーバ300に障害が発生したと判定した場合、アプリ実行部221を備えるゲストOS部220を起動する。また、系切替検出部213は、ホストOS記憶部219に記憶されている第一停止フラグにアプリ実行部221を停止しないことを示す第一継続値(後述する“系切替有り”)を設定する。
The system switching detection unit 213 (an example of a first monitoring unit and a first activation unit) monitors the standby server 300 every predetermined monitoring cycle and determines whether or not a failure has occurred in the standby server 300.
If the system switching detection unit 213 determines that a failure has occurred in the standby server 300, the system switching detection unit 213 activates the guest OS unit 220 including the application execution unit 221. Further, the system switching detection unit 213 sets a first continuation value (“system switching present” to be described later) indicating that the application execution unit 221 is not stopped in the first stop flag stored in the host OS storage unit 219. .
 系切替制御部212(第一停止部の一例)は、系切替指示部224からアプリ実行部221の停止を指示された場合、ホストOS記憶部219に記憶されている第一停止フラグを参照する。
 系切替制御部212は、参照した第一停止フラグに前記第一継続値が設定されている場合、参照した第一停止フラグにアプリ実行部221を停止してもよいことを示す非第一継続値(後述する“系切替無し”)を設定する。
 系切替制御部212は、参照した第一停止フラグに前記第一継続値が設定されていない場合、アプリ実行部221を備えるゲストOS部220を停止する。
When the system switching control unit 212 (an example of a first stop unit) is instructed to stop the application execution unit 221 by the system switching instruction unit 224, the system switching control unit 212 refers to the first stop flag stored in the host OS storage unit 219. .
The system switching control unit 212 indicates that the application execution unit 221 may be stopped at the referenced first stop flag when the first continuation value is set in the referenced first stop flag. Set the value ("No system switching" described later).
When the first continuation value is not set in the referenced first stop flag, the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.
 待機系サーバ300(第二サーバの一例)は、運用系サーバ200に障害が発生した場合に運用系サーバ200の代わりに運用処理を実行するサーバ装置である。
 待機系サーバ300は、運用系サーバ200と同様に、ホストOS部310とゲストOS部320と仮想マシンモニタ部330とを備える。
 また、待機系サーバ300は、運用系サーバ200と同様に、ハードウェア301を備える。
The standby server 300 (an example of a second server) is a server device that executes operation processing instead of the active server 200 when a failure occurs in the active server 200.
The standby server 300 includes a host OS unit 310, a guest OS unit 320, and a virtual machine monitor unit 330, similar to the active server 200.
The standby server 300 includes hardware 301 as with the active server 200.
 ホストOS部310(第二ホスト計算機の一例)は、運用系サーバ200のホストOS部210と同様に、ソフトウェアFT部311、系切替制御部312、系切替検出部313およびホストOS記憶部319を備える。
 ホストOS部310およびホストOS部310に備わる各「~部」は、ホストOS部310に割り当てられたハードウェア資源を用いて動作する。
The host OS unit 310 (an example of the second host computer) includes a software FT unit 311, a system switching control unit 312, a system switching detection unit 313, and a host OS storage unit 319 in the same manner as the host OS unit 210 of the active server 200. Prepare.
Each “˜ unit” included in the host OS unit 310 and the host OS unit 310 operates using hardware resources allocated to the host OS unit 310.
 ゲストOS部320(第二ゲスト計算機の一例)は、運用系サーバ200のゲストOS部220と同様に、アプリ実行部321、クラスタソフトウェア部322、系切替指示部324およびゲストOS記憶部329を備える。クラスタソフトウェア部322は障害検出部323を備える。
 ゲストOS部320およびゲストOS部320に備わる各「~部」は、ゲストOS部320に割り当てられたハードウェア資源を用いて動作する。
The guest OS unit 320 (an example of a second guest computer) includes an application execution unit 321, a cluster software unit 322, a system switching instruction unit 324, and a guest OS storage unit 329, similar to the guest OS unit 220 of the active server 200. . The cluster software unit 322 includes a failure detection unit 323.
The guest OS unit 320 and each “˜ unit” included in the guest OS unit 320 operate using hardware resources allocated to the guest OS unit 320.
 ソフトウェアFT部311(第二同期部の一例)は、所定の同期周期毎に運用系サーバ200から運用データを取得する。 The software FT unit 311 (an example of a second synchronization unit) acquires operational data from the operational server 200 at every predetermined synchronization period.
 ゲストOS記憶部329(第二運用データ記憶部の一例)は、ゲストOS部320で使用される各種データを記憶する。
 例えば、ゲストOS記憶部329は、ソフトウェアFT部311により取得された運用データ(処理データ、障害フラグなど)を記憶する。
The guest OS storage unit 329 (an example of a second operation data storage unit) stores various data used by the guest OS unit 320.
For example, the guest OS storage unit 329 stores operation data (processing data, failure flag, etc.) acquired by the software FT unit 311.
 アプリ実行部321(第二実行部の一例)は、ゲストOS記憶部329に記憶された運用データを用いて運用処理を実行する。 The application execution unit 321 (an example of a second execution unit) executes an operation process using the operation data stored in the guest OS storage unit 329.
 ホストOS記憶部319(第二停止フラグ記憶部の一例)は、ホストOS部310で使用される各種データを記憶する。
 例えば、ホストOS記憶部319は、アプリ実行部321の停止の可否を示す第二停止フラグ(後述する系切替状態フラグ)を記憶する。
The host OS storage unit 319 (an example of a second stop flag storage unit) stores various data used by the host OS unit 310.
For example, the host OS storage unit 319 stores a second stop flag (system switching state flag described later) indicating whether the application execution unit 321 can be stopped.
 系切替検出部313(第二監視部、第二起動部の一例)は、運用系サーバ200を所定の監視周期毎に監視し、運用系サーバ200に障害が発生したか否かを判定する。
 系切替検出部313は、運用系サーバ200に障害が発生したと判定した場合、アプリ実行部321を備えるゲストOS部320を起動する。また、系切替検出部313は、ホストOS記憶部319に記憶されている第二停止フラグにアプリ実行部321を停止しないことを示す第二継続値(後述する“系切替有り”)を設定する。
The system switching detection unit 313 (an example of a second monitoring unit and a second activation unit) monitors the active server 200 every predetermined monitoring cycle and determines whether or not a failure has occurred in the active server 200.
When the system switching detection unit 313 determines that a failure has occurred in the active server 200, the system switching detection unit 313 activates the guest OS unit 320 including the application execution unit 321. In addition, the system switching detection unit 313 sets a second continuation value (“system switching present” described later) indicating that the application execution unit 321 is not stopped in the second stop flag stored in the host OS storage unit 319. .
 クラスタソフトウェア部322に備わる障害検出部323(第二障害検出部の一例)は、アプリ実行部321により実行される運用処理の障害を検出する。
 障害検出部323は、運用処理の障害を検出した場合、ゲストOS記憶部329に記憶されている障害フラグに前記障害発生値を設定する。
A failure detection unit 323 (an example of a second failure detection unit) included in the cluster software unit 322 detects a failure in an operation process executed by the application execution unit 321.
When the failure detection unit 323 detects a failure in the operation process, the failure detection unit 323 sets the failure occurrence value in the failure flag stored in the guest OS storage unit 329.
 系切替指示部324(第二停止部の一例)は、ゲストOS記憶部329に記憶されている障害フラグ(後述する系切替指示フラグ)を参照する。
 系切替指示部324は、参照した障害フラグに前記障害発生値が設定されている場合、系切替制御部312にアプリ実行部321の停止(後述する系切替指示)を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値(後述する“未発生”)を設定する。
The system switching instruction unit 324 (an example of a second stopping unit) refers to a failure flag (a system switching instruction flag described later) stored in the guest OS storage unit 329.
When the failure occurrence value is set in the referenced failure flag, the system switching instruction unit 324 instructs the system switching control unit 312 to stop the application execution unit 321 (system switching instruction described later), and refers to the referenced failure flag. Is set to a non-failure occurrence value ("non-occurrence" described later) indicating that no operation processing failure has occurred.
 系切替制御部312(第二停止部の一例)は、系切替指示部324からアプリ実行部321の停止を指示された場合、ホストOS記憶部319に記憶されている第二停止フラグを参照する。
 系切替制御部312は、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグにアプリ実行部321を停止してもよいことを示す非第二継続値(後述する“系切替無し”)を設定する。
 系切替制御部312は、参照した第二停止フラグに前記第二継続値が設定されていない場合、アプリ実行部321を備えるゲストOS部320を停止する。
When the system switching control unit 312 (an example of the second stop unit) is instructed to stop the application execution unit 321 by the system switching instruction unit 324, the system switching control unit 312 refers to the second stop flag stored in the host OS storage unit 319. .
The system switching control unit 312 indicates that the application execution unit 321 may be stopped at the referenced second stop flag when the second continuation value is set in the referenced second stop flag. Set the value ("No system switching" described later).
When the second continuation value is not set in the referenced second stop flag, the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321.
 図2は、実施の形態1における系切り替え方法(運用系サーバ)を示すフローチャートである。
 運用系サーバ200(または新たな運用系サーバとして動作する待機系サーバ300)の系切り替え方法について、図2に基づいて説明する。
FIG. 2 is a flowchart showing a system switching method (active server) in the first embodiment.
A system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.
 以下に説明するS110からS130は並行して実行する。 S110 to S130 described below are executed in parallel.
 S110において、アプリ実行部221は、所定の運用処理の処理手順を記したアプリケーションプログラムを実行する。
 運用処理により処理される処理データおよび運用処理により処理された処理データは、ゲストOS記憶部229に記憶される。ゲストOS記憶部229は、例えば、ゲストOS部220に割り当てられたメモリ内の記憶領域である。
 次に、S120について説明する。
In S110, the application execution unit 221 executes an application program in which a predetermined operation processing procedure is described.
The processing data processed by the operation processing and the processing data processed by the operation processing are stored in the guest OS storage unit 229. The guest OS storage unit 229 is a storage area in a memory allocated to the guest OS unit 220, for example.
Next, S120 will be described.
 S120において、系切替検出部213は、ゲストOS部220が正常に動作していることを待機系サーバ300に通知するために、所定のハートビート通知周期が経過する毎にハートビート信号を待機系サーバ300へ送信(通知)する。 In S120, the system switching detection unit 213 sends a heartbeat signal to the standby system every time a predetermined heartbeat notification period elapses in order to notify the standby server 300 that the guest OS unit 220 is operating normally. Transmit (notify) to the server 300.
 例えば、系切替検出部213は、ハートビート通知タイマを起動し、ハートビート通知タイマからタイムアウトが通知されたときにハートビート信号を待機系サーバ300へ送信する。そして、系切替検出部213は、ハートビート通知タイマを新たに起動する。
 但し、ゲストOS部220が停止している場合、系切替検出部213は、ハートビート信号を送信しない。
 ハートビート通知タイマは、起動したときからハートビート通知周期が経過したときにタイムアウトを通知する機能である。
For example, the system switch detection unit 213 starts a heartbeat notification timer, and transmits a heartbeat signal to the standby server 300 when a timeout is notified from the heartbeat notification timer. Then, the system switching detection unit 213 newly starts a heartbeat notification timer.
However, when the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal.
The heartbeat notification timer is a function for notifying a timeout when the heartbeat notification cycle has elapsed since the activation.
 次に、S130について説明する。 Next, S130 will be described.
 S130において、障害検出部223は、アプリ実行部221により実行される運用処理にハードウェア障害またはソフトウェア障害が発生しているか否かを判定する。 In S <b> 130, the failure detection unit 223 determines whether a hardware failure or a software failure has occurred in the operation process executed by the application execution unit 221.
 例えば、障害検出部223は、運用処理でアクセスされたハードウェア201(外部記憶装置、通信装置など)を監視し、ハードウェア201からの応答の遅延(タイムアウト)を応答待ちタイマを用いてハードウェア障害として検出する。応答待ちタイマは、起動してから所定の応答待ち時間が経過したときにタイムアウトする機能である。
 また、障害検出部223は、アプリケーションプログラムの不具合をソフトウェア障害として検出する。メモリ不足や確保した記憶領域を解放しない解放エラーは、アプリケーションプログラムの不具合の一例である。
For example, the failure detection unit 223 monitors the hardware 201 (external storage device, communication device, etc.) accessed in the operation process, and uses a response wait timer to determine the response delay (timeout) from the hardware 201. Detect as a failure. The response waiting timer is a function that times out when a predetermined response waiting time elapses after activation.
Further, the failure detection unit 223 detects a defect in the application program as a software failure. A memory shortage or a release error that does not release the secured storage area is an example of a malfunction of the application program.
 ハードウェア障害またはソフトウェア障害が発生した場合、クラスタソフトウェア部222は、所定の障害処理を実行する。例えば、クラスタソフトウェア部222は、発生した障害の情報をディスプレイ装置に表示する。 When a hardware failure or a software failure occurs, the cluster software unit 222 executes a predetermined failure process. For example, the cluster software unit 222 displays information on the failure that has occurred on the display device.
 ハードウェア障害とソフトウェア障害とのいずれの障害も発生していない場合(未発生)、S110からS130を繰り返し実行する。
 ハードウェア障害が発生した場合、S131に進む。
 ソフトウェア障害が発生した場合、S132に進む。
When neither a hardware failure nor a software failure has occurred (not occurred), S110 to S130 are repeatedly executed.
If a hardware failure has occurred, the process proceeds to S131.
If a software failure has occurred, the process proceeds to S132.
 S131において、障害検出部223は、ゲストOS記憶部229に予め記憶されている系切替指示フラグ(障害フラグの一例)に“可”を設定する。
 系切替指示フラグの初期値は“未発生”である。“未発生”は運用処理に障害が発生していないことを意味し、“可”はハードウェア障害が発生したことを意味する。さらに、“可”は系切替指示の破棄が可能であることを意味する。
 S131の後、S140に進む。
In S131, the failure detection unit 223 sets “OK” to the system switching instruction flag (an example of the failure flag) stored in advance in the guest OS storage unit 229.
The initial value of the system switching instruction flag is “not generated”. “Non-occurring” means that no failure has occurred in the operation process, and “Yes” means that a hardware failure has occurred. Further, “permitted” means that the system switching instruction can be discarded.
After S131, the process proceeds to S140.
 S132において、障害検出部223は、系切替指示フラグに“否”を設定する。“否”はソフトウェア障害が発生したことを意味する。さらに、“否”は系切替指示の破棄が可能でないことを意味する。
 S132の後、S140に進む。
In S132, the failure detection unit 223 sets “No” in the system switching instruction flag. “No” means that a software failure has occurred. Further, “No” means that the system switching instruction cannot be discarded.
It progresses to S140 after S132.
 S140において、系切替指示部224は、所定の障害検出周期が経過する毎にゲストOS記憶部229に記憶されている系切替指示フラグを参照する。
 系切替指示部224は、系切替指示フラグに“可”または“否”が設定されている場合、系切替指示フラグの値を含んだデータを系切替指示として系切替制御部212に通知する。
In S140, the system switching instruction unit 224 refers to the system switching instruction flag stored in the guest OS storage unit 229 every time a predetermined failure detection period elapses.
When the system switching instruction flag is set to “permitted” or “no”, the system switching instruction unit 224 notifies the system switching control unit 212 of data including the value of the system switching instruction flag as a system switching instruction.
 例えば、系切替指示部224は、障害検出タイマを起動し、障害検出タイマからタイムアウトが通知されたときに系切替指示フラグを参照する。障害検出タイマは、起動したときから障害検出周期が経過したときにタイムアウトを通知する機能である。
 系切替指示部224は、参照した系切替指示フラグに“可”または“否”が設定されている場合、系切替指示を系切替制御部212に入力する。
 系切替指示部224は、参照した系切替指示フラグに“未発生”が設定されている場合、障害検出タイマを新たに起動する。
For example, the system switching instruction unit 224 activates a failure detection timer and refers to the system switching instruction flag when a timeout is notified from the failure detection timer. The failure detection timer is a function for notifying a timeout when a failure detection cycle has elapsed since the activation.
The system switching instruction unit 224 inputs a system switching instruction to the system switching control unit 212 when “permitted” or “no” is set in the referenced system switching instruction flag.
The system switching instruction unit 224 newly starts a failure detection timer when “not generated” is set in the referenced system switching instruction flag.
 例えば、系切替指示部224は、系切替指示の受け渡し用に設けた所定の記憶領域に系切替指示を設定することにより、系切替指示を系切替制御部212に通知する。 For example, the system switching instruction unit 224 notifies the system switching control unit 212 of the system switching instruction by setting the system switching instruction in a predetermined storage area provided for delivery of the system switching instruction.
 さらに、系切替指示部224は、ゲストOS記憶部229に記憶されている系切替指示フラグに“未発生”を設定する。つまり、系切替指示部224は、系切替指示フラグを初期化する。
 S140の後、S150に進む。
Further, the system switching instruction unit 224 sets “not generated” to the system switching instruction flag stored in the guest OS storage unit 229. That is, the system switching instruction unit 224 initializes the system switching instruction flag.
After S140, the process proceeds to S150.
 S150において、系切替制御部212は、系切替指示部224から通知された系切替指示を参照する。
 系切替指示に含まれる系切替指示フラグの値が“可”である場合、S151に進む。
 系切替指示に含まれる系切替指示フラグの値が“否”である場合、S152に進む。
In S150, the system switching control unit 212 refers to the system switching instruction notified from the system switching instruction unit 224.
If the value of the system switching instruction flag included in the system switching instruction is “permitted”, the process proceeds to S151.
When the value of the system switching instruction flag included in the system switching instruction is “NO”, the process proceeds to S152.
 S151において、系切替制御部212は、ホストOS記憶部219に予め記憶されている系切替状態フラグ(第一停止フラグの一例)を参照する。ホストOS記憶部219は、例えば、ホストOS部210に割り当てられたメモリ内の記憶領域である。
 系切替状態フラグの初期値は“系切替無し”である。“系切替無し”は運用系サーバ200から待機系サーバ300への系切り替えまたは待機系サーバ300(新たな運用系サーバ)から運用系サーバ200(新たな待機系サーバ)への系切り替え処理を実施していないことを意味する。
 系切替状態フラグには、“系切替無し”または“系切替有り”が設定される。“系切替有り”は系切り替え処理を実施していることを意味する。
 系切替状態フラグに“系切替無し”が設定されている場合、S152に進む。
 系切替状態フラグに“系切替有り”が設定されている場合、系切替制御部212は系切替指示を破棄し、系切替状態フラグに“系切替無し”を設定する。そして、S110からS130を継続する。
In S151, the system switching control unit 212 refers to a system switching status flag (an example of a first stop flag) stored in advance in the host OS storage unit 219. The host OS storage unit 219 is a storage area in a memory allocated to the host OS unit 210, for example.
The initial value of the system switching status flag is “no system switching”. “No system switching” performs system switching from the active server 200 to the standby server 300 or system switching processing from the standby server 300 (new active server) to the active server 200 (new standby server). Means not.
In the system switching status flag, “no system switching” or “system switching present” is set. “With system switching” means that a system switching process is being performed.
If “no system switching” is set in the system switching status flag, the process proceeds to S152.
When “system switching present” is set in the system switching status flag, the system switching control unit 212 discards the system switching instruction, and sets “no system switching” in the system switching status flag. Then, S110 to S130 are continued.
 S152において、系切替制御部212は、仮想マシンモニタ部230を介してゲストOS部220を停止する。
 ゲストOS部220が停止された後、系切替検出部213はハートビート信号を送信しない(S120)。
 S152の後、S200に進む。
In step S152, the system switching control unit 212 stops the guest OS unit 220 via the virtual machine monitor unit 230.
After the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal (S120).
After S152, the process proceeds to S200.
 S200において、管理者は、障害が発生したハードウェアを交換するなど、発生した障害を解消する。
 以後、運用系サーバ200は、新たな待機系サーバとして動作する。このとき、ゲストOS部220はまだ起動していない。
 新たな待機系サーバとして動作する運用系サーバ200は、運用系サーバ200に障害が発生するまでの待機系サーバ300と同様に動作する。
 待機系サーバ300の動作については後述する。
 S200により、系切り替え方法(運用系サーバ)は終了する。
In S200, the administrator resolves the failure that has occurred, such as replacing the hardware in which the failure has occurred.
Thereafter, the active server 200 operates as a new standby server. At this time, the guest OS unit 220 has not been activated yet.
The active server 200 that operates as a new standby server operates in the same manner as the standby server 300 until a failure occurs in the active server 200.
The operation of the standby server 300 will be described later.
By S200, the system switching method (active server) ends.
 図3は、実施の形態1における系切り替え方法(待機系サーバ)を示すフローチャートである。
 待機系サーバ300(または新たな待機系サーバとして動作する運用系サーバ200)の系切り替え方法について、図3に基づいて説明する。
FIG. 3 is a flowchart showing a system switching method (standby server) in the first embodiment.
A system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.
 待機系サーバ300のゲストOS部320は停止している。 The guest OS unit 320 of the standby server 300 is stopped.
 以下に説明するS210とS220とは並行して実行する。 S210 and S220 described below are executed in parallel.
 S210において、ソフトウェアFT部311は、所定の同期周期が経過する毎に運用系サーバ200からゲストOS部220の運用データを取得し、取得した運用データをゲストOS記憶部329に記憶する。
 運用データとは、処理データや系切替指示フラグなど、ゲストOS部220に割り当てられたメモリ内の記憶領域(ゲストOS記憶部229)に記憶されたデータである。
In S <b> 210, the software FT unit 311 acquires the operation data of the guest OS unit 220 from the active server 200 every time a predetermined synchronization period elapses, and stores the acquired operation data in the guest OS storage unit 329.
The operation data is data stored in a storage area (guest OS storage unit 229) in a memory allocated to the guest OS unit 220, such as processing data and a system switching instruction flag.
 例えば、待機系サーバ300のソフトウェアFT部311は、同期タイマを起動し、同期タイマからタイムアウトが通知されたときに同期要求を運用系サーバ200へ送信する。
 運用系サーバ200のソフトウェアFT部211は同期要求を受信し、ゲストOS記憶部229に記憶されている運用データを待機系サーバ300に送信する。
 そして、待機系サーバ300のソフトウェアFT部311は運用データを受信し、受信した運用データをゲストOS記憶部329に記憶し、同期タイマを新たに起動する。
 同期タイマは、起動したときから同期周期が経過したときにタイムアウトを通知する機能である。
For example, the software FT unit 311 of the standby server 300 starts a synchronization timer and transmits a synchronization request to the active server 200 when a timeout is notified from the synchronization timer.
The software FT unit 211 of the active server 200 receives the synchronization request and transmits the operation data stored in the guest OS storage unit 229 to the standby server 300.
Then, the software FT unit 311 of the standby server 300 receives the operation data, stores the received operation data in the guest OS storage unit 329, and newly starts a synchronization timer.
The synchronization timer is a function for notifying a timeout when the synchronization period has elapsed since the activation.
 次に、S220について説明する。 Next, S220 will be described.
 S220において、系切替検出部313は、運用系サーバ200に障害が発生したか否かを以下のように判定する。 In S220, the system switch detection unit 313 determines whether or not a failure has occurred in the active server 200 as follows.
 系切替検出部313は、所定の監視周期内に運用系サーバ200のハートビート信号を受信(検出)できなかった場合、運用系サーバ200に障害が発生したと判定する。 The system switching detection unit 313 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received (detected) within a predetermined monitoring period.
 例えば、系切替検出部313は、監視タイマを起動し、監視タイマからタイムアウトが通知されるまでにハートビート信号を受信したか否かを判定する。ハートビート信号を受信した場合、系切替検出部313は起動した監視タイマを停止し、新たな監視タイマを起動する。
 監視タイマは、起動したときから監視周期が経過したときにタイムアウトを通知する機能である。
For example, the system switch detection unit 313 activates a monitoring timer and determines whether a heartbeat signal has been received before a timeout is notified from the monitoring timer. When the heartbeat signal is received, the system switching detection unit 313 stops the started monitoring timer and starts a new monitoring timer.
The monitoring timer is a function for notifying a timeout when a monitoring cycle has elapsed since the activation.
 また、系切替検出部313は、ソフトウェアFT部311が運用系サーバ200から運用データを取得できなかった場合(S210)、運用系サーバ200に障害が発生したと判定する。 In addition, when the software FT unit 311 cannot acquire operation data from the active server 200 (S210), the system switch detection unit 313 determines that a failure has occurred in the active server 200.
 運用系サーバ200に障害が発生した場合(YES)、S230に進む。
 運用系サーバ200に障害が発生していない場合(NO)、S210とS220とを繰り返す。
If a failure has occurred in the active server 200 (YES), the process proceeds to S230.
If no failure has occurred in the active server 200 (NO), S210 and S220 are repeated.
 S230において、系切替検出部313は、仮想マシンモニタ部330を介してゲストOS部320を起動する。
 S230の後、S240に進む。
In S230, the system switch detection unit 313 activates the guest OS unit 320 via the virtual machine monitor unit 330.
It progresses to S240 after S230.
 S240において、系切替検出部313は、ホストOS記憶部319に予め記憶されている系切替状態フラグ(第二停止フラグの一例)に“系切替有り”を設定する。
 系切替状態フラグの初期値は“系切替無し”である。系切替状態フラグの値の意味は、運用系サーバ200(図2、S151参照)と同じである。
 S240の後、S100に進む。
In S240, the system switching detection unit 313 sets “system switching present” to the system switching state flag (an example of the second stop flag) stored in advance in the host OS storage unit 319.
The initial value of the system switching status flag is “no system switching”. The meaning of the value of the system switching status flag is the same as that of the active server 200 (see S151 in FIG. 2).
After S240, the process proceeds to S100.
 S100において、待機系サーバ300は、新たな運用系サーバとして動作する。
 S100により、系切り替え方法(待機系サーバ)は終了する。
In S100, the standby server 300 operates as a new active server.
By S100, the system switching method (standby system server) ends.
 つまり、待機系サーバ300は、新たな運用系サーバとして以下のように動作する。 That is, the standby server 300 operates as a new active server as follows.
 S140(図2参照)において、系切替指示部324は、障害検出周期が経過する毎にゲストOS記憶部329に記憶されている系切替指示フラグを参照する。
 系切替指示部324は、系切替指示フラグに“可”または“否”が設定されている場合、系切替指示フラグの値を含んだ系切替指示を系切替制御部212に通知し、系切替指示フラグを初期化する。
In S140 (see FIG. 2), the system switching instruction unit 324 refers to the system switching instruction flag stored in the guest OS storage unit 329 every time the failure detection cycle elapses.
The system switching instruction unit 324 notifies the system switching control unit 212 of a system switching instruction including the value of the system switching instruction flag when the system switching instruction flag is set to “permitted” or “no”, and system switching is performed. Initialize the instruction flag.
 新たな運用系サーバとして動作する待機系サーバ300の系切替状態フラグの値は、S240(図3参照)により“系切替有り”である。 The value of the system switching state flag of the standby server 300 operating as a new active server is “system switching present” by S240 (see FIG. 3).
 したがって、S150(図2参照)において系切替指示フラグの値が“可(ハードウェア障害)”である場合、S151において系切替制御部212は系切替指示を破棄し、系切替状態フラグに“系切替無し”を設定する。
 そして、アプリ実行部321は運用処理を実行し(S110)、系切替検出部313はハートビート通知周期が経過する毎にハートビート信号を送信し(S120)、障害検出部323は障害検出周期が経過する毎に障害が発生したか否かを判定する(S130)。
 つまり、系切替制御部312は、系切替指示が通知されても、系切替指示フラグの値が“系切替有り”であれば、ゲストOS部320を停止しない。このとき、運用系サーバ200にハードウェア障害が発生し、待機系サーバ300にはハードウェア障害が発生していないからである。
Therefore, if the value of the system switching instruction flag is “possible (hardware failure)” in S150 (see FIG. 2), the system switching control unit 212 discards the system switching instruction in S151 and sets the system switching status flag in the “system switching status flag”. Set “No switching”.
Then, the application execution unit 321 executes an operation process (S110), the system switching detection unit 313 transmits a heartbeat signal every time the heartbeat notification cycle elapses (S120), and the failure detection unit 323 has a failure detection cycle. It is determined whether or not a failure has occurred each time (S130).
That is, the system switching control unit 312 does not stop the guest OS unit 320 if the value of the system switching instruction flag is “system switching present” even if the system switching instruction is notified. At this time, a hardware failure has occurred in the active server 200, and no hardware failure has occurred in the standby server 300.
 また、S150(図2参照)において系切替指示フラグの値が“否(ソフトウェア障害)”である場合、系切替制御部212は、系切替状態フラグの値に関わらず、ゲストOS部320を停止する(S152)。
 運用系サーバ200にソフトウェア障害が発生した場合、待機系サーバ300が運用処理を引き継いでしまうと、待機系サーバ300にも運用系サーバ200と同じソフトウェア障害が発生するからである。
In S150 (see FIG. 2), when the value of the system switching instruction flag is “No (software failure)”, the system switching control unit 212 stops the guest OS unit 320 regardless of the value of the system switching state flag. (S152).
This is because, when a software failure occurs in the active server 200, if the standby server 300 takes over the operation process, the same software failure as that in the active server 200 occurs in the standby server 300.
 実施の形態1で説明した二重系システム100は、以下のような効果を奏する。 The duplex system 100 described in Embodiment 1 has the following effects.
 運用系サーバ200にハードウェア障害が発生した場合であって、待機系サーバ300を新たな運用系サーバとして稼働させることができるため、システムの可用性を高めることができる。 Even when a hardware failure occurs in the active server 200, the standby server 300 can be operated as a new active server, so that the system availability can be increased.
 ハードウェアが多重化されたFT(フォールトトレラント)サーバを用いる場合に比べて、フォールトトレラント機能を有するシステムを安く構築することができる。 It is possible to construct a system having a fault tolerant function at a lower cost than when using an FT (fault tolerant) server in which hardware is multiplexed.
 二重系システム100は、系切替状態フラグを用いることにより、運用系サーバ200から待機系サーバ300に障害の状態(系切替指示フラグ=“可”)が同期された場合であっても待機系サーバ300を新たな運用系サーバとして正常に動作させることができる。
 つまり、運用系サーバ200から待機系サーバ300に障害の状態が同期された場合であっても、待機系サーバ300は停止せず、新たな運用系サーバとして動作する。
The dual system 100 uses the system switching status flag so that the standby system can be used even when the failure status (system switching instruction flag = “OK”) is synchronized from the active server 200 to the standby server 300. The server 300 can be normally operated as a new operational server.
That is, even when the failure state is synchronized from the active server 200 to the standby server 300, the standby server 300 does not stop and operates as a new active server.
 二重系システム100は、系切替指示フラグを用いることによりハードウェア障害とソフトウェア障害とを判別し、ハードウェア障害が発生したときとソフトウェア障害が発生したときで異なる障害制御を行うことができる。
 例えば、ハードウェア障害の場合には系切替状態フラグに基づいてゲストOS部220を停止するか否かを判定し(図2、S151)、ソフトウェア障害の場合にはゲストOS部220を停止(図2、S152)することができる。
The dual system 100 can discriminate between a hardware failure and a software failure by using the system switching instruction flag, and can perform different failure control when a hardware failure occurs and when a software failure occurs.
For example, in the case of a hardware failure, it is determined whether to stop the guest OS unit 220 based on the system switching status flag (S151 in FIG. 2), and in the case of a software failure, the guest OS unit 220 is stopped (see FIG. 2, S152).
 図4は、実施の形態1における障害と障害を検出する手段との関係を示す表である。
 実施の形態1で説明した二重系システム100が奏する効果について、図4に基づいて説明する。
FIG. 4 is a table showing the relationship between the failure and the means for detecting the failure in the first embodiment.
The effect which the duplex system 100 demonstrated in Embodiment 1 show | plays is demonstrated based on FIG.
 ハードウェア(H/W)障害は(1)から(4)の原因で区別することができる。
 運用系サーバ200のホストOS部210は、運用系サーバ200のハードウェア障害を検出する障害検出部(図示省略)を備えるものとする。
Hardware (H / W) failures can be distinguished by causes (1) to (4).
The host OS unit 210 of the active server 200 includes a failure detection unit (not shown) that detects a hardware failure of the active server 200.
 障害(1)は、電源断などの原因によって運用系サーバ200が突然停止してしまう重度な障害である。このような障害ではハートビート通信が途絶えてしまう。このため、障害(1)は待機系サーバ300のホストOS部310(系切替検出部313)によって検出される。 Failure (1) is a serious failure that causes the active server 200 to stop suddenly due to a power failure or the like. Such a failure interrupts heartbeat communication. Therefore, the failure (1) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300.
 障害(2)は、ファンの故障など、運用系サーバ200の停止に至らない軽度な障害である。障害(2)は運用系サーバ200のホストOS部210(障害検出部)によって検出される。 Fault (2) is a minor fault that does not cause the active server 200 to stop, such as a fan failure. The failure (2) is detected by the host OS unit 210 (failure detection unit) of the active server 200.
 障害(3)は、ディスクやネットワークなどのI/Oエラーによってハードウェアからの応答待ちがタイムアウトする障害である。このような障害は、運用系サーバ200のホストOS部210(障害検出部)によって検出される。ホストOS部210(障害検出部)はホストOSのドライバの機能を用いて障害(2)を検出する。 Fault (3) is a fault in which waiting for a response from hardware times out due to an I / O error of a disk or network. Such a failure is detected by the host OS unit 210 (failure detection unit) of the active server 200. The host OS unit 210 (failure detection unit) detects the failure (2) using the driver function of the host OS.
 障害(4)は、障害(3)と同様にハードウェアからの応答待ちがタイムアウトする障害である。但し、障害(4)と障害(3)とはタイムアウト時間および検出手段が異なる。
 通常、OSレベルではハードウェアのタイムアウト時間は長めに設定されている。しかし、オンラインシステムなど、処理毎に所定の応答時間を保証するシステムでは、ハードウェアのタイムアウト時間を短く設定する必要がある。
 つまり、障害(4)は、システムに応じて設定されたタイムアウト時間を適用し、ハードウェアからの応答待ちがタイムアウトする障害である。
 障害(4)は運用系サーバ200のゲストOS部220(障害検出部223)によって検出される(図2、S130)。
The failure (4) is a failure in which waiting for a response from the hardware times out as in the case of the failure (3). However, the failure (4) and the failure (3) are different in timeout time and detection means.
Normally, the hardware timeout time is set longer at the OS level. However, in a system such as an online system that guarantees a predetermined response time for each process, it is necessary to set the hardware timeout time short.
That is, the failure (4) is a failure in which a time-out time set according to the system is applied, and waiting for a response from hardware times out.
The failure (4) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200 (FIG. 2, S130).
 障害(4)は運用系サーバ200のゲストOS部220によって検出されるため、障害(4)の状態(系切替指示フラグ)はゲストOS部220のデータ(運用データ)の一部として待機系サーバ300に同期されてしまう。
 この場合、待機系サーバ300には障害(4)が発生していないにも関わらず、待機系サーバ300は障害(4)を検出し、ゲストOS部320を停止してしまう。つまり、運用系サーバ200と待機系サーバ300とが共に停止してしまう。これを「共連れ死」と呼ぶ。
 しかし、実施の形態1では系切替状態フラグを用いることにより、共連れ死を防ぐことができる(図2、S151)。
Since the failure (4) is detected by the guest OS unit 220 of the active server 200, the state of the failure (4) (system switching instruction flag) is a standby server as part of the data (operation data) of the guest OS unit 220. 300 is synchronized.
In this case, although the failure (4) does not occur in the standby server 300, the standby server 300 detects the failure (4) and stops the guest OS unit 320. That is, both the active server 200 and the standby server 300 are stopped. This is called “joint death”.
However, in the first embodiment, by using the system switching state flag, it is possible to prevent the accompanying death (FIG. 2, S151).
 ソフトウェア(S/W)障害は(5)または(6)の原因で区別することができる。 Software (S / W) failure can be distinguished by the cause of (5) or (6).
 障害(5)は、OSのハングアップなどの原因によって運用系サーバ200のホストOS部210が停止してしまう障害である。このような障害ではハートビート通信が途絶えてしまう。このため、障害(5)は待機系サーバ300のホストOS部310(系切替検出部313)または後述するスタンバイ系サーバのゲストOS部(クラスタソフトウェア部)によって検出される。 Failure (5) is a failure in which the host OS unit 210 of the active server 200 stops due to a cause such as an OS hang-up. Such a failure interrupts heartbeat communication. Therefore, the failure (5) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300 or the guest OS unit (cluster software unit) of the standby server described later.
 障害(6)は、アプリケーションプログラムの不具合(例えば、メモリ不足)などによって運用処理が停止してしまう障害である。障害(6)は、運用系サーバ200のゲストOS部220(障害検出部223)によって検出される。 Fault (6) is a fault that causes the operation process to stop due to an application program malfunction (for example, memory shortage). The failure (6) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200.
 実施の形態1では系切替指示フラグを用いることにより、ハードウェア障害が発生したときとソフトウェア障害が発生したときとで異なる障害制御を行うことができる(図2、S150)。 In the first embodiment, by using the system switching instruction flag, different failure control can be performed when a hardware failure occurs and when a software failure occurs (FIG. 2, S150).
 図5は、実施の形態1における運用系サーバ200と待機系サーバ300との同期タイミングと待機系サーバ300の動作との関係を示すフローチャートである。
 実施の形態1で説明した二重系システム100が奏する効果について、図5に基づいて説明する。
FIG. 5 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment.
The effect which the duplex system 100 demonstrated in Embodiment 1 show | plays is demonstrated based on FIG.
 「同期A」は、運用系サーバ200のゲストOS部220がハードウェア障害を検出する前に、待機系サーバ300が運用系サーバ200から運用データを取得した場合である。
 「同期B」は、運用系サーバ200のゲストOS部220がハードウェア障害を検出した後からホストOS部210に系切替指示を通知する前に、待機系サーバ300が運用系サーバ200から運用データを取得した場合である。
 「同期C」は、運用系サーバ200のゲストOS部220がホストOS部210に系切替指示を通知した後に、待機系サーバ300が運用系サーバ200から運用データを取得した場合である。
“Synchronization A” is a case where the standby server 300 acquires operational data from the active server 200 before the guest OS unit 220 of the active server 200 detects a hardware failure.
The “synchronization B” is performed when the standby server 300 receives operational data from the active server 200 after the guest OS unit 220 of the active server 200 detects a hardware failure and before notifying the host OS unit 210 of a system switching instruction. Is obtained.
“Synchronous C” is a case where the standby server 300 acquires operational data from the active server 200 after the guest OS unit 220 of the active server 200 notifies the host OS unit 210 of a system switching instruction.
 「同期B」の場合、待機系サーバ300の系切替指示フラグの値は“可”であるため、ゲストOS部320の起動後、ゲストOS部320はホストOS部310に系切替指示が通知される。
 但し、ゲストOS部320の起動時に系切替状態フラグに“系切替有り”が設定されるため(図3、S240)、ホストOS部310は系切替指示を破棄し、ゲストOS部320を停止しない。
In the case of “synchronization B”, the value of the system switching instruction flag of the standby server 300 is “possible”. The
However, since “system switching is present” is set in the system switching status flag when the guest OS section 320 is started (S240 in FIG. 3), the host OS section 310 discards the system switching instruction and does not stop the guest OS section 320. .
 「同期A」または「同期C」の場合、待機系サーバ300の系切替指示フラグの値は“未発生”であるため、ゲストOS部320の起動後、新たに障害が検出されるまでゲストOS部320はホストOS部310に系切替指示が通知されない。
 つまり、新たに障害が検出されるまで、ゲストOS部320は停止しない。
In the case of “synchronization A” or “synchronization C”, the value of the system switching instruction flag of the standby system server 300 is “not generated”. The unit 320 does not notify the host OS unit 310 of a system switching instruction.
That is, the guest OS unit 320 does not stop until a new failure is detected.
 このように、二重系システム100は、いずれのタイミングで運用系サーバ200と待機系サーバ300との間で運用データの同期が取られても、運用系サーバ200と待機系サーバ300との共連れ死を防ぐことができる。 As described above, the duplex system 100 is configured so that the active server 200 and the standby server 300 can be synchronized with each other even if the operation data is synchronized between the active server 200 and the standby server 300 at any timing. Can prevent death.
 図6は、実施の形態1における運用系サーバ200および待機系サーバ300のハードウェア資源の一例を示す図である。
 図6において、運用系サーバ200および待機系サーバ300は、CPU901(Central Processing Unit)を備えている。CPU901は、バス902を介してROM903、RAM904、通信ボード905、ディスプレイ装置911、キーボード912、マウス913、ドライブ装置914、磁気ディスク装置920と接続され、これらのハードウェアデバイスを制御する。ドライブ装置914は、FD(Flexible Disk Drive)、CD(Compact Disc)、DVD(Digital Versatile Disc)などの記憶媒体を読み書きする装置である。
FIG. 6 is a diagram illustrating an example of hardware resources of the active server 200 and the standby server 300 according to the first embodiment.
In FIG. 6, the active server 200 and the standby server 300 include a CPU 901 (Central Processing Unit). The CPU 901 is connected to the ROM 903, the RAM 904, the communication board 905, the display device 911, the keyboard 912, the mouse 913, the drive device 914, and the magnetic disk device 920 via the bus 902, and controls these hardware devices. The drive device 914 is a device that reads and writes a storage medium such as an FD (Flexible Disk Drive), a CD (Compact Disc), and a DVD (Digital Versatile Disc).
 通信ボード905は、有線または無線で、LAN(Local Area Network)、インターネット、電話回線などの通信網に接続している。 The communication board 905 is wired or wirelessly connected to a communication network such as a LAN (Local Area Network), the Internet, or a telephone line.
 磁気ディスク装置920には、OS921(オペレーティングシステム)、プログラム群922、ファイル群923が記憶されている。 The magnetic disk device 920 stores an OS 921 (operating system), a program group 922, and a file group 923.
 プログラム群922には、実施の形態において「~部」として説明する機能を実行するプログラムが含まれる。プログラムは、CPU901により読み出され実行される。すなわち、プログラムは、「~部」としてコンピュータを機能させるものであり、また「~部」の手順や方法をコンピュータに実行させるものである。 The program group 922 includes a program for executing a function described as “unit” in the embodiment. The program is read and executed by the CPU 901. That is, the program causes the computer to function as “˜part”, and causes the computer to execute the procedures and methods of “˜part”.
 ファイル群923には、実施の形態において説明する「~部」で使用される各種データ(入力、出力、判定結果、計算結果、処理結果など)が含まれる。 The file group 923 includes various data (input, output, determination result, calculation result, processing result, etc.) used in “˜part” described in the embodiment.
 実施の形態において構成図およびフローチャートに含まれている矢印は主としてデータや信号の入出力を示す。 In the embodiment, arrows included in the configuration diagrams and flowcharts mainly indicate input and output of data and signals.
 実施の形態において「~部」として説明するものは「~回路」、「~装置」、「~機器」であってもよく、また「~ステップ」、「~手順」、「~処理」であってもよい。すなわち、「~部」として説明するものは、ファームウェア、ソフトウェア、ハードウェアまたはこれらの組み合わせのいずれで実装されても構わない。 In the embodiment, what is described as “to part” may be “to circuit”, “to apparatus”, and “to device”, and “to step”, “to procedure”, and “to processing”. May be. That is, what is described as “˜unit” may be implemented by any of firmware, software, hardware, or a combination thereof.
 図7は、実施の形態1における二重系システム100の別形態を示す構成図である。
 図7に示すように、二重系システム100は、スタンバイ系サーバ400と共有ストレージ102とを備えてもよい。
FIG. 7 is a configuration diagram showing another form of the dual system 100 according to the first embodiment.
As illustrated in FIG. 7, the dual system 100 may include a standby server 400 and a shared storage 102.
 スタンバイ系サーバ400(第三サーバ)は、運用系サーバ200にソフトウェア障害が発生した場合に運用系サーバ200の代わりに運用処理を実行するサーバ装置である。
 スタンバイ系サーバ400は、運用系サーバ200や待機系サーバ300と同様に、ホストOS部410、ゲストOS部420、仮想マシンモニタ部430およびハードウェア401を備える。
The standby server 400 (third server) is a server device that executes operation processing instead of the active server 200 when a software failure occurs in the active server 200.
The standby server 400 includes a host OS unit 410, a guest OS unit 420, a virtual machine monitor unit 430, and hardware 401, similar to the active server 200 and the standby server 300.
 共有ストレージ102は、運用処理に使用する処理データやゲストOS部(仮想マシン)を構成するイメージデータなどを記憶する記憶装置である。 The shared storage 102 is a storage device that stores processing data used for operation processing, image data constituting a guest OS unit (virtual machine), and the like.
 運用系サーバ200、待機系サーバ300またはスタンバイ系サーバ400は、LAN101を介して共有ストレージ102にアクセスし、共有ストレージ102に記憶されている処理データを用いて運用処理を実行する。 The active server 200, the standby server 300, or the standby server 400 accesses the shared storage 102 via the LAN 101 and executes an operation process using the processing data stored in the shared storage 102.
 スタンバイ系サーバ400は、運用系サーバ200と運用データ(系切替指示フラグ)の同期を取らなくても構わない。 The standby server 400 may not synchronize the operation server 200 and the operation data (system switching instruction flag).
 但し、運用系サーバ200のクラスタソフトウェア部222はハートビート信号を所定のハートビート通信周期が経過する毎にスタンバイ系サーバ400に送信し、スタンバイ系サーバ400のクラスタソフトウェア部は運用系サーバ200からのハートビート信号を受信する。
 スタンバイ系サーバ400のクラスタソフトウェア部は、ハートビート信号を所定の監視周期毎に監視する。そして、スタンバイ系サーバ400のクラスタソフトウェア部は、監視周期内にハートビート信号を受信できない場合、運用系サーバ200にソフトウェア障害が発生したと判定する。
 運用系サーバ200にソフトウェア障害が発生した場合、スタンバイ系サーバ400のアプリ実行部が運用処理のアプリケーションプログラムを再起動する。または、スタンバイ系サーバ400の系切替検出部がゲストOS部420を再起動する。
 そして、スタンバイ系サーバ400は、新たな運用系サーバとして稼働する。
However, the cluster software unit 222 of the active server 200 transmits a heartbeat signal to the standby server 400 every time a predetermined heartbeat communication cycle elapses, and the cluster software unit of the standby server 400 receives the signal from the active server 200. Receive a heartbeat signal.
The cluster software unit of the standby server 400 monitors the heartbeat signal every predetermined monitoring period. If the cluster software unit of the standby server 400 cannot receive the heartbeat signal within the monitoring period, it determines that a software failure has occurred in the active server 200.
When a software failure occurs in the active server 200, the application execution unit of the standby server 400 restarts the application program for operation processing. Alternatively, the system switching detection unit of the standby server 400 restarts the guest OS unit 420.
The standby server 400 operates as a new operational server.
 実施の形態1において、例えば、以下のようなサーバ間状態同期方式(系切り替え方法)について説明した。 In Embodiment 1, for example, the following inter-server state synchronization method (system switching method) has been described.
 二重系システム100は、運用系サーバ200と待機系サーバ300とを備え、運用系の動作状態(運用データ)を待機系に所定の手順で複製する。
 運用系で障害が発生した際、運用系サーバ200を停止させ、複製された動作状態を用いて待機系サーバ300を起動することにより、運用系から待機系への系切り替えを行う。
 運用系から待機系への系切り替えを行った後、障害が発生した状態(系切替指示フラグ)が待機系に複製されていた場合であって、その障害が待機系では実際には発生していない場合、待機系サーバ300を停止させない。
The dual system 100 includes an active server 200 and a standby server 300, and replicates the operating state (operation data) of the active system to the standby system in a predetermined procedure.
When a failure occurs in the active system, the active server 200 is stopped, and the standby server 300 is started using the replicated operation state, thereby switching the system from the active system to the standby system.
After a system switchover from the active system to the standby system, the state where the failure occurred (system switch instruction flag) was copied to the standby system, and the failure actually occurred in the standby system If not, the standby server 300 is not stopped.
 運用系から待機系への系切り替え有無(系切替状態フラグ)を格納する系切替状態保持部(ホストOS記憶部319)を待機系サーバ300に設ける。
 待機系で障害を検出した際に、系切替状態保持部に格納された系切替状態フラグが「系切替無し」であれば待機系サーバ300を停止し、系切替状態フラグが「系切替有り」であれば待機系サーバ300を停止させない。
The standby server 300 is provided with a system switching status holding unit (host OS storage unit 319) for storing presence / absence of system switching from the active system to the standby system (system switching status flag).
When a failure is detected in the standby system, if the system switching status flag stored in the system switching status holding unit is “no system switching”, the standby server 300 is stopped, and the system switching status flag is “system switching present”. If so, the standby server 300 is not stopped.
 運用系から待機系への系切り替えを行った後に運用系が復旧した場合は、元の待機系(待機系サーバ300)を新たな運用系、元の運用系(運用系サーバ200)を新たな待機系として運用する。
 新たな運用系で障害が発生し、新たな待機系に系切り替えを行った後、障害が発生した状態が新たな待機系に複製されていた場合であって、その障害が新たな待機系では実際には発生していない場合、新たな待機系サーバを停止させない。
When the active system is restored after system switching from the active system to the standby system, the original standby system (standby system server 300) is replaced with the new active system, and the original active system (active system server 200) is replaced with the new system. Operate as a standby system.
After a failure occurs in the new active system and the system is switched to the new standby system, the state where the failure occurred is replicated to the new standby system. If it does not actually occur, do not stop the new standby server.
 他系から自系への系切り替え有無(系切替状態フラグ)を格納する系切替状態保持部(ホストOS記憶部)を運用系サーバ200と待機系サーバ300とのそれぞれに設ける。
 障害を検出した際に、系切替状態保持部に格納された系切替状態フラグが「系切替無し」であれば他系への系切り替えのために自系サーバを停止し、系切替状態フラグが「系切替有り」であれば自系サーバを停止しない。
A system switching state holding unit (host OS storage unit) that stores the presence / absence of system switching from another system to the own system (system switching status flag) is provided in each of the active server 200 and the standby server 300.
When the failure is detected, if the system switching status flag stored in the system switching status holding unit is “no system switching”, the local server is stopped for system switching to another system, and the system switching status flag is If “system switchover” is present, the local server is not stopped.
 運用系サーバ200と待機系サーバ300とのそれぞれに仮想化環境(仮想マシンモニタ部)を搭載し、仮想化環境上で1つのホストOS(ホストOS部)と1つ以上のゲストOS(ゲストOS部)とを動作させる。
 ホストOS上に搭載したソフトウェアフォールトトレラント機能(ソフトウェアFT部211)によって、ゲストOSの動作状態(運用データ)を運用系から待機系に同期する。
 運用系サーバ200に障害が発生した場合、運用系サーバ200のゲストOSを停止させるとともに、同期しておいたゲストOSの動作状態を用いて待機系サーバ300上でゲストOSの動作を再開させる。
 系切り替えを実施する系切替制御部と、運用系から待機系への系切り替えの実施有無(系切替状態フラグ)を保持する系切替状態保持部(ホストOS記憶部)をホストOSに設ける。
 ゲストOSで障害を検出した場合は、ゲストOSからホストOSに対して系切替指示を送信する。
 ホストOSの系切替制御部は、ゲストOSから系切替指示を受信したとき、系切替状態保持部に格納された系切替状態フラグが“系切替有り”である場合にのみ系切替指示を破棄し、そうでない場合は待機系への系切り替えを実施する。
 系切り替えを実施する場合、ホストOS上の系切替制御部は運用系のゲストOSを停止させる。待機系のホストOSはソフトウェアフォールトトレラント機能による同期、または運用系とのハートビート通信が途絶えたことを検出し、待機系のゲストOSを起動する。
A virtual environment (virtual machine monitor unit) is mounted on each of the active server 200 and the standby server 300, and one host OS (host OS unit) and one or more guest OS (guest OS) are installed in the virtual environment. Part).
The operating state (operation data) of the guest OS is synchronized from the active system to the standby system by the software fault tolerant function (software FT unit 211) installed on the host OS.
When a failure occurs in the active server 200, the guest OS of the active server 200 is stopped and the operation of the guest OS is restarted on the standby server 300 using the synchronized operating state of the guest OS.
A system switching control unit that performs system switching and a system switching status holding unit (host OS storage unit) that holds whether or not system switching from the active system to the standby system is performed (system switching status flag) are provided in the host OS.
When a failure is detected in the guest OS, a system switching instruction is transmitted from the guest OS to the host OS.
When the system switching control unit of the host OS receives the system switching instruction from the guest OS, the system switching control unit discards the system switching instruction only when the system switching state flag stored in the system switching state holding unit is “system switching present”. If not, the system is switched to the standby system.
When performing system switching, the system switching control unit on the host OS stops the active guest OS. The standby host OS detects that the synchronization by the software fault tolerant function or the heartbeat communication with the active system has been interrupted, and starts the standby guest OS.
 ゲストOSからホストOSに送信する系切替指示に系切替指示の破棄の可否を示すフラグ(系切替指示フラグ)を設ける。
 ホストOSの系切替制御部はゲストOSから系切替指示を受信したとき、系切替指示フラグが系切替指示の破棄が可能であることを示し、かつ系切替状態保持部に格納された系切替状態フラグが“系切替有り”である場合にのみ、系切替指示を破棄する。そうでない場合、ホストOSの系切替制御部は、待機系への系切り替えを実施する。
A flag (system switching instruction flag) indicating whether or not the system switching instruction can be discarded is provided in the system switching instruction transmitted from the guest OS to the host OS.
When the system switching control unit of the host OS receives a system switching instruction from the guest OS, the system switching instruction flag indicates that the system switching instruction can be discarded, and the system switching state stored in the system switching state holding unit Only when the flag is “system switching present”, the system switching instruction is discarded. Otherwise, the system switching control unit of the host OS performs system switching to the standby system.
 系切替指示フラグを障害の原因がハードウェアの場合にのみ“可”に設定し、この値と系切替状態保持部に格納された系切替状態フラグとに基づいて系切り替えの実行可否を判断する。
 これにより、系切替指示フラグが“可”であり、系切替状態フラグが“系切替有り”の場合には系切り替え動作を行わず、共連れ死を防止することができる。
The system switching instruction flag is set to “permitted” only when the cause of the failure is hardware, and whether or not the system switching can be executed is determined based on this value and the system switching status flag stored in the system switching status holding unit. .
As a result, when the system switching instruction flag is “permitted” and the system switching state flag is “system switching is present”, the system switching operation is not performed, and the accompanying death can be prevented.
 実施の形態2.
 ゲストOS部220を停止したときにゲストOS部220を停止したことを運用系サーバ200が待機系サーバ300に通知する形態について説明する。
Embodiment 2. FIG.
A mode in which the active server 200 notifies the standby server 300 that the guest OS unit 220 has been stopped when the guest OS unit 220 is stopped will be described.
 図8は、実施の形態2における二重系システム100の構成図である。
 実施の形態2における二重系システム100の構成について、図8に基づいて説明する。
FIG. 8 is a configuration diagram of the dual system 100 according to the second embodiment.
The configuration of dual system 100 in the second embodiment will be described with reference to FIG.
 運用系サーバ200のホストOS部210は、実施の形態1で説明した系切替検出部213(図1参照)の代わりに、系切替通信部214を備える。
 また、待機系サーバ300のホストOS部310は、実施の形態1で説明した系切替検出部313の代わりに、系切替通信部314を備える。
The host OS unit 210 of the active server 200 includes a system switching communication unit 214 instead of the system switching detection unit 213 (see FIG. 1) described in the first embodiment.
The host OS unit 310 of the standby server 300 includes a system switching communication unit 314 instead of the system switching detection unit 313 described in the first embodiment.
 系切替通信部214(第一通知部の一例)は、系切替制御部212がアプリ実行部221を備えるゲストOS部220を停止する場合、ゲストOS部220(アプリ実行部221)の停止を待機系サーバ300に通知する。 The system switching communication unit 214 (an example of a first notification unit) waits for the guest OS unit 220 (application execution unit 221) to stop when the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221. Notification to the system server 300.
 系切替通信部314(第二起動部の一例)は、運用系サーバ200からゲストOS部220(アプリ実行部221)の停止が通知された場合、アプリ実行部321を備えるゲストOS部320を起動する。
 また、系切替通信部314は、ホストOS記憶部319に記憶されている系切替状態フラグにゲストOS部320(アプリ実行部321)を停止しないことを示す第二継続値(“系切替有り”)を設定する。
The system switching communication unit 314 (an example of a second activation unit) activates the guest OS unit 320 including the application execution unit 321 when the active server 200 notifies the stop of the guest OS unit 220 (application execution unit 221). To do.
Further, the system switching communication unit 314 sets a second continuation value (“with system switching”) indicating that the guest OS unit 320 (application execution unit 321) is not stopped in the system switching status flag stored in the host OS storage unit 319. ) Is set.
 待機系サーバ300が新たな運用系サーバとして稼働し、運用系サーバ200が新たな待機系サーバとして稼働する場合、系切替通信部214および系切替通信部314は、以下のように動作する。 When the standby server 300 operates as a new active server and the active server 200 operates as a new standby server, the system switching communication unit 214 and the system switching communication unit 314 operate as follows.
 系切替通信部314(第一起動部の一例)は、系切替制御部312がアプリ実行部321を備えるゲストOS部320を停止する場合、ゲストOS部320(アプリ実行部321)の停止を運用系サーバ200に通知する。 When the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321, the system switching communication unit 314 (an example of a first activation unit) operates to stop the guest OS unit 320 (application execution unit 321). Notification to the system server 200.
 系切替通信部214(第二通知部の一例)は、待機系サーバ300からゲストOS部320(アプリ実行部321)の停止が通知された場合、アプリ実行部221を備えるゲストOS部220を起動する。
 また、系切替通信部214は、ホストOS記憶部219に記憶されている系切替状態フラグに前記第二継続値(“系切替有り”)を設定する。
The system switching communication unit 214 (an example of the second notification unit) activates the guest OS unit 220 including the application execution unit 221 when the standby server 300 is notified of the stoppage of the guest OS unit 320 (application execution unit 321). To do.
Further, the system switching communication unit 214 sets the second continuation value (“system switching present”) in the system switching state flag stored in the host OS storage unit 219.
 図9は、実施の形態2における系切り替え方法(運用系サーバ)を示すフローチャートである。
 運用系サーバ200(または新たな運用系サーバとして動作する待機系サーバ300)の系切り替え方法について、図9に基づいて説明する。
FIG. 9 is a flowchart showing a system switching method (active server) in the second embodiment.
A system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.
 系切り替え方法(運用系サーバ)において、実施の形態1で説明した処理(図2参照)に加えて、S153を実行する。 In the system switching method (active server), in addition to the processing described in the first embodiment (see FIG. 2), S153 is executed.
 つまり、系切替制御部212がゲストOS部220を停止した後(S152)、系切替通信部214は、ゲストOS部220の停止通知を待機系サーバ300に送信する(S153)。 That is, after the system switching control unit 212 stops the guest OS unit 220 (S152), the system switching communication unit 214 transmits a stop notification of the guest OS unit 220 to the standby server 300 (S153).
 その他の処理は、実施の形態1(図2)と同様である。 Other processes are the same as those in the first embodiment (FIG. 2).
 図10は、実施の形態2における系切り替え方法(待機系サーバ)を示すフローチャートである。
 待機系サーバ300(または新たな待機系サーバとして動作する運用系サーバ200)の系切り替え方法について、図10に基づいて説明する。
FIG. 10 is a flowchart showing a system switching method (standby server) in the second embodiment.
A system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.
 系切り替え方法(待機系サーバ)において、実施の形態1で説明したS220(図3参照)の代わりに、S220Bを実行する。 In the system switching method (standby server), S220B is executed instead of S220 (see FIG. 3) described in the first embodiment.
 S220Bにおいて、系切替通信部314は、運用系サーバ200に障害が発生したか否かを以下のように判定する。
 系切替通信部314は、所定の監視周期内に運用系サーバ200のハートビート信号を受信できなかった場合、運用系サーバ200に障害が発生したと判定する。
 また、系切替通信部314は、ソフトウェアFT部311が運用系サーバ200から運用データを取得できなかった場合、運用系サーバ200に障害が発生したと判定する。
 また、系切替通信部314は、運用系サーバ200からゲストOS部220の停止通知を受信した場合、運用系サーバ200に障害が発生したと判定する。
 運用系サーバ200に障害が発生した場合(YES)、S230に進む。
 運用系サーバ200に障害が発生していない場合(NO)、S210とS220Bとを繰り返す。
In S220B, the system switching communication unit 314 determines whether or not a failure has occurred in the active server 200 as follows.
The system switching communication unit 314 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received within a predetermined monitoring period.
The system switching communication unit 314 determines that a failure has occurred in the active server 200 when the software FT unit 311 cannot acquire operation data from the active server 200.
Further, the system switching communication unit 314 determines that a failure has occurred in the active server 200 when receiving a stop notification of the guest OS unit 220 from the active server 200.
If a failure has occurred in the active server 200 (YES), the process proceeds to S230.
If no failure has occurred in the active server 200 (NO), S210 and S220B are repeated.
 その他の処理は、実施の形態1(図3)と同様である。 Other processes are the same as those in the first embodiment (FIG. 3).
 実施の形態2において、例えば、以下のようなサーバ間状態同期方式(系切り替え方法)について説明した。 In Embodiment 2, for example, the following inter-server state synchronization method (system switching method) has been described.
 ホストOS(ホストOS部)に系切替通知部(系切替通信部)と系切替通知受信部(系切替通信部)とを設ける。
 系切り替え時には、運用系のホストOSの系切替制御部がゲストOS(ゲストOS部)を停止させ、運用系の系切替通知部が待機系の系切替通知受信部に系切替が行われることを通知する。そして、待機系のホストOSがゲストOSを起動する。
A host switching notification unit (system switching communication unit) and a system switching notification reception unit (system switching communication unit) are provided in the host OS (host OS unit).
At the time of system switching, the system switching control unit of the active host OS stops the guest OS (guest OS unit), and the system switching notification unit of the active system performs system switching to the system switching notification receiving unit of the standby system. Notice. Then, the standby host OS activates the guest OS.
 これにより、待機系サーバは、運用系サーバが停止したことを同期やハートビートの周期に依存せずに検知し、運用系サーバが停止したときに直ちに新たな運用系サーバとして稼働することができる。 As a result, the standby server can detect that the active server has stopped without depending on the synchronization or heartbeat cycle, and can immediately operate as a new active server when the active server stops. .
 100 二重系システム、101 LAN、102 共有ストレージ、200 運用系サーバ、201 ハードウェア、210 ホストOS部、211 ソフトウェアFT部、212 系切替制御部、213 系切替検出部、214 系切替通信部、219 ホストOS記憶部、220 ゲストOS部、221 アプリ実行部、222 クラスタソフトウェア部、223 障害検出部、224 系切替指示部、229 ゲストOS記憶部、230 仮想マシンモニタ部、300 待機系サーバ、301 ハードウェア、310 ホストOS部、311 ソフトウェアFT部、312 系切替制御部、313 系切替検出部、314 系切替通信部、319 ホストOS記憶部、320 ゲストOS部、321 アプリ実行部、322 クラスタソフトウェア部、323 障害検出部、324 系切替指示部、329 ゲストOS記憶部、330 仮想マシンモニタ部、400 スタンバイ系サーバ、401 ハードウェア、410 ホストOS部、420 ゲストOS部、430 仮想マシンモニタ部、901 CPU、902 バス、903 ROM、904 RAM、905 通信ボード、911 ディスプレイ装置、912 キーボード、913 マウス、914 ドライブ装置、920 磁気ディスク装置、921 OS、922 プログラム群、923 ファイル群。 100 duplex system, 101 LAN, 102 shared storage, 200 operational server, 201 hardware, 210 host OS unit, 211 software FT unit, 212 system switching control unit, 213 system switching detection unit, 214 system switching communication unit, 219 Host OS storage unit, 220 Guest OS unit, 221 Application execution unit, 222 Cluster software unit, 223 Failure detection unit, 224 System switching instruction unit, 229 Guest OS storage unit, 230 Virtual machine monitor unit, 300 Standby server, 301 Hardware, 310 Host OS unit, 311 Software FT unit, 312 System switch control unit, 313 System switch detection unit, 314 System switch communication unit, 319 Host OS storage unit, 320 Guest OS unit, 321 Application execution unit, 322 class Software unit, 323 failure detection unit, 324 system switch instruction unit, 329 guest OS storage unit, 330 virtual machine monitor unit, 400 standby server, 401 hardware, 410 host OS unit, 420 guest OS unit, 430 virtual machine monitor 901 CPU, 902 bus, 903 ROM, 904 RAM, 905 communication board, 911 display device, 912 keyboard, 913 mouse, 914 drive device, 920 magnetic disk device, 921 OS, 922 program group, 923 file group.

Claims (9)

  1.  所定の運用処理を実行する第一サーバと、第一サーバに障害が発生した場合に第一サーバの代わりに運用処理を実行する第二サーバとを備える多重系システムにおいて、
     第一サーバは、
     運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグとを含んだデータを運用データとして記憶する第一運用データ記憶部と、
     前記第一運用データ記憶部に記憶された処理データを用いて運用処理を実行する第一実行部と、
     前記第一実行部により実行される運用処理の障害を検出する第一障害検出部であって、運用処理の障害を検出した場合、前記第一運用データ記憶部に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値を設定する第一障害検出部と、
     前記第一実行部を停止する第一停止部と、
     前記第一運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第一停止部に前記第一実行部の停止を指示する第一指示部とを備え、
     第二サーバは、
     所定の同期周期毎に第一サーバから運用データを取得する第二同期部と、
     前記第二同期部により取得された運用データを記憶する第二運用データ記憶部と、
     前記第二運用データ記憶部に記憶された運用データを用いて運用処理を実行する第二実行部と、
     前記第二実行部の停止の可否を示す第二停止フラグを記憶する第二停止フラグ記憶部と、
     第一サーバを所定の監視周期毎に監視し、第一サーバに障害が発生したか否かを判定する第二監視部と、
     前記第二監視部により第一サーバに障害が発生したと判定された場合、前記第二実行部を起動し、前記第二停止フラグ記憶部に記憶されている第二停止フラグに前記第二実行部を停止しないことを示す第二継続値を設定する第二起動部と、
     前記第二実行部により実行される運用処理の障害を検出する第二障害検出部であって、運用処理の障害を検出した場合、前記第二運用データ記憶部に記憶されている障害フラグに前記障害発生値を設定する第二障害検出部と、
     前記第二実行部を停止する第二停止部と、
     前記第二運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第二停止部に前記第二実行部の停止を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値を設定する第二指示部とを備え、
     前記第二停止部は、前記第二指示部から前記第二実行部の停止を指示された場合、前記第二停止フラグ記憶部に記憶されている第二停止フラグを参照し、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグに前記第二実行部を停止してもよいことを示す非第二継続値を設定し、参照した第二停止フラグに前記第二継続値が設定されていない場合、前記第二実行部を停止する
    ことを特徴とする多重系システム。
    In a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
    The first server
    A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
    A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
    A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
    A first stop unit for stopping the first execution unit;
    The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first instruction unit,
    The second server
    A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
    A second operational data storage unit that stores operational data acquired by the second synchronization unit;
    A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
    A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
    A second monitoring unit that monitors the first server every predetermined monitoring cycle and determines whether or not a failure has occurred in the first server;
    When it is determined by the second monitoring unit that a failure has occurred in the first server, the second execution unit is started and the second execution flag stored in the second stop flag storage unit is executed in the second execution flag. A second activation part for setting a second continuation value indicating that the part is not stopped;
    A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
    A second stop unit for stopping the second execution unit;
    Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
    When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop The second system is stopped when the second continuation value is not set in the flag, and the second execution unit is stopped.
  2.  第一サーバは、さらに、
     前記第一実行部の停止の可否を示す第一停止フラグを記憶する第一停止フラグ記憶部と、
     所定の同期周期毎に第二サーバから運用データを取得する第一同期部と、
     第二サーバを所定の監視周期毎に監視し、第二サーバに障害が発生したか否かを判定する第一監視部と、
     前記第一監視部により第二サーバに障害が発生したと判定された場合、前記第一実行部を起動し、前記第一停止フラグ記憶部に記憶されている第一停止フラグに前記第一実行部を停止しないことを示す第一継続値を設定する第一起動部とを備え、
     前記第一停止部は、前記第一指示部から前記第一実行部の停止を指示された場合、前記第一停止フラグ記憶部に記憶されている第一停止フラグを参照し、参照した第一停止フラグに前記第一継続値が設定されている場合、参照した第一停止フラグに前記第一実行部を停止してもよいことを示す非第一継続値を設定し、参照した第一停止フラグに前記第一継続値が設定されていない場合、前記第一実行部を停止する
    ことを特徴とする請求項1記載の多重系システム。
    The first server further
    A first stop flag storage unit for storing a first stop flag indicating whether or not the first execution unit can be stopped;
    A first synchronization unit that obtains operational data from the second server every predetermined synchronization period;
    A first monitoring unit that monitors the second server every predetermined monitoring cycle and determines whether or not a failure has occurred in the second server;
    When it is determined by the first monitoring unit that a failure has occurred in the second server, the first execution unit is activated, and the first execution flag stored in the first stop flag storage unit is changed to the first execution flag. A first activation part for setting a first continuation value indicating that the part is not stopped,
    The first stop unit refers to the first stop flag stored in the first stop flag storage unit when the first instruction unit instructs the stop of the first execution unit, and refers to the first stop flag When the first continuation value is set in the stop flag, a non-first continuation value indicating that the first execution unit may be stopped is set in the referred first stop flag, and the first stop referred to The multiplex system according to claim 1, wherein when the first continuation value is not set in a flag, the first execution unit is stopped.
  3.  前記第一サーバは、第一ゲスト計算機と、第一ホスト計算機とを備え、
     前記第一ゲスト計算機は、前記第一実行部と、前記第一障害検出部と、前記第一指示部と、前記第一運用データ記憶部とを備え、
     前記第一ホスト計算機は、前記第一停止部と、前記第一同期部と、前記第一監視部と、前記第一起動部と、前記第一停止フラグ記憶部とを備え、
     前記第二サーバは、第二ゲスト計算機と、第二ホスト計算機とを備え、
     前記第二ゲスト計算機は、前記第二実行部と、前記第二障害検出部と、前記第二指示部と、前記第二運用データ記憶部とを備え、
     前記第二ホスト計算機は、前記第二停止部と、前記第二同期部と、前記第二監視部と、前記第二起動部と、前記第二停止フラグ記憶部とを備える
    ことを特徴とする請求項2記載の多重系システム。
    The first server includes a first guest computer and a first host computer,
    The first guest computer includes the first execution unit, the first failure detection unit, the first instruction unit, and the first operation data storage unit.
    The first host computer includes the first stop unit, the first synchronization unit, the first monitoring unit, the first activation unit, and the first stop flag storage unit,
    The second server includes a second guest computer and a second host computer,
    The second guest computer includes the second execution unit, the second failure detection unit, the second instruction unit, and the second operation data storage unit,
    The second host computer includes the second stop unit, the second synchronization unit, the second monitoring unit, the second activation unit, and the second stop flag storage unit. The multiplex system according to claim 2.
  4.  所定の運用処理を実行する第一サーバと、第一サーバに障害が発生した場合に第一サーバの代わりに運用処理を実行する第二サーバとを備える多重系システムにおいて、
     第一サーバは、
     運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグとを含んだデータを運用データとして記憶する第一運用データ記憶部と、
     前記第一運用データ記憶部に記憶された処理データを用いて運用処理を実行する第一実行部と、
     前記第一実行部により実行される運用処理の障害を検出する第一障害検出部であって、運用処理の障害を検出した場合、前記第一運用データ記憶部に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値を設定する第一障害検出部と、
     前記第一実行部を停止する第一停止部と、
     前記第一運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第一停止部に前記第一実行部の停止を指示する第一指示部と、
     前記第一停止部が前記第一実行部を停止する場合、前記第一実行部の停止を第二サーバに通知する第一通知部とを備え、
     第二サーバは、
     所定の同期周期毎に第一サーバから運用データを取得する第二同期部と、
     前記第二同期部により取得された運用データを記憶する第二運用データ記憶部と、
     前記第二運用データ記憶部に記憶された運用データを用いて運用処理を実行する第二実行部と、
     前記第二実行部の停止の可否を示す第二停止フラグを記憶する第二停止フラグ記憶部と、
     第一サーバから前記第一実行部の停止が通知された場合、前記第二実行部を起動し、前記第二停止フラグ記憶部に記憶されている第二停止フラグに前記第二実行部を停止しないことを示す第二継続値を設定する第二起動部と、
     前記第二実行部により実行される運用処理の障害を検出する第二障害検出部であって、運用処理の障害を検出した場合、前記第二運用データ記憶部に記憶されている障害フラグに前記障害発生値を設定する第二障害検出部と、
     前記第二実行部を停止する第二停止部と、
     前記第二運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第二停止部に前記第二実行部の停止を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値を設定する第二指示部とを備え、
     前記第二停止部は、前記第二指示部から前記第二実行部の停止を指示された場合、前記第二停止フラグ記憶部に記憶されている第二停止フラグを参照し、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグに前記第二実行部を停止してもよいことを示す非第二継続値を設定し、参照した第二停止フラグに前記第二継続値が設定されていない場合、前記第二実行部を停止する
    ことを特徴とする多重系システム。
    In a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
    The first server
    A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
    A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
    A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
    A first stop unit for stopping the first execution unit;
    The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first indicator;
    When the first stop unit stops the first execution unit, a first notification unit for notifying the second server of the stop of the first execution unit,
    The second server
    A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
    A second operational data storage unit that stores operational data acquired by the second synchronization unit;
    A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
    A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
    When the stop of the first execution unit is notified from the first server, the second execution unit is started, and the second execution unit is stopped at the second stop flag stored in the second stop flag storage unit A second activation unit for setting a second continuation value indicating that no
    A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
    A second stop unit for stopping the second execution unit;
    Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
    When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop The second system is stopped when the second continuation value is not set in the flag, and the second execution unit is stopped.
  5.  第二サーバは、さらに、
     前記第二停止部が前記第二実行部を停止する場合、前記第二実行部の停止を第一サーバに通知する第二通知部を備え、
     第一サーバは、さらに、
     前記第一実行部の停止の可否を示す第一停止フラグを記憶する第一停止フラグ記憶部と、
     所定の同期周期毎に第二サーバから運用データを取得する第一同期部と、
     第二サーバから前記第二実行部の停止が通知された場合、前記第一実行部を起動し、前記第一停止フラグ記憶部に記憶されている第一停止フラグに前記第一実行部を停止しないことを示す第一継続値を設定する第一起動部とを備え、
     前記第一停止部は、前記第一指示部から前記第一実行部の停止を指示された場合、前記第一停止フラグ記憶部に記憶されている第一停止フラグを参照し、参照した第一停止フラグに前記第一継続値が設定されている場合、参照した第一停止フラグに前記第一実行部を停止してもよいことを示す非第一継続値を設定し、参照した第一停止フラグに前記第一継続値が設定されていない場合、前記第一実行部を停止する
    ことを特徴とする請求項4記載の多重系システム。
    The second server further
    When the second stop unit stops the second execution unit, the second stop unit includes a second notification unit that notifies the first server of the stop of the second execution unit,
    The first server further
    A first stop flag storage unit for storing a first stop flag indicating whether or not the first execution unit can be stopped;
    A first synchronization unit that obtains operational data from the second server every predetermined synchronization period;
    When the stop of the second execution unit is notified from the second server, the first execution unit is activated and the first execution unit is stopped at the first stop flag stored in the first stop flag storage unit. A first activation unit for setting a first continuation value indicating that
    The first stop unit refers to the first stop flag stored in the first stop flag storage unit when the first instruction unit instructs the stop of the first execution unit, and refers to the first stop flag When the first continuation value is set in the stop flag, a non-first continuation value indicating that the first execution unit may be stopped is set in the referred first stop flag, and the first stop referred to The multiplex system according to claim 4, wherein when the first continuation value is not set in the flag, the first execution unit is stopped.
  6.  前記第一サーバは、第一ゲスト計算機と、第一ホスト計算機とを備え、
     前記第一ゲスト計算機は、前記第一実行部と、前記第一障害検出部と、前記第一指示部と、前記第一運用データ記憶部とを備え、
     前記第一ホスト計算機は、前記第一通知部と、前記第一停止部と、前記第一同期部と、前記第一監視部と、前記第一起動部と、前記第一停止フラグ記憶部とを備え、
     前記第二サーバは、第二ゲスト計算機と、第二ホスト計算機とを備え、
     前記第二ゲスト計算機は、前記第二実行部と、前記第二障害検出部と、前記第二指示部と、前記第二運用データ記憶部とを備え、
     前記第二ホスト計算機は、前記第二通知部と、前記第二停止部と、前記第二同期部と、前記第二監視部と、前記第二起動部と、前記第二停止フラグ記憶部とを備える
    ことを特徴とする請求項5記載の多重系システム。
    The first server includes a first guest computer and a first host computer,
    The first guest computer includes the first execution unit, the first failure detection unit, the first instruction unit, and the first operation data storage unit.
    The first host computer includes the first notification unit, the first stop unit, the first synchronization unit, the first monitoring unit, the first activation unit, and the first stop flag storage unit. With
    The second server includes a second guest computer and a second host computer,
    The second guest computer includes the second execution unit, the second failure detection unit, the second instruction unit, and the second operation data storage unit,
    The second host computer includes the second notification unit, the second stop unit, the second synchronization unit, the second monitoring unit, the second activation unit, and the second stop flag storage unit. The multiplex system according to claim 5, further comprising:
  7.  前記第一ゲスト計算機と、前記第一ホスト計算機と、前記第二ゲスト計算機と、前記第二ホスト計算機とが仮想マシンとして構成されることを特徴とする請求項3または請求項6記載の多重系システム。 7. The multi-system according to claim 3, wherein the first guest computer, the first host computer, the second guest computer, and the second host computer are configured as virtual machines. system.
  8.  所定の運用処理を実行する第一サーバと、第一サーバに障害が発生した場合に第一サーバの代わりに運用処理を実行する第二サーバとを備える多重系システムの系切り替え方法において、
     第一サーバの第一運用データ記憶部が、運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグとを含んだデータを運用データとして記憶し、
     第一サーバの第一実行部が、前記第一運用データ記憶部に記憶された処理データを用いて運用処理を実行し、
     第一サーバの第一障害検出部が、前記第一実行部により実行される運用処理の障害を検出する第一障害検出部であって、運用処理の障害を検出した場合、前記第一運用データ記憶部に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値を設定し、
     第一サーバの第一指示部が、前記第一運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第一実行部の停止を指示し、
     第一サーバの第一停止部が、前記第一実行部を停止し、
     第二サーバの第二停止フラグ記憶部が、前記第二実行部の停止の可否を示す第二停止フラグを記憶し、
     第二サーバの第二同期部が、所定の同期周期毎に第一サーバから運用データを取得し、
     第二サーバの第二運用データ記憶部が、前記第二同期部により取得された運用データを記憶し、
     第二サーバの第二監視部が、第一サーバを所定の監視周期毎に監視し、第一サーバに障害が発生したか否かを判定し、
     第二サーバの第二起動部が、前記第二監視部により第一サーバに障害が発生したと判定された場合、第二実行部を起動し、前記第二停止フラグ記憶部に記憶されている第二停止フラグに前記第二実行部を停止しないことを示す第二継続値を設定し、
     第二サーバの第二実行部が、前記第二運用データ記憶部に記憶された運用データを用いて運用処理を実行し、
     第二サーバの第二障害検出部が、前記第二実行部により実行される運用処理の障害を検出する第二障害検出部であって、運用処理の障害を検出した場合、前記第二運用データ記憶部に記憶されている障害フラグに前記障害発生値を設定し、
     第二サーバの第二指示部が、前記第二運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第二実行部の停止を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値を設定し、
     第二サーバの第二停止部が、前記第二指示部から前記第二実行部の停止を指示された場合、前記第二停止フラグ記憶部に記憶されている第二停止フラグを参照し、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグに前記第二実行部を停止してもよいことを示す非第二継続値を設定し、参照した第二停止フラグに前記第二継続値が設定されていない場合、前記第二実行部を停止する
    ことを特徴とする多重系システムの系切り替え方法。
    In a system switching method for a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
    The first operation data storage unit of the first server stores, as operation data, data including processing data used for operation processing and a failure flag indicating whether or not an operation processing failure has occurred,
    The first execution unit of the first server executes the operation process using the processing data stored in the first operation data storage unit,
    When the first failure detection unit of the first server detects a failure in the operation process executed by the first execution unit and detects a failure in the operation process, the first operation data Set a failure occurrence value indicating that an operation processing failure has occurred in the failure flag stored in the storage unit,
    When the first instruction unit of the first server refers to the failure flag stored in the first operational data storage unit and the failure occurrence value is set in the referenced failure flag, the first execution unit Instruct to stop,
    The first stop unit of the first server stops the first execution unit,
    A second stop flag storage unit of the second server stores a second stop flag indicating whether the second execution unit can be stopped;
    The second synchronization unit of the second server acquires operational data from the first server every predetermined synchronization period,
    A second operation data storage unit of the second server stores the operation data acquired by the second synchronization unit;
    The second monitoring unit of the second server monitors the first server every predetermined monitoring cycle, determines whether or not a failure has occurred in the first server,
    When the second activation unit of the second server determines that a failure has occurred in the first server by the second monitoring unit, the second execution unit is activated and stored in the second stop flag storage unit A second continuation value indicating that the second execution unit is not stopped is set in a second stop flag,
    The second execution unit of the second server executes the operation process using the operation data stored in the second operation data storage unit,
    When the second failure detection unit of the second server is a second failure detection unit that detects a failure in the operation process executed by the second execution unit and detects a failure in the operation process, the second operation data Set the failure occurrence value in the failure flag stored in the storage unit,
    The second instruction unit of the second server refers to the failure flag stored in the second operation data storage unit, and when the failure occurrence value is set in the referenced failure flag, the second execution unit Instruct the stop, set a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
    When the second stop unit of the second server is instructed to stop the second execution unit from the second instruction unit, refer to the second stop flag stored in the second stop flag storage unit, When the second continuation value is set in the second stop flag, the non-second continuation value indicating that the second execution unit may be stopped is set and referred to in the referred second stop flag. A system switching method for a multi-system, wherein the second execution unit is stopped when the second continuation value is not set in a second stop flag.
  9.  所定の運用処理を実行する第一サーバと、第一サーバに障害が発生した場合に第一サーバの代わりに運用処理を実行する第二サーバとを備える多重系システムの系切り替え方法において、
     第一サーバの第一運用データ記憶部が、運用処理に用いられる処理データと、運用処理の障害が発生したか否かを示す障害フラグとを含んだデータを運用データとして記憶し、
     第一サーバの第一実行部が、前記第一運用データ記憶部に記憶された処理データを用いて運用処理を実行し、
     第一サーバの第一障害検出部が、前記第一実行部により実行される運用処理の障害を検出する第一障害検出部であって、運用処理の障害を検出した場合、前記第一運用データ記憶部に記憶されている障害フラグに、運用処理の障害が発生したことを示す障害発生値を設定し、
     第一サーバの第一指示部が、前記第一運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第一実行部の停止を指示し、
     第一サーバの第一停止部が、前記第一実行部を停止し、
     第一サーバの第一通知部が、前記第一停止部が前記第一実行部を停止する場合、前記第一実行部の停止を第二サーバに通知し、
     第二サーバの第二停止フラグ記憶部が、前記第二実行部の停止の可否を示す第二停止フラグを記憶し、
     第二サーバの第二同期部が、所定の同期周期毎に第一サーバから運用データを取得し、
     第二サーバの第二運用データ記憶部が、前記第二同期部により取得された運用データを記憶し、
     第二サーバの第二起動部が、第一サーバから前記第一実行部の停止が通知された場合、第二実行部を起動し、前記第二停止フラグ記憶部に記憶されている第二停止フラグに前記第二実行部を停止しないことを示す第二継続値を設定し、
     第二サーバの第二実行部が、前記第二運用データ記憶部に記憶された運用データを用いて運用処理を実行し、
     第二サーバの第二障害検出部が、前記第二実行部により実行される運用処理の障害を検出する第二障害検出部であって、運用処理の障害を検出した場合、前記第二運用データ記憶部に記憶されている障害フラグに前記障害発生値を設定し、
     第二サーバの第二指示部が、前記第二運用データ記憶部に記憶されている障害フラグを参照し、参照した障害フラグに前記障害発生値が設定されている場合、前記第二実行部の停止を指示し、参照した障害フラグに運用処理の障害が発生していないことを示す非障害発生値を設定し、
     第二サーバの第二停止部が、前記第二指示部から前記第二実行部の停止を指示された場合、前記第二停止フラグ記憶部に記憶されている第二停止フラグを参照し、参照した第二停止フラグに前記第二継続値が設定されている場合、参照した第二停止フラグに前記第二実行部を停止してもよいことを示す非第二継続値を設定し、参照した第二停止フラグに前記第二継続値が設定されていない場合、前記第二実行部を停止する
    ことを特徴とする多重系システムの系切り替え方法。
    In a system switching method for a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
    The first operation data storage unit of the first server stores, as operation data, data including processing data used for operation processing and a failure flag indicating whether or not an operation processing failure has occurred,
    The first execution unit of the first server executes the operation process using the processing data stored in the first operation data storage unit,
    When the first failure detection unit of the first server detects a failure in the operation process executed by the first execution unit and detects a failure in the operation process, the first operation data Set a failure occurrence value indicating that an operation processing failure has occurred in the failure flag stored in the storage unit,
    When the first instruction unit of the first server refers to the failure flag stored in the first operational data storage unit and the failure occurrence value is set in the referenced failure flag, the first execution unit Instruct to stop,
    The first stop unit of the first server stops the first execution unit,
    When the first notification unit of the first server stops the first execution unit, the first stop unit notifies the second server of the stop of the first execution unit,
    A second stop flag storage unit of the second server stores a second stop flag indicating whether the second execution unit can be stopped;
    The second synchronization unit of the second server acquires operational data from the first server every predetermined synchronization period,
    A second operation data storage unit of the second server stores the operation data acquired by the second synchronization unit;
    When the second activation unit of the second server is notified of the stop of the first execution unit from the first server, the second execution unit is activated, and the second stop stored in the second stop flag storage unit Set a second continuation value indicating that the second execution unit is not stopped in the flag,
    The second execution unit of the second server executes the operation process using the operation data stored in the second operation data storage unit,
    When the second failure detection unit of the second server is a second failure detection unit that detects a failure in the operation process executed by the second execution unit and detects a failure in the operation process, the second operation data Set the failure occurrence value in the failure flag stored in the storage unit,
    The second instruction unit of the second server refers to the failure flag stored in the second operation data storage unit, and when the failure occurrence value is set in the referenced failure flag, the second execution unit Instruct the stop, set a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
    When the second stop unit of the second server is instructed to stop the second execution unit from the second instruction unit, refer to the second stop flag stored in the second stop flag storage unit, When the second continuation value is set in the second stop flag, the non-second continuation value indicating that the second execution unit may be stopped is set and referred to in the referred second stop flag. A system switching method for a multi-system, wherein the second execution unit is stopped when the second continuation value is not set in a second stop flag.
PCT/JP2010/072272 2010-12-10 2010-12-10 Multiplex system and method for switching multiplex system WO2012077235A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2012547662A JP5342701B2 (en) 2010-12-10 2010-12-10 Multisystem system and system switching method for multisystem
PCT/JP2010/072272 WO2012077235A1 (en) 2010-12-10 2010-12-10 Multiplex system and method for switching multiplex system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2010/072272 WO2012077235A1 (en) 2010-12-10 2010-12-10 Multiplex system and method for switching multiplex system

Publications (1)

Publication Number Publication Date
WO2012077235A1 true WO2012077235A1 (en) 2012-06-14

Family

ID=46206754

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/072272 WO2012077235A1 (en) 2010-12-10 2010-12-10 Multiplex system and method for switching multiplex system

Country Status (2)

Country Link
JP (1) JP5342701B2 (en)
WO (1) WO2012077235A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015138292A (en) * 2014-01-20 2015-07-30 横河電機株式会社 Process controller and update method thereof
JP5855724B1 (en) * 2014-09-16 2016-02-09 日本電信電話株式会社 Virtual device management apparatus, virtual device management method, and virtual device management program
JP2016095770A (en) * 2014-11-17 2016-05-26 富士電機株式会社 Controller and redundancy control system using the same
EP4345626A1 (en) * 2022-09-30 2024-04-03 Yokogawa Electric Corporation Primary machine and fault-tolerant system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61136137A (en) * 1984-12-07 1986-06-24 Yokogawa Electric Corp Duplex computer system
JPH06131208A (en) * 1992-10-20 1994-05-13 Fujitsu Ltd Switching system between in-operation device and stand-by device
JPH0991233A (en) * 1995-09-27 1997-04-04 Nec Corp Network connection device
JP2006285631A (en) * 2005-03-31 2006-10-19 Yokogawa Electric Corp Duplex system
JP2010160660A (en) * 2009-01-07 2010-07-22 Nec Corp Network interface, computer system, operation method therefor, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61136137A (en) * 1984-12-07 1986-06-24 Yokogawa Electric Corp Duplex computer system
JPH06131208A (en) * 1992-10-20 1994-05-13 Fujitsu Ltd Switching system between in-operation device and stand-by device
JPH0991233A (en) * 1995-09-27 1997-04-04 Nec Corp Network connection device
JP2006285631A (en) * 2005-03-31 2006-10-19 Yokogawa Electric Corp Duplex system
JP2010160660A (en) * 2009-01-07 2010-07-22 Nec Corp Network interface, computer system, operation method therefor, and program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015138292A (en) * 2014-01-20 2015-07-30 横河電機株式会社 Process controller and update method thereof
JP5855724B1 (en) * 2014-09-16 2016-02-09 日本電信電話株式会社 Virtual device management apparatus, virtual device management method, and virtual device management program
JP2016095770A (en) * 2014-11-17 2016-05-26 富士電機株式会社 Controller and redundancy control system using the same
EP4345626A1 (en) * 2022-09-30 2024-04-03 Yokogawa Electric Corporation Primary machine and fault-tolerant system

Also Published As

Publication number Publication date
JPWO2012077235A1 (en) 2014-05-19
JP5342701B2 (en) 2013-11-13

Similar Documents

Publication Publication Date Title
US9582373B2 (en) Methods and systems to hot-swap a virtual machine
CN108139925B (en) High availability of virtual machines
TWI537828B (en) Method, computer system and computer program for virtual machine management
JP5851503B2 (en) Providing high availability for applications in highly available virtual machine environments
US7617411B2 (en) Cluster system and failover method for cluster system
US9176834B2 (en) Tolerating failures using concurrency in a cluster
US20150149813A1 (en) Failure recovery system and method of creating the failure recovery system
JP5579650B2 (en) Apparatus and method for executing monitored process
JP5561622B2 (en) Multiplexing system, data communication card, state abnormality detection method, and program
JP2012221321A (en) Fault tolerant computer system, control method for fault tolerant computer system and control program for fault tolerant computer system
JP2006072591A (en) Virtual computer control method
JP2010067042A (en) Computer switching method, computer switching program, and computer system
JP5342701B2 (en) Multisystem system and system switching method for multisystem
US9210059B2 (en) Cluster system
US20110209148A1 (en) Information processing device, virtual machine connection method, program, and recording medium
WO2012004902A1 (en) Computer system and system switch control method for computer system
EP2725496A1 (en) Information processing device, virtual machine control method and program
JPWO2015104841A1 (en) MULTISYSTEM SYSTEM AND MULTISYSTEM SYSTEM MANAGEMENT METHOD
JP2015148893A (en) virtualization system, control method, and control program
JP2014191491A (en) Information processor and information processing system
JP2021002144A (en) Information processing device, control method of information processing device, and control program of information processing device
JP5335150B2 (en) Computer apparatus and program
JP6424134B2 (en) Computer system and computer system control method
CN112912848A (en) Power supply request management method in cluster operation process
JP2013254354A (en) Computer device, software management method and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10860561

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2012547662

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10860561

Country of ref document: EP

Kind code of ref document: A1