WO2012077235A1

WO2012077235A1 - Multiplex system and method for switching multiplex system

Info

Publication number: WO2012077235A1
Application number: PCT/JP2010/072272
Authority: WO
Inventors: 峯村　治実; 俊介國分
Original assignee: 三菱電機株式会社
Priority date: 2010-12-10
Filing date: 2010-12-10
Publication date: 2012-06-14
Also published as: JPWO2012077235A1; JP5342701B2

Abstract

The purpose of the present invention is to prevent a standby system from crashing due to the operational data of an operational system. In an operational server (200), a failure detection unit (223) sets a system switching instruction flag to "permitted" when a hardware failure has been detected. A guest OS (220) is subsequently suspended. In a standby server (300), a software FT unit (311) periodically acquires operational data containing the system switching instruction flag from the operational server (200). When a heartbeat signal from the operational server (200) ceases, a system switching detection unit (313) launches a guest OS (320), and sets a system switching status flag to "system switched". A system switching instruction unit (324) detects a hardware failure by referring to the system switching instruction flag, and notifies the system switching detection unit (313) of the system switching instruction. Having received the system switching instruction, the system switching detection unit (313) suspends the guest OS (320) only when the system switching status flag has not been set to "system switched".

Description

Multisystem system and system switching method for multisystem

The present invention relates to a multiplex system that switches a standby server to a new active server when a failure occurs in the active server, for example, and a system switching method of the multiplex system.

Technology that realizes fault tolerance (fault tolerance) in software by applying virtualization technology and synchronizing virtual machines between two physical servers (active and standby) (Non-Patent Document 1, Non-Patent Document 1) There is literature 2). This technique is called “software FT”.

In the software FT, the standby system periodically acquires operation data of the operation system. When a failure occurs in the active system, the standby system takes over the processing of the active system using the operation data acquired from the active system.
For this reason, when the standby system acquires operational data when a failure occurs in the active system, the standby system also stops according to the operational data indicating that a failure has occurred.
In this way, when the standby system stops together with the active system, it is called “co-death”.

An object of the present invention is to prevent, for example, the standby system from being stopped according to the operational data of the operational system (co-death).

The multiplex system of the present invention includes a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server.
The first server
A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
A first stop unit for stopping the first execution unit;
The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first instruction unit,
The second server
A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
A second operational data storage unit that stores operational data acquired by the second synchronization unit;
A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
A second monitoring unit that monitors the first server every predetermined monitoring cycle and determines whether or not a failure has occurred in the first server;
When it is determined by the second monitoring unit that a failure has occurred in the first server, the second execution unit is started and the second execution flag stored in the second stop flag storage unit is executed in the second execution flag. A second activation part for setting a second continuation value indicating that the part is not stopped;
A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
A second stop unit for stopping the second execution unit;
Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop If the second continuation value is not set in the flag, the second execution unit is stopped.

According to the present invention, for example, it is possible to prevent the standby system (second server) from stopping according to the operation data of the active system (first server) (co-death).

1 is a configuration diagram of a dual system 100 according to Embodiment 1. FIG. 3 is a flowchart showing a system switching method (active server) in the first embodiment. 3 is a flowchart showing a system switching method (standby server) in the first embodiment. 4 is a table showing a relationship between a failure and a means for detecting the failure in the first embodiment. 3 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment. 2 is a diagram illustrating an example of hardware resources of an active server 200 and a standby server 300 according to Embodiment 1. FIG. FIG. 3 is a configuration diagram showing another form of the dual system 100 according to the first embodiment. FIG. 3 is a configuration diagram of a dual system 100 according to a second embodiment. 9 is a flowchart showing a system switching method (active server) in the second embodiment. 9 is a flowchart showing a system switching method (standby server) in the second embodiment.

Embodiment 1 FIG.
A dual system in which the standby server operates instead of the active server when a failure occurs on the active server will be described.

FIG. 1 is a configuration diagram of a dual system 100 according to the first embodiment.
A dual system 100 according to Embodiment 1 will be described with reference to FIG.

The duplex system 100 (an example of a multiple system) includes an operational server 200, a host OS unit 310, and a LAN 101 (Local Area Network).
The active server 200 and the host OS unit 310 communicate via the LAN 101. The LAN 101 is an example of a network.

The active server 200 (an example of a first server) is a server device that executes predetermined operation processing.
The active server 200 includes a host OS unit 210, a guest OS unit 220, and a virtual machine monitor unit 230.
Further, the operational server 200 includes a CPU (Central Processing Unit), a memory, and the like as the hardware 201.

The virtual machine monitor unit 230 executes a virtual machine monitor and controls the host OS unit 210 and the guest OS unit 220 as virtual machines.
The virtual machine monitor is a function for controlling a plurality of virtual machines by allocating hardware resources (CPU usage time, storage areas in a memory, etc.) of the computer (server device) to the plurality of virtual machines.

The host OS unit 210 (an example of a first host computer) is a virtual machine that manages the guest OS unit 220. Hereinafter, an OS (Operating System) of the host OS unit 210 is referred to as a host OS.
The host OS unit 210 includes a software FT unit 211, a system switching control unit 212, a system switching detection unit 213, and a host OS storage unit 219.
The host OS unit 210 and each “˜ unit” included in the host OS unit 210 operate using hardware resources allocated to the host OS unit 210.

The guest OS unit 220 (an example of a first guest computer) is a virtual machine that executes predetermined operation processing. Hereinafter, the OS of the guest OS unit 220 is referred to as a guest OS.
The guest OS unit 220 includes an application execution unit 221, a cluster software unit 222, a system switching instruction unit 224, and a guest OS storage unit 229. The cluster software unit 222 includes a failure detection unit 223.
The guest OS unit 220 and each “˜ unit” included in the guest OS unit 220 operate using hardware resources allocated to the guest OS unit 220.

The guest OS storage unit 229 (an example of a first operation data storage unit) stores various data used by the guest OS unit 220.
For example, the guest OS storage unit 229 stores operation data. The operation data is data including processing data used for operation processing and a failure flag (system switching instruction flag described later) indicating whether or not an operation processing failure has occurred.

The application execution unit 221 (an example of a first execution unit) executes an application program (hereinafter referred to as an application) that describes a processing procedure of a predetermined operation process by using processing data stored in the guest OS storage unit 229.

A failure detection unit 223 (an example of a first failure detection unit) included in the cluster software unit 222 detects a failure in an operation process executed by the application execution unit 221.
When the failure detection unit 223 detects an operation process failure, the failure flag stored in the guest OS storage unit 229 indicates a failure occurrence value (“OK” described later) indicating that an operation process failure has occurred. Set.

The system switching instruction unit 224 (an example of a first instruction unit) refers to a failure flag stored in the guest OS storage unit 229.
When the failure occurrence value is set in the referenced failure flag, the system switching instruction unit 224 instructs the system switching control unit 212 to stop the application execution unit 221 (system switching instruction described later).

When the system switching control unit 212 (an example of a first stopping unit) is instructed to stop the application execution unit 221 from the system switching instruction unit 224, the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.

When a failure occurs in the active server 200, the standby server 300 operates as a new active server, and when the active server 200 recovers, the active server 200 operates as a new standby server.

The host OS storage unit 219 (an example of a first stop flag storage unit) stores various data used by the host OS unit 210.
For example, the host OS storage unit 219 stores a first stop flag (a system switching state flag described later) indicating whether the application execution unit 221 can be stopped.

When the standby server 300 operates as a new active server and the active server 200 operates as a new standby server, each “˜part” of the active server 200 operates as follows.

The software FT unit 211 (an example of a first synchronization unit) acquires operation data from the standby server 300 every predetermined synchronization period.

The system switching detection unit 213 (an example of a first monitoring unit and a first activation unit) monitors the standby server 300 every predetermined monitoring cycle and determines whether or not a failure has occurred in the standby server 300.
If the system switching detection unit 213 determines that a failure has occurred in the standby server 300, the system switching detection unit 213 activates the guest OS unit 220 including the application execution unit 221. Further, the system switching detection unit 213 sets a first continuation value (“system switching present” to be described later) indicating that the application execution unit 221 is not stopped in the first stop flag stored in the host OS storage unit 219. .

When the system switching control unit 212 (an example of a first stop unit) is instructed to stop the application execution unit 221 by the system switching instruction unit 224, the system switching control unit 212 refers to the first stop flag stored in the host OS storage unit 219. .
The system switching control unit 212 indicates that the application execution unit 221 may be stopped at the referenced first stop flag when the first continuation value is set in the referenced first stop flag. Set the value ("No system switching" described later).
When the first continuation value is not set in the referenced first stop flag, the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221.

The standby server 300 (an example of a second server) is a server device that executes operation processing instead of the active server 200 when a failure occurs in the active server 200.
The standby server 300 includes a host OS unit 310, a guest OS unit 320, and a virtual machine monitor unit 330, similar to the active server 200.
The standby server 300 includes hardware 301 as with the active server 200.

The host OS unit 310 (an example of the second host computer) includes a software FT unit 311, a system switching control unit 312, a system switching detection unit 313, and a host OS storage unit 319 in the same manner as the host OS unit 210 of the active server 200. Prepare.
Each “˜ unit” included in the host OS unit 310 and the host OS unit 310 operates using hardware resources allocated to the host OS unit 310.

The guest OS unit 320 (an example of a second guest computer) includes an application execution unit 321, a cluster software unit 322, a system switching instruction unit 324, and a guest OS storage unit 329, similar to the guest OS unit 220 of the active server 200. . The cluster software unit 322 includes a failure detection unit 323.
The guest OS unit 320 and each “˜ unit” included in the guest OS unit 320 operate using hardware resources allocated to the guest OS unit 320.

The software FT unit 311 (an example of a second synchronization unit) acquires operational data from the operational server 200 at every predetermined synchronization period.

The guest OS storage unit 329 (an example of a second operation data storage unit) stores various data used by the guest OS unit 320.
For example, the guest OS storage unit 329 stores operation data (processing data, failure flag, etc.) acquired by the software FT unit 311.

The application execution unit 321 (an example of a second execution unit) executes an operation process using the operation data stored in the guest OS storage unit 329.

The host OS storage unit 319 (an example of a second stop flag storage unit) stores various data used by the host OS unit 310.
For example, the host OS storage unit 319 stores a second stop flag (system switching state flag described later) indicating whether the application execution unit 321 can be stopped.

The system switching detection unit 313 (an example of a second monitoring unit and a second activation unit) monitors the active server 200 every predetermined monitoring cycle and determines whether or not a failure has occurred in the active server 200.
When the system switching detection unit 313 determines that a failure has occurred in the active server 200, the system switching detection unit 313 activates the guest OS unit 320 including the application execution unit 321. In addition, the system switching detection unit 313 sets a second continuation value (“system switching present” described later) indicating that the application execution unit 321 is not stopped in the second stop flag stored in the host OS storage unit 319. .

A failure detection unit 323 (an example of a second failure detection unit) included in the cluster software unit 322 detects a failure in an operation process executed by the application execution unit 321.
When the failure detection unit 323 detects a failure in the operation process, the failure detection unit 323 sets the failure occurrence value in the failure flag stored in the guest OS storage unit 329.

The system switching instruction unit 324 (an example of a second stopping unit) refers to a failure flag (a system switching instruction flag described later) stored in the guest OS storage unit 329.
When the failure occurrence value is set in the referenced failure flag, the system switching instruction unit 324 instructs the system switching control unit 312 to stop the application execution unit 321 (system switching instruction described later), and refers to the referenced failure flag. Is set to a non-failure occurrence value ("non-occurrence" described later) indicating that no operation processing failure has occurred.

When the system switching control unit 312 (an example of the second stop unit) is instructed to stop the application execution unit 321 by the system switching instruction unit 324, the system switching control unit 312 refers to the second stop flag stored in the host OS storage unit 319. .
The system switching control unit 312 indicates that the application execution unit 321 may be stopped at the referenced second stop flag when the second continuation value is set in the referenced second stop flag. Set the value ("No system switching" described later).
When the second continuation value is not set in the referenced second stop flag, the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321.

FIG. 2 is a flowchart showing a system switching method (active server) in the first embodiment.
A system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.

S110 to S130 described below are executed in parallel.

In S110, the application execution unit 221 executes an application program in which a predetermined operation processing procedure is described.
The processing data processed by the operation processing and the processing data processed by the operation processing are stored in the guest OS storage unit 229. The guest OS storage unit 229 is a storage area in a memory allocated to the guest OS unit 220, for example.
Next, S120 will be described.

In S120, the system switching detection unit 213 sends a heartbeat signal to the standby system every time a predetermined heartbeat notification period elapses in order to notify the standby server 300 that the guest OS unit 220 is operating normally. Transmit (notify) to the server 300.

For example, the system switch detection unit 213 starts a heartbeat notification timer, and transmits a heartbeat signal to the standby server 300 when a timeout is notified from the heartbeat notification timer. Then, the system switching detection unit 213 newly starts a heartbeat notification timer.
However, when the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal.
The heartbeat notification timer is a function for notifying a timeout when the heartbeat notification cycle has elapsed since the activation.

Next, S130 will be described.

In S <b> 130, the failure detection unit 223 determines whether a hardware failure or a software failure has occurred in the operation process executed by the application execution unit 221.

For example, the failure detection unit 223 monitors the hardware 201 (external storage device, communication device, etc.) accessed in the operation process, and uses a response wait timer to determine the response delay (timeout) from the hardware 201. Detect as a failure. The response waiting timer is a function that times out when a predetermined response waiting time elapses after activation.
Further, the failure detection unit 223 detects a defect in the application program as a software failure. A memory shortage or a release error that does not release the secured storage area is an example of a malfunction of the application program.

When a hardware failure or a software failure occurs, the cluster software unit 222 executes a predetermined failure process. For example, the cluster software unit 222 displays information on the failure that has occurred on the display device.

When neither a hardware failure nor a software failure has occurred (not occurred), S110 to S130 are repeatedly executed.
If a hardware failure has occurred, the process proceeds to S131.
If a software failure has occurred, the process proceeds to S132.

In S131, the failure detection unit 223 sets “OK” to the system switching instruction flag (an example of the failure flag) stored in advance in the guest OS storage unit 229.
The initial value of the system switching instruction flag is “not generated”. “Non-occurring” means that no failure has occurred in the operation process, and “Yes” means that a hardware failure has occurred. Further, “permitted” means that the system switching instruction can be discarded.
After S131, the process proceeds to S140.

In S132, the failure detection unit 223 sets “No” in the system switching instruction flag. “No” means that a software failure has occurred. Further, “No” means that the system switching instruction cannot be discarded.
It progresses to S140 after S132.

In S140, the system switching instruction unit 224 refers to the system switching instruction flag stored in the guest OS storage unit 229 every time a predetermined failure detection period elapses.
When the system switching instruction flag is set to “permitted” or “no”, the system switching instruction unit 224 notifies the system switching control unit 212 of data including the value of the system switching instruction flag as a system switching instruction.

For example, the system switching instruction unit 224 activates a failure detection timer and refers to the system switching instruction flag when a timeout is notified from the failure detection timer. The failure detection timer is a function for notifying a timeout when a failure detection cycle has elapsed since the activation.
The system switching instruction unit 224 inputs a system switching instruction to the system switching control unit 212 when “permitted” or “no” is set in the referenced system switching instruction flag.
The system switching instruction unit 224 newly starts a failure detection timer when “not generated” is set in the referenced system switching instruction flag.

For example, the system switching instruction unit 224 notifies the system switching control unit 212 of the system switching instruction by setting the system switching instruction in a predetermined storage area provided for delivery of the system switching instruction.

Further, the system switching instruction unit 224 sets “not generated” to the system switching instruction flag stored in the guest OS storage unit 229. That is, the system switching instruction unit 224 initializes the system switching instruction flag.
After S140, the process proceeds to S150.

In S150, the system switching control unit 212 refers to the system switching instruction notified from the system switching instruction unit 224.
If the value of the system switching instruction flag included in the system switching instruction is “permitted”, the process proceeds to S151.
When the value of the system switching instruction flag included in the system switching instruction is “NO”, the process proceeds to S152.

In S151, the system switching control unit 212 refers to a system switching status flag (an example of a first stop flag) stored in advance in the host OS storage unit 219. The host OS storage unit 219 is a storage area in a memory allocated to the host OS unit 210, for example.
The initial value of the system switching status flag is “no system switching”. “No system switching” performs system switching from the active server 200 to the standby server 300 or system switching processing from the standby server 300 (new active server) to the active server 200 (new standby server). Means not.
In the system switching status flag, “no system switching” or “system switching present” is set. “With system switching” means that a system switching process is being performed.
If “no system switching” is set in the system switching status flag, the process proceeds to S152.
When “system switching present” is set in the system switching status flag, the system switching control unit 212 discards the system switching instruction, and sets “no system switching” in the system switching status flag. Then, S110 to S130 are continued.

In step S152, the system switching control unit 212 stops the guest OS unit 220 via the virtual machine monitor unit 230.
After the guest OS unit 220 is stopped, the system switch detection unit 213 does not transmit a heartbeat signal (S120).
After S152, the process proceeds to S200.

In S200, the administrator resolves the failure that has occurred, such as replacing the hardware in which the failure has occurred.
Thereafter, the active server 200 operates as a new standby server. At this time, the guest OS unit 220 has not been activated yet.
The active server 200 that operates as a new standby server operates in the same manner as the standby server 300 until a failure occurs in the active server 200.
The operation of the standby server 300 will be described later.
By S200, the system switching method (active server) ends.

FIG. 3 is a flowchart showing a system switching method (standby server) in the first embodiment.
A system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.

The guest OS unit 320 of the standby server 300 is stopped.

S210 and S220 described below are executed in parallel.

In S <b> 210, the software FT unit 311 acquires the operation data of the guest OS unit 220 from the active server 200 every time a predetermined synchronization period elapses, and stores the acquired operation data in the guest OS storage unit 329.
The operation data is data stored in a storage area (guest OS storage unit 229) in a memory allocated to the guest OS unit 220, such as processing data and a system switching instruction flag.

For example, the software FT unit 311 of the standby server 300 starts a synchronization timer and transmits a synchronization request to the active server 200 when a timeout is notified from the synchronization timer.
The software FT unit 211 of the active server 200 receives the synchronization request and transmits the operation data stored in the guest OS storage unit 229 to the standby server 300.
Then, the software FT unit 311 of the standby server 300 receives the operation data, stores the received operation data in the guest OS storage unit 329, and newly starts a synchronization timer.
The synchronization timer is a function for notifying a timeout when the synchronization period has elapsed since the activation.

Next, S220 will be described.

In S220, the system switch detection unit 313 determines whether or not a failure has occurred in the active server 200 as follows.

The system switching detection unit 313 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received (detected) within a predetermined monitoring period.

For example, the system switch detection unit 313 activates a monitoring timer and determines whether a heartbeat signal has been received before a timeout is notified from the monitoring timer. When the heartbeat signal is received, the system switching detection unit 313 stops the started monitoring timer and starts a new monitoring timer.
The monitoring timer is a function for notifying a timeout when a monitoring cycle has elapsed since the activation.

In addition, when the software FT unit 311 cannot acquire operation data from the active server 200 (S210), the system switch detection unit 313 determines that a failure has occurred in the active server 200.

If a failure has occurred in the active server 200 (YES), the process proceeds to S230.
If no failure has occurred in the active server 200 (NO), S210 and S220 are repeated.

In S230, the system switch detection unit 313 activates the guest OS unit 320 via the virtual machine monitor unit 330.
It progresses to S240 after S230.

In S240, the system switching detection unit 313 sets “system switching present” to the system switching state flag (an example of the second stop flag) stored in advance in the host OS storage unit 319.
The initial value of the system switching status flag is “no system switching”. The meaning of the value of the system switching status flag is the same as that of the active server 200 (see S151 in FIG. 2).
After S240, the process proceeds to S100.

In S100, the standby server 300 operates as a new active server.
By S100, the system switching method (standby system server) ends.

That is, the standby server 300 operates as a new active server as follows.

In S140 (see FIG. 2), the system switching instruction unit 324 refers to the system switching instruction flag stored in the guest OS storage unit 329 every time the failure detection cycle elapses.
The system switching instruction unit 324 notifies the system switching control unit 212 of a system switching instruction including the value of the system switching instruction flag when the system switching instruction flag is set to “permitted” or “no”, and system switching is performed. Initialize the instruction flag.

The value of the system switching state flag of the standby server 300 operating as a new active server is “system switching present” by S240 (see FIG. 3).

Therefore, if the value of the system switching instruction flag is “possible (hardware failure)” in S150 (see FIG. 2), the system switching control unit 212 discards the system switching instruction in S151 and sets the system switching status flag in the “system switching status flag”. Set “No switching”.
Then, the application execution unit 321 executes an operation process (S110), the system switching detection unit 313 transmits a heartbeat signal every time the heartbeat notification cycle elapses (S120), and the failure detection unit 323 has a failure detection cycle. It is determined whether or not a failure has occurred each time (S130).
That is, the system switching control unit 312 does not stop the guest OS unit 320 if the value of the system switching instruction flag is “system switching present” even if the system switching instruction is notified. At this time, a hardware failure has occurred in the active server 200, and no hardware failure has occurred in the standby server 300.

In S150 (see FIG. 2), when the value of the system switching instruction flag is “No (software failure)”, the system switching control unit 212 stops the guest OS unit 320 regardless of the value of the system switching state flag. (S152).
This is because, when a software failure occurs in the active server 200, if the standby server 300 takes over the operation process, the same software failure as that in the active server 200 occurs in the standby server 300.

The duplex system 100 described in Embodiment 1 has the following effects.

Even when a hardware failure occurs in the active server 200, the standby server 300 can be operated as a new active server, so that the system availability can be increased.

It is possible to construct a system having a fault tolerant function at a lower cost than when using an FT (fault tolerant) server in which hardware is multiplexed.

The dual system 100 uses the system switching status flag so that the standby system can be used even when the failure status (system switching instruction flag = “OK”) is synchronized from the active server 200 to the standby server 300. The server 300 can be normally operated as a new operational server.
That is, even when the failure state is synchronized from the active server 200 to the standby server 300, the standby server 300 does not stop and operates as a new active server.

The dual system 100 can discriminate between a hardware failure and a software failure by using the system switching instruction flag, and can perform different failure control when a hardware failure occurs and when a software failure occurs.
For example, in the case of a hardware failure, it is determined whether to stop the guest OS unit 220 based on the system switching status flag (S151 in FIG. 2), and in the case of a software failure, the guest OS unit 220 is stopped (see FIG. 2, S152).

FIG. 4 is a table showing the relationship between the failure and the means for detecting the failure in the first embodiment.
The effect which the duplex system 100 demonstrated in Embodiment 1 show | plays is demonstrated based on FIG.

Hardware (H / W) failures can be distinguished by causes (1) to (4).
The host OS unit 210 of the active server 200 includes a failure detection unit (not shown) that detects a hardware failure of the active server 200.

Failure (1) is a serious failure that causes the active server 200 to stop suddenly due to a power failure or the like. Such a failure interrupts heartbeat communication. Therefore, the failure (1) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300.

Fault (2) is a minor fault that does not cause the active server 200 to stop, such as a fan failure. The failure (2) is detected by the host OS unit 210 (failure detection unit) of the active server 200.

Fault (3) is a fault in which waiting for a response from hardware times out due to an I / O error of a disk or network. Such a failure is detected by the host OS unit 210 (failure detection unit) of the active server 200. The host OS unit 210 (failure detection unit) detects the failure (2) using the driver function of the host OS.

The failure (4) is a failure in which waiting for a response from the hardware times out as in the case of the failure (3). However, the failure (4) and the failure (3) are different in timeout time and detection means.
Normally, the hardware timeout time is set longer at the OS level. However, in a system such as an online system that guarantees a predetermined response time for each process, it is necessary to set the hardware timeout time short.
That is, the failure (4) is a failure in which a time-out time set according to the system is applied, and waiting for a response from hardware times out.
The failure (4) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200 (FIG. 2, S130).

Since the failure (4) is detected by the guest OS unit 220 of the active server 200, the state of the failure (4) (system switching instruction flag) is a standby server as part of the data (operation data) of the guest OS unit 220. 300 is synchronized.
In this case, although the failure (4) does not occur in the standby server 300, the standby server 300 detects the failure (4) and stops the guest OS unit 320. That is, both the active server 200 and the standby server 300 are stopped. This is called “joint death”.
However, in the first embodiment, by using the system switching state flag, it is possible to prevent the accompanying death (FIG. 2, S151).

Software (S / W) failure can be distinguished by the cause of (5) or (6).

Failure (5) is a failure in which the host OS unit 210 of the active server 200 stops due to a cause such as an OS hang-up. Such a failure interrupts heartbeat communication. Therefore, the failure (5) is detected by the host OS unit 310 (system switch detection unit 313) of the standby server 300 or the guest OS unit (cluster software unit) of the standby server described later.

Fault (6) is a fault that causes the operation process to stop due to an application program malfunction (for example, memory shortage). The failure (6) is detected by the guest OS unit 220 (failure detection unit 223) of the active server 200.

In the first embodiment, by using the system switching instruction flag, different failure control can be performed when a hardware failure occurs and when a software failure occurs (FIG. 2, S150).

FIG. 5 is a flowchart showing the relationship between the synchronization timing between the active server 200 and the standby server 300 and the operation of the standby server 300 in the first embodiment.
The effect which the duplex system 100 demonstrated in Embodiment 1 show | plays is demonstrated based on FIG.

“Synchronization A” is a case where the standby server 300 acquires operational data from the active server 200 before the guest OS unit 220 of the active server 200 detects a hardware failure.
The “synchronization B” is performed when the standby server 300 receives operational data from the active server 200 after the guest OS unit 220 of the active server 200 detects a hardware failure and before notifying the host OS unit 210 of a system switching instruction. Is obtained.
“Synchronous C” is a case where the standby server 300 acquires operational data from the active server 200 after the guest OS unit 220 of the active server 200 notifies the host OS unit 210 of a system switching instruction.

In the case of “synchronization B”, the value of the system switching instruction flag of the standby server 300 is “possible”. The
However, since “system switching is present” is set in the system switching status flag when the guest OS section 320 is started (S240 in FIG. 3), the host OS section 310 discards the system switching instruction and does not stop the guest OS section 320. .

In the case of “synchronization A” or “synchronization C”, the value of the system switching instruction flag of the standby system server 300 is “not generated”. The unit 320 does not notify the host OS unit 310 of a system switching instruction.
That is, the guest OS unit 320 does not stop until a new failure is detected.

As described above, the duplex system 100 is configured so that the active server 200 and the standby server 300 can be synchronized with each other even if the operation data is synchronized between the active server 200 and the standby server 300 at any timing. Can prevent death.

FIG. 6 is a diagram illustrating an example of hardware resources of the active server 200 and the standby server 300 according to the first embodiment.
In FIG. 6, the active server 200 and the standby server 300 include a CPU 901 (Central Processing Unit). The CPU 901 is connected to the ROM 903, the RAM 904, the communication board 905, the display device 911, the keyboard 912, the mouse 913, the drive device 914, and the magnetic disk device 920 via the bus 902, and controls these hardware devices. The drive device 914 is a device that reads and writes a storage medium such as an FD (Flexible Disk Drive), a CD (Compact Disc), and a DVD (Digital Versatile Disc).

The communication board 905 is wired or wirelessly connected to a communication network such as a LAN (Local Area Network), the Internet, or a telephone line.

The magnetic disk device 920 stores an OS 921 (operating system), a program group 922, and a file group 923.

The program group 922 includes a program for executing a function described as “unit” in the embodiment. The program is read and executed by the CPU 901. That is, the program causes the computer to function as “˜part”, and causes the computer to execute the procedures and methods of “˜part”.

The file group 923 includes various data (input, output, determination result, calculation result, processing result, etc.) used in “˜part” described in the embodiment.

In the embodiment, arrows included in the configuration diagrams and flowcharts mainly indicate input and output of data and signals.

In the embodiment, what is described as “to part” may be “to circuit”, “to apparatus”, and “to device”, and “to step”, “to procedure”, and “to processing”. May be. That is, what is described as “˜unit” may be implemented by any of firmware, software, hardware, or a combination thereof.

FIG. 7 is a configuration diagram showing another form of the dual system 100 according to the first embodiment.
As illustrated in FIG. 7, the dual system 100 may include a standby server 400 and a shared storage 102.

The standby server 400 (third server) is a server device that executes operation processing instead of the active server 200 when a software failure occurs in the active server 200.
The standby server 400 includes a host OS unit 410, a guest OS unit 420, a virtual machine monitor unit 430, and hardware 401, similar to the active server 200 and the standby server 300.

The shared storage 102 is a storage device that stores processing data used for operation processing, image data constituting a guest OS unit (virtual machine), and the like.

The active server 200, the standby server 300, or the standby server 400 accesses the shared storage 102 via the LAN 101 and executes an operation process using the processing data stored in the shared storage 102.

The standby server 400 may not synchronize the operation server 200 and the operation data (system switching instruction flag).

However, the cluster software unit 222 of the active server 200 transmits a heartbeat signal to the standby server 400 every time a predetermined heartbeat communication cycle elapses, and the cluster software unit of the standby server 400 receives the signal from the active server 200. Receive a heartbeat signal.
The cluster software unit of the standby server 400 monitors the heartbeat signal every predetermined monitoring period. If the cluster software unit of the standby server 400 cannot receive the heartbeat signal within the monitoring period, it determines that a software failure has occurred in the active server 200.
When a software failure occurs in the active server 200, the application execution unit of the standby server 400 restarts the application program for operation processing. Alternatively, the system switching detection unit of the standby server 400 restarts the guest OS unit 420.
The standby server 400 operates as a new operational server.

In Embodiment 1, for example, the following inter-server state synchronization method (system switching method) has been described.

The dual system 100 includes an active server 200 and a standby server 300, and replicates the operating state (operation data) of the active system to the standby system in a predetermined procedure.
When a failure occurs in the active system, the active server 200 is stopped, and the standby server 300 is started using the replicated operation state, thereby switching the system from the active system to the standby system.
After a system switchover from the active system to the standby system, the state where the failure occurred (system switch instruction flag) was copied to the standby system, and the failure actually occurred in the standby system If not, the standby server 300 is not stopped.

The standby server 300 is provided with a system switching status holding unit (host OS storage unit 319) for storing presence / absence of system switching from the active system to the standby system (system switching status flag).
When a failure is detected in the standby system, if the system switching status flag stored in the system switching status holding unit is “no system switching”, the standby server 300 is stopped, and the system switching status flag is “system switching present”. If so, the standby server 300 is not stopped.

When the active system is restored after system switching from the active system to the standby system, the original standby system (standby system server 300) is replaced with the new active system, and the original active system (active system server 200) is replaced with the new system. Operate as a standby system.
After a failure occurs in the new active system and the system is switched to the new standby system, the state where the failure occurred is replicated to the new standby system. If it does not actually occur, do not stop the new standby server.

A system switching state holding unit (host OS storage unit) that stores the presence / absence of system switching from another system to the own system (system switching status flag) is provided in each of the active server 200 and the standby server 300.
When the failure is detected, if the system switching status flag stored in the system switching status holding unit is “no system switching”, the local server is stopped for system switching to another system, and the system switching status flag is If “system switchover” is present, the local server is not stopped.

A virtual environment (virtual machine monitor unit) is mounted on each of the active server 200 and the standby server 300, and one host OS (host OS unit) and one or more guest OS (guest OS) are installed in the virtual environment. Part).
The operating state (operation data) of the guest OS is synchronized from the active system to the standby system by the software fault tolerant function (software FT unit 211) installed on the host OS.
When a failure occurs in the active server 200, the guest OS of the active server 200 is stopped and the operation of the guest OS is restarted on the standby server 300 using the synchronized operating state of the guest OS.
A system switching control unit that performs system switching and a system switching status holding unit (host OS storage unit) that holds whether or not system switching from the active system to the standby system is performed (system switching status flag) are provided in the host OS.
When a failure is detected in the guest OS, a system switching instruction is transmitted from the guest OS to the host OS.
When the system switching control unit of the host OS receives the system switching instruction from the guest OS, the system switching control unit discards the system switching instruction only when the system switching state flag stored in the system switching state holding unit is “system switching present”. If not, the system is switched to the standby system.
When performing system switching, the system switching control unit on the host OS stops the active guest OS. The standby host OS detects that the synchronization by the software fault tolerant function or the heartbeat communication with the active system has been interrupted, and starts the standby guest OS.

A flag (system switching instruction flag) indicating whether or not the system switching instruction can be discarded is provided in the system switching instruction transmitted from the guest OS to the host OS.
When the system switching control unit of the host OS receives a system switching instruction from the guest OS, the system switching instruction flag indicates that the system switching instruction can be discarded, and the system switching state stored in the system switching state holding unit Only when the flag is “system switching present”, the system switching instruction is discarded. Otherwise, the system switching control unit of the host OS performs system switching to the standby system.

The system switching instruction flag is set to “permitted” only when the cause of the failure is hardware, and whether or not the system switching can be executed is determined based on this value and the system switching status flag stored in the system switching status holding unit. .
As a result, when the system switching instruction flag is “permitted” and the system switching state flag is “system switching is present”, the system switching operation is not performed, and the accompanying death can be prevented.

Embodiment 2. FIG.
A mode in which the active server 200 notifies the standby server 300 that the guest OS unit 220 has been stopped when the guest OS unit 220 is stopped will be described.

FIG. 8 is a configuration diagram of the dual system 100 according to the second embodiment.
The configuration of dual system 100 in the second embodiment will be described with reference to FIG.

The host OS unit 210 of the active server 200 includes a system switching communication unit 214 instead of the system switching detection unit 213 (see FIG. 1) described in the first embodiment.
The host OS unit 310 of the standby server 300 includes a system switching communication unit 314 instead of the system switching detection unit 313 described in the first embodiment.

The system switching communication unit 214 (an example of a first notification unit) waits for the guest OS unit 220 (application execution unit 221) to stop when the system switching control unit 212 stops the guest OS unit 220 including the application execution unit 221. Notification to the system server 300.

The system switching communication unit 314 (an example of a second activation unit) activates the guest OS unit 320 including the application execution unit 321 when the active server 200 notifies the stop of the guest OS unit 220 (application execution unit 221). To do.
Further, the system switching communication unit 314 sets a second continuation value (“with system switching”) indicating that the guest OS unit 320 (application execution unit 321) is not stopped in the system switching status flag stored in the host OS storage unit 319. ) Is set.

When the standby server 300 operates as a new active server and the active server 200 operates as a new standby server, the system switching communication unit 214 and the system switching communication unit 314 operate as follows.

When the system switching control unit 312 stops the guest OS unit 320 including the application execution unit 321, the system switching communication unit 314 (an example of a first activation unit) operates to stop the guest OS unit 320 (application execution unit 321). Notification to the system server 200.

The system switching communication unit 214 (an example of the second notification unit) activates the guest OS unit 220 including the application execution unit 221 when the standby server 300 is notified of the stoppage of the guest OS unit 320 (application execution unit 321). To do.
Further, the system switching communication unit 214 sets the second continuation value (“system switching present”) in the system switching state flag stored in the host OS storage unit 219.

FIG. 9 is a flowchart showing a system switching method (active server) in the second embodiment.
A system switching method of the active server 200 (or the standby server 300 operating as a new active server) will be described with reference to FIG.

In the system switching method (active server), in addition to the processing described in the first embodiment (see FIG. 2), S153 is executed.

That is, after the system switching control unit 212 stops the guest OS unit 220 (S152), the system switching communication unit 214 transmits a stop notification of the guest OS unit 220 to the standby server 300 (S153).

Other processes are the same as those in the first embodiment (FIG. 2).

FIG. 10 is a flowchart showing a system switching method (standby server) in the second embodiment.
A system switching method of the standby server 300 (or the active server 200 operating as a new standby server) will be described with reference to FIG.

In the system switching method (standby server), S220B is executed instead of S220 (see FIG. 3) described in the first embodiment.

In S220B, the system switching communication unit 314 determines whether or not a failure has occurred in the active server 200 as follows.
The system switching communication unit 314 determines that a failure has occurred in the active server 200 when the heartbeat signal of the active server 200 cannot be received within a predetermined monitoring period.
The system switching communication unit 314 determines that a failure has occurred in the active server 200 when the software FT unit 311 cannot acquire operation data from the active server 200.
Further, the system switching communication unit 314 determines that a failure has occurred in the active server 200 when receiving a stop notification of the guest OS unit 220 from the active server 200.
If a failure has occurred in the active server 200 (YES), the process proceeds to S230.
If no failure has occurred in the active server 200 (NO), S210 and S220B are repeated.

Other processes are the same as those in the first embodiment (FIG. 3).

In Embodiment 2, for example, the following inter-server state synchronization method (system switching method) has been described.

A host switching notification unit (system switching communication unit) and a system switching notification reception unit (system switching communication unit) are provided in the host OS (host OS unit).
At the time of system switching, the system switching control unit of the active host OS stops the guest OS (guest OS unit), and the system switching notification unit of the active system performs system switching to the system switching notification receiving unit of the standby system. Notice. Then, the standby host OS activates the guest OS.

As a result, the standby server can detect that the active server has stopped without depending on the synchronization or heartbeat cycle, and can immediately operate as a new active server when the active server stops. .

100 duplex system, 101 LAN, 102 shared storage, 200 operational server, 201 hardware, 210 host OS unit, 211 software FT unit, 212 system switching control unit, 213 system switching detection unit, 214 system switching communication unit, 219 Host OS storage unit, 220 Guest OS unit, 221 Application execution unit, 222 Cluster software unit, 223 Failure detection unit, 224 System switching instruction unit, 229 Guest OS storage unit, 230 Virtual machine monitor unit, 300 Standby server, 301 Hardware, 310 Host OS unit, 311 Software FT unit, 312 System switch control unit, 313 System switch detection unit, 314 System switch communication unit, 319 Host OS storage unit, 320 Guest OS unit, 321 Application execution unit, 322 class Software unit, 323 failure detection unit, 324 system switch instruction unit, 329 guest OS storage unit, 330 virtual machine monitor unit, 400 standby server, 401 hardware, 410 host OS unit, 420 guest OS unit, 430 virtual machine monitor 901 CPU, 902 bus, 903 ROM, 904 RAM, 905 communication board, 911 display device, 912 keyboard, 913 mouse, 914 drive device, 920 magnetic disk device, 921 OS, 922 program group, 923 file group.

Claims

In a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
The first server
A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
A first stop unit for stopping the first execution unit;
The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first instruction unit,
The second server
A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
A second operational data storage unit that stores operational data acquired by the second synchronization unit;
A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
A second monitoring unit that monitors the first server every predetermined monitoring cycle and determines whether or not a failure has occurred in the first server;
When it is determined by the second monitoring unit that a failure has occurred in the first server, the second execution unit is started and the second execution flag stored in the second stop flag storage unit is executed in the second execution flag. A second activation part for setting a second continuation value indicating that the part is not stopped;
A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
A second stop unit for stopping the second execution unit;
Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop The second system is stopped when the second continuation value is not set in the flag, and the second execution unit is stopped.
The first server further
A first stop flag storage unit for storing a first stop flag indicating whether or not the first execution unit can be stopped;
A first synchronization unit that obtains operational data from the second server every predetermined synchronization period;
A first monitoring unit that monitors the second server every predetermined monitoring cycle and determines whether or not a failure has occurred in the second server;
When it is determined by the first monitoring unit that a failure has occurred in the second server, the first execution unit is activated, and the first execution flag stored in the first stop flag storage unit is changed to the first execution flag. A first activation part for setting a first continuation value indicating that the part is not stopped,
The first stop unit refers to the first stop flag stored in the first stop flag storage unit when the first instruction unit instructs the stop of the first execution unit, and refers to the first stop flag When the first continuation value is set in the stop flag, a non-first continuation value indicating that the first execution unit may be stopped is set in the referred first stop flag, and the first stop referred to The multiplex system according to claim 1, wherein when the first continuation value is not set in a flag, the first execution unit is stopped.
The first server includes a first guest computer and a first host computer,
The first guest computer includes the first execution unit, the first failure detection unit, the first instruction unit, and the first operation data storage unit.
The first host computer includes the first stop unit, the first synchronization unit, the first monitoring unit, the first activation unit, and the first stop flag storage unit,
The second server includes a second guest computer and a second host computer,
The second guest computer includes the second execution unit, the second failure detection unit, the second instruction unit, and the second operation data storage unit,
The second host computer includes the second stop unit, the second synchronization unit, the second monitoring unit, the second activation unit, and the second stop flag storage unit. The multiplex system according to claim 2.
In a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
The first server
A first operational data storage unit that stores, as operational data, data including processing data used for operational processing and a failure flag indicating whether or not an operational processing failure has occurred;
A first execution unit that executes an operation process using the processing data stored in the first operation data storage unit;
A first failure detection unit that detects a failure in an operation process executed by the first execution unit, and detects a failure in an operation process, the failure flag stored in the first operation data storage unit, A first failure detection unit for setting a failure occurrence value indicating that an operation processing failure has occurred;
A first stop unit for stopping the first execution unit;
The failure flag stored in the first operation data storage unit is referred to, and when the failure occurrence value is set in the referenced failure flag, the first stop unit is instructed to stop the first execution unit. A first indicator;
When the first stop unit stops the first execution unit, a first notification unit for notifying the second server of the stop of the first execution unit,
The second server
A second synchronization unit that obtains operational data from the first server every predetermined synchronization period;
A second operational data storage unit that stores operational data acquired by the second synchronization unit;
A second execution unit that executes an operation process using the operation data stored in the second operation data storage unit;
A second stop flag storage unit for storing a second stop flag indicating whether or not the second execution unit can be stopped;
When the stop of the first execution unit is notified from the first server, the second execution unit is started, and the second execution unit is stopped at the second stop flag stored in the second stop flag storage unit A second activation unit for setting a second continuation value indicating that no
A second failure detection unit that detects a failure in the operation process executed by the second execution unit, and detects a failure in the operation process, the failure flag stored in the second operation data storage unit A second failure detection unit for setting a failure occurrence value;
A second stop unit for stopping the second execution unit;
Refers to the failure flag stored in the second operational data storage unit, and when the failure occurrence value is set in the referenced failure flag, instructs the second stop unit to stop the second execution unit A second instruction unit that sets a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit is instructed to stop the second execution unit from the second instruction unit, the second stop unit refers to the second stop flag stored in the second stop flag storage unit and refers to the second stop flag. When the second continuation value is set in the stop flag, a non-second continuation value indicating that the second execution unit may be stopped is set in the referred second stop flag, and the referred second stop The second system is stopped when the second continuation value is not set in the flag, and the second execution unit is stopped.
The second server further
When the second stop unit stops the second execution unit, the second stop unit includes a second notification unit that notifies the first server of the stop of the second execution unit,
The first server further
A first stop flag storage unit for storing a first stop flag indicating whether or not the first execution unit can be stopped;
A first synchronization unit that obtains operational data from the second server every predetermined synchronization period;
When the stop of the second execution unit is notified from the second server, the first execution unit is activated and the first execution unit is stopped at the first stop flag stored in the first stop flag storage unit. A first activation unit for setting a first continuation value indicating that
The first stop unit refers to the first stop flag stored in the first stop flag storage unit when the first instruction unit instructs the stop of the first execution unit, and refers to the first stop flag When the first continuation value is set in the stop flag, a non-first continuation value indicating that the first execution unit may be stopped is set in the referred first stop flag, and the first stop referred to The multiplex system according to claim 4, wherein when the first continuation value is not set in the flag, the first execution unit is stopped.
The first server includes a first guest computer and a first host computer,
The first guest computer includes the first execution unit, the first failure detection unit, the first instruction unit, and the first operation data storage unit.
The first host computer includes the first notification unit, the first stop unit, the first synchronization unit, the first monitoring unit, the first activation unit, and the first stop flag storage unit. With
The second server includes a second guest computer and a second host computer,
The second guest computer includes the second execution unit, the second failure detection unit, the second instruction unit, and the second operation data storage unit,
The second host computer includes the second notification unit, the second stop unit, the second synchronization unit, the second monitoring unit, the second activation unit, and the second stop flag storage unit. The multiplex system according to claim 5, further comprising:
7. The multi-system according to claim 3, wherein the first guest computer, the first host computer, the second guest computer, and the second host computer are configured as virtual machines. system.
In a system switching method for a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
The first operation data storage unit of the first server stores, as operation data, data including processing data used for operation processing and a failure flag indicating whether or not an operation processing failure has occurred,
The first execution unit of the first server executes the operation process using the processing data stored in the first operation data storage unit,
When the first failure detection unit of the first server detects a failure in the operation process executed by the first execution unit and detects a failure in the operation process, the first operation data Set a failure occurrence value indicating that an operation processing failure has occurred in the failure flag stored in the storage unit,
When the first instruction unit of the first server refers to the failure flag stored in the first operational data storage unit and the failure occurrence value is set in the referenced failure flag, the first execution unit Instruct to stop,
The first stop unit of the first server stops the first execution unit,
A second stop flag storage unit of the second server stores a second stop flag indicating whether the second execution unit can be stopped;
The second synchronization unit of the second server acquires operational data from the first server every predetermined synchronization period,
A second operation data storage unit of the second server stores the operation data acquired by the second synchronization unit;
The second monitoring unit of the second server monitors the first server every predetermined monitoring cycle, determines whether or not a failure has occurred in the first server,
When the second activation unit of the second server determines that a failure has occurred in the first server by the second monitoring unit, the second execution unit is activated and stored in the second stop flag storage unit A second continuation value indicating that the second execution unit is not stopped is set in a second stop flag,
The second execution unit of the second server executes the operation process using the operation data stored in the second operation data storage unit,
When the second failure detection unit of the second server is a second failure detection unit that detects a failure in the operation process executed by the second execution unit and detects a failure in the operation process, the second operation data Set the failure occurrence value in the failure flag stored in the storage unit,
The second instruction unit of the second server refers to the failure flag stored in the second operation data storage unit, and when the failure occurrence value is set in the referenced failure flag, the second execution unit Instruct the stop, set a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit of the second server is instructed to stop the second execution unit from the second instruction unit, refer to the second stop flag stored in the second stop flag storage unit, When the second continuation value is set in the second stop flag, the non-second continuation value indicating that the second execution unit may be stopped is set and referred to in the referred second stop flag. A system switching method for a multi-system, wherein the second execution unit is stopped when the second continuation value is not set in a second stop flag.
In a system switching method for a multi-system including a first server that executes predetermined operation processing, and a second server that executes operation processing instead of the first server when a failure occurs in the first server,
The first operation data storage unit of the first server stores, as operation data, data including processing data used for operation processing and a failure flag indicating whether or not an operation processing failure has occurred,
The first execution unit of the first server executes the operation process using the processing data stored in the first operation data storage unit,
When the first failure detection unit of the first server detects a failure in the operation process executed by the first execution unit and detects a failure in the operation process, the first operation data Set a failure occurrence value indicating that an operation processing failure has occurred in the failure flag stored in the storage unit,
When the first instruction unit of the first server refers to the failure flag stored in the first operational data storage unit and the failure occurrence value is set in the referenced failure flag, the first execution unit Instruct to stop,
The first stop unit of the first server stops the first execution unit,
When the first notification unit of the first server stops the first execution unit, the first stop unit notifies the second server of the stop of the first execution unit,
A second stop flag storage unit of the second server stores a second stop flag indicating whether the second execution unit can be stopped;
The second synchronization unit of the second server acquires operational data from the first server every predetermined synchronization period,
A second operation data storage unit of the second server stores the operation data acquired by the second synchronization unit;
When the second activation unit of the second server is notified of the stop of the first execution unit from the first server, the second execution unit is activated, and the second stop stored in the second stop flag storage unit Set a second continuation value indicating that the second execution unit is not stopped in the flag,
The second execution unit of the second server executes the operation process using the operation data stored in the second operation data storage unit,
When the second failure detection unit of the second server is a second failure detection unit that detects a failure in the operation process executed by the second execution unit and detects a failure in the operation process, the second operation data Set the failure occurrence value in the failure flag stored in the storage unit,
The second instruction unit of the second server refers to the failure flag stored in the second operation data storage unit, and when the failure occurrence value is set in the referenced failure flag, the second execution unit Instruct the stop, set a non-failure occurrence value indicating that no operation processing failure has occurred in the referenced failure flag,
When the second stop unit of the second server is instructed to stop the second execution unit from the second instruction unit, refer to the second stop flag stored in the second stop flag storage unit, When the second continuation value is set in the second stop flag, the non-second continuation value indicating that the second execution unit may be stopped is set and referred to in the referred second stop flag. A system switching method for a multi-system, wherein the second execution unit is stopped when the second continuation value is not set in a second stop flag.