WO1999026138A1

WO1999026138A1 - Method of changing over a multiplex system

Info

Publication number: WO1999026138A1
Application number: PCT/JP1997/004160
Authority: WO
Inventors: Hiroshi Ohno; Shigenori Kaneko; Yoshihiro Miyazaki; Soichi Takaya; Hiroaki Fukumaru; Takahiro Saruta; Naoshi Kato; Kunihiro Suzuki; Kenichi Kurosawa; Masahiko Saito; Hidehito Takewa; Hirohito Tsukahara; Eiki Shoji
Original assignee: Hitachi, Ltd.; Hitachi Process Computer Engineering, Inc.
Priority date: 1997-11-14
Filing date: 1997-11-14
Publication date: 1999-05-27
Also published as: JP3806600B2

Abstract

When a failure has occurred in a multiplex system, the computer of the operating system in which the failure has occurred stops the processing and starts storing the failure information. Then, a stand-by computer detects the failure in the computer and autonomously takes over the processing. When the failure information is not properly stored, the computer having the failure is completely brought into a halt upon the operation-stop instruction from the stand-by computer. Prior to storing the failure information by the computer which has the failure, the operation of the input/output devices at the coupling portions among the systems such as networks and shared disk devices are halted. Thus the system can be switched at high speeds while storing a large amount of failure information when a failure has occurred in a multiplex system.

Description

Specification

System switching method for multiple systems

The present invention relates to a method for managing a multiplex system, and more particularly to a method for performing a system switchover when a failure occurs in any of the computers in a multiplex system including an active computer and a standby computer. is there. Background art

In applications where high reliability is required, for example, when computers are used for railway operation management, plant control, power system control, etc., failures have occurred in the active computer as well as the active computer that performs the processing. In such cases, it is desirable to use a computer as a multiplex system with a standby computer that takes over the processing that was performed by the active computer.

Failures that hinder the operation of computers include hardware failures and logical inconsistencies due to defects in core software such as operating systems (hereinafter referred to as OS) and device drivers. When these faults occur, by saving various states related to the hardware and software of the computer, it is possible to analyze faults after the fact, which can be used for recovery measures, measures to prevent recurrence, etc., and improve system reliability. Useful. This is the same in a multiplex system.

In a conventional multiplex system, when a failure occurs, the failure information is stored in the disk unit of the failed computer, and then the processing executed by the failed computer is taken over to the standby system. Switching methods have been implemented. In Japanese Patent Application Laid-Open No. 8-202573, all computers constituting a multiplex system are equipped with a common memory whose contents are always matched to each other. Fault information is always written on this common memory, It describes how the computer that took over the processing that was performed by the generating computer saves this fault information to disk.

In order to shorten the processing stop time, it is desirable that the time required for system switching is as short as possible. In the case of the conventional switching method, system switching is waited only for the time required to store the fault information, so that the amount of fault information that can be stored is limited in order to realize a practical switching time.

On the other hand, in the case of the method described in Japanese Patent Application Laid-Open No. 8-202573, it is possible to reduce the system switching time, but if the amount of fault information to be stored increases, the required capacity of the common memory increases and the cost of the device increases. At the same time, the computer load and network load for matching the common memory contents also increase.

An object of the present invention is to realize high-speed system switching while storing large-capacity failure information including a memory dump when a failure occurs in a multiplex system.

In addition, runaway of hardware or software in the fault occurrence system, and the operation of saving the failure information in the fault occurrence system affect the operation of the system switching operation and the operation of the new operating system taking over the processing after the switching. The purpose is not to give the. Disclosure of the invention

According to the present invention, the processing performed on the active computer in which the failure has occurred is stopped, the processing for storing the failure information is started, and subsequently, the standby computer performs the processing for the failure of the computer. It takes over the processing that was stopped after detecting harm. The stop of the processing and the start of the storage of the fault information in the faulty computer are performed spontaneously by software on the faulty computer, or the standby computer first detects the fault in the computer and notifies the computer of the fault. This is realized by instructing the operation by using

According to such a system switching method, the processing can be switched only by the estimated time from the detection of a failure in the standby computer to the start of stable storage of the failure information in the computer in which the failure has occurred. Can be reduced.

In order to achieve the above object, according to the present invention, the standby computer that has detected the failure of the active computer instructs the failure computer to stop the operation of the failure computer following the instruction to start saving the failure information. Therefore, the fault occurrence computer ignores the operation stop instruction when the normal failure information storage operation is performed, and completely accepts the operation stop instruction when the normal failure information storage operation is not performed. It will stop.

With such an operation method of the fault occurrence computer, the fault occurrence computer operates unexpectedly in a severe fault state in which the fault information storage operation cannot be performed, and the fault occurrence computer operates between systems such as a network and a shared disk device. Through the connection unit, it is possible to prevent the operation of the new active computer taking over the processing from being affected.

Further, in order to achieve the above object, the present invention provides a method for stopping the operation of an input / output device of a coupling unit between a system such as a network and a shared disk device before storing the fault information in the fault occurrence computer. Things.

Due to the operation method of such a failure computer, the operation of the hardware irrelevant to the storage of the failure information allows the network and shared disk device to It is possible to prevent the operation of the new active computer taking over the processing from being affected through the connection between the systems. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram showing the configuration of a duplex system, and FIG. 2 is a time chart showing the order of the system switching process and the relationship between the processes in the duplex system.

FIG. 3 is a time chart of the system switching process based on the OS logical inconsistency detection, and FIG. 4 is a time chart of the system switching process based on the hardware failure detection.

Fig. 5 is a block diagram showing the configuration of the LXP board mounted on the computer. Fig. 6 is a flowchart showing the processing procedure of the expansion bus interface mounted on the LXP board. Is a flowchart showing the processing procedure of the linkage control processor mounted on the LXP board. Fig. 8 is a flowchart showing the processing procedure of the management program's survival notification message transmission processing. Fig. 9 shows the processing procedure of monitoring of the management program's survival notification message and processing in the event of a failure in another system. FIG. 10 is a flowchart showing a processing procedure of processing when a failure occurs in the own computer of the management program.

FIG. 11 is a flowchart showing the processing procedure of the interrupt processing routine. BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, an embodiment of a method for switching a multiplex system according to the present invention will be described in detail. FIG. 1 shows the configuration of a multiplex system according to the present embodiment.

As illustrated, the multiplex system according to the present embodiment is a double system composed of two computers. However, three or more computers may be used.

In Fig. 1, computers 100 and 101 represent an active computer and a standby computer, respectively. By the system switching, the active computer 100 operates as a standby computer, and the active computer 101 operates as an active computer. Each computer 100, 101 has a central processing unit (hereinafter referred to as MPU) 110, main memory 111, and input / output control unit 112, which are connected by a processor bus 120. Have been. To the input / output control unit 112, a disk unit 113 and an expansion bus 121 are connected.

A circuit for expanding the functions of the computer is connected to the expansion bus 122. Generally, an expansion board on which a circuit is mounted is connected to the expansion bus 122 by inserting it into a slot connector. However, some functions may be implemented inside the computer and connected directly to the expansion bus. The computers 100 and 101 according to the present embodiment include a small computer system interface (SCSI) board K114, a link bus port (hereinafter referred to as LX) as an extension board. Ρ) Board 1 1 5 、

Equipped with Ethernet board 1 16.

The shared disk unit 102 is connected to the SCS I board 114. This shared disk device 102 is used to store data that takes over the processing at the time of system switching. A bus such as a USB (Universal Serial Bus) may be used instead of the SCS I bus.

The Ethernet board 116 is connected to the Ethernet network 103, and communicates with other computers connected to the network 103. Real truth In the embodiment, a plurality of controllers 910 for managing and controlling the plant 900 are connected to the network 103. A network such as token ring or ATM may be used instead of Ethernet.

The LXP board 115 is a function expansion board for system switching control, and is connected via a linkage path 104 which is a dedicated transmission path. The LXP board monitors the live status of the partner computer between computers 100 and 101, transmits the forced interrupt, operation stop, and computer restart instruction messages required for system switchover, and also sends each instruction message. Executes the contents of the instruction on the local computer at the time of reception.

In such a dual system, when both the active computer 100 and the standby computer 101 are in a normal state, the main memory 111 of the active computer 100 stores 0 S 13 0 and the management program. 13 1, management communication program 13 2 and application (AP) 13 5 are loaded, and management program 13 1, management communication program 13 2, and application 13 5 It is running. Similarly, the same program is loaded in the main memory 111 of the standby computer 101, and 〇S130, the management program 131, and the management communication program 1332 are executed. Force Absorption 1 3 5 has not been performed. Further, an interrupt processing routine 133 is loaded in the main memory 111 of each of the computers 100 and 101. The application 135 is a program that performs processing as a use of the dual system. In the case of the present embodiment, the application 135 processes and records data sent from each controller 910 via the network 103. _(The management program 13 1 is a program that switches between the active computer and the standby computer. This program is for the LXP board 1 15 It sends a message transmission request and an operation instruction to the management communication program 1332, and sends a transmission / reception request of the existence notification message to the management communication program 13.

The management communication program 132 sends and receives a survival notification message to and from another computer via the network 103 using the Ethernet board 116. Messages are sent and received using the TCPZIP protocol. This program waits for a connection from another computer on a predetermined TCP port, and if connected, receives a message, retains the contents in this program, and reads it out from the management program 13 1 Returns the contents held for the request. Also, upon receiving a request for transmission of a survival confirmation message from the management program 131, the management communication program 132 on the other computer constituting the duplex system sends a message to the TCP port on standby.

The interrupt processing routine 133 is registered to be activated when a non-maskable interrupt signal is input to the MPU. Then, when a non-maskable interrupt signal is generated, processing at the time of occurrence of a failure, such as storage of failure information, is performed. However, in the present embodiment, registration is made so as to be activated by a non-maskable interrupt signal, but it may be realized using another interrupt mechanism provided by the MPU. In this embodiment, the interrupt processing routine 133 is an independent program. However, depending on the type of 〇S130, the interrupt processing routine is provided as a part of 〇S. In this case, the same function can be realized by incorporating the necessary processing as a subroutine called from the interrupt processing routine of S130.

Next, a system switching method of the multiplex system according to the present embodiment will be described.

Figure 2 shows the time chart of the system switchover process.

If both the active computer 100 and the standby computer 101 are normal, The following processing is performed.

The management program 13 1 requests the management communication program 13 2 and the LXP board 115 to transmit a survival notification message at regular time intervals (301). The management communication program 132 drives the Ethernet board 116, and sends a survival notification message 4-1 to another computer via the network 103 (302). On the other hand, the LXP board 115 sends a survival notification message 402 to another computer via the linkage bus 104 (303).

The management communication program 132 and the LXP board 115 of the standby computer 101 that have received the survival notification messages 401 and 02 respectively store the reception results (304, 305) ). Then, the management program 13 1 of the standby computer 101 sends a survival notification message from the active computer to the management communication program 13 2 and the LXP board 115 of its own computer at regular intervals. Check whether it has been received (306). If neither of the survival notification messages 401 and 402 from the active computer is received for a certain time or more, it is determined that a failure has occurred in the active computer. The reason why the survival notification message is transmitted over two paths is to make it possible to distinguish failures that occur in each transmission path and the connection circuit to the transmission path from failures in the computer itself. If only one survival notification message is not received, it is determined that there is a failure in the transmission path, and only a warning is issued in the form of screen display or log recording, and no system switchover is performed.

In Fig. 2, only the transmission of the survival confirmation message from the active computer 100 to the standby computer 101 is shown, but in reality, the transmission of the survival confirmation message in the opposite direction is also performed. The reception confirmation processing at the active computer 100 and the transmission processing at the standby computer 101 are scheduled. Running every second.

Next, an operation when a failure occurs in the active computer 100 will be described.

There are several possible failure modes. First, a description will be given of a case in which a hang-up state occurs due to an infinite loop inside the OS.

The operation of the management program 13 1 stops due to the occurrence of a failure inside the 0 S, and the transmission processing 3 0 1 of the survival notification message is not executed at regular intervals. The management program 13 1 of the standby computer 10 1 receives both the survival notification messages 4 0 1 and 4 0 2 at the time of the received message confirmation 3 0 6 performed at regular intervals 4 5 1. If it is detected that no failure has occurred, it is determined that a failure has occurred in the active computer 100. The management program 13 1 on the standby computer 101 that detected the failure has requested the LXP board 115 to transmit a forced interrupt instruction (307), and the LXP board 115 The forced interrupt instruction message 400 is transmitted to the LXP board of the active computer (308).

When the LXP board 1 15 on the active computer 1 0 0 receives the forced interrupt instruction message 4 0 3, it generates a hardware non-maskable interrupt signal 4 0 4 (3 0 9 ). The MPU receives this interrupt signal and activates the interrupt processing routine 133.

When the interrupt processing routine 1 3 3 is started, first, the non-maskable interrupt signal is invalidated. That is, if the non-maskable interrupt signal is generated again, it is set to be ignored (3 1 0).

After the startup, the interrupt processing routine 133 instructs the operation stop of a component in the own computer that may affect the partner computer 101 (311). In the case of the configuration of the present embodiment, the SCSI board 114 and the Ethernet board The node 116 corresponds to such a component, and the operation is stopped by setting a bit in the register on each board that instructs the operation to stop. As a result, when the other computer 101 accesses the shared disk 102 or the network 103, the other computer 100 is not affected by the failure computer 100. Note that, depending on the type of component, the operation stop may be instructed by clearing the operable bit in the register.

Next, the interrupt processing routine 13 3 sets the LXP board 115 to ignore subsequent instruction messages from other computers (3 1 2), and saves the fault information. Yes (3 1 3). After saving the failure information, the interrupt processing routine 133 is stopped (314), and the failed computer 100 is stopped.

In the failure information storage process 3 13, the contents of the main memory 1 11 and the contents of each register indicating the operation state of the computer body and each function expansion board are stored. In addition to the fault information, a process that can be executed even under the condition after the occurrence of the fault may be executed among the normal shutdown processes. For example, if the cache contents are written to the disk devices 113, the consistency of the disk contents of the failed computer is maintained, and the possibility of rescuing the contents increases.

After sending the forced interrupt instruction (307), the management program 1311 of the standby computer 101 sends an operation stop instruction to the LXP board 115 after a certain period of time 45 (3 15), and at this time, the application 13 5 loaded on the standby computer 101 is started to take over the processing of the active computer 100 (3 18). , Set your computer as the new active system. This completes system switching.

Ai? Board 1 15 sends an operation stop instruction from the management program 13 1. In response to the request, an operation stop instruction message 405 is transmitted (3 16). However, in the fault computer 100, the interrupt processing routine 133 sets the LXP board to ignore the instruction message (3 1 2). 5 is ignored, and the collection of fault information (3 1 3) is continued.

In the operation stop processing of the components in the faulty computer 3 1 1, if each component has an operation status check means such as an operation status display register, check the operation stop by the operation stop processing 3 1 1 Additional steps may be added. If it is determined in the confirmation of the operation stop that the operation stop instruction has failed, the interrupt processing routine 133 stops the processing. As a result, the process of ignoring the instruction message from the other computer is not performed, and the computer 100 is forcibly activated by the LXP board receiving the operation stop instruction message 405 from the LXP board of the standby computer. Then, the standby computer 101 takes over the processing without being affected by the failure computer 100.

If it is determined at the beginning of the failure information storage process 3 1 3 that the disk device is not ready for failure information storage, such as a disk unit error, the interrupt processing routine 13 33 ignores the LXP board message. May be canceled (319), and the failure information storage processing may be stopped. Also in this case, the failure computer 100 is forcibly stopped in response to the operation stop instruction message 405 from the standby computer.

As a second failure mode, a failure generally called a kernel panic, in which 0S detects a serious logical inconsistency and determines that continuous operation is impossible, will be described. Figure 3 shows the time chart of the process in this case.

〇 When S detects a logical contradiction, it activates the interrupt processing routine 1 3 3 (3 3 1). The interrupt processing routine instructs the stop of the operation of the components in its own computer (311), as in the case described with reference to FIG. 2, and then sends the LXP board 115 to the other computers thereafter. Then, a setting is made to ignore the instruction message from the user (312), and thereafter, the processing for saving the failure information is performed (313), and the process is stopped (314).

When an OS failure occurs and execution is transferred to the interrupt processing routine, the management program 13 1 on the active computer 100 stops operating, and a survival notification message is sent to the standby computer. 4 0 1 and 4 0 2 are not transmitted. As described above, the management program 13 1 on the standby computer 101 detects that neither the survival notification message 401 nor 402 is received.

(306), the forced interrupt instruction message 403 and the computer operation stop instruction message 405 are transmitted (308, 316).

When the forced interrupt instruction message 400 is received, the interrupt processing routine 133 has already been activated and the message ignore setting has been performed for the LXP board (3 1 2). The forced interrupt instruction message 4 03 is ignored

(3 3 2), collection of fault information 3 1 3 is continued. The operation stop instruction message 405 received subsequently is also ignored (333).

Here, it is assumed that 0 S calls the interrupt processing routine 13 3, but the non-maskable interrupt signal may be generated to start the interrupt processing routine 13 3. Depending on the type of 〇S, 〇S itself saves the failure information (memory dump), but if the function to call the process registered before the execution is provided, the interrupt process The same processing can be realized by registering the processing excluding the saving of the fault information (3 13) from the routine 13 3.

The third failure mode describes a partial hardware failure I do. In this example, the effect of the failure does not appear in the two failure modes described above, but it is not possible to continue processing that is the original use of the multiplex system. It was detected again. Figure 4 shows the time chart of the process in this case.

The detection of the occurrence of such a failure includes detection by the management program 131, detection by the dedicated failure detection subprogram 134, and abnormality detection by the application 135. If a failure is detected by a program other than the management program, the failure detection is notified to the management program 13 1 (3 4 1, 3 4 2). The management program 13 1 starts the interrupt processing routine 13 3 upon detection of a failure by itself or a failure notification from the failure detection subprogram 13 4 or the application 13 5 (3 4 3) . The interrupt processing routine 1 33 executes the same processing procedure as that at the time of detection of the logical inconsistency of 0 S described in FIG. 3, and the system switching is performed.

If the occurrence of a failure is monitored by a hardware mechanism, this hardware uses an interrupt to notify the abnormality detection result to the management program 131, the failure detection subprogram 1334, or The program and the fault detection subprogram poll the hardware periodically to confirm the presence or absence of abnormality detection, and perform the same processing.

Also, depending on the degree of destruction of memory contents or hardware malfunction, the interrupt processing routine 133 may not be able to be started. In this case, the fault occurrence computer 100 is in a severely uncontrollable state, performs an unpredictable operation, and may affect the operation of the standby computer 101.

In this case, the setting (3 1 2) for ignoring the instruction message from the other computer is not performed for the LXP board 115 of the faulty computer. Therefore, the LXP board receiving the operation stop instruction message 4 05 from the standby computer The node 115 forces the computer 100 to stop. Therefore, it is necessary to ensure that the fault occurrence computer 100 does not affect the operation of the standby computer 101 and then take over the processing, so that the system switching can be reliably performed. it can.

As shown in FIG. 3, the time 451 from when the existence notification message is not received to the time when it is determined that a failure has occurred is, as shown in FIG. 3, the interruption processing routine 133 is called by software due to the failure. Therefore, set a little longer than the time until the setting (31 2) for the LXP board is completed. The interval 452 between the transmission of the forced interrupt instruction message and the transmission of the computer operation stop instruction message is, as shown in FIG. 2, the interrupt processing routine of the active computer 100 by the forced interrupt instruction (307). Set a little longer than the time until 1 3 3 is started and the setting (3 1 2) for the LXP board is completed. The system switching time, that is, the time until the completion of the processing takeover, is approximately the sum of the time 451 and the time 452. The switching time of this system is sufficiently shorter than the time required for saving the failure information such as memory dumps, and both saving the failure information and reducing the system switching time are compatible.

In the above description, the processing when a failure occurs in the active computer 100 has been described. However, when a failure occurs in the standby computer 101, the processing of the active and standby systems by taking over the processing can be performed. The same processing is performed except that there is no switching.

In this embodiment, each computer has the LXP board 115 and the Ethernet board 116, but each computer has two Ethernet boards 116, and the Ethernet network 103 is duplicated. In a multiplex system configured to communicate liveness monitoring messages, system switching can be performed in the same way. In such a system, OS logical inconsistency detection In the failure mode of partial failure detection in one hardware, the failure information is saved in the failure occurrence computer 100 and the processing is transferred to the standby computer 101. Operation is possible. However, since the forced interrupt instruction 4 0 3 cannot be sent, failure information cannot be saved in the failure mode in the hang-up state. In addition, since the operation stop instruction message 4 05 cannot be sent, there is a possibility that the abnormal operation of the failure computer 100 will affect the standby computer 101 depending on the degree of the failure.

Hereinafter, details of each unit will be described.

First, the L XP board 1 15 will be described. Figure 5 shows the L XP board

The internal configuration of 1 15 is shown.

As shown in the figure, the LXP board 115 is a linkage control processor that processes messages via the expansion bus interface 170 and the linkage bus 104 that are in charge of input and output to and from the expansion bus 121. 1 7 1, memory for storing programs to be executed by this linkage control processor 17 1

1 75, transmission line interface that converts messages to electrical signals on the linkage bus 17 2, message storage memory 17 3 that is a buffer for temporarily storing messages 17 3, power supply that detects rising power supply voltage It has a voltage detection circuit 174 and an operation control register 176 for checking the operation state of the linkage control processor 171 from the expansion bus side and instructing the operation method.

The operation control registers 1 16 can be read and written from the expansion bus 12 1, so check the operation state and instruct the operation method from software running on the computer on which the LXP board 115 is mounted. Is possible. The operation control register 176 includes a forced interrupt instruction inhibit bit 176 1, an operation stop instruction inhibit bit 176 2, and a restart instruction inhibit bit 176 3 described later. No.

The initialization operation of the LXP board will be described. The LXP board operates independently of the connected computer and needs to handle the reset signal itself of the computer. For this reason, the initialization of the LXP board is performed only when the power to the LXP board is turned on, independent of the reset processing of the computer. For this reason, the power supply voltage detection circuit 174 that monitors the power supply voltage supplied via the expansion bus 122 detects the rise of the power supply voltage and instructs each component in the LXP board to initialize. Outputs initialization signal 18 4. The extended bus interface 170, the linkage control processor 171, and the transmission line interface 172 receive this initialization signal 184, and clear the memory, clear various state information, and clear the register. Performs initialization processing such as resetting the rear and linkage buses.

Next, the message transmission function will be described. The management program 131 sends a message transmission request to the expansion bus interface 170 via the expansion bus 122. Since the expansion bus interface 170 has a different data transfer rate between the expansion bus 122 and the linkage bus 104, the message to be transmitted is temporarily used as a speed buffer and the message storage memory 173 is used. And notifies the linkage control processor 171 of the arrival of the message. In response to this notification, the linkage control processor 171 retrieves the message from the message storage memory 173, transfers the message to the transmission path interface 172, and transmits the message via the linkage bus 104. To the LXP board of another computer.

Finally, the message reception processing function will be described. When an instruction message arrives from the LXP board of another computer via the linkage bus 104, one of the following processes is performed according to the type of the instruction message. (1) If the message indicates a forced interrupt, a non-maskable interrupt signal is output to the connected computer via the non-maskable interrupt signal line 182, and the MPU 110 Is switched to the interrupt routine 1 3 3. However, if the forced interrupt instruction disable bit 1761 of the register 1776 is set, this processing is not performed and the instruction message is ignored.

(2) If the message indicates an operation stop instruction, the reset signal is continuously output to the connected own computer via the reset signal line 183, thereby forcibly stopping the computer. However, if the operation stop instruction disable bit 1762 of the register 176 is set, the message is ignored without performing this processing.

(3) If the message indicates a restart instruction, a reset signal is output once to the connected own computer via the reset signal line 183, thereby restarting the computer. However, if the restart instruction prohibition bit 1763 of the register 1776 is set, this processing is not performed and the message is ignored.

(4) In the case of a message other than the above, the contents of the message are stored in the message storage memory 173. The stored message is thereafter read out at any time via the extended bus interface 170 and the extended bus 122 in response to a request from the management program 131.

FIG. 6 shows the processing procedure of the extended bus interface 170.

Upon receiving an input / output request signal from a computer (expansion bus) and an initialization signal from the initialization signal line 184, the extended paste interface 170 exits the request waiting state 501 and processes it. Is started, and the type of the processing request is determined from the received signal (502).

If the processing request is an initialization signal, initialization processing of internal registers and circuits is performed. (503).

When the processing request is a read signal from the expansion bus 121, if the target of the read request is a register, the contents of the register 176 are read (505), and if the target of the read request is a message, a message is output. The contents of the storage memory 173 are read (507), and the read result is sent to the extension bus 121 (506, 508).

When the processing request is a write signal from the expansion bus 122, if the target of the write request is a register, the write contents are written to the register 176 (510). On the other hand, if the target of the write request is a transmission message, the transmission message is temporarily stored in the message storage memory 173 (5 1 1), and transmitted to the linkage control processor 1 Ί 1. (5 1 2). Fig. 7 shows the processing procedure of the linkage control processor 171.

The control processor 17 1 receives one of a start request from the expansion bus interface 17 0, a message from the transmission path interface 17 2, and an initialization signal from the initialization signal line 18 4. With this event, the process exits from the event waiting state 5 21 to start processing, and determines the type of the event (5 2 2).

If the generated event is an initialization signal, the communication process is initialized, all messages stored in the message storage memory 173 are discarded, and registers 1 to 6 are set to the initial state ( 5 2 3).

On the other hand, if the generated event is a start request from the extended bus interface 170, that is, a message transmission request, the message to be transmitted is read from the message storage memory 173 (52 4) Then, the message is transmitted to the transmission path interface 172 (525).

In addition, the generated event is a message from the transmission line interface 172. In the case of a page reception event, it indicates the arrival of an instruction message from another LXP board. In this case, the type of the received instruction message is determined (526), and processing corresponding to each is performed.

If the message is a forced interrupt instruction, operation stop instruction, or restart instruction, as described above, the corresponding inhibit bit (1 176 1, 1 762 2, 1 7 Confirm that (6 3) is cleared (5 27, 5 29, 5 3 1), and output the signal as described above (5 2 8, 5 3 0, 5 3 2).

In the case of a message other than the above, the received instruction message is simply stored in the message storage memory 173 (533)

Next, the management program 13 1 will be described.

The management program 1 3 1 performs the following three processes.

(1) Periodically send a survival notification message to notify other computers that their computer is operating normally.

(2) Monitor the survival notification message sent from the other computer, and if it is not received for a certain period of time, judge that the source computer has failed, and send a forced interrupt instruction message and operation stop to the other computer Send instruction message. If the faulty computer is an active computer, the process executed by the computer is taken over and the own computer is replaced by a new active computer ¾t £ ^.

(3) Recognize that a failure has occurred in its own computer due to a call from another program, and activate the interrupt processing routine 133 for collecting failure information.

Note that the management program 13 1 may also have a function of detecting the occurrence of a failure in its own computer. In this case, when a failure is detected, the same as (3) above To start an interrupt processing routine.

Fig. 8 shows the processing flow of the survival notification message transmission processing in (1) above.

As shown in the figure, in this process, a survival notification is periodically sent to another computer. That is, it requests the management communication program 132 and the LXP board 115 to transmit a survival notification message (301), and shifts to a waiting state for a predetermined time (5401). .

Fig. 9 shows the processing flow of the surviving notification message and the processing when an error occurs in the other system (2).

As shown in the figure, the reception status of the survival message from the other computer is periodically checked, and if it cannot be received for a certain period of time, the other system failure processing is executed.

To determine the waiting time 4 5 1 for determining a failure in the other system, the variables “number of times of waiting for notification 1” and “number of times of waiting for notification 2” are set. The initial value of these variables is N times, the product of the latency t. _Lambda. And in the processing 5 6 3 "New X t _w" is the waiting time 4 5 1 for determining that the other system failures. First, each of these variables. Is initialized N times (551, 552).

Next, since the content of the received message is stored in the management communication program 132, the management communication program 132 is inquired as to whether or not the existence notification message 401 has been received (5553). If received

The “Notification 1 wait count” is set to N times and reinitialized (555), and the management communication program 1332 is instructed to clear the stored survival notification message (5555.5). ). On the other hand, if the existence notification message has not been received, the value of “the number of times to wait for notification 1” is decreased by one. However, if the value of “Number of times to wait for notification 1” becomes negative, 0 shall be set (555). Similarly, since the LXP board 115 stores the contents of the received message, it is inquired whether or not the survival notification message 402 has been received (5557). If it has been received, the “notification 2 wait count” is reset to N times (558), and the survival notification message stored in the LXP board 115 is cleared (5). 5 9). If the existence notification message has not been received, the value of “notification 2 wait count” is decremented by one. However, if the value of “Number of waits for notification 2” becomes negative, 0 shall be set (560).

Here, check the values of “Notification 1 wait count” and “Notification 2 wait count” (5 6 1)

If both variables are set to 0, it means that neither the survival notification message 401 nor 402 has been received during the waiting time 45 1 or more represented by “NX _tw ”. Therefore, it is determined that a failure has occurred in another computer. First, the LXP board 115 is requested to transmit a forced interrupt instruction message 403 (307), and then waits for a certain period of time 452 (5664). Request the LXP board 115 to send a computer operation stop instruction message 405 (315). Further, if the setting of the own computer is a standby computer, the processing of the active computer is taken over (3 18), and the system switching is executed. After these processes are executed, the surviving notification message monitoring process is stopped because the faulty computer of the other system is always in a stopped state (566). If the faulty computer is replaced or the cause of the fault is removed, and the system is to be returned to the redundant system as a standby computer, this process is started again (550). The operation may be started manually by the operator, or after the monitoring process is stopped (555), another process is started to continue monitoring the alive monitoring message. When the alive monitoring message is detected, the monitoring process is started. The restart (550) method may be used. If only one of `` Notification 1 wait count '' and `` Notification 2 wait count '' is 0 in process 56, it is considered that a failure has occurred in the message transmission path or the connection circuit to the transmission path. Judgment is made and a warning is issued in the form of a screen display or log recording (5562).

Unless both variables of “number of waits for notification 1” and “number of waits for notification 2” are 0 in process 561, wait for a predetermined time t _w (56 3), and process 5 5 Return to 3.

FIG. 10 shows the processing flow of the management program 133 when a failure has occurred in the computer (3).

This processing is started by a call from the fault detection subprogram 134 or the application 135 (570), and simply starts the interrupt processing routine 133 (344). The interrupt processing routine 1 3 3 does not return the processing to the caller.

Next, the interrupt processing routine 133 will be described.

The interrupt processing routine 1 3 3 is started from the software on its own computer when a failure occurs, or from the LXP board 1 15 upon receiving a forced interrupt instruction message from another computer. Stores fault information and performs related processing.

FIG. 11 shows the processing flow of the interrupt processing routine 133.

At startup, the interrupt processing routine 133 first invalidates the non-maskable interrupt signal (310). This is achieved by preparing a dummy interrupt processing routine that returns without performing any processing and registering this in the MPU as a processing routine for non-maskable interrupts. As a result, even if a non-maskable interrupt signal is generated again during the processing of the interrupt processing routine 133, the processing returns to the dummy routine and the interrupt returns immediately. The possible interrupt is ignored, and the interrupt processing routine 13 can be continued. Next, it instructs to stop the operation of some of its own computers, especially those components that may affect other computers (311). Then, the status is inquired for each component that has instructed to stop the operation, and it is confirmed whether or not all the components have actually stopped (581). If there is any operation that failed, interrupt processing is terminated (590). If all the components for which operation stop has been instructed have stopped, the LXP board 115 is set to ignore subsequent instruction messages from other computers (312).

Subsequently, it is checked whether or not the failure information can be saved (582). If it is determined that the failure information cannot be saved, the LXP board 115 is released from ignoring the instruction message from the other computer ( 319), interrupt processing is terminated (590). If it is determined that saving is possible, save the actual failure information

(3 1 3). After saving the failure information, the interrupt processing routine 1 3 3 stops (3 1 4), and the host computer is stopped. After the failure information has been saved, the LXP board 115 on its own computer may be instructed to continue the reset signal to completely stop the operation of the computer.

If the computer stops due to interruption of the interrupt processing (590), the own computer will be in the halt state, but the LXP board 115 will be reset following the operation stop instruction message sent from another computer. Since the signal is generated continuously, the operation stops completely even in this case.

As described above, according to the present invention, in a multiplex system, when a failure occurs, high-speed system switching can be realized while storing large-capacity failure information including a memory dump.

Further, according to the present invention, the runaway of hardware or software in the fault occurrence system and the saving operation of the fault information in the fault occurrence system are performed by the system disconnection. It is possible not to affect the operation of the new operating system that took over the switching operation and the processing after the switching. Industrial applicability

As described above, the present invention is effective in a multiplex system for applications requiring high reliability, and when a failure occurs in the active computer, the standby system takes over the processing performed by the active computer. In a multiplex system equipped with computers, if one of the computers fails, post-failure analysis can be performed, which can be used for recovery measures, measures to prevent recurrence, and improve reliability. Help.

Claims

The scope of the claims

1. In a multiplex system in which multiple computers are configured and the computer set as the standby system takes over the processing performed by the computer when a failure occurs in the computer set as the active system.

At the time of the failure,

Software running on the failed computer detects the failure and saves the failure information, or a standby computer detects the failure and responds to the failed computer. After instructing the storage of the fault information, and after the standby computer recognizes the fault, the computer of the standby system spontaneously takes over the processing without waiting for the end of storing the fault information in the failed computer. System switching method for multiplex systems, which is a distinctive feature.

2. Each of the computers has a function expansion board, which operates independently of the software on the computer, and is connected to each other via a transmission line.

Each of the function expansion boards generates an interrupt to the computer equipped with the function expansion board according to the content of a message received via a transmission line from the function expansion board mounted on another computer. And the function to stop the operation of the computer equipped with the function expansion board, and the functions of the functions corresponding to the message from the software operating on the computer equipped with the function expansion board Has a function to instruct deterrence

When the occurrence of a failure in another computer is recognized, an interrupt is generated from the function expansion board mounted on the computer that recognized the failure to the function expansion board mounted on the computer where the failure occurred. A message for instructing the computer to be stopped is transmitted after a certain period of time, and the function expansion board mounted on the failed computer is used for the interrupt instruction. In the interrupt processing for the interrupt that occurs in response to the notification message, save the fault information and instruct the function expansion board to suppress the interrupt generation function and the computer operation stop function 2. The method according to claim 1, further comprising ignoring a message transmitted later to instruct the computer to stop, and continuing to store the failure information.

3. When a fault occurs, the fault information is automatically saved by the software of the faulty computer, and the function expansion board is instructed to suppress the interrupt generation function and the computer operation stop function. 3. The multi-system system according to claim 2, wherein the message of the interrupt generation instruction and the computer stop instruction transmitted later is ignored, and the storage of the failure information is continued. Replacement method.

4. Prior to the storage of the fault information, the faulty computer is connected to a part not related to the storage of the fault information, in particular, to a computer other than the faulty computer, which constitutes the multiplex system. 2. The multi-system system according to claim 1, wherein an input / output unit of the unit instructs a unit other than a transmission line connecting the function expansion boards to stop its operation. System switching method.

5. There is a means for confirming whether or not the operation has actually stopped for the part instructed to stop the operation. If any of the parts failed to stop, the failure information is not stored and the failure is not saved. 5. The method of system switching of a multiplex system according to claim 4, wherein the operation of the computer in which the error occurs is stopped.

6. The method according to claim 1, wherein a part of a normal computer shutdown procedure is executed instead of or simultaneously with the storage of the fault information. System switching method for multiple systems.