JP2005108034A

JP2005108034A - Computer system

Info

Publication number: JP2005108034A
Application number: JP2003342544A
Authority: JP
Inventors: Tsutomu Igarashi; 強五十嵐
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-09-30
Filing date: 2003-09-30
Publication date: 2005-04-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computer system which can seek a cause of failure easily when the failure occurs in a computer, and also can surely make the computer resume the operation. <P>SOLUTION: When a fault surveillance object computer is in a communicable state in which at least neither memory dump nor fault diagnosis is carried out, if the computer does not make a response and decided to be in a fault, an interruption signal is outputted so as to make the computer carry out restart processing, and if it is decided that there is no response from the computer after that, a reset signal is outputted so as to make the computer carry out restarting. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

計算機システムにおいて、稼働中の計算機に障害が発生したときの障害対策技術に関する。 The present invention relates to failure countermeasure technology when a failure occurs in an operating computer in a computer system.

従来、計算機システムにおいて、稼働中の計算機で障害が発生した場合の対策として、次のような技術があった。 Conventionally, in a computer system, there have been the following techniques as a countermeasure when a failure occurs in an operating computer.

従プロセッサが正常に動作しているか否かを主プロセッサが定期的に問い合わせ確認し、従プロセッサからの応答がない場合には、主プロセッサが従プロセッサで障害が発生したとして認識し、従プロセッサをリセットして再立ち上げする（特許文献１、段落００１０〜段落００１６、図１、図２）。 The main processor periodically inquires and confirms whether or not the slave processor is operating normally. If there is no response from the slave processor, the master processor recognizes that the slave processor has failed. Reset and restart (Patent Document 1, paragraphs 0010 to 0016, FIGS. 1 and 2).

計算機の故障にはハードウェアの障害に起因するハードウェア障害とソフトウェアの障害に起因するソフトウェア障害とがあり、上記従来の技術では障害が発生した場合には一律にプロセッサをリセットして再立ち上げしており、障害の原因追及が容易でなかった。 There are two types of computer failures: hardware failure due to hardware failure and software failure due to software failure. In the above conventional technology, if a failure occurs, the processor is uniformly reset and restarted. The cause of the failure was not easily pursued.

また、二重系の計算機システムにおいて、障害が発生した計算機自身がメモリダンプ（クラッシュダンプ）をしてメモリの状態を採取しているとき、一時的に一方の計算機だけでシステムを運用し、もう一方の計算機が自身の故障診断（テスト）をしているとき等では、上記従来の障害対策を実施すると計算機が再稼働できない等の重大な障害を発生する可能性があった。
特開平６−２３６２９９号 Also, in a dual computer system, when the failed computer itself is performing a memory dump (crash dump) and collecting the memory status, the system is temporarily operated with only one computer, When one of the computers is performing its own failure diagnosis (test), a serious failure such as the computer being unable to restart may occur if the conventional failure countermeasures are implemented.
JP-A-6-236299

本発明は、計算機の障害の原因追及を容易にすると共に、確実に計算機を再稼働させることができる計算機システムを提供することを目的とする。 An object of the present invention is to provide a computer system that facilitates pursuing the cause of a failure of a computer and can reliably restart the computer.

本発明による計算機システムは、障害監視対象の計算機が少なくともメモリダンプ及び故障診断をしていない通信可能状態であるか否かを判断する通信可能状態判断手段と、計算機が通信可能状態のときに、計算機に動作確認信号を出力する動作確認信号出力手段と、動作確認信号出力手段が計算機に対して動作確認信号を出力した後、計算機からの応答の有無を判断する第一の応答判断手段と、第一の応答判断手段が計算機からの応答が無いと判断したとき、計算機に対して再立ち上げ処理をさせるための割り込み信号を出力する割り込み手段と、割り込み手段が計算機に対して割り込み信号を出力した後、計算機からの応答の有無を判断する第二の判断手段と、第二の判断手段が計算機からの応答が無いと判断したとき、計算機に対してリセット信号を出力するリセット手段とを具備することを特徴とする。 The computer system according to the present invention includes a communicable state determining means for determining whether or not a computer to be monitored for failure is in a communicable state in which at least a memory dump and a fault diagnosis are not performed, and when the computer is in a communicable state, An operation confirmation signal output means for outputting an operation confirmation signal to the computer, and a first response determination means for determining whether or not there is a response from the computer after the operation confirmation signal output means outputs an operation confirmation signal to the computer; When the first response determination means determines that there is no response from the computer, an interrupt means for outputting an interrupt signal for causing the computer to restart processing, and the interrupt means outputs an interrupt signal to the computer. After that, when the second judging means for judging whether or not there is a response from the computer and the second judging means judges that there is no response from the computer, Characterized by comprising a reset means for outputting a set signal.

本発明によれば、計算機の障害の原因追及を容易にすると共に、確実に計算機を再稼働させることができる計算機システムを提供できる。 According to the present invention, it is possible to provide a computer system that facilitates pursuing the cause of a failure of a computer and can reliably restart the computer.

以下、本発明の実施例を図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

以下、図面を参照して本発明の第１の実施例を詳細に説明する。図１は、二重系計算機システムを示す図である。計算機（Ａ）１と計算機（Ｂ）２とが互いにＬＡＮ３を介して相互に接続し二重系計算機システムを構成している。二重系計算機システムは、一方を稼働系計算機（オンライン状態の計算機）、他方を待機系計算機（スタンバイ状態の計算機）として構成する。計算機（Ａ）１と計算機（Ｂ）２とは相互に、計算機間通信信号１００，１０１、リセット信号１０２，１０３、割り込み信号１０４，１０５、ステータス信号１０６，１０７を送受信している。 Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram showing a dual computer system. A computer (A) 1 and a computer (B) 2 are connected to each other via a LAN 3 to constitute a dual system computer system. One of the dual computer systems is configured as an active computer (online computer) and the other as a standby computer (standby computer). The computer (A) 1 and the computer (B) 2 mutually transmit and receive communication signals 100 and 101 between computers, reset signals 102 and 103, interrupt signals 104 and 105, and status signals 106 and 107.

図２は、図１に示した計算機（Ｂ）２の詳細を示す図である。計算機（Ａ）１の構成もこの図２に示した構成と同一である。システムバス１７を介して、第１コントローラ１０、第２コントローラ１１１、プロセッサ１２、メモリ１３、Ｉ／Ｏコントローラ１４、ネットワークインタフェース１５が相互に接続されている。Ｉ／Ｏコントローラ１４には、磁気ディスク装置（ＨＤＤ）１６が接続されている。 FIG. 2 is a diagram showing details of the computer (B) 2 shown in FIG. The configuration of the computer (A) 1 is the same as that shown in FIG. The first controller 10, the second controller 111, the processor 12, the memory 13, the I / O controller 14, and the network interface 15 are connected to each other via the system bus 17. A magnetic disk device (HDD) 16 is connected to the I / O controller 14.

第１コントローラ１０は、図示しない計算機Ａ１からの動作確認信号１００、リセット信号１０２、割り込み信号１０４、ステータス信号１０６を受信している。一方、第１コントローラ１０は、図示しない計算機Ａ１に動作確認信号１０１、割り込み信号１０３、リセット信号１０５、ステータス信号１０７を送信している。第１コントローラ１０から第２コントローラ１１にリセット信号１０２が出力される。また、第１コントローラ１０からプロセッサ１２に割り込み信号１０４が出力される。 The first controller 10 receives an operation confirmation signal 100, a reset signal 102, an interrupt signal 104, and a status signal 106 from a computer A1 (not shown). On the other hand, the first controller 10 transmits an operation confirmation signal 101, an interrupt signal 103, a reset signal 105, and a status signal 107 to a computer A1 (not shown). A reset signal 102 is output from the first controller 10 to the second controller 11. The interrupt signal 104 is output from the first controller 10 to the processor 12.

図３は、第２コントローラ１１の詳細を示す図である。第２コントローラ１１は、外部リセットフリップフロップ回路１１１、パワーオンリセット回路１１２、ＯＲ回路１１３から構成されている。外部リセットフリップフロップ回路１１１は、第１コントローラ１０から出力されたリセット信号１０２を入力してセットし、そのセット出力をＯＲ回路１１３に出力している。パワーオンリセット回路１１２は、この計算機（Ｂ）２の電源がＯＮされたとき又はリセットされたときにリセット信号をＯＲ回路１１３に出力する。ＯＲ回路１１３は、外部リセットフリップフロップ回路１１１から出力されたセット出力信号とパワーオンリセット回路１１２から出力されたリセット信号とをＯＲ処理してリセット信号１０２として出力する。尚、図３において、システムバス１７とのインタフェースに関連する回路の図示を省略している。 FIG. 3 is a diagram showing details of the second controller 11. The second controller 11 includes an external reset flip-flop circuit 111, a power-on reset circuit 112, and an OR circuit 113. The external reset flip-flop circuit 111 receives and sets the reset signal 102 output from the first controller 10, and outputs the set output to the OR circuit 113. The power-on reset circuit 112 outputs a reset signal to the OR circuit 113 when the power source of the computer (B) 2 is turned on or reset. The OR circuit 113 ORs the set output signal output from the external reset flip-flop circuit 111 and the reset signal output from the power-on reset circuit 112 and outputs the result as the reset signal 102. In FIG. 3, the illustration of circuits related to the interface with the system bus 17 is omitted.

図４は、ステータス信号１０７の状態とその状態に対応して計算機が通信可能の状態にあるか否かを示す図である。ステータス信号１０７は、その計算機が稼働状態であるか否かを示す１ビットと、その計算機がメモリダンプをしている状態であるか否かを示す１ビットと、その計算機が故障診断（テスト）をしている状態であるか否かを示す１ビットの３ビットで構成される信号である。第４図において、「１」と表示されているのは、その状態がセットされていることを示し、「０」と表示されているのは、その状態がリセットされていることを示す。第４図に示すように、その計算機が稼働状態（オンライン状態）で、しかもメモリダンプがされていなく、且つテスト（診断）状態でもない場合に、その計算機が通信可能状態であり、それ以外の状態ではその計算機は通信ができない不可の状態である。 FIG. 4 is a diagram showing the state of the status signal 107 and whether or not the computer is in a communicable state corresponding to the state. The status signal 107 includes 1 bit indicating whether or not the computer is in an operating state, 1 bit indicating whether or not the computer is performing a memory dump, and the computer performs a fault diagnosis (test). It is a signal composed of 1 bit 3 bits indicating whether or not In FIG. 4, "1" is displayed when the state is set, and "0" is displayed when the state is reset. As shown in FIG. 4, when the computer is in the operating state (online state), the memory dump is not performed, and it is not in the test (diagnosis) state, the computer is in a communicable state. In this state, the computer cannot communicate.

次に図５、図６、図７、図８に示したフローチャートに基づいて、二重系計算機システムにおいて稼働系である計算機（Ｂ）２が故障した場合の動作を説明する。図５は待機系である計算機（Ａ）１の動作を説明するためのフローチャートである。図６は稼働系である計算機（Ｂ）２の割り込み処理の動作を説明するためのフローチャートである。図７は計算機（Ｂ）２のリセット受付処理の動作を説明するためのフローチャートである。図８は計算機（Ｂ）２のリセット処理の動作を説明するためのフローチャートである。 Next, based on the flowcharts shown in FIG. 5, FIG. 6, FIG. 7, and FIG. 8, the operation when the computer (B) 2 that is the active system fails in the dual computer system will be described. FIG. 5 is a flowchart for explaining the operation of the computer (A) 1 which is a standby system. FIG. 6 is a flowchart for explaining the interrupt processing operation of the computer (B) 2 which is the active system. FIG. 7 is a flowchart for explaining the operation of the reset acceptance process of the computer (B) 2. FIG. 8 is a flowchart for explaining the reset processing operation of the computer (B) 2.

待機系である計算機（Ａ）１は、稼働系である計算機（Ｂ）２の故障の発生を監視し、故障が発見された場合には、その対策を実施するよう動作する。まず、計算機（Ａ）１は、
ステータス信号１０７をチェックして計算機（Ｂ）２と通信可能状態であるか否かを判断する。（図５ステップＳ４６）。図４に示したように、計算機（Ｂ）２が稼働状態（オンライン状態）で、しかもメモリダンプがされていなく、且つテスト（診断）状態でもない場合に、通信可能状態と判断し、それ以外の場合には通信不可の状態と判断する。計算機（Ｂ）２が通信可能な状態の場合には、ステップＳ５０に進み、通信不可の状態の場合にはステップＳ４７に進む。ステップＳ４７では計算機（Ｂ）２のメモリダンプや診断が終了するのを待つことを目的として一定時間待機する。この待機時間はそのシステムごとに異なる。一定時間後ステップＳ４８において、ステータス信号１０７をチェックして通信可能状態であるか否かを判断する。通信可能な状態の場合には、ステップＳ５０に進む。、通信不可の状態の場合には計算機（Ｂ）２がロック状態と判断し、続くステップＳ４９にて強制的にステップを進めたことを示すフラグをセットして、ステップＳ５０に進む。 The computer (A) 1 that is a standby system monitors the occurrence of a failure in the computer (B) 2 that is an active system, and operates to take countermeasures when a failure is found. First, the computer (A) 1 is
The status signal 107 is checked to determine whether or not communication with the computer (B) 2 is possible. (FIG. 5, step S46). As shown in FIG. 4, when the computer (B) 2 is in an operating state (online state), a memory dump is not performed, and it is not in a test (diagnosis) state, it is determined that communication is possible, and the others In this case, it is determined that communication is not possible. If the computer (B) 2 is in a communicable state, the process proceeds to step S50. If the computer (B) 2 is in a communicable state, the process proceeds to step S47. In step S47, the computer (B) 2 waits for a certain period of time for the purpose of waiting for the completion of the memory dump or diagnosis. This waiting time varies from system to system. After a certain time, in step S48, the status signal 107 is checked to determine whether or not communication is possible. If the communication is possible, the process proceeds to step S50. If the communication is impossible, the computer (B) 2 determines that the lock is in effect, sets a flag indicating that the step has been forcibly advanced in step S49, and then proceeds to step S50.

次に計算機（Ａ）１は、計算機（Ｂ）２に動作確認信号１００を出力する（ステップＳ５０）。その後、計算機（Ａ）１は、動作確認信号１００の応答として計算機（Ｂ）２から出力される動作確認信号１０１の受信の有無をチェックして計算機（Ｂ）２からの応答の有無をチェックする（ステップＳ５１）。計算機（Ａ）１が動作確認信号１０１を受信した場合には、計算機（Ｂ）２から応答ありと判断しステップＳ４６へ戻る（ステップＳ５１のＹ）。また、計算機（Ａ）１が動作確認信号１０１を受信できない場合には、計算機（Ｂ）２から応答がなく故障していると判断しステップＳ５２進む（ステップＳ５１のＮ）。 Next, the computer (A) 1 outputs the operation confirmation signal 100 to the computer (B) 2 (step S50). Thereafter, the computer (A) 1 checks whether or not the operation confirmation signal 101 output from the computer (B) 2 is received as a response to the operation confirmation signal 100 and checks whether or not there is a response from the computer (B) 2. (Step S51). When the computer (A) 1 receives the operation confirmation signal 101, the computer (B) 2 determines that there is a response and returns to step S46 (Y in step S51). If the computer (A) 1 cannot receive the operation confirmation signal 101, it is determined that there is no response from the computer (B) 2 and a failure occurs, and the process proceeds to step S52 (N in step S51).

ステップＳ５２では、計算機（Ａ）１は計算機（Ｂ）２にマスクできない（ノンマスカラブル）最高レベルの割り込み信号１０４を出力し、ステップＳ５３にて計算機（Ａ）１は計算機（Ｂ）２から出力される動作確認信号１０１の受信の有無をチェックして計算機（Ｂ）２からの応答の有無を判断する。 In step S52, the computer (A) 1 outputs the interrupt signal 104 of the highest level that cannot be masked (non-mascarable) to the computer (B) 2. In step S53, the computer (A) 1 outputs from the computer (B) 2. The presence or absence of a response from the computer (B) 2 is determined by checking whether or not the operation confirmation signal 101 is received.

一方、計算機（Ａ）１から割り込み信号１０４を受信した計算機（Ｂ）２は、ハードウゥア故障が発生している場合には、例え割り込み信号１０４を受信しても割り込み処理は実施できず応答信号としての動作確認信号１０１を計算機（Ａ）１に出力することができない。計算機（Ｂ）２がソフトウェアのバグなどによりその処理で無限ループが発生しているなどのソフトウェア障害が発生している場合には、図６に示す割り込み処理を実行する。 On the other hand, the computer (B) 2 that has received the interrupt signal 104 from the computer (A) 1 cannot perform interrupt processing even if it receives the interrupt signal 104 if a hardware failure has occurred. The operation confirmation signal 101 cannot be output to the computer (A) 1. If the computer (B) 2 has a software failure such as an infinite loop in the processing due to a software bug or the like, the interrupt processing shown in FIG. 6 is executed.

まず、計算機（Ｂ）２は計算機（Ａ）１へ動作確認信号１０１を出力する（ステップＳ６０）。次いで、計算機（Ｂ）２は障害原因解析のために、メモリ１３の記憶内容、Ｉ／Ｏの設定情報、プロセッサ（ＣＰＵ）１２のレジスタ情報等を採取（ステップＳ６１）し、この採取した情報を磁気ディスク装置（ＨＤＤ）１６に保存する（ステップＳ６２）。次に計算機（Ｂ）２はメモリ１３をクリア（ステップＳ６３）し、システムをブートする（ステップＳ６４）。その後、計算機（Ｂ）２は計算機（Ａ）１にＬＡＮ１７を介して、システムが立ち上がったことを示すメッセージを送信（ステップＳ６５）して割り込み処理を終了する。 First, the computer (B) 2 outputs the operation confirmation signal 101 to the computer (A) 1 (step S60). Next, the computer (B) 2 collects the storage contents of the memory 13, I / O setting information, register information of the processor (CPU) 12, etc. for failure cause analysis (step S 61). The data is stored in the magnetic disk device (HDD) 16 (step S62). Next, the computer (B) 2 clears the memory 13 (step S63) and boots the system (step S64). Thereafter, the computer (B) 2 transmits a message indicating that the system has started up to the computer (A) 1 via the LAN 17 (step S65), and ends the interrupt processing.

一方、計算機（Ａ）１はステップＳ５３において、計算機（Ｂ）２から出力されるシステムが立ち上がったことを示すメッセージの受信の有無をチェックして計算機（Ｂ）２からの応答の有無を判断する。計算機（Ａ）１は計算機（Ｂ）２から応答があった場合には、計算機（Ｂ）２での障害が回復したと判断してステップＳ４６へ戻る。計算機（Ａ）１は計算機（Ｂ）２から応答が無かった場合には、計算機（Ｂ）２で障害（ハードウェア障害）が発生したと判断し、ステップ５４へ進みステータス信号１０７をチェックして計算機（Ｂ）２と通信可能状態であるか否かを判断する。通信可能な状態と判断した場合は計算機（Ｂ）２で障害（ハードウェア障害）が発生したと判断し、計算機（Ａ）１は計算機（Ｂ）２へリセット信号１０２を出力する（ステップＳ５７）。一方通信不可と判断した場合は、計算機（Ｂ）２でのメモリダンプや診断が終了するのを待つことを目的として一定時間待機する（ステップＳ５５）。この待機時間はそのシステムごとに異なる。一定時間後、計算機（Ａ）１はステータス信号１０７をチェックして計算機（Ｂ）２と通信可能状態であるか否かを判断する（ステップＳ５６）。通信可能な状態の場合には、正常状態と判断しステップＳ４６に戻る。通信不可の場合は計算機（Ｂ）２で障害（ハードウェア障害）が発生したと判断し、ステップＳ５７に進み計算機（Ｂ）２へリセット信号１０２を出力する。 On the other hand, in step S53, the computer (A) 1 checks whether or not a message indicating that the system output from the computer (B) 2 has started up is received and determines whether or not there is a response from the computer (B) 2. . If there is a response from the computer (B) 2, the computer (A) 1 determines that the failure in the computer (B) 2 has been recovered and returns to step S46. If there is no response from the computer (B) 2, the computer (A) 1 determines that a failure (hardware failure) has occurred in the computer (B) 2, and proceeds to step 54 to check the status signal 107. It is determined whether or not communication with the computer (B) 2 is possible. If it is determined that communication is possible, it is determined that a failure (hardware failure) has occurred in the computer (B) 2, and the computer (A) 1 outputs a reset signal 102 to the computer (B) 2 (step S57). . On the other hand, when it is determined that communication is not possible, the computer (B) 2 waits for a certain period of time for the purpose of waiting for the completion of the memory dump or diagnosis (step S55). This waiting time varies from system to system. After a certain time, the computer (A) 1 checks the status signal 107 to determine whether or not communication with the computer (B) 2 is possible (step S56). If it is in a communicable state, it is determined as normal and the process returns to step S46. If communication is not possible, it is determined that a failure (hardware failure) has occurred in the computer (B) 2, and the process proceeds to step S57 to output the reset signal 102 to the computer (B) 2.

計算機（Ｂ）２は、リセット信号１０２を第１のコントローラ１０で受信し第２のコントローラ１１において外部リセットフリップフロップ回路１１１をセットする（図７ステップＳ７０）。このセットされた外部リセットフリップフロップ回路１１１の出力は、ＯＲ回路１１３を経由してプロセッサ１２に対してリセット信号１０２として出力される（図７ステップＳ７１）。リセット信号１０２を受信したプロセッサ１２は、外部リセットフリップフロップ回路１１１がセットされているか否かをチェックする（図８ステップＳ８０）。外部リセットフリップフロップ回路１１１がセットされていない場合には、通常のリセット、例えばパワーオンリセットと判断して処理をステップＳ８３へ進める。 The computer (B) 2 receives the reset signal 102 by the first controller 10 and sets the external reset flip-flop circuit 111 in the second controller 11 (step S70 in FIG. 7). The output of the set external reset flip-flop circuit 111 is output as the reset signal 102 to the processor 12 via the OR circuit 113 (step S71 in FIG. 7). The processor 12 that has received the reset signal 102 checks whether or not the external reset flip-flop circuit 111 is set (step S80 in FIG. 8). If the external reset flip-flop circuit 111 is not set, it is determined that the reset is a normal reset, for example, a power-on reset, and the process proceeds to step S83.

外部リセットフリップフロップ回路１１１がセットされている場合には、計算機（Ａ）１からのリセット処理と判断して、処理をステップＳ８１へ進め、計算機（Ｂ）２は障害原因解析のために、メモリ１３の記憶内容、Ｉ／Ｏの設定情報、プロセッサ（ＣＰＵ）１２のレジスタ情報等を採取（ステップＳ８１）し、この採取した情報を磁気ディスク装置（ＨＤＤ）１６に保存する（ステップＳ８２）。次に計算機（Ｂ）２はメモリ１３をクリア（ステップＳ８３）し、システムをブートする（ステップＳ８４）。その後、計算機（Ｂ）２は計算機（Ａ）１にＬＡＮ１７を介して、システムが立ち上がったことを示すメッセージを送信（ステップＳ８５）してリセット処理を終了する。 If the external reset flip-flop circuit 111 is set, it is determined that the reset process is from the computer (A) 1, and the process proceeds to step S81. The computer (B) 2 stores the memory for failure cause analysis. The storage contents 13, I / O setting information, register information of the processor (CPU) 12, and the like are collected (step S 81), and the collected information is stored in the magnetic disk device (HDD) 16 (step S 82). Next, the computer (B) 2 clears the memory 13 (step S83) and boots the system (step S84). Thereafter, the computer (B) 2 transmits a message indicating that the system has been started up to the computer (A) 1 via the LAN 17 (step S85), and the reset process is terminated.

一方、計算機（Ａ）１は図５のステップＳ５７で計算機（Ｂ）２にリセット信号１０２を出力した後、計算機（Ｂ）２から出力されるシステムが立ち上がったことを示すメッセージの受信の有無をチェックして計算機（Ｂ）２からの応答の有無を判断する（ステップＳ５８）。計算機（Ａ）１は計算機（Ｂ）２から応答があった場合には、計算機（Ｂ）２での障害が回復したと判断してステップＳ４６へ戻る（ステップＳ５８のＹ）。計算機（Ａ）１は計算機（Ｂ）２から応答が無かった場合（ステップＳ５８のＮ）には、計算機（Ｂ）２で障害が回復しなかったと判断し、オペレータに障害の通知をする等のエラー処理（ステップＳ５９）をする。 On the other hand, the computer (A) 1 outputs a reset signal 102 to the computer (B) 2 in step S57 in FIG. 5, and then receives a message indicating that the system output from the computer (B) 2 has been started. A check is made to determine whether there is a response from the computer (B) 2 (step S58). When there is a response from the computer (B) 2, the computer (A) 1 determines that the failure in the computer (B) 2 has been recovered and returns to step S46 (Y in step S58). When there is no response from the computer (B) 2 (N in step S58), the computer (A) 1 determines that the failure has not been recovered by the computer (B) 2, and notifies the operator of the failure. Error processing (step S59) is performed.

以上説明した通り、図５のステップＳ４６で計算機Ｂが通信可能状態であるか否かを判断した後、計算機Ｂの故障監視をしているので、計算機Ｂが再稼働できない等の重大な障害を回避することができる。また、図６のステップＳ６１及び図７のステップＳ７１で、メモリ１３の記憶内容、Ｉ／Ｏの設定情報、プロセッサ１２のレジスタ情報等を採取し、磁気ディスク装置１６に保存しているので、計算機Ｂをリセットにより再立ち上げしたのか又は割り込みにより再立ち上げしたのかの履歴を参照することにより、計算機Ｂの障害の原因追及が容易にできる。 As described above, since it is determined whether or not the computer B is in a communicable state in step S46 of FIG. 5, since the failure of the computer B is monitored, a serious failure such as the computer B cannot be restarted is detected. It can be avoided. In step S61 in FIG. 6 and step S71 in FIG. 7, the storage contents of the memory 13, the I / O setting information, the register information of the processor 12, and the like are collected and stored in the magnetic disk device 16. By referring to the history of whether B is restarted by resetting or restarting by interruption, the cause of the failure of the computer B can be easily pursued.

本発明の実施例である二重系計算機システムを示す図である。It is a figure which shows the dual system computer system which is an Example of this invention. 図１における計算機Ｂの詳細を示す図である。It is a figure which shows the detail of the computer B in FIG. 図２における第２コントローラ１１の詳細を示す図である。It is a figure which shows the detail of the 2nd controller 11 in FIG. ステータス信号１０７の状態とその状態に対応して計算機が通信可能の状態にあるか否かを示す図である。It is a figure which shows whether the computer is in the state which can communicate according to the state of the status signal 107, and its state. 計算機Ａの動作を説明するためのフローチャートを示す図である。It is a figure which shows the flowchart for demonstrating operation | movement of the computer A. FIG. 計算機Ｂの割り込み処理の動作を説明するためのフローチャートを示す図である。It is a figure which shows the flowchart for demonstrating the operation | movement of the interruption process of the computer B. FIG. 計算機Ｂのリセット受付処理の動作を説明するためのフローチャートを示す図である。It is a figure which shows the flowchart for demonstrating operation | movement of the reset reception process of the computer B. FIG. 計算機Ｂのリセット処理の動作を説明するためのフローチャートを示す図である。It is a figure which shows the flowchart for demonstrating the operation | movement of the reset process of the computer B. FIG.

Explanation of symbols

１計算機Ａ
２計算機Ｂ
３ＬＡＮ
１０第１コントローラ
１１第２コントローラ
１２プロセッサ
１１１外部リセットフリップフロップ回路
１１２パワーオンリセット回路
１１３ＯＲ回路 1 Calculator A
2 Calculator B
3 LAN
10 first controller 11 second controller 12 processor 111 external reset flip-flop circuit 112 power-on reset circuit 113 OR circuit

Claims

A communicable state judging means for judging whether or not the fault monitoring computer is in a communicable state where at least a memory dump and a fault diagnosis are not performed;
An operation confirmation signal output means for outputting an operation confirmation signal to the computer when the computer is in a communicable state;
After the operation confirmation signal output means outputs the operation confirmation signal to the computer, first response judgment means for judging whether or not there is a response from the computer;
When the first response determination means determines that there is no response from the computer, an interrupt means for outputting an interrupt signal for causing the computer to perform a restart process;
After the interrupt means outputs the interrupt signal to the computer, second determination means for determining whether or not there is a response from the computer;
A computer system comprising: a reset unit that outputs a reset signal to the computer when the second determination unit determines that there is no response from the computer.