JP3806600B2

JP3806600B2 - System switching method for multi-system

Info

Publication number: JP3806600B2
Application number: JP2000521438A
Authority: JP
Inventors: 大野　　洋; 茂則金子; 義弘宮崎; 壮一 ▲高▼谷; 広昭福丸; 隆弘猿田; 加藤　　直; 邦弘鈴木; 憲一黒沢; 雅彦齊藤; 秀仁武和; 裕人塚原; 栄喜庄子
Original assignee: Hitachi Ltd; Hitachi Information and Control Solutions Ltd
Current assignee: Hitachi Ltd; Hitachi Information and Control Solutions Ltd
Priority date: 1997-11-14
Filing date: 1997-11-14
Publication date: 2006-08-09
Anticipated expiration: 2017-11-14
Also published as: WO1999026138A1

Description

技術分野
本発明は多重系システムの管理方法に係わり、特に、稼働系と待機系の計算機により構成される多重系システムにおいて、いずれかの計算機に障害が発生した際に系切り替えを実施する方法に関するものである。
背景技術
高い信頼性が要求される用途、例えば、鉄道運行管理，プラント制御，電力系統制御などに計算機を用いる場合には、処理を行う稼働系計算機の他に、稼働系の計算機に障害が生じた場合に稼働系の計算機が行っていた処理を引き継ぐ待機系の計算機を備えた多重系システムとして計算機を利用することが望ましい。
計算機の稼働を阻害する障害としては、ハードウェアの故障、およびオペレーティングシステム（以下ＯＳと記す）やデバイスドライバなどの基幹ソフトウェアの欠陥による論理矛盾が挙げられる。これらの障害発生時に、計算機のハードウェア・ソフトウェアに関する各種状態を保存することにより、事後の障害解析が可能となり、復旧措置，再発防止策の実施などに活用でき、システムの信頼性向上に役立つ。これは多重系システムにおいても同様である。
従来の多重系システムにおいては、障害が発生した場合に、障害が発生した計算機のディスク装置に障害情報を保存し、その後、当該障害発生計算機が実行していた処理を待機系に引き継ぐ系切り替え方法が実施されてきた。
また、特開平８−２０２５７３号公報には、多重系を構成する計算機全てに、お互いに常に内容を一致化させている共通メモリを搭載し、この共通メモリ上に障害情報を常に書き込み、障害発生計算機が実行していた処理を引き継いだ計算機がこの障害情報をディスクに保存する方法が記載されている。
処理の停止時間を短くするために、系切り替えに要する時間はできるだけ短いことが望ましい。従来の切り替え方法の場合、障害情報の保存に要する間だけ系切り替えが待たされるため、実用的な切り替え時間を実現するためには保存できる障害情報の量が制限されてしまう。
一方、特開平８−２０２５７３号公報に記載された方法の場合、系切り替え時間の短縮は可能であるが、保存する障害情報の量が多くなると、必要な共通メモリの容量が大きくなり装置コストが大きくなると同時に、共通メモリ内容一致化のための計算機負荷およびネットワーク負荷も大きくなってしまう。
本発明は、多重系システムにおいて、障害発生時に、メモリダンプを含む大容量の障害情報の保存を実施しつつ、高速な系切り替えを実現することを目的とする。
また、障害発生系におけるハードウェアやソフトウェアの暴走、および障害発生系における障害情報の保存動作が、系切り替え動作および切り替え後の処理を引き継いだ新稼働系の動作に影響を与えないようにすることを目的とする。
発明の開示
本発明は、障害の発生した稼働系計算機で行っていた処理を停止して障害情報の保存処理を開始し、引き続いて待機系計算機は該計算機の障害を検出して停止していた処理を引き継ぐものである。該障害発生計算機における処理の停止および障害情報の保存開始は、該障害発生計算機上のソフトウェアにより自発的に行うか、または先に待機系計算機が該計算機の障害を検出し該計算機に対して動作を指示することにより行うかにより実現される。
このような系切り替え方法によれば、処理の切り替えは、待機系計算機における障害検出から、障害発生計算機において安定して障害情報の保存が開始されるまでの見込み時間のみで実施でき、切り替え時間の短縮が実現できる。
また、前記目的達成のために、本発明は、稼働系計算機の障害を検出した待機系計算機が該障害発生計算機に対して障害情報の保存開始指示に引き続き該障害発生計算機の動作停止を指示して、該障害発生計算機では正常な障害情報保存動作をしている場合には動作停止指示を無視し、正常な障害情報保存動作をしていない場合には動作停止指示を受け入れて完全に停止するものである。
このような障害発生計算機の動作方法により、障害情報保存動作が不可能なほどの重度の障害状態において、該障害発生計算機が予期せぬ動作をし、ネットワークや共有ディスク装置といった系間の結合部を通じて、処理を引き継いだ新稼働系計算機の動作に影響を与えることが防げる。
また、前記目的達成のために、本発明は、該障害発生計算機において障害情報の保存を実施する前に、ネットワークや共有ディスク装置といった系間の結合部の入出力装置の動作を停止させるものである。
このような障害発生計算機の動作方法により、障害情報保存に無関係なハードウェアの動作により、ネットワークや共有ディスク装置といった系間の結合部を通じて、処理を引き継いだ新稼働系計算機の動作に影響を与えることが防げる。
発明を実施するための最良の形態
以下、本発明に係る多重系システムの切り替え方法の実施形態について詳細に説明する。
第１図に本実施形態に係る多重系システムの構成を示す。
図示するとおり、本実施形態に係る多重系システムは２台の計算機で構成された２重系システムである。ただし、計算機は３台以上で構成してもよい。
第１図において、計算機１００，１０１はそれぞれ稼働系計算機，待機系計算機を示している。系切り替えにより、稼働系計算機１００は待機系計算機として、稼働系計算機１０１は稼働系計算機として動作する。
各計算機１００，１０１は、中央演算処理装置（以下ＭＰＵと記す）１１０と主メモリ１１１，入出力制御装置１１２を備え、これらはプロセッサバス１２０によって接続されている。入出力制御装置１１２には、ディスク装置１１３や拡張バス１２１が接続される。
拡張バス１２１には、計算機の機能を拡張するための回路が接続される。一般的には回路が実装された拡張ボードを、スロットコネクタに挿入する形態で拡張バス１２１に接続される。ただし一部の機能は計算機本体内に実装され、拡張バスに直接内部で接続されている場合もある。本実施形態に係る計算機１００，１０１は、拡張ボードとしてＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）ボード１１４，リンケージバスポート（ＬｉｎｋａｇｅＢｕｓＰｏｒｔ）（以下ＬＸＰと記す）ボード１１５，Ｅｔｈｅｒｎｅｔボード１１６を備える。
ＳＣＳＩボード１１４には共有ディスク装置１０２が接続されている。この共有ディスク装置１０２は、系切り替え時の処理の引き継ぎデータなどを記憶するのに使用される。なお、ＳＣＳＩバスの代わりにＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）といったバスを使用する場合もある。
Ｅｔｈｅｒｎｅｔボード１１６はＥｔｈｅｒｎｅｔネットワーク１０３に接続され、このネットワーク１０３に接続された他の計算機などと通信を行う。本実施形態ではネットワーク１０３には、プラント９００を管理・制御するための複数のコントローラ９１０が接続されている。なお、Ｅｔｈｅｒｎｅｔの代わりに、トークンリングやＡＴＭといったネットワークを使用する場合もある。
ＬＸＰボード１１５は、系切り替え制御のための機能拡張ボードであり、専用の伝送路であるリンケージバス１０４を介して接続される。ＬＸＰボードは計算機１００，１０１相互間での相手計算機の生存監視と、系切り替えに必要な強制割込，動作停止，計算機再起動の各指示メッセージの送信、さらに各指示メッセージ受信時の自計算機における指示内容の実行を行う。
このような２重系システムにおいて、稼働系計算機１００，待機系計算機１０１ともに正常な状態では、稼働系計算機１００の主メモリ１１１にはＯＳ１３０，管理プログラム１３１，管理通信プログラム１３２、およびアプリケーション（ＡＰ）１３５がロードされ、管理プログラム１３１，管理通信プログラム１３２、およびアプリケーション１３５がＯＳ１３０上で実行されている。同様に、待機系計算機１０１の主メモリ１１１にも同じプログラムがロードされ、ＯＳ１３０，管理プログラム１３１、および管理通信プログラム１３２は実行されているが、アプリケーション１３５は実行されていない。さらに各計算機１００，１０１の主メモリ１１１には割込処理ルーチン１３３がロードされている。
アプリケーション１３５は、該２重系システムの用途たる処理を行うプログラムであり、本実施形態の場合、ネットワーク１０３を介して各コントローラ９１０から送られるデータの処理・記録を行うものである。
管理プログラム１３１は、稼働系計算機と待機系計算機の切り替え処理を行うプログラムである。本プログラムはＬＸＰボード１１５に対してメッセージ送受信要求や動作指示を行い、また、管理通信プログラム１３２に対して生存通知メッセージの送受信要求を行う。
管理通信プログラム１３２はＥｔｈｅｒｎｅｔボード１１６を使いネットワーク１０３を介して、他計算機と生存通知メッセージの送受信を行う。メッセージ送受信はＴＣＰ／ＩＰプロトコルを使って実行する。本プログラムは予め決められたＴＣＰポートで他計算機からの接続を待ち、接続された場合にはメッセージを受信して本プログラム内で内容を保持し、管理プログラム１３１からの読み出し要求に対して保持している内容を返す。また管理プログラム１３１からの生存確認メッセージ送信要求を受け、２重系を構成している他計算機上の管理通信プログラム１３２が待機しているＴＣＰポートに対してメッセージを送信する。
割込処理ルーチン１３３は、ＭＰＵに対してマスク不可能割込信号が入力されたときに起動されるように登録される。そして、マスク不可能割込信号発生時に障害情報の保存等、障害発生時の処理を実行する。ただし、本実施形態ではマスク不可能割込信号により起動するように登録しているが、ＭＰＵが提供する他の割込機構を使って実現してもよい。なお、本実施形態の場合、割込処理ルーチン１３３が独立したプログラムとなっているが、ＯＳ１３０の種類によってはＯＳの一部として割込処理ルーチンが提供される場合もあり、この場合はＯＳ１３０の割込処理ルーチンから呼び出されるサブルーチンとして必要な処理を組み込むことにより同一の機能が実現できる。
次に、本実施形態に係る多重系システムの系切り替え方法について説明する。
第２図に系切り替え処理のタイムチャートを示す。
稼働系計算機１００，待機系計算機１０１がともに正常な状態では、次のような処理が行われる。
管理プログラム１３１は、一定時間毎に管理通信プログラム１３２およびＬＸＰボード１１５に対して、生存通知メッセージ送信を要求する（３０１）。管理通信プログラム１３２はＥｈｔｅｒｎｅｔボード１１６を駆動し、ネットワーク１０３経由で他計算機に対して生存通知メッセージ４０１を送信する（３０２）。一方、ＬＸＰボード１１５はリンケージバス１０４経由で他計算機に対して生存通知メッセージ４０２を送信する（３０３）。
前記の生存通知メッセージ４０１，４０２を受信した待機系計算機１０１の管理通信プログラム１３２およびＬＸＰボード１１５は、各々受信結果を記憶する（３０４，３０５）。そして、待機系計算機１０１の管理プログラム１３１は、一定時間毎に自計算機の管理通信プログラム１３２およびＬＸＰボード１１５に対して、稼働系計算機からの生存通知メッセージを受信したかどうか確認する（３０６）。一定時間以上、稼働系計算機からの生存通知メッセージ４０１，４０２が双方とも受信されない場合には、稼働系計算機に障害が発生したものと判断する。
ここで生存通知メッセージを２つの経路で伝送するのは、各伝送経路や伝送路への接続回路に発生した障害を、計算機自体の障害と区別できるようにするためである。一方の生存通知メッセージのみが受信されない場合には、伝送路で障害であると判断し、画面表示やログ記録などの形で警告を発するに止め、系切り替えは実施しない。
第２図では稼働系計算機１００から待機系計算機１０１への向きの生存確認メッセージの送信動作のみが示されているが、実際には逆向きの生存確認メッセージの送信も行っており、稼働系計算機１００での受信確認処理３０６および待機系計算機１０１での送信処理３０１が一定時間毎に実行されている。
次に、稼働系計算機１００に障害が発生した場合の動作について説明する。
障害モードは複数考えられるが、第１に、ＯＳ内部で無限ループが発生するなどの要因でハングアップ状態になった場合を説明する。
ＯＳ内部での障害発生により管理プログラム１３１の動作はストップし、生存通知メッセージの送信処理３０１が一定時間毎に実行されなくなる。待機系計算機１０１の管理プログラム１３１は、一定時間４５１の間隔で行う受信メッセージ確認３０６の際に、２つの生存通知メッセージ４０１，４０２とも受信されていないことを検出すると、稼働系計算機１００に障害が発生したものと判断する。障害発生を検出した待機系計算機１０１上の管理プログラム１３１はＬＸＰボード１１５に対して強制割込指示の送信を依頼し（３０７）、ＬＸＰボード１１５は稼働系計算機のＬＸＰボードに対して強制割込指示メッセージ４０３を送信する（３０８）。
稼働系計算機１００上のＬＸＰボード１１５は強制割込指示メッセージ４０３を受信すると、ハードウェア的にマスク不可能割込信号４０４を発生させる（３０９）。ＭＰＵはこの割込信号を受け、割込処理ルーチン１３３を起動する。
割込処理ルーチン１３３は起動時に、まず、マスク不可能割込信号を無効化、すなわち再度マスク不可能割込信号が発生した場合にこれを無視するように設定する（３１０）。
割込処理ルーチン１３３は、起動後、相手系計算機１０１に影響を及ぼす可能性のある自計算機内の構成要素の動作停止を指示する（３１１）。本実施形態の構成の場合、ＳＣＳＩボード１１４およびＥｔｈｅｒｎｅｔボード１１６がこの様な構成要素に相当し、各ボードにあるレジスタ中の動作停止を指示するビットをセットすることにより動作を停止させる。これにより相手系計算機１０１が共有ディスク１０２やネットワーク１０３にアクセスする場合に、障害発生計算機１００の影響を受けなくなる。なお、構成要素の種類によってはレジスタ中の動作可能ビットをクリアすることにより、動作停止を指示する場合もある。
次に割込処理ルーチン１３３は、ＬＸＰボード１１５に対して以後の他計算機からの指示メッセージを無視するように設定し（３１２）、障害情報の保存を実行する（３１３）。障害情報の保存完了後、割込処理ルーチン１３３は停止し（３１４）、障害が発生した計算機１００は停止状態となる。
障害情報の保存処理３１３では、主メモリ１１１の内容や、計算機本体および各機能拡張ボードの動作状態を表す各々のレジスタの内容などを保存する。また、障害情報以外に、通常のシャットダウン処理のうち、該障害発生後の条件下でも実行可能な処理を実行してもよい。例えば、ディスク装置１１３に対するキャッシュ内容の書き出しを実行すれば、該障害発生計算機のディスク内容の整合性が保たれ、内容を救出できる可能性が高くなる。
待機系計算機１０１の管理プログラム１３１は、強制割込指示の送信（３０７）後、一定時間４５２をおいて、ＬＸＰボード１１５に対して動作停止指示の送信を依頼し（３１５）、またこの時点で、待機系計算機１０１でロードされていたアプリケーション１３５を起動して稼働系計算機１００の処理を引き継ぎ（３１８）、自計算機を新たな稼働系に設定する。これで系切り替えは完了する。
ＬＸＰボード１１５は、管理プログラム１３１からの動作停止指示送信依頼により、動作停止指示メッセージ４０５を送信する（３１６）。しかし、障害発生計算機１００では割込処理ルーチン１３３によりＬＸＰボードに対して指示メッセージを無視する設定が行われている（３１２）ため、この動作停止指示メッセージ４０５は無視され、障害情報の収集（３１３）が継続されることになる。
障害発生計算機内の構成要素の動作停止処理３１１において、各構成要素に、動作状態表示レジスタなどの動作状況確認手段が備わっている場合、動作停止処理３１１による動作停止を確認する手順を追加してもよい。この動作停止の確認において動作停止指示が失敗していると判断された場合、割込処理ルーチン１３３はその処理を停止する。これにより、他計算機からの指示メッセージを無視する処理が行われず、待機系計算機のＬＸＰボードからの動作停止指示メッセージ４０５を受けたＬＸＰボードにより計算機１００は強制的に停止状態となり、待機系計算機１０１は障害発生計算機１００の影響を受けずに処理を引き継ぐことになる。
また、障害情報保存処理３１３の先頭で、ディスク装置の異常など、障害情報保存のための準備が出来ていないと判断された場合、割込処理ルーチン１３３はＬＸＰボードのメッセージ無視の設定を解除し（３１９）、障害情報保存処理を停止するようにしてもよい。この場合も、待機系計算機からの動作停止指示メッセージ４０５を受けて障害発生計算機１００は強制的に停止状態となる。
第２の障害モードとして、一般的にカーネルパニックと呼ばれる、ＯＳが重大な論理矛盾を検出して継続運転不能と判断した障害について説明する。この場合の処理のタイムチャートを第３図に示す。
ＯＳは論理矛盾を検出すると、割込処理ルーチン１３３を起動する（３３１）。割込処理ルーチンは、第２図で説明した場合と同様に、自計算機内の構成要素の動作停止を指示し（３１１）、次にＬＸＰボード１１５に対して以後の他計算機からの指示メッセージを無視するように設定し（３１２）、その後、障害情報の保存処理を行い（３１３）、停止する（３１４）。
ＯＳに障害が発生し、割込処理ルーチンへ実行が移ることにより、稼働系計算機１００上の管理プログラム１３１が動作しなくなるため、待機系計算機に対して生存通知メッセージ４０１，４０２が送信されなくなる。待機系計算機１０１上の管理プログラム１３１は、前述のとおり、生存通知メッセージ４０１，４０２ともに受信されないことを検出し（３０６）、強制割込指示メッセージ４０３および計算機動作停止指示メッセージ４０５の送信を行う（３０８，３１６）。
強制割込指示メッセージ４０３を受けた時点で、すでに割込処理ルーチン１３３が起動しＬＸＰボードに対してメッセージ無視の設定が行われている（３１２）ため、強制割込指示メッセージ４０３は無視され（３３２）、障害情報の収集３１３が継続される。引き続いて受け取る動作停止指示メッセージ４０５も同様に無視される（３３３）。
なお、ここではＯＳが割込処理ルーチン１３３を呼び出すものとしたが、マスク不可能割込信号を発生させて割込処理ルーチン１３３を起動してもよい。またＯＳの種類によってはＯＳ自身が障害情報の保存（メモリダンプ）を行うものもあるが、その実行前に登録した処理を呼び出す機能が提供されている場合には、割込処理ルーチン１３３から障害情報の保存（３１３）を除いた処理を登録しておくことにより、同等の処理を実現することができる。
第３の障害モードとして、ハードウェアの部分的な障害について説明する。ここで説明するのは、障害の影響が前述した２つの障害モードとしては現れないが、多重系システムの本来の用途たる処理を継続することができないものであり、何らかの検出方法により検出されたものである。この場合の処理のタイムチャートを第４図に示す。
このような障害の発生の検出には、管理プログラム１３１による検出、専用の障害検出サブプログラム１３４による検出、アプリケーション１３５での異常検出などがある。これらのうち、管理プログラム以外で障害を検出した場合は、障害発生の検出を管理プログラム１３１に通知する（３４１，３４２）。管理プログラム１３１は、自分自身での障害検出、または障害検出サブプログラム１３４やアプリケーション１３５からの障害通知を受けて、割込処理ルーチン１３３を起動する（３４３）。割込処理ルーチン１３３は第３図で説明したＯＳの論理矛盾検出時と同一の処理手順を実行し、系切り替えが実施される。
なお、障害発生をハードウェア機構により監視している場合は、このハードウェアが割込を使用して異常検出結果を管理プログラム１３１や障害検出サブプログラム１３４に通知するか、もしくは管理プログラムや障害検出サブプログラムの側が定期的に該ハードウェアをポーリングして異常検出の有無を確認して、同様の処理を行う。
また、メモリ内容の破壊やハードウェア的な動作不全の程度により、割込処理ルーチン１３３の起動ができない場合がある。この場合、障害発生計算機１００は重度の制御不能状態であり、予測できない動作をして、待機系計算機１０１の動作に影響を与える恐れがある。
この場合は、障害発生計算機のＬＸＰボード１１５に対して他計算機からの指示メッセージを無視する設定（３１２）が行われない。従って、待機系計算機からの動作停止指示メッセージ４０５を受けたＬＸＰボード１１５が計算機１００を強制的に停止状態とする。従って障害発生計算機１００を待機系計算機１０１の動作に確実に影響を与えない状態としてから処理の引き継ぎを実施することになるので、確実に系の切り替えができる。
生存通知メッセージが受信されず障害が発生したと判断するまでの時間４５１は、第３図で示すように、障害が発生してソフトウェア的に割込処理ルーチン１３３が呼び出され、ＬＸＰボードに対する設定（３１２）を完了するまでの時間に対して、やや長く設定しておく。また強制割込指示メッセージ送信と計算機動作停止指示メッセージ送信の間隔４５２は、第２図に示すように、強制割込指示（３０７）による稼働系計算機１００の割込処理ルーチン１３３が起動され、ＬＸＰボードに対する設定（３１２）を完了するまでの時間に対して、やや長く設定しておく。
系の切り替え時間、すなわち処理引き継ぎ完了までの時間は、おおよそ時間４５１と時間４５２の合計となる。この系の切り替え時間は、メモリダンプなどの障害情報の保存３１３に要する時間に対して十分短く、障害情報の保存と系切り替え時間の短縮が両立される。
なお、以上の説明では稼働系計算機１００に障害が発生した場合の処理について説明してきたが、待機系計算機１０１に障害が発生した場合も、処理の引き継ぎによる稼働系，待機系切り替えがないことを除いて、同一の処理が行われる。
本実施形態では、各計算機がＬＸＰボード１１５とＥｔｈｅｒｎｅｔボード１１６を備えていたが、各計算機にＥｔｈｅｒｎｅｔボード１１６を２つ備え、Ｅｔｈｅｒｎｅｔネットワーク１０３を二重化して生存監視メッセージの通信を行う構成の多重系システムにおいても、同様の方法による系切り替えが可能である。このようなシステムにおいては、ＯＳの論理矛盾検出やハードウェアの部分的な障害検出という障害モードに対して、障害発生計算機１００における障害情報の保存３１３と待機系計算機１０１への処理引き継ぎ３１８による系切り替え動作が可能である。ただし強制割込指示４０３を送ることが出来ないので、ハングアップ状態の障害モードでは障害情報の保存が出来ない。また、動作停止指示メッセージ４０５を送ることが出来ないので、障害の程度によっては障害発生計算機１００の異常動作が待機系計算機１０１に影響を与える可能性が残る。
以下、各部の詳細について説明する。
まずＬＸＰボード１１５について説明する。第５図にＬＸＰボード１１５の内部構成を示す。
図示するようにＬＸＰボード１１５は、拡張バス１２１との入出力を担当する拡張バスインタフェース１７０，リンケージバス１０４を介したメッセージ処理を行うリンケージ制御用プロセッサ１７１、このリンケージ制御用プロセッサ１７１が実行するプログラムを格納するメモリ１７５，メッセージとリンケージバス上の電気信号との変換を行う伝送路インタフェース１７２，メッセージの一時格納用バッファであるメッセージ記憶用メモリ１７３，電源電圧の立ち上がりを検出する電源電圧検出回路１７４，拡張バス側からリンケージ制御用プロセッサ１７１の動作状態を確認したり動作方法を指示するための動作制御レジスタ１７６を備えている。
動作制御レジスタ１７６は拡張バス１２１から読み書きできるので、このＬＸＰボード１１５が搭載されている計算機上で動作するソフトウェアから動作状態を確認したり動作方法を指示することが可能である。この動作制御レジスタ１７６は、後述する強制割込指示禁止ビット１７６１，動作停止指示禁止ビット１７６２，再起動指示禁止ビット１７６３を含む。
ＬＸＰボードの初期化動作を説明する。ＬＸＰボードは、接続されている計算機とは独立に動作し、計算機のリセット信号自体を扱う必要がある。このため、ＬＸＰボードの初期化処理は、計算機のリセット処理とは独立に、ＬＸＰボードへの電源投入時にのみ行う。このため、拡張バス１２１経由で供給される電源電圧を監視する電源電圧検出回路１７４が電源電圧の立ち上がりを検出して、ＬＸＰボード内の各構成要素に対して初期化を指示する初期化信号１８４を出力する。拡張バスインタフェース１７０，リンケージ制御用プロセッサ１７１、および伝送路インタフェース１７２は、この初期化信号１８４を受け、メモリのクリア，各種状態情報のクリア，レジスタのクリア，リンケージバスのリセットなどの初期化処理を実行する。
次にメッセージ送信機能について説明する。管理プログラム１３１は拡張バス１２１を介して、拡張バスインタフェース１７０にメッセージの送信要求を行う。拡張バスインタフェース１７０は、拡張バス１２１とリンケージバス１０４のデータ転送速度が異なるため、送信するメッセージを一旦速度緩衝用バッファとしてメッセージ記憶用メモリ１７３に格納し、リンケージ制御用プロセッサ１７１に対してメッセージの到着を通知する。リンケージ制御用プロセッサ１７１はこの通知を受けてメッセージ記憶用メモリ１７３からメッセージを取り出し、伝送路インタフェース１７２に転送し、リンケージバス１０４を介して、メッセージを他計算機のＬＸＰボードに送信する。
最後にメッセージ受信処理機能について説明する。他計算機のＬＸＰボードからリンケージバス１０４を経由して指示メッセージが届いた場合、その種類に応じて以下のいずれかの処理を行う。
（１）メッセージが強制割込指示の場合、接続されている自計算機に対して、マスク不可能割込信号線１８２を通じて、マスク不可能割込信号を出力し、ＭＰＵ１１０での処理を割込ルーチン１３３に切り替える。ただし、レジスタ１７６の強制割込指示禁止ビット１７６１がセットされている場合には、本処理を行わず、指示メッセージを無視する。
（２）メッセージが動作停止指示の場合、接続されている自計算機に対してリセット信号線１８３を通じてリセット信号を継続して出力し続け、これにより計算機を強制的に停止する。ただし、レジスタ１７６の動作停止指示禁止ビット１７６２がセットされている場合には、本処理を行わず、メッセージを無視する。
（３）メッセージが再起動指示の場合、接続されている自計算機に対してリセット信号線１８３を通じてリセット信号を１度出力し、これにより計算機を再起動する。ただし、レジスタ１７６の再起動指示禁止ビット１７６３がセットされている場合には、本処理を行わず、メッセージを無視する。
（４）上記以外のメッセージの場合、メッセージ内容をメッセージ記憶用メモリ１７３に格納する。格納されたメッセージは、その後、管理プログラム１３１からの要求により、拡張バスインタフェース１７０，拡張バス１２１を介して随時読み出される。
第６図に拡張バスインタフェース１７０の処理手順を示す。
拡張バスインタフェース１７０は、計算機（拡張バス）からの入出力要求信号、および初期化信号線１８４からの初期化信号を受けると、要求待ち状態５０１から抜けて処理を開始し、受けた信号から処理要求の種類を判定する（５０２）。
処理要求が初期化信号であった場合、内部レジスタや回路の初期化処理（５０３）を行う。
処理要求が拡張バス１２１からの読出信号の場合、読み出し要求の対象がレジスタであればそのレジスタ１７６の内容を読み出し（５０５）、読み出し要求の対象がメッセージであればメッセージ記憶メモリ１７３の内容を読み出し（５０７）、読み出した結果を拡張バス１２１に送出する（５０６，５０８）。
処理要求が拡張バス１２１からの書込信号の場合、書き込み要求の対象がレジスタであれば書き込み内容をレジスタ１７６に書き込む（５１０）。一方、書き込み要求の対象が送信メッセージである場合には、その送信メッセージを一旦メッセージ記憶用メモリ１７３に格納し（５１１）、これをリンケージ制御用プロセッサ１７１に伝送させる（５１２）。
第７図にリンケージ制御用プロセッサ１７１の処理手順を示す。
制御用プロセッサ１７１は、拡張バスインタフェース１７０からの起動要求、伝送路インタフェース１７２からのメッセージ受信、および初期化信号線１８４からの初期化信号のいずれかのイベントにより、イベント待ち状態５２１から抜けて処理を開始し、そのイベントの種類を判定する（５２２）。
発生したイベントが初期化信号の場合、通信処理を初期化し、メッセージ記憶用メモリ１７３に保存されている全メッセージを破棄し、さらにレジスタ１７６を初期状態に設定する（５２３）。
一方、発生したイベントが、拡張バスインタフェース１７０からの起動要求、すなわち、メッセージの送信要求であれば、送信すべきメッセージをメッセージ記憶用メモリ１７３から読み出し（５２４）、伝送路インタフェース１７２に該メッセージを伝送させる（５２５）。
また、発生したイベントが伝送路インタフェース１７２からのメッセージ受信イベントの場合、他のＬＸＰボードからの指示メッセージの到着を示している。この場合、受信した指示メッセージの種類を判定し（５２６）、各々に対応した処理を行う。
メッセージが強制割込指示，動作停止指示，再起動指示のいずれかの場合、既に述べたとおり、レジスタ１７６中の対応する各禁止ビット（１７６１，１７６２，１７６３）がクリアされていることを確認し（５２７，５２９，５３１）、前述のとおりの信号を出力する（５２８，５３０，５３２）。
前記以外のメッセージの場合、単に受信した指示メッセージをメッセージ記憶用メモリ１７３に格納する（５３３）。
次に管理プログラム１３１について説明する。
管理プログラム１３１は次の３つの処理を行う。
（１）自計算機が正常に動作していることを他の計算機に通知するため、定期的に生存通知メッセージを送信する。
（２）他計算機から送られてくる生存通知メッセージを監視し、一定時間以上受信されない場合は送信元計算機に障害が発生したものと判断し、他計算機に対して強制割込指示メッセージならびに動作停止指示メッセージを送信する。また、障害発生計算機が稼働系計算機ならば、該計算機で実行していた処理を引き継ぎ、自計算機を新たな稼働系計算機に設定する。
（３）他のプログラムからの呼び出しにより、自計算機に障害が発生したことを認識し、障害情報収集等の割込処理ルーチン１３３を起動する。
なお、管理プログラム１３１が自計算機の障害発生を検出する機能を合わせ持っていてもよい。この場合、障害検出時には前記（３）と同様に割込処理ルーチンを起動する。
第８図に前記（１）の生存通知メッセージ送信処理の処理フローを示す。
図示するとおり、この処理では定期的に生存通知を他計算機に対して通知する。すなわち、管理通信プログラム１３２およびＬＸＰボード１１５に対して生存通知メッセージ送信を要求し（３０１）、予め定められた時間だけ待ち状態に移行する（５４１）処理を繰り返す。
第９図に前記（２）の生存通知メッセージの監視と他系障害発生時処理の処理フローを示す。
図示するように、周期的に他計算機からの生存メッセージの受信状態を確認し、一定時間以上受信できない場合には他系障害発生時処理を実行する。
他系障害と判断するための待ち時間４５１を決定するために、「通知１待ち回数」，「通知２待ち回数」という変数を設定する。これらの変数の初期値はＮ回であり、処理５６３での待ち時間ｔ_ｗとの積「Ｎ×ｔ_ｗ」が他系障害と判断するための待ち時間４５１となる。まずこれらの変数の初期化処理として、各々Ｎ回を設定する（５５１，５５２）。
次に、管理通信プログラム１３２では受信したメッセージの内容を記憶しているので、生存通知メッセージ４０１を受信したかどうかを管理通信プログラム１３２に問い合わせる（５５３）。受信されていれば「通知１待ち回数」をＮ回に設定して再度初期化し（５５４）、管理通信プログラム１３２に対しては記憶している生存通知メッセージのクリアを指示する（５５５）。一方、生存通知メッセージが受信されていなければ、「通知１待ち回数」の値を１減少させる。ただし「通知１待ち回数」の値が負になった場合は０を設定するものとする（５５６）。
同様にして、ＬＸＰボード１１５は受信したメッセージの内容を記憶しているので、生存通知メッセージ４０２を受信したかどうかを問い合わせる（５５７）。受信されていれば「通知２待ち回数」をＮ回に再設定して（５５８）、ＬＸＰボード１１５に記憶している生存通知メッセージのクリアを指示する（５５９）。生存通知メッセージが受信されていなければ、「通知２待ち回数」の値を１減少させる。ただし「通知２待ち回数」の値が負になった場合は０を設定するものとする（５６０）。
ここで「通知１待ち回数」および「通知２待ち回数」の値を調べる（５６１）。
両変数とも０となっている場合には、「Ｎ×ｔ_ｗ」で表される待ち時間４５１以上の間、生存通知メッセージ４０１および４０２がともに受信されていないことになるため、他系の計算機に障害が発生したものと判断する。そしてまずＬＸＰボード１１５に対して強制割込指示メッセージ４０３の送信を依頼し（３０７）、次いで一定時間４５２だけ待ち状態とし（５６４）、その後、ＬＸＰボード１１５に対して計算機動作停止指示メッセージ４０５の送信を依頼する（３１５）。さらに自計算機の設定が待機系計算機である場合には、稼働系計算機の処理内容の引き継ぎを行い（３１８）、系切り替えを実行する。これらの処理を実行した後は、他系の障害発生計算機は必ず停止状態なので、生存通知メッセージの監視処理は停止する（５６６）。なお、障害発生計算機を交換しまたは障害要因を取り除き、待機系計算機として二重化システム内に復帰させる場合には、再度本処理を開始する（５５０）。開始はオペレータによる手動操作でもよいし、本監視処理停止（５５６）後、別処理を起動して生存監視メッセージの監視を続け、生存監視メッセージを検出した時点で本監視処理を再開する（５５０）方法でもよい。
処理５６１にて「通知１待ち回数」および「通知２待ち回数」のいずれか一方のみが０であった場合は、メッセージ伝送路や伝送路への接続回路に障害が発生したと判断し、これを画面表示やログ記録などの形で警告を発する（５６２）。
処理５６１にて「通知１待ち回数」および「通知２待ち回数」の両変数が０であった場合を除き、予め定められた時間ｔ_ｗだけ待ち（５６３）、処理５５３へ戻る。
第１０図に前記（３）の自計算機で障害が発生した時の管理プログラム１３３の処理フローを示す。
この処理は、障害検出サブプログラム１３４やアプリケーション１３５からの呼び出しにより起動し（５７０）、単に割込処理ルーチン１３３を起動する（３４３）。割込処理ルーチン１３３は呼び出し元に処理を戻さない。
次に、割込処理ルーチン１３３について説明する。
割込処理ルーチン１３３は、障害発生時に、自計算機上のソフトウェアから起動されるか、または他計算機からの強制割込指示メッセージを受けてＬＸＰボード１１５から起動され、障害情報の保存およびこれに関連する処理を行う。
第１１図に割込処理ルーチン１３３の処理フローを示す。
割込処理ルーチン１３３は起動時に、まずマスク不可能割込信号を無効化する（３１０）。これは、何も処理を行わずに復帰するダミーの割込処理ルーチンを用意し、これをマスク不可能割込に対する処理ルーチンとしてＭＰＵに登録することにより実現する。これにより割込処理ルーチン１３３の処理中に再度マスク不可能割込信号が発生した場合でも、前記ダミーのルーチンへ処理が移りすぐに割込復帰するので、マスク不可能割込を無視することとなり、割込処理ルーチン１３３を継続できる。
次に、自計算機の一部、特に他系の計算機に影響を及ぼす可能性のある構成要素の動作停止を指示する（３１１）。そして動作停止を指示した各構成要素に対して状態を問い合わせ、全ての構成要素が本当に動作停止したかどうかを確認する（５８１）。動作停止に失敗したものがある場合、割込処理を打ち切る（５９０）。動作停止を指示した各構成要素が全て停止していれば、ＬＸＰボード１１５に対して以後の他計算機からの指示メッセージを無視するように設定する（３１２）。
続いて障害情報の保存が可能な状態かどうかを調べ（５８２）、保存が不可と判断された場合は、ＬＸＰボード１１５に対して他計算機からの指示メッセージ無視を解除し（３１９）、割込処理を打ち切る（５９０）。保存が可能と判断された場合は、実際の障害情報の保存を実行する（３１３）。障害情報の保存完了後、割込処理ルーチン１３３は停止し（３１４）、自計算機は停止状態となる。なお、障害情報の保存完了後、自計算機上のＬＸＰボード１１５に対してリセット信号の継続発生を指示し、計算機の動作を完全に停止させるようにしてもよい。
割込処理の打ち切りにより停止した場合（５９０）、自計算機は停止状態となるが、引き続き他計算機から送られてくる動作停止指示メッセージを受けてＬＸＰボード１１５がリセット信号を継続発生するので、この場合でも動作は完全に停止する。
以上のように、本発明によれば、多重系システムにおいて、障害発生時に、メモリダンプを含む大容量の障害情報の保存を実施しつつ、高速な系切り替えを実現することが可能である。
また、本発明によれば、障害発生系におけるハードウェアやソフトウェアの暴走、および障害発生系における障害情報の保存動作が、系切り替え動作および切り替え後の処理を引き継いだ新稼働系の動作に影響を与えないようにすることが可能である。
産業上の利用可能性
以上のように、本発明は高い信頼性が要求される用途の多重系システムに有効であり、稼働系の計算機に障害が生じた場合に稼働系の計算機が行っていた処理を引き継ぐ待機系の計算機を備えた多重系システムにおいて、いずれか一方の計算機で障害が発生した際に、事後の障害解析が可能となり、復旧措置，再発防止策の実施などに活用でき、システムの信頼性向上に役立つ。
【図面の簡単な説明】
第１図は、２重系システムの構成を示すブロック図であり、第２図は、この２重系システムにおける系切り替え処理の順序と各処理の関係を示したタイムチャートである。
第３図は、ＯＳの論理矛盾検出による系切り替え処理のタイムチャートであり、第４図は、ハードウェア障害検出による系切り替え処理のタイムチャートである。
第５図は、計算機に搭載するＬＸＰボードの構成を示すブロック図であり、第６図は、ＬＸＰボードに搭載する拡張バスインタフェースの処理手順を示すフローチャートであり、第７図は、ＬＸＰボードに搭載するリンケージ制御用プロセッサの処理手順を示すフローチャートである。
第８図は、管理プログラムの生存通知メッセージ送信処理の処理手順を示すフローチャートであり、第９図は、管理プログラムの生存通知メッセージの監視と他系障害発生時処理の処理手順を示すフローチャートであり、第１０図は、管理プログラムの自計算機に障害発生時処理の処理手順を示すフローチャートである。
第１１図は、割込処理ルーチンの処理手順を示すフローチャートである。Technical field
The present invention relates to a method for managing a multisystem, and more particularly to a method for performing system switching when a failure occurs in any computer in a multisystem composed of active and standby computers. is there.
Background art
When a computer is used for applications that require high reliability, for example, railway operation management, plant control, power system control, etc., in addition to the active computer that performs processing, if the active computer fails It is desirable to use the computer as a multiple system including a standby computer that takes over the processing performed by the active computer.
Examples of failures that hinder the operation of a computer include hardware failures and logical contradictions due to defects in core software such as an operating system (hereinafter referred to as OS) and device drivers. When these faults occur, by saving various states related to the hardware and software of the computer, it becomes possible to analyze the faults afterwards, which can be used for implementing recovery measures and measures to prevent recurrence, and helps to improve system reliability. The same applies to a multi-system system.
In a conventional multiplex system, when a failure occurs, the failure information is stored in the disk device of the computer in which the failure has occurred, and then the system switching method for taking over the processing executed by the failed computer to the standby system Has been implemented.
Japanese Patent Laid-Open No. 8-202573 discloses that all computers constituting a multiplex system are equipped with a common memory whose contents are always matched with each other, and failure information is always written in the common memory, and a failure occurs. A method is described in which a computer that has taken over the processing executed by the computer stores this failure information on a disk.
In order to shorten the processing stop time, it is desirable that the time required for system switching is as short as possible. In the case of the conventional switching method, since the system switching is waited only for the time required for storing the failure information, the amount of failure information that can be stored is limited in order to realize a practical switching time.
On the other hand, in the case of the method described in JP-A-8-202573, the system switching time can be shortened. However, if the amount of failure information to be stored increases, the necessary common memory capacity increases and the apparatus cost increases. At the same time, the computer load and network load for matching common memory contents also increase.
An object of the present invention is to realize high-speed system switching while saving a large amount of failure information including a memory dump when a failure occurs in a multiplex system.
Also, make sure that hardware and software runaway in the faulty system and the storage of fault information in the faulty system do not affect the system switching operation and the operation of the new operating system that has taken over the processing after switching. With the goal.
Disclosure of the invention
The present invention stops the processing that has been performed on the active computer in which the failure has occurred and starts the storage processing of the failure information. Subsequently, the standby computer detects the failure of the computer and takes over the stopped processing. Is. The stop of processing and the start of storage of fault information in the faulty computer are performed spontaneously by software on the faulty computer, or the standby computer first detects a fault in the computer and operates on the computer This is realized depending on whether or not it is performed.
According to such a system switching method, the process can be switched only in the expected time from the failure detection in the standby computer until the failure information is stably stored in the failure occurrence computer. Shortening can be realized.
In order to achieve the above object, according to the present invention, the standby computer that has detected the failure of the active computer instructs the failure occurrence computer to stop the operation of the failure occurrence computer following the failure information storage start instruction. Thus, the failure occurrence computer ignores the operation stop instruction when the normal failure information storage operation is performed, and accepts the operation stop instruction when the normal failure information storage operation is not performed and stops completely. Is.
In such a faulty computer operation method, the faulty computer operates unexpectedly in a faulty state so severe that fault information storage operation is impossible, and a coupling unit between systems such as a network and a shared disk device Through this, it is possible to prevent the operation of the new active computer that took over the processing from being affected.
In order to achieve the above object, the present invention stops the operation of the input / output device of the coupling unit between the systems such as the network and the shared disk device before the failure information is stored in the failure occurrence computer. is there.
Due to the operation method of the faulty computer, the operation of the hardware that is not related to the storage of fault information affects the operation of the new active computer that has taken over the process through the connection between systems such as the network and shared disk device. I can prevent it.
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of a method for switching a multisystem according to the present invention will be described in detail.
FIG. 1 shows the configuration of a multiplex system according to this embodiment.
As shown in the figure, the multiplex system according to the present embodiment is a dual system composed of two computers. However, you may comprise three or more computers.
In FIG. 1, computers 100 and 101 indicate an active computer and a standby computer, respectively. By system switching, the active computer 100 operates as a standby computer and the active computer 101 operates as an active computer.
Each of the computers 100 and 101 includes a central processing unit (hereinafter referred to as MPU) 110, a main memory 111, and an input / output control device 112, which are connected by a processor bus 120. A disk device 113 and an expansion bus 121 are connected to the input / output control device 112.
A circuit for extending the function of the computer is connected to the expansion bus 121. In general, an expansion board on which a circuit is mounted is connected to the expansion bus 121 in a form of being inserted into a slot connector. However, some functions are implemented in the computer main body and may be directly connected to the expansion bus internally. The computers 100 and 101 according to this embodiment include a small computer system interface (SCSI) board 114, a linkage bus port (hereinafter referred to as LXP) board 115, and an Ethernet board 116 as expansion boards.
The shared disk device 102 is connected to the SCSI board 114. This shared disk device 102 is used to store data taken over in processing at the time of system switching. A bus such as USB (Universal Serial Bus) may be used instead of the SCSI bus.
The Ethernet board 116 is connected to the Ethernet network 103 and communicates with other computers connected to the network 103. In the present embodiment, a plurality of controllers 910 for managing and controlling the plant 900 are connected to the network 103. A network such as token ring or ATM may be used instead of Ethernet.
The LXP board 115 is a function expansion board for system switching control, and is connected via a linkage bus 104 which is a dedicated transmission path. The LXP board monitors the other computer's survival between the computers 100 and 101, sends each instruction message for forced interrupt, operation stop, and computer restart necessary for system switching, and also in the own computer when each instruction message is received. The instruction contents are executed.
In such a dual system, when both the active computer 100 and the standby computer 101 are normal, the OS 130, the management program 131, the management communication program 132, and the application (AP) are stored in the main memory 111 of the active computer 100. 135 is loaded, and the management program 131, the management communication program 132, and the application 135 are executed on the OS. Similarly, the same program is loaded into the main memory 111 of the standby computer 101, and the OS 130, the management program 131, and the management communication program 132 are executed, but the application 135 is not executed. Further, an interrupt processing routine 133 is loaded in the main memory 111 of each computer 100, 101.
The application 135 is a program that performs processing that is a use of the duplex system. In the present embodiment, the application 135 performs processing and recording of data transmitted from each controller 910 via the network 103.
The management program 131 is a program that performs a switching process between the active computer and the standby computer. This program issues a message transmission / reception request and operation instruction to the LXP board 115, and issues a survival notification message transmission / reception request to the management communication program 132.
The management communication program 132 uses the Ethernet board 116 to transmit / receive a survival notification message to / from other computers via the network 103. Message transmission / reception is performed using the TCP / IP protocol. This program waits for a connection from another computer at a predetermined TCP port, and when it is connected, it receives a message, holds the contents in this program, and holds it in response to a read request from the management program 131. Returns the contents. In response to a survival confirmation message transmission request from the management program 131, the management communication program 132 on the other computer constituting the duplex system transmits a message to the TCP port on which it is waiting.
The interrupt processing routine 133 is registered so as to be activated when a non-maskable interrupt signal is input to the MPU. Then, when a non-maskable interrupt signal is generated, processing at the time of occurrence of the failure, such as saving of failure information, is executed. However, in the present embodiment, registration is performed so as to be activated by an unmaskable interrupt signal, but it may be realized by using another interrupt mechanism provided by the MPU. In the present embodiment, the interrupt processing routine 133 is an independent program. However, depending on the type of the OS 130, the interrupt processing routine may be provided as a part of the OS. The same function can be realized by incorporating necessary processing as a subroutine called from the interrupt processing routine.
Next, a system switching method of the multiplex system according to the present embodiment will be described.
FIG. 2 shows a time chart of the system switching process.
When both the active computer 100 and the standby computer 101 are normal, the following processing is performed.
The management program 131 requests the management communication program 132 and the LXP board 115 to send a survival notification message at regular intervals (301). The management communication program 132 drives the Ethernet board 116 and transmits a survival notification message 401 to other computers via the network 103 (302). On the other hand, the LXP board 115 transmits a survival notification message 402 to other computers via the linkage bus 104 (303).
The management communication program 132 and the LXP board 115 of the standby computer 101 that have received the survival notification messages 401 and 402 store the reception results (304 and 305). Then, the management program 131 of the standby computer 101 confirms whether or not the survival notification message from the active computer has been received to the management communication program 132 and the LXP board 115 of the own computer at regular intervals (306). If both the survival notification messages 401 and 402 from the active computer have not been received for a certain time or more, it is determined that a failure has occurred in the active computer.
Here, the reason why the survival notification message is transmitted through the two paths is to make it possible to distinguish a failure occurring in each transmission path and a connection circuit to the transmission path from a failure of the computer itself. If only one survival notification message is not received, it is determined that there is a failure in the transmission path, a warning is issued in the form of screen display or log recording, and system switching is not performed.
Although FIG. 2 shows only the operation of transmitting the survival confirmation message in the direction from the active computer 100 to the standby computer 101, the actual operation confirmation message is actually transmitted in the reverse direction. The reception confirmation processing 306 at 100 and the transmission processing 301 at the standby computer 101 are executed at regular intervals.
Next, an operation when a failure occurs in the active computer 100 will be described.
A plurality of failure modes can be considered. First, a case will be described in which a hang-up state is caused by factors such as the occurrence of an infinite loop inside the OS.
The operation of the management program 131 stops due to the occurrence of a failure in the OS, and the survival notification message transmission processing 301 is not executed at regular intervals. When the management program 131 of the standby computer 101 detects that neither of the two survival notification messages 401 and 402 is received during the reception message confirmation 306 performed at regular time 451 intervals, there is a failure in the active computer 100. Judge that it occurred. The management program 131 on the standby computer 101 that has detected the failure requests the LXP board 115 to send a forced interrupt instruction (307), and the LXP board 115 forcibly interrupts the LXP board of the active computer. An instruction message 403 is transmitted (308).
When the LXP board 115 on the active computer 100 receives the forced interrupt instruction message 403, it generates a non-maskable interrupt signal 404 in hardware (309). The MPU receives this interrupt signal and activates the interrupt processing routine 133.
When the interrupt processing routine 133 is started, first, the non-maskable interrupt signal is invalidated, that is, when an unmaskable interrupt signal is generated again, it is set to be ignored (310).
The interrupt processing routine 133 instructs the operation stop of the components in the own computer that may affect the partner computer 101 after starting (311). In the case of the configuration of this embodiment, the SCSI board 114 and the Ethernet board 116 correspond to such components, and the operation is stopped by setting a bit instructing the operation stop in the register in each board. As a result, when the partner computer 101 accesses the shared disk 102 or the network 103, it is not affected by the failure computer 100. Depending on the type of component, the operation stop may be instructed by clearing the operable bit in the register.
Next, the interrupt processing routine 133 sets the LXP board 115 so as to ignore subsequent instruction messages from other computers (312), and executes failure information storage (313). After the failure information is saved, the interrupt processing routine 133 is stopped (314), and the computer 100 in which the failure has occurred is stopped.
In the failure information saving process 313, the contents of the main memory 111, the contents of each register representing the operation state of the computer main body and each function expansion board, and the like are saved. In addition to the failure information, processing that can be executed under conditions after the occurrence of the failure may be executed in normal shutdown processing. For example, if the cache contents are written to the disk device 113, the consistency of the disk contents of the failed computer is maintained, and the possibility that the contents can be rescued increases.
The management program 131 of the standby computer 101 requests the LXP board 115 to send an operation stop instruction after a certain time 452 after sending the forced interrupt instruction (307), and at this point of time. Then, the application 135 loaded in the standby system computer 101 is activated to take over the processing of the active system computer 100 (318), and the own computer is set as a new active system. This completes the system switchover.
In response to the operation stop instruction transmission request from the management program 131, the LXP board 115 transmits an operation stop instruction message 405 (316). However, since the failure processing computer 100 is set to ignore the instruction message for the LXP board by the interrupt processing routine 133 (312), the operation stop instruction message 405 is ignored and the failure information is collected (313). ) Will be continued.
In the operation stop processing 311 of the component in the fault occurrence computer, if each component is provided with an operation status confirmation means such as an operation status display register, a procedure for confirming the operation stop by the operation stop processing 311 is added. Also good. If it is determined in this confirmation of operation stop that the operation stop instruction has failed, the interrupt processing routine 133 stops the processing. As a result, processing for ignoring the instruction message from the other computer is not performed, and the computer 100 is forcibly stopped by the LXP board that has received the operation stop instruction message 405 from the LXP board of the standby computer, and the standby computer 101. Will take over the processing without being affected by the failure occurrence computer 100.
Also, if it is determined at the beginning of the failure information saving process 313 that there is no preparation for saving the failure information, such as a disk device error, the interrupt processing routine 133 cancels the message ignore setting of the LXP board. (319) The failure information saving process may be stopped. Also in this case, upon receiving the operation stop instruction message 405 from the standby computer, the faulty computer 100 is forced to stop.
As a second failure mode, a failure that is generally called kernel panic and that the OS has detected a serious logical contradiction and determined that continuous operation is impossible will be described. FIG. 3 shows a time chart of processing in this case.
When the OS detects a logical contradiction, the OS activates an interrupt processing routine 133 (331). As in the case described with reference to FIG. 2, the interrupt processing routine instructs to stop the operation of the components in its own computer (311), and then sends an instruction message from another computer to the LXP board 115. It is set to be ignored (312), and then the failure information is stored (313) and stopped (314).
When a failure occurs in the OS and execution proceeds to the interrupt processing routine, the management program 131 on the active computer 100 does not operate, so that the survival notification messages 401 and 402 are not transmitted to the standby computer. As described above, the management program 131 on the standby computer 101 detects that both the survival notification messages 401 and 402 are not received (306), and transmits a forced interrupt instruction message 403 and a computer operation stop instruction message 405 ( 308, 316).
When the forced interrupt instruction message 403 is received, the interrupt processing routine 133 has already started and the message ignore setting has been set for the LXP board (312), so the forced interrupt instruction message 403 is ignored ( 332), the failure information collection 313 is continued. The operation stop instruction message 405 received subsequently is similarly ignored (333).
Although the OS calls the interrupt processing routine 133 here, the interrupt processing routine 133 may be activated by generating a non-maskable interrupt signal. Depending on the type of OS, the OS itself stores fault information (memory dump), but if a function is provided to call a process registered before the execution, the fault processing routine 133 starts the fault. By registering the process excluding information storage (313), an equivalent process can be realized.
As a third failure mode, a hardware failure will be described. What is described here is that the influence of the failure does not appear as the two failure modes described above, but the processing that is the original use of the multiplex system cannot be continued, and is detected by some detection method It is. FIG. 4 shows a time chart of processing in this case.
Detection of the occurrence of such a failure includes detection by the management program 131, detection by a dedicated failure detection subprogram 134, detection of an abnormality in the application 135, and the like. Of these, if a failure is detected by a program other than the management program, the management program 131 is notified of the occurrence of the failure (341, 342). The management program 131 activates the interrupt processing routine 133 in response to failure detection by itself or a failure notification from the failure detection subprogram 134 or the application 135 (343). The interrupt processing routine 133 executes the same processing procedure as that when detecting the OS logical contradiction described with reference to FIG.
Note that when the occurrence of a failure is monitored by a hardware mechanism, the hardware notifies the abnormality detection result to the management program 131 and the failure detection subprogram 134 using an interrupt, or the management program and the failure detection The subprogram side periodically polls the hardware to check whether there is an abnormality detected and performs the same processing.
In some cases, the interrupt processing routine 133 cannot be activated due to the destruction of the memory contents or hardware malfunction. In this case, the failure-occurring computer 100 is in a severely uncontrollable state, and may operate unpredictably and affect the operation of the standby computer 101.
In this case, the setting (312) for ignoring the instruction message from the other computer is not performed on the LXP board 115 of the faulty computer. Therefore, the LXP board 115 that has received the operation stop instruction message 405 from the standby computer forcibly puts the computer 100 into the stop state. Accordingly, since the failure taking computer 100 is brought into a state that does not affect the operation of the standby computer 101 without fail, the process is taken over, so that the system can be switched reliably.
As shown in FIG. 3, the time 451 until it is determined that a failure has not occurred since the existence notification message has not been received, the interruption processing routine 133 is called by software and the setting for the LXP board ( 312) is set to be slightly longer than the time until completion. In addition, as shown in FIG. 2, the interval 452 between the forced interrupt instruction message transmission and the computer operation stop instruction message transmission is set such that the interrupt processing routine 133 of the active computer 100 by the forced interrupt instruction (307) is started and the LXP It is set a little longer than the time until the setting (312) for the board is completed.
The system switching time, that is, the time until the completion of the process takeover is approximately the sum of time 451 and time 452. This system switching time is sufficiently shorter than the time required for storing the fault information 313 such as a memory dump, and both storage of the fault information and shortening of the system switching time are achieved.
In the above description, the processing when a failure occurs in the active computer 100 has been described. However, even when a failure occurs in the standby computer 101, there is no switching between the active system and the standby system by taking over the processing. Except for this, the same processing is performed.
In this embodiment, each computer is provided with the LXP board 115 and the Ethernet board 116. However, each computer is provided with two Ethernet boards 116, and is a multiple system configured to duplicate the Ethernet network 103 and communicate the survival monitoring message. Also in the system, the system can be switched by the same method. In such a system, a failure information storage 313 in the failure occurrence computer 100 and a process takeover 318 to the standby computer 101 with respect to failure modes such as OS logical contradiction detection or partial hardware failure detection. Switching operation is possible. However, since the forced interrupt instruction 403 cannot be sent, the failure information cannot be stored in the failure mode in the hang-up state. Further, since the operation stop instruction message 405 cannot be sent, there is a possibility that the abnormal operation of the failure computer 100 may affect the standby computer 101 depending on the degree of failure.
Details of each part will be described below.
First, the LXP board 115 will be described. FIG. 5 shows the internal configuration of the LXP board 115.
As shown in the figure, the LXP board 115 includes an expansion bus interface 170 in charge of input / output to / from the expansion bus 121, a linkage control processor 171 that performs message processing via the linkage bus 104, and a program executed by the linkage control processor 171. A memory 175 for storing messages, a transmission path interface 172 for converting messages and electrical signals on the linkage bus, a message storage memory 173 which is a buffer for temporarily storing messages, and a power supply voltage detection circuit 174 for detecting rising of the power supply voltage , An operation control register 176 is provided for confirming the operation state of the linkage control processor 171 and instructing the operation method from the expansion bus side.
Since the operation control register 176 can read and write from the expansion bus 121, it is possible to confirm the operation state and to instruct an operation method from software operating on the computer on which the LXP board 115 is mounted. The operation control register 176 includes a forced interrupt instruction prohibition bit 1761, an operation stop instruction prohibition bit 1762, and a restart instruction prohibition bit 1863 which will be described later.
The initialization operation of the LXP board will be described. The LXP board operates independently of the connected computer and needs to handle the reset signal itself of the computer. For this reason, the initialization process of the LXP board is performed only when the power to the LXP board is turned on, independently of the reset process of the computer. For this reason, the power supply voltage detection circuit 174 that monitors the power supply voltage supplied via the expansion bus 121 detects the rise of the power supply voltage and initializes each component in the LXP board to initialize. Is output. The expansion bus interface 170, the linkage control processor 171, and the transmission path interface 172 receive the initialization signal 184 and perform initialization processing such as memory clear, various state information clear, register clear, and linkage bus reset. Execute.
Next, the message transmission function will be described. The management program 131 sends a message transmission request to the expansion bus interface 170 via the expansion bus 121. Since the data transfer speeds of the expansion bus 121 and the linkage bus 104 are different, the expansion bus interface 170 temporarily stores the message to be transmitted in the message storage memory 173 as a speed buffer, and sends the message to the linkage control processor 171. Notify of arrival. Upon receiving this notification, the linkage control processor 171 retrieves the message from the message storage memory 173, transfers it to the transmission path interface 172, and transmits the message to the LXP board of another computer via the linkage bus 104.
Finally, the message reception processing function will be described. When an instruction message arrives from the LXP board of another computer via the linkage bus 104, one of the following processes is performed according to the type of the message.
(1) When the message is a forced interrupt instruction, a non-maskable interrupt signal is output to the connected own computer through the non-maskable interrupt signal line 182, and the processing by the MPU 110 is interrupted. Switch to 133. However, when the forced interrupt instruction prohibition bit 1761 of the register 176 is set, this process is not performed and the instruction message is ignored.
(2) When the message is an operation stop instruction, the reset signal is continuously output to the connected own computer through the reset signal line 183, thereby forcibly stopping the computer. However, when the operation stop instruction prohibition bit 1762 of the register 176 is set, this processing is not performed and the message is ignored.
(3) When the message is a restart instruction, a reset signal is output once to the connected own computer through the reset signal line 183, thereby restarting the computer. However, when the restart instruction prohibition bit 1863 of the register 176 is set, this processing is not performed and the message is ignored.
(4) In the case of a message other than those described above, the message content is stored in the message storage memory 173. The stored message is thereafter read out as needed via the expansion bus interface 170 and the expansion bus 121 in response to a request from the management program 131.
FIG. 6 shows the processing procedure of the expansion bus interface 170.
When the expansion bus interface 170 receives an input / output request signal from the computer (expansion bus) and an initialization signal from the initialization signal line 184, the expansion bus interface 170 exits from the request wait state 501 and starts processing, and processing starts from the received signal. The type of request is determined (502).
If the processing request is an initialization signal, an internal register or circuit initialization process (503) is performed.
When the processing request is a read signal from the expansion bus 121, the content of the register 176 is read if the target of the read request is a register (505), and the content of the message storage memory 173 is read if the target of the read request is a message. (507) The read result is sent to the expansion bus 121 (506, 508).
When the processing request is a write signal from the expansion bus 121, if the target of the write request is a register, the write content is written to the register 176 (510). On the other hand, if the target of the write request is a transmission message, the transmission message is temporarily stored in the message storage memory 173 (511) and transmitted to the linkage control processor 171 (512).
FIG. 7 shows the processing procedure of the linkage control processor 171.
The control processor 171 exits from the event wait state 521 and processes according to any one of an activation request from the expansion bus interface 170, a message reception from the transmission path interface 172, and an initialization signal from the initialization signal line 184. And the type of the event is determined (522).
If the generated event is an initialization signal, the communication process is initialized, all messages stored in the message storage memory 173 are discarded, and the register 176 is set to the initial state (523).
On the other hand, if the generated event is an activation request from the expansion bus interface 170, that is, a message transmission request, the message to be transmitted is read from the message storage memory 173 (524), and the message is sent to the transmission path interface 172. Transmit (525).
When the event that has occurred is a message reception event from the transmission path interface 172, it indicates the arrival of an instruction message from another LXP board. In this case, the type of the received instruction message is determined (526), and processing corresponding to each is performed.
If the message is one of a forced interrupt instruction, an operation stop instruction, and a restart instruction, confirm that the corresponding prohibit bits (1761, 1762, 1863) in the register 176 are cleared as described above. (527, 529, 531), the signals as described above are output (528, 530, 532).
In the case of a message other than the above, the received instruction message is stored in the message storage memory 173 (533).
Next, the management program 131 will be described.
The management program 131 performs the following three processes.
(1) In order to notify other computers that the own computer is operating normally, a survival notification message is periodically transmitted.
(2) Monitor the survival notification message sent from another computer, and if it is not received for a certain period of time, it is determined that a failure has occurred in the source computer, and a forced interrupt instruction message and operation stop for the other computer Send an instruction message. Further, if the fault occurrence computer is an active computer, the processing executed by the computer is taken over and the own computer is set as a new active computer.
(3) Recognize that a failure has occurred in the local computer by a call from another program, and activate an interrupt processing routine 133 such as failure information collection.
Note that the management program 131 may have a function of detecting the occurrence of a failure in the own computer. In this case, when a failure is detected, an interrupt processing routine is started in the same manner as (3).
FIG. 8 shows a process flow of the survival notification message transmission process (1).
As shown in the figure, in this process, a survival notification is periodically sent to other computers. That is, the management communication program 132 and the LXP board 115 are requested to send a survival notification message (301), and the process of shifting to a waiting state for a predetermined time (541) is repeated.
FIG. 9 shows a processing flow of the monitoring of the survival notification message (2) and the processing at the time of occurrence of another system failure.
As shown in the figure, the reception status of the survival message from the other computer is periodically checked, and if it cannot be received for a predetermined time or longer, the processing at the time of occurrence of another system failure is executed.
In order to determine the waiting time 451 for determining the failure of the other system, variables “notification 1 wait count” and “notification 2 wait count” are set. The initial values of these variables are N times, and the waiting time t in the process 563 is _w Product with N × t _w "Becomes the waiting time 451 for determining that the fault is in the other system. First, N times are set for initialization of these variables (551, 552).
Next, since the contents of the received message are stored in the management communication program 132, the management communication program 132 is inquired as to whether or not the existence notification message 401 has been received (553). If it has been received, the “notification 1 wait count” is set to N times and initialized again (554), and the management communication program 132 is instructed to clear the stored survival notification message (555). On the other hand, if the survival notification message has not been received, the value of “number of times waiting for notification 1” is decreased by one. However, if the value of “the number of times of waiting for notification 1” becomes negative, 0 is set (556).
Similarly, since the LXP board 115 stores the contents of the received message, it inquires whether or not the existence notification message 402 has been received (557). If it has been received, “Notification 2 wait count” is reset to N times (558), and an instruction to clear the survival notification message stored in the LXP board 115 is given (559). If the existence notification message has not been received, the value of “number of times waiting for notification 2” is decreased by one. However, when the value of “number of times to wait for notification 2” becomes negative, 0 is set (560).
Here, the values of “number of times waiting for notification 1” and “number of times waiting for notification 2” are examined (561).
When both variables are 0, “N × t _w Since the survival notification messages 401 and 402 are not received during the waiting time 451 represented by "", it is determined that a failure has occurred in the other computer. First, the LXP board 115 is requested to send a forced interrupt instruction message 403 (307), then waits for a predetermined time 452 (564), and then the computer operation stop instruction message 405 is sent to the LXP board 115. Request transmission (315). Further, when the setting of the own computer is a standby computer, the processing contents of the active computer are taken over (318), and system switching is executed. After these processes are executed, the failure notification computer of the other system is always in the stopped state, so the monitoring process for the survival notification message is stopped (566). If the faulty computer is replaced or the cause of the fault is removed and the computer is returned to the redundant system as a standby computer, this processing is started again (550). The start may be a manual operation by the operator, or after the monitoring process is stopped (556), another process is started to continue monitoring the survival monitoring message, and the monitoring process is resumed when the survival monitoring message is detected (550). The method may be used.
If only one of “Notification 1 wait count” and “Notification 2 wait count” is 0 in process 561, it is determined that a failure has occurred in the message transmission path and the connection circuit to the transmission path. Is issued in the form of screen display or log recording (562).
Except when both the “notification 1 wait count” and “notification 2 wait count” variables are 0 in the process 561, a predetermined time t _w Only wait (563) and return to processing 553.
FIG. 10 shows a processing flow of the management program 133 when a failure occurs in the computer (3).
This process is activated by a call from the failure detection subprogram 134 or the application 135 (570), and simply activates the interrupt processing routine 133 (343). The interrupt processing routine 133 does not return processing to the caller.
Next, the interrupt processing routine 133 will be described.
The interrupt processing routine 133 is started from software on the own computer when a failure occurs, or is started from the LXP board 115 upon receiving a forced interrupt instruction message from another computer, and storage of the failure information and related to this Perform the process.
FIG. 11 shows a processing flow of the interrupt processing routine 133.
The interrupt processing routine 133 first disables the non-maskable interrupt signal at the time of activation (310). This is realized by preparing a dummy interrupt processing routine that returns without performing any processing, and registers this in the MPU as a processing routine for a non-maskable interrupt. As a result, even if a non-maskable interrupt signal is generated again during the processing of the interrupt processing routine 133, the processing moves to the dummy routine and immediately returns to interrupt, so the non-maskable interrupt is ignored. The interrupt processing routine 133 can be continued.
Next, an instruction to stop the operation of a component that may affect a part of the own computer, in particular, a computer of another system is given (311). Then, the status is inquired with respect to each component instructed to stop the operation, and it is confirmed whether or not all the components have really stopped operating (581). If there is an unsuccessful operation stop, the interrupt process is terminated (590). If all the components that have been instructed to stop operation are stopped, the LXP board 115 is set to ignore subsequent instruction messages from other computers (312).
Subsequently, it is checked whether or not the failure information can be saved (582). If it is judged that the failure information cannot be saved, the instruction message from the other computer is ignored for the LXP board 115 (319), and the interrupt is interrupted. The process is terminated (590). If it is determined that the storage is possible, the actual failure information is stored (313). After completion of saving the failure information, the interrupt processing routine 133 is stopped (314), and the own computer is stopped. Note that after the failure information has been saved, the LXP board 115 on the own computer may be instructed to continue the reset signal to stop the operation of the computer completely.
When stopped due to interruption of interrupt processing (590), the own computer enters a stopped state, but the LXP board 115 continuously generates a reset signal in response to an operation stop instruction message sent from another computer. Even if the operation stops.
As described above, according to the present invention, in a multiplex system, it is possible to realize high-speed system switching while saving large-capacity fault information including a memory dump when a fault occurs.
Further, according to the present invention, the hardware and software runaway in the faulty system and the fault information saving operation in the faulty system have an effect on the system switching operation and the operation of the new operating system that has taken over the processing after switching. It is possible not to give.
Industrial applicability
As described above, the present invention is effective for a multiplex system for applications requiring high reliability, and is a standby system that takes over the processing performed by the active computer when a failure occurs in the active computer. In a multi-system with a computer, if a failure occurs in one of the computers, it is possible to analyze the failure after the fact, which can be used to implement recovery measures and recurrence prevention measures, etc., and helps improve system reliability. .
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a dual system, and FIG. 2 is a time chart showing the order of system switching processing and the relationship between the processes in this dual system.
FIG. 3 is a time chart of the system switching process based on OS logical contradiction detection, and FIG. 4 is a time chart of the system switching process based on hardware failure detection.
FIG. 5 is a block diagram showing the configuration of the LXP board mounted on the computer, FIG. 6 is a flowchart showing the processing procedure of the expansion bus interface mounted on the LXP board, and FIG. 7 shows the LXP board. It is a flowchart which shows the process sequence of the processor for linkage control mounted.
FIG. 8 is a flowchart showing the processing procedure of the survival notification message transmission process of the management program, and FIG. 9 is a flowchart showing the processing procedure of the monitoring of the survival notification message of the management program and the processing when another system failure occurs. FIG. 10 is a flowchart showing a processing procedure for processing when a failure occurs in the own computer of the management program.
FIG. 11 is a flowchart showing the processing procedure of the interrupt processing routine.

Claims

In a multi-system where multiple computers are configured and the computer set as the active system takes over the processing performed by the computer set as the standby system when a failure occurs in the computer set as the active system.
When the failure occurs,
The software operating on the failed computer detects the failure and saves the failure information, or the standby computer detects the failure and detects the failure information for the failed computer. To save
And the standby computer, after recognizing the failure, spontaneously takes over the processing without waiting for the end of the storage of the failure information in the failed computer ,
Each of the computers is equipped with a function expansion board that operates independently from the software on the computer and is connected to each other via a transmission path.
Each function expansion board has a function for generating an interrupt to the computer on which the function expansion board is mounted and the function according to the content of the message received from the function expansion board mounted on another computer via the transmission path. It has a function to stop the operation of the computer on which the expansion board is mounted, and has a function to instruct the suppression of each function for the message from the software operating on the computer on which the function expansion board is mounted,
A message for instructing the occurrence of an interrupt from the function expansion board installed in the computer that has recognized the fault to the function expansion board installed in the faulty computer when the occurrence of the fault in another computer is recognized And then send a message to stop the computer after a certain period of time,
In the interrupt processing for the interrupt generated by the function expansion board mounted on the failed computer in response to the interrupt instruction message, the fault information is stored, and the function expansion board instructing suppression of the computer operation stop function and interrupt generation function, the system of multiplex systems which is characterized in that to continue saving the ignored by fault information a message instructing the stop of the machine to be sent later Switching method.

In the event of a failure, the failure information software voluntarily saves the failure information, and instructs the function expansion board to inhibit the interrupt generation function and the computer operation stop function, and send them later. 2. The system switching method for a multi-system according to claim 1, wherein the storage of the fault information is continued by ignoring the interrupt generation instruction and computer stop instruction messages.