JP3555047B2

JP3555047B2 - Compound computer system

Info

Publication number: JP3555047B2
Application number: JP33135795A
Authority: JP
Inventors: 進奥原; 浩守島; 新吾前田; 貴久子田巻
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-12-20
Filing date: 1995-12-20
Publication date: 2004-08-18
Anticipated expiration: 2015-12-20
Also published as: JPH09171475A; US5974565A

Description

【０００１】
【発明の属する技術分野】
本発明は、独立して稼働する複数の処理装置によって共有資源を排他制御してアクセスする複合コンピュータシステムに関し、特に、複数の処理装置によって共有資源を排他制御してアクセスする際に発生した障害を早期に発見し代替処理を行う複合コンピュータシステムに適用して有効な技術に関するものである。
【０００２】
【従来の技術】
従来、相互に接続された複数の処理装置が磁気ディスク装置や磁気テープ装置などの資源を共有する負荷分散・協調型の複合コンピュータシステムにおいては、複数の処理装置間での通信を行うチャネル間結合装置等の入出力機器を接続し、入出力命令によって相互に通信することにより複数の処理装置間の連携を行ってきた。
【０００３】
ところが、この様な従来の複合コンピュータシステムにおいては、チャネル障害、通信経路の障害及びシステムダウン等の障害により相手系の処理装置との連絡が不能になると、共有している資源の排他処理が続行できなくなる。
【０００４】
従って、相手系の処理装置の無応答を検知した場合には、オペレータに無応答の処理装置を検知したことを示すメッセージを出力して人間の判断によって障害部位を特定し、発生した障害に対応する処理を行って業務を続行していた。
【０００５】
なお、従来の複合コンピュータシステムにおける障害検知時の応答手順については（株）日立製作所発行のマニュアル「プログラムプロダクトＶＯＳ３／ＡＳシステム操作−ＪＳＳ３編−」（平成６年１２月発行）に「ＭＳＣＦ障害時のオペレータ処置」として記述されている。
【０００６】
更に、従来の複合コンピュータシステムにおいて、複数の処理装置間の通信オーバヘッドを削減する為に、共有資源を管理する排他制御用のメモリを設け、複数の処理装置間で効率よく連携する方式がとられてきた。
【０００７】
例えば、二重化される磁気ディスク装置の各ボリューム単位に設けた不揮発の制御メモリに排他制御用のロック情報を配置し、ディスク二重書き制御プログラムで前記制御メモリの排他制御用のロック情報を使用するものがある。
【０００８】
前記のディスク二重書き制御プログラムでは、１つの処理装置がロック情報を更新すると、他の処理装置に非同期の入出力割り込みとして報告する機能を利用して、複数の処理装置間で連携することを実現している。
【０００９】
しかし、前記従来の複合コンピュータシステムにおいて、１つの処理装置がロック情報を持ったままシステムダウンした場合には、正常に稼働中の他の処理装置の二重書き磁気ディスク装置へのアクセスがロック情報を確保できず、入出力タイムオーバとなり処理が続行できなくなる。
【００１０】
前述したチャネル間結合装置等の入出力機器を使用して複数の処理装置間で通信を行って共有資源の排他制御を行う従来の技術や、前記ディスク二重書き制御プログラムの様に１つの処理装置がロック情報を持つことによって排他制御を行う従来の技術では、他の処理装置の稼働状態を判断することができない為、ロック情報を持つ処理装置に障害が発生したときのロック情報の解除にはオペレータの介入が必要である。
【００１１】
この為、前記従来の複合コンピュータシステムでは、事前に障害時の組み合わせを想定した回復手順書を作成する必要があり、複合コンピュータシステムを運用する際の負担となっていた。
【００１２】
【発明が解決しようとする課題】
本発明者は、前記従来技術を検討した結果、以下の問題点を見い出した。
【００１３】
すなわち、前記従来の複合コンピュータシステムでは、相手系の処理装置の無応答を検知した場合に、オペレータに無応答の処理装置を検知したことを示すメッセージを出力して人間の判断によって障害部位を特定し業務を続行していた為、メッセージ出力時の運用手順の作成等の運用負担の増加や、長時間の無人運転に対応することができないという問題があった。
【００１４】
また、前記従来の複合コンピュータシステムのディスク二重書き制御プログラムでは、１つの処理装置がロック情報を持ったままシステムダウンした場合には、ロック情報の解除にはオペレータの介入を必要とする為、事前に障害時の組み合わせを想定した回復手順書を作成する必要があり運用上の負担となっていた。
【００１５】
本発明の目的は、障害が発生したときに早期に障害部位を特定し障害部位に対応する処理を行って長時間の無人運転の実現とユーザ負担の軽減を行うことが可能な技術を提供することにある。
【００１６】
本発明の他の目的は、特定の稼働監視装置が障害により使用できなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能な技術を提供することにある。
【００１７】
本発明の他の目的は、稼働監視装置が全面的に動作しなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能な技術を提供することにある。
【００１８】
本発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面によって明かになるであろう。
【００１９】
【課題を解決するための手段】
本願によって開示される発明のうち、代表的なものの概要を簡単に説明すれば、下記のとおりである。
【００２０】
（１）複数の処理装置を通信手段で接続し特定の共有資源を排他制御してアクセスする複合コンピュータシステムにおいて、
複数の処理装置が起動または停止したときに前記複数の処理装置の稼働状態を記録する稼働監視装置と、前記複数の処理装置と稼働監視装置とを接続する稼働監視用ネットワークと、前記複数の処理装置のプログラムが起動または停止したときに前記プログラムの稼働状態を記録するプログラム状態管理手段とを備え、前記複数の処理装置で障害が発生したときに前記稼働監視用ネットワークを介して稼働監視装置に記録された前記複数の処理装置の稼働状態と前記プログラム状態管理手段に記録されたプログラムの稼働状態を取得して障害部位の特定を行うものである。
【００２１】
前記複合コンピュータシステムでは、複数の処理装置をチャネル間結合装置等の特定の通信手段で接続し、前記チャネル間結合装置等の特定の通信手段によって複数の処理装置間で通信を行うことにより、磁気ディスク装置や磁気テープ装置等の特定の共有資源を排他制御してアクセスしている。
【００２２】
前記複合コンピュータシステムを構成する複数の処理装置は、前記チャネル間結合装置等の排他制御用の特定の通信手段とは異なる稼働監視用ネットワークを介して稼働監視装置に接続されており、前記複数の処理装置が起動または停止したときに前記複数の処理装置の稼働状態を前記稼働監視装置に記録する。
【００２３】
また、前記複合コンピュータシステムの複数の処理装置で稼働するオペレーティングシステムは、前記複数の処理装置上でプログラムが起動または停止したときに前記プログラムの稼働状態をプログラム状態管理手段に記録する。
【００２４】
前記複合コンピュータシステムにおいて、前記チャネル間結合装置等の排他制御用の特定の通信手段によって、磁気ディスク装置や磁気テープ装置等の特定の共有資源を排他制御してアクセスしようとしたときに、特定の処理装置からの応答が予め規定された特定の時間を経過しても得られない無応答の状態を検知する場合がある。
【００２５】
前記の様に無応答の状態を検知したときに複合コンピュータシステムで障害が発生したとみなして、前記稼働監視用ネットワークを介して稼働監視装置に記録された前記複数の処理装置の稼働状態と前記プログラム状態管理手段に記録されたプログラムの稼働状態を取得し、前記特定の処理装置の稼働状態と前記特定の処理装置上のプログラムの稼働状態とを比較して障害部位の特定を行う。
【００２６】
すなわち、前記特定の処理装置が非稼働中である場合には、障害部位を前記特定の処理装置であるとみなして他の処理装置で排他処理を代替する縮退運転を行い、前記特定の処理装置が稼働中である場合には、前記特定の処理装置上のプログラムの稼働状態を調べる。
【００２７】
前記特定の処理装置上のプログラムの稼働状態を調べ、前記特定の処理装置上のプログラムが非稼働中である場合には、障害部位を前記特定の処理装置上のプログラムであるとみなして前記特定の処理装置上のプログラムの再起動を行い、前記特定の処理装置上のプログラムが稼働中である場合には、前記排他制御用の特定の通信手段が障害部位であるとみなして予備の通信経路を選択して排他制御を続行する。
【００２８】
以上の様に、前記複合コンピュータシステムによれば、複数の処理装置の稼働状態と前記複数の処理装置上のプログラムの稼働状態とを稼働監視用ネットワークを介して監視するので、障害が発生したときに早期に障害部位を特定し障害部位に対応する処理を行って長時間の無人運転の実現とユーザ負担の軽減を行うことが可能である。
【００２９】
（２）前記（１）に記載された複合コンピュータシステムにおいて、前記稼働監視装置を複数備え、正装置である稼働監視装置以外の稼働監視装置から前記複数の処理装置への通信を抑止する通信抑止手段と、前記通信抑止手段を制御することにより、複数の処理装置から前記複数の稼働監視装置への通信を制御する単一の稼働監視装置多重化手段とを備え、前記稼働監視装置多重化手段により前記複数の処理装置の稼働状態を前記複数の稼働監視装置に送信すると共に、前記通信抑止手段により正装置である稼働監視装置以外の稼働監視装置から前記複数の処理装置への通信を抑止して正装置である稼働監視装置のみにより前記複数の処理装置の稼働状態を監視し、前記正装置である稼働監視装置に障害が発生した場合に、前記稼働監視装置多重化手段により前記障害の発生した稼働監視装置以外の複数の稼働監視装置の任意の稼働監視装置の通信抑止手段の通信抑止状態を解除し、前記通信抑止状態が解除された稼働監視装置により前記複数の処理装置の稼働状態の監視を続行するものである。
【００３０】
前記複合コンピュータシステムでは、複数の処理装置と複数の稼働監視装置とを稼働監視用ネットワークで接続し、前記複数の稼働監視装置は、前記複数の処理装置との通信を抑止する通信抑止手段を備えている。
【００３１】
前記複合コンピュータシステムでは、稼働監視装置多重化手段により、前記複数の処理装置からの通知を前記複数の稼働監視装置のそれぞれに通知する。
【００３２】
一方、前記複数の稼働監視装置では、特定の稼働監視装置以外の稼働監視装置の通信抑止手段を通信抑止状態にしておき、前記特定の稼働監視装置を正装置、前記特定の稼働監視装置以外の稼働監視装置を副装置とし、正装置である稼働監視装置以外からの前記複数の処理装置への通信を抑止している。
【００３３】
前記の様に、副装置である稼働監視装置において通信抑止手段によって稼働監視装置から複数の処理装置への通信が抑止されることにより、特定の処理装置のシステム停止を検知した場合に送られる通知が、稼働中の他の処理装置に重複して届けられることはない。
【００３４】
前記複合コンピュータシステムにおいて、正装置である稼働監視装置に障害が発生し、予め規定された特定の時間が経過しても正装置である稼働監視装置からの応答が得られない状態となって、前記複数の処理装置と正装置である稼働監視装置との間の通信ができなくなった場合には、前記稼働監視装置多重化手段は、副装置である稼働監視装置の特定の稼働監視装置の通信抑止手段の通信抑止状態を解除する。
【００３５】
この様にして、多重化された稼働監視装置の特定の稼働監視装置が障害により使用できなくなっても、複数の処理装置側では何も意識する必要はなく、障害の発生していない他の稼働監視装置によって複数の処理装置の稼働状態の監視を続行することが可能である。
【００３６】
以上の様に、前記複合コンピュータシステムによれば、複数の稼働監視装置により複数の処理装置の稼働状態を監視するので、特定の稼働監視装置が障害により使用できなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【００３７】
（３）前記（１）または（２）に記載された複合コンピュータシステムにおいて、複数の処理装置を接続する前記通信手段を介して前記複数の処理装置間で特定のデータを送受信することにより前記複数の処理装置が相互に稼働状態の監視を行うものである。
【００３８】
前記複合コンピュータシステムにおいて、複数の処理装置上で稼働中のプログラムは、各処理装置を結ぶチャネル間結合装置等の特定の通信手段を介して一定間隔で入出力命令を発行する。
【００３９】
例えば、特定の処理装置で稼働中のプログラムは、他の処理装置上で稼働中のプログラムにある特定のデータを送信し、前記他の処理装置上で稼働中のプログラムは、前記特定のデータを受信したら、その応答として受信確認のデータを送信元の前記特定の処理装置上で稼働中のプログラムに送り返す。
【００４０】
この様なシーケンスで、複数の処理装置で稼働中の各プログラムが、相互に特定のデータを送受信することによって、何らかの障害が発生した場合には予め規定された特定の時間を経過しても応答が受信されない為、無応答をもって相手の処理装置の異常とみなせる。
【００４１】
前記の様に、複数の処理装置で稼働中のプログラムが相互に特定のデータを送受信する場合には、相互に特定のデータを送受信するプログラムの数が増加すると、その通信負荷が急速に増加することが考えられるが、前記複合コンピュータシステムでは、通常の障害検知は稼働監視装置により実現することが可能である為、前記の相互に特定のデータを送受信する頻度を少なくしても良い。
【００４２】
従って、前記複合コンピュータシステムでは、複数の処理装置相互で特定のデータを送受信するオーバヘッドを削減して通常の通信に与える影響を少なくすると共に、稼働監視装置が障害の発生等により全面的に動作しなくなった場合であっても複数の処理装置の稼働状態の監視を続行することが可能である。
【００４３】
以上の様に、前記複合コンピュータシステムによれば、複数の処理装置相互で特定のデータを送受信して他の処理装置の稼働状態を監視するので、稼働監視装置が全面的に動作しなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【００４４】
【発明の実施の形態】
以下、本発明について、実施形態とともに図を参照して詳細に説明する。なお、実施形態を説明するための全図において、同一機能を有するものは同一符号を付け、その繰り返しの説明は省略する。
【００４５】
（実施形態１）
以下に、本発明の複合コンピュータシステムにおいて、磁気ディスク装置上の共有データを排他制御管理プログラムを介してアクセスする複数の処理装置を監視する実施形態１の複合コンピュータシステムについて説明する。
【００４６】
図１は、本実施形態の複合コンピュータシステムの概略構成を示す図である。図１において、１００、１１０及び１２０は処理装置、１０１、１０２、１１１、１１２、１２１及び１２２は命令プロセッサ、１０３、１０４、１１３、１１４、１２３及び１２４は入出力プロセッサ、１０５、１１５及び１２５は主記憶装置、１０６、１１６及び１２６はシステム制御装置、１０７、１１７及び１２７はサービスプロセッサ、１０８、１１８及び１２８はコンソール、１３０は稼働監視装置、１４０及び１４１は磁気ディスク装置、１５０及び１５１は磁気テープ装置、１６０〜１６２はチャネル間結合装置である。
【００４７】
図１に示す様に、本実施形態の複合コンピュータシステムは、処理装置１００、１１０及び１２０と、命令プロセッサ１０１、１０２、１１１、１１２、１２１及び１２２と、入出力プロセッサ１０３、１０４、１１３、１１４、１２３及び１２４と、主記憶装置１０５、１１５及び１２５と、システム制御装置１０６、１１６及び１２６と、サービスプロセッサ１０７、１１７及び１２７と、コンソール１０８、１１８及び１２８と、稼働監視装置１３０と、磁気ディスク装置１４０及び１４１と、磁気テープ装置１５０及び１５１と、チャネル間結合装置１６０〜１６２とを有している。
【００４８】
また、図１に示す様に、本実施形態の複合コンピュータシステムでは、処理装置１００は、命令プロセッサ１０１と、命令プロセッサ１０２と、入出力プロセッサ１０３と、入出力プロセッサ１０４と、主記憶装置１０５とをシステム制御装置１０６に接続し、処理装置１００に対してシステムの起動指示及びハードウェア構成定義をするサービスプロセッサ１０７及びコンソール１０８が接続されている。
【００４９】
また、処理装置１１０は、命令プロセッサ１１１と、命令プロセッサ１１２と、入出力プロセッサ１１３と、入出力プロセッサ１１４と、主記憶装置１１５とをシステム制御装置１１６に接続し、処理装置１１０に対してシステムの起動指示及びハードウェア構成定義をするサービスプロセッサ１１７及びコンソール１１８が接続されており、処理装置１２０は、命令プロセッサ１２１と、命令プロセッサ１２２と、入出力プロセッサ１２３と、入出力プロセッサ１２４と、主記憶装置１２５とをシステム制御装置１２６に接続し、処理装置１２０に対してシステムの起動指示及びハードウェア構成定義をするサービスプロセッサ１２７及びコンソール１２８が接続されている。
【００５０】
入出力プロセッサ１０３、１０４、１１３、１１４、１２３及び１２４は、磁気ディスク装置１４０及び１４１並びに磁気テープ装置１５０及び１５１に接続されており、複数の処理装置１００、１１０及び１２０は、磁気ディスク装置１４０及び１４１並びに磁気テープ装置１５０及び１５１を共有資源として共有している。
【００５１】
また、入出力プロセッサ１０３はチャネル間結合装置１６０を介して入出力プロセッサ１１４に、入出力プロセッサ１１３はチャネル間結合装置１６１を介して入出力プロセッサ１２４に、入出力プロセッサ１２３はチャネル間結合装置１６２を介して入出力プロセッサ１０４に接続されており、複数の処理装置１００、１１０及び１２０はマルチパス構成で相互に接続されている。
【００５２】
処理装置１００、１１０または１２０が他の処理装置と通信を行う場合には、チャネル間結合装置１６０、１６１または１６２を介して、入出力プロセッサ１０３及び１１４、入出力プロセッサ１１３及び１２４または入出力プロセッサ１２３及び１０４を使用して通信を行う。
【００５３】
本実施形態の複合コンピュータシステムでは、処理装置１００、１１０及び１２０の状態を管理するサービスプロセッサ１０７、１１７及び１２７と稼働監視装置１３０とを稼働監視用ネットワークであるＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）で接続することにより、稼働監視装置１３０が処理装置１００、１１０及び１２０の稼働情報・構成情報を一括して管理している。
【００５４】
以下に、本実施形態の複合コンピュータシステムにおいて、処理装置１００、１１０及び１２０のシステムが起動する場合や、処理装置１００、１１０及び１２０で動作するプログラムが起動する場合の稼働管理について説明する。
【００５５】
図２は、本実施形態の複合コンピュータシステムの起動時の稼働管理の概略を示す図である。図２において、２００、２１０及び２２０はオペレーティングシステム、２００１及び２１０１は構成管理手段、２００２及び２１０２はプログラム状態管理手段、２００３及び２１０３は稼働監視装置通信手段、２００４及び２１０４は他システム通信手段、２１１及び２２１はジョブ管理プログラム、２１２及び２２２は排他制御管理プログラム、２２３はデータベース管理プログラム、２３０は処理装置通信手段、２３１は接続状態監視手段、２３２は接続構成管理手段、２３３は稼働状態管理手段、２３４は構成情報・稼働状態管理テーブルである。
【００５６】
図２に示す様に、本実施形態の複合コンピュータシステムの起動時の稼働管理では、オペレーティングシステム２００、２１０及び２２０と、構成管理手段２００１及び２１０１と、プログラム状態管理手段２００２及び２１０２と、稼働監視装置通信手段２００３及び２１０３と、他システム通信手段２００４及び２１０４と、ジョブ管理プログラム２１１及び２２１と、排他制御管理プログラム２１２及び２２２と、データベース管理プログラム２２３と、処理装置通信手段２３０と、接続状態監視手段２３１と、接続構成管理手段２３２と、稼働状態管理手段２３３と、構成情報・稼働状態管理テーブル２３４とを使用している。
【００５７】
また、図２に示す様に、本実施形態の複合コンピュータシステムの起動時の稼働管理では、処理装置１００、１１０及び１２０のシステムが起動する場合や、処理装置１００、１１０及び１２０で動作するプログラムが起動する場合には、稼働監視装置１３０に起動通知を行い、構成情報・稼働状態管理テーブル２３４の内容を更新する。
【００５８】
本実施形態の複合コンピュータシステムの稼働監視装置１３０に格納されている構成情報・稼働状態管理テーブル２３４には、接続構成管理手段２３２によって管理されている稼働監視装置１３０に接続された処理装置１００、１１０及び１２０の物理アドレス、システム識別子、システム名称及び稼働状態が記録されており、構成情報・稼働状態管理テーブル２３４の稼働状態が「０」である場合には、その処理装置が非稼働中であることを示し、稼働状態が「１」である場合には、その処理装置が稼働中であることを示している。
【００５９】
以下に、本実施形態の複合コンピュータシステムにおいて、処理装置１１０のシステムを起動したときの稼働管理について説明する。
【００６０】
本実施形態の複合コンピュータシステムにおいて、処理装置１００のシステムが起動すると処理装置１００のオペレーティングシステム２００は、処理装置１００のシステムが起動されたことを稼働監視装置１３０に通知する起動通知命令を発行し、構成管理手段２００１を経由して稼働監視装置通信手段２００３により稼働監視装置１３０に対して起動通知を行う。
【００６１】
稼働監視装置通信手段２００３によって送信された処理装置１００の起動通知は、稼働監視装置１３０の処理装置通信手段２３０によって受け付けられ、稼働監視装置１３０の稼働状態管理手段２３３は、前記受け付けた起動通知のパラメタを解析し、構成情報・稼働状態管理テーブル２３４の物理アドレス「０００１」、システム識別子「Ａ」及びシステム名称「ＳＹＳ１」に対応する処理装置１００の稼働状態を、非稼働中であることを示す「０」から稼働中であることを示す「１」に遷移させる。
【００６２】
稼働監視装置１３０は、処理装置１００の起動通知が正常に完了すると、稼働監視装置１３０の処理装置通信手段２３０により、起動通知を発行した処理装置１００に前記起動通知に対する応答を返す。
【００６３】
尚、本実施形態の複合コンピュータシステムにおいて、稼働監視装置１３０を比較的処理能力の低いコンピュータで構成し、比較的低速の非同期通信回線によって前記起動通知に対する応答を処理装置１００に返しても良い。
【００６４】
処理装置１００のオペレーティングシステム２００は、稼働監視装置通信手段２００３により受信した稼働監視装置１３０からの応答を構成管理手段２００１により解析し、構成情報・稼働状態管理テーブル２３４の処理装置１００の稼働状態を正常に更新できたかどうかを検知する。
【００６５】
同様にして、本実施形態の複合コンピュータシステムの処理装置１１０及び処理装置１２０のシステムを起動すると、図２に示す様に、稼働監視装置１３０に格納されている構成情報・稼働状態管理テーブル２３４には、物理アドレス「０００２」、システム識別子「Ｂ」、及びシステム名称「ＳＹＳ２」に対応する処理装置１１０の稼働状態と、物理アドレス「０００３」、システム識別子「Ｃ」、及びシステム名称「ＳＹＳ３」に対応する処理装置１２０の稼働状態が稼働中であることを示す「１」として記録される。
【００６６】
また、稼働監視装置１３０に格納されている構成情報・稼働状態管理テーブル２３４の、物理アドレス「０００４」、システム識別子「Ｄ」、及びシステム名称「ＳＹＳ４」に対応する処理装置は本実施形態の複合コンピュータシステムに未接続状態である為、その稼働状態は「０」で非稼働中であることを示している。
【００６７】
本実施形態の複合コンピュータシステムにおいて、各処理装置のオペレーティングシステムでプログラムを起動すると、前記起動されたプログラムからの通知によりオペレーティングシステムは、前記プログラムが稼働中であることを記録する。
【００６８】
各処理装置のオペレーティングシステムで稼働中のプログラムが、他の処理装置上のプログラムが起動されているかどうかを知りたい場合には、前記稼働中のプログラムのオペレーティングシステムの構成管理手段に指示し、他システム通信手段を経由して他の処理装置のオペレーティングシステムと通信することにより、他の処理装置上のプログラムが起動されているかどうかを検知することが可能である。
【００６９】
例えば、本実施形態の複合コンピュータシステムにおいて、処理装置１１０のオペレーティングシステム２１０上で稼働中の排他制御管理プログラム２１２が、他の処理装置である処理装置１００または処理装置１２０で排他制御管理プログラムが起動されているかどうかをチェックする処理は以下の様になる。
【００７０】
本実施形態の複合コンピュータシステムの処理装置１１０において、排他制御管理プログラム２１２を起動すると、起動された排他制御管理プログラム２１２は、オペレーティングシステム２１０の構成管理手段２１０１に対し、排他制御管理プログラム２１２が起動したことを通知する。
【００７１】
処理装置１１０のオペレーティングシステム２１０の構成管理手段２１０１は、プログラム状態管理手段２１０２により、排他制御管理プログラム２１２が稼働中であることを記録する。
【００７２】
また、他の処理装置である処理装置１００または処理装置１２０で排他制御管理プログラムを起動した場合にも同様な手順により、その処理装置のオペレーティングシステム上で排他制御管理プログラムが稼働中であることを記録する。
【００７３】
図２に示す様に、本実施形態の複合コンピュータシステムでは、処理装置１１０及び処理装置１２０において排他制御管理プログラム２１２及び排他制御管理プログラム２２２が起動されている。
【００７４】
ここで、処理装置１１０で実行中の排他制御管理プログラム２１２が、処理装置１２０上で排他制御管理プログラム２２２が稼働中であるかどうかを調べる為に、オペレーティングシステム２１０の構成管理手段２１０１に、処理装置１２０のプログラムの稼働状態のチェックを依頼する。
【００７５】
処理装置１１０のオペレーティングシステム２１０の構成管理手段２１０１は、他システム通信手段２１０４を介して処理装置１２０のオペレーティングシステム２２０の構成管理手段に問い合わせることにより、処理装置１２０で排他制御管理プログラム２２２が稼働中であることを検知する。
【００７６】
次に、本実施形態の複合コンピュータシステムにおいて、処理装置１００、１１０及び１２０のシステムを停止する場合や、処理装置１００、１１０及び１２０で動作中のプログラムを停止する場合の稼働管理について説明する。
【００７７】
図３は、本実施形態の複合コンピュータシステムの停止時の稼働管理の概略を示す図である。
【００７８】
図３に示す様に、本実施形態の複合コンピュータシステムの停止時の稼働管理では、稼働監視装置１３０の接続状態監視手段２３１と、サービスプロセッサ１０７、１１７及び１２７とが定期的に通信を行っており、処理装置１００、１１０または１２０のシステムを停止した場合には、停止したシステムに接続されているサービスプロセッサも停止し、稼働監視装置１３０が接続状態監視手段２３１により停止した処理装置のサービスプロセッサからの応答が無いことから、対応する処理装置のシステムの停止を検知する。
【００７９】
本実施形態の複合コンピュータシステムにおいて、処理装置１１０がシステム停止を行うと、稼働監視装置１３０が接続状態監視手段２３１により処理装置１１０のシステム停止を検知し、稼働状態管理手段２３３により構成情報・稼働状態管理テーブル２３４の処理装置１１０に対応する稼働状態を、稼働中であることを示す「１」から非稼働中であることを示す「０」に遷移させる。
【００８０】
これと同時に、稼働監視装置１３０は、この時稼働状態が「１」である処理装置１００及び処理装置１２０に対して、システム停止が発生したことを処理装置通信手段２３０により通知する。
【００８１】
処理装置１００のオペレーティングシステム２００の構成管理手段２００１は、稼働監視装置１３０からのシステム停止の発生を示す通知を検知したら、稼働監視装置１３０の構成情報・稼働状態管理テーブル２３４の内容を稼働監視装置通信手段２００３によって採取し、どの処理装置が停止したかを直ちに把握することが可能である。
【００８２】
また、本実施形態の複合コンピュータシステムにおいて、各処理装置のオペレーティングシステムで稼働中のプログラムを停止する場合には、前記停止するプログラムからの通知によりオペレーティングシステムは、前記プログラムの稼働状態を示す情報を稼働中から非稼働中に変更する。
【００８３】
各処理装置のオペレーティングシステムで稼働中のプログラムが、他の処理装置上のプログラムが停止しているかどうかを知りたい場合には、前記稼働中のプログラムのオペレーティングシステムの構成管理手段に指示し、他システム通信手段を経由して、プログラムの稼働状態を知りたい他の処理装置のオペレーティングシステムと通信することにより、他の処理装置上のプログラムが停止しているかどうかを検知することが可能である。
【００８４】
例えば、本実施形態の複合コンピュータシステムの処理装置１１０において、排他制御管理プログラム２１２を停止するときに、排他制御管理プログラム２１２は、オペレーティングシステム２１０の構成管理手段２１０１に対し、排他制御管理プログラム２１２を停止することを通知する。
【００８５】
処理装置１１０のオペレーティングシステム２１０の構成管理手段２１０１は、プログラム状態管理手段２１０２により、排他制御管理プログラム２１２の稼働状態を示す情報を稼働中から非稼働中に変更する。
【００８６】
また、他の処理装置である処理装置１００または処理装置１２０で排他制御管理プログラムを停止する場合にも同様な手順により、その処理装置のオペレーティングシステム上の排他制御管理プログラムの稼働状態を示す情報を稼働中から非稼働中に変更する。
【００８７】
図３に示す様に、本実施形態の複合コンピュータシステムでは、処理装置１００の排他制御管理プログラムは起動されていない。
【００８８】
ここで、処理装置１１０で実行中の排他制御管理プログラム２１２が、処理装置１００上で排他制御管理プログラムが稼働中であるかどうかを調べる為に、オペレーティングシステム２１０の構成管理手段２１０１に、処理装置１００のプログラムの稼働状態のチェックを依頼する。
【００８９】
処理装置１１０のオペレーティングシステム２１０の構成管理手段２１０１は、他システム通信手段２１０４を介して処理装置１００のオペレーティングシステム２００の構成管理手段２００１に問い合わせることにより、処理装置１００では排他制御管理プログラムが停止していることを検知する。
【００９０】
以下に、本実施形態の複合コンピュータシステムにおいて、複数の処理装置が排他制御管理プログラムを介して共有データをアクセスする際に発生した障害部位の特定を行う処理手順について説明する。
【００９１】
図４は、本実施形態の複合コンピュータシステムの障害部位を特定する処理の処理手順を示すフローチャートである。
【００９２】
本実施形態の複合コンピュータシステムにおいて、処理装置１００、１１０及び１２０は、各処理装置上の排他制御管理プログラムを介して磁気ディスク装置１４０上の共有データをアクセスする。
【００９３】
各処理装置上の排他制御管理プログラムは、マスター・スレーブ方式で排他制御を行うものとし、マスター側の排他制御管理プログラムは処理装置１１０に存在するものとする。
【００９４】
マスター・スレーブ方式の排他制御では、スレーブ側の処理装置上の排他制御管理プログラムは、磁気ディスク装置１４０上の共有データにアクセスする前に必ずマスター側の処理装置の排他制御管理プログラムに、磁気ディスク装置１４０上の共有データを使用する使用許可を得る。
【００９５】
例えば、処理装置１００が磁気ディスク装置１４０上の共有データにアクセスする場合には、磁気ディスク装置１４０上の共有データを使用しても良いかどうかを、チャネル間結合装置１６０を介して処理装置１１０の排他制御管理プログラム２１２に問い合わせる。
【００９６】
処理装置１１０の排他制御管理プログラム２１２は、処理装置１１０及び処理装置１２０で磁気ディスク装置１４０上の共有データを使用していないことを確認すると、処理装置１００に対しチャネル間結合装置１６０を介して磁気ディスク装置１４０上の共有データの使用許可を発行する。
【００９７】
処理装置１００では、処理装置１１０の排他制御管理プログラム２１２からの使用許可を受信した後に、磁気ディスク装置１４０上の共有データにアクセスする。
【００９８】
本実施形態の複合コンピュータシステムにおいて、処理装置１２０の排他制御管理プログラム２２２が、磁気ディスク装置１４０上の共有データを使用しても良いかどうかを処理装置１１０の排他制御管理プログラム２１２に問い合わせた後、処理装置１１０の排他制御管理プログラム２１２からの応答が、予め規定された特定の時間を経過しても受信されない場合には、その原因としてチャネル間結合装置、処理装置間を接続する通信経路及びチャネル装置の障害といった経路障害、並びに、処理装置１１０の排他制御管理プログラム２１２の異常終了及び処理装置１１０のシステム停止の何れかが想定される。
【００９９】
図４に示す様に、本実施形態の複合コンピュータシステムにおいて、処理装置１２０から磁気ディスク装置１４０上の共有データをアクセスしようとしたときに発生した障害部位を特定する処理では、まず、ステップ４０１の処理で、マスター側の排他制御管理プログラム２１２が存在する処理装置１１０への通信が、予め規定された特定の時間内に完了したかどうかを調べる。
【０１００】
処理装置１２０からマスター側の排他制御管理プログラム２１２が存在する処理装置１１０への通信が予め規定された特定の時間内に完了していない場合には、ステップ４０２の処理に進み、処理装置１２０のオペレーティングシステム２２０の構成管理手段は、稼働監視装置１３０に処理装置１１０のシステムが停止状態かどうかを問い合わせる。
【０１０１】
ステップ４０２の処理で、処理装置１２０のオペレーティングシステム２２０の構成管理手段は、稼働監視装置１３０の構成情報・稼働状態管理テーブル２３４の内容を稼働監視装置通信手段２００３によって採取し、処理装置１１０のシステムが停止しているかどうかを調べる。
【０１０２】
処理装置１１０のシステムが停止している場合には、ステップ４０３の処理に進み、マスター側の処理装置を処理装置１１０から処理装置１２０に交代し、排他制御管理プログラム２２２をマスター側の排他制御管理プログラムに変更する。
【０１０３】
処理装置１１０のシステムが停止していない場合には、ステップ４０４の処理に進み、処理装置１２０のオペレーティングシステム２２０の構成管理手段は、マスター側である処理装置１１０の排他制御管理プログラム２１２の稼働状態を処理装置１１０の構成管理手段２１０１に問い合わせる。
【０１０４】
ステップ４０４の処理で、処理装置１２０のオペレーティングシステム２２０の構成管理手段は、他システム通信手段を介して処理装置１１０のオペレーティングシステム２１０の構成管理手段２１０１に問い合わせることにより、処理装置１１０で排他制御管理プログラム２１２が稼働中であるかどうかを調べる。
【０１０５】
処理装置１１０の排他制御管理プログラム２１２が停止している場合には、ステップ４０５の処理に進み、処理装置１１０上の排他制御管理プログラム２１２を再起動する。
【０１０６】
処理装置１１０の排他制御管理プログラム２１２が停止していない場合には、通信経路の障害が想定される為、ステップ４０６の処理に進み、予備の通信経路を交代パスとして再接続処理を行う。
【０１０７】
この様な処理手順により、従来オペレータの判断が必要であった複合コンピュータシステムの障害部位の特定を自動的に行うことが可能となる。
【０１０８】
以上説明した様に、本実施形態の複合コンピュータシステムによれば、複数の処理装置の稼働状態と前記複数の処理装置上のプログラムの稼働状態とを稼働監視用ネットワークを介して監視するので、障害が発生したときに早期に障害部位を特定し障害部位に対応する処理を行って長時間の無人運転の実現とユーザ負担の軽減を行うことが可能である。
【０１０９】
（実施形態２）
以下に、本発明の複合コンピュータシステムにおいて、複数の稼働監視装置によって複合コンピュータシステムの稼働監視を行う実施形態２の複合コンピュータシステムについて説明する。
【０１１０】
図５は、本実施形態の複合コンピュータシステムの稼働監視装置を二重化した場合の概略構成を示す図である。図５において、１０９は稼働監視装置二重化手段、１３０は正装置である稼働監視装置、１３１は副装置である稼働監視装置、２３５及び２４５は通信抑止手段、２３６及び２４６はコンソール間通信手段である。
【０１１１】
図５に示す様に、本実施形態の複合コンピュータシステムの稼働監視装置を二重化した場合では、稼働監視装置二重化手段１０９と、正装置である稼働監視装置１３０と、副装置である稼働監視装置１３１と、通信抑止手段２３５及び２４５と、コンソール間通信手段２３６及び２４６とを有している。
【０１１２】
また、図５に示す様に、本実施形態の複合コンピュータシステムでは、本実施形態の複合コンピュータシステムでは、処理装置１００、１１０及び１２０の状態を管理するサービスプロセッサ１０７、１１７及び１２７と正装置である稼働監視装置１３０とを稼働監視用ネットワークである第１のＬＡＮで接続すると共に、サービスプロセッサ１０７、１１７及び１２７と副装置である稼働監視装置１３１とを稼働監視用ネットワークの第２のＬＡＮで接続している。
【０１１３】
また、本実施形態の複合コンピュータシステムの稼働監視装置１３０及び稼働監視装置１３１は、処理装置１００、１１０及び１２０との通信を行う処理装置通信手段２３０及び２４０の動作を抑止する通信抑止手段２３５及び２４５を備えており、また、稼働監視装置１３０と稼働監視装置１３１とはコンソール間通信手段２３６及びコンソール間通信手段２４６を介して接続されている。
【０１１４】
以下に、本実施形態の複合コンピュータシステムにおいて、稼働監視装置が二重化された場合に複数の処理装置の稼働状態を管理する処理について説明する。
【０１１５】
本実施形態の複合コンピュータシステムのサービスプロセッサ１０７は、稼働監視装置二重化手段１０９を備え、サービスプロセッサ１０７の稼働監視装置二重化手段１０９により、処理装置１００からの通知を二重化された稼働監視装置１３０及び１３１のそれぞれに通知する。
【０１１６】
二重化された稼働監視装置から処理装置１００、１１０及び１２０への通知は、正装置である稼働監視装置１３０から実行され、副装置である稼働監視装置１３１では、処理装置通信手段２４０の通信抑止手段２４１によって稼働監視装置１３１から処理装置１００、１１０及び１２０への通信が抑止されている。
【０１１７】
前記の様に、副装置である稼働監視装置１３１において処理装置通信手段２４０の通信抑止手段２４１によって稼働監視装置１３１から処理装置１００、１１０及び１２０への通信が抑止されることにより、処理装置１００、１１０または１２０のシステム停止を検知した場合に送られる通知が、稼働中の他の処理装置に二重に届けられることはない。
【０１１８】
本実施形態の複合コンピュータシステムにおいて、正装置である稼働監視装置１３０に障害が発生し、予め規定された特定の時間が経過しても正装置である稼働監視装置１３０からの応答が得られない状態となって、サービスプロセッサ１０７と稼働監視装置１３０との間の通信ができなくなった場合には、サービスプロセッサ１０７の稼働監視装置二重化手段１０９は、副装置である稼働監視装置１３１の処理装置通信手段２４０に備えられた通信抑止手段２４５の通信抑止状態を解除する。
【０１１９】
サービスプロセッサ１０７の稼働監視装置二重化手段１０９が、通信抑止手段２４５の通信抑止状態を解除することにより、副装置である稼働監視装置１３１は、コンソール間通信手段２４６により、正装置である稼働監視装置１３０に閉塞命令を発行する。
【０１２０】
正装置である稼働監視装置１３０のコンソール間通信手段２３６は、副装置である稼働監視装置１３１からの閉塞命令を受けると、処理装置通信手段２３０に備えられた通信抑止手段２３５により稼働監視装置１３０から処理装置１００、１１０及び１２０への通信を抑止する。
【０１２１】
この様にして、二重化された稼働監視装置１３０または１３１の一方の稼働監視装置が障害により使用できなくなっても、処理装置１００、１１０及び１２０側では何も意識する必要はなく、障害の発生していない他方の稼働監視装置によって処理装置１００、１１０及び１２０の稼働状態の監視を続行することが可能である。
【０１２２】
また、本実施形態の複合コンピュータシステムにおいて、上記以外の稼働監視装置を多重化する手段として、処理装置に稼働監視装置二重化手段に相当する手段を備え、処理装置側で多重化された稼働監視装置を管理したり、稼働監視装置内の各手段を多重化して複数の処理装置の稼働状態を監視しても良い。
【０１２３】
以上説明した様に、本実施形態の複合コンピュータシステムによれば、複数の稼働監視装置により複数の処理装置の稼働状態を監視するので、特定の稼働監視装置が障害により使用できなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【０１２４】
（実施形態３）
以下に、本発明の複合コンピュータシステムにおいて、処理装置相互による監視によって複合コンピュータシステムの稼働監視を行う実施形態３の複合コンピュータシステムについて説明する。
【０１２５】
本実施形態の複合コンピュータシステムにおいて、処理装置１００、１１０及び１２０上で稼働中のプログラムは、各処理装置を結ぶチャネル間結合装置１６０、１６１及び１６２を介して一定間隔で入出力命令を発行する。
【０１２６】
例えば、処理装置１００上で稼働中のプログラムは、処理装置１１０及び１２０上で稼働中のプログラムにある特定のデータを送信し、処理装置１１０及び１２０上で稼働中のプログラムは、前記特定のデータを受信したら、その応答として受信確認のデータを送信元の処理装置１００上で稼働中のプログラムに送り返す。
【０１２７】
この様なシーケンスで、処理装置１００、１１０及び１２０上で稼働中の各プログラムが、相互に特定のデータを送受信することによって、何らかの障害が発生した場合には予め規定された特定の時間を経過しても応答が受信されない為、無応答をもって相手の処理装置の異常とみなせる。
【０１２８】
前記の様に、複数の処理装置で稼働中のプログラムが相互に特定のデータを送受信する場合には、相互に特定のデータを送受信するプログラムの数が増加すると、その通信負荷が急速に増加することが考えられるが、本実施形態の複合コンピュータシステムでは、通常の障害検知は稼働監視装置により実現することが可能である為、前記の相互に特定のデータを送受信する頻度を少なくしても良い。
【０１２９】
従って、本実施形態の複合コンピュータシステムでは、複数の処理装置相互で特定のデータを送受信するオーバヘッドを削減して通常の通信に与える影響を少なくすると共に、稼働監視装置が障害の発生等により全面的に動作しなくなった場合であっても複数の処理装置の稼働状態の監視を続行することが可能である。
【０１３０】
以上説明した様に、本実施形態の複合コンピュータシステムによれば、複数の処理装置相互で特定のデータを送受信して他の処理装置の稼働状態を監視するので、稼働監視装置が全面的に動作しなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【０１３１】
以上、本発明を前記実施形態に基づき具体的に説明したが、本発明は、前記実施形態に限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能であることは勿論である。
【０１３２】
例えば、排他制御専用のコンピュータに複数の処理装置を接続した複合コンピュータシステムでは、前記排他制御専用のコンピュータを稼働監視装置による稼働状態の監視の対象としても良い。
【０１３３】
また、仮想計算機上に複数の処理装置と稼働監視装置を仮想的に設定して複合コンピュータシステムを構成し、前記の仮想的な複数の処理装置の稼働状態を監視しても良い。
【０１３４】
【発明の効果】
本願において開示される発明のうち代表的なものによって得られる効果を簡単に説明すれば、下記のとおりである。
【０１３５】
（１）複数の処理装置の稼働状態と前記複数の処理装置上のプログラムの稼働状態とを稼働監視用ネットワークを介して監視するので、障害が発生したときに早期に障害部位を特定し障害部位に対応する処理を行って長時間の無人運転の実現とユーザ負担の軽減を行うことが可能である。
【０１３６】
（２）複数の稼働監視装置により複数の処理装置の稼働状態を監視するので、特定の稼働監視装置が障害により使用できなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【０１３７】
（３）複数の処理装置相互で特定のデータを送受信して他の処理装置の稼働状態を監視するので、稼働監視装置が全面的に動作しなくなった場合に複数の処理装置の稼働状態の監視を続行することが可能である。
【図面の簡単な説明】
【図１】実施形態１の複合コンピュータシステムの概略構成を示す図である。
【図２】実施形態１の複合コンピュータシステムの起動時の稼働管理の概略を示す図である。
【図３】実施形態１の複合コンピュータシステムの停止時の稼働管理の概略を示す図である。
【図４】実施形態１の複合コンピュータシステムの障害部位を特定する処理の処理手順を示すフローチャートである。
【図５】実施形態２の複合コンピュータシステムの稼働監視装置を二重化した場合の概略構成を示す図である。
【符号の説明】
１００、１１０及び１２０…処理装置、１０１、１０２、１１１、１１２、１２１及び１２２…命令プロセッサ、１０３、１０４、１１３、１１４、１２３及び１２４…入出力プロセッサ、１０５、１１５及び１２５…主記憶装置、１０６、１１６及び１２６…システム制御装置、１０７、１１７及び１２７…サービスプロセッサ、１０８、１１８及び１２８…コンソール、１０９…稼働監視装置二重化手段、１３０及び１３１…稼働監視装置、１４０及び１４１…磁気ディスク装置、１５０及び１５１…磁気テープ装置、１６０〜１６２…チャネル間結合装置、２００、２１０及び２２０…オペレーティングシステム、２００１及び２１０１…構成管理手段、２００２及び２１０２…プログラム状態管理手段、２００３及び２１０３…稼働監視装置通信手段、２００４及び２１０４…他システム通信手段、２１１及び２２１…ジョブ管理プログラム、２１２及び２２２…排他制御管理プログラム、２２３…データベース管理プログラム、２３０…処理装置通信手段、２３１…接続状態監視手段、２３２…接続構成管理手段、２３３…稼働状態管理手段、２３４…構成情報・稼働状態管理テーブル、２３５及び２４５…通信抑止手段、２３６及び２４６…コンソール間通信手段。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a complex computer system in which a plurality of independently operating processing devices exclusively control and access a shared resource, and more particularly, to a failure that occurs when a plurality of processing devices exclusively control and access a shared resource. The present invention relates to a technology that is effective when applied to a complex computer system that performs early detection and alternative processing.
[0002]
[Prior art]
Conventionally, in a load-balancing / cooperative complex computer system in which a plurality of interconnected processing devices share resources such as a magnetic disk device and a magnetic tape device, inter-channel coupling for communication between the plurality of processing devices is known. Input / output devices such as devices have been connected, and communication has been performed between a plurality of processing devices by communicating with each other by input / output instructions.
[0003]
However, in such a conventional complex computer system, if communication with the partner processing unit becomes impossible due to a failure such as a channel failure, a communication path failure, and a system down, the exclusive processing of the shared resources continues. become unable.
[0004]
Therefore, when a non-response of the processing unit of the partner system is detected, a message indicating that the non-response processing unit is detected is output to the operator, the failure part is specified by human judgment, and the failure is dealt with. And continued the business.
[0005]
Regarding the response procedure when a failure is detected in a conventional complex computer system, refer to the manual “Program Product VOS3 / AS System Operation-JSS3 Edition” (issued in December 1994) issued by Hitachi, Ltd. Operator action ".
[0006]
Furthermore, in the conventional multi-computer system, in order to reduce the communication overhead between a plurality of processing devices, a memory for exclusive control for managing shared resources is provided, and a method for efficiently cooperating between the plurality of processing devices is adopted. Have been.
[0007]
For example, lock information for exclusive control is arranged in a non-volatile control memory provided for each volume of a magnetic disk device to be duplicated, and the lock information for exclusive control of the control memory is used in a disk double writing control program. There is something.
[0008]
In the above-described disk dual write control program, when one processing device updates the lock information, the function of reporting to another processing device as an asynchronous input / output interrupt is used to cooperate among a plurality of processing devices. Has been realized.
[0009]
However, in the above-mentioned conventional multi-computer system, when one processing unit goes down with the lock information and the system goes down, the access to the dual-write magnetic disk device of the other normally operating processing units is caused by the lock information. Cannot be secured, the input / output time is over, and the processing cannot be continued.
[0010]
Conventional technology for performing exclusive control of shared resources by communicating between a plurality of processing devices using input / output devices such as the above-described inter-channel coupling device or one process such as the disk dual-write control program In the conventional technique in which the device performs the exclusive control by having the lock information, it is not possible to determine the operating state of the other processing device, and thus the lock information is released when a failure occurs in the processing device having the lock information. Requires operator intervention.
[0011]
For this reason, in the above-mentioned conventional composite computer system, it is necessary to prepare a recovery procedure manual assuming a combination at the time of a failure in advance, which is a burden when operating the composite computer system.
[0012]
[Problems to be solved by the invention]
The present inventor has found the following problems as a result of studying the above-mentioned conventional technology.
[0013]
That is, in the conventional complex computer system, when a non-response of a partner processing unit is detected, a message indicating that a non-response processing unit has been detected is output to an operator to specify a failure part by human judgment. However, there has been a problem that the operation load has to be increased, such as creating an operation procedure at the time of outputting a message, and it is not possible to cope with long-time unmanned operation.
[0014]
Further, in the conventional disk dual-write control program of the complex computer system, if one processing unit goes down with the lock information, the operator needs to intervene to release the lock information. It was necessary to prepare a recovery procedure manual that assumed the combination at the time of the failure in advance, which was a burden on operations.
[0015]
An object of the present invention is to provide a technique capable of specifying a faulty part at an early stage when a fault occurs, performing processing corresponding to the faulty part, realizing long-time unmanned driving, and reducing a user burden. It is in.
[0016]
It is another object of the present invention to provide a technique capable of continuing to monitor the operation states of a plurality of processing devices when a specific operation monitoring device cannot be used due to a failure.
[0017]
Another object of the present invention is to provide a technique capable of continuing to monitor the operation states of a plurality of processing devices when the operation monitoring device completely stops operating.
[0018]
The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.
[0019]
[Means for Solving the Problems]
The outline of typical inventions among the inventions disclosed by the present application will be briefly described as follows.
[0020]
(1) In a compound computer system in which a plurality of processing devices are connected by communication means and a specific shared resource is exclusively controlled and accessed,
An operation monitoring device that records an operation state of the plurality of processing devices when the plurality of processing devices are started or stopped; an operation monitoring network that connects the plurality of processing devices to the operation monitoring device; Program status management means for recording the operating status of the program when the program of the device is started or stopped, and when an error occurs in the plurality of processing devices, the operation monitoring device via the operation monitoring network The operation state of the plurality of processing devices and the operation state of the program recorded in the program state management means are acquired to specify a failure part.
[0021]
In the composite computer system, a plurality of processing devices are connected by a specific communication unit such as an inter-channel coupling device, and communication is performed between the plurality of processing devices by a specific communication unit such as the inter-channel coupling device. A specific shared resource such as a disk device or a magnetic tape device is exclusively controlled and accessed.
[0022]
The plurality of processing devices constituting the composite computer system are connected to an operation monitoring device via an operation monitoring network different from a specific communication unit for exclusive control such as the inter-channel coupling device, and the plurality of processing devices are connected to each other. When the processing device starts or stops, the operation status of the plurality of processing devices is recorded in the operation monitoring device.
[0023]
The operating system running on a plurality of processing devices of the complex computer system records the operating status of the program in a program status management unit when the program starts or stops on the plurality of processing devices.
[0024]
In the complex computer system, when a specific shared resource such as a magnetic disk device or a magnetic tape device is exclusively controlled and accessed by a specific communication means for exclusive control such as the inter-channel coupling device, a specific In some cases, a non-response state in which a response from the processing device is not obtained even after a predetermined time has elapsed is detected.
[0025]
When the non-response state is detected as described above, it is considered that a failure has occurred in the composite computer system, and the operation states of the plurality of processing devices recorded in the operation monitoring device via the operation monitoring network and the The operating status of the program recorded in the program status managing means is acquired, and the operating status of the specific processing device is compared with the operating status of the program on the specific processing device to specify a faulty part.
[0026]
That is, when the specific processing device is not operating, the degraded operation of replacing the exclusive process with another processing device by regarding the failure site as the specific processing device is performed, and the specific processing device is executed. Is running, the operating status of the program on the specific processing device is checked.
[0027]
The operating status of the program on the specific processing device is checked, and if the program on the specific processing device is not running, the fault location is regarded as a program on the specific processing device and the identification is performed. When the program on the specific processing device is restarted and the program on the specific processing device is running, the specific communication means for exclusive control is regarded as a failure site and a spare communication path is determined. Select to continue exclusive control.
[0028]
As described above, according to the composite computer system, the operating states of the plurality of processing units and the operating states of the programs on the plurality of processing units are monitored via the operation monitoring network. In this way, it is possible to identify a faulty part early and perform processing corresponding to the faulty part, thereby realizing long-time unmanned driving and reducing the burden on the user.
[0029]
(2) In the composite computer system according to (1), Equipped with multiple operation monitoring devices , It is a correct device Communication suppression means for suppressing communication from an operation monitoring device other than an operation monitoring device to the plurality of processing devices; By controlling the communication suppression means, Communication from a plurality of processing devices to the plurality of operation monitoring devices Control Control single An operation monitoring device multiplexing unit, wherein the operation monitoring device multiplexing unit transmits the operation states of the plurality of processing devices to the plurality of operation monitoring devices, and the communication suppressing unit It is a correct device Suppress communication from the operation monitoring device other than the operation monitoring device to the plurality of processing devices It is a correct device The operation status of the plurality of processing devices is monitored only by the operation monitoring device, It is a correct device When a failure occurs in the operation monitoring device, the operation monitoring device multiplexing unit may operate a plurality of operation monitoring devices other than the failed operation monitoring device. Any The communication inhibition state of the communication inhibition means of the operation monitoring apparatus is released, and the operation monitoring apparatus from which the communication inhibition state has been released continues to monitor the operation states of the plurality of processing devices.
[0030]
In the composite computer system, a plurality of processing devices and a plurality of operation monitoring devices are connected by an operation monitoring network, and the plurality of operation monitoring devices include a communication inhibiting unit that inhibits communication with the plurality of processing devices. ing.
[0031]
In the composite computer system, the operation monitoring device multiplexing unit notifies each of the plurality of operation monitoring devices of a notification from the plurality of processing devices.
[0032]
On the other hand, in the plurality of operation monitoring devices, the communication inhibiting unit of the operation monitoring device other than the specific operation monitoring device is set to the communication inhibited state, and the specific operation monitoring device is set to the main device, and the other than the specific operation monitoring device. The operation monitoring device is a sub device, and communication from the operation monitoring device other than the primary device to the plurality of processing devices is suppressed.
[0033]
As described above, a notification sent when a system stoppage of a specific processing device is detected by the communication suppression unit suppressing communication from the operation monitoring device to the plurality of processing devices in the operation monitoring device that is the sub device. However, there is no duplicate delivery to other operating processing units.
[0034]
In the composite computer system, a failure has occurred in the operation monitoring device that is the main device, and a response from the operation monitoring device that is the main device cannot be obtained even after a predetermined period of time has elapsed, If the communication between the plurality of processing devices and the operation monitoring device as the primary device becomes impossible, the operation monitoring device multiplexing unit may communicate with the specific operation monitoring device of the operation monitoring device as the secondary device. Releases the communication suppression state of the suppression means.
[0035]
In this way, even if a specific operation monitoring device of the multiplexed operation monitoring device becomes unusable due to a failure, there is no need to be aware of anything on the multiple processing devices, and other operation It is possible to continue monitoring the operating states of the plurality of processing devices by the monitoring device.
[0036]
As described above, according to the composite computer system, the operation statuses of the plurality of processing devices are monitored by the plurality of operation monitoring devices. It is possible to continue monitoring the operating state.
[0037]
(3) In the multi-computer system according to (1) or (2), specific data is transmitted and received between the plurality of processing devices via the communication unit that connects the plurality of processing devices, so that the plurality of processing devices are transmitted and received. Are mutually monitoring the operation state.
[0038]
In the complex computer system, programs running on a plurality of processing devices issue input / output instructions at regular intervals via specific communication means such as an inter-channel coupling device that connects the processing devices.
[0039]
For example, a program running on a specific processing device transmits certain data to a program running on another processing device, and a program running on the other processing device transmits the specific data to the program. Upon receipt, the data of the acknowledgment is sent back to the program running on the specific processing device of the transmission source as a response.
[0040]
In such a sequence, each program running on a plurality of processing devices transmits and receives specific data to and from each other, so that if a failure occurs, a response is made even after a predetermined time has elapsed. Is not received, it can be determined that there is no response and that the other processing device is abnormal.
[0041]
As described above, when programs running on a plurality of processing devices transmit and receive specific data to and from each other, when the number of programs for transmitting and receiving specific data to each other increases, the communication load increases rapidly. However, in the complex computer system, normal failure detection can be realized by an operation monitoring device, and thus the frequency of transmission and reception of the specific data may be reduced.
[0042]
Therefore, in the complex computer system, the overhead of transmitting and receiving specific data among a plurality of processing devices is reduced to reduce the influence on normal communication, and the operation monitoring device operates completely due to a failure or the like. It is possible to continue monitoring the operating states of a plurality of processing devices even when the processing devices have run out.
[0043]
As described above, according to the composite computer system, a specific data is transmitted and received among a plurality of processing devices to monitor the operation state of another processing device. Therefore, when the operation monitoring device completely stops operating. It is possible to continue monitoring the operating status of a plurality of processing devices.
[0044]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the drawings together with embodiments. In all the drawings for describing the embodiments, components having the same function are denoted by the same reference numerals, and a repeated description thereof will be omitted.
[0045]
(Embodiment 1)
In the following, a description will be given of a composite computer system according to the first embodiment in which a plurality of processing devices accessing shared data on a magnetic disk device via an exclusive control management program are monitored in the composite computer system of the present invention.
[0046]
FIG. 1 is a diagram illustrating a schematic configuration of a multifunction computer system according to the present embodiment. In FIG. 1, 100, 110 and 120 are processing devices, 101, 102, 111, 112, 121 and 122 are instruction processors, 103, 104, 113, 114, 123 and 124 are input / output processors, and 105, 115 and 125 are Main storage devices, 106, 116 and 126 are system control devices, 107, 117 and 127 are service processors, 108, 118 and 128 are consoles, 130 is an operation monitoring device, 140 and 141 are magnetic disk devices, and 150 and 151 are magnetic Tape devices 160 to 162 are inter-channel coupling devices.
[0047]
As shown in FIG. 1, the complex computer system according to the present embodiment includes processing devices 100, 110, and 120, instruction processors 101, 102, 111, 112, 121, and 122, and input / output processors 103, 104, 113, and 114. , 123 and 124, main storage devices 105, 115 and 125, system control devices 106, 116 and 126, service processors 107, 117 and 127, consoles 108, 118 and 128, operation monitoring devices 130, It has disk devices 140 and 141, magnetic tape devices 150 and 151, and inter-channel coupling devices 160 to 162.
[0048]
Also, as shown in FIG. 1, in the composite computer system of the present embodiment, the processing device 100 includes an instruction processor 101, an instruction processor 102, an input / output processor 103, an input / output processor 104, a main storage device 105, Are connected to a system controller 106, and a service processor 107 and a console 108 for instructing the processing apparatus 100 to start up the system and defining the hardware configuration are connected.
[0049]
Further, the processing device 110 connects the instruction processor 111, the instruction processor 112, the input / output processor 113, the input / output processor 114, and the main storage device 115 to the system control device 116. The processor 120 is connected to a service processor 117 and a console 118 for instructing start-up and for defining a hardware configuration. The processing device 120 includes an instruction processor 121, an instruction processor 122, an input / output processor 123, an input / output processor 124, and a main processor. The storage device 125 is connected to a system control device 126, and a service processor 127 and a console 128 for instructing the processing device 120 to start the system and defining the hardware configuration are connected.
[0050]
The input / output processors 103, 104, 113, 114, 123 and 124 are connected to magnetic disk devices 140 and 141 and magnetic tape devices 150 and 151, and a plurality of processing devices 100, 110 and 120 are connected to the magnetic disk device 140. , 141 and the magnetic tape devices 150 and 151 are shared as shared resources.
[0051]
The input / output processor 103 is connected to the input / output processor 114 via the inter-channel coupling device 160, the input / output processor 113 is connected to the input / output processor 124 via the inter-channel coupling device 161, and the input / output processor 123 is connected to the inter-channel coupling device 162. And a plurality of processing devices 100, 110, and 120 are connected to each other in a multipath configuration.
[0052]
When the processing device 100, 110, or 120 communicates with another processing device, the input / output processor 103 or 114, the input / output processor 113 or 124, or the input / output processor via the inter-channel coupling device 160, 161 or 162. Communication is performed using 123 and 104.
[0053]
In the multifunction computer system of the present embodiment, the service processors 107, 117, and 127 that manage the states of the processing devices 100, 110, and 120 and the operation monitoring device 130 are connected by a LAN (Local Area Network) that is an operation monitoring network. Thus, the operation monitoring device 130 collectively manages the operation information and configuration information of the processing devices 100, 110, and 120.
[0054]
Hereinafter, operation management when the system of the processing devices 100, 110, and 120 is activated and when the programs that operate on the processing devices 100, 110, and 120 are activated in the multifunction computer system of the present embodiment will be described.
[0055]
FIG. 2 is a diagram showing an outline of operation management at the time of startup of the composite computer system of the present embodiment. In FIG. 2, reference numerals 200, 210, and 220 denote operating systems, 2001 and 2101 denote configuration management means, 2002 and 2102 denote program state management means, 2003 and 2103 denote operation monitoring device communication means, 2004 and 2104 denote other system communication means, and 211 And 221 are a job management program, 212 and 222 are exclusive control management programs, 223 is a database management program, 230 is a processing device communication unit, 231 is a connection state monitoring unit, 232 is a connection configuration management unit, 233 is an operation state management unit, Reference numeral 234 denotes a configuration information / operation state management table.
[0056]
As shown in FIG. 2, in the operation management of the composite computer system according to the present embodiment at the time of startup, the operating systems 200, 210, and 220, the configuration management units 2001 and 2101, the program state management units 2002 and 2102, the operation monitoring Device communication means 2003 and 2103, other system communication means 2004 and 2104, job management programs 211 and 221, exclusive control management programs 212 and 222, database management program 223, processing device communication means 230, connection state monitoring A means 231, a connection configuration management means 232, an operation state management means 233, and a configuration information / operation state management table 234 are used.
[0057]
As shown in FIG. 2, in the operation management at the time of startup of the composite computer system according to the present embodiment, when the system of the processing devices 100, 110, and 120 is started, a program operating on the processing devices 100, 110, and 120 is used. Is activated, the operation monitoring device 130 is notified of the activation, and the contents of the configuration information / operation state management table 234 are updated.
[0058]
The configuration information / operation state management table 234 stored in the operation monitoring device 130 of the multifunction computer system according to the present embodiment includes the processing device 100 connected to the operation monitoring device 130 managed by the connection configuration management unit 232. If the physical addresses, system identifiers, system names, and operating states of 110 and 120 are recorded, and the operating state of the configuration information / operating state management table 234 is “0”, the processing device is not operating. When the operating state is “1”, it indicates that the processing apparatus is operating.
[0059]
Hereinafter, operation management when the system of the processing device 110 is started in the multifunction computer system of the present embodiment will be described.
[0060]
In the multifunction computer system according to the present embodiment, when the system of the processing device 100 is started, the operating system 200 of the processing device 100 issues a start notification command for notifying the operation monitoring device 130 that the system of the processing device 100 has been started. Then, the operation monitoring device communication unit 2003 sends a start notification to the operation monitoring device 130 via the configuration management unit 2001.
[0061]
The activation notification of the processing device 100 transmitted by the operation monitoring device communication unit 2003 is received by the processing device communication unit 230 of the operation monitoring device 130, and the operation state management unit 233 of the operation monitoring device 130 The parameters are analyzed to indicate that the operation state of the processing apparatus 100 corresponding to the physical address “0001”, the system identifier “A”, and the system name “SYS1” of the configuration information / operation state management table 234 is inactive. A transition is made from “0” to “1” indicating that it is operating.
[0062]
When the activation notification of the processing device 100 is completed normally, the operation monitoring device 130 returns a response to the activation notification to the processing device 100 that has issued the activation notification by the processing device communication unit 230 of the operation monitoring device 130.
[0063]
In the composite computer system of the present embodiment, the operation monitoring device 130 may be constituted by a computer having a relatively low processing capacity, and a response to the start notification may be returned to the processing device 100 via a relatively low-speed asynchronous communication line.
[0064]
The operating system 200 of the processing device 100 analyzes the response from the operation monitoring device 130 received by the operation monitoring device communication unit 2003 by the configuration management unit 2001, and determines the operation state of the processing device 100 in the configuration information / operation state management table 234. Detects whether the update was successful.
[0065]
Similarly, when the system of the processing device 110 and the processing device 120 of the composite computer system of the present embodiment is started, as shown in FIG. 2, the configuration information / operation state management table 234 stored in the operation monitoring device 130 is displayed. Is the operating state of the processing device 110 corresponding to the physical address “0002”, the system identifier “B”, and the system name “SYS2”, and the physical address “0003”, the system identifier “C”, and the system name “SYS3”. It is recorded as “1” indicating that the operation state of the corresponding processing device 120 is operating.
[0066]
Further, the processing device corresponding to the physical address “0004”, the system identifier “D”, and the system name “SYS4” in the configuration information / operation status management table 234 stored in the operation monitoring device 130 is a composite of this embodiment. Since it is not connected to the computer system, its operating state is "0", indicating that it is not operating.
[0067]
In the multifunction computer system according to the present embodiment, when a program is started by the operating system of each processing device, the operating system records that the program is running based on a notification from the started program.
[0068]
If a program running on the operating system of each processing device wants to know whether or not a program on another processing device has been activated, it instructs the configuration management means of the operating system of the running program, By communicating with the operating system of another processing device via the system communication means, it is possible to detect whether a program on another processing device has been activated.
[0069]
For example, in the composite computer system of the present embodiment, the exclusive control management program 212 running on the operating system 210 of the processing device 110 starts the exclusive control management program in the other processing device 100 or the processing device 120. The process of checking whether or not it has been performed is as follows.
[0070]
When the exclusive control management program 212 is started in the processing device 110 of the multifunction computer system according to the present embodiment, the started exclusive control management program 212 instructs the configuration management unit 2101 of the operating system 210 to start the exclusive control management program 212. Notify that
[0071]
The configuration management unit 2101 of the operating system 210 of the processing device 110 records that the exclusive control management program 212 is operating by the program state management unit 2102.
[0072]
Also, when the exclusive control management program is started in another processing device, such as the processing device 100 or the processing device 120, the same procedure is used to determine that the exclusive control management program is operating on the operating system of the processing device. Record.
[0073]
As shown in FIG. 2, in the composite computer system of the present embodiment, the exclusive control management program 212 and the exclusive control management program 222 are activated in the processing devices 110 and 120.
[0074]
Here, in order to check whether the exclusive control management program 212 running on the processing device 110 is running the exclusive control management program 222 on the processing device 120, the configuration management unit 2101 of the operating system 210 sends A request is made to check the operating status of the program of the device 120.
[0075]
The configuration management unit 2101 of the operating system 210 of the processing device 110 queries the configuration management unit of the operating system 220 of the processing device 120 via the other system communication unit 2104, so that the exclusive control management program 222 is running on the processing device 120. Is detected.
[0076]
Next, operation management when the system of the processing devices 100, 110, and 120 is stopped or when the programs running on the processing devices 100, 110, and 120 are stopped in the composite computer system of the present embodiment will be described.
[0077]
FIG. 3 is a diagram illustrating an outline of operation management when the composite computer system according to the present embodiment is stopped.
[0078]
As shown in FIG. 3, in the operation management of the composite computer system according to the present embodiment when the system is stopped, the connection state monitoring unit 231 of the operation monitoring device 130 and the service processors 107, 117, and 127 periodically communicate with each other. When the system of the processing device 100, 110, or 120 is stopped, the service processor connected to the stopped system is also stopped, and the operation monitoring device 130 is stopped by the connection state monitoring unit 231. Since there is no response from the system, the stop of the system of the corresponding processing device is detected.
[0079]
In the multifunction computer system according to the present embodiment, when the processing device 110 stops the system, the operation monitoring device 130 detects the system stop of the processing device 110 by the connection status monitoring unit 231, and the configuration information / operation by the operation status management unit 233. The operating state corresponding to the processing device 110 in the state management table 234 is changed from “1” indicating that it is operating to “0” indicating that it is not operating.
[0080]
At the same time, the operation monitoring device 130 notifies the processing device 100 and the processing device 120 whose operation state is “1” at this time by the processing device communication unit 230 that the system stoppage has occurred.
[0081]
When the configuration management unit 2001 of the operating system 200 of the processing device 100 detects the notification indicating the occurrence of the system stop from the operation monitoring device 130, the configuration management unit 2001 changes the contents of the configuration information / operation state management table 234 of the operation monitoring device 130 to the operation monitoring device. The information is collected by the communication unit 2003, and it is possible to immediately grasp which processing device has stopped.
[0082]
Further, in the multifunction computer system of the present embodiment, when a program running on the operating system of each processing apparatus is stopped, the operating system receives information indicating the operating state of the program by a notification from the program to be stopped. Change from running to non-working.
[0083]
If a program running on the operating system of each processing device wants to know whether or not a program on another processing device has stopped, it instructs the operating system configuration management means of the running program, By communicating with the operating system of another processing device that wants to know the operating state of the program via the system communication means, it is possible to detect whether or not the program on the other processing device is stopped.
[0084]
For example, when the exclusive control management program 212 is stopped in the processing device 110 of the composite computer system of the present embodiment, the exclusive control management program 212 sends the exclusive control management program 212 to the configuration management unit 2101 of the operating system 210. Notify that it will stop.
[0085]
The configuration management unit 2101 of the operating system 210 of the processing device 110 uses the program status management unit 2102 to change the information indicating the operating status of the exclusive control management program 212 from active to inactive.
[0086]
Also, when the exclusive control management program is stopped in another processing device, such as the processing device 100 or the processing device 120, information indicating the operating state of the exclusive control management program on the operating system of the processing device is similarly determined. Change from running to non-working.
[0087]
As shown in FIG. 3, in the composite computer system of the present embodiment, the exclusive control management program of the processing device 100 has not been started.
[0088]
Here, in order for the exclusive control management program 212 running on the processing device 110 to check whether the exclusive control management program is running on the processing device 100, the configuration management unit 2101 of the operating system 210 A request is made to check the operating status of 100 programs.
[0089]
The configuration management unit 2101 of the operating system 210 of the processing device 110 makes an inquiry to the configuration management unit 2001 of the operating system 200 of the processing device 100 via the other system communication unit 2104, so that the exclusive control management program is stopped in the processing device 100. Is detected.
[0090]
In the following, a description will be given of a processing procedure for specifying a failure site that has occurred when a plurality of processing devices access shared data via the exclusive control management program in the multifunction computer system of the present embodiment.
[0091]
FIG. 4 is a flowchart illustrating a processing procedure of a process of specifying a failed part in the multifunction computer system according to the present embodiment.
[0092]
In the composite computer system of the present embodiment, the processing devices 100, 110, and 120 access the shared data on the magnetic disk device 140 via the exclusive control management program on each processing device.
[0093]
The exclusive control management program on each processing device performs exclusive control in a master-slave manner, and the master-side exclusive control management program exists in the processing device 110.
[0094]
In the master-slave exclusive control, the exclusive control management program on the slave-side processing device always includes the magnetic disk drive in the exclusive control management program on the master-side processing device before accessing the shared data on the magnetic disk device 140. Obtain permission to use the shared data on device 140.
[0095]
For example, when the processing device 100 accesses shared data on the magnetic disk device 140, it is determined whether the shared data on the magnetic disk device 140 can be used or not via the inter-channel coupling device 160. The exclusive control management program 212 is inquired.
[0096]
When the exclusive control management program 212 of the processing device 110 confirms that the shared data on the magnetic disk device 140 is not used by the processing device 110 and the processing device 120, the exclusive control management program 212 sends the processing device 100 to the processing device 100 via the inter-channel coupling device 160. A permission to use the shared data on the magnetic disk device 140 is issued.
[0097]
The processing device 100 accesses the shared data on the magnetic disk device 140 after receiving the use permission from the exclusive control management program 212 of the processing device 110.
[0098]
In the composite computer system of the present embodiment, after the exclusive control management program 222 of the processing device 120 inquires of the exclusive control management program 212 of the processing device 110 whether the shared data on the magnetic disk device 140 may be used. If the response from the exclusive control management program 212 of the processing device 110 is not received even after the lapse of a predetermined time, the causes include the inter-channel coupling device, the communication path connecting the processing devices, and the like. It is assumed that a path failure such as a failure of the channel device, an abnormal termination of the exclusive control management program 212 of the processing device 110, or a stop of the system of the processing device 110.
[0099]
As shown in FIG. 4, in the multi-computer system of the present embodiment, in the process of specifying a failure site that occurs when the processing device 120 attempts to access shared data on the magnetic disk device 140, first, in step 401, In the process, it is determined whether or not the communication to the processing device 110 in which the exclusive control management program 212 on the master side is completed within a predetermined time.
[0100]
If the communication from the processing device 120 to the processing device 110 having the master-side exclusive control management program 212 is not completed within a predetermined time, the process proceeds to step 402 and the processing device 120 The configuration management means of the operating system 220 inquires of the operation monitoring device 130 whether or not the system of the processing device 110 is stopped.
[0101]
In the process of step 402, the configuration management unit of the operating system 220 of the processing device 120 collects the contents of the configuration information / operation status management table 234 of the operation monitoring device 130 by the operation monitoring device communication unit 2003, and Find out if is stopped.
[0102]
If the system of the processing device 110 is stopped, the process proceeds to step 403, where the processing device on the master side is changed from the processing device 110 to the processing device 120, and the exclusive control management program 222 is changed to the exclusive control management on the master side. Change to a program.
[0103]
If the system of the processing device 110 has not been stopped, the process proceeds to step 404, and the configuration management unit of the operating system 220 of the processing device 120 checks the operating status of the exclusive control management program 212 of the processing device 110 on the master side. Is sent to the configuration management unit 2101 of the processing device 110.
[0104]
In the process of step 404, the configuration management unit of the operating system 220 of the processing device 120 inquires of the configuration management unit 2101 of the operating system 210 of the processing device 110 via another system communication unit, so that the processing device 110 performs exclusive control management. Check whether the program 212 is running.
[0105]
If the exclusive control management program 212 of the processing device 110 is stopped, the process proceeds to step 405, and the exclusive control management program 212 of the processing device 110 is restarted.
[0106]
If the exclusive control management program 212 of the processing device 110 is not stopped, a failure in the communication path is assumed, so the process proceeds to step 406, and the reconnection processing is performed using the spare communication path as an alternate path.
[0107]
According to such a processing procedure, it is possible to automatically specify a faulty part of the complex computer system, which has conventionally required the judgment of the operator.
[0108]
As described above, according to the composite computer system of the present embodiment, the operating states of the plurality of processing devices and the operating states of the programs on the plurality of processing devices are monitored via the operation monitoring network. When an error occurs, it is possible to identify a faulty part at an early stage and perform processing corresponding to the faulty part, thereby realizing long-time unmanned driving and reducing the burden on the user.
[0109]
(Embodiment 2)
In the following, a description will be given of a composite computer system according to a second embodiment in which a plurality of operation monitoring devices monitor the operation of the composite computer system in the composite computer system of the present invention.
[0110]
FIG. 5 is a diagram illustrating a schematic configuration in a case where the operation monitoring device of the composite computer system according to the present embodiment is duplicated. In FIG. 5, reference numeral 109 denotes an operation monitoring device duplexing unit, 130 denotes an operation monitoring device as a primary device, 131 denotes an operation monitoring device as a secondary device, 235 and 245 denote communication suppressing units, and 236 and 246 denote console-to-console communication units. .
[0111]
As shown in FIG. 5, when the operation monitoring device of the composite computer system according to the present embodiment is duplicated, the operation monitoring device duplication unit 109, the operation monitoring device 130 as the primary device, and the operation monitoring device 131 as the secondary device are used. And communication suppression means 235 and 245, and inter-console communication means 236 and 246.
[0112]
Further, as shown in FIG. 5, in the composite computer system of the present embodiment, in the composite computer system of the present embodiment, service processors 107, 117, and 127 that manage the states of the processing devices 100, 110, and 120, and the primary device. A certain operation monitoring device 130 is connected by a first LAN which is an operation monitoring network, and the service processors 107, 117 and 127 and an operation monitoring device 131 which is a sub device are connected by a second LAN of the operation monitoring network. Connected.
[0113]
In addition, the operation monitoring device 130 and the operation monitoring device 131 of the multifunction computer system according to the present embodiment include a communication inhibiting unit 235 that inhibits the operation of the processing device communication units 230 and 240 that communicate with the processing devices 100, 110, and 120. The operation monitoring device 130 and the operation monitoring device 131 are connected via the inter-console communication means 236 and the inter-console communication means 246.
[0114]
In the following, a description will be given of a process of managing the operation states of a plurality of processing devices when the operation monitoring device is duplicated in the multifunction computer system of the present embodiment.
[0115]
The service processor 107 of the multifunction computer system according to the present embodiment includes the operation monitoring device duplication unit 109, and the operation monitoring devices 130 and 131 in which the notification from the processing device 100 is duplicated by the operation monitoring device duplication unit 109 of the service processor 107. Notify each.
[0116]
The notification from the duplicated operation monitoring device to the processing devices 100, 110, and 120 is executed from the operation monitoring device 130, which is the primary device, and the operation monitoring device 131, which is the secondary device, performs communication suppression of the processing device communication unit 240. 241, communication from the operation monitoring apparatus 131 to the processing apparatuses 100, 110, and 120 is suppressed.
[0117]
As described above, the communication from the operation monitoring device 131 to the processing devices 100, 110, and 120 is suppressed by the communication suppression unit 241 of the processing device communication unit 240 in the operation monitoring device 131, which is the sub device, so that the processing device 100 , 110, or 120 is not duplicated to other operating processing units that are in operation.
[0118]
In the multifunction computer system according to the present embodiment, a failure occurs in the operation monitoring device 130 as the main device, and no response is obtained from the operation monitoring device 130 as the main device even after a predetermined time elapses. State and the service processor 10 7 and When communication with the operation monitoring device 130 becomes impossible, the operation monitoring device duplication unit 109 of the service processor 107 is provided with the communication suppression unit provided in the processing device communication unit 240 of the operation monitoring device 131 which is a sub device. 245 is released.
[0119]
When the operation monitoring device duplexing unit 109 of the service processor 107 releases the communication inhibition state of the communication inhibiting unit 245, the operation monitoring device 131 as the sub device is controlled by the inter-console communication unit 246 to operate as the primary device. Issue a blocking instruction to 130.
[0120]
When the inter-console communication unit 236 of the operation monitoring device 130 as the primary device receives the closing command from the operation monitoring device 131 as the sub device, the communication inhibiting unit 235 provided in the processing device communication unit 230 causes the operation monitoring device 130 To the processing devices 100, 110, and 120.
[0121]
In this way, even if one of the duplicated operation monitoring devices 130 or 131 becomes unusable due to a failure, the processing devices 100, 110, and 120 do not need to be aware of anything, and a failure occurs. It is possible to continue monitoring the operation state of the processing devices 100, 110, and 120 by the other operation monitoring device that is not present.
[0122]
Further, in the multifunction computer system of the present embodiment, as means for multiplexing operation monitoring devices other than those described above, the processing device includes means corresponding to operation monitoring device duplexing means, and the operation monitoring device multiplexed on the processing device side. May be managed, or each unit in the operation monitoring apparatus may be multiplexed to monitor the operation state of a plurality of processing apparatuses.
[0123]
As described above, according to the composite computer system of the present embodiment, the operation statuses of a plurality of processing devices are monitored by a plurality of operation monitoring devices. It is possible to continue monitoring the operation state of the processing device.
[0124]
(Embodiment 3)
In the following, a description will be given of a composite computer system according to a third embodiment in which the operation of the composite computer system is monitored by mutual monitoring of the processing devices in the composite computer system of the present invention.
[0125]
In the multifunction computer system of the present embodiment, the programs running on the processing devices 100, 110, and 120 issue input / output instructions at regular intervals via the inter-channel coupling devices 160, 161 and 162 connecting the processing devices. .
[0126]
For example, the program running on the processing device 100 transmits certain data to the program running on the processing devices 110 and 120, and the program running on the processing devices 110 and 120 is Is received, the data of the acknowledgment is sent back to the program running on the processing device 100 of the transmission source as a response.
[0127]
In such a sequence, the programs running on the processing devices 100, 110, and 120 mutually transmit and receive specific data, and when a certain failure occurs, a predetermined specific time elapses. Even if no response is received, it can be regarded as an abnormality of the processing device of the partner with no response.
[0128]
As described above, when programs running on a plurality of processing devices transmit and receive specific data to and from each other, when the number of programs for transmitting and receiving specific data to each other increases, the communication load increases rapidly. However, in the composite computer system of the present embodiment, since normal failure detection can be realized by the operation monitoring device, the frequency of transmitting and receiving the specific data to each other may be reduced. .
[0129]
Therefore, in the composite computer system of the present embodiment, the overhead of transmitting and receiving specific data between a plurality of processing devices is reduced to reduce the influence on normal communication, and the operation monitoring device is completely shut down due to a failure or the like. It is possible to continue monitoring the operating states of a plurality of processing devices even if the operation stops.
[0130]
As described above, according to the composite computer system of the present embodiment, a specific data is transmitted and received among a plurality of processing devices to monitor the operation state of another processing device. It is possible to continue monitoring the operating states of the plurality of processing devices when the processing has stopped.
[0131]
As described above, the present invention has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and it is needless to say that various changes can be made without departing from the gist of the invention.
[0132]
For example, in a composite computer system in which a plurality of processing devices are connected to a computer dedicated to exclusive control, the computer dedicated to exclusive control may be the target of operation status monitoring by an operation monitoring device.
[0133]
Further, a plurality of processing devices and an operation monitoring device may be virtually set on the virtual machine to configure a complex computer system, and the operating states of the virtual plurality of processing devices may be monitored.
[0134]
【The invention's effect】
The effects obtained by the typical inventions among the inventions disclosed in the present application will be briefly described as follows.
[0135]
(1) Since the operating states of the plurality of processing units and the operating states of the programs on the plurality of processing units are monitored via the operation monitoring network, when a failure occurs, the failed part is identified early and the failed part is identified. , It is possible to realize long-time unmanned operation and reduce the burden on the user.
[0136]
(2) Since the operation statuses of a plurality of processing devices are monitored by a plurality of operation monitoring devices, it is possible to continue monitoring the operation statuses of a plurality of processing devices when a specific operation monitoring device becomes unusable due to a failure. It is.
[0137]
(3) Since specific data is transmitted and received among a plurality of processing apparatuses to monitor the operating state of another processing apparatus, the operating state of the plurality of processing apparatuses is monitored when the operation monitoring apparatus completely stops operating. It is possible to continue.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a schematic configuration of a composite computer system according to a first embodiment.
FIG. 2 is a diagram illustrating an outline of operation management at the time of startup of the composite computer system according to the first embodiment;
FIG. 3 is a diagram illustrating an outline of operation management when the composite computer system according to the first embodiment is stopped.
FIG. 4 is a flowchart illustrating a processing procedure of a process of specifying a failed part in the multifunction computer system according to the first embodiment;
FIG. 5 is a diagram illustrating a schematic configuration in a case where the operation monitoring device of the composite computer system according to the second embodiment is duplicated.
[Explanation of symbols]
100, 110, and 120 processing units, 101, 102, 111, 112, 121, and 122 instruction processors, 103, 104, 113, 114, 123, and 124 input / output processors, 105, 115, and 125 main storage devices 106, 116 and 126: System control unit, 107, 117 and 127: Service processor, 108, 118 and 128: Console, 109: Operation monitoring device duplication means, 130 and 131: Operation monitoring device, 140 and 141: Magnetic disk device , 150 and 151: magnetic tape device, 160 to 162: inter-channel coupling device, 200, 210 and 220: operating system, 2001 and 2101: configuration management means, 2002 and 2102: program state management means, 2003 and 2103: operation Viewing device communication means, 2004 and 2104 ... other system communication means, 211 and 221 ... job management programs, 212 and 222 ... exclusive control management programs, 223 ... database management programs, 230 ... processing device communication means, 231 ... connection state monitoring means 232: Connection configuration management means, 233: Operation state management means, 234: Configuration information / operation state management table, 235 and 245: Communication suppression means, 236 and 246: Console communication means.

Claims

In a multi-computer system in which a plurality of processing devices are connected to each other by a communication unit and accessed by exclusively controlling a specific shared resource, when a plurality of processing devices are started or stopped, a plurality of operating states of the plurality of processing devices are recorded . An operation monitoring device, an operation monitoring network that connects the plurality of processing devices and the operation monitoring device, and an operation state of the program when each of the plurality of processing devices starts or stops its own program. Operating state of the plurality of processing devices recorded in the operation monitoring device via the operation monitoring network when there is no response from any one of the processing devices within a predetermined time. And the operation status of the program recorded in the program status management means is obtained, the faulty part is specified, and the system A multi-computer system characterized in that the processing unit on the master side is replaced in the case of a failure, the program is restarted when the program is stopped, and the reconnection processing is performed on a spare communication path in the case of a communication path failure. .

A plurality of operation monitoring devices, a communication suppressing unit for suppressing communication from an operation monitoring device other than the operation monitoring device that is a primary device to the plurality of processing devices, and a plurality of processes performed by controlling the communication suppressing unit. A single operation monitoring device multiplexing unit for controlling communication from the device to the plurality of operation monitoring devices, wherein the operation monitoring device multiplexing unit changes an operation state of the plurality of processing devices to the plurality of operation monitoring devices. And the communication suppression means suppresses communication from the operation monitoring device other than the operation monitoring device which is the main device to the plurality of processing devices, and only the operation monitoring device which is the main device controls the plurality of processing devices. The operating status is monitored, and if a failure occurs in the operation monitoring device that is the primary device, the operation monitoring device multiplexing unit outputs a plurality of operating devices other than the failed operation monitoring device. The communication inhibition state of the communication inhibiting means of any operation monitoring device of the monitoring device is released, and the operation monitoring device from which the communication inhibition state has been released continues to monitor the operation status of the plurality of processing devices. The composite computer system according to claim 1.

2. The plurality of processing devices mutually monitor an operation state by transmitting / receiving specific data between the plurality of processing devices via the communication unit connecting the plurality of processing devices. Alternatively, the composite computer system according to claim 2.