JP2004213178A

JP2004213178A - Computer system

Info

Publication number: JP2004213178A
Application number: JP2002379727A
Authority: JP
Inventors: Takanori Kono; 貴憲河野; Masaru Koyanagi; 勝小柳; Takayuki Abe; 孝之阿部; Hirobumi Fujita; 博文藤田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-12-27
Filing date: 2002-12-27
Publication date: 2004-07-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computer system in which software for maintenance on each OS is prevented from making redundant report when a failure occurs in a resource to be commonly used by the plurality of OSs, and software for maintenance on another OS is enabled to make failure report, and the malfunction of the software for maintenance or making of error log is prevented, and report is surely made when any failure occurs even when the OS being the destination of communication of the failure is crushed or the like and not normally operated. <P>SOLUTION: This single computer system capable of simultaneously operating the plurality of OS is provided with a failure detecting means 152 for detecting a failure, a failure notification destination determining means 155 for determining the set of the OS being the destination of notification with respect to the failure, a failure notifying means 151 for notifying the failure to each of those sets and an OS monitoring means 150 for monitoring the operation of the OS being the destination of notification. When the abnormality of the operation of the OS occurs, the destination of notification of the failure is switched to another OS. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のオペレーティングシステム（以下、ＯＳと称する）の稼動する計算機システムにおいて発生する障害をＯＳに通知する技術、および障害を通知されたＯＳがユーザに障害を通報する技術に適用して有効な技術に関する。
【０００２】
【従来の技術】
本発明者が検討したところによれば、計算機システムに関しては、以下のような技術が考えられる。
【０００３】
例えば、計算機システムで障害が発生した場合、迅速な保守作業による対応が望まれる。障害発生時にユーザに障害の発生を報告するため、ＯＳ上で実行される障害通報機能を備えた保守用ソフトウェアが広く使用されている。
【０００４】
通常、計算機システムにおける障害が発生した場合、障害の種類や障害の発生した部品、障害の発生した時刻などの情報を含む障害ログが主記憶等の記憶媒体内の障害ログ保存部に記録され、ＯＳがこの障害ログ保存部から障害ログを読み出すことによってＯＳに障害が通知される。このように通知された障害は前記保守用ソフトウェアにより、コンソール表示や電子メールの送信などの手段でユーザに通報される。
【０００５】
一方、１台の計算機上で複数のＯＳを実行する技術が知られている。例えば、１台の計算機システム上で複数の仮想的な計算機を同時に走行させるシステムとして、仮想計算機システムがある。仮想計算機システムでは、複数のＯＳを制御するプログラムであるハイパバイザ（ホストＯＳとも呼ばれる）が走行することにより、複数のＯＳのスケジューリング、割り込みのディスパッチ、命令シミュレーションなどの制御を行い、複数のＯＳの同時実行を可能としている（例えば、特許文献１参照）。
【０００６】
また、仮想計算機システムにおけるホストＯＳをハードウエア機構のように提供し、実計算機を論理的に分割しているようにユーザから見える論理分割システムがある。論理分割システムには、ゲストに割り当てられたシステム資源に対するゲストの動作を制限する方法がある（例えば、特許文献２参照）。また、論理プロセッサ設備を備えたデータプロセッシングシステム内で論理システムの起動を制御する装置がある（例えば、特許文献３参照）。
【０００７】
また、従来、複数のＯＳの稼動する計算機システムにおいて、障害の発生をユーザに通報する手段として以下の三つの方法が用いられていた。
【０００８】
第一の方法は、計算機システムに組み込まれたサービスプロセッサ上で動作するソフトウェアが障害をユーザに通報する方法である。
【０００９】
第二の方法は、仮想計算機を制御するソフトウェアが障害をユーザに通報する方法である。
【００１０】
第三の方法は、前記仮想計算機が自身の使用する資源の障害をＯＳに通知し、ＯＳ上で動作する保守用ソフトウェアが障害を通報する方法である。
【００１１】
上記第一および第二の方法では、障害の通報に用いるコンソールやネットワークインタフェースカードなどのデバイスをＯＳの使用可能なデバイスとは別に用意する必要があり、計算機システムの価格の上昇を招く。
【００１２】
また、これらのデバイスを制御するプログラムを前記サービスプロセッサ上で動作するソフトウェアまたは前記仮想計算機を制御するソフトウェアの中に持つ必要があるが、このプログラムの開発コストおよび保守コストのため、やはり計算機システムの価格の上昇を招く。
【００１３】
従って、安価に障害通報機能を提供するためには、上記第三の方法を採用することが望ましい。
【００１４】
【特許文献１】
特公昭６１ー２２８２５号公報
【００１５】
【特許文献２】
特公平６ー７３１０８号公報
【００１６】
【特許文献３】
特許第３０９０４５２号公報
【００１７】
【発明が解決しようとする課題】
ところで、前記のような計算機システムの技術について、本発明者が検討した結果、上記第三の方法には以下のような問題点があることが明らかとなった。
【００１８】
まず、複数のＯＳが共通に使用する資源に障害が発生した場合、複数のＯＳ上で動作する保守用ソフトウェアのそれぞれが障害を通報することになる。このように１回の障害発生に対して複数の通報がなされることは、保守作業を非常に煩雑にする。たとえば、必要な交換部品の数を決定するため、複数の通報の内容を検査し、同一の部品での障害を示している冗長な通報を除去する必要が生じる。
【００１９】
また、ＯＳ毎に異なる保守サービス会社と保守契約を結び、それぞれのＯＳで障害通報の届く保守拠点が異なる場合では、たとえば１つの部品の交換で対策可能な障害に対しても複数の保守拠点に通報が届けられることになり、無駄な保守作業の発生を招く可能性がある。
【００２０】
これらの問題は、複数のＯＳ上で動作する保守用ソフトウェアが連携して動作する機能を備え、無駄な通報を行わないようにすることによって回避可能であるが、全ての保守用ソフトウェアがそのような機能を持っているわけではない。
【００２１】
また、上記第三の方法では、あるＯＳがクラッシュした場合に、このＯＳのみに通知されるように設定されていた障害の通知が以後なされなくなるという問題がある。
【００２２】
また、上記第三の方法では、あるＯＳ上で実行される保守用ソフトウェアがこのＯＳに通知される障害に対応していない場合に、この障害に対して通報がなされないという問題がある。あるＯＳ上で保守用ソフトウェアが実行されていない場合も同様の問題が生じる。また、保守用ソフトウェアが対応していない障害を保守用ソフトウェアが受け取った場合、この保守用ソフトウェアが誤動作したり、エラーログを生成したりする可能性があり、システムの運用上問題となる。
【００２３】
そこで、本発明の目的は、複数のＯＳが共通に使用する資源に障害が発生した場合に、各ＯＳ上で動作する保守用ソフトウェアが冗長な通報を行うことを防止できる計算機システムを提供することにある。
【００２４】
さらに、本発明の他の目的は、あるＯＳ上で実行される保守用ソフトウェアが、このＯＳに通知される障害の一部に対応していない場合にも、他のＯＳ上の保守用ソフトウェアに障害通報を行わせることを可能とし、また保守用ソフトウェアの誤動作やエラーログが作成されることを防止できる計算機システムを提供することにある。
【００２５】
また、本発明のさらに他の目的は、障害の通知先となっているＯＳがクラッシュするなどして正常に動作していない場合でも、障害発生時の確実な通報が可能となる計算機システムを提供することにある。
【００２６】
【課題を解決するための手段】
本発明は、複数のＯＳを同時に動作させることのできる単一の計算機システムに適用され、以下のような特徴を有するものである。
【００２７】
（１）計算機システムで発生する障害を検出する障害検出手段と、障害検出手段により検出された障害に対してその通知先となるＯＳの集合を決定する障害通知先決定手段と、障害通知先決定手段により決定されたＯＳの集合の各々に対し障害検出手段により検出された障害を通知する障害通知手段と、障害検出手段が検出しうる障害の少なくとも１つに対しそれを通知すべきＯＳの集合をユーザが設定できる障害通知先設定手段とを備え、障害通知先決定手段が、障害通知先設定手段による設定に従って障害の通知先となるＯＳの集合を決定するものである。
【００２８】
（２）計算機システム上で稼動するＯＳの動作を監視するＯＳ監視手段と、あるＯＳに通知されるように設定されていた障害を他のＯＳに通知するように設定を変更する障害通知先切り替え手段とを備え、ＯＳ監視手段があるＯＳの動作の異常を検出した場合に、障害通知先切り替え手段によりあるＯＳに通知されるように設定されていた障害の通知先を他のＯＳに切り替えるものである。
【００２９】
（３）前記（１）と（２）を組み合わせたものである。
【００３０】
【発明の実施の形態】
以下、図面を用いて、本発明の実施の形態を詳細に説明する。
【００３１】
まず、図１により、本発明の一実施の形態である計算機システムの構成の一例を説明する。図１は本実施の形態である計算機システムを示す構成図である。
【００３２】
本実施の形態の計算機システムは、筐体１００の中に、システムバス１１０を介して接続されたプロセッサ１１１，１１２、主記憶１１４および温度センサ１１３、不揮発性メモリ１１５、ＩＯアダプタ１１６、タイマ１１７を有し、さらにＩＯアダプタ１１６には磁気ディスク装置１７０とコンソール装置１７１が接続されている。
【００３３】
温度センサ１１３は、筐体１００内の温度を測定し、測定値が一定の範囲から外れた場合、プロセッサ１１１またはプロセッサ１１２に対し温度障害を通知するための割り込みを発生する機能を有する。また、プロセッサ１１１，１１２は、キャッシュメモリを有し、自身のキャッシュメモリに１ビット障害が発生した場合にそれを自動的に訂正し、さらにキャッシュ障害が発生して訂正されたことをＯＳに通知するため、自身に対する割り込みを発生する機能を有する。
【００３４】
主記憶１１４には、障害検出手段１５２が配置されており、プロセッサ１１１または１１２が、前記の温度センサ１１３またはプロセッサ１１１，１１２による割り込みを受け取った時に障害検出手段１５２が実行される。障害検出手段１５２は、障害ログを作成した後、この障害ログを不揮発性メモリ１１５内の障害ログ保存部１２０に記録し、復帰するプログラムである。
【００３５】
タイマ１１７は、一定の時間毎にプロセッサ１１１または１１２に対する割り込みを発生させる機能を有し、プロセッサ１１１または１１２が前記割り込みを受け取ると、主記憶１１４内に配置されたハイパバイザ１９０が実行される。
【００３６】
また、主記憶１１４内には、ＯＳ１３０、ＯＳ１３１、ＯＳ１３２の３つのＯＳが配置されており、ハイパバイザ１９０の制御によりＯＳ１３０およびＯＳ１３２はプロセッサ１１１上で動作し、ＯＳ１３１はプロセッサ１１１およびプロセッサ１１２の２つのプロセッサ上で動作しているものとする。
【００３７】
また、ＯＳ１３０，１３１，１３２では、それぞれ障害発生時に通報を行う保守用ソフトウェア１８０，１８１，１８２が実行されている。
【００３８】
また、主記憶１１４内には、ＯＳ監視手段１５０、障害通知手段１５１、障害検出手段１５２、障害通知先設定手段１５３、障害通知先切り替え手段１５４、障害通知先決定手段１５５が配置されている。これらは、プロセッサ１１１またはプロセッサ１１２で実行されるプログラムである。
【００３９】
また、主記憶１１４内には、それぞれＯＳ１３０，１３１，１３２に通知されるログを保存する領域である個別障害ログ保存部１４０，１４１，１４２が設けられている。
【００４０】
さらに、主記憶１１４内には、障害検出手段１５２により検出される障害と、この障害を通知されるＯＳの集合との対応を保存する領域である障害通知先ＯＳ設定保存部１６０が設けられている。
【００４１】
続いて、図２により、本実施の形態の計算機システムにおいて、障害をＯＳに通知する処理の動作の概要を説明する。図２は本実施の形態の計算機システムにおける各要素がどのように連携して障害通知処理を行うかの概要を示す図である。
【００４２】
障害の通知処理は、障害検出手段１５２が割り込みにより起動されて障害ログを作成し、障害ログ保存部１２０に記録することによって始まる（ステップ２０１）。次に、ハイパバイザ１９０によって障害通知手段１５１が呼び出され、障害ログ保存部１２０から前記障害ログの取得を行う（ステップ２０２）。
【００４３】
さらに、障害通知手段１５１は、障害を通知すべきＯＳの集合を得るため、この障害ログの一部を引数として障害通知先決定手段１５５を呼び出す（ステップ２０３）。障害通知先ＯＳ設定保存部１６０には、どの障害をどのＯＳに通知すべきかを表す表が記録されており、障害通知先決定手段１５５は渡された障害ログの一部を用いて表を検索し、障害を通知すべきＯＳの集合を返り値として返す（ステップ２０４）。
【００４４】
次に、障害通知手段１５１は、前記障害を通知すべきＯＳの集合内の各ＯＳ１３０，１３１，１３２に対応する個別障害ログ保存部１４０，１４１，１４２に、前記障害ログのコピーを書き込む（ステップ２０５）。この後、障害通知手段１５１はハイパバイザ１９０に制御を返す。
【００４５】
以上により、個別障害ログ保存部１４０，１４１，１４２に記録された障害ログは、保守用ソフトウェア１８０，１８１，１８２からのポーリングによってＯＳ１３０，１３１，１３２に読み出され、このようにして障害の発生がＯＳに通知される（ステップ２０６）。保守用ソフトウェア１８０，１８１，１８２は、それぞれＯＳ１３０，１３１，１３２が提供するインタフェースを使用してＯＳが読み出した障害ログを取得し、コンソール１７１にログの内容を表示する。
【００４６】
また、前記障害通知先ＯＳ設定保存部１６０内の表は、障害通知先設定手段１５３、障害通知先切り替え手段１５４によって書き換えられ、これによって障害通知先ＯＳの設定が変更される。
【００４７】
例えば、障害通知先設定手段１５３は、ユーザによる入力を受け取り、この入力に基づいて前記障害通知先ＯＳ設定保存部１６０内の表を書き換える（ステップ２１１，２１２）。これによって、ユーザが障害通知先ＯＳを設定することが可能となる。
【００４８】
また、障害通知先切り替え手段１５４は、ＯＳの動作を監視するＯＳ監視手段１５０があるＯＳの動作に異常を検出した場合に呼び出され、このＯＳに通知するように設定されていた障害を、このＯＳに通知しないように前記障害通知先ＯＳ設定保存部１６０内の表を書き換えることで障害通知先ＯＳの切り替えを行う（ステップ２２１，２２２）。
【００４９】
本実施の形態では、ＯＳ監視手段１５０はポーリングによる個別障害ログ保存部１４０，１４１，１４２からの読み出しが一定時間行われない場合に、この個別障害ログ保存部に対応するＯＳの動作に異常が発生したと判定し、上記の手順に従って切り替えを行う。
【００５０】
続いて、具体的にどのようなステップで、以上に説明した処理が実行されるかを詳細に説明する。
【００５１】
最初に、図３および図４により、障害検出手段１５２による障害の検出について説明する。図３は障害検出手段１５２の処理を示す流れ図である。図４は障害ログ保存部１２０に保存される障害ログのフォーマットを示す図である。
【００５２】
一般に、計算機システムでは割り込みの発生時に割り込みの種類ごとに実行されるプログラム（割り込みハンドラ）を設定可能であり、障害検出手段１５２は温度センサ１１３またはプロセッサ１１１，１１２からの割り込みの発生時に実行されるプログラムであるものとする。
【００５３】
また一般に、割り込みハンドラを実行しているプロセッサは割り込みの種類を示す値をレジスタに保持するが、本実施の形態でもレジスタの値によって割り込みの種類を判別可能であるとする。
【００５４】
障害検出手段１５２は、まずステップ３００で、前記レジスタの値から障害の種類を判定する。前記レジスタの値が温度センサ１１３からの温度障害による割り込みを示している場合には、ステップ３０１で、温度障害の障害ログを作成する。また、前記レジスタの値がプロセッサのキャッシュメモリ障害による割り込みを示している場合には、ステップ３０２で、プロセッサ障害の障害ログを作成する。
【００５５】
その後、ステップ３０３で、作成された障害ログを障害ログ保存部１２０に保存し、復帰する。障害ログ保存部１２０に保存される障害ログのフォーマットを図４に示す。障害ログは、障害の種類を表す障害ＩＤ、障害ログを書き込んだ時点での時刻を示す時刻印および温度センサの値やエラーの発生したキャッシュメモリのアドレスなど障害の内容を示す障害データからなっている。本実施の形態では、温度障害の障害ＩＤは１、プロセッサ１１１の障害ＩＤは２、プロセッサ１１２の障害ＩＤは３である。
【００５６】
次に、図５により、検出された障害がどのようにＯＳに通知されるかを説明する。図５はタイマ割り込みの発生時に実行されるハイパバイザ１９０の処理を示す流れ図である。
【００５７】
タイマ割り込み発生後、ハイパバイザ１９０は、ステップ５００で、ＯＳ監視手段１５０を呼び出し、その後、ステップ５０１で、障害通知手段１５１を呼び出し、その後、ステップ５０２で、ＯＳのスケジューリングを行い、次に実行するＯＳを決定後、ステップ５０３で、このＯＳに制御を渡す。タイマ割り込み毎に、以上に述べた処理が実行される。
【００５８】
次に、図６により、ＯＳ監視手段１５０の処理を説明する。図６はＯＳ監視手段１５０の処理を示す流れ図である。
【００５９】
ＯＳ監視手段１５０は、自身の呼ばれた回数をカウントする内部変数ｃと、ｃが０であった時点に個別障害ログ保存部１４０，１４１，１４２がプロセッサによってリードされた回数が何回であったかをそれぞれ保持する内部変数ｐ０，ｐ１，ｐ２を持つ。
【００６０】
ＯＳ監視手段１５０は、まずステップ６００で、ｃの値を１増やし、ステップ６０１で、ｃの値が１０００に達したか否かを検査する。本実施の形態では、ｃの値が１０００に達している場合には、ｃが０であった時点から十分な時間が経過しており、ＯＳが正常に動作している場合にはその間に個別障害ログ保存部１４０，１４１，１４２はそれぞれＯＳ１３０，１３１，１３２によって最低一度は読み込まれると仮定できるとする。
【００６１】
ステップ６０１で、ｃの値が１０００に達していないと判定された場合はそのまま復帰する。ｃの値が１０００に達したと判定された場合は、ステップ６０３で、個別障害ログ保存部１４０がプロセッサにリードされた回数を求め、この回数がｐ０の値よりも大きいことを検査する。
【００６２】
ステップ６０３の判定結果が偽の場合は、ＯＳ１３０が正常に動作していないと判断されるので、ステップ６０４で、障害通知先切り替え手段１５４を呼び出し、ＯＳ１３０に通知されるように設定されていた障害の通知先を他のＯＳに切り替える。
【００６３】
ステップ６０５〜６０８では、個別障害ログ保存部１４１，１４２について同様の処理を行う。以上の処理の後、ステップ６０９で、ｐ０，ｐ１，ｐ２の値を更新し、復帰する。
【００６４】
なお、ステップ６０３，６０５，６０７，６０９で、個別障害ログ保存部１４０，１４１，１４２がプロセッサにリードされた回数を取得しているが、本実施の形態では、プロセッサ１１１，１１２は個別障害ログ保存部１４０，１４１，１４２のリード回数を計測するレジスタを保持しており、このレジスタを読むことによって回数を取得するものとする。
【００６５】
次に、図７により、障害通知手段１５１の処理の内容を説明する。図７は障害通知手段１５１の処理を示す流れ図である。
【００６６】
障害通知手段１５１は、自身が最後に読み込んだ障害ログの時刻印を保持する内部変数ｔを持つ。障害通知手段１５１は、ハイパバイザ１９０によって呼び出された後、ステップ７００で、障害ログ保存部１２０内の障害ログの時刻印とｔとを比較し、ｔより大きい時刻印を持つ障害ログが存在するかどうかを判定する。
【００６７】
存在する場合、ステップ７０１で、そのような障害ログのうち時刻印が最小のものを１つ読み込み、ｔの値を読み込んだ障害ログの時刻印に更新する。その後、ステップ７０２で、前記読み込んだ障害ログの障害ＩＤを引数として障害通知先決定手段１５５を呼び出し、返り値として得た障害を通知すべきＯＳの集合を得る。
【００６８】
その後、ステップ７０３で、この集合に含まれるＯＳに対応する個別障害ログ保存部に前記読み込んだ障害ログのコピーを書き込み、ステップ７００に戻る。ステップ７００で、ｔより大きい時刻印を持つ障害ログが存在しない場合、復帰する。
【００６９】
次に、図８により、障害通知先決定手段１５５の処理について説明する。図８は障害通知先ＯＳ設定保存部１６０内の表を示す図である。
【００７０】
本実施の形態では、障害通知先ＯＳ設定保存部１６０内に図８に示される表を保持している。この表で、「Ｙ」となっているセルは、そのセルの行の障害ＩＤを持つ障害をそのセルの列のＯＳに通知することを表し、「Ｎ」の場合は通知しないことを表す。
【００７１】
図８の表では、障害ＩＤが１の障害をＯＳ１３０，１３１，１３２に、障害ＩＤが２の障害をＯＳ１３０，１３１，１３２に、障害ＩＤが３の障害をＯＳ１３１に通知することを表している。障害通知先決定手段１５５は、引数である障害ＩＤからこの表の行を検索し、その行で「Ｙ」となっている各セルの列のＯＳの集合を返り値として返す。
【００７２】
次に、図９により、ＯＳ１３０，１３１，１３２および保守用ソフトウェア１８０，１８１，１８２の処理について説明する。図９は、保守用ソフトウェア１８０，１８１，１８２の処理を示す流れ図である。
【００７３】
本実施の形態では、ＯＳ１３０，１３１，１３２は、ＧＥＴ＿ＥＲＲＯＲ＿ＬＯＧと言う名前のシステムコールを備え、このシステムコールは個別障害ログ保存部１４０，１４１，１４２内の障害ログを読み出し、また読み出した障害ログを消去し、読み出したログを呼び出し元に返り値として渡すものとする。保守用ソフトウェア１８０，１８１，１８２は図９に示される処理を行う。
【００７４】
まず、ステップ９００で、システムコールＧＥＴ＿ＥＲＲＯＲ＿ＬＯＧを呼び出す。この呼び出しで障害ログを取得できなかった場合には、ステップ９０５で、一定時間スリープし、ステップ９００に戻る。障害ログを取得できた場合には、ステップ９０２で、コンソールに障害ログの内容を表示することでユーザに障害を通報する。
【００７５】
その後、ステップ９０３で、障害ログの障害ＩＤから障害が温度障害かどうかを判定し、温度障害の場合にはステップ９０４で、ＯＳのシャットダウン処理を行う。その他の障害の場合には、ステップ９０５で、一定時間スリープした後、ステップ９００に戻る。
【００７６】
次に、図１０により、障害通知先設定手段１５３について説明する。図１０は設定画面を示す図である。
【００７７】
本実施の形態では、障害通知先設定手段１５３は計算機システムの起動時に実行され、図１０のような設定画面をコンソールに表示し、ユーザがチェックボックスにチェックすることにより、各障害を通知するＯＳの集合を指定するユーザインタフェースを備える。障害通知先設定手段１５３は、ユーザが図１０の「ＯＫ」ボタンを押したことを検知し、ユーザの入力通りに障害通知先ＯＳ設定保存部１６０内の表を書き換えることで障害の通知先の設定を行う。
【００７８】
たとえば、ユーザが、温度障害ではデータの保全を行うため、全ＯＳに通知して保守用ソフトウェアにシャットダウンを行わせたいが、プロセッサのキャッシュ障害については軽度の障害であるためＯＳ１３１にのみ通知し、ＯＳ１３０，１３２上の保守用ソフトウェア１８０，１８２による通報を行わせないようにしたいと考えた場合、図１０のようにチェックを行うことで実現できる。
【００７９】
また、このように障害の通知先ＯＳを決定できることは、誤動作を防ぐ上でも有用である。たとえば、ＯＳ１３２上で動作する保守用ソフトウェア１８２が温度障害の障害ログに対応しておらず、障害ログを読み込んだ場合に誤動作したり、エラーログを生成する可能性がある場合、温度障害をＯＳ１３２に通知しないように設定することで、保守用ソフトウェア１８２が誤動作したり、エラーログを生成することを防止できる。
【００８０】
次に、図１１により、障害通知先切り替え手段１５４について説明する。図１１は障害通知先切り替え手段１５４により変更された障害通知先ＯＳ設定保存部１６０内の表を示す図である。
【００８１】
障害通知先切り替え手段１５４は、前述のようにＯＳ監視手段１５０がＯＳの動作に異常を検出した場合に呼び出される。本実施の形態では、障害通知先切り替え手段１５４は動作に異常を検出したＯＳがＯＳ１３０，１３１，１３２のいずれのＯＳであるかを識別する識別子を引数として受け取り、この識別子を用いて障害通知先ＯＳ設定保存部１６０内の表を検索することにより、この識別子の示すＯＳが通知先に設定されている障害ＩＤを得た後、その障害ＩＤを持つ障害の通知先ＯＳを他のＯＳに切り替えるために障害通知先ＯＳ設定保存部１６０内の表を書き換える。
【００８２】
たとえば、障害通知先ＯＳ設定保存部１６０内の表が前述した図８に示されている状態であった場合に、ＯＳ監視手段１５０がＯＳ１３１の動作に異常を検出したとする。すると、障害通知先切り替え手段１５４は、ＯＳ１３１が通知先に設定されている障害の障害ＩＤが１および２であることを求め、これらの障害ＩＤの示す障害の通知先ＯＳをＯＳ１３０に切り替えるために、障害通知先ＯＳ設定保存部１６０内の表を図１１に示されている状態へと書き換える。以上により、以後プロセッサ１１２のキャッシュメモリ障害をＯＳ１３０に通知され、保守用ソフトウェア１８０がユーザに通報することが可能となる。
【００８３】
このようにして、ＯＳがクラッシュするなどして障害の通報を行えなくなった場合でも、障害の通知先となるＯＳが自動的に変更され、以後も他のＯＳ上の保守用ソフトウェアにより障害の通報を行うことができる。
【００８４】
なお、以上に述べた実施の形態は本発明の実施方法の一例であり、本発明の内容が以上に述べた実施の形態に限定されることを示すものではない。
【００８５】
以上に述べた実施の形態では、ＯＳ監視手段１５０はプロセッサのレジスタを読むことによって個別障害ログ保存部のリード回数を計測し、この回数が一定時間変化しないことを検出することによりＯＳの動作の異常を検出しているが、別の方法によって行うこともできる。
【００８６】
たとえば、プロセッサ１１１，１１２が個別障害ログ保存部への読み出しが発生した場合に割り込みを発生する機能を備えているならば、この割り込みの割り込みハンドラでリード回数を計測することができ、これによってＯＳの監視を行える。また、ＯＳが定期的に特定の関数を実行することがわかっている場合は、この関数の呼び出しの回数を計測することで監視を行うことができる。
【００８７】
また、障害通知先設定手段１５３は、前記図１０に示したようなユーザインタフェースを備える必要はなく、ユーザが障害通知先のＯＳの集合を設定することさえできれば、テキストベースのユーザインタフェースや、またはスイッチによる設定などであってもよい。
【００８８】
【発明の効果】
以上に述べたように、本発明によれば、障害を通知するＯＳを限定することにより、複数のＯＳが共通に使用する資源に障害が発生した場合に、各ＯＳ上で動作する保守用ソフトウェアが冗長な通報を行うことを防止できる。さらに、あるＯＳ上で実行される保守用ソフトウェアが、このＯＳに通知される障害の一部に対応していない場合にも、他のＯＳを障害の通知先に設定することにより、他のＯＳ上の保守用ソフトウェアに障害通報を行わせることが可能となる。また、他のＯＳを障害の通知先に設定することにより、保守用ソフトウェアの誤動作やエラーログが作成されることを防止できる。
【００８９】
また、本発明によれば、障害の通知先となっているＯＳがクラッシュするなどして正常に動作していない場合でも、障害の通知先を正常に動作している他のＯＳに切り替えることにより、障害発生時の確実な通報が可能となる。
【図面の簡単な説明】
【図１】本発明の一実施の形態である計算機システムを示す構成図である。
【図２】本発明の一実施の形態である計算機システムにおいて、障害通知処理の動作概要を示す図である。
【図３】本発明の一実施の形態である計算機システムにおいて、障害検出手段の処理を示す流れ図である。
【図４】本発明の一実施の形態である計算機システムにおいて、障害ログのフォーマットを示す図である。
【図５】本発明の一実施の形態である計算機システムにおいて、ハイパバイザの処理を示す流れ図である。
【図６】本発明の一実施の形態である計算機システムにおいて、ＯＳ監視手段の処理を示す流れ図である。
【図７】本発明の一実施の形態である計算機システムにおいて、障害通知手段の処理を示す流れ図である。
【図８】本発明の一実施の形態である計算機システムにおいて、障害通知先ＯＳ設定保存部内の表を示す図である。
【図９】本発明の一実施の形態である計算機システムにおいて、保守用ソフトウェアの処理を示す流れ図である。
【図１０】本発明の一実施の形態である計算機システムにおいて、設定画面を示す図である。
【図１１】本発明の一実施の形態である計算機システムにおいて、障害通知先切り替え手段により変更された障害通知先ＯＳ設定保存部内の表を示す図である。
【符号の説明】
１００…筐体、１１０…システムバス、１１１，１１２…プロセッサ、１１３…温度センサ、１１４…主記憶、１１５…不揮発性メモリ、１１６…ＩＯアダプタ、１１７…タイマ、１２０…障害ログ保存部、１３０，１３１，１３２…ＯＳ、１４０，１４１，１４２…個別障害ログ保存部、１５０…ＯＳ監視手段、１５１…障害通知手段、１５２…障害検出手段、１５３…障害通知先設定手段、１５４…障害通知先切り替え手段、１５５…障害通知先決定手段、１６０…障害通知先ＯＳ設定保存部、１７０…磁気ディスク装置、１７１…コンソール装置、１８０，１８１，１８２…保守用ソフトウェア、１９０…ハイパバイザ。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention is applied to a technology for notifying an OS of a failure that occurs in a computer system in which a plurality of operating systems (hereinafter, referred to as OS) operate, and a technology for notifying the OS of the failure notified OS to a user. Regarding effective technology.
[0002]
[Prior art]
According to the study by the inventor, the following technology is considered for the computer system.
[0003]
For example, when a failure occurs in a computer system, it is desired to respond by a quick maintenance operation. In order to report the occurrence of a failure to a user when a failure occurs, maintenance software having a failure notification function executed on the OS is widely used.
[0004]
Normally, when a failure occurs in the computer system, a failure log including information such as the type of the failure, the component in which the failure occurred, and the time when the failure occurred is recorded in a failure log storage unit in a storage medium such as a main storage, The OS is notified of the fault by reading the fault log from the fault log storage unit. The fault notified in this way is reported to the user by means of the maintenance software by means of console display or transmission of electronic mail.
[0005]
On the other hand, a technique for executing a plurality of OSs on one computer is known. For example, there is a virtual computer system as a system in which a plurality of virtual computers run simultaneously on one computer system. In the virtual machine system, a hypervisor (also referred to as a host OS), which is a program for controlling a plurality of OSs, runs to control scheduling of a plurality of OSs, dispatch of interrupts, instruction simulation, and the like. Execution is possible (for example, see Patent Document 1).
[0006]
There is also a logical division system in which a host OS in a virtual computer system is provided like a hardware mechanism, and a real computer is logically divided into a real computer and seen by a user. In a logical partitioning system, there is a method of restricting the operation of a guest on system resources allocated to the guest (for example, see Patent Document 2). Further, there is an apparatus for controlling activation of a logical system in a data processing system having logical processor equipment (for example, see Patent Document 3).
[0007]
Conventionally, in a computer system running a plurality of OSs, the following three methods have been used as means for notifying a user of the occurrence of a failure.
[0008]
The first method is a method in which software running on a service processor incorporated in a computer system notifies a user of a failure.
[0009]
The second method is a method in which software for controlling a virtual machine notifies a user of a failure.
[0010]
A third method is a method in which the virtual machine notifies the OS of a failure of a resource used by itself, and maintenance software operating on the OS reports the failure.
[0011]
In the above first and second methods, it is necessary to prepare devices such as a console and a network interface card used for reporting a failure separately from devices that can use the OS, which causes an increase in the price of the computer system.
[0012]
In addition, it is necessary to have a program for controlling these devices in software operating on the service processor or software for controlling the virtual machine. This leads to higher prices.
[0013]
Therefore, in order to provide a fault reporting function at low cost, it is desirable to adopt the third method.
[0014]
[Patent Document 1]
JP-B-61-22825
[0015]
[Patent Document 2]
Japanese Patent Publication No. 6-73108
[0016]
[Patent Document 3]
Japanese Patent No. 3090452
[0017]
[Problems to be solved by the invention]
By the way, the present inventor has studied the technique of the computer system as described above, and as a result, it has been found that the third method has the following problems.
[0018]
First, when a failure occurs in a resource commonly used by a plurality of OSs, each of the maintenance software operating on the plurality of OSs reports the failure. The fact that a plurality of reports are made in response to one failure occurrence makes maintenance work very complicated. For example, in order to determine the number of required replacement parts, it is necessary to examine the contents of a plurality of messages and remove redundant messages indicating a failure in the same part.
[0019]
In addition, if a maintenance contract is concluded with a different maintenance service company for each OS, and a different maintenance base is provided for each OS, a plurality of maintenance bases can be used for a fault that can be dealt with by replacing one component. The notification will be delivered, which may cause unnecessary maintenance work.
[0020]
These problems can be avoided by providing a function in which the maintenance software operating on a plurality of OSs cooperate and prevent unnecessary notification. It does not have any functions.
[0021]
Further, in the third method, when a certain OS crashes, there is a problem that a notification of a failure which is set to be notified only to this OS is no longer made.
[0022]
Further, the third method has a problem that, when maintenance software executed on a certain OS does not support a failure notified to the OS, no notification is made for the failure. A similar problem occurs when maintenance software is not executed on a certain OS. Further, when the maintenance software receives a failure that is not supported by the maintenance software, the maintenance software may malfunction or generate an error log, which is a problem in system operation.
[0023]
Accordingly, an object of the present invention is to provide a computer system that can prevent maintenance software running on each OS from making redundant notifications when a failure occurs in a resource commonly used by a plurality of OSs. It is in.
[0024]
Further, another object of the present invention is to provide a method for maintaining maintenance software on another OS even when the maintenance software executed on the OS does not cope with some of the failures notified to this OS. It is an object of the present invention to provide a computer system which enables a trouble report to be made and prevents a malfunction of the maintenance software and the creation of an error log.
[0025]
Still another object of the present invention is to provide a computer system capable of performing a reliable notification when a failure occurs even if the OS to which the failure is notified does not operate normally due to a crash or the like. Is to do.
[0026]
[Means for Solving the Problems]
The present invention is applied to a single computer system capable of operating a plurality of OSs simultaneously, and has the following features.
[0027]
(1) Failure detection means for detecting a failure occurring in a computer system, failure notification destination determination means for determining a set of OSs to be notified of the failure detected by the failure detection means, and failure notification destination determination Failure notification means for notifying each of a set of OSs determined by the means of a failure detected by the failure detection means, and a set of OSs to be notified of at least one of the failures detectable by the failure detection means Is set by the user, and the failure notification destination determination means determines a set of OSs to which failures are to be notified according to the setting by the failure notification destination setting means.
[0028]
(2) OS monitoring means for monitoring the operation of the OS running on the computer system, and failure notification destination switching for changing the setting so that a failure set to be notified to one OS is notified to another OS Means for switching the failure notification destination set to be notified to one OS by the failure notification destination switching means to another OS when the OS monitoring means detects an abnormality in the operation of the OS. It is.
[0029]
(3) A combination of the above (1) and (2).
[0030]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0031]
First, an example of a configuration of a computer system according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a configuration diagram showing a computer system according to the present embodiment.
[0032]
The computer system according to the present embodiment includes processors 111 and 112, a main memory 114 and a temperature sensor 113, a nonvolatile memory 115, an IO adapter 116, and a timer 117 connected via a system bus 110 in a housing 100. Further, a magnetic disk device 170 and a console device 171 are connected to the IO adapter 116.
[0033]
The temperature sensor 113 has a function of measuring the temperature inside the casing 100 and generating an interrupt for notifying the processor 111 or the processor 112 of a temperature failure when the measured value is out of a certain range. Further, the processors 111 and 112 have a cache memory, and when a one-bit failure occurs in their own cache memories, they automatically correct the failure and notify the OS that a cache failure has occurred and has been corrected. Therefore, it has a function of generating an interrupt for itself.
[0034]
The main memory 114 is provided with a fault detecting means 152, and the fault detecting means 152 is executed when the processor 111 or 112 receives an interrupt from the temperature sensor 113 or the processors 111 and 112. The failure detection unit 152 is a program that creates a failure log, records the failure log in the failure log storage unit 120 in the nonvolatile memory 115, and returns.
[0035]
The timer 117 has a function of generating an interrupt to the processor 111 or 112 at predetermined time intervals. When the processor 111 or 112 receives the interrupt, the hypervisor 190 arranged in the main memory 114 is executed.
[0036]
In the main memory 114, three OSs, OS130, OS131, and OS132, are arranged. Under the control of the hypervisor 190, the OS130 and the OS132 operate on the processor 111. It is assumed that it is running on the processor.
[0037]
In the OSs 130, 131, and 132, maintenance software 180, 181, and 182 for notifying when a failure occurs are executed.
[0038]
In the main memory 114, an OS monitoring unit 150, a failure notification unit 151, a failure detection unit 152, a failure notification destination setting unit 153, a failure notification destination switching unit 154, and a failure notification destination determination unit 155 are arranged. These are programs executed by the processor 111 or the processor 112.
[0039]
In the main memory 114, individual failure log storage units 140, 141, and 142, which are areas for storing logs notified to the OSs 130, 131, and 132, respectively, are provided.
[0040]
Further, in the main memory 114, a failure notification destination OS setting storage unit 160 is provided, which is an area for storing a correspondence between a failure detected by the failure detection unit 152 and a set of OSs notified of the failure. I have.
[0041]
Next, an outline of an operation of a process of notifying the OS of a failure in the computer system according to the present embodiment will be described with reference to FIG. FIG. 2 is a diagram showing an outline of how the components in the computer system of the present embodiment cooperate with each other to perform a failure notification process.
[0042]
The failure notification process starts when the failure detection unit 152 is activated by an interrupt, creates a failure log, and records it in the failure log storage unit 120 (step 201). Next, the failure notification unit 151 is called by the hypervisor 190, and the failure log is acquired from the failure log storage unit 120 (Step 202).
[0043]
Further, the fault notifying unit 151 calls the fault notifying destination determining unit 155 with a part of the fault log as an argument in order to obtain a set of OSs to be notified of the fault (Step 203). The failure notification destination OS setting storage unit 160 stores a table indicating which failure should be notified to which OS, and the failure notification destination determination unit 155 searches the table using a part of the passed failure log. Then, a set of OSs to be notified of the failure is returned as a return value (step 204).
[0044]
Next, the failure notification unit 151 writes a copy of the failure log in the individual failure log storage units 140, 141, and 142 corresponding to each of the OSs 130, 131, and 132 in the set of OSs to be notified of the failure (step 205). Thereafter, the failure notifying unit 151 returns control to the hypervisor 190.
[0045]
As described above, the failure logs recorded in the individual failure log storage units 140, 141, and 142 are read out to the OSs 130, 131, and 132 by the polling from the maintenance software 180, 181, and 182. Is notified to the OS (step 206). The maintenance software 180, 181, 182 acquires the failure log read by the OS using the interface provided by the OS 130, 131, 132, and displays the log content on the console 171.
[0046]
The table in the failure notification destination OS setting storage unit 160 is rewritten by the failure notification destination setting unit 153 and the failure notification destination switching unit 154, whereby the setting of the failure notification destination OS is changed.
[0047]
For example, the failure notification destination setting unit 153 receives an input from the user, and rewrites a table in the failure notification destination OS setting storage unit 160 based on the input (steps 211 and 212). This allows the user to set the failure notification destination OS.
[0048]
The failure notification destination switching unit 154 is called when the OS monitoring unit 150 that monitors the operation of the OS detects an abnormality in the operation of the OS, and detects a failure that is set to notify the OS. The failure notification destination OS is switched by rewriting the table in the failure notification destination OS setting storage unit 160 so as not to notify the OS (steps 221 and 222).
[0049]
In the present embodiment, when reading from the individual failure log storage units 140, 141, 142 by polling is not performed for a certain period of time, the OS monitoring unit 150 detects an abnormality in the operation of the OS corresponding to the individual failure log storage unit. It is determined that an error has occurred, and switching is performed according to the above procedure.
[0050]
Next, specific steps at which the above-described processing is executed will be described in detail.
[0051]
First, detection of a failure by the failure detection unit 152 will be described with reference to FIGS. FIG. 3 is a flowchart showing the processing of the failure detection means 152. FIG. 4 is a diagram showing a format of the failure log stored in the failure log storage unit 120.
[0052]
Generally, in a computer system, a program (interrupt handler) to be executed for each type of interrupt when an interrupt occurs can be set, and the failure detection means 152 is executed when an interrupt from the temperature sensor 113 or the processors 111 and 112 occurs. It is assumed to be a program.
[0053]
In general, a processor executing an interrupt handler holds a value indicating the type of interrupt in a register. In this embodiment, it is assumed that the type of interrupt can be determined based on the value of the register.
[0054]
First, at step 300, the fault detecting means 152 determines the type of fault from the value of the register. If the value of the register indicates an interrupt due to a temperature failure from the temperature sensor 113, a failure log of the temperature failure is created in step 301. If the value of the register indicates an interruption due to a cache memory failure of the processor, a failure log of the processor failure is created in step 302.
[0055]
Then, in step 303, the created failure log is stored in the failure log storage unit 120, and the process returns. FIG. 4 shows the format of the failure log stored in the failure log storage unit 120. The failure log includes a failure ID indicating the type of the failure, a time stamp indicating the time at which the failure log was written, and failure data indicating the content of the failure, such as the value of the temperature sensor and the address of the cache memory where the error has occurred. I have. In the present embodiment, the fault ID of the temperature fault is 1, the fault ID of the processor 111 is 2, and the fault ID of the processor 112 is 3.
[0056]
Next, how the detected failure is notified to the OS will be described with reference to FIG. FIG. 5 is a flowchart showing the processing of the hypervisor 190 executed when a timer interrupt occurs.
[0057]
After the timer interrupt occurs, the hypervisor 190 calls the OS monitoring means 150 in step 500, then calls the failure notification means 151 in step 501, and then schedules the OS in step 502, and executes the OS to be executed next. Is determined, control is passed to this OS in step 503. The above-described processing is executed for each timer interrupt.
[0058]
Next, the processing of the OS monitoring means 150 will be described with reference to FIG. FIG. 6 is a flowchart showing the processing of the OS monitoring means 150.
[0059]
The OS monitoring unit 150 counts an internal variable c that counts the number of times the OS monitoring unit 150 is called, and how many times the individual failure log storage units 140, 141, and 142 were read by the processor when c was 0. Have internal variables p0, p1, and p2, respectively.
[0060]
The OS monitoring means 150 first increases the value of c by 1 in step 600, and checks in step 601 whether the value of c has reached 1000. In this embodiment, when the value of c has reached 1000, sufficient time has elapsed since the time when c was 0, and when the OS is operating normally, individual It is assumed that the failure log storage units 140, 141, and 142 can be read at least once by the OSs 130, 131, and 132, respectively.
[0061]
If it is determined in step 601 that the value of c has not reached 1000, the process returns. When it is determined that the value of c has reached 1000, in step 603, the number of times that the individual failure log storage unit 140 has been read by the processor is obtained, and it is checked that this number is greater than the value of p0.
[0062]
If the result of the determination in step 603 is false, it is determined that the OS 130 is not operating normally, so in step 604, the failure notification destination switching means 154 is called, and the failure set to be notified to the OS 130 is called. Is switched to another OS.
[0063]
In steps 605 to 608, similar processing is performed for the individual failure log storage units 141 and 142. After the above processing, in step 609, the values of p0, p1, and p2 are updated, and the process returns.
[0064]
In steps 603, 605, 607, and 609, the number of times that the individual failure log storage units 140, 141, and 142 have been read by the processors has been acquired. A register for measuring the number of times of reading of the storage units 140, 141, and 142 is held, and the number of times of reading is obtained by reading this register.
[0065]
Next, the contents of the processing of the failure notification unit 151 will be described with reference to FIG. FIG. 7 is a flowchart showing the processing of the failure notification means 151.
[0066]
The failure notification unit 151 has an internal variable t that stores the time stamp of the failure log read last by itself. After being called by the hypervisor 190, the failure notification unit 151 compares the time stamp of the failure log in the failure log storage unit 120 with t in step 700, and determines whether there is a failure log having a time stamp greater than t. Determine whether
[0067]
If there is, in step 701, one of the failure logs with the smallest time stamp is read, and the value of t is updated to the time stamp of the read failure log. Thereafter, in step 702, the failure notification destination determination means 155 is called with the failure ID of the read failure log as an argument, and a set of OSs to be notified of the failure obtained as a return value is obtained.
[0068]
Then, in step 703, a copy of the read failure log is written in the individual failure log storage unit corresponding to the OS included in the set, and the process returns to step 700. If there is no failure log having a time stamp greater than t in step 700, the process returns.
[0069]
Next, the processing of the failure notification destination determination unit 155 will be described with reference to FIG. FIG. 8 is a diagram showing a table in the failure notification destination OS setting storage unit 160.
[0070]
In the present embodiment, the table shown in FIG. 8 is held in the failure notification destination OS setting storage unit 160. In this table, a cell indicated by "Y" indicates that a fault having a fault ID in the row of the cell is notified to the OS in a column of the cell, and "N" indicates that no notification is made.
[0071]
The table in FIG. 8 indicates that a failure with a failure ID of 1 is notified to the OSs 130, 131, and 132, a failure with a failure ID of 2 is reported to the OSs 130, 131, and 132, and a failure with a failure ID of 3 is reported to the OS 131. . The failure notification destination determination means 155 searches the row of this table from the failure ID which is an argument, and returns a set of OSs in the column of each cell having “Y” in the row as a return value.
[0072]
Next, the processing of the OSs 130, 131, 132 and the maintenance software 180, 181, 182 will be described with reference to FIG. FIG. 9 is a flowchart showing the processing of the maintenance software 180, 181, 182.
[0073]
In the present embodiment, the OSs 130, 131, and 132 have a system call named GET_ERROR_LOG. This system call reads the fault logs in the individual fault log storage units 140, 141, and 142, and reads the read fault logs. It is assumed that the log is deleted and the read log is passed to the caller as a return value. The maintenance software 180, 181, 182 performs the processing shown in FIG.
[0074]
First, in step 900, a system call GET_ERROR_LOG is called. If the failure log cannot be acquired by this call, at step 905, the apparatus sleeps for a predetermined time and returns to step 900. If the failure log has been acquired, the user is notified of the failure by displaying the contents of the failure log on the console in step 902.
[0075]
Then, in step 903, it is determined from the failure ID in the failure log whether the failure is a temperature failure. If the failure is a temperature failure, the OS shutdown processing is performed in step 904. In the case of any other failure, after sleeping for a certain period of time in step 905, the process returns to step 900.
[0076]
Next, the failure notification destination setting means 153 will be described with reference to FIG. FIG. 10 shows a setting screen.
[0077]
In the present embodiment, the failure notification destination setting means 153 is executed when the computer system is started, displays a setting screen as shown in FIG. 10 on the console, and allows the user to check a check box to notify the OS of each failure. A user interface for specifying a set of The failure notification destination setting unit 153 detects that the user has pressed the “OK” button in FIG. 10 and rewrites the table in the failure notification destination OS setting storage unit 160 as input by the user, thereby setting the failure notification destination. Make settings.
[0078]
For example, the user wants to notify all OSs and perform maintenance software shutdown in order to maintain data in the event of a temperature failure, but notifies the OS 131 only of a processor cache failure because it is a minor failure, When it is desired not to make a notification by the maintenance software 180, 182 on the OS 130, 132, it can be realized by performing a check as shown in FIG.
[0079]
In addition, being able to determine the failure notification destination OS in this way is useful for preventing malfunction. For example, if the maintenance software 182 that operates on the OS 132 does not support a failure log of a temperature failure and malfunctions when the failure log is read, or there is a possibility that an error log is generated, , It is possible to prevent the maintenance software 182 from malfunctioning and generating an error log.
[0080]
Next, the failure notification destination switching means 154 will be described with reference to FIG. FIG. 11 is a diagram showing a table in the failure notification destination OS setting storage unit 160 changed by the failure notification destination switching unit 154.
[0081]
The failure notification destination switching unit 154 is called when the OS monitoring unit 150 detects an abnormality in the operation of the OS as described above. In the present embodiment, the failure notification destination switching unit 154 receives as an argument an identifier for identifying which OS among the OSs 130, 131, and 132 the OS that has detected an abnormality in operation, and uses the identifier to identify the failure notification destination. After the OS indicated by this identifier obtains the fault ID set as the notification destination by searching the table in the OS setting storage unit 160, the notification destination OS of the fault having the fault ID is switched to another OS. Therefore, the table in the failure notification destination OS setting storage unit 160 is rewritten.
[0082]
For example, it is assumed that the OS monitoring unit 150 detects an abnormality in the operation of the OS 131 when the table in the failure notification destination OS setting storage unit 160 is in the state shown in FIG. Then, the fault notification destination switching unit 154 requests the OS 131 to set the fault IDs of the faults set as the notification destinations to 1 and 2, and switches the OS of the fault notification destination OS indicated by these fault IDs to the OS 130. Then, the table in the failure notification destination OS setting storage unit 160 is rewritten to the state shown in FIG. Thus, the OS 130 is notified of the cache memory failure of the processor 112 thereafter, and the maintenance software 180 can notify the user.
[0083]
In this way, even if the OS cannot be notified of a failure due to a crash or the like, the OS to be notified of the failure is automatically changed, and the maintenance software on another OS thereafter reports the failure. It can be performed.
[0084]
The embodiment described above is an example of a method of implementing the present invention, and does not indicate that the content of the present invention is limited to the embodiment described above.
[0085]
In the above-described embodiment, the OS monitoring unit 150 measures the number of reads of the individual failure log storage unit by reading the register of the processor, and detects that the number does not change for a certain period of time. Although an abnormality is detected, it can be performed by another method.
[0086]
For example, if the processors 111 and 112 have a function of generating an interrupt when reading to the individual failure log storage unit occurs, the number of reads can be measured by an interrupt handler for this interrupt, and the OS Can be monitored. If it is known that the OS periodically executes a specific function, monitoring can be performed by measuring the number of times this function is called.
[0087]
Further, the failure notification destination setting means 153 does not need to have the user interface as shown in FIG. 10 and, as long as the user can set a set of failure notification destination OSs, a text-based user interface, or The setting by a switch may be used.
[0088]
【The invention's effect】
As described above, according to the present invention, by limiting the OS that notifies a failure, maintenance software that operates on each OS when a failure occurs in a resource commonly used by a plurality of OSs Can prevent redundant notification. Further, even when the maintenance software executed on a certain OS does not support some of the faults notified to this OS, the other OS is set as the notification destination of the fault, so that the other OS is notified. It is possible to cause the above maintenance software to report a failure. In addition, by setting another OS as a failure notification destination, it is possible to prevent malfunction of the maintenance software and creation of an error log.
[0089]
Further, according to the present invention, even if the OS that is the failure notification destination is not operating normally due to a crash or the like, the failure notification destination is switched to another normally operating OS. Thus, a reliable report can be made when a failure occurs.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a computer system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an outline of an operation of a failure notification process in a computer system according to an embodiment of the present invention.
FIG. 3 is a flowchart showing processing of a failure detecting unit in the computer system according to the embodiment of the present invention;
FIG. 4 is a diagram showing a format of a failure log in a computer system according to an embodiment of the present invention.
FIG. 5 is a flowchart showing processing of a hypervisor in the computer system according to the embodiment of the present invention.
FIG. 6 is a flowchart showing processing of an OS monitoring means in the computer system according to one embodiment of the present invention.
FIG. 7 is a flowchart showing processing of a failure notifying unit in the computer system according to one embodiment of the present invention.
FIG. 8 is a diagram showing a table in a failure notification destination OS setting storage unit in the computer system according to the embodiment of the present invention;
FIG. 9 is a flowchart showing processing of maintenance software in the computer system according to the embodiment of the present invention.
FIG. 10 is a diagram showing a setting screen in a computer system according to an embodiment of the present invention.
FIG. 11 is a diagram showing a table in a failure notification destination OS setting storage unit changed by a failure notification destination switching unit in the computer system according to the embodiment of the present invention;
[Explanation of symbols]
100: chassis, 110: system bus, 111, 112: processor, 113: temperature sensor, 114: main memory, 115: nonvolatile memory, 116: IO adapter, 117: timer, 120: failure log storage unit, 130, 131, 132: OS, 140, 141, 142: individual failure log storage unit, 150: OS monitoring means, 151: failure notification means, 152: failure detection means, 153: failure notification destination setting means, 154: failure notification destination switching Means: 155: Failure notification destination determining means, 160: Failure notification destination OS setting storage unit, 170: Magnetic disk device, 171: Console device, 180, 181, 182: Maintenance software, 190: Hypervisor

Claims

A single computer system capable of running multiple operating systems simultaneously,
Fault detection means for detecting a fault occurring in the computer system,
Failure notification destination determination means for determining a set of operating systems to be notified of the failure detected by the failure detection means,
Failure notification means for notifying a failure detected by the failure detection means to each of a set of operating systems determined by the failure notification destination determination means,
Failure notification destination setting means for allowing a user to set a set of operating systems to be notified of at least one of the failures that can be detected by the failure detection means;
The computer system, wherein the failure notification destination determination means determines a set of operating systems to which failures are to be notified according to the setting by the failure notification destination setting means.

A single computer system capable of running multiple operating systems simultaneously,
Operating system monitoring means for monitoring operation of an operating system running on the computer system;
Failure notification destination switching means for changing settings so as to notify the second operating system of a failure set to be notified to the first operating system;
When the operating system monitoring unit detects an abnormality in the operation of the first operating system, the failure notification destination switching unit sets the failure notification destination set to be notified to the first operating system. A computer system characterized by switching to the second operating system.

A single computer system capable of running multiple operating systems simultaneously,
Fault detection means for detecting a fault occurring in the computer system,
Failure notification destination determination means for determining a set of operating systems to be notified of the failure detected by the failure detection means,
Failure notification means for notifying a failure detected by the failure detection means to each of a set of operating systems determined by the failure notification destination determination means,
Failure notification destination setting means for allowing a user to set a set of operating systems to be notified of at least one of the failures that can be detected by the failure detection means;
Operating system monitoring means for monitoring operation of an operating system running on the computer system;
A failure set to be notified to a first operating system of the determined set of operating systems is notified to a second operating system not included in the determined set of operating systems. And a failure notification destination switching means for changing settings.
The failure notification destination determination unit determines a set of operating systems that are failure notification destinations according to the setting by the failure notification destination setting unit,
When the operating system monitoring unit detects an abnormality in the operation of the first operating system, the failure notification destination switching unit sets the failure notification destination set to be notified to the first operating system. A computer system characterized by switching to the second operating system.