JP4087974B2

JP4087974B2 - Equipment failure management apparatus, equipment failure management method, and storage medium

Info

Publication number: JP4087974B2
Application number: JP06283399A
Authority: JP
Inventors: 彰男深田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-03-10
Filing date: 1999-03-10
Publication date: 2008-05-21
Anticipated expiration: 2019-03-10
Also published as: JP2000259455A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばインタフェースカードなどのハードウェアモジュールや、プリンタ、ハードディスクドライブなどのユニットを組み合わせて構成されるシステムにおける、各ハードウェアモジュールや各ユニットの障害状況を管理する設備障害管理装置および設備障害管理方法、ならびに記憶媒体に関するものである。
【０００２】
【従来の技術】
この種の従来の設備障害管理装置および設備障害管理方法について以下に説明する。
【０００３】
すなわち、従来では、インタフェースカードなどのハードウェアモジュール(以下、「モジュール」と称する)や、プリンタ、ハードディスクドライブ（以下、「ＨＤＤ」と称する）などのユニットを組み合わせて構成されるシステムにおける、各モジュールや各ユニットの障害は、各モジュールや各ユニットに内蔵された診断回路により個別に検出され、その検出結果が、当該システムのヒューマンインタフェース（以下「ＨＭＩ」と称する）部にて、警報としてシステムの使用者へ通知される。
【０００４】
そして、システムの使用者は、この通知された障害情報を見ることにより、システムにどのような障害が発生したのかを判断した上で、障害あるいは故障の発生したモジュールまたはユニットを交換することにより、使用中のシステムを障害から回復させる作業を行う。
【０００５】
【発明が解決しようとする課題】
しかしながら、このような従来の設備障害管理装置および設備障害管理方法では、以下のような問題がある。
【０００６】
すなわち、多くのモジュール、ユニットから構成されるシステムの場合、システムの使用者は容易に的確な判断をすることができず、システムの回復作業に手間取ることが多かった。
【０００７】
また、良品と交換し、システムから取り除かれた障害発生の原因と推定されるモジュール、ユニットは、工場の品質保証部門へ故障解析や修理作業が依頼されるが、工場で再現テストを行っても容易にその原因が突き止められない場合があった。
【０００８】
さらに、各モジュールや各ユニットの障害履歴は、各モジュール、各ユニットに情報を記録する手段がないため、各モジュール、各ユニットに添付される障害履歴を所定の様式にて記載した書類の記録に頼ることになる。
【０００９】
しかしながら、これらの情報は、障害発見者、またはサービスマンの主観によって記載されるところが多く、客観的な分析が阻害されるという問題があった。特に、伝送路にて互いに接続されたシステム間に発生した障害の解析を行うことは、難題であった。
【００１０】
本発明はこのような事情に鑑みてなされたものであり、多くのモジュール、ユニットから構成されるシステムにおいて障害が発生した場合においても、使用者の能力に依存することなく、その回復作業、故障解析、および修理作業を容易に実施することができ、もって、障害監視能力や障害回復能力に優れた設備障害管理装置および設備障害管理方法、ならびに記憶媒体を提供することを目的とする。
【００１１】
【課題を解決するための手段】
上記の目的を達成するために、本発明では、以下のような手段を講じる。
【００１２】
すなわち、請求項１の発明では、各種機能を持ったハードウェアモジュール及びユニットを組み合わせて構成されるシステムにおける、ハードウェアモジュール及びユニットの障害の状況を管理する設備障害管理装置において、ハードウェアモジュール及びユニットの内部の障害情報を検出し、第１の不揮発性記憶装置に記録する障害情報検出記録手段と、障害情報検出記録手段によって検出された障害情報に基づいて、当該障害が発生したハードウェアモジュール及びユニットを判断するとともに、当該障害の原因を特定する障害原因特定手段と、障害原因特定手段によって特定された障害原因の情報を、当該障害が発生したと判断されたハードウェアモジュール及びユニットヘ伝達する障害原因伝達手段と、障害原因特定手段によって特定された障害原因の情報を、当該障害が発生したと判断されたハードウェアモジュール及びユニットの内部に備えられた第２の不揮発性記憶装置に記録する障害原因記録手段と、障害情報検出記録手段によって検出された障害情報、および障害原因特定手段によって特定された障害原因情報を表示する表示手段とを具備し、障害原因特定手段が障害原因を特定することができない場合に、障害情報検出記録手段により検出された障害情報を、障害原因記録手段によって同一時間帯に障害が発生したハードウェアモジュール及びユニットの内部に備えられた第３の不揮発性記憶装置に記録する。
また、請求項２の発明では、請求項１に記載の設備障害管理装置において、複数の設備障害管理装置本体を、ネットワークによって相互に接続する。
【００１３】
また、請求項３の発明では、各種機能を持ったハードウェアモジュール及びユニットを組み合わせて構成されるシステムにおけるハードウェアモジュール及びユニットの障害の状況を管理するプログラムを実行するプロセッサモジュールによって、障害の状況を管理する設備障害管理方法において、プログラムは、プロセッサモジュールに、ハードウェアモジュール及びユニットの障害情報の検出及び記録を行う第１の手順、障害情報に対応する障害発生部の判断及び原因特定を行う第２の手順、障害情報の伝達及び記録を行う第３の手順、障害情報及び特定された原因の表示を行う第４の手順を実行させる。
そして、プロセッサモジュールが、プログラムに従って第１の手順を実行することにより、ハードウェアモジュール及びユニットの内部の障害情報を検出して第１の不揮発性記憶装置に記録し、第２の手順を実行することにより、検出された障害情報に基づいて、当該障害が発生したハードウェアモジュール及びユニットを判断するとともに、当該障害の原因を特定し、第３の手順を実行することにより、特定された障害原因の情報を、当該障害が発生したと判断されたハードウェアモジュール及びユニットヘ伝達すると共に、その内部に備えられた第２の不揮発性記憶装置に記録し、第４の手順を実行することにより、障害情報および障害原因情報を表示する。
更に、第２の手順において原因特定を行うことができない場合には、障害情報を、同一時間帯に障害が発生したハードウェア及びユニットに記録する第５の手順を実行させる。そして、プロセッサモジュールが、プログラムに従って第５の手順を実行することにより、障害原因を特定することができない場合に、検出された障害情報を、同一時間帯に障害が発生したハードウェアモジュール及びユニットの内部に備えられた第３の不揮発性記憶装置に記録する。
【００１４】
従って、請求項１および請求項３の発明の設備障害管理装置および設備障害管理方法においては、障害が発生した場合に、その障害の発生時刻、障害が発生したユニットおよびモジュール、障害原因を明確にすることができる。
【００１５】
また、障害が発生したユニットおよびモジュールに障害状況を記録することによって、各ユニットおよびモジュール単位での障害履歴の確認することが可能となる。
【００１９】
また、障害が発生したユニットおよびモジュールに自身の障害状況のみならず、同一時間帯に障害が発生した全てのユニットおよびモジュールの障害状況を記録することによって、障害原因が特定できない場合においても、ユニットおよびモジュール単体での障害履歴が確認できるため、障害解析に役立てることができる。
【００２０】
一方、請求項５の発明では、障害情報検出記録手段および障害原因記録手段を、不揮発性記憶装置によりそれぞれ構成したことを特徴とする請求項１または請求項３の発明の設備障害管理装置とする。
【００２１】
従って、請求項５の発明の設備障害管理装置においては、設備障害管理装置本体からモジュール及びユニットを取り外しても、モジュール及びユニットの記録内容が喪失しない。よって、モジュール及びユニットのみを設備障害管理装置本体から取り外し、工場、解析部門へ持ち込んで障害発生の履歴確認及び解析をすることが可能となる。
【００２３】
また、特に、請求項２の発明の設備障害管理装置においては、設備障害管理装置の障害データを、公衆回線などを経由して読み出すことができるようになる。その結果、ユーザにおける設備の稼働状況を、メーカの品質管理部門にてリモート診断することが可能となる。
【００２４】
さらに、請求項４の発明では、各種機能を持ったハードウェアモジュール及びユニットを組み合わせて構成されるシステムにおける、ハードウェアモジュール及びユニットの障害の状況を管理するためのプログラムを記憶した記憶媒体であって、ハードウェアモジュール及びユニットの内部の障害情報を検出し、第１の不揮発性記憶装置に記録する障害情報検出記録手段、障害情報検出記録手段によって検出された障害情報に基づいて、当該障害が発生したハードウェアモジュール及びユニットを判断するとともに、当該障害の原因を特定する障害原因特定手段、障害原因特定手段によって特定された障害原因の情報を、当該障害が発生したと判断されたハードウェアモジュール及びユニットヘ伝達する障害原因伝達手段、障害原因特定手段によって特定された障害原因の情報を、当該障害が発生したと判断されたハードウェアモジュール及びユニットの内部に備えられた第２の不揮発性記憶装置に記録する障害原因記録手段、障害情報検出記録手段によって検出された障害情報、および障害原因特定手段によって特定された障害原因情報を表示する表示手段、前記障害原因特定手段が障害原因を特定することができない場合に、前記障害情報検出記録手段により検出された障害情報を、前記障害原因記録手段によって同一時間帯に障害が発生したハードウェアモジュール及びユニットの内部に備えられた第３の不揮発性記憶装置に記録する手段としてコンピュータを機能させるためのプログラムを記憶したコンピュータ読み取り可能な記憶媒体とする。
【００２５】
このような、請求項４の発明は、請求項１および請求項３に対応する発明をコンピュータに実現させるプログラムを記憶した記憶媒体である。
【００２６】
この記憶媒体から読み出されたプログラムより制御されるコンピュータは、請求項１の設備障害管理装置として機能するとともに、請求項３の設備障害管理方法を実現する。
【００２７】
【発明の実施の形態】
以下に、本発明の実施の形態について図面を参照しながら説明する。
【００２８】
（第１の実施の形態）
本発明の第１の実施の形態を図１と図２とを用いて説明する。
【００２９】
図１は、本発明の実施の形態による設備障害管理装置及び設備障害管理方法を適用したシステム構成の一例を示す機能ブロック図である。
【００３０】
本装置および本方法は、例えば磁気ディスク等の記憶媒体に記憶されたプログラムを読み込み、このプログラムによって動作が制御されるコンピュータによって実現される。
【００３１】
なお、このシステム構成は、パソコンや制御システムのヒューマンインタフェースを想定した例であり、本発明はこの構成に何ら限定されるものではない。
【００３２】
図１に示すシステムにおいて、図示一点鎖線で囲まれた範囲は、システムの主装置（以下、単に「主装置」と記す）１である。主装置１の構成要素である、メインプロセッサモジュール２、主メモリモジュール４、ＳＣＳＩ（small computer system interface）インタフェースモジュール５、表示・入力制御モジュール６、伝送制御モジュール７は、内部バス３を介して他の構成要素と互いに各々接続している。
【００３３】
メインプロセッサモジュール２は、主メモリモジュール４上に配置されたプログラムを解読、実行し、主装置１に与えられている機能を実行する。
【００３４】
ＳＣＳＩインタフェースモジュール５は、インタフェースケーブル９を介してＨＤＤ８と接続している。
【００３５】
ＨＤＤ８は、主装置１の外部記憶装置として動作するものである。主装置１のプログラムは、通常このＨＤＤ８に格納されており、ＳＣＳＩインタフェースモジュール５は、電源投入時に必要なプログラムをＨＤＤ８から読み出し、主メモリモジュール４に格納する。主メモリモジュール４に格納されたプログラムは、更にメインプロセッサモジュール２によって逐次読み出され、解読、実行される。
【００３６】
表示・入力制御モジュール６は、主装置１の使用者に必要な情報を、使用者が理解できる形にして、インタフェースケーブル１２を介して表示装置１０に表示する。また、主装置１の使用者は、入力装置１１から操作指令を入力することにより、この操作指令をインタフェースケーブル１３を介して表示・入力制御モジュール６へ伝えることができる。
【００３７】
表示・入力制御モジュール６は、更にこの操作指令を内部バス３を経由して、メインプロセッサモジュール２、主メモリモジュール４へ伝達する。
【００３８】
伝送制御モジュール７は、伝送路１４を介して他の装置との情報交換を行うためのインタフェースモジュールである。
【００３９】
本発明の第１の実施の形態による設備障害管理装置および設備障害管理方法は、システムを構成しているモジュールまたはユニットに組み込まれている。
【００４０】
以下、システムを構成しているメインプロセッサモジュール２、主メモリモジュール４、ＳＣＳＩインタフェースモジュール５、表示・入力制御モジュール６、伝送制御モジュール７、ＨＤＤ８、表示装置１０、入力装置１１に組み込んだ本実施の形態による設備障害管理装置および設備障害管理方法の構成について説明する。
【００４１】
主装置１内部のモジュールであるメインプロセッサモジュール２、主メモリモジュール４、ＳＣＳＩインタフェースモジュール５、表示・入力制御モジュール６、伝送制御モジュール７、および主装置１にインターフェイスを介して接続している外部ユニットであるＨＤＤ８、表示装置１０、入力装置１１には、それぞれ内部に発生する障害を検出する障害検出手段２０ａ、２０ｂ、２０ｃ、２０ｄ、２０ｅ、２０ｆ、２０ｇ、２０ｈ（以下、「２０ａ〜ｈ」のように表す）、内部に発生した障害情報を記録する障害記録手段２１ａ〜ｈ、後述する障害原因特定プログラム２４ｂによって特定された障害原因を後述する障害原因記録プログラム２５ｂに指定されたモジュール又はユニットに記録する障害原因記録手段２２ａ〜ｈを個別に備えている。
【００４２】
また、主メモリモジュール４には、障害の検出および記録動作をメインプロセッサモジュール２に指示する障害検出・記録プログラム２３ｂ、複数の障害情報から障害原因を特定する動作をメインプロセッサモジュール２に指示する障害原因特定プログラム２４ｂ、前記障害原因の情報を所定のモジュール又はユニットヘ書き込む指示をメインプロセッサモジュール２に与える障害原因記録プログラム２５ｂ、障害の情報および障害原因特定プログラム２４ｂによって決定された障害原因情報を表示装置１０から表示する指示をメインプロセッサモジュール２に与える障害情報・障害原因情報読み出し・表示プログラム２６ｂを備えている。更に、主メモリモジュール４は、障害の情報を記録するエリアである障害管理テーブル２７ｂと、障害を判定する基準となるデータを備えている障害判定データベース２８ｂとを備えている。
【００４３】
次に、以上のように構成した本実施の形態の設備障害管理方法を適用した設備障害管理装置の作用について、ＳＣＳＩインタフェースモジュール５にて障害が検出された場合を例として説明する。
【００４４】
上述したように、各モジュールおよび各ユニットは、それぞれに障害検出手段２０ａ〜ｈを備えている。
【００４５】
いま、ＳＣＳＩインタフェースモジュール５において障害が発生すると、ＳＣＳＩインタフェースモジュール５が備えている障害検出手段２０ｃにて障害が検出される。この障害は、１種類とは限らず、装置、障害の発生状況によっては複数種類の障害が検出される場合もある。
【００４６】
このように検出された障害情報、すなわち障害の種類、障害の発生時刻、障害発生時に動作していた作業、などの情報を当該モジュールであるＳＣＳＩインタフェースモジュール５に備えられた障害記録手段２１ｃが記録する。
【００４７】
また、これら障害情報は、障害が発生すると、メインプロセッサモジュール２へ通知されると共に、主メモリモジュール４上の障害管理テーブル２７ｂへ記録される。障害発生の情報が、メインプロセッサモジュール２に通知されると、障害原因特定プログラム２４ｂが起動し、障害原因特定プログラム２４ｂは複数のモジュールまたはユニットから通知された障害情報を考慮して障害原因の特定を行う。
【００４８】
尚、上述した例では、同時に他のモジュールおよびユニットから障害発生の通知がない場合であるが、他のモジュールやユニットにて同時に発生した障害情報がある場合には、障害原因特定プログラム２４ｂは全ての障害の内容を考慮して、どのモジュール、どのユニットに障害の原因が発生したのかを、障害判定データベース２８ｂを参照することにより特定する。
【００４９】
この場合、さらに、障害原因記録プログラム２５ｂが起動し、障害原因特定プログラム２４ｂにおいて特定された障害原因を、主メモリモジュール４上の障害管理テーブル２７ｂへ記録する。
【００５０】
図２は、障害管理テーブル２７ｂの一例を示す図であり、障害発生時刻、検出されたモジュール又はユニット、原因の発生したモジュールまたはユニット、障害原因、障害発生時に実行されていた処理を記録できるようになっている。使用者は、必要に応じて、このような障害管理テーブル２７ｂを参照することによって、発生した障害の原因を分析することができる。
【００５１】
上述したように、本実施の形態の設備障害管理装置および設備障害管理方法においては、上記のような作用により、障害が発生した場合、障害の発生時刻、障害が発生したユニットおよびモジュール、障害原因を明確にすることができる。
【００５２】
また、障害が発生したユニットおよびモジュールに障害状況を記録することによって、各ユニットおよびモジュール単位での障害履歴の確認をすることが可能となる。
【００５３】
（第２の実施の形態）
すなわち、本発明の第２の実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムは、図１および図２に示す前述した第１の実施の形態において、他のモジュールやユニットにて発生した障害情報があり、これらの情報を基に障害判定データベース２８ｂを参照しても、どのモジュール、どのユニットに原因となる障害が発生したのかを特定できない場合に対処する手段を備えたものである。
【００５４】
なお、本実施の形態においても、第１の実施の形態と同様、ＳＣＳＩインタフェースモジュール５にて障害が検出された場合を例として説明する。
【００５５】
すなわち、本実施の形態の設備障害管理装置および設備障害管理方法が適用されるシステムの場合は、図１に示すように、障害原因記録プログラム２５ｂによって障害原因特定プログラム２４ｂへ入力された障害情報の全てを、図２に示すような障害管理テーブル２７ｂへ記録すると共に、同一時間帯に障害の発生したモジュールおよびユニットの障害原因記録手段２２に、障害が発生した全てのモジュールおよびユニットを記録するようにする。
【００５６】
主装置１の使用者は、必要に応じて、関連するモジュールおよびユニットの障害原因記録手段２２から障害の関連情報を取り出すことによって、障害の原因を推定することができる。また、これらの情報は、メインプロセッサモジュール２により表示装置１０に表示することもできる。
【００５７】
上述したように、本実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムにおいては、上記のような作用により、障害が発生した場合に、その障害の発生時刻、障害が発生したユニットおよびモジュールを明確にすることができる。
【００５８】
また、障害が発生したユニットおよびモジュールに自身の障害状況のみならず、同一時間帯に障害が発生した全てのユニットおよびモジュールの障害状況を記録することによって、障害原因が特定できない場合においても、ユニットおよびモジュール単体での障害履歴が確認できるため、障害解析に役立てることができる。
【００５９】
（第３の実施の形態）
すなわち、本発明の第３の実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムは、図１および図２に示す前述した第１または第２の実施の形態において、障害記録手段２１ａ〜ｈ、および障害原因記録手段２２ａ〜ｈを、不揮発性記憶装置によりそれぞれ構成する。
【００６０】
本実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムにおいては、以上のような手段を講じることによって、設備障害管理装置本体からモジュール及びユニットを取り外しても、モジュール及びユニットの記録内容が喪失しない。よって、モジュール及びユニットのみを設備障害管理装置本体から取り外し、工場、解析部門へ持ち込んで障害発生の履歴確認及び解析をすることが可能となる。
【００６１】
（第４の実施の形態）
本発明の第４の実施の形態を図２、図３を用いて説明する。
【００６２】
図３は、本発明の実施の形態による設備障害管理装置および設備障害管理方法を適用したシステムの一例を示す機能ブロック図である。
【００６３】
本装置および本方法は、例えば磁気ディスク等の記憶媒体に記憶されたプログラムを読み込み、このプログラムによって動作が制御されるコンピュータによって実現される。
【００６４】
すなわち、本発明の第４の実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムは、図３に示す様に、２つの主装置３１、４１を、伝送路幹線５６を介して相互に接続している。また、それぞれの主装置３１、４１は、内部バス３３、４３を備えており、この内部バス３３、３５によって、それぞれ以下の４つのモジュールである、メインプロセッサモジュール３２、４２、主メモリモジュール３４、４４、機能モジュール３５、４５、伝送制御モジュール３７、４７を互いに接続している。
【００６５】
メインプロセッサモジュール３２、４２は、主メモリモジュール３４、４４に格納された種々のプログラムを解読、実行し、主装置３１、４１に与えられた機能を実行するものである。機能モジュール３５、４５は、特定の機能を実行する機能を有したモジュールである。
【００６６】
伝送制御モジュール３７、４７は、他の装置との情報交換を行うためのインタフェースモジュールであり、おのおの伝送路支線５４、５５を経由して他の装置との共通ネットワーク幹線である伝送路幹線５６と接続している。
【００６７】
更に、それぞれの主装置３１、４１内部の各モジュールには、本実施の形態にある設備障害管理装置が組み込まれている。すなわち、各モジュール内部に発生する障害を検出するための障害検出手段２０ｉ、２０ｊ、２０ｋ、２０ｌ、２０ｍ、２０ｎ、２０ｏ、２０ｐ（以下、「２０ｉ〜ｐ」のように表す）、各モジュール内部に発生した障害情報を当該モジュールに記録する障害記録手段２１ｉ〜ｐ、後述する障害原因特定プログラム２４ｊ、２４ｎによって特定された障害原因を、後述する障害原因記録プログラム２５ｊ、２５ｎによって指定されたモジュールに記録する障害原因記録手段２２ｉ〜ｐを備えている。
【００６８】
また、第１の主装置３１および第２の主装置４１の主メモリモジュール３４、４４は、障害検出とその記録とをメインプロセッサモジュール３２、４２に指示するための障害検出・記録プログラム２３ｊ、２３ｎ、各モジュールから入力された複数の障害情報から真の障害原因を特定する動作をメインプロセッサモジュール３２、４２に指示するための障害原因特定プログラム２４ｊ、２４ｎ、特定された障害原因を各モジュールヘ書き込む指示をメインプロセッサモジュール３２、４２に与える障害原因記録プログラム２５ｊ、２５ｎ、各モジュールで検出された障害情報および障害原因特定プログラム２４ｊ、２４ｎで決定された真の障害原因情報を表示装置（図示せず）に表示する指示をメインプロセッサモジュール３２、４２に与える障害情報・障害原因情報読み出し・表示プログラム２６ｊ、２６ｎを備えている。
【００６９】
更に、第１の主装置３１および第２の主装置４１の主メモリモジュール３４、４４は、障害情報を記録するエリアである障害管理テーブル２７ｊ、２７ｎと、障害を判定する基準となるデータを備えている障害判定データベース２８ｊ、２８ｎとを備えている。
【００７０】
次に、以上のように構成した本実施の形態の設備障害管理方法が適用された設備障害管理装置の作用について説明する。
【００７１】
なお、ここでは、第２の主装置４１の伝送制御モジュール４７において障害が発生した場合における障害の検出方法を例として説明する。
【００７２】
いま、第２の主装置４１の伝送制御モジュール４７にて動作上の障害が発生すると、伝送制御モジュール４７の障害検出手段２０ｏがこの障害を検出する。この障害は、１種類とは限らず、装置や障害の発生状況によって複数種類の障害が検出される場合もある。
【００７３】
この場合、第２の主装置４１の伝送制御モジュール４７が備えている障害記録手段２１ｏに検出された障害情報（障害の種類、障害の発生時刻、障害発生時に動作していた作業など）は、第２の主装置４１のメインプロセッサモジュール４２へ通知されると共に、第２の主装置４１の主メモリモジュール４４の障害管理テーブル２７ｎへ記録される。
【００７４】
メインプロセッサモジュール４２へ障害発生の通知が行われると、障害原因特定プログラム２４ｎが起動し、障害原因特定プログラム２４ｎは複数のモジュールから通知された障害の内容も考慮して、障害原因の特定を行う。
【００７５】
尚、上述した例は、同時に他のモジュールから障害発生の通知がない場合であるが、他のモジュールにて発生した障害情報がある場合には、全ての障害の内容を考慮して、どのモジュールに障害の原因が発生したのかを障害判定データベース２８ｎを参照することにより特定する。
【００７６】
次に、障害原因記録プログラム２５ｎが起動し、障害原因特定プログラム２４ｎにおいて特定された障害原因を、第２の主装置４１の主メモリモジュール４４の障害管理テーブル２７ｎへ記録する。更に、この障害原因は伝送路幹線５６を介して、接続している他の装置である第１の主装置３１の主メモリモジュール３４の障害管理テーブル２７ｊにも記録される。
【００７７】
図２は、障害管理テーブル２７ｎの一例を示す図であり、障害発生時刻、検出されたモジュール、原因の発生したモジュール、障害原因、障害発生時に実行されていた処理を記録できるようになっていると共に、関連するモジュールの障害原因記録手段２２に障害が発生した全てのモジュールを記録するようにする。
【００７８】
尚、伝送制御モジュールにおける障害は、障害が発生した当該装置における伝送制御モジュールのみならず、同一ネットワーク上に接続している他の装置における伝送制御モジュールにおいても検出される。
【００７９】
すなわち、本実施の形態におけるように、第２の主装置４１の伝送制御モジュール４７に障害が発生した場合、その障害情報は、第１の主装置３１の伝送制御モジュール３７の障害検出手段２０ｌにおいても同様に検出され、この障害の種類、障害の発生時刻、障害発生時に動作していた作業などの情報が、障害記録手段２１ｌに記録される。更に、障害検出手段２０ｌにおいて検出された障害情報は、メインプロセッサモジュール３２へ通知されると共に、主メモリモジュール３４の障害管理テーブル２７ｊに記録される。
【００８０】
このようにして、第１の主装置３１の主メモリモジュール３４の障害管理テーブル２７ｊは、第２の主装置４１の伝送制御モジュール４７で発生した障害の情報を、第２の主装置４１の障害管理テーブル２７ｎから取得する場合と、伝送制御モジュール３７を介して取得する場合との２つの情報として取得する。
【００８１】
障害情報がメインプロセッサモジュール３２へ通知されると、主メモリモジュール３４の障害原因特定プログラム２４ｊが起動し、障害原因の特定を行う。
【００８２】
尚、ここでは、伝送制御モジュール３７、４７のうちいずれのモジュールに障害の原因が発生したのかを、障害判定データベース２８ｊを参照することにより特定する。
【００８３】
次に、障害原因記録プログラム２５ｊが起動し、障害原因特定プログラム２４ｊにて特定された障害原因を、第１の主装置３１の主メモリモジュール３４の障害管理テーブル２７ｊへ記録すると共に、伝送制御モジュール３７、４７の障害原因記録手段２２ｌ、２２ｏに、障害発生モジュール、ユニット名、時刻を記録する。
【００８４】
また、これら障害情報をシステム全体を管理する特定の主装置を設け、そこで一元管理することも可能である。図３の場合、第１の主装置３１を、システム全体を管理する主装置として動作させると、システム内の全ての障害情報は、第１の主装置３１へ転送され、主メモリモジュール３４の障害管理テーブル２７ｊに反映することができる。これにより、使用者は、障害管理テーブル２７ｊから障害原因情報を取り出すことによって、システム内における障害の全情報を把握することができる。
【００８５】
上述したように、本実施の形態の設備障害管理装置および設備障害管理方法を適用したシステムにおいては、上記のような作用により、障害発生時刻と障害の状況だけでなく、障害発生時に実行されていた処理および関連モジュールでの障害検出状況が明確になり、障害発生時の解析を容易に行うことができる。
【００８６】
更に、これらの設備障害管理装置が接続しているネットワークに、公衆回線などを利用することにより、ユーザ側で使用されている設備の稼働状況を、メーカの品質管理部門にて、リモート診断することも可能となる。
【００８７】
なお、上述した各実施の形態において記載した方法は、コンピュータに実行させることのできるプログラムとして、例えば磁気ディスク（フロッピーディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリなどの記憶媒体に書き込んで各種装置に適用したり、通信媒体により伝送して各種装置に適用することも可能である。本装置を実現するコンピュータは、記憶媒体に記憶されたプログラムを読み込み、このプログラムによって動作が制御されることにより、上述した処理を実行する。
【００８８】
【発明の効果】
以上説明したように、本発明の設備障害管理装置および設備障害管理方法、ならびに記憶媒体によれば、多くのモジュール、ユニットから構成されるシステムにおいて障害が発生した場合においても、システムの使用者の能力に依存することなく、その回復作業、故障解析、および修理作業を容易に実施することができる。
【００８９】
以上により、障害履歴の客観的な分析が可能となり、もって、障害監視能力や障害回復能力に優れた設備障害管理装置および設備障害管理方法を実現することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態による設備障害管理装置および設備障害管理方法を適用したシステムの一例を示す図。
【図２】本発明による障害管理テーブルのフォーマットの一例を示す図。
【図３】本発明の第３の実施の形態による設備障害管理装置および設備障害管理方法を適用したシステムの一例を示す構成図。
【符号の説明】
１…システムの主装置、
２…メインプロセッサモジュール、
３…内部バス、
４…主メモリモジュール、
５…ＳＣＳＩインタフェースモジュール、
６…表示・入力制御モジュール、
７…伝送制御モジュール、
８…ＨＤＤ、
９、１２、１３…インタフェースケーブル、
１０…表示装置、
１１…入力装置、
１４…伝送路、
２０…障害検出手段、
２１…障害記録手段、
２２…障害原因記録手段、
２３…障害検出・記録プログラム、
２４…障害原因特定プログラム、
２５…障害原因記録プログラム、
２６…障害情報・障害原因情報読み出し・表示プログラム、
２７…障害管理テーブル、
２８…障害判定データベース、
３１…第１の主装置、
３２、４２…メインプロセッサモジュール、
３３、４３…内部バス、
３４、４４…主メモリモジュール、
３５、４５…機能モジュール、
３７、４７…伝送制御モジュール、
４１…第２の主装置、
５４、５５…伝送路支線、
５６…伝送路幹線。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a facility failure management apparatus and facility failure for managing a failure state of each hardware module and each unit in a system configured by combining hardware modules such as an interface card and units such as a printer and a hard disk drive. The present invention relates to a management method and a storage medium.
[0002]
[Prior art]
This type of conventional equipment failure management apparatus and equipment failure management method will be described below.
[0003]
That is, conventionally, each module in a system configured by combining units such as a hardware module (hereinafter referred to as “module”) such as an interface card, a printer, and a hard disk drive (hereinafter referred to as “HDD”). The failure of each unit is individually detected by a diagnostic circuit built in each module or each unit, and the detection result is reported as an alarm in the human interface (hereinafter referred to as “HMI”) part of the system. The user is notified.
[0004]
Then, the user of the system looks at the notified failure information to determine what kind of failure has occurred in the system, and then replaces the module or unit in which the failure or failure has occurred. Work to recover your system from failure.
[0005]
[Problems to be solved by the invention]
However, such conventional equipment failure management apparatuses and equipment failure management methods have the following problems.
[0006]
That is, in the case of a system composed of many modules and units, the user of the system cannot easily make an accurate determination, and often takes time to recover the system.
[0007]
In addition, modules and units that have been replaced with non-defective products and removed from the system are estimated to be the cause of the failure. Failure analysis and repair work are requested from the factory's quality assurance department. In some cases, the cause could not be easily determined.
[0008]
Furthermore, since there is no means for recording information on the failure history of each module or unit in each module or unit, the failure history attached to each module or unit can be recorded in a document in a prescribed format. Rely on.
[0009]
However, this information is often written by the subject of the person who finds the disability or the service person, and there is a problem that the objective analysis is hindered. In particular, it has been difficult to analyze a failure that has occurred between systems connected to each other via a transmission line.
[0010]
The present invention has been made in view of such circumstances, and even when a failure occurs in a system composed of a large number of modules and units, recovery work and failure without depending on the ability of the user. It is an object of the present invention to provide an equipment failure management apparatus and equipment failure management method, and a storage medium that can easily perform analysis and repair work, and are excellent in failure monitoring capability and failure recovery capability.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, the present invention takes the following measures.
[0012]
  That is, according to the first aspect of the present invention, in a facility failure management apparatus that manages the failure status of hardware modules and units in a system configured by combining hardware modules and units having various functions, the hardware modules and Fault information detection recording means for detecting fault information inside the unit and recording it in the first non-volatile storage device, and a hardware module in which the fault has occurred based on the fault information detected by the fault information detection recording means Determining the cause of the failure, and transmitting the cause information of the failure specified by the failure cause specifying means to the hardware module and unit determined to have the failure. Specialized by failure cause transmission means and failure cause identification means The failure cause recording means for recording the failure cause information recorded in the second nonvolatile storage device provided in the hardware module and unit determined to have the failure, and the failure information detection recording means Display means for displaying detected fault information and fault cause information specified by the fault cause specifying meansHowever, when the failure cause identification unit cannot identify the cause of the failure, the failure information detected by the failure information detection recording unit is used for the hardware module and unit in which the failure has occurred in the same time zone by the failure cause recording unit. Recorded in a third non-volatile storage device provided insideTo do.
According to a second aspect of the present invention, in the equipment fault management apparatus according to the first aspect, a plurality of equipment fault management apparatus bodies are connected to each other via a network.
[0013]
  Claims3In the invention of the present invention, the facility failure that manages the failure state by the processor module that executes the program that manages the failure state of the hardware module and unit in the system configured by combining the hardware module and unit having various functions. In the management method, the program stores, in the processor module, a first procedure for detecting and recording the fault information of the hardware module and the unit, a second procedure for determining a fault occurrence unit corresponding to the fault information and identifying the cause, A third procedure for transmitting and recording the failure information and a fourth procedure for displaying the failure information and the identified cause are executed.
  Then, the processor module executes the first procedure according to the program, thereby detecting fault information inside the hardware module and the unit, recording the fault information in the first nonvolatile storage device, and executing the second procedure. Thus, based on the detected failure information, the hardware module and unit in which the failure has occurred are determined, the cause of the failure is identified, and the third procedure is performed to identify the cause of the failure Is transmitted to the hardware module and unit in which it is determined that the failure has occurred, and is recorded in the second non-volatile storage device provided in the hardware module, and the fourth procedure is executed, thereby Displays information and failure cause information.
Further, if the cause cannot be identified in the second procedure, the fifth procedure for recording the failure information in the hardware and unit in which the failure has occurred in the same time zone is executed. Then, when the processor module executes the fifth procedure according to the program and cannot determine the cause of the failure, the detected failure information is used for the hardware module and unit in which the failure occurred in the same time zone. Recording is performed in a third non-volatile storage device provided inside.
[0014]
  Accordingly, claim 1 and claim3In the equipment fault management apparatus and equipment fault management method of the present invention, when a fault occurs, the time of occurrence of the fault, the unit and module in which the fault has occurred, and the cause of the fault can be clarified.
[0015]
Also, by recording the failure status in the unit and module in which the failure has occurred, it is possible to check the failure history for each unit and module.
[0019]
Even if the cause of failure cannot be identified by recording the failure status of all units and modules that have failed during the same time period as well as their own failure status, In addition, since the failure history of a single module can be confirmed, it can be used for failure analysis.
[0020]
On the other hand, in the invention of claim 5, the failure information detection recording means and the failure cause recording means are each constituted by a nonvolatile storage device. .
[0021]
Therefore, in the equipment failure management apparatus according to the fifth aspect, even if the module and unit are removed from the equipment failure management apparatus main body, the recorded contents of the module and unit are not lost. Therefore, it is possible to remove only the modules and units from the equipment failure management apparatus main body and bring them into the factory or analysis department to check and analyze the history of failure occurrence.
[0023]
  In particular,Claim2In the equipment failure management apparatus according to the invention, the trouble data of the equipment trouble management apparatus can be read out via a public line or the like. As a result, it is possible to remotely diagnose the operating status of the facility at the user by the quality control department of the manufacturer.
[0024]
  And claims4According to the present invention, there is provided a storage medium storing a program for managing a failure status of a hardware module and a unit in a system configured by combining hardware modules and units having various functions. And failure information detection recording means for detecting failure information inside the unit and recording it in the first nonvolatile storage device, and a hardware module in which the failure has occurred based on the failure information detected by the failure information detection recording means Failure to determine the cause of the failure, and to transmit the information on the cause of failure identified by the failure cause identification unit to the hardware module and unit determined to have the failure. Identified by cause transmission means and failure cause identification means Detected by the failure cause recording means and the failure information detection recording means for recording the failure cause information in the second nonvolatile storage device provided inside the hardware module and unit in which the failure is determined to occur. Display means for displaying fault information and fault cause information specified by the fault cause specifying meansThe failure information detected by the failure information detection recording means when the failure cause identification means cannot identify the failure cause, the hardware module in which the failure has occurred in the same time zone by the failure cause recording means, and Means for recording in a third non-volatile storage device provided inside the unitAs a computer-readable storage medium storing a program for causing a computer to function.
[0025]
  Such a claim4The invention of claim 1 and claim 13This is a storage medium storing a program that causes a computer to implement the invention corresponding to the above.
[0026]
  The computer controlled by the program read from the storage medium functions as the equipment failure management apparatus according to claim 1 and claims.3Realize the facility fault management method.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0028]
(First embodiment)
A first embodiment of the present invention will be described with reference to FIGS.
[0029]
FIG. 1 is a functional block diagram showing an example of a system configuration to which an equipment failure management apparatus and an equipment failure management method according to an embodiment of the present invention are applied.
[0030]
The apparatus and the method are realized by a computer that reads a program stored in a storage medium such as a magnetic disk and whose operation is controlled by the program.
[0031]
This system configuration is an example assuming a human interface of a personal computer or a control system, and the present invention is not limited to this configuration.
[0032]
In the system shown in FIG. 1, a range surrounded by a dashed-dotted line in the figure is a main apparatus (hereinafter simply referred to as “main apparatus”) 1 of the system. A main processor module 2, a main memory module 4, a SCSI (small computer system interface) interface module 5, a display / input control module 6, and a transmission control module 7, which are components of the main device 1, are connected via the internal bus 3. Are connected to each other.
[0033]
The main processor module 2 decodes and executes a program arranged on the main memory module 4 and executes a function given to the main device 1.
[0034]
The SCSI interface module 5 is connected to the HDD 8 via an interface cable 9.
[0035]
The HDD 8 operates as an external storage device of the main device 1. The program of the main device 1 is normally stored in the HDD 8, and the SCSI interface module 5 reads out a program required when the power is turned on from the HDD 8 and stores it in the main memory module 4. The program stored in the main memory module 4 is sequentially read out by the main processor module 2, and is decoded and executed.
[0036]
The display / input control module 6 displays information necessary for the user of the main device 1 on the display device 10 through the interface cable 12 in a form that the user can understand. Further, the user of the main device 1 can transmit the operation command to the display / input control module 6 through the interface cable 13 by inputting the operation command from the input device 11.
[0037]
The display / input control module 6 further transmits this operation command to the main processor module 2 and the main memory module 4 via the internal bus 3.
[0038]
The transmission control module 7 is an interface module for exchanging information with other devices via the transmission path 14.
[0039]
The equipment failure management apparatus and equipment failure management method according to the first embodiment of the present invention are incorporated in modules or units constituting the system.
[0040]
Hereinafter, the main processor module 2, the main memory module 4, the SCSI interface module 5, the display / input control module 6, the transmission control module 7, the HDD 8, the display device 10, and the input device 11 that constitute the system are implemented in this embodiment. The configuration of the equipment failure management apparatus and equipment failure management method according to the embodiment will be described.
[0041]
Main processor module 2, main memory module 4, SCSI interface module 5, display / input control module 6, transmission control module 7, and external unit connected to main apparatus 1 via interfaces The HDD 8, the display device 10, and the input device 11 are fault detection means 20 a, 20 b, 20 c, 20 d, 20 e, 20 f, 20 g, 20 h (hereinafter “20a to h”). The failure recording means 21a-h for recording the failure information generated inside, the failure cause specified by the failure cause specifying program 24b described later in the module or unit specified in the failure cause recording program 25b described later Individually provided failure cause recording means 22a-h for recording There.
[0042]
The main memory module 4 includes a fault detection / recording program 23b that instructs the main processor module 2 to detect and record a fault, and a fault that instructs the main processor module 2 to identify the cause of the fault from a plurality of fault information. A cause identification program 24b, a failure cause recording program 25b for giving an instruction to the main processor module 2 to write the failure cause information to a predetermined module or unit, failure information and failure cause information determined by the failure cause identification program 24b 10 includes a failure information / failure cause information read / display program 26b for giving an instruction to display to the main processor module 2. Further, the main memory module 4 includes a failure management table 27b, which is an area for recording failure information, and a failure determination database 28b having data serving as a criterion for determining failures.
[0043]
Next, the operation of the equipment failure management apparatus to which the equipment failure management method of the present embodiment configured as described above is applied will be described by taking a case where a failure is detected by the SCSI interface module 5 as an example.
[0044]
As described above, each module and each unit includes the failure detection means 20a to 20h.
[0045]
Now, when a failure occurs in the SCSI interface module 5, the failure detection means 20c provided in the SCSI interface module 5 detects the failure. This failure is not limited to one type, and a plurality of types of failures may be detected depending on the apparatus and the occurrence state of the failure.
[0046]
The fault recording means 21c provided in the SCSI interface module 5 as the module records the fault information detected in this way, that is, the fault type, the time of occurrence of the fault, and the work that was operating when the fault occurred. To do.
[0047]
In addition, when the failure occurs, the failure information is notified to the main processor module 2 and recorded in the failure management table 27b on the main memory module 4. When the failure occurrence information is notified to the main processor module 2, the failure cause identification program 24b is activated, and the failure cause identification program 24b identifies the failure cause in consideration of failure information notified from a plurality of modules or units. I do.
[0048]
In the above-described example, there is no failure notification from other modules and units at the same time. However, when there is failure information that has occurred simultaneously in other modules and units, the failure cause identification program 24b is all. By referring to the failure determination database 28b, it is specified which module and which unit caused the failure.
[0049]
In this case, the failure cause recording program 25b is further activated, and the failure cause specified in the failure cause specifying program 24b is recorded in the failure management table 27b on the main memory module 4.
[0050]
FIG. 2 is a diagram showing an example of the failure management table 27b, so that the failure occurrence time, the detected module or unit, the module or unit in which the cause has occurred, the cause of the failure, and the processing that was executed when the failure occurred can be recorded. It has become. The user can analyze the cause of the failure that has occurred by referring to the failure management table 27b as necessary.
[0051]
As described above, in the equipment fault management apparatus and equipment fault management method according to the present embodiment, when a fault occurs due to the above-described action, the fault occurrence time, the unit and module in which the fault has occurred, the fault cause Can be clarified.
[0052]
Further, by recording the failure status in the unit and module in which the failure has occurred, it is possible to check the failure history for each unit and module.
[0053]
(Second Embodiment)
That is, the system to which the equipment failure management apparatus and the equipment failure management method according to the second embodiment of the present invention are applied is different from the above-described first embodiment shown in FIGS. 1 and 2 in other modules and units. Provided with a means for dealing with a case where it is not possible to identify which module or unit caused the failure even if the failure determination database 28b is referred to based on such information. It is.
[0054]
In the present embodiment, as in the first embodiment, a case where a failure is detected by the SCSI interface module 5 will be described as an example.
[0055]
That is, in the case of a system to which the equipment fault management apparatus and equipment fault management method of the present embodiment is applied, as shown in FIG. 1, the fault information input to the fault cause identification program 24b by the fault cause recording program 25b is used. All are recorded in the failure management table 27b as shown in FIG. 2, and all the modules and units in which the failure has occurred are recorded in the failure cause recording means 22 of the module and the unit in which the failure has occurred in the same time zone. To.
[0056]
The user of the main device 1 can estimate the cause of the failure by extracting the related information of the failure from the failure cause recording means 22 of the related module and unit as necessary. These pieces of information can also be displayed on the display device 10 by the main processor module 2.
[0057]
As described above, in the system to which the equipment failure management apparatus and the equipment failure management method according to the present embodiment are applied, when a failure occurs due to the above-described operation, the failure occurrence time and the failure occurred. Units and modules can be clarified.
[0058]
Even if the cause of failure cannot be identified by recording the failure status of all units and modules that have failed during the same time period as well as their own failure status, In addition, since the failure history of a single module can be confirmed, it can be used for failure analysis.
[0059]
(Third embodiment)
That is, the system to which the equipment fault management apparatus and equipment fault management method according to the third embodiment of the present invention is applied is the same as the fault recording means in the first or second embodiment described above with reference to FIGS. Each of 21a-h and failure cause recording means 22a-h is configured by a nonvolatile storage device.
[0060]
In the system to which the equipment failure management apparatus and equipment failure management method of the present embodiment is applied, even if the module and unit are removed from the equipment failure management apparatus main body by taking the above-described means, the module and unit records The content is not lost. Therefore, it is possible to remove only the modules and units from the equipment failure management apparatus main body and bring them into the factory or analysis department to check and analyze the history of failure occurrence.
[0061]
(Fourth embodiment)
A fourth embodiment of the present invention will be described with reference to FIGS.
[0062]
FIG. 3 is a functional block diagram showing an example of a system to which the equipment failure management apparatus and the equipment failure management method according to the embodiment of the present invention are applied.
[0063]
The apparatus and the method are realized by a computer that reads a program stored in a storage medium such as a magnetic disk and whose operation is controlled by the program.
[0064]
That is, a system to which the equipment failure management apparatus and equipment failure management method according to the fourth embodiment of the present invention are applied is configured so that two main devices 31 and 41 are connected via a transmission line trunk 56 as shown in FIG. Connected to each other. Each of the main devices 31 and 41 includes internal buses 33 and 43. By the internal buses 33 and 35, main processor modules 32 and 42, main memory modules 34, 44, functional modules 35 and 45, and transmission control modules 37 and 47 are connected to each other.
[0065]
The main processor modules 32 and 42 decode and execute various programs stored in the main memory modules 34 and 44, and execute functions given to the main devices 31 and 41. The function modules 35 and 45 are modules having a function of executing a specific function.
[0066]
The transmission control modules 37 and 47 are interface modules for exchanging information with other devices. The transmission control modules 37 and 47 are connected to transmission line trunks 56 that are common network trunk lines with other devices via transmission line branch lines 54 and 55, respectively. Connected.
[0067]
Furthermore, the equipment failure management apparatus according to the present embodiment is incorporated in each module inside each main apparatus 31, 41. That is, failure detection means 20i, 20j, 20k, 20l, 20m, 20n, 20o, 20p (hereinafter, expressed as “20i-p”) for detecting a failure that occurs inside each module, The failure recording means 21i-p that records the generated failure information in the module, and the failure cause specified by the failure cause specifying programs 24j and 24n described later are recorded in the modules specified by the failure cause recording programs 25j and 25n described later. Failure cause recording means 22i to 22p are provided.
[0068]
Further, the main memory modules 34 and 44 of the first main device 31 and the second main device 41 have fault detection / recording programs 23j and 23n for instructing the main processor modules 32 and 42 to detect and record the fault. , Fault cause specifying programs 24j and 24n for instructing the main processor modules 32 and 42 to specify the true fault cause from a plurality of fault information input from each module, and writing the specified fault cause to each module Fault cause recording programs 25j and 25n for giving instructions to the main processor modules 32 and 42, fault information detected by each module, and true fault cause information determined by the fault cause identification programs 24j and 24n are displayed on a display device (not shown). ) Is given to the main processor modules 32 and 42. Failure information and failure cause information read and display program 26j, it is equipped with a 26n.
[0069]
Furthermore, the main memory modules 34 and 44 of the first main device 31 and the second main device 41 include failure management tables 27j and 27n, which are areas for recording failure information, and data serving as criteria for determining failures. Fault determination databases 28j and 28n.
[0070]
Next, the operation of the equipment failure management apparatus to which the equipment failure management method of the present embodiment configured as described above is applied will be described.
[0071]
Here, a failure detection method when a failure occurs in the transmission control module 47 of the second main device 41 will be described as an example.
[0072]
If an operational failure occurs in the transmission control module 47 of the second main device 41, the failure detection means 20o of the transmission control module 47 detects this failure. This failure is not limited to one type, and a plurality of types of failures may be detected depending on the device and the occurrence status of the failure.
[0073]
In this case, the failure information detected by the failure recording means 21o included in the transmission control module 47 of the second main device 41 (failure type, failure occurrence time, work that was operating when the failure occurred, etc.) The main processor module 42 of the second main device 41 is notified and recorded in the failure management table 27n of the main memory module 44 of the second main device 41.
[0074]
When the main processor module 42 is notified of the occurrence of a failure, the failure cause identification program 24n is activated, and the failure cause identification program 24n identifies the cause of the failure in consideration of the content of the failure notified from a plurality of modules. .
[0075]
Note that the above example is a case where there is no failure notification from another module at the same time. However, if there is failure information that occurred in another module, which module takes into account the content of all failures. It is specified by referring to the failure determination database 28n whether the cause of the failure has occurred.
[0076]
Next, the failure cause recording program 25n is started, and the failure cause specified by the failure cause specifying program 24n is recorded in the failure management table 27n of the main memory module 44 of the second main device 41. Further, the cause of the failure is also recorded in the failure management table 27j of the main memory module 34 of the first main device 31 which is another connected device via the transmission line trunk line 56.
[0077]
FIG. 2 is a diagram showing an example of the failure management table 27n, and it is possible to record the failure occurrence time, the detected module, the module in which the cause has occurred, the cause of the failure, and the processing executed when the failure occurred. At the same time, all the failed modules are recorded in the related module failure cause recording means 22.
[0078]
A failure in the transmission control module is detected not only in the transmission control module in the device in which the failure has occurred, but also in the transmission control module in another device connected on the same network.
[0079]
That is, as in the present embodiment, when a failure occurs in the transmission control module 47 of the second main device 41, the failure information is transmitted to the failure detection means 20l of the transmission control module 37 of the first main device 31. Is detected in the same manner, and information such as the type of failure, the time of occurrence of the failure, and the work that was operating at the time of the failure is recorded in the failure recording means 21l. Further, the failure information detected by the failure detection means 201 is notified to the main processor module 32 and recorded in the failure management table 27j of the main memory module 34.
[0080]
In this way, the failure management table 27j of the main memory module 34 of the first main device 31 uses the information on the failure that has occurred in the transmission control module 47 of the second main device 41 as the failure of the second main device 41. The information is acquired as two pieces of information, that is, the case of acquiring from the management table 27n and the case of acquiring via the transmission control module 37.
[0081]
When the failure information is notified to the main processor module 32, the failure cause specifying program 24j of the main memory module 34 is activated to specify the cause of the failure.
[0082]
Here, it is specified by referring to the failure determination database 28j which of the transmission control modules 37 and 47 has caused the failure.
[0083]
Next, the failure cause recording program 25j is started, and the failure cause specified by the failure cause specifying program 24j is recorded in the failure management table 27j of the main memory module 34 of the first main device 31, and the transmission control module. The failure occurrence module, unit name, and time are recorded in the failure cause recording means 22l and 22o of 37 and 47, respectively.
[0084]
It is also possible to provide a specific main apparatus for managing the failure information for the entire system and centrally manage the information. In the case of FIG. 3, when the first main device 31 is operated as a main device that manages the entire system, all fault information in the system is transferred to the first main device 31 and the fault of the main memory module 34 is detected. This can be reflected in the management table 27j. Thereby, the user can grasp all the information of the failure in the system by extracting the failure cause information from the failure management table 27j.
[0085]
As described above, in the system to which the equipment fault management apparatus and equipment fault management method of the present embodiment is applied, not only the fault occurrence time and the fault status but also the fault execution time is executed by the above-described action. The failure detection status in the processing and related modules is clarified, and the analysis when the failure occurs can be easily performed.
[0086]
Furthermore, by using a public line or the like to the network to which these equipment failure management devices are connected, the operating status of equipment used on the user side can be remotely diagnosed by the manufacturer's quality control department. Is also possible.
[0087]
Note that the method described in each of the embodiments described above is stored as a program that can be executed by a computer, such as a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a semiconductor memory, etc. It is also possible to write on a medium and apply to various apparatuses, or transmit by a communication medium and apply to various apparatuses. A computer that implements this apparatus reads the program stored in the storage medium, and performs the above-described processing by controlling the operation by this program.
[0088]
【The invention's effect】
As described above, according to the equipment failure management apparatus, equipment failure management method, and storage medium of the present invention, even when a failure occurs in a system composed of many modules and units, The recovery work, failure analysis, and repair work can be easily performed without depending on the capability.
[0089]
As described above, an objective analysis of the fault history is possible, and thus an equipment fault management apparatus and equipment fault management method excellent in fault monitoring capability and fault recovery capability can be realized.
[Brief description of the drawings]
FIG. 1 is a diagram showing an example of a system to which an equipment failure management apparatus and an equipment failure management method according to a first embodiment of the present invention are applied.
FIG. 2 is a diagram showing an example of a format of a failure management table according to the present invention.
FIG. 3 is a configuration diagram showing an example of a system to which an equipment failure management apparatus and an equipment failure management method according to a third embodiment of the present invention are applied.
[Explanation of symbols]
1 ... The main device of the system,
2 ... Main processor module,
3 ... Internal bus,
4 ... main memory module,
5 ... SCSI interface module,
6 ... display / input control module,
7: Transmission control module,
8 ... HDD,
9, 12, 13 ... interface cable,
10 ... display device,
11 ... Input device,
14 ... transmission line,
20 ... failure detection means,
21 ... Failure recording means,
22 ... Failure cause recording means,
23 ... Fault detection / recording program,
24 ... Fault cause identification program,
25. Failure cause recording program,
26: Fault information / failure cause information read / display program,
27 ... Failure management table,
28 ... Failure judgment database,
31 ... first main device,
32, 42 ... main processor module,
33, 43 ... Internal bus,
34, 44 ... main memory module,
35, 45 ... functional modules,
37, 47 ... transmission control module,
41 ... second main device,
54, 55 ... transmission line branch line,
56: Transmission line trunk line.

Claims

In a facility failure management apparatus that manages the failure status of the hardware module and unit in a system configured by combining hardware modules and units having various functions,
Fault information detection recording means for detecting fault information inside the hardware module and the unit and recording the fault information in a first nonvolatile storage device;
Based on the failure information detected by the failure information detection recording means, the hardware module and unit in which the failure has occurred are determined, and a failure cause identifying means for identifying the cause of the failure;
Failure cause transmission means for transmitting information on the cause of failure identified by the failure cause identification means to the hardware module and unit determined to have caused the failure;
Failure cause recording means for recording information on the cause of failure specified by the failure cause specifying means in a second nonvolatile storage device provided inside the hardware module and unit in which the failure has been determined; ,
Display means for displaying the failure information detected by the failure information detection recording means and the failure cause information specified by the failure cause specifying means ;
Hardware module and unit in which failure information detected by the failure information detection recording means is detected by the failure cause recording means in the same time zone when the failure cause identification means cannot specify the failure cause An equipment failure management apparatus, which records in a third non-volatile storage device provided in the inside of the apparatus.

The equipment fault management apparatus according to claim 1, wherein a plurality of equipment fault management apparatus main bodies are connected to each other via a network.

Facility failure management method for managing the failure status by a processor module that executes a program for managing the failure status of the hardware module and unit in a system configured by combining hardware modules and units having various functions In
The program includes, in the processor module, a first procedure for detecting and recording fault information of the hardware module and unit, a second procedure for determining a fault occurrence unit corresponding to the fault information and specifying a cause, A third procedure for transmitting and recording the failure information, a fourth procedure for displaying the failure information and the identified cause,
The processor module detects the failure information inside the hardware module and the unit by executing the first procedure according to the program, and records it in the first nonvolatile storage device,
By executing the second procedure, based on the detected failure information, the hardware module and unit in which the failure has occurred are determined, the cause of the failure is identified,
By executing the third procedure, the information of the specified cause of failure is transmitted to the hardware module and unit determined to have the failure, and the second non-volatile provided therein Record in storage,
By executing the fourth procedure, the failure information and the failure cause information are displayed ,
If the cause cannot be identified in the second procedure, the failure information is recorded in the hardware and unit in which the failure occurred in the same time period, and a fifth procedure is executed.
When the processor module executes the fifth procedure according to the program and cannot determine the cause of the failure, the detected failure information is used as the hardware module in which the failure has occurred in the same time period. And a facility failure management method characterized in that the failure is recorded in a third non-volatile storage device provided inside the unit.

In a system configured by combining hardware modules and units having various functions, a storage medium storing a program for managing the failure status of the hardware modules and units,
Fault information detection recording means for detecting fault information inside the hardware module and the unit and recording the fault information in a first nonvolatile storage device;
Based on the failure information detected by the failure information detection recording means, the hardware module and unit in which the failure has occurred are determined, and a failure cause identifying means for identifying the cause of the failure,
Failure cause transmission means for transmitting information on the cause of failure identified by the failure cause identification means to the hardware module and unit determined to have the failure;
Failure cause recording means for recording information on the cause of failure specified by the failure cause specifying means in a second nonvolatile storage device provided in the hardware module and unit in which the failure has been determined;
Display means for displaying the fault information detected by the fault information detection recording means and the fault cause information specified by the fault cause specifying means ;
Hardware module and unit in which failure information detected by the failure information detection recording means is detected by the failure cause recording means in the same time zone when the failure cause identification means cannot specify the failure cause A computer-readable storage medium storing a program for causing a computer to function as means for recording in a third non-volatile storage device provided inside the computer.