JPH09305439A

JPH09305439A - Fault monitoring and reporting device for distributedly arranged computer systems

Info

Publication number: JPH09305439A
Application number: JP8139492A
Authority: JP
Inventors: Nobuhito Tsuchiya; 伸仁土屋
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1996-05-09
Filing date: 1996-05-09
Publication date: 1997-11-28
Anticipated expiration: 2016-05-09
Also published as: JP3102349B2

Abstract

PROBLEM TO BE SOLVED: To shorten the time required for executing correspondent measures to a fault and to prevent a secondary fault by reporting a hardware fault immediately to an operator on the job site and simultaneously reporting it to the computer of a head office. SOLUTION: The information concerning the hardware fault is immediately reported to the operator on the job site and at the same time, that information is stored in a hardware fault log file 11 and a fault information file 12 of a job site computer 10. Then, a fault monitoring part 13 detects the generation of any new fault at that job site computer 10, and that detected fault is reported to a head office computer 20 by a fault reporting means 14. That reported fault information is displayed on a display device 15. Thus, the time required for executing the correspondent measures of the fault can be shortened and the secondary fault can be prevented.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明が属する技術分野】本発明は、広域に多数のコン
ピュータを配置して構成された分散配置コンピュータシ
ステムにおいて、ハードウェアに発生した障害を検知し
て本部のサーバコンピュータに迅速に通報する障害監視
通報装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a distributed monitoring computer system configured by arranging a large number of computers in a wide area, and detects a failure occurring in hardware and promptly notifies the server computer at the headquarters of the failure. Regarding reporting device.

【０００２】[0002]

【従来の技術】広域に多数のコンピュータを配置し、Ｗ
ＡＮ等で接続して情報の送受信を行う分散配置コンピュ
ータシステムでは、システムを構成するハードウェア
（コンピュータ）に障害が発生した場合に適切に対処す
べく、当該障害に関する情報を本部にて管理するための
障害通報手段が備えられている。従来、この種の分散配
置コンピュータシステムでは、障害が発覚したコンピュ
ータのオペレータが、本部のサーバコンピュータのオペ
レータ（以下、本部オペレータと称す）に障害の発生と
その内容とを通報していた。2. Description of the Related Art A large number of computers are arranged in a wide area and W
In a distributed computer system that transmits and receives information by connecting with an AN or the like, in order to appropriately deal with a failure in the hardware (computer) that configures the system, the headquarters manages the information related to the failure. Is equipped with a means for reporting problems. Conventionally, in a distributed computer system of this type, an operator of a computer in which a failure is detected notifies the operator of the server computer of the headquarters (hereinafter, referred to as a head operator) of the occurrence of the failure and its content.

【０００３】ここで、分散配置コンピュータシステムを
構成するコンピュータ（以下、現地コンピュータと称
す）においてハードウェア障害が発生した場合、当該現
地コンピュータのオペレータ（以下、現地オペレータと
称す）は、当該コンピュータがハードウェア障害を持っ
たまま動作し、プログラム障害等が発生して誤動作等が
生じることによりハードウェア障害の発生を検知してい
た。Here, when a hardware failure occurs in a computer (hereinafter, referred to as a local computer) that constitutes the distributed computer system, an operator of the local computer (hereinafter, referred to as a local operator) operates the computer as a hardware. It operates with a hardware failure, and the occurrence of a hardware failure is detected by causing a program failure or the like and causing a malfunction or the like.

【０００４】また、ハードウェア障害の発生した現地コ
ンピュータが分散配置コンピュータシステムの本部のサ
ーバコンピュータではない場合、本部オペレータは、現
地オペレータからの報告を受け付けた段階で、初めてシ
ステム上の当該現地コンピュータでのハードウェア障害
の発生と、その内容を知得していた。If the local computer in which the hardware failure has occurred is not the server computer of the head office of the distributed computer system, the head office operator does not use the local computer on the system for the first time at the stage of receiving the report from the local operator. Was aware of the hardware failure and its contents.

【０００５】したがって、現地コンピュータのハードウ
ェア障害に対する対応策は、現地オペレータからの報告
があった後に本部オペレータからの指示で決定されるこ
ととなり、ハードウェア障害の発生から、対応措置が実
行されるまでに多大な時間を要していた。Therefore, the countermeasure against the hardware failure of the local computer will be decided by the instruction from the head office operator after the report from the local operator, and the countermeasure will be taken from the occurrence of the hardware failure. It took a lot of time before.

【０００６】[0006]

【発明が解決しようとする課題】上述した従来の分散配
置コンピュータシステムにおける障害通報手段の第１の
問題点は、障害に対する対応措置が実行されるまでに時
間がかかり、システムの迅速な復旧ができないことであ
る。その理由は、ハードウェア障害の発生したコンピュ
ータが誤動作等を起こすことにより現地オペレータが当
該ハードウェア障害の発生を検知し、本部オペレータに
通報するという手順で障害通報を行っていたためであ
る。The first problem of the fault reporting means in the above-mentioned conventional distributed computer system is that it takes time until the countermeasure for the fault is executed, and the system cannot be promptly restored. That is. The reason is that the local operator detects the occurrence of the hardware failure due to the malfunction of the computer in which the hardware failure has occurred, and notifies the headquarters operator of the failure.

【０００７】第２の問題点は、ハードウェア障害に対す
る対応措置の実行の遅れにより、ファイルデータの不正
等の２次障害を招来する恐れがあることである。その理
由は、ハードウェア障害の発生から対応措置の実行まで
に多大な時間を要し、速やかな復旧ができないためであ
る。The second problem is that there is a risk of causing a secondary failure such as an illegal file data due to the delay in the execution of the countermeasure for the hardware failure. The reason is that it takes a lot of time from the occurrence of a hardware failure to the execution of countermeasures, and quick recovery cannot be performed.

【０００８】本発明の目的は、上記従来の問題点を解決
し、分散配置コンピュータシステムを構成するコンピュ
ータで発生したハードウェア障害を、現地オペレータに
即時に通知すると同時に本部のコンピュータに通報する
ことにより、障害に対する対応措置の実行までに要する
時間の短縮と、２次障害の未然防止をはかることにあ
る。An object of the present invention is to solve the above-mentioned conventional problems and to immediately notify a local operator of a hardware failure occurring in a computer constituting a distributed arrangement computer system, and at the same time, notify a computer of a headquarters. The purpose is to reduce the time required to implement countermeasures against failures and prevent secondary failures.

【０００９】[0009]

【課題を解決するための手段】上記の目的を達成するた
め、本発明の障害監視通報装置は、分散配置コンピュー
タシステムを構成する現地コンピュータに搭載された、
前記現地コンピュータにおいてハードウェア障害が発生
した場合に該ハードウェア障害に関する障害情報を取得
すると共に該障害情報をオペレータに通知する障害監視
手段と、前記障害監視手段が取得した障害情報を前記分
散配置コンピュータシステムを構成し前記現地コンピュ
ータに通信回線を介して接続された本部コンピュータに
送信する障害通報手段と、前記本部コンピュータに搭載
された、前記障害情報を受信しオペレータに前記現地コ
ンピュータにおけるハードウェア障害の発生を通知する
障害通報受信手段とを備える構成としている。In order to achieve the above object, the fault monitoring and reporting device of the present invention is installed in a local computer that constitutes a distributed computer system.
When a hardware failure occurs in the local computer, failure monitoring means for acquiring failure information about the hardware failure and notifying the operator of the failure information, and the failure information acquired by the failure monitoring means for the distributed arrangement computer Failure reporting means for configuring a system and transmitting to a headquarters computer connected to the local computer via a communication line; and a hardware fault in the local computer, which is installed in the headquarters computer and receives the failure information, and informs the operator of a hardware failure in the local computer. It is configured to include failure notification receiving means for notifying the occurrence.

【００１０】また、他の態様では、前記現地コンピュー
タは、検出されたハードウェア障害に関する障害情報を
格納し蓄積する第１の障害情報格納手段と、前記本部コ
ンピュータへ通報すべき障害情報を格納する第２の障害
情報格納手段とをさらに備え、前記障害監視手段は、前
記第１の障害情報格納手段と前記第２の障害情報格納手
段とを比較し、前記第１の障害情報格納手段中に前記第
２の障害情報格納手段中の障害情報よりも新しい障害情
報が格納されていた場合に該障害情報を前記第２の障害
情報格納手段に書き込み、前記障害通報手段は、前記第
２の障害情報格納手段に未通報の障害情報が格納されて
いる場合に該障害情報を読み出して前記本部コンピュー
タに送信する。In another aspect, the local computer stores first fault information storage means for storing and accumulating fault information relating to the detected hardware fault, and fault information to be reported to the headquarters computer. A second fault information storage means is further provided, and the fault monitoring means compares the first fault information storage means with the second fault information storage means and stores in the first fault information storage means. When failure information newer than failure information in the second failure information storage means is stored, the failure information is written in the second failure information storage means, and the failure notification means causes the second failure information to be stored. When the unreported fault information is stored in the information storage means, the fault information is read and transmitted to the headquarters computer.

【００１１】さらに他の態様では、前記障害監視手段
は、通常のプログラム・ジョブとは独立して予め定めら
れた所定のタイミングで前記第１の障害情報格納手段と
前記第２の障害情報格納手段とを定期的に比較する。In still another mode, the fault monitoring means is independent of a normal program / job, and has the first fault information storing means and the second fault information storing means at predetermined timings. Compare with.

【００１２】また、他の好ましい態様では、前記障害通
報受信手段は、前記障害通報受信手段にて受信した障害
情報を格納し蓄積する第３の障害情報格納手段をさらに
備える構成としている。[0012] In another preferred aspect, the fault notification receiving means further comprises third fault information storage means for storing and accumulating the fault information received by the fault notification receiving means.

【００１３】また、他の好ましい態様では、前記本部コ
ンピュータは、前記第３の障害情報格納手段に格納され
た障害情報を所定の検索条件にしたがって検索する障害
情報検索手段をさらに備える構成としている。[0013] In another preferred aspect, the headquarters computer further comprises a failure information search means for searching the failure information stored in the third failure information storage means according to a predetermined search condition.

【００１４】[0014]

【発明の実施の形態】以下、本発明の実施例について図
面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１５】図１は、本発明の１実施例による障害監視
通報装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a fault monitoring / reporting device according to an embodiment of the present invention.

【００１６】図示のように、本実施例は、広域に配置さ
れた多数のコンピュータをＷＡＮ等を用いて接続した分
散配置コンピュータシステムにおいて、各地に配置され
る現地コンピュータ１０と、本部に配置される本部コン
ピュータ２０とを備えて構成される。また特に図示しな
いが、現地コンピュータ１０と本部コンピュータ２０と
はＷＡＮ回線等で接続される。なお、図には本実施例の
特徴的な構成のみを記載し、他の記載については記載を
省略してある。As shown in the figure, in this embodiment, in a distributed arrangement computer system in which a large number of computers arranged in a wide area are connected by using WAN or the like, the local computers 10 arranged in various places and the local computers 10 are arranged. The main computer 20 is provided. Although not particularly shown, the local computer 10 and the headquarters computer 20 are connected by a WAN line or the like. It should be noted that only the characteristic configuration of this embodiment is shown in the drawing, and the other description is omitted.

【００１７】現地コンピュータ１０は、ハードウェア障
害に関する情報を格納するハードウェア障害ログファイ
ル１１及び障害情報ファイル１２と、現地コンピュータ
における新たな障害の発生を検出する障害監視部１３、
検出された障害を本部コンピュータに通知する障害通知
手段１４と、障害情報を表示してオペレータに障害の発
生を知らせるための表示装置１５とを備える。なお、現
地コンピュータ１０におけるＤＩＳＫ障害、ＣＰＵ障害
等のハードウェア障害の検出は、現地コンピュータ１０
に搭載されるオペレーティングシステムの機能として実
行される。The local computer 10 has a hardware fault log file 11 and a fault information file 12 for storing information about hardware faults, a fault monitoring section 13 for detecting the occurrence of a new fault in the local computer,
The headphone computer is provided with a failure notification means 14 for notifying the detected failure, and a display device 15 for displaying failure information and notifying an operator of the occurrence of the failure. It should be noted that detection of a hardware failure such as a DISK failure or a CPU failure in the local computer 10 is performed by the local computer 10
It is executed as a function of the operating system installed in.

【００１８】ハードウェア障害ログファイル１１は、磁
気ディスク装置等の記憶装置で実現され、オペレーティ
ングシステムが処理実行中にハードウェア障害を発見し
た場合に、当該ハードウェア障害に関する情報（以下、
障害情報と称す）を格納する。具体的には、障害の発生
及び種類を示すエラーコード、障害の発生した機器、障
害の発生時刻等を格納する。また、ハードウェア障害ロ
グファイル１１の情報出力形式は、使用しているオペレ
ーティングシステムの仕様に従う。The hardware failure log file 11 is realized by a storage device such as a magnetic disk device, and when the operating system finds a hardware failure during execution of processing, information on the hardware failure (hereinafter,
Stored as failure information). Specifically, the error code indicating the occurrence and type of the failure, the device in which the failure occurred, the time when the failure occurred, and the like are stored. The information output format of the hardware failure log file 11 complies with the specifications of the operating system in use.

【００１９】障害情報ファイル１２は、磁気ディスク装
置等の記憶装置で実現され、障害情報ファイル１２は、
本部コンピュータ２０へ通報する情報を格納する。格納
する情報は、ハードウェア障害ログファイル１１に格納
された障害情報と同様である。また、障害情報ファイル
１２には、本部コンピュータ２０に送信された障害情報
が、新たなハードウェア障害に関する障害情報が格納さ
れるまで保持される。したがって、次にハードウェア障
害が発生するまで、直前に本部コンピュータ２０に送信
した障害情報を残している。The fault information file 12 is realized by a storage device such as a magnetic disk device, and the fault information file 12 is
The information to be reported to the headquarters computer 20 is stored. The information to be stored is the same as the failure information stored in the hardware failure log file 11. Further, the failure information file 12 holds the failure information transmitted to the headquarters computer 20 until failure information relating to a new hardware failure is stored. Therefore, the failure information transmitted to the headquarters computer 20 immediately before is left until the next hardware failure occurs.

【００２０】図２に、障害情報ファイル１２に格納され
る情報のフォーマットの例を示す。上述したように、ハ
ードウェア障害が発生した際にハードウェア障害ログフ
ァイル１１に格納される情報も同様のフォーマットによ
り格納される。FIG. 2 shows an example of the format of information stored in the failure information file 12. As described above, the information stored in the hardware failure log file 11 when a hardware failure occurs is also stored in the same format.

【００２１】障害監視部１３は、プログラム制御された
ＣＰＵ等で実現され、現地コンピュータ１０の障害を監
視するために以下の処理を行う。The fault monitoring unit 13 is realized by a program-controlled CPU or the like, and performs the following processing in order to monitor the fault of the local computer 10.

【００２２】第１に、障害情報ファイル１２に格納され
ている情報、すなわち直前に本部コンピュータ２０に通
報した情報と、ハードウェア障害ログファイル１１に格
納されている情報とを検索し、両情報の障害発生日時を
比較する。そして、ハードウェア障害ログファイル１１
に格納された障害情報の中に、障害発生日時が障害情報
ファイル１２に格納された障害情報の障害発生日時より
も新しい障害情報を検出した場合、当該障害情報を障害
情報ファイル１２に書き込む。当該障害情報は、障害情
報ファイル１２に格納されていた障害情報よりも新しい
ことから、最後に本部コンピュータ２０に障害情報が送
信された後に発生したハードウェア障害に関する障害情
報であることがわかる。First, the information stored in the failure information file 12, that is, the information reported to the head office computer 20 immediately before and the information stored in the hardware failure log file 11 are searched, and both information are retrieved. Compare the date and time of failure occurrence. Then, the hardware error log file 11
When the failure information whose failure date and time is newer than the failure occurrence date and time of the failure information stored in the failure information file 12 is detected in the failure information stored in, the failure information is written in the failure information file 12. Since the fault information is newer than the fault information stored in the fault information file 12, it can be seen that the fault information is the fault information regarding the hardware fault that has occurred after the fault information was last transmitted to the headquarter computer 20.

【００２３】第２に、検出された新しい障害情報を現地
コンピュータ１０の表示装置１５に表示する。これによ
り、現地オペレータにハードウェア障害の発生を通知す
ることができる。なお、表示装置１５に障害情報を表示
する際、警告音を一緒に発するようにしてもよい。ま
た、表示装置１５への障害情報の表示は、例えば現地オ
ペレータの指示によりプリンタを用いて当該障害情報を
印字出力されるまで継続する。Second, the detected new fault information is displayed on the display device 15 of the local computer 10. As a result, it is possible to notify the local operator of the occurrence of the hardware failure. When displaying the failure information on the display device 15, a warning sound may be emitted together. Further, the display of the failure information on the display device 15 is continued until the failure information is printed out using a printer, for example, according to an instruction from the local operator.

【００２４】第３に、新しい障害情報が検出されたこと
を障害通報部１４に通知する。この通知により、後述す
る障害通報部１４による動作を起動する。Thirdly, the fault notification unit 14 is notified that new fault information has been detected. By this notification, the operation by the failure notification unit 14 described later is activated.

【００２５】以上の各処理は、現地オペレータによって
設定された所定の周期で定期的に実行する。このため、
ハードウェア障害が発生した場合、障害がプログラム障
害に及び誤動作等が発生するまで待つことなく、定期的
な処理によって当該ハードウェア障害を検知することが
できる。Each of the above processes is periodically executed at a predetermined cycle set by the local operator. For this reason,
When a hardware failure occurs, the hardware failure can be detected by a regular process without waiting until the failure causes a program failure and a malfunction or the like occurs.

【００２６】障害通報部１４は、プログラム制御された
ＣＰＵとインターフェース等で実現され、新たに検出さ
れた障害情報を本部コンピュータ２０に送信する。障害
通報部１４は、障害監視部１３からの障害発生を示す通
知を受け付けることで起動し、障害情報ファイル１２か
ら障害情報を読み出してＷＡＮ回線等を介して本部コン
ピュータ２０に送信する。本部コンピュータ２０への障
害通報が正常に終了した場合、障害情報ファイル１２の
当該障害情報に対応する本部コンピュータ送信結果の項
目に“正常終了”をスタンプする。The fault notifying section 14 is realized by a program-controlled CPU, an interface, etc., and transmits the newly detected fault information to the head office computer 20. The failure notification unit 14 is activated by receiving a notification indicating the occurrence of a failure from the failure monitoring unit 13, reads the failure information from the failure information file 12, and sends the failure information to the headquarters computer 20 via a WAN line or the like. When the failure notification to the head office computer 20 has been normally completed, the item “normal end” is stamped on the head computer transmission result item corresponding to the failure information in the failure information file 12.

【００２７】表示装置１５は、ＣＲＴディスプレイ装置
や液晶ディスプレイ装置等で実現され、障害監視部１３
の制御により、障害情報ファイル１２から読み出された
障害情報を表示する。現地オペレータは、この表示を参
照して当該現地コンピュータ１０にハードウェア障害が
発生したことを認識できる。図４に、表示装置１５の表
示画面の例を示す。The display device 15 is realized by a CRT display device, a liquid crystal display device, or the like, and has a fault monitoring section 13.
The failure information read from the failure information file 12 is displayed under the control of. The local operator can recognize that a hardware failure has occurred in the local computer 10 by referring to this display. FIG. 4 shows an example of the display screen of the display device 15.

【００２８】本部コンピュータ２０は、現地コンピュー
タ１０から受信した障害通報を受け付ける障害通報受信
部２１と、障害情報を格納する本部障害情報ファイル２
２と、本部障害情報ファイル２２から任意の障害情報を
検索するための障害情報検索部２３と、障害情報を表示
してオペレータに障害の発生を知らせるための表示装置
２４とを備える。The head office computer 20 has a trouble report receiving section 21 for receiving a trouble report received from the local computer 10 and a head office trouble information file 2 for storing trouble information.
2, a fault information retrieval unit 23 for retrieving arbitrary fault information from the headquarters fault information file 22, and a display device 24 for displaying fault information and notifying an operator of the occurrence of a fault.

【００２９】障害通報受信部２１は、プログラム制御さ
れたＣＰＵとインターフェース等で実現され、現地コン
ピュータ１０からの障害通報を受け付けて、本部オペレ
ータに通知する。また、障害通報と共に受け取った障害
情報を本部情報ファイル２２へ書き込む。本部オペレー
タに対する障害通報受信の通知は、図４に示した現地コ
ンピュータ１０の表示装置１５に表示される表示画面と
同様の表示画面を表示装置２４に表示することにより行
う。なお、現地コンピュータ１０における現地オペレー
タへの通知の場合と同様に、表示装置２４に障害情報を
表示する際に警告音を一緒に発するようにしたり、表示
装置２４への障害情報の表示をプリンタにより印字出力
するまで継続したりすることができる。The fault report receiver 21 is realized by a program-controlled CPU and an interface, and receives a fault report from the local computer 10 to notify the headquarters operator. Also, the failure information received together with the failure notification is written in the headquarters information file 22. The notification of the failure notification reception to the headquarter operator is performed by displaying the same display screen as the display screen displayed on the display device 15 of the local computer 10 shown in FIG. 4 on the display device 24. As in the case of the notification to the local operator in the local computer 10, a warning sound may be emitted together with the failure information displayed on the display device 24, or the failure information may be displayed on the display device 24 by a printer. It can be continued until it is printed out.

【００３０】本部障害情報ファイル２２は、磁気ディス
ク装置等の記憶装置で実現され、当該分散配置コンピュ
ータシステムを構成する各コンピュータにおいて過去に
発生した障害に関する情報を、障害通報受信部２１から
受け取って格納する。具体的には、各コンピュータごと
に受信した障害情報であって、当該コンピュータのＩＤ
データ、エラーコード、障害の発生した機器、障害の発
生日時等を対応付けて格納する。図３に、本部障害情報
ファイル２２に格納される情報のフォーマットの例を示
す。The head office failure information file 22 is realized by a storage device such as a magnetic disk device, and receives information about failures that have occurred in the past in each computer constituting the distributed computer system from the failure notification receiving section 21 and stores them. To do. Specifically, it is the failure information received for each computer and the ID of the computer.
The data, the error code, the device in which the failure has occurred, the date and time when the failure occurred, and the like are stored in association with each other. FIG. 3 shows an example of the format of information stored in the headquarter failure information file 22.

【００３１】障害情報検索部２３は、プログラム制御さ
れたＣＰＵ等で実現され、種々の検索条件により本部障
害情報ファイル２２に格納された障害情報を検索する。
検索条件としては、障害情報の項目に応じて、例えば現
地コンピュータＩＤや障害発生日時等をオペレータが任
意に指定することができる。また、障害情報検索部２３
は、入力された検索条件にしたがって検索して得られた
当該県策条件に該当する障害情報を表示装置２４に表示
する。図５に検索結果を表示する際の表示装置２４の表
示画面の表示例を示す。The fault information retrieval section 23 is realized by a program-controlled CPU or the like, and retrieves fault information stored in the headquarter fault information file 22 under various retrieval conditions.
As a search condition, the operator can arbitrarily specify, for example, a local computer ID, a failure occurrence date and time, etc. according to the failure information item. In addition, the failure information search unit 23
Displays failure information corresponding to the prefectural policy condition obtained by searching according to the input search condition on the display device 24. FIG. 5 shows a display example of the display screen of the display device 24 when displaying the search result.

【００３２】表示装置２４は、ＣＲＴディスプレイ装置
や液晶ディスプレイ装置等で実現され、障害通報受信部
２１または障害情報検索部２３の制御により、本部障害
情報ファイル２２から読み出された障害情報または障害
情報の検索結果を表示する。本部オペレータは、この表
示を参照して所定の現地コンピュータ１０にハードウェ
ア障害が発生したことを認識したり、過去の障害の発生
傾向を認識したりすることができる。The display device 24 is realized by a CRT display device, a liquid crystal display device, or the like, and under the control of the trouble notification receiving unit 21 or the trouble information searching unit 23, the trouble information or trouble information read from the head office trouble information file 22. Display the search results of. The headquarter operator can refer to this display to recognize that a hardware failure has occurred in a predetermined local computer 10 or to recognize a past failure occurrence tendency.

【００３３】次に、図６のフローチャートを参照して以
上のように構成される本実施例の障害監視通報装置の動
作を説明する。Next, the operation of the fault monitoring and reporting apparatus of this embodiment having the above-mentioned configuration will be described with reference to the flow chart of FIG.

【００３４】初期状態として、現地コンピュータ１０
は、オペレータの指示や予め設定された時刻起動等によ
り所定のプログラム・ジョブを実行している。そして、
プログラム・ジョブの実行中にＤＩＳＫアクセスエラ
ー、ＣＰＵ障害等による処理異常が発生した場合、オペ
レーティングシステムの機能により、当該障害に関する
情報がハードウェア障害ログファイル１１に書き込まれ
ている。As an initial state, the local computer 10
Executes a predetermined program / job according to an operator's instruction, a preset time start, or the like. And
When a processing error occurs due to a DISK access error, a CPU failure, or the like during execution of a program / job, information about the failure is written in the hardware failure log file 11 by the function of the operating system.

【００３５】障害監視部１３は、当該現地コンピュータ
１０の起動と同時に、上記プログラム・ジョブとは独立
して処理を開始し、現地オペレータの設定した時間間隔
で、以下の処理を行う。まず障害情報ファイル１２を検
索し、格納されている障害情報の障害発生日時を取得す
る（ステップ６０１）。そして、ハードウェア障害ログ
ファイル１１を検索する（ステップ６０２）。この時、
障害情報ファイル１２から取得した障害情報の発生日時
とハードウェア障害ログファイル１１に格納されている
障害情報の発生日時とを比較し（ステップ６０３）、障
害情報ファイル１２から取得した障害情報の発生日時よ
りも新しい障害情報を検出した場合、当該障害情報を読
み出して障害情報ファイル１２に書き込む（ステップ６
０４、６０５）。The fault monitoring unit 13 starts processing independently of the program / job at the same time when the local computer 10 is started, and performs the following processing at time intervals set by the local operator. First, the failure information file 12 is searched and the failure occurrence date and time of the stored failure information is acquired (step 601). Then, the hardware failure log file 11 is searched (step 602). This time,
The date and time of occurrence of the fault information acquired from the fault information file 12 is compared with the date and time of occurrence of the fault information stored in the hardware fault log file 11 (step 603). If newer failure information is detected, the failure information is read and written in the failure information file 12 (step 6).
04, 605).

【００３６】これに対し、新たな障害情報を検出しなか
った場合、すでに本部コンピュータ２０に送信した障害
以降はハードウェア障害が発生していないと判断し、処
理を終了して次の障害情報ファイル１２及びハードウェ
ア障害ログファイル１１を検索するタイミングを待つ
（ステップ６０４）。On the other hand, if no new failure information is detected, it is determined that no hardware failure has occurred after the failure already transmitted to the head office computer 20, the processing is terminated, and the next failure information file 12 and the timing of searching the hardware failure log file 11 are awaited (step 604).

【００３７】ハードウェア障害ログファイル１１の検索
が終了した後、障害情報ファイル１２に情報出力を行っ
た場合、障害監視部１３は、新たに障害情報ファイル１
２に出力された障害情報を表示装置１５の表示画面に表
示し、ハードウェア障害の発生を現地オペレータに通知
する（ステップ６０６）。そして、障害通報部１４にハ
ードウェア障害の発生を通知する（ステップ６０７）。
新たな障害情報を複数発見した場合、障害発生日時の古
い障害情報から順に通知する。When information is output to the failure information file 12 after the search of the hardware failure log file 11 is completed, the failure monitoring unit 13 newly creates the failure information file 1
The failure information output to No. 2 is displayed on the display screen of the display device 15 to notify the local operator of the occurrence of the hardware failure (step 606). Then, the failure notification unit 14 is notified of the occurrence of the hardware failure (step 607).
When a plurality of pieces of new failure information are found, the failure information is notified in order from the oldest failure information.

【００３８】障害通報部１４は、障害監視部１３から送
られたハードウェア障害通知を受信することにより起動
し、障害情報ファイル１２から新たに格納された障害情
報、すなわち“正常終了”とスタンプされていない障害
情報を取得し、本部コンピュータ２０へ送信する（ステ
ップ６０８）。The fault notifying section 14 is activated by receiving the hardware fault notice sent from the fault monitoring section 13, and is stamped with the fault information newly stored from the fault information file 12, that is, "normal end". The failure information that has not been obtained is acquired and transmitted to the headquarters computer 20 (step 608).

【００３９】本部コンピュータ２０への障害通報が正常
に終了した場合、障害情報ファイル１２の当該障害情報
に対応する本部コンピュータ送信結果の項目に“正常終
了”をスタンプする（ステップ６０９）。本部コンピュ
ータ２０への障害通報が正常に終了しなかった場合は、
障害情報ファイル１２の当該障害情報に対応する本部コ
ンピュータ送信結果の項目に当該障害コードをスタンプ
する。例えば、本部コンピュータ２０への障害通報が回
線障害で異常終了となった場合、本部コンピュータ送信
結果の項目に回線障害のエラーコードをスタンプする。
本部コンピュータ２０へ通報する新たな障害情報が複数
ある場合、以上の処理は、障害情報ファイル１２から該
当する障害情報を全て読み込み、本部コンピュータ２０
への通報を終了するまで緩り返す。When the failure notification to the head office computer 20 has been normally completed, the heading computer transmission result item corresponding to the failure information in the failure information file 12 is stamped with "normal end" (step 609). If the failure notification to the headquarters computer 20 does not end normally,
The fault code is stamped on the item of the headquarter computer transmission result corresponding to the fault information in the fault information file 12. For example, when the failure notification to the head office computer 20 is abnormally ended due to a line failure, the error code of the line failure is stamped in the item of the head office computer transmission result.
When there are a plurality of pieces of new failure information to be reported to the headquarters computer 20, the above processing reads all the relevant failure information from the failure information file 12,
Relax until the call is completed.

【００４０】本部コンピュータ２０への障害通報が異常
終了した場合は、障害通報のリトライ処理を行う（ステ
ップ６０９、６０８）。障害通報のリトライを実行する
時間間隔は、現地オペレータが任意に設定できる。本部
コンピュータ２０への障害通報のリトライ処理は、当該
障害情報に関する本部コンピュータ送信結果の項目が
“正常終了”となるまで繰り返し実行する。When the failure notification to the head office computer 20 is abnormally terminated, a failure notification retry process is performed (steps 609 and 608). The local operator can arbitrarily set the time interval for executing the failure notification retry. The retry processing of the failure notification to the headquarters computer 20 is repeatedly executed until the item of the headquarters computer transmission result regarding the failure information is “normal end”.

【００４１】次に、本部コンピュータ２０において、障
害通報受信部２１は、現地コンピュータ１０からの障害
通報を受信すると、当該障害情報を本部障害情報ファイ
ル２２に書き込む（ステップ６１０）。また、当該障害
情報を表示装置２４の表示画面に表示し、現地コンピュ
ータ１０においてハードウェア障害が発生したことを本
部オペレータに通知する（ステップ６１１）。Next, in the head office computer 20, when the trouble report receiving section 21 receives the trouble report from the local computer 10, it writes the trouble information into the head office trouble information file 22 (step 610). Further, the failure information is displayed on the display screen of the display device 24 to notify the head operator that a hardware failure has occurred in the local computer 10 (step 611).

【００４２】以上の動作の結果、本部オペレータは、現
地コンピュータ１０でのハードウェア障害の発生と、そ
の内容を即座に知ることができ、ハードウェア陣害の発
生した現地コンピュータ１０の現地オペレータに対し、
適切な指示を与えることが可能となる。また、表示装置
２４に表示された障害の内容を参照することにより、現
地コンピュータ１０から通報されたハードウェア障害が
重大な障害であると判断した場合に、現地のオペレータ
に対しシステム停止等の指示を出すことが可能である。
さらに、本部オペレータは障害情報検索部２３を利用し
て、特定のハードウェア障害が発生した現地コンピュー
タ１０、特定の現地コンピュータ１０で発生した障害内
容等のような障害の発生傾向の検索を容易に行うことが
可能となる。As a result of the above operation, the headquarter operator can immediately know the occurrence of the hardware failure in the local computer 10 and its contents, so that the local operator of the local computer 10 in which the hardware damage has occurred. ,
It becomes possible to give appropriate instructions. In addition, by referring to the content of the failure displayed on the display device 24, when it is determined that the hardware failure reported from the local computer 10 is a serious failure, the local operator is instructed to stop the system. It is possible to issue
Further, the headquarter operator can easily use the failure information search unit 23 to easily search for a failure occurrence tendency such as a failure of a local computer 10 in which a specific hardware failure has occurred or the content of a failure in a specific local computer 10. It becomes possible to do.

【００４３】以上好ましい実施例をあげて本発明を説明
したが、本発明は必ずしも上記実施例に限定されるもの
ではない。例えば、障害情報として取得される項目は、
必要に応じて適当に設定すればよく、図２や図３に示し
たフォーマットに限らない。また、表示装置の表示画面
も図４や図５に示したものに限らない。Although the present invention has been described with reference to the preferred embodiments, the present invention is not necessarily limited to the above embodiments. For example, the items acquired as fault information are
The format may be appropriately set as needed, and the format is not limited to that shown in FIGS. Further, the display screen of the display device is not limited to that shown in FIGS. 4 and 5.

【００４４】[0044]

【発明の効果】以上説明したように、本発明は、第１の
効果として、障害に対する対応措置が実行されるまでの
時間を短縮し、システムの迅速な復旧が可能となるとい
う効果を有する。その理由は、コンピュータが定期的な
監視手段によって誤動作等の発生を待たずに障害の発生
を検出し、現地オペレータ及び本部オペレータに通報す
ることにより、本部オペレータが速やかに障害の発生及
びその内容を知悉でき、対応措置を採ることができるた
めである。As described above, the present invention has, as a first effect, the effect of shortening the time until the countermeasure for the failure is executed and enabling the quick restoration of the system. The reason is that the computer detects the failure occurrence by the regular monitoring means without waiting for the occurrence of malfunction etc., and notifies the local operator and the headquarters operator so that the headquarters operator can promptly check the occurrence of the failure and its contents. This is because they can be acquainted and can take countermeasures.

【００４５】また、第２の効果として、ハードウェア障
害に対する対応措置を迅速に実行することにより、ファ
イルデータの不正等の２次障害の招来を回避できるとい
う効果がある。その理由は、ハードウェア障害の発生か
ら対応措置の実行までの時間が短縮され、速やかな復旧
が可能だからである。As a second effect, there is an effect that it is possible to avoid a secondary failure such as an injustice of file data by promptly taking a countermeasure against the hardware failure. The reason is that the time from the occurrence of a hardware failure to the execution of countermeasures is shortened, and quick recovery is possible.

【００４６】本発明の目的は、上記従来の問題点を解決
し、分散配置コンピュータシステムを構成するコンピュ
ータで発生したハードウェア障害を、現地オペレータに
即時に通知すると同時に本部のコンピュータに通報する
ことにより、障害に対する対応措置の実行までに要する
時間の短縮と、２次障害の未然防止をはかることにあ
る。An object of the present invention is to solve the above-mentioned conventional problems and to immediately notify a local operator of a hardware failure occurring in a computer constituting a distributed arrangement computer system, and at the same time, to notify a computer of a headquarters. The purpose is to reduce the time required to implement countermeasures against failures and prevent secondary failures.

[Brief description of drawings]

【図１】本発明の１実施例による障害監視通報装置の
構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a fault monitoring and reporting device according to an embodiment of the present invention.

【図２】本実施例の現地コンピュータにて蓄積される
障害情報のフォーマット例を示す図である。FIG. 2 is a diagram showing a format example of failure information accumulated in a local computer of the present embodiment.

【図３】本実施例の本部コンピュータにて蓄積される
障害情報のフォーマット例を示す図である。FIG. 3 is a diagram showing an example of a format of fault information accumulated in the head office computer of the present embodiment.

【図４】本実施例の現地コンピュータ及び本部コンピ
ュータの表示装置による障害発生を通知する表示画面の
例を示す図である。FIG. 4 is a diagram showing an example of a display screen for notifying a failure occurrence on the display devices of the local computer and the headquarters computer of the present embodiment.

【図５】本実施例の本部コンピュータの表示装置によ
る障害検索を実行した場合の表示画面の例を示す図であ
る。FIG. 5 is a diagram showing an example of a display screen when a failure search is executed by the display device of the headquarters computer of the present embodiment.

【図６】本実施例の動作を示すフローチャートであ
る。FIG. 6 is a flowchart showing the operation of this embodiment.

[Explanation of symbols]

１０現地コンピュータ１１ハードウェア障害ログファイル１２障害情報ファイル１３障害監視部１４障害通報部１５表示装置２０本部コンピュータ２１障害通報受信部２２本部障害情報ファイル２３障害情報検索部２４表示装置 10 local computer 11 hardware failure log file 12 failure information file 13 failure monitoring section 14 failure reporting section 15 display device 20 headquarters computer 21 failure reporting receiver 22 headquarters failure information file 23 failure information search section 24 display device

Claims

[Claims]

1. When a hardware failure occurs in the local computer that is installed in a local computer that constitutes a distributed computer system, failure information regarding the hardware failure is acquired and the operator is notified of the failure information. Failure monitoring means, failure reporting means for sending the failure information acquired by the failure monitoring means to a headquarters computer that constitutes the distributed computer system and is connected to the local computer via a communication line, and mounted on the headquarters computer And a fault notification receiving unit that receives the fault information and notifies the operator of the occurrence of a hardware fault in the local computer.

2. The first local computer stores and accumulates fault information relating to a detected hardware fault.
And a second failure information storage means for storing failure information to be reported to the headquarters computer, wherein the failure monitoring means includes the first failure information storage means and the second failure information storage means. Comparing with the failure information storage means, if the failure information newer than the failure information in the second failure information storage means is stored in the first failure information storage means, the failure information is stored as the second failure information. Writing in the failure information storage means, and the failure notification means reads the failure information and sends it to the headquarters computer when unreported failure information is stored in the second failure information storage means. The fault monitoring and reporting device according to claim 1.

3. The failure monitoring means periodically executes the first failure information storage means and the second failure information storage means at a predetermined timing independent of a normal program / job. The method according to claim 2, characterized in that
The fault monitoring and reporting device described in.

4. The fault notification receiving means further comprises third fault information storage means for storing and accumulating fault information received by the fault notification receiving means. The fault monitoring and reporting device described in.

5. The headquarter computer further comprises failure information search means for searching the failure information stored in the third failure information storage means according to a predetermined search condition. Fault monitoring and reporting device.