JPH10312321A

JPH10312321A - On-line system fault analyzing method

Info

Publication number: JPH10312321A
Application number: JP9120484A
Authority: JP
Inventors: Shinichi Kogure; 慎一小暮
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-05-12
Filing date: 1997-05-12
Publication date: 1998-11-24

Abstract

PROBLEM TO BE SOLVED: To minimize the influence of a fault by actualizing processes from the detection of a fault of an on-line system to a display of a countermeasure method in real time. SOLUTION: When a fault message is generated, a counter 1 is made to one (step 152). The history accuracy and monitor accuracy of the instance in the place that the counter indicates as to all instances in a message table are found (step 168). A display of the generation cause of the fault message is made and the instance having the maximum history accuracy in the message table is displayed as an optimum countermeasure method (the instance having the maximum monitor accuracy is displayed when there is more than history having the maximum history accuracy) (step 156). When a countermeasure is automatically set, a countermeasure command and a countermeasure confirmation command are executed (step 160) and after a recovery from the fault, the countermeasure result is displayed (step 164), but when the fault is not recovered, a counter measure module is actuated to take a manual countermeasure (step 166).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はシステムから出力さ
れる情報からハードウェア及びソフトウェアの障害につ
いて、障害に関する原因と、障害の対処方法の表示を実
現する障害解析方法に関し、特に障害の検出から対処ま
での処理に高速なリアルタイム性を要求されるオンライ
ンシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a failure analysis method for displaying a cause of a failure and a method of dealing with the failure with respect to hardware and software failures from information output from the system, and more particularly, from the detection of the failure. To online systems that require high-speed real-time processing.

【０００２】[0002]

【従来の技術】機器などのハードウェアの障害やアプリ
ケーションプログラムなどのソフトウェアの障害に関
し、障害の診断を行う装置やシステムが様々な分野で開
発され、稼動している。これらのシステムの管理におい
ては、障害解析が大きな役割を果たしている。例えば、
特開平０６−１６１７６０号公報に示されているように
機器の診断にリアルタイム性を持った装置を持ったもの
がある。この技術では、専門家によって行われていた診
断を機械化することによって判定にばらつきがなく、迅
速な診断を可能としている。また、特開平０８−２３７
１８８号公報に示されているように通信端末装置は通信
の履歴に関するデータを記憶し、通信障害時のデータを
ネットワークを介して情報蓄積装置に伝送することによ
り、通信端末装置の保守を効率良くできるようにしてい
る。2. Description of the Related Art Devices and systems for diagnosing faults of hardware such as equipment and faults of software such as application programs have been developed and operated in various fields. Failure management plays a major role in managing these systems. For example,
As shown in Japanese Patent Application Laid-Open No. 06-161760, there is a device having a device having a real-time property for device diagnosis. In this technique, the diagnosis performed by an expert is mechanized, and the determination is not varied, thereby enabling a quick diagnosis. Also, Japanese Patent Application Laid-Open No. 08-237
As disclosed in Japanese Patent Publication No. 188, the communication terminal device stores data relating to the communication history and transmits data at the time of a communication failure to the information storage device via the network, thereby efficiently maintaining the communication terminal device. I can do it.

【０００３】また、企業や官庁において、高速処理や信
頼性が要求される分野で様々なオンラインシステムが稼
動している。これらのシステムにおいては、機器などの
ハードウェア上でアプリケーションプログラムなどのソ
フトウェアが稼動しており、個々のハードウェアやソフ
トウェアの障害の診断を行っている。[0003] In companies and government offices, various online systems operate in fields requiring high-speed processing and reliability. In these systems, software such as an application program runs on hardware such as equipment, and diagnoses failures of individual hardware and software.

【０００４】[0004]

【発明が解決しようとする課題】かかる従来の方法にお
いては、次のような問題がある。However, such a conventional method has the following problems.

【０００５】すなわち、ハードウェアとソフトウェアの
障害が同時に複数件数発生した場合、障害の検出から障
害の対処までの処理を高速に実現することは困難な作業
となる。That is, when a plurality of hardware and software failures occur simultaneously, it is difficult to realize the processing from the detection of the failure to the handling of the failure at a high speed.

【０００６】このように従来の方法は、ハードウェアま
たはソフトウェアの個々の障害に対応するもので、複数
のハードウェア装置及びソフトウェアから構成され相互
に影響を与える可能性のあるシステムの障害には対処で
きないという問題があった。As described above, the conventional method deals with an individual failure of hardware or software, and deals with a failure of a system constituted by a plurality of hardware devices and software and possibly affecting each other. There was a problem that it was not possible.

【０００７】本発明の目的は、複数のハードウェア装置
及びソフトウェアから構成されるオンラインシステムの
障害にリアルタイムに対処可能なオンラインシステム障
害解析方法を提供することにある。An object of the present invention is to provide an online system failure analysis method capable of dealing with a failure of an online system composed of a plurality of hardware devices and software in real time.

【０００８】本発明の他の目的は、従来人間が行ってい
た複雑な障害回復作業を機械化することにより作業の精
度及び速度の向上や作業の容易性を実現し、障害による
損害を最小限に抑えて、専門の知識を持たない人でも障
害回復作業を行うための手段を提供することにある。Another object of the present invention is to improve the accuracy and speed of work and to facilitate the work by mechanizing complicated trouble recovery work conventionally performed by humans, thereby minimizing damage due to trouble. Another object of the present invention is to provide a means for a person who does not have specialized knowledge to perform a disaster recovery operation.

【０００９】[0009]

【課題を解決するための手段】本発明は、オンラインシ
ステムから監視端末に出力されるメッセージの履歴に関
係する情報の一定期間の蓄積とハードウェア／ソフトウ
ェアの稼動状況の監視を行い、障害が発生した際に、障
害の原因の表示と過去の最適事例の対処方法の表示を行
う。SUMMARY OF THE INVENTION The present invention accumulates information relating to the history of messages output from an online system to a monitoring terminal for a certain period of time and monitors the operation status of hardware / software, and a failure occurs. Then, the cause of the failure is displayed and the method of handling the past optimum case is displayed.

【００１０】オンラインシステムから監視端末に出力さ
れるメッセージをメッセージ監視モジュールにおいて常
時監視し、出力メッセージの一定時間の履歴を随時、履
歴蓄積テーブルに蓄積しておく。障害が発生した際に
は、障害発生時点の履歴蓄積テーブルを作業履歴蓄積テ
ーブルとしてコピーする。この作業履歴蓄積テーブルと
メッセージテーブルの過去の事例の履歴をメッセージＩ
Ｄをキーにマッチングを行い、一致した数を履歴確信度
として格納する。続いて、障害が発生した時点のハード
ウェア／ソフトウェアの稼動状況とメッセージテーブル
の過去の事例が発生した時点のハードウェア／ソフトウ
ェアの稼動状況をハードウェア／ソフトウェア名及び監
視項目をキーにマッチングを行い、一致した数を監視確
信度として格納する。そして、履歴確信度が最大の事例
を最適事例とする（履歴確信度が最大の事例が複数存在
する際には、監視確信度が最大の事例とする）。なお、
障害メッセージの原因及び分類情報、過去の事例情報は
メッセージテーブルで管理する。続いて、障害の原因の
表示と最適事例の対処方法の表示を行う。対処による障
害回復については、対処コマンドと対処確認コマンドの
実行及び結果確認により行う。The message output from the online system to the monitoring terminal is constantly monitored by the message monitoring module, and the history of the output message for a certain period of time is accumulated in the history accumulation table as needed. When a failure occurs, the history accumulation table at the time of occurrence of the failure is copied as a work history accumulation table. The history of past cases in the work history accumulation table and the message table is stored in the message I
Matching is performed using D as a key, and the number of matches is stored as a history certainty factor. Subsequently, the operation status of the hardware / software at the time of occurrence of the failure and the operation status of the hardware / software at the time of occurrence of the past case in the message table are matched with the hardware / software name and the monitoring item as keys. And the number of matches is stored as the monitoring certainty factor. Then, the case with the highest historical certainty is determined as the optimal case (when there are a plurality of cases with the highest historical certainty, the case with the highest monitoring certainty is determined). In addition,
Cause and classification information of fault messages and past case information are managed in a message table. Subsequently, the cause of the failure is displayed and the method of coping with the optimum case is displayed. Recovery from failure by handling is performed by executing a handling command and a handling confirmation command and confirming the result.

【００１１】ハードウェアの監視と、ソフトウェアの監
視を常時行い、ハードウェア及びソフトウェアの障害に
ついて、障害に関する原因と、障害の対処方法の表示を
リアルタイムに実現することを特徴とする障害解析方
法。[0011] A failure analysis method characterized by constantly monitoring hardware and software, and real-time display of the cause of the failure and the method of dealing with the failure of the hardware and software is realized.

【００１２】障害に関係する情報は、障害の原因に関係
する情報と、過去の障害の事例に関係する情報の２種類
とする。There are two types of information related to the fault, information related to the cause of the fault and information related to past fault cases.

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施の形態を詳細
に説明する。Embodiments of the present invention will be described below in detail.

【００１４】図１、図２、図３、図４は、本発明をオン
ラインシステムに適用した場合の処理手順の実施の形態
を示すフローチャートであり、図５は、本発明に係るオ
ンラインシステム障害解析方法の構成を示すブロック図
である。図６及び図７は、本発明に係るオンラインシス
テム障害解析方法を実現する際に利用するデータテーブ
ルである。FIGS. 1, 2, 3 and 4 are flowcharts showing an embodiment of a processing procedure when the present invention is applied to an online system, and FIG. 5 is an online system failure analysis according to the present invention. It is a block diagram showing composition of a method. 6 and 7 are data tables used when implementing the online system failure analysis method according to the present invention.

【００１５】図５において、障害解析処理を監視端末に
おける５つのモジュールにより実現する。メッセージ監
視モジュール１２は、まず、オンラインシステム１０か
らメッセージ取得部１１を通じて監視端末側へ出力され
る全てのメッセージを履歴蓄積テーブル１７へ格納す
る。次に、出力メッセージに関する分類情報をメッセー
ジテーブル１９へ参照し、履歴蓄積テーブル１７へ格納
する。ハードウェア監視モジュール１３は、ハードウェ
ア稼動状況を常時監視しており、ハードウェア監視テー
ブル２０に格納する。ソフトウェア監視モジュール１４
は、ソフトウェア稼動状況を常時監視しており、ソフト
ウェア監視テーブル２１に格納する。履歴蓄積テーブル
１７は、オンラインシステム１０から出力されるメッセ
ージの履歴に関係する情報を蓄積する為に用いられる。
作業履歴蓄積テーブル１８は、障害発生時点のメッセー
ジの履歴に確保し、メッセージテーブルの過去の事例の
履歴との比較を行う為に用いられる。対処履歴テーブル
２２は、障害発生後の対処結果を格納する為に用いられ
る。メッセージテーブル１９には、監視対象のメッセー
ジがメッセージＩＤをキーとして、ハードウェア／ソフ
トウェア別、システムメッセージ／アプリケーションメ
ッセージ別に分類した形で格納されている。ハードウェ
ア監視テーブル２０は、ハードウェアの稼動状況を常時
監視する為のテーブルであり、監視対象の設定は予め設
定しておくものとする。ソフトウェア監視テーブル２１
は、ソフトウェアの稼動状況を常時監視する為のテーブ
ルであり、監視対象の設定は予め設定しておくものとす
る。障害解析モジュール１５は、障害が発生した際に起
動され、障害の原因の表示と過去の最適事例の対処方法
の表示を行い、対処結果を対処履歴テーブル２２に格納
する。対処履歴テーブル２２は、発生した障害メッセー
ジに対する対処の情報を格納する為のテーブルである。In FIG. 5, the failure analysis processing is realized by five modules in the monitoring terminal. The message monitoring module 12 first stores all messages output from the online system 10 to the monitoring terminal via the message acquisition unit 11 in the history accumulation table 17. Next, the classification information relating to the output message is referred to the message table 19 and stored in the history accumulation table 17. The hardware monitoring module 13 constantly monitors the operating status of the hardware, and stores it in the hardware monitoring table 20. Software monitoring module 14
, Constantly monitors the software operation status, and stores it in the software monitoring table 21. The history accumulation table 17 is used to accumulate information related to the history of messages output from the online system 10.
The work history storage table 18 is used to secure a message history at the time of occurrence of a failure and to compare the message history with a history of past cases in the message table. The handling history table 22 is used to store a handling result after a failure has occurred. The message to be monitored is stored in the message table 19 in a form categorized by hardware / software and system message / application message using the message ID as a key. The hardware monitoring table 20 is a table for constantly monitoring the operation status of the hardware, and the setting of the monitoring target is set in advance. Software monitoring table 21
Is a table for constantly monitoring the operation status of the software, and the setting of the monitoring target is set in advance. The failure analysis module 15 is activated when a failure occurs, displays the cause of the failure, displays a method of coping with a past optimum case, and stores the coping result in the coping history table 22. The handling history table 22 is a table for storing information on the handling of a fault message that has occurred.

【００１６】図６及び図７は、図５における各テーブル
のレコードフォーマットであり、以下、図６及び図７の
各テーブルの関係について説明する。図６Ａ＜履歴蓄積
テーブル及び作業履歴蓄積テーブル＞は、図５の履歴蓄
積テーブル１７及び作業履歴蓄積テーブル１８のレコー
ドフォーマットである。ここで、図６Ａ＜履歴蓄積テー
ブル及び作業履歴蓄積テーブル＞の各項目について説明
する。メッセージＩＤは図５のオンラインシステム１０
から図５の監視端末に出力されるメッセージを識別する
キーとして用い、メッセージが発生する度に図５Ａ＜履
歴蓄積テーブル＞に蓄積される。発生時刻にはメッセー
ジが出力された時刻を格納する。ソフト／ハード区分及
び障害／警告区分には図６Ｂ＜メッセージテーブル＞を
参照し、格納する。メッセージには出力されたメッセー
ジの中のコメント情報をテキスト形式で格納する。図６
Ｂ＜メッセージテーブル＞は、図５のメッセージテーブ
ル１９のレコードフォーマットである。続いて、図６Ｂ
＜メッセージテーブル＞の各項目について説明する。メ
ッセージＩＤは図６Ｂ＜メッセージテーブル＞における
キーとして管理されており、ハードウェア／ソフトウェ
ア別、システムメッセージ／アプリケーションメッセー
ジ別に分類されている。障害／警告区分はメッセージが
障害を意味するか警告を意味するかを示すものである。
原因はマニュアルに掲載されている情報であり、重要度
はメッセージの持つ意味合いを示すものである。事例件
数は、過去の事例を図６Ｂ＜メッセージテーブル＞に何
件格納してあるかを示す。障害メッセージについての
み、過去の事例が発生した時の出力メッセージの履歴を
履歴蓄積テーブルで、過去の事例の対処に関する情報を
対処テーブルでそれぞれ管理している。続いて、図６Ｃ
＜対処テーブル＞の各項目について説明する。図６Ｃ＜
対処テーブル＞には発生した障害に対して行った対処方
法を格納する。対処策には実際の対処策の内容を格納す
る。対処コマンドには実際の対処に用いたコマンドを格
納し、対処結果コマンドには対処コマンドの実行結果を
格納する。対処確認コマンドには対処コマンドが正しく
実行されたかを確認するコマンドを格納し、対処確認結
果には対処確認コマンドの実行結果の確認事項を格納す
る。履歴確信度には障害発生時点の図６Ａ＜作業履歴蓄
積テーブル＞と図６Ｂ＜メッセージテーブル＞に格納さ
れている事例の履歴蓄積テーブルとのメッセージＩＤを
キーとしたマッチングの結果を格納する。監視確信度に
は障害発生時点のハードウェア／ソフトウェア稼動状況
をハードウェア／ソフトウェア名及び監視項目をキーと
したマッチングの結果を格納する。自動／手動対処設定
には、過去の障害の中で最も確信度が高かった場合に自
動的に対処を行うかどうかを設定する。続いて、図７Ｄ
＜ハードウェア監視テーブル＞は、図５のハードウェア
監視テーブル２０のレコードフォーマットであり、ハー
ドウェア別に監視項目及び監視結果を設定する。図７Ｅ
＜ソフトウェア監視テーブル＞は、図５のソフトウェア
監視テーブル２１のレコードフォーマットであり、ソフ
トウェア別に監視項目及び監視結果を設定する。図７Ｆ
＜対処履歴テーブル＞は、図５の対処履歴テーブル２２
のレコードフォーマットであり、発生した障害メッセー
ジに対する対処が終了した時点で格納する為のテーブル
であり、障害解析が終了した時点で図４のメッセージテ
ーブル１９に事例として格納する。FIGS. 6 and 7 show the record format of each table in FIG. 5. The relationship between the tables in FIGS. 6 and 7 will be described below. FIG. 6A is a record format of the history accumulation table 17 and the work history accumulation table 18 in FIG. Here, each item of FIG. 6A <history accumulation table and work history accumulation table> will be described. The message ID is the online system 10 in FIG.
5 is used as a key for identifying a message output to the monitoring terminal of FIG. 5 and is accumulated in the <history accumulation table> of FIG. 5A every time a message occurs. The occurrence time stores the time when the message was output. The software / hardware division and the failure / warning division are stored with reference to FIG. 6B <message table>. The message stores comment information in the output message in text format. FIG.
B <message table> is a record format of the message table 19 in FIG. Subsequently, FIG. 6B
Each item of the <message table> will be described. The message ID is managed as a key in FIG. 6B <message table>, and is classified by hardware / software and system message / application message. The fault / warning category indicates whether the message indicates a fault or a warning.
The cause is information described in the manual, and the importance indicates the meaning of the message. The number of cases indicates how many past cases are stored in the <message table> in FIG. 6B. For only the fault message, the history of the output message when a past case has occurred is managed in a history accumulation table, and the information on the handling of the past case is managed in a handling table. Then, FIG. 6C
Each item of the <action table> will be described. FIG. 6C <
The countermeasure table> stores the countermeasures taken for the fault that has occurred. In the countermeasure, the content of the actual countermeasure is stored. The response command stores the command used for the actual response, and the response result command stores the execution result of the response command. The response confirmation command stores a command for confirming that the response command has been executed correctly, and the response confirmation result stores confirmation items of the execution result of the response confirmation command. In the history certainty factor, the result of matching using a message ID as a key between the failure accumulation time table shown in FIG. 6A <work history accumulation table> and the case history accumulation table stored in FIG. 6B <message table> is stored. In the monitoring certainty factor, the result of matching using the hardware / software operating status at the time of occurrence of the failure and the hardware / software name and the monitoring item as keys is stored. In the automatic / manual response setting, it is set whether a response is automatically performed when the degree of certainty is highest among past failures. Subsequently, FIG. 7D
<Hardware monitoring table> is a record format of the hardware monitoring table 20 of FIG. 5, and sets a monitoring item and a monitoring result for each hardware. FIG. 7E
<Software monitoring table> is a record format of the software monitoring table 21 in FIG. 5, and sets a monitoring item and a monitoring result for each software. FIG. 7F
The <response history table> corresponds to the response history table 22 in FIG.
This is a table for storing when the handling of the generated failure message is completed, and is stored as an example in the message table 19 of FIG. 4 when the failure analysis is completed.

【００１７】次に図１、図２、図３、図４のフローチャ
ートに基いて図５の各動作を説明する。Next, each operation of FIG. 5 will be described based on the flowcharts of FIGS. 1, 2, 3, and 4.

【００１８】図５のメッセージ監視モジュール１２の起
動（ステップ１００）から終了までの動作を説明する。
障害解析を終了するまで処理を行う（ステップ１０
２）。まず、図５のオンラインシステム１０よりメッセ
ージが出力されているかを確認する（ステップ１０
４）。出力されていればメッセージＩＤをキーに図５の
メッセージテーブル１９を参照し、図５の履歴蓄積テー
ブル１７に出力メッセージに関する情報を蓄積する（ス
テップ１０６）。蓄積する際、図５メッセージテーブル
１９からメッセージＩＤ、発生時刻、ソフト／ハード区
分、障害／警告区分、メッセージを参照する。図５のメ
ッセージテーブル１９に登録されていないメッセージが
出力された場合には履歴蓄積テーブルへの蓄積を行わな
い。次に、出力メッセージが障害メッセージであるかを
判別し（ステップ１０８）、障害である場合以下の処理
を行う。まず、障害メッセージ発生時点の履歴蓄積テー
ブルを作業履歴蓄積テーブルとしてコピーする（ステッ
プ１１０）。続いて、障害メッセージがハードウェアに
関するものであれば（ステップ１１２）、ハードウェア
監視テーブルを更新する（ステップ１１４）。ハードウ
ェアに関するものでなければソフトウェア監視テーブル
を更新する（ステップ１１６）。続いて、図５のメッセ
ージテーブル１９に対処方法が存在する場合には（ステ
ップ１１８）、図５の障害解析モジュールを起動する
（ステップ１２０）。対処方法が存在しない場合には、
障害の原因のみを表示する（ステップ１２２）。対処方
法が存在するかの判別は図６Ｂ＜メッセージテーブル＞
の事例件数が０がどうかで行う。The operation from the start (step 100) to the end of the message monitoring module 12 in FIG. 5 will be described.
Processing is performed until the failure analysis is completed (step 10
2). First, it is confirmed whether a message is output from the online system 10 of FIG. 5 (step 10).
4). If the message has been output, the message ID is used as a key to refer to the message table 19 in FIG. 5, and the information on the output message is accumulated in the history accumulation table 17 in FIG. 5 (step 106). At the time of storing, the message ID, the occurrence time, the soft / hard classification, the failure / warning classification, and the message are referred from the message table 19 in FIG. When a message not registered in the message table 19 of FIG. 5 is output, the message is not stored in the history storage table. Next, it is determined whether the output message is a failure message (step 108), and if it is a failure, the following processing is performed. First, the history accumulation table at the time of occurrence of the failure message is copied as a work history accumulation table (step 110). Subsequently, if the failure message is related to hardware (step 112), the hardware monitoring table is updated (step 114). If not, the software monitoring table is updated (step 116). Subsequently, if a countermeasure exists in the message table 19 of FIG. 5 (step 118), the failure analysis module of FIG. 5 is started (step 120). If no workaround exists,
Only the cause of the failure is displayed (step 122). FIG. 6B <Message Table>
Is performed depending on whether the number of cases is 0 or not.

【００１９】続いて図５のハードウェア監視モジュール
１３の起動（ステップ１３０）から終了までの処理を説
明する。ハードウェア監視モジュール１３では、障害解
析が終わるまで（ステップ１３２）、ハードウェア管理
テーブル２０にハードウェアの稼動状況を格納する（ス
テップ１３４）。Next, the processing from activation (step 130) to termination of the hardware monitoring module 13 in FIG. 5 will be described. The hardware monitoring module 13 stores the operating status of the hardware in the hardware management table 20 until the failure analysis is completed (Step 132) (Step 134).

【００２０】続いて図５のソフトウェア監視モジュール
１４の起動（ステップ１４０）から終了までの処理を説
明する。ソフトウェア監視モジュール１４では、障害解
析が終わるまで（ステップ１４２）、ソフトウェア管理
テーブル２１にソフトウェアの稼動状況を格納する（ス
テップ１４４）。Next, the processing from activation (step 140) to termination of the software monitoring module 14 of FIG. 5 will be described. The software monitoring module 14 stores the operating status of the software in the software management table 21 until the failure analysis is completed (Step 142) (Step 144).

【００２１】続いて図５の障害解析モジュール１５の起
動（ステップ１５０）から終了までの処理を説明する。
まず、過去の事例の中から最適事例を求める際に用いる
カウンタを１にする（ステップ１５２）。まず、カウン
タが図６Ｂ＜メッセージテーブル＞の事例件数より大き
いかを判別する（ステップ１５４）。大きくない場合、
カウンタの示す順番に格納されている事例の履歴蓄積テ
ーブルと作業履歴蓄積テーブルとのメッセージＩＤをキ
ーとしたマッチング結果の図６Ｃ＜対処テーブル＞の履
歴確信度へ格納する。続いて、カウンタの示す順番に格
納されている事例の図６Ａ＜履歴蓄積テーブル＞のソフ
トウェア／ハードウェア監視テーブルと図７Ｄ及びＥの
ソフトウェア／ハードウェア監視テーブルとのハードウ
ェア／ソフトウェア名及び監視項目をキーとしたマッチ
ングの結果を監視確信度へ格納する。続いて、カウンタ
に１を加える（ステップ１６８）。カウンタが図６Ｂ＜
メッセージテーブル＞の事例件数より大きい場合、以下
の処理を行う（ステップ１５４）。履歴確信度が最大の
事例を最適事例とし（履歴確信度が最大の事例が複数存
在する際には、監視確信度が最大の事例とする）、障害
の原因の表示と過去の最適事例表示を行う（ステップ１
５６）。続いて、図６Ｃ＜対処テーブル＞の自動／手動
対処設定を参照し、対処が自動設定になっているかを判
別する（ステップ１５８）。自動設定になっていない場
合、対処モジュールを起動し、人による対処を行う（ス
テップ１６６）。自動設定になっている場合、図６Ｃ＜
対処テーブル＞の対処コマンド及び対処確認コマンドを
実行する（ステップ１６０）。続いて、障害回復したど
うかを図７Ｆ＜対処履歴テーブル＞の対処結果と対処確
認結果を比較して判別し（ステップ１６２）、回復した
場合には障害回復の表示を行う（ステップ１６４）。回
復していない場合、対処モジュールを起動し、人による
対処を行う（ステップ１６６）。Next, the processing from the start (step 150) to the end of the failure analysis module 15 in FIG. 5 will be described.
First, the counter used for finding the optimum case from the past cases is set to 1 (step 152). First, it is determined whether the counter is larger than the number of cases in FIG. 6B <message table> (step 154). If not,
The matching result is stored in the history certainty factor of FIG. 6C <action table> of the matching result using the message ID of the history storage table and the work history storage table of the cases stored in the order indicated by the counter as keys. Subsequently, the hardware / software names and monitoring items of the software / hardware monitoring table of FIG. 6A <history accumulation table> and the software / hardware monitoring tables of FIGS. 7D and 7E of the cases stored in the order indicated by the counter Is stored in the monitoring certainty factor. Subsequently, 1 is added to the counter (step 168). The counter is shown in FIG.
If it is larger than the number of cases in the message table>, the following processing is performed (step 154). The case with the highest historical certainty is taken as the best case (when there are multiple cases with the highest historical certainty, the case with the highest monitoring certainty is taken). Do (Step 1
56). Subsequently, referring to the automatic / manual countermeasure setting in FIG. 6C <countermeasure table>, it is determined whether the countermeasure is set automatically (step 158). If the automatic setting has not been made, the coping module is activated and a coping is performed by a person (step 166). In the case of automatic setting, FIG.
The action command and the action confirmation command of the action table> are executed (step 160). Subsequently, it is determined whether or not the failure has been recovered by comparing the response result and the response confirmation result in the <response history table> of FIG. 7F (step 162), and when recovery has been performed, the failure recovery is displayed (step 164). If not recovered, the coping module is activated and a coping by a person is performed (step 166).

【００２２】続いて図５の対処モジュール１６の起動
（ステップ１７０）から終了までの処理を説明する。ま
ず、人による対処入力を待ち（ステップ１７２）、入力
後、図７Ｆ＜対処履歴テーブル＞に対処結果及び対処確
認結果を格納する（ステップ１７４）。障害回復したど
うかを図７Ｆ＜対処履歴テーブル＞の対処結果と対処確
認結果を比較して判別し（ステップ１７６）、回復した
場合には障害回復の表示を行う（ステップ１７８）。回
復していない場合、人による対処を行う（ステップ１７
２、１７４、１７６）。Next, the processing from activation (step 170) to termination of the handling module 16 in FIG. 5 will be described. First, a response input by a person is waited for (step 172), and after the input, a response result and a response confirmation result are stored in FIG. 7F <response history table> (step 174). Whether or not the failure has been recovered is determined by comparing the response result and the response confirmation result in FIG. 7F <response history table> (step 176), and when recovery has been performed, failure recovery is displayed (step 178). If it has not recovered, a human response is taken (step 17).
2, 174, 176).

【００２３】[0023]

【発明の効果】以上述べたように、本発明によれば、複
数のハードウェア装置及びソフトウェアから構成される
オンラインシステムの障害にリアルタイムな対処を可能
とすることができるので、障害発生による損害を最小限
に抑えることが出来る。また、従来人間が行っていた複
雑な障害回復作業を機械化して支援することにより、専
門の知識を持たない人でも障害回復作業を行うことがで
きる。As described above, according to the present invention, it is possible to deal with a failure of an online system composed of a plurality of hardware devices and software in real time. Can be minimized. In addition, by mechanizing and supporting a complicated failure recovery operation conventionally performed by a human, even a person without specialized knowledge can perform the failure recovery operation.

[Brief description of the drawings]

【図１】本発明の処理手順の実施の形態を示すフローチ
ャートである。FIG. 1 is a flowchart showing an embodiment of a processing procedure of the present invention.

【図２】本発明の処理手順の実施の形態を示すフローチ
ャートである。FIG. 2 is a flowchart showing an embodiment of a processing procedure of the present invention.

【図３】本発明の処理手順の実施の形態を示すフローチ
ャートである。FIG. 3 is a flowchart showing an embodiment of a processing procedure of the present invention.

【図４】本発明の処理手順の実施の形態を示すフローチ
ャートである。FIG. 4 is a flowchart showing an embodiment of a processing procedure of the present invention.

【図５】本発明に係わるオンラインシステム障害解析方
法を構成するシステムブロック図である。FIG. 5 is a system block diagram of an online system failure analysis method according to the present invention.

【図６】本発明に係わるオンラインシステム障害解析シ
ステムを構成するデータテーブルである。FIG. 6 is a data table constituting an online system failure analysis system according to the present invention.

【図７】本発明に係わるオンラインシステム障害解析シ
ステムを構成するデータテーブルである。FIG. 7 is a data table constituting an online system failure analysis system according to the present invention.

【符号の説明】１０オンラインシステム１１メッセージ取得部１２メッセージ監視モジュール１３ハードウェア監視モジュール１４ソフトウェア監視モジュール１５障害解析監視モジュール１６対処モジュール１７履歴蓄積テーブル１８作業履歴蓄積テーブル１９メッセージテーブル２０ハードウェア監視テーブル２１ソフトウェア監視テーブル２２対処履歴監視テーブル。[Description of Signs] 10 online system 11 message acquisition unit 12 message monitoring module 13 hardware monitoring module 14 software monitoring module 15 failure analysis monitoring module 16 handling module 17 history accumulation table 18 work history accumulation table 19 message table 20 hardware monitoring table 21 Software monitoring table 22 Response history monitoring table.

Claims

[Claims]

1. A method for accumulating information relating to data output from a system, checking a correspondence between information relating to a failure and information relating to data outputted from the system, and detecting a failure when the correspondences match. A failure analysis method characterized by realizing the display of the cause related to the failure and the method of dealing with the failure.

2. The method according to claim 1, wherein the monitoring of the hardware and the monitoring of the software are constantly performed, and for the failure of the hardware and the software, the cause of the failure and the method of dealing with the failure are displayed. Failure analysis method.

3. The failure analysis method according to claim 2, wherein the information relating to the failure manages information relating to a cause of the failure and information relating to past cases of the failure.