JP6317074B2

JP6317074B2 - Failure notification device, failure notification program, and failure notification method

Info

Publication number: JP6317074B2
Application number: JP2013106078A
Authority: JP
Inventors: 裕美大立目; 佐藤　秀憲; 秀憲佐藤; 光悦和賀; 裕之岩田
Original assignee: NEC Communication Systems Ltd
Current assignee: NEC Communication Systems Ltd
Priority date: 2013-05-20
Filing date: 2013-05-20
Publication date: 2018-04-25
Anticipated expiration: 2033-05-20
Also published as: JP2014228932A

Description

本発明は、コンピュータ（情報処理装置）に発生する障害を通知する技術に関する。 The present invention relates to a technique for notifying a computer (information processing apparatus) of a failure that occurs.

分散コンピュータネットワークシステムは、プログラムを構成する個々の部分が同時並行的に複数のコンピュータで実行され、それらがネットワークを介して互いに通信しあうシステムである。したがって、分散コンピュータネットワークシステムを利用すると、一台のコンピュータで計算した場合にくらべて、スループットが向上する。このような分散コンピュータネットワークシステムの監視は、分散しているオフィス先のコンピュータに対してなされなければならないので、遠隔操作でのコマンド制御が必要となる。 A distributed computer network system is a system in which individual parts constituting a program are executed by a plurality of computers in parallel and communicate with each other via a network. Therefore, when the distributed computer network system is used, the throughput is improved as compared with the case where the calculation is performed by one computer. Since monitoring of such a distributed computer network system must be performed on distributed computers at the office, command control by remote operation is required.

特許文献１は、大規模な分散コンピュータネットワークシステムにおける複数装置の管理や監視に関して、統合的なネットワーク監視システムを用いて保守運用を行うシステムを開示する。このネットワーク監視システムは、サーバのリモート監視を行う手段として、監視対象装置に、その監視対象装置の情報収集を行うための情報収集エージェント(プログラム)を組み込んでいる。また、このネットワーク監視システムは、監視装置にネットワーク監視マネージャ(プログラム)を組み込んでいる。図２は、このようなネットワーク監視システムの全体構成を示す図である。ネットワーク監視システムは、保守用ネットワーク監視装置１３０に組み込むネットワーク監視マネージャ機能１３１と、監視対象装置１００、１１０、１２０に組み込む監視エージェント機能１０２、１１２、１２２で構成される。監視対象装置１００、１１０、１２０にはそれぞれ、アプリケーションプログラム１０１、１１１、１２１を組み込んでいる。監視エージェント機能１０２、１１２、１２２によって検知された監視対象装置１００、１１０、１２０の障害情報は、ネットワーク監視マネージャ機能１３１へ通知される。この障害情報は、ネットワーク監視マネージャ機能１３１が実装する独自のＧＵＩ画面１３１ａに表示される。障害の通知先はこのＧＵＩ画面１３１ａ（ＧＵＩ：ＧｒａｐｈｉｃＵｓｅｒＩｎｔｅｒｆａｃｅ）に限定される。 Patent Document 1 discloses a system that performs maintenance operation using an integrated network monitoring system regarding management and monitoring of a plurality of devices in a large-scale distributed computer network system. In this network monitoring system, as a means for remotely monitoring a server, an information collection agent (program) for collecting information on the monitored device is incorporated in the monitored device. This network monitoring system incorporates a network monitoring manager (program) in the monitoring device. FIG. 2 is a diagram showing the overall configuration of such a network monitoring system. The network monitoring system includes a network monitoring manager function 131 incorporated in the maintenance network monitoring apparatus 130 and monitoring agent functions 102, 112, 122 incorporated in the monitoring target apparatuses 100, 110, 120. Application programs 101, 111, and 121 are incorporated in the monitoring target devices 100, 110, and 120, respectively. The failure information of the monitoring target devices 100, 110, 120 detected by the monitoring agent functions 102, 112, 122 is notified to the network monitoring manager function 131. This failure information is displayed on a unique GUI screen 131a implemented by the network monitoring manager function 131. The failure notification destination is limited to the GUI screen 131a (GUI: Graphic User Interface).

この方式を、既に運用中の大規模な分散コンピュータネットワークシステム等のネットワークシステムに導入しようとすると、上記のような統合的なネットワーク監視システムやサーバ管理ソフトウェアを、既存のシステムに新たに組み込む必要がある。 If this method is to be introduced into a network system such as a large-scale distributed computer network system that is already in operation, it is necessary to newly incorporate an integrated network monitoring system and server management software as described above into the existing system. is there.

一方、一般に運用中のネットワークシステムには、一元的に管理可能な統合監視機能を搭載している。例えば、図３ａに示すように、既存の分散コンピュータシステムには、保守用ネットワーク監視装置１７０の保守作業用ＧＵＩ画面１７１ａから、監視対象となる複数の装置の管理を行う統合監視機能１７１が既に実装されているのが現状である。統合監視機能１７１は、監視対象装置１４０、１５０の稼働状態や障害発生状況等の確認を保守作業用ＧＵＩ画面１７１ａから一元的に行う。また統合監視機能１７１は、この保守作業用ＧＵＩ画面１７１ａから、各種のメンテナンス操作も行う。ここで、監視対象装置１４０、１５０には、それぞれ、障害監視および通知機能を有するアプリケーションプログラム１４１、１５１が組み込まれている。このため、図３ａのシステムに新たに監視対象装置１６０を追加し、この追加に伴って統合ネットワーク監視システムやサーバ管理ソフトウェアを組み込む場合、既存のモニタ装置が、重複する構成となる。図３ｂは、運用中のネットワーク監視装置１７０に、ネットワーク監視マネージャ機能１７２で監視される監視対象装置１６０を追加し、ネットワーク監視マネージャ機能１７２を保守用ネットワーク監視装置１７０にインストールした構成を示す。保守者は既存のＧＵＩ画面１７１ａとネットワーク監視マネージャ機能１７２が提供するＧＵＩ画面１７２ａの、２つの画面から確認作業やメンテナンス操作を行わなければならない。その結果、保守者の作業が煩雑となり作業効率が低下するという問題が発生する。 On the other hand, a network system that is generally in operation is equipped with an integrated monitoring function that can be managed centrally. For example, as shown in FIG. 3a, an integrated distributed monitoring function 171 for managing a plurality of devices to be monitored is already installed in the existing distributed computer system from the maintenance work GUI screen 171a of the maintenance network monitoring device 170. This is the current situation. The integrated monitoring function 171 centrally confirms the operating state and failure occurrence status of the monitoring target devices 140 and 150 from the maintenance work GUI screen 171a. The integrated monitoring function 171 also performs various maintenance operations from the maintenance work GUI screen 171a. Here, application programs 141 and 151 having fault monitoring and notification functions are incorporated in the monitoring target devices 140 and 150, respectively. For this reason, when a monitoring target device 160 is newly added to the system of FIG. 3a and an integrated network monitoring system and server management software are incorporated along with this addition, the existing monitoring devices are configured to overlap. FIG. 3B shows a configuration in which the monitoring target device 160 monitored by the network monitoring manager function 172 is added to the network monitoring device 170 in operation, and the network monitoring manager function 172 is installed in the maintenance network monitoring device 170. The maintenance person must perform a confirmation operation and a maintenance operation from two screens of the existing GUI screen 171a and the GUI screen 172a provided by the network monitoring manager function 172. As a result, there is a problem that the maintenance worker's work becomes complicated and the work efficiency decreases.

また、安定稼働しているシステム運用中の装置に、新たなプログラムの追加やソフトウェアを組み込むことは、運用中のアプリケーション動作に影響を及ぼす可能性があるため、容易に実施することはできないという問題がある。 In addition, adding a new program or installing software to a device that is operating in a stable system may affect the operation of the application during operation, and cannot be easily implemented. There is.

一方、監視で得られる情報は、ＵＮＩＸ（登録商標）系ＯＳ（ＯｐｅｒａｔｉｏｎＳｙｓｔｅｍ）のロギング機能を利用しても得ることができる。ＵＮＩＸ系のＯＳは、システム上で発生した各種イベントや状態変化などの動作ログをファイルに記録するシステムログ（ｓｙｓｌｏｇ）と呼ばれるロギング機能を有している。このロギング機能は、ユーザのログイン日時から、カーネル・パニックなどの異常発生時の状況まで、システムに関するさまざまな事象をシステムログファイルに記録する。このためシステムログは、障害発生時の原因追求や不正アクセスの痕跡探しなど、システム管理を行うための重要な情報源となっている。したがって、このロギング機能は、情報収集エージェントや、ネットワーク監視マネージャを、代替することができる。この代替により、既存の分散コンピュータネットワークに統合ネットワーク監視システム等を組み込む場合にも、ネットワーク監視マネージャ機能をインストールする必要はなくなり、ユーザの作業効率の低下は免れられる。 On the other hand, information obtained by monitoring can also be obtained by using a logging function of a UNIX (registered trademark) OS (Operation System). The UNIX-based OS has a logging function called a system log (syslog) that records operation logs such as various events and state changes occurring in the system in a file. This logging function records various system-related events in the system log file, from the user's login date and time to the situation when an abnormality such as a kernel panic occurs. For this reason, the system log is an important information source for system management, such as pursuing the cause of a failure and searching for traces of unauthorized access. Therefore, this logging function can replace the information collection agent and the network monitoring manager. With this alternative, even when an integrated network monitoring system or the like is incorporated into an existing distributed computer network, it is not necessary to install the network monitoring manager function, and a reduction in user work efficiency is avoided.

しかし、システムログの内容は一般に複雑である。この内容には障害情報のみならず所定の動作履歴がすべて盛り込まれている。単に障害情報を得ることが目的の場合、システムログは情報過多とも言える。該当するメッセージを１つ１つ確認して、緊急を要する内容のものか、その障害のレベルを判断する必要があるため、保守者の手を煩わせるばかりか、緊急時の障害原因の分析に時間を要するという問題が生じる。 However, the contents of the system log are generally complex. This content includes not only fault information but also all predetermined operation histories. If the goal is simply to obtain fault information, the system log can be said to be information overload. It is necessary to check each applicable message one by one to determine whether the content is urgent or the level of the failure. The problem of taking time arises.

特許文献２では、ＵＮＩＸサーバで発生した障害を検知し、その障害内容を複数の異なるログファイルに記録し、システム保守者に通知するシステムを開示している。しかしこのシステムは、障害すべてについて通知するため、保守者にとって原因を切り分ける作業が不可欠である。インフォメーションレベルで緊急を要さない状態変化やログイン情報などのメッセージから、システム管理者が介在して緊急に対応を必要とする障害発生に伴うメッセージまで、多岐に渡るログが通知される。したがって、システム保守者はその情報を整理することに忙殺されるという問題がある。 Patent Document 2 discloses a system that detects a failure occurring in a UNIX server, records the content of the failure in a plurality of different log files, and notifies a system maintenance person. However, since this system notifies all faults, it is indispensable for the maintenance person to isolate the cause. A wide variety of logs are notified from messages such as state changes and login information that do not require urgent information, to messages that accompany the occurrence of a failure that requires urgent action through the system administrator. Therefore, there is a problem that the system maintainer is busy with organizing the information.

特許文献３は、システム管理に資するシステムログに関するデータを管理するログ情報管理装置及びログ情報管理プログラムを開示している。このログ情報管理プログラムは、システムログファイル内のメッセージのキーワード検索を可能とし、システム管理者が、障害発生原因の追求等に最適なログ情報を適宜に抽出できる技術を提供している。しかし、このシステムは、検索に必要な工数を発生させる。 Patent Document 3 discloses a log information management device and a log information management program for managing data related to a system log that contributes to system management. This log information management program enables a keyword search for messages in a system log file, and provides a technique by which a system administrator can appropriately extract log information optimal for pursuing the cause of a failure. However, this system generates man-hours necessary for the search.

ログ情報を選択して通知することにより、保守者の負担を軽減しているものとしては、
特許文献４に開示された技術が挙げられる。特許文献４は、ログ管理アプリケーションがドライバアプリケーションから出力されるログデータを監視し、エラーレベルを判定した後、所定レベル以上のエラーについてはログデータベースに記録することを、開示している。 By reducing the burden on maintenance personnel by selecting and notifying log information,
A technique disclosed in Patent Document 4 may be mentioned. Patent Document 4 discloses that a log management application monitors log data output from a driver application, determines an error level, and then records an error of a predetermined level or higher in a log database.

また、特許文献５は、システム運用中の障害を集中監視し、障害を検知した際に起動処理部に保存されているログ情報を採取して、警報メッセージとともに外部のリモート保守管理システムに通報することを開示されている。 Further, Patent Document 5 centrally monitors a failure during system operation, collects log information stored in the activation processing unit when a failure is detected, and notifies an external remote maintenance management system together with an alarm message. It is disclosed that.

特開２００４−０２１５４９号公報JP 2004-021549 A 特開２０００−１４８５４１号公報JP 2000-148541 A 特開２００３−１４１０７５号公報JP 2003-141075 A 特開２００９−１５１６８０号公報JP 2009-151680 A 特開２００１−３２５１２４号公報JP 2001-325124 A

特許文献１の技術における問題点である、２つの画面を用いて確認作業やメンテナンス操作を行なうという煩雑さ、並びに特許文献２の技術における、検索作業の複雑さは、分散コンピュータシステムの監視を困難にする原因となっている。 The problem of the technique of Patent Document 1 is that the confirmation work and the maintenance operation are performed using two screens, and the complexity of the search work in the technique of Patent Document 2 makes it difficult to monitor the distributed computer system. It is a cause.

特許文献４の技術は、アプリケーションから出力されるログデータでエラーレベルを判定しているので、より深刻なエラーレベルがオペレーションシステムで発生していても、これを検出できるものではない。また、複数のアプリケーションを同時に起動している場合、重複するエラーメッセージが発生することがあり、効率的な対処が阻まれる。また、同じ原因に起因するエラーであっても、アプリケーションによって異なる現象で現れることがあるため、特許文献４の手段は根本的な解決にはつながらない。 Since the technique of Patent Document 4 determines an error level based on log data output from an application, even if a more serious error level occurs in the operation system, this cannot be detected. In addition, when a plurality of applications are activated at the same time, a duplicate error message may occur, preventing efficient countermeasures. Further, even if the error is caused by the same cause, it may appear as a different phenomenon depending on the application. Therefore, the means of Patent Document 4 does not lead to a fundamental solution.

特許文献５の技術は、アプリケーションの起動段階で発生したエラーについては、起動処理部に保存されたログデータをすべてリモート保守管理システムに通報するが、ＯＳで発生したエラー一般について通報が行われるものではない。また、障害時にはすべてのログデータが通報されるため、システム保守者に判定の負担がかかる。さらに、この技術では、一端正常に起動した場合は、アプリケーションから出力されるログデータが利用されるため、特許文献４と同様、根本的な解決手段とはならない。 The technology of Patent Document 5 reports all log data stored in the activation processing unit to the remote maintenance management system for errors that occur in the application activation stage, but reports general errors that occur in the OS. is not. In addition, since all log data is reported at the time of failure, a burden of determination is imposed on the system maintainer. Furthermore, in this technique, when the data is normally started, log data output from the application is used. Therefore, as in Patent Document 4, it is not a fundamental solution.

本発明の目的は、保守者が効率的に保守対応を行うことを可能とするように、エラーメッセージを通知する、障害通知装置等を提供することである。 An object of the present invention is to provide a failure notification device or the like that notifies an error message so that a maintenance person can efficiently perform maintenance.

本発明によれば、
監視対象装置が備えるオペレーションシステムから障害に関するログを取得する監視手段と、前記オペレーションシステムから取得したログの重要度を判別する分析手段と、前記ログのうち、前記重要度が閾値以上のログの内容を表す障害通知情報を通知する通知手段と、を有する障害通知装置が得られる。 According to the present invention,
Monitoring means for acquiring a log relating to a failure from an operation system provided in the monitoring target device, analysis means for determining the importance of the log acquired from the operation system, and contents of the log having the importance equal to or higher than a threshold among the logs A failure notification device having notification means for notifying failure notification information representing

本発明によれば、監視対象装置が備えるオペレーションシステムから障害に関するログを取得する監視処理と、前記オペレーションシステムから取得したログの重要度を判別する分析処理と、前記ログのうち、前記重要度が閾値以上のログの内容を表す障害通知情報を通知する通知処理と、コンピュータに実行させる障害通知プログラムが得られる。 According to the present invention, the monitoring process for acquiring a log relating to a failure from the operation system provided in the monitoring target device, the analysis process for determining the importance level of the log acquired from the operation system, and the importance level among the logs is A notification process for notifying the failure notification information indicating the log content equal to or greater than the threshold and a failure notification program to be executed by the computer are obtained.

本発明によれば、監視対象装置が備えるオペレーションシステムから障害に関するログを取得し、前記オペレーションシステムから取得したログの重要度を判別し、前記ログのうち、前記重要度が閾値以上のログの内容を表す障害通知情報を通知する障害通知方法が得られる。 According to the present invention, a log relating to a failure is acquired from an operation system included in a monitoring target device, the importance of the log acquired from the operation system is determined, and the content of the log having the importance greater than or equal to a threshold among the logs A failure notification method for notifying failure notification information that represents

本発明によれば、保守者が効率的に保守対応を行うことを可能とするように、障害情報を通知する障害通知装置等が得られる。 According to the present invention, a failure notification device or the like that notifies failure information can be obtained so that a maintenance person can efficiently perform maintenance response.

次に、本発明の実施の形態について、基本的構成内容を説明する。 Next, the basic configuration content of the embodiment of the present invention will be described.

本発明の実施形態の構成について、図面を参照して詳細に説明する。 The configuration of the embodiment of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施形態に係わる障害通知装置を備える分散コンピュータネットワークの機能ブロック図である。 FIG. 1 is a functional block diagram of a distributed computer network including a failure notification device according to an embodiment of the present invention.

本発明における障害通知装置３０１は、一例として監視対象ＵＮＩＸ装置１に備えられる。障害通知装置３０１は監視エージェント機能１１、システムログ機能１２、障害ログファイル１２ｃ、障害通知機能１５を備える。監視対象ＵＮＩＸ装置１は、障害通知装置３０１と、システムログ１２ａ、各種ログ１２ｂ、障害監視機能１３、障害管理機能１４、障害通知情報管理テーブル１４ａ、障害通知機能１５を備える。 The failure notification device 301 according to the present invention is provided in the monitoring target UNIX device 1 as an example. The failure notification device 301 includes a monitoring agent function 11, a system log function 12, a failure log file 12c, and a failure notification function 15. The monitored UNIX device 1 includes a failure notification device 301, a system log 12a, various logs 12b, a failure monitoring function 13, a failure management function 14, a failure notification information management table 14a, and a failure notification function 15.

この監視対象ＵＮＩＸ装置１は、分散コンピュータネットワークシステムにおいて動作する。分散コンピュータネットワークシステムには、他に監視対象装置Ａ３、監視対象装置Ｂ４、監視対象装置Ｃ５、監視対象装置Ｄ６、監視対象装置Ｅ７、監視対象装置Ｆ８が接続されている。 This monitored UNIX device 1 operates in a distributed computer network system. In addition, a monitoring target device A3, a monitoring target device B4, a monitoring target device C5, a monitoring target device D6, a monitoring target device E7, and a monitoring target device F8 are connected to the distributed computer network system.

監視エージェント機能１１は監視対象ＵＮＩＸ装置１のオペレーションシステムの動作状況を監視し、その動作状況を表すログを取得して、取得したログをシステムログ機能１２に通知する。監視エージェント機能１１は監視部３１０とも呼ばれる。 The monitoring agent function 11 monitors the operation status of the operation system of the monitoring target UNIX device 1, acquires a log representing the operation status, and notifies the acquired log to the system log function 12. The monitoring agent function 11 is also called a monitoring unit 310.

システムログ機能１２は、例えばＵＮＩＸ系のＯＳにより提供される。システムログ機能１２は、ＵＮＩＸ系ＯＳの基本機能を実装したソフトウェア(ｋｅｒｎｅｌ)や、バックグラウンドで動作するプログラム(デーモン)等の動作や操作、障害情報等の様々なメッセージを、システムログ１２ａとして記録する。なお監視エージェント機能１１は監視部３１０とも呼ぶ。また、システムログ機能１２は、各種のログを各種ログ１２ｂとして記録する。各種のログとは、ＵＮＩＸ系ＯＳに標準機能として組み込まれているメールサーバソフトウェア、ジョブの定時自動実行プログラム等、ＯＳ標準の機能がその動作ログや障害情報等のメッセージを、機能毎のファイルとして記録したものである。ここで、メールサーバソフトウェアとはｓｅｎｄｍａｉｌ等である。また、ジョブの定時自動実行プログラムとはｃｒｏｎ等である。なお、各種ログ１２ｂは、ファイル群である。 The system log function 12 is provided by, for example, a UNIX-based OS. The system log function 12 records various messages such as operation and operation of the software (kernel) that implements the basic functions of the UNIX OS and programs (daemons) that operate in the background, fault information, and the like as the system log 12a. To do. The monitoring agent function 11 is also called a monitoring unit 310. The system log function 12 records various logs as various logs 12b. Various logs are mail server software built into the UNIX OS as a standard function, scheduled automatic execution program for jobs, etc. The OS standard functions send messages such as operation logs and failure information as files for each function. Recorded. Here, the mail server software is sendmail or the like. The scheduled automatic execution program for jobs is cron or the like. The various logs 12b are file groups.

また、システムログ機能１２は、監視エージェント機能１１が検知した監視対象ＵＮＩＸ装置１のハードウェアに生じた障害を通知するメッセージを障害ログファイル１２ｃに記録する。 Further, the system log function 12 records a message notifying a failure that has occurred in the hardware of the monitored UNIX device 1 detected by the monitoring agent function 11 in the failure log file 12c.

また、システムログ機能１２は、システムログ１２ａ、各種ログ１２ｂに記録されたメッセージのうち、障害レベルを示す「重要度」が高いメッセージを選択して障害ログファイル１２ｃに格納する。なお、システムログ機能１２は分析部３２０とも呼ばれる。 Further, the system log function 12 selects a message having a high “importance” indicating a failure level from the messages recorded in the system log 12a and various logs 12b, and stores the selected message in the failure log file 12c. The system log function 12 is also called an analysis unit 320.

図４は、障害ログファイル１２ｃの記載内容の一例を示す図である。障害ログファイル１２ｃに記録されるメッセージの「重要度」は、４：ｅｍｅｒｇ(システムが利用できないほどのエラー)、３：ａｌｅｒｔ(緊急に対処すべきエラー)、２：ｃｒｉｔ(致命的なエラー)、１：ｅｒｒ(一般的なエラー)の４段階である。記録対象とするメッセージの「重要度」は、システムログ設定ファイル１２ｄで定義される。例として、システムログ設定ファイル１２ｄは、「重要度４のみ記録」、「重要度３、４を記録」、「重要度１〜４をすべて記録」という規則で記録すべき重要度を定義する。図４は後述するように、システムログ設定ファイル１２ｄにおいて、「重要度３、４を記録」することを定義された場合である。これを言い換えれば、図４は、重要度の閾値を３とする場合の、障害ログファイル１２ｃの例である。システムログ設定ファイル１２ｄは、このシステムログ機能１２が保有する設定ファイルであり、上記の「重要度」の他に、メッセージの「記録先ファイル名」をも定義する。メッセージ記録先ファイル名には、障害ログファイル１２ｃのファイル名などが用いられる。なお、障害ログファイル１２ｃは保存部３３０とも呼ばれる。また、システムログ設定ファイル１２ｄは、システムログ機能１２の、図示されない記憶部に格納される。 FIG. 4 is a diagram illustrating an example of the description content of the failure log file 12c. The “significance” of the message recorded in the failure log file 12c is 4: emerg (an error that the system cannot be used), 3: alert (an error to be dealt with urgently), 2: crit (a fatal error) , 1: err (general error). The “importance” of the message to be recorded is defined in the system log setting file 12d. As an example, the system log setting file 12d defines the importance to be recorded according to the rules of “record only importance 4”, “record importance 3 and 4”, and “record all importance 1 to 4”. FIG. 4 shows a case where “record importance 3 and 4” is defined in the system log setting file 12d, as will be described later. In other words, FIG. 4 is an example of the failure log file 12c when the importance level threshold is 3. The system log setting file 12d is a setting file held by the system log function 12, and defines the “record destination file name” of the message in addition to the above “importance”. The file name of the failure log file 12c or the like is used as the message recording destination file name. The failure log file 12c is also called a storage unit 330. The system log setting file 12d is stored in a storage unit (not shown) of the system log function 12.

次に、本実施形態の動作について、その概略を図１に基づき説明する。システムログ機能１２は、システムログ１２ａ、各種ログ１２ｂの重要度を判定し、ある基準を越える重要度を有するログを障害ログファイル１２ｃに登録する。この判定はシステムログ設定ファイル１２ｄで定義されている、メッセージ毎の重要度を基準として行われる。図４に示す例は重要度３を閾値として、この値以上（ａｌｅｒｔ、ｅｍｅｒｇ）の障害ログが登録されている。障害ログは他の基準で登録されてもよい。例えば重要度が４以上であることを基準とした場合は、１２月２日９時５１分２７秒に発生したｋｅｒｎｅｌプログラムにおける障害ログのみが記録される。 Next, the outline of the operation of the present embodiment will be described with reference to FIG. The system log function 12 determines the importance of the system log 12a and various logs 12b, and registers a log having an importance exceeding a certain standard in the failure log file 12c. This determination is performed based on the importance for each message defined in the system log setting file 12d. In the example shown in FIG. 4, the importance level 3 is set as a threshold value, and fault logs greater than this value (alert, emerge) are registered. The failure log may be registered according to other criteria. For example, when the importance is 4 or more, only the failure log in the kernel program that occurred at 9:51:27 on December 2 is recorded.

障害監視機能１３は、障害ログファイル１２ｃを常時監視し、新たな障害ログの記録を検知した場合、その障害ログを、障害通知メッセージとして保守者へ通知するために、障害通知メッセージ用のデータ形式に変換する。しかる後、障害監視機能１３は障害管理機能１４へ障害メッセージを送信する。障害管理機能１４は、障害監視機能１３から受信した障害通知メッセージを元に、障害通知情報管理テーブル１４ａへ、障害発生時刻、障害が発生したハードウェア名、アプリケーションプロセス名等の発生箇所を特定する情報、および、障害内容の詳細を互いに関連付けされた状態で登録する。なお、障害通知情報管理テーブル１４ａに記録された上記内容は、障害通知メッセージ用のデータ形式に変換されて記録されている点を除いて、図４の対応する内容と同じである。障害通知機能１５は、障害通知情報管理テーブル１４ａを一定の時間周期で監視する。障害通知機能１５は、障害通知メッセージが新たに登録されたことを検知した場合、障害通知情報管理テーブル１４ａから障害通知メッセージを取得し、保守用のネットワーク監視装置２へこれを送信する。保守者は、保守用ネットワーク監視装置２に接続しているＧＵＩ画面２１によって、一元的にこの分散コンピュータネットワークの保守管理を行う。なお、障害通知機能１５は通知部３４０とも呼ぶ。 The failure monitoring function 13 constantly monitors the failure log file 12c, and when a new failure log record is detected, the failure monitoring function 13 notifies the maintenance person of the failure log as a failure notification message. Convert to Thereafter, the failure monitoring function 13 transmits a failure message to the failure management function 14. Based on the failure notification message received from the failure monitoring function 13, the failure management function 14 identifies the occurrence location such as the failure occurrence time, the name of the hardware in which the failure has occurred, and the application process name in the failure notification information management table 14a. The information and details of the failure contents are registered in a state of being associated with each other. Note that the content recorded in the failure notification information management table 14a is the same as the corresponding content in FIG. 4 except that the content is converted into a data format for the failure notification message and recorded. The failure notification function 15 monitors the failure notification information management table 14a at a constant time period. When the failure notification function 15 detects that a failure notification message has been newly registered, the failure notification function 15 acquires the failure notification message from the failure notification information management table 14a and transmits it to the maintenance network monitoring device 2. The maintenance person performs maintenance management of the distributed computer network in an integrated manner through the GUI screen 21 connected to the maintenance network monitoring apparatus 2. The failure notification function 15 is also referred to as a notification unit 340.

次に、本実施形態における障害監視機能１３の動作について、図５の処理概要フローチャートに基づき詳細説明する。 Next, the operation of the failure monitoring function 13 in this embodiment will be described in detail based on the processing outline flowchart of FIG.

障害監視機能１３は、常時一定秒の周期、または特定のスケジュールに基づいて、障害ログファイル１２ｃを読み出し、新しいログの書き込みの有無をチェックする(ステップＳ０５２〜Ｓ０５９)。新規のログがある場合（ステップＳ０５４にてＹＥＳ）、保守者へ通知するための障害通知メッセージ形式にデータを変換し（ステップＳ０５５）、障害管理機能１４へ障害通知メッセージを送信する(ステップＳ０５６)。なお、前述の一定秒の周期は、障害発生頻度、障害検知の即時性、システム負荷条件を考慮し、例えば１０〜３０秒程度に設定する。図４は、前述したように、障害ログファイル１２ｃに記載された内容の一例である。障害ログファイル１２ｃに格納されるレコードは、「障害発生日時」、「障害発生装置ホスト名」、「障害発生プログラム」、「障害内容詳細」、「障害の重要度」である。これらは互いに関連付けられた状態で格納されている。図４は、重要度が３、４の障害の記録が格納されている例である。このように、障害ログファイル１２ｃには「障害発生日時」が記録されているので、障害ログファイル１２ｃの読み出し処理においては、この障害発生日時の記録を元に、新規のログのみを読み出すことが能率的である。そこで、この「障害発生日時」情報が、前周期で読み出しを完了しているデータの障害ログファイルにおける読み出し開始の位置情報として用いられる。すなわち、「障害ログファイル１２ｃの読み出しが完了したデータの日時」（以降、「日時ＤＰ」（ＤＰ：ＤａｔａＰｏｓｉｔｉｏｎ）と称する）に関連づけられたデータにマーカが付される。障害監視機能１３は、次周期において、この位置情報をもとに、「日時ＤＰ」以降のログを、ファイルの最終行まで読み出す(ステップＳ０５３)。 The failure monitoring function 13 always reads the failure log file 12c based on a fixed second period or a specific schedule, and checks whether a new log has been written (steps S052 to S059). If there is a new log (YES in step S054), the data is converted into a failure notification message format for notification to the maintenance person (step S055), and the failure notification message is transmitted to the failure management function 14 (step S056). . The period of the predetermined second is set to, for example, about 10 to 30 seconds in consideration of the failure occurrence frequency, the immediateness of failure detection, and the system load condition. FIG. 4 is an example of the contents described in the failure log file 12c as described above. The records stored in the failure log file 12c are “failure occurrence date”, “failure occurrence device host name”, “failure occurrence program”, “failure content details”, and “failure importance”. These are stored in association with each other. FIG. 4 is an example in which records of failures with importance levels 3 and 4 are stored. As described above, since the “failure occurrence date / time” is recorded in the failure log file 12c, in the read processing of the failure log file 12c, only a new log can be read based on the record of the failure occurrence date / time. It is efficient. Therefore, this “failure occurrence date and time” information is used as read start position information in the failure log file of data that has been read in the previous cycle. That is, a marker is attached to data associated with “date and time of data when reading of the failure log file 12c has been completed” (hereinafter referred to as “date and time DP” (DP: Data Position)). In the next cycle, the failure monitoring function 13 reads the log after “date and time DP” up to the last line of the file based on this position information (step S053).

「日時ＤＰ」の情報は、障害監視機能１３の起動時に初期化される(ステップＳ０５１)。障害監視機能１３は「日時ＤＰ」以降に書き込まれたログから、ファイルの最後の行まで、障害ログファイル１２ｃを読み出す（ステップＳ０５３）。障害監視機能１３は、新規のログがあるか否かを判断する（ステップＳ０５４）。新規のログがある場合（ステップＳ０５４にてＹＥＳ）、障害監視機能１３は、取得した新しいログを保守者に通知するための障害通知メッセージの形式に変換する（ステップＳ０５５）。その後、障害監視機能１３はこの障害通知メッセージを障害管理機能１４に送信する（ステップＳ０５６）。障害監視機能１３は、最後に読み込んだ障害メッセージの「障害発生日時」を、「日時ＤＰ」として設定し、更新を完了する(ステップＳ０５７)。一定秒の処理停止（ステップＳ０５８）後、障害監視機能１３は、Ｓ０５２からの処理を再開する。 The information of “date and time DP” is initialized when the failure monitoring function 13 is activated (step S051). The failure monitoring function 13 reads the failure log file 12c from the log written after “date and time DP” to the last line of the file (step S053). The fault monitoring function 13 determines whether there is a new log (step S054). If there is a new log (YES in step S054), the failure monitoring function 13 converts the acquired new log into a failure notification message format for notifying the maintenance person (step S055). Thereafter, the failure monitoring function 13 transmits this failure notification message to the failure management function 14 (step S056). The failure monitoring function 13 sets the “failure occurrence date / time” of the last read failure message as “date / time DP” and completes the update (step S057). After the processing is stopped for a fixed time (step S058), the failure monitoring function 13 resumes the processing from S052.

新規のログがない場合（ステップＳ０５４にてＮＯ）、一定秒の処理停止（ステップＳ０５８）後、障害監視機能１３は、Ｓ０５２からの処理を再開する。 If there is no new log (NO in step S054), the failure monitoring function 13 resumes the processing from S052 after the processing is stopped for a fixed second (step S058).

次に、本実施形態における障害管理機能１４の動作について、図５の一部、及び図６の処理概要フローチャートに基づき詳細に説明する。 Next, the operation of the failure management function 14 in the present embodiment will be described in detail based on a part of FIG. 5 and a processing outline flowchart of FIG.

障害管理機能１４は、障害監視機能１３より通知された障害通知メッセージ（図５ステップＳ０５６）を受信する(ステップＳ０６１)。障害管理機能１４は、障害通知情報管理テーブル１４ａにこれを登録する(ステップＳ０６２)。 The failure management function 14 receives the failure notification message (step S056 in FIG. 5) notified from the failure monitoring function 13 (step S061). The failure management function 14 registers this in the failure notification information management table 14a (step S062).

次に、本実施形態における障害通知機能１５の動作について、図７の処理概要フローチャートに基づき詳細説明する。障害通知情報管理テーブル１４ａの読み出し処理においては、新規の障害通知情報のみを読み出すことが能率的である。そこで、障害通知情報管理テーブル１４ａの読み出しを完了したデータにマーカが付される。このマーカを付すことによって、「障害通知情報管理テーブルの読み出しが完了したデータの位置」（以降「管理テーブルＤＰ（ＤＰ：ＤａｔａＰｏｓｉｔｉｏｎ）」と称する）の情報が初期化される（ステップＳ０７１）。障害通知機能１５は、この初期化に続いて、Ｓ０７２〜Ｓ０７ａのループ処理を行う。このループ処理において、始めに障害通知機能１５は、上記マーカの位置から、障害通知情報管理テーブル１４ａを読み出し、データ以降に登録された障害通知メッセージを、最終行まで読み出す(ステップＳ０７３)。この読み出しは一定秒の周期で行われてもいいし、特定のスケジュールで行われてもよい。 Next, the operation of the failure notification function 15 in this embodiment will be described in detail based on the process outline flowchart of FIG. In the reading process of the failure notification information management table 14a, it is efficient to read only new failure notification information. Therefore, a marker is added to the data that has been read from the failure notification information management table 14a. By attaching this marker, the information of “position of data for which reading of the failure notification information management table has been completed” (hereinafter referred to as “management table DP (DP: Data Position)”) is initialized (step S071). The failure notification function 15 performs the loop processing of S072 to S07a following this initialization. In this loop process, first, the failure notification function 15 reads the failure notification information management table 14a from the position of the marker, and reads the failure notification messages registered after the data up to the last line (step S073). This reading may be performed at a constant second cycle or may be performed according to a specific schedule.

障害通知機能１５は、新しい障害通知メッセージの書き込みが無いかをチェックする(ステップＳ０７４)。新規の障害通知メッセージがある場合（ステップＳ０７４にてＹＥＳ）、障害通知機能１５は、Ｓ０７５〜Ｓ０７ｂの処理を繰り返す。すなわち、障害通知機能１５は、保守用のネットワーク監視装置２へ障害通知メッセージを送信する(ステップＳ０７６)。障害通知機能１５は、保守用ネットワーク監視装置２への通知手段として、ＵＤＰ（ＵｓｅｒＤａｔａｇｒａｍＰｒｏｔｏｃｏｌ）ベースのネットワーク監視・管理用プロトコルであるＳＮＭＰ(ＳｉｍｐｌｅＮｅｔｗｏｒｋＭａｎａｇｅｍｅｎｔＰｒｏｔｏｃｏｌ)のトラップメッセージを使用する。 The failure notification function 15 checks whether a new failure notification message has been written (step S074). If there is a new failure notification message (YES in step S074), failure notification function 15 repeats the processes of S075 to S07b. That is, the failure notification function 15 transmits a failure notification message to the maintenance network monitoring device 2 (step S076). The fault notification function 15 uses a trap message of SNMP (Simple Network Management Protocol), which is a UDP (User Datagram Protocol) -based network monitoring / management protocol, as means for notifying the maintenance network monitoring apparatus 2.

障害通知機能１５は、「管理テーブルＤＰ」の情報を更新する（ステップＳ０７７）。障害通知機能１５は、ＳＮＭＰトラップの送信件数が規定件数以下の場合（ステップＳ０７８にてＹＥＳ）、Ｓ０７５〜Ｓ０７ｂを繰り返す。障害通知機能１５は、ＳＮＭＰトラップの送信件数が規定件数より大きい場合（ステップＳ０７８にてＮＯ）、ステップＳ０７２にもどって処理を繰り返す。 The failure notification function 15 updates the information in the “management table DP” (step S077). Failure notification function 15 repeats S075 to S07b when the number of SNMP trap transmissions is equal to or less than the prescribed number (YES in step S078). If the number of SNMP trap transmissions is greater than the specified number (NO in step S078), failure notification function 15 returns to step S072 and repeats the process.

なお、一定秒周期で障害ログファイルの読み出しを行う場合、その周期（ステップＳ０７９）は、障害通知の即時性を考慮するために、例えば、１秒程度の短周期に設定する。 Note that when the failure log file is read at a constant second cycle, the cycle (step S079) is set to a short cycle of, for example, about 1 second in order to take into account the immediateness of the failure notification.

なお、本実施形態においては、障害通知機能１５が、障害監視機能１３、障害管理機能１４、障害通知情報管理テーブル１４ａの機能を兼ね備える構成をとることができる。また、障害監視機能１３が障害管理機能１４を兼ね備える構成をとることができる。
このような障害監視機能１３と障害管理機能１４を兼ね備えた構成、または障害管理機能１４を単に管理部とも呼ぶ。また、障害通知管理テーブル１４ａは管理テーブルとも呼ぶ。 In the present embodiment, the failure notification function 15 can be configured to combine the functions of the failure monitoring function 13, the failure management function 14, and the failure notification information management table 14a. Further, the failure monitoring function 13 can be configured to have the failure management function 14.
Such a configuration having the failure monitoring function 13 and the failure management function 14 or the failure management function 14 is also simply referred to as a management unit. The failure notification management table 14a is also called a management table.

また、障害ログファイル１２ｃにデータが格納された時点で、障害通知機能２０５が随時障害通知メッセージを行う構成をとるができる。この場合、障害監視機能１３、障害管理機能１４、障害通知管理テーブル１４ａは不要である。 In addition, when the data is stored in the failure log file 12c, the failure notification function 205 can be configured to issue a failure notification message as needed. In this case, the failure monitoring function 13, the failure management function 14, and the failure notification management table 14a are unnecessary.

本実施形態においては、ＯＳのシステムログの機能を利用して障害管理を行うので、既存のＧＵＩモニタにより保守管理が可能である。また、障害通知メッセージの送信が重要度の高い障害に限られるので、保守者は効率的に保守対応を行うことができる。 In the present embodiment, failure management is performed using the system log function of the OS, so that maintenance management is possible using an existing GUI monitor. In addition, since the transmission of the failure notification message is limited to a failure having a high importance level, the maintenance person can efficiently perform maintenance.

監視エージェント機能１１、システムログ機能１２、障害監視機能１３、障害管理機能１４、障害通知機能１５は論理回路などのハードウェアで実現されてもよいし、図示されていないメモリに格納されているプログラムを実行することで実現されてもよい。 The monitoring agent function 11, the system log function 12, the fault monitoring function 13, the fault management function 14, and the fault notification function 15 may be realized by hardware such as a logic circuit or a program stored in a memory (not shown) It may be realized by executing.

本実施形態においてはＯＳとしてＵＮＩＸを用いているが、他のＯＳが用いられてもよい。例えば、ＬＩＮＵＸ（登録商標）が用いられることも可能である。
（第２の実施形態）
次に上述した第１の実施形態を基本とする第２の実施形態について説明する。以下の説明においては、第１の実施形態と同様な構成については、第１の実施形態に係わる図１の部位に付された参照番号と同一の参照番号を付すことにより、重複する説明は省略する。 In the present embodiment, UNIX is used as the OS, but other OS may be used. For example, LINUX (registered trademark) can be used.
(Second Embodiment)
Next, a second embodiment based on the above-described first embodiment will be described. In the following description, the same components as those of the first embodiment are denoted by the same reference numerals as those shown in FIG. 1 according to the first embodiment, and redundant description is omitted. To do.

図８は、本発明の第２の実施形態の構成を示す概略ブロック図である。 FIG. 8 is a schematic block diagram showing the configuration of the second exemplary embodiment of the present invention.

本実施形態では、図１の構成に加え、監視対象ＵＮＩＸ装置２００の上で動作するアプリケーションプログラム２０６とアプリケーションプログラム２０７が設けられている。 In the present embodiment, an application program 206 and an application program 207 that operate on the monitored UNIX device 200 are provided in addition to the configuration of FIG.

監視対象ＵＮＩＸ装置２００は、監視対象ＵＮＩＸ装置１と同様、分散コンピュータネットワークシステムに接続されている。この分散コンピュータネットワークシステムには、他に監視対象装置Ａ２２０、監視対象装置Ｂ２３０、監視対象装置Ｃ２４０、監視対象装置Ｄ２５０、監視対象装置Ｅ２６０、監視対象装置Ｆ２７０が広域ＬＡＮ９を介して通信可能に接続されている。 The monitoring target UNIX device 200 is connected to the distributed computer network system in the same manner as the monitoring target UNIX device 1. In addition, a monitoring target device A220, a monitoring target device B230, a monitoring target device C240, a monitoring target device D250, a monitoring target device E260, and a monitoring target device F270 are communicably connected to the distributed computer network system via the wide area LAN 9. ing.

監視対象ＵＮＩＸ装置２００におけるシステムログ２０２ａ、各種ログ２０２ｂ、障害ログファイル２０２ｃはそれぞれ、第１の実施形態のシステムログ１２ａ、各種ログ１２ｂ、障害ログファイル１２ｃと同じ機能を有している。 The system log 202a, various logs 202b, and the failure log file 202c in the monitoring target UNIX device 200 have the same functions as the system log 12a, various logs 12b, and the failure log file 12c of the first embodiment, respectively.

アプリケーションプログラム２０６とアプリケーションプログラム２０７は、それぞれアプリケーションの実行中に何らかの処理エラーを検出した場合、その障害情報を保守者へ通知するための障害情報メッセージ形式のデータに変換する。また、アプリケーションプログラム２０６とアプリケーションプログラム２０７は、障害管理機能２０４へ障害通知メッセージを送信する。 When the application program 206 and the application program 207 detect any processing error during the execution of the application, the application program 206 and the application program 207 convert the failure information into data in a failure information message format for notifying the maintenance person. In addition, the application program 206 and the application program 207 transmit a failure notification message to the failure management function 204.

障害管理機能２０４は、アプリケーション２０６または２０７より通知された障害通知メッセージを受信し、障害監視機能２０３からの障害情報メッセージ受信時と同様に、障害通知情報管理テーブル２０４ａへ障害通知情報を登録する。アプリケーション２０６およびアプリケーション２０７は実行監視部とも呼ぶ。 The failure management function 204 receives the failure notification message notified from the application 206 or 207, and registers the failure notification information in the failure notification information management table 204a in the same manner as when the failure information message is received from the failure monitoring function 203. The application 206 and the application 207 are also called execution monitoring units.

障害通知情報管理テーブル２０４ａには、障害管理機能２０４、アプリケーションプログラム２０６、アプリケーションプログラム２０７から通知された障害通知メッセージが登録される。障害通知装置３０２が備える障害通知機能２０５は、これらの障害通知メッセージを保守用ネットワーク監視装置２１０に送信する。障害通知機能２０５は、第１の実施形態の障害通知機能１５と同様に、障害ログファイル２０２Ｃから障害監視機能１３が読み込み、障害管理機能２０４に送信通知された障害通知メッセージも、保守用ネットワーク監視装置２１０に送信する。保守者は保守用ネットワーク監視装置２１０に接続しているＧＵＩ画面２１１によって、一元的にこの分散コンピュータネットワークの保守管理を行う。 A failure notification message notified from the failure management function 204, the application program 206, and the application program 207 is registered in the failure notification information management table 204a. The failure notification function 205 included in the failure notification device 302 transmits these failure notification messages to the maintenance network monitoring device 210. As with the failure notification function 15 of the first embodiment, the failure notification function 205 reads the failure notification message read from the failure log function 202C from the failure log file 202C and sent to the failure management function 204, as well as maintenance network monitoring. To device 210. The maintenance person performs maintenance management of the distributed computer network in an integrated manner using the GUI screen 211 connected to the maintenance network monitoring apparatus 210.

本発明の第２の実施形態によれば、監視対象ＵＮＩＸ装置１で発生したハードウェア障害やＯＳのエラーのみならず、各種のアプリケーションプログラムで検出されたエラー情報も、障害通知情報管理テーブル２０４ａで一元管理できる。このため、新規にアプリケーションプログラムを追加する場合、障害管理機能２０４とアプリケーションプログラムの障害通知インタフェースを合わせれば、障害情報の管理から保守用ネットワーク監視装置２１０への障害通知が可能となる。したがって、アプリケーションプログラム固有の障害管理機能２０４、障害通知機能２０５の作り込みは不要となる。
（第３の実施形態）
次に上述した第１の実施形態を基本とする第２の実施形態について説明する。以下の説明においては、第１の実施形態と同様な構成については、第１の実施形態に係わる図１の部位に付された参照番号と同一の参照番号を付すことにより、重複する説明は省略する。 According to the second embodiment of the present invention, not only the hardware failure and OS error that occurred in the monitored UNIX device 1, but also error information detected by various application programs is stored in the failure notification information management table 204a. Centralized management. For this reason, when a new application program is added, the failure management function 204 and the failure notification interface of the application program are combined to enable failure notification from the failure information management to the maintenance network monitoring device 210. Therefore, it is not necessary to create a failure management function 204 and a failure notification function 205 specific to the application program.
(Third embodiment)
Next, a second embodiment based on the above-described first embodiment will be described. In the following description, the same components as those of the first embodiment are denoted by the same reference numerals as those shown in FIG. 1 according to the first embodiment, and redundant description is omitted. To do.

第１の実施形態では、重要度に応じて障害通知を障害ログファイル１２ｃに記録した。第３の実施形態では、アプリケーション毎に重要度を設定し、この重要度に応じた障害通知を記録する。障害監視機能１３は、使用中のアプリケーションに多大な影響をもたらす障害メッセージのみを障害管理機能１４へ通知する構成とする。障害通知機能１５は、障害監視機能１３から受信した障害情報メッセージを元に、障害通知情報管理テーブル１４ａへ、障害発生時刻、障害が発生したハードウェア名、アプリケーションプロセス名等の発生箇所を特定する情報、および、障害内容の詳細を登録する。障害通知機能１５は、第1の実施形態と同様に、障害通知情報管理テーブル１４ａを一定周期で監視する。障害通知機能１５は障害通知情報が新たに登録されたことを検知した場合、障害通知情報管理テーブル１４ａから障害情報を取得し、保守用のネットワーク管理装置２へ障害通知メッセージを送信する。 In the first embodiment, the failure notification is recorded in the failure log file 12c according to the importance. In the third embodiment, an importance level is set for each application, and a failure notification corresponding to the importance level is recorded. The fault monitoring function 13 is configured to notify the fault management function 14 only of fault messages that have a great influence on the application in use. Based on the failure information message received from the failure monitoring function 13, the failure notification function 15 identifies the occurrence location such as the failure occurrence time, the name of the hardware where the failure occurred, the application process name, etc., in the failure notification information management table 14a. Register information and details of failure. The failure notification function 15 monitors the failure notification information management table 14a at regular intervals, as in the first embodiment. When the failure notification function 15 detects that the failure notification information is newly registered, the failure notification function 15 acquires the failure information from the failure notification information management table 14a, and transmits a failure notification message to the maintenance network management device 2.

この実施形態では、使用中のアプリケーションに直接影響のない障害メッセージは通知されないので、ユーザは徒に保守作業による中断を受けることなく、第１の実施形態に比べてより効率的に保守対応を実施することができる。 In this embodiment, a failure message that does not directly affect the application in use is not notified, so that the user can perform maintenance response more efficiently than the first embodiment without being interrupted by maintenance work. can do.

一例として、システムＸにおいて、システムＸの動作に直接影響を及ぼすアプリケーションＡと、システムＸの動作に影響を及ぼさないアプリケーションＢが動作している環境を取り上げる。システムＸにおいては、アプリケーションＡのプロセスが停止した場合、重要な障害となるが、アプリケーションＢのプロセスが停止した場合は、深刻な問題とはならない。「プロセス停止」という障害の重要度の定義を、アプリケーションＡとＢで個別に定義することで、保守者に有用な障害情報のみ通知することができる。すなわち、アプリケーションＡにおいては、プロセスが停止する障害を、アプリケーションＢよりも高い重要度に設定することが、この実施形態では可能である。 As an example, an environment in which an application A that directly affects the operation of the system X and an application B that does not affect the operation of the system X are operating in the system X will be described. In the system X, when the process of the application A stops, it becomes an important failure, but when the process of the application B stops, it does not become a serious problem. By defining the failure importance level of “process stop” individually for the applications A and B, only useful failure information can be notified to the maintainer. That is, in this embodiment, it is possible for the application A to set the failure that causes the process to stop at a higher importance level than the application B.

本実施形態においては、第２の実施形態と同様に、アプリケーションプログラム２０６、２０７のような実行監視部が実行中のアプリケーションプログラムを特定する、という構成をとることができる。 As in the second embodiment, the present embodiment can be configured such that the execution monitoring unit such as the application programs 206 and 207 identifies an application program being executed.

本実施形態においては、実行中のアプリケーションプログラムを特定するアプリケーションプログラム特定部が、実行中のアプリケーションプログラムを判定する、という構成が取られてもよい。
（第４の実施形態）
第４の実施形態について、図９を参照して説明する。本発明の第４の実施形態は、オペレーションシステムからログを取得する監視部３１０と、オペレーションシステムから取得したログの重要度を判別する分析部３２０と、前記ログのうち、前記重要度が閾値以上のログを障害通知として保守装置に通知する通知部３４０と、を有する障害通知装置３００である。 In the present embodiment, a configuration may be employed in which an application program specifying unit that specifies a running application program determines the running application program.
(Fourth embodiment)
A fourth embodiment will be described with reference to FIG. The fourth embodiment of the present invention includes a monitoring unit 310 that acquires a log from the operation system, an analysis unit 320 that determines the importance of the log acquired from the operation system, and the importance of the logs is equal to or greater than a threshold value. And a notification unit 340 that notifies the maintenance device of the log as a failure notification.

本実施形態の効果は、オペレーションシステムのログを利用することにより、既存のコンピュータネットワークにおいても、効率的な保守対応が可能なことである。 The effect of this embodiment is that an efficient maintenance response is possible even in an existing computer network by using the log of the operation system.

上述した第１乃至第４の実施形態を例に説明した本発明は、当該実施形態の説明において参照したフローチャート（図５、図６、図７）の機能、或いは図１、図８、図９に示したブロック図において当該装置内に示した各部を実現可能なプログラムを図１０に示す情報処理装置１０００に対して供給した後、そのプログラムをＣＰＵ１１００（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）に対して実行することによって達成される。また、情報処理装置１０００内に供給されたプログラムは、読み書き可能な一時記憶メモリ１２００またはハードディスクドライブ等の不揮発性の記憶装置１３００に格納すればよい。 The present invention described with reference to the first to fourth embodiments described above is the function of the flowchart (FIGS. 5, 6, and 7) referred to in the description of the embodiment, or FIGS. In the block diagram shown in FIG. 10, a program capable of realizing each unit shown in the apparatus is supplied to the information processing apparatus 1000 shown in FIG. 10, and then the program is executed on a CPU 1100 (CPU: Central Processing Unit). Is achieved. The program supplied in the information processing apparatus 1000 may be stored in a readable / writable temporary storage memory 1200 or a nonvolatile storage device 1300 such as a hard disk drive.

本発明の第１の実施形態に係わる障害通知装置を備える分散コンピュータネットワークのブロック図である。1 is a block diagram of a distributed computer network including a failure notification device according to a first embodiment of the present invention. サーバ管理ソフトウェアを使用したネットワーク監視システム全体構成の一例を示すブロック図である。It is a block diagram which shows an example of the whole network monitoring system structure using server management software. 監視対象装置を追加する前の既存ネットワーク監視システム全体構成の一例を示すブロック図である。It is a block diagram which shows an example of the whole existing network monitoring system structure before adding a monitoring object apparatus. 監視対象装置を追加した後の既存ネットワーク監視システム全体構成の一例を示すブロック図である。It is a block diagram which shows an example of the whole existing network monitoring system structure after adding a monitoring object apparatus. 本発明の第１の実施形態による障害ログファイルの内容の一例を示す図である。It is a figure which shows an example of the content of the failure log file by the 1st Embodiment of this invention. 本発明の第１の実施形態による障害監視機能の処理を表すフローチャートである。It is a flowchart showing the process of the failure monitoring function by the 1st Embodiment of this invention. 本発明の第１の実施形態による障害管理機能の処理を表すフローチャートである。It is a flowchart showing the process of the failure management function by the 1st Embodiment of this invention. 本発明の第１の実施形態による障害通知機能の処理を表すフローチャートである。It is a flowchart showing the process of the failure notification function by the 1st Embodiment of this invention. 本発明の第２の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd Embodiment of this invention. 本発明の第４の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of the 4th Embodiment of this invention. 本発明をコンピュータプログラムで実行する情報処理装置を示すブロック図である。It is a block diagram which shows the information processing apparatus which performs this invention by a computer program.

１監視対象ＵＮＩＸ装置
２保守用メットワーク監視装置
３監視対象装置Ａ
４監視対象装置Ｂ
５監視対象装置Ｃ
６監視対象装置Ｄ
７監視対象装置Ｅ
８監視対象装置Ｆ
９広域ＬＡＮ
１１監視エージェント機能
１２システムログ機能
１２ａシステムログ
１２ｂ各種ログ
１２ｃ障害ログ
１２ｄシステムログ設定ファイル
１３障害監視機能
１４障害管理機能
１４ａ障害通知情報管理テーブル
１５障害通知機能
２１ＧＵＩ画面
１００監視対象装置
１０１アプリケーションプログラム
１０２監視エージェント機能
１１０監視対象装置
１１１アプリケーションプログラム
１１２監視エージェント機能
１２０監視対象装置
１２１アプリケーションプログラム
１２２監視エージェント機能
１３０保守用ネットワーク監視装置
１３１ネットワーク監視マネージャ機能
１３１ａＧＵＩ画面
１４０監視対象装置
１５０監視対象装置
１６０監視対象装置
１４１アプリケーションプログラム
１５１アプリケーションプログラム
１６１アプリケーションプログラム
１６２監視エージェント機能
１７０保守用ネットワーク監視装置
１７１統合監視機能
１７１ａ保守作業用ＧＵＩ画面
１７２ネットワーク監視マネージャ機能
１７２ａＧＵＩ画面
２００監視対象ＵＮＵＸ装置
２０２ａシステムログ
２０２ｂ各種ログ
２０２ｃ障害ログ
２０３障害監視機能
２０４障害管理機能
２０４ａ障害通知情報管理テーブル
２０５障害通知機能
２０６アプリケーションプログラム
２０７アプリケーションプログラム
２１０保守用ネットワーク監視装置
２１１ＧＵＩ画面
２２０監視対象装置Ａ
２３０監視対象装置Ｂ
２４０監視対象装置Ｃ
２５０監視対象装置Ｄ
２６０監視対象装置Ｅ
２７０監視対象装置Ｆ
３００障害通知装置
３０１障害通知装置
３０２障害通知装置
３１０監視部
３２０分析部
３４０通知部
１０００情報処理装置
１１００ＣＰＵ
１２００一時記憶メモリ
１３００記憶装置 1 Monitored UNIX device 2 Maintenance network monitoring device 3 Monitored device A
4 Monitoring target device B
5 Monitoring target device C
6 Monitored device D
7 Monitoring target device E
8 Device F to be monitored
9 Wide area LAN
11 Monitoring Agent Function 12 System Log Function 12a System Log 12b Various Logs 12c Fault Log 12d System Log Setting File 13 Fault Monitoring Function 14 Fault Management Function 14a Fault Notification Information Management Table 15 Fault Notification Function 21 GUI Screen 100 Monitored Device 101 Application Program 102 Monitoring Agent Function 110 Monitoring Target Device 111 Application Program 112 Monitoring Agent Function 120 Monitoring Target Device 121 Application Program 122 Monitoring Agent Function 130 Maintenance Network Monitoring Device 131 Network Monitoring Manager Function 131a GUI Screen 140 Monitoring Target Device 150 Monitoring Target Device 160 Monitoring Target device 141 Application program 151 Application Application program 162 Monitoring agent function 170 Maintenance network monitoring device 171 Integrated monitoring function 171a Maintenance work GUI screen 172 Network monitoring manager function 172a GUI screen 200 Monitored UNUX device 202a System log 202b Various logs 202c Fault log 203 Fault monitoring function 204 Failure Management Function 204a Failure Notification Information Management Table 205 Failure Notification Function 206 Application Program 207 Application Program 210 Maintenance Network Monitoring Device 211 GUI Screen
220 Device A to be monitored
230 Device B to be monitored
240 Monitoring target device C
250 Monitoring target device D
260 Monitoring target device E
270 Monitoring target device F
300 Failure Notification Device 301 Failure Notification Device 302 Failure Notification Device 310 Monitoring Unit 320 Analysis Unit 340 Notification Unit 1000 Information Processing Device 1100 CPU
1200 Temporary storage memory 1300 Storage device

Claims

A monitoring unit for acquiring a log relating to a failure from an operation system included in the monitoring target device;
An analysis unit for determining the importance of the log acquired from the operation system;
Among the logs, a notification unit for notifying failure notification information indicating the content of the log whose importance is equal to or higher than a threshold value;
A storage unit for storing the log equal to or higher than the threshold;
A management unit that monitors the storage unit and registers the new log as the failure notification information in a management table included in the monitoring target device when the new log is registered;
The notification unit monitors the management table and, when new failure notification information is registered, reads and notifies the new failure notification information from the management table, and the reading is completed in the management table A failure notification device comprising a function of attaching a marker to the new failure notification information .

An execution monitoring unit for monitoring the application program;
The failure notification apparatus according to claim 1, wherein the management unit registers a failure message regarding the application program acquired by the execution monitoring unit as the failure notification information in the management table.

The analysis unit determines importance of the log for each of a plurality of application programs that can be executed by the monitoring target device, and the notification unit corresponds to an application program executed by the monitoring target device. The failure notification device according to claim 1, wherein the content of the log is notified as failure notification information.

A monitoring process for acquiring a log relating to a failure from an operation system included in the monitoring target device;
An analysis process for determining the importance of the log acquired from the operation system;
Among the logs, a notification process for notifying failure notification information indicating the content of the log having the importance level equal to or higher than a threshold value;
A process of storing the log of the threshold value or more in a storage unit;
When the storage unit is monitored and a new log is registered, the computer executes a management process for registering the new log as the failure notification information in a management table provided in the monitoring target device, and
The notification process monitors the management table and, when new failure notification information is registered, reads and notifies the new failure notification information from the management table, and the reading is completed in the management table A failure notification program comprising a process of adding a marker to the new failure notification information .

An execution monitoring process for monitoring the application program;
5. The fault notification program according to claim 4, wherein the management process includes a process of registering a fault message related to the application program acquired in the execution monitoring process as the fault notification information in the management table.

The analysis process includes a process of determining the importance of the log for each of a plurality of application programs that can be executed by the monitoring target device, and the notification process is an application executed by the monitoring target device 6. The failure notification program according to claim 4, further comprising a process of notifying the contents of the log corresponding to the program as failure notification information.

Obtain a log related to the failure from the operation system of the monitored device,
Determining the importance of the log obtained from the operation system;
Notifying the failure notification information indicating the content of the log having the importance level equal to or higher than the threshold among the logs
Save the log above the threshold in the storage unit,
When the storage unit is monitored and the new log is registered, the new log is registered as the failure notification information in the management table included in the monitoring target device,
The management table is monitored, and when the new failure notification information is registered, the new failure notification information is read out from the management table and notified, and the new failure that has been read out in the management table A failure notification method characterized by attaching a marker to notification information .

Monitor application programs
The failure notification method according to claim 7, wherein a failure message related to the application program acquired by monitoring the application program is registered in the management table as the failure notification information.

The degree of importance of the log is determined for each of a plurality of application programs that can be executed by the monitoring target device, and the content of the log corresponding to the application program being executed by the monitoring target device is the failure notification information. The failure notification method according to claim 7 or 8, which is notified as follows.