JP2003216457A

JP2003216457A - Error log collecting and analyzing agent system

Info

Publication number: JP2003216457A
Application number: JP2002013654A
Authority: JP
Inventors: Yoshitsune Yamamura; 喜恒山村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-01-23
Filing date: 2002-01-23
Publication date: 2003-07-31

Abstract

<P>PROBLEM TO BE SOLVED: To resolve a problem when a plurality of products are related to a work system and work of sequentially specifying a fault cause from a point of trouble in a whole of the work system is difficult since log files per each product have to be tracked. <P>SOLUTION: The error log collecting and analyzing agent system is characterized by that it collects pieces of log information outputted by products of a plurality of vendors at the point of trouble, it organizes the pieces of log information in time sequence in accordance with a reference of importance of the whole of the work system and not by importance determined per each product, and it generates an essential cause of the trouble and its countermeasure as a report. By the feature, when there is trouble in the work system, not only the recognizable trouble of the system but also it essential cause is specified, and MTTR is shortened. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、単一または複数の
アプリケーション・プロダクトから構成されるコンピュ
ータシステムにおいて、通常の運用時、または、障害発
生時のログ情報を、収集解析し、障害の原因と対策方法
をシステム管理者に通知する障害情報通知システムに関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention collects and analyzes log information during normal operation or when a failure occurs in a computer system composed of a single or a plurality of application products to determine the cause of the failure. The present invention relates to a fault information notification system that notifies a system administrator of a countermeasure method.

【０００２】[0002]

【従来の技術】通常、中規模から大規模の業務システム
用のコンピュータソフトウェアを開発する場合、すべて
のアプリケーション・プロダクトを、自社で開発するこ
とは稀である。多くの場合、既存のベンダが提供するい
くつかの製品を組み合わせて一つの業務システムを構築
する。このように複数のベンダの製品で構築したシステ
ムを運用する場合、システム全体が正常に動作している
か、また各ベンダの製品それぞれが正常に動作している
かの確認は、実際にデータベースに登録されたデータの
整合性の判断と各システムの出力するログファイルの内
容解析によって行っていた。このうちログファイルの内
容は各ベンダによって書式が異なることが多く、また障
害発生時には、製品毎にエラー個所を確認してそれらを
組み合せて障害の原因を推測する必要があり、多大な時
間を要していた。更に、ログファイルに出力されるメッ
セージの重要性は、製品毎に定められた重要度に依存す
るため、製品単体での重要性が低い場合、システム全体
として重要なメッセージを見逃す可能性がある。このロ
グファイルの解析を支援する方法としてログ情報の中か
らログ記録時刻情報のみを取り出して記録時刻インデッ
クスを作成する技術が特開平９−３２１７２８に示され
ている。2. Description of the Related Art Normally, when developing computer software for medium to large-scale business systems, it is rare to develop all application products in-house. In many cases, one business system is constructed by combining several products provided by existing vendors. When operating a system constructed with products from multiple vendors in this way, it is necessary to confirm whether the entire system is operating normally and whether each product of each vendor is operating normally by registering it in the database. This was done by judging the data consistency and analyzing the contents of the log file output by each system. Of these, the log file contents often have different formats depending on each vendor, and when a failure occurs, it is necessary to check the error location for each product and combine them to estimate the cause of the failure, which requires a lot of time. Was. Furthermore, since the importance of the message output to the log file depends on the importance determined for each product, if the importance of the product alone is low, the important message may be missed by the entire system. As a method for supporting the analysis of the log file, Japanese Patent Laid-Open No. 9-321728 discloses a technique of extracting only log recording time information from the log information and creating a recording time index.

【０００３】[0003]

【発明が解決しようとする課題】上記公知例の技術で
は、保守員が障害時の切り分け、分析を行うために必要
とするログ情報を効率良く探すことができる効果がある
とされている。しかしながら、特定時刻を基準としてロ
グ情報を効率良く探すことは可能であるものの、業務シ
ステムに複数の製品が関連している場合、それぞれの製
品ごとのログファイルを追跡しなければならないため、
業務システム全体で障害発生時から順番に障害原因を特
定していく作業は難しいと考えられる。また、エラーの
原因を特定しても、そこから実際の対策方法を考えるた
めには、システムに精通した管理者にしか作業ができな
いため、管理者不在の場合の復旧作業は多大の時間を要
する、という問題がある。本発明の目的は、複数の製品
を組み合わせて構築している業務システムにおいて特に
発生が予想される、上記のような従来技術の問題を解決
し、障害発生時のＭＴＴＲ（ＭｅａｎＴｉｍｅＴｏ
Ｒｅｐａｉｒ：障害発生によるサービス停止から、サー
ビス回復までにかかる時間）を短縮することで、システ
ムの保守効率を向上することにある。The technique of the above-mentioned known example is said to have the effect of enabling the maintenance personnel to efficiently search for log information necessary for performing isolation and analysis when a failure occurs. However, although it is possible to efficiently search for log information based on a specific time, if multiple products are related to the business system, the log file for each product must be traced, so
It is considered difficult to sequentially identify the cause of a failure in the entire business system from the time of failure occurrence. Moreover, even if the cause of the error is identified, only the administrator who is familiar with the system can work in order to consider the actual countermeasures from it, so the recovery work in the absence of the administrator takes a lot of time. , There is a problem. The object of the present invention is to solve the above-mentioned problems of the prior art, which are particularly expected to occur in a business system constructed by combining a plurality of products, and to solve MTTR (Mean Time To) when a failure occurs.
Repair: improving the maintenance efficiency of the system by shortening the time taken from service stop due to a failure to service recovery.

【０００４】[0004]

【課題を解決するための手段】上記課題を解決するため
に、本発明に係わる、障害時のエラーログ収集解析エー
ジェントシステムは、障害発生時に複数ベンダの製品が
出力するログ情報を収集し、それらを時系列で整理し、
製品毎に定められた重要度ではなく、業務システム全体
としての重要度の基準に従って、ログ情報を整理し、発
生した障害の本質的な原因と、その対策方法をレポート
として生成することを特徴とする。この機能により、業
務システムに障害が発生した場合、目に見えるシステム
の障害だけでなく、その根本的な原因を特定しＭＴＴＲ
を短縮する。各ベンダごとに出力フォーマットが異なる
ログ情報は、予めエージェントが具備するデータベース
に、メッセージコードと、メッセージと、当該メッセー
ジコードがエラーに関係する場合は、当該エラーの原因
情報及びエラーの対策方法とを共通のフォーマットに変
換して登録しておく。このデータベースに共通フォーマ
ットでデータを登録する処理は、電子媒体として各ベン
ダから配布されるメッセージ一覧マニュアルから、正規
表現を使ってキーワードを設定し、システム管理者が重
要と判断するメッセージのみを抽出する。また、このエ
ージェントが具備するデータベースは、コンピュータシ
ステムを構成する複数の製品に機能の相互関係があるよ
うな場合、それらに関連する情報を登録する機能を備え
る。システムに障害が起こった場合、本発明のエージェ
ントシステムは、システムを構成する製品のログ情報を
収集し、それらを時系列に整理し、メッセージコードに
対応するエラーの原因と対策方法をデータベースから取
得し、障害情報レポートを生成する。システム管理者
は、障害情報レポートに記述されたエラー原因及び対策
方法を参照して、システム復旧活動を実施する。本発明
により、複数の製品に係わる障害の原因究明、対策手順
の確立が容易になる。また、エラーの発生順序を正確に
時系列で整理するために、本エージェントシステムは、
業務システムを構成するソフトウェアが動作するコンピ
ュータのシステム時刻の同期をとる機能を備える。In order to solve the above problems, an error log collection / analysis agent system at the time of failure according to the present invention collects log information output by products of a plurality of vendors at the time of failure and Organized in chronological order,
The feature is that the log information is organized according to the standard of the importance of the business system as a whole, not the importance determined for each product, and the essential cause of the occurred failure and the countermeasure method are generated as a report. To do. With this function, when a failure occurs in the business system, not only the visible system failure but also the root cause of the failure can be identified.
To shorten. For log information with different output formats for each vendor, the message code, the message, and if the message code is related to an error, the cause information of the error and the countermeasure method for the error are stored in the database provided in the agent in advance. Convert to a common format and register. The process of registering data in this database in a common format sets keywords using regular expressions from the message list manual distributed by each vendor as an electronic medium, and extracts only the messages that the system administrator deems important. . Further, the database included in this agent has a function of registering information related to a plurality of products forming a computer system when the products have mutual relations of functions. When a system failure occurs, the agent system of the present invention collects the log information of the products that make up the system, arranges them in chronological order, and acquires the cause of the error corresponding to the message code and the countermeasure method from the database. And generate a fault information report. The system administrator refers to the error cause and countermeasure method described in the failure information report to carry out the system recovery activity. According to the present invention, it becomes easy to investigate the cause of failure related to a plurality of products and establish countermeasure procedures. Also, in order to arrange the error occurrence order accurately in chronological order, this agent system
It has a function to synchronize the system time of the computer running the software that constitutes the business system.

【０００５】[0005]

【発明の実施の形態】本発明による実施の形態を、以下
に図面を参照して説明する。図１は、本発明の実施の形
態を示したブロック図である。以下、本発明の構成につ
いて、図１を使用して説明する。本発明に係わるエージ
ェントシステム０２は、大きく、障害対策情報マスタ作
成部０８、準備処理部０５、ログ情報解析処理部１０、
障害レポート作成処理部１５、ネットワーク間時刻同期
処理部０７の４つの処理部から構成される。障害対策情
報マスタ作成部０８は、本エージェントシステムによる
解析処理を実行する前に、予めエラーの原因及び対策方
法を集めた障害対策情報マスタ０９の作成を行う処理部
である。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention. The configuration of the present invention will be described below with reference to FIG. The agent system 02 according to the present invention is largely composed of a failure countermeasure information master creation unit 08, a preparation processing unit 05, a log information analysis processing unit 10,
The fault report creation processing unit 15 and the inter-network time synchronization processing unit 07 are composed of four processing units. The failure countermeasure information master creating unit 08 is a processing unit that creates a failure countermeasure information master 09 in which the cause of an error and a countermeasure method are collected in advance before executing the analysis processing by the agent system.

【０００６】ログ情報取り込み処理部０６から構成され
る準備処理部０５は、解析パラメータ０３及び収集対象
製品定義０４を読み込み、収集対象となるログファイル
群０１のログファイルをエージェントシステム内に取り
込む。このとき、解析パラメータ０３には、ログ情報の
解析対象終了時刻、システムの障害原因の特定に必要な
情報の重要度の判断基準、あるメッセージと次のメッセ
ージとの間隔が、どの程度であれば関連メッセージと判
断するかを意味する関連情報基準時間の情報が含まれ
る。収集対象製品定義０４には、コンピュータシステム
上の各製品のログファイルの存在場所を意味するファイ
ルパス情報が含まれる。The preparation processing section 05, which is composed of the log information acquisition processing section 06, reads the analysis parameter 03 and the collection target product definition 04, and acquires the log files of the log file group 01 to be collected into the agent system. At this time, the analysis parameter 03 includes the analysis target end time of the log information, the criterion for determining the importance of the information necessary for identifying the cause of the system failure, and the interval between one message and the next message. The information of the related information reference time, which means whether to judge the related message, is included. The collection target product definition 04 includes file path information indicating the location of the log file of each product on the computer system.

【０００７】ログ情報字句解析処理部１１、事象時系列
ソート部１２、障害原因／対策特定部１３から構成され
るログ情報解析処理部１０は、収集したログ情報を時系
列で整理し、障害対策情報マスタ０９に問い合わせるこ
とでエラーの原因及び対策方法を推定し、障害対策／原
因一時ファイル１４を生成する。The log information lexical analysis processing unit 11, the event time series sorting unit 12, and the failure cause / countermeasure specifying unit 13 organize the collected log information in time series and take a countermeasure against the failure. The cause of the error and the countermeasure method are estimated by making an inquiry to the information master 09, and the failure countermeasure / cause temporary file 14 is generated.

【０００８】障害原因／対策レポート作成部１６と運用
情報メール送信部１７から構成される障害レポート作成
処理部１５は、障害対策／原因一時ファイル１４を元
に、障害情報レポート１８及び運用情報１９を生成す
る。なお、本エージェントシステムは、異なるコンピュ
ータ上の製品のログファイルの時刻の同期をとるため
に、ネットワーク間時刻同期処理部０７を備える。A failure report creation processing section 15 composed of a failure cause / countermeasure report creating section 16 and an operation information mail sending section 17 stores a failure information report 18 and operation information 19 based on the failure countermeasure / cause temporary file 14. To generate. The agent system includes an inter-network time synchronization processing unit 07 in order to synchronize the time of log files of products on different computers.

【０００９】図２は、本発明の処理フローを示す図であ
る。以下、本エージェントシステムの障害監視時の処理
について図２を使用して説明する。FIG. 2 is a diagram showing a processing flow of the present invention. Hereinafter, the process of fault monitoring of this agent system will be described with reference to FIG.

【００１０】まず、本エージェントシステムを起動する
（S200）と、準備処理部０５の解析パラメータ設定処理
(S201)で解析パラメータ０３を読み込み、ログ情報の解
析処理に必要なパラメータ値を特定する。解析パラメー
タに定義する情報についての詳細は図３を使用して後述
する。次に、準備処理部０５は収集対象製品定義０４を
読み込み、収集対象のログファイルのファイルパス情報
を取得し、収集対象ログファイル群０１の内容を順次取
り込む(S202)。収集対象製品定義０４には、予めログ収
集対象製品ごとのログファイルの格納先を、コンピュー
タ端末に搭載しているオペレーティング・システムに準
じた形式で定義しておく。このとき、収集対象ログファ
イル群０１には、ネットワークを介したリモートサーバ
上のファイルを含めることも可能とする。ログファイル
読み込み時、ファイルが存在しない、などの理由でファ
イル取り込みに失敗した場合は、取り込み失敗の内容を
含むメッセージを表示した後、次のログファイルの取り
込み処理に移る。取り込んだログ情報は、ログ情報解析
処理部１０にそのまま渡す。First, when the agent system is started (S200), the analysis parameter setting process of the preparation processing section 05 is performed.
In (S201), the analysis parameter 03 is read and the parameter value required for the log information analysis process is specified. Details of the information defined in the analysis parameter will be described later with reference to FIG. Next, the preparation processing unit 05 reads the collection target product definition 04, acquires the file path information of the collection target log file, and sequentially fetches the contents of the collection target log file group 01 (S202). In the collection target product definition 04, the storage destination of the log file for each log collection target product is defined in advance in a format conforming to the operating system installed in the computer terminal. At this time, the collection target log file group 01 can include files on a remote server via the network. If the file import fails when reading the log file due to the fact that the file does not exist, etc., a message containing the details of the import failure is displayed, and then the process for importing the next log file starts. The acquired log information is passed to the log information analysis processing unit 10 as it is.

【００１１】続いて、ログ情報解析処理部１０のログ情
報字句解析処理部１１において、読み込んだログ情報を
解析し(S203)、ログが出力された日付、時刻、メッセー
ジコード、メッセージコードに対応するメッセージ、そ
の他の付加情報を抽出する。ログ情報字句解析処理部１
１の構成については、図６を使用して後述する。Next, the log information lexical analysis processing unit 11 of the log information analysis processing unit 10 analyzes the read log information (S203) and corresponds to the date, time, message code and message code at which the log was output. Extract messages and other additional information. Log information lexical analysis processing unit 1
The configuration of No. 1 will be described later with reference to FIG.

【００１２】続いて、事象時系列ソート処理１２を実行
し(S204)、解析パラメータ０３で指定された条件に合致
するデータを対象として障害が発生した時刻から降順の
ソート処理を行い、処理結果を障害原因／対策特定部１
３に引き渡す。この処理によって、システムを構成する
複数の製品のログ情報をまとめて時系列で整理すること
が可能になる。Next, the event time series sorting process 12 is executed (S204), and the sorting process is performed in descending order from the time when the failure occurs on the data that matches the condition specified by the analysis parameter 03, and the processing result is displayed. Fault cause / countermeasure identification part 1
Hand over to 3. By this processing, log information of a plurality of products that form the system can be collected and arranged in a time series.

【００１３】障害原因／対策特定部１３では、ログ情報
のメッセージコードをキーワードとして、予め障害対策
情報マスタ作成部０８における障害対策情報マスタ処理
(S211)によって構築した障害対策情報マスタ０９に、エ
ラーの原因とその障害から復旧するための対策方法を問
い合わせる(S205)。得られた結果は、一時ファイル書き
込み処理(S206)によって、一旦、障害対策原因一時ファ
イル１４に蓄積する。続いて、障害レポート作成処理部
１５の一時ファイル読み込み処理(S207)によって、障害
原因／対策一時ファイル１４の内容を読み込み、レポー
ト作成処理１６によって、発生した障害の内容とその原
因及び対策方法を含む障害情報レポート１８を作成する
(S208)。The failure cause / countermeasure specifying unit 13 uses the message code of the log information as a keyword in advance to perform the failure countermeasure information master process in the failure countermeasure information master creating unit 08.
The failure countermeasure information master 09 constructed in (S211) is inquired about the cause of the error and the countermeasure method for recovering from the failure (S205). The obtained result is temporarily stored in the failure countermeasure cause temporary file 14 by the temporary file writing process (S206). Then, the content of the failure cause / countermeasure temporary file 14 is read by the temporary file reading processing (S207) of the failure report creation processing unit 15, and the contents of the failure that occurred and its cause and the countermeasure method are included by the report creation processing 16. Create a failure information report 18
(S208).

【００１４】障害情報レポート１８と同様の内容は、運
用情報メール送信部１７によってシステム管理者に通知
される(S209)。システム管理者は、本エージェントシス
テムによって作成された障害情報レポート１８を参照す
ることで、障害原因が複数の製品に係わるような場合で
も、障害発生までの処理の前後関係を把握することが容
易になるため、速やかに障害内容とその原因を理解し対
策を講じることができる。更に、本エージェントシステ
ムを定期的に動作させ電子メールによって運用情報１９
を取得することで、コンピュータシステムが発する警告
情報を把握し、障害を未然に防止することが可能にな
る。The operation information mail transmission unit 17 notifies the system administrator of the same contents as the failure information report 18 (S209). By referring to the failure information report 18 created by this agent system, the system administrator can easily understand the context of processing up to the occurrence of failure even when the cause of failure involves multiple products. Therefore, it is possible to quickly understand the content of the failure and its cause and take countermeasures. Furthermore, the operation information 19 is sent by e-mail by regularly operating this agent system.
By acquiring the information, the warning information issued by the computer system can be grasped and the failure can be prevented beforehand.

【００１５】図３は、本エージェントシステムの動作基
準を規定する解析パラメータ０３を示した図である。解
析パラメータ０３には、収集したログ情報の解析起点と
なる解析対象終了時刻、エラー情報を選別する際の基準
となる解析対象重要度、エラー情報を関連情報として一
つのカテゴリにまとめるための関連情報基準時間を定義
する。本エージェントシステムは、動作開始時にこれら
のパラメータを読み込み、先ず解析対象終了時刻より前
のログ情報について解析処理を開始する。次に、解析対
象重要度を判別し、システム障害の度合いに応じたログ
情報を抽出する。この判別基準となる障害の重要度は、
予め障害対策情報マスタ０９にエラーメッセージごとに
設定しておく必要がある。障害対策情報マスタ０９につ
いては図４を使用して後述する。システムの障害原因が
システムを構成する複数の製品に起因するとき、それら
障害情報を一つのカテゴリとして抽出した方が障害の根
本的な原因を突き止めるのに都合が良い。図３の例で
は、データベース製品ログの2001/10/11 20:36:30 LOG2
00-Eのログ情報を基準として60秒以内に発生するログ情
報で重要度がＡレベルのもの、すなわち通信制御製品ロ
グの2001/10/11 20:36:50 System Downのログ情報を関
連情報として抽出する。関連情報抽出処理の処理フロー
については、図８を使用して後述する。FIG. 3 is a diagram showing the analysis parameter 03 which defines the operation standard of this agent system. The analysis parameter 03 includes an analysis target end time that is an analysis start point of the collected log information, an analysis target importance that is a reference when selecting error information, and related information for collecting error information into one category as related information. Define the reference time. The agent system reads these parameters at the start of the operation, and first starts the analysis process for the log information before the analysis target end time. Next, the importance of the analysis target is determined, and the log information corresponding to the degree of the system failure is extracted. The importance of the obstacle, which is the criterion for this judgment,
It is necessary to set the error countermeasure information master 09 in advance for each error message. The failure countermeasure information master 09 will be described later with reference to FIG. When the cause of a system failure is caused by a plurality of products that make up the system, it is convenient to extract the failure information as one category to find the root cause of the failure. In the example of Fig. 3, 2001/10/11 20:36:30 LOG2 of the database product log
Log information that occurs within 60 seconds based on the 00-E log information and has an A level of importance, that is, the communication control product log 2001/10/11 20:36:50 System Down log information. To extract. The processing flow of the related information extraction processing will be described later with reference to FIG.

【００１６】図４は、障害対策情報マスタ作成部０８に
よる障害対策情報マスタ０９の構築方法を示した図であ
る。ここでは、例として、Ａ社製データベースシステ
ム、Ｂ社製通信制御システム等によって構成されたコン
ピュータシステム（例としては、住民管理システムや職
員給与管理システムがある）を前提として説明をする。FIG. 4 is a diagram showing a method of constructing the fault countermeasure information master 09 by the fault countermeasure information master creating unit 08. Here, as an example, description will be made on the premise of a computer system (for example, a resident management system and a staff salary management system) configured by a database system manufactured by A company, a communication control system manufactured by B company, and the like.

【００１７】本エージェントシステムは、コンピュータ
システムの障害時に、ログ情報に出力されたメッセージ
コードをキーワードとして、該障害の対策方法を障害対
策情報マスタ０９に問い合わせ、対策方法のレポートを
作成する。そのため、コンピュータシステムの稼動前
に、予めメッセージ、メッセージコード、対策方法等の
情報を登録した障害対策情報マスタ０９を構築しておく
必要がある。以下、障害対策情報マスタ０９の構築方法
について、図４を使用して説明する。When a fault occurs in the computer system, the agent system inquires the fault countermeasure information master 09 of the countermeasure method for the fault using the message code output to the log information as a keyword, and creates a report of the countermeasure method. Therefore, it is necessary to build a failure countermeasure information master 09 in which information such as messages, message codes, and countermeasures is registered in advance before the computer system starts operating. Hereinafter, a method for constructing the failure countermeasure information master 09 will be described with reference to FIG.

【００１８】障害対策情報マスタ０９に登録するメッセ
ージコードとメッセージは、通常、各製品のベンダがメ
ッセージ一覧マニュアル群２９として配布する。多くの
場合、メッセージはその重要度に従って、エラー、警告
及びインフォメーションの３種類に分類されている。シ
ステム管理者は、日常のシステム監視、または障害発生
時に、これらメッセージの重要度を判断して、障害発生
の有無や障害時の影響範囲などを見極める。しかしなが
ら、この重要度は製品毎に定められたものであり、業務
システム全体を考慮した場合の重要度とは必ずしも一致
しないことがある。例えば、Ａ社製のデータベース製品
が出力するメッセージの内容が「ディスクの領域不足６
０％」などの警告である場合、データベース単体での運
用では、すぐには問題にはならないとしても、２４時間
連続稼動するようなコンピュータシステム全体として考
えると、早急にディスク増設などの対応作業の日程を考
慮しなければならない重要な問題である。The message codes and messages registered in the failure countermeasure information master 09 are usually distributed as a message list manual group 29 by the vendor of each product. In many cases, messages are classified into three types: error, warning, and information according to their importance. The system administrator checks the importance of these messages during daily system monitoring or when a failure occurs, and determines the presence or absence of a failure and the range of influence at the time of failure. However, this importance is determined for each product and may not necessarily match the importance when considering the entire business system. For example, the content of the message output by the database product of Company A is "Insufficient disk space 6
If it is a warning such as "0%", even if it is not a problem immediately with the operation of the database alone, considering the entire computer system that operates continuously for 24 hours, it is urgent to take measures such as adding disks. This is an important issue that requires consideration of the schedule.

【００１９】つまり、製品毎に定められた重要性を指標
にするだけでは、システム全体を運用する上でのクリテ
ィカルな問題を見落とす危険性があるといえる。このた
め、システム管理者は、製品毎に定められた重要度とは
別に、システム全体としての重要度を把握しておく必要
があり、このような情報も含めて障害対策情報マスタ０
９に登録しておく。In other words, it can be said that there is a risk of overlooking a critical problem in operating the entire system only by using the importance determined for each product as an index. Therefore, the system administrator needs to know the importance of the entire system, in addition to the importance determined for each product, and the failure countermeasure information master 0
Register in 9.

【００２０】障害対策情報マスタ０９を構築する作業
は、障害対策情報マスタ作成部０８によって行う。障害
情報マスタ作成部０８は、大きくデータ自動入力処理部
２５とユーザ入力処理部３１から構成されており、ま
ず、各ベンダが配布するメッセージ一覧マニュアル群２
９を、障害対策情報マスタ０９に自動登録する作業を行
う。The work for constructing the fault countermeasure information master 09 is performed by the fault countermeasure information master creating unit 08. The failure information master creation unit 08 mainly comprises an automatic data input processing unit 25 and a user input processing unit 31, and first, a message list manual group 2 distributed by each vendor.
9 is automatically registered in the failure countermeasure information master 09.

【００２１】データ自動入力処理部２５は、電子媒体と
して提供された各ベンダ製品についてのメッセージコー
ドとそれに対応するメッセージ集から、正規表現を用い
た検索機能によって、システム管理者が必要な情報を取
捨選択し自動的に登録する機能を有している。正規表現
とは、特定の文字列ではなく文字列の一部を一般化して
表現するための手法であり、文書のキーワード検索を行
うときに、特定の文字列ではなく文字列の一部を置き換
え可能な状態で検索することができる。例えば、任意の
文字列を意味する'*'を検索キーワードに含め、文字列'
LOG*-E'をキーワードとして検索することで、エラーメ
ッセージ集からシステムエラーに関するメッセージ（通
常、エラーレベルのメッセージは末尾が-Eで終わること
が多い。同様に警告メッセージであれば末尾が-W、イン
フォメーションであれば末尾が-Iで終わることが一般
的）だけを効率良く選別することができる。また、デー
タ自動入力部２５は、各ベンダごとに独自のフォーマッ
トで記述されているメッセージ一覧マニュアル群２９に
対応して、自動登録処理部（２６、２７、２８）から構
成されており、例えば、Ａ社製データ自動登録処理部２
６は、メッセージ一覧マニュアル群２９からＡ社が電子
媒体として提供する製品について、必要な情報を自動的
に障害対策情報マスタ０９に取り込む機能を有し、Ｂ社
製データ自動登録処理部２７は、Ｂ社が電子媒体として
提供する製品についての情報を解析して障害情報マスタ
０９に取り込む機能を有する。The automatic data input processing unit 25 removes information required by the system administrator from the message code for each vendor product provided as an electronic medium and the corresponding message collection by a search function using a regular expression. It has the function of selecting and automatically registering. Regular expression is a method to generalize and express a part of a character string instead of a specific character string, and replace a part of a character string instead of a specific character string when performing a keyword search of a document. You can search in a possible state. For example, include '*', which means an arbitrary character string, in the search keyword
By searching for "LOG * -E 'as a keyword, you can find messages related to system errors from the error message collection (usually error level messages end with -E. Similarly, warning messages end with -W. , It is common for information to end with -I). The automatic data input unit 25 is composed of automatic registration processing units (26, 27, 28) corresponding to the message list manual group 29 described in a unique format for each vendor. Automatic data registration processing unit 2 made by company A
6 has a function of automatically importing necessary information from the message list manual group 29 into the fault countermeasure information master 09 about the product provided by A company as an electronic medium. The company B has a function of analyzing information about products provided as an electronic medium and importing the information into the failure information master 09.

【００２２】これら個別の自動登録処理部（２６、２
７、２８）は、共通のインタフェースを持つ独立したコ
ンポーネントとして作成しても良く、その場合、コンピ
ュータシステムを構成する製品に追加または入れ替えと
いった変更が発生した場合に、他のデータ自動登録コン
ポーネント３０と交換が容易である。These individual automatic registration processing units (26, 2)
7, 28) may be created as an independent component having a common interface, and in this case, when a change such as addition or replacement occurs in a product that constitutes the computer system, another automatic data registration component 30 is provided. Easy to replace.

【００２３】次に、システム管理者は必要に応じてユー
ザ入力処理部３１を利用して、追加情報を障害対策情報
マスタ０９に登録する。追加情報には、例えば、過去に
システム管理者が対応した障害時のノウハウ集や、コン
ピュータシステムを構成するベンダが追加情報として提
供した情報等が含まれる。Next, the system administrator uses the user input processing section 31 as necessary to register the additional information in the failure countermeasure information master 09. The additional information includes, for example, a know-how collection at the time of a failure that the system administrator has dealt with in the past, information provided as additional information by the vendor configuring the computer system, and the like.

【００２４】図５は、障害対策情報マスタ０９に登録す
るデータのテーブル構成である。障害対策情報マスタ０
９には、コンピュータシステムを構成する製品の名称、
メッセージコード、メッセージ、該メッセージについ
て、コンピュータシステム全体を考慮した場合の重要
度、該メッセージが出力された場合の原因、該メッセー
ジが出力された場合の対策方法（３２）についての情報
が含まれる。FIG. 5 is a table structure of data registered in the failure countermeasure information master 09. Failure countermeasure information master 0
9, the names of the products that make up the computer system,
The information includes a message code, a message, the importance of the message when considering the entire computer system, the cause when the message is output, and the countermeasure (32) when the message is output.

【００２５】図６は、システムを構成する製品が出力す
るログファイルの具体例である。収集対象ログファイル
群０１に含まれる個々のログファイルには、ログを出力
した日付、時刻、メッセージコード、メッセージ等の情
報（３３）が出力される。FIG. 6 shows a specific example of a log file output by a product that constitutes the system. Information (33) such as the date, time, message code, and message at which the log is output is output to each log file included in the collection target log file group 01.

【００２６】図７は、ログ情報字句解析処理部１１の構
成について示した構成図である。準備処理部０５で取り
込んだログファイルは、各ベンダの独自書式で出力され
ているため、別々の解析処理部（３４、３５、３６）か
ら構成されるログ情報字句解析処理部１１によって、必
要な情報の切り出し処理を行う。FIG. 7 is a configuration diagram showing the configuration of the log information lexical analysis processing unit 11. Since the log file captured by the preparation processing unit 05 is output in the unique format of each vendor, it is necessary for the log information lexical analysis processing unit 11 composed of the separate analysis processing units (34, 35, 36). Information is cut out.

【００２７】例えば、準備処理部０５で取得したログフ
ァイルが、Ａ社製のデータベース製品が出力したログ情
報であった場合、Ａ社製データベース解析処理部３４に
よって字句解析処理を施し、メッセージを出力した日付
情報、時刻情報、メッセージコード、メッセージ及び付
加情報を抽出する。このログ情報字句解析処理部１１を
構成するコンポーネント（３４、３５，３６）は、共通
のインタフェースを持つ独立したコンポーネントとして
作成しても良い。その場合、稼動するコンピュータシス
テムに、新たに他社ベンダの製品を組み込む場合や、既
存製品のログファイルのフォーマットを変更した場合な
どに、他のログ解析コンポーネント３７との変更が容易
になる。For example, when the log file acquired by the preparation processing section 05 is the log information output by the database product manufactured by A company, the database analysis processing section 34 manufactured by A company performs lexical analysis processing and outputs a message. The extracted date information, time information, message code, message and additional information are extracted. The components (34, 35, 36) forming the log information lexical analysis processing unit 11 may be created as independent components having a common interface. In that case, when a product of another vendor is newly incorporated in the operating computer system or when the log file format of the existing product is changed, the change with the other log analysis component 37 becomes easy.

【００２８】図８は、障害原因／対策特定部１３の処理
フローを説明する図である。障害原因／対策特定部１３
では、取り込んだログ情報が終了するまで（３８）、障
害対策方法問合せ処理３９を実行する。障害対策方法問
合せ処理３９では、日時の降順にソートされたログ情報
からメッセージコードをキーワードとして、障害対策情
報マスタ０９に問い合わせ処理を行い、予め登録してお
いたデータから障害の発生原因とその対策方法を取得
し、障害原因／対策一時ファイル１４に格納する。FIG. 8 is a diagram for explaining the processing flow of the failure cause / countermeasure specifying unit 13. Fault cause / measure identification unit 13
Then, the failure countermeasure method inquiry processing 39 is executed until the captured log information is completed (38). In the failure countermeasure method inquiry processing 39, inquiry processing is performed to the failure countermeasure information master 09 using the message code as a keyword from the log information sorted in the descending order of date and time, and the cause of the failure and the countermeasure against it from the data registered in advance. The method is acquired and stored in the failure cause / countermeasure temporary file 14.

【００２９】続いて、該ログ情報の次のレコードを読み
込み、該メッセージコードの出力時点を起点として、単
位時間内に出力されているかどうかを判定する（４
０）。これは２つ以上の製品が関連した処理を行ってい
る場合に、ログ情報を一つのカテゴリとして抽出するた
めである。例えば、Ｂ社製通信制御製品が、Ａ社製デー
タベース製品の運用が前提であるように、２つ以上の製
品が連携するシステムの場合、障害情報としては一つの
纏まりとして抽出した方が、解析効率が良い。仮にＡ社
製データベース製品が異常終了した場合には、連携する
Ｂ社製通信制御製品にもなんらかの障害が発生しログ情
報が残されているはずであるため、これらのログ情報を
関連情報というカテゴリでまとめて取得する。Subsequently, the next record of the log information is read, and it is determined whether the message code is output within a unit time, starting from the output time of the message code (4
0). This is to extract log information as one category when two or more products are performing related processing. For example, in the case of a system in which two or more products are linked together, such that the communication control product manufactured by company B operates on the database product manufactured by company A, it is better to extract the failure information as a single set for analysis. It is efficient. If the database product of company A terminates abnormally, some trouble should have occurred in the communication control product of company B that cooperates, and the log information should remain, so these log information are classified as related information. Get all at once.

【００３０】このようなケースにおいて、本エージェン
トシステムは、Ｂ社製通信制御製品のログ情報がＡ社製
データベース製品のログ情報出力後、数秒〜数分後に出
力されていると考え、解析パラメータ０３に定義された
関連情報基準時間内のログ情報を、関連情報問合せ処理
４１によって障害対策情報マスタ０９から取得し、関連
情報として障害原因／対策一時ファイル１４に格納す
る。In such a case, the agent system considers that the log information of the communication control product manufactured by company B is output several seconds to several minutes after the log information of the database product manufactured by company A is output. The log information within the related information reference time defined in 1. is acquired from the failure countermeasure information master 09 by the related information inquiry processing 41 and stored in the failure cause / countermeasure temporary file 14 as the related information.

【００３１】図９は、エラーログ情報の解析結果を出力
した障害情報レポート１８の一例を示した図である。障
害情報レポート１８には、エラーが発生した製品とその
事象を時系列で整理した内容、該エラーの原因、及び、
該エラーの対策手順のレポートが、障害報告４３のよう
な形式で出力される。システム管理者は、このレポート
に記述された内容を参照し、即時に障害回復の対策を講
じることができる。FIG. 9 is a diagram showing an example of the failure information report 18 that outputs the analysis result of the error log information. The failure information report 18 includes the contents in which the product in which the error has occurred and its event are arranged in time series, the cause of the error, and
A report of the countermeasure procedure for the error is output in a format like the failure report 43. The system administrator can refer to the contents described in this report and immediately take measures for disaster recovery.

【００３２】上記した本発明の実施例に係るエージェントシステムを
定期的に動作させることで、障害発生を未然に防止でき
る効果がある。システム管理者は、予めコンピュータシ
ステムを構成する各製品が出力するメッセージ情報及び
警告情報にシステム全体の運用上の重要度を設定してお
き、定期的に本エージェントシステムを動作させ、障害
情報レポートや、管理者宛に届く電子メール情報から、
障害発生につながりそうな内容を把握することで、障害
の発生を予測し、事前に対応策を講じることが可能にな
る。また、システムの障害が軽度である場合にシステム
管理者以外の一般ユーザでも対策を講じることが可能に
なる効果がある。従来は、障害復旧のノウハウは、シス
テム管理者のみ知るところであり、管理者以外の人間が
システムを復旧することは困難を極めた。本実施例に係
るエージェントシステムを使用すれば、予めシステム管
理者のノウハウを障害対策情報マスタに登録しておける
ため、軽度の障害については、システム管理者以外の人
間でも復旧が容易になる。By periodically operating the agent system according to the above-described embodiment of the present invention, it is possible to prevent a failure from occurring. The system administrator sets in advance the importance of operating the entire system in the message information and warning information output by each product that makes up the computer system, operates this agent system on a regular basis, and reports error information and From the e-mail information sent to the administrator,
By understanding the content that is likely to lead to a failure, it becomes possible to predict the occurrence of a failure and take countermeasures in advance. Further, there is an effect that even a general user other than the system administrator can take measures when the system failure is minor. Conventionally, the know-how of failure recovery is known only to the system administrator, and it has been extremely difficult for a person other than the administrator to recover the system. By using the agent system according to the present embodiment, the know-how of the system administrator can be registered in advance in the failure countermeasure information master, so that a person other than the system administrator can easily recover from a minor failure.

【００３３】[0033]

【発明の効果】本発明によれば、ＭＴＴＲを短縮できる
効果がある。従来は、システム管理者が、コンピュータ
システムを構成する製品の個々のログファイルを一つず
つ収集し、障害が発生した特定の日付、時刻についての
データを抽出し、更にその中からシステムダウンにつな
がるようなクリティカルな要因を解析し、各ベンダが提
供するマニュアルを読解した上で、その対策方法を考え
る、といった手順で作業を行っていたため、コンピュー
タシステムを復旧するまでに多大な時間がかかってい
た。この手順を本システムに任せることでＭＴＴＲの短
縮につなげることができる。According to the present invention, the MTTR can be shortened. Conventionally, a system administrator collects individual log files of products that make up a computer system one by one, extracts data about a specific date and time when a failure has occurred, and then leads to a system failure. It took a lot of time to recover the computer system because the procedure was to analyze such critical factors, read the manuals provided by each vendor, and then consider the countermeasures. . By leaving this procedure to this system, the MTTR can be shortened.

[Brief description of drawings]

【図１】本発明によるエラーログ収集解析エージェント
システムの構成を示したブロック図である。FIG. 1 is a block diagram showing the configuration of an error log collection / analysis agent system according to the present invention.

【図２】本発明によるエラーログ収集解析エージェント
システムの処理フローを示した図である。FIG. 2 is a diagram showing a processing flow of an error log collection / analysis agent system according to the present invention.

【図３】本発明によるエラーログ収集解析エージェント
システムの解析パラメータの構成例を示した図である。FIG. 3 is a diagram showing a configuration example of analysis parameters of an error log collection / analysis agent system according to the present invention.

【図４】本発明によるエラーログ収集解析エージェント
システムの障害対策情報マスタ作成部の構成を示したブ
ロック図である。FIG. 4 is a block diagram showing a configuration of a failure countermeasure information master creation unit of the error log collection / analysis agent system according to the present invention.

【図５】本発明によるエラーログ収集解析エージェント
システムの障害対策情報マスタのテーブル構成例を示し
た図である。FIG. 5 is a diagram showing a table configuration example of a failure countermeasure information master of an error log collection / analysis agent system according to the present invention.

【図６】本発明によるエラーログ収集解析エージェント
システムのログファイルの具体例を示した図である。FIG. 6 is a diagram showing a specific example of a log file of an error log collection / analysis agent system according to the present invention.

【図７】本発明によるエラーログ収集解析エージェント
システムのログ情報字句解析処理部の構成を示したブロ
ック図である。FIG. 7 is a block diagram showing a configuration of a log information lexical analysis processing unit of the error log collection / analysis agent system according to the present invention.

【図８】本発明によるエラーログ収集解析エージェント
システムの障害原因／対策特定部の処理フローを示した
図である。FIG. 8 is a diagram showing a processing flow of a failure cause / countermeasure specifying unit of the error log collection / analysis agent system according to the present invention.

【図９】本発明によるエラーログ収集解析エージェント
システムの障害情報レポートの出力例を示した図であ
る。FIG. 9 is a diagram showing an output example of a failure information report of the error log collection / analysis agent system according to the present invention.

[Explanation of symbols]

０１．．．収集対象ログファイル群、０２．．．エラー
ログ収集解析エージェントシステム、０３．．．解析パ
ラメータ、０４．．．収集対象製品定義、０５．．．準
備処理部、０６．．．ログ情報取り込み処理部、０
７．．．ネットワーク間時刻同期処理部、０８．．．障
害対策情報マスタ作成部、０９．．．障害対策情報マス
タ、１０．．．ログ情報解析処理部、１１．．．ログ情
報字句解析処理部、１２．．．事象時系列ソート部、１
３．．．障害原因／対策特定部、１４．．．障害原因／
対策一時ファイル、１５．．．障害レポート作成処理
部、１６．．．障害原因対策レポート作成部、１
７．．．運用情報メール送信部、１８．．．障害情報レ
ポート、１９．．．運用情報01. ．． Collection target log file group, 02. ．． Error log collection and analysis agent system, 03. ．． Analysis parameter, 04. ．． Collection target product definition, 05. ．． Preparation processing unit, 06. ．． Log information import processing unit, 0
7. ．． Inter-network time synchronization processing unit, 08. ．． Failure countermeasure information master creation unit, 09. ．． Fault countermeasure information master, 10. ．． Log information analysis processing unit, 11. ．． Log information lexical analysis processing unit, 12. ．． Event time series sorting section, 1
3. ．． Fault cause / countermeasure identification unit, 14. ．． Cause of failure /
Countermeasure temporary file, 15. ．． Fault report creation processing unit, 16. ．． Failure cause countermeasure report creation section, 1
7. ．． Operation information mail transmission unit, 18. ．． Fault information report, 19. ．． Operation information

Claims

[Claims]

1. A message code output by the system as log information, a message, the importance of the message when the system is targeted, the error cause of the error message in the message, and countermeasure information are registered as a failure information master. Failure countermeasure information master creating means and inter-network time synchronization processing means for time synchronization of computers in the system, means for obtaining system independent format information from the log information, and system independent format information Log information analysis processing means having a means for analyzing, a failure report creation processing means having a means for outputting the analysis result as a failure information report and a means for transmitting the report as an e-mail for system management. Error log collection and analysis agent system.

2. The fault countermeasure information master creating unit comprises a data automatic analysis processing means for extracting a keyword by a regular expression from a message list distributed by a plurality of product vendors and a data input means for a user to input additional information. The error log collection / analysis agent system according to claim 1.

3. The log information lexical analysis processing means provided in the log information analysis processing part according to claim 1 and at least one of the data automatic analysis processing means according to claim 2 are constituted by independent components. Characteristic error log collection and analysis agent system.

4. The log information analysis processing unit according to claim 1,
An error log collection / analysis agent system comprising means for collectively organizing log information of a plurality of products in time series and means for classifying fault information in a unit time range as a category of related fault information.

5. The error log according to claim 1, wherein the means for acquiring information in a system-independent format from the log information selectively fetches necessary information from the log information based on an analysis parameter. Collection and analysis agent system.

6. A message code output by a system having a plurality of products as log information, the importance of the message when the entire system is targeted, the error cause of the error message among the messages, and countermeasure information. And a network time synchronization processing means for synchronizing the time of the computers in the system, and means for acquiring information in a format that does not depend on the individual system from the log information. A log information analysis processing means having a means for analyzing information in a system-independent format, a means for outputting the analysis result as a failure information report, and a means for transmitting the report as an e-mail for system management. Error log collection and analysis characterized by having a report creation processing means Over stringent system.

7. The plurality of products in claim 6 are products provided by a plurality of different vendors, and the means for acquiring information in a format that does not depend on the system includes a date, time and message when the log is output. An error log collection / analysis agent system characterized by extracting a code, a message corresponding to a message code, and other additional information.

8. A message code output by the system as log information, the importance of the message when the message is targeted to the system, the error cause of the error message among the messages, and the countermeasure information are registered as a failure information master. Failure countermeasure information master creation processing step,
A time synchronization processing between networks for time synchronization of computers in the system; a step of acquiring information in a system-independent format from the log information; and a log information analysis processing step of analyzing information in a system-independent format A method for collecting and analyzing an error log, comprising: a failure report creation processing step of outputting the analysis result as a failure information report and transmitting the report as an e-mail to the system administrator terminal.

9. A message code output from a system including a plurality of products as log information, the importance of the message when the entire system is targeted, the error cause of the error message among the messages, and countermeasure information. A failure countermeasure information master creating step for registering as a failure information master, an inter-network time synchronization processing step for time synchronization of computers in the system, and a step for obtaining information in a format independent of each system from the log information. A log information analysis processing step having means for analyzing information in a system-independent format, and a failure report creation processing step for outputting the analysis result as a failure information report and transmitting the report as an e-mail to the system administrator terminal. Error log collection and analysis method characterized by

10. The plurality of products in claim 9 are products provided by a plurality of different vendors, and the step of acquiring information in a format that does not depend on the system includes the date, time, and message when the log is output. An error log collection and analysis method characterized by extracting a code, a message corresponding to a message code, and other additional information.

11. A fault information master that displays a message code output by a system as log information to a computer, the importance of the message when the system is targeted, the error cause of the error message among the messages, and countermeasure information. A function for creating a failure countermeasure information master to be registered as, a time synchronization processing function between networks for synchronizing the time of computers in the system, and a function for acquiring information in a system-independent format from the log information,
Log information analysis processing function having means for analyzing system-independent format, means for outputting analysis result as failure information report, and failure report creation having function for sending the report as e-mail for system management A computer-readable recording medium that records a program for an error log collection and analysis agent that realizes a processing function.

12. A message code output by a system having a plurality of products as log information to a computer, the importance of the message when the entire system is targeted, and the error cause of the error message among the messages, Also, it is equipped with a failure countermeasure information master creation function that registers countermeasure information as a failure information master, and a time synchronization processing function between networks that synchronizes the time of the computers in the system. Faults that have a function to obtain, a log information analysis processing function to analyze information in a system-independent format, a function to output the analysis result as a fault information report, and a function to send the report as an email for system management. Error log collection and analysis age that realizes report creation processing function Computer readable recording medium recording a cement program.

13. The plurality of products according to claim 12 are products provided by a plurality of different vendors, and the function of acquiring information in a format that does not depend on the system has a date, time, and message when a log is output. A computer-readable recording medium recording an error log collection and analysis agent program, which is characterized by extracting a code, a message corresponding to the message code, and other additional information.

14. A fault information master that outputs a message code output by the system as log information to a computer, the importance of the message when the system is targeted, the error cause of the error message in the message, and countermeasure information. A function for creating a failure countermeasure information master to be registered as, a time synchronization processing function between networks for synchronizing the time of computers in the system, and a function for acquiring information in a system-independent format from the log information,
Log information analysis processing function having means for analyzing system-independent format, means for outputting analysis result as failure information report, and failure report creation having function for sending the report as e-mail for system management A program for an error log collection and analysis agent characterized by realizing processing functions.

15. A message code output by a system including a plurality of products as log information to a computer, the importance of the message when the entire system is targeted, and the error cause of the error message among the messages, Also, it is equipped with a failure countermeasure information master creation function that registers countermeasure information as a failure information master, and a time synchronization processing function between networks that synchronizes the time of the computers in the system. Faults that have a function to obtain, a log information analysis processing function to analyze information in a system-independent format, a function to output the analysis result as a fault information report, and a function to send the report as an email for system management. An error log characterized by realizing the report creation processing function Collection and analysis program for the agent.

16. The plurality of products according to claim 15 are products provided by a plurality of different vendors, and the function of acquiring information in a format that does not depend on the system has a date, time, and message when a log is output. A program for an error log collection and analysis agent, which extracts a code, a message corresponding to the message code, and other additional information.