JP2005316728A

JP2005316728A - Fault analysis device, method, and program

Info

Publication number: JP2005316728A
Application number: JP2004133998A
Authority: JP
Inventors: Nobutane Mori; 信胤森; Makoto Sato; 佐藤　　誠
Original assignee: Mitsubishi Electric Corp; Mitsubishi Electric Information Systems Corp; Mitsubishi Electric Information Technology Corp
Current assignee: Mitsubishi Electric Corp; Mitsubishi Electric Information Systems Corp; Mitsubishi Electric Information Technology Corp
Priority date: 2004-04-28
Filing date: 2004-04-28
Publication date: 2005-11-10
Anticipated expiration: 2024-04-28
Also published as: JP4575020B2

Abstract

<P>PROBLEM TO BE SOLVED: To specify a fault part from a fault condition of a relevant work configuration appliance and resources and to immediately check a function of the specified fault part for easily specifying the fault part. <P>SOLUTION: This fault analysis device is provided with a work configuration information table 41 storing a configurative appliance constituting work of a monitoring object system and resources for each work, a monitoring method information table 42 storing monitoring methods for the respective configuration appliances and resources, a monitoring part 51 monitoring the monitoring object system, and a check method specification part 33 estimating work causing a fault by referring a work configuration information table when the monitoring part detects the fault of the monitoring object system, extracting the configuration appliance and resources constituting the work causing the fault, and applying a monitoring method for the configuration appliance and resources constituting the work. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、システム中で発生した障害状況から障害原因を推定して、その障害の素となった要素を特定する装置、方法に関するものである。 The present invention relates to an apparatus and a method for estimating a cause of a failure from a failure situation occurring in a system and identifying an element that is a source of the failure.

分散コンピュータシステムでは、障害の発生に備えてシステム内でさまざまな監視が行われている。例えば、ｐｉｎｇ監視に代表される機器外部からの問い合わせによるシステムあるいはシステムを構成する要素の生死監視がある。また障害以外にも、機器内のエージェントによるプロセス有無やディスク空き、ＣＰＵ使用状況等のリソース監視、業務処理の利用者視点に立った応答時間の監視等である。
ところでこれらの障害監視は、個々の監視設定に従って独立に実施されており相互の連携はない。あるものは短周期での監視、あるものは長周期での監視になっており、１つの根本原因に誘発された複数の波及障害の検知に時間差が生じる。更に個々に表面的な障害現象のみが収集されるので、障害の素を解析するには専門技術を必要とし、障害の根本原因箇所特定に時間を要してしまう。また、監視周期を短くすることで障害検知のリアルタイム性は向上し、検知の同時性も向上するが、システムに対するデータ収集と原因解析量の増大による監視処理の負荷が大きくなってしまう。 In a distributed computer system, various types of monitoring are performed in the system in preparation for occurrence of a failure. For example, there is a life / death monitoring of a system or an element constituting the system based on an inquiry from the outside of the device represented by ping monitoring. In addition to failures, there are resource monitoring such as the presence / absence of a process by an agent in a device, disk free space, CPU usage status, response time monitoring from the viewpoint of a user of business processing, and the like.
By the way, these fault monitoring is performed independently according to each monitoring setting, and there is no mutual cooperation. Some are monitored in a short cycle, and some are monitored in a long cycle, and there is a time difference in detecting a plurality of spillover faults induced by one root cause. Furthermore, since only superficial fault phenomena are collected individually, special techniques are required to analyze the source of the fault, and it takes time to identify the root cause of the fault. In addition, by shortening the monitoring cycle, the real-time property of fault detection is improved and the simultaneousness of detection is also improved, but the load of monitoring processing increases due to the collection of data and the cause analysis amount in the system.

また、業務処理フローを集中管理し、その実行結果を収集することで障害データを迅速に収集する技術、例えば特許文献１の「管理マネージャ計算機、記録媒体、および、計算機運用管理方法」や、特許文献２の「業務フローの稼動情報取得方法および業務フローシステム」があるが、これらは、業務処理フローの観点で実行結果を追跡する技術である。従って、単に障害が発生したという事実の通知はあっても、業務処理を実行するために利用するコンピュータリソースや外部サービス、ネットワーク経路の障害には関知していない。つまり障害解析に関しては何の開示もないし、障害の根本原因の特定については述べられていない。
特開平１１−１３４３０６号公報特開２００３−６７２２２号公報 Further, a technique for centrally managing the business process flow and collecting the execution results to quickly collect the failure data, such as “Management Manager Computer, Recording Medium, and Computer Operation Management Method” of Patent Document 1, There are “business flow operation information acquisition method and business flow system” in Document 2, which are techniques for tracking execution results from the viewpoint of business processing flow. Accordingly, even if there is a notification of the fact that a failure has occurred, the computer resource used for executing the business process, an external service, and a network path failure are not known. In other words, there is no disclosure regarding failure analysis, and no identification of the root cause of the failure is stated.
JP-A-11-134306 JP 2003-67222 A

従来のシステムの監視装置は上記のように構成されており、単に障害状況を個々に収集するのみである。あるいは過去の障害を記録した障害データベースを参照して障害原因を推定するのみであり、もとになる障害を見つけることが困難であるという課題がある。 The conventional system monitoring apparatus is configured as described above, and simply collects the failure status individually. Alternatively, there is a problem that it is difficult to find the underlying failure only by estimating the cause of the failure by referring to a failure database in which past failures are recorded.

本発明は上記の課題を解決するためになされたもので、関連する業務構成機器やリソースの障害状況から障害部分を特定し、かつ特定した障害部分に対して直ちに機能確認を行って容易に障害部分を特定することを目的とする。 The present invention has been made to solve the above-mentioned problems. The failure part is identified from the failure state of the related business component device or resource, and the function is immediately confirmed for the identified failure part, so that the failure can be easily performed. The purpose is to identify the part.

この発明に係る障害解析装置は、監視対象システムの業務を構成する構成機器・リソースを業務毎に記憶する業務構成情報テーブルと、
上記各構成機器・リソースの個別監視方法を記憶する監視方法情報テーブルと、
上記監視対象システムを監視する監視部と、
上記監視部が上記監視対象システムの障害を検出すると、上記業務構成情報テーブルを参照して上記障害が発生した業務を推定し、該障害が発生した業務を構成する構成機器・リソースを抽出し、該業務を構成する構成機器・リソースの監視方法を適用する関連障害の確認方法特定部、とを備えた。 The failure analysis apparatus according to the present invention includes a business configuration information table that stores, for each business, component devices and resources that configure the business of the monitoring target system;
A monitoring method information table for storing an individual monitoring method for each component device / resource;
A monitoring unit for monitoring the monitored system;
When the monitoring unit detects a failure of the monitored system, the business configuration information table is referred to, the business in which the failure has occurred is estimated, and the constituent devices and resources that configure the business in which the failure has occurred are extracted. And a related failure confirmation method specifying unit that applies a monitoring method of component devices and resources constituting the business.

この発明によれば、業務構成情報テーブルと監視方法情報テーブルと障害が発生した業務を推定し、その業務を構成する構成機器・リソースを抽出し、その構成機器・リソースの監視方法を適用する関連障害の確認方法特定部とを備えたので、障害を分析する高度な知識を要せずに、障害部分に対して直ちに機能確認を行って容易に障害部分を特定できる効果がある。 According to the present invention, the business configuration information table, the monitoring method information table, and the business in which a failure has occurred are estimated, the constituent devices and resources constituting the business are extracted, and the monitoring method for the constituent devices and resources is applied. Since the fault confirmation method specifying unit is provided, there is an effect that the fault part can be easily identified by immediately confirming the function without requiring high-level knowledge for analyzing the fault.

実施の形態１．
システム障害があった場合に、その障害の根本原因部分は、業務を構成する機器やソフトウェアを含むリソース群の中のいずれかに存在するはずである。このことから、システム障害を検知すると、直ちに関連する業務構成機器かリソースの障害を推定して、その推定機器またはリソースに対して直ちに機能確認すれば、根本の障害部分を短時間に特定できる。こうすれば通常時のシステム監視負荷を増加することもない。
図１は、上記思想に基づくこの発明の実施の形態１における、障害解析装置を示す構成ブロック図である。
図１の構成において、監視対象システム１は、障害解析装置により、通常時に一定周期等でシステムの障害発生状況を監視部５１，５２で監視されている。同様に随時障害監視部５３，５４により、随時、監視対象システム内の構成要素の障害発生状況等を監視されている。
後にも詳述するが、クライアント別に監視業務を細分化し、監視プログラムをこれら細分化した業務が使用するリソースのレスポンスまで監視するようにすれば、少なくともどのような業務で障害が発生したかを把握することは困難ではない。 Embodiment 1 FIG.
If there is a system failure, the root cause of the failure should be in one of the resource groups including the devices and software that make up the business. From this, when a system failure is detected, it is possible to immediately identify the failure of the related business component device or resource, and immediately confirm the function for the estimated device or resource, thereby identifying the fundamental failure portion in a short time. This does not increase the normal system monitoring load.
FIG. 1 is a block diagram showing the configuration of a failure analysis apparatus according to Embodiment 1 of the present invention based on the above concept.
In the configuration of FIG. 1, the monitoring target system 1 is monitored by the monitoring units 51 and 52 in a normal cycle by a failure analysis device at a constant period or the like. Similarly, the failure monitoring statuses of the components in the monitored system are monitored by the failure monitoring units 53 and 54 as needed.
As will be described in detail later, if you divide the monitoring work for each client and monitor the monitoring program up to the response of the resources used by the broken down work, you can understand at least what kind of work caused the failure. It is not difficult to do.

また障害解析装置には、これらの監視部で得られた監視データから、障害発生を検知すると、障害に関連する機器やリソースを抽出し、それらの障害状況を確認する方法を特定する関連障害の確認方法特定部３０がある。関連障害の確認方法特定部３０は、内部に、発生した障害から最も障害があり得る業務を推定する障害発生業務特定部３１と、その業務処理を行うための機器やリソースを抽出する関連リソース抽出部３２と、それらの機器やリソースの障害状況を確認する方法を特定する確認方法特定部３３を持つ。
更に障害解析装置には、関連障害の確認方法特定部３０が障害部分を推定する際に参照する、システムに関する情報群４０がある。このシステムに関する情報群４０中に、業務構成情報テーブル４１、対象毎の監視方法情報テーブル４２がある。更に、以上の一連の処理全体を制御する障害情報収集制御部２０がある。 In addition, the failure analyzer detects the occurrence of a failure from the monitoring data obtained by these monitoring units, extracts the devices and resources related to the failure, and specifies the method of checking the failure status of the related failure. There is a confirmation method specifying unit 30. The related failure confirmation method identification unit 30 internally includes a failure occurrence business identification unit 31 that estimates the most likely failure from the failures that have occurred, and related resource extraction that extracts devices and resources for performing the business process And a confirmation method identification unit 33 for identifying a method for confirming the failure status of the devices and resources.
Further, the failure analysis apparatus includes a system information group 40 which is referred to when the related failure confirmation method specifying unit 30 estimates a failure portion. In the information group 40 regarding this system, there are a business configuration information table 41 and a monitoring method information table 42 for each target. Furthermore, there is a failure information collection control unit 20 that controls the entire series of processes described above.

また図２は、図１において監視対象システム１として示される、その具体的な構成の例と、特定の構成機器またはリソースを使用した業務の関係を示す図である。
図において、コンピュータノードとしてのサーバ１０１〜１０６は、内部に種々のプログラム１１１〜１１７を持ち、ネットワーク機器１２１〜１２４及びネットワークサービス１２５，１２６を経由して互いに接続されている。
図中の点線１３０は、ある業務処理「業務１」を構成する業務構成である。例として挙げた「業務１」はクライアント１からの処理要求が業務サーバ３１０３中の業務プログラム３、業務プログラム４を経由して業務サーバ１１０１中の業務プログラム１で処理される構成である。また業務プログラム３はその処理の中で、例えばネームサービスや認証サービスのような共有サービスプログラム１１１５を利用している。「業務１」の点線１３０は、これらの業務構成を線で表したものである。
通常時障害監視部５１等が行う監視の方法として、例えば業務１として業務プログラム１はクライアント１からアクセスされ、同じ業務プログラム１を使用してもクライアント２からアクセスする場合には業務１１と名付けるようにすれば、かなり細かなレベルで障害発生時の業務の特定が可能である。更に監視プログラムが、業務プログラム１、クライアントプログラム１の振る舞いまでも監視すれば、障害発生を検知することは容易である。 FIG. 2 is a diagram illustrating a relationship between an example of a specific configuration shown as the monitoring target system 1 in FIG. 1 and a business using a specific component device or resource.
In the figure, servers 101 to 106 as computer nodes have various programs 111 to 117 therein, and are connected to each other via network devices 121 to 124 and network services 125 and 126.
A dotted line 130 in the figure is a business configuration that constitutes a business process “business 1”. “Business 1” given as an example has a configuration in which a processing request from the client 1 is processed by the business program 1 in the business server 1 101 via the business program 3 and business program 4 in the business server 3 103. The business program 3 uses a shared service program 1 115 such as a name service or an authentication service in the processing. The dotted line 130 of “business 1” represents these business configurations with lines.
As a monitoring method performed by the normal time fault monitoring unit 51 and the like, for example, the business program 1 is accessed from the client 1 as the business 1, and if the same business program 1 is used and accessed from the client 2, the business program 1 is named as the business 11 By doing so, it is possible to specify the work at the time of failure occurrence at a fairly detailed level. Furthermore, if the monitoring program still monitors the behavior of the business program 1 and the client program 1, it is easy to detect the occurrence of a failure.

また図３は、図１中の業務構成情報テーブル４１におけるデータ例を示す図であり、図２で示した「業務１」の業務構成をテーブル形式で表したものである。このテーブルには、業務１を構成する機器やリソースとそれぞれの機器やリソースが動作するための前提となる依存機器やリソースが表されている。もちろんその他に、業務２、業務３等の、他の業務の構成機器・リソースも記憶、表されている。
また図４は、図１中の対象毎の監視方法情報テーブル４２におけるデータ例を示す図であり、図２で示した対象システムの構成要素１つずつに対して、その監視方法をテーブル形式で表したものである。 FIG. 3 is a diagram showing an example of data in the business configuration information table 41 in FIG. 1, and shows the business configuration of “business 1” shown in FIG. 2 in a table format. In this table, devices and resources that constitute the business 1 and dependent devices and resources that are prerequisites for the operation of the devices and resources are represented. Of course, other business components and resources such as business 2 and business 3 are also stored and represented.
FIG. 4 is a diagram showing an example of data in the monitoring method information table 42 for each target in FIG. 1. For each component of the target system shown in FIG. It is a representation.

次に動作について説明する。
図１の通常時障害監視部５１，５２は、監視対象システムの障害を検知すると、障害情報収集制御部２０に障害検知を通知する。ここでは、例として図２におけるクライアントプログラム１１１６で業務応答が無くなった場合を想定する。即ち設定された時間内に応答が返らないので、障害検知とする。
この検知を受けて、障害情報収集制御部２０は、この情報を関連障害の確認方法特定部３０に渡す。関連障害の確認方法特定部３０では、まず障害発生業務特定部３１が障害の内容にあるクライアントプログラムの応答不良からその障害が業務１に関する障害であることを推定する。これは図１８における障害業務の推定ステップＳ１０１である。次に関連リソース抽出部３２が、図３に示す業務構成情報テーブル４１を参照して業務１に関連する機器やリソースを抽出する。これは図１８の構成機器・リソース推定ステップＳ１０２である。さらに、確認方法特定部３３が、図４に示す対象毎の監視方法情報テーブル４２から、各機器やリソースに対する障害監視方法を特定する。これは図１８の監視方法特定ステップＳ１０４である。図５は、関連障害の確認方法特定部３０が作成した業務１に関する監視方法情報テーブルの例を示す図である。 Next, the operation will be described.
The normal failure monitoring units 51 and 52 in FIG. 1 notify the failure information collection control unit 20 of failure detection when detecting a failure in the monitored system. Here, as an example, it is assumed that there is no business response in the client program 1 116 in FIG. That is, since no response is returned within the set time, the failure is detected.
Upon receiving this detection, the failure information collection control unit 20 passes this information to the related failure confirmation method specifying unit 30. In the related failure confirmation method identifying unit 30, first, the failure occurrence task identifying unit 31 estimates that the failure is a failure related to the task 1 from the response failure of the client program in the content of the failure. This is the trouble task estimation step S101 in FIG. Next, the related resource extraction unit 32 refers to the business configuration information table 41 shown in FIG. 3 and extracts devices and resources related to the business 1. This is the component device / resource estimation step S102 of FIG. Further, the confirmation method specifying unit 33 specifies a failure monitoring method for each device or resource from the monitoring method information table 42 for each target shown in FIG. This is the monitoring method specifying step S104 in FIG. FIG. 5 is a diagram illustrating an example of a monitoring method information table related to the job 1 created by the related failure confirmation method specifying unit 30.

図５の業務１の監視方法情報テーブル４２ｂによる監視方法の情報を受け取った障害情報収集制御部２０は、随時障害監視部５３，５４を経由して、監視対象システムの中の業務１に関わる機器やリソースの障害状況の確認を、図５の「監視方法（現在状況確認方法）」に基づいて、構成される業務構成機器・リソース別に順に、直ちに実行する。これは図１８の個別障害確認実行ステップＳ１０５である。例えば、図５のＮＷ機器１について、ｐｉｎｇＮＷ１−１を実行し、所定の応答が返らなければ、ＮＷ機器１が障害原因であったことが確認、検知できる。この障害状況確認処理の中で図２のＮＷ機器１１２１に障害が発生していることを確認、検知すれば、クライアントプログラム１１１６の業務応答が無くなった根本原因部分がＮＷ機器１であることが判明する。
図１３は、このシステムの通常時の監視方法に関するテーブルの例を示す図であるが、この中で参考として示した各監視機能の監視間隔の例によると、ＮＷ機器１の監視間隔は２０分であり、本実施の形態における障害解析装置がなければ、根本原因部分の障害を検知するまでに、最悪２０分の時間差が生じてしまう。
このように従来の監視装置が、個々の機器やリソースの監視をそれぞれ個別に独立して設定された監視周期で実施されているだけの状況に比べて、本障害解析装置を用いることにより、業務の障害に対する根本原因部分の特定が迅速に行える効果がある。 Upon receiving the monitoring method information from the monitoring method information table 42b of the job 1 in FIG. 5, the failure information collection control unit 20 passes through the failure monitoring units 53 and 54 as needed, and the equipment related to the job 1 in the monitored system. And the failure status of the resource are immediately executed in order for each business component device / resource configured based on the “monitoring method (current status checking method)” of FIG. This is the individual failure confirmation execution step S105 of FIG. For example, if ping NW1-1 is executed for the NW device 1 in FIG. 5 and a predetermined response is not returned, it can be confirmed and detected that the NW device 1 is the cause of the failure. If it is confirmed and detected that a failure has occurred in the NW device 1 121 in FIG. 2 in this failure status confirmation process, the NW device 1 is the root cause where the client program 1 116 has no business response. Becomes clear.
FIG. 13 is a diagram showing an example of a table related to the normal monitoring method of this system. According to the example of the monitoring interval of each monitoring function shown as a reference in this example, the monitoring interval of the NW device 1 is 20 minutes. Without the failure analysis device in the present embodiment, the worst time difference of 20 minutes occurs until the failure of the root cause portion is detected.
In this way, compared with the situation where the conventional monitoring device only performs monitoring of individual devices and resources with the monitoring period set independently, the use of this failure analysis device This has the effect of quickly identifying the root cause of the failure.

実施の形態２．
実施の形態１では、確認方法特定部３３が図５に示す業務構成機器・リソースに対して順次、個別に障害確認を行う例を説明した。しかし順次、障害確認を行う方法では、効率が悪い。障害が発生した構成機器・リソースを推定するには、過去に発生した障害を参照して、同様の状況であれば、その構成機器・リソースであると推定するのが自然である。本実施の形態では、こうした過去の履歴によって障害が発生した業務の推定を行う。
図６は、本実施の形態における障害解析装置を示す構成ブロック図である。図において、先の実施の形態に追加された新しい構成要素として、障害履歴情報テーブル４３が追加されている。この障害履歴情報テーブル４３は、関連障害の確認方法特定部３０が処理を行う際に参照する。
また図７は、障害履歴情報テーブル４３に記憶されている具体的なデータの例を示す図であり、監視対象システム機器やリソース毎の障害履歴を記録している。 Embodiment 2. FIG.
In the first embodiment, the example has been described in which the confirmation method specifying unit 33 performs failure confirmation individually on the business configuration devices and resources shown in FIG. However, the method of sequentially checking the failure is inefficient. In order to estimate a component device / resource in which a failure has occurred, it is natural to refer to a failure that has occurred in the past and to estimate that the component device / resource is in the same situation. In the present embodiment, the work in which a failure has occurred is estimated based on such past history.
FIG. 6 is a configuration block diagram showing the failure analysis apparatus according to the present embodiment. In the figure, a failure history information table 43 is added as a new component added to the previous embodiment. The failure history information table 43 is referred to when the related failure confirmation method specifying unit 30 performs processing.
FIG. 7 is a diagram showing an example of specific data stored in the failure history information table 43, and records the failure history for each monitored system device and resource.

次に動作について説明する。
図８は、実施の形態１で作成した図５の監視方法情報テーブルに図７の障害履歴情報テーブル４３から得た「障害発生日」「障害重大度」の情報を付加したテーブルである。
特定の業務が障害していると推定されたとして、その推定された業務を構成する機器やリソースが多数抽出される場合がある。その場合、これら抽出された全ての機器の障害状況を優先度制御なしに全抽出機器とリソースに順次、図５に示す監視方法を適用して結果を得るには長い時間が要る。
そこで、過去に障害が発生した部分は、再度障害が発生する確率が高いと想定し、まずはそれらを優先的に障害確認し一次情報を報告する。その後全件の確認を行うことで、障害復旧対策の迅速化が図れる効果がある。
図８の例では、業務プログラム１、ＮＷ機器１、共用サービスプログラム１について優先的に障害確認を行う。 Next, the operation will be described.
8 is a table in which “failure occurrence date” and “failure severity” information obtained from the failure history information table 43 in FIG. 7 is added to the monitoring method information table in FIG. 5 created in the first embodiment.
If it is estimated that a specific task is faulty, a large number of devices and resources constituting the estimated task may be extracted. In that case, it takes a long time to obtain the result by applying the monitoring method shown in FIG. 5 to all the extracted devices and resources in order without giving priority control to the failure statuses of all these extracted devices.
Therefore, it is assumed that a portion where a failure has occurred in the past has a high probability that a failure will occur again. First, the failure is preferentially checked and primary information is reported. After confirming all the cases, there is an effect of speeding up the disaster recovery measures.
In the example of FIG. 8, failure confirmation is preferentially performed for the business program 1, the NW device 1, and the shared service program 1.

実施の形態３．
実施の形態３では実施の形態２と同様に障害監視の優先処理を実施するが、優先処理判断に利用する情報は図９に例を示した障害履歴情報テーブル４３ｃである。ここでは機能ブロック図は省略するが、この障害発生頻度情報テーブルは図６の障害履歴情報テーブル４３の中にこの頻度項目を設けて、関連障害の確認方法特定部３０が処理を行う際にこの頻度項目を参照する。 Embodiment 3 FIG.
In the third embodiment, failure monitoring priority processing is performed in the same manner as in the second embodiment. Information used for priority processing determination is a failure history information table 43c illustrated in FIG. Although the functional block diagram is omitted here, this failure occurrence frequency information table includes this frequency item in the failure history information table 43 of FIG. Refer to the frequency field.

この場合の動作としては、実施の形態２と同様に、障害発生頻度が高い部分に障害が発生する確率が高いと想定し、まずはそれらの障害発生頻度が高い業務構成機器またはリソースを優先的に障害確認して一次情報を報告する。その後全件の確認を行うことで、障害復旧対策の迅速化が図れる効果がある。
図９の例で、たとえば３回以上のしきい値で優先監視するならば、先ずＮＷ機器１、業務プログラム１に対して優先的に障害確認を行う。ただしＮＷサービス２は業務１には該当しない。 As an operation in this case, as in the second embodiment, it is assumed that there is a high probability that a failure will occur in a portion where the failure occurrence frequency is high. Confirm failure and report primary information. After confirming all the cases, there is an effect of speeding up the disaster recovery measures.
In the example of FIG. 9, if priority monitoring is performed with a threshold value of three times or more, for example, failure check is first performed on the NW device 1 and the business program 1. However, the NW service 2 does not correspond to the business 1.

実施の形態４．
本実施の形態では実施の形態２と同様に障害監視の優先処理を実施するが、優先処理判断に利用する情報は図１０に例を示したシステム変更履歴情報テーブル４４である。ここでは機能ブロック図を省略するが、このシステム変更履歴情報テーブル４４は図６の障害履歴情報テーブル４３と同等のテーブルとし、障害履歴情報テーブル４３と同様の部分に設ける。そして関連障害の確認方法特定部３０が処理を行う際に、このシステム変更履歴情報テーブル４４を参照する。 Embodiment 4 FIG.
In this embodiment, failure monitoring priority processing is performed in the same manner as in the second embodiment. Information used for priority processing determination is a system change history information table 44 shown in FIG. Although the functional block diagram is omitted here, the system change history information table 44 is a table equivalent to the failure history information table 43 in FIG. 6 and is provided in the same portion as the failure history information table 43. The system change history information table 44 is referred to when the related failure confirmation method specifying unit 30 performs processing.

実施の形態２と同様に、システムに対して変更を行った部分に障害が発生する確率が高いと想定し、まずはそれらを優先的に障害確認し一次情報を報告する。その後全件の確認を行うことで、障害復旧の迅速化が図れる効果がある。
図１０の例では、変更記録があるもの全てを優先するならば、業務プログラム３、業務プログラム１、ＮＷ機器３、ＮＷ機器１、共用サービスプログラム１を優先的に障害確認を行う。ただしＮＷサービス２は業務１には該当しない。 As in the second embodiment, it is assumed that there is a high probability that a failure has occurred in a portion where the system has been changed. First, the failure is preferentially checked and primary information is reported. Confirming all cases after that has the effect of speeding up failure recovery.
In the example of FIG. 10, if priority is given to all records with change records, failure check is preferentially performed for the business program 3, the business program 1, the NW device 3, the NW device 1, and the shared service program 1. However, the NW service 2 does not correspond to the business 1.

実施の形態５．
本実施の形態では、実施の形態２、３、４と類似で障害監視の優先処理を実施するが、優先処理判断に利用する情報は図１１に例を示したシステム機器・リソース重要度情報テーブル４５の情報である。ここでは機能ブロック図を省略するが、このシステム機器・リソース重要度情報テーブル４５は図６の障害履歴情報テーブル４３と同等のテーブルとし、障害履歴情報テーブル４３と同様の部分に設ける。そして関連障害の確認方法特定部３０が処理を行う際に、このシステム機器・リソース重要度情報テーブル４５を参照する。 Embodiment 5 FIG.
In the present embodiment, failure monitoring priority processing is performed in the same manner as in the second, third, and fourth embodiments, but the information used for priority processing determination is the system device / resource importance level information table shown in FIG. 45 information. Although the functional block diagram is omitted here, the system device / resource importance level information table 45 is a table equivalent to the failure history information table 43 in FIG. 6 and is provided in the same portion as the failure history information table 43. When the related failure confirmation method specifying unit 30 performs processing, the system device / resource importance level information table 45 is referred to.

障害の根本原因箇所を特定する場合に、障害の影響が大きい機器やリソースの障害は、いち早く検知し対策をとるべきである。この目的のため、まずは重要度の高い機器やリソースを優先的に障害確認し順次報告することで、障害の業務影響を極小化できる。
図１１の例では、まず重大度レベルが最高値１０の業務プログラム１の障害確認を行い、順次、次いで重大度レベルの高い順へと確認を行う。 When identifying the root cause of a failure, the failure of a device or resource that is greatly affected by the failure should be detected and countermeasures taken immediately. For this purpose, the impact of the failure on the work can be minimized by first checking the failure and reporting the devices and resources with high importance first.
In the example of FIG. 11, first, the failure check of the business program 1 having the highest severity level 10 is performed, and then sequentially confirmed in the order of higher severity level.

実施の形態６．
本実施の形態では、実施の形態２、３、４と類似で障害監視の優先処理を実施するが、優先処理判断に利用する情報は、図１２に示される、リソース毎の単位時間当たりの使用頻度情報テーブル４６である。これは例えば図１の構成において、障害情報収集制御部２０が随時障害監視部５３を用いて定期的に各業務構成機器とリソースの使用頻度を調べる。その調査結果を図１２のリソース使用頻度情報テーブル４６の、使用頻度の項に記録して管理しておく。使用頻度の調査は任意期間でよく、障害情報収集制御部２０が随時障害監視部５３に起動をかけて、対象となるリソースのオープン（開始）またはクローズ（終了）のどちらかを数えることで頻度が判る。これを更に積算していけば、相対的な使用頻度が判る。このリソース使用頻度テーブルを図６の障害履歴情報テーブル４３と同様の部分に設ける。
そして関連障害の確認方法特定部３０が処理を行う際に、このリソース使用頻度情報テーブル４６を参照する。
障害の根本原因箇所を特定する場合に、リソース使用頻度情報テーブル４６に記載の使用頻度が少ないリソースほど、残存バグなどの可能性があって、障害が発生しているのかもしれない。 Embodiment 6 FIG.
In this embodiment, priority processing for fault monitoring is performed in the same manner as in the second, third, and fourth embodiments, but the information used for priority processing determination is the usage per unit time for each resource shown in FIG. This is a frequency information table 46. For example, in the configuration of FIG. 1, the failure information collection control unit 20 periodically checks the usage frequency of each business component device and resource by using the failure monitoring unit 53 as needed. The investigation result is recorded and managed in the usage frequency section of the resource usage frequency information table 46 of FIG. The frequency of use may be surveyed for an arbitrary period. The failure information collection control unit 20 activates the failure monitoring unit 53 at any time and counts whether the target resource is open (start) or closed (end). I understand. If this is further integrated, the relative usage frequency can be determined. This resource usage frequency table is provided in the same portion as the failure history information table 43 of FIG.
The resource usage frequency information table 46 is referred to when the related failure confirmation method specifying unit 30 performs processing.
When identifying the root cause of a failure, a resource with a lower use frequency described in the resource use frequency information table 46 may have a remaining bug or the like and may have a failure.

実施の形態７．
本発明の装置は、システム障害の検知を効率よく行うことを目的としているが、その監視方法によっては、監視対象の障害自体ではなく、それ以外の障害によって誤検知している場合もあり得る。
図１３は、ｐｉｎｇによりネットワーク機器の外部から生死を確認している例であり、ｐｉｎｇ監視サーバ１０７からＮＷ機器３１２４を監視している。この時、ＮＷ機器３に対するｐｉｎｇ応答エラー（無応答）はＮＷ機器３の障害以外に、監視経路上のＮＷ機器２やＮＷ機器４の障害でも検知してしまう。すなわち、ｐｉｎｇ監視サーバ１０７の位置からのＮＷ機器３へのｐｉｎｇ監視は、ＮＷ機器２とＮＷ機器４に依存していると言える。 Embodiment 7 FIG.
The apparatus of the present invention is intended to efficiently detect a system failure. However, depending on the monitoring method, there may be a case where a false detection is not caused by a failure other than the failure to be monitored itself.
FIG. 13 shows an example in which life / death is confirmed from the outside of the network device by ping, and the NW device 3 124 is monitored from the ping monitoring server 107. At this time, a ping response error (no response) to the NW device 3 is detected not only by the failure of the NW device 3 but also by the failure of the NW device 2 or the NW device 4 on the monitoring path. That is, it can be said that ping monitoring from the position of the ping monitoring server 107 to the NW device 3 depends on the NW device 2 and the NW device 4.

本実施の形態では、実施の形態１ないし６の処理を行う前に、システム障害の検知に誤検知状態がなかったかをまず確認するものである。
図１４は、通常時の監視についての監視依存関係を示した監視依存関係テーブル５７の例を示した図である。たとえばＮＷ機器３の障害を検知した場合に、「誤検知原因になる監視機能の依存箇所」に登録されているＮＷ機器２、ＮＷ機器４の障害状況を先ず確認する。そして、これらに障害がなければ、実施の形態１ないし６の処理を実施する。
この事前処理を行うことにより、根本障害箇所検出処理の精度が向上する。 In the present embodiment, before performing the processing of the first to sixth embodiments, it is first confirmed whether there has been a false detection state in detecting a system failure.
FIG. 14 is a diagram showing an example of the monitoring dependency relationship table 57 showing the monitoring dependency relationship regarding normal monitoring. For example, when a failure of the NW device 3 is detected, first, the failure status of the NW device 2 and the NW device 4 registered in the “dependent part of the monitoring function causing the false detection” is confirmed. If there is no failure in these, the processing in the first to sixth embodiments is performed.
By performing this preliminary process, the accuracy of the root failure location detection process is improved.

実施の形態８．
実施の形態１〜６の処理では、業務構成から抽出した機器やリソースの障害状況を、登録してある確認方法で確認するが、全ての機器やリソースに障害状態が認められないケースが考えられる。本実施の形態では、その場合でも、障害確認方法の観点を変更したり、より詳しい分析のために、障害解析情報を収集し人的な分析を行うことに備えたりする、二次ステップ、三次ステップの処理を実施できる手段を持つ。
図１５は、図４に示した対象毎の監視方法情報テーブル４２を拡張した監視方法情報テーブル４２ｄであり、二次アクションを登録した例である。この例では二次アクションとしては障害解析のための情報収集方法が登録されている。 Embodiment 8 FIG.
In the processing of the first to sixth embodiments, the failure status of the device or resource extracted from the business configuration is confirmed by the registered confirmation method, but there may be a case where no failure status is recognized for all the devices and resources. . In this embodiment, even in such a case, the secondary step, the tertiary, and the like, which change the viewpoint of the failure confirmation method or prepare for collecting failure analysis information and performing human analysis for more detailed analysis. It has means that can execute the processing of the step.
FIG. 15 is a monitoring method information table 42d obtained by extending the monitoring method information table 42 for each target shown in FIG. 4, and is an example in which secondary actions are registered. In this example, an information collection method for failure analysis is registered as a secondary action.

図１６は、本実施の形態における動作を実行するフローである。一次アクションリストに従って関連機器、リソースの障害状況を確認しても、いずれも明らかな障害状態でない場合、二次アクションに切り替えて再度処置を、ステップＳ６１ないしＳ６５により実行する。
なお、この例を実施するためには、構成要素を追加した装置を示す図１６において、障害解析情報収集部５５，５６を新たに設ける等、実施したい処理に合わせた要素を追加する必要がある。
この処理により、障害箇所を検出できなかった場合でも、人的な分析に備えた障害解析情報を予め収集しておくなどの、代替処置を実行しておくことができ、システム障害対策の迅速化がはかれる。 FIG. 16 is a flow for executing the operation in the present embodiment. Even if the failure status of the related devices and resources is confirmed according to the primary action list, if neither of them is an obvious failure state, the secondary action is switched to and the treatment is executed again through steps S61 to S65.
In order to implement this example, it is necessary to add elements according to the processing to be performed, such as newly providing failure analysis information collection units 55 and 56 in FIG. 16 showing the device to which the component is added. .
Even if the failure location cannot be detected by this processing, alternative measures such as collecting failure analysis information for human analysis in advance can be executed, and system failure countermeasures can be accelerated. Is peeled off.

実施の形態９．
上記の各実施の形態では、障害解析装置はハードウェアで構成されるとして説明した。しかし装置はそれに限定されることは無く、汎用のプロセッサとメモリを用いて、メモリにソフトウェアのプログラムでステップを記述して、このプログラム・ステップにより、同等の動作を実行させてもよい。
図１８は、こうしたプログラム・ステップで実施の形態１における動作を実現するフローチャートを示した図である。図において、プログラム・ステップとしてＳ１０１で障害発生業務特定部３１相当の機能を組む。通常時障害監視部１５１が障害検知を通知すると、この通知をスタートの監視するステップＳ１００として監視を始める。以下、関連リソース抽出部３２相当の機能をＳ１お２で組み、確認方法特定部３３相当の機能をＳ１０４とＳ１０５で組む。また実施の形態２ないし６における個別構成機器・リソース選択優先順位の機能を、Ｓ１０３で組む。
更にこの図１８で示されるフローチャートの機能をプログラムとして作成しておけば、汎用の計算機にそのプログラムをロードして、上記各実施の形態で説明した障害解析装置を構成することが出来る。 Embodiment 9 FIG.
In each of the above embodiments, the failure analysis apparatus has been described as being configured by hardware. However, the apparatus is not limited to this, and a general-purpose processor and memory may be used to describe steps in a software program in the memory, and equivalent operations may be executed by these program steps.
FIG. 18 is a diagram showing a flowchart for realizing the operation in the first embodiment by such program steps. In the figure, as a program step, a function corresponding to the failure occurrence work specifying unit 31 is assembled in S101. When the normal-time failure monitoring unit 151 notifies failure detection, monitoring is started as step S100 for monitoring this notification. Hereinafter, the function corresponding to the related resource extraction unit 32 is assembled in S1 and S2, and the function corresponding to the confirmation method specifying unit 33 is combined in S104 and S105. In addition, the functions of the individual component device / resource selection priority in the second to sixth embodiments are assembled in S103.
Furthermore, if the function of the flowchart shown in FIG. 18 is created as a program, the failure analysis apparatus described in the above embodiments can be configured by loading the program into a general-purpose computer.

この発明の実施の形態１における障害解析装置を示す構成ブロック図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a block diagram which shows the failure analysis apparatus in Embodiment 1 of this invention. 実施の形態１における監視対象システムである業務１と特定の構成機器・リソースの関係を示す図である。3 is a diagram illustrating a relationship between a business 1 that is a monitoring target system according to the first embodiment and a specific component device / resource. FIG. 実施の形態１における業務構成情報テーブル中のデータ例を示す図である。6 is a diagram illustrating an example of data in a business configuration information table according to Embodiment 1. FIG. 実施の形態１における監視方法情報テーブル中のデータ例を示す図である。6 is a diagram illustrating an example of data in a monitoring method information table in the first embodiment. FIG. 実施の形態１で関連障害の確認方法特定部が作成した監視方法情報の例を示す図である。6 is a diagram illustrating an example of monitoring method information created by a related failure confirmation method specifying unit in Embodiment 1. FIG. この発明の実施の形態２における障害解析装置を示す構成ブロック図である。It is a block diagram which shows the failure analysis apparatus in Embodiment 2 of this invention. 実施の形態２における障害履歴情報テーブル中のデータ例を示す図である。10 is a diagram illustrating an example of data in a failure history information table according to Embodiment 2. FIG. 実施の形態２で関連障害の確認方法特定部が作成した監視方法情報の例を示す図である。10 is a diagram illustrating an example of monitoring method information created by a related failure confirmation method specifying unit in Embodiment 2. FIG. 実施の形態３における障害履歴情報テーブル中のデータ例を示す図である。FIG. 10 is a diagram illustrating an example of data in a failure history information table in the third embodiment. 実施の形態４におけるシステム変更履歴情報テーブル中のデータ例を示す図である。FIG. 20 is a diagram illustrating an example of data in a system change history information table in the fourth embodiment. 実施の形態５におけるシステム機器・リソース重要度情報テーブル中のデータ例を示す図である。FIG. 20 is a diagram illustrating an example of data in a system device / resource importance level information table according to the fifth embodiment. 実施の形態６における使用頻度情報テーブル中のデータ例を示す図である。FIG. 20 is a diagram illustrating an example of data in a usage frequency information table in the sixth embodiment. 実施の形態７における外部からの構成機器・リソース障害確認方法例を説明する図である。FIG. 24 is a diagram for explaining an example of a method for confirming component / resource failure from the outside in the seventh embodiment. 実施の形態７における監視依存関係テーブル中のデータ例を示す図である。FIG. 20 is a diagram illustrating an example of data in a monitoring dependency relationship table in the seventh embodiment. 実施の形態８における監視方法情報テーブル中のデータ例を示す図である。FIG. 20 is a diagram illustrating an example of data in a monitoring method information table in the eighth embodiment. 実施の形態８における障害分析動作を示すフロー図である。FIG. 20 is a flowchart showing a failure analysis operation in the eighth embodiment. 実施の形態８における障害解析装置を示す構成ブロック図である。FIG. 10 is a configuration block diagram illustrating a failure analysis device according to an eighth embodiment. この発明の実施の形態９における障害解析方法を示す動作フロー図である。It is an operation | movement flowchart which shows the failure analysis method in Embodiment 9 of this invention.

Explanation of symbols

２０障害情報収集制御部、３０関連障害の確認方法特定部、３１障害発生業務特定部、３２関連リソース抽出部、３３確認方法特定部、４０システムに関する情報群、４１業務構成情報テーブル、４２，４２ｂ，４２ｃ，４２ｄ（対象毎の）監視方法情報テーブル、４３，４３ｃ障害履歴情報テーブル、４４システム変更履歴情報テーブル、４５システム機器・リソース重要度情報テーブル、４６使用頻度情報テーブル、５１，５２通常時障害監視部、５３，５４随時障害監視部、５５，５６障害解析情報収集部、５７監視依存関係テーブル、Ｓ６１前業務関連箇所の抽出と第一次ステップアクションの抽出、Ｓ６２全対象に対するアクション実施、Ｓ６３障害箇所検出ステップ、Ｓ６４次アクション登録ステップ、Ｓ６５次アクション抽出ステップ、Ｓ１０１障害業務推定ステップ、Ｓ１０２該当業務の構成機器・リソース抽出ステップ、Ｓ１０３構成機器・リソースの優先順位選定ステップ、Ｓ１０４構成機器・リソースの確認方法特定ステップ、Ｓ１０５優先順位に基づく個別障害確認実行ステップ、Ｓ１０６確認実行終了確認ステップ。 20 fault information collection control unit, 30 related fault confirmation method identification unit, 31 fault occurrence business identification unit, 32 related resource extraction unit, 33 confirmation method identification unit, 40 system information group, 41 business configuration information table, 42, 42b , 42c, 42d Monitoring method information table (for each target), 43, 43c Failure history information table, 44 System change history information table, 45 System device / resource importance information table, 46 Usage frequency information table, 51, 52 Normal time Fault monitoring unit, 53, 54 Anytime fault monitoring unit, 55, 56 Fault analysis information collection unit, 57 Monitoring dependency relationship table, S61 Extraction of previous work related parts and extraction of primary step actions, S62 Action execution for all targets, S63 fault location detection step, S64 next action registration step , S65 next action extraction step, S101 failure task estimation step, S102 component device / resource extraction step of relevant job, S103 component device / resource priority selection step, S104 component device / resource confirmation method specifying step, S105 priority Individual failure confirmation execution step based on S106, confirmation execution end confirmation step in S106.

Claims

A business configuration information table that stores the components and resources that make up the business of the monitored system for each business;
A monitoring method information table for storing an individual monitoring method for each component device / resource;
A monitoring unit for monitoring the monitored system;
When the monitoring unit detects a failure of the monitored system, the business configuration information table is referred to, the business in which the failure has occurred is estimated, and the constituent devices and resources that configure the business in which the failure has occurred are extracted. A failure analysis apparatus comprising: a related failure confirmation method identification unit that applies a method for monitoring component devices and resources constituting the business.

It has a failure history information table that stores past failure histories of component devices and resources.
The failure analysis apparatus according to claim 1, wherein the related failure confirmation method specifying unit applies a monitoring method of component devices / resources constituting a business with reference to the failure history information table.

A system change history information table for storing past system change history of component devices / resources is provided.
The failure analysis apparatus according to claim 1, wherein the related failure confirmation method specifying unit applies a monitoring method of component devices and resources constituting a business with reference to the system change history information table.

It has a component / resource importance information table that specifies the importance of component devices / resources in the monitored system.
The failure analysis apparatus according to claim 1, wherein the related failure confirmation method identification unit applies a monitoring method of component devices / resources that constitute a business with reference to the component / resource importance information table.

A monitoring dependency table that stores the mutual influence relationship between component devices and resources in the monitored system
The failure analysis apparatus according to claim 1, wherein the related failure confirmation method specifying unit applies a monitoring method of component devices / resources constituting a business with reference to the monitoring dependency relationship table.

In addition to storing the individual monitoring method of each component device / resource as a primary action, the monitoring method information table stores other monitoring methods as secondary actions,
The related failure confirmation method specifying unit applies a monitoring method of component devices / resources constituting a business based on the secondary action if necessary following the primary action of the monitoring method information table. The failure analysis apparatus according to claim 1.

Analysis method in an analysis apparatus comprising: a business configuration information table for storing, for each business, component devices / resources that configure a business of the monitored system; and a monitoring method information table for storing individual monitoring methods for each component device / resource. In
Monitoring the monitored system;
When the monitoring step detects a failure of the monitored system, referring to the business configuration information table to estimate the business in which the failure has occurred;
A step of extracting component devices / resources constituting the business in which the failure has occurred;
A failure analysis method comprising: a step of confirming and executing a monitoring method of component devices and resources constituting the business.

Furthermore, in the analysis method of the analysis apparatus provided with the failure history information table for storing the past failure history of the component device / resource,
Following the step of extracting component devices / resources, there is provided a component device / resource priority order selection step for selecting the priority order of component devices / resources to be checked for association with reference to the failure history information table, and the priority order selection step 8. The failure analysis method according to claim 7, wherein a step of confirming and executing a monitoring method of component devices / resources is performed after execution of the step.

In the monitoring method information table, in addition to storing the individual monitoring method of each component device / resource as a primary action, in the analysis method of the analysis device that stores other monitoring methods as secondary actions,
A step of confirming and executing the monitoring method of the component device / resource based on the secondary action, if necessary, following the step of the primary action of confirming and executing the component device / resource monitoring method is provided. The failure analysis method according to claim 7.

Configure as a computer executable program,
Storing the component devices and resources that constitute the business of the monitored system for each business and configuring the business configuration information table;
Storing an individual monitoring method for each component / resource and configuring a monitoring method information table;
Monitoring the monitored system;
When the monitoring step detects a failure of the monitored system, referring to the business configuration information table to estimate the business in which the failure has occurred;
A step of extracting component devices / resources constituting the business in which the failure has occurred;
A failure analysis program comprising: a step of confirming and executing a monitoring method of component devices and resources constituting the business.