JP4862446B2

JP4862446B2 - Failure cause estimation system, method, and program

Info

Publication number: JP4862446B2
Application number: JP2006079266A
Authority: JP
Inventors: 敏夫登内
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2006-03-22
Filing date: 2006-03-22
Publication date: 2012-01-25
Anticipated expiration: 2026-03-22
Also published as: JP2007257184A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an obstacle factor estimation system for estimating the fundamental factor of an obstacle without further defining by manpower complicated dependency. <P>SOLUTION: An initial model generation means 40 generates an initial obstacle generation model by modeling the correspondence relation of an event and the occurrence factor and inter-occurrence factor transition by a finite automaton based on basic model definition 20. A Baum-Welch calculation means 50 learns which probability has resulted in the transition of the status of the finite automaton corresponding to the factor, based on the initial model occurrence model and an event column 100 for learning. A Viterbi calculation means 60 searches the status transition column in which the possibility of the observation of an event column 110 for obstacle factor discovery is judged to be the highest by an obstacle occurrence model after learning. A filtering module estimates the fundamental factor of the obstacle which has occurred in a monitor object device from the status transition column. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、障害原因推定システム、方法、及び、プログラムに関し、更に詳しくは、監視対象装置で発生した障害の原因を推定する障害原因推定システム、方法、及び、プログラムに関する。 The present invention relates to a failure cause estimation system, method, and program, and more particularly to a failure cause estimation system, method, and program for estimating the cause of a failure that has occurred in a monitoring target device.

コンピュータ等の管理対象機器を監視し、管理対象機器に障害が発生した際に、その障害の原因を推定するシステムがある。図７は、特許文献１に記載された障害原因発見装置の構成を示している。この従来の障害原因発見装置２００では、プロセス情報獲得部２０１は、計算機システム上で実行されているプロセス情報を獲得し、環境ファイル情報獲得部２０２は、システム環境ファイル情報を獲得する。また、デバイス情報獲得部２０３は、計算機システムに組み込まれているデバイスドライバの情報を獲得する。 There is a system that monitors a managed device such as a computer and estimates the cause of the failure when the managed device has a failure. FIG. 7 shows the configuration of the failure cause detection device described in Patent Document 1. In the conventional failure cause detection apparatus 200, the process information acquisition unit 201 acquires process information executed on the computer system, and the environment file information acquisition unit 202 acquires system environment file information. Further, the device information acquisition unit 203 acquires information on device drivers incorporated in the computer system.

基準環境情報獲得部２０４は、計算機システムが正常動作しているときに、プロセス情報獲得部２０１、環境ファイル情報獲得部２０２、及び、デバイス情報獲得部２０３が獲得した各情報を収集し、これを基準環境情報記憶部２０５に記憶する。検査環境情報獲得部２０６は、計算機システムに異常が発生したか否かを検出する際に、プロセス情報獲得部２０１、環境ファイル情報獲得部２０２、及び、デバイス情報獲得部２０３が獲得した各情報を収集し、これを検査環境情報記憶部２０７に記憶する。 The reference environment information acquisition unit 204 collects each information acquired by the process information acquisition unit 201, the environment file information acquisition unit 202, and the device information acquisition unit 203 when the computer system is operating normally. The information is stored in the reference environment information storage unit 205. When the inspection environment information acquisition unit 206 detects whether or not an abnormality has occurred in the computer system, each information acquired by the process information acquisition unit 201, the environment file information acquisition unit 202, and the device information acquisition unit 203 is obtained. Collected and stored in the inspection environment information storage unit 207.

環境情報比較判定部２０９は、基準環境情報記憶部２０５の内容と検査環境情報記憶部２０７の内容とを比較し、状態内容変化を見つけ出す。環境情報比較判定部２０９は、状態変化内容が、状態内容変化の許容値を記憶する許容範囲情報記憶部２０８を参照して、見つけ出した状態内容変化が、許容される範囲を超えるか否かを判断する。異常原因特定部２１１は、環境情報比較判定部２０９が許容される範囲を超えると判断すると、状態変化内容から、計算機システムの異常発生原因を特定する。 The environment information comparison / determination unit 209 compares the contents of the reference environment information storage unit 205 and the contents of the inspection environment information storage unit 207 to find a state content change. The environmental information comparison / determination unit 209 refers to the allowable range information storage unit 208 in which the state change content stores the allowable value of the state content change, and determines whether or not the found state content change exceeds the allowable range. to decide. When the abnormality cause identifying unit 211 determines that the environment information comparison / determination unit 209 exceeds the allowable range, the abnormality cause identifying unit 211 identifies the cause of the abnormality of the computer system from the state change content.

例えば、基準環境情報記憶部２０５には、「実行中のプロセスは、ｎｆｓが1個」が記憶される。また、許容範囲情報記憶部２０８には、マウントデバイスの追加が許可される状況に対応して、プロセス“ｎｆｓ”について、実行中のプロセスの最大値が“８個”である旨を記憶している。この場合に、検査環境情報獲得部２０６によって、マウントデバイスの追加が許可された状況で、「実行中のプロセスは、ｎｆｓが１１個」という状態が検出されると、環境情報比較判定部２０９は、許容された範囲を超えると判定し、異常を検出する。異常原因特定部２１１は、計算機システムで異常が検出された原因を、プロセス“ｎｆｓ”が許容プロセス数を超えたためと特定する。 For example, the reference environment information storage unit 205 stores “the process being executed has one nfs”. Further, the allowable range information storage unit 208 stores that the maximum value of the process being executed is “8” for the process “nfs” corresponding to the situation where the addition of the mount device is permitted. Yes. In this case, when the inspection environment information acquisition unit 206 permits the addition of the mount device and detects a state that “the process being executed has 11 nfs”, the environment information comparison determination unit 209 It is determined that the allowable range is exceeded, and an abnormality is detected. The abnormality cause identifying unit 211 identifies the cause of the abnormality being detected in the computer system because the process “nfs” has exceeded the allowable number of processes.

従来の障害原因を推定するシステムの別の例としては、特許文献２に記載されたものがある。図８は、特許文献２に記載された障害原因推定システムに用いられる因果システムモデル構築装置の構成を示している。この特許文献２では、因果関係をあらわすデータベースである因果データ生成記憶部３１１と、その逆マッピングである果因データ生成記憶部３１２と、複数の事象を一つの事象グループに対応させ、或いは、複数の原因を一つの原因グループに対応させる関係を記録する同一結果データ集合生成部３１３とを用い、部分因果システムモデル構成部３１４により、原因グループと事象グループとの関係をマッピングする因果システムモデル構築装置３００を構築する。 Another example of a conventional system for estimating the cause of failure is described in Patent Document 2. FIG. 8 shows the configuration of a causal system model construction device used in the failure cause estimation system described in Patent Document 2. In this Patent Document 2, a causal data generation storage unit 311 which is a database representing a causal relationship, a causal data generation storage unit 312 which is a reverse mapping thereof, a plurality of events correspond to one event group, or a plurality of events A causal system model construction device that maps the relationship between a cause group and an event group by a partial causal system model configuration unit 314 using the same result data set generation unit 313 that records a relationship that causes each cause to correspond to one cause group Build 300.

図９は、特許文献２に記載された障害原因推定システムに用いられる原因推定装置の構成を示している。因果システムモデル記憶部３２４には、因果システムモデル構築装置３００（図８）によって構築された因果システムモデルが格納される。原因推定装置３２０は、観測データ認識部３２１で認識した事象から、因果システムモデル記憶部３２４に格納した因果システムモデルの事象から原因のマッピングを適用し、障害の原因を求める。更に、原因から事象へのマッピングを適用することで、その原因から発生しうる事象を求める。このように、事象から原因へのマッピング、及び、原因から事象へのマッピングを推移的に適用し、推移的閉包を求めることで、起こりうる根源原因を含む原因を推移的に求める。 FIG. 9 shows the configuration of a cause estimation apparatus used in the failure cause estimation system described in Patent Document 2. The causal system model storage unit 324 stores the causal system model constructed by the causal system model construction apparatus 300 (FIG. 8). The cause estimation device 320 applies the cause mapping from the events of the causal system model stored in the causal system model storage unit 324 from the events recognized by the observation data recognition unit 321 to obtain the cause of the failure. Further, by applying the mapping from the cause to the event, an event that can occur from the cause is obtained. In this way, by applying the mapping from the event to the cause and the mapping from the cause to the event in a transitive manner to obtain transitive closure, the cause including a possible root cause is transitively obtained.

また、システムのトラブルシューティングを行う自動トラブルシュート機構に関する技術として、特許文献３に記載された技術がある。特許文献３では、変数を表すノードと変数間の依存関係を表すアークからなる方向を持った非環式グラフであるベイジアンネットワークで、システムの故障を引き起こすシステムコンポーネントをモデル化し、このモデルを用いて、プリンタシステムの自動診断を行う。 Further, as a technique related to an automatic troubleshooting mechanism for troubleshooting a system, there is a technique described in Patent Document 3. In Patent Document 3, a system component that causes a system failure is modeled in a Bayesian network, which is an acyclic graph having a direction composed of nodes representing variables and arcs representing dependencies between variables, and this model is used. Automatic diagnosis of the printer system.

特開平８−２５５０９３号公報（段落００１３〜００３１、図１）JP-A-8-255093 (paragraphs 0013 to 0031, FIG. 1) 特開２００４−１２６６４１号公報Japanese Patent Laid-Open No. 2004-126641 特開２００１−７５８０８号公報JP 2001-75808 A

特許文献１では、状態内容変化を検出し、検出した状態内容変化が、許容される範囲を超えるときには、状態変化内容から、計算機システムの異常発生原因を特定している。しかし、特許文献１で特定できるのは、直接の状態のみであり、ある障害が発生したことに起因して異常状態となったような場合には、その根源となる障害を特定することはできない。 In Patent Document 1, a state content change is detected, and when the detected state content change exceeds an allowable range, the cause of the abnormality in the computer system is specified from the state change content. However, in Patent Document 1, only a direct state can be specified, and in the case where an abnormal state occurs due to the occurrence of a certain failure, the root failure cannot be specified. .

例えば、「マウントデバイスの追加が許可される」状況におけるｎｆｓプロセスの最大値（許容値）が「１２」で、「マウントデバイスの追加が許可されない」状況におけるｎｆｓプロセスの最大値が「８」であるとする。現在、システムが、「マウントデバイスの追加が許可される」状況にあり、ｎｆｓプロセス数が１１個であったとすると、ｎｆｓプロセス数は、最大値「１２」を超えない状態であるため、正常であると判断される。 For example, the maximum value (allowable value) of the nfs process in the situation “addition of mount device is allowed” is “12”, and the maximum value of the nfs process in the situation “addition of mount device is not allowed” is “8”. Suppose there is. If the system is currently in a state where “addition of a mount device is permitted” and the number of nfs processes is 11, the number of nfs processes is not exceeding the maximum value “12”. It is judged that there is.

その後、「ＳＣＳＩボートの故障」に起因して、「マウントデバイスの追加が許可されない」状況になったとする。ｎｆｓプロセス数が変化せずに、１２個のままであったとすると、「マウントデバイスの追加が許可されない」状況におけるｎｆｓプロセスの最大値は「８」であるため、ｎｆｓプロセス数は最大値を超える。従って、特許文献１では、ｎｆｓプロセス数が許容値を超えたことを、障害原因として特定する。 After that, it is assumed that due to the “failure of the SCSI boat”, the state where “addition of mount device is not permitted” is entered. If the number of nfs processes does not change and remains twelve, the maximum value of the nfs process in the situation “addition of mount device is not permitted” is “8”, so the number of nfs processes exceeds the maximum value. . Therefore, in Patent Document 1, it is specified as a cause of failure that the number of nfs processes exceeds an allowable value.

しかしながら、上記の場合、本来の故障原因は、「ＳＣＳＩボードの故障」であり、ｎｆｓプロセスの数が正常値を逸脱しているということの原因が、「ＳＣＳＩボードの故障」であると認識して対応しないと、正しい障害対応はできない。特許文献１では、推移的におきる障害、例えば、障害Ａにより状態Ｂに移行し、状態Ｂを直接原因に起きる障害Ｃに対しては、直接の障害Ｃしか特定することができず、根源原因である障害Ａを特定することができないという問題がある。 However, in the above case, the original failure cause is “SCSI board failure”, and the cause that the number of nfs processes deviates from the normal value is recognized as “SCSI board failure”. If you do not respond correctly, you cannot respond correctly. In Patent Document 1, for a failure that occurs transitively, for example, a failure C that is caused directly by the failure B and that is caused directly by the failure B, only the direct failure C can be specified. There is a problem that failure A cannot be specified.

また、特許文献２では、原因と障害との対応関係を事前に登録する必要があるが、一般事象の関係は複雑であるため、障害や状態の種類が多いと、その対応関係を記述することは困難であり、根源的な障害原因を求めるための規則を入力することが困難であるという問題がある。特許文献２には、このような記述の困難性を軽減するため、障害原因をグループ化し、その間のマッピングを記載している。しかし、グループごとにマッピングを定義する場合であっても、マッピングの定義は必要であり、グループを定義するのに手間がかかるのに加えて、上手にグループ化しないと、原因と障害のマッピングの精度が落ちる可能性があるという問題が残る。 In Patent Document 2, it is necessary to register the correspondence between the cause and the failure in advance. However, since the relationship between the general events is complicated, the correspondence is described when there are many types of failures and states. Is difficult, and it is difficult to input rules for determining the root cause of failure. In Patent Document 2, in order to reduce the difficulty of such description, failure causes are grouped and mapping between them is described. However, even when mapping is defined for each group, it is necessary to define the mapping. In addition to the time and effort required to define the group, if mapping is not done well, mapping of the cause and fault The problem remains that accuracy may drop.

図１０は、因果関係をモデル化して示している。例えば、特許文献２には、同図に示す事例が記載されている。同図では、矢印は、因果関係を示している。例えば、新たな事象ｙ’を定義し、ｙ’がｘ２の原因であるとする。この関係は、
ｙ’→ｘ２
と書ける。特許文献２では、マッピングの定義を容易にするため、ｙ’→ｘ２、かつ、ｘ２∈Ｘ１ならば、
ｈ（Ｘ１）＝ｙ’
という果因関係を抽象化したマッピングｈを設ける。これにより、ｈは、Ｘ１をｘ２，ｘ３と細かく見ないことで、果因関係を簡単に記述できる。しかし、「細かく見ない」ために、ｘ３∈Ｘ１のとき、ｈ（ｘ３）＝ｙ’となる。すなわち、ｘ３の原因として、本来、因果関係のない、ｙ’が原因とみなされる危険性があるという問題がある。 FIG. 10 shows the causal relationship as a model. For example, Patent Document 2 describes an example shown in FIG. In the figure, arrows indicate causal relationships. For example, a new event y ′ is defined and y ′ is the cause of x2. This relationship
y '→ x2
Can be written. In Patent Document 2, in order to facilitate the definition of mapping, if y ′ → x2 and x2∈X1,
h (X1) = y ′
A mapping h that abstracts the causal relationship is provided. Thus, h can easily describe the causal relationship by not looking at X1 as x2 and x3. However, in order to “not look closely”, h (x3) = y ′ when x3εX1. That is, as a cause of x3, there is a problem that there is a risk that y ′, which is not causally related, is regarded as a cause.

特許文献３では、障害のモデル化に、ＤＡＧ（ループのないグラフ）を用いている。障害の発生では、同一障害からイベントが繰り返し発生する場合も多いが、特許文献３では、ＤＡＧを用いているので、繰り返し発生するイベントには対応できないという問題がある。 In Patent Document 3, DAG (a graph without a loop) is used for fault modeling. When a failure occurs, an event often occurs repeatedly from the same failure. However, since Patent Document 3 uses a DAG, there is a problem that an event that occurs repeatedly cannot be handled.

本発明は、上記従来技術の問題点を解消し、人手により複雑な依存関係を定義しなくても、障害の根源的な原因を推定できる障害原因推定システム、方法、及び、プログラムを提供することを目的とする。 The present invention provides a failure cause estimation system, method, and program capable of solving the above-described problems of the prior art and estimating a root cause of a failure without manually defining complicated dependency relationships. With the goal.

上記目的を達成するために、本発明の障害原因推定システムは、監視対象装置で発生し得る障害に起因するイベントと該イベントを引き起こす障害原因とを定義した基本モデル定義を入力する入力手段と、前記入力手段により入力された前記基本モデル定義に基づいて、イベントとその発生原因との対応関係及び発生原因間の遷移を有限オートマトンでモデル化した初期障害発生モデルを生成する初期モデル生成手段と、前記初期モデル生成手段が生成した前記初期障害発生モデルを記憶する初期モデル記憶手段と、前記監視対象装置で発生したイベントの集合である学習用イベント列を記憶する学習用イベント列記憶手段と、前記初期モデル記憶手段に記憶された前記初期障害発生モデルと、前記学習用イベント列記憶手段に記憶された前記学習用イベント列とに基づいて、前記学習用イベント列記憶手段に記憶された前記学習用イベント列の発生原因間の遷移を表す有限オートマトンの状態がどの確率で推移したかを学習する遷移確率学習手段と、前記遷移確率学習手段が学習した遷移確率を反映させた障害発生モデルを生成する障害発生モデル生成手段と、前記障害発生モデル生成手段が生成した前記障害発生モデルを記憶する障害発生モデル記憶手段と、前記監視対象装置で発生したイベントの集合である障害原因発見用イベント列を記憶する障害原因発見用イベント列記憶手段と、前記障害発生モデル記憶手段に記憶された前記障害発生モデルに従って、前記障害原因発見用イベント列記憶手段に記憶された前記障害原因発見用イベント列が観察される確率が最も高いと考えられる状態遷移列を求める状態遷移列生成手段と、前記状態遷移列生成手段が生成した前記状態遷移列を記憶する状態遷移列記憶手段と、前記状態遷移列記憶手段に記憶された前記状態遷移列を、前記障害発生モデル記憶手段に記憶された前記障害発生モデルの前記遷移確率に基づいて分割し、前記分割した各状態遷移列の先頭の状態を、監視対象装置で発生した障害の根源的な原因と推定するフィルタリングモジュールと、を備えることを特徴とする。 In order to achieve the above object, the fault cause estimation system of the present invention includes an input means for inputting a basic model definition that defines an event caused by a fault that can occur in a monitoring target device and a fault cause that causes the event, Based on the basic model definition input by the input means, an initial model generation means for generating an initial failure occurrence model in which a correspondence between an event and its occurrence cause and a transition between the occurrence causes are modeled by a finite automaton; An initial model storage unit that stores the initial failure occurrence model generated by the initial model generation unit; a learning event sequence storage unit that stores a learning event sequence that is a set of events that have occurred in the monitoring target device; The initial failure occurrence model stored in the initial model storage unit and the previous model stored in the learning event string storage unit Based on the learning event string, the transition probabilities to learn whether it has remained in a state what probability finite automaton representing the transitions between the cause of the learning event string stored in the learning event sequence storage means learning A fault occurrence model generating means for generating a fault occurrence model reflecting the transition probability learned by the transition probability learning means, and a fault occurrence model storage for storing the fault occurrence model generated by the fault occurrence model generation means According to the failure occurrence model stored in the failure occurrence model storage unit, and a failure cause discovery event sequence storage unit that stores a failure cause discovery event sequence that is a set of events that have occurred in the monitoring target device, When the failure cause finding event sequence stored in the failure cause finding event sequence storage means has the highest probability of being observed A state transition sequence generating means for obtaining the obtained state transition sequence, a state transition sequence storing means for storing the state transition sequence generated by the state transition sequence generating means, and the state transition stored in the state transition sequence storing means A column is divided based on the transition probability of the failure occurrence model stored in the failure occurrence model storage means, and the first state of each of the divided state transition sequences is determined as the root of the failure that has occurred in the monitoring target device. And a filtering module for estimating the cause.

本発明の障害原因推定方法は、コンピュータを用いて、監視対象装置で発生した障害の根源的な原因を推定する方法であって、前記監視対象装置で発生し得る障害に起因するイベントと該イベントを引き起こす障害原因とを定義した基本モデル定義を入力する入力ステップと、前記入力ステップにより入力された前記基本モデル定義に基づいて、イベントとその発生原因との対応関係及び発生原因間の遷移を有限オートマトンでモデル化した初期障害発生モデルを生成する初期モデル生成ステップと、前記初期モデル生成ステップが生成した前記初期障害発生モデルを記憶する初期モデル記憶ステップと、前記監視対象装置で発生したイベントの集合である学習用イベント列を記憶する学習用イベント列記憶ステップと、前記初期モデル記憶ステップに記憶された前記初期障害発生モデルと、前記学習用イベント列記憶ステップに記憶された前記学習用イベント列とに基づいて、前記学習用イベント列記憶ステップに記憶された前記学習用イベント列の発生原因間の遷移を表す有限オートマトンの状態がどの確率で推移したかを学習する遷移確率学習ステップと、前記遷移確率学習ステップが学習した遷移確率を反映させた障害発生モデルを生成する障害発生モデル生成ステップと、前記障害発生モデル生成ステップが生成した前記障害発生モデルを記憶する障害発生モデル記憶ステップと、前記監視対象装置で発生したイベントの集合である障害原因発見用イベント列を記憶する障害原因発見用イベント列記憶ステップと、前記障害発生モデル記憶ステップに記憶された前記障害発生モデルに従って、前記障害原因発見用イベント列記憶ステップに記憶された前記障害原因発見用イベント列が観察される確率が最も高いと考えられる状態遷移列を求める状態遷移列生成ステップと、前記状態遷移列生成手段が生成した前記状態遷移列を記憶する状態遷移列記憶ステップと、前記状態遷移列記憶ステップに記憶された前記状態遷移列を、前記障害発生モデル記憶ステップに記憶された前記障害発生モデルの前記遷移確率に基づいて分割し、前記分割した各状態遷移列の先頭の状態を、監視対象装置で発生した障害の根源的な原因と推定するフィルタリングステップと、を備えることを特徴とする。 The failure cause estimation method of the present invention is a method for estimating the root cause of a failure that has occurred in a monitoring target device using a computer, and the event caused by the failure that may occur in the monitoring target device and the event An input step for inputting a basic model definition that defines a failure cause that causes an event, and a correspondence between an event and its occurrence cause and a transition between the occurrence causes are finite based on the basic model definition input by the input step. An initial model generation step for generating an initial failure occurrence model modeled by an automaton, an initial model storage step for storing the initial failure occurrence model generated by the initial model generation step, and a set of events generated in the monitored device A learning event sequence storage step for storing a learning event sequence, and the initial model storage And the initial failure model stored in step, the based on the stored the learning event string in the learning event string storage step, the stored in the learning event string storage step of the learning event string A transition probability learning step that learns at what probability the state of the finite automaton that represents the transition between occurrence causes has changed, and a failure occurrence model that generates a failure occurrence model that reflects the transition probability learned by the transition probability learning step A failure cause storage step for storing the failure occurrence model generated by the failure occurrence model generation step, and a failure cause for storing a failure cause discovery event sequence which is a set of events generated in the monitored device The event sequence storage step for discovery and the failure stored in the failure occurrence model storage step A state transition sequence generating step for obtaining a state transition sequence that is considered to have the highest probability that the failure cause finding event sequence stored in the failure cause finding event sequence storing step is observed according to a raw model, and the state transition and a state transition sequence storing step of storing the state transition sequence of sequence generation means is generated, the state transition has been the state transition sequence stored in the string storage step, the failure model stored in the fault model storage step on the basis of the transition probabilities is divided, the head of state of each state transition sequence in which the divided, characterized in that it comprises a filtering step of estimating the fundamental cause of the failure that occurred in the monitoring target device, the.

本発明のプログラムは、コンピュータに、監視対象装置で発生した障害の根源的な原因を推定する方法を実行させるプログラムであって、前記コンピュータに、前記監視対象装置で発生し得る障害に起因するイベントと該イベントを引き起こす障害原因とを定義した基本モデル定義を入力する入力ステップと、前記入力ステップにより入力された前記基本モデル定義に基づいて、イベントとその発生原因との対応関係及び発生原因間の遷移を有限オートマトンでモデル化した初期障害発生モデルを生成する初期モデル生成ステップと、前記初期モデル生成ステップが生成した前記初期障害発生モデルを記憶する初期モデル記憶ステップと、前記監視対象装置で発生したイベントの集合である学習用イベント列を記憶する学習用イベント列記憶ステップと、前記初期モデル記憶ステップに記憶された前記初期障害発生モデルと、前記学習用イベント列記憶ステップに記憶された前記学習用イベント列とに基づいて、前記学習用イベント列記憶ステップに記憶された前記学習用イベント列の発生原因間の遷移を表す有限オートマトンの状態がどの確率で推移したかを学習する遷移確率学習ステップと、前記遷移確率学習ステップが学習した遷移確率を反映させた障害発生モデルを生成する障害発生モデル生成ステップと、前記障害発生モデル生成ステップが生成した前記障害発生モデルを記憶する障害発生モデル記憶ステップと、前記監視対象装置で発生したイベントの集合である障害原因発見用イベント列を記憶する障害原因発見用イベント列記憶ステップと、前記障害発生モデル記憶ステップに記憶された前記障害発生モデルに従って、前記障害原因発見用イベント列記憶ステップに記憶された前記障害原因発見用イベント列が観察される確率が最も高いと考えられる状態遷移列を求める状態遷移列生成ステップと、前記状態遷移列生成手段が生成した前記状態遷移列を記憶する状態遷移列記憶ステップと、前記状態遷移列記憶ステップに記憶された前記状態遷移列を、前記障害発生モデル記憶ステップに記憶された前記障害発生モデルの前記遷移確率に基づいて分割し、前記分割した各状態遷移列の先頭の状態を、監視対象装置で発生した障害の根源的な原因と推定するフィルタリングステップと、を実行させることを特徴とする。 The program of the present invention is a program for causing a computer to execute a method for estimating a root cause of a failure that has occurred in a monitoring target device, and for causing an event caused by a failure that may occur in the monitoring target device to the computer. And an input step for inputting a basic model definition that defines the cause of failure that causes the event, and a correspondence relationship between the event and the cause of the occurrence and between the causes based on the basic model definition input by the input step An initial model generation step for generating an initial failure generation model in which transitions are modeled by a finite automaton, an initial model storage step for storing the initial failure generation model generated by the initial model generation step, and an occurrence in the monitoring target device A learning event sequence storage for storing a learning event sequence that is a set of events Stored in the learning event sequence storage step based on the initial failure model stored in the initial model storage step and the learning event sequence stored in the learning event sequence storage step. A transition probability learning step for learning at which probability the state of the finite automaton representing the transition between the causes of occurrence of the learning event sequence has changed, and a failure occurrence reflecting the transition probability learned by the transition probability learning step A fault occurrence model generation step for generating a model, a fault occurrence model storage step for storing the fault occurrence model generated by the fault occurrence model generation step, and a fault cause discovery that is a set of events that have occurred in the monitored device An event sequence storage step for fault cause discovery storing an event sequence, and the failure occurrence model description In accordance with the failure occurrence model stored in the step, a state transition sequence for obtaining a state transition sequence that is considered to have the highest probability that the failure cause finding event sequence stored in the failure cause finding event sequence is observed. a generation step, a state transition sequence storing step of storing the state transition sequence of the state transition sequence generating means has generated, the state transition sequence stored in the state transition sequence storing step, to the failure model storage step A filtering step of dividing based on the transition probability of the stored failure occurrence model , and estimating a leading state of the divided state transition sequence as a root cause of a failure that has occurred in the monitoring target device; It is made to perform.

本発明の障害原因推定システム、方法、及び、プログラムでは、基本定義モデルに基づいて作成された初期障害発生モデルを、学習用イベント列を用いて学習し、学習によってえられた障害発生モデルから、障害原因発見用イベント列が観察される確率が最も高いと考えられる状態遷移列を求め、その状態遷移列に基づいて、監視対象装置で発生した障害の根源的な原因を推定する。本発明では、人手で定義が必要なのは、発生し得るイベントと、イベントを引き起こす障害原因と、その対応関係とであり、これらは、人手で容易に定義できる。従って、複雑な原因、結果間の依存関係を定義しなくても、障害の根源的な原因を推定できる。 In the failure cause estimation system, method, and program of the present invention, the initial failure occurrence model created based on the basic definition model is learned using the learning event sequence, and from the failure occurrence model obtained by learning, A state transition sequence that is considered to have the highest probability of observing a failure cause finding event sequence is obtained, and based on the state transition sequence, a root cause of a failure that has occurred in the monitoring target device is estimated. In the present invention, what needs to be defined manually is the event that can occur, the cause of the failure that causes the event, and the correspondence between them, which can be easily defined manually. Therefore, it is possible to estimate the root cause of a failure without defining a complicated cause and dependency between results.

本発明の障害原因推定システム、方法、及び、プログラムでは、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムにより、前記学習用イベント列記憶手段に記憶された前記学習用イベント列の発生原因間の状態遷移確率と、各原因でのイベント発生確率とを学習する構成を採用できる。学習のアルゴリズムとしては、出力記号列からパラメータを推定するＢａｕｍ−Ｗｅｌｃｈアルゴリズムを用いることができる。 In the failure cause estimation system, method, and program of the present invention, the state transition probability between the causes of occurrence of the learning event sequence stored in the learning event sequence storage means and each cause by the Baum-Welch algorithm. It is possible to adopt a configuration for learning the event occurrence probability. As a learning algorithm, a Baum-Welch algorithm that estimates a parameter from an output symbol string can be used.

本発明の障害原因推定システム、方法、及び、プログラムでは、Ｖｉｔｅｒｂｉアルゴリズムにより、前記状態遷移列を求める構成を採用できる。学習後の障害発生モデルから状態連に列を求める際に用いるアルゴリズムとしては、出力記号列から状態列を推定するＶｉｔｅｒｂｉアルゴリズムを用いることができる。 The fault cause estimation system, method, and program of the present invention can employ a configuration for obtaining the state transition sequence by the Viterbi algorithm. A Viterbi algorithm that estimates a state string from an output symbol string can be used as an algorithm used when obtaining a string of states from a failure occurrence model after learning.

本発明の障害原因推定システム、方法、及び、プログラムでは、前記初期モデル記憶手段に記憶された前記初期障害発生モデルが、前記イベントの集合Σと、前記障害原因の集合Ｓに正常状態ｓ_０を加えた状態の集合と、各状態について、当該状態から各状態に遷移する確率を示す条件付確率Ｐｒ（ｓｊ｜ｓｉ）_{ｓｉ，ｓｊ∈Ｓ}と、各状態について、開始時に当該状態にいる初期確率｛Ｐ^０ _ｓｉ｝_ｓｉ∈Ｓと、各状態について、当該状態で前記イベントが発生する確率を示すＰｒ（ｅｊ｜ｓｉ） _{ｓｉ∈Ｓ，ｅｊ∈Σ}とを含む構成を採用できる。 Failure cause estimation system of the present invention, a method, and the program, the initial model the initial failure model stored in the storage means, before a set Σ of hearing vent, the normal state to the set S of the failure cause s A set of states plus ₀ , a conditional probability Pr (sj | si) _{si, sjεS} indicating the probability of transition from each state to each state, and each state is in that state at the start A configuration including an initial probability {P ⁰ _si } _siεS and Pr (ej | si ) _{s iεS, ejεΣ} indicating the probability of occurrence of the event in each state can be adopted.

本発明の障害原因推定システム、方法、及び、プログラムでは、前記フィルタリングモジュールは、前記状態遷移列記憶手段に記憶された前記状態遷移列を｛ｓ（０）、ｓ（１）、・・・、ｓ（ｎ）｝とするとき、状態ｓ（ｉ）（０≦ｉ＜ｎ）から次状態ｓ（ｉ＋１）への前記条件付確率Ｐｒ（ｓ（ｉ＋１）｜s（ｉ））が、所定の確率よりも低いと、前記状態遷移列記憶手段に記憶された前記状態遷移列を、｛ｓ（０）、ｓ（１）、・・・、ｓ（ｉ）｝と、｛ｓ（ｉ＋１）、・・・、ｓ（ｎ）｝とに分割する構成を採用できる。この場合、前記フィルタリングモジュールは、前記分割された各状態遷移列の先頭の状態を、障害の根源的原因と推定することができる。このように分割することで、異なる根源原因による２つの状態列（シーケンス）が、たまたま確率的に最大値であるために１つのシーケンスとしてみなされている場合に、そのシーケンスを、根源原因ごとのシーケンスに分割でき、各シーケンスの先頭の状態を、根源原因と推定できる。 In the failure cause estimation system, method, and program of the present invention, the filtering module uses the state transition sequence stored in the state transition sequence storage means as {s (0), s (1),. s (n)}, the conditional probability Pr (s (i + 1) | s (i)) from the state s (i) (0 ≦ i <n) to the next state s (i + 1) is a predetermined value. below the probability, the previous SL state transition sequence stored in the state transition sequence storage unit, {s (0), s (1), ···, s (i)} and, {s (i + 1) ,..., S (n)} can be adopted. In this case, the filtering module can estimate the leading state of each of the divided state transition sequences as the root cause of the failure. By dividing in this way, when two state sequences (sequences) due to different root causes happen to be regarded as one sequence because they happen to be probabilistically maximum values, the sequences are separated for each root cause. It can be divided into sequences, and the leading state of each sequence can be estimated as the root cause.

本発明の障害原因推定システムでは、前記学習用イベント列記憶手段に記憶された前記学習用イベント列が、前記監視対象装置を試運転した際にモニタされたイベント列である構成を採用できる。或いは、前記学習用イベント列記憶手段に記憶された前記学習用イベント列が、前記監視対象装置の運用時にモニタされたイベント列であって、障害の発生原因が解析済みのイベント列である構成を採用できる。 In the failure cause estimation system of the present invention, a configuration can be adopted in which the learning event sequence stored in the learning event sequence storage means is an event sequence monitored when the monitoring target device is trial run. Alternatively, the learning event sequence stored in the learning event sequence storage means is an event sequence monitored during operation of the monitoring target device, and the cause of the failure is an analyzed event sequence. Can be adopted.

本発明の障害原因推定システムでは、前記障害原因発見用イベント列記憶手段に記憶された前記障害原因発見用イベント列が、前記監視対象装置の運用時にモニタされたイベント列である構成を採用できる。この場合、運用中の監視対象装置で発生した障害の根源的原因を推定することができる。 In the failure cause estimation system of the present invention, a configuration can be adopted in which the failure cause discovery event sequence stored in the failure cause discovery event sequence storage means is an event sequence monitored during operation of the monitoring target device. In this case, it is possible to estimate the root cause of the failure that has occurred in the monitoring target device in operation.

本発明の障害原因推定システムでは、前記学習用イベント列記憶手段に記憶された前記学習用イベント列及び前記障害原因発見用イベント列記憶手段に記憶された前記障害原因発見用イベント列のそれぞれにおける隣接する２つのイベントの発生時間間隔が所定の値以下である構成を採用できる。この場合、ある障害原因に対して、互いに関連性のあるイベントの列学習用イベント列及び障害原因発見用イベント列とすることができる。 In the failure cause estimation system of the present invention, the learning event sequence stored in the learning event sequence storage means and the failure cause discovery event sequence stored in the failure cause discovery event sequence storage means are adjacent to each other. It is possible to adopt a configuration in which the occurrence time interval between two events is equal to or less than a predetermined value. In this case, for a certain cause of failure, an event sequence for learning events and an event sequence for finding failure causes can be used.

本発明の障害原因推定システム、方法、及び、プログラムでは、前記状態遷移列記憶手段に記憶された前記状態遷移列を記憶する手段と、前記障害原因発見用イベント列記憶手段に記憶された前記障害原因発見用イベント列のうちで、前記状態遷移列記憶手段に記憶された前記状態遷移列に含まれる各状態を発生原因とするイベントを、各状態に対応付けて記憶する手段と、を備える構成を採用できる。このようにして記憶された原因推定結果データベースを参照することで、管理者等は、障害の根源的な原因や、それに付随して発生した原因等を解析することができる。
Failure cause estimation system of the present invention, a method, and, in the program, means for memorize the state transition sequence stored in the state transition sequence storage means, stored in said failure cause found for the event sequence storage means of failure cause found for the event column, and means for the event that the cause each state, in association with each state included in the state transition sequence storage means and said state transition sequence stored in Configuration can be adopted. By referring to the cause estimation result database stored in this way, the administrator or the like can analyze the root cause of the failure, the cause that accompanies it, and the like.

本発明の障害原因推定システム、方法、及び、プログラムでは、基本定義モデルに基づいて作成された初期障害発生モデルを、学習用イベント列を用いて学習し、学習によって得られた障害発生モデルから、障害原因発見用イベント列が観察される確率が最も高いと考えられる状態遷移列を求め、その状態遷移列に基づいて、監視対象装置で発生した障害の根源的な原因を推定する。本発明では、人手で定義が必要なのは、発生し得るイベントと、イベントを引き起こす障害原因と、その対応関係とであり、これらは、人手で容易に定義できる。従って、複雑な原因、結果間の依存関係を定義しなくても、障害の根源的な原因を推定できる。 In the failure cause estimation system, method, and program of the present invention, the initial failure occurrence model created based on the basic definition model is learned using the learning event sequence, and from the failure occurrence model obtained by learning, A state transition sequence that is considered to have the highest probability of observing a failure cause finding event sequence is obtained, and based on the state transition sequence, a root cause of a failure that has occurred in the monitoring target device is estimated. In the present invention, what needs to be defined manually is the event that can occur, the cause of the failure that causes the event, and the correspondence between them, which can be easily defined manually. Therefore, it is possible to estimate the root cause of a failure without defining a complicated cause and dependency between results.

以下、図面を参照し、本発明の実施の形態を詳細に説明する。図１は、本発明の一実施形態の障害原因推定装置の構成を示している。障害原因推定装置１０は、初期モデルパーザ３０、初期モデル生成手段４０、Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０、Ｖｉｔｅｒｂｉ計算手段６０、フィルタリングモジュール７０、モデル格納データベース１２０、イベント列パーザ１３０、イベント列データベース１４０、原因推定結果データベース１５０を備える。障害原因推定装置１０は、例えばワークステーション等のコンピュータシステムで構成され、所定のプログラムを動作させることで、各部の機能が実現される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows the configuration of a failure cause estimation apparatus according to an embodiment of the present invention. The failure cause estimation apparatus 10 includes an initial model parser 30, an initial model generation means 40, a Baum-Welch calculation means 50, a Viterbi calculation means 60, a filtering module 70, a model storage database 120, an event string parser 130, an event string database 140, a cause An estimation result database 150 is provided. The failure cause estimation device 10 is configured by a computer system such as a workstation, for example, and functions of each unit are realized by operating a predetermined program.

初期モデルパーザ３０は、装置開発者が記述した基本モデル定義２０を読み込み、それを構文解釈して構文情報を生成し、初期モデル生成手段４０に受け渡す。基本モデル定義２０は、監視対象装置８０で発生し得るイベントと、それを引き起こす障害原因とを定義したものである。初期モデル生成手段４０は、初期モデルパーザ３０が生成した構文情報に基づいて、イベントとその発生原因の対応関係、及び、発生原因間の遷移を有限オートマトンでモデル化した初期障害発生モデルを生成し、モデル格納データベース１２０に格納する。 The initial model parser 30 reads the basic model definition 20 described by the device developer, parses it, generates syntax information, and passes it to the initial model generation means 40. The basic model definition 20 defines an event that can occur in the monitoring target device 80 and a failure cause that causes the event. Based on the syntax information generated by the initial model parser 30, the initial model generation means 40 generates an initial failure occurrence model in which the correspondence relationship between events and their occurrence causes, and transitions between occurrence causes are modeled by a finite automaton. And stored in the model storage database 120.

イベントモニタ９０は、監視対象装置８０で発生するイベントをモニタし、学習用イベント列１００と、障害原因発見用イベント列１１０とを生成する。イベント列パーザ１３０は、これら学習用イベント列１００及び障害原因発見用イベント列１１０を構文解釈して、イベントデータを生成し、イベント列データベース１４０に記憶する。Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０は、モデル格納データベース１２０に格納された初期障害発生モデルと、イベント列データベース１４０に格納された学習用イベント列１００に対応するイベントデータ（以下、単に学習用イベント列１００とも呼ぶ）とに基づいて、原因に対応する有限オートマトンの状態がどの確率で推移したかを学習し、その学習結果を反映した障害発生モデルをモデル格納データベース１２０に格納する。 The event monitor 90 monitors events that occur in the monitoring target device 80 and generates a learning event sequence 100 and a failure cause discovery event sequence 110. The event sequence parser 130 parses the learning event sequence 100 and the failure cause discovery event sequence 110 to generate event data and store it in the event sequence database 140. The Baum-Welch calculating means 50 includes an initial failure occurrence model stored in the model storage database 120 and event data corresponding to the learning event sequence 100 stored in the event sequence database 140 (hereinafter, simply referred to as the learning event sequence 100). And the probability that the state of the finite automaton corresponding to the cause has changed, and the failure occurrence model reflecting the learning result is stored in the model storage database 120.

Ｖｉｔｅｒｂｉ計算手段６０は、モデル格納データベース１２０に格納された障害発生モデルと、イベント列データベース１４０に格納された障害原因発見用イベント列１１０に対応するイベントデータ（以下、単に障害発見用イベント列とも呼ぶ）とを用いて、最も発生確率が高い障害発生モデルの状態遷移列を求め、フィルタリングモジュール７０に出力する。その際、Ｖｉｔｅｒｂｉ計算手段６０は、状態遷移列を求める元となった障害原因発見用イベント列１１０を、あわせて出力する。 The Viterbi calculation means 60 is a failure occurrence model stored in the model storage database 120 and event data corresponding to the failure cause discovery event sequence 110 stored in the event sequence database 140 (hereinafter also simply referred to as a failure discovery event sequence). ) To obtain the state transition sequence of the failure occurrence model having the highest probability of occurrence and output it to the filtering module 70. At this time, the Viterbi calculation means 60 also outputs the failure cause discovery event sequence 110 that is the source for obtaining the state transition sequence.

フィルタリングモジュール７０は、Ｖｉｔｅｒｂｉ計算手段６０が求めた、最も発生したと推測される状態遷移列（障害原因の列）のうちで、確率的に低い遷移を切り捨て、もっともらしい遷移列を発見して、その遷移列の開始状態を根源的な原因と推定する。フィルタリングモジュール７０は、推定した根源的な原因と、それに連なる派生原因とを含む原因列を、原因推定結果データベース１５０に格納する。また、その際、障害原因発見用イベント列１１０のうちで、根源的な原因及び派生原因を発生原因とするイベントを、各原因に対応付けて原因推定結果データベース１５０に格納する。管理者は、原因推定結果データベース１５０を参照することで、障害の発生原因等の解析を行う。 The filtering module 70 truncates a probabilistic low sequence from the state transition sequence (failure cause sequence) estimated by the Viterbi calculation means 60 and estimated to have occurred most, The starting state of the transition sequence is estimated as the root cause. The filtering module 70 stores in the cause estimation result database 150 a cause string that includes the estimated root cause and the derived causes that are related thereto. At that time, in the failure cause finding event sequence 110, the event having the root cause and the derived cause as the cause of occurrence is stored in the cause estimation result database 150 in association with each cause. The administrator refers to the cause estimation result database 150 to analyze the cause of the failure.

図２は、障害発生モデルを生成する際の障害原因推定装置１０の動作手順を示している。初期モデルパーザ３０は、装置開発者によって記述された基本モデル定義２０を読み込み、読み込んだ基本モデル定義２０を、初期モデル生成手段４０が解釈可能な構文情報に変換する（ステップＡ１）。基本モデル定義２０は、イベントの集合Σと、障害原因の集合Ｓと、関数ｆ：Σ→Ｓで定義される。基本的に、関数ｆは全域関数であるが、部分関数でもよい。 FIG. 2 shows an operation procedure of the failure cause estimation device 10 when generating a failure occurrence model. The initial model parser 30 reads the basic model definition 20 described by the device developer, and converts the read basic model definition 20 into syntax information that can be interpreted by the initial model generation means 40 (step A1). The basic model definition 20 is defined by an event set Σ, a failure cause set S, and a function f: Σ → S. Basically, the function f is a global function, but may be a partial function.

図３は、基本モデル定義２０の記述例を示している。基本モデル定義２０は、例えば同図に示すようなテキストファイルで記述される。この例では、ＯＳにＷｉｎｄｏｗｓ（登録商標）を想定している、［ｓｔａｔｅｓ］で始まる段落は、障害原因の集合Ｓを定義する。同図の例では、“Print”や“Application Popup”など、９つの原因を定義している。［ｏｂｓｅｒｖａｔｉｏｎｓ］で始まる段落は、イベントの集合Σを定義している。Ｗｉｎｄｏｗｓ（登録商標）のイベントモニタツール「event viewer」では、イベントの種別は“３”，“４”，“１６”などの数字のＩＤで与えられており、イベントの集合Σの定義には、そのＩＤを用いる。 FIG. 3 shows a description example of the basic model definition 20. The basic model definition 20 is described in a text file as shown in FIG. In this example, a paragraph beginning with [states] assuming Windows (registered trademark) as the OS defines a set S of failure causes. In the example of FIG. 9, nine causes such as “Print” and “Application Popup” are defined. The paragraph beginning with [obsservations] defines a set of events Σ. In the event monitor tool “event viewer” of Windows (registered trademark), the event type is given by a numerical ID such as “3”, “4”, “16”, etc. Use that ID.

［ｏｂｓｅｒｖａｔｉｏｎｓ］で、イベントの種類（ＩＤ）と“，”で区切られた部分は、イベントに対して想定される原因を表す。この部分が、障害から原因への関数（マッピング）ｆ：Σ→Ｓを定義する。例えば、イベント“３”は“Print”関係の状態を発生原因としており、ｆ（“３”）＝“Ｐｒｉｎｔ”と定義される。初期モデルパーザ３０は、このようなテキストファイルを読み込んで、記述された基本モデル定義２０に相当する構文情報を初期モデル生成手段４０に受け渡す。 In [Observations], the part delimited by the event type (ID) and “,” represents a possible cause for the event. This part defines the function (mapping) f: Σ → S from the fault to the cause. For example, the event “3” is caused by the state related to “Print”, and is defined as f (“3”) = “Print”. The initial model parser 30 reads such a text file and passes the syntax information corresponding to the described basic model definition 20 to the initial model generation means 40.

図２に戻り、初期モデル生成手段４０は、基本モデル定義２０に相当する構文情報に基づいて、初期障害発生モデルを生成する（ステップＡ２）。障害発生モデルＭを次式で定義する。
Ｍ＝{Σ，Ｓ∪{ｓ_０}，{Ｐｒ(s_j｜s_i)}_si，sj∈S，{Ｐ⁰ _i}_si∈S，{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}｝
Ｐｒ（ａ｜ｂ）は、条件付確率であり、ｂという条件下でａが発生する確率を示す。また、Ｐ^０ _ｉは、障害発生モデルＭが状態Ｓ_ｉから開始する確率を示す。ｓ_０は、監視対象装置８０が正常である状態を示す。上記障害発生モデルＭにおいて、「Ｓ∪{ｓ_０}，{Ｐｒ(s_j｜s_i)}_si，sj∈S，{Ｐ⁰ _i}_si，sj∈S」は、有限状態オートマトンを表し、これは、次の状態ｓ∈Ｓ∪｛ｓ_０｝は、直前の状態Ｓ’∈Ｓ∪{ｓ_０}のみで定まり、固定の確率Ｐｒ（ｓ｜ｓ’)で遷移することを表している。 Returning to FIG. 2, the initial model generation means 40 generates an initial failure occurrence model based on the syntax information corresponding to the basic model definition 20 (step A2). The failure occurrence model M is defined by the following equation.
M = {Σ, S∪ {s ₀ }, {Pr (s _j | s _i )} _{si, sj∈S} , {P ⁰ _i } _si∈S , {Pr (e _j | s _i )} _{si∈S , Ej∈Σ} }
Pr (a | b) is a conditional probability and indicates the probability of occurrence of a under the condition of b. P ⁰ _i indicates the probability that the failure occurrence model M starts from the state S _i . s ₀ indicates a state in which the monitoring target device 80 is normal. In the failure occurrence model M, “S∪ {s ₀ }, {Pr (s _j | s _i )} _{si, sj∈S} , {P ⁰ _i } _{si, sj∈S} ” represents a finite state automaton, This indicates that the next state s∈S∪ {s ₀ } is determined only by the immediately preceding state S′∈S∪ {s ₀ } and transitions with a fixed probability Pr (s | s ′). .

初期モデル生成手段４０が生成する初期障害発生モデルＭ_０について詳細に説明する。初期障害発生モデルＭ_０で扱うイベントの集合Σは、基本モデル定義２０で定義したΣと同一である。また、初期障害発生モデルＭ_０で扱う原因の集合Ｓ∪{ｓ_０}は、基本モデル定義２０で定義したＳに、正常状態ｓ_０を加えた集合である。{Ｐｒ(s_j｜s_i)}_si，sj∈Sは、原因間の遷移確率を示しており、この遷移確率は、等確率とする。具体的には、｜Ｓ｜を原因の集合Ｓの個数として、Ｐｒ(s_j｜s_i)＝１／（｜Ｓ｜＋１）とする。この確率を等確率にせずに、自身へ遷移する確率Ｐｒ(s_i｜s_i)のみを大きくするなどして、定常状態になる確率を高く設定してもよい。{Ｐ⁰ _i}_si∈Sは、Ｐ^０ _０＝１、Ｐ^０ _ｉ＝０（ｉ≠０)とする。これは、初期障害発生モデルＭ_０が正常状態ｓ_０から開始することを意味する。 Will be described in detail early failure model M ₀ initial model generating unit 40 generates. The set of events Σ handled in the initial failure occurrence model M ₀ is the same as Σ defined in the basic model definition 20. The cause set S 集合 {s ₀ } handled in the initial failure occurrence model M ₀ is a set obtained by adding the normal state s ₀ to S defined in the basic model definition 20. {Pr (s _j | s _i )} _{si and sjεS} indicate transition probabilities between causes, and the transition probabilities are equal. Specifically, Pr (s _j | s _i ) = 1 / (| S | +1), where | S | The probability of becoming a steady state may be set high by increasing only the probability Pr (s _i | s _i ) of transitioning to itself without making this probability equal. {P ⁰ _i } _siεS is set to P ⁰ ₀ = 1 and P ⁰ _i = 0 (i ≠ 0). This means that the initial failure occurrence model M ₀ starts from the normal state s ₀ .

{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}は、イベントと原因との対応関係を示しており、状態Ｓ_ｉでイベントｅ_ｊが発生する確率を示す。{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}は、
Ｐｒ(e｜s)＝ｋ×ｐ（ｆ（ｅ）＝ｓのとき）
Ｐｒ(e｜s)＝ｐ（ｆ（ｅ）≠ｓのとき）
と定義する。ただし、ｋは１以上の定数である。また、全てのｓ∈Ｓ∪｛ｓ_０｝について、Σ_{｛ｅ｜ｆ（ｅ）＝ｓ｝}ｋ×ｐ＋Σ_{｛ｅ｜ｆ（ｅ）≠ｓ｝}ｐ≦１である。上記定義は、基本モデル定義２０で定義されたｆ（ｅ）＝ｓ、すなわち、イベントｅの発生原因ｓについては、ｓからｅが発生する確率を、ｆ（ｅ）≠ｓの場合確率ｐのｋ倍に設定することを示している。関数ｆが部分関数で、ｆ（ｅ）が定義されていないｅの場合には、上記定義に従って、ｆ（ｅ｜ｓ）には、確率ｐを与える。 {Pr (e _j | s _i )} _{siεS, ejεΣ} indicates the correspondence between the event and the cause, and indicates the probability that the event e _j will occur in the state S _i . {Pr (e _j | s _i )} _{si∈S, ej∈Σ} is
Pr (e | s) = k × p (when f (e) = s)
Pr (e | s) = p (when f (e) ≠ s)
It is defined as However, k is a constant of 1 or more. In addition, for all sεS∪ {s ₀ }, Σ _{{e | f (e) = s}} k × p + Σ _{{e | f (e) ≠ s}} p ≦ 1. In the above definition, f (e) = s defined in the basic model definition 20, that is, for the cause s of the event e, the probability of occurrence of e from s is the probability of probability p if f (e) ≠ s. It indicates that the setting is k times. When the function f is a partial function and f is not defined, f (e | s) is given a probability p according to the above definition.

管理者は、オンライン又はオフラインで、学習用イベント列１００を障害原因推定装置１０に与える（ステップＡ３）。管理者は、例えば監視対象装置８０を試運転した際にイベントモニタ９０がモニタしたイベント列を、オフラインで、学習用イベント列１００として与える。または、監視対象装置８０の運用中にイベントモニタ９０がモニタしたイベント列のうちで、障害原因解析が既に行われたイベント列を、オンラインで、学習用イベント列１００として与える。 The administrator gives the event sequence for learning 100 to the failure cause estimating apparatus 10 online or offline (step A3). For example, the administrator gives the event sequence monitored by the event monitor 90 when the monitoring target device 80 is run as a learning event sequence 100 offline. Alternatively, among the event strings monitored by the event monitor 90 during the operation of the monitoring target device 80, an event string for which the cause of failure analysis has already been performed is given as a learning event string 100 online.

イベント列パーザ１３０は、与えられた学習用イベント列１００から、他のモジュールで解釈可能なイベントデータを生成し、イベント列データベース１４０に格納する。イベント列パーザ１３０は、イベントデータを生成する際に、学習用イベント列１００を、所定の条件で、複数のイベントの列に分割する。図４は、イベント列の分割の様子を示している。イベント列パーザ１３０は、イベントの発生間隔が所定のしきい値Ｔよりも大きいと、イベント列を分割する。具体的には、例えば、イベント列（ｅ（０），…，ｅ（ｎ））があるとき、イベントｅ（ｉ）の発生時刻と、ｅ（ｉ＋１）の発生時刻との間の時間がしきい値Ｔよりも長いと、イベント列（ｅ（０），…，ｅ（ｎ））を、イベントリージョンＲ_０：（ｅ（０），…，ｅ（ｉ））と、イベントリージョンＲ_１：（ｅ（ｉ＋１），…，ｅ（ｎ））とに分割する。このことは、イベントリージョン内では、イベント発生間隔がしきい値Ｔ以下であり、イベントリージョン間のイベント発生間隔はしきい値Ｔよりも大きいことを意味している。なお、ここでのイベントｅ（ｉ）は、イベントタイプ（イベントの種別）ではなくイベントそのものである。 The event sequence parser 130 generates event data that can be interpreted by other modules from the given learning event sequence 100 and stores the event data in the event sequence database 140. When generating event data, the event string parser 130 divides the learning event string 100 into a plurality of event strings under a predetermined condition. FIG. 4 shows how the event sequence is divided. The event string parser 130 divides the event string when the event occurrence interval is larger than a predetermined threshold T. Specifically, for example, when there is an event sequence (e (0),..., E (n)), the time between the occurrence time of the event e (i) and the occurrence time of e (i + 1) is shortened. If it is longer than the threshold value T, the event sequence (e (0),..., E (n)), the event region R ₀ : (e (0),..., E (i)) and the event region R ₁ : (E (i + 1),..., E (n)). This means that the event occurrence interval is equal to or less than the threshold value T within the event region, and the event occurrence interval between the event regions is larger than the threshold value T. The event e (i) here is not the event type (event type) but the event itself.

再び図２に戻り、Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０は、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムにより、イベント列パーザ１３０から入力した学習用イベント列１００のイベントリージョンを用いて、初期モデル生成手段４０が生成した初期障害発生モデルＭ_０の学習を行う。Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０は、与えられた学習対象のイベント列（イベントリージョン）に対して、最も確率を高くするモデルＭ＝{Σ，Ｓ∪{ｓ_０}，{Ｐｒ(s_j｜s_i)}_si，sj∈S，{Ｐ⁰ _i}_si，sj∈S，{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}｝の遷移確率{Ｐｒ(s_j｜s_i)}_si，sj∈Sと、イベントの発生確率{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}とを求める。ただし、最適な値を求めるのではなく、初期モデルＭ_０に対して、そこから局所解を求める最尤法である。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムは、例えば「“Statistical Methods for Speech Recognition (Language, Speech, and Communication)” (Frederick Jelinek著) 9.3節」にも記載されるようによく知られたアルゴリズムであり、その詳細な説明は省略する。 Returning to FIG. 2 again, the Baum-Welch calculation means 50 generates an initial failure generated by the initial model generation means 40 using the event region of the learning event sequence 100 input from the event sequence parser 130 by the Baum-Welch algorithm. It performs learning of the model _{M 0.} The Baum-Welch calculation means 50 has a model M = {Σ, S∪ {s ₀ }, {Pr (s _j | s _i ) that has the highest probability for a given event sequence (event region) to be learned. )} _{si, sj∈S} , {P ⁰ _i } _{si, sj∈S} , {Pr (e _j _│s _i )} _si ∈ _{S, ej∈Σ} } transition probability {Pr (s _j _│s _i )} _{Si, sjεS} and event occurrence probability {Pr (e _j | s _i )} _{siεS, ejεΣ} are obtained. However, it is a maximum likelihood method for obtaining a local solution from the initial model M ₀ instead of obtaining an optimum value. The Baum-Welch algorithm is a well-known algorithm as described in, for example, “Statistical Methods for Speech Recognition (Language, Speech, and Communication)” (Frederick Jelinek), Section 9.3. Is omitted.

Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０は、初期障害発生モデルＭ_０の遷移確率{Ｐｒ(s_j｜s_i)}_si，sj∈S及びイベントの発生確率{Ｐｒ(e_j｜s_i)}_{si∈Ｓ，ej∈Σ}を、それぞれ学習により求めた遷移確率及びイベントの発生確率に置き換えた障害発生モデルＭ’を生成し、モデル格納データベース１２０に格納する（ステップＡ４）。ここまでが、障害発生モデル生成フェーズである。以降、このようにして得た障害発生モデルＭ’を使用して、障害の根源的な原因を推定する。 The Baum-Welch calculation means 50 determines the transition probability {Pr (s _j | s _i )} _{si, sj∈S} and the event occurrence probability {Pr (e _j | s _i )} _si∈S of the initial failure occurrence model M _0. _{, Ε∈Σ} is replaced with the transition probability and event occurrence probability obtained by learning, respectively, and a failure occurrence model M ′ is generated and stored in the model storage database 120 (step A4). Up to this point is the failure occurrence model generation phase. Thereafter, the root cause of the fault is estimated using the fault occurrence model M ′ thus obtained.

図５は、障害原因を推定する際の障害原因推定装置１０の動作手順を示している。管理者は、オンラインで、イベントモニタ９０が監視対象装置８０で観察したイベント列を、障害原因発見用イベント列１１０として障害原因推定装置１０に与える（ステップＢ１）。イベント列パーザ１３０は、与えられた障害原因発見用イベント列１１０を、複数のイベントリージョン（図４）に分割し、イベント列データベース１４０を介して、Ｖｉｔｅｒｂｉ計算手段６０に受け渡す。 FIG. 5 shows an operation procedure of the failure cause estimating apparatus 10 when estimating the cause of the failure. The administrator provides the event sequence observed by the event monitor 90 on the monitoring target device 80 online to the failure cause estimation device 10 as the failure cause discovery event sequence 110 (step B1). The event sequence parser 130 divides the given failure cause discovery event sequence 110 into a plurality of event regions (FIG. 4), and delivers them to the Viterbi calculation means 60 via the event sequence database 140.

Ｖｉｔｅｒｂｉ計算手段６０は、Ｖｉｔｅｒｂｉアルゴリズムにより、モデル格納データベース１２０に格納された、図２に示す手順で学習された障害発生モデルＭ’に対して、入力された障害原因発見用イベント１１０（イベントリージョン）を実現する可能性（尤度）が最も高い原因の順序列（ｓ（０），ｓ（１），ｓ（２），．．，ｓ（ｎ））を求める（ステップＢ２）。原因順序列中のｓ（ｉ）は、原因の種類を示すのではなく、原因の状態遷移列を時間順に示したものであり、括弧内の数字は時間順に割り当てたものである。Ｖｉｔｅｒｂｉアルゴリズムは、例えば、「“Statistical Methods for Speech Recognition (Language, Speech, and Communication)” (Frederick Jelinek著) 5章」などにも記述されているように、一般によく知られたアルゴリズムであり、ここでは、その詳細な説明は省略する。 The Viterbi calculation means 60 uses the Viterbi algorithm to input the failure cause discovery event 110 (event region) for the failure occurrence model M ′ learned in the procedure shown in FIG. 2 and stored in the model storage database 120. The sequence (s (0), s (1), s (2),..., S (n)) of the cause having the highest possibility (likelihood) of realizing is obtained (step B2). S (i) in the cause sequence column does not indicate the type of cause, but indicates the state transition sequence of the cause in time order, and the numbers in parentheses are assigned in time order. The Viterbi algorithm is a generally well-known algorithm as described in, for example, “Statistical Methods for Speech Recognition (Language, Speech, and Communication)” (by Frederick Jelinek), Chapter 5). Then, the detailed description is abbreviate | omitted.

フィルタリングモジュール７０は、Ｖｉｔｅｒｂｉ計算手段６０が求めた原因の順序列（ｓ（０），ｓ（１），ｓ（２），．．，ｓ（ｎ））を、この順序列の隣接する２つの状態間の状態遷移確率Ｐｒ(s_ｉ＋１｜s_ｉ)に基づいて、複数のグループに分割する（ステップＢ３）。例えば、Ｐｒ(s（ｑ＋１）｜s（ｑ）)＜Ｌであれば、（ｓ（０），ｓ（１）・・・，ｓ（ｑ））と（ｓ（ｑ＋１），・・・，ｓ（ｎ））とに分割する。分割の判断となる確率Ｌは、０〜１までの間のしきい値であり、比較的小さい確率値である。このように分割するのは、異なる根源原因による２つのシーケンスがある場合に、たまたま確率的に最大値であるために１つのシーケンスとしてみなされている可能性があるからである。そこで、フィルタリングモジュール７０により、遷移確率がしきい値Ｌよりも低いシーケンスは、確率的にシーケンスではなく、たまたま時系列的に重なっているものとみなして、分割する。 The filtering module 70 converts the cause sequence (s (0), s (1), s (2), ..., s (n)) obtained by the Viterbi calculation means 60 into two adjacent sequences. Based on the state transition probability Pr (s _{i + 1} | s _i ) between states, it is divided into a plurality of groups (step B3). For example, if Pr (s (q + 1) | s (q)) <L, (s (0), s (1)..., S (q)) and (s (q + 1),. s (n)). The probability L for determining the division is a threshold value between 0 and 1, and is a relatively small probability value. The reason for dividing in this way is that when there are two sequences due to different root causes, it may happen to be regarded as one sequence because it happens to be the maximum value. Therefore, the filtering module 70 divides a sequence having a transition probability lower than the threshold value L by considering that it is not a sequence stochastically, but happens to overlap in time series.

フィルタリングモジュール７０は、分割されたシーケンス（原因の順序列）を、原因推定結果データベース１５０に格納する。また、分割されたシーケンスのうちの先頭を、根源的な原因として推定する。フィルタリングモジュール７０は、原因推定結果データベース１５０に、原因の順序列を格納する際に、各原因に対応するイベントを、各原因に対応付けて、原因推定結果データベース１５０に格納する。例えば、根源的な原因がＳ_ｉであった場合には、障害原因発見用イベント列１１０（ｅ（０），ｅ（１），・・・，ｅ（ｎ））のうちで、基本モデル定義２０で障害原因Ｓ_ｉに対応付けられているイベントを、障害原因Ｓ_ｉに対応付けて原因推定結果データベース１５０に格納する。 The filtering module 70 stores the divided sequences (cause order sequence) in the cause estimation result database 150. Also, the head of the divided sequence is estimated as the root cause. When storing the cause sequence in the cause estimation result database 150, the filtering module 70 stores the events corresponding to the causes in the cause estimation result database 150 in association with the causes. For example, when the root cause is S _i , the basic model definition is included in the failure cause discovery event sequence 110 (e (0), e (1),..., E (n)). The event associated with the failure cause S _i at 20 is stored in the cause estimation result database 150 in association with the failure cause S _i .

以下、具体例を用いて説明する。基本モデル定義２０としては、図３に示したものを考える。初期モデル生成手段４０が生成する初期障害発生モデルＭ_０で扱うイベントの集合Σは、
Σ＝｛３，４，１６，１７，１８，１９，２０｝
＝＞｛ｅ_０，ｅ_１，ｅ_２，ｅ_３，ｅ_４，ｅ_５，ｅ_６｝
である。状態の集合は、
Ｓ∪｛ｓ_０｝＝｛ｓ_０，“Print”,“Windows Update Agent”,“W32Time”,“Application Popup”,“i8042prt”,“Windows Installer”,“DHCP”,“Browser”,“Tcpip”｝
＝＞｛ｓ_０，ｓ_１，ｓ_２，ｓ_３，ｓ_４，ｓ_５，ｓ_６，ｓ_７，ｓ_８，ｓ_９，ｓ_１０｝
である。原因間の遷移確率{Ｐｒ(s_j｜s_i)}_si，sj∈Sは、状態数が計１０個であるため、
Ｐｒ(s_j｜s_i)＝１／１０
である。初期確率は、Ｐ^０ _０＝１、Ｐ^０ _ｉ＝０（ｉ≠０)である。図３では、イベントの種類が７種類あるとする。この場合、イベント発生確率は、
Ｐｒ(e｜s)＝２／８（ｆ（ｅ）＝ｓのとき）
Ｐｒ(e｜s)＝１／８（ｆ（ｅ）≠ｓのとき）
である。 Hereinafter, a specific example will be described. As the basic model definition 20, the one shown in FIG. 3 is considered. A set Σ of events handled by the initial failure occurrence model M ₀ generated by the initial model generation means 40 is
Σ = {3, 4, 16, 17, 18, 19, 20}
=> {E ₀ , e ₁ , e ₂ , e ₃ , e ₄ , e ₅ , e ₆ }
It is. The set of states is
S∪ {s ₀ } = {s ₀ , “Print”, “Windows Update Agent”, “W32Time”, “Application Popup”, “i8042prt”, “Windows Installer”, “DHCP”, “Browser”, “Tcpip” }
=> {S ₀ , s ₁ , s ₂ , s ₃ , s ₄ , s ₅ , s ₆ , s ₇ , s ₈ , s ₉ , s ₁₀ }
It is. Since the transition probability between causes {Pr (s _j | s _i )} _{si, sj∈S} has a total of 10 states,
Pr (s _j | s _i ) = 1/10
It is. The initial probabilities are P ⁰ ₀ = 1 and P ⁰ _i = 0 (i ≠ 0). In FIG. 3, it is assumed that there are seven types of events. In this case, the event occurrence probability is
Pr (e | s) = 2/8 (when f (e) = s)
Pr (e | s) = 1/8 (when f (e) ≠ s)
It is.

学習用イベント列１００を、{ｅ（０）, .., ｅ（ｎ）}とする。このｅ（ｉ）は、イベントタイプではなく、イベントそのものであり、発生時刻が記録されている。このイベント列では、e（ｉ）とe（ｉ＋１）の発生時刻の差が２秒であり、他のイベントの発生時刻の差は全て１秒以下とする。イベント列パーザ１３０がイベント列を複数のリージョンに分割する際のしきい値Ｔを１秒とすると、学習用イベント列１００は、Ｒ_１＝｛ｅ（０）、・・・、ｅ（ｉ）｝と、Ｒ_２＝｛ｅ（ｉ＋１）、・・・、ｅ（ｎ）｝とに分割される。Ｂａｕｍ−Ｗｅｌｃｈ計算手段５０は、Ｒ_１、Ｒ_２を含む多くのイベントリージョンを与えられることで、初期障害発生モデルＭ_０から、遷移確率Ｐｒ（ｓ_ｊ｜ｓ_ｉ）とイベント発生確率Ｐｒ(e｜s)とに関して、与えられたイベントリージョンを最も発生する確率を学習し、障害発生モデルＭ’を得る。 Let the learning event sequence 100 be {e (0),..., E (n)}. This e (i) is not an event type but an event itself, and an occurrence time is recorded. In this event sequence, the difference between the occurrence times of e (i) and e (i + 1) is 2 seconds, and the difference between the occurrence times of other events is 1 second or less. Assuming that the threshold T when the event sequence parser 130 divides the event sequence into a plurality of regions is 1 second, the learning event sequence 100 has R ₁ = {e (0),..., E (i) } And R ₂ = {e (i + 1),..., E (n)}. The Baum-Welch calculation means 50 is provided with many event regions including R ₁ and R ₂ , so that the transition probability Pr (s _j | s _i ) and the event occurrence probability Pr (e) are obtained from the initial failure occurrence model M _0. With respect to | s), the probability of most occurrence of a given event region is learned, and a failure occurrence model M ′ is obtained.

Ｖｉｔｅｒｂｉ計算手段６０は、学習された障害発生モデルＭ’に対して、障害原因発見用イベント列１１０のイベントリージョンを実現する可能性が最も高い原因の順序列（シーケンス）を求める。フィルタリングモジュール７０は、Ｖｉｔｅｒｂｉ計算手段６０が求めた原因の順序列を、障害発生モデルＭ’の原因間の状態遷移確率に基づいて分割し、分割された各順序列を、原因推定結果データベース１５０に格納する。その際、障害原因発見用イベント列１１０のうちで、各原因に対応するイベントを、障害原因に対応付けて、原因推定結果データベース１５０に格納する。管理者は、原因推定結果データベース１５０に格納された情報を参照することで、根源的な原因が何かを推定する。 The Viterbi calculation means 60 obtains a sequence of causes that is most likely to realize the event region of the failure cause discovery event sequence 110 for the learned failure occurrence model M ′. The filtering module 70 divides the sequence of causes obtained by the Viterbi calculation means 60 based on the state transition probability between causes of the failure occurrence model M ′, and the divided sequence sequences in the cause estimation result database 150. Store. At that time, the event corresponding to each cause in the failure cause discovery event sequence 110 is stored in the cause estimation result database 150 in association with the cause of the failure. The administrator estimates the root cause by referring to the information stored in the cause estimation result database 150.

図６は、原因推定結果データベース１５０に格納された情報の具体例を示す。同図では、ｓｔａｔｅで示される部分が原因に対応している。また、ｓｔａｔｅの｛｝内が、その原因に対応するイベントを示している。この例では、原因の順序列は、下から上へと遷移しており、Ｔｃｐｉｐが根源的な原因であると推定される。管理者は、原因推定結果データベース１５０に格納された情報を参照することで、ＴＣＰ／ＩＰプロトコルスタックエラーが、 “Browser”、“Dhcp”、“Windows Installer 3.1”などの障害原因を引き起こしていることを知ることができる。 FIG. 6 shows a specific example of information stored in the cause estimation result database 150. In the figure, the part indicated by state corresponds to the cause. In addition, {} in the state indicates an event corresponding to the cause. In this example, the cause sequence is transitioning from bottom to top, and it is estimated that Tcpip is the root cause. The administrator refers to the information stored in the cause estimation result database 150, and the TCP / IP protocol stack error is causing the cause of failure such as “Browser”, “Dhcp”, “Windows Installer 3.1”, etc. Can know.

本実施形態では、イベントとその原因との対応関係を与えて障害発生モデルを生成し、その障害発生モデルに、監視対象装置８０で観察されたイベント列を与えて、イベント列の遷移から、原因の順序列を求める。このようにして求めた原因の順序列を、原因間の遷移確率に基づいて分割することで、原因の遷移の元となる、障害の根源的な原因を推定することができる。また、本実施形態では、障害発生原因間の関係は、学習用イベント列１００を初期障害発生モデルに与えることで得られ、人手で、原因間の依存関係を定義する必要がない。本実施形態では、初期障害発生モデルの生成に際して、イベントとその発生原因とを定義すればよく、イベントとその発生原因との関係は、比較的記述が容易であるため、簡易に、障害の根源的な原因を推定することができる。 In this embodiment, a failure occurrence model is generated by giving a correspondence relationship between an event and its cause, an event sequence observed by the monitoring target device 80 is given to the failure occurrence model, and the cause is determined from the transition of the event sequence. Find the sequence of. By dividing the sequence of cause obtained in this way based on the transition probability between causes, it is possible to estimate the root cause of the failure that is the source of the cause transition. In the present embodiment, the relationship between the cause of failure is obtained by giving the learning event sequence 100 to the initial failure occurrence model, and it is not necessary to manually define the dependency between causes. In this embodiment, when generating an initial failure occurrence model, it is only necessary to define an event and its cause, and the relationship between the event and its cause is relatively easy to describe. Can be estimated.

以上、本発明をその好適な実施形態に基づいて説明したが、本発明の障害原因推定システム、方法、及び、プログラムは、上記実施形態例にのみ限定されるものではなく、上記実施形態の構成から種々の修正及び変更を施したものも、本発明の範囲に含まれる。 As described above, the present invention has been described based on the preferred embodiment. However, the failure cause estimation system, method, and program of the present invention are not limited to the above-described embodiment example, and the configuration of the above-described embodiment. To which various modifications and changes are made within the scope of the present invention.

本発明は、ネットワークやコンピュータシステムの障害監視システムの用途に適用できる。また、組み込みシステムの障害発見系の用途にも適用できる。 The present invention can be applied to the use of a fault monitoring system for a network or a computer system. It can also be applied to fault detection systems in embedded systems.

本発明の一実施形態の障害原因推定装置の構成を示すブロック図。The block diagram which shows the structure of the failure cause estimation apparatus of one Embodiment of this invention. 障害発生モデルを生成する際の障害原因推定装置の動作手順を示すフローチャート。The flowchart which shows the operation | movement procedure of the failure cause estimation apparatus at the time of producing | generating a failure generation model. 基本モデル定義の記述例を示す図。The figure which shows the example of a description of a basic model definition. イベント列の分割の様子を示す模式図。The schematic diagram which shows the mode of a division | segmentation of an event sequence. 障害原因を推定する際の障害原因推定装置の動作手順を示すフローチャート。The flowchart which shows the operation | movement procedure of the failure cause estimation apparatus at the time of estimating a failure cause. 原因推定結果データベースに格納された情報の具体例を示す図。The figure which shows the specific example of the information stored in the cause estimation result database. 特許文献１に記載された障害原因発見装置の構成を示すブロック図。The block diagram which shows the structure of the failure cause discovery apparatus described in patent document 1. FIG. 特許文献２に記載された障害原因推定システムに用いられる因果システムモデル構築装置の構成を示すブロック図。The block diagram which shows the structure of the causal system model construction apparatus used for the failure cause estimation system described in patent document 2. FIG. 特許文献２に記載された障害原因推定システムに用いられる原因推定装置の構成を示すブロック図。The block diagram which shows the structure of the cause estimation apparatus used for the failure cause estimation system described in patent document 2. FIG. 因果関係をモデル化して示すモデル図。The model figure which models and shows a causal relationship.

Explanation of symbols

１０：障害原因推定装置
２０：基本モデル定義
３０：初期モデルパーザ
４０：初期モデル生成手段
５０：Ｂａｕｍ−Ｗｅｌｃｈ計算手段
６０：Ｖｉｔｅｒｂｉ計算手段
７０：フィルタリングモジュール
８０：監視対象装置
９０：イベントモニタ
１００：学習用イベント列
１１０：障害原因発見用イベント列
１２０：モデル格納データベース
１３０：イベント列パーザ
１４０：イベント列データベース
１５０：原因推定結果データベース 10: Failure cause estimation device 20: Basic model definition 30: Initial model parser 40: Initial model generation means 50: Baum-Welch calculation means 60: Viterbi calculation means 70: Filtering module 80: Monitoring target device 90: Event monitor 100: Learning Event sequence 110: failure cause discovery event sequence 120: model storage database 130: event sequence parser 140: event sequence database 150: cause estimation result database

Claims

An input means for inputting a basic model definition that defines an event caused by a failure that may occur in the monitoring target device and a cause of the failure that causes the event;
Based on the basic model definition input by the input means, an initial model generation means for generating an initial failure occurrence model in which a correspondence between an event and its occurrence cause and a transition between the occurrence causes are modeled by a finite automaton;
Initial model storage means for storing the initial failure occurrence model generated by the initial model generation means;
A learning event string storage unit that stores a learning event string that is a set of events that have occurred in the monitoring target device;
The learning stored in the learning event sequence storage unit based on the initial failure occurrence model stored in the initial model storage unit and the learning event sequence stored in the learning event sequence storage unit A transition probability learning means for learning at what probability the state of the finite automaton representing the transition between the occurrence causes of the event sequence for
A fault occurrence model generating means for generating a fault occurrence model reflecting the transition probability learned by the transition probability learning means;
A failure occurrence model storage means for storing the failure occurrence model generated by the failure occurrence model generation means;
A failure cause discovery event sequence storage means for storing a failure cause discovery event sequence that is a set of events that have occurred in the monitored device;
According to the failure occurrence model stored in the failure occurrence model storage unit, a state transition sequence that is considered to have the highest probability that the failure cause discovery event sequence stored in the failure cause discovery event sequence is observed. State transition sequence generation means for obtaining
State transition sequence storage means for storing the state transition sequence generated by the state transition sequence generation means;
Has been the state transition sequence stored in the state transition sequence storage means, the divided based on the transition probability of failure the failure model stored in the model storage unit, the head of each state transition sequence in which the divided A filtering module that estimates the state as the root cause of the failure that occurred in the monitored device;
A failure cause estimation system comprising:

The transition probability learning means learns a state transition probability between occurrence causes of the learning event string stored in the learning event string storage means and an event occurrence probability at each cause by a Baum-Welch algorithm. The failure cause estimation system according to claim 1.

The failure cause estimation system according to claim 1, wherein the state transition sequence generation unit obtains the state transition sequence by a Viterbi algorithm.

The initial failure model stored in the initial model storage means, a set Σ before hearing vent, a set of state plus normal state s ₀ to the set S of the failure cause, for each state, the state Conditional probability Pr (sj | si) _{si, sjεS} indicating the probability of transition from each state to each state, and the initial probability {P ⁰ _si } _siεS at each state at the start for each state, and for each state The failure cause estimation system according to claim 1, further comprising: Pr (ej | si ) _{s iεS, ejεΣ} indicating the probability that the event will occur in the state.

The filtering module, the state transition sequence storage means stored said state transition sequence of {s (0), s ( 1), ···, s (n)} when the state s (i) ( If the conditional probability Pr (s (i + 1) | s (i)) from 0 ≦ i <n) to the next state s (i + 1) is lower than a predetermined probability, it is stored in the state transition sequence storage means. and the pre-Symbol state transition sequence is divided into a {s (0), s ( 1), ···, s (i)} and, {s (i + 1) , ···, s (n)}, The failure cause estimation system according to claim 4.

The failure cause estimation system according to claim 1, wherein the learning event sequence stored in the learning event sequence storage unit is an event sequence monitored when the monitoring target device is trial run.

2. The learning event sequence stored in the learning event sequence storage means is an event sequence monitored during operation of the monitoring target device, and the cause of the failure is an analyzed event sequence. Failure cause estimation system described in 1.

The failure cause estimation system according to claim 1, wherein the failure cause discovery event sequence stored in the failure cause discovery event sequence storage means is an event sequence monitored during operation of the monitoring target device.

An occurrence time interval between two adjacent events in each of the learning event sequence stored in the learning event sequence storage unit and the failure cause finding event sequence stored in the failure cause finding event sequence storage unit is The failure cause estimation system according to claim 1, wherein the failure cause estimation system is equal to or less than a predetermined value.

The filtering module includes means for memorize the state transition sequence stored in the state transition sequence storage means,
Among the failure cause discovery event sequences stored in the failure cause discovery event sequence storage means, an event that causes each state included in the state transition sequence stored in the state transition sequence storage means to occur The failure cause estimation system according to claim 1, further comprising: means for storing the information in association with each state.

A method for estimating the root cause of a failure that has occurred in a monitored device using a computer,
An input step of inputting a basic model definition that defines an event caused by a failure that can occur in the monitoring target device and a failure cause that causes the event;
Based on the basic model definition input in the input step, an initial model generation step for generating an initial failure occurrence model in which a correspondence between an event and its occurrence cause and a transition between the occurrence causes are modeled by a finite automaton;
An initial model storage step for storing the initial failure occurrence model generated by the initial model generation step;
A learning event sequence storage step for storing a learning event sequence that is a set of events that have occurred in the monitored device;
The learning stored in the learning event sequence storage step based on the initial failure occurrence model stored in the initial model storage step and the learning event sequence stored in the learning event sequence storage step A transition probability learning step for learning at which probability the state of the finite automaton representing the transition between the occurrence causes of the event sequence has changed,
A failure occurrence model generation step of generating a failure occurrence model reflecting the transition probability learned by the transition probability learning step;
A failure occurrence model storage step for storing the failure occurrence model generated by the failure occurrence model generation step;
A failure cause discovery event sequence storage step for storing a failure cause discovery event sequence that is a set of events that have occurred in the monitored device;
According to the failure occurrence model stored in the failure occurrence model storage step, the state transition sequence considered to have the highest probability that the failure cause discovery event sequence stored in the failure cause discovery event sequence is observed. A state transition sequence generation step for obtaining
A state transition sequence storage step for storing the state transition sequence generated by the state transition sequence generation means;
The state transition sequence stored in the state transition sequence storing step, the divided based on the transition probability of the failure model stored in the failure model storage step, the head of each state transition sequence in which the divided A filtering step that estimates the state as the root cause of the failure that occurred in the monitored device;
A failure cause estimation method comprising:

The transition probability learning step learns the state transition probability between the causes of occurrence of the learning event sequence stored in the learning event sequence storage means and the event occurrence probability at each cause by the B aum-Welch algorithm. to, failure cause estimation method of claim 1 1.

The state transition sequence generating step, the V Iterbi algorithm, determining the state transition sequence, failure cause estimation method of claim 1 1.

The initial failure model stored in the initial model storage step, a set Σ before hearing vent, a set of state plus normal state s ₀ to the set S of the failure cause, for each state, the state Conditional probability Pr (sj | si) _{si, sjεS} indicating the probability of transition from each state to each state, and the initial probability {P ⁰ _si } _siεS at each state at the start for each state, and for each state , Pr indicating the probability of the event in the condition occurs (ej | _{si) s i∈S,} and a _{Ej∈shiguma,} failure cause estimation method of claim 1 1.

The filtering step includes
Before SL state transition sequence storing step to said stored state transition sequence of {s (0), s ( 1), ···, s (n)} when the state s (i) (0 ≦ i < (the conditional probability Pr (s (i + 1 to i + 1)) from n) next state s | s (i)) is a predetermined lower than the probability, before Symbol state stored in the state transition sequence storing step the transition sequence, you divide the a {s (0), s ( 1), ···, s (i)} and, {s (i + 1) , ···, s (n)}, according to claim 1 4. The method for estimating a cause of failure according to 4 .

It said filtering step includes the steps of memorize the state transition sequence stored in the state transition sequence storing step,
Among the failure cause discovery event sequences stored in the failure cause discovery event sequence storage step, an event having each state included in the state transition sequence stored in the state transition sequence storage step as a cause of occurrence , further comprising the steps of storing in association with each state, a failure cause estimation method of claim 1 1.

A program for causing a computer to execute a method for estimating a root cause of a failure that has occurred in a monitored device, the computer comprising:
An input step of inputting a basic model definition that defines an event caused by a failure that can occur in the monitoring target device and a failure cause that causes the event;
Based on the basic model definition input in the input step, an initial model generation step for generating an initial failure occurrence model in which a correspondence between an event and its occurrence cause and a transition between the occurrence causes are modeled by a finite automaton;
An initial model storage step for storing the initial failure occurrence model generated by the initial model generation step;
A learning event sequence storage step for storing a learning event sequence that is a set of events that have occurred in the monitored device;
The learning stored in the learning event sequence storage step based on the initial failure occurrence model stored in the initial model storage step and the learning event sequence stored in the learning event sequence storage step A transition probability learning step for learning at which probability the state of the finite automaton representing the transition between the occurrence causes of the event sequence has changed,
A failure occurrence model generation step of generating a failure occurrence model reflecting the transition probability learned by the transition probability learning step;
A failure occurrence model storage step for storing the failure occurrence model generated by the failure occurrence model generation step;
A failure cause discovery event sequence storage step for storing a failure cause discovery event sequence that is a set of events that have occurred in the monitored device;
According to the failure occurrence model stored in the failure occurrence model storage step, the state transition sequence considered to have the highest probability that the failure cause discovery event sequence stored in the failure cause discovery event sequence is observed. A state transition sequence generation step for obtaining
A state transition sequence storage step for storing the state transition sequence generated by the state transition sequence generation means;
The state transition sequence stored in the state transition sequence storing step, the divided based on the transition probability of the failure model stored in the failure model storage step, the head of each state transition sequence in which the divided A filtering step that estimates the state as the root cause of the failure that occurred in the monitored device;
A failure cause estimation program characterized in that

The transition probability learning step, the Baum-Welch algorithm learns the state transition probability between the cause of the learning event string stored in the learning event string storage means, and event probability of each cause The failure cause estimation program according to claim 17 .

The state transition sequence generating step, the Viterbi algorithm determines the state transition sequence, failure cause estimation program according to claim 1 7.

The initial failure model stored in the initial model storage step, a set Σ before hearing vent, a set of state plus normal state s ₀ to the set S of the failure cause, for each state, the state Conditional probability Pr (sj | si) _{si, sjεS} indicating the probability of transition from each state to each state, and the initial probability {P ⁰ _si } _siεS at each state at the start for each state, and for each state , Pr indicating the probability of the event in the condition occurs (ej | _{si) s i∈S,} and a _{Ej∈shiguma,} failure cause estimation program according to claim 1 7.

The filtering step includes
The state transition sequence storing step wherein state transition sequence stored in the {s (0), s ( 1), ···, s (n)} when the state s (i) (0 ≦ i <n ) from the next state s (i + 1) probabilities above conditions to Pr (s (i + 1) | s (i)) is a predetermined lower than the probability, pre SL state transitions stored in said state transition sequence storing step column, you divide the a {s (0), s ( 1), ···, s (i)} and, {s (i + 1) , ···, s (n)}, according to claim 2 0 Failure cause estimation program described in 1.

It said filtering step includes the steps of memorize the state transition sequence stored in the state transition sequence storing step,
Among the failure cause discovery event sequences stored in the failure cause discovery event sequence storage step, an event having each state included in the state transition sequence stored in the state transition sequence storage step as a cause of occurrence , and storing in association with each state, further to execute, failure cause estimation program according to claim 1 7.