JP6280862B2

JP6280862B2 - Event analysis system and method

Info

Publication number: JP6280862B2
Application number: JP2014238755A
Authority: JP
Inventors: 慶行但馬; 志村　明俊; 明俊志村; 知行山形
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-11-26
Filing date: 2014-11-26
Publication date: 2018-02-14
Anticipated expiration: 2034-11-26
Also published as: JP2016099938A

Description

本発明は、鉄道や電力等の社会インフラのシステムやその設備の保守、障害対応等におけるイベント分析システムおよびイベント分析方法に関する。 The present invention relates to an event analysis system and an event analysis method in social infrastructure systems such as railways and electric power, maintenance of facilities thereof, response to failures, and the like.

昨今、鉄道の乗り入れや電力融通など、社会インフラは大規模化と複雑化が進みつつあり、障害対応や保守のコスト増大、熟練者不足が問題となってきている。また、社会インフラに係るシステムが、インターネット等の公衆通信ネットワークに接続されることが一般化してきた結果、サイバー攻撃など新たな脅威も生まれている。これらを背景として、従来から開発・実用化が進められてきたタービン等のコンポーネント単位の障害分析に加え、コンポーネントを制御する計算機に起因する障害や計算機が出力するデータから分析できる障害などのシステムレベルの障害に対する対応迅速化が求められている。 In recent years, social infrastructures are becoming larger and more complex, such as the introduction of railways and power interchange, and problems such as increased costs for failure handling and maintenance, and lack of skilled workers are becoming problems. Moreover, as a result of the generalization of systems related to social infrastructure being connected to public communication networks such as the Internet, new threats such as cyber attacks are also born. Against this background, in addition to failure analysis in units of components such as turbines that have been developed and put to practical use in the past, system levels such as failures caused by computers controlling components and failures that can be analyzed from data output by computers There is a need to expedite the response to various obstacles.

一般的に計算機では、社会インフラの状態を計測するセンサ値のように、定期的に計測される時系列の数値データではなく、不定期に発生するOSやミドルウェアの警告やエラー、コマンドの実行結果、あるいは、アプリケーションの内部ステート遷移や動作履歴などに関する時系列の文字列データ、いわゆるログが分析対象の大半を占める。 In general, computers do not use time-series numerical data that is measured regularly, such as sensor values that measure the state of social infrastructure, but OS / middleware warnings and errors that occur irregularly, and command execution results. Alternatively, time-series character string data relating to application internal state transitions and operation history, so-called logs, occupies most of the analysis target.

システムレベルの障害に対応するために、特許文献１では、障害等のイベントを発生させる原因を予め定義し、発生原因間の遷移を有限オートマトンでモデル化し、観察されるイベントの生成確率が最も高くなる発生原因の状態遷移を出力する技術が開示されている。これによって、定義した発生原因の発生過程を用いた原因分析や将来のイベント発生予測をユーザに提示することができる。 In order to deal with a system level failure, Patent Document 1 defines in advance the cause of occurrence of an event such as a failure, models the transition between the occurrence causes with a finite automaton, and has the highest probability of the observed event generation. A technique for outputting a state transition of an occurrence cause is disclosed. As a result, it is possible to present to the user cause analysis using the occurrence process of the defined occurrence cause and prediction of future event occurrence.

特開２０１１−１７５５０４号公報JP 2011-175504 A

しかしながら、特許文献１記載の技術を適用するためには、イベントの発生原因を装置開発者が予め適切に定義する必要がある。ところが、複数のコンポーネント（装置や設備等の社会インフラを構成する要素）に複数の計算機を接続したシステムでは、ある計算機で発生した障害の原因がネットワーク上の他のコンポーネントに起因する等、イベント(障害などのシステムの状態)の発生原因の候補が膨大となり、予めすべてを定義することが困難である。従って、イベントの発生原因が特定できないばかりか、例えば発生原因が特定できないイベントを他の発生原因から発生したと判定することになりかねないため、発生原因の推定精度が悪化する可能性がある。 However, in order to apply the technique described in Patent Document 1, it is necessary for the device developer to appropriately define the cause of the event in advance. However, in a system in which multiple computers are connected to multiple components (elements that make up social infrastructure such as devices and equipment), the event (such as the cause of a failure that occurred in one computer is caused by other components on the network) There are a large number of candidates for the causes of system conditions such as failures, and it is difficult to define all of them in advance. Therefore, not only the cause of occurrence of an event cannot be specified, but also an event for which the cause of occurrence cannot be specified may be determined to have occurred from another cause of occurrence, so that the accuracy of estimating the cause of occurrence may be deteriorated.

一方、発生原因は適切に定義されているが、実際には発生原因に対応するイベントが観測できない状況もある。例えば、ログとしては出力したものの、ディスクの記憶容量やネットワークリソースの制約から、出力したログを破棄している場合等に、イベントが観測できない状況が生じる。このような場合、前述とは逆にある発生原因に対応させるイベントが観測できないため、他のイベントを対応させることとなり、発生原因を分析しても、発生するイベントを正しく解釈できなくなる可能性がある。 On the other hand, although the cause of occurrence is properly defined, there are situations where the event corresponding to the cause of the occurrence cannot actually be observed. For example, although a log is output, an event cannot be observed when the output log is discarded due to disk storage capacity or network resource restrictions. In such a case, the event corresponding to the cause of occurrence that is opposite to the above cannot be observed, so other events will be associated, and even if the cause of the analysis is analyzed, it may not be possible to correctly interpret the event that occurs. is there.

このような状況にあっては、システムの運用開始や改修にかかる、障害検知や分析などに要する時間とコストが多大になる。そこで、監視対象システムの障害の発生原因を事前に定義する必要のないイベント分析システムおよび方法が必要とされる。 In such a situation, the time and cost required for failure detection and analysis, etc., related to the start and repair of the system become large. Therefore, there is a need for an event analysis system and method that does not require a predefined cause of failure in the monitored system.

開示するイベント分析システムは、監視対象システムの稼働時の第１のイベント列に基づき第１のイベントを予測する局所予測モデルを予め学習する局所予測モデル学習プロセス、監視対象システムの第２のイベント列を監視し、監視した第２のイベント列に含まれる第２のイベントの観測結果と、局所予測モデルを用いて予測した第１のイベントの予測結果とが不一致の第２のイベントを抽出する異常検知プロセス、および、第２のイベントの抽出に応答して、監視した第２のイベント列に含まれ、観測結果を生じさせる、第２のイベントの直前に発生した第３のイベントを起点としてバックトレースした第１のイベントグラフを作成する障害分析支援プロセスを有する。 The disclosed event analysis system includes a local prediction model learning process that learns in advance a local prediction model that predicts a first event based on a first event sequence during operation of the monitored system, and a second event sequence of the monitored system To detect a second event in which the observation result of the second event included in the monitored second event sequence does not match the prediction result of the first event predicted using the local prediction model In response to the detection process and the extraction of the second event, the third event occurred immediately before the second event, which is included in the monitored second event sequence and produces the observation result, is started. A failure analysis support process for creating a traced first event graph;

開示するイベント分析システムによれば、監視対象システムからのイベントに基づく局所予測モデルを用いて、監視対象システムの異常発生に対応したイベントを分析できるので、障害検知や分析などのシステムの運用開始や改修にかかる時間とコストを削減できる。 According to the disclosed event analysis system, it is possible to analyze events corresponding to the occurrence of abnormalities in the monitored system using a local prediction model based on events from the monitored system. Renovation time and cost can be reduced.

イベント分析システムの構成図である。It is a block diagram of an event analysis system. イベント列の例である。It is an example of an event sequence. 局所予測モデルパラメータの例である。It is an example of a local prediction model parameter. モデル予測精度データの例である。It is an example of model prediction accuracy data. イベントグラフデータの例である。It is an example of event graph data. ノードイベントの例である。It is an example of a node event. 事例データの例である。It is an example of case data. 特徴関数定義の例である。It is an example of feature function definition. 局所予測モデル学習プロセスの処理フローチャートである。It is a processing flowchart of a local prediction model learning process. 局所予測モデルの学習の処理フローチャートである。It is a processing flowchart of learning of a local prediction model. 障害分析支援プロセスの処理フローチャートである。It is a processing flowchart of a failure analysis support process. イベントグラフの作成の処理フローチャートである。It is a processing flowchart of creation of an event graph. 事例登録プロセスの処理フローチャートである。It is a processing flowchart of a case registration process. 事例提示プロセスの処理フローチャートである。It is a processing flowchart of a case presentation process. イベントグラフ表示画面の例である。It is an example of an event graph display screen. 事例データ表示画面の例である。It is an example of a case data display screen.

本実施形態は、鉄道や電力等の社会インフラのシステム（監視対象システムと呼ぶ）が稼働中に出力するログに基づいて、障害の検知、対応支援、および事例提示のためのイベント（監視対象システムの状態）を分析するシステムの例である。詳細な説明に先立って、本実施形態のイベント分析システムの概略を説明する。 In this embodiment, an event (monitoring target system) for fault detection, response support, and case presentation based on a log output during operation of a social infrastructure system (referred to as a monitoring target system) such as railways and electric power It is an example of the system which analyzes (state of). Prior to detailed description, an outline of the event analysis system of this embodiment will be described.

イベント分析システムは、モデル学習、障害分析支援、事例提示の3つの機能（プロセス）を備える。モデル学習は、監視対象システムから正常時のログを収集、解析することで得られるイベント列からイベントの変化を局所的に予測する局所予測モデルを学習する。 The event analysis system has three functions (processes): model learning, failure analysis support, and case presentation. Model learning learns a local prediction model that locally predicts a change in an event from an event sequence obtained by collecting and analyzing a normal log from a monitored system.

障害分析支援は、監視対象システムの異常状態を検知し、その異常状態に至る過程を分析し、その分析結果を保守員やエンジニアに提示する。具体的には、次のように障害分析を支援する。イベント分析システムは、局所予測モデルを用いてイベントの変化を予測し、予測結果と観測されたイベントとの乖離（一致の有無）を監視する。乖離がある（不一致の）場合、イベント分析システムは、観測されたイベントを障害と判定し、抽出する。この判定は、一般的な社会インフラなどの監視対象システムは状態遷移ベースで作られ、タイミングは異なっても状態遷移過程は同じであるので、予測と異なる場合、監視対象システムの異常な挙動である可能性が高いことに基づく。次に、抽出したイベントの直前に発生したイベントを起点にイベントを遡って（バックトレースして）、関連するイベントを抽出し、抽出したイベントを連結することでイベントグラフを生成する。そして、障害に対応する保守員やエンジニアにこのイベントグラフを提示する。このイベントグラフは、障害の発生に寄与した可能性が高いイベントを連結したものであるため、保守員やエンジニアが、障害発生に至る流れを把握し、根本的な原因を発見することに役立つ。 The failure analysis support detects an abnormal state of the monitored system, analyzes the process leading to the abnormal state, and presents the analysis result to maintenance personnel and engineers. Specifically, the failure analysis is supported as follows. The event analysis system predicts a change in an event using a local prediction model, and monitors the deviation (presence / absence of coincidence) between the prediction result and the observed event. If there is a divergence (mismatch), the event analysis system determines that the observed event is a failure and extracts it. This judgment is based on the state transition of general monitoring target systems such as social infrastructure, and the state transition process is the same even if the timing is different. Based on the high probability. Next, an event graph is generated by tracing back events (back tracing) starting from the event that occurred immediately before the extracted event, extracting related events, and concatenating the extracted events. Then, this event graph is presented to maintenance personnel and engineers corresponding to the failure. Since this event graph is a concatenation of events that are likely to have contributed to the occurrence of a failure, it is useful for maintenance personnel and engineers to grasp the flow leading to the occurrence of the failure and discover the root cause.

事例提示は、障害分析支援によって生成された、障害発生に至る可能性のあるイベントグラフをキーにして、保守員やエンジニアが予め登録した障害原因と対策内容に関する事例データを検索し、検索結果に基づいて障害への対策を提示する。保守員やエンジニアが事例データを登録する際、事例データと関連するイベントを指定しておく。イベント分析システムは、指定されたイベントからイベントグラフを生成し、事例データと対応付けて登録しておく。その上で、イベント分析システムは、監視対象システムから抽出されるイベントに基づいてイベントグラフを生成し、登録されているイベントグラフとの類似を監視する。イベント分析システムは、類似したイベントグラフを見つけた場合、保守員やエンジニアに、生成したイベントグラフと対応する事例データを提示する。すなわち、イベントグラフを事例データのキーとして、監視対象システムの稼働時に生成したイベントグラフに基づいて事例データを検索し、検索した事例データを生成したイベントグラフと対応させて提示する。これによって、既知の障害に類似する障害であれば、保守員やエンジニアが分析に時間を要することなく対策できる。 Case presentation searches for case data related to failure causes and countermeasures registered in advance by maintenance personnel and engineers using the event graph generated by failure analysis support that can lead to failure occurrence as a key. Based on the above, present countermeasures for obstacles. When a maintenance staff or engineer registers case data, an event related to the case data is designated. The event analysis system generates an event graph from the specified event and registers it in association with the case data. In addition, the event analysis system generates an event graph based on the event extracted from the monitored system, and monitors the similarity with the registered event graph. When the event analysis system finds a similar event graph, it presents case data corresponding to the generated event graph to maintenance personnel and engineers. That is, using the event graph as a key of the case data, the case data is searched based on the event graph generated when the monitored system is operating, and the searched case data is presented in correspondence with the generated event graph. As a result, if the failure is similar to the known failure, the maintenance staff and the engineer can take measures without taking time for analysis.

図１は、イベント分析システムの構成図である。イベント分析システムは、監視対象システム１のイベントを監視するイベント監視装置２、分析用計算機３、および操作端末４を備える。監視対象システム１とイベント監視装置２は、監視対象システム１内の制御用LAN(Local Area Network)等の制御用ネットワーク１４を介して接続される。イベント監視装置２、分析用計算機３、および操作端末４は、インターネット等または自営網などのネットワーク５を介して接続される。 FIG. 1 is a configuration diagram of an event analysis system. The event analysis system includes an event monitoring device 2 that monitors an event of the monitoring target system 1, an analysis computer 3, and an operation terminal 4. The monitoring target system 1 and the event monitoring apparatus 2 are connected via a control network 14 such as a control LAN (Local Area Network) in the monitoring target system 1. The event monitoring device 2, the analysis computer 3, and the operation terminal 4 are connected via a network 5 such as the Internet or a private network.

なお、図１に示すように、典型的なイベント分析システムでは、イベント監視装置２は、監視対象システム１が設置された現地サイトに置かれ、分析用計算機３はデータセンターなどに設置される。一方、操作端末４は、現地サイトの保守員と保守拠点のエンジニアが使うために、現地サイトと保守拠点との双方に設置される。なお、これら構成および配置は一例であって、すべて現地サイトに置くことも可能であるし、後述のイベント監視装置２の一部機能（収集解析部、転送部）を除いて保守拠点に集約することも可能である。また、イベント監視装置２、分析用計算機３、および操作端末４がネットワークで接続されていないシステムの場合、USB（Universal Serial Bus）メモリ等の記憶媒体を介してイベントや局所予測モデルのパラメータ（後述）を受け渡す形態を採ってもよい。 As shown in FIG. 1, in a typical event analysis system, the event monitoring device 2 is placed at a local site where the monitoring target system 1 is installed, and the analysis computer 3 is installed at a data center or the like. On the other hand, the operation terminal 4 is installed at both the local site and the maintenance base for use by maintenance personnel at the local site and engineers at the maintenance base. These configurations and arrangements are merely examples, and all of them can be placed on the local site, and are consolidated at the maintenance base except for some functions (collection analysis unit, transfer unit) of the event monitoring device 2 described later. It is also possible. Further, in the case of a system in which the event monitoring device 2, the analysis computer 3, and the operation terminal 4 are not connected via a network, parameters of events and local prediction models (described later) via a storage medium such as a USB (Universal Serial Bus) memory. ) May be used.

監視対象システム１は、様々なシステムであり、その形態も様々であるが、センサやアクチュエータを持った設備１１、設備１１の状態をセンサを介して監視し、設備１１を制御するコントローラ１２、およびコントローラ１２を統括する制御用計算機１３を備え、これらは制御用ネットワーク１４を介して接続される。なお、図示する構成要素は一例であって、設備11などの要素数は増減してもよく、一つの制御用ネットワーク１４で接続されていても、階層化された制御用ネットワーク１４で接続されていてもよい。 The monitoring target system 1 is a variety of systems, and the forms thereof are also various. The equipment 11 having sensors and actuators, the controller 12 that monitors the state of the equipment 11 through the sensors, and controls the equipment 11, and A control computer 13 that controls the controller 12 is provided, and these are connected via a control network 14. Note that the illustrated components are examples, and the number of elements such as equipment 11 may be increased or decreased, and even though they are connected by a single control network 14, they are connected by a hierarchical control network 14. May be.

図１を用いて、各装置の構成を説明する。イベント監視装置２は、収集解析部２１、異常検知部２２、および事例検索部２３の各処理部、並びに、短期イベント記憶部２４、およびパラメータ記憶部２５の各記憶部を備えるコンピュータである。 The configuration of each apparatus will be described with reference to FIG. The event monitoring device 2 is a computer that includes the processing units of the collection analysis unit 21, the abnormality detection unit 22, and the case search unit 23, and the storage units of the short-term event storage unit 24 and the parameter storage unit 25.

収集解析部２１は、監視対象システム１のコントローラ１２、制御用計算機１３、および制御用ネットワーク１４（ログ出力するネットワーク機器）からログを収集し、収集したログを解析して時系列のイベント列を生成し、分析用計算機３に送信する。異常検知部２２は、分析用計算機３から受信した局所予測モデルのパラメータに基づき監視対象システム１の異常状態を検知する。事例検索部２３は、分析用計算機３から受信したイベントグラフに対応する、事例データを検索する。短期イベント記憶部２４はイベントを格納する。短期イベント記憶部２４はイベントを、分析用計算機３に送信するまで一時的に格納するバッファである。パラメータ記憶部２５は、局所予測モデルのパラメータ、事例データなどを格納する。 The collection analysis unit 21 collects logs from the controller 12 of the monitoring target system 1, the control computer 13, and the control network 14 (network device for log output), analyzes the collected logs, and generates a time-series event sequence. It is generated and transmitted to the computer 3 for analysis. The abnormality detection unit 22 detects an abnormal state of the monitoring target system 1 based on the parameters of the local prediction model received from the analysis computer 3. The case search unit 23 searches for case data corresponding to the event graph received from the analysis computer 3. The short-term event storage unit 24 stores events. The short-term event storage unit 24 is a buffer that temporarily stores events until they are transmitted to the analysis computer 3. The parameter storage unit 25 stores parameters of the local prediction model, case data, and the like.

分析用計算機３は、モデル学習部３１、イベントグラフ作成部３２、および操作管理部３３の各処理部、並びに、イベント記憶部３４、モデル記憶部３５、および事例記憶部３６の各記憶部を備えるコンピュータである。 The analyzing computer 3 includes processing units such as a model learning unit 31, an event graph creation unit 32, and an operation management unit 33, and storage units such as an event storage unit 34, a model storage unit 35, and a case storage unit 36. It is a computer.

モデル学習部３１は、イベント監視装置２が生成したイベント列から監視対象システム１の正常時の局所予測モデルを生成する。イベントグラフ作成部３２は、起点となるイベントから遡り（バックトレースして）、監視対象システム１の異常状態を示すイベントの発生に至るまでの、監視対象システム１の局所予測モデル上で異常状態の発生に寄与するイベントを抽出し、抽出したイベントを連結することでイベントグラフを作成する。操作管理部３３は、操作端末４への表示情報の作成や、操作端末４からの事例データを登録する。 The model learning unit 31 generates a local prediction model when the monitoring target system 1 is normal from the event sequence generated by the event monitoring device 2. The event graph creation unit 32 goes back (back traces) from the event that is the starting point, and the occurrence of an event indicating the abnormal state of the monitored system 1 until the occurrence of the abnormal state on the local prediction model of the monitored system 1. An event graph is created by extracting events that contribute to the occurrence and concatenating the extracted events. The operation management unit 33 creates display information on the operation terminal 4 and registers case data from the operation terminal 4.

イベント記憶部３４は、イベント監視装置２から受信したイベントを格納する。モデル記憶部３５は、局所予測モデルのパラメータ等を格納する。事例記憶部３６は、事例データおよびそれに対応するイベントグラフ等を格納する。 The event storage unit 34 stores events received from the event monitoring device 2. The model storage unit 35 stores parameters of the local prediction model. The case storage unit 36 stores case data and an event graph corresponding to the case data.

操作端末４は、分析用計算機３で作成された表示情報の表示や、事例データを入力する。 The operation terminal 4 inputs display information generated by the analysis computer 3 and case data.

なお、以上の各処理部及び各記憶部の配置は一例であって、他の装置上にあっても、複数の装置上に配置されてもよい。例えば、異常検知部２２が分析用計算機３にあってもよいし、イベントグラフ作成部３２がイベント監視装置２に配置されてもよい。 The arrangement of each processing unit and each storage unit described above is an example, and may be arranged on another device or a plurality of devices. For example, the abnormality detection unit 22 may be in the analysis computer 3, and the event graph creation unit 32 may be arranged in the event monitoring device 2.

図２は、短期イベント記憶部２４およびイベント記憶部３４が格納するイベント列１００（イベントが一つの場合もイベント列と呼ぶ。）の例である。イベント列１００は、監視対象システム１から収集したログを解析することで得られ、サイトID１０１、時刻１０２、イベントID１０３、および詳細データ１０４を含む。サイトID１０１及びイベントID１０３は、それぞれ現地サイト及びイベントを識別するための識別子である。図２には、ある現地サイトSite1の通信ミドルウェアの通信ログの例として、詳細データ１０４に「2014/9/21 12:23:14.1034 SEND TYPE=xxx DATA=yyy」を示している。このようなイベント列１００を発生させるログの場合、イベント列１００は、サイトID１０１にSite1を格納し、ログに含まれる時刻部分（詳細データ１０４の時刻部分）を時刻１０２に格納し、「装置名.通信ミドルウェアの名前.ログ名.SEND」等に対応してあらかじめ定められた識別子をイベントID１０２に格納し、ログ全体、もしくは不要な箇所を削除したログの一部の文字列を詳細データ１０４に格納する。 FIG. 2 is an example of an event sequence 100 (also referred to as an event sequence when there is one event) stored in the short-term event storage unit 24 and the event storage unit 34. The event string 100 is obtained by analyzing a log collected from the monitoring target system 1 and includes a site ID 101, a time 102, an event ID 103, and detailed data 104. The site ID 101 and the event ID 103 are identifiers for identifying the local site and the event, respectively. FIG. 2 shows “2014/9/21 12: 23: 14.1034 SEND TYPE = xxx DATA = yyy” in the detailed data 104 as an example of the communication log of the communication middleware at a certain local site Site1. In the case of a log that generates such an event sequence 100, the event sequence 100 stores Site1 in the site ID 101, stores the time portion included in the log (the time portion of the detailed data 104) at the time 102, and “device name Predetermined identifier corresponding to “Communication middleware name.Log name.SEND” is stored in the event ID 102, and the entire log or a part of the character string of the log with unnecessary portions deleted is stored in the detailed data 104. Store.

このように、イベント（図２のレコード）の時系列の連なりがイベント列である。前述したように、イベントが一つの場合もイベント列と呼ぶ。なお、短期イベント記憶部２４は現地サイトのイベントを格納するが、イベント記憶部３４は、分析用計算機３に接続する全ての現地サイトのイベントを格納する。 In this way, a series of events (records in FIG. 2) in time series is an event string. As described above, even when there is one event, it is called an event string. The short-term event storage unit 24 stores local site events, but the event storage unit 34 stores all local site events connected to the analysis computer 3.

図３は、パラメータ記憶部２５およびモデル記憶部３５が格納する局所予測モデルパラメータ２００の例である。詳細は後述するが、イベントの局所予測モデルにはロジスティック回帰モデル（２値の場合の対数線形モデル）を用いる。このため、イベントの特徴（図８を用いて後述）に対する重みベクトルが局所予測モデルのパラメータである。局所予測モデルパラメータ２００は、サイトID２０１、イベントID２０２、特徴ID２０３、および重み２０４を含む。サイトID２０１及びイベントID２０２は、それぞれ現地サイト及びイベントを識別するための識別子であり、特徴ID２０３は局所予測モデルに対応するイベントの特徴を識別するための識別子である。局所予測モデルは、サイトID２０１とイベントID２０２との対によって特定される。 FIG. 3 is an example of the local prediction model parameter 200 stored in the parameter storage unit 25 and the model storage unit 35. As will be described in detail later, a logistic regression model (logarithmic linear model in the case of binary) is used as a local prediction model for events. For this reason, a weight vector for an event feature (described later with reference to FIG. 8) is a parameter of the local prediction model. The local prediction model parameter 200 includes a site ID 201, an event ID 202, a feature ID 203, and a weight 204. The site ID 201 and the event ID 202 are identifiers for identifying the local site and the event, respectively, and the feature ID 203 is an identifier for identifying the feature of the event corresponding to the local prediction model. The local prediction model is specified by a pair of site ID 201 and event ID 202.

なお、他のモデルを用いる場合は、適宜パラメータも変更する。また、パラメータ記憶部２４は自サイトに関係するパラメータを格納するが、モデル記憶部３５は、分析用計算機に接続する全ての現地サイトのパラメータを格納する。 When other models are used, the parameters are changed as appropriate. The parameter storage unit 24 stores parameters related to its own site, while the model storage unit 35 stores parameters of all local sites connected to the analysis computer.

図４は、モデル記憶部３５が格納するモデル予測精度データ３００の例である。モデル予測精度データ３００は、サイトID３０１、イベントID３０２、及び予測精度３０３を含む。サイトID３０１及びイベントID３０２は、それぞれ現地サイト及びイベントを識別するための識別子である。詳細は後述するが、予測精度３０３は、イベント列から算出された、局所予測モデルの予測精度を表す。 FIG. 4 is an example of model prediction accuracy data 300 stored in the model storage unit 35. The model prediction accuracy data 300 includes a site ID 301, an event ID 302, and a prediction accuracy 303. The site ID 301 and the event ID 302 are identifiers for identifying the local site and the event, respectively. Although details will be described later, the prediction accuracy 303 represents the prediction accuracy of the local prediction model calculated from the event sequence.

図５は、パラメータ記憶部２５および事例記憶部３６が格納するイベントグラフデータ４００の例である。イベントグラフデータ４００は、イベントの発生過程を表現するイベントグラフのエッジ情報であり、グラフID４０１、親ノードID４０２、子ノードID４０３、連結タイプ４０４、重み４０５を含む。グラフID４０１は、イベントグラフを識別するための識別子である。親ノードID４０２および子ノードID４０３はノードを識別するための識別子であり、イベントの発生をバックトレースするので、子ノードID４０３のイベントが親ノードID４０２のイベントを発生させることを意味し、親ノードID４０２および子ノードID４０３のそれぞれは図６のノードイベント５００で定義される。連結タイプ４０４は、エッジの導出過程（子ノードID４０３のイベントから親ノードID４０２のイベントへの状態遷移）に基づいて設定される。重み４０５は、局所予測モデルの予測に用いられる。イベントグラフデータ４００の詳細は、イベント分析システムの動作説明等において後述する。 FIG. 5 is an example of event graph data 400 stored in the parameter storage unit 25 and the case storage unit 36. The event graph data 400 is edge information of an event graph that represents an event generation process, and includes a graph ID 401, a parent node ID 402, a child node ID 403, a connection type 404, and a weight 405. A graph ID 401 is an identifier for identifying an event graph. The parent node ID 402 and the child node ID 403 are identifiers for identifying the node, and the occurrence of the event is backtraced. Therefore, the event of the child node ID 403 means that the event of the parent node ID 402 is generated. Each child node ID 403 is defined by a node event 500 in FIG. The connection type 404 is set based on the edge derivation process (state transition from the event of the child node ID 403 to the event of the parent node ID 402). The weight 405 is used for prediction of the local prediction model. Details of the event graph data 400 will be described later in the explanation of the operation of the event analysis system.

図６は、パラメータ記憶部２５および事例記憶部３６が格納する、イベントグラフのノードとなるノードイベント５００の例である。ノードイベント５００は、イベントグラフのノードとイベントの対応を示すデータであり、ノードID５０１、サイトID５０２、時刻５０３、イベントID５０４を含む。ノードID５０１が、イベントグラフ４００の親ノードID４０２または子ノードID４０３と対応し、サイトID５０２、時刻５０３、およびイベントID５０４が、イベント列１００のサイトID１０１、時刻１０２、イベントID１０３と対応する。 FIG. 6 is an example of a node event 500 that is a node of an event graph stored in the parameter storage unit 25 and the case storage unit 36. The node event 500 is data indicating the correspondence between the event graph node and the event, and includes a node ID 501, a site ID 502, a time 503, and an event ID 504. The node ID 501 corresponds to the parent node ID 402 or child node ID 403 of the event graph 400, and the site ID 502, time 503, and event ID 504 correspond to the site ID 101, time 102, and event ID 103 of the event sequence 100.

図７は、パラメータ記憶部２５および事例記憶部３６が格納する事例データ６００の例である。事例データ６００は、操作端末４を介して、保守員やエンジニアが障害を発見、解析、対策後に、すなわち発生した障害に関して入力するデータであって、事例ID６０１、サイトID６０２、グラフID６０３、内容データ６０４を含む。事例ID６０１、サイトID６０２、及びグラフID６０３は、それぞれ、事例データ、現地サイト及びイベントグラフを識別するための識別子である。事例ID６０１とサイトID６０２との対で事例データが特定される。内容データ６０４は、保守員やエンジニアによって入力されるデータであり、典型的には「[状況]マシン2のプロセスが異常終了、[原因]追加したマシンの性能がHUBの性能を超えパケットロス、[対策]HUBを最新型に変更」といった状況、原因、対策に関する情報が入力される。なお、この入力項目を含むフォーマットは、対象のシステムに応じて適宜変更される。グラフID６０３は、内容データ６０４に対応したイベントグラフを表すので、グラフID６０３で識別されるイベントグラフをキーにして、稼働中の監視対象システム１のイベント列（グラフID６０３で識別されるイベントグラフに対応する障害に至るイベント列）から事例データが検索される。 FIG. 7 is an example of the case data 600 stored in the parameter storage unit 25 and the case storage unit 36. The case data 600 is data that a maintenance staff or engineer inputs via the operation terminal 4 after finding, analyzing, and taking measures for the trouble, that is, regarding the trouble that has occurred. The case ID 601, the site ID 602, the graph ID 603, and the content data 604. including. The case ID 601, the site ID 602, and the graph ID 603 are identifiers for identifying the case data, the local site, and the event graph, respectively. Case data is specified by a pair of case ID 601 and site ID 602. The content data 604 is data input by maintenance personnel or engineers. Typically, “[Status] The process of machine 2 ends abnormally, [Cause] The performance of the added machine exceeds the HUB performance, packet loss, [Countermeasure] Change the HUB to the latest version ", information about the situation, cause, and countermeasure are entered. Note that the format including this input item is appropriately changed according to the target system. Since the graph ID 603 represents an event graph corresponding to the content data 604, the event graph of the monitoring target system 1 in operation (corresponding to the event graph identified by the graph ID 603) is set using the event graph identified by the graph ID 603 as a key. Case data is retrieved from the event sequence leading to the failure to be performed).

図８、モデル記憶部３５が格納する特徴関数定義７００の例である。特徴関数定義７００は、局所予測モデルの入力となる特徴量を生成するための特徴関数の定義データであって、特徴ID７０１と、特徴関数定義７０２を含む。特徴ID７０１は、特徴関数定義７０２の定義を識別するための識別子である。 FIG. 8 shows an example of the feature function definition 700 stored in the model storage unit 35. The feature function definition 700 is feature function definition data for generating a feature quantity to be input to the local prediction model, and includes a feature ID 701 and a feature function definition 702. The feature ID 701 is an identifier for identifying the definition of the feature function definition 702.

本実施形態では、特定のイベントの発生や、特定のイベント列の発生順序または発生タイミングを特徴量として用いる。一般に特徴関数は、引数（または入力）としてのイベント列に対して数値（１又は０）を返す関数であって、前述以外の定義、例えば、単に複合イベントの発生や、イベントIDだけでなく、イベントの詳細データに含まれる特定キーワード等を考慮した関数を定義することができる。 In the present embodiment, the occurrence of a specific event or the generation order or generation timing of a specific event sequence is used as a feature amount. In general, a feature function is a function that returns a numerical value (1 or 0) for an event sequence as an argument (or input), and other definitions than those described above, for example, not only the occurrence of a composite event or event ID, It is possible to define a function that takes into account specific keywords included in the detailed data of the event.

以下、イベント監視装置２および分析用計算機３の連携により各プロセスの処理を説明するが、説明を簡単にするために、図２〜図８の各データを各記憶部に格納する処理を必ずしも明示していない。 Hereinafter, the processing of each process will be described by the cooperation of the event monitoring device 2 and the analysis computer 3, but in order to simplify the description, the processing for storing each data of FIGS. Not done.

図９は、局所予測モデル学習プロセスの処理フローチャートである。局所予測モデル学習プロセスの処理は、収集解析部２１およびモデル学習部３１の連携により実行される。 FIG. 9 is a process flowchart of the local prediction model learning process. The process of the local prediction model learning process is executed by cooperation of the collection analysis unit 21 and the model learning unit 31.

収集解析部２１は、監視対象システム１から、OSやミドルウェアの警告やエラー、コマンドの実行結果、アプリケーションの内部ステート遷移や動作履歴などに関する時系列の文字列データであるログを収集する（S１０１）。収集解析部２１は、収集したログを解析し、自サイトのサイトID１０１に対応させ、ログの内容（アラームであれば、エラーレベルやそのID等）に応じた時刻１０２、イベントID１０３及び詳細データ１０４を付与したイベント１００を抽出し、短期イベント記憶部２４に格納する（S１０２）。なお、短期イベント記憶部２４に格納されたイベント列１００は、分析用計算機３への送信完了（イベント記憶部３４への格納完了）に応じて削除される。 The collection analysis unit 21 collects logs, which are time-series character string data related to OS and middleware warnings and errors, command execution results, application internal state transitions, operation history, and the like, from the monitored system 1 (S101). . The collection analysis unit 21 analyzes the collected log, associates it with the site ID 101 of its own site, and sets the time 102, event ID 103, and detailed data 104 according to the contents of the log (in the case of an alarm, the error level and its ID). Is extracted and stored in the short-term event storage unit 24 (S102). Note that the event sequence 100 stored in the short-term event storage unit 24 is deleted when transmission to the analysis computer 3 is completed (storage completed in the event storage unit 34).

モデル学習部３１は、定期的（１日１回の頻度等）にイベント監視装置の短期イベント記憶部２４からのイベント列１００を収集し、イベント記憶部３４に格納する（S１０３）。モデル学習部３１は、監視対象システム１の正常時のイベント列１００を用いて、イベントの変化を局所的に予測する局所予測モデルを学習し、局所予測モデルパラメータ２００を作成し、作成した局所予測モデルパラメータ２００をモデル記憶部３５に格納する（S１０４）。この処理については、後述する。 The model learning unit 31 collects the event sequence 100 from the short-term event storage unit 24 of the event monitoring apparatus periodically (such as once a day) and stores it in the event storage unit 34 (S103). The model learning unit 31 learns a local prediction model for locally predicting a change in an event using the normal event sequence 100 of the monitoring target system 1, creates a local prediction model parameter 200, and creates the created local prediction The model parameter 200 is stored in the model storage unit 35 (S104). This process will be described later.

モデル学習部３１は、交差検定等を用いて作成した局所予測モデルの予測精度を算出し、モデル予測精度データ３００を作成し、モデル記憶部３５に格納する（S１０５）。モデル学習部３１は、作成した局所予測モデルパラメータ２００と、モデル予測精度データ３００をイベント監視装置２に送信する。イベント監視装置２は、受信した局所予測モデルパラメータ２００とモデル予測精度データ３００をパラメータ記憶部２５に格納し、処理を終える（S１０６）。 The model learning unit 31 calculates the prediction accuracy of the local prediction model created using cross-validation or the like, creates model prediction accuracy data 300, and stores it in the model storage unit 35 (S105). The model learning unit 31 transmits the created local prediction model parameter 200 and model prediction accuracy data 300 to the event monitoring device 2. The event monitoring device 2 stores the received local prediction model parameter 200 and model prediction accuracy data 300 in the parameter storage unit 25, and ends the process (S106).

なお、監視対象システム１の運用時（正常時）のイベント列からモデルを学習する例を述べたが、実際には、監視対象システム１の出荷前テストや出荷した後の現地での試運転で収集されたイベント列を用いて学習してもよい。 In addition, although the example which learns a model from the event sequence at the time of the operation (normal time) of the monitoring target system 1 was described, actually, it is collected by the pre-shipment test of the monitoring target system 1 or on-site test operation after shipping. Learning may be performed using the event sequence.

図１０は、モデル学習部３１による局所予測モデルの学習（S１０４）の処理フローチャートである。ここでは、局所予測モデルとしてL1正則化付のロジスティック回帰を用いる。説明に先立って定義を説明する。 FIG. 10 is a processing flowchart of local prediction model learning (S104) by the model learning unit 31. Here, logistic regression with L1 regularization is used as the local prediction model. Prior to the explanation, the definition will be explained.

イベント集合Eは、イベント列１００に対応し、E = {E_i,t | i=1,2,3,…,I, t=1,2,3,…,T}と記載する。ここで、E_i,tはイベント集合の要素であり、iはサイトID１０１とイベントID１０３との対（以下では、イベントID１０３を代表させ、「イベントIDがi」などと記載する。）を示し、tは時刻１０２を示す。時刻tにイベントIDがiのイベントが発生していない場合、E_i,tは値Emptyが対応付けられる。そうでないならば（発生しているならば）、E_i,tはサイトID１０１、イベントID１０３、時刻１０２に対応するイベント列１００のレコードが対応付けられる。 The event set E corresponds to the event sequence 100 and is described as E = {E_i, t | i = 1,2,3,..., I, t = 1,2,3,. Here, E_i, t is an element of the event set, i indicates a pair of the site ID 101 and the event ID 103 (hereinafter, the event ID 103 is represented as “event ID is i”, etc.), and t Indicates time 102. When an event with an event ID i does not occur at time t, the value Empty is associated with E_i, t. If not (if it has occurred), E_i, t is associated with the record of the event string 100 corresponding to the site ID 101, the event ID 103, and the time 102.

ある時刻tsから他の時刻te-1までのイベントの部分集合をE[ts:te]={E_i,t | i=1,2,3,…,I, t=ts,…,te}と記載する。 Let E [ts: te] = {E_i, t | i = 1,2,3,…, I, t = ts,…, te} as a subset of events from one time ts to another time te-1 Describe.

イベント変化C(i, t)は、E_i,tがEmptyでなく、K時刻前（t-K：t-1、t-2、・・・のように離散的な時刻を表す。以下も同様の表現を用いる。）までに発生したEmptyでないイベントの中で最新のイベントE_i,t’と、イベントIDが同じでない場合に1、それ以外は0を返す関数とする。ここで、Kは予め定義しておく定数であって、本実施形態では、各イベントIDについてE_iの平均発生間隔の２倍とする。ただし、それ以外の値、例えば3分、等の定数であってもかまわない。 The event change C (i, t) represents a discrete time like K time (tK: t-1, t-2, etc.), where E_i, t is not Empty. The function that returns 1 if the event ID is not the same as the latest event E_i, t 'among non-Empty events that occurred up to this point, and 0 otherwise. Here, K is a constant defined in advance, and in this embodiment, it is set to twice the average occurrence interval of E_i for each event ID. However, other values such as a constant such as 3 minutes may be used.

特徴ベクトルΦ(E([ts:te]))は、E[ts:te]における特徴を表し、Φ(E[ts:te])=[φ_1(E([ts:te])), …, φ_k(E([ts:te])), … , φ_K(E([ts:te]))]と記載する。ここで、φ_k(・)は特徴量関数であり、φ_k(E[ts:te])は、E[ts:te]のk番目の特徴量を表す。 A feature vector Φ (E ([ts: te])) represents a feature in E [ts: te], and Φ (E [ts: te]) = [φ_1 (E ([ts: te])), ... , φ_k (E ([ts: te])), ..., φ_K (E ([ts: te]))]. Here, φ_k (•) is a feature quantity function, and φ_k (E [ts: te]) represents the kth feature quantity of E [ts: te].

重みベクトルW_iは、イベントIDがiの局所予測モデルの重みベクトルであり、W_i=[W_i,0, W_i,1, … , W_i,K]と記載する。 The weight vector W_i is a weight vector of the local prediction model whose event ID is i, and is described as W_i = [W_i, 0, W_i, 1,..., W_i, K].

以上の定義に従って、局所予測モデルの学習の処理を説明する。モデル学習部３１は、学習回数のカウンタcntに0を代入し、各W_iに乱数を設定する（S２０１）。モデル学習部３１は、時刻変数tに1からTまでの値をランダムに設定する（S１０２）。モデル学習部３１は、特徴量Φ([E[t-τ,t])を算出する。ここでτは定数であって、各イベントIDについてE_iの平均発生間隔とする（S２０３）。モデル学習部３１は、イベントIDに対応する変数iに0を代入する（S２０４）。 The process of learning the local prediction model will be described according to the above definition. The model learning unit 31 assigns 0 to the counter cnt of the number of learning times, and sets a random number for each W_i (S201). The model learning unit 31 randomly sets a value from 1 to T in the time variable t (S102). The model learning unit 31 calculates a feature quantity Φ ([E [t−τ, t]). Here, τ is a constant, and is an average occurrence interval of E_i for each event ID (S203). The model learning unit 31 substitutes 0 for the variable i corresponding to the event ID (S204).

モデル学習部３１は、入力を特徴量Φ([E[t-τ,t])、出力をC(i, t)とするL1正則化付ロジスティック回帰モデルとして表現されたイベントiの局所予測モデルの重みベクトルW_iを更新する（S２０５）。ここで、特徴量Φ([E[t-τ,t])を生成するために、特徴関数定義７００に登録された特徴関数φ_kの定義７０２を用いる。具体的には、例えば、あるイベントIDのイベントが発生したならば１、そうでなければ０、というような単一イベントの関係に基づく特徴や、あるイベントIDのイベントが発生した2秒後以内に他のイベントIDのイベントが発生したならば１、そうでなければ０、というような複数のイベントとイベントの順序関係に基づく特徴を用いる。重みベクトルW_iの更新には、勾配法や劣勾配法で更新した重みベクトルと正則化項の最適化によって最終的な重みベクトルを決定するForward Backward Splittingを用いる。これによって、スパースな重みベクトルW_iを得ることができる。 The model learning unit 31 is a local prediction model of the event i expressed as a logistic regression model with L1 regularization, where the input is a feature quantity Φ ([E [t−τ, t]) and the output is C (i, t). The weight vector W_i is updated (S205). Here, in order to generate the feature quantity Φ ([E [t−τ, t]), the definition 702 of the feature function φ_k registered in the feature function definition 700 is used. Specifically, for example, a feature based on the relationship of a single event such as 1 if an event with an event ID occurs, 0 otherwise, or within 2 seconds after an event with an event ID occurs A feature based on the order relation between a plurality of events such as 1 if an event with another event ID occurs, and 0 otherwise. For updating the weight vector W_i, Forward Backward Splitting is used in which the final weight vector is determined by optimizing the weight vector updated by the gradient method or the subgradient method and the regularization term. Thereby, a sparse weight vector W_i can be obtained.

モデル学習部３１は、変数iを1インクリメントする（S２０６）。モデル学習部３１は、すべてのiについてS２０５〜S２０６を実行したかを判定し、実行した場合はS２０８に移り、そうでない場合は、S２０５に戻る（S２０７）。 The model learning unit 31 increments the variable i by 1 (S206). The model learning unit 31 determines whether S205 to S206 have been executed for all i, and if executed, moves to S208, otherwise returns to S205 (S207).

S２０７において、すべてのiについて実行した場合、モデル学習部３１はカウンタcntを1インクリメントする（S２０８）。モデル学習部３１は、cntが予め定められた回数N未満かを判定し、未満の場合、S２０２に戻り、そうでない（以上の）場合は、処理を終了する（S２０９）。 In S207, when all i are executed, the model learning unit 31 increments the counter cnt by 1 (S208). The model learning unit 31 determines whether cnt is less than a predetermined number N. If it is less, the process returns to S202, and if not (or more), the process ends (S209).

説明した局所予測モデルは、イベントの明示的な因果関係や原因を扱う必要がなく、モデル学習部３１が、イベント監視装置２を介して監視対象システム１から取得したイベント列を用いるので、対象とする監視対象システム１に関して知識をほとんど持たない人でも、局所予測モデルの構築が可能である。また、重みベクトルのスパース化によって後述の処理で生成されるイベントグラフが簡素化できる。この結果、保守員やエンジニアにわかりやすいイベントグラフを提示できる。また、事例検索のためにイベントグラフとイベント列の相関を求めるための計算量も小さくできる。 The described local prediction model does not need to deal with an explicit causal relationship or cause of an event, and the model learning unit 31 uses an event sequence acquired from the monitoring target system 1 via the event monitoring device 2. Even a person who has little knowledge about the monitoring target system 1 can construct a local prediction model. In addition, the event graph generated by the process described later can be simplified by making the weight vector sparse. As a result, an easy-to-understand event graph can be presented to maintenance personnel and engineers. In addition, it is possible to reduce the calculation amount for obtaining the correlation between the event graph and the event string for the case search.

以上は、ロジスティック回帰モデルを用いた例を示したが、多値の変化を扱う場合には、前記のような入力の特徴量と多値のイベント変化の組合せとして表現される特徴関数（素性関数）に関する対数線形モデルを用いることができる。さらに、他の予測モデル、例えばCW(Confidence-Weighted)やAROW(Adaptive Regularization of Weight Vectors)等の識別モデルCRF(Conditional Random Fields)等の構造予測が可能なモデルを用いてもよい。また、モデルの更新方法についても、単なる勾配法や劣勾配法を用いるだけであってもよいし、その他の手法を用いても良い。 The above is an example using a logistic regression model. However, when dealing with multi-value changes, feature functions (feature functions) expressed as combinations of input feature quantities and multi-value event changes as described above. A log-linear model for) can be used. Furthermore, other prediction models, for example, models capable of structure prediction such as identification models CRF (Conditional Random Fields) such as CW (Confidence-Weighted) and AROW (Adaptive Regularization of Weight Vectors) may be used. As for the model update method, a simple gradient method or a subgradient method may be used, or other methods may be used.

図１１は、障害分析支援プロセスの処理フローチャートである。障害分析支援プロセスの処理は、収集解析部２１、異常検知部２２、イベントグラフ作成部３２、および操作管理部３３の連携により実行される。障害分析支援プロセスの処理は、新しいイベントに応じて又は定期的に実行される。収集解析部２１は、前述の処理（S１０１、S１０２）と同様に、監視対象システム１からログを収集し、最新のイベント列１００を短期イベント記憶部２４に格納する。ここで、説明を簡単にするために、格納された（観測された）イベント列１００をイベントID＝１００、時刻tとする。異常検知部２２が、時刻t-1に観測したイベントをもとに、局所予測モデルを用いて時刻tのイベントを予測する（S３０１）。 FIG. 11 is a process flowchart of the failure analysis support process. The processing of the failure analysis support process is executed by cooperation of the collection analysis unit 21, the abnormality detection unit 22, the event graph creation unit 32, and the operation management unit 33. The processing of the failure analysis support process is executed in response to a new event or periodically. The collection analysis unit 21 collects logs from the monitoring target system 1 and stores the latest event sequence 100 in the short-term event storage unit 24 as in the above-described processing (S101, S102). Here, to simplify the explanation, it is assumed that the stored (observed) event string 100 is event ID = 100 and time t. The anomaly detector 22 predicts an event at time t using a local prediction model based on the event observed at time t−1 (S301).

異常検知部２２は、予測した時刻ｔのイベントの予測結果と観測した時刻ｔのイベントID=100（観測結果）が一致しているか否かを判定する。一致していれば処理を終了し、一致していなければS３０３に移る（S３０２）。予測結果と観測結果が一致していない場合、異常検知部２２は、分析用計算機３に監視対象システム１の異常を通知する（S３０３）。イベントグラフ作成部３２は、異常を通知したイベント監視装置２から最新のイベント列１００を収集する（S３０４）。最新のイベント列１００がイベント記憶部３４にすでに格納されている場合は、イベントグラフ作成部３２はイベント記憶部３４から最新のイベント列１００を収集する。 The abnormality detection unit 22 determines whether or not the predicted result of the event at the predicted time t matches the event ID = 100 (observed result) at the observed time t. If they match, the process ends. If they do not match, the process proceeds to S303 (S302). When the prediction result and the observation result do not match, the abnormality detection unit 22 notifies the analysis computer 3 of the abnormality of the monitoring target system 1 (S303). The event graph creation unit 32 collects the latest event sequence 100 from the event monitoring device 2 that has notified the abnormality (S304). When the latest event sequence 100 is already stored in the event storage unit 34, the event graph creation unit 32 collects the latest event sequence 100 from the event storage unit 34.

イベントグラフ作成部３２は、観測され、異常が通知されたイベント（前述の例の時刻ｔのイベント）の直前のイベント（前述の例の時刻ｔ-1の観測されたイベント）を起点として、イベントを時間的に遡って（バックトレースして）関連するイベントを抽出し、抽出したイベントを連結することでイベントグラフを作成する（S３０５）。この処理の詳細は、後述する。操作管理部３３は、操作端末４に異常の発生通知と、作成したイベントグラフを送信する。部操作端末４は、受信した内容を表示し、障害分析支援プロセスの処理を終了する（S３０６）。 The event graph creation unit 32 starts the event (the observed event at the time t-1 in the above example) immediately before the observed event (the event at the time t-1 in the above example) that is notified of the abnormality. Are traced back in time (backtraced) to extract related events, and an event graph is created by connecting the extracted events (S305). Details of this processing will be described later. The operation management unit 33 transmits an abnormality occurrence notification and the created event graph to the operation terminal 4. The department operation terminal 4 displays the received content and ends the processing of the failure analysis support process (S306).

以上の説明において、異常検知部２２は、予測結果と観測結果の乖離に基づいて異常検知したが、明確に障害に至る異常であると判定できるイベントの検出や、他の手法、例えば１Class SVM(Support Vector Machine)やSVM等を用いた異常検知を併用してもよい。 In the above description, the anomaly detection unit 22 detects an anomaly based on the difference between the prediction result and the observation result. However, the anomaly detection unit 22 detects an event that can be clearly determined to be an anomaly leading to a failure, or other methods such as 1Class SVM ( Anomaly detection using Support Vector Machine) or SVM may be used together.

図１２は、イベントグラフ作成部３２によるイベントグラフの作成（図１１のS３０５）の処理フローチャートである。イベントグラフ作成部３２は、異常が通知されたイベントに対応するノードイベント５００を、事例記憶部３６に格納する。そして、イベントグラフデータ４００の親ノードID402を「Empty」、子ノードID403を事例記憶部３６に格納したノードイベント５００のノードID501、連結タイプ４０４を「なし」、重み４０５を「Empty」として、イベントグラフデータ４００を作成し、事例記憶部３６に格納する。（S４０１）。 FIG. 12 is a process flowchart of event graph creation (S305 in FIG. 11) by the event graph creation unit 32. The event graph creation unit 32 stores the node event 500 corresponding to the event notified of the abnormality in the case storage unit 36. Then, the event graph data 400 has the parent node ID 402 as “Empty”, the child node ID 403 as the node ID 500 of the node event 500 stored in the case storage unit 36, the connection type 404 as “none”, and the weight 405 as “Empty”. The graph data 400 is created and stored in the case storage unit 36. (S401).

イベントグラフ作成部３２は、通知されたイベントのM時刻前に発生したイベントをイベント管理部３４のイベント列１００から抽出し、ノードイベント５００を作成し、事例記憶部３６に登録する。さらにイベントグラフ作成部３２は、通知されたイベントのノードID（親ノード）と、抽出した各イベントのノードID（子ノード）のタプルをスタック（Last-in First-out）に追加する。スタックは、分析用計算機３のワークエリア（作業記憶領域）に設ける。なお、M時刻前に発生したイベントが複数ある場合は、別々にスタックに追加する（S４０２）。本実施例ではMは通知されたイベントの平均発生間隔とする。なお、１分前まで等、Mは対象によって自由に変更してよい。 The event graph creation unit 32 extracts events that occurred before M times of the notified event from the event sequence 100 of the event management unit 34, creates a node event 500, and registers it in the case storage unit 36. Further, the event graph creating unit 32 adds a tuple of the node ID (parent node) of the notified event and the node ID (child node) of each extracted event to the stack (Last-in First-out). The stack is provided in the work area (working storage area) of the analysis computer 3. If there are a plurality of events that occurred before time M, they are added separately to the stack (S402). In this embodiment, M is the average occurrence interval of notified events. Note that M may be freely changed depending on the object, such as up to one minute before.

イベントグラフ作成部３２は、スタックが空かどうかを判定し、スタックが空の場合は本処理を終了し、そうでない場合はS４０４に移る（S４０３）。イベントグラフ作成部３２は、スタックから親ノードと子ノードのタプルを取り出し（S４０４）、取り出した親ノード、子ノードに対応するイベントグラフデータ４００を作成し、事例管理部３６に登録する。このとき、イベントグラフデータ４００の親ノードID４０２が通知されたノードである場合、連結タイプ４０４には「異常遷移」、重み４０５には「Empty」、通知されたノードでない場合は連結タイプ４０４には「正常遷移」、重み４０５には親ノードに対応するイベントIDの局所予測モデルにおける子ノードに対応するイベントIDの重みを登録する（S４０５）。 The event graph creating unit 32 determines whether or not the stack is empty. If the stack is empty, the process is terminated, and if not, the process proceeds to S404 (S403). The event graph creation unit 32 extracts a tuple of a parent node and a child node from the stack (S404), creates event graph data 400 corresponding to the extracted parent node and child node, and registers them in the case management unit 36. At this time, if the parent node ID 402 of the event graph data 400 is the notified node, the connection type 404 is “abnormal transition”, the weight 405 is “Empty”, and if it is not the notified node, the connection type 404 is In “normal transition” and weight 405, the event ID weight corresponding to the child node in the local prediction model of the event ID corresponding to the parent node is registered (S405).

イベントグラフ作成部３２は、取り出した子ノードの局所予測モデルの予測精度３０３が予め設定された閾値α（例えば、α＝0.6）未満であるかを判定し、α未満である場合にはS４０３に戻り、そうでない場合にはS４０７に移る（S４０６）。予測精度がα未満でなかった場合、イベントグラフ作成部３２は、局所予測モデルによる子ノードの予測結果が観測結果（子ノード）と乖離していないか（すなわち、True-PositiveもしくはFalse-Negativeであること）を判定する。乖離していない場合はS４０８に移り、そうでない場合はS４０９に移る（S４０７）。 The event graph creating unit 32 determines whether or not the prediction accuracy 303 of the local prediction model of the extracted child node is less than a preset threshold value α (for example, α = 0.6), and if it is less than α, the process proceeds to S403. If not, the process proceeds to S407 (S406). When the prediction accuracy is not less than α, the event graph creation unit 32 determines whether the prediction result of the child node by the local prediction model is not different from the observation result (child node) (that is, True-Positive or False-Negative Is determined). If not, the process moves to S408, and if not, the process moves to S409 (S407).

予測結果と観測結果が乖離していない場合、イベントグラフ作成部３２は、再帰的にイベントグラフを作成するために、観測結果（子ノード）を新たな親ノード、局所予測モデルの重みが一定値以上（スパース化できれているならば０でない）の特徴関数で用いたイベント（予測に寄与するイベント）を新たな子ノードとしてスタックに追加し、S４０３に戻る（S４０８）。 When the prediction result and the observation result do not deviate, the event graph creation unit 32 uses the observation result (child node) as a new parent node and a constant weight of the local prediction model in order to recursively create the event graph. The event (event contributing to prediction) used in the above feature function (event that contributes to prediction) is added as a new child node to the stack, and the process returns to S403 (S408).

予測結果と観測結果が乖離している場合、イベントグラフ作成部３２は、観測結果（子ノード）のM時刻前に発生したイベントを抽出し、ノードイベント５００を作成し、事例管理部３６に登録する。イベントグラフ作成部３２は、観測結果を新たな親ノード、登録したノードイベントに対応するノードを新たな子ノードとしてイベントグラフデータ４００を作成し、事例記憶部３６に格納する。このときイベントグラフ作成部３２は、イベントグラフデータ４００の連結タイプ４０４には「異常遷移」、重み４０５には「Empty」を設定する。そして、ステップ１S４０３に戻る（S４０９）。 When the prediction result and the observation result are different from each other, the event graph creation unit 32 extracts an event that occurred before M times of the observation result (child node), creates a node event 500, and registers it in the case management unit 36 To do. The event graph creation unit 32 creates the event graph data 400 using the observation result as a new parent node and the node corresponding to the registered node event as a new child node, and stores the event graph data 400 in the case storage unit 36. At this time, the event graph creation unit 32 sets “abnormal transition” in the connection type 404 and “Empty” in the weight 405 of the event graph data 400. Then, the process returns to step 1 S403 (S409).

以上のように、局所予測モデルに基づいて、あるイベントの発生過程を分析、抽出することができる。 As described above, it is possible to analyze and extract the occurrence process of a certain event based on the local prediction model.

図１３は、事例登録プロセスの処理フローチャートである。事例登録プロセスの処理は、保守員やエンジニアが前述の障害分析支援プロセスを活用して、障害の原因を突き止め、対策を講じた後に、操作管理部３３および事例検索部２３の連携により実行される。保守員やエンジニアが、事例ID、サイトID、突き止めた原因や講じた対策を、操作端末４を介して入力する（S５０１）。 FIG. 13 is a process flowchart of the case registration process. The process of the case registration process is executed by the operation manager 33 and the case search unit 23 in cooperation after maintenance personnel and engineers use the above-described failure analysis support process to determine the cause of the failure and take countermeasures. . The maintenance staff or engineer inputs the case ID, the site ID, the cause of the identification, and the countermeasures taken through the operation terminal 4 (S501).

保守員やエンジニアが、操作端末４を用いて、障害に関係するイベントグラフを、入力した原因および対策に対応付ける。操作端末４は、イベントグラフと原因および対策などが対応付けられた事例データ６００を操作管理部３３に送信する。操作管理部３３は受信した事例データ６００を事例記憶部３６に格納する（S５０２）。 The maintenance staff or engineer uses the operation terminal 4 to associate the event graph related to the failure with the input cause and countermeasure. The operation terminal 4 transmits the case data 600 in which the event graph is associated with the cause and the countermeasure to the operation management unit 33. The operation management unit 33 stores the received case data 600 in the case storage unit 36 (S502).

事例管理部３６は、格納された事例データ６００、事例データ６００のグラフID６０３として参照されているイベントグラフデータ４００、並びに、イベントグラフデータ４００の親ノードID４０２および子ノードID４０３として参照されているノードイベント５００を、イベント監視装置２の事例検索部２３に配信する。事例検索部２３は、受信した事例データ６００、イベントグラフデータ４００およびノードイベント５００をパラメータ記憶部２５に格納し、処理を終了する（S５０３）。 The case management unit 36 stores stored case data 600, event graph data 400 referred to as the graph ID 603 of the case data 600, and node events referred to as the parent node ID 402 and child node ID 403 of the event graph data 400. 500 is distributed to the case search unit 23 of the event monitoring apparatus 2. The case search unit 23 stores the received case data 600, event graph data 400, and node event 500 in the parameter storage unit 25, and ends the process (S503).

なお、あるサイトの事例を他のサイトに配信する場合に、システム構成や動作上存在しないイベントをイベントグラフから削除しておく。 When an example of a site is distributed to another site, events that do not exist in the system configuration or operation are deleted from the event graph.

運用時に起きた障害に関する事例を登録する例を述べたが、実際には、監視対象システム１の出荷前テストや出荷した後の現地での試運転で発生した障害等を同様に登録してもよい。 Although an example of registering a case related to a failure that occurred during operation has been described, actually, a failure that occurred in a pre-shipment test of the monitored system 1 or a test run on site after shipment may be registered in the same manner. .

図１４は、事例提示プロセスの処理フローチャートである。事例提示プロセスの処理は、事例検索部２３および操作管理部３３の連携によって定期的に実行される。短期イベント記憶部２４には、最新のイベント列が格納されている状態を前提として、事例提示プロセスの処理を説明する。 FIG. 14 is a process flowchart of the case presentation process. The processing of the case presentation process is periodically executed by the cooperation of the case search unit 23 and the operation management unit 33. The case presentation process will be described on the assumption that the short-term event storage unit 24 stores the latest event sequence.

事例照合部２３は、短期イベント記憶部２４に格納されている現時刻(t)からτ時刻前まで（t-τ〜t）のイベント列１００を取得する。そして、事例検索部２３は、イベント列１００とパラメータ記憶部２５に格納されているイベントグラフデータ４００、ノードイベント５００との相関（または類似度、距離）を計算する（S６０１）。例えば、イベント列１００の発生順序が重要となることが多いことを考慮すると、発生順序を反映できる相関係数であるスピアマンの順位相関係数を用いることができる。また、イベント列１００の詳細データ１０４を考慮するならば、前述したイベントグラフの生成処理に基づき、短期イベント記憶部２４から取得したイベント列１００の最新時刻付近のイベント列からイベントグラフを生成し、イベントグラフ同士のグラフカーネル（ツリーカーネル）によって類似性を図ることもできる。これは、観測したイベント列に基づいたイベントグラフをキーにすることで、稼働中のイベント列を使って、類似したイベントグラフに対応する事例を検索することを可能とする。 The case matching unit 23 acquires the event sequence 100 from the current time (t) stored in the short-term event storage unit 24 to the previous τ time (t−τ to t). Then, the case search unit 23 calculates a correlation (or similarity, distance) between the event string 100 and the event graph data 400 stored in the parameter storage unit 25 and the node event 500 (S601). For example, considering that the occurrence sequence of the event sequence 100 is often important, Spearman's rank correlation coefficient, which is a correlation coefficient that can reflect the occurrence order, can be used. If the detailed data 104 of the event sequence 100 is considered, an event graph is generated from the event sequence near the latest time of the event sequence 100 acquired from the short-term event storage unit 24 based on the event graph generation process described above. Similarity can be achieved by a graph kernel (tree kernel) between event graphs. This makes it possible to search for a case corresponding to a similar event graph by using an active event sequence by using an event graph based on the observed event sequence as a key.

事例検索部２３が、相関が予め設定した閾値γ（例えば、γ＝0.8）以上となる事例があるかを判定し、該当する事例ある場合にはS６０３に移り、そうでなければ処理を終了する（S６０２）。 The case search unit 23 determines whether there is a case where the correlation is equal to or greater than a preset threshold value γ (for example, γ = 0.8). If there is a case, the process proceeds to S603, otherwise the process ends. (S602).

イベントグラフ間の相関が閾値γ以上に対応する事例がある場合、事例検索部２３は、その事例データ６００を分析用計算機３の操作管理部３３に通知する。操作管理部３３は、通知された事例データ６００と対応するイベントグラフデータ４００、並びに、イベントグラフデータ４００の親ノードID４０２および子ノードID４０３に対応するノードイベント５００を操作端末４に送信する。操作端末４１は、受信した各データを表示（保守員やエンジニアに通知）し、処理を終了する（S６０３）。 When there is a case where the correlation between the event graphs corresponds to the threshold value γ or more, the case search unit 23 notifies the case management unit 33 of the analysis computer 3 of the case data 600. The operation management unit 33 transmits the event graph data 400 corresponding to the notified case data 600 and the node event 500 corresponding to the parent node ID 402 and the child node ID 403 of the event graph data 400 to the operation terminal 4. The operation terminal 41 displays each received data (notifies maintenance personnel and engineers), and ends the processing (S603).

以上のように、イベントグラフを使うことで稼動している監視対象システム１のイベント列からリアルタイムで事例の検索が可能になる。さらに、イベントグラフは、障害が起きた際のイベントからバックトレースすることで生成しているので、最終的な障害が起きる前に事例を検索し、保守員やエンジニアに提示することも可能となる。 As described above, by using the event graph, it is possible to search for cases in real time from the event sequence of the monitoring target system 1 that is operating. In addition, since the event graph is generated by back-tracing from the event at the time of the failure, it is possible to search for cases before the final failure occurs and present it to maintenance personnel and engineers. .

なお、本実施形態ではリアルタイムで事例を検索するためにイベントグラフを用いる方法を説明したが、イベントグラフ同士の類似性から事例を分類し、体系化して保守員やエンジニアに提示することもできる。 In the present embodiment, the method of using an event graph to search for cases in real time has been described. However, cases can be classified based on the similarity between event graphs, organized, and presented to maintenance personnel and engineers.

図１５は、イベントグラフ表示画面１１００の例である。イベントグラフ表示画面１１００は、イベント監視装置２の異常検知部２２が検出した異常に関連するイベントIDとイベントグラフを操作端末４の画面に表示したものである。イベントグラフ表示画面１１００は、サイト表示ボックス１１０１、日時表示ボックス１１０２、異常イベントID表示ボックス１１０３、およびイベントグラフモニタ１１０４を有する。 FIG. 15 is an example of the event graph display screen 1100. The event graph display screen 1100 displays the event ID and event graph related to the abnormality detected by the abnormality detection unit 22 of the event monitoring device 2 on the screen of the operation terminal 4. The event graph display screen 1100 includes a site display box 1101, a date / time display box 1102, an abnormal event ID display box 1103, and an event graph monitor 1104.

サイト表示ボックス１１０１には、異常が発生した監視対象システム１のサイトを表示する。日時表示ボックス１１０２には、異常が発生した時刻を表示する。異常イベントID表示ボックス１１０３には通知されたイベントIDを表示する。イベントグラフモニタ１１０４には、イベントグラフデータ４００に基づいて、異常イベントID表示ボックス１１０３に表示したイベント１１０４aをルートノードとし、子孫ノードに対応するイベントIDを左側に配置するツリー状の有向グラフを表示する。また、イベントグラフモニタ１１０４の下部には、表示された各ノードに対応するイベントの発生時刻が表示される。 The site display box 1101 displays the site of the monitoring target system 1 in which an abnormality has occurred. The date and time display box 1102 displays the time when the abnormality occurred. The abnormal event ID display box 1103 displays the notified event ID. The event graph monitor 1104 displays, based on the event graph data 400, a tree-shaped directed graph in which the event 1104a displayed in the abnormal event ID display box 1103 is a root node, and event IDs corresponding to descendant nodes are arranged on the left side. . In addition, at the lower part of the event graph monitor 1104, the event occurrence time corresponding to each displayed node is displayed.

図１５に示す例のように、あるイベントが他の２つのイベントの発生に寄与することがあり（あるノードに親ノードが2つ存在する場合があり、図中、イベントID４４に、イベントID６７とイベントID１０８の親ノードがある）、厳密に各ノードが半順序集合をなすツリーではない。 As in the example shown in FIG. 15, an event may contribute to the occurrence of the other two events (two parent nodes may exist in a certain node, and event ID 67 and event ID 67 in the figure) There is a parent node of event ID 108), and not exactly a tree in which each node forms a partial order set.

また、イベントグラフデータ４００の連結タイプ４０４に応じて表示方法を次のように変更する。連結タイプ４０４が「異常遷移」である場合、同じ親ノードを持つ兄弟ノードを破線枠１１０４bで囲み、親ノードとその破線枠１１０４bとをエッジ１１０４ｃで結ぶ。さらに、そのエッジの上に「！（１１０４c）」を表示することで、「異常遷移」であることを明示する。連結タイプ４０４が「正常遷移」である場合、親ノードと子ノードを直接エッジで結ぶ。さらに、そのエッジの上に「○（１１０４d）」を表示することで、「正常遷移」であることを明示する。また、対応する局所予測モデルの重み（１１０４e）を、局所予測モデルパラメータ２００の重み２０４に基づいて表示する。なお、正常遷移が何段も続く場合は間のノードとエッジを省略して表示してもよい。以上のように表示することで、保守員やエンジニアが障害の発生過程を示すイベントを視覚的に捉えることができる。また、「！」や「○」は記号の一例であって、視覚的に認識される表示であればよく、他の記号あるいは文字列等で表記されていてもかまわない。 Further, the display method is changed as follows according to the connection type 404 of the event graph data 400. When the connection type 404 is “abnormal transition”, sibling nodes having the same parent node are surrounded by a broken line frame 1104b, and the parent node and the broken line frame 1104b are connected by an edge 1104c. Further, “! (1104c)” is displayed on the edge to clearly indicate “abnormal transition”. When the connection type 404 is “normal transition”, the parent node and the child node are directly connected by an edge. Furthermore, “◯ (1104d)” is displayed on the edge to clearly indicate “normal transition”. Further, the weight (1104e) of the corresponding local prediction model is displayed based on the weight 204 of the local prediction model parameter 200. In addition, when normal transition continues many steps, you may abbreviate | omit and display the node and edge in between. By displaying as described above, a maintenance worker or an engineer can visually grasp an event indicating a failure occurrence process. Further, “!” And “◯” are examples of symbols, and any display that can be visually recognized may be used, and other symbols or character strings may be used.

図１６は、事例データ表示画面１２００の例である。事例データ表示画面１２００は、イベント監視装置２の事例検索部２３が検索し、検索時刻を付与して、分析用計算機３に通知した事例データを、操作端末４に表示したものである。事例データ表示画面１２００は、サイト表示ボックス１２０１、日時表示ボックス１２０２、事例リスト１２０３、稼動状況リスト１２０４、および事例イベントグラフモニタ１２０５を備える。 FIG. 16 is an example of a case data display screen 1200. The case data display screen 1200 is displayed on the operation terminal 4 of case data searched by the case search unit 23 of the event monitoring device 2, given search time, and notified to the analysis computer 3. The case data display screen 1200 includes a site display box 1201, a date and time display box 1202, a case list 1203, an operation status list 1204, and a case event graph monitor 1205.

サイト表示ボックス１２０１には、分析用計算機３に通知された事例データ６００のサイトID６０１を表示する。日時表示ボックス１２０２には、通知された事例データ６００の検索時刻を表示する。事例リスト１２０３には通知された事例データ６００の事例ID６０１、相関（事例データを検索する際のイベントグラフ間の相関）、および内容データ６０４を表示する。稼働状況リスト１２０４には、サイト表示ボックス１２０１に表示したサイトで発生した最近のイベント列１００の時刻１０２、イベントID１０３、および詳細データ１０４を表示する。事例イベントグラフモニタ１２０５には、事例リスト１２０３で選択された事例データ６００のイベントグラフを表示する。 In the site display box 1201, the site ID 601 of the case data 600 notified to the analysis computer 3 is displayed. The date / time display box 1202 displays the search time of the notified case data 600. In the case list 1203, the case ID 601 of the notified case data 600, correlation (correlation between event graphs when searching for case data), and content data 604 are displayed. In the operation status list 1204, the time 102, event ID 103, and detailed data 104 of the recent event sequence 100 that occurred in the site displayed in the site display box 1201 are displayed. The case event graph monitor 1205 displays an event graph of the case data 600 selected in the case list 1203.

稼働状況リスト１２０４に表示されているイベントIDがイベントグラフのノードとして表示されている場合、そのノードを特徴づける（そのノードの表示色を変えるなどにより、他とは異なる態様で表示する。）。以上のような事例データの表示により、保守員やエンジニアが発生したイベント列と近い事例を把握することができる。 When the event ID displayed in the operation status list 1204 is displayed as a node of the event graph, the node is characterized (displayed in a different manner from the others by changing the display color of the node). By displaying the case data as described above, it is possible to grasp cases that are close to the event sequence generated by maintenance personnel or engineers.

前述の処理説明では、図１５および図１６の表示画面例に表示されるすべてのデータが、イベント監視装置２から分析用計算機３を介して、または分析用計算機３から、操作端末４に送信されるように説明していないが、処理説明を簡明にするために詳細を省略したものである。 In the above description of the processing, all data displayed in the display screen examples of FIGS. 15 and 16 are transmitted from the event monitoring device 2 to the operation terminal 4 via the analysis computer 3 or from the analysis computer 3. Although not described in detail, details are omitted for the sake of brevity.

以上に説明したように、本実施形態によれば、事前にイベントの発生原因等を定義する必要がなく、監視対象システムから取得したイベント列を用いて局所予測モデルを構築でき、構築した局所予測モデルを利用することにより、障害検知や分析などのシステムの運用開始または改修にかかる時間とコストを削減できる。 As described above, according to the present embodiment, it is not necessary to define the cause of an event in advance, and a local prediction model can be constructed using an event sequence acquired from a monitored system. By using the model, it is possible to reduce the time and cost required to start or repair the system such as failure detection and analysis.

また、本実施形態によれば、稼働している監視対象システムのログからリアルタイムに異常を検知したり、生成したイベントグラフを事例のキーとして検索できるので、熟練していない保守員やエンジニアであっても監視対象システムの異常や事例を把握することができる。 In addition, according to the present embodiment, an abnormality can be detected in real time from the log of the monitored system that is in operation, and the generated event graph can be searched as a case key. However, it is possible to grasp abnormalities and cases of monitored systems.

さらに、本実施形態によれば、装置間の相互作用で起きる障害であっても、実際の監視対象システムの動作から局所予測モデルを構築するので、発生する障害に特有のイベント列に含まれるノイズ（監視対象システムとは無関係のイベント）が少なく、障害の発生原因の推定に要する時間を少なくできる。 Furthermore, according to the present embodiment, even if a failure occurs due to the interaction between devices, the local prediction model is constructed from the actual operation of the monitored system, so the noise included in the event sequence specific to the failure that occurs There are few (events unrelated to the monitored system), and the time required to estimate the cause of the failure can be reduced.

１：監視対象システム、２：イベント監視装置、３：分析用計算機、４：操作端末、５：ネットワーク、１１：設備、１２：コントローラ、１３：制御用計算機、１４：制御用ネットワーク、２１：収集解析部、２２：異常検知部、２３：事例検索部、２４：短期イベント記憶部、２５：パラメータ記憶部、３１：モデル学習部、３２：イベントグラフ作成部、３３：操作管理部、３４：イベント記憶部、３５：モデル記憶部、３６：事例記憶部。 1: Monitoring target system, 2: Event monitoring device, 3: Computer for analysis, 4: Operation terminal, 5: Network, 11: Equipment, 12: Controller, 13: Computer for control, 14: Network for control, 21: Collection Analysis unit 22: Abnormality detection unit 23: Case search unit 24: Short-term event storage unit 25: Parameter storage unit 31: Model learning unit 32: Event graph creation unit 33: Operation management unit 34: Event Storage unit, 35: model storage unit, 36: case storage unit.

Claims

A local prediction model learning process for learning in advance a local prediction model for predicting a first event based on a first event sequence during operation of the monitored system;
The second event sequence of the monitored system is monitored, the observation result of the second event included in the monitored second event sequence, and the prediction of the first event predicted using the local prediction model An anomaly detection process for extracting the second event that is inconsistent with a result, and in response to the extraction of the second event, included in the monitored second event sequence to generate the observation result, An event analysis system comprising: a failure analysis support process for creating a first event graph backtraced from a third event that occurs immediately before a second event.

The event analysis system according to claim 1, wherein the prediction using the local prediction model is predicted based on one of an occurrence order and an occurrence timing of the first events included in the first event sequence. .

The failure analysis support process has a high prediction accuracy of the local prediction model for the third event, and when the prediction result of the third event matches the observation result of the third event, The event sequence included in the second event sequence is extracted recursively using the fourth event included in the second event sequence that contributed to the prediction of the event 3 as a new starting point. The event analysis system according to claim 1, wherein the first event graph is generated by combining the event sequences.

The failure analysis support process has high prediction accuracy of the local prediction model for the fourth event included in the event sequence extracted recursively, and the prediction result of the fourth event and the observation of the fourth event If the result does not match, the event sequence immediately before the fourth event is extracted, the recursive extraction is terminated, and the prediction accuracy of the local prediction model for the fourth event is low. 4. The event analysis system according to claim 3, wherein the recursive extraction is terminated.

The event analysis system according to claim 1, wherein the local prediction model learning process sparses the weight vector of the local prediction model.

The second event graph having a case corresponding to the second event graph and having a strong correlation with the second event string is searched, and the case corresponding to the searched second event graph is operated as the operation terminal. The event analysis system according to claim 1, further comprising a case presentation process to be displayed on the screen.

The event analysis system according to claim 6, wherein the case presentation process calculates the correlation using a correlation coefficient that reflects an occurrence order of the second events included in the second event sequence. .

The case presentation process displays the background of the back trace of the second event sequence so as to be visually recognized, and when the transition between events included in the second event sequence is a normal transition, 7. The event analysis system according to claim 6, wherein a weight is applied to the local prediction model used for predicting transition between events, and the graph is displayed.

The abnormality detection process is used, the local predictive model, the monitored system event analyzing system according to claim 1, characterized in that previously stored in event monitoring apparatus installed in sites related to site.

An event analysis method in an event analysis system of a monitored system, wherein the event analysis system includes:
Learning in advance a local prediction model that predicts a first event based on a first event sequence during operation of the monitored system,
The second event sequence of the monitored system is monitored, the observation result of the second event included in the monitored second event sequence, and the prediction of the first event predicted using the local prediction model Extracting the second event whose result does not match,
In response to the extraction of the second event, a back trace is started from the third event that occurs in the monitored second event sequence and that causes the observation result and that occurs immediately before the second event. An event analysis method characterized by creating a first event graph.

In the event analysis system, when the prediction accuracy of the local prediction model for the third event is high, and the prediction result of the third event matches the observation result of the third event, the third event The event sequence included in the second event sequence is recursively extracted using the fourth event included in the second event sequence that contributed to the prediction of the event as a new starting point, and the recursively extracted The event analysis method according to claim 10, wherein an event graph is generated by combining event sequences.

The event analysis system has high prediction accuracy of the local prediction model for the fourth event included in the recursively extracted event sequence, the prediction result of the fourth event and the observation result of the fourth event. Is not matched, the event sequence immediately before the fourth event is extracted, the recursive extraction is terminated, and the prediction accuracy of the local prediction model for the fourth event is low. 12. The event analysis method according to claim 11, wherein the recursive extraction is terminated.

The event analysis system has a case corresponding to a second event graph, searches the second event graph having a strong correlation with the second event sequence, and corresponds to the searched second event graph The event analysis method according to claim 10, wherein the case is displayed on an operation terminal.

14. The event analysis method according to claim 13, wherein the event analysis system calculates the correlation using a correlation coefficient that reflects the occurrence order of the second events included in the second event sequence. .

The event analysis system displays the background of the back trace of the second event sequence so as to be visually recognized, and when the transition between events included in the second event sequence is a normal transition, 14. The event analysis method according to claim 13, wherein the local prediction model used for prediction of transition between events is weighted and displayed in a graph.