JP2009253358A

JP2009253358A - Information processor and information processing method

Info

Publication number: JP2009253358A
Application number: JP2008095102A
Authority: JP
Inventors: Mitsutoshi Arai; 光俊荒井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-04-01
Filing date: 2008-04-01
Publication date: 2009-10-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processor capable of forming a rule for deleting a fault message of low priority. <P>SOLUTION: The information processor comprises a memory and a control portion. The memory stores network configuration information indicative of a connection relation for a plurality of data transfer devices. The control portion receives fault messages from the plurality of data transfer devices and stores the fault messages in the memory. When a time of fault occurrence is within a predetermined time between two fault messages among these messages, the control portion gives a predetermined mark to the two fault messages. The control portion investigates a hop count between two devices being the transmission source of the two fault message with reference to the network configuration information, and adds a mark to the score of the two fault messages wherein the mark becomes larger as the hop count becomes smaller. It is decided that association is higher as the sum total value of the mark is larger. The control portion determines whether a rule of deleting a later message is generated or not, when a fault of type indicated by each of the two fault messages occurs in a fault generation sequence of the two fault messages within a predetermined time duration. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ネットワーク内のデータ転送装置に生じる障害を監視するための情報処理装置および情報処理方法に関する。 The present invention relates to an information processing apparatus and an information processing method for monitoring a failure that occurs in a data transfer apparatus in a network.

ネットワークに接続されたルータおよびスイッチ等のネットワーク装置は、障害が発生すると、その障害発生を通知するためのメッセージである障害メッセージを管理装置に送信する。初めに発生した障害に伴って他の装置に障害が発生すると、主原因とは別の障害メッセージがその装置から管理装置に送信される。このように、障害発生の原因が１つであっても、複数の障害メッセージが管理者に届き、管理者は障害の主原因を見つけるのに手間がかかってしまうことがある。 When a failure occurs, a network device such as a router or a switch connected to the network transmits a failure message, which is a message for notifying the occurrence of the failure, to the management device. When a failure occurs in another device due to the failure that occurred first, a failure message different from the main cause is transmitted from the device to the management device. As described above, even if there is only one cause of the failure, a plurality of failure messages may reach the administrator, and it may take time for the administrator to find the main cause of the failure.

このような手間を軽減するために、管理者は、複数の障害の関係を示すコリレーションルールを予め準備し、管理装置が複数のネットワーク装置から複数の障害メッセージを受信すると、コリレーションルールを適用して余分な障害メッセージを削除し、主原因の障害メッセージを明らかにしている。 In order to reduce this effort, the administrator prepares a correlation rule indicating the relationship between multiple failures in advance, and applies the correlation rule when the management device receives multiple failure messages from multiple network devices. Then remove the extra fault message and reveal the main cause fault message.

特許文献１の方法では、優先順位の高い主原因アラームと優先順位の低い影響アラームの振り分け方を予め定義して、アラームを振り分けている。
特開平８−２５１１６２号公報 In the method of Patent Literature 1, the alarms are distributed by defining in advance a method for distributing a main cause alarm having a high priority and an influence alarm having a low priority.
JP-A-8-251162

管理者は、コリレーションルール作成の際、通知される障害メッセージ同士の関連を論理的に考察して予めルール設計を行う。しかし、隣接する装置の障害によってどのような障害が派生するかを予測するのは難しい。そのため、ルール作成には、ネットワーク装置の仕様およびネットワーク全体の構成を事前に深く調査した上で注意深く設定を行う必要があり、管理者にとって極めて手間のかかる作業であった。特許文献１の方法におけるアラーム振り分け方法を定義する場合も同様に手間がかかってしまう。そのため、予め設定されたルールにしたがって優先順位の低い障害メッセージを削除するコリレーション機能を管理装置に搭載しても、実際には使用されないことがあった。 When creating a correlation rule, the administrator performs rule design in advance by logically considering the relationship between the notified failure messages. However, it is difficult to predict what kind of failure is caused by the failure of an adjacent device. For this reason, rule creation requires careful setting after a thorough investigation in advance of the specifications of the network device and the configuration of the entire network, which is extremely laborious for the administrator. Similarly, when the alarm distribution method in the method of Patent Document 1 is defined, it takes time and effort. For this reason, even if a management apparatus is equipped with a correlation function for deleting a failure message having a low priority according to a preset rule, the management apparatus may not actually be used.

本発明は上述したような技術が有する問題点を解決するためになされたものであり、優先順位の低い障害メッセージを削除するためのルールを生成する情報処理装置および情報処理方法を提供することを目的とする。 The present invention has been made to solve the above-described problems of the technology, and provides an information processing apparatus and an information processing method for generating a rule for deleting a failure message having a low priority. Objective.

上記目的を達成するための本発明の情報処理装置は、装置毎に異なる識別子が割り当てられた複数のデータ転送装置とネットワークを介して接続される情報処理装置であって、
前記複数のデータ転送装置の接続関係を示すネットワーク構成情報が格納された記憶部と、
前記識別子、障害発生日時および障害の種類の情報を含む障害メッセージを前記複数のデータ転送装置から受信すると、これらのメッセージを前記記憶部に格納し、該複数の障害メッセージのうち２つの障害メッセージの前記障害発生日時が所定の時間内であれば、該２つの障害メッセージに所定の点数を付与し、前記ネットワーク構成情報を参照して該２つの障害メッセージの送信元となる２つの装置間のホップ数を調べ、該ホップ数が少ないほど大きい点数を前記２つの障害メッセージの得点に加算し、点数の合計値が大きいほど前記２つの障害メッセージの関連性が高いと判定し、前記所定の時間内に前記２つの障害メッセージのそれぞれが示す種類の障害が該２つの障害メッセージの障害発生順序で発生すると順序が後の方のメッセージを削除する旨のルールを生成するか否かを決定する制御部と、
を有する構成である。 An information processing apparatus of the present invention for achieving the above object is an information processing apparatus connected via a network to a plurality of data transfer apparatuses to which different identifiers are assigned for each apparatus,
A storage unit storing network configuration information indicating a connection relationship between the plurality of data transfer devices;
When a failure message including information on the identifier, the date and time of failure occurrence and the type of failure is received from the plurality of data transfer devices, these messages are stored in the storage unit, and two failure messages of the plurality of failure messages are stored. If the failure occurrence date / time is within a predetermined time, a predetermined score is given to the two failure messages, and a hop between the two devices that are the transmission source of the two failure messages with reference to the network configuration information The smaller the number of hops, the larger the score is added to the score of the two failure messages. The larger the total score is, the higher the relevance of the two failure messages is determined. When a failure of the type indicated by each of the two failure messages occurs in the failure occurrence order of the two failure messages, A control unit for determining whether to generate rules to the effect to remove the di,
It is the structure which has.

また、本発明の情報処理方法は、ネットワーク内に設けられ、装置毎に異なる識別子が割り当てられた複数のデータ転送装置を監視するための情報処理方法であって、
前記複数のデータ転送装置の接続関係を示すネットワーク構成情報を保持し、
前記識別子、障害発生日時および障害の種類の情報を含む障害メッセージを前記複数のデータ転送装置から受信すると、これらのメッセージを記録し、
記録した複数の障害メッセージのうち２つの障害メッセージの前記障害発生日時が所定の時間内であれば、該２つの障害メッセージに所定の点数を付与し、
前記ネットワーク構成情報を参照して前記２つの障害メッセージの送信元となる２つの装置間のホップ数を調べ、該ホップ数が少ないほど大きい点数を前記２つの障害メッセージの得点に加算し、
点数の合計値が大きいほど前記２つの障害メッセージの関連性が高いと判定し、前記所定の時間内に前記２つの障害メッセージのそれぞれが示す種類の障害が該２つの障害メッセージの障害発生順序で発生すると順序が後の方のメッセージを削除する旨のルールを生成するか否かを決定するものである。 An information processing method of the present invention is an information processing method for monitoring a plurality of data transfer devices provided in a network and assigned different identifiers for each device,
Holding network configuration information indicating a connection relationship of the plurality of data transfer devices;
When a failure message including information on the identifier, failure occurrence date and time, and failure type is received from the plurality of data transfer devices, these messages are recorded,
If the failure occurrence date and time of two failure messages among the plurality of recorded failure messages is within a predetermined time, a predetermined score is given to the two failure messages,
Referring to the network configuration information, the number of hops between the two devices that are the transmission sources of the two failure messages is examined, and the smaller the number of hops, the larger the score is added to the score of the two failure messages;
It is determined that the relationship between the two failure messages is higher as the total value of the points is larger, and the types of failures indicated by the two failure messages within the predetermined time are in the failure occurrence order of the two failure messages. When it occurs, it is determined whether or not to generate a rule to delete a message whose order is later.

本発明によれば、データ転送装置から送出される障害メッセージの重要度が判定され、優先順位の低いメッセージを削除するためのルールが生成されるため、ネットワークの管理者はそのルールを作成するための手間が省ける。 According to the present invention, since the importance level of the failure message sent from the data transfer apparatus is determined and a rule for deleting the low priority message is generated, the network administrator creates the rule. Saves time and effort.

本発明の情報処理装置は、コリレーションルールの生成を、障害発生日時および障害に関連するネットワーク装置間の接続関係に基づいて行うことを特徴とする。 The information processing apparatus according to the present invention generates a correlation rule based on a failure occurrence date and time and a connection relationship between network devices related to the failure.

本実施形態の障害管理装置について説明する。 The failure management apparatus of this embodiment will be described.

図１は本実施形態の障害管理装置の一構成例を示すブロック図である。図２は図１に示す障害管理装置がネットワークに接続された構成の一例を示す図である。 FIG. 1 is a block diagram illustrating a configuration example of a failure management apparatus according to the present embodiment. FIG. 2 is a diagram illustrating an example of a configuration in which the failure management apparatus illustrated in FIG. 1 is connected to a network.

図２に示すように、障害管理装置１は、管理対象のネットワーク装置１０ａ〜１０ｅとネットワーク１００を介して接続されている。ネットワーク装置１０ａ〜１０ｅは、ルータおよびスイッチのように、受信するデータをその宛先となる端末の方向の伝送経路に転送するデータ転送装置である。ネットワーク装置１０ａ〜１０ｅのそれぞれに異なる識別子が付与されている。 As shown in FIG. 2, the failure management apparatus 1 is connected to the management target network apparatuses 10 a to 10 e via the network 100. The network devices 10a to 10e, like routers and switches, are data transfer devices that transfer received data to a transmission path in the direction of the destination terminal. Different identifiers are assigned to the network devices 10a to 10e, respectively.

本実施形態では、ネットワーク装置１０ａ〜１０ｅをＩＰ（Internet Protocol）通信におけるデータ転送装置とする。ネットワーク装置１０ａ〜１０ｅは、受信するパケットのヘッダの情報を読み出して、ヘッダに記述された宛先にパケットを転送する。ネットワーク装置１０ａ〜１０ｅは、障害が発生すると、障害発生日時、自装置の識別子、および障害の種類を示す情報を含む障害メッセージをネットワーク１００を介して障害管理装置１に送信する。なお、本実施形態では、ネットワーク装置１０ａ〜１０ｅを管理対象とするが、管理対象の装置台数は５台に限られない。 In the present embodiment, the network devices 10a to 10e are data transfer devices in IP (Internet Protocol) communication. The network devices 10a to 10e read the header information of the received packet and transfer the packet to the destination described in the header. When a failure occurs, the network devices 10 a to 10 e transmit a failure message including information indicating the failure occurrence date and time, the identifier of the own device, and the type of failure to the failure management device 1 via the network 100. In this embodiment, the network devices 10a to 10e are managed, but the number of managed devices is not limited to five.

次に、障害管理装置１の構成を説明する。 Next, the configuration of the failure management apparatus 1 will be described.

障害管理装置１は、サーバおよびワークステーション等の情報処理装置である。図１に示すように、障害管理装置１は、ネットワーク装置１０ａ〜１０ｅから障害メッセージを受信する受信部１１と、障害メッセージが格納される記憶部９と、各部を制御する制御部３とを有する。 The failure management apparatus 1 is an information processing apparatus such as a server and a workstation. As shown in FIG. 1, the failure management apparatus 1 includes a receiving unit 11 that receives a failure message from the network devices 10a to 10e, a storage unit 9 that stores the failure message, and a control unit 3 that controls each unit. .

制御部３は、ネットワーク構成管理部２と、得点積算部５と、コリレーションルール生成部６と、コリレーションルール適用部７とを有する。制御部３には、プログラムにしたがって所定の処理を実行するＣＰＵ（Central Processing Unit）（不図示）と、プログラムを格納するためのメモリ（不図示）とが設けられている。ＣＰＵがプログラムを実行することで、ネットワーク構成管理部２、得点積算部５、コリレーションルール生成部６、およびコリレーションルール適用部７が障害管理装置内に仮想的に構成される。 The control unit 3 includes a network configuration management unit 2, a score accumulation unit 5, a correlation rule generation unit 6, and a correlation rule application unit 7. The control unit 3 is provided with a CPU (Central Processing Unit) (not shown) that executes predetermined processing according to a program, and a memory (not shown) for storing the program. When the CPU executes the program, the network configuration management unit 2, the score accumulation unit 5, the correlation rule generation unit 6, and the correlation rule application unit 7 are virtually configured in the failure management apparatus.

記憶部９には、ネットワーク装置１０ａ〜１０ｅから受信する障害メッセージが格納される。図３は記憶部に格納された障害メッセージの例を示す表である。障害メッセージの受信順序を示す番号は、障害発生日時が古いものほど小さい。ここでは、説明のために番号の欄を表に設けて障害メッセージを区別できるようにしているが、障害メッセージに番号は含まれていなくてもよい。 The storage unit 9 stores failure messages received from the network devices 10a to 10e. FIG. 3 is a table showing an example of a failure message stored in the storage unit. The number indicating the reception order of the failure messages is smaller as the failure occurrence date is older. Here, for the purpose of explanation, a number column is provided in the table so that the failure message can be distinguished, but the failure message may not include a number.

図３の表には、障害メッセージ毎に、番号、障害発生日時、障害が発生した装置の識別子、および障害の種類が示されている。「装置ａ」をネットワーク装置１０ａの識別子とし、他の装置についても同様に、識別子がそれぞれの装置の符号のローマ字に対応している。表の１番の障害メッセージは、２００７年１２月１１日の２０時３８分５０秒に、障害メッセージＡという種類の障害がネットワーク装置１０ａに発生したことを示している。説明を簡単にするために、装置に対応してメッセージの種類を１つとしているが、各装置について複数の種類の障害メッセージがあってもよい。 In the table of FIG. 3, the number, the date and time of occurrence of the failure, the identifier of the device in which the failure has occurred, and the type of failure are shown for each failure message. “Device a” is an identifier of the network device 10a, and the identifiers of the other devices correspond to the Roman letters of the codes of the respective devices. The first failure message in the table indicates that a failure of the type of failure message A occurred in the network device 10a at 20:38:50 on December 11, 2007. In order to simplify the explanation, one message type is used for each device, but there may be a plurality of types of failure messages for each device.

記憶部９には、監視対象となる複数のネットワーク装置１０ａ〜１０ｅが互いにどのように接続されているかを示す情報であるネットワーク構成情報が格納されている。 The storage unit 9 stores network configuration information that is information indicating how the plurality of network devices 10a to 10e to be monitored are connected to each other.

図２に示した構成例では、次のような内容の情報がネットワーク構成情報として記憶部９に格納されている。それは、「ネットワーク装置１０ａがネットワーク装置１０ｂと接続され、ネットワーク装置１０ｂがネットワーク装置１０ａ、１０ｄ、１０ｅと接続されている。ネットワーク装置１０ｄがネットワーク装置１０ｂ、１０ｃと接続され、ネットワーク装置１０ｅがネットワーク装置１０ｂ、１０ｃと接続されている。ネットワーク装置１０ｃがネットワーク装置１０ｄ、１０ｅと接続されている。」というものである。装置名には、ネットワーク装置１０ａ〜１０ｅの識別子が用いられている。 In the configuration example shown in FIG. 2, the following information is stored in the storage unit 9 as network configuration information. The network device 10a is connected to the network device 10b, the network device 10b is connected to the network devices 10a, 10d, and 10e. The network device 10d is connected to the network devices 10b and 10c, and the network device 10e is connected to the network device. 10b and 10c. The network device 10c is connected to the network devices 10d and 10e. " As the device name, the identifiers of the network devices 10a to 10e are used.

なお、本実施形態ではネットワーク構成情報は管理者により記憶部９に予め登録されているが、制御部３がプログラムを実行することでネットワーク１００内のルータ（不図示）に格納されたルーティングテーブルの情報を収集し、収集した情報に基づいてネットワーク構成情報を作成して記憶部９に登録してもよい。また、ネットワーク構成情報の登録形式は上記の記述方法に限られない。 In this embodiment, the network configuration information is registered in advance in the storage unit 9 by the administrator. However, when the control unit 3 executes a program, a routing table stored in a router (not shown) in the network 100 is stored. Information may be collected, and network configuration information may be created based on the collected information and registered in the storage unit 9. The registration format of the network configuration information is not limited to the above description method.

受信部１１は、ネットワーク装置１０ａ〜１０ｅより受信する障害メッセージを制御部３に送信する。制御部３は、受信部１１から受け取る障害メッセージを記憶部９に格納し、次のようにしてコリレーションルールを生成する。 The receiving unit 11 transmits a failure message received from the network devices 10 a to 10 e to the control unit 3. The control unit 3 stores the failure message received from the reception unit 11 in the storage unit 9 and generates a correlation rule as follows.

制御部３の得点積算部５は、定期的に記憶部９に格納される障害メッセージを監視し、記憶部９に複数の障害メッセージが記録されていると、これらの障害の関連性の程度を調べるために得点付けを行う。得点付けは、障害発生日時による時間判定と、装置間の接続関係による判定との２段階で行われる。 The score accumulating unit 5 of the control unit 3 periodically monitors the failure messages stored in the storage unit 9, and if a plurality of failure messages are recorded in the storage unit 9, the degree of relevance of these failures is determined. Scoring to investigate. Scoring is performed in two stages: time determination based on the date and time of failure occurrence and determination based on the connection relationship between the devices.

はじめに、時間判定による得点付け方法を説明する。得点積算部５は、判定対象の複数の障害メッセージを所定の時間ｔ１で区切ってグループ分けする。時間ｔ１は、任意の値を取り得るが、実験または実績等によって予め決められている。本実施形態では、ｔ１＝１０秒とする。同じグループ内で障害発生日時が最初の障害メッセージを親メッセージとし、他の障害メッセージを子メッセージとする。これら２つのメッセージからなるペアに対して点として「１」を与える。得点積算部５は、ペアに対応して得点情報を制御部３内のメモリ（不図示）に記録する。なお、グループ分けの仕方により、１つの障害メッセージが複数のグループに所属する場合もあり得る。その場合の詳細な説明は後述する。 First, a scoring method based on time determination will be described. The score accumulating unit 5 divides a plurality of failure messages to be determined into groups by dividing them at a predetermined time t1. Although the time t1 can take an arbitrary value, it is determined in advance by experiments or results. In the present embodiment, t1 = 10 seconds. A failure message with the first failure occurrence date and time in the same group is a parent message, and other failure messages are child messages. “1” is given as a point to the pair of these two messages. The score accumulating unit 5 records score information corresponding to the pair in a memory (not shown) in the control unit 3. Note that one failure message may belong to a plurality of groups depending on the grouping method. Detailed description in that case will be described later.

得点積算部５は、時間判定による得点付けの後、装置間の接続関係による判定を次のように行う。ペア内の２つの障害メッセージの送信元の装置が異なっていれば、親メッセージの送信元の装置を基準にして子メッセージの送信元の装置がどのくらい離れているかをネットワーク構成管理部２に問い合わせるために、ペアとなる２つの装置の識別子およびこれらの装置間の距離を問い合わせる旨の情報を含むペア構成情報要求信号をネットワーク構成管理部２に送信する。 The score integrating unit 5 performs the determination based on the connection relationship between the devices after scoring by time determination as follows. In order to inquire the network configuration management unit 2 how far the transmission source device of the child message is based on the transmission source device of the parent message if the transmission source devices of the two failure messages in the pair are different In addition, a pair configuration information request signal including information for inquiring the identifiers of the two devices to be paired and the distance between these devices is transmitted to the network configuration management unit 2.

ここでは、装置間の伝送経路の長さを実際に計測するのではなく、装置間の距離を示す目安として「ホップ」という単位を用いる。最小値の１ホップとは、２つの装置が伝送経路を介して直接に接続されている場合の装置間の距離を示す。図４は図２に示した構成例で装置間距離を説明するための図である。図４に示すように、例えば、ネットワーク装置１０ａおよびネットワーク装置１０ｂの距離は１ホップとなり、ネットワーク装置１０ａおよびネットワーク装置１０ｃの距離は３ホップとなる。 Here, instead of actually measuring the length of the transmission path between the devices, a unit called “hop” is used as a guide indicating the distance between the devices. The minimum value of 1 hop indicates a distance between devices when the two devices are directly connected via a transmission path. FIG. 4 is a diagram for explaining the inter-device distance in the configuration example shown in FIG. As shown in FIG. 4, for example, the distance between the network device 10a and the network device 10b is 1 hop, and the distance between the network device 10a and the network device 10c is 3 hops.

得点積算部５は、ペアとなる２つの装置の識別子およびホップ数の情報を含むペア構成情報応答信号をネットワーク構成管理部２から受信すると、ホップ数に基づいて子メッセージの得点情報に点数を加算する。ホップ数による点数は、次のとおりである。 When the score accumulating unit 5 receives a pair configuration information response signal including information on the identifiers of the two devices and the number of hops from the network configuration management unit 2, the score accumulating unit 5 adds the score to the score information of the child message based on the number of hops. To do. Points according to the number of hops are as follows.

子メッセージの送信元が親メッセージの送信元と直接に接続される１ホップの場合には、子メッセージの得点情報に３点を加算する。子メッセージの送信元が他の１つの装置を中継して親メッセージの送信元に接続される２ホップの場合には、子メッセージの得点情報に２点を加算する。２つの装置間の距離が３ホップの場合には子メッセージの得点情報に１点を加算し、２つの装置間の距離が４ホップ以上の場合には子メッセージの得点情報に点を加算しない。このように装置間のホップ数が少ないほど加算する点が大きいのは、２つの装置が近いほど、相手に発生した障害の影響を受けやすいからである。 When the transmission source of the child message is one hop connected directly to the transmission source of the parent message, 3 points are added to the score information of the child message. When the child message transmission source is a two-hop relay connected to the parent message transmission source through another device, two points are added to the child message score information. When the distance between the two devices is 3 hops, 1 point is added to the score information of the child message, and when the distance between the 2 devices is 4 hops or more, no point is added to the score information of the child message. The reason why the points to be added is larger as the number of hops between the devices is smaller in this way because the closer the two devices are, the more susceptible to the failure that has occurred in the other party.

得点積算部５は、判定対象となる障害メッセージに対する得点付けが終了すると、ペアの親メッセージおよび子メッセージのそれぞれの障害の種類の情報と得点情報を含むペア判定結果情報をコリレーションルール生成部６に送信する。 When the scoring for the failure message to be determined is completed, the score accumulating unit 5 sends the pair determination result information including the information of the type of each of the parent message and the child message of the pair and the score information to the correlation rule generating unit 6. Send to.

ネットワーク構成管理部２は、得点積算部５からペア構成情報要求信号を受信すると、記憶部９に登録されたネットワーク構成情報を参照し、ペア構成情報要求信号に含まれる２つの識別子の装置間の距離をホップ数で求める。そして、求めたホップ数の情報を含むペア構成情報応答信号を得点積算部５に送信する。 When the network configuration management unit 2 receives the pair configuration information request signal from the score accumulating unit 5, the network configuration management unit 2 refers to the network configuration information registered in the storage unit 9, and between the two identifier devices included in the pair configuration information request signal Find the distance in hops. Then, a pair configuration information response signal including information on the obtained number of hops is transmitted to the point integration unit 5.

コリレーションルール生成部６は、得点積算部５からペア判定結果情報を受け取ると、ペア判定結果情報に基づいてコリレーションルールを生成し、生成したコリレーションルールの情報をコリレーションルール適用部７に送信する。 When the correlation rule generation unit 6 receives the pair determination result information from the score accumulating unit 5, the correlation rule generation unit 6 generates a correlation rule based on the pair determination result information, and sends the generated correlation rule information to the correlation rule application unit 7. Send.

コリレーションルール適用部７は、コリレーションルール生成部６からコリレーションルールの情報を受信すると、記憶部９に順次蓄積される障害メッセージに対してコリレーションルールを適用し、コリレーションルール適用後の障害メッセージを表示部８に送信する。表示部８は、コリレーションルール適用後の障害メッセージを表示する。 When the correlation rule application unit 7 receives the correlation rule information from the correlation rule generation unit 6, the correlation rule application unit 7 applies the correlation rule to the failure messages sequentially stored in the storage unit 9 and applies the correlation rule after application. A failure message is transmitted to the display unit 8. The display unit 8 displays a failure message after the correlation rule is applied.

次に、本実施形態の障害管理装置の動作手順を説明する。 Next, an operation procedure of the failure management apparatus according to this embodiment will be described.

ネットワーク装置１０ａ〜１０ｅは、障害が発生すると障害メッセージを障害管理装置１宛にネットワーク１００を介して送信する。障害管理装置１は、受信部１１を介して障害メッセージを受信すると、障害メッセージを記憶部９に記録する。ここでは、図３に示した障害メッセージが記憶部９に記録されたものとする。 When a failure occurs, the network devices 10a to 10e transmit a failure message to the failure management device 1 via the network 100. When the failure management apparatus 1 receives the failure message via the reception unit 11, the failure management device 1 records the failure message in the storage unit 9. Here, it is assumed that the failure message shown in FIG.

図５は得点積算部の動作手順を示すフローチャートである。得点積算部５は、記憶部９に記録された複数の障害メッセージに対して、最初に発生した障害のメッセージの情報から障害発生日時の情報を読み出す（ステップ１０１）。図３を参照すると、１番の障害メッセージでは、障害が２００７年１２月１１日２０時３８分５０秒に発生している。続いて、１番の障害メッセージの障害発生日時からｔ１＝１０秒以内に発生している障害があるか否かを調べ（ステップ１０２）、他に障害が発生している場合、それらの障害によるメッセージを読み出し、１番の障害メッセージと同じグループとして扱う。 FIG. 5 is a flowchart showing an operation procedure of the score accumulating unit. For the plurality of failure messages recorded in the storage unit 9, the score accumulating unit 5 reads out information on the failure occurrence date and time from the information on the failure message that occurred first (step 101). Referring to FIG. 3, in the first failure message, a failure occurs at 20:38:50 on December 11, 2007. Subsequently, it is checked whether or not there is a failure that has occurred within t1 = 10 seconds from the failure occurrence date and time of the first failure message (step 102). Read the message and treat it as the same group as the first failure message.

図３に示す例では、２番目の障害メッセージは２０時３８分５２秒に障害が発生したことを示し、３番目の障害メッセージは２０時３８分５５秒に障害が発生していることを示しているため、得点積算部５は、これら２つの障害メッセージを、１番の障害メッセージと同じグループとして扱う。なお、ステップ１０２で、他に障害が発生していない場合、次の障害メッセージに対してステップ１０１の処理に戻る（ステップ１０３）。 In the example shown in FIG. 3, the second failure message indicates that a failure has occurred at 20:38:52, and the third failure message indicates that a failure has occurred at 20:38:55. Therefore, the score accumulating unit 5 treats these two failure messages as the same group as the first failure message. If no other failure has occurred in step 102, the processing returns to step 101 for the next failure message (step 103).

図６および図７はメッセージの親子関係と得点方法を説明するための図である。図６において、障害の種類が障害メッセージＡと障害メッセージＢのペアでは、過去の判定処理により、予め得点情報として１点を保持していたものとする。 6 and 7 are diagrams for explaining the parent-child relationship of messages and the scoring method. In FIG. 6, it is assumed that the pair of failure message A and failure message B previously holds one point as score information by the past determination process.

得点積算部５は、最初に読みだした１番の障害メッセージを親メッセージとし、２番および３番の障害メッセージを子メッセージとする。親メッセージと２番の障害メッセージとを組にしたペアと、親メッセージと３番の障害メッセージとを組にしたペアを作成する（ステップ１０４）。そして、各ペアの得点情報に１点を加算する（ステップ１０５）。これにより、図６に示すように、障害発生日時による得点は、２番の障害メッセージを含むペアが２点になり、３番の障害メッセージを含むペアが１点になる。 The score accumulating unit 5 uses the first failure message read first as a parent message and the second and third failure messages as child messages. A pair in which the parent message and the second failure message are paired and a pair in which the parent message and the third failure message are paired are created (step 104). Then, one point is added to the score information of each pair (step 105). As a result, as shown in FIG. 6, the score based on the failure occurrence date is 2 points for the pair including the 2nd failure message and 1 point for the pair including the 3rd failure message.

続いて、得点積算部５は、上記２つのペアについてのペア構成情報要求信号をネットワーク構成管理部２に送信し、その結果を示す情報としてペア構成情報応答信号をネットワーク構成管理部２から受信する（ステップ１０６）。図２に示したように、ネットワーク装置１０ａおよびネットワーク装置１０ｂ間は１ホップであることから、得点積算部５は、２番の障害メッセージを含むペアの得点情報に３点を加算する。また、ネットワーク装置１０ａおよびネットワーク装置１０ｃ間は３ホップであることから、得点積算部５は、３番の障害メッセージを含むペアの得点情報に１点を加算する（ステップ１０７）。 Subsequently, the score accumulating unit 5 transmits a pair configuration information request signal for the two pairs to the network configuration management unit 2 and receives a pair configuration information response signal from the network configuration management unit 2 as information indicating the result. (Step 106). As shown in FIG. 2, since there is one hop between the network device 10a and the network device 10b, the score accumulating unit 5 adds 3 points to the score information of the pair including the second failure message. Further, since there are three hops between the network device 10a and the network device 10c, the score accumulating unit 5 adds one point to the pair score information including the third failure message (step 107).

装置間接続の関係による判定まで行うことにより、図６に示すように、２番の障害メッセージを含むペアの得点情報は２＋３＝５点となり、３番の障害メッセージを含むペアの得点情報は１＋１＝２点となる。得点積算部５は、ここまで処理を行うと、まだ未処理のメッセージが記憶部９にあるか否かを判定する（ステップ１０８）。未処理のメッセージがあれば、次の障害メッセージに対してステップ１０１以降の処理を繰り返す。以下に、次のメッセージとなる２番の障害メッセージを親メッセージとする場合も説明する。 By performing the determination based on the connection relationship between devices, as shown in FIG. 6, the score information of the pair including the second failure message is 2 + 3 = 5, and the score information of the pair including the third failure message is 1 + 1. = 2 points. When the score accumulation unit 5 has performed the process so far, the score accumulation unit 5 determines whether or not there is an unprocessed message in the storage unit 9 (step 108). If there is an unprocessed message, the processing after step 101 is repeated for the next failure message. Hereinafter, a case where the second failure message, which is the next message, is used as a parent message will be described.

ステップ１０２で、得点積算部５は、次のメッセージである２番の障害メッセージを親メッセージにして同じグループに属する子メッセージがないかを調べる。図３から２番の障害メッセージおよび３番の障害メッセージでも１つのグループが形成されることがわかる。 In step 102, the score accumulating unit 5 checks whether there is a child message belonging to the same group by using the second failure message as the next message as a parent message. It can be seen from FIG. 3 that one group is formed by the second failure message and the third failure message.

得点積算部５は、図７に示すように、２番の障害メッセージを親メッセージとし、３番の障害メッセージを子メッセージとしたペアに対して（ステップ１０４）、障害発生日時による点として１点を与える（ステップ１０５）。続いて、得点積算部５は、このペアについてのペア構成情報要求信号をネットワーク構成管理部２に送信し、その結果として、ネットワーク構成管理部２からペア構成情報応答信号を受信する（ステップ１０６）。 As shown in FIG. 7, the score accumulating unit 5 gives one point as a point according to the date and time when the failure occurred for a pair in which the second failure message is a parent message and the third failure message is a child message (step 104). (Step 105). Subsequently, the score accumulating unit 5 transmits a pair configuration information request signal for this pair to the network configuration management unit 2, and as a result, receives a pair configuration information response signal from the network configuration management unit 2 (step 106). .

図２に示したように、ネットワーク装置１０ｂおよびネットワーク装置１０ｃ間は２ホップであることから、得点積算部５は、２番の障害メッセージおよび３番の障害メッセージのペアの得点情報に２点を加算する（ステップ１０７）。図７に示すように、２番の障害メッセージおよび３番の障害メッセージのペアによる得点情報は１＋２＝３点となる。 As shown in FIG. 2, since there are two hops between the network device 10b and the network device 10c, the score accumulating unit 5 assigns two points to the score information of the pair of the second failure message and the third failure message. Add (step 107). As shown in FIG. 7, the score information by the pair of the second failure message and the third failure message is 1 + 2 = 3 points.

その後、得点積算部５は、ステップ１０８で未処理のメッセージがなければ、処理を終了する。一定時間の後、記憶部９に記録されたメッセージを調べ、新たに記録された障害メッセージがあれば、図５に示した処理を実行する。 Thereafter, if there is no unprocessed message in step 108, the score integrating unit 5 ends the processing. After a certain time, the message recorded in the storage unit 9 is examined, and if there is a newly recorded failure message, the process shown in FIG. 5 is executed.

得点積算部５は、上述のようにしてペア毎に得点付けを行うと、ペア判定結果情報をコリレーションルール生成部６に送信する。コリレーションルール生成部６は、ペア判定結果情報を得点積算部５から受信すると、ペア毎に得点情報を読み出し、その得点が予め決められた基準点以上であるか否かにより、ペア判定結果情報の親メッセージと子メッセージの障害に関連があるか否かを決定する。 The score accumulation unit 5 transmits pair determination result information to the correlation rule generation unit 6 when scoring is performed for each pair as described above. When the correlation rule generation unit 6 receives the pair determination result information from the score accumulating unit 5, the correlation rule generation unit 6 reads the score information for each pair, and determines whether the score is equal to or higher than a predetermined reference point. Determine whether there is a link between the parent message and the child message failure.

ここでは、判定対象となるペアは、図６に示した２つのペアと図７に示した１つのペアを合わせた３つである。予め決めた基準点を３点とすると、コリレーションルール生成部６は、３つのペアのうち、関連のある障害の種類を示すペアとして、合計点が５点である「障害メッセージＡ（親メッセージ）−障害メッセージＢ（子メッセージ）」と、合計点が３点である「障害メッセージＢ（親メッセージ）−障害メッセージＣ（子メッセージ）」を選びだす。 Here, there are three pairs to be determined, including the two pairs shown in FIG. 6 and the one pair shown in FIG. Assuming that the predetermined reference points are three points, the correlation rule generation unit 6 sets “fault message A (parent message) having a total score of five points as a pair indicating the type of fault involved in the three pairs. ) -Failure message B (child message) "and" failure message B (parent message) -failure message C (child message) "having a total of 3 points.

そして、コリレーションルール生成部６は、次の２つのルールを生成する。ルール１は、「障害メッセージＡの障害が発生した時刻から一定時間ｔ２以内に障害メッセージＢの障害が発生した場合、障害メッセージＢの障害が発生したことを通知するための障害メッセージを削除する」というものである。ルール２は、「障害メッセージＢの障害が発生した時刻から一定時間ｔ２以内に障害メッセージＣの障害が発生した場合、障害メッセージＣの障害が発生したことを通知するための障害メッセージを削除する」というものである。一定時間ｔ２は、実験または実績等によって予め決められた値とする。ｔ１＝ｔ２としてもよい。図８は、生成したルールを簡単に記述した表である。 Then, the correlation rule generation unit 6 generates the following two rules. Rule 1 states that “when a failure of the failure message B occurs within a certain time t2 from the time when the failure of the failure message A occurs, the failure message for notifying that the failure of the failure message B has occurred” is deleted. That's it. Rule 2 states that “when a failure of the failure message C occurs within a certain time t2 from the time when the failure of the failure message B occurs, the failure message for notifying that the failure of the failure message C has occurred” is deleted. That's it. The fixed time t2 is set to a value determined in advance by experiments or results. It is good also as t1 = t2. FIG. 8 is a table that briefly describes the generated rules.

生成されたルール１は、障害メッセージＡの障害発生から一定時間ｔ２以内に発生する障害メッセージＢの障害が障害メッセージＡを主原因とするものであると判定されたことに基づいている。そのため、障害メッセージＡの障害が解決すれば、障害メッセージＢの障害に対処する必要はない。ルール２は、障害メッセージＢの障害から一定時間ｔ２以内に発生する障害メッセージＣの障害は障害メッセージＢが主原因とするものであると判定されたことに基づいている。そのため、障害メッセージＢの障害が解決すれば、障害メッセージＣの障害に対処する必要はない。 The generated rule 1 is based on the determination that the failure of the failure message B occurring within the predetermined time t2 from the failure occurrence of the failure message A is mainly caused by the failure message A. Therefore, if the failure of the failure message A is solved, it is not necessary to deal with the failure of the failure message B. Rule 2 is based on the determination that the failure of the failure message B is caused mainly by the failure of the failure message C that occurs within a predetermined time t2 from the failure of the failure message B. Therefore, if the failure of the failure message B is solved, it is not necessary to deal with the failure of the failure message C.

コリレーションルール生成部６は、生成したルールをコリレーション適用部７に送信する。コリレーションルール適用部７は、コリレーションルール生成部６から受信したコリレーションルールを記憶部９に順次蓄積される障害メッセージに適用する。これにより、主原因から派生した障害を通知するための障害メッセージが一部削除される。 The correlation rule generation unit 6 transmits the generated rule to the correlation application unit 7. The correlation rule application unit 7 applies the correlation rules received from the correlation rule generation unit 6 to failure messages that are sequentially stored in the storage unit 9. As a result, the failure message for notifying the failure derived from the main cause is partially deleted.

コリレーションルール適用部７は、記憶部９に蓄積された障害メッセージに対してコリレーションルールを適用した後の障害メッセージを表示部８に送信する。表示部８は、コリレーションルール適用部７から受信する障害メッセージを表示する。表示部８に表示される障害メッセージは主原因の障害に関するものであり、主原因から派生した障害に関するメッセージの表示が低減する。 The correlation rule application unit 7 transmits the failure message after applying the correlation rule to the failure message accumulated in the storage unit 9 to the display unit 8. The display unit 8 displays a failure message received from the correlation rule application unit 7. The failure message displayed on the display unit 8 relates to the failure of the main cause, and the display of the message related to the failure derived from the main cause is reduced.

本実施形態の情報処理装置は、複数のネットワーク装置から複数の障害メッセージを受信すると、関連があるか否かの判定対象となる障害の種類を抽出し、抽出した障害の種類に対して、障害が発生したネットワーク装置の接続関係と障害発生日時に対応した得点付けを行い、その結果に基づいてコリレーションルールを生成している。障害に関するコリレーションルールが障害メッセージの情報から生成されるので、管理者は手作業によりコリレーションルールを作成する必要がなく、ルール作成作業の負荷が軽減する。 When the information processing apparatus according to the present embodiment receives a plurality of failure messages from a plurality of network devices, the information processing apparatus extracts a failure type as a determination target of whether or not there is a relationship, A score corresponding to the connection relationship of the network device in which the error occurred and the date and time of failure occurrence is performed, and a correlation rule is generated based on the result. Since the correlation rule regarding the failure is generated from the information of the failure message, the administrator does not need to create the correlation rule manually, and the load of the rule creation operation is reduced.

また、管理者の想定していない障害が発生しても、生成されたコリレーションルールにより主原因以外の障害に関するメッセージが低減するため、管理者は、発生した障害の主原因に早く対応することが可能となる。 In addition, even if a failure that the administrator does not expect occurs, messages related to failures other than the main cause are reduced by the generated correlation rules, so the administrator must respond quickly to the main cause of the failure that has occurred. Is possible.

また、図６に示したように、障害の種類によるペアの情報を記録することで、ｔ１以内に２種類の障害が発生する現象が３回以上あったとき、得点が３点となり、その障害の関連性を検出することが可能となる。２つの装置間のホップ数が４以上であるため装置間の接続関係の判定による得点がゼロで、図５に示した手順の一度の実行でルールが生成されなくても、離れた２つの装置で関連して発生する障害について、派生した障害のメッセージを削除するルールが生成される。 In addition, as shown in FIG. 6, by recording pair information according to the type of failure, when there are three or more phenomena in which two types of failures occur within t1, the score becomes 3 points. It becomes possible to detect the relevance of. Since the number of hops between two devices is 4 or more, the score based on the determination of the connection relationship between the devices is zero, and even if no rules are generated by one execution of the procedure shown in FIG. A rule is generated that deletes the message of the derived failure for the failure that occurs in connection with the above.

なお、本実施形態では、得点積算部５が、記憶部９の複数の障害メッセージをグループ化する際、基準となる障害メッセージの障害発生日時から一定時間ｔ１以内の障害メッセージを同じグループにしているが、グループ化する際の時間の区切り方はこの場合に限られない。基準となる障害メッセージの障害発生日時の前の一定時間ｔ１以内の障害メッセージも併せて同じグループにしてもよい。この場合、同じグループ内で障害発生日時が最も古いメッセージを親メッセージとし、上述した説明と同様に処理を実行すればよい。 In the present embodiment, when the score accumulating unit 5 groups a plurality of failure messages in the storage unit 9, the failure messages within a predetermined time t1 from the failure occurrence date and time of the reference failure message are grouped into the same group. However, the method of dividing the time when grouping is not limited to this case. Failure messages within a predetermined time t1 before the failure occurrence date and time of the reference failure message may be combined into the same group. In this case, the message having the oldest failure date and time within the same group may be used as the parent message, and the process may be executed in the same manner as described above.

また、コリレーションルール生成部６は、コリレーションルールを生成する際、基準点以上のペアを選び出しているが、ペアの選出方法はこの場合に限られない。コリレーション生成部６は、一定期間に記録された複数の障害メッセージから複数のペアを抽出すると、複数のペアに対してそれぞれの合計点の順にソートを行って、合計点の最上位のペアから一定順位以内にあるペアを選び出すようにしてもよい。 Further, the correlation rule generation unit 6 selects a pair of reference points or more when generating the correlation rule, but the method of selecting a pair is not limited to this case. When the correlation generation unit 6 extracts a plurality of pairs from a plurality of failure messages recorded in a certain period, the correlation generation unit 6 sorts the plurality of pairs in the order of the total points, and starts from the top pair of the total points. A pair within a certain order may be selected.

本実施形態の障害管理装置の一構成例を示すブロック図である。It is a block diagram which shows the example of 1 structure of the failure management apparatus of this embodiment. 図１に示す障害管理装置がネットワークに接続された構成の一例を示す図である。It is a figure which shows an example of the structure by which the failure management apparatus shown in FIG. 1 was connected to the network. 障害メッセージの例を示す表である。It is a table | surface which shows the example of a failure message. 装置間距離を説明するための図である。It is a figure for demonstrating the distance between apparatuses. 得点積算部の動作手順を示すフローチャートである。It is a flowchart which shows the operation | movement procedure of a score integrating | accumulating part. メッセージの親子関係と得点方法を説明するための図である。It is a figure for demonstrating the parent-child relationship and scoring method of a message. メッセージの親子関係と得点方法を説明するための別の図である。It is another figure for demonstrating the parent-child relationship and scoring method of a message. 生成されたコリレーションルールの一例を示す表である。It is a table | surface which shows an example of the produced | generated correlation rule.

Explanation of symbols

１障害管理装置
３制御部
９記憶部 1 Failure management device 3 Control unit 9 Storage unit

Claims

An information processing device connected via a network to a plurality of data transfer devices assigned different identifiers for each device,
A storage unit storing network configuration information indicating a connection relationship between the plurality of data transfer devices;
When a failure message including information on the identifier, the date and time of failure occurrence and the type of failure is received from the plurality of data transfer devices, these messages are stored in the storage unit, and two failure messages of the plurality of failure messages are stored. If the failure occurrence date / time is within a predetermined time, a predetermined score is given to the two failure messages, and a hop between the two devices that are the transmission source of the two failure messages with reference to the network configuration information The smaller the number of hops, the larger the score is added to the score of the two failure messages. The larger the total score is, the higher the relevance of the two failure messages is determined. When a failure of the type indicated by each of the two failure messages occurs in the failure occurrence order of the two failure messages, A control unit for determining whether to generate rules to the effect to remove the di,
An information processing apparatus.

The controller is
The information processing apparatus according to claim 1, wherein the rule is generated when the total value is equal to or greater than a preset reference point, and the rule is not generated when the total value is smaller than the reference point.

The controller is
When a plurality of pairs of the two failure messages are extracted for a plurality of failure messages recorded in the storage unit during a certain period, and the plurality of pairs are arranged in descending order of the total value, The information processing apparatus according to claim 1, wherein the rule is generated for two failure messages included in each upper pair.

The controller is
The information processing according to any one of claims 1 to 3, wherein when the rule is not generated, information on the type of failure indicated by each of the two failure messages is paired and a score is recorded corresponding to the pair. apparatus.

An information processing method for monitoring a plurality of data transfer devices provided in a network and assigned different identifiers for each device,
Holding network configuration information indicating a connection relationship of the plurality of data transfer devices;
When a failure message including information on the identifier, failure occurrence date and time, and failure type is received from the plurality of data transfer devices, these messages are recorded,
If the failure occurrence date and time of two failure messages among the plurality of recorded failure messages is within a predetermined time, a predetermined score is given to the two failure messages,
Referring to the network configuration information, the number of hops between the two devices that are the transmission sources of the two failure messages is examined, and the smaller the number of hops, the larger the score is added to the score of the two failure messages;
It is determined that the relationship between the two failure messages is higher as the total value of the points is larger, and the types of failures indicated by the two failure messages within the predetermined time are in the failure occurrence order of the two failure messages. An information processing method for determining whether or not to generate a rule to delete a message whose order is later when it occurs.

The information processing method according to claim 5, wherein the rule is generated when the total value is greater than or equal to a preset reference point, and the rule is not generated when the total value is smaller than the reference point.

When a plurality of pairs of the two failure messages are extracted for a plurality of failure messages recorded in the storage unit during a certain period, and the plurality of pairs are arranged in descending order of the total value, The information processing method according to claim 5, wherein the rule is generated for two failure messages included in each upper pair.

The information processing according to any one of claims 5 to 7, wherein when the rule is not generated, information on a type of failure indicated by each of the two failure messages is paired and a score is recorded corresponding to the pair. Method.