JP2005004513A

JP2005004513A - Failure analysis data sampling system and method thereof

Info

Publication number: JP2005004513A
Application number: JP2003167858A
Authority: JP
Inventors: Nobutane Mori; 信胤森; Akihiro Baba; 昭宏馬場
Original assignee: Mitsubishi Electric Corp; Mitsubishi Electric Information Systems Corp; Mitsubishi Electric Information Technology Corp
Current assignee: Mitsubishi Electric Corp; Mitsubishi Electric Information Systems Corp; Mitsubishi Electric Information Technology Corp
Priority date: 2003-06-12
Filing date: 2003-06-12
Publication date: 2005-01-06
Anticipated expiration: 2023-06-12
Also published as: JP4286594B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a failure analysis data sampling system and a method thereof which determines a condition at the occurrence of a failure to sample most suitable failure analysis data and decides sampling data. <P>SOLUTION: The failure analysis data sampling system 100 is provided with surveillance object nodes 1, 2 and 3, network 5, a surveillance object resource 10, a failure surveillance means 11, a data sampling control means 12, data collecting means 13a, 13b and 13c, a sampling object specifying means 14, data recording means 23a, 23b and 23c. This system can additionally sample detailed data valid for a failure analysis corresponding to the condition the occurrence of the failure by dynamically selecting the data, without limiting the preservation of logs and traces concerning sampling processing which each program records at the normal time. Consequently, data necessary to analyze the occurrence of the failure can be assembled easily, and the failure analysis is accelerated. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、コンピュータシステムの障害解析用データ採取装置およびその方法に関するものである。
【０００２】
【従来の技術】
従来の障害解析用データ採取装置として、クライアント側ワークステーションとサーバ側ワークステーションとがネットワークを介して接続されていて、クライアント側ワークステーションには、アプリケーションプログラム、通信管理プログラム、アプリケーションプログラムの関連プログラム名称をテーブル化した関連プログラム一覧テーブル、保守情報管理部、メッセージ通信エリア、管理テーブル等が備わるものがある。関連プログラム一覧テーブルと管理テーブルはシステム導入時に作成されるものであり、関連プログラム一覧テーブルには、サーバアクセス時に関連するクライアントの通信管理プログラムとサーバの通信管理プログラムおよびアプリケーションプログラムが登録されている。また、サーバ側ワークステーションにもシステム導入時に、管理テーブルを作成しておくものである。
【０００３】
上述したような装置でクライアント側ワークステーションからサーバ側ワークステーションにアクセスをして障害が発生した場合、自身の保守情報取得停止を行い、それまでメモリに記録していた情報をディスクに書き込む。続いて、関連プログラム一覧テーブルを参照し、クライアント側ワークステーションおよびサーバ側ワークステーション上の関連プログラムを特定し、保守情報管理部およびメッセージ通信エリアを経由して、それぞれのプログラムの保守情報取得停止を行う。この処理により、障害が発生したプログラムだけでなく、関連するクライアント側ワークステーションおよびサーバ側ワークステーション上のプログラムの保守情報取得停止を実施し、採取してきた保守情報をディスクに記録するものである（例えば、特許文献１参照）。
【０００４】
【特許文献１】
特開平６−２６６６８６号公報
【０００５】
【発明が解決しようとする課題】
従来の障害解析データ採取装置では、ディスクに記録した保守情報は、各プログラムがメモリ上に記憶していた情報に関しているだけであり、障害原因究明のためには障害発生時、または、障害発生時前後の、より詳しい障害情報を追加採取することが望まれるが採取する機能を備えてはいない。また、追加情報が必要ないように通常時から詳細情報を採取しようとすると、コンピュータの処理やメモリあるいはディスク等のリソースに過大な負荷をかけてしまうという問題がある。
【０００６】
また、障害情報を採取する対象をテーブルに登録できるが固定的であり、障害発生時の状況から判断して取得する保守情報を追加したり削除したり採取内容を変更したりすることができない。さらに関係するプログラムを固定的に登録し、障害発生時は常に、登録されたプログラムのみから、或いは登録されたプログラムの全てから保守情報を取得するのでは、最適な保守情報を得ることは困難である。
【０００７】
この発明は上記のような課題を解決するためになされたもので、障害発生時に、障害解析データ（保守情報）を追加で採取する方法、および最適な障害解析データを採取するために、障害発生時点の状況を判断し、採取データを決定する障害解析データ採取装置およびその方法を得ることを目的とする。
【０００８】
【課題を解決するための手段】
この発明に係る障害解析データ採取装置は、分散コンピュータシステムで構成された障害解析データ採取装置において、障害の発生を監視し検知する障害監視手段と、障害発生時点の状況を調査し、採取すべきノードと障害解析データを特定する採取対象特定手段と、障害解析データを採取するデータ採取手段と、障害解析データの採取処理を制御するデータ採取制御手段とを備えるものである。
【０００９】
この発明に係る障害解析データ採取方法は、分散コンピュータシステムにより障害解析データを採取する障害解析データ採取方法において、障害の発生を監視および検知し、障害発生時点の状況を調査して、採取すべきノードと障害解析データを判断し、さらに障害解析データを採取すると共に、障害解析データの採取処理を行うものである。
【００１０】
【発明の実施の形態】
以下、この発明の実施の一形態について説明する。
実施の形態１．
この発明に係る実施の形態１について図１を参照して説明する。なお、図１は実施の形態１に係る障害解析データ採取装置１００の構成を示すブロック図である。
【００１１】
障害解析データ採取装置１００は、監視対象ノード１，２，３と、ネットワーク５と、監視対象リソース１０と、障害監視手段１１と、データ採取制御手段１２と、データ採取手段１３ａ，１３ｂ，１３ｃと、採取対象特定手段１４と、データ記録手段２３ａ，２３ｂ，２３ｃとを備えて構成される。
尚、監視対象ノードは３つに限ることはなく、さらに多数のノードがネットワークに接続されていてもよいものである。
【００１２】
監視対象ノード１，２，３は、分散コンピュータシステム内のサーバやネットワーク機器等の監視対象となるノードであり、ネットワーク５は分散している夫々の監視対象ノード１，２，３を接続する。監視対象リソース１０は監視対象ノード内のメモリ、プロセッサ、プログラムおよびディスク等の監視対象となるリソースであり、障害監視手段１１は監視対象の障害、異常を監視、検知する。
【００１３】
データ採取制御手段１２は、データ採取手段１３ａ，１３ｂ，１３ｃと採取対象特定手段１４を利用して障害解析データを収集する処理の制御を行う。データ採取手段１３ａ，１３ｂ，１３ｃは、新たに詳細な障害解析データを採取し、また採取対象特定手段１４は、現在のシステム状況等を調査し、障害解析データとして、どこからどのような障害解析データを採取すべきかを判断する。データ記録手段２３ａ，２３ｂ，２３ｃは過去の障害情報を含めて記録されていて、データ採取手段１３ａ，１３ｂ，１３ｃにより記録され、また、必要な情報が読み出される。
【００１４】
次に障害解析データ採取装置１００の動作について説明する。
障害監視手段１１は、障害発生を監視し（図１の▲１▼の処理）、検知した場合その内容をデータ採取制御手段１２に通知する（図１の▲２▼の処理）。データ採取制御手段１２は採取対象特定手段１４に問い合わせを行い、現在のシステム状況から、どのノードからどのような障害解析データを採取するべきかの情報を得る（図１の▲３▼の処理）。
【００１５】
データ採取制御手段１２は採取対象特定手段１４から得た情報をもとに、複数あるデータ採取手段１３ａ，１３ｂ，１３ｃの中からデータ記録手段２３ａ，２３ｂ，２３ｃに記録されている情報に基づき、必要な手段を選び（たとえば１３ａと１３ｃを選び）、さらにそれぞれの手段に対して、データ採取範囲を限定するような付加情報を付与して、障害解析データ採取処理を行う（図１の▲４▼の処理）。
【００１６】
以上説明したように、障害発生時の解析データの採取処理において、各プログラムが通常時から記録しておいたログやトレースを保存するだけにとどまらず、障害解析に有効な詳細データを、発生時の状況に応じて選択し、追加採取することができる。これにより障害発生時の解析に必要なデータが揃いやすく、障害解析の迅速化を可能とするものである。
【００１７】
実施の形態２．
この発明に係る実施の形態２について図２を参照して説明する。なお、図２は実施の形態２に係る障害解析データ採取装置２００の構成を示すブロック図である。
【００１８】
障害解析データ採取装置２００の構成は、上述した実施の形態１の障害解析データ採取装置１００に、分散システム内の他のノードの障害解析データも採取できるよう、ノード間連携手段１５を追加した構成である。その他の構成要素とそれらの動作は実施の形態１で説明したことと同様であり、ここでの説明は省略する。
【００１９】
障害解析情報は、障害発生ノード内からのみの採取では不十分な場合があり、他ノードからの採取が望まれる場合がある。これに備え、ノード間連携手段１５は、他ノードへのデータ採取要求を行うとともに、他からのデータ採取要求に応え、データ採取制御手段１２を介して、データ採取手段１３ａ，１３ｂ，１３ｃを利用して障害解析データを採取する。図２の中の監視対象ノード２および監視対象ノード３の内部構成は監視対象ノード１と同様であり、互いに連携することができる（図２の▲５▼および▲６▼の処理）。また、連携方式は特に限定はなく、ノード間で直接連携する方式でも、マネージャ機能等の全体の連携を制御する機能を介して要求することでもよい。特にクライアントサーバ形式に限定するものではない。
【００２０】
実施の形態３．
この発明に係る実施の形態３について図３および図４を参照して説明する。なお、図３は実施の形態３に係る障害解析データ採取装置３００の構成を示すブロック図であり、図４は障害解析データ採取装置３００のコネクション情報を示すテーブルである。
【００２１】
障害解析データ採取装置３００の構成は、上述した実施の形態２の障害解析データ採取装置２００に、採取対象特定のために、自ノードのネットワークコネクション状態を調査する、コネクション状況調査手段１６を加えた構成である。その他の構成要素とその動作は実施の形態１および実施の形態２で説明したことと同様であり、ここでの説明は省略する。
【００２２】
分散システムで発生する障害は、通信相手、あるいは通信相手との通信処理過程に原因が存在する場合があり、発生時のコネクション状況から、実施の形態３において備えたコネクション状況調査手段１６により、関連があると思われるノードを推測し、障害解析データを採取しておくことが可能となる。
【００２３】
図４に示すようにコネクション情報は、例えばプロトコルの種類、自ノードのポート番号、通信状態、相手ノードの番号、相手ポートの番号があり、所定の条件下で採取判断がなされる。採取対象特定手段１４では、コネクション状況調査手段１６に要求し取得した障害発生時のコネクション状態の情報と、予め決めておく障害データ採取を行うコネクションを特定する条件とを利用して、障害解析データ対象を特定する。条件として例えば「接続状態のｔｃｐプロトコルのコネクションで、自ノード、相手ノードともポート番号が１０００番以上」であれば、自ノードのポート番号が２０００番と１５００番の場合が選択される。このようにして障害解析に有効なデータを採取することが可能となる。
【００２４】
実施の形態４．
この発明に係る実施の形態４について図５を参照して説明する。なお、図５は実施の形態４に係る障害解析データ採取装置４００の構成を示すブロック図である。
【００２５】
障害解析データ採取装置４００の構成は、上述した実施の形態２である障害解析データ採取装置２００に、採取対象特定のために、分散システム全体のネットワークコネクション状態を記録するコネクション状況モニタ手段１７を加えた構成である。その他の構成要素とその動作は実施の形態１、および実施の形態２で説明したことと同様であり、ここでの説明は省略する。
【００２６】
コネクション状況モニタ手段１７は、専らネットワークのコネクション状態を監視し、他の機能を有さない手段であり、刻々と変化するネットワーク状況を詳細にモニタし続けることができる。実施の形態３では障害発生時直後のネットワークコネクション情報を活用する方式であったが、実施の形態４では障害発生時、或いはその前後の情報をも活用する方式であり、より効果的である。また、障害が発生したノード自身で採取できる状況の情報だけでなく、外部の装置から客観的に情報が採取され、抜けや認識誤りも発生しにくく、情報としてより正確になる。
【００２７】
また、採取データ内容は実施の形態３と同様であるが、タイミングとして障害発生時および、その前後の情報を扱えることで、必要な採取データの選定の正確性が増大する。
【００２８】
実施の形態５．
この発明に係る実施の形態５について図６および図７を参照して説明する。なお、図６は実施の形態５に係る障害解析データ採取装置５００の構成を示すブロック図であり、図７は障害解析データ採取装置５００のコネクション情報を示すテーブルである。
【００２９】
障害解析データ採取装置５００の構成は、上述した実施の形態４である障害解析データ採取装置４００に、あるべきコネクション状態を登録するコネクション状態登録手段１８を加えた構成である。その他の構成要素とその動作は実施の形態１、実施の形態２、および実施の形態４で説明したことと同様であり、ここでの説明は省略する。
【００３０】
コネクション状態登録手段１８は、そのノードでの正常時にあるべきネットワークコネクションの状態を登録しておくことができる手段である。障害発生時にコネクション状況モニタ手段１７から得たコネクション状態の情報と、コネクション状態登録手段１８に登録してある、あるべきコネクション状態に相違がある場合に、その相違に関連する部分に障害の原因や、障害原因を特定するための情報がある確率が高いと考えられ、関連する部分から、より詳細の障害解析データを採取する判断ができることになる。
【００３１】
図７は、コネクション状況モニタ手段１７で採取した障害発生時の情報〔図７（ａ）〕と、コネクション状態登録手段１８に登録しておいたあるべきネットワークコネクション状態の情報〔図７（ｂ）〕の例である。これら情報の比較から、自ポート２０００とのコネクションが存在するはずであるにも関わらず不在になっているノード３からの障害解析情報の採取が望まれることが判断できる。
【００３２】
障害発生ノード上では正常時に、どのようなコネクションが確立されているべきかの情報を登録することができ、障害が発生した時点の実際のコネクション状況と比較し、相違部分があれば、その相違に関係するノードから詳細の障害解析データを追加採取することで、より有効な障害解析データの採取が可能となる。
【００３３】
実施形態６．
この発明に係る実施の形態６について図８を参照して説明する。なお、図８は実施の形態６に係る障害解析データ採取装置６００の構成を示すブロック図である。
【００３４】
障害解析データ採取装置６００の構成は、上述した実施の形態２である障害解析データ採取装置２００に、採取対象特定のために、分散システム全体のネットワークトラフィック状態を記録するトラフィック状況モニタ手段１９を加えた構成である。その他の構成要素とその動作は実施の形態１、および実施の形態２で説明したことと同様であり、ここでの説明は省略する。
【００３５】
トラフィックモニタ手段１９は、専らネットワークのトラフィック監視のみを行う手段であり、刻々と変化するネットワーク状況を詳細にモニタし続けることができる。障害はネットワークトラフィック異常に起因して発生している場合もあり、障害発生時に、あらかじめ規定した異常と判断するトラフィック量を超えるトラフィックを検出していた場合に、その異常トラフィックに関わるノードからの詳細な障害解析情報を追加で採取することが可能となる。例えば、ノード２とノード３の間で異常トラフィックが発生していた場合、ノード２とノード３とその中継経路上のネットワーク装置から障害解析情報を採取する。
【００３６】
分散システムで発生する障害は、ネットワーク上のトラフィック異常により発生することもあり、分散システム内にネットワークのトラフィックをモニタする手段を設け、トラフィック情報を蓄積し、障害発生時には、該当ノードからの問い合わせにより、その情報を提供することができるので、異常ネットワークトラフィックに関係しているノードから、詳細障害解析データを追加採取することで有効な情報を採取できる。
【００３７】
実施の形態７．
この発明に係る実施の形態７について図９および図１０を参照して説明する。なお、図９は実施の形態７に係る障害解析データ採取装置７００の構成を示すブロック図であり、図１０は障害解析データ採取装置７００の処理履歴を示すテーブルである。
【００３８】
障害解析データ採取装置７００の構成は、上述した実施の形態２である障害解析データ採取装置２００に、採取対象特定のために処理フローモニタ手段２０を加えた構成である。その他の構成要素とその動作は実施の形態１および実施の形態２で説明したことと同様であり、ここでの説明は省略する。
【００３９】
分散システム上の処理は、ワークフロー処理や、ジョブ制御機能により、複数のノードを渡り処理が進むものがある。処理は必ずしも同じ経路をたどらず、処理内容等により、経路が異なったり、省略されたりするため、固定的ではない。このような処理の場合、障害が発生したノードではなく、それ以前の処理を行ったノードに問題があったことも予測される。これに備えた障害解析データを採取するために、該当処理がどのような処理履歴をたどったかをモニタする手段を加え、関連したノードやプログラムを特定することができる。
【００４０】
図１０に示すように処理フローモニタ手段２０が記録した処理履歴のデータとして、例えば処理ＩＤ、実施順、処理ノード、処理プログラム、開始日時、終了日時等が記録されている。データは一覧の処理に付与される処理ＩＤ毎に、その処理の履歴として記録される。ある処理（ＩＤ＝Ｘ１１）がノード１で障害になり、処理停止になった場合、処理フローモニタ手段２０に問い合わせを行うと、過去の処理履歴を得ることができる。このデータを元に過去に経由してきた処理ノードのそれぞれのプログラムに関する障害解析データを追加採取することが可能となる。
【００４１】
【発明の効果】
以上のように、この発明によれば、障害発生時の解析データの採取処理において、各プログラムが通常時から記録しておいた採取処理に関するログやトレースを保存するだけにとどまらず、障害解析に有効な詳細データを、発生時の状況に応じて、動的に選択して追加採取することができる。これにより障害発生時の解析に必要なデータが揃いやすく、障害解析の迅速化を可能とする効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１に係る障害解析データ採取装置の構成を示すブロック図である。
【図２】この発明の実施の形態２に係る障害解析データ採取装置の構成を示すブロック図である。
【図３】この発明の実施の形態３に係る障害解析データ採取装置の構成を示すブロック図である。
【図４】実施の形態３を説明するためのコネクション情報を示すテーブルである。
【図５】この発明の実施の形態４に係る障害解析データ採取装置の構成を示すブロック図である。
【図６】この発明の実施の形態５に係る障害解析データ採取装置の構成を示すブロック図である。
【図７】実施の形態５を説明するためのコネクション情報を示すテーブルである。
【図８】この発明の実施の形態６に係る障害解析データ採取装置の構成を示すブロック図である。
【図９】この発明の実施の形態７に係る障害解析データ採取装置の構成を示すブロック図である。
【図１０】実施の形態７を説明するための処理履歴を示すテーブルである。
【符号の説明】
１，２，３監視対象ノード、５ネットワーク、１０監視対象リソース、１１障害監視手段、１２データ採取制御手段、１３ａ，１３ｂ，１３ｃデータ採取手段、１４採取対象特定手段、１５ノード間連携手段、１６コネクション状況調査手段、１７コネクション状況モニタ手段、１８コネクション状態登録手段、１９トラフィックモニタ手段、２０処理フローモニタ手段、２３ａ，２３ｂ，２３ｃデータ記録手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a failure analysis data collection device and method for computer systems.
[0002]
[Prior art]
As a conventional failure analysis data collection device, a client-side workstation and a server-side workstation are connected via a network, and the client-side workstation has application programs, communication management programs, and application program related program names. Are provided with a related program list table, a maintenance information management unit, a message communication area, a management table, and the like. The related program list table and the management table are created when the system is introduced. In the related program list table, a client communication management program, a server communication management program, and an application program related to server access are registered. A management table is also created in the server side workstation when the system is introduced.
[0003]
When a failure occurs when the client-side workstation accesses the server-side workstation with the above-described apparatus, the maintenance information acquisition is stopped, and the information recorded in the memory until then is written to the disk. Next, refer to the related program list table, identify the related programs on the client workstation and server workstation, and stop the maintenance information acquisition of each program via the maintenance information management unit and the message communication area. Do. By this processing, maintenance information acquisition is stopped not only for the program in which the failure has occurred, but also for the programs on the associated client workstation and server workstation, and the collected maintenance information is recorded on the disk ( For example, see Patent Document 1).
[0004]
[Patent Document 1]
Japanese Patent Laid-Open No. 6-266686
[Problems to be solved by the invention]
In the conventional failure analysis data collection device, the maintenance information recorded on the disk is only related to the information that each program has stored in the memory. It is desirable to collect more detailed fault information before and after, but it does not have a function to collect. In addition, if detailed information is collected from a normal time so that additional information is not required, there is a problem that an excessive load is applied to resources such as computer processing and memory or a disk.
[0006]
In addition, the target for collecting failure information can be registered in the table, but it is fixed, and maintenance information to be acquired based on the situation at the time of failure cannot be added or deleted, or the collected content cannot be changed. Furthermore, it is difficult to obtain the optimum maintenance information by registering the related programs in a fixed manner and always obtaining maintenance information from only the registered programs or all of the registered programs when a failure occurs. is there.
[0007]
The present invention has been made to solve the above-described problems. When a failure occurs, a method for additionally collecting failure analysis data (maintenance information) and a failure occurrence in order to collect optimum failure analysis data It is an object of the present invention to obtain a failure analysis data collection apparatus and method for judging a situation at a time point and determining collection data.
[0008]
[Means for Solving the Problems]
The failure analysis data collection device according to the present invention should investigate and collect failure monitoring means for monitoring and detecting the occurrence of a failure, and the situation at the time of failure occurrence in the failure analysis data collection device configured by a distributed computer system A collection target identification unit that identifies a node and failure analysis data, a data collection unit that collects failure analysis data, and a data collection control unit that controls failure analysis data collection processing are provided.
[0009]
The failure analysis data collection method according to the present invention is a failure analysis data collection method for collecting failure analysis data by a distributed computer system. The failure analysis data collection method should monitor and detect the occurrence of a failure, investigate the situation at the time of failure, and collect the failure analysis data. The node and the failure analysis data are judged, and further, failure analysis data is collected and failure analysis data is collected.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described.
Embodiment 1 FIG.
Embodiment 1 according to the present invention will be described with reference to FIG. FIG. 1 is a block diagram illustrating a configuration of the failure analysis data collection device 100 according to the first embodiment.
[0011]
The failure analysis data collection device 100 includes monitoring nodes 1, 2, and 3, a network 5, a monitoring target resource 10, a failure monitoring unit 11, a data collection control unit 12, and data collection units 13a, 13b, and 13c. The collection target specifying means 14 and the data recording means 23a, 23b, 23c are provided.
Note that the number of nodes to be monitored is not limited to three, and a larger number of nodes may be connected to the network.
[0012]
The monitoring target nodes 1, 2, and 3 are nodes to be monitored such as servers and network devices in the distributed computer system, and the network 5 connects the distributed monitoring target nodes 1, 2, and 3 to each other. The monitoring target resource 10 is a resource to be monitored such as a memory, a processor, a program, and a disk in the monitoring target node, and the fault monitoring unit 11 monitors and detects a fault or abnormality of the monitoring target.
[0013]
The data collection control unit 12 controls processing for collecting failure analysis data using the data collection units 13a, 13b, and 13c and the collection target specifying unit 14. The data collection means 13a, 13b, and 13c newly collect detailed failure analysis data, and the collection target specifying means 14 investigates the current system status and the like, and from what failure analysis data as the failure analysis data from where. Judge whether to collect. The data recording means 23a, 23b, and 23c are recorded including past failure information, recorded by the data collection means 13a, 13b, and 13c, and necessary information is read out.
[0014]
Next, the operation of the failure analysis data collection device 100 will be described.
The failure monitoring means 11 monitors the occurrence of a failure (process (1) in FIG. 1), and if detected, notifies the data collection control means 12 of the content (process (2) in FIG. 1). The data collection control means 12 makes an inquiry to the collection target specifying means 14 and obtains information on what kind of failure analysis data should be collected from which node based on the current system status (process (3) in FIG. 1). .
[0015]
Based on the information obtained from the collection target specifying means 14, the data collection control means 12 is based on the information recorded in the data recording means 23a, 23b, 23c from among a plurality of data collection means 13a, 13b, 13c. Necessary means are selected (for example, 13a and 13c are selected), and additional information for limiting the data collection range is given to each means, and failure analysis data collection processing is performed ((4) in FIG. 1). Process of ▼).
[0016]
As explained above, in the process of collecting analysis data at the time of failure occurrence, detailed data useful for failure analysis is not only saved in the logs and traces recorded by each program from the normal time. Depending on the situation, additional sampling can be selected. As a result, the data necessary for the analysis at the time of failure occurrence can be easily prepared, and the failure analysis can be speeded up.
[0017]
Embodiment 2. FIG.
A second embodiment according to the present invention will be described with reference to FIG. FIG. 2 is a block diagram showing the configuration of the failure analysis data collection apparatus 200 according to the second embodiment.
[0018]
The configuration of the failure analysis data collection device 200 is a configuration in which inter-node cooperation means 15 is added to the failure analysis data collection device 100 of the first embodiment described above so that failure analysis data of other nodes in the distributed system can also be collected. It is. Other components and their operations are the same as those described in the first embodiment, and a description thereof is omitted here.
[0019]
It may be insufficient to collect the failure analysis information only from within the failed node, and it may be desired to collect the failure analysis information from another node. In preparation for this, the inter-node cooperation means 15 makes a data collection request to another node and uses the data collection means 13a, 13b, 13c via the data collection control means 12 in response to a data collection request from another. Collect failure analysis data. The internal configurations of the monitoring target node 2 and the monitoring target node 3 in FIG. 2 are the same as those of the monitoring target node 1, and can cooperate with each other (the processes (5) and (6) in FIG. 2). Further, the cooperation method is not particularly limited, and may be a method of directly cooperation between nodes or may be requested through a function that controls the overall cooperation such as a manager function. It is not particularly limited to the client server format.
[0020]
Embodiment 3 FIG.
Embodiment 3 according to the present invention will be described with reference to FIGS. 3 and 4. FIG. 3 is a block diagram showing a configuration of the failure analysis data collection device 300 according to the third embodiment, and FIG. 4 is a table showing connection information of the failure analysis data collection device 300.
[0021]
In the configuration of the failure analysis data collection device 300, the failure analysis data collection device 200 of the second embodiment described above is added with connection status investigation means 16 for examining the network connection state of the own node in order to specify the collection target. It is a configuration. Other components and their operations are the same as those described in the first embodiment and the second embodiment, and a description thereof is omitted here.
[0022]
The failure that occurs in the distributed system may be caused by the communication partner or the communication processing process with the communication partner, and the connection status investigation means 16 provided in Embodiment 3 determines the relationship from the connection status at the time of occurrence. It is possible to guess the node that seems to be, and collect failure analysis data.
[0023]
As shown in FIG. 4, the connection information includes, for example, the type of protocol, the port number of the own node, the communication state, the number of the partner node, and the number of the partner port, and the collection determination is made under predetermined conditions. The collection target specifying unit 14 uses the connection state information at the time of failure occurrence that is requested and acquired from the connection status investigation unit 16 and the condition for specifying a connection for collecting failure data, which is determined in advance, to analyze failure analysis data. Identify the target. For example, if the port number of the own node and the partner node is 1000 or more in the connection of the tcp protocol in the connected state, the cases where the port numbers of the own node are 2000 and 1500 are selected. In this way, data effective for failure analysis can be collected.
[0024]
Embodiment 4 FIG.
Embodiment 4 according to the present invention will be described with reference to FIG. FIG. 5 is a block diagram showing a configuration of the failure analysis data collection device 400 according to the fourth embodiment.
[0025]
The configuration of the failure analysis data collection device 400 is the same as the above-described failure analysis data collection device 200 according to the second embodiment, except that connection status monitoring means 17 for recording the network connection status of the entire distributed system is added to specify the collection target. It is a configuration. Other components and their operations are the same as those described in the first embodiment and the second embodiment, and a description thereof is omitted here.
[0026]
The connection status monitoring means 17 is a means that exclusively monitors the connection status of the network and does not have other functions, and can continue to monitor the network status that changes every moment in detail. In the third embodiment, the network connection information immediately after the failure occurs is used, but in the fourth embodiment, the information is also used when the failure occurs or before and after the failure, which is more effective. Also, not only information on the situation that can be collected by the failed node itself, but also information is objectively collected from an external device, so that omissions and recognition errors are less likely to occur, and the information becomes more accurate.
[0027]
The contents of the collected data are the same as those in the third embodiment, but the accuracy of selecting the necessary collected data can be increased by handling the information at the time of failure and the information before and after the failure.
[0028]
Embodiment 5 FIG.
Embodiment 5 according to the present invention will be described with reference to FIGS. 6 is a block diagram showing the configuration of the failure analysis data collection device 500 according to the fifth embodiment, and FIG. 7 is a table showing connection information of the failure analysis data collection device 500.
[0029]
The configuration of the failure analysis data collection device 500 is a configuration obtained by adding the connection state registration means 18 for registering a connection state to be added to the failure analysis data collection device 400 according to the fourth embodiment described above. Other components and their operations are the same as those described in the first embodiment, the second embodiment, and the fourth embodiment, and a description thereof is omitted here.
[0030]
The connection state registration unit 18 is a unit that can register the state of the network connection that should be normal at the node. If there is a difference between the connection status information obtained from the connection status monitoring means 17 at the time of the failure and the connection status that should be registered in the connection status registration means 18, It is considered that there is a high probability that there is information for identifying the cause of the failure, and it is possible to make a decision to collect more detailed failure analysis data from related portions.
[0031]
FIG. 7 shows information at the time of failure (FIG. 7A) collected by the connection status monitor means 17 and information on the network connection status that should be registered in the connection status registration means 18 (FIG. 7B). ] Is an example. From the comparison of the information, it can be determined that it is desired to collect the failure analysis information from the node 3 that is absent even though the connection with the own port 2000 should exist.
[0032]
Information on what kind of connection should be established at the time of normal operation can be registered on the failed node. Compared with the actual connection status at the time of failure, if there is a difference, the difference It is possible to collect more effective failure analysis data by additionally collecting detailed failure analysis data from nodes related to.
[0033]
Embodiment 6. FIG.
Embodiment 6 according to the present invention will be described with reference to FIG. FIG. 8 is a block diagram showing the configuration of the failure analysis data collection device 600 according to the sixth embodiment.
[0034]
The configuration of the failure analysis data collection device 600 is configured by adding traffic condition monitoring means 19 for recording the network traffic state of the entire distributed system to identify the collection target to the failure analysis data collection device 200 according to the second embodiment described above. It is a configuration. Other components and their operations are the same as those described in the first embodiment and the second embodiment, and a description thereof is omitted here.
[0035]
The traffic monitoring unit 19 is a unit that exclusively monitors the traffic of the network, and can continuously monitor the network status that changes every moment in detail. Failures may have occurred due to network traffic abnormalities. When a traffic exceeding the amount of traffic determined to be abnormal in advance is detected at the time of the failure, details from the nodes involved in the abnormal traffic Additional failure analysis information can be collected. For example, when abnormal traffic occurs between the node 2 and the node 3, the failure analysis information is collected from the node 2 and the node 3 and the network device on the relay route.
[0036]
Failures that occur in a distributed system may occur due to abnormal traffic on the network. A means for monitoring network traffic is provided in the distributed system, and traffic information is accumulated. Since this information can be provided, effective information can be collected by additionally collecting detailed failure analysis data from nodes related to abnormal network traffic.
[0037]
Embodiment 7 FIG.
A seventh embodiment according to the present invention will be described with reference to FIG. 9 and FIG. 9 is a block diagram showing a configuration of the failure analysis data collection device 700 according to the seventh embodiment, and FIG. 10 is a table showing a processing history of the failure analysis data collection device 700.
[0038]
The configuration of the failure analysis data collection device 700 is a configuration in which the process flow monitoring means 20 is added to the failure analysis data collection device 200 according to the second embodiment described above to specify the collection target. Other components and their operations are the same as those described in the first embodiment and the second embodiment, and a description thereof is omitted here.
[0039]
Some of the processes on the distributed system proceed through a plurality of nodes by workflow process or job control function. The processing does not necessarily follow the same route, and the route is different or omitted depending on the processing content or the like. In the case of such processing, it is also predicted that there was a problem not with the node where the failure occurred, but with the node that performed the previous processing. In order to collect the failure analysis data prepared for this, it is possible to add a means for monitoring what processing history the corresponding process has followed, and to specify related nodes and programs.
[0040]
As processing history data recorded by the processing flow monitoring means 20 as shown in FIG. 10, for example, processing ID, execution order, processing node, processing program, start date and time, end date and time, etc. are recorded. Data is recorded as a history of processing for each processing ID assigned to the processing of the list. When a process (ID = X11) becomes a failure in the node 1 and the process is stopped, an inquiry is made to the process flow monitor means 20 to obtain a past process history. Based on this data, it becomes possible to additionally collect failure analysis data relating to each program of the processing node that has passed through in the past.
[0041]
【The invention's effect】
As described above, according to the present invention, in the analysis data collection process at the time of failure occurrence, not only the logs and traces related to the collection process recorded by each program from the normal time but also the failure analysis is performed. Effective detailed data can be selected and collected dynamically according to the situation at the time of occurrence. As a result, the data necessary for the analysis at the time of failure occurrence can be easily obtained, and the failure analysis can be speeded up.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 1 of the present invention.
FIG. 2 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 2 of the present invention.
FIG. 3 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 3 of the present invention.
FIG. 4 is a table showing connection information for explaining the third embodiment.
FIG. 5 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 4 of the present invention.
FIG. 6 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 5 of the present invention.
FIG. 7 is a table showing connection information for explaining the fifth embodiment;
FIG. 8 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 6 of the present invention.
FIG. 9 is a block diagram showing a configuration of a failure analysis data collection device according to Embodiment 7 of the present invention.
FIG. 10 is a table showing a processing history for explaining the seventh embodiment.
[Explanation of symbols]
1, 2, 3 Monitoring target node, 5 network, 10 Monitoring target resource, 11 Fault monitoring means, 12 Data collection control means, 13a, 13b, 13c Data collection means, 14 Collection target specifying means, 15 Inter-node cooperation means, 16 Connection status checking means, 17 connection status monitoring means, 18 connection status registration means, 19 traffic monitoring means, 20 processing flow monitoring means, 23a, 23b, 23c data recording means.

Claims

In the failure analysis data collection device configured with a distributed computer system,
A fault monitoring means for monitoring and detecting the occurrence of a fault;
A collection target identification means for investigating the situation at the time of failure occurrence and identifying nodes to be collected and failure analysis data,
Data collection means for collecting the failure analysis data;
A failure analysis data collection device comprising: data collection control means for controlling the failure analysis data collection processing.

2. The failure analysis data collection device according to claim 1, further comprising inter-node cooperation means for exchanging information between a plurality of nodes.

3. The failure analysis data collection device according to claim 2, further comprising a connection status investigation means for examining the connection status of the node.

3. The failure analysis data collection apparatus according to claim 2, wherein connection status monitoring means for monitoring the connection status of the entire network is provided in the distributed system.

5. The failure analysis data collection device according to claim 4, further comprising connection state registration means for registering a desired connection state.

3. The failure analysis data collection device according to claim 2, further comprising a traffic monitoring means for monitoring the traffic of the entire network.

3. The failure analysis data collection apparatus according to claim 2, further comprising processing flow monitor means for recording a processing history generated in each node.

In the failure analysis data collection method for collecting failure analysis data by a distributed computer system, monitoring and detecting the occurrence of a failure, investigating the situation at the time of failure occurrence, determining nodes to be collected and failure analysis data, A failure analysis data collection method characterized by collecting failure analysis data and performing processing for collecting the failure analysis data.

9. The failure analysis data collection method according to claim 8, wherein information is exchanged between a plurality of nodes.

9. The failure analysis data collection method according to claim 8, wherein the connection status of the own node is investigated.

9. The failure analysis data collection method according to claim 8, wherein the connection status of the entire network is monitored.

The failure analysis data collection method according to claim 11, further comprising registering a connection state that should be.

9. The failure analysis data collection method according to claim 8, wherein traffic of the entire network is monitored.

9. The failure analysis data collection method according to claim 8, wherein a history of processing that has occurred in each node is recorded.