JP2012181699A

JP2012181699A - System for collecting failure investigation information material, administrative server, method for collecting failure investigation information material, and program therefor

Info

Publication number: JP2012181699A
Application number: JP2011044417A
Authority: JP
Inventors: Yosuke Hibi; 洋介日比
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-03-01
Filing date: 2011-03-01
Publication date: 2012-09-20
Anticipated expiration: 2031-03-01
Also published as: JP5732913B2

Abstract

PROBLEM TO BE SOLVED: To provide a system for collecting failure investigation information material and the like which enable material necessary for identifying a cause of a failure having occurred to be properly collected according to a type of the failure.SOLUTION: A system for collecting failure investigation information material has: a failure diagnosis unit 101 for, when an administrative server 10 receives failure information indicating that a failure occurs from a monitoring target server 20, determining a type of failure material to be collected from the monitoring target server 20 by acquiring the type of failure material to be collected from collection information data in failure 113 according to a name of the failure; a collection material determination unit 104 for acquiring a collection method corresponding to the type of failure material and influence on the monitoring target server 20 due to execution of the collection method from collection target failure material data 112; a failure material management unit 105 for determining whether the failure material can be acquired by comparing the type of failure material, the collection method, and its influence with a current state of the monitoring target server 20; and a failure material request unit 103 for acquiring the failure material determined to be acquirable from the monitoring target server 20.

Description

本発明は障害調査情報資料採取システム、管理サーバ、障害調査情報資料採取方法およびそのプログラムに関し、特にコンピュータネットワークで発生した障害の種類に応じて必要な資料の採取を適切に行うことを可能とする障害調査情報資料採取システム等に関する。 The present invention relates to a failure investigation information material collection system, a management server, a failure investigation information material collection method, and a program thereof, and in particular, makes it possible to appropriately collect necessary data according to the type of failure that occurred in a computer network. Related to troubleshooting information collection system.

コンピュータネットワークが企業などの基幹業務で根幹をなす存在となっている以上、その運用中に発生した障害に対しては早急に対処する必要がある。障害が発生した場合、その発生原因を特定するために、当該ネットワーク上で運用中であるコンピュータやネットワーク機器から、たとえば動作ログ、入力ログなどのような必要な情報を採取することが必要である。これを、本明細書では資料の採取という。 As computer networks have become the backbone of core business such as companies, it is necessary to deal quickly with failures that occur during their operation. When a failure occurs, it is necessary to collect necessary information such as operation logs and input logs from computers and network devices that are operating on the network in order to identify the cause of the failure. . This is called collection of data in this specification.

この資料の採取は、当該コンピュータネットワーク上で正常に動作している他のコンピュータ、アプリケーション、およびネットワーク機器などの動作に悪影響を及ぼさない範囲で行うことが必要とされている。そのため、通常は基幹業務用アプリケーションが動作している平日の昼間などの時間帯を避けて資料の採取が行われる。 It is necessary to collect this material as long as it does not adversely affect the operation of other computers, applications, network devices, etc. that are operating normally on the computer network. For this reason, data collection is usually performed avoiding the time zone such as daytime on weekdays when the business application is running.

これに関連して、次のような各々の技術文献がある。その中でも特許文献１には、あらかじめシステム管理者が定めた、障害時の情報採取に生じるリスクをポリシーとして定めて、その範囲内での資料を採取するという障害調査資料採取システムについて記載されている。 In this connection, there are the following technical documents. Among them, Patent Document 1 describes a failure investigation data collection system in which a risk that arises in collecting information at the time of a failure is determined as a policy and a material within that range is collected, which is determined in advance by a system administrator. .

特許文献２には、過去に発生した障害と現在の障害とを比較して採取する資料の種類を決定するという障害調査資料採取システムについて記載されている。特許文献３には、資料の採取によって生じる影響を定めたポリシーから、発生した故障の原因を分析して措置を行うという故障措置システムについて記載されている。 Patent Document 2 describes a failure investigation material collection system that determines a type of material to be collected by comparing a failure that has occurred in the past with a current failure. Patent Document 3 describes a failure countermeasure system that analyzes a cause of a failure that has occurred based on a policy that defines the effects caused by the collection of data.

特許文献４には、現在の障害と過去の類似障害とを比較して障害発生可能性を予測する障害監視システムについて記載されている。特許文献５には、稼働状態に関する情報から算出した相関情報から、障害を回復する処理による影響を判定して実行要否／時刻／順序を決定するというコンピュータシステムの制御方法について記載されている。 Patent Document 4 describes a failure monitoring system that predicts the possibility of occurrence of a failure by comparing a current failure with a past similar failure. Patent Document 5 describes a computer system control method that determines the necessity / time / order of execution by determining the influence of processing for recovering a failure from correlation information calculated from information related to an operating state.

特許文献６には、ソフトウェアの動作について記録した監査ログの特定の指標値について集計し、その指標値について以前のログと比較して評価するという監査ログ収集・比較システムについて記載されている。特許文献７には、採取情報データベースを用いて資料採取を一元化することができるという情報採取手順管理システムについて記載されている。 Patent Document 6 describes an audit log collection / comparison system in which specific index values of an audit log recorded about the operation of software are aggregated, and the index values are evaluated in comparison with previous logs. Patent Document 7 describes an information collection procedure management system that can collect data collection using a collection information database.

特開２００６−１９５６８７号公報JP 2006-195687 A 特開２００３−３４５６２８号公報JP 2003-345628 A 特開２００４−０８０２９７号公報JP 2004-080297 A 特開２００７−２９３３９３号公報JP 2007-293393 A 特開２００８−００９８４２号公報JP 2008-009842 A 特開２００９−１１０２２０号公報JP 2009-110220 A 特開２００９−１９３２０７号公報JP 2009-193207 A

特許文献１に記載された既存の障害調査資料採取システムでは、発生した障害の種類に応じて、リスクの判定に使用するポリシー、具体的には採取する資料の種類やその採取方法を切り替えることはできない。 In the existing trouble investigation data collection system described in Patent Document 1, according to the type of trouble that has occurred, it is not possible to switch the policy used for risk determination, specifically the type of material to be collected and its collection method. Can not.

より具体的には、「アプリケーションの処理速度が低下した（アプリケーションスローダウン）」場合なら動作中の他のアプリケーションの動作を阻害しない範囲で資料の採取を行う必要があるが、「アプリケーションの動作が停止した（アプリケーションダウン）」場合にはそのアプリケーションの動作を再開させることが必要であるので、他の基幹業務ソフトの動作を阻害してでも資料の採取を優先することが適切である。 More specifically, if “application processing speed has been reduced (application slowdown)”, it is necessary to collect data within a range that does not hinder the operation of other running applications. In the case of “stopped (application down)”, it is necessary to restart the operation of the application. Therefore, it is appropriate to prioritize the collection of data even if the operation of other core business software is obstructed.

しかしながら、この障害調査資料採取システムでは、発生した障害の種類に応じて採取する資料やその採取方法を切り替えることができないので、このような点に対して考慮した適切な措置を取ることが不可能である。 However, in this trouble investigation data collection system, it is not possible to switch the data to be collected and the collection method according to the type of failure that occurred, so it is impossible to take appropriate measures in consideration of such points It is.

また、この障害調査資料採取システムでは、採取する必要のある資料であってもポリシーとして設定された条件を満たさないと判定された場合には、以後その資料の採取は行わないものと判定される。たとえば、「その資料を採取することによってＣＰＵ（Central Processing Unit）の利用率が特定の数値範囲を越える場合にはその資料の採取を行わない」ようにポリシーが設定されていた時に「ＣＰＵの利用率が設定された範囲を越えている」と判断された場合には、他のアプリケーションが動作していない夜間や休日などにＣＰＵの利用率が低下してその資料の採取が可能になったとしても、その資料は採取されない。 Also, in this trouble investigation data collection system, even if it is necessary to collect the data, if it is determined that the conditions set as a policy are not met, it will be determined that the data will not be collected thereafter. . For example, when the policy is set to "collect the data, if the CPU (Central Processing Unit) usage rate exceeds a specific numerical range, the data will not be collected." If it is determined that the rate exceeds the set range, it is assumed that the CPU usage rate has fallen during the night or holidays when other applications are not running, and the data can be collected. However, the data is not collected.

特許文献１に記載された技術のこれらの問題点を解決しうる技術は、特許文献２〜７には記載されていない。特許文献５には、障害を回復する処理による影響を判定して実行要否などを決定するという技術は記載されているが、これは「資料の採取による影響」について判定するものではない。残る特許文献１〜４および６〜７にも、この点について判定して資料の採取を適切に行うことを可能とする技術は記載されていない。 Patent Documents 2 to 7 do not describe any technology that can solve these problems of the technology described in Patent Document 1. Patent Document 5 describes a technique for determining the necessity of execution by determining the influence of a process for recovering a failure, but this does not determine “the influence of collecting data”. The remaining Patent Documents 1 to 4 and 6 to 7 do not describe a technique that makes it possible to appropriately collect data by determining this point.

本発明の目的は、ネットワーク内において発生した障害の種類に応じて、その発生原因の特定に必要な資料の採取がネットワーク全体に対して与える影響について判定し、資料の採取を適切に行うことを可能とする障害調査情報資料採取システム、管理サーバ、障害調査情報資料採取方法およびそのプログラムを提供することにある。 The purpose of the present invention is to determine the influence of the collection of data necessary for identifying the cause of the occurrence on the entire network according to the type of failure that occurred in the network, and to appropriately collect the data. It is an object of the present invention to provide a failure investigation information material collection system, a management server, a failure investigation information material collection method, and a program thereof.

上記目的を達成するため、本発明に係る障害調査情報資料採取システムは、管理サーバと監視対象サーバとが相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を管理サーバが監視対象サーバから採取する障害調査情報資料採取システムであって、監視対象サーバが、管理サーバから予め送られた発生条件によって障害の発生を検知してその旨を含む障害情報を管理サーバに対して送出する障害検知部と、管理サーバからの要求に応じて障害資料を採取する障害試料採取部とを備えると共に、管理サーバが、監視対象サーバから障害が発生した旨の障害情報を受信した際に、予め記憶された障害時採取情報データから障害情報に含まれる障害名に対応して監視対象サーバから採取する障害資料の種類を取得して決定する障害診断部と、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する採取資料判断部と、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する障害資料管理部と、取得可能であると判定された障害資料の取得を監視対象サーバの障害試料採取部に依頼する障害資料要求部とを有することを特徴とする。 In order to achieve the above object, the failure investigation information material collection system according to the present invention is necessary for analyzing the cause of a failure that has occurred in the operation of the monitored server because the management server and the monitored server are connected to each other. A failure investigation information data collection system in which the management server collects data from the monitored server, where the monitored server detects the occurrence of a failure according to the occurrence condition sent in advance from the management server and displays failure information including that fact. Failure information indicating that a failure has occurred from the monitored server, as well as a failure detection unit that is sent to the management server and a failure sample collection unit that collects failure data in response to a request from the management server Of the failure data collected from the monitored server corresponding to the failure name included in the failure information from the pre-stored failure time collection information data A fault diagnosis unit that obtains and determines a class, a sampling method that acquires from the pre-stored sampling target fault data, the sampling method corresponding to the type of fault data, and the effect that this causes on the monitored server; The failure data management unit that determines whether or not this failure data can be acquired by comparing the type, collection method, and impact of the failure data with the current status of the monitored server, and determined that acquisition is possible And a failure data request unit that requests the failure sample collection unit of the monitoring target server to acquire the failure data.

上記目的を達成するため、本発明に係る管理サーバは、監視対象サーバと相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を監視対象サーバから採取する管理サーバであって、監視対象サーバから障害が発生した旨の障害情報を受信した際に、予め記憶された障害時採取情報データから障害情報に含まれる障害の種類に対応して監視対象サーバから採取する障害資料の種類を取得して決定する障害診断部と、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する採取資料判断部と、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する障害資料管理部と、取得可能であると判定された障害資料を監視対象サーバから取得する障害資料要求部とを有することを特徴とする。 In order to achieve the above object, the management server according to the present invention is interconnected with the monitored server, and collects data necessary for analyzing the cause of the failure that occurred in the operation of the monitored server from the monitored server. When the failure information indicating that a failure has occurred is received from the monitoring target server, the management server receives from the monitoring target server corresponding to the type of failure included in the failure information from the pre-stored failure collection information data. Acquires the fault diagnosis unit that acquires and determines the type of fault data to be collected, the sampling method corresponding to the type of fault data, and the effects that this causes on the monitored server from pre-stored fault target data The collected data judgment unit compares the type, collection method, and impact of the fault data with the current status of the monitored server to determine whether the fault data can be acquired. That a fault material management unit, and having a fault article requesting unit for acquiring the determined fault article that can be obtained from the monitored servers.

上記目的を達成するため、本発明に係る障害調査情報資料採取方法は、管理サーバと監視対象サーバとが相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を管理サーバが監視対象サーバから採取する障害調査情報資料採取システムにあって、監視対象サーバから障害が発生した旨の障害情報を管理サーバが受信し、障害情報に含まれる障害の種類に対応する障害資料の種類を予め記憶された障害時採取情報データから管理サーバの障害診断部が取得して決定し、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから管理サーバの採取資料判断部が取得し、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを管理サーバの障害資料管理部が判定し、取得可能であると判定された障害資料を監視対象サーバから管理サーバの障害資料要求部が取得することを特徴とする。 In order to achieve the above object, the failure investigation information material collecting method according to the present invention is necessary for analyzing the cause of a failure that has occurred in the operation of the monitored server when the management server and the monitored server are mutually connected. In the failure investigation information data collection system in which the management server collects data from the monitored server, the management server receives failure information indicating that a failure has occurred from the monitored server, and handles the types of failures included in the failure information The failure diagnosis unit of the management server obtains and determines the type of failure data to be collected from failure collection information data stored in advance, and determines the collection method corresponding to the type of failure material and the effect on the monitored server. The collected data judgment unit of the management server obtains the collected failure data from the previously collected collection, and the type, collection method, and impact of the failure data are displayed on the monitored server. The failure data management unit of the management server determines whether or not this failure data can be acquired by comparing with the status of the server, and requests the failure data of the management server from the monitored server for the failure data determined to be available. The department acquires.

上記目的を達成するため、本発明に係る障害調査情報資料採取プログラムは、管理サーバと監視対象サーバとが相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を管理サーバが監視対象サーバから採取する障害調査情報資料採取システムにあって、管理サーバが備えるコンピュータに、障害情報に含まれる障害の種類に対応する障害資料の種類を予め記憶された障害時採取情報データから取得して決定する手順、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する手順、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する手順、および取得可能であると判定された障害資料を監視対象サーバから取得する手順を実行させることを特徴とする。 In order to achieve the above object, the failure investigation information material collection program according to the present invention is necessary for analyzing the cause of a failure that has occurred in the operation of the monitored server when the management server and the monitored server are mutually connected. In a failure investigation information data collection system in which the management server collects data from monitored servers, when a failure is stored in advance on the computer provided in the management server, the type of failure data corresponding to the type of failure included in the failure information Procedure to obtain and determine from the collected information data, collection method corresponding to the type of fault data and the effect of this on the monitored server, and procedure to acquire from the pre-stored fault target data, type of fault data The procedure for determining whether or not this failure data can be acquired by comparing the collection method and impact with the current status of the monitored server And the determined fault article that can be obtained, characterized in that to execute the steps of obtaining from the monitoring server.

本発明は、上述したように、発生した障害の種類に対応してその採取による影響を監視対象サーバの現在の状態と比較してその障害資料が取得可能であるか否かを判断するように構成した。これによって、ネットワーク内において発生した障害の種類に応じて、その発生原因の特定に必要な資料の採取を適切に行うことを可能であるという、優れた特徴を持つ障害調査情報資料採取システム、管理サーバ、障害調査情報資料採取方法およびそのプログラムを提供することができる。 As described above, the present invention determines whether or not the failure data can be acquired by comparing the influence of the collection with the current state of the monitored server corresponding to the type of failure that has occurred. Configured. As a result, according to the type of failure that occurred in the network, it is possible to appropriately collect the data necessary for identifying the cause of the failure. It is possible to provide a server, a troubleshooting information collection method, and a program thereof.

図２に示した障害調査情報採取システムを構成する管理サーバ、監視対象サーバ、管理端末の動作を、より観念的に示す説明図であるIt is explanatory drawing which shows more conceptually operation | movement of the management server, the monitoring object server, and management terminal which comprise the failure investigation information collection system shown in FIG. 本発明の第１の実施形態に係る障害調査情報採取システムの構成について示す説明図である。It is explanatory drawing shown about the structure of the failure investigation information collection system which concerns on the 1st Embodiment of this invention. 図１〜２に示した許容度データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the tolerance data shown to FIGS. 図１〜２に示した採取対象障害資料データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the collection object failure material data shown to FIGS. 図１〜２に示した障害時採取情報データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the collection information data at the time of a failure shown to FIGS. 図１〜２に示した障害資料データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the obstacle data shown in FIGS. 図１〜２に示したシステム状態データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the system status data shown to FIGS. 図１〜２に示した監視対象サーバの発生条件データの内容の一例について示す説明図である。It is explanatory drawing shown about an example of the content of the generation condition data of the monitoring object server shown to FIGS. 図１〜２に示した障害調査情報採取システムで行われる、障害資料の採取の動作について示すフローチャートである。3 is a flowchart illustrating an operation of collecting failure data performed by the failure investigation information collection system illustrated in FIGS.

（実施形態）
以下、本発明の実施形態の構成について添付図１〜２に基づいて説明する。
最初に、本実施形態の基本的な内容について説明し、その後でより具体的な内容について説明する。
本実施形態に係る障害調査情報採取システム１は、管理サーバ１０と監視対象サーバ２０とが相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を管理サーバが監視対象サーバから採取する障害調査情報資料採取システムである。ここで監視対象サーバ２０は、管理サーバから予め送られた発生条件によって障害の発生を検知してその旨を含む障害情報を管理サーバに対して送出する障害検知部２０１と、管理サーバからの要求に応じて障害資料を採取する障害試料採取部２０３とを備える。そして管理サーバ１０は、監視対象サーバから障害が発生した旨の障害情報を受信した際に、予め記憶された障害時採取情報データ１１３から障害情報に含まれる障害名に対応して監視対象サーバから採取する障害資料の種類を取得して決定する障害診断部１０１と、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データ１１２から取得する採取資料判断部１０４と、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する障害資料管理部１０５と、取得可能であると判定された障害資料の取得を監視対象サーバの障害試料採取部に依頼する障害資料要求部１０３とを有する。 (Embodiment)
Hereinafter, the configuration of an embodiment of the present invention will be described with reference to FIGS.
First, the basic content of the present embodiment will be described, and then more specific content will be described.
In the failure investigation information collection system 1 according to the present embodiment, the management server 10 and the monitoring target server 20 are connected to each other, and the management server collects data necessary for analyzing the cause of the failure that has occurred in the operation of the monitoring target server. Is a failure investigation information material collection system collected from monitored servers. Here, the monitoring target server 20 detects the occurrence of a failure according to an occurrence condition sent in advance from the management server and sends failure information including the failure to the management server, and a request from the management server And a failure sample collecting unit 203 that collects failure data according to the situation. When the management server 10 receives failure information indicating that a failure has occurred from the monitoring target server, the management server 10 responds to the failure name included in the failure information from the failure collection information data 113 stored in advance. The fault diagnosis unit 101 that acquires and determines the type of fault data to be collected, the sampling method corresponding to the type of fault data, and the effect that this causes on the monitored server are collected from the target fault data 112 stored in advance. A collection material determination unit 104 to acquire, a failure material management unit 105 that compares the type, collection method, and influence of the failure material with the current state of the monitored server to determine whether the failure material can be acquired; A failure material request unit 103 that requests the failure sample collection unit of the monitoring target server to acquire the failure material determined to be obtainable.

また、障害時採取情報データ１１３には、障害名に対応する障害資料が取得可能であるか否かの判定基準である優先許容度について記憶されており、障害資料管理部１０５は、この障害名に対応する優先許容度に基づいて障害資料が取得可能であるか否かを判定する機能を有する。 Further, the failure collection information data 113 stores the priority allowance that is a criterion for determining whether or not failure data corresponding to the failure name can be acquired. The failure material management unit 105 stores the failure name. Has a function of determining whether or not failure data can be acquired based on the priority tolerance corresponding to.

ここで、管理サーバ１０は、監視対象サーバの現在の状態の中で当該監視対象サーバから取得可能な項目についてこれを取得するよう当該監視対象サーバに要求するシステム情報要求部１０２を有する。また、監視対象サーバの現在の状態の中でユーザが入力可能な項目についての入力を受け付けてこれを記憶するシステム状態入力部１０６を有する。 Here, the management server 10 includes a system information request unit 102 that requests the monitoring target server to acquire items that can be acquired from the monitoring target server in the current state of the monitoring target server. In addition, the system has a system status input unit 106 that receives and stores input about items that can be input by the user in the current status of the monitoring target server.

そして障害資料要求部１０３は、取得可能でないと判定された障害資料について、取得可能になるまで待機してから取得する機能を有する。さらに、この管理サーバ１０と、障害時採取情報データおよび採取対象障害資料データをユーザが予め入力することが可能である管理端末３０とが相互に接続されている。 Then, the failure material request unit 103 has a function of acquiring the failure material determined to be unacquirable after waiting until the failure material can be acquired. Furthermore, the management server 10 and a management terminal 30 through which a user can input in advance the failure collection information data and the collection target failure material data are connected to each other.

以上の構成を備えることにより、障害調査情報採取システム１は、ネットワーク内において発生した障害の種類に応じて、その発生原因の特定に必要な資料の採取を適切に行うことが可能となる。
以下、これをより詳細に説明する。 With the above configuration, the failure investigation information collection system 1 can appropriately collect data necessary for identifying the cause of occurrence according to the type of failure that occurred in the network.
Hereinafter, this will be described in more detail.

図２は、本発明の第１の実施形態に係る障害調査情報採取システム１の構成について示す説明図である。障害調査情報採取システム１（以後単にシステムという場合がある）は、障害調査情報の採取についての動作を管理する管理サーバ１０、監視対象となる業務に係る動作を行う監視対象サーバ２０、およびシステム管理者が操作入力を行う管理端末３０が、ネットワーク４０を介して相互に接続されて構成されている。管理サーバ１０、監視対象サーバ２０、管理端末３０の各々の台数については特に制限は無い。 FIG. 2 is an explanatory diagram showing the configuration of the failure investigation information collection system 1 according to the first embodiment of the present invention. The failure investigation information collection system 1 (hereinafter may be simply referred to as a system) includes a management server 10 that manages operations relating to collection of failure investigation information, a monitoring target server 20 that performs operations related to a monitoring target operation, and system management. Management terminals 30 on which an operator inputs operations are connected to each other via a network 40. There are no particular restrictions on the number of each of the management server 10, the monitoring target server 20, and the management terminal 30.

管理サーバ１０は、一般的なコンピュータ装置としての構成を備えている。即ち、管理サーバ１０は、コンピュータプログラムとして記述された各種処理を実行する主体である主演算制御手段（ＣＰＵ: Central Processing Unit）１１と、プログラムおよびデータを記憶する記憶手段１２と、ネットワーク４０に接続して他の情報処理装置とデータ通信を行う通信手段１３とを備える。その他の要素については、本実施形態を説明する上で特に必要ではないので、図示していない。 The management server 10 has a configuration as a general computer device. That is, the management server 10 is connected to a main processing control means (CPU: Central Processing Unit) 11 that is a main body that executes various processes described as computer programs, a storage means 12 that stores programs and data, and a network 40. And communication means 13 for performing data communication with other information processing apparatuses. Other elements are not shown because they are not particularly necessary for explaining the present embodiment.

管理サーバ１０の主演算制御手段１１で障害調査情報採取プログラムが動作することにより、主演算制御手段１１は、障害診断部１０１、システム情報要求部１０２、障害資料要求部１０３、採取資料判断部１０４、障害資料管理部１０５、およびシステム状態入力部１０６の各々として機能する。また記憶手段１２には、許容度データ１１１、採取対象障害資料データ１１２、障害時採取情報データ１１３、障害資料データ１１４、およびシステム状態データ１１５の各々が記憶されている。これらの各々については後述する。 When the failure investigation information collection program operates in the main calculation control unit 11 of the management server 10, the main calculation control unit 11 includes the failure diagnosis unit 101, the system information request unit 102, the failure material request unit 103, and the collection material determination unit 104. , Functioning as failure material management unit 105 and system state input unit 106. The storage means 12 stores tolerance data 111, collection target failure data data 112, failure collection information data 113, failure material data 114, and system status data 115. Each of these will be described later.

監視対象サーバ２０もまた、一般的なコンピュータ装置としての構成を備えている。即ち、監視対象サーバ２０は、管理サーバ１０と同様の主演算制御手段２１と、通信手段２２と、記憶手段２３とを備える。入出力手段やその他の要素については、本実施形態を説明する上で特に必要ではないので、図示していない。 The monitoring target server 20 also has a configuration as a general computer device. That is, the monitoring target server 20 includes a main calculation control unit 21, a communication unit 22, and a storage unit 23 similar to those of the management server 10. The input / output means and other elements are not shown in the figure because they are not particularly necessary for explaining the present embodiment.

監視対象サーバ２０の主演算制御手段２１で、管理サーバ１０とは別の障害調査情報採取プログラムが動作することにより、主演算制御手段２１は障害検知部２０１、システム情報採取部２０２、障害資料採取部２０３、および本業務プログラム動作部２０４の各々として機能する。記憶手段２３には、発生条件データ２１１が記憶されている。これらの各々についても後述する。 The main calculation control unit 21 of the monitoring target server 20 is operated by a failure investigation information collecting program that is different from the management server 10. It functions as each of the unit 203 and the business program operation unit 204. The storage unit 23 stores generation condition data 211. Each of these will also be described later.

管理端末３０もまた、一般的なコンピュータ装置としての構成を備えている。即ち、管理端末３０は、管理サーバ１０および監視対象サーバ２０と同様の主演算制御手段３１と、通信手段３２とを備える。さらに、ユーザからの操作入力を受け付け、またユーザに処理結果を提示する入出力手段３３を備える。記憶手段やその他の要素については、本実施形態を説明する上で特に必要ではないので、図示していない。 The management terminal 30 also has a configuration as a general computer device. That is, the management terminal 30 includes a main calculation control unit 31 and a communication unit 32 similar to those of the management server 10 and the monitoring target server 20. Furthermore, an input / output means 33 is provided for accepting an operation input from the user and presenting a processing result to the user. The storage means and other elements are not shown because they are not particularly necessary for describing the present embodiment.

管理端末３０の主演算制御手段３１で、管理サーバ１０および監視対象サーバ２０とは別の障害調査情報採取プログラムが動作することにより、主演算制御手段３１は採取対象障害資料登録部３０１、障害時採取情報登録部３０２、許容度登録部３０３、障害資料管理部３０４、およびシステム状態登録部３０５の各々として機能する。これらの各々についても後述する。 The main calculation control unit 31 of the management terminal 30 is operated by a failure investigation information collection program that is different from the management server 10 and the monitoring target server 20, so that the main calculation control unit 31 includes the collection target failure material registration unit 301, the time of failure It functions as a collection information registration unit 302, a tolerance registration unit 303, a failure material management unit 304, and a system state registration unit 305, respectively. Each of these will also be described later.

図１は、図２に示した障害調査情報採取システム１を構成する管理サーバ１０、監視対象サーバ２０、管理端末３０の動作を、より観念的に示す説明図である。 FIG. 1 is an explanatory diagram more conceptually showing the operations of the management server 10, the monitoring target server 20, and the management terminal 30 constituting the failure investigation information collection system 1 shown in FIG.

障害診断部１０１は、障害検知部２０１から受信した障害情報の内容に応じて、障害時採取情報データ１１３を用いて採取すべき障害資料を判断し、その判断結果を採取資料判断部１０３に渡す。 The failure diagnosis unit 101 determines the failure material to be collected using the failure collection information data 113 according to the content of the failure information received from the failure detection unit 201, and passes the determination result to the collection material determination unit 103. .

システム情報要求部１０２は、障害資料の採取が障害調査情報採取システム１全体に与える直接的なリスクを判断するために必要な情報、たとえば「ファイルサイズ」「ＣＰＵ利用率」などのような情報をシステム情報採取部２０２から受信して、これを障害資料管理部１０５に渡す。 The system information request unit 102 obtains information necessary for determining the direct risk that the collection of the fault data gives to the entire fault investigation information collection system 1, for example, information such as “file size” and “CPU usage rate”. The information is received from the system information collection unit 202 and transferred to the failure data management unit 105.

障害資料要求部１０３は、障害資料管理部１０５から「採取すべき障害資料」についての情報を受け取り、この障害資料の送付を障害資料採取部２０３に対して要求すると共に、これに対して返送されてきた障害資料を障害資料採取部２０３から受信し、これを障害資料データ１１４に保存する。 The failure material request unit 103 receives information on “failure material to be collected” from the failure material management unit 105, requests the failure material collection unit 203 to send the failure material, and is returned in response thereto. The fault data received is received from the fault data collection unit 203 and stored in the fault data data 114.

採取資料判断部１０４は、障害診断部１０１から「採取すべき障害資料」についての情報を受け取り、これの「具体的な取得方法」および「採取が与える影響」についての情報を採取対象障害資料データ１１２を参照して読み取り、これを障害資料管理部１０５に渡す。 The collection material judgment unit 104 receives information on “failure material to be collected” from the failure diagnosis unit 101, and collects information on “specific acquisition method” and “effect of collection” on the collection target failure material data 112 is read with reference to the failure material management unit 105.

障害資料管理部１０５は、採取資料判断部１０４、システム情報要求部１０２、およびシステム状態入力部１０６から受け取った情報から、許容度データ１１１および障害資料データ１１４を参照して、「採取すべき障害資料」を決定し、この決定結果を障害資料要求部１０３に渡す。この動作の詳細については後述する。 The failure material management unit 105 refers to the tolerance data 111 and the failure material data 114 from the information received from the collection material determination unit 104, the system information request unit 102, and the system state input unit 106. “Material” is determined, and the determination result is passed to the failure material request unit 103. Details of this operation will be described later.

システム状態入力部１０６は、リスクに関連するシステムの状態に関するユーザからの入力を取得し、障害資料管理部１０５に送出する。ここでいう「リスクに関連するシステムの状態」とは、たとえば該システムが「本番稼働中」である、あるいは「障害対応中」である、などのような動作状態のことをいう。ただし、その「リスクに関連するシステムの状態」の中には、日時などのように、管理サーバ１０の内蔵時計などを介して取得可能なものもある。 The system state input unit 106 acquires input from the user regarding the state of the system related to the risk and sends it to the failure material management unit 105. As used herein, “risk-related system status” refers to an operating status such as “the system is in operation” or “failure handling”. However, some of the “risk-related system states” can be acquired via the internal clock of the management server 10 such as date and time.

本実施形態では、システム情報要求部１０２が監視対象サーバ２０から取得する情報を「システム情報」といい、それ以外の（システム管理者の入力などによる）情報を「システム状態」という。システム情報およびシステム状態は、いずれも障害資料管理部１０５が各々の資料の採取が可能であるか否かを判定するために使用される情報であり、いずれも特許請求の範囲でいう「監視対象サーバの現在の状態」に該当する情報である。 In the present embodiment, information acquired from the monitoring target server 20 by the system information request unit 102 is referred to as “system information”, and other information (such as input from the system administrator) is referred to as “system state”. The system information and the system state are both information used by the failure data management unit 105 to determine whether or not each data can be collected. This information corresponds to “the current state of the server”.

障害検知部２０１は、予め管理サーバ１０から送られた障害発生と判定する条件を発生条件データ２１１として記憶し、この条件によって監視対象サーバ２０で発生した障害を検知し、障害診断部１０１に対してその情報を送出する。 The failure detection unit 201 stores, as the generation condition data 211, a condition for determining that a failure has been sent in advance from the management server 10, detects a failure that has occurred in the monitoring target server 20 based on this condition, and Send the information.

システム情報採取部２０２は、監視対象サーバ２０においてシステム情報要求部１０２からの要求に応じて、システムに与える直接的なリスクを判断するために必要な情報を採取し、これをシステム情報要求部１０２に返信する。 The system information collection unit 202 collects information necessary for determining a direct risk given to the system in response to a request from the system information request unit 102 in the monitoring target server 20, and uses this information as a system information request unit 102. Reply to

障害資料採取部２０３は、監視対象サーバ２０において障害資料要求部１０３からの要求に応じて、障害の調査に必要な障害資料を採取し、これを障害資料要求部１０３に返信する。 In response to a request from the failure material request unit 103 in the monitored server 20, the failure material collection unit 203 collects failure material necessary for investigating the failure, and returns this to the failure material request unit 103.

本業務プログラム動作部２０４は、監視対象となる業務に係る動作を行うコンピュータプログラムである本業務プログラム（図示せず）を動作させる。 The business program operation unit 204 operates a business program (not shown) that is a computer program that performs an operation related to a business to be monitored.

管理端末３０の採取対象障害資料登録部３０１、障害時採取情報登録部３０２、許容度登録部３０３、障害資料管理部３０４、およびシステム状態登録部３０５は、それぞれ採取対象障害資料データ１１２、障害時採取情報データ１１３、許容度データ１１１、障害資料データ１１４、およびシステム状態データ１１５の内容の追加、削除、変更、参照などを、入出力手段３３を介して行う。 The collection target failure material registration unit 301, the failure collection information registration unit 302, the tolerance registration unit 303, the failure material management unit 304, and the system state registration unit 305 of the management terminal 30 respectively include the collection target failure material data 112 and the failure time, respectively. The addition, deletion, change, reference, and the like of the contents of the collection information data 113, the tolerance data 111, the failure data 114, and the system status data 115 are performed via the input / output means 33.

図３は、図１〜２に示した許容度データ１１１の内容の一例について示す説明図である。図４は、採取対象障害資料データ１１２の内容の一例について示す説明図である。図５は、障害時採取情報データ１１３の内容の一例について示す説明図である。図６は、障害資料データ１１４の内容の一例について示す説明図である。図７は、システム状態データ１１５の内容の一例について示す説明図である。そして図８は、図１〜２に示した監視対象サーバ２０の発生条件データ２１１の内容の一例について示す説明図である。 FIG. 3 is an explanatory diagram showing an example of the contents of the tolerance data 111 shown in FIGS. FIG. 4 is an explanatory diagram showing an example of the contents of the collection target failure material data 112. FIG. 5 is an explanatory diagram showing an example of the contents of the failure collection information data 113. FIG. 6 is an explanatory diagram showing an example of the contents of the failure material data 114. FIG. 7 is an explanatory diagram showing an example of the contents of the system state data 115. FIG. 8 is an explanatory diagram showing an example of the contents of the generation condition data 211 of the monitoring target server 20 shown in FIGS.

許容度データ１１１（図３）は、各々の許容度の設定に対して一意に与えられる許容度番号１１１ａと、その許容度が影響を与える対象を示す影響対象１１１ｂと、影響対象１１１ｂに対してシステムとして許容できる範囲についての内容を示す許容影響度１１１ｃを含む。即ち、影響対象１１１ｂで示される各項目の各々について、許容できる範囲が許容影響度１１１ｃとして示される。 The tolerance data 111 (FIG. 3) includes a tolerance number 111a that is uniquely given for each tolerance setting, an influence target 111b that indicates an object that the tolerance affects, and an influence target 111b. It includes an allowable influence degree 111c indicating the contents of a range allowable as a system. That is, for each item indicated by the influence target 111b, the allowable range is indicated as the allowable influence degree 111c.

ここでいう影響対象１１１ｂは、システム情報要求部１０２が監視対象サーバ２０から直接的に取得するファイルサイズやＣＰＵ使用率などのようなデータ（システム情報）だけでなく、たとえば日時や曜日、あるいはプログラムの動作状況などのような内容（システム状態）も対象として含む。図３に示された例では、「ファイルサイズ」と「ＣＰＵ使用率」に加えて、「日時」と「業務」が影響対象１１１ｂに含まれている。「日時」に対する許容影響度１１１ｃは「平日９時〜１８時の間は実施不可」、「業務」に対する許容影響度１１１ｃは「本業務プログラムの稼働中は実施不可」となっている。 The influence target 111b here is not only data (system information) such as the file size or CPU usage rate directly acquired from the monitoring target server 20 by the system information request unit 102, but also, for example, the date, day of the week, or program The contents (system state) such as the operation status of are also included as targets. In the example shown in FIG. 3, in addition to “file size” and “CPU usage rate”, “date and time” and “business” are included in the influence target 111b. The allowable influence degree 111c with respect to “date and time” is “impossible to implement between 9:00 and 18:00 on weekdays”, and the allowable influence degree 111c with respect to “business” is “impossible to execute while the business program is running”.

後述するように、「ファイルサイズ」と「ＣＰＵ使用率」については監視対象サーバ２０から取得されたシステム情報と比較し、「日時」と「業務」についてはシステム管理者の入力などによるシステム状態と比較して、各々が許容範囲であるか否かを判断する。 As will be described later, “file size” and “CPU usage rate” are compared with system information acquired from the monitoring target server 20, and “date and time” and “business” are system statuses input by a system administrator or the like. In comparison, it is determined whether each is within an allowable range.

採取対象障害資料データ１１２（図４）は、採取対象の障害資料に対して一意に与えられる採取資料番号１１２ａ、採取対象の資料の名前を示す採取資料名１１２ｂ、その資料を採取する方法を示す採取方法１１２ｃ、その採取によってシステムに与える影響を示す影響１１２ｄを含む。 The collection target failure data data 112 (FIG. 4) indicates a collection material number 112a uniquely given to the collection target failure data, a collection material name 112b indicating the name of the collection target material, and a method of collecting the material. The collection method 112c includes an influence 112d indicating the influence of the collection on the system.

ここで、採取方法１１２ｃは、採取する際に使用するコマンド、もしくは採取するファイルのホスト名を含むＵＲＩ（Uniform Resource Identifier）などを表す。図４に示した例では、「動作ログ」「入力ログ」を取得する場合にそのファイルのＵＲＩ、「ホスト名」を取得する場合にはそのホスト名（ｈｏｓｔｎａｍｅ）、「ＣＰＵ（主演算制御手段２１）利用率」を取得する場合には「ｓａｒ」コマンドを「ＣＰＵ利用率を１秒間に１回取得する」というオプションを指定して送信する、ということが示されている。 Here, the collection method 112c represents a command used when collecting, a URI (Uniform Resource Identifier) including the host name of the file to be collected, or the like. In the example shown in FIG. 4, when acquiring “operation log” and “input log”, the URI of the file, when acquiring “host name”, the host name (hostname), “CPU (main calculation control means” 21) In the case of acquiring the “utilization rate”, it is indicated that the “sar” command is transmitted by specifying the option “acquire CPU utilization rate once per second”.

影響１１２ｄは、その情報の採取によってシステムに直接的に与える影響、たとえば取得されるデータファイルの容量、あるいはＣＰＵ（主演算制御手段２１）の利用率の増加などについて登録する。 The influence 112d registers the influence directly on the system by collecting the information, for example, the capacity of the acquired data file or the increase in the utilization rate of the CPU (main arithmetic control means 21).

障害時採取情報データ１１３（図５）は、障害の種類を示す障害名１１３ａ、その障害が発生したことを判別する方法および条件を示す判別方法１１３ｂ、その障害が発生した時に採取する資料を示す採取情報１１３ｃ、およびその障害が発生した時に優先的に考慮すべきリスクへの対応内容を示す優先許容影響度１１３ｄを含む。 Failure collection information data 113 (FIG. 5) indicates a failure name 113a indicating the type of failure, a method for determining that the failure has occurred, a determination method 113b indicating the condition, and data to be collected when the failure occurs. The collection information 113c and the priority allowable influence level 113d indicating the contents of the risk to be preferentially considered when the failure occurs are included.

ここで、採取情報１１３ｃに登録される情報は採取対象障害資料データ１１２の採取資料番号１１２ａで表され、また優先度順に並べられている。また優先許容影響度１１３ｄに登録されている情報は、許容度データ１１１の許容度番号１１１ａと、その許容度番号１１１ａが示す影響対象１１１ｂに対する条件が示されている。即ち、障害名１１３ａに示される項目の障害が発生した場合に、優先的に考慮すべきリスクへの対応内容が優先許容影響度１１３ｄに登録されている。 Here, the information registered in the collection information 113c is represented by the collection material number 112a of the collection target failure material data 112 and arranged in order of priority. The information registered in the priority allowable influence level 113d indicates a tolerance number 111a of the tolerance data 111 and a condition for the influence target 111b indicated by the tolerance number 111a. In other words, when a failure of the item indicated by the failure name 113a occurs, the contents of correspondence to the risk to be preferentially considered are registered in the priority allowable influence level 113d.

図５に示されている例でいえば、アプリケーションの処理速度の低下を意味する障害名１１３ａ「ＡＰ（アプリケーション）スローダウン」の障害については、「ａｐｐｅｒｆコマンドで取得される性能情報の数値が３０以下である」という条件で示される状態が検出された場合にこの障害が発生したと判断するよう、判別方法１１３ｂに示されている。 In the example shown in FIG. 5, for the failure with the failure name 113a “AP (application) slowdown” meaning a decrease in the processing speed of the application, the numerical value of the performance information acquired by the “apperf command” is 30. The determination method 113b indicates that it is determined that this failure has occurred when a state indicated by the condition "is below" is detected.

そして、この障害が発生した場合の採取情報１１３ｃは「１，４，３」、即ち「動作ログ」「ＣＰＵ利用率」「ホスト名」を順番に取得することを示している。そしてこの場合の優先許容影響度１１３ｄは空欄であるので、許容度データ１１１に示された許容範囲内で各データを取得することが示されている。 When the failure occurs, the collection information 113c indicates that “1, 4, 3”, that is, “operation log”, “CPU usage rate”, and “host name” are acquired in order. In this case, the priority allowable influence level 113d is blank, and it is indicated that each data is acquired within the allowable range indicated in the tolerance data 111.

これに対して、アプリケーションの動作の停止を意味する障害名１１３ａ「ＡＰダウン」の障害については、「ＡＰＮＡＭＥ」をチェック対象アプリケーションのプロセス名とすると、「動作中プロセス一覧を取得して、その中に該プロセス名（ＡＰＮＡＭＥ）が含まれない」という条件で示される状態が検出された場合にこの障害が発生したと判断するよう、判別方法１１３ｂに示されている。 On the other hand, for the failure of the failure name 113a “AP down”, which means the stop of the operation of the application, if “APNAME” is the process name of the application to be checked, “ In the determination method 113b, it is determined that this failure has occurred when the state indicated by the condition "The process name (APNAME) is not included in the process name" is detected.

そして、この障害が発生した場合の採取情報１１３ｃは「１，２，３」、即ち「動作ログ」「入力ログ」「ホスト名」を順番に取得することを示している。そして優先許容影響度１１３ｄは「ファイルサイズ：無制限」と「業務：無制限」であるので、許容度データ１１１に示された許容範囲の中で「ファイルサイズ」と「業務」についての条件を無視して、「ＣＰＵ利用率」と「日時」の許容範囲だけを満たす範囲で各データを取得することが示されている。 When the failure occurs, the collection information 113c indicates that “1, 2, 3”, that is, “operation log”, “input log”, and “host name” are acquired in order. Since the priority allowable influence level 113d is “file size: unlimited” and “business: unlimited”, the conditions for “file size” and “business” in the allowable range indicated in the tolerance data 111 are ignored. Thus, it is shown that each data is acquired in a range that satisfies only the allowable ranges of “CPU utilization” and “date and time”.

障害資料データ１１４（図６）は、情報採取の起因となった各々の障害に対して与えられる障害番号１１４ａ、その障害の名前を示す障害名１１４ｂ、その障害が登録された日時を示す登録日時１１４ｃ、採取対象となった障害情報の資料を示す採取資料１１４ｄ、その採取資料が採取済みか否かを示す採取状況１１４ｅを含む。 The failure material data 114 (FIG. 6) includes a failure number 114a given to each failure that caused information collection, a failure name 114b indicating the name of the failure, and a registration date and time indicating the date and time when the failure was registered. 114c, collection data 114d indicating the material of the failure information to be collected, and collection status 114e indicating whether or not the collection data has been collected.

ここで、同一種類の障害が複数回発生した場合でも、原因がその各々で異なる場合がある。また同じ原因で発生した同一種類の障害であっても、対処はそのたびごとに行う必要がある。従って、資料の採取は障害発生のたびに行われ、障害番号１１４ａはそのたびに与えられる。 Here, even when the same type of failure occurs a plurality of times, the cause may be different for each. Even if the same type of failure occurs for the same cause, it is necessary to deal with it each time. Accordingly, the collection of data is performed every time a failure occurs, and the failure number 114a is given each time.

採取資料１１４ｄは、採取対象障害資料データ１１２の採取資料番号１１２ａで表され、その順序は障害時採取情報データ１１３の採取情報１１３ｃとして示された順番に従う。採取状況１１４ｅは「未済」「要求済」「採取済」の３通りの値で示される。「未済」はその資料に対して何もアクションが行われていない状態、「要求済」はその資料を採取するよう障害資料要求部１０３に対して要求はされたが実際の採取がまだであるという状態、「採取済」はその資料の採取が完了した状態、を各々示す。初期値は「未済」である。 The collection material 114 d is represented by the collection material number 112 a of the collection target failure material data 112, and the order follows the order indicated as the collection information 113 c of the failure collection information data 113. The collection status 114e is indicated by three values of “uncompleted”, “requested”, and “collected”. “Unfinished” indicates that no action has been taken on the material. “Requested” indicates that the failure material request unit 103 is requested to collect the material, but the actual collection is not yet performed. “Collected” indicates that the data has been collected. The initial value is “incomplete”.

システム状態データ１１５（図７）は、前述したように、システム管理者が管理端末３０を介して入力した、または管理サーバ１０の内蔵時計を介して取得した「リスクに関連するシステムの状態」を含む。 As described above, the system state data 115 (FIG. 7) is the “risk-related system state” input by the system administrator through the management terminal 30 or acquired through the internal clock of the management server 10. Including.

図７に示した例では、システム状態データ１１５には「日時」「業務（業務プログラムの動作状況）」といった各項目が含まれており、「日時」は内蔵時計から取得される。また、「業務」はシステム管理者が管理端末３０のシステム状態登録部３０５を介して入力する。これらの状態と許容影響度１１１ｃとを比較して、各々の資料の採取が可能であるか否かを障害資料管理部１０５が判定する。 In the example shown in FIG. 7, the system status data 115 includes items such as “date and time” and “business (operation status of business program)”, and “date and time” is acquired from the built-in clock. Further, the “operation” is input by the system administrator via the system state registration unit 305 of the management terminal 30. The failure material management unit 105 determines whether or not each material can be collected by comparing these states with the allowable influence degree 111c.

監視対象サーバ２０の発生条件データ２１１（図８）は、障害時採取情報データ１１３の障害名１１３ａおよび判別方法１１３ｂと同一のデータが、予め障害診断部１０１から障害検知部２０１に送られて記憶されるものである。ここで、監視対象サーバ２０ごとに該当する障害名１１３ａおよび判別方法１１３ｂのみを、障害診断部１０１から障害検知部２０１に送るようにしてもよい。障害検知部２０１は、これで記憶された発生条件に基づいて障害の発生を検知する。 The occurrence condition data 211 (FIG. 8) of the monitoring target server 20 is stored in advance by sending the same data as the failure name 113a and the determination method 113b of the failure collection information data 113 from the failure diagnosis unit 101 to the failure detection unit 201. It is what is done. Here, only the failure name 113 a and the determination method 113 b corresponding to each monitoring target server 20 may be sent from the failure diagnosis unit 101 to the failure detection unit 201. The failure detection unit 201 detects the occurrence of a failure based on the occurrence condition stored in this way.

図９は、図１〜２に示した障害調査情報採取システム１で行われる、障害資料の採取の動作について示すフローチャートである。まず、障害診断部１０１は障害時採取情報データ１１３の障害名１１３ａおよび判別方法１１３ｂを、各々の監視対象サーバ２０ごとに、予め障害検知部２０１に送っておく。障害検知部２０１はこれを発生条件データ２１１として記憶する（ステップＳ１０１）。障害検知部２０１は判別方法１１３ｂに示された基準に基づいて障害の発生を検知する動作を行う。 FIG. 9 is a flowchart showing an operation of collecting failure data performed by the failure investigation information collection system 1 shown in FIGS. First, the failure diagnosis unit 101 sends the failure name 113 a and the determination method 113 b of the failure collection information data 113 to the failure detection unit 201 in advance for each monitored server 20. The failure detection unit 201 stores this as occurrence condition data 211 (step S101). The failure detection unit 201 performs an operation of detecting the occurrence of a failure based on the criteria indicated in the determination method 113b.

障害検知部２０１が障害の発生を検知すると（ステップＳ１０２）、検知した障害名１１３ａを障害発生情報として障害診断部１０１に送信する。これを受けた障害診断部１０１は、障害時採取情報データ１１３を参照して、障害名１１３ａに対応する採取情報１１３ｃ、即ちその障害に対して採取すべき資料を確定させ、採取資料判断部１０４にその資料を採取するよう指令する（ステップＳ１０３）。その際、採取情報１１３ｃに登録された順序で資料を採取するよう、採取資料判断部１０４に指令する。 When the failure detection unit 201 detects the occurrence of a failure (step S102), the detected failure name 113a is transmitted to the failure diagnosis unit 101 as failure occurrence information. Receiving this, the failure diagnosis unit 101 refers to the failure collection information data 113 to determine the collection information 113c corresponding to the failure name 113a, that is, the material to be collected for the failure, and the collection material determination unit 104 Is instructed to collect the data (step S103). At that time, the collection material determination unit 104 is instructed to collect the materials in the order registered in the collection information 113c.

これを受けた採取資料判断部１０４は、障害診断部１０１から採取するよう指令された採取情報１１３ｃに示された資料を、採取情報１１３ｃを採取資料番号１１２ａとして採取対象障害資料データ１１２を参照して、これに対応する採取方法１１２ｃと影響１１２ｄとを取得し、障害資料管理部１０５に渡す（ステップＳ１０４）。 Upon receipt of this, the collected material judgment unit 104 refers to the collection target failure material data 112 using the collection information 113c as the collected material number 112a for the material indicated in the collection information 113c instructed to be collected from the failure diagnosis unit 101. The acquisition method 112c and the influence 112d corresponding to this are acquired and passed to the failure material management unit 105 (step S104).

これを受けた障害資料管理部１０５は、受け取った情報を障害資料データ１１４に登録する（ステップＳ１０５）。ステップＳ１０２で障害検知部２０１が検知した障害の種類が障害名１１４ｂとなり、障害番号１１４ａは前述のようにその障害発生に対して与えられる番号、登録日時１１４ｃはその障害が登録された日時である。ステップＳ１０３で障害診断部１０１が採取するよう指令した資料番号が、順番もそのまま採取資料１１４ｄとなる。採取資料１１４ｄの各々に対する採取状況１１４ｅの初期値は「未済」である。 Receiving this, the failure material management unit 105 registers the received information in the failure material data 114 (step S105). The type of failure detected by the failure detection unit 201 in step S102 is the failure name 114b, the failure number 114a is the number given to the occurrence of the failure as described above, and the registration date 114c is the date when the failure was registered. . The material number instructed by the failure diagnosis unit 101 in step S103 to be collected becomes the collected material 114d in the order. The initial value of the collection status 114e for each of the collection materials 114d is “incomplete”.

障害資料管理部１０５は引き続いて、障害資料データ１１４に登録された採取資料１１４ｄの中で、採取状況１１４ｅが「未済」（即ち障害資料要求部１０３に対して採取要求が行われてもいない）の資料があるか否かを判定する（ステップＳ１０６）。全ての資料について、障害資料要求部１０３に対して採取要求が行われていれば（即ち採取状況１１４ｅが「未済」のものが１つもなければ）、ステップＳ１１０に進む。 The failure material management unit 105 continues with the collection status 114e in the collection material 114d registered in the failure material data 114 as “uncompleted” (that is, no collection request is made to the failure material request unit 103). It is determined whether there is any material (step S106). If a collection request has been made to the failure material request unit 103 for all the materials (that is, if there is no collection status 114e that is “incomplete”), the process proceeds to step S110.

採取状況１１４ｅが「未済」の資料があれば（ステップＳ１０６がイエス）、その資料およびその資料採取の元になった障害に対して、（システム情報要求部１０２がシステム情報採取部２０２から取得した）システム情報および（システム状態データ１１５に入力された）システム状態と、採取対象障害資料データ１１２の影響１１２ｄとして登録されたその資料の取得による影響と、許容度データ１１１の許容影響度１１１ｃとして登録された影響対象に対する許容影響度、障害時採取情報データ１１３の優先許容影響度１１３ｄとして登録された優先許容影響度を比較し、現時点で採取可能な情報であるか否かを判断する（ステップＳ１０７〜１０８）。 If there is a material whose collection status 114e is “incomplete” (Yes in step S106), the system information requesting unit 102 has acquired from the system information collecting unit 202 for the material and the failure from which the material was collected. ) System information and system status (input to the system status data 115), the effect of acquiring the material registered as the influence 112d of the collection target failure data data 112, and the allowable data 111c registered as the allowable influence 111c The allowable impact level for the affected target and the priority allowable impact level registered as the priority allowable impact level 113d of the collection information data 113 at the time of failure are compared to determine whether the information can be collected at the present time (step S107). ~ 108).

この時、障害時採取情報データ１１３の優先許容影響度１１３ｄが登録されている場合には、その内容に基づいてその資料が採取可能であるか否かを判定し、優先許容影響度１１３ｄが登録されていない場合には許容度データ１１１の許容影響度１１１ｃの内容に基づいて判定を行う。 At this time, if the priority allowable influence level 113d of the collection information data 113 at the time of failure is registered, it is determined whether or not the material can be collected based on the content, and the priority allowable influence level 113d is registered. If not, the determination is made based on the content of the allowable influence 111c of the tolerance data 111.

また、許容度データ１１１でシステム状態データ１１５にある項目、たとえば「日時」「業務」（システム状態）についてはそのシステム状態データ１１５を利用し、それ以外の項目、たとえば「（動作ログや入力ログなどの）ファイルサイズ」「ＣＰＵ利用率」（システム情報）についてはシステム情報要求部１０２が採取対象障害資料データ１１２に登録された採取方法によってシステム情報採取部２０２から取得する。 For the items in the system status data 115 in the tolerance data 111, such as “date and time” and “business” (system status), the system status data 115 is used, and other items such as “(operation log and input log). File size ”and“ CPU utilization ”(system information) are acquired from the system information collection unit 202 by the collection method registered in the collection target failure material data 112 by the system information request unit 102.

障害資料データ１１４に登録された全ての採取資料１１４ｄについて以上のリスク判定を行い、それらの許容範囲を満たしつつその資料を採取することが可能である場合には、それらのデータを取得するよう、障害資料要求部１０３に指令する。障害資料要求部１０３は、採取状況１１４ｅを「要求済」として、指定された資料の優先度の順に採取スケジュールを立てる（ステップＳ１０９）。その後、ステップＳ１０６に戻る。 If all the collected data 114d registered in the obstacle data 114 is subjected to the above risk determination, and if it is possible to collect the data while satisfying the permissible range, the data should be acquired. Commands the obstacle material request unit 103. The failure material request unit 103 sets the collection status 114e as “Requested” and sets up a collection schedule in the order of priority of the designated materials (step S109). Thereafter, the process returns to step S106.

ここで、ステップＳ１０７〜１０８の判断についてより詳細に説明する。障害名１１３ａ「ＡＰダウン」の障害が発生した場合、優先許容影響度１１３ｄは「ファイルサイズ：無制限」と「業務：無制限」であるので、障害資料管理部１０５は、システム状態データ１１５などに示されたシステム上の状態が、許容度データ１１１に示された許容範囲の中で「ファイルサイズ」と「業務」についての条件を無視して「ＣＰＵ利用率」と「日時」の許容範囲だけを満たす範囲で各データを取得するよう、障害資料要求部１０３に指令する。その際、採取情報１１３ｃに「１，２，３」とあるので、「動作ログ」「入力ログ」「ホスト名」を順番に取得するよう、障害資料要求部１０３に指令する。 Here, the determination in steps S107 to S108 will be described in more detail. When a failure with the failure name 113a “AP down” occurs, the priority allowable influence level 113d is “file size: unrestricted” and “business: unrestricted”. Therefore, the failure material management unit 105 indicates the system state data 115 or the like. The system status is set to the allowable range of “CPU utilization” and “date / time” while ignoring the conditions for “file size” and “business” in the allowable range indicated in the tolerance data 111. The failure material request unit 103 is instructed to acquire each data within a range to be satisfied. At that time, since there is “1, 2, 3” in the collection information 113c, the failure material request unit 103 is instructed to acquire “operation log”, “input log”, and “host name” in order.

これに対して、障害名１１３ａ「ＡＰ（アプリケーション）スローダウン」の障害が発生した場合、優先許容影響度１１３ｄは空欄であるので、障害資料管理部１０５は、システム状態データ１１５などに示されたシステム上の状態が許容度データ１１１に示された許容範囲の全項目を満たす範囲で各データを取得するよう、障害資料要求部１０３に指令する。その際、採取情報１１３ｃに「１，４，３」とあるので、「動作ログ」「ＣＰＵ利用率」「ホスト名」を順番に取得するよう、障害資料要求部１０３に指令する。 On the other hand, when a failure with the failure name 113a “AP (application) slowdown” occurs, the priority allowable influence level 113d is blank, so the failure material management unit 105 is indicated in the system status data 115 or the like. The fault data requesting unit 103 is instructed to acquire each data in a range where the system state satisfies all the items of the allowable range indicated in the tolerance data 111. At that time, since the collection information 113c includes “1, 4, 3”, the failure material request unit 103 is instructed to acquire “operation log”, “CPU usage rate”, and “host name” in order.

障害資料データ１１４に登録された全ての資料について、障害資料要求部１０３に対して採取要求が行われていれば、障害資料要求部１０３はステップＳ１０９で立てられたスケジュールに基づき、障害資料採取部２０３経由でそれらの資料を採取する（ステップＳ１１０）。障害資料要求部１０３はこれに伴って、採取が完了した資料について採取状況１１４ｅを「採取済」とする。そして、障害資料データ１１４に登録された全ての資料が採取されたか否か（採取状況１１４ｅが「採取済」となったか否か）を判断する（ステップＳ１１１）。 If a collection request is made to the failure material request unit 103 for all the materials registered in the failure material data 114, the failure material request unit 103 determines that the failure material collection unit is based on the schedule set in step S109. Those materials are collected via 203 (step S110). Accordingly, the failure data request unit 103 sets the collection status 114e to “collected” for the data that has been collected. Then, it is determined whether or not all the materials registered in the failure material data 114 have been collected (whether or not the collection status 114e has been “collected”) (step S111).

採取されていない資料がある場合（ステップＳ１１１がノー）、これはシステム情報およびシステム状態のうちのいずれかの項目が許容範囲を満たしていないからである（たとえば「ＣＰＵ利用率」が指定された条件の範囲を超えていたため、「日時」の条件が「平日９時〜１８時実施不可」となっているのに対してその範囲内の日時だったため、あるいは「業務」の条件が「本業務プログラム稼働中実施不可」となっているのに対して本業務プログラムが稼働中であったため…などのように）。 When there is data that has not been collected (step S111 is no), this is because any item of the system information and the system state does not satisfy the allowable range (for example, “CPU utilization” is designated) Because the condition range was exceeded, the “date and time” condition was “cannot be implemented from 9 am to 6 pm on weekdays” whereas the date and time was within that range, or the condition of “business” was “this work (This is because the business program was running while the program was not running ").

従って、障害資料要求部１０３は、その許容範囲を満たす状態になるまで待機して（ステップＳ１１２）からステップＳ１０６に戻る。障害資料データ１１４に登録された全ての資料が採取できたら（ステップＳ１１１がイエス）、ひとまず動作は終了し、ステップＳ１０２の障害発生を待機する状態に戻る。 Therefore, the failure material request unit 103 waits until the allowable range is satisfied (step S112) and returns to step S106. When all the materials registered in the failure material data 114 have been collected (Yes in step S111), the operation ends for the time being and returns to the state of waiting for the failure occurrence in step S102.

（実施形態の全体的な動作）
次に、上記の実施形態の全体的な動作について説明する。本実施形態に係る障害調査情報資料採取方法は、管理サーバ１０と監視対象サーバ２０とが相互に接続され、監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を管理サーバが監視対象サーバから採取する障害調査情報資料採取システムにあって、監視対象サーバから障害が発生した旨の障害情報を管理サーバが受信し（図９・ステップＳ１０２）、障害情報に含まれる障害の種類に対応する障害資料の種類を予め記憶された障害時採取情報データから管理サーバの障害診断部が取得して決定し（図９・ステップＳ１０３）、障害資料の種類に対応する採取方法とこれによって監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから管理サーバの採取資料判断部が取得し（図９・ステップＳ１０４）、障害資料の種類と採取方法および影響を監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを管理サーバの障害資料管理部が判定し（図９・ステップＳ１０７〜１０８）、取得可能であると判定された障害資料を監視対象サーバから管理サーバの障害資料要求部が取得する（図９・ステップＳ１１０）。 (Overall operation of the embodiment)
Next, the overall operation of the above embodiment will be described. In the failure investigation information material collection method according to the present embodiment, the management server 10 and the monitoring target server 20 are connected to each other, and the management server 10 collects data necessary for analyzing the cause of the failure that has occurred in the operation of the monitoring target server. In the failure investigation information data collection system collected from the monitored server, the management server receives failure information indicating that a failure has occurred from the monitored server (step S102 in FIG. 9), and the failure information included in the failure information The type of fault data corresponding to the type is acquired and determined by the fault diagnosis unit of the management server from the pre-stored fault collection information data (step S103 in FIG. 9), and the sampling method corresponding to the type of fault data and this The collected data judgment unit of the management server obtains the influence that occurs on the monitored server by the collection target failure material data stored in advance (FIG. 9, step 104) The failure data management unit of the management server determines whether or not the failure data can be acquired by comparing the type, collection method, and influence of the failure data with the current state of the monitored server (FIG. 9). In step S107 to 108), the failure material request unit of the management server obtains the failure material determined to be obtainable from the monitoring target server (step S110 in FIG. 9).

ここで、上記各動作ステップについては、これをコンピュータで実行可能にプログラム化し、これらを前記各ステップを直接実行するコンピュータである管理サーバ１０に実行させるようにしてもよい。本プログラムは、非一時的な記録媒体、例えば、ＤＶＤ、ＣＤ、フラッシュメモリ等に記録されてもよい。その場合、本プログラムは、記録媒体からコンピュータによって読み出され、実行される。
この動作により、本実施形態は以下のような効果を奏する。 Here, each of the above operation steps may be programmed to be executable by a computer, and may be executed by the management server 10 which is a computer that directly executes each of the steps. The program may be recorded on a non-transitory recording medium, such as a DVD, a CD, or a flash memory. In this case, the program is read from the recording medium by a computer and executed.
By this operation, this embodiment has the following effects.

本実施形態によれば、発生した障害の種類に応じて採取する資料の種類を切り替えるだけでなく、障害資料が取得可能であるか否かを判定する基準もまた、発生した障害の種類に応じて切り替えることができる。従って、たとえば前述した「ＡＰスローダウン」と「ＡＰダウン」のように、発生した障害の質や重要度などに応じて、必要な資料の採取を適切に行うことが可能となる。 According to the present embodiment, not only the type of data to be collected is switched according to the type of failure that has occurred, but also the criteria for determining whether or not the failure material can be acquired also depends on the type of failure that has occurred. Can be switched. Therefore, it is possible to appropriately collect necessary data according to the quality and importance of the failure that has occurred, such as “AP slowdown” and “AP down” described above.

さらに、現状で採取できないと判定された資料であっても、状況が変わることを待ってから採取する構成となっているので、必要な資料が採取できないために障害原因の解析が困難となるような状況の発生を抑制することができる。 Furthermore, even if it is determined that the data cannot be collected at present, it is collected after waiting for the situation to change, so it is difficult to analyze the cause of the failure because the necessary data cannot be collected. The occurrence of unusual situations can be suppressed.

そして、障害資料が取得可能であるか否かの判断には、監視対象サーバから取得したシステム情報だけでなく、システム管理者が手で入力するシステム状態も利用するので、障害資料が取得可能であるか否かについてより状況に即した適切な判断が可能となる。 In order to determine whether or not failure data can be acquired, not only the system information acquired from the monitored server but also the system status entered manually by the system administrator can be used. It is possible to make an appropriate judgment according to the situation as to whether or not there is.

これまで本発明について図面に示した特定の実施形態をもって説明してきたが、本発明は図面に示した実施形態に限定されるものではなく、本発明の効果を奏する限り、これまで知られたいかなる構成であっても採用することができる。 The present invention has been described with reference to the specific embodiments shown in the drawings. However, the present invention is not limited to the embodiments shown in the drawings, and any known hitherto provided that the effects of the present invention are achieved. Even if it is a structure, it is employable.

上述した実施形態について、その新規な技術内容の要点をまとめると、以下のようになる。なお、上記実施形態の一部または全部は、新規な技術として以下のようにまとめられるが、本発明は必ずしもこれに限定されるものではない。 Regarding the embodiment described above, the main points of the new technical contents are summarized as follows. In addition, although part or all of the said embodiment is summarized as follows as a novel technique, this invention is not necessarily limited to this.

（付記１）管理サーバと監視対象サーバとが相互に接続され、前記監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を前記管理サーバが前記監視対象サーバから採取する障害調査情報資料採取システムであって、
前記監視対象サーバが、前記管理サーバから予め送られた発生条件によって前記障害の発生を検知してその旨を含む障害情報を前記管理サーバに対して送出する障害検知部と、前記管理サーバからの要求に応じて前記障害資料を採取する障害試料採取部とを備えると共に、
前記管理サーバが、
前記監視対象サーバから前記障害が発生した旨の障害情報を受信した際に、予め記憶された障害時採取情報データから前記障害情報に含まれる障害名に対応して前記監視対象サーバから採取する障害資料の種類を取得して決定する障害診断部と、
前記障害資料の種類に対応する採取方法とこれによって前記監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する採取資料判断部と、
前記障害資料の種類と採取方法および前記影響を前記監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する障害資料管理部と、
取得可能であると判定された前記障害資料の取得を前記監視対象サーバの前記障害試料採取部に依頼する障害資料要求部と
を有することを特徴とする障害調査情報資料採取システム。 (Supplementary note 1) A failure in which the management server and the monitoring target server are connected to each other, and the management server collects data necessary for analyzing the cause of the failure that has occurred in the operation of the monitoring target server from the monitoring target server A survey information collection system,
The monitoring target server detects the occurrence of the failure according to the occurrence condition sent in advance from the management server and sends failure information including the failure to the management server; and from the management server With a fault sample collection unit that collects the fault data as required,
The management server is
Faults collected from the monitored server corresponding to the fault name included in the fault information from the pre-stored faulty collection information data when the fault information indicating that the fault has occurred is received from the monitored server A fault diagnosis unit that obtains and determines the type of material;
A collection material determination unit that acquires a collection method corresponding to the type of the failure material and an effect caused on the monitored server thereby from the collection object failure material data stored in advance,
A fault data management unit that determines whether or not the fault data can be acquired by comparing the type and collection method of the fault data and the influence with the current state of the monitored server;
A failure investigation information material collection system, comprising: a failure material request unit that requests the failure sample collection unit of the monitoring target server to acquire the failure material determined to be obtainable.

（付記２）前記障害時採取情報データに、前記障害名に対応する前記障害資料が取得可能であるか否かの判定基準である優先許容度について記憶されており、
前記障害資料管理部が、この障害名に対応する優先許容度に基づいて前記障害資料が取得可能であるか否かを判定する機能を有することを特徴とする、付記１に記載の障害調査情報資料採取システム。 (Supplementary Note 2) The priority collection degree that is a criterion for determining whether or not the failure material corresponding to the failure name can be acquired is stored in the failure collection information data.
The failure investigation information according to appendix 1, wherein the failure material management unit has a function of determining whether or not the failure material can be acquired based on a priority tolerance corresponding to the failure name. Data collection system.

（付記３）前記管理サーバが、前記監視対象サーバの現在の状態の中で当該監視対象サーバから取得可能な項目についてこれを取得するよう当該監視対象サーバに要求するシステム情報要求部を有することを特徴とする、付記１に記載の障害調査情報資料採取システム。 (Additional remark 3) The said management server has a system information request | requirement part which requests | requires the said monitoring target server to acquire this about the item which can be acquired from the said monitoring target server in the present state of the said monitoring target server. The failure investigation information material collection system according to appendix 1, which is characterized.

（付記４）前記管理サーバが、前記監視対象サーバの現在の状態の中でユーザが入力可能な項目についての入力を受け付けてこれを記憶するシステム状態入力部を有することを特徴とする、付記１に記載の障害調査情報資料採取システム。 (Additional remark 4) The said management server has a system state input part which receives the input about the item which a user can input in the present state of the said monitoring object server, and memorize | stores this, The additional remark 1 characterized by the above-mentioned. Failure investigation information material collection system described in 1.

（付記５）前記管理サーバの前記障害資料要求部が、取得可能でないと判定された前記障害資料について、取得可能になるまで待機してから取得する機能を有することを特徴とする、付記１に記載の障害調査情報資料採取システム。 (Supplementary note 5) The supplementary note 1 is characterized in that the faulty material request unit of the management server has a function of obtaining the faulty material determined to be unacquirable after waiting until it can be acquired. Failure investigation information material collection system described.

（付記６）前記管理サーバと、前記障害時採取情報データおよび前記採取対象障害資料データをユーザが予め入力することが可能である管理端末とが相互に接続されていることを特徴とする、付記１に記載の障害調査情報資料採取システム。 (Supplementary Note 6) The supplementary note is characterized in that the management server is connected to a management terminal through which a user can input in advance the failure collection information data and the collection target failure material data. The failure investigation information material collection system according to 1.

（付記７）監視対象サーバと相互に接続され、前記監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を前記監視対象サーバから採取する管理サーバであって、
前記監視対象サーバから前記障害が発生した旨の障害情報を受信した際に、予め記憶された障害時採取情報データから前記障害情報に含まれる障害の種類に対応して前記監視対象サーバから採取する障害資料の種類を取得して決定する障害診断部と、
前記障害資料の種類に対応する採取方法とこれによって前記監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する採取資料判断部と、
前記障害資料の種類と採取方法および前記影響を前記監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する障害資料管理部と、
取得可能であると判定された前記障害資料を前記監視対象サーバから取得する障害資料要求部と
を有することを特徴とする管理サーバ。 (Supplementary Note 7) A management server that is interconnected with a monitoring target server and collects data necessary for analyzing the cause of a failure that has occurred in the operation of the monitoring target server from the monitoring target server,
When failure information indicating that the failure has occurred is received from the monitored server, it is collected from the monitored server corresponding to the type of failure included in the failure information from pre-stored failure collection information data A fault diagnosis unit that acquires and determines the type of fault data;
A collection material determination unit that acquires a collection method corresponding to the type of the failure material and an effect caused on the monitored server thereby from the collection object failure material data stored in advance,
A fault data management unit that determines whether or not the fault data can be acquired by comparing the type and collection method of the fault data and the influence with the current state of the monitored server;
A management server, comprising: a failure data request unit that acquires the failure data determined to be obtainable from the monitoring target server.

（付記８）管理サーバと監視対象サーバとが相互に接続され、前記監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を前記管理サーバが前記監視対象サーバから採取する障害調査情報資料採取システムにあって、
前記監視対象サーバから前記障害が発生した旨の障害情報を前記管理サーバが受信し、
前記障害情報に含まれる障害の種類に対応する障害資料の種類を予め記憶された障害時採取情報データから前記管理サーバの障害診断部が取得して決定し、
前記障害資料の種類に対応する採取方法とこれによって前記監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから前記管理サーバの採取資料判断部が取得し、
前記障害資料の種類と採取方法および前記影響を前記監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを前記管理サーバの障害資料管理部が判定し、
取得可能であると判定された前記障害資料を前記監視対象サーバから前記管理サーバの障害資料要求部が取得する
ことを特徴とする障害調査情報資料採取方法。 (Supplementary note 8) A failure in which the management server and the monitoring target server are connected to each other, and the management server collects data necessary for analyzing the cause of the failure that has occurred in the operation of the monitoring target server from the monitoring target server In the investigation information collection system,
The management server receives failure information indicating that the failure has occurred from the monitored server,
The fault diagnosis unit of the management server obtains and determines the type of fault data corresponding to the type of fault included in the fault information from the pre-stored fault collection information data,
The collection material determination unit of the management server obtains the collection method corresponding to the type of the failure material and the influence generated on the monitored server thereby from the collection target failure material data stored in advance,
The failure data management unit of the management server determines whether or not the failure data can be acquired by comparing the type and collection method of the failure material and the influence with the current state of the monitored server.
A failure investigation information material collection method, wherein the failure material request unit of the management server acquires the failure material determined to be obtainable from the monitored server.

（付記９）管理サーバと監視対象サーバとが相互に接続され、前記監視対象サーバの動作について発生した障害の原因を解析するために必要な資料を前記管理サーバが前記監視対象サーバから採取する障害調査情報資料採取システムにあって、
前記管理サーバが備えるコンピュータに、
前記障害情報に含まれる障害の種類に対応する障害資料の種類を予め記憶された障害時採取情報データから取得して決定する手順、
前記障害資料の種類に対応する採取方法とこれによって前記監視対象サーバに発生する影響とを予め記憶された採取対象障害資料データから取得する手順、
前記障害資料の種類と採取方法および前記影響を前記監視対象サーバの現在の状態と比較してこの障害資料が取得可能であるか否かを判定する手順、
および取得可能であると判定された前記障害資料を前記監視対象サーバから取得する手順
を実行させることを特徴とする障害調査情報資料採取プログラム。 (Supplementary note 9) Failures in which the management server and the monitoring target server are connected to each other, and the management server collects data necessary for analyzing the cause of the failure that has occurred in the operation of the monitoring target server from the monitoring target server In the investigation information collection system,
In the computer provided in the management server,
A procedure for obtaining and determining the type of fault data corresponding to the type of fault included in the fault information from pre-stored fault collection information data,
A procedure for acquiring from the collection target failure material data stored in advance the collection method corresponding to the type of the failure material and the effect caused by this on the monitored server;
A procedure for determining whether or not the failure data can be acquired by comparing the type and collection method of the failure material and the influence with the current state of the monitored server;
And a failure investigation information material collection program for executing a procedure for obtaining the failure material determined to be obtainable from the monitoring target server.

本発明は、コンピュータネットワークの維持および運用に適用できる。特に、企業などの基幹業務を実行するコンピュータネットワークに適している。 The present invention can be applied to the maintenance and operation of a computer network. In particular, it is suitable for a computer network that executes a basic business such as a company.

１障害調査情報採取システム
１０管理サーバ
１１、２１、３１主演算制御手段
１２、２３記憶手段
１３、２２、３２通信手段
２０監視対象サーバ
３０管理端末
３３入出力手段
４０ネットワーク
１０１障害診断部
１０２システム情報要求部
１０３障害資料要求部
１０４採取資料判断部
１０５障害資料管理部
１０６システム状態入力部
１１１許容度データ
１１２採取対象障害資料データ
１１３障害時採取情報データ
１１４障害資料データ
１１５システム状態データ
２０１障害検知部
２０２システム情報採取部
２０３障害資料採取部
２０４本業務プログラム動作部
２１１発生条件データ
３０１採取対象障害資料登録部
３０２障害時採取情報登録部
３０３許容度登録部
３０４障害資料管理部
３０５システム状態登録部 DESCRIPTION OF SYMBOLS 1 Failure investigation information collection system 10 Management server 11, 21, 31 Main operation control means 12, 23 Storage means 13, 22, 32 Communication means 20 Monitored server 30 Management terminal 33 Input / output means 40 Network 101 Fault diagnosis part 102 System information Request section 103 Fault data request section 104 Collected data judgment section 105 Fault data management section 106 System status input section 111 Tolerance data 112 Collection target fault data data 113 Fault collection information data 114 Fault data 115 System status data 201 Fault detection section 202 System Information Collecting Unit 203 Fault Material Collecting Unit 204 This Business Program Operation Unit 211 Occurrence Condition Data 301 Collection Target Fault Material Registration Unit 302 Fault Collection Information Registration Unit 303 Tolerance Registration Unit 304 Fault Material Management Unit 305 System State registration unit

Claims

Failure investigation information data that the management server collects from the monitored server the failure data necessary for analyzing the cause of the failure that occurred in the operation of the monitored server when the management server and the monitored server are connected to each other A collection system,
The monitoring target server detects the occurrence of the failure according to the occurrence condition sent in advance from the management server and sends failure information including the failure to the management server; and from the management server With a fault sample collection unit that collects the fault data as required,
The management server is
When the failure information is received from the monitored server, the type of failure data collected from the monitored server corresponding to the failure name included in the failure information is acquired from the failure collection information data stored in advance. Fault diagnosis department to be determined,
A collection material determination unit that acquires a collection method corresponding to the type of the failure material and an effect caused on the monitored server thereby from the collection object failure material data stored in advance,
A fault data management unit that determines whether or not the fault data can be acquired by comparing the type and collection method of the fault data and the influence with the current state of the monitored server;
A failure investigation information material collection system, comprising: a failure material request unit that requests the failure sample collection unit of the monitoring target server to acquire the failure material determined to be obtainable.

Stored in the failure collection information data is a priority tolerance that is a criterion for determining whether or not the failure material corresponding to the failure name can be acquired,
The failure investigation according to claim 1, wherein the failure material management unit has a function of determining whether or not the failure material can be acquired based on a priority tolerance corresponding to the failure name. Information collection system.

The management server has a system information requesting unit that requests the monitoring target server to acquire items that can be acquired from the monitoring target server in the current state of the monitoring target server. The failure investigation information material collection system according to claim 1.

The said management server has a system state input part which receives the input about the item which a user can input in the present state of the said monitoring target server, and memorize | stores this. Fault investigation information data collection system.

The failure according to claim 1, wherein the failure material request unit of the management server has a function of acquiring the failure material determined to be unacquirable after waiting until it can be acquired. Investigation information collection system.

The management server and a management terminal from which a user can input in advance the failure collection information data and the collection target failure material data are connected to each other. System for collecting failure investigation information.

A management server that is interconnected with a monitored server and collects data necessary for analyzing the cause of a failure that has occurred in the operation of the monitored server from the monitored server,
When failure information indicating that the failure has occurred is received from the monitored server, it is collected from the monitored server corresponding to the type of failure included in the failure information from pre-stored failure collection information data A fault diagnosis unit that acquires and determines the type of fault data;
A collection material determination unit that acquires a collection method corresponding to the type of the failure material and an effect caused on the monitored server thereby from the collection object failure material data stored in advance,
A fault data management unit that determines whether or not the fault data can be acquired by comparing the type and collection method of the fault data and the influence with the current state of the monitored server;
A management server, comprising: a failure data request unit that acquires the failure data determined to be obtainable from the monitoring target server.

Collection of failure investigation information data that the management server collects from the monitored server the data necessary for analyzing the cause of the failure that occurred in the operation of the monitored server when the management server and the monitored server are connected to each other In the system,
The management server receives failure information indicating that the failure has occurred from the monitored server,
The fault diagnosis unit of the management server obtains and determines the type of fault data corresponding to the type of fault included in the fault information from the pre-stored fault collection information data,
The collection material determination unit of the management server obtains the collection method corresponding to the type of the failure material and the influence generated on the monitored server thereby from the collection target failure material data stored in advance,
The failure data management unit of the management server determines whether or not the failure data can be acquired by comparing the type and collection method of the failure material and the influence with the current state of the monitored server.
A failure investigation information material collection method, wherein the failure material request unit of the management server acquires the failure material determined to be obtainable from the monitored server.

Collection of failure investigation information data that the management server collects from the monitored server the data necessary for analyzing the cause of the failure that occurred in the operation of the monitored server when the management server and the monitored server are connected to each other In the system,
In the computer provided in the management server,
A procedure for obtaining and determining the type of fault data corresponding to the type of fault included in the fault information from pre-stored fault collection information data,
A procedure for acquiring from the collection target failure material data stored in advance the collection method corresponding to the type of the failure material and the effect caused by this on the monitored server;
A procedure for determining whether or not the failure data can be acquired by comparing the type and collection method of the failure material and the influence with the current state of the monitored server;
And a failure investigation information material collection program for executing a procedure for obtaining the failure material determined to be obtainable from the monitoring target server.