JP2009276929A

JP2009276929A - Automatic fault handling system

Info

Publication number: JP2009276929A
Application number: JP2008126316A
Authority: JP
Inventors: Satoshi Iwagaki; 聡岩垣
Original assignee: Hitachi Electronics Services Co Ltd
Current assignee: Hitachi Electronics Services Co Ltd
Priority date: 2008-05-13
Filing date: 2008-05-13
Publication date: 2009-11-26
Anticipated expiration: 2028-05-13
Also published as: JP4872058B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an automatic fault handling system for quickly preventing any fault on a system, preventing system stop or minimizing the time for stopping the system. <P>SOLUTION: A fault on a monitored server 2 is monitored by a fault detection means 11. When the fault is detected, a material collection means 12 collects materials 31 such as an operation log from the monitored server 2, and specifies a log as a cause of the fault by a keyword extraction means 13, and extracts a keyword 32. A similar event extraction means extracts similar event information 33 from an event DB 21 based on the keyword 32, and a fault avoidance file generation means 15 generates a fault avoidance file 34 based on a procedure of avoiding the fault included in the similar event information 33 or necessary information, and a fault avoidance file application means 16 transmits the fault avoidance file 34 to the monitored server 2. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、障害が発生したサーバ等の復旧作業の効率化及び自動化を可能とする自動障害対応システムに関する。 The present invention relates to an automatic failure handling system that makes it possible to improve the efficiency and automation of recovery work for a server or the like in which a failure has occurred.

従来、基幹業務を担うシステムにおいて、いち早く障害の発生を検出するために、別途監視サーバを設け、障害の検出とエンジニアへの報知が行われている。例えば特許文献１では、業務サーバやネットワーク機器等の複数の監視対象機器とネットワーク上に接続され、各装置の障害対策情報をデータベースに蓄積し、各装置に障害が発生した場合、発生した障害と同種又は類似する障害情報及びその対処方法がデータベースに蓄積されている場合、これを表示部に自動的に表示させる方法が開示されている。 2. Description of the Related Art Conventionally, in a system responsible for mission-critical work, in order to quickly detect the occurrence of a failure, a separate monitoring server is provided to detect the failure and notify an engineer. For example, in Patent Document 1, a plurality of monitored devices such as business servers and network devices are connected to a network, and failure countermeasure information for each device is stored in a database. When the same type or similar failure information and a coping method are stored in a database, a method of automatically displaying the information on a display unit is disclosed.

また、特許文献２では、特許文献１と同様に発生した障害内容が類似し、障害対処情報を有する障害情報をデータベースから抽出し、これに基づいて障害を回避する所定のコマンドを自動生成し、障害発生中のサーバに適用して復旧を試みるシステムが開示されている。また、本文献には、対処情報がデータベースに蓄積されていない場合は、対処できる可能性がある障害対処情報に基づき、予め決められた優先順によって順次実行され、監視対象サーバが障害から復旧するまで行う方法も開示されている。 Further, in Patent Document 2, similar to the content of the problem that occurred in Patent Document 1, the failure information having the failure handling information is extracted from the database, and a predetermined command for avoiding the failure is automatically generated based on this information. A system that attempts recovery by applying to a server in which a failure has occurred is disclosed. In addition, in this document, when the handling information is not accumulated in the database, the monitoring target server is recovered from the fault by sequentially executing in a predetermined priority order based on the fault handling information that may be dealt with. A method of performing up to is also disclosed.

特開２００５−２０２４４６号公報JP 2005-202446 A 特許第３８２６９４０号公報Japanese Patent No. 3826940

しかしながら、特許文献１の方法では、表示された対処方法によって障害を回避する作業は人的作業に頼らざるを得ず、表示装置の監視を行う人員を削減することは容易ではない。また、特許文献２の方法では、障害が発生し、これを復旧するまでを全自動化する方法であるが、条件が一致する障害対処情報がない場合、対処の実績が無い情報に基づいて障害復旧を試みるのだが、ミッションクリティカルなシステムにおいて、このように対処実績の無い障害対処情報をシステムが復旧するまで順次試みるというのは、システムの復旧に長時間支障が出る可能性があり現実的に難しい。 However, in the method of Patent Document 1, the work for avoiding the failure by the displayed countermeasures must be relied on for human work, and it is not easy to reduce the number of people who monitor the display device. The method of Patent Document 2 is a method of fully automating the occurrence of a failure and recovering the failure. However, when there is no failure handling information that matches the conditions, failure recovery is performed based on information that has no record of handling. However, in mission-critical systems, it is practically difficult to attempt failure handling information with no track record in this way until the system recovers, because there is a possibility that it will hinder system recovery for a long time. .

本発明は上記課題を鑑みてなされたものであり、既知事例に基づき発生した障害を自動復旧させ、既知事例が存在しない場合においても迅速なシステムの復旧が可能な自動障害対応システムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides an automatic failure response system that can automatically recover from a failure that has occurred based on a known case and can quickly recover the system even when the known case does not exist With the goal.

請求項１に係る自動障害対応システムは、監視対象となるサーバ（以降監視対象サーバと称呼する）の監視を行い、該監視対象サーバに発生した障害を検知する障害検知手段と、
該監視対象サーバが使用するオペレーティングシステム及びプログラムの動作ログを採取する資料採取手段と、
採取された資料から障害情報を示すキーワードを抽出するキーワード抽出手段と、
過去に発生した障害の内容及び回避手順を蓄積した事例情報データベース（以降事例ＤＢと称呼する）と、
前記キーワード抽出手段によって抽出されたキーワードに基づいて前記事例ＤＢから障害回避情報を抽出する類似事例抽出手段と、
ＩＰアドレスやホスト名などのネットワーク環境情報、プログラムのインストールフォルダや各種設定値などのプログラム動作環境に代表される前記監視対象サーバの環境情報を蓄積した環境情報データベース（以降環境ＤＢと称呼する）と、
前記類似事例抽出手段から抽出される障害回避情報と、前記環境ＤＢから抽出される前記監視対象サーバの環境情報に基づいて、障害を自動的に対処するための障害回避ファイルを生成する障害回避ファイル生成手段と、
前記障害回避ファイルを前記監視対象サーバに適用する障害回避ファイル適用手段と、
を備え、前記監視対象サーバに障害が発生した場合、前記障害検知手段はこれを検出し、前記資料採取手段によって動作ログを採取するよう指示を行い、前記キーワード抽出手段によって採取された動作ログから所定のキーワードを抽出し、抽出されたキーワードに基づいて類似事例抽出手段は事例ＤＢから障害回避情報を取得し、前記障害回避ファイル生成手段は、環境ＤＢから取得した障害が発生した監視対象サーバの環境情報と障害回避情報から障害回避ファイルを生成し、前記障害回避ファイル適用手段がこれを前記監視対象サーバに適用することを特徴とする。 An automatic failure handling system according to claim 1 is configured to monitor a server to be monitored (hereinafter referred to as a monitoring target server) and detect a failure that has occurred in the monitoring target server;
Data collection means for collecting an operation log of the operating system and program used by the monitored server;
A keyword extracting means for extracting a keyword indicating failure information from the collected data;
A case information database (hereinafter referred to as a case DB) that accumulates the contents of the failures that occurred in the past and avoidance procedures;
Similar case extraction means for extracting failure avoidance information from the case DB based on the keywords extracted by the keyword extraction means;
An environment information database (hereinafter referred to as an environment DB) that stores environment information of the monitored server represented by the program environment such as network environment information such as IP address and host name, program installation folder and various setting values; ,
A failure avoidance file for generating a failure avoidance file for automatically dealing with a failure based on the failure avoidance information extracted from the similar case extraction means and the environment information of the monitored server extracted from the environment DB Generating means;
A failure avoidance file applying means for applying the failure avoidance file to the monitored server;
The failure detection means detects the failure and instructs the material collection means to collect an operation log, and from the operation log collected by the keyword extraction means Based on the extracted keyword, the similar case extraction unit acquires failure avoidance information from the case DB based on the extracted keyword, and the failure avoidance file generation unit acquires the failure target file of the monitoring target server that has acquired the failure from the environment DB. A failure avoidance file is generated from environment information and failure avoidance information, and the failure avoidance file application means applies this to the monitored server.

請求項１の構成によれば、監視対象サーバを定期的に監視し、障害を検知した場合は、障害が発生した前後の動作ログを採取し、当該ファイルから障害発生のトリガーとなるエラーログ等を検索し、エラーコードなどのキーワードを抽出し、該キーワードによって類似事例抽出手段は事例ＤＢから類似事例を抽出し、当該類似事例に含まれる障害に対応するための障害回避情報を取得し、これに基づき障害回避ファイルを生成、監視対象サーバに送信して監視対象サーバにこれを適用することができる。 According to the configuration of claim 1, when a monitored server is periodically monitored and a failure is detected, an operation log before and after the occurrence of the failure is collected, and an error log or the like that triggers the failure from the file , And a keyword such as an error code is extracted, and the similar case extraction means extracts a similar case from the case DB based on the keyword, acquires failure avoidance information corresponding to the failure included in the similar case, and Based on the above, a failure avoidance file can be generated, transmitted to the monitoring target server, and applied to the monitoring target server.

請求項２に係る自動障害対応システムは、請求項１記載の自動障害対応システムにおいて、前記障害回避ファイル生成手段は、障害回避ファイルを生成する際に、該障害回避ファイルが行う手順を詳細に記載する調査報告書を自動生成可能なことを特徴とする。 The automatic failure handling system according to claim 2 is the automatic failure handling system according to claim 1, wherein the failure avoidance file generation means describes in detail a procedure performed by the failure avoidance file when generating the failure avoidance file. It is possible to automatically generate a survey report.

請求項２の構成によれば、障害回避ファイル生成手段が障害回避ファイルを生成する際の操作手順及び解決方法を、調査報告書に利用し、顧客又はサービスマンへ送付することができる。 According to the configuration of the second aspect, the operation procedure and the solution when the failure avoidance file generation unit generates the failure avoidance file can be used for the investigation report and sent to the customer or the service person.

請求項１に係る自動障害対応システムは、監視対象となるサーバの監視を行い、該監視対象サーバに発生した障害を検知する障害検知手段と、
該監視対象サーバが使用するオペレーティングシステム及びプログラムの動作ログを採取する資料採取手段と、
採取された資料から障害情報を示すキーワードを抽出するキーワード抽出手段と、
過去に発生した障害の内容及び回避手順を蓄積した事例情報データベースと、
前記キーワード抽出手段によって抽出されたキーワードに基づいて前記事例ＤＢから障害回避情報を抽出する類似事例抽出手段と、
ＩＰアドレスやホスト名などのネットワーク環境情報、プログラムのインストールフォルダや各種設定値などのプログラム動作環境に代表される前記監視対象サーバの環境情報を蓄積した環境情報データベースと、
前記類似事例抽出手段から抽出される障害回避情報と、前記環境ＤＢから抽出される前記監視対象サーバの環境情報に基づいて、障害を自動的に対処するための障害回避ファイルを生成する障害回避ファイル生成手段と、
前記障害回避ファイルを前記監視対象サーバに適用する障害回避ファイル適用手段と、
を備え、前記監視対象サーバに障害が発生した場合、前記障害検知手段はこれを検出し、前記資料採取手段によって動作ログを採取するよう指示を行い、前記キーワード抽出手段によって採取された動作ログから所定のキーワードを抽出し、抽出されたキーワードに基づいて類似事例抽出手段は事例ＤＢから障害回避情報を取得し、前記障害回避ファイル生成手段は、環境ＤＢから取得した障害が発生した監視対象サーバの環境情報と障害回避情報から障害回避ファイルを生成し、前記障害回避ファイル適用手段がこれを前記監視対象サーバに適用するので、監視対象サーバに障害が発生してから障害を回避し、システムが回復するまでの間の全ての作業を自動化することができ、人的要因による作業の遅延を防止することができる。また、サービススタッフ等の人員の削減を図ることができる。 An automatic failure handling system according to claim 1 is configured to monitor a server to be monitored, and to detect a failure that has occurred in the monitored server;
Data collection means for collecting an operation log of the operating system and program used by the monitored server;
A keyword extracting means for extracting a keyword indicating failure information from the collected data;
A case information database that stores the contents of failures that occurred in the past and avoidance procedures;
Similar case extraction means for extracting failure avoidance information from the case DB based on the keywords extracted by the keyword extraction means;
An environment information database that stores network environment information such as an IP address and a host name, an environment information of the monitored server represented by a program operating environment such as a program installation folder and various setting values;
A failure avoidance file for generating a failure avoidance file for automatically dealing with a failure based on the failure avoidance information extracted from the similar case extraction means and the environment information of the monitored server extracted from the environment DB Generating means;
A failure avoidance file applying means for applying the failure avoidance file to the monitored server;
The failure detection means detects the failure and instructs the material collection means to collect an operation log, and from the operation log collected by the keyword extraction means Based on the extracted keyword, the similar case extraction unit acquires failure avoidance information from the case DB based on the extracted keyword, and the failure avoidance file generation unit acquires the failure target file of the monitoring target server that has acquired the failure from the environment DB. A failure avoidance file is generated from the environment information and the failure avoidance information, and the failure avoidance file application unit applies this to the monitored server. Therefore, the failure is avoided after the failure occurs in the monitored server, and the system recovers. It is possible to automate all the work until it is done, and to prevent work delays due to human factors. In addition, the number of service staff can be reduced.

請求項２に係る自動障害対応システムは、請求項１記載の自動障害対応システムにおいて、前記障害回避ファイル生成手段は、障害回避ファイルを生成する際に、該障害回避ファイルが行う手順を詳細に記載する調査報告書を自動生成可能なので、顧客又はサービスマンの下に障害の詳細内容及び解決方法を記す調査報告書を速やかに届けることができる。 The automatic failure handling system according to claim 2 is the automatic failure handling system according to claim 1, wherein the failure avoidance file generation means describes in detail a procedure performed by the failure avoidance file when generating the failure avoidance file. Since the investigation report to be generated can be automatically generated, the investigation report describing the detailed contents of the failure and the solution method can be promptly delivered to the customer or service person.

以下、本発明を実施するための最良の形態としての実施例を図１から図４を参照して説明する。もちろん、本発明は、その発明の趣旨に反さない範囲で、実施例において説明した以外のものに対しても容易に適用可能なことは説明を要するまでもない。 Hereinafter, an embodiment as the best mode for carrying out the present invention will be described with reference to FIGS. Of course, it goes without saying that the present invention can be easily applied to other than those described in the embodiments without departing from the spirit of the invention.

本実施例における自動障害対応システムの構成について、図１に基づいて説明する。１は自動障害対応システムである。２は自動障害対応システム１によってその動作を監視される監視対象サーバである。３は顧客担当者又は障害対応を行うサービスマンである。 The configuration of the automatic failure handling system in this embodiment will be described with reference to FIG. Reference numeral 1 denotes an automatic failure handling system. Reference numeral 2 denotes a monitoring target server whose operation is monitored by the automatic failure handling system 1. Reference numeral 3 denotes a customer representative or a service person who handles a failure.

１１は、監視対象サーバ２のヘルスチェックを行い、障害を検出するための障害検知手段である。１２は障害が発生した監視対象サーバ２から動作ログ等の資料３１を採取する資料採取手段である。１３は採取した資料からエラーコード等の障害を判別するキーワード３２を抽出するキーワード抽出手段である。２１は障害情報及びその対処方法が蓄積される事例ＤＢ（データベース）である。１４は、キーワード抽出手段１３により抽出されたキーワードに基づいて事例ＤＢ２１から類似事例を抽出する類似事例抽出手段である。２２は監視対象サーバ２のネットワーク情報やインストールフォルダ、その他個々のサーバの情報を蓄積する環境ＤＢ（データベース）である。１５は障害が発生した監視対象サーバに関する環境情報を環境ＤＢ２２から取得し、該環境情報と、類似事例抽出手段１４によって抽出された類似事例情報３３とに基づいて、若しくは後述の調査結果３６に基づいて、監視対象サーバ２に発生した障害を対処するための障害回避ファイル３４及び障害の原因や障害回避ファイルが動作した場合の説明を記す調査報告書３５を生成する障害回避ファイル生成手段である。１６は、障害回避ファイル生成手段１５によって生成された障害回避ファイル３４を監視対象サーバ２に送信し実行する障害回避ファイル適用手段である。１７は、顧客又はサービスマンに調査報告書３５を送信する報告書送付手段である。４はエンジニアであり、事例ＤＢ２１から類似事例が抽出できない場合等の例外が生じた場合、手動にて調査結果３６を作成し、障害回避ファイル生成手段１５に送り、以降の処理を継続する。 Reference numeral 11 denotes failure detection means for performing a health check on the monitored server 2 and detecting a failure. Reference numeral 12 denotes data collection means for collecting data 31 such as operation logs from the monitored server 2 in which a failure has occurred. Reference numeral 13 denotes a keyword extracting means for extracting a keyword 32 for determining a failure such as an error code from the collected data. Reference numeral 21 denotes a case DB (database) in which failure information and a coping method are stored. Reference numeral 14 denotes similar case extraction means for extracting similar cases from the case DB 21 based on the keywords extracted by the keyword extraction means 13. Reference numeral 22 denotes an environment DB (database) that accumulates network information and installation folders of the monitoring target server 2 and other individual server information. 15 acquires environmental information about the monitored server in which a failure has occurred from the environment DB 22, and based on the environmental information and similar case information 33 extracted by the similar case extracting means 14, or based on a survey result 36 described later. The failure avoidance file generation means generates a failure avoidance file 34 for coping with a failure that has occurred in the monitored server 2 and a survey report 35 that describes the cause of the failure and a description when the failure avoidance file operates. Reference numeral 16 denotes failure avoidance file application means for transmitting the failure avoidance file 34 generated by the failure avoidance file generation means 15 to the monitored server 2 and executing it. Reference numeral 17 denotes report sending means for sending the survey report 35 to the customer or service person. Reference numeral 4 denotes an engineer. When an exception such as a case where a similar case cannot be extracted from the case DB 21, an investigation result 36 is manually created and sent to the failure avoidance file generation means 15, and the subsequent processing is continued.

続いて、自動障害対応システムの動作の概要を図２に基づいて説明する。障害検知手段１１は、複数の監視対象サーバ２の障害発生状態を随時チェック（ヘルスチェック）する（Ｓ２０１）。障害検知手段１１によって障害が検出された場合（Ｓ２０２）、資料採取手段１２により、障害が発生した監視対象サーバ２から、障害が発生した時刻又は当該時刻の前後所定時間内に記録されたＯＳやプログラムのログ等の障害情報の資料３１を取得する（Ｓ２０３）。キーワード抽出手段１３は、取得した資料３１を解析し、例えばシステムのエラーコード及びエラー発生時刻における他のプログラムやＯＳのログ等からキーワードを抽出し（Ｓ２０４）、類似事例抽出手段１４に送る（Ｓ２０５）。類似事例抽出手段１４は、抽出されたキーワードに基づいて事例ＤＢ２１から監視対象サーバ２に発生した障害の類似事例を全文検索により抽出する（Ｓ２０６）。この時、環境ＤＢ２２から監視対象サーバ２におけるシステムの環境情報を取得し、事例ＤＢ２１から類似事例を検索する際のキーワードとして付加を行ってもよい。既知の事例が複数抽出された場合は、検索スコアの最も高い結果を障害回避ファイル生成手段１５へ送る（Ｓ２０８）。障害回避ファイル生成手段１５は、送られた事例情報から障害の対処に関する情報を解析し、障害回避ファイル３４を生成する（Ｓ２０９）。その際、調査報告書３５の生成も同時に行う。生成された障害回避ファイル３４は、障害回避ファイル適用手段１６によって障害が発生した監視対象サーバ２に送信され、実行される（Ｓ２１０）。事例ＤＢから類似事例が抽出されない場合（Ｓ２０７）、若しくは障害回避ファイル３４の適用を持っても障害が解消されない場合（Ｓ２１１）は、自動障害対応システム１（またはサポートセンター１００）に再度資料を送信し（Ｓ２１２）、エンジニア４による監視対象サーバ２の解析が行われ（Ｓ２１３）、解析結果から調査結果３６を作成し（Ｓ２１４）、障害回避ファイル生成手段１５に調査結果３６を送信する（Ｓ２１５）。 Next, an outline of the operation of the automatic failure handling system will be described with reference to FIG. The failure detection means 11 checks (health check) the failure occurrence status of the plurality of monitoring target servers 2 as needed (S201). When a failure is detected by the failure detection unit 11 (S202), the OS or the recorded time from the monitored server 2 where the failure has occurred, or a predetermined time before and after the failure, from the monitored server 2 where the failure occurred. A failure information document 31 such as a program log is acquired (S203). The keyword extraction unit 13 analyzes the acquired material 31, extracts keywords from, for example, other programs and OS logs at the system error code and error occurrence time (S204), and sends them to the similar case extraction unit 14 (S205). ). The similar case extraction means 14 extracts similar cases of failures that have occurred in the monitored server 2 from the case DB 21 based on the extracted keywords by full-text search (S206). At this time, the system environment information in the monitoring target server 2 may be acquired from the environment DB 22 and may be added as a keyword when searching for similar cases from the case DB 21. When a plurality of known cases are extracted, the result with the highest search score is sent to the failure avoidance file generation means 15 (S208). The failure avoidance file generation means 15 analyzes information relating to the handling of the failure from the sent case information, and generates a failure avoidance file 34 (S209). At that time, the survey report 35 is also generated at the same time. The generated failure avoidance file 34 is transmitted to the monitored server 2 where the failure has occurred by the failure avoidance file application unit 16 and executed (S210). If a similar case is not extracted from the case DB (S207), or if the failure is not resolved even if the failure avoidance file 34 is applied (S211), the material is sent again to the automatic failure handling system 1 (or the support center 100). (S212), the engineer 4 analyzes the monitored server 2 (S213), creates a survey result 36 from the analysis result (S214), and transmits the survey result 36 to the failure avoidance file generation means 15 (S215). .

以上のような構成の自動障害対応システムによれば、監視対象サーバ２に障害が発生し、システムダウン等が起こったとしても、当該障害のエラーログ等を自動的に採取し、これに関連する既知事例を検索し、類似事例の詳細な情報から障害を回避するための対策ファイルを生成し、これを自動的に障害が発生した監視対象サーバに適用することで、無人で迅速な障害回避を行うことができ、障害対応に当たる人員の大幅削減が可能となる。また、既知事例に存在しない場合、又は障害回避ファイルの適用をもってしても障害が回避できない場合は、エラーログ等の資料をエンジニアに送信することで、人的作業を促し、迅速に監視対象サーバの復旧作業を行うことができる。 According to the automatic failure handling system configured as described above, even if a failure occurs in the monitored server 2 and the system goes down, an error log of the failure is automatically collected and related to this. Search for known cases, generate a countermeasure file to avoid failures from detailed information of similar cases, and apply this automatically to the monitored server where the failure occurred, thereby preventing unattended and rapid failure avoidance This can be done, and it is possible to greatly reduce the number of people who deal with the failure. Also, if there is no known case, or if the failure cannot be avoided even by applying the failure avoidance file, send the error log and other materials to the engineer to promote human work and quickly monitor the server to be monitored. Can be restored.

図３は、自動障害システムを複数社のサーバを監視する大規模サポートセンターに適用し、監視対象サーバに直接接続する手段を各社に設置した例を示している。Ａ社〜Ｃ社にはそれぞれに監視対象サーバ２、これを監視する障害検知手段１１、動作ログ等の資料を採取する資料採取手段１２を有しており、資料採取手段１２はインターネット等の公衆ネットワークを介してサポートセンターのキーワード抽出手段１３と接続している。即ち、顧客環境側で監視対象サーバの監視を行い、障害が発生した際の障害回避ファイルの生成をサポートセンター側にて行う構成である。このような構成によれば、各社で発生した障害情報及びその対処方法は同一の事例データベースを用いるため、蓄積される情報量が各社毎にシステムを構成するよりも増加し、類似事例抽出手段１４による類似事例の抽出の際の事例データのヒット率を向上させることができる。また、高負荷な解析機能たる類似事例抽出手段１４、障害回避ファイル生成手段１５及び事例ＤＢ２１、環境ＤＢ２２等を集中させることができ、顧客側設置コストの軽減し、顧客による各データベースのメンテナンス作業を行う必要がなくなる。更に、類似事例が存在せず、障害回避ファイル生成手段１５による障害回避ファイル３５の生成ができない場合でも、サポートセンター内にてエンジニアによる人的作業にて障害の調査、解析を行うことができるので、物理的障害ではない限り顧客環境において障害を無人で対応することができる。 FIG. 3 shows an example in which the automatic failure system is applied to a large-scale support center that monitors servers of a plurality of companies, and means for directly connecting to the monitoring target server is installed in each company. Company A to Company C each have a monitoring target server 2, a failure detection means 11 for monitoring the server 2, and a material collection means 12 for collecting data such as operation logs. The material collection means 12 is a public such as the Internet. It connects with the keyword extraction means 13 of a support center via a network. That is, the customer environment side monitors the monitoring target server, and a failure avoidance file is generated on the support center side when a failure occurs. According to such a configuration, the failure information generated in each company and the coping method thereof use the same case database, so that the amount of information to be accumulated is larger than that constituting the system for each company, and similar case extraction means 14 It is possible to improve the hit rate of case data when extracting similar cases by. In addition, similar case extraction means 14, failure avoidance file generation means 15 and case DB 21, environment DB 22 and the like, which are high-load analysis functions, can be concentrated, reducing the installation cost on the customer side, and maintenance work of each database by the customer. There is no need to do it. Furthermore, even if there is no similar case and the failure avoidance file 35 cannot be generated by the failure avoidance file generation means 15, it is possible to investigate and analyze the failure by human work by an engineer within the support center. As long as it is not a physical failure, the failure can be handled unattended in the customer environment.

図４は、図３と同様、自動障害システムを複数社のサーバを監視する大規模サポートセンターに適用し、図３と異なり顧客側に採取した資料からキーワードを抽出するキーワード抽出手段１３、及び障害回避ファイル生成手段１５、環境ＤＢ２２を設置する例を示している。Ａ〜Ｃ社には、それぞれに障害検知手段１１、資料採取手段１２、キーワード抽出手段１３、障害回避ファイル生成手段１５、障害回避ファイル適用手段１６、報告書送付手段１７及び環境ＤＢ２２が設置される。図３の場合と比較して、顧客毎に各手段を設置する必要があり、設置コストは肥大化するが、動作ログ等の資料が多くなってしまうような場合においても、同社内にてファイル等の送受信ができるためネットワーク負荷を抑えることが可能となり、更に近年の情報漏洩に対する危機意識の向上から、社外に持ち出せない内容の資料があったとしても、サポートセンター側へ送信を行うのはキーワードのみという最低限の情報なので、セキュリティの向上を図ることができる。 Similar to FIG. 3, FIG. 4 applies the automatic failure system to a large-scale support center that monitors servers of a plurality of companies, and unlike FIG. 3, keyword extraction means 13 for extracting keywords from data collected on the customer side, and failure The example which installs the avoidance file production | generation means 15 and environment DB22 is shown. In each of the companies A to C, a failure detection unit 11, a material collection unit 12, a keyword extraction unit 13, a failure avoidance file generation unit 15, a failure avoidance file application unit 16, a report sending unit 17, and an environment DB 22 are installed. . Compared to the case of Fig. 3, it is necessary to install each means for each customer, and the installation cost will be enlarged, but even if the materials such as operation logs increase, the file in the company It is possible to reduce the network load because of the ability to send and receive data, etc. In addition, due to the increased awareness of crisis against information leakage in recent years, even if there are materials that cannot be taken outside the company, it is the keyword to send to the support center side Since this is the minimum information, security can be improved.

以上、本発明の実施例について詳述したが、本発明は前記実施例に限定されるものではなく、本発明の要旨の範囲内で種々の変形実施が可能である。例えば、上記実施例において、障害検知手段１１及び資料採取手段１２は監視対象サーバ２内に有するとしてもよい。その際、物理的なサーバの故障等による障害の検知ができなくなる可能性はあるが、構成を縮小し、コストダウンを図ることができる。また、各手段及びデータベースは同一のサーバにて稼動してもよいし、複数のサーバに分散してもよい。 As mentioned above, although the Example of this invention was explained in full detail, this invention is not limited to the said Example, A various deformation | transformation implementation is possible within the range of the summary of this invention. For example, in the above embodiment, the failure detection unit 11 and the material collection unit 12 may be included in the monitoring target server 2. At that time, although there is a possibility that a failure due to a physical server failure or the like cannot be detected, the configuration can be reduced and the cost can be reduced. Each means and database may be operated on the same server, or may be distributed to a plurality of servers.

本実施例における、自動障害対応システムの構成を示すブロック図である。It is a block diagram which shows the structure of the automatic failure response system in a present Example. 同上、自動障害対応システムの動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of an automatic failure response system same as the above. 同上、自動障害対応システムの第一の適用例を示すブロック図である。It is a block diagram which shows the 1st application example of an automatic failure response system same as the above. 同上、自動障害対応システムの第二の適用例を示すブロック図である。It is a block diagram which shows the 2nd application example of an automatic failure response system same as the above.

Explanation of symbols

１自動障害対応システム
１１障害検知手段
１２資料採取手段
１３キーワード抽出手段
１４類似事例抽出手段
１５障害回避ファイル生成手段
１６障害回避ファイル適用手段
１７報告書送付手段
２１事例ＤＢ
２２環境ＤＢ
３１資料
３２キーワード
３３類似事例
３４障害回避ファイル
３５調査報告書
３６調査結果
２監視対象サーバ
３サービスマン
４エンジニア
１００サポートセンター DESCRIPTION OF SYMBOLS 1 Automatic failure response system 11 Failure detection means 12 Data collection means 13 Keyword extraction means 14 Similar case extraction means 15 Failure avoidance file generation means 16 Failure avoidance file application means 17 Report sending means 21 Case DB
22 Environment DB
31 Document 32 Keyword 33 Similar Case 34 Failure Avoidance File 35 Survey Report 36 Survey Result 2 Monitored Server 3 Serviceman 4 Engineer 100 Support Center

Claims

A failure detection means for monitoring a server to be monitored (hereinafter referred to as a monitored server) and detecting a failure that has occurred in the monitored server;
Data collection means for collecting an operation log of the operating system and program used by the monitored server;
A keyword extracting means for extracting a keyword indicating failure information from the collected data;
A case information database (hereinafter referred to as a case DB) that accumulates the contents of the failures that occurred in the past and avoidance procedures;
Similar case extraction means for extracting failure avoidance information from the case DB based on the keywords extracted by the keyword extraction means;
An environment information database (hereinafter referred to as an environment DB) that stores environment information of the monitored server represented by the program environment such as network environment information such as IP address and host name, program installation folder and various setting values; ,
A failure avoidance file for generating a failure avoidance file for automatically dealing with a failure based on the failure avoidance information extracted from the similar case extraction means and the environment information of the monitored server extracted from the environment DB Generating means;
A failure avoidance file applying means for applying the failure avoidance file to the monitored server;
The failure detection means detects the failure and instructs the material collection means to collect an operation log, and from the operation log collected by the keyword extraction means Based on the extracted keyword, the similar case extraction unit acquires failure avoidance information from the case DB based on the extracted keyword, and the failure avoidance file generation unit acquires the failure target file of the monitoring target server that has acquired the failure from the environment DB. An automatic failure handling system, wherein a failure avoidance file is generated from environment information and failure avoidance information, and the failure avoidance file application means applies it to the monitoring target server.

2. The automatic failure handling according to claim 1, wherein the failure avoidance file generation means can automatically generate a survey report that describes in detail the procedure performed by the failure avoidance file when generating the failure avoidance file. system.