JP2010108266A

JP2010108266A - Failure detection device, failure detection method, failure detection program, wordbook forming device, and failure occurrence analysis device

Info

Publication number: JP2010108266A
Application number: JP2008279896A
Authority: JP
Inventors: Yasuhito Takamiya; 安仁高宮
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-10-30
Filing date: 2008-10-30
Publication date: 2010-05-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of detecting a failure of a computer without using knowledge or log related to a process. <P>SOLUTION: A process monitoring part 2 monitors process information on an OS 21, and acquires monitoring information such as use states of various resources such as a CPU or memory. A rank-group information creation part 3 ranks, for all processes, the use state of the various resources or groups them according to the degree of use state from the monitoring information. A word creation part 4 creates a wordbook used for extracting feature quantity from the obtained rank-group information and process names. A collection part 11 collects the created words, and a failure time zone estimation part 12 extracts a characteristic keyword from the collected words, and estimates a faulty computer and a time zone of the failure occurrence. A display part 13 displays the faulty computer and the time zone. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、障害検知装置、障害検知方法及び障害検知プログラムに関し、特に、クラスタ型システムなど同一のハードウェアおよびソフトウェアから構成された、同一のジョブを実行する一連の計算機群から構成される障害検知装置、障害検知方法、障害検知プログラム、単語集作成装置、及び障害発生分析装置に関する。 The present invention relates to a failure detection apparatus, a failure detection method, and a failure detection program, and in particular, a failure detection that includes a series of computers that execute the same job and that are configured from the same hardware and software such as a cluster system. The present invention relates to a device, a failure detection method, a failure detection program, a word collection creation device, and a failure occurrence analysis device.

従来から、クラスタ型システムに代表される、同一のハードウェア及びソフトウェアで構成され、同一のジョブを実行する計算機群を前提とした効率的な障害検知方法がいくつか提案されている。 Conventionally, several efficient failure detection methods have been proposed that are based on a computer group configured by the same hardware and software and represented by a cluster system and executing the same job.

特許文献１に開示された障害検知装置においては、障害検知対象となる計算機同士でツリー構造を構成し、ツリー構造で接続されたノード同士で互いに障害を監視し合うことによって、全対全の監視に比較して計算機一台当たりの監視対象の計算機を減らすことで、効率的に障害を検出する。 In the failure detection device disclosed in Patent Document 1, all-to-all monitoring is performed by configuring a tree structure between computers that are the target of failure detection and monitoring each other for failures between nodes connected in the tree structure. Compared to, the number of computers to be monitored per computer is reduced, thereby efficiently detecting failures.

また、非特許文献１に開示された障害検知装置においては、クラスタを構成する各ノードのｓｙｓｌｏｇを特徴量アルゴリズムにかけ、ノード毎にこの特徴量を比較することによって、異常な特徴量を示すノードにおいて障害が発生したものと推定する。この方法では、ｓｙｓｌｏｇを機械的に処理するだけで障害発生の時間帯を検知できるため、障害検知のためにプロセスの振舞いや、特徴といったプロセスに関する特別な知識、及び事前の学習を必要としない。
特開２００３−２７１４７１号公報 Bad Words: Finding Faults In Spirit’s Syslogs, Cluster Computing and the Grid, 2008. CCGRID ’08. 8th IEEE International Symposium on (2008), pp. 765-770. Further, in the failure detection device disclosed in Non-Patent Document 1, a node indicating an abnormal feature amount is obtained by applying the syslog of each node constituting the cluster to the feature amount algorithm and comparing the feature amount for each node. Presume that a failure has occurred. In this method, since the failure occurrence time zone can be detected only by mechanically processing the syslog, special knowledge about the process such as process behavior and characteristics and prior learning are not required for failure detection.
JP 2003-271471 A Bad Words: Finding Faults In Spirit's Syslogs, Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on (2008), pp. 765-770.

このように、上述した従来の障害検知装置では、特許文献１のように、ツリー構造で接続された計算機同士で互いに障害を検知する方法や、非特許文献１のように、ｓｙｓｌｏｇの特徴量を時間帯で比較することによって障害の起こった計算機と時間帯を推定する手法があった。 As described above, in the conventional failure detection apparatus described above, a method of detecting a failure between computers connected in a tree structure as in Patent Document 1 or a feature value of a syslog as in Non-Patent Document 1 is used. There was a method to estimate the failed computer and time zone by comparing by time zone.

特許文献１の方法によれば、計算機同士が全対全で監視する場合に比べ、ツリー構造による監視では、一台の計算機が監視しなければならない計算機の数が少ない。このため、障害検知による負荷が軽減される。しかしながら、障害を判定するためには、プロセスの正常な振舞いについての知識が必須である。つまり、障害検知のためには、あらかじめ知識データベースを学習などによって作成しておく手間が必要になるという問題がある。 According to the method of Patent Document 1, the number of computers that must be monitored by one computer is smaller in the monitoring by the tree structure than in the case where the computers monitor all-to-all. For this reason, the load by fault detection is reduced. However, knowledge of the normal behavior of the process is essential to determine the failure. That is, there is a problem that it takes time and effort to create a knowledge database by learning or the like in advance for fault detection.

また、非特許文献１の方法によれば、ｓｙｓｌｏｇの特徴量のみを用いて障害検知を行うため、あらかじめプロセスについての知識データベースを作成する必要が無い。しかしながら、この方法で障害検知できるプロセスは、サービスプロセス（デーモン）などのｓｙｓｌｏｇへログを出力するプロセスに限定されてしまうという問題がある。 In addition, according to the method of Non-Patent Document 1, since failure detection is performed using only the feature value of the syslog, it is not necessary to create a knowledge database about the process in advance. However, there is a problem that a process that can detect a failure by this method is limited to a process that outputs a log to a syslog such as a service process (daemon).

本発明は、このような事情を考慮してなされたものであり、その目的は、プロセスに関する知識や、ログを用いなくとも、計算機の障害を検知することができる障害検知装置、障害検知方法、障害検知プログラム、単語集作成装置、及び障害発生分析装置を提供することにある。 The present invention has been made in consideration of such circumstances, and its purpose is to provide a failure detection device, a failure detection method, and a failure detection device capable of detecting a failure of a computer without using knowledge about a process or a log. An object of the present invention is to provide a failure detection program, a word collection creation device, and a failure occurrence analysis device.

上述した課題を解決するために、本発明は、複数の計算機の各々で実行されるプロセスを監視し、プロセス毎のリソース使用状況と、プロセスのモニタリング情報とを取得するプロセス監視手段と、前記プロセス監視手段により取得されたリソース使用状況とモニタリング情報とに基づいて、プロセス毎の順位付け、またはグループ分けを行う順位・グループ情報生成手段と、前記順位・グループ情報生成手段により得られた、プロセス毎の順位付けを示す順位情報と、グループ分けを示すグループ情報と、前記モニタリング情報とに基づいて、プロセス毎の特徴量抽出に用いる単語を生成する単語生成手段と、前記単語生成手段により生成された単語を前記複数の計算機毎に収集する収集手段と、前記収集手段により収集された、前記複数の計算機のプロセス毎の単語に基づいて、障害が発生した計算機と障害発生時間帯とを推定する障害時間帯推定手段とを備えることを特徴とする障害検知装置である。 In order to solve the above-described problem, the present invention monitors a process executed in each of a plurality of computers, and obtains a resource usage status for each process and process monitoring information, and the process Rank / group information generating means for ranking or grouping for each process based on the resource usage status and monitoring information acquired by the monitoring means, and for each process obtained by the rank / group information generating means Based on the ranking information indicating the ranking, group information indicating the grouping, and the monitoring information, a word generating means for generating a word used for feature quantity extraction for each process, and the word generating means Collecting means for collecting words for each of the plurality of computers, and the plurality of words collected by the collecting means Based on the word each process computer, a failure detection device, characterized in that it comprises a failure time zone estimating means for estimating the computer and the failure occurrence time slot of a failure.

また、上述した課題を解決するために、本発明は、複数の計算機の各々で実行されるプロセスを監視し、プロセス毎のリソース使用状況と、プロセスのモニタリング情報とを取得するステップと、前記取得されたリソース使用状況とモニタリング情報とに基づいて、プロセス毎の順位付け、またはグループ分けを行うステップと、前記プロセス毎の順位付けを示す順位情報と、グループ分けを示すグループ情報と、前記モニタリング情報とに基づいて、プロセス毎の特徴量抽出に用いる単語を生成するステップと、前記生成された単語を前記複数の計算機毎に収集するステップと、前記収集された複数の計算機のプロセス毎の単語に基づいて、障害が発生した計算機と障害発生時間帯とを推定するステップとを含むことを特徴とする障害検知方法である。 In order to solve the above-described problem, the present invention monitors a process executed in each of a plurality of computers, acquires a resource usage status for each process, and monitoring information of the process, and the acquisition A step of ranking or grouping for each process based on the resource usage status and the monitoring information, the ranking information indicating the ranking for each process, the group information indicating the grouping, and the monitoring information Based on the above, a step of generating a word used for feature amount extraction for each process, a step of collecting the generated word for each of the plurality of computers, and a word for each of the collected processes of the plurality of computers And a step of estimating a failure occurrence time zone and a failure occurrence time zone. It is the law.

また、上述した課題を解決するために、本発明は、複数の計算機の各々で実行されるプロセスを監視して障害の発生を検知する障害検知装置のコンピュータに、複数の計算機の各々で実行されるプロセスを監視し、プロセス毎のリソース使用状況と、プロセスのモニタリング情報とを取得するステップと、前記取得されたリソース使用状況とモニタリング情報とに基づいて、プロセス毎の順位付け、またはグループ分けを行うステップと、前記プロセス毎の順位付けを示す順位情報と、グループ分けを示すグループ情報と、前記モニタリング情報とに基づいて、プロセス毎の特徴量抽出に用いる単語を生成するステップと、前記生成された単語を前記複数の計算機毎に収集するステップと、前記収集された複数の計算機のプロセス毎の単語に基づいて、障害が発生した計算機と障害発生時間帯とを推定するステップとを実行させることを特徴とする障害検知プログラムである。 Further, in order to solve the above-described problem, the present invention is executed on each of the plurality of computers by the computer of the failure detection apparatus that monitors the process executed on each of the plurality of computers and detects the occurrence of the failure. Monitoring the process, obtaining resource usage status for each process and process monitoring information, and ranking or grouping for each process based on the obtained resource usage status and monitoring information. Generating a word used for feature amount extraction for each process based on rank information indicating ranking for each process, group information indicating grouping, and the monitoring information, and The collected words for each of the plurality of computers, and the collected words for each process of the plurality of computers. Zui by a failure detection program for causing and a step of estimating a fault calculator and failure time zone has occurred.

また、上述した課題を解決するために、本発明は、計算機で実行されるプロセスを監視し、プロセス毎のリソース使用状況と、プロセスのモニタリング情報とを取得するプロセス監視手段と、前記プロセス監視手段により取得されたリソース使用状況とモニタリング情報とに基づいて、プロセス毎の順位付け、またはグループ分けを行う順位・グループ情報生成手段と、前記順位・グループ情報生成手段により得られた、プロセス毎の順位付けを示す順位情報と、グループ分けを示すグループ情報と、前記モニタリング情報とに基づいて、プロセス毎の特徴量抽出に用いる単語を生成する単語生成手段とを具備することを特徴とする単語集作成装置である。 In order to solve the above-described problems, the present invention monitors a process executed by a computer, and obtains a resource usage status for each process and process monitoring information, and the process monitoring unit. Ranking / group information generating means for ranking or grouping for each process based on the resource usage and monitoring information obtained by the above, and ranking for each process obtained by the ranking / group information generating means A word collection creation comprising: word generation means for generating words used for feature quantity extraction for each process based on rank information indicating attachment, group information indicating grouping, and the monitoring information Device.

また、上述した課題を解決するために、本発明は、計算機で実行されるプロセス毎のリソース使用状況とプロセスのモニタリング情報とに基づいてプロセス毎の順位付け、またはグループ分けが行われることにより得られた、プロセス毎の順位付けを示す順位情報と、グループ分けを示すグループ情報とを、前記モニタリング情報とに基づいて生成されたプロセス毎の特徴量抽出に用いる単語を、複数の計算機毎に収集する収集手段と、前記収集手段により収集された、前記複数の計算機のプロセス毎の単語に基づいて、障害が発生した計算機と障害発生時間帯とを推定する障害時間帯推定手段とを備えることを特徴とする障害発生分析装置である。 Further, in order to solve the above-described problems, the present invention is obtained by ranking or grouping for each process based on the resource usage status for each process executed by the computer and the process monitoring information. The word used for the feature quantity extraction for each process generated based on the ranking information indicating the ranking for each process and the group information indicating the grouping is collected for each of a plurality of computers. And a failure time zone estimation means for estimating a failure occurrence computer and a failure occurrence time zone based on words for each process of the plurality of computers collected by the collection device. This is a characteristic failure analysis apparatus.

この発明によれば、プロセスに関する特別な知識を不要とし、またログを用いなくとも、計算機の障害を検知することができるという利点が得られる。 According to the present invention, there is an advantage that it is possible to detect a computer failure without using special knowledge about the process and without using a log.

以下、本発明の一実施形態を、図面を参照して説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

Ａ．第１実施形態
まず、本発明の第１実施形態について説明する。
図１は、本発明の第１実施形態による障害検知装置の構成を示すブロック図である。図において、障害検知装置は、単語集作成部１と障害発生分析部１０とから構成される。単語集作成部１は、プロセス監視部２、順位・グループ情報生成部３、及び単語生成部４を備える。障害発生分析部１０は、収集部１１、障害時間帯推定部１２、及び表示部１３を備える。 A. First Embodiment First, a first embodiment of the present invention will be described.
FIG. 1 is a block diagram showing the configuration of the failure detection apparatus according to the first embodiment of the present invention. In the figure, the failure detection apparatus includes a word collection creation unit 1 and a failure occurrence analysis unit 10. The word collection creation unit 1 includes a process monitoring unit 2, a rank / group information generation unit 3, and a word generation unit 4. The failure occurrence analysis unit 10 includes a collection unit 11, a failure time zone estimation unit 12, and a display unit 13.

単語集作成部１は、障害検知の対象となる全ての計算機２０−１、２０−２、２０−３上で動作し、各計算機２０−１〜２０−３上で動作するプロセスの動作状況を表す単語集を生成する。障害発生分析部１０は、全ての計算機２０−１〜２０−３から単語集を定期的に収集し、障害が発生した計算機とその時間帯を分析する。 The word collection creation unit 1 operates on all the computers 20-1, 20-2, and 20-3 that are subject to failure detection, and displays the operation status of the processes that operate on the computers 20-1 to 20-3. Generate a collection of words to represent. The failure occurrence analysis unit 10 periodically collects word collections from all the computers 20-1 to 20-3, and analyzes the computer in which the failure has occurred and its time zone.

プロセス監視部２は、ＯＳ２１上で動作するプロセスの情報を監視し、ＣＰＵや、メモリなどの各種リソースの使用状況、及びプロセスの実行ユーザＩＤ、実行グループＩＤといったモニタリング情報を取得する。順位・グループ情報生成部３は、ＯＳ２１上で動作する全てのプロセスについて、各種リソースの使用状況またはモニタリング情報に従って、プロセスの順位付けや、グループ分けを行う。単語生成部４は、得られた順位・グループ情報と実行ユーザＩＤ、実行グループＩＤ、及びプロセス名とから特徴量抽出に用いる単語を生成する。 The process monitoring unit 2 monitors information on processes operating on the OS 21 and acquires monitoring information such as usage statuses of various resources such as a CPU and a memory, and an execution user ID and an execution group ID of the process. The rank / group information generation unit 3 ranks and groups processes for all processes operating on the OS 21 according to the usage status of various resources or monitoring information. The word generation unit 4 generates a word used for feature amount extraction from the obtained rank / group information, execution user ID, execution group ID, and process name.

収集部１１は、単語生成部４において生成された単語を全ての計算機２０−１〜２０−３から収集する。障害時間帯推定部１２は、収集された単語から特徴的なキーワードを抽出し障害の発生した計算機と時間帯を推定する。表示部１３は、推定された時間帯をユーザに対して表示する。 The collection unit 11 collects the words generated by the word generation unit 4 from all the computers 20-1 to 20-3. The failure time zone estimation unit 12 extracts characteristic keywords from the collected words and estimates the computer and time zone in which the failure occurred. The display unit 13 displays the estimated time zone to the user.

図２は、プロセス監視部２によって監視されるプロセス情報の一例を示す図である。上述したプロセス監視部２は、ＯＳ２１上で動作するプロセスの各種リソース使用状況や、プロセスの実行ユーザＩＤ、実行グループＩＤを、ＯＳ２１の提供するインタフェースを通じて監視する。監視対象のリソースとしては、ＣＰＵや、メモリの使用率などがある。ＯＳ２１の提供するインタフェースとしては、／ｐｒｏｃファイルシステムや、システムコールなどがある。 FIG. 2 is a diagram illustrating an example of process information monitored by the process monitoring unit 2. The process monitoring unit 2 described above monitors various resource usage statuses of processes operating on the OS 21, process execution user IDs, and execution group IDs through an interface provided by the OS 21. Examples of resources to be monitored include a CPU and a memory usage rate. Examples of interfaces provided by the OS 21 include a / proc file system and a system call.

図３、図４は、順位・グループ情報生成部３による順位付け及びグループ分けの一例を示す図である。順位・グループ情報生成部３は、プロセス監視部２を通じて、ＯＳ２１上で動作する全てのプロセスの各種リソース使用状況を取得し、リソースそれぞれについて順位付け、もしくはグループ分けをする。プロセスのリソース使用状況が、図２に示す通りであった場合、ＣＰＵリソース使用状況の順位付けは、図３に示す通りになる。また、ＣＰＵリソース使用率の度合いによって、例えば、「ｈｉｇｈ（３０％以上）」、「ｍｉｄ（１０％以上３０％未満」、「ｌｏｗ（１０％未満）」などの閾値を適切に決めることで、図４に示すようにグループ分けすることができる。 FIGS. 3 and 4 are diagrams showing an example of ranking and grouping by the rank / group information generation unit 3. The rank / group information generation unit 3 acquires various resource usage statuses of all processes operating on the OS 21 through the process monitoring unit 2, and ranks or groups each resource. When the resource usage status of the process is as shown in FIG. 2, the ranking of the CPU resource usage status is as shown in FIG. In addition, by appropriately determining thresholds such as “high (30% or more)”, “mid (10% or more and less than 30%)”, “low (less than 10%)” depending on the degree of CPU resource usage rate, They can be grouped as shown in FIG.

単語生成部４は、プロセス名と、順位・グループ情報生成部３で得られた順位や、グループ情報、プロセスの実行ユーザＩＤ、実行グループＩＤなどとを適切な区切り文字で接続することによって、計算機２０−１〜２０−３で動作するプロセスの情報を表す単語集を生成する。例えば、図３のＣＰＵ使用率から単語集を作成した場合、プロセス「ｆｉｒｅｆｏｘ」は、ＣＰＵ使用率が１位であるため、区切り文字を「：」とすると、生成される単語は、ｆｉｒｅｆｏｘ：ｃｐｕ１である。同様に、ｅｍａｃｓ：ｃｐｕ２、ｘｔｅｒｍ：ｃｐｕ３という単語を生成し、最終的には｛ｆｉｒｅｆｏｘ：ｃｐｕ１，ｅｍａｃｓ：ｃｐｕ２，ｘｔｅｒｍ：ｃｐｕ３｝という３つの単語から成る単語集を生成する。 The word generation unit 4 connects the process name to the rank obtained by the rank / group information generation unit 3, the group information, the execution user ID of the process, the execution group ID, and the like by using appropriate delimiters. A word collection representing information of processes operating in 20-1 to 20-3 is generated. For example, when the word collection is created from the CPU usage rate of FIG. 3, since the process “firefox” has the first CPU usage rate, if the delimiter is “:”, the generated word is firefox: cpu1. It is. Similarly, the words emacs: cpu2, xterm: cpu3 are generated, and finally a word collection consisting of three words {firefox: cpu1, emacs: cpu2, xterm: cpu3} is generated.

また、図４のグループ情報を用いる場合、プロセス「ｆｉｒｅｆｏｘ」は、ＣＰＵグループが「ｈｉｇｈ」であるため、生成される単語は、ｆｉｒｅｆｏｘ：ｃｐｕｈｉｇｈである。同様に、ｅｍａｃｓ：ｃｐｕｈｉｇｈ、ｘｔｅｒｍ：ｃｐｕｌｏｗという単語を生成し、最終的には｛ｆｉｒｅｆｏｘ：ｃｐｕｈｉｇｈ，ｅｍａｃｓ：ｃｐｕｈｉｇｈ，ｘｔｅｒｍ：ｃｐｕｌｏｗ｝という３つの単語から成る単語集を生成する。 Further, when the group information of FIG. 4 is used, since the process “firefox” has the CPU group “high”, the generated word is “firefox: cpuhigh”. Similarly, the words emacs: cpuhigh and xterm: cpulow are generated, and finally a word collection consisting of three words {firefox: cpuhigh, emacs: cpuhigh, xterm: cpulow} is generated.

また、図示しなかったが、実行ユーザＩＤから単語集を作成した場合、プロセス「ｆｉｒｅｆｏｘ」の実行ユーザＩＤは、ｙａｓｕｈｉｔｏであるため、生成される単語は、ｆｉｒｅｆｏｘ：ｙａｓｕｈｉｔｏである。同様に、ｅｍａｃｓ：ｙａｓｕｈｉｔｏ、ｘｔｅｒｍ：ｙｕｔａｒｏという単語を生成し、最終的には｛ｆｉｒｅｆｏｘ：ｙａｓｕｈｉｔｏ，ｅｍａｃｓ：ｙａｓｕｈｉｔｏ，ｘｔｅｒｍ：ｙｕｔａｒｏ｝という３つの単語から成る単語集を生成する。 Although not shown in the figure, when a word collection is created from the execution user ID, the execution user ID of the process “firefox” is yashitho, and thus the generated word is firefox: yasuito. Similarly, the words emacs: yasuhito and xterm: yutaro are generated, and finally a word collection consisting of three words {firefox: yasuhito, emacs: yasuhito, xterm: yutaro} is generated.

収集部１１は、全ての計算機２０−１〜２０−３から単語集を収集する。また、単語集毎に単語集を生成した計算機のＩＤ（ホスト名など）、及び収集した時間帯情報を付加する。時間帯の幅は、例えば、１時間や、１５分など任意に設定して良い。 The collection unit 11 collects a word collection from all the computers 20-1 to 20-3. In addition, the ID (host name or the like) of the computer that generated the word collection and the collected time zone information are added for each word collection. The width of the time zone may be arbitrarily set such as 1 hour or 15 minutes.

障害時間帯推定部１２は、収集部１１にて収集された単語集について、既存のＩＤＦ法、もしくはｌｏｇ．ｅｎｔｒｏｐｙ法を用いてキーワード抽出を行う。キーワード抽出では、収集された単語集に表れる単語ｉについて、この単語ｉがどの程度特徴的かという指標ｇ（ｉ）を求める。次に、求められた指標ｇ（ｉ）を使って障害検知の対象となる期間をある一定の時間帯毎に重み付けし、時間帯毎の特徴値を得る。特徴値が大きい時間帯は、他の時間帯に比較して異常であり、障害が発生したものと推測する。 The failure time zone estimation unit 12 uses the existing IDF method or the log. Keyword extraction is performed using the entropy method. In the keyword extraction, for a word i appearing in the collected word collection, an index g (i) indicating how much the word i is characteristic is obtained. Next, using the obtained index g (i), the period for which the failure is to be detected is weighted for each certain time period to obtain a characteristic value for each time period. A time zone with a large feature value is abnormal as compared to other time zones, and it is assumed that a failure has occurred.

一般的に、特徴値を大きくする傾向のある単語の特徴として、全体のうちごく少数の計算機のみで表れるということが挙げられる。例えば、特定の計算機のみで実行されるプロセスや、異常なＣＰＵ使用率を示すプロセス、実行ユーザＩＤが他の計算機と異なるプロセスなどからは、特徴的な単語が生成されるため、特徴値が異常値を示しやすい。 In general, as a feature of a word that tends to increase the feature value, it can be expressed by only a few computers in the whole. For example, a characteristic word is generated from a process that is executed only on a specific computer, a process that shows an abnormal CPU usage rate, or a process that has a different execution user ID from other computers. Easy to show value.

障害時間帯推定部１２は、前処理として、次のような疎行列であるＭ×Ｎ行列を作成する。Ｍ×Ｎ行列の要素をｘ（ｉ，ｊ）とすると、ｘ（ｉ，ｊ）を「単語ｉがノード時間帯ｊに出現した回数」と定義する。ここで、ノード時間帯とは、計算機毎の時間帯を一意に定めるＩＤである。つまり、行列の列数Ｎ＝（ノード数）×（時間帯の数）となる。Ｍ×Ｎ行列を生成した後、ＩＤＦ法、もしくはｌｏｇ．ｅｎｔｒｏｐｙ法を用いて、指標ｇ（ｉ）を次のように算出する。 The failure time zone estimation unit 12 creates the following sparse matrix M × N matrix as preprocessing. If the element of the M × N matrix is x (i, j), x (i, j) is defined as “the number of times word i appears in node time zone j”. Here, the node time zone is an ID that uniquely defines a time zone for each computer. That is, the number of columns N = (number of nodes) × (number of time zones). After generating the M × N matrix, the IDF method, or log. The index g (i) is calculated as follows using the entropy method.

以下、ＩＤＦ法を用いた場合のキーワード抽出方法について説明する。ＩＤＦ法では、指標ｇ（ｉ）は、次式（１）で求められる。 Hereinafter, a keyword extraction method when the IDF method is used will be described. In the IDF method, the index g (i) is obtained by the following equation (1).

ここで、ｎは全ノード時間帯の数、ｄｆ_ｉは単語ｉが出現するノード時間帯の数である。求められたｇ（ｉ）を重み付けに用いると、ノード時間帯ｊの特徴値｜ｘ_ｊ｜は、次式（２）で求められる。 Here, n is the number of all node time zones, and df _i is the number of node time zones in which the word i appears. When the obtained g (i) is used for weighting, the characteristic value | x _j | of the node time zone j is obtained by the following equation (2).

ＩＤＦ法の特徴として、多くのノード時間帯に出現する単語の重要度が下がり、特定のノード時間帯にしか出現しない単語の重要度が上がる。 As a feature of the IDF method, the importance of words appearing in many node time zones decreases, and the importance of words that appear only in specific node time zones increases.

次に、ｌｏｇ．ｅｎｔｒｏｐｙ法を用いた場合のキーワード抽出方法について説明する。ｌｏｇ．ｅｎｔｒｏｐｙ法では、ｇ（ｉ）は、次式（３）で求められる。 Next, log. A keyword extraction method when the entropy method is used will be described. log. In the entropy method, g (i) is obtained by the following equation (3).

ここで、ｐ_ｉｊは、単語ｉが出現した全回数のうち、ノード時間帯ｊに出現した回数の割合であり、次式（４）で表わされる。 Here, p _ij is the ratio of the number of times the word i has appeared in the node time zone j out of the total number of times the word i has appeared, and is expressed by the following equation (4).

求められたｇ（ｉ）を重み付けに用いると、ノード時間帯ｊの特徴値｜ｘ_ｊ｜は、次式（５）で求められる。 When the obtained g (i) is used for weighting, the characteristic value | x _j | of the node time zone j is obtained by the following equation (5).

ｌｏｇ．ｅｎｔｒｏｐｙ法の特徴として、全てのノード時間帯に同じ回数ずつ出現する単語の重要度は、「０」であり、１つのノード時間帯にのみ出現する単語の重要度は、「１」である。つまり、特定のノード時間帯に偏って出現する単語ほど、その重みは、「１」に近付く。 log. As a feature of the entropy method, the importance of a word that appears the same number of times in all node time zones is “0”, and the importance of a word that appears only in one node time zone is “1”. That is, the weight of a word that appears biased in a specific node time zone approaches “1”.

表示部１３は、障害時間帯推定部１２において、ある閾値以上の特徴値が検出された場合、その計算機と時間帯とをユーザに対して表示する。表示方法としては、ディスプレイに表示する他、ファイルに出力する、電子メールを送る、他のプログラムに内容を送信するなど、結果をユーザが認知できる任意の方法で良い。 When the failure time zone estimation unit 12 detects a feature value equal to or greater than a certain threshold, the display unit 13 displays the computer and the time zone to the user. The display method may be any method that allows the user to recognize the result, such as displaying on a display, outputting to a file, sending an e-mail, or sending the contents to another program.

次に、本第１実施形態の動作を説明する。
図５は、本第１実施形態による単語集作成部１の動作を説明するためのフローチャートである。まず、ステップＳａ１において、プロセス監視部２は、ＯＳ２１上で動作するプロセスの各種リソース使用状況を、ＯＳ２１が提供するインタフェースを通じて監視し、取得する（ステップＳａ１）。 Next, the operation of the first embodiment will be described.
FIG. 5 is a flowchart for explaining the operation of the word collection creation unit 1 according to the first embodiment. First, in step Sa1, the process monitoring unit 2 monitors and acquires various resource usage statuses of processes operating on the OS 21 through an interface provided by the OS 21 (step Sa1).

次に、順位・グループ情報生成部３は、プロセス監視部２を通じて、ＯＳ２１上で動作する全てのプロセスの各種リソース使用状況を取得し、リソースそれぞれについて順位付け、もしくはグループ分けをする（ステップＳａ２）。次に、単語生成部４は、プロセス名と順位・グループ情報生成部３で得られた順位、もしくはグループ情報とを適切な区切り文字で接続することによって、計算機で動作するプロセスの情報を表す単語集を生成する（ステップＳａ３）。 Next, the rank / group information generation unit 3 acquires various resource usage statuses of all processes operating on the OS 21 through the process monitoring unit 2, and ranks or groups each resource (step Sa2). . Next, the word generation unit 4 connects the process name and the rank obtained by the rank / group information generation unit 3 or group information with an appropriate delimiter to represent a word representing the information of the process operating on the computer A collection is generated (step Sa3).

次に、図６は、本第１実施形態による障害発生分析部１０の動作を説明するためのフローチャートである。収集部１１は、一定時間毎に起動され、障害検知対象となる全ての計算機の単語集作成部１で生成された単語集を収集する（ステップＳｂ１）。次に、障害時間推定部１２は、収集部１１にて収集された単語集について、ＩＤＦ法、もしくはｌｏｇ．ｅｎｔｒｏｐｙ法を用いて、障害を起こした計算機と時間帯との推定を行う（ステップＳｂ２）。 Next, FIG. 6 is a flowchart for explaining the operation of the failure occurrence analysis unit 10 according to the first embodiment. The collection unit 11 is activated at regular intervals and collects the word collections generated by the word collection creation unit 1 of all computers that are fault detection targets (step Sb1). Next, the failure time estimation unit 12 applies the IDF method or the log. The entropy method is used to estimate the failed computer and the time zone (step Sb2).

次に、障害時間推定部１２は、障害が発見されたかを判定し（ステップＳｂ３）、障害が検知された場合には、表示部１３は、障害が発生した計算機と時間帯とをユーザに対して表示する（ステップＳｂ４）。その後、ステップＳｂ１へ戻り、上述した処理を繰り返す。一方、ステップＳｂ３で、障害が検知されなかった場合には、そのままステップＳｂ１へ戻り、上述した処理を繰り返す。 Next, the failure time estimation unit 12 determines whether or not a failure has been found (step Sb3). If a failure is detected, the display unit 13 displays the computer and the time zone in which the failure has occurred to the user. Are displayed (step Sb4). Then, it returns to step Sb1 and repeats the process mentioned above. On the other hand, if no failure is detected in step Sb3, the process returns to step Sb1 and the above-described processing is repeated.

上述した第１実施形態によれば、プロセスが出力するログを用いることなく、ＯＳから得られるプロセスの情報のみを用いて障害検知することができる。このため、大部分のユーザプロセスなどログを出力しない種類のプロセスについて障害を検知することが可能となる。また、障害検知のための知識データベースを必要としないため、あらかじめ知識データベースを作成する必要がなく、未知の障害にも対応可能である。 According to the first embodiment described above, it is possible to detect a failure using only process information obtained from the OS without using a log output by the process. For this reason, it becomes possible to detect a failure for a type of process that does not output a log, such as most user processes. In addition, since a knowledge database for fault detection is not required, it is not necessary to create a knowledge database in advance, and it is possible to deal with unknown faults.

Ｂ．第２実施形態
次に、本発明の第２実施形態について説明する。
図７は、本発明の第２実施形態による障害検知装置の構成を示すブロック図である。なお、図において、図１と同一部分には同一符号を付し、その詳細な説明を省略する。本第２実施形態では、第１実施形態の構成に加えて、プログラムパス取得部５を備えている。また、第１実施形態における単語生成部４に変えて、単語生成部６を備えている。 B. Second Embodiment Next, a second embodiment of the present invention will be described.
FIG. 7 is a block diagram showing the configuration of the failure detection apparatus according to the second embodiment of the present invention. In the figure, the same parts as those in FIG. 1 are denoted by the same reference numerals, and detailed description thereof is omitted. In the second embodiment, a program path acquisition unit 5 is provided in addition to the configuration of the first embodiment. Further, a word generation unit 6 is provided instead of the word generation unit 4 in the first embodiment.

プログラムパス取得部５は、ＯＳ２１上で実行されているプロセスそれぞれについて、ＯＳ２１のインタフェースを通じてプログラムパスを取得する。例えば、プロセス「ｆｉｒｅｆｏｘ」のプログラムパスは、／ｕｓｒ／ｂｉｎ／ｆｉｒｅｆｏｘ、プロセス「ｘｔｅｒｍ」のプログラムパスは、／ｕｓｒ／ｂｉｎ／ｘｔｅｒｍなどという情報を取得する。 The program path acquisition unit 5 acquires a program path for each process executed on the OS 21 through the interface of the OS 21. For example, the program path of the process “firefox” obtains information such as / usr / bin / firefox, the program path of the process “xterm” obtains information such as / usr / bin / xterm, and the like.

単語生成部６は、プログラムパス取得部５で得られたプログラムパスと順位・グループ情報生成部３で得られた順位や、グループ情報、プロセスの実行ユーザＩＤ、実行グループＩＤなどとを適切な区切り文字で接続することによって、計算機で動作するプロセスの情報を表す単語集を生成する。 The word generation unit 6 appropriately separates the program path obtained by the program path acquisition unit 5 from the rank obtained by the rank / group information generation unit 3, group information, process execution user ID, execution group ID, and the like. By connecting with letters, a word collection representing information of processes operating on a computer is generated.

例えば、図３のＣＰＵ使用率から単語集を作成した場合、プロセス「ｆｉｒｅｆｏｘ」は、プログラムパス取得部５により、プログラム／ｕｓｒ／ｂｉｎ／ｆｉｒｅｆｏｘを実行したものであることが分かり、また、ＣＰＵ仕用率が１位であるため、区切り文字を「：」とすると、生成される単語は、／ｕｓｒ／ｂｉｎ／ｆｉｒｅｆｏｘ：ｃｐｕ１となる。同様に、／ｕｓｒ／ｂｉｎ／ｅｍａｃｓ：ｃｐｕ２、／ｕｓｒ／ｂｉｎ／ｘｔｅｒｍ：ｃｐｕ３という単語を生成し、最終的には、｛／ｕｓｒ／ｂｉｎ／ｆｉｒｅｆｏｘ：ｃｐｕ１，／ｕｓｒ／ｂｉｎ／ｅｍａｃｓ：ｃｐｕ２，／ｕｓｒ／ｂｉｎ／ｘｔｅｒｍ：ｃｐｕ３｝という３つの単語から成る単語集を生成する。 For example, when the word collection is created from the CPU usage rate of FIG. 3, it is understood that the process “firefox” is the program / usr / bin / firefox executed by the program path acquisition unit 5, and the CPU Since the usage rate is first, when the delimiter is “:”, the generated word is / usr / bin / firefox: cpu1. Similarly, the words / usr / bin / emacs: cpu2, / usr / bin / xterm: cpu3 are generated, and finally, {/ usr / bin / firefox: cpu1, / usr / bin / emacs: cpu2, A word collection consisting of three words / usr / bin / xterm: cpu3} is generated.

次に、本第２実施形態の動作について説明する。
図８は、本第２実施形態による単語集作成部１の動作を説明するためのフローチャートである。プロセス監視部２は、ＯＳ２１上で動作するプロセスの各種リソース使用状況を、ＯＳ２１の提供するインタフェースを通じて監視し、取得する（ステップＳｃ１）。次に、順位・グループ情報生成部３は、プロセス監視部２を通じて、ＯＳ２１上で動作する全てのプロセスの各種リソース使用状況を取得し、リソースそれぞれについて順位付け、もしくはグループ分けをする（ステップＳｃ２）。 Next, the operation of the second embodiment will be described.
FIG. 8 is a flowchart for explaining the operation of the word collection creation unit 1 according to the second embodiment. The process monitoring unit 2 monitors and acquires various resource usage statuses of processes operating on the OS 21 through an interface provided by the OS 21 (step Sc1). Next, the rank / group information generation unit 3 acquires various resource usage statuses of all processes operating on the OS 21 through the process monitoring unit 2, and ranks or groups each resource (step Sc2). .

次に、プログラムパス取得部５は、ＯＳ２１上で実行されているプロセスそれぞれについて、ＯＳ２１のインタフェースを通じてプログラムパスを取得する（ステップＳｃ３）。単語生成部６は、プログラムパス取得部５で得られたプログラムパス名と、順位・グループ情報生成部３で得られた順位、もしくはグループ情報とを適切な区切り文字で接続することによって、計算機で動作するプロセスの情報を表す単語集を生成する（ステップＳｃ４）。生成された単語集は、収集部１１への入力として渡され、以下、第１実施形態と同様の手順で障害が推定される。 Next, the program path acquisition unit 5 acquires a program path for each process executed on the OS 21 through the interface of the OS 21 (step Sc3). The word generation unit 6 connects the program path name obtained by the program path acquisition unit 5 and the rank obtained by the rank / group information generation unit 3 or group information by an appropriate delimiter. A word collection representing the information of the operating process is generated (step Sc4). The generated word collection is passed as an input to the collection unit 11, and the failure is estimated in the same procedure as in the first embodiment.

上述した第２実施形態によれば、単語の生成が、プロセス名ではなく、プロセスを生成したプログラムパスを用いて行われることによって、例えば、同じプロセス名でもプログラムパスが異なる場合を区別することができる。このため、悪意のあるユーザが偽装したプログラムが実行された場合を障害検出結果に反映することができる。このため、第１実施形態で検知できる障害の精度をさらに向上させることができる。 According to the second embodiment described above, the generation of words is performed using the program path that generated the process, not the process name, so that, for example, the case where the program path is different even with the same process name can be distinguished. it can. For this reason, the case where the program camouflaged by the malicious user is executed can be reflected in the failure detection result. For this reason, it is possible to further improve the accuracy of faults that can be detected in the first embodiment.

本発明の障害検知装置は、ソフトウェア及びハードウェア構成がほとんど同様なパーソナルコンピュータや、ワークステーション、サーバなどの計算機群に対して適用することができる。 The failure detection apparatus of the present invention can be applied to a computer group such as a personal computer, a workstation, and a server having almost the same software and hardware configuration.

本発明の第１実施形態による障害検知装置の構成を示すブロック図である。It is a block diagram which shows the structure of the failure detection apparatus by 1st Embodiment of this invention. プロセス監視部２によって監視されるプロセス情報の一例を示す図である。It is a figure which shows an example of the process information monitored by the process monitoring part. 順位・グループ情報生成部３による順位付け及びグループ分けの一例を示す図である。It is a figure which shows an example of the ranking by the rank and group information generation part 3, and grouping. 順位・グループ情報生成部３による順位付け及びグループ分けの一例を示す図である。It is a figure which shows an example of the ranking by the rank and group information generation part 3, and grouping. 本第１実施形態による単語集作成部１の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the word collection preparation part 1 by this 1st Embodiment. 本第１実施形態による障害発生分析部１０の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the failure generation analysis part 10 by this 1st Embodiment. 本発明の第２実施形態による障害検知装置の構成を示すブロック図である。It is a block diagram which shows the structure of the failure detection apparatus by 2nd Embodiment of this invention. 本第２実施形態による単語集作成部１の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the word collection preparation part 1 by this 2nd Embodiment.

Explanation of symbols

１単語集作成部
２プロセス監視部
３順位・グループ情報生成部
４単語生成部
５プログラムパス取得部
６単語生成部
１０障害発生分析部
１１収集部
１２障害時間推定部
１３表示部
２０−１〜２０−３計算機
２１ＯＳ DESCRIPTION OF SYMBOLS 1 Word collection preparation part 2 Process monitoring part 3 Order | ranking / group information generation part 4 Word generation part 5 Program path acquisition part 6 Word generation part 10 Failure occurrence analysis part 11 Collection part 12 Failure time estimation part 13 Display part 20-1-20 -3 Computer 21 OS

Claims

A process monitoring means for monitoring a process executed by each of a plurality of computers, and obtaining a resource usage status for each process and process monitoring information;
Rank / group information generating means for ranking or grouping for each process based on the resource usage and monitoring information acquired by the process monitoring means;
Generates words to be used for feature value extraction for each process based on the rank information indicating the ranking for each process, the group information indicating the grouping, and the monitoring information obtained by the rank / group information generating means. Word generating means for
Collecting means for collecting the words generated by the word generating means for each of the plurality of computers;
Failure detection, comprising: a failure time zone estimation means for estimating a failure occurrence time zone and a failure occurrence time zone based on the words for each process of the plurality of computers collected by the collection means apparatus.

The word generation unit generates a word for each process by connecting a process name and rank information, or a process name and group information, or a process name and monitoring information with a predetermined delimiter. Item 5. The fault detection apparatus according to Item 1.

A program path acquisition means for acquiring a program path for each process executed by each of a plurality of computers,
The word generating means
The failure according to claim 1, wherein a word used for feature quantity extraction for each process is generated based on the program path acquired by the program path acquisition unit, rank information, group information, and monitoring information. Detection device.

The word generation unit generates a word for each process by connecting a program path and rank information, or a program path and group information, or a program path and monitoring information with a predetermined delimiter. Item 5. The fault detection apparatus according to Item 3.

The failure time zone estimation means obtains an index indicating how characteristic the words for each process of the plurality of computers collected by the collection means, and for each predetermined time zone based on the index And calculating a failure occurrence computer, a failure occurrence computer, and a failure occurrence time zone based on the feature value in each time zone. The fault detection apparatus in any one.

The failure time zone estimation means obtains a feature value for each predetermined time zone by weighting a period for failure detection for each predetermined time zone using the index. The failure detection apparatus according to claim 5.

Monitoring a process executed on each of a plurality of computers, obtaining resource usage for each process, and process monitoring information;
Based on the obtained resource usage status and monitoring information, ranking or grouping for each process;
Generating a word to be used for feature value extraction for each process based on rank information indicating ranking for each process, group information indicating grouping, and the monitoring information;
Collecting the generated words for each of the plurality of computers;
A failure detection method comprising: estimating a computer in which a failure has occurred and a failure occurrence time zone based on the collected words for each process of the plurality of computers.

In the computer of the failure detection device that detects the occurrence of a failure by monitoring the processes executed on each of the plurality of computers,
Monitoring a process executed on each of a plurality of computers, obtaining resource usage for each process, and process monitoring information;
Based on the obtained resource usage status and monitoring information, ranking or grouping for each process;
Generating a word to be used for feature value extraction for each process based on rank information indicating ranking for each process, group information indicating grouping, and the monitoring information;
Collecting the generated words for each of the plurality of computers;
A failure detection program that executes a step of estimating a computer in which a failure has occurred and a failure occurrence time zone based on the collected words for each process of the plurality of computers.

A process monitoring means for monitoring a process executed by a computer and obtaining resource use status for each process and process monitoring information;
Rank / group information generating means for ranking or grouping for each process based on the resource usage and monitoring information acquired by the process monitoring means;
Generates words to be used for feature value extraction for each process based on the rank information indicating the ranking for each process, the group information indicating the grouping, and the monitoring information obtained by the rank / group information generating means. And a word generation means.

A program path acquisition means for acquiring a program path for each process executed by each of a plurality of computers,
The word generating means
10. The word according to claim 9, wherein a word used for feature amount extraction for each process is generated based on the program path acquired by the program path acquisition means, rank information, group information, and monitoring information. Collection device.

Rank information indicating the ranking for each process and group obtained by ranking or grouping for each process based on the resource usage status for each process executed by the computer and the process monitoring information, and the group A collection means for collecting, for each of a plurality of computers, a word used for feature amount extraction for each process generated based on the monitoring information and group information indicating division;
A failure occurrence characterized by comprising failure time zone estimation means for estimating a failure occurrence time zone and a failure occurrence time zone based on the words for each process of the plurality of computers collected by the collection means Analysis equipment.

The failure time zone estimation means obtains an index indicating how characteristic the words for each process of the plurality of computers collected by the collection means, and for each predetermined time zone based on the index 12. The failure according to claim 11, wherein a failure occurrence, a computer in which a failure has occurred, and a failure occurrence time zone are estimated based on the feature value of each time zone. Development analysis device.

The failure time zone estimation means obtains a feature value for each predetermined time zone by weighting a period for failure detection for each predetermined time zone using the index. The failure occurrence analysis apparatus according to claim 12.