JP5379719B2

JP5379719B2 - Computer, computer failure detection method, and program

Info

Publication number: JP5379719B2
Application number: JP2010040591A
Authority: JP
Inventors: 秀一洪; 和政戸部; 英敦三浦; 光宏谷野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-02-25
Filing date: 2010-02-25
Publication date: 2013-12-25
Anticipated expiration: 2030-02-25
Also published as: JP2011175570A

Description

本発明は、コンピュータ、コンピュータの障害検知方法、及びプログラムに係わり、特にコンピュータに生じた障害を、異常終了等に至る前に迅速に検知することを可能とするコンピュータ、コンピュータの障害検知方法、及びプログラムに関する。 The present invention relates to a computer, a computer fault detection method, and a program, and in particular, a computer capable of quickly detecting a fault that has occurred in a computer before reaching abnormal termination, etc., and a computer fault detection method, and Regarding the program.

高可用性システム（High Availability System、以下「ＨＡシステム」という。）と呼ばれるコンピュータは、障害が発生した場合でもシステムを停止させず継続して稼働させることができるように設計されたシステム、あるいは、障害が発生して停止を余儀なくされた場合でも、復旧させるまでの時間が可及的に短縮されるように設計されたシステムである。このため、ＨＡシステムでは、例えば機能ブロックの主要な部分を冗長化することにより、障害発生時には障害発生部位を切り離し、正常な部位を利用して稼働継続することにより稼働率を高めている。このような障害発生時対応を実現するためには、システム内の障害発生部位を迅速かつ適切に検知することができるように構成することが重要である。 A computer called a High Availability System (hereinafter referred to as “HA System”) is a system designed so that it can be operated continuously without stopping even if a failure occurs, or a failure. This system is designed so that the time until recovery is shortened as much as possible even if it is forced to stop. For this reason, in the HA system, for example, by making the main part of the functional block redundant, the failure occurrence site is separated when a failure occurs, and the operation rate is increased by continuing operation using a normal site. In order to realize such a failure response, it is important to configure so that the failure occurrence site in the system can be detected quickly and appropriately.

コンピュータの障害を検知するために、ハードウェアでは個々の要素での障害検知機能、冗長化が発展してきている。一方、ソフトウェアについても同様に、個々のプログラムでの障害検知機能と、ＨＡクラスタソフトウェアによる冗長化などが適用されるようになっている。 In order to detect computer failures, hardware has developed failure detection functions and redundancy in individual elements. On the other hand, the failure detection function in each program and redundancy by HA cluster software are applied to the software as well.

特許文献１には、複数のプログラムがイベントにより連携している場合の動作遅延を検出することを目的として、監視対象とするイベントを指定し、監視対象として指定されているイベント通知に対する待機、通知の操作を記録し、待機しているスレッドの実行再開を記録し、この記録を検査してイベント通知されたにも関わらず走行を再開しないスレッドが存在することを検知する構成が提案されている。 In Patent Document 1, for the purpose of detecting an operation delay when a plurality of programs are linked by an event, an event to be monitored is specified, and standby and notification for the event notification specified as the monitoring target are specified. A configuration has been proposed in which the operation is recorded, the execution restart of the waiting thread is recorded, and this record is inspected to detect that there is a thread that does not resume running despite being notified of the event. .

特開２００７−７２９５８号公報JP 2007-72958 A

しかしながら、特許文献１では、単一のプログラムが実行されている場合に生じる障害に対処することができないと考えられる。
さらに、今日多数のソフトウェアを組み合わせて実現されるオープンシステムでは、多数の人員と組織が開発に携わっており、それらの間で設計の基準が異なることもあり、ソフトウェア設計上の統一性と設計レベルを均一に保つことが難しい場合もある。 However, in Patent Document 1, it is considered that a failure that occurs when a single program is executed cannot be dealt with.
Furthermore, in open systems realized by combining a large number of softwares today, a large number of people and organizations are involved in development, and design standards may differ between them. It may be difficult to maintain a uniform thickness.

ソフトウェアの障害は、異常終了・エラー出力・スローダウン／ハングアップの三種類に大別される。異常終了は、プログラムのバグなどに起因して、プログラムやアプリケーションの処理が通常の状態で終了しないことをいう。エラー出力は、プログラム実行中になんらかのエラーが検知されたことを、出力画面等を通じてユーザ等に出力することをいう。スローダウン／ハングアップは、なんらかの原因によって、コンピュータの処理速度が低下したり、処理が停止したりする現象であり、ユーザインタフェースでは、マウス等による操作入力に対する反応速度の低下、画面のフリーズ等の現象として把握されることがある。 Software faults can be broadly classified into three types: abnormal termination, error output, and slowdown / hangup. Abnormal termination means that processing of a program or application does not end in a normal state due to a bug in the program. Error output means that an error is detected during program execution and output to a user or the like through an output screen or the like. Slowdown / hangup is a phenomenon in which the processing speed of the computer decreases or the processing stops due to some cause. In the user interface, the response speed to the operation input by the mouse etc. decreases, the screen freezes, etc. It may be grasped as a phenomenon.

異常終了やエラー出力を伴う障害に関しては、ソフトウェアの設計レベルがたとえ均一でなかったとしても、開発者がそれらを引き起こす事象を異常と認識できている限りそれらに対する検知機能は個別に実装されるため、あまり問題にならない。 For faults with abnormal termination or error output, even if the software design level is not uniform, as long as the developer can recognize the events that cause them as abnormal, the detection function for them will be implemented separately. , Not much problem.

しかし、スローダウン／ハングアップについては、システム内の様々な部位で発生する可能性があるため、個々のハードウェア、ソフトウェアで対処しても、複数のシステムが協働する場合に、なお障害を引き起こす可能性を含んでいる。また、特にプロセッサの処理時間を不当に延長させるような障害に対しては検知しない設計をとる場合もある。そのため、このスローダウン／ハングアップ検知はまだ十分に行われているとは言えない状況である。したがって、スローダウン／ハングアップによりシステムのサービス提供が停止した場合でも、これを自動的に検知できず、システムの不稼働時間が発生することがある。 However, slowdown / hangup can occur in various parts of the system, so even if it is handled by individual hardware and software, it will still be a problem when multiple systems work together. Contains the potential to cause. In some cases, a design that does not detect a fault that unduly extends the processing time of the processor may be employed. Therefore, it cannot be said that the slowdown / hangup detection has been sufficiently performed. Accordingly, even when system service provision is stopped due to slowdown / hangup, this may not be automatically detected, and system downtime may occur.

本発明は上記の事情に鑑みてなされたものであり、その一つの目的は、コンピュータに生じた障害を、異常終了等に至る前に迅速に検知することを可能とするコンピュータ、コンピュータの障害検知方法、及びプログラムを提供することである。 The present invention has been made in view of the above circumstances, and one object thereof is a computer capable of quickly detecting a failure that has occurred in a computer before abnormal termination or the like, and computer failure detection. A method and program are provided.

上記の及び他の目的を達成するために、本発明の一態様は、プロセッサとメモリとを備え、前記プロセッサによって、前記メモリに記憶されている少なくとも一のソフトウェアプログラムを構成している複数のプロセスを処理して前記ソフトウェアプログラムを実行するコンピュータであって、各前記プロセスについて、当該プロセスが前記プロセッサによって処理開始されてから処理終了するまでにわたって、前記プロセッサが前記プロセスを処理している時間であるプロセッサ使用時間と、前記プロセッサが前記プロセスの処理を停止している時間であるプロセッサ不使用時間とを順次複数回計測して取得し、所定の統計処理に従って、各前記プロセッサ使用時間の統計的基準値であるプロセッサ使用時間基準値と、各前記プロセッサ不使用時間の統計的基準値であるプロセッサ不使用時間基準値とを、各前記プロセスについて算出して記憶するプロセッサ処理基準値取得部と、前記プロセッサにより前記ソフトウェアプログラムを構成する前記複数のプロセスのいずれかが処理されているときに、当該プロセスについて、前記プロセッサ使用時間及び前記プロセッサ不使用時間を計測して、逐次当該プロセスについて記憶されている前記プロセッサ使用時間基準値及び前記プロセッサ不使用時間基準値と比較し、当該比較結果が所定の判定基準を満たしていないと判定した場合に、当該プロセス処理中に障害が発生したと判定し、計測した前記プロセッサ不使用時間が所定の第一のプロセッサ不使用時間閾値より長く、かつその時の前記プロセッサ使用時間が所定のプロセッサ使用時間閾値より短いと判定した場合、当該プロセスにおいて遷移時間が前記所定のプロセッサ使用時間閾値より短くなる障害が発生したと判定する基準値比較処理部とを備えていることを特徴とするコンピュータである。 To achieve the above and other objects, one aspect of the present invention provides a plurality of processes comprising a processor and a memory, and constituting at least one software program stored in the memory by the processor. Is the time for which the processor is processing the process from the start of processing by the processor to the end of processing for each of the processes. The processor usage time and the processor non-use time, which is the time during which the processor has stopped processing, are obtained by sequentially measuring a plurality of times, and according to a predetermined statistical process, a statistical standard for each processor usage time Processor usage time reference value that is a value and each processor A processor nonuse time reference value is a statistical measure value of use time, and the processor processes the reference value acquisition unit that calculates and stores for each said process, any of the plurality of processes that make up the software program by the processor When the process is being processed, the processor use time and the processor non-use time are measured for the process, and the processor use time reference value and the processor non-use time reference value that are sequentially stored for the process If the comparison result is determined not to satisfy the predetermined criterion, it is determined that a failure has occurred during the process processing, and the measured processor non-use time is the predetermined first processor failure. It is longer than the usage time threshold, and the processor usage time at that time is a predetermined process. If it is determined shorter than the working time threshold, the computer, characterized in that a transition time and a said the determining reference value comparison processing unit prescribed shorter made fault by the processor using time threshold has occurred in the process is there.

上記の構成を有する本発明によれば、コンピュータに生じた障害を、異常終了等に至る前に迅速に検知することを可能とするコンピュータ、コンピュータの障害検知方法、及びプログラムが提供される。 According to the present invention having the above-described configuration, a computer, a computer failure detection method, and a program capable of quickly detecting a failure that has occurred in a computer before reaching abnormal termination or the like are provided.

図１は、本発明における障害検知方法を実現するための、ＣＰＵにおけるプロセス実行状態の捉え方を示す模式図である。FIG. 1 is a schematic diagram showing how to grasp a process execution state in a CPU for realizing a failure detection method according to the present invention. 図２は、通常とは異なる状態を認識した障害の概要を示す模式図である。FIG. 2 is a schematic diagram showing an outline of a fault that recognizes an unusual state. 図３は、状態の遷移に異常に時間がかかる障害の概要を示す模式図である。FIG. 3 is a schematic diagram showing an outline of a failure that takes an abnormally long time for state transition. 図４は、ある状態が異常に長く続く障害の概要を示す模式図である。FIG. 4 is a schematic diagram showing an outline of a failure in which a certain state is abnormally long. 図５は、本発明の一実施形態に係るコンピュータ１におけるハードウェア構成の全体図である。FIG. 5 is an overall view of the hardware configuration of the computer 1 according to the embodiment of the present invention. 図６は、コンピュータ１のソフトウェア構成の一例を示す図である。FIG. 6 is a diagram illustrating an example of a software configuration of the computer 1. 図７は、プロセス管理テーブル７００の一例を示す図である。FIG. 7 is a diagram illustrating an example of the process management table 700. 図８は、統計情報管理テーブル８００の一例を示す図である。FIG. 8 is a diagram illustrating an example of the statistical information management table 800. 図９は、仮記憶テーブル９００の一例を示す図である。FIG. 9 is a diagram illustrating an example of the temporary storage table 900. 図１０は、図７のプロセス管理テーブル７００に対応する別の例を示す図である。FIG. 10 is a diagram showing another example corresponding to the process management table 700 of FIG. 図１１は、統計情報Ｂに関する統計情報管理テーブル８００の一例を示す図である。FIG. 11 is a diagram illustrating an example of the statistical information management table 800 regarding the statistical information B. 図１２は、状態遷移監視処理の処理フローの一例を示す図である。FIG. 12 is a diagram illustrating an example of a process flow of the state transition monitoring process. 図１３は、統計情報採取処理の処理フローの一例を示す図である。FIG. 13 is a diagram illustrating an example of a processing flow of statistical information collection processing. 図１４Ａは、統計情報比較処理の処理フローの一例を示す図である。FIG. 14A is a diagram illustrating an example of a processing flow of statistical information comparison processing. 図１４Ｂは、統計情報比較処理の処理フローの一例を示す図である。FIG. 14B is a diagram illustrating an example of a processing flow of statistical information comparison processing.

以下に、本発明の一実施形態を、図面を用いて詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

《本実施形態における障害検知方法の概要》
一般的なソフトウェア（特にサービス提供プログラム）の動作は、当該ソフトウェアを構成する多数のプロセスが、それらのプロセスを処理するプロセッサにおいてそれぞれ一定の状態遷移を繰り返しながら実行されることにより、そのサービスを提供している。プロセスが一定期間（例えば１秒間）スリープした期間（プロセッサ不使用時間）を「状態」と定義し、その状態が変化する期間（プロセッサ使用時間）を「遷移」と定義する。つまり、あるプロセスがスリープしてから走行開始するまでの間の待機期間を「状態」といい、走行開始してから再度スリープするまでの期間を「遷移」と定義している。なお、「プロセス」の用語は、本明細書中で、ＵＮＩＸＯＳが適用される場合に限らず、一般的にＯＳによって実行されるプログラムの実行単位を示すものとする。 << Outline of Failure Detection Method in the Present Embodiment >>
The operation of general software (especially a service providing program) provides a service by executing a number of processes constituting the software while repeating certain state transitions in a processor that processes those processes. doing. A period during which the process has slept for a certain period (for example, 1 second) (processor non-use time) is defined as “state”, and a period during which the state changes (processor use time) is defined as “transition”. That is, a standby period from when a certain process sleeps until it starts to travel is referred to as “state”, and a period from when the process starts to sleep again is defined as “transition”. In the present specification, the term “process” is not limited to the case where UNIXOS is applied, but generally indicates an execution unit of a program executed by the OS.

本実施形態における障害検知方法の考え方について、図１〜図４を参照して説明する。図１は、本発明における障害検知方法を実現するための、プロセッサにおけるプロセス実行状態の捉え方を示す模式図である。ここでは、プロセスが開始しプロセッサを使用している（プロセスが走行している）状況を「遷移」１００１と定義する。また「遷移」と対をなし、プロセスがプロセッサを一定時間使用していない状況を「状態」１００２と定義する。この遷移１００１と状態１００２とが対となって形成される状態遷移を把握し、正常時とは異なる挙動（例えば図１における状態１００３の加入）を捉えることにより障害を検知する。 The concept of the failure detection method in the present embodiment will be described with reference to FIGS. FIG. 1 is a schematic diagram showing how to grasp a process execution state in a processor for realizing the failure detection method according to the present invention. Here, the state where the process is started and the processor is used (the process is running) is defined as “transition” 1001. A state that is paired with “transition” and the process has not used the processor for a certain period of time is defined as “state” 1002. A failure is detected by grasping a state transition formed by a pair of the transition 1001 and the state 1002 and capturing a behavior different from the normal state (for example, joining of the state 1003 in FIG. 1).

図２は、通常とは異なる状態を認識した障害の概要を示す模式図である。状態２００１から状態２００２への遷移が正常時とは異なり不当に短く、その間に通常とは異なる状態２００４への新たな遷移２００３を認識した場合を例示している。図２の例は、通常は短時間で終了するロック待ちが、何らかの理由で間延びし、プロセス全体の処理時間が遅くなる障害などを想定している。 FIG. 2 is a schematic diagram showing an outline of a fault that recognizes an unusual state. The case where the transition from the state 2001 to the state 2002 is unreasonably short unlike the normal state and a new transition 2003 to the state 2004 different from the normal state is recognized is illustrated. In the example of FIG. 2, it is assumed that the lock wait, which normally ends in a short time, is extended for some reason and the processing time of the entire process is delayed.

図３は、状態の遷移に異常に時間がかかる障害の概要を示す模式図である。状態３００１から状態３００２への遷移３００３が、通常とは異なり大幅に時間がかかって、状態３００２へ遷移しない場合を示している。図３の例は、他のプロセスによってプロセッサが不当に長時間占有されている、無限ループなどの障害を想定している。 FIG. 3 is a schematic diagram showing an outline of a failure that takes an abnormally long time for state transition. A transition 3003 from the state 3001 to the state 3002 takes a long time unlike the normal case, and does not transition to the state 3002. The example of FIG. 3 assumes a failure such as an infinite loop in which the processor is unreasonably occupied for a long time by another process.

図４は、ある状態が異常に長く続く障害の概要を示す模式図である。状態１２０１において正常時よりも長くプロセッサの割り当てが行われず、遷移１２０２に遷移しない場合を示している。図４の例は、ハングアップや不当なＩ／Ｏ待ちなどの障害を想定している。 FIG. 4 is a schematic diagram showing an outline of a failure in which a certain state is abnormally long. In the state 1201, a case is shown in which the processor is not assigned for a longer time than in the normal state and the transition to the transition 1202 is not made. The example in FIG. 4 assumes a failure such as a hang-up or an illegal I / O wait.

本実施形態における障害検知方法では、上記の「状態遷移」の時間変化について、プロセッサで稼働するオペレーティングシステム（Operating System、ＯＳ）内部で計測して統計をとり、この統計情報を基にソフトウェアの障害を検知することとしている。ＯＳ内部で一定のポリシーで障害を監視し検知することにより、実装されているソフトウェア側での設計レベルのばらつきなどを吸収して迅速かつ確実な障害検知を実現する。その一定のポリシーとしては、プロセッサがプロセスによって使用されていない（スリープしている）状態に着目し、ある状態から次の状態への遷移に要する時間について統計情報を取得してプロセスの動作を把握し、その変化を捉えることにより障害を検知するものである。 In the failure detection method according to the present embodiment, the time change of the above “state transition” is measured and measured inside an operating system (OS) operating on the processor, and the software failure is based on the statistical information. Is going to be detected. By monitoring and detecting faults with a fixed policy inside the OS, variations in design level on the installed software side are absorbed, and quick and reliable fault detection is realized. As a fixed policy, we focus on the state where the processor is not used (sleeping) by the process, and obtain statistical information about the time required for transition from one state to the next state to understand the operation of the process. The failure is detected by capturing the change.

なお、具体的な障害検知の態様については、本障害検知方法の処理フロー例等を参照して後述する。 A specific failure detection mode will be described later with reference to a processing flow example of the failure detection method.

《システム構成》
次に、本実施形態の障害検知方法を適用したコンピュータ１について説明する。図５は、本発明の一実施形態に係るコンピュータ１のハードウェア構成の全体図を示している。 "System configuration"
Next, the computer 1 to which the failure detection method of this embodiment is applied will be described. FIG. 5 shows an overall view of the hardware configuration of the computer 1 according to an embodiment of the present invention.

図５に示すように、コンピュータ１は、中央処理装置１１０、制御部１２０、主記憶装置１３０、補助記憶装置１４０、入力装置１５０、出力装置１６０、通信制御部１７０、及びそれらを相互に通信可能に接続する内部バス１８０を備えて構成される。 As shown in FIG. 5, the computer 1 can communicate with the central processing unit 110, the control unit 120, the main storage device 130, the auxiliary storage device 140, the input device 150, the output device 160, the communication control unit 170, and each other. And an internal bus 180 connected to the.

中央処理装置１１０は、例えばＣＰＵ（Central Processing Unit）あるいはＭＰＵ（Micro Processing Unit）を含むプロセッサである。本明細書では以下簡単のため「ＣＰＵ」と称する。制御部１２０は、ＣＰＵ１１０と他のハードウェアブロックとの間でのデータ転送を制御するインタフェースである。 The central processing unit 110 is a processor including, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). In the present specification, it is hereinafter referred to as “CPU” for simplicity. The control unit 120 is an interface that controls data transfer between the CPU 110 and other hardware blocks.

主記憶装置１３０は、例えばＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等の記憶素子を備えるメモリであり、後述するように、コンピュータ１全体の制御を行うＯＳ、各種プログラム及びテーブル等のデータが読み込まれる。補助記憶装置１４０は、例えばＨＤＤ（Hard Disk Drive）、半導体記憶デバイス（Solid State Drive、「ＳＳＤ」）等の記憶デバイスであり、コンピュータ１で実行される各種プログラム、当該プログラム実行時に参照されるテーブル類が記憶される。 The main storage device 130 is a memory having storage elements such as a RAM (Random Access Memory) and a ROM (Read Only Memory), for example, and, as will be described later, an OS, various programs, tables, and the like that control the entire computer 1 Data is read. The auxiliary storage device 140 is a storage device such as an HDD (Hard Disk Drive), a semiconductor storage device (Solid State Drive, “SSD”), and the like. Various programs executed by the computer 1 and tables referred to when the programs are executed. Kind is memorized.

入力装置１５０は、例えばキーボードやマウスであり、ユーザの操作入力を受け付ける。出力装置１６０は、例えば液晶モニタ、プリンタ等の出力機器であり、オーディオ等の他の出力を可能とする機器を含む。通信制御部１７０は、例えばＮＩＣ（Network Interface Card）、ＨＢＡ（Host Bus Adapter）等を含む、他の装置との間の通信を実現する機能を有する。 The input device 150 is, for example, a keyboard or a mouse, and accepts user operation input. The output device 160 is an output device such as a liquid crystal monitor and a printer, and includes a device that enables other output such as audio. The communication control unit 170 has a function of realizing communication with other devices including, for example, a network interface card (NIC), a host bus adapter (HBA), and the like.

次に、コンピュータ１のソフトウェア構成について説明する。図６に、コンピュータ１のソフトウェア構成の一例を、各ソフトウェアがＣＰＵ１１０によって実行されるべく主記憶装置１３０に読み込まれた状態で示している。なお、これらのソフトウェアは、常時補助記憶装置１４０に格納されており、コンピュータ１の起動時、あるいはユーザの操作入力に基づくＯＳからの命令受信時に主記憶装置１３０に読み込まれる。 Next, the software configuration of the computer 1 will be described. FIG. 6 shows an example of the software configuration of the computer 1 in a state where each software is read into the main storage device 130 to be executed by the CPU 110. Note that these software are always stored in the auxiliary storage device 140, and are read into the main storage device 130 when the computer 1 is started or when a command is received from the OS based on a user operation input.

図６の例では、コンピュータ１には、ソフトウェアとして、ＯＳ１３１、プログラム１３２、及び障害検知部１３３を実現するためのプログラムが実装されている。 In the example of FIG. 6, a program for realizing the OS 131, the program 132, and the failure detection unit 133 is installed in the computer 1 as software.

ＯＳ１３１は、コンピュータ１を構成している各構成要素の動作を制御している。ＯＳ１３１はプロセススケジューラ１３１１（プロセス割当て部）を有しており、補助記憶装置１５０に格納されている後述のプログラム１３２を主記憶装置１３０に読み込み、プロセススケジューラ１３１１を用いてそのロードしたプログラム１３２を構成しているプロセスにＣＰＵ１１０を割り当ててプロセスを実行させている。ＯＳ１３１としては、例えばＷｉｎｄｏｗｓ（登録商標）、及びＵＮＩＸ（登録商標）系のＯＳ等が好適に用いられるが、特に制約されるものではない。 The OS 131 controls the operation of each component constituting the computer 1. The OS 131 has a process scheduler 1311 (process allocation unit), reads a later-described program 132 stored in the auxiliary storage device 150 into the main storage device 130, and configures the loaded program 132 using the process scheduler 1311. The CPU 110 is assigned to the process being executed and the process is executed. As the OS 131, for example, a Windows (registered trademark), a UNIX (registered trademark) OS, or the like is preferably used, but the OS 131 is not particularly limited.

プログラム１３２は、コンピュータ１において実行されるべきアプリケーションプログラム等を含む任意のプログラムであり、前記のようにＯＳ１３１によって補助記憶装置１４０から主記憶装置１３０に読み込まれ、ＣＰＵ１１０によって実行される。本実施形態の例では、プログラム１３２は２つのプロセス、プロセスＡ（１３２Ａ）及びプロセスＢ（１３２Ｂ）を有している。プログラム１３２がＣＰＵ１１０で実行される際には、プロセススケジューラ１３１１によってプロセスＡ、プロセスＢに適時にＣＰＵ１１０が割り当てられることとなる。 The program 132 is an arbitrary program including an application program to be executed in the computer 1, and is read from the auxiliary storage device 140 into the main storage device 130 by the OS 131 and executed by the CPU 110 as described above. In the example of the present embodiment, the program 132 has two processes, process A (132A) and process B (132B). When the program 132 is executed by the CPU 110, the process scheduler 1311 assigns the CPU 110 to the processes A and B in a timely manner.

障害検知部１３３は、この障害検知部１３３の機能を実現するプログラムをＣＰＵ１１０が実行することによって実現される機能ブロックである（図６においては、障害検知部１３３は主記憶装置１３０に読み込まれたプログラムとして表されている。）。障害検知部１３３は、状態遷移監視部１３３１、統計情報採取処理部（プロセッサ処理基準値取得部）１３３２、統計情報記憶部１３３３、統計情報比較処理部（基準値比較処理部）１３３４、及び障害通知部１３３５を備えて構成される。 The failure detection unit 133 is a functional block realized by the CPU 110 executing a program that realizes the function of the failure detection unit 133 (in FIG. 6, the failure detection unit 133 is read into the main storage device 130. Represented as a program). The failure detection unit 133 includes a state transition monitoring unit 1331, a statistical information collection processing unit (processor processing reference value acquisition unit) 1332, a statistical information storage unit 1333, a statistical information comparison processing unit (reference value comparison processing unit) 1334, and a failure notification. A portion 1335 is provided.

障害検知部１３３は、コンピュータ１の起動時に、ＯＳ１３１により補助記憶装置１４０から主記憶装置１３０にロードされる。障害検知部１３３は、ロードされた後は、プロセススケジューラ１３１１による、障害検知対象のプロセス（図６の例ではプログラム１３２を実行するためのプロセスＡ（１３２Ａ）及びプロセスＢ（１３２Ｂ））に対する制御部１２０を通したＣＰＵ１１０の操作の情報を取得し、統計情報採取処理部１３３２が統計情報記憶部１３３３にその採取された統計情報を保存する。 The failure detection unit 133 is loaded from the auxiliary storage device 140 to the main storage device 130 by the OS 131 when the computer 1 is activated. After being loaded, the failure detection unit 133 is a control unit for the process to be detected by the process scheduler 1311 (process A (132A) and process B (132B) for executing the program 132 in the example of FIG. 6) by the process scheduler 1311. Information on the operation of the CPU 110 through 120 is acquired, and the statistical information collection processing unit 1332 stores the collected statistical information in the statistical information storage unit 1333.

障害検知部１３３は、統計情報の保存が完了した後、障害検知を開始する。障害検知を実行する場合、状態遷移監視部１３３１から得たＣＰＵ１１０の操作に関する情報と統計情報記憶部１３３３の統計情報を統計情報比較処理部１３３４が比較し、いずれかのプロセスに障害が発生していないか確認する。障害検知部１３３は、障害が発生したことを確認した場合、障害通知部１３３５を通じてＯＳ１３１に障害発生を通知する。この通知を受けたＯＳ１３１は、出力装置１５０を通じて管理者等に警告メッセージ等を提示することができる。 The failure detection unit 133 starts failure detection after the storage of the statistical information is completed. When executing failure detection, the statistical information comparison processing unit 1334 compares the information about the operation of the CPU 110 obtained from the state transition monitoring unit 1331 and the statistical information of the statistical information storage unit 1333, and a failure has occurred in any of the processes. Check if there is any. When the failure detection unit 133 confirms that a failure has occurred, the failure detection unit 133 notifies the OS 131 of the occurrence of the failure through the failure notification unit 1335. Upon receiving this notification, the OS 131 can present a warning message or the like to the administrator or the like through the output device 150.

次に、上記障害検知部１３３によって生成及び参照されるテーブル類について説明する。本実施形態では、障害検知部１３３の統計情報記憶部１３３３に、プロセス管理テーブル７００、統計情報管理テーブル８００、及び仮記憶テーブル９００が保持されている。 Next, tables generated and referred to by the failure detection unit 133 will be described. In this embodiment, the statistical information storage unit 1333 of the failure detection unit 133 holds a process management table 700, a statistical information management table 800, and a temporary storage table 900.

図７に、プロセス管理テーブル７００の一例を示している。プロセス管理テーブル７００は、障害検知部１３３の監視対象となるプロセス１３２に関する統計情報の状態を管理するために使用され、監視対象プロセス７０１、統計情報採取完了フラグ７０２、及び統計情報テーブル７０３の各項目を記録している。プロセス管理テーブル７００は、コンピュータ１の起動時に、障害検知部１３３によって統計情報記憶部１３３３内に生成される。 FIG. 7 shows an example of the process management table 700. The process management table 700 is used to manage the status of statistical information related to the process 132 to be monitored by the failure detection unit 133, and each item of the monitoring target process 701, the statistical information collection completion flag 702, and the statistical information table 703. Is recorded. The process management table 700 is generated in the statistical information storage unit 1333 by the failure detection unit 133 when the computer 1 is started.

監視対象プロセス７０１は、障害検知部１３３によって監視される対象となるプロセス１３２を特定する情報が記録されており、図７の例ではコンピュータ１において監視対象のプロセスＡ（１３２Ａ）及び監視対象のプロセスＢ（１３２Ｂ）が実行されているため、その両者が記録されている。この場合、プロセスＡ、プロセスＢのいずれかが実行終了すれば、その記録はプロセス管理テーブル７００から削除される。また、プロセスＡ、プロセスＢに加えて他の監視対象のプロセス（例えばプロセスＣ）にＣＰＵ１１０が割り当てられれば、プロセスＣに関するレコードが追加される。また、監視対象プロセス７０１には、管理者等が監視対象とすることを所望するプロセスを入力装置１５０から登録しておくこともできる。 In the monitoring target process 701, information for specifying the process 132 to be monitored by the failure detection unit 133 is recorded. In the example of FIG. 7, the monitoring target process A (132A) and the monitoring target process in the computer 1 are recorded. Since B (132B) is executed, both of them are recorded. In this case, when either process A or process B is finished executing, the record is deleted from the process management table 700. In addition to the processes A and B, if the CPU 110 is assigned to another process to be monitored (for example, the process C), a record related to the process C is added. In addition, in the monitoring target process 701, a process that an administrator or the like desires to be monitored can be registered from the input device 150.

統計情報採取完了フラグ７０２は、各監視対象プロセス７０１について、障害検知部１３３での障害検知有無の判断に使用する統計情報が採取されているかを示す情報が記録される。採取が完了している場合には統計情報採取完了フラグ７０２に「Ｔｒｕｅ」が記録され、まだ採取が完了していない場合には、「Ｆａｌｓｅ」が記録される。 In the statistical information collection completion flag 702, information indicating whether statistical information used for determining whether or not a failure has been detected in the failure detection unit 133 is collected for each monitoring target process 701 is recorded. If the collection has been completed, “True” is recorded in the statistical information collection completion flag 702, and if the collection has not been completed, “False” is recorded.

統計情報テーブル７０３には、各監視対象プロセス７０１について障害検知部１３３が参照すべき統計情報テーブル（後述）として対応付けられているテーブルが特定されて記録されている。図７の例では、プロセスＡ（１３２Ａ）については、すでに障害検知に使用する統計情報が採取済みであるため、当該採取済みの統計情報が記録されている統計情報Ａテーブル（具体的には後出の統計情報管理テーブル８００）が対応付けられている。一方、プロセスＢについてはまだ統計情報の採取が完了していないため、対応する統計情報テーブル７０３には、統計情報採取中であることを示す、後出の仮記憶テーブル９００（「仮記憶１」、「仮記憶２」）が対応付けられている。 In the statistical information table 703, a table associated with each monitoring target process 701 as a statistical information table (described later) to be referred to by the failure detection unit 133 is specified and recorded. In the example of FIG. 7, since the statistical information used for failure detection has already been collected for the process A (132A), the statistical information A table (specifically, the subsequent statistical information is recorded later). And a corresponding statistical information management table 800). On the other hand, since the collection of statistical information has not been completed for the process B, a temporary storage table 900 (“temporary storage 1”), which indicates that statistical information is being collected, is displayed in the corresponding statistical information table 703. , “Temporary storage 2”).

次に、統計情報管理テーブル８００について説明する。図８に本実施形態における統計情報管理テーブル８００の一例を示している。統計情報管理テーブル８００は、後述する障害検知部１３３が監視対象プロセスについて比較参照するべき統計情報を記録している。統計情報管理テーブル８００には、監視項目８０１、遷移時間８０２、及び状態待機時間８０３の各項目が記録されている。監視項目８０１には、監視対象プロセス（この場合プロセスＡ（１３２Ａ））にＣＰＵ１１０が割り当てられて実行開始後の「状態遷移」、すなわち「遷移」及び「状態」の組み合わせが、例えば図８の例では「状態遷移１」、「状態遷移２」と順に登録されている。 Next, the statistical information management table 800 will be described. FIG. 8 shows an example of the statistical information management table 800 in the present embodiment. The statistical information management table 800 records statistical information to be compared and referred to by the failure detection unit 133 described later for the monitoring target process. In the statistical information management table 800, items of a monitoring item 801, a transition time 802, and a state waiting time 803 are recorded. In the monitoring item 801, the “state transition” after the CPU 110 is assigned to the monitoring target process (in this case, process A (132A)) and the execution is started, that is, the combination of “transition” and “state” is, for example, the example of FIG. Then, “state transition 1” and “state transition 2” are registered in order.

遷移時間８０２（プロセッサ使用時間基準値）には、対応付けられている状態遷移における遷移に要した時間として統計的に採取された数値が記録される。また、状態待機時間８０３（プロセッサ不使用時間基準値）には、対応付けられている状態遷移における状態が持続した時間として統計的に採取された数値が記録される。図８の例では、状態遷移１について、その遷移時間が５００ｍｓ、状態待機時間が１０００ｍｓとして統計的に求められたことを示している。なお、図８には仮定の数値を記載しており、また数値の単位はミリ秒以外であってもよい。なお、ここで採用する遷移時間８０２、状態待機時間８０３を求めるための統計処理は、複数の計測値の算術平均、中心値、あるいは最頻値を求める処理等の、適宜の手法を適用して実行すればよい。 In the transition time 802 (processor usage time reference value), a numerical value statistically collected as the time required for the transition in the associated state transition is recorded. In the state waiting time 803 (processor non-use time reference value), a numerical value statistically collected as the time during which the state in the associated state transition has continued is recorded. The example of FIG. 8 shows that the state transition 1 is statistically obtained with a transition time of 500 ms and a state standby time of 1000 ms. FIG. 8 shows assumed numerical values, and the unit of numerical values may be other than milliseconds. Note that the statistical processing for obtaining the transition time 802 and the state waiting time 803 employed here applies an appropriate method such as processing for obtaining an arithmetic average, a center value, or a mode value of a plurality of measurement values. Just do it.

次に、仮記憶テーブル９００について説明する。図９に仮記憶テーブル９００の一例を示している。仮記憶テーブル９００は、システム起動時にプロセス管理テーブル７００が生成されて監視対象プロセスが登録されたことを契機として、障害検知部１３３により統計情報記憶部１３３３内に生成される。図７に示すように、仮記憶テーブル９００は、統計情報採取が完了していない監視対象プロセス（完了フラグ＝「Ｆａｌｓｅ」）に対応付けて複数生成される。これは、監視対象プロセスの状態遷移について、統計的に信頼することができる数値を得るために、遷移時間及び状態待機時間を複数回採取して記録するためである。規定の回数採取された場合には、それによって生成された複数の仮記憶テーブル９００を、前記例示した統計処理に対応する所定の手順でマージして図８の統計情報管理テーブル８００が生成される。 Next, the temporary storage table 900 will be described. FIG. 9 shows an example of the temporary storage table 900. The temporary storage table 900 is generated in the statistical information storage unit 1333 by the failure detection unit 133 when the process management table 700 is generated and the monitoring target process is registered when the system is started. As shown in FIG. 7, a plurality of temporary storage tables 900 are generated in association with monitoring target processes (completion flag = “False”) for which statistical information collection has not been completed. This is because the transition time and the state waiting time are collected and recorded a plurality of times in order to obtain a numerical value that can be statistically reliable for the state transition of the monitored process. When the predetermined number of times is collected, the plurality of temporary storage tables 900 generated thereby are merged by a predetermined procedure corresponding to the above-described statistical processing to generate the statistical information management table 800 of FIG. .

仮記憶テーブル９００に記録される、監視項目９０１、遷移時間９０２、及び状態待機時間９０３は、統計情報管理テーブル８００の対応する項目と同一である。 The monitoring item 901, the transition time 902, and the state standby time 903 recorded in the temporary storage table 900 are the same as the corresponding items in the statistical information management table 800.

図１０に、図７のプロセス管理テーブル７００に対応する別の例を示している。図１０のプロセス管理テーブル７００では、プロセスＢについても統計情報採取が完了しているため、統計情報採取完了フラグ７０２に「Ｔｒｕｅ」が、統計情報テーブル７０３の項目には、対応付けられている「統計情報Ｂ」が記録されている。 FIG. 10 shows another example corresponding to the process management table 700 of FIG. In the process management table 700 of FIG. 10, since statistical information collection has been completed for the process B as well, “True” is associated with the statistical information collection completion flag 702, and the item of the statistical information table 703 is associated with “ Statistical information B "is recorded.

図１１に、統計情報Ｂに関する統計情報管理テーブル８００の一例を示している。図１１の統計情報管理テーブル８００は、プロセスＢについての障害検知処理を実行する際に、障害検知部１３３によって参照される。図１１の統計情報管理テーブル８００に記録されている内容は、図８と同様である。 FIG. 11 shows an example of the statistical information management table 800 related to the statistical information B. The statistical information management table 800 in FIG. 11 is referred to by the failure detection unit 133 when executing the failure detection process for the process B. The contents recorded in the statistical information management table 800 in FIG. 11 are the same as those in FIG.

《障害検知処理の処理内容》
次に、以上説明したシステム構成に基づいて、コンピュータ１における障害検知部１３３が実行する障害検知処理について、処理フロー例を参照しつつ説明する。 << Contents of failure detection processing >>
Next, based on the system configuration described above, the failure detection processing executed by the failure detection unit 133 in the computer 1 will be described with reference to a processing flow example.

状態遷移監視処理
まず、障害検知部１３３の状態遷移監視部１３３１によって実行される状態遷移監視処理について説明する。図１２は、状態遷移監視処理の処理フローの一例を示している。この状態遷移監視処理では、主に、状態遷移監視部１３３１により、プロセス管理テーブル７００において監視対象として格納されている各プロセスについて、障害検知処理に使用すべき統計情報を採取する処理を行うのか、すでに取得されている統計情報を使用して実際に障害検知処理を実行するのかを判断する処理が行われる。 State Transition Monitoring Process First, the state transition monitoring process executed by the state transition monitoring unit 1331 of the failure detection unit 133 will be described. FIG. 12 shows an example of the process flow of the state transition monitoring process. In this state transition monitoring processing, whether or not the state transition monitoring unit 1331 mainly performs processing for collecting statistical information to be used for failure detection processing for each process stored as a monitoring target in the process management table 700. A process for determining whether to actually execute the failure detection process using the statistical information already acquired is performed.

まず、状態遷移監視部１３３１は、ＯＳ１３１のプロセススケジューラ１３１１を監視し、ＯＳ１３１でＣＰＵ１１０が割り当てられ処理が開始されたプロセスがプロセス管理テーブル７００に格納されているプロセスであるか判断する（Ｓ１２０１）。処理が開始されたプロセスがプロセス管理テーブル７００に格納されていないと判断した場合（Ｓ１２０１、Ｎｏ）、状態遷移監視部１３３１は、プロセス管理テーブル７００において監視対象とされているプロセスの処理が開始されるまでプロセススケジューラ１３１１を続けて監視する。 First, the state transition monitoring unit 1331 monitors the process scheduler 1311 of the OS 131, and determines whether the process assigned to the CPU 110 by the OS 131 and started processing is a process stored in the process management table 700 (S1201). When it is determined that the process that has been started is not stored in the process management table 700 (S1201, No), the state transition monitoring unit 1331 starts the process of the process that is the monitoring target in the process management table 700. Until the process scheduler 1311 continues to monitor.

処理が開始されたプロセスがプロセス管理テーブル７００に格納されていると判断した場合（Ｓ１２０１、Ｙｅｓ）、状態遷移監視部１３３１は、開始された当該プロセスに対応付けられている統計情報採取完了フラグ７０２を確認する（Ｓ１２０２）。統計情報採取完了フラグ７０２に「Ｔｒｕｅ」が記録されていると判断した場合（Ｓ１２０２、Ｙｅｓ）、状態遷移監視部１３３１は、統計情報比較処理部１３３４に統計情報比較処理を実行させる（Ｓ１２０３）。統計情報比較処理の内容については後述する。 When it is determined that the process that has been started is stored in the process management table 700 (S1201, Yes), the state transition monitoring unit 1331 has a statistical information collection completion flag 702 associated with the started process. Is confirmed (S1202). When it is determined that “True” is recorded in the statistical information collection completion flag 702 (S1202, Yes), the state transition monitoring unit 1331 causes the statistical information comparison processing unit 1334 to execute statistical information comparison processing (S1203). The contents of the statistical information comparison process will be described later.

統計情報採取完了フラグ７０２に「Ｆａｌｓｅ」が記録されていると判断した場合（Ｓ１２０２、Ｎｏ）、状態遷移監視部１３３１は、統計情報採取処理部１３３２に統計情報採取処理を実行させる（Ｓ１２０４）。統計情報採取処理の内容については後述する。 When it is determined that “False” is recorded in the statistical information collection completion flag 702 (S1202, No), the state transition monitoring unit 1331 causes the statistical information collection processing unit 1332 to execute statistical information collection processing (S1204). The contents of the statistical information collection process will be described later.

統計情報採取処理終了後、状態遷移監視部１３３１は、当該統計情報採取処理が指定した回数実行されたか判断する（Ｓ１２０５）。統計情報採取処理についての実行回数は、例えば状態遷移監視部１３３１内にパラメータとして保持させればよい。統計情報採取処理が指定した回数実行されたと判断した場合（Ｓ１２０５、Ｙｅｓ）、状態遷移監視部１３３１は、それまで採取して仮記憶テーブル９００に格納されている状態遷移に関する情報について所定の手順で統計処理し、得られた統計情報を統計情報管理テーブル８００に格納する（Ｓ１２０７）。統計処理の内容としては、前記したように、例えばあるプロセスについて指定の複数回計測して得られた状態遷移についての遷移時間及び状態待機時間を単純平均して求める、あるいは、中心値、最頻値を求めるなど、適宜の統計処理を適用することができる。 After the statistical information collection process is completed, the state transition monitoring unit 1331 determines whether the statistical information collection process has been executed a specified number of times (S1205). What is necessary is just to hold | maintain the frequency | count of execution about a statistical information collection process as a parameter in the state transition monitoring part 1331, for example. When it is determined that the statistical information collection process has been executed the specified number of times (S1205, Yes), the state transition monitoring unit 1331 uses a predetermined procedure for information regarding state transitions that have been collected and stored in the temporary storage table 900. Statistical processing is performed, and the obtained statistical information is stored in the statistical information management table 800 (S1207). As described above, the contents of the statistical processing are obtained by, for example, simply averaging the transition time and the state waiting time for the state transition obtained by measuring a plurality of times specified for a certain process, or by calculating the center value, the mode Appropriate statistical processing such as obtaining a value can be applied.

状態遷移監視部１３３１は、次いで処理中の当該プロセスについて、プロセス管理テーブル７００において統計情報採取完了フラグ７０２を「Ｔｒｕｅ」に変更し、統計情報テーブル７０３に対応する統計情報管理テーブル８００を特定する（Ｓ１２０８）。 The state transition monitoring unit 1331 then changes the statistical information collection completion flag 702 to “True” in the process management table 700 for the process being processed, and specifies the statistical information management table 800 corresponding to the statistical information table 703 ( S1208).

Ｓ１２０５において、統計情報採取処理が指定した回数実行されていないと判断した場合（Ｓ１２０５、Ｎｏ）、状態遷移監視部１３３１は、新規の仮記憶テーブル９００を作成し、次回の統計情報採取処理では、新たに作成した仮記憶テーブル９００（例えば「仮記憶２」）に採取した情報を記録する（Ｓ１２０６）。 When it is determined in S1205 that the statistical information collection process has not been executed the specified number of times (S1205, No), the state transition monitoring unit 1331 creates a new temporary storage table 900, and in the next statistical information collection process, The collected information is recorded in the newly created temporary storage table 900 (for example, “temporary storage 2”) (S1206).

以上述べた状態遷移監視処理は、システム起動後、コンピュータ１又は障害検知部１３３を稼働させるプログラムが終了しない限り繰り返し実行される。 The state transition monitoring process described above is repeatedly executed unless the program for operating the computer 1 or the failure detection unit 133 is terminated after the system is started.

以上説明した状態遷移監視処理によれば、障害検知部１３３がＣＰＵ１１０で実行されるプロセスの実行状態（状態遷移）に基づいて、当該プロセスの障害検知を実行するのに必要とされる、正常時のプロセス状態遷移に関する統計情報を自動的に収集して、その統計情報に基づく状態遷移の監視を行うことができる。 According to the state transition monitoring process described above, when the failure detection unit 133 is required to execute failure detection of the process based on the execution state (state transition) of the process executed by the CPU 110, the normal time It is possible to automatically collect statistical information related to the process state transitions and monitor state transitions based on the statistical information.

統計情報採取処理
次に、障害検知部１３３の統計情報採取処理部１３３２によって実行される統計情報採取処理（状態遷移監視処理（図１２）におけるＳ１２０４）について説明する。図１３に、本実施形態における統計情報採取処理の処理フローの一例を示している。統計情報採取処理では、主に、統計情報採取処理部１３３２により、統計情報記憶部１３３３に統計情報を採取して格納する処理が実行される。 Statistical Information Collection Processing Next, statistical information collection processing (S1204 in the state transition monitoring processing (FIG. 12)) executed by the statistical information collection processing unit 1332 of the failure detection unit 133 will be described. FIG. 13 shows an example of a processing flow of statistical information collection processing in the present embodiment. In the statistical information collection process, a process for collecting and storing statistical information in the statistical information storage unit 1333 is mainly executed by the statistical information collection processing unit 1332.

まず、統計情報採取処理部１３３２は、監視対象であるプロセスによるＣＰＵ１１０の使用時間の計測を開始する（Ｓ１３０１）。次いで、統計情報採取処理部１３３２は、監視対象プロセスがＣＰＵ１１０の使用を開始しているか判断し（Ｓ１３０２）、使用していると判断した場合（Ｓ１３０２、Ｙｅｓ）、Ｓ１３０１で開始したＣＰＵ使用時間の計測を継続する。 First, the statistical information collection processing unit 1332 starts measuring the usage time of the CPU 110 by the process to be monitored (S1301). Next, the statistical information collection processing unit 1332 determines whether the monitoring target process has started using the CPU 110 (S1302). If it is determined that the monitoring target process is using (S1302, Yes), the CPU usage time started in S1301 is determined. Continue measuring.

Ｓ１３０２で監視対象プロセスが使用されていないと判断した場合（Ｓ１３０２、Ｎｏ）、統計情報採取処理部１３３２は、ＣＰＵ１１０が使用されている時間のみを計測するため、ＣＰＵ１１０の使用時間の計測を中断し、使用時間の計測値を例えば主記憶装置１３０内の適宜の格納場所に格納する（Ｓ１３０３）。なお、プロセスがＣＰＵ１１０の使用をしない状態となるのは、プロセス自体がＣＰＵ１１０を解放する場合、プロセススケジューラ１３１１によりプロセスの使用が停止される場合等を含んでいる。 If it is determined in S1302 that the process to be monitored is not used (S1302, No), the statistical information collection processing unit 1332 suspends measurement of the usage time of the CPU 110 in order to measure only the time during which the CPU 110 is used. For example, the measured value of the usage time is stored in an appropriate storage location in the main storage device 130 (S1303). The state where the process does not use the CPU 110 includes a case where the process itself releases the CPU 110, a case where the process scheduler 1311 stops using the process, and the like.

次いで、統計情報採取処理部１３３２は、監視対象プロセスが再度ＣＰＵ１１０を使用開始するまでの時間、すなわち監視対象のプロセスによってＣＰＵ１１０が使用されていない時間を計測する（Ｓ１３０４）。 Next, the statistical information collection processing unit 1332 measures the time until the monitored process starts using the CPU 110 again, that is, the time when the CPU 110 is not used by the monitored process (S1304).

次いで、統計情報採取処理部１３３２は、Ｓ１３０４で計測したＣＰＵ１１０を使用していない時間が、状態として認識する閾値として設定した値を超えていないか判断する（Ｓ１３０５）。この閾値は、ＣＰＵ１１０を使用していないとして計測された時間を「状態」として把握してよいか判断するためのパラメータであり、任意の数値を設定することができる。 Next, the statistical information collection processing unit 1332 determines whether the time during which the CPU 110 measured in S1304 is not used exceeds a value set as a threshold value recognized as a state (S1305). This threshold is a parameter for determining whether or not the time measured as not using the CPU 110 may be grasped as a “state”, and an arbitrary numerical value can be set.

計測したＣＰＵ１１０を使用していない時間が閾値を超えていないと判断した場合（Ｓ１３０５、Ｎｏ）、統計情報採取処理部１３３２は、当該計測値を状態として把握せず、まだ遷移であると把握するため、ＣＰＵ１１０の使用時間の計測を再開する（Ｓ１３１１）。 If it is determined that the time during which the measured CPU 110 is not used does not exceed the threshold (No in S1305), the statistical information collection processing unit 1332 does not grasp the measurement value as a state and grasps that it is still a transition. Therefore, the measurement of the usage time of the CPU 110 is resumed (S1311).

計測したＣＰＵ１１０を使用していない時間が閾値を超えていると判断した場合（Ｓ１３０５、Ｙｅｓ）、統計情報採取処理部１３３２は、プロセス管理テーブル７００を参照し、ＣＰＵ１１０を使用していない時間の計測値を、監視対象プロセスに関して現在使用されている仮記憶テーブル９００の監視項目９０１に記録されている「状態遷移１」に対応する状態待機時間９０３として登録する。 When it is determined that the time when the measured CPU 110 is not used exceeds the threshold (S1305, Yes), the statistical information collection processing unit 1332 refers to the process management table 700 and measures the time when the CPU 110 is not used. The value is registered as the state waiting time 903 corresponding to “state transition 1” recorded in the monitoring item 901 of the temporary storage table 900 currently used for the monitoring target process.

次いで、統計情報採取処理部１３３２は、Ｓ１３０３において格納しておいたＣＰＵ１１０の使用時間計測値を、同じく現在使用している仮記憶テーブル９００の監視項目９０１に記録されている「状態遷移１」に対応する遷移時間９０２として登録する（Ｓ１３０７）。 Next, the statistical information collection processing unit 1332 stores the measured usage time value of the CPU 110 stored in S1303 in the “state transition 1” recorded in the monitoring item 901 of the temporary storage table 900 that is also currently used. The corresponding transition time 902 is registered (S1307).

以上で、監視対象プロセスに関する最初の状態遷移に関する遷移時間９０２及び状態待機時間９０３の計測及び記録が完了したこととなるので、統計情報採取処理部１３３２は、仮記憶テーブル９００において、次の状態遷移（図９の状態遷移２）について遷移時間９０２と状態待機時間９０３を記録するため、監視項目９０１として次の状態遷移２に関するレコードを追加する。 As described above, since the measurement and recording of the transition time 902 and the state standby time 903 related to the first state transition related to the monitoring target process are completed, the statistical information collection processing unit 1332 stores the next state transition in the temporary storage table 900. In order to record the transition time 902 and the state waiting time 903 for (the state transition 2 in FIG. 9), a record relating to the next state transition 2 is added as the monitoring item 901.

次に、統計情報採取処理部１３３２は、監視対象としているプロセスが終了したか判断し（Ｓ１３０９）、終了していないと判断した場合（Ｓ１３０９、Ｎｏ）、次の状態遷移における遷移時間９０２を計測するために、ＣＰＵ１１０の使用時間を再度計測開始して（Ｓ１３１０）、Ｓ１３０２に処理を戻し、監視対象のプロセスが終了するまで、状態遷移を記録する。Ｓ１３０９で、監視対象としているプロセスが終了したと判断した場合（Ｓ１３０９、Ｙｅｓ）、統計情報採取処理部１３３２は、処理を終了する。 Next, the statistical information collection processing unit 1332 determines whether or not the process to be monitored has ended (S1309), and when determining that it has not ended (S1309, No), measures the transition time 902 in the next state transition. Therefore, measurement of the usage time of the CPU 110 is started again (S1310), the process is returned to S1302, and the state transition is recorded until the process to be monitored is completed. If it is determined in S1309 that the process to be monitored has ended (S1309, Yes), the statistical information collection processing unit 1332 ends the process.

以上説明した統計情報採取処理によれば、監視対象プロセスの開始から終了まで（具体的には、図１における開始から終了まで）について、障害検知処理に使用する統計情報を算出するために、各状態遷移の遷移時間及び状態待機時間を得ることができる。 According to the statistical information collection processing described above, each of the monitoring target processes is calculated from the start to the end (specifically, from the start to the end in FIG. 1) to calculate statistical information used for the failure detection processing. The transition time of the state transition and the state waiting time can be obtained.

統計情報比較処理
次に、統計情報比較処理について説明する。図１４Ａ、図１４Ｂに、本実施形態における統計情報比較処理の処理フローの一例を示している。この統計情報比較処理では、主に、障害検知部１３３の統計情報比較処理部１３３４により、状態遷移監視部１３３１から得た情報を統計情報記憶部１３３４に格納されている情報と比較して障害検知処理が実行される。監視対象プロセスの状態遷移に関する時間は、図１３に例示した統計情報採取処理の場合と同様に、ＯＳ１３１のプロセススケジューラ１３１１の状態を監視することにより計測する。 Statistical Information Comparison Processing Next, statistical information comparison processing will be described. 14A and 14B show an example of a processing flow of statistical information comparison processing in the present embodiment. In this statistical information comparison processing, failure detection is performed by comparing the information obtained from the state transition monitoring unit 1331 with the information stored in the statistical information storage unit 1334 by the statistical information comparison processing unit 1334 of the failure detection unit 133. Processing is executed. The time related to the state transition of the monitoring target process is measured by monitoring the state of the process scheduler 1311 of the OS 131 as in the case of the statistical information collection process illustrated in FIG.

まず、統計情報比較処理部１３３４は、監視対象プロセスによるＣＰＵ１１０の使用時間の計測を開始する（Ｓ１４０１）。次いで、統計情報比較処理部１３３４は、当該プロセスがＣＰＵ１１０を使用しているか判断し（Ｓ１４０２）、使用していると判断した場合（Ｓ１４０２、Ｙｅｓ）、監視対象プロセスについて記録されている状態遷移の遷移時間８０２（例えば、プロセス管理テーブル７００のプロセスＡが監視対象である場合、図８の統計情報管理テーブル８００で「状態遷移１」について記録されている遷移時間８０２）とＣＰＵ１１０の使用時間計測値（プロセッサ使用時間計測値）との比較を実行する（Ｓ１４０３）。 First, the statistical information comparison processing unit 1334 starts measuring the usage time of the CPU 110 by the monitoring target process (S1401). Next, the statistical information comparison processing unit 1334 determines whether the process is using the CPU 110 (S1402). If it is determined that the process is using (S1402, Yes), the status transition recorded for the monitored process is displayed. Transition time 802 (for example, when process A in process management table 700 is the monitoring target, transition time 802 recorded for “state transition 1” in statistical information management table 800 in FIG. 8) and measured usage time of CPU 110 Comparison with (processor usage time measurement value) is executed (S1403).

統計情報比較処理部１３３４は、Ｓ１４０３での比較処理において、ＣＰＵ使用時間の計測値が、遷移時間８０２についてあらかじめ規定されている閾値を越えているか判断する（Ｓ１４０４）。この遷移時間８０２に関する閾値は、管理者等がパラメータとして入力装置１５０を通じて指定して統計情報比較処理部１３３４内に保持させることができる。指定する閾値の具体例としては、状態遷移の遷移時間８０２の２倍あるいは３倍の値を指定することが考えられる。 The statistical information comparison processing unit 1334 determines whether the measured value of the CPU usage time exceeds the threshold value defined in advance for the transition time 802 in the comparison processing in S1403 (S1404). The threshold related to the transition time 802 can be specified as a parameter by the administrator or the like through the input device 150 and can be stored in the statistical information comparison processing unit 1334. As a specific example of the threshold value to be specified, it is conceivable to specify a value that is twice or three times the transition time 802 of the state transition.

ＣＰＵ使用時間計測値が指定した閾値を超えていないと判断した場合（Ｓ１４０４、Ｎｏ）、統計情報比較処理部１３３４は、監視対象のプロセスの遷移時間は正常であると判断して、ＣＰＵ１１０の使用中は引き続きＳ１４０２〜Ｓ１４０４の処理を反復実行する。一方、ＣＰＵ使用時間計測値が指定した閾値を超えていると判断した場合（Ｓ１４０４、Ｙｅｓ）、統計情報比較処理部１３３４は、監視対象のプロセスにおいて遷移時間が異常に長い障害が発生していると判断して、障害通知部１３３５からＯＳ１３１、出力装置１６０を通じで障害発生を通知する。なお、遷移時間が異常に長い障害とは、例えば無限ループなど監視対象プロセスがＣＰＵ１１０を通常よりも長い時間にわたって占有している障害状況である。 When it is determined that the CPU usage time measurement value does not exceed the specified threshold (S1404, No), the statistical information comparison processing unit 1334 determines that the transition time of the process to be monitored is normal, and the CPU 110 uses During this time, the processing of S1402 to S1404 is repeated. On the other hand, if it is determined that the CPU usage time measurement value exceeds the specified threshold value (S1404, Yes), the statistical information comparison processing unit 1334 has a fault with an abnormally long transition time in the monitored process. The failure notification unit 1335 notifies the occurrence of a failure through the OS 131 and the output device 160. The failure having an abnormally long transition time is a failure state in which the monitored process occupies the CPU 110 for a longer time than usual, such as an infinite loop.

Ｓ１４０２で、統計情報比較処理部１３３４が、監視対象プロセスがＣＰＵ１１０を使用していないと判断した場合（Ｓ１４０２、Ｎｏ）、統計情報比較処理部１３３４は、ＣＰＵ１１０の使用時間計測を中断し（Ｓ１４０６）、監視対象プロセスがＣＰＵ１１０を使用していない時間（プロセッサ不使用時間）の計測を開始する（Ｓ１４０７）。 If the statistical information comparison processing unit 1334 determines in S1402 that the monitoring target process is not using the CPU 110 (No in S1402), the statistical information comparison processing unit 1334 interrupts the usage time measurement of the CPU 110 (S1406). Then, measurement of the time during which the monitored process is not using the CPU 110 (processor non-use time) is started (S1407).

次いで、統計情報比較処理部１３３４は、監視対象プロセスが再度ＣＰＵ１１０を使用しているか判断する（Ｓ１４０８）。Ｓ１４０８での判断処理は、ＣＰＵ１１０の停止時間が異常に長い障害が生じているか否かを判別するために実行される。ＣＰＵ１１０が使用されていないと判断した場合（Ｓ１４０８、Ｙｅｓ）、統計情報比較処理部１３３４は、さらに、ＣＰＵ１１０の停止時間が許容される閾値を越えているか判断する（Ｓ１４０９）。この停止時間として許容される閾値（以下「停止許容閾値」）は、管理者等が入力装置１５０を通じて入力することにより、パラメータとして統計情報比較処理部１３３４内に保持させることができる。停止許容閾値には、統計情報管理テーブル８００の状態待機時間８０３に格納されている数値の２倍ないし３倍の値など、任意の数値を指定することができる。 Next, the statistical information comparison processing unit 1334 determines whether the monitoring target process uses the CPU 110 again (S1408). The determination process in S1408 is executed to determine whether a failure has occurred that causes the CPU 110 to stop abnormally long. When it is determined that the CPU 110 is not used (S1408, Yes), the statistical information comparison processing unit 1334 further determines whether the stop time of the CPU 110 exceeds an allowable threshold (S1409). A threshold that is allowed as the stop time (hereinafter referred to as “stop allowable threshold”) can be held in the statistical information comparison processing unit 1334 as a parameter when the administrator or the like inputs it through the input device 150. An arbitrary numerical value such as a value twice or three times the numerical value stored in the status waiting time 803 of the statistical information management table 800 can be designated as the stop allowable threshold value.

Ｓ１４０９で、ＣＰＵ不使用時間の計測値（プロセッサ不使用時間計測値）が停止許容閾値を超えていると判断した場合（Ｓ１４０９、Ｙｅｓ）、統計情報比較処理部１３３４は、待機時間が異常に長い障害が発生していると判断して、障害通知部１０９からＯＳ１３１、出力装置１６０を通じて障害発生を通知する（Ｓ１４１０）。 In S1409, when it is determined that the measured value of the CPU non-use time (processor non-use time measurement value) exceeds the stop allowable threshold (S1409, Yes), the statistical information comparison processing unit 1334 has an abnormally long standby time. It is determined that a failure has occurred, and the failure notification unit 109 notifies the failure occurrence through the OS 131 and the output device 160 (S1410).

Ｓ１４０９で、ＣＰＵ不使用時間の計測値が停止許容閾値を超えていないと判断した場合（Ｓ１４０９、Ｎｏ）、統計情報比較処理部１３３４は、Ｓ１４０８へ処理を戻し、再度ＣＰＵ１１０が使用されているか判断する。一方、Ｓ１４０９でＣＰＵ不使用時間が停止許容閾値を越えていないと判断し、さらにＳ１４０８で再度ＣＰＵ１１０が使用されていると判断した場合（Ｓ１４０８、Ｎｏ）、統計情報比較処理部１３３４は、ＣＰＵ１１０の不使用時間が状態と判断することができるか確認するため、図１４ＢのＳ１４１１へ処理を移行させる。 In S1409, when it is determined that the measured value of the CPU non-use time does not exceed the stop allowable threshold (No in S1409), the statistical information comparison processing unit 1334 returns the process to S1408 and determines whether the CPU 110 is used again. To do. On the other hand, if it is determined in S1409 that the CPU non-use time has not exceeded the stop allowable threshold, and further in S1408 it is determined that the CPU 110 is being used again (No in S1408), the statistical information comparison processing unit 1334 In order to confirm whether the non-use time can be determined as the state, the processing is shifted to S1411 in FIG. 14B.

Ｓ１４１１では、統計情報比較処理部１３３４は、ＣＰＵ不使用時間が所定の閾値を超えて状態と認定することができるか判断している。閾値を超えていないと判断した場合（Ｓ１４１１、Ｎｏ）、統計情報比較処理部１３３４は、ＣＰＵ１１０の使用時間の計測を再開する（Ｓ１４１７）。 In S1411, the statistical information comparison processing unit 1334 determines whether the CPU non-use time exceeds a predetermined threshold and can be recognized as a state. When it is determined that the threshold value is not exceeded (S1411, No), the statistical information comparison processing unit 1334 restarts the measurement of the usage time of the CPU 110 (S1417).

一方、閾値を超えていると判断した場合（Ｓ１４１１、Ｙｅｓ）、統計情報比較処理部１３３４は、ＣＰＵ使用時間を、監視対象プロセスの統計情報管理テーブル８００に記録されている遷移時間８０２と比較し（Ｓ１４１２）、両者の比較結果が所定の閾値を下回っているか判断する（Ｓ１４１３）。この閾値は、例えばパラメータとして統計情報比較処理部１３３４に設定しておくことができる。閾値の例としては、ＣＰＵ使用時間が遷移時間８０２で計測された値の１／２倍ないし１／３の値を指定することができる。これにより、状態遷移に要する時間が不当に短くないか判断している。閾値を下回っていると判断した場合（Ｓ１４１３、Ｙｅｓ）、統計情報比較処理部１３３４は、遷移時間８０２が不当に短い通常とは異なる遷移が発生する障害であると判断し、障害通知部１０８、ＯＳ１３１及び出力装置１６０を通じて障害発生を通知する（Ｓ１４１４）。通常とは異なる遷移の例としては、ロック待ちなどの、通常は状態としては認識されないような短時間で完了する処理が何らかの理由で遅延し、新たに状態として認識される場合等が想定される。 On the other hand, if it is determined that the threshold value is exceeded (S1411, Yes), the statistical information comparison processing unit 1334 compares the CPU usage time with the transition time 802 recorded in the statistical information management table 800 of the monitored process. (S1412), it is determined whether the comparison result between the two is below a predetermined threshold (S1413). This threshold value can be set in the statistical information comparison processing unit 1334 as a parameter, for example. As an example of the threshold value, a value that is ½ times or １／ of the value measured at the transition time 802 can be specified. Thus, it is determined whether the time required for the state transition is unduly short. If it is determined that the threshold value is below the threshold (S1413, Yes), the statistical information comparison processing unit 1334 determines that the transition time 802 is unduly short and causes a transition different from normal, and the failure notification unit 108, The occurrence of a failure is notified through the OS 131 and the output device 160 (S1414). As an example of a transition that is different from normal, it may be assumed that a process that is completed in a short time that is not normally recognized as a state, such as waiting for a lock, is delayed for some reason and is newly recognized as a state. .

一方、Ｓ１４１３で閾値を下回っていないと判断された場合（Ｓ１４１３、Ｎｏ）、統計情報比較処理部１３３４は、監視対象プロセスの次の状態遷移について、遷移時間８０２と状態待機時間８０３とを記録するために、監視項目８０１を次の状態遷移へと移行させる（Ｓ１４１５）。そして、監視対象であるプロセスが終了したか判断し（Ｓ１４１６）、終了していると判断した場合（Ｓ１４１６、Ｙｅｓ）、監視対象プロセスについての統計情報比較処理を終了する。 On the other hand, when it is determined in S1413 that the threshold value is not below (S1413, No), the statistical information comparison processing unit 1334 records the transition time 802 and the state standby time 803 for the next state transition of the monitored process. Therefore, the monitoring item 801 is shifted to the next state transition (S1415). Then, it is determined whether the process to be monitored has ended (S1416). If it is determined that the process has ended (S1416, Yes), the statistical information comparison process for the process to be monitored is ended.

終了していないと判断した場合（Ｓ１４１６、Ｎｏ）、統計情報比較処理部１３３４は、ＣＰＵ１１０の使用時間を再度計測し（Ｓ１４１８）、処理をＳ１４０２に移行させて、監視対象プロセスが終了するまで障害が発生していないか調べる統計情報比較処理を続行する。 If it is determined that the process has not been completed (S1416, No), the statistical information comparison processing unit 1334 measures the usage time of the CPU 110 again (S1418), shifts the process to S1402, and fails until the monitored process ends. Continue the statistical information comparison process to check whether or not the error occurred.

以上説明したように、本実施形態に係るコンピュータ１によれば、ＣＰＵ１１０で実行される個々のプロセスについて、その状態遷移に関する時間計測値を統計的に求めた基準値と逐次比較処理することにより、ハングアップ、ＯＳ１３１によるエラー検出といったイベントに至らない早期の段階で、コンピュータ１に生じた障害を確実に検出することができるので、コンピュータ１のダウンタイムを可及的に短縮し、可用性を向上させる効果を奏する。 As described above, according to the computer 1 according to the present embodiment, for each process executed by the CPU 110, the time measurement value regarding the state transition is sequentially compared with the reference value obtained statistically. Since the failure that occurred in the computer 1 can be reliably detected at an early stage that does not lead to an event such as a hang-up or an error detection by the OS 131, the downtime of the computer 1 is reduced as much as possible and the availability is improved. There is an effect.

なお、本明細書では、本発明についてその実施形態に即して添付図面を参照しつつ説明したが、本発明はこのような実施形態によって限定されるものではない。本発明は、特許請求の範囲に記載されている発明の範囲内で、前記の実施形態にかかわらず、種々の形態で実施することができ、当該特許請求の範囲に記載されている発明の均等物も本発明に含まれるものである。 In the present specification, the present invention has been described with reference to the accompanying drawings according to the embodiment, but the present invention is not limited to such an embodiment. The present invention can be implemented in various forms within the scope of the invention described in the claims, regardless of the above-described embodiment, and is equivalent to the invention described in the claims. Products are also included in the present invention.

１コンピュータ１１０ＣＰＵ１２０制御部
１３０主記憶装置１３１ＯＳ１３１１プロセススケジューラ
１３２プログラム１３２Ａ、１３２Ｂプロセス
１３３障害検知部１３３１状態遷移監視部
１３３２統計情報採取処理部１３３３統計情報記憶部
１３３４統計情報比較処理部１３３５障害通知部
１４０補助記憶装置１５０入力装置１６０出力装置
１７０通信制御部１８０内部バス
７００プロセス管理テーブル７０１監視対象プロセス
７０２統計情報採取完了フラグ７０３統計情報テーブル
８００統計情報管理テーブル８０１監視項目
８０２遷移時間８０３状態待機時間９００仮記憶テーブル
９０１監視項目９０２遷移時間９０３状態待機時間 1 Computer 110 CPU 120 Control Unit 130 Main Storage Device 131 OS 1311 Process Scheduler 132 Program 132A, 132B Process 133 Failure Detection Unit 1331 State Transition Monitoring Unit 1332 Statistical Information Collection Processing Unit 1333 Statistical Information Storage Processing Unit 1334 Statistical Information Comparison Processing Unit 1335 Failure Notification unit 140 Auxiliary storage device 150 Input device 160 Output device 170 Communication control unit 180 Internal bus 700 Process management table 701 Monitored process 702 Statistical information collection completion flag 703 Statistical information table 800 Statistical information management table 801 Monitoring item 802 Transition time 803 State Standby time 900 Temporary storage table 901 Monitoring item 902 Transition time 903 State standby time

Claims

A computer comprising a processor and a memory, wherein the processor executes a plurality of processes constituting at least one software program stored in the memory to execute the software program,
For each process, the processor usage time, which is the time during which the processor is processing the process from when the process is started by the processor to the end of the process, and the processor stops processing the process. The processor non-use time, which is a certain time, is measured and acquired sequentially several times, and according to a predetermined statistical process, a processor use time reference value that is a statistical reference value of each processor use time and each processor non-use A processor processing reference value acquisition unit that calculates and stores a processor non-use time reference value that is a statistical reference value of time for each of the processes;
When one of the plurality of processes constituting the software program is processed by the processor, the processor use time and the processor non-use time are measured for the process, and the process is sequentially stored for the process. A criterion for determining that a failure has occurred during the process processing when it is determined that the comparison result does not satisfy a predetermined determination criterion. A value comparison processing unit ,
When the reference value comparison processing unit determines that the measured processor non-use time is longer than a predetermined first processor non-use time threshold and the processor use time at that time is shorter than a predetermined processor use time threshold, Determining that a failure has occurred in which the transition time is shorter than the predetermined processor usage time threshold in the process ;
A computer characterized by that.

The computer according to claim 1, further comprising a process allocating unit that performs a process of allocating the processor to each of the processes, wherein the processor processing reference value acquisition unit and the reference value comparison processing unit include: A computer characterized by measuring the processor use time and the processor non-use time by monitoring an operation.

2. The computer according to claim 1, wherein the reference value comparison processing unit is configured to measure the processor usage time measurement value and the corresponding processor usage time reference value during processing of any of the plurality of processes by the processor. And when it is determined that the difference between the measured processor usage time value and the corresponding processor usage time reference value exceeds a predetermined value, it is determined that a failure has occurred during the process processing. A computer characterized by

The computer according to claim 1, wherein the reference value comparison processing unit is configured to use the processor non-use time measurement value and a predetermined second processor non-use during processing of any of the plurality of processes by the processor. A comparison with a time threshold value, and when it is determined that the processor non-use time measurement value exceeds the predetermined second processor non-use time threshold value, it is determined that a failure has occurred during the process processing. A featured computer.

A computer fault detection method comprising a processor and a memory, wherein the processor processes a plurality of processes constituting at least one software program stored in the memory and executes the software program. The processor is
For each process, the processor usage time, which is the time during which the processor is processing the process from when the process is started by the processor to the end of the process, and the processor stops processing the process. The processor non-use time, which is a certain time, is measured and acquired sequentially several times, and according to a predetermined statistical process, a processor use time reference value that is a statistical reference value of each processor use time and each processor non-use A processor non-use time reference value, which is a statistical reference value of time, is calculated and stored for each of the processes;
When one of the plurality of processes constituting the software program is processed by the processor, the processor use time and the processor non-use time are measured for the process, and the process is sequentially stored for the process. Is compared with the processor usage time reference value and the processor usage time reference value, and when it is determined that the comparison result does not satisfy a predetermined determination criterion, it is determined that a failure has occurred during the process processing ,
When it is determined that the measured processor non-use time is longer than a predetermined first processor non-use time threshold and the processor use time at that time is shorter than a predetermined processor use time threshold, the transition time in the process is the predetermined A failure detection method for a computer, characterized in that it is determined that a failure has occurred that is shorter than a processor usage time threshold .

6. The failure detection method for a computer according to claim 5 , wherein the processor measures the processor use time and the processor non-use time by monitoring an assignment status of the processor to each of the processes. A computer fault detection method characterized by the above.

6. The failure detection method for a computer according to claim 5 , wherein the processor uses the processor usage time measurement value and the corresponding processor usage time reference value during processing of any of the plurality of processes by the processor. And when it is determined that the difference between the measured processor usage time value and the corresponding processor usage time reference value exceeds a predetermined value, it is determined that a failure has occurred during the process processing. A computer fault detection method characterized by the above.

6. The failure detection method for a computer according to claim 5 , wherein the processor uses the processor non-use time measurement value and a predetermined second processor non-use during processing of any of the plurality of processes by the processor. A comparison with a time threshold value, and when it is determined that the processor non-use time measurement value exceeds the predetermined second processor non-use time threshold value, it is determined that a failure has occurred during the process processing. A feature of a computer fault detection method.

A computer comprising a processor and a memory, wherein the processor executes a plurality of processes constituting at least one software program stored in the memory and executes the software program.
For each process, the processor usage time, which is the time during which the processor is processing the process from when the process is started by the processor to the end of the process, and the processor stops processing the process. The processor non-use time, which is a certain time, is measured and acquired sequentially several times, and according to a predetermined statistical process, a processor use time reference value that is a statistical reference value of each processor use time and each processor non-use Calculating and storing a processor non-use time reference value, which is a statistical reference value of time, for each of the processes;
When one of the plurality of processes constituting the software program is processed by the processor, the processor use time and the processor non-use time are measured for the process, and the process is sequentially stored for the process. Comparing with the processor usage time reference value and the processor usage time reference value, and determining that a failure has occurred during the process processing when it is determined that the comparison result does not satisfy a predetermined determination criterion; ,
When it is determined that the measured processor non-use time is longer than a predetermined first processor non-use time threshold and the processor use time at that time is shorter than a predetermined processor use time threshold, the transition time in the process is the predetermined Determining that a failure has occurred that is shorter than the processor usage time threshold of
A program characterized by that.

The program according to claim 9 , wherein the processor is caused to execute a step of measuring the processor use time and the processor non-use time by monitoring a state of assignment of the processor to each process. A program characterized by

The program according to claim 9 , wherein the processor usage time measurement value is compared with the corresponding processor usage time reference value during processing of any of the plurality of processes by the processor. If it is determined that the difference between the processor usage time measurement value and the corresponding processor usage time reference value exceeds a predetermined value, a step of determining that a failure has occurred during the process processing is executed. A program characterized by

10. The program according to claim 9 , wherein the processor non-use time measurement value and a predetermined second processor non-use time threshold value are set in the processor during processing of any of the plurality of processes by the processor. In comparison, when it is determined that the processor non-use time measurement value exceeds the predetermined second processor non-use time threshold value, a step of determining that a failure has occurred during the process processing is executed. A featured program.