JP2000305914A

JP2000305914A - Cluster system, specification assisting device of system fault factor and recording medium

Info

Publication number: JP2000305914A
Application number: JP11109223A
Authority: JP
Inventors: Shigeru Kobayashi; 茂小林
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-04-16
Filing date: 1999-04-16
Publication date: 2000-11-02

Abstract

PROBLEM TO BE SOLVED: To promptly and exactly specify a system component to be a fault generating factor without troubling a maintenance staff. SOLUTION: A state value including a state value number recorded in recording means 4a, 4b,... of all nodes 3a, 3b,... to constitute a cruster system is fetched by a state value reading and storage means 21 and stored in a state value storage part 12. In addition, the system is preliminarily provided with a defining means (13, factor attribute definition table, relational attribute table between factor and state value) to define certainty of generation of a fault by every system component and certainty of the fault to be reflected on the state value by every system component. And an abnormal value is found from among the state values to be stored in the state value reading and storage means 21 and appropriateness of fault of the system component is calculated by using the state value as the abnormal value, the certainty of generation of the fault and the certainty of the fault to be reflected on the state value by a fault possibility processing means 22.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の計算機（以
下、ノードと呼ぶ）がネットワーク等で連携しながらプ
ログラムの所定の処理を実行するクラスタシステム、シ
ステム障害要因特定支援装置および記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system in which a plurality of computers (hereinafter, referred to as nodes) execute predetermined processing of a program in cooperation with each other via a network or the like, a system failure factor identification support device, and a recording medium.

【０００２】[0002]

【従来の技術】従来、図８に示すように、二重化された
伝送ラインＬ１，Ｌ２を有する例えばＬＡＮ５１上に複
数のノード５２₁，５２₂，…が接続され、１つのノード
５２₁のプログラム実行中に障害が発生したとき、次の
ノード５２₂が同一機能のプログラムの実行を引き継い
で所定の処理を実行することが行われている。2. Description of the Related Art Conventionally, as shown in FIG. 8, a plurality of nodes 52 ₁ , 52 ₂ ,... Are connected on, for example, a LAN 51 having duplicated transmission lines L 1 and L ₂ , and one node 52 ₁ executes a program. when a failure occurs during, it has been made _{that the 2} next node 52 executes a predetermined process takes over the execution of the program of the same function.

【０００３】このようなクラスタシステムでは、障害発
生の要因を解析する場合、保守員が各ノード５２₁，５
２₂，…に付設される記憶装置５３₁，５３₂，…からプ
ログラム実行状態であるエラー／実行ログ（ｌｏｇ：履
歴状態）を読み出してプリントアウトし、障害発生要因
となっているシステム構成要素を推定する。In such a cluster system, when analyzing the cause of the occurrence of a failure, the maintenance staff needs to operate the nodes 52 ₁ , 5 2
Read the error / execution log (log: history state), which is the program execution state, from the storage devices 53 ₁ , 53 ₂ ,... Attached to 22 ₂ ,. Is estimated.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、以上の
ようなクラスタシステムでは、単一のノードからなるシ
ステムよりも構成，動作が複雑であるばかりでなく、１
つの業務プログラムが複数のノード上を移動しながら実
行することから、保守員は手作業で障害発生要因を解析
するのが困難な場合が多い。However, in the above-described cluster system, not only the configuration and operation are more complicated than a system including a single node, but also
Since one business program is executed while moving on a plurality of nodes, it is often difficult for maintenance personnel to manually analyze the cause of the failure.

【０００５】本発明は上記事情にかんがみてなされたも
ので、予め定められる状態値番号とともに各ノードの検
出可能な状態値を記録し、障害発生要因となるシステム
構成要素を容易に特定可能とするクラスタシステムを提
供することにある。The present invention has been made in view of the above circumstances, and records a detectable state value of each node together with a predetermined state value number, so that it is possible to easily specify a system component which causes a failure. To provide a cluster system.

【０００６】また、別の発明の目的は、保守員の手を煩
わすことなく迅速，的確に障害発生要因となるシステム
構成要素を特定するシステム障害要因特定支援装置を提
供することにある。It is another object of the present invention to provide a system failure factor identification support device for quickly and accurately identifying a system component that causes a failure without the need for maintenance personnel.

【０００７】さらに、別の発明の目的は、迅速，的確に
障害発生要因となるシステム構成要素を特定する障害要
因特定処理用プログラムを記録したコンピュータ読み取
り可能な記録媒体を提供することにある。It is a further object of the present invention to provide a computer-readable recording medium in which a failure factor identification processing program for quickly and accurately identifying a system component causing a failure is recorded.

【０００８】[0008]

【課題を解決するための手段】上記課題を解決するため
に、本発明は、複数のノードが伝送ラインで連携しなが
ら所要の処理を実行するクラスタシステムにおいて、前
記各ノードは、各ノードの検出可能な状態値の異常時ま
たは一定の時間経過ごとに、時刻、予め定められる状態
値番号および状態値を記録する記録手段を設けた構成で
ある。In order to solve the above-mentioned problems, the present invention provides a cluster system in which a plurality of nodes execute required processing while cooperating with each other via a transmission line. A recording means is provided for recording a time, a predetermined state value number, and a state value each time a possible state value is abnormal or every certain time has elapsed.

【０００９】本発明は、以上のような手段を講じたこと
により、各ノードが状態値の異常時または一定の時間経
過ごとに、時刻、予め定められる状態値番号および状態
値を記録することから、異常とされる状態値の状態値番
号に基づき、障害発生要因となっているシステム構成要
素を特定可能となる。According to the present invention, by taking the above measures, each node records a time, a predetermined state value number and a state value each time the state value is abnormal or every time a certain time elapses. Based on the status value number of the status value determined to be abnormal, it is possible to specify the system component causing the failure.

【００１０】また、別の発明は、クラスタシステムを構
成する全ノードがそれぞれ各ノードの検出可能な状態値
の異常時または一定の時間経過ごとに、時刻、予め定め
られる状態値番号および状態値を記録する記録手段を備
えている場合、各ノードに付設されている記録手段から
状態値番号を含む状態値を取り込んで記憶する状態値読
取記憶手段と、各システム構成要素ごとに障害の発症の
確実さ（０〜１の実数）および各システム構成要素ごと
に障害の状態値に反映される確実さ（０〜１の実数）を
定義する定義手段と、前記状態値読取記憶手段に記憶さ
れる状態値の中から異常値を見つけ出し、当該異常値で
ある状態値と前記障害の発症の確実さと前記障害の状態
値に反映される確実さとを用いて、システム構成要素の
障害妥当性を算出する障害可能性処理手段とを設け、さ
らにこれら構成要素に障害可能性処理手段により算出さ
れる各システム構成要素の障害妥当性を表示し、障害要
因の特定を可能とする妥当性出力手段を設けたシステム
障害要因特定支援装置である。[0010] Further, another invention is characterized in that all nodes constituting the cluster system change the time, a predetermined state value number and a state value each time the detectable state value of each node is abnormal or every certain time has elapsed. In the case where a recording means for recording is provided, a state value reading / storage means for taking in and storing a state value including a state value number from a recording means attached to each node, and a reliable occurrence of a failure for each system component. Means (real number from 0 to 1) and certainty (real number from 0 to 1) to be reflected in the state value of a fault for each system component, and a state stored in the state value reading storage means An abnormal value is found from the values, and the fault validity of the system component is calculated using the status value that is the abnormal value, the certainty of the occurrence of the fault, and the certainty reflected in the status value of the fault. And a validity output means for displaying the validity of each system component calculated by the failure possibility processing means on these components and enabling the identification of the cause of the failure. This is a system failure factor identification support device.

【００１１】この発明は、以上のような手段を講じたこ
とにより、状態値読取記憶手段にて各ノードに付設され
る記録手段から状態値番号を含む状態値を読み取って時
系列的に処理記憶した後、障害可能性処理手段が状態値
の中から異常とされる状態値を見つけ出し、この状態値
と定義手段で定義された当該状態値に関連するシステム
構成要素の両確実さとを用いて、システム構成要素の障
害妥当性を算出することにより、妥当性を「１」とした
とき、この「１」から両確実さ乗算値を減算し、障害妥
当性が最も大きいシステム構成要素を特定するものであ
る。According to the present invention, by taking the above means, the state value including the state value number is read from the recording means attached to each node by the state value reading / storing means, and the state values are processed and stored in chronological order. After that, the failure possibility processing means finds a state value regarded as abnormal from the state values, and using both this state value and the certainty of the system component related to the state value defined by the definition means, By calculating the fault validity of the system component, when the validity is set to "1", the multiplicity of the two certainty values is subtracted from this "1" to specify the system component having the highest fault validity. It is.

【００１２】さらに、別の発明として、システム障害要
因特定支援装置の構成をそのままクラスタシステムを構
成する少なくとも１つのノードに適用すれば、クラスタ
システム自身で障害要因となっているシステム構成要素
を特定できる。Further, as another invention, if the configuration of the system failure factor identification support device is applied to at least one node constituting the cluster system as it is, the cluster system itself can identify the system component causing the failure. .

【００１３】[0013]

【発明の実施の形態】以下、本発明の実施の形態を説明
するに先立ち、クラスタシステムの障害要因を機械的に
解析する場合の基礎的事項について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Prior to the description of embodiments of the present invention, basic matters in the case of mechanically analyzing a failure factor of a cluster system will be described.

【００１４】クラスタシステムの障害要因を機械的に解
析する場合、クラスタシステムを構成する各ノードの検
出可能な状態値を的確に把握する必要がある。この検出
可能な状態値は、典型的にはネットワークを介して定期
的に行う各ノード間の定期交信やあるノードにおけるプ
ログラムの実行状態などが挙げられ、しかも単独のシス
テム構成要素ではなく、幾つかのシステム構成要素に跨
った動作状態から解析する必要がある。例えばある状態
値に異常が現れたとき、それに関連する構成要素の少な
くとも１つに異常が現れてくることを意味し、逆にある
ノードの状態値が正常であれば、関連するシステム構成
要素の全てが正常である可能性が高いことを意味する。When mechanically analyzing the cause of a failure in the cluster system, it is necessary to accurately grasp the detectable state value of each node constituting the cluster system. This detectable status value includes, for example, periodic communication between each node that is performed periodically via a network, the execution status of a program at a certain node, and the like. It is necessary to analyze from the operation state across the system components. For example, when an abnormality appears in a certain state value, it means that an abnormality appears in at least one of the components related thereto, and conversely, when the state value of a certain node is normal, the related system component It means that everything is likely to be normal.

【００１５】一般に、クラスタシステムの特長である冗
長構成およびプログラムの実行引継ぎなどを考えれば、
各システム構成要素のさまざまな状態が組み合わされた
検出状態値で現れることになる。よって、これら構成要
素の状態を連立方程式とし、各システム構成要素の中か
ら障害要因となるシステム構成要素を特定する。In general, considering a redundant configuration and a handover of a program which are features of a cluster system,
The various states of each system component will appear in the combined detected state values. Therefore, the states of these components are set as simultaneous equations, and a system component that becomes a failure factor is specified from among the system components.

【００１６】図１は本発明に係るクラスタシステムの一
実施の形態を示す構成図である。FIG. 1 is a configuration diagram showing one embodiment of a cluster system according to the present invention.

【００１７】このシステムは、二重化された伝送ライン
Ｌ１，Ｌ２からなる例えばＬＡＮ1上にノードの検出可
能な状態値を取り込んで処理するクラスタ制御用プログ
ラム２ａ、２ｂ、２ｃ、…をもつ複数のノード３ａ、３
ｂ、３ｃ、…が接続され、これら各ノード３ａ、３ｂ、
３ｃ、…には状態値などのデータを記録する記録装置４
ａ、４ｂ、４ｃ、…が接続されている。This system comprises a plurality of nodes 3a each having a cluster control program 2a, 2b, 2c,... Which takes in and processes a detectable state value of a node on, for example, LAN1 composed of duplicated transmission lines L1 and L2. , 3
, 3c,... are connected, and these nodes 3a, 3b,.
Recording devices 4 for recording data such as status values are stored in 3c,.
a, 4b, 4c,... are connected.

【００１８】前記各ノード３ａ、３ｂ、３ｃ、…には、
システム構成要素の異常時に状態値異常検出時刻を取り
込むための時刻発生源（図示せず）が設けられ、また記
録装置４ａ、４ｂ、４ｃ、…には予め後記するように状
態値の種類数に相当する状態値番号が設定されている。Each of the nodes 3a, 3b, 3c,.
A time generation source (not shown) for taking in the state value abnormality detection time when the system component is abnormal is provided, and the recording devices 4a, 4b, 4c,. The corresponding state value number has been set.

【００１９】ここで、前記各ノードの検出可能な状態値
とは、極力単純な値とする。例えば「正常」か「異常」
の何れか，つまり、「１」か「０」で保存するものとす
る。また、各ノードにおけるプログラム実行状態におい
ては、実行している、もしくは実行して失敗したプログ
ラムの状態のみを例えば同じく「１」、「０」などで記
録するものとする。Here, the detectable state value of each node is a simple value as much as possible. For example, "normal" or "abnormal"
, That is, “1” or “0”. Also, in the program execution state in each node, only the state of the program that has been executed or has failed in execution is recorded as “1” or “0”, for example.

【００２０】前記検出可能な状態値の種類数は、ノード
間ハートビートとノード実行時のプログラムの状態とで
分けて考えたとき、以下のような種類数が存在する。The number of types of the detectable state values includes the following types when the heartbeat between nodes and the state of the program at the time of executing the node are considered separately.

【００２１】＊ノード間ハートビートの場合、あるノ
ードから見て、監視対象ノードが稼動しているか否か
は、ノード数×通信経路数×ノード数 … （１）だけ存在する。但し、伝送ラインが例えば二重化されて
いない場合にはノード数×通信経路数だけ存在する。In the case of an inter-node heartbeat, from a certain node, whether or not the monitored node is operating is determined by the number of nodes × the number of communication paths × the number of nodes (1). However, if the transmission line is not duplicated, for example, there are the number of nodes × the number of communication paths.

【００２２】＊ノードで実行しているプログラムの状
態の場合は、ノード数×プログラム数 … （２）だけ存在する。* In the case of the state of the program running on the node, there are as many as the number of nodes × the number of programs (2).

【００２３】そこで、以上のような状態値を整理する意
味から、前記状態値の種類数だけ状態値番号を付するも
のとする。Therefore, in order to organize the state values as described above, the state value numbers are assigned by the number of types of the state values.

【００２４】なお、前記（１）においてノード数の二乗
式を用いているが、これは例えば図２に示すような二重
化伝送ラインの場合、２つの問合せ通信経路Ｒ１（実
線）とＲ２（点線）があるので、問合せ元ノードは、問
合せ先ノードに対してそれぞれ異なる通信経路Ｒ１，Ｒ
２を通して問合せを行う必要がある為である。このよう
な問合せを実行することにより、一方の通信経路Ｒ１に
よる問合せに対し、問合せ先ノードから返答があるが、
例えば他方の通信経路Ｒ２による問合せに対し、問合せ
先ノードから返答が無ければ、少なくとも通信経路Ｒ２
が異常であるとする状態値を検出できる。In the above (1), the square equation of the number of nodes is used. For example, in the case of a duplex transmission line as shown in FIG. 2, two inquiry communication paths R1 (solid line) and R2 (dotted line) , The inquiring node sends different communication paths R1, R
This is because it is necessary to make an inquiry through 2. By executing such an inquiry, there is a response from the inquiry destination node to the inquiry via one communication route R1,
For example, if there is no response from the inquiry destination node to the inquiry via the other communication path R2, at least the communication path R2
Can be detected as an abnormal state value.

【００２５】次に、以上のようなシステムの動作につい
て図３（ａ）、（ｂ）を参照して説明する。Next, the operation of the above system will be described with reference to FIGS. 3 (a) and 3 (b).

【００２６】各ノード３（３ａ、３ｂ、…）は、ノード
本来のデータ処理中に割り込みその他の種々の要因を受
け、クラスタ制御用プログラム２（２ａ、２ｂ、…）の
動作が開始すると、図３（ａ）に示すような処理を実行
する。When each node 3 (3a, 3b,...) Receives an interrupt or other various factors during the original data processing of the node and the operation of the cluster control program 2 (2a, 2b,. The processing as shown in FIG.

【００２７】すなわち、各ノード３（３ａ、３ｂ、…）
は、不要データのクリア処理その他必要なデータの設定
等の初期化処理を行った後（Ｓ１）、状態値に異常有り
か否かを判断する（Ｓ２）。各ノードの検出可能な状態
値の中に異常な値をとる場合がある時、異常有りと判断
する。ここで、異常な値をとるとは、例えば交信不可そ
の他のデータ授受不可とか、電源ダウンとか、ＯＳダウ
ン、クラスタ制御用プログラム動作不可などが挙げられ
るが、それ以外にも種々の異常が考えられる。That is, each node 3 (3a, 3b,...)
Performs initialization processing such as unnecessary data clear processing and other necessary data setting (S1), and then determines whether or not there is an abnormality in the state value (S2). When there is an abnormal value among the detectable state values of each node, it is determined that there is an abnormality. Here, the abnormal value may be, for example, communication failure or other data transfer failure, power supply down, OS down, cluster control program inoperable, etc., but other various failures may be considered. .

【００２８】以上のようにして状態値が異常な値をとる
場合、その時の時刻、状態値番号を取り込み、状態値と
ともに、記録装置４（４ａ、４ｂ、…）に記録する（Ｓ
３）。このとき、異常に関係する状態値種類数が複数存
在する場合、それぞれの状態値番号を用いてその種類ご
とに状態値を記憶する。When the status value takes an abnormal value as described above, the time and the status value number at that time are fetched and recorded in the recording device 4 (4a, 4b,...) Together with the status value (S).
3). At this time, if there are a plurality of status value types related to the abnormality, the status values are stored for each type using the respective status value numbers.

【００２９】以上のような一連の処理は、検出終了指示
が有るまで繰り返し実行される（Ｓ４）。The above series of processing is repeatedly executed until a detection end instruction is given (S4).

【００３０】その結果、記録装置４には、図４に示すよ
うに異常値が発生する毎に，時系列的に時刻、状態値番
号および状態値のレコードで順次記録していく。As a result, as shown in FIG. 4, every time an abnormal value occurs, the recording device 4 sequentially records the time, the status value number, and the status value record in time series.

【００３１】なお、図３（ａ）は状態値が異常な値をと
るときの状態値の記録処理であるが、例えば一定時間経
過するごとに状態値を記録する場合でもよい。FIG. 3A shows a process of recording a status value when the status value takes an abnormal value. However, the status value may be recorded every time a predetermined time elapses.

【００３２】この一定時間ごとの状態値の記録は、図３
（ｂ）に示すように各ノード３（３ａ、３ｂ、…）が初
期化処理を行った後（Ｓ１１）、一定時間Δｔを経過し
たか否かを判断する（Ｓ１２）。一定時間Δｔを経過し
た場合には、その時刻、状態値番号およびその時の状態
値を順次記録装置４に記憶していく（Ｓ１３）。そし
て、以上のような一連の処理は、検出終了指示が有るま
で繰り返し実行される（Ｓ１４）。The recording of the state value at regular intervals is shown in FIG.
As shown in (b), after each node 3 (3a, 3b,...) Performs an initialization process (S11), it is determined whether a certain time Δt has elapsed (S12). When the predetermined time Δt has elapsed, the time, the state value number, and the state value at that time are sequentially stored in the recording device 4 (S13). Then, a series of processing as described above is repeatedly executed until there is a detection end instruction (S14).

【００３３】従って、以上のような実施の形態によれ
ば、各ノード３の検出可能な状態値が異常値をとると
き、或いは一定の時間経過するごとに、状態値番号ごと
に状態値がどのような値をとっているかを記録するの
で、異常とされる状態値の状態値番号に基づき、障害発
生要因となっているシステム構成要素を特定することが
可能となる。Therefore, according to the above-described embodiment, when the detectable state value of each node 3 takes an abnormal value, or every time a certain time elapses, the state value is changed for each state value number. Since such a value is recorded, it is possible to identify the system component causing the failure based on the status value number of the status value regarded as abnormal.

【００３４】次に、図５は本発明に係わるシステム障害
要因特定支援装置の一実施の形態を示す構成図である。
なお、この装置は、図１に示すクラスタシステムとは独
立した支援装置であるが、例えば図１に示す特定のノー
ド例えば３₁に図５に示す機能をもたせてもよく、或い
は各ノードにそれぞれ図５に示す機能をもたせてもよ
い。この場合には、障害要因特定支援機能をもったクラ
スタシステムを実現できる。FIG. 5 is a block diagram showing an embodiment of a system failure factor identification support apparatus according to the present invention.
Note that this apparatus is an independent support system is a cluster system shown in FIG. 1, may be imparted the functions shown in FIG. 5 to a particular node for example 3 ₁ shown in FIG. 1 for example, or to each node The function shown in FIG. 5 may be provided. In this case, a cluster system having a failure factor identification support function can be realized.

【００３５】このシステム障害要因特定支援装置は、各
ノード３ａ、３ｂ、３ｃ、…の記録装置４ａ、４ｂ、４
ｃ、…に保存される状態値などをインタフェース１１を
通して読み取って記憶する状態値記憶部１２と、予め要
素属性および要素−状態値間関係属性を定義する定義テ
ーブル１３と、障害要因特定処理用プログラムを記録す
る記録媒体１４と、ＣＰＵで構成され、記録媒体１４に
記録される障害要因特定処理用プログラムを読み取って
所定の処理を実行する障害要因特定処理部１５と、処理
前、処理中および処理結果のデータを記憶するデータバ
ッファ１６と、入力機器１７と、表示部１８とが設けら
れている。The system failure factor identification support device includes the recording devices 4a, 4b, 4 of the nodes 3a, 3b, 3c,.
a status value storage unit 12 for reading and storing status values and the like stored in c,... through the interface 11, a definition table 13 for defining element attributes and element-state value relationship attributes in advance, and a failure factor identification processing program , A failure factor identification processing unit 15 which is configured by a CPU, reads a failure factor identification processing program recorded on the recording medium 14 and executes a predetermined process, and before, during, and after the process. A data buffer 16 for storing result data, an input device 17, and a display unit 18 are provided.

【００３６】なお、前述する状態値記憶部１２、定義テ
ーブル１３およびデータバッファ１６はそれぞれ別体に
設けられているが、同一の記録媒体にエリア分けして用
いてもよい。Although the state value storage section 12, the definition table 13, and the data buffer 16 are provided separately, they may be used in the same recording medium by dividing the area.

【００３７】前記状態値記憶部１２には、図６（ａ）に
示すようにノード３ａ、３ｂ、３ｃ、…に付設される各
記録装置４ａ、４ｂ、４ｃ、…に記録される時刻，状態
値番号および状態値のうち、状態値番号および状態値が
時系列的にテーブル化されて記憶されている。以下、こ
の状態値記憶部１２の記憶テーブルを状態値テーブルＴ
ｒと呼ぶ。As shown in FIG. 6 (a), the time and state recorded in each of the recording devices 4a, 4b, 4c,... Attached to the nodes 3a, 3b, 3c,. Among the value numbers and the state values, the state value numbers and the state values are stored in a time-series table. Hereinafter, the storage table of the state value storage unit 12 is referred to as a state value table T.
Called r.

【００３８】前記定義テーブル１３は、図６（ｂ）に示
すような要素属性を定義する要素属性定義テーブルＴａ
と図６（ｃ）に示すような要素−状態値間関係の属性を
定義する要素−状態値間関係属性定義テーブルＴｒとか
ら成っている。The definition table 13 is an element attribute definition table Ta for defining element attributes as shown in FIG.
And an element-state value relation attribute definition table Tr that defines the attribute of the element-state value relation as shown in FIG. 6C.

【００３９】この要素属性定義テーブルＴａは、予めシ
ステム構成要素の障害発症の確実さを設定するテーブル
である。具体的には、障害を発生する可能性のあるシス
テム構成要素（システム構成要素番号）を属性とし、そ
のシステム構成要素の障害が症状に表れるのが継続的で
あるか間欠的であるかの程度、つまりシステム構成要素
ごとに症状の継続性（反復性）の程度である確実さを０
〜１の実数をもって表わしたものであって、いわゆる障
害の時間上における発症の確実さを属性として定義す
る。This element attribute definition table Ta is a table for setting in advance the certainty of failure occurrence of system components. Specifically, a system component (system component number) that may cause a failure is used as an attribute, and the degree of whether the failure of the system component appears as a symptom is continuous or intermittent. That is, the degree of certainty of the symptom continuity (repetition) for each system component
It is represented by a real number of １1, and the certainty of the onset of the so-called failure over time is defined as an attribute.

【００４０】なお、システム構成要素とは、システムを
構成する全ての要素のうち、障害を発生する可能性のあ
る要素であって、例えば各ノード、各伝送ライン（通信
経路）、プログラム、ＯＳ、電源、プログラムの使用す
るリソース等が挙げられ、これらシステム構成要素には
それぞれシステム構成要素番号が付される。同一機能の
プログラムは、何れのノードで実行されても、プログラ
ム内容が同一であるので、論理的に同一のシステム構成
要素とみなし、同一の要素番号が用いられる。The system components are components which may cause a failure among all components constituting the system. For example, each node, each transmission line (communication path), program, OS, A power source, resources used by the program, and the like are listed, and these system components are respectively assigned system component numbers. Regardless of which node executes a program having the same function, the contents of the program are the same. Therefore, the programs are regarded as logically the same system components, and the same element numbers are used.

【００４１】一方、要素−状態値間関係属性定義テーブ
ルＴｒは、予めシステム構成要素の障害が状態値に反映
される確実さを設定するテーブルである。状態値番号、
システム構成要素番号、確実さの組からなるレコードの
配列となっている。このレコードのフィールドは、それ
ぞれＴｒ［ｉ］．ｓ，Ｔｒ［ｉ］．ｃ，Ｔｒ［ｉ］．ｗ
で参照するものとする。ｓは状態値番号、ｃはシステム
構成要素番号、ｗは確実さに属することを意味する。On the other hand, the element-state value relationship attribute definition table Tr is a table for setting in advance the certainty that a failure of a system component is reflected in the state value. Status value number,
It is an array of records consisting of a set of system component numbers and certainty. The fields of this record are Tr [i]. s, Tr [i]. c, Tr [i]. w
Shall be referred to. s indicates a state value number, c indicates a system component number, and w indicates a certainty.

【００４２】すなわち、要素−状態値間関係属性定義テ
ーブルＴｒは、具体的には、状態値番号および障害を発
生する可能性のあるシステム構成要素であるシステム構
成要素番号を属性とし、そのシステム構成要素の障害が
実際に状態値に反映される確実さを０〜１の実数をもっ
て表わしたものであって、いわゆる障害の空間上におけ
る発症の確実さを属性として定義している。More specifically, the element-state value relation attribute definition table Tr has, as attributes, a state value number and a system element number which is a system element which may cause a failure. It is a certainty that the failure of the element is actually reflected in the state value as a real number of 0 to 1, and the so-called failure occurrence in the space of the failure is defined as an attribute.

【００４３】前記記録媒体１４は、後記する図７に示す
よう障害要因特定処理部１５に実行させるための障害要
因特定処理用プログラムが記録されている。なお、記録
媒体１４としては、一般的にはＣＤ−ＲＯＭまたは磁気
ディスクが用いられるが、それ以外にも例えば磁気テー
プ、ＤＶＤ−ＲＯＭ、フロッピー（登録商標）ディス
ク、ＭＯ、ＣＤ−Ｒ、メモリカードなどを用いてもよ
い。The recording medium 14 stores a failure factor identification processing program to be executed by the failure factor identification processing unit 15 as shown in FIG. As the recording medium 14, a CD-ROM or a magnetic disk is generally used, but other than that, for example, a magnetic tape, DVD-ROM, floppy (registered trademark) disk, MO, CD-R, memory card Or the like may be used.

【００４４】前記障害要因特定処理部１５は、状態値読
取記憶手段２１と、障害可能性処理手段２２と、妥当性
出力手段２３とが設けられている。The failure factor identification processing section 15 is provided with a state value reading and storing means 21, a failure possibility processing means 22, and a validity output means 23.

【００４５】この状態値読取記憶手段２１は、各ノード
３ａ、３ｂ、３ｃ、…に付設される記録装置４ａ、４
ｂ、４ｃ、…に記録されているデータを読み込んで順次
データバッファ１６に格納し、このデータ格納後に時系
列的な並べ替え処理を行って状態値記憶部１２に記憶す
る機能をもっている。The state value reading and storing means 21 includes recording devices 4a, 4a,
has a function of reading the data recorded in b, 4c,... and sequentially storing the data in the data buffer 16, performing a time-series rearrangement process after storing the data, and storing the data in the state value storage unit 12.

【００４６】前記障害可能性処理手段２２は、状態値記
憶部１２に記憶される状態値、予め定義テーブル１３に
設定される各システム構成要素の障害の発症の確実さお
よび状態値に反映される確実さなどを用いて、各システ
ム構成要素の障害の可能性，，つまり障害の妥当性を数
値で表わす機能をもっている。The fault possibility processing means 22 reflects the status value stored in the status value storage unit 12, the reliability of occurrence of a fault of each system component set in the definition table 13 and the status value in advance. It has a function of expressing the possibility of a failure of each system component, that is, the validity of the failure, by numerical values using certainty and the like.

【００４７】前記妥当性出力手段２３は、各システム構
成要素の障害妥当性を表示部１８に表示して保守員に知
らせる機能をもっている。The validity output means 23 has a function of displaying the validity of the failure of each system component on the display unit 18 to notify maintenance personnel.

【００４８】次に、以上のように構成された装置に関
し、障害要因となるシステム構成要素の特定処理につい
て説明する。Next, a description will be given of a process of identifying a system component which causes a failure in the apparatus configured as described above.

【００４９】なお、この障害要因となるシステム構成要
素の特定処理は減点法を用いて処理する例である。この
例は、初期状態時、全てのシステム構成要素の障害可能
性（障害妥当性）を「１」とし、あるシステム構成要素
が障害をもつと仮定したとき、異常が検出される可能性
のある状態値について、実際には異常が検出されなけれ
ば仮定された障害の可能性を小さくするといつた手法を
とるものである。Note that the process of specifying the system component which causes the obstacle is an example of processing using the deduction method. In this example, in the initial state, the possibility of failure (failure validity) of all system components is set to “1”, and when it is assumed that a certain system component has a failure, an abnormality may be detected. Regarding the state value, a method is adopted that reduces the possibility of an assumed failure unless an abnormality is actually detected.

【００５０】以下、具体的な動作について図７を参照し
て説明する。Hereinafter, a specific operation will be described with reference to FIG.

【００５１】装置の動作が開始すると、初期化処理を行
った後（Ｓ２１）、記録媒体１４に記録される障害要因
特定処理用プログラムを読み出し、例えばデータバッフ
ァ１６に格納する。ここで、保守員によって入力機器１
７から解析しようとする時間帯を入力すると、その解析
時間帯をデータバッファ１６その他の記憶手段に設定す
る（Ｓ２２；解析時間帯設定機能）。When the operation of the apparatus is started, an initialization process is performed (S21), and then a failure factor identification processing program recorded on the recording medium 14 is read and stored in, for example, the data buffer 16. Here, the input device 1 is
When a time zone to be analyzed is input from step 7, the analysis time zone is set in the data buffer 16 and other storage means (S22; analysis time zone setting function).

【００５２】しかる後、複数のノードのうち最初のノー
ドとしてｉ＝１を設定し（Ｓ２３）、ｉ＝１に相当する
ノード３ａに付設される記録装置４ａから時間帯内での
時刻、状態値番号および状態値を順次読み込んでデータ
バッファ１６に格納する。Thereafter, i = 1 is set as the first node among the plurality of nodes (S23), and the time and state values within the time zone are obtained from the recording device 4a attached to the node 3a corresponding to i = 1. The number and the state value are sequentially read and stored in the data buffer 16.

【００５３】そして、記録装置４ａの状態値読取が完了
すると、引き続き、次のノード３ｂに付設される記憶装
置４ｂから時間帯内での時刻、状態値番号および状態値
を順次読み込んでデータバッファ１６に格納する。この
ような処理は全ノードについて実行する。しかる後、デ
ータバッファ１６に格納される全ノードのデータについ
て、時刻記録から時系列的な順序に整理し、時刻の記録
を除去して状態値番号と状態値との組からなるレコード
を順次配列する（Ｓ２４〜Ｓ２８；状態値読取記憶機
能）。この状態値のレコードのフィールドは、それぞれ
Ｔｆ［ｉ］．ｓ、Ｔｆ［ｉ］．ｖから参照するものとす
る。When the reading of the status value of the recording device 4a is completed, the time, the status value number and the status value within the time zone are sequentially read from the storage device 4b attached to the next node 3b, and the data buffer 16 is read. To be stored. Such processing is performed for all nodes. Thereafter, the data of all the nodes stored in the data buffer 16 are arranged in chronological order from the time record, the time record is removed, and the record composed of the set of the state value number and the state value is sequentially arranged. (S24 to S28; status value reading storage function). The fields of this state value record are Tf [i]. s, Tf [i]. v.

【００５４】次に、例えば定義テーブル１３またはデー
タバッファ１６などに形成される妥当性テーブルＴｗ
に、全てのシステム構成要素番号の障害妥当性として
「１」を設定する。つまり、初期状態時、全てのシステ
ム構成要素の障害妥当性を「１」とする（Ｓ２９；初期
障害可能性設定機能）。この妥当性テーブルＴｗは各シ
ステム構成要素に障害があるとする仮定の妥当性を表わ
すものであって、後記するように各システム構成要素ご
とに障害妥当性の実数の配列となる。Next, for example, the validity table Tw formed in the definition table 13 or the data buffer 16 or the like.
, "1" is set as the fault validity of all system component numbers. That is, in the initial state, the failure validity of all system components is set to “1” (S29; initial failure possibility setting function). The validity table Tw indicates the validity of the assumption that each system component has a fault, and is an array of real numbers of fault validity for each system component as described later.

【００５５】さらに、以上のようにして妥当性テーブル
Ｔｗにおける全てのシステム構成要素番号の妥当性に
「１」を設定した後、状態値記憶部１２に記憶される状
態値について順次異常があるか否かを判断する（Ｓ３
０，Ｓ３１；状態値異常有無判断機能）。After setting the validity of all the system component numbers in the validity table Tw to “1” as described above, whether the status values stored in the status value storage unit 12 are successively abnormal or not. (S3)
0, S31; status value abnormality presence / absence determination function).

【００５６】仮に、ｉ番目のレコードが「Ｔｆ［ｉ］．
ｖ＝＝異常」であるならば、要素−状態値間関係属性テ
ーブルＴｒの全レコードの状態値番号を検索し、該当す
る状態値番号のものが有るか否かを調べる（Ｓ３２，Ｓ
３３；該当状態値番号検索機能）。今、要素−状態値間
関係属性テーブルＴｒの全てのレコードについて、仮
に、ｊ番目のレコードが「Ｔｒ［ｊ］．ｓ＝＝Ｔｆ
［ｉ］．ｖ」であるならば、ｃ＝Ｔｒ［ｊ］．ｃとし、
該当システム構成要素の障害妥当性を算出する（Ｓ３
４；妥当性算出機能）。このシステム構成要素の障害妥
当性は、Ｔｗ［ｃ］＝Ｔｗ［ｃ］＊｛１−（Ｔａ［ｃ］＊Ｔｒ
［ｊ］．ｗ｝から算出される。If the i-th record is “Tf [i].
If “v == abnormal”, the state value numbers of all records in the element-state value relation attribute table Tr are searched to determine whether or not there is a corresponding state value number (S32, S32).
33; applicable state value number search function). Now, for all records in the element-state value relationship attribute table Tr, suppose that the j-th record is “Tr [j] .s == Tf
[I]. v ", then c = Tr [j]. c,
The fault validity of the corresponding system component is calculated (S3
4: validity calculation function). The failure validity of this system component is Tw [c] = Tw [c] * ｛1- (Ta [c] * Tr
[J]. It is calculated from w｝.

【００５７】さらに、状態値異常有無検索完了でなけれ
ば、ステップＳ３１に戻り、次の状態値の異常について
同様の処理を繰り返し実行する（Ｓ３１〜Ｓ３６）。If the state value abnormality search is not completed, the process returns to step S31, and the same processing is repeatedly executed for the next state value abnormality (S31 to S36).

【００５８】以上のようにして状態値の異常ごとにシス
テム構成要素の妥当性を算出し妥当性テーブルＴｗに格
納したならば、この妥当性テーブルからシステム構成要
素（要素番号）の妥当性を読み出し、表示部１８に表示
する（Ｓ３７；妥当性出力機能）。つまり、妥当性テー
ブルＴｗにおけるシステム構成要素番号のうち、妥当性
数値の最も大きいシステム構成要素が障害をもつ可能性
が高いとみなすことができる。When the validity of the system component is calculated for each abnormal state value and stored in the validity table Tw as described above, the validity of the system component (element number) is read from the validity table. Is displayed on the display unit 18 (S37; validity output function). That is, it can be considered that the system component having the highest validity value among the system component numbers in the validity table Tw has a high possibility of having a failure.

【００５９】従って、以上のような実施の形態によれ
ば、クラスタシステムを構成する各ノードに付設される
記録装置から順次状態値番号とともに状態値を読み出
し、これら状態値に異常があれば、この異常とされる状
態値に基づいて、要素−状態値間関係属性テーブルＴｒ
に設定する状態値番号、システム構成要素番号、システ
ム構成要素の障害の状態値に反映される確実さのレコー
ドから該当システム構成要素を見つけ出し、このシステ
ム構成要素の確実さと予め要素属性テーブルＴａに設定
される該当システム構成要素の障害の発症の確実さとを
用いて、障害妥当性を数値的に算出し、障害の可能性の
高いシステム構成要素を特定するので、クラスタシステ
ムにおける障害要因を自動的に解析でき、しかも妥当性
の数値の大きなシステム構成要素が障害をもつ可能性有
りと特定でき、さらに障害可能性有りとするシステム構
成要素に関連するシステム構成要素の妥当性の数値も同
時に算出されるので、各システム構成要素の影響も容易
に把握できる。Therefore, according to the above-described embodiment, the status values are sequentially read out together with the status value numbers from the recording devices attached to the respective nodes constituting the cluster system. Element-state value relation attribute table Tr based on abnormal state values
The corresponding system component is found from the record of the status value number, the system component number, and the reliability reflected in the failure status value of the system component, and the reliability of the system component is set in advance in the element attribute table Ta. Failure validity is calculated numerically using the certainty of the occurrence of failure of the corresponding system component, and the system component with a high possibility of failure is specified. System components that can be analyzed and whose validity value is large can be identified as possibly having a failure, and the validity value of the system component related to the system component that is considered to have a possible failure is also calculated at the same time. Therefore, the influence of each system component can be easily grasped.

【００６０】[0060]

【発明の効果】以上説明したように本発明によれば、各
ノードは、当該各ノードの検出可能な状態値の異常時ま
たは一定の時間経過ごとに、時刻、予め定められる状態
値番号および状態値を記録することにより、障害発生要
因となるシステム構成要素を容易に特定可能とするクラ
スタシステムを提供できる。As described above, according to the present invention, each node sets a time, a predetermined state value number and a predetermined state value every time a detectable state value of each node is abnormal or every predetermined time. By recording the value, it is possible to provide a cluster system that can easily specify a system component causing a failure.

【００６１】また、別の発明は、保守員の手を煩わすこ
となく迅速，的確に障害発生要因となるシステム構成要
素を特定できるシステム障害要因特定支援装置を提供で
きる。Another aspect of the invention can provide a system failure factor identification support device that can quickly and accurately identify a system component causing a failure without the need for maintenance personnel.

【００６２】さらに、別の発明は、迅速，的確に障害発
生要因となるシステム構成要素を特定する障害要因特定
処理用プログラムを記録したコンピュータ読み取り可能
な記録媒体を提供できる。Still another aspect of the present invention can provide a computer-readable recording medium that stores a failure factor identification processing program for quickly and accurately identifying a system component that causes a failure.

[Brief description of the drawings]

【図１】本発明に係るクラスタシステムの一実施の形
態を示す構成図。FIG. 1 is a configuration diagram showing an embodiment of a cluster system according to the present invention.

【図２】状態値番号を付するための状態値種類数を説
明する図。FIG. 2 is a diagram illustrating the number of state value types for assigning state value numbers.

【図３】本発明に係るクラスタシステムの動作を説明
するフローチャート。FIG. 3 is a flowchart illustrating the operation of the cluster system according to the present invention.

【図４】クラスタシステムを構成する各ノードに付設
される記録装置のデータ記録状態を示す図。FIG. 4 is a diagram showing a data recording state of a recording device attached to each node constituting the cluster system.

【図５】本発明に係るシステム障害要因特定支援装置
の一実施の形態を示す構成図。FIG. 5 is a configuration diagram showing an embodiment of a system failure factor identification support device according to the present invention.

【図６】図５の装置において使用するテーブルデータ
の状態を示す図。FIG. 6 is a view showing a state of table data used in the apparatus of FIG. 5;

【図７】図５に示す装置の動作を説明するフローチャ
ート。FIG. 7 is a flowchart for explaining the operation of the apparatus shown in FIG. 5;

【図８】従来のクラスタシステムを説明する構成図。FIG. 8 is a configuration diagram illustrating a conventional cluster system.

【符号の説明】３ａ，３ｂ，…ノード４ａ，４ｂ…記録装置１２…状態値記憶部１３…定義テーブル１４…記録媒体１５…障害要因特定処理部２１…状態値読取記憶手段２２…障害可能性処理手段２３…妥当性出力手段[Description of References] 3a, 3b,..., Nodes 4a, 4b,... Recording device 12,... Status value storage unit 13,... Definition table 14,. Processing means 23 ... validity output means

Claims

[Claims]

In a cluster system in which a plurality of nodes execute required processing while cooperating with each other on a transmission line, each of the nodes is configured to detect when a detectable state value of each of the nodes is abnormal or every predetermined time. A cluster system comprising recording means for recording a time, a predetermined state value number, and a state value.

2. A state value reading and storing means for fetching and storing a state value including a state value number from said recording means of all nodes constituting the cluster system according to claim 1, and a fault for each system component. Defining means for defining the certainty of the onset and the certainty reflected in the status value of the fault for each system component; finding an abnormal value from the status values stored in the status value reading storage means; Fault possibility processing means for calculating the fault validity of a system component by using a status value that is a value, the certainty of the occurrence of the fault, and the certainty reflected in the status value of the fault. System failure factor identification support device to be used.

3. The system failure factor identification support device according to claim 2, wherein the definition means for defining the certainty of the occurrence of the failure of each system component has each of the system components as an attribute. A system failure factor identification support device, which defines a degree of continuity of a symptom when an element fails.

4. The system failure factor identification support device according to claim 2, wherein the definition means for defining the reliability reflected in the failure status value of each of the system components includes the system components and the respective statuses. A system failure factor identification support device, wherein a relationship with a value is defined as an attribute, and the failure of each system component is defined with certainty reflected in the state value.

5. The system failure factor identification support device according to claim 2, wherein the failure possibility processing means sets the failure validity of each of the system components to 1, and defines each system configuration defined by the definition means. The fault validity of each of the system components is determined by using a deduction method in which the two certainties of the elements are real numbers of 0 to 1 and the multiplied value of the two certainties is subtracted from the fault validity of a certain system component. A system failure factor identification support device characterized by calculating.

6. The system failure factor identification support device according to claim 2, wherein the failure possibility processing means relates to the status value based on a status value which is an abnormal value found from each status value. A system failure factor identification support device, wherein a failure validity of each system component is calculated using the above-mentioned two certainties of each system component.

7. A first definition table defining a degree of continuity of a symptom when a failure occurs in a system component, a predetermined state value number, and a system component as attributes. And a second definition table for defining the reliability reflected in the status value for each status value number, and a failure factor identification processing program for identifying a failure factor of the system component is recorded on the recording medium. An analysis time zone setting function for setting the time zone for analyzing the cause of the failure in the computer for processing the cause of the failure, and when the detectable state value is abnormal or constant within the set time range. A status value reading and storing function for sequentially reading and storing status values including status value numbers recorded in all nodes at every elapse of time, and fault validity of all system components An initial failure possibility setting function to be set as "1"; a state value abnormality presence / absence determination function to detect a state value indicating that there is an abnormality from among the state values stored by the state value reading / storing function; In connection with the status value detected by the abnormality presence / absence determination function, the first
First and second from the and second definition tables, respectively
Is obtained by subtracting the multiplied value of the first and second certainties from the set fault validity “1” of each set system component and calculating the fault validity of the corresponding system component. A computer-readable recording medium that stores a failure factor identification processing program for realizing a calculation function.