JPH0362235A

JPH0362235A - Down monitoring processing system for composite system

Info

Publication number: JPH0362235A
Application number: JP1198539A
Authority: JP
Inventors: Koichi Shiga; 浩一志賀; Yukiyoshi Yanase; 柳瀬　幸好; Kazunori Hiraishi; 平石　壽徳
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-07-31
Filing date: 1989-07-31
Publication date: 1991-03-18
Anticipated expiration: 2013-02-18
Also published as: JP2716537B2

Abstract

PURPOSE:To effectively confirm the occurrence of abnormality and also to effectively process the abnormality by providing a process part which performs the process of a subsystem of the same kind registered previously or the process related to the abnormality when the information of the abnormality is received from another cluster about a subsystem. CONSTITUTION:When a down monitoring mechanism 15A detects the abnormality of a subsystem S1 in a cluster 13A, an informing part 17 informs the down monitoring mechanism 15B of a cluster 13B of the abnormality via a proper communication means. Thus an exit schedule part 18 of the mechanism 15B retrieves a subsystem monitoring table 14B based on the name of the abnormal subsystem S1 and schedules a resources collection exit E1 of the subsystem S1 working in its own cluster. Thus it is possible to effectively recognize the occurrence of abnormality and to effectively process the abnormality.

Description

【発明の詳細な説明】〔概要〕ある業務またはクラスタが異常状態に陥ったことを、異
常状態に陥った業務またはクラスタからダウン監視機構
に通知する複合システムにおけるダウン監視処理方式に
関し。[Detailed Description of the Invention] [Summary] The present invention relates to a down monitoring processing method in a complex system in which a down monitoring mechanism is notified from a business or cluster that a certain business or cluster has fallen into an abnormal state.

ダウンの監視機構をＯＳレベルで統合化するとともに、
異常発生のＵＡｇｌｉおよび異常に対する処理を効率的
に実行できるようにすることを目的とし。In addition to integrating the down monitoring mechanism at the OS level,
The purpose is to efficiently execute UAgli when an abnormality occurs and processing for the abnormality.

各クラスタごとにダウン監視機構を有し、各ダウン監視
機構は、自クラスタ内におけるサブシステムの異常発生
を監視する手段と、自クラスタ内におけるサブシステム
に異常が発生した場合に。Each cluster has a down monitoring mechanism, and each down monitoring mechanism has means for monitoring the occurrence of an abnormality in a subsystem within its own cluster, and means for monitoring when an abnormality occurs in a subsystem within its own cluster.

そのサブシステムが異常状態に陥ったことを、他クラス
タに存在するダウン監視機構に通知する手段と、他クラ
スタからサブシステムに関する異常通知を受けた場合に
、あらかしめ登録されている同種のサブシステムまたは
異常に関係する処理を行う処理部を呼び出す手段とを備
えるように構成する。A means of notifying a down monitoring mechanism existing in other clusters that the subsystem has fallen into an abnormal state, and a means of notifying a subsystem of the same type registered in advance when an abnormality notification regarding a subsystem is received from another cluster. or means for calling a processing unit that performs processing related to the abnormality.

[Industrial application field]

本発明は、ある業務またはクラスタが異常状態に陥った
ことを、異常状態に陥った業務またはクラスタからダウ
ン監視機構に通知する複合システムにおけるダウン監視
処理方式に関する。The present invention relates to a down monitoring processing method in a complex system in which a down monitoring mechanism is notified that a certain business or cluster has fallen into an abnormal state from the business or cluster that has fallen into the abnormal state.

複数の計算機を通信路等で結合した複合システムを構築
した場合、ある計算機の異常を、他の正常な計算機が認
識する機構が必要とされる。この機構を、“ダウン監視
機構”という、複合システムをＩＩ處する計算機を、“
クラスタ”という。When constructing a complex system in which multiple computers are connected through communication channels, a mechanism is required that allows other normal computers to recognize an abnormality in one computer. This mechanism is called the "Down Monitoring Mechanism," and the computer that makes up the complex system is called the "Down Monitoring Mechanism."
It is called "cluster".

また、複数のクラスタで同種の業務を遂行するサブシス
テムを動作させ、あるクラスタにおけるサブシステムに
異常が発生した場合、他のクラスタにおけるサブシステ
ムが、異常になったサブシステムの資源などを回収し、
可能であれば、その業務を引き継ぐようなことが必要と
される。In addition, if subsystems that perform the same type of work are operated in multiple clusters, and an error occurs in a subsystem in one cluster, the subsystems in other clusters will recover the resources of the subsystem that has become abnormal. ,
If possible, it is necessary to take over the work.

[Conventional technology]

第８図は従来のダウン監視の例を示す。 FIG. 8 shows an example of conventional down monitoring.

従来の複合システムにおけるダウン監視機構１５Ａ、１
５Ｂは、第８図に示すように、正常なりラスタ１３Ａ、
１３Ｂが相互に、いわゆる“Ｉ”ｍＡＬＩＶＥ”という
ような生存通知を行い、異常クラスタを監視する方式を
採用していた。この生存通知が、ある定められた期間内
に到着しない場合には。Down monitoring mechanism 15A, 1 in conventional complex system
5B is normal raster 13A, as shown in FIG.
13B mutually sends a so-called "I"mALIVE" survival notification and monitors abnormal clusters. If this survival notification does not arrive within a certain predetermined period.

何回かの生存問い合わせを行い５それでも応答がないと
きに、相手が異常状態に陥っていると認識する。When there is no response after making several survival inquiries, it is recognized that the other party is in an abnormal state.

また、ある業務を遂行する同種のサブシステムが、相互
に異常を監視する場合にも、それぞれのサブシステムご
とに、定期的に生存通知を行うなどして、異常の発生を
認識するようにしていた。Furthermore, even when subsystems of the same type that carry out a certain task mutually monitor each other for abnormalities, each subsystem is made to recognize the occurrence of an abnormality by sending periodic survival notifications. Ta.

[Problem to be solved by the invention]

従来のダウン監視では、相互に、自分が正常であること
を通知する生存通知を、定期的にある間隔で行うために
、異常が発生しても、他のクラスタは、直ちには異常の
発生を認識することができず１通知の間隔分だけ、異常
の検出が遅延するという問題があった。In conventional down monitoring, survival notifications are periodically sent to each other at certain intervals to notify each other that the cluster is normal, so even if an error occurs, other clusters do not immediately notice the error. There was a problem in that the abnormality could not be recognized and detection of the abnormality was delayed by the interval of one notification.

また、ある業務を遂行する同種のサブシステムが、相互
に異常を監視するような場合、ダウン監視機構を“業Ｉ
Ｍ（サブシステム）″ごとに作威しなければならないた
め、その開発負担が大きくなるという問題があった。さ
らに、１つのクラスタ内で複数のサブシステムが動作す
るような場合に。In addition, when subsystems of the same type that perform a certain job monitor each other for abnormalities, the down monitoring mechanism can be
There is a problem in that the development burden increases because it has to be created for each M (subsystem).Furthermore, when multiple subsystems operate within one cluster.

各サブシステムごとに、ダウン監視機構が必要になるた
め、ｃｐｕ資源やメモリ資源などが圧迫されるという問
題があった。Since a down monitoring mechanism is required for each subsystem, there is a problem in that CPU resources, memory resources, etc. are under pressure.

本発明は上記問題点の解決を図り、ダウンの監視ＩＱ横
をＯＳレベルで統合化するとともに、異常発生の認識お
よび異常に対する処理を効率的に実行できるようにする
ことを目的としている。The present invention aims to solve the above-mentioned problems, to integrate down monitoring IQ at the OS level, and to make it possible to efficiently recognize the occurrence of an abnormality and perform processing for the abnormality.

[Means to solve the problem]

第１図は本発明の構成例を示す。 FIG. 1 shows an example of the configuration of the present invention.

第１図において、１０は各クラスタが共通にアクセスで
きるようになっているクラスタ関外部記憶装置、１１は
クラスタ間の通信データを格納するクラスタ間通信域、
１２はクラスタの状態（運用中、停止中、ダウン中等）
を管理するクラスタ監視表を表す。In FIG. 1, 10 is a cluster-related external storage device that can be commonly accessed by each cluster; 11 is an inter-cluster communication area that stores communication data between clusters;
12 is the cluster status (in operation, stopped, down, etc.)
Represents the cluster monitoring table that manages the cluster.

１３Ａ、１３ＢはそれぞれＣＰＵやメモリを持つ処理装
置で構成されるクラスタ、１４Ａ、１４Ｂは自クラスタ
で動作するサブシステムの状態（運用中、停止中、ダウ
ン中等）を管理するサブシステム監視表、１５Ａ、１５
Ｂはオペレーティング・システム（Ｏ３）レベルで統合
化してシステムの整合性を実現するダウン監視機構、１
６は監視部、１７は通知部、１Ｂは出口スケジュール部
、Ｓｌ−Ｓ３はデータベース管理、ＴＳＳその他の各種
業務を遂行するサブシステム、Ｅｌは異常が発生したサ
ブシステムの持っていた資源を回収する障害後処理など
を行う窓口となる資源回収出口を表す。13A and 13B are clusters consisting of processing units each having a CPU and memory; 14A and 14B are subsystem monitoring tables that manage the status of subsystems operating in the cluster (in operation, stopped, down, etc.); 15A , 15
B is a down monitoring mechanism that is integrated at the operating system (O3) level to achieve system consistency; 1
6 is a monitoring unit, 17 is a notification unit, 1B is an exit schedule unit, Sl-S3 is a subsystem that performs database management, TSS, and other various tasks, and El recovers resources held by the subsystem in which an abnormality has occurred. Represents a resource recovery exit that serves as a point of contact for post-failure processing.

本発明では、各クラスタ１３Ａ、１３Ｂごとに。In the present invention, for each cluster 13A, 13B.

ＯＳレベルでダウン監視機構１５Ａ、１５Ｂを持つ。It has down monitoring mechanisms 15A and 15B at the OS level.

各ダウン監視機構１５Ａ、１５Ｂは、監視部１６、通知
部１７．出口スケジュール部１８の各処理部を持つ、監
視部１６は、自クラスタ内におけるサブシステム３１〜
Ｓ３の異常発生を監視する処理を行う０通知部１７は、
自クラスタ内におけるサブシステム５１〜Ｓ３のどれか
に異常が発生した場合に、そのサブシステムが異常状態
に陥ったことを、クラスタ間通信域１１などを介して。Each down monitoring mechanism 15A, 15B includes a monitoring section 16, a notification section 17. The monitoring unit 16, which has each processing unit of the exit schedule unit 18, manages the subsystems 31 to 31 in its own cluster.
The 0 notification unit 17, which performs the process of monitoring the occurrence of an abnormality in S3,
When an abnormality occurs in any of the subsystems 51 to S3 within the own cluster, the fact that the subsystem has fallen into an abnormal state is notified via the intercluster communication area 11 or the like.

他クラスタに存在するダウン監視機構に通知する処理を
行う、出口スケジュール部１８は、他クラスタからサブ
システムに関する異常通知を受けた場合に、サブシステ
ム監視表１４Ｂなどを参照し。The exit scheduler 18, which performs a process of notifying a down monitoring mechanism existing in another cluster, refers to the subsystem monitoring table 14B etc. when receiving an abnormality notification regarding a subsystem from another cluster.

あらかじめ登録されている同種のサブシステムまたは異
常に関係する処理を行う処理部、すなわちサブシステム
対応の資源回収出口Ｅｌなどを呼び出す処理を行う。It performs a process of calling a previously registered subsystem of the same type or a processing unit that performs processing related to an abnormality, that is, a resource recovery exit El corresponding to the subsystem.

[Effect]

あるサブシステムが運用を継続できない状態に陥った場
合に、サブシステム運用環境の回収などを行うサブシス
テムダウン処理に先立って、他クラスタに異常を通知す
る０本発明では、このようなりラスタ監視およびサブシ
ステムの監視の機構を、ダウン監視機構１５Ａ、１５Ｂ
として、ＯＳレベルで統合化し、システムの整合性を実
現する。When a certain subsystem falls into a state where it cannot continue its operation, it notifies other clusters of the abnormality before performing subsystem down processing to recover the subsystem operating environment. The subsystem monitoring mechanism is down monitoring mechanism 15A, 15B.
As a result, it is integrated at the OS level to achieve system consistency.

各サブシステム５ｌ−５３は、あらかじめ監視依頼を、
ダウン監視機構１５Ａ、１５Ｂに対して行っておく、ダ
ウン監視機構１５Ａ、１５Ｂは。Each subsystem 5l-53 sends a monitoring request in advance,
The down monitoring mechanism 15A, 15B is performed for the down monitoring mechanism 15A, 15B.

サブシステム監視表１４Ａ、１４Ｂに、監視対象となる
サブシステムの状態を登録する。The status of the subsystem to be monitored is registered in the subsystem monitoring tables 14A and 14B.

各サブシステム５Ｌ−５３は、自サブシステムの異常を
検出したときに、自己申告により、異常となったサブシ
ステム塩を、ダウン監視機構１５Ａ等に通知する。また
、ダウン監視機構１５Ａ等は、監視部１６による各サブ
システム３１−３３ごとの生存通知出口のスケジュール
などにより。When each subsystem 5L-53 detects an abnormality in its own subsystem, it self-reports the abnormal subsystem salt to the down monitoring mechanism 15A and the like. In addition, the down monitoring mechanism 15A and the like depend on the schedule of survival notification exits for each subsystem 31-33 by the monitoring unit 16.

サブシステムの異常を検出する。Detect subsystem abnormalities.

例えば、クラスタ１３Ａにおいて、ダウン監視機構１５
ＡがサブシステムＳ１の異常を検出した場合１通知部１
７は、クラスタ１３Ｂのダウン監視機構１５Ｂに、“適
当な通信手段を利用して、その異常を通知する０通知す
る情報は、異常状態に陥ったクラスタ名とサブシステム
塩などである。For example, in the cluster 13A, the down monitoring mechanism 15
When A detects an abnormality in subsystem S1 1 Notification unit 1
7 notifies the down monitoring mechanism 15B of the cluster 13B of the abnormality using an appropriate communication means.The information to be notified includes the name of the cluster that has fallen into the abnormal state and the subsystem salt.

通知を受けたクラスタ１３Ｂにおけるダウン監視機構１
５Ｂの出口スケジュール部１８は、異常になったサブシ
ステムＳｌのサブシステム塩により、サブシステム監視
表１４Ｂを検索し、自りラスクにおいて動作しているサ
ブシステムＳｌの資源回収出口Ｅ１をスケジュールする
。この資源回収出口Ｅｌにより、クラスタ１３Ａにおけ
る異常になったサブシステムＳ１が使用していた運用環
境などの資源を回収し、必要に応じて処理中であったト
ランザクションなどの処理を引き継ぐ。Down monitoring mechanism 1 in cluster 13B that received the notification
The exit scheduler 18 of the subsystem 5B searches the subsystem monitoring table 14B based on the subsystem salt of the subsystem Sl that has become abnormal, and schedules the resource recovery exit E1 of the subsystem Sl that is operating in its own Rask. This resource recovery exit El recovers resources such as the operating environment used by the subsystem S1 that has become abnormal in the cluster 13A, and takes over processing of transactions that were being processed as necessary.

〔Example〕

第２図は本発明の適用システムの例、第３図は本発明の
実施例で用いる制御テーブルの例、第４図は本発明の実
施例に係るダウン監視機構の初期化時の処理フロー９、
第５図は本発明の実施例に係る監視部処理フロー、第６
図は本発明の実施例に係るダウン監視機構処理説明図、
第７図は本発明の実施例に係るダウン監視機構のダウン
通知時の処理の例を示す。FIG. 2 is an example of a system to which the present invention is applied, FIG. 3 is an example of a control table used in an embodiment of the present invention, and FIG. 4 is a processing flow 9 at the time of initializing the down monitoring mechanism according to an embodiment of the present invention. ,
FIG. 5 shows the processing flow of the monitoring unit according to the embodiment of the present invention;
The figure is an explanatory diagram of down monitoring mechanism processing according to an embodiment of the present invention,
FIG. 7 shows an example of processing performed by the down monitoring mechanism according to the embodiment of the present invention when notifying the down state.

本発明は１例えば第２図に示すような複合システムに適
用することができる。システム記憶装置２１は、クラス
タ間通信機能を持ち１例えばクラスタ間で共用可能な半
導体記憶装置で構成される。The present invention can be applied to a complex system as shown in FIG. 2, for example. The system storage device 21 is composed of a semiconductor storage device that has an intercluster communication function and can be shared between clusters, for example.

なお、この例におけるクラスタ間通信機能を、従来のプ
ロセッサ間通信などによる通信機能に置き換え、クラス
タ間で共用可能な外部記憶装置としての機能を、磁気デ
ィスク装置等のＤＡＳＤに置き換えることも可能である
。Note that it is also possible to replace the inter-cluster communication function in this example with a communication function such as conventional inter-processor communication, and replace the function as an external storage device that can be shared between clusters with a DASD such as a magnetic disk device. .

第２図に示す例では、各クラスタを構成する処理部２３
が、ＣＰＵ２４およびローカルに使用するメモリ２５を
持ち、システム記憶装置２１に対しては、メモリ制御装
置（ＭＣＵ）２２を介して。In the example shown in FIG. 2, the processing unit 23 constituting each cluster
has a CPU 24 and a memory 25 for local use, and accesses the system storage device 21 via a memory control unit (MCU) 22.

アクセスできるようになっている。It is accessible.

第１図に示すクラスタ間通信域１１．クラスタ監視表１
２は、システム記憶装置２１内に作威し。Inter-cluster communication area 11 shown in FIG. Cluster monitoring table 1
2 is created in the system storage device 21.

第１図に示すサブシステム監視表１４Ａ、１４Ｂは、各
クラスタのメモリ２５内に作成する。The subsystem monitoring tables 14A and 14B shown in FIG. 1 are created in the memory 25 of each cluster.

クラスタ監視表１２は、各クラスタの運用状態を監視す
るためのものであり９例えば第３図（イ）に示すような
情報の記憶領域からなる。The cluster monitoring table 12 is for monitoring the operational status of each cluster, and consists of a storage area for information as shown in FIG. 3(a), for example.

クラスタ識別子は、複合システム内でクラスタを一意に
決定する識別名または識別番号である。A cluster identifier is an identification name or identification number that uniquely determines a cluster within a complex system.

状態表示域には、クラスタが動作を開始し、監視対象に
なった旨の表示や、生存表示などが行われる。動作中の
サブシステム数の領域に、そのクラスタで動作している
サブシステムの数が格納される。資源回収出口のポイン
タ情報は、クラスタが異なっても、異常になったクラス
タの資源を回収するための資源回収出口の仮想空間アド
レスを。In the status display area, there is a display indicating that the cluster has started operating and has become a monitoring target, and a display indicating that the cluster is alive. The number of subsystems operating in the cluster is stored in the area for the number of operating subsystems. The resource recovery exit pointer information is the virtual space address of the resource recovery exit for recovering the resources of the cluster that has become abnormal, even if the cluster is different.

正しく把握できるようにするための情報である。This is information to help you understand it correctly.

サブシステム監視表１４は、各クラスタごとのサブシス
テムの運用状態を監視するためのものであり９例えば第
３図（ロ）に示すような情報の記憶領域からなる。The subsystem monitoring table 14 is used to monitor the operating status of the subsystems for each cluster, and consists of a storage area for information as shown in FIG. 3(b), for example.

サブシステム識別子は、サブシステムを一意に識別する
情報である。状態表示域には、そのサブシステムが監視
対象になった旨の表示や、生存表示などが行われる。生
存通知出口アドレスは、定期的にある間隔で生存通知を
行わせるために、ダウン監視機構が呼び出す出口のアド
レスである。A subsystem identifier is information that uniquely identifies a subsystem. In the status display area, a display indicating that the subsystem has become a monitoring target, a display indicating whether the subsystem is alive, etc. are performed. The survival notification exit address is the address of the exit that the down monitoring mechanism calls in order to periodically issue a survival notification at certain intervals.

資源回収出口アドレスは、他クラスタで動作している同
種のサブシステムの資源回収を行うために。The resource recovery exit address is used to recover resources from subsystems of the same type running in other clusters.

他クラスタがダウンまたは他クラスタにおけるサブシス
テムがダウンしたときを契機として、ダウン監視機構が
呼び出す資源回収ルーチンのアドレスである。This is the address of the resource recovery routine that the down monitoring mechanism calls when another cluster goes down or a subsystem in another cluster goes down.

第１図に示すダウン監視機構１５Ａ、１５Ｂの初期化時
の処理は１例えば第４図に示すとおりである。The initialization process of the down monitoring mechanisms 15A and 15B shown in FIG. 1 is as shown in FIG. 4, for example.

■　各クラスタが共用するシステム記憶の初期化が必要
かどうかを判定する。他のクラスタが既に初期化を行っ
ている場合には、初期化の必要がないので１次の処理■
をスキップする。■ Determine whether the system storage shared by each cluster needs to be initialized. If another cluster has already initialized, there is no need to initialize, so the first process
Skip.

■　クラスタ監視表１２を初期化する。■ Initialize the cluster monitoring table 12.

■　クラスタ監視表１２中に自クラスタの生存表示があ
るかどうかを判定する。自クラスタの生存表示がある場
合、前に自クラスタに異常が発生し、現在そのダウン後
の再立ち上げ処理中であることになる。したがって１次
の処理■を実行する。生存表示がない場合９次の処理■
をスキップする。■ Determine whether or not the cluster monitoring table 12 indicates that the own cluster is alive. If there is an indication that the own cluster is alive, it means that an abnormality occurred in the own cluster previously and it is currently in the process of being restarted after going down. Therefore, the primary process (2) is executed. If there is no survival indication, the 9th process■
Skip.

■　障害に対する後処理のため、資源回収出口を呼び出
し、以前使用していたシステム記憶資源を返却する。■ For post-processing of the failure, call the resource recovery exit and return the previously used system storage resources.

■　クラスタ監視表１２に、自クラスタに関する情報を
登録する。■ Register information regarding the own cluster in the cluster monitoring table 12.

■　クラスタ監視表１２に、自クラスタの生存表示を行
う。■ Display the existence of the own cluster in the cluster monitoring table 12.

■　周期的に動作する監視部１６に、起動の契機を与え
る。その後、初期化処理を終了する。(2) Give an opportunity to start up the monitoring unit 16, which operates periodically. Thereafter, the initialization process ends.

起動された監視部１６は、第５図に示す処理■〜■のよ
うな処理を実行する。The activated monitoring unit 16 executes the processes ① to ② shown in FIG. 5.

■　他クラスタに、自クラスタの生存通知を行うための
“Ｉ’ｓ　ＡＬＩＶＥ”通信の時間間隔を設定する。■ Set the time interval for "I's ALIVE" communication to notify other clusters that the cluster is alive.

■　他クラスタから送ら、れてくる“ｌ’ｗ　ＡＬＩＶ
Ｅ’通信の監視時間間隔を設定する。■ "l'w ALIV sent and received from other clusters"
Set the monitoring time interval for E' communication.

■　自クラスタ内におけるサブシステムの生存通知出口
をスケジュールする監視時間間隔を設定する。その後、
監視を開始する。■ Set the monitoring time interval for scheduling the survival notification exit of subsystems within the local cluster. after that,
Start monitoring.

■　サブシステムの監視時間になったならば、監視対象
となっているサブシステムの生存通知出口をスケジュー
ルし、呼び出す、この処理を。■ When it is time to monitor a subsystem, schedule and call the survival notification exit of the subsystem being monitored.

所定の監視時間間隔ごとに行う、なお１図示省略するが
、他クラスタへの自クラスタの生存通知、および他クラ
スタからの生存通知未着の監視も、所定の時間間隔ごと
に行う、この生存通知に関する処理は、従来の処理と同
様でよい。This survival notification is carried out at predetermined monitoring time intervals.Although not shown in the figure, this survival notification is also carried out at predetermined time intervals.The survival notification of the own cluster to other clusters and the monitoring of non-arrival of survival notifications from other clusters are also carried out at predetermined time intervals. The related processing may be similar to conventional processing.

各サブシステム３１等と、ダウン監視機構１５とのイン
タフェースは１例えば第６図に示すよっになっている。The interface between each subsystem 31 and the like and the down monitoring mechanism 15 is as shown in FIG. 6, for example.

各サブシステム３１等は、ダウン監視機構１５に対し、
マクロインタフェースにより、サブシステム監視の依頼
を行う。これに対し、ダウン監視機ｌ１１５は、サブシ
ステム監視表に、サブシステム識別子、生存通知出口、
資源回収出口などの監視に必要な情報を登録する。以後
、第５図に示した監視部の処理などにより、このサブシ
ステムの監視が行われることになる。Each subsystem 31 etc. has the following information for the down monitoring mechanism 15:
Request subsystem monitoring using the macro interface. On the other hand, the down monitor l115 has the subsystem identifier, survival notification exit,
Register information necessary for monitoring resource recovery exits, etc. Thereafter, this subsystem will be monitored by the processing of the monitoring unit shown in FIG. 5.

ダウン監視機構１５の監視部は、定期的に、サブシステ
ム監視表に登録された生存通知出口をスケジュールする
。この生存通知出口では、自サブシステムが正常に動作
していれば、生存主張をマクロインタフェースで行う、
ダウン監視機構１５は、サブシステム監視表の状態表示
域に、当該サブシステムが生存していることを表示する
。ある時間を待っても、生存通知がない場合には、この
サブシステムに異常が発生したとみなす。The monitoring unit of the down monitoring mechanism 15 periodically schedules the survival notification exit registered in the subsystem monitoring table. This survival notification exit asserts survival using the macro interface if the own subsystem is operating normally.
The down monitoring mechanism 15 displays in the status display area of the subsystem monitoring table that the subsystem is alive. If there is no survival notification even after waiting for a certain period of time, it is assumed that an abnormality has occurred in this subsystem.

また、サブシステム３１等が、自ら異常を検出した場合
には、マクロインタフェースにより、ダウン監視機構１
５に対し、異常の自己申告を行う。In addition, when the subsystem 31 etc. detects an abnormality by itself, the down monitoring mechanism 1 is notified by the macro interface.
5, self-report abnormalities.

なお、サブシステムが自分自身の動作異常を検出する技
術は、従来から種々の方式が知られているので、ここで
の詳細な説明は省略する。It should be noted that various techniques for the subsystem to detect its own operational abnormality have been known in the past, and detailed explanations thereof will be omitted here.

ダウン監視機構１５は、サブシステム３１等が異常状態
に陥ったことを検出したならば、他クラスタのダウン監
視機構１５へダウンの通知を行う。When the down monitoring mechanism 15 detects that the subsystem 31 or the like has fallen into an abnormal state, it notifies the down monitoring mechanisms 15 of other clusters of the down status.

なお、ダウン監視機構１５に対するサブシステム監視の
依頼のタイプによっては、他クラスタのダウン監視機構
１５に対する通知の後で、クラスタ停止をスケジュール
する。Note that depending on the type of subsystem monitoring request to the down monitoring mechanism 15, cluster suspension is scheduled after the down monitoring mechanism 15 of another cluster is notified.

サブシステムＳ１等は、処理を終了するとき。When the subsystem S1 etc. terminates processing.

マクロインタフェースにより、サブシステム監視からの
離脱を、ダウン監視機構１５に依頼する。Using the macro interface, a request is made to the down monitoring mechanism 15 to withdraw from subsystem monitoring.

これに対し、ダウン監視機構１５は、サブシステム監視
表からの削除を行う、また、必要に応じて。On the other hand, the down monitoring mechanism 15 deletes information from the subsystem monitoring table, and as necessary.

クラスタ監視表から、自りラスタ分の資源回収ルーチン
（資源回収出口〉の削除を行う。Delete the resource recovery routine (resource recovery exit) for the current raster from the cluster monitoring table.

他クラスタから、クラスタまたはサブシステムのダウン
通知があった場合、ダウン監視機構１５は、第７図に示
す処理を行う。When there is a cluster or subsystem down notification from another cluster, the down monitoring mechanism 15 performs the process shown in FIG. 7.

ダウン監視機構１５は１通知を受けたクラスタ内にある
サブシステム監視表を参照して、これに登録されている
ダウン監視対象サブシステムの資源回収出口Ｅｌをスケ
ジュールし、同種のサブシステムにダウンの旨を通知す
る。これによって。The down monitoring mechanism 15 refers to the subsystem monitoring table in the cluster that received the notification, schedules the resource recovery exit El of the down monitoring target subsystem registered in this table, and instructs the same type of subsystem to go down. We will notify you accordingly. by this.

そのサブシステムＳｔは、異常になったサブシステムの
資源を回収し、必要に応じてその業務を引き継ぐことが
できる。The subsystem St can recover the resources of the abnormal subsystem and take over its operations as necessary.

なお、クラスタ停止タイプの場合には、資源回収出口の
スケジュールに先立って、クラスタ間共有資源（システ
！、資′ａ）の回収処理を行う。In the case of the cluster stop type, inter-cluster shared resources (system!, resource'a) are collected prior to the resource collection exit schedule.

〔Effect of the invention〕

以上説明したように１本発明によれば、従来の生存通知
のみによる監視方式に比較して、以下の効果がある。As explained above, according to the present invention, the following effects are achieved compared to the conventional monitoring system using only survival notifications.

（ａ）　　他クラスタの異常をシステムの負荷状態に影
響されないで正確に、かつ異常発生個所からの自己申告
などによって瞬時に認識することができるようになる。(a) Abnormalities in other clusters can be accurately and instantaneously recognized without being affected by the system load condition, and by self-reporting from the location where the abnormality has occurred.

特に、ホットスタンバイシステム等において、システム
切り替え処理などの高速化が可能になるので、有用であ
る。This is particularly useful in hot standby systems and the like because it enables faster system switching processing.

へ）ダウン監視機構の統合化により、クラスタ異常、ま
たはある業務異常を、複数のダウン監視機構で監視する
必要がなく、ダウン監視機構間の認識のズレの問題をな
くすことが可能となる。f) By integrating the down monitoring mechanisms, there is no need to monitor cluster abnormalities or certain business abnormalities with multiple down monitoring mechanisms, and it is possible to eliminate the problem of misrecognition between the down monitoring mechanisms.

[Brief explanation of drawings]

第１図は本発明の構成例。第２図は本発明の適用システムの例。第３図は本発明の実施例で用いる制御テーブルの例。第４図は本発明の実施例に係るダウン監視機構の初期化
時の処理フロー第５図は本発明の実施例に係る監視部処理フロ第６図は
本発明の実施例に係るダウン監視機構処理説明図。第７図は本発明の実施例に係るダウン監視機構のダウン
通知時の処理の例。第８図は従来のダウン監視の例を示す。図中、１０はクラスタ間外部記憶装置、１１はクラスタ
間通信域、１２はクラスタ監視表、１３Ａ、１３Ｂはク
ラスタ、１４Ａ、１４Ｂはサブシステム監視表、１５Ａ
、１５Ｂはダウン監視機構。１６は監視部、１７は通知部、１８は出口スケジュール
部、Ｓｌ〜Ｓ３はサブシステム、Ｅｌは資源回収出口を
表す。FIG. 1 shows a configuration example of the present invention. FIG. 2 shows an example of a system to which the present invention is applied. FIG. 3 is an example of a control table used in the embodiment of the present invention. FIG. 4 is a process flow at the time of initialization of a down monitoring mechanism according to an embodiment of the present invention. FIG. 5 is a processing flow of a monitoring unit according to an embodiment of the present invention. FIG. Processing explanatory diagram. FIG. 7 is an example of processing performed by the down monitoring mechanism according to the embodiment of the present invention when notifying the down state. FIG. 8 shows an example of conventional down monitoring. In the figure, 10 is an inter-cluster external storage device, 11 is an inter-cluster communication area, 12 is a cluster monitoring table, 13A, 13B are clusters, 14A, 14B are subsystem monitoring tables, 15A
, 15B is a down monitoring mechanism. 16 is a monitoring unit, 17 is a notification unit, 18 is an exit schedule unit, Sl to S3 are subsystems, and El is a resource recovery exit.

Claims

[Claims] A plurality of clusters (13A, 13B) each having a computer
In a complex system consisting of, each cluster has a down monitoring mechanism (15A, 15B), and each down monitoring mechanism has a means (16) for monitoring the occurrence of an abnormality in a subsystem within its own cluster, and a means (16) for monitoring the occurrence of an abnormality in a subsystem within its own cluster. When an abnormality occurs in a subsystem, a means (1) of notifying a down monitoring mechanism in another cluster that the subsystem has fallen into an abnormal state.
7), and means (18) for calling a pre-registered subsystem of the same type or a processing unit that performs processing related to the abnormality when an abnormality notification regarding the subsystem is received from another cluster. A down monitoring processing method for a complex system.