JPH11296492A

JPH11296492A - Control method and device for multi computer system recovery and machine readable recording medium recording program

Info

Publication number: JPH11296492A
Application number: JP10115926A
Authority: JP
Inventors: Toshihiro Nishizaki; 智弘西崎
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-04-10
Filing date: 1998-04-10
Publication date: 1999-10-29

Abstract

PROBLEM TO BE SOLVED: To enable flexible performance such as distributing a performance load among computers in normal performance, which is impossible in a conventional technique that fixedly decides a computer beforehand by which recovery processing is executed by having a computer of the lowest system load at that time out of computers operating normally execute the recovery processing when a fault occurs to a computer in a multi computer system. SOLUTION: A load information recording means 4 monitors performance load conditions of each of computers A, B and C at every constant interval of time and records them in a load information management file 3. When a fault occurs to a computer (for example, A), a fault notification means 7 selects a computer (for example B) of the lowest performance load out of computers which normally operate, and notifies it of the fault. A recovery control means B-8 of the computer B to which the fault is notified performs the recovery processing on the basis of data image before updated in a update recording file 5, and releases exclusion state.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数の計算機がデ
ータを共有すると共に、共有データに矛盾が発生しない
ように排他制御を行うマルチ計算機システムに関し、特
に、業務処理プログラム実行時に障害が発生した場合に
おけるリカバリ制御技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multi-computer system in which a plurality of computers share data and perform exclusive control so that inconsistency does not occur in the shared data. Recovery control technology in cases.

【０００２】[0002]

【従来の技術】従来、この種のマルチ計算機システムに
おけるリカバリ制御方式では、ある計算機で業務処理プ
ログラム実行中に障害が発生した場合、予め定められて
いる他の計算機に障害発生を通知し、この通知を受けた
計算機において、更新途中のデータをリカバリし、共有
データに矛盾が生じないようにしている。2. Description of the Related Art Conventionally, in a recovery control system of this kind of multi-computer system, when a failure occurs during execution of a business processing program on a certain computer, a failure is notified to another predetermined computer, and the failure is notified. The computer receiving the notification recovers the data being updated, so that no inconsistency occurs in the shared data.

【０００３】従来のリカバリ制御システムの一例が、特
開昭６３−９８７６４号公報に記載されている。One example of a conventional recovery control system is described in Japanese Patent Application Laid-Open No. 63-98764.

【０００４】この公報に記載されたシステムは、外部共
用記憶部を有するマルチ計算機システムにおいて、この
マルチ計算機を構成するそれぞれの計算機に、アプリケ
ーションプログラムの実行結果に応答してファイルデー
タ更新のためにファイル記憶部をアクセスするファイル
アクセス制御手段と、このファイルアクセス制御手段の
アクセスによるファイルデータの更新に応答して更新後
の情報を、外部共用記憶部の自計算機に対応するリカバ
リ情報格納域に更新発生順に格納するリカバリ情報格納
手段と、他の計算機の動作状況を監視する監視手段と、
この監視手段により他の計算機の異常を検知した場合、
上記他の計算機が自計算機でリカバリ処理を行う計算機
であれば、上記他の計算機に対応するリカバリ情報格納
域の情報を用いて上記他の計算機のファイルデータを回
復するリカバリ手段とを備えたものである。The system described in this publication is a multi-computer system having an external shared storage unit. In a multi-computer system, each computer constituting the multi-computer is provided with a file for updating file data in response to an execution result of an application program. A file access control unit for accessing the storage unit, and updating of the updated information in the recovery information storage area corresponding to the own computer in the external shared storage unit in response to the update of the file data by the access of the file access control unit Recovery information storage means for storing in order, monitoring means for monitoring the operation status of other computers,
When an abnormality of another computer is detected by this monitoring means,
If the other computer is a computer that performs recovery processing on its own computer, a recovery device that recovers file data of the other computer using information in a recovery information storage area corresponding to the other computer It is.

【０００５】このような構成を有するリカバリ制御シス
テムは次のように動作する。[0005] The recovery control system having such a configuration operates as follows.

【０００６】マルチ計算機システムの各計算機のダウン
を検知するために、各計算機間でヘルスチェック信号の
やり取りがなされている。その信号が一定時間経過して
も到着しない場合、計算機の障害として検知し、当該計
算機でリカバリ処理を行うか否かを判断する。ある計算
機が障害の場合、どの計算機でリカバリ処理を行うか
は、予め決められている。当該計算機でリカバリ処理を
行うと判断した場合は、障害となった計算機で排他制御
がかかったままとなっているものを、当該計算機に強制
的に取り込み、更に、障害となった計算機のリカバリ情
報を読み込み、必要であれば、リカバリ処理を実行す
る。In order to detect the downtime of each computer in the multi-computer system, a health check signal is exchanged between the computers. If the signal does not arrive even after the elapse of a certain time, it is detected as a failure of the computer, and it is determined whether or not the computer performs a recovery process. When a computer fails, which computer performs the recovery process is determined in advance. If it is determined that the recovery process is to be performed on the computer, the faulty computer that is still under exclusive control is forcibly imported to the computer, and the recovery information of the faulty computer Is read, and a recovery process is executed if necessary.

【０００７】また、別の方式が、特開平４−２１９８６
０号公報に記載されている。Another method is disclosed in Japanese Patent Application Laid-Open No. 4-21986.
No. 0 publication.

【０００８】この公報に記載されたシステムは、複数の
ホストからなりホスト間の共有資源を排他制御するマル
チシステム制御プロセッサを有する疎結合電子計算機シ
ステムにおいて、各ホスト毎に時間を監視する監視タイ
マと、各ホストによるマルチシステム制御プロセッサの
使用開始時に各ホストに対応する監視タイマをスタート
させる監視タイマスタート手段と、監視タイマを一定時
間以内毎にリセットし対応するホストが使用中であるこ
とをマルチシステム制御プロセッサに伝える監視タイマ
リセット手段と、監視タイマが一定時間以内にリセット
されない場合に対応するホストの障害を検出するホスト
障害検出手段と、ホスト障害検出手段によりあるホスト
の障害が検出されたときに他ホストにその旨を通知する
ホスト障害通知手段とを含むマルチシステム制御プロセ
ッサと、自ホストが障害となった場合に自ホストの後始
末処理を行うリカバリホストを決定するためのリカバリ
順位定義情報をあらかじめ定義しておくリカバリ順位定
義手段と、他ホストの障害が通知された際にマルチシス
テム制御プロセッサが保持するホスト運用情報を取得す
るホスト運用情報取得手段と、ホスト運用情報取得手段
により取得されたホスト運用情報とリカバリ順位定義手
段により定義されたリカバリ順位定義情報とを照らし合
わせることにより自ホストがリカバリを行うべきか否か
およびどのホストのリカバリを行うべきかを決定するリ
カバリホスト決定手段と、リカバリホスト決定手段によ
り自ホストがリカバリホストであると決定された場合に
障害となったホストの後始末処理を起動するリカバリ起
動手段とを含むホストを有する。The system described in this publication is a loosely-coupled computer system comprising a plurality of hosts and having a multi-system control processor for exclusively controlling a shared resource among the hosts, a monitoring timer for monitoring time for each host, A monitoring timer starting means for starting a monitoring timer corresponding to each host when each host starts using the multi-system control processor; and a multi-system for resetting the monitoring timer within a predetermined time and determining that the corresponding host is in use. A monitoring timer reset means for communicating to the control processor, a host failure detection means for detecting a failure of the host corresponding to a case where the monitoring timer is not reset within a fixed time, and a failure of a certain host detected by the host failure detection means. Host failure notifier that notifies other hosts to that effect And a recovery order definition means for defining in advance recovery order definition information for determining a recovery host for performing cleanup processing of the host when the host fails. The host operation information acquisition means for acquiring the host operation information held by the multi-system control processor when the failure of the host is notified, and the host operation information acquired by the host operation information acquisition means and the recovery order definition means A recovery host determining unit that determines whether or not the own host should perform recovery and which host should perform recovery by comparing with the recovery order definition information, and the own host is a recovery host by the recovery host determining unit If it is determined that the failed host will be cleaned up Having a host and a recovery activation means for moving.

【０００９】このような構成を有するリカバリ制御シス
テムはつぎのように動作する。The recovery control system having such a configuration operates as follows.

【００１０】各ホストは、自ホストが障害となった場合
に自ホストの後始末処理を行うリカバリホストを決定す
るためのリカバリ順位定義情報をリカバリ順位定義手段
により予め定義しておく。そして、ホスト運用情報取得
手段が、他ホストのホスト障害が通知された際にマルチ
システム制御プロセッサが保持するホスト運用情報を取
得し、リカバリホスト決定手段がホスト運用情報取得手
段により取得されたホスト運用情報とリカバリ順位定義
手段により定義されたリカバリ順位定義情報とを照らし
合わせることにより自ホストがリカバリを行うべきか否
かおよびどのホストのリカバリを行うべきかを決定し、
リカバリ起動手段がリカバリホスト決定手段により自ホ
ストがリカバリホストであると決定された場合に障害と
なったホストの後始末処理を起動する。Each host previously defines recovery order definition information for determining a recovery host for performing a cleanup process of its own host when its own host has failed by the recovery order definition means. Then, the host operation information acquisition unit acquires the host operation information held by the multi-system control processor when the host failure of the other host is notified, and the recovery host determination unit acquires the host operation information acquired by the host operation information acquisition unit. By comparing the information with the recovery order definition information defined by the recovery order definition means, it is determined whether or not the own host should perform recovery and which host should be recovered,
When the recovery host determining means determines that the own host is the recovery host, the recovery starting means starts the trouble-shooting process of the failed host.

【００１１】[0011]

【発明が解決しようとする課題】前述した従来のマルチ
計算機システムにおけるリカバリ制御方式では、ある計
算機で障害が発生した場合にリカバリ処理を行う計算機
が固定的に決められているため、リカバリ処理を担当す
る計算機を、上記他の計算機の障害に備えて常に処理プ
ロセッサ等の動作負荷に余裕を持たせて運用することが
必要になる。つまり、動作負荷に余裕を持たせておかな
いと、リカバリ処理を行う計算機の動作負荷が高いとき
に上記他の計算機に障害が発生することがあり、このよ
うな事態が発生すると、自計算機における通常業務が滞
ったり、リカバリ処理を迅速に行うことができなくなる
からである。このように、従来のリカバリ制御方式で
は、リカバリ処理を担当する計算機の動作負荷に余裕を
持たせることが必要になるため、通常の運用中におい
て、計算機間で動作負荷を分散させると言うような柔軟
な運用を行いにくいという問題が生じる。In the above-described recovery control method in the conventional multi-computer system, the computer that performs recovery processing when a failure occurs in a certain computer is fixedly determined, and is responsible for recovery processing. It is necessary to always operate such a computer with an allowance for the operation load of the processor and the like in preparation for the failure of the other computers. In other words, if the operating load does not have a margin, the other computer may fail when the operating load of the computer performing the recovery process is high. This is because normal work is delayed or recovery processing cannot be performed quickly. As described above, in the conventional recovery control method, it is necessary to provide a margin for the operation load of the computer in charge of the recovery process. Therefore, during normal operation, the operation load is distributed among the computers. There is a problem that it is difficult to perform flexible operations.

【００１２】〔発明の目的〕そこで、本発明の目的は、
複数の計算機により構成され、且つ、業務処理プログラ
ムの実行に必要となるデータを、構成要素である全ての
計算機から更新および参照が可能な共有データとして共
有し、同一の業務処理プログラムが何れの計算機でも実
行可能なマルチ計算機システムにおいて、ある計算機で
業務処理プログラムを実行中に、当該業務処理プログラ
ムの実行の継続が不可能となる障害が発生した場合、当
該計算機以外の正常に動作している計算機の中で、その
時点において最も動作負荷の低い計算機でリカバリ処理
を実施することにより、リカバリ処理を行う計算機を予
め固定的に決めておく必要をなくし、より柔軟な運用を
行うことができるリカバリ制御技術を提供することにあ
る。[Object of the Invention] The object of the present invention is to
Data composed of a plurality of computers and necessary for execution of the business processing program is shared as shared data that can be updated and referred to by all of the constituent computers. However, in a multi-computer system that can be executed, if a failure occurs that makes it impossible to continue execution of the business processing program while a business processing program is being executed on a certain computer, a computer other than the computer that is operating normally In the recovery control, the recovery processing is performed on the computer with the lowest operation load at that time, which eliminates the need to fixedly determine the computer to perform the recovery processing in advance and enables more flexible operation. To provide technology.

【００１３】[0013]

【課題を解決するための手段】本発明のマルチ計算機シ
ステムリカバリ制御装置は、上記目的を達成するため、
ある業務処理プログラムから共有データ（図１の１）を
更新する時に、共有データ（図１の１）に矛盾が発生し
ないよう、他の業務処理プログラムからの同時更新を防
ぐ制御を実施する排他制御手段（図１の２）と、共有デ
ータ（図１の１）への更新処理が発生したとき、更新前
のデータイメージを更新記録ファイル（図１の５）に格
納する更新記録管理手段（図１の６）と、業務処理プロ
グラムを実行中に、当該業務処理プログラムの実行の継
続が不可能となった計算機が発生した場合、当該計算機
以外の正常に動作している計算機の中で、その時点にお
いて最も動作負荷の低い計算機に対して障害の発生した
計算機を通知する障害通知手段（図１の７）と、前記各
計算機（図１のＡ，Ｂ，Ｃ）毎に設けられ、対応する計
算機に障害の発生した計算機が通知されたとき、前記更
新記録ファイル（図１の５）から更新途中のデータの、
更新前のデータイメージを読み出して更新途中であった
データを更新前の状態に復旧し、更に、排他制御され更
新不可能な状態になっている共有データの排他状態を解
除するリカバリ制御手段（図１のＡ−８，Ｂ−８，Ｃ−
８）とを備えている。SUMMARY OF THE INVENTION A multi-computer system recovery control apparatus according to the present invention has the following objects.
When updating shared data (1 in FIG. 1) from a certain business processing program, exclusive control for performing control to prevent simultaneous updating from another business processing program so that inconsistency does not occur in the shared data (1 in FIG. 1). Means (FIG. 1-2) and update record management means (FIG. 1-5) for storing a data image before update in an update record file (5 in FIG. 1) when an update process to the shared data (1 in FIG. 1) occurs. 1-6), when a computer that cannot continue the execution of the business processing program occurs while the business processing program is being executed, among the normally operating computers other than the computer, A fault notifying means (7 in FIG. 1) for notifying a computer having a fault to a computer with the lowest operating load at the time point is provided for each of the computers (A, B, and C in FIG. 1). Computer failure When the computer has been notified, the data being updated from the update record file (5 in FIG. 1),
Recovery control means for reading the data image before update, restoring the data that was being updated to the state before update, and releasing the exclusive state of the shared data that has been subjected to exclusive control and cannot be updated (FIG. A-8, B-8, C-
8).

【００１４】この構成においては、マルチ計算機システ
ムを構成する計算機（図１のＡ，Ｂ，Ｃ）の内の、ある
計算機（例えば、図１のＡ）に障害が発生した場合、障
害通知手段（図１の７）が、正常に動作している計算機
（図１のＢ，Ｃ）の内の、その時点において最も動作負
荷の低い計算機（例えば、図１のＢ）に対して障害の発
生した計算機を通知し、リカバリ処理を行わせる。In this configuration, when a failure occurs in a certain computer (for example, A in FIG. 1) among the computers (A, B, and C in FIG. 1) constituting the multi-computer system, the failure notification means ( 1) of the normally operating computers (B and C in FIG. 1) has a fault in the computer with the lowest operation load (eg, B in FIG. 1) at that time. Notify the computer and perform recovery processing.

【００１５】また、本発明のマルチ計算機システムリカ
バリ制御装置は、障害発生時に、迅速にリカバリ処理を
開始できるようにするため、一定時間間隔毎に前記各計
算機（図１のＡ，Ｂ，Ｃ）の動作負荷を示す動作負荷情
報を求めて負荷情報管理ファイル（図１の３）に記録す
る負荷情報記録手段（図１の４）を備え、且つ、前記障
害通知手段（図１の７）は、正常に動作している計算機
の内の、前記負荷情報管理ファイル（図１の３）に記録
されている動作負荷情報によって示される動作負荷が最
も低い計算機に対して、障害の発生した計算機を通知す
る構成を備えている。Further, the multi-computer system recovery control device of the present invention allows each of the computers (A, B, C in FIG. 1) to be started at a predetermined time interval so that the recovery process can be started promptly when a failure occurs. Load information recording means (4 in FIG. 1) for obtaining operation load information indicating the operation load of the above and recording it in the load information management file (3 in FIG. 1), and the failure notification means (7 in FIG. 1) Among the normally operating computers, the computer with the lowest operation load indicated by the operation load information recorded in the load information management file (3 in FIG. 1) is assigned to the failed computer. A configuration for notifying is provided.

【００１６】この構成においては、負荷情報記録手段
（図１の４）が、各計算機（図１のＡ，Ｂ，Ｃ）の動作
負荷を示す動作負荷情報を一定時間間隔毎に求めて負荷
情報管理ファイル（図１の３）に記録している。そし
て、マルチ計算機システムを構成する計算機（図１の
Ａ，Ｂ，Ｃ）の内の、ある計算機（例えば、図１のＡ）
に障害が発生した場合、障害通知手段（図１の７）が、
正常に動作している計算機（図１のＢ，Ｃ）内の、負荷
情報管理ファイル（図１の３）に記録されている動作負
荷情報によって示される動作負荷が最も低い計算機に対
して、障害の発生した計算機を通知する。In this configuration, the load information recording means (4 in FIG. 1) obtains the operation load information indicating the operation load of each computer (A, B, C in FIG. 1) at regular time intervals, and obtains the load information. It is recorded in the management file (3 in FIG. 1). Then, a computer (for example, A in FIG. 1) among the computers (A, B, and C in FIG. 1) constituting the multi-computer system.
In the event of a failure, the failure notification means (7 in FIG. 1)
In the normally operating computers (B and C in FIG. 1), the computer having the lowest operation load indicated by the operation load information recorded in the load information management file (3 in FIG. 1) is faulty. Notify the computer where the error occurred.

【００１７】また、本発明のマルチ計算機システムリカ
バリ制御装置は、現時点での計算機の正確な動作負荷を
得ることができ、且つ、動作負荷情報の取得処理による
計算機の通常業務の遅れを少なくできるようにするた
め、業務処理プログラムの実行の継続が不可能となった
計算機（例えば、図６のＡ）が発生した場合、当該計算
機（図６のＡ）以外の正常に動作している計算機（図６
のＢ，Ｃ）からその計算機の動作負荷を示す動作負荷情
報を取得し、該取得した動作負荷情報に基づいて最も動
作負荷の低い計算機を選択する負荷状態参照手段（図６
の９）を備え、且つ、障害通知手段（図６の７ａ）は、
前記負荷状態参照手段（図６の９）で選択された動作負
荷が最も低い計算機に対して障害の発生した計算機を通
知する構成を備えている。Further, the multi-computer system recovery control device of the present invention can obtain an accurate operation load of a computer at the present time and can reduce a delay of a normal operation of the computer due to an operation load information acquisition process. Therefore, when a computer (for example, A in FIG. 6) in which execution of the business processing program cannot be continued occurs, a normally operating computer (FIG. 6A) other than the computer (A in FIG. 6) occurs. 6
B, C), obtains the operation load information indicating the operation load of the computer, and selects a computer having the lowest operation load based on the obtained operation load information (see FIG. 6).
9), and the failure notification means (7a in FIG. 6)
A configuration is provided in which the computer having the lowest operating load selected by the load state reference means (9 in FIG. 6) is notified of the failed computer.

【００１８】この構成においては、業務処理プログラム
の実行の継続が不可能となった計算機（例えば、図６の
Ａ）が発生した場合、負荷状態参照手段（図６の９）
が、当該計算機（図６のＡ）以外の正常に動作している
計算機（図６のＢ，Ｃ）から動作負荷を示す動作負荷情
報を取得し、取得した動作負荷情報に基づいて最も動作
負荷の低い計算機を選択する。負荷状態参照手段（図６
の９）が、最も動作負荷の低い計算機を選択すると、障
害通知手段（図６の７ａ）が、負荷状態参照手段（図６
の９）で選択された計算機に対して障害の発生した計算
機を通知する。In this configuration, when a computer (for example, A in FIG. 6) in which execution of the business processing program cannot be continued occurs, the load state reference means (9 in FIG. 6)
Obtains the operation load information indicating the operation load from the normally operating computers (B and C in FIG. 6) other than the computer (A in FIG. 6), and determines the most operation load based on the obtained operation load information. Choose a computer with a lower score. Load state reference means (FIG. 6)
9) selects the computer with the lowest operation load, the failure notifying means (7a in FIG. 6) causes the load state referring means (FIG. 6).
The computer selected in step 9) is notified of the failed computer.

【００１９】[0019]

【発明の実施の形態】次に本発明の実施の形態について
図面を参照して詳細に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００２０】図１は本発明の第１の実施の形態の構成例
を示すブロック図であり、マルチ計算機システムを構成
する計算機が３台の場合のブロック図である。FIG. 1 is a block diagram showing a configuration example of the first embodiment of the present invention, and is a block diagram in a case where three computers constitute a multi-computer system.

【００２１】図１を参照すると、本発明の第１の実施の
形態は、複数の計算機Ａ，Ｂ，Ｃにより構成され、且
つ、業務処理プログラムの実行に必要となるデータを、
構成要素である全ての計算機から更新および参照が可能
な共有データ１として共有し、同一の業務処理プログラ
ムが何れの計算機Ａ，Ｂ，Ｃでも実行可能なマルチ計算
機システムにおいて、ある業務処理プログラムから共有
データ１を更新する時に、共有データ１に矛盾が発生し
ないよう、他の業務処理プログラムからの同時更新を防
ぐ制御を実施する排他制御手段２と、各計算機Ａ，Ｂ，
Ｃにおける業務処理プログラムを実行する処理プロセッ
サ（図示せず）の動作負荷状況等の計算機Ａ，Ｂ，Ｃの
動作負荷状況を一定時間間隔毎に監視し、各時点での負
荷状態に関する情報を、負荷情報管理ファイル３に逐次
記録する負荷情報記録手段４と、共有データ１への更新
処理が発生したとき、更新前のデータイメージを更新記
録ファイル５に格納する更新記録管理手段６と、計算機
Ａ，Ｂ，Ｃの内のある計算機において、業務処理プログ
ラムを実行中に、当該業務処理プログラム実行の継続が
不可能となる障害が発生した場合、障害を検出し、負荷
情報記録手段４により記録された負荷情報管理ファイル
３の内容を取り出し、上記障害が発生したある計算機以
外の正常に動作している計算機の中で、その時点におい
て最も動作負荷の低い計算機に対して障害の発生した計
算機を通知する障害通知手段７と、各計算機Ａ，Ｂ，Ｃ
毎に設けられ、対応する計算機に対して障害通知が行わ
れた場合、更新記録管理手段６により格納された更新記
録ファイル５から更新途中の共有データ１の、更新前の
データイメージを読み出し、更新途中であったデータを
更新前の状態に復旧し、更に、排他制御手段２により排
他状態にされ更新不可能な状態になっているデータを、
更新可能な非排他状態にするリカバリ制御手段Ａ−８，
Ｂ−８，Ｃ−８とを含む。Referring to FIG. 1, in a first embodiment of the present invention, data which is constituted by a plurality of computers A, B, and C and which is necessary for executing a business processing program is described.
In a multi-computer system that is shared as shared data 1 that can be updated and referenced from all the computers as constituent elements and that can be executed by any of the computers A, B, and C, the same business processing program is shared by a certain business processing program When updating data 1, exclusive control means 2 for performing control to prevent simultaneous updates from other business processing programs so that inconsistency does not occur in shared data 1, and each computer A, B,
The operation load status of the computers A, B, and C, such as the operation load status of a processing processor (not shown) that executes the business processing program in C, is monitored at regular time intervals, and information on the load status at each time point is obtained. Load information recording means 4 for sequentially recording in the load information management file 3; update record management means 6 for storing a data image before update in the update record file 5 when an update process to the shared data 1 occurs; , B, and C, when a failure that makes it impossible to continue the execution of the business processing program occurs during execution of the business processing program, the failure is detected and recorded by the load information recording unit 4. The contents of the load information management file 3 are retrieved, and among the normally operating computers other than the computer in which the failure has occurred, the operating load of the computer at that time is the highest. The fault notifying means 7 for notifying the generated calculator obstacle to have the computer, each computer A, B, C
Provided for each computer, and when a failure notification is sent to the corresponding computer, a data image before update of the shared data 1 being updated is read from the update record file 5 stored by the update record management means 6 and updated. The data that was in the middle of the process is restored to the state before the update, and the data that has been put into the exclusive state by the exclusive control means 2 and cannot be updated is further deleted.
Recovery control means A-8 for setting an updatable non-exclusive state,
B-8 and C-8.

【００２２】これらの手段はそれぞれ概略次のような機
能を有する。Each of these means has the following functions.

【００２３】排他制御手段２は、業務処理プログラムが
共有データ１に対する更新処理を開始する際、更新する
データを含む所定の排他制御単位（ファイル，ブロッ
ク，レコード等）を排他状態に遷移させる。これによ
り、上記排他制御単位に対する他の業務処理プログラム
からの更新要求は待たされることになるので、更新を行
うデータが他の業務処理プログラムから同時に更新さ
れ、破壊されることを防ぐことができる。また、排他制
御手段２は、業務処理プログラムから共有データ１への
更新処理が終了すると、排他状態にしていた排他制御単
位を、全てのプログラムから更新が可能な状態である非
排他状態に遷移させる。これにより、待たされていたア
クセス要求に従った処理を行うことが可能になる。When the business processing program starts an update process on the shared data 1, the exclusive control means 2 causes a predetermined exclusive control unit (file, block, record, etc.) including the data to be updated to transition to the exclusive state. Accordingly, an update request from the other business processing program to the exclusive control unit is kept waiting, so that the data to be updated is simultaneously updated from the other business processing program and can be prevented from being destroyed. When the update processing from the business processing program to the shared data 1 is completed, the exclusive control unit 2 changes the exclusive control unit in the exclusive state to a non-exclusive state in which all the programs can be updated. . This makes it possible to perform processing according to the waiting access request.

【００２４】負荷情報記録手段４は、業務処理プログラ
ムを実行する各計算機Ａ，Ｂ，Ｃの処理プロセッサの使
用率や処理の待ち行列としてスケジューリングされてい
る業務処理プログラムの個数等の動作負荷状況を一定時
間間隔毎に監視し、取得した処理プロセッサの使用率等
を計算機Ａ，Ｂ，Ｃの動作負荷を示す動作負荷情報とし
て負荷情報管理ファイル３に格納する。The load information recording means 4 stores the operation load status such as the usage rate of the processing processors of the computers A, B, and C executing the business processing programs and the number of the business processing programs scheduled as the processing queue. Monitoring is performed at regular time intervals, and the obtained usage rates of the processing processors are stored in the load information management file 3 as operation load information indicating the operation loads of the computers A, B, and C.

【００２５】更新記録管理手段６は、業務処理プログラ
ムが共有データ１の更新処理を開始する際、更新データ
を含む所定の単位の更新前のデータイメージと、上記単
位のアドレス情報と、上記業務処理プログラムを実行し
ている計算機の識別子とを関連付けて更新記録ファイル
５に格納する。また、上記業務処理プログラムが、上記
データの更新処理を終了した場合には、更新記録ファイ
ル５に格納した上記情報に対して、更新処理が終了した
ことを示す更新終了情報を付加する。尚、更新終了情報
を付加する代わりに、更新記録ファイル５から上記情報
を削除するようにしても良い。When the business processing program starts the processing of updating the shared data 1, the update record management means 6 stores the data image of the predetermined unit including the update data before the update, the address information of the unit, and the business processing program. The identifier of the computer executing the program is stored in the update record file 5 in association with the identifier. Further, when the business processing program ends the data update processing, update end information indicating that the update processing has ended is added to the information stored in the update record file 5. Note that the above information may be deleted from the update record file 5 instead of adding the update end information.

【００２６】障害通知手段７は、計算機Ａ，Ｂ，Ｃの何
れかの計算機において障害が発生したことを検出した場
合、負荷情報管理ファイル３を参照して障害が発生した
計算機以外の計算機についてその時点で最新の動作負荷
情報を取得し、更に、取得した動作負荷情報に基づいて
その時点で最も動作負荷の低い計算機を選び出し、その
計算機のリカバリ制御手段に対して、障害が発生した計
算機を通知する。When the failure notifying means 7 detects that a failure has occurred in any one of the computers A, B and C, the failure notification means 7 refers to the load information management file 3 to check the computer other than the failed computer. Obtains the latest operating load information at the time, further selects the computer with the lowest operating load at that time based on the obtained operating load information, and notifies the recovery control means of the computer of the failed computer. I do.

【００２７】リカバリ制御手段Ａ−８，Ｂ−８，Ｃ−８
は、障害通知手段７から障害の発生した計算機が通知さ
れた場合、更新記録ファイル５を参照して障害が発生し
た計算機で更新途中になっているデータが存在するか否
かを調べ、更新途中になっているデータが存在する場合
は、更新記録管理ファイル５に格納されている更新途中
のデータの更新前のデータイメージを取り出し、当該デ
ータイメージを基に、共有データ１の内容を更新前のデ
ータに復旧する。Recovery control means A-8, B-8, C-8
When the failed computer is notified from the failure notifying unit 7, the computer refers to the update record file 5 to determine whether or not there is data that is being updated by the failed computer. If there is any data having the status of, the data image before update of the data being updated stored in the update record management file 5 is extracted, and the content of the shared data 1 is updated based on the data image. Restore data.

【００２８】次に、図１，図２，図３，図４，および図
５を参照して本実施の形態の動作について詳細に説明す
る。Next, the operation of this embodiment will be described in detail with reference to FIG. 1, FIG. 2, FIG. 3, FIG. 4, and FIG.

【００２９】図２は、計算機Ａ，Ｂ，Ｃ上で実行される
業務処理プログラムが共有データ１に対する更新処理を
開始するときの、排他制御手段２と更新記録管理手段６
との処理の流れを示している。FIG. 2 shows the exclusive control unit 2 and the update record management unit 6 when the business processing program executed on the computers A, B, and C starts the update process on the shared data 1.
This shows the flow of the processing.

【００３０】計算機Ａ，Ｂ，Ｃの何れかで実行されてい
る業務処理プログラムは、共有データ１に対する更新処
理を行う場合、排他制御手段２に対して更新要求を出力
する。これにより、排他制御手段２は、当該業務処理プ
ログラムが更新するデータを含む共有データ１の排他制
御単位が排他状態になっていなければ、その排他制御単
位を他の業務処理プログラムから更新することができな
い排他状態に遷移させる（図２のステップ２−１）。そ
の後、更新記録管理手段６が、当該業務処理プログラム
で更新されるデータを含む所定単位の更新前のデータイ
メージ，アドレス情報，および当該業務処理プログラム
を実行している計算機の識別子を関連付けて更新記録フ
ァイル５に記録し（ステップ２−２）、排他制御手段２
に更新の開始を指示する（ステップ２−３）。排他制御
手段２はその指示を受け取ると、更新処理を開始する
（ステップ２−４）。The business processing program executed on any of the computers A, B, and C outputs an update request to the exclusive control means 2 when performing an update process on the shared data 1. Accordingly, if the exclusive control unit of the shared data 1 including the data to be updated by the business processing program is not in the exclusive state, the exclusive control unit 2 can update the exclusive control unit from another business processing program. A transition is made to an exclusive state that cannot be performed (step 2-1 in FIG. 2). Thereafter, the update record management means 6 associates the update record data with a predetermined unit including the data to be updated by the business processing program, the address information, and the identifier of the computer executing the business processing program. Recorded in file 5 (step 2-2), exclusive control means 2
Is instructed to start updating (step 2-3). Upon receiving the instruction, the exclusive control means 2 starts an update process (step 2-4).

【００３１】図３は、業務処理プログラムが共有データ
１の更新処理を終了したときの、排他制御手段２と更新
記録管理手段６との処理の流れを示している。FIG. 3 shows the flow of processing between the exclusive control means 2 and the update record management means 6 when the business processing program has finished updating the shared data 1.

【００３２】排他制御手段２は、業務処理プログラムが
共有データ１の更新処理を終了したとき、更新記録管理
手段６に共有データ１へのデータ更新処理の終了を通知
する（図３のステップ３−１）。更新記録管理手段６
は、この通知を受け取ると、図２のステップ２−２で更
新記録ファイル５に格納した情報に、更新終了情報を付
加し（ステップ３−２）、更新終了情報を付加した情報
に含まれているアドレス情報を排他制御手段２に通知す
る（ステップ３−３）。排他制御手段２は、上記アドレ
ス情報によって示される排他制御単位を、他の業務処理
プログラムからアクセスすることができる非排他状態に
更新する（ステップ３−４）。The exclusive control means 2 notifies the update record management means 6 of the end of the data update processing to the shared data 1 when the business processing program ends the update processing of the shared data 1 (step 3 in FIG. 3). 1). Update record management means 6
Receives this notification, adds the update end information to the information stored in the update record file 5 in step 2-2 of FIG. 2 (step 3-2), and includes the update end information in the added information. The exclusive address information is notified to the exclusive control means 2 (step 3-3). The exclusive control means 2 updates the exclusive control unit indicated by the address information to a non-exclusive state accessible from another business processing program (step 3-4).

【００３３】図４は、負荷情報記録手段４が記録する負
荷情報管理ファイル３のデータ形式を示している。負荷
情報記録手段４は、予め設定された一定の時間間隔で、
全ての計算機Ａ，Ｂ，Ｃでの、処理プロセッサの使用率
や、待ち行列としてスケジューリングされている業務処
理プログラムの個数等の動作負荷状況に関する情報Ｎ
ａ，Ｎｂ，Ｎｃを取得し、取得した動作負荷情報Ｎａ，
Ｎｂ，Ｎｃを、取得した時刻”ｈｈ：ｍｍ：ｓｓ”や対
象の計算機を一意に示す計算機識別子Ａ，Ｂ，Ｃと共に
負荷情報管理ファイル３に格納する。図４は、マルチ計
算機システムが識別子Ａ，Ｂ，Ｃで示される３台の計算
機Ａ，Ｂ，Ｃで構成されており、時刻”ｈｈ：ｍｍ：ｓ
ｓ”に計算機Ａ，Ｂ，Ｃの動作負荷情報を取得したとき
の、負荷情報管理ファイル３に記録される内容の例であ
る。FIG. 4 shows the data format of the load information management file 3 recorded by the load information recording means 4. The load information recording means 4 is provided at a predetermined time interval,
Information N on the operation load status such as the usage rate of the processing processors and the number of business processing programs scheduled as queues in all the computers A, B and C.
a, Nb, Nc are obtained, and the obtained operation load information Na,
Nb and Nc are stored in the load information management file 3 together with the obtained time “hh: mm: ss” and computer identifiers A, B, and C that uniquely indicate the target computer. FIG. 4 shows a multi-computer system including three computers A, B, and C indicated by identifiers A, B, and C, and the time "hh: mm: s".
This is an example of the contents recorded in the load information management file 3 when the operation load information of the computers A, B, and C is acquired in “s”.

【００３４】次に、障害発生時の動作について説明す
る。Next, the operation when a failure occurs will be described.

【００３５】図５は、業務処理プログラムを実行中のあ
る計算機において障害が発生した場合のリカバリ処理の
流れを示している。FIG. 5 shows a flow of a recovery process when a failure occurs in a certain computer which is executing the business process program.

【００３６】業務処理プログラムを実行する計算機（例
えばＡ）において、障害が発生し、プログラムの実行が
不可能な状態になると、障害通知手段７はその障害を検
出し（図５のステップ５−１）、負荷情報管理ファイル
３を参照し、正常に動作している計算機Ｂ，Ｃについて
の動作負荷情報を取得する（ステップ５−２）。その
後、障害通知手段７は、計算機Ｂ，Ｃの動作負荷情報に
基づいて、その時点で最も動作負荷の低い計算機を選択
する（ステップ５−３）。今、例えば、最も動作負荷の
低い計算機として計算機Ｃを選択したとすると、障害通
知手段７は、その計算機Ｃに対して、障害が発生したこ
とと障害が発生した計算機Ａの識別子Ａとを通知する
（ステップ５−４）。When a failure occurs in the computer (for example, A) that executes the business processing program and the program cannot be executed, the failure notification unit 7 detects the failure (step 5-1 in FIG. 5). ), Referring to the load information management file 3, obtains operation load information on the computers B and C operating normally (step 5-2). Thereafter, the failure notifying means 7 selects a computer having the lowest operation load at that time based on the operation load information of the computers B and C (step 5-3). Now, for example, if the computer C is selected as the computer with the lowest operation load, the failure notifying unit 7 notifies the computer C that the failure has occurred and the identifier A of the computer A in which the failure has occurred. (Step 5-4).

【００３７】一方、通知を受けた計算機Ｃでは、リカバ
リ制御手段Ｃ−８が、更新記録ファイル５を参照し（ス
テップ５−５）、障害が発生した計算機Ａで更新処理途
中の状態のままのデータがあるか否かを確認する（ステ
ップ５−６）。更新処理途中のデータが無い場合は、リ
カバリ処理が不要であるため、処理を終了する。これに
対して、更新処理途中のデータが存在する場合は、更新
記録ファイル５に記録されている更新前データイメージ
を取り出し（ステップ５−７）、当該データを更新処理
前の状態に復旧する（ステップ５−８）。データの復旧
が終了すると、排他制御手段２が、当該データの排他状
態を非排他状態に更新する（ステップ５−９）。以上の
処理により、共有データ１は矛盾のないものとなり、且
つ、計算機Ｂ，Ｃ上で動作している業務処理プログラム
が上記データを使用した処理を行うことが可能になる。
尚、必要があれば、ステップ５−９の処理が終了した後
に、リカバリ処理を行う計算機Ｃにおいて、障害が発生
した計算機Ａが実行中であった共有データ１に対する更
新処理を再実行するようにしても良い。On the other hand, in the computer C that has received the notification, the recovery control means C-8 refers to the update record file 5 (step 5-5), and the computer A in which the failure has occurred remains in the state of being updated. It is confirmed whether or not there is data (step 5-6). If there is no data in the middle of the update process, the process ends because the recovery process is unnecessary. On the other hand, if there is data being updated, the pre-update data image recorded in the update recording file 5 is extracted (step 5-7), and the data is restored to the state before the update processing (step 5-7). Step 5-8). When the data recovery is completed, the exclusive control means 2 updates the exclusive state of the data to a non-exclusive state (step 5-9). Through the above processing, the shared data 1 becomes consistent, and the business processing program running on the computers B and C can perform the processing using the data.
If necessary, after the processing of step 5-9 is completed, the computer C performing the recovery processing re-executes the update processing for the shared data 1 being executed by the computer A in which the failure has occurred. May be.

【００３８】次に、本実施の形態の効果について説明す
る。Next, the effect of the present embodiment will be described.

【００３９】本実施の形態では、ある計算機Ａにおい
て、業務処理プログラムを実行中に、当該業務処理プロ
グラムの実行の継続が不可能となる障害が発生した場
合、負荷情報管理ファイル３のデータを基に、当該計算
機Ａ以外の正常に動作している計算機Ｂ，Ｃの中で、そ
の時点において最もシステム負荷の低い計算機Ｂまたは
Ｃを選択できるように構成されているため、リカバリ処
理を行う計算機が予め決まっている従来のリカバリ制御
システムより柔軟な運用を行うことができる。以上、こ
れまで述べてきた実施の形態は、３台の計算機で構成さ
れたマルチ計算機システムを例としているが、同様のこ
とが、４台以上の計算機で構成されたマルチ計算機シス
テムの場合にも当てはめることができる。In this embodiment, if a failure occurs in a computer A during execution of a business processing program, the execution of the business processing program cannot be continued. Since the computer B or C having the lowest system load at that time can be selected from the normally operating computers B and C other than the computer A, the computer performing the recovery process is not required. It is possible to perform more flexible operation than a conventional recovery control system determined in advance. The embodiments described above exemplify a multi-computer system composed of three computers, but the same applies to a multi-computer system composed of four or more computers. Can be applied.

【００４０】次に、本発明の第２の実施の形態について
図面を参照して詳細に説明する。Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

【００４１】図６を参照すると、本発明の第２の実施の
形態は、複数の計算機Ａ，Ｂ，Ｃにより構成され、且
つ、業務処理プログラムの実行に必要となるデータを、
構成要素である全ての計算機Ａ，Ｂ，Ｃから更新および
参照が可能な共有データ１として共有し、同一の業務処
理プログラムが何れの計算機Ａ，Ｂ，Ｃでも実行可能な
マルチ計算機システムにおいて、ある業務処理プログラ
ムから共有データ１を更新する時に、共有データ１に矛
盾が発生しないよう、他の業務処理プログラムからの同
時更新を防ぐ制御を実施する排他制御手段２と、共有デ
ータ１への更新処理が発生したとき、更新前のデータイ
メージを更新記録ファイル５に格納する更新記録管理手
段６と、計算機Ａ，Ｂ，Ｃの内のある計算機において、
業務処理プログラムを実行中に、当該業務処理プログラ
ム実行の継続が不可能となる障害が発生した場合、当該
計算機以外の正常に動作している計算機から、その計算
機の動作負荷を示す動作負荷情報を収集し、その動作負
荷情報に基づいてその時点において最も動作負荷の低い
計算機を選び出す負荷状態参照手段９と、計算機Ａ，
Ｂ，Ｃの何れかに障害が発生した場合、障害発生を検出
して負荷状態参照手段９に通知し、負荷状態参照手段９
によりその時点において最も動作負荷の低い計算機が選
び出された場合、その計算機に対して障害の発生した計
算機を通知する障害通知手段７ａと、各計算機Ａ，Ｂ，
Ｃ毎に設けられ、対応する計算機に対して障害通知が行
われた場合、更新記録管理手段６により格納された更新
記録ファイル５から更新途中の共有データ１の、更新前
のデータイメージを読み出し、更新途中であったデータ
を更新前の状態に復旧し、更に、排他制御手段２により
排他状態にされ更新不可能な状態になっているデータ
を、更新可能な非排他状態にするリカバリ制御手段Ａ−
８，Ｂ−８，Ｃ−８とを含む。Referring to FIG. 6, according to a second embodiment of the present invention, data which is constituted by a plurality of computers A, B, and C and which is necessary for executing a business processing program is described in FIG.
In a multi-computer system that is shared as shared data 1 that can be updated and referenced from all of the computers A, B, and C as constituent elements and that can execute the same business processing program on any of the computers A, B, and C Exclusive control means 2 for performing control to prevent simultaneous updating from other business processing programs so that inconsistency does not occur in shared data 1 when the shared data 1 is updated from the business processing program, and update processing to shared data 1 Occurs, the update record management means 6 for storing the data image before the update in the update record file 5 and one of the computers A, B, and C
If a failure that makes it impossible to continue the execution of the business processing program occurs during the execution of the business processing program, operating load information indicating the operating load of the computer from a normally operating computer other than the computer concerned A load state reference unit 9 for collecting and selecting a computer with the lowest operation load at that time based on the operation load information;
If a failure occurs in any of B and C, the failure is detected and notified to the load state reference means 9.
When a computer with the lowest operation load is selected at that time, a failure notifying means 7a for notifying the computer of the occurrence of the failure, and a computer A, B,
In the case where a failure notification is provided to the corresponding computer provided for each C, the data image of the shared data 1 being updated is read from the update record file 5 stored by the update record management means 6 before the update, Recovery control means A for restoring data that was in the process of being updated to the state before the update, and for converting data that has been made exclusive and non-updatable by the exclusive control means 2 into an updatable non-exclusive state. −
8, B-8 and C-8.

【００４２】上記したように、本実施の形態は、障害通
知手段７の代わりに障害通知手段７ａを備えている点、
負荷情報管理ファイル３，負荷情報記録手段４の代わり
に負荷状態参照手段９を備えている点が第１の実施の形
態と相違する。As described above, this embodiment is provided with the fault notifying means 7a instead of the fault notifying means 7,
The difference from the first embodiment is that a load state reference unit 9 is provided instead of the load information management file 3 and the load information recording unit 4.

【００４３】図６に示した各手段はそれぞれ概略次のよ
うな機能を有する。Each means shown in FIG. 6 has the following functions.

【００４４】排他制御手段２は、業務処理プログラムが
共有データ１に対する更新処理を開始する際、更新する
データを含む所定の排他制御単位を排他状態に遷移させ
る。更に、業務処理プログラムから共有データ１への更
新処理が終了すると、排他状態にしていた排他制御単位
を、全てのプログラムから更新が可能な状態である非排
他状態に遷移させる。When the business processing program starts an update process on the shared data 1, the exclusive control means 2 changes a predetermined exclusive control unit including the data to be updated to the exclusive state. Further, when the update processing from the business processing program to the shared data 1 is completed, the exclusive control unit which has been in the exclusive state is transited to the non-exclusive state in which all the programs can be updated.

【００４５】更新記録管理手段６は、業務処理プログラ
ムが共有データ１の更新処理を開始する際、更新データ
を含む所定の単位の更新前のデータイメージと、更新単
位のアドレス情報と、上記業務処理プログラムを実行し
ている計算機の識別子とを関連付けて更新記録ファイル
５に格納する。また、上記業務処理プログラムが、上記
データの更新処理を終了した場合には、更新記録ファイ
ル５に格納した上記情報に対して、更新処理が終了した
ことを示す更新終了情報を付加する。When the business processing program starts the update processing of the shared data 1, the update record management means 6 stores the data image of the predetermined unit including the update data before the update, the address information of the update unit, and the business processing program. The identifier of the computer executing the program is stored in the update record file 5 in association with the identifier. Further, when the business processing program ends the data update processing, update end information indicating that the update processing has ended is added to the information stored in the update record file 5.

【００４６】障害通知手段７ａは、計算機Ａ，Ｂ，Ｃの
内のある計算機において障害が発生した場合、それを検
出して負荷状態参照手段９に障害が発生した計算機を通
知し、この通知に応答して負荷状態参照手段９が選び出
したその時点で最も負荷の低い、正常に動作している計
算機に対して、障害が発生した計算機の識別子を通知す
る。When a failure occurs in one of the computers A, B, and C, the failure notification means 7a detects the failure and notifies the load state reference means 9 of the failed computer. In response, the load state referring means 9 notifies the normally operating computer with the lowest load at that point of time of the identifier of the failed computer.

【００４７】負荷状態参照手段９は、障害通知手段７ａ
からの障害の発生の通知を受け取ると、その時点で、障
害が発生した計算機以外の正常に動作している全ての計
算機から、業務処理プログラムを実行する処理プロセッ
サの動作負荷状況等の計算機の動作負荷を示す動作負荷
情報を取得し、その情報に基づいてその時点において最
もシステム負荷の低い、正常に動作している計算機を選
び出し、障害通知手段７ａに通知する。The load state reference means 9 is provided with a failure notification means 7a.
When the notification of the occurrence of the failure is received from the computer, the operation of the computer such as the operation load of the processing processor executing the business processing program is performed from all the normally operating computers other than the computer in which the failure occurred. The operation load information indicating the load is acquired, and based on the information, a normally operating computer with the lowest system load at that time is selected and notified to the failure notification means 7a.

【００４８】各計算機Ａ，Ｂ，Ｃがそれぞれ備えている
リカバリ制御手段Ａ−８，Ｂ−８，Ｃ−８は、障害通知
手段７ａから障害の発生した計算機が通知された場合、
更新記録ファイル５を参照して障害が発生した計算機で
更新途中になっているデータが存在するか否かを調べ、
更新途中になっているデータが存在する場合は、更新記
録ファイル５に格納されている更新途中のデータの更新
前のデータイメージを取り出し、当該データイメージを
基に、共有データ１の内容を更新前のデータに復旧す
る。The recovery control means A-8, B-8, and C-8 provided in each of the computers A, B, and C respectively operate when the computer in which the fault has occurred is notified from the fault notifying means 7a.
Referring to the update record file 5, it is checked whether or not there is data being updated on the failed computer, and
If there is data that is being updated, the data image before the update of the data that is being updated stored in the update record file 5 is extracted, and the content of the shared data 1 is updated based on the data image. To recover data.

【００４９】次に、図６，図２，図３，図７，および図
８を参照して本実施の形態の動作について詳細に説明す
る。Next, the operation of this embodiment will be described in detail with reference to FIG. 6, FIG. 2, FIG. 3, FIG. 7, and FIG.

【００５０】図２は、業務処理プログラムが共有データ
１の更新処理を開始するときの、排他制御手段２と更新
記録管理手段６との処理の流れを示しており、第１の実
施の形態での、業務処理プログラムが共有データ１の更
新処理を開始するときの処理の流れと同様である。FIG. 2 shows a processing flow of the exclusive control means 2 and the update record management means 6 when the business processing program starts the update processing of the shared data 1, and is shown in the first embodiment. This is the same as the processing flow when the business processing program starts the update processing of the shared data 1.

【００５１】図３は、業務処理プログラムが共有データ
１の更新処理を終了したときの、排他制御手段２と更新
記録管理手段６との処理の流れを示しており、第１の実
施の形態での、業務処理プログラムが共有データの更新
処理を終了したときの処理の流れと同様である。FIG. 3 shows the flow of processing between the exclusive control means 2 and the update record management means 6 when the business processing program ends the update processing of the shared data 1, and is shown in the first embodiment. This is the same as the processing flow when the business processing program ends the update processing of the shared data.

【００５２】図７は、計算機Ａにおいて障害が発生した
時点で、負荷状態参照手段９が、正常に動作している各
計算機ＢとＣから取得する動作負荷状況に関する情報の
形式を示している。FIG. 7 shows the format of the information on the operation load status obtained from the computers B and C which are operating normally by the load state reference means 9 at the time when a failure occurs in the computer A.

【００５３】負荷状態参照手段９は、計算機Ａで障害が
発生したことを障害通知手段７ａから通知された時点
で、正常に動作している全ての計算機での、処理プロセ
ッサの使用率や、待ち行列としてスケジューリングされ
ている業務処理プログラムの個数等の動作負荷状況に関
する情報Ｎｂ，Ｎｃと、計算機Ｂ，Ｃを一意に示す識別
子Ｂ，Ｃとを取得し、取得した動作負荷情報Ｎｂ，Ｎｃ
から、その時点で最も動作負荷の低い計算機ＢまたはＣ
を選び出し、選び出した計算機の識別子ＢまたはＣを障
害通知手段７ａに通知する。図７は、マルチ計算機シス
テムが計算機の識別子がそれぞれＡ，Ｂ，Ｃで示される
３台の計算機で構成されており、計算機Ａで障害が発生
したときに、負荷状態参照手段９が取得する計算機Ｂ，
Ｃの動作負荷情報Ｎｂ，Ｎｃの内容の例である。The load state reference means 9 determines the usage rates of the processing processors and the waiting time in all the normally operating computers when the failure notification means 7a notifies that the failure has occurred in the computer A. Information Nb and Nc relating to the operation load status such as the number of business processing programs scheduled as a matrix and identifiers B and C uniquely indicating the computers B and C are acquired, and the acquired operation load information Nb and Nc are acquired.
From the computer B or C with the lowest operating load at that time
And notifies the failure notifying means 7a of the identifier B or C of the selected computer. FIG. 7 shows a multi-computer system comprising three computers whose computer identifiers are indicated by A, B, and C, respectively. B,
It is an example of the content of the operation load information Nb, Nc of C.

【００５４】次に、障害発生時の動作について説明す
る。Next, the operation when a failure occurs will be described.

【００５５】図８は、業務処理プログラムを実行中の計
算機において障害が発生した場合のリカバリ処理の流れ
を示している。FIG. 8 shows the flow of a recovery process when a failure occurs in a computer which is executing a business process program.

【００５６】業務処理プログラムを実行する計算機（例
えばＡ）において、障害が発生し、プログラムの実行が
不可能な状態になると、障害通知手段７ａはその障害を
検出し（図８のステップ８−１）、障害が発生したこと
と障害が発生した計算機Ａの識別子Ａとを負荷状態参照
手段９に通知する（ステップ８−２）。When a failure occurs in the computer (for example, A) that executes the business processing program and the program cannot be executed, the failure notification unit 7a detects the failure (step 8-1 in FIG. 8). ), And notifies the load status reference means 9 of the occurrence of the failure and the identifier A of the computer A in which the failure has occurred (step 8-2).

【００５７】負荷状態参照手段９は障害が発生した計算
機Ａ以外の全ての計算機Ｂ，Ｃでの、処理プロセッサの
使用率や、待ち行列としてスケジューリングされている
業務処理プログラムの個数等の動作負荷状況に関する情
報Ｎｂ，Ｎｃと、計算機を一意に示す識別子Ｂ，Ｃとを
各計算機から取得し（ステップ８−３）、取得した動作
負荷情報Ｎｂ，Ｎｃから、その時点で最も動作負荷の低
い計算機（例えば計算機Ｂとする）を選び出し、障害通
知手段７ａに通知する（ステップ８−４）。障害通知手
段７ａは、負荷状態参照手段９で選び出された計算機Ｂ
に対して、障害が発生したことと障害が発生した計算機
の識別子Ａとを通知する（ステップ８−５）。The load state reference means 9 operates on all the computers B and C other than the computer A in which the failure has occurred, such as the usage rate of the processing processors and the number of operation processing programs scheduled as queues. Information Nb, Nc and identifiers B, C uniquely indicating the computer are acquired from each computer (step 8-3), and from the acquired operation load information Nb, Nc, the computer with the lowest operation load at that time ( For example, computer B) is selected and notified to the fault notification means 7a (step 8-4). The failure notifying means 7a is provided for the computer B selected by the load state referencing means 9.
Is notified of the occurrence of the failure and the identifier A of the failed computer (step 8-5).

【００５８】一方、通知を受けた計算機Ｂでは、リカバ
リ制御手段Ｂ−８が、更新記録ファイル５を参照し（ス
テップ８−６）、障害が発生した計算機Ａで更新処理途
中の状態のままのデータがあるか否かを確認する（ステ
ップ８−７）。更新処理途中のデータが無い場合は、リ
カバリ処理が不要であるため、処理を終了する。これに
対して、更新処理途中のデータが存在する場合は、更新
記録ファイル５に記録されている更新前データイメージ
を取り出し（ステップ８−８）、当該データを更新処理
前の状態に復旧する（ステップ８−９）。データの復旧
処理が終了すると、排他制御手段２が、当該データの排
他状態を非排他状態に更新する（ステップ８−１０）。
以上の処理により、共有データ１は、矛盾のないものと
なり、且つ、計算機Ｂ，Ｃ上で動作している業務処理プ
ログラムが上記データを使用した処理を行うことが可能
になる。On the other hand, in the computer B that has received the notification, the recovery control means B-8 refers to the update record file 5 (step 8-6), and the computer A in which the failure has occurred remains in the state of being updated. It is confirmed whether or not there is data (step 8-7). If there is no data in the middle of the update process, the process ends because the recovery process is unnecessary. On the other hand, if there is data being updated, the pre-update data image recorded in the update recording file 5 is extracted (step 8-8), and the data is restored to the state before the update processing (step 8-8). Step 8-9). When the data recovery processing is completed, the exclusive control means 2 updates the exclusive state of the data to a non-exclusive state (step 8-10).
By the above processing, the shared data 1 becomes consistent, and the business processing program running on the computers B and C can perform the processing using the data.

【００５９】以上、これまで述べてきた第２の実施の形
態は、３台の計算機で構成されたマルチ計算機システム
を例としているが、同様のことが、４台以上の計算機で
構成されたマルチ計算機システムの場合にも当てはめる
ことができる。Although the second embodiment described above exemplifies a multi-computer system composed of three computers, the same applies to a multi-computer system composed of four or more computers. The same can be applied to a computer system.

【００６０】本実施の形態では、障害の発生した計算機
を検出した場合のみ、障害通知手段７ａが、計算機の動
作負荷を示す動作負荷情報の取得処理を行うようにして
いるので、一定時間間隔で動作負荷情報の取得処理を行
っている第１の実施の形態に比較して、現時点での計算
機の正確な動作負荷を得ることができ、また、動作負荷
情報の取得処理による計算機の通常業務の遅れを少なく
することができる。In the present embodiment, the failure notifying means 7a performs the process of acquiring the operation load information indicating the operation load of the computer only when the computer in which the failure has occurred is detected. Compared with the first embodiment in which the processing for acquiring the operation load information is performed, it is possible to obtain the accurate operation load of the computer at the present time, and to perform the normal operation of the computer by the processing for acquiring the operation load information. Delay can be reduced.

【００６１】図９は、図１或いは図６に示したマルチ計
算機システムリカバリ制御装置を実現するハードウェア
構成の一例を示した図であり、コンピュータ９１と、記
録媒体９２と、リカバリ制御手段Ａ−８，Ｂ−８，Ｃ−
８を備えた計算機Ａ，Ｂ，Ｃとを備えている。記録媒体
９２は、ディスク，半導体メモリ，その他の記録媒体で
あり、マルチ計算機システムリカバリ制御用プログラム
が記録されている。FIG. 9 is a diagram showing an example of a hardware configuration for realizing the multi-computer system recovery control device shown in FIG. 1 or FIG. 6. The computer 91, the recording medium 92, and the recovery control means A- 8, B-8, C-
8 are provided with computers A, B, and C. The recording medium 92 is a disk, a semiconductor memory, or another recording medium, in which a multi-computer system recovery control program is recorded.

【００６２】図１に示したマルチ計算機システムリカバ
リ制御装置を実現する場合には、記録媒体９２に記録さ
れているマルチ計算機システムリカバリ制御用プログラ
ムをコンピュータ９１に読み取らせる。これにより、マ
ルチ計算機システムリカバリ制御用プログラムが、コン
ピュータ９１の動作を制御し、コンピュータ９１上に、
図１に示した排他制御手段２と、負荷情報記録手段４
と、更新記録管理手段６と、障害通知手段７とを実現す
る。To realize the multi-computer system recovery control device shown in FIG. 1, the computer 91 reads the multi-computer system recovery control program recorded on the recording medium 92. As a result, the multi-computer system recovery control program controls the operation of the computer 91 and
Exclusive control means 2 and load information recording means 4 shown in FIG.
And the update record management means 6 and the failure notification means 7 are realized.

【００６３】図６に示したマルチ計算機システムリカバ
リ制御装置を実現する場合には、記録媒体９２に記録さ
れているマルチ計算機システムリカバリ制御用プログラ
ムをコンピュータ９１に読み取らせる。これにより、マ
ルチ計算機システムリカバリ制御用プログラムがコンピ
ュータ９１の動作を制御し、コンピュータ９１上に、図
６に示した排他制御手段２と、更新記録管理手段６と、
障害通知手段７ａと、負荷状態参照手段９とを実現す
る。In order to realize the multi-computer system recovery control device shown in FIG. 6, the computer 91 reads the multi-computer system recovery control program recorded on the recording medium 92. As a result, the multi-computer system recovery control program controls the operation of the computer 91, and the exclusive control unit 2 and the update record management unit 6 shown in FIG.
The failure notification unit 7a and the load state reference unit 9 are realized.

【００６４】[0064]

【発明の効果】第１の効果は、通常の運用中において、
計算機間で動作負荷を分散させるというような柔軟な運
用が可能になるという点である。その理由は、障害が発
生した時点で、最も動作負荷に余裕のある計算機をリカ
バリ処理を実施する計算機としたからである。The first effect is that during normal operation,
The point is that flexible operation such as distributing the operation load among computers becomes possible. The reason is that, at the time of occurrence of the failure, the computer having the largest operating load is selected as the computer that executes the recovery process.

【００６５】第２の効果は、障害発生時に、迅速にリカ
バリ処理を開始できるということである。その理由は、
負荷情報記録手段が、各計算機の動作負荷を示す動作負
荷情報を一定時間間隔で求めて負荷情報管理ファイルに
記録しておき、障害発生時には、上記負荷情報管理ファ
イルの内容に基づいて障害通知手段がリカバリ処理を行
う計算機を決定するようにしているからである。The second effect is that when a failure occurs, the recovery process can be started quickly. The reason is,
The load information recording means obtains operation load information indicating the operation load of each computer at regular time intervals and records it in a load information management file, and when a failure occurs, the failure notification means based on the contents of the load information management file. Determines the computer that performs the recovery process.

【００６６】第３の効果は、現時点での計算機の正確な
動作負荷を得ることができ、且つ、動作負荷情報の取得
処理による計算機の通常業務の遅れを少なくできるとい
う点である。その理由は、ある計算機に障害が発生した
時だけに、負荷状態参照手段が正常な計算機の動作負荷
を求めるようにしているからである。The third effect is that an accurate operation load of the computer at the present time can be obtained, and a delay of the normal operation of the computer due to the processing of obtaining the operation load information can be reduced. The reason is that only when a failure occurs in a certain computer, the load state reference means obtains the normal operation load of the computer.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態の構成例を示すブロ
ック図である。FIG. 1 is a block diagram illustrating a configuration example of a first embodiment of the present invention.

【図２】第１，第２の実施の形態において、業務処理プ
ログラムが共有データ１に対する更新処理を開始すると
きの、排他制御手段２，更新記録管理手段６の処理例を
示す流れ図である。FIG. 2 is a flowchart showing a processing example of an exclusive control unit 2 and an update record management unit 6 when a business processing program starts an update process on shared data 1 in the first and second embodiments.

【図３】第１，第２の実施の形態において、業務処理プ
ログラムが共有データ１の更新処理を終了したときの、
排他制御手段２，更新記録管理手段６の処理例を示す流
れ図である。FIG. 3 is a diagram showing a state in which the business process program ends the process of updating the shared data 1 in the first and second embodiments.
9 is a flowchart illustrating a processing example of the exclusive control unit 2 and the update record management unit 6;

【図４】負荷情報管理ファイル３の内容例を示す図であ
る。FIG. 4 is a diagram showing an example of the contents of a load information management file 3;

【図５】障害発生時の第１の実施の形態の処理例を示す
流れ図である。FIG. 5 is a flowchart illustrating a processing example according to the first embodiment when a failure occurs;

【図６】本発明の第２の実施の形態の構成例を示すブロ
ック図である。FIG. 6 is a block diagram illustrating a configuration example of a second embodiment of the present invention.

【図７】負荷状態参照手段９が取得する負荷状態情報の
一例を示す図である。FIG. 7 is a diagram illustrating an example of load state information acquired by a load state reference unit 9;

【図８】障害発生時の第２の実施の形態の処理例を示す
図である。FIG. 8 is a diagram illustrating a processing example according to the second embodiment when a failure occurs;

【図９】第１，第２の実施の形態を実現するハードウェ
ア構成の一例を示すブロック図である。FIG. 9 is a block diagram illustrating an example of a hardware configuration that implements the first and second embodiments.

[Explanation of symbols]

Ａ，Ｂ，Ｃ…計算機１…共有データ２…排他制御手段手段３…負荷情報管理ファイル４…負荷情報記録手段５…更新記録ファイル６…更新記録管理手段７，７ａ… 障害通知手段Ａ−８，Ｂ−８，Ｃ−８…リカバリ制御手段９…負荷状態参照手段９１…コンピュータ９２…記録媒体 A, B, C Computer 1 Shared data 2 Exclusive control means 3 Load information management file 4 Load information recording means 5 Update record file 6 Update record management means 7, 7a Fault notification means A-8 , B-8, C-8: Recovery control means 9: Load state reference means 91: Computer 92: Recording medium

Claims

[Claims]

1. A system comprising a plurality of computers and sharing data necessary for execution of a business processing program as shared data which can be updated and referred to by all of the constituent computers. In a multi-computer system in which a program can be executed by any computer, when a certain business processing program updates the shared data, an exclusive control is performed to prevent simultaneous updating from another business processing program, When the update process occurs, the data image before the update is stored in the update record file, and if the computer that cannot continue the execution of the business process program occurs during the execution of the business process program, Among the normally operating computers other than, a computer with the lowest operating load at that time The computer that has generated the failure has been notified, and the computer that has been notified of the failed computer reads the data image before update of the data being updated from the update record file, and restores the data that was being updated to the state before update. A multi-computer system recovery control method, comprising recovering and releasing the exclusive state of shared data that has been exclusively controlled and cannot be updated.

2. A computer in which operation load information indicating the operation load of each computer is obtained and recorded in a load information management file at regular time intervals, and execution of the business processing program cannot be continued. Informing the computer having the lowest operation load indicated by the operation load information recorded in the load information management file among the normally operating computers other than the computer concerned, of the failed computer. 2. The multi-computer system recovery control method according to claim 1, wherein:

3. When a computer in which execution of the business processing program cannot be continued occurs, an operation load of a normally operating computer other than the computer is determined, and a computer having the lowest operation load is determined. 2. The multi-computer system recovery control method according to claim 1, wherein a computer in which a failure has occurred is notified.

4. The same business processing is performed by sharing data that is composed of a plurality of computers and that is necessary for executing a business processing program as shared data that can be updated and referenced from all the computers as constituent elements. In a multi-computer system in which a program can be executed by any computer, when a certain business processing program updates the shared data, exclusive control means for performing exclusive control for preventing simultaneous update from another business processing program, Update record management means for storing the data image before update in the update record file when update processing to shared data occurs, and it becomes impossible to continue the execution of the business processing program while the business processing program is running If a failed computer occurs, it will be the most frequently operating computer other than the computer at that time. Failure notification means for notifying a computer with a low operation load of a computer in which a failure has occurred; provided for each of the computers, and when the computer in which the failure has occurred is notified to a corresponding computer, updating is performed from the update record file. Reads the data image of the intermediate data before the update, restores the data that was being updated to the state before the update, and releases the exclusive state of the shared data that has been subjected to exclusive control and is in an unupdateable state. A multi-computer system recovery control device comprising: a recovery control unit.

5. A load information recording unit for obtaining operation load information indicating an operation load of each computer at a fixed time interval and recording the obtained information in a load information management file, and wherein the failure notification unit operates normally. Of the computers having the lowest operation load indicated by the operation load information recorded in the load information management file among the computers that have failed. The multi-computer system recovery control device according to claim 4.

6. When a computer in which execution of the business processing program cannot be continued occurs, operation load information indicating the operation load of the computer is acquired from a normally operating computer other than the computer, A load state reference unit that selects a computer with the lowest operation load based on the acquired operation load information, and the failure notification unit determines a computer with the lowest operation load selected by the load state reference unit. 5. A configuration for notifying a computer in which a failure has occurred by using a computer.
A multi-computer system recovery control device as described.

7. Data that is composed of a plurality of computers and that is necessary for executing a business processing program is shared as shared data that can be updated and referenced from all the computers as constituent elements. A recording medium for recording a program for performing a recovery control of a multi-computer system provided with a recovery control means by a computer, wherein the computer executes another business when a business processing program updates the shared data. An exclusive control unit for performing exclusive control for preventing simultaneous updating from a processing program; an update record management unit for storing a data image before update in an update record file when an update process to the shared data occurs; During execution, there was a computer that was unable to continue execution of the business processing program In this case, among the normally operating computers other than the computer concerned, a program for functioning as a failure notifying means for notifying the computer with the lowest operation load of the failed computer at that time is recorded. Machine readable recording medium.