JPH10133927A

JPH10133927A - Computer system and file managing method

Info

Publication number: JPH10133927A
Application number: JP9232930A
Authority: JP
Inventors: Hideaki Hirayama; 秀昭平山; Toshio Shirokibara; 敏雄白木原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1996-09-03
Filing date: 1997-08-28
Publication date: 1998-05-22
Anticipated expiration: 2017-08-28
Also published as: JP4095139B2

Abstract

PROBLEM TO BE SOLVED: To provide a computer system realizing rolling back at the time of generating a fault without waiting the saving of data before update at the time of updating a file. SOLUTION: When writing, etc., is requested to the file, 'file writing information' is preserved in an unidentified queue 431 to instantly update only a primary file 39. Then after a check point is picked up, 'file writing information' preserved in the queue 431 is moved to an identified queue 432 to reflect to a backup file 41. On the other hand, at the time of recovery, all the pieces of data before update corresponding to data updated after a finally picked check point are read from the file 41 based on 'file writing information' preserved in the queue 431 to recover the file 39 to the point of a check point time by using this read data before update.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、たとえばネット
ワーク接続された複数のコンピュータにより構成される
ネットワークコンピューティング環境などにおいて、高
い信頼性を必要とするグループコンピューティング処
理、データベース処理、およびトランザクション処理な
どに適用して好適なコンピュータシステムおよびファイ
ル管理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a group computing process, a database process, a transaction process, and the like that require high reliability in a network computing environment including a plurality of computers connected to a network. The present invention relates to a computer system and a file management method suitable for application.

【０００２】[0002]

【従来の技術】ＣＰＵによって実行されるプロセスのア
ドレス空間やコンテクスト、およびファイルなどの状態
を定期的に採取して（これをチェックポイントと称
す）、障害が発生したときに、最後に採取したチェック
ポイントの状態を復元し、その時点からプロセスの実行
を再開始するといった障害からの回復機能を有したシス
テムにおいては、従来より外部入出力処理に関して課題
があった。すなわち、障害が発生したときに、最後に採
取したチェックポイントからプロセスを再実行させる
際、プロセスのアドレス空間やプロセッサのコンテクス
トなどの状態は復元できるが、外部入力装置の状態の復
元は容易ではなかった。2. Description of the Related Art The status of a process executed by a CPU, such as an address space, a context, and a file, is periodically collected (this is referred to as a checkpoint). In a system having a failure recovery function such as restoring the state of a point and restarting the execution of the process from that point, there has been a problem regarding external input / output processing. In other words, when a failure occurs, when the process is re-executed from the last collected checkpoint, the state such as the address space of the process and the context of the processor can be restored, but the state of the external input device is not easy to restore. Was.

【０００３】たとえば、ファイルに対する書き込みをキ
ャンセルすることは困難であるために、ファイルに対し
て書き込みを行なうときには、データをファイルに書き
込む前に書き込み以前のデータを事前に読み込んで保存
を行ない、その後にファイルへのデータ書き込みを行な
っていた。For example, since it is difficult to cancel writing to a file, when writing to a file, data before writing is read and saved before writing data to the file. Data was being written to the file.

【０００４】図１５は、ファイルに対する書き込みをキ
ャンセルすることが困難なため、ファイルに対して書き
込みを行なうときに、データをファイルに書き込む前に
書き込み以前のデータを事前に読み込んで保存を行な
い、その後にファイルへのデータ書き込みを行なう従来
のシステムの仕組みを説明する図である。FIG. 15 shows that it is difficult to cancel writing to a file. Therefore, when writing to a file, data before writing is read and saved before writing data to the file. FIG. 1 is a diagram for explaining a mechanism of a conventional system for writing data to a file.

【０００５】この例では、時刻ｔ１のチェックポイント
を採取した時点において“ＡＢＣＤ”の４バイトのデー
タからなるファイルに、時刻ｔ２において１バイト目に
“Ｘ”をｗｒｉｔｅしている（１）。この場合、従来で
は、ファイルの１バイト目に“Ｘ”をｗｒｉｔｅする前
に、ファイルの１バイト目のデータ“Ｂ”をｒｅａｄし
ておき（これをｕｎｄｏログとも言う）（２）、その後
でファイルの１バイト目に“Ｘ”をｗｒｉｔｅしている
（３）。In this example, "X" is written to the first byte at time t2 in a file consisting of 4 bytes of data "ABCD" at the time when the checkpoint is taken at time t1 (1). In this case, conventionally, before writing "X" to the first byte of the file, data "B" of the first byte of the file is read (this is also called an undo log) (2), and thereafter, “X” is written in the first byte of the file (3).

【０００６】その後、時刻ｔ３において障害が発生した
ために、プロセスを最後に採取されたチェックポイント
の状態（ｔ１）にロールバックする。ファイルは、チェ
ックポイントｔ１以降に１バイト目が“Ｘ”に更新され
ているが、更新時に採取されたｕｎｄｏログを用いるこ
とにより、チェックポイントｔ１の状態を復元してい
る。なお、このｕｎｄｏログは、次のチェックポイント
時に不要となり廃棄される。After that, at time t3, the process is rolled back to the last checkpoint state (t1) because a failure has occurred. The first byte of the file is updated to “X” after the checkpoint t1, but the state of the checkpoint t1 is restored by using the undo log collected at the time of the update. This undo log becomes unnecessary at the next checkpoint and is discarded.

【０００７】また、たとえば２つのコンピュータにより
構築され、その一方（プライマリコンピュータ）を運用
系、他方（バックアップコンピュータ）を待機系として
振り分けて２重化し、プライマリコンピュータに障害が
発生したときに、バックアップコンピュータが処理を引
き継くことによってシステムの可用性を高めるといった
システムも存在する。そして、このようなシステムで、
前述したようにチェックポイントを定期的に採取してい
けば、信頼性をさらに向上させることが可能となる。Also, for example, the computer is constructed by two computers, one of which (primary computer) is distributed as an active system and the other (backup computer) is divided into a standby system and duplicated, and when a failure occurs in the primary computer, the backup computer There is also a system that increases the availability of the system by taking over the processing. And with such a system,
As described above, if checkpoints are periodically collected, the reliability can be further improved.

【０００８】[0008]

【発明が解決しようとする課題】この様に、プロセスの
アドレス空間やコンテクスト、およびファイルなどの状
態、すなわち、チェックポイントを定期的に採取してい
き、障害が発生したときに、最後に採取したチェックポ
イントの状態を復元し、その時点からプロセスの実行を
再開始するといった障害からの回復機能を有したシステ
ム（２重化されているかどうかを問わない）において
は、その信頼性は向上されるが、一方で、ファイルの更
新（たとえば書き込み）を行なうときに、一旦更新前の
データをファイルから読み込んで、それからファイルへ
の更新を行なわなければならなかったために、ファイル
の更新性能を低下させるという課題があった。As described above, the state of the address space, the context, and the file of the process, that is, the checkpoint is periodically collected, and when a failure occurs, the state is finally collected. In a system (whether or not duplexed) having a failure recovery function of restoring the checkpoint state and restarting the process execution from that point, the reliability is improved. However, on the other hand, when updating (for example, writing) a file, the data before updating has to be once read from the file, and then the file must be updated. There were challenges.

【０００９】この発明は、このような実情に鑑みてなさ
れたものであり、チェックポイントを定期的に採取し
て、障害が発生したときには最後に採取したチェックポ
イントの状況を復元し、その時点からプロセスの実行を
再開始するといった障害からの回復機能を有したシステ
ムにおいて、ファイルの更新を行なうときに、更新前の
データをファイルから読み込むなどといったことを不要
とし、ファイルの更新性能を大幅に改善することを可能
とするコンピュータシステムおよびファイル管理方法を
提供することを目的とする。The present invention has been made in view of such circumstances, and periodically collects checkpoints and restores the status of the last checkpoint collected when a failure occurs. In a system that has a recovery function from a failure such as restarting the execution of a process, when updating a file, it is not necessary to read the data before updating from the file, greatly improving the file update performance. It is an object of the present invention to provide a computer system and a file management method capable of performing such operations.

【００１０】[0010]

【課題を解決するための手段】この発明のコンピュータ
システムは、運用系および待機系の２つのコンピュータ
で２重化されたコンピュータシステムであって、中断さ
れた処理を再開始するためのチェックポイントを定期的
に採取し、前記運用系および待機系双方のコンピュータ
上に保存するコンピュータシステムにおいて、前記運用
系のコンピュータ上で実行されるプロセスによって更新
されるファイルを前記運用系および待機系双方のコンピ
ュータで２重化して設けておき、前記プロセスからファ
イルの更新が指示されたときに、その更新情報を前記待
機系のコンピュータ上に保存して運用系のファイルのみ
を更新し、その更新が完了した時点でその更新の要求元
に対し更新完了を通知する手段と、前記チェックポイン
トが採取された後に、前記更新情報に示される更新内容
を前記待機系のファイルに反映させる手段とを具備して
なることを特徴とする。A computer system according to the present invention is a computer system duplexed by two computers, an active system and a standby system, and has a checkpoint for restarting an interrupted process. In a computer system that periodically collects and saves the files on both the active and standby computers, a file updated by a process executed on the active computer is written on both the active and standby computers. When the file update is instructed by the process, the update information is stored on the standby computer, and only the active file is updated. When the update is completed, Means for notifying the update request source to the update requester, and after the checkpoint is taken , Characterized by comprising and means to reflect the update content indicated in the update information file of the standby system.

【００１１】この発明のコンピュータシステムにおいて
は、プロセスがファイルの更新を要求したときに、その
更新内容を示す更新情報を取得して保存するとともに、
運用系のコンピュータに配置されたファイル（運用系フ
ァイル）のみを即座に更新して、その結果を要求元であ
るプロセスに返答する。そして、チェックポイントが採
取された後に、その保存しておいた更新情報で示される
更新内容を、待機系のコンピュータに配置されたファイ
ル（待機系ファイル）に反映させる。In the computer system according to the present invention, when a process requests an update of a file, update information indicating the update content is obtained and stored, and
Immediately updates only the files (active files) located on the active computer, and returns the result to the requesting process. Then, after the checkpoint is collected, the update content indicated by the stored update information is reflected in a file (standby file) arranged in the standby computer.

【００１２】一方、たとえばプロセスがアボートしたと
きなどには、保存しておいた更新情報に基づいて、最後
に採取したチェックポイント以降に更新されたデータに
対応する更新前のデータを待機系ファイルからすべて読
み出し、この読み出した更新前のデータを用いて運用系
ファイルをチェックポイント時点に復元する。On the other hand, for example, when the process is aborted, the data before update corresponding to the data updated since the last collected checkpoint is stored in the standby file based on the stored update information. All are read out, and the operating system file is restored to the check point using the read out data before update.

【００１３】すなわち、このコンピュータシステムにお
いては、従来のようにファイルを更新するときに、更新
前のデータを読み出して退避させておくといった処理の
完了を通常処理に待機させることなく、障害時のファイ
ルのリカバリが実現されることになり、信頼性を損なう
ことなくファイルの更新性能を飛躍的に向上させること
が可能となる。That is, in this computer system, when a file is updated as in the prior art, the completion of processing such as reading out the data before update and saving the data is not waited for in the normal processing, and the file at the time of the failure is updated. Recovery can be realized, and the file update performance can be dramatically improved without deteriorating reliability.

【００１４】また、運用系ファイルの復元に代えて、最
終のチェックポイント以前に保存された更新情報で示さ
れる更新内容すべてが反映された待機系ファイルを用い
たチェックポイントからのプロセスの再実行も有効であ
る。すなわち、運用系のコンピュータの障害などによ
り、運用系ファイルを用いての再開始が不可能な場合な
どにおける処理の継続も確保されることになり、システ
ムの可用性を向上させることになる。また、この場合に
は、第３のコンピュータに新たに待機系ファイルを確保
すれば、システムの可用性をさらに向上させることが可
能となる。Instead of restoring the active file, re-executing the process from the checkpoint using the standby file in which all the updates indicated by the update information stored before the final checkpoint are reflected. It is valid. In other words, continuation of processing when restart using the active file is impossible due to a failure of the active computer or the like is also ensured, and the availability of the system is improved. In this case, if a new standby file is secured in the third computer, the availability of the system can be further improved.

【００１５】[0015]

【発明の実施の形態】まず、図１を参照してこの発明の
基本原理を説明する。図１に示すように、この発明のコ
ンピュータシステムは、運用系システム１０と待機系シ
ステム２０とで多重化されたシステムを前提とする。以
下にそれぞれの動作を説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, the basic principle of the present invention will be described with reference to FIG. As shown in FIG. 1, the computer system of the present invention is based on a system multiplexed with an active system 10 and a standby system 20. Hereinafter, each operation will be described.

【００１６】（通常処理）（１）運用系システム１０でアプリケーションプログラ
ム１１がＷｒｉｔｅシステムコールを発行する。(Normal Processing) (1) In the active system 10, the application program 11 issues a Write system call.

【００１７】（２）ジャケットルーチン１２がＷｒｉｔ
ｅシステムコールをフックし、運用系のオペレーティン
グシステムにＷｒｉｔｅシステムコールを発行するとと
もに、そのＷｒｉｔｅ要求を待機系システム２０に送信
する。ただし、待機系システム２０に即座にＷｒｉｔｅ
要求を送信する必要はなく、次のチェックポイントまで
に送信すればよい。また、待機系システム２０では、受
信したＷｒｉｔｅ要求を即座に実行するのではなく、一
旦未確定キュー２１１に格納する。(2) The jacket routine 12 is Writ
The e-system call is hooked, a Write system call is issued to the active operating system, and the Write request is transmitted to the standby system 20. However, the write to the standby system 20 is immediately performed.
There is no need to send the request, just send it by the next checkpoint. In addition, the standby system 20 temporarily stores the received Write request in the undetermined queue 211 instead of immediately executing the received Write request.

【００１８】（３）チェックポイント処理が指示される
と、運用系システム１０は、溜っているＷｒｉｔｅ要求
をすべて待機系システム２０に送信し終えなければなら
ない。(3) When a checkpoint process is instructed, the active system 10 must finish transmitting all accumulated Write requests to the standby system 20.

【００１９】（４）一方、待機系システム２０では、未
確定キュー２１１に格納されたＷｒｉｔｅ要求を確定キ
ュー２１２に移動する。(4) On the other hand, in the standby system 20, the Write request stored in the undetermined queue 211 is moved to the defined queue 212.

【００２０】（５）この確定キュー２１２に移されたＷ
ｒｉｔｅ要求は、待機系システム２０のオペレーティン
グシステムによって順次処理されていく。(5) W transferred to the fixed queue 212
The write requests are sequentially processed by the operating system of the standby system 20.

【００２１】すなわち、通常処理において発生するファ
イル更新においては、更新前のデータを読み出して退避
させておくといった処理の完了を待機することがない。That is, in the file update occurring in the normal processing, there is no need to wait for the completion of the processing of reading out the data before update and saving the data.

【００２２】（ロールバック処理）（３）´障害が発生したようなときに、運用系システム
１０および待機系システム２０の双方にロールバック処
理が指示される。(Rollback Process) (3) ′ When a failure occurs, both the active system 10 and the standby system 20 are instructed to perform the rollback process.

【００２３】（４）´このとき、運用系システム１０に
残存するＷｒｉｔｅ要求を、すべて待機系システム２０
に送信する。また、待機系の未確定キュー２１１に格納
されたＷｒｉｔｅ要求は、最後のチェックポイント以降
に発行されたものであるので、逆にこれを参照して待機
系ファイル２３から更新前のデータを読み出し、この読
み出した更新前のデータを用いて運用系ファイル１４を
ロールバックする。これにより、運用系ファイル１４お
よび待機系ファイル２３の双方のファイルが最後のチェ
ックポイント時点の状態になる。(4) 'At this time, all Write requests remaining in the active system 10 are transmitted to the standby system 20.
Send to Since the Write request stored in the pending queue 211 of the standby system is issued after the last checkpoint, the data before the update is read out from the standby file 23 by referring to the Write request. The operating system file 14 is rolled back using the read data before update. As a result, both the active file 14 and the standby file 23 are in the state at the time of the last checkpoint.

【００２４】（５）´そして、待機系システム２０は、
未確定キュー２１１に残存するＷｒｉｔｅ要求をすべて
キャンセルする。(5) 'And the standby system 20
All Write requests remaining in the undetermined queue 211 are canceled.

【００２５】これにより、チェックポイント時点からの
再開始が可能となる。As a result, it is possible to restart from the check point.

【００２６】次に、この発明の実施の形態を説明する。Next, an embodiment of the present invention will be described.

【００２７】（第１の実施形態）まず、この発明の第１
の実施形態を説明する。図２にはこの発明の第１の実施
形態に係るコンピュータシステムのシステム構成が示さ
れている。図２に示したように、本実施形態のコンピュ
ータシステムは、コンピュータがプライマリコンピュー
タ３０と、バックアップコンピュータ４０とで２重化さ
れており、これらはネットワーク５０で接続されてい
る。このプライマリコンピュータ３０とバックアップコ
ンピュータ４０とは、前述した運用系システム１０およ
び待機系システム２０双方をそれぞれに備えており、い
ずれかで運用系システム１０が動作するときに、他方で
は待機系システム２０が動作する。ここでは、プライマ
リコンピュータ３０側で運用系システム１０、バックア
ップコンピュータ４０側で待機系システム２０をそれぞ
れ説明する。(First Embodiment) First, a first embodiment of the present invention will be described.
An embodiment will be described. FIG. 2 shows a system configuration of a computer system according to the first embodiment of the present invention. As shown in FIG. 2, in the computer system of the present embodiment, the computers are duplicated by a primary computer 30 and a backup computer 40, and these are connected by a network 50. The primary computer 30 and the backup computer 40 have both the active system 10 and the standby system 20 described above. When the active system 10 operates in one of them, the standby system 20 on the other side. Operate. Here, the primary computer 30 and the backup computer 40 will be described as the active system 10 and the standby system 20, respectively.

【００２８】プロセス３５は、プライマリコンピュータ
３０上で実行され、プライマリファイル３９とバックア
ップファイル４１とで２重化されたファイルを更新す
る。ここで、プライマリファイル３９はプライマリコン
ピュータ３０上に、バックアップファイル４１はバック
アップコンピュータ４０上に配置され、プライマリコン
ピュータ３０上のファイルシステム３６およびバックア
ップコンピュータ４０上のファイルシステム４８を介し
て更新される。The process 35 is executed on the primary computer 30 and updates a file duplicated with the primary file 39 and the backup file 41. Here, the primary file 39 is located on the primary computer 30 and the backup file 41 is located on the backup computer 40, and are updated via the file system 36 on the primary computer 30 and the file system 48 on the backup computer 40.

【００２９】プライマリコンピュータ３０上のファイル
システム３６は、プライマリファイル操作部３８とプラ
イマリファイル復元部３７とを含んでいる。一方、バッ
クアップコンピュータ４０上のファイルシステム４８
は、バックアップファイル操作部４３、未確定キュー４
３１、確定キュー４３２、バックアップファイル更新部
４４およびプライマリファイル復元情報読み出し部４２
を含んでいる。The file system 36 on the primary computer 30 includes a primary file operation unit 38 and a primary file restoration unit 37. On the other hand, the file system 48 on the backup computer 40
Indicates the backup file operation unit 43 and the undetermined queue 4
31, a confirmation queue 432, a backup file updating unit 44, and a primary file restoration information reading unit 42
Contains.

【００３０】プロセス３５がこの２重化されたファイル
を更新する場合、プライマリファイル操作部３８および
バックアップファイル操作部４３を経由して行なう。プ
ロセス３５がこの２重化されたファイルに対応するｗｒ
ｉｔｅを行なうと、プライマリファイル３９は、そのま
ま即座に更新されるが、バックアップファイル４１はそ
の時点では更新されずに、「ファイル書き込み情報」が
バックアップファイル操作部４３を経由して、バックア
ップコンピュータ４０上の未確定キュー４３１に保存さ
れる。When the process 35 updates the duplicated file, the update is performed via the primary file operation unit 38 and the backup file operation unit 43. The process 35 determines whether the wr corresponding to the duplicated file is wr.
When the “item” is performed, the primary file 39 is immediately updated as it is, but the backup file 41 is not updated at that time, and “file write information” is transmitted to the backup computer 40 via the backup file operation unit 43. Is stored in the unconfirmed queue 431.

【００３１】また、プロセス３５がチェックポイントを
採取する場合には、チェックポイント制御部３１が、チ
ェックポイント情報保存部３２とプライマリファイル操
作部３８とにその指示を出す。チェックポイント情報保
存部３２は、チェックポイント採取の指示を受け取る
と、チェックポイント情報（アドレス空間とプロセッサ
コンテクスト）をプライマリコンピュータ３０上および
バックアップコンピュータ４０上に保存する（プライマ
リコンピュータ３０上のチェックポイント情報３４およ
びバックアップコンピュータ４０上のチェックポイント
情報４５）。When the process 35 collects checkpoints, the checkpoint control unit 31 issues an instruction to the checkpoint information storage unit 32 and the primary file operation unit 38. Upon receiving the checkpoint collection instruction, the checkpoint information storage unit 32 stores the checkpoint information (address space and processor context) on the primary computer 30 and the backup computer 40 (checkpoint information 34 on the primary computer 30). And checkpoint information 45 on the backup computer 40).

【００３２】一方、プライマリファイル操作部３８は、
チェックポイント採取の指示を受け取ると、バックアッ
プファイル操作部４３を経由して、未確定キュー４３１
に保存されていた「ファイル書き込み情報」を確定キュ
ー４３２に移動させる。この確定キュー４３２に移動さ
れた「ファイル書き込み情報」は、チェックポイント採
取後に、バックアップファイル更新部４４によってバッ
クアップファイル４１の更新のために使用され、バック
アップファイル４１の更新後に廃棄される。これによ
り、チェックポイント以降にプライマリファイル３９に
対して行なわれたものと同じｗｒｉｔｅ操作が、バック
アップファイル４１に対しても行なわれることになる。On the other hand, the primary file operation unit 38
Upon receiving the checkpoint collection instruction, the undetermined queue 431 is transmitted via the backup file operation unit 43.
Is moved to the confirmation queue 432. The “file write information” moved to the confirmation queue 432 is used for updating the backup file 41 by the backup file update unit 44 after the checkpoint is collected, and is discarded after the update of the backup file 41. Thus, the same write operation performed on the primary file 39 after the checkpoint is performed on the backup file 41.

【００３３】プロセス３５がアボートなどの障害を発生
させ、プロセス３５をプライマリコンピュータ３０上で
最後に採取したチェックポイントから再実行する場合、
アドレス空間とプロセッサコンテクストとは、プライマ
リコンピュータ３０上のチェックポイント情報復元部３
７によって復元される。When the process 35 causes a failure such as an abort and the process 35 is re-executed on the primary computer 30 from the last collected checkpoint,
The address space and the processor context are stored in the checkpoint information restoring unit 3 on the primary computer 30.
7 restored.

【００３４】ファイルに関しては、バックアップファイ
ル４１は、チェックポイント以降の更新は未だ未確定キ
ュー４３１に「ファイル書き込み情報」が保存されてい
るだけであり、実際には更新されていないので復元は不
要である。しかしながら、プライマリファイル３９は、
チェックポイント以降にすでに更新が行なわれているの
で復元が必要である。したがって、未確定キュー４３１
に保存された「ファイル書き込み情報」に基づき、プラ
イマリファイル３９の更新前データをバックアップファ
イル４１からｒｅａｄし、このｒｅａｄした更新前デー
タをプライマリファイル３９にｗｒｉｔｅすることによ
って復元する。そして、この後、未確定キュー４３１に
保存された「ファイル書き込み情報」を廃棄する。な
お、確定キュー４３２に「ファイル書き込み情報」が保
存されている場合には、その「ファイル書き込み情報」
のバックアップファイル４１への反映が完了した後に、
前述した復元処理を開始する。Regarding the file, the backup file 41 has not been updated since the checkpoint since the “file write information” is only stored in the undetermined queue 431 and is not actually updated. is there. However, the primary file 39 is
Since the update has already been performed after the checkpoint, restoration is necessary. Therefore, the undetermined queue 431
The pre-update data of the primary file 39 is read from the backup file 41 on the basis of the “file write information” stored in the backup file 41, and the read pre-update data is written to the primary file 39 to be restored. After that, the “file writing information” stored in the undetermined queue 431 is discarded. If the “file writing information” is stored in the confirmation queue 432, the “file writing information”
After the update to the backup file 41 is completed,
The restoration processing described above is started.

【００３５】一方、プライマリコンピュータ３０または
プライマリコンピュータ３０を制御するオペレーティン
グシステムがシステムダウンなどの障害を発生させ、プ
ロセス３５をバックアップコンピュータ４０上で最後に
採取したチェックポイントから再実行する場合には、ア
ドレス空間とプロセッサコンテクストとは、チェックポ
イント情報復元部４６によってプロセス４７に復元され
る。On the other hand, if the primary computer 30 or the operating system that controls the primary computer 30 causes a failure such as a system failure and re-executes the process 35 from the last checkpoint collected on the backup computer 40, the The space and the processor context are restored to the process 47 by the checkpoint information restoring unit 46.

【００３６】ファイルに関しては、バックアップファイ
ル４１は、チェックポイント以降の更新は未だ未確定キ
ュー４３１に「ファイル書き込み情報」が保存されてい
るだけであり、実際には更新されていないので復元は不
要である。Regarding the file, the backup file 41 has not been updated since the checkpoint since the “file write information” is only stored in the undetermined queue 431 and is not actually updated. is there.

【００３７】なお、この「ファイル書き込み情報」のプ
ライマリコンピュータ３０からバックアップコンピュー
タ４０への転送については最適化が可能である。障害が
発生したときに、プライマリコンピュータ３０がダウン
しなかった場合は、プライマリファイル３９を復元し、
プライマリファイル３９を用いてチェックポイントから
の処理を再開する。一方、障害が発生したときに、プラ
イマリコンピュータ３０がダウンした場合には、バック
アップファイル４１を用いてチェックポイントから処理
を再開する。The transfer of the "file write information" from the primary computer 30 to the backup computer 40 can be optimized. If the primary computer 30 did not go down when the failure occurred, the primary file 39 is restored,
The processing from the check point is restarted using the primary file 39. On the other hand, if the primary computer 30 goes down when a failure occurs, the processing is restarted from the checkpoint using the backup file 41.

【００３８】それゆえに、「ファイル書き込み情報」
は、プライマリファイル操作部３８からバックアップフ
ァイル操作部４３に即時に送る必要はない。すなわち、
これらの「ファイル書き込み情報」は、次のチェックポ
イントまでに送ればよいので、転送効率を考慮すると、
一旦プライマリファイル操作部３８において蓄積してお
き、「一定容量蓄積された」、「一定時間経過した」お
よび「チェックポイント採取が要求された」といった事
象の発生をトリガとして、バックアップファイル操作部
４３にまとめて送るということが可能である。Therefore, "file write information"
Need not be sent from the primary file operation unit 38 to the backup file operation unit 43 immediately. That is,
Since these "file writing information" need only be sent before the next checkpoint, considering transfer efficiency,
The data is temporarily stored in the primary file operation unit 38, and triggered by the occurrence of events such as "accumulated to a certain capacity", "elapse of a certain time", and "request for checkpoint collection", the backup file operation unit 43 It is possible to send them all together.

【００３９】図３には、本実施形態を適用するコンピュ
ータシステムの概略構成が示されている。コンピュータ
はプライマリコンピュータ３０とバックアップコンピュ
ータ４０とで２重化されており、プライマリコンピュー
タ３０にはディスク装置６０ａが、バックアップコンピ
ュータ４０にはディスク装置６０ｂがそれぞれ接続され
ている。プロセス３５はプライマリコンピュータ上で実
行され、また、このプロセス３５がアクセスするファイ
ルは、プライマリファイル３９とバックアップファイル
４１とで２重化されており、各々ディスク装置６０ａと
ディスク装置６０ｂとに配置されている。FIG. 3 shows a schematic configuration of a computer system to which the present embodiment is applied. The computer is duplicated by a primary computer 30 and a backup computer 40, and a disk device 60a is connected to the primary computer 30, and a disk device 60b is connected to the backup computer 40, respectively. The process 35 is executed on the primary computer, and the files accessed by the process 35 are duplicated by the primary file 39 and the backup file 41, and are respectively located in the disk devices 60a and 60b. I have.

【００４０】そして、チェックポイントは、チェックポ
イント情報をプライマリコンピュータ３０側（プライマ
リチェックポイント情報３４）と、バックアップコンピ
ュータ４０側（バックアップチェックポイント情報４
５）の両方に保持する。なお、この図では、チェックポ
イントをディスク装置上に保持しているが、メモリ上に
保持しても構わない。The checkpoint information is transmitted to the primary computer 30 (primary checkpoint information 34) and to the backup computer 40 (backup checkpoint information 4).
5) Hold both. In this figure, the check points are stored on the disk device, but may be stored on the memory.

【００４１】もし、プライマリコンピュータ３０または
プライマリコンピュータ３０を制御するオペレーティン
グシステムにシステムダウンなどの障害が発生した場合
には、バックアップコンピュータ４０側でチェックポイ
ント情報４５を用いてプロセス４７を再実行する。この
場合プロセス４７は、バックアップファイル４１を使用
することになる。If a failure such as a system failure occurs in the primary computer 30 or the operating system that controls the primary computer 30, the backup computer 40 re-executes the process 47 using the checkpoint information 45. In this case, the process 47 uses the backup file 41.

【００４２】また、プライマリファイル３９またはバッ
クアップファイル４１を複数個持ち、３重化以上のファ
イルシステムを作ることも可能である。この場合、たと
えば３重化ファイルシステムならば、（１）２個のプライマリファイルと１個のバックアップ
ファイル（２）１個のプライマリファイルと２個のバックアップ
ファイルといった組み合わせが考えられる。It is also possible to have a plurality of primary files 39 or backup files 41 to create a triple or more file system. In this case, for example, in the case of a triple file system, a combination of (1) two primary files and one backup file, and (2) one primary file and two backup files can be considered.

【００４３】図４は、本実施形態においてファイルを更
新する様子を示す図である。この例では、プライマリコ
ンピュータ３０上で動くプロセス３５が、４バイトのデ
ータ“ＡＢＣＤ”を持つ２重化されたファイル（プライ
マリコンピュータ３０上のプライマリファイル３９と、
バックアップコンピュータ４０上のバックアップファイ
ル４１）に対し、時刻ｔ１において１バイト目に“Ｘ”
をｗｒｉｔｅしている（１）。これによってプライマリ
ファイル３９は即時に更新されるが、バックアップファ
イル４１は即時には更新されずに、「ファイル書き込み
情報」のみを保存している。FIG. 4 is a diagram showing how a file is updated in this embodiment. In this example, the process 35 running on the primary computer 30 has a duplicated file (a primary file 39 on the primary computer 30,
For the backup file 41) on the backup computer 40, at the time t1, "X" is added to the first byte.
(1). As a result, the primary file 39 is updated immediately, but the backup file 41 is not updated immediately, and only the “file write information” is stored.

【００４４】この後、時刻ｔ２においてチェックポイン
トが採取されることによって、先程の「ファイル書き込
み情報」の実行が確定する（２）。そして時刻ｔ２以降
で、確定された「ファイル書き込み情報」に基づいて、
バックアップファイル４１の更新を実行している。Thereafter, a checkpoint is taken at time t2, thereby confirming the execution of the above-mentioned "file write information" (2). Then, after time t2, based on the determined “file write information”,
The backup file 41 is being updated.

【００４５】図５は、本実施形態において障害発生時に
プライマリファイルを復元する様子を示す図である。こ
の例では、プライマリコンピュータ３０上で動くプロセ
ス３５が、４バイトのデータ“ＡＢＣＤ”を持つ２重化
されたファイル（プライマリコンピュータ３０上のプラ
イマリファイル３９と、バックアップコンピュータ４０
上のバックアップファイル４１）に対し、時刻ｔ１にお
いて１バイト目に“Ｘ”をｗｒｉｔｅしている（１）。
これによってプライマリファイル３９は即時に更新され
るが、バックアップファイル４１は即時には更新されず
に、「ファイル書き込み情報」のみを保存している。FIG. 5 is a diagram showing how a primary file is restored when a failure occurs in this embodiment. In this example, the process 35 running on the primary computer 30 has a duplicate file (a primary file 39 on the primary computer 30 and a backup computer 40) having 4 bytes of data “ABCD”.
In the above backup file 41), "X" is written in the first byte at time t1 (1).
As a result, the primary file 39 is updated immediately, but the backup file 41 is not updated immediately, and only the “file write information” is stored.

【００４６】この後、時刻ｔ２において障害が発生して
いる（２）。すなわち、時刻ｔ１おける「ファイル書き
込み情報」でプライマリファイル３９は更新されている
ため復元の必要があるが、バックアップファイル４１は
未だ更新されていないため復元の必要がない。ここで時
刻ｔ１において保存された「ファイル書き込み情報」に
よって、プライマリファイル３９の更新部分がかわる。
そこで、プライマリファイル３９の復元においては、未
確定の「ファイル書き込み情報」に示された位置のデー
タをバックアップファイル４１からｒｅａｄし、そのｒ
ｅａｄしたデータをプライマリファイル３９にｗｒｉｔ
ｅすることによって、プライマリファイル３９を復元す
る。Thereafter, a fault has occurred at time t2 (2). That is, the primary file 39 has been updated with the “file write information” at time t1 and needs to be restored, but the backup file 41 has not yet been updated and does not need to be restored. Here, the updated part of the primary file 39 is changed according to the “file writing information” stored at the time t1.
Therefore, in the restoration of the primary file 39, the data at the position indicated by the undetermined “file write information” is read from the backup file 41, and its r
Write the read data to the primary file 39
e, the primary file 39 is restored.

【００４７】そして、プライマリコンピュータ３０上で
取られているチェックポイントを用いて、プライマリコ
ンピュータ３０上でプロセス３５を再実行している。こ
の再実行されたプロセス３５は、復元されたプライマリ
ファイル３９を使用する。Then, the process 35 is re-executed on the primary computer 30 using the checkpoint taken on the primary computer 30. This re-executed process 35 uses the restored primary file 39.

【００４８】図６は、ファイル操作部が「ファイル書き
込み」を指示されたときの処理の流れを示すフローチャ
ートである。この場合、まず、「ファイル書き込み情
報」を保存し、未確定キュー４３１にリンクする（ステ
ップＡ１）。次に、「ファイル書き込み情報」にしたが
って、プライマリファイル３９の更新を行なう（ステッ
プＡ２）。この時点で、「ファイル書き込み」操作は完
了したとして、要求側に完了通知を行なう（ステップＡ
３）。FIG. 6 is a flowchart showing the flow of processing when the file operation unit is instructed to "write a file". In this case, first, the “file write information” is stored and linked to the undetermined queue 431 (step A1). Next, the primary file 39 is updated according to the "file write information" (step A2). At this point, it is determined that the "file write" operation has been completed, and a notification of completion is given to the requesting side (step A).
3).

【００４９】図７は、ファイル操作部が「チェックポイ
ント採取」を指示されたときの処理の流れを示すフロー
チャートである。この場合、保存されている「ファイル
書き込み情報」を未確定キュー４３１から確定キュー４
３２に移動する（ステップＢ１）。FIG. 7 is a flowchart showing the flow of processing when the file operation unit is instructed to "checkpoint collection". In this case, the stored “file write information” is transferred from the undetermined queue 431 to the determined queue 4.
32 (step B1).

【００５０】図８は、バックアップファイル更新部の処
理の流れを示すフローチャートである。この場合、ま
ず、確定キュー４３２に「ファイル書き込み情報」がリ
ンクされているかどうかを検査する（ステップＣ１）。
もし、リンクされていない場合（ステップＣ１のＮ）、
バックアップファイル更新部４４は、この検査を続行す
る。一方、リンクされている場合には（ステップＣ１の
Ｙ）、確定キュー４３２にリンクされている「ファイル
書き込み情報」に基いて、バックアップファイル４１を
更新する（ステップＣ２）。そして、実行した「ファイ
ル書き込み情報」を確定キュー４３２からはずす（ステ
ップＣ３）。FIG. 8 is a flowchart showing the flow of the process of the backup file update unit. In this case, first, it is checked whether “file writing information” is linked to the confirmation queue 432 (step C1).
If they are not linked (N in step C1),
The backup file updating unit 44 continues this check. On the other hand, if linked (Y in step C1), the backup file 41 is updated based on the “file write information” linked to the confirmation queue 432 (step C2). Then, the executed “file writing information” is removed from the confirmation queue 432 (step C3).

【００５１】図９は、プロセス３５にアボートなどの障
害が発生し、プロセス３５をプライマリコンピュータ３
０上で最後に採取したチェックポイントから再実行する
場合の処理の流れを示すフローチャートである。FIG. 9 shows that a failure such as abort occurs in the process 35 and the process 35
11 is a flowchart showing the flow of processing when re-executing from a checkpoint that was last collected on 0.

【００５２】プロセス３５に障害が発生すると、まず、
プライマリコンピュータ３０上のチェックポイント情報
復元部３３に、「アドレス空間とプロセッサコンテクス
トとの復元を指示する（ステップＤ１）。次に、プライ
マリファイル復元部３３に、「プライマリファイルの復
元」を指示する（ステップＤ２）。When a failure occurs in the process 35, first,
Instruct the checkpoint information restoring unit 33 on the primary computer 30 to restore the address space and the processor context (step D1), and then instruct the primary file restoring unit 33 to restore the primary file (step D1). Step D2).

【００５３】図１０は、プライマリコンピュータ３０上
のチェックポイント情報復元部が「アドレス空間とプロ
セッサコンテクストの復元」を指示された場合の処理の
流れを示すフローチャートである。この場合、まず、プ
ロセス３５のアドレス空間を復元する（ステップＥ
１）。次に、プロセス３５のチェックポイント採取時の
プロセッサコンテクストの状態を復元する（ステップＥ
２）。FIG. 10 is a flowchart showing the flow of processing when the checkpoint information restoring unit on the primary computer 30 is instructed to "restoring address space and processor context". In this case, first, the address space of the process 35 is restored (step E).
1). Next, the processor context state at the time of checkpoint collection of the process 35 is restored (step E).
2).

【００５４】図１１は、プライマリファイル復元部３７
が、「プライマリファイルの復元」を指示された場合の
処理の流れを示すフローチャートである。この場合、ま
ず、未確定キュー４３１に、「ファイル書き込み情報」
がリンクされているかどうかを検査する（ステップＦ
１）。「ファイル書き込み情報」がリンクされている場
合には（ステップＦ１のＹ）未確定キュー４３１にリン
クされている「ファイル書き込み情報」にしたがって、
プライマリファイル３９の中の更新されている部分のデ
ータをバックアップファイル４１からｒｅａｄし、その
Ｒｅａｄしたデータをプライマリファイル３９にｗｒｉ
ｔｅすることにより、プライマリファイル３９のその更
新されている部分のデータを復元する（ステップＦ
２）。そして、復元に利用した「ファイル書き込み情
報」を、未確定キュー４３１からはずす（廃棄する）
（ステップＦ３）。この処理は、未確定キュー４３１に
リンクた「ファイル書き込み情報」が無くなるまで繰り
返される。FIG. 11 shows the primary file restoring section 37.
Is a flowchart showing the flow of processing when "restore primary file" is instructed. In this case, first, the “file write information” is stored in the undetermined queue 431.
Is linked or not (step F
1). When the “file writing information” is linked (Y in step F1), according to the “file writing information” linked to the undetermined queue 431,
The updated data in the primary file 39 is read from the backup file 41, and the read data is written to the primary file 39.
to restore the updated data of the primary file 39 (step F).
2). Then, the “file write information” used for restoration is removed from the undetermined queue 431 (discarded).
(Step F3). This process is repeated until there is no more “file write information” linked to the undetermined queue 431.

【００５５】プライマリコンピュータ３０またはプライ
マリコンピュータ３０を制御するオペレーティングシス
テムにシステムダウンなどの障害が発生した場合には、
プロセス３５をバックアップコンピュータ４０上で最後
に採取したチェックポイントから再実行する。この場合
は、バックアップファイル４１で処理を引き継ぐ。図１
２は、障害が発生したときに、バックアップファイル４
１で処理を引き継ぐ様子を示す図である。When a failure such as a system failure occurs in the primary computer 30 or the operating system that controls the primary computer 30,
The process 35 is re-executed on the backup computer 40 from the last collected checkpoint. In this case, the processing is taken over by the backup file 41. FIG.
2 is the backup file 4 when a failure occurs
FIG. 2 is a diagram showing a state of taking over the process in step S1.

【００５６】この例では、プライマリコンピュータ３０
上で動作するプロセス３５が、４バイトのデータ“ＡＢ
ＣＤ”を持つ２重化されたファイル（プライマリコンピ
ュータ３０上のプライマリファイル３９と、バックアッ
プコンピュータ４０上のバックアップファイル４１）に
対し、時刻ｔ１において１バイト目に“Ｘ”をｗｒｉｔ
ｅしている（１）。これによってプライマリファイル３
９は即時に更新されるが、バックアップファイル４１は
即時には更新されずに、「ファイル書き込み情報」のみ
を保存している。In this example, the primary computer 30
The process 35 that operates on the above-described process has four bytes of data “AB
At the time t1, "X" is written to the first byte of the duplicate file having the "CD" (the primary file 39 on the primary computer 30 and the backup file 41 on the backup computer 40).
e (1). This makes the primary file 3
9 is updated immediately, but the backup file 41 is not updated immediately, and only “file write information” is stored.

【００５７】この後、時刻ｔ２においてプライマリコン
ピュータ３０に障害が発生している（２）。この場合、
バックアップコンピュータ４０上に取られたチェックポ
イントを用いて、バックアップコンピュータ４０上でプ
ロセス４７を再実行している。このとき、プロセス４７
は、バックアップファイル４１を用いて処理を継続する
わけだが、時刻ｔ１においてプライマリファイル３９は
更新されているが、バックアップファイル４１は未だ更
新されていないので、バックアップコンピュータ４０上
でのプロセス４７の再実行においては、バックアップフ
ァイル４２がそのまま使用できる。Thereafter, a failure has occurred in the primary computer 30 at time t2 (2). in this case,
The process 47 is re-executed on the backup computer 40 using the checkpoint taken on the backup computer 40. At this time, process 47
Continues the processing using the backup file 41, but at time t1, the primary file 39 has been updated, but since the backup file 41 has not been updated yet, the process 47 is re-executed on the backup computer 40. In, the backup file 42 can be used as it is.

【００５８】なお、障害発生によりバックアップファイ
ルを切り離した場合には、その後に新たなバックアップ
ファイルを作成することによって、再び図１の様な初期
状態を再現することができ、再度の障害発生に対しても
回復処理が可能となる。When the backup file is separated due to the occurrence of a failure, the initial state as shown in FIG. 1 can be reproduced again by creating a new backup file thereafter. However, the recovery process can be performed.

【００５９】また、障害発生によってバックアップファ
イルで処理を引き継ぎ、チェックポイントから処理を再
実行した場合には、その後、バックアップファイルをプ
ライマリファイルとして新たなバックアップファイルを
作成することにより、再び図１の様な初期状態を再現す
ることができ、再度の障害発生に対しても回復処理が可
能となる。この再度バックアップファイルを作成する場
合には、以下の様な２つの方法がある。When the backup file takes over the processing due to the occurrence of a failure and the processing is re-executed from the checkpoint, a new backup file is created with the backup file as a primary file, as shown in FIG. A simple initial state can be reproduced, and recovery processing can be performed even if a failure occurs again. To create the backup file again, there are the following two methods.

【００６０】（１）バックアップファイル切り離し後の
プライマリファイルの更新情報とデータとを保存してお
き、バックアップファイルを再接続する場合には、バッ
クアップファイルに前記切り離し後のプライマリファイ
ルの更新情報とデータとを反映させる。(1) The update information and data of the primary file after the separation of the backup file are stored, and when the backup file is reconnected, the update information and the data of the primary file after the separation are added to the backup file. To reflect.

【００６１】（２）プライマリファイルをバックアップ
ファイルにコピーする。ただし、コピー中にもプライマ
リファイルが更新され続けている場合には、コピーを始
めると同時にファイルの更新情報とデータとをバックア
ップファイルにも反映させる。(2) Copy the primary file to the backup file. However, if the primary file is continuously updated during the copy, the update information and data of the file are reflected in the backup file at the same time as the copy is started.

【００６２】さらに、この２つの方法を組み合わせた以
下の様な方法も有効である。Further, the following method combining these two methods is also effective.

【００６３】（３）切り離されたバックアップファイル
（あるいは障害発生前のプライマリファイル）を再接続
することを前提に、一定時間が経過するまでは（１）の
方法が取れる様に、バックアップファイル切り離し後の
プライマリファイルの更新情報とデータとを保存してお
く。一定時間を経過したら、（１）の方法は締め、バッ
クアップファイル切り離し後のプライマリファイルの更
新情報とデータとの保存は止めて、（２）の方法を取る
ようにする。また、切り離されたバックアップファイル
以外のファイルで再接続する場合にも、バックアップフ
ァイル切り離し後のプライマリファイルの更新情報とデ
ータとの保存は止めて、（２）の方法を取る。(3) Assuming that the detached backup file (or the primary file before the occurrence of the failure) is reconnected, after the backup file is detached so that the method of (1) can be performed until a certain time elapses. The update information and data of the primary file are stored. After a certain period of time, the method (1) is closed, the storage of the update information and data of the primary file after the backup file is separated is stopped, and the method (2) is adopted. Also, when reconnecting with a file other than the disconnected backup file, saving the update information and data of the primary file after the disconnection of the backup file is stopped and the method (2) is adopted.

【００６４】（第２の実施形態）次に、この発明の第２
の実施形態を説明する。第１の実施形態では、２重化さ
れたコンピュータシステムを説明したが、この発明は、
２重化されていないコンピュータ上のファイルシステム
に適用しても効果がある。そこで、本実施形態では、２
重化されていないコンピュータ上のファイルシステムに
適用した場合を例に説明する。図１３は、この発明を２
重化されていないコンピュータ上のファイルシステムに
適用した場合の構成図である。このシステムでは、コン
ピュータは２重化されておらず、コンピュータ３０のみ
が存在する。プロセス３５は、このコンピュータ３０上
で実行され、プライマリファイル３９とバックアップフ
ァイル４１とで２重化されたファイルを更新する。すな
わち、これらブライマリファイル３９およびバックアッ
プファイル４１は、共にコンピュータ３０上に配置さ
れ、ファイルシステム３６を介して更新される。(Second Embodiment) Next, a second embodiment of the present invention will be described.
An embodiment will be described. In the first embodiment, a duplicated computer system has been described.
It is also effective when applied to a file system on a computer that is not duplicated. Therefore, in the present embodiment, 2
An example in which the present invention is applied to a file system on an unduplicated computer will be described. FIG.
FIG. 11 is a configuration diagram when applied to a file system on a non-duplicated computer. In this system, the computers are not duplicated, and only the computer 30 exists. The process 35 is executed on the computer 30 and updates a file duplicated by the primary file 39 and the backup file 41. That is, the primary file 39 and the backup file 41 are both located on the computer 30 and are updated via the file system 36.

【００６５】コンピュータ３０上のファイルシステム３
６は、プライマリファイル操作部３８、プライマリファ
イル復元部３７、バックアップファイル操作部４３、未
確定キュー４３１、確定キュー４３２、バックアップフ
ァイル更新部４４およびプライマリファイル復元情報読
み出し部４２を含んでいる。File System 3 on Computer 30
Reference numeral 6 includes a primary file operation unit 38, a primary file restoration unit 37, a backup file operation unit 43, an unconfirmed queue 431, a confirmation queue 432, a backup file update unit 44, and a primary file restoration information read unit 42.

【００６６】プロセス３５がこの２重化されたファイル
を更新するときは、プライマリファイル操作部３８およ
びバックアップファイル操作部４３を経由して行なう。
プロセス３５がこの２重化されたファイルに対するｗｒ
ｉｔｅを行なうと、プライマリファイル３９はそのまま
更新されるが、バックアップファイル４１は更新されず
に、「ファイル書き込み情報」がバックアップファイル
操作部４３を経由して未確定キュー４３１に保存され
る。The process 35 updates the duplicated file via the primary file operation unit 38 and the backup file operation unit 43.
Process 35 writes wr to this duplicated file.
When the “item” is performed, the primary file 39 is updated as it is, but the backup file 41 is not updated, and “file write information” is stored in the undetermined queue 431 via the backup file operation unit 43.

【００６７】また、プロセス３５がチェックポイントを
採取するときには、チェックポイント制御部３１が、チ
ェックポイント情報保存部３２とプライマリファイル操
作部４３に指示を出す。チェックポイント情報保存部３
２はチェックポイント採取の指示を受けると、アドレス
空間とプロセッサコンテクストとをコンピュータ３０上
に行なう（チェックポイント情報３４）。When the process 35 collects a checkpoint, the checkpoint control unit 31 issues an instruction to the checkpoint information storage unit 32 and the primary file operation unit 43. Checkpoint information storage 3
When receiving the checkpoint collection instruction, 2 performs the address space and the processor context on the computer 30 (checkpoint information 34).

【００６８】一方、プライマリファイル操作部３８は、
チェックポイント採取の指示を受けると、バックアップ
ファイル操作部４３を経由して、未確定キュー４３１に
保存されていた「ファイル書き込み情報」を確定キュー
４３２に移動させる。確定キュー４３２に移動された
「ファイル書き込み情報」は、チェックポイント採取後
に、バックアップファイル更新部４４によってバックア
ップファイル４１の更新のために使用され、バックアッ
プファイル４１の更新後に廃棄される。これにより、チ
ェックポイント以降にプライマリファイル３９に対して
行なわれたのと同じように、ｗｒｉｔｅ操作がバックア
ップファイル４１に対して行なわれる。On the other hand, the primary file operation unit 38
Upon receiving the checkpoint collection instruction, the “file writing information” stored in the undetermined queue 431 is moved to the defined queue 432 via the backup file operation unit 43. The “file write information” moved to the confirmation queue 432 is used for updating the backup file 41 by the backup file update unit 44 after the checkpoint is collected, and is discarded after the update of the backup file 41. Thus, the write operation is performed on the backup file 41 in the same manner as performed on the primary file 39 after the checkpoint.

【００６９】プロセス３５にアボートなどの障害が発生
し、プロセス３５をコンピュータ３０上で最後に採取し
たチェックポイントから再実行する場合、アドレス空間
とプロセッサコンテクストは、コンピュータ３０上のチ
ェックポイント情報復元部３３によって復元される。When a failure such as an abort occurs in the process 35 and the process 35 is re-executed from the last checkpoint collected on the computer 30, the address space and the processor context are stored in the checkpoint information restoring unit 33 on the computer 30. Restored by

【００７０】ファイルに関しては、バックアップファイ
ル４１は、チェックポイント以降の更新が未だ未確定キ
ュー４３１に「ファイル書き込み情報」が保存されてい
るだけであり、実際には更新されていないので復元は不
要である。しかしながら、プライマリファイル３９は、
チェックポイント以降にすでに更新が行なわれているの
で復元が必要である。したがって、未確定キュー４３１
に保存された「ファイル書き込み情報」に基づき、プラ
イマリファイル３９の更新前データをバックアップファ
イル４１からｒｅａｄし、このＲｅａｄした更新前デー
タをプライマリファイル３９にｗｒｉｔｅすることによ
って復元する。そして、この後、未確定キュー４３１に
保存された「ファイル書き込み情報」を廃棄する。な
お、確定キュー４３２に「ファイル書き込み情報」が保
存されている場合には、その「ファイル書き込み情報」
のバックアップファイル４１への反映が完了した後に、
前述した復元処理を開始する。Regarding the file, the backup file 41 has only the “file write information” stored in the undetermined queue 431 whose update since the checkpoint has not yet been performed. is there. However, the primary file 39 is
Since the update has already been performed after the checkpoint, restoration is necessary. Therefore, the undetermined queue 431
The pre-update data of the primary file 39 is read from the backup file 41 based on the “file write information” stored in the backup file 41, and the read pre-update data is written to the primary file 39 to restore the data. After that, the “file writing information” stored in the undetermined queue 431 is discarded. If the “file writing information” is stored in the confirmation queue 432, the “file writing information”
After the update to the backup file 41 is completed,
The restoration processing described above is started.

【００７１】図１４には、本実施形態を適用するコンピ
ュータシステムの概略構成が示されている。本実施形態
のシステムはコンピュータ３０のみで稼働し２重化され
ていない。コンピュータ３０にはディスク装置６０ａと
ディスク装置６０ｂとが接続されている。プロセス３５
はコンピュータ３０上で実行され、また、このプロセス
３５がアクセスするファイルは、プライマリファイル３
９とバックアップファイル４１とで２重化されており、
各々ディスク装置６０ａとディスク装置６０ｂとに配置
されている。FIG. 14 shows a schematic configuration of a computer system to which the present embodiment is applied. The system of the present embodiment operates only by the computer 30 and is not duplicated. The disk device 60a and the disk device 60b are connected to the computer 30. Process 35
Is executed on the computer 30, and the file accessed by the process 35 is the primary file 3
9 and the backup file 41 are duplicated,
They are arranged in the disk device 60a and the disk device 60b, respectively.

【００７２】このように、この発明を適用することによ
り、プロセスのアドレス空間やプロセッサのコンテクス
トなどの状態（チェックポイント情報）を定期的に保存
しながら実行を続け、障害が発生したときには最後に保
存したチェックポイントからプロセスを再実行させるこ
とによる障害時対策を施したシステムにおいて、ファイ
ルの更新を行なう際に、一旦更新前データをファイルか
ら読み込む必要がなくなるため、ファイルの更新性能が
大幅に改善される。As described above, by applying the present invention, the execution (checkpoint information) such as the address space of the process and the context of the processor is continuously performed while being periodically saved, and the last saved when a failure occurs. In a system that takes measures against a failure by re-executing the process from the specified checkpoint, when updating the file, it is no longer necessary to read the pre-update data from the file, and the file update performance has been greatly improved. You.

【００７３】なお、前述の実施形態に記載したファイル
の管理方法は、コンピュータに実行させることのできる
プログラムとしてフロッピィディスク、光ディスクおよ
び半導体メモリなどの記録媒体に格納して頒布すること
が可能である。The file management method described in the above embodiment can be distributed by storing it in a recording medium such as a floppy disk, an optical disk, or a semiconductor memory as a computer-executable program.

【００７４】[0074]

【発明の効果】以上詳述したように、この発明によれ
ば、プロセスがファイルの更新を要求したときに、その
更新内容を示す更新情報を取得して保存するとともにプ
ライマリファイルのみを即座に更新し、チェックポイン
トが採取された後に、その保存しておいた更新情報で示
される更新内容をバックアップファイルに反映させる。
そして、たとえばプロセスがアボートしたときなどに
は、保存しておいた更新情報に基づいて、最後に採取し
たチェックポイント以降に更新されたデータに対応する
更新前のデータをバックアップファイルからすべて読み
出し、この読み出した更新前のデータを用いてプライマ
リファイルをチェックポイント時点に復元し、プロセス
の再実行を開始する（バックアップファイルを用いたプ
ロセスの再実行の開始も可能）。As described above in detail, according to the present invention, when a process requests a file update, update information indicating the update content is acquired and stored, and only the primary file is immediately updated. Then, after the checkpoint is collected, the update content indicated by the stored update information is reflected in the backup file.
Then, for example, when the process aborts, based on the stored update information, all the pre-update data corresponding to the data updated since the last collected checkpoint is read from the backup file, and The primary file is restored to the time of the checkpoint using the read data before update, and the process is started again (the process can be started again using the backup file).

【００７５】すなわち、このコンピュータシステムにお
いては、従来のようにファイルを更新するときに、更新
前のデータを読み出して退避させておくといった処理の
完了を通常処理に待機させることなく、障害時のファイ
ルのリカバリが実現されることになり、信頼性を損なう
ことなくファイルの更新性能を飛躍的に向上させること
が可能となる。That is, in this computer system, when a file is updated as in the prior art, the completion of processing such as reading out the data before update and saving the data is not waited for in the normal processing, and the file at the time of the failure is updated. Recovery can be realized, and the file update performance can be dramatically improved without deteriorating reliability.

[Brief description of the drawings]

【図１】この発明の基本原理を説明するための概念図。FIG. 1 is a conceptual diagram for explaining a basic principle of the present invention.

【図２】この発明の第１の実施形態に係るコンピュータ
システムのシステム構成を示す図。FIG. 2 is an exemplary view showing a system configuration of a computer system according to the first embodiment of the present invention.

【図３】同実施形態を適用するコンピュータシステムの
概略構成を示す図。FIG. 3 is an exemplary view showing a schematic configuration of a computer system to which the embodiment is applied.

【図４】同実施形態においてファイルを更新する様子を
示す図。FIG. 4 is an exemplary view showing how a file is updated in the embodiment.

【図５】同実施形態において障害発生時にプライマリフ
ァイルを復元する様子を示す図。FIG. 5 is an exemplary view showing how a primary file is restored when a failure occurs in the embodiment.

【図６】同実施形態のファイル操作部が「ファイル書き
込み」を指示されたときの処理の流れを示すフローチャ
ート。FIG. 6 is an exemplary flowchart showing the flow of processing when the file operation unit of the embodiment is instructed to “write a file”;

【図７】同実施形態のファイル操作部が「チェックポイ
ント採取」を指示されたときの処理の流れを示すフロー
チャート。FIG. 7 is an exemplary flowchart showing the flow of processing when the file operation unit of the embodiment is instructed to “checkpoint collection”;

【図８】同実施形態のバックアップファイル更新部の処
理の流れを示すフローチャート。FIG. 8 is an exemplary flowchart showing the flow of the process of the backup file updating unit of the embodiment.

【図９】同実施形態のプロセスにアボートなどの障害が
発生し、プロセスをプライマリコンピュータ３０上で最
後に採取したチェックポイントから再実行する場合の処
理の流れを示すフローチャート。FIG. 9 is an exemplary flowchart showing the processing flow when a failure such as an abort occurs in the process of the embodiment and the process is re-executed from the last checkpoint collected on the primary computer 30;

【図１０】同実施形態のプライマリコンピュータ上のチ
ェックポイント情報復元部が「アドレス空間とプロセッ
サコンテクストとの復元」を指示された場合の処理の流
れを示すフローチャート。FIG. 10 is an exemplary flowchart illustrating the flow of processing when the checkpoint information restoring unit on the primary computer according to the embodiment is instructed to “restoring an address space and a processor context”;

【図１１】同実施形態のプライマリファイル復元部が
「プライマリファイルの復元」を指示された場合の処理
の流れを示すフローチャート。FIG. 11 is an exemplary flowchart showing the flow of processing when the primary file restoration unit of the embodiment is instructed to “restore a primary file”;

【図１２】同実施形態の障害が発生したときにバックア
ップファイルで処理を引き継ぐ様子を示す図。FIG. 12 is an exemplary view showing how a backup file takes over the processing when a failure occurs in the embodiment.

【図１３】この発明の第２の実施形態に係るコンピュー
タシステムのシステム構成を示す図。FIG. 13 is a diagram showing a system configuration of a computer system according to a second embodiment of the present invention.

【図１４】同実施形態を適用するコンピュータシステム
の概略構成を示す図。FIG. 14 is an exemplary view showing a schematic configuration of a computer system to which the embodiment is applied.

【図１５】従前のファイルに対する書き込みをキャンセ
ルすることが困難なため、ファイルに対して書き込みを
行なうときに、データをファイルに書き込む前に書き込
み以前のデータを事前に読み込んで保存を行ない、その
後にファイルへのデータ書き込みを行なう従来のシステ
ムの仕組みを説明する図。FIG. 15 shows that it is difficult to cancel writing to a previous file. Therefore, when writing to a file, data before writing is read and saved before writing data to the file. FIG. 2 is a view for explaining the structure of a conventional system for writing data to a file.

[Explanation of symbols]

１０…運用系システム、１１…アプリケーションプログ
ラム、１２…ジャケットルーチン、１３…ＯＳバッファ
キャッシュ、１４…ディスク装置、２０…待機系システ
ム、２１デーモン、２１１…未確定キュー、２１２…確
定キュー、２２…ＯＳバッファキャッシュ、２３…ディ
スク装置、３０…プライマリコンピュータ、３１…チェ
ックポイント制御部、３２…チェックポイント情報保存
部、３３…チェックポイント情報復元部、３４…チェッ
クポイント情報、３５…プロセス、３６…ファイルシス
テム、３７…プライマリファイル復元部、３８…プライ
マリファイル操作部、３９…プライマリファイル、４０
…バックアップコンピュータ、４１…バックアップファ
イル、４２…プライマリファイル復元情報読み出し部、
４３…バックアップファイル操作部、４３１…未確定キ
ュー、４３２…確定キュー、４４…バックアップファイ
ル更新部、４５…チェックポイント情報、４６…チェッ
クポイント情報復元部、４７…プロセス、５０…ネット
ワーク、６０ａ，６０ｂ…ディスク装置。DESCRIPTION OF SYMBOLS 10 ... Operating system, 11 ... Application program, 12 ... Jacket routine, 13 ... OS buffer cache, 14 ... Disk device, 20 ... Standby system, 21 daemon, 211 ... Indeterminate queue, 212 ... Confirmed queue, 22 ... OS Buffer cache, 23 disk device, 30 primary computer, 31 checkpoint control unit, 32 checkpoint information storage unit, 33 checkpoint information restoration unit, 34 checkpoint information, 35 process, 36 file system 37, primary file restoring unit, 38, primary file operating unit, 39, primary file, 40
... backup computer, 41 ... backup file, 42 ... primary file restoration information reading unit,
43 backup file operation unit, 431 undetermined queue, 432 fixed queue, 44 backup file update unit, 45 checkpoint information, 46 checkpoint information restoration unit, 47 process, 50 network, 60a, 60b ... Disk devices.

Claims

[Claims]

1. A computer system duplexed by two computers, an active system and a standby system, wherein a checkpoint for restarting an interrupted process is periodically collected, and In a computer system that saves data on both computers of the system, files updated by a process executed on the computer of the active system are provided in duplicate on both the computer of the active system and the computer of the standby system. When instructed to update the file,
Means for storing the update information on the standby computer and updating only the active file, and when the update is completed, notifying the update request source of the update completion; and Means for reflecting the update content indicated in the update information to the standby file after the update.

2. The system according to claim 1, further comprising means for buffering the update information on the active computer and transferring the update information to the standby computer by the time the checkpoint is collected. The computer system of claim 1, wherein

3. When the process is aborted, data before update for a file update executed since the last checkpoint is read from the standby file using the update information, and the active file is read from the standby file. 3. The computer system according to claim 1, further comprising: means for re-executing the process from the check point after restoring the state at the time of the check point.

4. When the process aborts, deletes the update information stored since the last checkpoint and reflects the update indicated by the update information before the checkpoint on the file of the standby system. 3. The computer system according to claim 1, further comprising means for re-executing said process from said checkpoint on said standby computer.

5. When a failure occurs in the active computer or the operating system controlling the active computer, the update information stored after the last checkpoint is deleted, and the update before the checkpoint is performed. 2. The system according to claim 1, further comprising: a unit configured to re-execute the process from the checkpoint on the standby computer after reflecting the update indicated by the information on the standby file. 3. The computer system according to 2.

6. The system further comprises means for stopping transfer of the checkpoint and update information to the standby computer when a failure occurs in the standby computer or an operating system controlling the standby computer. 3. The computer system according to claim 1, wherein:

7. When a failure occurs in the active file, the update information stored after the last checkpoint is deleted, and the update indicated by the update information before the checkpoint is transferred to the standby file. 3. The computer system according to claim 1, further comprising: means for re-executing the process from the checkpoint on the standby computer after reflecting the process on the standby computer.

8. The system according to claim 1, further comprising means for stopping transfer of the checkpoint and update information to the standby computer when a failure occurs in the standby file. Or the computer system according to 2.

9. The system according to claim 1, further comprising means for newly securing a standby file on the third computer when a standby file is separated. Computer system as described.

10. When the process is re-executed from the checkpoint using a standby file, the standby file is switched to an active file, and a new file is stored on the active computer. 3. The computer system according to claim 1, further comprising means for securing a standby file.

11. A computer which is duplexed by two computers, an active system and a standby system, periodically collects checkpoints for restarting the interrupted processing and stores the checkpoints on the computers of both the active system and the standby system. A file management method for a computer system wherein a file to be stored and updated by a process executed on the active computer is duplicated on both the active and standby computers. When an update is ordered,
Storing the update information on the standby computer, updating only the active file, and notifying the update request source to the update requester when the update is completed; Reflecting the update content indicated in the update information in the file of the standby system after the update.

12. The data before update for a file update executed since the last checkpoint is read from the file of the standby system based on the update information, and the file of the active system is restored to the state at the time of the checkpoint. The method of claim 11, further comprising the step of re-executing the process from the checkpoint afterwards.

13. The method according to claim 12, wherein the update information stored after the last checkpoint is deleted, and the update indicated by the update information before the checkpoint is reflected in the file of the standby system. The file management method according to claim 11, further comprising a step of re-executing from the check point on a computer.

14. A computer which is duplexed by two computers, an active system and a standby system, periodically collects checkpoints for restarting the interrupted processing and stores the checkpoints on the computers of both the active system and the standby system. A program for managing files of a computer system provided to store and multiplex files updated by a process executed on the active computer on both the active and standby computers, When the process instructs to update the file,
The update information is stored on the standby computer, and only the active file is updated.When the update is completed, the update request is notified to the update request source, and the checkpoint is collected. And a computer-readable storage medium storing a program for causing the computer to operate so that the update content indicated in the update information is reflected in the standby file.

15. The program reads data before update for a file update executed since the last checkpoint from the standby file based on the update information, and reads the active file at the time of the checkpoint. The computer-readable medium of claim 14, further comprising causing the computer to re-execute the process from the checkpoint after restoring the state.

16. The program deletes the update information stored since the last checkpoint, reflects the update indicated by the update information before the checkpoint on the file of the standby system, and then executes the process. 15. The computer-readable storage medium according to claim 14, further causing the computer to operate again from the checkpoint on the standby computer.