JP2003280939A

JP2003280939A - Process pair execution control method and process pair execution control program in fault tolerant system, and fault tolerant system

Info

Publication number: JP2003280939A
Application number: JP2002084321A
Authority: JP
Inventors: Hideaki Hirayama; 秀昭平山
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-03-25
Filing date: 2002-03-25
Publication date: 2003-10-03
Anticipated expiration: 2022-03-25
Also published as: JP3708891B2

Abstract

<P>PROBLEM TO BE SOLVED: To utilize an open system without requiring CPU resource double. <P>SOLUTION: A checkpoint collecting part 15 conducts a reading-out D of the condition of a primary process 11 every time when a checkpoint collection time comes, and executes a copy E to a backup process 12. If this backup process 12 is in an execution state, the checkpoint obtaining part 15 stops the process 12 by an instruction F to a backup process execution state control part 17. A process-issue system call detecting part 19 restarts the process 12 from the latest obtained checkpoint, and brings it into an execution by an instruction G to the backup process execution state control part 17 if the backup process 12 is stopped when a system call is issued from the primary process 11. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、プロセスを実行し
ている計算機に障害が発生した場合でも、他の計算機を
利用して当該プロセスを継続して実行することを可能に
するフォールトトレラント技術に係り、特にその技術を
ＣＡＤやシミュレーション等の科学技術計算プログラム
に適用する場合に好適な、フォールトトレラントシステ
ムにおけるプロセスペア実行制御方法、プロセスペア実
行制御プログラム、及びフォールトトレラントシステム
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a fault tolerant technique which enables a computer executing a process to continue to execute the process by utilizing another computer even if a failure occurs in the computer. In particular, the present invention relates to a process pair execution control method, a process pair execution control program and a fault tolerant system in a fault tolerant system, which is suitable when the technique is applied to a scientific and technological calculation program such as CAD or simulation.

【０００２】[0002]

【従来の技術】プロセスを実行している計算機に障害が
発生した場合でも、他の計算機を利用して当該プロセス
を継続して実行することを可能にするフォールトトレラ
ント技術の代表的な手法としてプロセスペア方式が知ら
れている。2. Description of the Related Art Even if a computer that is executing a process fails, a process is used as a typical fault-tolerant technique that makes it possible to continue execution of the process by using another computer. The pair method is known.

【０００３】プロセスペア方式とは、プロセスをプライ
マリプロセスとバックアッププロセスの２つで構成し、
両プロセスを異なる計算機上に配置する手法である。従
来、このプロセスペア方式には以下に述べる第１及び第
２の方式が存在する。In the process pair system, a process is composed of two processes, a primary process and a backup process,
This is a method of arranging both processes on different computers. Conventionally, there are first and second methods described below in this process pair method.

【０００４】（１）第１の方式第１の方式は、文献「”The Process Group Approach t
o Reliable Distributed Computing,” K.Birman, Tech
nical Report, Computer Science Department,Cornel U
niversity, July 1991」（以下、第１の文献と称する）
に記載されている。(1) First method The first method is the one described in the document "" The Process Group Approach t.
o Reliable Distributed Computing, ”K. Birman, Tech
nical Report, Computer Science Department, Cornel U
niversity, July 1991 "(hereinafter referred to as the first document)
It is described in.

【０００５】この第１の文献では、プロセスペアはプロ
セスグループと呼ばれ、２個以上のプロセスによる処理
の多重化を行っている。ここではプロセス数を２個に限
定したプロセスペア方式の一形態として、第１の方式と
称するものとする。In this first document, a process pair is called a process group, and processing by two or more processes is multiplexed. Here, as a form of the process pair system in which the number of processes is limited to two, it is referred to as a first system.

【０００６】図９は第１の方式を説明するための図であ
る。図９に示されるように、第１の方式では、１つのプ
ロセスはプライマリプロセスとバックアッププロセスか
らなるプロセスペア９１として構成される。プロセスペ
ア９１が他のプロセスペア９２と通信を行う際には、プ
ロセスペア間通信９３〜９６等が行われる。プロセスペ
ア間通信では、送信側プライマリプロセスと送信側バッ
クアッププロセスから、受信側プライマリプロセスと受
信側バックアッププロセスに、メッセージを一貫性を保
った状態で送受信する機能を提供する。FIG. 9 is a diagram for explaining the first method. As shown in FIG. 9, in the first method, one process is configured as a process pair 91 including a primary process and a backup process. When the process pair 91 communicates with another process pair 92, communication between process pairs 93 to 96 and the like are performed. In the process pair-to-process communication, a function is provided in which the sender primary process and the sender backup process send and receive messages to the receiver primary process and the receiver backup process in a consistent state.

【０００７】なお、メッセージを一貫性を保った状態で
送受信するという意味は、プライマリプロセスとバック
アッププロセスが共に、メッセージを１つのみ送信また
は受信するということである。逆に言えば、プライマリ
プロセスのみメッセージを受信して、バックアッププロ
セスがメッセージを受信していない等の状態にならない
ことを示す。Note that the meaning of transmitting and receiving a message in a consistent state means that both the primary process and the backup process send or receive only one message. Conversely, it means that only the primary process receives the message and the backup process does not receive the message.

【０００８】第１の方式では、図９中のｆａｕｌｔ１、
或いはｆａｕｌｔ２で示される時点で、プライマリプロ
セスが実行されている計算機に障害が発生しても、他の
計算機で実行されているバックアッププロセスが処理を
継続し、プライマリプロセスの役割を代替する。これに
より、プロセスペア９１としては処理を継続することが
できる。In the first method, fault1,
Alternatively, at the time indicated by fault2, even if the computer on which the primary process is running fails, the backup process running on another computer continues processing and replaces the role of the primary process. As a result, the process pair 91 can continue processing.

【０００９】図９に示した第１の方式では、全く同じ処
理を２つのプロセス（プライマリプロセス及びバックア
ッププロセス）で実行するため、ＣＰＵリソースを２倍
必要とする。In the first method shown in FIG. 9, exactly the same processing is executed by two processes (a primary process and a backup process), so that twice the CPU resource is required.

【００１０】（２）第２の方式第２の方式は、文献「”フォールト・トレラント・シス
テム”、グレイ他著、渡辺榮一編訳、マグロウヒル出
版」（以下、第２の文献と称する）に記載されている。
この第２の文献では、プロセスペアは、そのままプロセ
スペアと呼ばれている。(2) Second Method The second method is described in the document "Fault Tolerant System", Gray et al., Translated by Eiichi Watanabe, McGraw-Hill Publishing "(hereinafter referred to as the second document). Has been done.
In this second document, the process pair is directly called a process pair.

【００１１】図１０は第２の方式を説明するための図で
ある。図１０に示されるように、第２の方式では、１つ
のプロセスはプライマリプロセスとバックアッププロセ
スからなるプロセスペア１０１として構成される。プロ
セスペア１０１が他のプロセスペア１０２と通信を行う
際には、プロセスペア間通信が行われる。FIG. 10 is a diagram for explaining the second method. As shown in FIG. 10, in the second method, one process is configured as a process pair 101 including a primary process and a backup process. When the process pair 101 communicates with another process pair 102, communication between process pairs is performed.

【００１２】プロセスペア間通信では、送信側プライマ
リプロセスと送信側バックアッププロセスから、受信側
プライマリプロセスと受信側バックアッププロセスに、
メッセージを一貫性を保った状態で送受信する機能を提
供する。In the process pair-to-process communication, from the sender-side primary process and the sender-side backup process to the receiver-side primary process and the receiver-side backup process,
Provides a function to send and receive messages in a consistent state.

【００１３】上記第１の方式では、バックアッププロセ
スも、プライマリプロセスと同じ処理を実行している。
しかし、第２の方式では、バックアッププロセスは、プ
ロセスとしては存在するが実際の処理は実行しないで、
チェックポイント採取時（ｃｋｐ１，ｃｋｐ２，ｃｋｐ
３，ｃｋｐ４）に、プライマリプロセスの状態をバック
アッププロセスにコピーする。In the first method, the backup process also executes the same processing as the primary process.
However, in the second method, the backup process exists as a process but does not execute the actual processing,
Checkpoint collection (ckp1, ckp2, ckp
3, ckp4), copy the state of the primary process to the backup process.

【００１４】第２の方式では、図１０中のｆａｕｌｔ
１、或いはｆａｕｌｔ２で示される時点で、プライマリ
プロセスが実行されている計算機に障害が発生した場
合、他の計算機上のバックアッププロセスが、各々ｒｅ
ｓｔａｒｔ１、或いはｒｅｓｔａｒｔ２で示される最後
に採取されたチェックポイントの時点から処理を再開
し、プライマリプロセスの役割を代替する。このため、
プロセスペアとしては処理を継続することができる。In the second method, the fault shown in FIG.
At the time indicated by 1 or fault2, if the computer on which the primary process is running fails, the backup process on the other computer is re-reset.
The process is restarted from the time point of the checkpoint taken at the end, which is indicated by start1 or restart2, and replaces the role of the primary process. For this reason,
Processing can continue as a process pair.

【００１５】第２の方式では、上記第１の方式とは異な
って、全く同じ処理をプライマリプロセス及びバックア
ッププロセスの２つのプロセスで実行するわけではな
い。このため第２の方式では、ＣＰＵリソースを２倍必
要とするということはない。Unlike the first method, the second method does not perform exactly the same processing in two processes, a primary process and a backup process. Therefore, the second method does not require twice the CPU resource.

【００１６】[0016]

【発明が解決しようとする課題】上記した、プロセスペ
ア方式と呼ばれる従来のフォールトトレラント技術、例
えば第１の方式では、プライマリプロセスが実行されて
いる計算機に障害が発生しても、他の計算機で実行され
ているバックアッププロセスが処理を継続し、プライマ
リプロセスの役割を代替することで、プロセスペアとし
ては処理を継続することができる。ところが、第１の方
式では、全く同じ処理を２つのプロセス（プライマリプ
ロセス及びバックアッププロセス）で実行するため、Ｃ
ＰＵリソースを２倍必要とするという問題がある。In the conventional fault-tolerant technology called the process pair method described above, for example, the first method, even if the computer in which the primary process is executed fails, another computer can execute it. The backup process being executed continues processing, and by replacing the role of the primary process, processing can be continued as a process pair. However, in the first method, since exactly the same processing is executed by two processes (primary process and backup process), C
There is a problem that PU resources are required twice.

【００１７】これに対し、上記第２の方式では、第１の
方式とは異なって、全く同じ処理を２つのプロセスで実
行するわけではなく、したがってＣＰＵリソースを２倍
必要とせずに済む。ところが第２の方式には、以下に述
べる別の問題がある。On the other hand, unlike the first method, in the second method, exactly the same processing is not executed by the two processes, and therefore the CPU resource need not be doubled. However, the second method has another problem described below.

【００１８】まず、第２の方式では、実際には実行して
いないバックアッププロセス、つまりプライマリプロセ
ス側で障害が発生しない限り定常的に停止状態にあるバ
ックアッププロセスを、チェックポイント採取時から再
開するために、プライマリプロセスの状態をバックアッ
ププロセスにコピーする。アドレス空間やコンテクスト
は、これで問題ない。First, in the second method, a backup process that is not actually executed, that is, a backup process that is in a stationary state unless a failure occurs on the primary process side, is restarted from the time of checkpoint collection. First, copy the state of the primary process to the backup process. For address space and context, this is fine.

【００１９】しかしながら第２の方式は、バックアップ
プロセスが定常的に停止状態にあることから、単にプラ
イマリプロセスの状態をバックアッププロセスにコピー
するだけでは、システムコールの実行によって、ＯＳ
（オペレーティングシステム）から受けているサービス
の状態を復元できない。このＯＳから受けているサービ
スの状態とは、例えば、どのファイルを、どのディスク
リプタでオープンしているか、そのシークポインタ等の
状態である。そこで、第２の方式では、このようなＯＳ
から受けているサービスの状態を、保存・復元できるよ
うな機能を持った独自のＯＳを採用している。However, in the second method, since the backup process is constantly in a stopped state, the OS is executed by executing a system call by simply copying the state of the primary process to the backup process.
Unable to restore the state of the service received from (operating system). The status of the service received from the OS is, for example, which file is opened by which descriptor, its seek pointer, or the like. Therefore, in the second method, such an OS
It employs a unique OS that has the function of saving and restoring the status of services received from.

【００２０】このため、第２の方式では、産業界で広く
利用されているオープンシステムを利用することができ
ず、全てのアプリケーションを独自に開発する必要があ
り、生産性が低くなるという問題がある。Therefore, in the second method, the open system widely used in the industrial world cannot be used, and it is necessary to independently develop all the applications, which causes a problem of low productivity. is there.

【００２１】本発明は上記事情を考慮してなされたもの
でその目的は、障害発生時にも処理を継続することを可
能としながら、ＣＰＵリソースを２倍必要とせずに済
み、且つオープンシステムを利用できるフォールトトレ
ラントシステムにおけるプロセスペア実行制御方法、プ
ロセスペア実行制御プログラム、及びフォールトトレラ
ントシステムを提供することにある。The present invention has been made in consideration of the above circumstances, and an object thereof is to enable the processing to be continued even when a failure occurs, without needing to double the CPU resources, and to use an open system. It is to provide a process pair execution control method, a process pair execution control program, and a fault tolerant system in a fault-tolerant system that can be performed.

【００２２】[0022]

【課題を解決するための手段】本発明の１つの観点によ
れば、障害発生時にも処理を継続することが可能な、プ
ライマリプロセスとバックアッププロセスから構成され
るプロセスペアが実行されるフォールトトレラントシス
テムにおけるプロセスペア実行制御方法が提供される。
このプロセスペア実行制御方法は、プロセスペアの起動
時にはプライマリプロセス及びバックアッププロセスを
共に実行状態にするステップと、チェックポイント採取
時期が到来する毎に、プライマリプロセスの状態をバッ
クアッププロセスにコピーするステップと、チェックポ
イント採取時にバックアッププロセスが実行状態にある
ならば、当該バックアッププロセスを停止状態にするス
テップと、プロセスペアからシステムコールが発行され
た場合に、バックアッププロセスが停止状態にあるなら
ば、当該バックアッププロセスを最も最近に採取された
チェックポイント（最後のチェックポイント）から再開
させて実行状態にするステップとから構成される。According to one aspect of the present invention, a fault tolerant system in which a process pair composed of a primary process and a backup process is executed, which can continue processing even when a failure occurs A process pair execution control method in is provided.
This process pair execution control method includes a step of setting both the primary process and the backup process to the running state at the time of starting the process pair, and a step of copying the state of the primary process to the backup process at each checkpoint collection time, If the backup process is in the running state at the time of checkpoint collection, the step to put the backup process in the stopped state, and if the backup process is in the stopped state when a system call is issued from the process pair, the backup process is in the stopped state. Is restarted from the checkpoint (the last checkpoint) taken most recently and is put into the execution state.

【００２３】本発明の第１の観点に係るプロセスペア実
行制御方法においては、プログラム（プロセスペア）の
起動直後の期間は、プライマリプロセス及びバックアッ
ププロセスが共に実行状態となって動作する。このため
両プロセスは、ＯＳからのサービス提供を受けている状
態となる。その後は、最初のチェックポイント採取時期
の到来によりバックアッププロセスが停止状態となり、
バックアッププロセス側でのＣＰＵリソースの消費が抑
えられる。また、バックアッププロセスが停止状態にあ
るときにシステムコールが発行されると、バックアップ
プロセスは最後のチェックポイントから処理を再開す
る。つまり、バックアッププロセスも実行状態となって
動作し、プライマリプロセス及びバックアッププロセス
は再びＯＳからのサービス提供を受けている状態とな
る。In the process pair execution control method according to the first aspect of the present invention, both the primary process and the backup process operate in the running state immediately after the program (process pair) is started. Therefore, both processes are in a state of being provided with services from the OS. After that, the backup process is stopped due to the arrival of the first checkpoint collection time,
CPU resource consumption on the backup process side is suppressed. Also, if a system call is issued while the backup process is stopped, the backup process resumes processing from the last checkpoint. That is, the backup process also operates in the running state, and the primary process and the backup process are again in the state of being provided with the service from the OS.

【００２４】このように、本発明の第１の観点に係るプ
ロセスペア実行制御方法において、バックアッププロセ
スは、従来の技術の欄で述べた第２の方式と異なって、
プログラムの起動直後の期間と、その後バックアッププ
ロセスが停止状態にあるときにシステムコールが発行さ
れた場合には実行状態となる。一方、プライマリプロセ
スは上記第２の方式と同様に、障害が発生しない限りは
プログラムが終了するまで実行状態にある。つまり、第
１の観点に係るプロセスペア実行制御方法では、従来の
技術の欄で述べた第１の方式と異なって、プライマリプ
ロセス及びバックアッププロセスが全く同じ処理を常に
実行するわけでもなく、また第２の方式と異なって、バ
ックアッププロセスが定常的に停止状態にあるわけでも
なく、両プロセスが共に実行状態にある期間が存在す
る。この期間中、両プロセスはＯＳからのサービス提供
を受けている状態となる。As described above, in the process pair execution control method according to the first aspect of the present invention, the backup process is different from the second method described in the section of the prior art,
It will be in the running state immediately after the program is started, and if a system call is issued while the backup process is stopped after that. On the other hand, as in the case of the second method, the primary process remains in the running state until the program ends unless a failure occurs. That is, in the process pair execution control method according to the first aspect, unlike the first method described in the section of the related art, the primary process and the backup process do not always execute exactly the same processing, and Unlike the method of 2, the backup process is not constantly in the stopped state, and there is a period in which both processes are in the running state. During this period, both processes are in a state of receiving service from the OS.

【００２５】このため、本発明の第１の観点に係るプロ
セスペア実行制御方法においては、上記第２の方式のよ
うに、ＯＳから受けているサービスの状態を、保存・復
元できるような機能を持った独自のＯＳを採用する必要
がなく、産業界で広く利用されているオープンシステム
を利用することが可能となる。また、バックアッププロ
セスが停止状態にある期間が存在するため、ＣＰＵリソ
ースを２倍必要とせずに済む。また、本発明の第１の観
点に係るプロセスペア実行制御方法は、以下の理由によ
り、ＣＡＤやシミュレーション等の科学技術計算プログ
ラムに特に適している。即ち、この種の科学技術計算プ
ログラムでは、最初にシステムコールの発行を伴う入力
データの読み出し等を行い、その後はシステムコールを
発行せずに、ＣＰＵ演算を繰り返すことが多く、しかも
ＣＰＵ演算が行われる期間は、システムコールの発行を
伴う期間に比べて著しく長い。このため、上記第１の観
点に係るプロセスペア実行制御方法において、プログラ
ムの起動直後と、それ以降はシステムコールの発行を伴
う期間だけプライマリプロセス及びバックアッププロセ
スを共に実行状態にし、それ以外の長時間行われるＣＰ
Ｕ演算の期間はバックアッププロセスを停止状態にする
ことにより、ＣＰＵリソースがバックアッププロセスの
実行に用いられる時間を大幅に短縮すると共に、汎用的
なＯＳを使用しても当該ＯＳから受けているサービスの
状態を保存・復元するのを可能とする。Therefore, in the process pair execution control method according to the first aspect of the present invention, there is a function capable of saving / restoring the state of the service received from the OS, as in the second method. It is possible to use an open system widely used in the industrial world without having to adopt a proprietary OS that the company has. Further, since there is a period during which the backup process is in a stopped state, it is not necessary to double the CPU resource. Further, the process pair execution control method according to the first aspect of the present invention is particularly suitable for a science and technology calculation program such as CAD and simulation for the following reasons. That is, in this kind of scientific and technological calculation program, first, input data is read with issuance of a system call, etc., and then the CPU operation is often repeated without issuing a system call. The period to be closed is significantly longer than the period involving issuance of a system call. Therefore, in the process pair execution control method according to the first aspect, both the primary process and the backup process are put into the running state immediately after the program is started and thereafter only during the period accompanied by the issuance of the system call, and other long time CP performed
By suspending the backup process during the U operation, the time taken for the CPU resource to execute the backup process is significantly reduced, and even if a general-purpose OS is used, the service received from the OS is Allows you to save and restore state.

【００２６】ここで、プライマリプロセス側での障害発
生時に、バックアッププロセスが停止状態にあるなら
ば、当該バックアッププロセスを最も最近に採取された
チェックポイントから再開させて実行状態にするステッ
プを追加するならば、たとえバックアッププロセスが停
止状態にある期間にプライマリプロセス側で障害が発生
しても、バックアッププロセスにより処理を継続するこ
とが可能となる。If the backup process is in a stopped state when a failure occurs on the primary process side, if a step of restarting the backup process from the checkpoint taken most recently to put it in the execution state is added. For example, even if a failure occurs on the primary process side while the backup process is stopped, the backup process can continue the processing.

【００２７】また、バックアッププロセスを停止状態か
ら実行状態に切り換えることが必要となる直前のタイミ
ング、例えばプライマリプロセスからシステムコールが
発行される直前のタイミングにチェックポイント採取時
期（第１のチェックポイント採取時期）を設定するステ
ップを追加するならば、その後プライマリプロセスから
実際にシステムコールが発行されてバックアッププロセ
スを再開した場合に、その再開後の処理に要する時間を
短縮できる。The checkpoint collection timing (first checkpoint collection timing) is set at the timing immediately before the backup process needs to be switched from the stopped state to the running state, for example, immediately before the system call is issued from the primary process. ) Is set, the time required for the processing after the restart is restarted when the primary process actually issues a system call to restart the backup process.

【００２８】また、バックアッププロセスを実行状態に
維持しておく必要がなくなる直後のタイミングにチェッ
クポイント採取時期（第２のチェックポイント採取時
期）を設定するステップを追加するならば、バックアッ
ププロセスが実行状態にある期間を必要最小限に抑え
て、バックアッププロセスの実行に必要なＣＰＵリソー
スが余分に使用されるのを防ぐことができる。If a step for setting a checkpoint collection time (second checkpoint collection time) is added immediately after it becomes unnecessary to keep the backup process in the execution state, the backup process will be in the execution state. Can be kept to a minimum to prevent excessive use of CPU resources required to perform the backup process.

【００２９】また、上記第１のチェックポイント採取時
期から次の上記第２のチェックポイント採取時期のまで
の期間を除く、バックアッププロセスが停止状態にある
期間、予め定められた時間間隔でチェックポイント採取
時期（第３のチェックポイント採取時期）を設定するス
テップを追加するなら、プライマリプロセス及びバック
アッププロセスが共に実行状態にある期間にチェックポ
イント採取動作が行われて、その都度バックアッププロ
セスが停止されて、その後のプライマリプロセスでのシ
ステムコールにより当該バックアッププロセスが最後の
チェックポイントから再開されるという、処理を遅延さ
せる無駄な動作が発生するのを防止できる。これによ
り、処理効率の向上と、チェックポイントの効率的な採
取とが可能となる。Also, checkpoints are collected at predetermined time intervals during the period during which the backup process is in a stopped state, excluding the period from the first checkpoint collection period to the next second checkpoint collection period. If a step for setting the time (third checkpoint collection time) is added, the checkpoint collection operation is performed during the period when both the primary process and the backup process are in the running state, and the backup process is stopped each time, It is possible to prevent the useless operation that delays the processing, that is, the backup process is restarted from the last checkpoint by the system call in the primary process thereafter. As a result, it is possible to improve processing efficiency and efficiently collect check points.

【００３０】なお、以上のプロセスペア実行制御方法に
係る本発明は、当該方法を構成する各ステップを計算機
に実行させるためのプログラム（プロセスペア実行制御
プログラム）に係る発明としても、当該方法を実行する
フォールトトレラントシステムに係る発明としても成立
する。The present invention relating to the process pair execution control method described above also executes the method as an invention relating to a program (process pair execution control program) for causing a computer to execute each step constituting the method. The present invention can also be realized as an invention relating to a fault tolerant system.

【００３１】[0031]

【発明の実施の形態】以下、本発明の実施の形態につき
図面を参照して説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings.

【００３２】図１は本発明の一実施形態に係るフォール
トトレラントシステムの構成を示すブロック図である。
図１において、計算機１ａ及び１ｂは、ネットワーク２
により相互接続されている。計算機１ａ（の図示せぬ記
憶装置）上にはプライマリプロセス１１が配置され、計
算機１ｂ上（の図示せぬ記憶装置）にはバックアッププ
ロセス１２が配置されている。つまり計算機１ａ及び１
ｂの組により、障害発生時にも処理を継続することが可
能な、プライマリプロセス１１とバックアッププロセス
１２から構成されるプロセスペア１３が実現されてい
る。FIG. 1 is a block diagram showing the configuration of a fault tolerant system according to an embodiment of the present invention.
In FIG. 1, computers 1a and 1b are network 2
Interconnected by. A primary process 11 is arranged on (the storage device (not shown) of) the computer 1a, and a backup process 12 is arranged on (storage device thereof (not shown)) of the computer 1b. That is, computers 1a and 1
The group b realizes a process pair 13 including a primary process 11 and a backup process 12 that can continue processing even when a failure occurs.

【００３３】計算機１ａ及び１ｂは、いずれも、プロセ
スペア間通信部１４、チェックポイント採取部１５、プ
ロセスリスタート部１６、バックアッププロセス実行状
態制御部１７、バックアッププロセス実行状態管理部１
８、及びプロセス発行システムコール検知部１９の各機
能要素を有している。これら各部１４〜１９は、プライ
マリプロセス１１またはバックアッププロセス１２から
利用される関数プログラム等を格納したライブラリ２０
によって実現される。Each of the computers 1a and 1b has an inter-process pair communication unit 14, a checkpoint collection unit 15, a process restart unit 16, a backup process execution state control unit 17, and a backup process execution state management unit 1.
8 and each functional element of the process issuing system call detection unit 19. Each of these units 14 to 19 is a library 20 that stores a function program or the like used from the primary process 11 or the backup process 12.
Is realized by

【００３４】図２は、上記プライマリプロセス１１及び
バックアッププロセス１２から構成されるプロセスペア
１３と、上記ライブラリ２０により実現される上記各部
１４〜１９との関係を示す機能ブロック構成図である。FIG. 2 is a functional block diagram showing the relationship between the process pair 13 including the primary process 11 and the backup process 12 and the units 14 to 19 realized by the library 20.

【００３５】プロセスペア間通信部１４は、計算機１ａ
及び１ｂ上のプライマリプロセス１１及びバックアップ
プロセス１２から構成されるプロセスペア１３と他のプ
ロセスペアとの間の通信を行う。プロセスペア間通信部
１４は、プライマリプロセス１１及びバックアッププロ
セス１２が共に生存（存在）している場合、両プロセス
を調停して、１つのメッセージのみを他のプロセスペア
に送る。逆に他のプロセスペアからメッセージが送られ
てきた場合、プロセスペア間通信部１４は、プライマリ
プロセス１１とバックアッププロセス１２が共に生存し
ている場合には、当該メッセージを両プロセスに送る。The inter-process pair communication unit 14 is the computer 1a.
Communication between the process pair 13 composed of the primary process 11 and the backup process 12 on 1b and another process pair is performed. When both the primary process 11 and the backup process 12 are alive (existing), the inter-process pair communication unit 14 arbitrates both processes and sends only one message to another process pair. On the contrary, when a message is sent from another process pair, the inter-process pair communication unit 14 sends the message to both the primary process 11 and the backup process 12 when both are alive.

【００３６】チェックポイント採取部１５は、チェック
ポイント採取時期が到来すると起動される。ここでは、
チェックポイント採取部１５は、プライマリプロセス１
１からのシステムコールによるチェックポイント採取の
ための指示（チェックポイント採取指示）Ａ、またはタ
イマＴＭからの定期的な割り込み（チェックポイント採
取割り込み）Ｂにより起動される。チェックポイント採
取部１５は、バックアッププロセス実行状態管理部１８
によって管理されているプライマリプロセス１１の状態
から、当該プライマリプロセス１１が生存していること
が判別される場合に、当該プライマリプロセス１１に指
示Ｃを出して当該プライマリプロセス１１の状態の読み
出しＤを行い、バックアッププロセス１２へのコピーＥ
を実行するチェックポイント採取動作を行う機能を有す
る。なお、ここでいうプロセスの状態とは、アドレス空
間とコンテクストのことである。The checkpoint sampling unit 15 is activated when the checkpoint sampling time comes. here,
The checkpoint collection unit 15 uses the primary process 1
It is activated by an instruction A for checkpoint collection by a system call from 1 (checkpoint collection instruction) A or a periodic interrupt (checkpoint collection interrupt) B from a timer TM. The checkpoint collection unit 15 includes a backup process execution state management unit 18
When it is determined that the primary process 11 is alive from the state of the primary process 11 managed by, the instruction C is issued to the primary process 11 and the state D of the primary process 11 is read. , Copy to Backup Process 12 E
Has a function of performing a checkpoint collection operation. The state of the process mentioned here is the address space and the context.

【００３７】チェックポイント採取部１５はまた、バッ
クアッププロセス実行状態管理部１８を利用してバック
アッププロセス１２の状態（実行状態または停止状態）
を調べ、実行状態ならば、バックアッププロセス実行状
態制御部１７にバックアッププロセス１２の実行を停止
させる指示Ｆを出す。The checkpoint collection unit 15 also uses the backup process execution state management unit 18 to determine the state of the backup process 12 (execution state or stop state).
If it is in the execution state, an instruction F for stopping the execution of the backup process 12 is issued to the backup process execution state control unit 17.

【００３８】プロセスリスタート部１６は、プライマリ
プロセス１１を実行している計算機の障害を検知した場
合に、バックアッププロセス実行状態管理部１７を利用
してバックアッププロセス１２の状態を調べ、停止状態
ならば、バックアッププロセス実行状態制御部１７に対
してバックアッププロセス１２の実行を開始させる指示
Ｈを出す。When the process restart unit 16 detects a failure in the computer that is executing the primary process 11, the process restart unit 16 checks the state of the backup process 12 using the backup process execution state management unit 17, and if it is in the stopped state. , An instruction H for starting the execution of the backup process 12 is issued to the backup process execution state control unit 17.

【００３９】バックアッププロセス実行状態制御部１７
は、プライマリプロセス１１及びバックアッププロセス
１２から構成されるプロセスペア１３の起動時（プログ
ラムの実行開始時）に、当該プライマリプロセス１１及
びバックアッププロセス１２を共に実行状態として、バ
ックアッププロセス実行状態管理部１８に登録する機能
を有する。バックアッププロセス実行状態制御部１７は
また、チェックポイント採取部１５からの指示Ｆを受け
て、バックアッププロセス１２を停止させ、その状態を
停止状態として、バックアッププロセス実行状態管理部
１８に登録する機能を有する。バックアッププロセス実
行状態制御部１７はまた、プロセスリスタート部１６か
らの指示Ｈまたはプロセス発行システムコール検知部１
９からの後述する指示Ｇを受けて、バックアッププロセ
ス１２を実行させ、その状態を実行状態として、バック
アッププロセス実行状態管理部１８に登録する機能をも
有する。Backup process execution status controller 17
When the process pair 13 including the primary process 11 and the backup process 12 is activated (when the execution of the program is started), the primary process 11 and the backup process 12 are both set to the execution state, and the backup process execution state management unit 18 Has the function to register. The backup process execution state control unit 17 also has a function of receiving the instruction F from the checkpoint collection unit 15, stopping the backup process 12, and registering the state in the backup process execution state management unit 18 as a stop state. . The backup process execution state control unit 17 also receives the instruction H from the process restart unit 16 or the process issuing system call detection unit 1
It also has a function of receiving the instruction G described later from 9 to execute the backup process 12 and registering the state in the backup process execution state management unit 18 as the execution state.

【００４０】バックアッププロセス実行状態管理部１８
は、バックアッププロセス１２の状態を保持・管理す
る。Backup process execution status management unit 18
Holds and manages the state of the backup process 12.

【００４１】プロセス発行システムコール検知部１９
は、プロセスペア１３（を構成するプライマリプロセス
１１またはバックアッププロセス１２）がシステムコー
ルを実行（発行）したことを検知する。またプロセス発
行システムコール検知部１９は、プライマリプロセス１
１がシステムコールを実行したことを検知した場合、バ
ックアッププロセス実行状態管理部１８を利用してバッ
クアッププロセス１２の状態を調べる。プロセス発行シ
ステムコール検知部１９は、バックアッププロセス１２
が停止状態ならば、バックアッププロセス実行状態制御
部１７に対してバックアッププロセス１２の実行を開始
させる指示Ｇを出す。Process issuing system call detector 19
Detects that the process pair 13 (the primary process 11 or the backup process 12 constituting the process pair) has executed (issued) a system call. In addition, the process issuing system call detection unit 19 uses the primary process 1
When it is detected that 1 has executed the system call, the state of the backup process 12 is checked using the backup process execution state management unit 18. The process issuing system call detection unit 19 uses the backup process 12
Is stopped, an instruction G for starting the execution of the backup process 12 is issued to the backup process execution state control unit 17.

【００４２】計算機１ａ及び計算機１ｂ上の、それぞれ
プロセスペア間通信部１４同士、チェックポイント採取
部１５同士、プロセスリスタート部１６同士、バックア
ッププロセス実行状態制御部１７同士、バックアッププ
ロセス実行状態管理部１８同士、そしてプロセス発行シ
ステムコール検知部１９同士は、互いにネットワーク２
を介して通信をすことで、あたかも１つであるかのよう
に動作する。On the computers 1a and 1b, the inter-process pair communication units 14 each other, the checkpoint collection units 15 each other, the process restart units 16 each other, the backup process execution state control units 17 each other, the backup process execution state management unit 18 respectively. And the process issuing system call detecting units 19 are connected to each other via the network 2
By communicating via, it operates as if it were one.

【００４３】図３は状態遷移図であり、同図（ａ）はプ
ライマリプロセス１１の取り得る状態を示す状態遷移
図、同図（ｂ）はバックアッププロセス１２の取り得る
状態を示す状態遷移図である。FIG. 3 is a state transition diagram. FIG. 3A is a state transition diagram showing possible states of the primary process 11, and FIG. 3B is a state transition diagram showing possible states of the backup process 12. is there.

【００４４】まずプライマリプロセス１１は、図３
（ａ）に示すように、停止状態及び実行状態のいずれか
の状態を取る。プライマリプロセス１１は、プログラム
実行開始ａ１と共に停止状態から実行状態に遷移する。
プライマリプロセス１１は、プログラム実行終了ａ２と
なるまで、実行状態を保つ。First, the primary process 11 is shown in FIG.
As shown in (a), it takes one of a stopped state and a running state. The primary process 11 transits from the stopped state to the running state when the program execution starts a1.
The primary process 11 maintains the execution state until the program execution end a2.

【００４５】次にバックアッププロセス１２も、図３
（ｂ）に示すように、停止状態及び実行状態のいずれか
の状態を取る。バックアッププロセス１２は、プログラ
ム実行開始ｂ１と共に停止状態から実行状態に遷移す
る。バックアッププロセス１２は、実行状態において、
チェックポイント採取ｂ２が実行されると、停止状態に
遷移する。また、バックアッププロセス１２は、停止状
態において、プライマリプロセス１１でのシステムコー
ル発行ｂ３が行われると、実行状態に遷移する。Next, the backup process 12 is also shown in FIG.
As shown in (b), it takes one of a stopped state and an execution state. The backup process 12 transits from the stopped state to the running state when the program execution starts b1. The backup process 12 is
When the checkpoint collection b2 is executed, the state transits to the stopped state. Further, the backup process 12 transitions to the execution state when the system call issuance b3 in the primary process 11 is performed in the stopped state.

【００４６】次に、本実施形態の動作を、図４乃至図８
を適宜参照して説明する。なお、図４はプロセスペア１
３を構成するプライマリプロセス１１及びバックアップ
プロセス１２の全体の動作を説明するためのタイミング
チャート、図５はチェックポイント採取部１５の動作を
説明するためのフローチャート、図６はプロセスリスタ
ート部１６の動作を説明するためのフローチャート、図
７はプロセス発行システムコール検知部１９の動作を説
明するためのフローチャート、図８はプライマリプロセ
ス１１及びバックアッププロセス１２の状態とチェック
ポイント採取時期との関係を説明するためのタイミング
チャートである。Next, the operation of this embodiment will be described with reference to FIGS.
Will be described as appropriate. Note that FIG. 4 shows process pair 1
3 is a timing chart for explaining the overall operation of the primary process 11 and the backup process 12, FIG. 5 is a flowchart for explaining the operation of the checkpoint sampling unit 15, and FIG. 6 is the operation of the process restart unit 16. 7 is a flow chart for explaining the operation of the process issuing system call detection unit 19, and FIG. 8 is a flow chart for explaining the relationship between the states of the primary process 11 and backup process 12 and the checkpoint collection timing. 2 is a timing chart of.

【００４７】まず、プロセス開始直後、つまりプログラ
ム実行開始ａ１，ｂ１直後は、計算機１ａ上のプライマ
リプロセス１１及び計算機１ｂ上のバックアッププロセ
ス１２は、図３に示すように共に停止状態から実行状態
に遷移する。First, immediately after the start of the process, that is, immediately after the start of program execution a1 and b1, both the primary process 11 on the computer 1a and the backup process 12 on the computer 1b transit from the stopped state to the running state as shown in FIG. To do.

【００４８】今、プライマリプロセス１１及びバックア
ッププロセス１２が実行状態にあるときに、例えば図４
中のｃｋｐ１の時点で、プライマリプロセス１１からチ
ェックポイント採取部１５にチェックポイント採取のシ
ステムコールＡが発行されたものとする。Now, when the primary process 11 and the backup process 12 are in the execution state, for example, as shown in FIG.
It is assumed that the primary process 11 issues a checkpoint collection system call A to the checkpoint collection unit 15 at the time of ckp1.

【００４９】この場合、チェックポイント採取部１５は
起動され、バックアッププロセス実行状態管理部１８に
対して、当該管理部１８により管理されているプライマ
リプロセス１１及びバックアッププロセス１２の状態を
問い合わせ、両プロセスが共に生存しているか否かを判
定する（ステップＳ１）。もし、プライマリプロセス１
１及びバックアッププロセス１２の少なくとも一方が生
存していないならば、チェックポイント採取部１５はそ
のまま動作を終了する。In this case, the checkpoint collection unit 15 is activated, the backup process execution state management unit 18 is inquired about the states of the primary process 11 and the backup process 12 managed by the management unit 18, and both processes are executed. It is determined whether both are alive (step S1). If primary process 1
If at least one of 1 and the backup process 12 is not alive, the checkpoint collection unit 15 ends the operation as it is.

【００５０】これに対し、プライマリプロセス１１及び
バックアッププロセス１２が共に生存しているならば、
チェックポイント採取部１５はプライマリプロセス１１
に指示Ｃを出して当該プライマリプロセス１１の状態を
読み出す動作Ｄを実行し、読み出した状態をバックアッ
ププロセス１２にコピーする動作Ｅを実行する（ステッ
プＳ２）。On the other hand, if both the primary process 11 and the backup process 12 are alive,
The checkpoint collection unit 15 is the primary process 11
To execute the operation D for reading the state of the primary process 11 and the operation E for copying the read state to the backup process 12 (step S2).

【００５１】チェックポイント採取部１５は、ステップ
Ｓ２の処理（チェックポイント採取動作）を実行する
と、バックアッププロセス実行状態管理部１８に対して
バックアッププロセス１２の状態を問い合わせ、当該バ
ックアッププロセス１２が実行状態にあるか否かを判定
する（ステップＳ３）。もし、バックアッププロセス１
２が停止状態にあるなら、チェックポイント採取部１５
はそのまま動作を終了する。これに対し、バックアップ
プロセス１２が実行状態にあるならば、チェックポイン
ト採取部１５はバックアッププロセス実行状態制御部１
７に対して指示Ｆを出すことで、当該バックアッププロ
セス１２を実行状態から停止状態に遷移させる（ステッ
プＳ４）。このバックアッププロセス１２の新たな状態
（停止状態）は、バックアッププロセス実行状態制御部
１７によりバックアッププロセス実行状態管理部１８に
登録される。When the checkpoint collecting unit 15 executes the processing of step S2 (checkpoint collecting operation), the checkpoint collecting unit 15 inquires of the backup process execution state management unit 18 about the state of the backup process 12, and the backup process 12 becomes the execution state. It is determined whether there is any (step S3). If backup process 1
If 2 is in a stopped state, the checkpoint collection unit 15
Ends the operation as it is. On the other hand, if the backup process 12 is in the execution state, the checkpoint collection unit 15 determines that the backup process execution state control unit 1
By issuing the instruction F to 7, the backup process 12 is transited from the running state to the stopped state (step S4). The new state (stop state) of the backup process 12 is registered in the backup process execution state management unit 18 by the backup process execution state control unit 17.

【００５２】図４の例では、ｃｋｐ１の後も、例えばタ
イマＴＭからの定期的な割り込みＢのタイミングで決ま
るｃｋｐ２，ｃｋｐ３，ｃｋｐ４の時点で、チェックポ
イントが採られる。このとき、バックアッププロセス１
２は、上記の説明から明らかなように停止状態のままで
ある。In the example of FIG. 4, a checkpoint is taken even after ckp1 at ckp2, ckp3, and ckp4, which are determined by the timing of the periodic interrupt B from the timer TM, for example. At this time, backup process 1
No. 2 remains in a stopped state as is clear from the above description.

【００５３】その後、例えばプライマリプロセス１１で
の処理に伴う出力データの書き出し等のために、当該プ
ライマリプロセス１１が図４に示す送信（ｓｅｎｄ）処
理４１を行うものとする。この送信（ｓｅｎｄ）処理４
１はシステムコールであるものとする。この場合、プロ
セス発行システムコール検知部１９は、上記システムコ
ール（送信処理）を検知する。するとプロセス発行シス
テムコール検知部１９は、バックアッププロセス実行状
態管理部１８に対してプライマリプロセス１１及びバッ
クアッププロセス１２の状態を問い合わせ、両プロセス
が共に生存しているか否かを判定する（ステップＳ２
１）。もし、プライマリプロセス１１及びバックアップ
プロセス１２の少なくとも一方が生存していないなら
ば、プロセス発行システムコール検知部１９はそのまま
動作を終了する。After that, for example, in order to write the output data accompanying the processing in the primary process 11, the primary process 11 performs the send processing 41 shown in FIG. This send processing 4
It is assumed that 1 is a system call. In this case, the process issuing system call detection unit 19 detects the system call (transmission process). Then, the process issuing system call detection unit 19 inquires of the backup process execution state management unit 18 about the states of the primary process 11 and the backup process 12, and determines whether both processes are alive (step S2).
1). If at least one of the primary process 11 and the backup process 12 is not alive, the process issuing system call detection unit 19 ends the operation as it is.

【００５４】これに対し、プライマリプロセス１１及び
バックアッププロセス１２が共に生存しているならば、
プロセス発行システムコール検知部１９はバックアップ
プロセス実行状態管理部１８を用いてバックアッププロ
セス１２が実行状態にあるか否かを判定する（ステップ
Ｓ２２）。On the other hand, if both the primary process 11 and the backup process 12 are alive,
The process issuing system call detection unit 19 uses the backup process execution state management unit 18 to determine whether the backup process 12 is in the execution state (step S22).

【００５５】もし、バックアッププロセス１２が実行状
態にないならば、つまりバックアッププロセス１２が停
止状態にあるならば、プロセス発行システムコール検知
部１９はバックアッププロセス実行状態制御部１７に対
して指示Ｇを出して当該バックアッププロセス１２を実
行状態に遷移させ、その時点を基準に、最後（最も最
近）に採ったチェックポイントから当該バックアッププ
ロセス１２をリスタートさせる（ステップＳ２３）。図
４の例では、送信（ｓｅｎｄ）処理４１のタイミングか
らみて、最後に採られたチェックポイントはｃｋｐ４で
ある。この場合、バックアッププロセス１２はチェック
ポイントｃｋｐ４から処理を再開する。If the backup process 12 is not in the execution state, that is, if the backup process 12 is in the stopped state, the process issuing system call detection unit 19 issues an instruction G to the backup process execution state control unit 17. Then, the backup process 12 is transited to the execution state, and the backup process 12 is restarted from the last (most recent) checkpoint based on that time point (step S23). In the example of FIG. 4, the last checkpoint taken in view of the timing of the transmission process 41 is ckp4. In this case, the backup process 12 restarts the process from the checkpoint ckp4.

【００５６】プロセス発行システムコール検知部１９
は、バックアッププロセス１２が停止状態にある場合
（ステップＳ２２のＮＯ）には、上述のように当該バッ
クアッププロセス１２を実行状態にしてリスタートさせ
た後（ステップＳ２３）に、ステップＳ２４に進む。ま
たプロセス発行システムコール検知部１９は、バックア
ッププロセス１２が既に実行状態にある場合には（ステ
ップＳ２２のＹＥＳ）、そのままステップＳ２４に進
む。プロセス発行システムコール検知部１９は、ステッ
プＳ２４において、プライマリプロセス１１とバックア
ッププロセス１２とを同期させて動作を終了する。つま
りバックアッププロセス１２は、リスタート後に送信
（ｓｅｎｄ）処理４１を行ったところで、プロセスペア
間通信部１４によって同期させられる。Process issuing system call detector 19
When the backup process 12 is in the stopped state (NO in step S22), after the backup process 12 is set to the running state and restarted as described above (step S23), the process proceeds to step S24. If the backup process 12 is already in the execution state (YES in step S22), the process issuing system call detection unit 19 proceeds directly to step S24. In step S24, the process issuing system call detecting unit 19 synchronizes the primary process 11 and the backup process 12 and ends the operation. That is, the backup process 12 is synchronized by the inter-process pair communication unit 14 when the send process 41 is performed after the restart.

【００５７】次に、プライマリプロセス１１を実行して
いる計算機１ａに障害が発生し、当該プライマリプロセ
ス１１が停止したものとする。また、この計算機１ａの
障害が図４中のｆａｕｌｔ１の時点で発生したものとす
る。この計算機１ａの障害はプロセスリスタート部１６
により検出される。Next, it is assumed that a failure has occurred in the computer 1a executing the primary process 11 and the primary process 11 has stopped. It is also assumed that the failure of the computer 1a has occurred at the time of fault1 in FIG. This computer 1a failure is caused by the process restart unit 16
Detected by.

【００５８】プロセスリスタート部１６は、計算機の障
害を検知した場合、その障害発生計算機がバックアップ
プロセス１２側の計算機であるか否かを判定する（ステ
ップＳ１１）。もし、バックアッププロセス１２側の計
算機、つまり計算機１ｂでの障害発生の場合には、計算
機１ａ上のプライマリプロセス１１は実行可能であるこ
とから、プロセスリスタート部１６はそのまま動作を終
了する。When a computer failure is detected, the process restart section 16 determines whether the failed computer is the backup process 12 side computer (step S11). If a failure occurs in the computer on the backup process 12 side, that is, in the computer 1b, the primary process 11 on the computer 1a can be executed, so the process restart unit 16 ends the operation as it is.

【００５９】これに対し、プライマリプロセス１１側の
計算機、つまり計算機１ａでの障害発生の場合は、プロ
セスリスタート部１６はバックアッププロセス実行状態
管理部１８に対してバックアッププロセス１２の状態を
問い合わせ、当該バックアッププロセス１２が生存して
いて且つ停止状態にあるか否かを判定する（ステップＳ
１２，Ｓ１３）。もし、バックアッププロセス１２が生
存していない場合、或いは生存していても実行状態にあ
る場合には、プロセスリスタート部１６はそのまま動作
を終了する。On the other hand, when a failure occurs in the computer on the primary process 11 side, that is, the computer 1a, the process restart unit 16 inquires of the backup process execution state management unit 18 about the state of the backup process 12, It is determined whether the backup process 12 is alive and in a stopped state (step S
12, S13). If the backup process 12 is not alive, or if it is alive and is still in the running state, the process restart unit 16 ends the operation as it is.

【００６０】これに対し、バックアッププロセス１２が
生存していて且つ停止状態にある場合には、プロセスリ
スタート部１６はバックアッププロセス実行状態制御部
１７に対して指示Ｈを出して当該バックアッププロセス
１２を実行状態に遷移させ、その時点、つまりｆａｕｌ
ｔ１の時点を基準に、最後に採ったチェックポイント
（ここでは、図３から明らかなようにｃｋｐ４）から当
該バックアッププロセス１２をリスタートさせる（ステ
ップＳ１４）。これによりバックアッププロセス１２
は、図４の例では、ｒｅｓｔａｒｔ１の時点であるチェ
ックポイントｃｋｐ４から処理を再開する。On the other hand, when the backup process 12 is alive and is in a stopped state, the process restart unit 16 issues an instruction H to the backup process execution state control unit 17 to cause the backup process 12 to be executed. Transition to the running state, at that time, that is, faul
Based on the time point of t1, the backup process 12 is restarted from the last check point (here, ckp4 as apparent from FIG. 3) (step S14). This allows the backup process 12
In the example of FIG. 4, the process restarts from the checkpoint ckp4, which is the time point of restart1.

【００６１】一般にＣＡＤやシミュレーション等の科学
技術計算プログラムでは、最初にシステムコールの発行
を伴う入力データの読み出し等を行い、その後はシステ
ムコールを発行せずに、ＣＰＵ演算を繰り返すことが多
い。そして長時間のＣＰＵ演算が終わった最後に、シス
テムコールの発行を伴う出力データの書き出し等が行わ
れる。このシステムコールの発行を伴う期間だけ、バッ
クアッププロセス１２も実行状態とするならば、システ
ムコールを発行せずに、ＣＰＵ演算を繰り返す期間、プ
ライマリプロセス１１だけを実行させても、ＯＳから受
けているサービスの状態を保存・復元するのに独自のＯ
Ｓを採用する必要がなく、産業界で広く利用されている
オープンシステムを利用することができる。Generally, in science and technology calculation programs such as CAD and simulation, the input data accompanied by the issuance of a system call is first read, and then the CPU operation is often repeated without issuing the system call. Then, at the end of the CPU operation for a long time, output data is written with issuance of a system call. If the backup process 12 is also in the execution state only during the period in which the system call is issued, even if only the primary process 11 is executed for the period in which the CPU operation is repeated without issuing the system call, it is received from the OS. Unique O to save / restore the service state
It is not necessary to adopt S, and an open system widely used in industry can be used.

【００６２】ここで、図１のフォールトトレラントシス
テムを、上述の科学技術計算プログラムの実行に適用す
るものとする。この場合、例えば図８に示すように、科
学技術計算プログラムの起動直後、つまりプライマリプ
ロセス１１及びバックアッププロセス１２から構成され
るプロセスペア１３の起動直後、入力データ読み出し８
１等でシステムコールを発行する最初の期間は、プライ
マリプロセス１１及びバックアッププロセス１２は共に
動作して、ＯＳからのサービス提供を受けている状態
（実行状態）となる。Here, it is assumed that the fault tolerant system of FIG. 1 is applied to the execution of the scientific and technological calculation program described above. In this case, for example, as shown in FIG. 8, immediately after the start of the science and technology calculation program, that is, immediately after the start of the process pair 13 including the primary process 11 and the backup process 12, the input data read 8
During the first period in which a system call is issued with 1 or the like, both the primary process 11 and the backup process 12 operate and are in a state (execution state) in which the OS provides service.

【００６３】その後の長時間のＣＰＵ演算の間は、最初
のチェックポイントｃｋｐ１の時点以降、先のチェック
ポイント採取部１５の動作から明らかなように、バック
アッププロセス１２は停止状態となる。このため、長時
間のＣＰＵ演算の間、バックアッププロセス１２側では
科学技術計算プログラムの実行のためにＣＰＵリソース
を消費しない。そして最後に出力データの書き出し８２
等でシステムコールを発行する間は、再びプライマリプ
ロセス１１及びバックアッププロセス１２が共に動作し
て、ＯＳからのサービス提供を受けている状態になる。During the subsequent CPU operation for a long time, the backup process 12 is in the stopped state after the time of the first checkpoint ckp1, as is apparent from the operation of the previous checkpoint sampling unit 15. For this reason, during the CPU operation for a long time, the backup process 12 does not consume the CPU resource for executing the scientific and technological calculation program. And finally write output data 82
While the system call is issued, etc., the primary process 11 and the backup process 12 operate together again and are in a state of receiving the service from the OS.

【００６４】さて、本実施形態では、プロセス発行シス
テムコール検知部１９がプライマリプロセス１１でのシ
ステムコールの発行を検知して、最後に採ったチェック
ポイントから自動的にバックアッププロセス１２を再開
させ、システムコールを処理させている。この場合、も
し最後にチェックポイントを採ってから長時間が経過し
ていると、再開後の処理に時間がかかる。In the present embodiment, the process issuance system call detection unit 19 detects the issuance of the system call in the primary process 11 and automatically restarts the backup process 12 from the last checkpoint taken. You are processing a call. In this case, if a long time has passed since the last checkpoint was taken, the processing after the restart will take time.

【００６５】そこで本実施形態では、プログラム中で明
示的にチェックポイント採取を指示し、最後にチェック
ポイントを採った後長時間が経過してから、システムコ
ールが実行されるのを防ぐようにしている。図８の例で
は、チェックポイントｃｋｐ５及びｃｋｐ１０が、これ
に相当し、プライマリプロセス１１からのシステムコー
ルによる指示Ａにより、チェックポイント採取部１５に
対して指定される。一方、図８中のチェックポイントｃ
ｋｐ１，ｃｋｐ２，ｃｋｐ３，ｃｋｐ４及びｃｋｐ７，
ｃｋｐ８，ｃｋｐ９は、バックアッププロセス１２が停
止状態にある期間、タイマＴＭからの定期的な割り込み
Ｂによって指定される。Therefore, in the present embodiment, checkpoint collection is explicitly instructed in the program to prevent the system call from being executed after a long time has elapsed since the last checkpoint was taken. There is. In the example of FIG. 8, checkpoints ckp5 and ckp10 correspond to this, and are designated to the checkpoint collection unit 15 by the instruction A by the system call from the primary process 11. On the other hand, check point c in FIG.
kp1, ckp2, ckp3, ckp4 and ckp7,
ckp8 and ckp9 are designated by the periodic interrupt B from the timer TM while the backup process 12 is in the stopped state.

【００６６】ところで、既に説明したように、バックア
ッププロセス１２が実行状態にある期間に、チェックポ
イントの採取時期が到来すると、当該バックアッププロ
セス１２は停止状態となる。この状態でシステムコール
が実行されると、その時点を基準に最後に採ったチェッ
クポイントからバックアッププロセス１２を再開させる
必要がある。このため、システムコールを繰り返す必要
のある期間に、チェックポイントの採取時期が到来する
のは処理効率の点で好ましくない。By the way, as described above, when the checkpoint collection time comes while the backup process 12 is in the execution state, the backup process 12 is stopped. When a system call is executed in this state, it is necessary to restart the backup process 12 from the last checkpoint taken based on that point. Therefore, it is not preferable in terms of processing efficiency that the checkpoint collection time comes during a period in which the system call needs to be repeated.

【００６７】そこで、チェックポイント採取部１５がタ
イマＴＭからの定期的な割り込みＢにより起動される期
間を、バックアッププロセス１２が停止状態にある期間
に限定し、バックアッププロセス１２が実行状態にある
期間における当該割り込みＢは、チェックポイント採取
部１５にて無視（無効扱い）される構成とするとよい。
また、プライマリプロセス１１からのシステムコールに
よるチェックポイント採取の指示Ａは、チェックポイン
ト採取部１５にて常に有効として処理される構成とす
る。Therefore, the period during which the checkpoint collection unit 15 is activated by the periodic interrupt B from the timer TM is limited to the period during which the backup process 12 is in the stopped state, and the period during which the backup process 12 is in the executed state is limited. The interrupt B may be ignored (treated as invalid) by the checkpoint collection unit 15.
In addition, the checkpoint sampling instruction A by the system call from the primary process 11 is always processed as valid by the checkpoint sampling unit 15.

【００６８】更に、システムコールを繰り返す必要のあ
る期間の終了直後、つまりバックアッププロセス１２を
実行状態に維持しておく必要がなくなる直後でも、プロ
グラム中で明示的にチェックポイント採取を指示すると
よい。このようにすると、システムコールを繰り返す必
要のある期間の終了直後にバックアッププロセス１２は
停止状態となり、以降タイマＴＭからの定期的な割り込
みＢによるチェックポイント採取の指示が可能となる。
この結果、効率よくチェックポイントを採るようにする
ことができる。図８の例では、ｃｋｐ６が、システムコ
ールを繰り返す必要のある期間の終了直後となるよう
に、プログラム中で明示的に指示されたチェックポイン
トである。Further, even immediately after the end of the period in which the system call needs to be repeated, that is, immediately after the backup process 12 does not need to be kept in the active state, it is preferable to explicitly instruct the program to take checkpoints. By doing so, the backup process 12 is stopped immediately after the end of the period in which the system call needs to be repeated, and thereafter, it becomes possible to instruct the checkpoint collection by the periodic interrupt B from the timer TM.
As a result, checkpoints can be taken efficiently. In the example of FIG. 8, ckp6 is a checkpoint explicitly designated in the program so as to be immediately after the end of the period in which the system call needs to be repeated.

【００６９】また、同様に、バックアッププロセス１２
を実行状態にすることを、プログラム中で明示すること
もできる。Similarly, the backup process 12
It is also possible to explicitly state that is to be executed in the program.

【００７０】このように本実施形態においては、障害発
生時における処理の継続を可能としながら、（１）ＣＰ
Ｕリソースを２倍使うことなく、（２）オープンシステ
ムに適用可能な、プロセスペア方式によるフォールトト
レラントシステムが実現できる。As described above, according to the present embodiment, the processing can be continued when a failure occurs, while (1) CP
It is possible to realize a fault-tolerant system by the process pair method applicable to the open system (2) without using twice the U resource.

【００７１】上記実施形態では、プロセスペア１３を構
成するプライマリプロセス１１及びバックアッププロセ
ス１２を、それぞれ異なる計算機１ａ及び１ｂ上で動作
させている。しかし、プログラム上の障害だけを考慮す
ればよいフォールトトレラントシステムでは、プライマ
リプロセス１１及びバックアッププロセス１２を同一計
算機上で動作させるようにしてもよい。但し、プライマ
リプロセス１１及びバックアッププロセス１２が動作す
る唯一の計算機自体の障害が発生した場合には、処理を
継続することはできない。In the above embodiment, the primary process 11 and the backup process 12 which compose the process pair 13 are operated on different computers 1a and 1b, respectively. However, in a fault-tolerant system that only needs to consider a program failure, the primary process 11 and the backup process 12 may be operated on the same computer. However, if a failure occurs in only one computer on which the primary process 11 and the backup process 12 operate, the processing cannot be continued.

【００７２】なお、本発明は、上記実施形態に限定され
るものではなく、実施段階ではその要旨を逸脱しない範
囲で種々に変形することが可能である。更に、上記実施
形態には種々の段階の発明が含まれており、開示される
複数の構成要件における適宜な組み合わせにより種々の
発明が抽出され得る。例えば、実施形態に示される全構
成要件から幾つかの構成要件が削除されても、発明が解
決しようとする課題の欄で述べた課題が解決でき、発明
の効果の欄で述べられている効果が得られる場合には、
この構成要件が削除された構成が発明として抽出され得
る。The present invention is not limited to the above embodiment, and can be variously modified at the stage of implementation without departing from the spirit of the invention. Furthermore, the embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, the problem described in the section of the problem to be solved by the invention can be solved, and the effect described in the section of the effect of the invention can be solved. If you get
A configuration in which this component is deleted can be extracted as an invention.

【００７３】[0073]

【発明の効果】以上詳述したように本発明によれば、バ
ックアッププロセスを、プログラムの起動直後の期間
と、その後当該バックアッププロセスが停止状態にある
ときにシステムコールが発行された場合に実行状態にす
る一方、チェックポイント採取時に当該バックアッププ
ロセスが実行状態にあるならば、当該バックアッププロ
セスを停止状態にするようにしたので、障害発生時にも
処理を継続することを可能としながら、ＣＰＵリソース
を２倍必要とせずに済み、しかも独自のＯＳを採用する
ことなくＯＳから受けているサービスの状態を保存・復
元できるため、オープンシステムを利用できる。As described above in detail, according to the present invention, the backup process is executed when the system call is issued during the period immediately after the start of the program and after that when the backup process is stopped. On the other hand, if the backup process is in the execution state when the checkpoint is taken, the backup process is stopped, so that it is possible to continue the process even when a failure occurs, and the CPU resource is reduced. You don't need to do it twice, and you can save and restore the status of services received from the OS without using your own OS, so you can use an open system.

[Brief description of drawings]

【図１】本発明の一実施形態に係るフォールトトレラン
トシステムの構成を示すブロック図。FIG. 1 is a block diagram showing a configuration of a fault tolerant system according to an embodiment of the present invention.

【図２】図１中のプライマリプロセス１１及びバックア
ッププロセス１２から構成されるプロセスペア１３と、
ライブラリ２０により実現される各機能要素との関係を
示す機能ブロック構成図。FIG. 2 is a process pair 13 including a primary process 11 and a backup process 12 in FIG.
The functional block block diagram which shows the relationship with each functional element implement | achieved by the library 20.

【図３】プライマリプロセス１１及びバックアッププロ
セス１２の取り得る状態を示す状態遷移図。FIG. 3 is a state transition diagram showing possible states of a primary process 11 and a backup process 12.

【図４】同実施形態におけるプロセスペア１３を構成す
るプライマリプロセス１１及びバックアッププロセス１
２の全体の動作を説明するためのタイミングチャート。FIG. 4 is a primary process 11 and a backup process 1 forming a process pair 13 in the same embodiment.
2 is a timing chart for explaining the overall operation of 2.

【図５】同実施形態におけるチェックポイント採取部１
５の動作を説明するためのフローチャート。FIG. 5 is a checkpoint sampling unit 1 according to the same embodiment.
6 is a flowchart for explaining the operation of No. 5.

【図６】同実施形態におけるプロセスリスタート部１６
の動作を説明するためのフローチャート。FIG. 6 is a process restart unit 16 in the same embodiment.
6 is a flowchart for explaining the operation of FIG.

【図７】同実施形態におけるプロセス発行システムコー
ル検知部１９の動作を説明するためのフローチャート。FIG. 7 is a flowchart for explaining the operation of the process issuance system call detection unit 19 in the same embodiment.

【図８】同実施形態におけるプライマリプロセス１１及
びバックアッププロセス１２の状態とチェックポイント
採取時期との関係を説明するためのタイミングチャー
ト。FIG. 8 is a timing chart for explaining the relationship between the states of the primary process 11 and the backup process 12 and the checkpoint collection timing in the same embodiment.

【図９】従来のプロセスペアの第１の方式を説明するた
めの図。FIG. 9 is a diagram for explaining a first method of a conventional process pair.

【図１０】従来のプロセスペアの第２の方式を説明する
ための図。FIG. 10 is a diagram for explaining a second method of a conventional process pair.

[Explanation of symbols]

１ａ，１ｂ…計算機２…ネットワーク１１…プライマリプロセス１２…バックアッププロセス１３…プロセスペア１４…プロセスペア間通信部１５…チェックポイント採取部１６…プロセスリスタート部１７…バックアッププロセス実行状態制御部１８…バックアッププロセス実行状態管理部１９…プロセス発行システムコール検知部２０…ライブラリ 1a, 1b ... Calculator 2 ... Network 11 ... Primary process 12 ... Backup process 13 ... Process pair 14 ... Process pair communication unit 15 ... Checkpoint collection section 16 ... Process restart section 17 ... Backup process execution status control unit 18 ... Backup process execution status management unit 19 ... Process issuing system call detector 20 ... Library

Claims

[Claims]

1. A process pair execution control method in a fault tolerant system in which a process pair including a primary process and a backup process, which can continue processing even when a failure occurs, is provided. The step of putting both the primary process and the backup process into the running state at the time of start-up, the step of copying the state of the primary process to the backup process at each checkpoint collection time, and the backup process at the checkpoint collection If the backup process is in the stopped state, if the system call is issued from the process pair, and if the backup process is in the stopped state, the backup process is stopped. Process pair execution control method in a fault tolerant system, characterized by comprising the steps of: in the execution state is resumed from the most recently taken checkpoint to click up process.

2. If the backup process is in a stopped state when a failure occurs on the primary process side, the method further comprises the step of restarting the backup process from the checkpoint taken most recently to put it in the running state. The process pair execution control method in a fault tolerant system according to claim 1, wherein

3. The fault tolerant system according to claim 1, further comprising the step of setting a checkpoint sampling time immediately before the need to switch the backup process from the stopped state to the running state. Control method for process pair execution.

4. The process in a fault tolerant system according to claim 3, further comprising the step of setting a checkpoint collection timing immediately after the backup process does not need to be kept in an active state. Pair execution control method.

5. A step of setting a first checkpoint collection time at a timing immediately before it is necessary to switch the backup process from a stopped state to an execution state, and the backup process must be kept in the execution state. A step of setting a second checkpoint sampling time immediately after the first checkpoint sampling time, and a step of setting the second checkpoint sampling time from the first checkpoint sampling time
Setting a third checkpoint collection time at a predetermined time interval during the period in which the backup process is in a stopped state except the period up to the checkpoint collection time. The process pair execution control method in the fault tolerant system according to claim 1.

6. A process pair execution control program for a fault tolerant system, which executes a process pair composed of a primary process and a backup process, capable of continuing processing even when a failure occurs, the computer comprising: When the process pair is started, both the primary process and the backup process are set to the running state, the state of the primary process is copied to the backup process at each checkpoint collection time, and the checkpoint collection is performed. Sometimes if the backup process is in the running state, the step of bringing the primary process into the stopped state, and if the backup process is in the stopped state when a system call is issued from the process pair. Process Pair execution control program for executing a step of the execution state by resuming the backup process from the most recently taken checkpoint.

7. A fault-tolerant system that executes a process pair consisting of a primary process and a backup process, which can continue processing even when a failure occurs, in a communication between the process pair and another process pair. And a backup process execution state control unit for controlling the execution state of the backup process, which executes both the primary process and the backup process when the process pair is activated. State control means and checkpoint collection means for collecting checkpoints by copying the state of the primary process to the backup process each time the checkpoint collection time comes. If the backup process is in the execution state at the time of taking, checkpoint collection means for bringing the backup process into the stopped state by the backup process execution state control means, and a process issuing system call for detecting a system call issued by the process pair When the system call is detected, if the backup process is in a stopped state, the backup process is restarted from the checkpoint most recently taken by the backup process execution state control unit and executed. A fault tolerant system, comprising: a process issuing system call detecting means for bringing the system into a state.

8. If the backup process is in a stopped state at the time of occurrence of a failure on the primary process side, the backup process execution state control means restarts the backup process from the checkpoint most recently taken. 8. The method according to claim 7, further comprising a process restart means for bringing the process into an execution state.
The described fault tolerant system.

9. A backup process execution state management unit for managing the execution state of the backup process, wherein the checkpoint collection unit, the process issuing system call detection unit and the process restart unit are included in the backup process. 9. The fault tolerant system according to claim 8, wherein the state is determined by inquiring the backup process execution state management means.