JP3122371B2

JP3122371B2 - Computer system

Info

Publication number: JP3122371B2
Application number: JP08251041A
Authority: JP
Inventors: 秀昭平山; 邦保清水
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1996-01-31
Filing date: 1996-09-24
Publication date: 2001-01-09
Anticipated expiration: 2016-09-24
Also published as: JPH09269905A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、チェックポイント
リスタート機能を備えた計算機システムに係り、チェッ
クポイント処理専用プロセスを設けることにより、リス
タート時のロックランアウト処理を不要にし構築コスト
を大幅に低減させる計算機システムに関する。また、本
発明は、チェックポイントリスタート機能を備えたマル
チプロセッサシステムにおいて一部のプロセッサに故障
が発生した場合にも残りのプロセッサで処理を継続させ
る計算機システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer system having a checkpoint restart function. By providing a process dedicated to checkpoint processing, lock runout processing at the time of restart is not required and construction cost is greatly reduced. To a computer system that causes Further, the present invention relates to a computer system in which, even if a failure occurs in some of the processors in a multiprocessor system having a checkpoint restart function, the remaining processors continue processing.

【０００２】[0002]

【従来の技術】例えば、通常時にはチェックポイントを
取得しながら処理を進め、障害が発生した場合には最後
に取得したチェックポイントからシステムを再実行させ
ることによって、故障からの回復を行なう従来の計算機
システムについて説明する。2. Description of the Related Art For example, a conventional computer that recovers from a failure by performing processing while acquiring checkpoints during normal times and re-executing the system from the last acquired checkpoint when a failure occurs. The system will be described.

【０００３】この計算機システムにおいては、通常稼働
中においてはシステムのチェックポイントを取得しなが
ら処理を進めていく。そして、故障などが発生した場合
においては、最後に取得したチェックポイントからシス
テムを再実行させることによって、故障からの回復を行
なう。In this computer system, during normal operation, processing proceeds while acquiring a checkpoint of the system. When a failure or the like occurs, the system is re-executed from the last acquired checkpoint, thereby recovering from the failure.

【０００４】このチェックポイントは、以下のような場
合に取得される。（１）コード中にチェックポイントの取得が明示的に指
示されている場合。（２）最後にチェックポイントを取得した後、一定時間
が経過した場合。（３）チェックポイントの取得を促すイベント（割り込
み）が発生した場合。[0004] This checkpoint is obtained in the following cases. (1) When the acquisition of a checkpoint is explicitly instructed in the code. (2) When a certain time has elapsed since the last checkpoint was obtained. (3) When an event (interruption) prompting acquisition of a checkpoint occurs.

【０００５】これらの条件は、プログラム実行中の任意
の時点で発生しうる。従来では、この条件が発生した時
点で、すなわち、プログラム実行中の任意の時点で、即
座にチェックポイントの取得を行なっていた。[0005] These conditions can occur at any point during program execution. Conventionally, a checkpoint is obtained immediately when this condition occurs, that is, at an arbitrary time during the execution of a program.

【０００６】図１５は、プロセッサが通常の処理を実行
している途中で、チェックポイント処理を実行している
様子を示している。時刻ｔ１では、チェックポイントの
取得を促すようなイベントの発生に伴なう割り込み処理
（図１５の（１））の中で、チェックポイント処理（図
１５の（２））を行なっている。FIG. 15 shows a state where a checkpoint process is being performed while the processor is performing a normal process. At time t1, in the occurrence of an event, such as prompting the acquisition of checkpoint accompanied interrupt processing (FIG. 15 (1)) of the check point processing (Fig.
15 (2)) is performed.

【０００７】また、時刻ｔ２では、最後にチェックポイ
ントが取得されてから一定時間が経過したときに生ずる
タイマ割り込みの処理（図１５の（３））の中で、チェ
ックポイント処理（図１５の（４））を行なっている。
すなわちチェックポイントは、任意のプロセス実行中に
取得されていた。[0007] At time t2, in the end processing of the timer interrupt occurs when the checkpoint predetermined time has elapsed since the acquisition (Fig. 15 (3)), the checkpoint processing (FIG. 15 ( 4)).
That is, the checkpoint was obtained during execution of an arbitrary process.

【０００８】図１６は、チェックポイントを取得しなが
ら処理を進めていく途中で故障が発生し、最終チェック
ポイントから処理を再実行している様子を示している。
時刻ｔ１およびｔ２でチェックポイントを取得した後に
故障が発生すると（図１６の（１））、最後に取得した
チェックポイント（ｔ２）から処理が再実行される（図
１６の（２））。FIG. 16 shows a situation in which a failure occurs during the processing while acquiring checkpoints, and the processing is re-executed from the last checkpoint.
When a failure occurs after the checkpoint is acquired at times t1 and t2 ((1) in FIG. 16 ), the process is executed again from the last acquired checkpoint (t2) (FIG. 16 ).
16 (2)).

【０００９】しかし、一般に、故障が発生した場合の再
実行を考慮すると、通常、処理の中には、「あるまとま
った単位で扱わなければならない処理」が存在する。こ
のような処理部分の１つにロックランアウト領域と呼ば
れるものがある。However, in general, in consideration of re-execution in the event of a failure, there is usually a "process that must be handled in a certain unit" in the process. One of such processing parts is called a lock runout area.

【００１０】ロックランアウト領域とは、この間にチェ
ックポイントを取得しても構わないが、この間に取得さ
れたチェックポイントからシステムを再実行する場合に
は、正常状態に復帰する前に、故障回復処理の中で「走
り切らせる」必要がある区間のことを示す。これはスピ
ンロックを獲得している区間のことである。In the lock run-out area, a checkpoint may be acquired during this time. However, when the system is re-executed from the checkpoint acquired during this time, a failure recovery process is performed before returning to a normal state. Indicates the section that needs to be "run off". This is the section where the spin lock is acquired.

【００１１】スピンロックとは、そのロックを獲得した
状態ではスリープすることができず、ロックが獲得でき
るまでそのプロセッサ上でスピンし続けなければならな
いロックのことである。[0011] A spin lock is a lock that cannot sleep when the lock is acquired and must continue to spin on the processor until the lock can be acquired.

【００１２】このスピンロックを獲得する際には、デッ
ドロックが発生しないように注意する必要がある。通常
は各スピンロックにレベル付けされたロッククラスを付
加し、すでにスピンロックを獲得している状態でさらに
別のスピンロックを獲得する場合には、たとえば現在獲
得しているスピンロックのロッククラスのレベルの中で
最も低いレベルよりも、さらに低いレベルのロッククラ
スのスピンロックしか獲得できないように便宜的に設計
する。このようにスピンロックの獲得を管理することに
より、各プロセッサでのロック獲得の順序性を保証す
る。When acquiring this spin lock, care must be taken so that deadlock does not occur. Normally, when a lock class with a level is added to each spin lock and another spin lock is acquired while the spin lock has already been acquired, for example, the lock class of the currently acquired spin lock is It is designed so that only spin locks of a lock class of a lower level than the lowest one of the levels can be acquired. By managing the acquisition of spin locks in this way, the order of lock acquisition by each processor is guaranteed.

【００１３】たとえば、図１７に示すようにロッククラ
スのレベルが設定され、ロック操作の伴なう「処理Ａ」
と「処理Ｄ」とを同時に実行する場合であって、その双
方のロックを同時期に重複して獲得しなければならない
場合、各プロセッサは、必ず「処理Ｄ」のロック（レベ
ルＬ５）を獲得してから「処理Ａ」のロック（レベルＬ
３）を獲得するといった順序を辿らなければならない。For example, as shown in FIG. 17, the level of the lock class is set, and "Processing A" accompanying the lock operation is performed.
When the processor and the "process D" are executed simultaneously and both locks must be acquired at the same time, each processor always acquires the lock of the "process D" (level L5). And then lock the "process A" (level L
3) must be followed.

【００１４】ここで、ロックをランアウトさせる必要が
ある理由を、図１８および図１９を参照して説明する。
図１８には、ロックランアウトを実施しないためにデッ
ドロックを発生させてしまう例が示されている。Here, the reason why the lock needs to be run out will be described with reference to FIGS.
FIG. 18 shows an example in which deadlock occurs because lock runout is not performed.

【００１５】いま、プロセッサ（０）ではプロセスＴ０
が、プロセッサ（１）ではプロセスＴ１が、それぞれ実
行されており、プロセスＴ０はスピンロックＬ５とＬ３
を、プロセスＴ１はスピンロックＬ４を獲得した状態で
チェックポイントが取得されたものとする。Now, in the processor (0), the process T0
However, in the processor (1), the process T1 is being executed, and the process T0 has the spin locks L5 and L3.
It is assumed that the process T1 has acquired the checkpoint while acquiring the spin lock L4.

【００１６】そして、この後、プロセッサ（０）で固定
故障が発生した場合を考える。この場合、正常に稼働す
るプロセッサは、プロセッサ（１）のみになってしまう
ので、プロセッサ（１）でプロセスＴ０とプロセスＴ１
を実行しなければならない。プロセスＴ０およびプロセ
スＴ１が現在獲得しているスピンロックは認識可能だ
が、プロセスＴ０およびプロセスＴ１がこれからどのよ
うな挙動を示すか、すなわち、これからどのようなスピ
ンロックの獲得を行なうのかは予測することができな
い。Then, consider a case where a fixed fault occurs in the processor (0). In this case, only the processor (1) operates normally, so that the processor (1) processes the process T0 and the process T1.
Must be performed. Although the spin locks currently acquired by the process T0 and the process T1 can be recognized, it is necessary to predict what behavior the process T0 and the process T1 will exhibit, that is, what spin lock will be acquired. Can not.

【００１７】そこで、リカバリを実行した後、プロセッ
サ（１）には現在低い方のレベルのスピンロックを獲得
しているプロセスＴ０がディスパッチされたとする。さ
らに、このプロセスＴ０は、すでに獲得しているスピン
ロックＬ３を解放した後、スピンロックＬ４を新たに獲
得しにいったとする。ところがこのスピンロックＬ４
は、故障発生前はプロセッサ（１）で実行されていたプ
ロセスＴ１が獲得しているため、プロセスＴ０はいつま
でたってもこのスピンロックを獲得できない。すなわ
ち、デッドロックの発生である。この問題は、故障前に
は２つのプロセッサでスピンロック獲得の順序性を保証
していたにも拘らず、１個のプロセッサが故障してしま
ったために、各プロセッサで保証していたスピンロック
獲得の順序性が崩れてしまったことに起因する。Therefore, it is assumed that after the recovery is executed, the process T0 which has acquired the lower level spin lock is dispatched to the processor (1). Further, it is assumed that the process T0 releases the spin lock L3 that has already been acquired, and then newly acquires the spin lock L4. However, this spin lock L4
Since the process T1 executed by the processor (1) before the occurrence of the failure has acquired the process, the process T0 cannot acquire the spin lock forever. That is, a deadlock occurs. This problem is caused by the fact that although two processors guarantee the order of spin lock acquisition before the failure, one processor has failed, and the spin lock acquisition guaranteed by each processor has failed. Is lost.

【００１８】この問題を解決するための手法として、ロ
ックのランアウト機能が知られている。この機能は、チ
ェックポイントからの再実行前に、チェックポイント取
得時に獲得していたすべてのスピンロックを解放させ、
全てのプロセスを特定のプロセッサに依存しない状態に
するものであり、以下の手順を踏む。As a method for solving this problem, a lock run-out function is known. This function releases all spinlocks acquired at the time of the checkpoint before re-executing from the checkpoint,
This is to make all processes independent of a specific processor, and takes the following steps.

【００１９】（１）チェックポイント取得時に獲得中だ
ったスピンロックの中で、最も低いレベルのスピンロッ
クを獲得しているプロセスを選択する。（２）プロセッサを、選択されたプロセスを実行してい
たプロセッサにみせかけて、そのスピンロックを解放す
るまでそのプロセスを実行する。(1) Select the process that has acquired the lowest level of spin lock from the spin locks that had been acquired when the checkpoint was acquired. (2) Make the processor appear to the processor that was executing the selected process and execute that process until the spinlock is released.

【００２０】（３）スピンロックの解放処理の中で、ス
ピンロックを獲得しているプロセスがまだ存在するかど
うか調べる。（４）もし存在すれば、（１）の処理から繰り返す。も
し存在しなければ、ロックランアウトの処理を終える。(3) During the release processing of the spin lock, it is checked whether or not a process that has acquired the spin lock still exists. (4) If present, repeat from the processing of (1). If it does not exist, the lock runout processing ends.

【００２１】すなわち、たとえば図１９Ａに示すように
スピンロックが獲得されていた場合、まずプロセスＴ０
が選択され（Ｌ３が最もレベルが低い）、このプロセス
Ｔ０は、スピンロックＬ３が解放されるまで実行され
る。That is, for example, when the spin lock has been acquired as shown in FIG.
Is selected (L3 is the lowest level), and this process T0 is executed until the spin lock L3 is released.

【００２２】次に、最もレベルが低いＬ４を獲得してい
るプロセスＴ１が選択され（図１９Ｂ）、さらに、その
解放後にＬ５を獲得しているプロセスＴ０が選択されて
（図１９Ｃ）、ロックランアウトが完了する。そしてこ
のロックランアウトが完了した後に、システムはリスタ
ートを実施する。Next, the process T1 acquiring the lowest level L4 is selected (FIG. 19B), and the process T0 acquiring L5 after its release is selected (FIG. 19C), and the lock run-out is performed. Is completed. After the lock run-out is completed, the system performs a restart.

【００２３】このような手順で実行されるロックランア
ウト処理を実現するためには、スピンロックの解放処理
が、ロックランアウト中は特殊なディスパッチ機構を呼
び出すようにする必要がある。In order to realize the lock run-out process executed in such a procedure, it is necessary that the spin lock release process calls a special dispatch mechanism during the lock run-out.

【００２４】[0024]

【発明が解決しようとする課題】このように、従来のチ
ェックポイントの取得方法では、ソフトウェア（ＯＳ：
オペレーティングシステム）において、ロックランアウ
ト領域といった処理部分を抽出し、それらの「まとまっ
た単位」を保護するために、前述したような特殊なディ
スパッチ機構を実装しなければならなかった。As described above, according to the conventional checkpoint acquisition method, software (OS:
In the operating system), a special dispatch mechanism as described above had to be implemented in order to extract a processing portion such as a lock runout area and protect those "units".

【００２５】このため、計算機システムのコストアップ
が余儀なくされてしまうとソフトウェアの実装に制限を
うけてしまうという問題があった。そこで、本発明は上
記事情を考慮をして成されたものであり、チェックポイ
ント処理専用プロセスを設けることにより、リスタート
時のロックランアウト処理を不要にし構築コストを大幅
に低減させる計算機システムを提供することを目的とす
る。For this reason, if the cost of the computer system must be increased, there is a problem that the software implementation is restricted. Accordingly, the present invention has been made in view of the above circumstances, and provides a computer system that does not require a lock runout process at the time of restart by providing a process dedicated to checkpoint processing, thereby greatly reducing the construction cost. The purpose is to do.

【００２６】また、本発明はチェックポイントリスター
ト機能を備えたマルチプロセッサシステムにおいて、一
部のプロセッサに故障が発生した場合にも残りのプロセ
ッサで処理を継続できる計算機システムを提供すること
を目的とする。Another object of the present invention is to provide, in a multiprocessor system having a checkpoint restart function, a computer system which can continue processing on the remaining processors even if some of the processors fail. I do.

【００２７】[0027]

【００２８】[0028]

【課題を解決するための手段】本発明は、上記目的を達
成するため、少なくとも１つのプロセッサと、プロセッ
サに対応してそれぞれ設けられ、故障などによって中断
されたプロセスを再開始するためのチェックポイントを
取得するチェックポイント処理専用プロセスと、実行中
のプロセスに割り込みを行ない、チェックポイント処理
専用プロセスを待機状態から実行可能状態にする割り込
み手段と、割り込み手段によって実行可能状態にされた
チェックポイント処理専用プロセスをディスパッチする
ディスパッチ手段と、ディスパッチ手段によりディスパ
ッチされたチェックポイント処理専用プロセスがチェッ
クポイントを取得後、チェックポイント処理専用プロセ
スを再度待機状態にする待機状態手段とを具備したこと
を特徴とする計算機システムにある。In order to achieve the above object, the present invention provides at least one processor and a checkpoint for restarting a process provided corresponding to the processor and interrupted due to a failure or the like. and checkpoints dedicated to processing the process of obtaining, during execution
Interrupt means for interrupting the process of step (a) , and changing the process dedicated to checkpoint processing from a standby state to an executable state, dispatching means for dispatching the process dedicated to checkpoint processing made executable by the interrupt means, and dispatching by the dispatching means And a standby state unit for resuming the checkpoint process only after the obtained checkpoint process obtains the checkpoint.

【００２９】従って、このような構成によれば、割り込
み手段により、実行中のプロセスに割り込みを行ない、
チェックポイント処理専用プロセスを待機状態から実行
可能状態にする。次に、ディスパッチ手段により、割り
込み手段によって実行可能状態にされたチェックポイン
ト処理専用プロセスをディスパッチし、チェックポイン
トを取得する。これにより、チェックポイント取得時に
は、他の一切のプロセスはｒｕｎｎｉｎｇ状態にはない
ので、デッドロックを発生する可能性はない。Therefore, according to such a configuration, the interrupting means interrupts the running process ,
Change the process dedicated to checkpoint processing from the standby state to the executable state. Next, the dispatching unit dispatches the checkpoint processing dedicated process which has been made executable by the interrupting unit, and acquires a checkpoint. As a result, at the time of checkpoint acquisition, no other processes are in the running state, so there is no possibility that a deadlock will occur.

【００３０】そして、ディスパッチ手段によりディスパ
ッチされたチェックポイント処理専用プロセスがチェッ
クポイントを取得後、待機状態移行手段によりチェック
ポイント処理専用プロセスを再度待機状態にする。Then, after the checkpoint processing dedicated process dispatched by the dispatching means acquires the checkpoint, the standby state shifting means sets the checkpoint processing dedicated process again to the standby state.

【００３１】また、本発明は上記目的を達成するため
に、少なくとも１つのプロセッサと、チェックポイント
取得条件が成立した場合に、故障などによって中断され
たプロセスを再開始するためのチェックポイントの取得
を指示するチェックポイント処理実行指示手段と、オペ
レーティングシステムのディスパッチャに設けられ、前
記チェックポイント処理実行指示手段から前記ディスパ
ッチャにチェックポイントの取得が指示された場合に、
前記ディスパッチャより呼び出され、前記プロセッサに
対応する各チェックポイントを取得するチェックポイン
ト処理手段と、前記チェックポイント取得後に前記チェ
ックポイント処理手段を再度待機状態にする待機状態移
行手段とを具備することを特徴とする計算機システムに
ある。Further, in order to achieve the above object, the present invention provides at least one processor and, when a checkpoint acquisition condition is satisfied, acquisition of a checkpoint for restarting a process interrupted due to a failure or the like. and checkpointing execution instruction means for instructing, provided to the dispatcher operating system, the from the check point processing execution instruction means disperser
Is instructed to take a checkpoint,
Called by the dispatcher and sent to the processor
Checkpoint to get each corresponding checkpoint
A preparative process unit, the check after the checkpoint
In the computer system, characterized by comprising a standby state means that the standby state again the Kkupointo processing means.

【００３２】従って、このような構成によれば、チェッ
クポイント取得指示手段によりチェックポイントの取得
が指示された場合に、実行可能手段がチェックポイント
取得手段を実行可能な状態にする。次に、ディスパッチ
手段が実行可能手段によって実行可能状態にされたチェ
ックポイント取得手段をディスパッチする。これによ
り、チェックポイント取得時には、他の一切のプロセス
はｒｕｎｎｉｎｇ状態にはないので、デッドロックを発
生する可能性はない。Therefore, according to such a configuration, when acquisition of a checkpoint is instructed by the checkpoint acquisition instructing means, the executable means makes the checkpoint acquiring means executable. Next, the dispatch unit dispatches the checkpoint acquisition unit that has been made executable by the executable unit. As a result, at the time of checkpoint acquisition, no other processes are in the running state, so there is no possibility that a deadlock will occur.

【００３３】そして、ディスパッチ手段によりディスパ
ッチされたチェックポイント取得手段がチェックポイン
トを取得後、待機状態移行手段によりチェックポイント
取得手段を再度待機状態にする。Then, after the checkpoint obtaining means dispatched by the dispatching means obtains the checkpoint, the standby state shifting means sets the checkpoint obtaining means to the standby state again.

【００３４】さらに、本発明によれば、チェックポイン
トの取得時にｒｕｎｎｉｎｇ状態にあるプロセスは、チ
ェックポイント処理専用プロセスのみである。それ以外
の通常のプロセスは、全て、プロセッサに依存した状態
ではないので、マルチプロセッサシステムのうちの特定
のプロセッサで固定故障が発生したとしても、そのプロ
セッサ用のチェックポイント処理専用プロセスの実行を
行なわなければ、容易にシステムの再コンフィグレーシ
ョン（プロセッサの縮退）を行なうことが可能になる。Further, according to the present invention, the process in the running state at the time of acquiring the checkpoint is only the process dedicated to the checkpoint processing. Since all other normal processes are not dependent on the processor, even if a fixed failure occurs in a specific processor of the multiprocessor system, the process dedicated to the checkpoint processing for that processor is executed. Otherwise, the system can be easily reconfigured (processor degeneration).

【００３５】[0035]

【発明の実施の形態】以下、図面を参照して本発明の一
実施の形態を説明する。（第１実施形態）図１は、本発明の一実施形態に係る計
算機システムの概略構成を示す図である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. (First Embodiment) FIG. 1 is a diagram showing a schematic configuration of a computer system according to one embodiment of the present invention.

【００３６】１０はプロセッサであり、内部バス１１に
接続されている。また、この内部バス１１にはメモリ１
２、メモリ１２の更新前のイメージデータを格納するＢ
ＩＢ（ＢｅｆｏｒｅＩｍａｇｅＢｕｆｆｅｒ）１３
が接続されている。A processor 10 is connected to the internal bus 11. The internal bus 11 has a memory 1
2. B for storing the image data before updating of the memory 12
IB (Before Image Buffer) 13
Is connected.

【００３７】メモリ１２には、プロセッサによって実行
されるオペレーティングシステム（ＯＳ）を含めたソフ
トウェアが格納されている。図１４は、計算機のチェッ
クポイント／ロールバック機能を説明するための図であ
る。The memory 12 stores software including an operating system (OS) executed by the processor. FIG. 14 is a diagram for explaining the checkpoint / rollback function of the computer.

【００３８】同図に示すように、時刻ｔ０においてメモ
リの０，１，２，３番地の内容はａ，ｂ，ｃ，ｄであ
り、ＢＩＢにデータは格納されていないものとする。こ
の状態をＣＫＰとする。As shown in FIG. 5, at time t0, the contents of addresses 0, 1, 2, and 3 in the memory are a, b, c, and d, and it is assumed that no data is stored in the BIB. This state is referred to as CKP.

【００３９】その後、時刻ｔ１においてプロセッサから
１番地にｘをｓｔｏｒｅする命令が発行される。これに
より、メモリの１番地の内容は、ｘに変化する。このと
き、ＢＩＢは、メモリの更新前情報、すなわち、「１番
地がｂ」であったことを記憶する。Thereafter, at time t1, the processor issues an instruction to store x at address 1. As a result, the content of the address 1 in the memory changes to x. At this time, the BIB stores the pre-update information of the memory, that is, the fact that “address 1 is b”.

【００４０】次に、時刻ｔ２においてプロセッサから２
番地にＹをｓｔｏｒｅする命令が発行されると、メモリ
の２番地の内容がＹに書き替えられる前にＢＩＢは、メ
モリの更新情報、すなわち「２番地がＣ」であったこと
を記憶する。その後、時刻ｔ３において、障害が発生し
たとする。Next, at time t2, the processor
When the instruction to store Y at the address is issued, BIB is the update information of the memory , that is, "address 2 is C" before the contents of address 2 of the memory are rewritten to Y. Is stored. Thereafter, it is assumed that a failure has occurred at time t3.

【００４１】この場合、ＢＩＢに記憶されているメモリ
の更新情報を記憶された時とは逆の手順でメモリに反映
させれば、メモリの状態を更新前の状態、すなわち、時
刻ｔ０のＣＫＰの状態にロールバックすることができ
る。In this case, if the update information of the memory stored in the BIB is reflected in the memory in a procedure reverse to that when the information is stored, the state of the memory is changed to the state before the update, that is, the CKP at the time t0. You can roll back to the state.

【００４２】また、時刻ｔ３の時点で障害が発生せずに
処理が進んだ場合には、適当な時点でＢＩＢに記憶され
ている情報をクリアし、その時点でのメモリの状態をＣ
ＫＰとすれば、ＣＫＰの世代が進むことになる。If the process proceeds without a failure at time t3, the information stored in the BIB is cleared at an appropriate time and the state of the memory at that time is changed to C3.
In the case of KP, the generation of CKP will advance.

【００４３】図２は、メモリ１２に格納されるソフトウ
ェアの機能を説明するための機能ブロック図である。２
１はチェックポイント処理専用プロセスで、通常は、２
２に示されたスリープ部によって、待機状態になってい
る。FIG. 2 is a functional block diagram for explaining functions of software stored in the memory 12. 2
1 is a process dedicated to checkpoint processing.
2 is in the standby state.

【００４４】また、チェックポイント取得条件が成立し
た場合には、２３に示されたウェイクアップ部を用い
て、待機状態になっているチェックポイント処理専用プ
ロセス２１を実行可能状態とする。When the checkpoint acquisition condition is satisfied, the checkpoint processing dedicated process 21 in the standby state is made executable by using the wake-up unit 23.

【００４５】チェックポイント取得条件は、ＢＩＢ１３
にデータが所定の量だけ格納された場合に成立するもの
とする。なお、チェックポイント取得条件は以下の場合
に成立するとしてもよい。The checkpoint acquisition condition is BIB13
Is stored when a predetermined amount of data is stored in the. Note that the checkpoint acquisition condition may be satisfied in the following cases.

【００４６】（１）メモリ１２の更新後のイメージデ
ータを格納するＡＩＢ（ＡｆｔｅｒＩｍａｇｅＢｕｆ
ｆｅｒ）にデータが所定の量だけ格納された場合。（２）コード中にチェックポイントの取得が明示的に
指示されている場合。(1) An AIB (After Image Buf) for storing the updated image data in the memory 12
fer) when a predetermined amount of data is stored. (2) When a checkpoint is explicitly specified in the code.

【００４７】（３）最後のチェックポイントを取得し
てから一定時間が経過した場合。実行可能状態となった
チェックポイント処理専用プロセス２１は、ディスパッ
チャ２４により選択される。これにより、チェックポイ
ント処理専用プロセス２１は実行状態となり、チェック
ポイントの取得を行なう。(3) When a certain time has elapsed since the last checkpoint was obtained. The checkpoint process-dedicated process 21 in the executable state is selected by the dispatcher 24. As a result, the checkpoint processing dedicated process 21 enters an execution state, and acquires a checkpoint.

【００４８】チェックポイント取得条件が成立した場合
には、なるべく早くチェックポイントを取得するため
に、チェックポイント処理専用プロセス２１のプライオ
リティ（実行優先度）を高くし、故障通知などの特殊な
割り込み以外は受け付けないようにしておく。If the checkpoint acquisition condition is satisfied, the priority (execution priority) of the checkpoint processing dedicated process 21 is increased in order to acquire the checkpoint as soon as possible. We do not accept.

【００４９】次に、図３乃至図５を参照して動作を説明
する。図３は、チェックポイント処理専用プロセス２１
の処理の流れを示すフローチャートである。Next, the operation will be described with reference to FIGS. FIG. 3 shows a process 21 dedicated to checkpoint processing.
3 is a flowchart showing the flow of the processing of FIG.

【００５０】チェックポイント処理専用プロセス２１
は、通常は待機状態にあるが（ステップＡ１）、チェッ
クポイント取得条件が成立するとウェイクアップ部２３
によってウェイクアップされ実行可能状態になる。そし
て、ディスパッチャ２４によってチェックポイント処理
専用プロセスがディスパッチされることによってチェッ
クポイントが取得される（ステップＡ２）。それが終わ
ると、再び待機状態に戻される。Check Point Processing Dedicated Process 21
Is normally in a standby state (step A1), but when the checkpoint acquisition condition is satisfied, the wake-up unit 23
Wakes up and becomes executable. Then, a checkpoint process is dispatched by the dispatcher 24 to acquire a checkpoint (step A2). When it is over, it returns to the standby state again.

【００５１】図４は、チェックポイント取得条件の成立
を通知し、待機状態にあるチェックポイント処理専用プ
ロセス２１をウェイクアップさせる処理の流れを示すフ
ローチャートである。FIG. 4 is a flowchart showing the flow of processing for notifying the establishment of the checkpoint acquisition condition and waking up the checkpoint processing dedicated process 21 in the standby state.

【００５２】図４で示すように、チェックポイント取得
条件が成立すると、割り込み処理あるいはサブルーチン
コールによって、チェックポイント取得条件の成立を通
知する処理を実行する（ステップＢ１）。ここでは待機
状態にあるチェックポイント処理専用プロセス２１を実
行可能状態とする。As shown in FIG. 4, when the checkpoint acquisition condition is satisfied, a process for notifying the establishment of the checkpoint acquisition condition is executed by interruption processing or a subroutine call (step B1). Here, the checkpoint processing dedicated process 21 in the standby state is set in the executable state.

【００５３】チェックポイント処理専用プロセス２１の
待機状態への移行や、実行可能状態への移行において
は、特定のプロセスのみについて行なうため、たとえば
ＵＮＩＸでは、ウェイクアップさせるプロセスの特定化
のために、特定のウェイトチャネル（この場合はＳ）を
使用する。The transition of the checkpoint processing dedicated process 21 to the standby state or the executable state is performed only for a specific process. For example, in UNIX, for the purpose of specifying the process to be woken up, a specific process is performed. (In this case, S) is used.

【００５４】図５は、チェックポイント処理専用プロセ
ス２１によって取得されたチェックポイントから、再実
行を行なう場合の処理の流れを示すフローチャートであ
る。この場合、最後に取得されたチェックポイントの状
態をまず復元し（ステップＣ１）、続いてチェックポイ
ント処理専用プロセス２１をカレントプロセスとして、
その先頭から処理を実行させる（ステップＣ２）。ある
いは、チェックポイント中に保存されているチェックポ
イント処理専用プロセスのコンテクストを復元してよ
い。FIG. 5 is a flowchart showing the flow of processing when re-executing from a checkpoint acquired by the checkpoint processing dedicated process 21. In this case, the state of the checkpoint acquired last is first restored (step C1), and then the process 21 dedicated to checkpoint processing is set as the current process.
The process is executed from the top (step C2). Alternatively, the context of the process dedicated to checkpoint processing saved during the checkpoint may be restored.

【００５５】次に、図１乃至図７を参照して、本実施の
形態の計算機システムの動作について説明する。図６
は、本実施の形態の計算機システムの動作を説明するた
めの図である。Next, the operation of the computer system according to the present embodiment will be described with reference to FIGS. FIG.
Is a diagram for explaining the operation of the computer system of the present embodiment.

【００５６】プロセッサが任意のプロセスｘを実行して
いる最中に、ＢＩＢ１３に格納されるメモリ１２の更新
前のイメージデータのデータ量が所定のデータ量に達す
ると、ＢＩＢ１３は、プロセッサ１０にチェックポイン
ト採取要求割り込みをかける。While the processor is executing an arbitrary process x, if the data amount of the image data before updating of the memory 12 stored in the BIB 13 reaches a predetermined data amount, the BIB 13 checks with the processor 10. Issue a point collection request interrupt.

【００５７】ここでは、ＢＩＢ１３は、プロセッサ
（１）に割り込みをかけるものとする。ＢＩＢ１３から
プロセッサ（１）に割り込みがかけられると、プロセッ
サ（１）は、実行中のプロセスｘの処理を一時的に中断
して割り込み処理を行なう（図６の（１））。Here, it is assumed that the BIB 13 interrupts the processor (1). When an interrupt is issued from the BIB 13 to the processor (1), the processor (1) temporarily interrupts the process of the process x being executed and performs an interrupt process ((1) in FIG. 6).

【００５８】この割り込み処理においては、ウェイクア
ップ部２３によって全ての待機（ｓｌｅｅｐ）状態にあ
るチェックポイント処理専用プロセス２１を実行可能
（ｒｅａｄｙ）状態にする（図６の（２））。In this interrupt processing, the wake-up section 23 makes all the checkpoint processing dedicated processes 21 in the standby state executable (ready) (FIG. 6 (2)).

【００５９】図７は、ｓｌｅｅｐ状態にあるチェックポ
イント処理専用プロセスをｒｅａｄｙ状態にする場合を
説明するための図である。このとき、ｒｅａｄｙ状態の
チェックポイント処理専用プロセス２１のプライオリテ
ィ（実行優先度）を高くし、故障通知などの特殊な割り
込み以外は受け付けないようにしておく。FIG. 7 is a diagram for explaining a case where a checkpoint processing dedicated process in the sleep state is set to the ready state. At this time, the priority (execution priority) of the checkpoint process dedicated process 21 in the ready state is set to be high so that it can accept only a special interrupt such as a failure notification.

【００６０】割り込み処理を終えると、プロセッサ
（１）は、一時中断していたプロセスｘの実行を再開
し、このプロセスｘの一実行単位が終了すると、続いて
ディスパッチャ２４が呼び出される（図６の（３））。
このディスパッチャ２４の呼び出しは、タイムシェアリ
ング処理などの制御によって行なわれる。When the interrupt processing is completed, the processor (1) resumes the execution of the process x which has been suspended, and when one execution unit of the process x is completed, the dispatcher 24 is subsequently called (FIG. 6). (3)).
The call of the dispatcher 24 is performed by control such as time sharing processing.

【００６１】ディスパッチャ２４は、プライオリティの
高いプロセスを見つけ、ｒｕｎｎｉｎｇ状態にしてプロ
セッサに実行させる。ここでは、優先度の高いチェック
ポイント処理専用プロセス２１がウェイクアップされて
いるので、ディスパッチャ２４は、チェックポイント専
用プロセス２１を選択して、プロセッサに実行させてチ
ェックポイントの取得を行なわせる（図６の（４））。The dispatcher 24 finds a high-priority process and causes the processor to execute the process in a running state. Here, since the checkpoint processing dedicated process 21 having a high priority has been woken up, the dispatcher 24 selects the checkpoint dedicated process 21 and causes the processor to execute the process to acquire the checkpoint (FIG. 6). (4)).

【００６２】このチェックポイントの取得は、従来の計
算機システムにおいては、任意のプロセスの任意の割り
込み処理において行なわれていたので、プロセスがスピ
ンロックを有していたり、プロセッサに依存したリソー
スにアクセスしている可能性があった。従って、これに
よりデッドロックを発生してしまう可能性があった。In the conventional computer system, the acquisition of the checkpoint is performed in an arbitrary interrupt process of an arbitrary process. Therefore, the process has a spin lock or accesses a processor-dependent resource. Could have been. Therefore, this may cause a deadlock.

【００６３】これに対して、本実施の形態の計算機シス
テムにおいては、ディスパッチャ２４によりディスパッ
チしたチェックポイント処理専用プロセス２１によっ
て、チェックポイントを取得するので、他の一切のプロ
セスはｒｕｎｎｉｎｇ状態にない。このため、従来の計
算機システムのように、リスタート時に、デッドロック
が発生することがない。On the other hand, in the computer system according to the present embodiment, since the checkpoint is acquired by the checkpoint processing dedicated process 21 dispatched by the dispatcher 24 , no other processes are in the running state. For this reason, deadlock does not occur at the time of restart as in the conventional computer system.

【００６４】そして、チェックポイントの取得が終了す
ると、チェックポイント処理専用プロセス２１は、再び
待機（ｓｌｅｅｐ）状態になり（図６の（５））、ディ
スパッチャが呼び出されて、再び任意の通常プロセスが
選択される（図６の（６））。When the acquisition of the checkpoint is completed, the checkpoint processing dedicated process 21 enters the sleep state again ((5) in FIG. 6), the dispatcher is called, and an arbitrary normal process starts again. Is selected ((6) in FIG. 6).

【００６５】このように、チェックポイント処理専用プ
ロセス２１を用いた場合には、チェックポイントの取得
は常にチェックポイント処理専用プロセス２１の中での
み実行される。As described above, when the checkpoint processing dedicated process 21 is used, the checkpoint is always acquired only in the checkpoint processing dedicated process 21.

【００６６】このため、リスタート時のロックランアウ
ト処理が不要となり、チェックポイント取得機能を含む
オペレーティングシステムを大幅に簡単化できる結果、
構築コストを大幅に低減することが可能となる。As a result, the lock run-out process at the time of restart is not required, and the operating system including the checkpoint acquisition function can be greatly simplified.
The construction cost can be significantly reduced.

【００６７】また、このような構成とすることにより、
次に示すような効果も得ることができる。図８Ａは、マ
ルチプロセッサシステムを示す図である。この例では、
プロセッサ（０）〜プロセッサ（３）の４つのプロセッ
サを有したマルチプロセッサシステムを前提としてお
り、各プロセッサ１０は、チェックポイントを取得する
際、プロセッサに対応するチェックポイント処理専用プ
ロセス２１をディスパッチして行なっている。Also, by adopting such a configuration,
The following effects can also be obtained. FIG. 8A is a diagram illustrating a multiprocessor system. In this example,
It is assumed that the multiprocessor system has four processors of processor (0) to processor (3). When acquiring a checkpoint, each processor 10 dispatches a checkpoint processing dedicated process 21 corresponding to the processor. I do.

【００６８】具体的には、マルチプロセッサシステムに
おいては、チェックポイントの取得の方法には、（１）
データの依存関係を有するプロセッサ同士が同期して
一斉にチェックポイントを採取する方法、（２）全て
のプロセッサが同期して一斉にチェックポイントを採取
する方法、の２つの方法がある。Specifically, in a multiprocessor system, the checkpoint acquisition method includes (1)
There are two methods: a method in which processors having a data dependency are synchronized to collect checkpoints simultaneously, and a method (2) in which all processors synchronously collect checkpoints simultaneously.

【００６９】しかし、一般的には、プロセッサ間のデー
タ依存関係を管理することは難しく、且つオーバヘッド
が大きいため、（２）の方法、すなわち、全てのプロセ
ッサが同期して一斉にチェックポイントを採取すること
が多い。However, since it is generally difficult to manage the data dependence between processors and the overhead is large, the method (2), ie, all the processors collect checkpoints simultaneously and synchronously Often do.

【００７０】本実施の形態のマルチプロセッサシステム
においても全てのプロセッサが同期して一斉にチェック
ポイントを採取するものとする。このことは、チェック
ポイント取得時において実行中であるプロセスはチェッ
クポイント処理専用プロセス２１のみであり、他のいか
なるプロセスも実行状態にはないことを示している。In the multiprocessor system according to the present embodiment, it is assumed that all processors simultaneously take checkpoints in synchronization. This indicates that only the process 21 dedicated to the checkpoint process is being executed at the time of obtaining the checkpoint, and that no other process is in the execution state.

【００７１】ここで、たとえば特定のプロセッサ１０に
て間欠故障が発生した場合を考える。この場合、故障回
復処理では、最後に取得したチェックポイントまでシス
テムの状態をロールバックし、後は各プロセッサがチェ
ックポイント取得時に実行していたチェックポイント処
理専用プロセス２１をリジュームすればよい。[0071] In this case, for example, to a specific processor 10
Consider a case where an intermittent failure occurs. In this case, in the failure recovery processing, the state of the system is rolled back to the last acquired checkpoint, and then the checkpoint processing dedicated process 21 executed by each processor at the time of acquiring the checkpoint may be resumed.

【００７２】次に、図８Ｂを参照して、たとえば特定の
プロセッサ１０にて固定故障が発生した場合を考える。
この場合、故障処理では、最後に取得したチェックポイ
ントまでシステムの状態をロールバックし、後は各プロ
セッサがチェックポイント取得時に実行していたチェッ
クポイント処理専用プロセス２１をリジュームする。Next, with reference to FIG. 8B, for example, consider the case where permanent fault in a particular processor 10 occurs.
In this case, in the failure processing, the state of the system is rolled back to the last acquired checkpoint, and thereafter, the checkpoint processing dedicated process 21 executed by each processor at the time of acquiring the checkpoint is resumed.

【００７３】ただし、固定故障が発生したプロセッサ１
０（図８Ｂではプロセッサ（１））は処理を実行できな
い。これにより、その固定故障が発生したプロセッサで
実行されていたチェックポイント処理専用プロセス２１
（ｉ）は、いかなるプロセッサからも実行されなくな
り、かつ他のいずれの通常プロセスも故障したプロセッ
サにディスパッチされることがないため、固定故障を発
生させたプロセッサの切り離し（プロセッサの縮退、再
コンフィグレーション）が容易に実現されることとな
る。However, the processor 1 in which the fixed failure has occurred
0 (the processor (1) in FIG. 8B) cannot execute the process. As a result, the checkpoint processing dedicated process 21 executed by the processor in which the fixed failure has occurred.
Since (i) is not executed by any processor and no other normal process is dispatched to the failed processor, the detachment of the processor that caused the fixed failure (processor degeneration, reconfiguration ) Can be easily realized.

【００７４】（第２実施形態）図９は、本発明の第２実
施形態に係わる計算機システムの概略構成を示す図であ
る。(Second Embodiment) FIG. 9 is a diagram showing a schematic configuration of a computer system according to a second embodiment of the present invention.

【００７５】１０はプロセッサであり、２０に示された
オペレーティングシステムを含めたソフトウェアを実行
する。２４はディスパッチャで、チェックポイント処理
部２５を有している。ディスパッチャ２４は、常にチェ
ックポイント処理部２５を呼び出すわけではなく、チェ
ックポイント処理実行指示部２６がチェックポイント処
理の実行を指示している場合にだけ呼び出す。そしてチ
ェックポイント取得条件が成立した場合、チェックポイ
ント処理実行指示部２６にてチェックポイント処理の実
行が指示される。A processor 10 executes software including the operating system 20. Reference numeral 24 denotes a dispatcher having a checkpoint processing unit 25. The dispatcher 24 does not always call the checkpoint processing unit 25, but calls it only when the checkpoint processing execution instructing unit 26 instructs execution of the checkpoint processing. And if the checkpoint acquisition condition is satisfied, execution of the check point processing is instructed by the check point processing execution instruction portion 26.

【００７６】チェックポイント処理実行指示部２６によ
ってチェックポイント処理の実行が指示されると、ディ
スパッチャ２４は、チェックポイント処理部２５を呼び
出す。When the execution of the checkpoint processing is instructed by the checkpoint processing execution instructing section 26, the dispatcher 24 calls the checkpoint processing section 25.

【００７７】このチェックポイント処理の実行指示は、
たとえばソフトウェアにおける変数をフラグとして用
い、チェックポイント処理の実行を指示する場合にフラ
グをセット（１）する。したがって、ディスパッチャ２
４は、このフラグがセット（１）されている場合にの
み、チェックポイント処理部２５を呼び出せば良い。The checkpoint processing execution instruction is
For example, when a variable in software is used as a flag and execution of checkpoint processing is instructed, the flag is set (1). Therefore, dispatcher 2
4 only needs to call the checkpoint processing unit 25 only when this flag is set (1).

【００７８】次に、図１０乃至図１２を参照して本実施
の形態の計算機システムの動作手順を説明する。図１０
は、チェックポイント処理部２５を備えたディスパッチ
ャ２４の処理の流れを示すフローチャートである。Next, an operation procedure of the computer system according to the present embodiment will be described with reference to FIGS. FIG.
5 is a flowchart showing a flow of processing of the dispatcher 24 including the checkpoint processing unit 25.

【００７９】チェックポイント処理部２５を備えたディ
スパッチャ２４は、チェックポイント処理実行指示部２
６からの指示が通知されているか否かを監視し（ステッ
プＤ１）、チェックポイント処理部２５の実行が指示さ
れている場合には、チェックポイント処理部２５の実行
指示をクリアした後（ステップＤ２）、チェックポイン
トの採取を行なう（ステップＤ３）。The dispatcher 24 having the checkpoint processing unit 25 is the checkpoint processing execution instructing unit 2
6 is monitored (step D1). If execution of the checkpoint processing unit 25 is instructed, the execution instruction of the checkpoint processing unit 25 is cleared (step D2). Then, checkpoints are collected (step D3).

【００８０】そして、このチェックポイントの採取が完
了すると、通常のディスパッチャの処理を実行する（ス
テップＤ４）。すなわち、優先度の高いプロセスを選択
してプロセッサにそのプロセスを実行させる。When the collection of the checkpoint is completed, a normal dispatcher process is executed (step D4). That is, a process with a higher priority is selected, and the processor executes the process.

【００８１】図１１は、チェックポイント取得条件の成
立を通知し、ディスパッチャ２４にチェックポイント処
理部２５の実行を指示する処理の流れを示すフローチャ
ートである。FIG. 11 is a flowchart showing the flow of processing for notifying the establishment of the checkpoint acquisition condition and instructing the dispatcher 24 to execute the checkpoint processing section 25 .

【００８２】図１１に示すように、チェックポイント取
得条件が成立すると、割り込み処理あるいはサブルーチ
ンコールによって、チェックポイント取得条件の成立を
通知する処理を実行する（ステップＥ１）。ここではチ
ェックポイント処理部２５の実行を指示する。As shown in FIG. 11, when the checkpoint acquisition condition is satisfied, a process for notifying the establishment of the checkpoint acquisition condition is executed by interruption processing or a subroutine call (step E1). Here, execution of the checkpoint processing unit 25 is instructed.

【００８３】このチェックポイント処理部２５の実行指
示は、前述したように、たとえばソフトウェアにおける
変数をフラグとして用い、チェックポイント処理部２５
の実行を指示する場合にフラグをセット（１）する。し
たがって、ディスパッチャ２４は、このフラグがセット
（１）されている場合にのみ、チェックポイント処理部
２５を呼び出せば良い。As described above, the execution instruction of the checkpoint processing unit 25 uses, for example, a variable in software as a flag, and
Is set (1) when the execution of the instruction is instructed. Therefore, the dispatcher 24 may call the checkpoint processing unit 25 only when this flag is set (1).

【００８４】図１２は、チェックポイント処理部２５に
よって取得されたチェックポイントから、再実行を行な
う場合の処理の流れを示すフローチャートである。この
場合、最後に取得されたチェックポイントの状態をまず
復元し（ステップＦ１）、続いてディスパッチャ２４を
先頭から呼び出す（ステップＦ２）。FIG. 12 is a flowchart showing the flow of processing when re-execution is performed from a check point acquired by the check point processing unit 25. In this case, the state of the checkpoint acquired last is restored first (step F1), and then the dispatcher 24 is called from the top (step F2).

【００８５】ここで、図１３を参照して本実施の形態の
計算機システムの動作を説明する。プロセッサが任意の
プロセスを実行している最中に、チェックポイント取得
条件が成立した旨の割り込みが発生すると（図１３の
（１））、チェックポイント処理実行指示部２６を介し
てチェックポイント処理部２５の実行を指示する（図１
３の（２））。Here, the operation of the computer system according to the present embodiment will be described with reference to FIG. While the processor is executing an arbitrary process, if an interrupt is generated to the effect that the checkpoint acquisition condition is satisfied ((1) in FIG. 13), the checkpoint processing unit 26 is instructed via the checkpoint processing execution instruction unit 26. 25 (see FIG. 1).
3 (2)).

【００８６】この後、元のプロセスに制御が戻り、この
プロセスの一実行単位が終了すると、続いてディスパッ
チャ２４が呼び出されるが、すでにチェックポイント処
理部２５の実行が指示されているので、ディスパッチャ
２４は、チェックポイント処理部２５の実行を指示し、
チェックポイントの取得を行なう（図１３の（３））。Thereafter, control returns to the original process, and when one execution unit of this process ends, the dispatcher 24 is subsequently called. However, since execution of the checkpoint processing unit 25 has already been instructed, the dispatcher 24 Instructs execution of the checkpoint processing unit 25,
A checkpoint is obtained ((3) in FIG. 13).

【００８７】このチェックポイントの取得が終了する
と、ディスパッチャ２４は、優先度の高いプロセスを選
択し、プロセッサ１０にそのプロセスを実行させるとい
う、通常のディスパッチャの処理に戻る（図１３の
（４））。When the acquisition of the checkpoint is completed, the dispatcher 24 returns to the normal dispatcher processing of selecting a process with a high priority and causing the processor 10 to execute the process ((4) in FIG. 13). .

【００８８】このように、チェックポイント処理部２５
を備えたディスパッチャ２４を用いた場合には、チェッ
クポイントの取得は、常にディスパッチャの中でのみ実
行されるため、リスタート時のロックランアウトが不要
となる。従って、チェックポイント取得機能を含むオペ
レーティングシステムを大幅に簡単化でき、構築コスト
を大幅に低減することが可能となる。As described above, the checkpoint processing unit 25
In the case where the dispatcher 24 provided with is used, the acquisition of the checkpoint is always executed only in the dispatcher, so that the lock runout at the time of restart is not required. Therefore, the operating system including the checkpoint acquisition function can be greatly simplified, and the construction cost can be significantly reduced.

【００８９】従って、上述実施の形態の計算機システム
によれば、チェックポイントは常に、チェックポイント
取得プロセスの中か、ディスパッチャの中でしか取得さ
れないこととなる。これにより、チェックポイント取得
時に、他のプロセスが実行されているといったことが一
切なくなり、従来のチェックポイントの取得方法では考
慮する必要のあった、上述の「まとまった単位」を考慮
することが不要になり、オペレーティングシステムが大
幅に簡単化できる。この結果、構築あるいは改良のコス
トを大幅に低減させることが可能となる。Therefore, according to the computer system of the above embodiment, a checkpoint is always acquired only in the checkpoint acquisition process or in the dispatcher. This eliminates any other process being executed at the time of checkpoint acquisition, eliminating the need to consider the "units" described above, which had to be considered in the conventional checkpoint acquisition method. Operating system can be greatly simplified. As a result, the cost of construction or improvement can be significantly reduced.

【００９０】また、マルチプロセッサシステムにおいて
は、１部のプロセッサに固定故障が発生した場合にも、
故障が発生したプロセッサ用のチェックポイント処理専
用プロセスの実行を抑止することにより、容易にシステ
ムの再コンフィグレーション（プロセッサの縮退、切り
離し）を実現することが可能になる。In a multiprocessor system, even if a fixed failure occurs in one of the processors,
By suppressing the execution of the process dedicated to the checkpoint process for the processor in which the failure has occurred, it is possible to easily realize the reconfiguration of the system (degeneration and separation of the processor).

【００９１】[0091]

【発明の効果】以上詳記したように本発明によれば、チ
ェックポイント処理専用プロセスを設けることにより、
リスタート時のロックランアウト処理を不要にし構築コ
ストを大幅に低減させることができる。また、チェック
ポイントリスタート機能を備えたマルチプロセッサシス
テムにおいて、一部のプロセッサに故障が発生した場合
にも残りのプロセッサで処理を継続させることができ
る。さらに、チェックポイントリスタート機能を備えた
マルチプロセッサシステムにおいて、一部のプロセッサ
に故障が発生した場合にも残りのプロセッサで処理を継
続させることができるという優れた効果を奏する。As described above, according to the present invention, by providing a process dedicated to checkpoint processing,
The lock runout process at the time of restart is not required, and the construction cost can be significantly reduced. Further, in a multiprocessor system having a checkpoint restart function, even when a failure occurs in some of the processors, processing can be continued in the remaining processors. Further, in a multiprocessor system having a checkpoint restart function, even when a failure occurs in some of the processors, there is an excellent effect that processing can be continued in the remaining processors.

[Brief description of the drawings]

【図１】本発明の一実施の形態に係る計算機システムの
ハードウェア構成を示すブロック図。FIG. 1 is a block diagram showing a hardware configuration of a computer system according to an embodiment of the present invention.

【図２】同実施の形態に係る計算機システムのメモリに
格納されるソフトウェアの機能を説明するための機能ブ
ロック図。FIG. 2 is an exemplary functional block diagram for explaining functions of software stored in a memory of the computer system according to the embodiment;

【図３】同実施の形態に係るチェックポイント処理専用
プロセスの処理の流れを示すフローチャート。FIG. 3 is an exemplary flowchart showing the flow of the process of a checkpoint process dedicated process according to the embodiment;

【図４】同実施の形態に係るチェックポイント処理専用
プロセスをウェイクアップさせる処理の流れを示すフロ
ーチャート。FIG. 4 is an exemplary flowchart showing the flow of a process for waking up a process dedicated to checkpoint processing according to the embodiment.

【図５】同実施の形態に係るチェックポイント処理専用
プロセスによって取得されたチェックポイントから再実
行を行なう場合の処理の流れを示すフローチャート。FIG. 5 is an exemplary flowchart showing the flow of processing when re-execution is performed from a checkpoint acquired by a checkpoint processing dedicated process according to the embodiment;

【図６】同実施の形態に係る計算機システムの動作を説
明するための図。FIG. 6 is an exemplary view for explaining the operation of the computer system according to the embodiment;

【図７】同実施の形態に係るｓｌｅｅｐ状態にあるチェ
ックポイント処理専用プロセスをｒｅａｄｙ状態にする
場合を説明するための図。FIG. 7 is an exemplary view for explaining a case where a checkpoint processing dedicated process in a sleep state according to the embodiment is set to a ready state;

【図８】同実施の形態に係る、マルチプロセッサシステ
ムの構成を示すブロック図（８Ａ）、及びそのシステム
に於いて固定故障が発生した場合のマルチプロセッサシ
ステムの動作を説明するための図（８Ｂ）。FIG. 8 is a block diagram (8A) showing the configuration of the multiprocessor system according to the embodiment, and a diagram (8B) for explaining the operation of the multiprocessor system when a fixed failure occurs in the system; ).

【図９】第２実施の形態に係る計算機システムの構成を
示すブロック図。FIG. 9 is a block diagram showing a configuration of a computer system according to a second embodiment.

【図１０】同実施の形態に係る、チェックポイント処理
部を備えたディスパッチャの処理の流れを示すフローチ
ャート。FIG. 10 is a flowchart showing a processing flow of a dispatcher having a checkpoint processing unit according to the embodiment;

【図１１】同実施の形態に係る、ディスパッチャにチェ
ックポイント処理部の実行を指示する処理の流れを示す
フローチャート。FIG. 11 is a flowchart showing a flow of a process for instructing a dispatcher to execute a checkpoint processing unit according to the embodiment;

【図１２】同実施の形態に係る、チェックポイント処理
部によって取得されたチェックポイントから、再実行を
行なう場合の処理の流れを示すフローチャート。FIG. 12 is a flowchart showing a flow of processing when re-executing from a checkpoint acquired by a checkpoint processing unit according to the embodiment;

【図１３】同実施の形態に係る、計算機システムの動作
を説明するための図。FIG. 13 is an exemplary view for explaining the operation of the computer system according to the embodiment;

【図１４】同実施の形態に係る、計算機のチェックポイ
ント／ロールバック機能を説明するための図。FIG. 14 is an exemplary view for explaining a checkpoint / rollback function of the computer according to the embodiment;

【図１５】従来の計算機システムが通常の処理を実行し
ている途中で、チェックポイント処理を実行している様
子を示す図。FIG. 15 is a diagram showing a state where a conventional computer system is executing a checkpoint process while performing a normal process.

【図１６】従来の計算機システムがチェックポイントを
取得しながら処理を進めていく途中で故障が発生し、最
終チェックから再実行している様子を示す図。FIG. 16 is a diagram showing a state in which a failure has occurred while a conventional computer system is proceeding with processing while acquiring checkpoints, and the computer system is re-executing from the last check.

【図１７】従来のロッククラスのレベルの設定例を示す
図である。FIG. 17 is a diagram showing a conventional example of setting a lock class level.

【図１８】デッドロックの発生を説明するための図であ
る。FIG. 18 is a diagram for explaining occurrence of a deadlock.

【図１９】ロックランアウト処理を説明するための図で
ある。FIG. 19 is a diagram for explaining a lock run-out process.

[Explanation of symbols]

１０…プロセッサ、１２…メモリ、１３…ＢＩＢ、２１…チェックポイント処理専用プロセス、２２…スリープ部、２３…ウェイクアップ部、２４…ディスパッチャ、２５…チェックポイント処理部。 DESCRIPTION OF SYMBOLS 10 ... Processor, 12 ... Memory, 13 ... BIB, 21 ... Checkpoint processing dedicated process, 22 ... Sleep part, 23 ... Wakeup part, 24 ... Dispatcher, 25 ... Checkpoint processing part.

Claims

(57) [Claims]

At least one processor, a process dedicated to a checkpoint process provided for the processor and acquiring a checkpoint for restarting a process interrupted due to a failure, and a process being executed Interrupting means for interrupting the checkpoint processing process from a standby state to an executable state; dispatching means for dispatching the checkpoint processing process enabled by the interrupting means; and A computer system comprising: a standby state transition unit that, after the dispatched process dedicated to checkpoint processing acquires a checkpoint, sets the process dedicated to checkpoint processing to a standby state again.

2. The computer system according to claim 1, wherein the interruption processing by said interruption means is performed after a checkpoint acquisition condition is satisfied.

3. The computer system according to claim 2, wherein the checkpoint acquisition condition is satisfied when a checkpoint acquisition is instructed in the code of the processor.

4. The computer system according to claim 2, wherein the checkpoint acquisition condition is satisfied after a predetermined time has elapsed after the checkpoint is acquired by the checkpoint processing dedicated process.

5. The computer system according to claim 2, wherein the checkpoint acquisition condition is determined by a data amount of image data stored in a before image buffer for acquiring image data before updating the memory. .

6. The computer system according to claim 2, wherein the checkpoint acquisition condition is determined by a data amount of image data stored in an after-image buffer for collecting image data after updating the memory. .

7. The computer system according to claim 1, wherein the dispatching of the process dedicated to the checkpoint process by the dispatching unit is performed by a time sharing process.

8. The apparatus according to claim 1, further comprising a restoring means for restoring the state of the processor at the time of the last checkpoint acquired by the process dedicated to checkpoint processing when a temporary failure occurs in any of the processors. The computer system according to claim 1, wherein

9. The computer system according to claim 8, wherein, after the state of the processor is restored by the restoring unit, the process of the processor is executed with the checkpoint processing dedicated process as a current process.

10. When a fixed failure occurs in any of the processors, the state of the processor other than the processor in which the fixed failure has occurred is changed by the checkpoint processing processor which was last acquired by the checkpoint processing dedicated process. 2. The computer system according to claim 1, further comprising restoring means for restoring the state.

11. The processor according to claim 7, wherein after the restoration of the state of the processor by the restoration unit, the process other than the processor in which the fixed failure has occurred is executed by using the checkpoint processing dedicated process as a current process. 10. The computer system according to 10.

12. At least one processor and, when a checkpoint acquisition condition is satisfied, execution of a checkpoint process for instructing acquisition of a checkpoint for restarting a process interrupted due to a failure or the like.
Instruction means , provided in a dispatcher of the operating system, and transmitted from the checkpoint processing execution instruction means to the
When a dispatcher is instructed to acquire a checkpoint
Is called by the dispatcher and
Check to get each checkpoint corresponding to the checker
And point processing unit, the checkpoint processing after the checkpoint
And a standby state transition unit for setting the management unit to a standby state again.