JP2000501216A

JP2000501216A - Main memory system and checkpointing protocol for fault-tolerant computer systems using read buffers

Info

Publication number: JP2000501216A
Application number: JP9-522061A
Authority: JP
Inventors: スティフラー，ジャック・ジェイ
Original assignee: テキサス・マイクロ・インコーポレーテッド
Priority date: 1995-11-29
Filing date: 1996-11-27
Publication date: 2000-02-02

Abstract

(57)【要約】通常のコンピュータ動作に制約を加えることなく、主メモリ内に一貫した周期的更新状態を維持するための機構を提供し、これによって、コンピュータ・システムが、データまたは処理の連続性を損なうことなく、障害から回復できるようにする。本発明では、第１のコンピュータが、プロセッサと、主要素子を含む主メモリ・サブシステムに接続された入出力素子とを含む。第２のコンピュータは、１つ以上のバッファ・メモリおよびシャドウ・メモリを含むことができる、遠隔チェックポイント・メモリ素子を有し、これが第１のコンピュータの主メモリ・サブシステムに接続されている。通常処理の間、基本メモリ素子に書き込まれたデータの画像が、遠隔チェックポイント・メモリ素子によって捕獲される。新たなチェックポイントが望まれる（これによって、障害の後に、全ての実行中のアプリケーションが安全に戻ることができる、主メモリ内の一貫した状態を確立する）場合、以前に捕獲されたデータを用いて、第２のコンピュータに新たなチェックポイント状態を確立する。第１のコンピュータの障害発生時に、第２のコンピュータを再起動させ、第１のコンピュータに確立された最後のチェックポイントから動作することができる。この構造およびプロトコルは、主メモリにおける一貫した状態を保証することができ、こうしてフォールト・トレラント動作を可能にする。 (57) Abstract: A mechanism is provided for maintaining a consistent periodic update state in main memory without impairing normal computer operation, thereby allowing a computer system to maintain a continuous data or processing sequence. Be able to recover from disability without compromising gender. In the present invention, a first computer includes a processor and input / output devices connected to a main memory subsystem including main components. The second computer has a remote checkpoint memory element, which may include one or more buffer memories and shadow memory, that is connected to the main memory subsystem of the first computer. During normal processing, an image of the data written to the elementary memory element is captured by the remote checkpoint memory element. If a new checkpoint is desired (this establishes a consistent state in main memory where all running applications can safely return after a failure), use the previously captured data. To establish a new checkpoint state in the second computer. Upon failure of the first computer, the second computer can be restarted and operate from the last checkpoint established on the first computer. This structure and protocol can guarantee a consistent state in main memory, thus allowing fault-tolerant operation.

Description

【発明の詳細な説明】リード・バッファを用いたフォールト・トレラント・コンピュータ・システム用主メモリ・システムおよびチェックポインティング・プロトコル発明の分野本発明は、特にフォールト・トレラント・コンピュータ・システムのための、コンピュータ・メモリ・システムおよびチェックポインティング・プロトコル(c heckpointing protocol)に関するものである。発明の背景コンピュータにおけるフォールト・トレランスは、通常、マスキングと呼ばれるハードウエア集約的な技法、またはチェックポインティングと呼ばれるソフトウエアに基づく手法のいずれかによって実現されている。マスキングを達成するには、同一ハードウエアを複数系統備え、コンピュータ・プログラムを数個の独立した装置で並列に実行する。次に、これら装置の出力を比較し、それらの有効性を判定する。この技法の最も単純かつ古い実施形態では、３台の完全なコンピュータを装備し、それらの出力に単純な多数決方式を用いて、「正しい」出力を判定する。これらのコンピュータの内少なくとも２台が適正に動作しており、投票システム自体が正しく稼働している場合、誤動作しているコンピュータの潜在的に正しくない可能性がある出力は排除(outvote)され、実際には正しい答えがユーザに提示される。これよりはいくらか効率的なマスキングの別の実施形態もあるが、マスキング・システムは通常、障害を発生した構成要素の影響を排除するためにハードウエアを追加しなければならないため、コストが著しく増大するという問題がある。加えて、マスキングは、ハードウエアの障害に対する保護を行うに過ぎない。１つの装置に誤動作を発生させるソフトウエアのバグは、同じソフトウエアを実行する他の装置にも、同様に誤動作を発生させる。全ての出力が同じエラーを含み、その結果、このエラーは検出されずに通過してしまうことになる。チェックポインティングと呼ばれる代替技法は、格段にコスト効率が高い方法で、障害に対する耐性を与える潜在的可能性を有する。この技法は、コンピュータ全体の状態を、周期的に、チェックポイントとして指定した時間間隔で記録する必要がある。障害は、ハードウエアの障害モニタによって（例えば、エラー検出コードを用いてエンコードされたデータに作用するデコーダ、温度または電圧センサ、あるいは別の同一の装置を監視する１つの装置によって）検出するか、またはソフトウエアの障害モニタ（例えば、データ構造内のスタック・ポインタまたはアドレス上で、範囲外状態をチェックする実行コードの一部として実行されるアサーション(assertion))によって検出することができる。障害が検出された場合、回復するには、まず最初に診断を行い、可能であれば誤動作装置を迂回し、次いでシステムを最後のチェックポイントに戻し、このポイントから正常な動作を再開することが必要である。回復が可能なのは、障害を発生したと識別されたあらゆる要素を、回復過程の間に迂回し、その後に十分なハードウエアが動作可能状態であり続ける場合である。例えば、マルチプロセッサ・システムでは、プロセッサの内少なくとも１つが機能し続ける限り、システムは動作し続けることができる。同様に、メモリのリマップを行うことができるシステム、あるいは代替ポートを通じてＩ／Ｏを割り当てなおすことができるシステムは、同様に、メモリまたはＩ／Ｏ資源の損失を克服することができる。更に、コンピュータ・システムにおいて見られる殆どの障害は、性質上瞬時的または間欠的であり、それら自体が一時的なグリッチ(g litch)に過ぎない。したがって、通常は、ハードウエアの迂回を全く行うことなく、かかる障害からの回復は可能である。しかしながら、瞬時的な障害および間欠的な障害は、永続的な障害と同様、障害時に操作されているデータを変転させる可能可能性があるので、かかるイベントの後にコンピュータが常に戻ってくる状態を有する必要がある。これが、周期的なチェックポイント状態(checkpointe d state)の目的である。チェックポイントは、典型的に５０ミリ秒程度毎に設けられているので、実行中のプログラムをその最後のチェックポイントまで後退させることは、通常ユーザには完全に透過的（トランスペアレント）である。適正に処理すれば、連続性の損失やデータの汚染(contamination)を発生することなく、全てのアプリケーションをその最後のチェックポイントから再開することができる。チェックポインティングには、マスキングに比較して、２つの主要な利点がある。第１に、チェックポインティングは、実装にかかる費用が非常に少なくて済む。第２にチェックポインティングは、ハードウエア障害だけでなくソフトウエア障害に対する保護も提供する。第１の利点は、単純に、チェックポインティングは大量の同一ハードウエアの装備を必要としないという事実を反映したに過ぎない。第２の利点は、十分に検査され完成度の高いソフトウエアにおいては、殆どのソフトウエア・バグは例外的な状況においてのみ露見されるだけであるという事実の結果である。これが正しくなければ、バグは通常の検査時に発見され、除去されるであろう。かかる例外的な状況は、一般的に、非同期的なイベントによって発生する。非同期的なイベントとは、割り込みが発生して、あるシーケンスに続いてプログラムの実行を強制するが、割り込みが発生しなければそのシーケンスに続いて実行するようなことはない場合である。システムを一貫性のある状態に強制的に戻し、動作させ続けた場合、即ち、ソフトウエア・バグをハードウエアの過渡現象(transient)として扱った場合、システムが以前と正確に同じ状態で正確に同じ例外に遭遇する可能性は非常に低い。その結果、同じバグに２回遭遇する可能性は非常に低い。また、チェックポインティングには、マスキングに比較して、２つの潜在的な欠点がある。第１に、マスキングは通常障害から瞬時的またはほぼ瞬時的に回復する。結果的に発生するあらゆるエラーは、単純にマスクしてしまうので、明示的な回復は不要である。チェックポインティングは、ある種のソフトウエア・ルーチンを実行し、問題を診断し、コンピュータの永続的に誤動作を発生するあらゆる構成要素を迂回する必要がある。その結果、回復に要する時間は、典型的に、１秒程度であり、応答時間がミリ秒未満の単位であることを要求するリアル・タイム・アプリケーションでは、フォールト・トレランスを達成するためにこの技法を用いることができない場合がある。しかしながら、人が直接コンピュータと双方向処理を行う用途、例えば、トランザクション処理の用途では、１秒程度の一時的な割り込みは、問題なく容認可能であり、実際、通常では気付かれもしない。したがって、このチェックポインティングの潜在的な欠点は、この種の用途には無関係である。第２に、チェックポインティングは、従来より、アプリケーション・レベルで達成されていた。したがって、アプリケーション・プログラマは、どのデータについてチェックポイント処理を行うのか、いつそれを行うべきかについて関与しなければならなかった。この要求は、プログラマにとっては重大な負担であり、フォールト・トレランスを達成する手段としての、チェックポインティングの使用普及を著しく妨げていた。近年になって、システム・ソフトウエア・レベルでチェックポインティングを可能にする技法が開発されたので、アプリケーション・プログラマは、チェックポイント処理対象とすべきデータを識別しようとすることに気を使うことはなくなり、チェックポインティングが行われることを知る必要すらなくなった。これを可能にするには、システム自体が、実行させ得るアプリケーションには無関係に、周期的なチェックポイントを設けることができなければならない。Stiffler の米国特許第４，６５４，８１９号および第４，８１９，１５４号は、正しくこれを行うことができるコンピュータ・システムについて記載するものである。このシステムは、この種のチェックポインティングを達成するために、そのプロセッサの各々に、新しいチェックポイントを確立し、全ての変更データを主メモリに放出(flush out)できるようになるまで、全ての変更データをそのローカル・キャッシュに保持することを要求する。このようなキャッシュのことを、時としてブロッキング・キャッシュ(blocking cashe)と呼ぶこともある。プロセッサは、そのブロッキング・キャッシュを消去す前に、内容切り替え(context switch) を行い、この間に、そのプログラム・カウンタを含むその内部レジスタの内容を、スタック上に置き、このスタックを他の変更データ全てと共に放出する。その結果、内部的に一貫性のあるデータによって、一度でメモリを更新することにより、その後システムに障害が発生した場合でも、システムが安全に戻ることができるチェックポイントを確立する。主メモリ障害および放出動作自体の間に発生する障害の双方を克服する機能を保証するためには、メモリを２系統備え、各データ項目を、主要位置およびシャドウ位置 (shadow location)の双方に格納する。この技法は、アプリケーション・プログラマに負担をかけずに、チェックポイントを確立するという目標は達成するものの、そのブロッキング・キャッシュの使用に依存することによる、ある種の欠点を有する。プロセッサは、現在変更されているラインを同時に全て書き戻す場合以外は、いずれのキャッシュ・ラインも主メモリに書き戻すことができないので、キャッシュのオーバーフローが発生したとき、またはあるプロセッサによって他のプロセッサのキャッシュに保持されているデータに対する要求が行われたときはいつでも、データを放出しているプロセッサに、そのキャッシュ全体を書き出すように要求することなる。この要件は、標準的なキャッシュ・コヒーレンシ・プロトコル（例えば、Gallagherの米国特許第５，２７６，８４８号に記載されているプロトコル）の使用を妨げ、プログラムがかかる標準的プロトコルに基づいて実行される場合、潜在的なポーティング(porting)や性能上の問題を生ずる。例えば、Kirrmann（米国特許第４，９０５，１９６号）およびLee et al.（"A Recovery Cache for the PDP-11"（ＰＤＰ−１１用回復キャッシュ），IEEETran s.on Computers，１９８０年６月）によって、チェックポインティングの目的のためにデータを捕獲する別の方法が提案されている。Kirrmannの方法は、カスケード状メモリ格納素子を用いる。これは、主メモリと、それに続く２つのアーカイブ・メモリから成り、各アーカイブ・メモリは主メモリと同じサイズとなっている。主メモリへの書込みは、プロセッサによってライト・バッファにも行われる。チェックポイントを確立する時刻になった場合、バッファされたデータをプロセッサがまずアーカイブ・メモリの一方にコピーし、次いで第２のアーカイブ・メモリにコピーする。しかし、これらのコピーの一方の必要をなくする技法についても記載されている。２つのアーカイブ・メモリは、バッファからメモリへのコピーが行われている最中に障害が発生しても、それらの少なくとも一方が有効なチェックポイントを含むことを保証する。このアーキテクチャに伴う問題には、３系統のメモリを備えなければならないこと、アーカイブ・メモリのために速度の遅いメモリを使用すること、および３つのメモリ素子が同一バス上の異なるポートとなるためのプロセッサの処理能力に影響が及ぶことが含まれる。 Lee et al．による論文では、アプリケーションによって特定されるアドレス範囲内に該当する全てのメモリ位置について、更新データがメモリに書き込まれる前に、データを回復キャッシュにセーブする方法が論じられている。この方法は、アプリケーションによって指定される範囲内のメモリに対する全てのライトを、ライト前リード動作(read-before-write operations)に変換する。アプリケーションの実行の間に障害が発生した場合、回復キャッシュの内容を主メモリに格納し戻すことによって、アプリケーションがその現実行を開始した時点における状態に、それを回復する。この方法の問題の１つに、ライト後リード動作によるメモリ・サイクルの干渉のために、ホスト・システムの速度低下を招き、これによってバス・プロトコルの変更が余儀なくされることがあげられる。また、これもアプリケーション・プログラマがチェックポインティングの処理または考慮に関与することを要求する。主メモリ以外に、ディスク上にデータのミラー(mirror)を作成する、別の技法が開発されている。ディスクのアクセスは主メモリのアクセスより数桁遅いので、このような方式は、データ・ファイルのミラー作成に限定されている。即ち、障害によってこれらのファイルへの主要アクセス経路が絶たれた場合に、バックアップをディスク・ファイルに供給する場合に限定されている。システムのユーザに対して透過的に、プログラムの連続性を保持したり、あるいは実行中のアプリケーションを回復する試みはなされていない。場合によっては、ミラー・ファイル同士の一貫性を保つことを保証するのでさえ不可能であり、それらは同一ファイルの別のコピーと一貫性があるに過ぎない。米国特許第５，２４７，６１８号は、かかる方式の一例を開示している。発明の概要本発明の実施形態は、コンピュータ・システムにおいて、従来のキャッシュ・コヒーレンシ・プロトコルおよび非ブロックキング・キャッシュの使用を可能にしつつ、コンピュータ・システムの主メモリ内において、一貫した周期的更新チェックポイント状態を維持する装置およびプロセスを提供する。本発明の実施形態は、１つ以上の論理ポートを介してプロセッサによってアクセスされ、基本メモリ素子およびチェックポイント・メモリ素子が双方ともこのポートに結合された、主メモリを提供する。基本メモリ素子は、標準的な主メモリと同様にアクセスされる。チェックポイント・メモリ素子は、検出可能な主メモリへの書き込みを捕獲(capture)する。この書込みが検出可能なのは、チェックポイント・メモリ素子が基本メモリ素子と同一ポートに接続されているからである。次に、捕獲された書込みを用いて、主メモリにおける一貫性のあるチェックポイント状態の存在を保証する。このように適切な検出および迂回手順を有するコンピュータ・システムは、データの保全性または処理の連続性を損なうことなく、障害からの回復が可能である。本発明の一実施形態では、主コンピュータから離れて位置するバックアップ・コンピュータに、バッファ・メモリと第２のランダム・アクセス・メモリ素子とを含む、遠隔チェックポイント・メモリ素子が設けられる。バックアップ・コンピュータおよび主コンピュータは、チェックポイント通信リンクによって接続されている。通常の処理の間、主コンピュータ内の主メモリに書き込まれたデータは、専用のチェックポイント通信リンクを通じて、バックアップ・コンピュータ内の遠隔チェックポイント・メモリ素子内のバッファ・メモリにも送出される。チェックポイントを確立すべきとき、バッファ・メモリ内に既に捕獲されているデータは、バックアップ・コンピュータ・システムの遠隔チェックポイント・メモリ素子のランダム・アクセス・メモリにコピーされる。主コンピュータの障害の場合、バックアップ・コンピュータは、主コンピュータが以前に処理していたアプリケーションの処理を引き継ぐ。バックアップ・コンピュータは、遠隔チェックポイント・メモリ素子のシャドウ・メモリ素子内に格納されているチェックポイント状態から開始して、アプリケーションを処理する。本発明の他の実施形態では、遠隔チェックポイント・バッファは、論理リング状にコンピュータを構成し、各コンピュータがその隣接するコンピュータの一方のためのバックアップとして機能することにより、Ｎ＋１の冗長性が得られるようにした。本発明によるシステムでは、入出力（Ｉ／Ｏ）動作は、通常以下のように処理される。通常動作の間、Ｉ／Ｏ要求はいずれかの標準的な方法で行われ、オペレーティング・システムによって適切なＩ／Ｏキューに入力される。しかしながら、実際の物理的Ｉ／Ｏ動作は、次のチェックポイントまで開始されない。したがって、障害および続くチェックポイント状態への後退の場合、全ての保留のＩ／Ｏ動作もチェックポイント処理の対象となる。ディスクおよびその他のアイデンポネント(idempotent) Ｉ／Ｏ動作は、単に再起動することができる。通信Ｉ／Ｏ動作の適切な処置は、通信プロトコルによって異なる。可能なメッセージの複製に対処するプロトコルでは、保留のＩ／Ｏを再起動することができる。欠落したメッセージを処理するプロトコルでは、Ｉ／Ｏを保留のキューから削除することができる。欠落メッセージも繰り返しメッセージも処理しないプロトコルでは、保留のＩ／Ｏは保留キューから削除される。障害の前にメッセージが実際に送出されなかった場合、または障害の結果として中止された場合、過渡通信リンク障害と影響は同一であり、同じ結果がアプリケーションまたはユーザにもたらされる。通信リンク割り込みは、通常、コンピュータ障害よりもかなり多く発生するので、かかるイベントを透過的にすることができないプロトコルの使用は、おそらく、ユーザまたはアプリケーションは、いずれにせよ、それらと対処する準備がなされていることを意味する。ここに記載する機構は、コンピュータが障害に続いて動作を再開することができる、一貫したチェックポイント状態の存在を保証することができ、こうしてフォールト・トレラント・コンピュータ・システム動作に対応にする。図面の簡単な説明本発明をよりよく理解するために、図面を参照する。図面は、この言及により本願にも含まれるものとする。図１は、本発明の一実施形態の主メモリ構造を用いた、コンピュータ・システムのブロック図である。図２は、本発明の一実施形態による遠隔チェックポイント・バッファを利用した、フォールト・トレラント・コンピュータ・システムのブロック図である。図３は、図２の主コンピュータ・システムおよびスタンバイ・コンピュータ・システムを詳細に示すブロック図である。図４は、Ｎ＋１冗長性を利用した、本発明の一実施形態による、遠隔チェックポイント・メモリ方式のブロック図である。図５は、主メモリの一貫性を維持するために、処理ユニットが使用するメモリ位置の図である。図６のＡは、各処理ユニットが、いかにしてそのキャッシュの放出を制御し、主メモリの一貫性を維持するのかを記述するフローチャートである。図６のＢは、各処理ユニットがそのキャッシュの主メモリへの放出を制御する際に用いる別の方法を記述するフローチャートである。詳細な説明本発明は、添付図面と関連付けて読むべき、以下の詳細な説明によって一層深く理解されよう。尚、添付図面では、同様の参照番号は同様の構造を示すものとする。１９９４年６月１０日に出願された、同一出願人の同時係属中の米国特許出願番号第０８／２５８，１６５号を引用する。この言及により、これは本願にも含まれるものとする。図１は、本発明の使用が概ね可能なコンピュータ・システム１１のブロック図である。１つ以上の処理素子１４および１６が、バスまたは交差点スイッチのような相互接続機構１０および１２を介して、１つ以上の主メモリ・システム１８および２０に接続されている。１つ以上の入出力（Ｉ／Ｏ）サブシステム２２および２４も、相互接続機構１０（１２）に接続されている。各Ｉ／Ｏサブシステムは、入出力（Ｉ／Ｏ）素子またはブリッジ２６（２８）、および１系統以上のバス３０および３２（３４および３６）から成る。Ｉ／Ｏ素子２６（２８）も、ＶＭＥバスのような、いずれかの標準的なＩ／Ｏバス３８（４０）に接続することができる。記載を簡単にするために、以下では、これらのシステム群およびサブシステム群は、各々その１つのみについて言及することにする。各処理素子、例えば、１４は、キャッシュ４２に接続された処理ユニット４４を含む。この接続は、処理ユニット４４およびキャッシュ４２を相互接続機構１０に接続するものでもある。処理ユニット４４は、いずれかの標準的なマイクロプロセッサ・ユニット（ＭＰＵ：microprocessorunit）とすればよい。例えば、Intel Corporationから入手可能なPENTIUMマイクロプロセッサは、この目的に適している。処理ユニット４４は、従来と同様、いずれかの適切なオペレーティング・システムにしたがって動作する。処理素子１４は、自己検査の目的のために、二重処理ユニット４４を含んでもよい。キャッシュ４２は、ライト・スルーまたはライト・バック型のキャッシュであり、任意のサイズおよび連想性(associativity)を有し、１キャッシュ・レベル以上の階層構造から成るものとしてもよい。処理ユニット４４は、キャッシュ４２内に、データのみを格納することも、コンピュータ・プログラムの命令およびデータ双方を格納することも可能である。前者の場合、同様の命令キャッシュ４３を追加として処理ユニット４４に接続し、処理ユニット４４がコンピュータ・プログラム命令を格納するようにしてもよい。この接続は、命令キャッシュ４３を相互接続機構１０に接続するものでもある。このシステムが多重処理コンピュータ・システムである場合、各処理ユニット４４は、バス・スヌーピング(bus s nooping)のような、従来のいずれかの機構を用いてキャッシュ・コヒーレンシを保持することができる。キャッシュ４２は、例えば、相互接続機構１０または１２を介して、主メモリ・システムに接続されている。遠隔チェックポインティング・バッファを利用した、本発明によるフォールト・トレラント・コンピュータ・システムの一実施形態を図２に示す。図２に示す実施形態では、主コンピュータ・システム１１０が、高速データ・リンク１５０を通じてスタンバイ・コンピュータ・システム１２０に結合されている。スタンバイ・コンピュータ・システムは、当該スタンバイ・コンピュータ・システムのメモリ内に遠隔的にチェックポイントを確立するために用いられる。図２に示すように、主コンピュータ・システム１１０およびスタンバイ・コンピュータ・システム１２０は、例えば、イーサネット・システム１４０を通じて相互接続してもよく、また共通のデュアル・ポート・ディスク・アレイ１３０ならびにその他のデュアル・ポート記憶および通信装置を共有してもよい。本実施例の好適なバージョンでは、主コンピュータ・システム１１０およびスタンバイ・コンピュータ・システム１２０は、実質的に同一コンピュータである。図３は、主コンピュータ・システム１１０の一実施形態を更に詳細に示す。スタンバイ・コンピュータ１２０は、図３に示す主コンピュータ・システムと同じ構造を有することは理解されよう。主コンピュータ・システム１１０は、メモリ・サブシステム１１２を含み、このメモリ・サブシステム１１２は、主メモリ１１３、ライト・バッファ１１６およびメモリ・サブシステム１１２内のデータ転送を制御するメモリ制御ロジック１１７から成る。メモリ・サブシステム１１２は、メモリ・バス１１４に結合されている。図１の記載によれば、図２および図３の代表的な主コンピュータ・システム１１０は、メモリ・バス１１４（またはその他の接続形態）に結合された１つ以上のプロセッサ１１８、遠隔チェックポイント・インターフェース１２２、Ｉ／Ｏバス１２４、イーサネット・インターフェース１２６、およびＳＣＳＩコントローラ１２８として示されている外部記憶インターフェースも含む。Ｉ／Ｏバスは、Ｉ／Ｏブリッジ１２５を介して、メモリ・バスに結合されている。遠隔チェックポイント・インターフェース１２２は、バッファ・メモリ１３２およびインターフェース・コントローラ１３４を含む。遠隔チェックポイント・インターフェースは、接続部１４４を介して、メモリ１１２に結合されている。インターフェース・コントローラ１３４は、高速Ｉ／Ｏポート１３１に結合され、主コンピュータ・システムを高速データ・リンク１５０に接続する。ＳＣＳＩコントローラ１２８は、ＳＣＳＩＩ／Ｏポート１２９に結合され、主コンピュータ・システムをデュアル・ポート・ディスク・アレイ１３０に接続する。イーサネット・インターフェースは、イーサネットＩ／Ｏポート１２７に結合され、主コンピュータ・システムをイーサネット１４０に接続する。このシステムを用いて、障害の後に一貫性のある状態を主メモリ内に維持するプロセスについて、これより説明する。米国特許第４，６５４，８１９号のようなシステムとは対照的に、このプロセスは、処理ユニット１４のキャッシュ４２全体を放出する必要なく、一方の処理素子１４から他の処理素子１６にデータを渡すことを可能にする。コンピュータ・システム１１内の全ての処理ユニット４４が、主メモリへの全てのバス即ち通信経路に対するアクセスを有する場合、各処理ユニット４４は、従来のバス・スヌーピング法を用いて、キャッシュのコヒーレンシを保証することができる。全ての処理ユニット４４が全てのシステム・バスに対してアクセスを有するのではない場合、処理ユニット４４は、他の公知のキャッシュ・コヒーレンシ技法(cachec oherency technique)を代わりに用いることができる。この実施形態の動作を、ここでは遠隔チェックポインティング(remotecheckpo inting)と呼び、図２および図３に示す。これは、チェックポインティングが局所的に行われる、先に引用した本出願人の同時係属中の特許出願に記載したものと概略的には同様である。しかしながら、遠隔チェックポインティングでは、スタンバイ・コンピュータ・システム１２０の主メモリ１１３が、主コンピュータ・システム１１０の主メモリ１１３のためのチェックポイント・メモリとして機能し、最後のチェックポイントにおける主コンピュータ・システムの主メモリ１１３の状態を含む。主コンピュータ・システムの主メモリ１１３に書き込まれるデータも、メモリ制御ロジック１１７の制御の下で、主コンピュータ・システムのライト・バッファ１１６に捕獲される。なお、アクセスには、部分的なキャッシュ・ラインまたは全体的なキャッシュ・ラインを伴うものがあることは理解されよう。ライト・バッファは十分なサイズを有し、バス１４４を通じたデータの転送に伴うあらゆる遅延にも対応するように、データのバッファリングを行う。データは、ライト・バッファ１１６から主コンピュータ・システムの遠隔チェックポイント・インターフェース１２２に転送され、ここで追加のバッファリングが行われ、高速データ・リンク１５０に伴うあらゆる遅延に対応する。次に、データは高速データ・リンク１５０を通じて、スタンバイ・コンピュータ・システム１２０の遠隔チェックポイント・インターフェース１２２に転送され、スタンバイ・コンピュータ・システムのバッファ・メモリ１３２に格納される。主コンピュータ・システムの主メモリ１１３に対する全ての書込み（キャッシュ放出の間にメモリに書き込まれるラインを含む）は、したがって、それらの物理的メモリ位置と共に、スタンバイ・コンピュータ・システムのバッファ・メモリ１３２にも書き込まれる。主コンピュータ・システムのプロセッサ１１８、およびメモリ・バス１１４に結合されたいずれかの追加のプロセッサがキャッシュ放出を完了したとき、主コンピュータ・システム内のオペレーティング・システムは、遠隔制御インターフェースを通じて、スタンバイ・コンピュータ・システムに通知する。スタンバイ・コンピュータ・システムのバッファ・メモリ１３２の内容は、スタンバイ・コンピュータ・システムの主メモリ１１３に転送され、スタンバイ・コンピュータ・システムの主メモリ１１３内にチェックポイントを確立する。一貫性のあるシステム状態についてチェックポイント処理を行うために、全てのプロセッサは同期してそのキャッシュを放出（flush）する。一旦処理素子１４が放出を開始したなら、他の全ての処理素子１４がそれらの放出を完了するまで、以下で論ずるある種の条件下を除いて、通常の動作を再開することができない。プロセッサのキャッシュ放出を同期させるのは、バッファ・メモリが、どのデータを主メモリ１１３にコピーすべきで、どのデータをコピーすべきでないかを知る必要があるためである。即ち、バッファ・メモリは、放出後データおよび放出前データ間の区別をする必要がある。したがって、どのプロセッサがデータを送出しているのかがバッファにわからない場合、全てのプロセッサは、通常動作が開始可能となる前に、それらの放出を完了させ、一貫性を保持するようにしなければならない。同期を制御するには、好ましくは、図５の８０に示すように、例えば、主コンピュータの主メモリ内の指定された位置を用いて、検査および設定ロック処理または同等の処理を用い、ロック値を格納する。基本メモリ素子の障害およびその他の障害からの回復が可能であることを保証するために、この指定位置は、ステータス・レジスタの一部として実施することが好ましい。周期的な間隔で、各処理ユニット４４は、図６のＡのステップ９０に示すように、放出処理を開始すべきか否かについて判定を行う。処理ユニット４４は、この判定を多数の異なる方法で行うことができる。典型的に、放出は、固定時間期間が経過した後に開始すればよい。この処理ユニット４４が放出を開始する必要がない場合、指定メモリ位置８０を検査し、他の処理ユニット４４が既にロックを設定しているか否かについて判定を行う（ステップ９２）。ロックが設定されていない場合、このプロセスは、９４に示すように終了する。逆に、ロックが設定されている場合、この処理ユニット４４はステップ９６においてそのキャッシュ４２を放出する。放出処理の効果は、キャッシュ内の全ライン（または、好ましくは、最後の放出以降変更されたラインのみ）をコンピュータの主メモリ１１３、およびライト・バッファ１１６にも同様に格納することである。実際の放出処理に先立って、処理ユニット４４は、その状態をキャッシュ４２にセーブし、この情報も同様に放出されるようにする。入出力（Ｉ／Ｏ）動作は、通常、以下のように処理される。通常動作の間、Ｉ／Ｏ要求は、オペレーティング・システムによって、いずれかの標準的な方法で発せられ、適切なＩ／Ｏキューに入力される。しかしながら、実際の物理的なＩ／Ｏ動作は、次のチェックポイントまで開始されない。したがって、障害およびそれに続くチェックポイント処理済み状態(checkpointed state)への後退の場合、全ての保留のＩ／Ｏ動作にも、チェックポイント処理が行われる。ディスクおよびその他のアイデンポネントＩ／Ｏ動作、即ち、結果を変化させることなく繰り返すことができる動作は、単に再起動することができる。通信Ｉ／Ｏ動作の適切な処置は、通信プロトコルに依存する。可能なメッセージの複製に対処するプロトコルでは、保留のＩ／Ｏを再起動することができる。欠落したメッセージを処理するプロトコルでは、Ｉ／Ｏを保留のキューから削除することができる。欠落メッセージも繰り返しメッセージも処理しないプロトコルでは、保留のＩ／Ｏは保留キューから削除される。障害の前にメッセージが実際に送出されなかった場合、または障害の結果として中止された場合、過渡通信リンク障害と影響は同一であり、同じ結果がアプリケーションまたはユーザにもたらされる。通信リンク割り込みは、通常、コンピュータ障害よりもかなり多く発生するので、かかるイベントを透過的にすることができないプロトコルの使用は、おそらく、ユーザまたはアプリケーションは、いずれにせよ、それらと対処する準備がなされていることを意味する。処理ユニット４４がステップ９０において、放出を開始すべきと判定した場合、ステップ９２と同様、ステップ９８において、ロックが既に設定されているか否かについて判定を行う。ロックが既に設定されている場合、処理ユニット４４は、ステップ９６において、そのキャッシュ４２の放出を継続する。その他の場合、ステップ１００においてロックを設定し、他のプロセッサにメッセージを送り、それらの放出ライン動作をトリガすることによって、そのキャッシュ４２を放出する前に、それ自体を放出のイニシェータ (initiator) として識別する。処理ユニット４４がステップ９６においてそのキャッシュ４２を放出した後、ステップ１０２においてその対応する放出カウンタを増分する。図５に示すように、各処理ユニット４４は、８２および８４で示すような放出カウンタを有し、これらは、主メモリ１１３内の所定の指定された位置である。放出カウンタ（例えば８２）を増分した後、処理ユニット４４は、それがこの放出シーケンスのイニシェータであるか否かについて判定を行う（ステップ１０４）。イニシェータでない場合、ステップ１０６において、ロックが解除されるまで待つ。ロックが解除されたなら、このプロセスはステップ１０８において終了し、処理ユニット４４は通常動作を再開することができる。ステップ１０４の判定において、処理ユニット４４が放出のイニシェータであった場合、ステップ１０５において、全ての放出カウンタ（８２〜８４）が増分されるまで待つ。一旦全ての放出カウンタが増分されたなら、この処理ユニット４４は、委託コマンド(commit command)をメモリ制御ロジック１１７に送り、前述のように、ライト・バッファ１１６内のデータを、スタンバイ・コンピュータの主メモリ１１３にコピーする。一旦この命令が送られたなら、放出ロックが解除され、処理ユニット４４は通常の処理を再開することができる。ステップ１０６ないし１１０間のループは、タイム・アウト保護を有し、放出動作中の障害の場合に、障害回復手順をトリガするようにすべきである。ここに記載する放出カウンタは、１ビット・カウンタとすればよく、したがってチェックポイント・メモリ素子内のステータス・レジスタの一部として容易に実装可能であることを注記しておく。ビットは、各プロセッサによって個別に設定し、イニシェータが委託コマンドを送った場合には自動的にリセットすることができる。ある種の非標準的バス・プロトコルも実装した場合、処理能力上の利点を得ることができる。例えば、バス・プロトコルが、メモリ・サブシステム４８に、処理素子１４間で識別すること、または格納対象のラインに書き込みを行ったのは、ｉ回目の放出を完了した処理素子１４か、またはｉ回目の放出を未だ実行中の処理素子かを少なくとも識別すること、あるいは放出後データから放出前データを少なくとも識別することを可能にする場合、処理素子１４は、通常の動作を開始する前に、他の全ての処理素子がそれらの放出を完了するまで待つ必要はない。この場合、処理素子１４に、そのｉ回目の放出を完了した後に、全ての処理素子１６も少なくともそのｉ回目の放出を開始する（しかし、完了するまでの必要はない）まで、正常動作を保留することを要求することによって、主メモリにおける一貫性を保持する。このように同期の制約を緩和してもなお、一貫したチェックポイント状態の存在は保証される。即ち、放出を開始していない処理素子１６は、放出を完了し通常処理を再開した他の処理素子１４から、放出後の変更されたデータを受け取らないことを保証する。この同期に対する制約が緩いプロトコルが許されるのは、メモリ・サブシステムが、放出処理の一部として書き込まれているデータと、放出を完了した処理素子１４によって書き込まれているデータとの間で区別することができる場合である。この種のキャッシュ放出同期を実施するためには、図６のＡのステップ９６および１０２の順番および配置を、図６のＢに示すように逆にすればよい。主コンピュータ・システムにおける障害の場合、スタンバイ・コンピュータ・システムの主メモリ１１３は、最後のチェックポイントにおける主コンピュータ・システムの状態を含む。スタンバイ・コンピュータ・システムは、標準的な技法を用いて、デュアル・ポート・ディスク・アレイ１３０のような全てのデュアル・ポートＩ／Ｏ装置への接続部を活性化し、障害の時点において主コンピュータ・システム上で実行中であったアプリケーションの処理を引き継ぎ、アプリケーションの中断や、関連するデータの損失は全く生じない。あるいは、障害の後に、スタンバイ・コンピュータ・システムは、高速データ・リンク１５０を通じて、その主メモリ１１３の内容を、主コンピュータ・システムの主メモリ１１３にコピーし、主コンピュータ・システムが、スタンバイ・コンピュータ・システムの主メモリに格納されている直前のチェックポイントから動作を再開できるようにすることも可能である。前述のように、スタンバイ・コンピュータ・システムは、主コンピュータ・システムにチェックポイント・メモリを提供しつつ、スタンバイ・モードではアイドル状態となるように設計されている。しかしながら、図１に示すフォールト・トレラント・コンピュータ・システムの実施形態では、コンピュータ・システムは、対称的に動作を行い、コンピュータ・システム１１０および１２０の各々が他のコンピュータ・システムのために遠隔チェックポイント・メモリを提供するようにしてもよい。この場合、スタンバイ・コンピュータ・システムは、主コンピュータ・システムにチェックポイント・メモリを提供しつつ、アプリケーションを実行することができる。対称的動作の場合、遠隔チェックポイント・インターフェース１１２は、チェックポイント・メモリと一体化することが好ましい。各コンピュータのバッファ・メモリ１３２および主メモリ１１３は、物理的に、メモリ・バス上に共通配置(co-located)されている。これによって、スタンバイ・モード以外のモードで動作するコンピュータ・システムのスループットを低下させる動作である、バッファからシャドウへの格納のために、Ｉ／Ｏおよびメモリ・バスを連結する必要性を回避する。一方のコンピュータの障害の後、残ったコンピュータは、１）双方のコンピュータ（それ自体および障害を発生したコンピュータ）のアプリケーションを実行することができるが、いずれか一方のアプリケーションに対するスループットは低下する。あるいは、２）それ自体のアプリケーションを終了し、障害を発生したコンピュータのアプリケーションのみを実行することができる。あるいは、３）結合したアプリケーションのサブセットを実行することができ、これは十分に高い優先度を有する。図４は、遠隔チェックポイント・バッファリングを利用する、本発明の他の実施形態を示す。図４に示す実施形態では、遠隔チェックポイント・バッファリングの概念を拡張し、コンピュータ１１０ａ，１１０ｂ，１２０および１２０ｄを論理リング状に構成し、各コンピュータがリング内の隣接するコンピュータのためにバックアップとして機能するようにすることによって、Ｎ＋１の冗長性を可能にするものである。図４に示す実施形態では、コンピュータ・システム１１０ｂがコンピュータ・システム１１０ａにシャドウ・メモリを提供し、コンピュータ・システム１１０ｃがコンピュータ・システム１１０ｂにシャドウ・メモリを提供し、コンピュータ・システム１１０ｄがコンピュータ・システム１１０ｃにシャドウ・メモリを提供し、コンピュータ・システム１１０ａがコンピュータ・システム１１０ｄにシャドウ・メモリを提供する。Ｉ／Ｏデータ装置１３０ａ，１３０ｂ，１３０ｃ，１３０ｄおよび１３０ｅは、直に隣接するコンピュータ間に結合されたデュアル・ポート装置である。Ｉ／Ｏデータ装置は図示のようにデュアル・ポートとすることができ、あるいはリング内の全コンピュータに共通とすることや、双方を結合することも可能である。図４に示す実施形態では、コンピュータ１１０ａ〜１１０ｄの１つが予備として指定され、これは、アイドル状態のままとなっているか、あるいは分配可能なタスクを実行することができる。リング内の各コンピュータは、２つの他の隣接するコンピュータに結合され、これら隣接するコンピュータの一方を左側のコンピュータと呼び、他方の隣接するコンピュータを右側のコンピュータと呼ぶ。リング内において、予備以外のコンピュータの１つが障害を発生した場合、障害を発生したコンピュータのタスクは、右側に隣接するコンピュータによって実行され、右側のコンピュータのタスクは、その右側のコンピュータによって実行される等、指定された予備コンピュータに達するまで続けられ、タスクのリップリング(rippling)を止めることができる。尚、前述のコンピュータは、チェーン状に構成することも可能であり、好ましくは、スタンバイ・コンピュータをチェーンの終端に配置することは理解されよう。かかる実施形態では、チェーンの終端にスタンバイ・コンピュータを維持するためには、障害後のシステムの再構成が必要となる場合もある。コンピュータ集合を論理リングに構成することによって、再構成の必要性は解消する。加えて、前述の本発明の実施形態では、コンピュータ・システムのプロセッサ、メモリ、Ｉ／Ｏ装置を含む個々の素子は、１系統以上のメモリ・バスによって相互接続されたものとして説明した。メモリバスは、交差点スイッチのような、データを転送し同じ機能を果たす他の相互接続機構で置き換えてもよいことは理解されよう。本発明の利点の１つは、遠隔ユニットを対象としたチェックポインティングを可能とすることにより、内部の冗長性が少ないか全くない（例えば、単一プロセッサ）コンピュータに、フォールト・トレランスを与えることである。本発明の従来技術のチェックポインティングに対する別の利点は、保護対象のコンピュータから物理的に分離された第２のコンピュータにおいて、チェックポイントを確立することである。第２のコンピュータは、ある距離だけ離れていても可能である。したがって、単一の予備コンピュータを、動作状態にある任意の数のコンピュータに対するバックアップとして機能させるように、フォールト・トレランスの概念を拡張する。本発明の更に別の利点は、アプリケーション・プログラムもユーザも、チェックポイント処理プロセスに関与する必要がなく、また知る必要もないことである。非常に迅速な回復に対応し、メモリ霜害やその他のハードウエアおよびソフトウエア障害に対する保護が得られる。ここに説明した本発明の実施形態から、上述の実施形態は単に例示的であり限定的なものではなく、単に一例として提示したに過ぎないことは、当業者には認められよう。多数の変更およびその他の実施形態は、当業者の範囲内であり、添付の請求の範囲に規定された本発明の範囲およびその均等物に該当するものと見做す。DETAILED DESCRIPTION OF THE INVENTION Fault Tolerant Computer System Using Read Buffer Main memory system and checkpointing protocol Field of the invention The present invention is particularly directed to fault tolerant computer systems. Computer memory system and checkpointing protocol (c heckpointing protocol). Background of the Invention Fault tolerance in computers is commonly referred to as masking. Hardware-intensive techniques, or software called checkpointing It is implemented by any of the wear-based approaches. Achieve masking Has the same hardware and multiple computer programs. Run in parallel on a standing device. Next, compare the outputs of these devices and Determine gender. In the simplest and oldest embodiment of this technique, three complete Computers and use a simple majority voting method for their output to get the "correct" output. judge. At least two of these computers are working properly and If the voting system itself is working properly, the potential of a malfunctioning computer Output that may be incorrect in the future is outvoted, and in fact the correct answer is Presented to the user. Another embodiment of masking that is somewhat more efficient However, masking systems usually eliminate the effects of failed components. Costs significantly because additional hardware must be added to There is a problem. In addition, masking protects against hardware failure. Just do it. Software bugs that cause malfunctions in one device are the same Other devices that execute software also cause malfunctions. All outputs Contains the same error, so that this error is passed undetected become. An alternative technique called checkpointing is a much more cost-effective method And has the potential to confer resistance to disorders. This technique is The status of the entire data is periodically recorded at specified time intervals as checkpoints. Need to be Faults are detected by hardware fault monitors (for example, error detection). Decoder, temperature or voltage acting on data encoded using output code Sensors, or one device monitoring another identical device) Or a software fault monitor (eg, a stack pointer in a data structure) Or on the address as part of the executable code that checks for out-of-range conditions. Assertion). Failure detected Recovery, first diagnose and, if possible, bypass the malfunctioning device. Then return the system to the last checkpoint, from which point It is necessary to resume operation. Recovery is possible if any elements identified as having failed are recovered during the recovery process. In the meantime, and then sufficient hardware remains operational. You. For example, in a multiprocessor system, at least one of the processors The system can continue to operate as long as. Similarly, in memory Assign I / O through a system that can remap, or through an alternate port A system that can be reassigned also has a loss of memory or I / O resources. Can be overcome. Furthermore, most of the things found in computer systems Failures are instantaneous or intermittent in nature and are themselves transient glitches (g litch). Therefore, it is usually not necessary to bypass the hardware at all. Recovery from such a failure is possible. However, momentary failures and pauses Intermittent failures, like permanent failures, disrupt the data being manipulated at the time of the failure. The computer will always return after such an event You need to have a state. This is the periodic checkpoint state (checkpointe d state). Checkpoints are typically provided approximately every 50 milliseconds, so Retreating a running program to its last checkpoint is usually a It is completely transparent to the user. If handled properly, All applications without loss of continuity or data contamination The application can be resumed from its last checkpoint. Checkpointing has two main advantages over masking. You. First, checkpointing has very low implementation costs. No. Second, checkpointing is not only a matter of hardware failure, but also software. Provide protection against a failure. The first advantage is that simply checkpointing Only reflect the fact that it does not require a large amount of identical hardware. Absent. The second advantage is that in well-tested and mature software, Which software bugs are only exposed in exceptional circumstances It is the result of the fact. If this is not correct, the bug will be found during normal inspection, Will be removed. Such an exceptional situation generally involves asynchronous events. Therefore, it occurs. Asynchronous events are when an interrupt occurs and a sequence Program execution following the program, but if no interrupt occurs, the program This is the case when there is nothing to execute after the cans. Make the system consistent If you forcibly return to the state and continue to operate, that is, software bugs When treated as a wear transient, the system is exactly the same as before. It is very unlikely that you will encounter exactly the same exception in a state. As a result, 2 It is very unlikely to be encountered twice. Checkpointing also has two potential There are drawbacks. First, masking usually recovers instantaneously or almost instantaneously from failure I do. Any resulting errors are simply masked out, so No special recovery is required. Checkpointing is a kind of software Run the routine, diagnose the problem, and cause a permanent computer malfunction. All components need to be bypassed. As a result, the time required for recovery is typically Real time, which requires a response time of less than millisecond Time applications use this to achieve fault tolerance. Techniques may not be available. However, if a person directly 1 second for applications that perform bidirectional processing with, for example, transaction processing Temporary interruptions can be tolerated without problems, and in fact, It is not always noticed. Therefore, the potential for this checkpointing The disadvantages are irrelevant for this type of application. Second, checkpointing has traditionally been at the application level. Had been achieved. Therefore, the application programmer has About when to checkpoint and when to do it. I had to. This demand is a significant burden on programmers, Using Checkpointing as a Means of Achieving Fault Tolerance The use of the product was severely hindered. In recent years, checkpointing at the system software level As the enabling technique was developed, the application programmer Don't worry about trying to identify the data that should be processed No need to even know that checkpointing is happening. this The system itself is independent of the applications it can run Must be able to provide periodic checkpoints. Stiffler U.S. Pat. Nos. 4,654,819 and 4,819,154 are correct. It describes a computer system that can perform this. This The system uses its processes to achieve this type of checkpointing. Establish a new checkpoint for each of the servers and store all changed data in main memory Until all changes can be flushed out to their local Request to be kept in cache. Such a cache is sometimes called Sometimes called blocking cashe. The processor , Before clearing its blocking cache, a context switch During this time, the contents of its internal registers, including its program counter, are , Put on the stack, and release this stack along with all other modified data. That As a result, updating the memory once with internally consistent data The system can then safely return if the system subsequently fails. Establish a checkpoint Occurs during main memory failure and the release operation itself In order to guarantee the ability to overcome both obstacles, two Data items to the main position and shadow position (shadow location). This technique allows checkpointing without burdening the application programmer. Achieves the goal of establishing It has certain disadvantages due to its dependence on use. Processor currently changed Any cache line, unless you want to write back all the lines Cannot write back to main memory, causing cache overflow Or cached by one processor in another processor's cache. Releases data whenever a request is made for the data being This would require the processor to write out its entire cache. This key The case is based on standard cache coherency protocols (eg, Gallagher's US Pat. No. 5,276,848). If the program runs under such a standard protocol, potential ports Porting and performance problems. For example, Kirrmann (U.S. Pat. No. 4,905,196) and Lee et al. ("A Recovery Cache for the PDP-11 "(Recovery cache for PDP-11), IEEETran s.on Computers, June 1980) for the purpose of checkpointing. Therefore, another method of capturing data has been proposed. Kirrmann's Method A card-shaped memory storage element is used. This is the main memory followed by the two Each archive memory is the same size as the main memory. I have. Writing to main memory is also performed by the processor to the write buffer. You. When it is time to establish a checkpoint, the buffered data is The processor first copies to one of the archive memories and then the second archive ・ Copy to memory. However, techniques that eliminate the need for one of these copies It is also described. Two archive memories, from buffer to memory If a failure occurs while a copy is being made, at least one of them Ensure that valid checkpoints are included. Problems with this architecture Must have three systems of memory, for archive memory Use slower memories and make sure that the three memory elements are on different buses Processor capacity to become a port Influence is included. Lee et al. In the paper by the address specified by the application Update data is written to memory for all memory locations that fall within the range. Prior to this, methods for saving data to the recovery cache have been discussed. This way Is all writes to memory within the range specified by the application Into read-before-write operations. Applique If a failure occurs during the execution of an application, the contents of the recovery cache are stored in main memory. By storing it back, the application will start To recover it. One of the problems with this method is that read after write Memory cycle interference, slowing down the host system Changes the bus protocol. Also, Both application programmers handle or consider checkpointing Request to be involved in Another technique for creating a mirror of data on disk other than main memory Is being developed. Since disk access is several orders of magnitude slower than main memory access Such schemes are limited to mirroring data files. That is, If a failure disrupts the primary access path to these files, Limited to supplying up to a disk file. System user Transparent to the user, maintaining program continuity or running applications No attempt has been made to recover the application. In some cases, mirror files It is not even possible to guarantee that files are consistent with each other; It is only consistent with another copy of the file. US Patent 5,247,618 Discloses an example of such a scheme. Summary of the Invention Embodiments of the present invention provide a computer system with a conventional cache cache. Enables use of coherency protocols and non-blocking caches While maintaining consistent periodic updates in the main memory of the computer system. An apparatus and a process for maintaining a checkpoint state are provided. Of the present invention Embodiments are accessed by a processor via one or more logical ports, Both basic and checkpoint memory elements are connected to this port. Provide combined main memory. Basic memory elements are similar to standard main memory. Is accessed. Checkpoint memory elements are used to write to detectable main memory. Capture the ingestion. This write can be detected at the checkpoint The reason is that the memory element is connected to the same port as the basic memory element. next Consistent checkpoints in main memory using captured writes Ensure the existence of a state. Computers with such proper detection and bypass procedures Data systems can be used without disruption without compromising data integrity or processing continuity. It is possible to recover from. In one embodiment of the present invention, a backup The computer includes a buffer memory and a second random access memory element. A remote checkpoint memory element is provided. Backup con Computer and the main computer are connected by a checkpoint communication link. Have been. Data written to main memory in the main computer during normal processing Backup computer via a dedicated checkpoint communication link It is also sent to the buffer memory in the remote checkpoint memory element in the. When a checkpoint should be established, it is already captured in buffer memory Data is stored on the remote checkpoint media of the backup computer system. Copied to memory element random access memory. Primary computer failure In the case of the backup computer, the primary computer was previously processing Take over the processing of the application. The backup computer is Check stored in the shadow memory element of the checkpoint memory element Process the application, starting from the point state. In another embodiment of the invention, the remote checkpoint buffer is Computers, each computer being one of its neighbors Function as a backup, providing N + 1 redundancy. Caught. In a system according to the present invention, input / output (I / O) operations are typically processed as follows. Is done. During normal operation, I / O requests are made in any standard manner and Input to the appropriate I / O queue by the operating system. However , The actual physical I / O operation is not started until the next checkpoint. But Thus, in the case of a failure and subsequent regression to the checkpoint state, all pending I / Os The O operation is also subject to checkpoint processing. Discs and other IDEN Idempotent I / O operations can simply be restarted. Appropriate measures for the communication I / O operation depend on the communication protocol. Possible messages Protocols that deal with message duplication can restart pending I / O You. In the protocol for handling missing messages, I / O is taken from the pending queue. Can be deleted. Professional that does not process missing or repeated messages In protocol, pending I / O is removed from the pending queue. Message before failure Is not delivered or aborted as a result of a failure, Communication link failure and impact are the same and the same result Brought to you. Communication link interrupts are usually much more than computer failures Because of the many occurrences, protocols that cannot make such events transparent Use will probably be with the user or application, anyway, with them It means you are prepared to deal with it. The mechanism described here allows the computer to resume operation following a failure. A consistent and consistent checkpoint state, and Corresponds to the operation of the fault tolerant computer system. BRIEF DESCRIPTION OF THE FIGURES For a better understanding of the present invention, reference is made to the drawings. The drawings are It shall be included in the present application. FIG. 1 shows a computer system using a main memory structure according to an embodiment of the present invention. FIG. FIG. 2 utilizes a remote checkpoint buffer according to one embodiment of the present invention. FIG. 2 is a block diagram of a fault-tolerant computer system. FIG. 3 illustrates the main computer system and the standby computer system of FIG. It is a block diagram showing a system in detail. FIG. 4 illustrates a remote check utilizing N + 1 redundancy, according to one embodiment of the present invention. It is a block diagram of a point memory system. FIG. 5 illustrates the memory used by the processing unit to maintain the consistency of the main memory. It is a figure of a position. FIG. 6A illustrates how each processing unit controls the release of its cache, 9 is a flowchart describing whether to maintain the consistency of the main memory. FIG. 6B shows that each processing unit controls the release of its cache to main memory. 9 is a flowchart for describing another method used in this case. Detailed description BRIEF DESCRIPTION OF THE DRAWINGS The present invention is further understood by the following detailed description, which should be read in connection with the accompanying drawings. It will be well understood. In the accompanying drawings, like reference numerals indicate like structures. I do. Applicant's co-pending U.S. patent filed June 10, 1994. Reference is made to Application No. 08 / 258,165. By this reference, this is Shall also be included. FIG. 1 is a block diagram of a computer system 11 in which the present invention can generally be used. It is. One or more processing elements 14 and 16 may be buses or intersection switches. One or more main memory systems 18 via such interconnects 10 and 12 And 20. One or more input / output (I / O) subsystems 22 and And 24 are also connected to the interconnect mechanism 10 (12). Each I / O subsystem The input / output (I / O) element or bridge 26 (28) and one or more It consists of buses 30 and 32 (34 and 36). The I / O elements 26 (28) also Connect to any standard I / O bus 38 (40), such as a VME bus. Can be. For the sake of simplicity, these systems and services are described below. Subsystems should each mention only one of them To Each processing element, for example, 14 comprises a processing unit 44 connected to a cache 42. including. This connection connects the processing unit 44 and the cache 42 to the interconnection mechanism 1. It is also connected to 0. The processing unit 44 can be any standard micro Processor unit (MPU: microprocessor)unit). For example The PENTIUM microprocessor available from Intel Corporation for this purpose Are suitable. The processing unit 44 is, as before, any suitable operating It operates according to the operating system. Processing element 14 is for self-test purposes Alternatively, a dual processing unit 44 may be included. The cache 42 is a write-through or write-back type cache. Arbitrarily sized and associative with one cache level It may have a hierarchical structure as described above. The processing unit 44 includes the cache 4 2 may store only data, or may include instructions of computer programs and It is also possible to store both data. In the former case, a similar instruction cache 4 3 is additionally connected to the processing unit 44, and the processing unit 44 Program instructions may be stored. This connection is made to the instruction cache 43 Is connected to the interconnection mechanism 10. This system is a multi-processing computer. In the case of a data system, each processing unit 44 has a bus snooping (bus s cache coherency using any conventional mechanism (e.g., nooping). Can be held. The cache 42 includes, for example, the interconnector 10 or 1 2 is connected to the main memory system. Fault according to the invention using a remote checkpointing buffer One embodiment of a tolerant computer system is shown in FIG. Shown in FIG. In an embodiment, the main computer system 110 includes a high-speed data link 150 Through a standby computer system 120. Stan The standby computer system is the standby computer system. Used to establish a checkpoint remotely in memory. Shown in FIG. As such, the primary computer system 110 and the standby computer system The system 120 may be, for example, through an Ethernet system 140 Interconnected, and a common dual port disk array 130 And may share other dual port storage and communication devices. In a preferred version of this embodiment, the main computer system 110 and the The standby computer system 120 is substantially the same computer . FIG. 3 illustrates one embodiment of the main computer system 110 in further detail. S The standby computer 120 is the same as the main computer system shown in FIG. It will be appreciated that it has a structure. Main computer system 110 includes a memory subsystem 112, Memory subsystem 112 includes main memory 113, write buffer 116 and Control logic for controlling data transfer in memory and memory subsystem 112 117. Memory subsystem 112 is coupled to memory bus 114. Have been. According to the description of FIG. 1, the representative main computer system of FIGS. The stem 110 is coupled to a memory bus 114 (or other topology) One or more processors 118, remote checkpoint interface 122 , I / O bus 124, Ethernet interface 126, and SCSI Also includes an external storage interface shown as controller 128. The I / O bus is coupled to the memory bus via an I / O bridge 125 You. The remote checkpoint interface 122 is connected to the buffer memory 13 2 and an interface controller 134. Remote checkpoint The interface is coupled to the memory 112 via the connection 144 . The interface controller 134 is coupled to the high-speed I / O port 131. Connecting the main computer system to the high speed data link 150. SCS The I controller 128 is coupled to the SCSI I / O port 129 and Computer system to the dual port disk array 130. I Ethernet interface is coupled to Ethernet I / O port 127 , Connect the main computer system to Ethernet 140. Use this system to maintain a consistent state in main memory after a failure. The process will now be described. U.S. Pat. No. 4,654,819 In contrast to such systems, this process is performed by the cache 4 of the processing unit 14. 2 from one processing element 14 to another processing element 16 without having to release the entire To pass. All processing units in the computer system 11 44 has access to all buses or communication paths to main memory, Each processing unit 44 caches the cache using a conventional bus snooping method. Healing can be guaranteed. All processing units 44 are all systems If not having access to the bus, the processing unit 44 Use the cache cacher technique of knowledge instead Can be. The operation of this embodiment is described here as remote checkpointing. Inting), which is shown in FIGS. This is a checkpointing station As described in a previously filed co-pending patent application filed by the applicant This is roughly the same. However, with remote checkpointing, The main memory 113 of the standby computer system 120 is A machine as a checkpoint memory for the main memory 113 of the system 110 Work, main memory 1 of main computer system at last checkpoint 13 states. Written to main memory 113 of main computer system Data is also stored in the main computer system under the control of the memory control logic 117. Is captured by the write buffer 116. Note that the access is only partially cached. It is understood that some have shrink lines or overall cache lines Let's go. The write buffer is of sufficient size to transfer data over bus 144. Buffer the data to accommodate any delays associated with the transfer. Data is transferred from the write buffer 116 to a remote check of the main computer system. Point interface 122 where additional buffering is performed. To account for any delays associated with the high speed data link 150. Next, Data through a high-speed data link 150 to a standby computer system. Transferred to the remote checkpoint interface 122 of the Computer system bus Is stored in the buffer memory 132. All writes to the main memory 113 of the main computer system (cache (Including lines written to memory during flash release) The buffer memory of the standby computer system along with the physical memory locations It is also written to the file 132. Processor 118 of the main computer system; And any additional processors coupled to memory bus 114 When the release is completed, the operating system in the primary computer system The remote control interface allows the standby computer system Notify the system. Standby computer system buffer memory 132 Is transferred to the main memory 113 of the standby computer system, Checkpoint in main memory 113 of standby computer system Establish. Checkpointing for consistent system state Processors flush their caches synchronously. Once processing element 1 4 has begun releasing, until all other processing elements 14 have completed their release. Cannot resume normal operation except under certain conditions discussed below. No. Synchronizing the processor's cache release depends on the buffer memory Data should be copied to main memory 113 and what data should not be copied Because you need to know. That is, the buffer memory stores the post-release data and It is necessary to distinguish between delivery data. Therefore, which processor If the buffer does not know what is sending, all processors are in normal operation Before they can be started, ensure that their release is complete and consistent. I have to. To control the synchronization, preferably as shown at 80 in FIG. For example, using a designated location in the main memory of the main computer, inspection and configuration The lock value is stored using constant lock processing or equivalent processing. Basic memory element Use this finger to ensure that recovery from failures and other failures is possible. The home position is preferably implemented as part of a status register. At periodic intervals, each processing unit 44 performs a process as shown in step 90 of FIG. Next, a determination is made as to whether or not the release process should be started. The processing unit 44 Can be determined in a number of different ways. Typically, the release is for a fixed time period It may be started after a lapse of time. If this processing unit 44 does not need to initiate a release, the designated memory location 80 Is checked to determine whether another processing unit 44 has already set the lock. Is determined (step 92). If no lock is set, this process The process ends as shown at 94. Conversely, if a lock is set, this processing unit The unit 44 releases its cache 42 at step 96. Effectiveness of release process Results are changed for all lines in the cache (or, preferably, since the last release). The main memory 113 of the computer and the write buffer 11 6 is stored in the same manner. Prior to the actual release processing, the processing unit 4 4 saves its state in cache 42 so that this information is released as well. To An input / output (I / O) operation is usually processed as follows. During normal operation, I The / O request can be sent in any standard way, depending on the operating system. Issued and entered into the appropriate I / O queue. However, the actual physical I The / O operation is not started until the next checkpoint. Therefore, obstacles and In the case of subsequent retreat to the checkpointed state , Checkpoint processing is also performed on all pending I / O operations. Disk And other iconic I / O operations, i.e., repeat without changing the result Actions that can be returned can simply be restarted. Proper handling of communication I / O operations depends on the communication protocol. Possible messages In protocols that deal with page duplication, pending I / O can be restarted. For protocols that handle missing messages, remove I / O from the pending queue can do. Protocol that does not process missing or repeated messages The pending I / O is removed from the pending queue. Message before failure If it was not sent out or was aborted as a result of a failure, Link failure and impact are the same, and the same result Or to the user. Communication link interrupts are usually Events can occur so much that such events cannot be made transparent. The use of a protocol is likely to be a user or application, Means that they are prepared to deal with them. When the processing unit 44 determines in step 90 that the release should be started , As in step 92, whether the lock has already been set in step 98 A determination is made as to whether or not it is. If the lock has already been set, the processing unit 44 Continue to flush its cache 42 at step 96. Other places In step 100, a lock is set in step 100, and a message is sent to another processor. By triggering their release line operations, Prior to emission, identify itself as an emission initiator. After processing unit 44 releases its cache 42 at step 96, In step 102, the corresponding emission counter is incremented. As shown in FIG. In addition, each processing unit 44 has a release counter as indicated at 82 and 84, These are predetermined designated positions in the main memory 113. Release counter (example For example, after incrementing 82), processing unit 44 determines that it is It is determined whether or not the camera is a nicheter (step 104). Initiator If not, the process waits at step 106 until the lock is released. Lock If so, the process ends at step 108 and the processing unit 44 can resume normal operation. In the determination at step 104, the processing unit 44 is the release initiator. If so, in step 105, all release counters (82-84) are incremented. Wait until done. Once all release counters have been incremented, this processing unit 44 sends a commit command to the memory control logic 117, As described above, the data in the write buffer 116 is transferred to the standby computer. Is copied to the main memory 113. Once this command has been sent, the release lock is unlocked. The processing unit 44 can resume normal processing. Step 10 Loop between 6 and 110 has time out protection and release In the event of a fault during operation, a fault recovery procedure should be triggered. The release counter described here may be a 1-bit counter, and Easily as part of the status register in the checkpoint memory device Note that implementation is possible. The bits are set individually by each processor. Automatically reset when the initiator sends a commission command Can be. Implementing certain non-standard bus protocols also offers processing power advantages be able to. For example, a bus protocol may cause the memory subsystem 48 to process The reason for discriminating between the physical elements 14 or writing to the line to be stored is , The processing element 14 that has completed the i-th release, or the i-th release is still being performed Pre-release data from post-release data, or at least identify processing elements If it is possible to identify at least the It is not necessary to wait until all other processing elements have completed their emission before starting . In this case, after completing the i-th discharge, all the processing elements Child 16 also initiates at least its i-th release (but needs to complete ), By requesting that normal operation be suspended. Maintain consistency. Even if the synchronization constraint is relaxed in this way, a consistent checkpoint state still exists. Location is guaranteed. That is, the processing element 16 that has not started discharging has completed discharging and has not passed. The modified data after the release is received from another processing element 14 which has resumed the normal processing. No guarantees. This less restrictive protocol for synchronization is allowed The memory subsystem determines which data is being written as part of the Of the data written by the processing element 14 which has completed the output. And if you can. To implement this type of cache release synchronization, FIG. 6A, the order and arrangement of steps 96 and 102 are reversed as shown in FIG. What should I do? If the primary computer system fails, the standby computer The main memory 113 of the system is the main computer at the last checkpoint. -Includes system status. Standby computer systems are standard All techniques, such as dual port disk array 130, are implemented using Activates the connection to the dual-port I / O device so that the primary Takes over the processing of the application running on the No interruption of application and no loss of associated data. Or obstacle After that, the standby computer system connects the high-speed data link 150 Through this, the contents of the main memory 113 are stored in the main memory 1 of the main computer system. 13 and the primary computer system is switched to the standby computer system. Operation can be resumed from the last checkpoint stored in the main memory of the system. It is also possible to make it. As mentioned above, the standby computer system is the primary computer system. System while providing checkpoint memory while in standby mode. It is designed to be in the dollar state. However, the fault shown in FIG. In an embodiment of a tolerant computer system, the computer system Operate symmetrically so that each of computer systems 110 and 120 Provides remote checkpoint memory for other computer systems You may do so. In this case, the standby computer system While providing checkpoint memory to the Can be executed. For symmetric operation, the remote checkpoint interface The interface 112 is preferably integrated with the checkpoint memory. The buffer memory 132 and the main memory 113 of each computer are physically Co-located on the memory bus. This allows the standby ・ Throughput of computer systems operating in modes other than mode I / O and memos for storing from buffer to shadow Avoids the need to connect re-buses. After the failure of one computer, the remaining computers are: 1) both computers Run applications on data (itself and the failed computer) But the throughput for either application is descend. Or 2) quit its own application and generate a fault Can run only the computer application. Ah Or 3) run a subset of the combined application, It has a sufficiently high priority. FIG. 4 illustrates another implementation of the present invention that utilizes remote checkpoint buffering. An embodiment will be described. In the embodiment shown in FIG. Computer 110a, 110b, 120 and 120d. Construct a logical ring, where each computer is N + 1 redundancy by acting as a backup for To make it work. In the embodiment shown in FIG. b provides shadow memory to computer system 110a; System system 110c provides shadow memory to computer system 110b. Providing computer system 110d to computer system 110c. Providing shadow memory so that the computer system 110a Provide shadow memory to system 110d. I / O data device 130a, 130b, 130c, 130d and 130e are between immediately adjacent computers A dual port device coupled to The I / O data device is Dual port, or common to all computers in the ring. It is also possible to combine the two. In the embodiment shown in FIG. 4, one of the computers 110a-110d is reserved. Which is either left idle or distributable Can perform tasks. Each computer in the ring has two other neighbors To one of these adjacent computers Computer and the other adjacent computer is called the right computer. Re If one of the non-spare computers fails in the The computer task that occurred is performed by the adjacent computer on the right. And the tasks of the right computer are performed by that right computer The task continues until the designated spare computer is reached, (Rippling) can be stopped. It should be noted that the above-mentioned computer can be configured in a chain shape, which is preferable. In other words, placing the standby computer at the end of the chain Let's go. In such an embodiment, a standby computer is maintained at the end of the chain. In some cases, it may be necessary to reconfigure the system after a failure. Compu By constructing data sets into logical rings, the need for reconstruction is eliminated. Addition In the above-described embodiment of the present invention, the processor and the computer of the computer system are used. The individual elements, including memory and I / O devices, are interconnected by one or more memory buses. It has been described as connected. The memory bus is a data bus, such as an intersection switch. It is understood that other interconnect mechanisms that transfer the data and perform the same function may be replaced. Let's go. One of the advantages of the present invention is that it enables checkpointing for remote units. By enabling it, there may be little or no internal redundancy (eg, a single process To provide the computer with fault tolerance. Of the present invention Another advantage over prior art checkpointing is that the protected computer Checkpoint on a second computer physically separated from the It is to stand. The second computer can be at a certain distance. You. Therefore, a single spare computer can be connected to any number of Fault tolerance to act as a backup to the Extend the concept of Yet another advantage of the present invention is that application programs Users do not need to be involved in the checkpointing process and need to know There is no such thing. Responds to very quick recovery, memory frost damage and other hardware Protection against air and software failures is provided. From the embodiments of the invention described herein, the above-described embodiments are merely exemplary and are It is recognized by those skilled in the art that they are not intended to be construed, but merely as examples. I will be able to. Numerous modifications and other embodiments are within the purview of those skilled in the art, and It is deemed to fall within the scope of the present invention defined in the appended claims and their equivalents. Regard it.

【手続補正書】特許法第１８４条の８第１項【提出日】１９９７年１２月３１日（１９９７．１２．３１）【補正内容】１．コンピュータ・システムであって、少なくとも第１のアプリケーションを処理する第１のコンピュータであって、キャッシュと、内部レジスタと、入出力イベント・キューとを有するプロセッサと、前記プロセッサおよび前記キャッシュに結合された主メモリ・サブシステムと、前記主メモリ・サブシステムに結合され、前記主メモリに書き込まれたデータを捕獲するライト・バッファと、外部ポートと、前記ライト・バッファおよび前記外部ポートに結合され、前記ライト・バッファ内のデータを、前記第１のコンピュータの前記外部ポートに転送するインターフェース制御部（１２２）と、を含み、前記プロセッサが、前記キャッシュ、前記内部レジスタおよび前記入出力イベント・キューを前記主メモリ・サブシステムに放出する手段と、前記キャッシュ、前記内部レジスタおよび前記入出力イベント・キューを消去した後に、チェックポイント命令を発行する手段とを含む該第１のコンピュータと、前記第１のコンピュータの前記外部ポートに結合されたデータ通信リンクと、前記第１のコンピュータから離れて位置する第２のコンピュータであって、前記データ通信リンクに結合された外部ポートと、前記第２のコンピュータの前記外部ポートに結合され、前記第１コンピュータの前記インターフェース・コントローラからのデータを受信する遠隔インターフェース制御部（１２２）と、主メモリ・サブシステムと、前記第２のコンピュータの前記インターフェース・コントローラと、前記第２コンピュータの前記主メモリとに結合され、前記第１のコンピュータの前記ライト・バッファから転送されるデータを受け取るバッファ・メモリと、を含み、前記第１コンピュータからの前記チェックポイント命令の受信時に、前記第２コンピュータの前記主メモリ・サブシステムに、前記バッファ・メモリに格納されているデータを転送し、前記第２コンピュータの前記主メモリ・サブシステムが、前記第１のコンピュータの前記第１のアプリケーションの処理を再起動することができる、一貫した状態を保持するようにした該第２のコンピュータと、から成ることを特徴とするコンピュータ・システム。２．前記第２のコンピュータが、更に、前記主メモリ・サブシステムに結合されたプロセッサを含み、前記第１のコンピュータの障害時に、前記第２のコンピュータが、データの損失を生ずることなく、前記第１のコンピュータの前記第１のアプリケーションを処理し続けることを特徴とする請求項１記載のコンピュータ・システム。３．前記第２のコンピュータは、少なくとも第２のアプリケーションを処理し、前記第２のアプリケーションの処理は、前記第１のコンピュータに障害が発生した場合に終了することを特徴とする請求項２記載のコンピュータ・システム。４．前記第２のコンピュータは、少なくとも第２のアプリケーションを処理し、前記第２のコンピュータは、前記第１のコンピュータに障害が発生した場合、前記第１および第２のアプリケーション双方を処理することを特徴とする請求項２記載のコンピュータ・システム。５．前記第２のコンピュータが、更に、前記主メモリ・システムに結合され、前記第２のコンピュータの前記主メモリ・サブシステムに書き込まれたデータを捕獲するライト・バッファを含み、前記第２のコンピュータの前記インターフェース・コントローラが、前記第２のコンピュータの前記ライト・バッファに含まれるデータを前記第１のコンピュータに転送し、前記第１のコンピュータの前記主メモリ・サブシステムが、前記第２のコンピュータの前記主メモリのために、前記シャドウ・メモリとして機能することを特徴とする請求項２記載のコンピュータ・システム。６．第１および第２のコンピュータを有し、該第１および第２のコンピュータの各々が、キャッシュを備えたプロセッサと、内部レジスタおよび入出力イベント・キューと、遠隔インターフェース制御部（１２２）を備えた外部ポートと、主メモリと、バッファ・メモリとを有するコンピュータシステムで、フォールト・トレランスを与える方法において、前記第１のコンピュータの前記主メモリに書き込まれたデータを捕獲するステップと、前記遠隔インターフェース制御部によって、データ・リンクを通じて、前記データを、前記第２コンピュータのバッファ・メモリに転送するステップと、前記第１のコンピュータの前記プロセッサの前記キャッシュ、前記内部レジスタおよび前記入出力イベント・キューを消去し、前記キャッシュ、前記内部レジスタおよび前記入出力イベント・キュー内に含まれているデータを、前記第１コンピュータの前記主メモリと、前記第２のコンピュータの前記バッファ・メモリとに書き込むステップと、前記第２のコンピュータの前記バッファ・メモリから、前記第２のコンピュータの前記主メモリにデータをコピーすることにより、チェックポイントを確立し、前記第２のコンピュータの前記主メモリを、前記第１のコンピュータの前記主メモリのシャドウ・メモリとして機能させるステップと、から成ることを特徴とする方法。７．更に、前記第１のコンピュータにおいて第１のアプリケーションを処理するステップと、前記第１のコンピュータの障害時に、前記第２のコンピュータにおいて前記第１のアプリケーションを処理するステップと、を含むことを特徴とする請求項６記載の方法。８．更に、前記第２のコンピュータにおいて第２のアプリケーションを処理するステップと、前記第１のコンピュータの障害時に、前記第２のアプリケーションの処理を終了するステップと、を含むことを特徴とする請求項７記載の方法。９．更に、前記第２のコンピュータにおいて第２のアプリケーションを処理するステップと、前記第１のコンピュータの障害時に、前記第２のコンピュータにおいて、前記第１および第２のアプリケーション双方を処理するステップと、を含むことを特徴とする請求項７記載の方法。１０．更に、前記第２のコンピュータの前記主メモリに書き込まれたデータを捕獲するステップと、前記データを、データ・リンクを通じて、前記第１コンピュータのバッファ・メモリに転送するステップと、前記第２のコンピュータの前記プロセッサの前記キャッシュを消去し、前記キャッシュ内に含まれているデータを、前記第２コンピュータの前記主メモリと、前記第１のコンピュータの前記バッファ・メモリとに書き込むステップと、前記第１のコンピュータのバッファ・メモリから、前記第１のコンピュータの前記主メモリにデータをコピーすることにより、チェックポイントを確立し、前記第１のコンピュータの主メモリを、前記第２のコンピュータの前記主メモリのシャドウ・メモリとして機能させるステップと、から成ることを特徴とする方法。[Procedure of Amendment] Article 184-8, Paragraph 1 of the Patent Act [Submission date] December 31, 1997 (Dec. 31, 1997) [Correction contents] 1. A computer system, A first computer for processing at least a first application, A processor having a cache, internal registers, and an input / output event queue. And A main memory subsystem coupled to the processor and the cache; , Data coupled to the main memory subsystem and written to the main memory A write buffer that captures An external port, The write buffer coupled to the write buffer and the external port; Interface for transferring data in the external computer to the external port of the first computer. A face control unit (122); Wherein the processor includes the cache, the internal register, and the input / output. Means for releasing a force event queue to the main memory subsystem; After flushing the internal registers and the input / output event queue, Said first computer including means for issuing a checkpoint instruction; A data communication link coupled to the external port of the first computer; A second computer located away from the first computer, An external port coupled to the data communication link; The first computer coupled to the external port of the second computer Remote interface for receiving data from the interface controller of A base control unit (122); A main memory subsystem; Said interface controller of said second computer; Coupled to the main memory of the second computer, the main computer of the first computer; A buffer memory for receiving data transferred from the write buffer; Upon receiving the checkpoint command from the first computer, In the main memory subsystem of the second computer, in the buffer memory Transferring the stored data to the main memory subsystem of the second computer; The system restarts processing of the first application on the first computer. The second computer capable of moving and maintaining a consistent state When, A computer system comprising: 2. The second computer further comprises: A processor coupled to the main memory subsystem; In the event of a failure of the first computer, the second computer loses data Without loss of the first application on the first computer The computer system according to claim 1, wherein the processing is continued. 3. The second computer processes at least a second application; The processing of the second application is performed when a failure occurs in the first computer. 3. The computer system according to claim 2, wherein the processing is terminated when said computer system is activated. 4. The second computer processes at least a second application; If the first computer fails, the second computer 3. The method according to claim 2, wherein both the first and second applications are processed. Computer system as described. 5. The second computer further comprises: The main memory of the second computer coupled to the main memory system A write buffer that captures data written to the subsystem; The interface controller of the second computer is connected to the second controller; The data contained in the write buffer of the first computer to the first Transferring to a computer, the main memory subsystem of the first computer But for said main memory of said second computer, said shadow memory and The computer system according to claim 2, which functions as a computer. 6. Having a first and a second computer, the first and second computers Processors each with cache, internal registers and input / output events A queue, an external port with a remote interface control (122), Computer system having a memory and a buffer memory; In the method of providing tolerance, Capturing data written to the main memory of the first computer; And The data is transmitted by the remote interface control unit through a data link. Transferring the data to a buffer memory of the second computer; The cache of the processor of the first computer, the internal registry Clears the cache and the internal register And the data contained in the I / O event queue is stored in the first The main memory of a computer and the buffer memory of the second computer And writing to From the buffer memory of the second computer, the second computer Establishes a checkpoint by copying data to the main memory of the The main memory of the second computer is stored in the main memory of the first computer. Functioning as memory shadow memory; A method comprising: 7. Furthermore, Processing a first application on the first computer When, Upon failure of the first computer, the second computer Processing one application; 7. The method according to claim 6, comprising: 8. And processing the second application in the second computer. Steps and When the first computer fails, the processing of the second application is terminated. Steps to complete, The method of claim 7, comprising: 9. Furthermore, Processing a second application on the second computer When, Upon a failure of the first computer, the second computer Processing both the first and second applications; The method of claim 7, comprising: 10. Furthermore, Capturing data written to the main memory of the second computer; And Transferring the data over a data link to a buffer of the first computer; Transferring to memory; Clearing the cache of the processor of the second computer; Transferring the data contained in the cache to the main memory of the second computer; Writing to the buffer memory of the first computer; From the buffer memory of the first computer, the Establish a checkpoint by copying data to the main memory, The main memory of the first computer is stored in the main memory of the second computer. Functioning as shadow memory; A method comprising:

───────────────────────────────────────────────────── 【要約の続き】起動させ、第１のコンピュータに確立された最後のチェックポイントから動作することができる。この構造およびプロトコルは、主メモリにおける一貫した状態を保証することができ、こうしてフォールト・トレラント動作を可能にする。────────────────────────────────────────────────── ─── [Continuation of summary] Activate the last chain established on the first computer. Can work from the lockpoint. This structure and And protocol guarantees consistent state in main memory And thus fault tolerant operation Enable.

Claims

[Claims] 1. A computer system, A processor, A main memory subsystem coupled to the processor, Basics in which data is read and data is written by the processor A memory element; Monitors each time data is written to the basic memory element by the processor And stores buffer data related to the data written to the processor. Write buffer Following the failure using the buffer data, data integrity or Consistent checkpoint state where processing can be resumed without damaging continuity Means for ensuring that it is present in said main memory subsystem. Said main memory subsystem; A computer system comprising: 2. The processor and the basic memory element are in a first computer, A write buffer connected to the first computer by a communication link; 2. The computer system according to claim 1, wherein the computer system is located on a second computer. Tem. 3. A computer system, A first computer for processing at least a first application, A processor, A main memory subsystem coupled to the processor; Data coupled to the main memory subsystem and written to the main memory A write buffer that captures An external port, The write buffer coupled to the write buffer and the external port; Interface for transferring data in the external computer to the external port of the first computer. A face controller, The first computer comprising: A data communication link coupled to the external port of the first computer; A second computer, An external port coupled to the data communication link; The first computer coupled to the external port of the second computer Interface for receiving data from the interface controller Controller and A main memory subsystem; The interface of the second computer; and the interface of the second computer. Coupled to the main memory and written to the main memory of the first computer A buffer memory for receiving the data to be The second computer comprising: Consisting of All data stored in the buffer memory is stored in the first computer. Transferred to the main memory of the second computer upon receipt of a command from the , The main memory of the second computer is the main memory of the first computer. Computer system characterized by functioning as shadow memory of memory Tem. 4. The second computer further comprises: A processor coupled to the main memory subsystem; In the event of a computer failure, the second computer will not lose data. Continuing to process the first application on the first computer 4. The computer system according to claim 3, wherein: 5. The second computer processes at least a second application; The processing of the second application is performed when a failure occurs in the first computer. The computer system according to claim 4, wherein the processing is terminated when the computer system has been operated. 6. The second computer processes at least a second application; If the first computer fails, the second computer Processing both the first and second applications. Item 6. The computer system according to Item 4. 7. The second computer further comprises: The main memory of the second computer coupled to the main memory system A write buffer that captures data written to the subsystem; The interface controller of the second computer is connected to the second controller; The data contained in the write buffer of the first computer to the first computer. And the main memory subsystem of the first computer Function as the shadow memory for the main memory of the second computer 5. The computer system according to claim 4, wherein: 8. A computer system, Multiple computers, each with the exception of one spare computer Said plurality of computers performing processing tasks; The plurality of computers are coupled to each of the plurality of computers, and the plurality of computers are logically reconfigured. A data communication network connected to the Consisting of In the event of a failure in one of said plurality of computers executing a data processing task The data processing task of the plurality of computers executing the data processing task is A functioning computer among the plurality of computers including the spare computer Computer without loss of data. Computer system. 9. A dual port I / O device, further comprising a plurality of dual port I / O devices; Wherein each of the devices is coupled to at least two of said plurality of computers. 9. The computer system according to claim 8, wherein 10. A number of computers performing data processing tasks are Between the failed computer and the failed computer. And the data processing task of each of the plurality of computers is next to Executed by the contacting computer, and the spare computer 9. The computer of claim 8, wherein the computer performs one task of the computer. Data system. 11. Having first and second computers, the first and second computers Each have a processor with a cache, a cache, an external port, In a computer system having a memory and a buffer memory, A way to give Capturing data written to the main memory of the first computer; And Transmitting the data to the buffer of the second computer via a data link. Transferring to a memory Releasing the cache of the processor of the first computer; Transferring the data contained in the cache to the main memory of the first computer; Writing to the buffer memory of the second computer; From the buffer memory of the second computer, the second computer Establishes a checkpoint by copying data to the main memory of the The main memory of the second computer is stored in the main memory of the first computer. Functioning as memory shadow memory; A method comprising: 12. Furthermore, Processing a first application on the first computer When, Upon failure of the first computer, the second computer Processing one application; The method of claim 11, comprising: 13. Further, the second computer processes a second application. Steps When the first computer fails, the processing of the second application is terminated. Steps to complete, 13. The method according to claim 12, comprising: 14． Furthermore, A step of processing a second application on the second computer. And Upon a failure of the first computer, the second computer Processing both the first and second applications; 13. The method according to claim 12, comprising: 15. Furthermore, Capturing data written to the main memory of the second computer; And Transmitting said data to said buffer of said first computer via a data link. Transferring to a memory Releasing the cache of the processor of the second computer; Transferring the data contained in the cache to the main memory of the second computer; Writing to the buffer memory of the first computer; From the buffer memory of the first computer, the Establish a checkpoint by copying data to the main memory, The main memory of the first computer is stored in the main memory of the second computer. Operating as shadow memory for the A method comprising: 16. A plurality of computers and the plurality of computers combined to form a logical ring Computer system having a data communication network A method of providing fault tolerance, Processing an application on each of the plurality of computers When, Computer in which at least one of the plurality of computers has failed Detecting Loading the at least one failed computer application Processing on at least one other computer of the plurality of computers Steps and A method comprising: 17． The processor receives input / output events initiated by the processor. And a corresponding input / output subsystem, the processor having a checkpoint Means to align I / O events between events and when checkpoints should be established Releasing the aligned events to the basic memory element Capturing force events to checkpoint data in the main memory subsystem 2. The computer system according to claim 1, further comprising: