JP4146045B2

JP4146045B2 - Electronic computer

Info

Publication number: JP4146045B2
Application number: JP25760999A
Authority: JP
Inventors: 泰彦黒澤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-09-10
Filing date: 1999-09-10
Publication date: 2008-09-03
Anticipated expiration: 2019-09-10
Also published as: JP2001084178A

Description

【０００１】
【発明の属する技術分野】
この発明は、チェックポイント・ロールバック方式の高信頼性機構によるチェックポイント処理やロールバック処理のオーバーヘッドを大幅に小さくする電子計算機に関する。
【０００２】
【従来の技術】
近年、様々な分野で業務処理の電子化が図られており、コンピュータの信頼性や耐障害性に対する要求は日々高まる一方である。そして、この耐障害性を実現するコンピュータとして、いわゆるフォールトトレラント型コンピュータが存在する。
【０００３】
このフォールトトレラント型コンピュータでは、ハードウェアを多重化し、演算結果を比較して信頼性を向上させる多重化方式と、定期的にチェックポイントとよばれる安定した状態を取り、障害が発生したらチェックポイント状態まで計算機の状態を戻すことによって正しい演算結果を得るチェックポイント・ロールバック方式との２つの方式がよく知られている。
【０００４】
このうち、多重化方式は、たとえば「ノンストップコンピュータのしくみがわかる本」（麹町ＦＴＣ研究会編、工業調査会：１９９２）に記載のタンデム社ＮｏｎＳｔｏｐＣｙｃｌｏｎｅシリーズ、ストラタス社ＸＡシリーズ等に代表されるように、疎結合または密結合のマルチプロセッサによる高信頼性コンピュータで広く実用化されている。
【０００５】
一方、チェックポイント・ロールバック方式は、たとえば特開平９−６７３１号公報に記載のメモリ状態復元装置や特開平５−６３０８号公報に記載のキャッシュコントローラ並びにフォールト・トレラント・コンピュータおよびそのデータ転送方式で示される高信頼性コンピュータシステムの構築方式である。
【０００６】
【発明が解決しようとする課題】
しかしながら、多重化方式によるフォールトトレラント型コンピュータは、プロセッサ等のリソースが複数必要であるためにコストが高くなり、また、チェックポイント・ロールバック方式によるフォールトトレラント型コンピュータは、チェックポイントを採るためのオーバーヘッドが大きく、多重化方式に比べると高性能を得にくいといった問題があった。
【０００７】
この発明は、このような実情を考慮してなされたものであり、チェックポイント・ロールバック方式の高信頼性機構によるチェックポイント処理やロールバック処理のオーバーヘッドを大幅に小さくする電子計算機を提供することを目的とする。
【０００８】
【課題を解決するための手段】
前述した目的を達成するために、この発明の電子計算機は、チェックポイント・ロールバック方式の高信頼化機構を有する電子計算機において、同一のアドレス空間を構成する第１および第２の記憶セル群と、前記第１および第２の記憶セル群が接続されるバッファと、前記バッファにデータを読み出す記憶セル群を前記第１および第２の記憶セル群の中から選択する第１の選択手段と、前記バッファのデータを書き込む記憶セル群を前記第１および第２の記憶セル群の中から選択する第２の選択手段と、を有するデータ記憶素子と、通常動作時、前記第１の記憶セル群のデータを前記バッファを介して同第１の記憶セル群のみに書き戻すと共に、前記第２の記憶セル群のデータを前記バッファを介して同第２の記憶セル群のみに書き戻すデータ保護手段と、チェックポイント時、前記第１の記憶セル群のデータを前記バッファを介して同第１の記憶セル群および第２の記憶セル群に同時に書き戻すデータ複製手段と、ロールバック時、前記第２の記憶セル群のデータを前記バッファを介して同第２の記憶セル群および第１の記憶セル群に同時に書き戻すデータ復元手段と、を具備することを特徴とする。
【０００９】
また、この発明の電子計算機は、チェックポイント・ロールバック方式の高信頼化機構を有する電子計算機において、同一のアドレス空間を構成する第１および第２の記憶セル群と、入力に前記第１の記憶セル群が接続され、出力に前記第１および第２の記憶セル群が接続される、前記第１の記憶セル群から読み出されたデータを格納する第１のバッファと、入力に前記第２の記憶セル群が接続され、出力に前記第１および第２の記憶セル群が接続される、前記第２の記憶セル群から読み出されたデータを格納する第２のバッファと、前記第１のバッファのデータを書き込む記憶セル群を前記第１および第２の記憶セル群の中から選択する第１の選択手段と、前記第２のバッファのデータを書き込む記憶セル群を前記第１および第２の記憶セル群の中から選択する第２の選択手段と、を有するデータ記憶素子と、通常動作時、前記第１の記憶セル群のデータを前記第１のバッファを介して同第１の記憶セル群のみに書き戻すと共に、前記第２の記憶セル群のデータを前記第２のバッファを介して同第２の記憶セル群のみに書き戻すデータ保護手段と、チェックポイント時、前記第１の記憶セル群のデータを前記第１のバッファを介して同第１の記憶セル群および第２の記憶セル群に同時に書き戻すデータ複製手段と、ロールバック時、前記第２の記憶セル群のデータを前記第２のバッファを介して同第２の記憶セル群および第１の記憶セル群に同時に書き戻すデータ復元手段と、を具備することを特徴とする。
【００１８】
【発明の実施の形態】
以下、図面を参照してこの発明の実施形態を説明する。
【００１９】
（第１実施形態）
まず、この発明の第１実施形態を説明する。
【００２０】
図１は、この発明の第１実施形態に係るチェックポイント・ロールバック方式のフォールト・トレラント・コンピュータの概略構成を示す図である。
【００２１】
図１に示すように、この第１実施形態のフォールト・トレラント・コンピュータには、システムバスＡが敷設されており、このシステムバスＡには、複数のＣＰＵ１とメモリコントローラ２とが接続されている。また、このメモリコントローラ２には、システムメモリ３が接続されている。
【００２２】
ＣＰＵ１は、密結合対系型マルチプロセッサ（ＳＭＰ）を構成している。すべてのＣＰＵ１はチェックポイントで同期をとり、内部情報をシステムメモリ３の予め定められた領域に書き込んでいる。さらに、ＣＰＵ１のキャッシュメモリのダーティ領域をシステムメモリ３に書き込み、システムの安定した状態（チェックポイント取得状態）をシステムメモリ３に作る。
【００２３】
チェックポイント取得状態が確定すると、システムメモリ３のプライマリ領域をセカンダリ領域にコピーするコマンドをメモリコントローラ２に送り、コピーが終了したことを確認すると再び通常動作状態に戻る。
【００２４】
メモリコントローラ２は、システムメモリ３を駆動制御するものであり、ＣＰＵ１からの指示にしたがって、システムメモリ３からのデータの読み出しやシステムメモリ３へのデータの書き込みを実行する。また、このメモリコントローラ２は、たとえばＤＲＡＭにおけるリフレッシュ動作など、システムメモリ３に格納されたデータの保護も実行制御する。
【００２５】
システムメモリ３は、複数のＣＰＵ１から共有される、このフォールト・トレラント・コンピュータの主記憶となるメモリデバイスであり、ＣＰＵ１によって実行制御されるプログラムや処理データを格納する。なお、このフォールト・トレラント・コンピュータでは、このシステムメモリ３をＤＲＡＭにより構成している。
【００２６】
そして、この発明は、このシステムメモリ３を構成するデータ記憶素子（ここではＤＲＡＭ）が、同一のアドレス空間を構成する記憶セル群をデータ記憶素子内部に複数組備えることで、このデータ記憶素子内にプライマリメモリ３ａとセカンダリメモリ３ｂの２つの記憶領域をもつことを可能とし、その結果、共有リソースであるシステムバスＡを用いることなく、チェックポイント処理やロールバック処理を行なえるようにしたことにより、これらのオーバヘッドを大幅に小さくした点を特徴としている。
【００２７】
そのために、このフォールト・トレラント・コンピュータでは、システムメモリ３を構成するＤＲＡＭに、リフレッシュ時に読み出しを行なうメモリセル（プライマリまたはセカンダリ）を指定する機能と、リフレッシュによる書き戻しを読み出し元のセルのみに行なうか、または、両方のセルに対して行なうかを指定する機能とを付加する。
【００２８】
図２は、このシステムメモリ３を構成するＤＲＡＭの２つのメモリセルおよびセンスアンプとその動作とを模式的に示したものである。なお、この図２には示していないが、リフレッシュ時にセンスアンプ３０に読み込むメモリセルの指定は、ＤＲＡＭの外部ピンで行なう。この外部ピンは、ＤＲＡＭの行アドレス選択用ピンと同等の機能をもつ。便宜上、このピンを「センスアンプ・リード・セレクタ・ピン」と呼ぶことにする。また、リフレッシュによる書き戻しを読み込んだセル側だけに行なうか、または、（プライマリとセカンダリの）両方のセルに対して行なうかを指定するためのピンも別に設けるものとする。便宜上、このピンを「センスアンプ・ライト・セレクタ・ピン」と呼ぶことにする。そして、ここでは簡単のため、「センスアンプ・ライト・セレクタ・ピン」は、各メモリセルに対して１ビットずつ割り当てられているものとする。
【００２９】
このＤＲＡＭでは、１つのセンスアンプ３０に２つのＤＲＡＭセル（１０，２０）が接続されており、通常はプライマリ側のメモリセル１０に対してデータをリード／ライトする。通常のリード時には、「センスアンプ・リード・セレクタ・ピン」は、プライマリ・セル１０側を選択している。また、通常のライト時には、「センスアンプ・ライト・セレクタ・ピン」は、プライマリ側だけに書き込むようにセットされている。
【００３０】
まず、このＤＲＡＭの通常動作について説明する。
【００３１】
このＤＲＡＭが通常のＤＲＡＭと異なる点は、リフレッシュ時のデータ更新動作にある。図３にプライマリ側のリフレッシュ動作を示し、図４に、セカンダリ側のリフレッシュ動作を示す。
【００３２】
リフレッシュ周期がチェックポイントよりも前に来た場合、ＤＲＡＭは、標準的なＤＲＡＭと同様の方法でリフレッシュを行なう。即ち、プライマリ側１０のリフレッシュとセカンダリ側２０のリフレッシュとを独立して行ない、書き戻しは読み出した側のセルに対してのみ行なう。
【００３３】
図３に示したプライマリ側セル１０のリフレッシュ時には、「センスアンプ・リード・セレクタ・ピン」によりセンスアンプ３０にプライマリ・セル１０側のデータを読み込み、「センスアンプ・ライト・セレクタ・ピン」はセンスアンプ３０の内容をプライマリ側だけに書き込むようにそれぞれセットされている。
【００３４】
一方、図４に示したセカンダリ側セル２０のリフレッシュ時には、「センスアンプ・リード・セレクタ・ピン」はセカンダリ・セル２０側を読み込み、「センスアンプ・ライト・セレクタ・ピン」はセカンダリ側セル２０にだけに書き込むようにそれぞれセットされている。
【００３５】
行アドレスは、通常のＣＡＳｂｅｆｏｒｅＲＡＳリフレッシュと同様にＤＲＡＭ内部で生成され、プライマリ側行アドレス（ＰＲｎ）とセカンダリ側行アドレス（ＳＲｎ）とに同じ値が供給される。
【００３６】
次に、このＤＲＡＭのチェックポイント動作について説明する。
【００３７】
図５にチェックポイント時のＤＲＡＭのリフレッシュ動作を示す。このＤＲＡＭでは、チェックポイント時にプライマリ側１０のデータをセンスアンプ３０に読み出し、プライマリ側１０とセカンダリ側２０との両方にデータを書き込むリフレッシュ動作を行なう。これにより、リフレッシュ動作にチェックポイント動作を隠蔽できるので、チェックポイント処理のオーバーヘッドを低減することができる。
【００３８】
図５に示したプライマリ側セル１０の内容をセカンダリ側セルにコピーするチェックポイント時のリフレッシュでは、「センスアンプ・リード・セレクタ・ピン」はプライマリ・セル１０側を、「センスアンプ・ライト・セレクタ・ピン」はプライマリ側セル１０とセカンダリ側セル２０との両方に書き込むようにそれぞれセットされている。また、行アドレスは、通常のＣＡＳｂｅｆｏｒｅＲＡＳリフレッシュと同様にＤＲＡＭ内部で生成され、プライマリ側行アドレス（ＰＲｎ）とセカンダリ側行アドレス（ＳＲｎ）とに同じ値が供給される。
【００３９】
次に、このＤＲＡＭのロールバック動作について説明する。
【００４０】
図６にロールバック時のＤＲＡＭのリフレッシュ動作を示す。この場合、セカンダリ側のセル２０からセンスアンプ３０にデータを読み出し、セカンダリ側２０とプライマリ側１０との両方にデータを書き戻すリフレッシュ動作をさせる。この動作により、メモリは、前のチェックポイント状態に迅速に戻ることができる。
【００４１】
図６に示したセカンダリ側セル２０の内容をプライマリ側セル１０にコピーするロールバック処理のリフレッシュでは、「センスアンプ・リード・セレクタ・ピン」はセカンダリ側セル２０側を、「センスアンプ・ライト・セレクタ・ピン」はプライマリ側セル１０とセカンダリ側セル２０との両方に書き込むようにそれぞれセットされている。また、行アドレスは、通常のＣＡＳｂｅｆｏｒｅＲＡＳリフレッシュと同様にＤＲＡＭ内部で生成され、プライマリ側行アドレス（ＰＲｎ）とセカンダリ側行アドレス（ＳＲｎ）とに同じ値が供給される。
【００４２】
以上の各リフレッシュ動作における動作を図７に示す。図７において、「センスアンプ・ライト・セレクタ・ピン」は、“（プライマリ側セルの選択、セカンダリ側セルの選択）”の形式で記載し、“１”のときアクティブ、“０”のときインアクティブとする。
【００４３】
また、図８に一般的な計算機のリフレッシュのタイミングを、図９にこの第１実施形態のＤＲＡＭを適用したフォールド・トレラント・コンピュータのリフレッシュのタイミングを示す。これらの図では、リフレッシュタイミングを簡易的に縦線で示している。
【００４４】
一般的な計算機では、ＤＲＡＭの規定により各行アドレスが６４ｍｓに１回以上の頻度でアクセスしなくてはならない。このため、普通はメモリコントローラが１５ｕｓに１回程度の頻度でリフレッシュ割り込みを入れ、ＤＲＡＭのリフレッシュを等間隔で実行する（図８）。
【００４５】
一方、チェックポイントロールバック方式の計算機では、チェックポイントを約１０ｍｓに１回の割合でとる。すなわち、チェックポイント間隔とリフレッシュ間隔とがほぼ同等であるので、この第１実施形態のＤＲＡＭでは、リフレッシュは原則として図９（ａ）に示すようにチェックポイント処理時に行なう。これにより、通常動作時にはリフレッシュによる割り込みが発生せず、メモリアクセスを高速に実行できる。
【００４６】
ただし、図９（ｂ）に示すようにチェックポイント間隔が規定のリフレッシュ間隔より長くなってしまう場合には、一般の計算機の記憶装置と同様にリフレッシュ割り込みを入れてリフレッシュを実行する。このため、たとえば規定のリフレッシュ周期６４ｍｓの３／４である４８ｍｓを越えると通常のリフレッシュ周期１５ｕｓの４倍の頻度、すなわち３．７５ｕｓに１回の割合でリフレッシュ割り込みをいれてリフレッシュ動作を行ない、ＤＲＡＭのデータを保証する。チェックポイントが４８ｍｓ以上６４ｍｓ未満で採られる場合には、チェックポイント処理に入る。一方、チェックポイントが６４ｍｓ以内で採られない場合には、リフレッシュ割り込み用の４８ｍｓカウンタをリセットして次のチェックポイント割り込みがリフレッシュ割り込みを越えないかをチェックする。
【００４７】
ここで、このＤＲＡＭのリフレッシュ方式の選択手順を図１０のフローチャートを参照しながら説明する。
【００４８】
まず、ロールバックの実行かどうかを調べ（ステップＡ１）、ロールバックの実行であれば（ステップＡ１のＹＥＳ）、セカンダリセルの内容をプライマリセルにコピーするリフレッシュ方式を選択する（ステップＡ２）。ロールバックの実行でなければ（ステップＡ１のＮＯ）、続いて、チェックポイントの採取かどうかを調べ（ステップＡ３）、チェックポイントの採取であれば（ステップＡ３のＹＥＳ）、プライマリセルの内容をセカンダリセルにコピーするリフレッシュ方式を選択する（ステップＡ４）。
【００４９】
また、チェックポイントの採取でなければ（ステップＡ３のＮＯ）、リフレッシュ規定時間内かどうかを調べ（ステップＡ５）、リフレッシュ規定時間内でなければ（ステップＡ５のＮＯ）、さらに、リフレッシュが完了しているかどうかを調べる（ステップＡ６）。そして、すでにリフレッシュが完了していれば（ステップＡ６のＹＥＳ）、リフレッシュタイマを“０”にリセットし（ステップＡ７）、一方、リフレッシュが完了していなければ（ステップＡ６のＮＯ）、通常のリフレッシュ方式を選択する（ステップＡ８）。
【００５０】
このように、このフォールト・トレラント・コンピュータでは、システムメモリ３を構成するＤＲＡＭが同一のアドレス空間を表すプライマリメモリ３ａとセカンダリメモリ３ｂの２つの記憶領域を構成するため、共有リソースであるシステムバスＡを用いることなく、チェックポイント処理やロールバック処理を行なうことを可能とし、これらのオーバヘッドを大幅に小さくする。
【００５１】
（第２実施形態）
次に、この発明の第２実施形態を説明する。
【００５２】
この第２実施形態のフォールト・トレラント・コンピュータの特徴は、前述の第１実施形態のフォールト・トレラント・コンピュータのＤＲＡＭにエラー検出／訂正回路（単一誤り訂正二重誤り検出方式ＥＣＣ）と、エラー検出／訂正イネーブルレジスタと、エラー通知部（エラーフラグ、エラーアドレスレジスタ）とを付加した点にある。
【００５３】
エラー検出／訂正回路は、ここでは、単一誤り訂正二重誤り検出方式ＥＣＣ（ＳＥＣ−ＤＳＤＥＣＣ）とする。ＤＲＡＭの列アドレス（ＣｏｌｕｍｎＡｄｄｒｅｓｓ）は、１Ｇビット（＝２０³⁰ビット）程度のＤＲＡＭで√２０³⁰＝２¹⁵ビット程度あるので、６４（＝２⁶）ビット程度のメモリバスより効率よくチェックビットを作ることができる。実際、理論上は２⁶ビットのデータに対するチェックビットは８ビット必要で、全データの１２％を占めるのに対し、２¹⁵ビットのデータに対するチェックビットは１７ビットであり、全データの０．５％にすぎない（「誤り符号化技術の要点」４章コンピュータへの応用，今井秀樹監修，日本工業技術センター参照）。
【００５４】
エラー検出／訂正イネーブルレジスタは、たとえば２ビットで構成し、ビット０を“１”にするとＥＣＣによるエラー検出を行ない、“０”にするとエラー検出を行なわない。また、ビット１を“１”にするとＥＣＣによる単一誤り訂正を行ない、“０”にすると誤り訂正を行なわない。
【００５５】
エラー通知部は、エラー発生の有無、エラーが訂正可能であったか否か、エラー発生の行アドレス（ＲｏｗＡｄｄｒｅｓｓ）を保持する。
【００５６】
この第２実施形態のＤＲＡＭは、データをセンスアンプからメモリセルに書き戻すときにＥＣＣチェックビットを生成してメモリセルに書き込み、データをメモリセルからセンスアンプに読み込むときにはＥＣＣチェックを行なう。このとき、エラー検出／訂正イネーブルレジスタにしたがって、エラー検出や訂正を行なう。すなわち、エラー検出がイネーブルであれば、エラー発生時にエラーフラグを“１”にセットし、エラーアドレスレジスタにエラーアドレスを保存し、エラー検出がディセーブルであれば、エラー発生時にもエラーフラグ、エラーアドレスレジスタを更新しない。なお、通常の高信頼性計算機のＥＣＣで行なわれているように、２重ビット誤り検出レジスタがセットされていれば、単一ビット誤り訂正時にエラーフラグやエラーアドレスレジスタを更新しないようにしてもよい。
【００５７】
チェックポイント処理またはロールバック処理のリフレッシュ動作において単一誤りを検出した場合は、誤りを訂正した後に両方のメモリセルに書き込む。また、２重誤りを検出した場合は、読み出した側のセルにのみデータを書き戻し、反対側のセルにはデータを書き戻さない。なお、エラー検出／訂正イネーブルレジスタは無くても構わない。
【００５８】
図１１にエラーチェックビットを持つ場合のＤＲＡＭの構成を示す。なお、図１１では省略したが、行アドレス（ＰＲｎ）のデータ１００とチェックビット１１０とをセンスアンプ３０に読み出すときに、ＥＣＣチェック回路を通してエラーチェックを行なう。単一ビット誤りが発生していれば、そのビットを反転させてセンスアンプ３０に書き込む。センスアンプからプライマリ１０側の行アドレス（ＰＲｎ）のメモリセルおよびセカンダリ２０側の行アドレス（ＳＲｎ）のメモリセルに書き込むときには、センスアンプ３０のデータ部１００からチェックビット１１０を生成して書き込む。すなわち、エラー情報を残さなければセンスアンプ３０にチェックビット１１０は必要ない。データ部１００からチェックビット１１０を生成するのは、外部からメモリへの書き込みが発生した場合に新しいデータ１００をもとにチェックビット１１０を生成しなおす必要があるためである。
【００５９】
（第３実施形態）
次に、この発明の第３実施形態を説明する。
【００６０】
この第３実施形態のフォールト・トレラント・コンピュータの特徴は、前述の第１実施形態のフォールト・トレラント・コンピュータのＤＲＡＭの各行アドレスに更新ビットを設けた点にあり、チェックポイント時に、この更新ビットの立っているＤＲＡＭアドレスのみをリフレッシュすることでチェックポイント処理を高速化する。
【００６１】
図１２に更新ビットをもつ場合のＤＲＡＭの構成を示す。
【００６２】
よく知られているように、計算機のメモリでは、短い時間にアクセスされるメモリアドレスは小さな領域に集中する参照の局所性がある。したがって、ＤＲＡＭのアドレスを行アドレスが連続になるように実アドレスにマッピングすると、短い時間の間に参照される行アドレスは小さな領域に集中する。よって、更新ビット（または更新管理テーブル：ＭｏｄｉｆｉｅｄＢｌｏｃｋＴａｂｌｅ）１２０を設け、必要な部分のみ前述のリフレッシュ動作による状態更新を行なえば、極めて短い時間にチェックポイント処理を行なうことが可能となる。
【００６３】
（第４実施形態）
次に、この発明の第４実施形態を説明する。
【００６４】
この第４実施形態のフォールト・トレラント・コンピュータの特徴は、前述の第３実施形態のフォールト・トレラント・コンピュータのＤＲＡＭがもつ更新ビットをＤＲＡＭの外部に設けた点にある。ここでいうＤＲＡＭの外部とは、メモリコントローラ内部であっても良いし、メモリコントローラとは独立していてもよい。また、更新ビットはメモリに対する書き込みが発生した場合にセットされるので、メモリ素子の外部に置くこともできる。この場合、管理方法選択の自由度が大きくなるので、たとえば前述の特開平９−６７３１号公報に記載のメモリ状態復元装置等に適用すると、小さなオーバーヘッドで信頼性の高い計算機を実現することができる。
【００６５】
（第５実施形態）
次に、この発明の第５実施形態を説明する。
【００６６】
この第５実施形態のフォールト・トレラント・コンピュータの特徴は、前述の第２の実施形態のフォールト・トレラント・コンピュータのＤＲＡＭの各行アドレスにリフレッシュ完了フラグを設けた点にある。図１３にリフレッシュ完了フラグをもつ場合のＤＲＡＭの構成を示す。
【００６７】
通常のリフレッシュ動作、チェックポイント処理またはロールバック処理によるリフレッシュ、または、通常のリード／ライトアクセスがあれば、このリフレッシュ完了フラグ１３０をセットする。そして、すべてのリフレッシュ完了フラグがセットされたときに、リフレッシュ完了フラグ１３０をリセットする。
【００６８】
リフレッシュ完了フラグ１３０が立っている行アドレスに関しては、ＣＡＳＢｅｆｏｒｅＲＡＳリフレッシュをしないようにすることで、リフレッシュ効率を高くすることができる。
【００６９】
（第６実施形態）
次に、この発明の第６実施形態を説明する。
【００７０】
この第６実施形態のフォールト・トレラント・コンピュータの特徴は、前述の第３実施形態のフォールト・トレラント・コンピュータのＤＲＡＭのセンスアンプを２組設けた点にある。
【００７１】
図１４にセンスアンプを２組設けた場合のＤＲＡＭの構成を示す。
【００７２】
１組のセンスアンプ３１ａの入力は、プライマリセル１０に、もう１組のセンスアンプ３１ｂの入力は、セカンダリセル２０にそれぞれ固定する。よって、第１実施形態で述べた「センスアンプ・リード・セレクタ・ピン」は必要ない。その代わりに、後述するような、メモリセルに書き戻すデータを保持するセンスアンプを指定するための手段を設ける。
【００７３】
プライマリメモリ１０に対する通常のリード／ライトアクセスでは、プライマリ側のセンスアンプ３１ａを用い、一方、セカンダリメモリ２０に対する通常のリード／ライトアクセスでは、セカンダリ側のセンスアンプ３１ｂを用いる。
【００７４】
まず、このＤＲＡＭの通常動作について説明する。
【００７５】
図１４には通常のＤＲＡＭのリフレッシュ動作が示されている。通常のリフレッシュでは、行アドレス（ＰＲｎ）にあるプライマリメモリ１０のデータは、プライマリセンスアンプ３１ａに読み出され、プライマリメモリ１０に書き戻される。同時に対応する行アドレス（ＳＲｎ）にあるセカンダリメモリ２０のデータがセカンダリセンスアンプ３１ｂに読み出され、セカンダリメモリ２０に書き戻される。簡単のため、デフォルトのセンスアンプを使用することを示す「ノーマル・リフレッシュ・ピン」を設けることにする。
【００７６】
「ノーマル・リフレッシュ・ピン」がアクティブのとき、「センスアンプ・ライト・セレクタ・ピン」がアクティブであるメモリセル群にデフォルトのセンスアンプのデータを書き込む。デフォルトのセンスアンプとは、プライマリメモリ１０はプライマリ側のセンスアンプ３１ａ、セカンダリメモリ２０はセカンダリ側のセンスアンプ３１ｂとする。「ノーマル・リフレッシュ・ピン」がアクティブでないとき、どちら側のセンスアンプの値をメモリセルに書き戻すかを決める手段が必要である。今回の場合、プライマリ、セカンダリの２つのメモリセル群しかないので、１ビットあれば十分である。これを「ライトバック・センスアンプ・セレクタ・ピン」と呼ぶことにする。
【００７７】
「ノーマル・リフレッシュ・ピン」がアクティブでないときに、「ライトバック・センスアンプ・セレクタ・ピン」が“プライマリ”であれば、プライマリ・センスアンプ３１ａの内容を、「センスアンプ・ライト・セレクタ・ピン」がアクティブであるメモリセルに書き込むものとする。
【００７８】
なお、このＤＲＡＭの構成においては、メモリセルが２組しかなく、“センスアンプ・ライトセレクタ・ピン”の設定は自明なので、“センスアンプ・ライト・セレクタ・ピン”を設けなくとも構わない。
【００７９】
次に、このＤＲＡＭのチェックポイント動作について説明する。
【００８０】
図１５にチェックポイント時のＤＲＡＭのリフレッシュ動作を示す。チェックポイント処理時には、キャッシュフラッシュを行ない、プロセッサのレジスタ等の情報をプライマリメモリに書き込んで退避した後で、更新ビット１３０の立っているプライマリメモリ１０の行アドレス（ＰＲｎ）のデータをプライマリセンスアンプ３１ａに読み出す。同時に対応する行アドレス（ＳＲｎ）にあるセカンダリメモリ２０のデータが、セカンダリセンスアンプ３１ｂに読み出される。書き込み時には、プライマリセンスアンプ３１ａの内容をプライマリメモリ１０の行アドレス（ＰＲｎ）のセル群だけでなく、セカンダリメモリ２０の対応する行アドレス（ＳＲｎ）のセル群にも書き込む。そして、セカンダリセンスアンプ３１ｂの内容は捨てられる。このために、「ノーマル・リフレッシュ・ピン」をインアクティブにし、「ライトバック・センスアンプ・セレクタ・ピン」を“プライマリ”にする。これにより、プライマリメモリ１０の更新された行アドレス（ＰＲｎ）のデータは、セカンダリメモリ２０の対応する行アドレス（ＳＲｎ）のメモリセルにコピーされる。
【００８１】
次に、このＤＲＡＭのロールバック動作について説明する。
【００８２】
障害発生時には、更新ビット１３０の立っているプライマリメモリ１０の行アドレス（ＰＲｎ）のデータをプライマリセンスアンプ３１ａに読み出す。同時に対応する行アドレス（ＳＲｎ）にあるセカンダリメモリ２０のデータが、セカンダリセンスアンプ３２に読み込まれる。書き込み時には、セカンダリセンスアンプ３１ｂの内容をセカンダリメモリ２０の行アドレス（ＳＲｎ）のセル群だけでなく、プライマリメモリ１０の対応する行アドレス（ＰＲｎ）のセル群にも書き込まれる。そして、プライマリセンスアンプ３１の内容は捨てられる。このために、「ノーマル・リフレッシュ・ピン」をインアクティブにし、「ライトバック・センスアンプ・セレクタ・ピン」を“セカンダリ”にする。これにより、セカンダリメモリ１０の更新された行アドレス（ＳＲｎ）のデータは、プライマリメモリ１０の対応する行アドレス（ＰＲｎ）のメモリセルにコピーされる。
【００８３】
なお、障害発生時には、更新ビット１３０を無視してすべての行アドレスに対してリフレッシュ動作を行なって、セカンダリメモリの内容をプライマリメモリにコピーしても良い。
【００８４】
以上の各リフレッシュ動作における動作を図１６に示す。図１６において、「センスアンプ・ライト・セレクタ・ピン」は、“（プライマリ側セルの選択、セカンダリ側のセルの選択）”の形式で記載し、“１”のときアクティブ、“０”のときインアクティブとする。
【００８５】
なお、前述の第１乃至第６実施形態のフォールト・トレラント・コンピュータでは、ＤＲＡＭのリフレッシュを用いてチェックポイント処理やロールバック処理を行なう例を示したが、データ記憶素子はＤＲＡＭに限られない。たとえばＤＲＡＭの替わりにＳＲＡＭを用いても良いし、プライマリ側をＤＲＡＭ、セカンダリ側をフラッシュメモリのような不揮発性メモリを用いてもよい。
【００８６】
特に、セカンダリ側に不揮発性素子を使用すると、高信頼性計算機のみならずに一般の計算機に使用してレジューム機能を実現する手段として用いても有効である。また、主記憶でなくディスクキャッシュに用いて高信頼性を図ることもできるようになる。
【００８７】
さらに、システム・オン・チップ技術により、プロセッサとメモリとを同一ＬＳＩチップ上に搭載する場合には、メモリ制御回路にこの発明を適用することも可能である。
【００８８】
【発明の効果】
以上詳述したように、この発明によれば、同一のアドレス空間を構成する記憶セル群を複数組設けることにより、単一のデータ記憶素子内にプライマリの記憶領域とセカンダリの記憶領域との２つの記憶領域をもつことを可能としたことから、共有リソースである外部バスを用いることなく、チェックポイント処理をプライマリとして用いる記憶セル群のデータをセカンダリとして用いる記憶セル群にコピーすることにより実行し、また、ロールバック処理をセカンダリとして用いる記憶セル群のデータをプライマリとして用いる記憶セル群にコピーすることにより実行することができるため、チェックポイント処理およびロールバック処理のオーバヘッドを大幅に小さくすることを可能とする。
【図面の簡単な説明】
【図１】この発明の第１実施形態に係るチェックポイント・ロールバック方式のフォールト・トレラント・コンピュータの概略構成を示す図。
【図２】同第１実施形態のシステムメモリ３を構成するＤＲＡＭの２つのメモリセルおよびセンスアンプとその動作とを模式的に示した図。
【図３】同第１実施形態のＤＲＡＭのプライマリ側のリフレッシュ動作を示す図。
【図４】同第１実施形態のＤＲＡＭのセカンダリ側のリフレッシュ動作を示す図。
【図５】同第１実施形態におけるチェックポイント時のＤＲＡＭのリフレッシュ動作を示す図。
【図６】同第１実施形態におけるロールバック時のＤＲＡＭのリフレッシュ動作を示す図。
【図７】同第１実施形態の各リフレッシュ動作における動作を示す図。
【図８】一般的な計算機のリフレッシュのタイミングを示す図。
【図９】同第１実施形態のＤＲＡＭを適用したフォールド・トレラント・コンピュータのリフレッシュのタイミングを示す図。
【図１０】同第１実施形態におけるＤＲＡＭのリフレッシュ方式の選択手順を説明するためのフローチャート。
【図１１】同第２実施形態のエラーチェックビットを持つ場合のＤＲＡＭの構成を示す図。
【図１２】同第３実施形態の更新ビットをもつ場合のＤＲＡＭの構成を示す図。
【図１３】同第５実施形態のリフレッシュ完了フラグをもつ場合のＤＲＡＭの構成を示す図。
【図１４】同第６実施形態のセンスアンプを２組設けた場合のＤＲＡＭの構成および通常のＤＲＡＭのリフレッシュ動作を示す図。
【図１５】同第６実施形態におけるチェックポイント時のＤＲＡＭのリフレッシュ動作を示す図。
【図１６】同第６実施形態の各リフレッシュ動作における動作を示す図。
【符号の説明】
１…ＣＰＵ
２…メモリコントローラ
３…システムメモリ（ＤＲＡＭ）
１０…プライマリセル
２０…セカンダリセル
３０…センスアンプ
３１ａ…プライマリセンスアンプ
３１ｂ…セカンダリセンスアンプ
１００…データ本体
１１０…チェックビット
１２０…更新ビット
１３０…リフレッシュ完了フラグ[0001]
BACKGROUND OF THE INVENTION
  The present invention greatly reduces the overhead of checkpoint processing and rollback processing by the high reliability mechanism of the checkpoint / rollback method.RudenConcerning child computers.
[0002]
[Prior art]
In recent years, business processing has been digitized in various fields, and demands for computer reliability and fault tolerance are increasing day by day. And what is called a fault tolerant type computer exists as a computer which implement | achieves this fault tolerance.
[0003]
This fault-tolerant computer uses a multiplexing method that multiplexes hardware and compares the results of operations to improve reliability, and takes a stable state called a checkpoint periodically. Two methods, the checkpoint / rollback method, which obtains a correct operation result by returning the state of the computer to the above are well known.
[0004]
Among these, the multiplexing method is represented by Tandem NonStop Cyclone series, Stratus XA series, etc. described in “Books that understand the mechanism of non-stop computers” (edited by Sakaimachi FTC Research Committee, Industrial Research Committee: 1992). Thus, it is widely put to practical use in a highly reliable computer using a loosely coupled or tightly coupled multiprocessor.
[0005]
On the other hand, the checkpoint / rollback method is, for example, a memory state restoration device described in JP-A-9-6731, a cache controller, a fault-tolerant computer, and its data transfer method described in JP-A-5-6308. It is the construction method of the high reliability computer system shown.
[0006]
[Problems to be solved by the invention]
However, a fault-tolerant computer using a multiplexing method is expensive because a plurality of resources such as processors are necessary, and a fault-tolerant computer using a checkpoint / rollback method requires overhead for taking a checkpoint. There is a problem that it is difficult to obtain high performance as compared with the multiplexing method.
[0007]
  The present invention has been made in consideration of such circumstances, and greatly reduces the overhead of checkpoint processing and rollback processing by the high-reliability mechanism of the checkpoint / rollback method.RudenThe purpose is to provide a child computer.
[0008]
[Means for Solving the Problems]
  In order to achieve the above-described object,Electronic computerIsIn an electronic computer having a checkpoint / rollback type high reliability mechanism, a buffer to which the first and second memory cell groups constituting the same address space and the first and second memory cell groups are connected A first selecting means for selecting a memory cell group for reading data from the buffer from the first and second memory cell groups; and a memory cell group for writing data in the buffer as the first and second memory cells. A data storage element having a second selection means for selecting from among the memory cell groups, and during normal operation, the data of the first memory cell group is transferred only to the first memory cell group via the buffer. And data protection means for writing back the data of the second memory cell group only to the second memory cell group via the buffer, and at the time of checkpoint, the first memory cell Data replicating means for simultaneously writing the data of the group to the first memory cell group and the second memory cell group via the buffer, and at the time of rollback, the data of the second memory cell group via the buffer Data restoring means for simultaneously writing back to the second memory cell group and the first memory cell group;It is characterized by comprising.
[0009]
  According to another aspect of the present invention, there is provided an electronic computer having a checkpoint / rollback type high reliability mechanism, the first and second memory cell groups constituting the same address space, and the first memory cell at the input. A first buffer for storing data read from the first memory cell group, connected to the memory cell group and connected to the first and second memory cell groups at an output; A second buffer for storing data read from the second memory cell group, to which two memory cell groups are connected, and to which the first and second memory cell groups are connected; A first selecting means for selecting a memory cell group to which data of one buffer is written from the first and second memory cell groups; and a memory cell group to which data of the second buffer is written are said first and second memory cells. Second memory cell A data storage element having a second selection means for selecting from among the group; and during normal operation, only the first storage cell group receives data of the first storage cell group via the first buffer. Data protection means for writing back the data of the second memory cell group only to the second memory cell group via the second buffer, and at the time of checkpoint, the first memory cell group Data replicating means for simultaneously writing the data of the second memory cell group to the first memory cell group and the second memory cell group via the first buffer, and the data of the second memory cell group at the time of rollback Data restoring means for simultaneously writing back to the second memory cell group and the first memory cell group via two buffers.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0019]
(First embodiment)
First, a first embodiment of the present invention will be described.
[0020]
FIG. 1 is a diagram showing a schematic configuration of a fault-tolerant computer of a checkpoint / rollback system according to the first embodiment of the present invention.
[0021]
As shown in FIG. 1, a system bus A is laid in the fault tolerant computer of the first embodiment, and a plurality of CPUs 1 and a memory controller 2 are connected to the system bus A. . A system memory 3 is connected to the memory controller 2.
[0022]
The CPU 1 constitutes a tightly coupled pair type multiprocessor (SMP). All the CPUs 1 are synchronized at checkpoints, and internal information is written in a predetermined area of the system memory 3. Further, the dirty area of the cache memory of the CPU 1 is written into the system memory 3 to create a stable system state (checkpoint acquisition state) in the system memory 3.
[0023]
When the checkpoint acquisition state is confirmed, a command for copying the primary area of the system memory 3 to the secondary area is sent to the memory controller 2, and when it is confirmed that the copying is completed, the normal operation state is restored.
[0024]
The memory controller 2 drives and controls the system memory 3 and executes reading of data from the system memory 3 and writing of data to the system memory 3 in accordance with instructions from the CPU 1. The memory controller 2 also executes and controls protection of data stored in the system memory 3, such as a refresh operation in a DRAM.
[0025]
The system memory 3 is a memory device that is shared by a plurality of CPUs 1 and serves as a main memory of the fault-tolerant computer, and stores programs and processing data that are controlled by the CPU 1. In the fault tolerant computer, the system memory 3 is constituted by a DRAM.
[0026]
According to the present invention, the data storage element (in this case, the DRAM) constituting the system memory 3 is provided with a plurality of storage cell groups constituting the same address space in the data storage element. It is possible to have two storage areas of the primary memory 3a and the secondary memory 3b, and as a result, checkpoint processing and rollback processing can be performed without using the system bus A which is a shared resource. The feature is that these overheads are greatly reduced.
[0027]
Therefore, in this fault-tolerant computer, the DRAM constituting the system memory 3 has a function of designating a memory cell (primary or secondary) to be read at the time of refresh, and write-back by refresh is performed only on the read source cell. Or a function for designating whether to perform both of the cells.
[0028]
FIG. 2 schematically shows two memory cells and sense amplifiers of the DRAM constituting the system memory 3 and their operations. Although not shown in FIG. 2, designation of a memory cell to be read into the sense amplifier 30 at the time of refresh is performed by an external pin of the DRAM. This external pin has a function equivalent to that of a DRAM row address selection pin. For convenience, this pin will be referred to as a “sense amplifier / lead / selector pin”. In addition, it is assumed that a separate pin is provided for designating whether the write-back by refresh is performed only on the read cell side or on both (primary and secondary) cells. For convenience, this pin will be referred to as a “sense amplifier / write / selector pin”. Here, for the sake of simplicity, it is assumed that “sense amplifier / write / selector / pin” is assigned to each memory cell by one bit.
[0029]
In this DRAM, two DRAM cells (10, 20) are connected to one sense amplifier 30, and data is normally read / written to / from the memory cell 10 on the primary side. During normal reading, the “sense amplifier read selector pin” selects the primary cell 10 side. In normal writing, the “sense amplifier / write / selector pin” is set to write only to the primary side.
[0030]
First, the normal operation of this DRAM will be described.
[0031]
This DRAM is different from a normal DRAM in the data update operation during refresh. FIG. 3 shows the refresh operation on the primary side, and FIG. 4 shows the refresh operation on the secondary side.
[0032]
If the refresh cycle comes before the checkpoint, the DRAM refreshes in the same way as a standard DRAM. That is, the refresh on the primary side 10 and the refresh on the secondary side 20 are performed independently, and the write-back is performed only on the read side cell.
[0033]
When the primary cell 10 shown in FIG. 3 is refreshed, the data on the primary cell 10 side is read into the sense amplifier 30 by the “sense amplifier read selector pin”, and the “sense amplifier write selector pin” is sensed. Each is set so that the contents of the amplifier 30 are written only to the primary side.
[0034]
On the other hand, when the secondary side cell 20 shown in FIG. 4 is refreshed, the “sense amplifier read selector pin” reads the secondary cell 20 side, and the “sense amplifier write selector pin” enters the secondary side cell 20. Each is set to write only.
[0035]
The row address is generated inside the DRAM in the same way as the normal CAS before RAS refresh, and the same value is supplied to the primary side row address (PRn) and the secondary side row address (SRn).
[0036]
Next, the checkpoint operation of this DRAM will be described.
[0037]
FIG. 5 shows the refresh operation of the DRAM at the checkpoint. In this DRAM, at the time of a check point, data on the primary side 10 is read to the sense amplifier 30 and a refresh operation is performed to write data on both the primary side 10 and the secondary side 20. Thereby, since the checkpoint operation can be concealed in the refresh operation, the overhead of the checkpoint process can be reduced.
[0038]
In the refresh at the checkpoint where the contents of the primary side cell 10 shown in FIG. 5 are copied to the secondary side cell, the “sense amplifier read selector pin” indicates the primary cell 10 side as “sense amplifier write selector”. “Pin” is set to write to both the primary side cell 10 and the secondary side cell 20. Further, the row address is generated inside the DRAM as in the normal CAS before RAS refresh, and the same value is supplied to the primary side row address (PRn) and the secondary side row address (SRn).
[0039]
Next, the rollback operation of this DRAM will be described.
[0040]
FIG. 6 shows the refresh operation of the DRAM during rollback. In this case, data is read from the secondary side cell 20 to the sense amplifier 30 and a refresh operation is performed to write back the data to both the secondary side 20 and the primary side 10. This action allows the memory to quickly return to the previous checkpoint state.
[0041]
In the refresh of the rollback process for copying the contents of the secondary side cell 20 shown in FIG. 6 to the primary side cell 10, the “sense amplifier read selector pin” moves the secondary side cell 20 side to “sense amplifier write write”. The “selector pin” is set so as to write to both the primary side cell 10 and the secondary side cell 20. Further, the row address is generated inside the DRAM as in the normal CAS before RAS refresh, and the same value is supplied to the primary side row address (PRn) and the secondary side row address (SRn).
[0042]
The operation in each of the above refresh operations is shown in FIG. In FIG. 7, “sense amplifier write selector pin” is described in the form of “(selection of primary side cell, selection of secondary side cell)”, and is active when “1” and inactive when “0”. Active.
[0043]
FIG. 8 shows a refresh timing of a general computer, and FIG. 9 shows a refresh timing of a fold tolerant computer to which the DRAM of the first embodiment is applied. In these figures, the refresh timing is simply indicated by vertical lines.
[0044]
In a general computer, each row address must be accessed at a frequency of once or more in 64 ms according to the regulations of DRAM. For this reason, the memory controller normally issues a refresh interrupt with a frequency of about once every 15 us and refreshes the DRAM at regular intervals (FIG. 8).
[0045]
On the other hand, a checkpoint rollback computer takes checkpoints at a rate of about once every 10 ms. That is, since the checkpoint interval and the refresh interval are substantially equal, in the DRAM according to the first embodiment, the refresh is basically performed during the checkpoint process as shown in FIG. 9A. As a result, an interrupt due to refresh does not occur during normal operation, and memory access can be executed at high speed.
[0046]
However, when the checkpoint interval becomes longer than the specified refresh interval as shown in FIG. 9B, refresh is executed with a refresh interrupt as in a general computer storage device. For this reason, for example, when 48 ms, which is 3/4 of the prescribed refresh cycle 64 ms, is exceeded, the refresh operation is performed with a refresh interrupt at a rate four times the normal refresh cycle 15 us, that is, once every 3.75 us. Guaranteed DRAM data. If the checkpoint is taken between 48 ms and less than 64 ms, checkpoint processing is entered. On the other hand, if the checkpoint is not taken within 64 ms, the refresh interrupt 48 ms counter is reset to check whether the next checkpoint interrupt exceeds the refresh interrupt.
[0047]
Here, the selection procedure of the refresh method of the DRAM will be described with reference to the flowchart of FIG.
[0048]
First, it is checked whether rollback is executed (step A1). If rollback is executed (YES in step A1), a refresh method for copying the contents of the secondary cell to the primary cell is selected (step A2). If rollback is not executed (NO in step A1), then it is checked whether or not checkpoints are collected (step A3). If checkpoints are collected (YES in step A3), the contents of the primary cell are changed to secondary. A refresh method to be copied to the cell is selected (step A4).
[0049]
If the check point is not collected (NO in step A3), it is checked whether the refresh time is within the prescribed refresh time (step A5). If not within the refresh prescribed time (NO in step A5), the refresh is completed. Whether it is present (step A6). If the refresh has already been completed (YES in step A6), the refresh timer is reset to “0” (step A7). On the other hand, if the refresh has not been completed (NO in step A6), a normal refresh is performed. A method is selected (step A8).
[0050]
As described above, in this fault tolerant computer, the DRAM constituting the system memory 3 constitutes two storage areas of the primary memory 3a and the secondary memory 3b representing the same address space. It is possible to perform checkpoint processing and rollback processing without using, and to significantly reduce these overheads.
[0051]
(Second Embodiment)
Next, a second embodiment of the present invention will be described.
[0052]
The fault tolerant computer according to the second embodiment is characterized by an error detection / correction circuit (single error correction double error detection ECC) and an error in the DRAM of the fault tolerant computer of the first embodiment. A detection / correction enable register and an error notification unit (error flag, error address register) are added.
[0053]
Here, the error detection / correction circuit is a single error correction double error detection ECC (SEC-DSD ECC). The column address (Column Address) of the DRAM is 1 Gbit (= 20³⁰√20 for a bit) DRAM³⁰= 2¹⁵Since there are about bits, 64 (= 2⁶) Check bits can be created more efficiently than a memory bus of about a bit. In fact, 2 in theory⁶Check bits for 8 bits of data require 8 bits, which occupies 12% of the total data.¹⁵The check bit for the bit data is 17 bits, which is only 0.5% of the total data (see “Guide to Error Coding Technology”, Chapter 4 Application to Computers, Supervised by Hideki Imai, Japan Industrial Technology Center).
[0054]
The error detection / correction enable register is composed of, for example, 2 bits. When bit 0 is set to “1”, error detection by ECC is performed, and when “0” is set, error detection is not performed. When bit 1 is set to “1”, single error correction by ECC is performed, and when it is set to “0”, error correction is not performed.
[0055]
The error notification unit holds whether or not an error has occurred, whether or not the error can be corrected, and the row address (Row Address) where the error has occurred.
[0056]
In the DRAM of the second embodiment, an ECC check bit is generated and written to the memory cell when data is written back from the sense amplifier to the memory cell, and an ECC check is performed when data is read from the memory cell to the sense amplifier. At this time, error detection and correction are performed according to the error detection / correction enable register. That is, if error detection is enabled, the error flag is set to “1” when an error occurs, and the error address is stored in the error address register. If error detection is disabled, the error flag and error are also detected when an error occurs. Do not update the address register. Note that if the double bit error detection register is set, as is done in the ECC of a normal high reliability computer, the error flag and error address register may not be updated during single bit error correction. Good.
[0057]
When a single error is detected in the refresh operation of the checkpoint process or rollback process, the error is corrected and then written to both memory cells. If a double error is detected, data is written back only to the read cell, and data is not written back to the opposite cell. The error detection / correction enable register may not be provided.
[0058]
FIG. 11 shows a configuration of a DRAM having an error check bit. Although omitted in FIG. 11, when the data 100 and the check bit 110 of the row address (PRn) are read out to the sense amplifier 30, an error check is performed through the ECC check circuit. If a single bit error has occurred, the bit is inverted and written to the sense amplifier 30. When writing from the sense amplifier to the memory cell at the row address (PRn) on the primary 10 side and the memory cell at the row address (SRn) on the secondary 20 side, the check bit 110 is generated from the data portion 100 of the sense amplifier 30 and written. That is, if no error information is left, the check bit 110 is not necessary for the sense amplifier 30. The reason why the check bit 110 is generated from the data part 100 is that it is necessary to regenerate the check bit 110 based on the new data 100 when writing to the memory from the outside occurs.
[0059]
(Third embodiment)
Next, a third embodiment of the present invention will be described.
[0060]
The feature of the fault tolerant computer of the third embodiment is that an update bit is provided for each row address of the DRAM of the fault tolerant computer of the first embodiment described above. Only the standing DRAM address is refreshed to speed up the checkpoint process.
[0061]
FIG. 12 shows a configuration of a DRAM having an update bit.
[0062]
As is well known, in a computer memory, memory addresses accessed in a short time have locality of reference concentrated in a small area. Therefore, when the addresses of the DRAM are mapped to the real addresses so that the row addresses are continuous, the row addresses referred to in a short time are concentrated in a small area. Therefore, if an update bit (or an update management table: Modified Block Table) 120 is provided and only the necessary part is updated by the above-described refresh operation, the checkpoint process can be performed in a very short time.
[0063]
(Fourth embodiment)
Next explained is the fourth embodiment of the invention.
[0064]
The feature of the fault-tolerant computer of the fourth embodiment is that the update bits of the DRAM of the fault-tolerant computer of the third embodiment are provided outside the DRAM. Here, the outside of the DRAM may be inside the memory controller or may be independent of the memory controller. Further, since the update bit is set when writing to the memory occurs, it can be placed outside the memory element. In this case, since the degree of freedom in selecting the management method is increased, for example, when applied to the memory state restoration device described in Japanese Patent Laid-Open No. 9-6731, for example, a highly reliable computer can be realized with a small overhead. .
[0065]
(Fifth embodiment)
Next explained is the fifth embodiment of the invention.
[0066]
The feature of the fault tolerant computer of the fifth embodiment is that a refresh completion flag is provided at each row address of the DRAM of the fault tolerant computer of the second embodiment. FIG. 13 shows the configuration of a DRAM having a refresh completion flag.
[0067]
If there is a normal refresh operation, refresh by checkpoint processing or rollback processing, or normal read / write access, this refresh completion flag 130 is set. When all the refresh completion flags are set, the refresh completion flag 130 is reset.
[0068]
With respect to the row address where the refresh completion flag 130 is set, the refresh efficiency can be increased by not performing CAS Before RAS refresh.
[0069]
(Sixth embodiment)
Next, a sixth embodiment of the invention will be described.
[0070]
The feature of the fault-tolerant computer of the sixth embodiment is that two sets of sense amplifiers for the DRAM of the fault-tolerant computer of the third embodiment are provided.
[0071]
FIG. 14 shows the configuration of a DRAM when two sets of sense amplifiers are provided.
[0072]
The input of one set of sense amplifiers 31 a is fixed to the primary cell 10, and the input of the other set of sense amplifiers 31 b is fixed to the secondary cell 20. Therefore, the “sense amplifier read selector pin” described in the first embodiment is not necessary. Instead, there is provided means for designating a sense amplifier that holds data to be written back to the memory cell, as will be described later.
[0073]
In the normal read / write access to the primary memory 10, the primary side sense amplifier 31a is used, while in the normal read / write access to the secondary memory 20, the secondary side sense amplifier 31b is used.
[0074]
First, the normal operation of this DRAM will be described.
[0075]
FIG. 14 shows a normal DRAM refresh operation. In normal refresh, the data in the primary memory 10 at the row address (PRn) is read to the primary sense amplifier 31a and written back to the primary memory 10. At the same time, the data in the secondary memory 20 at the corresponding row address (SRn) is read out to the secondary sense amplifier 31b and written back to the secondary memory 20. For simplicity, a “normal refresh pin” indicating that the default sense amplifier is used will be provided.
[0076]
When the “normal refresh pin” is active, the data of the default sense amplifier is written into the memory cell group in which the “sense amplifier write selector pin” is active. As for the default sense amplifier, the primary memory 10 is a primary-side sense amplifier 31a, and the secondary memory 20 is a secondary-side sense amplifier 31b. When the “normal refresh pin” is not active, a means for determining which sense amplifier value is written back to the memory cell is required. In this case, since there are only two primary and secondary memory cell groups, one bit is sufficient. This will be referred to as a “write-back sense amplifier selector pin”.
[0077]
If the “normal refresh pin” is not active and the “write-back sense amplifier selector pin” is “primary”, the contents of the primary sense amplifier 31a are changed to “sense amplifier write selector pin”. ”Is written in the memory cell in which“ is active ”.
[0078]
In this DRAM configuration, since there are only two memory cells and the setting of the “sense amplifier / write selector / pin” is obvious, the “sense amplifier / write / selector / pin” need not be provided.
[0079]
Next, the checkpoint operation of this DRAM will be described.
[0080]
FIG. 15 shows the refresh operation of the DRAM at the time of checkpoint. At the time of checkpoint processing, after performing cache flush and writing and saving information such as processor registers in the primary memory, the data of the row address (PRn) of the primary memory 10 in which the update bit 130 is set is transferred to the primary sense amplifier 31a. Read to. At the same time, the data in the secondary memory 20 at the corresponding row address (SRn) is read out to the secondary sense amplifier 31b. At the time of writing, the contents of the primary sense amplifier 31 a are written not only to the cell group of the row address (PRn) of the primary memory 10 but also to the cell group of the corresponding row address (SRn) of the secondary memory 20. The contents of the secondary sense amplifier 31b are discarded. For this purpose, the “normal refresh pin” is made inactive, and the “write back sense amplifier selector pin” is made “primary”. Thereby, the data of the updated row address (PRn) of the primary memory 10 is copied to the memory cell of the corresponding row address (SRn) of the secondary memory 20.
[0081]
Next, the rollback operation of this DRAM will be described.
[0082]
When a failure occurs, the data of the row address (PRn) of the primary memory 10 in which the update bit 130 is set is read to the primary sense amplifier 31a. At the same time, the data in the secondary memory 20 at the corresponding row address (SRn) is read into the secondary sense amplifier 32. At the time of writing, the contents of the secondary sense amplifier 31b are written not only to the cell group of the row address (SRn) of the secondary memory 20, but also to the cell group of the corresponding row address (PRn) of the primary memory 10. The contents of the primary sense amplifier 31 are discarded. For this purpose, the “normal refresh pin” is made inactive, and the “write back sense amplifier selector pin” is made “secondary”. Thereby, the data of the updated row address (SRn) of the secondary memory 10 is copied to the memory cell of the corresponding row address (PRn) of the primary memory 10.
[0083]
When a failure occurs, the refresh bit 130 may be ignored and a refresh operation may be performed on all row addresses to copy the contents of the secondary memory to the primary memory.
[0084]
The operation in each of the above refresh operations is shown in FIG. In FIG. 16, “sense amplifier write selector pin” is described in the form of “(primary side cell selection, secondary side cell selection)”. When “1”, it is active, and when it is “0” Inactive.
[0085]
In the fault-tolerant computers of the first to sixth embodiments described above, the example in which the checkpoint process and the rollback process are performed using the refresh of the DRAM is shown, but the data storage element is not limited to the DRAM. For example, an SRAM may be used instead of a DRAM, or a nonvolatile memory such as a DRAM may be used on the primary side and a flash memory on the secondary side.
[0086]
In particular, when a non-volatile element is used on the secondary side, it is effective even if used as a means for realizing a resume function by being used not only in a highly reliable computer but also in a general computer. Further, it is possible to achieve high reliability by using the disk cache instead of the main memory.
[0087]
Further, when the processor and the memory are mounted on the same LSI chip by the system on chip technology, the present invention can be applied to the memory control circuit.
[0088]
【The invention's effect】
As described in detail above, according to the present invention, by providing a plurality of sets of storage cell groups that constitute the same address space, a primary storage area and a secondary storage area are provided in a single data storage element. Since it is possible to have two storage areas, checkpoint processing is executed by copying the data of the memory cell group that is used as the primary to the memory cell group that is used as the secondary, without using the external bus that is a shared resource. In addition, since the rollback process can be executed by copying the data of the memory cell group used as the secondary to the memory cell group used as the primary, the overhead of the checkpoint process and the rollback process can be greatly reduced. Make it possible.
[Brief description of the drawings]
FIG. 1 is a diagram showing a schematic configuration of a fault-tolerant computer of a checkpoint / rollback method according to a first embodiment of the invention.
FIG. 2 is a diagram schematically showing two memory cells and a sense amplifier of a DRAM constituting the system memory 3 of the first embodiment and their operations.
FIG. 3 is a view showing a refresh operation on the primary side of the DRAM of the first embodiment;
FIG. 4 is a diagram showing a refresh operation on the secondary side of the DRAM of the first embodiment;
FIG. 5 is a diagram showing a refresh operation of the DRAM at a check point in the first embodiment.
6 is a diagram showing a refresh operation of the DRAM at the time of rollback in the first embodiment. FIG.
FIG. 7 is a view showing operations in each refresh operation of the first embodiment;
FIG. 8 is a diagram showing refresh timing of a general computer.
FIG. 9 is a diagram showing refresh timing of the fold tolerant computer to which the DRAM of the first embodiment is applied;
FIG. 10 is an exemplary flowchart for explaining a DRAM refresh method selection procedure according to the first embodiment;
FIG. 11 is a diagram showing a configuration of a DRAM having an error check bit according to the second embodiment.
FIG. 12 is a diagram showing a configuration of a DRAM having an update bit according to the third embodiment.
FIG. 13 is a diagram showing a configuration of a DRAM having a refresh completion flag according to the fifth embodiment.
FIG. 14 is a diagram showing a DRAM configuration and a normal DRAM refresh operation when two sets of sense amplifiers of the sixth embodiment are provided;
FIG. 15 is a view showing a refresh operation of a DRAM at a check point in the sixth embodiment.
FIG. 16 is a diagram showing operations in each refresh operation of the sixth embodiment.
[Explanation of symbols]
1 ... CPU
2 ... Memory controller
3. System memory (DRAM)
10 ... Primary cell
20 ... Secondary cell
30 ... Sense amplifier
31a ... Primary sense amplifier
31b ... Secondary sense amplifier
100 ... Data body
110: Check bit
120: Update bit
130: Refresh completion flag

Claims

In an electronic computer with a highly reliable checkpoint / rollback method,
First and second memory cell groups constituting the same address space;
A buffer to which the first and second memory cell groups are connected;
First selection means for selecting a memory cell group for reading data from the buffer from the first and second memory cell groups;
Second selection means for selecting a memory cell group to which data of the buffer is written from the first and second memory cell groups;
A data storage element comprising:
During normal operation, the data in the first memory cell group is written back to only the first memory cell group through the buffer, and the data in the second memory cell group is written in the second memory through the buffer. Data protection means for writing back only to the memory cell group,
Data duplicating means for simultaneously writing back the data of the first memory cell group to the first memory cell group and the second memory cell group via the buffer at the time of checkpoint;
Data restoring means for simultaneously writing back the data of the second memory cell group to the second memory cell group and the first memory cell group via the buffer at the time of rollback;
An electronic computer characterized by comprising:

2. The computer according to claim 1, wherein the buffer of the data storage element is a sense amplifier for writing back data of the first or second storage cell group for data protection.

2. The first storage cell group of the data storage element is a normal primary cell, and the second storage cell group of the data storage element is a backup secondary cell. Electronic calculator.

The data storage element has means for simultaneously writing back one of the data in the first and second memory cell groups to both the first and second memory cell groups via the buffer at the time of refresh. The electronic computer according to claim 1, wherein the electronic computer is a DRAM.

2. The computer according to claim 1, wherein at least one of the first memory cell group and the second memory cell group of the data storage element is a non-volatile element.

The data storage element includes a redundant bit for error checking in the first and second memory cell groups, and means for performing error detection or error correction using the redundant bit. The electronic computer according to claim 1.

The electronic computer according to claim 6, wherein the data storage element includes means for storing error information.

7. The computer according to claim 6, wherein the data storage element comprises means for designating whether or not to perform error detection or error correction.

2. The electronic computer according to claim 1, wherein said data storage element is provided with means for indicating a reference history of each storage cell corresponding to at least one of said first and second storage cell groups.

10. The electronic computer according to claim 9, wherein the data duplicating unit only writes back data having a reference history.

In an electronic computer with a highly reliable checkpoint / rollback method,
First and second memory cell groups constituting the same address space;
A first buffer for storing data read from the first memory cell group, wherein the first memory cell group is connected to an input and the first and second memory cell groups are connected to an output When,
A second buffer for storing data read from the second memory cell group, wherein the second memory cell group is connected to an input and the first and second memory cell groups are connected to an output When,
First selection means for selecting a storage cell group to which data of the first buffer is written from the first and second storage cell groups;
Second selection means for selecting a storage cell group to which data of the second buffer is written from the first and second storage cell groups;
A data storage element comprising:
During normal operation, the data of the first memory cell group is written back to the first memory cell group only via the first buffer, and the data of the second memory cell group is written to the second buffer. Data protection means for writing back only to the second memory cell group via
Data duplicating means for simultaneously writing data of the first memory cell group to the first memory cell group and the second memory cell group via the first buffer at the time of a checkpoint;
Data restoring means for simultaneously writing back the data of the second memory cell group to the second memory cell group and the first memory cell group via the second buffer at the time of rollback;
An electronic computer characterized by comprising:

The first buffer of the data storage element is a sense amplifier for writing back data of the first storage cell group for data protection, and the second buffer of the data storage element is data protection 12. The electronic computer according to claim 11, wherein the electronic computer is a sense amplifier for writing back data of the second memory cell group.

The first storage cell group of the data storage element is a normal primary cell, and the second storage cell group of the data storage element is a backup secondary cell. 11. The electronic computer according to 11.

At the time of refresh, the data storage element transfers either one of the data in the first and second memory cell groups via the first or second buffer to both the first and second memory cell groups. 12. The electronic computer according to claim 11, wherein the electronic computer is a DRAM provided with means for simultaneously writing back to the computer.

12. The electronic computer according to claim 11, wherein at least one of the first and second memory cell groups of the data storage element is a non-volatile element.

The data storage element includes a redundant bit for error checking in the first and second memory cell groups, and means for performing error detection or error correction using the redundant bit. The electronic computer according to claim 11.

17. The computer according to claim 16, wherein the data storage element includes means for storing error information.

17. The computer according to claim 16, wherein the data storage element comprises means for designating whether or not to perform error detection or error correction.

12. The electronic computer according to claim 11, wherein the data storage element is provided with means for indicating a reference history of each storage cell corresponding to at least one of the first and second storage cell groups.

20. The electronic computer according to claim 19, wherein the data duplicating unit only writes back data having a reference history.