JP2004062377A

JP2004062377A - Shared cache memory fault processing system

Info

Publication number: JP2004062377A
Application number: JP2002217788A
Authority: JP
Inventors: Takahiro Tanioka; 谷岡　隆浩
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-07-26
Filing date: 2002-07-26
Publication date: 2004-02-26
Anticipated expiration: 2022-07-26
Also published as: JP3931757B2

Abstract

<P>PROBLEM TO BE SOLVED: To narrow the affecting range of a fault generated in a cache memory as far as possible by specifying the position of the fault and limiting a section which is affected by the fault. <P>SOLUTION: A shared cache memory fault processing system comprises an address comparator 59 for suppressing the hit decision of a compartment which is not allocated to a request source processor in indexing a cache by a request from an optional processor, a replacement entry generation circuit 55 for selecting the compartment allocated to the request source processor in accordance with constitutional information stored in a compartment instruction circuit 52 in registering data in the cache, a fault detection circuit 58 for informing the compartment number of a fault whose correction is impossible when the correction disabled fault is detected at the time of cache index, and a shared cache control part 51 for specifying a section which can be affected by the fault on the basis of the information from the fault detection circuit 58, updating the constitutional information, and informing the compartment instruction circuit 52 of updated contents to process the fault. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は共有キャッシュメモリ障害処理方式に関し、特に複数のプロセッサから共有されているキャッシュメモリに発生した障害が影響する範囲を限定しながらシステム運用を継続させる共有キャッシュメモリ障害処理方式に関する。
【０００２】
【従来の技術】
図７は、共有キャッシュメモリを備えるコンピュータシステムの例を示す説明図である。同図において、共有キャッシュユニット２１及び２２は同等の構成からなる共有キャッシュであり、それぞれプロセッサ１１，１２及び１３，１４に共有されている。また他方で、共有キャッシュユニット２１，２２はシステムバス３１を介して主記憶４１に接続される。
【０００３】
昨今、このようなシステムにおいて、プロセッサや主記憶などのリソースを複数の区画に論理的に分割し、それぞれを独立したシステムとして運用する事が可能である。この場合、各区画に属するプロセッサからのメモリアクセスはプロセッサから送出される時点で区画に割り当てられたメモリ領域の先頭アドレスをオフセット値として加算されることにより、ソフトウェアは他の区画の存在を意識することなく命令を実行することが出来るようになっている。
【０００４】
しかしながら、各区画毎の信頼性を考えるとき、図を参照しても判るように複数の区画で共有されるユニット、例えば共有キャッシュユニット２１は区画１と区画２の単一故障点（ＳＰＯＦ：Ｓｉｎｇｌｅ　Ｐｏｉｎｔ　Ｏｆ　Ｆａｉｌｕｒｅ）であり、キャッシュメモリの故障によって２つの区画がシステムダウンに至るという問題がある。
【０００５】
【発明が解決しようとする課題】
上記のように、従来の共有キャッシュメモリを備えたコンピュータシステムでは、複数の区画から共用されているキャッシュメモリに障害が発生した場合に、それが訂正不可能な障害であれば、それぞれの区画がシステムダウンすることになるという欠点がある。
【０００６】
本発明の目的は、上記のような問題点を改善するために、キャッシュメモリに発生した障害位置を特定し、その障害が影響する区画を限定するようにし、その障害の波及範囲を極力狭くすることができる共有キャッシュメモリ障害処理方式を提供することにある。
【０００７】
【課題を解決するための手段】
本発明の共有キャッシュメモリ障害処理方式は、複数のコンパートメントを有し複数のプロセッサに共有された共有キャッシュメモリにおいて、各プロセッサごとに利用可能なコンパートメントをあらかじめ割り当て、任意の複数の区画に分かれたプロセッサから前記共有キャッシュメモリを共有する場合、キャッシュ索引時に訂正不可能な障害を検出したとき、前記障害が影響する区画を特定し、その障害を処理する手段を具備することを特徴とする。
【０００８】
また、本発明の共有キャッシュメモリ障害処理方式は、複数のコンパートメントを有し複数のプロセッサに共有された共有キャッシュメモリにおいて、それぞれのプロセッサにどのコンパートメントを割り当てるかを示す構成情報を保持するコンパートメント指示回路と、任意のプロセッサからの要求によりキャッシュを索引するとき要求元のプロセッサに割り当てられていないコンパートメントのヒット判定を抑止するアドレス比較回路と、キャッシュへデータを登録するとき前記コンパートメント指示回路が保持する構成情報に従って要求元のプロセッサに割り当てられたコンパートメントを選択するリプレースエントリ生成回路と、キャッシュ索引時に訂正不可能な障害を検出したときそのコンパートメント番号を通知する障害検出回路と、前記障害検出回路からの通知に基づいて前記障害が影響する区画を特定し前記構成情報を更新して前記コンパートメント指示回路へ通知し障害処理を行う共有キャッシュ制御部とを有することを特徴とする。
【０００９】
さらに、上記の共有キャッシュメモリ障害処理方式において、前記コンパートメント指示回路は、キャッシュのコンパートメント数に等しいビット幅および前記キャッシュに接続されるプロセッサ数に等しいワード数を含むレジスタ群を備え、前記共有キャッシュ制御部から前記構成情報を得てそれを保持し、さらに前記共有キャッシュ制御部からプロセッサＩＤ信号を得てコンパートメント割当情報を生成し、それを前記アドレス比較回路および前記リプレースエントリ生成回路へ通知することを特徴とする。
【００１０】
また、本発明の共有キャッシュメモリ障害処理方式は、複数のコンパートメントを有し複数のプロセッサに共有された共有キャッシュメモリにおいて、キャッシュに発生した障害の回数をコンパートメントごとに保持する障害履歴レジスタと、キャッシュへデータを登録するとき前記障害履歴レジスタの情報を参照してデータを登録するコンパートメントを決定しキャッシュに指示するリプレースエントリ指示回路と、キャッシュ索引時に訂正不可能な障害を検出したときそのコンパートメント番号を共有キャッシュ制御部に通知する障害検出回路と、任意のプロセッサからの要求によりキャッシュにアクセスした際にミスヒットによりデータを登録する場合には前記障害履歴レジスタの情報を参照し障害履歴を持たないコンパートメントを選択することを前記リプレースエントリ指示回路に指示する共有キャッシュ制御部とを有することを特徴とする。
【００１１】
さらに、上記の共有キャッシュメモリ障害処理方式において、前記障害履歴レジスタは、キャッシュのコンパートメント数に等しいワード数および管理すべき障害回数を保持可能なビット幅を含むレジスタ群を備え、前記共有キャッシュ制御部が送出する障害情報を障害履歴情報としてコンパートメントごとに保持し、それを前記リプレースエントリ生成回路および前記共有キャッシュ制御部へ送出することを特徴とする。
【００１２】
さらに、本発明の共有キャッシュメモリ障害処理方式において、前記共有キャッシュ制御部は、前記障害検出回路から受信した障害情報を診断プロセッサに通知し、前記診断プロセッサの診断結果に基づいてその障害の処理を実行することを特徴とする。
【００１３】
すなわち、本発明によるキャッシュメモリは、複数のコンパートメントからなるセットアソシアティブ方式の共有キャッシュであって、キャッシュを共有するプロセッサ毎に利用可能なコンパートメントを予め割り当てることができる。これにより共有キャッシュを複数の区画に分かれたプロセッサから共有する場合、キャッシュで訂正不能な障害が発生した際のシステムダウンに至る区画を特定し、障害の波及範囲を限定することができる。
【００１４】
また、本発明による共有キャッシュメモリは、コンパートメント毎に当該コンパートメントの障害履歴を保持する障害履歴レジスタを備えることにより、キャッシュコンパートメントの障害履歴を管理すると共に、キャッシュミスヒットに伴うリプレースエントリを決定する際に障害履歴をもつコンパートメントへの登録を抑止することができる。
【００１５】
さらに、障害の履歴を管理することにより、キャッシュメモリで障害を検出した際に障害履歴レジスタを参照し、当該コンパートメントで繰り返し障害が発生していることが判明した場合は、キャッシュメモリの動作を停止し、システムダウンに移行することが可能である。
【００１６】
本発明は、特にコンピュータシステムを複数の論理区画に分割して各々の区画で独立してオペレーティングシステムを稼動させうるシステムについて、複数の区画からキャッシュメモリを共有する場合において有効であり、キャッシュで訂正不能な障害が発生した区画を特定し、障害の波及範囲を限定すると共に、他の区画の動作には影響を与えず運用を継続することができる。
【００１７】
【発明の実施の形態】
以下、本発明について図面を参照しながら説明する。
【００１８】
図１は本発明の実施の第一の形態を示すブロック図である。同図において、本発明による共有キャッシュメモリ障害処理方式は、複数のコンパートメントを有し複数のプロセッサ１１および１２に共有された共有キャッシュユニット２１において、それぞれのプロセッサにどのコンパートメントを割り当てるかを示す構成情報を保持するコンパートメント指示回路５２と、任意のプロセッサからの要求によりキャッシュを索引するとき要求元のプロセッサに割り当てられていないコンパートメントのヒット判定を抑止するアドレス比較回路５９と、キャッシュへデータを登録するとき前記コンパートメント指示回路が保持する構成情報に従って要求元のプロセッサに割り当てられたコンパートメントを選択するリプレースエントリ生成回路５５と、キャッシュ索引時に訂正不可能な障害を検出したときそのコンパートメント番号を通知する障害検出回路５８と、前記障害検出回路からの通知に基づいて前記障害が影響する区画を特定し前記構成情報を更新して前記コンパートメント指示回路へ通知し障害処理を行う共有キャッシュ制御部５１とを有する。
【００１９】
すなわち、プロセッサ１１，１２の各々にどのキャッシュコンパートメントを割り当てるかを示す構成情報がコンパートメント指示回路５２に保持される。プロセッサ１１ないし１２からの要求によりキャッシュを索引する際に、アドレス比較回路５９は要求元のプロセッサに割り当てられていないキャッシュコンパートメントのヒットを判定を抑止する。
【００２０】
また、キャッシュへデータ登録する際には、リプレースエントリ生成回路５５は要求元のプロセッサに割り当てられたキャッシュコンパートメントの中から登録すべきコンパートメントを選択する。これにより、コンパートメント指示回路５２が保持する構成情報に従ってプロセッサ毎に利用可能なキャッシュコンパートメントを任意に割り当てることが可能となる。
【００２１】
また、キャッシュ索引時に障害検出回路５８により訂正不可能なエラーを検出すると、共有キャッシュ制御部５１は診断プロセッサへエラー報告と共に要求元のプロセッサ番号を通知する。これにより診断プロセッサは障害の影響を受ける区画を特定し、当該区画をシステムダウンさせる。
【００２２】
図１を参照すると、共有キャッシュ制御部５１は、プロセッサ１１，１２及びシステムバス３１に接続されるとともに、共有キャッシュユニット２１内の制御機能を有する。
【００２３】
コンパートメント指示回路５２は、共有キャッシュ制御部５１，リプレースエントリ回路５５，アドレス比較回路５９に接続され、キャッシュ索引の際に索引要求元に対してどのキャッシュコンパートメントが割り当てられており有効かを示す情報を出力する。
【００２４】
ＬＲＵメモリ５３は、キャッシュのＬＲＵ（Ｌｅａｓｔ　Ｒｅｃｅｎｔｌｙ　Ｕｓｅｄ）制御に用いるメモリで、キャッシュのライン毎に１ｂｉｔの情報を保持する。
【００２５】
アドレス生成部５４は、共有キャッシュ制御部５１からキャッシュ索引アドレスを入力して保持し、そのアドレスの一部をキャッシュ索引アドレス及びキーとして出力する。
【００２６】
リプレースエントリ生成回路５５は、共有キャッシュ制御部５１，コンパートメント指示回路５２，ＬＲＵメモリ５３に接続され、キャッシュへのデータ登録時に登録すべきキャッシュコンパートメントを決定し、キャッシュアドレスアレイ（ＡＡ）５６及びキャッシュデータアレイ（ＤＡ）５７に対して、決定したコンパートメントへの書き込み指示を出力する。
【００２７】
上記の共有キャッシュは、２ウェイセットアソシアティブ方式のストアインキャッシュあって、キャッシュのキーアドレスを格納するキャッシュＡＡ５６及びデータブロックを格納するキャッシュＤＡ５７は、それぞれ２つのコンパートメントからなるメモリアレイを構成する。
【００２８】
アドレス比較回路５９はキャッシュＡＡ５６，コンパートメント指示回路５２，アドレス生成部５４から信号を入力し、キャッシュヒット判定を行い、判定結果を出力する。
【００２９】
データセレクタ６０は、アドレス比較回路５９からの出力を受け取り、キャッシュＤＡ５７から読み出したデータのうちいずれかのコンパートメントからの出力を選択し、出力する。
【００３０】
障害検出回路５８は、キャッシュＡＡ５６及びキャッシュＤＡ５７から読み出した値のパリティチェックを行い、エラーを検出した場合にそのコンパートメントの番号を共有キャッシュ制御部５１へ出力する。
【００３１】
図２は、コンパートメント指示回路５２及びリプレースエントリ生成回路５５の例を示す構成図である。
【００３２】
図２において、コンパートメント指示回路５２は、２ビット幅のレジスタの２ワードからなるレジスタ群により構成される。ここでレジスタのビット幅はキャッシュのコンパートメント数に等しく、ワード数は共有キャッシュユニット２１に直接接続されるプロセッサ数に等しい。これらのレジスタ群は、共有キャッシュ制御部５１より入力されるキャッシュコンパートメント毎の割り当て構成情報を保持する。
【００３３】
また、コンパートメント指示回路５２は、共有キャッシュ制御部５１からレジスタの各ワードに対応する２本の要求元ＩＤ信号が入力され、レジスタ群の各ビット単位に要求元ＩＤ信号が”１”のワードに格納される値の論理和が出力され、コンパートメントの割り当て情報としてリプレースエントリ生成回路５５とアドレス比較回路５９へ分配される。
【００３４】
さらに、図２を参照すると、リプレースエントリ回路５５は、単純な組み合わせ回路により実現される。図３に、この組み合わせ回路の真理値を示す。
【００３５】
次に、上記の共有キャッシュメモリの動作について説明する。
【００３６】
（１−１）共有キャッシュメモリの初期化
システムの立ち上げ時、共有キャッシュメモリの初期化動作を行う場合には、通常キャッシュメモリの初期化時に行われるようなキャッシュＡＡ５６及びキャッシュＤＡ５７，ＬＲＵメモリ５３のクリア動作に加えて、コンパートメント指示回路５２への初期設定が行われる。
【００３７】
コンパートメント指示回路５２は、図２に既述したように、共有キャッシュメモリに接続される各プロセッサと共有キャッシュメモリの各コンパートメントの対応付けの情報を格納するレジスタから構成されており、この設定情報は外部の診断プロセッサからの制御により設定される。
【００３８】
図２では、設定の一例としてプロセッサ１１に共有キャッシュメモリのコンパートメント０，プロセッサ１２に共有キャッシュメモリのコンパートメント１が割り付けられている例を図示した。
【００３９】
（１−２）キャッシュメモリからの読み出し
続いて、プロセッサ１１及び１２から共有キャッシュメモリを読み出す場合の動作を説明すると、まずプロセッサからの読み出し要求は共有キャッシュ制御部５１へ送られ、共有キャッシュ制御部５１が共有キャッシュメモリの読み出しを開始する。
【００４０】
共有キャッシュ制御部５１は読み出し要求元のプロセッサ番号に該当するＩＤ信号をコンパートメント指示回路５２に送出し、当該プロセッサに割り当てられた共有キャッシュのコンパートメント割り当て情報を得る。
【００４１】
一方で、共有キャッシュ制御部５１から読み出しアドレスがアドレス生成部５４へ送られ、キャッシュＡＡ５６及びキャッシュＤＡ５７の索引アドレスを得て各々のメモリアレイを索引するとともに、索引キーをアドレス比較回路５９へ渡す。
【００４２】
アドレス比較回路５９では、キャッシュＡＡ５６とアドレス生成部５４から得た索引キーを比較し、キャッシュメモリのヒット判定を行うが、この時コンパートメント指示回路５２からのコンパートメント割り当て情報により、読み出し要求元のプロセッサに割り当てられていないコンパートメントのヒット判定をマスクする。データセレクタ６０では、このヒット判定結果を得てキャッシュＤＡ５７出力のデータのうち何れかを選択して共有キャッシュ制御部５１へ渡す。
【００４３】
この時、キャッシュヒットした場合は、共有キャッシュ制御部は取り出したデータをプロセッサへ送出して一連の動作が完了するが、キャッシュミスした場合はさらに主記憶４１からのデータの取り出しを伴う。
【００４４】
まず、アドレス比較回路５９からキャッシュミスヒットの通知を受け取った共有キャッシュ制御部は、システムバス３１へ要求アドレスの読み出し要求を送出する。
【００４５】
続いて、ＬＲＵメモリ５３は所定の論理に従ってリプレースすべきキャッシュコンパートメントを決定するが、この情報はリプレースエントリ生成回路５５においてコンパートメント指示回路５２から得られるコンパートメント割り当て情報によってマスクされ、要求元のプロセッサに割り当てられたコンパートメントの中から一つのコンパートメントを得る。
【００４６】
共有キャッシュ制御部は、キャッシュＡＡ５６及びキャッシュＤＡ５７から当該コンパートメントのデータを読み出し、当該データが更新されていれば（キャッシュ中に”ダーティ”ブロックとして登録されていれば）システムバス３１を経由して当該データを主記憶へ書き戻すことでリプレース動作が完了する。
【００４７】
続いてシステムバス３１への読み出し要求に応じて主記憶４１または他の共有キャッシュが応答を返してくるので、共有キャッシュ制御部５１は当該要求アドレス及びデータをキャッシュＡＡ５６及びキャッシュＤＡ５７の予め決定されたコンパートメントへ書き込むと同時に要求元のプロセッサへ送出する。
【００４８】
（１−３）キャッシュメモリへの書き込み
次に、プロセッサ１１及び１２から共有キャッシュメモリへの書き込み動作を説明すると、まずプロセッサからの書き込み要求では、読み出しの場合と同様に共有キャッシュメモリの索引が行われる。
【００４９】
キャッシュヒットした場合は、共有キャッシュ制御部５１からキャッシュＡＡ５６及びキャッシュＤＡ５７の当該エントリへ要求アドレスの書き込み、ＬＲＵメモリ５３の更新が行われて書き込み動作が完了する。
【００５０】
キャッシュミスした場合は、読み出し時のミスヒットと同様にまず主記憶ないし他の共有キャッシュメモリから当該共有キャッシュメモリへのデータの登録と、必要に応じてキャッシュのリプレース動作が行われた後、キャッシュヒットした場合と同様に共有キャッシュメモリへの書き込みが行われ、動作が完了する。
【００５１】
（１−４）バススヌープ動作
上記に説明したキャッシュの読み出し・書き込み動作の他、上記の共有キャッシュメモリの構成ではバススヌープによるキャッシュの索引及びデータの掃き出し・更新動作が必要となる。個々の索引動作・掃き出し動作・更新動作については、これまでの説明にて記述した動作と同様であるので、ここでは繰り返して説明は行わない。
【００５２】
（１−５）キャッシュ障害発生時の動作
プロセッサ１１ないし１２からの読み出し要求や、キャッシュミス時のリプレース動作に伴う読み出し動作において、障害検出回路５８が障害を検出した場合、共有キャッシュ制御部５１は共有キャッシュメモリの読み出し動作を中断して共有キャッシュユニット２１の動作を一次保留（ＨＯＬＤ）し、障害が発生したことを外部の診断プロセッサへ通知して以後の障害処理を委ねる。
【００５３】
診断プロセッサでは、キャッシュメモリの訂正不能障害の通知を受け取ると、共有キャッシュ制御部５１を介して障害コンパートメントの番号とプロセッサとキャッシュコンパートメントの割り当て情報を取り出し、障害が波及する論理区画を特定した後、当該論理区画をシステムダウンさせる処理を開始する。
【００５４】
一方で、当該の共有キャッシュメモリに障害が波及しない論理区画が含まれている場合は、コンパートメント指示回路５２の設定情報を更新して障害が発生した論理区画からのアクセスを抑止した後、共有キャッシュユニットの保留（ＨＯＬＤ）を解除し、運用を再開する。
【００５５】
図４は本発明の実施の第二の形態を示すブロック図である。同図において、本発明による共有キャッシュメモリ障害処理方式は、複数のコンパートメントを有し複数のプロセッサ１１および１２に共有された共有キャッシュユニット２１ａにおいて、キャッシュに発生した障害の回数をコンパートメントごとに保持する障害履歴レジスタ６２と、キャッシュへデータを登録するとき前記障害履歴レジスタの情報を参照してデータを登録するコンパートメントを決定しキャッシュに指示するリプレースエントリ指示回路６５と、キャッシュ索引時に訂正不可能な障害を検出したときそのコンパートメント番号を共有キャッシュ制御部に通知する障害検出回路６８と、任意のプロセッサからの要求によりキャッシュにアクセスした際にミスヒットによりデータを登録する場合には前記障害履歴レジスタの情報を参照し障害履歴を持たないコンパートメントを選択することを前記リプレースエントリ指示回路に指示する共有キャッシュ制御部６１とを有する。
【００５６】
すなわち、タグを格納するキャッシュアドレスアレイ（ＡＡ）６６およびデータを格納するキャッシュデータアレイ（ＤＡ）６７の障害が発生した回数を示す値をキャッシュコンパートメント毎に保持するレジスタが障害履歴レジスタ６２である。障害履歴レジスタ６２はキャッシュコンパートメント毎に複数ビットの情報を持ち、その各々をＯＲした情報が障害履歴として出力される。
【００５７】
プロセッサ１１または１２からの要求でキャッシュミスヒットによりキャッシュへデータ登録する際には、リプレースエントリ生成回路６５が障害が発生した履歴を持たないキャッシュコンパートメントの中から登録すべきコンパートメントを選択する。これにより、障害履歴を持つコンパートメントへの新規の登録を抑止し、実際に障害が発生したキャッシュラインが上書きされて再利用されることがないようにできる。
【００５８】
図４を参照すると、共有キャッシュ制御部６１はプロセッサ１１，１２及びシステムバス３１に接続されるとともに、共有キャッシュユニット２１ａ内の制御機能を有する。
【００５９】
障害履歴レジスタ６２は、共有キャッシュ制御部６１およびリプレースエントリ生成回路６５に接続され、タグを格納するキャッシュアドレスアレイ（ＡＡ）６６およびデータを格納するキャッシュデータアレイ（ＤＡ）６７の障害が発生した回数を示す値をキャッシュコンパートメント毎に保持する。
【００６０】
ＬＲＵメモリ６３は、キャッシュのＬＲＵ（Ｌｅａｓｔ　Ｒｅｃｅｎｔｌｙ　Ｕｓｅｄ）制御に用いるメモリで、キャッシュのライン毎に１ｂｉｔの情報を保持する。
【００６１】
アドレス生成部６４は、共有キャッシュ制御部６１からキャッシュ索引アドレスを入力して保持し、そのアドレスの一部をキャッシュ索引アドレス及びキーとして出力する。
【００６２】
リプレースエントリ生成回路６５は、共有キャッシュ制御部６１，障害履歴レジスタ６２，ＬＲＵメモリ６３に接続され、キャッシュへのデータ登録時に登録すべきキャッシュコンパートメントを決定し、キャッシュアドレスアレイ（ＡＡ）６６及びキャッシュデータアレイ（ＤＡ）６７に対して、決定したコンパートメントへの書き込み指示を出力する。
【００６３】
上記の共有キャッシュは、２ウェイセットアソシアティブ方式のストアインキャッシュであって、キャッシュＡＡ６６及びキャッシュＤＡ６７はそれぞれ２つのコンパートメントからなるメモリアレイを構成する。
【００６４】
アドレス比較回路６９は、キャッシュＡＡ６６及びアドレス生成部６４から信号を入力し、キャッシュヒット判定を行い、判定結果を出力する。この時、ヒット判定結果は障害履歴レジスタ６２が格納する障害履歴情報に影響されず、障害履歴が残っているコンパートメントであってもキーが一致した場合はキャッシュヒットとして判定結果を出力する。
【００６５】
データセレクタ７０は、アドレス比較回路６９からの出力を受け取り、キャッシュＤＡ６７から読み出したデータのうちいずれかのコンパートメントからの出力を選択し出力する。
【００６６】
障害検出回路６８は、キャッシュＡＡ６６及びキャッシュＤＡ６７から読み出した値のパリティチェックを行い、エラーを検出した場合にそのコンパートメントの番号を共有キャッシュ制御部６１へ出力する。
【００６７】
図５は、障害履歴レジスタ６２及びリプレースエントリ生成回路６５の例を示す構成図である。
【００６８】
同図において、障害履歴レジスタ６２は、２ビット幅のレジスタの２ワードからなるレジスタ群により構成される。ここでレジスタのワード数はキャッシュのコンパートメント数に等しく、ビット幅は管理すべき障害履歴の回数表示が可能な幅が必要である。これらのレジスタ群は、共有キャッシュ制御部６１よりセットされるキャッシュコンパートメント毎の障害履歴情報を保持する。保持される情報の全ては、共有キャッシュ制御部６１及びリプレースエントリ生成回路６５へ分配される。
【００６９】
さらに、リプレースエントリ回路６５は、同図に示すように単純な組み合わせ回路により実現される。図６に、この組み合わせ回路の真理値を示す。
【００７０】
次に、上記の共有キャッシュメモリの動作について説明する。
【００７１】
（２−１）共有キャッシュメモリの初期化
システムの立ち上げ時、共有キャッシュメモリの初期化動作を行う場合には、通常キャッシュメモリの初期化時に行われるようなキャッシュＡＡ６６及びキャッシュＤＡ６７，ＬＲＵメモリ６３のクリア動作に加えて、障害履歴レジスタ６２への初期設定が行われる。
【００７２】
障害履歴レジスタ６２は、図５に示したように、共有キャッシュメモリの各コンパートメント毎の障害履歴情報を格納するレジスタから構成されており、この履歴情報はシステム運用中に障害が発生した際に共有キャッシュ制御部６１により書き換えられる。ここでは、システム動作中の一状態を示すため、適当な値を選んで図中に表示してある。
【００７３】
（２−２）キャッシュメモリからの読み出し
続いて、プロセッサ１１及び１２から共有キャッシュメモリを読み出す場合の動作を説明すると、まずプロセッサからの読み出し要求は共有キャッシュ制御部６１へ送られ、共有キャッシュ制御部６１が共有キャッシュメモリの読み出しを開始する。
【００７４】
すなわち、共有キャッシュ制御部６１から読み出しアドレスがアドレス生成部６４へ送られ、キャッシュＡＡ６６及びキャッシュＤＡ６７の索引アドレスを得て各々のメモリアレイを索引するとともに、索引キーをアドレス比較回路６９へ渡す。
【００７５】
アドレス比較回路６９では、キャッシュＡＡ６６とアドレス生成部６４から得た索引キーを比較し、キャッシュメモリのヒット判定を行う。データセレクタ７０では、このヒット判定結果を得てキャッシュＤＡ６７出力のデータのうち何れかを選択して共有キャッシュ制御部６１へ渡す。
【００７６】
この時、キャッシュヒットした場合は、共有キャッシュ制御部は取り出したデータをプロセッサへ送出して一連の動作が完了するが、キャッシュミスした場合はさらに主記憶４１からのデータの取り出しを伴う。
【００７７】
まず、アドレス比較回路６９からキャッシュミスヒットの通知を受け取った共有キャッシュ制御部６１は、システムバス３１へ要求アドレスの読み出し要求を送出する。
【００７８】
続いて、ＬＲＵメモリ６３に格納されるＬＲＵ情報により所定の論理に従ってリプレースすべきキャッシュコンパートメントを決定するが、この情報はリプレースエントリ生成回路６５において障害履歴レジスタ６２から得られる障害履歴情報によってマスクされ、障害履歴を持たないコンパートメントの中から新規の登録を行うべきコンパートメント番号を得る。
【００７９】
共有キャッシュ制御部６１は、キャッシュＡＡ６６及びキャッシュＤＡ６７から当該コンパートメントのデータを読み出し、当該データが更新されていれば（キャッシュ中に”ダーティ”ブロックとして登録されていれば）システムバス３１を経由して当該データを主記憶４１へ書き戻すことでリプレース動作が完了する。
【００８０】
続いて、システムバス３１への読み出し要求に応じて主記憶４１または他の共有キャッシュメモリが応答を返してくるので、共有キャッシュ制御部６１は当該要求アドレス及びデータをキャッシュＡＡ６６及びキャッシュＤＡ６７の予め決定されたコンパートメントへ書き込むと同時に要求元のプロセッサへ送出して一連の動作を完了する。
【００８１】
（２−３）キャッシュメモリへの書き込み
次に、プロセッサ１１及び１２から共有キャッシュメモリへの書き込み動作を説明すると、まずプロセッサからの書き込み要求では、読み出しの場合と同様に共有キャッシュメモリの索引が行われる。
【００８２】
キャッシュヒットした場合は、共有キャッシュ制御部６１からキャッシュＡＡ６６及びキャッシュＤＡ６７の当該エントリへ要求アドレスの書き込み、ＬＲＵメモリ６３の更新が行われて書き込み動作が完了する。
【００８３】
キャッシュミスした場合は、読み出し時のミスヒットと同様に、まず主記憶４１もしくは他の共有キャッシュメモリから当該共有キャッシュメモリへのデータの登録と、必要に応じてキャッシュのリプレース動作が行われた後、キャッシュヒットした場合と同様に共有キャッシュメモリへの書き込みが行われ、動作が完了する。
【００８４】
（２−４）バススヌープ動作
上記に説明したキャッシュの読み出し・書き込み動作の他、上記の共有キャッシュメモリの構成ではバススヌープによるキャッシュの索引及びデータの掃き出し・更新動作が必要となる。個々の索引動作・掃き出し動作・更新動作については、これまでの説明にて記述した動作と同様であるので、ここでは繰り返して説明は行わない。
【００８５】
（２−５）キャッシュ障害発生時の動作
プロセッサ１１や１２からの読み出し要求において、障害検出回路６８が障害を検出した場合、共有キャッシュ制御部６１は読み出したデータを破壊された（ＰＯＩＳＯＮＥＤ）データとして要求元のプロセッサへ送出する。破壊されたデータを受け取ったプロセッサは、処理の継続が不可能と判断して当該プロセッサを含む区画をシステムダウンさせるなどの処理を開始するが、このプロセスについては本発明の説明の範囲を超えるので詳しくは言及しない。
【００８６】
続いて、共有キャッシュ制御部６１は障害履歴レジスタ６２の値を読み出し、障害が発生したコンパートメントに対応するワードの値を＋１加算して再び障害履歴レジスタ６２へ書き戻すことで、以後のキャッシュリプレース動作等で新規に当該コンパートメントにデータが登録されることがないようにする。
【００８７】
この時、読み出した障害履歴情報の値が規定値（ここでは”１１”とする）であった場合は、同一コンパートメントでの障害が規定回数以上繰り返されたと判断して、共有キャッシュユニット２１の動作を保留（ＨＯＬＤ）し、繰り返し障害が発生したことを外部の診断プロセッサ（図示せず）へ通知して以後の障害処理を委ねる。
【００８８】
診断プロセッサは通知を受け取ると、全ての区画を含むシステムをダウンさせるなどの障害処理を実行する。
【００８９】
読み出した障害履歴情報の値が規定値を超えていなかった場合は、共有キャッシュ制御部６１はさらにキャッシュＡＡ６６の障害が発生したラインをフラッシュし、破壊されたデータを破棄する。
【００９０】
また、キャッシュミス時のリプレース動作や書き込みに伴う読み出し動作時に障害を検出した場合は、上記の読み出しの場合と同様の手段でプロセッサに障害を通知することが出来ないので、共有キャッシュ制御部６１はキャッシュアクセスに使用されたアドレスから当該アドレスに対応する区画を特定しようと試みる。
【００９１】
上記の共有キャッシュメモリのように、複数の区画を同時に稼働させる場合では、主記憶アドレスを各区画毎に割り当てた上でプロセッサからの主記憶アクセスは各区画毎のオフセット値を加算することで行われる。このオフセット値の加算はプロセッサ内部で行われることを想定しており、これにより各区画のソフトウェアは他の区画の存在を意識することなく実行することが可能となっている。従って障害を検出したときのアドレスが分かれば、対応する区画を特定することは可能である。
【００９２】
区画の特定に成功した場合は、共有キャッシュ制御部は当該区画に属するプロセッサのいずれかへ割り込み等の手段を用いて障害を通知する。通知を受け取ったプロセッサは、処理の継続が不可能と判断して当該区画をシステムダウンさせるなどの処理を開始する。
【００９３】
万一区画の特定が出来なかった場合、共有キャッシュ制御部６１は全区画を含むシステム全体の運用継続が不可能と判断して共有キャッシュユニット２１の動作を保留（ＨＯＬＤ）し、致命的な障害が発生したことを外部の診断プロセッサへ通知して以後の障害処理を委ねる。
【００９４】
さらに区画の特定に成功した場合は、前述の読み出しの場合と同様、以後の障害履歴レジスタ６２の更新およびキャッシュラインのフラッシュ動作が行われて一連の障害時の動作を完了する。
【００９５】
上記のように、キャッシュメモリの障害が発生してプロセッサへの通知，障害履歴レジスタの更新，および当該キャッシュラインのフラッシュが完了した後は、共有キャッシュユニット２１は直ちに運用を継続して後続のメモリアクセス要求やバススヌープ動作を実行することが可能となる。従って障害が波及しない区画では、障害の影響を受けることなく運用を継続することが可能である。
【００９６】
また、障害履歴により障害が発生したコンパートメントへの登録が抑止された後も、当該コンパートメントの障害が発生したライン以外の既に登録済みのライン（健全なデータ）からの読み出し・書き込みを行うことは可能である。キャッシュの障害時に一般に行われているようなコンパートメント縮退方式に比べると、本発明では縮退操作に伴うキャッシュの掃き出し処理が不要であり、障害処理手順が簡略化されている。
【００９７】
なお、上記の実施の形態では、共有キャッシュメモリの構成を２ウェイセットアソシアティブ方式としたが、これをさらに多くのコンパートメントから構成することもできる。例えば、４ウェイセットアソシアティブ構成にする場合、ＬＲＵメモリにキャッシュの各ライン毎６ビットの情報を付与する必要があり、リプレースエントリ生成回路もより複雑な構成となるが、上記と同様の動作にて実現可能である。
【００９８】
さらに、上記においては、障害検出回路はパリティ検出回路であったが、今日ミッションクリティカル用途のコンピュータシステムで通常用いられているようにＥＣＣ（エラー訂正コード）による障害検出・訂正を行う事も可能である。
【００９９】
また、本発明は、共有キャッシュに接続されるプロセッサやバスの数に依存することなく実現可能である。
【０１００】
【発明の効果】
以上、詳細に説明したように、本発明によれば、共有キャッシュメモリにおいて訂正不可能な障害を検出した場合に、その障害が発生したキャッシュラインを利用している論理区画を容易に特定することができるので、障害を他の論理区画の動作に波及させることなく当該の区画のみに閉じた障害処理、例えばシステムダウンを行わせることが可能となる。
【０１０１】
さらに、キャッシュメモリは障害が発生した後も運用を継続することが可能であり、障害が波及する区画を除く他の区画に対しては影響を与えることなくシステムの運用を継続することができる。
【図面の簡単な説明】
【図１】本発明の実施の第一の形態を示すブロック図。
【図２】図１の主要部を示す構成図。
【図３】リプレースエントリ生成回路５５の真理値を示す説明図。
【図４】本発明の実施の第二の形態を示すブロック図。
【図５】図４の主要部を示す構成図。
【図６】リプレースエントリ生成回路６５の真理値を示す説明図。
【図７】本発明の従来例を示す説明図。
【符号の説明】
１１，１２，１３，１４　　プロセッサ
２１，２１ａ，２２　　共有キャッシュユニット
３１　　システムバス
４１　　主記憶
５１，６１　　共有キャッシュ制御部
５２　　コンパートメント指示回路
５３，６３　　ＬＲＵメモリ
５４，６４　　アドレス生成部
５５，６５　　リプレースエントリ生成回路
５６，６６　　キャッシュＡＡ
５７，６７　　キャッシュＤＡ
５８，６８　　障害検出回路
５９，６９　　アドレス比較回路
６０，７０　　データセレクタ
６２　　障害履歴レジスタ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a shared cache memory failure handling method, and more particularly to a shared cache memory failure handling method for continuing system operation while limiting a range affected by a failure occurring in a cache memory shared by a plurality of processors.
[0002]
[Prior art]
FIG. 7 is an explanatory diagram illustrating an example of a computer system including a shared cache memory. In the figure, shared cache units 21 and 22 are shared caches having the same configuration, and are shared by processors 11, 12, 13 and 14, respectively. On the other hand, the shared cache units 21 and 22 are connected to the main memory 41 via the system bus 31.
[0003]
Recently, in such a system, it is possible to logically divide resources such as a processor and a main memory into a plurality of partitions and operate each of them as an independent system. In this case, when a memory access from a processor belonging to each section is performed, the software is aware of the existence of another section by adding the start address of the memory area allocated to the section as an offset value at the time of transmission from the processor. Instructions can be executed without the need.
[0004]
However, when considering the reliability of each section, a unit shared by a plurality of sections, for example, the shared cache unit 21 is a single point of failure (SPOF: Single) of the section 1 and the section 2 as can be understood from the drawing. (Point of Failure), and there is a problem in that a failure of the cache memory causes two partitions to go down.
[0005]
[Problems to be solved by the invention]
As described above, in a computer system having a conventional shared cache memory, when a failure occurs in a cache memory shared by a plurality of partitions, if the failure cannot be corrected, each partition is There is a disadvantage that the system will go down.
[0006]
An object of the present invention is to identify the location of a fault that has occurred in a cache memory, limit the section affected by the fault, and reduce the range of the fault as much as possible in order to improve the above problems. It is an object of the present invention to provide a shared cache memory failure handling method that can perform the processing.
[0007]
[Means for Solving the Problems]
The shared cache memory failure processing method according to the present invention is directed to a shared cache memory having a plurality of compartments, in which a usable compartment is pre-allocated for each processor in a shared cache memory shared by a plurality of processors, and a processor divided into an arbitrary plurality of partitions. In the case where the shared cache memory is shared, when an uncorrectable failure is detected at the time of cache indexing, a section affected by the failure is specified, and a means for processing the failure is provided.
[0008]
A shared cache memory fault processing method according to the present invention is a compartment instruction circuit for storing configuration information indicating which compartment is to be assigned to each processor in a shared cache memory having a plurality of compartments and shared by a plurality of processors. And an address comparison circuit for suppressing a hit judgment of a compartment not allocated to the requesting processor when indexing the cache by a request from any processor, and a configuration held by the compartment instruction circuit when registering data in the cache A replacement entry generation circuit for selecting a compartment allocated to the requesting processor according to the information; and a failure detection circuit for notifying the compartment number when an uncorrectable failure is detected during cache indexing. And a shared cache control unit that identifies a section affected by the failure based on a notification from the failure detection circuit, updates the configuration information, notifies the compartment instruction circuit, and performs a failure process. I do.
[0009]
Further, in the above-described shared cache memory failure handling method, the compartment instruction circuit includes a register group including a bit width equal to the number of compartments of the cache and a word number equal to the number of processors connected to the cache, And obtaining and holding the configuration information from the unit, further obtaining a processor ID signal from the shared cache control unit to generate compartment allocation information, and notifying it to the address comparison circuit and the replacement entry generation circuit. Features.
[0010]
Further, the shared cache memory fault processing method of the present invention includes a fault history register for holding the number of faults occurring in a cache for each compartment in a shared cache memory having a plurality of compartments and shared by a plurality of processors; A replacement entry instruction circuit for determining a compartment for registering data by referring to the information of the failure history register when registering data to the cache and instructing the cache, and when an uncorrectable failure is detected at the time of cache index, the compartment number is determined. A failure detection circuit for notifying the shared cache control unit, and a compartment having no failure history by referring to the information of the failure history register when registering data by mishit when accessing the cache by a request from an arbitrary processor. To And having a shared cache control unit instructing the replacement entry instruction circuit to-option.
[0011]
Further, in the above-described shared cache memory failure processing method, the failure history register includes a register group including a word number equal to the number of compartments in the cache and a bit width capable of holding the number of failures to be managed. Is stored for each compartment as failure history information, and the failure information is transmitted to the replacement entry generation circuit and the shared cache control unit.
[0012]
Further, in the shared cache memory failure processing method of the present invention, the shared cache control unit notifies a failure information received from the failure detection circuit to a diagnosis processor, and performs processing of the failure based on a diagnosis result of the diagnosis processor. It is characterized by executing.
[0013]
That is, the cache memory according to the present invention is a shared cache of a set associative system including a plurality of compartments, and an available compartment can be allocated in advance to each processor sharing the cache. As a result, when the shared cache is shared by the processors divided into a plurality of partitions, it is possible to specify a partition that leads to a system down when an uncorrectable failure occurs in the cache, and to limit the range of the failure.
[0014]
Further, the shared cache memory according to the present invention is provided with a failure history register for storing a failure history of the compartment for each compartment, so that the failure history of the cache compartment is managed and a replacement entry associated with a cache miss is determined. Registration in a compartment having a failure history can be suppressed.
[0015]
Furthermore, by managing the history of faults, when a fault is detected in the cache memory, the fault history register is referenced, and if it is determined that a fault has repeatedly occurred in the relevant compartment, the operation of the cache memory is stopped. Then, it is possible to shift to system down.
[0016]
The present invention is particularly effective for a system in which a computer system is divided into a plurality of logical partitions and an operating system can be operated independently in each of the partitions, when a cache memory is shared from a plurality of partitions, It is possible to specify a section where an unusual failure has occurred, limit the range of the failure, and continue operation without affecting the operation of other sections.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described with reference to the drawings.
[0018]
FIG. 1 is a block diagram showing a first embodiment of the present invention. In the figure, a shared cache memory failure processing method according to the present invention provides configuration information indicating which compartment is assigned to each processor in a shared cache unit 21 having a plurality of compartments and shared by a plurality of processors 11 and 12. When the cache is indexed by a request from an arbitrary processor, an address comparison circuit 59 that suppresses a hit judgment of a compartment that is not allocated to the requesting processor, and when registering data in the cache. A replacement entry generating circuit 55 for selecting a compartment allocated to the requesting processor in accordance with the configuration information held by the compartment instructing circuit, and when an uncorrectable failure is detected during cache indexing A failure detection circuit 58 for notifying the compartment number of the same, and a shared section for identifying the section affected by the failure based on the notification from the failure detection circuit, updating the configuration information, notifying the compartment instruction circuit, and performing failure processing A cache control unit 51.
[0019]
That is, configuration information indicating which cache compartment is to be assigned to each of the processors 11 and 12 is stored in the compartment instruction circuit 52. When the cache is indexed by a request from the processor 11 or 12, the address comparison circuit 59 suppresses the determination of a hit in a cache compartment that is not allocated to the requesting processor.
[0020]
When registering data in the cache, the replacement entry generation circuit 55 selects a compartment to be registered from the cache compartments assigned to the requesting processor. Thus, it is possible to arbitrarily assign an available cache compartment for each processor according to the configuration information held by the compartment instruction circuit 52.
[0021]
When the failure detection circuit 58 detects an uncorrectable error at the time of cache indexing, the shared cache control unit 51 notifies the diagnostic processor of an error report and the requesting processor number. As a result, the diagnostic processor specifies the partition affected by the failure, and brings the partition down.
[0022]
Referring to FIG. 1, the shared cache control unit 51 is connected to the processors 11, 12 and the system bus 31, and has a control function in the shared cache unit 21.
[0023]
The compartment instruction circuit 52 is connected to the shared cache control unit 51, the replacement entry circuit 55, and the address comparison circuit 59, and provides information indicating which cache compartment is assigned to the index request source and is valid at the time of cache indexing. Output.
[0024]
The LRU memory 53 is a memory used for LRU (Least Recently Used) control of the cache, and holds 1-bit information for each line of the cache.
[0025]
The address generation unit 54 inputs and holds the cache index address from the shared cache control unit 51, and outputs a part of the address as a cache index address and a key.
[0026]
The replacement entry generation circuit 55 is connected to the shared cache control unit 51, the compartment instruction circuit 52, and the LRU memory 53, determines a cache compartment to be registered at the time of registering data in the cache, and determines a cache address array (AA) 56 and cache data. An instruction to write to the determined compartment is output to the array (DA) 57.
[0027]
The shared cache is a two-way set associative store-in cache, and a cache AA 56 for storing a key address of the cache and a cache DA 57 for storing a data block constitute a memory array including two compartments.
[0028]
The address comparison circuit 59 receives signals from the cache AA 56, the compartment instruction circuit 52, and the address generation unit 54, performs a cache hit determination, and outputs a determination result.
[0029]
The data selector 60 receives an output from the address comparison circuit 59, selects an output from any one of the data read from the cache DA 57, and outputs the selected data.
[0030]
The failure detection circuit 58 performs a parity check on the values read from the cache AA 56 and the cache DA 57, and outputs the compartment number to the shared cache control unit 51 when an error is detected.
[0031]
FIG. 2 is a configuration diagram showing an example of the compartment instruction circuit 52 and the replacement entry generation circuit 55.
[0032]
In FIG. 2, the compartment instruction circuit 52 is constituted by a register group consisting of two words of a register having a 2-bit width. Here, the bit width of the register is equal to the number of compartments of the cache, and the number of words is equal to the number of processors directly connected to the shared cache unit 21. These registers hold the allocation configuration information for each cache compartment input from the shared cache control unit 51.
[0033]
In addition, the compartment instruction circuit 52 receives two request source ID signals corresponding to each word of the register from the shared cache control unit 51, and converts the request source ID signal into a word of “1” for each bit of the register group. The logical sum of the stored values is output and distributed to the replacement entry generation circuit 55 and the address comparison circuit 59 as compartment allocation information.
[0034]
Further, referring to FIG. 2, the replacement entry circuit 55 is realized by a simple combination circuit. FIG. 3 shows the truth values of this combinational circuit.
[0035]
Next, the operation of the above-described shared cache memory will be described.
[0036]
(1-1) Initialization of shared cache memory
When the initialization of the shared cache memory is performed when the system is started, the compartment instruction circuit 52 is added to the operations of clearing the cache AA 56, the cache DA 57, and the LRU memory 53 as performed when the normal cache memory is initialized. Is initialized.
[0037]
As described above with reference to FIG. 2, the compartment instructing circuit 52 includes a register that stores information on the association between each processor connected to the shared cache memory and each compartment of the shared cache memory. Set by control from an external diagnostic processor.
[0038]
FIG. 2 illustrates an example in which the processor 11 is assigned the compartment 0 of the shared cache memory and the processor 12 is assigned the compartment 1 of the shared cache memory as an example of the setting.
[0039]
(1-2) Reading from cache memory
Next, an operation for reading the shared cache memory from the processors 11 and 12 will be described. First, a read request from the processor is sent to the shared cache control unit 51, and the shared cache control unit 51 starts reading the shared cache memory. .
[0040]
The shared cache control unit 51 sends an ID signal corresponding to the processor number of the read request source to the compartment instruction circuit 52, and obtains compartment allocation information of the shared cache allocated to the processor.
[0041]
On the other hand, the read address is sent from the shared cache control unit 51 to the address generation unit 54, the index addresses of the cache AA 56 and the cache DA 57 are obtained, each memory array is indexed, and the index key is passed to the address comparison circuit 59.
[0042]
The address comparison circuit 59 compares the cache AA 56 with the index key obtained from the address generation unit 54 and makes a cache memory hit determination. At this time, the compartment allocation information from the compartment instruction circuit 52 indicates to the read requesting processor Masks hit assignments for unassigned compartments. The data selector 60 obtains the hit determination result, selects one of the data output from the cache DA 57, and transfers it to the shared cache control unit 51.
[0043]
At this time, if a cache hit occurs, the shared cache control unit sends the fetched data to the processor to complete a series of operations. However, if a cache miss occurs, data is further fetched from the main memory 41.
[0044]
First, the shared cache control unit that has received the notification of the cache mishit from the address comparison circuit 59 sends a request for reading the requested address to the system bus 31.
[0045]
Subsequently, the LRU memory 53 determines a cache compartment to be replaced according to a predetermined logic. This information is masked by the compartment assignment information obtained from the compartment instruction circuit 52 in the replacement entry generation circuit 55, and is assigned to the requesting processor. Obtain one compartment from among the compartments assigned.
[0046]
The shared cache control unit reads the data of the compartment from the cache AA 56 and the cache DA 57 and, if the data has been updated (registered as a “dirty” block in the cache), via the system bus 31. The replacement operation is completed by writing the data back to the main memory.
[0047]
Subsequently, in response to the read request to the system bus 31, the main memory 41 or another shared cache returns a response. Therefore, the shared cache control unit 51 determines the requested address and data in the cache AA56 and the cache DA57 in advance. At the same time as writing to the compartment, it is sent to the requesting processor.
[0048]
(1-3) Writing to cache memory
Next, the write operation from the processors 11 and 12 to the shared cache memory will be described. First, in the case of a write request from the processor, the index of the shared cache memory is performed as in the case of reading.
[0049]
If a cache hit occurs, the shared cache control unit 51 writes the requested address to the corresponding entry of the cache AA 56 and the cache DA 57, updates the LRU memory 53, and completes the write operation.
[0050]
When a cache miss occurs, data is first registered from the main memory or another shared cache memory to the shared cache memory, and a cache replacement operation is performed as necessary. Writing to the shared cache memory is performed as in the case of a hit, and the operation is completed.
[0051]
(1-4) Bus snoop operation
In addition to the above-described cache read / write operation, the configuration of the above-described shared cache memory requires a cache snooping and data sweep-out / update operation by a bus snoop. The individual indexing operation, sweeping-out operation, and updating operation are the same as the operations described in the above description, so that the description will not be repeated here.
[0052]
(1-5) Operation when a cache failure occurs
When the failure detection circuit 58 detects a failure in a read request from the processors 11 or 12 or a read operation accompanying a replace operation at the time of a cache miss, the shared cache control unit 51 interrupts the read operation of the shared cache memory to perform the shared operation. The operation of the cache unit 21 is temporarily suspended (HOLD), the occurrence of a failure is notified to an external diagnostic processor, and the subsequent failure processing is entrusted.
[0053]
Upon receiving the notification of the uncorrectable failure of the cache memory, the diagnostic processor extracts the failure compartment number and the allocation information of the processor and the cache compartment through the shared cache control unit 51, specifies the logical partition to which the failure spreads, A process for bringing the logical partition down is started.
[0054]
On the other hand, if the shared cache memory includes a logical partition to which a failure does not propagate, the setting information of the compartment instruction circuit 52 is updated to prevent access from the failed logical partition, and then the shared cache is updated. Release the hold of the unit and resume operation.
[0055]
FIG. 4 is a block diagram showing a second embodiment of the present invention. In the figure, the shared cache memory failure handling system according to the present invention holds the number of failures occurring in a cache for each compartment in a shared cache unit 21a having a plurality of compartments and shared by a plurality of processors 11 and 12. A failure history register 62; a replacement entry instruction circuit 65 for determining a compartment for registering data by referring to the information of the failure history register when registering data in the cache and instructing the cache; And a failure detection circuit 68 for notifying the shared cache control unit of the compartment number when the cache history is detected. Reference to select a compartment no fault history instructing the replacement entry indicating circuit and a shared cache control unit 61.
[0056]
That is, the failure history register 62 is a register that stores, for each cache compartment, a value indicating the number of times a failure has occurred in the cache address array (AA) 66 for storing tags and the cache data array (DA) 67 for storing data. The failure history register 62 has a plurality of bits of information for each cache compartment, and information obtained by ORing each of them is output as a failure history.
[0057]
When registering data in the cache due to a cache mishit in response to a request from the processor 11 or 12, the replacement entry generation circuit 65 selects a compartment to be registered from cache compartments having no history of occurrence of a failure. As a result, new registration in a compartment having a failure history can be suppressed, and a cache line in which a failure has actually occurred can be prevented from being overwritten and reused.
[0058]
Referring to FIG. 4, the shared cache control unit 61 is connected to the processors 11, 12 and the system bus 31, and has a control function in the shared cache unit 21a.
[0059]
The failure history register 62 is connected to the shared cache control unit 61 and the replacement entry generation circuit 65, and is the number of times a failure has occurred in the cache address array (AA) 66 for storing tags and the cache data array (DA) 67 for storing data. Is stored for each cache compartment.
[0060]
The LRU memory 63 is a memory used for LRU (Least Recently Used) control of the cache, and holds 1-bit information for each line of the cache.
[0061]
The address generation unit 64 inputs and holds the cache index address from the shared cache control unit 61, and outputs a part of the address as the cache index address and the key.
[0062]
The replacement entry generation circuit 65 is connected to the shared cache control unit 61, the failure history register 62, and the LRU memory 63, determines a cache compartment to be registered when registering data in the cache, and sets a cache address array (AA) 66 and cache data. An instruction to write to the determined compartment is output to the array (DA) 67.
[0063]
The above-mentioned shared cache is a two-way set associative store-in cache, and the cache AA 66 and the cache DA 67 each constitute a memory array including two compartments.
[0064]
The address comparison circuit 69 receives a signal from the cache AA 66 and the address generation unit 64, performs a cache hit determination, and outputs a determination result. At this time, the result of the hit determination is not affected by the failure history information stored in the failure history register 62. Even in the compartment where the failure history remains, if the keys match, the determination result is output as a cache hit.
[0065]
The data selector 70 receives an output from the address comparison circuit 69, and selects and outputs an output from any one of the data read from the cache DA 67.
[0066]
The failure detection circuit 68 performs a parity check on the values read from the cache AA 66 and the cache DA 67, and outputs the compartment number to the shared cache control unit 61 when an error is detected.
[0067]
FIG. 5 is a configuration diagram showing an example of the failure history register 62 and the replacement entry generation circuit 65.
[0068]
In the figure, the failure history register 62 is composed of a register group consisting of two words of a 2-bit width register. Here, the number of words in the register is equal to the number of compartments in the cache, and the bit width needs to be large enough to indicate the number of failure histories to be managed. These registers hold failure history information for each cache compartment set by the shared cache control unit 61. All of the stored information is distributed to the shared cache control unit 61 and the replacement entry generation circuit 65.
[0069]
Further, the replacement entry circuit 65 is realized by a simple combination circuit as shown in FIG. FIG. 6 shows the truth values of this combinational circuit.
[0070]
Next, the operation of the above-described shared cache memory will be described.
[0071]
(2-1) Initialization of shared cache memory
When the initialization operation of the shared cache memory is performed when the system is started, in addition to the operation of clearing the cache AA 66, the cache DA 67, and the LRU memory 63 which is performed when the normal cache memory is initialized, the failure history register 62 Is initialized.
[0072]
As shown in FIG. 5, the failure history register 62 includes a register for storing failure history information for each compartment of the shared cache memory, and this history information is shared when a failure occurs during system operation. Rewritten by the cache control unit 61. Here, in order to show one state during system operation, an appropriate value is selected and displayed in the figure.
[0073]
(2-2) Reading from the cache memory
Next, the operation when reading the shared cache memory from the processors 11 and 12 will be described. First, a read request from the processor is sent to the shared cache control unit 61, and the shared cache control unit 61 starts reading the shared cache memory. .
[0074]
That is, the read address is sent from the shared cache control unit 61 to the address generation unit 64, the index addresses of the cache AA 66 and the cache DA 67 are obtained, each memory array is indexed, and the index key is passed to the address comparison circuit 69.
[0075]
The address comparison circuit 69 compares the cache AA 66 with the index key obtained from the address generation unit 64, and determines a hit in the cache memory. The data selector 70 obtains the hit determination result, selects one of the data output from the cache DA 67, and transfers it to the shared cache control unit 61.
[0076]
At this time, if a cache hit occurs, the shared cache control unit sends the fetched data to the processor to complete a series of operations. However, if a cache miss occurs, data is further fetched from the main memory 41.
[0077]
First, the shared cache control unit 61 that has received the notification of the cache mishit from the address comparison circuit 69 sends a request for reading the requested address to the system bus 31.
[0078]
Subsequently, the cache compartment to be replaced is determined according to a predetermined logic based on the LRU information stored in the LRU memory 63. This information is masked by the failure history information obtained from the failure history register 62 in the replacement entry generation circuit 65, A compartment number to be newly registered is obtained from compartments having no failure history.
[0079]
The shared cache control unit 61 reads the data of the compartment from the cache AA 66 and the cache DA 67, and if the data has been updated (registered as a “dirty” block in the cache) via the system bus 31 The replacement operation is completed by writing the data back to the main memory 41.
[0080]
Subsequently, since the main memory 41 or another shared cache memory returns a response in response to the read request to the system bus 31, the shared cache control unit 61 determines the requested address and data in the cache AA 66 and the cache DA 67 in advance. At the same time, the data is written to the compartment and sent to the requesting processor to complete a series of operations.
[0081]
(2-3) Writing to cache memory
Next, the write operation from the processors 11 and 12 to the shared cache memory will be described. First, in the case of a write request from the processor, the index of the shared cache memory is performed as in the case of reading.
[0082]
If a cache hit occurs, the shared cache control unit 61 writes the requested address to the corresponding entry of the cache AA 66 and the cache DA 67, updates the LRU memory 63, and completes the write operation.
[0083]
In the case of a cache miss, as in the case of a read miss, first, data is registered from the main memory 41 or another shared cache memory to the shared cache memory, and a cache replacement operation is performed as necessary. As in the case of a cache hit, writing to the shared cache memory is performed, and the operation is completed.
[0084]
(2-4) Bus snoop operation
In addition to the above-described cache read / write operation, the configuration of the above-described shared cache memory requires a cache snooping and data sweep-out / update operation by a bus snoop. The individual indexing operation, sweeping-out operation, and updating operation are the same as the operations described in the above description, so that the description will not be repeated here.
[0085]
(2-5) Operation when a cache failure occurs
When the failure detection circuit 68 detects a failure in a read request from the processor 11 or 12, the shared cache control unit 61 sends the read data as destroyed (POISONED) data to the requesting processor. The processor that has received the corrupted data determines that processing cannot be continued and starts processing such as shutting down the partition including the processor, but this process is beyond the scope of the present invention. I will not go into detail.
[0086]
Subsequently, the shared cache control unit 61 reads the value of the failure history register 62, adds +1 to the value of the word corresponding to the compartment in which the failure has occurred, and writes it back to the failure history register 62, thereby performing the subsequent cache replacement operation. For example, data is not newly registered in the compartment.
[0087]
At this time, if the value of the read failure history information is a prescribed value (here, “11”), it is determined that the failure in the same compartment has been repeated a prescribed number of times or more, and the operation of the shared cache unit 21 is performed. Is held (HOLD), and the fact that a fault has repeatedly occurred is notified to an external diagnostic processor (not shown) and the subsequent fault processing is entrusted.
[0088]
Upon receiving the notification, the diagnostic processor performs a failure process such as bringing down a system including all partitions.
[0089]
If the value of the read failure history information does not exceed the specified value, the shared cache control unit 61 further flushes the failed line of the cache AA 66 and discards the destroyed data.
[0090]
Further, when a failure is detected during a replace operation at the time of a cache miss or a read operation accompanying a write, the failure cannot be notified to the processor by the same means as in the above-described read operation. An attempt is made to specify a partition corresponding to the address from the address used for the cache access.
[0091]
In the case where a plurality of partitions are operated at the same time as in the above-described shared cache memory, a main storage address is assigned to each partition, and the main storage access from the processor is performed by adding an offset value for each partition. Is It is assumed that the addition of the offset value is performed inside the processor, whereby the software of each section can be executed without being aware of the existence of other sections. Therefore, if the address at the time of detecting the failure is known, it is possible to specify the corresponding section.
[0092]
When the partition is successfully specified, the shared cache control unit notifies one of the processors belonging to the partition of the failure using a means such as an interrupt. Upon receiving the notification, the processor determines that the continuation of the processing is impossible and starts processing such as bringing down the system of the partition.
[0093]
If the partition cannot be identified, the shared cache control unit 61 determines that the operation of the entire system including all the partitions cannot be continued, suspends the operation of the shared cache unit 21 (HOLD), and causes a fatal failure. Is notified to an external diagnostic processor, and the subsequent fault processing is entrusted.
[0094]
Further, when the section is successfully specified, the update of the failure history register 62 and the flush operation of the cache line are performed as in the case of the above-described readout, and a series of failure operations is completed.
[0095]
As described above, after the occurrence of a cache memory failure, the notification to the processor, the update of the failure history register, and the flushing of the cache line are completed, the shared cache unit 21 immediately continues to operate, and the subsequent memory An access request and a bus snoop operation can be executed. Therefore, in a section where the failure does not spread, the operation can be continued without being affected by the failure.
[0096]
Even after registration in the failed compartment is suppressed by the failure history, it is possible to read and write from already registered lines (healthy data) other than the failed line in that compartment. It is. Compared with the compartment degeneracy method that is generally performed when a cache fault occurs, the present invention does not require the process of flushing the cache accompanying the degeneration operation, and simplifies the fault processing procedure.
[0097]
In the above-described embodiment, the configuration of the shared cache memory is the two-way set associative system. However, the configuration can be configured with more compartments. For example, in the case of a 4-way set associative configuration, it is necessary to add 6-bit information to each line of the cache to the LRU memory, and the replacement entry generation circuit has a more complicated configuration. It is feasible.
[0098]
Further, in the above description, the failure detection circuit is a parity detection circuit, but it is also possible to perform failure detection and correction using an ECC (error correction code) as generally used in computer systems for mission critical applications today. is there.
[0099]
Further, the present invention can be realized without depending on the number of processors and buses connected to the shared cache.
[0100]
【The invention's effect】
As described above in detail, according to the present invention, when an uncorrectable failure is detected in a shared cache memory, it is possible to easily specify a logical partition using the failed cache line. Therefore, it is possible to cause only the partition concerned to perform closed failure processing, for example, system down, without spreading the failure to the operation of another logical partition.
[0101]
Furthermore, the cache memory can continue to operate even after a failure has occurred, and can continue to operate the system without affecting other partitions except for the partition affected by the failure.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a configuration diagram showing a main part of FIG. 1;
FIG. 3 is an explanatory diagram showing a truth value of a replacement entry generation circuit 55;
FIG. 4 is a block diagram showing a second embodiment of the present invention.
FIG. 5 is a configuration diagram showing a main part of FIG. 4;
FIG. 6 is an explanatory diagram showing a truth value of a replacement entry generation circuit 65;
FIG. 7 is an explanatory view showing a conventional example of the present invention.
[Explanation of symbols]
11, 12, 13, 14 processors
21, 21a, 22 Shared cache unit
31 System bus
41 Main Memory
51, 61 Shared cache control unit
52 compartment indication circuit
53,63 LRU memory
54, 64 address generator
55, 65 replacement entry generation circuit
56,66 Cache AA
57,67 Cache DA
58, 68 failure detection circuit
59, 69 Address comparison circuit
60, 70 data selector
62 Failure history register

Claims

In a shared cache memory having a plurality of compartments and shared by a plurality of processors, a usable compartment is pre-allocated to each processor, and when the shared cache memory is shared by any of a plurality of divided processors, a cache is used. A shared cache memory failure processing method, comprising: when an uncorrectable failure is detected during indexing, a section affected by the failure is specified, and the failure is processed.

In a shared cache memory having a plurality of compartments and shared by a plurality of processors, a compartment instruction circuit holding configuration information indicating which compartment is to be assigned to each processor, and a cache indexed by a request from an arbitrary processor An address comparison circuit for suppressing a hit determination of a compartment not assigned to the requesting processor, and selecting a compartment assigned to the requesting processor according to the configuration information held by the compartment instruction circuit when registering data in the cache. A replacement entry generation circuit, a failure detection circuit for notifying the compartment number when an uncorrectable failure is detected at the time of cache indexing, and the failure based on a notification from the failure detection circuit. Shared cache memory fault processing system characterized by having a shared cache control unit that identifies a partition to affect by updating the configuration information performs the notified failure processing to the compartments instruction circuit.

3. The shared cache memory fault handling method according to claim 2, wherein said compartment instruction circuit includes a register group including a bit width equal to the number of compartments of the cache and a word number equal to the number of processors connected to said cache. Obtaining the configuration information from the control unit and holding it, further obtaining a processor ID signal from the shared cache control unit to generate compartment allocation information, and notifying it to the address comparison circuit and the replacement entry generation circuit. A shared cache memory failure handling method.

In a shared cache memory having a plurality of compartments and shared by a plurality of processors, a failure history register that holds the number of failures occurring in the cache for each compartment, and information of the failure history register when registering data in the cache. A replacement entry instructing circuit for determining a compartment for registering data by referring to and instructing the cache; a failure detecting circuit for notifying the shared cache control unit of the compartment number when an uncorrectable failure is detected during cache indexing; When registering data due to a mishit when accessing the cache in response to a request from the processor, the replacement entry instruction circuit is instructed to refer to the information in the failure history register and select a compartment having no failure history. Shared cache memory fault processing system characterized by having a shared cache control unit that.

5. The shared cache memory fault handling method according to claim 4, wherein said fault history register includes a register group including a word number equal to the number of compartments of the cache and a bit width capable of holding the number of faults to be managed. A shared cache memory failure processing method, wherein failure information transmitted by a section is stored as failure history information for each compartment, and is transmitted to the replacement entry generation circuit and the shared cache control unit.

6. The shared cache memory failure processing method according to claim 2, wherein the shared cache control unit notifies a failure information received from the failure detection circuit to a diagnosis processor, and outputs the failure information to a diagnosis result of the diagnosis processor. A shared cache memory failure handling method, wherein the failure is executed based on the failure.