JPH10105527A

JPH10105527A - Multiprocessor computer and fault recovery method therefor

Info

Publication number: JPH10105527A
Application number: JP8261642A
Authority: JP
Inventors: Toshihisa Kamemaru; 敏久亀丸
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-10-02
Filing date: 1996-10-02
Publication date: 1998-04-24
Anticipated expiration: 2016-10-02
Also published as: JP2968484B2

Abstract

PROBLEM TO BE SOLVED: To provide a multiprocessor computer capable of improving reliability and versatility without multiplexing any component such as a processor. SOLUTION: A CPU bridge 12 has a managing table 32 holding a data output flag, which is cleared at the time of process switching and set when the outside of a CPU board 2 is reflected with correction data in an external cache 14 owned by each processor, for each of processors 10 and 11 and another processor reference flag which is cleared at the time of process switching and set when the correction data in the external cache 14 owned by each processor are accessed from the other processor inside the CPU board 2. Then, when both the flags are cleared in the case of fault at any processor, it is judged the other processor is not reflected with the correction data owned by the broken-down processor, and the relevant processor is disconnected without system down.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はマルチプロセッサ計
算機、特にコンポーネントを多重化していない計算機に
おける信頼性、可用性の向上に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multiprocessor computer, and more particularly to an improvement in reliability and availability in a computer in which components are not multiplexed.

【０００２】[0002]

【従来の技術】システムの一部にハードウェア障害やソ
フトウェア障害が発生しても、システム全体の動作に支
障を来さない計算機としてフォールトトレラント計算機
がある。このフォールトトレラント計算機は、障害に強
く信頼性の高いシステムを構築する際に有用であり、計
算機システムにおいて基幹となる部分には、特に望まし
いシステム形態を提供する。2. Description of the Related Art There is a fault-tolerant computer as a computer that does not hinder the operation of the entire system even if a hardware failure or a software failure occurs in a part of the system. This fault-tolerant computer is useful in constructing a highly reliable system that is resistant to failures, and provides a particularly desirable system form to a core part of the computer system.

【０００３】従来のフォールトトレラント計算機の高信
頼性、高可用性は、ハードウェアのコンポーネントを多
重化することによって実現されている。各コンポーネン
トは、システムの動作中に切り離し可能な単位のオンラ
イン交換モジュールとして設けられている。計算機にお
いて障害が発生した場合、システムは、その障害箇所を
検出し、その障害箇所をコンポーネント単位で隔離する
ようにしている。その際、他のコンポーネントによって
システムの運転を継続させたまま障害箇所の修復を行う
ことができる。すなわち、論理的には１００％システム
をダウンさせずに障害箇所の修復を行うことができる。
このうち、フォールトトレラント計算機に搭載されるプ
ロセッサも当然のことながら高信頼化のために多重化さ
れているが、それらは、プロセッサを２個１組にしてそ
れを２組持ついわゆるペア＆スペアの構成、あるいはプ
ロセッサを３個以上設けて多数決を取る構成となってお
り、プロセッサが出力する結果を比較しながら実行して
いる。[0003] High reliability and high availability of a conventional fault-tolerant computer are realized by multiplexing hardware components. Each component is provided as an on-line exchange module which can be separated during operation of the system. When a failure occurs in a computer, the system detects the failure location and isolates the failure location on a component basis. At that time, the fault location can be repaired while the operation of the system is continued by other components. That is, it is possible to repair a faulty part without logically bringing down the system.
Of these, the processors mounted on the fault-tolerant computer are naturally multiplexed for high reliability, but they are so-called “pair-and-spare” having two sets of two processors. The configuration or the configuration in which three or more processors are provided to take a majority decision are executed while comparing the results output by the processors.

【０００４】ところで、一般的なパーソナル・コンピュ
ータ（ＰＣ）で構築されるサーバシステムは、フォール
トトレラント計算機と同様に複数のプロセッサを持つ場
合もあるが、サーバシステムにおいては、性能面の向上
を図るために複数のプロセッサを具備する。つまり、フ
ォールトトレラント計算機のような多重化した形態を取
っていないので、サーバシステムにおいて何らかの障害
が発生した場合は、障害箇所の検出の可否に関係なくシ
ステムを停止せざるを得ない。By the way, a server system constructed by a general personal computer (PC) may have a plurality of processors similarly to a fault-tolerant computer, but in order to improve the performance of the server system, Is provided with a plurality of processors. That is, since a multiplexed configuration such as a fault-tolerant computer is not used, if any failure occurs in the server system, the system must be stopped regardless of whether a failure location can be detected.

【０００５】[0005]

【発明が解決しようとする課題】上記のＰＣサーバシス
テムの高信頼性、高可用性を追求するために、フォール
トトレラント計算機と同様に複数のプロセッサを多重化
させるように構築することも考えられる。In order to pursue high reliability and high availability of the above-mentioned PC server system, it is conceivable to construct such that a plurality of processors are multiplexed similarly to a fault-tolerant computer.

【０００６】しかしながら、プロセッサを多重化する
と、プロセッサのみならずその周辺回路に多数の部品を
要することになり回路規模が大きくなってしまう。これ
は、プロセッサを搭載するＣＰＵボードの大型化を招く
だけでなく、高価格化、消費電力の増大等につながる。
また、プロセッサを多重化することにより多数決などの
際の比較回路が必要となるため、プロセッサの周波数を
上げることができず、性能の向上を図ることができなく
なる。However, when processors are multiplexed, a large number of components are required not only for the processor but also for its peripheral circuits, which increases the circuit scale. This leads not only to an increase in the size of the CPU board on which the processor is mounted, but also to an increase in price and an increase in power consumption.
Further, since the multiplexing of the processors requires a comparison circuit at the time of majority decision, the frequency of the processor cannot be increased, and the performance cannot be improved.

【０００７】本発明は以上のような問題を解決するため
になされたものであり、その目的は、プロセッサなどの
構成要素を多重化しないで信頼性、可用性の向上を図る
マルチプロセッサ計算機を提供することにある。The present invention has been made to solve the above problems, and an object of the present invention is to provide a multiprocessor computer which improves reliability and availability without multiplexing components such as processors. It is in.

【０００８】[0008]

【課題を解決するための手段】以上のような目的を達成
するために、本発明におけるマルチプロセッサ計算機
は、計算機から切り離し可能な単位で形成したコンポー
ネントを有するマルチプロセッサ計算機において、シス
テム動作中に発生した障害を検出し、その障害箇所を特
定する障害検出手段と、所定の切離し条件に基づきシス
テムをダウンさせずに障害箇所の隔離可能かどうかの判
定を行う隔離判定手段と、障害箇所をシステム動作中に
切り離す隔離実行手段と、システムを停止し再立ち上げ
を行う再立ち上げ手段とを有し、前記隔離判定手段が隔
離可能であると判定した場合にシステムをダウンさせず
に障害箇所を切り離し、前記隔離判定手段が隔離可能で
ないと判定した場合にシステムの再立ち上げを行うこと
を特徴とする。これにより、本発明によれば、コンポー
ネントを多重化させなくてもある程度の信頼性、可用性
の向上を図ることができる。SUMMARY OF THE INVENTION In order to achieve the above object, a multiprocessor computer according to the present invention is a multiprocessor computer having components formed in units detachable from the computer. Fault detecting means for detecting a fault and identifying the fault location; isolation determining means for determining whether the fault location can be isolated without shutting down the system based on predetermined disconnection conditions; And isolation restarting means for shutting down and restarting the system. When the isolation determination means determines that isolation is possible, the isolation portion is isolated without bringing down the system. When the isolation determination means determines that isolation is not possible, the system is restarted. As a result, according to the present invention, it is possible to improve reliability and availability to some extent without multiplexing components.

【０００９】また、内蔵キャッシュメモリを有する少な
くとも１つのプロセッサと、前記内蔵キャッシュメモリ
の内容をライン毎に包含する外部キャッシュメモリと、
前記プロセッサ並びに前記外部キャッシュメモリを接続
し、それぞれの制御管理を行うプロセッサブリッジと、
を搭載し、自己障害検出機能を有する少なくとも１つの
ＣＰＵボードと、メインメモリとを有し、前記内蔵キャ
ッシュメモリ及び前記外部キャッシュメモリは、データ
をそのデータに対する修正の有無に関するステータス情
報とともに保持するマルチプロセッサ計算機において、
所定の切離し条件に基づきシステムをダウンさせずに障
害箇所の隔離可能かどうかの判定を行う隔離判定手段
と、障害箇所をシステム動作中に切り離す隔離実行手段
と、システムを停止し再立ち上げを行う再立ち上げ手段
とを有し、前記隔離判定手段が隔離可能であると判定し
た場合にシステムをダウンさせずに障害箇所を切り離
し、前記隔離判定手段が隔離可能でないと判定した場合
にシステムの再立ち上げを行うことを特徴とする。これ
により、本発明によれば、プロセッサを前述したコンポ
ーネントとして構成することにより、プロセッサに障害
が発生した場合、プロセッサ単位で隔離することができ
る。Also, at least one processor having a built-in cache memory, an external cache memory containing the contents of the built-in cache memory for each line,
A processor bridge that connects the processor and the external cache memory and performs control management of each;
And at least one CPU board having a self-failure detecting function, and a main memory, wherein the built-in cache memory and the external cache memory hold data together with status information on whether or not the data has been corrected. In the processor computer,
Isolation determining means for determining whether a faulty location can be isolated without shutting down the system based on predetermined disconnection conditions, isolation executing means for separating the faulty location while the system is operating, and stopping and restarting the system Restarting means, when the isolation determining means determines that isolation is possible, disconnects a fault location without shutting down the system, and restarts the system when the isolation determining means determines that isolation is not possible. It is characterized by starting up. Thus, according to the present invention, by configuring the processor as the above-described component, when a failure occurs in the processor, it is possible to isolate the processor in units of the processor.

【００１０】また、前記外部キャッシュメモリは、修正
データが記憶されている各ラインの所有者となるプロセ
ッサの識別情報を保持し、前記隔離判定手段は、障害が
発生した前記プロセッサからの影響が他に伝わっていな
いかどうかを判定し、前記隔離実行手段は、前記隔離判
定手段が他に影響が伝わっていないと判定した場合に、
前記プロセッサの識別情報に基づいて取得した、前記障
害発生プロセッサを搭載する前記ＣＰＵボード内に保持
されている前記障害発生プロセッサが修正したデータの
全てを無効にした後、前記障害発生プロセッサを切り離
すことを特徴とする。Further, the external cache memory holds identification information of a processor as an owner of each line in which the corrected data is stored, and the isolation determination means determines that the influence of the processor in which the failure has occurred is different. It is determined whether or not transmitted to the isolation, the isolation execution means, when the isolation determination means determines that the other influence is not transmitted,
Disabling the failed processor after invalidating all the data obtained by the failed processor and corrected based on the identification information of the processor and held in the CPU board mounting the failed processor. It is characterized by.

【００１１】また、前記プロセッサブリッジは、各プロ
セッサの内蔵キャッシュメモリにある修正データの書き
戻し処理の起動要求をプロセッサ毎に受け付ける手段
と、受け付けた当該起動要求により処理対象となった前
記プロセッサに、修正データの前記外部キャッシュメモ
リへの書き戻し処理を行わせる書き戻し処理要求を発行
する手段とを有し、プロセススイッチのときに前記処理
対象となったプロセッサの内蔵キャッシュメモリに保持
されている修正データを前記メインメモリへ書き戻すこ
とを特徴とする。Further, the processor bridge includes means for receiving, for each processor, a start request for writing back the correction data in the built-in cache memory of each processor, and the processor which has been processed by the received start request has: Means for issuing a write-back processing request for performing a write-back processing of the correction data to the external cache memory, wherein the correction held in the built-in cache memory of the processor to be processed at the time of the process switch Data is written back to the main memory.

【００１２】また、前記プロセッサブリッジは、同一Ｃ
ＰＵボード上のプロセッサ毎に割り当てた外部参照フラ
グ情報を保持し、前記各プロセッサに対応する外部参照
フラグ情報それぞれに対して、プロセススイッチのとき
にクリアし、同一ＣＰＵボードの内部にのみ保持されて
いる当該プロセッサ所有の修正データを、当該同一ＣＰ
Ｕボードの外部に反映させたときにセットすることを特
徴とする。このように、プロセススイッチの時に修正さ
れたデータの書き戻し処理を行うようにしたことで、次
のプロセスの実行時に障害が発生した場合、当該プロセ
スの開始前の状態に外部キャッシュメモリ及び内蔵キャ
ッシュメモリの内容を元に戻すことができるようにな
る。Further, the processor bridges have the same C
The external reference flag information assigned to each processor on the PU board is held, and the external reference flag information corresponding to each processor is cleared at the time of a process switch, and is held only inside the same CPU board. The modified data owned by the processor
It is set when reflected on the outside of the U board. In this way, by performing write-back processing of the corrected data at the time of the process switch, if a failure occurs during the execution of the next process, the external cache memory and the internal cache memory are restored to the state before the start of the process. The contents of the memory can be restored.

【００１３】また、前記隔離判定手段は、前記障害発生
プロセッサに対応した外部参照フラグ情報がクリアであ
るとき、前記障害発生プロセッサからの影響が他に伝わ
っていないと判定することを特徴とする。[0013] Further, the isolation determination means is characterized in that when the external reference flag information corresponding to the faulty processor is clear, it is determined that the influence from the faulty processor is not transmitted to another.

【００１４】また、前記外部キャッシュメモリは、修正
データが記憶されている各ラインの所有者となるプロセ
ッサの識別情報を保持し、前記プロセッサブリッジは、
同一ＣＰＵボード上のプロセッサ毎に割り当てた他プロ
セッサ参照フラグ情報を保持し、前記各プロセッサに対
応する他プロセッサ参照フラグ情報それぞれに対して、
プロセススイッチのときにクリアし、当該プロセッサ所
有の修正データが記憶された前記外部キャッシュメモリ
のラインに対して、その所有者以外の同一ＣＰＵボード
上のプロセッサがアクセスをしたときにセットすること
を特徴とする。Further, the external cache memory holds identification information of a processor as an owner of each line in which the correction data is stored, and the processor bridge includes:
Holding other processor reference flag information assigned to each processor on the same CPU board, and for each other processor reference flag information corresponding to each processor,
It is cleared when a process switch is performed, and is set when a processor on the same CPU board other than the owner accesses a line of the external cache memory in which correction data owned by the processor is stored. And

【００１５】また、前記隔離判定手段は、前記障害発生
プロセッサに対応した他プロセッサ参照フラグ情報がク
リアであるとき、前記障害発生プロセッサからの影響が
他に伝わっていないと判定することを特徴とする。Further, when the other processor reference flag information corresponding to the failed processor is clear, the isolation determining means determines that the influence from the failed processor is not transmitted to the other processor. .

【００１６】また、前記隔離実行手段は、前記障害発生
プロセッサから自己障害検出機能により発せられる障害
検出信号を受け取ると、前記障害発生プロセッサを前記
プロセッサブリッジから切り離すことを特徴とする。Further, the isolation executing means disconnects the faulty processor from the processor bridge upon receiving a fault detection signal issued from the faulty processor by the self-failure detecting function.

【００１７】また、前記隔離実行手段は、前記障害発生
プロセッサを前記プロセッサブリッジから切り離したと
いう故障フラグ情報を保持することを特徴とする。Further, the isolation execution means holds failure flag information indicating that the failed processor has been disconnected from the processor bridge.

【００１８】また、前記プロセッサブリッジは、前記隔
離判定手段を有することを特徴とする。これにより、プ
ロセッサブリッジによりプロセッサの隔離が可能かどう
かの判定を行わせることができる。Further, the processor bridge includes the isolation determining means. This makes it possible to determine whether the processor bridge can isolate the processor.

【００１９】また、前記プロセッサブリッジは、前記隔
離実行手段を有することを特徴とする。これにより、プ
ロセッサブリッジによりプロセッサの切り離しを行わせ
ることができる。Further, the processor bridge has the isolation execution means. As a result, the processor can be separated by the processor bridge.

【００２０】また、前記外部キャッシュメモリは、ウェ
イ単位に分割可能な構造を有しており、前記隔離判定手
段は、障害が発生した前記ウェイからの影響が、前記外
部キャッシュメモリを搭載する前記ＣＰＵボードの外部
に伝わっていないかどうかを判定し、前記隔離実行手段
は、前記隔離判定手段が前記障害発生ウェイからの影響
が前記ＣＰＵボードの外部に伝わっていないと判定した
場合に、前記外部キャッシュメモリが記憶する全ての修
正データを無効にした後、前記障害発生ウェイを縮退す
ることを特徴とする。これにより、本発明によれば、外
部キャッシュメモリのウェイ単位に切り離すことが可能
であれば、各ウェイを前述したコンポーネントとして構
成することにより、ウェイに障害が発生した場合、ウェ
イ単位で隔離することができる。Further, the external cache memory has a structure that can be divided into units of ways, and the isolation determining means determines whether the CPU having the external cache memory has an influence from the failed way. The isolation execution means determines whether or not the influence from the faulty way has not been transmitted to the outside of the CPU board. After invalidating all the correction data stored in the memory, the faulty way is degenerated. Thus, according to the present invention, if it is possible to separate each way of the external cache memory, by configuring each way as the above-described component, when a failure occurs in the way, it is possible to isolate each way in a way unit. Can be.

【００２１】また、前記プロセッサブリッジは、同一Ｃ
ＰＵボード上の各ウェイに割り振られた外部参照フラグ
情報を保持し、前記各ウェイに対応する外部参照フラグ
情報それぞれに対して、プロセススイッチのときにクリ
アし、保持する修正データが当該同一ＣＰＵボードの外
部に反映されたときにセットし、前記隔離判定手段は、
前記障害発生ウェイに対応した外部参照フラグ情報の内
容により、前記障害発生ウェイが記憶している修正デー
タが、外部に反映されているかいないかの判定を行うこ
とを特徴とする。つまり、本発明によれば、例えば外部
参照フラグ情報がクリアであるとき、前記障害発生ウェ
イからの影響が他に伝わっていないと判定し、システム
をダウンさせずに障害発生ウェイを縮退することができ
る。一方、外部参照フラグ情報がセットされているとき
前記障害発生ウェイからの影響が他に伝わっていると判
定し、システムを停止し、再立ち上げをすることにな
る。Further, the processor bridges have the same C
The external reference flag information allocated to each way on the PU board is retained, and the external reference flag information corresponding to each of the ways is cleared at the time of a process switch, and the modification data retained is the same CPU board. Set when reflected outside of the, the isolation determination means,
It is characterized in that, based on the content of the external reference flag information corresponding to the faulty way, it is determined whether or not the correction data stored in the faulty way is externally reflected. That is, according to the present invention, for example, when the external reference flag information is clear, it is determined that the influence from the faulty way is not transmitted to another, and the faulty way can be degenerated without bringing down the system. it can. On the other hand, when the external reference flag information is set, it is determined that the influence from the faulty way is transmitted to another, and the system is stopped and restarted.

【００２２】また、本発明に係るマルチプロセッサ計算
機における障害復旧方法は、計算機から切り離し可能な
単位で形成したコンポーネントを有するマルチプロセッ
サ計算機において、システム動作中に発生した障害を検
出し、その障害箇所を特定する障害検出ステップと、所
定の切離し条件に基づきシステムをダウンさせずに障害
箇所の隔離可能かどうかの判定を行う隔離判定ステップ
と、前記隔離判定ステップにおいて隔離可能であると判
定した場合にシステムをダウンさせずに障害箇所を切り
離す隔離実行ステップと、前記隔離判定ステップにおい
て隔離可能でないと判定した場合にシステムの再立ち上
げを行う再立ち上げステップとを含むことを特徴とす
る。これにより、本発明によれば、コンポーネントを多
重化させなくてもある程度の信頼性、可用性の向上を図
ることができる。Further, a fault recovery method in a multiprocessor computer according to the present invention detects a fault occurring during system operation in a multiprocessor computer having components formed in units separable from the computer, and identifies the fault location. A failure detection step to specify, an isolation determination step for determining whether or not a failure location can be isolated without shutting down the system based on a predetermined disconnection condition; and a system when the isolation determination step determines that isolation is possible. And a restarting step of restarting the system when it is determined in the isolation determining step that isolation is not possible in the isolation determining step. As a result, according to the present invention, it is possible to improve reliability and availability to some extent without multiplexing components.

【００２３】また、内蔵キャッシュメモリを有する少な
くとも１つのプロセッサと、前記内蔵キャッシュメモリ
の内容をライン毎に包含する外部キャッシュメモリと、
前記プロセッサ並びに前記外部キャッシュメモリを接続
し、それぞれの制御管理を行うプロセッサブリッジとを
搭載し、自己障害検出機能を有する少なくとも１つのＣ
ＰＵボードと、メインメモリとを有し、前記内蔵キャッ
シュメモリ及び前記外部キャッシュメモリは、データを
そのデータに対する修正の有無に関するステータス情報
とともに保持するマルチプロセッサ計算機において、シ
ステム動作中に障害が発生した前記プロセッサからの影
響が他に伝わっていないかどうかを判定する隔離判定ス
テップと、前記隔離判定ステップにおいて影響が伝わっ
ていないと判定された場合にシステムをダウンさせずに
前記障害発生プロセッサを搭載する前記ＣＰＵボード内
に保持されている前記障害発生プロセッサが修正したデ
ータの全てを無効にした後、前記障害発生プロセッサを
切り離す隔離実行ステップと、前記隔離判定ステップに
おいて影響が伝わっていると判定された場合にシステム
を停止し再立ち上げを行う再立ち上げステップとを含む
ことを特徴とする。これにより、本発明によれば、プロ
セッサを前述したコンポーネントとして構成することに
より、プロセッサに障害が発生した場合、プロセッサ単
位で隔離することができるので、障害発生プロセッサを
切り離した後、システムの復旧を行うことができる。Also, at least one processor having a built-in cache memory, an external cache memory containing the contents of the built-in cache memory for each line,
A processor bridge for connecting the processor and the external cache memory and performing control and management of each processor;
A multiprocessor computer having a PU board and a main memory, wherein the built-in cache memory and the external cache memory hold data together with status information on whether or not the data has been modified; An isolation determination step for determining whether or not the influence from the processor is transmitted to another; and mounting the faulty processor without shutting down the system when it is determined in the isolation determination step that the influence is not transmitted. After invalidating all the data corrected by the faulty processor held in the CPU board and then performing the isolation execution step of separating the faulty processor and the isolation determination step, it is determined that the influence is transmitted. System stopped and restarted Characterized in that it comprises a relaunch step of performing. Thus, according to the present invention, by configuring the processor as the above-described component, if a failure occurs in the processor, it can be isolated on a processor basis. It can be carried out.

【００２４】更に、内蔵キャッシュメモリを有する少な
くとも１つのプロセッサと、複数のウェイを有し前記内
蔵キャッシュメモリの内容をライン毎に包含する外部キ
ャッシュメモリと、前記プロセッサ並びに前記外部キャ
ッシュメモリを接続し、それぞれの制御管理を行うプロ
セッサブリッジとを搭載し、自己障害検出機能を有する
少なくとも１つのＣＰＵボードと、メインメモリとを有
し、前記内蔵キャッシュメモリ及び前記外部キャッシュ
メモリは、データをそのデータに対する修正の有無に関
するステータス情報とともに保持するマルチプロセッサ
計算機において、システム動作中に障害が発生した前記
ウェイが記憶している修正データが、前記外部キャッシ
ュメモリを搭載する前記ＣＰＵボードの外部に伝わって
いないかどうかを判定する隔離判定ステップと、前記隔
離判定ステップにおいて前記ＣＰＵボードの外部に伝わ
っていないと判定された場合にシステムをダウンさせず
に前記外部キャッシュメモリが記憶する全ての修正デー
タを無効にした後、前記障害発生ウェイを縮退する隔離
実行ステップと、前記隔離判定ステップにおいて前記Ｃ
ＰＵボードの外部に伝わっていると判定された場合にシ
ステムを停止し再立ち上げを行う再立ち上げステップと
を含むことを特徴とする。これにより、本発明によれ
ば、外部キャッシュメモリのウェイ単位に切り離すこと
が可能であれば、各ウェイを前述したコンポーネントと
して構成することにより、ウェイに障害が発生した場
合、ウェイ単位で隔離することができるので、障害発生
ウェイを縮退した後、システムの復旧を行うことができ
る。Connecting at least one processor having a built-in cache memory, an external cache memory having a plurality of ways and containing the contents of the built-in cache memory for each line, the processor and the external cache memory, It has a processor bridge for performing respective control management, has at least one CPU board having a self-failure detection function, and a main memory, and the built-in cache memory and the external cache memory correct data for the data. Whether the correction data stored in the way in which a failure has occurred during system operation has not been transmitted to the outside of the CPU board equipped with the external cache memory, To After determining all the correction data stored in the external cache memory without shutting down the system when it is determined in the isolation determination step that it is not transmitted to the outside of the CPU board, An isolation execution step of degenerating the faulty way;
A restarting step of stopping and restarting the system when it is determined that the signal is transmitted to the outside of the PU board. Thus, according to the present invention, if it is possible to separate each way of the external cache memory, by configuring each way as the above-described component, when a failure occurs in the way, it is possible to isolate each way in a way unit. Therefore, the system can be restored after the faulty way is degenerated.

【００２５】[0025]

【発明の実施の形態】以下、図面に基づいて、本発明の
好適な実施の形態について説明する。Preferred embodiments of the present invention will be described below with reference to the drawings.

【００２６】実施の形態１．図１は、本発明に係るマル
チプロセッサ計算機の一実施の形態を示した図である。
本実施の形態においては、マルチプロセッサ計算機とし
てＰＣサーバシステムを例にして説明する。サーバシス
テムは、システムバス１に２枚のＣＰＵボード２，２０
とＩ／Ｏブリッジ４と主記憶装置（メインメモリ）６と
を接続した構成を有している。各ＣＰＵボード２は、同
一構成とし、それぞれプロセッサ１０，１１、ＣＰＵブ
リッジ１２、外部キャッシュ１４を内蔵している。Ｉ／
Ｏブリッジ４にはＩ／Ｏコントローラ７を介してディス
ク装置等の外部記憶装置８が接続されている。本実施の
形態におけるＣＰＵボード２は、内部において発生した
障害を自己検出でき、その障害箇所を特定できる機能を
有している。また、サーバシステムは、障害が発生した
外部記憶装置を特定する障害検出機能を有している。障
害検出機能は、ハードウェアによって発揮されるものも
あればソフトウェアによって発揮されるものもある。 Embodiment 1 FIG. 1 is a diagram showing an embodiment of a multiprocessor computer according to the present invention.
In the present embodiment, a PC server system will be described as an example of a multiprocessor computer. The server system includes two CPU boards 2 and 20 on a system bus 1.
, An I / O bridge 4 and a main storage device (main memory) 6. Each CPU board 2 has the same configuration, and includes processors 10, 11, a CPU bridge 12, and an external cache 14, respectively. I /
An external storage device 8 such as a disk device is connected to the O-bridge 4 via an I / O controller 7. The CPU board 2 according to the present embodiment has a function of being able to self-detect a failure that has occurred therein and to identify the location of the failure. Further, the server system has a failure detection function for identifying the external storage device in which the failure has occurred. Some of the failure detection functions are performed by hardware, while others are performed by software.

【００２７】次に、本実施の形態において障害発生時に
おける基本動作について図２に示したフローチャートを
用いて説明する。Next, a basic operation when a failure occurs in this embodiment will be described with reference to the flowchart shown in FIG.

【００２８】サーバシステムにおいて何らかの障害が発
生すると（ステップ１００）、その障害箇所の検出を行
う（ステップ２００）。この結果、障害箇所が検出でき
なかったとき、あるいは検出できたとしても内部で取り
扱っているデータに矛盾が生じているときは、このまま
運転を継続できないのでシステムを停止する（ステップ
５００）。なお、データに矛盾が生じる場合というの
は、データの整合性が保てなくなる場合のことであり、
例えば故障したプロセッサの演算結果を用いて他のプロ
セッサが演算を継続しているような場合である。このよ
うな場合は、処理を元に戻してやり直す必要があるので
システムをダウンさせなくてはならない。When a failure occurs in the server system (step 100), the location of the failure is detected (step 200). As a result, if a failure point cannot be detected, or even if it can be detected, if the data handled internally has an inconsistency, the operation cannot be continued and the system is stopped (step 500). In the case where data inconsistency occurs, data consistency cannot be maintained.
For example, there is a case where another processor continues the operation using the operation result of the failed processor. In such a case, the system must be brought down because it is necessary to undo and redo the processing.

【００２９】障害箇所が検出できたとき、その障害箇所
が隔離可能かどうかを判定する（ステップ３００）。隔
離可能というのは、システムをダウンさせずに運転を継
続させたまま、かつ内部のデータに矛盾を生じさせるこ
となく対象となる障害箇所を含むコンポーネントを切り
離すことができるという意味である。隔離可能であれ
ば、切り離し可能なコンポーネントの単位で障害箇所を
システムから切り離す（ステップ４００）。これによ
り、障害箇所が入出力装置であれば、他からアクセスを
行わせないようにしたり、障害箇所がプロセッサであれ
ば処理を行わせないようにすることができる。When a fault location can be detected, it is determined whether the fault location can be isolated (step 300). Isolation means that the component containing the target failure point can be separated without stopping the system and continuing operation, and without causing inconsistency in the internal data. If it can be isolated, the fault location is separated from the system in units of separable components (step 400). As a result, if the failure location is the input / output device, it is possible to prevent access from others, and if the failure location is a processor, it is possible to prevent processing from being performed.

【００３０】また、障害箇所が検出できてもそれが隔離
不可能であると判定されたときは、システムの運転を継
続することは不可能なので明示的にシステムを停止す
る。そして、障害箇所がハードウェアに関わるものであ
ればコンポーネントを取り外し可能であればコンポーネ
ントの交換等を行った後、システムの再立ち上げを行う
（ステップ６００）。If it is determined that a faulty location cannot be isolated even if a faulty location can be detected, the system is explicitly stopped because it is impossible to continue the operation of the system. If the fault location is related to hardware, the component is replaced if the component can be removed, and then the system is restarted (step 600).

【００３１】以上のように、本実施の形態によれば、障
害が発生したとき、システムの内部状態の整合性が保て
る場合は、障害箇所をシステムをダウンさせずに隔離し
てそのままシステムの運転を継続する。そして、システ
ムの内部状態の整合性が保てない場合は、システムをい
ったんダウンさせてその障害箇所を隔離し、その後シス
テムを立ち上げるようにする。このように、障害箇所の
隔離にシステムをダウンさせる必要があるかないかをそ
の隔離前に判定するようにしたので、無用にシステムを
ダウンさせる必要がなくなる。その結果、コンポーネン
トを多重化させないでもシステムの信頼性、可用性を向
上させることができる。As described above, according to the present embodiment, when a failure occurs, if the integrity of the internal state of the system can be maintained, the failure can be isolated without shutting down the system and the system can be operated as it is. To continue. If the consistency of the internal state of the system cannot be maintained, the system is temporarily shut down to isolate the fault location, and then the system is started. As described above, it is determined whether or not it is necessary to bring the system down to isolate the fault location before the isolation, so that there is no need to bring down the system unnecessarily. As a result, the reliability and availability of the system can be improved without multiplexing components.

【００３２】ところで、システムのダウンは、ハードウ
ェアの故障のみならずソフトウェア（ＯＳ、ミドルウェ
ア、アプリケーションプログラム等）のバグやオペレー
タのミスオペレーションなどによっても発生しうる。こ
のソフトウェア等を原因とするシステムダウンは、フォ
ールトトレラント計算機においても回避できない。ここ
で、ソフトウェア等の原因よりハードウェアの故障によ
ってシステムがダウンすることの方が多いとすると、フ
ォールトトレラント計算機を用いることは、非常に有用
であるが、ハードウェアの故障率が非常に小さい場合
は、フォールトトレラント計算機を用いる効果を十分に
発揮する場合がない。例えば、ハードウェア障害の検出
率をＸ、検出したハードウェアの隔離が可能であるとす
る隔離可能率をＹ、ハードウェアの部品の故障率をλと
する。検出率Ｘは、図２においてステップ２００からス
テップ３００に移行する処理の確率に相当する。隔離可
能率Ｙは、図２においてステップ３００からステップ４
００に移行する処理の確率に相当する。ここで、図２に
おいてステップ１００からステップ４００以外の処理、
換言するとステップ５００若しくはステップ６００に処
理が移行する確率Ｚは、Ｚ＝（Σ（１−Ｘ＊Ｙ）＊λ）
となるが、この確率Ｚは、システムの運転が継続できず
余儀なくシステムをダウンさせる確率となる。従って、
この確率Ｚの値が非常に小さく、Ｚ＜（ソフトウェアが
原因によるダウン率、例えば（ソフトウェアのバグ率）
＋（オペレーションミス率））となるようなシステムで
あれば、コンポーネントを多重化させなくても本実施の
形態は非常に有用である。Incidentally, a system down can occur not only due to a hardware failure, but also due to a bug in software (OS, middleware, application program, etc.) or an erroneous operation by an operator. System down due to this software or the like cannot be avoided even in a fault-tolerant computer. Here, assuming that the system is more likely to be down due to hardware failure than software or the like, using a fault-tolerant computer is very useful, but the hardware failure rate is very small. Does not fully demonstrate the effect of using a fault-tolerant computer. For example, assume that the detection rate of a hardware failure is X, the isolation rate at which the detected hardware can be isolated is Y, and the failure rate of hardware components is λ. The detection rate X corresponds to the probability of the process of shifting from step 200 to step 300 in FIG. In FIG. 2, the isolation possibility rate Y is calculated from step 300 to step 4
This corresponds to the probability of the process shifting to 00. Here, in FIG. 2, processing other than steps 100 to 400,
In other words, the probability Z that the process proceeds to step 500 or step 600 is Z = (Σ (1−X * Y) * λ)
However, this probability Z is a probability that the operation of the system cannot be continued and the system is forced to go down. Therefore,
The value of the probability Z is very small, and Z <(down rate due to software, for example, (software bug rate)
+ (Operation error rate)), the present embodiment is very useful without multiplexing components.

【００３３】なお、図１に示したサーバシステムの構成
は例示であり、他の構成要素を設けたり、ＣＰＵボード
を３枚以上設けることは本発明の範囲内である。また、
上記においてはサーバシステムの例で説明したが、他の
マルチプロセッサ計算機を有するシステムにおいても適
用可能であることはいうまでもない。The configuration of the server system shown in FIG. 1 is merely an example, and it is within the scope of the present invention to provide other components or to provide three or more CPU boards. Also,
Although the server system has been described above as an example, it is needless to say that the present invention can be applied to a system having another multiprocessor computer.

【００３４】実施の形態２．本実施の形態２では、実施
の形態１に示した基本動作において切り離し可能なコン
ポーネントをプロセッサとした場合を例にして説明す
る。 Embodiment 2 In the second embodiment, an example will be described in which a processor is a component that can be separated in the basic operation shown in the first embodiment.

【００３５】通常、多重化していないマルチプロセッサ
計算機において、プロセッサが１つでも障害が発生し故
障した場合は、システム全体のデータ整合性がくずれ、
システムのダウンを余儀なくされる。プロセッサの台数
に比例してシステムにおける故障率も高まるので、プロ
セッサ数の多い計算機は、それだけで信頼性が低くなっ
てしまう。一方、プロセッサを多重化することも考えら
れるが、この方法だとコスト等で著しくデメリットにな
ることは、前述したとおりである。Normally, in a non-multiplexed multiprocessor computer, if at least one processor fails and fails, the data consistency of the entire system is lost,
The system must be shut down. Since the failure rate in the system increases in proportion to the number of processors, a computer having a large number of processors alone has low reliability. On the other hand, it is conceivable to multiplex the processors, but as described above, this method has significant disadvantages in cost and the like.

【００３６】そこで、本実施の形態では、障害が発生し
たＣＰＵボード上のプロセッサを切り離すことで上記目
的を達成することができる具体例を示している。Therefore, the present embodiment shows a specific example in which the above-mentioned object can be achieved by separating the processor on the CPU board in which a failure has occurred.

【００３７】本実施の形態の構成本実施の形態におけるシステム構成図は、図１と同じで
ある。図３は、本実施の形態における図１に示したＣＰ
Ｕボード２の詳細な構成図である。ＣＰＵボード２は、
それぞれに内蔵キャッシュ１８，１９を有するプロセッ
サ１０，１１と、全てのプロセッサ１０，１１を接続す
るＣＰＵブリッジ１２と、ＣＰＵブリッジ１２に接続さ
れ上記内蔵キャッシュ１８，１９を包含する外部キャッ
シュ１４と、全てのプロセッサ１０，１１とＣＰＵブリ
ッジ１２を接続するＣＰＵバス１６と、を搭載してい
る。 Configuration of this Embodiment A system configuration diagram of this embodiment is the same as that of FIG. FIG. 3 shows the CP according to the present embodiment shown in FIG.
FIG. 3 is a detailed configuration diagram of a U board 2. CPU board 2
Processors 10, 11 each having an internal cache 18, 19, a CPU bridge 12 connecting all the processors 10, 11; an external cache 14 connected to the CPU bridge 12 and including the internal caches 18, 19; , And a CPU bus 16 for connecting the CPU bridge 12.

【００３８】本実施の形態において用いるプロセッサ１
０，１１は、例えばインテル社製のＰｅｎｔｉｕｍプロ
セッサ等の市販品の最新製品を想定している。つまり、
本実施の形態におけるプロセッサ１０，１１は、キャッ
シュメモリを内蔵し、自己障害検出機能を有し、そのカ
バレッジも比較的高い。また、ＣＰＵバス１６におい
て、そのデータ線に対してはパリティ／ＥＣＣ等のエラ
ー検出・修正手段を、その制御線にはプロトコルチェッ
クを行いエラー検出を行う手段をそれぞれ有しており、
ＣＰＵバス１６上において発生する障害を検出すること
ができる。このように、ＣＰＵボード２においては、搭
載された各構成要素において発生する障害を検出できる
自己障害検出機能を有している。Processor 1 used in the present embodiment
0 and 11 are assumed to be the latest commercially available products such as Pentium processors manufactured by Intel Corporation. That is,
Processors 10 and 11 according to the present embodiment have a built-in cache memory, have a self-failure detection function, and have relatively high coverage. In the CPU bus 16, the data line has an error detecting / correcting means such as parity / ECC, and the control line has a means for performing a protocol check and detecting an error.
A fault occurring on the CPU bus 16 can be detected. As described above, the CPU board 2 has a self-failure detection function that can detect a failure occurring in each mounted component.

【００３９】ＣＰＵブリッジ１２は、ＣＰＵバス１６を
介して各プロセッサ１０，１１へのリクエスト発行機能
を有するＣＰＵバスリクエスト発行回路２４と、各プロ
セッサ１０，１１からデータ等を受信するＣＰＵバス受
信回路３８と、システムバス１へのリクエスト発行機能
を有するシステムバスリクエスト発行回路３０と、シス
テムバス１を介して他の構成要素からデータ等を受信す
るシステムバス受信回路４０と、外部キャッシュ１４の
アドレス管理機能を有するキャッシュ制御回路２８と、
外部キャッシュ１４において発生したエラーの検出機能
等を有するエラー検出部３６と、を有しており、プロセ
ッサ１０，１１や外部キャッシュ１４の制御管理機能を
有している。なお、その他の構成については後述する。The CPU bridge 12 has a CPU bus request issuing circuit 24 having a function of issuing a request to each of the processors 10 and 11 via the CPU bus 16, and a CPU bus receiving circuit 38 for receiving data and the like from each of the processors 10 and 11. A system bus request issuing circuit 30 having a function of issuing a request to the system bus 1, a system bus receiving circuit 40 for receiving data and the like from other components via the system bus 1, and an address management function of the external cache 14. A cache control circuit 28 having
An error detection unit 36 having a function of detecting an error that has occurred in the external cache 14, and has a control management function of the processors 10 and 11 and the external cache 14. The other configuration will be described later.

【００４０】図４は、ＣＰＵブリッジ１２が同一ＣＰＵ
ボード上に搭載されたプロセッサ１０，１１を管理する
ために保持する管理テーブル３２の設定例である。この
管理テーブル３２には、プロセッサ毎にプロセッサの識
別情報であるプロセッサ番号と、データ出力フラグ、他
プロセッサ参照フラグ及び故障フラグが設定される。各
フラグ情報の内容に関しては後述する。FIG. 4 shows that the CPU bridge 12 is the same CPU.
3 is a setting example of a management table 32 held for managing processors 10 and 11 mounted on a board. In the management table 32, a processor number, which is identification information of a processor, a data output flag, another processor reference flag, and a failure flag are set for each processor. The contents of each flag information will be described later.

【００４１】外部キャッシュ１４は、複数のラインを含
む複数のウェイから構成され、ウェイ毎に分割可能であ
り、また、使用可不可の設定が可能である。プロセッサ
１０，１１の内蔵キャッシュ１８，１９にあるデータ
は、外部キャッシュ１４に必ず保持されている。外部キ
ャッシュ１４及び内蔵キャッシュ１８，１９は、保持す
るデータを、そのデータに対するステータス情報ととも
に保持している。本実施の形態におけるステータス情報
は、ライン毎に保持され、修正（Ｍｏｄｉｆｙ）、排他
（Ｅｘｃｌｕｓｉｖｅ）、共有（Ｓｈａｒｅｄ）、無効
（Ｉｎｖａｌｉｄ）が用意されている。但し、本実施の
形態の特徴事項となる機能を発揮させるためには、少な
くとも修正（Ｍｏｄｉｆｙ）があればよい。以降の説明
において、ステータス情報をキャッシュステートとい
い、修正（Ｍｏｄｉｆｙ）されたデータを記憶する内蔵
キャッシュ１８，１９若しくは外部キャッシュ１４のラ
インを「Ｍｏｄｉｆｙライン」と表現することにする。
図５は、本実施の形態における外部キャッシュ１４の概
念図であり、ウェイ０の構造例のみを代表して示してい
る。もちろん、他のウェイもウェイ０と同じ構造とな
る。各ウェイは、キャッシュライン毎に、そのインデッ
クス、タグアドレス、キャッシュステート及び本実施の
形態において特徴の一つであるＭｏｄｉｆｙ所有プロセ
ッサ番号を各データに割り当てて記憶している。Ｍｏｄ
ｉｆｙ所有プロセッサ番号には、キャッシュステートが
Ｍｏｄｉｆｙであるラインの所有者となるプロセッサの
識別情報すなわちプロセッサ番号が保持される。すなわ
ち、外部キャッシュ１４には、各Ｍｏｄｉｆｙラインが
どのプロセッサに対応しているかが記憶される。The external cache 14 is composed of a plurality of ways including a plurality of lines, can be divided for each way, and can be set to be unusable. The data in the internal caches 18 and 19 of the processors 10 and 11 are always held in the external cache 14. The external cache 14 and the internal caches 18 and 19 hold the held data together with status information for the data. The status information according to the present embodiment is held for each line, and is prepared with “Modify”, “Exclusive”, “Shared”, and “Invalid”. However, in order to exhibit a function that is a feature of the present embodiment, at least a modification is required. In the following description, the status information is referred to as a cache state, and a line of the internal caches 18 and 19 or the external cache 14 that stores the modified data is referred to as a “Modify line”.
FIG. 5 is a conceptual diagram of the external cache 14 according to the present embodiment, and shows only a structural example of the way 0 as a representative. Of course, other ways have the same structure as way 0. Each way assigns an index, a tag address, a cache state, and a Modify owned processor number, which is one of the features of the present embodiment, to each data and stores the data. Mod
The ify owned processor number holds the identification information of the processor that becomes the owner of the line whose cache state is “Modify”, that is, the processor number. That is, the external cache 14 stores which processor each Modify line corresponds to.

【００４２】なお、プロセッサの数によっては、ＣＰＵ
ブリッジや外部キャッシュを搭載していないＣＰＵボー
ドもあるが、本実施の形態においては、ある程度の規模
を有するマルチプロセッサ計算機システムにも対応可能
なように、ＣＰＵブリッジ１２、外部キャッシュ１４を
具備したＣＰＵボード２を使用することにし、そのＣＰ
Ｕボード２は、プロセッサ１０，１１の多重化をしなく
ても障害の自己検出ができることを前提としている。Note that depending on the number of processors, the CPU
Although some CPU boards do not include a bridge or an external cache, in the present embodiment, a CPU having a CPU bridge 12 and an external cache 14 is provided so as to be compatible with a multiprocessor computer system having a certain scale. I decided to use board 2 and that CP
The U board 2 is assumed to be capable of self-detection of a failure without multiplexing the processors 10 and 11.

【００４３】通常時における動作次に、上記実施の形態１に示した障害発生時における基
本動作に基づく本実施の形態における具体的な動作につ
いて説明するが、その前に、サーバシステムの通常動作
時において行われる処理について説明する。まず、内蔵
キャッシュ１８，１９に含まれる各ラインの修正された
データがどのようにして書き戻されるかについて説明す
る。The operation in the normal time Next, the specific operation in the present embodiment based on the basic operation will be described in the event of a failure shown in the first embodiment, but before that, during normal operation of the server system The processing performed in will be described. First, how the corrected data of each line included in the internal caches 18 and 19 is written back will be described.

【００４４】本実施の形態におけるＣＰＵブリッジ１２
は、前述した構成に加えて、内蔵キャッシュ１８，１９
に保持されている修正データの書き戻し処理の起動要求
を受け付ける手段としてプロセッサ毎に設けられたポー
ト２２を有している。なお、特に断らない限りプロセッ
サ１０に対する処理を例にして説明する。The CPU bridge 12 in the present embodiment
Are the internal caches 18 and 19 in addition to the configuration described above.
Has a port 22 provided for each processor as a means for receiving a request to start a write-back process of the correction data stored in the processor. Unless otherwise specified, processing for the processor 10 will be described as an example.

【００４５】オペレーティングシステム（ＯＳ）は、プ
ロセススイッチのとき新たなプロセスを割り当てようと
するプロセッサ１０の内蔵キャッシュ１８に、キャッシ
ュステートがＭｏｄｉｆｙであるラインのデータが保持
されている場合、当該データを外部キャッシュ１４に書
き戻す。これは、次のようにして行う。すなわち、ＯＳ
は、プロセススイッチのときプロセッサ１０に対応した
ポート２２にアクセスをすることによって書き戻し処理
の要求を行う。ＣＰＵブリッジ１２は、このアクセスに
よりそのポート２２に対応したプロセッサ１０に対して
無効化リクエストを書き戻し回路２６経由でＣＰＵバス
リクエスト発行回路２４から発行する。無効化リクエス
トというのは、プロセッサに修正データの外部キャッシ
ュ１４への書き戻し処理を行わせるための書き戻し処理
要求である。この結果、プロセッサ１０が保持している
全てのＭｏｄｉｆｙラインのデータは、外部キャッシュ
１４に書き戻されるとともに、書き戻したラインのキャ
ッシュステートは、ＭｏｄｉｆｙからＩｎｖａｌｉｄ
（無効）に更新される。更に、ＣＰＵブリッジ１２は、
ＯＳからの命令により、書き戻し回路２６経由でシステ
ムバスリクエスト発行回路３０から書き戻しリクエスト
を発行し、外部キャッシュ１４に書き戻したデータを全
てメインメモリ６に書き戻す。このとき、書き戻したラ
インのキャッシュステートは、外部キャッシュ制御部２
８によりＭｏｄｉｆｙからＩｎｖａｌｉｄに更新され
る。When the internal cache 18 of the processor 10 to which a new process is to be allocated at the time of a process switch holds data of a line whose cache state is "Modify", the operating system (OS) externally stores the data. Write back to cache 14. This is performed as follows. That is, OS
Requests a write-back process by accessing the port 22 corresponding to the processor 10 at the time of the process switch. With this access, the CPU bridge 12 issues an invalidation request to the processor 10 corresponding to the port 22 from the CPU bus request issuing circuit 24 via the write-back circuit 26. The invalidation request is a write-back processing request for causing the processor to write-back modified data to the external cache 14. As a result, the data of all the Modify lines held by the processor 10 are written back to the external cache 14, and the cache state of the written-back line is changed from Modify to Invalid.
(Invalid). Further, the CPU bridge 12
In response to a command from the OS, a write-back request is issued from the system bus request issuing circuit 30 via the write-back circuit 26, and all data written back to the external cache 14 is written back to the main memory 6. At this time, the cache state of the rewritten line is stored in the external cache control unit 2
8 is updated from Modify to Invalid.

【００４６】以上のようにして、プロセススイッチのと
き、処理対象となるプロセッサに含まれる内蔵キャッシ
ュからＭｏｄｉｆｙラインが全てなくなる。更に、外部
キャッシュ１４からは、当該プロセッサが所有者となる
データがなくなる。このように、通常動作時には、各キ
ャッシュ内のデータは、プロセススイッチの度にそれぞ
れ書き戻されることになる。なお、前述したＭｏｄｉｆ
ｙラインのライトバック（書き戻し）処理では、キャッ
シュステートをＭｏｄｉｆｙからＩｎｖａｌｉｄにした
が、適切なキャッシュステートを用意して、Ｉｎｖａｌ
ｉｄ以外のキャッシュステートに更新するようにしても
よい。As described above, at the time of the process switch, all the Modify lines are eliminated from the internal cache included in the processor to be processed. Further, there is no data owned by the processor from the external cache 14. Thus, during normal operation, the data in each cache is written back each time a process switch is performed. In addition, the Modif mentioned above.
In the write-back (write-back) processing of the y-line, the cache state is changed from “Modify” to “Invalid”.
The cache state may be updated to a state other than id.

【００４７】ところで、本実施の形態においては、図５
に示したように、外部キャッシュ１４は、Ｍｏｄｉｆｙ
ラインがどのプロセッサに対応しているかを記憶してい
る。従って、プロセススイッチ時にそのプロセッサに対
応する修正データの書き戻し処理が行われ、そのプロセ
ッサが所有するＭｏｄｉｆｙラインが全てなくなると、
キャッシュステート並びにＭｏｄｉｆｙ所有プロセッサ
番号は初期化される。その後、新たなプロセスの実行に
より内蔵キャッシュ１８のデータが更新されると、外部
キャッシュ１４の内容も更新されるので、これに伴い、
ＣＰＵブリッジ１２における外部キャッシュ制御部２８
は、キャッシュステートのＭｏｄｉｆｙへの更新ととも
にＭｏｄｉｆｙ所有プロセッサ番号を外部キャッシュ１
４内に設定することになる。このようにして、通常動作
時において外部キャッシュ１４の更新がされるが、更
に、本実施の形態では、次に示すようにして管理テーブ
ル３２の更新を行う。By the way, in the present embodiment, FIG.
As shown in the figure, the external cache 14
It stores which processor the line corresponds to. Therefore, at the time of the process switch, the write-back processing of the correction data corresponding to the processor is performed, and when all the Modify lines owned by the processor are exhausted,
The cache state and the Modify owned processor number are initialized. Thereafter, when the data of the internal cache 18 is updated by executing a new process, the content of the external cache 14 is also updated.
External cache control unit 28 in CPU bridge 12
Updates the cache state to “Modify” and sets the processor number owned by “Modify” to the external cache 1
4 will be set. In this manner, the external cache 14 is updated during the normal operation. In this embodiment, the management table 32 is updated as described below.

【００４８】ＣＰＵブリッジ１２は、図４に示したよう
に、管理テーブル３２にデータ出力フラグを外部参照フ
ラグ情報としてプロセッサ毎に対応させて有している。
外部キャッシュ１４には、各プロセッサ１０，１１の内
蔵キャッシュ１８，１９が保持するデータの全てが記憶
され、Ｍｏｄｉｆｙラインの所有者となるプロセッサ番
号がライン毎に保持されているということは、前述した
とおりである。データ出力フラグは、外部キャッシュ１
４のプロセッサ番号を参照して各プロセッサ１０，１１
が外部キャッシュ１４内にＭｏｄｉｆｙラインを有して
いるかどうかを表している。As shown in FIG. 4, the CPU bridge 12 has a data output flag in the management table 32 corresponding to each processor as external reference flag information.
As described above, the external cache 14 stores all the data held by the built-in caches 18 and 19 of the processors 10 and 11 and holds the processor number of the owner of the Modify line for each line. It is as follows. The data output flag is external cache 1
Processor 10 and 11 with reference to the processor number 4
Has a Modify line in the external cache 14.

【００４９】本実施の形態では、プロセススイッチのと
きにプロセッサ１０に新たなプロセスが割り当てられる
が、このプロセッサ１０が保持している修正データは、
外部キャッシュ１４更にはメインメモリ６に書き戻され
るので、外部キャッシュ１４からプロセッサ１０所有の
Ｍｏｄｉｆｙラインはなくなる。これに伴い、プロセッ
サ１０に対応したデータ出力フラグはクリアされる。そ
して、当該データ出力フラグは、プロセスの実行時にお
いて次の場合にセットされる。それは、（１）プロセッ
サ１０がＩ／ＯリクエストをＣＰＵバス１６並びにシス
テムバス１へ発行したとき、（２）プロセッサ１０がラ
イトスルーのライトリクエストをＣＰＵバス１６並びに
システムバス１へ発行したとき、（３）外部キャッシュ
１４におけるプロセッサ１０所有のＭｏｄｉｆｙライン
が追い出し対象となり、メインメモリ６に書き戻された
とき、（４）外部キャッシュ１４におけるプロセッサ１
０所有のＭｏｄｉｆｙラインがシステムバス１を介した
他のＣＰＵボードからのリクエストにヒットし、システ
ムバス１に出力され読み出されたとき、である。すなわ
ち、外部キャッシュ１４におけるプロセッサ１０所有の
Ｍｏｄｉｆｙラインのデータ（修正データ）が、プロセ
ッサ１０を搭載するＣＰＵボード２以外の構成要素から
アクセスされ、その修正データの内容が反映されたとき
にデータ出力フラグはセットされる。In this embodiment, a new process is assigned to the processor 10 at the time of the process switch. The correction data held by the processor 10 is as follows:
Since the external cache 14 is further written back to the main memory 6, the Modify line owned by the processor 10 from the external cache 14 disappears. Accordingly, the data output flag corresponding to the processor 10 is cleared. Then, the data output flag is set in the following cases when the process is executed. (1) When the processor 10 issues an I / O request to the CPU bus 16 and the system bus 1, (2) When the processor 10 issues a write-through write request to the CPU bus 16 and the system bus 1, 3) When the Modify line owned by the processor 10 in the external cache 14 is to be evicted and written back to the main memory 6, (4) the processor 1 in the external cache 14
This is when the Modify line owned by 0 hits a request from another CPU board via the system bus 1 and is output to the system bus 1 and read. That is, data (modification data) of the Modify line owned by the processor 10 in the external cache 14 is accessed from a component other than the CPU board 2 on which the processor 10 is mounted, and a data output flag is set when the contents of the modification data are reflected. Is set.

【００５０】以上のようにして、データ出力フラグは、
システムの運転中にセット／クリアされるが、本実施の
形態によれば、データ出力フラグを設けたことにより、
Ｍｏｄｉｆｙラインに対して他のＣＰＵボード２０上の
プロセッサ等外部からアクセスがされたことを検出する
ことができる。As described above, the data output flag is
Although set / cleared during operation of the system, according to the present embodiment, by providing the data output flag,
It is possible to detect that the Modify line has been accessed from outside, such as a processor on another CPU board 20.

【００５１】更に、ＣＰＵブリッジ１２は、図４に示し
たように、管理テーブル３２に他プロセッサ参照フラグ
をプロセッサ毎に対応させて有している。他プロセッサ
参照フラグは、外部キャッシュ１４におけるあるプロセ
ッサ所有のＭｏｄｉｆｙラインに対して、そのプロセッ
サと同一ＣＰＵボード上にある他のプロセッサからアク
セスがされたかどうかを表している。Further, as shown in FIG. 4, the CPU bridge 12 has a management table 32 in which a processor reference flag is associated with each processor. The other processor reference flag indicates whether the Modify line owned by a certain processor in the external cache 14 has been accessed by another processor on the same CPU board as that processor.

【００５２】本実施の形態では、プロセススイッチのと
きにプロセッサ１０に新たなプロセスが割り当てられる
が、このプロセッサ１０が保持している修正データは、
外部キャッシュ１４更にはメインメモリ６に書き戻され
るので、外部キャッシュ１４からプロセッサ１０所有の
Ｍｏｄｉｆｙラインはなくなる。これに伴い、プロセッ
サ１０に対応した他プロセッサ参照フラグはクリアされ
る。そして、プロセッサ１０上で新たなプロセスが実行
されデータが更新されると、前述したように外部キャッ
シュ１４においてプロセッサ１０所有のＭｏｄｉｆｙラ
インが保持されることになる。なお、このとき、外部キ
ャッシュ１４には、Ｍｏｄｉｆｙ所有プロセッサ番号が
設定される。In the present embodiment, a new process is assigned to the processor 10 at the time of the process switch. The correction data held by the processor 10 is as follows:
Since the external cache 14 is further written back to the main memory 6, the Modify line owned by the processor 10 from the external cache 14 disappears. Accordingly, the other processor reference flag corresponding to the processor 10 is cleared. Then, when a new process is executed on the processor 10 and the data is updated, the Modify line owned by the processor 10 is held in the external cache 14 as described above. At this time, a Modify owned processor number is set in the external cache 14.

【００５３】ＣＰＵブリッジ１２は、外部キャッシュ１
４におけるプロセッサ１０所有のＭｏｄｉｆｙラインに
対して、同一ＣＰＵボード２上のプロセッサ１０以外の
プロセッサ（図３の例ではプロセッサ１１）がアクセス
したとき、参照されたプロセッサ１０の他プロセッサ参
照フラグがセットされる。すなわち、外部キャッシュ１
４におけるプロセッサ１０所有のＭｏｄｉｆｙラインの
データ（修正データ）が、プロセッサ１０と同一ＣＰＵ
ボード２上の他の構成要素からアクセスされ、その修正
データの内容が反映されたときに他プロセッサ参照フラ
グはセットされる。なお、プロセッサ１０所有のＭｏｄ
ｉｆｙラインに対して他のプロセッサ１１がアクセスし
たかどうかは、Ｍｏｄｉｆｙ所有プロセッサ番号を参照
することによって容易に判断することができる。The CPU bridge 12 is connected to the external cache 1
4, when a processor other than the processor 10 on the same CPU board 2 (the processor 11 in the example of FIG. 3) accesses the Modify line owned by the processor 10, the other processor reference flag of the referenced processor 10 is set. You. That is, the external cache 1
4, the data (modification data) of the Modify line owned by the processor 10 is the same as the processor 10
The other processor reference flag is set when it is accessed from another component on the board 2 and the content of the correction data is reflected. The Mod owned by the processor 10
Whether or not another processor 11 has accessed the ify line can be easily determined by referring to the Modify owned processor number.

【００５４】本実施の形態によれば、他プロセッサ参照
フラグを設けたことにより、Ｍｏｄｉｆｙラインに対し
て同一ＣＰＵボード上の他のプロセッサからアクセスが
されたことを検出することができる。According to the present embodiment, by providing the other processor reference flag, it is possible to detect that the Modify line has been accessed by another processor on the same CPU board.

【００５５】障害発生時における動作本実施の形態においては、サーバシステムの通常動作時
に前述した外部キャッシュ１４並びに管理テーブル３２
の更新処理がなされるが、次に、サーバシステムにおい
てプロセッサに障害が発生したときの動作について図６
に示したフローチャートを用いて説明する。ここでも、
ＣＰＵボード２に搭載されたプロセッサ１０で障害が発
生した場合を例にして説明する。なお、図２に示した基
本動作のフローチャートに対応した処理は、１００番台
が同じ符号を付ける。 Operation When Failure Occurs In this embodiment, the external cache 14 and the management table 32 described above during normal operation of the server system are used.
Is updated. Next, an operation performed when a failure occurs in the processor in the server system will be described with reference to FIG.
This will be described with reference to the flowchart shown in FIG. even here,
The case where a failure occurs in the processor 10 mounted on the CPU board 2 will be described as an example. In the processing corresponding to the flowchart of the basic operation shown in FIG.

【００５６】プロセッサ１０は、自己障害検出機能によ
りプロセスの実行中に障害の発生を検出すると（ステッ
プ１１０）、障害割込みを発生させてＣＰＵブリッジ１
２に障害検出信号をエラー線を介して送出する。ＣＰＵ
ブリッジ１２における切離し判定回路３４は、プロセッ
サ１０から障害検出信号を受信したことを認識すると、
障害が発生したプロセッサ１０からの影響が他に伝わっ
ていないかの判定処理を所定の切離し条件に基づき行う
（ステップ２１０）。これは、ＣＰＵブリッジ１２がデ
ータ出力フラグ及び他プロセッサ参照フラグの設定内容
を参照することによって行われる。データ出力フラグが
クリアされていれば、外部キャッシュ１４におけるプロ
セッサ１０所有のＭｏｄｉｆｙラインをＣＰＵボード２
の外部からアクセスされていないと判断することができ
る。また、他プロセッサ参照フラグがクリアされていれ
ば、外部キャッシュ１４におけるプロセッサ１０所有の
Ｍｏｄｉｆｙラインを同一ＣＰＵボード２上のプロセッ
サ１１がアクセスしていないと判断することができる。
従って、両フラグともクリアされているという切離し条
件を満たしていれば、障害が発生したプロセッサ１０が
修正したデータをいずれのプロセッサ等からも参照され
ておらず他のプロセッサに影響が伝わっていないと判断
することができる。一方、プロセッサ１０に対応させた
データ出力フラグ又は他プロセッサ参照フラグがセット
されていれば、他のプロセッサに影響が伝わったと判断
することができる。When the processor 10 detects the occurrence of a failure during the execution of the process by the self-failure detection function (step 110), the processor 10 generates a failure interrupt to generate a CPU interrupt.
2 sends a failure detection signal via an error line. CPU
When the disconnection determination circuit 34 in the bridge 12 recognizes that the failure detection signal has been received from the processor 10,
A process of determining whether or not the influence from the failed processor 10 has been transmitted to another is performed based on a predetermined disconnection condition (step 210). This is performed by the CPU bridge 12 referring to the setting contents of the data output flag and the other processor reference flag. If the data output flag is cleared, the Modify line owned by the processor 10 in the external cache 14 is changed to the CPU board 2
Can not be accessed from outside. If the other processor reference flag is cleared, it can be determined that the Modify line owned by the processor 10 in the external cache 14 is not accessed by the processor 11 on the same CPU board 2.
Therefore, if the disconnection condition that both flags are cleared is satisfied, the data corrected by the failed processor 10 is not referred to by any processor or the like, and the influence is not transmitted to the other processors. You can judge. On the other hand, if the data output flag or the other processor reference flag corresponding to the processor 10 is set, it can be determined that the influence has been transmitted to the other processor.

【００５７】以上の判定処理の結果、所定の切離し条件
を満たし、プロセッサ１０の隔離が可能であると判断で
きると（ステップ３１０）、外部キャッシュ１４におけ
るプロセッサ１０が所有する全てのＭｏｄｉｆｙライン
に対して、そのラインのデータを無効にするためにキャ
ッシュステートをＭｏｄｉｆｙからＩｎｖａｌｉｄに変
更する（ステップ４１０）。このとき、Ｍｏｄｉｆｙ所
有プロセッサ番号の内容も共にクリアする。そして、切
離し判定回路３４は、プロセッサ１０に対してリセット
線を介してリセット信号を送出することによってプロセ
ッサ１０を切り離す（ステップ４１１）。そして、管理
テーブル３２のプロセッサ１０に対応した故障フラグを
セットする。故障フラグは、プロセッサ毎に設定されて
おり、プロセッサが正常のときは初期化状態（クリア）
であり、障害が発生したためＣＰＵブリッジ１２から切
り離したときにその障害発生プロセッサに対応したフラ
グがセットされる。これにより、ＯＳは、障害フラグが
セットされているプロセッサにプロセスを割り当てない
ように動作することができる。As a result of the above determination processing, when it is determined that the predetermined disconnection condition is satisfied and that the processor 10 can be isolated (step 310), all the Modify lines owned by the processor 10 in the external cache 14 are determined. The cache state is changed from "Modify" to "Invalid" to invalidate the data of the line (step 410). At this time, the contents of the Modify owned processor number are also cleared. Then, the disconnection determination circuit 34 disconnects the processor 10 by sending a reset signal to the processor 10 via a reset line (step 411). Then, a failure flag corresponding to the processor 10 in the management table 32 is set. The failure flag is set for each processor, and is initialized (cleared) when the processor is normal.
When a failure occurs and the CPU bridge 12 is disconnected, a flag corresponding to the failed processor is set. As a result, the OS can operate so as not to allocate a process to the processor for which the failure flag is set.

【００５８】一方、所定の切離し条件を満たしておら
ず、プロセッサ１０の隔離ができないと判定された場
合、データの整合性を保つことができないと判断し、シ
ステムを停止し、場合によってシステムの再立ち上げを
行う（ステップ６１０）。そして、必要であれば、ＯＳ
は、プロセッサ１０において実行させていたプロセスを
他のプロセッサに割り当てて再度実行させる。On the other hand, when it is determined that the predetermined disconnection condition is not satisfied and the processor 10 cannot be isolated, it is determined that data consistency cannot be maintained, the system is stopped, and the system may be restarted in some cases. Start-up is performed (step 610). And if necessary, OS
Allocates the process executed by the processor 10 to another processor and executes the process again.

【００５９】このように、本実施の形態によれば、デー
タ出力フラグ及び他プロセッサ参照フラグを設けて障害
が発生したプロセッサが所有するＭｏｄｉｆｙラインへ
の他のプロセッサからのアクセスの有無を容易に検出す
ることができるようにした。そして、当該アクセスがな
かった場合には、障害が発生したプロセッサ１０を切り
離したとしても、データの整合性は失われることはない
ので、システムをダウンさせることなく隔離することが
可能である。As described above, according to the present embodiment, the data output flag and the other processor reference flag are provided to easily detect whether or not another processor has access to the Modify line owned by the failed processor. I can do it. If there is no access, even if the failed processor 10 is disconnected, data consistency will not be lost, so that the system can be isolated without bringing down the system.

【００６０】なお、本実施の形態では、複数のプロセッ
サを搭載するＣＰＵボードを複数接続したサーバシステ
ムの例で説明した。従って、図６に示したステップ２１
０の処理において、仮にサーバシステムが単一プロセッ
サを搭載するＣＰＵボードを複数有する場合は、切離し
条件として他のＣＰＵボード上のプロセッサからのアク
セスの有無のみを判定すればよいし、また、複数のプロ
セッサを搭載するＣＰＵボードを１つのみ接続する場合
は、切離し条件として同一ＣＰＵボード上の他のプロセ
ッサからのアクセスの有無のみを判定すればよいことに
なる。In the present embodiment, an example of a server system in which a plurality of CPU boards each having a plurality of processors are connected has been described. Therefore, step 21 shown in FIG.
In the process of step 0, if the server system has a plurality of CPU boards each having a single processor, only the presence or absence of access from a processor on another CPU board may be determined as a disconnection condition. When only one CPU board on which a processor is mounted is connected, only the presence or absence of access from another processor on the same CPU board may be determined as a disconnection condition.

【００６１】また、本実施の形態においては、障害が発
生したプロセッサが切り離し可能かどうかを判定する手
段と、当該障害発生プロセッサの切り離しを行う手段を
ＣＰＵブリッジ１２に搭載したが、これらを別構成とし
て設けるようにしてもよい。In this embodiment, means for judging whether a failed processor is detachable and means for detaching the failed processor are mounted on the CPU bridge 12. It may be provided as.

【００６２】実施の形態３．上記実施の形態２では、プ
ロセッサにおいて障害が発生したときの例で説明した
が、本実施の形態では、外部キャッシュ１４において障
害が発生した場合において上記目的を達成する具体例を
示している。なお、本実施の形態における外部キャッシ
ュ１４は、パリティ又はＥＣＣのチェックによる自己障
害検出機能を有しており、ここでは障害の発生をウェイ
単位に特定ができ、かつウェイ単位に縮退（使用不可状
態）することができるものとする。 Embodiment 3 In the second embodiment, an example in which a failure occurs in the processor has been described. However, in the present embodiment, a specific example in which the above object is achieved when a failure occurs in the external cache 14 is shown. Note that the external cache 14 in the present embodiment has a self-failure detection function by checking parity or ECC. Here, occurrence of a failure can be specified for each way, and degeneration is performed for each way (unusable state). ).

【００６３】次に、サーバシステムにおいて外部キャッ
シュ１４に障害が発生したときの動作について図７に示
したフローチャートを用いて説明する。なお、図２に示
した基本動作のフローチャートに対応した処理は、同じ
１００番台の符号を付ける。また、サーバシステムにお
ける通常動作時における処理は、プロセススイッチ時に
おける書き戻し処理、管理テーブルに含まれる各フラグ
の更新を行うなど上記実施の形態２と同じなので説明を
省略する。Next, the operation when a failure occurs in the external cache 14 in the server system will be described with reference to the flowchart shown in FIG. The processing corresponding to the flowchart of the basic operation shown in FIG. 2 is denoted by the same reference numeral in the hundreds. Further, the processing in the normal operation in the server system is the same as that in the second embodiment, such as the write-back processing at the time of the process switch and the updating of each flag included in the management table.

【００６４】外部キャッシュ１４は、自己障害検出機能
によりいずれかのウェイで故障の発生を検出すると（ス
テップ１５０）、障害割込みを発生させてＣＰＵブリッ
ジ１２に、例えばデータパリティ信号をエラー線を介し
て送出する。ＣＰＵブリッジ１２におけるエラー検出部
３６は、データパリティ信号を受信したことを認識する
と、障害発生ウェイにおけるＭｏｄｉｆｙラインのデー
タが他のＣＰＵボード２０上のプロセッサからアクセス
されていないかなど障害発生ウェイからの影響が他に伝
わっていないかの判定処理を所定の切離し条件に基づい
て行う（ステップ２５０）。Ｍｏｄｉｆｙラインであれ
ば、データの整合性が保てなくなるからである。なお、
キャッシュステートがＭｏｄｉｆｙ以外のラインであれ
ば、メインメモリ６にデータが存在するので当該ライン
を無効（Ｉｎｖａｌｉｄ）にしてしまえば何ら問題は発
生しないため判定の対象とする必要はない。本実施の形
態における判定処理は、ＣＰＵブリッジ１２がデータ出
力フラグの設定内容を参照することによって行われる。
障害発生ウェイにおける全てのプロセッサ１０，１１の
データ出力フラグがクリアされているという所定の切離
し条件を満たしていれば、ＣＰＵボード内における修正
データが外部からアクセスされておらず、他に影響を与
えていないと判断することができる。一方、障害発生ウ
ェイにおいていずれかのデータ出力フラグがセットされ
ていれば、障害発生ウェイに保持された修正データが外
部からアクセスされた可能性があると判断することがで
きる。When the external cache 14 detects the occurrence of a failure in any of the ways by the self-failure detection function (step 150), it generates a failure interrupt and sends a data parity signal, for example, to the CPU bridge 12 via the error line. Send out. When recognizing that the data parity signal has been received, the error detecting unit 36 in the CPU bridge 12 detects whether the data on the Modify line in the faulty way has been accessed from a processor on another CPU board 20 or not. A process of determining whether or not the influence is transmitted to another is performed based on a predetermined disconnection condition (step 250). This is because if the line is a modify line, data consistency cannot be maintained. In addition,
If the cache state is a line other than “Modify”, since there is data in the main memory 6, no problem occurs if the line is invalidated, so that there is no need to make it a determination target. The determination process in the present embodiment is performed by the CPU bridge 12 referring to the setting content of the data output flag.
If the predetermined disconnection condition that the data output flags of all the processors 10 and 11 in the faulty way are cleared is satisfied, the correction data in the CPU board is not accessed from the outside and affects other devices. You can judge that it is not. On the other hand, if any of the data output flags is set in the faulty way, it can be determined that the correction data held in the faulty way may have been accessed from outside.

【００６５】従って、判定処理の結果、所定の切離し条
件を満たし、障害発生ウェイの縮退が可能であると判断
されれば（ステップ３５０）、外部キャッシュ１４にお
ける全データを無効にするために、全てのＭｏｄｉｆｙ
ラインのキャッシュステートをＩｎｖａｌｉｄに変更す
る（ステップ４５０）。この処理は、外部キャッシュ１
４内の全てのデータを放棄するということである。故障
発生ウェイにＭｏｄｉｆｙラインを所有するプロセッサ
は、他の正常なウェイにもＭｏｄｉｆｙラインを有して
いる場合があるからであり、全データを放棄することに
よってデータの整合性を保つことができるからである。
なお、このとき、Ｍｏｄｉｆｙ所有プロセッサ番号の内
容も共にクリアする。そして、エラー検出部３６は、故
障発生ウェイを縮退する（ステップ４５１）。キャッシ
ュ制御部２８は、各ウェイへのデータ書込み許可を管理
する縮退フラグをウェイ毎に内部管理しており、この縮
退フラグをセットすることで縮退したウェイに今後デー
タをロードしないように制御管理する。そして、必要で
あれば、ＯＳは、実行中であったプロセスをいずれかの
プロセッサに割り当てて再度実行させる。Therefore, as a result of the determination processing, if it is determined that the predetermined disconnection condition is satisfied and the faulty way can be degenerated (step 350), all data in the external cache 14 is invalidated in order to invalidate all data. Modify
The cache state of the line is changed to Invalid (step 450). This processing is performed in the external cache 1
4 means to discard all data. This is because a processor having a Modify line in the way in which a failure has occurred may have a Modify line in another normal way, and data integrity can be maintained by discarding all data. It is.
At this time, the contents of the Modify owned processor number are also cleared. Then, the error detection unit 36 degenerates the failure occurrence way (Step 451). The cache control unit 28 internally manages a degeneration flag for managing data write permission to each way for each way. By setting the degeneration flag, the cache control unit 28 controls and manages so that data will not be loaded into the degenerated way in the future. . Then, if necessary, the OS allocates the process being executed to any of the processors and causes the processor to execute the process again.

【００６６】このようにして、外部キャッシュ１４のい
ずれかのウェイにおいて障害が発生したとしてもそのウ
ェイのみをシステムをダウンさせずに縮退することでデ
ータの整合性を保ちつつシステムの運転を継続させるこ
とができる。In this way, even if a failure occurs in any of the ways in the external cache 14, only the way is degraded without shutting down the system, thereby continuing the operation of the system while maintaining data consistency. be able to.

【００６７】なお、本実施の形態では、実施の形態２に
おいて設定したデータ出力フラグを利用することによっ
て外部からのアクセスの有無を判定するようにしたが、
本実施の形態における判定処理のためにウェイ毎に専用
の外部参照フラグ情報を設けるようにしてもよい。専用
の外部参照フラグ情報は、データ出力フラグと同様にプ
ロセススイッチの時にクリアされ、ウェイに保持された
修正データがＣＰＵボード２の外部に反映されたときに
セットされることになる。In this embodiment, the presence or absence of external access is determined by using the data output flag set in the second embodiment.
For the determination processing in the present embodiment, dedicated external reference flag information may be provided for each way. The dedicated external reference flag information is cleared at the time of the process switch similarly to the data output flag, and is set when the correction data held in the way is reflected outside the CPU board 2.

【００６８】また、図１に示したシステム構成は例示で
あり、ＣＰＵボード数、プロセッサ数、システムバスに
接続する装置構成等は、これに限られたものではない。The system configuration shown in FIG. 1 is merely an example, and the number of CPU boards, the number of processors, the configuration of devices connected to the system bus, and the like are not limited thereto.

【００６９】[0069]

【発明の効果】本発明によれば、障害が発生したとき、
システムをダウンさせずに障害箇所が隔離できるかどう
かを判定し、隔離可能であると判定した場合には、シス
テムの運転を継続したまま障害箇所を切り離すことがで
き、隔離不可能と判定した場合にシステムをいったんダ
ウンさせ障害箇所の切り離しなどを行った後システムを
立ち上げるようにした。すなわち、障害箇所の隔離にシ
ステムをダウンさせる必要があるかないかをその隔離前
に判定するようにしたので、無用にシステムをダウンさ
せなくても障害の復旧を行うことが可能となり、かつ、
計算機に搭載するコンポーネントを多重化させなくても
システムの信頼性、可用性を向上させることが可能とな
る。According to the present invention, when a failure occurs,
Determines whether the fault location can be isolated without bringing down the system.If it is determined that the fault location can be isolated, the fault location can be separated while the system is operating and it is determined that isolation cannot be performed. The system was brought down once, the faulty part was isolated, and then the system was started. In other words, since it is determined whether or not the system needs to be brought down to isolate the fault location before the isolation, it is possible to recover from the failure without needlessly bringing down the system, and
The reliability and availability of the system can be improved without multiplexing components mounted on the computer.

【００７０】また、本発明によれば、ＣＰＵボードに搭
載された各プロセッサや外部キャッシュメモリの各ウェ
イをコンポーネントとした場合に上記と同様の効果を奏
することができる。According to the present invention, the same effects as described above can be obtained when each processor mounted on the CPU board and each way of the external cache memory are used as components.

【００７１】また、外部参照フラグ情報を設けたことに
より、障害発生プロセッサにより修正されたデータを記
憶する外部キャッシュメモリのラインに対して、当該障
害発生プロセッサを搭載するＣＰＵボードの外部からア
クセスがされたことを検出することが可能となる。これ
により、当該ＣＰＵボードの外部から当該ラインへのア
クセスがされておらず、データの整合性が保てると判定
した場合には、システムをダウンさせることなく障害発
生プロセッサを隔離することが可能となる。Further, by providing the external reference flag information, a line of the external cache memory for storing data corrected by the failed processor can be accessed from outside the CPU board on which the failed processor is mounted. Can be detected. Thus, when it is determined that the line is not accessed from outside the CPU board and data consistency can be maintained, it is possible to isolate the failed processor without bringing down the system. .

【００７２】また、他プロセッサ参照フラグ情報を設け
たことにより、修正されたデータを記憶する外部キャッ
シュメモリのラインに対して同一ＣＰＵボード上の他の
プロセッサからアクセスがされたことを検出することが
可能となる。これにより、同一ＣＰＵボード上の他のプ
ロセッサから障害発生プロセッサが所有する当該ライン
へのアクセスがされておらず、データの整合性が保てる
と判定した場合には、システムをダウンさせることなく
障害発生プロセッサを隔離することが可能となる。Further, by providing the other processor reference flag information, it is possible to detect that another processor on the same CPU board has accessed the line of the external cache memory storing the corrected data. It becomes possible. As a result, if it is determined that another processor on the same CPU board has not accessed the line owned by the failed processor and that data consistency can be maintained, the failure occurs without bringing down the system. The processor can be isolated.

【００７３】また、故障フラグ情報を設けたことによ
り、切り離された障害発生プロセッサにプロセスを割り
当てないようにすることが可能となる。Further, by providing the failure flag information, it is possible not to assign a process to the separated failed processor.

【００７４】また、外部参照フラグ情報を設けたことに
より、各ウェイに保持された修正データに対して、当該
ウェイを搭載するＣＰＵボードの外部からアクセスがさ
れたことを検出することが可能となる。これにより、当
該ＣＰＵボードの外部から障害発生ウェイが保持する修
正データへのアクセスがされておらず、データの整合性
が保てると判定された場合には、システムをダウンさせ
ることなく障害発生ウェイを隔離することが可能とな
り、残りのウェイによりシステムの運転を継続して行う
ことが可能となる。The provision of the external reference flag information makes it possible to detect that the correction data held in each way is accessed from outside the CPU board on which the way is mounted. . As a result, when it is determined that the corrected data held by the faulty way is not accessed from outside the CPU board and it is determined that data consistency can be maintained, the faulty way can be determined without bringing down the system. Isolation is possible, and the remaining ways allow the system to continue operating.

[Brief description of the drawings]

【図１】本発明に係るマルチプロセッサ計算機の一実
施の形態を示した図である。FIG. 1 is a diagram showing an embodiment of a multiprocessor computer according to the present invention.

【図２】第１の実施の形態における基本動作を示した
フローチャートである。FIG. 2 is a flowchart showing a basic operation in the first embodiment.

【図３】第２の実施の形態におけるＣＰＵボードのブ
ロック構成図である。FIG. 3 is a block diagram of a CPU board according to a second embodiment.

【図４】第２の実施の形態におけるＣＰＵブリッジが
保持する管理テーブルの内容例を示した図である。FIG. 4 is a diagram showing an example of the contents of a management table held by a CPU bridge according to the second embodiment.

【図５】第２の実施の形態における外部キャッシュの
概念図である。FIG. 5 is a conceptual diagram of an external cache according to a second embodiment.

【図６】第２の実施の形態においてプロセッサに障害
が発生したときの動作を示したフローチャートである。FIG. 6 is a flowchart illustrating an operation when a failure occurs in a processor according to the second embodiment;

【図７】第３の実施の形態において外部キャッシュに
障害が発生したときの動作を示したフローチャートであ
る。FIG. 7 is a flowchart illustrating an operation when a failure occurs in an external cache in the third embodiment.

[Explanation of symbols]

１システムバス、２，２０ＣＰＵボード、４Ｉ／
Ｏブリッジ、６主記憶装置（メインメモリ）、８外
部記憶装置、１０，１１プロセッサ、１２ＣＰＵブリ
ッジ、１４外部キャッシュ、１６ＣＰＵバス、１
８，１９内蔵キャッシュ、２２ポート、２４ＣＰ
Ｕバスリクエスト発行回路、２６書き戻し回路、２８
キャッシュ制御回路、３０システムバスリクエスト
発行回路、３２管理テーブル、３４切離し判定回
路、３６エラー検出部、３８ＣＰＵバス受信回路、
４０システムバス受信回路。1 system bus, 2,20 CPU board, 4 I /
O bridge, 6 main storage (main memory), 8 external storage, 10, 11 processors, 12 CPU bridge, 14 external cache, 16 CPU bus, 1
8, 19 Built-in cache, 22 ports, 24 CP
U bus request issuing circuit, 26 write-back circuit, 28
Cache control circuit, 30 system bus request issuing circuit, 32 management table, 34 disconnection determination circuit, 36 error detection unit, 38 CPU bus reception circuit,
40 System bus receiving circuit.

Claims

[Claims]

1. A multiprocessor computer having a component formed in a unit separable from a computer, a fault detecting means for detecting a fault occurring during system operation and specifying a fault location, based on a predetermined disconnection condition. Isolation determination means for determining whether a faulty location can be isolated without shutting down the system, isolation execution means for isolating the faulty location while the system is operating, and restarting means for stopping and restarting the system; Having, when the isolation determination means determines that isolation is possible, disconnects the fault location without bringing down the system,
A multiprocessor computer, wherein the system is restarted when the isolation determination means determines that isolation is not possible.

2. A multiprocessor computer having a component formed in a unit separable from a computer, a fault detection step for detecting a fault occurring during the operation of the system and specifying the fault location, based on a predetermined disconnection condition. An isolation determination step for determining whether or not the fault location can be isolated without bringing down the system; and an isolation execution step for separating the fault location without shutting down the system when the isolation determination step determines that isolation is possible. A restarting step of restarting the system when it is determined in the isolation determining step that isolation is not possible; and a fault recovery method in the multiprocessor computer.

3. At least one processor having a built-in cache memory, an external cache memory containing the contents of the built-in cache memory for each line, and connecting the processor and the external cache memory to perform control and control of each of the processors. A processor bridge, and at least one CPU board having a self-failure detection function, and a main memory, wherein the built-in cache memory and the external cache memory have a status regarding whether data is corrected or not. A multiprocessor computer that holds the information together with information; isolation determination means for determining whether a faulty location can be isolated without shutting down the system based on predetermined disconnection conditions; isolation execution means for separating the faulty location during system operation; Stop system Restarting means for stopping and restarting, and, when the isolation determining means determines that isolation is possible, disconnect the fault location without bringing down the system,
A multiprocessor computer, wherein the system is restarted when the isolation determination means determines that isolation is not possible.

4. The external cache memory holds identification information of a processor that is the owner of each line in which corrected data is stored. The isolation execution means, if the isolation determination means has determined that the other influence is not transmitted, acquired based on the identification information of the processor, the faulty processor After invalidating all data corrected by the faulty processor held in the mounted CPU board,
4. The multiprocessor computer according to claim 3, wherein said faulty processor is separated.

5. The processor bridge according to claim 1, wherein said processor bridge receives, for each processor, a start request for writing back the correction data in the built-in cache memory of each processor; Means for issuing a write-back processing request for performing a write-back processing of the correction data to the external cache memory, wherein the processing is held in a built-in cache memory of the processor which is the processing target at the time of the process switch. 5. The multiprocessor computer according to claim 4, wherein the correction data is written back to said main memory.

6. The processor bridge holds external reference flag information assigned to each processor on the same CPU board, and clears the external reference flag information corresponding to each processor when a process switch is performed. 4. The method according to claim 3, wherein when the correction data owned by the processor, which is held only inside the same CPU board, is reflected outside the same CPU board, the correction data is set.
A multiprocessor computer as described.

7. The isolation determining means, when the external reference flag information corresponding to the failed processor is clear, determines that the influence from the failed processor is not transmitted to another. Item 7. A multiprocessor computer according to item 6.

8. The external cache memory holds identification information of a processor that is the owner of each line in which the correction data is stored, and the processor bridge is a processor assigned to each processor on the same CPU board. Holding the reference flag information, clearing each of the other processor reference flag information corresponding to each of the processors at the time of the process switch, and clearing the line of the external cache memory in which the correction data owned by the processor is stored. 5. The multiprocessor computer according to claim 4, wherein the setting is made when a processor on the same CPU board other than the owner accesses the processor.

9. The isolation determining means, when the other processor reference flag information corresponding to the failed processor is clear, determines that the influence from the failed processor is not transmitted to another. A multiprocessor computer according to claim 8.

10. The apparatus according to claim 4, wherein said isolation executing means disconnects said faulty processor from said processor bridge upon receiving a fault detection signal issued from said faulty processor by a self fault detecting function. Multiprocessor calculator.

11. The apparatus according to claim 1, wherein the isolation execution unit holds failure flag information indicating that the failed processor has been disconnected from the processor bridge.
0. The multiprocessor computer according to 0.

12. The multiprocessor computer according to claim 3, wherein said processor bridge includes said isolation determination means.

13. The multiprocessor computer according to claim 3, wherein said processor bridge has said isolation execution means.

14. At least one processor having a built-in cache memory, an external cache memory containing the contents of the built-in cache memory for each line, and connecting the processor and the external cache memory to perform control and control of the respective processors. A processor bridge, and at least one CPU board having a self-failure detection function, and a main memory, wherein the built-in cache memory and the external cache memory have a status regarding whether data is corrected or not. A multiprocessor computer that holds the information together with information, an isolation determination step of determining whether or not the influence from the processor in which a failure has occurred during system operation is transmitted to another; and if the influence is not transmitted in the isolation determination step, If it is set, the system does not bring down the system and, after invalidating all the data corrected by the failed processor held in the CPU board on which the failed processor is mounted, disconnects the failed processor. A failure recovery method in a multiprocessor computer, comprising: an execution step; and a restarting step of stopping and restarting the system when it is determined in the isolation determining step that the influence is transmitted.

15. The external cache memory has a structure that can be divided into units of way, and the isolation determination unit determines that the influence of the way in which a failure has occurred is caused by the CP having the external cache memory mounted thereon.
The isolation execution means determines whether or not the influence from the faulty way has not been transmitted to the outside of the CPU board. 4. The multiprocessor computer according to claim 3, wherein said faulty way is degenerated after invalidating all the correction data stored in the cache memory.

16. The processor bridge holds external reference flag information allocated to each way on the same CPU board, and clears each of the external reference flag information corresponding to each way when a process switch is performed. The isolation determination means sets when the held correction data is reflected outside the same CPU board, and stores the failure occurrence way in accordance with the content of the external reference flag information corresponding to the failure occurrence way. 16. The multiprocessor computer according to claim 15, wherein it is determined whether or not the corrected data is externally reflected.

17. At least one processor having a built-in cache memory, an external cache memory having a plurality of ways and containing the contents of the built-in cache memory for each line, connecting the processor and the external cache memory, A processor bridge for performing respective control management, and at least one CPU board having a self-failure detection function; and a main memory, wherein the internal cache memory and the external cache memory transfer data to the data. In the multiprocessor computer which holds the status information on whether or not there is a correction, the correction data stored in the way in which a failure has occurred during system operation is not transmitted outside the CPU board having the external cache memory. Corner In the isolation determination step of determining whether the data is not transmitted to the outside of the CPU board in the isolation determination step, all the correction data stored in the external cache memory is invalidated without shutting down the system. Then, an isolation execution step of degenerating the faulty way, and a restart step of stopping and restarting the system when it is determined in the isolation determination step that the fault is transmitted to the outside of the CPU board. A failure recovery method in a multiprocessor computer, characterized by comprising: