JP2968484B2

JP2968484B2 - Multiprocessor computer and fault recovery method in multiprocessor computer

Info

Publication number: JP2968484B2
Application number: JP8261642A
Authority: JP
Inventors: 敏久亀丸
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1996-10-02
Filing date: 1996-10-02
Publication date: 1999-10-25
Anticipated expiration: 2016-10-02
Also published as: JPH10105527A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はマルチプロセッサ計
算機、特にコンポーネントを多重化していない計算機に
おける信頼性、可用性の向上に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multiprocessor computer, and more particularly to an improvement in reliability and availability in a computer in which components are not multiplexed.

【０００２】[0002]

【従来の技術】システムの一部にハードウェア障害やソ
フトウェア障害が発生しても、システム全体の動作に支
障を来さない計算機としてフォールトトレラント計算機
がある。このフォールトトレラント計算機は、障害に強
く信頼性の高いシステムを構築する際に有用であり、計
算機システムにおいて基幹となる部分には、特に望まし
いシステム形態を提供する。2. Description of the Related Art There is a fault-tolerant computer as a computer that does not hinder the operation of the entire system even if a hardware failure or a software failure occurs in a part of the system. This fault-tolerant computer is useful in constructing a highly reliable system that is resistant to failures, and provides a particularly desirable system form to a core part of the computer system.

【０００３】従来のフォールトトレラント計算機の高信
頼性、高可用性は、ハードウェアのコンポーネントを多
重化することによって実現されている。各コンポーネン
トは、システムの動作中に切り離し可能な単位のオンラ
イン交換モジュールとして設けられている。計算機にお
いて障害が発生した場合、システムは、その障害箇所を
検出し、その障害箇所をコンポーネント単位で隔離する
ようにしている。その際、他のコンポーネントによって
システムの運転を継続させたまま障害箇所の修復を行う
ことができる。すなわち、論理的には１００％システム
をダウンさせずに障害箇所の修復を行うことができる。
このうち、フォールトトレラント計算機に搭載されるプ
ロセッサも当然のことながら高信頼化のために多重化さ
れているが、それらは、プロセッサを２個１組にしてそ
れを２組持ついわゆるペア＆スペアの構成、あるいはプ
ロセッサを３個以上設けて多数決を取る構成となってお
り、プロセッサが出力する結果を比較しながら実行して
いる。[0003] High reliability and high availability of a conventional fault-tolerant computer are realized by multiplexing hardware components. Each component is provided as an on-line exchange module which can be separated during operation of the system. When a failure occurs in a computer, the system detects the failure location and isolates the failure location on a component basis. At that time, the fault location can be repaired while the operation of the system is continued by other components. That is, it is possible to repair a faulty part without logically bringing down the system.
Of these, the processors mounted on the fault-tolerant computer are naturally multiplexed for high reliability, but they are so-called “pair-and-spare” having two sets of two processors. The configuration or the configuration in which three or more processors are provided to take a majority decision are executed while comparing the results output by the processors.

【０００４】ところで、一般的なパーソナル・コンピュ
ータ（ＰＣ）で構築されるサーバシステムは、フォール
トトレラント計算機と同様に複数のプロセッサを持つ場
合もあるが、サーバシステムにおいては、性能面の向上
を図るために複数のプロセッサを具備する。つまり、フ
ォールトトレラント計算機のような多重化した形態を取
っていないので、サーバシステムにおいて何らかの障害
が発生した場合は、障害箇所の検出の可否に関係なくシ
ステムを停止せざるを得ない。[0004] A server system constructed by a general personal computer (PC) may have a plurality of processors like a fault-tolerant computer, but in the server system, in order to improve performance. Is provided with a plurality of processors. That is, since a multiplexed configuration such as a fault-tolerant computer is not used, if any failure occurs in the server system, the system must be stopped regardless of whether a failure location can be detected.

【０００５】[0005]

【発明が解決しようとする課題】上記のＰＣサーバシス
テムの高信頼性、高可用性を追求するために、フォール
トトレラント計算機と同様に複数のプロセッサを多重化
させるように構築することも考えられる。In order to pursue high reliability and high availability of the above-mentioned PC server system, it is conceivable to construct such that a plurality of processors are multiplexed similarly to a fault-tolerant computer.

【０００６】しかしながら、プロセッサを多重化する
と、プロセッサのみならずその周辺回路に多数の部品を
要することになり回路規模が大きくなってしまう。これ
は、プロセッサを搭載するＣＰＵボードの大型化を招く
だけでなく、高価格化、消費電力の増大等につながる。
また、プロセッサを多重化することにより多数決などの
際の比較回路が必要となるため、プロセッサの周波数を
上げることができず、性能の向上を図ることができなく
なる。However, when processors are multiplexed, a large number of components are required not only for the processor but also for its peripheral circuits, which increases the circuit scale. This leads not only to an increase in the size of the CPU board on which the processor is mounted, but also to an increase in price and an increase in power consumption.
Further, since the multiplexing of the processors requires a comparison circuit at the time of majority decision, the frequency of the processor cannot be increased, and the performance cannot be improved.

【０００７】本発明は以上のような問題を解決するため
になされたものであり、その目的は、プロセッサなどの
構成要素を多重化しないで信頼性、可用性の向上を図る
マルチプロセッサ計算機を提供することにある。SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a multiprocessor computer which improves reliability and availability without multiplexing components such as processors. It is in.

【０００８】[0008]

【課題を解決するための手段】以上のような目的を達成
するために、本発明におけるマルチプロセッサ計算機
は、データを、当該データに対する修正の有無に関する
ステータス情報とともに保持する内蔵キャッシュメモリ
を有する少なくとも１つのプロセッサと、前記内蔵キャ
ッシュメモリの内容をライン毎に包含するとともに保持
したデータに対する修正の有無に関するステータス情報
とデータが記憶されている各ラインの所有者となるプロ
セッサの識別情報とをライン毎に保持する外部キャッシ
ュメモリと、前記プロセッサ並びに前記外部キャッシュ
メモリを接続し、それぞれの制御管理を行うプロセッサ
ブリッジとを搭載し、自己障害検出機能を有する少なく
とも１つのＣＰＵボードと、メインメモリと、所定の切
離し条件に基づきシステムをダウンさせずに障害箇所の
隔離可能かどうかの判定を行う隔離判定手段と、前記隔
離判定手段が隔離可能であると判定した場合に障害箇所
をシステム動作中に切り離す隔離実行手段と、前記隔離
判定手段が隔離可能でないと判定した場合にシステムを
停止し再立ち上げを行う再立ち上げ手段とを有し、前記
プロセッサブリッジは、同一ＣＰＵボード上のプロセッ
サ毎に割り当てられた外部参照フラグ情報を、プロセス
スイッチのときにクリアし、同一ＣＰＵボードの内部に
のみ保持されている当該プロセッサ所有の修正データ
を、当該同一ＣＰＵボードの外部に反映させたときにセ
ットし、前記隔離判定手段は、外部参照フラグ情報を参
照することによって障害が発生した前記プロセッサから
の影響が外部に伝わっていないかどうかを判定し、前記
隔離実行手段は、前記隔離判定手段が外部に影響が伝わ
っていないと判定した場合に、前記プロセッサの識別情
報に基づいて取得した、当該障害発生プロセッサを搭載
する前記ＣＰＵボード内に保持されている前記障害発生
プロセッサが修正したデータの全てを無効にした後、前
記障害発生プロセッサを切り離すものである。 In order to achieve the above object, a multiprocessor computer according to the present invention relates to a method for determining whether data has been modified or not.
Built-in cache memory that holds status information
At least one processor having
Includes and holds the contents of the flash memory for each line
Status information on whether the modified data has been modified
And the owner of each line where data is stored
An external cache that holds the identification information of the processor and each line
Memory, the processor and the external cache
Processor that connects memory and performs control management for each
With a bridge and a self-failure detection function
And one CPU board, main memory,
Based on the release condition, the failure location can be
Isolation determination means for determining whether isolation is possible;
Failure point when separation determination means determines that isolation is possible
Isolation executing means for isolating the isolation during system operation;
If the determination means determines that isolation is not possible,
Restarting means for stopping and restarting,
The processor bridge is a processor on the same CPU board.
External reference flag information assigned to each process
Cleared at the time of switch, and inside the same CPU board
Modification data owned by the processor only held
Is reflected on the outside of the same CPU board.
The isolation determining means refers to the external reference flag information.
From the processor that failed due to
Judge whether the influence of is transmitted to the outside,
The quarantine execution means is configured so that the quarantine determination means transmits the influence to the outside.
If not, the identification information of the processor
Equipped with the faulty processor obtained based on the information
Of the failure held in the CPU board
After invalidating all of the data modified by the processor,
The faulty processor is separated.

【０００９】また、前記プロセッサブリッジは、各プロ
セッサの内蔵キャッシュメモリにある修正データの書き
戻し処理の起動要求をプロセッサ毎に受け付ける手段
と、受け付けた当該起動要求により処理対象となった前
記プロセッサに、修正データの前記外部キャッシュメモ
リへの書き戻し処理を行わせる書き戻し処理要求を発行
する手段とを有し、プロセススイッチのときに前記処理
対象となったプロセッサの内蔵キャッシュメモリに保持
されている修正データを前記メインメモリへ書き戻すも
のである。 Further, the processor bridge is provided for each processor.
Write correction data in the built-in cache memory of Sessa
Means for receiving a return processing start request for each processor
And before being processed by the received start request
The processor stores the external cache memo of the correction data.
Issue a write-back processing request to perform write-back processing to the
Means for performing the process when the process switch is performed.
Stored in the internal cache memory of the target processor
Write the modified data back to the main memory
It is.

【００１０】また、前記隔離判定手段は、前記障害発生
プロセッサに対応した外部参照フラグ情報がクリアであ
るとき、前記障害発生プロセッサからの影響が外部に伝
わっていないと判定するものである。 [0010] Further, the isolation determination means may be configured to generate the fault.
The external reference flag information corresponding to the processor is clear.
The influence from the failed processor is transmitted to the outside.
It is determined that it has not changed.

【００１１】また、前記プロセッサブリッジは、同一Ｃ
ＰＵボード上のプロセッサ毎に割り当てられた他プロセ
ッサ参照フラグ情報を、プロセススイッチのときにクリ
アし、当該プロセッサ所有の修正データが記憶された前
記外部キャッシュメモリのラインに対して、その所有者
以外の同一ＣＰＵボード上のプロセッサがアクセスをし
たときにセットし、前記隔離判定手段は、他プロセッサ
参照フラグ情報を参照することによって障害が発生した
前記プロセッサからの影響が当該同一ＣＰＵボード上の
他のプロセッサに伝わっていないかどうかを判定し、前
記隔離実行手段は、前記隔離判定手段が外部にも当該同
一ＣＰＵボード上の他のプロセッサにも影響が伝わって
いないと判定した場合に、前記プロセッサの識別情報に
基づいて取得した、当該障害発生プロセッサを搭載する
前記ＣＰＵボード内に保持されている前記障害発生プロ
セッサが修正したデータの全てを無効にした後、前記障
害発生プロセッサを切り離すものである。 The processor bridges may have the same C
Other processes assigned to each processor on the PU board
Clear the process reference flag information at the time of process switch.
Before the correction data owned by the processor is stored
The owner of the external cache memory line
Other processors on the same CPU board
Is set when the other processor
Failure occurred by referring to the reference flag information
The influence from the processor is on the same CPU board
Determines whether it has been transmitted to another processor, and
The quarantine execution means may be such that the isolation determination means
The effect is transmitted to other processors on one CPU board
If it is determined that there is no
Equipped with the failed processor obtained based on the
The fault occurrence program held in the CPU board
After invalidating all the data corrected by Sessa,
It separates the harm-causing processor.

【００１２】更に、前記隔離判定手段は、前記障害発生
プロセッサに対応した他プロセッサ参照フラグ情報がク
リアであるとき、前記障害発生プロセッサからの影響が
当該同一ＣＰＵボード上の他のプロセッサに伝わってい
ないと判定するものである。 [0012] Further, the isolation judging means may be configured to execute the fault occurrence.
Other processor reference flag information corresponding to the processor is
Rear, the effect from the failed processor is
Transmitted to other processors on the same CPU board.
It is determined that there is not.

【００１３】また、前記隔離判定手段は、前記障害発生
プロセッサから自己障害検出機能により発せられる障害
検出信号を受け取ると、前記障害発生プロセッサを前記
プロセッサブリッジから切り離すものである。 [0013] Further, the isolation determination means may include a fault generated by the fault occurrence processor by a self fault detection function.
Upon receiving the detection signal, the faulty processor is
It is separated from the processor bridge.

【００１４】更に、前記隔離実行手段は、前記各プロセ
ッサが前記プロセッサブリッジから切り離されたか否か
を示す故障フラグ情報を保持するものである。 [0014] Further, the isolation execution means may be configured to execute each of the processes.
Whether the processor has been disconnected from the processor bridge
Is stored.

【００１５】また、前記プロセッサブリッジは、前記隔
離判定手段を有するものである。 [0015] The processor bridge may include the remote controller.
It has separation determining means.

【００１６】また、前記プロセッサブリッジは、前記隔
離実行手段を有するものである。 Further, the processor bridge is provided with the
It has separation execution means.

【００１７】他の発明に係るマルチプロセッサ計算機に
おける障害復旧方法は、データを、当該データに対する
修正の有無に関するステータス情報とともに保持する内
蔵キャッシュメモリを有する少なくとも１つのプロセッ
サと、前記内蔵キャッシュメモリに保持されたデータを
ライン毎に包含するとともに保持したデータに対する修
正の有無に関するステータス情報とデータが記憶されて
いる各ラインの所有者となるプロセッサの識別情報とを
ライン毎に保持する外部キャッシュメモリと、前記プロ
セッサ並びに前記外部キャッシュメモリを接続し、それ
ぞれの制御管理を行うとともに、同一ＣＰＵボード上の
プロセッサ毎に割り当てられた外部参照フラグ情報を、
プロセススイッチのときにクリアし、同一ＣＰＵボード
の内部にのみ保持されている当該プロセッサ所有の修正
データを、当該同一ＣＰＵボードの外部に反映させたと
きにセットするプロセッサブリッジとを搭載し、自己障
害検出機能を有する少なくとも１つのＣＰＵボードを有
するマルチプロセッサ計算機において、外部参照フラグ
情報を参照することによってシステム動作中に障害が発
生した前記プロセッサからの影響が外部に伝わっていな
いかどうかを判定する隔離判定ステップと、前記隔離判
定ステップにおいて当該障害発生プロセッサからの影響
が外部に伝わっていないと判定された場合にシステムを
ダウンさせずに前記障害発生プロセッサを搭載する前記
ＣＰＵボード内に保持されている前記障害発生プロセッ
サが修正したデータの全てを無効にした後、前記障害発
生プロセッサを切り離す隔離実行ステップと、前記隔離
判定ステップにおいて当該障害発生プロセッサからの影
響が外部に伝わっていると判定された場合にシステムを
停止し再立ち上げを行う再立ち上げステップとを含むも
のである。 According to another aspect of the present invention, there is provided a multiprocessor computer.
Failure recovery method is to transfer data to the data
Retained along with status information on the presence or absence of correction
At least one processor having a storage cache memory
And the data held in the internal cache memory.
Modify data that is included and retained for each line
Status information and data on the presence or absence of positive
Identification information of the processor that is the owner of each line
An external cache memory for each line,
Connecting the processor and the external cache memory,
Each control and management, and on the same CPU board
External reference flag information assigned to each processor
Cleared when process switch, same CPU board
Processor-owned fixes that are only kept inside the
When data is reflected outside the same CPU board
Equipped with a processor bridge to set
At least one CPU board with harm detection function
External reference flag in a multiprocessor computer
Failure to refer to the information during system operation
The effect of the processor that was generated is not transmitted to the outside
An isolation determination step for determining whether or not the
Influence from the failed processor in the fixed step
System is determined not to be transmitted to the outside
Said faulty processor mounted without down
The faulty processor held in the CPU board
After invalidating all the data corrected by the
An isolation execution step of isolating the raw processor;
In the determination step, the shadow from the faulty processor
System is determined to be transmitted to the outside
Including a restarting step of stopping and restarting.
It is.

【００１８】また、前記プロセッサブリッジが、同一Ｃ
ＰＵボード上のプロセッサ毎に割り当てられた他プロセ
ッサ参照フラグ情報を、プロセススイッチのときにクリ
アし、当該プロセッサ所有の修正データが記憶された前
記外部キャッシュメモリのラインに対して、その所有者
以外の同一ＣＰＵボード上のプロセッサがアクセスをし
たときにセットする場合、前記隔離判定ステップは、他
プロセッサ参照フラグ情報を参照することによって障害
が発生した前記プロセッサからの影響が当該同一ＣＰＵ
ボード上の他のプロセッサに伝わっていないかどうかを
判定し、前記隔離実行ステップは、前記隔離判定ステッ
プにおいて当該障害発生プロセッサからの影響が外部に
も当該同一ＣＰＵボード上の他のプロセッサにも影響が
伝わっていないと判定された場合にシステムをダウンさ
せずに前記障害発生プロセッサを搭載する前記ＣＰＵボ
ード内に保持されている前記障害発生プロセッサが修正
したデータの全てを無効にした後、前記障害発生プロセ
ッサを切り離すものである。 Further, the processor bridges have the same C
Other processes assigned to each processor on the PU board
Clear the process reference flag information at the time of process switch.
Before the correction data owned by the processor is stored
The owner of the external cache memory line
Other processors on the same CPU board
When set when the
Failure by referring to processor reference flag information
The same CPU is affected by the processor
To see if it has propagated to other processors on the board.
Determining, and the isolation execution step includes the isolation determination step.
Of the failed processor
Also affects other processors on the same CPU board.
If it is determined that the
The CPU board on which the failed processor is mounted.
Fixed the failed processor held in the code
After disabling all of the failed data,
It is to separate the Sassa.

【００１９】また、前記外部キャッシュメモリは、ウェ
イ単位に分割可能な構造を有しており、前記隔離判定手
段は、障害が発生した前記ウェイからの影響が、前記外
部キャッシュメモリを搭載する前記ＣＰＵボードの外部
に伝わっていないかどうかを判定し、前記隔離実行手段
は、前記隔離判定手段が当該障害発生ウェイからの影響
が前記ＣＰＵボードの外部に伝わっていないと判定した
場合に、前記外部キャッシュメモリが記憶する全ての修
正データを無効にした後、前記障害発生ウェイを縮退す
るものである。 The external cache memory may be
It has a structure that can be divided into a
The stage is not affected by the faulted way.
External to the CPU board with the external cache memory
The quarantine execution means
Means that the isolation determination means is affected by the faulty way.
Has not been transmitted to the outside of the CPU board
In the case, all repairs stored in the external cache memory
After invalidating the positive data, degenerate the faulty way
Things.

【００２０】更に、前記隔離判定手段は、前記障害発生
ウェイに含まれる各ラインの所有者となるプロセッサの
識別情報により特定されるプロセッサに割り当てられた
外部参照フラグ情報を参照することにより、前記障害発
生ウェイが記憶している修正データが、外部に反映され
ているかいないかの判定を行うものである。 [0020] Further, the isolation judging means is configured to generate the fault.
Of the processor that owns each line in the way
Assigned to the processor identified by the identification information
By referring to the external reference flag information,
The correction data stored in the raw way is externally reflected.
This is to determine whether or not it is in operation.

【００２１】また、他の発明に係るマルチプロセッサ計
算機における障害復旧方法は、データを、当該データに
対する修正の有無に関するステータス情報とともに保持
する内蔵キャッシュメモリを有する少なくとも１つのプ
ロセッサと、複数のラインを含むウェイ単位に分割可能
な構造を有し、前記内蔵キャッシュメモリに保持された
データを前記ライン毎に包含するとともに保持したデー
タに対する修正の有無に関するステータス情報とデータ
が記憶されている各ラインの所有者となるプロセッサの
識別情報とをライン毎に保持する外部キャッシュメモリ
と、前記プロセッサ並びに前記外部キャッシュメモリを
接続し、それぞれの制御管理を行うとともに、同一ＣＰ
Ｕボード上のプロセッサ毎に割り当てられた外部参照フ
ラグ情報を、プロセススイッチのときにクリアし、同一
ＣＰＵボードの内部にのみ保持されている当該プロセッ
サ所有の修正データを、当該同一ＣＰＵボードの外部に
反映させたときにセットするプロセッサブリッジとを搭
載し、自己障害検出機能を有する少なくとも１つのＣＰ
Ｕボードを有するマルチプロセッサ計算機において、外
部参照フラグ情報を参照することによってシステム動作
中に障害が発生した前記ウェイが記憶している修正デー
タが、前記外部キャッシュメモリを搭載する前記ＣＰＵ
ボードの外部に伝わっていないかどうかを判定する隔離
判定ステップと、前記隔離判定ステップにおいて前記Ｃ
ＰＵボードの外部に伝わっていないと判定された場合に
システムをダウンさせずに前記外部キャッシュメモリが
記憶する全ての修正データを無効にした後、当該障害発
生ウェイを縮退する隔離実行ステップと、前記隔離判定
ステップにおいて前記ＣＰＵボードの外部に伝わってい
ると判定された場合にシステムを停止し再立ち上げを行
う再立ち上げステップとを含むものである。 Further, a multiprocessor meter according to another invention is provided.
The failure recovery method in the computer is to transfer the data to the data.
Retained along with status information on whether or not there has been a modification
At least one processor having a built-in cache memory
Can be divided into processors and ways that include multiple lines
Having a simple structure and stored in the internal cache memory.
Data that contains and holds data for each line
Status information and data on whether data has been modified
Of the processor that is the owner of each line for which
External cache memory that holds identification information for each line
And the processor and the external cache memory
Connect, perform control and management of each, and use the same CP
External reference file assigned to each processor on the U board
Lag information is cleared at the time of process switch and is the same
The processor held only inside the CPU board
The modified data owned by the server to the outside of the same CPU board.
Install the processor bridge that is set when reflected.
At least one CP having a self-failure detection function
In a multiprocessor computer having a U board,
System operation by referring to section reference flag information
Correction data stored in the way that failed during
The CPU having the external cache memory mounted thereon;
Quarantine to determine if it is traveling outside the board
And C in the determination step and the isolation determination step.
When it is determined that it is not transmitted outside the PU board
The external cache memory can be
After invalidating all stored correction data,
An isolation execution step of degenerating a raw way, and the isolation determination
Transmitted to the outside of the CPU board in the step
If it is determined that the
Re-startup step.

【００２２】[0022]

【００２３】[0023]

【００２４】[0024]

【００２５】[0025]

【発明の実施の形態】以下、図面に基づいて、本発明の
好適な実施の形態について説明する。Preferred embodiments of the present invention will be described below with reference to the drawings.

【００２６】実施の形態１．図１は、本発明に係るマル
チプロセッサ計算機の一実施の形態を示した図である。
本実施の形態においては、マルチプロセッサ計算機とし
てＰＣサーバシステムを例にして説明する。サーバシス
テムは、システムバス１に２枚のＣＰＵボード２，２０
とＩ／Ｏブリッジ４と主記憶装置（メインメモリ）６と
を接続した構成を有している。各ＣＰＵボード２は、同
一構成とし、それぞれプロセッサ１０，１１、ＣＰＵブ
リッジ１２、外部キャッシュ１４を内蔵している。Ｉ／
Ｏブリッジ４にはＩ／Ｏコントローラ７を介してディス
ク装置等の外部記憶装置８が接続されている。本実施の
形態におけるＣＰＵボード２は、内部において発生した
障害を自己検出でき、その障害箇所を特定できる機能を
有している。また、サーバシステムは、障害が発生した
外部記憶装置を特定する障害検出機能を有している。障
害検出機能は、ハードウェアによって発揮されるものも
あればソフトウェアによって発揮されるものもある。 Embodiment 1 FIG. 1 is a diagram showing an embodiment of a multiprocessor computer according to the present invention.
In the present embodiment, a PC server system will be described as an example of a multiprocessor computer. The server system includes two CPU boards 2 and 20 on a system bus 1.
, An I / O bridge 4 and a main storage device (main memory) 6. Each CPU board 2 has the same configuration, and includes processors 10, 11, a CPU bridge 12, and an external cache 14, respectively. I /
An external storage device 8 such as a disk device is connected to the O-bridge 4 via an I / O controller 7. The CPU board 2 according to the present embodiment has a function of being able to self-detect a failure that has occurred therein and to identify the location of the failure. Further, the server system has a failure detection function for identifying the external storage device in which the failure has occurred. Some of the failure detection functions are performed by hardware, while others are performed by software.

【００２７】次に、本実施の形態において障害発生時に
おける基本動作について図２に示したフローチャートを
用いて説明する。Next, a basic operation when a failure occurs in this embodiment will be described with reference to the flowchart shown in FIG.

【００２８】サーバシステムにおいて何らかの障害が発
生すると（ステップ１００）、その障害箇所の検出を行
う（ステップ２００）。この結果、障害箇所が検出でき
なかったとき、あるいは検出できたとしても内部で取り
扱っているデータに矛盾が生じているときは、このまま
運転を継続できないのでシステムを停止する（ステップ
５００）。なお、データに矛盾が生じる場合というの
は、データの整合性が保てなくなる場合のことであり、
例えば故障したプロセッサの演算結果を用いて他のプロ
セッサが演算を継続しているような場合である。このよ
うな場合は、処理を元に戻してやり直す必要があるので
システムをダウンさせなくてはならない。When a failure occurs in the server system (step 100), the location of the failure is detected (step 200). As a result, if a failure point cannot be detected, or even if it can be detected, if the data handled internally has an inconsistency, the operation cannot be continued and the system is stopped (step 500). In the case where data inconsistency occurs, data consistency cannot be maintained.
For example, there is a case where another processor continues the operation using the operation result of the failed processor. In such a case, the system must be brought down because it is necessary to undo and redo the processing.

【００２９】障害箇所が検出できたとき、その障害箇所
が隔離可能かどうかを判定する（ステップ３００）。隔
離可能というのは、システムをダウンさせずに運転を継
続させたまま、かつ内部のデータに矛盾を生じさせるこ
となく対象となる障害箇所を含むコンポーネントを切り
離すことができるという意味である。隔離可能であれ
ば、切り離し可能なコンポーネントの単位で障害箇所を
システムから切り離す（ステップ４００）。これによ
り、障害箇所が入出力装置であれば、他からアクセスを
行わせないようにしたり、障害箇所がプロセッサであれ
ば処理を行わせないようにすることができる。When a fault location can be detected, it is determined whether the fault location can be isolated (step 300). Being separable means that the component containing the target fault can be separated while the operation is continued without bringing down the system and without causing inconsistency in the internal data. If it can be isolated, the fault location is separated from the system in units of separable components (step 400). As a result, if the failure location is the input / output device, it is possible to prevent access from others, and if the failure location is a processor, it is possible to prevent processing from being performed.

【００３０】また、障害箇所が検出できてもそれが隔離
不可能であると判定されたときは、システムの運転を継
続することは不可能なので明示的にシステムを停止す
る。そして、障害箇所がハードウェアに関わるものであ
ればコンポーネントを取り外し可能であればコンポーネ
ントの交換等を行った後、システムの再立ち上げを行う
（ステップ６００）。If it is determined that a faulty location cannot be isolated even if a faulty location can be detected, the system is explicitly stopped because it is impossible to continue the operation of the system. If the fault location is related to hardware, the component is replaced if the component can be removed, and then the system is restarted (step 600).

【００３１】以上のように、本実施の形態によれば、障
害が発生したとき、システムの内部状態の整合性が保て
る場合は、障害箇所をシステムをダウンさせずに隔離し
てそのままシステムの運転を継続する。そして、システ
ムの内部状態の整合性が保てない場合は、システムをい
ったんダウンさせてその障害箇所を隔離し、その後シス
テムを立ち上げるようにする。このように、障害箇所の
隔離にシステムをダウンさせる必要があるかないかをそ
の隔離前に判定するようにしたので、無用にシステムを
ダウンさせる必要がなくなる。その結果、コンポーネン
トを多重化させないでもシステムの信頼性、可用性を向
上させることができる。As described above, according to the present embodiment, when a failure occurs, if the consistency of the internal state of the system can be maintained, the failure can be isolated without shutting down the system and operating the system as it is. To continue. If the consistency of the internal state of the system cannot be maintained, the system is temporarily shut down to isolate the fault location, and then the system is started. As described above, it is determined whether or not it is necessary to bring the system down to isolate the fault location before the isolation, so that there is no need to bring down the system unnecessarily. As a result, the reliability and availability of the system can be improved without multiplexing components.

【００３２】ところで、システムのダウンは、ハードウ
ェアの故障のみならずソフトウェア（ＯＳ、ミドルウェ
ア、アプリケーションプログラム等）のバグやオペレー
タのミスオペレーションなどによっても発生しうる。こ
のソフトウェア等を原因とするシステムダウンは、フォ
ールトトレラント計算機においても回避できない。ここ
で、ソフトウェア等の原因よりハードウェアの故障によ
ってシステムがダウンすることの方が多いとすると、フ
ォールトトレラント計算機を用いることは、非常に有用
であるが、ハードウェアの故障率が非常に小さい場合
は、フォールトトレラント計算機を用いる効果を十分に
発揮する場合がない。例えば、ハードウェア障害の検出
率をＸ、検出したハードウェアの隔離が可能であるとす
る隔離可能率をＹ、ハードウェアの部品の故障率をλと
する。検出率Ｘは、図２においてステップ２００からス
テップ３００に移行する処理の確率に相当する。隔離可
能率Ｙは、図２においてステップ３００からステップ４
００に移行する処理の確率に相当する。ここで、図２に
おいてステップ１００からステップ４００以外の処理、
換言するとステップ５００若しくはステップ６００に処
理が移行する確率Ｚは、Ｚ＝（Σ（１−Ｘ＊Ｙ）＊λ）
となるが、この確率Ｚは、システムの運転が継続できず
余儀なくシステムをダウンさせる確率となる。従って、
この確率Ｚの値が非常に小さく、Ｚ＜（ソフトウェアが
原因によるダウン率、例えば（ソフトウェアのバグ率）
＋（オペレーションミス率））となるようなシステムで
あれば、コンポーネントを多重化させなくても本実施の
形態は非常に有用である。Incidentally, a system down can occur not only due to a hardware failure, but also due to a bug in software (OS, middleware, application program, etc.) or an erroneous operation by an operator. System down due to this software or the like cannot be avoided even in a fault-tolerant computer. Here, assuming that the system is more likely to be down due to hardware failure than software or the like, using a fault-tolerant computer is very useful, but the hardware failure rate is very small. Does not fully demonstrate the effect of using a fault-tolerant computer. For example, assume that the detection rate of a hardware failure is X, the isolation rate at which the detected hardware can be isolated is Y, and the failure rate of hardware components is λ. The detection rate X corresponds to the probability of the process of shifting from step 200 to step 300 in FIG. In FIG. 2, the isolation possibility rate Y is calculated from step 300 to step 4
This corresponds to the probability of the process shifting to 00. Here, in FIG. 2, processing other than steps 100 to 400,
In other words, the probability Z that the process proceeds to step 500 or step 600 is Z = (Σ (1−X * Y) * λ)
However, this probability Z is a probability that the operation of the system cannot be continued and the system is forced to go down. Therefore,
The value of the probability Z is very small, and Z <(down rate due to software, for example, (software bug rate)
+ (Operation error rate)), the present embodiment is very useful without multiplexing components.

【００３３】なお、図１に示したサーバシステムの構成
は例示であり、他の構成要素を設けたり、ＣＰＵボード
を３枚以上設けることは本発明の範囲内である。また、
上記においてはサーバシステムの例で説明したが、他の
マルチプロセッサ計算機を有するシステムにおいても適
用可能であることはいうまでもない。The configuration of the server system shown in FIG. 1 is merely an example, and it is within the scope of the present invention to provide other components or to provide three or more CPU boards. Also,
Although the server system has been described above as an example, it is needless to say that the present invention can be applied to a system having another multiprocessor computer.

【００３４】実施の形態２．本実施の形態２では、実施
の形態１に示した基本動作において切り離し可能なコン
ポーネントをプロセッサとした場合を例にして説明す
る。 Embodiment 2 In the second embodiment, an example will be described in which a processor is a component that can be separated in the basic operation shown in the first embodiment.

【００３５】通常、多重化していないマルチプロセッサ
計算機において、プロセッサが１つでも障害が発生し故
障した場合は、システム全体のデータ整合性がくずれ、
システムのダウンを余儀なくされる。プロセッサの台数
に比例してシステムにおける故障率も高まるので、プロ
セッサ数の多い計算機は、それだけで信頼性が低くなっ
てしまう。一方、プロセッサを多重化することも考えら
れるが、この方法だとコスト等で著しくデメリットにな
ることは、前述したとおりである。Normally, in a non-multiplexed multiprocessor computer, if at least one processor fails and fails, the data consistency of the entire system is lost,
The system must be shut down. Since the failure rate in the system increases in proportion to the number of processors, a computer having a large number of processors alone has low reliability. On the other hand, it is conceivable to multiplex the processors, but as described above, this method has significant disadvantages in cost and the like.

【００３６】そこで、本実施の形態では、障害が発生し
たＣＰＵボード上のプロセッサを切り離すことで上記目
的を達成することができる具体例を示している。Therefore, the present embodiment shows a specific example in which the above-mentioned object can be achieved by separating the processor on the CPU board in which a failure has occurred.

【００３７】本実施の形態の構成本実施の形態におけるシステム構成図は、図１と同じで
ある。図３は、本実施の形態における図１に示したＣＰ
Ｕボード２の詳細な構成図である。ＣＰＵボード２は、
それぞれに内蔵キャッシュ１８，１９を有するプロセッ
サ１０，１１と、全てのプロセッサ１０，１１を接続す
るＣＰＵブリッジ１２と、ＣＰＵブリッジ１２に接続さ
れ上記内蔵キャッシュ１８，１９を包含する外部キャッ
シュ１４と、全てのプロセッサ１０，１１とＣＰＵブリ
ッジ１２を接続するＣＰＵバス１６と、を搭載してい
る。 Configuration of this Embodiment A system configuration diagram of this embodiment is the same as that of FIG. FIG. 3 shows the CP according to the present embodiment shown in FIG.
FIG. 3 is a detailed configuration diagram of a U board 2. CPU board 2
Processors 10, 11 each having an internal cache 18, 19, a CPU bridge 12 connecting all the processors 10, 11; an external cache 14 connected to the CPU bridge 12 and including the internal caches 18, 19; , And a CPU bus 16 for connecting the CPU bridge 12.

【００３８】本実施の形態において用いるプロセッサ１
０，１１は、例えばインテル社製のＰｅｎｔｉｕｍプロ
セッサ等の市販品の最新製品を想定している。つまり、
本実施の形態におけるプロセッサ１０，１１は、キャッ
シュメモリを内蔵し、自己障害検出機能を有し、そのカ
バレッジも比較的高い。また、ＣＰＵバス１６におい
て、そのデータ線に対してはパリティ／ＥＣＣ等のエラ
ー検出・修正手段を、その制御線にはプロトコルチェッ
クを行いエラー検出を行う手段をそれぞれ有しており、
ＣＰＵバス１６上において発生する障害を検出すること
ができる。このように、ＣＰＵボード２においては、搭
載された各構成要素において発生する障害を検出できる
自己障害検出機能を有している。Processor 1 used in the present embodiment
0 and 11 are assumed to be the latest commercially available products such as Pentium processors manufactured by Intel Corporation. That is,
Processors 10 and 11 according to the present embodiment have a built-in cache memory, have a self-failure detection function, and have relatively high coverage. In the CPU bus 16, the data line has an error detecting / correcting means such as parity / ECC, and the control line has a means for performing a protocol check and detecting an error.
A fault occurring on the CPU bus 16 can be detected. As described above, the CPU board 2 has a self-failure detection function that can detect a failure occurring in each mounted component.

【００３９】ＣＰＵブリッジ１２は、ＣＰＵバス１６を
介して各プロセッサ１０，１１へのリクエスト発行機能
を有するＣＰＵバスリクエスト発行回路２４と、各プロ
セッサ１０，１１からデータ等を受信するＣＰＵバス受
信回路３８と、システムバス１へのリクエスト発行機能
を有するシステムバスリクエスト発行回路３０と、シス
テムバス１を介して他の構成要素からデータ等を受信す
るシステムバス受信回路４０と、外部キャッシュ１４の
アドレス管理機能を有するキャッシュ制御回路２８と、
外部キャッシュ１４において発生したエラーの検出機能
等を有するエラー検出部３６と、を有しており、プロセ
ッサ１０，１１や外部キャッシュ１４の制御管理機能を
有している。なお、その他の構成については後述する。The CPU bridge 12 has a CPU bus request issuing circuit 24 having a function of issuing a request to each of the processors 10 and 11 via the CPU bus 16, and a CPU bus receiving circuit 38 for receiving data and the like from each of the processors 10 and 11. A system bus request issuing circuit 30 having a function of issuing a request to the system bus 1, a system bus receiving circuit 40 for receiving data and the like from other components via the system bus 1, and an address management function of the external cache 14. A cache control circuit 28 having
An error detection unit 36 having a function of detecting an error that has occurred in the external cache 14, and has a control management function of the processors 10 and 11 and the external cache 14. The other configuration will be described later.

【００４０】図４は、ＣＰＵブリッジ１２が同一ＣＰＵ
ボード上に搭載されたプロセッサ１０，１１を管理する
ために保持する管理テーブル３２の設定例である。この
管理テーブル３２には、プロセッサ毎にプロセッサの識
別情報であるプロセッサ番号と、データ出力フラグ、他
プロセッサ参照フラグ及び故障フラグが設定される。各
フラグ情報の内容に関しては後述する。FIG. 4 shows that the CPU bridge 12 is the same CPU.
3 is a setting example of a management table 32 held for managing processors 10 and 11 mounted on a board. In the management table 32, a processor number, which is identification information of a processor, a data output flag, another processor reference flag, and a failure flag are set for each processor. The contents of each flag information will be described later.

【００４１】外部キャッシュ１４は、複数のラインを含
む複数のウェイから構成され、ウェイ毎に分割可能であ
り、また、使用可不可の設定が可能である。プロセッサ
１０，１１の内蔵キャッシュ１８，１９にあるデータ
は、外部キャッシュ１４に必ず保持されている。外部キ
ャッシュ１４及び内蔵キャッシュ１８，１９は、保持す
るデータを、そのデータに対するステータス情報ととも
に保持している。本実施の形態におけるステータス情報
は、ライン毎に保持され、修正（Ｍｏｄｉｆｙ）、排他
（Ｅｘｃｌｕｓｉｖｅ）、共有（Ｓｈａｒｅｄ）、無効
（Ｉｎｖａｌｉｄ）が用意されている。但し、本実施の
形態の特徴事項となる機能を発揮させるためには、少な
くとも修正（Ｍｏｄｉｆｙ）があればよい。以降の説明
において、ステータス情報をキャッシュステートとい
い、修正（Ｍｏｄｉｆｙ）されたデータを記憶する内蔵
キャッシュ１８，１９若しくは外部キャッシュ１４のラ
インを「Ｍｏｄｉｆｙライン」と表現することにする。
図５は、本実施の形態における外部キャッシュ１４の概
念図であり、ウェイ０の構造例のみを代表して示してい
る。もちろん、他のウェイもウェイ０と同じ構造とな
る。各ウェイは、キャッシュライン毎に、そのインデッ
クス、タグアドレス、キャッシュステート及び本実施の
形態において特徴の一つであるＭｏｄｉｆｙ所有プロセ
ッサ番号を各データに割り当てて記憶している。Ｍｏｄ
ｉｆｙ所有プロセッサ番号には、キャッシュステートが
Ｍｏｄｉｆｙであるラインの所有者となるプロセッサの
識別情報すなわちプロセッサ番号が保持される。すなわ
ち、外部キャッシュ１４には、各Ｍｏｄｉｆｙラインが
どのプロセッサに対応しているかが記憶される。The external cache 14 is composed of a plurality of ways including a plurality of lines, can be divided for each way, and can be set to be unusable. The data in the internal caches 18 and 19 of the processors 10 and 11 are always held in the external cache 14. The external cache 14 and the internal caches 18 and 19 hold the held data together with status information for the data. The status information according to the present embodiment is held for each line, and is prepared with “Modify”, “Exclusive”, “Shared”, and “Invalid”. However, in order to exhibit a function that is a feature of the present embodiment, at least a modification is required. In the following description, the status information is referred to as a cache state, and a line of the internal caches 18 and 19 or the external cache 14 that stores the modified data is referred to as a “Modify line”.
FIG. 5 is a conceptual diagram of the external cache 14 according to the present embodiment, and shows only a structural example of the way 0 as a representative. Of course, other ways have the same structure as way 0. Each way assigns an index, a tag address, a cache state, and a Modify owned processor number, which is one of the features of the present embodiment, to each data and stores the data. Mod
The ify owned processor number holds the identification information of the processor that becomes the owner of the line whose cache state is “Modify”, that is, the processor number. That is, the external cache 14 stores which processor each Modify line corresponds to.

【００４２】なお、プロセッサの数によっては、ＣＰＵ
ブリッジや外部キャッシュを搭載していないＣＰＵボー
ドもあるが、本実施の形態においては、ある程度の規模
を有するマルチプロセッサ計算機システムにも対応可能
なように、ＣＰＵブリッジ１２、外部キャッシュ１４を
具備したＣＰＵボード２を使用することにし、そのＣＰ
Ｕボード２は、プロセッサ１０，１１の多重化をしなく
ても障害の自己検出ができることを前提としている。Note that depending on the number of processors, the CPU
Although some CPU boards do not include a bridge or an external cache, in the present embodiment, a CPU having a CPU bridge 12 and an external cache 14 is provided so as to be compatible with a multiprocessor computer system having a certain scale. I decided to use board 2 and that CP
The U board 2 is assumed to be capable of self-detection of a failure without multiplexing the processors 10 and 11.

【００４３】通常時における動作次に、上記実施の形態１に示した障害発生時における基
本動作に基づく本実施の形態における具体的な動作につ
いて説明するが、その前に、サーバシステムの通常動作
時において行われる処理について説明する。まず、内蔵
キャッシュ１８，１９に含まれる各ラインの修正された
データがどのようにして書き戻されるかについて説明す
る。The operation in the normal time Next, the specific operation in the present embodiment based on the basic operation will be described in the event of a failure shown in the first embodiment, but before that, during normal operation of the server system The processing performed in will be described. First, how the corrected data of each line included in the internal caches 18 and 19 is written back will be described.

【００４４】本実施の形態におけるＣＰＵブリッジ１２
は、前述した構成に加えて、内蔵キャッシュ１８，１９
に保持されている修正データの書き戻し処理の起動要求
を受け付ける手段としてプロセッサ毎に設けられたポー
ト２２を有している。なお、特に断らない限りプロセッ
サ１０に対する処理を例にして説明する。The CPU bridge 12 in the present embodiment
Are the internal caches 18 and 19 in addition to the configuration described above.
Has a port 22 provided for each processor as a means for receiving a request to start a write-back process of the correction data stored in the processor. Unless otherwise specified, processing for the processor 10 will be described as an example.

【００４５】オペレーティングシステム（ＯＳ）は、プ
ロセススイッチのとき新たなプロセスを割り当てようと
するプロセッサ１０の内蔵キャッシュ１８に、キャッシ
ュステートがＭｏｄｉｆｙであるラインのデータが保持
されている場合、当該データを外部キャッシュ１４に書
き戻す。これは、次のようにして行う。すなわち、ＯＳ
は、プロセススイッチのときプロセッサ１０に対応した
ポート２２にアクセスをすることによって書き戻し処理
の要求を行う。ＣＰＵブリッジ１２は、このアクセスに
よりそのポート２２に対応したプロセッサ１０に対して
無効化リクエストを書き戻し回路２６経由でＣＰＵバス
リクエスト発行回路２４から発行する。無効化リクエス
トというのは、プロセッサに修正データの外部キャッシ
ュ１４への書き戻し処理を行わせるための書き戻し処理
要求である。この結果、プロセッサ１０が保持している
全てのＭｏｄｉｆｙラインのデータは、外部キャッシュ
１４に書き戻されるとともに、書き戻したラインのキャ
ッシュステートは、ＭｏｄｉｆｙからＩｎｖａｌｉｄ
（無効）に更新される。更に、ＣＰＵブリッジ１２は、
ＯＳからの命令により、書き戻し回路２６経由でシステ
ムバスリクエスト発行回路３０から書き戻しリクエスト
を発行し、外部キャッシュ１４に書き戻したデータを全
てメインメモリ６に書き戻す。このとき、書き戻したラ
インのキャッシュステートは、外部キャッシュ制御部２
８によりＭｏｄｉｆｙからＩｎｖａｌｉｄに更新され
る。When the internal cache 18 of the processor 10 to which a new process is to be allocated at the time of a process switch holds data of a line whose cache state is "Modify", the operating system (OS) externally stores the data. Write back to cache 14. This is performed as follows. That is, OS
Requests a write-back process by accessing the port 22 corresponding to the processor 10 at the time of the process switch. With this access, the CPU bridge 12 issues an invalidation request to the processor 10 corresponding to the port 22 from the CPU bus request issuing circuit 24 via the write-back circuit 26. The invalidation request is a write-back processing request for causing the processor to write-back modified data to the external cache 14. As a result, the data of all the Modify lines held by the processor 10 are written back to the external cache 14, and the cache state of the written-back line is changed from Modify to Invalid.
(Invalid). Further, the CPU bridge 12
In response to a command from the OS, a write-back request is issued from the system bus request issuing circuit 30 via the write-back circuit 26, and all data written back to the external cache 14 is written back to the main memory 6. At this time, the cache state of the rewritten line is stored in the external cache control unit 2
8 is updated from Modify to Invalid.

【００４６】以上のようにして、プロセススイッチのと
き、処理対象となるプロセッサに含まれる内蔵キャッシ
ュからＭｏｄｉｆｙラインが全てなくなる。更に、外部
キャッシュ１４からは、当該プロセッサが所有者となる
データがなくなる。このように、通常動作時には、各キ
ャッシュ内のデータは、プロセススイッチの度にそれぞ
れ書き戻されることになる。なお、前述したＭｏｄｉｆ
ｙラインのライトバック（書き戻し）処理では、キャッ
シュステートをＭｏｄｉｆｙからＩｎｖａｌｉｄにした
が、適切なキャッシュステートを用意して、Ｉｎｖａｌ
ｉｄ以外のキャッシュステートに更新するようにしても
よい。As described above, at the time of the process switch, all the Modify lines are eliminated from the internal cache included in the processor to be processed. Further, there is no data owned by the processor from the external cache 14. Thus, during normal operation, the data in each cache is written back each time a process switch is performed. In addition, the Modif mentioned above.
In the write-back (write-back) processing of the y-line, the cache state is changed from “Modify” to “Invalid”.
The cache state may be updated to a state other than id.

【００４７】ところで、本実施の形態においては、図５
に示したように、外部キャッシュ１４は、Ｍｏｄｉｆｙ
ラインがどのプロセッサに対応しているかを記憶してい
る。従って、プロセススイッチ時にそのプロセッサに対
応する修正データの書き戻し処理が行われ、そのプロセ
ッサが所有するＭｏｄｉｆｙラインが全てなくなると、
キャッシュステート並びにＭｏｄｉｆｙ所有プロセッサ
番号は初期化される。その後、新たなプロセスの実行に
より内蔵キャッシュ１８のデータが更新されると、外部
キャッシュ１４の内容も更新されるので、これに伴い、
ＣＰＵブリッジ１２における外部キャッシュ制御部２８
は、キャッシュステートのＭｏｄｉｆｙへの更新ととも
にＭｏｄｉｆｙ所有プロセッサ番号を外部キャッシュ１
４内に設定することになる。このようにして、通常動作
時において外部キャッシュ１４の更新がされるが、更
に、本実施の形態では、次に示すようにして管理テーブ
ル３２の更新を行う。By the way, in the present embodiment, FIG.
As shown in the figure, the external cache 14
It stores which processor the line corresponds to. Therefore, at the time of the process switch, the write-back processing of the correction data corresponding to the processor is performed, and when all the Modify lines owned by the processor are exhausted,
The cache state and the Modify owned processor number are initialized. Thereafter, when the data of the internal cache 18 is updated by executing a new process, the content of the external cache 14 is also updated.
External cache control unit 28 in CPU bridge 12
Updates the cache state to “Modify” and sets the processor number owned by “Modify” to the external cache 1
4 will be set. In this manner, the external cache 14 is updated during the normal operation. In this embodiment, the management table 32 is updated as described below.

【００４８】ＣＰＵブリッジ１２は、図４に示したよう
に、管理テーブル３２にデータ出力フラグを外部参照フ
ラグ情報としてプロセッサ毎に対応させて有している。
外部キャッシュ１４には、各プロセッサ１０，１１の内
蔵キャッシュ１８，１９が保持するデータの全てが記憶
され、Ｍｏｄｉｆｙラインの所有者となるプロセッサ番
号がライン毎に保持されているということは、前述した
とおりである。データ出力フラグは、外部キャッシュ１
４のプロセッサ番号を参照して各プロセッサ１０，１１
が外部キャッシュ１４内にＭｏｄｉｆｙラインを有して
いるかどうかを表している。As shown in FIG. 4, the CPU bridge 12 has a data output flag in the management table 32 corresponding to each processor as external reference flag information.
As described above, the external cache 14 stores all the data held by the built-in caches 18 and 19 of the processors 10 and 11 and holds the processor number of the owner of the Modify line for each line. It is as follows. The data output flag is external cache 1
Processor 10 and 11 with reference to the processor number 4
Has a Modify line in the external cache 14.

【００４９】本実施の形態では、プロセススイッチのと
きにプロセッサ１０に新たなプロセスが割り当てられる
が、このプロセッサ１０が保持している修正データは、
外部キャッシュ１４更にはメインメモリ６に書き戻され
るので、外部キャッシュ１４からプロセッサ１０所有の
Ｍｏｄｉｆｙラインはなくなる。これに伴い、プロセッ
サ１０に対応したデータ出力フラグはクリアされる。そ
して、当該データ出力フラグは、プロセスの実行時にお
いて次の場合にセットされる。それは、（１）プロセッ
サ１０がＩ／ＯリクエストをＣＰＵバス１６並びにシス
テムバス１へ発行したとき、（２）プロセッサ１０がラ
イトスルーのライトリクエストをＣＰＵバス１６並びに
システムバス１へ発行したとき、（３）外部キャッシュ
１４におけるプロセッサ１０所有のＭｏｄｉｆｙライン
が追い出し対象となり、メインメモリ６に書き戻された
とき、（４）外部キャッシュ１４におけるプロセッサ１
０所有のＭｏｄｉｆｙラインがシステムバス１を介した
他のＣＰＵボードからのリクエストにヒットし、システ
ムバス１に出力され読み出されたとき、である。すなわ
ち、外部キャッシュ１４におけるプロセッサ１０所有の
Ｍｏｄｉｆｙラインのデータ（修正データ）が、プロセ
ッサ１０を搭載するＣＰＵボード２以外の構成要素から
アクセスされ、その修正データの内容が反映されたとき
にデータ出力フラグはセットされる。In this embodiment, a new process is assigned to the processor 10 at the time of the process switch. The correction data held by the processor 10 is as follows:
Since the external cache 14 is further written back to the main memory 6, the Modify line owned by the processor 10 from the external cache 14 disappears. Accordingly, the data output flag corresponding to the processor 10 is cleared. Then, the data output flag is set in the following cases when the process is executed. (1) When the processor 10 issues an I / O request to the CPU bus 16 and the system bus 1, (2) When the processor 10 issues a write-through write request to the CPU bus 16 and the system bus 1, 3) When the Modify line owned by the processor 10 in the external cache 14 is to be evicted and written back to the main memory 6, (4) the processor 1 in the external cache 14
This is when the Modify line owned by 0 hits a request from another CPU board via the system bus 1 and is output to the system bus 1 and read. That is, data (modification data) of the Modify line owned by the processor 10 in the external cache 14 is accessed from a component other than the CPU board 2 on which the processor 10 is mounted, and a data output flag is set when the contents of the modification data are reflected. Is set.

【００５０】以上のようにして、データ出力フラグは、
システムの運転中にセット／クリアされるが、本実施の
形態によれば、データ出力フラグを設けたことにより、
Ｍｏｄｉｆｙラインに対して他のＣＰＵボード２０上の
プロセッサ等外部からアクセスがされたことを検出する
ことができる。As described above, the data output flag is
Although set / cleared during operation of the system, according to the present embodiment, by providing the data output flag,
It is possible to detect that the Modify line has been accessed from outside, such as a processor on another CPU board 20.

【００５１】更に、ＣＰＵブリッジ１２は、図４に示し
たように、管理テーブル３２に他プロセッサ参照フラグ
をプロセッサ毎に対応させて有している。他プロセッサ
参照フラグは、外部キャッシュ１４におけるあるプロセ
ッサ所有のＭｏｄｉｆｙラインに対して、そのプロセッ
サと同一ＣＰＵボード上にある他のプロセッサからアク
セスがされたかどうかを表している。Further, as shown in FIG. 4, the CPU bridge 12 has a management table 32 in which a processor reference flag is associated with each processor. The other processor reference flag indicates whether the Modify line owned by a certain processor in the external cache 14 has been accessed by another processor on the same CPU board as that processor.

【００５２】本実施の形態では、プロセススイッチのと
きにプロセッサ１０に新たなプロセスが割り当てられる
が、このプロセッサ１０が保持している修正データは、
外部キャッシュ１４更にはメインメモリ６に書き戻され
るので、外部キャッシュ１４からプロセッサ１０所有の
Ｍｏｄｉｆｙラインはなくなる。これに伴い、プロセッ
サ１０に対応した他プロセッサ参照フラグはクリアされ
る。そして、プロセッサ１０上で新たなプロセスが実行
されデータが更新されると、前述したように外部キャッ
シュ１４においてプロセッサ１０所有のＭｏｄｉｆｙラ
インが保持されることになる。なお、このとき、外部キ
ャッシュ１４には、Ｍｏｄｉｆｙ所有プロセッサ番号が
設定される。In the present embodiment, a new process is assigned to the processor 10 at the time of the process switch. The correction data held by the processor 10 is as follows:
Since the external cache 14 is further written back to the main memory 6, the Modify line owned by the processor 10 from the external cache 14 disappears. Accordingly, the other processor reference flag corresponding to the processor 10 is cleared. Then, when a new process is executed on the processor 10 and the data is updated, the Modify line owned by the processor 10 is held in the external cache 14 as described above. At this time, a Modify owned processor number is set in the external cache 14.

【００５３】ＣＰＵブリッジ１２は、外部キャッシュ１
４におけるプロセッサ１０所有のＭｏｄｉｆｙラインに
対して、同一ＣＰＵボード２上のプロセッサ１０以外の
プロセッサ（図３の例ではプロセッサ１１）がアクセス
したとき、参照されたプロセッサ１０の他プロセッサ参
照フラグがセットされる。すなわち、外部キャッシュ１
４におけるプロセッサ１０所有のＭｏｄｉｆｙラインの
データ（修正データ）が、プロセッサ１０と同一ＣＰＵ
ボード２上の他の構成要素からアクセスされ、その修正
データの内容が反映されたときに他プロセッサ参照フラ
グはセットされる。なお、プロセッサ１０所有のＭｏｄ
ｉｆｙラインに対して他のプロセッサ１１がアクセスし
たかどうかは、Ｍｏｄｉｆｙ所有プロセッサ番号を参照
することによって容易に判断することができる。The CPU bridge 12 is connected to the external cache 1
4, when a processor other than the processor 10 on the same CPU board 2 (the processor 11 in the example of FIG. 3) accesses the Modify line owned by the processor 10, the other processor reference flag of the referenced processor 10 is set. You. That is, the external cache 1
4, the data (modification data) of the Modify line owned by the processor 10 is the same as the processor 10
The other processor reference flag is set when it is accessed from another component on the board 2 and the content of the correction data is reflected. The Mod owned by the processor 10
Whether or not another processor 11 has accessed the ify line can be easily determined by referring to the Modify owned processor number.

【００５４】本実施の形態によれば、他プロセッサ参照
フラグを設けたことにより、Ｍｏｄｉｆｙラインに対し
て同一ＣＰＵボード上の他のプロセッサからアクセスが
されたことを検出することができる。According to the present embodiment, by providing the other processor reference flag, it is possible to detect that the Modify line has been accessed by another processor on the same CPU board.

【００５５】障害発生時における動作本実施の形態においては、サーバシステムの通常動作時
に前述した外部キャッシュ１４並びに管理テーブル３２
の更新処理がなされるが、次に、サーバシステムにおい
てプロセッサに障害が発生したときの動作について図６
に示したフローチャートを用いて説明する。ここでも、
ＣＰＵボード２に搭載されたプロセッサ１０で障害が発
生した場合を例にして説明する。なお、図２に示した基
本動作のフローチャートに対応した処理は、１００番台
が同じ符号を付ける。 Operation When Failure Occurs In this embodiment, the external cache 14 and the management table 32 described above during normal operation of the server system are used.
Is updated. Next, an operation performed when a failure occurs in the processor in the server system will be described with reference to FIG.
This will be described with reference to the flowchart shown in FIG. even here,
The case where a failure occurs in the processor 10 mounted on the CPU board 2 will be described as an example. In the processing corresponding to the flowchart of the basic operation shown in FIG.

【００５６】プロセッサ１０は、自己障害検出機能によ
りプロセスの実行中に障害の発生を検出すると（ステッ
プ１１０）、障害割込みを発生させてＣＰＵブリッジ１
２に障害検出信号をエラー線を介して送出する。ＣＰＵ
ブリッジ１２における切離し判定回路３４は、プロセッ
サ１０から障害検出信号を受信したことを認識すると、
障害が発生したプロセッサ１０からの影響が他に伝わっ
ていないかの判定処理を所定の切離し条件に基づき行う
（ステップ２１０）。これは、ＣＰＵブリッジ１２がデ
ータ出力フラグ及び他プロセッサ参照フラグの設定内容
を参照することによって行われる。データ出力フラグが
クリアされていれば、外部キャッシュ１４におけるプロ
セッサ１０所有のＭｏｄｉｆｙラインをＣＰＵボード２
の外部からアクセスされていないと判断することができ
る。また、他プロセッサ参照フラグがクリアされていれ
ば、外部キャッシュ１４におけるプロセッサ１０所有の
Ｍｏｄｉｆｙラインを同一ＣＰＵボード２上のプロセッ
サ１１がアクセスしていないと判断することができる。
従って、両フラグともクリアされているという切離し条
件を満たしていれば、障害が発生したプロセッサ１０が
修正したデータをいずれのプロセッサ等からも参照され
ておらず他のプロセッサに影響が伝わっていないと判断
することができる。一方、プロセッサ１０に対応させた
データ出力フラグ又は他プロセッサ参照フラグがセット
されていれば、他のプロセッサに影響が伝わったと判断
することができる。When the processor 10 detects the occurrence of a failure during the execution of the process by the self-failure detection function (step 110), the processor 10 generates a failure interrupt to generate a CPU interrupt.
2 sends a failure detection signal via an error line. CPU
When the disconnection determination circuit 34 in the bridge 12 recognizes that the failure detection signal has been received from the processor 10,
A process of determining whether or not the influence from the failed processor 10 has been transmitted to another is performed based on a predetermined disconnection condition (step 210). This is performed by the CPU bridge 12 referring to the setting contents of the data output flag and the other processor reference flag. If the data output flag is cleared, the Modify line owned by the processor 10 in the external cache 14 is changed to the CPU board 2
Can not be accessed from outside. If the other processor reference flag is cleared, it can be determined that the Modify line owned by the processor 10 in the external cache 14 is not accessed by the processor 11 on the same CPU board 2.
Therefore, if the disconnection condition that both flags are cleared is satisfied, the data corrected by the failed processor 10 is not referred to from any processor or the like, and the influence is not transmitted to the other processors. You can judge. On the other hand, if the data output flag or the other processor reference flag corresponding to the processor 10 is set, it can be determined that the influence has been transmitted to the other processor.

【００５７】以上の判定処理の結果、所定の切離し条件
を満たし、プロセッサ１０の隔離が可能であると判断で
きると（ステップ３１０）、外部キャッシュ１４におけ
るプロセッサ１０が所有する全てのＭｏｄｉｆｙライン
に対して、そのラインのデータを無効にするためにキャ
ッシュステートをＭｏｄｉｆｙからＩｎｖａｌｉｄに変
更する（ステップ４１０）。このとき、Ｍｏｄｉｆｙ所
有プロセッサ番号の内容も共にクリアする。そして、切
離し判定回路３４は、プロセッサ１０に対してリセット
線を介してリセット信号を送出することによってプロセ
ッサ１０を切り離す（ステップ４１１）。そして、管理
テーブル３２のプロセッサ１０に対応した故障フラグを
セットする。故障フラグは、プロセッサ毎に設定されて
おり、プロセッサが正常のときは初期化状態（クリア）
であり、障害が発生したためＣＰＵブリッジ１２から切
り離したときにその障害発生プロセッサに対応したフラ
グがセットされる。これにより、ＯＳは、障害フラグが
セットされているプロセッサにプロセスを割り当てない
ように動作することができる。As a result of the above determination processing, when it is determined that the predetermined disconnection condition is satisfied and that the processor 10 can be isolated (step 310), all the Modify lines owned by the processor 10 in the external cache 14 are determined. The cache state is changed from "Modify" to "Invalid" to invalidate the data of the line (step 410). At this time, the contents of the Modify owned processor number are also cleared. Then, the disconnection determination circuit 34 disconnects the processor 10 by sending a reset signal to the processor 10 via a reset line (step 411). Then, a failure flag corresponding to the processor 10 in the management table 32 is set. The failure flag is set for each processor, and is initialized (cleared) when the processor is normal.
When a failure occurs and the CPU bridge 12 is disconnected, a flag corresponding to the failed processor is set. As a result, the OS can operate so as not to allocate a process to the processor for which the failure flag is set.

【００５８】一方、所定の切離し条件を満たしておら
ず、プロセッサ１０の隔離ができないと判定された場
合、データの整合性を保つことができないと判断し、シ
ステムを停止し、場合によってシステムの再立ち上げを
行う（ステップ６１０）。そして、必要であれば、ＯＳ
は、プロセッサ１０において実行させていたプロセスを
他のプロセッサに割り当てて再度実行させる。On the other hand, when it is determined that the predetermined disconnection condition is not satisfied and the processor 10 cannot be isolated, it is determined that data consistency cannot be maintained, the system is stopped, and the system may be restarted in some cases. Start-up is performed (step 610). And if necessary, OS
Allocates the process executed by the processor 10 to another processor and executes the process again.

【００５９】このように、本実施の形態によれば、デー
タ出力フラグ及び他プロセッサ参照フラグを設けて障害
が発生したプロセッサが所有するＭｏｄｉｆｙラインへ
の他のプロセッサからのアクセスの有無を容易に検出す
ることができるようにした。そして、当該アクセスがな
かった場合には、障害が発生したプロセッサ１０を切り
離したとしても、データの整合性は失われることはない
ので、システムをダウンさせることなく隔離することが
可能である。As described above, according to the present embodiment, the data output flag and the other processor reference flag are provided to easily detect whether or not another processor has access to the Modify line owned by the failed processor. I can do it. If there is no access, even if the failed processor 10 is disconnected, data consistency will not be lost, so that the system can be isolated without bringing down the system.

【００６０】なお、本実施の形態では、複数のプロセッ
サを搭載するＣＰＵボードを複数接続したサーバシステ
ムの例で説明した。従って、図６に示したステップ２１
０の処理において、仮にサーバシステムが単一プロセッ
サを搭載するＣＰＵボードを複数有する場合は、切離し
条件として他のＣＰＵボード上のプロセッサからのアク
セスの有無のみを判定すればよいし、また、複数のプロ
セッサを搭載するＣＰＵボードを１つのみ接続する場合
は、切離し条件として同一ＣＰＵボード上の他のプロセ
ッサからのアクセスの有無のみを判定すればよいことに
なる。In the present embodiment, an example of a server system in which a plurality of CPU boards each having a plurality of processors are connected has been described. Therefore, step 21 shown in FIG.
In the process of step 0, if the server system has a plurality of CPU boards each having a single processor, only the presence or absence of access from a processor on another CPU board may be determined as a disconnection condition. When only one CPU board on which a processor is mounted is connected, only the presence or absence of access from another processor on the same CPU board may be determined as a disconnection condition.

【００６１】また、本実施の形態においては、障害が発
生したプロセッサが切り離し可能かどうかを判定する手
段と、当該障害発生プロセッサの切り離しを行う手段を
ＣＰＵブリッジ１２に搭載したが、これらを別構成とし
て設けるようにしてもよい。In this embodiment, means for judging whether a failed processor is detachable and means for detaching the failed processor are mounted on the CPU bridge 12. It may be provided as.

【００６２】実施の形態３．上記実施の形態２では、プ
ロセッサにおいて障害が発生したときの例で説明した
が、本実施の形態では、外部キャッシュ１４において障
害が発生した場合において上記目的を達成する具体例を
示している。なお、本実施の形態における外部キャッシ
ュ１４は、パリティ又はＥＣＣのチェックによる自己障
害検出機能を有しており、ここでは障害の発生をウェイ
単位に特定ができ、かつウェイ単位に縮退（使用不可状
態）することができるものとする。 Embodiment 3 In the second embodiment, an example in which a failure occurs in the processor has been described. However, in the present embodiment, a specific example in which the above object is achieved when a failure occurs in the external cache 14 is shown. Note that the external cache 14 in the present embodiment has a self-failure detection function by checking parity or ECC. Here, occurrence of a failure can be specified for each way, and degeneration is performed for each way (unusable state). ).

【００６３】次に、サーバシステムにおいて外部キャッ
シュ１４に障害が発生したときの動作について図７に示
したフローチャートを用いて説明する。なお、図２に示
した基本動作のフローチャートに対応した処理は、同じ
１００番台の符号を付ける。また、サーバシステムにお
ける通常動作時における処理は、プロセススイッチ時に
おける書き戻し処理、管理テーブルに含まれる各フラグ
の更新を行うなど上記実施の形態２と同じなので説明を
省略する。Next, the operation when a failure occurs in the external cache 14 in the server system will be described with reference to the flowchart shown in FIG. The processing corresponding to the flowchart of the basic operation shown in FIG. 2 is denoted by the same reference numeral in the hundreds. Further, the processing in the normal operation in the server system is the same as that in the second embodiment, such as the write-back processing at the time of the process switch and the updating of each flag included in the management table.

【００６４】外部キャッシュ１４は、自己障害検出機能
によりいずれかのウェイで故障の発生を検出すると（ス
テップ１５０）、障害割込みを発生させてＣＰＵブリッ
ジ１２に、例えばデータパリティ信号をエラー線を介し
て送出する。ＣＰＵブリッジ１２におけるエラー検出部
３６は、データパリティ信号を受信したことを認識する
と、障害発生ウェイにおけるＭｏｄｉｆｙラインのデー
タが他のＣＰＵボード２０上のプロセッサからアクセス
されていないかなど障害発生ウェイからの影響が他に伝
わっていないかの判定処理を所定の切離し条件に基づい
て行う（ステップ２５０）。Ｍｏｄｉｆｙラインであれ
ば、データの整合性が保てなくなるからである。なお、
キャッシュステートがＭｏｄｉｆｙ以外のラインであれ
ば、メインメモリ６にデータが存在するので当該ライン
を無効（Ｉｎｖａｌｉｄ）にしてしまえば何ら問題は発
生しないため判定の対象とする必要はない。本実施の形
態における判定処理は、ＣＰＵブリッジ１２がデータ出
力フラグの設定内容を参照することによって行われる。
障害発生ウェイにおける全てのプロセッサ１０，１１の
データ出力フラグがクリアされているという所定の切離
し条件を満たしていれば、ＣＰＵボード内における修正
データが外部からアクセスされておらず、他に影響を与
えていないと判断することができる。一方、障害発生ウ
ェイにおいていずれかのデータ出力フラグがセットされ
ていれば、障害発生ウェイに保持された修正データが外
部からアクセスされた可能性があると判断することがで
きる。When the external cache 14 detects the occurrence of a failure in any of the ways by the self-failure detection function (step 150), it generates a failure interrupt and sends a data parity signal, for example, to the CPU bridge 12 via the error line. Send out. When recognizing that the data parity signal has been received, the error detecting unit 36 in the CPU bridge 12 detects whether the data on the Modify line in the faulty way has been accessed from a processor on another CPU board 20 or not. A process of determining whether or not the influence is transmitted to another is performed based on a predetermined disconnection condition (step 250). This is because if the line is a modify line, data consistency cannot be maintained. In addition,
If the cache state is a line other than “Modify”, since there is data in the main memory 6, no problem occurs if the line is invalidated, so that there is no need to make it a determination target. The determination process in the present embodiment is performed by the CPU bridge 12 referring to the setting content of the data output flag.
If the predetermined disconnection condition that the data output flags of all the processors 10 and 11 in the faulty way are cleared is satisfied, the correction data in the CPU board is not accessed from the outside and affects other devices. You can judge that it is not. On the other hand, if any of the data output flags is set in the faulty way, it can be determined that the correction data held in the faulty way may have been accessed from outside.

【００６５】従って、判定処理の結果、所定の切離し条
件を満たし、障害発生ウェイの縮退が可能であると判断
されれば（ステップ３５０）、外部キャッシュ１４にお
ける全データを無効にするために、全てのＭｏｄｉｆｙ
ラインのキャッシュステートをＩｎｖａｌｉｄに変更す
る（ステップ４５０）。この処理は、外部キャッシュ１
４内の全てのデータを放棄するということである。故障
発生ウェイにＭｏｄｉｆｙラインを所有するプロセッサ
は、他の正常なウェイにもＭｏｄｉｆｙラインを有して
いる場合があるからであり、全データを放棄することに
よってデータの整合性を保つことができるからである。
なお、このとき、Ｍｏｄｉｆｙ所有プロセッサ番号の内
容も共にクリアする。そして、エラー検出部３６は、故
障発生ウェイを縮退する（ステップ４５１）。キャッシ
ュ制御部２８は、各ウェイへのデータ書込み許可を管理
する縮退フラグをウェイ毎に内部管理しており、この縮
退フラグをセットすることで縮退したウェイに今後デー
タをロードしないように制御管理する。そして、必要で
あれば、ＯＳは、実行中であったプロセスをいずれかの
プロセッサに割り当てて再度実行させる。Therefore, as a result of the determination processing, if it is determined that the predetermined disconnection condition is satisfied and the faulty way can be degenerated (step 350), all data in the external cache 14 is invalidated in order to invalidate all data. Modify
The cache state of the line is changed to Invalid (step 450). This processing is performed in the external cache 1
4 means to discard all data. This is because a processor having a Modify line in the way in which a failure has occurred may have a Modify line in another normal way, and data integrity can be maintained by discarding all data. It is.
At this time, the contents of the Modify owned processor number are also cleared. Then, the error detection unit 36 degenerates the failure occurrence way (Step 451). The cache control unit 28 internally manages a degeneration flag for managing data write permission to each way for each way. By setting the degeneration flag, the cache control unit 28 controls and manages so that data will not be loaded into the degenerated way in the future. . Then, if necessary, the OS allocates the process being executed to any of the processors and causes the processor to execute the process again.

【００６６】このようにして、外部キャッシュ１４のい
ずれかのウェイにおいて障害が発生したとしてもそのウ
ェイのみをシステムをダウンさせずに縮退することでデ
ータの整合性を保ちつつシステムの運転を継続させるこ
とができる。In this way, even if a failure occurs in any of the ways in the external cache 14, only the way is degraded without shutting down the system, thereby continuing the operation of the system while maintaining data consistency. be able to.

【００６７】なお、本実施の形態では、実施の形態２に
おいて設定したデータ出力フラグを利用することによっ
て外部からのアクセスの有無を判定するようにしたが、
本実施の形態における判定処理のためにウェイ毎に専用
の外部参照フラグ情報を設けるようにしてもよい。専用
の外部参照フラグ情報は、データ出力フラグと同様にプ
ロセススイッチの時にクリアされ、ウェイに保持された
修正データがＣＰＵボード２の外部に反映されたときに
セットされることになる。In this embodiment, the presence or absence of external access is determined by using the data output flag set in the second embodiment.
For the determination processing in the present embodiment, dedicated external reference flag information may be provided for each way. The dedicated external reference flag information is cleared at the time of the process switch similarly to the data output flag, and is set when the correction data held in the way is reflected outside the CPU board 2.

【００６８】また、図１に示したシステム構成は例示で
あり、ＣＰＵボード数、プロセッサ数、システムバスに
接続する装置構成等は、これに限られたものではない。The system configuration shown in FIG. 1 is merely an example, and the number of CPU boards, the number of processors, the configuration of devices connected to the system bus, and the like are not limited thereto.

【００６９】[0069]

【発明の効果】本発明によれば、障害が発生したとき、
システムをダウンさせずに障害箇所が隔離できるかどう
かを判定し、隔離可能であると判定した場合には、シス
テムの運転を継続したまま障害箇所を切り離すことがで
き、隔離不可能と判定した場合にシステムをいったんダ
ウンさせ障害箇所の切り離しなどを行った後システムを
立ち上げるようにした。すなわち、障害箇所の隔離にシ
ステムをダウンさせる必要があるかないかをその隔離前
に判定するようにしたので、無用にシステムをダウンさ
せなくても障害の復旧を行うことができ、かつ、プロセ
ッサを多重化させなくてもシステムの信頼性、可用性を
向上させることができる。特に、本発明においては、外
部参照フラグ情報を設けたことにより、障害発生プロセ
ッサにより修正されたデータを記憶する外部キャッシュ
メモリのラインに対して、当該障害発生プロセッサを搭
載するＣＰＵボードの外部からアクセスがされたことを
検出することが可能となる。これにより、当該ＣＰＵボ
ードの外部から当該ラインへのアクセスがされておら
ず、データの整合性が保てると判定した場合には、シス
テムをダウンさせることなく障害発生プロセッサを隔離
することができる。 According to the present invention, when a failure occurs,
Determines whether the fault location can be isolated without bringing down the system.If it is determined that the fault location can be isolated, the fault location can be separated while the system is operating and it is determined that isolation cannot be performed. The system was brought down once, the faulty part was isolated, and then the system was started. In other words, whether or not it is necessary to bring the system down to isolate the fault location is determined before the isolation, so that it is possible to recover from the failure without unnecessarily bringing down the system and to process
System reliability and availability without multiplexing
Can be improved. In particular, in the present invention,
By providing the unit reference flag information, the fault
External cache that stores data modified by the server
Install the faulty processor on the memory line.
That access from outside the CPU board
It becomes possible to detect. As a result, the CPU button
Access to the line from outside
If it is determined that data consistency can be maintained,
Isolate failed processors without bringing down the system
can do.

【００７０】[0070]

【００７１】[0071]

【００７２】また、他プロセッサ参照フラグ情報を設け
たことにより、修正されたデータを記憶する外部キャッ
シュメモリのラインに対して同一ＣＰＵボード上の他の
プロセッサからアクセスがされたことを検出することが
可能となる。これにより、同一ＣＰＵボード上の他のプ
ロセッサから障害発生プロセッサが所有する当該ライン
へのアクセスがされておらず、データの整合性が保てる
と判定した場合には、システムをダウンさせることなく
障害発生プロセッサを隔離することが可能となる。Further, by providing the other processor reference flag information, it is possible to detect that another processor on the same CPU board has accessed the line of the external cache memory storing the corrected data. It becomes possible. As a result, if it is determined that another processor on the same CPU board has not accessed the line owned by the failed processor and that data consistency can be maintained, the failure occurs without bringing down the system. The processor can be isolated.

【００７３】また、故障フラグ情報を設けたことによ
り、切り離された障害発生プロセッサにプロセスを割り
当てないようにすることが可能となる。Further, by providing the failure flag information, it is possible not to assign a process to the separated failed processor.

【００７４】また、外部参照フラグ情報を設けたことに
より、各ウェイに保持された修正データに対して、当該
ウェイを搭載するＣＰＵボードの外部からアクセスがさ
れたことを検出することが可能となる。これにより、当
該ＣＰＵボードの外部から障害発生ウェイが保持する修
正データへのアクセスがされておらず、データの整合性
が保てると判定された場合には、システムをダウンさせ
ることなく障害発生ウェイを隔離することが可能とな
り、残りのウェイによりシステムの運転を継続して行う
ことが可能となる。The provision of the external reference flag information makes it possible to detect that the correction data held in each way is accessed from outside the CPU board on which the way is mounted. . As a result, when it is determined that the corrected data held by the faulty way is not accessed from outside the CPU board and it is determined that data consistency can be maintained, the faulty way can be determined without bringing down the system. Isolation is possible, and the remaining ways allow the system to continue operating.

[Brief description of the drawings]

【図１】本発明に係るマルチプロセッサ計算機の一実
施の形態を示した図である。FIG. 1 is a diagram showing an embodiment of a multiprocessor computer according to the present invention.

【図２】第１の実施の形態における基本動作を示した
フローチャートである。FIG. 2 is a flowchart showing a basic operation in the first embodiment.

【図３】第２の実施の形態におけるＣＰＵボードのブ
ロック構成図である。FIG. 3 is a block diagram of a CPU board according to a second embodiment.

【図４】第２の実施の形態におけるＣＰＵブリッジが
保持する管理テーブルの内容例を示した図である。FIG. 4 is a diagram showing an example of the contents of a management table held by a CPU bridge according to the second embodiment.

【図５】第２の実施の形態における外部キャッシュの
概念図である。FIG. 5 is a conceptual diagram of an external cache according to a second embodiment.

【図６】第２の実施の形態においてプロセッサに障害
が発生したときの動作を示したフローチャートである。FIG. 6 is a flowchart illustrating an operation when a failure occurs in a processor according to the second embodiment;

【図７】第３の実施の形態において外部キャッシュに
障害が発生したときの動作を示したフローチャートであ
る。FIG. 7 is a flowchart illustrating an operation when a failure occurs in an external cache in the third embodiment.

[Explanation of symbols]

１システムバス、２，２０ＣＰＵボード、４Ｉ／
Ｏブリッジ、６主記憶装置（メインメモリ）、８外
部記憶装置、１０，１１プロセッサ、１２ＣＰＵブリ
ッジ、１４外部キャッシュ、１６ＣＰＵバス、１
８，１９内蔵キャッシュ、２２ポート、２４ＣＰ
Ｕバスリクエスト発行回路、２６書き戻し回路、２８
キャッシュ制御回路、３０システムバスリクエスト
発行回路、３２管理テーブル、３４切離し判定回
路、３６エラー検出部、３８ＣＰＵバス受信回路、
４０システムバス受信回路。1 system bus, 2,20 CPU board, 4 I /
O bridge, 6 main storage (main memory), 8 external storage, 10, 11 processors, 12 CPU bridge, 14 external cache, 16 CPU bus, 1
8, 19 Built-in cache, 22 ports, 24 CP
U bus request issuing circuit, 26 write-back circuit, 28
Cache control circuit, 30 system bus request issuing circuit, 32 management table, 34 disconnection determination circuit, 36 error detection unit, 38 CPU bus reception circuit,
40 System bus receiving circuit.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 15/177 678 G06F 15/177 672 G06F 11/20 310 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G06F 15/177 678 G06F 15/177 672 G06F 11/20 310

Claims

(57) [Claims]

(1)The data must be modified with respect to the data.
Keep with status information about nothingBuilt-in cap
At least one processor having a cache memory, and including the contents of the internal cache memory for each line
On whether the data held with
Ownership of each line where status information and data are stored
And the identification information of the processor to be used for each line
Connecting an external cache memory, the processor and the external cache memory
And at least one processor bridge having a self-failure detection function
CPU board and main memory, Failure to shut down the system based on the specified disconnection conditions
Isolation judging means for judging whether the harmful area can be isolated
When,When the isolation determination means determines that isolation is possible Obstacle
Quarantine execution means for separating a harmful point during system operation;When the isolation determination means determines that isolation is not possible Shi
Restarting means for stopping and restarting the stem;The processor bridge is a processor on the same CPU board.
The external reference flag information assigned to each processor is
Access switch, clear the same CPU board.
Modification data owned by the processor only held by the
Data is reflected outside the same CPU board.
Set The isolation determination unit refers to the external reference flag information.
The effect from the failed processor is
Determine whether it is transmitted to the outside, The quarantine execution means may be configured such that the quarantine determination means does not
If it is determined that the information has not been transmitted,
The failed processor obtained based on the separate information
The fault held in the mounted CPU board
Invalidated all data modified by the originating processor
After that, disconnect the failed processor Characterized by
Multiprocessor calculator.

2. The processor bridge according to claim 1, wherein the processor bridge receives, for each processor, a start request for a write-back process of the correction data in the built-in cache memory of each processor; Means for issuing a write-back processing request for performing a write-back processing of the correction data to the external cache memory, wherein the processing is held in a built-in cache memory of the processor which is the processing target at the time of the process switch. multiprocessor computer of claim 1, wherein the writing back modified data to the main memory.

3. The isolation determining means, when the external reference flag information corresponding to the faulty processor is clear, determines that the influence from the faulty processor is not transmitted to the outside. Item 4. The multiprocessor computer according to item 1 .

4. The processor bridge according to claim 1, wherein the processor bridges
Other processors assigned to each processor on the board
Clears reference flag information at process switch
And the correction data owned by the processor is stored.
For lines in the external cache memory,
An external processor on the same CPU board accessed
Set, and the isolation determination means refers to the other processor reference flag information.
From the processor that failed due to
Influence is transmitted to other processors on the same CPU board.
It is determined whether or not the separation has been performed.
Affects other processors on the same CPU board
Identification information of the processor when it is determined that the
The failed processor acquired based on the
The fault occurrence program held in the CPU board
After invalidating all the data corrected by the processor,
Claims wherein the faulty processor is isolated.
2. The multiprocessor computer according to 1 .

Wherein said isolation determining means, when the other processors see flag information corresponding to the failure processor is clear, the influence of the failure processor the
5. The multiprocessor computer according to claim 4, wherein it is determined that the signal is not transmitted to another processor on the same CPU board .

Wherein said isolation execution unit receives the failure detection signal emitted by the self-fault detection from the failure processor, according to claim 1, characterized in that separating the fault occurrence processor from the processor bridge
A multiprocessor computer as described.

7. The isolation execution means indicates whether each of the processors has been disconnected from the processor bridge.
7. The apparatus according to claim 6, wherein the failure flag information is held.
A multiprocessor computer as described.

Wherein said processor bridge multiprocessor computer according to claim 1, characterized by having the isolation determination unit.

Wherein said processor bridge multiprocessor computer according to claim 1, characterized by having the isolation execution unit.

10. The method according to claim 1 , wherein the data is a modification of the data.
At least one processor having a built-in cache memory for holding together with status information on presence / absence, and having a correction to the held data for each line including the data held in the built-in cache memory.
Status information and data about
Identification information of the processor that is the owner of the line
An external cache memory for holding each said processor and connect the external cache memory, performs respective control management, the same CPU board
External reference flags assigned to each processor on the
Information is cleared at the time of process switch, and the same CP
The processor location held only inside the U board
Reflect the existing correction data outside the same CPU board
Mounting a processor bridge, a setting when brought into, in a multiprocessor computer having at least one CPU board having a self-fault detection, fault occurs during the system operation by reference to the external reference flag information An isolation determination step for determining whether or not the influence from the processor is transmitted to the outside; and the faulty processor in the isolation determination step
If it is determined that the influence from the faulty processor has not been transmitted to the outside, the system disables all the data corrected by the faulty processor held in the CPU board on which the faulty processor is mounted without shutting down the system. After doing
And isolation execution step of separating the failure processor, the failure processor in the isolation determination step
A restarting step of stopping and restarting the system when it is determined that the influence from the outside is transmitted to the outside, and a method of recovering a failure in the multiprocessor computer.

11. The processor bridges having the same CP
Other processors assigned to each processor on the U board
Clear the reference flag information at the time of process switch
And the correction data owned by the processor is stored.
For lines in the external cache memory,
An external processor on the same CPU board accessed
When set, the isolation determination step includes the other processor reference flag information.
The processor having failed by referring to the
From other processors on the same CPU board
The quarantine execution step includes determining whether the quarantine is transmitted to the quarantine.
The effect from the failed processor is also
The effect is transmitted to other processors on one CPU board
If it is determined that there is no
In the CPU board on which the faulty processor is mounted
The data modified by the failed processor being retained.
After disabling all of the
11. The multiprocessor according to claim 10, wherein
How to recover from a failure on a computer.

12. The external cache memory has a structure that can be divided in units of ways, and the isolation determination unit determines that the influence of the way in which a failure has occurred is caused by the CP having the external cache memory mounted thereon.
The isolation execution means determines whether or not the influence from the faulty way has not been transmitted to the outside of the CPU board. after the cache memory has disabled all correction data to be stored, the multiprocessor computer of claim 1, wherein the degenerate the failure way.

13. Before SL isolation determining means identifies the processors to which the owner of each line included in the failure-way
By referring to the external reference flag information assigned to the processor specified by the separate information, it is possible to determine whether or not the correction data stored in the faulty way is externally reflected. 2. The method according to claim 1, wherein
2. The multiprocessor computer according to 2 .

14. The method according to claim 11 , wherein said data is a modification of said data.
At least one processor having a built-in cache memory that holds status information on presence / absence of presence / absence, and a structure that can be divided into ways containing a plurality of lines
Data stored in the internal cache memory.
For each line
Status information and data on the presence or absence of corrections are stored
Of the processor that is the owner of each line
An external cache memory which holds for each line, said processor and connect the external cache memory, performs respective control management, the same CPU board
External reference flags assigned to each processor on the
Information is cleared at the time of process switch, and the same CP
The processor location held only inside the U board
Reflect the existing correction data outside the same CPU board
Mounting a processor bridge, a setting when brought into, in a multiprocessor computer having at least one CPU board having a self-fault detection, the failure occurs during the system operation by reference to the external reference flag information The correction data stored in the way is stored in the CP having the external cache memory.
An isolation determination step for determining whether the information is not transmitted outside the U board; and the external cache memory is stored without shutting down the system when it is determined in the isolation determination step that the information is not transmitted outside the CPU board. After all the correction data to be invalidated, the isolation execution step of degenerating the faulty way, and the system is stopped and restarted when it is determined in the isolation determination step that the data is transmitted outside the CPU board. And a re-starting step for performing a fault recovery method in a multiprocessor computer.