JP2566356B2

JP2566356B2 - Fault-tolerant multiprocessor computer system

Info

Publication number: JP2566356B2
Application number: JP4138895A
Authority: JP
Inventors: デヴィッド・エス・エドワーズ; ウィリアム・エイ・シェリー; ジウィー・チャン; ミノル・イノシタ; レナード・ジー・トルビスキ
Original assignee: BURU EICHI ENU INFUOMEESHON SHISUTEMUZU Inc
Current assignee: BURU EICHI ENU INFUOMEESHON SHISUTEMUZU Inc
Priority date: 1991-05-31
Filing date: 1992-05-29
Publication date: 1996-12-25
Anticipated expiration: 2011-12-25
Also published as: JPH0628251A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、コンピュータ・システ
ムに関し、特にプロセッサのサイフォン・キャッシュ記
憶装置のエラーに対して耐故障性を有するマルチプロセ
ッサ・コンピュータ・システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to computer systems, and more particularly to a multiprocessor computer system that is fault tolerant to processor siphon cache storage errors.

【０００２】[0002]

【従来の技術】パーソナル・コンピュータおよびワーク
ステーションが益々強力になるに伴い、伝統的なメイン
フレーム・ベンダーが当面する主な問題の１つは、急激
に進歩する比較的小型のマシンからの自社の中型システ
ムの差別化にある。メインフレーム・マシンを小型のマ
シンから差別化し得る１つの重要な領域は、耐故障性の
領域にある。BACKGROUND OF THE INVENTION As personal computers and workstations become more and more powerful, one of the main problems faced by traditional mainframe vendors is the inconvenience of rapidly evolving, relatively small machines from their own machines. It is in the differentiation of medium-sized systems. One important area in which mainframe machines can be differentiated from smaller ones is in the area of fault tolerance.

【０００３】プロセッサのキャッシュ記憶装置のエラー
の問題は、メインフレーム・システムにおけるキャッシ
ュ・メモリーの使用の全歴史における問題であった。こ
れらのエラーは、主メモリー・エラーでもそうであるよ
うに、α粒子の衝突または過渡的な（あるいはハード
の）記憶要素の故障によって生じ得る。本発明が使途を
見出す事例システムでは、主メモリーの単ビット・エラ
ーは、欠点と関連するワードが要求側装置へ送られる前
にエラー状態のビットを訂正するメモリー・コントロー
ラにおける専用化されたハードウエアにより、システム
の可視性から隠蔽される。しかし、プロセッサのキャッ
シュの故障は、訂正ハードウエアがＶＬＳＩチップで使
用可能な集積回路面積制限などの多くの理由からプロセ
ッサ用に設計されていないために、キャッシュの読出し
アクティビティ中には訂正されない。The problem of processor cache storage errors has been a problem throughout the history of the use of cache memory in mainframe systems. These errors can be caused by alpha particle collisions or transient (or hard) storage element failures, as can main memory errors. In the case system where the present invention finds use, a single bit error in main memory is a dedicated hardware in memory controller that corrects the bit in error before the word associated with the fault is sent to the requesting device. Hides from the visibility of the system. However, processor cache failures are not corrected during cache read activity because the correction hardware is not designed for the processor for many reasons, including the limited integrated circuit area available on VLSI chips.

【０００４】プロセッサのキャッシュ・メモリーの利点
は、故障した時に生じる複雑な問題に大きく勝る。キャ
ッシュ・メモリーは、これによらない場合照会毎にプロ
セッサがメモリーから取出さねばならないデータおよび
命令に対する高速アクセスを提供する。キャッシュ・メ
モリーは、典型的には、主メモリーのアクセスに必要な
時間の１０乃至２５％で済み、従ってキャッシュ・メモ
リーはシステムのデータ記憶階層構造における恒久的な
地位を取得してきた。The benefits of processor cache memory greatly outweigh the complications that arise in the event of a failure. The cache memory provides fast access to the data and instructions that the processor would otherwise have to retrieve from memory for each query. Cache memory typically takes 10-25% of the time required to access main memory, thus cache memory has gained a permanent position in the system's data storage hierarchy.

【０００５】キャッシュ・メモリーをその中央処理装置
アーキテクチャに盛込むコンピュータ設計の労力は、下
記の逓増する困難な諸問題に当てなければならない。The computer design effort of incorporating cache memory into its central processing unit architecture must address the following increasing challenges.

【０００６】１．プロセッサは、キャッシュのエラー条
件を検出することが絶対必要であり、さもなければ、デ
ータの汚染をもたらす結果となる。この問題に対する最
も安価な解決法は、この種のエラーが発生する時システ
ムをハングアップさせるか壊す以外に何もしないことで
あるが、この方法は実際問題としてメインフレームの対
応としては全く受入れ得ない。1. It is imperative that the processor detect error conditions in the cache, otherwise it will result in data corruption. The cheapest solution to this problem is to do nothing but hang or crash the system when this kind of error occurs, but as a practical matter, this is totally acceptable for mainframes. Absent.

【０００７】２．装備の充分なマシンは、故障したキャ
ッシュ記憶要素の構成解除を支援すべきである。切離さ
れた故障要素を単に構成解除することにより、プロセッ
サは実質的な性能損失もなく実行を継続し得る。キャッ
シュ・メモリーは、多くのキャッシュ記憶要素を含むブ
ロックに分割される。この例示的なマシンでは、ブロッ
ク・サイズは１６ワード（６４バイト）である。キャッ
シュ・メモリーはまた、このような関係においてはブロ
ックの全列を意味するレベルの如き更に粗な細分割に分
割することもできる。例示的なマシンでは、そのキャッ
シュ・ブロックおよびキャッシュ・レベルを個々に構成
解除することを可能にするロジックで設計されている。2. A well-equipped machine should support the deconfiguration of failed cache storage elements. By simply deconfiguring the isolated failing element, the processor can continue execution without substantial performance loss. Cache memory is divided into blocks containing many cache storage elements. In this exemplary machine, the block size is 16 words (64 bytes). The cache memory can also be subdivided into more coarse subdivisions, such as levels meaning the entire column of blocks in this context. In the exemplary machine, it is designed with logic that allows its cache blocks and cache levels to be individually deconfigured.

【０００８】３．真に装備の充実したマシンは、プロセ
ッサのキャッシュ・エラーが生じると、エラー状態にあ
るブロックの最近のコピーを主メモリーから検索するこ
とができるか、あるいはエラー状態のブロックを何らか
の方法で補正することができることを保証すべきであ
る。例示的なマシンにおいては、主メモリーへの書込み
中にのみ生じたキャッシュ・ブロックの単一ビット・エ
ラーを訂正するための誤り訂正コードが組み込まれてい
る。しかし、このマシンの設計は、特定のキャッシュ・
ブロックの明瞭な「不自然な」訂正あるいは影響を受け
た命令の再開は対象としていなかった。（「不自然な」
とは、本文においては、例え自然な置換アルゴリズムが
エラー時のこのような事象を指令しなくとも、ブロック
が訂正のため交換されることが要求されることを示すた
め使用される。）４．例示的な装置の如きストア・イン型のキャッシュ・
マシンは、非常に効率的に動作し得るが、その働きを実
際にする主メモリーに対する書込みの遅延の特性は、処
理能力を更に改善するため他のプロセッサがシステムに
追加される時の負担である。多重プロセッサ構成は、キ
ャッシュ・エラーの処理、即ちエラー状態のブロックが
１つのプロセッサのキャッシュに存在して１つ以上のプ
ロセッサにより要求され、ブロックの「更新された」コ
ピーがシステムの主メモリーに存在しないキャッシュ・
オペランド・ブロックのエラーの取扱いにおける最後の
挑戦である。本発明が目的とするこの問題は、一般にサ
イフォン・エラー状態と呼ばれる。（サイフォンとは、
多重プロセッサ・システムの１つのプロセッサから別の
プロセッサまたは入出力装置に対するキャッシュ・ブロ
ックの転送を規定するため使用される当技術の用語であ
る。）単一プロセッサ・システムにおいて遭遇する同様
な問題は、本願と同日付で出願されたＤ．Ｓ．Ｅｄｗａ
ｒｄｓ等の米国特許出願「ＦａｕｌｔＴｏｌｅｒａｎ
ｔＣｏｍｐｕｔｅｒＳｙｓｔｅｍ」により包含され
る関連発明が対象としている。3. A truly fully equipped machine can either retrieve a recent copy of the block in error from main memory when a processor cache error occurs, or somehow correct the block in error. Should be guaranteed. In the exemplary machine, error correction code is incorporated to correct single bit errors in cache blocks that occur only during writes to main memory. However, the design of this machine is
Clear "unnatural" corrections of blocks or resumption of affected instructions were not covered. ("Unnatural"
Is used in the text to indicate that blocks are required to be exchanged for correction, even though the natural replacement algorithm does not dictate such an event on error. ) 4. Store-in cache, such as an exemplary device
The machine can operate very efficiently, but the delay characteristics of writes to main memory that do the work are a burden when other processors are added to the system to further improve processing power. . A multi-processor configuration handles cache errors, that is, blocks in error are in one processor's cache and are requested by more than one processor, and an "updated" copy of the block is in system main memory. Not cache
This is the final challenge in handling errors in operand blocks. This problem for the purposes of the present invention is commonly referred to as the siphon error condition. (What is a siphon?
A term in the art used to define the transfer of cache blocks from one processor of a multiprocessor system to another processor or I / O device. ) Similar problems encountered in uniprocessor systems are described in D. S. Edwa
rds et al. US patent application "Fault Toleran"
Related inventions encompassed by "T Computer System" are the subject of this disclosure.

【０００９】あるストア・イン型キャッシュの従来技術
システムは、エラーを含むデータがキャッシュから読出
される時エラーを訂正するため、エラー訂正ハードウエ
アを各キャッシュに付設することによりプロセッサのキ
ャッシュ・エラーの問題を取扱っている。これは、問題
に対する有効ではあるが高価な解決策である。One store-in cache prior art system corrects errors when data containing errors is read from the cache by adding error correction hardware to each cache to eliminate cache error in the processor. Dealing with problems. This is an effective but expensive solution to the problem.

【００１０】キャッシュのデータ訂正および再試行状態
を解決する第２の従来技術の試みは、ストア・スルー・
キャッシュを実現することにより問題を隠蔽する手法を
内蔵するものであった。（ストア・スルー構造において
は、１つのキャッシュ・ブロックが更新される時、この
キャッシュ・ブロックはキャッシュと主メモリーの両方
に即時書込まれる。）このような試みにより、キャッシ
ュからの取出しがエラーである時は常に、プロセッサは
キャッシュ・バイパスを強制して、命令の実行時とキャ
ッシュの更新（復元）時の双方において使用するブロッ
クに対するメモリー読込みを発する。この解決法の利点
は、影響を受けた命令がインパクトを受けず、従って、
このような全てのエラーが回復できるように、メモリー
からの取出しがキャッシュ・ミス条件と一致することで
ある。この解決法は、ストア・スルー設計の利点を利用
するもので、この設計は定義により主メモリーを常に更
新させる利益を提供する。A second prior art attempt to resolve cache data correction and retry conditions is a store-through
It had a built-in method to conceal the problem by implementing a cache. (In a store-through structure, when a cache block is updated, this cache block is immediately written to both cache and main memory.) Such an attempt causes an error in fetching from the cache. At any time, the processor forces a cache bypass to issue a memory read for the block used both during instruction execution and cache update (restore). The advantage of this solution is that the affected instructions are not impacted and therefore
Ejection from memory is consistent with a cache miss condition so that all such errors can be recovered. This solution takes advantage of the store-through design, which by definition provides the benefit of constantly updating main memory.

【００１１】ストア・イン・キャッシュ設計（コピー・
バック・キャッシュとして公知である）は、より少ない
プロセッサ・メモリー間書込みアクティビティを、従っ
てあるバス設計が実現される時システム・バスにおける
ボトルネックを比較的少なくすることになる比較的少な
い主メモリー通信量を結果として生じる故に、ストア・
スルー設計に勝る性能本位のシステムに有利である。強
化された性能をもたらすストア・イン特性は、必然的
に、システムにおけるデータの特定ブロックの唯一の妥
当コピーをしばしば含むキャッシュをもたらす結果とな
る。即ち、１つのキャッシュ・ブロックが修正された
時、これは主メモリーへは書戻されない。その代わり、
これは、第２のアクティブな装置（ＣＰＵまたはＩ／Ｏ
装置）により要求されるまで、あるいは新しいブロック
に対するキャッシュ内の余地を作るためこのブロックを
置換せねばならないとき主メモリーに対して書戻される
まで保持される。Store-in cache design (copy
Known as back cache) is a relatively small amount of main memory traffic that results in less processor-to-memory write activity, and therefore less bottlenecks in the system bus when certain bus designs are realized. Because the resulting store
This is advantageous for performance-oriented systems superior to the through design. Store-in characteristics that provide enhanced performance necessarily result in a cache that often contains only a valid copy of a particular block of data in the system. That is, when a cache block is modified, it is not written back to main memory. Instead,
This is the second active device (CPU or I / O
Device) or until it is written back to main memory when this block must be replaced to make room in the cache for a new block.

【００１２】当業者には、別の試みにおいて、従来技術
の解決法と関連するコストおよび複雑さによらずに、プ
ロセッサのキャッシュ・エラー状態に対するこれらの従
来技術の解決法の利点を達成することが非常に望ましい
ことが明らかであろう。One of ordinary skill in the art will, in another attempt, achieve the benefits of these prior art solutions to processor cache error conditions without the cost and complexity associated with prior art solutions. It will be clear that is highly desirable.

【００１３】[0013]

【課題を解決するための手段】従って、本発明の広義の
目的は、実現が簡単かつ経済的な、プロセッサのキャッ
シュ・エラー状態に対する解決法の提供にある。SUMMARY OF THE INVENTION Accordingly, a broad object of the present invention is to provide a solution to a processor cache error condition that is simple and economical to implement.

【００１４】本発明の更に特定な目的は、相互に個々の
プロセッサのキャッシュ・メモリーのアクセスを試みる
マルチプロセッサを内蔵するシステムに用いる時、問題
の状況におけるプロセッサのキャッシュ・エラー状態に
対する解決法の提供にある。A more particular object of the present invention is to provide a solution to a processor's cache error condition in a problem situation when used in a system containing multiple processors that attempt to access the cache memory of the individual processors with each other. It is in.

【００１５】要約すれば、本発明の上記および他の目的
は、それぞれキャッシュ・メモリー、およびキャッシュ
・メモリー装置に関して読出し／書込みされる情報ブロ
ックにおけるパリティ・エラーを検出してパリティ・エ
ラーが検出されるならば読出しまたは書込みキャッシュ
・エラー・フラッグを生じるためのパリティ・エラー検
出器を備えたキャッシュ・メモリー装置を有する複数の
中央処理装置を含む耐故障性のあるコンピュータ・シス
テムにより達成される。システム・バスが、ＣＰＵをパ
リティ・エラー訂正機能を持つシステム制御装置に接続
し、メモリー・バスがＳＣＵを主メモリーに接続する。
ＣＰＵ、サービス・プロセッサおよびオペレーティング
・システム・ソフトウエアに跨って分散されるエラー回
復制御機能が、故障ブロックを送出側ＣＰＵからＳＣＵ
（与えられた故障ブロックが訂正される）を介して主メ
モリーへ転送し、その後再試行が行われる時、訂正され
たメモリー・ブロックを主メモリーから受信側ＣＰＵへ
転送するためのサイフォン動作と関連して、送出側ＣＰ
Ｕにおける読出しパリティ・エラー・フラッグおよび受
信側ＣＰＵにおける書込みパリティ・エラー・フラッグ
の検出に応答する。In summary, the above and other objects of the invention detect a parity error in a cache memory and an information block read / written with respect to a cache memory device, respectively. Then, a fault tolerant computer system including a plurality of central processing units having a cache memory device with a parity error detector for producing a read or write cache error flag. A system bus connects the CPU to the system controller with parity error correction and a memory bus connects the SCU to main memory.
An error recovery control function distributed across the CPU, service processor and operating system software causes the failing block from the sending CPU to the SCU.
Associated with siphoning to transfer the corrected memory block from main memory to the receiving CPU when it is transferred to main memory via (the given failing block is corrected) and then retried. And send CP
Responsive to detection of a read parity error flag at U and a write parity error flag at the receiving CPU.

【００１６】本発明の主題は、特に本明細書の終結部分
において特に指摘され、明瞭に請求される。しかし、本
発明については、構成および作動方法の双方に関して、
頭書の特許請求の範囲および添付図面に関して以降の記
述を参照することにより最もよく理解されよう。The subject matter of the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. However, for the present invention, both in terms of construction and method of operation,
It will be best understood by referring to the following description in connection with the appended claims and the accompanying drawings.

【００１７】[0017]

【実施例】まず、本発明が組込まれる例示的な中央サブ
システム構造（ＣＳＳ）を示す図１を参照されたい。シ
ステム制御装置（ＳＣＵ）１は、システム・バス２とメ
モリー・バス３のスケジューリングを集中して制御す
る。更に、ＳＣＵ１：Ａ）メモリー制御、単一ビット・
エラー訂正および２倍ビット・エラー検出を行い、Ｂ）
メモリー装置（ＭＵ）４当たり１つずつ存在するメモリ
ー形態を制御し、Ｃ）中央処理装置（ＣＰＵ）５のスト
ア・イン型キャッシュ構造と関連してＣＰＵとＭＵ間の
６４バイト・ブロック転送を管理し、Ｄ）ＣＰＵのキャ
ッシュの修正ブロックにあるいはＣＰＵ、ＭＵまたは入
出力装置（ＩＯＵ）６からのデータ転送時に見出される
単一ビット・エラーを訂正し、Ｅ）システム・カレンダ
・クロックを含む。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT Reference is first made to FIG. 1 which illustrates an exemplary Central Subsystem Structure (CSS) in which the present invention is incorporated. The system control unit (SCU) 1 centrally controls the scheduling of the system bus 2 and the memory bus 3. In addition, SCU1: A) memory control, single bit
Performs error correction and double-bit error detection, and B)
Controls the memory configuration that exists one per memory unit (MU) 4, and manages the 64-byte block transfer between CPU and MU in connection with the store-in cache structure of C) central processing unit (CPU) 5. And D) corrects single bit errors found in the correction block of the CPU's cache or during data transfers from the CPU, MU or input / output unit (IOU) 6, and E) contains the system calendar clock.

【００１８】システム・バス２は、１乃至４個のＣＰＵ
および１乃至４個のＩＯＵを相互にかつＳＣＵと相互に
接続する。このシステム・バスは、１６バイトの２方向
性データ・インターフェースと、２方向性アドレスおよ
び指令インターフェースと、全てのＣＰＵおよびＩＯＵ
により監視されるＳＣＵ状態インターフェースと、ＳＣ
Ｕと各ＣＰＵとＩＯＵ間の少数の制御線とを含む。デー
タは、１６、３２または６４バイト・グループにおける
システム・バス上で交換され、データの交換はＣＰＵと
ＭＵ、ＩＯＵとＭＵ、２つのＣＰＵ、およびＣＰＵとＩ
ＯＵ間で生じ得る。システム・バス２を介する諸動作は
下記の如くである。即ち、 −読出し：１６、３２または６４バイト −排他性読出し：６４バイト −ＩＯＵからの書込み：１６、３２または６４バイト −ＣＰＵからの書込み（スワッピング）：６４バイト −割込みおよび接続；−読出し／書込みレジスタ各システム・バス動作は、アドレス相およびデータ相か
らなり、アドレス相は２マシン・サイクル毎に開始し得
る。１つのグループ内の連続する１６バイトのデータ転
送は、連続するマシン・サイクルで起生し得る。ＩＯＵ
またはＣＰＵは、同時に２つまでの要求のデータ相を待
機することができる。データ・ブロックは、要求の受取
りと同じ順序で転送される。The system bus 2 has 1 to 4 CPUs
And 1 to 4 IOUs are interconnected with each other and with the SCU. This system bus is a 16-byte bidirectional data interface, bidirectional address and command interface, all CPUs and IOUs.
SCU status interface monitored by
U and a few control lines between each CPU and IOU. Data is exchanged on the system bus in groups of 16, 32 or 64 bytes, exchanging data between CPU and MU, IOU and MU, two CPUs, and CPU and I.
It can occur between OUs. The various operations via the system bus 2 are as follows. Read: 16, 32 or 64 bytes-exclusive read: 64 bytes-write from IOU: 16, 32 or 64 bytes-write from CPU (swapping): 64 bytes-interrupt and connect; -read / write register Each system bus operation consists of an address phase and a data phase, which can start every two machine cycles. Consecutive 16-byte data transfers within a group can occur on consecutive machine cycles. IOU
Alternatively, the CPU can wait for up to two request data phases at the same time. The data blocks are transferred in the same order in which the requests were received.

【００１９】メモリー・バス３は、１乃至８個のＭＵを
ＳＣＵと連結する。このメモリー・バスは、１６バイト
の２方向性データ・インターフェースと、ＳＣＵから全
てのＭＵに至るアドレスおよび指令インターフェース
と、ＳＣＵと各ＭＵ間の少数の制御線とを含む。データ
は、メモリー・バス上で１６、３２または６４バイト・
グループで交換される。メモリー・バス３を介する動作
は下記の如くである。即ち、 −読出し：１６、３２または６４バイト −書込み：１６、３２または６４バイト主メモリーは、８個までのＭＵからなっている。（９番
目のスロット、ＭＵ４Ａは、故障の場合に再構成および
修理を容易にするため設けられる。）単一ビット訂正、
２倍ビット検出コードが２倍ワード毎に、即ち７２デー
タ・ビット毎に８コード・ビット記憶される。このコー
ドは、１チップ内の４ビットのエラーが４つの異なるワ
ードにおける４つの単一ビット・エラーとして訂正され
るように構成される。ＭＵにおけるデータは、１６バイ
ト（４ワード）の増分でＳＣＵからアドレス指定され
る。どのＭＵ内の全てのバイトは連続的にアドレス指定
される、即ち、並列で動作するＭＵ間にはインターレー
スは生じない。１つのメモリー・サイクルは、マシン・
サイクル毎に開始し、ＣＰＵから判るように、他の装置
との競合がないものとして、１つのメモリー・サイクル
は１０マシン・サイクルである。ＭＵ４は、１６０のダ
イナミック・ランダム・アクセス・メモリー（ＤＲＡ
Ｍ）回路を含み、その各々はｎ×４ビットの記憶要素を
有し、ｎ＝２５６、１０２４、あるいは４０９６であ
る。The memory bus 3 connects 1 to 8 MUs with the SCU. This memory bus contains a 16-byte bidirectional data interface, an address and command interface from the SCU to all MUs, and a few control lines between the SCU and each MU. Data is 16, 32 or 64 bytes on the memory bus
Exchanged in groups. The operation via the memory bus 3 is as follows. Read: 16, 32 or 64 bytes Write: 16, 32 or 64 bytes Main memory consists of up to 8 MUs. (The ninth slot, MU4A, is provided to facilitate reconfiguration and repair in case of failure.) Single bit correction,
A double bit detect code is stored for every double word, ie, 8 code bits for every 72 data bits. The code is arranged so that a 4-bit error in a chip is corrected as 4 single-bit errors in 4 different words. The data in the MU is addressed from the SCU in 16 byte (4 word) increments. All bytes in any MU are addressed consecutively, ie no interlacing occurs between MUs operating in parallel. One memory cycle is a machine
Starting from every cycle and as seen by the CPU, one memory cycle is 10 machine cycles, assuming no contention with other devices. The MU4 has 160 dynamic random access memories (DRA).
M) circuits, each having n × 4 bit storage elements, where n = 256, 1024, or 4096.

【００２０】ＩＯＵ６はそれぞれ、各入出力バス（ＩＯ
Ｂ）７が１つのＩＯＵとインターフェースするように、
システム・バス２と２つのＩＯＢ７間の接続を提供す
る。このため、ＩＯＵはＣＳＳと図１には示さないＩ／
Ｏサブシステム間のデータ転送を管理する。Each IOU 6 has an input / output bus (IO
B) so that 7 interfaces with one IOU,
It provides a connection between the system bus 2 and two IOBs 7. Therefore, the IOU is a CSS and an I / O not shown in FIG.
Manages data transfer between O subsystems.

【００２１】クロックおよび保守装置（ＣＭＵ）８は、
ＣＳＳにおける全ての装置に対するクロック信号を生成
し、分配、同調して、サービス・プロセッサ（ＳＰ）９
と中央処理、入出力および電源サブシステム間にインタ
ーフェースを提供し、ＣＳＳの諸装置を初期化し、ＣＳ
Ｓ装置内で検出されたエラーを処理する。ＣＳＳは、２
相クロック・システムおよびラッチされたレジスタ要素
を使用し、これにおいては、クロック１の後エッジが位
相１の終りを定義し、クロック２の後エッジは位相２の
終りを定義し、このため各位相は１マシン・サイクルの
半分となる。The clock and maintenance unit (CMU) 8 is
Generates, distributes, and tunes clock signals for all devices in the CSS, service processor (SP) 9
Interface between the central processing unit, input / output and power supply subsystem, initializes CSS devices, and
Handle errors detected in the S device. CSS is 2
A phase clock system and latched register elements are used in which the trailing edge of clock 1 defines the end of phase 1 and the trailing edge of clock 2 defines the end of phase 2 and thus each phase Is half a machine cycle.

【００２２】ＳＰ９は、遠隔の保守および諸操作を容易
にするための一体モデムを備えた市販パーソナル・コン
ピュータでよく、大きなシステムは高い可用度を得るよ
うにシステムが動的に再構成できる２つのＳＰを含む。
このＳＰは、下記の４つの機能を実施する。即ち、 −初期化、エラーのロギングおよび診断操作中ＣＳＳを
監視して制御し、 −システム・ブート中またはオペレータ指令と同時に主
オペレーティング・システム・コンソールとして働き、 −入出力サブシステム保守チャンネル・アダプタ（ＭＣ
Ａ）に対するコンソールおよびデータ・サーバとして働
き、 −遠隔保守インターフェースを提供する。The SP9 may be a commercially available personal computer with an integrated modem for facilitating remote maintenance and operations, with large systems two systems where the system can be dynamically reconfigured for high availability. Including SP.
This SP performs the following four functions. Monitor and control the CSS during initialization, error logging and diagnostic operations; -act as the main operating system console during system boot or at the same time as operator command; -I / O subsystem maintenance channel adapter ( MC
Acts as a console and data server for A) -provides a remote maintenance interface.

【００２３】次に、図１のＣＰＵ５の１つの全体ブロッ
ク図である図２を参照されたい。アドレスおよび実行装
置（ＡＸ装置）は、全てのアドレス準備を実施し、１０
進演算、２進浮動小数点、および乗除命令を除いて全て
の命令を実行するマイクロプロ処理・エンジンである。
２つの同じＡＸチップ１０、１０Ａが、複製動作を並行
に行い、結果として得るＡＸチップ出力はエラーを検出
するため常に比較される。ＡＸ装置により行われる主要
機能は、下記を含む。即ち、 −有効および仮想アドレス生成 −メモリー・アクセス制御 −保全検査 −レジスタ変更／使用の制御 −基本命令、シフト命令、保全命令、文字操作、および
諸命令の実行キャッシュ装置１１は、６４Ｋバイト（１６ワード）の
データ部分と、キャッシュ・データ部分に記憶された各
６４バイト（１６ワード）ブロックの主メモリーの場所
を定義する１組の関連するディレクトリ部分とを含む。
キャッシュ装置は物理的に１０個のＤＴチップ、１つの
キャッシュ・ディレクトリ（ＣＤ）チップ１２および複
写ディレクトリ（ＤＤ）チップ１３のアレイに構成され
る。Reference is now made to FIG. 2, which is an overall block diagram of one of the CPUs 5 of FIG. The address and execution unit (AX unit) perform all address preparations and
It is a microprocessor engine that executes all instructions except binary arithmetic, binary floating point, and multiply and divide instructions.
Two identical AX chips 10, 10A perform duplicate operations in parallel, and the resulting AX chip outputs are always compared to detect errors. The main functions performed by the AX device include: -Valid and virtual address generation-memory access control-integrity check-register change / use control-basic instruction, shift instruction, integrity instruction, character manipulation, and execution of instructions. Data portion of a word) and a set of associated directory portions that define the main memory location of each 64-byte (16 word) block stored in the cache data portion.
The cache device is physically arranged in an array of 10 DT chips, one cache directory (CD) chip 12 and a copy directory (DD) chip 13.

【００２４】キャッシュ装置１１により行われる特定機
能は下記を含む。即ち、 −命令およびオペランド・データ記憶の組合わせ −命令およびオペランドのバッファおよび整合 −システム・バス７とのデータ・インターフェース（図
１） −ＣＬＩＭＢ安全ストア・ファイルキャッシュ書込み法は「ストア・イン（ｓｔｏｒｅｉ
ｎｔｏ）」である。キャッシュから修正ブロックの部分
を読出す時縦方向のパリティ・エラーが検出されるなら
ば、このブロックはキャッシュからスワップされ、ＳＣ
Ｕにより訂正され、主メモリーに書込まれる。訂正され
たブロックは、再試行と同時に主メモリーから再び取出
される。Specific functions performed by the cache device 11 include: A combination of instruction and operand data storage-a buffer and alignment of instructions and operands-a data interface with the system bus 7 (FIG. 1) -CLIMB secure store file cache write method is "store in" i
nto) ”. If a vertical parity error is detected when reading a portion of the modified block from the cache, then this block is swapped from the cache and the SC
Corrected by U and written to main memory. The corrected block is fetched from main memory again upon retry.

【００２５】キャッシュのディレクトリ情報の２つのコ
ピーは、異なる論理機能を実施するＣＤおよびＤＤチッ
プにそれぞれ維持される。この２つのディレクトリ・コ
ピーは、ＣＰＵからの命令／オペランド・アクセスと干
渉することなくシステム・バスからのキャッシュ内容の
並行的な照会を許容し、またエラー回復を行う。ＣＤチ
ップ１２により行われる機能は下記を含む。即ち、 −ＣＰＵアクセスのためのキャッシュ・ディレクトリ −命令、オペランドおよびストア・バッファの管理 −仮想対実アドレス変換ページング・バッファＤＤチップ１３により行われる機能は下記を含む。即
ち、 −システム・アクセスのためのキャッシュ・ディレクト
リ −システム・バス制御 −分散された接続／インターフェースの管理 −キャッシュ・ディレクトリのエラー回復有効な科学計算能力は、浮動小数点演算（ＦＰ）チップ
１５、１５Ａにおいて実現される。についてＦＰチップ
は、全ての２進浮動小数点演算を重複して実行する。２
重のＡＸチップ１０、１０Ａと関連して動作するこれら
のチップは、スカラー科学演算処理を行う。Two copies of the directory information in the cache are maintained on the CD and DD chips, respectively, which implement different logical functions. The two directory copies allow concurrent querying of cache contents from the system bus without interfering with instruction / operand access from the CPU and also provide error recovery. The functions performed by the CD chip 12 include: Cache directory for CPU access management of instructions, operands and store buffers virtual-to-real address translation paging buffers The functions performed by the DD chip 13 include: -Cache directory for system access-System bus control-Distributed connection / interface management-Error recovery of cache directory Effective scientific computing power is achieved by floating point arithmetic (FP) chips 15, 15A Is realized in. The FP chip duplicates all binary floating point operations. Two
These chips, which operate in conjunction with the heavy AX chips 10, 10A, perform scalar scientific computing.

【００２６】ＦＰチップ１５（ＦＰチップ１５Ａと重
複）は、 −全ての２進、および固定および浮動小数点の乗除算を
実行し、 −１２×７２ビットの部分積を１マシン・サイクルで計
算し、 −除算サイクル毎に商の８つのビットを計算し、 −モジューロ１５の剰余の完全性検査の実施ＦＰチップ１５、１５Ａにより行われる諸機能は下記を
含む。即ち、 −乗除算を除く全ての浮動小数点の仮数演算 −２進または16進フォーマットにおける全ての指数演算
の実行 −乗除命令に対するオペランドの事前処理および結果の
事後処理 −識別子および状態制御の提供２つの特殊目的のランダム・アクセス・メモリー（ＦＲ
ＡＭ１７およびＸＲＡＭ１８）がＣＰＵに組込まれてい
る。ＦＲＡＭチップ１７は、ＦＰチップ１５、１５Ａの
付属物であり、ＦＰ制御ストアおよび１０進整数テーブ
ル索引として機能する。ＸＲＡＭチップ１８は、ＡＸチ
ップ１０、１０Ａの付属物であり、スクラッチパッドと
して働くと共に保護ストアおよびパッチ機能を提供す
る。The FP chip 15 (overlapping with FP chip 15A):-performs all binary and fixed and floating point multiplication and division; -calculates a 12x72 bit partial product in one machine cycle; -Compute 8 bits of the quotient for each division cycle, -implement the remainder integrity check of the modulo 15 The functions performed by the FP chips 15, 15A include: -All floating point mantissa operations except multiplication and division-Execution of all exponential operations in binary or hexadecimal format-Preprocessing of operands and postprocessing of results for multiplication and division instructions-Provision of identifiers and state control Special purpose random access memory (FR
AM17 and XRAM18) are built into the CPU. The FRAM chip 17 is an adjunct to the FP chips 15, 15A and serves as the FP control store and decimal integer table index. The XRAM chip 18 is an adjunct to the AX chips 10, 10A, which acts as a scratch pad and provides protective store and patch functions.

【００２７】ＣＰＵはまた、クロック分散（ＣＫ）チッ
プ１６を使用し、その機能は下記を含む。即ち、 −ＣＰＵを構成する幾つかのチップに対するクロック分
散 −シフト経路制御 −保守 −ＣＭＵとＣＰＵ間のインターフェース −エラー検出および回復のためのクロック停止ロジック
の提供ＤＮチップ１４（ＤＮチップ１４Ａと並列の）は、１０
進拡張命令セット（ＥＩＳ）命令の実行を行う。これは
また、１０進２進（ＤＴＢ）、２進１０進（ＢＴＤ）変
換ＥＩＳ命令、および数値移動編集（ＭＶＮＥ）ＥＩＳ
命令をＡＸチップ１０と関連して実行する。このＤＮチ
ップは、メモリーからオペランドを受取ると共に結果を
キャッシュ装置１１を介してメモリーへ送る。The CPU also uses a clock distribution (CK) chip 16 whose functions include: Clock distribution for several chips that make up the CPU-shift path control-maintenance-interface between CMU and CPU-provide clock stop logic for error detection and recovery DN chip 14 (parallel to DN chip 14A) ) Is 10
Executes an extended instruction set (EIS) instruction. It is also a decimal binary (DTB), binary decimal (BTD) conversion EIS command, and numeric move edit (MVNE) EIS.
The instructions are executed in association with the AX chip 10. The DN chip receives the operand from the memory and sends the result via the cache device 11 to the memory.

【００２８】ＡＸ、ＤＮおよびＦＰチップは、時にまと
めて基本処理装置（ＢＰＵ）と呼ばれる。ＡＸ、ＤＮお
よびＦＰチップは保護検査に使用し得る複写結果を取得
するため並列に動作する複写装置と複製されることが既
に判っている。このため、マスターおよびスレーブのＫ
Ｋが、これらのチップの通常の動作において取得され
る。マスター結果は、マスター結果バス（ＭＲＢ）２０
に置かれ、スレーブ結果はスレーブ結果バス（ＳＲＢ）
２１に置かれる。マスターおよびスレーブの両結果は、
ＭＲＢおよびＳＲＢ上でそれぞれキャッシュ装置１１に
対して送られる。更に、ＣＯＭＴＯバス２２およびＣＯ
ＭＦＲＯＭバス２３は、ある相互に関連する操作のため
ＡＸチップ、ＤＮ装置およびＦＰ装置を一緒に接続す
る。The AX, DN and FP chips are sometimes collectively referred to as the basic processing unit (BPU). It has already been found that the AX, DN and FP chips are duplicated with a copy machine operating in parallel to obtain a copy result which can be used for protection checking. Therefore, the master and slave K
K is obtained during normal operation of these chips. The master result is the master result bus (MRB) 20.
Are placed in the slave result bus (SRB)
Placed at 21. Both master and slave results are
It is sent to the cache device 11 on the MRB and SRB, respectively. In addition, COMTO bus 22 and CO
The MFROM bus 23 connects the AX chip, the DN device and the FP device together for certain interrelated operations.

【００２９】下記の論議は、キャッシュ記憶エラーがマ
ルチプロセッサ・システムにおいて検出され、データの
流れが第１のＣＰＵのキャッシュから第２のＣＰＵのＢ
ＰＵ／キャッシュへの方向である時に生じる事象に関す
るものである。これは、２つのキャッシュのオペランド
・データ・エラーのシナリオの更に複雑な例であり、本
発明が目的とする問題である。The discussion below is that cache store errors are detected in a multiprocessor system and data flow is from the cache of the first CPU to the B of the second CPU.
It concerns events that occur when going to the PU / cache. This is a more complex example of a two cache operand data error scenario and is the problem addressed by the present invention.

【００３０】このエラーが生じるために存在するはずで
ある予備条件は、下記の通り。The preconditions that must exist for this error to occur are:

【００３１】１．ＣＰＵが、ＢＰＵ要求によりデータ・
ブロックをそのキャッシュに読出さねばならない２．第２の（または、第３あるいは第４の）ＣＰＵが、
依然第１のＣＰＵが所有する間同じブロックを要求しな
ければならず、このブロックの後に第１のＣＰＵのキャ
ッシュに存在する間１つのビットが予期せずに変更され
ていた。（ＢＰＵの場合に対する１つのプロセッサ・キ
ャッシュとは異なり、エラーであるワードが目標ワード
であるかどうかは重要でない、即ち、キャッシュ・ブロ
ックにおけるエラーはサイフォン状態を招来する。）エ
ラー状態のブロックを有するＣＰＵ（送出側ＣＰＵ）が
サイフォン要求に応答してそのデータ転送位相に入る
時、エラーを処理するための主なプロセスが呼出され
る。これは、下記のステップを含む。即ち、１．送出側のＣＰＵは、要求されたブロックがキャッシ
ュ記憶装置から読出される時エラーを検出する。第１の
４分の１ブロックが要求側（受取り側）ＣＰＵへ転送さ
れる（図３のデータの流れ２８Ａ、２８Ｃ）時、エラー
信号もまた送られる。送出側ＣＰＵは、このエラー・タ
イプを特別に識別してそのキャッシュ制御ロジック（Ｄ
Ｄチップ）にエラーを通報するフラッグをセットするこ
とになる。送出側のＣＰＵのＤＤは、ＢＰＵ停止指令を
セットして、そのキャッシュ履歴レジスタ・バンクのサ
イフォン履歴エントリにおける欠陥キャッシュ・ブロッ
クを識別する行およびレベル情報をセーブする。送出側
ＣＰＵのＢＰＵは、ハード停止状態に置かれる。1. CPU sends data by BPU request
The block must be read into its cache. The second (or third or fourth) CPU
It still had to request the same block while it was owned by the first CPU, and one bit was changed unexpectedly while in the cache of the first CPU after this block. (Unlike one processor cache for the BPU case, it does not matter if the word in error is the target word, ie, an error in the cache block results in a siphon condition.) Having a block in error condition When the CPU (sending CPU) enters its data transfer phase in response to a siphon request, the main process for handling the error is invoked. This includes the following steps. That is, 1. The sending CPU detects an error when the requested block is read from cache storage. An error signal is also sent when the first quarter block is transferred to the requesting (receiving) CPU (data stream 28A, 28C of FIG. 3). The sending CPU specially identifies this error type and determines its cache control logic (D
A flag for reporting an error will be set on the D chip). The sending CPU's DD sets the BPU stop command to save the row and level information identifying the defective cache block in the siphon history entry of its cache history register bank. The BPU of the sending CPU is placed in a hard stop state.

【００３２】２．受取り側ＣＰＵは、送出側ＣＰＵから
エラー信号を受取り、ＢＰＵハード停止状態に入る。こ
のＣＰＵは、ＳＰが評価するため、送出側ＣＰＵからエ
ラー信号を受取ったことを指示するエラー状態にアラー
ムをセットする。これはまた、後でのＳＰ照合のためブ
ロックがそのキャッシュ履歴レジスタ・バンクについて
目標とされたキャッシュ記憶行およびレベルに関する情
報をセーブする。2. The receiving CPU receives the error signal from the sending CPU and enters the BPU hardware stop state. This CPU sets an alarm to an error condition indicating that it has received an error signal from the sending CPU for the SP to evaluate. It also saves information about the cache storage line and level at which the block was targeted for its cache history register bank for later SP matching.

【００３３】３．ＳＣＵもまたエラー信号に注目して、
故障ブロックを不良パリティでメモリーに強制する（図
３のデータ流れ２８Ａ、２８Ｂ）。これは、ＳＰに対す
るアラームを結果として生じて、ページ・アドレスが特
にこのエラー・タイプに対して留保されたレジスタに書
込まれる。ＳＣＵは、不良状態信号を既にハード停止状
態にある送出側のＣＰＵに戻す。3. SCU also pays attention to the error signal,
Force the failed block into memory with bad parity (data stream 28A, 28B in FIG. 3). This results in an alarm to the SP and the page address is written to a register reserved specifically for this error type. The SCU returns a bad status signal to the sending CPU that is already in the hard stop state.

【００３４】４．ＳＰはこれらの事象を分析しなければ
ならない、即ち、ＳＣＵのアラームがサイフォン・エラ
ーが生じた第２のＣＰＵからセットされたキャッシュ・
パリティ・エラー表示を持つアラームと関連して、１つ
のＣＰＵからのサイフォン／ＤＴエラーと関連して通報
されたため、これを通報しなければならない。ＳＰは、
相互に対する読出しＤＤエラー・レポートの発行を介し
て、送出および受取りの両ＣＰＵからエラー状態のキャ
ッシュ・ブロックに関する行およびレベル情報を取出さ
ねばならない。次に、ＳＰは、この情報を用いて受取り
側ＣＰＵが保持するブロックを無効化する。（ＳＰは、
これがＳＣＵがアンロックされることを保証するため読
出されねばならないが、ＳＣＵレポートを有効に無視す
る。）５．ＳＰは、スワップされるブロックの宛て先メモリー
・アドレスを指定しながら、スワップ指令の発行を介し
て送出側ＣＰＵにより保持されるエラー状態のキャッシ
ュ・ブロックの訂正を強制する。スワップは、故障キャ
ッシュ・ブロックがメモリーに書込まれる結果を生じる
（図３のデータ移動２９Ａ、２９Ｂ）。ＳＣＵが単ビッ
ト故障を訂正するのはこの書込み中である。ＳＰは、ス
ワップが完了した後そのレベルを無効にすることにより
故障ブロックと関連するキャッシュの記憶要素を不能化
する。4. The SP must analyze these events, i.e. the SCU's alarm is cache set from the second CPU where the siphon error occurred.
This has to be reported because it was reported in connection with a siphon / DT error from one CPU in association with an alarm with a parity error indication. SP is
Through issuing read DD error reports to each other, line and level information regarding cache blocks in error must be retrieved from both the sending and receiving CPUs. The SP then uses this information to invalidate the block held by the receiving CPU. (SP is
This must be read to ensure that the SCU is unlocked, but it effectively ignores the SCU report. ) 5. The SP forces the correction of the erroneous cache block held by the sending CPU via the issuance of the swap command, while specifying the destination memory address of the block to be swapped. Swap results in the defective cache block being written to memory (data move 29A, 29B in FIG. 3). It is during this write that the SCU corrects the single bit failure. The SP disables the storage element of the cache associated with the failed block by invalidating that level after the swap is complete.

【００３５】６．スワップが完了すると、ＳＰはある量
の情報を送出側ＣＰＵのＢＰＵから取出さねばならな
い。この情報は、その命令再試行ルーチンが故障命令に
対する再試行可能なマシン状態を生じる可能性を増すた
め、オペレーティング・システム・ソフトウエアにより
要求される。6. When the swap is complete, the SP must retrieve some amount of information from the sending CPU's BPU. This information is required by the operating system software as it increases the likelihood that the instruction retry routine will result in a retryable machine condition for the failing instruction.

【００３６】７．ＳＰは、この故障の症状を書込み、受
取り側ＣＰＵのデータを後でアクセスするためオペレー
ティング・システム・ソフトウエアから使用できるよう
に主メモリーの専用記憶域に記録する。7. The SP writes the symptom of this failure and records the receiving CPU's data in a dedicated area of main memory for use by the operating system software for later access.

【００３７】８．ＳＰは、送出側ＣＰＵにパリティ故障
によるその停止状態から再始動するよう指令する故障再
開指令を発行する。このＣＰＵが再始動すると、その状
態（保護ストア）をＸＲＡＭ１８からキャッシュにプッ
シュして、オペレーティング・システム・ソフトウエア
の故障処理／命令再試行ルーチンに入る。8. The SP issues a failure restart command instructing the sending CPU to restart from its stopped state due to the parity failure. When the CPU restarts, it pushes its state (protected store) from XRAM 18 to cache and enters operating system software fault handling / instruction retry routines.

【００３８】９．オペレーティング・システム・ソフト
ウエアは、パリティ事故を通知して、専用メモリーにセ
ーブされた情報を調べて故障の種類を決定する。これが
キャッシュのオペランド・エラーであることが判ると、
オペレーティング・システム・ソフトウエアは故障命令
の評価を行い、これが再試行可能かかどうかを判定す
る。オペレーティング・システム・ソフトウエアは、あ
る場合には、再試行の成功の機会を増すため予備実行状
態にセットするようにＳＰにより取得された送出側ＣＰ
Ｕレジスタ情報を使用することになる。9. The operating system software notifies the parity accident and examines the information saved in dedicated memory to determine the type of failure. If you see that this is a cache operand error,
Operating system software evaluates the failing instruction to determine if it can be retried. The operating system software will, in some cases, send the CP acquired by the SP to set it to a pre-run state to increase the chance of successful retries.
The U register information will be used.

【００３９】１０．オペレーティング・システム・ソフ
トウエアが故障と関連する命令が再試行可能であると判
定するならば、これは保護ストア・スタックに強制され
た状態を調整して、スタック・エントリをポップアップ
するよう送出側ＣＰＵに指令することにより故障命令を
再始動する。10. If the operating system software determines that the instruction associated with the failure can be retried, this adjusts the forced state on the protected store stack to pop up the stack entry on the sending CPU. Command to restart the faulty instruction.

【００４０】１１．オペレーティング・システムがステ
ップ９および１０を実施中、ＳＰはその停止状態から受
取り側ＣＰＵを再始動するタスクを開始する。ＳＰは、
ある量の情報を受取り側ＣＰＵのＢＰＵから取出さねば
ならない。この情報は、その命令再試行ルーチンが故障
命令に対する再試行可能なマシンの状態を生じる可能性
を増すため、オペレーティング・システム・ソフトウエ
アにより要求される。11. While the operating system is performing steps 9 and 10, the SP begins the task of restarting the receiving CPU from its halted state. SP is
A certain amount of information must be retrieved from the receiving CPU's BPU. This information is required by operating system software because it increases the likelihood that the instruction retry routine will result in a retryable machine condition for a failed instruction.

【００４１】１２．ＳＰは、故障の症状および受取り側
ＣＰＵのレジスタ・データを、後でアクセスするためオ
ペレーティング・システム・ソフトウエアが使用可能な
ように主メモリーの専用記憶域に書込むことになる。12. The SP will write the symptom of the failure and the receiving CPU's register data to a dedicated storage area in main memory for use by operating system software for later access.

【００４２】１３．受取り側ＣＰＵにパリティ故障によ
るその停止状態から再始動するよう指令する故障再開指
令を発行する。受取り側ＣＰＵが再始動すると、これは
その状態（保護ストア）をＸＲＡＭ１８からキャッシュ
に強制して、オペレーティング・システム・ソフトウエ
アの故障処理／命令再試行ルーチンに入る。13. A failure restart command is issued to instruct the receiving CPU to restart from its stopped state due to a parity failure. When the receiving CPU restarts, it forces its state (protected store) from the XRAM 18 into the cache and enters operating system software fault handling / instruction retry routines.

【００４３】１４．オペレーティング・システム・ソフ
トウエアは、パリティ故障を通知して、故障の種類を判
定するため専用メモリーにセーブされた情報を調べる。
これがキャッシュ・オペランド・エラーであると判る
と、オペレーティング・システム・ソフトウエアは、故
障命令を評価してこれが再試行可能かどうかを判定す
る。オペレーティング・システム・ソフトウエアは、あ
る場合には、良好な再試行の機会を増すため、予備的実
行状態をセットするためにＳＰにより取得された受取り
側ＣＰＵのレジスタ情報を使用することになる。14. Operating system software signals a parity failure and consults the information saved in dedicated memory to determine the type of failure.
When it is determined that this is a cache operand error, operating system software evaluates the failing instruction to determine if it can be retried. In some cases, the operating system software will use the receiving CPU's register information obtained by the SP to set the preliminary execution state, in order to increase the chance of good retries.

【００４４】１５．オペレーティング・システム・ソフ
トウエアが、故障と関連する命令が再試行可能であると
判定するならば、保護ストア・スタックに強制された状
態を調整して、スタック・エントリをポップアップする
よう受取り側ＣＰＵに命令することにより故障命令を再
開する。この再開は、訂正されたブロックが主メモリー
から取出される結果をもたらすことになる（図３、デー
タ移動３０Ａ、３０Ｂ）。正味の結果は、影響を受けた
プロセスが再開されエラーから完全に明瞭であることで
ある。15. If the operating system software determines that the instruction associated with the failure can be retried, it adjusts the state forced on the protected store stack and instructs the receiving CPU to pop up the stack entry. The failure instruction is restarted by issuing an instruction. This restart will result in the corrected block being fetched from main memory (FIG. 3, data move 30A, 30B). The net result is that the affected process is restarted and completely transparent to the error.

【００４５】第２または第３のＣＰＵが同じブロックを
要求して送出側ＣＰＵからエラー信号を受取ったなら
ば、上記ステップ１１乃至１５がこのような各受取り側
ＣＰＵ毎に反復される。If the second or third CPU requests the same block and receives an error signal from the sending CPU, steps 11 to 15 above are repeated for each such receiving CPU.

【００４６】ＣＰＵ、ＳＰおよびオペレーティング・シ
ステム・ソフトウエア間の応答性の区切りが例示の構成
において重要な勘案であり、また最初に絶対的な必要性
について、次に構成要素の強弱について判定がなされ
た。The responsiveness separation between the CPU, SP and operating system software is an important consideration in the illustrated configuration, and a determination is made first regarding absolute need and then for component strength. It was

【００４７】例示システムのハードウエアに対して下記
の機能性を提供しなければならない中庸なサポートが構
成されねばならかった。即ち、Ａ）エラーの検出Ｂ）エラーに関する情報の提供（関連するキャッシュ・
ブロックの同定を含む）Ｃ）予期し得る方法による影響を受けたＢＰＵの凍結
（停止）Ｄ）ＳＰへのアラームＥ）下記に対するサポート指令１）キャッシュ・ブロックのスワッピング２）キャッシュ・ブロックの不能化（あるいは、例示マ
シンにおけるレベルの如き大きなキャッシュの細分割）３）ＣＰＵの再始動、およびＦ）エラー処理全体におけるサービス・システム要求の
続行（システム全体の停止を避けるために）最初、ＣＰＵハードウエアはＳＰの介入なしにこれら全
ての役割を処理するように設計されねばならないよ８に
思われた。即ち、理想的には、ＣＰＵ自体が自動的に訂
正のためブロックをメモリーにスワップし、訂正された
ブロックを取出して、影響を受けた命令を再開すること
になる。しかし、当業者には、このような試みが設計誤
りの可能性に満ちており、またシステム設計の労力の根
源（即ち、設計者の時間およびシリコン・スペース）の
多くを費やすことが理解されよう。ＣＰＵ、ＳＰおよび
オペレーティング・システム・ソフトウエア間に責任を
分担することにより、ハードウエアの商業的な実装およ
び開発努力に関して（設計／実現の責任は１つの主要要
員に集中しない）、例示システムの全体的な設計、開発
および生産コストが著しく減少した。更にまた、当業者
は、早期のシステム・テスト中にバグが発見されるなら
ば、ハードウエアＶＬＳＩ構成要素の新しいバージョン
を作るよりソフトウエアを修正することが容易であるこ
とを容易に理解しよう。このような付加的な柔軟性は、
区分された試みにシリコンにプロセスを集中するのに勝
る利点を与える。Moderate support had to be configured to provide the following functionality to the hardware of the exemplary system. That is, A) error detection B) provision of information about the error (related cache
(Including block identification) C) Freeze (stop) BPU affected by predictable method D) Alarm to SP E) Support directive for 1) Swap cache block 2) Disable cache block (Alternatively, a large cache subdivision such as the level on the example machine) 3) CPU restart, and F) Continuing service system requests throughout error handling (to avoid system downtime) First, CPU hardware Seemed to have to be designed to handle all these roles without SP intervention8. That is, ideally, the CPU itself would automatically swap the block into memory for correction, retrieve the corrected block, and restart the affected instruction. However, those of ordinary skill in the art will appreciate that such attempts are fraught with potential for design error and consume much of the source of system design effort (ie, designer time and silicon space). . By sharing the responsibility among the CPU, SP and operating system software, the entire exemplary system in terms of commercial implementation and development efforts of the hardware (design / implementation responsibility is not concentrated on one key person). Design, development and production costs have been significantly reduced. Furthermore, those skilled in the art will readily appreciate that if bugs are discovered during early system testing, it is easier to modify the software than to create a new version of the hardware VLSI component. This additional flexibility
It gives the segmented approach the advantage over concentrating the process on silicon.

【００４８】ＳＰの責任は、下記のものを含む。即ち、Ａ）アラーム処理Ｂ）下記を含むエラー処理および訂正の監視１）スワップするブロックを決定する指令の発行２）エラーのブロックをスワップする指令の発行３）スワップ中に生じる例外の処理、即ち、エラーが回
復不能（例えば、２倍ビットの故障）ならば、ＳＰはこ
の情報／状態をオペレーティング・システムに送るよう
にプログラムされるＣ）命令再試行ソフトウエアに対する影響を受けたＢＰ
Ｕからレジスタを取出しＤ）ＣＰＵが然るべく実行するよう指令の発行を介して
再始動可能である柔軟性ｊにあることを保証ＳＰの責任は故障と関連する命令が再試行可能かどうか
の判定は含まないことが判るであろう。幾つかの要因が
これを行うことを禁止する。第１に、ＣＰＵアセンブリ
限度の命令セットにおける更に複雑な命令のあるものが
再試行可能であるかどうかの判定に要するアルゴリズム
は非常に複雑である（これが非常に大きなプログラムに
翻訳する）。予期される記憶容量の制限の故に、ＳＰに
対するこれ以上の記憶要求が行われるべきでないと判定
された。更に、ＳＰはこれがサポートするメインフレー
ム・コンピュータに比較して低速であり、従って再試行
ソフトウエアがメインフレームに存在すると、この処理
は遥かに更に高性能となる。The SP's responsibilities include: That is, A) alarm processing B) monitoring error processing and correction including the following: 1) issuing a command to determine blocks to be swapped 2) issuing a command to swap blocks in error 3) processing exceptions that occur during swapping, ie If the error is unrecoverable (eg double bit failure), the SP is programmed to send this information / state to the operating system C) Instruction retry software affected BP
Derive registers from U D) Ensure that CPU is in flexibility j, which can be restarted via issuance of instructions to execute accordingly SP's responsibility is whether the instruction associated with the failure can be retried It will be understood that the judgment is not included. Several factors prohibit doing this. First, the algorithms required to determine if some of the more complex instructions in the CPU assembly limited instruction set are retryable are very complex (which translates to very large programs). It was determined that no further storage requests should be made to the SP due to expected storage capacity limitations. In addition, the SP is slow compared to the mainframe computers it supports, so this process is much more powerful if retry software is present on the mainframe.

【００４９】オペレーティング・システム・ソフトウエ
アの責任は、主として影響を受ける命令が再試行可能で
あるかどうかを判定する責任である。この機能性は、影
響を受けたＣＰＵが故障で遅れさせられた後に可能にな
る。オペレーティング・システム・ソフトウエアは、故
障の種類を解釈し、これが本発明が目的とするエラーの
種類である判定される時、このソフトウエアはそのパリ
ティ故障処理手順に入る。Operating system software responsibility is primarily responsible for determining whether the affected instruction is retryable. This functionality is possible after the affected CPU has been delayed for failure. The operating system software interprets the failure type, and when it is determined that this is the type of error for which the invention is intended, the software enters its parity failure handling procedure.

【００５０】オペレーティング・システム・ソフトウエ
アが命令が再試行可能かどうか判定するため実行しなけ
ればならない分析は、故障した命令の種類に依存する。
実質的に、例示的ＣＰＵは、サポートするアセンブリ言
語命令セットが下記の命令からなる。即ち、１）キャッシュからのロード・レジスタ２）キャッシュへの書込み３）レジスタの修正４）同じキャッシュ・ワードに対して読込み、変更し、
次に書込む５）キャッシュ・データを１つの場所から別の場所へ移
動する、および（または）６）転送制御オペレーティング・システム・ソフトウエア再試行構成
要素が、命令のこれらの種別を分析して、与えられた命
令が与えられた状況において再試行可能であるかどうか
を判定する。特に、これは、非常に簡単なタスクに見え
る。例えば、サイフォンの間に受取られたデータが不良
パリティを持った故に簡単な「Ａ−レジスタのロード
（ＬＤＡ）」命令が失敗するならば、キャッシュ・ブロ
ックの訂正に続いてＬＤＡが再実行可能であると予期さ
れよう。しかし、単なる事例として、もしＬＤＡが間接
的で関連したタリー有効アドレスの修正を有するならば
何が起こるかを考えよう。従って、オペレーティング・
システム・ソフトウエアはこの状況を検出してタリー・
ワードをその予備実行状態に復元しなければならない。
このＬＤＡ例は、再試行アルゴリズムを複雑にする命令
セットに対する期待であるという周知の事実を示すため
に提示する。The analysis that operating system software must perform to determine if an instruction can be retried depends on the type of instruction that failed.
Substantially, the exemplary CPU supports an assembly language instruction set consisting of the following instructions. 1) Load register from cache 2) Write to cache 3) Modify register 4) Read and modify for the same cache word,
Then write 5) move cache data from one place to another, and / or 6) transfer control The operating system software retry component analyzes these types of instructions , Determine if a given instruction can be retried in a given situation. In particular, this looks like a very simple task. For example, if a simple "Load A-Register (LDA)" instruction fails because the data received during the siphon has bad parity, the LDA can be re-executed following the correction of the cache block. Expected to be. However, as a mere example, consider what happens if the LDA has an indirect and associated tally effective address modification. Therefore, the operating
The system software detects this situation and
The word must be restored to its preliminary execution state.
This LDA example is presented to show the well-known fact that it is an expectation for an instruction set that complicates the retry algorithm.

【００５１】このシステム例においては、ハードウエア
はこれらエラーから回復するためのある重要なサポート
を提供する。このハードウエアは、あるレジスタに対す
る予備実行値を見出して再試行のため使用が可能である
ようにあるレジスタのシャドウ動作を提供する。この真
に複雑な場合（例えば、倍精度演算）は、このシャドウ
動作から最も大きな利益を受ける。このような複雑な場
合では、オペレーティング・システム・ソフトウエア
は、予備実行レジスタが存在する場所を判定してこれら
レジスタを再試行のため使用することができる。シャド
ウ動作は、実質的に、無効データが読出される時でもレ
ジスタを修正する命令を動作が完了する程度に最適化さ
せ得るが、これはレジスタの予備実行コピーが再試行に
利用可能である故である。この特徴がなければ、これら
の命令は再試行不能であると見做されるか、あるいはＣ
ＰＵの実行が、無効データが検出される時動作が取消さ
れることを保証するため低速化されねばならないことに
なる。In this example system, the hardware provides some important support for recovering from these errors. This hardware finds the pre-run value for a register and provides shadowing behavior for that register so that it can be used for retries. This truly complex case (eg, double precision arithmetic) benefits the most from this shadowing operation. In such complex cases, operating system software can determine where pre-execution registers are located and use these registers for retry. The shadow operation can effectively optimize instructions that modify registers even when invalid data is read, to the extent that the operation completes, because a preliminary execution copy of the register is available for retries. Is. Without this feature, these instructions are considered non-retryable or C
The execution of the PU will have to be slowed down to ensure that the operation is canceled when invalid data is detected.

【００５２】命令が再試行可能である時、オペレーティ
ング・システム・ソフトウエアは制御を影響を受けたプ
ロセスに戻し、このプロセスはハードウエア・エラーか
ら明らかである。もし命令が再試行不能であるか、ある
いはキャッシュ・ブロックの故障が訂正不能であるなら
ば、影響を受けたプロセスが終了させられる。When the instruction can be retried, the operating system software returns control to the affected process, which is apparent from the hardware error. If the instruction is not retryable or the cache block failure is uncorrectable, the affected process is terminated.

【００５３】次に、図４のフロー図に注目されたい。こ
のフロー図は、例示のシステムの環境と似た環境におい
て本発明を実施する際にプログラマにとって特に有効と
なる本発明の別の開示である。Attention is now directed to the flow diagram of FIG. This flow diagram is another disclosure of the invention that is particularly useful to a programmer in implementing the invention in an environment similar to that of the exemplary system.

【００５４】本発明の原理は実施例において明瞭となっ
たが、当業者には、この原理から逸脱することなく特定
の環境および動作要件に特に適合する、本発明の実施に
際して使用される構造、配置、比率、要素、材料および
構成部分の多くの変更が明らかであろう。While the principles of the present invention have been clarified in the examples, those skilled in the art will appreciate that structures used in the practice of the invention that are particularly adapted to the particular environment and operating requirements without departing from this principle, Many variations in arrangement, proportions, elements, materials and components will be apparent.

[Brief description of drawings]

【図１】本発明が用途を有する情報処理システムの中央
システム構造を示す非常に高レベルのブロック図であ
る。FIG. 1 is a very high level block diagram showing the central system structure of an information handling system in which the present invention has application.

【図２】図１の中央システム構造の中央処理装置を示す
全体ブロック図である。FIG. 2 is an overall block diagram showing a central processing unit of the central system structure of FIG.

【図３】本発明の実施中生じるあるデータ移動を示す図
１と似たブロック図である。FIG. 3 is a block diagram similar to FIG. 1 showing certain data movements that occur during practice of the invention.

【図４】本発明の別の実施例を実行する流れ図である。FIG. 4 is a flow chart of implementing another embodiment of the present invention.

[Explanation of symbols]

１システム制御装置（ＳＣＵ）２システム・バス３メモリー・バス４メモリー装置（ＭＵ）５中央処理装置（ＣＰＵ）６入出力装置（ＩＯＵ）７入出力バス（ＩＯＢ）８クロックおよび保守装置（ＣＭＵ）９サービス・プロセッサ（ＳＰ）１０ＡＸチップ１１キャッシュ装置１２キャッシュ・ディレクトリ（ＣＤ）チップ１３複写ディレクトリ（ＤＤ）チップ１４ＤＮチップ１５浮動小数点演算（ＦＰ）チップ１６クロック分散（ＣＫ）チップ１７ＦＲＡＭチップ１８ＸＲＡＭチップ２０マスター結果バス（ＭＲＢ）２１スレーブ結果バス（ＳＲＢ）２２ＣＯＭＴＯバス２３ＣＯＭＦＲＯＭバス３３キャッシュ記憶装置３４エラー検出装置 1 system control unit (SCU) 2 system bus 3 memory bus 4 memory unit (MU) 5 central processing unit (CPU) 6 input / output unit (IOU) 7 input / output bus (IOB) 8 clock and maintenance unit (CMU) 9 Service Processor (SP) 10 AX Chip 11 Cache Device 12 Cache Directory (CD) Chip 13 Copy Directory (DD) Chip 14 DN Chip 15 Floating Point Arithmetic (FP) Chip 16 Clock Distribution (CK) Chip 17 FRAM Chip 18 XRAM chip 20 Master result bus (MRB) 21 Slave result bus (SRB) 22 COMTO bus 23 COMFROM bus 33 Cache storage device 34 Error detection device

───────────────────────────────────────────────────── フロントページの続き (72)発明者ウィリアム・エイ・シェリーアメリカ合衆国アリゾナ州85018，フェニックス，イースト・オズボーン・ロード 4900 (72)発明者ジウィー・チャンアメリカ合衆国アリゾナ州85023，フェニックス，ノース・サーティーファースト・ドライブ 15620 (72)発明者ミノル・イノシタアメリカ合衆国アリゾナ州85302，グレンデール，ウエスト・ゴールデン・レーン 5332 (72)発明者レナード・ジー・トルビスキアメリカ合衆国アリゾナ州85253，スコッツデール，イースト・ホースシュー・レーン 6725 (56)参考文献特開平１−88676（ＪＰ，Ａ) 特開平２−17550（ＪＰ，Ａ) ─────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor William A. Shelley 85,018 Arizona, USA, East Osborne Road, 4900 (72) Inventor, Jee Chan, 85023, Arizona, USA, Phoenix, North. Thirty Farst Drive 15620 (72) Inventor Minor Inosita 85302, Arizona, USA, West Golden Lane, 5302 Glendale, 5332 (72) Inventor Leonard G. Torbiski, Arizona, USA 85253, Scottsdale, East Horse Shoe Lane 6725 (56) Reference JP-A-1-88676 (JP, A) JP-A-2-17550 (JP, A)

Claims

(57) [Claims]

1. A fault tolerant multiprocessor computer system comprising: A) a first central processing unit including a first cache memory device, the first cache memory device comprising: Cache storage means, and (b) the first
Cache read parity error which is a parity error in the block of information read from the first cache storage means, and a cache write parity error which is a parity error in the block of information written to the first cache storage means. And a first parity error detecting means for issuing a first error flag when the cache write parity error is detected. A second central processing unit including a first central processing unit and B) a second cache memory unit, wherein the second cache memory unit is (a) second cache storage means. And (b) a parity error in the block of information read from the second cache storage means. A second parity error detecting means for detecting a cache read parity error which is a parity error in a block of information written to the second cache storage means. , When the cache read parity error is detected, the second
Second central processing unit for issuing an error flag of the second central processing unit, and C) in response to a siphon request from the first central processing unit. Transfer means for transferring a designated block of information from the cache storage means of the above to the first central processing unit via the first and second parity error detecting means, and D) performing parity error correction. A system controller having a parity error correction device; E) a system bus connecting the first and second central processing units to the system controller; F) a main memory device; G) the system control A memory bus connecting the device and the main memory device, and H) from the second central processing unit requested by the first central processing unit during siphon operation. In fault block distribution of the block, said second parity error detection means detecting a cache read parity error by, and in response to detection of a cache write parity error by said first parity error detection means,
The defective block is corrected from the second cache storage means via the system control unit and transferred to the main memory unit, and the corrected block is transferred from the main memory unit to the first central processing unit. Error recovery control means for transferring to the system control device for detecting the first and second error flags and correcting the faulty block from the second cache storage means in accordance therewith. A fault tolerant multiprocessor computer system comprising: an error recovery control means including a service processor that directs the transfer to the main memory device via the.

2. The fault tolerant multiprocessor computer system of claim 1, wherein each of the first and second central processing units further comprises: A) random access memory; and B) from the service processor. In response to the command of
Prior to retrying the operation resulting in the issuance of the first and second error flags, the random
A fault tolerant multiprocessor computer system including means for pushing protected store information to access memory.

3. The fault tolerant multiprocessor computer system according to claim 1, wherein said error recovery control means further comprises: A) operating system software including an instruction retry routine; and B) said instruction replay. In a trial routine, the presence of the first and second error flags is detected, the presence of the flag and the failure block having been previously transferred from the second cache storage means to the main memory device. Responsive to command a retry of the operation resulting in the issuance of the first and second error flags,
And means for directing the transfer of the corrected block from the main memory unit to the first central processing unit, the fault tolerant multiprocessor computer system.

4. A fault tolerant computer system comprising: A) a central processing unit comprising a cache memory device, a basic processing device, and a transfer means, wherein the cache memory device comprises: (a) cache storage means; b) first parity error detection means for detecting a parity error in a block of information read from the cache storage means, the first error flag being set when the parity error is detected. A first parity error detecting means including means for issuing; and the basic processing device for detecting a parity error in a block of information written from the cache storage means to the basic processing device. A parity error detecting means, wherein the second error flag is detected when the parity error is detected. Second parity error detecting means, including means for issuing a group, wherein the transfer means transfers the block of information read from the cache storage means via the second parity error detecting means. A central processing unit configured to transfer data to the basic processing unit by means of B, a system control unit including B) parity error correction means for performing parity error correction, and C) the central processing unit and the system control unit. A) a system bus connecting D., a) a main memory device, E) a memory bus connecting the system controller and the main memory device, and F. a block of information requested by the basic processing unit. In response to the detection of a parity error by both the first and second parity error detection means in the failed block, the cache Error recovery control means for correcting the defective block from the storage means and transferring the corrected block to the main memory device via the system control device, and transferring the corrected block from the main memory device to the central processing unit. And detecting the first and second error flags and, in response thereto, transferring the failed block from the cache storage means to the main memory device via the system controller for error correction. A fault tolerant computer system comprising: an error recovery control means including a commanding service processor.

5. The fault tolerant computer system of claim 4, wherein said central processing unit further comprises: A) random access memory; and B) in response to instructions from said service processor,
Prior to retrying the operation resulting in the issuance of the first and second error flags, the random
A fault tolerant computer system, including means for pushing protected store information to access memory.

6. The fault tolerant computer system of claim 4 or 5, wherein said error recovery control means further comprises: A) operating system software including an instruction retry routine; and B) in said instruction retry routine. Detecting the presence of the first and second error flags and responding to the presence of the flag and the previous transfer of the failed block from the cache storage means to the main memory device. Means for commanding the retry of the operation resulting in the issuance of first and second error flags and for commanding the transfer of the corrected block from the main memory device to the central processing unit. A fault-tolerant computer system.

7. The fault-tolerant computer system according to claim 4, 5, or 6, wherein the corrected block is returned to and stored in a different block position from the faulty block in the cache storage means. Features a fault-tolerant computer system.