JPH0683717A

JPH0683717A - Large fault-resistant nonvolatile plural port memories

Info

Publication number: JPH0683717A
Application number: JP5019446A
Authority: JP
Inventors: Iii Thomas B Smith; トーマス・ベイシル・スミス三世
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1992-02-14
Filing date: 1993-01-13
Publication date: 1994-03-25
Anticipated expiration: 2012-12-24
Also published as: JP2694099B2; US5502728A

Abstract

PURPOSE: To minimize an influence of an occurring error to realize a long-time operation free from error. CONSTITUTION: A device consists of a large-capacity semiconductor memory (DRUM) consisting of many symbol planes, triple processing cores 13, 14, and 15, and an I/O channel adapter which connects contents of processing cores to an external device, and the large-capacity semiconductor memory has a stripe constitution over plural symbol planes, and each symbol plane includes a failure confining area of the memory, and processing cores perfrom error check and correction of all fetched data to generate correction and detection code bits of data to be stored, and each processing code is provided with an ECC/voting function selection mechanism to continuously monitor three or more input links from plural symbol planes.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明はオンライン・トランザ
クション処理システム(OLTPS) のような大容量データ・
ベース・アプリケーションに使用する４乃至１２８ギガ
バイト位数の大型半導体基底耐故障及び不揮発性メモリ
ー・システムに関し、特にエラー解放動作又はオペレー
ションと、アプリケーション間及び複数のコンピュータ
間におけるデータの共用とがサイズ及びスピードに加え
て有意な設計基準であるメモリー・システムに関する。BACKGROUND OF THE INVENTION This invention relates to large-volume data storage such as online transaction processing system (OLTPS).
Large semiconductor-based fault tolerant and non-volatile memory systems of the order of 4 to 128 gigabytes used for base applications, especially in error relief operation or operation and sharing of data between applications and between multiple computers in size and speed In addition, it relates to a memory system, which is a significant design criterion.

【０００２】[0002]

【従来の技術】相互参照この発明はこの出願の譲受人と同一人に譲渡された“ネ
ストされたフレーム通信プロトコル”と称するティー・
ビー・スミスによる１９９１年５月１０日出願の出願番
号第０７／６９８，６８５号に開示されている発明に関
係がある。出願番号第０７／６９８，６８５号出願の開
示はここに参照文献として編入される。 Cross Reference This invention refers to a tee called "Nested Frame Communication Protocol" assigned to the same assignee as this application.
Relevant to the invention disclosed in Application No. 07 / 698,685 filed May 10, 1991 by B. Smith. The disclosure of Application No. 07 / 698,685 application is incorporated herein by reference.

【０００３】この発生は、又本出願の譲受人と同一人に
譲渡された“スキュー・データ・ストリームの自動副尺
同期”と称する米国特許第５，０２０，０２３号に開示
されている発明にも関係がある。米国特許第５，０２
０，０２３号の開示も参照文献としてここに編入され
る。This occurrence is also due to the invention disclosed in US Pat. No. 5,020,023, which is also assigned to the same assignee of the present application as "Automatic Vernier Synchronization of Skew Data Streams". Is also relevant. US Patent No. 5,02
The disclosure of 0,023 is also incorporated herein by reference.

【０００４】発明の背景半導体記憶域又は記憶装置は従来コンピュータの主記憶
構成要素として、及びキャッシュ・ディスク制御装置の
記憶媒体として使用されてきた。オンライン・トランザ
クション処理(OLTP)システムにおける主記憶装置の多く
はディスク制御装置のキャッシュ機能と性質が類似する
よう適用されるディスク記憶装置のデータ・ブロックの
緩衝に使用される。[0004] BACKGROUND Semiconductor storage or memory device of the present invention as a main storage component of a conventional computer, and has been used as a storage medium of a cache disk controller. Many of the main storage devices in online transaction processing (OLTP) systems are used to buffer the data blocks of the disk storage device which are applied to be similar in nature to the cache function of the disk controller.

【０００５】緩衝及びキャッシュ機能（以下、キャッシ
ュという）はディスクに対する読取要求を代行受信し
て、主記憶バッファから又はディスク制御装置のキャッ
シュから要求データを供給することによって、物理的な
ディスク・アクセス数を最少にすることができる。主メ
モリーの緩衝記憶は、主記憶の緩衝により代行受信した
読取要求がＩ／Ｏ活動又はディスク活動を生じさせない
ため、コンピュータＩ／Ｏサブシステム及びディスク・
アクチュエータの負担を相当軽減することができる。The buffer and cache function (hereinafter referred to as cache) intercepts a read request to the disk and supplies the requested data from the main memory buffer or the cache of the disk controller to determine the number of physical disk accesses. Can be minimized. The main memory buffer stores the computer I / O subsystem and disk storage because the read requests intercepted by the main memory buffer do not cause I / O activity or disk activity.
The load on the actuator can be considerably reduced.

【０００６】ディスク制御装置のキャッシュにより満足
した読取要求は、なお、コンピュータに対しＩ／Ｏチャ
ンネル・オペレーションの開始を要求するが、実際のデ
ィスク・アクチュエータ活動は免除される。そして、キ
ャッシュから供給される場合の読取りに対する待ち時間
は短かいので、Ｉ／Ｏチャンネルの占有は非常に短縮さ
れる。A read request satisfied by the disk controller cache still requires the computer to initiate an I / O channel operation, but the actual disk actuator activity is exempt. And, since the latency for reading when served from the cache is short, the I / O channel occupancy is greatly reduced.

【０００７】各ディスク・アクチュエータは単に毎秒２
０乃至４０ランダム・アクセス要求（正確な数はディス
クの型式及び特定のアクセス・パターンによる）をサー
ビスすることができるのみであるから、物理的なディス
ク活動の減少は特に重要である。プロセッサ速度が改良
され、トランザクション速度及び複合度が増加すると、
物理的なディスク・アクセスの数を削減してディスク・
アクチュエータの経済的な数を満足させることが重要で
ある。Each disk actuator is simply 2 per second
Reducing physical disk activity is particularly important because it can only service 0-40 random access requests (the exact number depends on the disk type and the particular access pattern). With improved processor speed and increased transaction speed and complexity,
Disks are reduced by reducing the number of physical disk accesses.
It is important to meet the economical number of actuators.

【０００８】例えば、システムが毎秒１０００トランザ
クションを処理し、各トランザクションが４０データ項
目をアクセスする（読取り又は書込み）とした場合、デ
ィスク・サブシステムをキャッシュ又は緩衝しないとす
れば、毎秒４０，０００回のディスク作動を支援する必
要があるであろう。For example, if the system processes 1000 transactions per second and each transaction accesses (reads or writes) 40 data items, then 40,000 times per second if the disk subsystem is not cached or buffered. Will need to support the operation of the disk.

【０００９】これらアクセスを使用可能なディスク・ア
クチュエータのすべてに亘り、スキューなくスプレッド
又は分散できたとすれば、２，０００ほどのディスク・
アクチュエータが要求されることになるであろう。スキ
ューの影響がほとんどのシステムにおいてはその要求を
最少にする必要がある。読取要求の９０％が主メモリー
・バッファ（又は緩衝記憶装置）又はディスク・キャッ
シュで代行受信することができ、アクセス要求の１５％
のみが書込みであるとすれば、ディスク・アクチュエー
タの負担は７５％だけ削減される。Given that these accesses could be spread or distributed without skew across all available disk actuators, there could be as many as 2,000 disk actuators.
Actuators will be required. In most systems the effects of skew need to be minimized. 90% of read requests can be intercepted by the main memory buffer (or buffer) or disk cache, and 15% of access requests
If only one is write, the load on the disk actuator is reduced by 75%.

【００１０】ディスク・アクセスの削減に加え、上記両
方式は実質的に要求データの使用可能性に対する待ち時
間を減少する。これは効率を更によくし、及び並列実行
するトランザクション間の衝突の可能性を減少すること
になり、コンピュータ内の多重プログラミング・レベル
を相対的に減少することができる。In addition to reducing disk access, both of the above methods substantially reduce the latency to availability of requested data. This will be more efficient and will reduce the likelihood of collisions between transactions executing in parallel, which will reduce the level of multiple programming in the computer relatively.

【００１１】それは、又トランザクションの応答時間を
減少し、主メモリーにおける緩衝より幾分長い待ち時間
を有するディスク制御装置のキャッシュは、ディスクの
データを数個のコンピュータ間で共用している場合に、
データの率直な共用を許容するので、主メモリー緩衝以
上の有意な利益を有する。主メモリーの緩衝において
は、そこにあるディスク・データを他のコンピュータに
よって変更する場合、一方のコンピュータの緩衝データ
を無効にするため、ある機構を設けなければならないと
いうように率直ではない。It also reduces the response time of transactions, and the disk controller cache, which has a somewhat longer latency than the buffer in main memory, when disk data is shared among several computers,
It allows for straightforward sharing of data and thus has a significant benefit over main memory buffers. The buffering of main memory is not straightforward in that if disk data residing there is modified by another computer, some mechanism must be provided to invalidate the buffered data on one computer.

【００１２】バッファ及びキャッシュのサイズが増大す
ると、読取り要求のより大きな分割部分を代行受信する
ことができる。極度に十分大きなメモリを使用すると、
読取要求をほとんど完全にバッファ又はキャッシュから
満足に供給することができる。そのようなシステムで
は、ディスク活動のほとんどはディスクに対する更新、
変更、又は追加のすべてを書込む必要性によって支配さ
れる。As the size of the buffer and cache increases, a larger portion of the read request can be intercepted. With extremely large memory,
Read requests can be satisfied almost entirely from a buffer or cache. On such systems, most of the disk activity is updates to the disk,
It is governed by the need to write all changes or additions.

【００１３】ディスクに対する全書込みに反映するため
のこの要求は、ディスク記憶装置を従来設計の主記憶装
置又はディスク制御装置のキャッシュ・メモリー（半導
体記憶装置）と比較した場合、ディスク（磁気）記憶装
置の方が広くより良い保（完）全性特性を有するという
理由から推進されるであろう。This requirement to reflect all writes to a disk is a requirement for disk (magnetic) storage when comparing disk storage with cache memory (semiconductor storage) of traditionally designed main or disk controllers. Would have broader and better protective integrity properties and would be promoted.

【００１４】書込み又は更新活動はほとんどのOLTPワー
クロードに対するディスク要求の有意な分割部分を構成
するので、大量の半導体メモリーを設置している場合で
も、従来のOLTPシステムの大きさ及び処理量は終極的に
支援するディスク・アクチュエータの容量及び特性によ
って制限されることになる。Since the write or update activity constitutes a significant portion of the disk demand for most OLTP workloads, the size and throughput of conventional OLTP systems is at a premium even when large amounts of semiconductor memory are installed. Will be limited by the capacity and characteristics of the disk actuator that it supports.

【００１５】従って、本発明に対する一次的動機は、大
規模半導体基底記憶サブシステムの保全特性を改良し
て、ディスクに対して変更又は更新を反映しないデータ
記憶装置として使用できるようにすることである。ディ
スク・データは屡同時に存在する他のシステムに二重化
（重複）されるので、そのメモリーはディスク記憶装置
の保全性プロフィールに対し有効に対抗しうるようにす
るため、耐故障及び不揮発性にしなければならない。Accordingly, the primary motivation for the present invention is to improve the integrity characteristics of large scale semiconductor base storage subsystems so that they can be used as data storage devices that do not reflect changes or updates to the disk. . Since disk data is often duplicated in other co-existing systems, its memory must be fault tolerant and non-volatile in order to effectively counter the integrity profile of the disk storage device. .

【００１６】これは、ディスクの支援がなく、データベ
ースを全部半導体メモリーに記憶させることを可能にす
る。それはディスク基底データベースのライトバック・
キャッシュを可能にする。ライトスルー・キャッシュに
代るライトバック・キャッシュは読取り及び書込みの両
方を代行受信することができるので、メモリーの追加に
伴いディスク・アクセスの数を連続的に減少させること
ができる。終極的に、ディスク・アクセスは、キャッシ
ュが十分大きくなったときには、すべて有効に除去する
ことができる。This allows the database to be stored entirely in semiconductor memory without the assistance of a disk. It is a write-back of disk-based database
Enable caching. The write-back cache, which replaces the write-through cache, can intercept both reads and writes, thus continuously reducing the number of disk accesses as memory is added. Ultimately, disk accesses can all be effectively removed when the cache becomes large enough.

【００１７】ここに開示した本発明による好ましい実施
例は、更に、この耐故障及び不揮発性メモリーを、複数
のクライアント・コンピュータ間で容易に共用すること
ができ、大容量トランザクション処理システムの構築を
容易にすることができるようにするところに位置付けす
ることを探求する。この目標を達成するため、耐故障プ
ロセッサ構成要素を耐故障及び不揮発性メモリーに組込
むことにより、データの共用キャッシュのため、及び収
容するデータベースの完全な記憶装置用として数個のク
ライアント・コンピュータ間で共用することができる知
能耐故障不揮発性メモリー・サブシステムを提供する。The preferred embodiment according to the invention disclosed herein further allows this fault tolerant and non-volatile memory to be easily shared between multiple client computers, facilitating the construction of a large capacity transaction processing system. Explore positioning to where you can. To achieve this goal, fault-tolerant processor components are incorporated into fault-tolerant and non-volatile memory to share between several client computers for a shared cache of data and for the complete storage of the containing database. An intelligent fault tolerant non-volatile memory subsystem is provided.

【００１８】従来技術の説明次に、具体的な従来技術について説明する。最も近い先
行技術であると思われるものとしては数個の耐故障コン
ピュータ設計がある。特に、ストラタス・コンピュータ
社（Stratus Computer Inc.)及びタンデム・コンピュー
タ社（Tandem Computer Inc.) は、適切なソフトウェア
及びＩ／Ｏ接続機構を増加して、共用データ記憶及びキ
ャッシュ機能に対する希望の保全性プロフィールを与え
る耐故障コンピュータ（例えば、それぞれストラタスＸ
Ａ２０００コンピュータ・モデル、及びタンデム保全性
コンピュータ・モデル）を製造し販売している。 Description of Prior Art Next, a specific prior art will be described. There are several fault tolerant computer designs that appear to be the closest prior art. In particular, Stratus Computer Inc. and Tandem Computer Inc. have increased the appropriate software and I / O attachments to provide the desired integrity for shared data storage and cache functions. Fault-tolerant computers that give profiles (eg Stratus X respectively
A2000 computer model and tandem integrity computer model) are manufactured and sold.

【００１９】そのような増加を行うことは日常のシステ
ム統合タスクであり、発明当時の技術状態内にある。そ
の上、ＩＥＥＥ第１６回国際耐故障計算機シンポジュー
ム（オーストリア、ウィーン）の１９８６ダイジェスト
で発表された“高性能耐故障実時間コンピュータ・アー
キテクチャ”に記述されているようなFTCXコンピュータ
は本発明の耐故障プロセッサ構成要素として使用される
基本３重複処理コア技術の良い先行実施例である。Making such an increase is a routine system integration task and is within the state of the art at the time of invention. Moreover, the FTCX computer as described in "High Performance Fault Tolerant Real-Time Computer Architecture" published in the 1986 digest of the IEEE 16th International Fault Tolerant Computer Symposium (Vienna, Austria) is a fault tolerant processor component of the present invention. It is a good prior example of the basic 3 overlap processing core technology used as.

【００２０】これら先行技術例の各々はそれらの主記憶
構成要素の設計が本発明と最も顕著に異なるところであ
る。先行技術に使用されている主記憶を保護するための
基本技術は簡単な複製方式である。ストラタス機及びタ
ンデム機においては、両主記憶とも単に複製又は重複す
るのみであり、FTCXの主記憶は３重複される。Each of these prior art examples differs most significantly from the present invention in the design of their main memory components. The basic technique for protecting the main memory used in the prior art is a simple duplication method. In the Stratus machine and the tandem machine, both main memories are simply duplicated or duplicated, and the main memory of the FTCX is duplicated three times.

【００２１】[0021]

【発明が解決しようとする課題】上記のような複製は簡
単ではあるが、本発明にない数々の別な欠点を有する。
その第１は、実際的に経済的でないことである。すなわ
ち、本発明は、特に電力又はクロック構成要素のような
支援システムの故障及び制御又はシーケンスの故障等を
含むすべての故障から記憶装置を保護するためのオーバ
ーヘッドの減少を選択することができ、及びエラー修正
コードを有する複数の記号プレーンの堅い（又は強い）
同期並列動作を使用することができる。Although such duplication is simple, it has a number of other drawbacks not present in the present invention.
The first is that it is not economical in practice. That is, the present invention may choose to reduce the overhead to protect the storage device from all failures, including failures of supporting systems such as power or clock components and failures of control or sequences among others, and Rigid (or strong) multiple symbol planes with error correction code
Synchronous parallel operation can be used.

【００２２】先行技術システムにおけるこの保護を与え
るためのオーバーヘッドは２重複に対しては１００％、
３重複に対しては２００％である。この数値は本発明の
好ましい実施例における同等の保護に対する１８％のオ
ーバーヘッドと比較すると、その差は明らかである。上
記の先行技術システムのコストは半導体メモリーのコス
トによって支配されるので、本発明では相当節約するこ
とができるということがわかる。The overhead to provide this protection in prior art systems is 100% for duplicates,
200% for triplicates. This difference is clear when compared to the 18% overhead for equivalent protection in the preferred embodiment of the invention. It can be seen that the cost of the above prior art system is dominated by the cost of the semiconductor memory, so that the present invention can save a considerable amount.

【００２３】第２は、本発明を使用した場合における複
数の記号プレーンの堅い同期並列動作は個有の高いメモ
リー帯域幅を有することである。すなわち、先行技術の
メモリー・システムの性能は大体従来設計の単一メモリ
ー・モジュールのそれと同等である。本発明の主記憶帯
域幅は単一モジュールの帯域幅の何倍も多い。Second, the tight synchronous parallel operation of multiple symbol planes using the present invention has its own high memory bandwidth. That is, the performance of the prior art memory system is roughly equivalent to that of a single memory module of conventional design. The main memory bandwidth of the present invention is many times higher than the bandwidth of a single module.

【００２４】本発明の好ましい実施例における有効帯域
幅は単一モジュールの帯域幅の１６倍もある。この本発
明による安価なコストと高い性能との組合せは本発明を
前述の大規模共用メモリー適用業務に対しより良く適し
たものとするであろう。The effective bandwidth in the preferred embodiment of the present invention is as much as 16 times the bandwidth of a single module. This combination of low cost and high performance according to the present invention would make the present invention better suited for the large scale shared memory applications described above.

【００２５】本発明による全体的メモリー・アーキテク
チャの有意な構成要素は従来技術において知られたある
構成部分を組込み使用する。Significant components of the overall memory architecture according to the present invention incorporate and use certain components known in the art.

【００２６】基本３重複処理コアは、本質的に、ＩＥＥ
Ｅ第１６回国際耐故障計算機シンポジューム（オースト
リア、ウィーン，１９８６年６月）のダイジェストで発
表された“高性能耐故障実時間コンピュータ・アーキテ
クチャ”、及びFTCSダイジェストの論文（１４〜１９
頁）に記述されているようなものである。The Basic 3 Overlap Processing Core is essentially an IEEE
E "High-performance fault-tolerant real-time computer architecture" published in the digest of the 16th International Fault-Tolerant Computer Symposium (Vienna, Austria, June 1986), and FTCS digest papers (14-19)
Page).

【００２７】逐次票決Ｉ／Ｏチャンネルを介して行う接
続は前述で照会した米国特許第５，０２０，０２３号
“スキュー・データ構造の自動副尺同期”、及び前述で
照会した米国特許出願第０７／６９８，６８５号（ネス
トされたフレーム通信プロトコル）に記述されているも
のに大変よく似ているものである。Connections made through a serial voting I / O channel are referred to above in US Pat. No. 5,020,023, "Automatic vernier synchronization of skew data structures", and in US patent application Ser. No. 07 referenced above. / 698,685 (Nested Frame Communication Protocol).

【００２８】３重複処理コアと記号プレーンとの間の通
信も又、米国特許第５，０２０，０２３号に記述されて
いる複数の記号プレーンと３重複コアとの間のスキュー
を補償する技術を使用する。Communication between the tri-duplication processing core and the symbol plane is also a technique described in US Pat. No. 5,020,023 for compensating for skew between a plurality of symbol planes and the tri-duplication core. use.

【００２９】ここに使用するエラー修正コードはＩＳ．
エル．チェンが１９８９年３月６日に出願した特殊リー
ド−ソーロマン( Reed−Soloman ) ＥＣＣ（エラー・チ
ェック修正）コードの例である“低コスト記号エラー修
正コーディング及びデコーディング”と称する米国特許
出願第０７／３１８，９８３号に記述されているような
ものである。それは、本発明による構造にすれば、本発
明の特定のサイズ又は適用業務に対しより良く最適化す
るかもしれない他のエラー修正コードを当業者が選択し
うるかもしれないものと思われる。The error correction code used here is IS.
Elle. U.S. patent application Ser. No. 07 entitled "Low Cost Symbol Error Correction Coding and Decoding" which is an example of a special Reed-Soloman ECC (Error Check Correction) code filed by Chen on March 6, 1989. / 318,983. It is believed that one of ordinary skill in the art will be able to select other error correction codes that, with the structure according to the present invention, may be better optimized for a particular size or application of the present invention.

【００３０】システムの独立した複数の記号プレーン及
び処理レール（rail）に対し同期時間基準を与えること
に使用する耐故障クロック・システムは、基本的に、耐
故障コンピュータ・システムに関するＩＥＥＥ第１６回
年例国際シンポジューム（オーストリア、ウィーン）で
提出された“高性能耐故障実時間コンピュータ・アーキ
テクチャ”と称する本発明の発明者によるこの論文に記
述されているようなものである。The fault tolerant clock system used to provide a synchronized time reference for the system's independent symbol planes and processing rails is basically an IEEE 16th Annual International Conference on Fault Tolerant Computer Systems. As described in this paper by the inventor of the present invention, entitled "High Performance Fault Tolerant Real-Time Computer Architecture", submitted to Symposium (Vienna, Austria).

【００３１】本発明は、多くの点で公知の先行技術と区
別される。例えば、米国特許第４，６５３，０５０号は
メモリー・モジュールの故障を修正する手段と、故障し
たモジュールを交換するためのメモリー・マッピング機
構等を開示する。ＥＣＣ技術の使用は単一メモリー・モ
ジュールの故障のために失われたデータの回復にエラー
修正コードを使用するようにした本発明に類似する。The present invention is in many ways distinguished from the known prior art. For example, U.S. Pat. No. 4,653,050 discloses means for correcting a failure of a memory module, a memory mapping mechanism for replacing the failed module, and the like. The use of ECC technology is similar to the present invention which uses error correction code to recover data lost due to the failure of a single memory module.

【００３２】エラー修正は、どの発明にも独特ではな
く、業界で全く一般的である。本発明は、又広いクラス
の制御又はシーケンス障害又は故障を修正する手段又は
許容する手段において米国特許第４，６５３，０５０号
とは区別される。これは詳細に後述する本発明によるＥ
ＣＣ／票決機能(voter) 選択回路によって明示される。Error correction is not unique to any invention and is quite common in the industry. The present invention is also distinguished from U.S. Pat. No. 4,653,050 in the means of correcting or allowing a broad class of control or sequence failures or faults. This is described in detail below in accordance with the present invention.
CC / Voting function (voter) Specified by the selection circuit.

【００３３】本発明は、又米国特許第４，６５３，０５
０号記載のものより強固な（異なる）点対点接続トポロ
ジィを使用し、米国特許第４，６５３，０５０号の記載
による接続機構における多数の単一故障点を有する共用
バス・トポロジィの使用と対比すると、本発明による相
互接続機構は如何なる単一故障点をも許容する又は故障
に耐えることができる。The present invention also relates to US Pat. No. 4,653,05.
Using a more robust (different) point-to-point connection topology than that described in No. 0, and using a shared bus topology with multiple single points of failure in the connection according to the description of US Pat. No. 4,653,050. In contrast, the interconnection mechanism according to the present invention can tolerate or withstand any single point of failure.

【００３４】発明の目的従って、本発明の第１の目的は、特に、大型オンライン
・トランザクション処理システム等に対する中央記憶施
設として使用するに適した超大型高信頼性不揮発性半導
体メモリー・システムを提供することである。The object of the invention Accordingly, a first object of the present invention, in particular, provides a very large highly reliable non-volatile semiconductor memory system suitable for use as a central storage facility for large-line transaction processing system and others That is.

【００３５】更に、本発明の目的は、本質的に、長いエ
ラー解放オペレーションを可能にする上記のようなメモ
リー・システムを提供することである。Furthermore, it is an object of the invention to provide, in essence, a memory system as described above which enables long error free operations.

【００３６】本発明の他の目的は、大容量メモリー・ア
レイそれ自体の広いエラー修正及び検出コードの使用、
多くの制御及び通信モジュールにおける３モジュラの重
複、及び故障によって発生したエラーを必要以上に伝搬
することを防止するための故障封じ込め領域又は区画の
規律的使用によって、エラー解放オペレーションを達成
しうるシステムを提供することである。Another object of the invention is the use of wide error correction and detection codes in the mass memory array itself.
A system capable of achieving error relief operations through three modular duplications in many control and communication modules, and the disciplined use of fault containment areas or partitions to prevent unwanted propagation of errors caused by faults. Is to provide.

【００３７】更に、本発明の他の目的は、エラー修正コ
ードの適用により、大容量メモリー・アレイからのエラ
ー解放データ内容が保証されるのみでなく、大容量メモ
リーと密接に関係する制御回路内における故障又は障害
をも検出及び修正する手段を提供するメモリー・システ
ムを提供することである。Still another object of the present invention is that the application of the error correction code not only guarantees the error release data contents from the mass memory array, but also in the control circuit closely related to the mass memory. To provide a means for detecting and correcting failures or faults in a memory system.

【００３８】[0038]

【課題を解決するための手段】本発明は、上記の課題を
解決してその目的を達成するため、以下に記述するよう
に構成する。そして、実施例の記載及び図面に基づき詳
細に後述することによってそれを明らかにする。In order to solve the above problems and achieve the object, the present invention is configured as described below. Then, the details will be described later with reference to the description of the embodiments and the drawings, which will be made clear.

【００３９】本発明は、広くは、３つの特有な構成要素
により理想的に構成される大型高信頼性半導体データ記
憶システムの設計によってその目的が達成される。その
第１は、大容量半導体メモリー・アレイ(DRAM)であり、
第２は最適に３重複された処理コアであり、第３は外部
装置に対しメモリーを接続する複数のチャンネル・アダ
プタである。The present invention is broadly accomplished by its design by designing a large high reliability semiconductor data storage system ideally composed of three unique components. The first is a large capacity semiconductor memory array (DRAM),
The second is an optimally tripled processing core, and the third is a plurality of channel adapters that connect memory to external devices.

【００４０】これら各構成要素は故障が発生した特定の
故障封じ込め領域に故障を包囲するようにした複数の故
障封じ込め領域に区画される。大容量メモリーは複数の
記号プレーンに亘ってストライプ(stripe)又は接続さ
れ、各記号プレーンは大容量メモリー故障封じ込め領域
を含み、各記号プレーンはシステムからアクセスされる
指定のメモリー・ワードの少くとも１ビットを記憶す
る。Each of these components is partitioned into a plurality of fault containment areas adapted to surround the fault in a particular fault containment area in which the fault has occurred. The mass memory is striped or connected across multiple symbol planes, each symbol plane including a mass memory fault containment area, and each symbol plane containing at least one designated memory word accessed by the system. Remember the bits.

【００４１】処理コアはメモリーからフェッチされた全
データをエラー・チェック（検査）及び修正し（ＥＣ
Ｃ）、及びメモリーに記憶されるべき全データに対する
修正及び検出コード・ビットを発生するエラー検出及び
修正手段を含む。各記号プレーンはメモリーからのデー
タ・フェッチに先立ち、そのデータを応答として固有に
識別するフェッチ−応答(FETCH−RESPONSE) 制御フィー
ルドを生成する手段を含む。The processing core error-checks and corrects all data fetched from memory (EC
C), and error detection and correction means for generating correction and detection code bits for all data to be stored in memory. Each symbol plane includes means for generating a FETCH-RESPONSE control field that uniquely identifies that data as a response prior to fetching the data from memory.

【００４２】ＥＣＣ／票決機能選択機構は各処理コアに
設けられ、その入力において、フェッチ応答コマンド・
フィールドを識別する処理コアに接続されている複数の
記号プレーンからの３以上の入力リンクを連続監視す
る。多数決票決機構は監視中の入力リンクの大多数がフ
ェッチ−応答コマンド・フィールドを桁上げするか否か
を判別するために使用される。An ECC / voting function selection mechanism is provided in each processing core, and at its input, fetch response command
Continuously monitor more than two input links from multiple symbol planes connected to the processing cores that identify the fields. A majority voting mechanism is used to determine if the majority of the incoming input links being monitored carry a fetch-response command field.

【００４３】メモリー・アレイの全活動記号プレーンか
らのその後に続く全データ・フィールドをエラー修正／
検出回路を通して処理させるよう適切なスイッチ手段を
起動する。入力の多数決を行ったときに、１以上のフェ
ッチ−応答フィールドが適切なフェッチ−応答コマンド
を含まないということをＥＣＣ／票決機能選択機構が検
出した場合に、欠陥記号プレーンの制御回路における動
作にエラーが発生すると、その後の診断テストのために
フラグが立てられる。Error correction / correct all data fields that follow from all active symbol planes of the memory array.
Activating the appropriate switch means for processing through the detection circuit. If the ECC / voting function selection mechanism detects that one or more fetch-response fields do not contain an appropriate fetch-response command when a majority vote of inputs is made, the operation in the control circuit of the defective symbol plane is When an error occurs, it is flagged for subsequent diagnostic testing.

【００４４】更に、本発明の一面によると、処理コアは
３モジュラ重複（ＴＭＲ）によってシステムのより大き
な信頼性を得ることができる。それによって、通信ハー
ドウェアに接続されたＩ／Ｏチャンネル・アダプタか、
又は大容量メモリーの記号プレーンのどちらかにより、
処理コアから受信した全送信をＴＭＲ検査することによ
って、処理コアの正しい動作を保証することができる。
又、システムを通して“副尺スキュー修正”を使用する
ことにより、高いデータ速度における有意により良いエ
ラー解放出力を可能にする。Furthermore, according to one aspect of the invention, the processing core can obtain greater reliability of the system by trimodular redundancy (TMR). An I / O channel adapter connected to the communication hardware,
Or by the symbol plane of the mass memory,
Correct operation of the processing core can be guaranteed by TMR checking all transmissions received from the processing core.
Also, the use of "vernier skew correction" throughout the system allows for significantly better error relief output at high data rates.

【００４５】[0045]

【実施例】以下、添付図面に基づき本発明の実施例を詳
細に説明する。まず、本発明の概要について説明する。
本発明は耐故障及び不揮発性である大容量半導体記憶装
置に対する重要な構成要素である。かかる記憶装置は複
数のコンピュータに共用され、従来のディスク記憶装置
の代りにデータのファイル、カタログ及び（又は）他の
永久データ記憶装置として使用され、ディスク・サブシ
ステムに対するシステム性能の要求を最少にするディス
クのライトバック・キャッシュのために使用することが
できる。Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. First, the outline of the present invention will be described.
The present invention is an important component for fault-tolerant and non-volatile large-capacity semiconductor memory devices. Such storage is shared by multiple computers and is used as a file of data, catalogs and / or other permanent data storage in place of conventional disk storage to minimize system performance requirements on the disk subsystem. It can be used for write-back cache on the disk.

【００４６】半導体メモリーとディスク間の有意な性能
に差があるため、かかるメモリー・システムを使用した
大型コンピュータを複合した結果の性能はディスク基底
等価システムのそれより劇的に良くすることができる。
記憶性能及びデータ保全性はそのどちらも記憶の単一点
の故障及びありうる複数障害のいずれによってもその影
響を受けない。記憶又は記憶装置の信頼性は少くとも完
全重複（二重化）ディスク記憶装置と同様に良い結果を
得ることができる。Due to the significant difference in performance between semiconductor memory and disk, the combined performance of large computers using such memory systems can be dramatically better than that of disk-based equivalent systems.
Neither storage performance nor data integrity is affected by either a single point of storage failure or possible multiple failures. The reliability of the storage or storage device can be at least as good as that of a fully duplicated disk storage device.

【００４７】次に、図１に基づき本発明の説明を進行す
る。図１は４つの主サブシステムから成る記憶装置の全
体組織を示す図である。それら各サブシステムは、１．典型的には、各クライアントのＣＰＵに対する個別
のチャンネルである数個の独立したチャンネル・アダプ
タから成るＩ／Ｏチャンネル・アダプタ・サブシステム
と、Next, the present invention will be described with reference to FIG. FIG. 1 is a diagram showing the overall organization of a storage device composed of four main subsystems. Each of these subsystems: An I / O channel adapter subsystem typically consisting of several independent channel adapters, each channel being a separate channel to the CPU of each client,

【００４８】２．相互に同一データを使用する堅いクロ
ック同期で動作する３つの独立した同一の処理レール(r
ail)から成る３重複制御又は処理コアと、2. Three independent identical processing rails (r
aile), three overlapping control or processing cores,

【００４９】３．相互に堅いクロック同期で動作する複
数の独立した記号プレーンから成る大容量メモリー・サ
ブシステムと、3. A large memory subsystem consisting of multiple independent symbol planes operating in tight clock synchronization with each other;

【００５０】４．粗調整ＤＣ電力を２つの一次ＤＣ配電
バスに供給する２つの独立したＡＣ−ＤＣ変換器から成
る二重複電力システムとから成る。4. And a dual power system consisting of two independent AC-DC converters providing coarsely regulated DC power to two primary DC distribution buses.

【００５１】これら各サブシステムは数個の故障封じ込
め領域（ＦＣＲ）を含み構成される。ＦＣＲは、簡単に
述べると、回路の定義済みブロックであり、内部故障の
物理的影響をそのブロックに封じ込めて、他のＦＣＲに
おける故障によって物理的影響を受けないように設計さ
れる。システムをＦＣＲに区画する大きさ及びその区画
は、図１では、そのシステムにおける各種類のＦＣＲに
対するその例を各ＦＣＲを点線で包囲して示す。Each of these subsystems comprises several fault containment regions (FCRs). An FCR, simply stated, is a predefined block of a circuit, designed to contain the physical effects of internal faults in that block and not be physically affected by faults in other FCRs. The size and the partitioning of the system into FCRs are shown in FIG. 1 for each type of FCR in the system, with each FCR surrounded by a dotted line.

【００５２】それら４種類のＦＣＲは前述した４つの主
サブシステムと同一である。すなわち、１．Ｉ／Ｏチャンネル・アダプタＦＣＲ。システムには
設置されたＩ／Ｏチャンネル当り１つのＩ／Ｏチャンネ
ル・アダプタがある。The four types of FCR are the same as the above-mentioned four main subsystems. That is, 1. I / O channel adapter FCR. The system has one I / O channel adapter per installed I / O channel.

【００５３】２．処理レールＦＣＲ。システムの好まし
い実施例では、３つの処理レールがある。2. Processing rail FCR. In the preferred embodiment of the system, there are three processing rails.

【００５４】３．記号プレーンＦＣＲ。本発明による実
施例では、１９の記号プレーンがあるが、この数は特定
の実施例に使用するエラー修正コードの機能に従い変え
ることができる。3. Symbol plane FCR. In the embodiment according to the invention, there are 19 symbol planes, but this number can vary depending on the function of the error correction code used in the particular embodiment.

【００５５】４．一次電力変換器ＦＣＲ。このシステム
には、２つの一次電力変換器がある。4. Primary power converter FCR. There are two primary power converters in this system.

【００５６】システムの全論理ＦＣＲ（電源には適用さ
れない）は、相互に堅いクロック同期で動作し、各ＦＣ
Ｒは、３重複処理コアの各レールにその一部が組込まれ
ている３重複クロック・システムから同報通信される３
重複クロック信号から局所クロック信号を個々に取出す
よう作用する。All logical FCRs of the system (not applicable to power supplies) operate with tight clock synchronization to each other and each FC
R is broadcast from a 3-duplication clock system, a portion of which is embedded in each rail of the 3-duplication processing core.
It serves to individually derive local clock signals from the overlapping clock signals.

【００５７】これら個々のＦＣＲ間の全通信は専用２地
点間リンクを介して行われ、副尺スキュー補償は要素間
の固有クロック・スキューの影響を除去するために使用
される。図１では、発明を不明瞭にするのを避けるた
め、少数の専用通信リンクのみを示したが、システムの
各ＦＣＲは専用２地点間リンクによって３レール処理コ
アの各レールに接続される。All communication between these individual FCRs is via a dedicated point-to-point link, and vernier skew compensation is used to eliminate the effects of inherent clock skew between the elements. Although only a few dedicated communication links are shown in FIG. 1 to avoid obscuring the invention, each FCR in the system is connected to each rail of a three-rail processing core by a dedicated point-to-point link.

【００５８】すなわち、処理コアの各レール１３，１
４，１５と各記号プレーンとの間に専用リンク１０，１
１，１２があり、処理コアの各レール１３，１４，１５
と各チャンネル・アダプタとの間には専用リンク１６，
１７，１８があり、処理コアの３つのレールを全部相互
に接続する専用リンク（図に示していない）がある。記
号プレーンを相互に接続するリンクはなく、チャンネル
・アダプタを相互に接続するリンクもない。That is, each rail 13, 1 of the processing core
Dedicated links 10, 1 between 4, 15 and each symbol plane
1 and 12, each rail 13, 14 and 15 of the processing core
, A dedicated link 16, between each channel adapter
There are 17, 18 and there are dedicated links (not shown) that connect all three rails of the processing core to each other. There are no links connecting the symbol planes to each other, and no links connecting the channel adapters to each other.

【００５９】その上、各ＦＣＲは、そのＦＣＲの部分で
ある専用ＤＣ−ＤＣ調整器から調整電力を取出し、個々
に電力供給される。このＤＣ−ＤＣ調整器は、２つの一
次ＤＣ電源バスの少くとも１つに電力供給が保持されて
いる限り、ＦＣＲに対して調整電力を供給することがで
きる。Moreover, each FCR draws regulated power from a dedicated DC-DC regulator that is part of that FCR and is individually powered. The DC-DC regulator can provide regulated power to the FCR as long as power is maintained on at least one of the two primary DC power buses.

【００６０】記憶装置は個々のＩ／Ｏチャンネル・アダ
プタ１９を介して接続されている客又はクライアント・
コンピュータにサービスを提供し、それら各コンピュー
タによって共用される。各チャンネル・アダプタ１９は
単信である。すなわち、重複要素がなく、個々独立に動
作する。記憶装置とクライアント・コンピュータとの間
に重複接続を希望する場合、コンピュータは、複数のチ
ャンネルを介して記憶装置に接続することができる。The storage device is a customer or client connected through an individual I / O channel adapter 19.
It provides services to computers and is shared by each of them. Each channel adapter 19 is simplex. That is, there are no overlapping elements and they operate independently. If a duplicate connection is desired between the storage device and the client computer, the computer can connect to the storage device via multiple channels.

【００６１】チャンネル・アダプタは入力データ・スト
リームを複製して、３重複処理コアのレール１３，１
４，１５の各々に対し同一複製を配布するよう機能す
る。それは、又３重複処理コアからの３重複送信を票決
又は多数決してチャンネル・アダプタから送信する単一
出力データ・ストリームを作成するよう機能する。The channel adapter replicates the input data stream and duplicates the rails 13, 1 of the triplication processing core.
It serves to distribute the same duplicate to each of the four and fifteen. It also functions to create a single output data stream to vote or multiple trips from a triple duplication processing core from a channel adapter.

【００６２】この票決機能はシステムのプロセッサ・レ
ールの１つからのエラー送信をマスク（及び検出）す
る。この好ましい実施例においては、各チャンネルは前
述において照会した米国特許出願第０７／６９８，６８
５号に記述されているようなネストされたリンク・プロ
トコルを使用して、毎秒１００メガバイトで直列データ
・ストリームを処理する。This voting function masks (and detects) error transmissions from one of the system's processor rails. In this preferred embodiment, each channel is a U.S. patent application Ser. No. 07 / 698,68 referenced above.
Process the serial data stream at 100 megabytes per second using the nested link protocol as described in No. 5.

【００６３】このチャンネル・プロトコルの特性は本発
明の中心ではなく、ＩＢＭ社のESCON 光ファイバ直列チ
ャンネル・プロトコルを使用することができるというよ
うに、他の多くのチャンネル・プロトコルと交換使用す
ることができる。性能及びデータ待ち時間の理由から、
及び半導体メモリーの利点を最も十分に活用するため
に、チャンネルは可能な限り高いデータ速度で動作し、
最少の待ち時間で最適化することが望ましい。The characteristics of this channel protocol are not central to the invention and can be interchanged with many other channel protocols, such as the IBM ESCON fiber optic serial channel protocol can be used. it can. For performance and data latency reasons,
And to take full advantage of semiconductor memory, the channel operates at the highest data rate possible,
It is desirable to optimize with the least latency.

【００６４】データ記憶装置とクライアント・コンピュ
ータとの間のメッセージは、３重複処理コア１３，１
４，１５によって処理される。この処理コアの各レール
は他の２つのレールと堅いクロック同期で動作し、各レ
ール内の同一データについて同一機能を実行する。その
レールからエラー送信を発生する単一レール故障は票決
回路の受信ＦＣＲによってマスクすることができる。The message between the data storage device and the client computer is a triple duplication processing core 13,1.
4,15. Each rail of this processing core operates in tight clock synchronization with the other two rails and performs the same function for the same data in each rail. A single rail fault that causes an error transmission from that rail can be masked by the receive FCR of the voting circuit.

【００６５】３重複処理コアの設計及び動作は副尺スキ
ュー補償の増加及び改良と共にFTCXコンピュータについ
て照会した前述の刊行物に記載されたものと類似してい
る。副尺スキュー補償はコアのレール間、及びコアとそ
れを取囲むＦＣＲとの間の高帯域幅送信及びコアの高速
動作を可能にする。これは、そうでなければ、スキュー
が帯域幅及び動作速度を限定するためである。従って、
それを修正すると、帯域幅及び動作速度の両方を高くす
ることができる。The design and operation of the three-overlap processing core is similar to that described in the aforementioned publication that referred to the FTCX computer with increased and improved vernier skew compensation. Vernier skew compensation enables high bandwidth transmission between the rails of the core and between the core and the surrounding FCR and high speed operation of the core. This is because skew otherwise limits bandwidth and operating speed. Therefore,
It can be modified to increase both bandwidth and operating speed.

【００６６】この実施例におけるスキュー補償モジュー
ル（スキュー回路）は全リンクの受信端に配置される。
又、これらは各記号プレーンの各メモリー・ポートにお
ける受信回路に配置され、同様に処理コアの各メモリー
・ポート制御装置における各リンクに対する受信回路に
配置されるであろう。The skew compensation module (skew circuit) in this embodiment is arranged at the receiving ends of all links.
They would also be placed in the receive circuit for each memory port in each symbol plane, as well as in the receive circuit for each link in each memory port controller of the processing core.

【００６７】又、各処理コア・レールにおける各チャン
ネル制御装置の受信回路に、及び各チャンネル・アダプ
タの受信回路にスキュー回路が配備される。Further, a skew circuit is provided in the receiving circuit of each channel controller in each processing core rail and in the receiving circuit of each channel adapter.

【００６８】更に、図１において、それら受信回路の一
部としてスキュー補償モジュールを含むリンクを、その
リンク上に付した拡大ドットでマークして示す。これら
スキュー・モジュールは、通常、物理的に、そのブロッ
クの受信回路の一部として作用する機能ブロック内に配
置される。各並列データ・リンクは単一線で示され、各
リンク１０，１１，１２は９ビット幅（８データ・ビッ
ト及び１制御ビット）、及び各リンク１６，１７，１８
は１２８ビット幅（８ビット・バイト×１６）である。Further, in FIG. 1, a link including a skew compensation module as a part of those receiving circuits is shown by being marked with an enlarged dot attached on the link. These skew modules are typically physically located within a functional block that acts as part of the block's receive circuitry. Each parallel data link is shown as a single line, each link 10, 11, 12 is 9 bits wide (8 data bits and 1 control bit), and each link 16, 17, 18
Is 128 bits wide (8 bits bytes x 16).

【００６９】これらの条件下において、本発明は、処理
コアのメモリー・ポート制御装置による複数の記号プレ
ーンＦＣＲの並列動作及び制御を設ける。データは、単
一記号プレーンの故障のために失われたデータが処理コ
アのメモリー・ポート制御回路内の票決手段及びエラー
修正回路との組合せ手段によって再構成することができ
るというように、複数の記号プレーンに亘りデータをス
トライプ（又は縞状に接続）することにより経済的に記
憶することができる。Under these conditions, the present invention provides parallel operation and control of multiple symbol planes FCR by the memory port controller of the processing core. The data can be reconstructed in multiple ways such that the data lost due to the failure of a single symbol plane can be reconstructed by the voting means in the memory port control circuit of the processing core and the combination means with the error correction circuit. Data can be stored economically by stripes (or connected in stripes) across symbol planes.

【００７０】複数の記号プレーンの並列動作は、又固有
に高いメモリー帯域幅を具備するものである。図２及び
図３は複数の記号プレーンに亘りデータをストライプす
る手段を例示する。この好ましい実施例では、データ記
号を記憶するために、１６記号プレーンを使用し、エラ
ー修正記号を記憶するために３記号プレーンが使用され
る。The parallel operation of multiple symbol planes is also one with inherently high memory bandwidth. 2 and 3 illustrate means for striping data across multiple symbol planes. In the preferred embodiment, 16 symbol planes are used to store data symbols and 3 symbol planes are used to store error correction symbols.

【００７１】更に、本実施例では、各記号は８ビット・
データ・バイトである。実際に、各１６バイト（１２８
ビット）データ・ワードが１６データ・プレーンの各々
に１バイト宛記憶するように１６記号プレーンに亘りス
トライプされる。更に、メモリー・ポート制御装置は、
１６バイト・ワード当り３つのエラー修正記号（ＥＣ
Ｓ）を演算して、それらを３つのＥＣＳプレーンに亘り
ストライプし、データを記憶するときにそれらを記憶す
る。処理コアのレールにあるメモリー・ポート制御装置
と各記号プレーンとの間の接続は専用２地点間直列リン
クによって行われる。Further, in this embodiment, each symbol is 8 bits.
It is a data byte. In fact, each 16 bytes (128
A (bit) data word is striped across 16 symbol planes to store one byte in each of the 16 data planes. In addition, the memory port controller
3 error correction symbols per 16 byte word (EC
S) and strip them across the three ECS planes and store them when storing the data. The connection between the memory port controllers on the rails of the processing core and each symbol plane is provided by a dedicated point-to-point serial link.

【００７２】図４は記号プレーン・コマンド及びデータ
・ストリームに対するポート制御装置用ホーマットを示
す。このデータ・ストリームは９ビット制御／データ記
号の直列ストリームから成るものと見做すことができ
る。その記号のビット０は、制御／データ記号が制御記
号か又はデータ記号のどちらかであることを示す。記号
のビット１−８はデータ・バイト（ビット０＝０）か又
は制御バイト（ビット０＝１）のどちらかである。FIG. 4 shows a port controller format for symbol plane commands and data streams. This data stream can be regarded as consisting of a serial stream of 9-bit control / data symbols. Bit 0 of that symbol indicates that the control / data symbol is either a control symbol or a data symbol. Bits 1-8 of the symbol are either data bytes (bit 0 = 0) or control bytes (bit 0 = 1).

【００７３】遊休(IDLE)制御記号はフレーム間で送信さ
れる。簡単な記憶(STORE) 要求に対し、フレームの第１
記号は記憶制御コードを含み、データを記憶するべきワ
ードのアドレスを含む４データ記号がそれに続き、更に
記憶するべき実際のデータがその後に続く。記憶するべ
きデータ・ブロックの長さは可変であり、データ・ブロ
ックの終末は新フレームの開始を印す遊休(IDLE)記号又
は他の制御記号によって限界が定められる。The idle (IDLE) control symbol is transmitted between frames. First of frame for a simple STORE request
The symbol contains a storage control code, followed by four data symbols containing the address of the word in which to store the data, followed by the actual data to be stored. The length of the data block to be stored is variable and the end of the data block is limited by an idle (IDLE) symbol or other control symbol that marks the beginning of a new frame.

【００７４】この好ましい実施例における直列データ・
ストリームの送信速度は毎秒２，５００万記号である。
制御記号の送信と各記号プレーンに対するアドレス・フ
ィールドの送信とは同一である。すなわち、各記号プレ
ーンに対する全１９送信のこれらフィールドは同一であ
る。Serial data in this preferred embodiment
The transmission rate of the stream is 25 million symbols per second.
The transmission of control symbols is the same as the transmission of the address field for each symbol plane. That is, these fields for all 19 transmissions for each symbol plane are the same.

【００７５】そして、すべての記号プレーンは正確に相
互に同期して動作するので、それらは要求された動作に
関する限りにおいてすべて同一の要求を受信しなければ
ならない。従って、フレームのコマンド部の有効送信速
度は、１リンクにつき基本記号送信速度である。この実
施例において、それは毎秒２，５００万記号である。And since all symbol planes operate exactly in synchronization with each other, they must all receive the same request as far as the required operation is concerned. Therefore, the effective transmission rate of the command portion of the frame is the basic symbol transmission rate per link. In this example, it is 25 million symbols per second.

【００７６】各記号プレーンに異なるデータが記憶され
るので、データ・フィールドの１送信に対する有効デー
タ速度は、各記号プレーンがそのプレーンに記憶される
べきデータ個有の複製を受信するのみであるため、更に
高速である。１６データ・プレーンを有するこの実施例
においては、大容量記号プレーン・メモリー・サブシス
テムに対するデータの有効送信速度は毎秒４０，０００
万データ・バイト（＋３ＥＣＳプレーンに対する毎秒
７，５００万ＥＣＳバイト）である。Since different data is stored in each symbol plane, the effective data rate for one transmission of the data field is that each symbol plane only receives a unique copy of the data to be stored in that plane. , Even faster. In this embodiment with 16 data planes, the effective rate of data transmission to the high capacity symbol plane memory subsystem is 40,000 per second.
10,000 data bytes (75 million ECS bytes per second for the +3 ECS plane).

【００７７】個々の記号プレーンに対する送信帯域幅は
なお毎秒２，５００万データ・バイトのみである。３重
複処理コアを有するこの実施例においては、各記号プレ
ーンは各コマンド及びデータ・フレームの３重複複製を
受信する。それは、レールの１つからのエラー送信を検
出してマスクするため、その記号（コマンド又はデータ
のどちらか）に対して記号基準による３重複送信を票決
する。The transmission bandwidth for the individual symbol planes is still only 25 million data bytes per second. In this embodiment with a triple redundant processing core, each symbol plane receives a triple redundant copy of each command and data frame. It detects and masks erroneous transmissions from one of the rails, thus voting 3 duplicate transmissions on a symbol basis for that symbol (either command or data).

【００７８】この好ましい実施例における処理コアは設
置されたＩ／Ｏチャンネルの数及び文字により１乃至４
メモリー・ポート制御装置を配置することができる。各
記号プレーンはそれらのプロセッサ構成と一致させるた
め、１乃至４独立ポートを支援するよう構成することが
できる。The processing cores in this preferred embodiment range from 1 to 4 depending on the number of installed I / O channels and letters.
A memory port controller can be located. Each symbol plane can be configured to support 1 to 4 independent ports to match their processor configuration.

【００７９】各ポートは類似しており、その各々は現に
記述した単一ポートの複製であって、３重複処理レール
内に３重複ポート制御装置を含み、各記号プレーンには
各ポート制御装置からメモリー・ポートに対する専用リ
ンクを含む。Each port is similar, each of which is a duplicate of the single port just described and contains three duplicate port controllers in three duplicate processing rails, with each symbol plane from each port controller. Contains a dedicated link to the memory port.

【００８０】図５はフェッチ(FETCH) 要求のホーマット
を示す。最初の４記号はデータ記号であり、フェッチさ
れるべき最初の１６バイト・ワードのアドレスを含み、
その後にブロック・サイズを含む２データ記号が続く。
このフェッチ要求は暗黙であり、遊休(IDLE)に続くデー
タ記号は暗黙指定フェッチの最初のアドレス・バイトで
ある。FIG. 5 shows the format of a fetch (FETCH) request. The first 4 symbols are the data symbols, containing the address of the first 16 byte word to be fetched,
Followed by two data symbols including the block size.
This fetch request is implicit and the data symbol following the idle (IDLE) is the first address byte of the implicit fetch.

【００８１】これは１サイクルだけフェッチの待ち時間
を減少する。フェッチが他のフレームの直後に続くべき
場合、フェッチ・アドレスの前に明示フェッチ制御記号
を挿入することによってそのフレームを区切るようにす
る。又、アドレスの送信はブロック・サイズの前であ
る。This reduces the fetch latency by one cycle. If the fetch should immediately follow another frame, then that frame is delimited by inserting an explicit fetch control symbol before the fetch address. Also, the address is sent before the block size.

【００８２】これは、メモリー・サイクルを開始するた
め（このとき、メモリーは遊休であるものと仮定す
る）、必要な情報が使用可能になるとすぐ、フェッチ・
オペレーションを開始することを可能にしてフェッチの
待ち時間を短縮することができる。記号プレーンからメ
モリー・ポート制御装置に対する応答ホーマットは図６
に示す。フェッチ・コマンド・ホーマット同様、応答ホ
ーマットは良いフェッチ待ち時間のために最適化され、
記号プレーンは単にデータ・ブロックの送信を開始す
る。This starts a memory cycle (assuming that the memory is idle at this time), so that as soon as the necessary information is available, the fetch
It is possible to start the operation and reduce the latency of the fetch. The response format from the symbol plane to the memory port controller is shown in Figure 6.
Shown in. Like the fetch command format, the response format is optimized for good fetch latency,
The symbol plane simply initiates the transmission of data blocks.

【００８３】遊休に続く応答フレームの最初の記号がデ
ータ記号であると、これは暗黙フェッチ−応答(FETCH−
RESPONSE) フレームである。フェッチ要求フレーム同
様、この応答フレームは先行する応答送信の直後に続く
べき場合、明示フェッチ−応答制御コードによって２つ
に分離される。If the first symbol of the response frame following idle is a data symbol, this is an implicit fetch-response (FETCH-
RESPONSE) frame. Like the fetch request frame, this response frame is separated in two by an explicit fetch-response control code if it should immediately follow the preceding response transmission.

【００８４】図７は本発明の正に有意な機能である処理
コア・ポート制御装置の受信部の構造及びデータ・フロ
ーを例示する。これは、記号プレーンのデータと制御又
はシーケンス情報の故障の両方を修正する手段を実施す
るＥＣＣ／票決機能選択機構と称する機構である。すな
わち、データ故障はエラー修正コードを適用して修正さ
れ、制御又はシーケンス情報の故障は多数決票決ロジッ
クによって修正される。FIG. 7 illustrates the structure and data flow of the receiver of the processing core port controller which is a positively significant feature of the present invention. This is a mechanism called the ECC / voting function selection mechanism that implements the means of correcting both the symbol plane data and control or sequence information failures. That is, the data failure is corrected by applying the error correction code, and the failure of the control or sequence information is corrected by the majority voting logic.

【００８５】ＥＣＣ／票決機能選択機構は、データに対
するエラー修正コードの適用と、ポート制御装置と記号
プレーンのメモリー・ポートとを相互接続する１つのリ
ンクの故障又は記号プレーンの故障が免除されているシ
ーケンス情報に対する多数決票決との間を選択的に切換
える手段を提供する。この手段は後程詳細に説明する。
かかる機構はこの好ましい実施例の３重複された処理コ
アの各レールに配設される。The ECC / voting function selection mechanism is exempt from the application of error correction codes to the data and the failure of one link or symbol plane that interconnects the port controller and the memory port of the symbol plane. A means for selectively switching between a majority vote for sequence information is provided. This means will be described later in detail.
Such a mechanism is disposed on each rail of the tripled processing core of this preferred embodiment.

【００８６】又、記号プレーンは複数ポート化すること
ができ、それは各メモリー・ポートに対する処理コア内
に突合せ専用メモリー・ポート制御装置を持つことによ
って達成されることに注目するべきである。この好まし
い実施例においては、４つのメモリー・ポートがある。
従って、各レールに４個宛配設されて、合計１２ポート
制御装置となる。It should also be noted that the symbol plane can be multi-ported, which is accomplished by having a match-only memory port controller in the processing core for each memory port. In the preferred embodiment, there are four memory ports.
Therefore, a total of 12 ports are arranged on each rail, which is a total of 12 ports.

【００８７】その上、上記手段は個々の処理コアがＴＭ
Ｒであるか否かに関係なく、記号プレーン又はリンクの
故障に対して十分に作動可能及び有効である。Moreover, in the above means, each processing core is TM
Whether R or not, it is fully operational and effective against symbol plane or link failures.

【００８８】更に、詳細に説明すると、各記号プレーン
は相互に堅い同期で動作するので、ポート制御装置はフ
ェッチ要求に応答して記号プレーンのすべてから同時に
堅い同期の応答を受信する。それらはＥＣＣ／票決機能
選択回路を通して処理され、すべての記号プレーンの故
障はマスクされるか修正される。More specifically, since the symbol planes operate in tight synchronization with each other, the port controller will receive a hard synchronization response from all of the symbol planes simultaneously in response to a fetch request. They are processed through the ECC / voting function select circuit and all symbol plane failures are masked or corrected.

【００８９】図５はこのＥＣＣ／票決機能選択回路の構
造を例示する。応答の制御又はシーケンス情報記号は１
９並列データ・ストリームのすべてに亘り常に同一であ
るが、フェッチされているデータ記号は常に記号プレー
ンごとに異なる。記号プレーン送信のサブセット、すな
わち１ビットは、記号がデータか制御情報かの判別に使
用される。FIG. 5 illustrates the structure of this ECC / voting function selection circuit. Response control or sequence information symbol is 1
Although always the same across all 9 parallel data streams, the data symbols being fetched are always different for each symbol plane. A subset of symbol plane transmissions, one bit, is used to determine if a symbol is data or control information.

【００９０】この好ましい実施例においては、３ＥＣＳ
記号プレーンからの制御／データ記号は、票決機能５０
において個々に審査され、それらが制御か又はシーケン
ス情報記号かが判別される。これら３記号プレーン間の
票決は票決機能５０において単一ビットについて行わ
れ、記号がコマンドかデータ記号かの判別が行われる。
記号がかかるコマンド記号であると判別されると、票決
機能５１において次の票決が行われ、その記号を判別す
る。In this preferred embodiment, 3 ECS
Control / data symbols from the symbol plane are vote functions 50
Are individually examined in order to determine whether they are control or sequence information symbols. Voting between these three symbol planes is performed for a single bit in the voting function 50 to determine whether the symbol is a command or a data symbol.
When it is determined that the symbol is such a command symbol, the voting function 51 makes the next vote and determines the symbol.

【００９１】実際には、票決機能５１の機構は３ＥＣＳ
記号プレーン間の票決によってコマンド・ストリームを
構成し、このコマンド・ストリームはポート制御装置の
受信状態機５２の駆動に使用される。この可視構成の制
御ストリームはすべて受信したコマンド記号（遊休、フ
ェッチ−応答、等…）の他、受信したデータ・ストリー
ムの如何なるデータ記号にも代えられる単一の擬似制御
記号、データである。Actually, the mechanism of the voting function 51 is 3 ECS.
Voting between the symbol planes constitutes the command stream, which is used to drive the receive state machine 52 of the port controller. The control stream of this visible configuration is all command symbols (idle, fetch-reply, etc ...) Received, as well as a single pseudo control symbol, data that can be replaced by any data symbol in the received data stream.

【００９２】このコマンド・ストリームは３記号プレー
ン間の票決によって構成され、及び１度に１以上の記号
プレーンは失敗しえないものと仮定するので、この構成
のコマンド・ストリームは如何なる単一プレーン制御障
害による制御回路の故障にも拘らず修正される。故障し
た記号プレーンがＥＣＳプレーンの場合、票決（又は多
数決）で勝つことができる。如何なるデータ・プレーン
における故障も、それはその構成における役割りを持た
ないので、構成されたコマンド・ストリームが破壊され
ることはない。Since this command stream is constructed by voting between 3 symbol planes and it is assumed that no more than one symbol plane can fail at a time, a command stream of this configuration will have no single plane control. It is corrected despite the failure of the control circuit due to the failure. If the failed symbol plane is an ECS plane, it can win by voting (or majority). A failure in any data plane does not destroy the constructed command stream because it has no role in its construction.

【００９３】１６データ・プレーンからの記号は、了解
のためデータ記号プレーンの故障のすべてを検出し報告
するため構成されたコマンド・ストリームと追加的に比
較することはできるが、これは本発明の主な機能にとっ
て中心的なものではなく、説明は除外される。どの３記
号プレーンが応答制御ストリームの構成（票決）に使用
されるかの選択は任意であり、３ＥＣＳプレーンの選択
はこの好ましい実施例では感覚的基準で行われる。The symbols from the 16 data planes can be additionally compared with the command stream constructed to detect and report all data symbol plane failures for the sake of clarity, but this is the subject of the present invention. It is not central to the main function and is excluded from explanation. The choice of which 3-symbol plane is used for the construction (voting) of the response control stream is arbitrary and the choice of the 3ECS plane is made on a sensory basis in this preferred embodiment.

【００９４】ポート制御装置の受信状態機５２は票決機
能５１及びマルチプレクサ（ＭＵＸ）５５の出力によっ
て指令され、エラー修正回路５３又はデータ選択回路５
４に対してデータを出力する責任がある。フェッチの場
合、ポート制御装置は応答データを待ち、遊休又はフェ
ッチ応答コードに続き第１のデータ記号から始まる１９
バイト幅のデータ・ストリームをエラー修正回路５３に
出力する。The reception state machine 52 of the port controller is commanded by the outputs of the voting function 51 and the multiplexer (MUX) 55, and the error correction circuit 53 or the data selection circuit 5 is sent.
4 is responsible for outputting the data. In the case of a fetch, the port controller waits for response data, starting with the idle or fetch response code, starting with the first data symbol 19
The byte-width data stream is output to the error correction circuit 53.

【００９５】このエラー修正回路は入力の１６データ・
バイト及び３ＥＣＳバイトから取出された１６バイト・
データ・ワードを各サイクル毎に出力する。この好まし
い実施例で実施したエラー修正コードは如何なる欠落又
は誤りバイト（単一データ・バイト又は単一ＥＣＳバイ
トのどちらでも）でも完全に修正することができ、更に
如何なる２バイト故障でも検出することができる。This error correction circuit uses the input 16 data
Bytes and 16 bytes extracted from 3 ECS bytes
A data word is output every cycle. The error correction code implemented in this preferred embodiment is capable of completely correcting any missing or erroneous bytes (either single data bytes or single ECS bytes) and is capable of detecting any two-byte failure. it can.

【００９６】エラー修正回路は選択的に使用可能化さ
れ、又は使用不能化される。それは入力データ・ストリ
ームの全記号位置が記号プレーンからストライプされた
データを含まないからである。エラー修正回路に対する
入力のこの選択的使用可能化及び使用不能化はＥＣＣ／
票決機能選択機構を含むポート制御装置の受信状態機及
び関連回路の責任である。The error correction circuit is selectively enabled or disabled. That is because not all symbol positions in the input data stream contain data striped from the symbol plane. This selective enabling and disabling of the inputs to the error correction circuit is
It is the responsibility of the receive state machine and associated circuitry of the port controller, including the voting function selection mechanism.

【００９７】フェッチ−応答において、ＥＣＣ回路はフ
ェッチ−応答コマンドの受信サイクル間で使用不能又は
遊休であり、次の送信に対する遊休又は応答コマンド記
号でマークされた送信の終了まで続くデータ記号ストリ
ームに対しては使用可能又は活動状態である。ＥＣＣ回
路はこの遊休又はコマンド記号を受信しているときには
使用不能である。In Fetch-Response, the ECC circuit is disabled or idle between the receive cycles of the Fetch-Response command, for data symbol streams that continue until the end of a transmission marked with an Idle or Response command symbol for the next transmission. Are available or active. The ECC circuit is disabled when receiving this idle or command symbol.

【００９８】明らかなストア又は記憶(STORE) 及びフェ
ッチ(FETCH) 記号プレーン要求／応答オペレーションに
加え、ポート制御装置は、又記号プレーンからのある記
号プレーン状況情報をフェッチする機構、及び記号プレ
ーンに対する構成及び制御情報を記憶する機構を含む。In addition to explicit store or store and fetch (FETCH) symbol plane request / response operations, the port controller also fetches some symbol plane status information from the symbol plane and the configuration for the symbol plane. And a mechanism for storing control information.

【００９９】状況情報は屡プレーン間で異なる。例え
ば、この好ましい実施例は記号プレーン内部の従来のエ
ラー修正機構を使用して、記号プレーンのメモリー・ア
レイDRAMのソフト・エラー（最も頻繁にアルファ粒子の
放射によってひき起こされる）をマスクする。このエラ
ーは修正され、このエラー事象は記号プレーン内部の記
号プレーン状況アレイに記録される。The status information differs between the often-used planes. For example, this preferred embodiment uses conventional error correction mechanisms inside the symbol plane to mask soft errors in the symbol plane memory array DRAM (most often caused by alpha particle emission). The error is corrected and the error event is recorded in the symbol plane status array within the symbol plane.

【０１００】記号プレーンから処理レールのメモリー・
ポート制御装置に送信されたデータは既に修正されてい
るため、このタイプのエラー事象は記号プレーン外部か
ら不可視である。又、それは、２つの記号プレーンが同
時に同一のソフト・エラーを持つということは考えられ
ないので、そのエラーが発生した記号プレーンに個有の
ものである。Memory from the symbol plane to the processing rail
This type of error event is invisible from outside the symbol plane because the data sent to the port controller has already been modified. Also, it is unlikely that two symbol planes will have the same soft error at the same time, so it is unique to the symbol plane in which the error occurred.

【０１０１】ポート制御装置は記号プレーンからのこの
内部状況アレイのフェッチを可能にする機能を実現す
る。ポート制御装置は下記のように記号プレーンからの
状況フェッチを実行する。それは状況−フェッチ (STAT
US−FETCH)コマンドを目標記号プレーンに送信し、ダミ
ー−状況−フェッチ(DUMMY−STATUS−FETCH)を他の記号
プレーンに対し並列に送信する。The port controller implements the functionality that allows fetching this internal status array from the symbol plane. The port controller performs a status fetch from the symbol plane as follows. It is the situation-fetch (STAT
US-FETCH) command to the target symbol plane, and DUMMY-STATUS-FETCH to other symbol planes in parallel.

【０１０２】これは、そのダミー−状況−フェッチに対
し、全記号プレーン間の堅い同期を維持するため、目標
記号プレーンの実状況−フェッチ・オペレーションと並
列にダミー・オペレーションを実行することを可能にす
る。従って、全記号プレーンは状況−フェッチ−応答(S
TATUS-FETCH-REPLY)で応答するが、目標記号プレーンの
みが状況データを戻す。他の記号プレーンは同期オペレ
ーションを維持するため、データ・フィールドに０を戻
す。This allows the dummy operation to be performed in parallel with the real situation-fetch operation of the target symbol plane in order to maintain a tight synchronization between all symbol planes for that dummy-state-fetch. To do. Therefore, all symbol planes are
TATUS-FETCH-REPLY) but only the target symbol plane returns status data. The other symbol planes return 0s in the data field to maintain synchronous operation.

【０１０３】ポート制御装置受信状況機５２は全記号プ
レーンからの並列送信をデータ・セレクタ５４に向け、
目標記号プレーンからの送信を選択させるためにそのセ
レクタを制御しなければならない。目標記号プレーンの
同一性は、その記号プレーンが要求を発送し応答を待っ
ているので、ポート制御装置には知られている。The port controller receive status machine 52 directs parallel transmissions from all symbol planes to the data selector 54,
That selector must be controlled to select the transmission from the target symbol plane. The identity of the target symbol plane is known to the port controller because that symbol plane is sending requests and waiting for responses.

【０１０４】ポート制御装置は内部記号プレーン回路を
構成し、制御する制御−記憶(CONTROL−STORE)コマン
ド、及びダミー−制御−記憶(DUMMY-CONTROL-STORE) コ
マンドを実行する。これらは、記号プレーンのソフト・
ドラム（DRAM）エラーによって生成されたかもしれない
ような種々のエラー状況表示を、ルーチン再構成コマン
ドのためにリセットするよう使用することができる。The port controller constitutes an internal symbol plane circuit and executes a control-store (CONTROL-STORE) command for controlling and a dummy-control-store (DUMMY-CONTROL-STORE) command. These are the symbols
Various error status indicators, such as might have been generated by a drum (DRAM) error, can be used to reset for a routine reconfigure command.

【０１０５】その良い例は記号プレーン内に設置された
メモリーのサイズ及び構成を変更調節することである。
この好ましい実施例においては制御−記憶フレームは電
源立上り初期化中における全記号プレーンに亘るDRAMの
リフレッシュ活動の同期にも使用される。A good example is changing and adjusting the size and configuration of the memory installed in the symbol plane.
In the preferred embodiment, the control-memory frame is also used to synchronize the refresh activity of the DRAM across all symbol planes during power-up initialization.

【０１０６】記号プレーンに対するすべての要求に対
し、ポート制御装置が実際の応答時間を予測することは
不可能である。そのため、応答コマンド票決機能を介し
て行われる応答コマンド記号の処理が本発明の中心とな
る。フェッチ応答のタイミングはドラム（DRAM）リフレ
ッシュのような多くの干渉の影響、又は他のポートから
の干渉によって異なる。ポート受信回路はこの票決機能
を信頼して必要なシーケンス情報を引き出し、その制御
状況機を駆動する。It is not possible for the port controller to predict the actual response time for every request for the symbol plane. Therefore, the processing of the response command symbol performed via the response command voting function is the center of the present invention. The timing of the fetch response depends on the impact of many interferences such as drum (DRAM) refresh, or interference from other ports. The port receiving circuit relies on this voting function to extract the necessary sequence information and drive the control status machine.

【０１０７】状況−フェッチ (STATUS−FETCH)の場合、
シーケンス情報は実際にダミー−状況−フェッチ動作を
実行している記号プレーンからの送信から最も頻繁に引
き出される。特に、この好ましい実施例において、状況
−フェッチがデータ・プレーンから出力されるとき、シ
ーケンス情報はそれのみダミー−状況−フェッチ動作を
実行する記号プレーンから引き出される。In the case of status-fetch (STATUS-FETCH),
Sequence information is most often derived from transmissions from the symbol planes that are actually performing the dummy-status-fetch operation. In particular, in this preferred embodiment, when the status-fetch is output from the data plane, the sequence information is only extracted from the symbol plane which performs the dummy-status-fetch operation.

【０１０８】基礎発明に対する拡張基礎となる本発明の脈絡に対し、特定の適用業務に対す
る実施を最適化するためのある最適化及び機能を追加す
ることができる。以下、３つの特定の発明拡張について
説明する。その１は記号プレーンの共用、２は交換エラ
ー修正コード、３は記号プレーン・メモリーの並列修理
及びグレードアップを容易にする記号プレーンのエラー
除去又はスクラブ(scrub) である。 Extensions to the Basic Invention Certain optimizations and features may be added to the underlying context of the invention to optimize the implementation for a particular application. In the following, three specific invention extensions will be described. 1 is symbol plane sharing, 2 is exchange error correction code, and 3 is symbol plane error removal or scrub which facilitates parallel repair and upgrade of symbol plane memories.

【０１０９】記号プレーンの予備故障した記号プレーンを持つシステムの電子修理を行う
手段を具備することが真に可能である。かかる状況は、
記号プレーン・メモリー・システムが非常に大型である
とき、及びシステムの設置場所が遠いため、修理を後ま
わしにすることを希望するようなときに、最も頻繁に発
生する可能性がある。It is truly possible to have a means of doing an electronic repair of a system with a pre- failed symbol plane. This situation is
It can occur most often when the symbol plane memory system is very large, and when the system is so far away that it may be desired to defer repair.

【０１１０】これは故障したプレーンを交換するため、
電子的に交換することができる１以上の追加の予備記号
プレーンを準備することによって容易にすることができ
る。この好ましい実施例においては、単一の予備プレー
ンを設けた場合、このプレーンは第２０番プレーン（予
備）を構成した。このプレーンは他の１９の記号プレー
ンのいずれとも交換することができる。This is to replace the failed plane,
This can be facilitated by providing one or more additional spare symbol planes that can be electronically exchanged. In the preferred embodiment, when provided with a single spare plane, this plane constituted the 20th plane (spare). This plane can be exchanged for any of the other 19 symbol planes.

【０１１１】図８は本発明を拡張して電子修理を可能に
する好ましい実施例を例示する。記号プレーンに対する
送信のための単一１９：１マルチプレクサ６０は予備記
号プレーンに対する送信のため１９個の並列ポート制御
装置送信の１つを選択する。FIG. 8 illustrates a preferred embodiment that extends the present invention to enable electronic repair. A single 19: 1 multiplexer 60 for transmission on the symbol plane selects one of 19 parallel port controller transmissions for transmission on the spare symbol plane.

【０１１２】このマルチプレクサは処理コアにおいて走
行するメンテナンス処理（ソフトウェア）によって構成
され、交換される記号プレーンに対する送信の複製を予
備プレーンに対し追加送信するよう支持する。例えば、
プレーン７が交換されるべき場合、マルチプレクサはプ
レーン７に対する送信を切替えて予備プレーンに対する
送信を選択するよう構成する。This multiplexer is constituted by a maintenance process (software) running in the processing core and supports a duplicate transmission of the transmissions for the exchanged symbol planes for additional transmissions to the spare planes. For example,
If plane 7 is to be replaced, the multiplexer is arranged to switch the transmission for plane 7 and select the transmission for the spare plane.

【０１１３】ポート制御装置は更に１９個の２：１マル
チプレクサ６２を含む。これらマルチプレクサは通常そ
れぞれの記号プレーンからポート制御装置に対して送信
するように構成される。電子修理が要求されたとき、予
備記号プレーンからの送信が故障記号プレーンからの送
信に切換えられる。例えば、プレーン７が交換されるべ
き場合、プレーン７に対する２：１マルチプレクサは予
備記号プレーンから受信した信号をプレーン７からの受
信信号に代えるよう構成される。The port controller further includes nineteen 2: 1 multiplexers 62. These multiplexers are typically configured to transmit from their respective symbol planes to the port controller. Transmission from the spare symbol plane is switched to transmission from the fault symbol plane when electronic repair is required. For example, if plane 7 is to be replaced, the 2: 1 multiplexer for plane 7 is configured to replace the signal received from the spare symbol plane with the received signal from plane 7.

【０１１４】そこで、電子処理手段は単一の１９：１マ
ルチプレクサ６０と、故障した記号プレーンを予備プレ
ーンと電子交換するべき１９個の２：１マルチプレクサ
６２の１つとから構成されるべきである。予備記号プレ
ーンの内容は記号プレーン・アレイから大容量メモリー
の全内容を読取り、それを記号プレーン・メモリーに書
き戻すことによってロードすることができる。Therefore, the electronic processing means should consist of a single 19: 1 multiplexer 60 and one of 19 2: 1 multiplexers 62 for electronically exchanging the failed symbol plane with a spare plane. The contents of the spare symbol plane can be loaded by reading the entire contents of the mass memory from the symbol plane array and writing it back to the symbol plane memory.

【０１１５】その読取動作中、故障した記号プレーンか
らの欠落情報はポート制御装置のエラー修正回路によっ
て再構成される。この再構成データは書込サイクルを通
して予備プレーンに書込まれる。このエラー除去動作の
終了において、予備プレーンは、原記号プレーンが故障
したときに失われたものの再構成データがロードされ
る。このエラー除去動作はメモリーのすべての読取り及
び書込みをループする簡単なプログラムによって実行す
ることができるか、又は下記のようにハードウェアの補
助によることができる。During the read operation, the missing information from the failed symbol plane is reconstructed by the error correction circuitry of the port controller. This reconstructed data is written to the spare plane throughout the write cycle. At the end of this error removal operation, the spare plane is loaded with the reconstructed data that was lost when the original symbol plane failed. This error elimination operation can be performed by a simple program that loops all reads and writes of memory, or it can be hardware assisted as described below.

【０１１６】交換エラー修正コードこの好ましい実施例は前述で照会したエス．エル．チェ
ンの米国特許出願第０７／３１８，９８３号に記述され
ているような“単一バイト・エラー修正／二重バイト・
エラー検出エラー修正コード”を使用する。交換エラー
修正コードは異なる適用業務を最適化するために使用す
ることができる。 Exchange Error Correction Code This preferred embodiment is the S. Elle. "Single byte error correction / double byte correction" as described in Chen, U.S. patent application Ser. No. 07 / 318,983.
Error detection error correction code "is used. The replacement error correction code can be used to optimize different applications.

【０１１７】例えば、このコードの二重エラー検出機能
は、二重エラーに対するエクスポージャが短時間の場
合、（エクスポージャは修理が行われるまでのみである
ような）ある適用業務に対して要求することはできな
い。２エラー修正記号は二重バイト・エラー検出のない
単一バイト・エラー修正のために要求されるのみであ
る。これは重複オーバーヘッドを１８％から１２％に減
少してより経済的である。For example, the double error detection function of this code requests some applications (such that the exposure is only until repair is done) when the exposure to the double error is short. It is not possible. The two-error correction symbol is only required for single-byte error correction without double-byte error detection. This reduces duplication overhead from 18% to 12% and is more economical.

【０１１８】ハードウェア補助メモリーのエラー除去この構成の多くは、メモリーのすべてがエラー除去され
る、すなわち、メモリーの全内容が記号プレーンから読
取られて再書込み（リライト）される、という要求を変
更することができる。予備記号プレーンと交換するとい
う前述の処理手順の例はエラー除去の必要性が発生しう
る数々の状況の単なる１例である。 Hardware Auxiliary Memory Error Removal Many of these configurations change the requirement that all of the memory be error removed, that is, the entire contents of the memory be read from the symbol plane and rewritten. can do. The above procedure example of exchanging with a spare symbol plane is only one example of a number of situations where the need for error elimination may occur.

【０１１９】メモリーに対する同時修理及び同時追加も
エラー除去動作を要求する。メモリーからエラー除去す
るべき時間は、メモリーがフェッチ／ストア・ループの
プログラム遂行によって全体的に駆動される場合、大型
メモリーに対しては非常に長時間であるかもしれない。
このフェッチ／ストア・ループは、そうでないと、間に
ある通常の記憶又はストア(STORE) がフェッチ／ストア
・オペレーションの記憶（ストア）部によって重ね書き
されてしまうことを防止するため、ワードをフェッチ／
ストア中ロックしなければならないため、機械の他のソ
フトウェアのオペレーションと調停することができる。Simultaneous repairs and additions to memory also require error removal operations. The time to error clear from memory may be very long for large memory if the memory is driven entirely by program execution of fetch / store loops.
This fetch / store loop fetches words to prevent otherwise intervening normal stores or stores from being overwritten by the store portion of fetch / store operations. /
Because it must be locked during the store, it can arbitrate with the operation of other software on the machine.

【０１２０】この好ましい実施例における記号プレーン
及びメモリー・ポート制御装置は不使用のメモリー帯域
幅を使用して、メモリーの高速ソフトウェアの透明エラ
ー除去を実行するためのハードウェアの補助を含む。ハ
ードウェア・エラー除去はすべての記号プレーンに対す
る制御／ストア(CONTROL−STORE)コマンドによって並列
に開始される。エラー除去されるべきメモリー範囲の開
始アドレス及び長さは各記号プレーンの特別エラー除去
制御レジスタに記憶される。The symbol plane and memory port controller in this preferred embodiment uses unused memory bandwidth and includes hardware aids to perform high speed software transparent error elimination of memory. Hardware error elimination is initiated in parallel by a CONTROL-STORE command for all symbol planes. The starting address and length of the memory range to be error-removed is stored in the special error-removal control register of each symbol plane.

【０１２１】そこで、記号プレーンはエラー除去を開始
し、制御する。各ワードは記号プレーンからフェッチさ
れ、メモリー・ポート制御装置に送信される。図９にそ
のホーマットを示す。メモリー・ポート制御装置はエラ
ー除去ワード又はワード・ストリームを受信すると、そ
のワードをエラー修正回路を通して送信し、記号プレー
ンに再書込みする。The symbol plane then initiates and controls error removal. Each word is fetched from the symbol plane and sent to the memory port controller. FIG. 9 shows the format. When the memory port controller receives the error elimination word or word stream, it sends the word through the error correction circuit and rewrites it in the symbol plane.

【０１２２】図１０は記号プレーンに対する再送信のた
めのホーマットを示す図である。記号プレーンがエラー
除去したデータを受信すると、そのメモリー位置の前の
内容を書換えて、修正データを有するそのワードの内容
を有効にリロードする。どちらの方向のエラー除去送信
でも通常のメモリー・トラフィックを可能にするようい
つでも優先使用することができ、記号プレーン内のメモ
リー・アレイそれ自体に対するアクセスは優先権基準で
通常のメモリー・トラフィックに対して第１に許可され
る。FIG. 10 is a diagram showing a format for retransmission on the symbol plane. When the symbol plane receives the error-corrected data, it overwrites the previous contents of that memory location and effectively reloads the contents of that word with the corrected data. Error-rejected transmissions in either direction can be overridden at any time to allow normal memory traffic, and access to the memory array itself in the symbol plane is on a priority basis to normal memory traffic. First, it is allowed.

【０１２３】従って、エラー除去によるメモリーの通常
使用に対する影響は最少である。ワードが最初エラー除
去機構によって読取られたときとそれが再書込みされた
ときとの間にはあるパイプライン待ち時間があるので、
通常のメモリー・トラフィックは、この待ち時間期間の
間、エラー除去されたメモリー位置に対して記憶が試み
られたかどうかを判別するため、記号プレーンによって
監視される。Therefore, the effect of error elimination on the normal use of the memory is minimal. Since there is some pipeline latency between when the word was first read by the error elimination mechanism and when it was rewritten,
Normal memory traffic is monitored by the symbol plane during this latency period to determine if an attempt was made to store to a memory location that was erroneously removed.

【０１２４】エラー除去ハードウェアによって読取られ
たメモリー位置に対する書込みが試みられてはいたが、
再書込みはされていない場合、通常の書込みが進めら
れ、その位置に対するエラー除去データの再書込みは取
消される。すなわち、エラー除去ハードウェアはその点
からのエラー除去を支援して退避し、パイプラインのデ
ータを破棄する。Although an attempt was made to write to a memory location read by the error elimination hardware,
If it has not been rewritten, normal writing proceeds and the rewriting of the error-removed data for that location is canceled. That is, the error removal hardware supports the error removal from that point, saves, and discards the pipeline data.

【０１２５】エラー修正回路を通してメモリー・ポート
制御装置に送信し、ポート制御装置から記号プレーンに
対して返信するパイプラインの効果は、いつでも数個の
エラー除去ワードが移動中であることができることを意
味するということに注目するべきである。通常のメモリ
ー活動とエラー除去との間のメモリー位置の競合は極く
まれであるから、エラー除去入力はエラー除去の性能に
影響を与えることはない。The effect of the pipeline sending to the memory port controller through the error correction circuit and returning from the port controller to the symbol plane is that several error elimination words can be in transit at any given time. It should be noted that it does. Memory location conflicts between normal memory activity and error elimination are extremely rare, so the error elimination input does not affect error elimination performance.

【０１２６】この好ましい実施例におけるハードウェア
補助エラー除去はその背景で十分パイプライン化されて
おり、他のメモリー活動がそのポートに存在しないとき
には全ポート帯域幅で進行することができる。これは、
システムの通常のソフトウェア走行の性能に与える影響
を最少にして、毎秒４００メガバイト近くで全メモリー
のエラー除去を可能にする。The hardware-assisted error elimination in this preferred embodiment is well pipelined in the background and can proceed at full port bandwidth when no other memory activity is present on the port. this is,
It enables error removal of the entire memory at near 400 megabytes per second with minimal impact on the performance of the system's normal software running.

【０１２７】なお、背景におけるメモリーのハードウェ
ア補助エラー除去を支援するよう要求されるメモリー・
ポート制御装置の制御及びデータ路構造の増設は容易に
理解することができる。It should be noted that the memory required to support hardware assisted error removal of the memory in the background
The control of the port controller and the addition of the data path structure can be easily understood.

【０１２８】以上、本発明の実施例を詳細に説明した
が、本発明をそれのみに限定することを意図するもので
はなく、本発明の精神から離れることなく、多くの変化
変更は可能である。そして、より信頼性、高速性、及び
多様性を達成するために拡張することが可能である。Although the embodiments of the present invention have been described in detail above, it is not intended to limit the present invention thereto alone, and many changes and modifications can be made without departing from the spirit of the present invention. . And it can be scaled to achieve more reliability, speed, and versatility.

【０１２９】[0129]

【発明の効果】本発明は、以上説明したように構成する
ことによって、処理コアを３モジュールに重複してエラ
ーによる再処理の確率を少くすると共に、各モジュール
及び記憶プレーンに発生したエラーをそこに封じ込めて
他に影響を与えないようにすると共に、メモリー・アレ
イ自体広いエラー修正及び検出コードを使用することに
よって非常に長時間のエラー解放オペレーションを達成
することができた。According to the present invention, by configuring as described above, the processing core is duplicated in three modules to reduce the probability of reprocessing due to an error, and the error occurring in each module and storage plane is reduced. It was possible to achieve very long error freeing operations by using wide error correction and detection code in the memory array itself, as well as containing it in a non-intrusive manner.

[Brief description of drawings]

【図１】本発明による主な機能構成要素を示すハイレベ
ル組織図FIG. 1 is a high-level organizational chart showing the main functional components according to the present invention.

【図２】処理コア・メモリー・ポート制御装置（処理コ
ア）と記憶及びフェッチ動作用記号プレーンとの間の接
続を示す説明図FIG. 2 is an explanatory diagram showing a connection between a processing core / memory port controller (processing core) and a symbol plane for a storage / fetch operation.

【図３】処理コア・メモリー・ポート制御装置（処理コ
ア）と記憶及びフェッチ動作用記号プレーンとの間の接
続を示す説明図FIG. 3 is an explanatory diagram showing a connection between a processing core / memory port controller (processing core) and a symbol plane for a storage / fetch operation.

【図４】ポート制御装置の記憶又はストア(STORE) コマ
ンド送信ホーマットを示す説明図FIG. 4 is an explanatory diagram showing a store or store (STORE) command transmission format of the port control device.

【図５】ポート制御装置のフェッチ(FETCH) コマンド送
信ホーマットを例示する説明図FIG. 5 is an explanatory diagram illustrating a fetch (FETCH) command transmission format of a port control device.

【図６】記号プレーンのフェッチ応答(FETCH RESPONSE)
送信ホーマットを例示する説明図[Figure 6] Fetch response of the symbol plane (FETCH RESPONSE)
Explanatory diagram illustrating a transmission format

【図７】一般的にＥＣＣ／票決機能選択回路を含むメモ
リー・ポート制御装置の制御及びデータ路構造を示す接
続図FIG. 7 is a connection diagram showing a control and data path structure of a memory / port control device generally including an ECC / voting function selection circuit.

【図８】予備記号プレーンの交換を支援するよう要求さ
れるメモリー・ポート制御装置の制御及びデータ路構造
の増加を示す説明図FIG. 8 is an illustration showing an increase in control and data path structure of a memory port controller required to support the exchange of spare symbol planes.

【図９】記号プレーンからポート制御装置に対するエラ
ー除去送信ホーマットを示す説明図FIG. 9 is an explanatory diagram showing an error removal transmission format from the symbol plane to the port control device.

【図１０】ポート制御装置から記号プレーンに対するエ
ラー除去送信ホーマットを示す説明図FIG. 10 is an explanatory diagram showing an error elimination transmission format from a port controller to a symbol plane.

[Explanation of symbols]

１０〜１２，１６〜１８専用リンク１３〜１５処理コア・レール１９Ｉ／Ｏチャンネル５０，５１票決機能５２ポート制御装置状態機５３エラー修正回路５４，５５マルチプレクサ６０１９：１マルチプレクサ６２２：１マルチプレクサ 10-12, 16-18 Dedicated link 13-15 Processing core rail 19 I / O channel 50,51 Voting function 52 Port controller state machine 53 Error correction circuit 54,55 Multiplexer 60 19: 1 Multiplexer 62 2: 1 Multiplexer

Claims

[Claims]

1. A memory function is striped across a plurality of symbol planes and includes a memory fault containment area,
Each said memory fault containment area has at least one designated memory word accessed in the storage system.
What is claimed is: 1. A large fault tolerant semiconductor data storage (memory) system including the symbol plane for storing bits, the storage system comprising at least one memory port controller for accessing the plurality of symbol planes in parallel. A memory core controller including a processing core, wherein the memory port controller inspects all data fetched from the memory for errors and generates error correction / detection code bits for all data to be stored in memory.
A detection mechanism, each said symbol plane including means for generating fetch-response sequence information that signals fetched from memory has arrived at the memory port controller and signaling to the requesting processor and the link, said sequence information being Generated on each symbol plane, the storage system further continuously monitors a plurality of input links from a plurality of signal planes to the processing core to identify fetch-response sequence information input to each processing core. An ECC / voting function selection mechanism, the ECC / voting function selection mechanism including a majority voting means, determining that a majority of the monitored input links include the fetch-response sequence information, the storage system further comprising: , All data appearing on the input link from the symbol plane in response to the discrimination Means for enabling error detection / correction circuit means of the processing core to check for errors after receipt of the fetch-response sequence information for which the error detection circuit is disabled until the sequence information is detected. Large-scale fault-tolerant highly reliable semiconductor data storage system characterized by being maintained in a state.

2. The completed processing cores are overlapped three to form three processing core rails, each connected on one side to all of the symbol planes and to the data link and on the other side said A rail connected to the data link between the processing core and the channel adapter for communicating with other components connected to the memory system, each between itself and each of the three core rails. A symbol plane and channel adapter having a data path and further including voting means in the three data paths for selecting a "majority voting" output for the data and control information received from the three processing rails. The large-scale fault-tolerant high-reliability semiconductor data storage system according to claim 1.

3. The processing core is directly connected from all of the symbol planes to its inputs,
Only one selectable multiplexer switch can be selected as a data input, and means for causing all unselected symbol planes to create dummy data even if the symbol plane data is not selected. The large-scale fault-tolerant high-reliability semiconductor data storage system according to claim 1.

4. A plurality of parallel bit data links between each said symbol plane and each processing core comprises means for vernier skew adjustment, all said vernier skew adjustment means being a single master system system. Clocked and all data transmitted between the processing core and the symbol plane of the data storage system synchronized and aligned with the master system clock to be better than a predetermined maximum skew tolerance, The large-scale fault-tolerant high-reliability semiconductor data storage system according to claim 1, characterized in that a skew not less than the maximum value provided by the vernier skew adjusting means is compensated.

5. The storage system comprises at least one spare symbol plane in the memory and a switch circuit operable under the control of a diagnostic circuit to form a data path between the spare symbol plane and a processing core input. And that the spare symbol plane receives all data destined for it and normally returns all data that would have come from the symbol plane designated as "unavailable" by the diagnostic circuitry. The large-scale fault-tolerant high-reliability semiconductor data storage system according to claim 1.

6. Each of the symbol planes has a plurality of ports, the same number of processing core port controllers as the ports of the symbol planes are provided, and each of the processing core port controllers has a channel for each symbol plane. Link means for connecting to a predetermined port of the device, a completed ECC / voting function selection mechanism, an error detection / correction mechanism, and means for activating the error detection / correction mechanism after receiving a fetch-response control signal. The method according to claim 1, wherein
The described large-scale fault-tolerant highly reliable semiconductor data storage system.

7. A large fault tolerant high reliability semiconductor data store that stripes memory functions across multiple symbol planes, each symbol plane storing at least one bit of a designated memory word accessed in a storage system. A system, wherein the storage system selectively connects memory to a high speed communication link via the channel adapter to communicate with at least one memory entity attached to another functional entity attached to the data storage system. A processing core module including a port control function and a channel port control function, wherein the processing core error-checks all data fetched from the memory, and an error correction and detection code for all data to be stored in the memory. Includes error correction / detection mechanism to generate bits , Each said symbol plane is fetched from memory and comprises means for generating fetch-response sequence information which must precede the data returned to the processing core, said sequence information being generated in each symbol plane, said storage system , Further, in each said processing core,
Fetches from multiple signal planes-ECs that continuously monitor multiple input links to identify response sequence information
C / voting function selection mechanism, said selection mechanism comprising majority voting means for determining that a majority of the monitored input links include such sequence information field, said storage system further comprising fetch-response sequence information In a large fault tolerant high reliability semiconductor data storage system including switch means for enabling an error detection / correction mechanism of the processing core to check for errors in all data appearing on the input link from the symbol plane after reception. A method of ensuring a corrective operation in a control circuit embedded in the symbol plane, comprising: 1. Determine that the sequence information is on multiple supervisory input links from the symbol plane, 2. Determine that the same sequence information was detected on the majority of input links, A method of guaranteeing corrective operation, comprising the steps of causing the error detection / correction circuit to process all subsequent data transmissions to the processing core.

8. The method of guaranteeing corrective operation of claim 7, wherein the method of guaranteeing corrective operation includes an error notification in the sequence information for the diagnostic module if one of the monitored links does not match the other.

9. The modified operation guarantee method supplies a control-store command that supplies a start address and the number of words to be error-removed to all symbol planes, to a symbol plane port controller of the processing core. To the symbol plane transmitting sequential words to the symbol plane, and the symbol plane port controller recognizes the error elimination command using the ECC / voting function selection mechanism and recognizes from the other symbol planes. Reconstructing missing data from uninitialized or improperly loaded symbol planes by using data and ECC symbols, the symbol plane port controller recalculating the ECC symbols and modifying the data and The recalculated ECC symbol is returned to the symbol plane port, where the symbol plane is the data and ECC. Modify operation assurance method of claim 8 wherein the overwriting and storing the error symbols, characterized in that it comprises the steps of No..