JP2004520643A

JP2004520643A - Method and apparatus for reducing latency in a memory system

Info

Publication number: JP2004520643A
Application number: JP2002547003A
Authority: JP
Inventors: ナギ・ナシーフ・メクヒール
Original assignee: Mosaid Technologies Inc
Current assignee: Mosaid Technologies Inc
Priority date: 2000-11-30
Filing date: 2001-11-28
Publication date: 2004-07-08
Also published as: CA2327134A1; EP1350168A2; CA2327134C

Abstract

メモリコントローラは最後に用いたアドレスおよび関係するデータを格納するバッファを制御するが、バッファに格納したデータは、メインメモリに格納されたデータの列の一部(列ヘッドデータ)である。ＣＰＵにより実行されるメモリアクセスにて、バッファ及びメインメモリが同時にアクセスされる。バッファが要求されたアドレスを含まないなら、バッファは直ちに関係する列ヘッドデータをキャッシャメモリに与える。一方、同じ列アドレスは、バッファ内で見つかった要求されたアドレスに対応するメインメモリバンク内でアクティブにされる。バッファが列ヘッドデータを与えた後、要求されたデータの残りが直ちにメインメモリによりＣＰＵに与えられる。The memory controller controls a buffer for storing the last used address and related data, and the data stored in the buffer is a part of a column of data stored in the main memory (column head data). In the memory access executed by the CPU, the buffer and the main memory are simultaneously accessed. If the buffer does not contain the requested address, the buffer immediately provides the relevant column head data to the cache memory. Meanwhile, the same column address is activated in the main memory bank corresponding to the requested address found in the buffer. After the buffer provides the column head data, the remainder of the requested data is immediately provided to the CPU by the main memory.

Description

【技術分野】
【０００１】
この発明は、一般にコンピュータシステムにおける中央処理ユニット(ＣＰＵ)とメインメモリ間でデータを転送するための方法に関する。より詳しくは、この発明は、待ち時間を隠す機構を用いることにより、メインメモリへのアクセスで待ち時間を最小にするための種々の実行を開示する。
【背景技術】
【０００２】
マイクロプロセッサの速度および演算パワーは、技術の進歩に従って継続的に増しつつある。演算パワーでのその増大は、データの転送および、マイクロプロセッサおよびメインメモリ間のプロセッサ速度での命令に依存する。不幸にも、現在のメモリシステムは、そのデータをプロセッサに対して要求されるような速度で与えることはできない。
【０００３】
プロセッサは、速度の遅いメモリシステムに対して待ち状態にすることにより、待たなくてはならず、そのため、プロセッサは自身の規格速度よりもかなり遅い速度にてプロセッサを実行させている。この問題は、システムの全体的な特性を低下させる。この傾向は、プロセッサ速度とメモリ速度との間のギャップが大きくなっているために、悪化させている。プロセッサにおけるいかなる特性改善もシステム全体の重要な特性が得られなくなるポイントにまもなく到達しようとしている。メモリシステムはそのため、システムの特性を限定する要素となっている。
【０００４】
Amdahlの法則によれば、システムの特性改善は、改善できないシステムの部分によって制限される。この理由は次の例で示される。
プロセッサの時間の５０％がメモリアクセスに使用され、残りの５０％が内部の演算サイクルで使用されるなら、Amdahlの法則は、プロセッサの速度が１０倍に増しても、システムの特性は単に１．８２倍しか増大しないと述べている。Amdahlの法則は、コンピュータシステムの部分を強化することにより、得られた速度上昇は、次式により与えられる。
【数１】

【０００５】
強化された部分：強化が用いた時間の比率
強化された速度上昇：強化された部分を、元の部分の特性と比較した時の速度上昇
【０００６】
この例のように、プロセッサは、内部演算が時間の５０％しか占められていないので、プロセッサの強化された速度は、その時間の５０％について利点となる。
【０００７】
Amdahlの法則は、上記の数値を採用すると次式のようになる。
【数２】

【０００８】
強化されたプロセッサが元のプロセッサに比べて１０倍であっても、その強化は、時間の５０％についてのみ適用されないためである。速度上昇の計算は、元のシステムの特性と比較して１．８１８倍の全体的な特性強化が得られる。
【０００９】
もし、強化されたプロセッサが、元のプロセッサの速度の１００倍ならば、Amdahlの法則は、次式のようになる。
【数３】

【００１０】
このことは、システムの特性は、メモリへのおよびメモリからの５０％のデータアクセスにより制限されることを意味する。明白なように、メインメモリシステムの速度に対してプロセッサの速度が増大するにつれ、利点が減少する傾向がある。
【００１１】
この問題を解決するためにキャッシュメモリを使用しており、プロセッサによりアクセスされそうなデータを、プロセッサ速度に対応する高速のキャッシュメモリに移動している。第１レベルのキャッシュ(Ｌ１キャッシュ)および第２レベルのキャッシュ(Ｌ２キャッシュ)からなるキャッシュの階層を形成するために種々のアプローチが提案されてきた。理想的には、プロセッサにより最もアクセスされそうなデータは最速のキャッシュレベルに格納すべきである。レベル１(Ｌ１)およびレベル２(Ｌ２)の双方のキャッシュは、ダイナミックランダムアクセスメモリ(ＤＲＡＭ)を上回る利点の故に、スタティックランダムアクセスメモリ(ＳＲＡＭ)技術によって形成される。キャッシュの設計及びキャッシュが何を目標とするかの問題で最も重要なことは、プロセッサにより次に要求されるデータがキャッシュシステムに高い確率で格納されることである。キャッシュにてこの要求されたデータの選出の確率を高めるために、または“的中する”キャッシュを持つために、２つの主な法則：一次的な位置および空間的な位置が機能する。
【００１２】
一次的な位置とは、最も平均的なプロセッサ動作のために、プロセッサにより次に要求されるデータが高い確率で直ぐに要求されるという概念である。空間的な位置とは、プロセッサにより次に要求されるデータが、現在アクセスされているデータの次に高い確率でアクセスされるという概念である。
【００１３】
キャッシュの階層は、それゆえ、現在アクセスされているメインメモリデータから伝送すると共に、物理的に近接するデータから転送することにより、これらの２つの概念の利点を取るものである。
【００１４】
しかしながら、キャッシュメモリシステムは、高速のプロセッサをより低速なメインメモリから完全に切離すことはできない。プロセッサにより要求されたアドレスおよび関係したデータは、キャッシュ内には見つからず、キャシュ“ミス”とよばれる事態が発生する。このようなキャッシュミスにおいては、プロセッサは、データを得るためにより低速なメインキャッシュをアクセスする。これらのミスは、プロセッサの時間の一部を示しており、これは、システム全体の特性改善を制限する。
【００１５】
このキャッシュミス問題を対処するために、レベル２のキャッシュはしばしば全体的なキャッシュ階層を備える。レベル２のキャッシュの目的は、レベル１のキャッシュを用いずに、高速アクセスのためにプロセッサで利用できるデータ量を拡大することである。レベル２のキャッシュは、典型的にプロセッサ自身と同じチップ上に形成される。レベル２のキャッシュは、オフチップ(つまり、プロセッサおよびレベル１のキャッシュと同じダイ(die)上にない)なので、より大きくなり、レベル１のキャッシュとメインメモリの速度の間の速度で実行する。しかしながら、レベル１およびレベル２のキャッシュの使用を適正にし、そして、キャッシュメモリシステムとメインメモリシステムとの間でデータの一貫性を維持して、最新のデータがプロセッサで使用できるようにするために、キャッシュおよびメインメモリの双方は、常にアップデートされなくてはならない。もし、プロセッサメモリのアクセスがリードアクセスなら、このことは、プロセッサがメモリからデータまたはコードをリードする必要があることを意味する。もしこの要求されたデータまたはコードがキャッシュで見つからないならば、そのキャッシュコンテンツはアップデートされなくてはならず、一般的に同じキャッシュコンテンツを要求する処理は、メインメモリからのデータまたはコードで置き換えられなくてはならない。キャッシュコンテンツとメインメモリのコンテンツとの間で一貫性を確実にするために、２つの技術：ライトスルーおよびライトバックが用いられる。
【００１６】
ライトスルーの技術は、書き込まれていたデータがキャッシュ内で見つかったとき、キャッシュおよびメインメモリの双方にデータを書き込む。この技術は、キャッシュコンテンツまたはメインメモリのいずれのデータがアクセスされようと、アクセスされたデータは同一であることを確実にする。ライトバックの技術は、メモリへの書き込みアクセス時に、キャッシュにのみデータを書き込む。キャッシュ内のデータとメインメモリ内のデータとの間で一貫性を確実にするために、これらのキャッシュコンテンツが上書きの時、特定のキャッシュ位置のキャッシュコンテンツがメインメモリに書き込まれる。しかしながら、キャッシュコンテンツがメモリ書き込みのアクセスにより置き替えられない時は、キャッシュコンテンツはメインメモリに書き込まれない。特定のキャッシュ位置のキャッシュコンテンツがメモリへの書き込みアクセスによって置き換えられていないなら、キャッシュコンテンツはメインメモリに書き込まれない。特定のキャッシュ位置のキャッシュコンテンツがメモリへの書き込みアクセスによって置きかえられたかを決定するために、フラッグのビットが用いられる。もし、キャッシュコンテンツがメモリへの書き込みアクセスにより置き換えられたなら、そのフラッグのビットがセットされるか、または“ダーティ”とみなされる。そのため、特定のキャッシュ位置のフラッグビットが“ダーティ”なら、新しいデータで上書きされる前に、そのキャッシュ位置のキャッシュコンテンツがメインメモリに書換えられなくてはならない。
【００１７】
キャッシュの的中率を増すための別のアプローチは、その結合率を増すことにある。結合率とは、キャッシュアクセスの間にサーチされた(つまり的中のためにチェックされた)キャッシュ内のライン数のことである。一般に、結合率がより高いと、キャッシュの的中率はより高くなる。直接にマップされるキャッシュシステムは、１：１のマッピングを持ち、それにより、キャッシュのアクセスの間に、唯一のラインが的中のためにチェックされる。スペクトラムの他方の端部にて、十分に関係したキャッシュが、典型的にコンテンツアドレス可能メモリ(ＣＡＭ)を用いて実行され、これにより、すべてのキャッシュライン(およびそれゆえ、すべてのキャッシュ位置)がサーチされ、そして、単一のキャッシュアクセスの間に同時に比較される。種々のレベルの結合率が実行される。
【００１８】
最終的にシステム全体の特性の改善を狙ったキャッシュ特性を改善するこれらのアプローチにも拘わらず、キャッシュ特性は、サイズ、結合率および速度のようなパラメータを変えることによる段階まで改善されるのみであることに気付くべきである。より低速なメインメモリを改善するための試みよりも、キャッシュシステムまたはシステムの高速メモリの改善を狙ったこのアプローチは、最後には飽和点に到達し、キャッシュの改善を通じたシステム全体の特性を改善する別のあらゆる試みも、システムの特性改善のレベルを低下を発生させる。おそらく、キャッシュがメインメモリと同程度に大きいなら、メインメモリの特性は、システム全体の特性での要因として排除され得るが、シリコンチップエリアの条件では、法外に高価となる。結果として、最小サイズのキャッシュでシステムの最大特性を得る方法が必要とされる。
【００１９】
プロセッサとメインメモリ間の速度の不適合は、近年、メインメモリの特性に重く依存するマルチメディアのような新しいアプリケーションのソフトウエアで問題になりつつある。不幸にも、メインメモリの特性は、このようなアプリケーションでの頻度の高いランダムなデータアクセスにより、制限される。キャッシュシステムはそれゆえ、このようなアプリケーションに使用された時、より効果が少なくなる。
【００２０】
プロセッサとメインメモリ間の速度の不適合を軽減するために、メインメモリの特性を改善する多数の試みが行われてきた。これらは、メインメモリの速度にいくらかの改善をもたらした。ＤＲＡＭへの初期の改善策は、ＤＲＡＭからアクセスサイクルにつき、複数ビットを得るものであり(一続きの(nibble)モード、又はより広いデータの固定出力)、内部的に種々のＤＲＡＭの動作をパイプライン処理するか、データを断片化し、これにより、いくつかのアクセス(ページモード、高速ページモード、拡張されたデータ出力(ＥＤＯ)モード)に対して、いくつかの動作を排除している。
【００２１】
ページモードは、ＤＲＡＭ内の列アドレスをラッチし、それをアクティブに維持することを含み、これにより、センスアンプに格納されるべきデータのページを有効的に排除している。高速ページモードでの行アドレスストロボ(ＣＡＳ)信号により、行アドレスがその後、ストロボ化されるページモードと違って、列アドレスストロボ(ＲＡＳ)信号がアクティブにされると同時に、行アドレスバッファがアクティブにされ、そして、明白なラッチとして作用し、行アドレスストロボの前に内部行データのフェッチを生じさせる。データ出力バッファのイネーブル化は、ＣＡＳがアクティブにされた時に達成される。新しい列をアクセスするために要求される列アドレスアクティブ化時間が同列上に留まることりより、排除されるので、これらの異なるページモードは、それゆえ、純粋なランダムアクセスモードに比べより高速である。
【００２２】
これに続く改善は、拡張されたデータ出力モードまたはＥＤＯモードを通じて、およびバーストＥＤＯモードにて実現されている。バーストＥＤＯモードは、各サイクルで新しいアドレスを与えることなく、連続的なデータのページがＤＲＡＭから復元されることを可能にする。しかしながら、バーストＥＤＯモードは連続的な情報のページを要求するグラフィックのアプリケーションでの使用に適するが、完全にサポートできるランダムアクセスを要求するメインメモリのアプリケーションに対してはより有用性が欠ける。
【００２３】
ＤＲＡＭ設計でのこのような改善は、より高い帯域幅でのアクセスを提供するが、それらは次の問題を呈する。
いくつかの散らばったメモリアクセスは、同じアクティブな列内でマッピングせず、それにより、高速ページモードの使用から利益を排除するので、プロセッサは新しいＤＲＡＭをより高い帯域幅で完全に用いることはできない。
新しいＤＲＡＭの設計はいくつかのバンクを持ってもよいが、高いページの的中率を持つためには、散らばったメモリアクセスを持つ典型的なプロセッサの環境に対しては十分な個数でない。
現在のプロセッサおよびシステムは、ＤＲＡＭへのメモリアクセスを遮り、これにより、これらのアクセスを局所的に低減する(第１および第２のレベルの)大きいキャッシュを使用し、このことが、更にアクセスを分散させ、そして、結果、ページの的中率を減じる。
【００２４】
システムの特性を改善するには、キャッシュシステムは無能であり、このことが、メインのＤＲＡＭメモリシステムの特性改善に更なる努力を必要とさせている。これらの努力の１つは、ＳＤＲＡＭ(同期したＤＲＡＭ)を用いることである。ＳＤＲＡＭは、高速のページモードを使用するアクセスに対する高い帯域幅を与えるために、多数のバンクおよび同期したバスを使用する。多数のＳＤＲＡＭバンクを備えることで、１つ以上のアクティブな列がプロセッサに、メモリの異なる部位からの高速なアクセスを与える。しかしながら、高速ページモードが使用されるためには、これらのアクセスは、バンクのアクティブな列内に存在しなくてはならない。さらには、メモリの帯域幅を増すために、単に複数のバンクへのアクセスすることへの依存は、バンク(この中にメモリが分割される)の数に基づき全体的に制限されることになる。
【００２５】
一般的に、制限されたバンク数、メインメモリ内の既にアクティブにされた列へのアクセスを遮る外部キャッシュシステム、および、アクセスされたデータの劣る空間的な位置、これらのすべてが、ＳＤＲＡＭからの特性の利益を制限する。
【００２６】
別の努力は、キャッシュＤＲＡＭ(ＣＤＲＡＭ)を用いることである。この設計は、ＤＲＡＭ内にＳＲＡＭに基づくキャッシュを組み込む。大きいブロックのデータは、その結果、単一のクロックサイクル内で、キャッシュからＤＲＡＭのアレーに、または、ＤＲＡＭからキャッシュに転送され得る。しかしながら、この設計は、外部の遮断するキャッシュ、およびデータ位置の低さにより引き起こされる、ＤＲＡＭ内のキャッシュの的中率の低さの問題に直面する。キャッシュのタグ、コンパレータおよびコントローラを要求することにより、内部キャッシュを制御し動作させるために、外部キャッシュに複雑さをも追加する。ＤＲＡＭに対して最適化された半導体製造プロセスにて、ＳＲＡＭをＤＲＡＭに統合させるために、多くのダイ(die)エリアの状況のためにかなりなコスト増となる。
【００２７】
より新しい設計では、プロセッサとＤＲＡＭをマージさせることであり、遮断するキャッシュ問題を排除し、プロセッサに完全なＤＲＡＭの帯域幅を与えている。このアプローチは、現在のプログラミングモデルにより使用された、散らばったメモリアクセスの性質のために、システムに複雑さを増し、低速と高速の技術を混合し、プロセッサのためのスペースを制限し、そして、高いＤＲＡＭの帯域幅を完全に用いることができない。
【００２８】
ＮＥＣの新規な仮想チャンネルＤＲＡＭ設計は、完全に結合した１６のチャンネルを用いており、これは、種々のソースによる使用のために、複数のコードおよびデータのストリームをトラックするように、高速のＳＲＡＭで形成される(非特許文献１)。本質的に、仮想チャンネルのＤＲＡＭは、ページモードの概念の拡張を示しており、１つのバンク／１つのページの限定が除去されている。結果として、多数のチャンネル(またはページ)が他のチャンネルから独立したバンク内に開放され得る。ＣＰＵは、例えば、仮想チャンネルのＤＲＡＭバンク内にランダムに配置された１６１ｋのチャンネルまでアクセスできる。結果として、ページ配分に対する繰返しのコンフリクトを生じさせることなく、複数のデバイス間のデータの移動が持続され得る。この仮想チャンネルのメモリは、各チャンネルに対応するメインメモリ位置がＣＰＵによりトラックされることを要求し、これにより、その制御機能を複雑化する。更に、そのＣＰＵは、そのチャンネルへのデータの有効的なプリフェッチのために、予測的なスキームを要求する。仮想チャンネルのＤＲＡＭはデータをチャンネルに転送するために高速ページモードを用いることを要求し、そして、ＶＣＤＲＡＭは、最終的に、キャッシュＤＲＡＭのように、関係するバッファにより消費される追加的なダイエリアのために高価となる。更に、備えられたキャッシュの量は、キャッシュ／ＤＲＡＭの比が通常固定されているので、いくつかのアプリケーションでは適切でないかもしれない。例えば、メインメモリがアップグレードされる時、追加的なキャッシュは必要とされず、そのため、システムのコストは無駄に高価となる。
【００２９】
近年、解決に向けたものとして、ＤＲＡＭの帯域幅を最大化するために、物理的なメモリアドレスを再マッピングするためのソフトウエアコンパイラを用いたようなソフトウエアが提案されている。これは、予知できる動作を有する特定のアプリケーションに対しては有利であるが、それはソフトウエアを交換することを要求し、それにより、複雑な問題を生じる。これらの努力は、高いレベルのアプローチを用い、そのため、ソフトウエアをハードウエアに合わせるために、アプリケーションのソースコードが修正される。このアプローチは、高価で時間を消費するだけでなく、すべてのソフトウエアに適用できない。
【００３０】
上述から、それゆえ何が要求されるのかというと、単純化されたメモリ制御機構に基づく解決であり、メインメモリに対し、簡単で、コストが有効で標準のＤＲＡＭを用い、広範囲なソフトウエアの書換えまたは複雑なアドレスのスキームを必要としないものである。このような解決は、理想的には、一時的および空間的な位置の双方の利点をとるべきである。最近アクセスされたデータは容易にアクセスできるのみならず、そのような最近にアクセスされたデータの接近した位置のデータも容易にアクセスできるべきである。
【非特許文献１】NEC Electronics Inc. Product Number Search「μPD4565161」
【発明の開示】
【課題を解決するための手段】
【００３１】
上述した問題への解決は、高速ページモードおよび高速のバッファまたはキャッシュの概念の双方の利点をもつ方法および装置に見出される。メモリコントローラは、最も最近に使用されたアドレスおよび関係したデータを格納するバッファを制御するが、そのバッファに格納されたデータは、メインメモリに格納されたデータの列の部分のみ(列ヘッドデータと記す)である。ＣＰＵにより実行されたメモリアクセスにおいて、バッファおよびメインメモリの双方が同時にアクセスされる。もし、そのバッファが要求されたアドレスを含むなら、そのバッファは直ちに、関係した列ヘッドデータをキャッシュメモリに急きょ与えることを開始する。一方、同じ列アドレスが、そのバッファ内で見出された要求されたアドレスに対応するメインメモリバンク内で能動化される。そのバッファが列ヘッドデータを与えた後、急きょ要求されたデータの残りはメインメモリによってＣＰＵに供給される。この方法では、小さい容量のバッファメモリが、より大きな量のＬ２キャッシュとなる機能を与えることができる。
【００３２】
第１の態様では、この発明は、メモリシステムからデータを回復する方法を提供し、前記方法は、
(ａ)メモリ位置のデータコンテンツに対するリード要求を受け取り、
(ｂ)前記データコンテンツの一部のために、前記メモリシステムのバッファ部を検索し、
(ｃ)前記データコンテンツの前記一部が前記バッファに格納されている場合に、前記バッファから前記一部を回復し、一方、同時に、前記メモリシステムのメインメモリ部から前記データコンテンツの残りの一部を回復し、
(ｄ)前記データコンテンツの前記部分が前記バッファに格納されていない場合に、メインメモリから前記データコンテンツの前記一部および前記残りの一部を回復することを含む。
【００３３】
第２の態様では、この発明は、列ヘッドをラッチするために列ヘッドバッファ回路を提供し、列ヘッドは、メモリバンクに格納されたメモリの列の部分であり、前記ラッチする回路は、
各列ヘッド登録部は、メモリバンク内の列ヘッドに対応し、前記列ヘッド登録部を多数含む列ヘッドバッファと、
列ヘッドバッファ内に含まれる列ヘッド登録部の物理的アドレスをラッチする列アドレスラッチの多数と、
列ヘッド登録部を、到来する要求された列アドレスと比較するための列アドレスコンパレータとを備え、
到来する要求された列アドレスが前記多数のアドレスラッチの１つに適合する時、前記バッファ回路は、メモリコントローラによって要求される、到来の列アドレスを、前記多数の列アドレスラッチと比較し、アドレスラッチの適合に対応する列ヘッドデータ登録部は、前記メモリコントローラに送信される。
【００３４】
第３の態様では、この発明は、メモリバッファのサブシステムを提供し、これは、
多数のバッファ登録部をもつ少なくとも１つのバッファバンクと、
前記バッファのサブシステムを制御するバッファコントローラとを備え、
各バッファ登録部は、
メインメモリバンク内の位置に対応するメモリアドレスを含むアドレス領域と、
メインメモリバンクアドレスに位置する第１のｎバイトのデータを含むデータ領域とを含み、
前記メインメモリバンクアドレスに位置する前記データがＣＰＵにより要求された時、前記第１のｎバイトのデータは、前記バッファのサブシステムにより、前記ＣＰＵに与えられ、一方、前記データの残りは、メインメモリバンク内で前記メモリアドレスから回復される。
【００３５】
第４の態様では、この発明はメモリシステムを提供し、これは、
メインメモリの少なくとも１つのバンクと、
メモリコントローラと、
バッファと、および
バッファコントローラとを備え、
前記メモリコントローラは、メインメモリの少なくとも１つのバンクにて制御し、
前記バッファは、多数のバッファ登録部を含み、
各バッファ登録部は、アドレス部分とデータ部分を含み、
前記データ部分は、メインメモリの少なくとも１つのバンク内に第１のデータ部分を備え、前記アドレス部分は、メモリ位置を参照するアドレスを備える。
【発明を実施するための最良の形態】
【００３６】
添付した図面に関連して以下の詳細な記述を読むことにより、本発明をより理解することができるであろう。
【００３７】
図１を参照すると、この発明の論議をコンテクストに導入する目的のために、通常のＣＰＵ−メインメモリシステム１０が示されている。このシステムは、一般に、組み込みのレベル１のキャッシュ１７を有するＣＰＵ１５、キャッシュおよびメインメモリコントローラ２０、レベルＬ２のキャッシュ２５およびメインメモリ３０からなる。ホストデータバス１６は、ＣＰＵとメインメモリ３０とレベルＬ２のキャッシュ２５との間のデータを転送する。ホストアドレスバス１８は、メモリコントローラ２０およびレベルＬ２のキャッシュ２５にアドレス情報を与える。同様に、データバス２１およびアドレスバス２２は、コントローラバス２３を通じ、キャッシュおよびメモリコントローラ２０の制御に基づき、レベルＬ２のキャッシュをホストバデータス１６およびアドレスバス１８に接続する。メインメモリ３０は、メモリデータバス２６を通じホストデータバス１６に結合され、そして、アドレスバス２７および制御バス２８を通じコントローラ２０からアドレスおよび制御情報を受け取る。
【００３８】
典型的なリード／ライトデータの動作では、ＣＰＵ１５はリードデータ情報を例えばメモリコントローラ２０に出力し、そして、アドレス位置を与え、これをコントローラは列および行のアドレスおよびメモリ制御信号に変換する。コントローラ２０は、また、レベル２のキャッシュに対し、アドレスおよび制御情報を発生する。データがレベル１のキャッシュに見当たらない場合は、そのコントローラ２０は、所望のデータをメインメモリだけでなく、レベル２のキャッシュに対して探す。もし、レベル２のキャッシュにデータが見つかった場合、そのデータは、データバス２１を通じてホストデータバス１６に供給され、そのデータは次にＣＰＵ１５に戻す。そのデータは、再度要求されることを予測して、同時にレベル１のキャッシュに書き込まれる。もしそのデータがレベル１のキャッシュまたはレベル２のキャッシュに見当たらない場合(つまり、レベル１およびレベル２のキャッシュの双方でキャッシュミスが発生した時)、コントローラ２０は、ページモードアクセスを用い、メインメモリ３０からデータを直接にアクセスするように仕向ける。メモリデータバス２６を通じデータがＣＰＵ１５に転送されるのと同時に、ＣＰＵが再度そのデータを要求することを予測して、そのデータはレベル１のキャッシュ１７にもコピーされる。
【００３９】
上述したように、レベル１およびレベル２のキャッシュおよびメモリコントローラからなるこのような通常のシステムは、パフォーマンスを低下させる兆候を示す。今日のアプリケーションは、より高速およびランダム性を要求し、それにより、キャッシュミスおよびメインメモリアクセスを頻繁に発生させる。
【００４０】
図２Ａおよび２Ｂを参照すると、この発明の実施例に基づく待ち時間を隠すバッファ１００が示される。このバッファは、図１のＣＰＵ−メインメモリシステムと共に使用され得る。
【００４１】
そのバッファ１００は、少なくとも１つのバッファバンク１１０とバッファコントローラからなる。この発明の実施例に基づく各バッファバンクは、Ｎ通りの組みで結合したキャッシュメモリで構成され、そのキャッシュメモリは、多数のラインを含む。各バッファは、要求されたアドレスを、バッファバンクに格納されたアドレスと比較するコンパレータ１３を持つ。各ラインは、組みのアドレス部１５０、タグのアドレス部１６０、最後に使用されたＭＲＵフラグビット１８０およびデータ部１７０を含む。組みの部分１５０は、バッファラインに格納されたメインメモリアドレス位置の下位の命令ビットに関するものである。タグ部１６０は、バッファラインに格納されたメインメモリアドレス位置の上位の命令ビットに関するものである。大部分の組みの結合したキャッシュシステムのように、典型的にバッファコントローラは、上位の命令タグビットをアドレスするために、組みのビットを使用する。ＭＲＵフラグビット１８０は、新しいアドレス登録部が挿入された時、どのバッファ登録部を取替えるべきでないかを決定するために使用される。そのデータ部は、組みおよびタグのビットにより特定されたメモリアドレスに関係したデータ(列ヘッド)を含む。１つの実施例では、列ヘッドは、メインメモリ内の列のデータにおける所望のデータビット数の部分のみ含む。たとえば、データの列ヘッドおよびその残りがメインメモリに格納されようとする時、バッファバンク１１０は、典型的な６４バイトキャッシュラインの最初の４データワードを格納する。その結果、バッファバンクは、キャッシュラインの１／４または全キャッシュラインのいくつかの部分を格納する。
【００４２】
ＭＲＵフラグ１８０に関しては、ＭＲＵフラグビットの組みを有するバッファバンク登録部は、最も最近に使用された登録部であり、取り返るべきでない。これは、一時的な位置の参照は、この登録部がアクセスされるべき次の登録部であることを示すためである。次に要求されたアドレスに対して、バッファは、ＭＲＵフラッグビットの組みを持たない登録部に対してサーチされる。バッファ登録部がアクセスされた後で、ＭＲＵフラグビットが、特定のバッファ登録部に対してセットされるので、もし、古いバッファ登録部がそれのＭＲＵフラグビットの組みを持つなら、この古いバッファ登録部は、その後、それのＭＲＵフラグビットをリセットし、新しいバッファ登録部を、組みのＭＲＵフラグビットを有する単なる登録部としてそのままにする。バッファ内の各関連する組みに対して、１つのＭＲＵフラグビットがアクティブになり得る。
【００４３】
バッファバンクの動作を示すために、１つの例が与えられる。バッファバンクは、デコードされたメモリアドレスをメインメモリコントローラから受け取る。このメモリアドレスの低い命令ビットは、どのバッファバンクおよび、前記バンク内のどの組みが適合するかを決定するために用いられる。メモリアドレスの高い命令ビットはコンパレータ１３０に与えられる。選択されたバッファラインのタグ領域もコンパレータ１３０に与えられる。もし、適合するものがあれば、バッファラインに格納の要求されたメモリアドレスは、適合する。その結果は、バッファコントローラに報告され、そしてデータはバッファ内でアクセスされる。
【００４４】
図２Ｂを参照すると、バッファコントローラ１０が図示されている。第１のグループの信号１９０は、バッファバンクから与えられ、選択されたバッファラインがそのＭＲＵの組みであるかそうでないかに拘わりなく、アドレスコンパレータの出力(アドレスの適合、不適合に拘わらない)を含むことができる。第２のグループの信号２００は、メインメモリコントローラから与えられる。これらは、メモリアクセスがリードまたはライトに拘わりなく、また、要求された列がアクティブまたはそうでないに拘わりなく、メモリアクセス要求の存在を示すような信号を含むことができる。
【００４５】
第３のグループの信号２１０は、バッファコントローラにより発生され、バッファバンクに供給される。これらは、バッファバンクへのリードまたはライト信号、およびＭＲＵビットのセッティングを含むことができる。第４のグループの信号２２０は、バッファコントローラにより発生され、メモリコントローラに供給される。これらは、メモリコントローラに対してメインメモリ内の特定の列をラッチするよう指令する信号、メインメモリ内の位置にデータを書き込む信号、または、指定されたオフセットでメインメモリ内の位置にアクセスする信号を含むことができる。
【００４６】
上述したバッファは、図１で示したＣＰＵメモリシステムの種々の部品で置き換え可能である。図３Ａ、３Ｂ、３Ｃおよび３Ｄを参照すると、待ち時間を隠すバッファの可能性ある配置が示されている。
【００４７】
図３Ａは、図１のすべての要素に、メモリコントローラ１２０の外部に位置する待ち時間を隠すバッファ１００を備えたものからなる。当業者には周知なごとく、図３Ａの各ブロックは、個別のチップまたはモジュール上に形成できる。例えば、メインメモリは、典型的に、メインメモリＤＩＭＭモジュール(Dual Inline Memory Module)を用いて作製され、そして、ＣＰＵおよびレベル１のキャツシュは典型的に、単一のモノシリックのマイクロプロセッサ内に形成される。典型的に個別のチップであるメモリコントローラは、通常、個別のチップとしてレベル２のキャッシュを含むチップセット内のマイクロプロセッサと一体に集積化される。図３Ａに示された実施例では、待ち時間を隠すバッファは、前記チップセットに対して統合化された別のチップ上に形成され、場合によっては、レベル２のキャッシュを置き替えるか、またはレベル２のキャッシュと関連して用いられる。図３Ｂは、ＤＲＡＭに基づくメインメモリと同一のチップ上にバッファが集積化されている別の可能性のある実施例を示す。図３Ｃは、レベル１のキャッシュおよびＣＰＵと同一のチップ上にバッファが集積化されている実施例を示す。最後に、図３Ｄは、メモリコントローラに集積化されたバッファを備え、レベル２のキャッシュを完全に置き換えた好ましい実施例を示す。これらの４つの実施例が示されているが、当業者は、ここで開示されたバッファの利点および概念を採用して、他の可能な結合を想到できるであろう。
【００４８】
図４は図３Ｄに対応するこの発明の好ましい実施例のより詳細な図面を示す。図４から理解されるように、複数のバッファバンク１１０は、メモリコントローラ２０に集積化されている。図４では単一のコンパレータ１３０が示されているが、各バッファバンク１１０はそれに関係したコンパレータを持つことに注目すべきである。
【００４９】
この発明の好ましい実施例にに基づくメモリコントローラ２０は、アドレスデコーダ２３０、メインメモリおよびキャッシュコントローラ２４０、バッファバンク１１０、コンパレータ１３０およびバッファコントローラ１２０を備える。アドレスデコーダ２３０は、要求されたアドレス(MemAddr)およびメモリアクセス信号(MemAcc)をＣＰＵから受け取る。そのアドレスデコーダ２３０は、要求されたメモリアドレスから次に、メインメモリ内の要求されたアドレスの列アドレスおよひ行アドレスを決定する。
【００５０】
要求されたメモリアドレスは、また、バッファ１１０に送出する。理解されるように、要求されたメモリアドレスの一部(SET)は、バッファバンク１１０を参照するために用いられる。要求された同じメモリアドレスの他の部分(TAG)は、コンパレータ１３０に送出される。コンパレータ１３０は、要求されたタグの分野を、バッファ１３０内のセット位置に格納されたタグと比較する。もし、要求されたアドレスのタグが、キャッシュ内のセット位置でのタグに適合するなら、バッファの的中が起きる。その位置が適合しないなら、バッファミスが起きる。セットの分野がバッファ１１０内のタグの分野をインデックスするために用いられる。バッファ１１０は、Ｎ通りの組みの関係するキャッシュメモリを用いて実行されるので、この検索および比較動作はすべてのＮ個のバッファで同時に起きる。行ったＮ回の比較がコンパレータ１３０からバッファ的中(BUFFER HIT)が生じる。比較結果によるバッファ的中はバッファ制御ブロック１２０に入力され、このブロックは制御信号バッファＯ／Ｅ、バッファＲ／ＷおよびＣＴＲＬを発生してメインメモリ及びキャッシュコントローラのブロック２４０に供給する。もし的中があれば、コンパレータ１３０は、バッファ的中ラインを経由してメインメモリおよびキャッシュコントローラ２４０へ的中を示す。
【００５１】
メインメモリ及びキャッシュコントローラ２４０は、バッファコントローラ１２０から制御信号(CTRL)を、そして、ＣＰＵからMemAce信号を受け取る。メインメモリおよびキャッシュコントローラ２４０は、受信した制御信号に基づき、メインメモリを能動化してアクセスするために、要求された信号を発生する。これらの要求された信号は、/RAS(列アドレスストローブ)、/CAS(行アドレスストローブ)および/CS(チップセレクト)の信号を含む。これらの信号は、当業者には周知である。
【００５２】
図５を参照すると、図４のメモリコントローラは、存在する２つ以上の信号：列ラッチ(ROW LATCH)および列的中(ROW HIT)を備えている。列ラッチは、メインメモリおよびキャッシュコントローラ２４０により発生され、アドレスデコーダ２３０に供給され、現在アクセスされている別の行が認識するまでそのアドレスデコーダ２３０をラッチし／能動化させるように指示する。アドレスデコーダ２３０により発生され、そして、メインメモリおよびキャッシュコントローラ２４０に供給される行的中信号は、メインメモリおよびキャッシュコントローラ２４０に対し、要求された行が既にラッチされたことを示す。図４および５のメモリコントローラは、共にメモリシステムとして使用できることに注目すべきであり、そのメモリシステムは、レベル２(Ｌ２)キャッシュを備えていてもいなくてもよい。
【００５３】
説明のために、バッファ登録部のデータは、要求されたメモリアドレスで格納された最初の数バイトであることに気付くべきである。そのため、ＣＰＵにはこのデータが与えられるが、要求されたメモリアドレス内のデータの残りは、メインメモリ／キャッシュから回復される。
【００５４】
これとは別に、バッファ登録部のデータが、メモリシステムのキャッシュ内のキャッシュラインを十分に満たしてもよい。それにより、バッファ的中時(要求されたメモリアドレスがバッファ内に見つかったとき)、そのバッファは、キャッシュライン全体をキャッシュに与える。このプロセスを援助するために、要求された列アドレス(要求されたアドレスからデコードされた)のラッチ動作は背後で行われてもよい。明確にするために、バッファ的中の有無に関係なく、列アドレスはメインメモリにラッチされてもよい。この構成により、次に要求されたアドレスがバッファになく、先に要求されたアドレスとして同じ列にあるなら、関係のある列はすでにアクティブになっており、これにより、通常、メインメモリアクセスに関係するセットアップおよびアクティブ化の時間を節約できる。この列のラッチ動作を用いる方法は、図５のメモリコントローラを使用するが、このラッチ動作を用いない方法は、図４のメモリコントローラを使用することに気付くべきである。理解できるように、図５のコントローラは２つの特別な信号、列的中および列ラッチを有する。その列的中は、メインメモリ／キャッシュコントローラ２４０に、(要求されたメモリアドレスを通じて)要求された列が既にラッチされていることを示す。列ラッチ信号は、メインメモリシステム内の特定の列をラッチする必要性をアドレスデコーダ２３０に、知らせることに役立つ。
【００５５】
図６を参照すると、図４のメモリシステムの動作を示したフローチャートが示されている。メモリアクセスに対する初期的なステップは、簡略化の観点でフローチャートから省略されている。要求されたメモリアドレスの受け取り、メモリアドレスをデコードしそしてメモリアクセス要求を受け取るステップは当業者には周知であり、困難を要しない。理解されるように、このプロセスは、要求されたアドレスがバッファ内に見つかったか否かを決定する判定３００でスタートする。
【００５６】
次に判定３１０が実行される。これは、メモリアクセスがリードまたはライトのアクセスなのかを決定する。そのアクセスがメモリライトなら、ステップＳ３２０に進む。ステップＳ３２０は、メインメモリへのライトを実行する。図示されるように、このステップにはバッファは含まれない。これとは別のように、メインメモリにライトされるべきデータをバッファ登録部に書き込むよう選択してもよい。これは、バッファをアクセスする際に要求される通常のステップを含み、そのステップについては後で詳しく述べる。
【００５７】
メモリアクセスがリードアクセスなら、そのバッファが利用され、一時的に上述したような並列処理が実行される。２つ以上の矢印が次の動作に供給される時、次の動作を開始する前に２つ以上の先行する動作を完了しなくてはならない。理解されるように、ステップＳ３３０、３４０および３５０は、ステップＳ３６０、３７０および３８０と共に並行に実行される。ステップＳ３３０、３４０および３５０は、メインメモリアクセスに関する。リード動作のために、周知でかつ確立された方法により、メインメモリがアクセスされ(ステップＳ３３０)、そのデータは要求されたメモリアドレスを用いて回復され(ステップＳ３４０)、そして、回復したデータはＣＰＵに送出される(ステップＳ３５０)。３つのすべてのステップは当業者には周知である。ステップＳ３６０、３７０および３８０は、リードデータのバッファへのコピーに関する。最初、ＭＲＵビットがセットされていないバッファ登録部を選出する(ステップＳ３６０)。そのＭＲＵビットの非アクティブな特性は、最後にアクセスされたバッファ登録部でないことを意味し、そのため、上書きしてもよい。このようなバッファが選出されると、当該データがバッファ登録部に書き込まれる(ステップＳ３７０)。このデータは、セットおよびタグ領域の適した位置にメモリアドレスを含み、メインメモリからデータが読み出される。このステップの後、この登録部に対するＭＲＵビットは、次のメモリアクセスでバッファ登録部が上書きされるのを防止するために、セットされる。
【００５８】
バッファ登録部のデータ部に書き込まれたデータは、要求された一部であることに注目すべきである。そのため、バッファがデータの最初の３２バイトのみをバッファリングするように構成されているなら、(ステップＳ３４０にて)メインメモリから読み出されたデータの全量および一部が、バッファ登録部に書き込まれる。もしバッファがキャッシュライン全体を格納できるように構成されているなら、情報の全量は、メインメモリからのデータから引き出され、バッファ登録部に格納される。
【００５９】
再度、図６を参照する。(ステップＳ３００の判定から)要求されたメモリアドレスがバッファ内にあるなら、ステップＳ３９０にて、メモリアクセスがリードかライトなのかが判定される。メモリリードなら、上述した一時的な並行処理が有利に採用される。ステップＳ４００、４１０および４２０は、バッファで実行される動作およびバッファにより実行される動作であり、一方、ステップＳ４３０、４４０および４５０は、そのバッファによる処理と同時に又は並行して、メモリにより実行されるステップである。
【００６０】
理解されるように、ステップＳ４００は、関係あるバッファ登録部のリードに関する。これは、バッファ登録部のデータ部に格納されたデータのリードを含む。この後、ステップＳ４１０にてバッファ登録部からリードされたデータをＣＰＵに送出する。最後に、前記バッファ登録部に対するＭＲＵビットがセットされる。
【００６１】
上述と同様に、メインメモリ内の対応するアドレス位置が要求されたメモリアドレスを用いてアクセスされる(ステップＳ４３０)。データの残りは、次に、予め設定したオフセットを用いてメインメモリからリードされる。バツファがデータの最初の３２バイトを格納できるように設計されているなら、メインメモリのデータリードは、３２バイト後からであり、これは通常、メモリリードの開始である。これにより、もし、メモリリードがポイントＸからであるなら、メインメモリリードは、Ｘ＋３２バイトから、バッファからＣＰＵに送出されるデータに対する量となる。通常、バッファがそのデータをＣＰＵへ送出した時間により、メインメモリをアクセスするのに要するセットアップ時間を超えてしまう。
【００６２】
このことは従って、ＣＰＵがバッファからのデータ受信を完了した時、メインメモリから到来する要求されたデータの残りは、丁度、ＣＰＵに到着している。ステップＳ４５０では、そのデータを実際に送出し、このステップは、メインメモリアクセスに対して実行される最後のステップとなる。
【００６３】
一方、メモリアクセスがライトアクセスならば、ステップＳ４６０、４７０および４９０が実行される。図６からわかるように、ステップＳ４６０および４７０はステップＳ４８０および４９０と並列に実行される。ステップＳ４６０では、書き込まれるべきデータが、関係するバッファ登録部に書き込まれる。この結果、要求されたアドレスに対応して見つかったバッファ登録部は、データが供給されたＣＰＵにより上書きされる。バッファ登録部が次のメモリアクセスで上書きされるのを防止するために、この後、バッファ登録部のＭＲＵビットがセットされる。これらのステップに呼応して、ステップＳ４８０および４９０はメインメモリに関係する。ステップＳ４８０ではメインメモリがアクセスされる。このステップにおいて、メインメモリをアクセスするために、関係があり、必要な信号が発生される。ステップＳ４４０にて同じデータをリードする時に反して、メインメモリにデータを書き込む時はオフセットを必要としないことに注目すべきである。これに対する理由は、完全なデータがメインメモリに書き込まれる時はオフセットは必要ないからである。メインメモリおよびバッファの双方に書き込むことにより、古いデータの出力が回避される。
【００６４】
上述したプロセスは、バッファが要求されたデータの始まりの部分のみをバッファリングするように設計された時に、最良の結果を生む。しかしながら、上記方法を避けるために、キャッシュライン全部を格納することを用いることはできないことは言うまでもない。キャッシュライン全部を格納するバッファは、上述した方法の利点をもつことができる。
【００６５】
上記方法の概念に追加された工夫は、ラッチされたアクティブの列を維持することにある。要求されたアドレスは、メインメモリにおける列に関する。第２の要求されたアドレスが到来した時に前記列が既にアクティブなら、かつ、第２の要求されたアドレスが同じ列に関するなら、データの回復はより早くなる。要求された列をアクセスするためのセットアップ時間が省略されるからである。列は既にアクティブである。バッファと結合されると、ラッチされた列を維持することの概念は、加速されたメモリアクセス速度の条件で多くの利点を与える。
【００６６】
図７を参照すると、図５のメモリコントローラを用いて実行できるプロセスのステップを示したフローチャートが示される。アクセスのリードのために用いられるべきこのプロセスは、上述に関する、列ラッチの概念を用いる。ステップＳ５００で開始し、メモリアクセスが初期化される。このステップは、要求されたメモリアクセスを受け取り、そしてメモリアクセスがリードアクセスであると決定することを含む。次にステップＳ５１０が実行される。このステップは、要求されたメモリアクセスをデコードし、そして、要求されたアドレスがどの列にあるかを決定することを含む。この時点で、プロセスは、バッファが与える一時的な並行処理の利点をもつ。ステップＳ５２０および５３０は同時に実行される。これにより、要求された列が既にアクティブなのかどうか、および、要求されたアドレスがバッファ内にあるかどうかのチェックがなされる。
【００６７】
もし、バッファが要求されたデータの始まりの部分、つまり列のヘッドのみをバッファリングするのであれば、図７のフローチャートの左のほとんど、および右のほとんどの分岐が容易に実行される。ステップＳ５３０および５２０の判定で共に“Ｙ”の場合、ステップＳ５４０、５５０、５６０、５７０、５８０、５９０および６００が並列に実行される。これにより、データの最初の部分はバッファ登録部から回復され(ステップＳ５４０)、そして、ＣＰＵに送出させる(ステップＳ５５０)。ステップＳ５５０は、列アドレスが非アクティブなら、より高速に達成される。メインメモリのアクセスに関係する通常のアクティブ化の時間が回避される。理想的にはこのメインメモリのアクセスは高速ページモード(ＦＰＭ)を用いて実行される。要求されたデータの残りは、メインメモリから回復される(ステップＳ５７０)。しかしながら、この回復は、ＣＰＵへ既に送出したデータを補うために(ステップＳ５８０)、上述と同じようような方法でオフセットを用いて実行される。一方、バッファに対し、アクセスされたバッファ登録部はそのセットされたＭＲＵビットをもつ。メインメモリに対しては、アクティブな列が次のメモリアクセスのためにアクティブを維持する。ステップＳ５３０の判定で“Ｙ”となり、一方、ステップＳ５２０での判定で“Ｎ”であったなら、ステップＳ５４０、５５０および５９０がバッファにより実行され、一方、ステップＳ６１０、６２０、６３０および６４０は、バッファを有するメインメモリシステムにより実行され、そのメインメモリシステムは並列動作する。メインメモリシステムのために、ステップＳ６１０では、周知のランダムアクセス技術を用いて、メインメモリをアクセスする。これは、適した/CAS /RAS 及び /CS 信号を適した時間に送出する。ステップＳ６２０では、バッファによりステップＳ５５０にて既にＣＰＵに供給されたデータを補償するために、行オフセットを用いてメインメモリから要求されたデータの残りを回復する。ステップＳ６３０では、これにより、この回復されたデータをＣＰＵに送出する。ステップＳ６４０では、列アドレスがアクセスされてアクティブにされた時、次のメモリアクセスの予想時に、列アドレスのアクティブ状態を維持する。
【００６８】
ステップＳ５２０の判定で“Ｙ”となり、ステップＳ５３０では“Ｎ”であれば、バッファはステップＳ６５０、６６０および６７０を実行し、一方、メインメモリシステムはステップＳ５６０、５７０、５８０および６００を実行する。そのため、要求されたデータがバッファ内に存在しないなら、その後、バッファ内に入れられなければならない。ステップＳ６５０では、取替えるためにバッファ登録部を選択する。これは、ＭＲＵがセットされていないバッファ登録部の選択を含む。これが実行されている間、メインメモリシステムは、(上述したステップＳ５６０および５７０を参照)要求されたデータを、オフセットなしでメインメモリから回復する。要求されたデータの第１の部分を送出していない時、そのオフセットは用いられない。そのため、その部分を補償する必要がない。
【００６９】
一旦、メインメモリからデータが回復されると、回復されたデータの第１の部分は、その後、選択されたバッファ登録部に格納される(ステップＳ６６０)。その後、ＭＲＵビットは、次のメモリアクセスで上書きされないように、このバッファ登録部にセットされる。
【００７０】
ステップＳ５２０および５３０の判定で共に“Ｎ”であったなら、メインメモリシステムは、ステップＳ６１０、６２０、６３０および６４０を実行し、一方、バッファはステップＳ６５０、６６０および６７０を実行する。バッファは、データを回復するためにアクセスされなず、そこに書き込まれたデータを持つだけなので、その後のステップＳ６２０では、補償するものがないので、メインメモリシステムはオフセットを用いない。
【００７１】
図７中の接続ＡおよびＢは、上述したステップのほとんどは並列に実行されるが、いくつかのステップはほかのものより先に実行されることを示すために用いられることに気付くべきである。例として、ステップＳ５５０が実行された後、ステップＳ５９０、５８０および６００が並列に実行される(接続Ｂを参照)。もし、他方、ステップＳ５２０の判定で“Ｎ”となり、ステップＳ５３０の判定で“Ｙ”であったなら、ステップＳ５５０の後にステップＳ５９０、６３０および６４０が並列に実行される(接続Ｂを参照)。これとは別に、ステップＳ５２０の判定で“Ｙ”となり、ステップＳ５３０の判定で“Ｎ”であったなら、接続Ａが示すように、ステップＳ５８０および６００がステップＳ６６０および６７０と並列に実行される。
【００７２】
図８を参照すると、書き込み動作のためのステップが示される。そのプロセスは、メモリアクセスの初期化で始まる(ステップＳ６８０)。上述から気付くように、これは、要求されたアドレスをデコードし、ＣＰＵからライト命令を受け取り、そして要求されたアドレスをメモリデコーダとバッファに送出することを含む。次に、メインメモリシステムは、ステップＳ６９０、７００および７１０を実行し、これと並列に、バッファはステップＳ７２０、７３０(もし要求あれば)、７４０、７５０および７６０を実行する。
【００７３】
メインメモリシステムのために、ステップＳ６９０にて、ＦＰＭを用いるが、あるいは用いずに、メインメモリをアクセスする。ステップＳ７００では、データがメインメモリに書き込まれ、そしてステップＳ７１０にて、次のメモリアクセスのために、アクセスされた列のアクティブ状態が維持される。（アクティブな列の個数は、システム設計者の判断にまかされることに気づくべきである。そのような設計者は、ＤＲＡＭバンクにつき、１列のみをアクティブにするか、バンクにつき、複数のアクティブの列を持つように希望してもよい。）バッファに対する最初のステップであるステップＳ１において、要求されたアドレスがバッファ内にあるかの判定がなされる。もし要求されたアドレスがバッファ内にあるなら、データがバッファ登録部に書き込まれる(ステップＳ７４０)。他方、要求されたアドレスがバッファ内に無い時は、バッファ登録部が取替えられる。そのため、ステップＳ７５０では、取替えられるべきバッファが選択される。これは、ＭＲＵビットがセットされていないバッファ登録部の選択を伴う。その後、取替えられるべきこのバッファ登録部が一旦、選択されると、データがそれに書き込まれる(ステップＳ７４０)。ステップＳ７４０にて書き込まれるバッファ登録部は、要求されたアドレスがバッファ内にあるかに依存する。もし存在するなら、データは選択されたバッファ登録部に書き込まれる。もし存在しなければ、取替えられる、または上書きされるバッファ登録部が選択される。その後、データがバッファ登録部に一旦書き込まれると、前記バッファ登録部に対してＭＲＵをセットする。そのデータは、バッファおよびメインメモリの双方に書き込まれ、両者でデータを一致させる。この例では、データの始めの部分(つまり列ヘッド)のみがバッファに書き込まれることに気付くべきであり、このことはこの例に対してバッファがいかにして形成されるかを示す。
【００７４】
図８に示したライトプロセスはまた、バッファが一杯のキャッシュラインをバッファリングするように形成された場合にも適用できる。この例と上述した例との唯一の差異は、プロセッサキャッシュライン全部がバッファ内に蓄えられる点である。
【００７５】
バッファがキャッシュライン全部をバッファリングする時のリードアクセスに対して、いくつかの可能性が存在する。上述で気付くように、もし、アクセス後に列アクティブを維持するプロセスが用いられるなら、図５の特別な列的中および列ラッチを有するメモリコントローラが使用される。図９および図１０は、図７に示したプロセスに似る２つの可能なプロセスを示す。図９および図１０の方法は、要求されたアドレスがバッファ内およびアクティブな列内に見つかったなら、デフォールト位置を持つ点で異なる。図９において、もし要求されたアドレスがアクティブな列内とバッファ内に見つかったなら、データはバッファから回復される。図１０において、同じことが事実なら、次にメインメモリがアクセスされる。
【００７６】
図９および図１０を参照すると、バッファがキャツシュライン全部をバッファリングするために形成されているなら、および、列ラッチの概念が用いられるなら、リード動作のための２つの似たプロセスが示されている。これらの２つのプロセスは、要求されたアドレスがバッファとアクティブな列の双方に存在する時のみ、異なる。
【００７７】
図９を参照すると、周知の方法および上述した他のプロセス内でのメモリアクセスの実行と同様な方法により、メモリアクセスがステップＳ７７０にて実行される。要求されたメモリアドレスは次にステップＳ７８０にてデコードされる。次のステップＳ７９０および８００にて並列に実行される。要求されたアドレスがバッファ内にあるかを見るために、バッファがチェックされ(ステップＳ７９０)、そして、要求されたアドレスがアクティブな列内にあるかを見るために、アクティブな列がチェックされる。これらのチェックに基づき、一連の判定がなされる。判定のステップＳ８１０は、要求されたアドレスがバッファおよびアクティブな列の双方にあるかがチェックされる。“Ｙ”と判定されたなら、その後、２つの分岐(一方がステップＳ８２０で他方がステップＳ８３０、８４０、８５０および８６０)が並列に実行される。ステップＳ８２０では、ステップＳ８００で見つかった列のアクティブ状態が維持されることに注目する。ステップＳ８３０、８４０、８５０および８６０は、バッファ内で並列に実行される。ステップＳ８３０では、バッファへのアクセスが実行される。ステップＳ８４０は、バッファから要求されたデータを、前記要求されたアドレスに対応するバッファ登録部から実際に回復する。その後、この回復されたデータは、ＣＰＵに送出される(ステップＳ８５０)。前記バッファ登録部が次のメモリアクセスで上書きされないように、そのバッファ登録部で見つかったＭＲＵビットは、ステップＳ８６０にてセットされる。
【００７８】
もしステップＳ８１０の判定で“Ｎ”となったなら、次にステップＳ８７０の判定がなされる。ステップＳ８７０は、要求されたアドレスがアクティブな列内にあり、バッファ内に無いかを判定する。もし、その場合、メインメモリのステップＳ８８０、８９０、９００および９１０の実行と並列に、バッファがステップＳ９２０、９３０、９４０および９４０を実行する。メインメモリシステムに対しては、ステップＳ８８０にて、高速ページモードを用いてメインメモリにアクセスする。これは、要求されたアドレスが既にアクティブにある列内にある時に実行される。次のステップＳ８９０は、メインメモリからデータを回復する。ステップＳ９００は、回復したデータをＣＰＵに送出する一方、ステップＳ９１０は列のアクティブ状態を維持する。バッファに対しては、そのプロセスの部分は、バッファ内に回復したデータを格納するために実行される。ステップＳ９２０は、取替えられるバッファ登録部を選択する。一旦、バッファ登録部が選択されると、ステップＳ８９０にて回復されたデータは、選択されたバッファ登録部に格納さ(ステップＳ９３０)、これにより、選択されたバッファ登録部の古いコンテンツに対して上書きされる。次のステップＳ９４０は、次のデータアクセスでこの特定のバッファ登録部が上書きされないように、ＭＲＵビットをセットする。しかしながら、接続ＣはステップＳ８９０の実行後のみ、ステップＳ９３０が実行されることに気付くべきである。データがメインメモリから回復された(ステップＳ８９０)後のみ、データがバッファ登録部に書き込まれる(ステップＳ９３０)。
【００７９】
もしステップＳ８７０にて“Ｎ”となったなら、判定のステップＳ９５０に進む。この判定は、要求されたアドレスがアクティブな列内にあるかを決定する。もし、その決定が真実なら、メインメモリシステムによるステップＳ１０００、１００２、１００４および１００６の実行と並列に、バッファは、ステップＳ９６０、９７０、９８０および９９０を実行する。バッファにおいて、ステップＳ９６０でバッファへのアクセスが行われる。ステップＳ９７０は、バッファから要求されたデータを実際に回復し、一方、ステップＳ９８０では要求され、回復されたデータをＣＰＵに送出する。バッファに対して先の分岐で実行されたように、ステップＳ９９０は、バッファ登録部が次のデータアクセスで上書きされないように、ＭＲＵビットがセットされる。ＭＲＵビットをセットするステップはまた、別のバッファ登録部に対して先に行ったＭＲＵビットのセットを解除することを含むことは明白である。このように、単一のバッファ登録部が、セットされたＭＲＵビットを持つ。同様に、メインメモリ内の列をアクティブにするステップ(ステップＳ１０００)も、先にアクティブであった列を非アクティブにすることを含む。このように、一度に最小の列がアクティブにされる。列がアクティブにされた後、ステップＳ１００２にあるように、メインメモリからデータがアクセスされる。このデータは、その後、ＣＰＵに送出され(ステップＳ１００４)、そして、列のアクティブ状態が維持される(ステップＳ１００６)。メインメモリシステムの形態に依存して、メインメモリシステム全体の中で１つの列のみがアクティブにされてもよく、または、(複数のバンクメインメモリシステムに対して)メインメモリバンクにつき１つの列がアクティブにされる。最終の末端ユーザーの要求に依存して、異なる形態が採用されてもよい。
【００８０】
再度、ステップＳ８７０の判定で“Ｎ”であれば、メインメモリシステムおよびバッファシステムは、一連のステップを並列に実行する。メインメモリシステムがステップＳ１０４０、１０５０、１０６０および１０７０を実行する一方、バッファに対しては、ステップＳ１０１０、１０２０および１０３０が実行される。バッファに対するステップＳ１０１０は、ＭＲＵビットがセットされていないバッファ登録部の選出を含む。このバッファ登録部のコンテンツは、回復されるべき新しいデータに置きかえられる。ステップＳ１０５０にてメインメモリシステムによりデータが回復され、ステップＳ１０２０は、その回復されたデータを選択されたバッファ登録部に書き込むことを含む。ステップＳ１０３０では、選択されたバッファ登録部に対し、ＭＲＵビットをセットする。
【００８１】
メインメモリシステムに対しては、ステップＳ１０４０にて、要求されたアドレスに格納されたメインメモリのデータをアクセスする。このメモリアクセスは、ＦＰＭが使用できず、要求された列がアクティブでない時、周知のランダムアクセス方法を用いて実行される。ステップＳ１０５０は、メインメモリがステップＳ１０４０にてアクセスされた後、メインメモリからデータを回復することを含む。この回復されたデータは、ステップＳ１０６０にてＣＰＵに送出される。このデータは、ステップＳ１０２０にて選択されたバッファ登録部に書き込まれたデータと同じか、それの一部である。次にステップＳ１０７０は、(ステップＳ１０４０にて)アクセスされた列を、アクティブとして設定し、これにより、次のメモリアクセスで、もし可能なら、ＦＰＭの使用が可能になる。
【００８２】
上記した接続Ｃと同様に、接続Ｄは、ステップＳ１０２０は、ステップＳ１０５０が実行された後にのみ実行され得ることを示す。これにより、ステップＳ１０５０が実行された後のみ、ステップＳ１０２０および分岐した他のその後のステップが実行される。(ステップＳ１０５０で)データが回復された後のみ、同じデータがバッファ登録部に書き込まれる(ステップＳ１０２０)。
【００８３】
図１０においては、第１の判定(ステップＳ８１０)で“Ｙ”となった場合に実行されるステップを除き、フローチャート中のすべてのステップは図９のものと同じである。もし、その場合、つまり、要求されたアドレスがバッファおよびアクティブな列の双方にある場合、バッファがステップＳ１１２０を実行する一方、メインメモリはステップＳ１０８０、１０９０、１１００および１１１０を実行する。
【００８４】
メインメモリシステムに対し、ステップＳ１０８０では、ＦＰＭを用い、メインメモリにアクセスされる。これは、ステップＳ８１０の判断にて、要求されたアドレスがアクティブな列にあると決定された時に実行される。データを実際に回復するステップＳ１０９０は、ステップＳ１０８０の後に実行される。ステップＳ１１００では、回復されたデータがＣＰＵに送出され、ステップＳ１１１０では、今アクセスされた列のアクティブ状態が維持される。バッファに対しては、ステップＳ１１２０にて、要求されたアドレスに対応するバッファ登録部に対してＭＲＵビットをセットする。このことを効果的に言うと、バッファ登録部は、そのコンテンツがリードされなくても、あるいは修正されなくても、アクセスされた最後のものである。
【００８５】
上述した装置およびプロセスに対する他の多くの形態が可能である。レベル２のキャッシュを使用でき、それへのアクセスは、上述した概要のプロセスに適用できる。
【００８６】
上述した発明を理解した人は、ここで述べた原理を用いて別の設計を計画することができる。添付した請求の範囲内に収まるこのようなすべての設計は、本発明の一部である。
【図面の簡単な説明】
【００８７】
【図１】従来技術によるＣＰＵメモリシステムの概略ブロック図
【図２Ａ】この発明に基づくバッファバンクの概略図
【図２Ｂ】図２Ａのバッファバンクを制御するバッファコントローラのブロック図
【図３Ａ】メモリコントローラから分離してバッファシステムを実行するメモリシステムのブロック図
【図３Ｂ】メインメモリの一部としてバッファシステムを実行するメモリシステムのブロック図
【図３Ｃ】ＣＰＵの一部としてバッファシステムを実行するメモリシステムのブロック図
【図３Ｄ】メモリコントローラの一部としてバッファシステムを実行するメモリシステムのブロック図
【図４】この発明の実施のための詳細ブロック図
【図５】図４に示した実施の変形例である詳細ブロック図
【図６】この発明の第１の態様に基づくメモリアクセスの方法におけるステップを示したフローチャート
【図７】この発明の第２の態様に基づくメモリアクセスの方法におけるステップを示したフローチャート
【図８】図７に示した方法で使用されるべきライトアクセス法のためのステップを示したフローチャート
【図９】この発明の第３の態様に基づくメモリアクセスの方法におけるステップを示したフローチャート
【図１０】図９に示した方法の変形例におけるステップを示したフローチャート
【符号の説明】
【００８８】
１０ＣＰＵメインメモリシステム
１５マイクロプロセッサ
１７Ｌ１キャッシュ
２０キャッシュ及びメインコントローラ
２５Ｌ２キャシュ
３０メインメモリ
１００待ち時間を隠すバッファ
１１０バッファバンク【Technical field】
[0001]
The present invention generally relates to a method for transferring data between a central processing unit (CPU) and a main memory in a computer system. More specifically, the present invention discloses various implementations for minimizing latency in accessing main memory by using a latency hiding mechanism.
[Background Art]
[0002]
The speed and computing power of microprocessors are continually increasing as technology advances. Its increase in computing power depends on the transfer of data and instructions at processor speed between the microprocessor and main memory. Unfortunately, current memory systems cannot provide that data to the processor at the required rate.
[0003]
Processors must wait by placing a wait state on a slower memory system, which causes the processor to run at a much slower speed than its standard speed. This problem degrades the overall performance of the system. This tendency is exacerbated by the large gap between processor speed and memory speed. Any improvement in the performance of the processor is about to reach a point where important characteristics of the entire system will no longer be obtained. The memory system is therefore an element that limits the characteristics of the system.
[0004]
According to Amdahl's law, the improvement of a system's properties is limited by the part of the system that cannot be improved. The reason is illustrated in the following example.
If 50% of the processor time is used for memory access and the remaining 50% is used for internal arithmetic cycles, then Amdahl's law states that even if the processor speed is increased by a factor of 10, the system will only have one characteristic. .82 fold. Amdahl's law strengthens parts of the computer system, and the resulting speed increase is given by:
(Equation 1)

[0005]
Enhancements: the percentage of time the enhancements used
Enhanced speedup: Speedup when the reinforced part is compared to the characteristics of the original part
[0006]
As in this example, the processor is only 50% of the time occupied by internal operations, so the enhanced speed of the processor is an advantage for 50% of that time.
[0007]
Amdahl's law is as follows when the above values are adopted.
(Equation 2)

[0008]
This is because even if the enhanced processor is ten times as large as the original processor, the enhancement is not applied only for 50% of the time. The calculation of the speed increase yields an overall enhancement of 1.818 times compared to the characteristics of the original system.
[0009]
If the enhanced processor is 100 times faster than the original processor, Amdahl's law is:
[Equation 3]

[0010]
This means that the characteristics of the system are limited by 50% data access to and from memory. Clearly, the benefits tend to decrease as the speed of the processor increases relative to the speed of the main memory system.
[0011]
In order to solve this problem, a cache memory is used, and data likely to be accessed by the processor is moved to a high-speed cache memory corresponding to the processor speed. Various approaches have been proposed to form a hierarchy of caches consisting of a first level cache (L1 cache) and a second level cache (L2 cache). Ideally, the data most likely to be accessed by the processor should be stored in the fastest cache level. Both level 1 (L1) and level 2 (L2) caches are formed by static random access memory (SRAM) technology because of the advantages over dynamic random access memory (DRAM). The most important issue in cache design and what the cache targets is that the data next required by the processor is stored with high probability in the cache system. To increase the probability of selecting this requested data in the cache, or to have a "hit" cache, two main rules work: primary location and spatial location.
[0012]
Primary location is the concept that, for the most average processor operation, the next data requested by the processor is immediately requested with a high probability. Spatial location is the concept that the next data requested by the processor will be accessed with the next highest probability of the data currently being accessed.
[0013]
The hierarchy of the cache therefore takes advantage of these two concepts by transmitting from the currently accessed main memory data as well as from physically adjacent data.
[0014]
However, cache memory systems cannot completely isolate a fast processor from a slower main memory. The address and associated data requested by the processor are not found in the cache and a situation called a cache "miss" occurs. In such a cache miss, the processor accesses the slower main cache to get the data. These mistakes represent a fraction of the processor time, which limits the overall system performance improvement.
[0015]
To address this cache miss problem, level two caches often have an overall cache hierarchy. The purpose of the level 2 cache is to increase the amount of data available to the processor for high speed access without using the level 1 cache. Level 2 caches are typically formed on the same chip as the processor itself. Because the level 2 cache is off-chip (ie, not on the same die as the processor and the level 1 cache), it is larger and runs at a speed between that of the level 1 cache and main memory. However, to properly use the level 1 and level 2 caches, and to maintain data consistency between the cache memory system and the main memory system so that the latest data is available to the processor. , Both cache and main memory must be constantly updated. If the access of the processor memory is a read access, this means that the processor needs to read data or code from the memory. If the requested data or code is not found in the cache, the cache content must be updated, and the process generally requiring the same cache content is replaced with data or code from main memory. Must-have. Two techniques are used to ensure consistency between cache content and content in main memory: write-through and write-back.
[0016]
The write-through technique writes data to both the cache and main memory when the written data is found in the cache. This technique ensures that the data accessed is the same regardless of whether the data in cache content or main memory is accessed. In the write-back technique, data is written only in a cache at the time of write access to a memory. To ensure consistency between the data in the cache and the data in main memory, when these cache contents are overwritten, the cache contents at a particular cache location are written to main memory. However, when the cache content is not replaced by a memory write access, the cache content is not written to main memory. If the cache content at a particular cache location has not been replaced by a write access to the memory, the cache content will not be written to main memory. Flag bits are used to determine whether the cache content at a particular cache location has been replaced by a write access to memory. If the cache content has been replaced by a write access to the memory, that flag bit is set or considered "dirty". Therefore, if the flag bit at a particular cache location is "dirty", the cache content at that cache location must be rewritten to main memory before being overwritten with new data.
[0017]
Another approach to increasing the cache hit ratio is to increase its coupling ratio. The join rate is the number of lines in the cache that were searched (ie, checked for hits) during a cache access. In general, the higher the coupling ratio, the higher the cache hit ratio. Directly mapped cache systems have a 1: 1 mapping whereby only one line is checked for hits during a cache access. At the other end of the spectrum, a well-related cache is typically implemented using a content addressable memory (CAM), whereby all cache lines (and hence all cache locations) are Searched and compared simultaneously during a single cache access. Various levels of coupling are implemented.
[0018]
Despite these approaches to improving cache properties, which ultimately aim to improve the performance of the entire system, cache properties can only be improved to the stage by changing parameters such as size, coupling rate and speed. You should notice that there is. This approach, aimed at improving the cache system or the system's fast memory, rather than trying to improve the slower main memory, eventually reaches a saturation point and improves the overall system performance through cache improvements. Any other attempt to do so will cause a reduction in the level of improvement in the performance of the system. Perhaps if the cache is as large as the main memory, the characteristics of the main memory can be ruled out as a factor in the characteristics of the whole system, but it is prohibitively expensive in conditions of silicon chip area. As a result, there is a need for a way to obtain the maximum performance of the system with a minimum size cache.
[0019]
The speed mismatch between the processor and the main memory has recently become a problem in software for new applications such as multimedia, which are heavily dependent on the characteristics of the main memory. Unfortunately, the characteristics of main memory are limited by frequent random data accesses in such applications. Cache systems are therefore less effective when used in such applications.
[0020]
Numerous attempts have been made to improve the characteristics of main memory to reduce the speed mismatch between the processor and main memory. These have led to some improvements in main memory speed. An early improvement to DRAMs was to get multiple bits per cycle from the DRAM (nibble mode, or fixed output of wider data) and internally pipe the operation of various DRAMs. Line processing or fragmenting the data, thereby eliminating some operations for some accesses (page mode, fast page mode, extended data output (EDO) mode).
[0021]
Page mode involves latching a column address in the DRAM and keeping it active, thereby effectively eliminating pages of data to be stored in the sense amplifier. The row address strobe (CAS) signal in the high-speed page mode activates the column address strobe (RAS) signal and simultaneously activates the row address buffer, unlike the page mode in which the row address is subsequently strobed. And acts as an explicit latch, causing a fetch of internal row data before the row address strobe. The enabling of the data output buffer is achieved when CAS is activated. These different page modes are therefore faster than the pure random access mode because the column address activation time required to access a new column is eliminated, rather than staying on the same column.
[0022]
Subsequent improvements have been realized through an extended data output mode or EDO mode, and in a burst EDO mode. Burst EDO mode allows successive pages of data to be restored from DRAM without giving a new address each cycle. However, while burst EDO mode is suitable for use in graphics applications that require continuous pages of information, it is less useful for main memory applications that require fully supportable random access.
[0023]
While such improvements in DRAM design provide higher bandwidth access, they present the following problems.
Some scattered memory accesses do not map within the same active column, thereby eliminating the benefit from using the fast page mode, so that the processor cannot fully use the new DRAM with higher bandwidth .
New DRAM designs may have several banks, but not enough for a typical processor environment with scattered memory accesses to have a high page hit ratio.
Current processors and systems use large caches (first and second levels) that block memory accesses to DRAM, thereby reducing these accesses locally, which further reduces access. Scatter and, consequently, reduce the hit rate of the page.
[0024]
To improve the performance of the system, the cache system is incapable, which requires more effort to improve the performance of the main DRAM memory system. One of these efforts is to use SDRAM (Synchronous DRAM). SDRAM uses multiple banks and synchronized buses to provide high bandwidth for access using the fast page mode. With multiple SDRAM banks, one or more active columns provide the processor with fast access from different parts of the memory. However, for the fast page mode to be used, these accesses must be in the active column of the bank. Furthermore, the reliance on simply accessing multiple banks to increase memory bandwidth will be totally limited based on the number of banks (where the memory is divided). .
[0025]
In general, a limited number of banks, an external cache system that blocks access to already activated columns in main memory, and a poor spatial location of accessed data, all of which are Limit the benefits of the property.
[0026]
Another effort is to use a cache DRAM (CDRAM). This design incorporates an SRAM-based cache in DRAM. Large blocks of data can then be transferred from the cache to the DRAM array or from the DRAM to the cache within a single clock cycle. However, this design faces the problem of poor cache hits in DRAM caused by external blocking caches and low data locations. Requesting cache tags, comparators and controllers also adds complexity to the external cache to control and operate the internal cache. Integrating the SRAM with the DRAM in a semiconductor manufacturing process optimized for the DRAM adds significant cost due to the situation of many die areas.
[0027]
A newer design is to merge the processor and the DRAM, eliminating the blocking cache problem and giving the processor full DRAM bandwidth. This approach adds complexity to the system, mixes low and high speed technologies, limits space for the processor, and, due to the nature of the scattered memory accesses used by the current programming model, and High DRAM bandwidth cannot be fully used.
[0028]
NEC's new virtual channel DRAM design uses 16 fully-coupled channels, which provide high-speed SRAM to track multiple code and data streams for use by various sources. (Non-Patent Document 1). In essence, the virtual channel DRAM represents an extension of the page mode concept, removing the one bank / one page limitation. As a result, multiple channels (or pages) can be opened in banks independent of other channels. The CPU can access, for example, up to 161k channels randomly arranged in a virtual channel DRAM bank. As a result, data movement between multiple devices can be sustained without causing repeated conflicts with page allocation. This virtual channel memory requires that the main memory location corresponding to each channel be tracked by the CPU, thereby complicating its control functions. Further, the CPU requires a predictive scheme for effective prefetching of data to the channel. Virtual channel DRAMs require the use of the fast page mode to transfer data to the channel, and VCDRAMs ultimately require additional die area consumed by associated buffers, such as cache DRAMs. Would be expensive. In addition, the amount of cache provided may not be appropriate in some applications, as the cache / DRAM ratio is usually fixed. For example, when the main memory is upgraded, no additional cache is needed, thus costing the system wastefully.
[0029]
In recent years, as a solution, software such as using a software compiler for remapping physical memory addresses has been proposed in order to maximize the bandwidth of the DRAM. While this is advantageous for certain applications that have predictable behavior, it requires replacing software, thereby creating a complex problem. These efforts use a high-level approach, so that the source code of the application is modified to adapt the software to the hardware. This approach is not only expensive and time consuming, but is not applicable to all software.
[0030]
From the above, what is required, therefore, is a solution based on a simplified memory control mechanism that uses a simple, cost-effective, standard DRAM for the main memory and a wide range of software. It does not require rewriting or complex addressing schemes. Such a solution should ideally take advantage of both temporary and spatial location. Not only recently accessed data should be easily accessible, but also data in close proximity to such recently accessed data should be easily accessible.
[Non-Patent Document 1] NEC Electronics Inc. Product Number Search “μPD4565161”
DISCLOSURE OF THE INVENTION
[Means for Solving the Problems]
[0031]
A solution to the above-described problem is found in a method and apparatus that has the advantages of both the fast page mode and the fast buffer or cache concept. The memory controller controls the buffer that stores the most recently used addresses and related data, but the data stored in that buffer is only the column portion of the data stored in main memory (column head data and ). In the memory access performed by the CPU, both the buffer and the main memory are accessed simultaneously. If the buffer contains the requested address, the buffer immediately begins dispatching the relevant column head data to the cache memory. Meanwhile, the same column address is activated in the main memory bank corresponding to the requested address found in that buffer. After the buffer provides the column head data, the rest of the urgently requested data is provided by the main memory to the CPU. In this way, a small capacity buffer memory can provide the function of becoming a larger amount of L2 cache.
[0032]
In a first aspect, the present invention provides a method for recovering data from a memory system, the method comprising:
(a) receiving a read request for data content at a memory location,
(b) searching a buffer portion of the memory system for a portion of the data content;
(c) recovering the portion from the buffer when the portion of the data content is stored in the buffer, while simultaneously restoring the remaining portion of the data content from the main memory portion of the memory system; Recover the department,
(d) recovering the portion and the remaining portion of the data content from main memory when the portion of the data content is not stored in the buffer.
[0033]
In a second aspect, the invention provides a column head buffer circuit for latching a column head, wherein the column head is a portion of a column of memory stored in a memory bank, wherein the circuit for latching comprises:
Each column head registration unit corresponds to a column head in a memory bank, and a column head buffer including a large number of the column head registration units;
A large number of column address latches for latching the physical address of the column head register contained in the column head buffer;
A column address comparator for comparing the column head register with the incoming requested column address;
When an incoming requested column address matches one of the multiple address latches, the buffer circuit compares the incoming column address requested by a memory controller with the multiple column address latches, and The column head data register corresponding to the latch adaptation is transmitted to the memory controller.
[0034]
In a third aspect, the invention provides a memory buffer subsystem, comprising:
At least one buffer bank having a number of buffer registrations;
A buffer controller that controls a subsystem of the buffer,
Each buffer registration unit:
An address area including a memory address corresponding to a position in the main memory bank;
A data area containing the first n bytes of data located at the main memory bank address;
When the data located at the main memory bank address is requested by a CPU, the first n bytes of data are provided to the CPU by a subsystem of the buffer, while the remainder of the data is Recovered from the memory address in the memory bank.
[0035]
In a fourth aspect, the present invention provides a memory system, comprising:
At least one bank of main memory;
A memory controller,
With a buffer, and
With a buffer controller,
The memory controller controls at least one bank of a main memory,
The buffer includes a number of buffer registration units,
Each buffer registration unit includes an address part and a data part,
The data portion comprises a first data portion in at least one bank of main memory, and the address portion comprises an address referring to a memory location.
BEST MODE FOR CARRYING OUT THE INVENTION
[0036]
The invention will be better understood from reading the following detailed description in conjunction with the accompanying drawings.
[0037]
Referring to FIG. 1, for the purpose of introducing the discussion of the present invention into the context, a conventional CPU-main memory system 10 is shown. The system generally comprises a CPU 15 with a built-in level 1 cache 17, a cache and main memory controller 20, a level L2 cache 25 and a main memory 30. The host data bus 16 transfers data between the CPU, the main memory 30, and the level 25 cache 25. The host address bus 18 provides address information to the memory controller 20 and the level L2 cache 25. Similarly, the data bus 21 and the address bus 22 connect the level L2 cache to the host data bus 16 and the address bus 18 under the control of the cache and memory controller 20 through the controller bus 23. Main memory 30 is coupled to host data bus 16 via memory data bus 26 and receives address and control information from controller 20 via address bus 27 and control bus 28.
[0038]
In a typical read / write data operation, the CPU 15 outputs read data information to, for example, a memory controller 20 and provides an address location, which translates into column and row addresses and memory control signals. Controller 20 also generates address and control information for the level 2 cache. If the data is not found in the level 1 cache, the controller 20 searches the desired data not only in the main memory but also in the level 2 cache. If data is found in the level 2 cache, the data is supplied to the host data bus 16 via the data bus 21 and the data is then returned to the CPU 15. The data is simultaneously written to the level 1 cache, anticipating that it will be requested again. If the data is not found in the level 1 cache or the level 2 cache (ie, when a cache miss occurs in both the level 1 and level 2 caches), the controller 20 uses page mode access to 30 to direct data access. At the same time that the data is transferred to the CPU 15 through the memory data bus 26, the data is also copied to the level 1 cache 17, anticipating that the CPU will request the data again.
[0039]
As noted above, such a conventional system consisting of level 1 and level 2 caches and memory controllers shows signs of degrading performance. Today's applications demand higher speed and randomness, which frequently causes cache misses and main memory accesses.
[0040]
Referring to FIGS. 2A and 2B, a latency hiding buffer 100 is shown in accordance with an embodiment of the present invention. This buffer can be used with the CPU-main memory system of FIG.
[0041]
The buffer 100 includes at least one buffer bank 110 and a buffer controller. Each buffer bank according to an embodiment of the present invention comprises a cache memory coupled in N sets, the cache memory including a number of lines. Each buffer has a comparator 13 which compares the requested address with the address stored in the buffer bank. Each line includes a set of address portions 150, a tag address portion 160, a last used MRU flag bit 180, and a data portion 170. The tuple 150 relates to the lower instruction bits of the main memory address location stored in the buffer line. The tag section 160 relates to an upper instruction bit of a main memory address position stored in the buffer line. As with most sets of combined cache systems, buffer controllers typically use the set of bits to address the high order instruction tag bits. The MRU flag bit 180 is used to determine which buffer registry should not be replaced when a new address registry is inserted. The data portion includes data (column head) related to the memory address specified by the tuple and tag bits. In one embodiment, the column head includes only the desired number of data bits of column data in main memory. For example, when the column heads of data and the rest are to be stored in main memory, buffer bank 110 stores the first four data words of a typical 64-byte cache line. As a result, the buffer bank stores 1/4 of a cache line or some portion of an entire cache line.
[0042]
Regarding the MRU flag 180, the buffer bank registry with the set of MRU flag bits is the most recently used registry and should not be reclaimed. This is because a temporary location reference indicates that this registration is the next registration to be accessed. For the next requested address, the buffer is searched for a registry that does not have the MRU flag bit set. Since the MRU flag bit is set for a particular buffer registry after the buffer registry has been accessed, if the old buffer registry has its MRU flag bit set, this old buffer registry is set. The unit then resets its MRU flag bits, leaving the new buffer registry just a registry with a set of MRU flag bits. One MRU flag bit may be active for each relevant set in the buffer.
[0043]
An example is given to illustrate the operation of the buffer bank. The buffer bank receives the decoded memory address from the main memory controller. The lower instruction bits of this memory address are used to determine which buffer bank and which set within said bank is compatible. The higher instruction bit of the memory address is provided to comparator 130. The tag area of the selected buffer line is also provided to the comparator 130. If there is a match, the requested memory address stored in the buffer line matches. The result is reported to the buffer controller, and the data is accessed in the buffer.
[0044]
Referring to FIG. 2B, the buffer controller 10 is illustrated. A first group of signals 190 is provided from the buffer bank and provides the output of the address comparator (whether address matches or mismatches) regardless of whether the selected buffer line is its MRU pair or not. Can be included. The second group of signals 200 is provided from the main memory controller. These may include signals that indicate the presence of a memory access request, whether the memory access is read or write, and whether the requested column is active or not.
[0045]
A third group of signals 210 is generated by the buffer controller and provided to the buffer bank. These can include read or write signals to the buffer bank, and the setting of the MRU bit. A fourth group of signals 220 is generated by the buffer controller and provided to the memory controller. These signals instruct the memory controller to latch a particular column in main memory, write data to a location in main memory, or access a location in main memory at a specified offset. Can be included.
[0046]
The buffer described above can be replaced with various components of the CPU memory system shown in FIG. Referring to FIGS. 3A, 3B, 3C and 3D, a possible arrangement of buffers that hide latency is shown.
[0047]
FIG. 3A comprises all elements of FIG. 1 with a latency hiding buffer 100 located outside of the memory controller 120. As is well known to those skilled in the art, each block in FIG. 3A can be formed on a separate chip or module. For example, main memory is typically made using a main memory DIMM module (Dual Inline Memory Module), and the CPU and level 1 cache are typically formed in a single monolithic microprocessor. You. A memory controller, typically a separate chip, is typically integrated as a separate chip with a microprocessor in a chipset that includes a level 2 cache. In the embodiment shown in FIG. 3A, the latency hiding buffer is formed on another chip integrated with the chipset, possibly replacing a level 2 cache or Used in connection with the second cache. FIG. 3B shows another possible embodiment in which the buffer is integrated on the same chip as the main memory based on DRAM. FIG. 3C shows an embodiment where the buffer is integrated on the same chip as the level 1 cache and CPU. Finally, FIG. 3D shows a preferred embodiment with a buffer integrated in the memory controller and completely replacing the level 2 cache. While these four embodiments are shown, those skilled in the art will recognize other possible combinations using the advantages and concepts of the buffers disclosed herein.
[0048]
FIG. 4 shows a more detailed drawing of the preferred embodiment of the present invention corresponding to FIG. 3D. As can be understood from FIG. 4, the plurality of buffer banks 110 are integrated in the memory controller 20. It should be noted that although FIG. 4 shows a single comparator 130, each buffer bank 110 has an associated comparator.
[0049]
The memory controller 20 according to the preferred embodiment of the present invention includes an address decoder 230, a main memory and cache controller 240, a buffer bank 110, a comparator 130, and a buffer controller 120. The address decoder 230 receives the requested address (MemAddr) and the memory access signal (MemAcc) from the CPU. The address decoder 230 then determines the column address and row address of the requested address in the main memory from the requested memory address.
[0050]
The requested memory address is also sent to the buffer 110. As will be appreciated, a portion of the requested memory address (SET) is used to reference the buffer bank 110. The other part (TAG) of the same requested memory address is sent to the comparator 130. Comparator 130 compares the requested tag field with the tag stored at the set location in buffer 130. If the tag at the requested address matches the tag at the set location in the cache, a buffer hit occurs. If the position does not match, a buffer miss occurs. The set field is used to index the field of the tag in buffer 110. Since buffer 110 is implemented with N sets of related cache memories, this search and compare operation occurs simultaneously on all N buffers. A buffer hit (BUFFER HIT) occurs from the comparator 130 in the N comparisons made. The buffer hit according to the comparison result is input to a buffer control block 120, which generates a control signal buffer O / E, a buffer R / W and CTRL and supplies them to a main memory and cache controller block 240. If so, the comparator 130 indicates a hit to the main memory and cache controller 240 via the buffer hit line.
[0051]
The main memory and cache controller 240 receives a control signal (CTRL) from the buffer controller 120 and a MemAce signal from the CPU. The main memory and cache controller 240 generates a required signal to activate and access the main memory based on the received control signal. These required signals include the / RAS (column address strobe), / CAS (row address strobe) and / CS (chip select) signals. These signals are well known to those skilled in the art.
[0052]
Referring to FIG. 5, the memory controller of FIG. 4 includes two or more signals that are present: a row latch and a row hit. The column latch is generated by the main memory and cache controller 240 and provided to the address decoder 230 to instruct it to latch / activate that address decoder 230 until it recognizes another row currently being accessed. The row hit signal generated by address decoder 230 and provided to main memory and cache controller 240 indicates to main memory and cache controller 240 that the requested row has already been latched. It should be noted that the memory controllers of FIGS. 4 and 5 can both be used as a memory system, which may or may not have a level 2 (L2) cache.
[0053]
For illustration purposes, it should be noted that the data in the buffer registry is the first few bytes stored at the requested memory address. Thus, this data is provided to the CPU, but the rest of the data at the requested memory address is recovered from the main memory / cache.
[0054]
Alternatively, the data in the buffer registry may sufficiently fill a cache line in the cache of the memory system. Thereby, when the buffer hits (when the requested memory address is found in the buffer), the buffer provides the entire cache line to the cache. To assist in this process, the latching of the requested column address (decoded from the requested address) may be performed behind the scenes. For clarity, the column address may be latched in main memory, with or without a buffer hit. With this arrangement, if the next requested address is not in the buffer and is in the same column as the previously requested address, then the relevant column is already active, which usually causes Saves setup and activation time. It should be noted that the method using this column latching operation uses the memory controller of FIG. 5, while the method not using this latching operation uses the memory controller of FIG. As can be seen, the controller of FIG. 5 has two special signals, column hit and column latch. In the column hit, it indicates to the main memory / cache controller 240 that the requested column (via the requested memory address) has already been latched. The column latch signal helps inform the address decoder 230 of the need to latch a particular column in the main memory system.
[0055]
Referring to FIG. 6, a flowchart illustrating the operation of the memory system of FIG. 4 is shown. The initial steps for memory access have been omitted from the flowchart for simplicity. Receiving the requested memory address, decoding the memory address and receiving the memory access request are well known to those skilled in the art and do not require difficulty. As will be appreciated, the process begins with a decision 300 that determines whether the requested address was found in the buffer.
[0056]
Next, a determination 310 is performed. This determines whether the memory access is a read or write access. If the access is a memory write, the process proceeds to step S320. A step S320 executes a write to the main memory. As shown, this step does not include a buffer. Alternatively, data to be written to the main memory may be selected to be written to the buffer registration unit. This includes the usual steps required to access the buffer, which will be described in more detail below.
[0057]
If the memory access is a read access, the buffer is used and the above-described parallel processing is temporarily executed. When more than one arrow is provided for the next operation, two or more preceding operations must be completed before starting the next operation. As will be appreciated, steps S330, 340 and 350 are performed in parallel with steps S360, 370 and 380. Steps S330, 340 and 350 relate to main memory access. For a read operation, the main memory is accessed (step S330), the data is recovered using the requested memory address (step S340), and the recovered data is stored in the CPU in a known and established manner. (Step S350). All three steps are well known to those skilled in the art. Steps S360, 370 and 380 relate to copying read data to the buffer. First, a buffer registration unit in which the MRU bit is not set is selected (step S360). The inactive nature of the MRU bit means that it is not the last accessed buffer registry and may therefore be overwritten. When such a buffer is selected, the data is written to the buffer registration unit (Step S370). This data includes a memory address at an appropriate position in the set and tag areas, and the data is read from the main memory. After this step, the MRU bit for this registry is set to prevent the buffer registry from being overwritten by the next memory access.
[0058]
It should be noted that the data written in the data part of the buffer registration part is the requested part. Therefore, if the buffer is configured to buffer only the first 32 bytes of data, the entire amount and a part of the data read from the main memory (at step S340) is written to the buffer registration unit. . If the buffer is configured to store an entire cache line, the entire amount of information is derived from the data from main memory and stored in the buffer registry.
[0059]
FIG. 6 is referred to again. If the requested memory address is in the buffer (from the determination in step S300), it is determined in step S390 whether the memory access is a read or a write. For a memory read, the above-described temporary parallel processing is advantageously employed. Steps S400, 410 and 420 are operations performed by and performed by the buffer, while steps S430, 440 and 450 are performed by the memory concurrently or in parallel with the processing by the buffer. Step.
[0060]
As will be appreciated, step S400 involves reading the relevant buffer registry. This includes reading data stored in the data section of the buffer registration section. Thereafter, in step S410, the data read from the buffer registration unit is sent to the CPU. Finally, the MRU bit for the buffer registry is set.
[0061]
As above, the corresponding address location in the main memory is accessed using the requested memory address (step S430). The rest of the data is then read from main memory using a preset offset. If the buffer is designed to store the first 32 bytes of data, the main memory data read is after 32 bytes, which is usually the start of a memory read. Thus, if the memory read is from point X, the main memory read will be from X + 32 bytes for the data sent from the buffer to the CPU. Usually, the time required for the buffer to send the data to the CPU exceeds the setup time required to access the main memory.
[0062]
This means that when the CPU has finished receiving data from the buffer, the rest of the requested data coming from main memory has just arrived at the CPU. In step S450, the data is actually transmitted, and this step is the last step executed for main memory access.
[0063]
On the other hand, if the memory access is a write access, steps S460, 470, and 490 are executed. As can be seen from FIG. 6, steps S460 and S470 are performed in parallel with steps S480 and S490. In step S460, the data to be written is written to the relevant buffer registration unit. As a result, the buffer registration unit found corresponding to the requested address is overwritten by the CPU supplied with the data. Thereafter, the MRU bit of the buffer registration unit is set to prevent the buffer registration unit from being overwritten by the next memory access. In response to these steps, steps S480 and 490 relate to main memory. In step S480, the main memory is accessed. In this step, relevant and necessary signals are generated to access the main memory. It should be noted that an offset is not required when writing data to the main memory as opposed to reading the same data in step S440. The reason for this is that no offset is needed when complete data is written to main memory. Writing to both the main memory and the buffer avoids the output of old data.
[0064]
The process described above produces the best results when the buffer is designed to buffer only the beginning of the requested data. However, it goes without saying that storing the entire cache line cannot be used to avoid the above method. A buffer that stores the entire cache line can have the advantages of the method described above.
[0065]
An ingenuity added to the concept of the above method is to keep the latched active columns. The requested address relates to a column in main memory. If the column is already active when the second requested address arrives, and if the second requested address relates to the same column, data recovery will be faster. This is because the setup time for accessing the requested column is omitted. Column is already active. When combined with a buffer, the concept of maintaining latched columns offers many advantages in terms of accelerated memory access speed.
[0066]
Referring to FIG. 7, there is shown a flowchart illustrating the steps of a process that can be performed using the memory controller of FIG. This process to be used for reading access uses the concept of a column latch, as described above. Beginning in step S500, memory access is initialized. This step involves receiving the requested memory access and determining that the memory access is a read access. Next, step S510 is performed. This step involves decoding the requested memory access and determining in which column the requested address is. At this point, the process has the advantage of temporary concurrency provided by the buffer. Steps S520 and S530 are performed simultaneously. This checks whether the requested column is already active and whether the requested address is in the buffer.
[0067]
If the buffer buffers only the beginning of the requested data, the head of the column, most of the left and most of the right branches of the flowchart of FIG. 7 are easily implemented. If the determinations in steps S530 and 520 are both “Y”, steps S540, 550, 560, 570, 580, 590 and 600 are executed in parallel. Thereby, the first part of the data is recovered from the buffer registration unit (step S540), and is sent to the CPU (step S550). Step S550 is accomplished faster if the column address is inactive. The normal activation time associated with accessing main memory is avoided. Ideally, this main memory access is performed using the fast page mode (FPM). The rest of the requested data is recovered from the main memory (step S570). However, this recovery is performed using offsets in a manner similar to that described above to supplement the data already sent to the CPU (step S580). On the other hand, for the buffer, the accessed buffer registration unit has the set MRU bit. For main memory, the active column remains active for the next memory access. If “Y” is determined in step S530 and “N” is determined in step S520, steps S540, 550 and 590 are executed by the buffer, while steps S610, 620, 630 and 640 are It is executed by a main memory system having a buffer, and the main memory system operates in parallel. In step S610, the main memory is accessed using a well-known random access technique for the main memory system. This sends out the appropriate / CAS / RAS and / CS signals at the appropriate time. In step S620, the remainder of the data requested from main memory is recovered using a row offset to compensate for the data already supplied to the CPU in step S550 by the buffer. In step S630, the recovered data is sent to the CPU. In step S640, when the column address is accessed and activated, the active state of the column address is maintained when the next memory access is expected.
[0068]
If “Y” is determined in step S520 and “N” in step S530, the buffer executes steps S650, 660 and 670, while the main memory system executes steps S560, 570, 580 and 600. Thus, if the requested data is not in the buffer, it must then be put into the buffer. In step S650, a buffer registration unit is selected for replacement. This includes the selection of a buffer registry where no MRU is set. While this is being performed, the main memory system recovers the requested data from main memory without offset (see steps S560 and 570 above). When not sending the first part of the requested data, the offset is not used. Therefore, there is no need to compensate for that part.
[0069]
Once the data has been recovered from the main memory, the first portion of the recovered data is then stored in the selected buffer registry (step S660). Thereafter, the MRU bit is set in this buffer registration unit so as not to be overwritten by the next memory access.
[0070]
If the determinations in steps S520 and 530 are both "N", the main memory system executes steps S610, 620, 630 and 640, while the buffer executes steps S650, 660 and 670. Since the buffer is not accessed to recover the data and only has the data written therein, in the subsequent step S620 there is nothing to compensate for, so the main memory system does not use the offset.
[0071]
It should be noted that connections A and B in FIG. 7 are used to indicate that most of the steps described above are performed in parallel, but some steps are performed before others. . As an example, after step S550 is performed, steps S590, 580 and 600 are performed in parallel (see connection B). On the other hand, if “N” is determined in step S520 and “Y” is determined in step S530, steps S590, 630, and 640 are executed in parallel after step S550 (see connection B). Separately, if “Y” is determined in step S520 and “N” is determined in step S530, steps S580 and 600 are executed in parallel with steps S660 and 670, as indicated by connection A. .
[0072]
Referring to FIG. 8, the steps for a write operation are shown. The process starts with memory access initialization (step S680). As noted from the above, this involves decoding the requested address, receiving a write instruction from the CPU, and sending the requested address to the memory decoder and buffer. Next, the main memory system performs steps S690, 700 and 710, in parallel with which the buffer performs steps S720, 730 (if requested), 740, 750 and 760.
[0073]
In step S690, the main memory is accessed with or without using the FPM for the main memory system. In step S700, data is written to the main memory, and in step S710, the active state of the accessed column is maintained for the next memory access. (It should be noted that the number of active columns is left to the discretion of the system designer. Such designers may activate only one column per DRAM bank or multiple active cells per bank. (It may be desired to have a column.) In step S1, which is the first step for the buffer, a determination is made whether the requested address is in the buffer. If the requested address is in the buffer, the data is written to the buffer register (step S740). On the other hand, if the requested address is not in the buffer, the buffer registration is replaced. Therefore, in step S750, a buffer to be replaced is selected. This involves selecting a buffer registry where the MRU bit is not set. Thereafter, once this buffer registry to be replaced is selected, data is written to it (step S740). The buffer register written in step S740 depends on whether the requested address is in the buffer. If present, the data is written to the selected buffer registry. If not, the buffer registry to be replaced or overwritten is selected. Thereafter, once data is written to the buffer registration unit, an MRU is set in the buffer registration unit. The data is written to both the buffer and the main memory, and the data is matched between the two. In this example, it should be noted that only the beginning of the data (ie, the column head) is written to the buffer, which shows how the buffer is formed for this example.
[0074]
The write process shown in FIG. 8 is also applicable when the buffer is configured to buffer a full cache line. The only difference between this example and the example described above is that the entire processor cache line is stored in the buffer.
[0075]
There are several possibilities for read access when the buffer buffers an entire cache line. As noted above, if the process of maintaining column active after access is used, the memory controller with the special column hit and column latches of FIG. 5 is used. 9 and 10 show two possible processes similar to the process shown in FIG. 9 and 10 differ in that they have a default location if the requested address is found in the buffer and in the active column. In FIG. 9, if the requested address is found in the active column and in the buffer, data is recovered from the buffer. In FIG. 10, if the same is true, then the main memory is accessed.
[0076]
Referring to FIGS. 9 and 10, two similar processes for a read operation are shown if the buffer is formed to buffer the entire cache line, and if the column latch concept is used. Have been. These two processes differ only when the requested address is present in both the buffer and the active column.
[0077]
Referring to FIG. 9, a memory access is performed in step S770 by a well-known method and a method similar to the execution of the memory access in the other processes described above. The requested memory address is then decoded in step S780. It is executed in parallel in the next steps S790 and S800. The buffer is checked to see if the requested address is in the buffer (step S790), and the active column is checked to see if the requested address is in the active column. . A series of determinations are made based on these checks. A decision step S810 checks whether the requested address is in both the buffer and the active column. If “Y” is determined, then two branches (one for step S820 and the other for steps S830, 840, 850 and 860) are executed in parallel. At step S820, note that the active state of the column found at step S800 is maintained. Steps S830, 840, 850 and 860 are executed in parallel in the buffer. In step S830, access to the buffer is performed. A step S840 actually recovers the data requested from the buffer from the buffer register corresponding to the requested address. Thereafter, the recovered data is sent to the CPU (step S850). The MRU bit found in the buffer registry is set in step S860 so that the buffer registry is not overwritten by the next memory access.
[0078]
If the determination at step S810 is "N", then the determination at step S870 is made. A step S870 determines whether the requested address is in the active column and not in the buffer. If so, the buffer performs steps S920, 930, 940 and 940 in parallel with the execution of steps S880, 890, 900 and 910 of the main memory. For the main memory system, in step S880, the main memory is accessed using the high-speed page mode. This is performed when the requested address is in a column that is already active. A next step S890 recovers data from the main memory. Step S900 sends the recovered data to the CPU, while step S910 maintains the active state of the column. For a buffer, that part of the process is performed to store the recovered data in the buffer. A step S920 selects a buffer registration unit to be replaced. Once the buffer registration unit is selected, the data recovered in step S890 is stored in the selected buffer registration unit (step S930), whereby the old content of the selected buffer registration unit is deleted. Will be overwritten. In the next step S940, the MRU bit is set so that this particular buffer registration unit will not be overwritten by the next data access. However, it should be noted that connection C performs step S930 only after performing step S890. Only after the data is recovered from the main memory (step S890), the data is written to the buffer registration unit (step S930).
[0079]
If “N” is obtained in step S870, the process proceeds to determination step S950. This determination determines whether the requested address is in the active queue. If the decision is true, the buffer performs steps S960, 970, 980 and 990 in parallel with the execution of steps S1000, 1002, 1004 and 1006 by the main memory system. In the buffer, access to the buffer is performed in step S960. Step S970 actually recovers the requested data from the buffer, while step S980 sends the requested and recovered data to the CPU. At step S990, the MRU bit is set so that the buffer register will not be overwritten by the next data access, as performed in the previous branch on the buffer. Obviously, setting the MRU bit also includes unsetting the previously set MRU bit for another buffer registry. Thus, a single buffer registry has the MRU bit set. Similarly, the step of activating a column in the main memory (step S1000) also includes deactivating a previously active column. Thus, the smallest column is activated at a time. After the column is activated, data is accessed from main memory, as in step S1002. This data is then sent to the CPU (step S1004), and the active state of the column is maintained (step S1006). Depending on the configuration of the main memory system, only one column may be activated in the entire main memory system, or one column per main memory bank (for multiple bank main memory systems). Be activated. Different configurations may be employed depending on the needs of the end user.
[0080]
Again, if the determination in step S870 is “N”, the main memory system and the buffer system execute a series of steps in parallel. The main memory system executes steps S1040, 1050, 1060 and 1070, while steps S1010, 1020 and 1030 are executed for the buffer. Step S1010 for the buffer involves selecting a buffer registry where the MRU bit is not set. The contents of this buffer registry are replaced with new data to be recovered. Data is recovered by the main memory system in step S1050, and step S1020 involves writing the recovered data to the selected buffer registry. In step S1030, the MRU bit is set in the selected buffer registration unit.
[0081]
In step S1040, the main memory system accesses data in the main memory stored at the requested address. This memory access is performed using well-known random access methods when the FPM is unavailable and the requested column is not active. Step S1050 includes recovering data from the main memory after the main memory has been accessed in step S1040. This recovered data is sent to the CPU in step S1060. This data is the same as or a part of the data written in the buffer registration unit selected in step S1020. Next, step S1070 sets the accessed column (in step S1040) as active, thereby allowing the next memory access to use the FPM, if possible.
[0082]
Similar to connection C described above, connection D indicates that step S1020 can be performed only after step S1050 has been performed. Thus, only after step S1050 is executed, step S1020 and the other subsequent steps that are branched are executed. Only after the data has been recovered (at step S1050), the same data is written to the buffer registry (step S1020).
[0083]
In FIG. 10, all steps in the flowchart are the same as those in FIG. 9, except for the step executed when “Y” is determined in the first determination (step S810). If so, that is, if the requested address is in both the buffer and the active column, the buffer performs step S1120, while the main memory performs steps S1080, 1090, 1100 and 1110.
[0084]
In step S1080 for the main memory system, the main memory is accessed using the FPM. This is executed when it is determined in step S810 that the requested address is in the active column. Step S1090 of actually restoring the data is executed after step S1080. In step S1100, the recovered data is sent to the CPU, and in step S1110, the active state of the currently accessed column is maintained. For the buffer, in step S1120, the MRU bit is set in the buffer register corresponding to the requested address. In effect, the buffer registry is the last one accessed, even if the content is not read or modified.
[0085]
Many other configurations for the devices and processes described above are possible. A level 2 cache can be used and access to it is applicable to the process outlined above.
[0086]
Those skilled in the art will appreciate that other designs may be designed using the principles set forth herein. All such designs that fall within the scope of the appended claims are part of the invention.
[Brief description of the drawings]
[0087]
FIG. 1 is a schematic block diagram of a conventional CPU memory system.
FIG. 2A is a schematic diagram of a buffer bank according to the present invention.
FIG. 2B is a block diagram of a buffer controller that controls the buffer bank of FIG. 2A.
FIG. 3A is a block diagram of a memory system that executes a buffer system separately from a memory controller.
FIG. 3B is a block diagram of a memory system that implements a buffer system as part of a main memory.
FIG. 3C is a block diagram of a memory system that executes a buffer system as part of a CPU.
FIG. 3D is a block diagram of a memory system that implements a buffer system as part of a memory controller.
FIG. 4 is a detailed block diagram for implementing the present invention.
FIG. 5 is a detailed block diagram showing a modification of the embodiment shown in FIG. 4;
FIG. 6 is a flowchart showing steps in a method of memory access according to the first embodiment of the present invention;
FIG. 7 is a flowchart showing steps in a memory access method according to a second embodiment of the present invention;
FIG. 8 is a flowchart showing steps for a write access method to be used in the method shown in FIG. 7;
FIG. 9 is a flowchart showing steps in a memory access method according to a third aspect of the present invention.
FIG. 10 is a flowchart showing steps in a modification of the method shown in FIG. 9;
[Explanation of symbols]
[0088]
10 CPU main memory system
15 Microprocessor
17 L1 cache
20 Cache and main controller
25 L2 cache
30 Main memory
100 Buffer to hide waiting time
110 buffer banks

Claims

A method for recovering data from a memory system, the method comprising:
(a) receiving a read request for data content at a memory location,
(b) searching a buffer portion of the memory system for a portion of the data content;
(c) recovering the portion from the buffer when the portion of the data content is stored in the buffer, while simultaneously restoring the remaining portion of the data content from the main memory portion of the memory system; Recover the department,
(d) recovering the portion and the remaining portion of the data content from main memory when the portion of the data content is not stored in the buffer.

2. The method of claim 1, further comprising the following steps.
(e) if the portion of the content is not stored in the buffer, store the portion of the recovered data and the remaining portion in the buffer.

3. The method of claim 2, wherein said portion of the content of the recovered data is replaced by a registry in said buffer.

The method of claim 2, wherein if a request to write data to a memory location is received, the data is written to a memory location in main memory, and a portion of the data is written to the buffer.

5. The method of claim 4, wherein said portion is replaced by a registry in said buffer.

The method of claim 4, wherein the data is written to the memory location and the buffer simultaneously.

2. The method of claim 1, wherein each time a read request to the memory location is received, a memory column including the memory location in main memory is latched.

The method of claim 1, wherein the last accessed registry in the buffer is marked, whereby only one registry in the buffer is marked at any time.

The registry in the buffer that was last accessed is marked, whereby only one registry in the buffer is marked at any time and the portion is marked in the buffer. 6. The method according to claim 5, wherein a registration unit other than the registration unit is replaced.

The method of claim 1, further comprising maintaining a particular active column in each memory bank in the main memory, wherein the particular column is the last accessed column in each memory bank.

A column head buffer circuit for latching a column head, wherein the column head is part of a memory column stored in a memory bank, and the latching circuit comprises:
Each column head registration unit corresponds to a column head in a memory bank, and a column head buffer including a large number of the column head registration units;
A large number of column address latches for latching the physical address of the column head register contained in the column head buffer;
A column address comparator for comparing the column head register with the incoming requested column address;
When an incoming requested column address matches one of the multiple address latches, the buffer circuit compares the incoming column address required by a memory controller with the multiple column address latches, and The column head data register corresponding to the latch adaptation is transmitted to the memory controller.

12. The buffer circuit of claim 11, wherein the mechanism keeps at least one column address active per memory bank by sending at least one latched column address per memory bank to a memory controller.

The buffer circuit according to claim 11, wherein the buffer circuit keeps at least one memory bank active through the memory controller.

A memory buffer subsystem,
At least one buffer bank having a number of buffer registrations;
A buffer controller that controls a subsystem of the buffer,
Each buffer registration unit:
An address area including a memory address corresponding to a position in the main memory bank;
A data area containing the first n bytes of data located at the main memory bank address;
When the data located at the main memory bank address is requested by a CPU, the first n bytes of data are provided to the CPU by a subsystem of the buffer, while the remainder of the data is Recovered from the memory address in the memory bank.

A memory system,
At least one bank of main memory;
A memory controller,
Comprising a buffer and a buffer controller,
The memory controller controls at least one bank of a main memory,
The buffer includes a number of buffer registration units,
Each buffer registration unit includes an address part and a data part,
The data portion comprises a first data portion in at least one bank of main memory, and the address portion comprises an address referring to a memory location.