JP2008507028A

JP2008507028A - System and method for managing cache memory

Info

Publication number: JP2008507028A
Application number: JP2007521441A
Authority: JP
Inventors: クリストファーケンドラー、フレデリック
Original assignee: シリコンオプティックスインコーポレイテッド
Priority date: 2004-07-14
Filing date: 2004-07-14
Publication date: 2008-03-06
Anticipated expiration: 2024-07-14
Also published as: EP1769360A4; EP1769360A1; KR20070038955A; CN100533403C; JP5071977B2; CN1961295A; WO2006019374A1; KR101158949B1

Abstract

２次元データ処理、特に、座標変換を同時に実行する２次元画像処理のためのキャッシュメモリー方法とそれに対応するシステムを開示する。本方法は、データを同時にアクセスする複数のバンクをおのおのが持っている広く高速な一次キャッシュメモリー（ＰＣＭ）と深い二次キャッシュメモリー（ＳＣＭ）を用いる。専用のプリフェッチロジックを用いて、外部プロセッサシステム（ＰＵ１）から制御パラメータを受信すると、外部メモリーから画素データを獲得して、二次制御キューに基づいてそのデータをＰＣＭ中に記憶する。
次に、このデータは特定のブロックサイズと特定のフォーマットで準備され、次に、最適化されたサイズのプリフェッチ一次制御キューに基づいてＰＣＭに記憶される。次に、この準備されたデータは、別の外部プロセッサシステム（ＰＵ２）によって読み出されて処理される。このキャッシュ制御ロジックによって、ＰＵ２の入力部のところでのデータと制御パラメータのコヒーレンシが保証される。A cache memory method and a corresponding system for two-dimensional data processing, particularly two-dimensional image processing for performing coordinate transformation simultaneously, are disclosed. The method uses a wide and fast primary cache memory (PCM) and a deep secondary cache memory (SCM), each having multiple banks that simultaneously access data. When control parameters are received from the external processor system (PU1) using a dedicated prefetch logic, the pixel data is acquired from the external memory and stored in the PCM based on the secondary control queue.
This data is then prepared in a specific block size and a specific format and then stored in the PCM based on an optimized size prefetch primary control queue. Next, this prepared data is read and processed by another external processor system (PU2). This cache control logic ensures coherency of data and control parameters at the input of PU2.

Description

本発明は、ディジタルデータ処理、特に、ディジタル画像データ処理におけるキャッシュメモリーの構造と管理に関する。 The present invention relates to the structure and management of a cache memory in digital data processing, particularly digital image data processing.

新しいコンピュータシステムが発明されて以来、より早い処理と高速なシステムを求める競争が常に存在した。クロックの速度を潜在的に高めるより高速なプロセッサが作成されてきた。データと命令の分量が急激に増加したのも自然なことである。コンピュータシステムにおいては、ますます大きい記憶容量を持つデータや命令を記憶するＲＯＭ（読み出し専用メモリー）やバーストベースのメモリー、たとえば、ＤＲＡＭなどの記憶デバイスが存在する。構造的には、大きいメモリー空間は深化しており、このため、メモリー中のデータや命令にアクセスするプロセッサの速度が遅くなっている。この問題によって、より効率的なメモリー管理と、キャッシュメモリーおよびキャッシュメモリー構造の創造とに対する必要性が生じている。キャッシュメモリーは、一般には、プロセッサの内部またはこれに近接したところにある浅く広い記憶デバイスであって、これによって、プロセッサはデータにアクセスしたりデータの内容を変更したりしやすくなる。キャッシュメモリー管理の哲学は、使用頻度の高い、すなわち、近い将来においてプロセッサが使用する確率が最も高いデータと命令のコピーを、最速でアクセス可能な記憶デバイスの内部に保存しておくというものである。これによって、外部メモリーにある場合よりも何倍も速くプロセッサはデータや命令にアクセスできる。しかしながら、キャッシュメモリーや外部メモリー内の内容を変更するというような動作においては調和を保つように注意が必要である。このような、ハードウエア機能とソフトウエア機能とに関する問題点のため、キャッシュメモリー構造とその管理のための技術が創造されてきた。 Since the invention of new computer systems, there has always been a competition for faster processing and faster systems. Faster processors have been created that potentially increase the speed of the clock. It is natural that the amount of data and instructions has increased rapidly. In a computer system, there are storage devices such as a ROM (Read Only Memory) for storing data and instructions having a larger storage capacity and a burst-based memory, for example, a DRAM. Structurally, large memory spaces are deepening, which slows down the speed of processors accessing data and instructions in memory. This problem creates a need for more efficient memory management and the creation of cache memory and cache memory structures. A cache memory is generally a shallow and wide storage device within or close to the processor that makes it easier for the processor to access and modify the contents of the data. The cache memory management philosophy is to keep a copy of data and instructions that are used frequently, that is, most likely to be used by the processor in the near future, in the fastest accessible storage device. . This allows the processor to access data and instructions many times faster than in external memory. However, care must be taken to maintain harmony in operations such as changing the contents of cache memory or external memory. Due to such problems related to hardware functions and software functions, cache memory structures and techniques for managing them have been created.

すでに述べたように、キャッシュメモリーは、プロセッサが次にアクセスする可能性が最も高いデータとアドレスポインタとのコピーを保持しておくものである。外部メモリーは、一般的には、キャパシタにデータを保存しておくものであり、データが失われることを防止するためにキャパシタに電荷を補充するリフレッシュサイクルを必要とする。しかしながら、一般的なキャッシュメモリーでは１ビットを表すのに８個のトラジスタを用い、これによって、リフレッシュサイクルを不要としている。したがって、キャッシュメモリーは、単位サイズあたりの記憶空間が外部メモリーと比べてはるかに少ない。このため、キャッシュメモリーは、収容可能なデータ量が外部メモリーよりはるかに少ない。その結果、キャッシュ動作を最適化するためには、データと命令を注意深く選別しなければならない。 As already mentioned, the cache memory holds a copy of the data and the address pointer that are most likely to be accessed next by the processor. The external memory generally stores data in the capacitor, and requires a refresh cycle for replenishing the capacitor with charge in order to prevent the data from being lost. However, a general cache memory uses 8 transistors to represent 1 bit, thereby eliminating the need for a refresh cycle. Therefore, the cache memory has much less storage space per unit size than the external memory. For this reason, cache memory has much less data capacity than external memory. As a result, data and instructions must be carefully screened to optimize cache operation.

キャッシュメモリー動作を最適にするさまざまなポリシーとプロトコルが用いられている。これらの内で最もよく知られているのが、直接マッピング方式、フルアソシアティブ方式、およびセットアソシアティブ方式である。これらのプロトコルは、当業者には周知である。これらのプロトコルは、データ処理、Ｗｅｂベースのアプリケーションなどを含む演算という一般的な目的に適っている。ポメレーン（Ｐｏｍｅｒｅｎｅ）に対して発行されている米国特許第４，２９５，１９３号には、マルチ命令ワードにコンパイルされている命令を同時並行に実行する演算マシンが提示されている。これは、キャッシュメモリー、アドレスゼネレータ、命令レジスタおよびパイプライン方式を示唆する最も初期の特許のうちの一つである。マツオ（Ｍａｔｓｕｏ）に対して発行されている米国特許第４，７９６，１７５号には、メインメモリーと命令キャッシュとから命令をプリフェッチする形態を持つ命令キュー機能付きのマイクロプロセッサが提示されている。スティルズ（Ｓｔｉｌｅｓ）に対して発行されている米国特許第６，０６７，６１６号には、フルアソシアティブ方式の広く浅い第１レベルのＢＣＰ（分岐予測キャッシュ）と、部分的予測情報を持つ深く狭い直接マッピングされた第２レベルのＢＣＰから成るハイブリッド型キャッシュ構造を持つ分岐予測キャッシュ（ＢＣＰ）スキームが提示されている。フランク（Ｆｒａｎｋ）に対して発行された米国特許第６，６５４，８５６号には、アドレス的に円形構造のキャッシュメモリーに重点が置かれているコンピュータシステムにおけるキャッシュ管理システムが提示されている。 Various policies and protocols are used to optimize cache memory behavior. The most well-known of these are the direct mapping method, the full associative method, and the set associative method. These protocols are well known to those skilled in the art. These protocols are suitable for the general purpose of operations involving data processing, web-based applications, and the like. U.S. Pat. No. 4,295,193, issued to Pomerene, presents a computing machine that executes instructions compiled into multi-instruction words concurrently. This is one of the earliest patents suggesting cache memory, address generators, instruction registers, and pipelining. U.S. Pat. No. 4,796,175 issued to Matsuo presents a microprocessor with an instruction queue function that has a form for prefetching instructions from a main memory and an instruction cache. US Pat. No. 6,067,616 issued to Stills includes a fully associative wide and shallow first level BCP (branch prediction cache) and a deep narrow direct with partial prediction information. A branch prediction cache (BCP) scheme with a hybrid cache structure of mapped second level BCPs is presented. U.S. Pat. No. 6,654,856 issued to Frank presents a cache management system in a computer system with an emphasis on addressable circular cache memory.

リアオ（Ｌｉａｏ）に対して発行された米国特許第６，６８１，２９６号には、制御装置とキャッシュを持つマイクロプロセッサが提示されているが、このキャッシュは、ロック部分とノーマル部分で区分されたキャッシュ構成とするか単独のキャッシュ構成とするか選択可能となっている。アルミリ（Ａｒｉｍｉｌｌｉ）に対して発行された米国特許第６，７２１，８５６号には、プロセッサアクセスシーケンスを包含しているプロセッサが異なればそれに対するサブエントリも異なるライン毎のコヒーレンシ状態とシステムコントローラ情報とを持つキャッシュが提示されている。米国特許第６，６２９，１８８号には、第１と第２の複数の記憶空間を持つキャッシュメモリーが開示されている。米国特許第６，２９５，５８２号には、データコヒーレンシを有し、実質的な順次読み出しコマンドと書き込みコマンドがデッドロックする事態を回避するキャッシュシステムが開示されている。米国特許第６，３３９，４２８号には、圧縮された（ｃｏｍｐｒｅｓｓｅｄ）テクスチャ情報がテクスチャ操作のために受信・圧縮解除（解凍）される（ｄｅｃｏｍｐｒｅｓｓｅｄ）ビオデグラフィックス分野におけるキャッシュ装置が開示されている。米国特許第６，３５３，４３８号には、複数タイルのテクスチャ画像データを持ち、データを直接にキャッシュにマッピングするキャッシュ編成が開示されている。 US Pat. No. 6,681,296 issued to Liao presents a microprocessor with a control unit and a cache, which is divided into a lock part and a normal part. A cache configuration or a single cache configuration can be selected. U.S. Pat. No. 6,721,856 issued to Arimili describes the coherency state and system controller information for each line with different processor subsequence entries for different processors. A cache with is presented. U.S. Pat. No. 6,629,188 discloses a cache memory having first and second storage spaces. U.S. Pat. No. 6,295,582 discloses a cache system that has data coherency and avoids deadlock between substantially sequential read and write commands. U.S. Pat. No. 6,339,428 discloses a caching device in the field of video graphics where compressed texture information is received and decompressed for texture manipulation. Yes. US Pat. No. 6,353,438 discloses a cache organization that has multiple tiles of texture image data and maps the data directly to the cache.

上記の発明はそのおのおのが、ある長所を提供する。効率的なキャッシュ構造とポリシーは、手元にある特定の応用物に強く依存する。ディジタルビデオ応用分野では、ディジタル画像をリアルタイムでしかも高品質で処理することは、この分野における大きな挑戦のうちの１つである。具体的には、非線形の座標変換を同時に実行しながら、詳細な二次元画像処理を必要とする。したがって、データのコヒーレンシを保った状態で迅速にアクセスするという固有の長所を持つ特殊化した専用のシステムが必要とされる。そのため、この応用のために、キャッシュ構造とキャッシュ管理ポリシーとを最適化することが必要である。
米国特許第４，２９５，１９３号米国特許第４，７９６，１７５号米国特許第６，０６７，６１６号米国特許第６，６５４，８５６号米国特許第６，６８１，２９６号米国特許第６，７２１，８５６号米国特許第６，６２９，１８８号米国特許第６，２９５，５８２号米国特許第６，３３９，４２８号米国特許第６，３５３，４３８号 Each of the above inventions provides certain advantages. An efficient cache structure and policy strongly depends on the specific application at hand. In digital video applications, processing digital images in real time and with high quality is one of the major challenges in this field. Specifically, detailed two-dimensional image processing is required while simultaneously performing nonlinear coordinate transformation. Therefore, there is a need for specialized specialized systems that have the unique advantage of quickly accessing data with coherency. Therefore, it is necessary to optimize the cache structure and cache management policy for this application.
U.S. Pat. No. 4,295,193 U.S. Pat. No. 4,796,175 US Pat. No. 6,067,616 US Pat. No. 6,654,856 US Pat. No. 6,681,296 US Pat. No. 6,721,856 US Pat. No. 6,629,188 US Pat. No. 6,295,582 US Pat. No. 6,339,428 US Pat. No. 6,353,438

本発明はその１態様においては、
（ａ）アクセスされて処理されるデータが記憶される外部メモリーと、
（ｂ）制御コマンドを発行し、制御パラメータと、前記外部メモリー中の処理予定データのメモリーアドレスとを生成する複数のプロセッサユニット（ＰＵ１）と、
（ｃ）データを処理する複数のプロセッサユニット（ＰＵ２）と、
から成るセッティングにおいて、ディジタルデータ処理、特に、ディジタル画像処理におけるキャッシュメモリーを管理方法とキャッシュメモリー構造を提供する。
本方法は、
（ｉ）おのおのが前記外部メモリーからデータを読み出すための記憶ラインを複数個有する複数のバンクを有する、より大きい記憶容量を持つより深い二次キャッシュメモリー（ＳＣＭ）と、
（ｉｉ）おのおのが前記ＰＵ２がそこからデータを読み出す記憶ラインを複数個有する複数のバンクを有する、より小さい記憶容量を持つより迅速でより広い一次キャッシュメモリー（ＰＣＭ）と、
（ｉｉｉ）制御ステージと制御キューを含んでおり、これで、プリフェッチ機能とキャッシュのコヒーレンシ性を提供する制御ロジックと、
というキャッシュ構造を用いて、ＰＵ１からアドレスシーケンスと制御パラメータを受信したら、外部メモリー中のデータを処理し、また、ＰＵ２が迅速にアクセスして処理できるようにデータを準備する。
本方法は、
（ａ）外部メモリー中のどのデータブロックを処理するかを、ＰＵ２中での処理動作のトポロジと構造とに基づいて識別するステップと、
（ｂ）十分大きいＳＣＭ制御キューをステップ（ａ）の結果に基づいて生成して、ＰＣＭ中にデータが存在するかどうか判定し、これで、ＳＣＭが外部メモリー中のデータにＰＵ２による処理で必要とされるより十分早期にアクセスするようにするステップと、
（ｃ）前記ＳＣＭの複数のバンクからの入力データのブロックを事前設定された数のクロックサイクルで同時に読み出して、前記キャッシュデータ編成から前記外部メモリーデータ編成を、データを解凍して再フォーマッティングすることによって抽出し、これによって、前記ＰＵ２からの外部メモリーデータ編成を隠匿して（隠して）、前記ＰＵ２中でのデータ処理の速度を増加させるステップと、
（ｄ）十分大きいＰＣＭ制御キューをステップ（ａ）と（ｂ）の結果に基づいて生成して、データが前記ＰＵ２によって必要とされる以前に、抽出されたデータを前記ＰＣＭ中に記憶するステップと、
（ｅ）前記ＰＵ２中でデータが到来するタイミングと制御パラメータが到来するタイミングの同期を取って、キャッシュコヒーレンシを達成するステップと、
によって、キャッシュコヒーレンシを達成し、また、メモリーの読み出しレイテンシを隠匿する。 In one aspect of the present invention,
(A) an external memory for storing data to be accessed and processed;
(B) a plurality of processor units (PU1) that issue control commands and generate control parameters and memory addresses of processing-scheduled data in the external memory;
(C) a plurality of processor units (PU2) for processing data;
In a setting comprising: a cache memory management method and a cache memory structure in digital data processing, particularly digital image processing.
This method
(I) a deeper secondary cache memory (SCM) having a larger storage capacity, each having a plurality of banks having a plurality of storage lines for reading data from the external memory;
(Ii) a faster and wider primary cache memory (PCM) with smaller storage capacity, each having a plurality of banks with a plurality of storage lines from which the PU 2 reads data;
(Iii) control logic including a control stage and a control queue, which provides prefetch functionality and cache coherency;
When the address sequence and control parameters are received from PU1, the data in the external memory is processed, and the data is prepared so that PU2 can quickly access and process it.
This method
(A) identifying which data block in the external memory is to be processed based on the topology and structure of the processing operation in the PU 2;
(B) A sufficiently large SCM control queue is generated based on the result of step (a) to determine whether data is present in the PCM, and the SCM is required for processing by the PU 2 on the data in the external memory. Steps to ensure access early enough to be
(C) simultaneously reading blocks of input data from a plurality of banks of the SCM in a predetermined number of clock cycles and decompressing and reformatting the external memory data organization from the cache data organization. And thereby concealing (hiding) the external memory data organization from the PU2 to increase the speed of data processing in the PU2, and
(D) generating a sufficiently large PCM control queue based on the results of steps (a) and (b) and storing the extracted data in the PCM before the data is needed by the PU 2 When,
(E) Synchronizing the timing at which data arrives in the PU 2 and the timing at which control parameters arrive to achieve cache coherency;
To achieve cache coherency and conceal memory read latency.

別の態様で、本発明は、上述の方法に基づいたキャッシュシステムを提供する。 In another aspect, the present invention provides a cache system based on the method described above.

本発明の実施形態のさまざまな態様と長所との更なる詳細を、添付図面を参照して以下に説明する。 Further details of various aspects and advantages of embodiments of the present invention are set forth below with reference to the accompanying drawings.

次に、添付図面と模範的な実施例にしたがって、本発明を詳細に説明する。本発明は、キャッシュの構造と管理に関する。以下の説明に出てくる実施例は、同時座標変換（ｓｉｍｕｌｔａｎｅｏｕｓｃｏｏｒｄｉｎａｔｅｔｒａｎｓｆｏｒｍａｔｉｏｎ）を伴う画像処理の例である。しかしながら、当業者は、本発明の範囲は、この特定の例に制限されないことを理解するであろう。本発明は、複数のプロセッサがデータと制御パラメータを外部メモリーと任意の形式を持つ他のプロセッサとからフェッチ（ｆｅｔｃｈ）してこようとするいかなるタイプのディジタルデータ処理にも関連する。特に、本書で説明する２次元（２Ｄ）画像変換の例は、本発明の範囲から逸脱することなくどのような２Ｄ画像変換に入れ替えることが可能であることは自明である。したがって、以下の説明で、データとは画像画素データを意味する。入力データの構造とトポロジに関連する制御パラメータを発行する複数のプロセッサとは、ジオメトリエンジンのことを意味する。加えて、動作用のデータを処理する複数のプロセッサとは、フィルタエンジンのことであり、それに対応する動作とはフィルタリング動作のことである。 The present invention will now be described in detail with reference to the accompanying drawings and exemplary embodiments. The present invention relates to cache structure and management. The embodiment described in the following description is an example of image processing that involves simultaneous coordinate transformation. However, those skilled in the art will understand that the scope of the present invention is not limited to this particular example. The present invention relates to any type of digital data processing in which multiple processors attempt to fetch data and control parameters from external memory and other processors of any format. In particular, it is obvious that the two-dimensional (2D) image conversion example described in this document can be replaced with any 2D image conversion without departing from the scope of the present invention. Therefore, in the following description, data means image pixel data. A plurality of processors that issue control parameters related to the structure and topology of input data refers to a geometry engine. In addition, the plurality of processors that process the operation data are filter engines, and the corresponding operation is a filtering operation.

図１に、本発明にしたがって構築された、同時座標変換機能を持つ、ディジタル画像データ処理用に設計された、演算装置中のキャッシュシステム１００の設定の例を図示する。キャッシュシステム１００は、２セットのプロセッサとインタフェースしている。この実施例において、第１の複数のプロセッサは、ジオメトリエンジン３００を構成しており、第２の複数のプロセッサはフィルタエンジン５００を構成している。これら２つのエンジンに加えて、キャッシュシステム１００は、アクセスレイテンシを持つどのようなメモリーでもありえる外部メモリー７００とインタフェースしている。キャッシュシステム１００は、座標変換パラメータとフィルタフットプリントパラメータとを含む制御パラメータをジオメトリエンジン３００から受信する。同時に、本システムは、画素データを外部メモリー７００から受信する。キャッシュシステム１００は、フィルタエンジン５００の機能停止を最小にとどめながらもフィルタリングプロセスを最適化するように、これらのデータをフィルタエンジン５００に対して提供する。 FIG. 1 illustrates an example of the setting of a cache system 100 in a computing device designed according to the present invention and designed for digital image data processing with a simultaneous coordinate transformation function. Cache system 100 interfaces with two sets of processors. In this embodiment, the first plurality of processors constitutes the geometry engine 300, and the second plurality of processors constitutes the filter engine 500. In addition to these two engines, the cache system 100 interfaces with an external memory 700 that can be any memory with access latency. The cache system 100 receives control parameters from the geometry engine 300 including coordinate transformation parameters and filter footprint parameters. At the same time, the system receives pixel data from the external memory 700. The cache system 100 provides these data to the filter engine 500 so as to optimize the filtering process while minimizing the outage of the filter engine 500.

２次元（２Ｄ）データ処理、特に、ディジタル画像データ処理においては、総合的なフィルタリング機能またはサンプリング機能が必要とされる。以下において、２Ｄ画像処理を特に例として取り上げ、したがって、「画素」という語は、任意の２Ｄデータのうちの特定の場合として用いる。２Ｄディジタル画像処理においては、おのおのの出力画素が、多くの入力画素からの情報に基づいて形成される。最初に、出力画素座標を、入力画素座標に対してマッピングする。これは座標変換であって、通常は、画像ワープ技法によって電子的に実施される。いったん中心の入力画素が決まると、出力画素仕様、すなわち、構成色の強度と、サンプリングフォーマットやブレンド機能などの他の情報とを生成するためにフィルタリング機能またはサンプリング機能が必要となる。それに対してサンプリングが実行される中心入力画素の周りのすべての画素を含む領域は、フィルタフットプリントと呼ばれる。フィルタフットプリントのサイズと形状は、出力画像の品質に影響することは技術上公知である。 In two-dimensional (2D) data processing, particularly digital image data processing, a comprehensive filtering function or sampling function is required. In the following, 2D image processing will be taken as a particular example, so the term “pixel” is used as a specific case of arbitrary 2D data. In 2D digital image processing, each output pixel is formed based on information from many input pixels. First, output pixel coordinates are mapped to input pixel coordinates. This is a coordinate transformation, usually performed electronically by image warping techniques. Once the central input pixel is determined, a filtering or sampling function is required to generate the output pixel specification, ie, the intensity of the constituent colors, and other information such as the sampling format and blend function. On the other hand, the area containing all the pixels around the central input pixel on which sampling is performed is called the filter footprint. It is well known in the art that the size and shape of the filter footprint affects the quality of the output image.

キャッシュシステム１００の機能は、専用のアーキテクチャとプリフェッチロジックを用いて、十分なランダムアクセス画素データと制御パラメータをフィルタエンジン５００に提供し、これにより、機能停止を最小に抑えながら、このエンジンがどのクロック速度においても処理すべきデータを有しているようにすることである。最適化されたサイズを持つ読み出し要求キューによって、キャッシュシステム１００は、画素データがフェッチされる元の外部メモリー７００に固有のメモリー読み出しレイテンシのほとんどを隠匿することが可能となる。メモリー読み出しレイテンシのこの隠匿動作は、フィルタの動作に優先する。このレイテンシが適切に隠匿されないと、フィルタエンジン５００のスループットは最大とならない。許容される機能停止時間は、設計上のパラメータである。ハードウエアコストとのトレードオフとして必要とされるスループットを達成するようにさまざまなパラメータを調節する必要がある。 The functionality of the cache system 100 uses a dedicated architecture and prefetch logic to provide sufficient random access pixel data and control parameters to the filter engine 500, which allows the engine to select which clock while minimizing outages. It is to have data to be processed even at speed. The read request queue with the optimized size allows the cache system 100 to conceal most of the memory read latency inherent in the external memory 700 from which the pixel data is fetched. This concealment operation of the memory read latency takes precedence over the filter operation. If this latency is not properly concealed, the throughput of the filter engine 500 will not be maximized. The allowed outage time is a design parameter. Various parameters need to be adjusted to achieve the required throughput as a trade-off with hardware cost.

加えて、キャッシュシステム１００からは、ジオメトリエンジン３００から読み出されるフィルタフットプリントパラメータと座標変換のための制御経路が提供されている。キャッシュシステム１００によって、一方では外部メモリー７００からの画素データと他方ではジオメトリエンジン３００からの制御パラメータとが、フィルタエンジン５００の入力部に到達した時点で同期が取られることを保証する。 In addition, the cache system 100 provides a filter footprint parameter read from the geometry engine 300 and a control path for coordinate conversion. The cache system 100 ensures that pixel data from the external memory 700 on the one hand and control parameters from the geometry engine 300 on the other hand are synchronized when they reach the input of the filter engine 500.

本明細書中でで、われわれは、イタリックで量を表示する慣習法を採用し（たとえば６４バイト）、これによって、参照番号（たとえば、フィルタエンジン５００）と区別されるようにする。 Herein, we adopt a convention of displaying quantities in italics (eg, 64 bytes) so that they are distinguished from reference numbers (eg, filter engine 500).

図２は、キャッシュシステム１００の詳細な構造の例を示す図である。おのおのの出力画素に対して、キャッシュシステム１００は、ジオメトリエンジン３００からある制御パラメータを受信する。このようなパラメータには、マッピングされた入力画素の座標と、ＵおよびＶと、フィルタフットプリントの形状、回転量およびサイズを定義する制御パラメータなどの追加の制御パラメータとが含まれる。同時に、キャッシュシステム１００は、外部メモリー７００からのフィルタフットプリントに含まれるがそのおのおのに対する画素データを受信する。このようなデータには、色空間内の構成色、たとえば、ＲＧＢもしくはＹＣｒＣｂの強度レベルと、サンプリングフォーマット、たとえば、４：４：４もしくは４：２：２と、ブレンド機能、すなわち、αありかα無しかということとが含まれる。 FIG. 2 is a diagram illustrating an example of a detailed structure of the cache system 100. For each output pixel, the cache system 100 receives certain control parameters from the geometry engine 300. Such parameters include the coordinates of the mapped input pixels, U and V, and additional control parameters such as control parameters that define the shape, rotation amount and size of the filter footprint. At the same time, the cache system 100 receives pixel data for each of the filter footprints from the external memory 700. Such data includes constituent colors in the color space, such as RGB or YCrCb intensity levels, sampling formats such as 4: 4: 4 or 4: 2: 2, and blending functions, ie α. Whether or not α is included.

キャッシュシステム１００の構造は、入力画像をｍ×ｎ個の画素分のサイズを持つブロックに分割することに関連している。図３に、ｎ＝８でｍ＝４である入力画像画素ブロック構造の例を示す。入力画像３３０は、ある数の画素、たとえば、１０２４×１０２４個の画素をブロックに分割したものを含んでいる。おのおのの入力画素ブロック３３２は、ｍ×ｎ個の入力画素３３４を含んでいる。ブロックの構造は一般に、さまざまなフィルタリングスキームにおいてフットプリントの形状とサイズの関数である。 The structure of the cache system 100 is related to dividing an input image into blocks having a size of m × n pixels. FIG. 3 shows an example of an input image pixel block structure in which n = 8 and m = 4. The input image 330 includes a certain number of pixels, for example, 1024 × 1024 pixels divided into blocks. Each input pixel block 332 includes m × n input pixels 334. The structure of the block is generally a function of footprint shape and size in various filtering schemes.

キャッシュシステム１００は、ｍ×ｎ個の入力画素ブロック３３２に関連するデータをフェッチして、フィルタエンジン５００が使用可能なデータブロックを生成する。このため、本システムは、どのブロックがフットプリントの内部に入るか、また、このようなブロック内のどの画素がフィルタリングのために含まれるべきであるかを判定しなければならない。キャッシュシステム１００の構造は、入力ブロックデータ構造に適合するように拡張可能となっている。また、一般に、キャッシュシステム１００の構造は、フィルタエンジン５００の動作の性質と構造の関数であることに注意すべきである。画像処理という特殊な場合では、この動作の構造とトポロジは部分的にはフィルタフットプリントによって定義される。 The cache system 100 fetches data associated with the m × n input pixel blocks 332 to generate a data block that can be used by the filter engine 500. Thus, the system must determine which blocks fall within the footprint and which pixels within such blocks should be included for filtering. The structure of the cache system 100 can be expanded to fit the input block data structure. It should also be noted that in general, the structure of the cache system 100 is a function of the nature and structure of the operation of the filter engine 500. In the special case of image processing, the structure and topology of this operation is defined in part by the filter footprint.

ここで図２に示す例を参照すると、キャッシュシステム１００は、浅く広くそして容量の少ない一次キャッシュ１１０と、深く容量の大きい二次キャッシュ１２０と、ブロック包含ステージ１５０と、ブロックデータ生成ステージ１３０と、一次キャッシュ制御ステージ１７０と、二次キャッシュ制御ステージ１９０とを備えている。また、多くのキューもあるが、これについては後述する。画素データは、最初に外部メモリー７００から二次キャッシュ１２０に読み込まれる。次に、これらのデータは、ブロック生成ステージ１３０によって再フォーマッティングされて、解凍されて、フィルタエンジン５００によって用いられるようにする。これらの再フォーマッティングされたデータはキューに組み込まれて、適当な時点に一次キャッシュ１１０中に置かれる、ここでは、フィルタエンジン５００によって即座にアクセス可能となる。以下に、データの経路と制御ロジック構造をそれぞれ説明する。 Referring now to the example shown in FIG. 2, the cache system 100 includes a shallow, wide and low capacity primary cache 110, a deep and large secondary cache 120, a block inclusion stage 150, a block data generation stage 130, A primary cache control stage 170 and a secondary cache control stage 190 are provided. There are also many queues, which will be described later. Pixel data is first read from the external memory 700 into the secondary cache 120. These data are then reformatted by the block generation stage 130 and decompressed for use by the filter engine 500. These reformatted data are queued and placed in the primary cache 110 at the appropriate time, where they are immediately accessible by the filter engine 500. The data path and control logic structure will be described below.

ここで図５に示す例を参照すると、二次キャッシュ１２０は、外部メモリー７００から生データを読み出す大容量記憶デバイスである。外部メモリー７００中の画素データは、任意のフォーマット、一般に、フィルタエンジン５００中で処理するにはあまり適していないフォーマットで記憶されており、たとえば、特殊な例では、データは、順次に、すなわち、走査線の順序で記憶されている。二次キャッシュ１２０は、割り込みを最小に抑えて効率的にこれらのデータを読み込むように設計されている。 Referring now to the example shown in FIG. 5, the secondary cache 120 is a mass storage device that reads raw data from the external memory 700. The pixel data in the external memory 700 is stored in any format, generally not well suited for processing in the filter engine 500, for example, in a special case, the data is sequentially, ie, They are stored in the order of scanning lines. The secondary cache 120 is designed to read these data efficiently with minimal interruptions.

二次キャッシュ中のおのおののラインは、外部メモリー７００からのｂ₂バイトのデータのバーストを収容するように設計されている。この理由によって、二次キャッシュ１２０中のおのおののラインのサイズは、外部メモリー７００の構造と読み出し要件とにしたがって決まる。このようなデータが記憶される二次キャッシュ１２０中のラインの数は、また、二次キャッシュのミスカウントを軽減するように最適化された設計パラメータでもある。二次キャッシュ１２０は、さらにそのうえ、一次キャッシュ１１０を更新して、フィルタエンジン５００の機能停止を最小化するに十分な読み出しスループットを可能とするようにバンキングされている。これらの設計パラメータは、中心入力画素をサンプリングするためには多くの隣接画素が必要とされるため、フィルタエンジン５００による画素処理用に十分なデータを記憶するために決定的に重要である。 Each line in the secondary cache is designed to accommodate a b ₂ byte burst of data from the external memory 700. For this reason, the size of each line in the secondary cache 120 is determined according to the structure of the external memory 700 and the read requirements. The number of lines in the secondary cache 120 where such data is stored is also a design parameter optimized to reduce secondary cache miss counts. Secondary cache 120 is further banked to update primary cache 110 to allow sufficient read throughput to minimize outage of filter engine 500. These design parameters are critical to storing sufficient data for pixel processing by the filter engine 500 because many neighboring pixels are required to sample the central input pixel.

したがって、二次キャッシュ１２０は、外部メモリー７００からデータを同時に読み出すために互いに独立したアクセスラインを持つバンクをある数だけ有するように設計されている。図５の図示例に示すように、二次キャッシュ１２０は多くのバンク１２２を有しているが、そのおのおのが、ある数のライン１２４を持っている。二次キャッシュのラインはそのおのおのが、外部メモリー７００から読み出された１データバーストのデータを含んでいる。これらのデータは、最終的にはフィルタエンジン５００によって読み出される必要がある。このため、二次キャッシュのバンクの数は、データのスループットの関数として設計されている。ｍ×ｎ個の入力ブロックからなる構造で、データを読み出すために必要とされるクロックサイクルの数がＮｃである場合、二次キャッシュ１２０中ではｎ／Ｎｃ個のバンクが必要とされる。データを二次キャッシュのバンクに分配するには、１つの特殊な実施例では、最下位ビット（ＬＳＢ）ＵとＶの組み合わせが用いられる。これによって、デコーディングロジックの複雑さが軽減され、これで、領域が節約されて更新動作がはるかに高速となる。おのおののバンクを２^j個のパーティションに分割するには、ｊ個のＬＳＢが用いられる。二次キャッシュバンク１つ当たり２^j本のラインがあるとすると、二次キャッシュのアーキテクチャは、２^j／２^jのセットアソシアティブ方式となる。その設計は二次キャッシュ１２０の適切な置き換えポリシーとあいまって、キャッシュロジックに沿って後述するように、分割を簡略で効率的なものとし、これで、データが二次キャッシュ１２０全体にわたって分布される。 Therefore, the secondary cache 120 is designed to have a certain number of banks having access lines independent from each other in order to simultaneously read data from the external memory 700. As shown in the example of FIG. 5, the secondary cache 120 has many banks 122, each of which has a certain number of lines 124. Each line of the secondary cache contains one data burst of data read from the external memory 700. These data need to be finally read by the filter engine 500. For this reason, the number of banks of the secondary cache is designed as a function of data throughput. If the number of clock cycles required to read data is Nc with a structure of m × n input blocks, n / Nc banks are required in the secondary cache 120. In order to distribute the data to the banks of the secondary cache, in one particular embodiment, the least significant bit (LSB) U and V combination is used. This reduces the complexity of the decoding logic, which saves space and makes the update operation much faster. To divide each bank into 2 ^j partitions, j LSBs are used. If there are 2 ^j lines per secondary cache bank, the secondary cache architecture is a 2 ^j / 2 ^j set associative scheme. The design, combined with an appropriate replacement policy for the secondary cache 120, makes partitioning simple and efficient, as will be described later along with the cache logic, so that data is distributed throughout the secondary cache 120. .

いったんデータが外部メモリー７００から二次キャッシュ１２０に読み込まれると、これらのデータは、フィルタエンジン５００にとって使用可能なフォーマットに変換する必要がある。ブロック生成ステージ１３０は、二次キャッシュ１２０からデータを読み出し、これらのデータを、ｍ×ｎ個の入力画素のブロックからのすべてのデータを含むブロックに準備する。上述したように、ブロック生成ステージ１３０は、クロックサイクル毎に、二次キャッシュ１２０の持つｎ／Ｎｃ個のラインを読み出す。これによって、Ｎｃ個のクロックサイクル毎に、１つの入力画素ブロックに関連するすべてのデータが同時に読み出されることが保証される。データのパッキングフォーマットと読み出しスループットによっては、入力画素ブロックを生成するには二次キャッシュ１２０から複数回の読み出し動作が必要とされる。これらのデータを読み出すことに加えて、ブロック生成ステージ１３０は、これらのデータを再フォーマッティングして、フィルタエンジン５００が容易に使用できるようなフォーマットに解凍する。したがって、ブロック生成ステージ１３０は、さまざまな圧縮スキームで圧縮可能なオリジナルの画素データフォーマットを隠匿する。これによって、フィルタエンジン５００は、外部メモリー７００中の画素データのフォーマットを解明して、オリジナルのフォーマッティング済みデータをフィルタリング動作で使用可能なブロックにアンパックする責務から開放される。これらのブロックデータは最終的には一次キャッシュ１１０に記憶され、そこからフィルタエンジン５００によって読み出される。 Once the data is read from the external memory 700 into the secondary cache 120, these data need to be converted into a format usable by the filter engine 500. The block generation stage 130 reads data from the secondary cache 120 and prepares these data into blocks that contain all the data from the block of m × n input pixels. As described above, the block generation stage 130 reads n / Nc lines of the secondary cache 120 every clock cycle. This ensures that every Nc clock cycles, all data associated with one input pixel block is read out simultaneously. Depending on the data packing format and read throughput, multiple read operations from the secondary cache 120 are required to generate the input pixel block. In addition to reading these data, the block generation stage 130 reformats these data and decompresses them into a format that the filter engine 500 can easily use. Accordingly, the block generation stage 130 conceals the original pixel data format that can be compressed with various compression schemes. This frees the filter engine 500 from the responsibility to unravel the format of the pixel data in the external memory 700 and unpack the original formatted data into blocks that can be used for filtering operations. These block data are finally stored in the primary cache 110 and read out by the filter engine 500 therefrom.

ここで図４の例を参照すると、一次キャッシュ１１０は、フィルタエンジン５００中でのデータアクセス速度を最適化するように設計されている。したがって、複数のアクセスラインに対して浅いが広い構造となっている。一次キャッシュ１１０は、ある数のバンクに分割されており、おのおのの一次キャッシュバンク１１２は、フィルタエンジン５００によって互いに独立にそして同時に読み出される。一次キャッシュバンクの数は、フィルタリング性能を最適化するように、経験に基づいたデータとシミュレーションにしたがって決定される。おのおのの一次キャッシュバンク１１２は、ある数の一次キャッシュラインを含んでいる。おのおのの一次キャッシュライン１１４は、入力データの完全なｍ×ｎ個のブロックからのデータを含んでいる。したがって、一次キャッシュバンクがｂ₁個あれば、フィルタエンジン５００は、ｂ₁個の入力ブロックを含むデータをサイクル毎に適切なフォーマットで読み出す。これは非常に重要であるが、それは、サンプリングするためには、入力画素の周りの入力ブロックが多数必要であり、多数の入力ブロックがフィルタエンジン５００に提供されないと、このエンジンは機能停止するからである。機能停止の期間と頻度によって、スループット性能が決まる。 Referring now to the example of FIG. 4, the primary cache 110 is designed to optimize the data access speed in the filter engine 500. Therefore, it has a shallow but wide structure for a plurality of access lines. The primary cache 110 is divided into a number of banks, and each primary cache bank 112 is read independently and simultaneously by the filter engine 500. The number of primary cache banks is determined according to experience-based data and simulations to optimize filtering performance. Each primary cash bank 112 includes a number of primary cache lines. Each primary cache line 114 contains data from a complete m × n block of input data. Therefore, if there are b ₁ primary cache banks, the filter engine 500 reads data including b ₁ input blocks in an appropriate format for each cycle. This is very important because it requires a large number of input blocks around the input pixel to sample, and if a large number of input blocks are not provided to the filter engine 500, the engine will fail. It is. Throughput performance is determined by the duration and frequency of outages.

データをさまざまな一次キャッシュバンクに分配するために、入力画素座標のＬＳＢであるＵとＶを用いる。一次キャッシュ１１０内部にある一次バンク１１２はそのおのおのが、ある数のパーティションにさらに分割されている。上述したように、ある数のＬＳＢを用いて、データをさまざまな一次キャッシュバンクに分配する。入力画素のＵとＶのアドレスの残余ビット中のさらなるＬＳＢをまた用いて、おのおのの一次キャッシュバンク中のデータを分配する。一次キャッシュバンク１つ当たり、そして、２^f個のライン毎に、おのおののバンクを区分するために用いられるｇ個のＬＳＢが用いられ、この分割によって、２^f／２^g個のセットアソシアティブアーキテクチャとなる。 In order to distribute the data to the various primary cache banks, the input pixel coordinate LSBs U and V are used. Each primary bank 112 within the primary cache 110 is further divided into a number of partitions. As described above, a certain number of LSBs are used to distribute data to the various primary cache banks. Additional LSBs in the remaining bits of the input pixel U and V addresses are also used to distribute the data in each primary cache bank. For each primary cache bank and every 2 ^f lines, g LSBs used to partition each bank are used. This partitioning results in 2 ^f / 2 ^g set associative architectures. Become.

後述するように、この設計をまた、一次キャッシュ１１０の適切な置き換えポリシーと共に用いて、最適なスループットが達成される。このアーキテクチャは簡単にそして自然に拡張可能であるが、それは、入力データの分量が多くなると、アドレスＵとアドレスＶ中で利用可能なビットの数が増えるからである。 As described below, this design is also used with an appropriate replacement policy for the primary cache 110 to achieve optimal throughput. This architecture is easily and naturally scalable because the amount of input data increases, so the number of bits available in address U and address V increases.

フィルタエンジン５００により必要とされる際に、使用可能なフォーマットのデータが存在することを保証するために、プリフェッチロジック構造が設計される。図６に、キャッシュ制御ロジック４００を示す。このロジック構造は、外部メモリー７００から二次キャッシュ１２０がデータを読み出す動作と、ブロック生成ステージ１３０でデータを読み出して再フォーマッティングする動作と、一次キャッシュ１１０にデータブロックを記憶する動作を制御する。 The prefetch logic structure is designed to ensure that there is data in a usable format as needed by the filter engine 500. FIG. 6 shows the cache control logic 400. This logic structure controls the operation of the secondary cache 120 reading data from the external memory 700, the operation of reading and reformatting the data in the block generation stage 130, and the operation of storing the data block in the primary cache 110.

ステップ４０２で、サンプリングのためにデータブロックが必要であるかどうかが、ジオメトリエンジン３００から受信された制御パラメータに基づいて判定される。いったんデータが識別されると、ステップ４１０で、これらのデータが一次キャッシュの内部に存在するかどうか判定される。存在すれば、ステップ４１２で一次制御キューに対してエントリが書き込まれ、ステップ４１４でこれらのデータのアドレスがフィルタエンジン４１４に送られる。データが一次キャッシュ中に存在しなければ、ステップ４１５で、後述される採用された置き換えポリシーにしたがって、どの一次キャッシュラインを置き換えるべきか判定される。次に、ステップ４１６で、この一次キャッシュラインのアドレスが一次制御キューに書き込まれて、ステップ４１８でフィルタエンジンに送られる。次に、これらのデータが二次キャッシュに存在するかどうかステップ４２０で判定される。データがそこにも存在しなければ、ステップ４２２で、どの二次キャッシュラインを置き換えるべきか判定される。次に、読み出し要求が外部メモリーに送られて、後でステップ４２６で二次キャッシュに読み込まれるデータをフェッチする。データが二次キャッシュ中に存在すれば、ステップ４２８で、エントリが二次キャッシュ制御キューに書き込まれる。 At step 402, it is determined based on the control parameters received from the geometry engine 300 whether a data block is needed for sampling. Once the data is identified, it is determined at step 410 whether these data are present in the primary cache. If so, an entry is written to the primary control queue at step 412 and the address of these data is sent to the filter engine 414 at step 414. If the data does not exist in the primary cache, it is determined in step 415 which primary cache line to replace according to the adopted replacement policy described below. Next, at step 416, the address of this primary cache line is written to the primary control queue and sent to the filter engine at step 418. Next, it is determined at step 420 whether these data are present in the secondary cache. If the data does not exist there, it is determined in step 422 which secondary cache line should be replaced. Next, a read request is sent to the external memory to fetch data that is later read into the secondary cache at step 426. If the data is in the secondary cache, at step 428, an entry is written to the secondary cache control queue.

データが外部メモリーからフェッチされた後で二次キャッシュがヒットしようと二次キャッシュがミスしようとどちらの場合でも、ステップ４４０で、ブロック生成用に二次キャッシュのデータが読み出される。この場合、データは複数の二次キャッシュバンクから読み出されて、ステップ４４２で、再フォーマッティングされて解凍される。この段階で、ステップ４５０で、適切なフォーマットを持つ入力データのブロックがキューとして送られて、一次キャッシュ中に記憶される。ステップ４５２で、これらのデータは一次キャッシュバンク中に記憶される。 Whether the secondary cache hits or the secondary cache misses after data is fetched from external memory, at step 440, the secondary cache data is read for block generation. In this case, data is read from multiple secondary cache banks and reformatted and decompressed at step 442. At this stage, in step 450, a block of input data with the appropriate format is sent as a queue and stored in the primary cache. At step 452, these data are stored in the primary cache bank.

一次キャッシュ１１０の更新動作は、関連の制御データが一次制御キュー２１２と画素制御キュー２１８から読み出されると発生する。これによって、キャッシュコヒーレンシが一次キャッシュ１００内部で保持されることが保証される。この時点で、一次キャッシュからのデータが制御パラメータコヒーレンシと共に、ステップ５１０でフィルタエンジン入力部に到達する。 The update operation of the primary cache 110 occurs when related control data is read from the primary control queue 212 and the pixel control queue 218. This ensures that cache coherency is maintained within the primary cache 100. At this point, data from the primary cache arrives at the filter engine input at step 510 along with control parameter coherency.

プリフェッチロジックは、フィルタエンジン５００中の読み出しレイテンシを隠匿するように設計されている。この制御ロジック構造がないと、データのスループットが最適化されず、また、フィルタエンジン５００の機能停止する割合が増す。キューのサイズが十分であり、記憶サイズが最適であり、データが準備されており、置き換えポリシーがインテリジェントであれば、キャッシュシステム１００は、フィルタエンジン５００より前を走行することによって読み出しレイテンシのほとんどを隠匿する。 The prefetch logic is designed to conceal the read latency in the filter engine 500. Without this control logic structure, data throughput is not optimized and the rate at which the filter engine 500 stops functioning increases. If the queue size is sufficient, the storage size is optimal, the data is prepared, and the replacement policy is intelligent, the cache system 100 will run most of the read latency by running before the filter engine 500. Conceal.

再度図２を参照して、キャッシュ制御ロジック４００のハードウエア実施例を以下に説明する。ブロック包含ステージ１５０は、この制御ロジックの開始点である。おのおのの出力画素に対して、このロジックは、マッピングされた入力画素の座標とフィルタフットプリントの形状と共に制御パラメータをジオメトリエンジン３００から受信する。入力画素座標と、ＵおよびＶと、フットプリント形状と、他の制御パラメータとに基づいて、ブロック包含ロジックは、おのおのの出力画素を処理するためにはどの入力ブロックが必要であるか、また、おのおののブロック中のどの画素がサンプリング用に必要であるかを判定する。 Referring again to FIG. 2, a hardware embodiment of the cache control logic 400 is described below. Block inclusion stage 150 is the starting point for this control logic. For each output pixel, the logic receives control parameters from the geometry engine 300 along with the mapped input pixel coordinates and the filter footprint shape. Based on the input pixel coordinates, U and V, footprint shape, and other control parameters, the block inclusion logic determines which input block is required to process each output pixel, and Determine which pixels in each block are needed for sampling.

ブロック包含ステージ１５０は、本発明の一例では、隣接するブロックの座標位置をフットプリントのジオメトリと比較して、サンプリングに必要な画素のブロックを包含する。このブロック包含ロジックは、そのブロックアドレス内で少なくとも最下位ビット（ＬＳＢ）１Ｕまたは１Ｖがおのおの異なるｋ個のブロックをクロックサイクル毎に生成する。これによって、ＬＳＢのｋ個の組み合わせが、ブロック包含ロジックによって生成されたブロックのおのおののセットに存在することが保証される。この制約を用いて、一次キャッシュバンク間にブロックを分配する。クロックサイクル毎の生成ブロック数ｋは、フットプリントのサイズの関数であり、ブロックのトポロジは、フットプリントの形状の関数である。これらのパラメータは、注意深いシミュレーションと実験によって、フィルタエンジン５００中でのデータ処理に関して、キャッシュシステム１１０の設計の際に考慮すべきものである。ブロック包含ステージ１５０で生成される画素制御キュー２１８は、フィルタエンジン５００が実際の画素データより前にスケーリングパラメータを生成することを許容するより以前にフィルタエンジン５００に送られる。 The block inclusion stage 150, in one example of the present invention, includes blocks of pixels necessary for sampling by comparing the coordinate positions of adjacent blocks with the footprint geometry. The block inclusion logic generates k blocks each having at least the least significant bit (LSB) 1U or 1V different in the block address every clock cycle. This ensures that there are k combinations of LSBs in each set of blocks generated by the block containment logic. Using this constraint, blocks are distributed among the primary cache banks. The number of generated blocks k per clock cycle is a function of the footprint size, and the block topology is a function of the footprint shape. These parameters should be considered when designing the cache system 110 for data processing in the filter engine 500 through careful simulation and experimentation. The pixel control queue 218 generated at the block inclusion stage 150 is sent to the filter engine 500 before allowing the filter engine 500 to generate the scaling parameters prior to the actual pixel data.

一次キャッシュ制御ステージ１７０は、一次キャッシュ１１０中ではデータの取り扱いのための制御ロジックを提供する。ブロック包含ステージ１５０によって決定されたおのおのの入力ブロックに対して、一次キャッシュ制御装置１７０は、このブロックが一次キャッシュ１１０中に存在するかどうかチェックする。データが存在すれば、これはキャッシュヒットと呼ばれる。存在しなければ、キャッシュミスが登録されて、ミスフラグが二次キャッシュ制御装置１９０に送られる。一次キャッシュ制御ステージ１７０は、エントリを一次制御キュー２１２に書き込んで、一次キャッシュ１１０の内部のデータのアドレスと、一次キャッシュのヒットがあったかミスがあったかとを示す。一次制御キュー２１２がフィルタエンジン５００によってＦＩＦＯ方式で読み出される。キャッシュミスフラグがこのようなエントリのうちの１つで上げられると、フィルタエンジン５００は読み出し要求をブロックキュー２１４に送り、するとこのキューが一次キャッシュ１１０を更新する。 The primary cache control stage 170 provides control logic for handling data in the primary cache 110. For each input block determined by the block inclusion stage 150, the primary cache controller 170 checks whether this block exists in the primary cache 110. If there is data, this is called a cache hit. If not, a cache miss is registered and a miss flag is sent to the secondary cache controller 190. The primary cache control stage 170 writes an entry into the primary control queue 212 to indicate the address of the data in the primary cache 110 and whether there was a primary cache hit or a miss. The primary control queue 212 is read by the filter engine 500 in a FIFO manner. When the cache miss flag is raised in one of such entries, the filter engine 500 sends a read request to the block queue 214, which then updates the primary cache 110.

データブロックが一次キャッシュ１１０に存在しない場合、アドレスＵもしくはアドレスＶがチェックされたどのブロックとも整合しない場合または関連の有効ビットが設定されていない場合に発生する一次キャッシュミスの場合、この事象は一次キャッシュミスと呼ばれる。二次キャッシュステージ１９０における制御ロジックは、一次キャッシュミスフラグを受信すると、一次キャッシュに書き込まれるｍ×ｎ個のブロックを生成するためにどの処置を（ステップを）とるべきか決定する。二次キャッシュ制御ステージ１９０は、最初に、二次キャッシュ１２０中にデータが存在するかどうか判定する。これによって、二次キャッシュヒットとなったり、二次キャッシュミスとなったりする。二次キャッシュミスが発生すると、二次キャッシュ制御装置１９０は読み出し要求を外部メモリー７００に送って、行方不明データを外部メモリー７００から二次キャッシュ１２０中にフェッチして、二次制御キュー２１６にエントリを書き込む。二次キャッシュヒットが発生すると、二次キャッシュ制御ステージ１９０は、読み出し要求を送出しないで、単にエントリを二次制御キュー２１６に書き込み、ここで、エントリはブロック生成ステージ１３０によってＦＩＦＯ方式で読み出される。 In the case of a primary cache miss that occurs if the data block does not exist in the primary cache 110, the address U or V does not match any of the checked blocks, or the associated valid bit is not set, this event Called a cache miss. When the control logic in the secondary cache stage 190 receives the primary cache miss flag, it determines what action (steps) should be taken to generate m × n blocks to be written to the primary cache. The secondary cache control stage 190 first determines whether data is present in the secondary cache 120. As a result, a secondary cache hit or a secondary cache miss occurs. When a secondary cache miss occurs, the secondary cache controller 190 sends a read request to the external memory 700, fetches missing data from the external memory 700 into the secondary cache 120, and enters the secondary control queue 216. Write. When a secondary cache hit occurs, the secondary cache control stage 190 simply writes the entry to the secondary control queue 216 without sending a read request, where the entry is read by the block generation stage 130 in a FIFO manner.

おのおののキューエントリを受信すると、ブロック生成ステージ１３０は、入力ブロック全体に関連する生データを二次キャッシュ１２０から読み出す。次に、これらのデータは、フィルタエンジン５００が容易に使用可能なフォーマットにブロック生成ステージ１３０中で再フォーマッティングされる。データパッキングするモードによっては、一次キャッシュライン１１４を生成するために、複数の二次キャッシュラインが必要とされる。１つの入力ブロックに関連するすべてのデータを取得して、これらのデータを再フォーマッティングしたら、ブロック生成ステージ１３０は、ブロックキュー２１４にエントリを書き込む。したがって、おのおののブロックキューエントリは、この入力ブロック全体からのデータをすべて適切なフォーマットで包含している。次に、ブロックキューエントリは一次キャッシュ１１０によって受信され、ここで、フィルタエンジン５００が容易にアクセスされるように記憶される。したがって、ブロックキュー２１４によって、二次キャッシュ１２０は、フィルタエンジン５００より前を走行することが許容される。 Upon receipt of each queue entry, the block generation stage 130 reads raw data associated with the entire input block from the secondary cache 120. These data are then reformatted in the block generation stage 130 into a format that the filter engine 500 can easily use. Depending on the data packing mode, a plurality of secondary cache lines are required to generate the primary cache line 114. Once all the data associated with one input block has been obtained and reformatted, the block generation stage 130 writes an entry to the block queue 214. Thus, each block queue entry contains all the data from this entire input block in an appropriate format. The block queue entry is then received by the primary cache 110 where it is stored for easy access by the filter engine 500. Therefore, the secondary cache 120 is allowed to travel before the filter engine 500 by the block queue 214.

キャッシュシステム１００の機能は、専用のプリフェッチロジックに加えて、画素データと制御パラメータのコヒーレンシしだいであることに注意すべきである。二次キャッシュ制御ステージ１９０からの要求がないかぎり、いかなるデータも二次キャッシュ１２０によって読み出されることはない。いったんそのデータが二次キャッシュ内に入ると、二次制御キュー２１６に対するエントリだけが、これらのデータがブロック生成ステージ１３０でのブロック生成にとって必要であるかどうかを決定する。データのブロックは、いったん生成されると、フィルタエンジン５００から読み出し要求された場合にだけ、キューに組み込まれて一次キャッシュに記憶されるが、このエンジン５００自身は、一次制御キュー２１２へのエントリによって教唆される。そのうえ、フィルタエンジンは、データと制御パラメータ双方が２つの互いに独立したキューから到来するのを待って、しかる後に、そのデータを処理する。 It should be noted that the function of the cache system 100 depends on pixel data and control parameter coherency in addition to dedicated prefetch logic. No data is read by the secondary cache 120 unless requested by the secondary cache control stage 190. Once that data enters the secondary cache, only entries for the secondary control queue 216 determine whether these data are needed for block generation at the block generation stage 130. Once a block of data is generated, it is queued and stored in the primary cache only when a read request is made from the filter engine 500, but the engine 500 itself is stored by an entry in the primary control queue 212. Be tempted. In addition, the filter engine waits for both data and control parameters to come from two independent queues and then processes the data.

フィルタフットプリントとキャッシュ記憶空間の相対的なサイズによっては、フットプリントをサブフットプリント部分に分割し、また、おのおののサブフットプリント部分中のデータを処理する必要がある。この対策は、動的にサイズ付けされるフットプリント用のキャッシュシステム１００の設計で予見される。いったんおのおののサブフットプリントに関連するデータがキャッシュされると、フィルタエンジンはこれらのデータを順次に処理する。 Depending on the relative size of the filter footprint and the cache storage space, it may be necessary to divide the footprint into sub-footprint parts and process the data in each sub-footprint part. This measure is foreseen in the design of a cache system 100 for a dynamically sized footprint. Once the data associated with each sub-footprint is cached, the filter engine processes these data sequentially.

キャッシュシステム１００にメモリー読み出しレイテンシを隠匿することを許容するデータプリフェッチ動作の効果を理解するに際しては、本発明の一例では、読み出しレイテンシが１２８クロックサイクル台にあるかどうかを基準として評価されてきた。十分大きいキューを提供することによって、ほとんどすべてのレイテンシが隠匿される。本発明におけるキューのサイズは、本システムで見受けられるメモリー読み出しレイテンシに適合するように調整することが可能であり、したがって、システム仕様に基づいた拡張可能な設計パラメータである。 In understanding the effect of data prefetch operation that allows the cache system 100 to conceal the memory read latency, an example of the present invention has been evaluated based on whether the read latency is in the 128 clock cycle range. By providing a sufficiently large queue, almost all latency is concealed. The size of the queue in the present invention can be adjusted to match the memory read latency found in the present system and is therefore an extensible design parameter based on system specifications.

いったんキャッシュロジック構造によって、あるブロックのデータを二次キャッシュ１２０によって読み出すべきであるまたは一次キャッシュ１１０に記憶するために準備すべきであると判定されると、置き換えポリシーが必要とされる。１つの既存の一次キャッシュライン１１４または複数の二次キャッシュライン１２４を置き換えなければならない。本発明の一例では、キャッシュ置き換えポリシーは距離ベースのポリシーである。入力ブロックアドレスＵとＶによれば、一次キャッシュ制御ステージ１７０と二次キャッシュ制御ステージ１９０は、中心入力画素ＵとＶの座標を、キャッシュライン中の既存のブロックデータの座標と比較する。次に、中心入力画素からの最大の距離を持つエントリが置き換えられる。このポリシーは、中心画素への距離が近いほど、サンプリングの計算のために必要とされる確率が高いという事実から派生している。 Once the cache logic structure determines that a block of data should be read by the secondary cache 120 or prepared for storage in the primary cache 110, a replacement policy is required. One existing primary cache line 114 or multiple secondary cache lines 124 must be replaced. In one example of the invention, the cache replacement policy is a distance-based policy. According to the input block addresses U and V, the primary cache control stage 170 and the secondary cache control stage 190 compare the coordinates of the central input pixels U and V with the coordinates of the existing block data in the cache line. Next, the entry with the largest distance from the central input pixel is replaced. This policy is derived from the fact that the closer the distance to the center pixel, the higher the probability that it will be needed for the calculation of the sampling.

本発明の別の例では、キャッシュ置き換えポリシーは、最低使用頻度（ＬＲＵ）ベースのポリシーである。この後者の例では、一次キャッシュ制御ステージ１７０と二次キャッシュ制御ステージ１９０は、最低使用頻度のキャッシュラインを置き換えようとする。 In another example of the present invention, the cache replacement policy is a least recently used (LRU) based policy. In this latter example, primary cache control stage 170 and secondary cache control stage 190 attempt to replace the least frequently used cache line.

キャッシュシステム１００の設計には、本システムを拡張可能なものとするためのいくつかの尺度を持っている。二次キャッシュラインのサイズは、外部メモリー７００とブロック生成レートからメモリー読み出しサイズ、たとえば、バーストサイズまで拡張可能である。二次キャッシュラインの数は、必要とされるキャッシュ効率に基づいて拡張可能である。二次キャッシュバンクの数は、入力ブロックデータ構造と二次キャッシュからのアクセス毎のクロックサイクル数とに基づいて拡張可能である。二次キャッシュ１２０の拡張は、サイズ要件とキャッシュシステム効率、すなわち、再読み出しされる入力ディジタルデータの分量に基づいている。 The design of the cache system 100 has several measures to make the system extensible. The size of the secondary cache line can be expanded from the external memory 700 and the block generation rate to a memory read size, for example, a burst size. The number of secondary cache lines can be expanded based on the required cache efficiency. The number of secondary cache banks can be expanded based on the input block data structure and the number of clock cycles per access from the secondary cache. The secondary cache 120 expansion is based on size requirements and cache system efficiency, ie, the amount of input digital data to be reread.

ブロック包含ステージ１５０中でクロックサイクル毎に生成されるブロックの数は、フィルタリングアルゴリズムならびにフットプリントサイズおよび必要とされるスループットに基づいて拡張可能である。入力画素ＬＳＢであるＵとＶに基づいた一次キャッシュ１１０と二次キャッシュ１２０の位置付けは、キャッシュのサイズにも適応可能である。これは、特定のパーティショニング目的で用いられるビットの数によって実施される。一次キャッシュラインのサイズは、入力ブロックのサイズに基づいて拡張可能である。一次キャッシュバンクの数は、フィルタリングのスループットに基づいて拡張可能である。さまざまなキューのサイズもまた、メモリーレイテンシ対必要スループットの関係に依存する拡張可能パラメータである。このようなサイズは、シミュレーションと経験的なデータに基づいて決定される。 The number of blocks generated per clock cycle in the block inclusion stage 150 is scalable based on the filtering algorithm and footprint size and required throughput. The positioning of the primary cache 110 and the secondary cache 120 based on the input pixels LSB U and V can also be adapted to the size of the cache. This is implemented by the number of bits used for a particular partitioning purpose. The size of the primary cache line can be expanded based on the size of the input block. The number of primary cache banks can be expanded based on filtering throughput. Various queue sizes are also scalable parameters that depend on the relationship between memory latency and required throughput. Such a size is determined based on simulation and empirical data.

このような設計パラメータはすべて経費と性能とのトレードオフとして注意深く考慮しなければならない。したがって、注意深いシミュレーションと実験は、本発明を特に実施する目的で実行されて、当面の特殊な場合のキャッシュソリューションを最適化する。 All such design parameters must be carefully considered as a trade-off between cost and performance. Careful simulations and experiments are therefore carried out specifically for the purpose of implementing the present invention to optimize the special case cache solution for the time being.

本発明のある特徴を本書に図示して説明したが、いまや、多くの修正例、置換例、変更例および等価例が当業者には思い当たるであろう。したがって、添付クレームは、本発明の真の精神に入るようなすべての修正例と変更例をカバーすることを意図するものであることを理解すべきである。 While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. Accordingly, it is to be understood that the appended claims are intended to cover all modifications and changes that fall within the true spirit of the invention.

本発明にしたがって構築されたキャッシュシステムの全体的なスキームを示す図である。1 is a diagram illustrating the overall scheme of a cache system constructed in accordance with the present invention. 本発明にしたがって構築されたキャッシュシステムの詳細な構造を示す図である。It is a figure which shows the detailed structure of the cache system constructed | assembled according to this invention. キャッシュされる入力データのブロック構造の例を示す図である。It is a figure which shows the example of the block structure of the input data cached. 本発明にしたがって構築された一次キャッシュシステムの一般的構造を示す図である。1 is a diagram illustrating the general structure of a primary cache system constructed in accordance with the present invention. 本発明にしたがって構築された二次キャッシュシステムの一般的構造を示す図である。1 is a diagram illustrating the general structure of a secondary cache system constructed in accordance with the present invention. 本発明にしたがって構築されたキャッシュシステムのフローのロジックを示す図である。FIG. 3 is a diagram illustrating the flow logic of a cache system constructed in accordance with the present invention.

Claims

A cache structure and management method in data processing, in particular, two-dimensional image processing in which coordinate transformation is performed simultaneously;
(A) an external memory for storing data to be accessed and processed;
(B) a plurality of processor units (PU1) that issue control commands and generate control parameters and memory addresses of processing-scheduled data in the external memory;
(C) a plurality of processor units (PU2) for processing data;
The method uses a cache system, the cache system comprising:
(I) a deeper secondary cache memory (SCM) having a larger storage capacity, each having a plurality of banks having a plurality of storage lines for reading data from the external memory;
(Ii) a faster and wider primary cache memory (PCM) with smaller storage capacity, each having a plurality of banks with a plurality of storage lines from which the PU 2 reads data;
(Iii) control logic including a control stage and a control queue, which provides prefetch functionality and cache coherency;
And when the address sequence and control parameters are received from the PU1, access the data in the external memory, and prepare the data so that the PU2 can quickly access and process it,
The method
(A) identifying which data block in the external memory is to be processed based on the topology and structure of the processing operation in the PU 2;
(B) A sufficiently large SCM control queue is generated based on the result of step (a) to determine whether data is present in the PCM, so that the SCM adds the PU2 to the data in the external memory. To ensure access early enough to be required for processing by
(C) simultaneously reading blocks of input data from a plurality of banks of the SCM in a predetermined number of clock cycles and decompressing and reformatting the external memory data organization from the cache data organization. And thereby concealing external memory data organization from the PU2 to increase the speed of data processing in the PU2, and
(D) generating a sufficiently large PCM control queue based on the results of steps (a) and (b) and storing the extracted data in the PCM before the data is needed by the PU 2 When,
(E) Synchronizing the timing at which data arrives in the PU 2 and the timing at which control parameters arrive to achieve cache coherency;
To achieve cache coherency and conceal memory read latency.

The number of SCM banks, the number of lines per SCM bank, the SCM structure including the SCM line size, the input block data structure, the read format from the external memory, and the required throughput. The method of claim 1, further comprising the step of optimizing.

Further optimizing the PCM structure including the number of PCM banks, the number of lines per PCM bank, and the size of the PCM lines based on the output data structure, format, and required throughput. The method of claim 2 comprising.

The method of claim 3, wherein the mapping to the cache system is a direct mapping based on an address sequence.

Mapping to the cache system
(A) direct mapping based on address sequence;
(B) a distance-based replacement policy in which data associated with an input block that is remote from the data block being processed is replaced;
4. The method of claim 3, wherein the method is performed in two stages.

Mapping to the cache system
(A) direct mapping based on address sequence;
(B) a least recently used replacement policy in which data associated with the least recently used input block is replaced;
4. The method of claim 3, wherein the method is performed in two stages.

The method of claim 3, further comprising scaling the PCM size based on the amount of data accessed.

The method of claim 3, further comprising scaling the SCM size based on the amount of data accessed.

4. The method of claim 3, further comprising scaling the PCM size based on cache update frequency.

The method of claim 3, further comprising scaling the SCM size based on a reread coefficient.

4. The method of claim 3, further comprising the step of dividing the input data block into sub-blocks and sequentially caching data from each sub-block for processing in the PU2.

The method of claim 3, further comprising scaling the depth of the control queue and data queue to optimize throughput.

4. The method of claim 3, further comprising scaling the PCM output width and the number of banks based on the PU2 throughput requirement.

The method of claim 3, further comprising scaling the PCM line size based on an input data block size.

The method of claim 3, further comprising scaling the SCM line size based on the external memory burst size.

4. The method of claim 3, further comprising scaling the number of SCM banks based on the required rate of the PCM update.

4. The method of claim 3, further comprising allocating data among the PCM and the SCM based on a least significant bit of a memory address of an input data block.

A cache system for data processing, particularly two-dimensional image processing that simultaneously performs coordinate transformation;
(A) an external memory for storing data to be accessed and processed;
(B) a plurality of processor units (PU1) that issue control commands and generate control parameters and memory addresses of processing-scheduled data in the external memory;
(C) a plurality of processor units (PU2) for processing data;
An apparatus comprising:
(I) a deeper secondary cache memory (SCM) having a larger storage capacity, each having a plurality of banks having a plurality of storage lines for reading data from the external memory;
(Ii) a faster and wider primary cache memory (PCM) with smaller storage capacity, each having a plurality of banks with a plurality of storage lines from which the PU 2 reads data;
(Iii) control logic including a control stage and a control queue, which provides prefetch functionality and cache coherency;
And when the address sequence and control parameters are received from the PU1, the data in the external memory is accessed, and the data is prepared so that the PU2 can quickly access and process the data,
The system
(A) identifying which data block in the external memory is to be processed based on the topology and structure of the processing operation in the PU 2;
(B) A sufficiently large SCM control queue is generated based on the result of step (a) to determine whether data is present in the PCM, so that the SCM adds the PU2 to the data in the external memory. To ensure access early enough to be required for processing by
(C) simultaneously reading blocks of input data from a plurality of banks of the SCM in a predetermined number of clock cycles to decompress and reformat the external memory data organization from the cache data organization. And thereby concealing external memory data organization from the PU2 to increase the speed of data processing in the PU2, and
(D) generating a sufficiently large PCM control queue based on the results of steps (a) and (b) and storing the extracted data in the PCM before the data is needed by the PU 2 When,
(E) Synchronizing the timing at which data arrives in the PU 2 and the timing at which control parameters arrive to achieve cache coherency;
To achieve cache coherency and conceal memory read latency.

The SCM structure, including the number of SCM banks, the number of lines per SCM bank, and the SCM line size, the input block data structure, the read format from the external memory, and the required throughput The system of claim 18 further comprising the step of optimizing.

Further optimizing the PCM structure including the number of PCM banks, the number of lines per PCM bank, and the size of the PCM lines based on the output data structure, format, and required throughput. 20. The system of claim 19, comprising.

21. The system of claim 20, wherein the mapping to the cache system is a direct mapping based on an address sequence.

Mapping to the cache system
(A) direct mapping based on address sequence;
(B) a distance-based replacement policy in which data associated with an input block that is remote from the data block being processed is replaced;
21. The system of claim 20, wherein the system is performed in two stages.

Mapping to the cache system
(A) direct mapping based on address sequence;
(B) a least recently used replacement policy in which data associated with the least recently used input block is replaced;
21. The system of claim 20, wherein the system is performed in two stages.

21. The system of claim 20, further configured to scale the PCM size based on an amount of data accessed.

21. The system of claim 20, further configured to scale the SCM size based on the amount of data accessed.

21. The system of claim 20, further configured to scale the PCM line size based on cache update frequency.

21. The system of claim 20, further configured to scale the SCM size based on a reread factor.

21. The system of claim 20, further comprising: dividing an input data block into sub-blocks and sequentially caching data from each sub-block for processing in the PU2.

21. The system of claim 20, further configured to scale the depth of the control queue and data queue to optimize throughput.

21. The system of claim 20, further configured to scale the PCM output width and the number of banks based on the PU2 throughput requirement.

21. The system of claim 20, wherein the system is adapted to scale the PCM line size based on an input data block size.

21. The system of claim 20, wherein the system is adapted to scale the SCM line size based on the external memory burst size.

21. The system of claim 20, wherein the system is adapted to scale the number of SCM banks based on a required rate of the PCM update.

21. The system of claim 20, wherein data is distributed among the PCM and the SCM based on a least significant bit of a memory address of an input data block.