JP2006520044A

JP2006520044A - Data processing system with cache optimized for processing data flow applications

Info

Publication number: JP2006520044A
Application number: JP2006506643A
Authority: JP
Inventors: エインドーフェンヨセフスティジェイファン; マルティンジェイルッテン; エヴェルト−ヤンディポル
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2003-03-06
Filing date: 2004-02-25
Publication date: 2006-08-31
Also published as: ATE487182T1; US20070168615A1; KR20050116811A; EP1604286A2; EP1604286B1; WO2004079488A3; CN1757017A; WO2004079488A2; DE602004029870D1; CN100547567C

Abstract

オーバラップしないキャッシュ位置が各々のデータストリームに対して確保される。それ故に各々のストリームに固有となるストリーム情報が、キャッシュメモリをインデックスするために使用される。この場合、当該ストリーム情報はストリーム識別体によって表される。特に異なるストリームは共有キャッシュリソースに対して競合するデータストリーム及びタスクを備えるデータフローアプリケーションを処理するために最適化されるデータ処理システムがもたらされる。明確なストリーム識別体が前記データストリームの各々に関連させられる。前記データ処理システムは、ストリーミングデータを処理するための少なくとも一つのプロセッサ（１２）と、複数のキャッシュブロックを有する少なくとも一つのキャッシュメモリ（２００）であって、前記キャッシュメモリ（２００）の一つは前記プロセッサ（１２）の各々に関連させられる少なくとも一つのキャッシュメモリ（２００）と、前記キャッシュメモリ（２００）を制御するための少なくとも一つのキャッシュコントローラ（３００）であって、前記キャッシュコントローラ（３００）の一つは前記キャッシュメモリ（２００）の各々に関連させられる少なくとも一つのキャッシュコントローラ（３００）とを有する。前記キャッシュコントローラ（３００）は、前記ストリーム識別体（stream_id）に応じて前記キャッシュメモリ（２００）におけるデータストリームの要素を記憶するための位置を選択するための選択手段（３５０）を有する。A non-overlapping cache location is reserved for each data stream. Therefore, stream information unique to each stream is used to index the cache memory. In this case, the stream information is represented by a stream identifier. In particular, a data processing system is provided that is optimized for processing data flow applications with different streams competing for shared cache resources and data streams and tasks. A distinct stream identifier is associated with each of the data streams. The data processing system includes at least one processor (12) for processing streaming data and at least one cache memory (200) having a plurality of cache blocks, one of the cache memories (200) being At least one cache memory (200) associated with each of said processors (12) and at least one cache controller (300) for controlling said cache memory (200), said cache controller (300) One has at least one cache controller (300) associated with each of the cache memories (200). The cache controller (300) has selection means (350) for selecting a position for storing the element of the data stream in the cache memory (200) according to the stream identifier (stream_id).

Description

本発明は、タスク及びデータストリームを備えるデータフローアプリケーションを処理するために最適化されるデータ処理システムと、タスク及びデータストリームを備えるデータフローアプリケーションを処理するために最適化されるデータ処理環境における使用のための半導体デバイスと、タスク及びデータストリームを備えるデータフローアプリケーションを処理するために最適化されるデータ処理環境においてキャッシュメモリをインデックスするための方法とに関する。 The present invention relates to a data processing system that is optimized to process a data flow application comprising tasks and data streams, and to use in a data processing environment that is optimized to process data flow applications comprising tasks and data streams. And a method for indexing a cache memory in a data processing environment optimized for processing data flow applications comprising tasks and data streams.

特に高解像度ディジタルテレビジョン（high-definition digital TV）、時間シフト機能（time-shift functionality）を備えるセットトップボックス、３Ｄゲーム、ビデオ会議、ＭＰＥＧ−４アプリケーション等のようなデータフローアプリケーションのために備えられるデータ処理システムに対する設計労力は、このようなアプリケーションに対する増大する需要のために近年増大してきている。 Especially for data flow applications such as high-definition digital TV, set-top boxes with time-shift functionality, 3D games, video conferencing, MPEG-4 applications, etc. The design effort for a given data processing system has increased in recent years due to the increasing demand for such applications.

ストリーム処理において、データのストリームについての連続命令（オペレーション）は異なるプロセッサによって実行される。例えば、第一のストリームは、画素の８×８ブロックのＤＣＴ（離散コサイン（余弦）変換（Discrete Cosine Transformation））係数のブロックの第二のストリームを生成するために第一のプロセッサによって処理される画像（イメージ）の画素（ピクセル）値（pixel value）から構成され得る。第二のプロセッサは、ＤＣＴ係数の各々のブロックに対して選択されると共に圧縮される係数のブロックのストリームを生成するためにＤＣＴ係数のブロックを生成し得る。 In stream processing, consecutive instructions (operations) on a stream of data are executed by different processors. For example, the first stream is processed by the first processor to generate a second stream of blocks of DCT (Discrete Cosine Transformation) coefficients of an 8 × 8 block of pixels. It can be composed of pixel values of an image. The second processor may generate a block of DCT coefficients to generate a stream of blocks of coefficients that are selected and compressed for each block of DCT coefficients.

データストリーム処理を実現するために複数のプロセッサがもたらされ、データオブジェクトのストリームからの次のデータオブジェクトからのデータが使用される度に、及び／又は当該ストリームにおける次のデータオブジェクトが生成される度に、各々は特定の命令（動作）を繰り返し実行し得る。ストリームは、あるプロセッサから他のプロセッサに伝送されるので、第一のプロセッサによって生成されるストリームは第二のプロセッサ等によって処理され得る。第一のプロセッサから第二のプロセッサにデータを伝送する一つのメカニズム（機構）は、第一のプロセッサによって生成されるデータブロックをメモリに書き込むことによる。ネットワークにおけるデータストリームはバッファされる。各々のバッファは、正確には一つのライタ（書き込み器（writer））と、一つ又はそれより多くのリーダ（読み出し器（reader））とを備えるＦＩＦＯとして実現される。このバッファリングのために、ライタ及びリーダは、チャネル上の個々の読み出し及び書き込み動作を相互に同期させる必要がない。通常のデータ処理システムは、それぞれ単一の用途専用の特定用途向けサブシステム（application specific subsystem）だけでなく完全にプログラム可能なプロセッサの混合体（mix）も含む。 Multiple processors are provided to implement data stream processing, each time data from the next data object from the data object stream is used and / or the next data object in the stream is generated. Each time, each can repeatedly execute a particular instruction (action). Since the stream is transmitted from one processor to another, the stream generated by the first processor can be processed by the second processor or the like. One mechanism for transmitting data from the first processor to the second processor is by writing a data block generated by the first processor into memory. Data streams in the network are buffered. Each buffer is implemented as a FIFO with exactly one writer (writer) and one or more readers (readers). Because of this buffering, the writer and reader do not need to synchronize individual read and write operations on the channel with each other. A typical data processing system includes a fully programmable mix of processors, as well as an application specific subsystem, each dedicated to a single application.

このようなアーキテクチャの例が、Rutten氏他の“エクリプス：フレキシブルメディア処理のためのヘテロマルチプロセッサアーキテクチャ（IEEE コンピュータの設計及びテスト：エンベデッドシステム、第３９乃至５０頁、２００２年７乃至８月）（“Eclipse: A Heterogeneous Multiprocessor Architecture for Flexible Media Processing”, IEEE Design and Test of Computers: Embedded Systems, pp. 39 − 50, July − August 2002）に示されている。必要とされる処理アプリケーションは、カーン（Kahn）プロセスネットワーク、すなわち一方向データストリーム（unidirectional data stream）によってデータを交換する並列（同時）実行タスク（concurrently executing task）のセットとして特定される。各々のアプリケーションタスクは特定のプログラム可能なプロセッサ又は専用プロセッサの一つの上にマップ（map）される。専用プロセッサは、弱く（薄く）しかプログラムされ得ないコプロセッサ（補助プロセッサ（coprocessor））によって実現される。各々のコプロセッサは、時分割（time-shared）によって複数のネットワーク又は単一のカーンネットワークからの複数のタスクを実行し得る。例えばメディア（媒体）処理アプリケーションのストリーミング特性は、参照の高い局所性（ローカリティ（locality））、すなわち隣接するデータのメモリアドレスに対する連続した参照（レファレンス）をもたらす。更にコプロセッサと通信ネットワークとの間、すなわちバスとメインメモリとの間に分散コプロセッサシェル（distributed coprocessor shell）が実現される。当該分散コプロセッサシェルは、マルチタスキング、ストリーム同期（シンクロナイゼーション）、及びデータ転送（伝送）等の多くのシステムレベルの問題を緩和するために使用される。自身の分散特性のために、シェルはそれが関連させられるコプロセッサの近くに実現され得る。各々のシェルにおいて、シェルに関連させられるコプロセッサ上にマップされるタスクに付随してストリームを処理するために必要とされる全てのデータはシェルのストリームテーブルに記憶される。 An example of such an architecture is described by Rutten et al., “Eclipse: Heterogeneous Multiprocessor Architecture for Flexible Media Processing (IEEE Computer Design and Test: Embedded Systems, pages 39-50, July-August 2002) ( “Eclipse: A Heterogeneous Multiprocessor Architecture for Flexible Media Processing”, IEEE Design and Test of Computers: Embedded Systems, pp. 39-50, July-August 2002). Kahn) process network, ie identified as a set of concurrently executing tasks that exchange data by means of a unidirectional data stream, each application task being a specific programmable processor or Above one of the dedicated processors A dedicated processor is implemented by a coprocessor (coprocessor) that can only be programmed weakly (thinly), each coprocessor being time-shared. Can perform multiple tasks from a single network or a single Khan network, for example, the streaming characteristics of a media processing application can be based on the high locality of reference, ie the memory address of adjacent data Provides a continuous reference, and a distributed coprocessor shell is realized between the coprocessor and the communication network, ie between the bus and the main memory. Tasking, stream synchronization (synchronization) Used to alleviate many system-level problems such as data transmission (transmission), etc. Because of its distributed nature, the shell can be implemented near the coprocessor with which it is associated. In each shell, all data needed to process the stream associated with a task mapped on the coprocessor associated with the shell is stored in the shell's stream table.

シェルは、メモリに書き込まれるとき又は読み出されるときにもたらされるデータアクセスレイテンシを低減させるためにキャッシュを有する。将来の処理ステップを実行するのに必要とされるデータはキャッシュ、すなわちメインメモリから分離させられていると共に、記憶されたデータを使用するプロセッサの近くにもたらされているより小さなメモリに記憶される。すなわち、キャッシュは中間記憶機能部（intermediate storage facility）として使用される。メモリアクセスレイテンシを低減させることによってプロセッサの処理速度は増加させられ得る。データ語（データワード）が、メインメモリからではなく自身のキャッシュからプロセッサによってアクセスされ得るだけの場合、平均アクセス時間（アクセスタイム）及びメインメモリアクセスの数はかなり低減させられるであろう。 The shell has a cache to reduce the data access latency introduced when it is written to or read from memory. The data needed to perform future processing steps is separated from the cache, i.e. main memory, and stored in a smaller memory that is brought close to the processor using the stored data. The That is, the cache is used as an intermediate storage facility. By reducing the memory access latency, the processing speed of the processor can be increased. If a data word (data word) can only be accessed by the processor from its own cache rather than from main memory, the average access time (access time) and the number of main memory accesses will be significantly reduced.

共有メモリにおいて実現されるストリームバッファは、アドレスタグ（address tag）を記憶するのに制限された数のバンク（bank）及びキャッシュラインのような共有リソースに対して競合する。コプロセッサのタスクは入力／出力集約型（Input/Output intensive）になるため、タスク実行遅延をもたらし得るキャッシュリソースの競合（contention）を回避するために効率的なキャッシュ動作が必要とされる。 Stream buffers implemented in shared memory compete for shared resources such as a limited number of banks and cache lines to store address tags. As coprocessor tasks become input / output intensive, efficient cache operations are required to avoid cache resource contention that can result in task execution delays.

それ故に本発明の目的は、異なるストリームが、共有キャッシュリソースに対して競合するデータフローアプリケーションのために最適化される環境においてキャッシュ競合の発生を低減することにある。 It is therefore an object of the present invention to reduce the occurrence of cache contention in an environment where different streams are optimized for data flow applications that compete for shared cache resources.

本目的は、請求項１によるデータ処理システムと、請求項９によるタスク及びデータストリームを備えるデータフローアプリケーションを処理するために最適化されるデータ処理環境における使用のための半導体デバイスと、請求項１０によるデータフローアプリケーションを処理するために最適化されるデータ処理環境においてキャッシュメモリをインデックスするための方法とによって解決される。 This object is directed to a data processing system according to claim 1 and a semiconductor device for use in a data processing environment optimized to process dataflow applications comprising tasks and data streams according to claim 9. And a method for indexing a cache memory in a data processing environment that is optimized for processing data flow applications.

本発明は、各々のデータストリームに対してオーバラップしないキャッシュ位置を確保する概念に基づいている。それ故に各々のストリームに固有（特有）となるストリーム情報が、キャッシュメモリをインデックスするために使用される。この場合、当該ストリーム情報はストリーム識別体（符号）（stream information）によって表される。 The present invention is based on the concept of ensuring a non-overlapping cache location for each data stream. Therefore, stream information that is unique to each stream is used to index the cache memory. In this case, the stream information is represented by a stream identifier (code).

特に共有キャッシュリソースに対して異なるストリームは競合するデータストリーム及びタスクを備えるデータフローアプリケーションを処理するように最適化されるデータ処理システムがもたらされる。明確なストリーム識別体（unambiguous stream identification）が前記データストリームの各々に関連させられる。前記データ処理システムは、ストリーミングデータを処理するための少なくとも一つのプロセッサ１２と、複数のキャッシュブロックを有する少なくとも一つのキャッシュメモリ２００であって、前記キャッシュメモリ２００の一つは前記プロセッサ１２の各々に関連させられる少なくとも一つのキャッシュメモリ２００と、前記キャッシュメモリ２００を制御するための少なくとも一つのキャッシュコントローラ３００であって、前記キャッシュコントローラ３００の一つは前記キャッシュメモリ２００の各々に関連させられる少なくとも一つのキャッシュコントローラ３００とを有する。前記キャッシュコントローラ３００は、前記ストリーム識別体stream_idに応じて前記キャッシュメモリ２００にデータストリームの要素（element）を記憶するための位置を選択するための選択手段３５０を有する。それ故に異なるストリームからのデータのキャッシュは効果的に切り離される。 In particular, a data processing system is provided that is optimized to process data flow applications with different data streams and tasks that compete against shared cache resources. An unambiguous stream identification is associated with each of the data streams. The data processing system includes at least one processor 12 for processing streaming data and at least one cache memory 200 having a plurality of cache blocks, and one of the cache memories 200 is assigned to each of the processors 12. At least one cache memory 200 associated with the cache memory 200 and at least one cache controller 300 for controlling the cache memory 200, wherein one of the cache controllers 300 is associated with each of the cache memories 200. Two cache controllers 300. The cache controller 300 includes selection means 350 for selecting a position for storing an element of a data stream in the cache memory 200 according to the stream identifier stream_id. Therefore, caching of data from different streams is effectively decoupled.

本発明の更なる態様によれば、前記選択手段３５０は、前記ストリームの入力／出力アドレスのサブセットに応じて前記キャッシュメモリ２００におけるキャッシュブロックの前記行内からキャッシュブロックのセットを選択するためのサブセット決定手段３５２を有する。 According to a further aspect of the present invention, the selection means 350 determines a subset for selecting a set of cache blocks from within the row of cache blocks in the cache memory 200 according to a subset of the input / output addresses of the stream. Means 352 are included.

本発明の態様によれば、前記選択手段３５０は、キャッシュ行の数よりも小さくなる数に対して前記ストリーム識別体stream_idにハシュ関数を実行するためのハシュ関数手段（hashing function means）３５１を有する。 According to an aspect of the present invention, the selection means 350 has a hashing function means 351 for performing a hash function on the stream identifier stream_id for a number smaller than the number of cache lines. .

本発明の更なる態様によれば、前記ハシュ関数手段３５１はモジュロ演算（modulo operation）を実行するためのももである。異なるタスクに渡って利用可能なキャッシュ行を共有することによって、キャッシュメモリ２００はより小さく具現化されることが可能であり、それによってシステム全体においてキャッシュメモリの費用は制限される。 According to a further aspect of the invention, the hash function means 351 is for performing a modulo operation. By sharing available cache lines across different tasks, the cache memory 200 can be made smaller, thereby limiting the cost of the cache memory throughout the system.

本発明の更なる態様によれば、前記選択手段３５０は、前記データストリームに関連付けられるタスク識別体task_id及び／又はポート識別体port_idに応じて前記キャッシュメモリ２００におけるデータストリームに対する位置を選択する。 According to a further aspect of the present invention, the selecting means 350 selects a position for the data stream in the cache memory 200 according to a task identifier task_id and / or a port identifier port_id associated with the data stream.

本発明は、明確なストリーム識別体stream_idが前記データストリームの各々に関連させられ、異なるタスクが共有キャッシュリソースに対して競合するデータストリーム及びタスクを備えるデータフローアプリケーションを処理するように最適化されるデータ処理環境における使用のための半導体デバイスにも関する。前記デバイスは、複数のキャッシュブロックを有するキャッシュメモリ２００と、前記キャッシュメモリ２００を制御するためのキャッシュコントローラ３００とを有しており、前記キャッシュコントローラ３００は前記キャッシュメモリ２００に関連させられる。前記キャッシュコントローラ３００は、前記ストリーム識別体stream_idに応じて前記キャッシュメモリ２００におけるデータストリームの要素を記憶するための位置を選択するための選択手段３５０を有する。 The present invention is optimized to handle data flow applications comprising data streams and tasks in which distinct stream identifiers stream_id are associated with each of the data streams and different tasks compete for shared cache resources. It also relates to semiconductor devices for use in data processing environments. The device includes a cache memory 200 having a plurality of cache blocks, and a cache controller 300 for controlling the cache memory 200, and the cache controller 300 is associated with the cache memory 200. The cache controller 300 includes selection means 350 for selecting a position for storing an element of a data stream in the cache memory 200 according to the stream identifier stream_id.

更に本発明は、異なるストリームが共有キャッシュリソースに対して競合するタスク及びデータストリームを備えるデータフローアプリケーションを処理するように最適化されるデータ処理環境においてキャッシュメモリ２００をインデックスするための方法にも関する。前記キャッシュメモリ２００は、複数のキャッシュブロックを有している。明確なストリーム識別体stream_idは前記データストリームの各々に関連させられる。前記キャッシュメモリ２００においてデータストリームの要素を記憶するための位置は、異なるstream_idの可能な数に比べて、前記キャッシュメモリにおけるより小さな数のサブセットを識別するために前記ストリーム識別体stream_idに応じて選択される。 The present invention further relates to a method for indexing cache memory 200 in a data processing environment where different streams are optimized to process data flow applications comprising tasks and data streams that compete for shared cache resources. . The cache memory 200 has a plurality of cache blocks. A distinct stream identifier stream_id is associated with each of the data streams. The location for storing the elements of the data stream in the cache memory 200 is selected according to the stream identifier stream_id to identify a smaller number of subsets in the cache memory compared to the possible number of different stream_ids Is done.

本発明の更なる態様は従属請求項に記載される。 Further aspects of the invention are set out in the dependent claims.

本発明のこれら及び他の態様は図面を参照して更に詳細に説明される。 These and other aspects of the invention are described in further detail with reference to the drawings.

図１は、本発明の好ましい実施例によるデータオブジェクト（実体）（data object）のストリームを処理するための処理システムを示す。前記システムは、異なる層、すなわち計算（演算）層（computation layer）１と、通信サポート層（communication support layer）２と、通信ネットワーク層（communication network layer）３とに分割され得る。計算層１はCPU１１と、二つのプロセッサ又はプロセッサ１２ａ及び１２ｂとを含む。これはただの例示であり、明らかなことに更に多くのプロセッサがシステムに含まれてもよい。通信サポート層２は、CPU１１に関連するシェル２１と、プロセッサ１２ａ及び１２ｂに関連するシェル２２ａ及び２２ｂとをそれぞれ有している。通信ネットワーク層３は、通信ネットワーク３１及びメモリ３２を有している。 FIG. 1 illustrates a processing system for processing a stream of data objects according to a preferred embodiment of the present invention. The system can be divided into different layers: a computation (computation) layer 1, a communication support layer 2, and a communication network layer 3. The calculation layer 1 includes a CPU 11 and two processors or processors 12a and 12b. This is just an example, and obviously more processors may be included in the system. The communication support layer 2 includes a shell 21 associated with the CPU 11 and shells 22a and 22b associated with the processors 12a and 12b. The communication network layer 3 includes a communication network 31 and a memory 32.

プロセッサ１２ａ及び１２ｂは好ましくは専用プロセッサであり、各々は、限定された範囲のストリーム処理機能（関数）を実行するように特化されている。各々のプロセッサは、同じ処理命令をストリームの連続したデータオブジェクトに繰り返しもたらすように構成される。プロセッサ１２ａ及び１２ｂは各々、異なるタスク又は機能、例えば有効長デコーディング（variable length decoding）、実行長デコーディング（run-length decoding）、動き補償（motion compensation）、若しくは画像スケーリング（image scaling）を実行してもよく、又はDCT変換を実行してもよい。動作において、各々のプロセッサ１２ａ及び１２ｂは一つ又はそれより多くのデータストリーム上で命令を実行する。命令は、例えばストリームを受信するステップ及び他のストリームを生成するステップ、又は新たなストリームを生成することなくストリームを受信するステップ若しくはストリームを受信することなくストリームを生成するステップ又は受信ストリームを修正するステップを含んでいてもよい。プロセッサ１２ａ及び１２ｂは、他のプロセッサ１２ｂ及び１２ａ若しくはCPU１１によって生成されるデータストリーム、又はそれら自身で生成されたストリームさえも処理し得る。ストリームは、前記メモリ３２を介してプロセッサ１２ａ及び１２ｂから転送されると共にプロセッサ１２ａ及び１２ｂに転送される一連のデータオブジェクトを有している。 Processors 12a and 12b are preferably dedicated processors, each specialized to perform a limited range of stream processing functions (functions). Each processor is configured to repeatedly bring the same processing instructions to successive data objects in the stream. Processors 12a and 12b each perform different tasks or functions, such as variable length decoding, run-length decoding, motion compensation, or image scaling. Or a DCT transformation may be performed. In operation, each processor 12a and 12b executes instructions on one or more data streams. The instructions, for example, receiving a stream and generating another stream, or receiving a stream without generating a new stream or generating a stream without receiving a stream or modifying a received stream Steps may be included. Processors 12a and 12b may process data streams generated by other processors 12b and 12a or CPU 11, or even streams generated by themselves. The stream has a series of data objects transferred from the processors 12a and 12b via the memory 32 and transferred to the processors 12a and 12b.

シェル２２ａ及び２２ｂは、通信層になる、通信ネットワーク層に対する第一のインタフェース部を有している。当該層は全てのシェルに対して汎用性があると共に均質である。更にシェル２２ａ及び２２ｂは、シェル２２ａ及び２２ｂがそれぞれ関連させられているプロセッサ１２ａ及び１２ｂに対する第二のインタフェース部を有している。第二のインタフェース部は、タスクレベルインタフェース部（task-level interface）であり、前記プロセッサ１２ａ及び１２ｂの特定の要求（要望）に対処することを可能にするために、関連するプロセッサ１２ａ及び１２ｂに対してカスタマイズ（仕様変更（customise））される。従って、シェル２２ａ及び２２ｂはプロセッサ特定インタフェース部（processor-specific interface）を第二のインタフェース部として有するが、特定用途の採用及びパラメータ化（parameterisation）が可能になる一方で、システムアーキテクチャ全体においてシェルの再利用を容易にするために、シェルのアーキテクチャ全体は全てのプロセッサに対して均質であると共に汎用性がある。 The shells 22a and 22b have a first interface unit for the communication network layer, which becomes the communication layer. The layer is universal and homogeneous for all shells. In addition, the shells 22a and 22b have a second interface to the processors 12a and 12b with which the shells 22a and 22b are associated, respectively. The second interface unit is a task-level interface unit (task-level interface) that allows the associated processors 12a and 12b to handle the specific requirements of the processors 12a and 12b. It is customized (specification change). Thus, the shells 22a and 22b have a processor-specific interface as the second interface part, while allowing specific application adoption and parameterisation, while the shell of the entire system architecture is To facilitate reuse, the entire shell architecture is homogeneous and versatile for all processors.

シェル２２ａ及び２２ｂは、データ伝送のための読み出し／書き込みユニットと、同期ユニット（synchronisation unit）と、タスクスイッチングユニット（task switching unit）とを有している。当該三つのユニットはマスタ／スレーブ（master/slave）によって、関連するプロセッサと通信し、プロセッサはマスタとしての役割を果たす。従って、それぞれ三つのユニットは、プロセッサからの要求（リクエスト）によって初期化（イニシャライズ（initialise））させられる。好ましくは、引数（argument）値を渡すと共に、要求された値が返（戻）されるのを待つために、プロセッサと三つのユニットとの間の通信は要求肯定応答ハンドシェーク機構（request-acknowledge handshake mechanism）によって実現される。それ故に、前記通信は阻止（ブロック）される。すなわち、制御の各スレッド（thread）はそれらの完了を待つ。 The shells 22a and 22b have a read / write unit for data transmission, a synchronization unit, and a task switching unit. The three units communicate with the associated processor via a master / slave, which acts as a master. Therefore, each of the three units is initialized (initialized) by a request from the processor. Preferably, the communication between the processor and the three units is a request-acknowledge handshake mechanism to pass the argument value and wait for the requested value to be returned. mechanism). Therefore, the communication is blocked (blocked). That is, each thread of control waits for their completion.

シェル２２ａ及び２２ｂは、自身が関連させられるプロセッサ１２ａ及び１２ｂの近くに各々は実現され得るように分散させられる。各々のシェルは、自身のプロセッサ上にマップされるタスクに伴うストリームに対するコンフィギュレーションデータ（構成データ（configuration data））をローカルに（局所的に）含んでおり、当該データを適切に処理するように全ての制御論理をローカルに実現する。従って、ローカルストリームテーブルが、各々のストリーム、すなわち各々のアクセスポイントに対するフィールドの行を含むシェル２２ａ及び２２ｂで実現されてもよい。 The shells 22a and 22b are distributed so that each can be implemented near the processors 12a and 12b with which it is associated. Each shell contains configuration data (configuration data) for the stream associated with the task mapped on its processor locally (locally) so that the data is processed appropriately. All control logic is implemented locally. Thus, a local stream table may be implemented with shells 22a and 22b that include a row of fields for each stream, ie each access point.

更にシェル２２は、プロセッサ１２と通信ネットワーク３１及びメモリ３２との間のデータ伝送、すなわち読み出し動作及び書き込み動作のためのデータキャッシュを有している。シェル２２におけるデータキャッシュの実現により、データバス幅の透明（トランスペアレント）な変換（transparent translation）、グローバル（広域）な相互接続部、すなわち通信ネットワーク３１上のアライメント（調整（alignment））制限の解消、及びグローバルな相互接続部上のI/O動作の数の低減がもたらされる。 Further, the shell 22 has a data cache for data transmission between the processor 12 and the communication network 31 and the memory 32, that is, a read operation and a write operation. Realization of the data cache in the shell 22 eliminates the transparent (transparent) translation of the data bus width, the global (wide area) interconnection, that is, the alignment (alignment) limitation on the communication network 31; And a reduction in the number of I / O operations on the global interconnect.

好ましくはシェル２２は、読み出し及び書き込みインタフェース部においてキャッシュを有するが、これらのキャッシュはアプリケーション機能の視点から見えない。当該キャッシュは、プロセッサ読み出し及び書き込みポートを、通信ネットワーク３のグローバルな相互接続部から切り離すことにおいて重要な役割を果たす。これらのキャッシュは、速度、電力、及び面積に関するシステム特性に大きな影響を及ぼす。 Preferably, the shell 22 has caches at the read and write interface, but these caches are not visible from the application function perspective. The cache plays an important role in disconnecting the processor read and write ports from the global interconnection of the communication network 3. These caches have a significant impact on system characteristics regarding speed, power, and area.

図１によるアーキテクチャに関する更なる詳細な説明については、Rutten氏他の“エクリプス：フレキシブルメディア処理のためのヘテロマルチプロセッサアーキテクチャ（IEEE コンピュータの設計及びテスト：エンベデッドシステム、第３９乃至５０頁、２００２年７乃至８月）（“Eclipse: A Heterogeneous Multiprocessor Architecture for Flexible Media Processing”, IEEE Design and Test of Computers: Embedded Systems, pp. 39 − 50, July − August 2002）を参照されたい。 For a more detailed description of the architecture according to FIG. 1, see Rutten et al., “Eclipse: Heterogeneous Multiprocessor Architecture for Flexible Media Processing (IEEE Computer Design and Test: Embedded Systems, pages 39-50, 2002 7). To August) ("Eclipse: A Heterogeneous Multiprocessor Architecture for Flexible Media Processing", IEEE Design and Test of Computers: Embedded Systems, pp. 39-50, July-August 2002).

図２は、図１によるアーキテクチャの一部を示す。特にプロセッサ１２ｂ、シェル２２ｂ、バス３１、及びメモリ３２が示されている。シェル２２ｂは、自身のデータ伝送ユニットの部分としてキャッシュコントローラ３００及びキャッシュメモリ２００を有する。キャッシュコントローラ３００は、ストリームテーブル３２０及び選択手段３５０を有する。キャッシュメモリ２００は異なるキャッシュブロック２１０に分割されてもよい。 FIG. 2 shows a part of the architecture according to FIG. In particular, the processor 12b, shell 22b, bus 31, and memory 32 are shown. The shell 22b has a cache controller 300 and a cache memory 200 as part of its own data transmission unit. The cache controller 300 includes a stream table 320 and a selection unit 350. The cache memory 200 may be divided into different cache blocks 210.

読み出し及び書き込み動作、すなわちI/Oアクセスがコプロセッサ１２ｂ上のタスクによって実行されるとき、当該アクセスがどの特定のタスク及びポートからデータを要求しているのか、又は当該アクセスがどの特定のタスク及びポートに対してデータを要求しているのかを示すアドレスに隣接してtask_id及びport_idパラメータを当該アクセスは供給する。前記アドレスは、共有メモリにおいてストリームバッファにおける位置を示す。ストリームテーブル３２０は、アクセスポイント及び各々のストリームに対するフィールドの行を含んでいる。特にストリームテーブルは、現在処理されているタスクを示すタスク識別子task_id及びデータが受信されるポートを示すポート識別子port_idからもたらされるストリーム識別子stream_idでインデックスされる。port_idは各々のタスクに対してローカルな範囲（スコープ）を有している。 When a read and write operation, i.e., an I / O access, is performed by a task on the coprocessor 12b, which particular task and port the access is requesting data from, or which particular task and The access supplies task_id and port_id parameters adjacent to the address indicating whether data is being requested from the port. The address indicates a position in the stream buffer in the shared memory. Stream table 320 includes access points and a row of fields for each stream. In particular, the stream table is indexed with a stream identifier stream_id resulting from a task identifier task_id indicating the currently processed task and a port identifier port_id indicating the port from which data is received. The port_id has a local range (scope) for each task.

本発明の第一の実施例は、復号化から直接エントリが決定される直接アドレス復号化（ダイレクトアドレスデコーディング（direct address decoding））を含むインデックスするステップによるアドレッシングに向けられる。それ故に前記選択手段３５０は、前記キャッシュメモリ２００におけるキャッシュブロックの行を選択するためにストリーム識別子stream_idを使用する。選択されたキャッシュ行内からの特定のキャッシュブロックは、コプロセッサによって供給される前記アドレス、すなわちI/Oアドレスの下位ビット（lower bit）を通じてインデックスされる。代わりにアドレスの上位ビットがインデックスのために使用されてもよい。本実施例によるキャッシュメモリ２００の構成体は、直接マップ（direct-mapped）によりなされる。すなわちアドレス及びストリーム識別子の全ての組み合わせは単一のキャッシュ位置にのみマップされ得る。従って、行におけるキャッシュブロックの数は、２のべき乗（power of two）に制限される。すなわち、複数のアドレスビットを復号化することによって列が選択されると、これは常に列の２乗の数に展開されるであろう。 The first embodiment of the invention is directed to addressing with an indexing step that includes direct address decoding (direct address decoding) in which direct entries are determined from decoding. Therefore, the selection unit 350 uses the stream identifier stream_id to select a cache block row in the cache memory 200. The particular cache block from within the selected cache line is indexed through the address supplied by the coprocessor, ie, the lower bit of the I / O address. Alternatively, the upper bits of the address may be used for the index. The structure of the cache memory 200 according to the present embodiment is made by a direct-mapped. That is, all combinations of addresses and stream identifiers can only be mapped to a single cache location. Thus, the number of cache blocks in a row is limited to a power of two. That is, if a column is selected by decoding multiple address bits, this will always expand to a squared number of columns.

図３は本発明の第二の実施例によるキャッシュ構成体の概念図を示しており、当該キャッシュ構成体は直接マップによりもたらされる。図２からの選択手段は、ハシュ関数手段３５１及びサブセット決定手段３５２を有する。I/Oアドレスは前記サブセット決定手段３５２に入力される一方、stream_idは前記ハシュ関数手段３５１に入力される。好ましくはハシュ関数手段３５１は、ストリーム識別子stream_idを、前記キャッシュメモリのより小さな数のキャッシュ行に変換するためにキャッシュ行の数に渡ってモジュロ演算を実行する。サブセット決定手段３５２は、コプロセッサによって供給される前記アドレス、すなわちI/Oアドレスの下位ビットを通じて前記キャッシュメモリの特定のキャッシュ列を決定する。代わりにアドレスの上位ビットがインデックスのために使用されてもよい。ハシュ関数手段３５１によって決定されるキャッシュ行及び前記サブセット決定手段３５２によって決定されるキャッシュ列によれば、特定のキャッシュブロックがインデックスされ得る。アドレス上のタグマッチング（タグ照合（tag matching））によって実際のデータ語が位置されてもよい。 FIG. 3 shows a conceptual diagram of a cache structure according to a second embodiment of the present invention, which is provided by a direct map. The selection means from FIG. 2 has a hash function means 351 and a subset determination means 352. The I / O address is input to the subset determination unit 352, while the stream_id is input to the hash function unit 351. Preferably, the hash function means 351 performs a modulo operation over the number of cache lines to convert the stream identifier stream_id to a smaller number of cache lines in the cache memory. The subset determining means 352 determines a specific cache column of the cache memory through the address supplied by the coprocessor, that is, the lower bits of the I / O address. Alternatively, the upper bits of the address may be used for the index. According to the cache line determined by the hash function means 351 and the cache column determined by the subset determination means 352, a specific cache block can be indexed. The actual data word may be located by tag matching on the address (tag matching).

代案として、ストリーム識別子stream_idの代わりにポート識別子port_idがハシュ関数手段３５１の入力部として使用されてもよく、ハシュ関数、すなわちキャッシュ行の数に渡るモジュロ演算が、キャッシュ行を選択するためにport_idをより小さな数のキャッシュ行にもたらすようにポート識別子port_idについて実行される。このことは、異なるタスクに渡って利用可能なキャッシュ行を共有することによってシェル２２におけるキャッシュメモリ２００はより小さく具現化されることが可能であり、それによってシステム全体においてキャッシュメモリの費用は制限されるという利点を有している。従って一つのタスクが複数のタスクポートとキャッシュ行を共有していてもよい。しかしながらこのことは、第二のタスクポートからいくつかのデータが散発的しか読み出されない一方、全てのデータが一つのタスクポートから読み出される場合に有益であると共に経済的である。それ故に各々のタスクポートのためのキャッシュ行に対するハードウエアの費用は低減され得る。 As an alternative, instead of the stream identifier stream_id, the port identifier port_id may be used as the input of the hash function means 351, and the hash function, ie the modulo operation over the number of cache lines, sets the port_id to select the cache line. Performed on port identifier port_id to result in a smaller number of cache lines. This means that by sharing available cache lines across different tasks, the cache memory 200 in the shell 22 can be implemented smaller, thereby limiting the cost of the cache memory throughout the system. Has the advantage of. Therefore, one task may share a cache line with a plurality of task ports. However, this is beneficial and economical if some data is read sporadically from the second task port while all data is read from one task port. Therefore, the hardware cost for the cache line for each task port can be reduced.

更なる代案において、キャッシュ行を選択するためにタスク識別子task_idがハシュ関数手段３５１に対する入力部として使用される。 In a further alternative, the task identifier task_id is used as an input to the hash function means 351 to select a cache line.

本発明の動作原理は図１に記載のアーキテクチャを参照して記載されているが、実際のデータはアドレス上でタグマッチングを通じて更に位置される一方、stream_idはキャッシュ行を選択し、アドレスの下位ビットはキャッシュブロックのセットを選択するより一般的なセット関連のキャッシュ構成体に本発明によるキャッシュインデックス方式が展開され得ることは明らかである。 The operating principle of the present invention has been described with reference to the architecture described in FIG. 1, but the actual data is further located through tag matching on the address, while the stream_id selects the cache line and the lower bits of the address It is clear that the cache indexing scheme according to the present invention can be extended to more general set-related cache constructs that select a set of cache blocks.

本発明によるストリームを基礎とした処理システムのアーキテクチャの概略ブロック図である。1 is a schematic block diagram of the architecture of a stream-based processing system according to the present invention. 本発明によるキャッシュコントローラのブロック図である。2 is a block diagram of a cache controller according to the present invention. FIG. 本発明の第二の実施例によるキャッシュ構成体の概念図である。It is a conceptual diagram of the cache structure by 2nd Example of this invention.

Claims

A data processing system that is optimized to process data flow applications comprising data streams and tasks in which different streams compete for shared cache resources, with distinct stream identifiers associated with each of the data streams. Data processing system
-At least one processor for processing the streaming data;
At least one cache memory having a plurality of cache blocks, wherein one of the cache memories is associated with each of the processors;
At least one cache controller for controlling the cache memory, wherein one of the cache controllers has at least one cache controller associated with each of the cache memories;
The cache controller-a data processing system comprising selection means for selecting a location for storing elements of the data stream in the cache memory according to the stream identifier;

The system of claim 1, wherein the selection means is provided to select a subset of cache blocks in the cache memory in response to the stream identifier.

The system according to claim 2, comprising subset determining means for selecting a set of cache blocks from among the subset of cache blocks in the cache memory according to a subset of input / output addresses of the stream.

4. The system of claim 3, wherein the subset determining means is provided for selecting a cache block in response to the low order bits of the input / output address of the stream.

4. The system of claim 3, wherein the subset determining means is provided for selecting a cache block from within the set of cache blocks by tag matching the subset of input / output address bits.

The selection means includes
The system according to claim 1, comprising hash function means for performing a hash function on the stream identifier for a number smaller than the number of cache lines.

The system of claim 6, wherein the hash function means is provided to perform a modulo operation.

The system of claim 1, wherein the selection means is provided for selecting a position for an element of the data stream in the cache memory in response to a task identifier and / or a port identifier associated with the data stream.

A semiconductor device for use in a data processing environment where different streams are optimized to process data flow applications comprising data streams and tasks competing for shared cache resources, the distinct stream identifier being said In a semiconductor device associated with each of the data streams,
A cache memory having a plurality of cache blocks;
A cache controller associated with the cache memory for controlling the cache memory;
The said cache controller is a semiconductor device which has a selection means for selecting the position for memorize | storing the element of the data stream in the said cache memory according to the said stream identification body.

A method for indexing cache memory in a data processing environment that is optimized to process data flow applications comprising data streams and tasks with different streams competing for shared cache resources, comprising:
The cache memory has a plurality of cache blocks,
In a method in which a distinct stream identifier is associated with each of the data streams:
-Selecting a location for storing an element of a data stream in the cache memory according to the stream identifier.