JP4906734B2

JP4906734B2 - Video processing

Info

Publication number: JP4906734B2
Application number: JP2007541436A
Authority: JP
Inventors: シリッシュガドレ，; アシッシュカランディカー，; スティーヴンリュウ，; クリストファー，ティー．チェン，
Original assignee: エヌヴィディアコーポレイション
Priority date: 2004-11-15
Filing date: 2005-11-14
Publication date: 2012-03-28
Anticipated expiration: 2025-11-14
Also published as: KR20070063580A; KR20100093141A; WO2006055546A3; KR100880982B1; JP2008521097A; KR20080080419A; KR101002485B1; KR101030174B1; KR100917067B1; KR101084806B1; KR20090020715A; KR20110011758A; CA2585157A1; WO2006055546A9; WO2006055546A2; EP1812928A2; EP1812928A4

Abstract

A latency tolerant system for executing video processing operations is described. So too is a stream processing in a video processor, a video processor having scalar and vector components and multidimensional datapath processing in a video processor.

Description

Cross-reference of related applications

本出願は、「ＡＭＥＴＨＯＤＡＮＤＳＹＳＴＥＭＦＯＲＶＩＤＥＯＰＲＯＣＥＳＳＩＮＧ」と題し、２００４年１１月１５日に出願したＧａｄｒｅ他の米国特許仮出願第６０／６２８，４１４号の、米国特許法第１１９（ｅ）条の下での利益を主張するものであり、当該仮出願を、その全体を参照することによって本明細書に援用するものである。 This application is entitled “A METHOD AND SYSTEM FOR VIDEO PROCESSING” and is filed on Nov. 15, 2004 in US Provisional Patent Application No. 60 / 628,414 of Gadre et al. The provisional application of which is hereby incorporated by reference in its entirety.

Field of Invention

[001]本記載内容の分野は、デジタル電子コンピュータシステムに関するものである。より詳細には、本記載内容は、ビデオ情報をコンピュータシステムにおいて効率的に処理するシステムに関するものである。一側面に関して、ビデオ処理オペレーションを実行する耐待ち時間システムを説明する。別の一側面に関して、ビデオプロセッサにおけるストリーム処理を説明する。さらに、ビデオプロセッサにおける多次元データパス処理を説明する。また、スカラーコンポーネント及びベクトルコンポーネントを有するビデオプロセッサを説明する。 [001] The field of this description relates to digital electronic computer systems. More particularly, this description relates to a system that efficiently processes video information in a computer system. In one aspect, a latency-tolerant system that performs video processing operations is described. Regarding another aspect, stream processing in a video processor will be described. Further, multidimensional data path processing in the video processor will be described. A video processor having a scalar component and a vector component is also described.

background

[002]画像及びフルモーションビデオの表示は、近年の大幅な進歩に伴って改良されつつある電子業界の分野である。高品質ビデオ、特に高精細度デジタルビデオの表示及びレンダリングは、最新ビデオ技術のアプリケーション及びデバイスの主要な目標である。ビデオ技術は、携帯電話、パーソナルビデオレコーダ、デジタルビデオプロジェクタ、高精細度テレビ等にわたる多種多様な製品に使用されている。高精細度ビデオの生成及び表示に対応するデバイスの登場と成長展開は、大幅な技術革新と進歩を経験している電子業界の一つの領域である。 [002] Display of images and full motion video is an area of the electronics industry that is being improved with significant progress in recent years. The display and rendering of high quality video, especially high definition digital video, is a major goal of modern video technology applications and devices. Video technology is used in a wide variety of products ranging from mobile phones, personal video recorders, digital video projectors, high-definition televisions and the like. The advent and growth of devices that support the generation and display of high-definition video is one area of the electronics industry that has experienced significant technological innovation and progress.

[003]多くの家庭用電化製品タイプ及び業務用レベルのデバイスにおいて導入されているビデオ技術は、一以上のビデオプロセッサを利用して、表示用のビデオ信号をフォーマット及び／又は強化する。これは特に、デジタルビデオアプリケーションに当てはまることである。例えば、一以上のビデオプロセッサは、標準的なセットトップボックスに組み込まれており、ＨＤＴＶ放送信号を、ディスプレイによって使用可能なビデオ信号に変換するために使用されている。このような変換には、例えば、スケーリングがあり、当該スケーリングでは、ビデオ信号が、１６：９以外のビデオ画像から、１６：９のディスプレイ（例えばワイドスクリーン）での適切な表示のために、変換される。一以上のビデオプロセッサを使用して走査変換を実行することが可能であり、当該走査変換では、奇数及び偶数の走査線が別個に表示されるインターレース形式から、フレーム全体が単一のスイープで描かれるプログレッシブ形式に、ビデオ信号が変換される。 [003] Video technologies deployed in many consumer electronics types and business-level devices utilize one or more video processors to format and / or enhance the video signal for display. This is especially true for digital video applications. For example, one or more video processors are incorporated into a standard set-top box and are used to convert HDTV broadcast signals into video signals that can be used by a display. Such conversion includes, for example, scaling, where a video signal is converted from a video image other than 16: 9 for proper display on a 16: 9 display (eg, widescreen). Is done. It is possible to perform scan conversion using one or more video processors, where the entire frame is drawn in a single sweep, from an interlaced format where the odd and even scan lines are displayed separately. The video signal is converted into a progressive format.

[004]ビデオプロセッサのアプリケーションの更なる例には、信号の圧縮解除があり、当該圧縮解除においては、例えば、ビデオ信号が圧縮形式（例えばＭＰＥＧ−２）で受信され、圧縮解除されて表示用にフォーマットされる。別の例は、再インターレース走査変換であり、当該変換は、到来するデジタルビデオ信号を、ＤＶＩ（ＤｉｇｉｔａｌＶｉｓｕａｌＩｎｔｅｒｆａｃｅ）形式から、市場に導入されている膨大な数の旧式のテレビディスプレイと互換性のある複合（コンポジット）ビデオ形式に変換することを伴う。 [004] A further example of a video processor application is signal decompression, in which, for example, a video signal is received in a compressed format (eg MPEG-2) and decompressed for display. To be formatted. Another example is a re-interlaced scan conversion that converts an incoming digital video signal from the DVI (Digital Visual Interface) format to the vast number of older television displays introduced on the market. With conversion to some composite video format.

[005]より高度なユーザーは、例えば、インループ／アウトオブループ・デブロッキングフィルタ、高性能（advanced）動き適応型順次走査変換、符号化オペレーション用の入力ノイズフィルタリング、多相スケーリング／再サンプリング、サブピクチャ合成、並びに、色空間変換、調整、ピクセルポイントオペレーション（例えば、鮮明化、ヒストグラム調整等）、及び、種々のビデオの表面フォーマット変換サポートオペレーションのようなプロセッサ増幅器のオペレーション等、より高性能のビデオプロセッサ機能を必要とする。 [005] More advanced users include, for example, in-loop / out-of-loop deblocking filters, advanced motion adaptive progressive scan conversion, input noise filtering for encoding operations, polyphase scaling / resampling, Higher performance such as sub-picture synthesis and processor amplifier operations such as color space conversion, adjustment, pixel point operations (eg sharpening, histogram adjustment, etc.) and various video surface format conversion support operations Requires video processor function.

[006]このような高性能のビデオプロセッサ機能を提供することに関する問題は、かかる機能を実装する十分に強力なアーキテクチャを有するビデオプロセッサが、多くのタイプのデバイスに組み込むにはあまりに高価なものであるということである。ビデオ処理機能が高性能になれば、それに応じて、シリコンダイ領域、トランジスタ数、メモリ速度要件等の点で、そのような機能を実装するために必要となる集積回路デバイスも更に高価となる。 [006] The problem with providing such high performance video processor functionality is that a video processor with a sufficiently powerful architecture to implement such functionality is too expensive to be incorporated into many types of devices. That is. As video processing functions become more sophisticated, the integrated circuit devices required to implement such functions are also more expensive in terms of silicon die area, number of transistors, memory speed requirements, and the like.

[007]したがって、従来技術のシステム設計者は、ビデオプロセッサの性能とコストに関してトレードオフを強いられてきた。許容可能なコスト性能比を有すると広く見なされている従来のビデオプロセッサは、多くの場合、待ち時間の制約（例えば、ビデオの中断（stuttering）又はビデオ処理アプリケーションの停滞を防ぐためのもの）、及び計算密度（例えば、ダイの平方ミリメートルあたりのプロセッサオペレーションの数）に関して殆ど十分ではなかった。さらに、従来技術のビデオプロセッサは一般に、ビデオデバイスが複数のビデオストリームを処理することを期待されている場合（例えば、複数の到来ストリーム及び送出ディスプレイストリームの同時処理）など、線形スケーリング性能の要件に適合していない。 [007] Accordingly, prior art system designers have been forced to make trade-offs regarding video processor performance and cost. Conventional video processors, widely regarded as having an acceptable cost / performance ratio, often have latency constraints (eg, to prevent video stuttering or stagnation of video processing applications), And in terms of computational density (eg, number of processor operations per square millimeter of die). In addition, prior art video processors typically meet linear scaling performance requirements, such as when a video device is expected to process multiple video streams (eg, simultaneous processing of multiple incoming and outgoing display streams). Not compatible.

[008]したがって、従来技術の制限を克服する新規のビデオプロセッサシステムが必要とされている。新規のビデオプロセッサシステムは、ますます高度化するユーザーによって期待されている高性能なビデオプロセッサ機能を扱うよう、拡張可能であり、高い計算密度を有する必要がある。 [008] Accordingly, there is a need for new video processor systems that overcome the limitations of the prior art. New video processor systems need to be scalable and have a high computational density to handle the high performance video processor functions expected by increasingly sophisticated users.

Overview

[009]本記載内容の実施の形態は、新規のビデオプロセッサシステムを提供するものであり、当該ビデオプロセッサシステムは、高性能なビデオ処理機能をサポートし、集積回路シリコンダイ領域、トランジスタ数、メモリ速度要件などを効率的に利用する。本記載内容の実施の形態は、高い計算密度を保持し、複数のビデオストリームを処理するために容易に拡張することができる。 [009] Embodiments described herein provide a novel video processor system that supports high performance video processing functions, integrated circuit silicon die area, transistor count, memory. Use speed requirements efficiently. Embodiments of this description retain high computational density and can be easily extended to handle multiple video streams.

[010]ある実施の形態では、ビデオプロセッサでビデオ処理オペレーションを実行する耐待ち時間システムが実施される。本システムは、ビデオプロセッサとホストＣＰＵ間の通信を実施するホストインターフェイスと、ホストインターフェイスに結合され、且つスカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ホストインターフェイスに結合され、且つ、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットとを備える。コマンドＦＩＦＯが、ベクトル実行ユニットがメモリコマンドＦＩＦＯにアクセスすることによって要求駆動ベースで動作できるようにするために備えられる。メモリインターフェイスが、ビデオプロセッサとフレームバッファメモリ間の通信を実施するために備えられる。ＤＭＡエンジンは、複数の異なる記憶域の間のＤＭＡ転送を実施し、データストアメモリ及び命令キャッシュにベクトル実行ユニットのデータと命令をロードするために、メモリインターフェイスに組み込まれている。 [010] In one embodiment, a latency-tolerant system is implemented that performs video processing operations on a video processor. The system is coupled to a host interface that implements communication between the video processor and the host CPU, a scalar execution unit coupled to the host interface and configured to perform scalar video processing operations, and to the host interface; A vector execution unit configured to perform vector video processing operations. A command FIFO is provided to allow the vector execution unit to operate on a request driven basis by accessing the memory command FIFO. A memory interface is provided for performing communication between the video processor and the frame buffer memory. The DMA engine is incorporated into the memory interface to perform DMA transfers between different storage locations and load vector execution unit data and instructions into the data store memory and instruction cache.

[011]ある実施の形態では、ベクトル実行ユニットは、コマンドＦＩＦＯにアクセスすることよってスカラー実行ユニットに対して非同期に動作して、要求駆動ベースで動作するように構成されている。要求駆動ベースは、異なる記憶域（例えば、フレームバッファメモリ、システムメモリ、キャッシュメモリなど）からベクトル実行ユニットのコマンドＦＩＦＯへのデータ転送の待ち時間を隠蔽するように構成することができる。コマンドＦＩＦＯは、ベクトル実行ユニットの停滞を防ぐためにパイプライン型ＦＩＦＯであってもよい。 [011] In an embodiment, the vector execution unit is configured to operate asynchronously with respect to the scalar execution unit by accessing a command FIFO and to operate on a request driven basis. The request driven base can be configured to conceal the latency of data transfer from different storage areas (eg, frame buffer memory, system memory, cache memory, etc.) to the command execution unit command FIFO. The command FIFO may be a pipelined FIFO to prevent stagnation of vector execution units.

[012]ある実施の形態では、本発明は、ビデオ処理オペレーションを実行するためのビデオプロセッサとして実施される。ビデオプロセッサは、ビデオプロセッサとホストＣＰＵとの間の通信を実施するホストインターフェイスを備える。ビデオプロセッサは、ビデオプロセッサとフレームバッファメモリとの間の通信を実施するメモリインターフェイスを備える。スカラー実行ユニットが、ホストインターフェイス及びメモリインターフェイスに結合され、且つ、スカラービデオ処理オペレーションを実行するように構成されている。ベクトル実行ユニットが、ホストインターフェイス及びメモリインターフェイスに結合され、且つ、ベクトルビデオ処理オペレーションを実行するように構成されている。ビデオプロセッサは、スタンドアロンのビデオプロセッサ集積回路であってもよく、或いは、ＧＰＵ集積回路に組み込まれたコンポーネントであってもよい。 [012] In one embodiment, the present invention is implemented as a video processor for performing video processing operations. The video processor includes a host interface that implements communication between the video processor and the host CPU. The video processor includes a memory interface that implements communication between the video processor and the frame buffer memory. A scalar execution unit is coupled to the host interface and the memory interface and is configured to perform scalar video processing operations. A vector execution unit is coupled to the host interface and the memory interface and is configured to perform vector video processing operations. The video processor may be a stand-alone video processor integrated circuit or may be a component embedded in a GPU integrated circuit.

[013]ある実施の形態では、スカラー実行ユニットは、ビデオプロセッサのコントローラとして機能し、ベクトル実行ユニットのオペレーションを制御する。スカラー実行ユニットは、アプリケーションのフロー制御アルゴリズムを実行するように構成することができ、ベクトル実行ユニットはアプリケーションのピクセル処理オペレーションを実行するように構成することができる。ベクトルインターフェイスユニットが、スカラー実行ユニットをベクトル実行ユニットにインターフェイスするためにビデオプロセッサにそなえられてもよい。ある実施の形態では、スカラー実行ユニット及びベクトル実行ユニットは、非同期に動作するように構成される。スカラー実行ユニットは第１のクロック周波数で実行することができ、ベクトル実行ユニットは異なるクロック周波数で（例えば、より速い、より遅いなど）実行することができる。ベクトル実行ユニットは、スカラー実行ユニットの制御の下で、要求駆動ベースで動作することができる。 [013] In an embodiment, the scalar execution unit functions as a controller of the video processor and controls the operation of the vector execution unit. The scalar execution unit can be configured to execute the flow control algorithm of the application, and the vector execution unit can be configured to execute the pixel processing operation of the application. A vector interface unit may be provided in the video processor to interface the scalar execution unit to the vector execution unit. In some embodiments, the scalar execution unit and the vector execution unit are configured to operate asynchronously. The scalar execution unit can execute at a first clock frequency and the vector execution unit can execute at a different clock frequency (eg, faster, slower, etc.). The vector execution unit can operate on a demand driven basis under the control of the scalar execution unit.

[014]ある実施の形態では、本発明は、ビデオ処理オペレーションを実行するためのビデオプロセッサ用の多次元データパス処理システムとして実施される。ビデオプロセッサは、スカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットと、を備える。データストアメモリが、ベクトル実行ユニットのデータを格納するために備えられる。データストアメモリは、配列状に構成された対称バンクデータ構造を有する複数のタイルを有する。バンクデータ構造は、各バンクの異なるタイルへのアクセスをサポートするように構成されている。 [014] In one embodiment, the present invention is implemented as a multi-dimensional data path processing system for a video processor for performing video processing operations. The video processor comprises a scalar execution unit configured to perform scalar video processing operations and a vector execution unit configured to perform vector video processing operations. A data store memory is provided for storing vector execution unit data. The data store memory has a plurality of tiles having a symmetric bank data structure arranged in an array. The bank data structure is configured to support access to different tiles in each bank.

[015]個々の構成の要件に応じて、各々のバンクデータ構造は、複数のタイル（例えば、４×４、８×８、８×１６、１６×２４など）を備えることができる。ある実施の形態では、バンクは、各バンクの異なるタイルへのアクセスをサポートするように構成されている。これにより、単一のアクセスで、二つの隣接するバンクから１行又は１列のタイルを取り出すことができる。ある実施の形態では、クロスバーが、複数のバンクデータ構造のタイルにアクセスする構成（例えば、行、列、ブロックなど）を選択するために使用される。また、コレクタを、クロスバーによってアクセスされるバンクのタイルを受け取るため、及び１クロックごとにベクトルデータパスの前端にタイルを提供するために備えることができる。 [015] Depending on individual configuration requirements, each bank data structure may comprise multiple tiles (eg, 4x4, 8x8, 8x16, 16x24, etc.). In one embodiment, the banks are configured to support access to different tiles in each bank. As a result, a single row or column of tiles can be extracted from two adjacent banks with a single access. In one embodiment, a crossbar is used to select a configuration (eg, row, column, block, etc.) that accesses multiple bank data structure tiles. A collector can also be provided to receive tiles for banks accessed by the crossbar and to provide tiles at the front end of the vector data path every clock.

[016]ある実施の形態では、本発明は、ビデオプロセッサのためのストリームベースのメモリアクセスシステムとして実施される。ビデオプロセッサは、スカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットと、を備える。フレームバッファメモリが、スカラー実行ユニット及びベクトル実行ユニットのデータを格納するために備えられる。メモリインターフェイスが、スカラー実行ユニットと、ベクトル実行ユニットと、フレームバッファメモリとの間の通信を実施するために備えられる。フレームバッファメモリは、複数のタイルを備える。メモリインターフェイスは、スカラー実行ユニットのタイルの第１の順次アクセスの第１のストリームを実施し、ベクトル実行ユニットのタイルの第２の順次アクセスの第２のストリームを実施する。 [016] In one embodiment, the present invention is implemented as a stream-based memory access system for a video processor. The video processor comprises a scalar execution unit configured to perform scalar video processing operations and a vector execution unit configured to perform vector video processing operations. A frame buffer memory is provided for storing scalar execution unit and vector execution unit data. A memory interface is provided for performing communication between the scalar execution unit, the vector execution unit, and the frame buffer memory. The frame buffer memory includes a plurality of tiles. The memory interface implements a first stream of first sequential accesses of tiles of scalar execution units and a second stream of second sequential access of tiles of vector execution units.

[017]ある実施の形態において、第１のストリーム及び第２のストリームは、アクセス待ち時間を元の記憶域（例えば、フレームバッファメモリ、システムメモリなど）から隠蔽するような方法でプリフェッチされた、一連の順次プリフェッチタイルを備える。ある実施の形態では、メモリインターフェイスは、複数の異なる元の記憶域からのストリーム、及び、複数の異なる終端記憶域へのストリームを含む複数の異なるストリームを管理するように構成される。ある実施の形態では、メモリインターフェイスに組み込まれているＤＭＡエンジンを使用して、複数のメモリ読み取り及び複数のメモリ書き込みを実施し、多数のストリームをサポートする。 [017] In an embodiment, the first stream and the second stream are prefetched in a manner that conceals access latency from original storage (eg, frame buffer memory, system memory, etc.), With a series of sequential prefetch tiles. In one embodiment, the memory interface is configured to manage a plurality of different streams including streams from a plurality of different original storage areas and a stream to a plurality of different terminal storage areas. In one embodiment, a DMA engine embedded in the memory interface is used to perform multiple memory reads and multiple memory writes to support multiple streams.

[018]概して、この書面は、少なくとも以下の四つの方法を開示する。
Ａ）この説明において広範に教示する方法は、ビデオ処理オペレーションを実行するビデオプロセッサにおける多次元データパス処理システムのための方法であり、スカラー実行ユニットを使用することによってスカラービデオ処理オペレーションを実行するステップと、ベクトル実行ユニットを使用することによってベクトルビデオ処理オペレーションを実行するステップと、データストアメモリを使用することによってベクトル実行ユニットのデータを格納するステップとを、含み、データストアメモリは配列状に構成された対称バンクデータ構造を有する複数のタイルを備え、当該バンクデータ構造は各バンクの異なるタイルへのアクセスをサポートするように構成される。さらに、上述のＡの方法では、各々のバンクデータ構造が４×４のパターンに構成された複数のタイルを含む。また、上述のＡの方法では、各々のバンクデータ構造が、８×８、８×１６、又は１６×２４のパターンに構成された複数のタイルを含む。加えて、上述のＡの方法は、バンクデータ構造が、各バンクデータ構造の異なるタイルへのアクセスをサポートするように構成され、少なくとも一つのアクセスが、二つの隣接するバンクデータ構造への当該二つのバンクデータ構造の１行のタイルを含むアクセスである。上述のＡの方法ではまた、タイルが、各バンクデータ構造の異なるタイルへのアクセスをサポートするように構成され、少なくとも一つのアクセスが、二つの隣接するバンクデータ構造への当該二つの隣接するバンクデータ構造の１列のタイルを含むアクセスである。さらに、上述のＡの方法は、複数のバンクデータ構造のタイルにアクセスするための構成を、データストアに結合されたクロスバーを使用することによって、選択するステップを含む。この選択ステップにおいて、クロスバーは、複数のバンクデータ構造のタイルにアクセスして、１クロック毎にベクトルデータパスにデータを供給する。また、クロスバーによってアクセスされる複数のバンクデータ構造のタイルを、コレクタを使用することによって受け取るステップと、タイルをベクトルデータパスの前端に１クロックごとに提供するステップと、を含む。 [018] In general, this document discloses at least the following four methods.
A) The method broadly taught in this description is a method for a multi-dimensional data path processing system in a video processor that performs video processing operations, and performing a scalar video processing operation by using a scalar execution unit And executing the vector video processing operation by using the vector execution unit, and storing the data of the vector execution unit by using the data store memory, wherein the data store memory is arranged in an array. A plurality of tiles having a symmetric bank data structure configured to support access to different tiles in each bank. Further, in the method A described above, each bank data structure includes a plurality of tiles arranged in a 4 × 4 pattern. In the method A described above, each bank data structure includes a plurality of tiles configured in an 8 × 8, 8 × 16, or 16 × 24 pattern. In addition, the method of A above is configured such that the bank data structures support access to different tiles of each bank data structure, and at least one access is associated with the two adjacent bank data structures. Access including one row tile of one bank data structure. In the above method A, the tile is also configured to support access to different tiles of each bank data structure, and at least one access is the two adjacent banks to the two adjacent bank data structures. Access that includes one column of tiles in the data structure. Further, the method A described above includes selecting a configuration for accessing a plurality of bank data structure tiles by using a crossbar coupled to the data store. In this selection step, the crossbar accesses a plurality of bank data structure tiles and supplies data to the vector data path every clock. Also, receiving a plurality of bank data structure tiles accessed by the crossbar by using a collector and providing the tiles to the front end of the vector data path every clock.

Ｂ）この説明において広範に教示する方法はまた、ビデオ処理オペレーションを実行するための方法であって、当該方法はコンピュータ可読コードを実行するコンピュータシステムのビデオプロセッサを使用して実施され、また、当該方法は、ビデオプロセッサとホストＣＰＵとの間の通信を、ホストインターフェイスを使用することによって確立するステップと、ビデオプロセッサとフレームバッファメモリとの間の通信を、メモリインターフェイスを使用することによって確立するステップと、スカラービデオ処理オペレーションを、ホストインターフェイス及びメモリインターフェイスに結合されたスカラー実行ユニットを使用することによって実行するステップと、ベクトルビデオ処理オペレーションを、ホストインターフェイス及びメモリインターフェイスに結合されたベクトル実行ユニットを使用することによって実行するステップと、を含む。上述のＢの方法では、更に、スカラー実行ユニットが、ビデオプロセッサのコントローラとして機能し、ベクトル実行ユニットのオペレーションを制御する。上述の方法Ｂはまた、スカラー実行ユニットとベクトル実行ユニットのインターフェイスを取るためにベクトルインターフェイスユニットを備える。上述のＢの方法ではまた、スカラー実行ユニット及びベクトル実行ユニットが、非同期に動作するように構成される。また、スカラー実行ユニットは第１のクロック周波数で実行し、ベクトル実行ユニットは第２のクロック周波数で実行する。上述のＢの方法では、スカラー実行ユニットが、アプリケーションのフロー制御アルゴリズムを実行するように構成され、ベクトル実行ユニットが、アプリケーションのピクセル処理オペレーションを実行するように構成される。さらに、ベクトル実行ユニットは、スカラー実行ユニットの制御のもとで、要求駆動ベースで動作するように構成される。加えて、スカラー実行ユニットは、メモリコマンドＦＩＦＯを使用してベクトル実行ユニットに関数呼び出しを送るように構成され、ベクトル実行ユニットはメモリコマンドＦＩＦＯにアクセスすることによって要求駆動ベースで動作する。また、ビデオプロセッサの非同期動作は、アプリケーションのベクトルサブルーチン又はスカラーサブルーチンの別個の独立した更新をサポートするように構成される。最後に、上述のＢの方法では、スカラー実行ユニットが、ＶＬＩＷ（超長命令語）コードを使用して動作するように構成される。 B) The method broadly taught in this description is also a method for performing video processing operations, the method being implemented using a video processor of a computer system executing computer readable code, and The method establishes communication between the video processor and the host CPU by using the host interface, and establishes communication between the video processor and the frame buffer memory by using the memory interface. Performing a scalar video processing operation by using a scalar execution unit coupled to the host interface and the memory interface; and vector video processing operation to the host interface and And executing by the use of combined vector execution unit to the memory interface, a. In the above method B, the scalar execution unit further functions as a controller of the video processor and controls the operation of the vector execution unit. Method B described above also comprises a vector interface unit for interfacing the scalar execution unit with the vector execution unit. In method B above, the scalar execution unit and the vector execution unit are also configured to operate asynchronously. The scalar execution unit executes at the first clock frequency, and the vector execution unit executes at the second clock frequency. In the method B above, the scalar execution unit is configured to execute the flow control algorithm of the application, and the vector execution unit is configured to execute the pixel processing operation of the application. Further, the vector execution unit is configured to operate on a demand driven basis under the control of the scalar execution unit. In addition, the scalar execution unit is configured to send function calls to the vector execution unit using a memory command FIFO, and the vector execution unit operates on a request driven basis by accessing the memory command FIFO. Also, the asynchronous operation of the video processor is configured to support separate and independent updates of the application's vector subroutine or scalar subroutine. Finally, in method B above, the scalar execution unit is configured to operate using VLIW (very long instruction word) code.

Ｃ）本明細書に説明する方法はまた、ビデオ処理オペレーションを実行するビデオプロセッサにおけるストリームベースのメモリアクセスのための方法を広範に教示するものであり、当該方法は、スカラービデオ処理オペレーションを、スカラー実行ユニットを使用することによって実行するステップと、ベクトルビデオ処理オペレーションを、ベクトル実行ユニットを使用することによって実行するステップと、スカラー実行ユニット及びベクトル実行ユニットのデータを、フレームバッファメモリを使用することによって格納するステップと、スカラー実行ユニットと、ベクトル実行ユニットと、フレームバッファメモリとの間の通信を、メモリインターフェイスを使用して実施するステップと、を含み、フレームバッファメモリは複数のタイルを備え、メモリインターフェイスは、ベクトル実行ユニット又はスカラー実行ユニットのために、タイルの第１の順次アクセスを含む第１のストリームを実施し、タイルの第２の順次アクセスを含む第２のストリームを実施する。上述のＣの方法はまた、少なくとも一つのプリフェッチされたタイルを含む第１のストリーム及び第２のストリームを有する。上述のＣの方法はさらに、フレームバッファメモリの第１のロケーションから生じる第１のストリームと、フレームバッファメモリの第２のロケーションから生じる第２のストリームとを備える。前述のＣの方法ではまた、メモリインターフェイスが、複数の異なる発生元ロケーションからのストリーム、及び、複数の異なる終端ロケーションへのストリームを含む複数のストリームを管理するように構成される。この点において、少なくとも一つの発生元ロケーション又は少なくとも一つの終端ロケーションは、システムメモリ内にある。上述のＣの方法はまた、複数のメモリ読み取りを実施して第１のストリーム及び第２のストリームをサポートするステップと、メモリインターフェイスに組み込まれたＤＭＡエンジンを使用することによって、複数のメモリ読み取りを実施して第１のストリーム及び第２のストリームをサポートするステップと、を含む。さらに、Ｃの方法では、第１のストリームが、第２のストリームよりも多い待ち時間を受け、当該第１のストリームは、タイルを格納するために第２のストリームよりも多数のバッファを組み込む。Ｃの方法ではまた、メモリインターフェイスが、第１のストリーム又は第２のストリームの調整可能な数のタイルをプリフェッチして、第１のストリーム又は第２のストリームの待ち時間を補償するように構成される。 C) The method described herein also broadly teaches a method for stream-based memory access in a video processor that performs video processing operations, which includes scalar video processing operations. Executing by using an execution unit; performing a vector video processing operation by using a vector execution unit; and using scalar execution unit and vector execution unit data by using a frame buffer memory. Storing, and performing communication between the scalar execution unit, the vector execution unit, and the frame buffer memory using a memory interface, wherein the frame buffer memory includes a plurality of frame buffer memories. And a memory interface for a vector execution unit or a scalar execution unit that implements a first stream that includes a first sequential access of tiles and a second stream that includes a second sequential access of tiles. carry out. The method C described above also has a first stream and a second stream that include at least one prefetched tile. The method C described above further comprises a first stream originating from a first location in the frame buffer memory and a second stream originating from a second location in the frame buffer memory. In method C above, the memory interface is also configured to manage multiple streams, including streams from multiple different origin locations and streams to multiple different end locations. In this regard, at least one origin location or at least one end location is in system memory. The method C described above also performs multiple memory reads to support the first stream and the second stream, and uses the DMA engine built into the memory interface to perform multiple memory reads. Implementing to support a first stream and a second stream. Further, in method C, the first stream experiences more latency than the second stream, and the first stream incorporates more buffers than the second stream to store tiles. In method C, the memory interface is also configured to prefetch an adjustable number of tiles of the first stream or second stream to compensate for the latency of the first stream or second stream. The

Ｄ）本明細書に説明する方法はまた、耐待ち時間ビデオ処理オペレーションの方法を広範に含むものであり、当該方法は、ビデオプロセッサとホストＣＰＵとの間の通信を、ホストインターフェイスを使用することによって実施するステップと、スカラービデオ処理オペレーションを、ホストインターフェイスに結合されたスカラー実行ユニットを使用することによって実行するステップと、ベクトルビデオ処理オペレーションを、ホストインターフェイスに結合されたベクトル実行ユニットを使用することによって実行するステップと、ベクトル実行ユニットがメモリコマンドＦＩＦＯにアクセスすることによって要求駆動ベースで動作できるようにするステップと、ビデオプロセッサとフレームバッファメモリとの間の通信を、メモリインターフェイスを使用することによって実施するステップと、複数の異なる記憶域の間のＤＭＡ転送を、メモリインターフェイスに組み込まれ、且つ、データストアメモリ及び命令キャッシュにベクトル実行ユニットのデータと命令をロードするように構成されたＤＭＡエンジンを使用することによって、実施するステップとを含む。上述のＤの方法では更に、ベクトル実行ユニットが、コマンドＦＩＦＯにアクセスして要求駆動ベースで動作することによって、スカラー実行ユニットに対して非同期に動作するように構成される。上述のＤの方法ではまた、要求駆動ベースが、異なる記憶域からベクトル実行ユニットのコマンドＦＩＦＯへのデータ転送の待ち時間を隠蔽するように構成される。さらに、上述のＤの方法では、スカラー実行ユニットが、アルゴリズムのフロー制御処理を実施するように構成され、ベクトル実行ユニットはビデオ処理ワークロードの大部分を実施するように構成される。ここで、スカラー実行ユニットは、ベクトル実行ユニットの作業パラメータを事前に計算して、データ転送待ち時間を隠蔽するように構成される。上述のＤの方法では、ベクトル実行ユニットが、ＤＭＡエンジン経由でのメモリ読み取りをスケジュールし、ベクトルサブルーチンの後続の実行のためのコマンドをプリフェッチするように構成される。ここで、メモリ読み取りは、スカラー実行ユニットによるベクトルサブルーチンの呼び出しに先立ち、ベクトルサブルーチンの実行のためのコマンドをプリフェッチするようにスケジュールされる。 D) The methods described herein also broadly include methods of latency-tolerant video processing operations that use a host interface for communication between the video processor and the host CPU. Performing a scalar video processing operation by using a scalar execution unit coupled to the host interface, and using a vector execution unit coupled to the host interface. The communication between the video processor and the frame buffer memory, and the step of enabling the vector execution unit to operate on a demand driven basis by accessing the memory command FIFO. The steps implemented by using the interface and DMA transfers between different storage areas are incorporated into the memory interface and load the data execution memory and instruction cache into the data store memory and instruction cache. Implementing by using a DMA engine configured to: In the method D described above, the vector execution unit is further configured to operate asynchronously with respect to the scalar execution unit by accessing the command FIFO and operating on a request driven basis. In method D above, the request driven base is also configured to conceal the latency of data transfer from different storage to the command FIFO of the vector execution unit. Further, in the method D described above, the scalar execution unit is configured to perform the flow control processing of the algorithm, and the vector execution unit is configured to perform the majority of the video processing workload. Here, the scalar execution unit is configured to pre-calculate the working parameters of the vector execution unit to conceal the data transfer latency. In the method D above, the vector execution unit is configured to schedule memory reads via the DMA engine and prefetch commands for subsequent execution of the vector subroutine. Here, the memory read is scheduled to prefetch a command for execution of the vector subroutine prior to calling the vector subroutine by the scalar execution unit.

[019]本発明を、添付の図面の図において、限定することなく、例示の目的で説明する。当該図面においては、同様の参照番号によって同様の要素を参照する。 [019] The present invention will now be described for purposes of illustration and not limitation in the figures of the accompanying drawings. In the drawings, like elements are referred to by like reference numerals.

Detailed Description of the Invention

[026]以下、本発明の好適な実施の形態の詳細について言及し、当該実施の形態の例を添付の図面に示す。本発明を好適な実施の形態に関して説明するが、これら実施の形態は本発明をそれらに限定するよう意図したものでないことが理解されよう。一方、本発明は、代替、変更、及び等価物を網羅することが意図されたものであり、これらは、添付の特許請求の範囲によって定義される本発明の精神及び範囲内に含まれる。さらに、本発明の実施の形態の以下の詳細な説明では、本発明を完全な理解のために、多数の具体的な詳細を示す。しかしながら、本発明はそれら具体的な詳細なくしても実施され得ることは、当業者には理解されるであろう。例によっては、本発明の実施の形態の態様を不必要に不明瞭にしないように、既知の方法、手順、コンポーネント、及び回路については、詳細には説明しない。 [026] Reference will now be made in detail to the preferred embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that these embodiments are not intended to limit the invention thereto. On the contrary, the invention is intended to cover alternatives, modifications, and equivalents, which are included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be appreciated by one skilled in the art that the present invention may be practiced without these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the embodiments of the present invention.

＜表記及び用語＞
[027]以下の詳細な説明の幾つかの部分は、コンピュータメモリ内のデータビットに対するオペレーションの手順、ステップ、論理ブロック、処理、及び、その他の記号による表記で、示されている。これらの説明及び表記は、当業者の分野の内容を他の当業者に最も効果的に伝達するために、データ処理分野の当業者によって使用される手段である。手順、コンピュータ実行ステップ、論理ブロック、プロセス等は、本明細書において、また一般的に、所望の結果を導く首尾一貫した一連のステップ又は命令であると考えられる。ステップは、物理量の物理的操作を必要とするものである。通常、必須ではないが、それらの量は、コンピュータシステムにおいて、格納、転送、結合、比較、その他の操作が行われうる電子信号又は磁気信号の形態をとる。主として一般的な用法であるという理由から、それらの信号をビット、値、要素、記号、文字、項、数等として参照することが、時として都合がよいことが判明している。 <Notation and terminology>
[027] Some portions of the detailed descriptions that follow are presented in terms of procedures, steps, logic blocks, processes, and other symbolic representations of operations on data bits within a computer memory. These descriptions and notations are the means used by those skilled in the data processing arts to most effectively convey the substance of their art to others skilled in the art. Procedures, computer-executed steps, logic blocks, processes, etc., are considered herein and generally a consistent series of steps or instructions that lead to a desired result. Steps are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electronic or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

[028]しかしながら、それら及び類似の用語は全て、適切な物理量に関連付けられるべきものであり、それらの量に適用される便利なラベルに過ぎないことを念頭におく必要がある。特に具体的に別の記述がない限り、以下の説明から明らかなように、本発明全体を通じて、「処理」、「アクセス」、「実行」「格納」、「レンダリング」等のいずれの用語を使用する説明も、コンピュータシステムのレジスタ及びメモリ内の物理（電子）量として表されるデータを、コンピュータシステムメモリ又はレジスタその他のそのような情報記憶、伝送又は表示装置内の物理量として同様に表される他のデータへと操作及び変換する、コンピュータシステム（例えば、図１のコンピュータシステム１００）又は類似の電子計算装置のアクション及びプロセスを指すことを理解されたい。 [028] However, it should be borne in mind that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to those quantities. Unless otherwise specifically stated, as will be clear from the following description, the terms “process”, “access”, “execute”, “store”, “render”, etc. are used throughout the present invention. Similarly, data expressed as physical (electronic) quantities in computer system registers and memories are also represented as physical quantities in computer system memory or registers or other such information storage, transmission or display devices. It should be understood that it refers to the actions and processes of a computer system (eg, computer system 100 of FIG. 1) or similar electronic computing device that manipulates and converts to other data.

＜コンピュータシステムプラットフォーム＞
[029]図１は、本発明のある実施の形態に係るコンピュータシステム１００を示している。コンピュータシステム１００は、本発明の実施の形態に係る基本コンピュータシステムのコンポーネントを示しており、特定のハードウェアベース及びソフトウェアベースの機能用に実行プラットフォームを提供するものである。一般に、コンピュータシステム１００は、少なくとも一つのＣＰＵ１０１と、システムメモリ１１５と、少なくとも一つのグラフィックス処理ユニット（ＧＰＵ）１１０と、一つのビデオ処理ユニット（ＶＰＵ）１１１と、を備えている。ＣＰＵ１０１は、ブリッジコンポーネント１０５を介してシステムメモリ１１５に結合することができ、又はＣＰＵ１０１の内部にあるメモリコントローラ（図示せず）を介してシステムメモリ１１５に直接結合することができる。ブリッジコンポーネント１０５（例えば、Ｎｏｒｔｈｂｒｉｄｇｅ）は、種々の入出力装置（例えば、一以上のハードディスクドライブ、イーサネットアダプタ、ＣＤ−ＲＯＭ、ＤＶＤ等）を接続する拡張バスをサポートすることができる。ＧＰＵ１１０及びビデオ処理装置１１１は、ディスプレイ１１２に結合されている。一以上の追加のＧＰＵを、オプションでシステム１００に結合して、その計算能力が更に増強することが可能である。ＧＰＵ１１０及びビデオ処理ユニット１１１は、ブリッジコンポーネント１０５を介してＣＰＵ１０１及びシステムメモリ１１５に結合されている。システム１００は、例えば、専用グラフィックスレンダリングＧＰＵ１１０に結合された強力な汎用ＣＰＵ１０１を有するデスクトップコンピュータシステム又はサーバコンピュータシステムとして実施されてもよい。このような実施の形態においては、周辺バス、専用のグラフィックスメモリ及びシステムメモリ、入出力装置等を追加するコンポーネントが含まれてもよい。同様に、システム１００は、ハンドヘルドデバイス（例えば、携帯電話等）、或いは、例えば、ワシントン州レドモンドのＭｉｃｒｏｓｏｆｔＣｏｒｐｏｒａｔｉｏｎから提供されているＸｂｏｘ（登録商標）又は日本の東京のＳｏｎｙＣｏｍｐｕｔｅｒＥｎｔｅｒｔａｉｎｍｅｎｔＣｏｒｐｏｒａｔｉｏｎから提供されているＰｌａｙＳｔａｔｉｏｎ３（登録商標）等のセットトップビデオゲームコンソールデバイスとして実施されてもよい。 <Computer system platform>
[029] FIG. 1 illustrates a computer system 100 according to an embodiment of the invention. Computer system 100 illustrates the components of a basic computer system according to an embodiment of the present invention and provides an execution platform for specific hardware-based and software-based functions. In general, the computer system 100 includes at least one CPU 101, a system memory 115, at least one graphics processing unit (GPU) 110, and one video processing unit (VPU) 111. The CPU 101 can be coupled to the system memory 115 via the bridge component 105 or can be directly coupled to the system memory 115 via a memory controller (not shown) internal to the CPU 101. The bridge component 105 (eg, Northbridge) can support an expansion bus that connects various input / output devices (eg, one or more hard disk drives, Ethernet adapters, CD-ROMs, DVDs, etc.). GPU 110 and video processing device 111 are coupled to display 112. One or more additional GPUs can optionally be coupled to the system 100 to further enhance its computing power. The GPU 110 and the video processing unit 111 are coupled to the CPU 101 and the system memory 115 via the bridge component 105. The system 100 may be implemented, for example, as a desktop computer system or server computer system having a powerful general purpose CPU 101 coupled to a dedicated graphics rendering GPU 110. In such an embodiment, a component for adding a peripheral bus, a dedicated graphics memory and system memory, an input / output device and the like may be included. Similarly, the system 100 may be a handheld device (eg, a mobile phone, etc.) or provided by, for example, Xbox® provided by Microsoft Corporation of Redmond, Washington or Sony Computer Entertainment Corporation of Tokyo, Japan. It may be implemented as a set-top video game console device such as PlayStation3®.

[030]ＧＰＵ１１０は、別個のコンポーネント、コネクタ（例えば、ＡＧＰスロット、ＰＣＩ−Ｅｘｐｒｅｓｓスロット等）を介してコンピュータシステム１００に結合するよう設計された別個のグラフィックスカード、別個の集積回路ダイ（例えば、マザーボードに直接搭載）、又はコンピュータシステムのチップセットコンポーネントの集積回路ダイに含められた（例えば、ブリッジチップ１０５内に組み込まれる）集積ＧＰＵとして実装され得ることを理解されたい。さらに、ローカルグラフィックスメモリが、ＧＰＵ１１０の高帯域グラフィックスデータストレージ用に含められてもよい。さらに、ＧＰＵ１１０及びビデオ処理装置１１１は、同一の集積回路ダイに統合されてもよく（例えば、コンポーネント１２０として）、或いは、コンピュータシステム１００のマザーボードに接続されるか又は搭載される別個の分離した集積回路コンポーネントであってもよいことを理解されたい。 [030] The GPU 110 is a separate graphics card, separate integrated circuit die (eg, motherboard) designed to couple to the computer system 100 via separate components, connectors (eg, AGP slots, PCI-Express slots, etc.). It should be understood that it can be implemented as an integrated GPU included directly in the bridge chip 105 (e.g., embedded in the bridge chip 105) or directly on the integrated circuit die of a chipset component of a computer system. In addition, local graphics memory may be included for high bandwidth graphics data storage of the GPU 110. Further, the GPU 110 and the video processing unit 111 may be integrated into the same integrated circuit die (eg, as the component 120) or may be separate discrete integrated connected to or mounted on the motherboard of the computer system 100. It should be understood that it may be a circuit component.

Embodiment of the present invention

[031]図２は、本発明のある実施の形態に係るビデオ処理ユニット１１１の内部コンポーネントを示す図である。図２に示すように、ビデオ処理ユニット１１１は、スカラー実行ユニット２０１と、ベクトル実行ユニット２０２と、メモリインターフェイス２０３と、ホストインターフェイス２０４と、を備えている。 [031] FIG. 2 is a diagram illustrating the internal components of the video processing unit 111 according to an embodiment of the present invention. As shown in FIG. 2, the video processing unit 111 includes a scalar execution unit 201, a vector execution unit 202, a memory interface 203, and a host interface 204.

[032]図２の実施の形態では、ビデオ処理ユニット（以降、単にビデオプロセッサ）１１１は、ビデオ処理オペレーションを実行する機能コンポーネントを有する。ビデオプロセッサ１１１は、ホストインターフェイス２０４を使用し、ブリッジ１０５を経由するビデオプロセッサ１１１とホストＣＰＵ１０１との間の通信を確立する。ビデオプロセッサ１１１は、メモリインターフェイス２０３を使用し、ビデオプロセッサ１１１とフレームバッファメモリ２０５（例えば、結合されたディスプレイ１１２用（図示せず））との間の通信を確立する。スカラー実行ユニット２０１は、ホストインターフェイス２０４及びメモリインターフェイス２０３に結合されており、スカラービデオ処理オペレーションを実行するように構成されている。ベクトル実行ユニットは、ホストインターフェイス２０４及びメモリインターフェイス２０３に結合されており、ベクトルビデオ処理オペレーションを実行するように構成されている。 [032] In the embodiment of FIG. 2, the video processing unit (hereinafter simply video processor) 111 has functional components that perform video processing operations. The video processor 111 uses the host interface 204 to establish communication between the video processor 111 and the host CPU 101 via the bridge 105. Video processor 111 uses memory interface 203 to establish communication between video processor 111 and frame buffer memory 205 (eg, for a combined display 112 (not shown)). The scalar execution unit 201 is coupled to the host interface 204 and the memory interface 203 and is configured to perform scalar video processing operations. The vector execution unit is coupled to the host interface 204 and the memory interface 203 and is configured to perform vector video processing operations.

[033]図２の実施の形態は、ビデオプロセッサ１１１が、その実行機能を、スカラーオペレーションとベクトルオペレーションに分割する方式を示している。スカラーオペレーションは、スカラー実行ユニット２０１によって実施される。ベクトルオペレーションは、ベクトル実行ユニット２０２によって実施される。 [033] The embodiment of FIG. 2 illustrates a manner in which the video processor 111 divides its execution function into scalar operations and vector operations. Scalar operations are performed by the scalar execution unit 201. Vector operations are performed by the vector execution unit 202.

[034]ある実施の形態では、ベクトル実行ユニット２０２は、スカラー実行ユニット２０１のスレーブコプロセッサとして機能するように構成される。そのような実施の形態では、スカラー実行ユニットは、ベクトル実行ユニット２０２のワークロードを、制御ストリームをベクトル実行ユニット２０２に供給し、且つ、ベクトル実行ユニット２０２のデータ入出力を管理することによって、管理する。制御ストリームは通常、関数パラメータ、サブルーチン引数等を含む。通常のビデオ処理アプリケーションでは、アプリケーションの処理アルゴリズムの制御フローはスカラー実行ユニット２０１で実行されるが、実際のピクセル／データ処理オペレーションはベクトル実行ユニット２０２で実施される。 [034] In one embodiment, vector execution unit 202 is configured to function as a slave coprocessor of scalar execution unit 201. In such an embodiment, the scalar execution unit manages the workload of the vector execution unit 202 by supplying a control stream to the vector execution unit 202 and managing the data input / output of the vector execution unit 202. To do. A control stream typically includes function parameters, subroutine arguments, and the like. In a typical video processing application, the control flow of the application processing algorithm is executed by the scalar execution unit 201, while the actual pixel / data processing operations are executed by the vector execution unit 202.

[035]更に図２を参照する。スカラー実行ユニット２０１は、ＲＩＳＣベースの実行技術が組み込まれたＲＩＳＣ型のスカラー実行ユニットとして実施することができる。ベクトル実行ユニット２０２は、例えば、一つ又は複数のＳＩＭＤパイプラインを有するＳＩＭＤマシンとして実施してもよい。二つのＳＩＭＤパイプラインの実施の形態では、例えば、各ＳＩＭＤパイプラインは幅１６ピクセルのデータパス（又はそれ以上）で実施することができ、従って、クロックあたり最大３２ピクセルのデータ出力を生み出す計算能力をベクトル実行ユニット２０２にもたらす。ある実施の形態では、スカラー実行ユニット２０１は、ＶＬＩＷ（超長命令語）ソフトウェアコードを使用して動作し、クロックあたりベースでスカラーオペレーションの並列実行を最適化するように構成されたハードウェアを備える。 [035] Still referring to FIG. The scalar execution unit 201 can be implemented as a RISC type scalar execution unit incorporating RISC-based execution technology. Vector execution unit 202 may be implemented, for example, as a SIMD machine having one or more SIMD pipelines. In two SIMD pipeline embodiments, for example, each SIMD pipeline can be implemented with a 16 pixel wide data path (or more), thus, computing power to produce a data output of up to 32 pixels per clock. To the vector execution unit 202. In one embodiment, scalar execution unit 201 comprises hardware that operates using VLIW (very long instruction word) software code and is configured to optimize parallel execution of scalar operations on a per clock basis. .

[036]図２の実施の形態では、スカラー実行ユニット２０１は、スカラープロセッサ２１０に結合された命令キャッシュ２１１及びデータキャッシュ２１２を備えている。キャッシュ２１１〜２１２は、例えば、フレームバッファ２０５等の外部メモリへのアクセス用のメモリインターフェイス２０３とインターフェイスする。スカラー実行ユニット２０１は更に、ベクトル実行ユニット２０２との通信を確立するためにベクトルインターフェイスユニット２１３を備えている。ある実施の形態では、ベクトルインターフェイスユニット２１３は、スカラー実行ユニット２０１とベクトル実行ユニット２０２との間の非同期通信を可能にするように構成された一以上の同期メールボックス２１４を備えることができる。 In the embodiment of FIG. 2, scalar execution unit 201 includes an instruction cache 211 and a data cache 212 coupled to scalar processor 210. The caches 211 to 212 interface with a memory interface 203 for accessing an external memory such as the frame buffer 205, for example. The scalar execution unit 201 further comprises a vector interface unit 213 for establishing communication with the vector execution unit 202. In certain embodiments, the vector interface unit 213 can comprise one or more synchronous mailboxes 214 configured to allow asynchronous communication between the scalar execution unit 201 and the vector execution unit 202.

[037]図２の実施の形態では、ベクトル実行ユニット２０２は、ベクトル制御ユニット２２０を備えており、当該ベクトル制御ユニット２２０は、ベクトル実行データパスであるベクトルデータパス２２１のオペレーションを制御するように構成されている。ベクトル制御ユニット２２０は、スカラー実行ユニット２０１から命令及びデータを受け取るために、コマンドＦＩＦＯ２２５を備えている。命令キャッシュ２２２は、命令をベクトル制御ユニット２２０に供給するように結合されている。データストアメモリ２２３は、入力データをベクトルデータパス２２１に供給し、結果として生じたデータをベクトルデータパス２２１から受け取るように、結合されている。データストア２２３は、ベクトルデータパス２２１の命令キャッシュ及びデータＲＡＭとして機能する。命令キャッシュ２２２及びデータストア２２３は、フレームバッファ２０５のような外部メモリにアクセスするためのメモリインターフェイス２０３に結合されている。図２の実施の形態はまた、第２のベクトルデータパス２３１及び対応の第２のデータストア２３３（例えば、点線の外枠）も示している。第２のベクトルデータパス２３１及び第２のデータストア２３３は、ベクトル実行ユニット２０２が二つのベクトル実行パイプライン（例えば、二重化ＳＩＭＤパイプライン構成）を有する場合を説明するために示されていることを理解されたい。本発明の実施の形態は、多数のベクトル実行パイプライン（例えば、４、８、１６など）を有するベクトル実行ユニットに適している。 [037] In the embodiment of FIG. 2, the vector execution unit 202 comprises a vector control unit 220, which controls the operation of the vector data path 221 that is a vector execution data path. It is configured. The vector control unit 220 includes a command FIFO 225 for receiving instructions and data from the scalar execution unit 201. Instruction cache 222 is coupled to provide instructions to vector control unit 220. The data store memory 223 is coupled to provide input data to the vector data path 221 and receive the resulting data from the vector data path 221. The data store 223 functions as an instruction cache and data RAM for the vector data path 221. Instruction cache 222 and data store 223 are coupled to a memory interface 203 for accessing external memory, such as frame buffer 205. The embodiment of FIG. 2 also shows a second vector data path 231 and a corresponding second data store 233 (eg, a dotted outline). The second vector data path 231 and the second data store 233 are shown to illustrate the case where the vector execution unit 202 has two vector execution pipelines (eg, a duplexed SIMD pipeline configuration). I want you to understand. Embodiments of the present invention are suitable for vector execution units having multiple vector execution pipelines (eg, 4, 8, 16, etc.).

[038]スカラー実行ユニット２０１は、データ及びコマンド入力をベクトル実行ユニット２０２に供給する。ある実施の形態では、スカラー実行ユニット２０１は、関数呼び出しをベクトル実行ユニット２０２に、メモリマップコマンドＦＩＦＯ２２５を使用して送る。ベクトル実行ユニット２０２のコマンドは、このコマンドＦＩＦＯ２２５のキューに入れられる。 [038] The scalar execution unit 201 provides data and command inputs to the vector execution unit 202. In one embodiment, scalar execution unit 201 sends a function call to vector execution unit 202 using memory map command FIFO 225. The command of the vector execution unit 202 is put in the queue of this command FIFO 225.

[039]コマンドＦＩＦＯ２２５の使用は、スカラー実行ユニット２０１をベクトル実行ユニット２０２から効果的に分離する。スカラー実行ユニット２０１は、その独自の個別のクロックで機能し、ベクトル実行ユニット２０２のクロック周波数とは異なり、且つ、当該ベクトル実行ユニット２０２のクロック周波数とは独立に制御されるその独自の個別のクロック周波数で動作することができる。 The use of command FIFO 225 effectively separates scalar execution unit 201 from vector execution unit 202. The scalar execution unit 201 functions with its own individual clock, which is different from the clock frequency of the vector execution unit 202 and is controlled independently of the clock frequency of the vector execution unit 202. Can operate at frequency.

[040]コマンドＦＩＦＯ２２５は、ベクトル実行ユニット２０２が要求駆動ユニットとして動作できるようにする。例えば、作業は、スカラー実行ユニット２０１からコマンドＦＩＦＯ２２５に渡され、次いで、分離した非同期方式で処理するためにベクトル実行ユニット２０２によってアクセスされる。したがって、ベクトル実行ユニット２０２は、スカラー実行ユニット２０１による必要に応じて、あるいは要求に応じて、そのワークロードを処理することになる。このような機能によって、ベクトル実行ユニット２０２は、最大性能が要求されない場合に（例えば、一以上の内部クロックを軽減／又は停止することによって）パワーを節約することが可能となる。 [040] Command FIFO 225 enables vector execution unit 202 to operate as a request drive unit. For example, work is passed from the scalar execution unit 201 to the command FIFO 225 and then accessed by the vector execution unit 202 for processing in a separate asynchronous manner. Accordingly, the vector execution unit 202 will process the workload as needed by the scalar execution unit 201 or upon request. Such a function allows the vector execution unit 202 to save power when maximum performance is not required (eg, by mitigating / or stopping one or more internal clocks).

[041]スカラー部分（例えば、スカラー実行ユニット２０１による実行用）及びベクトル部分（例えば、ベクトル実行ユニット２０２による実行用）へのビデオ処理機能の分割によって、ビデオプロセッサ１１１用に構築されたビデオ処理プログラムを、別個のスカラーソフトウェアコード及びベクトルソフトウェアコードにコンパイルすることが可能となる。スカラーソフトウェアのコード及びベクトルソフトウェアのコードを、個別にコンパイルして、その後一貫性のあるアプリケーションを形成するようにリンクすることができる。 [041] A video processing program built for video processor 111 by dividing the video processing function into a scalar part (eg, for execution by scalar execution unit 201) and a vector part (eg, for execution by vector execution unit 202) Can be compiled into separate scalar software code and vector software code. Scalar software code and vector software code can be compiled separately and then linked to form a consistent application.

[042]この分割によって、ベクトルソフトウェアコード関数を個別に書いて、スカラーソフトウェアコード関数と区別することが可能となる。例えば、ベクトル関数を個別に書いて（例えば、異なる時間において、異なるエンジニアのチームによって等）、スカラー関数（例えば、スカラースレッド、プロセス等）によって／と共に使用する一以上のサブルーチン又はライブラリ関数として提供することが可能である。これによって、スカラーソフトウェアコード及び／又はベクトルソフトウェアコードの個別の独立した更新が可能になる。例えば、ベクトルサブルーチンを、スカラーサブルーチンから独立して更新することが可能であり（例えば、以前に配布されたプログラムの更新、配布プログラムの機能を増強するために追加された新しい特徴等の更新を通じて）、又、その逆も可能である。分割は、スカラープロセッサ２１０（例えばキャッシュ２２１〜２１２）及びベクトル制御ユニット２２０とベクトルデータパス２２１（例えばキャッシュ２２２〜２２３）のそれぞれ別個のキャッシュによって容易になる。上述したように、スカラー実行ユニット２０１及びベクトル実行ユニット２０２は、コマンドＦＩＦＯ２２５を介して通信する。 [042] This partitioning allows vector software code functions to be written separately to distinguish them from scalar software code functions. For example, vector functions can be written individually (eg, at different times by different teams of engineers, etc.) and provided as one or more subroutines or library functions for use with / with scalar functions (eg, scalar threads, processes, etc.) It is possible. This allows for independent and independent updates of scalar software code and / or vector software code. For example, vector subroutines can be updated independently of scalar subroutines (eg, through updates to previously distributed programs, new features added to enhance the functionality of distributed programs, etc.). The reverse is also possible. Partitioning is facilitated by the scalar processor 210 (eg, caches 221-212) and separate caches of vector control unit 220 and vector data path 221 (eg, caches 222-223). As described above, the scalar execution unit 201 and the vector execution unit 202 communicate via the command FIFO 225.

[043]図３は、本発明のある実施の形態に係るビデオプロセッサ１１１の例示的なソフトウェアプログラム３００を示している。図３に示すように、ソフトウェアプログラム３００は、ビデオプロセッサ１１１のプログラミングモデルの特性を表しており、これにより、スカラー制御スレッド３０１は、ベクトルデータスレッド３０２と共にビデオプロセッサ１１１によって実行される。 [043] FIG. 3 illustrates an exemplary software program 300 of the video processor 111 according to an embodiment of the present invention. As shown in FIG. 3, the software program 300 represents the characteristics of the programming model of the video processor 111, whereby the scalar control thread 301 is executed by the video processor 111 along with the vector data thread 302.

[044]図３の実施の形態のソフトウェアプログラム３００の例は、ビデオプロセッサ１１１のプログラミングモデルを表しており、これによって、スカラー実行ユニット２０１上のスカラー制御プログラム（例えば、スカラー制御スレッド３０１）は、ベクトル実行ユニット２０２におけるサブルーチン呼び出し（例えば、ベクトルデータスレッド３０２）を実行する。ソフトウェアプログラム３００の例は、コンパイラ又はソフトウェアプログラマがビデオ処理アプリケーションをスカラー部分（例えば、第１のスレッド）とベクトル部分（例えば、第２のスレッド）に分解した場合を示している。 [044] The example software program 300 of the embodiment of FIG. 3 represents the programming model of the video processor 111 so that the scalar control program (eg, scalar control thread 301) on the scalar execution unit 201 is: A subroutine call (eg, vector data thread 302) in the vector execution unit 202 is executed. The example software program 300 illustrates the case where a compiler or software programmer breaks a video processing application into a scalar part (eg, a first thread) and a vector part (eg, a second thread).

[045]図３に示すように、スカラー実行ユニット２０１上で実行しているスカラー制御スレッド３０１は、事前に作業パラメータを計算して、これらパラメータを、処理作業の大部分を行うベクトル実行ユニット２０２に供給する。上述したように、二つのスレッド３０１及び３０２のソフトウェアコードは、個別に書いてコンパイルすることができる。 [045] As shown in FIG. 3, the scalar control thread 301 executing on the scalar execution unit 201 calculates work parameters in advance and uses these parameters for the vector execution unit 202 that performs most of the processing work. To supply. As described above, the software code for the two threads 301 and 302 can be written and compiled separately.

[046]スカラースレッドは、以下のことを担っている。
１．ホストユニット２０４とのインターフェイスを取り、クラスインターフェイスを実施する
２．ベクトル実行ユニット２０２の初期化、セットアップ、及び構成
３．以下のことを各反復が行なうよう、ループ内の作業単位、チャンク、又は作業セットにおけるアルゴリズムの実行
ａ．現在の作業セットのパラメータを計算する
ｂ．ベクトル実行ユニットへの入力データの転送を開始する
ｃ．ベクトル実行ユニットからの出力データの転送を開始する [046] A scalar thread is responsible for:
1. 1. Interface with host unit 204 and implement class interface 2. Initialization, setup and configuration of vector execution unit 202 Run the algorithm in a unit of work, chunk, or working set in a loop so that each iteration does the following: a. Calculate the parameters of the current working set b. Initiate transfer of input data to the vector execution unit c. Start transfer of output data from the vector execution unit

[047]スカラースレッドの標準的な実行モデルは「応答不要送信（fire-and-forget）」である。応答不要送信との用語は、ビデオベースバンド処理の標準的なモデルの場合に、コマンド及びデータが（例えばコマンドＦＩＦＯ２２５を介して）スカラー実行ユニット２０１からベクトル実行ユニット２０２に送信され、アルゴリズムが完了するまでベクトル実行ユニット２０２から戻りデータがない特性を指す。 [047] The standard execution model for scalar threads is "fire-and-forget". The term response-free transmission means that in the standard model of video baseband processing, commands and data are transmitted from the scalar execution unit 201 to the vector execution unit 202 (eg, via the command FIFO 225), completing the algorithm. Up to this point, there is no return data from the vector execution unit 202.

[048]図３のプログラム３００の例では、スカラー実行ユニット２０１は、コマンドＦＩＦＯ２２５にスペースがなくなるまで（例えば、！ｅｎｄ＿ｏｆ＿ａｌｇ＆！ｃｍｄ＿ｆｉｆｏ＿ｆｕｌｌ）、ベクトル実行ユニット２０２に対して作業をスケジューリングし続ける。スカラー実行ユニット２０１によってスケジュールされた作業は、パラメータを計算して、それらパラメータをベクトルサブルーチンに送信し、続いて、作業を実行するためにベクトルサブルーチンを呼び出す。ベクトル実行ユニット２０２によるサブルーチンの実行（例えば、ｖｅｃｔｏｒ＿ｆｕｎｃＢ）は、主に待ち時間を主記憶装置（例えば、システムメモリ１１５）から隠蔽するように、適切な時間で遅延される。したがって、ビデオプロセッサ１１１のアーキテクチャは、命令及びデータトラフィックの両方に対してベクトル実行ユニット２０２側に待ち時間補償機構をもたらす。これらの待ち時間補償機構を、以下により詳細に説明する。 [048] In the example program 300 of FIG. 3, the scalar execution unit 201 continues to schedule work for the vector execution unit 202 until there is no more space in the command FIFO 225 (eg,! End_of_alg &! Cmd_fifo_full). The work scheduled by the scalar execution unit 201 calculates parameters, sends them to the vector subroutine, and then calls the vector subroutine to perform the work. Execution of a subroutine (eg, vector_funcB) by the vector execution unit 202 is delayed by an appropriate amount of time, mainly to hide latency from the main storage (eg, system memory 115). Thus, the architecture of the video processor 111 provides a latency compensation mechanism on the vector execution unit 202 side for both instruction and data traffic. These latency compensation mechanisms are described in more detail below.

[049]ソフトウェアプログラム３００の例は、二つ以上のベクトル実行パイプラインが存在する（例えば、図２のベクトルデータパス２２１及び第２のベクトルデータパス２３１）場合には、より複雑になることに留意されたい。同様に、ソフトウェアプログラム３００の例は、プログラム３００が二つのベクトル実行パイプラインを有するコンピュータシステム用に作成されているが、尚も単一のベクトル実行パイプラインを有するシステム上で実行する能力を保持している場合には、より複雑になるであろう。 [049] The example software program 300 would be more complicated if there are two or more vector execution pipelines (eg, the vector data path 221 and the second vector data path 231 of FIG. 2). Please keep in mind. Similarly, the example software program 300 is written for a computer system that has two vector execution pipelines, but still retains the ability to execute on a system that has a single vector execution pipeline. If so, it will be more complicated.

[050]したがって、図２及び図３の説明において上述したように、スカラー実行ユニット２０１はベクトル実行ユニット２０２の計算を開始することを担っている。ある実施の形態では、スカラー実行ユニット２０１からベクトル実行ユニット２０２に渡されるコマンドには、以下の主たるタイプがある。
１．現在の作業セットデータをメモリからベクトル実行ユニット２０２のデータＲＡＭに転送するためにスカラー実行ユニット２０１によって開始される読み取りコマンド（例えば、ｍｅｍＲｄ）
２．スカラー実行ユニット２０１からベクトル実行ユニット２０２へのパラメータの受け渡し
３．実行すべきベクトルサブルーチンのＰＣ（例えば、プログラムカウンタ）の形式の実行コマンド
４．ベクトル計算の結果をメモリにコピーするためにスカラー実行ユニット２０１によって開始される書き込みコマンド（例えば、ｍｅｍＷｒ） Accordingly, as described above in the description of FIGS. 2 and 3, the scalar execution unit 201 is responsible for initiating the calculation of the vector execution unit 202. In one embodiment, there are the following main types of commands passed from the scalar execution unit 201 to the vector execution unit 202:
1. A read command (eg, memRd) initiated by the scalar execution unit 201 to transfer the current working set data from memory to the data RAM of the vector execution unit 202
2. 2. Parameter passing from the scalar execution unit 201 to the vector execution unit 202 3. Execution command in the form of a PC (eg, program counter) of the vector subroutine to be executed A write command (eg, memWr) initiated by the scalar execution unit 201 to copy the result of the vector calculation to memory

[051]ある実施の形態では、これらのコマンドを受け取ると、ベクトル実行ユニット２０２は、メモリインターフェイス２０３へのｍｅｍＲｄコマンドを即座にスケジュールする（例えば、フレームバッファ２０５から要求されたデータを読み取るなど）。ベクトル実行ユニット２０２はまた、実行コマンドを検査して、実行すべきベクトルサブルーチンをプリフェッチする（キャッシュ２２２に存在しない場合）。 [051] In some embodiments, upon receiving these commands, the vector execution unit 202 immediately schedules a memRd command to the memory interface 203 (eg, reading requested data from the frame buffer 205, etc.). Vector execution unit 202 also examines the execute command and prefetches the vector subroutine to be executed (if not present in cache 222).

[052]この状況でのベクトル実行ユニット２０２の目的は、ベクトル実行ユニット２０２が現在の実行に従事している間に、次の幾つかの実行の命令及びデータストリームを前もってスケジュールすることである。この事前スケジュール機能は、命令／データをその記憶域から取り出す際に要する待ち時間を効果的に隠蔽する。これらの読み取り要求を前もって行うために、ベクトル実行ユニット２０２、データストア（例えば、データストア２２３）、及び命令キャッシュ（例えば、キャッシュ２２２）は、高速最適化ハードウェアを使用することによって実施される。 [052] The purpose of the vector execution unit 202 in this situation is to schedule the next several execution instructions and data streams in advance while the vector execution unit 202 is engaged in the current execution. This pre-scheduling function effectively hides the latency required to fetch instructions / data from its storage. To perform these read requests in advance, the vector execution unit 202, the data store (eg, data store 223), and the instruction cache (eg, cache 222) are implemented by using fast optimization hardware.

[053]上述したように、データストア（例えば、データストア２２３）は、ベクトル実行ユニット２０２の作業ＲＡＭとして機能する。スカラー実行ユニット２０１は、ＦＩＦＯのコレクションであるかの如くデータストアを認識してデータストアと対話する。ＦＩＦＯは、それによってビデオプロセッサ１１１が動作する「ストリーム」を含む。ある実施の形態では、ストリームは一般に、スカラー実行ユニット２０１が（例えば、ベクトル実行ユニット２０２に）転送を開始する入出力ＦＩＦＯである。上述したように、スカラー実行ユニット２０１及びベクトル実行ユニット２０２のオペレーションは分離される。 [053] As described above, the data store (eg, data store 223) functions as a working RAM for the vector execution unit 202. The scalar execution unit 201 recognizes the data store and interacts with the data store as if it were a collection of FIFOs. The FIFO includes a “stream” through which the video processor 111 operates. In one embodiment, the stream is generally an input / output FIFO that the scalar execution unit 201 initiates transfer (eg, to the vector execution unit 202). As described above, the operations of scalar execution unit 201 and vector execution unit 202 are separated.

[054]入出力ストリームが満杯になると、ベクトル制御ユニット２２０内のＤＭＡエンジンは、コマンドＦＩＦＯ２２５の処理を停止する。そのため、まもなくコマンドＦＩＦＯ２２５は満杯になる。スカラー実行ユニット２０１は、コマンドＦＩＦＯ２２５が満杯になった場合、ベクトル実行ユニット２０２への更なる作業の発行を停止する。 [054] When the input / output stream is full, the DMA engine in the vector control unit 220 stops processing the command FIFO 225. As a result, the command FIFO 225 will soon be full. The scalar execution unit 201 stops issuing further work to the vector execution unit 202 when the command FIFO 225 becomes full.

[055]ある実施の形態では、ベクトル実行ユニット２０２は、入出力ストリームに加えて中間ストリームを必要とする。したがって、全データストア２２３は、スカラー実行ユニット２０１との対話に関してストリームのコレクションと見なすことができる。 [055] In one embodiment, vector execution unit 202 requires an intermediate stream in addition to the input and output streams. Thus, the entire data store 223 can be viewed as a collection of streams for interaction with the scalar execution unit 201.

[056]図４は、本発明のある実施の形態によるビデオプロセッサを使用したビデオとのサブピクチャの混合（ブレンディング）の例を示している。図４は、ビデオサーフェス（video surface）がサブピクチャと混合され、次いでＡＲＧサーフェスに変換される場合の例示的な事例を示している。サーフェスを含むデータは、輝度パラメータ４１２及び色度パラメータ４１３としてフレームバッファメモリ２０５内に存在する。サブピクチャのピクセル要素４１４もまた、図示したように、フレームバッファメモリ２０５内に存在する。ベクトルサブルーチン命令及びパラメータ４１１は、図示したように、メモリ２０５にインスタンス化される。 [056] FIG. 4 illustrates an example of blending (blending) sub-pictures with video using a video processor according to an embodiment of the present invention. FIG. 4 shows an exemplary case where a video surface is mixed with a sub-picture and then converted to an ARG surface. Data including the surface exists in the frame buffer memory 205 as a luminance parameter 412 and a chromaticity parameter 413. A sub-picture pixel element 414 is also present in the frame buffer memory 205 as shown. Vector subroutine instructions and parameters 411 are instantiated in memory 205 as shown.

[057]ある実施の形態では、各ストリームは、「タイル」と呼ばれるデータの作業２ＤチャンクのＦＩＦＯを備える。このような実施の形態では、ベクトル実行ユニット２０２は、各ストリームについて読み取りタイルポインタ及び書き込みタイルポインタを保持する。例えば、入力ストリームの場合、ベクトルサブルーチンが実行されると、ベクトルサブルーチンは、現在の（読み取り）タイルを消費するか、又は読み取ることができる。バックグラウンドでは、データはｍｅｍＲｄコマンドによって現在の（書き込み）タイルに転送される。ベクトル実行ユニットはまた、出力ストリーム用の出力タイルを生成することもできる。次いで、これらのタイルは、実行コマンドに続くｍｅｍＷｒ（）コマンドによってメモリに移動される。これによって、タイルを効果的にプリフェッチして、操作されるよう準備し、効果的に待ち時間を隠蔽する。 [057] In one embodiment, each stream comprises a working 2D chunk FIFO of data called "tiles". In such an embodiment, the vector execution unit 202 maintains a read tile pointer and a write tile pointer for each stream. For example, for an input stream, when the vector subroutine is executed, the vector subroutine can consume or read the current (read) tile. In the background, data is transferred to the current (write) tile by the memRd command. The vector execution unit can also generate output tiles for the output stream. These tiles are then moved to memory by the memWr () command following the execute command. This effectively prefetches tiles and prepares them to be manipulated, effectively hiding latency.

[058]図４のサブピクチャ混合の例では、ベクトルデータパス２２１は、ベクトルサブルーチン命令及びパラメータ４１１のインスタンス化されたインスタンスによって構成される（例えば、＆ｖ＿ｓｕｂｐ＿ｂｌｅｎｄ）。このことは、線４２１によって示されている。スカラー実行ユニット２０１は、サーフェスのチャンク（例えば、タイル）を読み取り、それらを（例えば、メモリインターフェイス２０３内の）ＤＭＡエンジン４０１を使用してデータストア２２３にロードする。ロードオペレーションは、線４２２、線４２３、及び線４２４によって示されている。 [058] In the example of sub-picture mixing of FIG. 4, the vector data path 221 is composed of instantiated instances of vector subroutine instructions and parameters 411 (eg, & v_subp_blend). This is indicated by line 421. Scalar execution unit 201 reads chunks (eg, tiles) of the surface and loads them into data store 223 using DMA engine 401 (eg, in memory interface 203). The load operation is indicated by line 422, line 423, and line 424.

[059]引き続き図４を参照する。複数の入力サーフェスが存在するので、複数の入力ストリームが保持される必要がある。各ストリームは、対応するＦＩＦＯを有する。各ストリームは、異なる数のタイルを有することができる。図４の例は、サブピクチャサーフェスがシステムメモリ１１５内（例えば、サブピクチャのピクセルエレメント４１４）にあり、したがって追加のバッファリング（例えば、ｎ、ｎ＋１、ｎ＋２、ｎ＋３など）を有するが、ビデオストリーム（例えば、輝度４１２、色度４１３など）はより少ない数のタイルを有することができる場合を示している。使用されるバッファ／ＦＩＦＯの数は、ストリームが受けた待ち時間の程度に従って調整することができる。 [059] Still referring to FIG. Since there are multiple input surfaces, multiple input streams need to be maintained. Each stream has a corresponding FIFO. Each stream can have a different number of tiles. In the example of FIG. 4, the sub-picture surface is in system memory 115 (eg, sub-picture pixel element 414) and thus has additional buffering (eg, n, n + 1, n + 2, n + 3, etc.), but the video stream (E.g., luminance 412, chromaticity 413, etc.) indicates a case where a smaller number of tiles can be provided. The number of buffers / FIFOs used can be adjusted according to the degree of latency experienced by the stream.

[060]上述したように、データストア２２３は、先読みプリフェッチ方法を使用して、待ち時間を隠蔽する。これによって、ストリームは、データが適切なベクトルデータパス実行ハードウェアにプリフェッチされる際に、二つ以上のタイルで当該データをもつことができる（例えば、ＦＩＦＯｎ、ｎ＋１、ｎ＋２等として示されている）。 [060] As described above, the data store 223 uses a look-ahead prefetch method to conceal latency. This allows the stream to have the data in more than one tile when the data is prefetched to the appropriate vector data path execution hardware (eg, shown as FIFO n, n + 1, n + 2, etc.) )

[061]データストアがロードされると、ＦＩＦＯは、ベクトルデータパスハードウェア２２１によってアクセスされ、ベクトルサブルーチン（例えば、サブルーチン４３０）によって操作される。ベクトルデータパスオペレーションの結果は、出力ストリーム４０３を構成する。この出力ストリームは、ＤＭＡエンジン４０１を介してスカラー実行ユニット２０１によってコピーされ、フレームバッファメモリ２０５（例えば、ＡＲＧＢ＿ＯＵＴ４１５）に戻される。このことは、線４２５によって示されている。 [061] When the data store is loaded, the FIFO is accessed by the vector data path hardware 221 and manipulated by a vector subroutine (eg, subroutine 430). The result of the vector data path operation constitutes the output stream 403. This output stream is copied by the scalar execution unit 201 via the DMA engine 401 and returned to the frame buffer memory 205 (for example, ARGB_OUT 415). This is indicated by line 425.

[062]したがって、本発明の実施の形態は、ストリーム処理の重要な特徴を使用するが、それはデータスストレージ及びメモリが複数のメモリタイトルとして抽象化されるということである。したがって、ストリームは、順次アクセスされるタイルのコレクションと見なすことができる。ストリームは、データをプリフェッチするために使用される。このデータは、タイルの形式をとる。タイルは、データの元となる特定のメモリソースから（例えば、システムメモリ、フレームバッファメモリ等）、待ち時間を隠蔽するためにプリフェッチされる。同様に、ストリームは、異なる場所に向けることができる（例えば、ベクトル実行ユニットのキャッシュ、スカラー実行ユニットのキャッシュ、フレームバッファメモリ、システムメモリ等）。ストリームの別の特徴は、一般に先読みプリフェッチモードでタイルにアクセスすることである。上述したように、待ち時間が大きくなれば、それに応じてプリフェッチはより深くなり、ストリームあたりで使用されるバッファリングはより多くなる（例えば、図４に示すように）。 [062] Thus, embodiments of the present invention use an important feature of stream processing, which is that data storage and memory are abstracted as multiple memory titles. Thus, a stream can be viewed as a collection of tiles that are accessed sequentially. The stream is used to prefetch data. This data takes the form of tiles. Tiles are prefetched to conceal latency from specific memory sources from which data originates (eg, system memory, frame buffer memory, etc.). Similarly, streams can be directed to different locations (eg, vector execution unit cache, scalar execution unit cache, frame buffer memory, system memory, etc.). Another feature of the stream is to access the tiles generally in a look-ahead prefetch mode. As described above, the higher the latency, the deeper the prefetch and the more buffering used per stream (eg, as shown in FIG. 4).

[063]図５は、本発明のある実施の形態に係るベクトル実行ユニットの内部コンポーネントを示す図である。図５は、ベクトル実行ユニット２０２の種々の機能ユニット及びレジスタ／ＳＲＡＭリソースの構成を、プログラミングの観点から示している。 [063] FIG. 5 is a diagram illustrating the internal components of a vector execution unit according to an embodiment of the invention. FIG. 5 illustrates the configuration of the various functional units and register / SRAM resources of the vector execution unit 202 from a programming perspective.

[064]図５の実施の形態では、ベクトル実行ユニット２０２は、ビデオベースバンド処理の性能及び種々のコーデック（圧縮−復元アルゴリズム）の実行用に最適化されたＶＬＩＷデジタル信号プロセッサを備えている。したがって、ベクトル実行ユニット２０２は、ビデオ処理／コーデック実行の効率を高めることに向けられた多数の特性を有する。 [064] In the embodiment of FIG. 5, vector execution unit 202 comprises a VLIW digital signal processor optimized for video baseband processing performance and various codec (compression-decompression algorithm) execution. Thus, the vector execution unit 202 has a number of characteristics that are aimed at increasing the efficiency of video processing / codec execution.

[065]図５の実施の形態では、上記特性は以下のものを含む。
１．複数のベクトル実行パイプラインの組み込み用のオプションを提供することによる拡張性能
２．パイプ毎の二つのデータアドレス発生器（ＤＡＧ）の割り当て
３．メモリ／レジスタのオペランド
４．２Ｄ（ｘ、ｙ）ポインタ／反復子（iterator）
５．深いパイプライン（例えば、１１〜１２）ステージ
６．スカラー（整数）／分岐ユニット
７．可変命令幅（Ｌｏｎｇ／Ｓｈｏｒｔ命令）
８．オペランド抽出のためのデータアライナ
９．標準的なオペランド及び結果の２Ｄデータパス（４×４）形状
１０.リモートプロシージャ呼び出しを実行する、スカラー実行ユニットに対するスレーブのベクトル実行ユニット [065] In the embodiment of FIG. 5, the characteristics include the following.
1. 1. Extended performance by providing options for incorporating multiple vector execution pipelines 2. Assignment of two data address generators (DAG) per pipe Memory / Register Operand 4.2D (x, y) Pointer / Iterator
5. 5. Deep pipeline (eg, 11-12) stage 6. Scalar (integer) / branching unit Variable instruction width (Long / Short instruction)
8). 8. Data aligner for operand extraction Standard operand and resulting 2D data path (4x4) shape 10. Slave vector execution unit to scalar execution unit to execute remote procedure calls

[066]一般に、プログラマの観点からは、ベクトル実行ユニット２０２は、二つのＤＡＧ５０３を備えるＳＩＭＤデータパスとして見られる。命令は、ＶＬＩＷの方式で発行され（例えば、命令はベクトルデータパス５０４及びアドレス発生器５０３に対して同時に発行され）、命令デコーダ５０１によってデコードされ適切な実行ユニットにディスパッチされる。命令は、可変長であり、最も一般に使用される命令が短形式でエンコードされている。完全な命令セットは、ＶＬＩＷタイプの命令として、長形式で使用可能である。 [066] In general, from the programmer's perspective, the vector execution unit 202 is viewed as a SIMD data path comprising two DAGs 503. Instructions are issued in a VLIW manner (eg, instructions are issued simultaneously to the vector data path 504 and the address generator 503), decoded by the instruction decoder 501, and dispatched to the appropriate execution unit. The instructions are variable length, and the most commonly used instructions are encoded in short form. The complete instruction set is available in long form as a VLIW type instruction.

[067]凡例（legend）５０２は、三つのこのようなＶＬＩＷ命令を有する三つのクロックサイクルを示している。凡例５１０によれば、ＶＬＩＷ命令５０２の最上部は、二つのアドレス命令（例えば、二つのＤＳＧ５０３用）及びベクトルデータパス５０４用の一つの命令を備える。中間部のＶＬＩＷ命令は、一つの整数命令（例えば、整数ユニット５０５用）、一つのアドレス命令、及び一つのベクトル命令を備える。最下部のＶＬＩＷ命令は、一つの分岐命令（例えば、分岐ユニット５０６用）、一つのアドレス命令、及び一つのベクトル命令を備える。 [067] Legend 502 shows three clock cycles with three such VLIW instructions. According to legend 510, the top of VLIW instruction 502 comprises two address instructions (eg, for two DSGs 503) and one instruction for vector data path 504. The intermediate VLIW instruction includes one integer instruction (for example, for the integer unit 505), one address instruction, and one vector instruction. The lowest VLIW instruction comprises one branch instruction (eg, for branch unit 506), one address instruction, and one vector instruction.

[068]ベクトル実行ユニットは、単一のデータパイプ又は複数のデータパイプを有するように構成することができる。各データパイプは、ローカルＲＡＭ（例えば、データストア５１１）、クロスバー５１６、二つのＤＡＧ５０３、及びＳＩＭＤ実行ユニット（例えば、ベクトルデータパス５０４）から成る。図５は、説明の目的で基本構成を示しており、ここでは、一つのデータパイプのみがインスタンス化される。二つのデータパイプがインスタンス化される場合、それらは独立したスレッドとして、又は共同のスレッドとして実行することができる。 [068] The vector execution unit may be configured to have a single data pipe or multiple data pipes. Each data pipe consists of local RAM (eg, data store 511), crossbar 516, two DAGs 503, and SIMD execution units (eg, vector data path 504). FIG. 5 shows a basic configuration for purposes of illustration, where only one data pipe is instantiated. When two data pipes are instantiated, they can run as independent threads or as joint threads.

[069]六つの異なるポート（例えば、四つの読み取り及び二つの書き込み）を、アドレスレジスタファイルユニット５１５を介してアクセスすることができる。これらのレジスタは、スカラー実行ユニットから、若しくは、整数ユニット５０５又はアドレスユニット５０３の結果からパラメータを受け取る。ＤＡＧ５０３はまた、コレクションコントローラとして機能し、レジスタの配置を管理してデータストア５１１の内容をアドレス指定する（例えば、ＲＡ０、ＲＡ１、ＲＡ２、ＲＡ３、ＷＡ０、及びＷＡ１）。クロスバー５１６は、ベクトルデータパス５０４に任意の順序／組み合わせで出力データポートＲ０、Ｒ１、Ｒ２、Ｒ３を割り当てて所定の命令を実施するように結合されている。ベクトルデータパス５０４の出力は、図示したように、データストア５１１にフィードバックすることができる（例えば、Ｗ０）。定数ＲＡＭ５１７は、頻繁に使用されるオペランドを整数ユニット５０５からベクトルデータパス５０４、及びデータストア５１１に供給するために使用される。 [069] Six different ports (eg, four reads and two writes) can be accessed via the address register file unit 515. These registers receive parameters from a scalar execution unit or from the result of integer unit 505 or address unit 503. The DAG 503 also functions as a collection controller, managing register placement and addressing the contents of the data store 511 (eg, RA0, RA1, RA2, RA3, WA0, and WA1). Crossbar 516 is coupled to execute predetermined instructions by assigning output data ports R0, R1, R2, R3 to vector data path 504 in any order / combination. The output of the vector data path 504 can be fed back to the data store 511 as shown (eg, W0). The constant RAM 517 is used to supply frequently used operands from the integer unit 505 to the vector data path 504 and the data store 511.

[070]図６は、本発明のある実施の形態に係り、メモリ６００の複数のバンク６０１〜６０４、及び対称配列のタイル６１０を有するデータストアのレイアウトを示す図である。図６に示すように、説明の目的で、データストア６１０の一部だけが示されている。データストア６１０は論理的に、タイルの配列（又は複数の配列）を備えている。各タイルは、４×４形状のサブタイルの配列である。物理的には、メモリ６００によって示されるように、データストア６１０は、メモリの「Ｎ」個の物理バンク（例えば、バンク６０１〜６０４）の配列に格納される。 [070] FIG. 6 is a diagram illustrating a layout of a data store having multiple banks 601-604 of memory 600 and symmetrically arranged tiles 610, according to an embodiment of the present invention. As shown in FIG. 6, only a portion of the data store 610 is shown for illustrative purposes. The data store 610 logically comprises an array (or multiple arrays) of tiles. Each tile is an array of 4 × 4 sub-tiles. Physically, as indicated by memory 600, data store 610 is stored in an array of “N” physical banks (eg, banks 601-604) of memory.

[071]さらに、データストア６１０は、ストリーム内の論理タイルを視覚的に示している。図６の実施の形態において、このタイルは、高さ１６バイト、幅１６バイトである。このタイルは、サブタイルの配列である（この例では４×４）。各サブタイルは、物理バンクに格納される。これは図６において、８バンクの物理メモリがある場合（例えば、バンク０〜７）、各４×４サブタイル内の番号によって示される。バンク内のサブタイルの編成は、２×２構成のサブタイトル内に共通のバンクが存在しないようになっている。これによって、バンクの衝突が発生することなく、不整列アクセス（例えば、ｘ及びｙ方向に）が可能になる。 [071] In addition, the data store 610 visually shows the logical tiles in the stream. In the embodiment of FIG. 6, this tile is 16 bytes high and 16 bytes wide. This tile is an array of subtiles (4 × 4 in this example). Each subtile is stored in a physical bank. This is indicated in FIG. 6 by the number in each 4 × 4 subtile when there are 8 banks of physical memory (eg, banks 0-7). The organization of the subtiles in the bank is such that there is no common bank in the 2 × 2 subtitles. This allows misaligned accesses (eg, in the x and y directions) without causing bank collisions.

[072]バンク６０１〜６０４は、各バンクの異なるタイルへのアクセスをサポートするように構成されている。例えば、ある事例では、クロスバー５１６は、バンク６０１から２×４のタイルのセットにアクセスすることができる（例えば、バンク６０１の最初の２行）。別の事例では、クロスバー５１６は、二つの隣接するバンクから１×８のタイルのセットにアクセスすることができる。同様に、別の事例では、クロスバー５１６は、二つの隣接するバンクから８×１のタイルのセットにアクセスすることができる。何れの場合にも、ＤＡＧ／コレクタ５０３は、バンクがクロスバー５１６によってアクセスされるとタイルを受け取り、それらタイルを１クロック毎にベクトルデータパス５０４の前端部に供給することができる。 [072] Banks 601-604 are configured to support access to different tiles in each bank. For example, in one instance, the crossbar 516 can access a set of 2 × 4 tiles from the bank 601 (eg, the first two rows of the bank 601). In another case, the crossbar 516 can access a set of 1 × 8 tiles from two adjacent banks. Similarly, in another case, the crossbar 516 can access an 8 × 1 set of tiles from two adjacent banks. In either case, the DAG / collector 503 can receive tiles as the bank is accessed by the crossbar 516 and supply the tiles to the front end of the vector data path 504 every clock.

[073]このように、本発明の実施の形態は、高性能なビデオ処理機能をサポートし、しかも集積回路シリコンダイ領域、トランジスタ数、メモリ速度要件などを効率的に利用する新しいビデオプロセッサアーキテクチャを提供する。本発明の実施の形態は、高い計算密度を保持し、複数のビデオストリームを処理するために容易に拡張することができる。本発明の実施の形態は、例えば、ＭＰＥＧ-２/ＷＭＶ９/Ｈ．２６４エンコードアシスト（例えば、インループデコーダ）、ＭＰＥＧ-２/ＷＭＶ９/Ｈ．２６４デコード（例えば、ポストエントロピーデコーディング）、及びインループ／アウトオブループデブロッキングフィルタなど、多数の高性能なビデオ処理オペレーションを提供することができる。 [073] Thus, embodiments of the present invention utilize a new video processor architecture that supports high performance video processing functions and that efficiently utilizes integrated circuit silicon die area, transistor count, memory speed requirements, and the like. provide. Embodiments of the present invention retain high computational density and can be easily extended to handle multiple video streams. The embodiment of the present invention is, for example, MPEG-2 / WMV9 / H. H.264 encoding assist (for example, in-loop decoder), MPEG-2 / WMV9 / H.264. A number of high performance video processing operations can be provided, such as H.264 decoding (eg, post-entropy decoding) and in-loop / out-of-loop deblocking filters.

[074]本発明の実施の形態によって提供される更なるビデオ処理オペレーションには、例えば、高性能動き適応型順次走査変換、エンコード用入力ノイズフィルタリング、多相スケーリング／リサンプリング、及びサブピクチャ合成等がある。本発明のビデオプロセッサアーキテクチャはまた、例えば、色空間補正、色空間調整、鮮明化やヒストグラム調整のようなピクセルポイントオペレーション、及び種々のビデオサーフェスフォーマット変換といった、特定のビデオプロセッサ増幅器（ｐｒｏｃａｍｐ）アプリケーションにも使用することができる。 [074] Further video processing operations provided by embodiments of the present invention include, for example, high performance motion adaptive progressive scan conversion, input noise filtering for encoding, polyphase scaling / resampling, and sub-picture synthesis. There is. The video processor architecture of the present invention is also suitable for specific video processor amplifier applications such as color space correction, color space adjustment, pixel point operations such as sharpening and histogram adjustment, and various video surface format conversions. Can also be used.

[075]広く、且つ、非限定的に、本書面は、以下の事項を開示している。即ち、ビデオ処理オペレーションを実行する耐待ち時間システムを説明している。システムは、ビデオプロセッサとホストＣＰＵ間の通信を実施するホストインターフェイスと、ホストインターフェイスに結合され、且つ、スカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ホストインターフェイスに結合され、且つ、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットと、を備える。コマンドＦＩＦＯが、ベクトル実行ユニットがメモリコマンドＦＩＦＯにアクセスすることによって要求駆動ベースで動作できるようにするために備えられている。メモリインターフェイスが、ビデオプロセッサとフレームバッファメモリ間の通信を実施するために備えられている。ＤＭＡエンジンが、複数の異なる記憶域の間のＤＭＡ転送を実施し、コマンドＦＩＦＯにベクトル実行ユニットのデータと命令をロードするために、メモリインターフェイスに組み込まれる。また、ビデオ処理オペレーションを実行するビデオプロセッサを説明している。ビデオプロセッサは、ビデオプロセッサとホストＣＰＵとの間の通信を実施するホストインターフェイスを備える。メモリインターフェイスが、ビデオプロセッサとフレームバッファメモリ間の通信を実施するために備えられている。スカラー実行ユニットが、ホストインターフェイス及びメモリインターフェイスに結合され、スカラービデオ処理オペレーションを実行するように構成されている。ベクトル実行ユニットが、ホストインターフェイス及びメモリインターフェイスに結合され、ベクトルビデオ処理オペレーションを実行するように構成されている。また、ビデオ処理オペレーションを実行するビデオプロセッサ用の多次元データパス処理システムを説明している。ビデオプロセッサは、スカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットと、を備える。データストアメモリが、ベクトル実行ユニットのデータを格納するために備えられている。データストアメモリは、配列状に構成された対称バンクデータ構造を有する複数のタイルを備える。バンクデータ構造は、各バンクの異なるタイルへのアクセスをサポートするように構成されている。また、ビデオオペレーションを実行するビデオプロセッサ用のストリームベースのメモリアクセスシステムを説明している。ビデオプロセッサは、スカラービデオ処理オペレーションを実行するように構成されたスカラー実行ユニットと、ベクトルビデオ処理オペレーションを実行するように構成されたベクトル実行ユニットと、を備える。フレームバッファメモリが、スカラー実行ユニット及びベクトル実行ユニットのデータを格納するために備えられている。メモリインターフェイスが、スカラー実行ユニットと、ベクトル実行ユニットと、フレームバッファメモリとの間の通信を確立するために備えられている。フレームバッファメモリは、複数のタイルを備える。メモリインターフェイスは、タイルの第１の順次アクセスを実施し、ベクトル実行ユニット又はスカラー実行ユニットのタイルの第２の順次アクセスを含む第２のストリームを実施する。 [075] Broadly and without limitation, this document discloses the following: That is, a latency-tolerant system that performs video processing operations is described. The system is coupled to a host interface that implements communication between the video processor and the host CPU, a scalar execution unit coupled to the host interface and configured to perform scalar video processing operations, the host interface, and A vector execution unit configured to perform vector video processing operations. A command FIFO is provided to allow the vector execution unit to operate on a request driven basis by accessing the memory command FIFO. A memory interface is provided for performing communication between the video processor and the frame buffer memory. A DMA engine is incorporated into the memory interface to perform DMA transfers between different storage locations and load vector execution unit data and instructions into the command FIFO. A video processor that performs video processing operations is also described. The video processor includes a host interface that implements communication between the video processor and the host CPU. A memory interface is provided for performing communication between the video processor and the frame buffer memory. A scalar execution unit is coupled to the host interface and the memory interface and is configured to perform scalar video processing operations. A vector execution unit is coupled to the host interface and the memory interface and is configured to perform vector video processing operations. A multi-dimensional data path processing system for a video processor that performs video processing operations is also described. The video processor comprises a scalar execution unit configured to perform scalar video processing operations and a vector execution unit configured to perform vector video processing operations. A data store memory is provided for storing vector execution unit data. The data store memory comprises a plurality of tiles having a symmetric bank data structure arranged in an array. The bank data structure is configured to support access to different tiles in each bank. A stream-based memory access system for a video processor that performs video operations is also described. The video processor comprises a scalar execution unit configured to perform scalar video processing operations and a vector execution unit configured to perform vector video processing operations. A frame buffer memory is provided for storing scalar execution unit and vector execution unit data. A memory interface is provided for establishing communication between the scalar execution unit, the vector execution unit, and the frame buffer memory. The frame buffer memory includes a plurality of tiles. The memory interface implements a first sequential access of tiles and a second stream including a second sequential access of tiles of vector execution units or scalar execution units.

[076]本発明の具体的な実施の形態についての上述の説明は、例示と説明の目的で提示したものである。これらは、包括的であること、又は本発明を開示した厳密な形態に限定することを意図しておらず、上記の教示の下に多くの変更及び変形が可能である。本実施の形態は、本発明の原理及びその実践的な用途を最良に説明し、それによって当業者が本発明及び種々の実施の形態を、考えられる特定の使用に適するように様々変更を加えて最良に使用できるようにするために、選択して、説明したものである。本発明の範囲は本明細書に添付の特許請求の範囲及びそれらの均等物によって定義されることが意図される。 [076] The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and many modifications and variations are possible under the above teachings. This embodiment best describes the principles of the invention and its practical application, so that those skilled in the art can make various modifications to the invention and various embodiments to suit the particular use contemplated. It has been chosen and described in order to be best used. It is intended that the scope of the invention be defined by the claims appended hereto and their equivalents.

１概要
ＶＰ２は、スカラー制御プロセッサに結合されたＶＬＩＷＳＩＭＤビデオＤＳＰである。その主な焦点は、ビデオコーデック及びビデオベースバンド処理である。

１．１ＶＰ２．０の核心
・効率：ＶＰ２．０は、ｐｅｒｆ／ｍｍ２及びｐｅｒｆ／ｍＷに関して、ビデオアプリケーション用途向けの計算効率の高いマシンとなる
・プログラム能力：ＶＰ２．０は、プログラミング能力が高く、容易にコンパイル可能であり、プログラムマシンに更に安全なものとなる
・拡張性：ＶＰ２．０の設計／アーキテクチャは、複数のアプリケーション分野の性能要件に適合するように拡張可能であるべきである

１．２設計目標
・計算密度
ＶＰ１．０に大きく優るｐｅｒｆ／ｍｍ２の利点を提供
Ｈ．２６４のような新たなアプリケーション分野の効率的な実現
・ＳＷ開発者の負担を軽減するＨＷの耐待ち時間性
メモリアクセス及び計算を再順序付けすることによりデータフェッチ待ち時間を隠蔽
命令ストリームの自動プリフェッチ
・データパス待ち時間を隠蔽
中間結果の選択的転送
ストリーミング計算モデル
・拡張性：
アーキテクチャ的にＶＰ２ベクトルユニットはそのデータパスを２倍に拡張及び１／２倍に縮小可能
周波数の改善は選択的再パイプライン処理によって達成可能

１．３アプリケーションターゲット
ＶＰ２．０の設計及び命令セットは、以下のアプリケーションを極めて効率的に行うように最適化される。
・ｍｐｅｇ２/ｗｍｖ９/Ｈ．２６４エンコードアシスト（インループデコーダ）
・ｍｐｅｇ２/ｗｍｖ９/Ｈ．２６４デコード（ポストエントロピーデコーディング）
・インループ／アウトオブループのデブロッキングフィルタ
・高性能動き適応順次走査変換
・エンコード用入力ノイズフィルタリング
・多相スケーリング／リサンプリング
・サブピクチャ合成
・ｐｒｏｃａｍｐ、色空間変換、色空間調整、鮮明化やヒストグラム調整等といったクセルポイントオペレーション
・様々なビデオサーフェスフォーマット変換サポート

アーキテクチャ的にＶＰ２．０は以下の領域において効率的となり得る。
・２Ｄのプリミティブ、ブリッツ、回転など
・微調整ベースのソフトウェア動き推定アルゴリズム
・１６／３２ビットのＭＡＣアプリケーション 1 Overview VP2 is a VLIW SIMD video DSP coupled to a scalar control processor. Its main focus is video codec and video baseband processing.

1.1 VP2.0 Core / Efficiency: VP2.0 is a computationally efficient machine for video applications with respect to perf / mm2 and perf / mW. Program Capability: VP2.0 has high programming capabilities Can be easily compiled and made more secure to the program machine. Extensibility: VP2.0 design / architecture should be extensible to meet the performance requirements of multiple application areas.

1.2 Design Goals / Calculation Density Provides perf / mm2 advantage over VP1.0. Efficient implementation of new application fields such as H.264 HW latency tolerance to reduce the burden on SW developers Hiding data fetch latency by reordering memory access and computation Automatic prefetching of instruction streams Hide data path latency Selective transfer of intermediate results Streaming computation model / extensibility:
Architecturally, the VP2 vector unit can expand its data path by a factor of 2 and reduce it by a factor of 2. Frequency improvement can be achieved by selective re-pipelining

1.3 Application Target The design and instruction set of VP2.0 is optimized to perform the following applications very efficiently.
Mpeg2 / wmv9 / H. H.264 encoding assist (in-loop decoder)
Mpeg2 / wmv9 / H. H.264 decoding (post-entropy decoding)
・ In-loop / out-of-loop deblocking filter ・ High-performance motion adaptive progressive scan conversion ・ Encoding input noise filtering ・ Polyphase scaling / resampling ・ Sub-picture synthesis ・ procamp, color space conversion, color space adjustment, sharpening Hexe point operation such as histogram adjustment, various video surface format conversion support

Architecturally, VP2.0 can be efficient in the following areas:
-2D primitive, blitz, rotation, etc.-Fine-tuned software motion estimation algorithm-16 / 32-bit MAC application

２トップレベルのアーキテクチャ
ＶＰ２．０マシンはスカラープロセッサ及びベクトルプロセッサに分割されている。ベクトルプロセッサは、スカラープロセッサに対するスレーブコプロセッサとして動作する。スカラープロセッサは、制御ストリーム（パラメータ、サブルーチン引数）をベクトルプロセッサに供給し、またベクトルプロセッサへのデータ入出力を管理することを担う。アルゴリズムの全ての制御フローはスカラーマシンで実行されるが、実際のピクセル／データ処理オペレーションはベクトルプロセッサで行われる。
スカラープロセッサは標準的なＲＩＳＣスタイルのスカラーであり、ベクトルコプロセッサは一つ又は二つのＳＩＭＤパイプ（各ＳＩＭＤパイプは１６ピクセルデータパスを有する）を備えるＳＩＭＤマシンである。したがって、ベクトルコプロセッサは、生の計算能力として最大３２ピクセルの処理結果を作成することができる。
スカラープロセッサは、メモリマップドコマンドＦＩＦＯを使用して、ベクトルコプロセッサに関数呼び出しを送る。コプロセッサコマンドは、このＦＩＦＯのキューに入れられる。スカラープロセッサは、このＦＩＦＯを使用してベクトルプロセッサから完全に分離される。スカラープロセッサは、独自のクロックで動作することができる。ベクトルプロセッサは、要求駆動ユニットとして動作する。
ＶＰ２．０のトップレベルの図を以下に示す。

ＶＰ２．０プログラムは、別個のスカラーコード及びベクトルコードにコンパイルされ、後に結合することができる。別々に、ベクトル関数は、個々に書き込まれ、サブルーチン又はライブラリ関数としてスカラースレッドに供給することができる。スカラープロセッサは、独自の命令及びデータキャッシュを有する。ベクトルユニットもまた、命令キャッシュ及びデータＲＡＭ（データストアと呼ばれる）を有する。これらの二つのエンジンは分離され、ＦＩＦＯを通じて通信する。 2 Top Level Architecture The VP2.0 machine is divided into a scalar processor and a vector processor. The vector processor operates as a slave coprocessor for the scalar processor. The scalar processor is responsible for supplying control streams (parameters, subroutine arguments) to the vector processor and managing data input and output to the vector processor. All control flow of the algorithm is performed on a scalar machine, but the actual pixel / data processing operations are performed on a vector processor.
The scalar processor is a standard RISC style scalar and the vector coprocessor is a SIMD machine with one or two SIMD pipes (each SIMD pipe has a 16 pixel data path). Thus, the vector coprocessor can create a processing result of up to 32 pixels as raw computing power.
The scalar processor uses the memory mapped command FIFO to send function calls to the vector coprocessor. Coprocessor commands are queued in this FIFO. The scalar processor is completely separated from the vector processor using this FIFO. A scalar processor can operate with its own clock. The vector processor operates as a request drive unit.
A top level view of VP2.0 is shown below.

The VP2.0 program can be compiled into separate scalar and vector code and later combined. Separately, vector functions can be written individually and supplied to scalar threads as subroutines or library functions. A scalar processor has its own instruction and data cache. The vector unit also has an instruction cache and a data RAM (called a data store). These two engines are separated and communicate through a FIFO.

３単純プログラミングモデル
ＶＰ２．０の最も単純なプログラミングモデルは、ベクトルスレーブコプロセッサでサブルーチン呼び出しを実行するスカラー制御プログラムである。ここにはプログラマが問題をこれらの二つのスレッドに分解している本質的な仮定がある。スカラープロセッサで実行しているスレッドは、事前に作業パラメータを計算して、それらを主力となるベクトルプロセッサに供給している。これらの二つのスレッドのプログラムは、個別に書き込まれてコンパイルされることが期待される。
スカラースレッドは、以下の事項を担う。
１．ホストユニットとインターフェイスを取り、クラスインターフェイスを実装する
２．ベクトルユニットの初期化、セットアップ、及び設定
３．以下のことを各反復が行なうよう、ループ内の作業単位、チャンク、又は作業セットにおけるアルゴリズムの実行
ａ．現在の作業セットのパラメータを計算
ｂ．ベクトルプロセッサへの入力データの転送を開始
ｃ．ベクトルプロセッサからの出力データの転送を開始
スカラースレッドの標準的な実行モデルは応答不要送信である。これは、ベクトルコプロセッサからの戻りデータがない場合のビデオベースバンド処理用の標準的なモデルであると期待される。スカラープロセッサは、コマンドＦＩＦＯにスペースがある限り、ベクトルプロセッサ用の作業をスケジュールし続ける。ベクトルプロセッサによるサブルーチンの実行は、主に主記憶装置からの待ち時間による時間で遅延される。したがって、ベクトル側に待ち時間補償機構を提供することが重要である。ＶＰ２．０では、ベクトルプロセッサは、命令及びデータトラフィックの両方に待ち時間補償を提供する。その機構について、セクションにおいて概略を示す。
標準的なＶＰプログラムは、以下のようなものである。

より複雑なプログラミングモデルは、二つのデータパイプを有する場合である。或いは、二つのデータパイプのコードを書いて、そのコードを一つのデータパイプマシン上で実行する場合である。そのプログラミングモデルを、第６節において考察する。 3 Simple Programming Model The simplest programming model of VP2.0 is a scalar control program that executes subroutine calls on a vector slave coprocessor. Here is an essential assumption that the programmer breaks down the problem into these two threads. A thread executing on a scalar processor calculates work parameters in advance and supplies them to a main vector processor. These two threaded programs are expected to be written and compiled separately.
The scalar thread is responsible for:
1. Take the interface with the host unit and implement the class interface. 2. Vector unit initialization, setup and configuration. Run the algorithm in a unit of work, chunk, or working set in a loop so that each iteration does the following: a. Calculate current working set parameters b. Start transfer of input data to vector processor c. Start transfer of output data from vector processor The standard execution model of scalar threads is response-free transmission. This is expected to be a standard model for video baseband processing in the absence of return data from the vector coprocessor. The scalar processor continues to schedule work for the vector processor as long as there is space in the command FIFO. The execution of the subroutine by the vector processor is delayed mainly by the time due to the waiting time from the main memory. Therefore, it is important to provide a latency compensation mechanism on the vector side. In VP2.0, the vector processor provides latency compensation for both instruction and data traffic. The mechanism is outlined in the section.
A standard VP program is as follows.

A more complex programming model is the case with two data pipes. Alternatively, code for two data pipes is written and the code is executed on one data pipe machine. The programming model is discussed in Section 6.

４ストリーミングモデル
上に概説したように、スカラーエンジンは、ベクトルプロセッサでの計算の開始を担っている。スカラーエンジンからベクトルエンジンに渡されるコマンドには、以下の主なタイプがある。
１．現在の作業セットデータをメモリからベクトルエンジンのデータＲＡＭに転送するためにスカラーによって開始される読み取りコマンド（ｍｅｍＲｄ）
２．スカラーからベクトルに渡すパラメータ
３．実行すべきベクトルサブルーチンのＰＣの形式での実行コマンド
４．ベクトル計算の結果をメモリにコピーするためにスカラーによって開始される書き込みコマンド（たとえば、ｍｅｍＷｒ）
これらのコマンドを受け取ると、ベクトルプロセッサは即座に、フレームバッファ（ＦＢ）インターフェイスへのｍｅｍＲｄコマンドをスケジュールする。ベクトルプロセッサはまた、実行コマンドを検査して、実行すべきベクトルサブルーチンをプリフェッチする（キャッシュに存在しない場合）。一つの目的は、ベクトルエンジンが現在の実行に従事している間に、次の幾つかの実行の命令とデータスチームを前もってスケジュールすることである。これらの読み取り要求を事前に行うために、ベクトルエンジンは、ハードウェア内のデータストア及び命令キャッシュを管理する。
データストアは、ベクトルプロセッサの作業ＲＡＭである。スカラープロセッサは、このデータストアを、ＦＩＦＯ又はストリームのコレクションと見なす。ストリームは本質的には、スカラーが転送を開始する入出力ＦＩＦＯである。入出力ストリームが満杯になると、ベクトルのＤＭＡエンジンは、スカラーからのコマンドＦＩＦＯの処理を停止して、まもなくそれが満杯になる。したがって、スカラーは、ベクトルエンジンにそれ以上作業を発行することを停止する。入出力ストリームに加えて、ベクトルは中間ストリームを必要としてもよい。したがって、データストア全体は、スカラー側からはストリームのコレクションと見なすことができる。各ストリームは、タイルと呼ばれる作業２ＤチャンクのＦＩＦＯである。ベクトルプロセッサは、各ストリーム用の読み取りタイルポインタ及び書き込みタイルポインタを保持する。入力ストリームの場合、ベクトルサブルーチンが実行されると、ベクトルサブルーチンは現在の（読み取り）タイルを消費するか、又は読み取ることができる。バックグラウンドでは、データはｍｅｍＲｄコマンドによって現在の（書き込み）タイルに転送される。ベクトルプロセッサはまた、出力ストリーム用の出力タイルを生成することもできる。次いで、これらのタイルは、実行コマンドに続くｍｅｍＷｒ（）コマンドによってメモリに移動される。
このモデルを、ビデオとのサブピクチャの混合の例によって説明する。例えば、ビデオサーフェス（例えば、ＮＶ１２形式）がサブピクチャと混合され、次いで、ＡＲＧＢサーフェスに変換される場合の単純化した例を考察する。これらのサーフェスは、メモリに存在する。スカラープロセッサは、これらサーフェスのチャンク（タイル）を読み取り、それらをデータストアにロードする。複数の入力サーフェスが存在するので、複数の入力ストリームを保持する必要がある。各ストリームは様々な数のタイルを有することができるが（例えば、この例では、サブピクチャサーフェスがシステムメモリ内にあり、したがって、それを更にバッファすべきであると仮定する）、ビデオストリームはより少ない数のタイルを有することがある。

4 Streaming model As outlined above, the scalar engine is responsible for initiating calculations on a vector processor. There are the following main types of commands passed from the scalar engine to the vector engine.
1. Read command (memRd) initiated by scalar to transfer current working set data from memory to vector engine data RAM
2. 2. Parameters passed from scalar to vector 3. Execution command in PC format of vector subroutine to be executed A write command (eg memWr) initiated by a scalar to copy the result of the vector calculation to memory
Upon receipt of these commands, the vector processor immediately schedules a memRd command to the frame buffer (FB) interface. The vector processor also examines the execute command and prefetches the vector subroutine to be executed (if it is not in the cache). One goal is to schedule the next few execution instructions and data steam in advance while the vector engine is engaged in the current execution. In order to make these read requests in advance, the vector engine manages a data store and instruction cache in hardware.
The data store is the working RAM of the vector processor. The scalar processor sees this data store as a FIFO or a collection of streams. The stream is essentially an input / output FIFO where the scalar begins to transfer. When the I / O stream is full, the vector DMA engine stops processing the command FIFO from the scalar and it will soon be full. Thus, the scalar stops issuing further work to the vector engine. In addition to input and output streams, vectors may require intermediate streams. Thus, the entire data store can be viewed as a collection of streams from the scalar side. Each stream is a working 2D chunk FIFO called a tile. The vector processor maintains a read tile pointer and a write tile pointer for each stream. For an input stream, when the vector subroutine is executed, the vector subroutine can consume or read the current (read) tile. In the background, data is transferred to the current (write) tile by the memRd command. The vector processor can also generate output tiles for the output stream. These tiles are then moved to memory by the memWr () command following the execute command.
This model is illustrated by an example of mixing sub-pictures with video. For example, consider a simplified example where a video surface (eg, NV12 format) is mixed with a subpicture and then converted to an ARGB surface. These surfaces exist in memory. The scalar processor reads these surface chunks (tiles) and loads them into the data store. Since there are multiple input surfaces, it is necessary to hold multiple input streams. Each stream can have a different number of tiles (eg, in this example, assume that the sub-picture surface is in system memory and therefore should be buffered further), but the video stream is more May have a small number of tiles.

５ベクトルコプロセッサ
ＶＰ２のベクトルコプロセッサは、ビデオベースバンド処理及びコーデック向けに設計されたＶＬＩＷＤＳＰである。本プロセッサの一部の重要な設計特性は、以下のものを含む。
１．拡張性の高い性能、一つ又は二つのデータパイプ
２．各パイプが二つのデータアドレス発生器（ＤＡＧ）を有する
３．メモリ／レジスタのオペランド
４．２Ｄ（ｘ、ｙ）ポインタ／反復子
５．深いパイプライン（１１〜１２）ステージ
６．スカラー（整数）／分岐ユニット
７．可変命令幅（Ｌｏｎｇ／Ｓｈｏｒｔ命令）
８．オペランド抽出のためのデータアライナ
９．標準的なオペランド及び結果の２Ｄデータパス（４×４）形状
１０．リモートプロシージャ呼び出しを実行する、スカラープロセッサに対するスレーブプロセッサ
プログラマの観点からのベクトルコプロセッサは、簡単に言えば、二つのＤＡＧを備えるＳＩＭＤデータパスである。命令は、ＶＬＩＷの方式で発行される（即ち、命令はベクトルデータパス及びアドレス発生器に対して同時に発行される）。命令は、可変長であり、最も一般に使用される命令は短形式でエンコードされる。完全な命令セットは、長形式で使用可能である。例えば、プログラマの観点からは、種々の機能ユニット及びレジスタ／ＳＲＡＭリソースの構成は、以下に示す通りである。

ベクトルユニットは、単一のデータパイプ又は二重のデータパイプをインスタンス化する。各データパイプは、ローカルＲＡＭ（データストア）、二つのＤＡＧ、及びＳＩＭＤ実行ユニットから成る。基本構成では、一つのデータパイプのみが存在する。二つのデータパイプが存在する場合、それらは独立したスレッドとして、又は共同のスレッドとして実行することができる。ベクトルプロセッサの完全なパイプラインの図を以下に示す。これは、二つのデータパイプを備える完全構成である。

5 Vector Coprocessor The vector coprocessor of VP2 is a VLIW DSP designed for video baseband processing and codecs. Some important design characteristics of the processor include:
1. Highly scalable performance, one or two data pipes 2. Each pipe has two data address generators (DAG). 4. Memory / register operands 4.2D (x, y) pointer / iterator 5. Deep pipeline (11-12) stage 6. Scalar (integer) / branching unit Variable instruction width (Long / Short instruction)
8). 8. Data aligner for operand extraction Standard operand and resulting 2D data path (4 × 4) shape Slave processor to scalar processor that performs remote procedure calls The vector coprocessor from the programmer's point of view is simply a SIMD data path with two DAGs. Instructions are issued in a VLIW manner (ie, instructions are issued simultaneously to the vector data path and address generator). The instructions are variable length and the most commonly used instructions are encoded in short form. The complete instruction set is available in long form. For example, from the programmer's point of view, the configuration of the various functional units and registers / SRAM resources is as follows.

A vector unit instantiates a single data pipe or a double data pipe. Each data pipe consists of a local RAM (data store), two DAGs, and a SIMD execution unit. In the basic configuration, there is only one data pipe. If there are two data pipes, they can run as independent threads or as joint threads. A complete pipeline diagram of the vector processor is shown below. This is a complete configuration with two data pipes.

６高度なプログラミングモデル
第３節では、基本アーキテクチャを説明するためにＲＰＣモデルを示した。この節では、更に高度な概念を示す。

６．１二重データパイプ構成
二重パイプ構成において、プロセッサの以下のリソースが共用される。
・スカラーコントローラ
・ベクトルコプロセッサのベクトル制御ユニット
・命令／データフェッチ用のＤＭＡエンジン
・命令キャッシュ（デュアルポートであってもよい）
以下のリソースは複製される。
・データパイプ（アドレス／分岐／ベクトル実行ユニット）
・データストア
・レジスタファイル
以下の事項に留意されたい。
１．プログラムは、一つのパイプを用いるインスタンスで二つのパイプ用に作成することができる。ベクトル制御ユニットは、同一の物理パイプ上に各パイプの実行をマップする。しかし、両パイプ用のストリームが一つのデータストアだけに存在するので、データストアのサイズを調整する必要がある。簡単な方法は、ストリーム内のタイルのサイズ又はタイルの数を半分に削減することである。これは、設定時にスカラースレッドによって行われる。マクロアーキテクチャステージにおいて解決する必要のあるグローバルレジスタの重複及びストリームマッピングのような問題が存在する。
２．一つのパイプ用に作成されたプログラムは、二つのパイプを備えるインスタンスで実行することができる。しかし、このコードは、一つのパイプでのみで動作し、他方のパイプを使用しない。マシンは、半アイドル状態になる。
３．プログラムは、それぞれ二つの完全に異なるスレッドを実行する二つのパイプ用に作成することができる。これは、マルチスレッドではない単一のスカラーしか備えていないので、好ましくない場合もある。一つのスカラー実行スレッドしかサポートしないので、これは好ましくないことがあるが、このモデルをサポートすることができる。
４．プログラムは、それぞれ同一のスレッドを実行する二つのパイプ用に作成することができる。これは、多くのビデオベースバンド処理などの、並列化可能アルゴリズムに期待される標準的なモデルである。これにより、同じ命令ストリームを使用して、ビデオの二つのストリップ又は二等分などを操作することができる。各データパイプは、独自の実行ユニット及びデータストアを有する。スカラーコントローラは、二つのデータパイプを供給する必要がある。しかしながら、そのパラメータ、読み取り及び書き込みコマンドは、相互に関連（オフセット）しているので、スカラー性能要件は厳密に倍加することはない。このモデルの例を、以下に示す

５．プログラムは、二つの共同するスレッドを用いて作成することができる。これは、単一のスカラー制御スレッドを有するが、複数の機能ベクトルの機能ブロックが相互に接続される必要である場合のコーデックに期待されるモデルである。これは、機能ブロックのＤｉｒｅｃｔＳｈｏｗのｐｉｎモデルと類似している。そのようなアプリケーションの例を、以下に示す。このモデルは、二つのデータパイプしか備えていないので、二つの共同のスレッドのみに制限される。更に注意することは、スレッドが二つのスレッド間でバランスを取る必要があることである。そのようにしないと、性能が損なわれる。これらの制約の範囲内で、このモデルは二つのデータパイプで機能し、さらに単一パイプに縮小することができる。

６．二つのデータパイプは、相互に同期することができる。同期化の基本的な手法は、データ駆動である。ベクトル関数は、データが処理に使用可能である場合に実行される。ストリームは、メモリからの読み取り、又は他のデータパイプからの書き込みによって満たされる。データが使用可能となると、ベクトル制御ユニットは実行をアクティブ化してそれを動作させる。ストリームはまた、計数セマフォーとして使用することもできる。スカラーコントローラ及びベクトルデータパイプは何れも、タイルポインタを増分及び減分して、データ転送が行われない場合にも計数セマフォーとしてストリーム記述子を使用することができる。 6 Advanced programming model In Section 3, the RPC model was presented to illustrate the basic architecture. This section presents more advanced concepts.

6.1 Dual data pipe configuration In the dual pipe configuration, the following resources of the processor are shared.
-Scalar controller-Vector coprocessor vector control unit-DMA engine for instruction / data fetch-Instruction cache (may be dual port)
The following resources are replicated:
Data pipe (address / branch / vector execution unit)
• Data store • Register file Note the following items.
1. A program can be created for two pipes in an instance that uses one pipe. The vector control unit maps the execution of each pipe onto the same physical pipe. However, since the streams for both pipes exist only in one data store, it is necessary to adjust the size of the data store. A simple way is to reduce the size or number of tiles in the stream by half. This is done by a scalar thread at setup time. There are problems such as global register duplication and stream mapping that need to be resolved in the macro architecture stage.
2. A program created for one pipe can be executed on an instance with two pipes. However, this code works with only one pipe and does not use the other pipe. The machine becomes semi-idle.
3. A program can be created for two pipes, each running two completely different threads. This may be undesirable because it has only a single scalar that is not multi-threaded. Although this may not be desirable because it only supports one scalar execution thread, it can support this model.
4). Programs can be created for two pipes, each executing the same thread. This is the standard model expected for parallelizable algorithms, such as many video baseband processing. This allows the same instruction stream to be used to manipulate two strips of video or bisection and the like. Each data pipe has its own execution unit and data store. The scalar controller needs to supply two data pipes. However, because the parameters, read and write commands are interrelated (offset), the scalar performance requirement does not double exactly. An example of this model is shown below

5. A program can be created using two collaborative threads. This is the model expected for a codec when it has a single scalar control thread, but the function blocks of multiple function vectors need to be interconnected. This is similar to the DirectShow pin model of the functional block. An example of such an application is shown below. Since this model has only two data pipes, it is limited to only two joint threads. Note further that the thread needs to be balanced between the two threads. Otherwise, performance will be compromised. Within these constraints, the model works with two data pipes and can be further reduced to a single pipe.

6). The two data pipes can be synchronized with each other. The basic method of synchronization is data driving. A vector function is executed when data is available for processing. The stream is filled by reading from memory or writing from other data pipes. As data becomes available, the vector control unit activates execution and runs it. The stream can also be used as a counting semaphore. Both the scalar controller and the vector data pipe can increment and decrement the tile pointer and use the stream descriptor as a counting semaphore even when no data transfer takes place.

補足的な概要：
一般に、本発明の実施の形態は以下のことを実行する。
１．メディアアルゴリズムをスカラー及びベクトル部分に分解する。
既製のスカラー設計であり、これはまた、パワーと性能要件に基づいて異なるクロック速度でスカラー及びベクトル部分を実行できる能力ももたらす。
２．ストリーム処理
３．２Ｄデータパス処理
４．待ち時間隠蔽（データ及びコマンドの両フェッチに対する） Supplemental overview:
In general, embodiments of the present invention do the following:
1. Decompose the media algorithm into scalar and vector parts.
An off-the-shelf scalar design, which also provides the ability to run scalar and vector parts at different clock speeds based on power and performance requirements.
2. 3. Stream processing 3.2D data path processing Latency hiding (for both data and command fetches)

応用分野
暗号：
命令コードの隠蔽
暗号化プログラムは、単にチップ上に乗せることができる。スカラー／コントローラブロックは単に、特定のオペレーションが実行されるよう要求し、暗号化エンジンは命令等をフェッチする。スカラーは、どのアルゴリズムが実行中であるかを確認することさえもできないので、非常に安全である。これは、ユーザーから暗号化アルゴリズムを隠蔽する機構をもたらす。
２Ｄ
ＶＰ２命令セットアーキテクチャは、２Ｄ処理の命令をサポートする。これらは、多くのＧＵＩ／ｗｉｎｄｏｗシステムで使用されるＲＯＰ３及びＲＯＰ４のサポートを含む。これにより、メディアプロセッサは、メディアプロセッサ上で２Ｄオペレーションを実行することができる。ここで特有の利点は、省力化である。 Application areas
Cipher:
Hiding the instruction code The encryption program can simply be placed on the chip. The scalar / controller block simply requests that a specific operation be performed, and the encryption engine fetches instructions and the like. A scalar is very safe because it cannot even see which algorithm is running. This provides a mechanism to hide the encryption algorithm from the user.
2D
The VP2 instruction set architecture supports 2D processing instructions. These include support for ROP3 and ROP4 used in many GUI / window systems. This allows the media processor to perform 2D operations on the media processor. A unique advantage here is labor saving.

ＩＳＡ
命令スロットとしての条件コード：
条件コードオペレーションに対して、（多重発行命令バンドルに）別個の発行スロットを備える。従来技術では、条件コード／述語レジスタにも影響を与えうるＳＩＭＤ命令が使用される。しかし、ＶＰ２データ処理及び述語レジスタで行われる手法によれば、処理は独立してスケジュールされ、その結果、より高い性能をもたらすことができる。 ISA
Condition code as instruction slot:
Provide separate issue slots (in multiple issue instruction bundles) for condition code operations. The prior art uses SIMD instructions that can also affect the condition code / predicate register. However, with VP2 data processing and the approach performed with predicate registers, processing can be scheduled independently, resulting in higher performance.

メモリＩ／Ｏ
マイクロコーディングされたＤＭＡエンジン：
ＤＭＡエンジンは、ストリームのデータのプリフェッチ、データ形式のフォーマッティング、エッジのパディングなど、様々なオペレーションを実行するようにプログラムすることができる（又は独自の小さなマイクロコードを有することができる）。一般に、プログラム可能ＤＭＡエンジンであって、ハードワイヤードの機能ではない。したがって、メモリ入出力プロセッサとメディア処理コアとの組み合わせは、全般的なシステムレベルの性能を増強する。メディアプロセッサコアは、データ入出力処理を行う必要から負荷を軽減される。

記憶装置階層アーキテクチャ：
ＶＰ２アーキテクチャでは、記憶装置の階層は、メモリＢＷを最小化して、待ち時間補償を行うように最適化される。以下のような多種多様な方式が提供される。
− スクラッチｒａｍとしてベクトルコアに可視であるストリーミングデータストアの第１レベル。スカラープロセッサによって生成された要求ストリームを前もって調べるためにＨＷによって管理される。このデータストアは、オプションで、データ再使用のためにＬ２キャッシュによって支援される。Ｌ２キャッシュは、ストリームベースで個々のセクタに分割することができる。
− ストリーミングデータストアによって支援されるＬ１キャッシュ。データストアは、次の関連データセットをプリフェッチしている。
− データタグとしてＳｔｒｅａｍポインタを使用するキャッシュ。
− Ｌ１データストア及びＬ２キャッシュをプリフェッチ／キャッシュするためのスカラー生成ストリームアドレスの使用。 Memory I / O
Microcoded DMA engine:
The DMA engine can be programmed to perform a variety of operations (or have its own small microcode), such as prefetching data in a stream, data format formatting, edge padding, etc. Generally, it is a programmable DMA engine, not a hardwired function. Thus, the combination of a memory input / output processor and a media processing core enhances overall system level performance. The media processor core reduces the load because it needs to perform data input / output processing.

Storage hierarchy architecture:
In the VP2 architecture, the storage hierarchy is optimized to minimize the memory BW and provide latency compensation. A variety of methods are provided as follows.
A first level of streaming data store that is visible to the vector core as a scratch ram. Managed by the HW to examine in advance the request stream generated by the scalar processor. This data store is optionally supported by an L2 cache for data reuse. The L2 cache can be divided into individual sectors on a stream basis.
-L1 cache supported by a streaming data store. The data store is prefetching the next related data set.
-A cache that uses a Stream pointer as a data tag.
-Use of scalar generated stream address to prefetch / cache L1 data store and L2 cache.

ベクトル通信リンクへの最適化スカラー：
ＭｅｍＲｄ／Ｗｒフォーマット
システムメモリとローカルメモリとの間の読み取り及び書き込みを行うためのスカラーからの短縮コマンド。ＤＭＡエンジンを管理するために必要な制御フロー帯域幅を節約する。同時に、サポートされるトランザクションのタイプを制限しない。

ベクトルＬ２に対するスカラー２ベクトルの推測

通信の帯域幅を削減するためのパラメータ修飾子及び反復子のサポートによるパラメータ圧縮。

パイプラインキャッシュ：
パイプライン命令キャッシュ。以下のような様々な方式がサポートされる。
− ベクトルとスカラープロセッサとの間の作動時の実行を追跡することにより各キャッシュラインのライフサイクルを管理。これにより、ベクトルプロセッサが実行を開始する前に命令を作動可能状態にすることができる。命令がまだキャッシュ内にない場合は、プリフェッチされる。
− 小規模な待ち時間構成の場合、小型のＦＩＦＯに変えることにより命令キャッシュが最小化される。既にＦＩＦＯ内にある実行は再使用することができるが、それ以外の場合は再度フェッチされる。

全般的アーキテクチャ
データストアは、さまざまな処理要素の間で共用することができる。これらは、ストリームを通じて通信し、相互に供給することができる。アーキテクチャは、ＳＩＭＤベクトルコア、ＤＭＡエンジン、ストリームを通じて接続された固定機能ユニットのような、異種の機能ユニットのセットを想定する。 Optimized scalar for vector communication links:
MemRd / Wr format Short command from scalar to read and write between system memory and local memory. Saves the control flow bandwidth required to manage the DMA engine. At the same time, it does not limit the types of transactions that are supported.

Scalar 2 vector guess for vector L2

Parameter compression with support for parameter modifiers and iterators to reduce communication bandwidth.

Pipeline cache:
Pipeline instruction cache. Various schemes are supported:
-Manage the life cycle of each cache line by tracking operational execution between the vector and the scalar processor. This allows the instructions to be ready before the vector processor begins execution. If the instruction is not already in the cache, it is prefetched.
-For small latency configurations, instruction cache is minimized by changing to a smaller FIFO. Execution already in the FIFO can be reused, otherwise it is fetched again.

General Architecture Data stores can be shared between various processing elements. These can be communicated through a stream and supplied to each other. The architecture assumes a heterogeneous set of functional units such as SIMD vector cores, DMA engines, fixed functional units connected through streams.

計算／ＤＰ
任意／柔軟な形式／ハーフパイプ
データパスは、様々な形式で動作する。データパスの形式は、問題セットと適合するように構成することができる。通常、１Ｄデータパスが行われる。ＶＰ２は、アルゴリズムに適合するように、４×４、８×４、又は１６×１などの可変サイズとなりうる形式を処理する。

拡張性
ＶＰ２データパスアーキテクチャは、命令搬送技法を使用して（注：１６ウェイのＳＩＭＤパイプを備え、各オペランドは１バイト幅である。８ウェイのＳＩＭＤパイプを備え（二つのパイプをまとめる）、各オペランドが２バイトのより広いＳＩＭＤデータパスを備えることができ、同様に４ウェイのＳＩＭＤパイプを備え（四つのパイプをまとめる）、各オペランドが４バイトのより広いＳＩＭＤデータパスを備えることができる）、領域を節約するために複数サイクルにわたってより狭いデータパスでより広いＳＩＭＤ命令を実行することができる。
例えば、ＶＰ２は、１６ウェイのＳＩＭＤから８ウェイのＳＩＭＤまでデータパスを拡大縮小することができる。

バイトレーンの結合
オペランド幅を増大するためにＳＩＭＤのウェイを結合。例、現在８ビットオペランドの１６ウェイＳＩＭＤ。８ウェイのＳＩＭＤで１６ビットのオペランド、及び４ウェイのＳＩＭＤで３２ビットのオペランドに増大することができる。

ＳＩＭＤアドレス発生器
ＳＩＭＤパイプの各ウェイに対して別個のストリームアドレス発生器。
ＶＰ２は、要求がデータストアへの最小アクセスに合体されるＳＩＭＤアドレス発生器を使用することができる。

クロスバー及びコレクタを使用するデータ展開
クロスバーを使用してさらに多くのデータオペランドを作成する能力。読み取りポートのプレッシャを軽減してパワーを節約。

Ｘ２命令：
全ての命令がデータパス内の全てのＨＷ要素（加算器／乗算器）を使用することができるわけではない。したがって、加算／減算のような単純な命令に対しては、複雑な命令の場合よりも広いデータ形式を処理することができる。そのため、最小共通サイズに性能を限定する代わりに、ＶＰ２では、読み取りポートがオペレーション帯域幅を保持できる限り、より広い形式で動作するよう便宜的に試みる柔軟な命令セットを使用する。

マルチスレッド／マルチコアのメディア処理
ＶＰ２アーキテクチャは、以下のような様々なマルチスレッドのオプションをサポートする。
− マルチスレッドスカラープロセッサは、ストリームを通じて接続された複数ベクトルユニットでプロシージャ呼び出しをスケジュールする。
− 命令で／命令により単一ベクトルエンジン上で実行するか又は実行スレッドスイッチングにより実行する複数スレッド。

異なるベクトル／スカラーを使用するパワー管理
分離されたスカラー及びベクトル部分により、これらの二つのブロックを、パワー及び性能の要件に基づいて異なる速度で実行することができる。 Calculation / DP
Arbitrary / Flexible format / Halfpipe The data path operates in various formats. The format of the data path can be configured to match the problem set. Usually, a 1D data pass is performed. VP2 processes a format that can be a variable size, such as 4 × 4, 8 × 4, or 16 × 1, to fit the algorithm.

Scalability The VP2 datapath architecture uses instruction transport techniques (Note: with 16-way SIMD pipes, each operand is 1 byte wide. With 8-way SIMD pipes (bundling two pipes) Each operand can have a wider SIMD data path of 2 bytes, as well as a 4-way SIMD pipe (collecting 4 pipes), each operand can have a wider SIMD data path of 4 bytes. ) Wider SIMD instructions can be executed with narrower data paths over multiple cycles to save space.
For example, VP2 can scale the data path from 16-way SIMD to 8-way SIMD.

Combine byte lanes Combine SIMD ways to increase operand width. Example: 16-way SIMD currently with 8-bit operands. It can be expanded to 16-bit operands with 8-way SIMD and 32-bit operands with 4-way SIMD.

SIMD address generator A separate stream address generator for each way of the SIMD pipe.
VP2 can use a SIMD address generator where requests are combined into a minimum access to the data store.

Data expansion using crossbars and collectors The ability to create more data operands using crossbars. Reduces read port pressure to save power.

X2 instruction:
Not all instructions can use all HW elements (adders / multipliers) in the data path. Therefore, for simple instructions such as addition / subtraction, a wider data format can be processed than in the case of complex instructions. Therefore, instead of limiting performance to the smallest common size, VP2 uses a flexible instruction set that conveniently attempts to operate in a wider format as long as the read port can maintain the operation bandwidth.

Multi-thread / multi-core media processing The VP2 architecture supports various multi-threading options, including:
A multi-thread scalar processor schedules a procedure call with multiple vector units connected through a stream.
Multiple threads executing on a single vector engine with / with instructions or by execution thread switching.

Power management using different vectors / scalars Separated scalar and vector parts allow these two blocks to run at different speeds based on power and performance requirements.

コンテキストスイッチ：
このメディアプロセッサは、その少ないレジスタのアーキテクチャにより、非常に高速のコンテキストスイッチをサポートする能力を有する。ＨＷサポートは、スカラー−ベクトルコマンドキューを追跡し、保存し、それを再生してコンテキストスイッチングを達成するために存在する。また、コンテキストスイッチは、ページフォールトで開始されうる。
これにより、メディアプロセッサは、入出力表示処理のようなリアルタイム処理タスクを保持することができ、同時に、２Ｄアクセラレーション又は表示パイプラインを供給するためのジャストインタイムのビデオ拡張のような非リアルタイムのタスクをサポートすることもできる。
このコンテキストスイッチ機能は、その命令セットと共に、ＶＰ２が統合されたピクセル／コーデック処理となるようにすることができる。 Context switch:
This media processor has the ability to support very fast context switches due to its low register architecture. HW support exists to track, save and replay a scalar-vector command queue to achieve context switching. A context switch can also be initiated with a page fault.
This allows the media processor to hold real-time processing tasks such as input / output display processing, and at the same time non-real-time video extensions such as just-in-time video extensions to provide 2D acceleration or display pipelines. It can also support tasks.
This context switch function, along with its instruction set, allows VP2 to be an integrated pixel / codec process.

データストアの編成：
ＶＰ２は、以下の特性を有するデータストアの編成を使用する。
各方向最大１６ピクセルまでバンク競合を生じることなくアクセスすることができる。これは、ストライド要件を最小に抑えつつ行われる。
データストア編成により、データ形式の効率的な置き換えが可能になる。
２Ｄアドレシングはデータストア内でサポートされ、ビデオなどの多くのメディア処理アプリケーションにおいてリニアアドレスのＳＷ計算を除去する。 Data store organization:
VP2 uses a data store organization with the following characteristics:
Up to 16 pixels in each direction can be accessed without causing bank contention. This is done while minimizing stride requirements.
Data store organization enables efficient replacement of data formats.
2D addressing is supported in the data store and eliminates the linear address SW calculation in many media processing applications such as video.

アプリケーション application

アーキテクチャ architecture

高計算密度
・ＶＰ１アーキテクチャの２倍のｐｅｒｆ／ｍｍ２
・新しいアプリケーションで効率的（固定機能ＨＷに比べ）
プログラム能力
・「妥当に」コンパイル可能：スカラーについてはＣ、ベクトルについては固有
・ＨＷで管理される待ち時間隠蔽
・ハング及び不正アドレスに対する保護
拡張可能性
・２倍及び＿倍の拡張オプション
・クロック周波数を通じて拡張可能
高精度のオペレーションのサポート
・１０ｂ＋２０ｂ整数データ型 High computational density ・ perf / mm2 twice that of VP1 architecture
・ Efficient with new applications (compared to fixed function HW)
Program Capability • Compile “reasonably”: C for scalar, unique for vector • HW-managed latency hiding • Scalability protection against hangs and illegal addresses • Double and _ times expansion options • Clock frequency Supports high-precision operations that can be extended through 10b + 20b integer data types

ブロック概要
・スカラープロセッサ
・プログラムフロー制御
・タイルラスタライズ及びアドレス計算
・緩く結合されたベクトルユニットを供給
・ベクトルプロセッサ
・３発行ＶＬＩＷ（ｖ＋Ｉ、Ａ、Ｂ）
・４×４ＳＩＭＤデータパス
・データストアから直接のベクトルオペランド（レジスタファイル転送なし）
・スカラー／ベクトルインターフェイス
・緩く結合されたプロセッサ−スカラーはコマンドＦＩＦＯを供給
・ＨＷによって管理される命令及びデータストリーミング（待ち時間隠蔽）
・命令及びデータが作動状態のときにベクトルコマンド発行 Block overview-Scalar processor-Program flow control-Tile rasterization and address calculation-Supply loosely coupled vector units-Vector processor-3 issue VLIW (v + I, A, B)
-4x4 SIMD data path-Vector operands directly from the data store (no register file transfer)
Scalar / vector interface Loosely coupled processor-scalar supplies command FIFO Instruction and data streaming managed by HW (latency hiding)
-Vector command issued when command and data are in operation

ＶＰ１からの継承
・内積指向性のデータパス
・ＶＰ２は２タップから４タップの計算へ
・Ａ／Ｉ／Ｂ／Ｖ命令を備えるＶＬＩＷ（変更された発行規則）
・データパスアーキテクチャ：アキュムレータ、データ転送、ＨＷインターロック
・フリーな置き換え、１６×１６−＞４×４に変更
・専用のアドレスユニット
・複雑な命令セット（ビデオに最適化） Inheritance from VP1 ・ Inner product-oriented data path ・ VP2 calculation from 2 taps to 4 taps ・ VLIW with A / I / B / V instructions (changed issuance rules)
・ Data path architecture: Accumulator, data transfer, HW interlock ・ Free replacement, change to 16 × 16-> 4 × 4 ・ Dedicated address unit ・ Complex instruction set (optimized for video)

ＶＰ２の特徴１
・分離されたスカラープロセッサ
・Ｃでのコンパイル可能な外部ループコードの最適なターゲット
・命令ストリーミング
・ＨＷ管理の命令プリフェッチ（パイプラインＩ＄を使用）
・アプリケーションは命令キャッシュよりも大きいフットプリントを保有可能
・データプリフェッチ
・データストア内のＨＷ管理のストリーム
・プログラマは外部ループ制御する必要なし
・精度の向上
・１０ｂ単精度
・新たな２０ｂ倍精度
・より優れたＩＳＡ
・より優れたエラー管理
・メモリ保護、命令トラップ Feature 1 of VP2
-Separate scalar processor-Optimal target for compilable external loop code in C-Instruction streaming-HW-managed instruction prefetch (uses pipeline I $)
・ Application can have a larger footprint than instruction cache ・ Data prefetch ・ HW management stream in data store ・ No need for external loop control ・ Improved accuracy ・ 10b single precision ・ New 20b double precision ・ More Excellent ISA
-Better error management-Memory protection, instruction trap

ＶＰ２の特徴２
・４×４のＳＩＭＤデータパス
・ＶＰ１は１６×１のＳＩＭＤマシンであった
・高機能コードに非常に効率的な編成（Ｈ．２６４、ＷＭＶ９）
・２Ｄ処理に必要な帯域幅を軽減
・改善されたデータベース編成
・データパスに適合し、ＶＰ１に比べはるかに安価な転置機能を提供するタイル型４×４構造
・マルチポートＲＡＭをエミュレートするＳＰＡ様のコレクタ構造
・ベクトルパイプはデータストアから直接動作（ベクトルレジスタファイルなし）
・専用のクロスバーステージ
・不整列オペランドを抽出
・わずかな数の読み取りポートから複数のオペランドを作成
・定数ＲＡＭ
・データストア読み取り帯域幅のプレッシャを軽減
・非常に柔軟な条件コード／述語のサポート Feature 2 of VP2
-4x4 SIMD data path-VP1 was a 16x1 SIMD machine-Very efficient organization for high-performance code (H.264, WMV9)
-Reduced bandwidth required for 2D processing-Improved database organization-Tile-type 4x4 structure that fits the data path and provides a much cheaper transposition function compared to VP1-SPA that emulates multi-port RAM Collector structure-Vector pipe operates directly from the data store (no vector register file)
-Dedicated crossbar stage-Extract unaligned operands-Create multiple operands from just a few read ports-Constant RAM
・ Reduces data store read bandwidth pressure ・ Supports very flexible condition codes / predicates

対象アプリケーション
・コーデック
・ｍｐｅｇ２／ｗｍｖ９／Ｈ．２６４エンコードアシスト（インループデコーダ）
・ｍｐｅｇ２／ｗｍｖ９／Ｈ．２６４デコード（ポストＶＬＤデコーディング）
・インループ／アウトオブループのデブロッキングフィルタ
・画像処理／強調
・高性能動き適応順次走査変換
・エンコード用入力ノイズフィルタリング
・多相スケーリング／リサンプリング
・サブピクチャ合成
・ピクセル単位の変換：ｐｒｏｃａｍｐ、色空間補正、ガンマ、ＬＣＤオーバードライブ、ヒストグラム調整など
・ビデオサーフェスフォーマット変換 Applicable application codec mpeg2 / wmv9 / H. H.264 encoding assist (in-loop decoder)
Mpeg2 / wmv9 / H. H.264 decoding (post-VLD decoding)
-In-loop / out-of-loop deblocking filter-Image processing / enhancement-High performance motion adaptive progressive scan conversion-Encoding input noise filtering-Multiphase scaling / resampling-Sub-picture synthesis-Pixel-by-pixel conversion: procamp, color Spatial correction, gamma, LCD overdrive, histogram adjustment, etc.Video surface format conversion

潜在的な対象アプリケーション
・２Ｄプリミティブ、ブリッツ、回転など
・微調整ベースのソフトウェアモーション推定アルゴリズム
・１６／３２ビットのＭＡＣアプリケーション（オーディオ？） Potential target applications • 2D primitive, blitz, rotation, etc. • Fine-tuned software motion estimation algorithm • 16 / 32-bit MAC application (audio?)

プログラミングモデル Programming model

本発明のある実施の形態に係るコンピュータシステムの基本コンポーネントを示す概略図である。1 is a schematic diagram illustrating basic components of a computer system according to an embodiment of the present invention. 本発明のある実施の形態に係るビデオプロセッサユニットの内部コンポーネントを示す図である。FIG. 3 is a diagram showing internal components of a video processor unit according to an embodiment of the present invention. 本発明のある実施の形態に係るビデオプロセッサの例示的なソフトウェアプログラムを示す図である。FIG. 3 illustrates an exemplary software program for a video processor according to an embodiment of the present invention. 本発明のある実施の形態に係り、ビデオプロセッサを使用してサブピクチャをビデオと混合させる例を示す図である。FIG. 7 is a diagram illustrating an example of mixing a sub-picture with video using a video processor according to an embodiment of the present invention. 本発明のある実施の形態に係るベクトル実行の内部コンポーネントを示す図である。FIG. 6 is a diagram illustrating internal components of vector execution according to an embodiment of the present invention. 本発明のある実施の形態に係るタイルの対称配列を有するデータストアメモリのレイアウトを示す図である。FIG. 3 is a diagram showing a layout of a data store memory having a symmetrical arrangement of tiles according to an embodiment of the present invention.

Claims

A scalar execution unit configured to perform scalar video processing operations;
A vector execution unit configured to perform vector video processing operations;
A data store memory for storing data of the vector execution unit;
A memory interface for performing communication between the scalar execution unit, the vector execution unit, and the data store memory;
With
The data store memory comprises a plurality of tiles having a symmetric bank data structure arranged in an array,
The bank data structure is configured to support access to different tiles in each bank;
A first stream including a first sequential access to the tile and a second stream including a second sequential access to the tile to the vector execution unit or the scalar execution unit; Carried out,
Wherein the memory interface, in order to compensate for the latency of the first stream and the second stream, and based on the said waiting time of the first and second streams, adjust the number of tiles to prefetch Te, it starts to prefetch data in the tile for the first sequential access and the second sequential access of,
system.

The system of claim 1, wherein the system is a multi-dimensional data path processing system for a video processor for performing video processing operations.

A system for multidimensional data path processing that supports video processing operations,
With the motherboard,
A host CPU coupled to the motherboard;
The video processor having the system of claim 1 coupled to the motherboard and coupled to the CPU;
A system comprising:

Each of the bank data structures comprises a plurality of tiles arranged in a 4x4, 8x8, 8x16, or 16x24 pattern;
The bank data structure is configured to support access to different tiles of each bank data structure and includes at least a row of tiles of the two adjacent bank data structures to two adjacent bank data structures Configured to support a single access,
The tiles are configured to support access to different tiles of each bank data structure, and at least one access of the tiles of the two adjacent bank data structures to two adjacent bank data structures. Contains columns,
The system further comprises a crossbar coupled to the data store memory and selecting a configuration for accessing the plurality of bank data structure tiles;
The crossbar accesses the tiles of the plurality of bank data structures to supply data to the vector data path every clock;
The system further includes a collector for receiving data in the tiles of the plurality of bank data structures accessed by the crossbar and supplying the data in the tiles to the front end of the vector data path every clock. The system as described in any one of Claims 1-3 provided.

A video processor for performing video processing operations,
A host interface for performing communication between the video processor and a host CPU;
A memory interface for performing communication between the video processor and a frame buffer memory;
A scalar execution unit coupled to the host interface and the memory interface and configured to perform scalar video processing operations;
A vector execution unit coupled to the host interface and the memory interface and configured to perform vector video processing operations;
With
The frame buffer memory comprises a plurality of tiles;
A first stream including a first sequential access to the tile and a second stream including a second sequential access to the tile to the vector execution unit or the scalar execution unit; Carried out,
Wherein the memory interface, in order to compensate for the latency of the first stream and the second stream, and based on the said waiting time of the first and second streams, adjust the number of tiles to prefetch Te, it starts to prefetch data in the tile for the first sequential access and the second sequential access of,
Video processor.

A system for performing video processing operations,
With the motherboard,
A host CPU coupled to the motherboard;
The video processor of claim 5 coupled to the motherboard and coupled to the CPU;
A system comprising:

The scalar execution unit functions as a controller of the video processor and controls the operation of the vector execution unit;
The video processor further comprises a vector interface unit that interfaces the scalar execution unit and the vector execution unit;
The video processor of claim 5, wherein the scalar execution unit and the vector execution unit are configured to operate asynchronously.

The scalar execution unit executes at a first clock frequency, and the vector execution unit executes at a second clock frequency;
The scalar execution unit is configured to execute an application flow control algorithm, and the vector execution unit is configured to execute a pixel processing operation of the application;
The vector execution unit is configured to operate on a demand driven basis under the control of the scalar execution unit;
The scalar execution unit is configured to send a function call to the vector execution unit using a command FIFO such that the vector execution unit operates on a request driven basis by accessing the command FIFO. And
The video processor of claim 7 or the system of claim 6, wherein the asynchronous operation of the video processor is configured to support separate independent updates of the application's vector subroutine or scalar subroutine.

The video processor of claim 5, wherein the scalar execution unit is configured to operate using VLIW (very long instruction word) code.

A stream-based memory access system for a video processor that performs video processing operations, comprising:
A scalar execution unit configured to perform scalar video processing operations;
A vector execution unit configured to perform vector video processing operations;
A frame buffer memory for storing data for the scalar execution unit and the vector execution unit;
A memory interface for performing communication between the scalar execution unit, the vector execution unit, and the frame buffer memory;
With
The frame buffer memory comprises a plurality of tiles;
A first stream including a first sequential access to the tile and a second stream including a second sequential access to the tile to the vector execution unit or the scalar execution unit; Carried out,
Wherein the memory interface, in order to compensate for the latency of the first stream and the second stream, and based on the said waiting time of the first and second streams, adjust the number of tiles to prefetch Te, it starts to prefetch data in the tile for the first sequential access and the second sequential access of,
system.

A system that performs stream-based memory access to support video processing operations,
With the motherboard,
A host CPU coupled to the motherboard;
A video processor coupled to the motherboard and coupled to the CPU,
A host interface establishing communication between the video processor and the host CPU;
A scalar execution unit coupled to the host interface and configured to perform scalar video processing operations;
A vector execution unit coupled to the host interface and configured to perform vector video processing operations;
A memory interface coupled to the scalar execution unit and the vector execution unit to establish stream-based communication between the scalar execution unit, the vector execution unit, and a frame buffer memory;
The video processor comprising:
With
The frame buffer memory comprises a plurality of tiles;
A first stream including a first sequential access to the tile and a second stream including a second sequential access to the tile to the vector execution unit or the scalar execution unit; Carried out,
Wherein the memory interface, in order to compensate for the latency of the first stream and the second stream, and based on the said waiting time of the first and second streams, adjust the number of tiles to prefetch Te, it starts to prefetch data in the tile for the first sequential access and the second sequential access of,
system.

The first stream and the second stream include data in at least one prefetched tile;
The first stream originates from a first location of the frame buffer memory and the second stream originates from a second location of the frame buffer memory;
The memory interface is configured to manage a plurality of streams including streams from a plurality of different origin locations and streams to a plurality of different end locations;
At least one of the source locations or at least one of the end locations is in system memory;
The system is embedded in the memory interface, performs a plurality of memory reads to support the first stream and the second stream, and supports the first stream and the second stream A DMA engine configured to perform a plurality of memory writes to
The memory interface prefetches data in an adjustable number of tiles of the first stream or the second stream to compensate for latency of the first stream or the second stream; The system of claim 10, wherein the system is configured.

A host interface for performing communication between the video processor and the host CPU;
A scalar execution unit coupled to the host interface and configured to perform scalar video processing operations;
A vector execution unit coupled to the host interface and configured to perform vector video processing operations;
A command FIFO that allows the vector execution unit to operate on a request driven basis by accessing the command FIFO;
A memory interface for performing communication between the video processor and a frame buffer memory;
A DMA engine embedded in the memory interface for performing DMA transfers between a plurality of different storage areas and loading the vector execution unit data and instructions into a data store memory and instruction cache;
With
The frame buffer memory comprises a plurality of tiles;
A first stream including a first sequential access to the tile and a second stream including a second sequential access to the tile to the vector execution unit or the scalar execution unit; Carried out,
Wherein the memory interface, in order to compensate for the latency of the first stream and the second stream, and based on the said waiting time of the first and second streams, adjust the number of tiles to prefetch Te, it starts to prefetch data in the tile for the first sequential access and the second sequential access of,
system.

The system of claim 13, wherein the system is a tolerant system that performs video processing operations.

With the motherboard,
A host CPU coupled to the motherboard;
A video processor coupled to the motherboard and coupled to the CPU;
15. The system of claim 14, further comprising:

The vector execution unit is configured to operate asynchronously with respect to the scalar execution unit by accessing the command FIFO to operate on the request driven basis;
The request driven base is configured to conceal the latency of data transfer from the different storage to the command FIFO of the vector execution unit;
The scalar execution unit is configured to perform algorithmic flow control processing, and the vector execution unit is configured to perform most of the video processing workload;
The scalar execution unit is configured to precalculate the working parameters of the vector execution unit to conceal data transfer latency;
The vector execution unit is configured to schedule memory reads via the DMA engine to prefetch commands for subsequent execution of vector subroutines;
The memory read is scheduled to prefetch commands for the execution of the vector subroutine prior to invoking the vector subroutine by the scalar execution unit;
The vector execution unit is configured to schedule a memory read via the DMA engine and prefetch commands for subsequent execution of a vector subroutine, wherein the memory read is performed by the scalar execution unit. Scheduled to prefetch commands for the execution of the vector subroutine prior to calling the vector subroutine;
The system according to any one of claims 13 to 15.