JPH06290262A

JPH06290262A - Processor for image codec

Info

Publication number: JPH06290262A
Application number: JP7476493A
Authority: JP
Inventors: Eiji Iwata; 英次岩田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1993-03-31
Filing date: 1993-03-31
Publication date: 1994-10-18
Anticipated expiration: 2018-04-28
Also published as: JP3401823B2

Abstract

PURPOSE:To provide the processor of a simple constitution for the image CODEC by regarding a microblock, consisting of plural blocks composed of mXn image data respectively, as one process unit, employing 'multiple data stream control by single instruction stream' processing adaptively to field image processing and frame image processing, and improving the parallelism of processors. CONSTITUTION:Mutually adjacent element processors 11-18 which are connected by data transfer lines 42-45 are provided corresponding to microblocks and a coefficient memory 51 which gives a coefficient to those element processors in common is provided. Further, a common adding circuit 41 which adds arithmetic results is provided at the output of the element processors 11-14 which perform the field processing and process image data in microblock units are inputted to the element processors 11-18 through buses 31-34; and the respective element processors perform pipeline processing in three stages, i.e., data input stage, calculation stage, and data output stage.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、例えば、数値計算、画
像処理、グラフィックス処理等に用いられる計算機シス
テムにおける中央処理装置（プロセッサ）に関するもの
であり、特に画像コーデックのようなビデオ信号処理に
好適な信号処理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a central processing unit (processor) in a computer system used for, for example, numerical calculation, image processing, graphics processing, etc., and particularly to video signal processing such as an image codec. The present invention relates to a suitable signal processing device.

【０００２】[0002]

【従来の技術】まず、画像コーデック処理における画像
のマクロブロックおよびブロックの概念について述べ
る。ここでは、ＣＣＩＲ６０１フォーマットに基づく
（４：４：２）信号を例に挙げる。なお、本発明におけ
る画像コーデック処理とは、ＣＣＩＴＴＨ．２６１勧
告やＭＰＥＧ等の画像圧縮符号化／伸長復号化標準に代
表されるような（画像データ動き補償＋離散コサイン変
換（ＤＣＴ）処理）に基づくマクロブロックを処理単位
とする画像の符号化処理および復号化処理を意味する。2. Description of the Related Art First, the concept of image macroblocks and blocks in image codec processing will be described. Here, a (4: 4: 2) signal based on the CCIR601 format is taken as an example. The image codec processing in the present invention means CCITT H.264. 261 recommendation and MPEG image compression encoding / decompression decoding standard represented by (image data motion compensation + discrete cosine transform (DCT) processing) based on a macroblock as an image encoding process and It means a decryption process.

【０００３】１画像フレームは、７２０ｘ４８０画素の
大きさの輝度成分（Ｙ成分）と、横方向にサブサンプリ
ングされた３６０ｘ４８０画素の大きさの２種の色差成
分、つまり、第１の色差成分（Ｃｒ成分）および第２の
色差成分（Ｃｒ成分）からなる。この画像フレームを、
輝度成分については１６ｘ１６画素の正方形（矩形）領
域に分割し、２種の色差成分についてはそれぞれ８ｘ１
６画素の矩形領域に分割する。この輝度成分における１
６ｘ１６画素の正方形領域と、その領域に位置的に対応
する２種の色差成分における８ｘ１６画素の矩形領域と
を合わせてマクロブロックと呼ぶ。また、輝度成分、色
差成分にかかわらず、８ｘ８画素の正方形領域をブロッ
クと呼ぶ。したがって、１マクロブロックは、図６に示
したように、輝度（Ｙ）成分４ブロック、色差成分４ブ
ロック（Ｃｒ成分２ブロック、Ｃｂ成分２ブロック）の
計８ブロックからなる。One image frame includes a luminance component (Y component) having a size of 720 × 480 pixels and two kinds of color difference components sub-sampled in the horizontal direction having a size of 360 × 480 pixels, that is, a first color difference component (Cr Component) and the second color difference component (Cr component). This image frame
The luminance component is divided into square (rectangular) regions of 16 × 16 pixels, and each of the two types of color difference components is 8 × 1.
Divide into a rectangular area of 6 pixels. 1 in this luminance component
A square area of 6 × 16 pixels and a rectangular area of 8 × 16 pixels in two kinds of color difference components corresponding to the area positionally are collectively called a macro block. A square area of 8 × 8 pixels is called a block regardless of the luminance component and the color difference component. Therefore, as shown in FIG. 6, one macroblock consists of a total of 8 blocks of 4 blocks of luminance (Y) component and 4 blocks of color difference component (2 blocks of Cr component, 2 blocks of Cb component).

【０００４】図７に示したように、マクロブロックを１
処理単位とする従来の画像コーデック用プロセッサ１
は、入力用データメモリ２、演算ユニット３、出力用デ
ータメモリ４で構成されており、画像コーデック処理対
象のマクロブロックを入力する「入力ステージ」、演算
ユニット３を用いて画像コーデック処理を行う「計算ス
テージ」、画像コーデック処理後のマクロブロックを出
力する「出力ステージ」の３ステージ構成でパイプライ
ン処理を行う。各ステージ間はそれぞれ、入力用データ
メモリ２、出力用データメモリ４と呼ぶダブルバッファ
構成（２個のメモリバンクを備えて処理単位毎に切り換
える構成）のデータメモリで結合されている。As shown in FIG. 7, one macroblock is used.
Conventional image codec processor 1 as a processing unit
Is composed of an input data memory 2, an arithmetic unit 3, and an output data memory 4, and an "input stage" for inputting a macroblock to be image codec processed, and an image codec process using the arithmetic unit 3 " Pipeline processing is performed with a three-stage configuration including a “calculation stage” and an “output stage” that outputs a macroblock after image codec processing. The respective stages are connected by a data memory having a double buffer structure (a structure having two memory banks and switching for each processing unit) called an input data memory 2 and an output data memory 4.

【０００５】以下、マクロブロックを１処理単位とする
従来の画像コーデック用プロセッサ１におけるパイプラ
イン処理を簡単に説明する。なお、以下に述べる係数ｋ
は正の整数とする。Ａ．（２ｋ）番目のマクロブロックが計算ステージにあ
る時Ａ１．入力ステージ（２ｋ＋１）番目のマクロブロックを図７に示した入力
用データメモリ２の第１のメモリバンク５（バンク０）
に書き込む。Ａ２．計算ステージ入力用データメモリ２の第２のメモリバンク６（バンク
１）より２ｋ番目のマクロブロックを取り込み、演算ユ
ニット３において画像コーデック処理を施した後、出力
用データメモリ４の第１のメモリバンク７（バンク０）
に書き込む。Ａ３．出力ステージ出力用データメモリ４の第２のメモリバンク８（バンク
１）より（２ｋ−１）番目の画像コーデック処理後のマ
クロブロックを出力する。The pipeline processing in the conventional image codec processor 1 using macro blocks as one processing unit will be briefly described below. The coefficient k described below
Is a positive integer. A. When the (2k) th macroblock is in the calculation stage A1. Input stage (2k + 1) th macroblock is the first memory bank 5 (bank 0) of the input data memory 2 shown in FIG.
Write in. A2. Calculation stage The second memory bank 6 (bank 1) of the input data memory 2 is loaded with the 2k-th macroblock, the image codec processing is performed in the arithmetic unit 3, and then the first memory bank of the output data memory 4 is used. 7 (bank 0)
Write in. A3. Output stage Outputs the (2k-1) th image codec-processed macroblock from the second memory bank 8 (bank 1) of the output data memory 4.

【０００６】Ｂ．（２ｋ＋１）番目のマクロブロックが
計算ステージにある時Ｂ１．入力ステージ（２ｋ＋２）番目のマクロブロックを入力用データメモ
リ２の第２のメモリバンク６（バンク１）に書き込む。Ｂ２．計算ステージ入力用データメモリ２の第１のメモリバンク５（バンク
０）より（２ｋ＋１）番目のマクロブロックを取り込
み、演算ユニット３において画像コーデック処理を施し
た後、出力用データメモリ４の第２のメモリバンク８
（バンク１）に書き込む。Ｂ３．出力ステージ出力用データメモリ４の第１のメモリバンク（バンク
０）より２ｋ番目の画像コーデック処理後のマクロブロ
ックを出力する。B. When the (2k + 1) th macroblock is in the calculation stage B1. The (2k + 2) th input stage macroblock is written in the second memory bank 6 (bank 1) of the input data memory 2. B2. Calculation stage The (2k + 1) th macro block is fetched from the first memory bank 5 (bank 0) of the input data memory 2, the image codec processing is performed in the arithmetic unit 3, and then the second data block of the output data memory 4 is processed. Memory bank 8
Write to (bank 1). B3. Output stage Outputs the 2kth macroblock after the image codec processing from the first memory bank (bank 0) of the output data memory 4.

【０００７】従来の画像コーデック用プロセッサは、上
記のように各ステージ間にダブルバッファ構成のデータ
メモリを設けているため、各ステージの動作周波数が異
なっていても問題ない。また、ステージ間におけるデー
タメモリへのアクセスの競合が発生しないので、各ステ
ージは完全に並列に動作可能である。Since the conventional image codec processor has the double buffer data memory provided between the stages as described above, there is no problem even if the operating frequencies of the stages are different. Further, since there is no competition for access to the data memory between the stages, the stages can operate completely in parallel.

【０００８】[0008]

【発明が解決しようとする課題】ところが、上述した画
像コーデック用プロセッサの構成では、演算ユニット３
を並列化して高速化を図ろうとすると、データメモリの
ポート数が増大し、演算ユニットとデータメモリとの間
の相互結合網が複雑化するという問題がある。例えば、
１個の演算ユニットのデータ入力数を２入力と仮定し、
４個の演算ユニットが４並列で動作するプロセッサを考
えた場合、４個全ての演算ユニットへマクロブロックの
任意のデータを毎クロックサイクル供給するためには、
入力用データメモリの１メモリバンクにつき８ポートの
マルチポートメモリが必要となる。また、演算ユニット
とデータメモリとの間の相互結合網についても、８個の
ポートの各々が演算ユニットの任意の入力端子へデータ
を供給する必要があるため、８ｘ８のクロスバー網とい
う非常に複雑でハードウェア量の大きい相互結合網を構
成しなくてはならない。これは、マクロブロックのすべ
てのデータをデータメモリの１個のメモリバンクに格納
しているためである。However, in the configuration of the image codec processor described above, the arithmetic unit 3 is used.
However, there is a problem that the number of ports of the data memory is increased and the interconnection network between the arithmetic unit and the data memory is complicated when the parallelization is performed to increase the speed. For example,
Assuming that the number of data inputs of one arithmetic unit is two,
Considering a processor in which four arithmetic units operate in four parallels, in order to supply arbitrary data of a macroblock to all four arithmetic units every clock cycle,
An 8-port multiport memory is required for each memory bank of the input data memory. Also, regarding the mutual coupling network between the arithmetic unit and the data memory, each of the eight ports needs to supply data to an arbitrary input terminal of the arithmetic unit, which is a very complicated 8x8 crossbar network. It is necessary to construct an interconnection network with a large amount of hardware. This is because all the data of the macro block is stored in one memory bank of the data memory.

【０００９】そこで、データメモリのポート数を増やす
ことなく演算ユニットを並列化するアプローチが考えら
れる。このアプローチにおいては、データメモリをマク
ロブロック内に存在するブロック対応に分割し、その各
々のデータメモリに対応して演算ユニットを設ける。さ
らに、すべての演算ユニットについて対応するデータメ
モリ以外のデータメモリとのデータ転送を禁止する。例
えば、図６に図解した（４：２：２）信号を考えると、
マクロブロックは８個のブロックからなるので、データ
メモリを８個に分割し、その各々のデータメモリに対応
して８個の演算ユニットを設ける。この構成を採った場
合、１個の演算ユニットのデータ入力数を２入力と仮定
すると、入力用データメモリの１メモリバンクにつき高
々２ポートのマルチポートメモリでよく、演算ユニット
とデータメモリとの間の相互結合網も簡単になる。Therefore, an approach is conceivable in which the arithmetic units are parallelized without increasing the number of data memory ports. In this approach, the data memory is divided into blocks corresponding to the blocks existing in the macroblock, and an arithmetic unit is provided corresponding to each data memory. Furthermore, data transfer to a data memory other than the corresponding data memory is prohibited for all arithmetic units. For example, considering the (4: 2: 2) signal illustrated in FIG.
Since the macro block consists of eight blocks, the data memory is divided into eight, and eight arithmetic units are provided corresponding to the respective data memories. In the case of adopting this configuration, assuming that the number of data inputs of one arithmetic unit is two, a multiport memory having at most two ports per one memory bank of the input data memory will suffice, and between the arithmetic unit and the data memory. The mutual connection network of is also simplified.

【００１０】しかしながら、画像コーデック処理におい
ては、ブロック間にまたがってデータ依存関係が存在す
る要素処理が必要となる場合がある。つまり、ブロック
内のデータだけではなく、隣接ブロックのデータをも必
要とする要素処理が存在する。例として、符号化時にお
ける動きベクトル検出、および、ＭＰＥＧ２におけるフ
ィールド／フレーム適応形離散コサイン変換（ＤＣＴ）
処理について順に説明する。However, in the image codec processing, there are cases where element processing in which a data dependency exists across blocks is necessary. That is, there is an element process that requires not only the data in the block but also the data in the adjacent block. As an example, motion vector detection at the time of encoding, and field / frame adaptive discrete cosine transform (DCT) in MPEG2
The processing will be described in order.

【００１１】まず、画像データの動きベクトル検出につ
いて説明する。動きベクトル検出において、ブロックマ
ッチングの全探索アルゴリズムを採用した場合、マクロ
ブロックの輝度成分 y _i,j（０≦i ＜16，０≦ｊ＜16）について、以下のような演算を行う。First, the motion vector detection of image data will be described. When the block matching full search algorithm is adopted in the motion vector detection, the following calculation is performed on the luminance component y _{i, j} (0 ≦ i <16, 0 ≦ j <16) of the macroblock.

【００１２】[0012]

【数１】 [Equation 1]

【００１３】なお、上式におけるｃ_i,jは、候補ブロッ
クのデータを意味する（動きベクトル検出処理の詳細に
ついては、本出願人が先に出願した、平成４年１０月２
８日出願、「演算回路」特願平４−３１１１６３号の明
細書、及び、図面に説明されている）。上式１から判る
ように、マクロブロックの輝度成分（４個のブロック）
のデータ全てを用いて、１個の差分絶対値和を求めてい
る。したがって、動きベクトル検出処理は、ブロック間
にまたがってデータ依存関係が存在する要素処理であ
る。また、符号化時における種々のモード決定処理にお
いても、このようなブロック間にまたがるデータ依存関
係が存在する。Note that c _{i, j} in the above equation means the data of the candidate block (for details of the motion vector detection processing, the applicant of the present application filed earlier, October 2, 1992).
8th application, the specification of "arithmetic circuit" Japanese Patent Application No. 4-311163 and drawings are described). As can be seen from Equation 1 above, the luminance component of the macroblock (4 blocks)
One sum of absolute differences is obtained by using all the data of. Therefore, the motion vector detection process is an element process in which there is a data dependency relationship between blocks. In addition, even in various mode determination processes at the time of encoding, there is a data dependency relationship that spans such blocks.

【００１４】次に、ＭＰＥＧ２におけるフレーム／フィ
ールド適応形離散コサイン変換（ＤＣＴ）処理について
説明する。ＭＰＥＧ２におけるフレーム／フィールド適
応形ＤＣＴでは、画像データの性質により、フレームＤ
ＣＴとフィールドＤＣＴを適応的に切り換える。この
際、図８に示すように、フレームＤＣＴの場合は図６に
示すブロックの構成と同一であるが、図９に示すよう
に、フィールドＤＣＴの場合はマクロブロックの縦方向
について交互にデータを抜き出してブロックを構成す
る。すなわち、フィールドＤＣＴにおいては、図６にお
ける縦方向２個のブロック（例えば、ブロック０とブロ
ック１）にまたがるデータ依存関係が存在する。Next, frame / field adaptive discrete cosine transform (DCT) processing in MPEG2 will be described. In the frame / field adaptive DCT in MPEG2, the frame D
Adaptively switch between CT and field DCT. At this time, as shown in FIG. 8, in the case of the frame DCT, the configuration is the same as that of the block shown in FIG. 6, but as shown in FIG. 9, in the case of the field DCT, data is alternately provided in the vertical direction of the macroblock. Extract and form blocks. That is, in the field DCT, there is a data dependency that extends over two blocks in the vertical direction in FIG. 6 (for example, block 0 and block 1).

【００１５】上記のように、画像コーデック処理におい
ては、ブロック間にまたがってデータ依存関係が存在す
る要素処理があるため、データメモリをブロック対応に
分割し、演算ユニットをブロック対応に設けることが困
難であった。また、上述した従来の画像コーデック用プ
ロセッサの構成では、入力用と出力用に各々異なるデー
タメモリを用いているため、演算処理用のメモリバンク
が入力用と出力用の各々に独立して存在しており、デー
タメモリ容量が増大していた。１メモリバンクのメモリ
容量をｍとすると、トータルで４ｍのメモリ容量が必要
となる。As described above, in the image codec processing, there is an element processing in which there is a data dependency relationship between blocks, so it is difficult to divide the data memory into blocks and provide the arithmetic units in blocks. Met. Further, in the above-described configuration of the conventional image codec processor, since different data memories are used for input and output, memory banks for arithmetic processing exist independently for input and output. The data memory capacity has increased. If the memory capacity of one memory bank is m, a total memory capacity of 4 m is required.

【００１６】[0016]

【課題を解決するための手段】上述した課題を解決する
ために、本発明では、画像コーデック処理でマクロブロ
ックを１処理単位として入力ステージ、計算ステージ、
出力ステージの３ステージ構成でパイプライン処理を行
うプロセッサにおいて、マクロブロックを構成する各ブ
ロック対応に設けられている要素プロセッサ内に、演算
ユニットおよびダブルバッファ構成のデータメモリを有
し、さらに、隣接する要素プロセッサの演算結果を加算
する回路を有し、また、隣接する要素プロセッサのデー
タメモリ間でデータ転送を可能とする回路を設ける。In order to solve the above-mentioned problems, the present invention uses an input stage, a calculation stage, and a macro block as one processing unit in image codec processing.
In a processor that performs pipeline processing with a three-stage configuration of output stages, an element processor provided corresponding to each block that constitutes a macroblock has an arithmetic unit and a data memory of a double buffer configuration, and further adjoins each other. There is provided a circuit for adding operation results of the element processors, and a circuit for enabling data transfer between data memories of adjacent element processors.

【００１７】したがって、本発明によれば、それぞれが
ｍｘｎの画像データで構成される複数のブロックからな
るマクロブロックを１処理単位として、複数のブロック
の画像データにまたがる信号処理と１つのブロック内の
画像データについての信号処理とを適応的に、単一の命
令ストリームで多重データストリーム制御処理する「単
一命令ストリーム・多重データストリーム：ＳＩＭＤ」
制御形画像コーデック用プロセッサにおいて、隣接要素
プロセッサ間データ転送路で接続されたそれぞれが隣接
する複数対の要素プロセッサが、前記マクロブロックに
対応して設けられ、これら複数の要素プロセッサに共通
に係数を提供する共通係数メモリが設けられ、前記複数
のブロックの画像データにまたがる信号処理を行う複数
対の要素プロセッサの出力にこれらの演算結果の加算を
行う共通の加算回路を設け、前記マクロブロック単位の
処理画像データが前記要素プロセッサに入力されること
を特徴とする画像コーデック用プロセッサが提供され
る。Therefore, according to the present invention, a macroblock consisting of a plurality of blocks, each of which is composed of mxn image data, is used as one processing unit, and signal processing over the image data of a plurality of blocks and within one block are carried out. "Single instruction stream / multiple data stream: SIMD" that adaptively performs signal processing on image data and multiple data stream control processing with a single instruction stream
In the control type image codec processor, a plurality of pairs of adjacent element processors, which are connected by a data transfer path between adjacent element processors, are provided corresponding to the macroblocks, and a coefficient is shared by the plurality of element processors. A common coefficient memory to be provided is provided, and a common adder circuit that adds these operation results to the outputs of a plurality of pairs of element processors that perform signal processing across the image data of the plurality of blocks is provided. An image codec processor is provided, wherein processed image data is input to the element processor.

【００１８】好適には、前記要素プロセッサのそれぞれ
が、データ入力ステージ、計算ステージ、および、デー
タ出力ステージからなる３ステージをパイプライン処理
するように構成されている。特定的には、前記要素プロ
セッサのそれぞれが、前記データ入力ステージ、計算ス
テージ、および、データ出力ステージからなる３ステー
ジに対応した３個のバンクを有するＩ／Ｏバッファと、
少なくとも交互に動作可能な並列的に配設された２つの
バンクを有する第１のデータメモリと、少なくとも交互
に動作可能な並列的に配設された２つのバンクを有する
第２のデータメモリと、これらのＩ／Ｏバッファ、第１
および第２のデータメモリを相互接続する相互結合網
と、該相互結合網に接続された演算ユニットとを有す
る。[0018] Preferably, each of the element processors is configured to pipeline three stages consisting of a data input stage, a calculation stage, and a data output stage. Specifically, each of the element processors has an I / O buffer having three banks corresponding to three stages including the data input stage, the calculation stage, and the data output stage;
A first data memory having at least two banks arranged in parallel that are at least alternately operable, and a second data memory having at least two banks arranged in parallel that are at least alternately operable; These I / O buffers, first
And an interconnecting network interconnecting the second data memory and an arithmetic unit connected to the interconnecting network.

【００１９】好適には、前記演算ユニットが、加算、減
算、各種論理演算、大小比較、差分絶対値演算およびバ
タフライ演算を行う２入力・２出力の拡張算術論理演算
処理ユニットと、前記係数メモリからの係数と、該拡張
算術論理演算処理ユニットの出力との乗算を行う乗算ユ
ニットと、該乗算結果を累積処理する累積ユニットとを
有する。さらに好適には、前記拡張算術論理演算処理ユ
ニットの後段にパイプラインレジスタと、前記乗算ユニ
ットの後段にパイプラインレジスタと、前記累積ユニッ
トの後段にパイプラインレジスタとが設けられ、前記デ
ータ入力ステージ、計算ステージ、および、データ出力
ステージからなる３ステージに対応したパイプライン処
理を行う。Preferably, the arithmetic unit comprises a 2-input / 2-output extended arithmetic logical operation processing unit for performing addition, subtraction, various logical operations, magnitude comparison, difference absolute value operation and butterfly operation, and the coefficient memory. A multiplication unit that multiplies the output of the extended arithmetic logic operation processing unit, and an accumulating unit that cumulatively processes the multiplication result. Further preferably, a pipeline register is provided at a stage subsequent to the extended arithmetic logic operation processing unit, a pipeline register is provided at a stage subsequent to the multiplication unit, and a pipeline register is provided at a stage subsequent to the accumulating unit, and the data input stage is provided. Pipeline processing corresponding to three stages including a calculation stage and a data output stage is performed.

【００２０】特定的には、前記複数のブロックの画像デ
ータにまたがる信号処理が、フィールド画像信号処理で
あり、前記１つのブロック内の画像データについての信
号処理が、フレーム画像処理である。Specifically, the signal processing over the image data of the plurality of blocks is field image signal processing, and the signal processing for the image data in the one block is frame image processing.

【００２１】また本発明によれば、画像コーデック処理
でマクロブロックを１処理単位とする画像コーデック用
プロセッサにおいて、マクロブロックを構成する各ブロ
ック対応に複数の要素プロセッサを有し、各マクロブロ
ックにおける画像コーデック処理を前記複数の要素プロ
セッサを用いて「単一命令ストリーム・多重データスト
リーム：ＳＩＭＤ」制御により並列に行うことを特徴と
する画像コーデック用プロセッサが提供される。Further, according to the present invention, in the image codec processor in which the macroblock is one processing unit in the image codec processing, a plurality of element processors are provided for each block forming the macroblock, and the image in each macroblock is There is provided an image codec processor characterized in that codec processing is performed in parallel by "single instruction stream / multiple data stream: SIMD" control using the plurality of element processors.

【００２２】また本発明によれば、画像コーデック処理
でマクロブロックを１処理単位として入力ステージ、計
算ステージ、出力ステージの３ステージ構成でパイプラ
イン処理を行う画像コーデック用プロセッサにおいて、
前記マクロブロックを構成する各ブロック対応に設けら
れている要素プロセッサ内に演算ユニットおよびダブル
バッファ構成のデータメモリを有し、各ブロックにおけ
る画像コーデック処理を複数の要素プロセッサを用いて
「単一命令ストリーム・多重データストリーム：ＳＩＭ
Ｄ」制御により並列に行うことを特徴とする画像コーデ
ック用プロセッサが提供される。Further, according to the present invention, in the image codec processor which performs pipeline processing in a three-stage configuration of an input stage, a calculation stage and an output stage with a macroblock as one processing unit in the image codec processing,
An element processor provided corresponding to each block forming the macro block has an arithmetic unit and a data memory having a double buffer structure, and image codec processing in each block is performed by using a plurality of element processors to obtain a “single instruction stream”.・ Multiple data streams: SIM
An image codec processor is provided which is characterized in that it is operated in parallel by "D" control.

【００２３】さらに本発明によれば、画像コーデック処
理でマクロブロックを１処理単位として入力ステージ、
計算ステージ、出力ステージの３ステージ構成でパイプ
ライン処理を行う画像コーデック用プロセッサにおい
て、前記マクロブロックを構成する各ブロック対応に設
けられている要素プロセッサ内に、演算ユニットおよび
ダブルバッファ構成のデータメモリを有し、さらに、隣
接する要素プロセッサの演算結果を加算する回路を有
し、各ブロックにおける画像コーデック処理を複数の要
素プロセッサを用いて「単一命令ストリーム・多重デー
タストリーム：ＳＩＭＤ」制御により並列に行うことを
特徴とする画像コーデック用プロセッサが提供される。Further, according to the present invention, in the image codec processing, the input stage is set with the macroblock as one processing unit,
In an image codec processor that performs pipeline processing with a three-stage configuration including a calculation stage and an output stage, an arithmetic unit and a data memory having a double buffer configuration are provided in an element processor provided corresponding to each block forming the macroblock. In addition, a circuit for adding operation results of adjacent element processors is provided, and image codec processing in each block is performed in parallel by a "single instruction stream / multiple data stream: SIMD" control using a plurality of element processors. An image codec processor is provided which is characterized by performing.

【００２４】また本発明によれば、画像コーデック処理
でマクロブロックを１処理単位として入力ステージ、計
算ステージ、出力ステージの３ステージ構成でパイプラ
イン処理を行う画像コーデック用プロセッサにおいて、
前記マクロブロックを構成する各ブロック対応に設けら
れている要素プロセッサ内に、演算ユニットおよびダブ
ルバッファ構成のデータメモリを有し、さらに、隣接す
る要素プロセッサの演算結果を加算する回路を有し、ま
た、隣接する要素プロセッサのデータメモリ間でデータ
転送を可能とし、各ブロックにおける画像コーデック処
理を複数の要素プロセッサを用いて「単一命令ストリー
ム・多重データストリーム：ＳＩＭＤ」制御により並列
に行うことを特徴とする画像コーデック用プロセッサが
提供される。Further, according to the present invention, in the image codec processor for performing pipeline processing in a three-stage configuration of an input stage, a calculation stage, and an output stage with a macroblock as one processing unit in the image codec processing,
An element processor provided corresponding to each block forming the macro block has an arithmetic unit and a data memory having a double buffer structure, and further has a circuit for adding the arithmetic results of the adjacent element processors, , Enables data transfer between the data memories of adjacent element processors, and performs image codec processing in each block in parallel by using "single instruction stream / multiple data stream: SIMD" control using a plurality of element processors. An image codec processor is provided.

【００２５】[0025]

【作用】マクロブロックを構成する各ブロック対応に設
けられている要素プロセッサ内に、演算ユニットおよび
ダブルバッファ構成のデータメモリを有し、さらに、隣
接する要素プロセッサの演算結果を加算する回路を有
し、また、隣接する要素プロセッサのデータメモリ間で
データ転送を可能とすることにより、各ブロックにおけ
る画像コーデック処理を複数の要素プロセッサを用い
て、「単一命令ストリーム・多重データストリーム：Ｓ
ＩＭＤ」制御により並列に行うことができる。In the element processor provided corresponding to each block constituting the macroblock, the operation unit and the data memory having the double buffer structure are provided, and further, the operation results of the adjacent element processors are added. Further, by enabling data transfer between the data memories of the adjacent element processors, the image codec processing in each block is performed by using a plurality of element processors, “single instruction stream / multiple data stream: S
It can be done in parallel by "IMD" control.

【００２６】加算回路は共通に１つ、係数メモリも共通
に１つ設けるだけでよい。It is only necessary to provide one adder circuit in common and one coefficient memory in common.

【００２７】[0027]

【実施例】以下、図面を参照して、本発明の画像コーデ
ック用プロセッサの実施例について詳述する。本発明の
実施例における画像コーデック用プロセッサは、アリス
メティク（算術）論理演算処理ユニット（ＡＬＵ）、乗
算器、累算器等からなる演算ユニットを複数有し、それ
らの演算ユニットが単一の命令流により複数のデータを
並列に処理する「単一命令ストリーム・多重データスト
リーム：ＳＩＭＤ（Single Instruction stream Multip
le Data stream）」方式のプロセッサに基づく。なお、
「単一命令ストリーム・多重データストリーム：ＳＩＭ
Ｄ」制御については、Ｙａｍａｕｃｈｉ，ｅｔａｌ，
“ＡｒｃｈｉｔｅｃｔｕｒｅａｎｄＩｍｐｌｅｍｅｎ
ｔａｔｉｏｎｏｆａＨｉｇｈｌｙＰａｒａｌｌ
ｅｌＳｉｎｇｌｅ：ＣｈｉｐＶｉｄｅｏＤＳＰ“，
ＩＥＥＥＴＲＡＮＳＡＣＴＩＯＮＳＡＮＤＳＹＳ
ＴＥＭＳＦＯＲＶＩＤＥＯＴＥＣＨＮＯＬＯＧ
Ｙ，ＶＯＬ．２，ＪＵＮＥ１９９２，ｐｐ．２０７
−２２０を参照されたい。Embodiments of the image codec processor of the present invention will now be described in detail with reference to the drawings. The image codec processor according to the embodiment of the present invention has a plurality of arithmetic units such as an arithmetic logic operation unit (ALU), a multiplier, and an accumulator, and these arithmetic units have a single instruction stream. SIMD (Single Instruction stream Multip) that processes multiple data in parallel by
le Data stream) ”type processor. In addition,
"Single instruction stream / multiple data streams: SIM
For "D" control, see Yamauchi, et al,
"Architure and Implement
station of a Highly Parallel
elSingle: Chip Video DSP ",
IEEE TRANSACTIONS AND SYS
TEMS FOR VIDEO TECHNOLOG
Y, VOL. 2, JUNE 1992, pp. 207
See -220.

【００２８】さらに、このプロセッサの演算ユニット
は、演算器をパイプライン接続することが可能であり、
パイプライン演算処理も行う。つまり、本発明の画像コ
ーデック用プロセッサは、図１に示したように、マクロ
ブロック入力端子２１、マクロブロック出力端子２２、
フレームメモリのマクロブロック入出力端子２３、フレ
ームメモリのマクロブロック入力端子２４を有し、さら
に、これらの端子に接続された入力用データバス３１、
出力用データバス３２、および、データバス３３，３４
をさらに有する。さらに画像コーデック用プロセッサ
は、これらのバス３１〜３４を介して相互に接続される
複数個、この例では８個の要素プロセッサ（ＰＥ）１１
〜１８と、４個の要素プロセッサ１１〜１４の結果を加
算する１つの加算回路４１と、各要素プロセッサのブロ
ック入力端子８１（図２）に係数を印加する１つの係数
メモリ５１とを有する。上記要素プロセッサ１１〜１８
は、上記バス３１〜３４で相互に接続される他、隣接す
る要素プロセッサ、つまり、ＰＥ０とＰＥ１、ＰＥ２と
ＰＥ３、ＰＥ４とＰＥ５、ＰＥ６とＰＥ７とが相互に接
続されている。Further, in the arithmetic unit of this processor, arithmetic units can be connected in pipeline,
It also performs pipeline arithmetic processing. That is, the image codec processor of the present invention, as shown in FIG. 1, has a macroblock input terminal 21, a macroblock output terminal 22,
It has a macro block input / output terminal 23 of the frame memory, a macro block input terminal 24 of the frame memory, and further has an input data bus 31 connected to these terminals,
Output data bus 32 and data buses 33 and 34
Further has. Further, a plurality of image codec processors, eight element processors (PE) 11 in this example, are connected to each other via these buses 31 to 34.
.About.18, one adder circuit 41 for adding the results of the four element processors 11 to 14, and one coefficient memory 51 for applying the coefficient to the block input terminal 81 (FIG. 2) of each element processor. The element processors 11 to 18
In addition to being connected to each other by the buses 31 to 34, adjacent element processors, that is, PE0 and PE1, PE2 and PE3, PE4 and PE5, PE6 and PE7 are connected to each other.

【００２９】以下、本発明の１実施例における画像コー
デック用プロセッサについて、全体構成を説明した後、
演算ユニットおよびデータメモリのそれぞれの構成につ
いて説明する。１．全体構成図１は、本発明の１実施例としての画像コーデック用プ
ロセッサの全体構成図である。この画像コーデック用プ
ロセッサには、図６に示したマクロブロックの各ブロッ
ク対応に８個の要素プロセッサ１１〜１８（以下、一般
的に１ｋ、但しｋ＝１〜８と表すこともある）が設けら
れている。図２に要素プロセッサの内部構成を示す。ｋ
番目の要素プロセッサは、演算ユニット６ｋと、入出力
（Ｉ／Ｏ）バッファ９１、第１のフレームバッファ０
（９２）、第２のフレームバッファ１（９３）、およ
び、ワーキングバッファ９４からなるデータメモリ７ｋ
と、これらバッファ９１〜９４を接続する相互結合網９
５と、セレクタ１１１とを有している。The overall structure of the image codec processor according to the first embodiment of the present invention will be described below.
The respective configurations of the arithmetic unit and the data memory will be described. 1. Overall Configuration FIG. 1 is an overall configuration diagram of an image codec processor as one embodiment of the present invention. This image codec processor is provided with eight element processors 11 to 18 (hereinafter, generally 1k, but may be expressed as k = 1 to 8) corresponding to each block of the macroblock shown in FIG. Has been. FIG. 2 shows the internal configuration of the element processor. k
The th element processor includes an arithmetic unit 6k, an input / output (I / O) buffer 91, and a first frame buffer 0.
(92), the second frame buffer 1 (93), and the data memory 7k including the working buffer 94.
And an interconnection network 9 connecting these buffers 91 to 94
5 and a selector 111.

【００３０】以下、本発明の画像コーデック用プロセッ
サの動作を説明する。まず、画像コーデックの処理対象
となるマクロブロックは、図１に示したマクロブロック
入力端子２１から画像データが１データずつ入力され
る。この際、マクロブロックの各ブロックは、自ブロッ
ク番号と同一の要素プロセッサ番号が付けられたｋ番目
の要素プロセッサのデータメモリ７ｋに入力用データバ
ス３１を介して格納される。つまり、ブロック０は第１
の要素プロセッサ１１（ＰＥ０）のデータメモリ７１
に、ブロック１は第２の要素プロセッサ１２（ＰＥ１）
のデータメモリ７２、以下、同様に、ブロック７は第８
の要素プロセッサ１８（ＰＥ７）のデータメモリ７８に
格納される。また同時に、画像データの動き補償を行う
際に必要となる過去のフレームや未来のフレームのマク
ロブロックも、上記の入力動作と同様に、データバス３
３，３４を介してフレームメモリのマクロブロック入出
力端子２３あるいは入力端子２４からｋ番目の要素プロ
セッサのデータメモリ７ｋに格納される。この入力動作
は、マクロブロックの予測モード、例えば、前方向予測
や両方向予測により異なる。つまり、画像コーデック処
理の対象となるマクロブロックが画像データの動き補償
を行わない場合は、この入力動作は行わない。また、前
方向予測あるいは後方向予測の動き補償を行う場合は、
フレームメモリのマクロブロック入力端子２４のみを使
用して、過去あるいは未来のいずれかのフレームのマク
ロブロックのみを入力する。また、両方向予測の動き補
償を行う場合は、フレームメモリのマクロブロック入出
力端子２３およびフレームメモリのマクロブロック入力
端子２４を２個とも使用して両方のフレームのマクロブ
ロックを入力する。The operation of the image codec processor of the present invention will be described below. First, the macroblock to be processed by the image codec receives image data one by one from the macroblock input terminal 21 shown in FIG. At this time, each block of the macro block is stored via the input data bus 31 in the data memory 7k of the kth element processor to which the same element processor number as the own block number is attached. That is, block 0 is the first
Data memory 71 of the element processor 11 (PE0) of
The block 1 is the second element processor 12 (PE1).
Data memory 72, hereafter, similarly, block 7 is the eighth
Is stored in the data memory 78 of the element processor 18 (PE7). At the same time, the macro blocks of the past frame and the future frame, which are necessary when performing the motion compensation of the image data, are processed by the data bus 3 similarly to the above input operation.
The data is stored in the data memory 7k of the k-th element processor from the macroblock input / output terminal 23 or the input terminal 24 of the frame memory via 3, 34. This input operation differs depending on the prediction mode of the macroblock, for example, forward prediction or bidirectional prediction. That is, if the macroblock that is the target of the image codec process does not perform the motion compensation of the image data, this input operation is not performed. Also, when performing motion compensation for forward prediction or backward prediction,
Only the macroblock input terminal 24 of the frame memory is used to input only the macroblock of either the past or the future frame. When performing bidirectional prediction motion compensation, the macroblock input / output terminal 23 of the frame memory and the macroblock input terminal 24 of the frame memory are both used to input the macroblocks of both frames.

【００３１】これらの入力動作と並行して、ｋ番目の要
素プロセッサＰＥの演算ユニット６ｋでは、「単一命令
ストリーム・多重データストリーム：ＳＩＭＤ」制御に
より離散コサイン変換（ＤＣＴ）や量子化といった画像
コーデックの要素処理が並列に実行されている。なお、
上述した文献に記載されているように、「単一命令スト
リーム・多重データストリーム：ＳＩＭＤ」制御とは、
単一の命令で、多重（複数）のデータの流れを制御する
方法である。また、すべての画像コーデックの要素処理
を「単一命令ストリーム・多重データストリーム：ＳＩ
ＭＤ」制御で行うので、図１に示した係数メモリ５１を
全ての要素プロセッサ１１〜１８で共有しており、係数
メモリ５１を各要素プロセッサ１１〜１８内に持たなく
てすむ。さらに、これらの入力動作および計算動作と並
行して、画像コーデック処理後のマクロブロックがマク
ロブロック出力端子２２から１データずつ出力される。
この際、マクロブロックの各ブロックは、自ブロック番
号と同一の要素プロセッサ番号を持つ要素プロセッサの
データメモリ７ｋから出力用データバス３２を介してマ
クロブロック出力端子２２へ出力される。すなわち、ブ
ロック０は要素プロセッサ１１（ＰＥ０）のデータメモ
リ７１から、ブロック１は要素プロセッサ１２（ＰＥ
１）のデータメモリ７２から、以下、同様に、ブロック
７は要素プロセッサ１８（ＰＥ７）のデータメモリ７８
から出力用データバス３２を介してマクロブロック出力
端子２２へ出力される。In parallel with these input operations, in the arithmetic unit 6k of the kth element processor PE, an image codec such as discrete cosine transform (DCT) or quantization is performed by "single instruction stream / multiple data stream: SIMD" control. The element processing of is executed in parallel. In addition,
As described in the above-mentioned document, the “single instruction stream / multiple data stream: SIMD” control means
It is a method of controlling multiple (multiple) data flows with a single instruction. In addition, element processing of all image codecs is performed by "single instruction stream / multiple data stream: SI
Since the "MD" control is performed, the coefficient memory 51 shown in FIG. 1 is shared by all the element processors 11 to 18, and the coefficient memory 51 does not have to be provided in each of the element processors 11 to 18. Further, in parallel with these input operation and calculation operation, the macro block after the image codec processing is output from the macro block output terminal 22 one data at a time.
At this time, each block of the macro block is output from the data memory 7k of the element processor having the same element processor number as the own block number to the macro block output terminal 22 via the output data bus 32. That is, the block 0 is from the data memory 71 of the element processor 11 (PE0), and the block 1 is from the element processor 12 (PE
From the data memory 72 of 1), similarly, the block 7 is the data memory 78 of the element processor 18 (PE7).
Is output to the macroblock output terminal 22 via the output data bus 32.

【００３２】また同時に、画像コーデック処理後のマク
ロブロックが他のマクロブロックの画像データ動き補償
を行う際に必要となる場合、上記の出力動作と同様にデ
ータバス３３を介してｋ番目の要素プロセッサのデータ
メモリ７ｋからフレームメモリのマクロブロック入出力
端子２３に出力される。なお、この画像コーデック処理
では、フレームメモリのマクロブロック入出力端子２３
においてマクロブロックの入力および出力を同時に行う
必要は生じない。At the same time, when the macroblock after the image codec processing is required for compensating the image data motion of another macroblock, the kth element processor via the data bus 33 as in the above output operation. Is output to the macro block input / output terminal 23 of the frame memory from the data memory 7k. In this image codec processing, the macro block input / output terminal 23 of the frame memory is
There is no need for simultaneous input and output of macroblocks in.

【００３３】次に、図１における加算回路４１について
説明する。上記「発明が解決しようとする課題」の項で
述べた画像データ動きベクトル検出やモード決定処理の
ようなブロック間データ依存関係は、各ブロック毎に求
めた演算結果をすべて加算できれば解決できる。例え
ば、画像データ動きベクトル検出で考えると、マクロブ
ロックの輝度成分（４個のブロック）の各ブロックにつ
いて差分絶対値和を求め、最後にそれら４個の差分絶対
値和を加算すればよい。このために、マクロブロックの
輝度成分を格納する４個の要素プロセッサ１１〜１４の
出力に加算回路４１を設けた。この加算回路４１は、４
個の演算結果がすべて加算できれば、どのような構成で
も構わない。加算結果は、制御回路のデータレジスタ
（図示省略）に書き込まれる。Next, the adder circuit 41 in FIG. 1 will be described. The inter-block data dependency such as the image data motion vector detection and the mode determination processing described in the above-mentioned “Problems to be solved by the invention” can be solved if all the calculation results obtained for each block can be added. For example, considering image data motion vector detection, the sum of absolute differences may be obtained for each block of the luminance component (4 blocks) of the macroblock, and the sum of these four absolute differences may be added last. For this purpose, the adder circuit 41 is provided at the outputs of the four element processors 11 to 14 which store the luminance components of the macroblocks. This adder circuit 41
Any configuration may be used as long as all the calculation results can be added. The addition result is written in the data register (not shown) of the control circuit.

【００３４】最後に、図１における隣接する要素プロセ
ッサ間のデータ転送路４２〜４５について説明する。上
記「発明が解決しようとする課題」の項で述べたフィー
ルドＤＣＴ処理における縦方向２個のブロックにまたが
るデータ依存関係は、隣接する２つの要素プロセッサ、
たとえば、ＰＥ０とＰＥ１との間で８ｘ８ブロックの半
分の３２個のデータを交換することで解決できる。この
ために、フィールドＤＣＴ／逆ＤＣＴ（ＩＤＣＴ）処理
時に隣接する要素プロセッサのデータメモリとのデータ
の交換を可能とする転送路４２〜４５を設ける。フィー
ルドＤＣＴ／ＩＤＣＴ処理時には、予めこれらの転送路
４２〜４５を用いてブロックの半分のデータを交換して
おいてからＤＣＴ／ＩＤＣＴ処理を実行すればよい。Finally, the data transfer paths 42 to 45 between the adjacent element processors in FIG. 1 will be described. In the field DCT processing described in the above-mentioned "Problems to be solved by the invention", the data dependence that extends over two blocks in the vertical direction is caused by the two adjacent element processors.
For example, it can be solved by exchanging 32 pieces of data of half of the 8 × 8 block between PE0 and PE1. For this purpose, transfer paths 42 to 45 are provided to enable data exchange with the data memory of the adjacent element processor during the field DCT / inverse DCT (IDCT) processing. At the time of the field DCT / IDCT processing, it is sufficient to exchange the data of half the blocks using these transfer paths 42 to 45 in advance and then execute the DCT / IDCT processing.

【００３５】演算ユニットの構成図３に本発明の１実施例による演算ユニットの内部構成
を示す。この演算ユニットは、第１のセレクタ１３１、
第２のセレクタ１３２、拡張ＡＬ（ＥＡＬＵ）１２１、
２つのパイプラインレジスタ１４１，１４２、第３のセ
レクタ１３３、乗算器１２２、パイプラインレジスタ１
４３、第４のセレクタ１３４、第５のセレクタ１３５、
シフト機能付累算器１２３、パイプラインレジスタ１４
４、第６のセレクタ１３６、および、パイプラインメモ
リ１２４を有している。Arrangement of Arithmetic Unit FIG. 3 shows the internal arrangement of the arithmetic unit according to one embodiment of the present invention. This arithmetic unit includes a first selector 131,
The second selector 132, the extended AL (EALU) 121,
Two pipeline registers 141 and 142, third selector 133, multiplier 122, pipeline register 1
43, a fourth selector 134, a fifth selector 135,
Accumulator 123 with shift function, pipeline register 14
It has four and six selectors 136 and a pipeline memory 124.

【００３６】拡張ＡＬＵ１２１の構成拡張ＡＬＵ１２１は、正負反転器３０１、加算器３０
２、減算器３０３、論理演算器３０４、正負判定器３０
５、および、データセレクタ３０１、３０７が図示のご
とく接続されている。図３に示したデータセレクタ１３
１の選択出力データがＸとして印加され、データセレク
タ１３２の選択出力データがＹとして印加されている。
論理演算器３０４は、否定、論理和、論理積、排他的論
理和などの論理演算を行う。正負反転器３０１は入力デ
ータＸの極性を反転してデータセレクタ３０６に印加す
る。加算器３０２は、データセレクタ３０６から極性反
転されたデータ：−Ｘが出力されたときは入力データＹ
に極性反転したデータ：−Ｘを加算して、（Ｙ−Ｘ）を
出力する。また、加算器３０２は、データセレクタ３０
６から入力データＸが出力されたときは、入力データＹ
と入力データＸとの加算結果（Ｙ＋Ｘ）を出力する。減
算器３０３は、（Ｘ−Ｙ）を計算する。論理演算器３０
４は、データＸとデータＹとの論理演算を行う。正負判
定器３０５は入力されたデータの正負を判定する。 Configuration of Extended ALU 121 The extended ALU 121 includes a positive / negative inverter 301 and an adder 30.
2, subtractor 303, logical operation unit 304, positive / negative decision unit 30
5 and data selectors 301 and 307 are connected as shown. The data selector 13 shown in FIG.
The selected output data of 1 is applied as X, and the selected output data of the data selector 132 is applied as Y.
The logical operation unit 304 performs logical operations such as negation, logical sum, logical product, and exclusive logical sum. The positive / negative inverter 301 inverts the polarity of the input data X and applies it to the data selector 306. The adder 302 receives the input data Y when the polarity-inverted data: -X is output from the data selector 306.
The polarity-inverted data: -X is added, and (Y-X) is output. Further, the adder 302 includes the data selector 30.
When the input data X is output from 6, the input data Y
And the addition result (Y + X) of the input data X is output. The subtractor 303 calculates (X-Y). Logical operation unit 30
4 performs a logical operation on the data X and the data Y. The plus / minus determiner 305 determines whether the input data is positive or negative.

【００３７】上述の拡張ＡＬＵ１２１は、通常のＡＬＵ
の機能である、加算、減算、論理演算の他に、大小比較
演算、差分絶対値演算、バタフライ演算を拡張機能とし
て備えている。以下、これらの機能を述べる。（イ）加算加算器３０２において、入力端子３１１，３１２に入力
されたデータＸおよびＹを加算する。この場合は入力デ
ータＸが正負反転されずに加算器３２０に印加されるよ
うに、データセレクタ３０６を選択しておく。データセ
レクタ３０７から加算結果（Ｘ＋Ｙ）が出力される。こ
の加算結果Ａは、出力端子３１３を介して図３に示した
パイプラインレジスタ１４１に印加される。（ロ）減算減算器３０３において、入力端子３１１，３１２に入力
されたデータＸからＹを減算する。この減算結果Ｂは、
出力端子３１４を介して図３に示したパイプラインレジ
スタ１４２に印加される。（ハ）論理演算論理演算器３０４において、入力端子３１１，３１２に
入力されたデータＸおよびＹの否定、論理和、論理積、
排他的論理和などの論理演算が行われ、データセレクタ
３０７および出力端子３１３を介して、パイプラインレ
ジスタ１４１に出力される。（ニ）大小比較：ｍｉｎ（Ｘ，Ｙ）、ｍａｘ（Ｘ，Ｙ）入力端子３１１，３１２に入力されたデータＸおよびＹ
について、正負反転器３０１、加算器３０２、減算器３
０３、正負判定器３０５を用いて大小比較を行う。この
場合、データセレクタ３０６は正負反転器３０１で極性
反転したデータ（−Ｘ）が加算器３０２に入力されるよ
うに設定される。正負判定器３０５には、加算器３０２
から（Ｙ−Ｘ）、減算器３０３から（Ｘ−Ｙ）が入力さ
れ、正負判定器３０５は、（ａ）最小値ｍｉｎ（Ｘ，Ｙ）として、（Ｙ−Ｘ）≧０のとき、Ｘ（Ｘ−Ｙ）＞０のとき、Ｙ（ｂ）最大値ｍａｘ（Ｘ，Ｙ）として、（Ｙ−Ｘ）≧０のとき、Ｙ（Ｘ−Ｙ）＞０のとき、Ｘをデータセレクタ３０７および出力端子３１３を介して
出力する。ただし、最小値と最大値とは同時に出力でき
ない。（ホ）差分絶対値演算：／Ｘ−Ｙ／入力端子３１１，３１２に入力されたデータＸおよびＹ
について、正負反転器３０１、加算器３０２、減算器３
０３、および、正負判定器３０５を用いて差分絶対値演
算を行う。この場合、データセレクタ３０６は正負反転
器３０１で極性反転したデータ（−Ｘ）が加算器３０２
に入力されるように設定される。正負判定器３０５に
は、加算器３０２から（Ｙ−Ｘ）、減算器３０３から
（Ｘ−Ｙ）が入力され、正負判定器３０５は、（Ｙ−Ｘ）≧０のとき、（Ｙ−Ｘ）（Ｘ−Ｙ）＞０のとき、（Ｘ−Ｙ）をデータセレクタ３０７および出力端子３１３を介して
出力する。（ヘ）バタフライ演算入力端子３１１，３１２に入力されたデータＸおよびＹ
について、加算器３０２、および、減算器３０３を用い
てバタフライ演算を行う。この場合、データセレクタ３
０６は入力データＸを加算器３０２に入力するように設
定される。データセレクタ３０７が出力端子３１３に加
算器３０２の加算結果（Ｘ＋Ｙ）を出力し、減算器３０
３の減算結果（Ｘ−Ｙ）が出力端子３１４に出力され
る。The extended ALU 121 described above is a normal ALU.
In addition to the functions of addition, subtraction, and logical operation, which are the functions of (1) and (2), magnitude comparison operation, difference absolute value operation, and butterfly operation are provided as extended functions. These functions will be described below. (A) Addition In the adder 302, the data X and Y input to the input terminals 311 and 312 are added. In this case, the data selector 306 is selected so that the input data X is applied to the adder 320 without being inverted. The addition result (X + Y) is output from the data selector 307. The addition result A is applied to the pipeline register 141 shown in FIG. 3 via the output terminal 313. (B) Subtraction The subtractor 303 subtracts Y from the data X input to the input terminals 311 and 312. This subtraction result B is
It is applied to the pipeline register 142 shown in FIG. 3 through the output terminal 314. (C) Logical operation In the logical operation unit 304, the negation, logical sum, logical product of the data X and Y input to the input terminals 311 and 312,
A logical operation such as an exclusive OR is performed and the result is output to the pipeline register 141 via the data selector 307 and the output terminal 313. (D) Size comparison: min (X, Y), max (X, Y) data X and Y input to the input terminals 311 and 312
With respect to the positive / negative inverter 301, the adder 302, and the subtracter 3
03, the positive / negative determination unit 305 is used to compare the magnitude. In this case, the data selector 306 is set so that the data (−X) whose polarity is inverted by the positive / negative inverter 301 is input to the adder 302. The positive / negative determination unit 305 includes an adder 302.
From (Y−X) and (X−Y) from the subtractor 303, and the positive / negative determination unit 305 determines (a) as the minimum value min (X, Y), when (Y−X) ≧ 0, X When (X−Y)> 0, Y (b) As the maximum value max (X, Y), when (Y−X) ≧ 0, when Y (X−Y)> 0, X is set to the data selector 307. And output through the output terminal 313. However, the minimum value and the maximum value cannot be output at the same time. (E) Difference absolute value calculation: / XY / Data X and Y input to input terminals 311 and 312
With respect to the positive / negative inverter 301, the adder 302, and the subtracter 3
03 and the positive / negative determiner 305 are used to perform the difference absolute value calculation. In this case, in the data selector 306, the data (−X) whose polarity is inverted by the positive / negative inverter 301 is added by the adder 302.
Is set to be input to. (Y−X) is input from the adder 302 and (X−Y) is input from the subtractor 303 to the positive / negative determination unit 305, and the positive / negative determination unit 305 determines (Y−X) when (Y−X) ≧ 0. ) When (XY)> 0, (XY) is output via the data selector 307 and the output terminal 313. (F) Butterfly operation Data X and Y input to input terminals 311 and 312
With respect to, the butterfly operation is performed using the adder 302 and the subtractor 303. In this case, the data selector 3
06 is set so that the input data X is input to the adder 302. The data selector 307 outputs the addition result (X + Y) of the adder 302 to the output terminal 313, and the subtracter 30
The subtraction result (X−Y) of 3 is output to the output terminal 314.

【００３８】次いで、図３に示した演算ユニットにおけ
るパイプライン処理の概要を図５を参照して述べる。簡
単な動作例として拡張ＡＬＵ１２１が第１ステップ（ス
テージ）において加算動作を行い、乗算器１２２が第２
ステップにおいて乗算を行い、シフト機能付累算器１２
３が第３ステップにおいて累算を行うとする。そして、
これら各ステップの動作は１クロックサイクル内に行わ
れるとする。拡張ＡＬＵ１２１の後段にパイプラインレ
ジスタ１４１，１４２、乗算器１２２の後段にパイプラ
インレジスタ１４３、シフト機能付累算器１２３の後段
にパイプラインレジスタ１４４が設けられているから、
（ｋ−２）クロックサイクルにおいて、拡張ＡＬＵ１２
１において加算を行い、その加算結果をパイプラインレ
ジスタ１４１に保存し、（ｋ−１）クロックサイクルに
おいて、上記拡張ＡＬＵ１２１における加算結果を保存
しているパイプラインレジスタ１４１を用いて乗算器１
２２において乗算を行いパイプラインレジスタ１４３に
保存するとともに、拡張ＡＬＵ１２１において新たな加
算を行いパイプラインレジスタ１４１に保存し、ｋクロ
ックサイクルにおいて、上記（ｋ−１）クロックサイク
ルにおける乗算器１２２におけるパイプラインレジスタ
１４３に保存した乗算結果を用いてシフト機能付累算器
１２３において累積演算を行いパイプラインレジスタ１
４４に保存し、（ｋ−１）クロックサイクルにおける拡
張ＡＬＵ１２１において加算したパイプラインレジスタ
１４１に保存されている加算結果について乗算器１２２
において乗算を行いパイプラインレジスタ１４３に保存
し、さらに拡張ＡＬＵ１２１において新たな加算を行い
パイプラインレジスタ１４１に保存する。以下、同様に
同じクロックサイクルにおいて、加算、乗算、累積が同
時的に行われる。このように、演算ユニット内におい
て、加算、乗算、累積が順序をおって並列して行われ
る。Next, the outline of pipeline processing in the arithmetic unit shown in FIG. 3 will be described with reference to FIG. As a simple operation example, the extended ALU 121 performs the addition operation in the first step (stage), and the multiplier 122 performs the second operation.
Accumulator with shift function 12
3 performs accumulation in the third step. And
It is assumed that the operation of each of these steps is performed within one clock cycle. Since the pipeline registers 141 and 142 are provided after the expanded ALU 121, the pipeline register 143 is provided after the multiplier 122, and the pipeline register 144 is provided after the shift function accumulator 123.
In the (k-2) clock cycle, the extended ALU 12
1, the addition result is stored in the pipeline register 141, and in the (k−1) clock cycle, the multiplier 1 using the pipeline register 141 storing the addition result in the extended ALU 121 is used.
22 performs the multiplication and saves it in the pipeline register 143, and the extended ALU 121 performs new addition and saves it in the pipeline register 141. In k clock cycles, the pipeline in the multiplier 122 in the above (k−1) clock cycle. Using the multiplication result stored in the register 143, cumulative operation is performed in the accumulator with shift function 123 and the pipeline register 1
The addition result stored in the pipeline register 141 stored in the extended register ALU 121 in the (k-1) clock cycle
In the pipeline register 141, the multiplication is performed and the result is stored in the pipeline register 143. Further, a new addition is performed in the extended ALU 121 and the result is stored in the pipeline register 141. Hereinafter, similarly, in the same clock cycle, addition, multiplication and accumulation are simultaneously performed. Thus, in the arithmetic unit, addition, multiplication and accumulation are performed in parallel in order.

【００３９】この演算ユニットは、本件出願人が本件出
願と同時提出の『処理適応型演算パイプラインの構成』
に示した演算ユニットと以下の点を除いては同一構成
で、やはり画像コーデックの要素処理に適した構成とな
っている。相違点は、本発明においては、演算ユニット
内にパイプラインメモリ１２４を設け、１個の８ｘ８ブ
ロックの離散コサイン変換／離散逆コサイン変換（ＤＣ
Ｔ／ＩＤＣＴ）処理を１演算ユニットのみを用いて行う
構成としていることである。これにより、すべての画像
コーデックの要素処理を「単一命令ストリーム・多重デ
ータストリーム：ＳＩＭＤ」制御で実現できる。This arithmetic unit is constructed by the applicant of the present invention at the same time as the "processing adaptive arithmetic pipeline structure".
It has the same configuration as the arithmetic unit shown in (1) except for the following points, and is also suitable for the element processing of the image codec. The difference is that in the present invention, a pipeline memory 124 is provided in the arithmetic unit, and a discrete cosine transform / discrete inverse cosine transform (DC) of one 8 × 8 block is provided.
That is, the T / IDCT) processing is performed using only one arithmetic unit. As a result, element processing of all image codecs can be realized by "single instruction stream / multiple data stream: SIMD" control.

【００４０】本件出願の発明者の分析によれば、画像コ
ーデックにおいて、乗算を連続して行う要素処理あるい
は乗算結果の和を求める要素処理の頻度は小さいことが
判ったので、演算ユニット間の結合をなくし、演算ユニ
ット間の配線の減少を図った。本発明においては、全て
の画像コーデックの要素処理を「単一命令ストリーム・
多重データストリーム：ＳＩＭＤ」制御で実現できるた
め、図１に示すように、係数メモリ５１は全ての要素プ
ロセッサで共有する構成とする。According to the analysis by the inventor of the present application, it is found that the frequency of the element processing for continuously performing multiplication or the element processing for obtaining the sum of multiplication results is small in the image codec. To eliminate wiring and reduce the wiring between arithmetic units. In the present invention, element processing of all image codecs is performed by "single instruction stream ...
Since it can be realized by "multiple data stream: SIMD" control, the coefficient memory 51 is shared by all the element processors as shown in FIG.

【００４１】データメモリの構成図２に本発明の１実施例によるデータメモリの構成を示
す。データメモリは、入出力（Ｉ／Ｏ）バッファ９１、
第１のフレームバッファ０（９２）、第２のフレームバ
ッファ１（９３）、ワーキングバッファ９４、および、
相互結合網９５からなる。以下、それぞれについて説明
する。（ａ）Ｉ／Ｏバッファ９１Ｉ／Ｏバッファ９１は、バンク０，１，２（１０１，１
０２，１０３）と呼ぶ３個のメモリバンクに分割されて
おり、各々が最低限１個の８ｘ８ブロックを格納できる
メモリ容量を有する。画像コーデック処理によってはさ
らに大きなメモリ容量を要する場合がある。このＩ／Ｏ
バッファ９１は、「従来の技術」の項で述べた入力用デ
ータメモリおよぶ出力用データメモリを兼用することに
より、メモリバンクの数を１個減らしたものである。す
なわち、計算ステージにおけるデータ入力用および出力
用のバッファを１個にまとめている。 Structure of Data Memory FIG. 2 shows the structure of the data memory according to one embodiment of the present invention. The data memory includes an input / output (I / O) buffer 91,
The first frame buffer 0 (92), the second frame buffer 1 (93), the working buffer 94, and
It comprises an interconnection network 95. Each will be described below. (A) I / O buffer 91 The I / O buffer 91 includes banks 0, 1, 2 (101, 1).
No. 02, 103), and each has a memory capacity capable of storing at least one 8 × 8 block. A larger memory capacity may be required depending on the image codec processing. This I / O
The buffer 91 is one in which the number of memory banks is reduced by using the input data memory and the output data memory described in the section "Prior Art" in common. That is, the data input and output buffers in the calculation stage are combined into one.

【００４２】以下、Ｉ／Ｏバッファ９１の動作を説明す
る。Ｉ／Ｏバッファ９１においては、演算の処理単位で
あるマクロブロック毎に演算処理用のメモリバンク、入
力用のメモリバンクおよび出力用のメモリバンクを以下
のように切り替える。ここで、演算処理用のメモリバン
クは相互結合網９５に接続され、入力用のメモリバンク
は入力用データバス３１に接続され、出力用のメモリバ
ンクは出力用データバス３２に接続される。The operation of the I / O buffer 91 will be described below. In the I / O buffer 91, a memory bank for arithmetic processing, an input memory bank, and an output memory bank are switched as follows for each macroblock which is a processing unit of arithmetic. Here, the memory bank for arithmetic processing is connected to the interconnection network 95, the memory bank for input is connected to the input data bus 31, and the memory bank for output is connected to the output data bus 32.

【００４３】[0043]

【表１】表１１．３ｋ番目のマクロブロックの処理時演算処理用のメモリバンク：バンク０（１０１）入力用のメモリバンク：バンク１（１０２）出力用のメモリバンク：バンク２（１０３）２．（３ｋ＋１）番目のマクロブロックの処理時演算処理用のメモリバンク：バンク１（１０２）入力用のメモリバンク：バンク２（１０３）出力用のメモリバンク：バンク０（１０１）３．（３ｋ＋１）番目のマクロブロックの処理時演算処理用のメモリバンク：バンク２（１０３）入力用のメモリバンク：バンク０（１０１）出力用のメモリバンク：バンク１（１０２）[Table 1] Table 1 1.3kth Macro Block Processing Memory bank for arithmetic processing: Bank 0 (101) Memory bank for input: Bank 1 (102) Memory bank for output: Bank 2 (103) 2. When processing the (3k + 1) th macroblock Memory bank for arithmetic processing: Bank 1 (102) Memory bank for input: Bank 2 (103) Memory bank for output: Bank 0 (101) 3. When processing the (3k + 1) th macroblock Memory bank for arithmetic processing: Bank 2 (103) Memory bank for input: Bank 0 (101) Memory bank for output: Bank 1 (102)

【００４４】Ｉ／Ｏバッファ９１が上記のように動作す
ることにより、入力ステージ、計算ステージ、出力ステ
ージの動作周波数が異なっていても問題ない。また、ス
テージ間におけるＩ／Ｏバッファ９１へのアクセスの競
合が発生しないので、各ステージは完全に並列に動作可
能である。Since the I / O buffer 91 operates as described above, there is no problem even if the operating frequencies of the input stage, calculation stage and output stage are different. Further, since there is no competition for access to the I / O buffer 91 between the stages, the stages can operate completely in parallel.

【００４５】（ｂ）第１のフレームバッファ０（９２）
および第２のフレームバッファ１（９３）第１のフレームバッファ０（９２）および第２のフレー
ムバッファ１（９３）は、各々がバンク０，１（１０
４：１０５、１０６：１０７）と呼ぶ２個のメモリバン
クに分割されており、各々が最低限１個の８ｘ８ブロッ
クを格納できるメモリ容量を有する。画像コーデック処
理によっては、さらに大きなメモリ容量を要する場合が
ある。第１のフレームバッファ０（９２）および第２の
フレームバッファ１（９３）は、画像データの動き補償
を行う際に必要となる過去のフレームや将来のフレーム
のマクロブロックを格納する。これらのマクロブロック
は、データバス３３、３４を介してフレームメモリのマ
クロブロック入出力端子２３あるいはマクロブロック入
力端子２４から入力される。この際、フレームバッファ
０（９２）、フレームバッファ１（９３）はダブルバッ
ファ構成となっているので、入力ステージと計算ステー
ジの動作周波数が異なっていても問題ない。(B) First frame buffer 0 (92)
And the second frame buffer 1 (93) The first frame buffer 0 (92) and the second frame buffer 1 (93) are respectively banks 0, 1 (10).
4: 105, 106: 107), each of which has a memory capacity capable of storing at least one 8 × 8 block. A larger memory capacity may be required depending on the image codec processing. The first frame buffer 0 (92) and the second frame buffer 1 (93) store macroblocks of past frames and future frames that are necessary when performing motion compensation of image data. These macroblocks are input from the macroblock input / output terminal 23 or the macroblock input terminal 24 of the frame memory via the data buses 33 and 34. At this time, since the frame buffer 0 (92) and the frame buffer 1 (93) have a double buffer structure, there is no problem even if the operating frequencies of the input stage and the calculation stage are different.

【００４６】また、ステージ間におけるフレームバッフ
ァへのアクセスの競合が発生しないので、入力ステージ
と計算ステージは完全に並列に動作可能である。さら
に、第１のフレームバッファ０（９２）は、画像コーデ
ック処理後のマクロブロックが他のマクロブロックの画
像データ動き補償を行う際に必要となる場合に、そのマ
クロブロックを格納する。さらに、データバス３３を介
してフレームメモリのマクロブロック入出力端子２３に
出力する。この際、第１のフレームバッファ０（９２）
はダブルバッファ構成となっているので、計算ステージ
と出力ステージの動作周波数が異なっていても問題な
い。また、ステージ間におけるフレームバッファへのア
クセスの競合が発生しないので、計算ステージと出力ス
テージは完全に並列に動作可能である。Further, since there is no competition for access to the frame buffer between the stages, the input stage and the calculation stage can operate completely in parallel. Furthermore, the first frame buffer 0 (92) stores a macroblock that has undergone the image codec processing when the macroblock is required for performing image data motion compensation of another macroblock. Further, the data is output to the macro block input / output terminal 23 of the frame memory via the data bus 33. At this time, the first frame buffer 0 (92)
Has a double-buffer structure, there is no problem even if the operating frequencies of the calculation stage and the output stage are different. Further, since there is no competition for accessing the frame buffer between the stages, the calculation stage and the output stage can operate in completely parallel.

【００４７】（ｃ）ワーキングバッファ９４ワーキングバッファ９４は、演算中間結果格納用バッフ
ァであり、最低限１個の８ｘ８ブロックを格納できるメ
モリ容量を有する。画像コーデック処理によっては、ワ
ーキングバッファ９４はさらに大きなメモリ容量を要す
る場合がある。(C) Working Buffer 94 The working buffer 94 is a buffer for storing intermediate calculation results, and has a memory capacity capable of storing at least one 8 × 8 block. The working buffer 94 may require a larger memory capacity depending on the image codec processing.

【００４８】（ｄ）相互結合網９５相互結合網９５は、上記４種のバッファ、つまり、Ｉ／
Ｏバッファ９１、第１のフレームバッファ０（９２）、
第２のフレームバッファ１（９３）およびワーキングバ
ッファ９４と演算ユニット６ｋとを接続するネットワー
クである。相互結合網９５としては、どのような構成で
もよいが、最低限、１クロックサイクル毎に２個のデー
タを任意のバッファ９１〜９４から演算ユニット６ｋに
供給可能とし、同時に１個のデータを演算ユニット６ｋ
から任意のバッファ９１〜９４に格納可能とする。ま
た、最低限、１クロックサイクル毎に１個のデータを任
意のバッファ９１〜９４から演算ユニット６ｋに供給
し、同時に２個のデータを演算ユニット６ｋから任意の
バッファ９１〜９４に格納可能とする。なお、画像コー
デックの要素処理において、１クロックサイクル毎に、
第１のフレームバッファ０（９２）および第２のフレー
ムバッファ１（９３）の同一メモリバンク（バンク０，
１のいずれか）から２個のデータが供給されたり、第１
のフレームバッファ０（９２）および第２のフレームバ
ッファ１（９３）の同一メモリバンクへ２個のデータが
格納されることはない。したがって、第１のフレームバ
ッファ０（９２）および第２のフレームバッファ１（９
３）のメモリバンクは、すべてシングルポート構成でよ
い。これに対し、Ｉ／Ｏバッファ９１の各メモリバンク
およびワーキングバッファ９４は、最低限２ポートのマ
ルチポート構成となる。(D) Mutual Connection Network 95 The mutual connection network 95 is the above-mentioned four types of buffers, that is, I / I.
O buffer 91, first frame buffer 0 (92),
This is a network that connects the second frame buffer 1 (93) and the working buffer 94 to the arithmetic unit 6k. The mutual connection network 95 may have any configuration, but at a minimum, two pieces of data can be supplied from the arbitrary buffers 91 to 94 to the arithmetic unit 6k every clock cycle, and one piece of data can be calculated at the same time. Unit 6k
Can be stored in any of the buffers 91 to 94. Further, at a minimum, one data is supplied from the arbitrary buffers 91 to 94 to the arithmetic unit 6k every one clock cycle, and at the same time, two data can be stored from the arithmetic unit 6k to the arbitrary buffers 91 to 94. . In the element processing of the image codec, every clock cycle,
The same memory bank of the first frame buffer 0 (92) and the second frame buffer 1 (93) (bank 0,
2) is supplied from the
No two data are stored in the same memory bank of the frame buffer 0 (92) and the second frame buffer 1 (93). Therefore, the first frame buffer 0 (92) and the second frame buffer 1 (9
All the memory banks of 3) may have a single port configuration. On the other hand, each memory bank of the I / O buffer 91 and the working buffer 94 have at least a 2-port multiport configuration.

【００４９】以上、本発明の１実施例としての画像コー
デック用プロセッサについて述べたが、本発明の実施に
際しては、上述した実施例に限定されず、上記同様の構
成および処理を行う他の構成にすることができる。Although the image codec processor as one embodiment of the present invention has been described above, the embodiment of the present invention is not limited to the above-described embodiment, and other configurations for performing the same processing and processing as above may be used. can do.

【００５０】[0050]

【発明の効果】本発明によれば、隣接する要素プロセッ
サの演算結果を加算する回路を有し、隣接する要素プロ
セッサＰＥのデータメモリ間でデータ転送を可能とする
ことにより、画像コーデックのブロック間にデータ依存
関係が存在する処理が実現できる。これにより、ブロッ
ク対応にダブルバッファ方式のデータメモリを有する要
素プロセッサを設けることが可能となる。この結果、従
来よりもデータメモリのポート数、相互結合網を複雑化
することなく並列度をあげることででき、処理性能が向
上する。According to the present invention, a circuit for adding operation results of adjacent element processors is provided, and data transfer is possible between the data memories of adjacent element processors PE, so that blocks between image codecs can be transferred. It is possible to realize a process in which there is a data dependency relationship. As a result, it becomes possible to provide an element processor having a double buffer type data memory for each block. As a result, the number of ports of the data memory and the parallelism can be increased without complicating the mutual connection network, and the processing performance can be improved.

【００５１】また、本発明によれば、データメモリ内の
Ｉ／Ｏバッファにおいて、従来の構成では入力データメ
モリと出力用データメモリで別々に存在していた演算処
理用のメモリバンクが兼用できるので、３メモリバンク
分のデータメモリがあればよい。したがって、１メモリ
バンクのデータメモリ容量をｍとすると、トータルで３
ｍのデータメモリ容量となり、従来の構成と比べてデー
タメモリ容量を３／４に削減できる。Further, according to the present invention, in the I / O buffer in the data memory, the memory banks for arithmetic processing, which are separately provided in the input data memory and the output data memory in the conventional configuration, can be used in common. It is sufficient to have data memory for three memory banks. Therefore, if the data memory capacity of one memory bank is m, the total is 3
Since the data memory capacity is m, the data memory capacity can be reduced to 3/4 compared to the conventional configuration.

[Brief description of drawings]

【図１】本発明の実施例としての画像コーデック用プロ
セッサの構成図である。FIG. 1 is a configuration diagram of an image codec processor as an embodiment of the present invention.

【図２】図１に示した要素プロセッサの構成図である。FIG. 2 is a configuration diagram of an element processor shown in FIG.

【図３】図１に示した演算ユニットの構成図である。FIG. 3 is a configuration diagram of an arithmetic unit shown in FIG.

【図４】図３に示した拡張ＡＬＵの構成図である。FIG. 4 is a configuration diagram of an extended ALU shown in FIG.

【図５】図３に示した演算ユニットにおけるパイプライ
ン処理を示すグラフである。5 is a graph showing pipeline processing in the arithmetic unit shown in FIG.

【図６】ＣＣＩＲ６０１フォーマットに基づく（４：
２：２）信号におけるマクロブロックおよびブロックの
概念を示す図である。FIG. 6 is based on the CCIR601 format (4:
FIG. 2 is a diagram showing the concept of macroblocks and blocks in a 2: 2) signal.

【図７】従来の画像コーデック用プロセッサの構成を示
す図である。FIG. 7 is a diagram showing a configuration of a conventional image codec processor.

【図８】フレームＤＣＴ処理時のマクロブロック（輝度
成分のみ）を示す図である。FIG. 8 is a diagram illustrating macroblocks (only luminance components) during frame DCT processing.

【図９】フィールドＤＣＴ処理時のマクロブロック（輝
度成分のみ）を示す図である。FIG. 9 is a diagram showing macroblocks (only luminance components) during field DCT processing.

[Explanation of symbols]

１ｋ（１１〜１８）・・本発明の実施例の要素プロセッ
サ（ＰＥ）２１・・マクロブロック入力端子２２・・マクロブロック出力端子２３・・フレームメモリのマクロブロック入出力端子２４・・フレームメモリのマクロブロック入力端子３１・・入力用データバス３２・・出力用データバス３３，３４・・データバス４１・・加算回路４２〜４５・・隣接要素プロセッサ間データ転送路５１・・係数メモリ６ｋ（６１〜６８）・・演算ユニット７ｋ（７１〜７８）・・データメモリ８１・・ブロック入力端子８２・・ブロック出力端子８３・・フレームメモリのブロック入出力端子８４・・フレームメモリのブロック入力端子９１・・Ｉ／Ｏバッファ９２・・第１のフレームバッファ０９３・・第２のフレームバッファ１９４・・ワーキングバッファ９５・・相互結合網１０１・・Ｉ／Ｏバッファのバンク０１０２・・Ｉ／Ｏバッファのバンク１１０３・・Ｉ／Ｏバッファのバンク２１０４・・フレームバッファ０のバンク０１０５・・フレームバッファ０のバンク１１０６・・フレームバッファ１のバンク０１０７・・フレームバッファ１のバンク１１１１・・データセレクタ１２１・・拡張論理演算ユニット（ＥＡＬＵ）１２２・・乗算器１２３・・シフト機能付き累算器１２４・・パイプラインメモリ１３１〜１３６・・データセレクタ１４１〜１４４・・パイプラインレジスタ1k (11-18) ··· Element processor (PE) of the embodiment of the present invention 21 ·· Macroblock input terminal 22 · · Macroblock output terminal 23 · · Macroblock input / output terminal 24 of frame memory · · Macro block input terminal 31 ··· Input data bus 32 · · Output data bus 33, 34 · · Data bus 41 · · Adder circuit 42 to 45 · · Data transfer path between adjacent element processors 51 · · Coefficient memory 6k (61) -68) -Operation unit 7k (71-78) -Data memory 81-Block input terminal 82-Block output terminal 83-Frame memory block input / output terminal 84-Frame memory block input terminal 91- I / O buffer 92 first frame buffer 0 93 second frame buffer 19 Working buffer 95 Mutual interconnection network 101 I / O buffer bank 0 102 I / O buffer bank 1 103 I / O buffer bank 2 104 Frame buffer 0 bank 0 105 -Bank 1 of frame buffer 0 106-Bank 0 of frame buffer 1 107-Bank 1 of frame buffer 1 111-Data selector 121-Extended logical operation unit (EALU) 122-Multiplier 123-Shift Accumulator with function 124 ··· Pipeline memory 131 to 136 ·· Data selector 141 to 144 · · Pipeline register

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁵ 識別記号庁内整理番号ＦＩ技術表示箇所Ｈ０４Ｎ 7/133 Ｚ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁵ Identification code Office reference number FI technical display location H04N 7/133 Z

Claims

[Claims]

1. A signal processing for processing image data of a plurality of blocks and a signal processing for image data in one block, with a macro block composed of a plurality of blocks each of which is composed of mxn image data as one processing unit. In a processor for a "single instruction stream / multiple data stream: SIMD" control type image codec that adaptively controls multiple data streams with a single instruction stream, each of them is connected by a data transfer path between adjacent element processors. A plurality of pairs of element processors adjacent to each other are provided corresponding to the macroblocks, and a common coefficient memory that provides a coefficient commonly to the plurality of element processors is provided, and signal processing across image data of the plurality of blocks is performed. The results of these operations on the outputs of multiple pairs of element processors that A common addition circuit for adding provided, the image codec processor, wherein the processed image data of the macro block is input to the element processor.

2. The processor for an image codec according to claim 1, wherein each of the element processors is configured to pipeline three stages including a data input stage, a calculation stage, and a data output stage.

3. Each of the element processors can operate at least alternately with an I / O buffer having three banks corresponding to the three stages including the data input stage, the calculation stage, and the data output stage. A first data memory having two banks arranged in parallel, a second data memory having at least two banks arranged in parallel that can operate alternately, and I / O buffers thereof 4. The image codec processor according to claim 3, further comprising: an interconnection network interconnecting the first and second data memories, and an arithmetic unit connected to the interconnection network.

4. A two-input / two-output extended arithmetic logic operation processing unit, wherein the operation unit performs addition, subtraction, various logic operations, magnitude comparison, difference absolute value operation and butterfly operation, and a coefficient from the coefficient memory. 4. The image codec processor according to claim 3, further comprising: a multiplication unit that multiplies the output of the extended arithmetic logic operation processing unit, and an accumulation unit that accumulates the multiplication result.

5. A pipeline register is provided after the extended arithmetic logic operation processing unit, a pipeline register is provided after the multiplication unit, and a pipeline register is provided after the accumulation unit, and the data input stage and the calculation are provided. The image codec processor according to claim 4, which performs pipeline processing corresponding to three stages including a stage and a data output stage.

6. The image according to claim 5, wherein the signal processing over the image data of the plurality of blocks is field image signal processing, and the signal processing for the image data in the one block is frame image processing. Codec processor.

7. A macroblock is set to 1 by image codec processing.
In the image codec processor as a processing unit, a plurality of element processors are provided for each block forming the macroblock, and the image codec processing in each macroblock is performed by using the plurality of element processors as a “single instruction stream. Multiple data streams: SIM
An image codec processor characterized in that it is performed in parallel by "D" control.

8. A macro block is set to 1 by image codec processing.
In an image codec processor that performs pipeline processing with a three-stage configuration including an input stage, a calculation stage, and an output stage as a processing unit, an arithmetic unit and a double buffer are provided in an element processor provided corresponding to each block forming the macroblock. The image codec processing in each block is performed by using a plurality of element processors having a data memory having a configuration of “single instruction stream / multiple data stream: SIM
An image codec processor characterized in that it is performed in parallel by "D" control.

9. An image codec processor for performing pipeline processing in a three-stage configuration of an input stage, a calculation stage, and an output stage with a macroblock as one processing unit for image codec processing, and corresponding to each block forming the macroblock. The element processor provided has an arithmetic unit and a data memory of a double buffer structure, and further has a circuit for adding the arithmetic results of the adjacent element processors, and the image codec processing in each block is performed by a plurality of element processors. An image codec processor characterized by performing "single instruction stream / multiple data stream: SIMD" control in parallel using.

10. An image codec processor for performing pipeline processing in a three-stage configuration of an input stage, a calculation stage, and an output stage with a macroblock as one processing unit for image codec processing, and corresponding to each block constituting the macroblock. The element processor provided has an arithmetic unit and a data memory having a double buffer structure, and further has a circuit for adding the arithmetic results of the adjacent element processors, and between the data memories of the adjacent element processors. An image codec processor characterized by enabling data transfer and performing image codec processing in each block in parallel by "single instruction stream / multiple data stream: SIMD" control using a plurality of element processors.

11. The image codec processor according to claim 7, wherein each of said element processors has three stages corresponding to said data input stage, calculation stage, and data output stage. An I / O buffer having a bank, a first data memory having at least two banks arranged in parallel that can operate alternately, and at least two banks arranged in parallel that can operate alternately Processor for image codec having a second data memory having an I / O buffer, an interconnection network interconnecting the first and second data memories, and an arithmetic unit connected to the interconnection network .

12. An arithmetic arithmetic operation processing unit of 2 inputs and 2 outputs, wherein the arithmetic unit performs addition, subtraction, various logical operations, magnitude comparison, difference absolute value arithmetic and butterfly arithmetic, and a coefficient from the coefficient memory. The image codec processor according to claim 11, further comprising: a multiplication unit that multiplies the output of the extended arithmetic logic operation processing unit, and an accumulation unit that accumulates the multiplication result.

13. A pipeline register is provided after the extended arithmetic logic operation processing unit, a pipeline register is provided after the multiplication unit, and a pipeline register is provided after the accumulation unit. The data input stage and the calculation are provided. 13. The image codec processor according to claim 12, which performs pipeline processing corresponding to three stages including a stage and a data output stage.

14. The image codec processor uses a macroblock consisting of a plurality of blocks each of which is composed of mxn image data as one processing unit, and performs signal processing over a plurality of blocks of image data and one block. Adaptively with the signal processing of the image data of
Controlling multiple data streams with a single instruction stream "Single instruction stream / multiple data stream: S
14. The image codec processor according to claim 13, which performs "IMD" control.

15. The image according to claim 14, wherein the signal processing over the image data of the plurality of blocks is field image signal processing, and the signal processing for the image data in the one block is frame image processing. Codec processor.

16. A pair of adjacent element processors are connected by a data transfer path between adjacent element processors,
A common coefficient memory provided corresponding to the macroblocks and commonly providing coefficients to the plurality of element processors is provided, and is output to a plurality of pairs of element processors that perform signal processing across the image data of the plurality of blocks. 16. The image codec processor according to claim 15, further comprising a common adder circuit for adding these calculation results, wherein the processed image data in macro block units is input to the element processor.