JP4516020B2

JP4516020B2 - Image processing device

Info

Publication number: JP4516020B2
Application number: JP2005508742A
Authority: JP
Inventors: 博 ▲高▼柳; 伸尋関; 修毛利; 昭寛牧野; 正裕三浦
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2003-08-28
Filing date: 2003-08-28
Publication date: 2010-08-04
Anticipated expiration: 2023-08-28
Also published as: WO2005025230A1; JPWO2005025230A1

Description

本発明は、画像処理装置に関し、特に、ＭＰＥＧ画像圧縮伸張に適用して有効な技術に関するものである。 The present invention relates to an image processing apparatus, and more particularly to a technique effective when applied to MPEG image compression / decompression.

本発明者が検討した技術として、例えば、ＭＰＥＧ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ）画像圧縮伸張においては、次の技術が考えられる。
ＩＳＯ／ＩＥＣ１４４９６−２（ＭＰＥＧ４）、ＩＳＯ／ＩＥＣ１３８１８−２（ＭＰＥＧ２）、ＩＳＯ／ＩＥＣ１１１７２−２（ＭＰＥＧ１）などの動画圧縮規格では、デジタル化された画像をブロック分割してブロックごとに動きベクトル検出、離散コサイン変換（ＤＣＴ；ＤｉｓｃｒｅｔｅＣｏｓｉｎｅＴｒａｎｓｆｏｒｍ）、量子化、ＡＣ／ＤＣ予測を実施し、ハフマン符号化して画像データを圧縮する。
このうち、動きベクトル検出は、画像フレーム中の１６×１６画素のマクロブロック（シフトブロック）に対して、時間的に前あるいは後の画像フレームのマクロブロック範囲から、現在のフレームと差分がもっとも小さい１６×１６画素の位置を検出する。そして、その位置ベクトルと差分（フレーム差）を用いて、以後のＤＣＴ、量子化、ＡＣ／ＤＣ予測、ハフマン符号化を実施することにより、高動画圧縮が可能となる。
また、圧縮データの伸張処理は、前記の圧縮と逆の手順、すなわち、ハフマン復号化、ＡＣ／ＤＣ予測、逆量子化、逆ＤＣＴおよび動きベクトル情報から補償画を生成することにより実現する。
また、動画圧縮を含めた画像処理は、比較的単純な計算アルゴリズムの反復が多く、同一命令に対するデータ並列性が大きい。そのため、画像処理の高速化にはＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａｓｔｒｅａｍ；単一命令複数データ流）型並列計算手法が適することが多い。
ＳＩＭＤ型並列計算アーキテクチャの実現例としては、例えば、汎用型ニューロコンピュータがある。このコンピュータのアーキテクチャは、複数のプロセッサと１つの制御系からなり、制御系が全プロセッサへ共通の命令とデータをブロードキャストして動作する。各プロセッサは、ローカルメモリと演算器（乗算器、ＡＬＵ、シフタなど）を備えている。制御系は、書き換え可能なプログラムメモリとグローバルメモリを備えている。そして、制御系から全プロセッサにデータを伝達するブロードキャストデータバスと、アドレス信号で指定する任意１つのプロセッサがトライステートバッファを介して制御系にデータを伝達する共通バスとによって、制御系・プロセッサ間のデータの授受が行われる。
制御系がブロードキャストデータバスを介して全プロセッサに伝達するデータとしては、制御系内のメモリ上のデータ、または制御系が共通バスを介して任意の１つのプロセッサから受け取ったデータのうちどちらかを選択することができる。
また、制御系からの命令は、アドレス信号で指定する１つのプロセッサだけを指定して実行させることも可能である。これにより、制御系は各プロセッサのローカルメモリを初期設定することが可能となり、制御系・プロセッサ間の１対Ｎ（Ｎは任意の自然数）のデータ伝送が可能となる。
汎用型ニューロコンピュータにおいて、制御系からブロードキャストされる単一の命令で制御される複数のプロセッサの個々がニューロン（神経細胞）の模倣単位となる。そして、制御系を介して全プロセッサへブロードキャストされる入力データに対して、各プロセッサが個々に持つローカルメモリ内の重み値データを演算することで、ニューラルネットワークの動作を模倣する。すなわち、制御系のプロブラムを書き換えることによって、バックプロパゲーション（誤差逆伝搬）に代表されるニューロアルゴリズムやその他のニューロアルゴリズムなど、あるいはニューロ以外のアルゴリズムを汎用的に並列計算することが可能となる。
前記汎用型ニューロコンピュータを画像処理における動きベクトル検出に適用すると、複数のプロセッサをシフトブロックの計算単位とすることができる。すなわち、１フレーム前の画像上にある動きベクトル検出範囲のシフトブロック（１６×１６画素の複数）をプロセッサ（複数）のローカルメモリに初期設定する。現フレームの動きベクトル検出画像（１６×１６画素のマクロブロック）を制御系から全プロセッサへブロードキャストする。各プロセッサは、ブロードキャストされたデータとローカルメモリ内のデータとのフレーム差を演算する。制御系は、全プロセッサのフレーム差を比較することで、最小となるプロセッサに設定したシフトブロックの画像位置を動きベクトル位置として検出することができる。As a technique studied by the present inventor, for example, the following technique can be considered in MPEG (Moving Picture Experts Group) image compression / decompression.
In moving image compression standards such as ISO / IEC14496-2 (MPEG4), ISO / IEC13818-2 (MPEG2), and ISO / IEC11172-2 (MPEG1), a digitized image is divided into blocks, and a motion vector is detected for each block. Discrete cosine transform (DCT), quantization, and AC / DC prediction are performed, and Huffman coding is performed to compress the image data.
Among them, the motion vector detection has the smallest difference from the current frame from the macroblock range of the previous or subsequent image frame with respect to the 16 × 16 pixel macroblock (shift block) in the image frame. A position of 16 × 16 pixels is detected. Then, using the position vector and the difference (frame difference), the subsequent DCT, quantization, AC / DC prediction, and Huffman coding are performed, thereby enabling high video compression.
Further, the decompression process of the compressed data is realized by generating a compensation image from a procedure reverse to the above-described compression, that is, Huffman decoding, AC / DC prediction, inverse quantization, inverse DCT, and motion vector information.
In addition, image processing including moving image compression often involves a relatively simple calculation algorithm, and data parallelism for the same instruction is large. Therefore, a SIMD (Single Instruction Multiple Data stream) type parallel calculation method is often suitable for speeding up image processing.
As an implementation example of the SIMD type parallel computing architecture, for example, there is a general-purpose neuro computer. This computer architecture comprises a plurality of processors and one control system, and the control system operates by broadcasting common instructions and data to all processors. Each processor includes a local memory and an arithmetic unit (multiplier, ALU, shifter, etc.). The control system includes a rewritable program memory and a global memory. A broadcast data bus that transmits data from the control system to all the processors, and a common bus in which any one processor specified by the address signal transmits data to the control system via the tristate buffer, between the control system and the processor. Data exchange is performed.
The data transmitted from the control system to all the processors via the broadcast data bus is either the data on the memory in the control system or the data received by the control system from any one processor via the common bus. You can choose.
In addition, it is possible to execute a command from the control system by designating only one processor designated by an address signal. As a result, the control system can initialize the local memory of each processor, and 1 to N (N is an arbitrary natural number) data transmission between the control system and the processor.
In a general-purpose neurocomputer, each of a plurality of processors controlled by a single command broadcast from a control system is a mimic unit of a neuron (neural cell). The operation of the neural network is imitated by calculating weight value data in the local memory of each processor for input data broadcast to all the processors via the control system. In other words, by rewriting the control system program, it becomes possible to perform general-purpose parallel computation of a neuro-algorithm represented by backpropagation (error back-propagation), other neuro-algorithms, or other algorithms.
When the general-purpose neurocomputer is applied to motion vector detection in image processing, a plurality of processors can be used as calculation units for shift blocks. That is, the motion vector detection range shift block (16 × 16 pixels plural) on the image one frame before is initialized in the local memory of the processor (plurality). The motion vector detection image (16 × 16 pixel macroblock) of the current frame is broadcast from the control system to all processors. Each processor calculates a frame difference between the broadcast data and the data in the local memory. The control system can detect the image position of the shift block set in the smallest processor as a motion vector position by comparing the frame differences of all the processors.

ところで、前記のようなＭＰＥＧ画像圧縮伸張の技術について、本発明者が検討した結果、以下のようなことが明らかとなった。
ＭＰＥＧ画像圧縮伸張における動きベクトル検出およびＤＣＴ・逆ＤＣＴのための演算処理量は非常に膨大である。動きベクトル検出において、もっとも単純な「動き補償フレーム間符号化」では、１フレーム前のマクロブロックの周り±１５画素が検出範囲とされ、現フレームのマクロブロックとの比較対照となるシフトブロックは９６１個存在する。したがって、ＶＧＡ（画素サイズ；６４０×４８０画素）についての演算量は２７８メガ回以上、１秒間に３０フレームの処理で換算すると８ギガ回／秒必要となる。また、ＤＣＴおよび逆ＤＣＴにおいても、ＶＧＡサイズ毎秒３０フレーム処理換算で０．８ギガ回／秒が必要となる。
以上の処理を半導体集積回路装置の単一の演算器で行う場合、１０ギガオーダの動作周波数が必要となり、そのままデジタルビデオカメラなどの低消費電力で駆動する携帯型家電に実装することは非常に困難である。
また、前記汎用型ニューロコンピュータのアーキテクチャを用いて動き検出を行う場合、各プロセッサのローカルメモリにマクロブロックに対する比較画情報を記憶しておき、マクロブロック情報をブロードキャストし、ローカルメモリ内の情報との差分演算を行えば、各プロセッサで並列動作が実現できる。
また、ＤＣＴ・逆ＤＣＴ処理においては、各プロセッサのローカルメモリにブロック単位の差分情報を記憶しておき、ＤＣＴ・逆ＤＣＴ係数をブロードキャストすることにより、演算の並列化を実現できる。
しかし、上記の動き検出とＤＣＴ処理を１つのＳＩＭＤ型並列計算機のみで実現する場合、動きを検出したプロセッサの差分ブロック情報結果をＤＣＴ処理のために他のプロセッサへ移動する必要が生じ、全体の処理性能が低下する。また、動き検出ではプロセッサ内に差分絶対値演算器が必要であり、ＤＣＴでは乗算器が必要となる。したがって、動き検出とＤＣＴを１つのＳＩＭＤ型並列計算機で処理した場合、差分絶対値演算器と乗算器の両方をプロセッサに内蔵して構成する必要があり、全体のゲート規模が増加する。
そこで、本発明の目的は、ＭＰＥＧ画像圧縮伸張などの画像処理において、回路構成が小規模で、かつ高速に動作させることができる画像処理装置を提供するものである。
本発明の前記ならびにその他の目的と新規な特徴は、本明細書の記述および添付図面から明らかになるであろう。
本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、次のとおりである。
すなわち、本発明による画像処理装置は、少なくとも差分演算器とローカルメモリとを含む複数のプロセッサが第１の制御ユニットからの単一命令で動作し、第１の制御ユニットから全プロセッサへデータを伝送するブロードキャストデータバスを備えた第１のＳＩＭＤ型計算機と、少なくとも乗算器を含む複数のプロセッサが第２の制御ユニットからの単一命令で動作する１つ以上の第２のＳＩＭＤ型計算機とを接続し、第１のＳＩＭＤ型計算機は画像処理における動き検出の処理を行い、１つまたは複数の第２のＳＩＭＤ型計算機は画像処理におけるＤＣＴ、逆ＤＣＴ、量子化または逆量子化の処理を行うものである。
また、上記構成において、第１のＳＩＭＤ型計算機の演算結果は、第２のＳＩＭＤ型計算機内のプロセッサへ、バッファを介して並列に伝送され、プロセッサは、それぞれ画像のブロック単位で並列に処理を行う。
また、ブロック単位のデータ伝送ごとに、ブロックの属性を示すヘッダ情報（ブロックのずれのベクトル情報、Ｉ・Ｐ・Ｂフレーム情報など）を付加することにより、第２のＳＩＭＤ型計算機の各プロセッサは各ヘッダ情報を判別してブロックに適した処理を効率的に行える構成とする。
本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下のとおりである。
（１）ＭＰＥＧ動画圧縮において、動き検出処理およびＤＣＴ・量子化などの処理を各々別のＳＩＭＤ型計算機で処理するため、各処理に最小限必要な演算器およびローカルメモリで構成することができ、それぞれの処理に必要な性能に見合ったプロセッサ数で画像圧縮処理を実現できる。
（２）動き検出やＤＣＴ・量子化などを処理ごとにパイプライン動作で行えるため、性能の向上が図れる。
（３）プロセッサ数が比較的小規模のＳＩＭＤ型計算機の構成で、ＭＰＥＧに代表される動画圧縮・伸張などの画像処理が実時間で処理可能となるため、例えばデジタルビデオカメラなどの低消費電力駆動の携帯型家電に実装可能な半導体集積回路（電子部品）で、従来に比べて画素密度の大きい画像処理機能を実現することが可能となる。By the way, as a result of the study of the MPEG image compression / decompression technique as described above, the following has been clarified.
The amount of calculation processing for motion vector detection and DCT / inverse DCT in MPEG image compression / decompression is very large. In the simplest “motion-compensated interframe coding” in motion vector detection, ± 15 pixels around a macroblock one frame before is set as a detection range, and a shift block as a comparison with the macroblock of the current frame is 961. There are. Therefore, the amount of calculation for VGA (pixel size: 640 × 480 pixels) is 278 megatimes or more, and 8 gigaseconds / second is required when converted with processing of 30 frames per second. Also in DCT and inverse DCT, 0.8 giga times / second is required in terms of processing of 30 frames per second for VGA size.
When the above processing is performed by a single arithmetic unit of a semiconductor integrated circuit device, an operating frequency of 10 gigahertz is required, and it is very difficult to mount it directly on a portable home appliance driven with low power consumption such as a digital video camera. It is.
In addition, when motion detection is performed using the architecture of the general-purpose neurocomputer, comparison image information for a macroblock is stored in the local memory of each processor, the macroblock information is broadcast, and the information in the local memory is If the difference calculation is performed, parallel operation can be realized in each processor.
In the DCT / inverse DCT processing, difference information in units of blocks is stored in the local memory of each processor, and the parallel operation can be realized by broadcasting the DCT / inverse DCT coefficients.
However, when the above motion detection and DCT processing are realized by only one SIMD type parallel computer, it becomes necessary to move the difference block information result of the processor that detected the motion to another processor for the DCT processing. Processing performance decreases. In addition, the motion detection requires an absolute difference calculator in the processor, and DCT requires a multiplier. Therefore, when motion detection and DCT are processed by one SIMD type parallel computer, it is necessary to configure both the absolute difference calculator and the multiplier in the processor, which increases the overall gate size.
SUMMARY OF THE INVENTION An object of the present invention is to provide an image processing apparatus capable of operating at a high speed with a small circuit configuration in image processing such as MPEG image compression / decompression.
The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.
Of the inventions disclosed in the present application, the outline of typical ones will be briefly described as follows.
That is, in the image processing apparatus according to the present invention, a plurality of processors including at least a difference calculator and a local memory operate with a single command from the first control unit, and transmit data from the first control unit to all the processors. Connecting a first SIMD type computer having a broadcast data bus to one or more second SIMD type computers in which a plurality of processors including at least a multiplier operate with a single instruction from a second control unit The first SIMD computer performs motion detection processing in image processing, and one or more second SIMD computers perform DCT, inverse DCT, quantization, or inverse quantization processing in image processing. It is.
In the above configuration, the calculation result of the first SIMD type computer is transmitted in parallel to the processor in the second SIMD type computer via the buffer, and the processor performs processing in parallel in units of image blocks. Do.
In addition, by adding header information (block shift vector information, I / P / B frame information, etc.) indicating block attributes for each block-based data transmission, each processor of the second SIMD computer can Each header information is discriminated and a process suitable for the block can be efficiently performed.
Of the inventions disclosed in the present application, effects obtained by typical ones will be briefly described as follows.
(1) In MPEG video compression, since motion detection processing and processing such as DCT / quantization are processed by separate SIMD type computers, it can be configured with a minimum required arithmetic unit and local memory for each processing, Image compression processing can be realized with the number of processors corresponding to the performance required for each processing.
(2) Since motion detection and DCT / quantization can be performed by pipeline operation for each process, the performance can be improved.
(3) With a configuration of a SIMD computer having a relatively small number of processors, image processing such as moving picture compression / decompression represented by MPEG can be performed in real time, so that, for example, low power consumption of a digital video camera or the like With a semiconductor integrated circuit (electronic component) that can be mounted on a driving portable home appliance, it is possible to realize an image processing function having a higher pixel density than conventional ones.

図１は本発明の一実施の形態の画像処理装置の構成を示すブロック図である。
図２は本発明の一実施の形態の画像処理装置に含まれるＳＩＭＤ型計算機１００の内部構成を示すブロック図である。
図３は本発明の一実施の形態の画像処理装置において、ＳＩＭＤ型計算機１００における動き検出処理手順を示す説明図である。
図４は本発明の一実施の形態の画像処理装置において、ＳＩＭＤ型計算機２００におけるＤＣＴ・量子化処理手順を示す説明図である。
図５は本発明の一実施の形態の画像処理装置において、ＳＩＭＤ型計算機３００における逆ＤＣＴ・逆量子化処理手順を示す説明図である。
図６は本発明の一実施の形態の画像処理装置を応用したビューアーシステムの構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an image processing apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing an internal configuration of the SIMD computer 100 included in the image processing apparatus according to the embodiment of the present invention.
FIG. 3 is an explanatory diagram showing a motion detection processing procedure in the SIMD computer 100 in the image processing apparatus according to the embodiment of the present invention.
FIG. 4 is an explanatory diagram showing a DCT / quantization processing procedure in the SIMD computer 200 in the image processing apparatus according to the embodiment of the present invention.
FIG. 5 is an explanatory diagram showing an inverse DCT / inverse quantization process procedure in the SIMD computer 300 in the image processing apparatus according to the embodiment of the present invention.
FIG. 6 is a block diagram showing a configuration of a viewer system to which the image processing apparatus according to the embodiment of the present invention is applied.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部材には同一の符号を付し、その繰り返しの説明は省略する。
まず、図１により、本発明の一実施の形態の画像処理装置の構成の一例を説明する。図１は本発明の一実施の形態の画像処理装置の構成を示すブロック図である。本実施の形態の画像処理装置は、例えば、画像圧縮システムとされ、ＳＩＭＤ型計算機１００、ＳＩＭＤ型計算機２００、ＳＩＭＤ型計算機３００、バッファ４０１〜４０６，５０１〜５０６，６０１〜６０６などから構成されている。
ＳＩＭＤ型計算機１００は、演算器（例えば差分演算器）とローカルメモリとを含む複数のプロセッサ１０１〜１１６からなるプロセッサアレイ１３０と、制御ユニット１４０などから構成されている。
ＳＩＭＤ型計算機２００は、演算器（例えば乗算器）とローカルメモリとを含む複数のプロセッサ２０１〜２０６からなるプロセッサアレイ２３０と、制御ユニット２４０などから構成されている。
ＳＩＭＤ型計算機３００は、演算器（例えば乗算器）とローカルメモリとを含む複数のプロセッサ３０１〜３０６からなるプロセッサアレイ３３０と、制御ユニット３４０などから構成されている。
ＳＩＭＤ型計算機１００とＳＩＭＤ型計算機２００とは複数のバッファ４０１〜４０６を介して電気的に接続され、ＳＩＭＤ型計算機１００内の制御ユニット１４０の出力が複数のバッファ４０１〜４０６に並列に入力し、バッファ４０１〜４０６の出力がＳＩＭＤ型計算機２００内の複数のプロセッサ２０１〜２０６に並列に入力している。ＳＩＭＤ型計算機２００とＳＩＭＤ型計算機３００とは、複数のバッファ５０１〜５０６を介して電気的に接続され、ＳＩＭＤ型計算機２００内の複数のプロセッサ２０１〜２０６の出力が複数のバッファ５０１〜５０６に並列に入力し、バッファ５０１〜５０６の出力がＳＩＭＤ型計算機３００内の複数のプロセッサ３０１〜３０６に並列に入力している。ＳＩＭＤ型計算機３００内の複数のプロセッサ３０１〜３０６の出力が複数のバッファ６０１〜６０６に入力している。また、バッファ５０１〜５０６の出力は、ＡＣ／ＤＣ予測およびハフマン処理へ、バッファ６０１〜６０６の出力は補償画生成処理へ伝送されている。
ＳＩＭＤ型計算機１００において、プロセッサアレイ１３０内の各プロセッサ１０１〜１１６と制御ユニット１４０とは、命令バス１５０、ブロードキャストデータバス１６０、プロセッサデータ出力共通バス１７０などにより電気的に接続されている。ＳＩＭＤ型計算機２００において、各プロセッサ２０１〜２０６と制御ユニット２４０とは、命令バス２５０などにより電気的に接続されている。ＳＩＭＤ型計算機３００において、各プロセッサ３０１〜３０６と制御ユニット３４０とは、命令バス３５０などにより電気的に接続されている。
なお、図１では、ＳＩＭＤ型計算機１００〜３００は３段で構成されているが、２段または４段以上の構成であってもよい。また、ＳＩＭＤ型計算機１００内のプロセッサ１０１〜１１６の個数は、１６個であるが、いくつであってもよい。また、ＳＩＭＤ型計算機２００，３００内のプロセッサ２０１〜２０６，３０１〜３０６並びにバッファ４０１〜４０６，５０１〜５０６，６０１〜６０６はそれぞれ６つ並列であるが、並列個数はいくつであってもよい。
図２に、ＳＩＭＤ型計算機１００の詳細な構成を示す。ＳＩＭＤ型計算機１００は、制御ユニット１４０、プロセッサアレイ１３０の他に複数のメモリユニット１２１〜１２９などから構成されている。プロセッサ１０１〜１１６内のローカルメモリおよびメモリユニット１２１〜１２９は、ＲＡＭ（メモリ）から構成されている。
プロセッサアレイ１３０内において、各プロセッサ１０１〜１１６はマトリクス状に配置され、各プロセッサ１０１〜１１６内のローカルメモリは上下左右の他のプロセッサと接続され、演算データを前後左右へシフトできるようになっている。また、プロセッサアレイ１３０の端に位置するプロセッサ１０４，１０８，１１２，１１３，１１４，１１５，１１６のローカルメモリは、プロセッサアレイ１３０の周りに配置されたメモリユニット１２１〜１２９と接続され、メモリユニット１２１〜１２９とも演算データをシフトできるようになっている。制御ユニット１４０と全プロセッサ１０１〜１１６内の演算器は、命令バス１５０およびブロードキャストデータバス１６０を介して接続され、制御ユニット１４０から全プロセッサ１０１〜１１６へ命令およびデータが出力されるようになっている。全プロセッサ１０１〜１１６内の演算器の出力は、トライステートバッファとプロセッサデータ出力共通バス１７０を介して制御ユニット１４０と接続され、各プロセッサ１０１〜１１６内の演算器の演算データが制御ユニット１４０へ出力されるようになっている。
各メモリユニット１２１〜１２９は隣接する他のメモリユニットと接続され、メモリユニット間でデータをシフトできるようになっている。また、各メモリユニット１２１〜１２９と制御ユニット１４０は、メモリ共通バス１８０を介して接続されている。
さらに、制御ユニット１４０は、外部制御（メインＣＰＵ）および外部メモリ（画像データ）と接続されている。
次に、図１により、本実施の形態の画像処理装置の動作を説明する。まず、ＳＩＭＤ型計算機１００では、画像処理における動き検出処理を行う。そして、ＳＩＭＤ型計算機１００は、動き検出処理の結果であるブロックごとの差分情報および動きベクトル情報をブロック単位でバッファ４０１〜４０６へ出力する。差分情報および動きベクトル情報をバッファ４０１〜４０６へ出力した後、ＳＩＭＤ型計算機１００は、次のマクロブロックに対する動き検出処理を行う。
ＳＩＭＤ型計算機２００では、プロセッサ２０１〜２０６により、画像処理におけるＤＣＴ演算を並列に行う。この時、プロセッサ２０１〜２０６は、バッファ４０１〜４０６内のブロックごとの差分情報を取り込み、ＤＣＴ演算を行う。次に、その演算結果を基に、ＳＩＭＤ型計算機２００はプロセッサ２０１〜２０６により量子化の処理を並列に行う。各ブロックのＤＣＴ演算および量子化処理終了後、プロセッサ２０１〜２０６は、動きベクトル情報とともにバッファ５０１〜５０６へ処理結果を並列に出力する。バッファ５０１〜５０６内の各ブロックの動きベクトル情報および量子化処理後のデータは、ＡＣ／ＤＣ処理とハフマン処理が行われ、圧縮データとして出力される。
また、バッファ５０１〜５０６の各ブロックの動きベクトル情報および量子化処理後のデータは、補償画生成のためにＳＩＭＤ型計算機３００へ出力される。ＳＩＭＤ型計算機３００では、プロセッサ３０１〜３０６が各ブロックに対して逆量子化の処理を並列に行う。次に、その処理結果を基に、プロセッサ３０１〜３０６は、逆ＤＣＴ演算を並列に行う。各ブロックの逆量子化処理および逆ＤＣＴ演算終了後、プロセッサ３０１〜３０６は、動きベクトル情報とともにバッファ６０１〜６０６へ処理結果を並列に出力する。バッファ６０１〜６０６内の各ブロックの動きベクトル情報および逆ＤＣＴ演算後のデータは、補償画生成処理に使用される。
ＳＩＭＤ型計算機１００，２００，３００では、それぞれの処理（動き検出、ＤＣＴ演算、量子化、逆量子化、逆ＤＣＴ演算）がバッファ４０１〜４０６，５０１〜５０６，６０１〜６０６を介して行われるため、パイプライン並行処理が可能であり、全体の処理が高速化される。
なお、本実施の形態では、ＳＩＭＤ型計算機２００でＤＣＴ演算および量子化処理を行い、ＳＩＭＤ型計算機３００で逆量子化処理および逆ＤＣＴ演算を行っているが、１つのＳＩＭＤ型計算機でＤＣＴ演算、量子化処理、逆量子化処理および逆ＤＣＴ演算を行ってもよく、あるいは、３つ以上のＳＩＭＤ型計算機でそれぞれの処理を分担して実行させてもよい。
また、バッファ４０１〜４０６，５０１〜５０６，６０１〜６０６には、演算処理結果やブロックのベクトル情報だけではなく、ブロックごとの属性、例えば比較画と差分処理を行ったか否かなどの情報を書き込むことができ、各プロセッサにてその情報を判断して、それぞれ異なる演算処理を実施することが可能となる。
次に、図３により、ＳＩＭＤ型計算機１００における動き検出処理の手順を説明する。図３（ａ）は全体画像に対するマクロブロック単位の動き検出処理順序を示し、図３（ｂ）はマクロブロックごとの処理フローを示す。
図３（ａ）に示すように、全体画像（現画像）は、マクロブロック（１６×１６画素）に分割され、マクロブロックごとに処理される。
図３（ｂ）に示すように、マクロブロックは、輝度（Ｙ０，Ｙ１，Ｙ２，Ｙ３）と色差（Ｕ，Ｖ）とで構成される。また、Ｙ０，Ｙ１，Ｙ２，Ｙ３，Ｕ，Ｖは、それぞれ８×８色要素で構成される。
マクロブロックに分割した後、マクロブロックごとに動き検出処理を実施する。動き検出処理は、比較画との差分を検出することにより行われる。
したがって、比較画との動き検出処理後の情報は、ブロックＹ０，Ｙ１，Ｙ２，Ｙ３，Ｕ，Ｖごとの差分値情報（Ｙ０’，Ｙ１’，Ｙ２’，Ｙ３’，Ｕ’，Ｖ’）と動きベクトル情報となる。
上記のブロック情報は、バッファ４０１〜４０６へ出力される。
次に図４により、ＳＩＭＤ型計算機２００におけるＤＣＴ演算および量子化処理の手順を説明する。
各ブロックＹ０，Ｙ１，Ｙ２，Ｙ３，Ｕ，Ｖの差分値情報（Ｙ０’，Ｙ１’，Ｙ２’，Ｙ３’，Ｕ’，Ｖ’）と動きベクトル情報は、バッファ４０１〜４０６からプロセッサ２０１〜２０６へ並列に入力され、並列に演算処理され、処理結果がバッファ５０１〜５０６へ並列に出力される。
次に、図５により、ＳＩＭＤ型計算機３００における逆量子化処理および逆ＤＣＴ演算の手順を説明する。
各ブロックＹ０，Ｙ１，Ｙ２，Ｙ３，Ｕ，Ｖの処理結果と動きベクトル情報は、バッファ５０１〜５０６からプロセッサ３０１〜３０６へ並列に入力され、並列に演算処理され、処理結果がバッファ６０１〜６０６へ並列に出力される。
次に、図６により、本実施の形態の画像処理装置の応用例を説明する。図６は、本実施の形態の画像処理装置をビューアーシステムへ応用した例である。本システムは、例えば、本実施の形態の画像処理装置７００、ＡＣ／ＤＣ予測ハフマン７０１、画像メモリ７０２、表示回路７０３、モニタ７０４、ＲＯＭ７０５、ＲＡＭ７０６、ＣＰＵ７０７、ＩＦ（インターフェイス）回路７０８などから構成されている。
画像処理装置７００は、画像メモリ７０２およびＡＣ／ＤＣ予測ハフマン７０１と接続され、画像メモリ７０２は表示回路７０３と接続され、表示回路７０３はモニタ７０４と接続されている。また、ＡＣ／ＤＣ予測ハフマン７０１、ＲＯＭ７０５、ＲＡＭ７０６、ＣＰＵ７０７、ＩＦ回路７０８はそれぞれバスを介して接続されている。ＩＦ回路７０８はメモリカード７０９と接続されている。
本システムは、デジタルムービーカメラなどで撮影したＭＰＥＧ画像をモニタやＴＶへ表示するシステムである。
本システムは、ＭＰＥＧの伸張のみの処理のため、図１で示した前記実施の形態の画像処理装置のうち、ＳＩＭＤ型計算機３００のみの構成で逆量子化と逆ＤＣＴを処理する。動き検出、ＤＣＴ、量子化、逆量子化および逆ＤＣＴの処理をそれぞれのＳＩＭＤ型計算機で別々に処理するため、必要な部分のみのＳＩＭＤ型計算機で画像処理装置を構成することができ、小型化、低消費電力化が可能となる。
したがって、前記実施の形態の画像処理装置によれば、ＭＰＥＧ動画圧縮では、動き検出およびＤＣＴ・量子化などの処理を各々別々のＳＩＭＤ型計算機で処理するため、各処理に必要な演算器およびローカルメモリで構成することができ、それぞれの処理に必要な性能に見合ったプロセッサ数で画像圧縮処理を実現することができる。また、動き検出やＤＣＴ・量子化などを処理ごとにパイプライン動作で行うことができるため、性能の向上を図ることができる。
また、プロセッサ数が比較的小規模のＳＩＭＤ型計算機の構成で、ＭＰＥＧに代表される動画圧縮・伸張などの画像処理が実時間で処理可能になるため、従来に比べて画素密度の大きい画像処理機能をデジタルビデオカメラなどの低消費電力で駆動する携帯型家電に実装可能な半導体集積回路装置（電子部品）により実現することが可能となる。
以上、本発明者によってなされた発明をその実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。
例えば、前記実施の形態においては、ＭＰＥＧの動画圧縮・伸張について説明したが、これに限定されるものではなく、他の画像圧縮・伸張についても適用可能である。
以上の説明では、主として本発明者によってなされた発明をその属する技術分野である画像処理に適用した場合について説明したが、これに限定されるものではなく、例えば、その他の画像処理、音声処理を始めとする行列演算を含む計算アルゴリズムを処理する電子機器全般などに適用することも可能である。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that in all the drawings for explaining the embodiments, the same members are denoted by the same reference numerals, and the repeated explanation thereof is omitted.
First, an example of the configuration of an image processing apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of an image processing apparatus according to an embodiment of the present invention. The image processing apparatus according to the present embodiment is, for example, an image compression system, and includes an SIMD computer 100, an SIMD computer 200, an SIMD computer 300, buffers 401 to 406, 501 to 506, 601 to 606, and the like. Yes.
The SIMD type computer 100 includes a processor array 130 including a plurality of processors 101 to 116 including an arithmetic unit (for example, a differential arithmetic unit) and a local memory, a control unit 140, and the like.
The SIMD type computer 200 includes a processor array 230 including a plurality of processors 201 to 206 including an arithmetic unit (for example, a multiplier) and a local memory, a control unit 240, and the like.
The SIMD type computer 300 includes a processor array 330 including a plurality of processors 301 to 306 including an arithmetic unit (for example, a multiplier) and a local memory, a control unit 340, and the like.
The SIMD computer 100 and the SIMD computer 200 are electrically connected via a plurality of buffers 401 to 406, and the output of the control unit 140 in the SIMD computer 100 is input to the plurality of buffers 401 to 406 in parallel. The outputs of the buffers 401 to 406 are input in parallel to the plurality of processors 201 to 206 in the SIMD computer 200. The SIMD type computer 200 and the SIMD type computer 300 are electrically connected via a plurality of buffers 501 to 506, and outputs of the plurality of processors 201 to 206 in the SIMD type computer 200 are parallel to the plurality of buffers 501 to 506. The outputs of the buffers 501 to 506 are input in parallel to the plurality of processors 301 to 306 in the SIMD computer 300. Outputs of the plurality of processors 301 to 306 in the SIMD type computer 300 are input to the plurality of buffers 601 to 606. The outputs of the buffers 501 to 506 are transmitted to AC / DC prediction and Huffman processing, and the outputs of the buffers 601 to 606 are transmitted to compensation image generation processing.
In the SIMD type computer 100, the processors 101 to 116 in the processor array 130 and the control unit 140 are electrically connected by an instruction bus 150, a broadcast data bus 160, a processor data output common bus 170, and the like. In the SIMD computer 200, the processors 201 to 206 and the control unit 240 are electrically connected by an instruction bus 250 or the like. In the SIMD computer 300, the processors 301 to 306 and the control unit 340 are electrically connected by an instruction bus 350 or the like.
In FIG. 1, the SIMD type computers 100 to 300 are configured in three stages, but may be configured in two stages or four or more stages. Further, the number of processors 101 to 116 in the SIMD computer 100 is 16, but any number is possible. Further, six processors 201 to 206 and 301 to 306 and buffers 401 to 406, 501 to 506, and 601 to 606 in the SIMD type computers 200 and 300 are arranged in parallel, but any number of processors may be used.
FIG. 2 shows a detailed configuration of the SIMD type computer 100. The SIMD computer 100 includes a plurality of memory units 121 to 129 in addition to the control unit 140 and the processor array 130. The local memory and the memory units 121 to 129 in the processors 101 to 116 are constituted by a RAM (memory).
In the processor array 130, the processors 101 to 116 are arranged in a matrix, and the local memory in each of the processors 101 to 116 is connected to other processors in the vertical and horizontal directions so that arithmetic data can be shifted in the vertical and horizontal directions. Yes. Further, the local memories of the processors 104, 108, 112, 113, 114, 115, 116 located at the end of the processor array 130 are connected to the memory units 121 to 129 arranged around the processor array 130, and the memory unit 121. The calculation data can be shifted with respect to .about.129. The control unit 140 and the arithmetic units in all the processors 101 to 116 are connected via the instruction bus 150 and the broadcast data bus 160, so that instructions and data are output from the control unit 140 to all the processors 101 to 116. Yes. The outputs of the arithmetic units in all the processors 101 to 116 are connected to the control unit 140 via the tristate buffer and the processor data output common bus 170, and the arithmetic data of the arithmetic units in the processors 101 to 116 are sent to the control unit 140. It is output.
Each memory unit 121 to 129 is connected to another adjacent memory unit so that data can be shifted between the memory units. The memory units 121 to 129 and the control unit 140 are connected via a memory common bus 180.
Further, the control unit 140 is connected to an external control (main CPU) and an external memory (image data).
Next, the operation of the image processing apparatus according to the present embodiment will be described with reference to FIG. First, the SIMD computer 100 performs motion detection processing in image processing. Then, the SIMD computer 100 outputs the difference information and motion vector information for each block, which is the result of the motion detection process, to the buffers 401 to 406 in units of blocks. After outputting the difference information and the motion vector information to the buffers 401 to 406, the SIMD computer 100 performs a motion detection process for the next macroblock.
In the SIMD computer 200, the processors 201 to 206 perform DCT operations in image processing in parallel. At this time, the processors 201 to 206 take in the difference information for each block in the buffers 401 to 406 and perform DCT calculation. Next, based on the calculation result, the SIMD computer 200 performs quantization processing in parallel by the processors 201 to 206. After the DCT operation and the quantization process for each block, the processors 201 to 206 output the processing results to the buffers 501 to 506 in parallel with the motion vector information. The motion vector information and quantized data of each block in the buffers 501 to 506 are subjected to AC / DC processing and Huffman processing, and are output as compressed data.
Further, the motion vector information and the quantized data of each block in the buffers 501 to 506 are output to the SIMD computer 300 for generating a compensation image. In the SIMD computer 300, the processors 301 to 306 perform inverse quantization processing on each block in parallel. Next, based on the processing result, the processors 301 to 306 perform inverse DCT operations in parallel. After the inverse quantization process and inverse DCT operation of each block, the processors 301 to 306 output the processing results to the buffers 601 to 606 in parallel with the motion vector information. The motion vector information of each block in the buffers 601 to 606 and the data after inverse DCT calculation are used for the compensation image generation process.
In the SIMD type computers 100, 200, and 300, each process (motion detection, DCT operation, quantization, inverse quantization, and inverse DCT operation) is performed via the buffers 401 to 406, 501 to 506, and 601 to 606. Pipeline parallel processing is possible, and the overall processing is speeded up.
In this embodiment, the SIMD computer 200 performs DCT operation and quantization processing, and the SIMD computer 300 performs inverse quantization processing and inverse DCT operation. However, one SIMD computer performs DCT operation, The quantization process, the inverse quantization process, and the inverse DCT calculation may be performed, or each process may be shared and executed by three or more SIMD type computers.
In addition, in the buffers 401 to 406, 501 to 506, and 601 to 606, not only the calculation processing result and the block vector information but also information for each block, for example, whether or not the difference processing with the comparison image has been performed is written. Each processor can judge the information and perform different arithmetic processing.
Next, the procedure of the motion detection process in the SIMD computer 100 will be described with reference to FIG. FIG. 3A shows the motion detection processing order in units of macroblocks for the entire image, and FIG. 3B shows the processing flow for each macroblock.
As shown in FIG. 3A, the entire image (current image) is divided into macroblocks (16 × 16 pixels) and processed for each macroblock.
As shown in FIG. 3B, the macroblock is composed of luminance (Y0, Y1, Y2, Y3) and color difference (U, V). Y0, Y1, Y2, Y3, U, and V are each composed of 8 × 8 color elements.
After dividing into macro blocks, motion detection processing is performed for each macro block. The motion detection process is performed by detecting a difference from the comparison image.
Therefore, the information after the motion detection process with the comparison image is the difference value information (Y0 ′, Y1 ′, Y2 ′, Y3 ′, U ′, V ′) for each block Y0, Y1, Y2, Y3, U, V. And motion vector information.
The block information is output to the buffers 401 to 406.
Next, referring to FIG. 4, the procedure of DCT calculation and quantization processing in the SIMD computer 200 will be described.
The difference value information (Y0 ′, Y1 ′, Y2 ′, Y3 ′, U ′, V ′) and motion vector information of each block Y0, Y1, Y2, Y3, U, V is sent from the buffers 401-406 to the processors 201-201. The data is input to 206 in parallel, processed in parallel, and the processing result is output to the buffers 501 to 506 in parallel.
Next, the procedure of inverse quantization processing and inverse DCT calculation in the SIMD computer 300 will be described with reference to FIG.
The processing results and motion vector information of each block Y0, Y1, Y2, Y3, U, V are input in parallel from the buffers 501 to 506 to the processors 301 to 306, and are processed in parallel, and the processing results are buffered 601 to 606. Are output in parallel.
Next, an application example of the image processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 6 shows an example in which the image processing apparatus according to the present embodiment is applied to a viewer system. This system includes, for example, the image processing apparatus 700 of the present embodiment, an AC / DC prediction Huffman 701, an image memory 702, a display circuit 703, a monitor 704, a ROM 705, a RAM 706, a CPU 707, an IF (interface) circuit 708, and the like. ing.
The image processing apparatus 700 is connected to an image memory 702 and an AC / DC prediction Huffman 701, the image memory 702 is connected to a display circuit 703, and the display circuit 703 is connected to a monitor 704. In addition, the AC / DC prediction Huffman 701, the ROM 705, the RAM 706, the CPU 707, and the IF circuit 708 are connected via a bus. The IF circuit 708 is connected to the memory card 709.
This system is a system for displaying MPEG images taken by a digital movie camera or the like on a monitor or TV.
Since this system only performs MPEG decompression processing, the inverse quantization and inverse DCT are processed with the configuration of only the SIMD computer 300 in the image processing apparatus of the embodiment shown in FIG. Since motion detection, DCT, quantization, inverse quantization, and inverse DCT processing are processed separately by each SIMD computer, an image processing apparatus can be configured with a SIMD computer of only necessary portions, and the size can be reduced. Therefore, low power consumption can be achieved.
Therefore, according to the image processing apparatus of the above embodiment, in MPEG video compression, processing such as motion detection and DCT / quantization is processed by separate SIMD type computers. The image compression processing can be realized with the number of processors corresponding to the performance required for each processing. In addition, since motion detection, DCT / quantization, and the like can be performed by pipeline operation for each process, performance can be improved.
In addition, with the configuration of a SIMD computer with a relatively small number of processors, image processing such as moving image compression / decompression represented by MPEG can be performed in real time, so that image processing with a higher pixel density than in the past is possible. The function can be realized by a semiconductor integrated circuit device (electronic component) that can be mounted on a portable home appliance driven with low power consumption such as a digital video camera.
As mentioned above, the invention made by the present inventor has been specifically described based on the embodiment. However, the invention is not limited to the embodiment, and various modifications can be made without departing from the scope of the invention. Needless to say.
For example, in the above embodiment, MPEG video compression / decompression has been described. However, the present invention is not limited to this, and other image compression / decompression can also be applied.
In the above description, the case where the invention mainly made by the present inventor is applied to the image processing that is the technical field to which the present invention is applied has been described. However, the present invention is not limited to this. For example, other image processing and audio processing are performed. The present invention can also be applied to all electronic devices that process calculation algorithms including matrix operations such as the beginning.

以上のように、本発明にかかる画像処理装置は、例えば、デジタルビデオカメラ、ビデオデッキ、情報端末などの動画圧縮・伸張を行う電子機器に用いるのに適している。また、その他の画像処理、音声処理を始めとする行列演算を含む計算アルゴリズムを処理する電子機器全般に応用可能である。 As described above, the image processing apparatus according to the present invention is suitable for use in electronic equipment that performs moving picture compression / decompression, such as a digital video camera, a video deck, and an information terminal. Further, the present invention can be applied to all electronic devices that process calculation algorithms including matrix operations including other image processing and audio processing.

Claims

A plurality of first processors including a difference calculator and a local memory, a first control unit that controls the first processor, and data transmission from the first control unit to all the first processors A first SIMD computer having a broadcast data bus,
One or a plurality of second SIMD type computers having a plurality of second processors including a multiplier and a second control unit for controlling the second processor;
The first SIMD type computer is configured to transfer the operation result of the first SIMD type computer to the plurality of second processors in the second SIMD type computer in parallel, and each of them performs pipeline parallel processing. And a plurality of buffers for connecting the second SIMD type computer ,
In the first SIMD type computer, a plurality of the first processors operate in parallel by a single instruction from the first control unit, and perform motion detection processing in image processing.
In the second SIMD type computer, a plurality of the second processors operate in parallel by a single instruction from the second control unit, and discrete cosine transform, inverse discrete cosine transform, quantization or inverse in image processing is performed. An image processing apparatus that performs quantization processing.

A plurality of first processors including a difference calculator and a local memory, a first control unit that controls the first processor, and data transmission from the first control unit to all the first processors A first SIMD computer having a broadcast data bus,
A plurality of second SIMD type computers having a plurality of second processors including a multiplier and a second control unit for controlling the second processor;
The operation result of the second SIMD type computer at the front stage is transferred in parallel to the second processor in the second SIMD type computer at the rear stage , and the first SIMD type computer and the plurality of second SIMD types are transferred. A plurality of buffers for connecting the first SIMD computer and the plurality of second SIMD computers in a configuration in which each type computer performs pipeline parallel processing;
In the first SIMD type computer, a plurality of the first processors operate in parallel by a single instruction from the first control unit, and perform motion detection processing in image processing.
In the second SIMD type computer, a plurality of the second processors operate in parallel by a single instruction from the second control unit, and discrete cosine transform, inverse discrete cosine transform, quantization or inverse in image processing is performed. An image processing apparatus that performs quantization processing.

The image processing apparatus according to any one of claims 1 and 2 ,
The data transferred between the first SIMD type computer and one or a plurality of the second SIMD type computers is added with header information indicating the attribute of the block for each block-based data transfer. An image processing apparatus.