JP4351903B2

JP4351903B2 - Video encoding device

Info

Publication number: JP4351903B2
Application number: JP2003427718A
Authority: JP
Inventors: 淳松村; 知也児玉; 昇山口; 忠昭増田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-12-24
Filing date: 2003-12-24
Publication date: 2009-10-28
Anticipated expiration: 2023-12-24
Also published as: JP2005191689A

Description

本発明は、複数のプロセッシングユニットを用いて、動画像を符号化する動画像符号化装置に関するものである。 The present invention uses a plurality of processing units, but about the video encoding KaSo location for coding a moving picture.

コンピュータ技術の進化に伴い、ＭＰＥＧ−２／ＭＰＥＧ−４に代表される動画像符号化および復号化をソフトウェアベースで行う技術が一般化してきている。この技術を利用して、パーソナルコンピュータ上でソフトウェアによりテレビ放送の画像符号化復号化を行い、パーソナルコンピュータをハードディスクレコーダとして使用するような利用形態が知られている。 With the advancement of computer technology, a technology for performing video coding and decoding represented by MPEG-2 / MPEG-4 on a software basis has become common. Utilizing this technology, there is known a utilization form in which a personal computer is used as a hard disk recorder by performing image coding / decoding of a television broadcast by software on the personal computer.

一方、次世代のゲーム機やホームサーバ、デジタルテレビジョンなどに搭載されるＣＰＵとして、高速なＳＩＭＤ(ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ)プロセッサを複数用意し、それらが協調して処理を実行するモデルが提案されている。 On the other hand, a model has been proposed in which a plurality of high-speed SIMD (Single Instruction Multiple Data) processors are prepared as CPUs installed in next-generation game machines, home servers, digital televisions, etc., and they execute processing in cooperation with each other. ing.

例えば、同一のＩＳＡ(ＩｎｓｔｒｕｃｔｉｏｎＳｅｔＡｒｃｈｉｔｅｃｔｕｒｅ)を持つ８個の付加処理ユニット(ＡＰＵ)が共用ダイナミックランダムアクセスメモリ(ＤＲＡＭ)を用いてリアルタイムに通信を行い、処理を行う装置がある（特許文献１参照）。 For example, there is an apparatus in which eight additional processing units (APUs) having the same ISA (Instruction Set Architecture) communicate and process in real time using a shared dynamic random access memory (DRAM) (see Patent Document 1). ).

これによれば、１つのプロセッサエレメント(ＰＥ)は、処理ユニット(ＰＵ)、ダイレクトメモリアクセスコントローラ(ＤＭＡＣ)、および８個のＡＰＵから構成される。ＰＵは、処理のスケジュール管理と装置の全般的管理を行う。そして、ＡＰＵはスケジュールに従って並列的に処理を実行する。 According to this, one processor element (PE) is composed of a processing unit (PU), a direct memory access controller (DMAC), and eight APUs. The PU performs processing schedule management and general management of devices. And APU performs a process in parallel according to a schedule.

本装置は、ＴＣＰ／ＩＰネットワークからパケット化されたＭＰＥＧデータを取得し、当該データを復号化する一連の処理を行う。この一連の処理のうちＭＰＥＧデータの抽出およびＭＰＥＧデータの復号化はＡＰＵにより行われる。このように、長時間を要する処理をＡＰＵに行わせることにより、処理の高速化を図っている。 This apparatus obtains packetized MPEG data from a TCP / IP network and performs a series of processes for decoding the data. In this series of processing, extraction of MPEG data and decoding of MPEG data are performed by APU. In this way, the processing speed is increased by causing the APU to perform processing that requires a long time.

特開２００２−３５８２８９号公報JP 2002-358289 A

以上のように、複数のＡＰＵの協働により処理の高速化を図る技術が知られているが、各ＡＰＵが有するローカルメモリのメモリ容量は非常に小さい。従って、メモリを大量に消費する演算や、大量のデータを利用する演算を行う場合には、ＡＰＵとＤＲＡＭの間での情報の授受が必要となる。そして、このようなＡＰＵからＤＲＡＭへのアクセスは処理の遅延の主な要因となる。 As described above, a technique for speeding up processing by cooperation of a plurality of APUs is known, but the memory capacity of a local memory included in each APU is very small. Therefore, when performing calculations that consume a large amount of memory or operations that use a large amount of data, it is necessary to exchange information between the APU and the DRAM. Such access from the APU to the DRAM is a major factor in processing delay.

本発明は、上記に鑑みてなされたものであって、上述のような処理の遅延を抑制し、高
速に符号化処理を行うことのできる動画像符号化装置を提供することを
目的とする。 The present invention was made in view of the above, to suppress the delay in processing as described above, and an object thereof is to provide a moving picture coding KaSo location capable of performing encoding processing at high speed.

上述した課題を解決し、目的を達成するために、本発明は、動画像を符号化する動画像符号化装置であって、前記符号化の処理を指示し、かつ前記符号化に関する処理を行う第１プロセッサと、前記第１プロセッサからの指示に基づいて、前記符号化に関する処理を行う第２プロセッサと、前記符号化に関する情報を保持するメインメモリと、前記メインメモリと前記第１プロセッサの間、および前記メインメモリと前記第２プロセッサとの間のデータの授受を制御するメインメモリ制御手段と、前記第２プロセッサから直接アクセス可能なローカルメモリとを備え、前記第１プロセッサは、前記メインメモリ制御手段を介して前記メインメモリに保持される情報を参照し、前記符号化に関する処理として、前処理、動き検出処理、動き補償処理、ＤＣＴ処理、逆離散コサイン変換処理、量子化処理、可変長符号化処理及びシンタックス生成処理の少なくとも１つのうち全部又は一部の処理であって且つ処理にかかる演算量が予め定められた演算量よりも小さい第１処理を行い、前記第２プロセッサは、前記ローカルメモリに保持される情報を参照して、前記第１プロセッサからの指示に基づいて、前記符号化に関する処理として、前処理、動き検出処理、動き補償処理、ＤＣＴ処理、逆離散コサイン変換処理、量子化処理、可変長符号化処理及びシンタックス生成処理のうち少なくとも１つのうち全部又は一部の処理であって且つ処理にかかる演算量が予め定められた演算量よりも大きい第２処理を行い、前記第２プロセッサが行う前記第２処理全体は、複数のステージに分割され、前記メインメモリは、各ステージにおいて用いられる情報をさらに保持し、前記ローカルメモリは、前記第２プロセッサが一の前記ステージにおいて前記第２処理を実行するときに、前記メインメモリ制御手段を介して前記メインメモリから当該ステージにおいて用いられる情報を取得し、当該ステージにおいて用いられる情報を保持することを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention is a moving picture coding apparatus for coding a moving picture, instructing the coding process and performing the process related to the coding. A first processor; a second processor that performs processing relating to the encoding based on an instruction from the first processor; a main memory that holds information relating to the encoding; and between the main memory and the first processor And a main memory control means for controlling data exchange between the main memory and the second processor, and a local memory that can be directly accessed from the second processor, wherein the first processor controls the main memory control. The information stored in the main memory is referred to via the means, and the processing related to the encoding includes preprocessing, motion detection processing, motion compensation Management, DCT processing, inverse discrete cosine transform process, a quantization process, the amount of calculation and according to the process and at least one of whole or a part process of variable length coding and syntax generation processing predetermined The second processor performs a first process smaller than the amount of computation, and the second processor refers to the information held in the local memory and, based on an instruction from the first processor, All or part of at least one of processing, motion detection processing, motion compensation processing, DCT processing, inverse discrete cosine transform processing, quantization processing, variable length coding processing, and syntax generation processing, and A second process in which the amount of calculation required for the process is larger than a predetermined amount of calculation is performed, and the entire second process performed by the second processor is divided into a plurality of stages. The main memory further holds information used in each stage, and the local memory is connected via the main memory control means when the second processor executes the second process in one stage. Information used in the stage is acquired from the main memory, and information used in the stage is held.

本発明にかかる動画像符号化装置は、符号化の処理のうち比較的大きな演算量の処理を、メインメモリに直接アクセス可能な第１プロセッサに担当させ、符号化の処理のうち比較的小さな演算量の処理を、メインメモリには直接アクセスできないが、ローカルメモリに直接アクセス可能な第２プロセッサに担当させる。このように、符号化に含まれる複数の処理を第１プロセッサおよび第２プロセッサに分担させることにより、処理の高速化を図ることができるという効果を奏する。また、複数のステージに分割される第２処理の一のステージにおいて処理を第２プロセッサが実行するときに、ローカルメモリがメインメモリ制御手段を介してメインメモリから当該ステージにおいて用いられる情報を取得し、当該ステージにおいて用いられる情報を保持することにより、処理の高速化と共に、ローカルメモリのメモリ量を低減させることができるという効果を奏する。 The moving picture encoding apparatus according to the present invention causes a first processor that can directly access a main memory to perform a process with a relatively large amount of calculation in the encoding process, and a relatively small amount of calculation in the encoding process. This processing is assigned to the second processor that cannot directly access the main memory but can directly access the local memory. As described above, by causing the first processor and the second processor to share a plurality of processes included in the encoding, there is an effect that the processing can be speeded up. Further, when the second processor executes the process in one stage of the second process divided into a plurality of stages, the local memory acquires information used in the stage from the main memory via the main memory control means. By holding the information used in the stage, it is possible to increase the processing speed and reduce the amount of local memory.

以下に、本発明にかかる動画像符号化装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 It will be described below in detail with reference to examples of video coding KaSo location according to the present invention with reference to the accompanying drawings. Note that the present invention is not limited to the embodiments.

図１は、本発明の動画像符号化装置１０のハードウェア構成を示すブロック図である。動画像符号化装置１０は、プロセッサエレメント１００とＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１０とを有している。さらにプロセッサエレメント１００は、処理ユニット１０２と、ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）１０４と、複数の付加処理ユニット、すなわち第１付加処理ユニット１０６ａ，第２付加処理ユニット１０６ｂ・・・とを有している。 FIG. 1 is a block diagram showing a hardware configuration of a moving image encoding apparatus 10 according to the present invention. The moving image encoding apparatus 10 includes a processor element 100 and a DRAM (Dynamic Random Access Memory) 110. Furthermore, the processor element 100 has a processing unit 102, a DMAC (Direct Memory Access Controller) 104, and a plurality of additional processing units, that is, a first additional processing unit 106a, a second additional processing unit 106b,. .

なお、ここで、本実施の形態にかかる処理ユニット１０２および付加処理ユニット１０６は、それぞれ本発明の第１プロセッサおよび第２プロセッサを構成する。また、本実施の形態のＤＲＡＭ１１０およびＤＭＡＣ１０４は、それぞれ本発明のメインメモリおよびメインメモリ制御手段を構成する。 Here, the processing unit 102 and the additional processing unit 106 according to the present embodiment constitute a first processor and a second processor of the present invention, respectively. The DRAM 110 and the DMAC 104 of the present embodiment constitute the main memory and main memory control means of the present invention, respectively.

処理ユニット１０２は、動画像符号化装置１０の全体を統括する。ＤＲＡＭ１１０は、処理対象となる動画像データおよび動画像データに対する符号化処理にかかるプログラム等を保持している。ＤＭＡＣ１０４は、ＤＲＡＭ１１０から取得した情報を処理ユニット１０２または付加処理ユニット１０６に送る。さらに処理ユニット１０２および付加処理ユニット１０６から取得した情報をＤＲＡＭ１１０に送る。このように、ＤＭＡＣ１０４は、ＤＲＡＭ１１０とプロセッサエレメント１００の間のインターフェースとして機能する。また、各付加処理ユニット１０６は、処理ユニット１０２からの指示により、符号化に関する処理を行う。 The processing unit 102 controls the entire moving image coding apparatus 10. The DRAM 110 holds moving image data to be processed and a program related to encoding processing for the moving image data. The DMAC 104 sends information acquired from the DRAM 110 to the processing unit 102 or the additional processing unit 106. Further, the information acquired from the processing unit 102 and the additional processing unit 106 is sent to the DRAM 110. As described above, the DMAC 104 functions as an interface between the DRAM 110 and the processor element 100. Further, each additional processing unit 106 performs processing related to encoding in accordance with an instruction from the processing unit 102.

図２は、図１に示す第１付加処理ユニット１０６ａの詳細な構成を示すブロック図である。付加処理ユニット１０６ａは、ローカルメモリ１０６０と、レジスタ１０６２と、第１浮動少数点演算ユニット１０６４ａ，第２浮動小数点演算ユニット１０６４ｂ，・・・と、第１整数演算ユニット１０６６ａ，第２整数演算ユニット１０６６ｂ・・・とを有している。 FIG. 2 is a block diagram showing a detailed configuration of the first additional processing unit 106a shown in FIG. The additional processing unit 106a includes a local memory 1060, a register 1062, a first floating point arithmetic unit 1064a, a second floating point arithmetic unit 1064b,..., A first integer arithmetic unit 1066a, and a second integer arithmetic unit 1066b. ..

第１付加処理ユニット１０６ａは、複数の浮動小数点演算ユニットおよび複数の整数演算ユニットを有している。第１付加処理ユニット１０６ａは、これらの協働により高速演算を行うことができる。 The first additional processing unit 106a has a plurality of floating point arithmetic units and a plurality of integer arithmetic units. The first additional processing unit 106a can perform high-speed computation by cooperating with them.

ローカルメモリ１０６０は、１２８キロバイト程度の比較的小さいＳＲＡＭ（ＳｔａｔｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）で構成されている。付加処理ユニット１０６は、ローカルメモリ１０６０が保持するプログラムおよびデータを利用して動作する。また、ＤＭＡＣ１０４に対してＤＲＡＭ１１０と付加処理ユニット１０６の間のデータ転送要求を行う。付加処理ユニット１０６は、ＤＭＡＣ１０４に接続されたＤＲＡＭ１１０を直接アクセスすることはできない。 The local memory 1060 is composed of a relatively small SRAM (Static Random Access Memory) of about 128 kilobytes. The additional processing unit 106 operates using programs and data stored in the local memory 1060. Also, a data transfer request between the DRAM 110 and the additional processing unit 106 is made to the DMAC 104. The additional processing unit 106 cannot directly access the DRAM 110 connected to the DMAC 104.

付加処理ユニット１０６がプログラムを実行する場合、処理ユニット１０２がＤＭＡＣ１０４を制御し、ＤＲＡＭ１１０から付加処理ユニット１０６のローカルメモリ１０６０にオブジェクトプログラムと関連するスタックフレームが転送される。次いで、処理ユニット１０２が付加処理ユニット１０６にプログラムを実行させる旨のコマンドを発行する。そして、付加処理ユニット１０６は、処理ユニット１０２から発行されたコマンドに基づいて、プログラムの実行を開始する。付加処理ユニット１０６はまた、プログラムの結果をＤＭＡＣ１０４を介してＤＲＡＭ１１０に転送する。付加処理ユニット１０６は、処理が完了すると処理ユニット１０２に対し、処理が完了したことを示す割り込みを発生させるように指示する。 When the additional processing unit 106 executes the program, the processing unit 102 controls the DMAC 104, and the stack frame related to the object program is transferred from the DRAM 110 to the local memory 1060 of the additional processing unit 106. Next, the processing unit 102 issues a command for causing the additional processing unit 106 to execute the program. Then, the additional processing unit 106 starts executing the program based on the command issued from the processing unit 102. The additional processing unit 106 also transfers the result of the program to the DRAM 110 via the DMAC 104. When the processing is completed, the additional processing unit 106 instructs the processing unit 102 to generate an interrupt indicating that the processing has been completed.

なお、図２を参照しつつ第１付加処理ユニット１０６ａの詳細な構成について説明したが、第２付加処理ユニット１０６ｂ等の詳細な構成は、第１付加処理ユニット１０６ａの詳細な構成と同様である。 Although the detailed configuration of the first additional processing unit 106a has been described with reference to FIG. 2, the detailed configuration of the second additional processing unit 106b and the like is the same as the detailed configuration of the first additional processing unit 106a. .

図３は、プロセッサエレメント１００における符号化処理を示すフローチャートである。まず、付加処理ユニット１０６は、ＤＭＡＣ１０４を介して動画像データを取得し、符号化効率を高めるための前処理を行う (ステップＳ１００)。次に、前方向動き検出(ステップＳ１０２)および後方向動き検出(ステップＳ１０４)を順に行う。次に、動き補償を行う(ステップＳ１０６)。次に、動き補償により得られた残差信号に対して離散コサイン変換を行う(ステップＳ１０８)。次に、量子化を行い(ステップＳ１１０)、続いて可変長符号化を行う(ステップＳ１１２)。 FIG. 3 is a flowchart showing an encoding process in the processor element 100. First, the additional processing unit 106 acquires moving image data via the DMAC 104, and performs preprocessing for improving the encoding efficiency (step S100). Next, forward motion detection (step S102) and backward motion detection (step S104) are sequentially performed. Next, motion compensation is performed (step S106). Next, discrete cosine transform is performed on the residual signal obtained by motion compensation (step S108). Next, quantization is performed (step S110), and then variable length coding is performed (step S112).

一方、動き検出および動き補償を行うために、量子化（ステップＳ１１０）を行った後、逆量子化を行い(ステップＳ１２０)、続いて逆離散コサイン変換を行う(ステップＳ１２２)。以上の処理は、全て付加処理ユニット１０６により行われる。 On the other hand, in order to perform motion detection and motion compensation, after performing quantization (step S110), inverse quantization is performed (step S120), and then inverse discrete cosine transform is performed (step S12 2 ). All of the above processing is performed by the additional processing unit 106.

以上のように、付加処理ユニット１０６にピクチャー層以下のマクロブロック単位の処理を割り当てるのが望ましい。ピクチャー層以下の処理は、処理に要するメモリ量が比較的小さくて済むので、付加処理ユニット１０６による処理に適している。 As described above, it is desirable to assign processing in units of macroblocks below the picture layer to the additional processing unit 106. The processing below the picture layer is suitable for processing by the additional processing unit 106 because the amount of memory required for processing is relatively small.

動き検出(ステップＳ１０２，ステップＳ１０４)、動き補償（ステップＳ１０６）およびＤＣＴ処理(ステップＳ１０８）において必要な演算量は非常に大きい。このように、演算量、すなわち演算負荷が大きい処理は、付加処理ユニット１０６に割り当てるのが望ましい。付加処理ユニット１０６は、図２において説明したように、複数の浮動少数点演算ユニット１０６４および複数の整数演算ユニット１０６６により、高速に処理を行うことができるので、処理ユニット１０２が処理する場合に比べてより高速に処理を行うことができる。 The amount of computation required for motion detection (step S102, step S104), motion compensation (step S106), and DCT processing (step S108) is very large. As described above, it is desirable to assign a process with a large calculation amount, that is, a calculation load to the additional processing unit 106. As described with reference to FIG. 2, the additional processing unit 106 can perform processing at a high speed by the plurality of floating point arithmetic units 1064 and the plurality of integer arithmetic units 1066, so that the processing unit 102 performs processing. Can be processed at a higher speed.

なお、演算負荷の大小を判断するための閾値を予め定めておいてもよい。この場合は、当該閾値と演算量とを比較することにより、処理ユニット１０２および付加処理ユニット１０６のいずれに割り当てるかを判断する。 Note that a threshold value for determining the magnitude of the calculation load may be determined in advance. In this case, it is determined which of the processing unit 102 and the additional processing unit 106 is assigned by comparing the threshold value with the calculation amount.

このように、所定の演算量を閾値として、閾値を基準に付加処理ユニット１０６に分担させる処理および処理ユニット１０２に分担させる処理を定めることにより、全体として処理の高速化を図ることができる。 As described above, the processing speed can be increased as a whole by defining the processing to be assigned to the additional processing unit 106 and the processing to be assigned to the processing unit 102 with the predetermined calculation amount as a threshold value.

また、ピクチャー層以下の処理は、繰り返し処理が多い。このように繰り返し処理が多い場合も、高速な処理が可能な付加処理ユニット１０６による処理に適している。このように、各機能に適した処理を割り当てることにより、全体として処理の効率化を図ることができる。 In addition, the processing below the picture layer is often repeated. Thus, even when there are many repetitive processes, it is suitable for processing by the additional processing unit 106 capable of high-speed processing. In this way, by assigning processes suitable for each function, it is possible to improve the efficiency of the process as a whole.

また、上述の付加処理ユニット１０６による処理とは別に、シンタックスの生成が行われる(ステップＳ１３０)。シンタックス処理(ステップＳ１３０)は、処理ユニット１０２が行う。 In addition to the processing performed by the additional processing unit 106, syntax generation is performed (step S130). The syntax processing (step S130) is performed by the processing unit 102.

シンタックス処理にかかる演算負荷は小さい。しかし、処理において大量のテーブルを使用する。そこで、メモリ容量の大きなＤＲＡＭ１１０との協働により処理を行うことにより、処理の効率化を図るのが望ましい。具体的には、処理ユニット１０２は、ＤＭＡＣ１０４を介してＤＲＡＭ１１０に直接アクセスし、ＤＲＡＭ１１０に保持されているテーブルを参照しつつシンタックス処理を行う。このように、処理ユニット１０２には、ピクチャー層よりも上位のシンタックス生成を割り当てるのが望ましい。 The calculation load for syntax processing is small. However, a large number of tables are used in processing. Therefore, it is desirable to improve processing efficiency by performing processing in cooperation with the DRAM 110 having a large memory capacity. Specifically, the processing unit 102 directly accesses the DRAM 110 via the DMAC 104 and performs syntax processing while referring to a table held in the DRAM 110. As described above, it is desirable to assign a syntax generation higher than the picture layer to the processing unit 102.

以上で、符号化処理が完了する。以上のように、符号化処理を処理ユニット１０２および付加処理ユニット１０６に分担することにより、効率的かつ高速に処理を行うことができる。 Thus, the encoding process is completed. As described above, the encoding process is shared between the processing unit 102 and the additional processing unit 106, whereby the processing can be performed efficiently and at high speed.

なお、プロセッサエレメント１００における符号化処理は一般的なＭＰＥＧによる符号化処理と同様である。 The encoding process in the processor element 100 is the same as a general MPEG encoding process.

図３を参照しつつ説明した付加処理ユニット１０６の処理は、複数のステージに分割されている。本実施の形態においては、付加処理ユニット１０６の処理は５つのステージに分割されている。すなわち、動画像符号化処理は、前処理(ステップＳ１００)を行う第１ステージと、前方向動き検出(ステップＳ１０２)を行う第２ステージと、後方向動き検出(ステップＳ１０４)を行う第３ステージと、動き補償(ステップＳ１０６)およびＤＣＴ処理（ステップＳ１０８)を行う第４ステージと、量子化(ステップＳ１１０)、可変長符号化（ステップＳ１１２)、逆量子化(ステップＳ１２０)および逆離散コサイン変換（ステップＳ１２２)を行う第５ステージを有している。 The processing of the additional processing unit 106 described with reference to FIG. 3 is divided into a plurality of stages. In the present embodiment, the processing of the additional processing unit 106 is divided into five stages. That is, the moving image encoding process includes a first stage that performs pre-processing (step S100), a second stage that performs forward motion detection (step S102), and a third stage that performs backward motion detection (step S104). A fourth stage for performing motion compensation (step S106) and DCT processing (step S108), quantization (step S110), variable length coding (step S112), inverse quantization (step S120), and inverse discrete cosine transform A fifth stage for performing (Step S122) is provided.

付加処理ユニット１０６による第１ステージの処理の実行中は、ローカルメモリ１０６０は、第１ステージにおいて実行されるプログラム、第１ステージにおける処理の対象となる動画像データおよび参照されるデータ等を保持している。そして、第１ステージにおける処理が完了すると、第１ステージにおいて実行されるプログラム等がＤＭＡＣ１０４を介して付加処理ユニット１０６からＤＲＡＭ１１０に退避される。 During the execution of the first stage processing by the additional processing unit 106, the local memory 1060 holds a program executed in the first stage, moving image data to be processed in the first stage, reference data, and the like. ing. When the processing in the first stage is completed, the program executed in the first stage is saved from the additional processing unit 106 to the DRAM 110 via the DMAC 104.

そして、第２ステージにおいて実効されるプログラム等がＤＭＡＣ１０４を介してＤＲＡＭ１１０から付加処理ユニット１０６に書き込まれる。このように、ＤＲＡＭ１１０へのアクセスは、第１ステージから第２ステージへの切り替り、第２ステージから第２ステージへの切り替りなど、各ステージの切り替りのタイミングにおいてのみ行われる。 Then, a program executed in the second stage is written from the DRAM 110 to the additional processing unit 106 via the DMAC 104. Thus, access to the DRAM 110 is performed only at the timing of switching of each stage, such as switching from the first stage to the second stage and switching from the second stage to the second stage.

各ステージに含める処理は、付加処理ユニット１０６のローカルメモリ１０６０の大きさ、および付加処理ユニット１０６における演算速度等に基づいて定められている。すなわち、ローカルメモリ１０６０のメモリ容量において保持可能な最大限のデータ量の処理を１つのステージとしている。 The processing included in each stage is determined based on the size of the local memory 1060 of the additional processing unit 106, the calculation speed in the additional processing unit 106, and the like. In other words, processing of the maximum amount of data that can be held in the memory capacity of the local memory 1060 is set as one stage.

符号化処理は、いくつもの複雑な処理を含む。従って、これらの処理をソフトウェアで実現するためのプログラムのデータ量は大きい。また、動き検出や動き補償は、参照画像、対象画像および動き補償画像などデータ量の多いデータを対象とするため大量にメモリを必要とする。また、可変長符号化は可変長符号化テーブルを保持して処理を行うため大量のメモリを必要とする。 The encoding process includes a number of complicated processes. Therefore, the data amount of the program for realizing these processes by software is large. In addition, since motion detection and motion compensation target data with a large amount of data such as a reference image, a target image, and a motion compensation image, a large amount of memory is required. Also, variable length coding requires a large amount of memory to perform processing while holding a variable length coding table.

これに対して、ローカルメモリ１０６０のメモリ容量は小さいので、符号化処理に利用すべきプログラムおよびデータを一度にローカルメモリ１０６０に保持させることはできない。従って、付加処理ユニット１０６は、必要に応じてプログラムやデータをダイナミックにＤＲＡＭ１１０から取得し、さらにＤＲＡＭ１１０に書き込む必要が生じる。 On the other hand, since the memory capacity of the local memory 1060 is small, the program and data to be used for the encoding process cannot be held in the local memory 1060 at a time. Therefore, it is necessary for the additional processing unit 106 to dynamically acquire a program and data from the DRAM 110 as necessary, and to write the program and data into the DRAM 110.

しかし、ＤＲＡＭ１１０へのアクセスを含む処理は長時間を要し、全体として処理の遅延を招く主な要因となる。そこで、上述のように、ＤＲＡＭ１１０にアクセスする頻度を最小限に留めることにより、ＤＲＡＭ１１０へのアクセスに起因する処理の遅延を避けることができる。 However, processing including access to the DRAM 110 takes a long time, which is a main factor that causes processing delay as a whole. Thus, as described above, by minimizing the frequency of accessing the DRAM 110, it is possible to avoid a processing delay due to the access to the DRAM 110.

以下、図４から図７を参照しつつ、各ステージにおけるデータの授受について説明する。 Hereinafter, data exchange in each stage will be described with reference to FIGS.

図４は、第１ステージにおける動画像データの流れを示す図である。第１ステージにおいては、まず、ＤＭＡＣ１０４は、ＤＲＡＭ１１０から動画像データを取得する。そして、ＤＭＡＣ１０４は、処理ユニット１０２からの指示に基づいて、動画像データをスライスを最少単位として各付加処理ユニット１０６に分配する。ここで、スライスとは、動画像を構成する静止画の横方向の１ライン分のことである。なお、動画像データの分配については後述する。 FIG. 4 is a diagram showing a flow of moving image data in the first stage. In the first stage, first, the DMAC 104 acquires moving image data from the DRAM 110. Then, the DMAC 104 distributes the moving image data to each additional processing unit 106 with a slice as a minimum unit based on an instruction from the processing unit 102. Here, the slice refers to one line in the horizontal direction of the still image constituting the moving image. The distribution of moving image data will be described later.

各付加処理ユニット１０６は、ＤＭＡＣ１０４から受け取った動画像データに対して前処理(ステップＳ２００)を行う。そして、前処理が施された動画像データは、ＤＭＡＣ１０４を介してＤＲＡＭ１１０に書き戻される。 Each additional processing unit 106 performs preprocessing (step S200) on the moving image data received from the DMAC 104. The preprocessed moving image data is written back to the DRAM 110 via the DMAC 104.

前処理(ステップＳ１００)としては、様々な種類の処理が想定される。具体的には、例えば、４：２：２→４：２：０変換、３：２プルダウン検出、ノイズ除去などがある。各処理にかかるプログラムの容量やデータ容量は、処理毎に異なる。そこで、これを１つのステージとしている。 Various types of processing are assumed as the preprocessing (step S100). Specifically, there are 4: 2: 2 → 4: 2: 0 conversion, 3: 2 pull-down detection, noise removal, and the like. The program capacity and data capacity for each process differ from one process to another. Therefore, this is one stage.

第１ステージにおいては、上述のように、前処理のみで１つのステージとしている。従って、前処理がメモリを多く必要とする場合であっても、第１ステージにかかる処理の途中でＤＲＡＭ１１０とのデータの授受を行わずに済み、効率的な処理を行うことができる。 In the first stage, as described above, only one stage is used for preprocessing. Therefore, even when the preprocessing requires a large amount of memory, it is not necessary to exchange data with the DRAM 110 during the processing of the first stage, and efficient processing can be performed.

図５は、第２ステージにおける動画像データの流れを示す図である。第２ステージにおいて各付加処理ユニット１０６は、ＤＭＡＣ１０４を介してＤＲＡＭ１１０から動画像データのローカルデコードを取得する。そして、前方向動き検出(ステップＳ１０２)を行う。そして、前方向動き検出(ステップＳ１０２)によって得られた動きベクトルは、ＤＭＡＣ１０４を介してＤＲＡＭ１１０に書き戻される。 FIG. 5 is a diagram showing a flow of moving image data in the second stage. In the second stage, each additional processing unit 106 acquires local decoding of moving image data from the DRAM 110 via the DMAC 104. Then, forward motion detection (step S102) is performed. Then, the motion vector obtained by the forward motion detection (step S102) is written back to the DRAM 110 via the DMAC 104.

第３ステージにおける処理は、図７を参照しつつ説明した第２ステージにおける動画像データの流れと同様である。 The processing in the third stage is the same as the flow of moving image data in the second stage described with reference to FIG.

図６は、第４ステージにおける動画像データの流れを示す図である。第４ステージにおいては、各付加処理ユニット１０６は、ＤＭＡＣ１０４を介してＤＲＡＭ１１０から動画像データおよびローカルデコード画像、および動きベクトルを取得する。付加処理ユニット１０６は、取得した動画像データに対して動き補償を行い、ローカルメモリ１０６０は、動き補償により得られた残差信号を保持する。さらに、ローカルメモリ１０６０が保持する残差信号に対してＤＣＴ処理(ステップＳ１０８)を行う。ＤＣＴ処理により得られた結果、すなわちＤＣＴ係数をＤＲＡＭ１１０に書き戻す。 FIG. 6 is a diagram showing a flow of moving image data in the fourth stage. In the fourth stage, each additional processing unit 106 acquires moving image data, a locally decoded image, and a motion vector from the DRAM 110 via the DMAC 104. The additional processing unit 106 performs motion compensation on the acquired moving image data, and the local memory 1060 holds a residual signal obtained by motion compensation. Further, DCT processing (step S108) is performed on the residual signal held in the local memory 1060. The result obtained by the DCT process, that is, the DCT coefficient is written back to the DRAM 110.

動き補償（Ｓ１０６）においては、動画像データ、ローカルデコード画像、動きベクトル、および残差信号を対象とする。これらの処理対象のデータのデータ容量は大きく、ローカルメモリ１０６０の容量の多くを占有する。一方、ＤＣＴ処理(ステップＳ１０８）は動き補償（Ｓ１０６）において確保されたデータ領域をＤＣＴ処理(ステップＳ１０８）の後のＤＣＴ係数の保存先としても利用することができる。また、ＤＣＴ処理(ステップＳ１０８）のプログラム自体は小さい。そこでこれらの処理をまとめて１つのステージとしている。 In the motion compensation (S106), moving image data, a local decoded image, a motion vector, and a residual signal are targeted. The data capacity of the data to be processed is large and occupies most of the capacity of the local memory 1060. On the other hand, in the DCT process (step S108), the data area secured in the motion compensation (S106) can be used as a storage destination of DCT coefficients after the DCT process (step S108). Further, the DCT process (step S108) program itself is small. Therefore, these processes are combined into one stage.

図７は、第５ステージにおける動画像データの流れを示す図である。第５ステージにおいては、各付加処理ユニット１０６は、ＤＭＡＣ１０４を介してＤＲＡＭ１１０からＤＣＴ係数を取得する。付加処理ユニット１０６は、取得したＤＣＴ係数に対して、量子化（ステップＳ１１０）を行う。ローカルメモリ１０６０は、量子化後のＤＣＴ係数を保持する。 FIG. 7 is a diagram showing a flow of moving image data in the fifth stage. In the fifth stage, each additional processing unit 106 acquires a DCT coefficient from the DRAM 110 via the DMAC 104. The additional processing unit 106 performs quantization (step S110) on the acquired DCT coefficient. The local memory 1060 holds the DCT coefficient after quantization.

また、各付加処理ユニット１０６は、ＤＭＡＣ１０４を介してＤＲＡＭ１１０から動きベクトルを取得する。そして、付加処理ユニット１０６は、動きベクトルとローカルメモリ１０６０が保持する量子化後のＤＣＴ係数とに対して、それぞれ可変長符号化(ステップＳ１１２)を行う。 Further, each additional processing unit 106 acquires a motion vector from the DRAM 110 via the DMAC 104. Then, the additional processing unit 106 performs variable length coding (step S112) on the motion vector and the quantized DCT coefficient held in the local memory 1060, respectively.

また、ローカルメモリ１０６０から量子化後のＤＣＴ係数を読み出し、逆量子化(ステップＳ１２０)を行う。そして、ローカルメモリ１０６０は、その結果を保持する。さらに、付加処理ユニット１０６は、ローカルメモリ１０６０に保持されている逆量子化後のＤＣＴ係数に対して逆離散コサイン変換(ステップＳ１２２)を行い、ローカルデコード画像を作成する。そして、ローカルデコード画像および可変長符号化されたデータをＤＲＡＭ１１０に書き込む。 Also, the quantized DCT coefficient is read from the local memory 1060, and inverse quantization (step S120) is performed. Then, the local memory 1060 holds the result. Further, the additional processing unit 106 performs inverse discrete cosine transform (step S122) on the DCT coefficients after inverse quantization held in the local memory 1060, and creates a local decoded image. Then, the local decoded image and the variable length encoded data are written into the DRAM 110.

量子化(ステップＳ１１０)、可変長符号化(ステップＳ１１２)、逆量子化(ステップＳ１２０)、および逆離散コサイン変換(ステップＳ１２２）は、いずれもプログラム量、データ容量とも小さい。ただし、可変長符号化(ステップＳ１１２)において利用されるテーブルのデータ量が比較的大きい。そこで、これら処理をまとめて１つのステージとする。 Quantization (step S110), variable-length coding (step S112), inverse quantization (step S120), and inverse discrete cosine transform (step S122) all have a small program amount and data capacity. However, the data amount of the table used in variable length coding (step S112) is relatively large. Therefore, these processes are combined into one stage.

図８は、各付加処理ユニット１０６への処理の割り当てを説明するための図である。図８に示すように、付加処理ユニット１０６にはタイミングを制御するタイムバジェットが設定されている。そして、各付加処理ユニット１０６は、同一のタイムバジェットにおいて異なるスライスに対して同一の処理を施す。ここで、スライスとは、動画像を構成する静止画の横方向の１ライン分のことである。 FIG. 8 is a diagram for explaining the assignment of processing to each additional processing unit 106. As shown in FIG. 8, the additional processing unit 106 is set with a time budget for controlling the timing. Each additional processing unit 106 performs the same processing on different slices in the same time budget. Here, the slice refers to one line in the horizontal direction of the still image constituting the moving image.

例えば、第１付加処理ユニット１０６ａには、スライス１に対する処理が割り当てられている、また、第２付加処理ユニット１０６ｂには、スライス２に対する処理が割り当てられている。このように、複数の付加処理ユニット１０６が１つの動画像データに対する処理を分担することにより並列処理を行う。 For example, the process for slice 1 is assigned to the first additional processing unit 106a, and the process for slice 2 is assigned to the second additional processing unit 106b. As described above, the plurality of additional processing units 106 perform the parallel processing by sharing the processing for one moving image data.

例えば第１ステージでは、第１付加処理ユニット１０６ａは、スライス１に対して前処理を行う。そして、前処理が行われた後のスライス１をＤＲＡＭ１１０に書き戻す。また、第２付加処理ユニット１０６ｂは、スライス２に対して前処理を行う。そして、前処理が行われた後のスライス２をＤＲＡＭ１１０に書き戻す。同様に、各付加処理ユニット１０６が前処理を行い、その結果をそれぞれＤＲＡＭ１１０に書き戻す。 For example, in the first stage, the first additional processing unit 106a performs preprocessing on slice 1. Then, the slice 1 after the preprocessing is written back to the DRAM 110. The second additional processing unit 106b performs preprocessing for slice 2. Then, the slice 2 after the preprocessing is written back to the DRAM 110. Similarly, each additional processing unit 106 performs preprocessing and writes the result back to the DRAM 110.

以上のように、複数の付加処理ユニット１０６がスライス単位で処理を分担することにより、処理の高速化を図ることができる。 As described above, the plurality of additional processing units 106 share processing in units of slices, so that processing speed can be increased.

また、スライスの数が付加処理ユニット１０６の数よりも多い場合には、１つの付加処理ユニットに複数のスライスを割り当ててもよい。例えば、第１付加処理ユニット１０６ａにスライス１〜スライス３が割り当てられ、第２付加処理ユニット１０６ｂにスライス４〜スライス６が割り当てられる。 When the number of slices is larger than the number of additional processing units 106, a plurality of slices may be assigned to one additional processing unit. For example, slices 1 to 3 are assigned to the first additional processing unit 106a, and slices 4 to 6 are assigned to the second additional processing unit 106b.

さらに、第１付加処理ユニット１０６ａが第１ステージの処理を完了したときに、第２付加処理ユニット１０６ｂが第１ステージの処理を完了していない場合がある。例えば、第１付加処理ユニット１０６ａが第１ステージの処理を完了したときに、第２付加処理ユニット１０６ｂがスライス４に対する処理を行っている場合がある。例えば、同一のスライスであっても、処理に要する演算量が異なる場合などである。 Furthermore, when the first additional processing unit 106a completes the first stage processing, the second additional processing unit 106b may not complete the first stage processing. For example, the second additional processing unit 106b may be processing the slice 4 when the first additional processing unit 106a completes the first stage processing. For example, even when the slices are the same, the amount of computation required for processing is different.

この場合には、第２付加処理ユニット１０６ｂがスライス４を処理する間に、第１付加処理ユニット１０６ａは、スライス６に対する処理を行う。これにより、第２付加処理ユニット１０６ｂがスライス４〜スライス６に対する処理を行う場合に比べて、処理を高速化することができる。このように、各付加処理ユニット１０６における処理速度が異なる場合には、スライスを再配分することにより、さらなる処理の高速化を図ることができる。 In this case, the first additional processing unit 106a performs processing on the slice 6 while the second additional processing unit 106b processes the slice 4. As a result, the processing speed can be increased as compared with the case where the second additional processing unit 106b performs processing for slice 4 to slice 6. As described above, when the processing speeds of the additional processing units 106 are different, the processing can be further speeded up by redistributing the slices.

なお、本実施の形態においては、スライス単位で動画像データが各付加処理ユニット１０６に割り当てられているが、各付加処理ユニット１０６への動画像データの割り当ての単位はこれに限定されるものではなく、これ以外の単位でもよい。例えば、スライスを構成するさらに細かい単位であるマクロブロック単位で、動画像データが各付加処理ユニット１０６に割り当てられてもよい。 In the present embodiment, moving image data is allocated to each additional processing unit 106 in slice units, but the unit of moving image data allocation to each additional processing unit 106 is not limited to this. There may be other units. For example, moving image data may be allocated to each additional processing unit 106 in units of macro blocks that are finer units constituting a slice.

図９は、付加処理ユニット１０６がそれぞれ異なる処理を担当する場合の各付加処理ユニット１０６の処理を説明するための図である。図９に示すように各付加処理ユニット１０６がそれぞれ異なる処理を担当することとすると、第１付加処理ユニット１０６ａがスライス１に対し第１ステージの処理を行っている間は、第２付加処理ユニット１０６ｂは、スライス１に対し第２ステージの処理を行えず、スタンバイ状態となる。このように、複数の付加処理ユニット１０６が異なる処理を行うこととすると、一の付加処理ユニット１０６における処理が完了するまで他の付加処理ユニット１０６は処理を開始できない場合があり効率が悪い。複数の付加処理ユニット１０６が異なる処理を担当することとすると、複数の付加処理ユニット１０６による並列処理を行っているにもかかわらず、十分な処理の効率化を図ることができない。 FIG. 9 is a diagram for explaining the processing of each additional processing unit 106 when the additional processing unit 106 is in charge of different processing. As shown in FIG. 9, if each additional processing unit 106 is in charge of different processing, the second additional processing unit 106 a while the first additional processing unit 106 a is performing the first stage processing on slice 1. 106b cannot perform the second stage processing on slice 1 and enters a standby state. As described above, if the plurality of additional processing units 106 perform different processes, the other additional processing units 106 may not be able to start the processing until the processing in one additional processing unit 106 is completed, which is inefficient. If a plurality of additional processing units 106 are in charge of different processes, sufficient processing efficiency cannot be achieved despite parallel processing performed by the plurality of additional processing units 106.

そこで、図８を参照しつつ説明したように、各付加処理ユニット１０６は、それぞれ異なるスライスに対して同一の処理を行うこととした。これにより、図９を参照しつつ説明したような各付加処理ユニット１０６におけるスタンバイ状態の頻出を避けることができ、処理の効率化を図ることができる。 Therefore, as described with reference to FIG. 8, each additional processing unit 106 performs the same processing on different slices. As a result, frequent occurrence of the standby state in each additional processing unit 106 as described with reference to FIG. 9 can be avoided, and processing efficiency can be improved.

以上のように、本発明にかかる動画像符号化装置および符号化方法は、動画像データの符号化に有用であり、特に、複数のプロセッシングユニットを用いた動画像データの符号化に有用である。 As described above, the moving image encoding apparatus and the encoding method according to the present invention are useful for encoding moving image data, and particularly useful for encoding moving image data using a plurality of processing units. .

本発明の動画像符号化装置１０のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the moving image encoder 10 of this invention. 図１に示す第１付加処理ユニット１０６ａの詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the 1st addition process unit 106a shown in FIG. プロセッサエレメント１００における符号化処理を示すフローチャートである。3 is a flowchart showing an encoding process in a processor element 100. 第１ステージにおける動画像データの流れを示す図である。It is a figure which shows the flow of the moving image data in a 1st stage. 第２ステージにおける動画像データの流れを示す図である。It is a figure which shows the flow of the moving image data in a 2nd stage. 第４ステージにおける動画像データの流れを示す図である。It is a figure which shows the flow of the moving image data in a 4th stage. 第５ステージにおける動画像データの流れを示す図である。It is a figure which shows the flow of the moving image data in a 5th stage. 各付加処理ユニット１０６への処理の割り当てを説明するための図である。FIG. 6 is a diagram for explaining process assignment to each additional processing unit 106; 付加処理ユニット１０６がそれぞれ異なる処理を担当する場合の各付加処理ユニット１０６の処理を説明するための図である。It is a figure for demonstrating the process of each additional process unit 106 in case the additional process unit 106 takes charge of a different process, respectively.

Explanation of symbols

１０動画像符号化装置
１００プロセッサエレメント
１０２処理ユニット
１０６付加処理ユニット
１０６０ローカルメモリ
１０６２レジスタ
１０６４浮動少数点演算ユニット
１０６６整数演算ユニット DESCRIPTION OF SYMBOLS 10 Moving image encoder 100 Processor element 102 Processing unit 106 Additional processing unit 1060 Local memory 1062 Register 1064 Floating point arithmetic unit 1066 Integer arithmetic unit

Claims

A moving image encoding device for encoding a moving image,
A first processor for instructing the encoding process and performing the encoding process;
A second processor that performs processing related to the encoding based on an instruction from the first processor;
A main memory holding information about the encoding;
Main memory control means for controlling data exchange between the main memory and the first processor and between the main memory and the second processor;
A local memory directly accessible from the second processor,
The first processor refers to information held in the main memory via the main memory control means, and as processing related to the encoding, preprocessing, motion detection processing, motion compensation processing, DCT processing, inverse discrete cosine, A first process that is all or a part of at least one of a conversion process, a quantization process, a variable-length encoding process, and a syntax generation process, and the amount of calculation for the process is smaller than a predetermined amount of calculation And
The second processor refers to the information held in the local memory and, based on an instruction from the first processor, performs processing related to the encoding as preprocessing, motion detection processing, motion compensation processing, and DCT processing. , An inverse discrete cosine transform process, a quantization process, a variable length encoding process, and a syntax generation process. The second process is also large,
The entire second process performed by the second processor is divided into a plurality of stages,
The main memory further holds information used in each stage,
The local memory acquires information used in the stage from the main memory via the main memory control means when the second processor executes the second process in the one stage. A moving picture coding apparatus characterized by holding information to be used.

The moving image according to claim 1, wherein the local memory saves information used in the stage to the main memory when the second processor completes the second processing in the one stage. Image encoding device.

Of the second processing, the pre-processing is included in the first stage, out of the motion detection processing, the forward motion detection processing is included in the second stage, and out of the motion detection processing, the backward motion detection processing is the third stage. The motion compensation processing and DCT processing are included in a fourth stage, and the inverse discrete cosine transform processing, quantization processing, and variable length encoding processing are included in a fifth stage. Or the moving image encoding apparatus of 2.

A plurality of the second processors;
The first processor divides the moving image into a plurality of partial data, distributes the moving image to the plurality of second processors in units of the partial data,
4. The moving picture coding apparatus according to claim 1, wherein the plurality of second processors perform second processing on the partial data received from the first processor. 5. .

5. The first processor according to claim 4, wherein the first processor divides a still image constituting the moving image into a plurality of slices, and distributes the moving image to the plurality of second processors in units of the slices. Video encoding device.