JP3977003B2

JP3977003B2 - Discrete cosine transform / inverse discrete cosine transform method and apparatus

Info

Publication number: JP3977003B2
Application number: JP2000297063A
Authority: JP
Inventors: 誠石川; 正博海永
Original assignee: Renesas Technology Corp
Current assignee: Renesas Technology Corp
Priority date: 2000-09-26
Filing date: 2000-09-26
Publication date: 2007-09-19
Anticipated expiration: 2020-09-26
Also published as: JP2002108843A

Description

【０００１】
【発明の属する技術分野】
本発明は、マイクロプロセッサやマイクロコンピュータ等のデータ変換処理装置に関わり、特に離散コサイン変換や逆離散コサイン変換などを行う画像処理及び音声処理応用データ処理装置に係わる。
【０００２】
【従来の技術】
ディジタル化された画像及び音声はデータ量が巨大であり、その蓄積や伝送の際に問題となる。従って、格納前に圧縮しておき、使用する際に伸長する、または、送信前に圧縮し、受信後に伸長するなどの対策が取られる。
【０００３】
以下、画像圧縮・伸長にを例に説明する。
【０００４】
圧縮
１．２次元離散コサイン変換
２．量子化
３．ハフマン符号化
伸長
４．ハフマン復号化
５．逆量子化
６．２次元逆離散コサイン変換
１の２次元離散コサイン変換及び６の２次元逆離散コサイン変換は８＊８画素の２次元ブロックを対象に行われ、変換結果も８＊８要素の２次元ブロック値群となる。
【０００５】
１の２次元離散コサイン変換変換により、８＊８ブロックの高周波成分に相当する要素は通常０に近い値が多くなり、重み付けされた量子化操作によってそれらの大多数は０となる。３のハフマン符号化は、８＊８のブロック要素群をビットストリームに変換する。この際、要素中に０が多いことを利用して変換するため、変換後のビットストリームの所用バイト数は１／１０程度になるといわれている。
【０００６】
伸長におけるハフマン複合化４は３と逆の操作を行い、ビットストリームから８＊８のブロック要素群を生成する。５の逆量子化は、２の量子化の際につけた重みの逆数を乗じることで、量子化前の要素群を復元する。６の２次元逆離散コサイン変換は１の逆の操作をすることで、８＊８画素を復元する。
【０００７】
ここで、上記の圧縮・伸長操作に必要な手間を示す。通常の汎用プロセッサを用いた場合、８＊８ブロックあたり、圧縮・伸長ともにそれぞれ２０００〜３０００命令必要である。６４０＊４８０のフルカラー（２４ビット／画素）画像を対象とする場合には１４４００倍となり、１静止画像あたり２８．８Ｍ〜４３．２Ｍ命令の実行を要する。１命令の処理を１クロック、１００ＭＨｚで動作するプロセッサを用いた場合、２〜４フレーム／秒の圧縮・伸長速度しか得られず、実時間で画像取り込み・画像再生を行うことで動画的な効果を得ることが難しいといえる。
【０００８】
そこで、画像及び音声圧縮・伸長を補助するための特殊な専用演算器を搭載するなどした処理装置を別途用意する方法が多く取られてきた。
【０００９】
また、もう一つの解決手段として汎用のプロセッサにも搭載されるようになった、行列・ベクトル演算命令や、複数の積和演算などを同時に処理するＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）方式の命令を利用して変換処理装置を構成する方法もあげられる。離散コサイン変換は線形変換であるため、Ｎ点の変換はＮ＊Ｎの行列で表現でき、上記の行列・ベクトル演算命令やＳＩＭＤ命令で行列の積を求めることで変換処理が完了する。
【００１０】
【発明が解決しようとする課題】
上記で示した２つ目の処理装置で使われる変換行列は、Ｎ点の離散コサイン変換にではＮ＊Ｎの大きさを持つことを述べた。例えば８＊８の２次元離散コサイン変換処理を繰り返して処理する場合には、図７に示す６４個の係数と８個の入力データを用いて８個の出力を得る行列演算を、縦方向８回、横方向８回繰り返すことになる。
【００１１】
このとき、６４個の係数は定数であるためプロセッサ内部に保持したまま処理することが理想的である。しかし、現実的にはレジスタ数の制約から主記憶に配置した６４個の一部を逐次読み込みながら処理することが要求される。
【００１２】
また、上記８点の変換処理を効率良く行なうためには、８＊８の行列演算器や、並列度８のＳＩＭＤ演算器を用意することが望ましいが、例えば３次元グラフィクス等の分野での利便性や回路面積を考慮すると、４＊４の行列演算や並列度４のＳＩＭＤ演算が合理的であるといえる。さらに、音声圧縮・伸長で利用される３２点の離散コサイン変換の場合は係数、回路の規模からより実現が難しくなる。
【００１３】
特開平９−２１２４８４の実施例において、図４に示すように８＊８の離散コサイン変換を偶関数部分と奇関数部分数に分解してすることで、８＊８行列の右上の４＊４成分、左下の４＊４成分が０になることが、結論に至る途中式に示されている。４＊４の行列と、４次元ベクトルの積を求める演算装置を備えたプロセッサを用いたとき、若干の前処理、後処理を追加することで２回の行列演算によって変換処理が可能であることが暗に示唆されている。
【００１４】
しかしながら、異なる４＊４の係数行列を２種類使用するため、離散コサイン変換処理で行列演算を行なう度に４＊４の係数データを入れ替える処理が必要となる。例えば、行列演算と係数入れ替えのサイクル数の比を１：１とした場合、行列演算器の利用効率は５０％以下になると概算できる。
【００１５】
本発明は、行列演算、ＳＩＭＤ演算による離散コサイン変換・逆離散コサイン変換の高速化を達成するために、上記変換係数行列の入れ替えにかかるオーバーヘッドを削減して演算器の利用効率を高めることを目的とする。
【００１６】
【課題を解決するための手段】
本発明では、上記の問題を解決するために、離散コサイン変換の定義式の性質を利用する。前処理のバタフライ演算と、後処理の加算処理を追加することで、Ｎ点の離散コサイン変換が、２つのＮ／２点の離散コサイン変換に分割できる。このとき、係数行列としてはただ１つの（Ｎ／２）＊（Ｎ／２）行列を使用するため、行列の入れ替えが不要となり、高速な変換処理が可能となる。
【００１７】
８点の変換処理における係数の数は、図７と比較すると１／４、図４と比較すると１／２となり、プロセッサ内に係数を保持するレジスタの数を低減できる。
【００１８】
この分割は任意の回数繰り返すことで、前後処理が若干増加するが、行列の係数、演算の規模が１／４、１／１６、．．．と縮小できるため、使用するプロセッサに最適な分割数を選択することができる。
【００１９】
逆離散コサイン変換に関しても上記の分割は成り立つため、同様に行列入れ替えを不要とした変換処理が可能となる。
【００２０】
【発明の実施の形態】
（１）離散コサイン変換の定義
１次元離散コサイン変換の定義式は図５のように示される。係数Ａの定義は式（３）であるが、画像圧縮のように離散コサイン変換後に量子化を行う場合、もしくは画像伸長のように逆離散コサイン変換前に逆量子化を行う場合には、その量子化時の重み係数にＡを乗じたものを利用すれば、離散コサイン変換の定義を式（３’）のように単純化できる。以後、１次元離散コサイン変換の定義式として式（２）、（３’）、（４）を利用することにする。
【００２１】
離散コサイン変換、逆離散コサイン変換の代表的な応用である静止画や動画の圧縮・伸長には、Ｎ＝８が利用される。そこで、図５の離散コサイン変換でＮ＝８とすれば、図７に示す行列演算で示すことができる。行列の要素は、図６に示すｃ（ｎ，Ｎ）の性質のうち、式（５）を利用して正規化している。逆離散コサイン変換は、この行列を転置した計算に相当するため、ここでは離散コサイン変換のみを扱う。
【００２２】
Ｎ＊Ｎ点の２次元離散コサイン変換は上記Ｎ点１次元離散コサイン変換を縦方向にＮ回、その後に横方向にＮ回（もしくは横方向にＮ回した後に縦方向にＮ回）行ったものとして定義できる。つまり、Ｎ＊Ｎ点の２次元離散コサイン変換は、Ｎ点の１次元離散コサイン変換２Ｎ回に分解できる。以後、１次元離散コサイン変換を中心に扱う。
【００２３】
（２）８点離散コサイン変換の分割
ここで、式（２）をｋ＝２Ｋ、ｋ＝２Ｋ＋１とおき、偶数出力、奇数出力に分離する。式（６）、式（７）の性質を利用すると、偶数側は式（９）のように、奇数側は式（１０）のように変形できる。また、奇数側でＫ＝Ｎ／２−１のとき、つまり、Ｘ［Ｎ−１］は式（６）の性質を利用すると、式（１１）のように導かれる。式（１０）、式（９）、式（１１）、にＮ＝８を代入してまとめると式（１）の行列表現が得られる。この行列の左上と右下の４＊４成分に注目すると、４点離散コサイン変換と同一の行列になっており、以上の式変形によって、８点離散コサイン変換はある前処理と後処理を追加することによって、２つの４点離散コサイン変換に分割できることを示すことができた。図２はそのバタフライ図を示したものである。
【００２４】
逆離散コサイン変換を行う場合には、演算の手順を出力側から入力側へ、逆順で行なう。その際、係数行列の逆行列が必要となるが、離散コサイン変換の係数行列の性質から、転置行列を用いればよい。以下で示す実施例では、離散コサイン変換のみを扱うが、ここで述べた逆順に処理する方法を用いて、逆離散コサイン変換も実現できることを述べておく。
【００２５】
（３）実施例１
本実施例の変換装置の構成を図１に示す。変換装置１０１はプロセッサ１０２と記憶部１０３によって構成され、外部に接続された入力装置１１１からデータを入力し、画像の圧縮・伸長等の変換処理を行ない、出力装置１１２から変換結果を出力する。プロセッサ１０２と記憶部１０３はアドレスバス１０４とデータバス１０５によって接続され、プロセッサ内部のアドレス生成器１０６で計算されたアドレスによって記憶装置の番地を指定し、データバスを通じてプロセッサ内のレジスタファイル１０７とのデータ転送を行なう。記憶部１０３はプログラム記憶装置１０９とデータ記憶装置１１０から構成される。演算器１０８はレジスタファイルの内容を読み出して演算処理を行ない、再びレジスタファイルに結果を書き戻す。レジスタファイル１０７の構成を図１０に示す。Ｒ０〜Ｒ１５で構成されるレジスタファイル１、ＸＲ０〜ＸＲ１５で構成されるレジスタファイル２から構成され、レジスタファイル０のレジスタＲｎ、Ｒｎ＋１、Ｒｎ＋２、Ｒｎ＋３を組みにして、ＶＲｎと呼ぶことにする。
【００２６】
ここで、プロセッサ１０２の命令とその動作を定義する。まず、４＊４行列と要素数４のベクトルとの行列積を行うＴＲＶ命令を、以下のように記述するものとする。
【００２７】
ＴＲＶＶＲｓ，ＶＲｄ（ｓ，ｄ＝４ｎ）
ＴＲＶ命令は、レジスタファイル１の１６本のレジスタを４＊４行列、レジスタファイル０内レジスタ群、ＶＲｓを４次元ベクトルとみなし、その行列とベクトルの乗算結果をＶＲｄへ格納する。図（１１）はＴＲＶ命令の動作内容を示したものである。
【００２８】
次に、加算、減算、乗算を行う命令として、ＡＤＤ、ＳＵＢ、ＭＵＬ、ＡＤＤ４、ＳＵＢ４、ＭＵＬ４を以下のように定義する。
【００２９】
ＡＤＤＲｓ，Ｒｔ，Ｒｄ
ＳＵＢＲｓ，Ｒｔ，Ｒｄ
ＭＵＬＲｓ，Ｒｔ，Ｒｄ
ＡＤＤ４ＶＲｓ，ＶＲｔ，ＶＲｄ（ｓ，ｔ，ｄ＝４ｎ）
ＳＵＢ４ＶＲｓ，ＶＲｔ，ＶＲｄ（ｓ，ｔ，ｄ＝４ｎ）
ＭＵＬ４ＶＲｓ，ＶＲｔ，ＶＲｄ（ｓ，ｔ，ｄ＝４ｎ）
ＡＤＤ、ＳＵＢ、ＭＵＬ命令は、レジスタＲｓと、Ｒｔについて、加算、減算、乗算をし、結果をレジスタＲｄへ格納する。ＡＤＤ４、ＳＵＢ４、ＭＵＬ４命令は、レジスタ群ＶＲｓとのレジスタ群ＶＲｔの対応する要素について、加算、減算、乗算をし、結果をＶＲｄへ格納する。図（１２）、図（１３）、図（１４）はＡＤＤ４命令、ＳＵＢ４命令、ＭＵＬ４命令、の動作内容を示したものである。さらに、主メモリからレジスタにデータをロード、逆にレジスタから主メモリにデータをストアする命令として、以下の４命令を定義する。
【００３０】
ＬＤｂ，ｄｉｓｐ，Ｒｄ
ＳＴｂ，ｄｉｓｐ，Ｒｓ
ＬＤ４ｂ，ｄｉｓｐ，ｓｔｅｐ，ＶＲｄ（ｄ＝４ｎ）
ＳＴ４ｂ，ｄｉｓｐ，ｓｔｅｐ，ＶＲｓ（ｓ＝４ｎ）
ＬＤ命令は、主メモリのアドレス（ｂ＋ｄｉｓｐ）番地に格納されているデータをレジスタＲｄにロードする。ＳＴ命令は、レジスタＲｓの値を主メモリのアドレス（ｂ＋ｄｉｓｐ）番地にストアする。ＬＤ４命令は、主メモリのアドレス（ｂ＋ｄｉｓｐ）、（ｂ＋ｄｉｓｐ＋ｓｔｅｐ）、（ｂ＋ｄｉｓｐ＋２＊ｓｔｅｐ）、（ｂ＋ｄｉｓｐ＋３＊ｓｔｅｐ）番地、に格納されているデータをレジスタファイル１内のレジスタＲｄ、Ｒｄ＋１、Ｒｄ＋２、Ｒｄ＋３にロードする。ＳＴ４命令は、レジスタＲｓ、Ｒｓ＋１、Ｒｓ＋２、Ｒｓ＋３の値を主メモリのアドレス（ｂ＋ｄｉｓｐ）、（ｂ＋ｄｉｓｐ＋ｓｔｅｐ）、（ｂ＋ｄｉｓｐ＋２＊ｓｔｅｐ）、（ｂ＋ｄｉｓｐ＋３＊ｓｔｅｐ）番地にストアする。
【００３１】
最後に、レジスタファイル０、レジスタファイル１の内容を入れ替えるＥＸＣＨＧ命令を定義する。
【００３２】
ＥＸＣＨＧ
以下に、上記の装置で８＊８の２次元離散コサイン変換を用いた画像変換処理を行なう例を示す。
【００３３】
図１の記憶装置１０３には、図１５に示すように、変換プログラム、４＊４の行列データ、４つの係数データ、８＊８画素＊Ｂブロックの画像データが主記憶上に格納されているとする。
【００３４】
まず、係数行列データとバタフライ演算係数データをレジスタにロードする。このロード作業は、Ｂブロックの離散コサイン変換処理の最初にただ１回だけ行えばよい。この操作により、レジスタファイルには図１０に示す係数がロードされ、これらの係数はＢブロック変換作業中に変更されない。
【００３５】
＃行列、係数ロード
ＬＤ４ＭＡＴＲＩＸ，０，１，Ｒ０
ＬＤ４ＭＡＴＲＩＸ，４，１，Ｒ４
ＬＤ４ＭＡＴＲＩＸ，８，１，Ｒ８
ＬＤ４ＭＡＴＲＩＸ，１２，１，Ｒ１２
ＥＸＣＨＧ
ＬＤ４ＣＯＥＦＦ，０，１，１２
次に８点１次元離散コサイン変換を行う命令列を示す。ＯＦＦはＩＭＧからのオフセットを示し第１回目の処理では０とする。
【００３６】
＃８点離散コサイン変換（横）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，１，ＶＲ８
ＬＤ４ＩＭＧ，７＋ＯＦＦ，−１，ＶＲ４
ＡＤＤ４ＶＲ８，ＶＲ４，ＶＲ０
ＳＵＢ４ＶＲ８，ＶＲ４，ＶＲ４
ＭＵＬ４ＶＲ４，ＶＲ１２，ＶＲ４
ＴＲＶＶＲ０，ＶＲ０
ＴＲＶＶＲ４，ＶＲ４
ＡＤＤＲ４，Ｒ５，Ｒ４
ＡＤＤＲ５，Ｒ６，Ｒ５
ＡＤＤＲ６，Ｒ７，Ｒ６
ＳＴ４ＩＭＧ，０＋ＯＦＦ，２，ＶＲ０
ＳＴ４ＩＭＧ，１＋ＯＦＦ，２，ＶＲ４
ＯＦＦを８ずつ増加させながらこの１２命令で構成される変換処理を８回行なうことで、８＊８画素に対して横方向の１次元離散コサイン変換が完了する。その後、縦方向の変換を行うために、ＯＦＦを０、１、．．．、７と変化させながら以下の命令列を８回行なう。
【００３７】
＃８点離散コサイン変換（縦）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，８，ＶＲ８
ＬＤ４ＩＭＧ，５６＋ＯＦＦ，−８，ＶＲ４
ＡＤＤ４ＶＲ８，ＶＲ４，ＶＲ０
ＳＵＢ４ＶＲ８，ＶＲ４，ＶＲ４
ＭＵＬ４ＶＲ４，ＶＲ１２，ＶＲ４
ＴＲＶＶＲ０，ＶＲ０
ＴＲＶＶＲ４，ＶＲ４
ＡＤＤＲ４，Ｒ５，Ｒ４
ＡＤＤＲ５，Ｒ６，Ｒ５
ＡＤＤＲ６，Ｒ７，Ｒ６
ＳＴ４ＩＭＧ，０＋ＯＦＦ，１６，ＶＲ０
ＳＴ４ＩＭＧ，５６＋ＯＦＦ，−１６，ＶＲ４
以上の操作により、８＊８の２次元離散コサイン変換を完了する。変換対象となるブロック数Ｂが十分大きいとすると、行列、係数ロードに必要な６命令を無視することができる。そのため、１ブロックあたり１９２命令で処理できるといえる。
【００３８】
従来例ではＴＲＶ命令毎に６命令の係数ロードが必要であり、１ブロックの変換中に３２回のＴＲＶ命令を使用することから、さらに６命令＊３２回＝１９２命令の追加となる。本発明により、命令数を半分に削減できたと言える。
【００３９】
（４）実施例２
本実施例は、行列演算命令ではなくベクトル内積演算命令を持つプロセッサを用いた場合の実装を示す。係数、データは実施例１で示した図１５のように主記憶上に配置されているとする。
【００４０】
本実施例で使用するプロセッサは、実施例１と以下の相違点を持つとする。
１．図１６に示す、３２本のレジスタから構成されるレジスタファイルだけを１つだけ持ち、そのためＥＸＣＨＧ命令は持たない
２．ＴＲＶ命令の代わりに、ＩＰＲ命令を持つ
ＩＰＲＶＲｓ，ＶＲｔ，Ｒｄ（ｓ，ｔ＝４ｎ）
ＩＰＲ命令は、レジスタ群ＶＲｓと、ＶＲｔをそれぞれ４要素のベクトルとみなし、そのの内積をレジスタＲｄに格納する。図１７にその演算内容を示す。
【００４１】
以上のプロセッサを利用して、８＊８の離散コサイン変換、逆離散コサイン変換を行なう手順を以下に示す。
【００４２】
まず、行列データと係数データをレジスタにロードする。このロード作業は、Ｂブロックの離散コサイン変換処理の最初にただ１回だけ行えばよい。
【００４３】
＃行列、係数ロード
ＬＤ４ＣＯＥＦＦ，０，１，ＶＲ１２
ＬＤ４ＭＡＴＲＩＸ，０，４，ＶＲ１６
ＬＤ４ＭＡＴＲＩＸ，１，４，ＶＲ２０
ＬＤ４ＭＡＴＲＩＸ，２，４，ＶＲ２４
ＬＤ４ＭＡＴＲＩＸ，３，４，ＶＲ２８
次に８点１次元離散コサイン変換を行う命令列を示す。ＯＦＦはＩＭＧからのオフセットを示し第１回目の処理では０とする。
【００４４】
＃８点離散コサイン変換（横）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，１，８
ＬＤ４ＩＭＧ，７＋ＯＦＦ，−１，４
ＡＤＤ４ＶＲ８，ＶＲ４，ＶＲ０
ＳＵＢ４ＶＲ８，ＶＲ４，ＶＲ４
ＭＵＬ４ＶＲ４，ＶＲ１２，ＶＲ４
ＩＰＲＶＲ０，ＶＲ１６，Ｒ８
ＩＰＲＶＲ０，ＶＲ２０，Ｒ９
ＩＰＲＶＲ０，ＶＲ２４，Ｒ１０
ＩＰＲＶＲ０，ＶＲ２８，Ｒ１１
ＩＰＲＶＲ４，ＶＲ１６，Ｒ０
ＩＰＲＶＲ４，ＶＲ２０，Ｒ１
ＩＰＲＶＲ４，ＶＲ２４，Ｒ２
ＩＰＲＶＲ４，ＶＲ２８，Ｒ３
ＡＤＤＲ１，Ｒ９，Ｒ１
ＡＤＤＲ２，Ｒ１０，Ｒ２
ＡＤＤＲ３，Ｒ１１，Ｒ３
ＳＴ４ＩＭＧ，０＋ＯＦＦ，２，ＶＲ８
ＳＴ４ＩＭＧ，１＋ＯＦＦ，２，ＶＲ０
ＯＦＦを８ずつ増加させながらこの１８命令で構成される変換処理を８回行なうこでで、８＊８画素に対して横方向の１次元離散コサイン変換が完了する。その後、縦方向の変換を行うために、ＯＦＦを０、１、．．．、７と変化させながら以下の命令列を８回行なう。
【００４５】
＃８点離散コサイン変換（縦）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，８，８
ＬＤ４ＩＭＧ，５６＋ＯＦＦ，−８，４
ＡＤＤ４ＶＲ８，ＶＲ４，Ｒ０
ＳＵＢ４ＶＲ８，ＶＲ４，Ｒ４
ＭＵＬ４ＶＲ４，ＶＲ１２，Ｒ４
ＩＰＲＶＲ０，ＶＲ１６，Ｒ８
ＩＰＲＶＲ０，ＶＲ２０，Ｒ９
ＩＰＲＶＲ０，ＶＲ２４，Ｒ１０
ＩＰＲＶＲ０，ＶＲ２８，Ｒ１１
ＩＰＲＶＲ４，ＶＲ１６，Ｒ０
ＩＰＲＶＲ４，ＶＲ２０，Ｒ１
ＩＰＲＶＲ４，ＶＲ２４，Ｒ２
ＩＰＲＶＲ４，ＶＲ２８，Ｒ３
ＡＤＤＲ１，Ｒ９，Ｒ１
ＡＤＤＲ２，Ｒ１０，Ｒ２
ＡＤＤＲ３，Ｒ１１，Ｒ３
ＳＴ４ＩＭＧ，０＋ＯＦＦ，１６，ＶＲ８
ＳＴ４ＩＭＧ，５６＋ＯＦＦ，−１６，ＶＲ０
以上の操作により、８＊８の２次元離散コサイン変換を完了する。変換対象となるブロック数Ｂは十分大きいとすると、行列、係数ロードに必要な５命令を無視することができるため、１ブロックあたり２８８命令で２次元離散コサイン変換を処理できる。
【００４６】
（５）実施例３
本実施例は、行列演算命令ではなくＳＩＭＤ命令を持つプロセッサを用いた場合の実装を示す。係数、データは実施例１で示した図１５のように主記憶上に配置されているとする。
【００４７】
本実施例で使用するプロセッサは、実施例２と以下の相違点を持つとする。
１．ＩＰＲ命令の代わりに、ＭＡＣ４命令を持つ
２．ＭＵＬ４、ＭＡＣ４命令をブロードキャスト拡張した、ＭＵＬ４Ｂ、ＭＡＣ４Ｂ命令を持つ
ＭＡＣ４ＶＲｓ，ＶＲｔ，ＶＲｄ（ｓ，ｔ，ｄ＝４ｎ）
ＭＡＣ４命令は、レジスタＲｓ、Ｒｓ＋１、Ｒｓ＋２、Ｒｓ＋３と、レジスタＲｔ、Ｒｔ＋１、Ｒｔ＋２、Ｒｔ＋３のそれぞれの積を、レジスタＲｄ、Ｒｄ＋１、Ｒｄ＋２、Ｒｄ＋３に足し込む。
【００４８】
ＭＵＬ４ＢＶＲｓ，ＶＲｔ，ＶＲｄ，ｂ（ｓ，ｔ，ｄ＝４ｎ、ｂ＝０〜３）
ＭＡＣ４ＢＶＲｓ，ＶＲｔ，ＶＲｄ，ｂ（ｓ，ｔ，ｄ＝４ｎ、ｂ＝０〜３）
ＭＵＬ４Ｂ命令は、レジスタＲｓ＋ｂ、Ｒｓ＋ｂ、Ｒｓ＋ｂ、Ｒｓ＋ｂと、レジスタＲｔ、Ｒｔ＋１、Ｒｔ＋２、Ｒｔ＋３のそれぞれの積を、レジスタＲｄ、Ｒｄ＋１、Ｒｄ＋２、Ｒｄ＋３に格納する。ＭＡＣ４Ｂ命令は、レジスタＲｓ＋Ｒｂ、Ｒｓ＋ｂ、Ｒｓ＋ｂ、Ｒｓ＋ｂと、レジスタＲｔ、Ｒｔ＋１、Ｒｔ＋２、Ｒｔ＋３のそれぞれの積を、レジスタＲｄ、Ｒｄ＋１、Ｒｄ＋２、Ｒｄ＋３に足し込む。
【００４９】
以上のプロセッサを利用して、８＊８の離散コサイン変換、逆離散コサイン変換を行なう手順を以下に示す。
【００５０】
まず、行列データと係数データをレジスタにロードする。このロード作業は、Ｂブロックの離散コサイン変換処理の最初にただ１回だけ行えばよい。
【００５１】
＃行列、係数ロード
ＬＤ４ＣＯＥＦＦ，０，１，ＶＲ１２
ＬＤ４ＭＡＴＲＩＸ，０，４，ＶＲ１６
ＬＤ４ＭＡＴＲＩＸ，１，４，ＶＲ２０
ＬＤ４ＭＡＴＲＩＸ，２，４，ＶＲ２４
ＬＤ４ＭＡＴＲＩＸ，３，４，ＶＲ２８
次に８点１次元離散コサイン変換を行う命令列を示す。ＯＦＦはＩＭＧからのオフセットを示し第１回目の処理では０とする。
【００５２】
＃８点離散コサイン変換（横）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，１，ＶＲ８
ＬＤ４ＩＭＧ，７＋ＯＦＦ，−１，ＶＲ４
ＡＤＤ４ＶＲ８，ＶＲ４，ＶＲ０
ＳＵＢ４ＶＲ８，ＶＲ４，ＶＲ４
ＭＵＬ４ＶＲ４，ＶＲ１２，ＶＲ４
ＭＵＬ４ＢＶＲ０，ＶＲ１６，ＶＲ８，０
ＭＡＣ４ＢＶＲ０，ＶＲ２０，ＶＲ８，１
ＭＡＣ４ＢＶＲ０，ＶＲ２４，ＶＲ８，２
ＭＡＣ４ＢＶＲ０，ＶＲ２８，ＶＲ８，３
ＭＵＬ４ＢＶＲ４，ＶＲ１６，ＶＲ０，０
ＭＡＣ４ＢＶＲ４，ＶＲ２０，ＶＲ０，１
ＭＡＣ４ＢＶＲ４，ＶＲ２４，ＶＲ０，２
ＭＡＣ４ＢＶＲ４，ＶＲ２８，ＶＲ０，３
ＡＤＤＲ１，Ｒ９，Ｒ１
ＡＤＤＲ２，Ｒ１０，Ｒ２
ＡＤＤＲ３，Ｒ１１，Ｒ３
ＳＴ４ＩＭＧ，０＋ＯＦＦ，２，ＶＲ８
ＳＴ４ＩＭＧ，１＋ＯＦＦ，２，ＶＲ０
ＯＦＦを８ずつ増加させながらこの１８命令で構成される変換処理を８回行なうことで、８＊８画素に対して横方向の１次元離散コサイン変換が完了する。その後、縦方向の変換を行うために、ＯＦＦを０、１、．．．、７と変化させながら以下の命令列を８回行なう。
【００５３】
＃８点離散コサイン変換（縦）
ＬＤ４ＩＭＧ，０＋ＯＦＦ，８，ＶＲ８
ＬＤ４ＩＭＧ，５６＋ＯＦＦ，−８，ＶＲ４
ＡＤＤ４ＶＲ８，ＶＲ４，ＶＲ０
ＳＵＢ４ＶＲ８，ＶＲ４，ＶＲ４
ＭＵＬ４ＶＲ４，ＶＲ１２，ＶＲ４
ＭＵＬ４ＢＶＲ０，ＶＲ１６，ＶＲ８，０
ＭＡＣ４ＢＶＲ０，ＶＲ２０，ＶＲ８，１
ＭＡＣ４ＢＶＲ０，ＶＲ２４，ＶＲ８，２
ＭＡＣ４ＢＶＲ０，ＶＲ２８，ＶＲ８，３
ＭＵＬ４ＢＶＲ４，ＶＲ１６，ＶＲ０，０
ＭＡＣ４ＢＶＲ４，ＶＲ２０，ＶＲ０，１
ＭＡＣ４ＢＶＲ４，ＶＲ２４，ＶＲ０，２
ＭＡＣ４ＢＶＲ４，ＶＲ２８，ＶＲ０，３
ＡＤＤＲ１，Ｒ９，Ｒ１
ＡＤＤＲ２，Ｒ１０，Ｒ２
ＡＤＤＲ３，Ｒ１１，Ｒ３
ＳＴ４ＩＭＧ，０＋ＯＦＦ，１６，ＶＲ８
ＳＴ４ＩＭＧ，５６＋ＯＦＦ，−１６，ＶＲ０
以上の操作により、８＊８の２次元離散コサイン変換を完了する。変換対象となるブロック数Ｂは十分大きいとすると、行列、係数ロードに必要な５命令を無視することができるため、１ブロックあたり２８８命令で２次元離散コサイン変換を処理できる。
【００５４】
（６）実施例４
本実施例は、本発明の処理手順をＣ言語で記述して利用する。係数、データは実施例１で示した図１５のように主記憶上に配置されているとする。
【００５５】
まず、図１８、図１９に示すように離散コサイン変換、逆離散コサイン変換に用いるデータタイプをＤＣＴＴＹＰＥとして定義し、同タイプの４＊４の２次元大域配列定数としてＭ［４］［４］、Ｃ４［４］を宣言する。Ｍには４＊４行列の値が、Ｃ４には４つの係数が設定されている。これは実施例１のレジスタＸＲ０〜ＸＲ１５、Ｒ１２〜Ｒ１５に相当する。次に、下位関数としてｌｄ４（）、ｓｔ４（）、ａｄｄ４（）、ｓｕｂ４（）、ｍｕｌ４（）、ｔｒｖ（）、を定義する。ｌｄ４（）はポインタａｄｒの示すメモリアドレスから４つのＤＣＴＴＹＰＥの値を取りだし、長さ４の１次元配列ＶＲに代入する。逆に、ｓｔ４はＶＲの値をポインタａｄｒの指し示すアドレスに書き込む。ａｄｄ４（）、ｓｕｂ４（）、ｍｕｌ４（）は、長さ４の１次元配列ＶＲ１、ＶＲ２の各要素について、それぞれ加算、減算、乗算を行ない、その結果をＶＲ３に代入する。ｔｒｖ（）は、４＊４の２次元配列Ｍで示される行列と、長さ４の１次元配列ＶＲ１の積を計算し、ＶＲ２に結果を代入する。これらの関数は、実施例１のＬＤ４、ＳＴ４、ＡＤＤ４、ＳＵＢ４、ＭＵＬ４、ＴＲＶに相当する処理を行なう。
【００５６】
以上の変数、下位間数をもとに、８点の離散コサイン変換、８＊８の２次元離散コサイン変換行なう関数は図２０のｄｃｔ８（）、ｄｃｔ８＿８（）ように書くことができる。このプログラムは使用するプロセッサごとに用意されたＣコンパイラが最適化を行なうため、その実行性能はコンパイラ性能にも依存しているといえる。しかし、本発明のアルゴリズムは、８＊８の離散コサイン変換に４＊４の係数行列しか利用しないため、図７に示す８＊８行列演算をそのまま実装した、６４の係数を利用するプログラムと比べて、主記憶とプロセッサとのロード／ストアの回数を減らすことができ、コンパイラがより高速な機械語命令列を出力することが期待できる。実施例１、２のように、行列演算命令や、ＳＩＭＤ命令を搭載したプロセッサの場合には、コンパイラのビルトイン機能を用いて機械語で記述した場合により近い結果を得らる。また下位関数のみ機械語記述をすることも可能である。
【００５７】
（７）実施例５
本実施例は、実施例４の８点離散コサイン変換を利用して、１６点の離散コサイン変換を行う事例を示す。本発明の分割手法を用いて、１６点の変換を８点に分割し、８点の離散コサイン変換を実施例４の関数で処理することで、４＊４の係数行列を固定したまま１６点の離散コサイン変換を行う。
【００５８】
先ほど導いた式（９）、式（１０）、式（１１）、にＮ＝１６を代入してまとめると図２１に示すように、２つの８点離散コサイン変換に分割できることが分かる。これは、１６点離散コサイン変換がある前処理と後処理を追加することによって、２つの８点離散コサイン変換に分割できることを示すものである。
【００５９】
この特徴を利用して、１６点離散コサイン変換を図１８、図１９に示す関数と、図２０に示すｄｃｔ８（）関数を利用して記述すると図２２、図２３、図２４のようになる。
【００６０】
同様にして、３２点、６４点、．．．、２＾ｎ点の離散コサイン変換も４＊４係数行列を用いて求めることができる。
【００６１】
（８）実施例６
本実施例の変換装置は図１の変換装置のプログラム記憶装置１０９をプロセッサ１０２と同一チップに集積する点で実施例１と異なる。実行可能な命令、処理内容は全て実施例１と同一である。
【００６２】
（９）実施例７
本実施例の変換装置は図１の変換装置のデータ記憶装置１１０をプロセッサ１０２と同一チップに集積する点で実施例１と異なる。実行可能な命令、処理内容は全て実施例１と同一である。
【００６３】
（１０）実施例８
本実施例の変換装置の構成を図２５に示す。変換装置２５０１はプロセッサ２５０２によって構成される。プロセッサ２５０２はアドレス生成器１０６、レジスタファイル１０７、演算器１０８に加えてプログラム記憶装置１０９およびデータ記憶装置１１０が内蔵されている点で実施例１と異なる。実行可能な命令、処理内容は全て実施例１と同一である。
【００６４】
【発明の効果】
本発明は、Ｎ点の離散コサイン変換、逆離散コサイン変換を、ただ１つの固定された（Ｎ／２＾ｋ）＊（Ｎ／２＾ｋ）の係数行列を利用して計算することを可能とする。これにより、従来は行列演算ごとに必要であった係数行列の入れ替え操作を不要とし、そのオーバーヘッドを取り除くことができるため、演算器の利用効率が向上する。
【００６５】
また、演算の規模や係数の格納に必要なレジスタ数も減少し、回路面積を縮小することができる。これは、汎用のプロセッサに行列演算器やそれに類する演算装置を追加し、回路面積の増加を抑えながら高速な画像や音声の圧縮・伸長処理装置を構成することを可能とする。
【図面の簡単な説明】
【図１】本発明を適用したデータ変換装置。
【図２】本発明を適用した８点離散コサイン変換の分割（バタフライ図）。
【図３】本発明を適用した８点離散コサイン変換の分割（行列表現）。
【図４】従来技術の８点離散コサイン変換（行列表現）。
【図５】離散コサイン変換の定義式。
【図６】離散コサイン変換、逆離散コサイン変換で使用する係数ｃ（ｋ，Ｎ）の性質。
【図７】８点１次元離散コサイン変換の定義式に準じた行列表現。
【図８】本発明による離散コサイン変換の式変形。
【図９】本発明による８点離散コサイン変換の式変形（Ｘ［Ｎ−１］）。
【図１０】実施例１のレジスタの構成。
【図１１】実施例１におけるＴＲＶ命令の処理内容。
【図１２】実施例１におけるＡＤＤ４命令の処理内容。
【図１３】実施例１におけるＳＵＢ４命令の処理内容。
【図１４】実施例１におけるＭＵＬ４命令の処理内容。
【図１５】８＊８点２次元離散コサイン変換で使用するデータの主記憶配置。
【図１６】実施例２のレジスタの構成。
【図１７】実施例２におけるＩＰＲ命令の処理内容。
【図１８】実施例４の４＊４行列を利用した８点離散コサイン変換のＣ言語記述（下位関数１）。
【図１９】実施例４の４＊４行列を利用した８点離散コサイン変換のＣ言語記述（下位関数２）。
【図２０】実施例４の８点１次元離散コサイン変換、８＊８点２次元離散コサイン変換のＣ言語記述（上位関数）。
【図２１】本発明による１６点離散コサイン変換の分割（バタフライ図）。
【図２２】実施例５の４＊４行列を利用した１６点離散コサイン変換のＣ言語記述（下位関数）。
【図２３】実施例５の１６点１次元離散コサイン変換のＣ言語記述（上位関数）。
【図２４】実施例５の１６＊１６点２次元離散コサイン変換のＣ言語記述（上位関数）。
【図２５】実施例８の装置の構成。
【符号の説明】
１０１：変換装置
１０２：プロセッサ
１０３：記憶部
１０４：アドレスバス
１０５：データバス
１０６：アドレス生成器
１０７：レジスタファイル
１０８：演算器
１０９：プログラム記憶装置
１１０：データ記憶装置
１１１：入力装置
１１２：出力装置
２５０１：変換装置
２５０２：プロセッサ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data conversion processing device such as a microprocessor or a microcomputer, and more particularly to an image processing and audio processing application data processing device that performs discrete cosine transformation, inverse discrete cosine transformation, or the like.
[0002]
[Prior art]
Digitized images and voices have a huge amount of data, which poses a problem during storage and transmission. Therefore, it is possible to take measures such as compressing before storage and decompressing when used, or compressing before transmission and decompressing after reception.
[0003]
Hereinafter, image compression / decompression will be described as an example.
[0004]
compression
1.2D discrete cosine transform
2. Quantization
3. Huffman coding
Elongation
4). Huffman decoding
5). Inverse quantization
6. Two-dimensional inverse discrete cosine transform
The two-dimensional discrete cosine transform 1 and the two-dimensional inverse discrete cosine transform 6 are performed on a two-dimensional block of 8 * 8 pixels, and the conversion result is also a two-dimensional block value group of 8 * 8 elements.
[0005]
By the two-dimensional discrete cosine transform transformation of 1, the elements corresponding to the high-frequency components of the 8 * 8 block usually have a value close to 0, and the majority of them becomes 0 by the weighted quantization operation. The 3 Huffman encoding converts 8 * 8 block elements into a bitstream. At this time, since conversion is performed using the fact that there are many zeros in the element, it is said that the required number of bytes in the converted bit stream is about 1/10.
[0006]
Huffman Decomposition 4 in decompression performs the reverse operation of 3, and generates an 8 * 8 block element group from the bitstream. In the inverse quantization of 5, the element group before the quantization is restored by multiplying the inverse of the weight given in the quantization of 2. The 6-dimensional inverse discrete cosine transform of 6 performs the inverse operation of 1 to restore 8 * 8 pixels.
[0007]
Here, the labor required for the above-described compression / decompression operation is shown. When an ordinary general-purpose processor is used, 2000 to 3000 instructions are required for each of 8 * 8 blocks for both compression and decompression. When a 640 * 480 full-color (24 bits / pixel) image is targeted, it is 14400 times and requires execution of 28.8M to 43.2M instructions per still image. When using a processor that operates at 100 MHz for processing one instruction, only a compression / decompression speed of 2 to 4 frames / second can be obtained, and moving images can be captured and played back in real time. It is difficult to obtain.
[0008]
In view of this, many methods have been adopted in which a separate processing apparatus equipped with a special dedicated arithmetic unit for assisting image and sound compression / decompression is prepared.
[0009]
As another solution, a SIMD (Single Instruction Multiple Data) instruction that simultaneously processes a matrix / vector operation instruction and a plurality of product-sum operations, etc., which is also installed in a general-purpose processor is used. Thus, a method for configuring the conversion processing apparatus is also mentioned. Since the discrete cosine transformation is a linear transformation, the N-point transformation can be expressed by an N * N matrix, and the transformation process is completed by obtaining the matrix product using the matrix / vector operation instruction or SIMD instruction.
[0010]
[Problems to be solved by the invention]
It has been stated that the transformation matrix used in the second processing device shown above has a size of N * N for N-point discrete cosine transformation. For example, when an 8 * 8 two-dimensional discrete cosine transform process is repeated, a matrix operation for obtaining 8 outputs using 64 coefficients and 8 input data shown in FIG. And 8 times in the horizontal direction.
[0011]
At this time, since the 64 coefficients are constants, it is ideal to process them while holding them in the processor. However, practically, it is required to process while sequentially reading 64 parts arranged in the main memory because of the restriction on the number of registers.
[0012]
Also, in order to efficiently perform the above 8-point conversion processing, it is desirable to prepare an 8 * 8 matrix calculator and a SIMD calculator with a parallel degree of 8, but it is convenient in the field of, for example, three-dimensional graphics. 4 * 4 matrix operation and SIMD operation with a parallel degree of 4 are reasonable. Furthermore, in the case of the 32-point discrete cosine transform used for voice compression / decompression, it is more difficult to realize from the coefficient and circuit scale.
[0013]
In the embodiment of Japanese Patent Laid-Open No. 9-212484, the 8 * 8 discrete cosine transform is decomposed into an even function part and an odd function part number as shown in FIG. It is shown in the intermediate equation that leads to the conclusion that the component, the 4 * 4 component in the lower left, is zero. When a processor equipped with a 4 * 4 matrix and an arithmetic unit for obtaining the product of a four-dimensional vector is used, conversion processing can be performed by two matrix operations by adding some pre-processing and post-processing. Is implied.
[0014]
However, since two different 4 * 4 coefficient matrices are used, a process of replacing 4 * 4 coefficient data is required every time a matrix operation is performed in the discrete cosine transform process. For example, when the ratio of the number of cycles for matrix calculation and coefficient replacement is 1: 1, the use efficiency of the matrix calculator can be estimated to be 50% or less.
[0015]
It is an object of the present invention to reduce the overhead required to replace the transform coefficient matrix and increase the utilization efficiency of an arithmetic unit in order to achieve high-speed discrete cosine transform and inverse discrete cosine transform by matrix operation and SIMD operation. And
[0016]
[Means for Solving the Problems]
In the present invention, in order to solve the above problem, the property of the definition formula of discrete cosine transform is used. By adding a pre-processing butterfly operation and a post-processing addition process, the N-point discrete cosine transform can be divided into two N / 2-point discrete cosine transforms. At this time, since only one (N / 2) * (N / 2) matrix is used as the coefficient matrix, it is not necessary to replace the matrix, and high-speed conversion processing is possible.
[0017]
The number of coefficients in the 8-point conversion process is ¼ compared to FIG. 7 and ½ compared to FIG. 4, and the number of registers holding coefficients in the processor can be reduced.
[0018]
This division is repeated an arbitrary number of times, so that the pre-processing and post-processing are slightly increased. However, the matrix coefficients and the operation scale are 1/4, 1/16,. . . Therefore, the optimum number of divisions can be selected for the processor to be used.
[0019]
Since the above-described division is also established with respect to inverse discrete cosine transform, similarly, transformation processing that does not require matrix replacement is possible.
[0020]
DETAILED DESCRIPTION OF THE INVENTION
(1) Definition of discrete cosine transform
The definition formula of the one-dimensional discrete cosine transform is shown as in FIG. The definition of the coefficient A is Equation (3). When quantization is performed after discrete cosine transform as in image compression, or when inverse quantization is performed before inverse discrete cosine transform as in image expansion, If the weighting coefficient at the time of quantization is multiplied by A, the definition of the discrete cosine transform can be simplified as shown in Expression (3 ′). Hereinafter, equations (2), (3 ′), and (4) will be used as defining equations for the one-dimensional discrete cosine transform.
[0021]
N = 8 is used for compression and expansion of still images and moving images, which are typical applications of discrete cosine transform and inverse discrete cosine transform. Therefore, if N = 8 in the discrete cosine transform of FIG. 5, the matrix operation shown in FIG. 7 can be used. The elements of the matrix are normalized using Equation (5) among the properties of c (n, N) shown in FIG. Since the inverse discrete cosine transform corresponds to a calculation obtained by transposing this matrix, only the discrete cosine transform is handled here.
[0022]
The N * N point two-dimensional discrete cosine transform was performed N times in the vertical direction and N times in the vertical direction, and then N times in the horizontal direction (or N times in the horizontal direction and then N times in the vertical direction). Can be defined as That is, the N * N-point two-dimensional discrete cosine transform can be decomposed into N-point one-dimensional discrete cosine transform 2N times. Hereinafter, the one-dimensional discrete cosine transform is mainly treated.
[0023]
(2) Division of 8-point discrete cosine transform
Here, the equation (2) is set as k = 2K and k = 2K + 1, and is separated into an even output and an odd output. By utilizing the properties of Equation (6) and Equation (7), the even side can be transformed as in Equation (9) and the odd side can be transformed as in Equation (10). Further, when K = N / 2-1 on the odd side, that is, X [N−1] is derived as shown in Expression (11) using the property of Expression (6). By substituting N = 8 into Equation (10), Equation (9), and Equation (11), a matrix representation of Equation (1) is obtained. Focusing on the 4 * 4 components at the upper left and lower right of this matrix, it is the same matrix as the 4-point discrete cosine transform, and with the above formula modification, the 8-point discrete cosine transform adds some pre-processing and post-processing. By doing so, it was shown that it can be divided into two 4-point discrete cosine transforms. FIG. 2 shows the butterfly diagram.
[0024]
When performing inverse discrete cosine transform, the calculation procedure is performed in reverse order from the output side to the input side. At that time, an inverse matrix of the coefficient matrix is required, but a transposed matrix may be used due to the nature of the coefficient matrix of the discrete cosine transform. In the embodiment described below, only discrete cosine transform is handled, but it will be described that inverse discrete cosine transform can also be realized by using the method of processing in the reverse order described here.
[0025]
(3) Example 1
The configuration of the conversion apparatus of this embodiment is shown in FIG. The conversion device 101 includes a processor 102 and a storage unit 103, inputs data from an input device 111 connected to the outside, performs conversion processing such as image compression / decompression, and outputs a conversion result from the output device 112. The processor 102 and the storage unit 103 are connected by an address bus 104 and a data bus 105, specify the address of the storage device by the address calculated by the address generator 106 in the processor, and communicate with the register file 107 in the processor through the data bus. Perform data transfer. The storage unit 103 includes a program storage device 109 and a data storage device 110. The arithmetic unit 108 reads the contents of the register file, performs arithmetic processing, and writes the result back to the register file again. The configuration of the register file 107 is shown in FIG. The register file 1 is composed of R0 to R15 and the register file 2 is composed of XR0 to XR15. The registers Rn, Rn + 1, Rn + 2, and Rn + 3 of the register file 0 are referred to as VRn.
[0026]
Here, instructions of the processor 102 and their operations are defined. First, a TRV instruction that performs a matrix product of a 4 * 4 matrix and a vector having 4 elements is described as follows.
[0027]
TRV VRs, VRd (s, d = 4n)
The TRV instruction regards 16 registers of register file 1 as a 4 * 4 matrix, a register group in register file 0, and VRs as a four-dimensional vector, and stores the multiplication result of the matrix and the vector in VRd. FIG. 11 shows the operation contents of the TRV instruction.
[0028]
Next, ADD, SUB, MUL, ADD4, SUB4, and MUL4 are defined as follows as instructions for addition, subtraction, and multiplication.
[0029]
ADD Rs, Rt, Rd
SUB Rs, Rt, Rd
MUL Rs, Rt, Rd
ADD4 VRs, VRt, VRd (s, t, d = 4n)
SUB4 VRs, VRt, VRd (s, t, d = 4n)
MUL4 VRs, VRt, VRd (s, t, d = 4n)
The ADD, SUB, and MUL instructions add, subtract, and multiply the registers Rs and Rt, and store the result in the register Rd. The ADD4, SUB4, and MUL4 instructions add, subtract, and multiply the corresponding elements of the register group VRt with the register group VRs, and store the result in VRd. FIGS. (12), (13), and (14) show the operation contents of the ADD4 instruction, SUB4 instruction, and MUL4 instruction. Further, the following four instructions are defined as instructions for loading data from the main memory into the register and conversely storing data from the register into the main memory.
[0030]
LD b, disp, Rd
ST b, disp, Rs
LD4 b, disp, step, VRd (d = 4n)
ST4 b, disp, step, VRs (s = 4n)
The LD instruction loads the data stored at the address (b + disp) of the main memory into the register Rd. The ST instruction stores the value of the register Rs at the address (b + disp) of the main memory. The LD4 instruction transfers the data stored at the addresses (b + disp), (b + disp + step), (b + disp + 2 * step), (b + disp + 3 * step) of the main memory to the registers Rd, Rd + 1, Rd + 2, and Rd + 3 in the register file 1. Load it. The ST4 instruction stores the values of the registers Rs, Rs + 1, Rs + 2, and Rs + 3 at addresses (b + disp), (b + disp + step), (b + disp + 2 * step), and (b + disp + 3 * step) of the main memory.
[0031]
Finally, an EXCHG instruction for exchanging the contents of register file 0 and register file 1 is defined.
[0032]
EXCHG
An example of performing image conversion processing using 8 * 8 two-dimensional discrete cosine transform in the above apparatus will be described below.
[0033]
As shown in FIG. 15, the storage device 103 in FIG. 1 stores a conversion program, 4 * 4 matrix data, four coefficient data, and 8 * 8 pixels * B block image data on the main memory. And
[0034]
First, coefficient matrix data and butterfly calculation coefficient data are loaded into a register. This loading operation needs to be performed only once at the beginning of the discrete cosine transform processing of the B block. By this operation, the coefficients shown in FIG. 10 are loaded into the register file, and these coefficients are not changed during the B block conversion operation.
[0035]
#Matrix, coefficient load
LD4 MATRIX, 0, 1, R0
LD4 MATRIX, 4, 1, R4
LD4 MATRIX, 8, 1, R8
LD4 MATRIX, 12, 1, R12
EXCHG
LD4 COEFF, 0, 1, 12
Next, an instruction sequence for performing 8-point one-dimensional discrete cosine transform is shown. OFF indicates an offset from the IMG and is set to 0 in the first processing.
[0036]
# 8-point discrete cosine transform (horizontal)
LD4 IMG, 0 + OFF, 1, VR8
LD4 IMG, 7 + OFF, -1, VR4
ADD4 VR8, VR4, VR0
SUB4 VR8, VR4, VR4
MUL4 VR4, VR12, VR4
TRV VR0, VR0
TRV VR4, VR4
ADD R4, R5, R4
ADD R5, R6, R5
ADD R6, R7, R6
ST4 IMG, 0 + OFF, 2, VR0
ST4 IMG, 1 + OFF, 2, VR4
By performing the conversion process composed of these 12 instructions 8 times while increasing OFF by 8, the one-dimensional discrete cosine transform in the horizontal direction is completed for 8 * 8 pixels. Thereafter, OFF is set to 0, 1,. . . , 7 and changing the following instruction sequence 8 times.
[0037]
# 8 discrete cosine transform (vertical)
LD4 IMG, 0 + OFF, 8, VR8
LD4 IMG, 56 + OFF, -8, VR4
ADD4 VR8, VR4, VR0
SUB4 VR8, VR4, VR4
MUL4 VR4, VR12, VR4
TRV VR0, VR0
TRV VR4, VR4
ADD R4, R5, R4
ADD R5, R6, R5
ADD R6, R7, R6
ST4 IMG, 0 + OFF, 16, VR0
ST4 IMG, 56 + OFF, -16, VR4
With the above operation, the 8 * 8 two-dimensional discrete cosine transform is completed. If the number of blocks B to be converted is sufficiently large, 6 instructions necessary for matrix and coefficient loading can be ignored. Therefore, it can be said that processing can be performed with 192 instructions per block.
[0038]
In the conventional example, a coefficient load of 6 instructions is required for each TRV instruction, and since 32 TRV instructions are used during conversion of one block, 6 instructions * 32 times = 192 instructions are added. It can be said that the number of instructions can be reduced to half by the present invention.
[0039]
(4) Example 2
This embodiment shows an implementation when a processor having a vector inner product operation instruction instead of a matrix operation instruction is used. It is assumed that the coefficients and data are arranged on the main memory as shown in FIG.
[0040]
Assume that the processor used in the present embodiment has the following differences from the first embodiment.
1. As shown in FIG. 16, it has only one register file composed of 32 registers, and therefore does not have an EXCHG instruction.
2. Have IPR instructions instead of TRV instructions
IPR VRs, VRt, Rd (s, t = 4n)
In the IPR instruction, the register groups VRs and VRt are regarded as vectors of four elements, respectively, and the inner product thereof is stored in the register Rd. FIG. 17 shows the calculation contents.
[0041]
A procedure for performing 8 * 8 discrete cosine transform and inverse discrete cosine transform using the above processor will be described below.
[0042]
First, matrix data and coefficient data are loaded into a register. This loading operation needs to be performed only once at the beginning of the discrete cosine transform processing of the B block.
[0043]
#Matrix, coefficient load
LD4 COEFF, 0, 1, VR12
LD4 MATRIX, 0, 4, VR16
LD4 MATRIX, 1, 4, VR20
LD4 MATRIX, 2, 4, VR24
LD4 MATRIX, 3,4, VR28
Next, an instruction sequence for performing 8-point one-dimensional discrete cosine transform is shown. OFF indicates an offset from the IMG and is set to 0 in the first processing.
[0044]
# 8-point discrete cosine transform (horizontal)
LD4 IMG, 0 + OFF, 1, 8
LD4 IMG, 7 + OFF, -1, 4
ADD4 VR8, VR4, VR0
SUB4 VR8, VR4, VR4
MUL4 VR4, VR12, VR4
IPR VR0, VR16, R8
IPR VR0, VR20, R9
IPR VR0, VR24, R10
IPR VR0, VR28, R11
IPR VR4, VR16, R0
IPR VR4, VR20, R1
IPR VR4, VR24, R2
IPR VR4, VR28, R3
ADD R1, R9, R1
ADD R2, R10, R2
ADD R3, R11, R3
ST4 IMG, 0 + OFF, 2, VR8
ST4 IMG, 1 + OFF, 2, VR0
By performing the conversion process composed of these 18 instructions eight times while increasing OFF by eight, the horizontal one-dimensional discrete cosine conversion is completed for 8 * 8 pixels. Thereafter, OFF is set to 0, 1,. . . , 7 and changing the following instruction sequence 8 times.
[0045]
# 8 discrete cosine transform (vertical)
LD4 IMG, 0 + OFF, 8, 8
LD4 IMG, 56 + OFF, -8, 4
ADD4 VR8, VR4, R0
SUB4 VR8, VR4, R4
MUL4 VR4, VR12, R4
IPR VR0, VR16, R8
IPR VR0, VR20, R9
IPR VR0, VR24, R10
IPR VR0, VR28, R11
IPR VR4, VR16, R0
IPR VR4, VR20, R1
IPR VR4, VR24, R2
IPR VR4, VR28, R3
ADD R1, R9, R1
ADD R2, R10, R2
ADD R3, R11, R3
ST4 IMG, 0 + OFF, 16, VR8
ST4 IMG, 56 + OFF, -16, VR0
With the above operation, the 8 * 8 two-dimensional discrete cosine transform is completed. If the number of blocks B to be converted is sufficiently large, 5 instructions necessary for matrix and coefficient loading can be ignored, so that the 2-dimensional discrete cosine transform can be processed with 288 instructions per block.
[0046]
(5) Example 3
This embodiment shows an implementation when a processor having a SIMD instruction instead of a matrix operation instruction is used. It is assumed that the coefficients and data are arranged on the main memory as shown in FIG.
[0047]
The processor used in the present embodiment is assumed to have the following differences from the second embodiment.
1. Have MAC4 instruction instead of IPR instruction
2. Broadcast extension of MUL4 and MAC4 instructions, with MUL4B and MAC4B instructions
MAC4 VRs, VRt, VRd (s, t, d = 4n)
The MAC4 instruction adds the products of the registers Rs, Rs + 1, Rs + 2, and Rs + 3 and the registers Rt, Rt + 1, Rt + 2, and Rt + 3 to the registers Rd, Rd + 1, Rd + 2, and Rd + 3.
[0048]
MUL4B VRs, VRt, VRd, b (s, t, d = 4n, b = 0-3)
MAC4B VRs, VRt, VRd, b (s, t, d = 4n, b = 0-3)
The MUL4B instruction stores the products of the registers Rs + b, Rs + b, Rs + b, Rs + b and the registers Rt, Rt + 1, Rt + 2, and Rt + 3 in the registers Rd, Rd + 1, Rd + 2, and Rd + 3. The MAC4B instruction adds the products of the registers Rs + Rb, Rs + b, Rs + b, Rs + b and the registers Rt, Rt + 1, Rt + 2, and Rt + 3 to the registers Rd, Rd + 1, Rd + 2, and Rd + 3.
[0049]
A procedure for performing 8 * 8 discrete cosine transform and inverse discrete cosine transform using the above processor will be described below.
[0050]
First, matrix data and coefficient data are loaded into a register. This loading operation needs to be performed only once at the beginning of the discrete cosine transform processing of the B block.
[0051]
#Matrix, coefficient load
LD4 COEFF, 0, 1, VR12
LD4 MATRIX, 0, 4, VR16
LD4 MATRIX, 1, 4, VR20
LD4 MATRIX, 2, 4, VR24
LD4 MATRIX, 3,4, VR28
Next, an instruction sequence for performing 8-point one-dimensional discrete cosine transform is shown. OFF indicates an offset from the IMG and is set to 0 in the first processing.
[0052]
# 8-point discrete cosine transform (horizontal)
LD4 IMG, 0 + OFF, 1, VR8
LD4 IMG, 7 + OFF, -1, VR4
ADD4 VR8, VR4, VR0
SUB4 VR8, VR4, VR4
MUL4 VR4, VR12, VR4
MUL4B VR0, VR16, VR8, 0
MAC4B VR0, VR20, VR8, 1
MAC4B VR0, VR24, VR8, 2
MAC4B VR0, VR28, VR8, 3
MUL4B VR4, VR16, VR0, 0
MAC4B VR4, VR20, VR0, 1
MAC4B VR4, VR24, VR0, 2
MAC4B VR4, VR28, VR0, 3
ADD R1, R9, R1
ADD R2, R10, R2
ADD R3, R11, R3
ST4 IMG, 0 + OFF, 2, VR8
ST4 IMG, 1 + OFF, 2, VR0
By performing the conversion process composed of the 18 instructions eight times while increasing OFF by eight, one-dimensional discrete cosine transformation in the horizontal direction is completed for 8 * 8 pixels. Thereafter, OFF is set to 0, 1,. . . , 7 and changing the following instruction sequence 8 times.
[0053]
# 8 discrete cosine transform (vertical)
LD4 IMG, 0 + OFF, 8, VR8
LD4 IMG, 56 + OFF, -8, VR4
ADD4 VR8, VR4, VR0
SUB4 VR8, VR4, VR4
MUL4 VR4, VR12, VR4
MUL4B VR0, VR16, VR8, 0
MAC4B VR0, VR20, VR8, 1
MAC4B VR0, VR24, VR8, 2
MAC4B VR0, VR28, VR8, 3
MUL4B VR4, VR16, VR0, 0
MAC4B VR4, VR20, VR0, 1
MAC4B VR4, VR24, VR0, 2
MAC4B VR4, VR28, VR0, 3
ADD R1, R9, R1
ADD R2, R10, R2
ADD R3, R11, R3
ST4 IMG, 0 + OFF, 16, VR8
ST4 IMG, 56 + OFF, -16, VR0
With the above operation, the 8 * 8 two-dimensional discrete cosine transform is completed. If the number of blocks B to be converted is sufficiently large, 5 instructions necessary for matrix and coefficient loading can be ignored, so that the 2-dimensional discrete cosine transform can be processed with 288 instructions per block.
[0054]
(6) Example 4
In this embodiment, the processing procedure of the present invention is described in C language and used. It is assumed that the coefficients and data are arranged on the main memory as shown in FIG.
[0055]
First, as shown in FIGS. 18 and 19, a data type used for discrete cosine transform and inverse discrete cosine transform is defined as DCTYPE, and M [4] [4] as a 4 * 4 two-dimensional global array constant of the same type. Declare C4 [4]. A 4 * 4 matrix value is set for M, and four coefficients are set for C4. This corresponds to the registers XR0 to XR15 and R12 to R15 of the first embodiment. Next, ld4 (), st4 (), add4 (), sub4 (), mul4 (), trv () are defined as lower functions. ld4 () extracts four DCTYPE values from the memory address indicated by the pointer adr and substitutes them into a one-dimensional array VR having a length of four. Conversely, st4 writes the value of VR into the address indicated by the pointer adr. add4 (), sub4 (), and mul4 () perform addition, subtraction, and multiplication for each element of the one-dimensional arrays VR1 and VR2 each having a length of 4, and assign the result to VR3. trv () calculates the product of the matrix indicated by the 4 * 4 two-dimensional array M and the one-dimensional array VR1 of length 4, and substitutes the result into VR2. These functions perform processing corresponding to LD4, ST4, ADD4, SUB4, MUL4, and TRV of the first embodiment.
[0056]
Based on the above variables and the number of subordinates, functions for performing 8-point discrete cosine transformation and 8 * 8 two-dimensional discrete cosine transformation can be written as dct8 () and dct8_8 () in FIG. Since this program is optimized by a C compiler prepared for each processor to be used, it can be said that the execution performance depends on the compiler performance. However, since the algorithm of the present invention uses only a 4 * 4 coefficient matrix for 8 * 8 discrete cosine transform, it is compared with a program using 64 coefficients that directly implements the 8 * 8 matrix operation shown in FIG. Thus, it is possible to reduce the number of loads / stores between the main memory and the processor, and it can be expected that the compiler outputs a higher-speed machine language instruction sequence. As in the first and second embodiments, in the case of a processor equipped with a matrix operation instruction or SIMD instruction, a result closer to that described in machine language using the built-in function of the compiler can be obtained. It is also possible to describe machine language only for lower functions.
[0057]
(7) Example 5
The present embodiment shows an example in which the 16-point discrete cosine transform is performed using the 8-point discrete cosine transform of the fourth embodiment. Using the dividing method of the present invention, the 16-point transform is divided into 8 points, and the 8-point discrete cosine transform is processed by the function of the fourth embodiment, so that the 4 * 4 coefficient matrix is fixed and 16 points are obtained. Perform a discrete cosine transform of
[0058]
By substituting N = 16 into the equations (9), (10), and (11) derived earlier, it can be seen that it can be divided into two 8-point discrete cosine transforms as shown in FIG. This indicates that a 16-point discrete cosine transform can be divided into two 8-point discrete cosine transforms by adding a pre-process and a post-process.
[0059]
Using this feature, 16-point discrete cosine transform is described using the functions shown in FIGS. 18 and 19 and the dct8 () function shown in FIG. 20 as shown in FIGS. 22, 23, and 24.
[0060]
Similarly, 32 points, 64 points,. . . The 2 ^ n point discrete cosine transform can also be obtained using a 4 * 4 coefficient matrix.
[0061]
(8) Example 6
The conversion apparatus of the present embodiment is different from that of the first embodiment in that the program storage device 109 of the conversion apparatus of FIG. Executable instructions and processing contents are all the same as in the first embodiment.
[0062]
(9) Example 7
The conversion device of this embodiment is different from that of the first embodiment in that the data storage device 110 of the conversion device of FIG. Executable instructions and processing contents are all the same as in the first embodiment.
[0063]
(10) Example 8
The configuration of the conversion apparatus of this embodiment is shown in FIG. The conversion device 2501 is configured by a processor 2502. The processor 2502 is different from the first embodiment in that a program storage device 109 and a data storage device 110 are incorporated in addition to the address generator 106, the register file 107, and the arithmetic unit 108. Executable instructions and processing contents are all the same as in the first embodiment.
[0064]
【The invention's effect】
The present invention can calculate N-point discrete cosine transform and inverse discrete cosine transform using only one fixed (N / 2 ^ k) * (N / 2 ^ k) coefficient matrix. And As a result, the operation of replacing the coefficient matrix, which has conventionally been necessary for each matrix operation, is unnecessary and the overhead can be removed, so that the utilization efficiency of the arithmetic unit is improved.
[0065]
In addition, the scale of operation and the number of registers necessary for storing coefficients can be reduced, and the circuit area can be reduced. This makes it possible to configure a high-speed image / sound compression / decompression processing device while suppressing an increase in circuit area by adding a matrix computing unit or similar arithmetic device to a general-purpose processor.
[Brief description of the drawings]
FIG. 1 shows a data conversion apparatus to which the present invention is applied.
FIG. 2 is an 8-point discrete cosine transform division (butterfly diagram) to which the present invention is applied.
FIG. 3 shows division (matrix representation) of 8-point discrete cosine transform to which the present invention is applied.
FIG. 4 is a conventional 8-point discrete cosine transform (matrix representation).
FIG. 5 is a definition formula of discrete cosine transform.
FIG. 6 shows the properties of coefficients c (k, N) used in discrete cosine transform and inverse discrete cosine transform.
FIG. 7 is a matrix representation according to the definition formula of 8-point one-dimensional discrete cosine transform.
FIG. 8 is a formula modification of the discrete cosine transform according to the present invention.
FIG. 9 is an equation modification (X [N−1]) of the 8-point discrete cosine transform according to the present invention.
10 shows a configuration of a register according to the first embodiment.
FIG. 11 shows processing contents of a TRV instruction in the first embodiment.
12 shows processing contents of an ADD4 instruction in Embodiment 1. FIG.
FIG. 13 shows processing contents of a SUB4 instruction in the first embodiment.
FIG. 14 shows processing contents of a MUL4 instruction in the first embodiment.
FIG. 15 is a main memory arrangement of data used in an 8 * 8-point two-dimensional discrete cosine transform.
FIG. 16 shows a configuration of a register according to the second embodiment.
FIG. 17 shows processing contents of an IPR instruction in the second embodiment.
FIG. 18 is a C language description (lower function 1) of 8-point discrete cosine transform using the 4 * 4 matrix of the fourth embodiment.
FIG. 19 is a C language description (lower function 2) of 8-point discrete cosine transform using the 4 * 4 matrix of the fourth embodiment.
20 is a C language description (upper function) of 8-point 1-dimensional discrete cosine transform and 8 * 8-point 2-dimensional discrete cosine transform of Embodiment 4. FIG.
FIG. 21 is a division (butterfly diagram) of a 16-point discrete cosine transform according to the present invention.
FIG. 22 is a C language description (lower function) of 16-point discrete cosine transform using the 4 * 4 matrix of the fifth embodiment.
FIG. 23 is a C language description (upper function) of 16-point one-dimensional discrete cosine transform according to the fifth embodiment.
FIG. 24 is a C language description (upper function) of 16 * 16 point two-dimensional discrete cosine transform according to the fifth embodiment.
FIG. 25 shows a configuration of an apparatus according to an eighth embodiment.
[Explanation of symbols]
101: Conversion device
102: Processor
103: Storage unit
104: Address bus
105: Data bus
106: Address generator
107: Register file
108: Calculator
109: Program storage device
110: Data storage device
111: Input device
112: Output device
2501: Conversion device
2502: Processor.

Claims

A processor, a program including N (N = 2 ⁿ , where n is a natural number) point discrete cosine transform processing, data to be subjected to the discrete cosine transform processing , matrix data and coefficient data for the discrete cosine transform processing, In a data processing device including a storage device for storing,
The discrete cosine transform processing is divided into 2 ^k (k is a natural number) N / 2 ^k point discrete cosine transform processing according to the processor to be used by adding pre- and post-processing,
The pre- and post-processing is performed such that at least two coefficient matrices of the discrete cosine transform coefficient matrix for the 2 ^k (k is a natural number) N / 2 ^k- point discrete cosine transform process are the same. ,
The processor has a register file, reads the data to be subjected to the discrete cosine transform processing from the storage device, and based on the program stored in the storage device, the preprocessing and 2 ^k (k is a natural number) ) times and the N / 2 ^k point discrete cosine transform, have rows and rear processing, first the register the matrix data and the coefficient data from the storage device of the N / 2 ^k point discrete cosine transform It is characterized in that it is loaded only once into a file and used as the coefficient matrix, and the N / 2 ^k- point discrete cosine transform processing is repeated 2 ^k (k is a natural number) times while the coefficient matrix is held in the register file. Data processing device.

A processor, a program including an N (N = 2 ⁿ (n is a natural number)) point inverse discrete cosine transform process, data to be subjected to the inverse discrete cosine transform process, matrix data and coefficients for the inverse discrete cosine transform process In a data processing device including a storage device for storing data,
The inverse discrete cosine transform process is divided into 2 ^k (k is a natural number) N / 2 ^k- point inverse discrete cosine transform processes according to the processor to be used by adding a pre- and post-process,
The pre- and post-processing is performed so that at least two coefficient matrices of the inverse discrete cosine transform coefficient matrix for the 2 ^k (k is a natural number) N / 2 ^k- point inverse discrete cosine transform process are the same. And
The processor has a register file, reads the data to be subjected to the inverse discrete cosine transform processing from the storage device, and based on the program stored in the storage device, the preprocessing and 2 ^k (k is and said N / 2 ^k-point inverse discrete cosine transform processing a natural number) times, have rows and rear process, the N / 2 ^k-point inverse discrete cosine transform first the matrix data and said storage device the coefficient data processing Is loaded into the register file only once to obtain the coefficient matrix, and the N / 2 ^k- point inverse discrete cosine transform process is repeated 2 ^k (k is a natural number) times while the coefficient matrix is held in the register file. A data processing apparatus.