JP4101645B2

JP4101645B2 - Motion vector detection device, motion vector detection method, program, and recording medium

Info

Publication number: JP4101645B2
Application number: JP2002378510A
Authority: JP
Inventors: 省造藤井; 勝彦吉田; 雅夫岡部
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2002-12-26
Filing date: 2002-12-26
Publication date: 2008-06-18
Anticipated expiration: 2022-12-26
Also published as: JP2004214733A

Description

【０００１】
【発明の属する技術分野】
本発明は，動画像の動き補償符号化の際に用いる動きベクトルを，ブロックマッチング法により検出する動きベクトル検出装置，動きベクトル検出方法，プログラム，および記録媒体に関するものである。
【０００２】
【従来の技術】
近年，動画像符号化方式として，ＭＰＥＧ−２（ＩＴＵ−ＴＨ．２６２）など，フレーム間相関を用いた符号化方式が用いられるようになってきた。これらの方式では符号化する画像を小さな矩形領域である符号化ブロックに分割し，各符号化ブロック毎に参照画像から検出した動きベクトルにより予測ブロックを求め，符号化ブロックと予測ブロックとの差分を圧縮符号化する動き補償符号化方式が用いられる。
【０００３】
代表的な動きベクトル検出方式としてブロックマッチング方式があげられる。ブロックマッチング方式とは，ある符号化ブロックの動きベクトル検索領域から符号化ブロックと同じサイズの予測ブロック候補を取り出し，取り出した予測ブロック候補が予測ブロックとして採用するのに適切であるかどうかを符号化ブロックと予測ブロック候補の誤差量を算出することにより評価する方式である。動きベクトル検索領域内にあるすべての予測ブロック候補を評価して，評価結果が最適であるものが予測ブロックとして採用され，また採用した予測ブロックと符号化ブロックとの位置座標の差分が動きベクトルとなる。誤差量としては（数１）に示すように，予測候補ブロックの画素データｒ_m+i,n+jと符号化ブロックの画素データｔ_m,nとの差分の絶対値をブロックの全画素に対して加算する差分絶対値総和ＡＥが用いられることが多い。（数１）で（ｊ，ｉ）が現在評価している予測ブロック候補のベクトルを意味し（ベクトルの第１成分は水平方向に関し，ベクトルの第２成分は垂直方向に関する），ＡＥ_i,jがその予測ブロック候補の評価結果である差分絶対値総和を示している。また，Ｍ，Ｎはブロックの水平垂直の画素数を，さらに，−Ｋ〜Ｋ−１は符号化ベクトルの位置を基準とした検索領域の水平方向の範囲を，−Ｌ〜Ｌ−１が垂直方向の範囲をそれぞれ意味している。
【０００４】
【数１】

ブロックマッチング方式は最も確実な動きベクトル検出方式であるが，その具現化には回路規模が大きく処理量が多いなどの課題が多く，それらの課題を解決するために多くの構成法が考案されている。その中でも特に効率的な方式として複数の演算ユニットによりパイプラインを構成する方式が知られている（たとえば，下記の特許文献１参照）。
【０００５】
以下に上述した従来の動きベクトル検出装置として従来例１と従来例２の２つの構成を順に説明する。
【０００６】
（従来例１）
まず，従来例１の動きベクトル検出装置について説明する。これは特許文献１の請求項１，請求項２に該当する装置である。図３５は従来例１の構成を示すブロック図である。図３５の動きベクトル検出装置は，直列に接続したレジスタ８０１，８０２，８０３，８０４と，直列に接続した演算ユニット（以下ＰＥと略す）８０５，８０６，８０７と，ＰＥ８０５，８０６，８０７の間に配置した演算データ遅延器８１１，８１２と，同様に直列に接続したＰＥ８０８，８０９，８１０と，ＰＥ８０８，８０９，８１０の間に配置した演算データ遅延器８１３，８１４とからなる。各ＰＥ８０５〜８０７と各ＰＥ８０８〜８１０にはレジスタ８０１〜８０４の出力が入力されるとともに，端子８１５から符号化画素データがそれぞれ入力される。
【０００７】
図３６はＰＥ８０５の構成を示すブロック図である。このＰＥ８０５は，直列に配置され順次符号化画素データを格納するレジスタ８１９〜８２２と，各レジスタ８１９〜８２２の格納値と各レジスタ８０１〜８０４の格納値が入力される差分絶対値演算器８２３〜８２６と，各差分絶対値演算器８２３〜８２６の演算結果を加算する加算器８２７と，加算器８２７の演算値と他のＰＥからの演算値とを加算する加算器８２８とからなる。図３５のＰＥ８０６〜８１０の構造はすべて図３６のＰＥ８０５の構造と同一であるから説明を省略する。図３７は演算データ遅延器８１１の構成を示すブロック図である。この演算データ遅延器８１１はＰＥ８０５の演算結果８個を記憶する直列接続されたレジスタ８２９〜８３６と，その動作タイミングを制御するタイミング制御部８３７とから構成される。図３５の演算データ遅延器８１１〜８１４の構造はすべて図３７の演算データ遅延器８１１の構造と同一であるから説明を省略する。なお，図３５，図３６と図３７においてレジスタ８０１〜８０４とレジスタ８１９〜８２２はそれぞれ１つの画素の値を記憶する複数ビットからなるレジスタ，レジスタ８２９〜８３６はＰＥの演算結果データを記憶する複数ビットからなるレジスタである。レジスタ８０１〜８０４の出力には個々の信号名称ａ０〜ａ３とし，ａ０〜ａ３をまとめた信号をＡとして図中に接続関係を記載している。
【０００８】
以下，従来例１の動作について説明する。図３８は本従来例の符号化画像と参照画像で各ブロックと画素および検索領域の位置関係を示す領域関係図，図３９，図４１は動作の詳細を示すタイミングチャートである。
【０００９】
まず，図３８でＰＥ８０５〜ＰＥ８１０と符号化ブロックとの関係と動作の概要を説明する。
【００１０】
従来例１では予測ブロック候補と符号化ブロックとの誤差量として（数１）の差分絶対値総和ＡＥを採用しており，ブロックの大きさは水平Ｎ＝３，垂直Ｍ＝４，ベクトルの検索領域は水平方向に−３〜２（Ｋ＝３），垂直方向に−４〜３（Ｌ＝４）としている。図３８で符号化ブロックＴ０は参照画像上で図中Ｃの太枠に示す検索範囲をもち，符号化ブロックＴ１はＤの太枠波線に示す検索範囲を，符号化ブロックＴ２はＥの一点鎖線に示す検索範囲をそれぞれ持つ。なお，図３８の検索範囲はブロックマッチングに使用する参照画素の範囲を示しているので，たとえば図中におけるＣの太枠は横８画素（２Ｋ＋Ｎ−１），縦１１画素（２Ｌ＋Ｍ−１）の大きさとなっている。従来例１は符号化ブロックＴ０と符号化ブロックＴ１の動きベクトル検出動作を行うが，各符号化ブロックは縦４画素からなる小ブロック３つに分割してそれぞれを図３８に示した対応関係のＰＥに格納し，各ＰＥでは予測ブロック候補の縦４画素からなる小ブロックとの誤差量演算を一括して実行しつつ，３つのＰＥの演算結果を総合して各予測ブロック候補に対するＡＥを算出するという方式をとっている。
【００１１】
この方式は，（数１）に示す差分絶対値総和ＡＥ_i,jがｍとｎに関する２次元の総和となっているが，これを（数２）に示すようにｎに関する列方向Ｍ画素の差分絶対値和ＡＥ_i,j,nを求めたのち，（数３）に示すようにｎに関してＮ列分加算するという２段階の演算に分割しても，（数１）と同じく差分絶対値総和ＡＥ_i,jが求められるという関係を動作の根拠とするものである。図３６で各ＰＥの加算器８２７の出力が（数２）に示すそれぞれの列方向４画素の差分絶対値和ＡＥ_i,j,nに対応し，ＰＥ８０７の加算器８２８の出力が符号化ブロックＴ０に対する差分絶対値総和ＡＥ_i,jに，ＰＥ８１０の加算器８２８の出力が符号化ブロックＴ１に対する差分絶対値総和ＡＥ_i,jにそれぞれ対応する。
【００１２】
【数２】

【００１３】
【数３】

以下に図３９，図４１を用いて動作の詳細を説明する。
【００１４】
まず図３９のタイミングＦ０で符号化ブロックＴ０の動きベクトル検出動作が開始されると，符号化ブロックＴ０の左端の列の４つの画素データｔ_4,3，ｔ_5,3，ｔ_6,3，ｔ_7,3が図３５の符号化画素データ入力端子８１５から順に入力され，ＰＥ８０５のレジスタ８１９〜８２２にシフトしながら順次格納され，図３９のタイミングＦ４の時点でＰＥ８０５の準備が完了する。このときＰＥ８０５のレジスタの出力は（ｂ０，ｂ１，ｂ２，ｂ３）＝（ｔ_4,3，ｔ_5,3，ｔ_6,3，ｔ_7,3）となり，この値は符号化ブロックＴ０の動きベクトル検索が終了するまで保持される。一方，参照候補画素データは，図３８に示すように符号化ブロックＴ０の検索範囲の左上から縦にｒ_0,0，ｒ_1,0からｒ_10,0まで１１個の画素データが順に図３５の参照画素データ入力端子８１６から入力され，レジスタ８０１から８０４に順にシフトしながら一時記憶されていく。いま，図３９のタイミングＦ４のときレジスタ８０１〜８０４の出力は（ａ０，ａ１，ａ２，ａ３）＝（ｒ_0,0，ｒ_1,0，ｒ_2,0，ｒ_3,0）となっている。図４０はレジスタ８０１〜８０４が格納する参照画素の検索領域内での位置を示すものであり，タイミングＦ４のとき図４０の小ブロックＦ４に示す４画素を記憶している。ＰＥ８０５は常に（ａ０，ａ１，ａ２，ａ３）と（ｂ０，ｂ１，ｂ２，ｂ３）の差分絶対値和を算出するから，タイミングＦ４のサイクルでは図中Ｇに示すように｜ｒ_0,0−ｔ_4,3｜＋｜ｒ_1,0−ｔ_5,3｜＋｜ｒ_2,0−ｔ_6,3｜＋｜ｒ_3,0−ｔ_7,3｜が求められることになる。これは符号化ブロックＴ０の左上座標（３，４）を基準に添え字を相対表記すれば，（数２）の定義式より，ＡＥ_-4,-3,0が求められたことが分かる。つまり図３８に示したベクトル（−３，−４）に対応する差分絶対値総和の３分の１が求められたことになる。以下，図３９に示した有効期間８サイクルの間，（ａ０，ａ１，ａ２，ａ３）は，図４０のＦ４からＦ１１に至る矢印のように順次検索範囲左端を１画素ずつ下がりながら参照画素データ４個を出力することとなるから，ＰＥ８０５はベクトル（−３，−４），（−３，−３），（−３，−２）から（−３，３）までのそれぞれ対応する差分絶対値和，すなわちＡＥ_-4,-3,0，ＡＥ_-3,-3,0，ＡＥ_-2,-3,0からＡＥ_3,-3,0まで８種類の差分絶対値和を出力するものである。
【００１５】
レジスタ８０１〜８０４の状態が図４０の小ブロックＦ１１に示す状態，つまり図３８で検索範囲の左下の参照画素データｒ_10,0を入力完了すると，図３９のタイミングＦ１１の１個のダミーデータを挟んでｒ_0,1から始まる参照画素データ１１個が引き続き入力される。ここで１個のダミーデータを挟んだために図３９で次の有効期間８サイクルが開始する前に４サイクルの無効期間が生じている。この４サイクルの無効期間を利用して符号化ブロックＴ０のｔ_4,4，ｔ_5,4，ｔ_6,4，ｔ_7,4の４個の符号化画素データが符号化画素データ入力端子８１５から順に入力され，ＰＥ８０６のレジスタ８１９〜８２２に格納され，タイミングＨ４の時点でＰＥ８０６の準備が完了する。また，参照データはこのタイミングＨ４のサイクルにおいて，レジスタ８０１〜８０４が図４０のＨ４の状態となるから，ＰＥ８０６の差分絶対値演算は｜ｒ_0,1−ｔ_4,4｜＋｜ｒ_1,1−ｔ_5,4｜＋｜ｒ_2,1−ｔ_6,4｜＋｜ｒ_3,1−ｔ_7,4｜を求めることとなる。これは符号化ブロックＴ０の左上座標を基準に相対表記すれば，ＡＥ_-4,-3,1が求められたことを意味する。以下，タイミングＨ４から始まる８サイクルの有効期間においてＰＥ８０６はＡＥ_-4,-3,1，ＡＥ_-3,-3,1，からＡＥ_3,-3,1まで８種類の差分絶対値和を順次算出することになる。一方，この間もＰＥ８０５のレジスタはｔ_4,3，ｔ_5,3，ｔ_6,3，ｔ_7,3を保持しているから，ＰＥ８０５はＡＥ_-4,-2,0，ＡＥ_-3,-2,0，からＡＥ_3,-2,0まで８種類の差分絶対値和を算出する。図３９のタイミングＩ４から開始される有効範囲８サイクルにおいても同様にＰＥ８０５，ＰＥ８０６，ＰＥ８０７がそれぞれ差分絶対値和ＡＥ_i,j,0，ＡＥ_i,j,1，ＡＥ_i,j,2を順次算出することとなる。
【００１６】
演算データ遅延器８１１はＰＥ８０５の有効期間８サイクルの演算結果であるＡＥ_-4,-3,0からＡＥ_3,-3,0をレジスタ８２９〜８３６に格納し，タイミング制御部８３７はそれに続く４サイクルの期間８個の演算データをレジスタ８２９〜８３６に保持させ，次の有効期間が開始すると保持していた８個の演算データを順次ＰＥ８０６に供給すると同時にＰＥ８０５の新たな演算結果を格納する。従って演算データ遅延器８１１は８個の有効な演算結果を１２サイクル遅延させる先入れ先出しバッファとして機能している。いま図３９のタイミングＦ４でＰＥ８０５から出力された差分絶対値和ＡＥ_-4,-3,0は演算データ遅延器８１１で１２サイクル遅延され，タイミングＨ４でＰＥ８０６に供給される。ＰＥ８０６はＰＥ８０６がタイミングＨ４で算出したＡＥ_-4,-3,1と演算データ遅延器８１１から供給されたＡＥ_-4,-3,0とを加算器８２８で加算し出力する。出力されたＡＥ_-4,-3,0＋ＡＥ_-4,-3,1は演算データ遅延器８１２で１２サイクル遅延され，タイミングＩ４に至るとＰＥ８０７の加算器８２８でＡＥ_-4,-3,2と加算され，ＰＥ８０７の演算結果として出力端子８１７に出力される。従って，この出力はＡＥ_-4,-3,0＋ＡＥ_-4,-3,1＋ＡＥ_-4,-3,2となるが，これは（数３）よりＡＥ_-4,-3が，すなわちベクトル（−３，−４）に対応する予測ブロック候補と符号化ブロックＴ０の差分絶対値総和が求められたこととなる。以下，同様に出力端子８１７には検索範囲の全ての予測ブロック候補と符号化ブロックＴ０との差分絶対値総和ＡＥ_i,jが，無効４サイクルを挟みながら有効８サイクルの期間に順次出力されるから，この値を比較し最も誤差量の小さなベクトルを動きベクトルとして採用することにより符号化ブロックＴ０に対する動きベクトル検出の機能を果たすことが出来る。
【００１７】
以上のように符号化ブロックＴ０の処理に着目すれば，この従来例１ではＰＥ８０５〜ＰＥ８０７からなる３つの演算ユニットを１２サイクル遅延の演算データ遅延器８１１と８１２で結ぶことにより３段のパイプラインを構成し，差分絶対値総和の演算を実現するものである。
【００１８】
次に処理する符号化ブロックの移行について説明する。
【００１９】
図４１は符号化ブロックＴ０の動きベクトル検出動作の後半のタイミングチャートである。ＰＥ８０５〜ＰＥ８０７で算出された符号化ブロックＴ０に対する差分絶対値総和はＰＥ８０７から順次出力されるが，ＰＥ８０５はタイミングＪ１１でＡＥ_3,2,0の算出を終えると符号化ブロックＴ０に対する演算を終了する。４サイクル後のタイミングＯ４の時点でレジスタ８０１〜８０４の出力は（ａ０，ａ１，ａ２，ａ３）＝（ｒ_0,6，ｒ_1,6，ｒ_2,6，ｒ_3,6）となっているが，これは図３８のＥに示す符号化ブロックＴ２の検索領域の左上端に位置する参照画素である。そこでタイミングＪ１１からタイミングＯ４に至る無効４サイクルの期間を用いてＰＥ８０５に符号化ブロックＴ２の左端列４画素データを格納することにより，タイミングＯ４の時点で（ｂ０，ｂ１，ｂ２，ｂ３）＝（ｔ_4,9，ｔ_5,9，ｔ_6,9，ｔ_7,9）となり，ＰＥ８０５は符号化ブロックＴ２のベクトル（−３，−４）に対応する差分絶対値和ＡＥ_-4,-3,0の算出を開始することが出来る。この間もＰＥ８０６とＰＥ８０７は符号化ブロックＴ０のために差分絶対値和演算を継続中である。さらに１２サイクル後，ＰＥ８０６がＴ０の演算を終了するとＴ２の画素データ４個が格納され，Ｔ２の差分絶対値和演算が開始される。すなわちＰＥ８０５〜ＰＥ８０７は符号化ブロックＴ０の演算を終了すると順に符号化ブロックＴ２の演算を開始することができ，ＰＥ８０７からは符号化ブロックＴ０の最後の差分絶対値総和ＡＥ_3,2が出力されると，無効４クロックを挟んで次の有効８クロックから符号化ブロックＴ２の差分絶対値総和ＡＥ_i,jを順次出力することとなる。
【００２０】
以上のように従来例１ではＰＥ８０５〜ＰＥ８０７を用いて符号化ブロックＴ０の演算を行い，ついでＴ２，Ｔ４と偶数番号の符号化ブロックを１つの系列として順に動きベクトル検出を処理していくことになる。このとき，Ｔ０の演算が終了したＰＥから順に次に処理する符号化ブロックＴ２の画素データ格納することにより，パイプラインを出来る限り滞らせることなく符号化マクロの移行を実現している。
【００２１】
次に並列処理について説明する。
【００２２】
奇数番号の符号化ブロックの系列の処理はＰＥ８０５〜ＰＥ８０７を用いることが出来ないため（後述の，従来例１の回路規模と処理速度についての説明も参照せよ），これとは別にＰＥ８０８〜ＰＥ８１０を設けて並列処理を実現している。図４１で今タイミングＰ４のサイクルでレジスタ８０１〜８０４の出力は（ａ０，ａ１，ａ２，ａ３）＝（ｒ_0,3，ｒ_1,3，ｒ_2,3，ｒ_3,3）となっているが，これは図３８のＤに示す符号化ブロックＴ１の検索領域の左上端に位置する参照画素である。そこでタイミングＰ４の直前の無効４サイクルの期間を用いてＰＥ８０８に符号化ブロックＴ１の左端列４画素データを格納することにより，タイミングＰ４の時点で（ｂ０，ｂ１，ｂ２，ｂ３）＝（ｔ_4,6，ｔ_5,6，ｔ_6,6，ｔ_7,6）となり，ＰＥ８０８は符号化ブロックＴ１のベクトル（−３，−４）に対応する差分絶対値和ＡＥ_-4,-3,0の算出を開始することが出来る。以下は偶数番号系列の場合のＰＥ８０５〜ＰＥ８０７の動作と同様にＰＥ８０８〜ＰＥ８１０が奇数番号系列の符号化ブロックの動きベクトル検出を実行していくこととなる。
【００２３】
以上のように従来例１では４画素分の参照画素データをレジスタ８０１〜８０４に格納して共通データとし，これを参照範囲に含む複数の符号化ブロックを個別のＰＥに格納することにより並列処理を可能としている。図３５の構成例では符号化ブロックの偶数番と奇数番の２系統の並列処理が実現されているわけである。
【００２４】
ここで，従来例１の回路規模と処理速度について説明する。
【００２５】
表１は従来例１の回路規模を示すものである。表１で，Ｍは符号化ブロックの縦の画素数，Ｎは同じく横の画素数２Ｋは検索範囲の横幅，２Ｌは同じく縦幅，Ｑは並列処理可能な系列数である。ＱはＮ×Ｑ≧２Ｋを満たす最小のＱとして求めることができる。図３５の従来例１の構成ではＭ＝４，Ｎ＝３，Ｋ＝３，Ｌ＝４であり，３×Ｑ≧２×３であるからちょうどＱ＝２となる。従って，表１より画素データを記憶するレジスタは２８画素分，演算結果のデータを記憶する演算データ遅延器のレジスタは３２個分で構成されている。画素データのレジスタを１画素あたり８ビット，データレジスタを１個あたり１０ビットと仮定すると総ビット数は５４４ビット，すなわち５４４個のフリップフロップで構成される。図３５でＰＥ８０５〜ＰＥ８０７を用いる１系列のみの処理構成とし，偶数番の符号化ブロックの動きベクトル検出を全て完了した後に奇数番の符号化ブロックの動きベクトル検出を行うとした場合はＱ＝１であり，フリップフロップは２８８個で構成できる。また，ＭＰＥＧ２の場合の現実的な仕様として，７２０×４８０画素のインタレース映像（以下４８０ｉと略す）を入力とし，Ｍ＝１６，Ｎ＝１６，Ｋ＝６４，Ｌ＝３２，としたとき，ＱはちょうどＱ＝８となり，８系列並列処理となる。このとき画素データを記憶するレジスタは２０６４画素分，演算結果のデータを記憶するレジスタは７６８０個分であり，フリップフロップは約９万３千個で構成される。さらに高精細映像の例として，１９２０×１０８０画素のインタレース映像（以下１０８０ｉと略す）を入力とし，Ｍ＝１６，Ｎ＝１６，Ｋ＝１２８，Ｌ＝６４，としたとき，１６系列並列処理ができ，フリップフロップは約３４万個で構成される。
【００２６】
【表１】

表２は従来例１の演算速度を示すものである。表２で有効サイクル数とは，ある１つの符号化ブロックの演算結果が出力されるのに必要な有効サイクル数であり，ロスサイクル数とはその間に入る無効サイクルのことである。従って，１つの符号化ブロック当たりの平均サイクル数とは，有効サイクル数とロスサイクル数の合計を並列処理する系列数で割った値となる。また，ＱがＮ×Ｑ＝２Ｋを満たす場合をのぞき，符号化ブロックの切り替え時にロスサイクルが発生することになるが，表２ではいずれもＮ×Ｑ＝２Ｋの場合について算出している。表２から，図３５の構成，すなわち２系列並列処理では平均３６サイクルで１つの符号化ブロックの動きベクトル検出完了すると考えることができ，１系列の場合は７２クロックで１ブロック完了する。表１の場合と同じ条件の４８０ｉの場合８系列並列処理となるので平均１２８０サイクルで１ブロックの処理が完了すると考えることができる。これは約５２ＭＨｚのクロックで動作させることを意味する。１０８０ｉの場合は２３０４サイクルに１ブロックの演算速度となり，これは５６０ＭＨｚのクロックで動作させることを意味する。
【００２７】
【表２】

（従来例２）
次に，従来の動きベクトル検出装置の第２の例について説明する。これは特許文献１の請求項３，請求項４に該当する装置である。図４２は従来例２の構成を示すブロック図であり，図４３はこの動きベクトル検出装置のＰＥ８４７〜８４９の構造を示す図である。図４２及び図４３において図３５，図３６と同一の部分に関しては同一の符合を付して説明を省略する。
【００２８】
図４２で従来例２の構成では，レジスタ８０１〜８０４に対し直列に参照画素データを記憶するレジスタ８３８〜８４１を設け，レジスタ８４１とレジスタ８０１の間に画素データ遅延器８４２を挿入し，レジスタ８３８〜８４１の出力がレジスタ８０１〜８０４に比べて更に４サイクル遅延するように構成している。レジスタ８０１〜８０４とレジスタ８３８〜８４１の出力はセレクタ８４３〜８４６で選択され，４個の参照画素データ出力のみがＰＥ８４７〜８４９に供給される。図中ではセレクタ８４３〜８４６で選択された４個の出力にａ０，ａ１，ａ２，ａ３と信号名を付している。また，ＰＥ間を接続する演算データ遅延器８５０，８５１は有効サイクルの８個のデータを１６サイクル遅延する遅延器に変更されている。
【００２９】
図４３に示すＰＥ８４７の構造は図３６に示す従来例１のＰＥ８０５の構造と比較して，符号化画素データを記憶するレジスタ８１９〜８２２に対して並列にレジスタ８５３〜８５６が設けられ，レジスタ８１９〜８２２とレジスタ８５３〜８５６の出力はセレクタ８５７〜８６０で選択されるという構造に変更されている。図中ではセレクタ８５７〜８６０で選択された４個の出力にｂ０，ｂ１，ｂ２，ｂ３と信号名を付して，選択された参照画素出力ａ０，ａ１，ａ２，ａ３との対応関係を示している。
【００３０】
以下，従来例２の動きベクトル検出装置の動作について説明する。従来例２は偶数番の符号化ブロックの系列と奇数番の符号化ブロックの系列との２つの系列をそれぞれＰＥ内部のレジスタ８１９〜８２２とレジスタ８５３〜８５６に記憶しておき，それぞれの動きベクトル検出動作を進めるものであるが，従来例１では無効サイクル４サイクルを挟みながら２つの系列の符号化ブロックに対する処理が並列処理で実行される構成であったことに対し，従来例２では無効サイクルを８サイクルに拡大して，２つの系列の符号化ブロックの処理が互いの無効サイクルで実行される時分割処理となっている点が異なる。
【００３１】
図４４は従来例２の動作を示すタイミングチャートである。図４４は既に動きベクトル検出動作が開始され，定常状態に入っている時点を示している。参照画素は図３８の参照画像検索範囲の縦１列に相当する１１画素が連続して入力されるが，それに引き続いて５サイクル期間の無効データが入力される。入力された参照画素は図４４に示すように順次レジスタ８０１〜８０４に積み上げられ，レジスタ８０１の出力は画素データ遅延器８４２で４サイクル遅延された後レジスタ８３８〜８４１に順次積み上げられる。その結果，レジスタ８０１〜８０４を１つの組，レジスタ８３８〜８４１を１つの組としたとき，それぞれの組は８サイクルの有効期間を持ち，互いの有効期間は重ならず交互に有効になるという関係になっている。セレクタ８４３〜８４６は有効である方の組のレジスタを選択することで，その出力（ａ０，ａ１，ａ２，ａ３）には常に有効な参照画素データが８サイクル期間ずつ２回繰り返して出力され，ＰＥ８４７〜８４９に供給することができる。図４４に表記した動作の範囲では既にＰＥ８４７〜８４９のそれぞれについてレジスタ８１９〜８２２には符号化ブロックＴ０の画素データが，レジスタ８５３〜８５６には符号化ブロックＴ１の画素データが格納されている。いま，タイミングＶ４のサイクルから始まる有効８サイクル期間ではＰＥ８４７〜８４９のセレクタ８５７〜８６０はレジスタ８１９〜８２２を，すなわち符号化ブロックＴ０の画素データを選択するので，符号化ブロックＴ０と参照画素データ（ａ０，ａ１，ａ２，ａ３）との差分絶対値総和が演算されることになる。タイミングＷ４のサイクルから始まる有効８サイクル期間ではセレクタ８５７〜８６０はレジスタ８５３〜８５６を，すなわち符号化ブロックＴ１の画素データを選択するので，符号化ブロックＴ１と参照画素データ（ａ０，ａ１，ａ２，ａ３）との差分絶対値総和が演算されることになる。タイミングＶ４から開始される８サイクルで演算された符号化ブロックＴ０に関する演算結果は，演算データ遅延器８５０，８５１で１６サイクル遅延させたのち隣接するＰＥに伝達される。従って，タイミングＷ４から開始される８サイクルの符号化ブロックＴ１に関する演算期間を越えて，タイミングＹ４から開始される符号化ブロックＴ０の演算に引き渡される。
【００３２】
このように，１６サイクル遅延の演算データ遅延器８５０，８５１でＰＥ８４７〜８４９を接続することにより符号化ブロックＴ０と符号化ブロックＴ１に関する差分絶対値総和の演算を，それぞれ独立のパイプラインとして実行することができるのである。また，タイミングＹ４から開始される有効８サイクルではＰＥ８４７のレジスタ８１９〜８２２が符号化ブロックＴ２の画素データに切り替わっており，ＰＥ８４８，８４９では符号化ブロックＴ０の演算が継続しながら，ＰＥ８４７では符号化ブロックＴ２の演算が開始されることは上述した従来例１の場合と同じである。
【００３３】
以上のように従来例２では参照画素を記憶するレジスタ（図４２参照）と，符号化ブロックの画素データを記憶するレジスタ（図４３参照）とをそれぞれ２重構造とすることで，符号化ブロックの偶数番の系列と奇数番の系列の２つの系列をロスサイクル無く時分割処理しているのである。
【００３４】
以上のように従来の動きベクトル検出装置では，複数の演算ユニットＰＥをもうけ，隣接するＰＥ間を演算データ遅延器で接続することによりパイプラインを構成して差分絶対値総和の演算を実現し，また，符号化ブロックの演算を終了した演算ユニットから順に次に処理すべき符号化ブロックの画素データを格納することにより，符号化ブロックの移行時にもパイプラインの停滞を最低限に押さえるものである。更に，従来例１では演算ユニット毎に符号化ブロックの画素データを格納するレジスタを設け，複数の系列の演算ユニットを有することで並列処理を可能とし，従来例２では参照画素のレジスタと符号化ブロックのレジスタをそれぞれ２重構造とすることで時分割処理を可能としている。
【００３５】
【特許文献１】
特開平１０−１３６３７７号公報
【００３６】
【発明が解決しようとする課題】
しかしながら，上述した従来の動きベクトル検出装置には，回路規模が増大してしまうという課題があった。
【００３７】
本発明者は，演算データ遅延器の回路規模が検索範囲の大きさに比例してしまうという弊害と，並列処理あるいは時分割処理で符号化ブロックを記憶するレジスタが系列毎に独立に必要であるなど系列間で共有できる回路が極めて少ないため，回路規模が処理系列数にほぼ比例して増大してしまうという弊害とが相乗的に回路規模の増大をもたらしてしまうものであると，分析している。
【００３８】
なお，従来の動きベクトル検出装置の構成には，検索範囲の大きさが比較的小さい場合には配線効率が小さくて済むという長所があるが，検索範囲の大きさが大きい場合には演算データ遅延器の回路規模が爆発的に大きくなってしまうという決定的な短所がある。
【００３９】
このような回路規模の増大は実装上の改善では解消することが不可能であり，現実の映像信号に対して実用的な検索範囲を実現することが極めて困難となっている。例えば，表１に示した４８０ｉ映像の場合で９万６千個以上のフリップフロップが必要となるから容易に実現することはできず，１０８０ｉでは５６０ＭＨｚクロックで動作させても約３４万個のフリップフロップが必要となるから実現は極めて困難である。
【００４０】
（１）なお，上記従来技術では，並列処理の装置構成に自由度がない。複数系列からなる並列処理において１つの系列に着目すると，符号化ブロックの処理順がＴ０の次にＴ２が処理されるなど符号化画像上で離れた位置に飛んだ順に処理されており，またその位置間隔は符号化ブロックの大きさと検索範囲の大きさの比率で一義的に決まってしまうという特徴がある。処理順が飛び飛びであれば動きベクトル検出処理に続く符号化処理が実現困難となってしまうから，並列処理で装置を構成せざるをえない。しかも並列処理の系列数もやはり前記符号化ブロックの大きさと検索範囲の大きさの比率で一義的に決まってしまうから，処理速度を要求されない装置であっても，極めて高速処理を要求される装置であっても一義的に決まる系列数だけ並列処理回路を持たなければならないことになり，使用目的に応じた最小の回路規模で実現することができない。
【００４１】
（２）また，上記従来技術では，フレームベクトルと２種類のフィールドベクトルとの３種類のベクトルの差分絶対値総和を同時に求めることができないから，独立に算出するしかなく，更なる回路規模が必要となる。ＭＰＥＧ２規格ではフレーム構造のピクチャーの場合にフレームベクトルか又は２種類のフィールドベクトルかいずれか有利なものを符号化ブロック毎に選択することができるが，そのためにはフレームベクトルとフィールドベクトルの検索が必要である。これを最低限の回路増加で同時に求めるという方式が求められているが，上記従来技術では実現することができない。
【００４２】
（３）また，従来例２の回路規模と処理速度について表３と表４にまとめる。表３は従来例２の回路規模を示すものであり，表４は従来例２の処理速度を示すものである。表３，表４の算出条件は表１，表２の従来例１の場合と同じであるが，Ｑは時分割多重できる系列数を意味する。従来例２の場合もＱはＮ×Ｑ≧２Ｋを満たす最小のＱとして求められる。特許文献１の動きベクトル検出装置では２系統時分割処理に限定した技術として記載されており，Ｑ＝２以外のものは本発明者が独自に算出したものである。また，Ｑ＝１の場合は上述した従来例１の構成でＱ＝１の場合に他ならないので省略した。
【００４３】
表３，表４と表１，表２とを比較すると回路規模は同程度であるが，時分割処理となっているため処理速度は従来例２が劣る（したがって，従来例２の場合には，実用的な検索範囲を実現することが従来例１の場合以上に困難となる）。従来例２は後述する本発明の構成との対比においてその差異をより明確にするために説明した。
【００４４】
【表３】

【００４５】
【表４】

本発明は，上記従来のこのような課題を考慮し，回路規模をより小さく抑えることができる動きベクトル検出装置，動きベクトル検出方法，プログラム，および記録媒体を提供することを目的とするものである。
【００４６】
【課題を解決するための手段】
第１の本発明は、符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２、図３参照）を構成する画素データを記憶し，前記符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロック出力ステップと，
参照画像のＭ個の画素を一時記憶し，これを１つの組の参照データとして出力する参照データ出力ステップであって，（１）前記参照データが前記参照画像上縦に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御，及び（２）前記参照データが前記参照画像上横に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御のうち少なくともいずれかの制御を行う参照データ出力ステップと，
１組の前記参照データと１組の前記符号化データとの誤差量を演算する演算ユニット（７〜９，図１参照）を１×Ｎ個利用して，１組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの誤差量を算出する演算ステップと，
前記符号化ブロック内で最も端に位置する符号化データの組の誤差量を１サイクル遅延させて隣接する符号化データの組の誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する誤差量に加算していく累積加算構造により前記Ｎ個の誤差量の総和を求める累積加算ステップとを備えた動きベクトル検出方法である。
【００４７】
第２の本発明は、符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタ（２，図１参照）と，
参照画像のＭ個の画素を一時記憶し，これを１つの組の参照データとして出力する第１の参照レジスタ（１，図１参照）であって，（１）前記参照データが前記参照画像上縦に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御機能のうち少なくともいずれかの制御機能を有する第１の参照レジスタ（１，図１参照）と，
１組の前記参照データと１組の前記符号化データとの誤差量を演算する演算ユニット（７〜９，図１参照）であって，１組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの誤差量を算出する１×Ｎ個の演算ユニット（７〜９，図１参照）と，
前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で最も端に位置する符号化データの組の誤差量を１サイクル遅延させて隣接する符号化データの組の誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する誤差量に加算していく累積加算構造により前記Ｎ個の誤差量の総和を求める累積加算アレイ（１０，図１参照）とを備えた動きベクトル検出装置である。
【００４８】
第３の本発明は、符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，前記符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロック出力ステップと，
参照画像のＭ＋Ｑ−１個の画素を一時記憶し，連続するＭ個の画素を１組の参照データとしてＱ組の前記参照データを出力する参照データ出力ステップであって，（１）前記参照データが前記参照画像上縦に配置されたＭ＋Ｑ−１個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御，及び（２）前記参照データが前記参照画像上横に配置されたＭ＋Ｑ−１個のデータである場合は，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御のうち少なくもいずれかの制御を行う参照データ出力ステップと，
１組の前記参照データと１組の前記符号化データの誤差量を演算する演算ユニット（７〜９，図１４参照）をＱ×Ｎ個利用して，Ｑ組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの前記誤差量を算出する演算ステップと，
前記符号化ブロック内で最も端に位置する前記符号化データの組の誤差量を１サイクル遅延させて隣接する前記符号化データの組の前記誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する前記誤差量に加算していく累積加算構造によりＮ個の前記誤差量の総和をＱ個の累積加算アレイ（１０，２０３，図１４参照）を利用して求める累積加算ステップとを備えた動きベクトル検出方法である。
【００４９】
第４の本発明は、符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタ（２，図１４参照）と，
参照画像のＭ＋Ｑ−１個の画素を一時記憶し，連続するＭ個の画素を１組の参照データとしてＱ組の前記参照データを出力する第１の参照レジスタ（２０１，図１４参照）であって，（１）前記参照データが前記参照画像上縦に配置されたＭ＋Ｑ−１個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ＋Ｑ−１個のデータである場合に，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御機能のうち少なくともいずれかの制御機能を有する第１の参照レジスタ（２０１，図１４参照）と，
１組の前記参照データと１組の前記符号化データの誤差量を演算する演算ユニット（７〜９，図１４参照）であって，Ｑ組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの前記誤差量を算出するＱ×Ｎ個の演算ユニット（７〜９，図１４参照）と，
前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で最も端に位置する前記符号化データの組の誤差量を１サイクル遅延させて隣接する前記符号化データの組の前記誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する前記誤差量に加算していく累積加算構造によりＮ個の前記誤差量の総和を求めるＱ個の累積加算アレイ（１０，２０３，図１４参照）とを備えた動きベクトル検出装置である。
【００５０】
第５の本発明は、前記第１の参照レジスタ（１，図２０参照）とは相異なる第２の参照レジスタ（４０１，図２０参照）と，
前記第１の参照レジスタ（１，図２０参照）から供給される参照データか前記第２の参照レジスタ（４０１，図２０参照）から供給される参照データかいずれかを選択する参照データ切り替えスイッチ（４０７〜４０９，図２０参照）と，
前記第１の参照レジスタ（１，図２０参照）が順次前記参照データを更新し，前記演算ユニット（７〜９，図２０参照）に参照データを供給する第１のモードと，前記第２の参照レジスタ（４０１，図２０参照）が順次前記参照データを更新し前記演算ユニット（７〜９，図２０参照）に前記参照データを供給する第２のモードとの移行時には，移行前の有効な演算が終了した前記演算ユニット（７〜９，図２０参照）から順に前記参照データ切り替えスイッチ（４０７〜４０９，図２０参照）を切り替えるモード制御手段（４１０，図２０参照）とをさらに備えた第２または第４の本発明の動きベクトル検出装置である。
【００５１】
第６の本発明は、前記モード制御手段（４１０，図２０参照）は，新たな前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）のデータを前記符号化ブロックレジスタ（４０２，図２０参照）に記憶させる場合，前記参照データ切り替えスイッチ（４０７〜４０９，図２０参照）の切り替え動作に同期して新たな前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）のデータを１組ずつ順に前記符号化ブロックレジスタ（４０２，図２０参照）に記憶させる第５の本発明の動きベクトル検出装置である。
【００５２】
第７の本発明は、前記累積加算アレイ（５０１，図２３参照）は，（ａ）個々の前記演算ユニット（１０８〜１１１，図２３参照）の前記誤差量の加算結果を１回遅延して隣接する符号化データの組の誤差量に加算することで，Ｎ個の前記誤差量を累積加算するフレーム加算アレイ（５０２，図２３参照）と，（ｂ）偶数または奇数番目であるＮ／２個の演算ユニット（１０８〜１１１，図２３参照）に対して２サイクル遅延しながら前記誤差量を前記累積加算構造で加算するフィールド加算アレイ（５０３，図２３参照）と，（ｃ）前記フレーム加算アレイ（５０２，図２３参照）と前記フィールド加算アレイ（５０３，図２３参照）との結果の差を求める演算手段（５０６，図２３参照）とを有する第２，４，５，６の本発明の何れかの動きベクトル検出装置である。
【００５３】
第８の本発明は、前記演算ユニット（６０２〜６０４，図２５参照）は，入力された前記参照データの組と前記符号化データの組とに対して，それぞれの偶数位置の画素に対する誤差量と，それぞれの奇数位置の画素に対する誤差量との２種類の誤差量を求め，
前記累積加算アレイ（６０５，図２５参照）は，（ａ）前記２種類の誤差量を独立にそれぞれ累積加算構造で加算する第１のフィールド加算アレイ（６０６，図２５参照）と，（ｂ）第２のフィールド加算アレイ（６０７，図２５参照）とを有する第２，４，５，６の本発明の何れかの動きベクトル検出装置である。
【００５４】
第９の本発明は、前記演算ユニット（６０２，図２８参照）は，入力された前記参照データの組と前記符号化データの組とに対して，それぞれの偶数位置または奇数位置の画素に対する第１の誤差量と，全ての前記画素に対する第２の誤差量との２種類の誤差量を求め，
前記累積加算アレイ（６０５，図２５参照）は，（ａ）前記第１の誤差量を独立に累積加算するフィールド加算アレイ（６０６，図２５参照）と，（ｂ）前記第２の誤差量を独立に累積加算するフレーム加算アレイ（６０７，図２５参照）と，（ｃ）前記フィールド加算アレイと前記フレーム加算アレイとの結果の差を求める演算手段（６０８，図２５参照）とを有する第２，４，５，６の本発明の何れかの動きベクトル検出装置である。
【００５５】
第１０の本発明は、符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，同一フィールドにおけるＭ個の前記画素データを１つの組として，第１フィールドの符号化データＮ／２組と第２フィールドの符号化データＮ／２組とを出力する符号化ブロックレジスタ（１０２，図２９参照）と，
参照画像の同一フィールドにおけるＭ個の画素データを記憶し，これを１つの組の参照データとして出力する第１フィールドおよび第２フィールドに対応する参照レジスタ（７０１〜７０３，図３参照）と，
前記参照データ１組と前記符号化データＮ／２組とを入力とし，フィールド誤差量を求めることができるフィールド評価手段（７０４〜７０７，図２９参照）と，
前記第１フィールドの参照データと前記第１フィールドの符号化データとに対するフィールド誤差量と，前記第２フィールドの参照データと前記第２フィールドの符号化データとに対するフィールド誤差量とを加算する第１の加算器（７２０，図２９参照）と，
前記第１フィールドの参照データと前記第２フィールドの符号化データとに対するフィールド誤差量と，前記第２フィールドの参照データと前記第１フィールドの符号化データとに対するフィールド誤差量とを加算する第２の加算器（７２１，図２９参照）とを備え，
前記参照レジスタ（７０１〜７０３，図３参照）は，（１）前記参照データが前記参照画像上縦に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納する制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納する制御機能のうち少なくともいずれかの制御機能を有し，
前記フィールド評価手段（７０４〜７０７，図２９参照）は，１組の前記参照データとＮ／２組の前記符号化データとの全ての組み合わせの誤差量を算出するＮ／２個の演算ユニット（７０８〜７１５，図２９参照）を有し，前記Ｎ／２個の誤差量から累積加算構造で総和を求め前記フィールド誤差量として出力する動きベクトル検出装置である。
【００５７】
第１１の本発明は、第２の本発明の動きベクトル検出装置の，符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタ（２，図１参照）と，参照画像のＭ個の画素を一時記憶し，これを１つの組の参照データとして出力する第１の参照レジスタ（１，図１参照）であって，（１）前記参照データが前記参照画像上縦に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ個のデータである場合は，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御機能のうち少なくともいずれかの制御機能を有する第１の参照レジスタ（１，図１参照）と，１組の前記参照データと１組の前記符号化データとの誤差量を演算する演算ユニット（７〜９，図１参照）であって，１組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの誤差量を算出する１×Ｎ個の演算ユニット（７〜９，図１参照）と，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で最も端に位置する符号化データの組の誤差量を１サイクル遅延させて隣接する符号化データの組の誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する誤差量に加算していく累積加算構造により前記Ｎ個の誤差量の総和を求める累積加算アレイ（１０，図１参照）としてコンピュータを機能させるためのプログラムである。
【００５９】
第１２の本発明は、第４の本発明の動きベクトル検出装置の，符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタ（２，図１４参照）と，参照画像のＭ＋Ｑ−１個の画素を一時記憶し，連続するＭ個の画素を１組の参照データとしてＱ組の前記参照データを出力する第１の参照レジスタ（２０１，図１４参照）であって，（１）前記参照データが前記参照画像上縦に配置されたＭ＋Ｑ−１個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納するための制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ＋Ｑ−１個のデータである場合に，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納するための制御機能のうち少なくともいずれかの制御機能を有する第１の参照レジスタ（２０１，図１４参照）と，１組の前記参照データと１組の前記符号化データの誤差量を演算する演算ユニット（７〜９，図１４参照）であって，Ｑ組の前記参照データとＮ組の前記符号化データとの全ての組み合わせの前記誤差量を算出するＱ×Ｎ個の演算ユニット（７〜９，図１４参照）と，前記符号化ブロック（Ｔ０〜Ｔ２，図３参照）内で最も端に位置する前記符号化データの組の誤差量を１サイクル遅延させて隣接する前記符号化データの組の前記誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する前記誤差量に加算していく累積加算構造によりＮ個の前記誤差量の総和を求めるＱ個の累積加算アレイ（１０，２０３，図１４参照）としてコンピュータを機能させるためのプログラムである。
【００６０】
第１３の本発明は、第１０の本発明の動きベクトル検出装置の，符号化画像上の矩形領域である符号化ブロック（Ｔ０〜Ｔ２，図３参照）を構成する画素データを記憶し，同一フィールドにおけるＭ個の前記画素データを１つの組として，第１フィールドの符号化データＮ／２組と第２フィールドの符号化データＮ／２組とを出力する符号化ブロックレジスタ（１０２，図２９参照）と，参照画像の同一フィールドにおけるＭ個の画素データを記憶し，これを１つの組の参照データとして出力する第１フィールドおよび第２フィールドに対応する参照レジスタ（７０１〜７０３，図３参照）と，前記参照データ１組と前記符号化データＮ／２組とを入力とし，フィールド誤差量を求めることができるフィールド評価手段（７０４〜７０７，図２９参照）と，前記第１フィールドの参照データと前記第１フィールドの符号化データとに対するフィールド誤差量と，前記第２フィールドの参照データと前記第２フィールドの符号化データとに対するフィールド誤差量とを加算する第１の加算器（７２０，図２９参照）と，前記第１フィールドの参照データと前記第２フィールドの符号化データとに対するフィールド誤差量と，前記第２フィールドの参照データと前記第１フィールドの符号化データとに対するフィールド誤差量とを加算する第２の加算器（７２１，図２９参照）としてコンピュータを機能させるためのプログラムであって，
前記参照レジスタは，（１）前記参照データが前記参照画像上縦に配置されたＭ個のデータである場合に，前記参照データを前記参照画像上水平方向に順次ずらしながら取り出して格納する制御機能，及び（２）前記参照データが前記参照画像上横に配置されたＭ個のデータである場合は，前記参照データを前記参照画像上垂直方向に順次ずらしながら取り出して格納する制御機能のうち少なくともいずれかの制御機能を有し，
前記フィールド評価手段は，１組の前記参照データとＮ／２組の前記符号化データとの全ての組み合わせの誤差量を算出するＮ／２個の演算ユニットを有し，前記Ｎ／２個の誤差量から累積加算構造で総和を求め前記フィールド誤差量として出力する，プログラムである。
【００６１】
第１４の本発明は、第１１から１３の本発明の何れかのプログラムを担持した記録媒体であって，コンピュータにより処理可能な記録媒体である。
【００６２】
【発明の実施の形態】
以下，本発明の実施の形態について，図面を用いて説明する。
【００６３】
（実施の形態１）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【００６４】
図１は本実施の形態の動きベクトル検出装置を示すブロック図である。
【００６５】
実施の形態１は，上述した第１，第２の本発明に関するものであり，第１，第２の本発明の縦１列のＭ個の画素データを１つの組とする場合に相当するものである。この実施の形態１では予測ブロック候補と符号化ブロックとの誤差量として，（数１）および（数２），（数３）に示した差分絶対値総和を採用し，符号化ブロックの大きさは水平Ｎ＝３，垂直Ｍ＝４，検索領域は水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【００６６】
本明細書においては，符号化ブロックを分解した小ブロックの画素数を記号Ｍで表記し，小ブロックの個数を記号Ｎで表記するようにしている。したがって，（１）符号化ブロックを列方向に分解する場合には，縦方向に関する量をＭで表記し，横方向に関する量をＮで表記し，（２）符号化ブロックを行方向に分解する場合には，縦方向に関する量をＮで表記し，横方向に関する量をＭで表記している（符号化ブロックを列方向に分解するのか行方向に分解するのかにかかわらず，水平検索範囲をＫで表記し，垂直検索範囲をＬで表記している）。なお，Ｎが奇数である場合には，Ｎ／２は（Ｎ＋１）／２と解釈される。もちろん，これらに関しては，以下の実施の形態においても同様である。
【００６７】
図１で参照レジスタ１は参照画像のａ０〜ａ３から成る４個の画素を一時記憶して，これを１つの組の参照データＡとして出力し，符号化ブロックレジスタ２は１２個の画素から成る符号化ブロックを縦１列に配置された４個のデータを１つの小ブロックとして３つの組に分け，第０列に相当するｂ００〜ｂ３０を記憶する符号化小ブロックレジスタ３と，第１列に相当するｂ０１〜ｂ３１を記憶する符号化小ブロックレジスタ４と，第２列に相当するｂ０２〜ｂ３２を記憶する符号化小ブロックレジスタ５とから構成され，それぞれの出力を符号化データＢ０，Ｂ１，Ｂ２の３つの組として出力する。演算ブロック６は演算ユニット７〜９で構成され，演算ユニット７は参照データＡと符号化データＢ０を，演算ユニット８は参照データＡと符号化データＢ１を，演算ユニット９は参照データＡと符号化データＢ２をそれぞれ入力とし，（数２）に該当する差分絶対値和ＡＥ_i, _j,nを出力する。演算ユニット７〜９の出力は累積加算アレイ１０に接続され，累積加算アレイ１０は（数３）に該当する差分絶対値総和ＡＥ_i,jを出力する。累積加算アレイ１０は，演算ユニット７の出力を遅延器１１を介して演算ユニット８の出力と加算器１２で加算し，その出力を遅延器１３を介して演算ユニット９の出力と加算器１４で加算するという構成である。図２は演算ユニット７の内部構成を示すブロック図である。演算ユニット７は図２で入力された参照データＡと符号化データＢ０の対応する各要素を差分絶対値演算器１５〜１８に接続し，その出力を加算器１９〜２１で加算する構成である。演算ユニット８，９の構成は図２の演算ユニット７の構成と同一であり，符号化データの対応関係がＢ０からＢ１とＢ２に変更されるのみであるから説明を省略する。
【００６８】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。なお，本実施の形態の動きベクトル検出装置の動作について説明しながら，本発明の動きベクトル検出方法の一実施の形態についても説明する（以下の実施の形態についても同様である）。
【００６９】
図３は本実施の形態の符号化画像と参照画像で各ブロックと画素および検索領域の位置関係を示す領域関係図，図４は動作の詳細を示すタイミングチャートである。
【００７０】
まず，実施の形態１での符号化ブロックＴ０の動きベクトル検出動作を説明する。図３で符号化ブロックＴ０の検索範囲は参照画像上でＣに示す範囲であり，ベクトル（−３，−４）すなわち検索範囲の左上端から誤差量の評価を開始する。図４のタイミングＤ０のサイクルから符号化ブロックＴ０の演算を開始するが，その直前サイクルで符号化小ブロックレジスタ３には図３の符号化ブロックＴ０の左端列の小ブロック（ｂ００，ｂ１０，ｂ２０，ｂ３０）＝（ｔ_4,3，ｔ_5,3，ｔ_6,3，ｔ_7,3）が，符号化小ブロックレジスタ４には中央列の小ブロック（ｂ０１，ｂ１１，ｂ２１，ｂ３１）＝（ｔ_4,4，ｔ_5,4，ｔ_6,4，ｔ_7,4）が，符号化小ブロックレジスタ５には右端列の小ブロック（ｂ０２，ｂ１２，ｂ２２，ｂ３２）＝（ｔ_4,5，ｔ_5,5，ｔ_6,5，ｔ_7,5）がそれぞれ読み込まれており，３つの出力の組Ｂ０，Ｂ１，Ｂ２として既に出力されているものとする。いま図４のタイミングＤ０のサイクルで動作が開始すると，参照レジスタ１には図３の参照画素データ（ａ０，ａ１，ａ２，ａ３）＝（ｒ_0,0，ｒ_1,0，ｒ_2,0，ｒ_3,0）が格納され，１組の出力Ａとして出力される。このサイクルＤ０で演算ユニット７では入力された参照データＡと符号化データＢ０に関してその差分絶対値和を求めるが，その結果は｜ｒ_0,0−ｔ_4,3｜＋｜ｒ_1,0−ｔ_5,3｜＋｜ｒ_2,0−ｔ_6,3｜＋｜ｒ_3,0−ｔ_7,3｜であり，（数２）より差分絶対値和ＡＥ_-3,-4,0を求めたことになる。続いてタイミングＤ１のサイクルで，参照レジスタ１には（ａ０，ａ１，ａ２，ａ３）＝（ｒ_0,1，ｒ_1,1，ｒ_2,1，ｒ_3,1）が読み込まれる。図５は参照画像の検索領域において参照レジスタ１が格納する小ブロックデータの領域を示す参照データ領域図である。図５でタイミングＤ０の時点で参照レジスタ１が格納する参照データは図中小ブロックＤ０で示す検索領域左上端の縦４画素であったが，タイミングＤ１ではそれが水平方向に隣接する４画素の小ブロックＤ１に移動したことを示している。すなわち，上述した第１，第２の本発明の参照データが縦に配列された４画素である場合は水平方向に移動することに相当するが，この移動制御方法は図４０に示した従来例の移動方法とは大きく異なるものである。この間も符号化ブロックレジスタ２は符号化ブロックＴ０の画素を格納，保持する。さて，図４に示したタイミングＤ１では，演算ユニット７が｜ｒ_0,1−ｔ_4,3｜＋｜ｒ_1,1−ｔ_5,3｜＋｜ｒ_2,1−ｔ_6,3｜＋｜ｒ_3,1−ｔ_7,3｜を，すなわちＡＥ_-4,-2,0を算出し，演算ユニット８が｜ｒ_0,0−ｔ_4,4｜＋｜ｒ_1,0−ｔ_5,4｜＋｜ｒ_2,0−ｔ_6,4｜＋｜ｒ_3,0−ｔ_7,4｜を，すなわち（数２）よりＡＥ_-4,-3,1を算出する。同様にタイミングＤ２のサイクルでは，演算ユニット７〜９がＡＥ_-4,-1,0，ＡＥ_-4,-2,1，ＡＥ_-4,-3,2をそれぞれ算出することとなる。以下，順に符号化ブロックＴ０に関するＡＥ_i,j,nが算出される。
【００７１】
演算ユニット７〜８の出力は累積加算アレイ１０で加算され，差分絶対値総和が求められる。いま，図４のタイミングＤ０のサイクルで演算ユニット７で求められたＡＥ_-4,-3,0は累積加算アレイ１０の遅延器１１で１サイクル遅延され，タイミングＤ１のサイクルで演算ユニット８で求められたＡＥ_-4,-3,1と加算器１２で加算され，ＡＥ_-4,-3,0＋ＡＥ_-4,-3,1が演算される。その結果は遅延器１３で１サイクル遅延され，タイミングＤ２のサイクルで演算ユニット９で求められたＡＥ_-4,-3,2と加算器１４で加算され，ＡＥ_-4,-3,0＋ＡＥ_-4,-3,1＋ＡＥ_-4,-3,2が求められる。これは（数３）と比較してＡＥ_-4,-3であることが分かる。すなわちベクトル（−３，−４）の予測ブロック候補と符号化ブロックＴ０との誤差量である差分絶対値総和ＡＥ_-4,-3が求められたわけである。この遅延器１１と遅延器１３で３つの演算ユニット７〜９を結ぶことによりパイプライン演算が構成され，ＡＥ_-4,-3，ＡＥ_-4,-2〜ＡＥ_-4,2まで毎サイクル求めることができるのである。
【００７２】
この一連のパイプライン動作により，図５で検索範囲内の小ブロックＤ０から小ブロックＤ７まで高さ４画素の帯状の領域が左から右へ順に評価されたことになる。図４タイミングＤ７でＡＥ_-4,2の算出を完了すると，次にタイミングＥ０のサイクルに移行して，参照レジスタ１には参照画素データ（ａ０，ａ１，ａ２，ａ３）＝（ｒ_1,0，ｒ_2,0，ｒ_3,0，r_4,0）が読み込まれ，同時に演算ユニット７がＡＥ_-3,-3,0を算出し，新たに一連のパイプライン動作が開始される。この間も符号化ブロックレジスタ２は符号化ブロックＴ０の画素を格納，保持しているから，タイミングＥ０からタイミングＥ７に至る一連のパイプライン動作では，図６の小ブロックＥ０〜小ブロックＥ７まで高さ４画素の帯状の領域の領域を左から右へ順に，符号化ブロックＴ０と予測ブロック候補との誤差量評価を実行することとなる。図５と図６で演算評価される帯状の領域を比較すると，図６では図５より１画素分下がって動作していることがわかる。これはＤ０〜Ｄ７に至る演算で検索領域の最上段の１行分の画素が評価対象として演算完了したことによる。同じように，Ｅ０〜Ｅ７の演算で上段２行分の画素が評価対象として演算を完了する。以下順次新たなパイプライン動作を開始する度に帯状の演算領域が１行ずつ下がり，パイプライン動作８巡目である図７のＦ０〜Ｆ７の演算を完了すると符号化ブロックＴ０の誤差量演算が全て完了することとなる。この間の差分絶対値総和の最小値を調べることにより符号化ブロックＴ０の動きベクトルを決定することができる。なお，求められるベクトルは，符号化画像と参照画像がフレーム構造の場合はフレームベクトルが，フィールド構造の場合はフィールドベクトルである。
【００７３】
図７の小ブロックＦ７の演算を完了すると，符号化ブロックレジスタ２には新たに符号化ブロックＴ１の１２個の画素が図３の関係に従って格納され，同時に符号化ブロックＴ２に対応する検索領域の左上端の４画素が参照レジスタ１に格納され，符号化ブロックＴ０の場合と全く同様にパイプライン動作が開始されるのである。すなわち，従来例とは異なり，参照レジスタ１が格納する参照データを複数の符号化ブロックで使用することにならず，新たな符号化ブロックを開始する場合には同時に参照データも新たに読み込むことが出来るから，符号化ブロックＴ０の処理の次にはＴ１を開始することができ，さらにその次はＴ２と，符号化画像上の位置をとばすことなく順番に処理を完了することが出来るものである。
【００７４】
実施の形態１の回路規模と処理速度について表５と表６にまとめる。表５は実施の形態１の回路規模を示すものであり，表６は実施の形態１の処理速度を示すものである。表５，表６の算出条件は表１，表２の従来例１の算出条件と同じである。本実施の形態１は並列演算していないので，表５，表６の図１の場合と，従来例１の表１，表２でＱ＝１の場合（すなわち，１系列の場合）とを比較すると，本実施の形態では必要とされるフリップフロップの数は約半減し，１ブロック当たりに必要なサイクル数も少なくなっている。これはパイプライン演算を構成するために演算ユニット７〜９をつなぐ遅延器１１，１３が検索範囲の大きさに関わりなく常に１画素分で構成できているためにレジスタ数が半減でき，また演算ユニットが有効な演算を実行していないロスサイクルが減少できているために演算速度が改善されたのである。また，４８０ｉや１０８０ｉの例では従来例１と比較して，並列演算していないので処理サイクルは約８．５倍，約１５倍とそれぞれ増加しているが，フリップフロップ数は約２．５％，約０．７％と激減している。
【００７５】
【表５】

【００７６】
【表６】

以上のように実施の形態１によれば，予測ブロック候補の縦１列を小ブロックに設定し，それを参照画面上水平方向に順次ずらしながらパイプライン演算を行う構成としたことにより，演算ユニットが有効な演算を実行しないロスサイクルを減少させ高速な処理が実現でき，検索範囲の大小にかかわらず演算ユニット間を１画素分のみの遅延器で接続してパイプライン演算を構成するからレジスタ数が最小となり，小さな回路規模で実現できているものである。また，処理する符号化ブロックがＴ０，Ｔ１，Ｔ２と画像上連続した位置で順に処理完了することができるから，動きベクトル検出に引き続き実行される符号化処理の構成が容易であり，多重化構成を前提とせずとも装置を構成でき，用途に応じて高速処理が必要ではない場合など極めて小さな回路で動きベクトル検出装置を実現できるものである。
【００７７】
（実施の形態２）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【００７８】
図８は本実施の形態の動きベクトル検出装置を示すブロック図である。
【００７９】
実施の形態２も，上述した第１，第２の本発明に関するものであり，第１，第２の本発明の横１行のＭ個の画素データを１つの組とする場合に相当するものである。また，この実施の形態２でも予測ブロック候補と符号化ブロックとの誤差量として，差分絶対値総和を採用する。（数４）に実施の形態２での差分絶対値和の定義式を示すが，（数１）に示した定義式とはＭ，Ｎの扱いを逆にしている（小ブロックの分解において行を単位とするか列を単位とするかが異なっているからである）。すなわち，Ｍは符号化ブロックの水平の大きさ，Ｎは垂直の大きさである。（数４）の差分絶対値総和ＡＥ_i,jはｍとｎについての２重総和となっているが，これを（数５）に示すｍに関する総和，すなわち同一行内の差分絶対値和ＡＥ_i,j,nと，（数６）に示すｎに関する総和の２段階に分解しても同じく差分絶対値総和ＡＥ_i,jを求めることが出来る。実施の形態２ではこの（数５）と（数６）の関係を用いて符号化ブロックと予測ブロック候補との差分絶対値総和を演算するものである。また，実施の形態２のブロックの大きさは水平Ｍ＝３，垂直Ｎ＝４とし，検索領域については，水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【００８０】
【数４】

【００８１】
【数５】

【００８２】
【数６】

以下，図８で本実施の形態２の構成を説明するが，上述した図１の実施の形態１と同一部分には同一番号を付し説明を省略する。
【００８３】
図８で参照レジスタ１０１は参照画像のａ０〜ａ２から成る３個の画素を一時記憶して，これを１つの組として参照データＡを出力する。符号化ブロックレジスタ１０２は１２個の画素から成る符号化ブロックを横１行に配置された３個データを組として４つの組に分け，第０行の小ブロックに相当するｂ００〜ｂ０２を記憶する符号化小ブロックレジスタ１０３と，第１行の小ブロックに相当するｂ１０〜ｂ１２を記憶する符号化小ブロックレジスタ１０４と，第２行の小ブロックに相当するｂ２０〜ｂ２２を記憶する符号化小ブロックレジスタ１０５と，第３行の小ブロックに相当するｂ３０〜ｂ３２を記憶する符号化小ブロックレジスタ１０６とから構成され，それぞれの出力を符号化データＢ０，Ｂ１，Ｂ２，Ｂ３の４つの組として出力する。演算ブロック１０７は演算ユニット１０８〜１１１から構成され，参照データＡと符号化データＢ０〜Ｂ２をそれぞれ入力とし（数５）に該当する差分絶対値和を出力する。演算ユニット１０８〜１１１の出力は累積加算アレイ１１２に接続され，累積加算アレイ１１２は（数６）に該当する差分絶対値総和を出力する。累積加算アレイ１１２の構造は，実施の形態１の図１に示された累積加算アレイ１０に対して遅延器１１３と加算器１１４が増設されたものである。図９は演算ユニット１０８の内部構成を示すブロック図である。演算ユニット１０８の内部構造は実施の形態１の図２演算ユニット７から差分絶対値演算器１８と加算器２０が削除されたものである。演算ユニット１０９〜１１１の構成は図９の演算ユニット１０８の構成と同一であり，符号化データの対応関係がＢ０からＢ１，Ｂ２，Ｂ３に変更されるのみであるから説明を省略する。
【００８４】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【００８５】
図１０は動作の詳細を示すタイミングチャート，図１１〜図１３は参照レジスタ１０１が保持するデータが参照画像の検索範囲内に占める位置関係を示す領域関係図である。
【００８６】
実施の形態２は符号化ブロックＴ０の動きベクトル検出について，ベクトル（−３，−４）すなわち検索範囲の左上端から誤差量の演算を開始する。図１０のタイミングＧ０のサイクルで参照レジスタ１０１には参照画素データ（ａ０，ａ１，ａ２）＝（ｒ_0,0，ｒ_0,1，ｒ_0,2）が読み込まれており，１つの出力の組Ａとして出力され，符号化小ブロックレジスタ１０３には（ｂ００，ｂ０１，ｂ０２）＝（ｔ_4,3，ｔ_4,4，ｔ_4,5）が，符号化小ブロックレジスタ１０４には（ｂ１０，ｂ１１，ｂ１２）＝（ｔ_5,3，ｔ_5,4，ｔ_5,5）が，符号化小ブロックレジスタ１０５には（ｂ２０，ｂ２１，ｂ２２）＝（ｔ_6,3，ｔ_6,4，ｔ_6,5）が，符号化小ブロックレジスタ１０６には（ｂ３０，ｂ３１，ｂ３２）＝（ｔ_7,3，ｔ_7,4，ｔ_7,5）がそれぞれ読み込まれており，４つの出力の組Ｂ０，Ｂ１，Ｂ２，Ｂ３として出力されている。実施の形態１と異なるのは参照レジスタ１０１は予測ブロック候補の行を単位とする小ブロックを１つ記憶する点と，符号化小ブロックレジスタ１０３〜１０６は符号化ブロックＴ０の行を単位とする小ブロック４つを記憶し，その出力は４つの出力の組として取り扱われる点である。誤差量演算が開始されると，タイミングＧ０のサイクルで（数５）に基づき演算ユニット１０８がＡＥ_-4,-3,0を算出し，タイミングＧ１では演算ユニット１０８〜１０９がＡＥ_-3,-3,0とＡＥ_-4,-3,1を求め，以下演算ユニット１０８〜１１１が符号化ブロックＴ０と参照ブロック候補の差分絶対値和ＡＥ_i,j,nを順次演算する。一方累積加算アレイ１１２は演算ユニット１０８〜１１１の演算結果を１サイクル遅延させながら加算することでパイプライン演算を構成し，差分絶対値総和ＡＥ_i,jを算出していく。
【００８７】
以上は上述した実施の形態１の図４に示した動作と類似した動作となっているが，実施の形態２では参照レジスタ１０１に格納されるデータの更新方法が実施の形態１と異なっている。タイミングＧ０のとき参照レジスタ１０１は図１１で検索領域左上端の横３画素からなる小ブロックＧ０を格納して誤差量演算を開始するが，タイミングＧ１に進むとそれを垂直方向に隣接する３画素である小ブロックＧ１に移動させ，以下タイミングＧ１０で検索領域の下端の小ブロックＧ１０に至るまで幅３画素の帯状の領域を上から下へと移動させながら一連のパイプライン動作で誤差量演算を実行するものである。タイミングＧ１０で一連のパイプライン演算を終了すると，タイミングＨ０のサイクルに移り，参照レジスタ１０１には参照画素データ（ａ０，ａ１，ａ２）＝（ｒ_0,1，ｒ_0,2，ｒ_0,3）が読み込まれ，新たに一連のパイプライン動作が開始される。この間も符号化ブロックレジスタ１０２は符号化ブロックＴ０の画素を保持している。タイミングＨ０からタイミングＨ１０に至る一連のパイプライン動作では，図１２の小ブロックＨ０〜小ブロックＨ１０の幅３画素の帯状の領域を上から下へ順に演算して，符号化ブロックＴ０と予測ブロック候補との誤差量評価を実行することになる。以下，一連のパイプライン動作が完了する毎に上記帯状の演算領域が１列ずつ右に移動し，パイプライン動作６巡目である図１３の小ブロックＩ０〜小ブロックＩ１０の演算を完了すると符号化ブロックＴ０に関する誤差量演算が完了する。この間の差分絶対値総和の最小値を調べることにより符号化ブロックＴ０の動きベクトルを決定することができるのである。
【００８８】
実施の形態２の回路規模と演算速度については，上述した実施の形態１の場合とほぼ同じであるから説明を省略する。
【００８９】
以上のように実施の形態２によれば，予測ブロック候補の横１行を小ブロックに設定し，それを参照画面上垂直方向に順次ずらしながらパイプライン演算を行う構成としたことにより，演算ユニットが有効な演算を実行しないロスサイクルを減少させ高速な処理が実現でき，検索範囲の大小にかかわらず演算ユニット間を１画素分のみの遅延器で接続してパイプライン演算を構成するから，レジスタ数が最小となり，小さな回路規模で実現できるものである。また，処理する符号化ブロックがＴ０，Ｔ１，Ｔ２と画像上連続した位置で順に処理完了することができるから，多重化構成を前提とせずとも構成でき，用途に応じて高速処理が要求されない場合など極めて小さな回路で動きベクトル検出装置を実現できるものである。
【００９０】
実施の形態１と実施の形態２は全く同じ効果を持ちつつ，参照画像に対する読み込み方法が異なるものであるから，動きベクトル検出装置に接続して使用するメモリなどの参照画像記憶媒体の特性，動作条件に応じて，実施の形態１あるいは実施の形態２からより有利である形態を選択して具現化することが出来る。
【００９１】
（実施の形態３）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【００９２】
図１４は本実施の形態の動きベクトル検出装置を示すブロック図である。
【００９３】
実施の形態３は，上述した第３，第４の本発明に関するものであり，第３，第４の本発明の参照データが縦１列のＭ＋Ｑ−１個のデータである場合に相当するものである。また，この実施の形態３では予測ブロック候補と符号化ブロックとの誤差量として，（数１）および（数２），（数３）に示した差分絶対値総和を採用し，符号化ブロックの大きさは水平Ｎ＝３，垂直Ｍ＝４，検索領域は水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【００９４】
図１４で本実施の形態３の構成を説明するが，上述した図１の実施の形態１と同一部分には同一番号を付して説明を省略する。
【００９５】
参照レジスタ２０１は参照画像のａ０〜ａ４から成る５個の画素を一時記憶して，連続する４個の画素を組とし，参照データＡａ＝（ａ０，ａ１，ａ２，ａ３）を１つの出力の組，参照データＡｂ＝（ａ１，ａ２，ａ３，ａ４）を１つの出力の組とするレジスタである。演算ブロック６は前記参照データＡａと符号化データＢ０〜Ｂ２を入力とし，演算ブロック２０２は演算ブロック６と全く同一の構成を採り，前記参照データＡｂと符号化データＢ０〜Ｂ２を入力とする。累積加算アレイ２０３は累積加算アレイ１０と全く同一の構成を採るものであり，演算ブロック２０２の３つの出力を入力として累積加算する。
【００９６】
本実施の形態３の構成では，参照レジスタ２０１のａ０〜ａ３と，符号化ブロックレジスタ２と演算ブロック６と累積加算アレイ１０から構成される部分は上述した実施の形態１の構成（図１参照）と同じものであり，参照レジスタ２０１のａ１〜ａ４と，符号化ブロックレジスタ２と演算ブロック２０２と累積加算アレイ２０３から構成される部分もまた上述した実施の形態１の構成（図１参照）と同じものである。即ち，参照レジスタ２０１を図１の参照レジスタ１に対して１画素増設したことにより，参照レジスタ２０１のａ１〜ａ３と符号化ブロックレジスタ２を共通化しながら２つの動きベクトル検出装置を合体した構造となっているものである。
【００９７】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【００９８】
図１５は動作の詳細を示すタイミングチャート，図１６〜図１８は参照レジスタ２０１が保持するデータが参照画像の検索範囲内に占める位置関係を示す領域関係図である。
【００９９】
実施の形態２は図１５のタイミングＪ０のサイクルから符号化ブロックＴ０の動きベクトル検出動作を開始する。図１５のタイミングＪ０からＪ７に至る期間，参照データＡａ，符号化データＢ０〜Ｂ２を入力とした演算ブロック６と累積加算アレイ１０による一連のパイプライン演算は上述した実施の形態１の図４サイクルＤ０〜Ｄ７のパイプライン動作と全く同一の動作である。また，この期間の参照データＡａの検索領域内での位置は実施の形態１の図５の小ブロックＤ０〜小ブロックＤ７と同じである。一方，図１５のタイミングＪ０からＪ７に至る期間，参照データＡｂ，符号化データＢ０〜Ｂ２を入力とした演算ブロック２０２と累積加算アレイ２０３による一連のパイプライン演算は上述した実施の形態１の図４サイクルＥ０〜Ｅ７のパイプライン動作と全く同一の動作であり，参照データＡｂの検索領域内での位置は実施の形態１の図６の小ブロックＥ０〜小ブロックＥ７と同じである。すなわち，実施の形態１で２巡回のパイプライン演算つまり１６サイクルで実現していた差分絶対値総和演算を，パイプライン構造を２系統並列処理することで図１５のＪ０〜Ｊ７の８サイクルで完了するものである。この期間，参照レジスタ２０１に格納される参照データは図１６の小ブロックＪ０〜小ブロックＪ７に示す高さ５画素の帯状の領域であり，参照レジスタ２０１は左端の縦５画素の小ブロックＪ０を格納する状態から格納データを順次右に移動させることで２系統のパイプライン演算並列処理を実行させるのである。また，図１６の小ブロックＪ７の演算を完了すると検索領域内の上端２行分の参照画素はそれ以降の符号化ブロックＴ０の誤差量演算に不要となるので，続くタイミングＰ０のサイクルでは参照レジスタ２０１に格納される参照画素は（ａ０，ａ１，ａ２，ａ３，ａ４）＝（ｒ_2,0，ｒ_3,0，ｒ_4,0，ｒ_5,0，ｒ_6,0）となり，新たな２系統並列処理のパイプライン動作が開始される。このパイプライン動作では図１７で小ブロックＰ０〜小ブロックＰ７に示ように上端から２画素下がった位置で高さ５画素の帯状の領域を左から右へと演算することに相当する。このようにパイプライン演算を完了するたびに２行ずつ下方にシフトしながら処理を進めるが，パイプライン演算４巡回目に図１８に示す状態となり，この小ブロックＳ７の演算完了を以て符号化ブロックＴ０の全ての予測ブロック候補の誤差量演算を完了するのである。この間の累積加算アレイ１０と累積加算アレイ２０３の出力である差分絶対値総和の最小値を調べることにより符号化ブロックＴ０の動きベクトルを決定することができる。
【０１００】
図１８の小ブロックＳ７の演算を完了すると，符号化ブロックレジスタ２には新たに符号化ブロックＴ１の１２個の画素が格納され，符号化ブロックＴ１に対応する検索領域の左上端の５画素が参照レジスタ２０１に格納され，符号化ブロックＴ０の場合と全く同様に符号化ブロックＴ１のパイプライン動作が開始されるのである。すなわち，実施の形態１，実施の形態２と同じく符号化ブロックＴ０の処理からＴ１，Ｔ２へと画像上の位置をとばすことなく順番に処理を完了することが出来ている。
【０１０１】
【表７】

【０１０２】
【表８】

実施の形態３の回路規模と処理速度について表７と表８にまとめる。表７は実施の形態３の回路規模を示すものであり，表８は実施の形態３の処理速度を示すものである。表７，表８の算出条件は表１，表２の従来例１の場合，表５，表６の実施の形態１の場合と同じである。まず，表５，表６に示した本発明実施の形態１の図１の場合と，表７，表８の本発明実施の形態３の図１４の場合を比較すると，実施の形態３では多重化系列数Ｑ＝２に並列処理することにより演算速度はちょうど２倍に改善できているが，一方レジスタは僅かに画素値レジスタ個数Ｓが１画素分とデータレジスタ個数Ｕが２データ分増加するのみであり，極めて効率的に並列処理化が実現できていることがわかる。また，表１，表２の従来例１の図３５の場合と比較すると，いずれも多重化系列数Ｑ＝２であり演算速度は互いに遜色ない程度となっているが，本実施の形態３ではフリップフロップ数がわずか３分の１で構成できており，著しい効果があることがわかる。現実的な映像のブロックサイズ，検索範囲とすればこの差はより顕著なものとなり，従来例１と比較して４８０ｉの例では処理速度は同程度であるがフリップフロップ数は約３．７％で構成でき，１０８０ｉの例では処理速度を２倍以上に高速化しつつもフリップフロップ数は約２．１％で構成できるという，劇的な効果を示している。これは以下の２点によるものである。第１に，従来方式では並列構成にするためには大量のフリップフロップを要する符号化ブロックレジスタを多重に持つ必要があったが，本発明では多重化系列すべてが同じ符号化ブロックを演算するのであるから，ただ１つの符号化ブロックを記憶するレジスタで構成できていること。第２に，演算データレジスタは本発明でも従来技術でも系列数に比例して増設する必要があるが，従来は検索範囲に比例する演算データレジスタが必要であったため，実用的な検索範囲では並列化と検索範囲の相乗効果で膨大な規模が要求されたことに対して，本発明では演算データレジスタが検索範囲の大きさにかかわらず常に１データ分のみで構成できるようになったことによるものである。
【０１０３】
以上のように本発明の実施の形態３では，上述した第３，第４の本発明の参照画像のＭ＋Ｑ−１個の画素を一時記憶し，連続するＭ個の画素を１組の参照データとしてＱ組の参照データを出力する参照レジスタ２０１が前記参照データを参照画面上水平方向に順次ずらしながら取り出して格納する制御機能をもち，１つの符号化ブロックと複数の予測ブロック候補との誤差量を同時に複数のパイプライン演算で求める構成としたことにより，符号化ブロックレジスタと参照レジスタを共用化して多重化並列処理回路を構成することができ，また，系列数に比例して増加する演算データ遅延器が常に最小の１データ分で構成できるから，極めて小さな回路規模で並列演算できる動きベクトル検出装置を構成できるものである。また，実施の形態１，実施の形態２と同じくすべての演算ユニットのロスサイクルが少なく，Ｑ系列が全て全く同時に動作するから符号化ブロック１つ当たりの演算速度は正確にＱ倍に高速化できる。さらに，本発明では符号化ブロックを１つずつ処理していくものであり，表７，表８では多重化系列数Ｑ＝２の場合，８の場合，３２の場合を例示しているが，本発明では多重化系列数の設定は符号化ブロックの大きさや検索範囲の大きさなどには一切の影響を受けず，系列数Ｑは１を含んで任意に設定することが出来る。そのため，回路規模の要求と処理速度の要求から適切な多重化系列数を任意に選択し，用途，条件に適合した動きベクトル検出装置を構成することが出来るものである。
【０１０４】
（実施の形態４）
はじめに，本実施の形態の動きベクトル検出装置の構成および動作について説明する。
【０１０５】
図１９は本実施の形態の動きベクトル検出装置を示すブロック図である。
【０１０６】
実施の形態３では第３，第４の本発明の参照データが縦１列のＭ＋Ｑ−１個のデータである場合に相当するものであったが，実施の形態４は，第３，第４の本発明の参照データが横１行のＭ＋Ｑ−１個のデータである場合に相当するものである。
【０１０７】
図１９で参照レジスタ３０１は図８の参照レジスタ１０１を１画素拡張し，連続する３画素を１つの組として参照データＡａと参照データＡｂを出力するレジスタであり，演算ブロック３０２は演算ブロック１０７と，累積加算アレイ３０３は累積加算アレイ１１２と同じ構成のものである。図１９では図８の実施の形態２と同一部分には同一番号を付している。
【０１０８】
実施の形態４の動作は，実施の形態１に対して実施の形態３が並列処理を実現したことと全く同様に実施の形態２に対して並列処理を実現するものであるから，詳細な説明を省略する。
【０１０９】
実施の形態４と実施の形態３は全く同じ効果を持ちつつ，参照画像に対する読み込み方法が異なるものである。従って，動きベクトル検出装置に接続して使用するメモリなど参照画像記憶媒体の特性，動作条件に応じて，実施の形態４あるいは実施の形態３からより適した形態を選択することが出来るものである。
【０１１０】
（実施の形態５）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【０１１１】
図２０は本実施の形態の動きベクトル検出装置を示すブロック図である。
【０１１２】
実施の形態５は，実施の形態１を基本として，第５，第６の本発明の技術を適用したものである。従って，この実施の形態５では予測ブロック候補と符号化ブロックとの誤差量として，（数１）および（数２），（数３）に示した差分絶対値総和を採用し，符号化ブロックの大きさは水平Ｎ＝３，垂直Ｍ＝４，検索領域は水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【０１１３】
図２０を用いて本実施の形態５の構成を説明するが，図１の実施の形態１の構成と同一部分には同一番号を付して説明を省略する。参照レジスタ４０１は参照レジスタ１を第１の参照データレジスタとして新たに増設された第２の参照レジスタであり，その出力は１つの組の参照データＣとして出力する。符号化ブロックレジスタ４０２は符号化小ブロックレジスタ４０３〜４０５から構成されるが，実施の形態１の構成と異なるのは新たな符号化ブロックの画素データを読み込むタイミングが３つの符号化小ブロックレジスタ４０３〜４０５でそれぞれ独立に制御できるように構成されていることである。演算ブロック４０６には符号化データＢ０，Ｂ１，Ｂ２，参照データＡに加え参照データＣが入力され，参照データＡと参照データＢは参照データ切り替えスイッチであるスイッチ４０７でいずれかが選択されて演算ユニット７に入力され，スイッチ４０８で選択されて演算ユニット８に入力され，スイッチ４０９で選択されて演算ユニット９に入力されている。スイッチ４０７，スイッチ４０８，スイッチ４０９はそれぞれ独立に制御される。モード制御部４１０は符号化小ブロックレジスタ４０３〜４０５の読み込みタイミングとスイッチ４０７〜４０９の切り替えを制御する制御部である。
【０１１４】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【０１１５】
図２１と図２２は本実施の形態５の動作を示すタイミングチャートである。図中の記号Ｄ，Ｅ，Ｆは図５，図６，図７の記号に対応させている。
【０１１６】
実施の形態５はスイッチ４０７〜４０９が全て参照データＡを選択している状態から動作を開始する。図２１のタイミングＤ０のサイクルで符号化ブロックレジスタ４０２は符号化ブロックＴ０の格納を完了し，列を単位に３つの出力の組Ｂ０，Ｂ１，Ｂ２を出力し，参照レジスタ１は図５の検索範囲で左上端の縦４画素である小ブロックＤ０を格納完了し参照データＡとして出力する。スイッチ４０７〜４０９は全て参照データＡを選択しているから，まず，タイミングＤ０で演算ユニット７が差分絶対値和ＡＥ_-4,-3,0を演算するが，以下参照データＡは図５に示す帯状の領域を左から右に順次シフトし，演算ユニット７〜９と累積加算アレイ１０がパイプラインを構成して差分絶対値総和ＡＥ_i,jを順次算出していくことは上述した実施の形態１の場合と同じである。一方参照レジスタ４０１は図２１のＥ０のサイクルで図６に示す帯状の領域の左端に位置する縦４画素の小ブロックＥ０を格納し，それを１組の参照データＣとして出力開始する。以下参照レジスタ４０１は図６の帯状の領域を左から右へ順次シフトしながら参照データを出力していく。ここで，図２１で参照レジスタ１のＤ６〜Ｄ７と，参照レジスタ４０１のＥ０〜Ｅ７は時間的に重なりを持ち，２サイクル期間同時にそれぞれ参照データＡと参照データＣに出力されている。
【０１１７】
ここで演算ユニット７に着目する。演算ユニット７は図５の帯状の領域で小ブロックＤ０〜小ブロックＤ５に対して有効な差分絶対値和ＡＥ_-4,-3,0〜ＡＥ_-4,2,0を算出するが，小ブロックＤ６とＤ７はＡＥ_-4,3,0とＡＥ_-4,4,0を意味し，これはベクトル（３，−４）とベクトル（４，−４）に該当するから検索範囲外であって算出不要なものである。いま，モード制御部４１０は演算ユニット７がＤ５のサイクルで有効な演算が終了したことを検知すると，それに続くサイクルでスイッチ４０７を制御して参照データＣを選択するように切り替える。切り替わったサイクル，すなわちＥ０では参照データＣには図６の小ブロックＥ０が出力されているから，演算ユニット７はＣ＝（ｒ_1,0，ｒ_2,0，ｒ_3,0，ｒ_4,0）とＢ０の差分絶対値和すなわちＡＥ_-3,-3,0を算出することとなる。この間もモード制御部４１０はスイッチ４０８，スイッチ４０９には参照データＡを選択させている。つまり，演算ユニット７は図６の小ブロックＥ０を，演算ユニット８，９は図５の小ブロックＤ６を同時に演算しているのである。モード制御部４１０は続くＤ７のサイクルでスイッチ４０８も参照データＣを選択するように切り替えるから，演算ユニット７，８は図６の小ブロックＥ１を，演算ユニット９は図５の小ブロックＤ７を同時に演算することとなる。その結果，図４の実施の形態１の動作ではパイプラインの切り替え時に３つの演算ユニットに２サイクルずつロスサイクルが存在したが，図２１の本実施の形態５の動作には存在せず，ある符号化ブロックの演算を開始すると，それ以降全ての演算ユニットが常に有効な演算となるから，パイプライン演算が隙間無く実行することができている。
【０１１８】
以上のように本実施の形態５によれば，第５の本発明に従って，参照レジスタ１が図５の小ブロックＤ０〜小ブロックＤ７まで順次データを更新して参照データＡを演算ブロック４０６に供給する第１のモードと，参照レジスタ２が図６の小ブロックＥ０〜小ブロックＥ７まで順次データを更新して参照データＣを供給する第２のモードを設け，モード移行時には有効な演算を終了した演算ユニット７から順に参照データをＡからＣへスイッチを切り替えるモード制御部４１０を設けたことにより，パイプライン演算にロスサイクルを発生させず，最大効率で演算続行させることが出来るものである。図２１の例では図４の場合に８サイクル必要であった１巡のパイプライン演算が６サイクルに短縮されており，更なる高速化が実現できている。
【０１１９】
実施の形態５は上述したように符号化ブロック，例えばＴ０の演算を開始すると，それ以降ロスサイクルなく，最大効率で演算続行できるものである。次に，ある符号化ブロックの演算が完了し，次の符号化ブロックの演算を開始する場合の動作について説明する。
【０１２０】
図２２のタイミングチャートで，Ｆ０からＦ７のサイクルが符号化ブロックＴ０の最後の演算部分であり，図２２の開始時点では符号化ブロックレジスタ４０２には符号化ブロックＴ０が格納されている。参照レジスタ４０１が図７の小ブロックＦ６，Ｆ７すなわち符号化ブロックＴ０の検索範囲の最後の２サイクル分の参照データをＣに出力している間に，参照レジスタ１は符号化ブロックＴ１に関する検索範囲の左上端である図５の小ブロックＤ０，Ｄ１を参照データＡに出力開始する。モード制御部４１０は図２２のサイクルＦ５で演算ユニット７の有効な演算が終了したと判断すると，それに続くサイクルでスイッチ４０７を制御して参照データＡを選択するように切り替える。これは図２１で説明した第５の本発明に従う動作である。この切り替えサイクルＤ０で，モード制御部４１０はスイッチ４０７の切り替えに同期して符号化小ブロックレジスタ４０３を制御して符号化ブロックＴ１の左端小ブロックを読み込み格納させ，Ｂ０に出力させる。すなわち，Ｂ０＝（ｔ_4,6，ｔ_5,6，ｔ_6,6，ｔ_7,6）とする。一方符号化小ブロックレジスタ４０４，４０５には格納データを保持させるから，図２２のサイクルＤ０ではＢ０が符号化ブロックＴ１，Ｂ１とＢ２が符号化ブロックＴ０となっている。その結果，演算ユニット７は符号化ブロックＴ１の符号化データＢ０と符号化ブロックＴ１検索範囲の左上端の参照データである小ブロックＤ０が出力Ａから供給されるから，その結果符号化ブロックＴ１に対するＡＥ_-4,-3,0を算出し，一方の演算ユニット８，９は符号化ブロックＴ０のＡＥ_3,2,1とＡＥ_3,1,2とを算出している。引き続き，サイクルＤ１では演算ユニット７，８が符号化ブロックＴ１のために図５の小ブロックＤ１を，演算ユニット９が符号化ブロックＴ０のために図７の小ブロックＦ７をそれぞれ演算し，その結果，符号化ブロックＴ０の差分絶対値総和演算を全て終了する。さらにＤ２のサイクルで符号化ブロックレジスタ４０２は全て符号化ブロックＴ１に切り替わり，移行を完了する。このように，符号化ブロックＴ０からＴ１への移行においても，全ての演算ユニットに常に有効な演算を連続させ，一切のロスサイクルを生じず，パイプライン演算をすき間無く実行することができるものである。
【０１２１】
以上のように本実施の形態５によれば，第６の本発明に従って，参照レジスタ１が図５の小ブロックＤ０〜小ブロックＤ７まで順次データを更新して，参照データモード制御部４１０が新たな符号化ブロックのデータを符号化ブロックレジスタ４０２に記憶させる場合，参照データのスイッチ４０７〜４０９の切り替え動作に同期して新たな符号化ブロックのデータを１組ずつ順に符号化ブロックレジスタ４０２に記憶させることにより，符号化ブロックの移行時においてもパイプライン演算にロスサイクルを生じず，最大効率で演算続行することができ，符号化ブロック数が多い場合など更なる高速化が実現できるものである。
【０１２２】
なお，本実施の形態５は実施の形態１のように符号化ブロックを列を単位に分解する場合であって，かつ多重化処理しない場合に対して第５，第６の本発明の技術を適用させたが，実施の形態２のように符号化ブロックを行を単位に分解し多重化処理しない場合，実施の形態３のように列を単位に分解し，多重化構成とする場合，実施の形態４のように行を単位に分解し多重化構成とする場合，いずれに対しても第５，第６の本発明の技術を，実施の形態５と全く同様に適用することができる。また，いずれの場合にもその効果は，符号化ブロックの処理中も符号化ブロックの移行時も一切のロスサイクルを発生せず，パイプライン演算に隙間が生じず最大効率で演算実行でき，高速演算が実現できることである。
【０１２３】
【表９】

【０１２４】
【表１０】

以上の効果を具体的に数値で表９と表１０にまとめる。表９は実施の形態５の回路規模を示すものであり，表１０は実施の形態５の処理速度を示すものである。表９，表１０の算出条件は表１，表２の従来例１の場合，表５，表６の実施の形態１の場合，表７，表８の実施の形態３の場合と同じである。表１，表２の従来例１の場合と比較すれば回路規模，処理速度とも劇的な改善となっているが，その理由に関しては既に実施の形態１および実施の形態３で述べた通りであるので説明を省略し，本発明の実施の形態１と実施の形態５の比較で第５，第６の本発明の技術の効果を確認する。
【０１２５】
まず，表５，表６に示した本発明実施の形態１の図１の場合と，表９，表１０の本発明実施の形態５の図２０の場合を比較して，実施の形態５では実施の形態１に対して回路規模で約２２％増加するが処理速度では約１．３３倍高速化が実現できていることがわかる。次に実施の形態５でＱ＝２の例とは図１４の実施の形態３に対して第５，第６の本発明の技術を適応した場合を意味する。実施の形態５のＱ＝２の例は表７，表８の実施の形態３図１４の例に対してやはり回路規模で約２３％増加するが処理速度は約１．３３倍高速化できている。４８０ｉの例では実施の形態５の場合は表７，表８の実施の形態３の場合に対して回路規模で約５％増加するが処理速度では約１．１２倍高速化でき，１０８０ｉの例では実施の形態５の場合は表７，表８の実施の形態３の場合に対して回路規模で約５％増加するが処理速度では約１．０６倍高速化できている。
【０１２６】
これら実施の形態５の場合，すなわち第５，第６の本発明の技術を用いた場合は，全てのサイクルで有効な差分絶対値総和が隙間無く，重複もなく求まり，しかも並列演算との組み合わせでも全く無駄が発生しないので，パイプライン演算の原理的な最高速度を実現しているものである。いずれも若干の回路増加を伴うが，特に高速動作を要求される用途では効果大なるものである。
【０１２７】
最後に，第５，第６の本発明を説明した構成例５で，従来例２と見かけの構成上類似した点があるので，その差異を以下に説明する。
【０１２８】
まず，第５の本発明に関する見かけ上の類似点について説明する。従来例２では図４２でレジスタ８０１〜８０４とレジスタ８３８〜８４２の２組の参照レジスタをもち，それをセレクタ８４３〜８４６で選択するという構成であるが，本発明では図２０で参照レジスタ１と参照レジスタ４０１の２組の参照レジスタをもち，それをスイッチ４０７〜４０９で選択するという構成である。しかしながら，従来例２ではセレクタ８４３〜８４６は全ての演算ユニットに共通のセレクタであり，選択した参照データは全ての演算ユニットに同じものが供給されることに対して，本発明では各演算ユニットに固有のスイッチであって，各演算ユニットに供給される参照データは個別に選択されるという構成上の実質相違がある。
【０１２９】
これを技術思想の点から相違を詳しく説明する。従来例２は２つの符号化ブロックＴ０とＴ１を時分割演算するという目的のために，有効期間８サイクルの参照データを２回ずつ繰り返す必要があり，その繰り返し実現のために２組の参照レジスタとセレクタを設けているのである。一方本発明では，１つの符号化ブロックＴ０の演算であって時分割とは無関係である。本発明では，パイプライン演算の移行時に演算ユニットに無駄なサイクルが発生することを防ぎ，最大速度を実現するという目的のために，移行時にも常に全ての演算ユニットに独立に有効な参照データを供給する必要があるから，２組の参照レジスタと演算ユニット毎に個別のスイッチを設けたものである。これは，全く異質の技術思想であって，本発明の技術思想を従来例に適応しようとしても無意味であるし，また従来例の技術思想を本発明に応用しようとしても無意味なものである。
【０１３０】
次に，第６の本発明に関する見かけ上の類似点について説明する。従来例２では図４４のＹ４のサイクルでＰＥ８４７のレジスタ８１９〜８２２に符号化ブロックＴ２を格納するが，他のＰＥ８４８，ＰＥ８４９のレジスタ８１９〜８２２には符号化ブロックＴ０を保持するから，符号化ブロック移行時にＴ０とＴ２の演算を同時に行っている。本発明では図２２のＦ６あるいはＤ０で符号化小ブロックレジスタ４０３に符号化ブロックＴ１を格納するが，他の符号化小ブロックレジスタ４０４，４０５には符号化ブロックＴ０を保持して，符号化ブロックの移行時にＴ０とＴ１の演算を同時に行っている。しかしながら，本発明では参照レジスタを２組設け，それぞれにＴ０用，Ｔ１用の参照データを格納し，演算ユニット毎に切り替えながら，その切り替えと同期して対応する符号化小ブロックレジスタにＴ１を格納させるという制御手段を必然の要素としている。従来例にはその制御要素が無く，実質相違がある。
【０１３１】
この相違を技術思想の点から詳しく説明する。従来例の技術思想ではＰＥ内部に記憶した符号化ブロックのデータについて演算完了すると，参照データが自然に次に処理する符号化ブロックの検索範囲に入ってくるのを待ち，しかる後，新しい符号化ブロックのデータを格納することで符号化ブロックの移行時のロスを最小限度に押さえようとするものである。従来例１では並列処理，従来例２では時分割処理で常に複数の符号化ブロックを処理しているから，参照データはいずれかの符号化ブロックの処理に使用されているから，ある符号化ブロックのＰＥが演算完了したといってもそれに合わせて参照データを入れ替えることはできない。従来例１，従来例２では符号化ブロックＴ０の演算が完了すると，その直後に参照データが次の符号化ブロックＴ２の検索範囲に入り，直ちにＴ２の演算が開始できているが，これは符号化ブロックの大きさと検索範囲を調整した特別な動作例の場合だけであって，一般的な用途では符号化ブロックの移行時に大きなロス時間が発生する。本発明では，ある符号化小ブロックレジスタのデータについて演算完了すると，その符号化小ブロックレジスタに次の符号化ブロックの該当する符号化データを読み込ませるだけではなく，それと同時に参照レジスタにも該当する検索範囲の最初の参照データを読み込ませている。新しい符号化ブロックのデータの読み込みとその演算に組み合わせる参照データの読み込みを同期させる制御手段を持つことが第６の本発明の本質であって，それにより符号化ブロックの大きさや検索範囲を任意に設定しても全くロスサイクルを発生させないものである。第１〜第４の本発明の方法および装置に従来例の技術思想を適応したものであれば，従来例と同じ問題点を生じる。つまり符号化ブロックの大きさと検索範囲が特別な値になっていなければ大きなロス時間を発生する。すなわち第６の本発明は従来例と異質の技術思想によるものであって，容易に類推できるものではない。
【０１３２】
（実施の形態６）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【０１３３】
図２３は本実施の形態の動きベクトル検出装置を示すブロック図である。
【０１３４】
実施の形態６は実施の形態２を基本として第７の本発明の技術を適用したものである。ＭＰＥＧ２規格ではインターレース映像でフレーム構造ピクチャーの場合に符号化ブロックに対してフレームベクトルばかりでなくフィールドベクトルを選択付与することが出来る。フレームベクトルは符号化画像と参照画像をともにフレーム構成の１枚の画像として取り扱い，符号化ブロックのフレーム成分を予測する１つのベクトルで構成されるものである。一方，フィールドベクトルは符号化映像と参照映像をそれぞれ第１フィールドと第２フィールドの２枚の画像に分解して取り扱い，参照画像のいずれかのフィールドから符号化ブロックの第１フィールド成分を予測する第１のフィールドベクトルと，参照画像のいずれかのフィールドから符号化ブロックの第２フィールド成分を予測する第２のフィールドベクトルとの２つのフィールドベクトルから構成される。符号化ブロックサイズは何れの場合も１６画素×１６画素であるから，フィールドベクトルの場合符号化ブロックの第１フィールド成分は横１６画素×縦８画素，第２フィールド成分も横１６画素×縦８画素となる。
【０１３５】
実施の形態６はフレームベクトルと２つのフィールドベクトルに関してそれぞれの誤差量を同時に算出するものである。フレームベクトルの誤差量として，（数４）および（数５），（数６）に示した差分絶対値総和ＡＥ_i,jを採用し，第１フィールドのフィールドベクトルの誤差量として（数５），（数７）の差分絶対値総和ＴＦＡＥ_i,jを，第２フィールドのフィールドベクトルの誤差量として（数５），（数８）の差分絶対値総和ＢＦＡＥ_i,jを採用するものとする。また，（数５），（数７），（数８）より，ＡＥ_i,jとＴＦＡＥ_i,j，ＢＦＡＥ_i,jの間には（数９）の関係が成り立っている。
【０１３６】
【数７】

【０１３７】
【数８】

【０１３８】
【数９】

また，実施の形態６では実施の形態２と同じく符号化ブロックの大きさは水平Ｍ＝３，垂直Ｎ＝４，検索領域は水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【０１３９】
図２３を用いて本実施の形態６の構成を説明する。図２３において，図８の実施の形態２の構成と同一部分には同一番号を付して説明を省略する。累積加算アレイ５０１はフレーム加算アレイ５０２とフィールド加算アレイ５０３からなる構成を有している。フレーム加算アレイ５０２は図８の累積加算アレイ１１２と全く同じ構成であるが，フィールド加算アレイ５０３は演算ユニット１０９の出力ＡＥ_i,j,1を入力とし，２サイクル遅延器５０４を経由して演算ユニット１１１の出力ＡＥ_i,j,3と加算器５０５で加算する構造である。更にフレーム加算アレイ５０２の出力ＡＥ_i,jからフィールド加算アレイ５０３の出力ＢＦＡＥ_i,jを減算しＴＦＡＥ_i,jを出力する減算器５０６が設けられている。
【０１４０】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【０１４１】
図２４は本実施の形態６の動作を示すタイミングチャートである。図２４で記号Ｇ，Ｈは実施の形態２の図１０，図１１，図１２に対応させている。
【０１４２】
実施の形態６では参照レジスタ１０１，符号化ブロックレジスタ１０２，演算ブロック１０７，フレーム加算アレイ５０２の動作は上述した実施の形態２の動作と全く同じであり，パイプライン動作により誤差量として差分絶対値総和ＡＥ_i,jが求められる。これは（数６）に示すフレームベクトルの差分絶対値総和を求めたことに他ならない。一方フィールド加算アレイ５０３は演算ユニット１０９の出力ＡＥ_i,j,1を２サイクル遅延させ，演算ユニット１１１の出力ＡＥ_i,j,3に加算する。いま，図２４でＧ１のサイクルの演算ユニット１０９出力ＡＥ_-3,-4,1はＧ３のサイクルまで遅延させ演算ユニット１１１の出力ＡＥ_-3,-4,3に加算するからフィールド加算アレイ５０３の出力はＡＥ_-3,-4,1＋ＡＥ_-3,-4,3となる。これは（数８）より，第２フィールドのフィールドベクトルに対する差分絶対値総和ＢＦＡＥ_-3,-4が求められたことになる。フィールドの対応関係では，この場合は参照ブロックの第２フィールドと符号化ブロックの第２フィールドのブロックマッチングの誤差量が求められているのである。以下，同様に順次フィールド加算アレイ５０３は符号化ブロック第２フィールドのブロックマッチングの差分絶対値総和ＢＦＡＥ_i,jを算出していくものである。一方，減算器５０６はフレーム加算アレイ５０２の出力ＡＥ_i,jからフィールド加算アレイ５０３の出力ＢＦＡＥ_i,jを減算している。いま，Ｇ３のサイクルではＡＥ_-4,-3−ＢＦＡＥ_-4,-3を求めているが，これは（数９）よりＴＦＡＥ_-4,-3が求められたことになる。フィールドの対応関係では，これは参照ブロックの第１フィールドと符号化ブロックの第１フィールドのブロックマッチング誤差量にあたる。続くＧ４のサイクルでは同様にＡＥ_-3,-3とＴＦＡＥ_-3,-3とＢＦＡＥ_-3,-3が求められる。フィールドの対応関係に関しては，この場合Ｇ３のサイクルの場合と比較して予測ブロック候補が参照画面で１行下がった位置に移動しているから第１フィールドと第２フィールドの関係が逆転することとなり，ＴＦＡＥ_-3,-3は符号化ブロックの第１フィールドと予測ブロック候補の第２フィールドのブロックマッチング誤差量に，ＢＦＡＥ_-3,-3は符号化ブロックの第２フィールドと予測ブロック候補の第１フィールドのブロックマッチング誤差量に対応する。以下，順次予測ブロック候補を移動させながら３種類の誤差量が求められることとなる。
【０１４３】
即ち，実施の形態６では，一連のパイプライン動作により，フレームベクトルはもちろんのこと，フィールドベクトルの第１フィールド，第２フィールド全ての組み合わせの誤差量を漏れなく求めることになるから，フレームベクトルの誤差量ＡＥ_i,jと，第１フィールド誤差量ＴＦＡＥ_i,jと，第２フィールドの誤差量ＢＦＡＥ_i,jの３種類についての誤差量をそれぞれ最小値を調べることにより，それぞれの最適なベクトル検出を実現することができるものである。
【０１４４】
以上のように本実施の形態６によれば，第７の本発明に従って，累積加算アレイ５０１に，個々の演算ユニットの誤差量の加算結果を１回遅延して隣接する符号化データの組の誤差量に加算することで，Ｎ個の誤差量を累積加算するフレーム加算アレイ５０２と，奇数番目であるＮ／２個の演算ユニットに対して２サイクル遅延しながら誤差量を累積加算構造で加算するフィールド加算アレイ５０３と，フレーム加算アレイ５０２とフィールド加算アレイ５０３の結果の差を求める演算手段である減算器５０６とを設けたことにより，フレームベクトルに対する誤差量とフィールドベクトルに対する誤差量２種類とを同時に，漏れなく求めることが出来るから，フレームベクトル検出とフィールドベクトル検出の両方を一度に実現できるものである。また，図２３と図８を比較すれば増設する回路はフィールド加算アレイ５０３と減算器５０６のみであり，４８０ｉの実用装置を想定しても３０％程度の回路増加に押さえることが出来る。一方，３種類の誤差量を同時に算出するのであるから，図２４と図１０を比較しても明らかに１つの符号化ブロックの処理リサイクル数は変わらず，高速処理できるという利点は損なわないものである。
【０１４５】
なお，本実施の形態６はフィールド加算アレイ５０３を演算ユニットの奇数番目を処理する加算アレイとしたが，偶数番目を処理する加算アレイとしても良い。この場合はフィールド加算アレイ５０３の出力が第１フィールドの差分絶対値総和ＴＦＡＥ_i,jとなり，減算器５０６の出力が第２フィールドの差分絶対値総和ＢＦＡＥ_i,jとなる。
【０１４６】
また，本実施の形態６は実施の形態２に対して第７の本発明の技術を適用したものとして構成したが，実施の形態４に対して第７の本発明の技術を適応しても構成することができ，また，更に第５，第６の本発明の技術と組み合わせて構成することも可能である。それらの場合，パイプライン演算に一切の隙間が生じない最高速度の実現，並列処理による更なる高速化など個々の効果を損うことなく，フレーム，フィールド両ベクトルの同時検出を実現することが出来るものである。
【０１４７】
また，本実施の形態ではＴＦＡＥ_i,j，ＢＦＡＥ_i,jを（数７），（数８）としたが，この定義は添え字ｊがフレーム画像のライン番号を基準としたものであって，上述したＭＰＥＧ２の定義のものとは異なる。ＭＰＥＧ２のための動きベクトル検出装置として用いる場合はＴＦＡＥ_i,j，ＢＦＡＥ_i,jに対して以下の変換により求められる。ＴＦＡＥ_i,jの場合，ｊが偶数であれば符号化ブロック第１フィールドが参照画像第１フィールドを参照し，フィールドベクトル（ｊ／２，ｉ）である。ｊが奇数であれば参照画像第２フィールドを参照し，フィールドベクトル（（ｊ−１）／２，ｉ）である。一方，ＢＦＡＥ_i,jの場合は，ｊが偶数であれば符号化ブロック第２フィールドが参照画像第２フィールドを参照し，フィールドベクトル（ｊ／２，ｉ）である。ｊが奇数であれば参照画像第１フィールドを参照し，フィールドベクトル（（ｊ＋１）／２，ｉ）である。
【０１４８】
（実施の形態７）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【０１４９】
図２５は本実施の形態の動きベクトル検出装置を示すブロック図である。
【０１５０】
実施の形態７は実施の形態１を基本として第８の本発明の技術を適用したものであって，フレームベクトルと２つのフィールドベクトルに関してそれぞれの誤差量を同時に算出するものである。本実施の形態７でもフレームベクトルの誤差量として，（数１）に示した差分絶対値総和ＡＥ_i,jを採用する。また，（数１）の２重総和の演算は（数２）に示す同一列内の総和であるＡＥ_i,j,nを求めた後，その総和を（数３）のように求めても良いことは既に示したとおりである。ここで，（数２）に再度着目し，（数２）の総和に関して偶数成分と奇数成分に分け，偶数成分を（数１０）のＴＦＡＥ_i,j,nとし，奇数成分を（数１１）のＢＦＡＥ_i,j,nと表記したとき，ＴＦＡＥ_i,j,nは符号化ブロックの第ｎ列の第１フィールドに関する差分絶対値和となっており，ＢＦＡＥ_i,j,nは第ｎ列の第２フィールドに関する差分絶対値和になっている。そこで，ＴＦＡＥ_i,j,nとＢＦＡＥ_i,j,nのそれぞれを独立にｎについて総和を求めれば，（数１２），（数１３）に示すようにＴＦＡＥ_i,jとＢＦＡＥ_i,jが求められる。これはそれぞれ第１フィールドに関する差分絶対値総和と第２フィールドに関する差分絶対値総和になっている。また，ＴＦＡＥ_i,jとＢＦＡＥ_i,jを加えたものは（数１）のフレームとしての差分絶対値総和ＡＥ_i,jであるから，（数９）も成り立っている。
【０１５１】
【数１０】

【０１５２】
【数１１】

【０１５３】
【数１２】

【０１５４】
【数１３】

本実施の形態７では第１フィールドのフィールドベクトルの誤差量として（数１０），（数１２）の差分絶対値総和ＴＦＡＥ_i,jを，第２フィールドのフィールドベクトルの誤差量として（数１１），（数１３）の差分絶対値総和ＢＦＡＥ_i,jを採用するものである。また，フレームベクトルの誤差量は（数１）を直接算出するのではなく，（数１２），（数１３）で求められたフィールドベクトルの誤差量を加算して，すなわち（数９）の関係を用いて求めるものである。
【０１５５】
図２５，図２６を用いて本実施の形態７の構成を説明する。図２５，図２６において，図１，図２の実施の形態１の構成と同一部分には同一番号を付して説明を省略する。図２５で演算ブロック６０１は３つの演算ユニット６０２〜６０４で構成され，累積加算アレイ６０５は２つの独立なフィールド加算アレイ６０６，６０７からなり，フィールド加算アレイ６０６の出力とフィールド加算アレイ６０７の出力は加算器６０８で加算される構成である。フィールド加算アレイ６０６，６０７それぞれの内部構造は図１の累積加算アレイ１０と同じ構造である。図２６で演算ユニット６０２の内部構成は差分絶対値演算器１５と１７の出力を加算器６０９で加算し，その結果をＴＦＡＥ_i,j,nとして出力する一方，差分絶対値演算器１６と１８の出力を加算器６１０で加算し，その結果をＢＦＡＥ_i,j,nとして出力するように構成されている。演算ユニット６０３，６０４の内部構造は図２６に示した演算ユニット６０２の内部構造と同一である。
【０１５６】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【０１５７】
図２７は本実施の形態７の動作を示すタイミングチャートである。図２７で記号Ｄ，Ｅは実施の形態１の図４，図５，図６，図７に対応させており，参照レジスタ１が参照画像の画素を順次読み込み格納する動作は図４〜図７の実施の形態１の参照レジスタ１の動作と全く同じである。
【０１５８】
いま，図２７のタイミングＤ０のサイクルではＢ０＝（ｔ_4,3，ｔ_5,3，ｔ_6,3，ｔ_7,3），Ａ＝（ｒ_0,0，ｒ_1,0，ｒ_2,0，ｒ_3,0）であるから演算ユニット６０２の加算器６０９は｜ｒ_0,0−ｔ_4,3｜＋｜ｒ_2,0−ｔ_6,3｜を求めている。これは符号化ブロックＴ０の左上端画素ｔ_4,3の座標を原点として相対表記に書き直せば｜ｒ_-4,-3−ｔ_0,0｜＋｜ｒ_-2,-3−ｔ_2,0｜となり，これは（数１０）のＴＦＡＥ_-4,-3,0を算出したことに他ならない。同様に演算ユニット６０２の加算器６１０は（数１１）のＢＦＡＥ_-4,-3,0を算出している。次いで，タイミングＤ１のサイクルで演算ユニット６０２はＴＦＡＥ_-4,-2,0とＢＦＡＥ_-4,-2,0を算出し，演算ユニット６０３はＴＦＡＥ_-4,-3,1とＢＦＡＥ_-4,-3,1を算出する。以下順次，演算ユニット６０２〜６０４はそれぞれＴＦＡＥ_i,j,nとＢＦＡＥ_i,j,nを算出していくこととなる。
【０１５９】
演算ユニット６０２〜６０４の出力であるＴＦＡＥ_i,j,0〜ＴＦＡＥ_i,j,2はフィールド加算アレイ６０６で加算され，またＢＦＡＥ_i,j,0〜ＢＦＡＥ_i,j,2はフィールド加算アレイ６０７で加算され，それぞれの差分絶対値総和が求められる。フィールド加算アレイ６０６に着目すると，いま，図２７タイミングＤ０のサイクルで演算ユニット６０２で求められたＴＦＡＥ_-4,-3,0はフィールド加算アレイ６０６の遅延器１１で１サイクル遅延され，タイミングＤ１のサイクルで演算ユニット６０３で求められたＴＦＡＥ_-4,-3,1と加算器１２で加算され，ＴＦＡＥ_-4,-3,0＋ＴＦＡＥ_-4,-3,1が演算される。更にその結果は遅延器１３で１サイクル遅延され，タイミングＤ２のサイクルで演算ユニット６０４で求められたＴＦＡＥ_-4,-3,2と加算器１４で加算され，ＴＦＡＥ_-4,-3,0＋ＴＦＡＥ_-4,-3,1＋ＴＦＡＥ_-4,-3,2が求められる。これは（数１２）よりＴＦＡＥ_-4,-3，すなわちベクトル（−３，−４）の予測ブロック候補の第１フィールドと符号化ブロックＴ０の第１フィールドとのブロックマッチングによる差分絶対値総和ＴＦＡＥ_-4,-3が求められたわけである。同じくフィールド加算アレイ６０７では（数１３）により同じ予測ブロック候補の第２フィールドと符号化ブロックＴ０の第２フィールドとのブロックマッチングによる差分絶対値総和ＢＦＡＥ_-4,-3が求められている。また，加算器６０８はＴＦＡＥ_-4,-3＋ＢＦＡＥ_-4,-3を算出するから，（数９）よりＡＥ_-4,-3，つまり同じ予測ブロック候補と符号化ブロックＴ０のフレームとしての差分絶対値総和を求めている。
【０１６０】
即ち，Ｄ０からＤ７に至る一連のパイプライン演算で，フィールド加算アレイ６０６，６０７と加算器６０８の出力はそれぞれ，符号化ブロック第１フィールドと参照ブロック第１フィールドのブロックマッチング誤差量ＴＦＡＥ_i,jと，符号化ブロック第２フィールドと参照ブロック第２フィールドのブロックマッチング誤差量ＢＦＡＥ_i,jと，フレーム予測の場合の差分絶対値総和ＡＥ_i,jとの３つの誤差量が同時に求められているのである。
【０１６１】
図２７のＥ０から開始される一連のパイプライン演算では参照レジスタ１は図６に示すようにフレーム構造で１画素下にずれた領域を演算するが，これは誤差量演算で参照フィールドの第１，第２の関係が逆転したことを意味する。従って，Ｅ０からＥ７に至る一連のパイプライン演算では，フレーム予測の場合の差分絶対値総和ＡＥ_i,jと，符号化ブロック第１フィールドと参照ブロック第２フィールドのブロックマッチング誤差量ＴＦＡＥ_i,jと，符号化ブロック第２フィールドと参照ブロック第１フィールドのブロックマッチング誤差量ＢＦＡＥ_i,jとの３つの誤差量が同時に求められることとなる。以上のように実施の形態７ではフレームベクトルと，２つのフィールドのフィールドベクトルとの誤差量を漏れなく，同時に演算するものであるから，それぞれの最小値を調べることにより，それぞれの最適なベクトルを検出することが出来るものである。
【０１６２】
以上のように本実施の形態７によれば，第８の本発明に従って，演算ユニット６０２〜６０４は入力された参照データの組と符号化データの組に対してそれぞれ偶数位置の画素に対する誤差量と奇数位置の画素に対する誤差量の２種類の誤差量をもとめ，累積加算アレイ６０５は上記２種類の誤差量を独立に累積加算構造で加算する第１のフィールド加算アレイ６０６と第２のフィールド加算アレイ６０７とを設ける構成としたことにより，フレームベクトルに対する誤差量と２種類フィールドベクトルに対する誤差量とを同時に，漏れなく求めることが出来るから，フレームベクトル検出とフィールドベクトル検出の両方を実現できるものである。また，図２５と図１を比較すれば，増設する回路はフィールド加算アレイ６０７と加算器６０８のみであり，４８０ｉの実用仕様を想定しても３０％程度の回路増加に押さえることが出来る。また，３種類の誤差量を同時に算出するのであるから，図２７と図４を比較しても明らかに，１つの符号化ブロックの処理サイクル数は変わらず，高速処理できるという利点は損なわないものである。
【０１６３】
また，本実施の形態７は実施の形態１に対して第８の本発明の技術を適用したものとして構成したが，実施の形態１に対して第９の本発明の技術を適用して構成してもよい。この場合，図２５の構成に対して演算ユニット６０２〜６０４を図２８に示す演算ユニットに置き換えることになる。図２８の演算ユニットは加算器６１１を増設してＴＦＡＥ_i,j,nとＢＦＡＥ_i,j,nの和，即ちＡＥ_i,j,nを出力する構造となっている。これは第９の本発明において，入力された偶数位置の画素に対する第１の誤差量としてのＴＦＡＥ _i,j,n と，全ての前記画素に対する第２の誤差量としてのＡＥ _i,j,n との，２種類の誤差量を求めたことに相当する。図２５でフィールド加算アレイ６０７を同じ構造のままフレーム加算アレイとして用いることとし，置き換えた演算ユニットの出力ＡＥ_i,j,nをフレーム加算アレイ６０７で累積加算し，フレーム誤差量である差分絶対値和ＡＥ_i,jを求める。さらにＡＥ_i,jからフィールド加算アレイ６０６の出力である第１フィールド誤差量ＴＦＡＥ_i,jを減算して第２フィールド誤差量ＢＦＡＥ_i,jが求められる構成となる。この場合においても，回路規模の程度と効果は上述した第８の本発明を用いた場合と全く同じである。
【０１６４】
また，本実施の形態７は実施の形態１に対して第８の本発明の技術を適用したものとして構成したが，実施の形態３に対して第８の本発明の技術を適応しても構成することができ，また，更に第５，第６の本発明の技術と組み合わせて構成することも可能である。それらの場合，パイプライン演算に一切の隙間が生じない最高速度の実現，並列処理による更なる高速化など個々の効果を損うことなく導入して，フレーム，フィールド両ベクトルの同時検出を実現することが出来るものである。
【０１６５】
実施の形態６と実施の形態７は全く同じ効果を持ちつつ，参照画像に対する読み込み方法が異なるものであるから，動きベクトル検出装置に接続して使用するメモリなどの参照画像記憶媒体の特性，動作条件に応じて，実施の形態６あるいは実施の形態７からより適した形態を選択することが出来るものである。
【０１６６】
また，本実施の形態においてもＭＰＥＧ２のための動きベクトル検出装置として用いる場合はＴＦＡＥ_i,j，ＢＦＡＥ_i,jの添え字ｊに対して，実施の形態６で説明した変換を行うことにより適切なベクトルを得ることができる。
【０１６７】
（実施の形態８）
はじめに，本実施の形態の動きベクトル検出装置の構成について説明する。
【０１６８】
図２９は本実施の形態の動きベクトル検出装置を示すブロック図である。
【０１６９】
実施の形態８は第１０の本発明の技術を適用したものであり，第１０の本発明の参照レジスタが参照画像上縦に配置されたＭ個のデータを記憶する場合に相当するものである。実施の形態８もフレームベクトルと２つのフィールドベクトルに関してそれぞれの誤差量を同時に算出するものである。本実施の形態８でもフレームベクトルの誤差量として，（数４）に示した差分絶対値総和ＡＥ_i,jを採用する。用いる算式は実施の形態６と同じく第１フィールドの差分絶対値総和ＴＦＡＥ_i,jとして（数５），（数７）を，第２フィールドの差分絶対値総和ＢＦＡＥ_i,jとして（数５），（数８）を，フレームの差分絶対値総和ＡＥ_i,jとして（数９）を，それぞれ用いる。
【０１７０】
また，実施の形態８は実施の形態６と同じく符号化ブロックの大きさは水平Ｍ＝３，垂直Ｎ＝４，検索領域は水平方向にＫ＝３すなわち−３〜２の範囲，垂直方向にＬ＝４すなわち−４〜３の範囲としている。
【０１７１】
図２９を用いて本実施の形態８の構成を説明する。図２９で参照レジスタ７０１〜７０３はそれぞれ参照画像の行方向に連続する３画素を記憶するレジスタであって，３つのレジスタで列方向に連続する３行分を記憶し，それぞれ参照データＡ０，Ａ１，Ａ２を出力するレジスタである。演算ブロック７０４は演算ユニット７０８，７０９から構成され，符号化ブロックレジスタ１０２の４つの出力の組から第１フィールドのデータであるＢ０，Ｂ２と，参照レジスタ７０１の参照データＡ０とを入力し，その差分絶対値和ＡＥ_i,j,n（但しｎは偶数）を（数５）に基づいて算出する演算ブロックであり，演算ブロック７０５は演算ユニット７１０，７１１から構成され，符号化ブロックレジスタ１０２の第２フィールドのデータであるＢ１，Ｂ３と，参照レジスタ７０２の参照データＡ１とを入力し，その差分絶対値和ＡＥ_i,j,n（但しｎは奇数）を（数５）に基づいて算出する演算ブロックであり，演算ブロック７０６は演算ユニット７１２，７１３から構成され，符号化ブロックレジスタ１０２の第１フィールドのデータであるＢ０，Ｂ２と，参照レジスタ７０２の参照データＡ１とを入力し，その差分絶対値和ＡＥ_i,j,n（但しｎは偶数）を算出する演算ブロックであり，演算ブロック７０７は演算ユニット７１４，７１５から構成され，符号化ブロックレジスタ１０２の第２フィールドのデータであるＢ１，Ｂ３と，参照レジスタ７０３の参照データＡ２とを入力し，その差分絶対値和ＡＥ_i,j,n（但しｎは奇数）を算出する演算ブロックである。累積加算アレイ７１６〜７１９はそれぞれ演算ブロック７０４〜７０７の出力を受け，累積加算することでフィールド誤差量を算出するフィールド加算アレイである。累積加算アレイ７１６と７１７の出力は加算器７２０で加算されフレーム誤差量となり，累積加算アレイ７１８と７１９の出力は加算器７２１で加算されフレーム誤差量が求められる構成である。演算ユニット７０８〜７１５の内部構造は既に実施の形態２の図９に示したものと同一である。その他，実施の形態２の図８の構成と同一部分には同一番号を付している。
【０１７２】
つぎに，本実施の形態の動きベクトル検出装置の動作について説明する。
【０１７３】
図３０は本実施の形態８の動作を示すタイミングチャートである。
【０１７４】
本実施の形態８の動作は，演算ブロック７０４と累積加算アレイ７１６による第１のフィールド評価手段と，演算ブロック７０５と累積加算アレイ７１７による第２のフィールド評価手段と，演算ブロック７０６と累積加算アレイ７１８による第３のフィールド評価手段と，演算ブロック７０７と累積加算アレイ７１９による第４のフィールド評価手段とに分けられ，４つのパイプライン演算から構成されている。
【０１７５】
まず，図３０の第１のパイプライン演算として，演算ブロック７０４はＧ０のサイクルで符号化データとしてＢ０＝（ｔ_4,3，ｔ_4,4，ｔ_4,5），Ｂ２＝（ｔ_6,3，ｔ_6,4，ｔ_6,5）を，参照データとしてＡ０＝（ｒ_0,0，ｒ_0,1，ｒ_0,2）を入力とし，演算ブロック７０８が｜ｒ_0,0−ｔ_4,3｜＋｜ｒ_0,1−ｔ_4,4｜＋｜ｒ_0,2−ｔ_4,5｜を演算する。これは符号化ブロックの左上端座標を基準に正規化すれば（数５）よりＡＥ_-4,-3,0であることがわかる。このときの参照データＡ０が参照画像の検索範囲内で占める位置は図３１に示す小ブロックＧ０である。図３０の第１のパイプラインで，Ｇ０に続いてＧ２のサイクルでは参照レジスタ７０１は図３２に示す小ブロックＧ２，すなわち２行下方に移動した位置のＡ０＝（ｒ_2,0，ｒ_2,1，ｒ_2,2）を格納し参照データＡ０に出力する。このとき演算ユニット７０８はＴＦＡＥ_-2,-3,0を，演算ユニット７０９はＡＥ_-4,-3,2を算出することとなる。以下同様に参照データＡ１は検索範囲を２行ずつ下方に移動して演算ユニット７０８，７０９にＡＥ_i,j,0とＡＥ_i,j,2を演算させる。累積加算アレイ７１６は演算ユニット７０８の出力を１サイクル遅延させて演算ユニット７０９の出力に加算するから，図３０の第１のパイプラインでサイクルＧ０の演算ユニット７０８出力であるＡＥ_-4,-3,0はサイクルＧ１で演算ユニット７０９の出力ＴＦＡＥ_-4,-3,2に加算され，ＴＦＡＥ_-4,-3,0＋ＴＦＡＥ_-4,-3,2が求められる。これは（数７）よりＴＦＡＥ_-4,-3を求めたことになる。以下順次，ＴＦＡＥ_-2,-3，ＴＦＡＥ_0,-3，・・・と求めることができる。つまり，第１のパイプラインは符号化ブロックの第１フィールドのデータＢ０，Ｂ１と参照画像の第１フィールドのデータＡ０を入力とするパイプライン演算であり，その結果は符号化ブロックの第１フィールドと参照ブロック候補の第１フィールドをブロックマッチングした場合の誤差量となるから，ＴＦＡＥ_i,j（但しｉは偶数）を順に求めるものである。
【０１７６】
次に図３０の第２のパイプラインについて説明する。演算ブロック７０５はＧ１のサイクルで符号化データとしてＢ１＝（ｔ_5,3，ｔ_5,4，ｔ_5,5），Ｂ２＝（ｔ_7,3，ｔ_7,4，ｔ_7,5）を，参照データとしてＡ１＝（ｒ_1,0，ｒ_1,1，ｒ_1,2）を入力とし，演算ユニット７１０は（数５）よりＡＥ_-4,-3,1を算出する。このときの参照データＡ１が参照画像の検索範囲内で占める位置は図３１に示す小ブロックＧ１である。Ｇ１に続くＧ３のサイクルでは参照レジスタ７０２は図３２に示す小ブロックＧ３，すなわち２行下方に移動した位置のＡ１＝（ｒ_3,0，ｒ_3,1，ｒ_3,2）を出力する。このとき演算ユニット７１０はＡＥ_-2,-3,1を，演算ユニット７１１はＡＥ_-4,-3,3を算出することとなる。累積加算アレイ７１７は演算ユニット７１０の出力を１サイクル遅延させて演算ユニット７１１の出力に加算するから，順次，ＢＦＡＥ_-4,-3，ＢＦＡＥ_-2,-3，・・・と求められることとなる。つまり，第２のパイプラインは符号化ブロックの第２フィールドのデータＢ１，Ｂ３と参照画像の第２フィールドのデータＡ１を入力として符号化ブロックの第２フィールドと参照ブロック候補の第２フィールドをブロックマッチングした場合の誤差量を求めるもので，ＢＦＡＥ_i,j（但しｉは偶数）を順に求めるものである。
【０１７７】
第１のパイプラインで求められた第１フィールドの誤差量と第２のパイプラインで求められた第２フィールドの誤差量は加算器７２０で加算されるが，これは（数９）の演算に相当し，ＡＥ_i,j（ただしｉは偶数）すなわちフレーム誤差量が求められることとなる。
【０１７８】
以上のように，符号化ブロックの第１フィールドと予測ブロック候補の第１フィールドのブロックマッチングによるフィールド誤差量を求める第１のパイプラインと符号化ブロックの第２フィールドと予測ブロック候補の第２フィールドのブロックマッチングによるフィールド誤差量を求める第２のパイプラインと，それぞれのフィールド誤差量を加算してフレーム誤差量を求める加算器を設けたことにより，動きベクトルのＹ成分が偶数である予測ブロック候補に対する３種類の誤差量を漏らすことなく求めることができているものである。
【０１７９】
演算ブロック７０６と累積加算アレイ７１８からなる図３０の第３のパイプラインでは，符号化データＢ０，Ｂ２すなわち第１フィールドデータと，参照データＡ１すなわち第２フィールドデータを入力とするから，符号化ブロックの第１フィールドと予測ブロック候補の第２フィールドとのブロックマッチングにおけるフィールド誤差量ＡＥ_i,j（但しｉは奇数）を算出する。最後に演算ブロック７０７と累積加算アレイ７１９からなる第４のパイプラインでは，符号化データＢ１，Ｂ３すなわち第２フィールドデータと，参照データＡ２すなわち第１フィールドデータを入力とするから，符号化ブロックの第２フィールドと予測ブロック候補の第１フィールドとのブロックマッチングにおけるフィールド誤差量ＡＥ_i,j（但しｉは奇数）を算出する。加算器７２１は第１フィールド誤差量と第２フィールド誤差量を加算することでフレーム誤差量ＡＥ_i,j（但しｉは奇数）を算出する。以上のように第３のパイプラインと第４のパイプラインによって，動きベクトルのＹ成分が奇数である予測ブロック候補に対する３種類の誤差量を漏らすことなく求めることができるものである。
【０１８０】
ここで，第４のパイプラインでは参照データＡ２は図３０の動作開始のＧ２サイクルでＡ２＝（ｒ_2,0，ｒ_2,1，ｒ_2,2）となっており，これは図３１で検索範囲内の小ブロックＧ２に位置する参照データである。図３０で第１のパイプラインのＧ０サイクルと第２，第３のパイプラインのＧ１サイクルと第４のパイプラインのＧ２サイクルはいずれも同時刻であり，動作開始のサイクルである。この動作開始のサイクルで参照レジスタ７０１〜７０３の３つのレジスタが出力するＡ０，Ａ１，Ａ２は図３１の小ブロックＧ０，Ｇ１，Ｇ２に示す領域のデータである。第２サイクルでは図３２に示すようにＡ０，Ａ１，Ａ２は小ブロックＧ２，Ｇ３，Ｇ４と，２行ずつ下方にシフトさせている。以下順次２行ずつシフトさせ，図３３に示すＧ８，Ｇ９，Ｇ１０の状態で最初のパイプラインを終了し，引き続き図３４のＨ０，Ｈ１，Ｈ２に示すように１画素右にずれた領域の上部３つの小ブロックから新たなパイプラインを開始する。以上のように参照レジスタ７０１〜７０３は検索範囲で幅３画素の帯状の領域を上から下へ２行ずつずらしながら参照画素を格納し，下端に達すると１画素右にずれた幅３画素の帯状の領域を同様に上から下へ格納していくことで，第１のパイプラインと第２のパイプラインの組み合わせによる誤差量演算と，第３のパイプラインと第４のパイプラインの組み合わせによる誤差量評価が互いにＹ成分の奇数，偶数の関係となり，重複せず，かつ漏れずに全てのベクトルを評価することができているのである。
【０１８１】
【表１１】

【０１８２】
【表１２】

実施の形態８の効果を具体的に数値で表１１と表１２にまとめる。表１１は実施の形態８の回路規模を示すものであり，表１２は実施の形態８の処理速度を示すものである。表１１，表１２の４８０ｉの場合，１０８０ｉの場合の算出条件は表１，表２の従来例１の場合などと同じである。表１，表２の従来例１の場合との比較では劇的な改善となっているが，既に実施の形態１および実施の形態２で述べた通りであるので説明を省略し，実施の形態８と本発明の実施の形態１，実施の形態２との比較で第１０の本発明の技術の効果を確認する。
【０１８３】
まず，表５，表６に示した本発明実施の形態１の図１の場合と，表１１，表１２の本発明実施の形態８の図２９の場合を比較して，実施の形態８では実施の形態１に対して回路規模で約４０％増加するが処理速度では２倍以上高速化されていることがわかる。実施の形態８では図３０のタイムチャートをみれば，第１のパイプラインと第２のパイプラインから構成される第１の系列と，第３のパイプラインと第４のパイプラインから構成される第２の系列の，２つの系列の並列処理となっているため約２倍の処理速度を持っているのである。この実施の形態８固有の並列処理構造は実施の形態３，実施の形態４に示した第５の本発明の並列処理技術とは独立なものであり，第１０の本発明の並列化技術と第５の本発明の並列化技術とを組み合わせて装置を実現することもできる性質のものである。表１１，表１２の多重化系列数Ｑは第５の本発明の技術による多重化系列数を意味しており，図２９の例では第５の本発明の技術を用いていないのでＱ＝１としている。実施の形態８に第５の本発明を適用してＱ＝２とした場合，表１２より処理速度が純粋に２倍となる。表７，表８実施の形態３と同じ仕様の４８０ｉの例，１０８０ｉの例と比較すると，表１１，表１２では第１０の本発明に固有の並列処理の効果があるため第５の本発明による多重化系列数Ｑは表７，表８に対して２分の１に設定して，ほぼ同じ処理速度をほぼ同じ回路規模で実現できている。
【０１８４】
以上のように本実施の形態８によれば，第１０の本発明に従って，符号化ブロックの同一フィールドの３個の画素データを１つの組として，第１フィールドの符号化データを２組と第２フィールドの符号化データ２組とを出力する符号化ブロックレジスタ１０２と，参照画像の同一フィールドの３個の画素を記憶しこれを１つの組の参照データとして出力する，３つの参照レジスタ７０１〜７０３と，参照データ１組と符号化データ２組とを入力としフィールド誤差量を求める４個のフィールド評価手段である演算ブロック７０４〜７０７と，第１フィールドの参照データＡ０と第１フィールドの符号化データＢ０，Ｂ２に対するフィールド誤差量と第２フィールドの参照データＡ１と第２フィールドの符号化データＢ１，Ｂ３に対するフィールド誤差量とを加算する加算器７２０と，第１フィールドの参照データＡ２と第２フィールドの符号化データＢ１，Ｂ３に対するフィールド誤差量と，第２フィールドの参照データＡ１と第１フィールドの符号化データＢ０，Ｂ２に対するフィールド誤差量とを加算する加算器７２１と，を備え，参照レジスタ７０１〜７０３は前記参照データを参照画面上垂直方向に順次ずらしながら取り出して格納する制御機能を具備し，演算ブロック７０４〜７０７は，２個の演算ユニットと，２個の誤差量から累積加算構造で総和をもとめフィールド誤差量として出力する構成としたことにより，フレームマッチングによる誤差量と，符号化ブロックの第１フィールドに対するフィールドマッチングによる誤差量と第２フィールドに対するフィールドマッチングによる誤差量の３種類の誤差量を同時に求めることができ，しかも実用的な映像信号を処理する場合に回路規模を増加させることなく実施の形態１や２の処理速度の高速性を保持することが出来るものである。
【０１８５】
また，本実施の形態においてもＭＰＥＧ２のための動きベクトル検出装置として用いる場合はＴＦＡＥ_i,j，ＢＦＡＥ_i,jの添え字ｊに対して，実施の形態６で説明した変換を行うことにより適切なベクトルを得ることができる。
【０１８６】
なお，第１０の本発明にも第５の本発明の技術を適応しても良いことは既に述べたが，それ以外にも第６，第７の本発明の技術を適応しても良い。その場合図２０と同様に図２９の構成に対して参照レジスタを３本増設し，その出力Ｃ０，Ｃ１，Ｃ２と参照レジスタ７０１〜７０３の出力Ａ０，Ａ１，Ａ２を演算ユニット７０８〜７１５の入力位置で切り替えるスイッチを設け，前記スイッチの切り替えタイミングと符号化小ブロックレジスタ１０３〜１０６の読み込みタイミングを制御するモード制御部を設ける構成となる。またその動作は図３０のタイミングチャートで各パイプラインを隙間なく前詰めで動作させることとなり，最大効率で最高速度を実現することが出来る。
【０１８７】
また，本実施の形態８では参照レジスタを７０１〜７０３の３個設ける構成としたが，参照レジスタ７０３を省略して参照レジスタ２個とし，参照データＡ２に変えて参照データＡ０を供給する構成とすることも出来る。この場合図３０のタイミングチャートで第４のパイプラインの動作が１サイクル遅れることとなるため，第３のパイプラインで累積加算アレイ７１８の出力を１サイクル遅延させることで，第４のパイプラインと同期させ，３種類の誤差量を求める機能を果たすことが出来る。
【０１８８】
更に，本実施の形態８は第１０の本発明で参照データが参照画像上横に配置されたＭ個のデータである場合の例であるが，これを参照データが参照画像上縦に配置されたＭ個のデータである場合ととし，参照レジスタ，符号化小ブロックレジスタを全て同一フィールドの縦Ｍ個の画素を記憶する構成とし，参照レジスタは検索領域を水平方向に１画素ずつずらしながらパイプライン演算を実行するものとしても良い。
【０１８９】
本発明の実施の形態１〜８において，ブロックマッチングの誤差量は全て差分絶対値総和であるとしたが，これは差分自乗総和や分散など差分絶対値総和以外のものとしても良い。このとき，誤差量が（数３）或いは（数６）で表記可能であるものであれば本発明の技術を全て適応することが出来る。
【０１９０】
以上においては，本実施の形態１〜８について詳細に説明した。
【０１９１】
第１の本発明の動きベクトル検出方式は，符号化ブロックの画素データを記憶し，符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロック出力ステップと，参照画像のＭ個の画素を一時記憶し，これを１つの組の参照データとして出力する参照データ出力ステップであって，前記参照データが参照画像上縦に配置されたＭ個のデータである場合は前記参照データを参照画面上水平方向に順次ずらしながら取り出して記憶するための第１の制御か，前記参照データが参照画像上横に配置されたＭ個のデータである場合は前記参照データを参照画面上垂直方向に順次ずらしながら取り出して記憶するための第２の制御のうち，少なくともいずれかの制御を行う参照データ出力ステップと，１組の参照データと１組の符号化データの誤差量を演算する演算ユニットを１×Ｎ個利用する演算ステップと，前記誤差量を１サイクル遅延させて隣接する符号化データの組の誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する誤差量に加算していく累積加算構造により前記Ｎ個の誤差量の総和を求める累積加算ステップとを備えた動きベクトル検出方法である。
【０１９２】
第１の本発明の動きベクトル検出方式を採用した第２の本発明の動きベクトル検出装置では，符号化ブロックの画素データを記憶し，符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタと，参照画像のＭ個の画素を一時記憶し，これを１つの組の参照データとして出力するレジスタであって，前記参照データが参照画像上縦に配置されたＭ個のデータである場合は前記参照データを参照画面上水平方向に順次ずらしながら取り出して格納する第１の制御機能か，前記参照データが参照画像上横に配置されたＭ個のデータである場合は前記参照データを参照画面上垂直方向に順次ずらしながら取り出して格納する第２の制御機能か，少なくともいずれかの制御機能を具備した参照レジスタと，１組の参照データと１組の符号化データの誤差量を演算するＮ個の演算ユニットと，前記誤差量を１サイクル遅延させて隣接する符号化データの組の誤差量に加算し，以下順次その加算結果を１サイクル遅延させ隣接する誤差量に加算していく累積加算構造により前記Ｎ個の誤差量の総和を求める累積加算アレイとから構成したものである。
【０１９３】
この構成により，画素データを格納するレジスタは符号化ブロック１個分の符号化ブロックレジスタと１行又は１列分の参照レジスタのみで構成されるので，参照レジスタがＮ−１行分またはＮ−１列分削減することができ，また，パイプラインの間をつなぐ演算データ遅延も検索範囲の大きさにかかわらず常に１サイクル分のみで構成できるから，実用的な映像信号を実用的な検索範囲で動きベクトル検出する場合も極めて小さな回路規模で実現できるものである。また，演算速度は１サイクルで１予測ブロック候補の誤差量演算が確定し，ロスサイクルも少なく，短いサイクル数で機能を果たす高速な動きベクトル検出動作を得るものである。また符号化ブロックを１つづつ処理完了させることが出来るから符号化ブロックの処理順番に制約が無く，動きベクトル検出装置に続く符号化装置も容易に構成でき，高速処理が要求されない用途では多重化しない極めて小さい回路を提供して実用になる極めて有効なものである。
【０１９４】
第３の本発明の動きベクトル検出方式は，符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロック出力ステップと，参照画像のＭ＋Ｑ−１個の画素を一時記憶し，連続するＭ個の画素を１組の参照データとしてＱ組の参照データを出力する参照データ出力ステップであって，前記参照データを参照画面上水平方向に順次ずらしながら取り出して記憶するための第１の制御か，前記参照データを参照画面上垂直方向に順次ずらしながら取り出して記憶するための第２の制御か，少なくともいずれかの制御を行う参照データ出力ステップと，１組の前記参照データと１組の前記符号化データの誤差量を演算する演算ユニットをＱ×Ｎ個利用して誤差量を算出する演算ステップと，累積加算構造によりＮ個の前記誤差量の総和をＱ個の累積加算アレイを利用して求める累積加算ステップとを備えた動きベクトル検出方法である。
【０１９５】
第３の本発明の動きベクトル検出方式を採用した第４の本発明の動きベクトル検出装置は，符号化ブロック内で縦１列または横１行に配置されたＭ個の画素データを１つの組としてＮ組の符号化データを出力する符号化ブロックレジスタと，参照画像のＭ＋Ｑ−１個の画素を格納し，連続するＭ個の画素を１組の参照データとしてＱ組の参照データを出力するレジスタであって，前記参照データを参照画面上水平方向に順次ずらしながら取り出して格納する第１の制御機能か，前記参照データを参照画面上垂直方向に順次ずらしながら取り出して格納する第２の制御機能か，少なくともいずれかの制御機能を具備した参照レジスタと，Ｑ×Ｎ個の演算ユニットと，累積加算構造により演算ユニットのＮ個の誤差量の総和を求めるＱ個の累積加算アレイとから構成するものである。
【０１９６】
この構成により，系列数をＱとする並列処理によるＱ倍の高速処理が実現され，しかも，画素データを格納するレジスタはＱ系列分必要となるのではなく，僅かに参照レジスタＱ−１画素分が増加するのみで構成できる。また，パイプラインをつなぐ演算データ遅延器はＱ系列分必要になるが，そのそれぞれが１サイクル遅延ですみ，極めて小さな回路で実現できるから回路増加を最低限に押さえることが出来る。その結果，全体としての回路規模は従来技術に比して劇的に減少させることができ，ことに実用映像信号を実用的な検索範囲で動きベクトル検出する場合に極めて顕著であって，その効果は絶大なるものである。また多重化数Ｑはブロックの大きさ，検索範囲の大きさ，その他何らの条件にも拘束されず，完全に任意に設定できるものであるから，要求される処理速度，使用目的に応じて最小の回路規模の装置を提供することが出来るものである。
【０１９７】
第５，６の本発明の動きベクトル検出装置は，第２の参照レジスタを備え，演算ユニットは第１の参照レジスタから供給される参照データか第２の参照レジスタから供給される参照データかいずれかを選択する参照データ切り替えスイッチを具備し，参照データを供給する参照レジスタを切り替えるモード移行時には，移行前のモードの有効な演算が終了した演算ユニットから順に参照データ切り替えスイッチを切り替え，また，新たな符号化ブロックの誤差量演算を開始する場合には前記参照データ切り替えスイッチの切り替え動作に同期して新たな符号化ブロックのデータを１組ずつ順に符号化ブロックレジスタに記憶させるモード制御手段とを備えたことにより，２つの参照レジスタを演算ユニット毎に使い分け，各演算ユニットに常に有効な参照データを供給することができ，また新たな符号化ブロックの演算開始に当たっても演算を開始できる状態になった演算ユニットから直ちに新たな符号化データとそれに対応した参照データを供給することができるから，全ての演算ユニットに常に有効な演算を実行させることができ，しかも同じ演算を重複することがないから最大効率を実現することとなって，処理速度を向上させることができるものである。
【０１９８】
第７の本発明の動きベクトル検出装置は，累積加算アレイが，演算ユニットのＮ個の誤差量を累積加算するフレーム加算アレイと，偶数または奇数番目であるＮ／２個の誤差量を２サイクル遅延しながら累積加算構造で加算するフィールド加算アレイとを備えたことにより，フレーム加算アレイで予測ブロック候補のフレーム誤差量を算出する一方，フィールド加算アレイで符号化ブロックの第１フィールド成分或いは第２フィールド成分と，前記予測ブロック候補との誤差量を求めることとができ，前記フレーム誤差量から前記フィールド誤差量を減算することで残りのフィールド誤差量も求めることができる。また，フィールド加算アレイが増加するのみであるから，わずかな回路増加で同一の予測ブロック候補に対するフレーム誤差量と２種類のフィールド誤差量を同時に算出することができるものである。
【０１９９】
第８，９の本発明の動きベクトル検出装置は，演算ユニットが入力された参照データの組と符号化データの組に対してそれぞれの偶数位置の画素に対する誤差量か，それぞれの奇数位置の画素に対する誤差量か，或いは全ての画素に対する誤差量かの３種類のうちいずれかの２種類の誤差量をもとめ，累積加算アレイは上記２種類の誤差量を独立に累積加算構造で加算する構造としたことにより，同一の予測ブロック候補に対するフレーム誤差量と２種類のフィールド誤差量を同時に算出することができるものである。
【０２００】
第１０の本発明の動きベクトル検出装置は，同一フィールドのＭ個の画素データを１つの組として第１フィールドの符号化データをＮ／２組と第２フィールドの符号化データＮ／２組とを出力する符号化ブロックレジスタと，参照画像の同一フィールドのＭ個の画素を記憶しこれを１つの組の参照データとして出力する少なくとも第１フィールドと第２フィールドの２つの参照レジスタと，参照データ１組と符号化データＮ／２組とからフィールド誤差量を求める４個のフィールド評価手段と，第１フィールド参照データと第１フィールド符号化データによるフィールド誤差量と第２フィールド参照データと第２フィールド符号化データによるフィールド誤差量とを加算する第１の加算器と，第１フィールド参照データと第２フィールド符号化データによるフィールド誤差量と第２フィールドの参照データと第１フィールド符号化データによるフィールド誤差量とを加算する第２の加算器とを備えたことにより，符号化ブロックと予測ブロック候補の全てのフィールド組み合わせのフィールド誤差量を並列構造で求めるから，２種類のフィールド誤差量とその加算によるフレーム誤差量を求めることができるものである。しかも，４つのフィールド評価手段が必要であるがこれらはいずれもフレーム評価の場合の略２分の１程度の規模で構成されるから全体としての回路規模の増加は僅かであるにもかかわらず，処理速度は並列処理となっているから２倍の高速処理が実現でき，実用的な映像信号を実用的な検索範囲で動きベクトル検出する場合に適応して極めて効果の大きなものである。
【０２０１】
尚，本発明のプログラムは，上述した本発明の動きベクトル検出装置の全部又は一部の手段（又は，装置，素子等）の機能をコンピュータにより実行させるためのプログラムであって，コンピュータと協働して動作するプログラムである。
【０２０２】
又，本発明のプログラムは，上述した本発明の動きベクトル検出方法の全部又は一部のステップ（又は，工程，動作，作用等）の動作をコンピュータにより実行させるためのプログラムであって，コンピュータと協働して動作するプログラムである。
【０２０３】
又，本発明の記録媒体は，上述した本発明の動きベクトル検出装置の全部又は一部の手段（又は，装置，素子等）の全部又は一部の機能をコンピュータにより実行させるためのプログラムを担持した記録媒体であり，コンピュータにより読み取り可能且つ，読み取られた前記プログラムが前記コンピュータと協動して前記機能を実行する記録媒体である。
【０２０４】
又，本発明の記録媒体は，上述した本発明の動きベクトル検出方法の全部又は一部のステップ（又は，工程，動作，作用等）の全部又は一部の動作をコンピュータにより実行させるためのプログラムを担持した記録媒体であり，コンピュータにより読み取り可能且つ，読み取られた前記プログラムが前記コンピュータと協動して前記動作を実行する記録媒体である。
【０２０５】
尚，本発明の上記「一部の手段（又は，装置，素子等）」とは，それらの複数の手段の内の，一つ又は幾つかの手段を意味し，本発明の上記「一部のステップ（又は，工程，動作，作用等）」とは，それらの複数のステップの内の，一つ又は幾つかのステップを意味する。
【０２０６】
又，本発明の上記「手段（又は，装置，素子等）の機能」とは，前記手段の全部又は一部の機能を意味し，本発明の上記「ステップ（又は，工程，動作，作用等）の動作」とは，前記ステップの全部又は一部の動作を意味する。
【０２０７】
又，本発明のプログラムの一利用形態は，コンピュータにより読み取り可能な記録媒体に記録され，コンピュータと協働して動作する態様であっても良い。
【０２０８】
又，本発明のプログラムの一利用形態は，伝送媒体中を伝送し，コンピュータにより読みとられ，コンピュータと協働して動作する態様であっても良い。
【０２０９】
又，記録媒体としては，ＲＯＭ等が含まれ，伝送媒体としては，インターネット等の伝送媒体，光・電波・音波等が含まれる。
【０２１０】
又，上述した本発明のコンピュータは，ＣＰＵ等の純然たるハードウェアに限らず，ファームウェアや，ＯＳ，更に周辺機器を含むものであっても良い。
【０２１１】
尚，以上説明した様に，本発明の構成は，ソフトウェア的に実現しても良いし，ハードウェア的に実現しても良い。
【０２１２】
【発明の効果】
本発明は，たとえば，回路規模をより小さく抑えることができることができるという長所を有する。
【図面の簡単な説明】
【図１】実施の形態１における動きベクトル検出装置のブロック図
【図２】実施の形態１の演算ユニット７の構成を示すブロック図
【図３】実施の形態１の画像の領域関係図
【図４】実施の形態１の動作のタイミングチャート
【図５】実施の形態１の参照データ領域図
【図６】実施の形態１の参照データ領域図
【図７】実施の形態１の参照データ領域図
【図８】実施の形態２における動きベクトル検出装置のブロック図
【図９】実施の形態２の演算ユニット１０８の構成を示すブロック図
【図１０】実施の形態２の動作のタイミングチャート
【図１１】実施の形態２の参照データ領域図
【図１２】実施の形態２の参照データ領域図
【図１３】実施の形態２の参照データ領域図
【図１４】実施の形態３における動きベクトル検出装置のブロック図
【図１５】実施の形態３の動作のタイミングチャート
【図１６】実施の形態２の参照データ領域図
【図１７】実施の形態２の参照データ領域図
【図１８】実施の形態２の参照データ領域図
【図１９】実施の形態４における動きベクトル検出装置のブロック図
【図２０】実施の形態５における動きベクトル検出装置のブロック図
【図２１】実施の形態５の動作のタイミングチャート
【図２２】実施の形態５の動作のタイミングチャート
【図２３】実施の形態６における動きベクトル検出装置のブロック図
【図２４】実施の形態６の動作のタイミングチャート
【図２５】実施の形態７における動きベクトル検出装置のブロック図
【図２６】実施の形態７の演算ユニット６０２の構成を示すブロック図
【図２７】実施の形態７の動作のタイミングチャート
【図２８】実施の形態７の演算ユニット６０２の別構成を示すブロック図
【図２９】実施の形態８における動きベクトル検出装置のブロック図
【図３０】実施の形態８の動作のタイミングチャート
【図３１】実施の形態８の参照データ領域図
【図３２】実施の形態８の参照データ領域図
【図３３】実施の形態８の参照データ領域図
【図３４】実施の形態８の参照データ領域図
【図３５】従来例１における動きベクトル検出装置のブロック図
【図３６】従来例１のＰＥ８０５の構成を示すブロック図
【図３７】従来例１の演算データ遅延器８１１の構成を示すブロック図
【図３８】従来例１の画像の領域関係図
【図３９】従来例１の動作のタイミングチャート
【図４０】従来例１の動作のタイミングチャート
【図４１】従来例１の動作のタイミングチャート
【図４２】従来例２における動きベクトル検出装置のブロック図
【図４３】従来例２のＰＥ８４７の構成を示すブロック図
【図４４】従来例２の動作のタイミングチャート
【符合の説明】
１，１０１，２０１，３０１，４０１，７０１，７０２，７０３参照レジスタ
２，１０２，４０２，６０１符号化ブロックレジスタ
３，４，５，１０３，１０４，１０５，１０６，４０３，４０４，４０５符号化小ブロックレジスタ
６，１０７，２０２，３０２，４０６，７０４，７０５，７０６，７０７演算ブロック
７，８，９，１０８，１０９，１１０，１１１，６０２，６０３，６０４，７０８，７０９，７１０，７１１，７１２，７１３，７１４，７１５演算ユニット
１０，１１２，２０３，３０３，５０１，６０５，７１６，７１７，７１８，７１９累積加算アレイ
１１，１３，１１３遅延器
１２，１４，１９，２０，２１，１１４，５０５，６０８，６０９，６１０，６１１，７２０，７２１加算器
１５，１６，１７，１８差分絶対値演算器
４０７，４０８，４０９スイッチ
４１０モード制御部
５０２フレーム加算アレイ
５０３，６０６，６０７フィールド加算アレイ
５０４２サイクル遅延器
５０６減算器
８０１，８０２，８０３，８０４，８１９，８２０，８２１，８２２８２９，８３０，８３１，８３２，８３３，８３４，８３５，８３６，８３８，８３９，８４０，８４１，８５３，８５４，８５５，８５６レジスタ
８０５，８０６，８０７，８０８，８０９，８１０，８４７，８４８，８４９演算ユニット
８１１，８１２，８１３，８１４，８５０，８５１演算データ遅延器
８１５，８１６，８１７，８１８，８５２端子
８２３，８２４，８２５，８２６差分絶対値演算器
８２７，８２８加算器
８３７タイミング制御部
８４２画素データ遅延器
８４３，８４４，８４５，８４６，８５７，８５８，８５９，８６０セレクタ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a motion vector detection device, a motion vector detection method, a program, and a recording medium that detect a motion vector used in motion compensation coding of a moving image by a block matching method.
[0002]
[Prior art]
In recent years, an encoding method using inter-frame correlation such as MPEG-2 (ITU-T H.262) has been used as a moving image encoding method. In these methods, the image to be encoded is divided into encoded blocks that are small rectangular areas, a prediction block is obtained from the motion vector detected from the reference image for each encoding block, and the difference between the encoding block and the prediction block is calculated. A motion compensation encoding method for compression encoding is used.
[0003]
A block matching method is a typical motion vector detection method. The block matching method extracts a prediction block candidate having the same size as the encoded block from the motion vector search region of a certain encoded block, and encodes whether or not the extracted predicted block candidate is appropriate for use as a predicted block. This is an evaluation method by calculating an error amount between a block and a prediction block candidate. All prediction block candidates in the motion vector search area are evaluated, and the one with the best evaluation result is adopted as the prediction block, and the difference between the position coordinates of the adopted prediction block and the coding block is the motion vector. Become. As the error amount, as shown in (Equation 1), the pixel data r of the prediction candidate block_{m + i, n + j}And pixel data t of the coding block_{m, n}In many cases, the difference absolute value sum AE that adds the absolute values of the differences to all the pixels of the block is used. (J, i) in (Equation 1) means a vector of prediction block candidates currently evaluated (the first component of the vector relates to the horizontal direction and the second component of the vector relates to the vertical direction), and AE_{i, j}Indicates the sum of absolute differences as an evaluation result of the prediction block candidate. Further, M and N are the numbers of horizontal and vertical pixels of the block, -K to K-1 are the horizontal range of the search area based on the position of the encoding vector, and -L to L-1 are vertical. Each means a range of directions.
[0004]
[Expression 1]

The block matching method is the most reliable motion vector detection method, but its realization has many problems such as a large circuit scale and a large amount of processing, and many configuration methods have been devised to solve these problems. Yes. Among them, a method of configuring a pipeline with a plurality of arithmetic units is known as a particularly efficient method (see, for example, Patent Document 1 below).
[0005]
Hereinafter, two configurations of the conventional example 1 and the conventional example 2 will be described in order as the conventional motion vector detecting device described above.
[0006]
(Conventional example 1)
First, the motion vector detection device of Conventional Example 1 will be described. This is an apparatus corresponding to

claims

1 and 2 of Patent Document 1. FIG. 35 is a block diagram showing the configuration of the first conventional example. 35 includes a

register

801, 802, 803, 804 connected in series, an arithmetic unit (hereinafter abbreviated as PE) 805, 806, 807, and

PEs

805, 806, 807 connected in series. Arranged operation

data delay units

811 and 812,

PE

808, 809, and 810 connected in series, and operation

data delay units

813 and 814 arranged between

PEs

808, 809, and 810, respectively. The outputs of the registers 801 to 804 are input to the PEs 805 to 807 and the PEs 808 to 810, respectively, and encoded pixel data is input from the terminal 815, respectively.
[0007]
FIG. 36 is a block diagram showing the configuration of PE 805. The PE 805 includes registers 819 to 822 that are arranged in series and sequentially store encoded pixel data, and stored values of the registers 819 to 822 and difference absolute value calculators 823 to 223 to which the stored values of the registers 801 to 804 are input. 826, an adder 827 for adding the calculation results of the difference absolute value calculators 823 to 826, and an adder 828 for adding the calculation value of the adder 827 and the calculation value from another PE. The structures of PEs 806 to 810 in FIG. 35 are all the same as the structure of PE 805 in FIG. FIG. 37 is a block diagram showing the configuration of the arithmetic data delay unit 811. The arithmetic data delay unit 811 includes registers 829 to 836 connected in series for storing eight calculation results of PE 805 and a timing control unit 837 for controlling the operation timing. The structure of the operation data delay units 811 to 814 in FIG. 35 is the same as that of the operation data delay unit 811 in FIG. 35, 36, and 37, registers 801 to 804 and registers 819 to 822 are registers each composed of a plurality of bits that store the value of one pixel, and registers 829 to 836 are a plurality of registers that store PE operation result data. A register consisting of bits. The outputs of the registers 801 to 804 have individual signal names a0 to a3, and A represents a signal obtained by combining a0 to a3.
[0008]
The operation of Conventional Example 1 will be described below. FIG. 38 is a region relationship diagram showing the positional relationship between each block, pixel, and search region in the encoded image and reference image of this conventional example, and FIGS. 39 and 41 are timing charts showing details of the operation.
[0009]
First, the relationship between PE805 to PE810 and the coding block and the outline of the operation will be described with reference to FIG.
[0010]
In Conventional Example 1, the difference absolute value sum AE (Equation 1) is adopted as the error amount between the prediction block candidate and the encoded block, and the block size is horizontal N = 3, vertical M = 4, and vector search. The region is −3 to 2 (K = 3) in the horizontal direction and −4 to 3 (L = 4) in the vertical direction. In FIG. 38, the coding block T0 has a search range indicated by a thick frame C in the drawing on the reference image, the coding block T1 has a search range indicated by a bold wavy line of D, and the coding block T2 is an alternate long and short dash line of E. Each has the search range shown in. The search range in FIG. 38 indicates the range of reference pixels used for block matching. For example, the thick frame of C in the figure is 8 pixels wide (2K + N−1) and 11 pixels vertical (2L + M−1). It is a size. Conventional example 1 performs the motion vector detection operation of coding block T0 and coding block T1, but each coding block is divided into three small blocks each consisting of four vertical pixels, and each of them has the correspondence shown in FIG. Stored in the PE and calculate the AE for each prediction block candidate by comprehensively calculating the calculation results of the three PEs while collectively executing the error amount calculation with the small block consisting of four vertical pixels of the prediction block candidate in each PE. The method of doing is taken.
[0011]
In this method, the sum of absolute differences AE shown in (Expression 1) is used._{i, j}Is a two-dimensional sum of m and n. As shown in (Expression 2), this is the difference absolute value sum AE of M pixels in the column direction for n._{i, j, n}Then, as shown in (Equation 3), even if it is divided into two-stage operations of adding N columns for n, as in (Equation 1), as in (Equation 1), the sum of absolute differences AE_{i, j}Is the basis for the action. In FIG. 36, the output of the adder 827 of each PE is the difference absolute value sum AE of the four pixels in each column direction shown in (Expression 2)._{i, j, n}, The output of the adder 828 of the PE 807 is the difference absolute value sum AE for the coding block T0._{i, j}In addition, the output of the adder 828 of the PE 810 is the difference absolute value sum AE for the coding block T1._{i, j}Correspond to each.
[0012]
[Expression 2]

[0013]
[Equation 3]

Details of the operation will be described below with reference to FIGS.
[0014]
First, when the motion vector detection operation of the coding block T0 is started at the timing F0 in FIG. 39, the four pixel data t in the leftmost column of the coding block T0 are displayed._4,3, T_5,3, T_6,3, T_7,3Are sequentially input from the encoded pixel data input terminal 815 of FIG. 35 and sequentially stored in the registers 819 to 822 of the PE 805, and the preparation of the PE 805 is completed at the timing F4 of FIG. At this time, the output of the register of PE 805 is (b0, b1, b2, b3) = (t_4,3, T_5,3, T_6,3, T_7,3This value is held until the motion vector search of the coding block T0 is completed. On the other hand, the reference candidate pixel data is r vertically from the upper left of the search range of the coding block T0 as shown in FIG._0,0, R_1,0To r_10,011 pixel data are sequentially input from the reference pixel data input terminal 816 in FIG. 35 and temporarily stored in the registers 801 to 804 while being sequentially shifted. Now, at timing F4 in FIG. 39, the outputs of the registers 801 to 804 are (a0, a1, a2, a3) = (r_0,0, R_1,0, R_2,0, R_3,0). FIG. 40 shows the positions of the reference pixels stored in the registers 801 to 804 in the search area. At timing F4, the four pixels shown in the small block F4 in FIG. 40 are stored. Since the PE 805 always calculates the sum of absolute differences of (a0, a1, a2, a3) and (b0, b1, b2, b3), as shown by G in the figure at the cycle of timing F4, | r_0,0-T_4,3| + | R_1,0-T_5,3| + | R_2,0-T_6,3| + | R_3,0-T_7,3| Is required. If the subscripts are expressed relative to the upper left coordinate (3, 4) of the encoding block T0, the AE_{-4, -3,0}It can be seen that That is, one third of the sum of absolute differences corresponding to the vector (−3, −4) shown in FIG. 38 is obtained. In the following, during the effective period of 8 cycles shown in FIG. 39, (a0, a1, a2, a3) is the reference pixel data while sequentially lowering the left end of the search range one pixel at a time as indicated by the arrows from F4 to F11 in FIG. Since four are output, the PE 805 outputs absolute differences corresponding to vectors (−3, −4), (−3, −3), (−3, −2) to (−3, 3), respectively. Sum of values, ie AE_{-4, -3,0}, AE_{-3, -3,0}, AE_{-2, -3,0}To AE_{3, -3,0}Up to 8 types of sums of absolute differences are output.
[0015]
The state of the registers 801 to 804 is the state shown in the small block F11 in FIG. 40, that is, the reference pixel data r at the lower left of the search range in FIG._10,0When the input is completed, r is sandwiched by one dummy data of timing F11 in FIG._0,111 pieces of reference pixel data starting from are continuously input. Since one dummy data is sandwiched here, an invalid period of 4 cycles occurs before the next effective period of 8 cycles starts in FIG. Using the invalid period of 4 cycles, t of the encoding block T0_4,4, T_5,4, T_6,4, T_7,4Are sequentially input from the encoded pixel data input terminal 815 and stored in the registers 819 to 822 of the PE 806, and the preparation of the PE 806 is completed at the timing H4. In addition, since the registers 801 to 804 are in the state of H4 in FIG. 40 in the cycle of this timing H4, the reference absolute value calculation of the PE 806 is | r_0,1-T_4,4| + | R_1,1-T_5,4| + | R_2,1-T_6,4| + | R_3,1-T_7,4| Is required. If this is expressed relative to the upper left coordinate of the coding block T0, AE_{-4, -3,1}Means that it was requested. Hereinafter, PE 806 is AE in the effective period of 8 cycles starting from timing H4._{-4, -3,1}, AE_{-3, -3,1}From AE_{3, -3,1}Up to 8 types of difference absolute value sums are sequentially calculated. On the other hand, the register of PE805 is t_4,3, T_5,3, T_6,3, T_7,3Since PE805 holds AE,_{-4, -2,0}, AE_{-3, -2,0}From AE_{3, -2,0}Up to 8 types of sum of absolute differences are calculated. Similarly, in the effective range 8 cycles starting from the timing I4 in FIG._{i, j, 0}, AE_{i, j, 1}, AE_{i, j, 2}Are sequentially calculated.
[0016]
The operation data delay unit 811 is an AE that is the operation result of the PE805 effective period of 8 cycles._{-4, -3,0}To AE_{3, -3,0}Are stored in the registers 829 to 836, and the timing control unit 837 causes the registers 829 to 836 to hold 8 pieces of operation data for the subsequent 4 cycles, and the 8 pieces of operation data held when the next valid period starts. Are sequentially supplied to PE 806 and at the same time a new calculation result of PE 805 is stored. Therefore, the arithmetic data delay unit 811 functions as a first-in first-out buffer for delaying eight valid arithmetic results by 12 cycles. The absolute difference sum AE output from the PE 805 at the timing F4 in FIG._{-4, -3,0}Is delayed by 12 cycles by the arithmetic data delay unit 811 and supplied to the PE 806 at timing H4. PE 806 is an AE calculated by PE 806 at timing H4._{-4, -3,1}AE supplied from the arithmetic data delay unit 811_{-4, -3,0}Are added by an adder 828 and output. Output AE_{-4, -3,0}+ AE_{-4, -3,1}Is delayed by 12 cycles by the operation data delay unit 812, and when the timing I4 is reached, the adder 828 of the PE 807 performs AE_{-4, -3,2}And the result of operation of PE 807 is output to the output terminal 817. Therefore, this output is AE_{-4, -3,0}+ AE_{-4, -3,1}+ AE_{-4, -3,2}However, this is AE from (Equation 3)._{-4, -3}That is, the sum of absolute differences between the prediction block candidate corresponding to the vector (−3, −4) and the coding block T0 is obtained. Similarly, the output terminal 817 similarly has a difference absolute value sum AE between all prediction block candidates in the search range and the coding block T0._{i, j}Are sequentially output during the period of 8 valid cycles with 4 invalid cycles in between. By comparing these values and adopting the vector with the smallest error amount as the motion vector, the function of motion vector detection for the coding block T0 is obtained. Can be fulfilled.
[0017]
If attention is paid to the processing of the coding block T0 as described above, in this conventional example 1, three arithmetic pipelines consisting of PE805 to PE807 are connected by arithmetic data delay

units

811 and 812 having a 12-cycle delay to form a three-stage pipeline. And the calculation of the sum of absolute differences is realized.
[0018]
Next, the transition of the coding block to be processed will be described.
[0019]
FIG. 41 is a timing chart of the latter half of the motion vector detection operation of the coding block T0. The sum of absolute difference values for the coding block T0 calculated in PE805 to PE807 is sequentially output from PE807, but PE805 receives AE at timing J11._3,2,0When the calculation of is completed, the calculation for the coding block T0 is terminated. At timing O4 after 4 cycles, the outputs of the registers 801 to 804 are (a0, a1, a2, a3) = (r_0,6, R_1,6, R_2,6, R_3,6This is a reference pixel located at the upper left corner of the search area of the coding block T2 shown in E of FIG. Therefore, by storing the 4-pixel data of the left end column of the coding block T2 in the PE 805 using an invalid 4-cycle period from the timing J11 to the timing O4, (b0, b1, b2, b3) = ( t_4,9, T_5,9, T_6,9, T_7,9The PE 805 calculates the sum of absolute differences AE corresponding to the vector (-3, -4) of the coding block T2._{-4, -3,0}The calculation of can be started. During this time, PE 806 and PE 807 are continuing the difference absolute value sum calculation for the coding block T0. After 12 cycles, when the PE 806 finishes the calculation of T0, four pieces of pixel data of T2 are stored, and the difference absolute value sum calculation of T2 is started. That is, the PEs 805 to 807 can start the calculation of the coding block T2 in order after completing the calculation of the coding block T0, and the PE 807 starts the final difference absolute value sum AE of the coding block T0._3,2Is output, the difference absolute value sum AE of the coding block T2 from the next valid 8 clocks across the invalid 4 clocks._{i, j}Are sequentially output.
[0020]
As described above, in the conventional example 1, the coding block T0 is calculated using the PEs 805 to PE807, and then the motion vector detection is sequentially performed using T2, T4 and even-numbered coding blocks as one series. Become. At this time, the transition of the coding macro is realized without delaying the pipeline as much as possible by storing the pixel data of the coding block T2 to be processed next in order from the PE for which the calculation of T0 is completed.
[0021]
Next, parallel processing will be described.
[0022]
Since the processing of the sequence of the odd-numbered encoded blocks cannot use PE 805 to PE 807 (see also the description of the circuit scale and processing speed of Conventional Example 1 described later), PE 808 to PE 810 are separately used. Provide parallel processing. In FIG. 41, the outputs of the registers 801 to 804 are (a0, a1, a2, a3) = (r_0,3, R_1,3, R_2,3, R_3,3This is a reference pixel located at the upper left corner of the search area of the coding block T1 shown in D of FIG. Therefore, the leftmost column 4 pixel data of the coding block T1 is stored in the PE 808 using the period of invalid 4 cycles immediately before the timing P4, so that (b0, b1, b2, b3) = (t_4,6, T_5,6, T_6,6, T_7,6PE808 is the difference absolute value sum AE corresponding to the vector (-3, -4) of the coding block T1._{-4, -3,0}The calculation of can be started. In the following, similarly to the operations of PE805 to PE807 in the case of even-numbered sequences, PE808 to PE810 perform motion vector detection of the odd-numbered sequence encoded blocks.
[0023]
As described above, in the conventional example 1, the reference pixel data for four pixels is stored in the registers 801 to 804 as common data, and a plurality of encoded blocks including the reference range are stored in individual PEs for parallel processing. Is possible. In the configuration example of FIG. 35, two systems of parallel processing of even numbered and odd numbered encoded blocks are realized.
[0024]
Here, the circuit scale and processing speed of Conventional Example 1 will be described.
[0025]
Table 1 shows the circuit scale of Conventional Example 1. In Table 1, M is the number of vertical pixels of the coding block, N is also the number of horizontal pixels 2K is the horizontal width of the search range, 2L is also the vertical width, and Q is the number of sequences that can be processed in parallel. Q can be obtained as the minimum Q that satisfies N × Q ≧ 2K. In the configuration of Conventional Example 1 in FIG. 35, M = 4, N = 3, K = 3, L = 4, and 3 × Q ≧ 2 × 3, so that Q = 2. Therefore, according to Table 1, the registers for storing pixel data are configured for 28 pixels, and the registers for the calculation data delay unit for storing data of calculation results are configured for 32. Assuming that the pixel data register is 8 bits per pixel and the data register is 10 bits per pixel, the total number of bits is 544 bits, that is, 544 flip-flops. In the case where the processing configuration of only one series using PE 805 to PE 807 in FIG. 35 and the motion vector detection of the odd-numbered encoded block is performed after all the motion vector detection of the even-numbered encoded block is completed, Q = 1. And 288 flip-flops can be formed. As a practical specification in the case of MPEG2, when an interlaced video of 720 × 480 pixels (hereinafter abbreviated as 480i) is input and M = 16, N = 16, K = 64, L = 32, Q is just Q = 8, and 8 series parallel processing is performed. At this time, the number of registers storing pixel data is 2064 pixels, the number of registers storing operation result data is 7680, and the number of flip-flops is about 93,000. Further, as an example of high-definition video, when 1920 × 1080 pixel interlaced video (hereinafter abbreviated as 1080i) is input and M = 16, N = 16, K = 128, L = 64, 16 series parallel processing There are about 340,000 flip-flops.
[0026]
[Table 1]

Table 2 shows the calculation speed of Conventional Example 1. In Table 2, the number of valid cycles is the number of valid cycles necessary for outputting the operation result of a certain coding block, and the number of loss cycles is an invalid cycle in between. Therefore, the average number of cycles per coding block is a value obtained by dividing the total number of effective cycles and the number of loss cycles by the number of sequences to be processed in parallel. Except for the case where Q satisfies N × Q = 2K, a loss cycle occurs when the coding block is switched. In Table 2, all are calculated for N × Q = 2K. From Table 2, it can be considered that the motion vector detection of one coding block is completed in an average of 36 cycles in the configuration of FIG. 35, that is, two-sequence parallel processing, and one block is completed in 72 clocks in the case of one sequence. In the case of 480i under the same conditions as in Table 1, since 8 series parallel processing is performed, it can be considered that the processing of one block is completed in an average of 1280 cycles. This means operating with a clock of about 52 MHz. In the case of 1080i, the calculation speed is 1 block per 2304 cycles, which means that the operation is performed with a clock of 560 MHz.
[0027]
[Table 2]

(Conventional example 2)
Next, a second example of a conventional motion vector detection device will be described. This is an apparatus corresponding to

claims

3 and 4 of Patent Document 1. FIG. 42 is a block diagram showing the configuration of Conventional Example 2, and FIG. 43 is a diagram showing the structure of PEs 847 to 849 of this motion vector detecting device. 42 and 43, the same portions as those in FIGS. 35 and 36 are denoted by the same reference numerals, and the description thereof is omitted.
[0028]
42, in the configuration of the conventional example 2, registers 838 to 841 for storing reference pixel data are provided in series with the registers 801 to 804, a pixel data delay unit 842 is inserted between the register 841 and the register 801, and the register 838 is provided. The outputs of ˜841 are further delayed by 4 cycles compared to the registers 801-804. Outputs of the registers 801 to 804 and the registers 838 to 841 are selected by selectors 843 to 846, and only four reference pixel data outputs are supplied to the PEs 847 to 849. In the figure, signal names a0, a1, a2, and a3 are attached to the four outputs selected by the selectors 843 to 846. Further, the arithmetic data delay

units

850 and 851 connecting the PEs are changed to delay units that delay eight data of valid cycles by 16 cycles.
[0029]
The structure of PE 847 shown in FIG. 43 is provided with registers 853 to 856 in parallel to registers 819 to 822 for storing encoded pixel data, as compared with the structure of PE 805 of Conventional Example 1 shown in FIG. ˜822 and the outputs of the registers 853 to 856 are changed to a structure in which the selectors 857 to 860 are selected. In the figure, b0, b1, b2, b3 and signal names are attached to the four outputs selected by the selectors 857 to 860, and the correspondence relationship with the selected reference pixel outputs a0, a1, a2, a3 is shown. ing.
[0030]
Hereinafter, the operation of the motion vector detection device of the conventional example 2 will be described. Conventional example 2 stores two sequences of even-numbered coded block sequences and odd-numbered coded block sequences in registers 819 to 822 and registers 853 to 856 in the PE, respectively. Although the detection operation is advanced, in the conventional example 1, the processing for the two series of encoded blocks is executed in parallel while sandwiching the four invalid cycles, whereas in the conventional example 2, the invalid cycle is performed. Is expanded to 8 cycles, and the difference is that the processing of the encoded blocks of the two sequences is time-division processing executed in the invalid cycle of each other.
[0031]
FIG. 44 is a timing chart showing the operation of the second conventional example. FIG. 44 shows a point in time when the motion vector detection operation has already started and the steady state is entered. As reference pixels, 11 pixels corresponding to one vertical column of the reference image search range in FIG. 38 are continuously input, and subsequently invalid data for a 5-cycle period is input. The input reference pixels are sequentially stacked in registers 801 to 804 as shown in FIG. 44, and the output of register 801 is delayed by four cycles by pixel data delay unit 842 and then sequentially stacked in registers 838 to 841. As a result, when the registers 801 to 804 are set as one set and the registers 838 to 841 are set as one set, each set has an effective period of 8 cycles, and the effective periods of the registers are alternately enabled without overlapping. It is in a relationship. The selectors 843 to 846 select the valid set of registers so that the valid reference pixel data is always output twice (8 cycles) to the output (a0, a1, a2, a3). It can be supplied to PE847-849. 44, the pixel data of the coding block T0 is already stored in the registers 819 to 822 and the pixel data of the coding block T1 is stored in the registers 853 to 856 in each of the PEs 847 to 849. Now, in the effective 8 cycle period starting from the cycle of the timing V4, the selectors 857 to 860 of the PEs 847 to 849 select the registers 819 to 822, that is, the pixel data of the coding block T0, so that the coding block T0 and the reference pixel data ( The sum of absolute differences from a0, a1, a2, a3) is calculated. In the effective 8 cycle period starting from the cycle of the timing W4, the selectors 857 to 860 select the registers 853 to 856, that is, the pixel data of the encoding block T1, and therefore the encoding block T1 and the reference pixel data (a0, a1, a2, and so on). The sum of absolute differences from a3) is calculated. The calculation result regarding the coding block T0 calculated in 8 cycles starting from the timing V4 is delayed by 16 cycles by the operation data delay

units

850 and 851, and then transmitted to the adjacent PE. Therefore, the calculation period related to the coding block T1 of 8 cycles starting from the timing W4 is passed, and the calculation is transferred to the coding block T0 starting from the timing Y4.
[0032]
As described above, by connecting the PEs 847 to 849 with the arithmetic

data delay units

850 and 851 having a 16-cycle delay, the calculation of the sum of absolute differences regarding the coding block T0 and the coding block T1 is executed as an independent pipeline. It can be done. In the effective 8 cycles starting from the timing Y4, the registers 819 to 822 of the PE 847 are switched to the pixel data of the coding block T2, and the

PE

847 and 849 continue the operation of the coding block T0 while the PE 847 performs coding. The calculation of the block T2 is started as in the case of the conventional example 1 described above.
[0033]
As described above, in the second conventional example, the register for storing the reference pixel (see FIG. 42) and the register for storing the pixel data of the encoding block (see FIG. 43) are each made into a double structure, so that the encoding block The two series of the even-numbered series and the odd-numbered series are time-division processed without a loss cycle.
[0034]
As described above, in the conventional motion vector detection device, a plurality of operation units PE are provided, and adjacent PEs are connected by an operation data delay unit to configure a pipeline to calculate the sum of absolute differences. In addition, by storing the pixel data of the coding block to be processed next in order from the computation unit that has finished the computation of the coding block, the pipeline stagnation can be minimized. . Further, in the conventional example 1, a register for storing the pixel data of the coding block is provided for each arithmetic unit, and parallel processing is possible by having a plurality of arithmetic units. In the conventional example 2, the register and the reference pixel register are encoded. Time-division processing is possible by making each block register have a double structure.
[0035]
[Patent Document 1]
JP-A-10-136377
[0036]
[Problems to be solved by the invention]
However, the above-described conventional motion vector detection device has a problem that the circuit scale increases.
[0037]
The present inventor has the disadvantage that the circuit scale of the arithmetic data delay unit is proportional to the size of the search range, and a register for storing the coding block in parallel processing or time division processing is required for each series independently. Analyzing that the number of circuits that can be shared between series is extremely small, and that the adverse effect of increasing the circuit scale in proportion to the number of processing series is synergistically increasing the circuit scale. Yes.
[0038]
The configuration of the conventional motion vector detection device has the advantage that the wiring efficiency is small when the size of the search range is relatively small. However, when the size of the search range is large, there is an operation data delay. There is a decisive disadvantage that the circuit scale of the vessel becomes explosive.
[0039]
Such an increase in circuit scale cannot be eliminated by improvement in mounting, and it is extremely difficult to realize a practical search range for an actual video signal. For example, in the case of the 480i video shown in Table 1, 96,000 or more flip-flops are required, so this cannot be easily realized. In 1080i, about 340,000 flip-flops can be operated even when operated with a 560 MHz clock. This is extremely difficult to implement.
[0040]
(1) It should be noted that the above prior art has no flexibility in the parallel processing apparatus configuration. Focusing on one sequence in parallel processing consisting of a plurality of sequences, the processing order of the encoded block is processed in the order of jumping to a distant position on the encoded image, such as T2 is processed next to T0, and The position interval is characterized in that it is uniquely determined by the ratio between the size of the encoded block and the size of the search range. If the processing order is skipped, the encoding process following the motion vector detection process becomes difficult to implement, and the apparatus must be configured by parallel processing. Moreover, since the number of parallel processing sequences is also uniquely determined by the ratio of the size of the coding block to the size of the search range, even a device that does not require a processing speed is a device that requires a very high speed processing. However, it is necessary to have as many parallel processing circuits as the number of sequences uniquely determined, and it cannot be realized with a minimum circuit scale according to the purpose of use.
[0041]
(2) In the above prior art, since the sum of absolute differences of three types of vectors, ie, a frame vector and two types of field vectors, cannot be obtained at the same time, it must be calculated independently and a further circuit scale is required. It becomes. According to the MPEG2 standard, in the case of a picture having a frame structure, either a frame vector or two types of field vectors can be selected for each coding block, but this requires a search for the frame vector and the field vector. It is. There is a demand for a method for obtaining this simultaneously with a minimum circuit increase, but this cannot be realized by the above-described conventional technology.
[0042]
(3) Table 3 and Table 4 summarize the circuit scale and processing speed of Conventional Example 2. Table 3 shows the circuit scale of Conventional Example 2, and Table 4 shows the processing speed of Conventional Example 2. The calculation conditions in Tables 3 and 4 are the same as those in Conventional Example 1 in Tables 1 and 2, but Q means the number of sequences that can be time-division multiplexed. Also in the case of Conventional Example 2, Q is obtained as the minimum Q that satisfies N × Q ≧ 2K. The motion vector detection device of Patent Document 1 is described as a technique limited to two-system time-sharing processing, and those other than Q = 2 are calculated by the inventor independently. The case of Q = 1 is omitted because it is the same as the case of Q = 1 in the configuration of the conventional example 1 described above.
[0043]
Comparing Tables 3 and 4 with Tables 1 and 2, the circuit scale is almost the same, but the processing speed is inferior to that of Conventional Example 2 because of the time-sharing process (therefore, in the case of Conventional Example 2) Therefore, it is more difficult to realize a practical search range than in the conventional example 1). Conventional Example 2 has been described in order to clarify the difference in comparison with the configuration of the present invention described later.
[0044]
[Table 3]

[0045]
[Table 4]

SUMMARY OF THE INVENTION The present invention is intended to provide a motion vector detection device, a motion vector detection method, a program, and a recording medium that can reduce the circuit scale in consideration of the above-described conventional problems. .
[0046]
[Means for Solving the Problems]
  The first aspect of the present invention is an encoded block which is a rectangular area on an encoded image (T0 to T2, see FIG. 3).Encoding block output step for storing N sets of encoded data with M pixel data arranged in one vertical column or one horizontal row as one set in the encoded block When,
A reference data output step of temporarily storing M pixels of a reference image and outputting the same as a set of reference data;(1)The reference data is M pieces of data arranged vertically on the reference image.In caseReference dataWhile sequentially shifting in the horizontal direction on the reference imageTo retrieve and storecontrol,And (2)The reference data is M pieces of data arranged horizontally on the reference image.In caseReference dataAre sequentially shifted in the vertical direction on the reference image.To retrieve and storecontrolofAt least one of the controlsReference data outputStepWhen,
Using 1 × N arithmetic units (7 to 9, see FIG. 1) for calculating an error amount between one set of the reference data and one set of the encoded data, one set of the reference data and N sets Calculating an error amount of all combinations with the encoded data of
The error amount of the encoded data set located at the end in the encoding block is delayed by one cycle and added to the error amount of the adjacent encoded data set, and then the addition result is sequentially delayed by one cycle. A cumulative addition step of obtaining a sum of the N error amounts by a cumulative addition structure that adds to the error amount to be performed;Is a motion vector detection method.
[0047]
  The second aspect of the present invention stores pixel data constituting a coding block (T0 to T2, see FIG. 3) which is a rectangular area on the coded image, and the coding block (T0 to T2, see FIG. 3). An encoding block register (see FIG. 1, FIG. 1) for outputting N sets of encoded data with M pixel data arranged in one vertical column or one horizontal row as one set;
  A first reference register (1, see FIG. 1) that temporarily stores M pixels of a reference image and outputs this as a set of reference data, wherein (1) the reference data is stored on the reference image A control function for taking out and storing the reference data while sequentially shifting the reference data in the horizontal direction on the reference image in the case of M pieces of data arranged vertically; and (2) the reference data isSaidWhen there are M pieces of data arranged horizontally on the reference image, a first control function having at least one of the control functions for taking out and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image. 1 reference register (1, see FIG. 1);
  An arithmetic unit (7 to 9, see FIG. 1) for calculating an error amount between one set of the reference data and one set of the encoded data, wherein one set of the reference data and N sets of the encoded data 1 × N arithmetic units (7 to 9, see FIG. 1) for calculating the error amount of all combinations with
  The error amount of the encoded data set located at the end in the encoded block (T0 to T2, see FIG. 3) is delayed by one cycle and added to the error amount of the adjacent encoded data set. A motion vector detection apparatus comprising a cumulative addition array (10, see FIG. 1) for obtaining the sum of the N error amounts by a cumulative addition structure in which the addition result is delayed by one cycle and added to adjacent error amounts. is there.
[0048]
  The third aspect of the present invention is an encoded block (T0 to T2, see FIG. 3) that is a rectangular area on an encoded image.Encoding block output step for storing N sets of encoded data with M pixel data arranged in one vertical column or one horizontal row as one set in the encoded block When,
A reference data output step of temporarily storing M + Q-1 pixels of a reference image, and outputting Q sets of the reference data using M consecutive pixels as a set of reference data,(1)The reference data is M + Q−1 pieces of data arranged vertically on the reference image.In case,The reference data on the reference imageWhile sequentially shifting horizontallyTo retrieve and storecontrol,And (2)The reference data is M + Q−1 pieces of data arranged horizontally on the reference image.IfThe reference data on the reference imageWhile shifting sequentially in the vertical directionTo retrieve and storecontrolofControl at least one of themReference data outputStepWhen,
Using Q × N operation units (7 to 9, see FIG. 14) for calculating the error amount between one set of the reference data and one set of the encoded data, Q sets of the reference data and N sets of the reference data are used. A calculation step of calculating the error amount of all combinations with the encoded data;
The error amount of the encoded data set located at the end in the encoding block is delayed by one cycle and added to the error amount of the adjacent encoded data set. A cumulative addition step of obtaining a total sum of N error amounts by using a cumulative addition array (10, 203, see FIG. 14) by a cumulative addition structure that is delayed and added to adjacent error amounts; PreparedThis is a motion vector detection method.
[0049]
  The fourth aspect of the present invention stores pixel data constituting a coding block (T0 to T2, see FIG. 3) which is a rectangular area on the coded image, and the coding block (T0 to T2, see FIG. 3). An encoding block register (see FIG. 14, FIG. 14) for outputting N sets of encoded data with M pixel data arranged in one vertical column or one horizontal row as one set;
threeThis is a first reference register (201, see FIG. 14) that temporarily stores M + Q-1 pixels of an image and outputs Q sets of the reference data using M consecutive pixels as a set of reference data. (1) When the reference data is M + Q−1 pieces of data arranged vertically on the reference image, control for taking out and storing the reference data while sequentially shifting in the horizontal direction on the reference image And (2) When the reference data is M + Q−1 pieces of data arranged horizontally on the reference image, the reference data is extracted and stored while being sequentially shifted in the vertical direction on the reference image. A first reference register (201, see FIG. 14) having at least one of the control functions;
  An arithmetic unit (7 to 9, see FIG. 14) for calculating an error amount between one set of the reference data and one set of the encoded data, wherein Q sets of the reference data, N sets of the encoded data, Q × N arithmetic units (7 to 9, see FIG. 14) for calculating the error amounts of all combinations of
  The error amount of the encoded data set located at the end in the encoded block (T0 to T2, see FIG. 3) is delayed by one cycle and added to the error amount of the adjacent encoded data set. , Q cumulative addition arrays (10, 203, see FIG. 14) for obtaining the sum of the N error amounts by a cumulative addition structure in which the addition results are sequentially delayed by one cycle and added to the adjacent error amounts. Is a motion vector detection device.
[0050]
A fifth aspect of the present invention is a second reference register (401, see FIG. 20) different from the first reference register (1, see FIG. 20),
Reference data changeover switch (1) for selecting either reference data supplied from the first reference register (1, see FIG. 20) or reference data supplied from the second reference register (401, see FIG. 20). 407-409, see FIG. 20),
A first mode in which the first reference register (1, see FIG. 20) sequentially updates the reference data and supplies the reference data to the arithmetic units (7-9, see FIG. 20); At the time of transition to the second mode in which the reference register (401, see FIG. 20) sequentially updates the reference data and supplies the reference data to the arithmetic units (7-9, see FIG. 20) A mode control means (410, see FIG. 20) for switching the reference data changeover switch (407-409, see FIG. 20) in order from the computation unit (7-9, see FIG. 20) that has finished computation; It is a motion vector detection apparatus according to the second or fourth aspect of the present invention.
[0051]
According to a sixth aspect of the present invention, the mode control means (410, see FIG. 20) transfers the data of the new coding block (T0 to T2, see FIG. 3) to the coding block register (402, see FIG. 20). When the data is stored in the data, the data of the new coding block (T0 to T2, see FIG. 3) is sequentially added to the code in synchronism with the switching operation of the reference data switch (407 to 409, see FIG. 20). The motion vector detecting apparatus according to the fifth aspect of the present invention is stored in the generalized block register (402, see FIG. 20).
[0052]
According to a seventh aspect of the present invention, the cumulative addition array (501, see FIG. 23) (a) delays the addition result of the error amount of each of the arithmetic units (108 to 111, see FIG. 23) once. A frame addition array (502, see FIG. 23) for accumulatively adding the N error amounts by adding to the error amounts of adjacent encoded data sets, and (b) N / 2 which is an even or odd number. A field addition array (503, see FIG. 23) for adding the error amount by the cumulative addition structure while delaying by two cycles with respect to the arithmetic units (108 to 111, see FIG. 23); and (c) the frame addition. 2nd, 4th, 5th and 6th aspects of the present invention having arithmetic means (506, see FIG. 23) for obtaining a difference between the results of the array (502, see FIG. 23) and the field addition array (503, see FIG. 23). Any of Is an odd vector detecting device.
[0053]
According to an eighth aspect of the present invention, the arithmetic unit (602 to 604, see FIG. 25) is configured such that the error amount for the pixels at even positions with respect to the set of reference data and the set of encoded data. And two types of error amounts, that is, the error amount for each odd-numbered pixel,
The cumulative addition array (605, see FIG. 25) includes (a) a first field addition array (606, see FIG. 25) for independently adding the two types of error amounts by a cumulative addition structure, and (b). The motion vector detection device according to any one of the second, fourth, fifth, and sixth aspects of the present invention having a second field addition array (607, see FIG. 25).
[0054]
  According to a ninth aspect of the present invention, the arithmetic unit (602, see FIG. 28) applies to the pixels at even positions or odd positions with respect to the set of reference data and the set of encoded data.FirstError amount and for all the pixelsSecondFind two types of error amount, the error amount,
  The cumulative addition array (605, see FIG. 25) includes:FirstA field addition array (606, see FIG. 25) for independently accumulating the error amounts ofSecondA frame addition array (607, see FIG. 25) for independently accumulating the error amount of (2), and (c) arithmetic means (608, see FIG. 25) for obtaining a difference between the results of the field addition array and the frame addition array. The motion vector detecting device according to any one of the second, fourth, fifth and sixth aspects of the present invention.
[0055]
  According to a tenth aspect of the present invention, pixel data constituting a coding block (T0 to T2, see FIG. 3) which is a rectangular area on a coded image is stored, and the M pieces of pixel data in the same field are combined into one set. A coding block register (102, see FIG. 29) for outputting the coded data N / 2 of the first field and the coded data N / 2 of the second field;
  Reference registers (701 to 703, see FIG. 3) corresponding to the first field and the second field for storing M pixel data in the same field of the reference image and outputting the data as one set of reference data;
  Field evaluation means (704 to 707, see FIG. 29) capable of obtaining a field error amount by inputting the reference data 1 set and the encoded data N / 2 set;
  A first field error amount for the reference data of the first field and the encoded data of the first field, and a field error amount for the reference data of the second field and the encoded data of the second field are added. Adder (720, see FIG. 29);
  A second field error amount for adding the field error amount for the reference data of the first field and the encoded data of the second field, and a field error amount for the reference data of the second field and the encoded data of the first field; And an adder (721, see FIG. 29),
  The reference registers (701 to 703, see FIG. 3) are (1) when the reference data is M pieces of data arranged vertically on the reference image.In, Retrieve and store the reference data while sequentially shifting in the horizontal direction on the reference imageControl function,as well as(2) When the reference data is M pieces of data arranged horizontally on the reference imageInThe reference data is extracted and stored while sequentially shifting in the vertical direction on the reference image.At least one of the control functionsWith control function,
  The field evaluation means (704 to 707, see FIG. 29) includes N / 2 arithmetic units (calculating error amounts of all combinations of one set of the reference data and N / 2 sets of the encoded data). 708 to 715 (see FIG. 29), and a motion vector detection device that obtains a sum by a cumulative addition structure from the N / 2 error amounts and outputs the sum as the field error amount.
[0057]
  First11The present invention stores the pixel data constituting the encoding block (T0 to T2, FIG. 3) which is a rectangular area on the encoded image of the motion vector detecting device of the second aspect of the present invention. An encoding block register (2, figure) that outputs N sets of encoded data with M pixel data arranged in one column or one row in a block (T0 to T2, see FIG. 3) as one set. 1) and a first reference register (1, see FIG. 1) for temporarily storing M pixels of the reference image and outputting them as one set of reference data, (1) the reference data Is a control function for extracting and storing the reference data while sequentially shifting in the horizontal direction on the reference image, and (2) the reference data isSaidIn the case of the M pieces of data arranged horizontally on the reference image, a first function having at least one of the control functions for taking out and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image. 1 reference register (1, see FIG. 1), and an arithmetic unit (7-9, see FIG. 1) for calculating an error amount between one set of the reference data and one set of the encoded data, 1.times.N arithmetic units (7-9, see FIG. 1) for calculating error amounts of all combinations of the set of reference data and N sets of encoded data, and the encoded blocks (T0-T2). , See FIG. 3), the error amount of the encoded data set located at the end of the set is delayed by one cycle and added to the error amount of the adjacent encoded data set, and then the addition result is sequentially delayed by one cycle. Add to the adjacent error amount Ku cumulative addition array (10, see FIG. 1) by accumulating structure obtaining the sum of said N error amount which is a program for causing a computer to function as a.
[0059]
  First12The present invention of the fourth aspect of the present invention stores the pixel data constituting the coding block (T0 to T2, see FIG. 3) which is a rectangular area on the coded image of the motion vector detecting device of the fourth aspect of the present invention. An encoding block register (2, figure) that outputs N sets of encoded data with M pixel data arranged in one column or one row in a block (T0 to T2, see FIG. 3) as one set. 14) and,threeThis is a first reference register (201, see FIG. 14) that temporarily stores M + Q-1 pixels of an image and outputs Q sets of the reference data using M consecutive pixels as a set of reference data. (1) When the reference data is M + Q−1 pieces of data arranged vertically on the reference image, control for taking out and storing the reference data while sequentially shifting in the horizontal direction on the reference image And (2) When the reference data is M + Q−1 pieces of data arranged horizontally on the reference image, the reference data is extracted and stored while being sequentially shifted in the vertical direction on the reference image. A first reference register (201, see FIG. 14) having at least one of the control functions, and an arithmetic unit for calculating an error amount between one set of the reference data and one set of the encoded data. Q × N arithmetic units (7 to 9; see FIG. 14) for calculating the error amount of all combinations of the Q sets of reference data and the N sets of encoded data. -9, see FIG. 14) and the coded data adjacent to each other by delaying the error amount of the coded data set located at the end in the coded block (T0-T2, see FIG. 3) by one cycle. Q cumulative addition arrays for summing up the N error amounts by a cumulative addition structure in which the addition results are sequentially delayed by one cycle and added to adjacent error amounts. (10, 203, see FIG. 14) is a program for causing a computer to function.
[0060]
  First13The present invention stores the pixel data constituting the encoded block (T0 to T2, FIG. 3) which is a rectangular area on the encoded image of the motion vector detecting device of the tenth aspect of the present invention, and in the same field An encoding block register that outputs the encoded data N / 2 of the first field and the encoded data N / 2 of the second field with the M pixel data as one set (see FIG. 29) And reference registers (701 to 703, see FIG. 3) corresponding to the first field and the second field that store M pixel data in the same field of the reference image and output this as a set of reference data. , Field evaluation means (704 to 707, see FIG. 29) which can obtain the field error amount by inputting the reference data 1 set and the encoded data N / 2 set. ), The field error amount for the reference data of the first field and the encoded data of the first field, and the field error amount for the reference data of the second field and the encoded data of the second field A first adder (720, see FIG. 29), a field error amount with respect to the reference data of the first field and the encoded data of the second field, the reference data of the second field and the first field For causing the computer to function as a second adder (721, FIG. 29) for adding the field error amount to the encoded data ofBecause
The reference register is (1) a control function that, when the reference data is M pieces of data arranged vertically on the reference image, extracts and stores the reference data while sequentially shifting in the horizontal direction on the reference image And (2) if the reference data is M pieces of data arranged horizontally on the reference image, at least one of the control functions for extracting and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image Have one of the control functions,
The field evaluation means has N / 2 arithmetic units for calculating error amounts of all combinations of one set of the reference data and N / 2 sets of the encoded data, and the N / 2 pieces A program that calculates the total sum from the error amount using the cumulative addition structure and outputs it as the field error amountIt is.
[0061]
  First14The present invention from the eleventh13A recording medium carrying any one of the programs of the present invention, which can be processed by a computer.
[0062]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0063]
(Embodiment 1)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0064]
FIG. 1 is a block diagram showing a motion vector detection apparatus according to the present embodiment.
[0065]
  The first embodiment relates to the first and second inventions described above.1stThis corresponds to a case in which M pixel data in one vertical column of 2 of the present invention are combined into one set. In the first embodiment, the difference absolute value sum shown in (Equation 1), (Equation 2), and (Equation 3) is adopted as the error amount between the prediction block candidate and the encoded block, and the size of the encoded block is determined. The horizontal N = 3, the vertical M = 4, and the search region has a horizontal range of K = 3, that is, a range of −3 to 2, and a vertical direction of L = 4, that is, a range of −4 to 3.
[0066]
In the present specification, the number of pixels of a small block obtained by decomposing an encoded block is represented by a symbol M, and the number of small blocks is represented by a symbol N. Therefore, (1) when the encoded block is decomposed in the column direction, the amount related to the vertical direction is expressed as M, the amount related to the horizontal direction is expressed as N, and (2) the encoded block is decomposed in the row direction. In this case, the amount related to the vertical direction is expressed as N, and the amount related to the horizontal direction is expressed as M (the horizontal search range is set regardless of whether the encoded block is decomposed in the column direction or the row direction). (Indicated by K and the vertical search range is denoted by L). When N is an odd number, N / 2 is interpreted as (N + 1) / 2. Of course, the same applies to the following embodiments.
[0067]
In FIG. 1, the reference register 1 temporarily stores four pixels consisting of a0 to a3 of the reference image and outputs them as a set of reference data A, and the encoding block register 2 consists of 12 pixels. An encoded small block register 3 that stores four data arranged in one vertical column into three groups as one small block and stores b00 to b30 corresponding to the zeroth column, and the first column Are encoded small block registers 4 for storing b01 to b31 and encoded small block registers 5 for storing b02 to b32 corresponding to the second column, and the respective outputs are encoded data B0, B1. , B2 are output as three sets. The arithmetic block 6 is composed of arithmetic units 7 to 9, the arithmetic unit 7 is the reference data A and the encoded data B0, the arithmetic unit 8 is the reference data A and the encoded data B1, and the arithmetic unit 9 is the reference data A and the encoded data. The difference absolute value sum AE corresponding to (Equation 2)_i, _{j, n}Is output. The outputs of the arithmetic units 7 to 9 are connected to the cumulative addition array 10, which is the sum of absolute differences AE corresponding to (Equation 3)._{i, j}Is output. The cumulative addition array 10 adds the output of the arithmetic unit 7 with the output of the arithmetic unit 8 and the adder 12 via the delay unit 11, and the output is added with the output of the arithmetic unit 9 and the adder 14 via the delay unit 13. It is the structure of adding. FIG. 2 is a block diagram showing the internal configuration of the arithmetic unit 7. The arithmetic unit 7 is configured to connect corresponding elements of the reference data A and the encoded data B0 input in FIG. 2 to the difference absolute value calculators 15 to 18 and add the outputs by the adders 19 to 21. . The configuration of the arithmetic units 8 and 9 is the same as that of the arithmetic unit 7 in FIG. 2, and the description is omitted because the correspondence relationship of the encoded data is only changed from B0 to B1 and B2.
[0068]
Next, the operation of the motion vector detection device of this embodiment will be described. Note that while describing the operation of the motion vector detection device of the present embodiment, an embodiment of the motion vector detection method of the present invention will also be described (the same applies to the following embodiments).
[0069]
FIG. 3 is a region relationship diagram showing the positional relationship between each block, pixel, and search region in the encoded image and reference image of this embodiment, and FIG. 4 is a timing chart showing details of the operation.
[0070]
  First, the motion vector detection operation of the coding block T0 in the first embodiment will be described. In FIG. 3, the search range of the coding block T0 is the range indicated by C on the reference image, and the evaluation of the error amount is started from the vector (−3, −4), that is, the upper left end of the search range. The calculation of the coding block T0 is started from the cycle of the timing D0 in FIG. 4. In the immediately preceding cycle, the coding small block register 3 stores the small block (b00, b10, b20 in the leftmost column of the coding block T0 in FIG. , B30) = (t_4,3, T_5,3, T_6,3, T_7,3), But the small block (b01, b11, b21, b31) = (t_4,4, T_5,4, T_6,4, T_7,4), But the encoded small block register 5 has small blocks (b02, b12, b22, b32) = (t_4,5, T_5,5, T_6,5, T_7,5) Are read, and already output as a set of three outputs B0, B1, and B2. Figure 4ofWhen the operation starts in the cycle of the timing D0, the reference pixel data (a0, a1, a2, a3) = (r_0,0, R_1,0, R_2,0, R_3,0) Are stored and output as a set of outputs A. In this cycle D0, the arithmetic unit 7 calculates the sum of absolute differences of the input reference data A and encoded data B0. The result is | r_0,0-T_4,3| + | R_1,0-T_5,3| + | R_2,0-T_6,3| + | R_3,0-T_7,3|, And the difference absolute value sum AE from (Equation 2)_{-3, -4,0}I asked for. Subsequently, in the cycle of timing D1, the reference register 1 has (a0, a1, a2, a3) = (r_0,1, R_1,1, R_2,1, R_3,1) Is read. FIG. 5 is a reference data area diagram showing a small block data area stored in the reference register 1 in the reference image search area. In FIG. 5, the reference data stored in the reference register 1 at the timing D0 is the vertical four pixels at the upper left corner of the search area indicated by the small block D0 in the figure, but at the timing D1, it is a small pixel of four pixels adjacent in the horizontal direction. It has moved to block D1. That is, the above-mentioned number1stWhen the reference data of 2 of the present invention is 4 pixels arranged vertically, this corresponds to moving in the horizontal direction, but this movement control method is significantly different from the movement method of the conventional example shown in FIG. It is. During this time, the coding block register 2 stores and holds the pixels of the coding block T0. Now, at the timing D1 shown in FIG._0,1-T_4,3| + | R_1,1-T_5,3| + | R_2,1-T_6,3| + | R_3,1-T_7,3|, Ie AE_{-4, -2,0}And the arithmetic unit 8_0,0-T_4,4| + | R_1,0-T_5,4| + | R_2,0-T_6,4| + | R_3,0-T_7,4|, That is, AE from (Equation 2)_{-4, -3,1}Is calculated. Similarly, in the cycle of timing D2, the arithmetic units 7 to 9 are AE._{-4, -1,0}, AE_{-4, -2,1}, AE_{-4, -3,2}Will be calculated respectively. In the following, the AE related to the coding block T0 in order_{i, j, n}Is calculated.
[0071]
The outputs of the arithmetic units 7 to 8 are added by the cumulative addition array 10 to obtain the sum of absolute differences. Now, the AE obtained by the arithmetic unit 7 in the cycle of the timing D0 in FIG._{-4, -3,0}Is delayed by one cycle by the delay unit 11 of the cumulative addition array 10, and the AE obtained by the arithmetic unit 8 in the cycle of timing D1._{-4, -3,1}And the adder 12 add AE_{-4, -3,0}+ AE_{-4, -3,1}Is calculated. The result is delayed by one cycle by the delay unit 13, and the AE obtained by the arithmetic unit 9 in the cycle of timing D2._{-4, -3,2}And the adder 14 add AE_{-4, -3,0}+ AE_{-4, -3,1}+ AE_{-4, -3,2}Is required. This is AE compared to (Equation 3)_{-4, -3}It turns out that it is. That is, the difference absolute value sum AE, which is the error amount between the prediction block candidate of the vector (−3, −4) and the coding block T0._{-4, -3}Was sought after. The delay unit 11 and the delay unit 13 connect the three arithmetic units 7 to 9 to form a pipeline operation, and AE_{-4, -3}, AE_{-4, -2}~ AE_-4,2Can be obtained every cycle.
[0072]
By this series of pipeline operations, a band-like area having a height of 4 pixels from the small block D0 to the small block D7 in the search range in FIG. 5 is evaluated in order from left to right. AE at timing D7 in FIG._-4,2Is completed, the process proceeds to a cycle of timing E0, and the reference pixel data (a0, a1, a2, a3) = (r_1,0, R_2,0, R_3,0, R_4,0) Is read and at the same time the arithmetic unit 7_{-3, -3,0}And a new series of pipeline operations is started. During this time, the coding block register 2 stores and holds the pixels of the coding block T0. Therefore, in a series of pipeline operations from the timing E0 to the timing E7, the height from the small block E0 to the small block E7 in FIG. The error amount evaluation between the encoded block T0 and the prediction block candidate is executed in order from the left to the right in the band-like region of 4 pixels. Comparing the band-like regions calculated and evaluated in FIG. 5 and FIG. 6, it can be seen that FIG. 6 operates by one pixel lower than FIG. This is because the calculation for D0 to D7 has been completed for the evaluation of the pixels for the top row of the search area as the evaluation target. Similarly, in the calculations of E0 to E7, the pixels for the upper two rows are evaluated and the calculation is completed. Thereafter, each time a new pipeline operation is sequentially started, the band-like computation area is lowered by one row, and when the computation of F0 to F7 in FIG. 7 which is the eighth round of pipeline operation is completed, the error amount computation of the coding block T0 is calculated. All will be completed. The motion vector of the coding block T0 can be determined by checking the minimum value of the sum of absolute differences during this period. Note that the obtained vector is a frame vector when the encoded image and the reference image have a frame structure, and a field vector when the encoded image and the reference image have a field structure.
[0073]
When the calculation of the small block F7 in FIG. 7 is completed, 12 pixels of the coding block T1 are newly stored in the coding block register 2 according to the relationship of FIG. 3, and at the same time, the search area corresponding to the coding block T2 is stored. The four pixels at the upper left corner are stored in the reference register 1, and the pipeline operation is started in the same manner as in the case of the coding block T0. That is, unlike the conventional example, the reference data stored in the reference register 1 is not used in a plurality of encoding blocks, and when starting a new encoding block, the reference data is newly read at the same time. Therefore, T1 can be started next to the processing of the coding block T0, and then T2 can be completed in order without skipping the position on the encoded image. .
[0074]
Tables 5 and 6 summarize the circuit scale and processing speed of the first embodiment. Table 5 shows the circuit scale of the first embodiment, and Table 6 shows the processing speed of the first embodiment. The calculation conditions of Tables 5 and 6 are the same as the calculation conditions of Conventional Example 1 of Tables 1 and 2. Since the first embodiment does not perform parallel operation, the case of FIG. 1 in Tables 5 and 6 and the case of Q = 1 in Tables 1 and 2 of the conventional example 1 (that is, the case of one series) are shown. In comparison, in the present embodiment, the number of flip-flops required is halved, and the number of cycles required per block is also reduced. This is because the number of registers can be halved because the

delay units

11 and 13 connecting the operation units 7 to 9 are always configured with one pixel regardless of the size of the search range in order to configure the pipeline operation. The calculation speed is improved because the loss cycle in which the unit is not executing valid calculations can be reduced. In the examples of 480i and 1080i, the number of flip-flops is increased by about 2.5 times since the parallel processing is not performed in the example of 480i and 1080i, and the processing cycle is increased by about 8.5 times and about 15 times. %, About 0.7%.
[0075]
[Table 5]

[0076]
[Table 6]

As described above, according to the first embodiment, the calculation unit is configured such that one column of prediction block candidates is set as a small block and the pipeline calculation is performed while sequentially shifting the prediction block in the horizontal direction on the reference screen. The number of registers can be reduced because the number of registers can be reduced by reducing the loss cycle that does not execute valid computations and realizing high-speed processing and connecting computation units with a delay unit of only one pixel regardless of the size of the search range. Is minimized and can be realized with a small circuit scale. In addition, since the encoded blocks to be processed can be sequentially completed at positions T0, T1, and T2 that are continuous on the image, the configuration of the encoding processing executed following the motion vector detection is easy, and the multiplexing configuration Therefore, a motion vector detection device can be realized with an extremely small circuit, such as when high-speed processing is not required depending on the application.
[0077]
(Embodiment 2)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0078]
FIG. 8 is a block diagram showing the motion vector detection device of the present embodiment.
[0079]
The second embodiment also relates to the first and second inventions described above.1stThis corresponds to a case where the M pixel data in one horizontal row of 2 of the present invention are combined into one set. Also in the second embodiment, the sum of absolute differences is adopted as the error amount between the prediction block candidate and the encoded block. (Formula 4) shows the definition formula of the sum of absolute differences in the second embodiment, but the handling of M and N is reversed from the definition formula shown in (Formula 1). This is because the unit is different from the unit. That is, M is the horizontal size of the encoded block, and N is the vertical size. Difference absolute value sum AE of (Equation 4)_{i, j}Is a double sum for m and n. This is the sum for m shown in (Equation 5), that is, the sum of absolute differences AE in the same row._{i, j, n}And the difference absolute value sum AE even if it is decomposed into two stages of the sum for n shown in (Equation 6)._{i, j}Can be requested. In the second embodiment, the sum of absolute differences between the encoded block and the prediction block candidate is calculated using the relationship of (Equation 5) and (Equation 6). In the second embodiment, the block size is horizontal M = 3 and vertical N = 4, and the search area is K = 3 in the horizontal direction, that is, a range of −3 to 2, and L = 4 in the vertical direction, that is, −. The range is 4 to 3.
[0080]
[Expression 4]

[0081]
[Equation 5]

[0082]
[Formula 6]

Hereinafter, the configuration of the second embodiment will be described with reference to FIG. 8, but the same parts as those of the first embodiment in FIG.
[0083]
In FIG. 8, the reference register 101 temporarily stores three pixels consisting of a0 to a2 of the reference image, and outputs the reference data A as a set. The encoding block register 102 divides an encoding block composed of 12 pixels into four sets of three data arranged in one horizontal row, and stores b00 to b02 corresponding to the small block in the zeroth row. An encoded small block register 103, an encoded small block register 104 that stores b10 to b12 corresponding to the small block in the first row, and an encoded small block that stores b20 to b22 corresponding to the small block in the second row It consists of a register 105 and an encoded small block register 106 that stores b30 to b32 corresponding to the small block in the third row, and outputs each output as four sets of encoded data B0, B1, B2, and B3. To do. The arithmetic block 107 is composed of arithmetic units 108 to 111, which receive the reference data A and the encoded data B0 to B2 respectively, and output the sum of absolute differences corresponding to (Equation 5). The outputs of the arithmetic units 108 to 111 are connected to the cumulative addition array 112, and the cumulative addition array 112 outputs the sum of absolute differences corresponding to (Equation 6). The structure of the cumulative addition array 112 is obtained by adding a delay unit 113 and an adder 114 to the cumulative addition array 10 shown in FIG. 1 of the first embodiment. FIG. 9 is a block diagram showing the internal configuration of the arithmetic unit 108. The internal structure of the arithmetic unit 108 is obtained by deleting the absolute difference calculator 18 and the adder 20 from the arithmetic unit 7 of FIG. The configuration of the arithmetic units 109 to 111 is the same as the configuration of the arithmetic unit 108 in FIG. 9, and the correspondence relationship of the encoded data is only changed from B0 to B1, B2, and B3, and thus description thereof is omitted.
[0084]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0085]
FIG. 10 is a timing chart showing details of the operation, and FIGS. 11 to 13 are region relationship diagrams showing the positional relationship that the data held in the reference register 101 occupies in the search range of the reference image.
[0086]
  In the second embodiment, for the motion vector detection of the coding block T0, the calculation of the error amount is started from the vector (−3, −4), that is, the upper left end of the search range. In the cycle of timing G0 in FIG. 10, the reference register 101 stores reference pixel data (a0, a1, a2) = (r_0,0, R_0,1, R_0,2) Is read and output as one output set A, and the encoded small block register 103 has (b00, b01, b02) = (t_4,3, T_4,4, T_4,5) In the encoding small block register 104 (b)10, B11, B12) = (T_5,3, T_5,4, T_5,5) In the encoding small block register 105 (b)20, B21, B22) = (T_6,3, T_6,4, T_6,5) Is stored in the encoding small block register 106 (b).30, B31, B32) = (T_7,3, T_7,4, T_7,5) Are read and output as a set of four outputs B0, B1, B2, and B3. The difference from the first embodiment is that the reference register 101 stores one small block whose unit is a prediction block candidate row, and the encoding small block registers 103 to 106 are units of a row of the encoding block T0. Four small blocks are stored, and the output is treated as a set of four outputs. When the error amount calculation is started, the arithmetic unit 108 becomes AE based on (Equation 5) in the cycle of timing G0._{-4, -3,0}And at the timing G1, the arithmetic units 108 to 109_{-3, -3,0}And AE_{-4, -3,1}The arithmetic units 108 to 111 then calculate the difference absolute value sum AE between the coding block T0 and the reference block candidate._{i, j, n}Are calculated sequentially. On the other hand, the cumulative addition array 112 constitutes a pipeline operation by adding the operation results of the operation units 108 to 111 while delaying them by one cycle, and constructs a difference absolute value sum AE._{i, j}Is calculated.
[0087]
The above is an operation similar to the operation shown in FIG. 4 of the first embodiment described above, but in the second embodiment, the method for updating the data stored in the reference register 101 is different from that of the first embodiment. . At timing G0, the reference register 101 stores a small block G0 consisting of three horizontal pixels at the upper left corner of the search area in FIG. 11 and starts an error amount calculation. Is moved to the small block G1, and at a timing G10, the error amount calculation is performed by a series of pipeline operations while moving the band-like region having a width of 3 pixels from the top to the bottom until reaching the small block G10 at the lower end of the search region. It is something to execute. When a series of pipeline operations are completed at timing G10, the process proceeds to a cycle of timing H0, and reference pixel data (a0, a1, a2) = (r_0,1, R_0,2, R_0,3) Is read and a new series of pipeline operations is started. During this time, the encoding block register 102 holds the pixel of the encoding block T0. In a series of pipeline operations from timing H0 to timing H10, a band-like region having a width of 3 pixels of the small block H0 to the small block H10 in FIG. 12 is calculated in order from top to bottom, and the encoded block T0 and the prediction block candidate are calculated. The error amount evaluation is executed. Hereinafter, each time a series of pipeline operations are completed, the band-like computation area moves to the right by one column, and when the computation of the small block I0 to small block I10 in FIG. The error amount calculation related to the generalized block T0 is completed. The motion vector of the coding block T0 can be determined by examining the minimum value of the sum of absolute differences during this period.
[0088]
Since the circuit scale and calculation speed of the second embodiment are almost the same as those of the first embodiment described above, description thereof is omitted.
[0089]
As described above, according to the second embodiment, the calculation unit is configured such that one horizontal line of the prediction block candidate is set as a small block and the pipeline calculation is performed while sequentially shifting it in the vertical direction on the reference screen. Can reduce the loss cycle that does not execute valid operations and achieve high-speed processing. Regardless of the size of the search range, the arithmetic units are connected by a delay unit of only one pixel, so that pipeline operations are configured. The number is minimized and can be realized with a small circuit scale. Also, since the encoded blocks to be processed can be completed in order at consecutive positions on the image, T0, T1, and T2, they can be configured without assuming a multiplexing configuration, and high-speed processing is not required depending on the application. The motion vector detection device can be realized with an extremely small circuit.
[0090]
Since the first embodiment and the second embodiment have exactly the same effect but differ in the method of reading a reference image, the characteristics and operation of a reference image storage medium such as a memory connected to the motion vector detection device are used. Depending on the conditions, the embodiment that is more advantageous than the first embodiment or the second embodiment can be selected and embodied.
[0091]
(Embodiment 3)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0092]
FIG. 14 is a block diagram showing the motion vector detection device of the present embodiment.
[0093]
  Embodiment 3 relates to the third and fourth aspects of the present invention described above.3rd4 corresponds to the case where the reference data of the present invention is M + Q−1 data in one vertical column. In addition, this embodiment3Then, the difference absolute value sum shown in (Equation 1), (Equation 2), and (Equation 3) is adopted as the error amount between the prediction block candidate and the encoded block, and the size of the encoded block is horizontal N = 3. , Vertical M = 4, the search region has a horizontal range of K = 3, ie, a range of −3 to 2, and a vertical direction of L = 4, ie, a range of −4 to 3.
[0094]
Although the configuration of the third embodiment will be described with reference to FIG. 14, the same parts as those in the first embodiment shown in FIG.
[0095]
The reference register 201 temporarily stores five pixels composed of a0 to a4 of the reference image, sets four consecutive pixels as a set, and outputs the reference data Aa = (a0, a1, a2, a3) as one output. This is a register having a set and reference data Ab = (a1, a2, a3, a4) as one output set. The arithmetic block 6 receives the reference data Aa and the encoded data B0 to B2, and the arithmetic block 202 has the same configuration as the arithmetic block 6 and receives the reference data Ab and the encoded data B0 to B2. The cumulative addition array 203 has the same configuration as that of the cumulative addition array 10 and cumulatively adds three outputs of the operation block 202 as inputs.
[0096]
  In the configuration of the third embodiment, the portion composed of a0 to a3 of the reference register 201, the encoding block register 2, the arithmetic block 6, and the cumulative addition array 10 is the configuration of the above-described first embodiment (see FIG. 1). ), A1 to a4 of the reference register 201, the encoding block register 2, the operation block 202, and the cumulative addition array203The portion constituted by is also the same as the configuration of the first embodiment described above (see FIG. 1). That is, by adding one pixel to the reference register 201 of the reference register 1 in FIG. 1, a structure in which two motion vector detection devices are combined while a1 to a3 of the reference register 201 and the encoding block register 2 are shared. It is what has become.
[0097]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0098]
FIG. 15 is a timing chart showing details of the operation, and FIGS. 16 to 18 are region relationship diagrams showing the positional relationship that the data held in the reference register 201 occupies in the search range of the reference image.
[0099]
In the second embodiment, the motion vector detection operation of the coding block T0 is started from the cycle of the timing J0 in FIG. In the period from the timing J0 to J7 in FIG. 15, the series of pipeline operations by the operation block 6 and the cumulative addition array 10 to which the reference data Aa and the encoded data B0 to B2 are input are the cycle shown in FIG. 4 of the first embodiment. This is exactly the same as the pipeline operation of D0 to D7. Further, the position of the reference data Aa in the search area in this period is the same as the small block D0 to small block D7 in FIG. 5 of the first embodiment. On the other hand, during the period from timing J0 to J7 in FIG. 15, a series of pipeline operations by the operation block 202 and the cumulative addition array 203 that receive the reference data Ab and the encoded data B0 to B2 are shown in FIG. This is exactly the same as the pipeline operation in the four cycles E0 to E7, and the position of the reference data Ab in the search area is the same as that of the small blocks E0 to E7 in FIG. 6 of the first embodiment. That is, the differential absolute value summation operation realized in two cycles in the first embodiment, that is, 16 cycles is completed in 8 cycles of J0 to J7 in FIG. 15 by parallel processing of the pipeline structure in two systems. To do. During this period, the reference data stored in the reference register 201 is a band-like region having a height of 5 pixels shown in the small block J0 to the small block J7 in FIG. 16, and the reference register 201 stores the small block J0 having 5 pixels in the left end. By moving the stored data sequentially from the stored state to the right, two systems of pipeline operation parallel processing are executed. Further, when the calculation of the small block J7 in FIG. 16 is completed, the reference pixels for the upper two rows in the search area become unnecessary for the calculation of the error amount of the subsequent encoding block T0. The reference pixels stored in 201 are (a0, a1, a2, a3, a4) = (r_2,0, R_3,0, R_4,0, R_5,0, R_6,0Then, a new pipeline operation for two-system parallel processing is started. In this pipeline operation, as shown in small block P0 to small block P7 in FIG. 17, this corresponds to the operation from the left to the right of a band-like region having a height of 5 pixels at a position 2 pixels below the upper end. In this way, every time the pipeline operation is completed, the processing proceeds while shifting downward by two rows. However, the state shown in FIG. 18 is reached in the fourth cycle of the pipeline operation, and the coding block T0 is obtained upon completion of the operation of the small block S7. This completes the error amount calculation of all prediction block candidates. The motion vector of the coding block T0 can be determined by examining the minimum value of the sum of absolute difference values which is the output of the cumulative addition array 10 and the cumulative addition array 203 during this period.
[0100]
When the calculation of the small block S7 in FIG. 18 is completed, 12 pixels of the coding block T1 are newly stored in the coding block register 2, and the upper left five pixels of the search area corresponding to the coding block T1 are stored. The pipeline operation of the encoding block T1 is started just like the case of the encoding block T0, which is stored in the reference register 201. That is, the processing can be completed in order without skipping the position on the image from the processing of the coding block T0 to T1 and T2 as in the first and second embodiments.
[0101]
[Table 7]

[0102]
[Table 8]

Table 7 and Table 8 summarize the circuit scale and processing speed of the third embodiment. Table 7 shows the circuit scale of the third embodiment, and Table 8 shows the processing speed of the third embodiment. The calculation conditions in Tables 7 and 8 are the same as those in the first example of Tables 5 and 6 in the case of Conventional Example 1 in Tables 1 and 2. First, comparing the case of FIG. 1 of the first embodiment of the present invention shown in Tables 5 and 6 with the case of FIG. 14 of the third embodiment of the present invention shown in Tables 7 and 8, comparing the case of FIG. Although the calculation speed can be improved by a factor of two by performing parallel processing with the number of quantization sequences Q = 2, the register slightly increases the pixel value register number S by one pixel and the data register number U by two data. It can be seen that parallel processing can be realized extremely efficiently. Compared with the case of FIG. 35 of Conventional Example 1 in Tables 1 and 2, the number of multiplexed sequences Q = 2 and the calculation speed are comparable to each other. Since the number of flip-flops is only one third, it can be seen that there is a remarkable effect. This difference becomes more conspicuous if the actual video block size and search range are used. The processing speed is the same in the example of 480i as compared with the conventional example 1, but the number of flip-flops is about 3.7%. The example of 1080i shows a dramatic effect that the number of flip-flops can be configured with about 2.1% while increasing the processing speed more than twice. This is due to the following two points. First, in the conventional method, it is necessary to have multiple encoding block registers that require a large number of flip-flops in order to achieve a parallel configuration. However, in the present invention, all multiplexed sequences operate on the same encoded block. Because there is, it must be composed of registers that store only one coding block. Second, the arithmetic data register needs to be increased in proportion to the number of series in the present invention and the prior art, but in the past, an arithmetic data register proportional to the search range has been required. This is because the operation data register can always be composed of only one data regardless of the size of the search range, whereas a huge scale is required due to the synergistic effect of the search and the search range. It is.
[0103]
  As described above, in the third embodiment of the present invention, the first described above.3rd4 of the reference image of the present invention temporarily stores M + Q-1 pixels, and a reference register 201 for outputting Q sets of reference data using M consecutive pixels as one set of reference data refers to the reference data as a reference screen. The encoding block register has a control function to extract and store while sequentially shifting in the upper horizontal direction, and obtains the amount of error between one encoding block and multiple prediction block candidates simultaneously by multiple pipeline operations. And the reference register can be shared, and a multiplexed parallel processing circuit can be configured, and the arithmetic data delay unit that increases in proportion to the number of series can always be configured with a minimum of one data. A motion vector detection apparatus capable of performing parallel computation can be configured. Further, as in the first and second embodiments, all the arithmetic units have few loss cycles, and all the Q sequences operate at the same time, so that the calculation speed per coding block can be accurately increased by Q times. . Furthermore, in the present invention, the encoded blocks are processed one by one. Tables 7 and 8 illustrate the case of the number of multiplexed sequences Q = 2, the case of 8, and the case of 32. In the present invention, the setting of the number of multiplexed sequences is not affected at all by the size of the coding block, the size of the search range, etc., and the number of sequences Q can be arbitrarily set including one. Therefore, it is possible to arbitrarily select an appropriate number of multiplexed sequences from the requirements of the circuit scale and the processing speed, and to configure a motion vector detection device suitable for the application and conditions.
[0104]
(Embodiment 4)
First, the configuration and operation of the motion vector detection device of the present embodiment will be described.
[0105]
FIG. 19 is a block diagram showing a motion vector detection apparatus according to the present embodiment.
[0106]
  In the third embodiment, the first3rd4 corresponds to the case where the reference data of the present invention is M + Q−1 data in one vertical column.3rd4 corresponds to the case where the reference data of the present invention is M + Q−1 data in one horizontal row.
[0107]
In FIG. 19, the reference register 301 is a register that extends the reference register 101 of FIG. 8 by one pixel and outputs the reference data Aa and the reference data Ab as a set of three consecutive pixels. , The cumulative addition array 303 has the same configuration as the cumulative addition array 112. In FIG. 19, the same parts as those in the second embodiment shown in FIG.
[0108]
The operation of the fourth embodiment realizes parallel processing with respect to the second embodiment in exactly the same way as the third embodiment realizes parallel processing with respect to the first embodiment. Is omitted.
[0109]
The fourth embodiment and the third embodiment have exactly the same effect, but differ in the method for reading a reference image. Therefore, a more suitable form can be selected from the fourth or third embodiment according to the characteristics and operating conditions of a reference image storage medium such as a memory connected to the motion vector detecting device. .
[0110]
(Embodiment 5)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0111]
FIG. 20 is a block diagram showing the motion vector detection device of the present embodiment.
[0112]
In the fifth embodiment, the techniques of the fifth and sixth inventions are applied on the basis of the first embodiment. Therefore, in the fifth embodiment, the difference absolute value sum shown in (Equation 1), (Equation 2), and (Equation 3) is adopted as the error amount between the prediction block candidate and the coding block, and the coding block The size is horizontal N = 3, vertical M = 4, and the search area is K = 3 in the horizontal direction, i.e., in the range of -3 to 2, and L = 4 in the vertical direction, i.e., in the range of -4 to 3.
[0113]
  Although the configuration of the fifth embodiment will be described with reference to FIG. 20, the same parts as those of the first embodiment in FIG. The reference register 401 is a second reference register newly added with the reference register 1 as the first reference data register, and its output is output as one set of reference data C. The encoding block register 402 is composed of encoding small block registers 403 to 405. The difference from the configuration of the first embodiment is that the timing for reading pixel data of a new encoding block is three encoding small block registers 403. ˜405 so that each can be controlled independently. Reference data C is input to the calculation block 406 in addition to the encoded data B0, B1, B2, and reference data A. Either reference data A or reference data B is selected by a switch 407 that is a reference data changeover switch. The signal is input to the unit 7, selected by the switch 408, input to the arithmetic unit 8, selected by the switch 409, and input to the arithmetic unit 9. The

switches

407, 408, and 409 are controlled independently. Mode controller410Is a control unit that controls the read timing of the encoding small block registers 403 to 405 and the switching of the switches 407 to 409.
[0114]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0115]
21 and 22 are timing charts showing the operation of the fifth embodiment. Symbols D, E, and F in the figure correspond to the symbols in FIGS.
[0116]
In the fifth embodiment, the operation starts from a state in which the switches 407 to 409 all select the reference data A. In the cycle of timing D0 in FIG. 21, the encoding block register 402 completes storage of the encoding block T0, outputs three output sets B0, B1, and B2 in units of columns, and the reference register 1 searches for FIG. Storage of the small block D0, which is the upper left four pixels in the range, is completed and output as reference data A. Since all the switches 407 to 409 have selected the reference data A, the arithmetic unit 7 first calculates the difference absolute value sum AE at the timing D0._{-4, -3,0}In the following, the reference data A sequentially shifts the band-like area shown in FIG. 5 from left to right, and the arithmetic units 7 to 9 and the cumulative addition array 10 constitute a pipeline so that the difference absolute value sum AE is calculated._{i, j}Is calculated in the same manner as in the first embodiment. On the other hand, the reference register 401 stores a small block E0 of four vertical pixels located at the left end of the band-like area shown in FIG. 6 in the cycle E0 in FIG. 21 and starts outputting it as a set of reference data C. Below reference register4016 outputs the reference data while sequentially shifting the band-like region of FIG. 6 from left to right. In FIG. 21, D6 to D7 of the reference register 1 and E0 to E7 of the reference register 401 overlap in time, and are output to the reference data A and the reference data C at the same time for two cycle periods, respectively.
[0117]
Here, attention is paid to the arithmetic unit 7. The arithmetic unit 7 has a difference absolute value sum AE effective for the small block D0 to the small block D5 in the band-like region of FIG._{-4, -3,0}~ AE_-4,2,0, But small blocks D6 and D7 are AE_-4,3,0And AE_-4,4,0Since this corresponds to the vector (3, -4) and the vector (4, -4), it is out of the search range and need not be calculated. Now, when the mode control unit 410 detects that the effective calculation is completed in the cycle D5, the mode control unit 410 controls the switch 407 to select the reference data C in the following cycle. Since the small block E0 of FIG. 6 is output to the reference data C in the cycle that is switched, that is, E0, the arithmetic unit 7 has C = (r_1,0, R_2,0, R_3,0, R_4,0) And B0 difference absolute value sum, ie AE_{-3, -3,0}Will be calculated. During this time, the mode control unit 410 causes the switch 408 and the switch 409 to select the reference data A. That is, the arithmetic unit 7 calculates the small block E0 in FIG. 6 and the arithmetic units 8 and 9 simultaneously calculate the small block D6 in FIG. Since the mode control unit 410 switches so that the switch 408 also selects the reference data C in the subsequent cycle D7, the arithmetic units 7 and 8 simultaneously perform the small block E1 in FIG. 6, and the arithmetic unit 9 simultaneously performs the small block D7 in FIG. Will be calculated. As a result, in the operation of the first embodiment in FIG. 4, there are two loss cycles in the three arithmetic units at the time of switching the pipeline, but there is no loss in the operation of the fifth embodiment in FIG. 21. When the calculation of the coding block is started, all the calculation units thereafter are always valid calculations, so that the pipeline calculation can be executed without any gap.
[0118]
As described above, according to the fifth embodiment, the reference register 1 sequentially updates data from the small block D0 to the small block D7 in FIG. 5 and supplies the reference data A to the arithmetic block 406 according to the fifth invention. And a second mode in which the reference register 2 sequentially updates the data from the small block E0 to the small block E7 in FIG. 6 and supplies the reference data C, and the effective calculation is completed at the time of mode transition. By providing the mode control unit 410 that switches the reference data from A to C in order from the arithmetic unit 7, it is possible to continue the calculation with maximum efficiency without generating a loss cycle in the pipeline calculation. In the example of FIG. 21, one cycle of pipeline operation, which required 8 cycles in the case of FIG. 4, is shortened to 6 cycles, and further speedup can be realized.
[0119]
In the fifth embodiment, as described above, when calculation of a coding block, for example, T0 is started, calculation can be continued with maximum efficiency without any loss cycle thereafter. Next, the operation when the calculation of a certain coding block is completed and the calculation of the next coding block is started will be described.
[0120]
In the timing chart of FIG. 22, the cycle from F0 to F7 is the last operation part of the encoding block T0, and the encoding block register 402 stores the encoding block T0 at the start time of FIG. While the reference register 401 outputs the reference data for the last two cycles of the search range of the small blocks F6 and F7 in FIG. 7, that is, the encoding block T0, to the reference register 1, the reference register 1 searches the search range for the encoding block T1. The small blocks D0 and D1 in FIG. The mode control unit 410 is shown in FIG.ofWhen it is determined in cycle F5 that the effective calculation of the calculation unit 7 has been completed, the switch 407 is controlled in the following cycle so that the reference data A is selected. This is the operation according to the fifth aspect of the present invention explained in FIG. In this switching cycle D0, the mode control unit 410 controls the encoded small block register 403 in synchronization with the switching of the switch 407, reads and stores the leftmost small block of the encoded block T1, and outputs it to B0. That is, B0 = (t_4,6, T_5,6, T_6,6, T_7,6). On the other hand, since the encoded small block registers 404 and 405 hold the stored data, FIG.ofIn the cycle D0, B0 is the coding block T1, B1 and B2 are the coding block T0. As a result, the arithmetic unit 7 is supplied with the encoded data B0 of the encoded block T1 and the small block D0 which is the reference data at the upper left corner of the encoded block T1 search range from the output A. As a result, for the encoded block T1 AE_{-4, -3,0}One of the arithmetic units 8 and 9 calculates the AE of the coding block T0._3,2,1And AE_3,1,2And are calculated. Subsequently, in the cycle D1, the arithmetic units 7 and 8 calculate the small block D1 of FIG. 5 for the encoding block T1, and the arithmetic unit 9 calculates the small block F7 of FIG. 7 for the encoding block T0. , All the absolute difference sum calculation of the coding block T0 is completed. Further, in the cycle D2, all the coding block registers 402 are switched to the coding block T1, and the transition is completed.FinishThe As described above, even in the transition from the encoding block T0 to T1, effective operations are always continued in all the arithmetic units, and any loss cycle does not occur, and pipeline operations can be executed without gaps. is there.
[0121]
As described above, according to the fifth embodiment, according to the sixth aspect of the present invention, the reference register 1 sequentially updates the data from the small block D0 to the small block D7 in FIG. When storing the data of a new encoded block in the encoded block register 402, the data of the new encoded block is stored in the encoded block register 402 one by one in synchronization with the switching operation of the reference data switches 407 to 409. By doing so, the pipeline operation does not cause a loss cycle even at the time of transition of the coding block, it is possible to continue the computation with maximum efficiency, and further speedup can be realized, for example, when there are many coding blocks. .
[0122]
The fifth embodiment is a case where the coding block is decomposed in units of columns as in the first embodiment, and the fifth and sixth techniques of the present invention are applied to the case where the multiplexing process is not performed. Although applied, if the coding block is not decomposed in units of rows as in the second embodiment and is not multiplexed, the decomposition is performed in units of columns as in the third embodiment to obtain a multiplexed configuration. In the case of decomposing a row into units as in the fourth embodiment to form a multiplexed configuration, the techniques of the fifth and sixth inventions can be applied to any of them in exactly the same manner as in the fifth embodiment. In either case, the effect is that no loss cycle occurs during the processing of the coding block or when the coding block is transferred, and there is no gap in the pipeline computation, so that the computation can be executed with maximum efficiency and high speed. The operation can be realized.
[0123]
[Table 9]

[0124]
[Table 10]

The above effects are summarized in numerical values in Tables 9 and 10. Table 9 shows the circuit scale of the fifth embodiment, and Table 10 shows the processing speed of the fifth embodiment. The calculation conditions in Tables 9 and 10 are the same as in the case of Conventional Example 1 in Tables 1 and 2, in the case of Embodiment 1 in Tables 5 and 6, and in the case of Embodiment 3 in Tables 7 and 8. . Compared with the conventional example 1 in Tables 1 and 2, both the circuit scale and the processing speed are dramatically improved. The reason for this is as already described in the first and third embodiments. Therefore, the description is omitted, and the effects of the techniques of the fifth and sixth aspects of the present invention are confirmed by comparing the first and fifth embodiments of the present invention.
[0125]
First, the case of FIG. 1 of Embodiment 1 of the present invention shown in Tables 5 and 6 and the case of FIG. 20 of Embodiment 5 of the present invention shown in Tables 9 and 10 are compared with each other in Embodiment 5. Although the circuit scale increases by about 22% compared to the first embodiment, it can be seen that the processing speed can be increased by about 1.33 times. Next, the example of Q = 2 in the fifth embodiment means a case where the techniques of the fifth and sixth inventions are applied to the third embodiment in FIG. In the example of Q = 2 in the fifth embodiment, the third embodiment in Tables 7 and 8 is still about 23% larger in circuit scale than the example in FIG. 14, but the processing speed can be increased by about 1.33 times. Yes. In the example of 480i, in the case of the fifth embodiment, the circuit scale increases by about 5% compared to the case of the third embodiment in Table 7 and Table 8, but the processing speed can be increased by about 1.12 times, and the example of 1080i In the fifth embodiment, the circuit scale increases by about 5% compared to the third embodiment in Tables 7 and 8, but the processing speed can be increased by about 1.06 times.
[0126]
In the case of the fifth embodiment, that is, when the techniques of the fifth and sixth aspects of the present invention are used, the sum of absolute differences that are effective in all cycles can be obtained without gaps and without duplication, and in combination with parallel computation However, there is no waste at all, so the principle maximum speed of pipeline operation is realized. All of these are accompanied by a slight increase in circuit, but are particularly effective in applications that require high-speed operation.
[0127]
Finally, the fifth and sixth aspects of the present invention are similar to the conventional example 2 in the configuration example 5, and the difference will be described below.
[0128]
First, apparent similarities relating to the fifth aspect of the present invention will be described. The conventional example 2 has two sets of reference registers, registers 801 to 804 and registers 838 to 842 in FIG. 42, which are selected by the selectors 843 to 846. In the present invention, the reference register 1 and the reference register 1 in FIG. The reference register 401 has two sets of reference registers, which are selected by the switches 407 to 409. However, in the conventional example 2, the selectors 843 to 846 are common to all the arithmetic units, and the selected reference data is supplied to all the arithmetic units. This is a unique switch, and there is a substantial difference in configuration that reference data supplied to each arithmetic unit is individually selected.
[0129]
This will be described in detail from the point of technical idea. Conventional example 2 needs to repeat the reference data with an effective period of 8 cycles twice for the purpose of time-division calculation of the two encoded blocks T0 and T1, and two sets of reference registers are necessary for the repetition. And a selector. On the other hand, in the present invention, it is an operation of one coding block T0 and is not related to time division. In the present invention, for the purpose of preventing a wasteful cycle from being generated in the arithmetic unit at the time of the transition of the pipeline operation and realizing the maximum speed, reference data that is always valid independently for all the arithmetic units at the time of the transition is also provided. Since it is necessary to supply, two sets of reference registers and individual switches are provided for each arithmetic unit. This is a completely different technical idea, meaning that it is meaningless even if the technical idea of the present invention is applied to the conventional example, and it is meaningless if the technical idea of the conventional example is applied to the present invention. is there.
[0130]
Next, apparent similarities relating to the sixth aspect of the present invention will be described. In Conventional Example 2, the coding block T2 is stored in the registers 819 to 822 of the PE847 in the cycle Y4 in FIG. 44, but the coding block T0 is held in the registers 819 to 822 of the other PE848 and PE849. At the time of block transition, T0 and T2 are calculated simultaneously. In the present invention, the encoding block T1 is stored in the encoding small block register 403 at F6 or D0 in FIG. 22, but the encoding block T0 is held in the other encoding small block registers 404 and 405, and the encoding block T1 is stored. At the time of transition, T0 and T1 are calculated simultaneously. However, in the present invention, two sets of reference registers are provided, the reference data for T0 and T1 are stored in each, and T1 is stored in the corresponding encoded small block register in synchronization with the switching while switching for each arithmetic unit. The control means to make it a necessary element. The conventional example does not have the control element and is substantially different.
[0131]
This difference will be described in detail from the point of technical idea. In the technical idea of the conventional example, when the calculation of the data of the coding block stored in the PE is completed, it waits for the reference data to naturally enter the search range of the coding block to be processed next, and then a new coding is performed. By storing the block data, the loss at the time of transfer of the encoded block is minimized. Since the conventional example 1 always processes a plurality of encoded blocks by parallel processing and the conventional example 2 by time division processing, the reference data is used for processing of any encoded block. Even if the PE has completed the calculation, the reference data cannot be changed accordingly. In the conventional example 1 and the conventional example 2, when the calculation of the coding block T0 is completed, the reference data immediately enters the search range of the next coding block T2, and the calculation of T2 can be started immediately. This is only in the case of a special operation in which the size of the coded block and the search range are adjusted, and in a general application, a large loss time occurs when the coded block is transferred. In the present invention, when the operation is completed for data in a certain encoded small block register, not only is the encoded small block register read the corresponding encoded data in the next encoded block, but at the same time, it corresponds to the reference register. The first reference data in the search range is read. It is the essence of the sixth aspect of the present invention to have a control means for synchronizing the reading of the data of the new coding block and the reading of the reference data combined with the calculation, thereby making it possible to arbitrarily set the size of the coding block and the search range. Even if it is set, no loss cycle is generated. First to fourth of the present inventionMethod andIf the technical idea of the conventional example is applied to the apparatus, the same problems as in the conventional example occur. That is, a large loss time is generated unless the size of the coding block and the search range are special values. That is, the sixth aspect of the present invention is based on a technical idea different from the conventional example, and cannot be easily analogized.
[0132]
(Embodiment 6)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0133]
FIG. 23 is a block diagram showing the motion vector detection device of the present embodiment.
[0134]
In the sixth embodiment, the technique of the seventh invention is applied on the basis of the second embodiment. In the MPEG2 standard, in the case of an interlaced video and a frame structure picture, not only a frame vector but also a field vector can be selectively given to an encoded block. The frame vector is composed of one vector that handles both the encoded image and the reference image as one image having a frame structure and predicts the frame component of the encoded block. On the other hand, the field vector treats the encoded video and the reference video by dividing them into two images of the first field and the second field, respectively, and predicts the first field component of the encoded block from one of the fields of the reference image. It consists of two field vectors, a first field vector and a second field vector that predicts the second field component of the encoded block from any field of the reference image. Any encoding block sizeofIn this case, since it is 16 pixels × 16 pixels, in the case of a field vector, the first field component of the coding block is 16 pixels wide × 8 pixels high, and the second field component is 16 pixels × 8 pixels high.
[0135]
In the sixth embodiment, the error amounts of the frame vector and the two field vectors are calculated simultaneously. As the error amount of the frame vector, the sum of absolute differences AE shown in (Expression 4), (Expression 5), and (Expression 6)_{i, j}And the difference absolute value sum TFAE of (Equation 5) and (Equation 7) as the error amount of the field vector of the first field._{i, j}As the error amount of the field vector of the second field, the difference absolute value sum BFAE of (Equation 5) and (Equation 8)._{i, j}Shall be adopted. From (Equation 5), (Equation 7), and (Equation 8), AE_{i, j}And TFAE_{i, j}, BFAE_{i, j}(Equation 9) is established between the two.
[0136]
[Expression 7]

[0137]
[Equation 8]

[0138]
[Equation 9]

In the sixth embodiment, as in the second embodiment, the size of the coding block is horizontal M = 3, vertical N = 4, the search area is K = 3 in the horizontal direction, that is, the range of −3 to 2, and the vertical direction. L = 4, that is, a range of -4 to 3.
[0139]
The configuration of the sixth embodiment will be described with reference to FIG. In FIG. 23, the same components as those of the second embodiment in FIG. The cumulative addition array 501 has a configuration including a frame addition array 502 and a field addition array 503. The frame addition array 502 has exactly the same configuration as the cumulative addition array 112 of FIG. 8, but the field addition array 503 is an output AE of the arithmetic unit 109._{i, j, 1}And the output AE of the arithmetic unit 111 via the two-cycle delay unit 504._{i, j, 3}And an adder 505. Further, the output AE of the frame addition array 502_{i, j}To BFAE of field addition array 503_{i, j}Subtract TFAE_{i, j}Is provided.
[0140]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0141]
FIG. 24 is a timing chart showing the operation of the sixth embodiment. In FIG. 24, symbols G and H correspond to FIGS. 10, 11, and 12 of the second embodiment.
[0142]
In the sixth embodiment, the operations of the reference register 101, the encoding block register 102, the operation block 107, and the frame addition array 502 are exactly the same as those of the second embodiment described above. Sum AE_{i, j}Is required. This is nothing but obtaining the sum of absolute differences of frame vectors shown in (Equation 6). On the other hand, the field addition array 503 has an output AE of the arithmetic unit 109._{i, j, 1}Is delayed by two cycles and the output AE of the arithmetic unit 111 is delayed._{i, j, 3}Add to. Now, in FIG. 24, the arithmetic unit 109 output AE of the cycle of G1._{-3, -4,1}Is delayed until the cycle of G3, and the output AE of the arithmetic unit 111 is_{-3, -4,3}The output of the field addition array 503 is AE_{-3, -4,1}+ AE_{-3, -4,3}It becomes. From (Equation 8), this is the sum of absolute differences BFAE for the field vector of the second field._{-3, -4}Is required. In this case, the amount of error in block matching between the second field of the reference block and the second field of the encoded block is determined. In the same manner, the sequential field addition array 503 applies the difference absolute value sum BFAE of the block matching of the encoded block second field._{i, j}Is calculated. On the other hand, the subtractor 506 outputs the output AE of the frame addition array 502._{i, j}To BFAE of field addition array 503_{i, j}Is subtracted. Now, in the G3 cycle, AE_{-4, -3}-BFAE_{-4, -3}This is TFAE from (Equation 9)._{-4, -3}Is required. In the field correspondence, this corresponds to a block matching error amount between the first field of the reference block and the first field of the encoded block. In the same G4 cycle, AE_{-3, -3}And TFAE_{-3, -3}And BFAE_{-3, -3}Is required. Regarding the field correspondence, in this case, compared to the case of the G3 cycle, the predicted block candidate has moved to a position one line down on the reference screen, so the relationship between the first field and the second field is reversed. , TFAE_{-3, -3}Is the block matching error amount of the first field of the encoded block and the second field of the prediction block candidate, and BFAE_{-3, -3}Corresponds to the block matching error amount of the second field of the encoded block and the first field of the prediction block candidate. Hereinafter, three types of error amounts are obtained while sequentially moving the prediction block candidates.
[0143]
That is, in the sixth embodiment, the error amount of all combinations of the first field and the second field of the field vector as well as the frame vector is obtained through a series of pipeline operations. Error amount AE_{i, j}And the first field error amount TFAE_{i, j}And the error amount BFAE of the second field_{i, j}By examining the minimum values of the error amounts for the three types, optimal vector detection can be realized.
[0144]
As described above, according to the sixth embodiment, according to the seventh aspect of the present invention, the accumulated result of the addition of the error amounts of the individual arithmetic units is delayed in the cumulative addition array 501 by one time. By adding to the error amount, the frame addition array 502 that cumulatively adds the N error amounts and the error amount is added by the cumulative addition structure while being delayed by 2 cycles with respect to the odd number of N / 2 arithmetic units. Field addition array503And a subtractor 506 that is a calculation means for obtaining a difference between the results of the frame addition array 502 and the field addition array 503, so that an error amount for the frame vector and two types of error amounts for the field vector can be simultaneously eliminated. Therefore, both frame vector detection and field vector detection can be realized at once. Further, if FIG. 23 and FIG. 8 are compared, the only circuits to be added are the field addition array 503 and the subtractor 506, and the circuit increase can be suppressed to about 30% even when a practical device of 480i is assumed. On the other hand, since three types of error amounts are calculated at the same time, even if FIG. 24 and FIG. 10 are compared, the number of processing recycles of one encoded block is clearly not changed, and the advantage of high-speed processing is not impaired. is there.
[0145]
The sixth embodiment is a field addition array.503Is an adder array that processes odd-numbered arithmetic units, but may be an adder array that processes even-numbered units. In this case a field summing array503Is the sum of absolute differences TFAE in the first field_{i, j}And the output of the subtractor 506 is the difference absolute value sum BFAE of the second field._{i, j}It becomes.
[0146]
Further, although the sixth embodiment is configured by applying the technique of the seventh invention to the second embodiment, the technique of the seventh invention is applied to the fourth embodiment. Further, it can be configured in combination with the techniques of the fifth and sixth aspects of the present invention. In these cases, simultaneous detection of both frame and field vectors can be achieved without losing the individual effects such as realization of the maximum speed without any gaps in pipeline operations and further acceleration by parallel processing. Is.
[0147]
In this embodiment, TFAE_{i, j}, BFAE_{i, j}Are defined as (Equation 7) and (Equation 8). However, in this definition, the subscript j is based on the line number of the frame image, and is different from the MPEG2 definition described above. TFAE when used as a motion vector detection device for MPEG2_{i, j}, BFAE_{i, j}Is obtained by the following conversion. TFAE_{i, j}In this case, if j is an even number, the first field of the coded block refers to the first field of the reference image, and is a field vector (j / 2, i). If j is an odd number, the second field of the reference image is referred to and is a field vector ((j−1) / 2, i). On the other hand, BFAE_{i, j}In this case, if j is an even number, the second field of the coded block refers to the second field of the reference image and is a field vector (j / 2, i). If j is an odd number, the first field of the reference image is referred to and is a field vector ((j + 1) / 2, i).
[0148]
(Embodiment 7)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0149]
FIG. 25 is a block diagram showing a motion vector detection apparatus according to the present embodiment.
[0150]
In the seventh embodiment, the technique of the eighth invention is applied on the basis of the first embodiment, and the error amounts of the frame vector and the two field vectors are calculated simultaneously. In the seventh embodiment, the difference absolute value sum AE shown in (Equation 1) is used as the error amount of the frame vector._{i, j}Is adopted. Further, the calculation of the double summation of (Equation 1) is the AE that is the summation in the same column shown in (Equation 2)._{i, j, n}As described above, the sum may be obtained as in (Equation 3). Here, paying attention again to (Equation 2), the sum of (Equation 2) is divided into an even component and an odd component, and the even component is converted to the TFAE of (Equation 10)._{i, j, n}And the odd component is BFAE of (Equation 11)_{i, j, n}TFAE_{i, j, n}Is the sum of absolute differences for the first field of the nth column of the encoded block, and BFAE_{i, j, n}Is the sum of absolute differences regarding the second field of the nth column. Therefore, TFAE_{i, j, n}And BFAE_{i, j, n}If the sum is obtained independently for each of n, TFAE is obtained as shown in (Equation 12) and (Equation 13)._{i, j}And BFAE_{i, j}Is required. This is the difference absolute value sum for the first field and the difference absolute value sum for the second field, respectively. Also, TFAE_{i, j}And BFAE_{i, j}Is the sum of absolute differences AE as a frame of (Equation 1)._{i, j}Therefore, (Equation 9) also holds.
[0151]
[Expression 10]

[0152]
## EQU11 ##

[0153]
[Expression 12]

[0154]
[Formula 13]

In the seventh embodiment, the difference absolute value sum TFAE of (Equation 10) and (Equation 12) is used as the error amount of the field vector of the first field._{i, j}Is the difference absolute value sum BFAE of (Equation 11) and (Equation 13) as the error amount of the field vector of the second field._{i, j}Is adopted. The error amount of the frame vector does not directly calculate (Equation 1), but adds the error amount of the field vector obtained by (Equation 12) and (Equation 13), that is, the relationship of (Equation 9). Is obtained using
[0155]
The configuration of the seventh embodiment will be described with reference to FIGS. 25 and FIG. 26, the same parts as those of the first embodiment shown in FIG. 1 and FIG. In FIG. 25, the arithmetic block 601 is composed of three arithmetic units 602 to 604, and the cumulative addition array 605 is composed of two independent

field addition arrays

606 and 607. The output of the field addition array 606 and the output of the field addition array 607 are The adder 608 adds the values. The internal structure of each of the

field addition arrays

606 and 607 is the same as that of the cumulative addition array 10 of FIG. In FIG. 26, the internal configuration of the arithmetic unit 602 is obtained by adding the outputs of the difference absolute value

arithmetic units

15 and 17 by an adder 609 and obtaining the result as TFAE._{i, j, n}On the other hand, the outputs of the difference

absolute value calculators

16 and 18 are added by the adder 610, and the result is added to the BFAE._{i, j, n}Is configured to output as The internal structure of the arithmetic units 603 and 604 is the same as the internal structure of the arithmetic unit 602 shown in FIG.
[0156]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0157]
FIG. 27 is a timing chart showing the operation of the seventh embodiment. In FIG. 27, symbols D and E correspond to FIGS. 4, 5, 6 and 7 of the first embodiment, and the operation of the reference register 1 sequentially reading and storing pixels of the reference image is shown in FIGS. The operation of the reference register 1 of the first embodiment is exactly the same.
[0158]
Now, in the cycle of timing D0 in FIG. 27, B0 = (t_4,3, T_5,3, T_6,3, T_7,3), A = (r_0,0, R_1,0, R_2,0, R_3,0), The adder 609 of the arithmetic unit 602 is | r_0,0-T_4,3| + | R_2,0-T_6,3| This is the upper left pixel t of the coding block T0._4,3If you rewrite relative coordinates with the coordinates of_{-4, -3}-T_0,0| + | R_{-2, -3}-T_2,0This becomes TFAE of (Equation 10)._{-4, -3,0}It is none other than having calculated. Similarly, the adder 610 of the arithmetic unit 602 performs the BFAE of (Equation 11)._{-4, -3,0}Is calculated. Next, in the cycle of timing D1, the arithmetic unit 602 performs TFAE._{-4, -2,0}And BFAE_{-4, -2,0}The arithmetic unit 603 calculates TFAE._{-4, -3,1}And BFAE_{-4, -3,1}Is calculated. Hereinafter, the arithmetic units 602 to 604 are each TFAE._{i, j, n}And BFAE_{i, j, n}Will be calculated.
[0159]
TFAE which is the output of the arithmetic units 602 to 604_{i, j, 0}~ TFAE_{i, j, 2}Are added in the field addition array 606, and BFAE_{i, j, 0}~ BFAE_{i, j, 2}Are added by the field addition array 607, and the sum of absolute differences is obtained. Focusing on the field addition array 606, the TFAE obtained by the arithmetic unit 602 in the cycle of timing D0 in FIG._{-4, -3,0}Is delayed by one cycle by the delay unit 11 of the field addition array 606, and the TFAE obtained by the arithmetic unit 603 in the cycle of the timing D1._{-4, -3,1}Is added by the adder 12, and TFAE is added._{-4, -3,0}+ TFAE_{-4, -3,1}Is calculated. Further, the result is delayed by one cycle by the delay unit 13, and the TFAE obtained by the arithmetic unit 604 in the cycle of timing D2._{-4, -3,2}And the adder 14 add TFAE_{-4, -3,0}+ TFAE_{-4, -3,1}+ TFAE_{-4, -3,2}Is required. This is TFAE from (Equation 12)_{-4, -3}That is, the difference absolute value sum TFAE by block matching between the first field of the prediction block candidate of the vector (-3, -4) and the first field of the coding block T0._{-4, -3}Was sought after. Similarly, in the field addition array 607, the difference absolute value sum BFAE by block matching between the second field of the same prediction block candidate and the second field of the encoded block T0 is obtained by (Equation 13)._{-4, -3}Is required. Also, the adder 608 is TFAE._{-4, -3}+ BFAE_{-4, -3}Therefore, AE is calculated from (Equation 9)._{-4, -3}That is, the sum of absolute differences as frames of the same prediction block candidate and encoding block T0 is obtained.
[0160]
That is, in a series of pipeline operations from D0 to D7, a field addition array606,607And the adder 608 output the block matching error TFAE of the first field of the encoded block and the first field of the reference block, respectively._{i, j}And the block matching error BFAE of the second field of the coded block and the second field of the reference block_{i, j}And absolute difference sum AE in the case of frame prediction_{i, j}The three error amounts are obtained simultaneously.
[0161]
In a series of pipeline operations starting from E0 in FIG. 27, the reference register 1 calculates a region shifted down by one pixel in the frame structure as shown in FIG. , Which means that the second relationship has been reversed. Therefore, in a series of pipeline operations from E0 to E7, the difference absolute value sum AE in the case of frame prediction._{i, j}And block matching error TFAE between the first field of the coded block and the second field of the reference block_{i, j}And the block matching error BFAE of the second field of the coded block and the first field of the reference block_{i, j}These three error amounts are obtained simultaneously. As described above, according to the seventh embodiment, the error amount between the frame vector and the field vector of the two fields is calculated simultaneously without omission, so that the optimum vector can be determined by examining the minimum value of each. It can be detected.
[0162]
As described above, according to the seventh embodiment, according to the eighth aspect of the present invention, the arithmetic units 602 to 604 have the error amounts for the pixels at even positions with respect to the input reference data set and encoded data set, respectively. The cumulative addition array 605 obtains the two types of error amounts for the pixels at odd positions and the first field addition array 606 for adding the two types of error amounts independently by the cumulative addition structure, and the second field addition. Since the array 607 is provided, the error amount for the frame vector and the error amount for the two types of field vectors can be obtained simultaneously without omission, so that both frame vector detection and field vector detection can be realized. is there. Further, comparing FIG. 25 with FIG. 1, the only circuits to be added are the field adder array 607 and the adder 608, and even if a practical specification of 480i is assumed, the circuit increase can be suppressed to about 30%. In addition, since three types of error amounts are calculated at the same time, the number of processing cycles of one coding block does not change and the advantage that high-speed processing can be performed is not impaired even when FIG. 27 and FIG. 4 are compared. It is.
[0163]
The seventh embodiment is configured by applying the technique of the eighth invention to the first embodiment. However, the technique of the ninth invention is applied to the first embodiment.ApplyYou may comprise. In this case, the arithmetic units 602 to 604 are replaced with the arithmetic unit shown in FIG. 28 in the configuration of FIG. The arithmetic unit in FIG. 28 has an adder 611 added to TFAE._{i, j, n}And BFAE_{i, j, n}Sum of AEs_{i, j, n}Is output.In the ninth aspect of the present invention, this is a TFAE as the first error amount for the input pixels at even positions. _{i, j, n} AE as the second error amount for all the pixels _{i, j, n} This is equivalent to obtaining two types of error amounts.In FIG. 25, the field addition array 607 is used as a frame addition array with the same structure, and the output AE of the replaced arithmetic unit is obtained._{i, j, n}Are cumulatively added by the frame addition array 607, and a difference absolute value sum AE which is a frame error amount is added._{i, j}Ask for. Furthermore AE_{i, j}To the first field error amount TFAE which is the output of the field addition array 606_{i, j}Is subtracted to obtain the second field error amount BFAE_{i, j}Is required. Even in this case, the degree and effect of the circuit scale are exactly the same as in the case of using the eighth aspect of the present invention.
[0164]
Although the seventh embodiment is configured by applying the technique of the eighth invention to the first embodiment, the technique of the eighth invention is applied to the third embodiment. Further, it can be configured in combination with the techniques of the fifth and sixth aspects of the present invention. In these cases, simultaneous detection of both frame and field vectors is realized without impairing individual effects such as realization of maximum speed without any gaps in pipeline operations and further speedup by parallel processing. It can be done.
[0165]
Since the sixth embodiment and the seventh embodiment have exactly the same effect but differ in the method of reading a reference image, the characteristics and operation of a reference image storage medium such as a memory connected to the motion vector detection device are used. A more suitable form can be selected from the sixth or seventh embodiment according to the conditions.
[0166]
Also in this embodiment, when used as a motion vector detection apparatus for MPEG2, TFAE is used._{i, j}, BFAE_{i, j}An appropriate vector can be obtained by performing the conversion described in the sixth embodiment on the subscript j.
[0167]
(Embodiment 8)
First, the configuration of the motion vector detection device of the present embodiment will be described.
[0168]
FIG. 29 is a block diagram showing the motion vector detection device of the present embodiment.
[0169]
The eighth embodiment applies the technique of the tenth aspect of the present invention, and corresponds to the case where the reference register of the tenth aspect of the present invention stores M pieces of data arranged vertically on the reference image. . In the eighth embodiment, the error amounts of the frame vector and the two field vectors are calculated simultaneously. Also in the eighth embodiment, the difference absolute value sum AE shown in (Equation 4) is used as the error amount of the frame vector._{i, j}Is adopted. The same formula as used in the sixth embodiment is the difference absolute value sum TFAE in the first field._{i, j}(Equation 5) and (Equation 7) as the sum of absolute differences BFAE in the second field_{i, j}(Equation 5) and (Equation 8) as the difference absolute value sum AE of frames_{i, j}(Equation 9) is used respectively.
[0170]
In the eighth embodiment, as in the sixth embodiment, the size of the coding block is horizontal M = 3, vertical N = 4, the search area is K = 3 in the horizontal direction, that is, a range of −3 to 2, and the vertical direction. L = 4, that is, a range of -4 to 3.
[0171]
The configuration of the eighth embodiment will be described with reference to FIG. In FIG. 29, reference registers 701 to 703 are registers for storing three consecutive pixels in the row direction of the reference image. Three registers store three rows in the column direction, and reference data A0 and A1 are respectively stored. , A2 is output. The operation block 704 is composed of

operation units

708 and 709, and inputs B0 and B2 which are data of the first field and reference data A0 of the reference register 701 from a set of four outputs of the encoding block register 102, Difference absolute value sum AE_{i, j, n}(Where n is an even number) is an arithmetic block for calculating based on (Equation 5), and the arithmetic block 705 is composed of

arithmetic units

710 and 711, and B1, B3 which are data of the second field of the encoding block register 102 And the reference data A1 of the reference register 702, and the difference absolute value sum AE_{i, j, n}(Where n is an odd number) is an arithmetic block for calculating based on (Equation 5). The arithmetic block 706 is composed of

arithmetic units

712 and 713, and B0 and B2 are data of the first field of the encoding block register 102. And the reference data A1 of the reference register 702, and the difference absolute value sum AE_{i, j, n}(Where n is an even number) is an arithmetic block, and the arithmetic block 707 is composed of

arithmetic units

714 and 715, and B1 and B3, which are data in the second field of the encoding block register 102, and a reference register 703 reference Input data A2 and the difference absolute value sum AE_{i, j, n}This is a calculation block for calculating (where n is an odd number). The cumulative addition arrays 716 to 719 are field addition arrays that receive the outputs of the calculation blocks 704 to 707 and calculate the field error amount by cumulative addition. The outputs of the cumulative addition arrays 716 and 717 are added by an adder 720 to obtain a frame error amount, and the outputs of the cumulative addition arrays 718 and 719 are added by an adder 721 to obtain a frame error amount. The internal structure of the arithmetic units 708 to 715 is the same as that shown in FIG. 9 of the second embodiment. In addition, the same reference numerals are given to the same parts as those of the configuration of FIG.
[0172]
Next, the operation of the motion vector detection device of this embodiment will be described.
[0173]
FIG. 30 is a timing chart showing the operation of the eighth embodiment.
[0174]
The operation of the eighth embodiment includes the first field evaluation means using the calculation block 704 and the cumulative addition array 716, the second field evaluation means using the calculation block 705 and the cumulative addition array 717, the calculation block 706, and the cumulative addition array. It is divided into third field evaluation means by 718 and fourth field evaluation means by an operation block 707 and a cumulative addition array 719, and consists of four pipeline operations.
[0175]
First, as the first pipeline operation of FIG. 30, the operation block 704 has B0 = (t_4,3, T_4,4, T_4,5), B2 = (t_6,3, T_6,4, T_6,5) As reference data A0 = (r_0,0, R_0,1, R_0,2) As an input, the calculation block 708 is | r_0,0-T_4,3| + | R_0,1-T_4,4| + | R_0,2-T_4,5| Is calculated. If this is normalized with reference to the upper left coordinates of the encoded block, AE is obtained from (Equation 5)._{-4, -3,0}It can be seen that it is. The position occupied by the reference data A0 in the reference image search range at this time is a small block G0 shown in FIG. In the first pipeline of FIG. 30, in the cycle of G2 following G0, the reference register 701 moves to the small block G2 shown in FIG. 32, that is, A0 = (r_2,0, R_2,1, R_2,2) And output to the reference data A0. At this time, the arithmetic unit 708 is set to TFAE._{-2, -3,0}, The arithmetic unit 709 is AE_{-4, -3,2}Will be calculated. Similarly, the reference data A1 moves down the search range two rows at a time and moves the AE to the

arithmetic units

708 and 709._{i, j, 0}And AE_{i, j, 2}Is calculated. Since the cumulative addition array 716 delays the output of the arithmetic unit 708 by one cycle and adds it to the output of the arithmetic unit 709, the AE that is the output of the arithmetic unit 708 in the cycle G0 in the first pipeline of FIG._{-4, -3,0}Is the output TFAE of the arithmetic unit 709 in cycle G1._{-4, -3,2}TFAE_{-4, -3,0}+ TFAE_{-4, -3,2}Is required. This is TFAE from (Equation 7)_{-4, -3}I asked for. In the following, TFAE_{-2, -3}, TFAE_{0, -3}, ... can be obtained. That is, the first pipeline is a pipeline operation that receives the data B0, B1 of the first field of the coding block and the data A0 of the first field of the reference image, and the result is the first field of the coding block. And the first field of the reference block candidate is an error amount when the block matching is performed._{i, j}(Where i is an even number).
[0176]
Next, the second pipeline in FIG. 30 will be described. The operation block 705 uses B1 = (t as encoded data in the G1 cycle._5,3, T_5,4, T_5,5), B2 = (t_7,3, T_7,4, T_7,5) As reference data, A1 = (r_1,0, R_1,1, R_1,2) As an input, and the arithmetic unit 710 calculates AE from (Equation 5)_{-4, -3,1}Is calculated. The position occupied by the reference data A1 in the reference image search range at this time is a small block G1 shown in FIG. In the cycle of G3 following G1, the reference register 702 moves to the small block G3 shown in FIG. 32, that is, A1 = (r_3,0, R_3,1, R_3,2) Is output. At this time, the arithmetic unit 710 causes the AE to_{-2, -3,1}, The arithmetic unit 711 is AE_{-4, -3,3}Will be calculated. The cumulative addition array 717 delays the output of the arithmetic unit 710 by one cycle and adds it to the output of the arithmetic unit 711._{-4, -3}, BFAE_{-2, -3}, ... will be required. That is, the second pipeline inputs the data B1 and B3 of the second field of the coding block and the data A1 of the second field of the reference image.AsAn error amount is obtained when block matching is performed between the second field of the encoded block and the second field of the reference block candidate._{i, j}(Where i is an even number).
[0177]
The error amount of the first field obtained by the first pipeline and the error amount of the second field obtained by the second pipeline are added by the adder 720, which is calculated by the equation (9). Equivalent, AE_{i, j}(Where i is an even number), that is, a frame error amount is obtained.
[0178]
As described above, the first pipeline for obtaining the field error amount by block matching between the first field of the coding block and the first field of the prediction block candidate, the second field of the coding block, and the second field of the prediction block candidate Prediction block candidates in which the Y component of the motion vector is an even number by providing a second pipeline for obtaining the field error amount by block matching and an adder for adding the respective field error amounts to obtain the frame error amount Can be obtained without leaking three types of error amounts.
[0179]
In the third pipeline of FIG. 30 comprising the operation block 706 and the cumulative addition array 718, the encoded data B0, B2, that is, the first field data and the reference data A1, that is, the second field data are input. Field error amount AE in block matching between the first field of the second block and the second field of the prediction block candidate_{i, j}(Where i is an odd number) is calculated. Finally, in the fourth pipeline composed of the operation block 707 and the cumulative addition array 719, the encoded data B1, B3, that is, the second field data and the reference data A2, that is, the first field data are input. Field error amount AE in block matching between the second field and the first field of the prediction block candidate_{i, j}(Where i is an odd number) is calculated. The adder 721 adds the first field error amount and the second field error amount to thereby add a frame error amount AE._{i, j}(Where i is an odd number) is calculated. As described above, the third pipeline and the fourth pipeline can be obtained without leaking the three types of error amounts for the prediction block candidates in which the Y component of the motion vector is an odd number.
[0180]
Here, in the fourth pipeline, the reference data A2 is A2 = (r in the G2 cycle of the operation start of FIG._2,0, R_2,1, R_2,2This is the reference data located in the small block G2 within the search range in FIG. In FIG. 30, the G0 cycle of the first pipeline, the G1 cycle of the second and third pipelines, and the G2 cycle of the fourth pipeline are all at the same time and are operation start cycles. A0, A1, and A2 output from the three registers 701 to 703 in this operation start cycle are the data in the areas shown in the small blocks G0, G1, and G2 in FIG. In the second cycle, as shown in FIG. 32, A0, A1, and A2 are shifted downward by two rows to small blocks G2, G3, and G4. Thereafter, the first pipeline is finished in the state of G8, G9, and G10 shown in FIG. 33, and the upper part of the region shifted to the right by one pixel as shown in H0, H1, and H2 in FIG. Start a new pipeline from 3 small blocks. As described above, the reference registers 701 to 703 store the reference pixels while shifting the band-like region having a width of 3 pixels from the top to the bottom by 2 rows in the search range. By storing the band-like area in the same manner from top to bottom, the error amount calculation by the combination of the first pipeline and the second pipeline and the combination of the third pipeline and the fourth pipeline are performed. The error amount evaluation is an odd-numbered and even-numbered relationship of Y components, and all vectors can be evaluated without overlapping and without leaking.
[0181]
[Table 11]

[0182]
[Table 12]

The effects of the eighth embodiment are specifically summarized in Tables 11 and 12 as numerical values. Table 11 shows the circuit scale of the eighth embodiment, and Table 12 shows the processing speed of the eighth embodiment. In the case of 480i in Tables 11 and 12, the calculation conditions in the case of 1080i are the same as in the case of Conventional Example 1 in Tables 1 and 2. In comparison with the conventional example 1 in Tables 1 and 2, the improvement is dramatic, but since it has already been described in the first embodiment and the second embodiment, the description is omitted and the embodiment is omitted. 8 and the first and second embodiments of the present invention, the effect of the tenth technique of the present invention is confirmed.
[0183]
First, the case of FIG. 1 according to the first embodiment of the present invention shown in Tables 5 and 6 is compared with the case of FIG. 29 according to the eighth embodiment of the present invention shown in Tables 11 and 12; It can be seen that the circuit scale is increased by about 40% compared to the first embodiment, but the processing speed is increased twice or more. In the eighth embodiment, referring to the time chart of FIG. 30, the first series constituted by the first pipeline and the second pipeline, and the third pipeline and the fourth pipeline are constituted. Since the second series is parallel processing of two series, the processing speed is about double. This parallel processing structure unique to the eighth embodiment is independent of the parallel processing technology of the fifth invention shown in the third and fourth embodiments, and the parallel processing technology of the tenth invention. The apparatus can be realized by combining the parallelization technique of the fifth aspect of the present invention. The number of multiplexed sequences Q in Tables 11 and 12 means the number of multiplexed sequences according to the fifth technique of the present invention. In the example of FIG. 29, the technique of the fifth invention is not used, so Q = 1. It is said. When the fifth aspect of the present invention is applied to the eighth embodiment and Q = 2, the processing speed is purely doubled from Table 12. Tables 7 and 8 Compared with the examples of 480i and 1080i having the same specifications as those in Embodiment 3, in Tables 11 and 12, there is a parallel processing effect unique to the tenth invention, so that the fifth invention The number of multiplexed sequences Q is set to 1/2 that of Tables 7 and 8, and substantially the same processing speed can be realized with substantially the same circuit scale.
[0184]
  As described above, according to the eighth embodiment, according to the tenth aspect of the present invention, three sets of pixel data in the same field of the encoded block are set as one set, and two sets of encoded data in the first field are set as the first set. An encoding block register 102 that outputs two sets of encoded data of two fields, and three reference registers 701 to 701 that store three pixels in the same field of the reference image and output them as one set of reference data 703, four sets of reference data and two sets of encoded data, and four operation blocks 704 to 707 as field evaluation means for obtaining a field error amount; reference data A0 of the first field and code of the first field Field error amount for the encoded data B0, B2, the reference data A1 for the second field, and the field data for the encoded data B1, B3 for the second field. An adder 720 for adding the field error amount, the field error amount for the reference data A2 of the first field and the encoded data B1 and B3 of the second field, and the reference data A1 of the second field and the encoding of the first field An adder 721 for adding field error amounts to the data B0 and B2, and the reference registers 701 to 703 have a control function for taking out and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference screen. Blocks 704 to 707 are composed of two arithmetic units and a sum of the two error amounts in a cumulative addition structure to obtain a field error amount.ageAs a result, the error amount due to frame matching, the error amount due to field matching with respect to the first field of the coding block, and the error amount due to field matching with respect to the second field can be obtained simultaneously, and practically. In the case of processing a simple video signal, the processing speed of the first and second embodiments can be maintained without increasing the circuit scale.
[0185]
Also in this embodiment, when used as a motion vector detection apparatus for MPEG2, TFAE is used._{i, j}, BFAE_{i, j}An appropriate vector can be obtained by performing the conversion described in the sixth embodiment on the subscript j.
[0186]
  Although it has already been described that the technique of the fifth invention may be applied to the tenth invention, the techniques of the sixth and seventh inventions may be applied in addition to that. In that case, as in FIG. 20, three reference registers are added to the configuration of FIG. 29, and its outputs C0, C1,C2And a switch for switching the outputs A0, A1, and A2 of the reference registers 701 to 703 at the input positions of the arithmetic units 708 to 715, and mode control for controlling the switching timing of the switches and the reading timing of the encoding small block registers 103 to 106 It becomes the structure which provides a part. In addition, the operation is to operate the pipelines without any gaps in the timing chart of FIG. 30, and the maximum speed can be realized with the maximum efficiency.
[0187]
  In the eighth embodiment, three reference registers 701 to 703 are provided. However, the reference register 703 is omitted, two reference registers are provided, and reference data A0 is supplied instead of the reference data A2.StructureIt can also be made. In this case, since the operation of the fourth pipeline is delayed by one cycle in the timing chart of FIG. 30, by delaying the output of the cumulative addition array 718 by one cycle in the third pipeline, It is possible to perform the function of obtaining three types of error amounts by synchronizing them.
[0188]
Further, the eighth embodiment is an example in which the reference data is M pieces of data arranged horizontally on the reference image in the tenth aspect of the present invention, and this reference data is arranged vertically on the reference image. The reference register and the encoding small block register are configured to store M vertical pixels in the same field, and the reference register is piped while shifting the search area by one pixel in the horizontal direction. A line operation may be executed.
[0189]
In Embodiments 1 to 8 of the present invention, the block matching error amounts are all the sum of absolute differences, but this may be other than the sum of absolute differences, such as the sum of squared differences and variance. At this time, as long as the error amount can be expressed by (Equation 3) or (Equation 6), all the techniques of the present invention can be applied.
[0190]
In the above, Embodiments 1 to 8 have been described in detail.
[0191]
  The motion vector detection method of the first aspect of the present invention is, MarksEncoding blockAn encoding block output step for storing N sets of pixel data and outputting N sets of encoded data with M pixel data arranged in one vertical column or one horizontal row as one set in the encoding block; A reference data output step of temporarily storing M pixels of an image and outputting this as a set of reference data, wherein the reference data is M data arranged vertically on the reference imageIf the aboveReference dataSeescreenWhile sequentially shifting in the upper horizontal directionFirst to retrieve and storeControl or, The reference data is M pieces of data arranged horizontally on the reference image.If the aboveReference dataSeescreenWhile sequentially shifting in the vertical directionSecond to retrieve and storecontrolOut ofAt least one controlA reference data output step, an arithmetic step using 1 × N arithmetic units for calculating an error amount of one set of reference data and one set of encoded data, and the error amount adjacent to each other with a delay of one cycle. A cumulative addition step of adding the error amount of the set of encoded data, and subsequently obtaining the sum of the N error amounts by a cumulative addition structure in which the addition result is sequentially delayed by one cycle and added to adjacent error amounts;Is a motion vector detection method.
[0192]
In the motion vector detection device of the second aspect of the present invention that employs the motion vector detection method of the first aspect of the present invention, the pixel data of the encoded block is stored and arranged in one vertical column or one horizontal row in the encoded block. An encoding block register that outputs N sets of encoded data with M pixel data as one set, and a register that temporarily stores M pixels of the reference image and outputs this as a set of reference data In the case where the reference data is M pieces of data arranged vertically on the reference image, the reference data is taken out and stored while sequentially shifting in the horizontal direction on the reference screen. When the data is M pieces of data arranged horizontally on the reference image, at least one of the second control functions for extracting and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference screen A reference register having the above control function, N arithmetic units for calculating an error amount of one set of reference data and one set of encoded data, and delaying the error amount by one cycle, A cumulative addition array that adds to the error amount of the set, and then sequentially adds the result of the addition to the adjacent error amount by delaying the addition result by one cycle and obtains the sum of the N error amounts. is there.
[0193]
With this configuration, the register for storing the pixel data is composed only of the encoding block register for one encoding block and the reference register for one row or one column, so that the reference register corresponds to N−1 rows or N− One column can be reduced, and the operation data delay connecting the pipelines can always be composed of only one cycle regardless of the size of the search range. Thus, the motion vector detection can be realized with an extremely small circuit scale. In addition, the calculation speed is such that the error amount calculation of one prediction block candidate is determined in one cycle, the number of loss cycles is small, and a high-speed motion vector detection operation that functions in a short number of cycles is obtained. In addition, the processing order of the coding blocks can be completed one by one, so there is no restriction on the processing order of the coding blocks, the coding device following the motion vector detection device can be easily configured, and it is multiplexed for applications that do not require high-speed processing. It is extremely effective to provide a very small circuit that will not be used.
[0194]
  The motion vector detection method of the third aspect of the present invention is:An encoding block output step of outputting N sets of encoded data with one set of M pixel data arranged in one vertical column or one horizontal row in the encoded block, and M + Q−1 pieces of reference images A reference data output step of temporarily storing pixels and outputting Q sets of reference data using M consecutive pixels as a set of reference data, the reference data being displayed on a reference screen;While sequentially shifting horizontallyFor taking out and memorizingIs it the first control?, The reference data on the reference screenWhile shifting sequentially in the vertical directionFor taking out and memorizing2nd control,at leastAny controlA reference data output step, a calculation step for calculating an error amount using Q × N operation units for calculating an error amount between one set of the reference data and one set of the encoded data, and a cumulative addition structure A cumulative addition step of obtaining a total sum of the N error amounts by using a Q cumulative addition array;Is a motion vector detection method.
[0195]
A motion vector detection device according to a fourth aspect of the present invention that employs a motion vector detection method according to the third aspect of the present invention includes M pixel data arranged in one vertical column or one horizontal row in a coding block as one set. As an encoding block register that outputs N sets of encoded data and M + Q-1 pixels of a reference image are stored, and Q sets of reference data are output using M consecutive pixels as one set of reference data. A first control function for extracting and storing the reference data while sequentially shifting the reference data in the horizontal direction on the reference screen, or a second control function for extracting and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference screen Or a reference register having at least one of the control functions, Q × N arithmetic units, and Q cumulative additions for calculating the sum of N error amounts of the arithmetic units by a cumulative addition structure. And it constitutes the array.
[0196]
With this configuration, Q-times high-speed processing is realized by parallel processing where the number of series is Q, and the register for storing pixel data is not required for the Q series, but is slightly equivalent to the reference register Q-1 pixel. It can be configured only by increasing. In addition, the operation data delay units connecting the pipelines are required for the Q series, but each of them requires only one cycle delay, and can be realized with an extremely small circuit, so that the circuit increase can be minimized. As a result, the overall circuit scale can be drastically reduced as compared with the prior art, and is particularly noticeable when motion vectors are detected within a practical search range. Is tremendous. The multiplexing number Q is not restricted by the size of the block, the size of the search range, or any other conditions, and can be set arbitrarily arbitrarily. Therefore, it can be set to a minimum according to the required processing speed and purpose of use. It is possible to provide a device having a circuit scale of.
[0197]
  5th and 6th motion vector detection of the present inventionapparatusIncludes a second reference register, and the arithmetic unit includes a reference data switching switch for selecting either reference data supplied from the first reference register or reference data supplied from the second reference register, When switching to the reference register that supplies the reference data, when switching to the mode, the reference data switch is switched in order from the arithmetic unit that has completed valid computations in the mode before the transition, and the error amount computation of a new coding block is started. Comprises a mode control means for storing the data of the new coding block in the coding block register in order one by one in synchronism with the switching operation of the reference data changeover switch. It is possible to supply valid reference data to each processing unit. In addition, new encoded data and corresponding reference data can be supplied immediately from an arithmetic unit that is ready to start operation even when a new coding block is started, so it is always effective for all arithmetic units. Calculations can be executed, and the same calculation is not duplicated, so that the maximum efficiency is realized and the processing speed can be improved.
[0198]
In the motion vector detection apparatus of the seventh aspect of the present invention, the cumulative addition array includes a frame addition array for cumulatively adding N error amounts of the arithmetic unit, and N / 2 error amounts that are even or odd numbers in two cycles. By providing a field addition array that adds with a cumulative addition structure while delaying, the frame addition array calculates the frame error amount of the prediction block candidate, while the field addition array calculates the first field component or the second field of the encoded block. An error amount between the field component and the prediction block candidate can be obtained, and the remaining field error amount can be obtained by subtracting the field error amount from the frame error amount. Also, since the field addition array only increases, the frame error amount and two types of field error amounts for the same prediction block candidate can be calculated simultaneously with a slight increase in circuit.
[0199]
According to the eighth and ninth motion vector detecting devices of the present invention, an error amount with respect to each even-numbered position pixel or each odd-numbered position pixel with respect to the reference data set and the encoded data set to which the arithmetic unit is input. An error amount with respect to, or an error amount with respect to all pixels, and any two of the three types of error amounts are obtained, and the cumulative addition array has a structure in which the above two types of error amounts are added independently by a cumulative addition structure. As a result, the frame error amount and two types of field error amounts for the same prediction block candidate can be calculated simultaneously.
[0200]
The motion vector detection apparatus according to the tenth aspect of the present invention includes M pixel data of the same field as one set, the encoded data of the first field as N / 2 sets, and the encoded data of the second field as N / 2 sets. A block register for outputting M, M pixels in the same field of the reference image, at least two reference registers for the first field and the second field for outputting them as one set of reference data, and reference data Four field evaluation means for determining the field error amount from one set and the encoded data N / 2 set, the field error amount by the first field reference data, the first field encoded data, the second field reference data, and the second A first adder for adding a field error amount based on field encoded data; first field reference data; and second field code By providing a second adder for adding the field error amount based on the data, the reference data of the second field, and the field error amount based on the first field encoded data, all fields of the encoded block and the prediction block candidate are provided. Since the field error amount of the combination is obtained by the parallel structure, two types of field error amounts and the frame error amount by addition thereof can be obtained. In addition, although four field evaluation means are required, all of them are configured on the order of about one-half of the case of frame evaluation, so the increase in the circuit scale as a whole is slight. Since the processing speed is parallel processing, twice as fast processing can be realized, and it is extremely effective when applied to the case where motion vectors are detected within a practical search range for a practical video signal.
[0201]
The program of the present invention is a program for causing a computer to execute the functions of all or a part of the motion vector detection device of the present invention described above (or the device, element, etc.), and cooperates with the computer. It is a program that operates as
[0202]
The program of the present invention is a program for causing a computer to execute all or some of the steps (or processes, operations, actions, etc.) of the motion vector detection method of the present invention described above. It is a program that works in cooperation.
[0203]
The recording medium of the present invention carries a program for causing a computer to execute all or part of the functions (or devices, elements, etc.) of all or part of the motion vector detecting device of the present invention described above. The recording medium can be read by a computer and the read program executes the function in cooperation with the computer.
[0204]
In addition, the recording medium of the present invention is a program for causing a computer to execute all or some of the steps (or processes, operations, actions, etc.) of the above-described motion vector detection method of the present invention. Is a recording medium that can be read by a computer and that the read program executes the operation in cooperation with the computer.
[0205]
The “part of means (or device, element, etc.)” of the present invention means one or several means out of the plurality of means. "Step (or process, operation, action, etc.)" means one or several of the steps.
[0206]
Further, the “functions of the means (or device, element, etc.)” of the present invention means all or a part of the functions of the means, and the “steps (or processes, operations, actions, etc.) of the present invention. The operation of “)” means the operation of all or part of the steps.
[0207]
Further, one usage form of the program of the present invention may be an aspect in which the program is recorded on a computer-readable recording medium and operates in cooperation with the computer.
[0208]
Further, one use form of the program of the present invention may be an aspect in which the program is transmitted through a transmission medium, read by a computer, and operated in cooperation with the computer.
[0209]
The recording medium includes a ROM and the like, and the transmission medium includes a transmission medium such as the Internet, light, radio waves, sound waves, and the like.
[0210]
The computer of the present invention described above is not limited to pure hardware such as a CPU, but may include firmware, an OS, and peripheral devices.
[0211]
As described above, the configuration of the present invention may be realized by software or hardware.
[0212]
【The invention's effect】
The present invention has an advantage that, for example, the circuit scale can be further reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram of a motion vector detection apparatus according to a first embodiment.
2 is a block diagram showing a configuration of an arithmetic unit 7 according to Embodiment 1. FIG.
FIG. 3 is an area relation diagram of an image according to the first embodiment.
FIG. 4 is an operation timing chart of Embodiment 1;
FIG. 5 is a reference data area diagram according to the first embodiment;
FIG. 6 is a reference data area diagram according to the first embodiment.
FIG. 7 is a reference data area diagram according to the first embodiment.
FIG. 8 is a block diagram of a motion vector detection device according to the second embodiment.
FIG. 9 is a block diagram showing a configuration of an arithmetic unit 108 according to the second embodiment.
FIG. 10 is a timing chart of the operation in the second embodiment.
FIG. 11 is a reference data area diagram according to the second embodiment.
FIG. 12 is a reference data area diagram according to the second embodiment.
FIG. 13 is a reference data area diagram according to the second embodiment.
14 is a block diagram of a motion vector detection device according to Embodiment 3. FIG.
FIG. 15 is a timing chart of the operation in the third embodiment.
FIG. 16 is a reference data area diagram according to the second embodiment;
FIG. 17 is a reference data area diagram according to the second embodiment;
FIG. 18 is a reference data area diagram according to the second embodiment;
FIG. 19 is a block diagram of a motion vector detection device according to the fourth embodiment.
FIG. 20 is a block diagram of a motion vector detection device according to the fifth embodiment.
FIG. 21 is a timing chart of the operation in the fifth embodiment.
FIG. 22 is a timing chart of the operation in the fifth embodiment.
FIG. 23 is a block diagram of a motion vector detection device according to the sixth embodiment.
FIG. 24 is a timing chart of the operation in the sixth embodiment.
FIG. 25 is a block diagram of a motion vector detection device according to the seventh embodiment.
26 is a block diagram illustrating a configuration of an arithmetic unit 602 according to Embodiment 7. FIG.
FIG. 27 is a timing chart of the operation in the seventh embodiment.
FIG. 28 is a block diagram illustrating another configuration of the arithmetic unit 602 according to the seventh embodiment;
29 is a block diagram of a motion vector detection device according to Embodiment 8. FIG.
FIG. 30 is a timing chart of the operation in the eighth embodiment.
FIG. 31 is a reference data area diagram according to the eighth embodiment;
FIG. 32 is a reference data area diagram according to the eighth embodiment;
33 is a reference data area diagram according to the eighth embodiment. FIG.
FIG. 34 is a reference data area diagram according to the eighth embodiment.
FIG. 35 is a block diagram of a motion vector detection device in Conventional Example 1;
FIG. 36 is a block diagram showing a configuration of PE 805 of Conventional Example 1;
FIG. 37 is a block diagram showing a configuration of a calculation data delay unit 811 of Conventional Example 1;
FIG. 38 is an image area relationship diagram of Conventional Example 1;
FIG. 39 is an operation timing chart of Conventional Example 1;
FIG. 40 is an operation timing chart of Conventional Example 1;
FIG. 41 is an operation timing chart of Conventional Example 1;
FIG. 42 is a block diagram of a motion vector detection device in Conventional Example 2;
FIG. 43 is a block diagram showing the configuration of PE847 of Conventional Example 2
44 is a timing chart of the operation of the conventional example 2. FIG.
[Explanation of sign]
1, 101, 201, 301, 401, 701, 702, 703 Reference register
2,102,402,601 Coding block register
3, 4, 5, 103, 104, 105, 106, 403, 404, 405 Encoded small block register
6, 107, 202, 302, 406, 704, 705, 706, 707
7, 8, 9, 108, 109, 110, 111, 602, 603, 604, 708, 709, 710, 711, 712, 713, 714, 715
10, 112, 203, 303, 501, 605, 716, 717, 718, 719 Cumulative addition array
11, 13, 113 delay device
12, 14, 19, 20, 21, 114, 505, 608, 609, 610, 611, 720, 721 Adder
15, 16, 17, 18 Difference absolute value calculator
407, 408, 409 switches
410 Mode controller
502 frame addition array
503,606,607 field summing array
504 2-cycle delay device
506 subtractor
801,802,803,804,819,820,821,822 829,830,831,832,833,834,835,836,838,839,840,841,853,854,855,856 registers
805, 806, 807, 808, 809, 810, 847, 848, 849 arithmetic units
811, 812, 813, 814, 850, 851 arithmetic data delay unit
815, 816, 817, 818, 852 terminals
823, 824, 825, 826 Difference absolute value calculator
827,828 Adder
837 Timing control unit
842 Pixel data delay unit
843, 844, 845, 846, 857, 858, 859, 860 selector

Claims

Pixel data constituting a coding block which is a rectangular area on the coded image is stored, and N sets of M pixel data arranged in one vertical column or one horizontal row in the coding block as one set An encoding block output step for outputting the encoded data of
A reference data output step of temporarily storing M pixels of a reference image and outputting the same as a set of reference data, (1) M pieces of reference data arranged vertically on the reference image If it is data, control for storing removed while sequentially shifting the reference data on the horizontal direction the reference image, and (2) the reference data of M data arranged laterally on said reference image in some cases, the reference data output step of performing at least one of the control of the control for storing removed while sequentially shifting the reference data on the vertical direction the reference image,
Using 1 × N arithmetic units for calculating an error amount between one set of the reference data and one set of the encoded data, all of the one set of the reference data and the N sets of the encoded data are used. A calculation step for calculating an error amount of the combination;
The error amount of the encoded data set located at the end in the encoding block is delayed by one cycle and added to the error amount of the adjacent encoded data set, and then the addition result is sequentially delayed by one cycle. And a cumulative addition step of obtaining a sum of the N error amounts by a cumulative addition structure that adds to the error amount to be detected.

Pixel data constituting a coding block which is a rectangular area on the coded image is stored, and N sets of M pixel data arranged in one vertical column or one horizontal row in the coding block as one set An encoding block register for outputting encoded data of
A first reference register that temporarily stores M pixels of a reference image and outputs them as a set of reference data, (1) M pieces of reference data arranged vertically on the reference image of in the case of data, control functions for storing removed while sequentially shifting the reference data on the horizontal direction the reference image, and (2) the reference data are M disposed laterally on the reference image A first reference register having at least one control function among control functions for taking out and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image,
An arithmetic unit for calculating an error amount between one set of the reference data and one set of the encoded data, wherein error amounts of all combinations of the one set of the reference data and the N sets of the encoded data are calculated. 1 × N arithmetic units to be calculated,
The error amount of the encoded data set located at the end in the encoding block is delayed by one cycle and added to the error amount of the adjacent encoded data set, and then the addition result is sequentially delayed by one cycle. A motion vector detection apparatus comprising: a cumulative addition array that calculates a sum of the N error amounts by a cumulative addition structure that adds to the error amount to be performed.

Pixel data constituting a coding block which is a rectangular area on the coded image is stored, and N sets of M pixel data arranged in one vertical column or one horizontal row in the coding block as one set An encoding block output step for outputting the encoded data of
A reference data output step of temporarily storing M + Q-1 pixels of a reference image, and outputting Q sets of the reference data by using M consecutive pixels as a set of reference data. (1) The reference data Control for taking out and storing the reference data while sequentially shifting the reference data in the horizontal direction on the reference image , and (2) the reference data when the reference data is M + Q-1 pieces of data arranged vertically on the reference image If There is a M + Q-1 pieces of data arranged laterally on the reference image, while sequentially shifting extraction also any of the control less of the control for storing the reference data on the vertical direction the reference image a reference data output step of,
All combinations of Q sets of reference data and N sets of encoded data using Q × N arithmetic units for calculating an error amount between one set of reference data and one set of encoded data A calculation step for calculating the error amount of
The error amount of the encoded data set located at the end in the encoded block is delayed by one cycle and added to the error amount of the adjacent encoded data set. A motion vector detection method comprising: a cumulative addition step of obtaining a total sum of N error amounts using a Q cumulative addition array by a cumulative addition structure that is delayed and added to adjacent error amounts .

Pixel data constituting a coding block which is a rectangular area on the coded image is stored, and N sets of M pixel data arranged in one vertical column or one horizontal row in the coding block as one set An encoding block register for outputting encoded data of
See the image stored temporarily M + Q-1 pixels of a first reference register for outputting a Q sets of said reference data of M consecutive pixels as a set of reference data, (1) the A control function for extracting and storing the reference data while sequentially shifting the reference data in the horizontal direction on the reference image when the reference data is M + Q-1 pieces of data arranged vertically on the reference image; and (2) When the reference data is M + Q−1 pieces of data arranged horizontally on the reference image, at least one of control functions for taking out and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image A first reference register having such a control function;
An arithmetic unit for calculating an error amount between one set of the reference data and one set of the encoded data, wherein the error amounts of all combinations of the Q sets of the reference data and the N sets of the encoded data are calculated. Q × N arithmetic units to be calculated;
The error amount of the encoded data set located at the end in the encoded block is delayed by one cycle and added to the error amount of the adjacent encoded data set. A motion vector detection apparatus comprising: Q cumulative addition arrays that obtain a sum of N error amounts by a cumulative addition structure that is delayed and added to adjacent error amounts.

A second reference register different from the first reference register;
A reference data changeover switch for selecting either reference data supplied from the first reference register or reference data supplied from the second reference register;
A first mode in which the first reference register sequentially updates the reference data and supplies the reference data to the arithmetic unit; and a second mode in which the second reference register sequentially updates the reference data and the reference to the arithmetic unit. 5. The movement according to claim 2, further comprising mode control means for switching the reference data changeover switch in order from the arithmetic unit for which valid computation before the transition is completed at the time of transition to the second mode for supplying data. Vector detection device.

The mode control means, when storing the new encoded block data in the encoded block register, sequentially sets the new encoded block data one by one in synchronization with the switching operation of the reference data changeover switch. The motion vector detection apparatus according to claim 5, wherein the motion vector detection apparatus is stored in the encoding block register.

The cumulative addition array (a) delays the addition result of the error amount of each of the arithmetic units once and adds it to the error amount of a set of adjacent encoded data, thereby adding N error amounts. A frame addition array for cumulative addition; (b) a field addition array for adding the error amount by the cumulative addition structure while delaying by two cycles with respect to an even or odd number of N / 2 arithmetic units; 7. The motion vector detection device according to claim 2, further comprising an arithmetic means for obtaining a difference between the results of the frame addition array and the field addition array.

The arithmetic unit has two types of error amounts for pixels at even positions and error amounts for pixels at odd positions with respect to the set of reference data and the set of encoded data. Find the amount of error,
5. The cumulative addition array includes: (a) a first field addition array that adds the two types of error amounts independently by a cumulative addition structure; and (b) a second field addition array. , 5 and 6.

The arithmetic unit has a first error amount for each even-numbered pixel or odd-numbered pixel for each of the input reference data set and the encoded data set, and a second error amount for all the pixels . Find two types of error amount, the error amount,
The cumulative addition array includes: (a) a field addition array for cumulatively adding the first error amount independently; (b) a frame addition array for independently cumulatively adding the second error amount; 7. The motion vector detection device according to claim 2, further comprising a calculation unit that obtains a difference between a result of the field addition array and the frame addition array.

Pixel data constituting an encoding block which is a rectangular area on the encoded image is stored, and the M pieces of pixel data in the same field are set as one set, and the encoded data N / 2 set of the first field and the second set An encoding block register that outputs N / 2 sets of encoded data of the field;
A reference register corresponding to the first field and the second field for storing M pixel data in the same field of the reference image and outputting the data as a set of reference data;
Field evaluation means capable of obtaining a field error amount by inputting the reference data 1 set and the encoded data N / 2 set;
A first field error amount for the reference data of the first field and the encoded data of the first field, and a field error amount for the reference data of the second field and the encoded data of the second field are added. The adder of
A second field error amount for adding the field error amount for the reference data of the first field and the encoded data of the second field, and a field error amount for the reference data of the second field and the encoded data of the first field; And an adder
The reference register is (1) a control function that, when the reference data is M pieces of data arranged vertically on the reference image, extracts and stores the reference data while sequentially shifting in the horizontal direction on the reference image And (2) if the reference data is M pieces of data arranged horizontally on the reference image, at least one of the control functions for extracting and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image Have one of the control functions,
The field evaluation means has N / 2 arithmetic units for calculating error amounts of all combinations of one set of the reference data and N / 2 sets of the encoded data, and the N / 2 pieces A motion vector detection device for obtaining a sum from an error amount by a cumulative addition structure and outputting the sum as a field error amount.

3. The motion vector detecting device according to claim 2, wherein pixel data constituting a coding block which is a rectangular area on a coded image is stored, and M arranged in one vertical column or one horizontal row in the coding block. An encoding block register that outputs N sets of encoded data with one set of pixel data as one set, and a first block that temporarily stores M pixels of a reference image and outputs this as one set of reference data (1) When the reference data is M pieces of data arranged vertically on the reference image, the reference data is extracted and stored while being sequentially shifted in the horizontal direction on the reference image. control functions, and (2) when the reference data is M pieces of data arranged laterally on the reference image, storing the reference data taken out while sequentially shifting the vertically said reference image A first reference register having at least one of the control functions for calculating, and an arithmetic unit for calculating an error amount between one set of the reference data and one set of the encoded data, A set of 1 × N arithmetic units for calculating error amounts of all combinations of the reference data of the set and the N sets of encoded data, and a set of encoded data located at the end in the encoding block The N errors are added by the cumulative addition structure in which the error amount is delayed by one cycle and added to the error amount of the adjacent encoded data set, and the addition result is sequentially delayed by one cycle and added to the adjacent error amount. A program that causes a computer to function as a cumulative addition array that calculates the sum of quantities.

5. The motion vector detecting device according to claim 4, wherein pixel data constituting a coding block which is a rectangular area on the coded image is stored, and M arranged in one vertical column or one horizontal row in the coding block. a coding block register to output N sets of coded data pieces of pixel data as one set, and temporarily stores the M + Q-1 pixels of the referenced image, a set of reference the M consecutive pixels A first reference register that outputs Q sets of the reference data as data, and (1) when the reference data is M + Q−1 pieces of data arranged vertically on the reference image, the reference data Is a control function for taking out and storing images sequentially shifted in the horizontal direction on the reference image, and (2) if the reference data is M + Q-1 data arranged horizontally on the reference image, the reference data A first reference register having at least one of the control functions for taking out and storing the reference images while being sequentially shifted in the vertical direction; and one set of the reference data and one set of the encoded data An arithmetic unit for calculating an error amount, Q × N arithmetic units for calculating the error amount of all combinations of Q sets of the reference data and N sets of the encoded data, and the encoding block The error amount of the encoded data set located at the end of the encoded data is delayed by one cycle and added to the error amount of the adjacent encoded data set, and then the addition result is sequentially delayed by one cycle and adjacent. A program for causing a computer to function as a Q cumulative addition array for obtaining a total sum of N error amounts by a cumulative addition structure for adding to the error amounts.

11. The motion vector detecting device according to claim 10, wherein pixel data constituting a coding block that is a rectangular area on a coded image is stored, and the M field pixel data in the same field are grouped into a first field. A coding block register for outputting the coded data N / 2 set of N and the coded data N / 2 set of the second field, and M pixel data in the same field of the reference image are stored and stored in one set. Field evaluation means capable of obtaining a field error amount by inputting a reference register corresponding to the first field and the second field to be output as reference data, and the reference data 1 set and the encoded data N / 2 set. A field error amount with respect to the reference data of the first field and the encoded data of the first field, and the second field. A first adder for adding a field error amount for the reference data of the second field and the encoded data of the second field, and a field error amount for the reference data of the first field and the encoded data of the second field And a program for causing a computer to function as a second adder that adds a field error amount with respect to the reference data of the second field and the encoded data of the first field ,
The reference register is (1) a control function that, when the reference data is M pieces of data arranged vertically on the reference image, extracts and stores the reference data while sequentially shifting in the horizontal direction on the reference image And (2) if the reference data is M pieces of data arranged horizontally on the reference image, at least one of the control functions for extracting and storing the reference data while sequentially shifting the reference data in the vertical direction on the reference image Have one of the control functions,
The field evaluation means has N / 2 arithmetic units for calculating error amounts of all combinations of one set of the reference data and N / 2 sets of the encoded data, and the N / 2 pieces A program for obtaining a sum total from an error amount by a cumulative addition structure and outputting it as the field error amount .

A recording medium carrying the program according to any of claims 11 to 13, can be processed recording medium by a computer.